# online_convex_optimization_with_unbounded_memory__cb2decef.pdf

Online Convex Optimization with Unbounded Memory

Raunak Kumar Department of Computer Science Cornell University Ithaca, NY 14853 raunak@cs.cornell.edu

Sarah Dean Department of Computer Science Cornell University Ithaca, NY 14853 sdean@cornell.edu

Robert Kleinberg Department of Computer Science Cornell University Ithaca, NY 14853 rdk@cs.cornell.edu

Online convex optimization (OCO) is a widely used framework in online learning. In each round, the learner chooses a decision in a convex set and an adversary chooses a convex loss function, and then the learner suffers the loss associated with their current decision. However, in many applications the learner s loss depends not only on the current decision but on the entire history of decisions until that point. The OCO framework and its existing generalizations do not capture this, and they can only be applied to many settings of interest after a long series of approximation arguments. They also leave open the question of whether the dependence on memory is tight because there are no non-trivial lower bounds. In this work we introduce a generalization of the OCO framework, Online Convex Optimization with Unbounded Memory , that captures long-term dependence on past decisions. We introduce the notion of p-effective memory capacity, Hp, that quantifies the maximum influence of past decisions on present losses. We prove an O( p

Hp T) upper bound on the policy regret and a matching (worst-case) lower bound. As a special case, we prove the first non-trivial lower bound for OCO with finite memory [Anava et al., 2015], which could be of independent interest, and also improve existing upper bounds. We demonstrate the broad applicability of our framework by using it to derive regret bounds, and to improve and simplify existing regret bound derivations, for a variety of online learning problems including online linear control and an online variant of performative prediction.

1 Introduction

Numerous applications are characterized by multiple rounds of sequential interactions with an environment, e.g., prediction from expert advice [Littlestone and Warmuth, 1989, 1994], portoflio selection [Cover, 1991], routing [Awerbuch and Kleinberg, 2008], etc. One of the most popular frameworks for modelling such sequential decision-making problems is online convex optimization (OCO) [Zinkevich, 2003]. The OCO framework is as follows. In each round, the learner chooses a decision in a convex set and an adversary chooses a convex loss function, and then the learner suffers the loss associated with their current decision. The performance of an algorithm is measured by regret: the difference between the algorithm s total loss and that of the best fixed decision. We refer the reader to Shalev-Shwartz [2012], Hazan [2019], Orabona [2019] for surveys on this topic.

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

However, in many applications the loss of the learner depends not only on the current decisions but on the entire history of decisions until that point. For example, in online linear control [Agarwal et al., 2019b], in each round the learner chooses a control policy (i.e., decision), suffers a loss that is a function of the action taken by this policy and the current state of the system, and the system s state evolves according to linear dynamics. The current state depends on the entire history of actions and, therefore, the current loss depends not only on the current decision but the entire history of decisions. The OCO framework cannot capture such long-term dependence of the current loss on the past decisions and neither can existing generalizations that allow the loss to depend on a constant number of past decisions [Anava et al., 2015]. Although a series of approximation arguments can be used to apply finite memory generalizations of OCO to the online linear control problem, there is no OCO framework that captures the complete long-term dependence of current losses on past decisions. Furthermore, there are no non-trivial lower bounds for OCO in the memory setting,1 which leaves open the question whether the dependence on memory is tight.

Contributions. In this paper we introduce a generalization of the OCO framework, Online Convex Optimization with Unbounded Memory (Section 2), that allows the loss in the current round to depend on the entire history of decisions until that point. We introduce the notion of p-effective memory capacity, Hp, that quantifies the maximum influence of past decisions on present losses. We prove an O( p

Hp T) upper bound on the policy regret (Theorem 3.1) and a matching (worst-case) lower bound (Theorem 3.2). As a special case, we prove the first non-trivial lower bound for OCO with finite memory (Theorem 3.4), which could be of independent interest, and also improve existing upper bounds (Theorem 3.3). Our lower bound technique extends existing techniques developed for memoryless settings. We design novel adversarial loss functions that exploit the fact that an algorithm cannot overwrite its history. We illustrate the power of our framework by bringing together the regret analysis of two seemingly disparate problems under the same umbrella. First, we show how our framework improves and simplifies existing regret bounds for the online linear control problem [Agarwal et al., 2019b] in Theorem 4.1. Second, we show how our framework can be used to derive regret bounds for an online variant of performative prediction [Perdomo et al., 2020] in Theorem 4.2. This demonstrates the broad applicability of our framework for deriving regret bounds for a variety of online learning problems, particularly those that exhibit long-term dependence of current losses on past decisions.

Related work. The most closely related work to ours is the OCO with finite memory framework [Anava et al., 2015]. They consider a generalization of the OCO framework that allows the current loss to depend on a constant number of past decisions. There have been a number of follow-up works that extend the framework in a variety of other ways, such as non-stationarity [Zhao et al., 2022], incorporating switching costs [Shi et al., 2020], etc. However, none of these existing works go beyond a constant memory length and do not prove a non-trivial lower bound with a dependence on the memory length. In a different line of work, Bhatia and Sridharan [2020] consider a much more general online learning framework that goes beyond a constant memory length, but they only provide non-constructive upper bounds on regret. In contrast, our OCO with unbounded memory framework allows the current loss to depend on an unbounded number of past decisions, provides constructive upper bounds on regret, and lower bounds for a broad class of problems that includes OCO with finite memory with a general memory length m.

A different framework for sequential decision-making is multi-armed bandits [Bubeck and Cesa Bianchi, 2012, Slivkins, 2019]. Qin et al. [2023] study a variant of contextual stochastic bandits where the current loss can depend on a sparse subset of all prior contexts. This setting differs from ours due to the feedback model, stochasticity, and decision space. Reinforcement learning [Sutton and Barto, 2018] is yet another popular framework for sequential decision-making that considers very general state-action models of feedback and dynamics. In reinforcement learning one typically measures regret with respect to the best state-action policy from some policy class, rather than the best fixed decision as in online learning and OCO. In the special case of linear control, policies can be reformulated as decisions while preserving convexity; we discuss this application in Section 4. Considering the general framework is an active area of research.

We defer discussion of related work for specific applications to Section 4.

1The trivial lower bound refers to the Ω(

T) lower bound for OCO in the memoryless setting.

2 Framework

We begin with some motivation for the formalism used in our framework (Section 2.1). Many real-world applications involve controlling a physical dynamical system, for example, variable-speed wind turbines in wind energy electric power production [Boukhezzar and Siguerdidjane, 2010]. The typical solution for these problems has been to model them as offline control problems with linear time-invariant dynamics and use classical methods such as LQR and LQG [Boukhezzar and Siguerdidjane, 2010]. Instead of optimizing over the space of control inputs, the typical feedback control approach optimizes over the space of controllers, i.e., policies that choose a control input as a function of the system state. The standard controllers considered in the literature are linear controllers. Even when the losses are convex in the state and input, they are nonconvex in the linear controller. In the special case of quadratic losses in terms of the state and input, there is a closed-form solution for the optimal solution using the algebraic Riccati equations [Lancaster and Rodman, 1995]. But this does not hold for general convex losses resulting in convex reparameterizations such as Youla [Youla et al., 1976, Kuˇcera, 1975] and SLS [Wang et al., 2019, Anderson et al., 2019]. The resulting parameterization represents an infinite dimensional system response and is characterized by a sequence of matrices. Recent work has studied an online approach for some of these control theory problems, where a sequence of controllers is chosen adaptively rather than choosing one offline [Abbasi-Yadkori and Szepesvári, 2011, Dean et al., 2018, Simchowitz and Foster, 2020, Agarwal et al., 2019b].

The takeaway from the above is that there are online learning problems in which (i) the current loss depends on the entire history of decisions; and (ii) the decision space can be more complicated than just a subset of Rd, e.g., it can be an unbounded sequence of matrices. This motivates us to model the decision space as a Hilbert space and the history space as a Banach space in the formal problem setup below, and this subsumes the special cases of OCO and OCO with finite memory. This formalism not only lets us consider a wide range of spaces, such as Rd, unbounded sequences of matrices, etc., but also lets us define appropriate norms on these spaces. This latter feature is crucial for deriving strong regret bounds for some applications such as online linear control. For this problem we derive improved regret bounds (Theorem 4.1) by defining weighted norms on the decision and history spaces, where the weights are chosen to leverage the problem structure.

Notation. We use U to denote the norm associated with a space U. The operator norm for a linear L operator from space U V is defined as L U V = maxu: u U 1 Lu V. For convenience, sometimes we simply use when the meaning is clear from the context. For a finite-dimensional matrix we use F and 2 to denote its Frobenius norm and operator norm respectively.

Let the decision space X be a closed and convex subset of a Hilbert space W with norm X and the history space H be a Banach space with norm H. Let A : H H and B : W H be linear operators. The game between the learner and an oblivious adversary proceeds as follows. Let T denote the time horizon and ft : H R be loss functions chosen by the adversary. The initial history is h0 = 0. In each round t [T], the learner chooses xt X, the history is updated to ht = Aht 1 + Bxt, and the learner suffers loss ft(ht). An instance of an online convex optimization with unbounded memory problem is specified by the tuple (X, H, A, B).

We use the notion of policy regret [Dekel et al., 2012] as the performance measure in our framework. The policy regret of a learner is the difference between its total loss and the total loss of a strategy that plays the best fixed decision in every round. The history after round t for a strategy that chooses x in every round is described by ht = Pt 1 k=0 Ak Bx, which motivates the following definition.

Definition 2.1. Given ft : H R, the function ft : X R is defined by ft(x) = ft(Pt 1 k=0 Ak Bx).

Definition 2.2 (Policy Regret). The policy regret of an algorithm A is defined as RT (A) = PT t=1 ft(ht) minx X PT t=1 ft(x).

In many motivating examples such as online linear control (Section 4.1), the history at the end of a round is a sequence of linear transformations of past decisions. The following definition captures this formally and we leverage this structure to prove stronger regret bounds (Theorem 3.1).

Definition 2.3 (Linear Sequence Dynamics). Consider an online convex optimization with unbounded memory problem specified by (X, H, A, B). Let (ξk) k=0 be a sequence of nonnegative real numbers satisfying ξ0 = 1. We say that (X, H, A, B) follows linear sequence dynamics with the ξ-weighted p-norm for p 1 if

1. H is the ξ-weighted ℓp-direct sum of a finite or countably infinite number of copies of W: every element y H is a sequence y = (yi)i I, where I = N or I = {0, . . . , n} for some n N, and y H = P

i I(ξi yi )p 1/p < .

2. We have A(y0, y1, . . . ) = (0, A0y0, A1y1, . . . ), where Ai : W W are linear operators.

3. The operator B satisfies B(x) = (x, 0, . . . ).

Note that since the norm on H depends on the weights ξ, the operator norm Ak also depends on ξ. If the weights are all equal to 1, then we simply say p-norm instead of ξ-weighted p-norm.

2.2 Assumptions

We make the following assumptions about the feedback model and the loss functions.

A1 The learner knows the operators A and B, and observes ft at the end of each round t.

A2 The operator norm of B is at most 1, i.e., B 1.

A3 The functions ft are convex.

A4 The functions ft are L-Lipschitz continuous: h, h H and t [T], we have |ft(h) ft( h)| L h h H.

Regarding Assumption A1, our results easily extend to the case where instead of observing ft, the learner receives a gradient ft(xt) from a gradient oracle, which can be implemented using knowledge of ft and the dynamics A and B. Handling the cases when the operators A and B are unknown and/or the learner observes bandit feedback (i.e., only ft(ht)) are important problems and we leave them as future work. Note that our assumption that A and B are known is no more restrictive than in the existing literature on OCO with finite memory [Anava et al., 2015] where it is assumed that the learner knows the constant memory length. In fact, our assumption is more general because our framework not only captures constant memory length as a special case but allows for richer dynamics as we illustrate in Section 4. Assumption A2 is made for convenience, and it amounts to a rescaling of the problem. Assumption A3 can be replaced by the weaker assumption that ft are convex (similar to the literature on OCO with finite memory [Anava et al., 2015]) and this is what we use in the rest of the paper.

Assumptions A1 and A4 imply that ft are L-Lipschitz continuous for the following L. Theorem 2.1. Consider an online convex optimization with unbounded memory problem specified by (X, H, A, B). If ft is L-Lipschitz continuous, then ft is L-Lipschitz continuous for L L P k=0 Ak . If (X, H, A, B) follows linear sequence dynamics with the ξ-weighted p-norm

for p 1, then L L P k=0 Ak p 1

The proof follows from the definitions of ft and H, and we defer it to Appendix A. The above bound is tighter than similar results in the literature on OCO with finite memory and online linear control. This theorem is a key ingredient, amongst others, in improving existing upper bounds on regret for OCO with finite memory (Theorem 3.3) and for online linear control (Theorem 4.1). Before presenting our final assumption we introduce the notion of p-effective memory capacity that quantifies the maximum influence of past decisions on present losses. Definition 2.4 (p-Effective Memory Capacity). Consider an online convex optimization with unbounded memory problem specified by (X, H, A, B). For p 1, the p-effective memory capacity is defined as

Hp(X, H, A, B) =

k=0 kp Ak p ! 1

When the meaning is clear from the context we simply use Hp instead. The p-effective memory capacity is an upper bound on the difference in histories for two sequences of decisions whose difference grows at most linearly with time. To see this, consider two sequences of decisions, (xk) and ( xk), whose elements differ by no more than k at time k: xk xk k. Then the histories generated by the two sequences have difference between bounded as h h = P

k Ak B(xk xk) P

k k Ak = H1, where the last inequality follows from Assumption A2. A similar bound holds with Hp instead when (X, H, A, B) follows linear sequence dynamics with the ξ-weighted p-norm.

A5 The 1-effective memory capacity is finite, i.e., H1 < .

Since Hp is decreasing in p, H1 < implies Hp < for all p 1. For the case of linear sequence dynamics with the ξ-weighted p-norm it suffices to make the weaker assumption that Hp < . However, for simplicity of exposition, we assume that H1 < .

2.3 Special Cases

OCO with Finite Memory. Consider the OCO with finite memory problem with constant memory length m. It can be specified in our framework by (X, H, Afinite,m, Bfinite,m), where H is the ℓ2direct sum of m copies of X, Afinite,m(x[m], . . . , x[1]) = (0, x[m], . . . , x[2]), and Bfinite,m(x) = (x, 0, . . . , 0). Note that (X, H, Afinite,m, Bfinite,m) follows linear sequence dynamics with the 2-norm. Our framework can even model an extension where the problem follows linear sequence dynamics with the p-norm for p 1 by simply defining H to be the ℓp-direct sum of m copies of X.

OCO with ρ-discounted Infinite Memory. Our framework can also model OCO with infinite memory problems that are not modelled by existing OCO frameworks. Let ρ (0, 1) be the discount factor and p 1. An OCO with ρ-discounted infinite memory problem is specified by (X, H, Ainfinite,ρ, Binfinite,ρ), where H is the ℓp-direct sum of countably many copies of X, Ainfinite,ρ((y0, y1, . . . )) = (0, ρy0, ρy1, . . . ), and Binfinite,ρ(x) = (x, 0, . . . ). Note that (X, H, Ainfinite,ρ, Binfinite,ρ) follows linear sequence dynamics with the p-norm. Due to space constraints we defer proofs of regret bounds for this problem to the appendix.

3 Regret Analysis

We present two algorithms for choosing the decisions xt. Algorithm 1 uses follow-the-regularizedleader (FTRL) [Shalev-Shwartz and Singer, 2006, Abernethy et al., 2008] on the loss functions ft. Due to space constraints, we discuss how to implement it efficiently in Appendix G and present simple simulation experiments in Appendix I. Algorithm 2, which we only present in Appendix H, combines FTRL with a mini-batching approach [Dekel et al., 2012, Altschuler and Talwar, 2018, Chen et al., 2020] to additionally guarantee that the decisions switch at most O(T L/LH1) times. We defer the proofs of the following upper and lower bounds to Appendices C and D respectively.

Algorithm 1: FTRL Input : Time horizon T, step size η, α-strongly-convex regularizer R : X R.

1 Initialize history h0 = 0.

2 for t = 1, 2, . . . , T do

3 Learner chooses xt arg minx X Pt 1 s=1 fs(x) + R(x)

4 Set ht = Aht 1 + Bxt.

5 Learner suffers loss ft(ht) and observes ft.

Theorem 3.1. Consider an online convex optimization with unbounded memory problem specified by (X, H, A, B). Let the regularizer R : X R be α-strongly-convex and satisfy |R(x) R( x)| D for all x, x X. Algorithm 1 with step-size η satisfies RT (FTRL) D

α + η T L LH1

αD T L(LH1+ L), then

RT (FTRL) O

When (X, H, A, B) follows linear sequence dynamics with the ξ-weighted p-norm, then all of the above hold with Hp instead of H1.

The proof of this theorem involves writing the regret as P

t ft(ht) ft(xt) + P

t ft(xt) ft(x ), and bounding the first term using the p-effective memory capacity and the second term using FTRL. We defer the full proof to Appendix C. The following lower bound shows that this is tight in the worst-case.

Theorem 3.2. There exists an instance of the online convex optimization with unbounded memory problem, (X, H, A, B), that follows linear sequence dynamics with the ξ-weighted p-norm and there exist L-Lipschitz continuous loss functions {ft : H R}T t=1 such that the regret of any algorithm A satisfies

The proof of this theorem follows from our lower bound for the special case of OCO with finite memory (Theorem 3.4), which we discuss in detail below. However, as we show in Appendix D, the lower bound holds for a much broader class of problems.

Specialization to OCO with finite memory. Now we show how our bounds specialize to the special case of OCO with finite memory (Section 2.3). Due to space constraints, we defer the specialization of the upper and lower bounds for OCO with ρ-discounted infinite memory (Section 2.3) to Appendix C and Appendix D respectively.

Theorem 3.3. Consider an online convex optimization with finite memory problem with constant memory length m specified by (X, H = X m, Afinite,m, Bfinite,m). Let the regularizer R : X R be α-strongly-convex and satisfy |R(x) R( x)| D for all x, x X. Algorithm 1 with step-size

T L(Lm 3 2 + L) satisfies

RT (FTRL) O

α TL Lm 3 2

This follows from using the definition of Afinite,m to bound L and H2 in Theorem 3.1. This improves existing results [Anava et al., 2015] by a factor of m 1/4. Our bound depends on the Lipschitz continuity constants as p

L L whereas existing bounds depend as L, and L can be as large as m L (Theorem 2.1). We defer the full proof to Appendix C and a detailed comparison with existing results to Appendix C.1. The following lower bound shows that this is tight in the worst-case.

Theorem 3.4. There exists an instance of the online convex optimization with finite memory problem with constant memory length m, (X, H = X m, Afinite,m, Bfinite,m), and there exist L-Lipschitz continuous loss functions {ft : H R}T t=1 such that the regret of any algorithm A satisfies

To the best of our knowledge, this is the first non-trivial lower bound for OCO with finite memory with an explicit dependence on the memory length m. Our construction involves three main steps, the first two of which are loosely based on Altschuler and Talwar [2018]. First, divide time into N = T/m blocks of size m. (For simplicity, assume T is a multiple of m.) Second, sample a random sign ϵn for each block n [N]. Third, for t > m choose

ft(ht) = ϵ t

2 | {z } (b)

xt m+1 + + xm t

where term (a) is the random sign ϵn sampled for the block n = t/m that t belongs to, term (b) is a scaling factor chosen while respecting the Lipschitz continuity constraint, and term (c) is a sum over a subset of past decisions. Two important features of this construction are: (i) a random sign is sampled for each block rather than each round; and (ii) the loss in round t depends on the history of decisions until and including the first round of the block that t belongs to. These exploit the fact that an algorithm cannot overwrite its history and penalize it for its past decisions even after it observes the random sign ϵn for the current block. (See Fig. 1 for an illustration.) Existing lower bound proofs for OCO sample a random sign in each round and choose ft(xt) ϵtxt. A first attempt at extending this for the OCO with finite memory setting would be to choose ft(ht) ϵt Pm 1 k=0 xt k. However, in constrast to our approach, this does not exploit the fact that an algorithm cannot overwrite its history and does not suffice for obtaining a matching lower bound.

Time x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12

ϵ1 ϵ2 ϵ3 ϵ4

f4(h4) = ϵ2 3 1

2 (x2 + x3 + x4)

f5(h5) = ϵ2 3 1

2 (x3 + x4)

f6(h6) = ϵ2 3 1

T = 12, m = 3, L = 1. Resample every m rounds.

Figure 1: An illustration of the loss functions ft for the OCO with finite memory lower bound.

Comparison of upper bound with prior work. The algorithmic ideas and analysis for our regret upper bound are influenced by Anava et al. [2015]. However, an important innovation in our work is the use of weighted norms in the case of linear sequence dynamics. This is a simple but powerful way of encoding prior knowledge about a problem, and allows us to derive non-trivial regret bounds in the case of unbounded-length histories. The technical complications that arise are captured in bounding the relevant quantities of interest, e.g., the Lipschitz constant L, the operator norm Ak , etc. Furthermore, using weighted norms even leads to improved regret bounds for some applications. Indeed, consider the application to online linear control with adversarial disturbances (Section 4.1). Our framework and upper bound applied to this problem (Theorem 4.1) improve upon the existing upper bound, which used a finite memory approximation. See Lemmas E.2 and E.6 for an illustration of the technical details involved when using weighted norms.

4 Applications

In this section we apply our framework to online linear control (Section 4.1) and online performative prediction (Section 4.2). We defer expanded details and proofs to Appendices E and F respectively.

4.1 Online Linear Control

Background. Online linear control (OLC) is the problem of controlling a system with linear dynamics, adversarial disturbances, and adversarial and convex losses. It combines aspects from control theory and online learning. We refer the reader to Agarwal et al. [2019b] for more details. Here, we introduce the basic mathematical setup of the problem.

Let S Rds and U Rdu denote the state and control spaces. Let st and ut denote the state and control at time t with s0 being the inital state. The system evolves according to the linear dynamics st+1 = Fst + Gut + wt, where F Rds ds, G Rds du are matrices satisfying F 2, G 2 κ and wt Rds is an adversarially chosen disturbance with w 2 W. Without loss of generality, we assume that κ, W 1, ds = du = d, and also define w 1 = s0. For t = 0, . . . , T 1, let ct : S U [0, 1] be convex loss functions chosen by an oblivous adversary. The functions ct satisfy the following Lipschitz condition: if s 2, u 2 DX , then sct(s, u) , uct(s, u) L0DX . The goal in online linear control is to choose a sequence of policies that yield a sequence of controls ut to minimize the regret RT (Π) = PT 1 t=0 ct(st, ut) minπ Π PT 1 t=0 ct(sπ t , uπ t ), where st evolves according to linear dynamics stated above, Π denotes a controller class, and sπ t , cπ t denote the state and control at time t when the controls are chosen according to π .

A very simple controller class is constant input, i.e., Π = {πu : π(s) = u U}. In this case, the history ht can be represented by the finite-dimensional state st, and the operators can be set to A = F and B = G/ G . However, like previous work [Agarwal et al., 2019b] we focus on the class of (κ, ρ)-strongly stable linear controllers, K, where K K satisfies F GK = HLH 1 with K 2, H 2, H 1 2 κ and L 2 = ρ < 1. Given such a controller, the inputs are chosen as linear functions of the current state, i.e., ut = Kst. Unfortunately, parameterizing ut directly with a linear controller as ut = Kst leads to a non-convex problem because st is a non-linear function of K, e.g., if disturbances are 0, then st = (F GK)ts0. An alternative parameterization is the disturbance-action controller (DAC).

Definition 4.1. Let K K be fixed. The class of disturbance-action controllers (DACs) MK is {(K, M) : M = (M [s]) s=0}, where M [s] Rd d satisfies M [s] 2 κ4ρs. The control in round k is chosen as uk = Ksk + Pk+1 s=1 M [s]wk s.

The class of such DACs has two important properties. First, it acts on the entire history of past disturbances. Consequently, given an arbitrary K K, every K K can be expressed as a DAC (K, M) MK with M = (M [1], . . . , M [T +1], 0, . . . ) [Agarwal et al., 2019a, Section 16.5]. That is, K MK and it suffices to compute regret against MK instead of K. For the rest of this paper we fix K K and denote e F = F GK. Second, suppose Mt = (M [s] t ) s=0 is the parameter chosen in round t and the control ut is chosen according to the DAC (K, Mt). Then, st and ut are linear functions of the parameters, which implies that ct is convex in the parameters. (See the next paragraph on Formulation as OCO with Unbounded Memory for a formula.) A similar parameterization was first considered for online linear control by Agarwal et al. [2019b] and is based on similar ideas in control theory, e.g., Youla [Youla et al., 1976, Kuˇcera, 1975] and SLS [Wang et al., 2019, Anderson et al., 2019].

Formulation as OCO with Unbounded Memory. The first step is a change of variables with respect to the control inputs from linear controllers to DACs and the second is a corresponding change of variables for the state. Define the decision space X = {M = (M [s]) : M [s] Rd d, M [s] 2 κ4ρs}. Define the history space H to be the set consisting of sequences h = (Yk), where Y0 X and Yk = e F k 1GXk for Xk X, k 1. (Recall e F = F GK.) Define weighted norms M 2 X = P s=1 ρ s M [s] 2 F and h 2 H = P k=0 ξ2 k Yk 2 X , where the weights (ξk) are defined as ξ = (1, 1, 1, ρ 1

2 , ρ 1, ρ 3

2 , . . . ). Define the linear operators A : H H and B : W H as A((Y0, Y1, . . . )) = (0, GY0, e FY1, e FY2, . . . ) and B(M) = (M, 0, 0, . . . ). Note that the problem follows linear sequence dynamics with the ξ-weighted 2-norm (Definition 2.3). The weights in the norms on X and H increase exponentially. However, the norms M [s] 2 F and e F k 1G 2 F decrease exponentially as well: by definition of X and the assumption on e F = F GK for K K. Leveraging these exponential decays to define exponentially increasing weights is crucial for deriving our regret bounds that are stronger than existing results.

Construct the functions ft : H R that correspond to ct(st, ut) as follows. Given a sequence of decisions (M0, . . . , Mt), the history at the end of round t is given by ht = (Mt, GMt 1, e FGMt 2, . . . , e F t 1GM0, 0, . . . ). A simple inductive argument shows that the state and control in round t can be written as functions of ht as st = e F ts0 + Pt 1 k=0 Pk+1 s=1 e F t k 1GM [s] k wk s + wt 1 and ut = Kst + Pt+1 s=1 M [s] t wt s. Define the functions ft : H R by ft(h) = ct(s, u), where s and u are the state and control determined by the history as above. Note that ft is parameterized by the past disturbances. Since the state and control are linear functions of the history and ct is convex, this implies that ft is convex. Now, given the above formulation and the fact that the class of disturbance-action controllers is a superset of the class of (κ, ρ)-strongly-stable linear controllers, we have that the policy regret for the online linear control problem is at most the policy regret, PT 1 t=0 ft(ht) min M X PT 1 t=0 ft(M). The following is our main result for online linear control and it improves existing results [Agarwal et al., 2019b] by a factor of O(d log(T)3.5κ5(1 ρ) 1). See Appendix E.3 for a detailed comparison.

Theorem 4.1. Consider the online linear control problem as defined in Section 4.1. Suppose the decisions in round t are chosen using Algorithm 1. Then, the upper bound on the policy regret is

Td 1 2 κ17(1 ρ) 4.5 . (2)

Comparison with prior and concurrent work. Existing works solve OLC (and its extensions) by making multiple finite memory approximations. First, they formulate the problem as OCO with finite memory. This requires bounding numerous error terms because the problem is inherently an OCO with unbounded memory problem. We bypass these error analysis steps entirely because the problem fits into our framework naturally. Second, existing works use the parameterization from Agarwal et al. [2019b] that only acts on a fixed, constant number of past disturbances. In particular, existing works use a truncated DAC policy that is a sequence of d d matrices of length 2κ4(1 ρ) 1 log T. Our DAC policy acts on the entire history of disturbances and is a sequence of d d matrices of unbounded length. Yet, we capture the dimension of this infinite-dimensional space in a way that still improves the overall bound, including completely eliminating the dependence on log T, and improving the dependence on d, κ, and (1 ρ). This improvement comes from our novel use of weighted norms on the history and decision spaces. These norms allow us to give tighter bounds on the relevant quantities in the regret upper bound, e.g., Ak (Lemma E.2) and L (Lemma E.6).

In complementary concurrent work, Lin et al. [2022] focus on a more general online control problem. They improve regret bounds for this general version by a factor of log T compared to existing reductions to OCO with finite memory. They do so by using that the impact of a past policy decays geometrically with time. On the other hand, the primary focus of our work is studying the complete dependence of present losses on the entire history in OCO. Applying our resulting OCO with unbounded memory framework to OLC, we improve upon existing results for OLC by removing all log T factors and improving the dependence on d, κ, and (1 ρ).

4.2 Online Performative Prediction

Background. In many applications of machine learning the algorithm s decisions influence the data distribution, e.g., online labor markets [Anagnostopoulos et al., 2018, Horton, 2010], predictive policing [Lum and Isaac, 2016], on-street parking [Dowling et al., 2020, Pierce and Shoup, 2018], vehicle sharing markets [Banerjee et al., 2015], etc. Motivated by such applications, several works have studied the problem of performative prediction, which models the data distribution as a function of the decision-maker s decision [Perdomo et al., 2020, Mendler-Dünner et al., 2020, Miller et al., 2021, Brown et al., 2022, Ray et al., 2022, Jagadeesan et al., 2022]. Most of these works view the problem as a stochastic optimization problem; Jagadeesan et al. [2022] adopt a regret minimization perspective. We refer the reader to these citations for more details. As a natural extension to existing works, we introduce an online learning variant of performative prediction with geometric decay [Ray et al., 2022] that differs from the original formulations in a few key ways.

Let the decision set X Rd be closed and convex with x 2 DX . Let p1 denote the initial data distribution over the instance space Z. In each round t [T], the learner chooses a decision xt X and an oblivious adversary chooses a loss function lt : X Z [0, 1], and then the learner suffers the loss Lt(xt) = Ez pt [lt(xt, z)], where pt = pt(x1, . . . , xt) is the data distribution in round t. The goal in our online learning setting is to minimize the difference between the algorithm s total loss and the total loss of the best fixed decision, PT t=1 Ez pt [lt(xt, z)] minx X PT t=1 Ez pt(x) [lt(x, z)], where pt(x) = pt(x, . . . , x) is the data distribution in round t had x been chosen in all rounds so far. This measure is similar to performative regret [Jagadeesan et al., 2022] and is a natural generalization of performative optimality [Perdomo et al., 2020] for an online learning formulation.

We make the following assumptions. First, the loss functions lt are convex and L0-Lipschitz continuous. Second, the data distribution satisfies for all t 1, pt+1 = ρpt + (1 ρ)D(xt), where ρ (0, 1) and D(xt) is a distribution over Z that depends on the decision xt [Ray et al., 2022]. Third, D(x) is a location-scale distribution: z D(x) iff z ξ + Fx, where F Rd d satisfies F 2 < and ξ is a random variable with mean µ and covariance Σ [Ray et al., 2022].

Our problem formulation differs from existing work in the following ways. First, we adopt an online learning perspective on performative prediction with geometric decay, whereas Ray et al. [2022] adopt a stochastic optimization one. So, we assume that the loss functions lt are adversarially chosen, whereas Ray et al. [2022] assume lt = l are fixed. Second, we assume that the dynamics (D and ρ) are known (Assumption A1), whereas Ray et al. [2022] assume they are unknown and use samples from the data distribution. We believe that an appropriate extension of our framework that can deal with unknown linear operators A and B can be applied to this more difficult setting, and we leave this as future work. Third, even though Jagadeesan et al. [2022] also study an online learning variant

of performative prediction, they assume lt = l are fixed and the data distribution depends only on the current decisions, whereas we assume the data distribution depends on the entire history of decisions.

Formulation as OCO with Unbounded Memory. Let ρ (0, 1). Let the decision space X Rd be closed and convex with the 2-norm. Let the history space H be the ℓ1-direct sum of countably infinite number of copies of X. Define the linear operators A : H H and B : W H as A((y0, y1, . . . )) = (0, ρy0, ρy1, . . . ) and B(x) = (x, 0, . . . ). The problem is an OCO with ρ-discounted infinite memory problem and follows linear sequence dynamics with the 1-norm (Definition 2.3). Given a sequence of decisions (xk)t k=1, the history is ht = (xt, ρxt 1, . . . , ρt 1x1, 0, . . . ) and the data distribution pt = pt(ht) satisfies: z pt iff z Pt 1 k=1(1 ρ)ρk 1(ξ +Fxt k)+ρtp1. This follows from the recursive definition of pt and parametric assumption about D(x). Define the functions ft : H [0, 1] by ft(ht) = Ez pt[lt(xt, z)]. Now, the original goal of minimizing the difference between the algorithm s total loss and the total loss of the best fixed decision is equivalent to minimizing the policy regret. The following is our main result for online performative prediction.

Theorem 4.2. Consider the online performative prediction problem as defined in Section 4.2. Suppose the decisions in round t are chosen using Algorithm 1. Then, the upper bound on the policy regret is

T F 2(1 ρ) 1

5 Conclusion

In this paper we introduced a generalization of the OCO framework, Online Convex Optimization with Unbounded Memory , that allows the loss in the current round to depend on the entire history of decisions until that point. We proved matching upper and lower bounds on the policy regret in terms of the time horizon, the p-effective memory capacity (a quantitative measure of the influence of past decisions on present losses), and other problem parameters (Theorems 3.1 and 3.2). As a special case, we proved the first non-trivial lower bound for OCO with finite memory (Theorem 3.4), which could be of independent interest, and also improved existing upper bounds (Theorem 3.3). We illustrated the power of our framework by bringing together the regret analysis of two seemingly disparate problems under the same umbrella: online linear control (Theorem 4.1), where we improve and simplify existing regret bounds, and online performative prediction (Theorem 4.2).

There are a number of directions for future research. A natural follow-up is to consider unknown dynamics (i.e., when the learner does not know the operators A and B) and/or the case of bandit feedback (i.e., when the learner only observes ft(ht)). The extension to bandit feedback has been considered in the OCO and OCO with finite memory literature [Hazan and Li, 2016, Bubeck et al., 2021, Zhao et al., 2021, Gradu et al., 2020, Cassel and Koren, 2020]. It is tempting to think about a version where the history is a nonlinear, but decaying, function of the past decisions. The obvious challenge is that the nonlinearity would lead to non-convex losses. It is unclear how to deal with such issues, e.g., restricted classes of nonlinearities for which the OCO with unbounded memory perspective is still relevant [Zhang et al., 2015], different problem formulations such as online non-convex learning [Gao et al., 2018, Suggala and Netrapalli, 2020], etc.

There is a growing body of work on online linear control and its variants that rely on OCO with finite memory [Hazan et al., 2020, Agarwal et al., 2019c, Foster and Simchowitz, 2020, Cassel and Koren, 2020, Gradu et al., 2020, Li et al., 2021, Minasyan et al., 2021]. In this paper we showed how our framework can be used to improve and simplify regret bounds for the online linear control problem. Another direction for future work is to use our framework, perhaps with suitable extensions outlined above, to derive similar improvements for these other variants of online linear control.

Acknowledgments and Disclosure of Funding

We thank Sloan Nietert and Victor Sanches Portella for helpful discussions. We thank Wei-Yu Chan for pointing out an error in the proof of Lemma C.1. We also thank anonymous AISTATS and Neur IPS reviewers for their helpful comments on an older and the current version of the paper respectively. This research was partially supported by the NSERC Postgraduate Scholarships-Doctoral Fellowship 545847-2020, NSF awards CCF-2312774 and OAC-2311521, and a gift from Wayfair.

Yasin Abbasi-Yadkori and Csaba Szepesvári. Regret bounds for the adaptive control of linear quadratic systems. In Sham M. Kakade and Ulrike von Luxburg, editors, COLT 2011 - The 24th Annual Conference on Learning Theory, June 9-11, 2011, Budapest, Hungary, 2011.

Jacob D. Abernethy, Elad Hazan, and Alexander Rakhlin. Competing in the dark: An efficient algorithm for bandit linear optimization. In Rocco A. Servedio and Tong Zhang, editors, 21st Annual Conference on Learning Theory - COLT 2008, Helsinki, Finland, July 9-12, 2008, pages 263 274. Omnipress, 2008.

Alekh Agarwal, Nan Jiang, Sham M Kakade, and Wen Sun. Reinforcement learning: Theory and algorithms. 2019a.

Naman Agarwal, Brian Bullins, Elad Hazan, Sham M. Kakade, and Karan Singh. Online control with adversarial disturbances. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pages 111 119. PMLR, 2019b.

Naman Agarwal, Elad Hazan, and Karan Singh. Logarithmic regret for online control. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d Alché-Buc, Emily B. Fox, and Roman Garnett, editors, Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 10175 10184, 2019c.

Akshay Agrawal, Robin Verschueren, Steven Diamond, and Stephen Boyd. A rewriting system for convex optimization problems. Journal of Control and Decision, 5(1):42 60, 2018.

Jason M. Altschuler and Kunal Talwar. Online learning over a finite action set with limited switching. In Sébastien Bubeck, Vianney Perchet, and Philippe Rigollet, editors, Conference On Learning Theory, COLT 2018, Stockholm, Sweden, 6-9 July 2018, 2018.

Aris Anagnostopoulos, Carlos Castillo, Adriano Fazzone, Stefano Leonardi, and Evimaria Terzi. Algorithms for hiring and outsourcing in the online labor market. In Yike Guo and Faisal Farooq, editors, Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2018, London, UK, August 19-23, 2018, pages 1109 1118. ACM, 2018.

Oren Anava, Elad Hazan, and Shie Mannor. Online learning for adversaries with memory: Price of past mistakes. In Corinna Cortes, Neil D. Lawrence, Daniel D. Lee, Masashi Sugiyama, and Roman Garnett, editors, Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 784 792, 2015.

James Anderson, John C Doyle, Steven H Low, and Nikolai Matni. System level synthesis. Annual Reviews in Control, 47:364 393, 2019.

Baruch Awerbuch and Robert Kleinberg. Online linear optimization and adaptive routing. J. Comput. Syst. Sci., 74(1):97 114, 2008.

Siddhartha Banerjee, Carlos Riquelme, and Ramesh Johari. Pricing in ride-share platforms: A queueing-theoretic approach. Available at SSRN 2568258, 2015.

Kush Bhatia and Karthik Sridharan. Online learning with dynamics: A minimax perspective. In Hugo Larochelle, Marc Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, 2020.

Boubekeur Boukhezzar and Houria Siguerdidjane. Comparison between linear and nonlinear control strategies for variable speed wind turbines. Control Engineering Practice, 18:1357 1368, 2010.

Gavin Brown, Shlomi Hod, and Iden Kalemaj. Performative prediction in a stateful world. In Gustau Camps-Valls, Francisco J. R. Ruiz, and Isabel Valera, editors, International Conference on Artificial Intelligence and Statistics, AISTATS 2022, 28-30 March 2022, Virtual Event, volume 151 of Proceedings of Machine Learning Research, pages 6045 6061. PMLR, 2022.

Sébastien Bubeck and Nicolo Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multiarmed bandit problems. Foundations and Trends in Machine Learning, 5:1 122, 2012.

Sébastien Bubeck, Ronen Eldan, and Yin Tat Lee. Kernel-based methods for bandit convex optimization. J. ACM, 68(4):25:1 25:35, 2021.

Asaf B. Cassel and Tomer Koren. Bandit linear control. In Hugo Larochelle, Marc Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, 2020.

Lin Chen, Qian Yu, Hannah Lawrence, and Amin Karbasi. Minimax regret of switching-constrained online convex optimization: No phase transition. In Hugo Larochelle, Marc Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, 2020.

Thomas M Cover. Universal portfolios. Mathematical finance, 1(1):1 29, 1991.

Sarah Dean, Horia Mania, Nikolai Matni, Benjamin Recht, and Stephen Tu. Regret bounds for robust adaptive control of the linear quadratic regulator. In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett, editors, Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, Neur IPS 2018, December 3-8, 2018, Montréal, Canada, pages 4192 4201, 2018.

Ofer Dekel, Ambuj Tewari, and Raman Arora. Online bandit learning against an adaptive adversary: from regret to policy regret. In Proceedings of the 29th International Conference on Machine Learning, ICML 2012, Edinburgh, Scotland, UK, June 26 - July 1, 2012. icml.cc / Omnipress, 2012.

Steven Diamond and Stephen Boyd. CVXPY: A Python-embedded modeling language for convex optimization. Journal of Machine Learning Research, 17(83):1 5, 2016.

Chase P. Dowling, Lillian J. Ratliff, and Baosen Zhang. Modeling curbside parking as a network of finite capacity queues. IEEE Trans. Intell. Transp. Syst., 21(3):1011 1022, 2020.

Dylan J. Foster and Max Simchowitz. Logarithmic regret for adversarial online control. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, 2020.

Xiand Gao, Xiaobo Li, and Shuzhong Zhang. Online learning with non-convex losses and nonstationary regret. In Amos J. Storkey and Fernando Pérez-Cruz, editors, International Conference on Artificial Intelligence and Statistics, AISTATS 2018, 9-11 April 2018, Playa Blanca, Lanzarote, Canary Islands, Spain, volume 84 of Proceedings of Machine Learning Research, pages 235 243. PMLR, 2018.

Gene H. Golub and Charles F. Van Loan. Matrix Computations, Third Edition. John Hopkins University Press, 1996.

Paula Gradu, John Hallman, and Elad Hazan. Non-stochastic control with bandit feedback. In Hugo Larochelle, Marc Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, 2020.

Elad Hazan. Introduction to online convex optimization. Co RR, abs/1909.05207, 2019.

Elad Hazan and Yuanzhi Li. An optimal algorithm for bandit convex optimization. Co RR, abs/1603.04350, 2016. URL http://arxiv.org/abs/1603.04350.

Elad Hazan, Sham M. Kakade, and Karan Singh. The nonstochastic control problem. In Aryeh Kontorovich and Gergely Neu, editors, Algorithmic Learning Theory, ALT 2020, 8-11 February 2020, San Diego, CA, USA, volume 117 of Proceedings of Machine Learning Research, pages 408 421. PMLR, 2020.

John J Horton. Online labor markets. In International workshop on internet and network economics, pages 515 522. Springer, 2010.

Meena Jagadeesan, Tijana Zrnic, and Celestine Mendler-Dünner. Regret minimization with performative feedback. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato, editors, International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, 2022.

Anna Karlin. Lecture notes for cse 522: Algorithms and uncertainty, 2017. URL https://courses. cs.washington.edu/courses/cse522/17sp/lectures/Lecture9.pdf.

Vladimír Kuˇcera. Stability of discrete linear feedback systems. IFAC Proceedings Volumes, 8(1): 573 578, 1975.

Peter Lancaster and Leiba Rodman. Algebraic riccati equations. Clarendon press, 1995.

Yingying Li, Subhro Das, and Na Li. Online optimal control with affine constraints. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021, pages 8527 8537. AAAI Press, 2021.

Yiheng Lin, James Preiss, Emile Anand, Yingying Li, Yisong Yue, and Adam Wierman. Online adaptive controller selection in time-varying systems: No-regret via contractive perturbations. ar Xiv preprint ar Xiv:2210.12320, 2022.

Nick Littlestone and Manfred K. Warmuth. The weighted majority algorithm. In 30th Annual Symposium on Foundations of Computer Science, Research Triangle Park, North Carolina, USA, 30 October - 1 November 1989, pages 256 261. IEEE Computer Society, 1989.

Nick Littlestone and Manfred K. Warmuth. The weighted majority algorithm. Inf. Comput., 108(2): 212 261, 1994.

Kristian Lum and William Isaac. To predict and serve? Significance, 13(5):14 19, 2016.

Celestine Mendler-Dünner, Juan C. Perdomo, Tijana Zrnic, and Moritz Hardt. Stochastic optimization for performative prediction. In Hugo Larochelle, Marc Aurelio Ranzato, Raia Hadsell, Maria Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, 2020.

John Miller, Juan C. Perdomo, and Tijana Zrnic. Outside the echo chamber: Optimizing the performative risk. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 7710 7720. PMLR, 2021.

Edgar Minasyan, Paula Gradu, Max Simchowitz, and Elad Hazan. Online control of unknown time-varying dynamical systems. In Marc Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, Neur IPS 2021, December 6-14, 2021, virtual, 2021.

Francesco Orabona. A modern introduction to online learning. Co RR, abs/1912.13213, 2019.

Juan C. Perdomo, Tijana Zrnic, Celestine Mendler-Dünner, and Moritz Hardt. Performative prediction. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pages 7599 7609. PMLR, 2020.

Gregory Pierce and Donald Shoup. Sfpark: Pricing parking by demand. In Parking and the City, pages 344 353. Routledge, 2018.

Yuzhen Qin, Yingcong Li, Fabio Pasqualetti, Maryam Fazel, and Samet Oymak. Stochastic contextual bandits with long horizon rewards. In Thirty-Seventh AAAI Conference on Artificial Intelligence, AAAI 2023, Washington, DC, USA, February 7-14, 2023. AAAI Press, 2023.

Mitas Ray, Lillian J. Ratliff, Dmitriy Drusvyatskiy, and Maryam Fazel. Decision-dependent risk minimization in geometrically decaying dynamic environments. In Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022, Thirty-Fourth Conference on Innovative Applications of Artificial Intelligence, IAAI 2022, The Twelveth Symposium on Educational Advances in Artificial Intelligence, EAAI 2022 Virtual Event, February 22 - March 1, 2022, pages 8081 8088. AAAI Press, 2022.

Shai Shalev-Shwartz. Online learning and online convex optimization. Found. Trends Mach. Learn., 4(2):107 194, 2012.

Shai Shalev-Shwartz and Yoram Singer. Online learning meets optimization in the dual. In Gábor Lugosi and Hans Ulrich Simon, editors, Learning Theory, 19th Annual Conference on Learning Theory, COLT 2006, Pittsburgh, PA, USA, June 22-25, 2006, Proceedings, 2006.

Guanya Shi, Yiheng Lin, Soon-Jo Chung, Yisong Yue, and Adam Wierman. Online optimization with memory and competitive control. In Hugo Larochelle, Marc Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, 2020.

Max Simchowitz and Dylan J. Foster. Naive exploration is optimal for online LQR. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, 2020.

Aleksandrs Slivkins. Introduction to multi-armed bandits. Found. Trends Mach. Learn., 12:1 286, 2019.

Arun Sai Suggala and Praneeth Netrapalli. Online non-convex learning: Following the perturbed leader is optimal. In Aryeh Kontorovich and Gergely Neu, editors, Algorithmic Learning Theory, ALT 2020, 8-11 February 2020, San Diego, CA, USA, volume 117 of Proceedings of Machine Learning Research, pages 845 861. PMLR, 2020.

Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.

Yuh-Shyang Wang, Nikolai Matni, and John C. Doyle. A system-level approach to controller synthesis. IEEE Trans. Autom. Control., 64(10):4079 4093, 2019.

Dante Youla, Hamid Jabr, and Jr Bongiorno. Modern wiener-hopf design of optimal controllers part ii: The multivariable case. IEEE Transactions on Automatic Control, 21(3):319 338, 1976.

Lijun Zhang, Tianbao Yang, Rong Jin, and Zhi-Hua Zhou. Online bandit learning for a special class of non-convex losses. In Blai Bonet and Sven Koenig, editors, Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, January 25-30, 2015, Austin, Texas, USA, pages 3158 3164. AAAI Press, 2015.

Peng Zhao, Guanghui Wang, Lijun Zhang, and Zhi-Hua Zhou. Bandit convex optimization in non-stationary environments. J. Mach. Learn. Res., 22:125:1 125:45, 2021.

Peng Zhao, Yu-Xiang Wang, and Zhi-Hua Zhou. Non-stationary online learning with memory and non-stochastic control. In Gustau Camps-Valls, Francisco J. R. Ruiz, and Isabel Valera, editors, International Conference on Artificial Intelligence and Statistics, AISTATS 2022, 28-30 March 2022, Virtual Event, volume 151 of Proceedings of Machine Learning Research, pages 2101 2133. PMLR, 2022.

Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Tom Fawcett and Nina Mishra, editors, Proceedings of the 20th International Conference on Machine Learning, ICML 2003, August 21-24 2003, Washington, DC, USA, 2003.

Organization of the Appendix

The appendix is organized as follows:

Appendix A contains the proofs of the results from Section 2. Appendix B contains the proofs of existing results about FTRL for completeness. Appendix C contains the proofs of upper bounds on regret from Section 3. Appendix D contains the proofs of lower bounds on regret from Section 3. Appendix E contains the proofs of results about online linear control from Section 4.1. Appendix F contains the proofs of results about online performative prediction from Section 4.2. Appendix G discusses how to implement Algorithm 1 efficiently. Appendix H presents an algorithm for OCO with unbounded memory that provides the same upper bound on policy regret as Algorithm 1 while guaranteeting a small number of switches (Algorithm 2). Appendix I presents simulation experiments.

A Framework

In this section prove Theorem 2.1. But first we prove a lemma that we use for proofs involving linear sequence dynamics with the ξ-weighted p-norm (Definition 2.3). Recall that U denotes the norm associated with a space U and the operator norm L for a linear operator L : U V is defined as L = maxu: u U 1 Lu V. Lemma A.1. Consider an online convex optimization with unbounded memory problem specified by (X, H, A, B). If (X, H, A, B) follows linear sequence dynamics with the ξ-weighted p-norm for p 1, then for all k 1 ξk Ak 1 A0 Ak .

Proof. Let x X with x X = 1. We have

ξk Ak 1 A0x X = Ak(x, 0, . . . ) H Ak (x, 0, . . . ) H Ak , where the last inequality follows because (x, 0, . . . ) H = ξ0 x X and ξ0 = 1 by Definition 2.3. Therefore, Ak 1 A0 Ak .

Theorem 2.1. Consider an online convex optimization with unbounded memory problem specified by (X, H, A, B). If ft is L-Lipschitz continuous, then ft is L-Lipschitz continuous for L L P k=0 Ak . If (X, H, A, B) follows linear sequence dynamics with the ξ-weighted p-norm

for p 1, then L L P k=0 Ak p 1

Proof. Let x, x X. For the general case, we have ft(x) ft( x) =

! by Definition 2.1

k=0 Ak B(x x)

H ft is L-Lipschitz continuous

k=0 Ak B x x X

k=0 Ak x x X by Assumption A2

k=0 Ak x x X .

If (H, X, A, B) follows linear sequence dynamics with the ξ-weighted p-norm for p 1, then we have ft(x) ft( x) =

! by Definition 2.1

k=0 Ak B(x x)

H ft is L-Lipschitz continuous

= L (0, A0(x x), A1A0(x x), . . . ) by Definition 2.3

k=0 ξp k Ak 1 A0(x x) p ! 1

p by Definition 2.3

k=0 Ak p ! 1

p x x X by Lemma A.1

k=0 Ak p ! 1

B Standard Analysis of Follow-the-Regularized-Leader

In this section we state and prove some existing results about the follow-the-regularized-leader (FTRL) algorithm [Shalev-Shwartz and Singer, 2006, Abernethy et al., 2008]. These results are well known in the literature, but we prove them here for completeness and use them in the remainder of the paper. We use the below results for functions ft with Lipschitz constants L. However, in this section we use a more general notation, denoting functions by gt and their Lipschitz constant by Lg.

Consider the following setup for an online convex optimization (OCO) problem. Let T denote the time horizon. Let the decision space X be a closed, convex subset of a Hilbert space and gt : X R be loss functions chosen by an oblivious adversary. The functions gt are convex and Lg-Lipschitz continuous. The game between the learner and the adversary proceeds as follows. In each round t [T], the learner chooses xt X and the learner suffers loss gt(xt). The goal of the learner is to minimize (static) regret,

Rstatic T =

t=1 gt(xt) min x X

t=1 gt(x). (3)

Let R : X R be an α-strongly convex regularizer satisfying |R(x) R( x)| D for all x, x X. The FTRL algorithm chooses iterates xt as

xt arg min x X

s=1 gs(x) + R(x)

where η is a tunable parameter referred to as the step-size. In what follows, let g0 = R

η . The analysis in this section closely follows Karlin [2017]. Lemma B.1. For all x X, FTRL (Eq. (4)) satisfies

t=0 gt(xt+1).

Proof. We use proof by induction on T. The base case is T = 0. By definition, x1 arg minx X R(x). Therefore, R(x) R(x1) for all x X. Recalling the notation g0 = R

η proves the base case. Now, assume that the lemma is true for T 1. That is,

t=0 gt(xt+1).

Let x X be arbitrary. Since x T +1 arg minx X PT t=0 gt(x), we have

t=0 gt(x T +1)

t=0 gt(x T +1) + g T (x T +1)

t=0 gt(xt+1) + g T (x T +1) by inductive hypothesis

t=0 gt(xt+1).

This completes the proof.

Lemma B.2. For all x X, FTRL (Eq. (4)) satisfies

t=1 gt(x) D

t=1 gt(xt) gt(xt+1).

Proof. Note that

t=1 gt(x) + g0(x) g0(x1)

because x1 arg minx X g0(x). The proof of this lemma now follows by using the above inequality, Lemma B.1, the definition g0 = R

η , and the definition of D.

Theorem B.1. FTRL (Eq. (4)) satisfies

xt+1 xt X η Lg

α and Rstatic T D

η + η TL2 g α .

Choosing η = q

αD T L2g yields

Rstatic T O

Proof. Let x arg minx X PT t=1 gt(x). Using Lemma B.2 we have

t=1 gt(x ) D

t=1 gt(xt) gt(xt+1). (5)

We can bound the summands in the sum above as follows. Define Gt(x) = Pt 1 s=0 gs(x). Then, xt arg minx X Gt(x). and xt+1 arg minx X Gt+1(x). Since {gs}T s=1 are convex, R is α-strongly-convex, and g0 = R

η , we have that Gt is α

η -strongly-convex. So,

Gt(xt+1) Gt(xt) + α

2η xt+1 xt 2 X ,

Gt+1(xt) Gt+1(xt+1) + α

2η xt+1 xt 2 X .

Adding the above two inequalities yields

gt(xt) gt(xt+1) α

η xt+1 xt 2 X . (6)

Since gt are convex and Lg-Lipschitz continuous, we also have

gt(xt) gt(xt+1) Lg xt+1 xt X . (7)

Combining Eqs. (6) and (7) we have

xt+1 xt X η Lg

This proves the first part of the theorem. Now, using this in Eq. (7) we have

gt(xt) gt(xt+1) η L2 g α . (8)

Finally, substituting this in Eq. (5) proves the second part of the theorem.

C Regret Analysis: Upper Bounds

First we prove a lemma that bounds the difference in the value of ft evaluated at the actual history ht and an idealized history that would have been obtained by playing xt in all prior rounds. Lemma C.1. Consider an online convex optimization with unbounded memory problem specified by (X, H, A, B). If the decisions (xt) are generated by Algorithm 1, then ft(ht) ft(xt) η L LH1

α for all rounds t. When (X, H, A, B) follows linear sequence dynamics with the ξ-weighted p-norm for p 1, then ft(ht) ft(xt) η L LHp

α for all rounds t.

Proof. We have ft(ht) ft(xt) =

! by Definition 2.1

by Assumption A4

k=0 Ak Bxt k

by definition of ht

k=0 Ak B(xt k xt)

First consider the general case where (X, H, A, B) does not necessarily follow linear sequence dynamics. We can bound the term (a) as

k=0 Ak B(xt k xt)

Ak B xt xt k

Ak B kη L α by Theorem B.1

Ak kη L α by Assumption A2

Plugging this into Eq. (9) completes the proof for the general case. Now consider the case when (X, H, A, B) follows linear sequence dynamcis with the ξ-weighted p-norm. We can bound the term (a) as

k=0 Ak B(xt k xt)

= (0, A0(xt xt 1), A1A0(xt xt 2), . . . ) by Definition 2.3

k=0 ξp k Ak 1 A0(xt xt k) p ! 1

p by Definition 2.3

k=0 ξp k Ak 1 A0 p xt xt k p ! 1

Ak p xt xt k p ! 1

p by Lemma A.1

Ak p kp ! 1

p by Theorem B.1

Plugging this into Eq. (9) completes the proof.

Now we restate and prove Theorem 3.1

Theorem 3.1. Consider an online convex optimization with unbounded memory problem specified by (X, H, A, B). Let the regularizer R : X R be α-strongly-convex and satisfy |R(x) R( x)| D for all x, x X. Algorithm 1 with step-size η satisfies RT (FTRL) D

α + η T L LH1

αD T L(LH1+ L), then

RT (FTRL) O

When (X, H, A, B) follows linear sequence dynamics with the ξ-weighted p-norm, then all of the above hold with Hp instead of H1.

Proof. First consider the general case where (X, H, A, B) does not necessarily follow linear sequence dynamics. Let x arg minx X PT t=1 ft(x). Note that we can write the regret as

RT (FTRL) =

t=1 ft(ht) min x X

t=1 ft(ht) ft(xt)

t=1 ft(xt) ft(x )

We can bound term (a) using Lemma C.1 and term (b) using Theorem B.1. Therefore, we have

RT (FTRL) =

t=1 ft(ht) ft(xt)

t=1 ft(xt) ft(x )

Choosing η = q

αD T L(LH1+ L) yields

RT (FTRL) O

where we used the definition of p-effective memory capacity (Definition 2.4) and the bound on L (Theorem 2.1) to simplify the above expression. This completes the proof for the general case. The proof for when (X, H, A, B) follows linear sequence dynamcis with the ξ-weighted p-norm is the same as above, except we bound the term (a) above using Lemma C.1 for linear sequence dynamics.

Now we restate and prove Theorem 3.3. Theorem 3.3. Consider an online convex optimization with finite memory problem with constant memory length m specified by (X, H = X m, Afinite,m, Bfinite,m). Let the regularizer R : X R be α-strongly-convex and satisfy |R(x) R( x)| D for all x, x X. Algorithm 1 with step-size

T L(Lm 3 2 + L) satisfies

RT (FTRL) O

α TL Lm 3 2

The OCO with finite memory problem, as defined in the literature, follows linear sequence dynamics with the 2-norm. In this subsection we consider a more general version of the OCO with finite memory problem that follows linear sequence dynamics with the p-norm. We provide an upper bound on the policy regret for this more general formulation and the proof of Theorem 3.3 follows as a special case when p = 2. Theorem C.1. Consider an online convex optimization with finite memory problem with constant memory length m, (X, H = X m, Afinite,m, Bfinite,m). Assume that the problem follows linear sequence dynamics with the p-norm for p 1. Let the regularizer R : X R be α-strongly-convex and satisfy |R(x) R( x)| D for all x, x X. Algorithm 1 with step-size η satisfies

RT (FTRL) O

Proof. Using Theorem 3.1 it suffices to bound L and Hp for this problem. Note that Ak finite = 1 if k m and 0 otherwise. Using this we have

k Ak finite p ! 1

This proves the first inequality in the statement of the theorem. The second inequality follows from the above and Theorem 2.1, which states that

k=0 Ak finite p ! 1

p = Lm 1 p .

Finally, we provide an upper bound on the policy regret for the OCO with ρ-discounted infinite memory problem. For simplicity, we consider the case when the problem follows linear sequence dynamics with the 2-norm instead of a general p-norm. Theorem C.2. Consider an online convex optimization with ρ-discounted infinite memory problem (X, H, Ainfinite,ρ, Binfinite). Suppose that the problem follows linear sequence dynamics with the 2norm. Let the regularizer R : X R be α-strongly-convex and satisfy |R(x) R( x)| D for all x, x X. Algorithm 1 with step-size η satisfies

RT (FTRL) O

α TL L(1 ρ2) 3

α TL2(1 ρ2) 2 !

α TL2(1 ρ) 2 !

Proof. Using Theorem 3.1, it suffices to bound L and Hp for this problem. Recall that Ak infinite,ρ = ρk. Using this we have

k Ak finite 2 ! 1

This proves the first inequality in the statement of the theorem. The second inequality follows from the above and Theorem 2.1, which states that

k=0 Ak infinite,ρ 2 ! 1

2 = L(1 ρ2) 1

The last inequality follows because 1 ρ2 = (1 + ρ)(1 ρ), which implies that 1 ρ 1 ρ2 2(1 ρ) because ρ (0, 1).

C.1 Existing Regret Bound for OCO with Finite Memory

In this subsection we provide a detailed comparison of our upper bound on the policy regret for OCO with finite memory with that of Anava et al. [2015]. The material in this subsection comes from Appendix A.2 of their ar Xiv version or Appendix C.2 of their conference version.

The existing upper bound on regret is

where D = maxx, x X |R(x) R( x)|. Although the parameter λ is defined in terms of dual norms of the gradient of ft, it is essentially the Lipschitz-continuity constant for ft: for all x, x X, ft(x) ft( x)

where α is the strong-convexity parameter of the regularizer R (or σ in the notation of Anava et al. [2015]). Therefore, the existing regret bound can be rewritten as

Our upper bound on the policy regret for OCO with finite memory Theorem 3.3 is

α L LTm 3 2

Since L m L by Theorem 2.1, this leads to an improvement by a factor of m 1 4 .

D Regret Analysis: Lower Bounds

We first restate Theorems 3.2 and 3.4. Theorem 3.2. There exists an instance of the online convex optimization with unbounded memory problem, (X, H, A, B), that follows linear sequence dynamics with the ξ-weighted p-norm and there exist L-Lipschitz continuous loss functions {ft : H R}T t=1 such that the regret of any algorithm A satisfies

Theorem 3.4. There exists an instance of the online convex optimization with finite memory problem with constant memory length m, (X, H = X m, Afinite,m, Bfinite,m), and there exist L-Lipschitz continuous loss functions {ft : H R}T t=1 such that the regret of any algorithm A satisfies

Time x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12

ϵ1 ϵ2 ϵ3 ϵ4

f4(h4) = ϵ2 3 1

2 (x2 + x3 + x4)

f5(h5) = ϵ2 3 1

2 (x3 + x4)

f6(h6) = ϵ2 3 1

T = 12, m = 3, L = 1. Resample every m rounds.

Figure 2: An illustration of the loss functions ft for the OCO with finite memory lower bound. Suppose T = 12, m = 3, L = 1, and p = 2. Time is divided into blocks of size m = 3. Consider round t = 5. The history is h5 = (x3, x4, x5). The loss function f5(h5) is a product of three terms: a random sign ϵ2 sampled for the block that round 5 belongs to, namely, block 2; a scaling factor of m 1

2 ; a sum over the decisions in the history excluding those that were chosen after observing ϵ2, i.e., a sum over x3 and x4, excluding x5.

Theorem 3.2 follows from Theorem 3.4. However, the lower bound is true for a much broader class of problems as we show in this section. We first provide a lower bound for a more general formulation of the OCO with finite memory problem (Theorem D.1). The proof of Theorem 3.4 follows as a special case when p = 2. Then, we provide a lower bound for the OCO with ρ-discounted infinite memory problem (Theorem D.2).

The OCO with finite memory problem, as defined in the literature, follows linear sequence dynamics with the 2-norm. In this section we consider a more general version of the OCO with finite memory problem that follows linear sequence dynamics with the p-norm. We provide a lower bound on the policy regret for this more general formulation and the proof of Theorem 3.4 follows as a special case when p = 2. Theorem D.1. For all p 1, there exists an instance of the online convex optimization with finite memory problem with constant memory length m, (X, H = X m, Afinite,m, Bfinite,m), that follows linear sequence dynamics with the p-norm, and there exist L-Lipschitz continuous loss functions {ft : H R}T t=1 such that the regret of any algorithm A satisfies

Proof. Let X = [ 1, 1] and consider an OCO with finite memory problem with constant memory length m, (X, H = X m, Afinite,m, Bfinite,m), that follows linear sequence dynamics with the p-norm. For simplicity, assume that T is a multiple of m (otherwise, the same proof works but with slightly more tedious bookkeeping) and that L = 1 (otherwise, multiply the functions ft defined below by L).

Divide the T rounds into N = T

m blocks of m rounds each. Sample N independent Rademacher random variables {ϵ1, . . . , ϵN}, where each ϵi is equal to 1 with probability 1

2. Recall that ht = (xt, . . . , xt m+1). Define the loss functions {ft}T t=1 as follows. (See Fig. 2 for an illustration.) If t m, let ft = 0. Otherwise, let

ft(ht) = ϵ t

p xt m+1 + + xm t

In words, the loss in the first m rounds is equal to 0. Thereafter, in round t the loss is equal to a random sign ϵ t

m , which is fixed for that block, times a scaling factor, which is chosen according to the p-norm to ensure that the Lipschitz constant L is at most 1, times a sum of a subset of past decisions in the history ht = (xt, . . . , xt m+1). This subset consists of all past decisions until and including the first decision of the current block, which is the decision in round m t

The functions ft are linear, so they are convex. In order to show that they satisfy Assumptions A3 and A4, it remains to show that they are 1-Lipschitz continuous. Let h = (x(1), . . . , x(m)) and

h = ( x(1), . . . , x(m)) be arbitrary elements of H = X m. We have ft(h) ft( h)

p (x(1) x(1)) + + (x(m) x(m))

p (x(1) x(1)) + + (x(m) x(m)) because ϵ t

x(k) x(k) p ! 1

p by Hölder s inequality

where the last equality follows because of our assumption that the problem that follows linear sequence dynamics with the p-norm.

First we will show that the total expected loss of any algorithm is 0, where the expectation is with respect to the randomness in the choice of {ϵ1, . . . , ϵN}. The total loss in the first block is 0 because ft = 0 for t [m]. For each subsequent block n {2, . . . , N}, the total loss in block n depends on the algorithm s choices made before observing ϵn, namely, {x(n 2)m+2, . . . , x(n 1)m+1}. Since ϵn is equal to 1 with probability 1

2, the expected loss of any algorithm in a block is equal to 0 and the total expected loss is also equal to 0.

Now we will show that the expected loss of the benchmark is at most

where the expectation is with respect to the randomness in the choice of {ϵ1, . . . , ϵN}. We have

t=(n 1)m+1 ϵnm

p x (m (t (n 1)m 1))

The first equality follows from first summing over blocks and then summing over the rounds in that block. The second equality follows from the definitions of ft above and of ft (Definition 2.1). By the defintion of ft, the history ht consists of m copies of x for t m.. By the definition of ft, which sums over all past decisions until the first round of the current block, we have that within a block the sum first extends over m copies of x (in the first round of the block), then m 1 copies of x (in the second round of the block), and so on until the last round of the block. So, we have

t=(n 1)m+1 ϵnm

p x (m (t (n 1)m 1))

min x { 1,1}

n=2 ϵn( 1 + 1) 1

n=2 ϵn( 1 1)

where the second-last equality follows because the minima of a linear function over an interval is at one of the endpoints and the last equality follows because min{x, y} = 1

2 |x y|. Since ϵn are Rademacher random variables equal to 1 with probability 1

2, we can simplify the above as

where the last inequality follows from Khintchine s inequality. Using the definition N = T

T m 3 2 + 1 p

p + m 1 2 + 1 p

Tm 3 2 + 1 p

Therefore, we have

Eϵ1,...,ϵN [RT (FTRL)] = E

This completes the proof.

Now we provide a lower bound for the OCO with ρ-discounted infinite memory problem. For simplicity, we consider the case when the problem follows linear sequence dynamics with the 2-norm instead of a general p-norm.

Theorem D.2. Let ρ [ 1

2, 1). There exists an instance of the online convex optimization with ρdiscounted infinite memory problem, (X, H, Ainfinite,ρ, Binfinite), that follows linear sequence dynamics with the 2-norm and there exist L-Lipschitz continuous loss functions {ft : H R}T t=1 such that the regret of any algorithm A satisfies

TL2(1 ρ) 2 .

The proof is very similar to that of Theorem D.1 with slight adjustments to account for a ρ-discounted infinite memory instead of a finite memory of constant size m.

Proof. Let X = [ 1, 1] and consider an OCO with infinite memory problem with discount factor ρ, (X, H, Ainfinite,ρ, Binfinite), that follows linear sequence dynamics with the 2-norm. For simplicity, assume that T is a multiple of (1 ρ) 1 (otherwise, the same proof works but with slightly more tedious bookkeeping) and that L = 1 (otherwise, multiply the functions ft defined below by L).

Define m = (1 ρ) 1. Divide the T rounds into N = T m blocks of m rounds each. Sample N independent Rademacher random variables {ϵ1, . . . , ϵN}, where each ϵi is equal to 1 with

probability 1

2. Recall that ht = (xt, ρxt 1, . . . , ρt 1x1, 0, . . . ). Define the loss functions {ft}T t=1 as follows. If t m, let ft = 0. Otherwise, let

ft(ht) = ϵ t

k=0 ρk+t m t

The functions ft are linear, so they are convex. In order to show that they satisfy Assumptions A3 and A4, it remains to show that they are 1-Lipschitz continuous. Let h = (x(1), ρx(2), . . . ) and h = ( x(1), ρ x(2), . . . ) be arbitrary elements of H. We have ft(h) ft( h)

k=1 ρk 1 x(k) x(k)

k=1 ρk 1 x(k) x(k) because ϵ t

k=1 ρ2(k 1) x(k) x(k) 2 ! 1

2 by Hölder s inequality

where the last equality follows because the follows linear sequence dynamics with the 2-norm.

First we will show that the total expected loss of any algorithm is 0, where the expectation is with respect to the randomness in the choice of {ϵ1, . . . , ϵN}. The total loss in the first block is 0 because ft = 0 for t [m]. For each subsequent block n {2, . . . , N}, the total loss in block n depends on the algorithm s choices made before observing ϵn, namely, {x(n 2)m+2, . . . , x(n 1)m+1}. Since ϵn is equal to 1 with probability 1

2, the expected loss of any algorithm in a block is equal to 0 and the total expected loss is also equal to 0.

Now we will show that the expected loss of the benchmark is at most

where the expectation is with respect to the randomness in the choice of {ϵ1, . . . , ϵN}. We have

t=(n 1)m+1 ϵnm 1

k=0 ρk+t (n 1)m 1x

t=(n 1)m+1 ρt (n 1)m 1 m 1 X

t=(n 1)m+1 ρt (n 1)m 1

The term (a) above can be bounded above by

N as in the proof of Theorem D.1 using Khintchine s inequality. Therefore, using that N = T

m and m = (1 ρ) 1 we have

(1 ρ) 1 2 1 ρm

T(1 ρ) 2(1 ρm)2

where the last inequality follows from the assumption that ρ [ 1

2, 1) and the following argument:

ρm = (1 (1 ρ))m = (1 (1 ρ)) 1 1 ρ 1

(1 ρm)2 1 1

(1 ρm)2 1 1

This completes the proof.

E Online Linear Control

E.1 Formulation as OCO with Unbounded Memory

Now we formulate the online linear control problem in our framework by defining the decision space X, the history space H, and the linear operators A : H H and B : W H. Then, we define the functions ft : H R in terms of ct and finally, prove an upper bound on the policy regret. For notational convenience, let (M [s]) and (Yk) denote the sequences (M [1], M [2], . . . ) and (Y0, Y1, . . . ) respectively.

Recall that we fix K K to be an arbitrary (κ, ρ)-strongly stable linear controller and consider the disturbance-action controller policy class MK (Definition 4.1). For the rest of this paper let e F = F GK. The first step is a change of variables with respect to the control inputs from linear controllers to DACs and the second is a corresponding change of variables for the state. Define the decision space X as

X = {M = (M [s]) : M [s] Rd d, M [s] 2 κ4ρs} (10)

s=1 ρ s M [s] 2 F . (11)

Define the history space H to be the set consisting of sequences h = (Yk), where Y0 X and Yk = e F k 1GXk for Xk X, k 1 with

k=0 ξ2 k Yk 2 X , (12)

where the weights (ξk) are nonnegative real numbers defined as

ξ = (1, 1, 1, ρ 1

2 , ρ 1, ρ 3

2 , . . . ). (13)

Define the linear operators A : H H and B : W H as

A((Y0, Y1, . . . )) = (0, GY0, e FY1, e FY2, . . . ) and B(M) = (M, 0, 0, . . . ).

Note that the problem follows linear sequence dynamics with the ξ-weighted 2-norm (Definition 2.3), where ξ is defined above in Eq. (13). The weights in the weighted norms on X and H increase exponentially. However, the norms M [s] 2 F and e F k 1G 2 F decrease exponentially as well: by definition of M [s] in Eq. (11) and the assumption on e F = F GK for K K. Leveraging this exponential decrease in M [s] 2 F and e F k 1G 2 F to define exponentially increasing weights turns out to be crucial for deriving our regret bounds that are stronger than existing results. Furthermore, the choice to have ξp = 1 for p {1, 2} in addition to p = 0 (as required by Definition 2.3) might seem like a small detail, but this also turns out to be crucial for avoiding unnecessary factors of ρ 1 in the regret bounds.

Recall that the loss functions in the online linear control problem are ct(st, ut), where st and ut are the state and control at round t. Now we will show how to construct the functions ft : H R that correspond to ct(st, ut). By definition, given a sequence of decisions (M0, . . . , Mt), the history at the end of round t is given by

ht = (Mt, GMt 1, e FGMt 2, . . . , e F t 1GM0, 0, . . . ).

A simple inductive argument shows that the state and control in round t can be written as

st = e F ts0 +

s=1 e F t k 1GM [s] k wk s + wt 1, (14)

s=1 M [s] t wt s. (15)

Define the functions ft : H R by ft(h) = ct(s, u), where s and u are the state and control determined by the history as above. Note that ft is parameterized by the past disturbances. Since the state and control are linear functions of the history and ct is convex, this implies that ft is convex.

With the above formulation and the fact that the class of disturbance-action controllers is a superset of the class of (κ, ρ)-strongly-stable linear controllers, we have that the policy regret for the online linear control problem is at most

t=0 ft(ht) min M X

This completes the specification of the online convex optimization with unbounded memory problem, (X, H, A, B), corresponding to the online linear control problem. Using Algorithm 1 and Theorem 3.1

we can upper bound the above by

where L is the Lipschitz constant of ft, L is the Lipschitz constant of ft, H2 is the 2-effective memory capacity, and D = maxx, x X |R(x) R( x)| for an α-strongly-convex regularizer R : X R. In the next subsection we bound these quantities in terms of the problem parameters of the online linear control problem. We use O( ) to hide absolute constants.

E.2 Regret Analysis

We use the following standard facts about matrix norms. Lemma E.1. Let M, N Rd d. Then,

2. MN F M 2 N F .

Proof. Part 1 can be found in, for example, Golub and Loan [1996, Section 2.3.2]. Letting Nj denote the j-th column of N, part 2 follows from

j=1 MNj 2 2 M 2 2

j=1 Nj 2 2 = M 2 2 N 2 F .

This completes the proof.

Lemma E.2. For s 2, the operator norm As is bounded above as

As O κ4ρ s 2 .

Proof. Recall the definition of H and H (Eq. (12)). Let

(Y0, Y1, . . . ) = (Y0, GX1, e FGX2, e F 2GX3, . . . ) be an element of H with unit norm, i.e.,v u u t

k=0 ξ2 k Yk 2 X = 1,

where the weights (ξk) are defined in Eq. (13). Note that ξp = 1 for p = 0, 1 and ξ2 p = ρ p+2 for p = 2, 3, . . . . From the definition of the operator A and for s 2, we have

As((Y0, Y1, . . . )) = (0, . . . , 0, e F s 1GY0, e F s GX1, e F s+1GX2, . . . ). Now we bound As as follows. By definition of As and H (Eq. (12)), and part 2 of Lemma E.1, we have

As((Y0, Y1, . . . )) =

v u u tρ s+2 e F s 1GY0 2 X +

k=1 ρ s k+2 e F s+k 1GXk 2 X

v u u tρ s+2 e F s 1G 2 2 Y0 2 X +

k=1 ρ s k+2 e F s 1 2 2 e F 2 2 e F k 1GXk 2 X

2 e F s 1 2

v u u tρ2 G 2 2 Y0 2 X +

k=1 ρ k+2 e F 2 2 e F k 1GXk 2 X

2 e F s 1 2

v u u tρ2 G 2 2 Y0 2 X +

k=1 ρ k+2 e F 2 2 Yk 2 X .

Using our assumptions that G 2 κ and e F 2 κ2ρ, we have

As((Y0, Y1, . . . )) ρ s

2 e F s 1 2

v u u tρ2κ2 Y0 2 X +

k=1 ρ k+2κ4ρ2 Yk 2 X

2 ρκ2 e F s 1 2

v u u t Y0 2 X +

k=1 ρ k+2 Yk 2 X

2 ρκ2κ2ρs 1

v u u t Y0 2 X +

k=1 ρ k+2 Yk 2 X

v u u t Y0 2 X +

k=1 ρ k+2 Yk 2 X .

Using ρ 1+2 = ρ < 1 for k = 1 in the above sum, the definition of (ξk), and our assumption that (Y0, Y1 . . . ) has unit norm, we have

As((Y0, Y1, . . . )) κ4ρ s 2

v u u tξ2 0 Y0 2 X +

k=1 ξ2 k Yk 2 X = κ4ρ s 2 .

This completes the proof.

Lemma E.3. The 2-effective memory capacity is bounded above as

H2 O κ4(1 ρ) 3

Proof. Using Lemma E.2 to bound Ak for k 2, we have

k=0 k2 Ak 2 O

O κ4(1 ρ) 3

Lemma E.4. Suppose R : X R is defined by R(M) = 1

2 M 2 X . Then, it is 1-strongly-convex and D = max M,f M X |R(M) R(f M)| dκ8(1 ρ) 1.

Proof. Note that R is 1-strongly-convex by definition. Using part 1 of Lemma E.1 and the definition of X (Eq. (10)), we have for all M, f M X,

D = max M,f M X |R(M) R(f M)|

= max M,f M X

1 2 M 2 X 1

max M X M 2 X

s=1 ρ s M [s] 2 F by Eq. (11)

s=1 ρ sd M [s] 2 2 by Lemma E.1

s=1 ρ sdκ8ρ2s by Eq. (10)

dκ8(1 ρ) 1.

This completes the proof.

Lemma E.5. We can bound the norm of the state and control at time t as

max{ st 2, ut 2} DX = O Wκ8(1 ρ) 2 .

Proof. We can bound the norm of st and ut using Eqs. (14) and (15) as

s=1 e F t k 1GM [s] k wk s + wt 1

s=1 κ2ρt k 1κκ4ρs W

κ2 + W + Wκ7(1 ρ) 2

O Wκ7(1 ρ) 2 .

s=1 M [s] t wt s

O Wκ8(1 ρ) 2 +

O Wκ8(1 ρ) 2 .

Above, we used the assumptions that κ, W 1. This completes the proof.

Lemma E.6. The Lipschitz constant of ft can be bounded above as

L O L0DX Wκ(1 ρ) 1 ,

where DX is defined in Lemma E.5.

Proof. Let (M0, . . . , Mt) and (f M0, . . . , f Mt) be two sequences of decisions, where Mk and f Mk X. Let ht and ht be the corresponding histories, and (st, ut) and ( st, ut) be the corresponding statecontrol pairs at the end of round t. We have

ft(ht) ft( ht) = |ct(st, ut) ct( st, ut)|

L0DX max{ st st 2 , ut ut 2},

where the last inequality follows from our assumptions about the functions ct and Lemma E.5. It suffices to bound the two norms on the right-hand side in terms of ht ht H. For k = 0, . . . , t 1,

define Z[s] k = e F t k 1G(M [s] k f M [s] k ). Using Eq. (14), we have

s=1 Z[s] k wk s

Z[s] k wk s 2

2 Z[s] k ρ s 2 wk s 2

ρ s 2 wk s 2

s=1 ξ1+t 1 k ρ s

2 Z[s] k 2 ξ 1 1+t 1 k ρ s 2 wk s 2

s=1 ξ2 t k ρ s

s=1 ξ 2 t k ρ s 2 wk s 2 2 (16)

2 | {z } (a)

k=0 ξ 2 t k

ρ s 2 wk s 2 2 | {z } (b)

where Eq. (16) follows from the Cauchy-Schwarz inequality. The specific choice of weighted norms on X and H allow us to bound the terms (a) and (b) in terms of ht ht H. We can bound the term (a) using the definition of Z[s] k , X , and H as

s=1 ρ s e F t k 1G(M [s] k f M [s] k ) 2

s=1 ρ s e F t k 1G(M [s] k f M [s] k ) 2

ht ht H, (18)

where Eq. (17) follows from part 1 of Lemma E.1 and Eq. (18) follows from the definitions of X and H. Using wt 2 W for all rounds t, we can bound the term (b) as

k=0 ξ 2 t k

ρ s 2 wk s 2 2 W

k=0 ξ 2 t k

k=0 ξ 2 t k ρ(1 ρk+1)

W(1 ρ) 1, (19)

where Eq. (19) follows from the definition of (ξk) (Eq. (13)). Substituting Eqs. (18) and (19) in Eq. (16), we have

st st 2 W(1 ρ) 1 ht ht H.

s=1 (M [s] t f M [s] t )wt s

O Wκ(1 ρ) 1 ht ht H

where the last inequality follows from our assumption that K 2 κ and the above inequality for st st 2. This completes the proof.

Lemma E.7. The Lipschitz constant of ft can be bounded above as

L O L0DX Wκ5(1 ρ) 3

where DX is defined in Lemma E.5.

Proof. Using Lemma E.2 that bounds Ak , we have v u u t

k=0 Ak 2 O κ4(1 ρ) 1

Using Theorem 2.1 that bounds L in terms of L and the above, we have

L O Lκ4(1 ρ) 1

2 O L0DX Wκ5(1 ρ) 3

where the last inequality follows from Lemma E.6.

Now we restate and prove Theorem 4.1. Theorem 4.1. Consider the online linear control problem as defined in Section 4.1. Suppose the decisions in round t are chosen using Algorithm 1. Then, the upper bound on the policy regret is

Td 1 2 κ17(1 ρ) 4.5 . (2)

Proof. Using Theorem 3.1 and the above lemmas, we can upper bound the policy regret of Algorithm 1 for the online linear control problem by

dκ8(1 ρ) 1 T (L0W 2κ9(1 ρ) 3)2 κ4(1 ρ) 1

2 κ4(1 ρ) 3

Td 1 2 κ17(1 ρ) 4.5 .

This completes the proof.

E.3 Existing Regret Bound

The upper bound on policy regret for the online linear control problem in existing work is given in Agarwal et al. [2019b, Theorem 5.1]. The theorem statement only shows the dependence on L, W, and T. The dependence on d, κ, and ρ can be found in the details of the proof. Below we give a detailed accounting of all of these terms in their regret bound.

To simplify notation let γ = 1 ρ. Agarwal et al. [2019b] define

γ log(T) and C = W(κ2 + HκBκ2a) γ(1 κ2(1 γ)H+1) + κBκ3W

The value of a is not specified in Theorem 5.1. However, from Theorem 5.3 and the definition of M in Algorithm 1 their paper, we can infer that a = κBκ3.

The final regret bound is obtained by summing Equations 5.1, 5.3, and 5.4. Given the definition of H above, we have that (1 γ)H+1 exp( κ2 log T) = T κ2.

So, the dominant term in the regret bound is Equation 5.4, which is

O L0WCd 3 2 κ2 Bκ6H2.5γ 1

Substituting the values of H and C from above and collecting terms, we have that the upper bound on policy regret in existing work [Agarwal et al., 2019b, Theorem 5.1] is

T log(T)2.5κ2 Bκ11γ 3.5C

= O L0Wd 3 2

T log(T)2.5κ2 Bκ11γ 3.5 W(κ2 + HκBκ2a)

γ(1 κ2(1 γ)H+1) + κBκ3W

= O L0Wd 3 2

T log(T)2.5κ2 Bκ11γ 3.5 Wκ2

γ(1 κ2(1 γ)H+1) + Wκ2 Bκ7 log(T) γ2(1 κ2(1 γ)H+1) + κBκ3W

= O L0W 2d 3 2

T log(T)2.5κ2 Bκ13γ 4.5(1 κ2(1 γ)H+1) 1

+ O L0W 2d 3 2

T log(T)3.5κ4 Bκ18γ 5.5(1 κ2(1 γ)H+1) 1

+ O L0W 2d 3 2

T log(T)2.5κ3 Bκ14γ 4.5

= O L0W 2d 3 2

T log(T)3.5κ4 Bκ18γ 5.5 .

Above we used that lim T (1 κ2(1 γ)H+1) 1 = 1 to simplify the expressions. Therefore, the upper bound on policy regret for the online linear control problem in existing work is

O L0W 2d 3 2

T log(T)3.5κ4 Bκ18γ 5.5 . (20)

F Online Performative Prediction

Before formulating the online performative prediction problem in our OCO with unbounded memory framework, we state the definition of 1-Wasserstein distance that we use in our regret analysis. Informally, the 1-Wasserstein distance is a measure of the distance between two probability measures.

Definition F.1 (1-Wasserstein Distance). Let (Z, d) be a metric space. Let P(Z) denote the set of Radon probability measures ν on Z with finite first moment. That is, there exists z Z such that Ez ν[d(z, z )] < . The 1-Wasserstein distance between two probability measures ν, ν P(Z) is defined as W1(ν, ν ) = sup{Ez ν[f(z)] Ez ν [f(z)]},

where the supremum is taken over all 1-Lipschitz continuous functions f : Z R.

F.1 Formulation as OCO with Unbounded Memory

Now we formulate the online performative prediction problem in our framework by defining the decision space X, the history space H, and the linear operators A : H H and B : W H. Then, we define the functions ft : H R in terms of lt and finally, prove an upper bound on the policy regret. For notational convenience, let (yk) denote the sequence (y0, y1, . . . ).

Let ρ (0, 1). Let the decision space X Rd be closed and convex with X = 2. Let the history space H be the ℓ1-direct sum of countably infinte number of copies of X. Define the linear operators A : H H and B : X H as

A((y0, y1, . . . )) = (0, ρy0, ρy1, . . . ) and B(x) = (x, 0, . . . ).

Note that the problem is an OCO with ρ-discounted infinite memory problem and follows linear sequence dynamics with the 1-norm (Definition 2.3).

Given a sequence of decisions (xk)t k=1, the history is ht = (xt, ρxt 1, . . . , ρt 1x1, 0, . . . ) and the data distribution pt = pt(ht) satisfies:

k=1 (1 ρ)ρk 1(ξ + Fxt k) + ρtp1. (21)

This follows from the recursive definition of pt and parametric assumption about D(x). Define the functions ft : H [0, 1] by ft(ht) = Ez pt[lt(xt, z)].

With the above formulation and definition of ft, the original goal of minimizing the difference between the algorithm s total loss and the total loss of the best fixed decision is equivalent to minimizing the policy regret, T X

t=1 ft(ht) min x X

F.2 Regret Analysis

Lemma F.1. The operator norm As is bounded above as

As O (ρs) .

Proof. Recall the definition of H and H. Let

(y0, y1, . . . ) = (x0, ρx1, ρ2x2, . . . )

be an element of H with unit norm, i.e.,

k=0 yk = 1.

From the definition of the operator A, we have

As((y0, y1, . . . )) = (0, . . . , 0, ρsx0, ρs+1x1, . . . ).

Now we bound As as follows. By definition of As and H, we have

As((y0, y1, . . . )) =

k=0 ρs+k xk = ρs X

k=0 ρk xk = ρs X

k=0 yk = ρs.

Lemma F.2. The 1-effective memory capacity is bounded above as

H2 O (1 ρ) 2 .

Proof. Using Lemma F.1 to bound Ak , we have

k=0 kρk O (1 ρ) 2 .

Lemma F.3. Suppose R : X R is defined by R(x) = 1

2 x 2 X . Then, it is 1-strongly-convex and D = maxx, x X |R(x) R( x)| D2 X .

Proof. Note that R is 1-strongly-convex by definition. By the assumption that x X DX for all x X, we have that D D2 X .

Lemma F.4. The Lipschitz constant of ft can be bounded above as

Proof. Let (x1, . . . , xt) and ( x1, . . . , xt) be two sequences of decisions, where xk, xk X. Let ht and ht be the corresponding histories, and pt and pt be the corresponding distributions at the end of round t. We have ft(ht) ft( ht)

= |Ez pt [lt(xt, z)] Ez pt [lt( xt, z)]| = |Ez pt [lt(xt, z)] Ez pt [lt( xt, z)] + Ez pt [lt( xt, z)] Ez pt [lt( xt, z)]| L0 xt xt 2 + L0W1(pt, pt),

where the last inequality follows from the assumptions about the functions lt and the definition of the Wasserstein distance W1. By definition of pt (Eq. (21)), we have

ρ ρk F 2 xt k xt k 2

ρ F 2 ht ht H,

where the last inequality follows from the definition of H. Therefore, L L0 1 ρ

Lemma F.5. The Lipschitz constant of ft can be bounded above as

L O L0 1 ρ F 2

Proof. Using Lemma F.1 that bounds Ak , we have

k=0 Ak = (1 ρ) 1.

Using Theorem 2.1 that bounds L in terms of L and the above, we have

L O L(1 ρ) 1 = O L0 1 ρ F 2

where the last equality follows from Lemma F.4.

Now we restate and prove Theorem 4.2.

Theorem 4.2. Consider the online performative prediction problem as defined in Section 4.2. Suppose the decisions in round t are chosen using Algorithm 1. Then, the upper bound on the policy regret is

T F 2(1 ρ) 1

Proof. Using Theorem 3.1 and the above lemmas, we can upper bound the policy regret of Algorithm 1 for the online performative prediction problem by

= O DX L0 F 2(1 ρ) 1

This completes the proof.

We note that the upper bound can be improved by defining a weighted norm on H similar to the approach in Appendix E. However, here we present the looser anaysis for simplicity of exposition.

G Implementation Details for Algorithm 1

In this section we discuss how to implement Algorithm 1 efficiently.

Dimensionality of X. First, note that the decisions x X could be high-dimensional, e.g., an unbounded sequence of matrices as in the online linear control problem, but this is external to our framework and is application dependent. Our framework can be applied to X or to a lowerdimensional decision space X . However, the choice of X and analyzing the difference

t=1 ft(x ) min x X

is application dependent. For example, for the online linear control problem one could consider a restricted class of disturbance-action controllers that operate on a constant number of past disturbances as opposed to all the past disturbances, and then analyze the difference between these two policy classes. See, for example, Agarwal et al. [2019b, Lemma 5.2].

Computational cost of each iteration of Algorithm 1. Now we discuss how to implement each iteration of Algorithm 1 efficiently. We are interested in the computational cost of computing the decision xt+1 as a function of t. (Given the above discussion about the dimensionality of X, we ignore the fact that the dimensionality of the decisions themselves could depend on t.) Therefore, for the purposes of this section we (i) use O( ) notation to hide absolute constants and problem parameters excluding t and T; (ii) invoke the operators A and B by calling oracles OA( ) and OB( ); and (iii) evaluate the functions ft by calling oracles Of(t, ). Recall from Assumption A1 that we assume the learner knows the operators A and B, and observes ft at the end of each round t. So, the oracles OA, OB, and Of are readily available.

Algorithm 1 chooses the decision xt+1 as

xt+1 arg min x X

s=1 fs(x) + R(x)

η = arg min x X

| {z } =Ft(x)

Since Ft(x) is a sum of f1, . . . , ft, evaluating Ft(x) requires Θ(t) oracle calls to Of. However, this issue is present in FTRL for OCO and OCO with finite memory as well and is not specific to our framework. To deal with this issue, one could consider mini-batching algorithms [Dekel et al., 2012, Altschuler and Talwar, 2018, Chen et al., 2020] such as Algorithm 2.

A naïve implementation to evaluate Ft(x) could require O(t3) oracle calls to OA: for each s [t], constructing the argument Ps 1 k=0 Ak Bx for fs could require k oracle calls to OA to compute Ak Bx, for a total of O(s2) oracle calls. However, Ft(x) can be evaluated with just O(t) oracle calls to OA by constructing the arguments incrementally. For t 0, define Γt : X H as Γ0(x) = Bx Γt(x) = A (Γt 1(x)) for t 1.

Note that Γt(Bx) = At Bx. Also, for t 1, define Φt : X H as Φ1(x) = Γ0(x) Φt(x) = Φt 1(x) + Γt 1(x) for t 2.

Note that Φs(x) = Ps 1 k=0 Ak Bx is the argument for fs. These can be constructed incrementally as follows.

1. Construct Γ0(x) using one oracle call to OB. 2. For s = 1,

(a) Construct Φ1(x) = Γ0(x). (b) Construct Γ1(x) from Γ0(x) using one oracle call to OA. 3. For s 2,

(a) Construct Φs(x) by adding Φs 1(x) and Γs 1(x). This can be done in O(1) time. Recall from our earlier discussion that O( ) hides absolute constants and problem parameters excluding t and T. (b) Construct Γs(x) from Γs 1(x) using one oracle call to OA.

By incrementally constructing Φs(x) as above, we can evaluate Ft(x) in O(t) time with O(1) oracle calls to OB, O(t) oracle calls to OA, and O(t) oracle calls to Of.

Memory usage of Algorithm 1. We end with a brief discussion of the memory usage of Algorithm 1. We are interested in the memory usage of computing the decision xt+1 as a function of t. (Given the discussion about the dimensionality of X at the start of this section, we ignore the fact that the dimensionality of the decisions themselves could depend on t.) For each t [T], the memory usage could be as low as O(1) (if, for example, X Rd, and A, B Rd d, which implies that Φt(x) is a d-dimensional vector) or as high as O(t) (if, for example, Φt(x) is a t-length sequence of d-dimensional vectors). However, the memory usage is already Ω(t) to store the functions f1, . . . , ft. Therefore, Algorithm 1 only incurs a constant factor overhead.

H An Algorithm with A Low Number of Switches: Mini-Batch FTRL

In this section we present an algorithm (Algorithm 2) for OCO with unbounded memory that provides the same upper bound on policy regret as Algorithm 1 while guaranteeting a small number of switches. Algorithm 2 combines FTRL on the functions ft with a mini-batching approach. First, it divides rounds into batches of size S, where S is a parameter. Second, at the start of batch b {1, . . . , T/S }, it performs FTRL on the functions {g1, . . . , gb}, where gi is the average of the functions ft in batch i. Then, it uses this decision for the entirety of the current batch. By design, Algorithm 2 switches decisions at most O(T/S) times. This algorithm is insipired by similar algorithms for online learning and OCO [Dekel et al., 2012, Altschuler and Talwar, 2018, Chen et al., 2020].

Algorithm 2: Mini-Batch FTRL Input : Time horizon T, step size η, α-strongly-convex regularizer R : X R, batch size S.

1 Initialize history h0 = 0.

2 for t = 1, 2, . . . , T do

3 if t mod S = 1 then

4 Let Nt = {1, . . . , t

S } denote the number of batches so far.

5 For b Nt, let Tb = {(b 1)S + 1, . . . , b S} denote the rounds in batch b.

6 For b Nt, let gb = 1

S P s Tb fs. denote the average of the functions in batch b.

7 Learner chooses xt arg minx X P

b Nt gb(x) + R(x)

10 Learner chooses xt = xt 1.

12 Set ht = Aht 1 + Bxt.

13 Learner suffers loss ft(ht) and observes ft.

Theorem H.1. Consider an online convex optimization with unbounded memory problem specified by (X, H, A, B). Let the regularizer R : X R be α-strongly-convex and satisfy |R(x) R( x)| D for all x, x X. Algorithm 2 with batch size S and step-size η satisfies

RT (Mini-Batch FTRL) SD

α + η TL LH1

αSD T L( LH1

S + L), then

RT (Mini-Batch FTRL) O

α T L LH1 + S L2 !

Setting the batch size to be S = LH1/ L we obtain the same upper bound on policy regret as Algorithm 1 while guaranteeing that the decisions xt switch at most T L/LH1 times.

Corollary H.1. Consider an online convex optimization with unbounded memory problem specified by (X, H, A, B). Let the regularizer R : X R be α-strongly-convex and satisfy |R(x) R( x)| D

for all x, x X. Algorithm 2 with batch size S = LH1

L and step-size η = r

αSD T L( LH1

S + L) satisfies

RT (Mini-Batch FTRL) O

Furthermore, the decisions xt switch at most T L LH1 times.

Intuitively, in the OCO with unbounded memory framework each decision xt is penalized not just in round t but in future rounds as well. Therefore, instead of immediately changing the decision, it is prudent to stick to it for a while, collect more data, and then switch decisions. For the OCO with finite memory problem, the constant memory length m provides a natural measure of how long decisions penalized for and when one should switch decisions. In the general case, this is measured by the quantity LH1/ L. Note that this simplifies to m for OCO with finite memory for all p-norms.

Proof of Theorem H.1. For simplicity, assume that T is a multiple of S. Otherwise, the same proof works after replacing T

S . Let x arg minx X PT t=1 ft(x). Note that we can write the regret as

RT (Mini-Batch FTRL) =

t=1 ft(ht) min x X

t=1 ft(ht) ft(xt)

t=1 ft(xt) ft(x )

We can bound the term (b) using Theorem B.1 for mini-batches [Dekel et al., 2012, Altschuler and Talwar, 2018, Chen et al., 2020] by SD

It remains to bound term (a). Let N = T/S denote the number of batches and Tn = {(n 1)S + 1, . . . , n S} denote the rounds in batch n [N]. We can write

t=1 ft(ht) ft(xt) =

k=0 Ak Bxt k

by Definition 2.1

k=0 Ak Bxt k

by Assumption A4

k=0 Ak Bxt k

where the last inequality follows because of the following. Consider rounds t1 = b1S + r and t2 = b2S + r for b1 < b2 and r [S]. Then, ht1 Pt1 1 k=0 Ak Bxt1 ht2 Pt2 1 k=0 Ak Bxt2 because the latter sums over more terms in its history and decisions in consecutive batches have distance bounded above by η L/α (Theorem B.1). Therefore, it suffices to show that term (c) is upper bounded by η LH1/α. We have

k=0 Ak Bxt k

Ak Bxt k Ak Bxt

k=0 Ak B xt k xt

k=0 Ak xt k xt by Assumption A2.

Since the same decision xn is chosen in all rounds of batch n, we can reindex and rewrite

k=0 Ak Bxt k

k=0 Ak xt k xt

s=1 A(N n 1)S+s+o x N xn

s=1 (N n) A(N n 1)S+s+o

s=1 n A(n 1)S+s+o ,

where the last inequality follows from bounding the distance between decision in consecutive batches Theorem B.1 and the triangle inequality. Expanding the triple sum yields

s=1 n A(n 1)S+s+o

A + + AS + 2 AS+1 + + 2 A2S + 3 A2S+1 + + 3 A3S + . . .

+ A2 + + AS+1 + 2 AS+2 + + 2 A2S+1 + 3 A2S+2 + + 3 A3S+1 + . . . ...

+ AS + + A2S 1 + 2 A2S + + 2 A3S 1 + 3 A3S + + 3 A4S 1 + . . . ,

where each line above corresponds to a value of o {0, . . . , S 1}. Adding up these terms yields H1. This completes the proof.

Note that Theorem H.1 only provides an upper bound on the policy regret for the general case. Unlike Algorithm 1, it is unclear how to obtain a stronger bound depending on Hp for the case of linear sequence dynamics with the ξ-weighted p-norm for p > 1. The above proof can be specialized for this special case, similar to the proofs of Theorem 2.1 and Lemma C.1, to obtain

k=0 Ak Bxt k

n A(n 1)S+s+o p ! 1

n A(n 1)S+s+o p ! 1

A p + + AS p + 2p AS+1 p + . . . 2p A2S p + 3p A2S+1 p + . . . 1

+ A2 p + + AS+1 p + 2p AS+2 p + . . . 2p A2S+1 p + 3p A2S+2 p + . . . 1

... AS p + + A2S 1 p + 2p A2S p + . . . 2p A3S 1 p + 3p A3S p + . . . 1

The above expression cannot be easily simplified to O(Hp). However, for the special case of OCO with finite memory, which follows linear sequence dynamics with the 2-norm, we can do so by leveraging the special structure of the linear operator Afinite,m.

Theorem H.2. Consider an online convex optimization with finite memory problem with constant memory length m specified by (X, H = X m, Afinite,m, Bfinite,m). Let the regularizer R : X R be α-strongly-convex and satisfy |R(x) R( x)| D for all x, x X. Algorithm 2 with batch size m

and step-size η = r

T L Lm 1 2 + L satisfies

RT (Mini-Batch FTRL) O

α TL Lm 3 2

Furthermore, the decisions xt switch at most T

Proof. Given the proof of Theorem H.1 and the above discussion, it suffices to show that

n A(n 1)S+s+o 2 ! 1

2 H2 = m 3 2 .

Recall that Ak finite = 1 if k m and 0 otherwise. Using this and S = m, we have that the above

sum is at most m + m 1 + +

1 = O m 3 2 . This completes the proof.

I Experiments

In this section we present some simple simulation experiments.2

Problem Setup. We consider the problem of online linear control with a constant input controller class Π = {πu : π(s) = u U}. Let T denote the time horizon. Let S = Rd and U = {u Rd : u 2 1} denote the state and control spaces. Let st and ut denote the state and control at time t with s0 being the initial state. The system evolves according to linear dynamics st+1 = Fst + Gut + wt, where F, G Rd d are system matrices and wt Rd is a disturbance. The loss function in round t is simply ct(st, ut) = ct(st) = Pd j=1 st,j, where st,j denotes the j-th coordinate of st. The goal is to choose a sequence of control inputs u0, . . . , u T 1 U to minimize the regret

t=0 ct(st, ut) min u U

t=0 ct(su t , u),

where su t denotes the state in round t upon choosing control input u in each round. Note that the state in round t can be written as

k=1 F k Gut k +

k=1 F kwt k.

Therefore, we can formulate this problem as an OCO with unbounded memory problem by setting X = U, H = {y Rd : y = Pt k=0 F k Gu for some u U and t N}, A(h) = Fh, B(x) = Gx, and ft(ht) = ct(Pt k=1 F k Gut k + Pt k=1 F kwt k). Note that H, A, and B are all finitedimensional.

Data. We set the time horizon T = 750 and dimension d = 2. We sample the disturbances {wt} from a standard normal distribution. We set the system matrix G to be the identity and the system matrix F to be a diagonal plus upper triangular matrix with the diagonal entries equal to ρ and the upper triangular entries equal to α. We run simulations with various values of ρ and α.

Implementation. We use the cvxpy library [Diamond and Boyd, 2016, Agrawal et al., 2018] for implementing Algorithm 1. We use step-sizes according to Theorems 3.1 and 3.3. We run the experiments on a standard laptop.

2https://github.com/raunakkmr/oco-with-memory-code.

Results. We compare the regret with respect to the optimal control input of OCO with unbounded memory and OCO with finite memory for various memory lengths m in Fig. 3 for ρ = 0.90 and Fig. 4 for ρ = 0.95. There are a few important takeaways.

1. OCO with unbounded memory either performs as well as or better than OCO with finite memory, and it does so at comparable computational cost (Appendix G). In fact, the regret curve for OCO with unbounded memory reaches an asymptote whereas this is not the case for OCO with finite memory for a variety of memory lengths. 2. Knowledge of the spectral radius of F, ρ, is not sufficient to tune the memory length m for OCO with finite memory. This is illustrated by comparing Figs. 3a to 3d. Even though small memory lengths perform well when the upper triangular value is small, they perform poorly when the upper triangular value is large. In contrast, OCO with unbounded memory performs well in all cases. 3. For a fixed memory length, OCO with unbounded memory eventually performs better than OCO with finite memory. This is illustrated by comparing Figs. 3a to 3d. 4. As we increase the memory length, the performance of OCO with finite memory eventually approaches that of OCO with unbounded memory. However, an advantage of OCO with unbounded memory is that it does not require tuning the memory length. For example, when ρ = 0.90 and the upper triangular entry of F = 0.10, OCO with finite memory with m = 4 performs comparably to m = 8 and m = 16 (Fig. 3c). However, when the upper triangular entry of F = 0.12, then it performs much worse (Fig. 3d). However, OCO with unbounded memory performs well in all cases without the need for tuning an additional hyperparameter in the form of memory length.

Figure 3: Regret plot for ρ = 0.90. The label OCO-UM refers to formulating the problem as an OCO with unbounded memory problem. The OCO-FM-m refers to formulating the problem as an OCO with finite memory problem with constant memory length m. The titles of the plots indicate the values of the dimension, the diagonal entries of F, and the upper triangular entries of F.

Figure 4: Regret plot for ρ = 0.95. The label OCO-UM refers to formulating the problem as an OCO with unbounded memory problem. The OCO-FM-m refers to formulating the problem as an OCO with finite memory problem with constant memory length m. The titles of the plots indicate the values of the dimension, the diagonal entries of F, and the upper triangular entries of F.