# dynamical_linear_bandits__137c0aaa.pdf

Dynamical Linear Bandits

Marco Mussi 1 Alberto Maria Metelli 1 Marcello Restelli 1

In many real-world sequential decision-making problems, an action does not immediately reflect on the feedback and spreads its effects over a long time frame. For instance, in online advertising, investing in a platform produces an instantaneous increase of awareness, but the actual reward, i.e., a conversion, might occur far in the future. Furthermore, whether a conversion takes place depends on: how fast the awareness grows, its vanishing effects, and the synergy or interference with other advertising platforms. Previous work has investigated the Multi-Armed Bandit framework with the possibility of delayed and aggregated feedback, without a particular structure on how an action propagates in the future, disregarding possible dynamical effects. In this paper, we introduce a novel setting, the Dynamical Linear Bandits (DLB), an extension of the linear bandits characterized by a hidden state. When an action is performed, the learner observes a noisy reward whose mean is a linear function of the hidden state and of the action. Then, the hidden state evolves according to linear dynamics, affected by the performed action too. We start by introducing the setting, discussing the notion of optimal policy, and deriving an expected regret lower bound. Then, we provide an optimistic regret minimization algorithm, Dynamical Linear Upper Confidence Bound (Dyn Lin-UCB), that suffers an expected regret of order r O d ?

T p1 ρq3{2 , where ρ is a measure of the stability of the system, and d is the dimension of the action vector. Finally, we conduct a numerical validation on a synthetic environment and on real-world data to show the effectiveness of Dyn Lin-UCB in comparison with several baselines.

1Politecnico di Milano, Milan, Italy. Correspondence to: Marco Mussi <marco.mussi@polimi.it>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

1. Introduction

In a large variety of sequential decision-making problems, a learner must choose an action that, when executed, determines an evolution of the underlying system state that is hidden to the learner. In these partially observable problems, the learner observes a reward (i.e., feedback) representing the combined effect of multiple actions played in the past. For instance, in online advertising campaigns, the process that leads to a conversion, i.e., marketing funnel (Court et al., 2009), is characterized by complex dynamics and comprises several phases. When heterogeneous campaigns/platforms are involved, a profitable budget investment policy has to account for the interplay between campaigns/platforms. In this scenario, a conversion (e.g., a user s purchase of a promoted product) should be attributed not only to the latest ad the user was exposed to, but also to previous ones (Berman, 2018).

The joint consideration of each funnel phase is a fundamental step towards an optimal investment solution while considering the advertising campaigns/platforms independently leads to sub-optimal solutions. Consider, for instance, a simplified version of the funnel with two types of campaigns: awareness (i.e., impression) ads and conversion ads. The first kind of ad aims at improving brand awareness, while the latter aims at creating the actual conversion. If we evaluate the performances in terms of conversions only, we will discover that impression ads are not instantaneously effective in creating conversions, so we will be tempted to reduce the budget invested in such a campaign. However, this approach is sub-optimal because impression ads increase the chance to convert when a conversion ad is shown after the impression (e.g., Hoban & Bucklin, 2015). In addition, the effect of some ads, especially impression ads delivered via television, may be delayed. It has been demonstrated (Chapelle, 2014) that users remember advertising over time in a vanishing way, leading to consequences that non-dynamical models cannot capture. This kind of interplay comprises more general scenarios than the simple reward delay, including the case where the interaction is governed by a dynamics hidden to the observer.

While this scenario can be indubitably modeled as a Partially Observable Markov Decision Process (POMDP, Astr om, 1965), the complexity of the framework and its general-

Dynamical Linear Bandits

ity are often not required to capture the main features of the problem. Indeed, for specific classes of problems, the Multi-Armed Bandit (MAB, Lattimore & Szepesv ari, 2020) literature has explored the possibility of experiencing delayed reward either assuming that the actual reward will be observed, individually, in the future (e.g., Joulani et al., 2013) or with the more realistic assumption that an aggregated feedback is available (e.g., Pike-Burke et al., 2018), with also specific applications to online advertising (Vernade et al., 2017). Although effective in dealing with delay effects and the possibility of a reward spread in the future (Cesa Bianchi et al., 2018), they do not account for the additional, more complex, dynamical effects, which can be regarded as the evolution of a hidden state.

In this work, we take a different perspective. We propose to model the non-observable dynamical effects underlying the phenomena as a Linear Time-Invariant (LTI) system (Hespanha, 2018). In particular, the system is characterized by a hidden internal state xt (e.g., awareness) which evolves via linear dynamics fed by the action ut (e.g., amount invested) and affected by noise. At each round, the learner experiences a reward yt (e.g., conversions), which is a noisy observation that linearly combines the state xt and the action ut. Our goal consists in learning an optimal policy so as to maximize the expected cumulative reward. We call this setting Dynamical Linear Bandits (DLBs) that, as we shall see, reduces to linear bandits (Abbasi-Yadkori et al., 2011) when no dynamics are involved. Because of the dynamics, the effect of each action persists over time indefinitely but, under stability conditions, it vanishes asymptotically. This allows representing interference and synergy between platforms, thanks to the dynamic nature of the system.

Contributions In Section 2, we introduce the Dynamical Linear Bandit (DLB) setting to represent sequential decision-making problems characterized by a hidden state that evolves linearly according to an unknown dynamics. We show that, under stability conditions, the optimal policy corresponds to playing the constant action that leads the system to the most profitable steady state. Then, we derive an expected regret lower bound of order Ω d ?

T p1 ρq1{2 , being d the dimensionality of the action space and ρ ă 1 the spectral radius of the dynamical matrix of the system evolution law.1 In Section 3, we propose a novel optimistic regret minimization algorithm, Dynamical Linear Upper Confidence Bound (Dyn Lin-UCB), for the DLB setting. Dyn Lin-UCB takes inspiration from Lin-UCB but subdivides the optimization horizon T into increasing-length epochs. In each epoch, an action is selected optimistically and kept constant (i.e., persisted) so that the system approximately reaches the steady state. We provide a regret analysis for Dyn Lin-UCB showing that, under certain assumptions,

1The smaller ρ, the faster the system reaches its steady state.

it enjoys r O d ?

T p1 ρq3{2 expected regret. In Section 5, we provide a numerical validation, with both synthetic and realworld data, compared with bandit baselines. The proofs of all the results are reported in Appendix B.

Notation Let a, b P N with a ď b, we introduce the symbols: Ja, b K : ta, . . . , bu, Jb K : J1, b K, and Ja, 8M ta, a 1, . . . u . Let x, y P Rn, we denote with xx, yy x Ty řn j 1 xiyi the inner product. For a positive semidefinite matrix A P Rnˆn, we denote with }x}2 A x TAx the weighted 2-norm. The spectral radius ρp Aq is the largest absolute value of the eigenvalues of A, the spectral norm }A}2 is the square root of the maximum eigenvalue of ATA. We introduce the maximum spectral norm to spectral radius ratio of the powers of A defined as Φp Aq supτě0 }Aτ}2{ρp Aqτ (Oymak & Ozay, 2019). We denote with In the identity matrix of order n and with 0n the vector of all zeros of dimension n. A random vector x P Rn is σ2-subgaussian, in the sense of Hsu et al. (2012), if for every vector ζ P Rn it holds that E rexp pxζ, xyqs ď expp}ζ}2 2σ2{2q.

In this section, we introduce the Dynamical Linear Bandits (DLBs), the learner-environment interaction, assumptions, and regret (Section 2.1). Then, we derive a closed-form expression for the optimal policy for DLBs (Section 2.2). Finally, we derive a lower bound to the regret, highlighting the intrinsic complexities of the DLB setting (Section 2.3).

2.1. Problem Formulation

In a Dynamical Linear Bandit (DLB), the environment is characterized by a hidden state, i.e., a n-dimensional real vector, initialized to x1 P X, where X Ď Rn is the state space. At each round t P N, the environment is in the hidden state xt P X, the learner chooses an action, i.e., a d-dimensional real vector ut P U, where U Ď Rd is the action space. Then, the learner receives a noisy reward yt xω, xty xθ, uty ηt P Y, where Y Ď R is the reward space, ω P Rn, θ P Rd are unknown parameters, and ηt is a zero-mean σ2 subgaussian random noise, conditioned to the past. Then, the environment evolves to the new state according to the unknown linear dynamics xt 1 Axt But ϵt, where A P Rnˆn is the dynamic matrix, B P Rnˆd is the action-state matrix, and ϵt is a zero-mean σ2 subgaussian random noise, conditioned to the past, independent of ηt.2

Remark 2.1. The setting proposed above is a particular case of a POMDP ( Astr om, 1965), in which the state xt is non-observable, while the learner accesses the noisy

2n is the order of the LTI system (Kalman, 1963). We make no assumption on the value of n and on its knowledge.

Dynamical Linear Bandits

observation yt that corresponds to the noisy reward too. Furthermore, the setting can be viewed as a MISO (Multiple Input Single Output) discrete-time LTI system (Kalman, 1963). Finally, the DLB reduces to (non-contextual) linear bandit (Abbasi-Yadkori et al., 2011) when the hidden state does not affect the reward, i.e., when ω 0.

Markov Parameters We revise a useful representation, that for every H P Jt K allows expressing yt in terms of the sequence of the most recent H 1 actions pusqs PJt H,t K, reward noise ηt, H state noises pϵsqs PJt H,t 1K, and starting state xt H (Ho & Kalman, 1966; Oymak & Ozay, 2019; Tsiamis & Pappas, 2019; Sarkar et al., 2021):

s 0 xhtsu,ut sy loooooooomoooooooon

action effect

ωTAHxt H loooooomoooooon

starting state

s 1 ωTAs 1ϵt s looooooooooomooooooooooon

where the sequence of vectors htsu P Rd for every s P N are called Markov parameters and are defined as: ht0u θ and htsu BTp As 1q Tω if s ě 1. Furthermore, we introduce the cumulative Markov parameters, defined for every s, s1 P N with s ď s1 as h Js,s1K řs1

l s htlu and the corresponding limit as s1 Ñ 8, i.e., h Js, 8M ř 8 l s htlu. Finally, we use the abbreviation h h J0, 8M θ BTp In Aq Tω.

We will make use of the following standard assumption related to the stability of the dynamic matrix A, widely employed in discrete time LTI literature (Oymak & Ozay, 2019; Lale et al., 2020a;b). Assumption 2.1 (Stability). The spectral radius of A is strictly smaller than 1, i.e., ρp Aq ă 1, and the maximum spectral norm to spectral radius ratio of the powers of A is bounded, i.e., Φp Aq ă 8.3

Policies and Performance The learner s behavior is modeled via a deterministic policy π pπtqt PN defined, for every round t P N, as πt : Ht 1 Ñ U, mapping the history of observations Ht 1 pu1, y1, . . . , ut 1, yt 1q P Ht 1 to an action ut πtp Ht 1q P U, where Ht 1 p U ˆ Yqt 1

is the set of histories of length t 1. The performance of a policy π is evaluated in terms of the (infinite-horizon) expected average reward:

Jpπq : lim inf HÑ 8 E

xt 1 Axt But ϵt yt xω, xty xθ, uty ηt ut πtp Ht 1q , @t P N,

where the expectation is taken w.r.t. the randomness of the state noise ϵt and reward noise ηt. If a policy π is constant, i.e., πtp Ht 1q u for every t P N, we abbreviate Jpuq

3The latter is a mild assumption: if A is diagonalizable as A QΛQ 1, then Φp Aq ď }Q}2}Q 1}2 and it is finite. In particular, if A is symmetric then Φp Aq 1.

Jpπq. A policy π is an optimal policy if it maximizes the expected average reward, i.e., π P arg maxπ Jpπq, and its performance is denoted by J : Jpπ q.

We further introduce the following assumption that requires the boundedness of the norms of the relevant quantities. Assumption 2.2 (Boundedness). There exist Θ, Ω, B, U ă 8 s.t.: }θ}2 ď Θ, }ω}2 ď Ω, }B}2 ď B, supu PU }u}2 ď U, and supx PX }x}2 ď X, supu PU |Jpuq| ď 1.4

Regret The regret suffered by playing a policy π, competing against the optimal infinite-horizon policy π over a learning horizon T P N is given by:

Rpπ, Tq : TJ

t 1 yt, (3)

where yt is the sequence of rewards collected by playing π as in Equation (2). The goal of the learner consists in minimizing the expected regret E Rpπ, Tq, where the expectation is taken w.r.t. the randomness of the reward.

2.2. Optimal Policy

In this section, we derive a closed-form expression for the optimal policy π for the infinite horizon objective function, as introduced in Equation (2). Theorem 2.1 (Optimal Policy). Under Assumptions 2.1 and 2.2, an optimal policy π maximizing the (infinitehorizon) expected average reward Jpπq (Equation 2), for every round t P N and history Ht 1 P Ht 1 is given by:

π t p Ht 1q u where u Pargmax u PU Jpuq xh,uy. (4)

Some remarks are in order. The optimal policy plays the constant action u P U which brings the system in the most profitable steady-state.5 Indeed, the expression xh, uy can be rewritten expanding the cumulative Markov parameter as pθT ωTp In Aq 1Bqu and x p In Aq 1Bu

is the expression of the steady state x Ax Bu , when applying action u . It is worth noting the role of Assumption 2.1 which guarantees the existence of the inverse p In Aq 1. In this sense, our problem shares the constant nature of the optimal policy with the linear bandit setting (Abbasi-Yadkori et al., 2011), although ours is characterized by an evolving state, which introduces a new tradeoff in the action selection. From the LTI system perspective, this implies that we can restrict to open-loop stationary policies. The reason why DLBs do not benefit from closed-loop policies, differently from other classical problems, such as

4The assumption of the bounded state norm }x}2 ď X holds whenever the state noise ϵ is bounded. As shown by Agarwal et al. (2019), this assumption can be relaxed, for unbounded subgaussian noise, by conditioning to the event that none of the noise vectors are ever large at the cost of an additional log T factor in the regret. 5In Appendix C, we show that the optimal policy is non stationary for the finite horizon case.

Dynamical Linear Bandits

the LQG (Abbasi-Yadkori & Szepesv ari, 2011), lies in the linearity of the reward yt and in the additive noise ηt and ϵt, making their presence irrelevant (in expectation) for control purposes. Nonetheless, as we shall see, our problem poses additional challenges compared to linear bandits since, in order to assess the quality of an action u P U, instantaneous rewards are not reliable, and we need to let the system evolve to the steady state and, only then, observe the reward.

2.3. Regret Lower Bound

In this section, we provide a lower bound to the expected regret that any learning algorithm suffers when addressing the learning problem in a DLB.

Theorem 2.2 (Lower Bound). For any policy π (even stochastic), there exists a DLB fulfilling Assumptions 2.1 and 2.2, such that for sufficiently large T ě O d2 1 ρp Aq , policy π suffers an expected regret lower bounded by:

ERpπ, Tq ě Ω

p1 ρp Aqq 1 2

The lower bound highlights the main challenges of the DLB learning problem. First of all, we observe a dependence on 1{p1 ρp Aqq, being ρp Aq the spectral radius of the matrix A. This is in line with the intuition that, as ρp Aq approaches 1, the problem becomes more challenging. Furthermore, we note that when ρp Aq 0, i.e., the problem has no dynamical effects, the lower bound matches the one of linear bandits (Lattimore & Szepesv ari, 2020). It is worth noting that, for technical reasons, the result of Theorem 2.2 is derived under the assumption that, at every round t P JTK, the agent observes both the state xt and the reward yt (see Appendix B). Clearly, this represents a simpler setting w.r.t. DLBs (in which xt is hidden) and, consequently, Theorem 2.2 is a viable lower bound for DLBs too.

3. Algorithm

In this section, we present an optimistic regret minimization algorithm for the DLB setting. Dynamical Linear Upper Confidence Bound (Dyn Lin-UCB), whose pseudocode is reported in Algorithm 1, requires the knowledge of an upperbound ρ ă 1 on the spectral radius of the dynamic matrix A (i.e., ρp Aq ď ρ) and on the maximum spectral norm to spectral radius ratio Φ ă 8 (i.e., Φp Aq ď Φ), as well as the bounds on the relevant quantities of Assumption 2.2.6

6As an alternative, one can consider a more demanding requirement of the knowledge of a bound on the spectral norm }A}2 of A. Similar assumptions regarding the knowledge of analogous quantities are considered in the literature, e.g., decay of Markov operator norms (Simchowitz et al., 2020) and strong stability (Plevrakis & Hazan, 2020), spectral norm bound (Lale et al., 2020a). As a side note, the knowledge of ρ ě ρp Aq (or an equivalent quantity) is proved to be unavoidable by Theorem 2.2. Indeed, if no restriction

Dyn Lin-UCB is based on the following simple observation. To assess the quality of action u P U, we need to persist in applying it so that the system approximately reaches the corresponding steady state and, then, observe the reward yt, representing a reliable estimate of Jpuq xh, uy. We shall show that, under Assumption 2.1, the number of rounds needed to approximately reach such a steady state is logarithmic in the learning horizon T and depends on the upper bound of the spectral norm ρ. After initializing the Gram matrix V0 λId and the vectors b0 and ph0 both to 0d (line 1), Dyn Lin-UCB subdivides the learning horizon T in M ď T epochs. Each epoch m P JMK is composed of Hm 1 rounds, where Hm tlog m{ logp1{ρqu is logarithmic in the epoch index m. At the beginning of each epoch, m P JMK, Dyn Lin-UCB computes the upper confidence bound (UCB) index (line 4) defined for every u P U as:

UCBtpuq : xpht 1, uy βt 1 }u}V 1 t 1 , (5)

where pht 1 V 1 t 1bt 1 is the Ridge regression estimator of the cumulative Markov parameter h, as in Equation (4) and βt 1 ě 0 is an exploration coefficient to be defined later. Similar to Lin-UCB (Abbasi-Yadkori et al., 2011), the index UCBtpuq is designed to be optimistic, i.e., Jpuq ď UCBtpuq in high-probability for all u P U. Then, the optimistic action ut P arg maxu PU UCBtpuq is executed (line 6) and persisted for the next Hm rounds (lines 811). The length of the epoch Hm is selected such that, under Assumption 2.1, the system has approximately reached the steady state after Hm 1 rounds. In this way, at the end of epoch m, the reward yt is an almost-unbiased sample of the steady-state performance Jputq. This sample is employed to update the Gram matrix estimate Vt and the vector bt (line 13), while the samples collected in the previous Hm rounds are discarded (line 9). It is worth noting that by setting Hm 0 for all m P JMK, Dyn Lin-UCB reduces to Lin-UCB. The following sections provide the concentration of the estimator pht 1 of h (Section 3.1) and the regret analysis of Dyn Lin-UCB (Section 3.2).

3.1. Self-Normalized Concentration Inequality for the Cumulative Markov Parameter

In this section, we provide a self-normalized concentration result for the estimate pht of the cumulative Markov parameter h. For every epoch m P JMK, we denote with tm the last round of epoch m: t0 0 and tm tm 1 1 Hm. At the end of each epoch m, we solve the Ridge regression problem, defined for every round t P JTK as: pht argmin rh PRd

l PJMK:tlďtm pytl xrh,utlyq2 λ rh 2 2 V 1 t bt.

on ρp Aq is enforced (i.e., just ρp Aq ă 1), one can always consider the DLB in which ρp Aq 1 1{T ă 1 making the regret lower bound degenerate to linear.

Dynamical Linear Bandits

Algorithm 1: Dyn Lin-UCB.

Input :Regularization parameter λ ą 0, exploration coefficients pβt 1qt PJT K, spectral radius upper bound 0 ď ρ ă 1

1 Initialize t Ð 1, V0 λId, b0 0d, ph0 0d,

2 Define M mint M 1 PN :řM1

m 1 1 t log m logp1{ρquąTu 1

3 for m P JMK do

4 Compute ut P arg maxu PU UCBtpuq

5 where UCBtpuq: xpht 1,uy βt 1 }u}V 1 t 1 6 Play arm ut and observe yt

7 Define Hm t log m logp1{ρqu

8 for j P JHm K do

9 Update Vt Vt 1, bt bt 1 10 t Ð t 1

11 Play arm ut ut 1 and observe yt

13 Update Vt Vt 1 utu T t, bt bt 1 utyt

14 Compute pht V 1 t bt 15 t Ð t 1

We now present the following self-normalized maximal concentration inequality and, then, we compare it with the existing results in the literature.

Theorem 3.1 (Self-Normalized Concentration). Let pphtqt PN be the sequence of solutions of the Ridge regression problems of Algorithm 1. Then, under Assumption 2.1 and 2.2, for every λ ě 0 and δ P p0, 1q, with probability at least 1 δ, simultaneously for all rounds t P N, it holds that: pht h Vt ď c1 ?

λ logpept 1qq c2 ?

2rσ2 ˆ log ˆ1

2 log ˆdet p Vtq

where c1 UΩΦp Aq UB 1 ρp Aq X , c2 Θ ΩBΦp Aq

and rσ2 σ2 1 Ω2Φp Aq2

First, we note that when Ω 0 (ω 0n), i.e., the state does not affect the reward, the bound perfectly reduces to the selfnormalized concentration used in linear bandits (Abbasi Yadkori et al., 2011, Theorem 1). In particular, we recognize the second term due to the regularization parameter λ ą 0 and the third one, which involves the subgaussianity parameter rσ2, related to the joint contribution of the state and reward noises. Furthermore, the first term is an additional bias that derives from the epochs of length Hm 1. The choice of the value Hm represents one of the main technical novelties that, on the one hand, leads to a bias that conveniently grows logarithmically with t and, on the other hand, can be computed without the knowledge of T.

It is worth looking at our result from the perspective of learning the LTI system parameters. We can compare our

Theorem 3.1 with the concentration presented in (Lale et al., 2020a, Appendix C), which represents, to the best of our knowledge, the only result for the closed-loop identification of LTI systems with non-observable states. First, note that, although we focus on a MISO system (yt is a scalar, being our reward), extending our estimator to multiple-outputs (MIMO) is straightforward. Second, the approach of (Lale et al., 2020a) employs the predictive form of the LTI system to cope with the correlation introduced by closed-loop control. This choice allows for convenient analysis of the estimated Markov parameters of the predictive form. However, recovering the parameters of the original system requires an application of the Ho-Kalman method (Ho & Kalman, 1966) which, unfortunately, does not preserve the concentration properties in general, but only for persistently exciting actions. Our method, instead, forces to play an open-loop policy within a single epoch (each with logarithmic duration), while the overall behavior is closed-loop, as the next action depends on the previous-epoch estimates. In this way, we are able to provide a concentration guarantee on the parameters of the original system without assuming additional properties on the action signal.

3.2. Regret Analysis

In this section, we provide the analysis of the regret of Dyn Lin-UCB, when we select the exploration coefficient βt based on the knowledge of the upper bounds ρ ă 1, Φ ă 8, and those specified in Assumption 2.2, defined for every round t P JTK as:

λ logpept 1qq c2 ?

2σ2 ˆ log ˆ1

2 log ˆ 1 t U 2

where c1 UΩΦ UB 1 ρ X , c2 Θ ΩBΦ

1 ρ , and σ2

1 ρ2 . The following result provides the bound on the expected regret of Dyn Lin-UCB.

Theorem 3.2 (Upper Bound). Under Assumptions 2.1 and 2.2, selecting βt as in Equation (6) and δ 1{T, Dyn Lin-UCB suffers an expected regret bounded as (highlighting the dependencies on T, ρ, d, and σ only):

E RpπDyn Lin-UCB, Tq ď O

Tplog Tq 3 2 1 ρ

d Tplog Tq2

p1 ρq 3 2 1 p1 ρp Aqq2

Proof Sketch. The analysis of Dyn Lin-UCB poses additional challenges compared to that of Lin-UCB (Abbasi Yadkori et al., 2011) because of the dynamic effects of the hidden state. The idea behind the proof is to first derive

Dynamical Linear Bandits

a bound on a different notion of regret, i.e., the offline regret: Roffpπ,Tq TJ řT t 1 Jputq, that compares J

with the steady-state performance Jputq of the action ut πtp Ht 1q (Theorem B.2). This analysis of Roffpπ,Tq can be comfortably carried out, by adopting a proof strategy similar to that of Lin-UCB. However, when applying action ut, the DLB does not immediately reach the performance Jputq as the expected reward Eryts experiences a transitional phase before converging to the steady state. Under stability (Assumption 2.1), it is possible to show that the expected offline regret and the expected regret differ by a constant: |ERpπ,Tq ERoffpπ,Tq|ďOp1{p1 ρp Aqq2q (Lemma B.1).

Some observations are in order. We first note a dependence on the term 1{p1 ρq, which, in turn, depends on the upper bound ρ of the spectral gap ρp Aq. If the system does not display a dynamics, i.e., we can set ρ 0, we obtain a regret bound that, apart from logarithmic terms, coincides with that of Lin-UCB, i.e., r Opdσ ?

Tq. Instead, for slowconverging systems, i.e., ρ 1, the regret bound enlarges, as expected. Clearly, a value of ρ too large compared to the optimization horizon T (e.g., ρ 1 1{T 1{3) makes the regret bound degenerate to linear. This is a case in which the underlying system is so slow that the whole horizon T is insufficient to approximately reach the steady state. Third, the regret bound is the sum of three components: the first one depends on the subgaussian proxy σ and is due to the noisy estimation of the relevant quantities; the second one is a bias due to the epoch-based structure of Dyn Lin-UCB; finally, the third one is constant (does not depend on T) accounts for the time needed to reach the steady state.

Remark 3.1 (Regret upper bound (Theorem 3.2) and lower bound (Theorem 2.2) Comparison). Apart from logarithmic terms, we notice a tight dependence on d and on T. Instead, concerning the spectral properties of A, in the upper bound, we experience a dependence on 1{p1 ρq raised to a higher power (either 1 for the term multiplied by d and 3{2 for the term multiplied by ?

d) w.r.t. the exponent appearing in the lower bound (i.e., 1{2). It is currently an open question whether the lower bound is not tight (which is obtained for a simpler setting in which the state is observable xt) or whether more efficient algorithms for DLBs can be designed. Furthermore, Theorem 3.2 highlights the impact of the upper bound ρ compared with the true ρp Aq.

4. Related Works

In this section, we survey and compare the literature with a particular focus on bandits with delayed, aggregated, and composite feedback (Joulani et al., 2013) and online control for Linear Time-Invariant (LTI) systems (Hespanha, 2018). Additional related works are reported in Appendix A.

Bandits with Delayed/Aggregated/Composite Feedback The Multi-Armed Bandit setting has been widely employed as a principled approach to address sequential decision-making problems (Lattimore & Szepesv ari, 2020). The possibility of experiencing delayed rewards has been introduced by Joulani et al. (2013) and widely exploited in advertising applications (Chapelle, 2014; Vernade et al., 2017). A large number of approaches have extended this setting either considering stochastic delays (Vernade et al., 2020), unknown delays (Li et al., 2019; Lancewicki et al., 2021), arm-dependent delays (Manegueu et al., 2020), nonstochastic delays (Ito et al., 2020; Thune et al., 2019; Jin et al., 2022). Some methods relaxed the assumption that the individual reward is revealed after the delay expires, admitting the possibility of receiving anonymous feedback, which can be aggregated (Pike-Burke et al., 2018; Zhang et al., 2021) or composite (Cesa-Bianchi et al., 2018; Garg & Akash, 2019; Wang et al., 2021). Most of these approaches are able to achieve r Op ?

Tq regret, plus additional terms depending on the extent of the delay. In our DLBs, the reward is generated over time as a combined effect of past and present actions through a hidden state, while these approaches generate the reward instantaneously and reveal it (individually or in aggregate) to the learner in the future and no underlying state dynamics is present.

Online Control of Linear Time-Invariant Systems The particular structure imposed by linear dynamics makes our approach comparable to LTI online control for partially observable systems (e.g., Lale et al., 2020b; Simchowitz et al., 2020; Plevrakis & Hazan, 2020). While the dynamical model is similar, in online control of LTI systems, the perspective is quite different. Most of the works either consider the Linear Quadratic Regulator (Mania et al., 2019; Lale et al., 2020b) or (strongly) convex objective functions (Mania et al., 2019; Simchowitz et al., 2020; Lale et al., 2020a), achieving, in most of the cases r Op ?

Tq regret for strongly convex functions and r Op T 2{3q for convex functions. Recently, r Op ?

Tq regret rate has been obtained for convex function too, by means of geometric exploration methods (Plevrakis & Hazan, 2020). Compared to Dyn Lin-UCB, the algorithm of Plevrakis & Hazan (2020) considers general convex costs but assumes the observability of the state and limits to the class of disturbance response controllers (Li & Bosch, 1993) that do not include the constant policy. Moreover, the regret bound of Plevrakis & Hazan (2020) differs from Theorem 3.2, as it shows a cubic dependence on the system order7 and an implicit nontrivial dependence on the dynamic matrix A. Instead, our Theorem 3.2 is remarkably independent of the system order n. Furthermore, Lale et al. (2020a) reach Oplogp Tqq regret in the case of strongly convex cost functions compet-

7This holds for known cost functions. Instead, for unknown costs, the exponent becomes 24 (Plevrakis & Hazan, 2020).

Dynamical Linear Bandits

ing against the best persistently exciting controller (i.e., a controller implicitly maintaining a non-null exploration). Some approaches are designed to deal with adversarial noise (Simchowitz et al., 2020). All of these solutions, however, look for the best closed-loop controller within a specific class, e.g., disturbance response control (Li & Bosch, 1993). These controllers, however, do not allow us to easily incorporate constraints on the action space, which could be of crucial importance in practice, e.g., in advertising domains. Dyn Lin-UCB works with an arbitrary action space and, thanks to the linearity of the reward, does not require complex closed-loop controllers.

5. Numerical Simulations

In this section, we provide numerical validations of Dyn Lin-UCB in both a synthetic scenario and a domain obtained from real-world data. The goal of these simulations is to highlight the behavior of Dyn Lin-UCB in comparison with bandit baselines, describing advantages and disadvantages. The first experiment is a synthetic setting in which we can evaluate the performances of all the solutions and the sensitivity of Dyn Lin-UCB w.r.t. the ρ parameter (Section 5.1). Then, we show a comparison in a DLB scenario retrieved from real-world data (Section 5.2). The code of the experiments can be found at https://github.com/marcomussi/DLB. Details and additional experiments can be found in Appendix E.

Baselines We consider as main baseline Lin-UCB (Abbasi Yadkori et al., 2011), designed for linear bandits. We include Exp3 (Auer et al., 1995) usually employed in (nonadaptive) adversarial settings, and its extension to k-length memory (adaptive) adversaries Exp3-k by Dekel et al. (2012).8 Additionally, we perform a comparison with algorithms for regret minimization in non-stationary environments: D-Lin-UCB (Russac et al., 2019), an extension of Lin-UCB for non-stationary settings, and AR2 (Chen et al., 2021), a bandit algorithm for processes presenting temporal structure. Lastly, in the case of real-world data, we compare our solution with a human-expert policy (Expert). This policy is directly generalized from the original dataset by learning via regression the average budget allocation over all platforms from the available data.

For the baselines which do not support vectorial actions, we perform a discretization of the action space U that surely contains optimal action. Concerning the hyperparameters of the baselines, whenever possible, they are selected as in the respective original papers. The experiments are presented with a regularization parameter λ P t1, log T} for the algorithms which require it (i.e., Dyn Lin-UCB, Lin-UCB, and

8k is proportional to tlog M{ logp1{ρqu. In Appendix A.3 we elaborate on the use of adversarial bandit algorithms for DLBs.

D-Lin-UCB).9 Further information about the hyperparameters of the baselines and the adopted optimistic exploration bounds are presented in Appendix E.1.

5.1. Synthetic Data

Setting We consider a DLB defined by the following matrices A diagpp0.2, 0, 0.1qq, B diagpp0.25, 0, 0.1qq, θ p0, 0.5, 0.1q T, ω p1, 0, 0.1q T and a Gaussian noise with σ 0.01 (diagonal covariance matrix for the state noise).10 This way, the spectral gap of the dynamical matrix is ρp Aq 0.2 and Φp Aq 1. Moreover, the cumulative Markov parameter is given by h p0.56, 0.5, 0.11q T. We consider the action space U tpu1, u2, u3q T P r0, 1s3 with u1 u2 u3 ď 1.5u that simulates a total budget of 1.5 to be allocated to the three platforms. Thus, a myopic agent would simply look at how the action immediately propagates to the reward through θ, and will invest the budget in the second component of the action, which is weighted by 0.5. Instead, a far-sighted agent, aware of the system evolution, will look at the cumulative Markov parameter h, realizing that the most convenient action is investing in the first component, weighted by 0.56. Therefore, the optimal action is u p1, 0.5, 0q T leading to J 0.81.

Comparison with the bandit baselines Figure 1 shows the performance in terms of cumulative regret of Dyn Lin-UCB, Lin-UCB, D-Lin-UCB, AR2, Exp3, and Exp3-k. The experiments are conducted over a time horizon of 1 million rounds. For Dyn Lin-UCB, we employed, for the sake of this experiment, the true value of the spectral gap, i.e., ρ ρp Aq 0.2. First of all, we observe that both Exp3 and Exp3-k suffers a significantly large cumulative regret. Similar behavior is displayed by AR2. Moreover, all the versions of Lin-UCB and D-Lin-UCB suffer linear regret. The best performance of D-Lin-UCB is obtained when the discount factor γ is close to 1 (the weights take the form wt γ t), and the behavior is comparable with the one of Lin-UCB. Even for a quite fast system (ρp Aq 0.2), ignoring the system dynamics, and the presence of the hidden state, has made both Lin-UCB and D-Lin-UCB commit (in their best version, with λ log T) to the sub-optimal (myopic) action u p0.5, 1, 0q T with performance J 0.78 ă J , with also a relevant variance. On the other hand, Dyn Lin-UCB is able to maintain

9For Dyn Lin-UCB, log T is a nearly optimal choice for λ as it can be seen by looking at the first two addenda of the exploration factor in Equation (6). 10It is worth noting that the decision of using diagonal matrices is just for explanation purposes and w.l.o.g. (at least in the class of diagonalizable dynamic matrices). Indeed, we are just interested in the cumulative Markov parameter h and we could have obtained the same results with an equivalent (non-diagonal) representation, by applying an inevitable transformation T as A1 TAT 1, ω1 T Tω, and B1 TB.

Dynamical Linear Bandits

0 0.2 0.4 0.6 0.8 1 106

Cumulative Regret

Dyn Lin-UCB (log T )

Dyn Lin-UCB (1)

Lin-UCB (log T )

Lin-UCB (1)

D-Lin-UCB (log T )

D-Lin-UCB (1) AR2 Exp3 Exp3-k

Figure 1. Cumulative regret as a function of the rounds comparing Dyn Lin-UCB and the other bandit baselines (50 runs, mean std).

0 0.5 1 1.5 2 2.5 3 105

Cumulative Regret

ρ ρp Aq Lin-UCB ρ 0.4 ρ 0.1 ρ 0.05 ρ 0

Figure 2. Cumulative regret as a function of the rounds comparing Lin-UCB, and Dyn Lin-UCB with λ log T, varying the upper bound on the spectral radius ρ (50 runs, mean std).

0 0.2 0.4 0.6 0.8 1 106

Cumulative Regret

Dyn Lin-UCB (log T )

Dyn Lin-UCB (1)

Lin-UCB (log T )

Lin-UCB (1)

D-Lin-UCB (log T )

D-Lin-UCB (1) AR2 Exp3 Exp3-k Expert

Figure 3. Cumulative regret for Dyn Lin-UCB, the other bandit baselines and the Expert in the system generalized from real-world data (50 runs, mean std).

a smaller and stable (variance is negligible) sublinear regret in both its versions, with a notable advantage when using λ log T.

Sensitivity to the Choice of ρ The upper bound ρ of the spectral radius ρp Aq 0.2 represents a crucial parameter of Dyn Lin-UCB. While an overestimation ρ " ρp Aq does not compromise the regret rate but tends to slow down the convergence process, a severe underestimation ρ ! ρp Aq might prevent learning at all. In Figure 2, we test Dyn Lin-UCB against a misspecification of ρ, when λ log T. We can see that by considering ρ 2ρp Aq, Dyn Lin-UCB experiences a larger regret but still sublinear and smaller w.r.t. Lin-UCB with λ log T. Even by reducing ρ P t0.1, 0.05u, Dyn Lin-UCB is able to keep the regret sublinear, showing remarkable robustness to misspecification. Clearly, setting ρ 0 makes the regret almost degenerate to linear.

5.2. Real-world Data

We present an experimental evaluation based on realworld data coming from three web advertising platforms (Facebook, Google, and Bing), related to several campaigns for an invested budget of 5 Million EUR over 2 years. Starting from such data, we learn the best DLB model by means of a specifically designed variant of the Ho-Kalman algorithm (Ho & Kalman, 1966).11 We used the learned model to build up a simulator. The resulting system has ρp Aq 0.67. We evaluate Dyn Lin-UCB against the baselines for T 106 steps over 50 runs.

Results Figure 3 shows the results in terms of cumulative regret. It is worth noting that no algorithm, except for Dyn Lin-UCB, is able to converge to the optimal choice. Indeed, they immediately commit to a sub-optimal solution.

11See Appendix D.

Dyn Lin-UCB, instead, shows a convergence trend towards the optimal policy over time for both λ 1 and λ log T, even if the best-performing version is the one which employs λ log T. The Expert, which has a preference towards maximizing the instantaneous effect of the actions only and does not take into account correlations between platforms, displays a sub-optimal performance.

6. Discussion and Conclusions

In this paper, we have introduced the Dynamical Linear Bandits (DLBs), a novel model to represent sequential decisionmaking problems in which the system is characterized by a non-observable hidden state that evolves according to linear dynamics and by an observable noisy reward that linearly combines the hidden state and the action played. This model accounts for scenarios that cannot be easily represented by existing bandit models that consider delayed and aggregated feedback. We have derived a regret lower bound that highlights the main complexities of the DLB problem. Then, we have proposed a novel optimistic regret minimization approach, Dyn Lin-UCB, that, under stability assumption, is able to achieve sub-linear regret. The numerical simulation in both synthetic and real-world domains succeeded in showing that, in a setting where the baselines mostly suffer linear regret, our algorithm consistently enjoys sublinear regret. Furthermore, Dyn Lin-UCB proved to be robust to misspecification of its most relevant hyper-parameter ρ. To the best of our knowledge, this is the first work addressing this family of problems, characterized by hidden linear dynamics, with a simple, yet effective, bandit-like approach. Short-term future directions include efforts in closing the gap between the regret lower and upper bounds. Long-term future directions should focus on extending the present approach to non-linear system dynamics and embedding in the algorithm additional budget constraints enforced over the optimization horizon.

Dynamical Linear Bandits

Acknowledgements

This paper is supported by PNRR-PE-AI FAIR project funded by the Next Generation EU program.

Abbasi-Yadkori, Y. and Szepesv ari, C. Regret bounds for the adaptive control of linear quadratic systems. In The 24th Annual Conference on Learning Theory, pp. 1 26, 2011.

Abbasi-Yadkori, Y., P al, D., and Szepesv ari, C. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pp. 2312 2320, 2011.

Agarwal, N., Hazan, E., and Singh, K. Logarithmic regret for online control. In Advances in Neural Information Processing Systems, pp. 10175 10184, 2019.

Astr om, K. J. Optimal control of markov processes with incomplete state information. Journal of mathematical analysis and applications, 10(1):174 205, 1965.

Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. Gambling in a rigged casino: The adversarial multi-armed bandit problem. In Proceedings of IEEE 36th annual foundations of computer science, pp. 322 331. IEEE, 1995.

Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. The nonstochastic multiarmed bandit problem. SIAM J. Comput., 32(1):48 77, 2002.

Bacchiocchi, F., Genalti, G., Maran, D., Mussi, M., Restelli, M., Gatti, N., and Metelli, A. M. Autoregressive bandits. Co RR, abs/2212.06251, 2022.

Berman, R. Beyond the last touch: Attribution in online advertising. Marketing Science, 37(5):771 792, 2018.

Cesa-Bianchi, N., Gentile, C., and Mansour, Y. Nonstochastic bandits with composite anonymous feedback. In Conference On Learning Theory, pp. 750 773, 2018.

Chapelle, O. Modeling delayed feedback in display advertising. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1097 1105. Association for Computing Machinery, 2014.

Chen, Q., Golrezaei, N., and Bouneffouf, D. Dynamic bandits with temporal structure. Available at SSRN 3887608, 2021.

Court, D., Elzinga, D., Mulder, S., and Vetvik, O. J. The consumer decision journey. Mc Kinsey Quarterly, 3:96 107, 2009.

Dekel, O., Tewari, A., and Arora, R. Online bandit learning against an adaptive adversary: from regret to policy regret. In International Conference on Machine Learning, 2012.

Garg, S. and Akash, A. K. Stochastic bandits with delayed composite anonymous feedback. Co RR, abs/1910.01161, 2019.

Gur, Y., Zeevi, A., and Besbes, O. Stochastic multi-armedbandit problem with non-stationary rewards. In Advances in Neural Information Processing Systems, pp. 199 207, 2014.

Hespanha, J. P. Linear Systems Theory: Second Edition. Princeton University Press, 2018.

Ho, B. L. and Kalman, R. E. Effective construction of linear state-variable models from input/output functions. at-Automatisierungstechnik, 14(1-12):545 548, 1966.

Hoban, P. R. and Bucklin, R. E. Effects of internet display advertising in the purchase funnel: Model-based insights from a randomized field experiment. Journal of Marketing Research, 52(3):375 393, 2015.

Hsu, D., Kakade, S., and Zhang, T. A tail inequality for quadratic forms of subgaussian random vectors. Electronic Communications in Probability, 17:1 6, 2012.

Isom, J. D., Meyn, S. P., and Braatz, R. D. Piecewise linear dynamic programming for constrained pomdps. In Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence, pp. 291 296. AAAI Press, 2008.

Ito, S., Hatano, D., Sumita, H., Takemura, K., Fukunaga, T., Kakimura, N., and Kawarabayashi, K. Delay and cooperation in nonstochastic linear bandits. In Advances in Neural Information Processing Systems, 2020.

Jin, T., Lancewicki, T., Luo, H., Mansour, Y., and Rosenberg, A. Near-optimal regret for adversarial MDP with delayed bandit feedback. Co RR, abs/2201.13172, 2022.

Joulani, P., Gy orgy, A., and Szepesv ari, C. Online learning under delayed feedback. In Proceedings of the 30th International Conference on Machine Learning, pp. 1453 1461, 2013.

Kalman, R. E. Mathematical description of linear dynamical systems. Journal of the Society for Industrial and Applied Mathematics, Series A: Control, 1(2):152 192, 1963.

Kim, D., Lee, J., Kim, K., and Poupart, P. Point-based value iteration for constrained pomdps. In Proceedings of the 22nd International Joint Conference on Artificial Intelligence, pp. 1968 1974, 2011.

Dynamical Linear Bandits

Lale, S., Azizzadenesheli, K., Hassibi, B., and Anandkumar, A. Logarithmic regret bound in partially observable linear dynamical systems. In Advances in Neural Information Processing Systems, 2020a.

Lale, S., Azizzadenesheli, K., Hassibi, B., and Anandkumar, A. Regret minimization in partially observable linear quadratic control. Co RR, abs/2002.00082, 2020b.

Lancewicki, T., Segal, S., Koren, T., and Mansour, Y. Stochastic multi-armed bandits with unrestricted delay distributions. In Proceedings of the 38th International Conference on Machine Learning, pp. 5969 5978, 2021.

Lattimore, T. and Szepesv ari, C. Bandit algorithms. Cambridge University Press, 2020.

Li, B., Chen, T., and Giannakis, G. B. Bandit online learning with unknown delays. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 993 1002, 2019.

Li, H. X. and Bosch, P. P. J. V. D. A robust disturbancebased control and its application. International Journal of Control, 58(3):537 554, 1993.

Manegueu, A. G., Vernade, C., Carpentier, A., and Valko, M. Stochastic bandits with arm-dependent delays. In Proceedings of the 37th International Conference on Machine Learning, pp. 3348 3356, 2020.

Mania, H., Tu, S., and Recht, B. Certainty equivalence is efficient for linear quadratic control. In Advances in Neural Information Processing Systems, pp. 10154 10164, 2019.

Nobari, S. DBA: dynamic multi-armed bandit algorithm. In The Thirty-Third AAAI Conference on Artificial Intelligence, pp. 9869 9870, 2019.

Oymak, S. and Ozay, N. Non-asymptotic identification of LTI systems from a single trajectory. In 2019 American Control Conference, pp. 5655 5661, 2019.

Pike-Burke, C., Agrawal, S., Szepesv ari, C., and Gr unew alder, S. Bandits with delayed, aggregated anonymous feedback. In Proceedings of the 35th International Conference on Machine Learning, pp. 4102 4110, 2018.

Plevrakis, O. and Hazan, E. Geometric exploration for online control. In Advances in Neural Information Processing Systems, 2020.

Russac, Y., Vernade, C., and Capp e, O. Weighted linear bandits for non-stationary environments. In Advances in Neural Information Processing Systems, pp. 12017 12026, 2019.

Sarkar, T., Rakhlin, A., and Dahleh, M. A. Finite time LTI system identification. J. Mach. Learn. Res., 22:26:1 26:61, 2021.

Simchowitz, M., Singh, K., and Hazan, E. Improper learning for non-stochastic control. In Conference on Learning Theory, volume 125, pp. 3320 3436. PMLR, 2020.

Thune, T. S., Cesa-Bianchi, N., and Seldin, Y. Nonstochastic multiarmed bandits with unrestricted delays. In Advances in Neural Information Processing Systems, pp. 6538 6547, 2019.

Tsiamis, A. and Pappas, G. J. Finite sample analysis of stochastic system identification. In 58th IEEE Conference on Decision and Control, pp. 3648 3654, 2019.

Undurti, A. and How, J. P. An online algorithm for constrained pomdps. In IEEE International Conference on Robotics and Automation, pp. 3966 3973. IEEE, 2010.

Vernade, C., Capp e, O., and Perchet, V. Stochastic bandit models for delayed conversions. In Proceedings of the Thirty-Third Conference on Uncertainty in Artificial Intelligence, 2017.

Vernade, C., Carpentier, A., Lattimore, T., Zappella, G., Ermis, B., and Br uckner, M. Linear bandits with stochastic delayed feedback. In Proceedings of the 37th International Conference on Machine Learning, pp. 9712 9721, 2020.

Wang, S., Wang, H., and Huang, L. Adaptive algorithms for multi-armed bandit with composite and anonymous feedback. In Thirty-Fifth AAAI Conference on Artificial Intelligence, pp. 10210 10217, 2021.

Zhang, M., Tsuchida, R., and Ong, C. S. Gaussian process bandits with aggregated feedback. Co RR, abs/2112.13029, 2021.

Astr om, K. Optimal control of markov processes with incomplete state information. Journal of Mathematical Analysis and Applications, 10(1):174 205, 1965.

Dynamical Linear Bandits

A. Additional Related Works

In this appendix, we report additional details about the related works.

A.1. Delayed/Aggregated Feedback with DLBs

In this appendix, we show how we can model delayed and composite feedback with DLBs. For the delayed feedback, we focus on the case in which either the delay is fixed to the value τ ě 1, i.e., the reward of the pull performed at round t is experienced at round t τ. For the composite feedback, we assume that the reward of the pull performed at round t is spread over the next τ ě 1 rounds with fixed weights pw1, . . . , wτq. Denoting with Rt the full reward (not observed) due to the pull performed at round t, the agent at round t observes the weighted sum of the rewards reported below:12

l 1 wl Rt l. (6)

These two cases can be modeled as DLBs with a suitable encoding of the arms and choice of matrices. In particular, assuming to have K arms, we take the arm set U to be the canonical basis of RK, and we denote with µ the vector of expected rewards. We define θ 0 and:

0 0 . . . 0 0 1 0 . . . 0 0 0 1 . . . 0 0 ... ... ... ... ... 0 0 . . . 1 0

µT K 0T K 0T K... 0T K

0 0 0 ... 1

P Rτ, ωcomposite

w1 w2 w3 ... wτ

However, DLBs cannot model random or adversarial delays. Nevertheless, DLBs can capture scenarios of composite feedback in which the reward is spread over an infinite number of rounds. Keeping the K-armed case introduced above, we can consider the simplest example of a reward that spreads as an autoregressive process AR(1) with parameter γ P p0, 1q, that cannot be represented using the standard composite feedback. In such a case, we simply need a system with order n 1 with matrices (actually scalars):

A γ, B u T , ω 1.

Clearly, one can consider AR(m) processes (Bacchiocchi et al., 2022) by employing systems of order n m ą 1.

A.2. Partially Observable Markov Decision Processes

As already noted, looking at DLBs in their generality, we realize that our model is a particular subclass of the Partially Observable Markov Decision Processes (POMDP, Astr om, 1965). However, in the POMDP literature, no particular structure of the hidden state dynamics is assumed. The specific linear dynamics are rarely considered, as well as the possibility of a reward that is a linear combination of the hidden state and the action. Nevertheless, several works accounted for the presence of constraints (Isom et al., 2008; Undurti & How, 2010; Kim et al., 2011) without exploiting the linearity and without regret guarantees.

A.3. Adversarial Bandits

It is worth elaborating on the adaptation of adversarial MAB algorithms to this setting. First, since the reward distribution in DLBs depends at every round t on the sequence of actions played by the agent prior to t, we can reduce the DLB setting to an adversarial bandit with an adaptive (or non-oblivious) adversary. Second, such an adversary must have infinite memory in principle. Third, our regret definition of Section 2 is a policy regret (Dekel et al., 2012) that compares the algorithm performance against playing the optimal policy in hindsight from the beginning, as opposed to the external regret often

12It is worth noting that the fixed-delay case is a particular case of composite feedback, where w1 wτ 1 0 and wτ 1.

Dynamical Linear Bandits

employed for non-adaptive adversaries. It is well known that for infinite-memory adaptive adversaries, no algorithm can achieve sublinear policy regret. Nevertheless, for DLB setting, we know that the effect of the past is always vanishing (given Assumption 2.1 enforcing ρp Aq ă 1), so we can approximate our setting as a finite-memory setting, by considering memory length k9r log M

log 1{ρs, where M is the one defined in Algorithm 1 (line 2), with an additional regret term only logarithmic in the optimization horizon T. Then, given this approximation, we can make use of an adversarial bandit algorithm (designed for non-adaptive adversaries) in the framework proposed by Dekel et al. (2012) to make it effective for the finite-memory adaptive adversary setting. In the case of an optimal algorithm, such as Exp3 (Auer et al., 2002), suffering an external regret of order r Op ?

MTq, being M the number of arms, the version to address this finite-memory adaptive adversary setting suffers a regret bounded by r Oppk 1q M 1{3T 2{3q, as shown in Theorem 2 of Dekel et al. (2012).

A.4. Other Approaches

Non-stationary bandits (Gur et al., 2014) can be regarded as bandits with a hidden state that evolves through a (possibly non-linear) dynamics. The main difference compared with our DLBs is that the hidden state evolves in an uncontrollable way, i.e., it does not depend on the sequence of actions performed so far. Russac et al. (2019) extend the linear bandit setting by considering a non-stationary evolution of the parameter θ t . The notion of dynamic bandit is further studied by Chen et al. (2021), where an auto-regressive process is considered for the evolution of the reward through time and by Nobari (2019) that propose a practical approach to cope with this setting.

B. Proofs and Derivations

In this section, we provide the proofs we have omitted in the main paper.

B.1. Proofs of Section 2

Before we proceed, we introduce a different notion of regret useful for analysis purposes, that we name offline regret. This notion of regret compares J with the steady-state performance of the action ut πtp Ht 1q played at each round t P JTK by the agent:

Roffpπ, Tq : TJ

t 1 Jputq. (7)

We denote with ERoffpπ, Tq the expected offline regret, where the expectation is taken w.r.t. the randomness of the reward. Clearly, the two notions of regret coincide when the system has no dynamics.

The following result relates the offline and the (online) expected regret.

Lemma B.1. Under Assumptions 2.1 and 2.2, for any policy π, it holds that: ˇˇE Roffpπ, Tq E Rpπ, Tq ˇˇ ď ΩΦp Aq BU

p1 ρp Aqq2 ΩΦp Aq X

Proof. First of all, we observe that for any policy, the cumulative effect of the noise components is zero-mean. Thus, it suffices to consider the deterministic evolution of the system. For every t P JTK, let us denote with Eryts the expected reward at time t and with Jputq as the steady-state performance when executing action ut:

s 0 xhtsu, ut sy ωTAt 1x1 θTut ωT t 1 ÿ

s 1 As 1But s ωTAt 1x1,

Jputq θTut ωT p Id Aq 1 ut θTut ωT 8 ÿ

We now proceed by summing over t P JTK. First of all, we consider the following preliminary result involving yt, which is obtained by rearranging the summations:

t 1 Eryts θT Tÿ

t 1 ut ωT Tÿ

s 1 As 1But s ωT Tÿ

Dynamical Linear Bandits

t 1 ut ωT T 1 ÿ

t 1 At 1x1.

Thus, we have: ˇˇˇˇˇ

t 1 p Jputq Erytsq

ď ΩΦp Aq BU

s T t ρp Aqs ΩΦp Aq X

t 1 ρp Aqt 1 (8)

ď ΩΦp Aq BU

t 1 ρp Aq T t ΩΦp Aq X

1 ρp Aq (9)

ď ΩΦp Aq BU

p1 ρp Aqq2 ΩΦp Aq X

1 ρp Aq , (10)

where line (8) follows from Assumptions 2.1 and 2.2, lines (9) and (10) follow from bounding the summations with the series. The result follows by observing that:

E Roffpπ, Tq E Rpπ, Tq

t 1 p Jputq Erytsq .

Theorem 2.2 (Lower Bound). For any policy π (even stochastic), there exists a DLB fulfilling Assumptions 2.1 and 2.2, such that for sufficiently large T ě O d2 1 ρp Aq , policy π suffers an expected regret lower bounded by:

ERpπ, Tq ě Ω

p1 ρp Aqq 1 2

Proof. To derive the lower bound, we take inspiration from the construction of (Lattimore & Szepesv ari, 2020) for linear bandits (Theorem 24.1). We consider a class of DLBs defined in terms of fixed 0 ď ρ ă 1 and 0 ď ϵ ď ρ with ω 1d, θ 2p1 ρq ϵ

2p1 pρ ϵqq1d, B p1 ρq Id and with a diagonal dynamical matrix A diagpaq, defined in terms of the vector a belonging to the set A tρ, ρ ϵud. The available actions are U t 1, 1ud. Let us note that |A| |U| 2d. Thus, in our set of DLBs, the vector a fully characterizes the problem. Moreover, we observe that, given the diagonal a diagp Aq, we can compute the cumulative Markov parameter ha signpaq ϵ 2p1 pρ ϵqq.13 As a consequence the optimal action can be defined as u a signpaq, whose performance is given by J a xha, u ay ϵd 2p1 pρ ϵqq.

Let us consider the probability distribution over the canonical bandit model induced by executing a policy π in a DLB characterized by the diagonal of the dynamical matrix a P A and with Gaussian diagonal noise:

t 1 Npxt 1|Axt But, σ2Idq Npyt|xθ, uty xω, xty, σ2qπtput|Ht 1q,

where Ht 1 is the history of observations up to time t 1. We denote with Ea the expectation induced by the distribution Pa. For every i P Jd K, let us now consider an alternative DLB instance that differs on the dynamical matrix only. Specifically:

aj if j i ρ if j i and aj ρ ϵ ρ ϵ if j i and aj ρ , @j P Jd K.

13For a vector v P Rd, we denote with signpvq P t 1, 1ud the vector of the signs of the components of v. It is irrelevant how we convene to define the sign of 0.

Dynamical Linear Bandits

By relative entropy identities (Lattimore & Szepesv ari, 2020), let A diagpaq and A1 diagpa1q, we have:

DKL p Pa, Pa1q Ea

t 1 DKL Np |Axt But, σ2Idq, Np |A1xt But, σ2Idq ff

t 1 Ea A A1 xt 2 2

ı ϵ2Ea x2 t,i .

We proceed at properly bounding the KL-divergence, letting ei be the i-th vector of the canonical basis of Rd and convening that x0 0d:

Ea x2 t,i Ea

s 1 e T i As But s

s 1 e T i Asϵt s

s 1 as iut s,i

s 1 as iϵt s,i

p1 ρq2 t 1 ÿ

l 1 as l i ut s,iut l,i looooooooooooooooooomooooooooooooooooooon

l 1 as l i ut s,iϵt l,i loooooooooooooooooomoooooooooooooooooon

l 1 as l i ϵt s,iϵt l,i looooooooooooomooooooooooooon

ffiffiffiffifl

Let us start with (a):

l 1 as l i ut s,iut l,i

ď p1 ρq2 t 1 ÿ

l 1 ρs l ď 1,

having observed that |ut s,i|, |ut l,i| ď 1, that |ai| ď ρ, and bounding the summations with the series. Let us move to (b):

l 1 as l i ut s,iϵt l,i

l s 1 as l i ut s,iϵt l,i

s l as l i ut s,iϵt l,i

l s 1 ρs l Ea r|ϵt l,i|s

having observed that ut s,i and ϵt l,i are independent when s ě l and ϵt l,i has zero mean, that |ut s,i| ď 1, that as l i ď

ρs l, and that the expectation of the absolute value of random variable normally distributed is given by E r|ϵt l,i|s σ b

2 π. Finally, let us consider (c):

l 1 as l i ϵt s,iϵt l,i

s 1 a2s i ϵt s,iϵt s,i

l s 1 as l i ϵt s,iϵt l,i

s 1 ρ2s ď σ2

having observed that the noise vectors ϵt l,i and ϵt s,i are independent whenever s l, that Earϵ2 t s,is σ2, and having bounded the sum with the series. Coming back to the original bound, we have:

Ea x2 t,i ď 1 1 1 ρ

Dynamical Linear Bandits

For i P Jd K and a P A, we introduce the symbol:

t 1 1tsignput,iq signpha,iqu ě T

Thus, for a and a1 defined as above, by the Bretagnolle-Huber inequality (Lattimore & Szepesv ari, 2020, Theorem 14.2), we have:

pa,i pa1,i ě 1

2 exp p DKL p Pa, Pa1qq 1

t 1 EP A A1 xt 2 2

2 exp ˆ 2Tϵ2

having selected σ2 1. We use the notation ř

a i to denote the multiple summation ř

a1,...,ai 1,ai 1,...,ad Ptρ,ρ ϵud 1:

a PA 2 d dÿ

ai Ptρ,ρ ϵu pa,i

2 exp ˆ 2Tϵ2

4 exp ˆ 2Tϵ2

Therefore, with this averaging argument, we can conclude that there exists a P A such that řd i 1 pa ,i ě d

4 exp 2T ϵ2

1 ρ . For this choice a , we consider u a signpa q P U, we can proceed to the lower bound on the expected offline regret:

ERoffpπ, Tq

t 1 Ea rxha , u a utys

i 1 1tsignput,iq signpha ,iqu ϵ 1 pρ ϵq

i 1 Pa signput,iq signpha ,iq

ě Tϵ 2p1 pρ ϵqq

t 1 1tsignput,iq signpha ,iqu ě T

Tϵ 2p1 pρ ϵqq

i 1 pa ,i ě Tdϵ 8p1 pρ ϵqq exp ˆ 2Tϵ2

We now maximize over 0 ď ϵ ă ρ. To this end, we perform the substitution ϵ p1 ρqrϵ

1 rϵ , with 0 ď rϵ ď ρ:

Tdϵ 8p1 pρ ϵqq exp ˆ 2Tϵ2

8 exp ˆ 2rϵ2Tp1 ρq

8 exp 8rϵ2Tp1 ρq ,

where the last inequality holds for rϵ ď 1

2. We not take rϵ 1 ?

8T p1 ρq which is smaller than 1

2 if T ě 1 2p1 ρq, to get:

ERoffpπ, Tq ě d ?

512ep1 ρq .

Notice that with this choice of rϵ (and, consequently, of ϵ), for sufficiently large T, we fulfill Assumption 2.2. Indeed:

32Tp1 ρq , J a d a

Dynamical Linear Bandits

Thus, we require T ě O d2 1 ρ . Finally, to convert this result to the expected regret, we employ Lemma B.1:

ERoffpπ, Tq ě ERoffpπ, Tq d 1 ρ.

Under the constraint T ě O d2 1 ρ , we observe that:

ERoffpπ, Tq ě Ω

Theorem 2.1 (Optimal Policy). Under Assumptions 2.1 and 2.2, an optimal policy π maximizing the (infinite-horizon) expected average reward Jpπq (Equation 2), for every round t P N and history Ht 1 P Ht 1 is given by:

π t p Ht 1q u where u Pargmax u PU Jpuq xh,uy. (4)

Proof. Referring to the notation of Appendix C, we first observe that for every policy π, we have Jpπq lim inf HÑ 8 JHpπq, where JHpπq 1 H ErřH t 1 yts, is the H-horizon expected average reward. Let us start with Equation (18), a fixed finite H P N, and considering the sequence of actions pu1, u2, . . . q generated by policy π:

s 1 xh J0,H s K, Erussy 1

t 1 ωTAt 1 Erx1s

s 1 xh, Erussy 1

s 1 xh JH s 1, 8M, Erussy 1

t 1 ωTAt 1 Erx1s.

Now, we consider two bounds on JHpπq, obtained by an application of Cauchy-Schwarz inequality on the second addendum:

s 1 xh, Erussy 1

h JH s 1, 8M 2 } Eruss}2

t 1 ωTAt 1 Erx1s : JÒ Hpπq,

s 1 xh, Erussy 1

h JH s 1, 8M 2 } Eruss}2

t 1 ωTAt 1 Erx1s : JÓ Hpπq.

Concerning the term } Eruss}2, we have that } Eruss}2 ď Er}us}2s ď U, having used Jensen s inequality and under Assumption 2.2. Regarding the second term, using Assumptions 2.1 and 2.2, we obtain: h JH s 1, 8M 2

l H s 1 BTp Al 1q Tω

l H s 1 Φp Aqρp Aql 1

BΩΦp Aqρp Aq H s

1 ρp Aq. (11)

Plugging this result into the summation over s, we obtain:

1 H BΩΦp Aq

s 1 ρp Aq H s BΩΦp Aqp1 ρp Aq Hq

Hp1 ρp Aqq2 .

It is simple to observe that the last term approaches zero as H Ñ 8. Moreover, with an analogous argument, it can be proved that 1

H řH t 1 ωTAt 1 Erx1s 2 Ñ 0 as H Ñ 8. Thus, we have that lim inf HÑ 8 JÓ Hpπq

Dynamical Linear Bandits

lim inf HÑ 8 JÒ Hpπq. Consequently, by the squeezing theorem of limits, we have:

Jpπq lim inf HÑ 8 JÒ Hpπq lim inf HÑ 8 JÓ Hpπq

lim inf HÑ 8 1 H

s 1 xh, Erussy h T

lim inf HÑ 8 1 H

It follows that an optimal policy is a policy that plays the constant action u P arg maxu PUxh, uy.

B.2. Proofs of Section 3

Theorem 3.1 (Self-Normalized Concentration). Let pphtqt PN be the sequence of solutions of the Ridge regression problems of Algorithm 1. Then, under Assumption 2.1 and 2.2, for every λ ě 0 and δ P p0, 1q, with probability at least 1 δ, simultaneously for all rounds t P N, it holds that: pht h Vt ď c1 ?

λ logpept 1qq c2 ?

2rσ2 ˆ log ˆ1

2 log ˆdet p Vtq

where c1 UΩΦp Aq UB 1 ρp Aq X , c2 Θ ΩBΦp Aq

1 ρp Aq , and rσ2 σ2 1 Ω2Φp Aq2

Proof. First of all, let us properly relate the round t P JTK and the index of the epoch m P JMK. For every epoch m P JMK, we denote with tm the last round of epoch m (i.e., the one in which we update the relevant matrices Vt and bt):14

t0 0, tm tm 1 1 Hm. We now proceed to define suitable filtrations. Let F p Ftqt PJT K such that for every t ě 1, the random variables tu1, y1, . . . , ut 1, yt 1, utu are Ft 1-measurable, i.e., Ft 1 σpu1, y1, . . . , ut 1, yt 1, utq. Let us also consider the filtration indexed by m, denoted with r F p r Fmqm PJMK and defined for all m P JMK as r Fm Ftm 1 1. Thus, the random variables r Fm 1-measurable are those realized until the end of epoch m except for ytm.

Since the estimates pht do not change within an epoch, we need to guarantee the statement for all rounds ttmum PJMK only. For these rounds, we define the following quantities: rym ytm, rum utm, (or any ul with l P Jtm 1 1, tm K since they are all equal)

s 1 ωTAs 1ϵtm s,

rxm 1 xtm 1, rhm phtm, r Vm Vtm, rbm btm.

We prove that prξmqm PJMK is a martingale difference process adapted to the filtration r F. To this end, we recall that, by construction, pηtqt PJT K and pϵtqt PJT K are martingale difference processes adapted to the filtration F. It is clear that rξm is Fm-measurable and, being σ2-subgaussian it is absolutely integrable. Furthermore, using the tower law of expectation:

E rξm| r Fm 1 ı E

s 1 ωTAs 1ϵtm s|Ftm 1

E rηtm|Ftm 1s E

s 1 ωTAs 1 Erϵtm s|Ftm s 1s|Ftm 1

since the system is operating by persisting the action after having decided it at the beginning of the epoch. Thus, by

14It is worth noting that the variables tm are deterministic.

Dynamical Linear Bandits

exploiting the decomposition in Equation (1), we can write:

rym ytm xh J0,Hm 1K, rumy ωTAHm 1xtm 1 ηtm

s 1 ωTAs 1ϵtm s

xh J0,Hm 1K, rumy ωTAHm 1rxm 1 rξm

xh, rumy xh JHm 2,8M, rumy ωTAHm 1rxm 1 rξm, (12)

where we simply exploit the identity h h J0,Hm 1K h JHm 2,8M. We now introduce the following vectors and matrices:

ru T 1 ... ru T m

P Rmˆd, rym

ry1 ... rym

rξ1 ... rξm

ωTAH1 2rx0 ... ωTAHm 2rxm 1

xh JH1 1,8M, ru1y ... xh JHm 1,8M, rumy

Using the vectors and matrices above, we observe that r Vm λI r UT m r Um and rbm r UT mrym. Furthermore, by exploiting Equation (12), we can write:

rym r Umh rgm rνm rξm.

Let us consider the estimate at m P JMK:

rhm r V 1 m rbm λI r UT m r Um 1 r UT mrym

λI r UT m r Um 1 r UT m r Umh rgm rνm rξm

h λI r UT m r Um 1 λh r UT mrgm r UT mrνm r UT m rξm .

We now proceed at bounding the } } r Vm-norm, and exploit the triangle inequality: rhm h r Vm ď λ r V 1 m h r Vm r V 1 m r UT mrgm r Vm r V 1 m r UT mrνm r Vm r V 1 m r UT m rξm r Vm

λ }h} r V 1 m looomooon

r UT mrgm r V 1 m loooooomoooooon

r UT mrνm r V 1 m loooooomoooooon

r UT m rξm r V 1 m loooooomoooooon

where we simply exploited the identity }V 1x}2 V x TV 1VV 1x x TV 1x }x}2 V 1. We now bound one term at a time. Let us start with (a):

(a)2 λ2 }h}2 r V 1 m λ2h T r V 1 m h

ď λ2 r V 1 m 2 }h}2 2

ď λ ˆ Θ ΩBΦp Aq

where we observed that r V 1 m 2 ď r Vm 1

2 ď λ 1. Finally, we have bounded the norm of h:

Dynamical Linear Bandits

ď }θ}2 }ω}2}B}2

ď Θ ΩBΦp Aq

where we have exploited Assumptions 2.1 and 2.2.

We now move to term (b):

(b)2 r UT mrgm 2

r V 1 m rg T m r Um r V 1 m r UT mrgm

rg T m r Um 2

l 1 xrul, h JHl 2,8Myrul

l 1 }rul}2 2 h JHl 2,8M 2

ď U 4Ω2B2Φp Aq2

λp1 ρp Aqq2

l 1 ρp Aq Hl 1 2

where we have employed the following inequality: h JHl 2,8M 2

j Hl 2 Aj 1B

ď }ω}2 }B}2

ď ΩBΦp Aqρp Aq Hl 1

Let us now consider term (c):

(c)2 r UT mrνm 2

r V 1 m rνT m r Um r V 1 m r UT mrνm

r UT mrνm 2

s 1 ωTAHl 1rxl 1rul

s 1 }ω}2 AHl 1 2 }rxl 1}2}rul}2

ď X2Ω2U 2Φp Aq2

l 1 ρp Aq Hl 1 2

We now bound the summations, exploiting the inequality ρp Aq ď ρ, holding by assumption:

l 1 ρp Aq Hl 1

log l log 1

log l log 1

Dynamical Linear Bandits

log 1 ρp Aq log 1

l ď logpm 1q 1 ď logpt 1q 1 logpept 1qq,

having exploited the fact that m ď t and the bound with the integral to the harmonic sum.

Finally, we consider term (d). In this case, we apply Theorem 1 of (Abbasi-Yadkori et al., 2011), observing that the conditions are satisfied. To this end, we first need to determine the subgaussianity constant for the noise process rξl. For every l P Jm K and ζ P R, and properly using the tower law of expectation:

E exp ζrξl | r Fl 1 ı E

s 1 ωTAs 1ϵtl s

E rexp pζηtlq |Ftl 1s

s 1 E E exp ζωTAs 1ϵtl s |Ftl 1 s |Ftl 1

ď exp ˆζ2σ2

s 1 E exp ˆζ2}ωTAs 1}2 2σ2

ď exp ˆζ2σ2

s 1 exp ˆζ2Ω2Φp Aq2ρp Aq2ps 1qσ2

1 Ω2Φp Aq2 8 ÿ

s 1 ρp Aq2ps 1q

ˆ 1 Ω2Φp Aq2

Thus, simultaneously for all m P JMK, with probability at least 1 δ, it holds that:

(d)2 r UT m rξm 2

r V 1 m ď 2σ2 ˆ 1 Ω2Φp Aq2

We now proceed at bounding the offline regret Roff and, then, relating the offline regret Roff with the online regret R, as defined in the main paper.

Theorem B.2 (Offline Regret Upper Bound). Under Assumptions 2.1 and 2.2, having selected βt as in Equation (6), for every δ P p0, 1q, with probability at least 1 δ, Dyn Lin-UCB suffers an offline regret Roff bounded as:

RoffpπDyn Lin-UCB, Tq ď

g f f e8d Tβ2 T 1

log ˆ 1 TU 2

Moreover, by setting δ 1{T, highlighting the dependencies on T, ρ, d, and σ only, the expected offline regret E Roff is bounded as:

E RoffpπDyn Lin-UCB, Tq ď O

Tplog Tq 3 2 1 ρ

d Tplog Tq2

Proof. For every epoch m P JMK, let us define rβm 1 βtm 1 and define the confidence set Cm 1 trh P Rd : }rh rhm 1} r Vm 1 ď rβm 1u. Let us start by considering the instantaneous offline regret rrm at epoch m P JMK. Let

u P arg maxu PU xh, uy and let rhÒ m 1 P Cm 1 such that UCBtm 1 1prumq xrhÒ m 1, rumy. Thus, with probability at least 1 δ, we have: rrm J Jprumq xh, u y xh, rumy xrhÒ m 1, rumy

Dynamical Linear Bandits

ď xrhÒ m 1 h, rumy (13)

ď rhÒ m 1 h r Vm 1 }rum} r V 1 m 1

ď ˆ rhÒ m 1 rhm 1 r Vm 1 rhm 1 h r Vm 1

}rum} r V 1 m 1 (14)

ď 2rβm 1 }rum} r V 1 m 1 . (15)

where line (13) follows from the optimism, line (14) derives from triangle inequality, line (15) is obtained by observing that h P Cm 1 with probability at least 1 δ, simultaneously for all m P JMK, thanks to Theorem 3.1, having observed that rβm 1 is larger than the right hand side of Theorem 3.1.

We now move to the cumulative offline regret over the whole horizon T, by decomposing w.r.t. the epochs and recalling that we pay the same instantaneous regret within each epoch:

Roffp Dyn Lin-UCB, Tq

m 1 p Hm 1qrrm ď

m 1 p Hm 1q2

Concerning the first summation, we proceed as follows, recalling that M ď T and Hm ď HM for all m P JMK:

m 1 p Hm 1q2 ď Tp HM 1q ď T

For the second summation, we follow the usual derivation for linear bandits, recalling that rβM 1 ě maxt1, rβm 1u for all m P JMK and that under Assumption 2.2 we have that rr2 m ď 2. In particular:

rr2 m ď min ! 2, 2rβM 1 }rum} r V 1 m 1

) ď 2rβM 1 min ! 1, }rum} r V 1 m 1

Plugging this inequality into the second summation, we obtain:

m 1 rr2 m ď 4rβ2 M 1

m 1 min ! 1, }rum}2 r V 1 m 1

ď 8drβ2 M 1 log ˆ 1 MU 2

ď 8dβ2 T 1 log ˆ 1 TU 2

where the last passage follows from the elliptic potential lemma (Lattimore & Szepesv ari, 2020, Lemma 19.4). Putting all together, we obtain the inequality holding with probability at least 1 δ:

Roffp Dyn Lin-UCB, Tq ď

g f f e8d Tβ2 T 1

log ˆ 1 TU 2

having observed that rβM 1 ď βT 1 We can also arrive at a problem-dependent regret bound, by setting : infu PUxh,uyăxh,u y xh, u uy (if it exists ą 0). Since the instantaneous regret is either 0 or at least , we have:

Roffp Dyn Lin-UCB, Tq ď

m 1 p Hm 1qrr2 m

8drβ2 M 1 log ˆ 1 MU 2

β2 T 1 log ˆ 1 TU 2

By setting δ 1{T, replacing the value of βT 1, we obtain the offline regret in expectation, highlighting the dependence on T, ρ, d, and σ only:

E Roffp Dyn Lin-UCB, Tq ď O

Tplog Tq 3 2 1 ρ

d Tplog Tq2

where we used the fact that 1 log 1

ρ ď 1 1 ρ and ρp Aq ď ρ.

Dynamical Linear Bandits

The following lemma relates the expected offline regret with the expected online regret.

Theorem 3.2 (Upper Bound). Under Assumptions 2.1 and 2.2, selecting βt as in Equation (6) and δ 1{T, Dyn Lin-UCB suffers an expected regret bounded as (highlighting the dependencies on T, ρ, d, and σ only):

E RpπDyn Lin-UCB, Tq ď O

Tplog Tq 3 2 1 ρ

d Tplog Tq2

p1 ρq 3 2 1 p1 ρp Aqq2

Proof. The result is simply obtained by exploiting the offline regret bound of Theorem B.2 and by upper bounding the expected regret using Lemma B.1.

C. Finite-Horizon Setting

In this section, we compare the finite-horizon setting with the infinite-horizon one presented in the main paper. We shall show that under Assumption 2.1, the two settings tend to coincide when the horizon is sufficiently large. Let us start by introducing the H horizon expected average reward, with H P N being the optimization horizon:

xt 1 Axt But ϵt yt xω, xty xθ, uty ηt ut πtp Ht 1q , t P r Hs, (16)

where the expectation is taken w.r.t. the randomness of the state noise ϵt and reward noise ηt. We now show that the optimal policy for the finite-horizon setting is a non-stationary open-loop policy.

Theorem C.1 (Optimal Policy for the H Horizon Setting). If H P N, an optimal policy π H pπ H,tqt PJHK maximizing the H-horizon expected average reward Jpπq as in Equation (16) is given by:

@t P JHK, @Ht 1 P Ht 1 : π H,tp Ht 1q u H,t where u H,t P arg max u PU xh J0,H t K, uy.

Proof. We start by expressing for every t P JHK the reward yt as a function of the sequence of actions u pu1, . . . , u Hq produced by a generic policy π. By exploiting Equation (4) instanced with H t 1, we have:

s 0 xhtsu, ut sy ωTAt 1x1 ηt

s 1 ωTAs 1ϵt s.

By computing the expectation, using linearity, and recalling that the noises are zero-mean, we obtain:

s 0 xhtsu, Erut ssy ωTAt 1 Erx1s.

By averaging over t P JHK, we obtain the H-horizon expected average reward:

s 0 xhtsu, Erut ssy 1

t 1 ωTAt 1 Erx1s

t s htt su T

t 1 ωTAt 1 Erx1s (17)

s 1 xh J0,H s K, Erussy 1

t 1 ωTAt 1 Erx1s. (18)

where line (17) is obtained by renaming the indexes of the summations, and line (18) comes from the definition of cumulative Markov parameter h J0,H s K. It is now simple to see, as no noise is present in the expression, that the performance JHpπq is maximized by taking at each round s P N an action u s π s p Hs 1q such that whose expectation satisfies Eru s s arg max Erussxh J0,H s K, Erussy. Clearly, we can take the deterministic action such that u s Eru s s.

Dynamical Linear Bandits

We now show that for sufficiently large H, the H-horizon expected average reward JH tends to coincide with the infinitehorizon expected average reward.

Proposition C.2. Let H P N. Then, for every policy π it holds that:

|JHpπq Jpπq| ď BUΩΦp Aqp1 ρp Aq Hq

Hp1 ρp Aqq .

Proof. Consider two horizons H ă H1 P N, and let pu1, u2, . . . q be the sequence of actions played by policy π. Using Equation (18), we have:

JHpπq JH1pπq 1

s 1 xh J0,H s K, Erussy 1

s 1 xh J0,H1 s K, Erussy (19)

s 1 xh J0,H s K h, Erussy 1

s 1 xh J0,H1 s K h, Erussy (20)

s 1 xh JH s 1, 8M, Erussy 1

s 1 xh JH1 s 1, 8M, Erussy. (21)

As shown in Appendix B.1, we have that the second addendum vanishes as H1 approaches 8:

s 1 xh JH1 s 1, 8M, Erussy

ˇˇˇˇˇ Ñ 0 when H1 Ñ 8.

Concerning the first addendum, we have:

s 1 xh JH s 1, 8M, Erussy

h JH s 1, 8M 2

s 1 ρp Aq H s

BUΩΦp Aqp1 ρp Aq Hq

Hp1 ρp Aqq .

D. System Identification

This section presents a solution to identify matrices A, B, C, and D characterizing an LTI system starting from a single trajectory. We adopt a variant of the Ho-Kalman (Ho & Kalman, 1966) algorithm. We start from the identification method proposed by Lale et al. (2020a, Section 3), where authors consider a system of the type (strictly proper):

xt 1 Axt But ϵt, (22) ryt Cxt zt.

Our setting can be seen as (not strictly proper):

xt 1 Axt But ϵt, (23)

yt Cxt Dut zt, with xt, ϵt P Rn, ut P Rp, and yt, zt P Rm. The noise over state transition model ϵt and output zt are σ2-subgaussian random variables. We consider in this part the standard control problem notation adopted for LTI systems. The mapping to our problem is straightforward by considering C ωT and D θT. In predictive form, the system described in Equation (22) is: pxt 1 Apxt But Fryt, ryt Cpxt et,

where: A A FC,

Dynamical Linear Bandits

F AΣCTp CΣCT σ2Iq 1,

and Σ is the solution to the following DARE (Discrete Algebraic Riccati Equation):

Σ AΣAT AΣCTp CΣCT σ2Iq 1CΣAT σ2I.

In order to identify this LTI system, we want to detect a matrix r Gy: r Gy CF C AF . . . C AH 1F CB C AB . . . C AH 1B . (24)

To identify through least squares method matrix r Gy, we construct for each t, a vector rϕt: rϕt y T t 1 . . . y T t H u T t 1 . . . u T t H T P Rpm pq H. (25)

The system output ryt can be rewritten as:

ryt r Gy rϕt et CAHxt H.

The output of the system under analysis (Equation 23) is:

yptq ryt Dut r Gy rϕt Dut et CAHxt H

We can incorporate the contribution of Dut in r Gy obtaining Gy:

Gy CF C AF . . . C AH 1F D CB C AB . . . C AH 1B .

The related vector ϕt is:

ϕt y T t 1 . . . y T t H u T t u T t 1 . . . u T t H T P Rpm pq H p. (26)

The best value of Gy can be found through regularized least squares as in Lale et al. (2020a, Equation 10):

p Gy arg min X λ}X}2 F

τ t H }yτ Xϕτ}2 2, (27)

where } }F represents the Frobenius norm.

The matrix D can be directly retrieved from p Gy. In order to get matrices A, B, and C, we remove the values related to D from p Gy and we retrieve r Gy. From now on, we refer to the algorithm proposed in Lale et al. (2020a, Appendix B).

E. Integration on Numerical Simulations

This section is divided in three parts. First, in Section E.1, we provide additional information about the baselines, their hyperparameters and the optimistic bounds. Second, in Section E.2, we provide all the matrices and vectors generalized to run the real-world experiment. Third, in Section E.3, we provide further results for the simulations presented in Section 5.1.

E.1. Additional Notes on the Baselines

As mentioned in Section 5, the chosen baselines are Lin-UCB (Abbasi-Yadkori et al., 2011), D-Lin-UCB (Russac et al., 2019), AR2 (Chen et al., 2021), Exp3 (Auer et al., 1995), Exp3-k (Dekel et al., 2012; Auer et al., 1995) and the Expert (the latter available only in the case of real-world data). All the hyperparameters, whenever possible, are set as prescribed in the original papers. The bounds used for the exploration are adjusted in order to be able to fairly compete in this setting, and are considered as follows:

βLin-UCB t : c2 ?

2σ2 ˆ log ˆ1

2 log ˆ 1 t U 2

βD-Lin-UCB t : c2 ?

2σ2 ˆ log ˆ1

2 log ˆ 1 t U 2

where c2 and σ2 are as prescribed in Section 3.2, and the hyperparameter γ of D-Lin-UCB is tuned.

For AR2, the hyperparameter α, describing the correlation over time is considered equal to ρp Aq.

In the case of Exp3, the rewards are rescaled in order to make them range in r0, 1s with high probability, as follows:

4ξ , where ξ ˆ Θ ΩB 1 ρp Aq

Dynamical Linear Bandits

0 0.2 0.4 0.6 0.8 1 106

Cumulative Regret

Dyn Lin-UCB (log T )

Dyn Lin-UCB (1)

Lin-UCB (log T )

Lin-UCB (1)

D-Lin-UCB (log T )

D-Lin-UCB (1) AR2 Exp3 Exp3-k

(a) σ 0.001

0 0.2 0.4 0.6 0.8 1 106

Cumulative Regret

Dyn Lin-UCB (log T )

Dyn Lin-UCB (1)

Lin-UCB (log T )

Lin-UCB (1)

D-Lin-UCB (log T )

D-Lin-UCB (1) AR2 Exp3 Exp3-k

0 0.2 0.4 0.6 0.8 1 106

Cumulative Regret

Dyn Lin-UCB (log T )

Dyn Lin-UCB (1)

Lin-UCB (log T )

Lin-UCB (1)

D-Lin-UCB (log T )

D-Lin-UCB (1) AR2 Exp3 Exp3-k

Figure 4. Performance of Dyn Lin-UCB, Lin-UCB, D-Lin-UCB, AR2, Exp3 and Exp3-k at different values of σ. (50 runs, mean std)

Furthermore, in the case of Exp3-k, the batch dimension k is considered as:

where M is the one defined in Algorithm 1 (line 2). This batch size k ensures that, at each time t, the contribution of actions us is negligible, with s P Jt k 1K. The rewards collected in the same batch are averaged and transformed as in Exp3.

E.2. Further Information on the Real-world Setting

The real-world setting is generalized through a dataset containing real data related to the budgets invested in each advertising platform (i.e., the ut) and the overall generated conversions (i.e., the yt) collected from three of the most important advertising platforms of the web (Facebook, Google, and Bing), related to a large number of campaigns for a value of more than 5 Million USD over 2 years. Starting from such data, we generalized the best model by means of a specifically designed variant of the Ho-Kalman algorithm (Ho & Kalman, 1966), as described in Appendix D. We used the matrices estimated with Ho-Kalman to build up a simulator. The resulting system has ρp Aq 0.67, and is characterized as follows:

0.38 0.33 0.6 0.07 0.76 0.54 0.18 0.34 0.05

0.14 0.34 0.05 0.17 0.03 0.01 0.04 0.09 0.17

0.61 0.04 0.13

0.13 0.41 0.02

E.3. Additional Numerical Simulations

These additional results are obtained in the setting presented in Section 5.1. However, here, we want to analyze the behavior of Dyn Lin-UCB and the other bandit baselines at different magnitudes of noise in both the state transition model and the output. The noise in this simulation is a zero-mean Gaussian noise with σ P t0.001, 0.01, 0.1u.

Results Figure 4 shows the results of the experiment for the different values of σ. It is clearly visible how Dyn Lin-UCB performs in almost the same way no matter the noise to which the system is subject, always leading to sub-linear regret. On the other hand, the cumulative regret of both Lin-UCB and D-Lin-UCB is different in every simulation we perform. Indeed, with a low level of noise (Figure 4a) reaches linear regret and does not converge, while for large values of noise, it converges very quickly (Figure 4c). This is due to the nature of the confidence bound of linear bandits, which is not able to take into account such a complex scenario and leads to no guarantees in this setting. Exp3, Exp3-k, and AR2 are not able to reach the optimum in this scenario, independently from the noise magnitude σ, and provide large values of (linear) regret.

E.4. Computational Time

The code used for the results provided in this section has been run on an Intel(R) I5 8259U @ 2.30GHz CPU with 8 GB of LPDDR3 system memory. The operating system was mac OS 12.2.1, and the experiments have been run on Python 3.9.7. A single run of Dyn Lin-UCB takes 110 seconds to run. It is worth noting that the time complexity of Dyn Lin-UCB is upper-bounded by the one of Lin-UCB.