# rebounding_bandits_for_modeling_satiation_effects__70d3d5bf.pdf Rebounding Bandits for Modeling Satiation Effects Liu Leqi Machine Learning Department Carnegie Mellon University Pittsburgh, PA 15213 leqi@cs.cmu.edu Fatma Kılınç-Karzan Tepper School of Business Carnegie Mellon University Pittsburgh, PA 15213 fkilinc@andrew.cmu.edu Zachary C. Lipton Machine Learning Department Carnegie Mellon University Pittsburgh, PA 15213 zlipton@cmu.edu Alan L. Montgomery Tepper School of Business Carnegie Mellon University Pittsburgh, PA 15213 alanmontgomery@cmu.edu Psychological research shows that enjoyment of many goods is subject to satiation, with short-term satisfaction declining after repeated exposures to the same item. Nevertheless, proposed algorithms for powering recommender systems seldom model these dynamics, instead proceeding as though user preferences were fixed in time. In this work, we introduce rebounding bandits, a multi-armed bandit setup, where satiation dynamics are modeled as time-invariant linear dynamical systems. Expected rewards for each arm decline monotonically with consecutive exposures and rebound towards the initial reward whenever that arm is not pulled. Unlike classical bandit algorithms, methods for tackling rebounding bandits must plan ahead and model-based methods rely on estimating the parameters of the satiation dynamics. We characterize the planning problem, showing that the greedy policy is optimal when the arms exhibit identical deterministic dynamics. To address stochastic satiation dynamics with unknown parameters, we propose Explore Estimate-Plan, an algorithm that pulls arms methodically, estimates the system dynamics, and then plans accordingly. 1 Introduction Recommender systems suggest such diverse items as music, news, restaurants, and even job candidates. Practitioners hope that by leveraging historical interactions, they might provide services better aligned with their users preferences. However, despite their ubiquity in application, the dominant learning framework suffers several conceptual gaps that can result in misalignment between machine behavior and human preferences. For example, because human preferences are seldom directly observed, these systems are typically trained on the available observational data (e.g., purchases, ratings, or clicks) with the objective of predicting customer behavior [4, 27]. Problematically, such observations tend to be confounded (reflecting exposure bias due to the current recommender system) and subject to censoring (e.g., users with strong opinions are more likely to write reviews) [41, 16]. Even if we could directly observe the utility experienced by each user, we might expect it to depend, in part, on the history of past items consumed. For example, consider the task of automated (music) playlisting. As a user is made to listen to the same song over and over again, we might expect that the utility derived from each consecutive listen would decline [35]. However, after listening to other music for some time, we might expect the utility associated with that song to bounce back towards its baseline level. Similarly, a diner served pizza for lunch might feel diminished pleasure upon eating pizza again for dinner. 35th Conference on Neural Information Processing Systems (Neur IPS 2021). The psychology literature on satiation formalizes the idea that enjoyment depends not only on one s intrinsic preference for a given product but also on the sequence of previous exposures and the time between them [3, 6]. Research on satiation dates to the 1960s (if not earlier) with early studies addressing brand loyalty [42, 28]. Interestingly, even after controlling for marketing variables like price, product design, promotion, etc., researchers still observe brand-switching behavior in consumers. Such behavior, referred as variety seeking, has often been explained as a consequence of utility associated with the change itself [25, 17]. For a comprehensive review on hedonic decline caused by repeated exposure to a stimulus, we refer the readers to [11]. In this paper, we introduce rebounding bandits, a multi-armed bandits (MABs) [37] framework that models satiation via linear dynamical systems. While traditional MABs draw rewards from fixed but unknown distributions, rebounding bandits allow each arm s rewards to evolve as a function of both the per-arm characteristics (susceptibility to satiation and speed of rebounding) and the historical pulls (e.g., past recommendations). In rebounding bandits, even if the dynamics are known and deterministic, selecting the optimal sequence of T arms to play requires planning in a Markov decision process (MDP) whose state space scales exponentially in the horizon T. When the satiation dynamics are known and stochastic, the states are only partially observable, since the satiation of each arm evolves with (unobserved) stochastic noises between pulls. And when the satiation dynamics are unknown, learning requires that we identify a stochastic dynamical system. We propose Explore-Estimate-Plan (EEP) an algorithm that (i) collects data by pulling each arm repeatedly, (ii) estimates the dynamics using this dataset; and (iii) plans using the estimated parameters. We provide guarantees for our estimators in 6.2 and bound EEP s regret in 6.3. Our main contributions are: (i) the rebounding bandits problem ( 3), (ii) analysis showing that when arms share rewards and (deterministic) dynamics, the optimal policy pulls arms cyclically, exhibiting variety-seeking behavior ( 4.1); (iii) an estimator (for learning the satiation dynamics) along with a sample complexity bound for identifying an affine dynamical system using a single trajectory of data ( 6.2); (iv) EEP, an algorithm for learning with unknown stochastic dynamics that achieves sublinear w-step lookahead regret [34] ( 6); and (v) experiments demonstrating EEP s efficacy ( 7). 2 Related Work Satiation effects have been addressed by such diverse disciplines as psychology, marketing, operations research, and recommendation systems. In the psychology and marketing literatures, satiation has been proposed as an explanation for variety-seeking consumer behavior [11, 25, 26]. In operations research, addressing continuous consumption decisions, [3] propose a deterministic linear dynamical system to model satiation effects. In the recommendation systems community, researchers have used semi-Markov models to explicitly model two states: (i) sensitization where the user is highly interested in the product; and (ii) boredom where the user is not engaged [18]. The bandits literature has proposed a variety of extensions where rewards depend on past exposures, both to address satiation and other phenomena. [14, 21, 39] tackle settings where each arm s expected reward grows (or shrinks) monotonically in the number of pulls. By contrast, [19, 2, 7] propose models where rewards increase as a function of the time elapsed since the last pull. [34] model the expected reward as a function of the time since the last pull drawn from a Gaussian Process with known kernel. [43] propose a model where rewards are linear functions of the recent history of actions and [29] model the reward as a function of a context that evolves according to known deterministic dynamics. In rested bandits [12], an arm s expected rewards change only when it is played, and in restless bandits [44] rewards evolve independently from the play of each arm. Key Differences This may be the first bandits paper to model evolving rewards through continuousstate linear stochastic dynamical systems with unknown parameters. Our framework captures several important aspects of satiation: rewards decline by diminishing amounts with consecutive pulls and rebound towards the baseline with disuse. Unlike models that depend only on fixed windows or the time since the last pull, our model expresses satiation more organically as a quantity that evolves according to stochastic dynamics. To estimate the reward dynamics, we leverage recent advances in the identification of linear dynamical systems [40, 38] that rely on the theory of self-normalized processes [33, 1] and block martingale conditions [40]. Satiation Reward Pull 0 1 2 3 4 5 0 5 10 15 20 25 30 (a) γk = .5, λk = 3, bk = 3 Satiation Reward Pull 0 1 2 3 4 5 0 5 10 15 20 25 30 (b) γk = .8, λk = 1.5, bk = 3 Figure 1: These plots illustrate the satiation level and reward of an arm from time 1 to 30. The two plots are generated with the same pull sequence, base rewards bk = 3 and realized noises with variance σz = .1. In Figure 1a, γk = .5 and λk = 3. In Figure 1b, γk = .8 and λk = 1.5. In both cases, the arm has started with 0 as its base satiation level. Black dashed line: the satiation level. Red solid line: the reward. Blue dots: time steps where the arm is pulled. 3 Rebounding Bandits Problem Setup Consider the set of K arms [K] := {1, . . . , K} with bounded base rewards b1, . . . , b K. Given a horizon T, a policy π1:T := (π1, . . . , πT ) is a sequence of actions, where πt [K] depends on past actions and observed rewards. For any arm k [K], we denote its pull history from time 0 to T as the binary sequence uk,0:T := (uk,0, . . . , uk,T ), where uk,0 = 0 and for t [T], uk,t = 1 if πt = k and uk,t = 0 otherwise. The subsequence of uk,0:T from t1 to t2 (including both endpoints) is denoted by uk,t1:t2. At time t, each arm k has a satiation level sk,t that depends on a satiation retention factor γk [0, 1), as follows sk,t := γk(sk,t 1 + uk,t 1) + zk,t 1, t > tk 0, (1) where tk 0 := mint{t : uk,t = 1} is the first time arm k is pulled and zk,t 1 is independent and identically distributed noise drawn from N(0, σ2 z), accounting for incidental (uncorrelated) factors in the satiation dynamics. Because satiation requires exposure, arms only begin to have nonzero satiation levels after their first pull, i.e., sk,0 = . . . = sk,tk 0 = 0. At time t [T], if arm k is played with a current satiation level sk,t, the agent receives reward µk,t := bk λksk,t, where bk is the base reward for arm k and λk 0 is a bounded exposure influence factor. We use satiation influence to denote the product of the exposure influence factor λk and the satiation level sk,t. In Figure 1, we show how rewards evolve in response to both pulls and the stochastic dynamics under two sets of parameters. The expected reward of arm k (where the expectation is taken over all noises associated with the arm) monotonically decreases by diminishing amounts with consecutive pulls and increases with disuse by diminishing amounts. Remark 1. We note that there exist choices of bk, γk, λk for which the expected reward of arm k can be negative. In the traditional bandits setup, one must pull an arm at every time step. Thus, what matters are the relative rewards and the problem is mathematically identical, regardless of whether the expected rewards range from 10 to 0 or 0 to 10. In addition, one might construct settings where negative expected rewards are reasonable. For example, when one of the arms corresponds to no recommendation with 0 being its expected reward (e.g., bk = 0, λk = 0), then the interpretation of negative expected reward would be that pulling (recommending) the corresponding arm (item) is less preferred relative to not pulling (no recommendation). Given horizon T 1, we seek a pull sequence π1:T , where πt depends on past rewards and actions (π1, µπ1,1, . . . , πt 1, µπt 1,t 1), that maximizes the expected cumulative reward: GT (π1:T ) := E h PT t=1 µπt,t i . (2) Additional Notation Let γ := maxk [K] γk and λ := maxk [K] λk. We use a b when a Cb for some positive constant C. 4 Planning with Known Dynamics Before we can hope to learn an optimal policy with unknown stochastic dynamics, we need to establish a procedure for planning when the satiation retention factors, exposure influence factors, and base rewards are known. We begin by presenting several planning strategies and analyzing them under deterministic dynamics, where the past pulls exactly determine each arm s satiation level, i.e., sk,t = γk(sk,t 1 + uk,t 1), t > tk 0. With some abuse of notation, at time t 2, given a pull sequence uk,0:t 1, we can express the satiation and the expected1 reward of each arm as sk,t(uk,0:t 1) = γk (sk,t 1 + uk,t 1) = γk (γk (sk,t 2 + uk,t 2)) + γkuk,t 1 = Pt 1 i=1 γt i k uk,i, µk,t(uk,0:t 1) = bk λk Pt 1 i=1 γt i k uk,i . (3) At time t = 1, we have that sk,1(uk,0:0) = 0 and µk,1(uk,0:0) = bk for all k [K]. Since the arm parameters {λk, γk, bk}K k=1 are known, our goal (2) simplifies to finding a pull sequence that solves the following bilinear integer program: i=0 γt i k uk,i k=1 uk,t = 1, t [T], uk,t {0, 1}, uk,0 = 0, k [K], t [T] where the objective maximizes the expected cumulative reward associated with the pull sequence and the constraints ensure that at each time period we pull exactly one arm. Note that (4) includes products of decision variables uk,t leading to bilinear terms in the objective. In Appendix A, we provide an equivalent integer linear program. 4.1 The Greedy Policy At each step, the greedy policy πg picks the arm with the highest instantaneous expected reward. Formally, at time t, given the pull history {uk,0:t 1}K k=1, the greedy policy picks πg t arg max k [K] µk,t(uk,0:t 1). In order to break ties, when all arms have the same expected reward, the greedy policy chooses the arm with the lowest index. Note that the greedy policy is not, in general, optimal. Sometimes, we are better off allowing the current best arm to rebound even further, before pulling it again. Example 1. Consider the case with two arms. Suppose that arm 1 has base reward b1, satiation retention factor γ1 (0, 1), and exposure influence factor λ1 = 1. For any fixed time horizon T > 2, suppose that arm 2 has b2 = b1 + γ2 γT 2 1 γ2 where γ2 (0, 1) and λ2 = 1. The greedy policy πg 1:T will keep pulling arm 2 until time T 1 and then play arm 1 (or arm 2) at time T. This is true because if we keep pulling arm 2 until T 1, at time T, we have µ2,T (u2,0:T 1) = b1 = µ1,T (u1,0:T 1). However, the policy πn 1:T , where πn t = 2 if t T 2, πn T 1 = 1, and πn T = 2, obtains a higher expected cumulative reward. In particular, the difference GT (πn 1:T ) GT (πg 1:T ) will be γ2 γT 1 2 . 4.2 When is Greedy Optimal? When the satiation effect is always 0, e.g., when the satiation retention factors γk = 0 for all k [K], we know that the greedy policy (which always plays the arm with the highest instantaneous expected reward) is optimal. However, when satiation can be nonzero, it is less clear under what conditions the greedy policy performs optimally. This question is of special interest when we consider human decision-making, since we cannot expect people to solve large-scale bilinear integer programs every time they pick music to listen to. In this section, we show that when all arms share the same properties (γk, λk, bk are identical for k [K]), the greedy policy is optimal. In this case, the greedy policy exhibits variety-seeking behavior as it plays the arms cyclically. Interestingly, this condition aligns with early research that 1We use expected reward to emphasize that all results in this section apply to settings where the satiation dynamics are deterministic but the rewards are stochastic, i.e., µk,t = bk λksk,t + ek,t for independent mean-zero noises ek,t. has motivated studies on satiation [42, 28]: when controlling for marketing variables (e.g., the arm parameters γk, λk, bk), researchers still observe variety-seeking behaviors of consumers (e.g., playing arms in a cyclic order). Assumption 1. γ1 = . . . = γK = γ, λ1 = . . . = λK = λ, and b1 = . . . = b K = b. We start with characterizing the greedy policy when Assumption 1 holds. Lemma 1 (Greedy Policy Characterization). Under Assumption 1 and the tie-breaking rule that when all arms have the same expected reward, the greedy policy chooses the one with the lowest arm index, the sequence of arms pulled by the greedy policy forms a periodic sequence: π1 = 1, π2 = 2, , πK = K, and πt+K = πt, t N+. In this case, the greedy policy is equivalent to playing the arms in a cyclic order. All proofs for the paper are deferred to the Appendices. Theorem 1. Under Assumption 1, given any horizon T, the greedy policy πg 1:T is optimal. Remark 2. Theorem 1 suggests that when the (deterministic) satiation dynamics and base rewards are identical across arms, planning does not require knowledge of those parameters, since playing the arms in a cyclic order is optimal. Lemma 1 and Theorem 1 lead us to conclude the following result: when recommending items that share the same properties, the best strategy is to show the users a variety of recommendations by following the greedy policy. On a related note, Theorem 1 also gives an exact Max K-Cut of a complete graph KT on T vertices, where the edge weight connecting vertices i and j is given by e(i, j) = λγ|j i| for i = j. The Max K-Cut problem partitions the vertices of a graph into K subsets P1, . . . PK, such that the sum of the edge weights connecting the subsets are maximized [10]. Mapping the Max K-Cut problem back to our original setup, each vertex represents a time step. If vertex i is assigned to subset Pk, it suggests that arm k should be played at time i. The edge weights e(i, j) = λγ|j i| for i = j can be seen as the reduction in satiation influence achieved by not playing the same arm at both time i and time j. The goal (4) is to maximize the total satiation influence reduction. Proposition 2 (Connection to Max K-Cut). Under Assumption 1, an optimal solution to (4) is given by a Max K-Cut on KT , where KT is a complete graph on T vertices with edge weights e(i, j) = λγ|j i| for all i = j. Using Lemma 1 and Theorem 1, we obtain an exact Max K-Cut of KT : k [K], Pk = {t [T] : t k (mod K)}. 4.3 The w-lookahead Policy To model settings where the arms correspond to items with different characteristics (e.g., we can enjoy tacos on consecutive days but require time to recover from a trip to the steakhouse) we must allow the satiation parameters to vary across arms. Here, the greedy policy may not be optimal. Thus, we consider more general lookahead policies (the greedy policy is a special case). Given a window of size w and the current satiation levels, the w-lookahead policy picks actions to maximize the total reward over the next w time steps. Let l denote T/w . Define ti = min{iw, T} for i [l] and t0 = 0. More formally, the w-lookahead policy πw 1:T is defined as follows: for any i [l], given the previously chosen arms corresponding pull histories {uw k,0:ti 1}K k=1 where uw k,0 = 0 and uw k,t = 1 if (and only if) πw t = k, the next w (or T mod w) actions πw ti 1+1:ti are given by max πw ti 1+1:ti t=ti 1+1 µπw t ,t(uπt,0:t 1) : uk,0:ti 1 = uw k,0:ti 1, k [K], PK k=1 uk,t = 1, t [ti], uk,t {0, 1}, k [K], t [ti] In the case of a tie, one can pick any of the sequences that maximize (5). We recover the greedy policy when the window size w = 1, and finding the w-lookahead policy for the window size w = T is equivalent to solving (4). Remark 3. Another reasonable lookahead policy, which requires planning ahead at every time step, would be the following: at every time t, plan for the next w actions and follow them for a single time step. To lighten the computational load, we adopt the current w-lookahead policy which only requires planning every w time steps. For the rest of the paper, we use Lookahead({λk, γk, bk}K k=1, {uw k,0:ti 1}K k=1, ti 1, ti) to refer to the solution of (5), when the arm parameters are {λk, γk, bk}K k=1, the historical pull sequences of all arms till time ti 1 are given by {uw k,0:ti 1}K k=1. The solution corresponds to the actions that should be taken by the w-lookahead policy for the next ti ti 1 time steps. Theorem 2. Given any horizon T, let π 1:T be a solution to (4). For a fixed window size w T, we have that GT (π 1:T ) GT (πw 1:T ) λγ(1 γT w) (1 γ)2 T/w . Remark 4. Note that when w = T, the w-lookahead policy by definition is the optimal policy and in such a case, the upper bound for the optimality gap of w-lookahead established in Theorem 2 is also 0. In contrast to the optimal policy, the computational benefit of the w-lookahead policy becomes apparent when the horizon T is large since it requires solving for a much smaller program (5). In general, the w-lookahead policy is expected to perform much better than the greedy policy (which corresponds to the case of w = 1) at the expense of a higher computational cost. Finally, we note that for the window size of w = T, we obtain GT (π 1:T ) GT (πw 1:T ) O( 5 Learning with Unknown Dynamics: Preliminaries When the satiation dynamics are unknown and stochastic (σz > 0), the learner faces a continuousstate partially observable MDP because the satiation levels are not observable. To set the stage, we first introduce our state representation ( 5.1) and a regret-based performance measure ( 5.2). In the next section, we will introduce EEP, our algorithm for rebounding bandits. 5.1 State Representation Following [32], at any time t [T], we define a state vector xt in the state space X to be xt = (x1,t, n1,t, x2,t, n2,t, . . . , x K,t, n K,t), where nk,t N is the number of steps at time t since arm k was last selected and xk,t is the satiation influence (product of λk and the satiation level) as of the most recent pull of arm k. Since the most recent pull happens at t nk,t, we have xk,t = bk µk,t nk,t = λksk,t nk,t. Recall that µk,t nk,t is the reward collected by pulling arm k at time t nk,t. Note that bk is directly observed when arm k is pulled for the first time because there is no satiation effect. The state at the first time step is x1 = (0, . . . , 0). At time t, if arm k is chosen at state xt, and reward µk,t is obtained, then the next state xt+1 will satisfy (i) for the pulled arm k, nk,t+1 = 1 and xk,t+1 = bk µk,t; (ii) for other arms k = k, nk ,t+1 = nk ,t + 1 if nk ,t = 0, nk ,t+1 = 0 if nk ,t = 0, and the satiation influence remains the same xk ,t+1 = xk ,t. Given {γk, λk, bk}K k=1, the reward function r : X [K] R represents the expected reward of pulling arm k under state xt: If nk,t = 0, then r(xt, k) = bk. If nk,t 1, r(xt, k) = E[µk,t|xt] = bk γnk,t k xk,t λkγnk,t k , where the expectation is taken over the noises in between the current pull and the last pull of arm k. See Appendix C.1 for the full description of the MDP setup (including the transition kernel and value function definition) of rebounding bandits. 5.2 Evaluation Criteria: w-step Lookahead Regret In reinforcement learning (RL), the performance of a learner is often measured through a regret that compares the expected cumulative reward obtained by the learner against that of an optimal policy in a competitor class [20]. In most episodic (e.g., finite horizon) RL literature [31, 15], regrets are defined in terms of episodes. In such cases, the initial state is reset (e.g., to a fixed state) after each episode ends, independent of previous actions taken by the leaner. Unlike these episodic RL setups, in rebounding bandits, we cannot restart from the initial state because the satiation level cannot be reset and user s memory depends on past received recommendations. Instead, [34] proposed a version of w-step lookahead regret that divides the T time steps into T/w episodes where each episode (besides the last) consists of w time steps. At the beginning of each episode, the initial state is reset but depends on how the learner has interacted with the user previously. In particular, at the beginning of episode i + 1 (at time t = iw + 1), given that the learner has played π1:iw with corresponding pull sequence uk,0:iw for k [K], we reset the initial state to be xi = (µ1,iw+1(u1,0:iw), n1,iw+1, . . . , µK,iw+1(u K,0:iw), n K,iw+1) where µk,t( ) is defined in (3) and nk,iw+1 is the number of steps since arm k is last pulled by the learner as of time iw + 1. Then, given the learner s policy π1:T , where πt : X [K], the w-step lookahead regret, against a competitor class Cw (which we define later), is defined as follows: Regw(T) = P T/w 1 i=0 max π1:w Cw E h Pmin{w,T iw} j=1 r(xiw+j, πj(xiw+j)) xiw+1 = xii E h Pmin{w,T iw} j=1 r(xiw+j, πiw+j(xiw+j)) xiw+1 = xii , (6) where the expectation is taken over xiw+2, . . . , xmin{(i+1)w,T }. The competitor class Cw that we have chosen consists of policies that depend on time steps, i.e., Cw = { π1:w : πt = πt(xt) = πt(x t), πt [K], t [w], xt, x t X} . We note that Cw subsumes many traditional competitor classes in bandits literature, including the class of fixed-action policies considered in adversarial bandits [20] and the class of periodic ranking policies [7]. In our paper, the w-lookahead policy (including the T-lookahead policy given by (4)) is a time-dependent policy that belongs to Cw, since at time t, it will play a fixed action by solving (5) using the true reward parameters {λk, γk, bk}K k=1. The time-dependent competitor class Cw differs from a state-dependent competitor class which includes all measurable functions πt that map from X to [K]. The state-dependent competitor class contains the optimal policy π where π t (xt) depends on not just the time step but also the exact state xt. Finding the optimal state-dependent policy requires optimal planning for a continuous-state MDP, which relies on state space discretizion [31] or function approximation (e.g., approximate dynamic programming algorithms [30, 9, 36]). In Appendix C, we provide discussion and analysis on an algorithm compared against the optimal state-dependent policy. We proceed the rest of the main paper with Cw defined above. When w = 1, the 1-step lookahead regret is also known as the instantaneous regret, which is commonly used in restless bandits literature and some nonstationary bandits papers including [29]. Note that low instantaneous regret does not imply high expected cumulative reward in the long-term, i.e., one may benefit more by waiting for certain arms to rebound. When w = T, we recover the full horizon regret. As we have noted earlier, finding the optimal competitor policy in this case is computationally intractable because the number of states, even when the satiation dynamics are deterministic, grows exponentially with the horizon T. Finally, we note that the w-step lookahead regret can be obtained for not just policies designed to look w steps ahead but any given policy. For a more comprehensive discussion on these notions of regret, see [34, Section 4]. 6 Explore-Estimate-Plan We now present Explore-Estimate-Plan (EEP), an algorithm for learning in rebounding bandits with stochastic dynamics and unknown parameters, that (i) collects data by pulling each arm a fixed number of times; (ii) estimates the model s parameters based on the logged data; and then (iii) plans according to the estimated model. Finally, we analyze EEP s regret. Because each arm s base reward is known from the first pull, whenever arm k is pulled at time t and nk,t = 0, we measure the satiation influence λksk,t, which becomes the next state xk,t+1: xk,t+1 = λksk,t = λkγnk,t k sk,t nk,t + λkγnk,t k + λk Pnk,t 1 i=0 γi kzk,t 1 i = γnk,t k xk,t+1 nk,t + λkγnk,t k + λk Pnk,t 1 i=0 γi kzk,t 1 i. (7) We note that the current state xk,t equals xk,t+1 nk,t, since xk,t+1 nk,t is the last observed satiation influence for arm k and nk,t is the number of steps since arm k was last pulled. 6.1 The Exploration Phase: Repeated Pulls We collect a dataset Pn k by consecutively pulling each arm n+1 times, in turn, where n T 2/3/K (Line 4-7 of Algorithm 1). Specifically, for each arm k [K], the dataset Pn k contains a single trajectory of n+1 observed satiation influences xk,1, . . . , xk,n+1, where xk,1 = 0 and xk,j (j > 1) is the difference between the first reward and the j-th reward from arm k. Thus, for xk,j, xk,j+1 Pn k , using (7) with nk,t = 1 (because pulls are consecutive), it follows that xk,j+1 = γk xk,j + dk + zk,j, (8) where dk = λkγk and zk,j are independent samples from N(0, σ2 z,k) with σ2 z,k = λ2 kσ2 z. In Appendix E.2, we discuss other exploration strategies (e.g., playing the arms in a cyclic order) for EEP and their regret guarantees. Algorithm 1: w-lookahead Explore-Estimate-Plan Input: Lookahead window size w, Number of arms K, Horizon T 1 Initialize t = 1, π1:T to be an empty array of length T and e T = T 2/3 + w (T 2/3 mod w). 2 for k = 1, . . . , K do 3 Set t = t and initialize an empty array Pn k . 4 for c = 0, . . . , e T/K do 5 Play arm k to obtain reward µk,t +c and add µk,t µk,t +c to Pn k . 6 Set πt = k and increase t by 1. 8 Obtain bγk, bdk using the estimator (9),set bλk = |bdk/bγk| and bbk = µk,t . 10 Let t0 = e T, set πt:t0 = (1, . . . , e T t + 1), and play πt:t0. 11 for i = 1, . . . , T t0 12 Set ti = min{ti 1 + w, T}. 13 Obtain πti 1+1:ti = Lookahead({bλk, bγk,bbk}K k=1, {uk,0:ti 1}K k=1, ti 1, ti) where {uk,0:ti 1}K k=1 are the arm pull histories correspond to π1:ti 1. 14 Play πti 1+1,ti. 6.2 Estimating the Reward Model and Satiation Dynamics For all k [K], given the dataset Pn k , we estimate Ak = (γk, dk) using the ordinary least squares estimator: b Ak arg min A R2 Yk Xk A 2 2, (9) where Yk Rn is an n-dimensional vector whose j-th entry is xk,j+1 and Xk Rn 2 takes as its j-th row the vector xk,j = ( xk,j, 1) , i.e., xk,j+1 is treated to be the response to the covariates xk,j. For n 2, we have that b Ak = bγk bdk Xk Yk, (10) and we take bλk = |bdk/bγk|. The difficulty in analyzing the ordinary least squares estimator (10) for identifying an affine dynamical system (8) using a single trajectory of data comes from the fact that the samples are not independent. Asymptotic guarantees of the ordinary least squares estimators in this case have been studied previously in the control theory and time series communities [13, 22]. Recent work on system identifications for linear dynamical systems focuses on the sample complexity [40, 38]. Adapting the proof of [40, Theorem 2.4], we derive the following theorem for identifying our affine dynamical system (8). Theorem 3. Fix δ (0, 1). For all k [K], there exists a constant n0(δ, k) such that if the dataset Pn k satisfies n n0(δ, k), then P b Ak Ak 2 p where ψ = r min n σ2 z,k(1 γk)2 16d2 k(1 γ2 k)+(1 γk)2σ2 z,k , σ2 z,k 4(1 γ2 k) o . As shown in Theorem 3, when dk = λkγk gets larger, the convergence rate for b Ak gets slower. Given a single trajectory of sufficient length, we obtain |bγk γk| O(1/ n) and |bdk dk| O(1/ n). In Corollary 4, we show that the estimator of λk also achieves O(1/ n) estimation error. Corollary 4. Fix δ (0, 1). Suppose that for all k [K], we have P( b Ak Ak 2 1/ n) δ and bγk > 0. Then, with probability 1 δ, we have that for all k [K], |bγk γk| O 1 n , |bλk λk| O 1 n (a) log |bγk γk| v.s. log n (b) log |bλk λk| v.s. log n Figure 2: Figure 2a and 2b are the log-log plots of absolute errors of bγk and bλk with respect to the number of samples n in a single trajectory. The results are averaged over 30 random runs, where the shaded area represents one standard deviation. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 w 130 135 140 145 150 155 160 165 Cumulative Reward T-lookahead w-lookahead 4.00 4.25 4.50 4.75 5.00 5.25 5.50 5.75 6.00 4.25 4.50 4.75 5.00 5.25 5.50 5.75 6.00 w = 2, Slope = 0.57 w = 5, Slope = 0.61 w = 8, Slope = 0.67 w = 10, Slope = 0.71 (b) log Regw v.s. log T Figure 3: Figure 3a shows the expected cumulative reward collected by the T-lookahead policy (red line) and w-lookahead policy (blue dots) when T = 30. Figure 3b shows the log-log plot of the w-step lookahead regret of w-lookahead EEP under different T averaged over 20 random runs. 6.3 Planning and Regret Bound In the planning stage of Algorithm 1 (Line 11-15), at time ti 1 + 1, the next w arms to play are obtained through the Lookahead function defined in (5) based on the estimated parameters from the estimation stage (Line 8). Using the results in Corollary 4, we obtain the following sublinear regret bound for w-lookahead EEP. Theorem 5. There exists a constant T0 such that for all T > T0 and w T 2/3, the w-step lookahead regret of w-lookahead Explore-Estimate-Plan satisfies Regw(T) O(K1/2T 2/3 log T). Remark 5. The fact that EEP incurs a regret of order O(T 2/3) is expected for two reasons: First, EEP can be viewed as an explore-then-commit (ETC) algorithm that first explores then exploits. The regret of EEP resembles the O(T 2/3) regret of the ETC algorithm in the classical K-armed bandits setting [20]. In rebounding bandits, the fundamental obstacle to mixing the exploration and exploitation stages is the need to estimate the satiation dynamics. When the rewards of each arm are not observed periodically, the obtained satiation influences can no longer be viewed as samples from the same time-invariant affine dynamical system, since the parameters of the system depend on the duration between pulls. In practice, one may utilize the maximum likelihood estimator to obtain estimates of the reward parameters but obtaining the sample complexity of such an estimator with dependent data is difficult. Second, it has been shown in [5] that when the rewards of the arms have temporal variation that depends on the horizon T, the worst case instantaneous regret has a lower bound Ω(T 2/3). On the other hand, in the traditional K-armed bandits setup, the regret (following the classical definition [20]) is lower bounded by Ω(T 1/2), and can be attained by methods like the upper confidence bound algorithm [20]. Precisely characterizing the regret lower bound for rebounding bandits is of future interest. 7 Experiments We now evaluate the performance of EEP experimentally, separately investigating the sample efficiency of our proposed estimators (10) for learning the satiation and reward models (Figure 2) and the computational performance of the w-lookahead policies (5) (Figure 3a). For the experimental setup, we have 5 arms with satiation retention factors γ1 = γ2 = .5, γ3 = .6, γ4 = .7, γ5 = .8, exposure influence factors λ1 = 1, λ2 = λ3 = 3, λ4 = λ5 = 2, base rewards b1 = 2, b2 = 3, b3 = 4, b4 = 2, b5 = 10, and noise with variance σz = 0.1. Parameter Estimation We first evaluate our proposed estimator for using a single trajectory per arm to estimate the arm parameters γk, λk. In Figure 2, we show the absolute error (averaged over 30 random runs) between the estimated parameters and the true parameters for each arm. Aligning with our theoretical guarantees (Corollary 4), the log-log plots show that the convergence rate of the absolute error is on the scale of O(n 1/2). w-lookahead Performance To evaluate w-lookahead policies, we solve (5) using the true reward parameters and report expected cumulative rewards of the obtained w-lookahead policies (Figure 3a). Recall that the greedy policy is precisely the 1-lookahead policy. In order to solve the resulting integer programs, we use Gurobi 9.1 [23] and set the number of threads for solving the problem to 10. When T = 30, the T-lookahead policy (expected cumulative rewards given by the red line in Figure 3a) solved through (4) is obtained in 1610s. On the other hand, all w-lookahead policies (expected cumulative rewards given by the blue dots in Figure 3a) for w in between 1 and 15 are solved within 2s. We provide the results when T = 100 in Appendix G. Despite using significantly lower computational time, w-lookahead policies achieve a similar expected cumulative reward to the T-lookahead policy. EEP Performance We evaluate the performance of EEP when T ranges from 60 to 400. For each horizon T, we examine the w-step lookahead regret of w-lookahead EEP where w = 2, 5, 8, 10. All results are averaged over 20 random runs. As T increases, the exploration stage of EEP becomes longer, which results in collecting more data for estimating the reward parameters and lower variance of the parameter estimators. We fit a line for the regrets with the same lookahead size w to examine the order of the regret with respect to the horizon T. The slopes of the lines (see Figure 3b s legend) are close to 2/3, which aligns with our theoretical guarantees (Theorem 5), i.e., the regrets are on the order of O(T 2/3). In Appendix G, we present the T-step lookahead regret of w-lookahead EEP under the same settings, and additional experimental setups and results. 8 Conclusions While our work has taken strides towards modeling the exposure-dependent evolution of preferences through dynamical systems, there are many avenues for future work. First, while our satiation dynamics are independent across arms, a natural extension might allow interactions among the arms. For example, a diner sick of pizza after too many trips to Di Fara s, likely would also avoid Grimaldi s until the satiation effect wore off. On the system identification side, we might overcome our reliance on evenly spaced pulls, producing more adaptive algorithms (e.g., optimism-based algorithms) that can refine their estimates, improving the agent s policy even past the pure exploration period. Finally, our satiation model captures just one plausible dynamic according to which preferences might evolve in response to past recommendations. Characterizing other such dynamics (e.g., the formation of brand loyalty where the rewards of an arm increase with more pulls) in bandits setups is of future interest. Acknowledgement LL is generously supported by an Open Philanthropy AI Fellowship. This research has been partly funded by the Center for Marketing Information and Technology at the Tepper School of Business. The authors would like to thank David Childers, Biswajit Paria, Eyan P. Noronha, Sai Sandeep and Max Simchowitz for very helpful discussions, and Stephen Tu for his insightful suggestions on system identification of affine dynamical systems. [1] Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312 2320, 2011. [2] Soumya Basu, Rajat Sen, Sujay Sanghavi, and Sanjay Shakkottai. Blocking bandits. In Advances in Neural Information Processing Systems, pages 4785 4794, 2019. [3] Manel Baucells and Rakesh K Sarin. Satiation in discounted utility. Operations research, 55(1):170 181, 2007. [4] James Bennett, Stan Lanning, et al. The netflix prize. In Proceedings of KDD cup and workshop, volume 2007, page 35. New York, 2007. [5] Omar Besbes, Yonatan Gur, and Assaf Zeevi. Optimal exploration exploitation in a multi-armed bandit problem with non-stationary rewards. Stochastic Systems, 9(4):319 337, 2019. [6] Felipe Caro and Victor Martínez-de Albéniz. Product and price competition with satiation effects. Management Science, 58(7):1357 1373, 2012. [7] Leonardo Cella and Nicolò Cesa-Bianchi. Stochastic bandits with delay-dependent payoffs. In International Conference on Artificial Intelligence and Statistics, pages 1168 1177, 2020. [8] Luc Devroye, Abbas Mehrabian, and Tommy Reddad. The total variation distance between high-dimensional gaussians. ar Xiv preprint ar Xiv:1810.08693, 2018. [9] Damien Ernst, Pierre Geurts, and Louis Wehenkel. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6(Apr):503 556, 2005. [10] Alan Frieze and Mark Jerrum. Improved approximation algorithms for max k-cut and max bisection. Algorithmica, 18(1):67 81, 1997. [11] Jeff Galak and Joseph P Redden. The properties and antecedents of hedonic decline. Annual review of psychology, 69:1 25, 2018. [12] John C Gittins. Bandit processes and dynamic allocation indices. Journal of the Royal Statistical Society: Series B (Methodological), 41(2):148 164, 1979. [13] James Hamilton. Time series analysis. Princeton University Press, Princeton, N.J, 1994. [14] Hoda Heidari, Michael J Kearns, and Aaron Roth. Tight policy regret bounds for improving and decaying bandits. In IJCAI, pages 1562 1570, 2016. [15] Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(4), 2010. [16] Thorsten Joachims, Adith Swaminathan, and Tobias Schnabel. Unbiased learning-to-rank with biased feedback. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, pages 781 789, 2017. [17] Barbara E Kahn. Consumer variety-seeking among goods and services: An integrative review. Journal of retailing and consumer services, 2(3):139 148, 1995. [18] Komal Kapoor, Karthik Subbian, Jaideep Srivastava, and Paul Schrater. Just in time recommendations: Modeling the dynamics of boredom in activity streams. In Proceedings of the eighth ACM international conference on web search and data mining, pages 233 242, 2015. [19] Robert Kleinberg and Nicole Immorlica. Recharging bandits. In 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS), pages 309 319. IEEE, 2018. [20] Tor Lattimore and Csaba Szepesvári. Bandit algorithms. Cambridge University Press, 2020. [21] Nir Levine, Koby Crammer, and Shie Mannor. Rotting bandits. In Advances in neural information processing systems, pages 3074 3083, 2017. [22] Lennart Ljung. System identification. Wiley encyclopedia of electrical and electronics engineering, pages 1 19, 1999. [23] Gurobi Optimization LLC. Gurobi optimizer reference manual, 2021. [24] Nikolai Matni and Stephen Tu. A tutorial on concentration bounds for system identification. In 2019 IEEE 58th Conference on Decision and Control (CDC), pages 3741 3749. IEEE, 2019. [25] Leigh Mc Alister. A dynamic attribute satiation model of variety-seeking behavior. Journal of Consumer Research, 9(2):141 150, 1982. [26] Leigh Mc Alister and Edgar Pessemier. Variety seeking behavior: An interdisciplinary review. Journal of Consumer research, 9(3):311 322, 1982. [27] Julian John Mc Auley and Jure Leskovec. From amateurs to connoisseurs: modeling the evolution of user expertise through online reviews. In Proceedings of the 22nd international conference on World Wide Web, pages 897 908. ACM, 2013. [28] J Douglas Mc Connell. The development of brand loyalty: an experimental study. Journal of Marketing Research, 5(1):13 19, 1968. [29] Yonatan Mintz, Anil Aswani, Philip Kaminsky, Elena Flowers, and Yoshimi Fukuoka. Nonstationary bandits with habituation and recovery dynamics. Operations Research, 68(5):1493 1516, 2020. [30] Rémi Munos. Performance bounds in ℓp-norm for approximate value iteration. SIAM journal on control and optimization, 46(2):541 561, 2007. [31] Ronald Ortner and Daniil Ryabko. Online regret bounds for undiscounted continuous reinforcement learning. In Advances in Neural Information Processing Systems, pages 1763 1771, 2012. [32] Ronald Ortner, Daniil Ryabko, Peter Auer, and Rémi Munos. Regret bounds for restless markov bandits. In International Conference on Algorithmic Learning Theory, pages 214 228. Springer, 2012. [33] Victor H Peña, Tze Leung Lai, and Qi-Man Shao. Self-normalized processes: Limit theory and Statistical Applications. Springer, 2009. [34] Ciara Pike-Burke and Steffen Grunewalder. Recovering bandits. In Advances in Neural Information Processing Systems, pages 14122 14131, 2019. [35] Rebecca K Ratner, Barbara E Kahn, and Daniel Kahneman. Choosing less-preferred experiences for the sake of variety. Journal of consumer research, 26(1):1 15, 1999. [36] Martin Riedmiller. Neural fitted q iteration first experiences with a data efficient neural reinforcement learning method. In European Conference on Machine Learning, pages 317 328. Springer, 2005. [37] Herbert Robbins. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 58(5):527 535, 1952. [38] Tuhin Sarkar and Alexander Rakhlin. Near optimal finite time identification of arbitrary linear dynamical systems. In International Conference on Machine Learning, pages 5610 5618, 2019. [39] Julien Seznec, Andrea Locatelli, Alexandra Carpentier, Alessandro Lazaric, and Michal Valko. Rotting bandits are no harder than stochastic ones. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 2564 2572, 2019. [40] Max Simchowitz, Horia Mania, Stephen Tu, Michael I Jordan, and Benjamin Recht. Learning without mixing: Towards a sharp analysis of linear system identification. In Conference On Learning Theory, pages 439 473, 2018. [41] Adith Swaminathan and Thorsten Joachims. Counterfactual risk minimization: Learning from logged bandit feedback. In International Conference on Machine Learning, pages 814 823, 2015. [42] William T Tucker. The development of brand loyalty. Journal of Marketing research, 1(3):32 35, 1964. [43] Romain Warlop, Alessandro Lazaric, and Jérémie Mary. Fighting boredom in recommender systems with linear reinforcement learning. In Advances in Neural Information Processing Systems, pages 1757 1768, 2018. [44] Peter Whittle. Restless bandits: Activity allocation in a changing world. Journal of applied probability, 25(A):287 298, 1988.