# uplifting_bandits__49a66f77.pdf

Uplifting Bandits

Yu-Guan Hsieh University of Grenoble Alpes yu-guan.hsieh@univ-grenoble-alpes.fr

Shiva Prasad Kasiviswanathan Amazon kasivisw@gmail.com

Branislav Kveton Amazon bkveton@amazon.com

We introduce a multi-armed bandit model where the reward is a sum of multiple random variables, and each action only alters the distributions of some of them. After each action, the agent observes the realizations of all the variables. This model is motivated by marketing campaigns and recommender systems, where the variables represent outcomes on individual customers, such as clicks. We propose UCB-style algorithms that estimate the uplifts of the actions over a baseline. We study multiple variants of the problem, including when the baseline and affected variables are unknown, and prove sublinear regret bounds for all of these. We also provide lower bounds that justify the necessity of our modeling assumptions. Experiments on synthetic and real-world datasets show the benefit of methods that estimate the uplifts over policies that do not use this structure.

1 Introduction

Multi-armed bandit (MAB) is an important framework for sequential decision making under uncertainty [8, 19, 20]. In this problem, a learner repeatedly takes action and receives their rewards, while the outcomes of the other actions are unobserved. The goal of the learner is to maximize their cumulative reward over time by balancing exploration (select actions with uncertain reward estimates) and exploitation (select actions with high reward estimates). MAB has applications in online advertisement, recommender systems, portfolio management, and dynamic channel selection in wireless networks [7].

One prominent question in the MAB literature is how the dependencies between the actions can be exploited to improve statistical efficiency. Popular examples include linear bandits [11] and combinatorial bandits [9]. In this work, we study a different structured bandit problem with the following three features: (i) the reward is the sum of the payoffs of a fixed set of variables; (ii) these payoffs are observed; and (iii) each action only affects a small subset of the variables. This structure arises in many applications, as discussed below.

(a) Online Marketing. An ecommerce platform can opt among several marketing strategies (actions) to encourage customers to make more purchases on their website. As different customers can be sensitive to a different marketing strategy, regarding each of them as a variable, it is natural to expect that each action would likely affect only a subset of the variables. The payoff associated to a customer can be for example the revenue generated by that customer; then the reward is just the total revenue received by the platform. (b) Product Discount. Consider a company that uses discount strategies to increase their sales. It is common to design discount strategies that only apply to a small subset of products. In this case, we

Work done during an internship at Amazon.

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

Algorithm UCB UPUCB (b) UPUCB UPUCB-n Aff UPUCB-i Lift

Affected variables known No Yes Yes No No Baseline payoffs known No Yes No No No

Regret Bound Km2

K clip( / up, L, m)2

Table 1: Summary of our regret bounds for uplifting bandits. Constant and logarithmic factors are ignored throughout. For simplicity, we assume here all actions affect exactly L of the m variables and all the suboptimal actions have the same suboptimality gap . K and up are respectively the number of actions and a lower bound on individual uplift ( up is formally introduced in Appendix E). The operator clip restricts the value of its first variable to the range defined by its second and third variables, clip(x, α, β) = max(α, min(β, x)).

can view the sales of each product as a variable and each discount strategy as an action, and assume that each action only has a significant impact on the sales of those products discounted by this action. (c) Movie Recommendation. Consider a bandit model for movie recommendation where actions correspond to different recommendation algorithms, and the variables are all the movies in the catalog. For each user, define a set of binary payoffs that indicate whether the user watches a movie in the catalog, and the reward is the number of movies from the catalog that this user watches. Since a recommendation (action) for this user contains (promotes) only a subset of movies, it is reasonable to assume that only their associated payoffs are affected by that action.

Our Contributions. To begin with, we formalize an uplifting bandit model that captures the aforementioned structure, with the term uplift borrowed from the field of uplift modeling [15, 22] to indicate that the actions effects are incremental over a certain baseline. The model is stochastic, that is, the payoffs of the variables follow an unknown distribution that depends on the chosen action. We study this problem under various assumptions on the learners prior knowledge about (1) the baseline payoff of each variable and (2) the set of affected variables of each action. Our first result (Section 3) shows that when both (1) and (2) are known, a simple modification of the upper confidence bound (UCB) algorithm [3] (Algorithm UPUCB (b)) for estimating the uplifts has a much lower regret than its standard implementation. Roughly speaking, when m is the number of variables and L is the number of variables affected by each action, we get a O(L2) regret bound instead of O(m2). This results in a major reduction for L m, and is a distinguishing feature of all our results.

Going one step further, we design algorithms that have minimax optimal regret bounds without assuming that either (1) or (2) is known. When (1) the baseline payoffs are unknown and (2) the affected variables are known, we compute differences of UCBs to estimate the uplifts (Algorithm UPUCB). In contrast to standard UCB methods, these differences are not optimistic, in the sense they are not high-probability upper bounds on the estimated quantity. This construction reflects the fact that the feedback for any single action also provides information about the baseline. When (2) the affected variables are also unknown, we identify them on the fly to maintain suitable estimates of the uplifts. We study two approaches, which differ in what they know about the affected variables. Algorithm UPUCB-n Aff (Section 6) knows an upper bound on number of affected variables, whereas Algorithm UPUCB-i Lift (Appendix E) knows a lower bound on individual uplift. Our regret bounds are summarized in Table 1. These results are further complemented with lower bounds that justify the need for our modeling assumptions (Section 4). To demonstrate the generality of our setup and how our algorithmic ideas extend beyond vanilla multi-armed bandits, we discuss contextual extensions of our model in Section 7. The experiments in Section 8 confirm the benefit of our approach.

Related Work. The goals of both uplifting modeling and MAB is to help selecting the optimal action. The former achieves it by modelling the incremental effect of an action on an individual s behavior. Despite this apparent connection between the two concepts, few papers explicitly link them together. We believe that this is because uplifting can be solved by classic bandit algorithms with a redefined reward. This approach was taken in [6, 23].

Instead, our paper focuses on bandit problems in which estimating the uplifts improves the statistical efficiency of the algorithms, and this is made possible thanks to the sparsity of the actions effects. Prior to our work, sparsity assumptions in bandits primarily concerned the sparsity of the parameter vectors in linear bandits [2, 5, 16]. A notable exception is Kwon et al. [18] who studied a variant of the MAB where the sparsity is reflected by the fact that the number of arms with positive reward is small. Our work is orthogonal to all of these in that we look at a different form of sparsity. As

we will see in Section 7, while sparsity in parameter can be a cause of sparsity in action effect, the improvement of regret is established with a different mechanism.

The additive structure of the reward and observability of individual payoffs also suggest some similarity between our model and that of combinatorial semi-bandits [10, 17, 21]. There, the learner selects at each round a subset of ground items and the reward is generally defined as the sum of the payoffs of the selected items. However, this seeming similarity comes with an important conceptual difference: the set of items, or variables, for which we observe the outcomes are selected in semibandits, while for us they are fixed and inherent to the reward generation mechanism. As an example, in Application (c), combinatorial bandit models optimize the number of movies that are watched among the recommended ones, while we also consider movies that are not recommended. In fact, as argued by Wang et al. [24], a recommendation is only effective if the user actually watches the movie and would not watch it without the recommendation. We refer the readers to Appendix B for more technical details and also a comparison with the causal bandit model.

Organization. We introduce our uplifting bandit model along with its various variations in Section 2. Over Sections 3, 5, 6, we provide regret bounds for these variations, with a lower bound presented in Section 4. We discuss contextual extensions in Section 7 and experimental results in Section 8.

2 Problem Description

We start by formally introducing our uplifting bandit model. We illustrate it in Fig. 1 and summarize our notation in Appendix A. Contextual extensions of this model are discussed in Section 7.

A (K, m)-uplifting bandit is a stochastic bandit with K actions and m underlying variables. Each action a A := {1, ..., K} is associated with a distribution Pa on Rm. At each round t, the learner chooses an action at A and receives reward rt = P

i V yt(i) where V := {1, ..., m} is the set of all variables and yt = (yt(i))i V Pat is the payoff vector.2 Our model is distinguished by two assumptions that we describe below.

(I) Limited Number of Affected Variables. Let Va V be the subset of variables affected by action a and P0 be the baseline distribution that the variables payoffs follow when no action is taken. By definition Pa and P0 have the same marginal distribution on Va := V \ Va, the variables unaffected by action a.3 While the above condition is always satisfied with Va = V, we are interested in the case of La := card(Va) m, meaning only a few variables are affected by a. We define L as a uniform bound on La, so that maxa A La L. For convenience of notation, we write [T] := {1, ..., T} and A0 = A {0}, where 0 is used for all the quantities related to the baseline. (II) Observability of Individual Payoff. In addition to the reward rt, we assume that the learner observes all the payoffs (yt(i))i V after an action is taken in round t.

Uplift and Noise. Let ya = (ya(i))i V be a random variable with distribution Pa. We define µa(i) = E[ya(i)] and ξa(i) = ya(i) µa(i) respectively as the expected value and the noise component of ya(i). We use µa (resp. µ0) to denote the vector of (µa(i))i V (resp. (µ0(i))i V), and refer to µ0 as the baseline payoffs. The individual uplift associated to a pair (a, i) A V is defined as µa up(i) = µa(i) µ0(i). An individual uplift can be positive or negative. We obtain the (total) uplift of an action by summing its individual uplifts over all the variables affected by that action

i Va µa up(i) = X

i Va (µa(i) µ0(i)). (1)

i V µa(i) be the expected reward of an action or of pure observation. We also have ra up = ra r0 since µa(i) = µ0(i) as long as i / Va.

A real-value random variable X is said to be σ-sub-Gaussian if for all γ R, it holds E[exp(γX)] exp(σ2γ2/2). Throughout the paper, we assume that ξa(i) is 1-sub-Gaussian for all a A0 and i V. Note that we do not assume that (ya(i))i V are independent, i.e., the elements in the noise vector (ξa(i))i V may be correlated, for the following two reasons:

2The terms reward and payoff distinguish rt and (yt(i))i V. 3If the actions in fact have small impact on the variables in Va, our model is misspecified and incurs additional linear regret whose size is proportional to the impact of the actions on Va. An interesting question is how we can design algorithms that self-adapt to the degree of misspecification.

Actions Variables

Figure 1: Illustration of the uplifting bandit model. This example has K = 3 actions, m = 5 variables and each action affects La = 2 variables. Dash lines indicate the variables payoffs follow the baseline distribution P0 by default.

(a) UPUCB: After suboptimal action at is taken, the confidence intervals of (µa(i))i Vat and (µ0(i))i Va \Vat shrink (from purple to pink). Hence the uplifting index of at decreases while that of a increases (from dash to solid).

(b) UPUCB-n Aff (b): To compute the uplifting index of an action, we identify a set of variables whose associated confidence intervals do not contain the baseline payoff and pad it with the variables with the largest UCBs on uplifts.

Figure 2: Explanation about UPUCB and UPUCB-n Aff (b) using model from Fig. 1. The five vertical bars correspond to the five actions while the y axis corresponds to the value of the payoff.

The independence assumption is not always realistic. In our first example, it excludes any potential correlation between two customers purchases. While a learner can exploit their knowledge on the noise covariance matrix to reduce the regret, as for example shown by Degenne and Perchet [12], incorporating such knowledge complicates the algorithm design and analysis. However, we believe this is an interesting future direction to pursue.

Regret. The learner s performance is characterized by their regret. To define it, we denote by a arg maxa A ra an action with the largest expected reward and by r = ra = maxa A ra the largest expected reward. The regret of the learner after T rounds is then given by

Reg T = r T

t=1 rat = X

t=1 1{at = a} a, (2)

where a = r ra is the suboptimality gap of action a. In words, we compare the learner s cumulative (expected) reward against the best we can achieve by taking an optimal action at each round. In the following, we also write = mina A, a>0 a for the minimum non-zero suboptimality gap and refer to it as the suboptimality gap of the problem.

UCB for Uplifting Bandits. The UCB algorithm [4], at each round, constructs an upper confidence bound (UCB) on the expected reward of each action and chooses the action with the highest UCB. When applied to our model, we get a regret of O(Km2 log T/ ), as the noise in the reward is m-sub-Gaussian (recall that we do not assume independence of the payoffs). However, this approach completely ignores the structure of our problem. As we show in Section 3, we can achieve much smaller regret by focusing on the uplifts.

Problem Variations. In the following sections, we study several variants of the above basic problem that differ in the prior knowledge about (Pa)a A0 that the learner possesses.

(a) Knowledge of Baseline Payoffs. We consider two scenarios based on whether the learner knows the baseline payoffs µ0 := (µ0(i))i V or not. (b) Knowledge of Affected Variables. We again consider two scenarios based on whether the learner knows the affected variables associated with each action (Va)a A or not.

3 Case of Known Baseline and Known Affected Variables

We start with the simplest setting where both the affected variables and the baseline payoffs are known. To address this problem, we make the crucial observation that for any two actions, the difference in their rewards equals to that of their uplifts. As an important consequence, uplift maximization has the same optimal action as total reward maximization, and we can replace rewards by uplifts in the definition of the regret (2). Formally, we write r up = ra up = maxa A ra up for the largest uplift; then Reg T = r up T PT t=1 rat up. By making this transformation, we gain in statistical efficiency because

ra up = E[P i Va(ya(i) µ0(i))] can now be estimated much more efficiently under our notion of sparsity. Since both µ0 and (Va)a A are known, we can directly construct a UCB on ra up. For this, we define for all rounds t [T], actions a A, and variables i V the quantities

s=1 1{as = a}, ca t =

2 log(1/δ )

N a t , ˆµa t (i) =

ys(i) 1{as = a}

max(1, N a t ) , (3)

where δ > 0 is a tunable parameter. In words, N a t , ˆµa t (i), and ca t represent respectively the number of times that action a is taken, the empirical estimate of µa(i), and the associated radius of confidence interval calculated at the end of round t. The UCB for action a A and variable i Va at round t is U a t (i) = ˆµa t 1(i) + ca t 1. We further define τ a t = P

i Va(U a t (i) µ0(i)) as the uplifting index, and refer to UPUCB (b) as the algorithm that takes an action with the largest uplifting index τ a t at each round t (Algorithm 3, Appendix A). Here, the suffix (b) indicates that the method operates with the knowledge of the baseline payoffs.

Since UPUCB (b) is nothing but a standard UCB with transformed rewards r t = P

i Vat(yt(i) µ0(i)), and E[r t] = rat up defines exactly the same regret as the original reward, a standard analysis for UCB yields the following result. Proposition 1. Let δ = δ/(2KT). Then the regret of UPUCB (b) (Algorithm 3, Appendix A), with probability at least 1 δ, satisfies

8(La)2 log(2KT/δ)

a + a . (4)

As expected, the regret bound does not depend on m and scales with L2. This is because the transformed reward of action a is only La-sub-Gaussian. The improvement is significant when L m. However, this also comes at a price, as both the baseline payoffs and affected variables need to be known. We address these shortcomings in Sections 5 and 6. On a side note, we remark that observing the aggregated payoff P

i Vat yt(i) is sufficient for UPUCB (b), but this will not be the case for the other algorithms presented in our paper.

Estimating Baseline Payoffs from Observational Data. In practice, the baseline payoffs µ0 can be estimated from observational data, which gives confidence intervals that µ0(i) lie in with high probability. The uplifting indices can then be constructed by subtracting the lower confidence bounds of µ0(i) from U a t (i). If the number of samples is n, the widths of the confidence intervals are in O(1/ n). Identification of a is possible only when the sum of these widths over L variables is at most , which requires n = Ω(L2/ 2). Otherwise, these errors persist at each iteration and n must be in the order of T to ensure O(

Gap-Free Bounds. In Proposition 1, we state a high-probability instance-dependent regret bound, and we will continue to do so for all the regret upper bounds that we present for the non-contextual variants. This type of result can be directly transformed to bound on E[Reg T ] by taking δ = 1/T. Following the routine of separating a into two groups depending on their scale, most of our proofs can be modified to obtain a gap-free bound, which is usually in the order of O(L p

KT log(T)). We will not state these results to avoid unnecessary repetitions.

4 Lower Bounds

In this section, we shortly discuss the necessity of our modeling assumptions for obtaining the improved regret bounds of Proposition 1. The complete discussion appears in Appendix D.

Intuitively, the regret can be improved both because the noise in the effect of an action is small, and because the observation of this effect does not heavily deteriorate with noise. These two points correspond respectively to assumptions (I) and (II). Moreover, the knowledge on (Va)a A allows the learner to distinguish between problems with different structures. Without such distinction, there is no chance that the learner can leverage the underlying structure. Therefore, the aforementioned three points are crucial for obtaining (4). Below, we establish this formally for algorithms that are consistent [19] over the class of 1-sub-Gaussian uplifting bandits, which means the induced regret of the algorithm on any uplifting bandit with 1-sub-Gaussian noise satisfies Regt = o(tp) for all p > 0.

Proposition 2. Let π be a consistent algorithm over the class of 1-sub-Gaussian uplifting bandits that at most uses the knowledge of P0, (Va)a A, and the fact that the noise is 1-sub-Gaussian. Let K, m > 0 and sequence (La, a)1 a K ([m] R+)K satisfy 1 = 0. Assume either of the following holds.

(a) La = m for all a A, so that in the bandits considered below all actions affect all variables. (b) Only the reward is observed.

(c) The algorithm π does not make use of any prior knowledge about the arms expected payoffs (µa)a A (in particular, the knowledge of (Va)i V is not used by π).

Then, there exists a 1-sub-Gaussian (K, m)-uplifting bandit whose suboptimality gaps and numbers of affected variables are respectively ( a)a A and (La)a A, such that the regret induced by π on it satisfies: lim inf T + E[Reg T ]

a A: a>0 2m2

Proposition 2 states an instance-dependent lower-bound for a learner that may be equipped with full knowledge of the baseline distribution.4 Its proof is presented in Appendix D. At this point, what remains unclear is whether similar improvement of the regret is still possible when the baseline is unknown or when the learner only has access to more restricted knowledge than (Va)a A. We give affirmative answers to these two questions in the next two sections.

5 Case of Unknown Baseline

In this section, we consider the situation where the learner knows the sets of affected variables (Va)a A, but not the baseline payoffs. Since the actual uplift at each round is never observed, the uplifts of the actions can hardly be estimated directly in this case. Instead, we follow a two-model approach. Leveraging the fact that Pa and P0 have the same marginal distribution on Va, we can estimate the baseline payoffs by aggregating information from the feedback of different actions. This leads to

N 0 t (i) =

s=1 1{i / Vas}, c0 t(i) =

2 log(1/δ )

N 0 t (i) , ˆµ0 t(i) = Pt s=1 ys(i) 1{i / Vas}

max(1, N 0 t (i)) . (5)

Compared to (3), we notice that both N 0 t and c0 t are functions of i. This is because for each taken action a, we only increase the counters N 0 t (i) for those i / Va, which causes a non-uniform increase of (N 0 t (i))i V. Then, by looking at all the rounds that variable i is not influenced by the chosen action, we manage to compute ˆµ0 t(i), an estimate of µ0(i). To proceed, we define the following UCB indices for all the pairs (a, i) A0 V

U a t (i) =

0 if a = 0 and i T

a A Va, ˆµa t 1(i) + ca t 1 if a A and i Va, ˆµ0 t 1(i) + c0 t 1(i) otherwise. (6)

The second and the third lines of (6) contain the usual definition of UCBs using the empirical estimates and the radii of the confidence intervals defined in (3) and (5). In the special case that a variable is affected by all the actions, it is impossible to estimate U 0(i) but it is enough to compare U a(i) directly against U a (i) for any two actions a, a A, so we just set U 0 t (i) to 0 in this case.

We outline the proposed method, UPUCB, in Algorithm 1. The uplifting indices are given by τ a t = P

i Va(U a t (i) U 0 t (i)). It may be counter-intuitive to compare the differences between two UCBs. Indeed, τ a t is no longer an optimistic estimate of the uplifting effect ra up, but it captures the essential trade-off between learning action a s payoffs and learning the baseline µ0. To provide some intuition, we give an informal justification of UPUCB in Fig. 2a: If a suboptimal action a is taken in round t, the estimates of all U a t (i) move closer to the actual mean from above. As a result, τ a t decreases, since all U a t (i) for i Va do. Thus action a is less likely to be taken next. Moreover, τ a t increases, since U 0 t (i) decrease for any i affected by a but not a. Thus a is more likely to be taken next. The effectiveness of UPUCB is confirmed by the following theorem.

4Of course, the problem only becomes more challenging if the learner does not know the baseline distribution.

Algorithm 1 UPUCB

1: Input: Error probability δ , the sets of variables each action affects {Va : a A} 2: Initialization: Take each action once 3: for t = K + 1, . . . , T do 4: Compute the UCB indices following (6) 5: For a A, set τ a t P

i Va(U a t (i) U 0 t (i)) 6: Select action at arg maxa A τ a t

Theorem 1. Let δ = δ/(4KLT). Then the regret of UPUCB (Algorithm 1) with probability at least 1 δ, satisfies:

8(La + La )2 log(4KLT/δ)

a + a . (7)

Idea of proof. The full proof of Theorem 1 is presented in Appendix C.2, and proceeds as following.

1. With concentration of measure, we show that with probability 1 δ, it holds |ˆµa t (i) µa(i)| ca t and |ˆµ0 t(i) µ0(i)| c0 t(i) for all relevant estimates. It is thus sufficient to show that (7) holds when these inequalities are satisfied. 2. A suboptimal action a can only be taken if its uplifting index is larger than that of a , i.e., if P

i Va(U a t (i) U 0 t (i)) P

i Va (U a t (i) U 0 t (i)). Rearranging, we get X

i Va U a t (i) + X

i Va \Va U 0 t (i) X

i Va U a t (i) + X

i Va\Va U 0 t (i). (8)

Using the inequalities mentioned in the previous point and a = ra up ra up, bounding the two sides of (8), we deduce a 2Laca t 1 + P

i Va \Va 2c0 t 1(i).

3. Note that for all i Va \ Va, whenever action a is taken, the count of N 0 s (i) also increases by 1. We have thus N 0 t 1(i) N a t 1 and accordingly c0 t 1(i) ca t 1. We then get a 2(La+La )ca t 1. This allows us to bound the number of times that action a is taken and conclude.

As in Proposition 1, the regret of Theorem 1 is in O(KL2 log T/ ). In fact, all decisions in UPUCB are made on uncertain estimates of at most L variables; thus the statistical efficiency scales with L and not m. A detailed comparison with Proposition 1 reveals that the difference is in the order of K(La )2 log T/ ; this is only significant when the optimal actions affect many more variables then the suboptimal ones. Hence, the price of not knowing the baseline payoffs is generally quite small.

6 Case of Unknown Affected Variables

Now we study the more challenging setting where the affected variables (Va)a A are unknown to the learner. Proposition 2 states that improvement is impossible if the learner does not have any prior knowledge to exploit the structure. To circumvent this negative result, we study two weak assumptions motivated by practice: learner has access to (i) an upper bound on the number of affected variables or (ii) a lower bound on individual uplift. Due to space constraints, we only present the first setting here and defer the discussion of the other to Appendix E.

For the rest of this section, we assume that we know an upper bound on the number of affected variables L (i.e., L maxa A La). We design algorithms with O(KL2 log T/ ) regret bounds that takes L as input. We consider the cases of known and unknown baseline payoffs.

Known Baseline Payoffs. To illustrate our ideas, we start by assuming that the baseline payoffs are known. We propose an optimistic algorithm that maintains a UCB on the total uplift with an overestimate in the order of L. Let N a t , ca t , and ˆµa t (i) be defined as in (3). The UCB, uplifting indices, and the confidence intervals for each (action, variable) pair (a, i) A V are

U a t (i) = ˆµa t 1(i) + ca t 1, ρa t (i) = U a t (i) µ0(i), Ca t (i) = [ˆµa t 1(i) ca t 1, ˆµa t 1(i) + ca t 1]. (9)

In the rest of the paper, we refer to ρa t (i) as the individual uplifting index of the pair (a, i). It is

Algorithm 2 UPUCB-n Aff

1: Input: Error probability δ , Upper bound L on the number of variables that each action affects 2: Initialization: Take each action once 3: for t = K + 1, . . . , T do 4: Choose bt arg maxa A N a t 1 and compute UCBs and confidence intervals using (9) 5: for a A do 6: Set b Va t {i V : Ca t (i) Cbt t (i) = } 7: For i V, compute ρa t (i) U a t (i) U bt t (i) 8: Set La t max(0, 2L card(b Va t )) and La t arg max L V\b Va t , card(L) La t P

i L ρa t (i)

9: Compute uplifting index τ a t P

i b Va t La t ρa t (i) 10: Select action at arg maxa A τ a t

an overestimate of the individual uplift µa up(i). As for Ca t (i), it is the confidence interval that µa(i) lies in with high probability. Our algorithm, UPUCB-n Aff (b) with n Aff for number of affected, leverages two important procedures to compute an optimistic estimate of the uplift ra up: identification of affected variables, and padding with variables with the highest individual uplifting indices.

To begin, UPUCB-n Aff (b) constructs the set of identified variables b Va t = {i V : µ0(i) / Ca t (i)} which is contained in Va with high probability. In fact, by concentration of measure, with high probability µa(i) Ca t (i), in which case µ0(i) / Ca t (i) indicates i Va. However, b Va t is not guaranteed to capture all the affected variables, so we also need to provide an upper bound for P

i Va\b Va t µa up(i), the uplift contributed by the unidentified affected variables. Since the individual uplifting index ρa t (i) is in fact a UCB on the individual uplift µa up(i) here and card(Va) L, we can simply choose the L card(b Va t ) variables in V \ b Va t with the largest ρa t (i). Let us refer to this set as La t . We then get a proper UCB on the uplift of action a by computing τ a t = P

i b Va t La t ρa t (i). This process is summarized in Algorithm 4 in Appendix A and illustrated in Fig. 2b.

Unknown Baseline Payoffs. Now we focus on the most challenging setting, where also the baseline payoffs are unknown. In this case, neither the sets of identified variables nor the uplifting indices of UPUCB-n Aff (b) can be defined. We also cannot estimate the baseline payoffs using (5) since the sets of affected variables are unknown. To overcome these challenges, we note that for any two actions a, a A, µa and µa only differ on Va Va , and card(Va Va ) 2L. Therefore, µa and µa differ in at most 2L variables, and we recover a similar problem structure by taking the payoffs of any action as the baseline.

Combining this idea with the elements that we have introduced previously, we obtain UPUCB-n Aff (Algorithm 2). In each round, UPUCB-n Aff starts by picking a most frequently taken action bt (Line 4) whose payoffs are treated as the baseline in that round. Then, in Line 6, UPUCB-n Aff chooses variables that are guaranteed to be either in Va or Vbt. This generalizes the identification step of UPUCB-n Aff (b). The individual uplifting indices ρa t (i) = U a t (i) U bt t (i) are computed in Line 7. The differences of UCBs are inspired by a similar construction in UPUCB. Line 8 constitutes the padding step, during which variables with the highest uplifting indices are selected, and finally in Line 9 we combine the above to get the uplifting index of the action. To see why UPUCB-n Aff is similar to UPUCB-n Aff (b), suppose that one action has been taken frequently. Then the baseline payoffs are precisely estimated and do not change much between consecutive rounds.

Regret. Both UPUCB-n Aff (b) and UPUCB-n Aff choose O(L) variables for estimating the uplift of an action, and the decisions are based on these estimates. Therefore, the statistical efficiencies of these algorithms only scale with L and not m. This in turn translates into an improvement of the regret, as demonstrated by the theorem below.

Theorem 2. Let δ = δ/(2Km T). Then the regret of UPUCB-n Aff (Algorithm 2) (resp. UPUCBn Aff (b), Algorithm 4), with probability at least 1 δ, satisfies:

αL2 log(2Km T/δ)

a + a , (10)

where α = 512 (resp. 32) in the above inequality.

Idea of proof. The full proof of Theorem 2 is presented in Appendix C.3. We outline below the main steps to prove the result for UPUCB-n Aff (b).

1. With concentration of measure, with probability 1 δ it holds µa(i) Ca t (i). Then, as explained in the text, b Va t Va and τ a t is an upper bound on ra up; especially τ a t ra up .

2. For i / b Va t , by definition µ0(i) Ca t , from which we deduce 0 ρa t (i) 2ca t 1. With b Va t Va

and La t V \ b Va t we can then write

i b Va t La t

i Va ρa t (i) + X

i La t ρa t (i) ra up + 4Lca t 1.

We have also used card(Va) L, card(La t ) L, and ρa t (i) µa(i) + 2ca t 1.

3. Whenever a suboptimal action a is taken, we have τ a t τ a t . Combined with the two previous point we then deduced a 4Lca t 1. Subsequently, we bound the number of times that each suboptimal action a is taken to conclude.

The proof of Theorem 2 is notable for two reasons. First, tracking of the identified variables guarantees that the uplifting index τ a t does not overestimate the uplift ra up too much. Take UPUCB-n Aff (b) as an example. An alternative to constructing a UCB on ra up is to choose the L variables with the highest individual uplifting indices ρa t (i). However, this would result in a severe overestimate when a negative individual uplift is present. Second, to prove (10) for UPUCB-n Aff, we use that the widths of confidence intervals of the chosen bt are always smaller than those of the taken action. This is ensured by taking bt as the most frequent action (Line 4 in Algorithm 2).

7 Contextual Extensions

In this section, we briefly discuss potential contextual extensions of our model; a detailed case study is presented in Appendix F. As in contextual bandits, context is a side information that helps the learner to make a more informed decision, which results in a higher reward. To incorporate context, one possibility is to associate each variable with a feature vector xt(i) Rd. The subscript t indicates that the context can change from one round to another. We also associate each action with a function f a so that the expected payoff of action a acting on a variable with feature xt(i) is f a(xt(i)). The expected reward of choosing a at round t is then ra(xt) = P

i V f a(xt(i)). The optimal action in round t is a t = arg maxa A ra(xt) and the regret of a learner that takes the actions (at)t [T ] is given by Reg T = PT t=1 P

i V(f a t (xt(i)) f at(xt(i))).

The key structure in our model (Section 2) is that there exists a baseline payoff vector µ0 such that for any given action a, µa(i) = µ0(i) holds for most i V. Given context, this translates into the existence of a baseline function f 0 such that for any given a and t, f a(xt(i)) = f 0(xt(i)) holds for most i V. The uplift of action a is defined as ra up(xt) = P

i V f a(xt(i)) f 0(xt(i)).

For concreteness, let us consider a model with linear payoffs. Then, each action is associated with an unknown parameter θa and the expected payoff is the scalar product of θa and the feature of the variable, i.e., f a(xt(i)) = θa, xt(i) . We also assume the aforementioned equality to hold for the baseline function f 0 and we use Va t = {i V : θa, xt(i) = θ0, xt(i) } for the variables affected by action a at round t. One sufficient condition for Va t to be small is sparsity in both the parameter difference θa up = θa θ0 and the context vector xt(i). In fact, θa, xt(i) = θ0, xt(i) as long as the supports of θa up and xt(i) are disjoint. We assume card(Va t ) is uniformly bounded by L below.

Clearly, our algorithms can be directly applied as long as we can construct a UCB on θa, xt(i) . This can for example be done using the construction of linear UCB [1]. In this way, the decision of the learner is again based on the uncertain estimates of at most O(L) variables, and we expect similar improvements as in our earlier theorems. As an example, when both θ0 and Va t are known, UPUCB (b) adapted to this situation constructs UCB for P

i Va t θa up, xt(i) , and it is straightforward

to show that the regret of such algorithm can be as small as O(Ld

KT). In contrast, if the learner works directly with the total reward, the regret is in O(md

5We present gap-free bounds here and thus we get L versus m in the place of L2 versus m2. As in previous sections, these bounds apply to a potentially dependent noise.

8 Numerical Experiments

0 2 4 6 8 10 # Iterations 104

UCB Up UCB (b) Up UCB Up UCB-n Aff (b) Up UCB-n Aff Thompson sampling

(a) Gaussian uplifting

0 200 400 600 800 1000 # Iterations

UCB Up UCB (b) Up UCB

Up UCB-n Aff (b) Up UCB-n Aff Thompson sampling

(b) Bernoulli uplifting

Figure 3: Experimental results on a synthetic and real-world dataset. All the curves are averaged over 100 runs and the shaded areas represent the standard errors.

In this section, we present numerical experiments to demonstrate the benefit of estimating uplifts in our model. We compare our methods introduced in Sections 3, 5 and 6 against UCB and Thompson sampling with Gaussian prior and Gaussian noise that only use the observed rewards (rt)t [T ]. To ensure a fair comparison, we tune all the considered algorithms and report results for the parameters that yield the best average performance. The detailed procedure, and additional experimental details and results, are provided in Appendices G and H. The experiments for the contextual extension in Section 7 are presented in Appendix F.3.

Gaussian Uplifting Bandit. We first study our algorithms in a Gaussian uplifting bandit with K = 10 actions, m = 100 variables, and La = 10 for all a A meaning that each action affects 10 variables.6 The expected payoffs are contained in [0, 1], and the covariance matrix of the noise is taken the same for all the actions. The suboptimality gap of the created problem is around 0.2, and the variance of the total noise P

i V ξa(i) is around 80.

Bernoulli Uplifting Bandit with Criteo Uplift. We use the Criteo Uplift Prediction Dataset [13] with visit as the outcome variable to build a Bernoulli uplifting bandit, where the payoff of each variable has a Bernoulli distribution. This dataset is designed for uplift modeling, and has outcomes for both treated and untreated individuals. Thus it is suitable for our simulations. To build the model, we sample 105 examples from the dataset, and use K-means to partition these samples into 20 clusters of various sizes. The 105 examples are taken as our variables. We consider 20 actions that correspond to treating individuals of each cluster, and construct independent Bernoulli payoffs using the visit rates of the treated/untreated individuals of the clusters following a procedure detailed in Appendix G.2. Here, L = 12654 and is around 30.

Results. Fig. 3 confirms that we can effectively achieve much smaller regret by restricting our attention to the uplift. Moreover, when the sets of affected are known, the loss of not knowing the baseline payoffs seems to be minimal. On the other hand, not knowing the affected variables has a more severe effect in the second experiment. In fact, the design of UPUCB-n Aff and UPUCBn Aff (b) heavily rely on the additive structure of the uplifting index, and can thus hardly benefit from the payoff independence which allow the other four algorithms to achieve smaller regret in this case.

This paper studies multi-armed bandit problems where the rewards are sums of observable variables. When each action only affects a limited number of these variables, much smaller regret can be achieved, and we developed algorithms with such guarantees under different forms of knowledge that the learner possesses.

While we study here a UCB-style algorithm, we believe that understanding how similar improvement can be achieved by other types of algorithms such as Thompson sampling and information directed sampling is an important question. Moreover, further extending our work to cope with non-stationary or even adversarial bandits is another promising direction to pursue. As for the former, a direct combination with existing techniques [14] can readily make our algorithms bypass the stationarity assumption that we make throughout the paper.

[1] Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. Advances in Neural Information Processing Systems, 24:2312 2320, 2011.

6By Gaussian uplifting bandit we mean that the noise vectors (ξa)a A are Gaussian.

[2] Yasin Abbasi-Yadkori, David Pal, and Csaba Szepesvari. Online-to-confidence-set conversions and application to sparse stochastic bandits. In Artificial Intelligence and Statistics, pages 1 9. PMLR, 2012.

[3] Peter Auer. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397 422, 2002.

[4] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2):235 256, 2002.

[5] Hamsa Bastani and Mohsen Bayati. Online decision making with high-dimensional covariates. Operations Research, 68(1):276 294, 2020.

[6] Jeroen Berrevoets, Sam Verboven, and Wouter Verbeke. Treatment effect optimisation in dynamic environments. Journal of Causal Inference, 10(1):106 122, 2022.

[7] Djallel Bouneffouf, Irina Rish, and Charu Aggarwal. Survey on applications of multi-armed and contextual bandits. In 2020 IEEE Congress on Evolutionary Computation (CEC), pages 1 8. IEEE, 2020.

[8] Sébastien Bubeck and Nicolo Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Machine Learning, 5(1):1 122, 2012.

[9] Nicolo Cesa-Bianchi and Gábor Lugosi. Combinatorial bandits. Journal of Computer and System Sciences, 78(5):1404 1422, 2012.

[10] Wei Chen, Yajun Wang, and Yang Yuan. Combinatorial multi-armed bandit: General framework and applications. In International Conference on Machine Learning, pages 151 159. PMLR, 2013.

[11] Varsha Dani, Thomas P Hayes, and Sham M Kakade. Stochastic linear optimization under bandit feedback. In Conference on Learning Theory, pages 355 366, 2008.

[12] Rémy Degenne and Vianney Perchet. Combinatorial semi-bandit with known covariance. In Advances in Neural Information Processing Systems, pages 2972 2980, 2016.

[13] Eustache Diemert, Artem Betlei, Christophe Renaudin, and Massih-Reza Amini. A large scale benchmark for uplift modeling. In International Conference on Knowledge Discovery and Data Mining. ACM, 2018.

[14] Aurélien Garivier and Eric Moulines. On upper-confidence bound policies for switching bandit problems. In International Conference on Algorithmic Learning Theory, pages 174 188. Springer, 2011.

[15] Pierre Gutierrez and Jean-Yves Gérardy. Causal inference and uplift modelling: A review of the literature. In International Conference on Predictive Applications and APIs, pages 1 13. PMLR, 2017.

[16] Botao Hao, Tor Lattimore, and Wei Deng. Information directed sampling for sparse linear bandits. In Advances in Neural Information Processing Systems, 2021.

[17] Branislav Kveton, Zheng Wen, Azin Ashkan, and Csaba Szepesvari. Tight regret bounds for stochastic combinatorial semi-bandits. In Artificial Intelligence and Statistics, pages 535 543. PMLR, 2015.

[18] Joon Kwon, Vianney Perchet, and Claire Vernade. Sparse stochastic bandits. In Conference on Learning Theory, volume 65, 2017.

[19] Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4 22, 1985.

[20] Tor Lattimore and Csaba Szepesvári. Bandit algorithms. Cambridge University Press, 2020.

[21] Pierre Perrault, Etienne Boursier, Michal Valko, and Vianney Perchet. Statistical efficiency of thompson sampling for combinatorial semi-bandits. Advances in Neural Information Processing Systems, 33:5429 5440, 2020.

[22] Nicholas Radcliffe. Using control groups to target on predicted lift: Building and assessing uplift model. Direct Marketing Analytics Journal, pages 14 21, 2007.

[23] Neela Sawant, Chitti Babu Namballa, Narayanan Sadagopan, and Houssam Nassif. Contextual multi-armed bandits for causal marketing. In International Conference on Machine Learning, 2018.

[24] Yixin Wang, Dawen Liang, Laurent Charlin, and David M Blei. Causal inference for recommender systems. In Fourteenth ACM Conference on Recommender Systems, pages 426 431, 2020.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] We have discussed potential extensions, which reflects the limitation of the current work. (c) Did you discuss any potential negative societal impacts of your work? [No] (d) Have you read the ethics review guidelines and ensured that your paper conforms to them?

[Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] (b) Did you include complete proofs of all theoretical results? [Yes] 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] Provided in the supplmental material. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] See Section 8 and Appendices F.3, G and H. (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] See Appendix G 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [Yes] (b) Did you mention the license of the assets? [Yes] See Appendix G

(c) Did you include any new assets either in the supplemental material or as a URL? [N/A] (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [No] (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [No] 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]