# stochastic_multiarmed_bandits_with_control_variates__4adb0028.pdf Stochastic Multi-Armed Bandits with Control Variates Arun Verma Department of Computer Science National University of Singapore arun@comp.nus.edu.sg Manjesh K. Hanawal Department of IEOR IIT Bombay, Mumbai, India mhanawal@iitb.ac.in This paper studies a new variant of the stochastic multi-armed bandits problem where auxiliary information about the arm rewards is available in the form of control variates. In many applications like queuing and wireless networks, the arm rewards are functions of some exogenous variables. The mean values of these variables are known a priori from historical data and can be used as control variates. Leveraging the theory of control variates, we obtain mean estimates with smaller variance and tighter confidence bounds. We develop an upper confidence bound based algorithm named UCB-CV and characterize the regret bounds in terms of the correlation between rewards and control variates when they follow a multivariate normal distribution. We also extend UCB-CV to other distributions using resampling methods like Jackknifing and Splitting. Experiments on synthetic problem instances validate performance guarantees of the proposed algorithms. 1 Introduction Stochastic Multi-Armed Bandits (MAB) setting has been extensively used to study decision making under uncertainty (Thompson, 1933; Bubeck et al., 2012; Lattimore and Szepesvári, 2020). In the classical setting, it is assumed that the arm rewards are independent of each other. After playing an arm, the learner observers independent and identically distributed reward samples. The exploration versus exploitation trade-off is a fundamental problem in the bandits setting, and a learner can accumulate more rewards if it is balanced well. A better balance can be achieved if the learner can estimate arm s mean rewards with tighter confidence bounds (Auer et al., 2002; Auer and Ortner, 2010; Garivier and Cappé, 2011) and smaller variance (Audibert et al., 2009b; Mukherjee et al., 2018). Any available side or auxiliary information can aid in building tighter confidence bounds. In this paper, we study the improvement in the performance that can be achieved when side information is available in the form of control variates. Any observable input that is correlated with the random variable of interest can be a Control Variate (CV). When the mean of the CV is known, it becomes useful to reduce the variance of the mean estimator of the random variable. CV method is a popular variance reduction technique to improve the estimates precision without altering the simulation runs or an increase in the number of samples (Lavenberg et al., 1982). Many stochastic simulations involve inputs generated using known distributions. These inputs can be potential CVs as output could be correlated with them. It has motivated the development of rich theory to analyze the quality of the estimators resulting from CVs (Lavenberg and Welch, 1981; Lavenberg et al., 1982; Nelson, 1990). We adopt these techniques to study the stochastic multi-armed bandits problem when CVs are available for the arms. In many real-life applications, exogenous variables influence arms rewards and act as CVs. For example, consider a job aggregating platform that assigns jobs to one of the servers/workers in each slot. The jobs are of variable size, and the time to serve them depends on their size and other 35th Conference on Neural Information Processing Systems (Neur IPS 2021). unknown extraneous factors. The platform observes the size of each job and knows their mean value from historical data. The random job sizes correlate with service times and can be used as a CV to estimate each server s mean service time. Another example is a wireless system where a transmitter can transmit packets on a channel in each slot. Successful transmission of the packets (quality) depends on random quantities like fading, interference, and channel noise on the selected channel. The transmitter can observe the fading value using pilot signals in each round and know their mean value from past observations. These fading values on a channel can be used as a CV to estimate the channel s quality. The performance of a MAB algorithm depends on how good is the confidence interval of its mean reward estimators (Auer et al., 2002; Auer and Ortner, 2010; Garivier and Cappé, 2011). The tightness of these intervals depends on the variance of the estimators (Audibert et al., 2009b; Mukherjee et al., 2018). Naturally, estimators with smaller variance for the same number of samples result in better performance. Thus CVs can be leveraged to improve the performance of bandits algorithms with smaller confidence intervals. The reduction depends on how strongly the reward samples and associated CVs are correlated. Linear methods are widely used for variance reduction using CVs where the new samples are generated taking the weighted sum of reward samples and centered CV samples. Then these new samples are used for the estimation of the mean rewards. The choice of the weight that results in maximum variance reduction depends on the variance of CV and its correlation with reward samples. In practice, both of these quantities are unknown and need to be estimated from the observed samples. However, using the estimated weight makes the new samples highly correlated and makes the analysis of confidence intervals challenging. For multivariate normal distributions, tight confidence intervals can be constructed that hold with high probability for the mean reward estimators resulting from the linear control variates. Using these results, we develop a MAB algorithm that exploits CVs to improve the regret performance. For general distributions, we discuss resampling methods that result in unbiased estimators with better confidence intervals. Specifically, our contributions are as follows: For the case where rewards and CVs are normally distributed, we develop an algorithm named Upper Confidence Bounds with Control Variates (UCB-CV) that uses the estimators based on the linear control variates. In Section 4, we show that the regret of UCB-CV is smaller by a factor (1 ρ2) compared to the existing algorithms when the rewards and CV have a multivariate normal distribution, where ρ is the correlation coefficient of the reward and control variates. In Section 5, we discuss how to adapt UCB-CV for the general distributions using estimators and associated confidence intervals based on jackknifing, splitting, and batching methods. We validate the performance of UCB-CV on synthetically generated data in Section 6. 1.1 Related Work Our work uses side or auxiliary information available in the form of CVs to improve the variance of estimates. In the following, we discuss works that incorporate variance estimates and/or side information to improve the performance of multi-armed bandits algorithms. Incorporating variance estimates: Many stochastic multi-armed bandits algorithms like UCB1 (Auer et al., 2002), DMED (Honda and Takemura, 2010), KL-UCB (Garivier and Cappé, 2011), Thompson Sampling (Chapelle and Li, 2011; Agrawal and Goyal, 2012; Kaufmann et al., 2012; Agrawal and Goyal, 2013), assume that the rewards are bounded on some interval [0, b] and define index of each arm based on its estimates of the mean rewards and ignore the variance estimates. The regret of these algorithms have optimal logarithmic bound in the number of rounds (T), and grows quadratic in b, i.e., O((b2/ ) log T), where is the smallest sub-optimality gap. UCB1-NORMAL by (Auer et al., 2002) uses the variance estimates in the index of the arms and when rewards are Gaussian with variance σ2, to achieve regret is of order O((σ2/ ) log T). This bound is even better than that of UCB1 when the variance is smaller than b2 even though it has considered unbounded support. Further, the authors experimentally demonstrate that the UCB-Tuned algorithm (Auer et al., 2002), which uses variance estimates in arm indices, significantly improves regret performance over UCB1 when rewards are bounded. Extending the idea of using variance aware indices, UCBV (Audibert et al., 2009b) algorithm achieves regret of order O((σ2/ + 2b) log T) thus confirming that making algorithms variance aware is beneficial. EUCBV (Mukherjee et al., 2018) improves the performance of UCBV using an arm elimination strategy like used in UCB-Improved (Auer and Ortner, 2010). Incorporating side information: In the applications like advertising and web search, whether or not users like a displayed item could depend on their profile. Users profile is often available to the platform, which could then be used as side information. Contextual bandits (Dani et al., 2008; Rusmevichientong and Tsitsiklis, 2010; Li et al., 2010; Filippi et al., 2010; Abbasi-Yadkori et al., 2011; Chu et al., 2011; Li et al., 2017; Jun et al., 2017; Zhang et al., 2016) and MAB with covariates (Perchet and Rigollet, 2013) exploit such side information to learn the optimal arm. They assume a correlation structure (linearity, GLM, etc.) between mean reward and context where mean reward varies with side-information (context). In contrast to contextual bandits, mean rewards do not vary in our unstructured setup with side information available in the form of CVs. We do not use CVs directly to decide which arm to select in each round but use them only to get better estimates of the mean rewards by exploiting their correlation with rewards. Our motivating examples from jobs scheduling and communication networks capture these unstructured bandits where CVs provide information about the rewards but do not alter their mean values. Our confidence bounds for mean reward estimators are based on the variance of estimators rather than on the variance of the reward sample, as was the case in UCBV (Audibert et al., 2009a) and EUCBV (Mukherjee et al., 2018). The samples that we use to estimate mean rewards are not independent and identically distributed. Hence, we can not use the standard Bernstein inequality to get confidence bounds on mean and variance estimates. CVs are used extensively for variance reduction (James, 1985; Botev and Ridder, 2014; Chen and Ghahramani, 2016; Kreutzer et al., 2017; Vlassis et al., 2019) in the Monte-Carlo simulation of complex systems (Lavenberg and Welch, 1981; Lavenberg et al., 1982; Nelson, 1989, 1990). To the best of our knowledge, our work is the first to analyze the performance of stochastic bandits algorithms with control variates. 2 Control Variate for Variance Reduction Let µ be the unknown quantity that needs to be estimated, and X be an unbiased estimator of µ, i.e., E [X] = µ. A random variable W with known expectation (ω) is a CV if it is correlated with X. Linear control methods use errors in estimates of known random variables to reduce error in the estimation of an unknown random variable. For any choice of a coefficient β, define a new estimator as X = X + β(ω W). It is straightforward to verify that its variance is given by Var( X) = Var(X) + β2Var(W) 2βCov(X, W). and it is minimized by setting β to β = Cov(X, W) The minimum value of the variance is given by Var( X) = (1 ρ2)Var(X), where ρ is the correlation coefficient of X and W. Larger the correlation, the greater the variance reduction achieved by the CV. In practice, the values of Cov(X, W) and Var(W) are unknown and need to be estimated to compute the best approximation for β . 3 Problem Setting We consider a stochastic multi-armed bandits problem with K arms. The set of arms is represented by [K] .= {1, 2, . . . , K}. In round t, the environment generates a vector ({Xt,i}i [K], {Wt,i}i [K]). The random variable Xt,i denotes the reward of arm i in round t, and form an Independent and Identically Distributed (IID) sequence drawn from an unknown but fixed distribution with mean µi and variance σ2 i . The random variable Wt,i is the Control Variate (CV) associated with the reward of arm i in round t. These random variables are drawn from an unknown but fixed distribution with mean ωi and variance σ2 w,i and form an IID sequence. The learner knows the values {ωi}i [K] but not the {σ2 w,i}i [K]. The correlation coefficient between rewards and associated control variates of an arm i is denoted by ρi. In our setup, the learner observes the reward and associated CVs from the selected arm. We refer to this new variant of the multi-armed bandits problem as Multi-Armed Bandits with Control Variates (MAB-CV). The parameter vectors µσ = {µi, σ2 i }i [K], ωσ = {ωi, σ2 w,i}i [K], and ρ = {ρi}i [K] identify an instance of MAB-CV problem, which we denote henceforth using P = (µσ, ωσ, ρ). The collections of all MAB-CV problems are denoted by P. For a problem instance P P with mean reward vector µ, we denote the optimal arm as i = argmax i [K] µi. MAB-CV problem with instance (µσ, ωσ, ρ) For round t: 1. Environment generates a vector Xt = (Xt,1, . . . , Xt,K) RK and Wt = (Wt,1, . . . , Wt,K) RK, where E [Xt,i] = µi, Var(Xt,i) = σ2 i , E [Wt,i] = ωi, Var(Wt,i) = σ2 w,i, and the sequence (Xt,i)t 1 and (Wt,i)t 1 are IID for all i [K]. 2. Learner selects an arm It [K] based on past observation of rewards and CVs samples from the arms till round t 1. 3. Feedback and Regret: The learner observes a random reward Xt,It and associated CV Wt,It, and then incurs penalty (µi µIt). Let It denote the arm selected by the learner in round t based on the observation of past reward and control variate samples. The interaction between the environment and a learner is given in MAB-CV problem. We aim to learn policies that accumulate maximum reward and measure its performance by comparing its cumulative reward with that of an Oracle that plays the optimal arm in each round. Specifically, we measure the performance in terms of regret defined as follows: Note that maximizing the mean cumulative reward of a policy is the same as minimizing the policy s regret. A good policy should have sub-linear expected regret, i.e., RT /T 0 as T . Our goal is to learn a policy that is sub-linear with small regret. To this effect, we use CVs to estimate mean rewards with sharper confidence bounds so that the learner can have a better exploration-exploitation trade-off and start playing the optimal arm more frequently early. 4 Arms with Normally Distributed Rewards and Control Variates We first focus on the case where the rewards and associated CVs of arms have a multivariate normal distribution. We discuss the general distributions in the next section. To bring out the main ideas of the algorithm, we first consider the case where each arm is associated with only one CV. Motivated by the linear CV technique discussed in the previous section, we consider a new sample for an arm i in round t as follows: Xt,i = Xt,i + β i (ωi Wt,i), (2) where Xt,i is the tth reward, Wt,i is the tth associated control variate with an arm i, ωi = E [Wt,i], and β i = Cov(Xt,i, Wt,i)/Var(Wt,i). Using s such samples, the mean reward for an arm i is defined as ˆµc s,i = (Ps r=1 Xr,i)/s. Let ˆµs,i = 1 s Ps r=1 Xr,i and ˆωs,i = 1 s Ps r=1 Wr,i denote the sample mean of reward and CVs of an arm i from s samples. Then ˆµc s,i can be written as follows: ˆµc s,i = ˆµs,i + β s,i(ωi ˆωs,i). (3) Since the value of Cov(Xi, Wi) and Var(Wi) are unknown, the optimal coefficient β s,i need to be estimated. After having s samples, it is estimated as below and used in Eq. (3). ˆβ s,i = Ps r=1(Xr,i ˆµs,i)(Wr,i ωi) Ps r=1(Wr,i ωi)2 . (4) When the variance of CV is known, it can be directly used to estimate β i by replacing estimated variance by actual variance of CV in the denominator of Eq. (4), i.e., replacing Ps r=1(Wr,i ωi)2/(s 1) by σ2 w,i. After observing reward and associated control variates for arm i, the value of ˆβ i is re-estimated. The value of ˆβ i depends on all observed rewards and associated control variates from arm i. Since all X.,i uses same ˆβ i , this leads to correlation between the X.,i observations. Let Var( Xi) = σ2 c,i denote the variance of the linear CV samples and νs,i = Var(ˆµc s,i) denote the variance the estimator using these samples. Since the mean reward estimator (ˆµc s,i) is computed using the correlated samples, we cannot use Bernstein inequality to get confidence bounds on mean reward estimates. However, when the rewards and CVs follow a multivariate normal distribution, the following results tell how to obtain an unbiased variance estimator and confidence interval for ˆµc s,i. Lemma 1. Let the reward and control variate of each arm have a multivariate normal distribution. After observing s samples of reward and control variate from arm i, define ˆνs,i = Zs,iˆσ2 c,i(s) s , where 1 (Ps r=1(Wr,i ωi))2 s Ps r=1(Wr,i ωi)2 and ˆσ2 c,i(s) = 1 s 2 r=1 ( Xr,i ˆµc s,i)2, then ˆνs,i is an unbiased variance estimator of ˆµc s,i, i.e., E[ˆνs,i] = Var(ˆµc s,i). The confidence interval for reward estimator is given by our next result. Lemma 2. Let the conditions in Lemma 1 hold and s be the number of reward and associated control variate samples from arm i in round t. Then P n |ˆµc s,i µi| V (α) t,s p ˆνs,i o 2/tα, where V (α) t,s denote 100(1 1/tα)th percentile value of the t distribution with s 2 degrees of freedom and ˆνs,i is an unbiased estimator for variance of ˆµc s,i. We prove Lemma 1 and Lemma 2 using regression theory and control theory results (Nelson, 1990). The detailed proofs are given in Appendix B. The proof uses the fact that the arm s rewards and CVs have a multivariate normal distribution. Therefore, we can use t-distribution for designing confidence intervals of mean reward estimators. Analogous to the Hoeffding inequality, this result shows that the probability of the estimate obtained by samples { Xt,i} deviating from the true mean of arm i decays fast. Further, the deviation factor V (α) t,s p ˆνs,i depends on the variance of the estimator (ˆµc s,i), which guarantees sharper confidence intervals. As we will see later in Lemma 3, these confidence terms are smaller by a factor (1 ρ2 i ) compared to the case when no CVs are used. Equipped with these results, we next develop a Upper Confidence Bound (UCB) based algorithm for the MAB-CV problem. 4.1 Algorithm: UCB-CV Let Ni(t) be the number of times the arm i is selected by a learner until round t and ˆνNi(t),i be the unbiased sample variance of mean reward estimator (ˆµc Ni(t),i) for arm i. Motivated from Lemma 2, we define optimistic upper bound for mean reward estimate of arm i as follows: UCBt,i = ˆµc Ni(t),i + V (α) t,Ni(t) q ˆνNi(t),i. (5) Using above values as UCB indices of arms, we develop an algorithm named UCB-CV for the MAB-CV problem. The algorithm works as follows: It takes the number of arms K, a constant Q = 3 (number of CV per arm + 2), and α > 1 (trades-off between exploration and exploitation) as input. Each arm is played Q times to ensure the sample variance for observations (ˆσ2 c,i(s)) can be computed (see Lemma 1). In round t, UCB-CV computes the upper confidence bound of each arm s mean reward estimate using Eq. (5) and selects the arm having highest upper confidence bound value. We denote the selected arm by It. After playing arm It, the reward Xt,It and associated control variate Wt,It are observed. After that, the value of NIt(t) is updated and ˆβ NIt(t),It, ˆµc NIt(t),It and ˆνt,NIt(t) are re-estimated. The same process is repeated for the subsequent rounds. UCB-CV UCB based Algorithm for MAB-CV problem 1: Input: K, Q, α > 1 2: Play each arm i [K] Q times 3: for t = QK + 1, QK + 2, . . . , do 4: i [K] : compute UCBt 1,i as given in Eq. (5) 5: Select It = argmax i [K] UCBt 1,i 6: Play arm It and observe Xt,It and associated control variates Wt,It. Increment the value of NIt(t) by one and re-estimate ˆβ NIt(t),It, ˆµc NIt(t),It and ˆνt,NIt(t) 7: end for 4.2 Estimator with Multiple Control Variates In some applications, it could be possible that each arm is associated with multiple CVs. We denote the number of CVs with each arm as q. Let Wt,i,j be the jth control variate of arm i that is observed in round t, . Then the unbiased mean reward estimator for arm i with associated CVs is given by ˆµc s,i,q = ˆµs,i + ˆβ i (ωi ˆωs,i), where ˆβ i = ˆβi,1, . . . , ˆβi,q , ωi = (ωi,1, . . . , ωi,q) , and ˆωs,i = (ˆωs,i,1, . . . , ˆωs,i,q) . Let s be the number of rewards and associated CVs samples for arm i, W i be the s q matrix whose rth row is (Wr,i,1, Wr,i,2, . . . , Wr,i,q), and Xi = (X1,i, . . . , Xs,i) . By extending the arguments used in Eq. (4) to q control variates, the estimated coefficient vector is given by ˆβ i = (W i W i sˆωs,iˆω s,i) 1(W i Xi sˆωi ˆµs,i). We can generalize Lemma 1 and Lemma 2 for the MAB-CV problems with q control variates and then use UCB-CV with Q = q + 2 and appropriate optimistic upper bound for multiple control variate case. More details can be found in Appendix C. 4.3 Analysis of UCB-CV In our next result, we describe the property of our estimator that uses control variates. This result is derived from the standard results of control variates theory (Nelson, 1990). Lemma 3. Let reward and control variates have a multivariate normal distribution and q be the number of control variates. Then after having s observations of rewards and associated control variates from arm i, E ˆµc s,i,q = µi, and Var(ˆµc s,i,q) = s 2 s q 2(1 ρ2 i )Var(ˆµs,i), where ρ2 i = σXi W iΣ 1 W i W iσ Xi W i/σ2 i is the square of the multiple correlation coefficient, σ2 i = Var(Xi), and σXi W i = (Cov(Xi, Wi,1), . . . , Cov(Xi, Wi,q)). The following is our main result which gives the regret upper bound of UCB-CV. The detailed analysis is given in Appendix D. Theorem 1. Let the conditions in Lemma 3 hold, α = 2, i = µi µi be the sub-optimality gap for arm i [K], and Ni(T) be the number of times sub-optimal arm i selected in T rounds. Let Ni(T ) 2 Ni(T ) q 2 V (2) T,Ni(T ),q V (2) T,T,q for all i. Then the regret of UCB-CV in T rounds is upper 4(V (2) T,T,q)2CT,i,q(1 ρ2 i )σ2 i i + iπ2 Note that α is a hyper-parameter to trade-off between exploration and exploitation. UCB-CV works as long as α > 1 and the effect of α on the regret bound is through the term P i=1 1 tα . This term has a nice closed form value of π2/6 for α = 2. Hence, we state the results with α = 2 and use it in the experiments as well. Remark 1. Unfortunately, we cannot directly compare our bound with that of UCB based stochastic MAB algorithms like UCB1-NORMAL and UCBV as V (2) T,T,q do not have closed-form expression. V (2) T,T,q denotes 100(1 1/T 2)th percentile of t-distribution with T q 1 degrees of freedom. Remark 2. As the degrees of freedom decreases in q, V (2) T,T,q increases in q (see Fig. 1a for small value of T and Fig. 1b for large value of T). However, this does not imply that the overall regret increases in q as ρ2 i also increases in q. Hence dependency of regret on q is delicate. It is also observed in Nelson (1989) that having more CVs does not mean better variance reduction. One can empirically verify that (V (2) T,T,q)2 is upper bounded by 3.726 log T. This upper bound holds for all q as long as T q + max(32, 0.7q). (a) (V (2) T,T,q)2 vs 3.726 log(T ) for small T . (b) (V (2) T,T,q)2 vs 3.726 log(T ) for large T . (c) Variation in V (2) T,Ni(T ),q V (2) T,T,q with Ni(T ). Figure 1: Properties of (V (2) T,T,q)2 and V (2) T,Ni(T ),q/V (2) T,T,q. Remark 3. It is hard to bound CT,i,q as V (2) T, ,q does not have a explicit form. However, CT,i,q tends to 1 as T . Indeed, from lower bound argument, we know that Ni(T) as T and hence Ni(T ) 2 Ni(T ) q 2 1 and V (2) T,Ni(T ),q V (2) T,T,q 1 (confidence interval shrink with more samples) as shown in Fig. 1c. However, under some assumptions, we can obtain a bound empirically. Specifically, if q 20 and Ni(T) > 50, then Ni(T ) 2 Ni(T ) q 2 1.72 and from t-distribution tables, we get V (2) T,Ni(T ),q V (2) T,T,q 1.5 (see Fig. 1c), hence the bound CT,i,q 4. We can guarantee Ni(T) > 50 by playing each arm 50 times at the beginning. Therefore, we can obtain one possible explicit bound in Lemma (6) by replacing CT,i,q with 4. However, for a general case, an explicit bound in Lemma 8 is hard. 5 General Distribution We now consider the general case with no distributional assumptions on reward and associated CVs of arms. The samples Xt,i,q resulting from the linear combination of rewards and CVs are dependent but need not be normally distributed. Hence ˆµc s,i,q need not remain an unbiased estimator. Therefore, we cannot use the t-distribution properties to obtain confidence intervals. However, we can use resampling methods such as jackknifing, splitting, and batching to reduce the bias of the estimators and develop confidence intervals that hold approximately or asymptotically. These confidence intervals can then be used to defined indices for the arms. Below we briefly discuss these methods. 5.1 Jackknifing In statistics, jackknifing is a well-known re-sampling technique for variance and bias estimation (Efron, 1982; Efron and Gong, 1983). We start with a classical presentation of jackknifing for CVs (Lavenberg et al., 1982). Let ˆµc, j s,i,q be the estimate of the mean reward that is computed without sample Xj. The Y J j,i,q = sˆµc s,i,q (s 1)ˆµc, j s,i,q, which is sometimes called the jth pseudo-value. Using pseudo-values mean reward estimation given as ˆµc,J s,i,q = 1 s Ps j=1 Y J j,i,q and its sample variance is ˆνJ s,i,q = (s(s 1)) 1 Ps j=1(Y J j,i,q ˆµc,J s,i,q)2. ˆµc,J s,i,q is shown to be unbiased (Nelson, 1989)[Thm. 7] when reward variance is bounded. Further, ˆµc,J s,i,q tα/2(n 1)ˆνJ s,i,q is a asymptotically valid confidence interval (Avramidis et al., 1991) and holds approximately for finite number of samples. 5.2 Splitting Splitting is a technique to split the correlated observations into two or more groups, compute an estimate of β i from each group, then exchange the estimates among the groups. Nelson (1990) considers an extreme form of splitting, where he splits s observations into s groups. The jth observation is given by Y S j,i,q = Xj,i + ˆβ j i (ω W j,i), j [s], where ˆβ j i is estimated without jth observation and W j,i = (Wj,i,1, . . . , Wj,i,q) is the vector of CVs with q elements associated with reward Xj,i. The point estimator for splitting method is ˆµc,S s,i,q = 1 s Ps j=1 Y S j,i,q and its sample variance is ˆνS s,i,q = (s(s 1)) 1 Ps j=1(Y S j,i,q ˆµc,S s,i,q)2. Then ˆµc,S s,i,q tα/2(n 1)ˆνS s,i,q gives an approximate confidence interval (Nelson, 1990). Below we define UCB index for these methods. Let ˆνG Ni(t),i,q be the sample variance of mean reward estimator (ˆµc,G s,i,q) for arm i. We define the optimistic upper bound for mean reward for G {J, S} as follows: UCBG t,i,q = ˆµc,G Ni(t),i,q + V G,α t,Ni(t),q q ˆνG Ni(t),i,q. (6) where V G,α t,s,q is the 100(1 1/tα)th percentile value of the t distribution with s 1 degrees of freedom. Since the optimistic upper bounds defined in Eq. (6) are valid only asymptotically (Nelson, 1990; Avramidis et al., 1991), it cannot be used for any finite time regret guarantee. However the above UCB indices can be used in UCB-CV to get an heuristic algorithm for the general case. We experimentally validate its performance in the next section. 6 Experiments We empirically evaluate the performance of UCB-CV by comparing it with UCB-CV with UCB1 (Auer et al., 2002), UCB-V (Audibert et al., 2009b), Thompson Sampling (Agrawal and Goyal, 2013), and EUCBV (Mukherjee et al., 2018) on different synthetically generated problem instances. For all the instance we use we use K = 10, q = 1, and α = 2. All the experiments are repeated 100 times and cumulative regret with a 95% confidence interval (the vertical line on each curve shows the confidence interval) are shown. Details of each instance are as follows: Instance 1: The reward and associated CV of this instance have a multivariate normal distribution. The reward of each arm has two components. We treated one of the components as CV. In round t, the reward of arm i is given as follows: Xt,i = Vt,i + Wt,i, where Vt,i N(µv,i, σ2 v,i) and Wt,i N(µw,i, σ2 w,i). Therefore, Xt,i N(µv,i+µw,i, σ2 v,i+σ2 w,i). We treat Wi as CV of Xi. It can be easily shown that the correlation coefficient of Xi and Wi is ρi = q σ2 w,i/(σ2 v,i + σ2 w,i). For each i, we set µv,i = 0.6 (i 1) 0.05, µw,i = 0.3, for arm i [K]. The value of σ2 v,i = 0.1 and σ2 w,i = 0.1 for all arms. Note that the problem instance has the same CV for all arms, but all the CVs observations for each arm are maintained separately. Instance 2: It is the same as the Instance 1 except each arm has a different CV associated with its reward. The mean value of the CV associated with arm i is set as µw,i = 0.8 (i 1) 0.05. Instance 3: It is the same as the Instance 2 except the samples of reward and associated control variate are generated from Gamma distribution, where the value of scale is set to 1. Instance 4: It is the same as the Instance 2 except the samples of reward and associated control variate are generated from Log-normal distribution with the values of σ2 v,i = 1 and σ2 w,i = 1 for all arms. Comparing regret of UCB-CV with existing algorithms The regret of different algorithms for Instances 1 and 2 are shown in Figures Fig. 2a and Fig. 2b, respectively. As see UCB-CV outperforms all the algorithms. We observe that Thomson Sampling has a large regret for normally distributed reward and CVs. Hence, we have not added regret of Thomson Sampling. Note that UCB-CV does not require a bounded support assumption, which is needed in all other algorithms. The regret of the EUCBV algorithm is closest to UCB-CV, but EUCBV can stick with the sub-optimal arm as it uses an arm elimination-based strategy, which has a small probability of eliminating the optimal arm. (a) Instance 1 (b) Instance 2 (c) Varying Correlation Coefficient. Figure 2: Regret comparison of different multi-armed bandits algorithms with UCB-CV. Regret of UCB-CV vs Correlation Coefficient Theorem 1 shows that UCB-CV have better regret bounds when correlation between arm rewards and CVs is higher. To validate this we derived problem instances having different correlation coefficients as follows: Instance 5: Similar to the Instance 1, the reward and associated CVs have a multivariate normal distribution with rewards expressed as sum of two components. We set µv,i = 6.0 (i 1) 0.5 and µw,i = 4.0 for arm i [K]. The value of σ2 v,i = 1.0 and σ2 w,i = 1.0 for all arms. The problem instance has common CV for all arms but all the observations are maintained for each arm separately. As the correlation coefficient of Xi and Wi is ρi = q σ2 w,i/(σ2 v,i + σ2 w,i), we varied σ2 v,i over the values {1.0, 1.5., 2.0, 2.5, 3.0} to obtain problem instances with different correlation coefficient. Increasing σ2 v,i reduces the correlation coefficient of Xi and Wi. We have plotted the regret of UCB-CV for different problem instances. As expected, we observe that the regret decreases as the correlation coefficient of reward and its associated control variates increases as shown in Fig. 2c. Performance of Jackknifing, Splitting, and Batching Methods We compare the regret of UCB-CV with Jackknifing, Splitting and Batching (see Appendix E for more details) methods. The regret of the batching method is worse for all the instances. Whereas, the jackknifing and splitting are performing well for heavy-tailed distributions (Instance 3) and UCB-CV performs better for normal distribution (Instance 2) and Log-normal distribution (Instance 4) as shown in Fig. 3. (a) Instance 2 (b) Instance 3 (c) Instance 4. Figure 3: Comparing regret of UCB-CV with Jackknifing, Splitting, and Batching method based. Regret of UCB-CV vs Estimated Mean of CV Our work has demonstrated the applicability of CVs to bandit theory. To establish the gains analytically, we have to assume that mean of the CVs is known. However, it is not unreasonable as the mean of CVs can be constructed such that its mean value is known (see Eq. (7) and Eq. (8) of Kreutzer et al. (2017)). If the mean of CV is unknown, we estimate it from the samples, but using the estimated/approximated means can deteriorate the performance of the proposed algorithm.s Though the theory of control variate is well developed for known mean, and only empirical studies are available to the case when the CV means are known approximately (Schmeiser et al., 2001; Pasupathy et al., 2012). To know the effect of approximation error (ε) in the mean estimation of CVs, we run an experiment with Instance 1, Instance 2, and Instance 5. We assume that the approximated mean of CVs is given by ωi + ε. We observe that the regret of UCB-CV increases with an increase in approximation error. UCB-CV can even start performing poorly than the existing state-of-the-art algorithm for significant large approximation errors as shown in Fig. 4. Since the maximum and minimum reward gap can be more than one for the Instance 5, we must multiply confidence intervals of UCB1, UCB-V, and EUCBV with the appropriate factor (used 6 for the experiment) for a fair comparison. We also observe that Thompson Sampling has almost linear regret for the Instance 5. We believe that it is due to the high variance of arms rewards that leads to overlapping of rewards distributions. (a) Instance 1 (b) Instance 2 (c) Instance 5. Figure 4: Regret comparison of UCB-CV with varying approximation error in the mean estimation of control variates and different multi-armed bandits algorithms. Our experimental results demonstrate that as long as the error in the estimation of the mean of CVs is within a limit, there will be an advantage using control variates. The impact of such approximations in bandits needs to be analyzed extensively and demands independent work. 7 Conclusion In this work, we studied stochastic multi-armed bandits problem with side information available in the form of Control Variate (CV). Leveraging the linear control variates variance reduction properties, we developed a variant of the UCB algorithm named UCB-CV. When reward and CVs have a multivariate normal distribution, we showed that the regret of UCB-CV, which is of the order O(σ2(1 ρ2)/ log T) where σ2 is the maximum variance of arm rewards and ρ is the minimum correlation between the arm rewards and CVs. As expected, our the bounds showed that when the correlation is strong, one can get significant improvement in regret performance. For the case where the reward and control variates follow a general distribution, we discussed Jackknifing and splitting resampling that will give estimators with smaller bias and sharper confidence bounds. The variance reduction techniques based on CVs assume that the exact mean of the CVs are known. In practice, only some rough estimates of the mean of CVs rather than the exact values may be available. It would be interesting to study how does it affect the performance of the bandit algorithms. Another interesting direction to study how CVs are helpful if one has to rank the arms instead of just finding the top-ranked arm. Acknowledgements Manjesh K. Hanawal is supported by INSPIRE faculty fellowship from DST and Early Career Research Award (ECR/2018/002953) from SERB, Govt. of India. This work was done when Arun Verma was at IIT Bombay. Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312 2320, 2011. Shipra Agrawal and Navin Goyal. Analysis of thompson sampling for the multi-armed bandit problem. In Conference on Learning Theory, pages 39 1, 2012. Shipra Agrawal and Navin Goyal. Further optimal regret bounds for thompson sampling. In Artificial intelligence and statistics, pages 99 107, 2013. Jean-Yves Audibert, Rémi Munos, and Csaba Szepesvári. Exploration exploitation tradeoff using variance estimates in multi-armed bandits. Theoretical Computer Science, 410(19):1876 1902, 2009a. Jean-Yves Audibert, Rémi Munos, and Csaba Szepesvári. Exploration exploitation tradeoff using variance estimates in multi-armed bandits. Theoretical Computer Science, 410(19):1876 1902, 2009b. Peter Auer and Ronald Ortner. Ucb revisited: Improved regret bounds for the stochastic multi-armed bandit problem. Periodica Mathematica Hungarica, 61(1-2):55 65, 2010. Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, pages 235 256, 2002. Athanassios N Avramidis, Kenneth W Bauer Jr, and James R Wilson. Simulation of stochastic activity networks using path control variates. Naval Research Logistics (NRL), 38(2):183 201, 1991. Zdravko Botev and Ad Ridder. Variance reduction. Wiley Stats Ref: Statistics Reference Online, pages 1 6, 2014. Sébastien Bubeck, Nicolo Cesa-Bianchi, et al. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning, 5(1):1 122, 2012. Olivier Chapelle and Lihong Li. An empirical evaluation of thompson sampling. In Advances in neural information processing systems, pages 2249 2257, 2011. Yutian Chen and Zoubin Ghahramani. Scalable discrete sampling as a multi-armed bandit problem. In International Conference on Machine Learning, pages 2492 2501. PMLR, 2016. Wei Chu, Lihong Li, Lev Reyzin, and Robert E Schapire. Contextual bandits with linear payoff functions. In International Conference on Artificial Intelligence and Statistics, pages 208 214, 2011. Richard W Conway. Some tactical problems in digital simulation. Management science, 10(1):47 61, 1963. Varsha Dani, Thomas P Hayes, and Sham M Kakade. Stochastic linear optimization under bandit feedback. In COLT, pages 355 366, 2008. Bradley Efron. The jackknife, the bootstrap and other resampling plans. SIAM, 1982. Bradley Efron and Gail Gong. A leisurely look at the bootstrap, the jackknife, and cross-validation. The American Statistician, 37(1):36 48, 1983. Sarah Filippi, Olivier Cappe, Aurélien Garivier, and Csaba Szepesvári. Parametric bandits: The generalized linear case. In Advances in Neural Information Processing Systems, pages 586 594, 2010. Aurélien Garivier and Olivier Cappé. The kl-ucb algorithm for bounded stochastic bandits and beyond. In Proceedings of the 24th annual Conference On Learning Theory, pages 359 376, 2011. F Hayashi. Econometrics. Princeton University Press, 2000. Junya Honda and Akimichi Takemura. An asymptotically optimal bandit algorithm for bounded support models. In COLT, pages 67 79. Citeseer, 2010. BAP James. Variance reduction techniques. Journal of the Operational Research Society, 36(6): 525 530, 1985. Kwang-Sung Jun, Aniruddha Bhargava, Robert Nowak, and Rebecca Willett. Scalable generalized linear bandits: Online computation and hashing. In Advances in Neural Information Processing Systems, pages 99 109, 2017. Emilie Kaufmann, Nathaniel Korda, and Rémi Munos. Thompson sampling: An asymptotically optimal finite-time analysis. In International Conference on Algorithmic Learning Theory, pages 199 213. Springer, 2012. Julia Kreutzer, Artem Sokolov, and Stefan Riezler. Bandit structured prediction for neural sequence-to-sequence learning. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1503 1513, 2017. Tor Lattimore and Csaba Szepesvári. Bandit Algorithms. Cambridge University Press, 2020. doi: 10.1017/9781108571401. Stephen S Lavenberg and Peter D Welch. A perspective on the use of control variables to increase the efficiency of monte carlo simulations. Management Science, 27(3):322 335, 1981. Stephen S Lavenberg, Thomas L Moeller, and Peter D Welch. Statistical results on control variables with application to queueing network simulation. Operations Research, 30(1):182 202, 1982. Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661 670. ACM, 2010. Lihong Li, Yu Lu, and Dengyong Zhou. Provably optimal algorithms for generalized linear contextual bandits. In International Conference on Machine Learning, pages 2071 2080, 2017. Subhojyoti Mukherjee, KP Naveen, Nandan Sudarsanam, and Balaraman Ravindran. Efficient-ucbv: An almost optimal algorithm using variance estimates. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018. Barry L Nelson. Batch size effects on the efficiency of control variates in simulation. European Journal of Operational Research, 43(2):184 196, 1989. Barry L Nelson. Control variate remedies. Operations Research, 38(6):974 992, 1990. Raghu Pasupathy, Bruce W Schmeiser, Michael R Taaffe, and Jin Wang. Control-variate estimation using estimated control means. IIE Transactions, 44(5):381 385, 2012. Vianney Perchet and Philippe Rigollet. The multi-armed bandit problem with covariates. The Annals of Statistics, 41(2):693 721, 2013. Paat Rusmevichientong and John N Tsitsiklis. Linearly parameterized bandits. Mathematics of Operations Research, 35(2):395 411, 2010. Bruce Schmeiser. Batch size effects in the analysis of simulation output. Operations Research, 30(3): 556 568, 1982. Bruce W Schmeiser, Michael R Taaffe, and Jin Wang. Biased control-variate estimation. IIE Transactions, 33(3):219 228, 2001. William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285 294, 1933. Sara A Van De Geer. Least squares estimation. Encyclopedia of statistics in behavioral science, 2005. Nikos Vlassis, Aurelien Bibaut, Maria Dimakopoulou, and Tony Jebara. On the design of estimators for bandit off-policy evaluation. In International Conference on Machine Learning, pages 6468 6476. PMLR, 2019. Lijun Zhang, Tianbao Yang, Rong Jin, Yichi Xiao, and Zhi-hua Zhou. Online stochastic linear optimization under one-bit feedback. In International Conference on Machine Learning, pages 392 401, 2016.