# continuous_meancovariance_bandits__8c6b3fe9.pdf Continuous Mean-Covariance Bandits Yihan Du IIIS, Tsinghua University Beijing, China duyh18@mails.tsinghua.edu.cn Siwei Wang CST, Tsinghua University Beijing, China wangsw2020@mail.tsinghua.edu.cn Zhixuan Fang IIIS, Tsinghua University, Beijing, China Shanghai Qi Zhi Institute, Shanghai, China zfang@mail.tsinghua.edu.cn Longbo Huang IIIS, Tsinghua University Beijing, China longbohuang@mail.tsinghua.edu.cn Existing risk-aware multi-armed bandit models typically focus on risk measures of individual options such as variance. As a result, they cannot be directly applied to important real-world online decision making problems with correlated options. In this paper, we propose a novel Continuous Mean-Covariance Bandit (CMCB) model to explicitly take into account option correlation. Specifically, in CMCB, there is a learner who sequentially chooses weight vectors on given options and observes random feedback according to the decisions. The agent s objective is to achieve the best trade-off between reward and risk, measured with option covariance. To capture different reward observation scenarios in practice, we consider three feedback settings, i.e., full-information, semi-bandit and full-bandit feedback. We propose novel algorithms with optimal regrets (within logarithmic factors), and provide matching lower bounds to validate their optimalities. The experimental results also demonstrate the superiority of our algorithms. To the best of our knowledge, this is the first work that considers option correlation in risk-aware bandits and explicitly quantifies how arbitrary covariance structures impact the learning performance. The novel analytical techniques we developed for exploiting the estimated covariance to build concentration and bounding the risk of selected actions based on sampling strategy properties can likely find applications in other bandit analysis and be of independent interests. 1 Introduction The stochastic Multi-Armed Bandit (MAB) [3, 28, 2] problem is a classic online learning model, which characterizes the exploration-exploitation trade-off in decision making. Recently, due to the increasing requirements of risk guarantees in practical applications, the Mean-Variance Bandits (MVB) [26, 30, 34] which aim at balancing the rewards and performance variances have received extensive attention. While MVB provides a successful risk-aware model, it only considers discrete decision space and focuses on the variances of individual arms (assuming independence among arms). However, in many real-world scenarios, a decision often involves multiple options with certain correlation structure, which can heavily influence risk management and cannot be ignored. For instance, in finance, investors can select portfolios on multiple correlated assets, and the investment risk is closely related to the correlation among the chosen assets. The well-known risk diversification strategy [4] embodies the importance of correlation to investment decisions. In clinical trials, a Corresponding author. 35th Conference on Neural Information Processing Systems (Neur IPS 2021). treatment often consists of different drugs with certain ratios, and the correlation among drugs plays an important role in the treatment risk. Failing to handle the correlation among multiple options, existing MVB results cannot be directly applied to these important real-world tasks. Witnessing the above limitation of existing risk-aware results, in this paper, we propose a novel Continuous Mean-Covariance Bandit (CMCB) model, which considers a set of options (base arms) with continuous decision space and measures the risk of decisions with the option correlation. Specifically, in this model, a learner is given d base arms, which are associated with an unknown joint reward distribution with a mean vector and covariance. At each timestep, the environment generates an underlying random reward for each base arm according to the joint distribution. Then, the learner selects a weight vector of base arms and observes the rewards. The goal of the learner is to minimize the expected cumulative regret, i.e., the total difference of the reward-risk (mean-covariance) utilities between the chosen actions and the optimal action, where the optimal action is defined as the weight vector that achieves the best trade-off between the expected reward and covariance-based risk. To capture important observation scenarios in practice, we consider three feedback settings in this model, i.e., full-information (CMCB-FI), semi-bandit (CMCB-SB) and full-bandit (CMCB-FB) feedback, which vary from seeing rewards of all options to receiving rewards of the selected options to only observing a weighted sum of rewards. The CMCB framework finds a wide range of real-world applications, including finance [23], company operation [24] and online advertising [27]. For example, in stock markets, investors choose portfolios based on the observed prices of all stocks (full-information feedback), with the goal of earning high returns and meanwhile minimizing risk. In company operation, managers allocate investment budgets to several correlated business and only observe the returns of the invested business (semi-bandit feedback), with the objective of achieving high returns and low risk. In clinical trials, clinicians select a treatment comprised of different drugs and only observe an overall therapeutic effect (full-bandit feedback), where good therapeutic effects and high stability are both desirable. For both CMCB-FI and CMCB-SB, we propose optimal algorithms (within logarithmic factors) and establish matching lower bounds for the problems, and contribute novel techniques in analyzing the risk of chosen actions and exploiting the covariance information. For CMCB-FB, we develop a novel algorithm which adopts a carefully designed action set to estimate the expected rewards and covariance, with non-trivial regret guarantees. Our theoretical results offer an explicit quantification of the influences of arbitrary covariance structures on learning performance, and our empirical evaluations also demonstrate the superior performance of our algorithms. Our work differs from previous works on bandits with covariance [32, 33, 11, 25] in the following aspects. (i) We consider the reward-risk objective under continuous decision space and stochastic environment, while existing works study either combinatorial bandits, where the decision space is discrete and risk is not considered in the objective, or adversarial online optimization. (ii) We do not assume a prior knowledge or direct feedback on the covariance matrix as in [32, 33, 11]. (iii) Our results for full-information and full-bandit feedback explicitly characterize the impacts of arbitrary covariance structures, whereas prior results, e.g., [11, 25], only focus on independent or positively-correlated cases. These differences pose new challenges in algorithm design and analysis, and demand new analytical techniques. We summarize the main contributions as follows. We propose a novel risk-aware bandit model called continuous mean-covariance bandit (CMCB), which considers correlated options with continuous decision space, and characterizes the trade-off between reward and covariance-based risk. Motivated by practical reward observation scenarios, three feedback settings are considered under CMCB, i.e., full-information (CMCB-FI), semi-bandit (CMCB-SB) and full-bandit (CMCB-FB). We design an algorithm MC-Empirical for CMCB-FI with an optimal O( T) regret (within logarithmic factors), and develop a novel analytical technique to build a relationship on risk between chosen actions and the optimal one using properties of the sampling strategy. We also derive a matching lower bound, by analyzing the gap between hindsight knowledge and available empirical information under a Bayesian environment. For CMCB-SB, we develop MC-UCB, an algorithm that exploits the estimated covariance information to construct confidence intervals and achieves the optimal O( T) regret (up to logarithmic factors). A matching regret lower bound is also established, by investigating the necessary regret paid to differentiate two well-chosen distinct instances. We propose a novel algorithm MC-ETE for CMCB-FB, which employs a well-designed action set to carefully estimate the reward means and covariance, and achieves an O(T 2 3 ) regret guarantee under the severely limited feedback. To our best knowledge, our work is the first to explicitly characterize the influences of arbitrary covariance structures on learning performance in risk-aware bandits. Our results shed light into optimal risk management in online decision making with correlated options. Due to space limitation, we defer all detailed proofs to the supplementary material. 2 Related Work (Risk-aware Bandits) Sani et al. [26] initiate the classic mean-variance paradigm [23, 14] in bandits, and formulate the mean-variance bandit problem, where the learner plays a single arm each time and the risk is measured by the variances of individual arms. Vakili & Zhao [29, 30] further study this problem under a different metric and complete the regret analysis. Zhu & Tan [34] provide a Thompson Sampling-based algorithm for mean-variance bandits. In addition to variance, several works consider other risk criteria. The Va R measure is studied in [10], and CVa R is also investigated to quantify the risk in [13, 16]. Cassel et al. [5] propose a general risk measure named empirical distributions performance measure (EDPM) and present an algorithmic framework for EDPM. All existing studies on risk-aware bandits only consider discrete decision space and assume independence among arms, and thus they cannot be applied to our CMCB problem. (Bandits with Covariance) In the stochastic MAB setting, while there have been several works [11, 25] on covariance, they focus on the combinatorial bandit problem without considering risk. Degenne & Perchet [11] study the combinatorial semi-bandits with correlation, which assume a known upper bound on the covariance, and design an algorithm with this prior knowledge of covariance. Perrault et al. [25] further investigate this problem without the assumption on covariance under the subexponential distribution framework, and propose an algorithm with a tight asymptotic regret analysis. In the adversarial setting, Warmuth & Kuzmin [32, 33] consider an online variance minimization problem, where at each timestep the learner chooses a weight vector and receives a covariance matrix, and propose the exponentiated gradient based algorithms. Our work differs from the above works in the following aspects: compared to [11, 25], we consider a continuous decision space instead of combinatorial space, study the reward-risk objective instead of only maximizing the expected reward, and investigate two more feedback settings other than the semi-bandit feedback. Compared to [32, 33], we consider the stochastic environment and in our case, the covariance cannot be directly observed and needs to be estimated. 3 Continuous Mean-Covariance Bandits (CMCB) Here we present the formulation for the Continuous Mean-Covariance Bandits (CMCB) problem. Specifically, a learner is given d base arms labeled 1, . . . , d and a decision (action) space D d, where d = {w Rd : 0 wi 1, i [d], P i wi = 1} denotes the probability simplex in Rd. The base arms are associated with an unknown d-dimensional joint reward distribution with mean vector θ and positive semi-definite covariance matrix Σ , where Σ ii 1 for any i [d] without loss of generality. For any action w D, which can be regarded as a weight vector placed on the base arms, the instantaneous reward-risk utility is given by the following mean-covariance function f(w) = w θ ρw Σ w, (1) where w θ denotes the expected reward, w Σ w represents the risk, i.e., reward variance, and ρ > 0 is a risk-aversion parameter that controls the weight placed on the risk. We define the optimal action as w = argmaxw D f(w). Compared to linear bandits [1, 17], the additional quadratic term in f(w) raises significant challenges in estimating the covariance, bounding the risk of chosen actions and deriving covariance-dependent regret bounds. At each timestep t, the environment generates an underlying (unknown to the learner) random reward vector θt = θ + ηt according to the joint distribution, where ηt is a zero-mean noise vector and it is Algorithm 1 MC-Empirical 1: Input: Risk-aversion parameter ρ > 0. 2: Initialization: Pull action w1 = ( 1 d, . . . , 1 d), and observe θ1 = (θ1,1, . . . , θd,1) . ˆθ 1,i θ1,i, i [d]. ˆΣ1,ij = (θ1,i ˆθ 1,i)(θ1,j ˆθ 1,j), i, j [d]. 3: for t = 2, 3, . . . do 4: wt = argmax w d (w ˆθ t 1 ρw ˆΣt 1w) 5: Pull wt, observe θt = (θt,1, . . . , θt,d) 6: ˆθ t,i 1 t Pt s=1 θs,i, i [d] 7: ˆΣt,ij=1 s=1 (θs,i ˆθ t,i)(θs,j ˆθ t,j), i,j [d] independent among different timestep t. Note that here we consider an additive vector noise to the parameter θ , instead of the simpler scalar noise added in the observation (i.e., yt = w t θ + ηt) as in linear bandits [1, 17]. Our noise setting better models the real-world scenarios where distinct actions incur different risk, and enables us to explicitly quantify the correlation effects. Following the standard assumption in the bandit literature [22, 11, 34], we assume the noise is sub-Gaussian, i.e., u Rd, E[exp(u ηt)] exp( 1 2u Σ u), where Σ is unknown. The learner selects an action wt D and observes the feedback according to a certain structure (specified later). For any time horizon T > 0, define the expected cumulative regret as t=1 E [f(w ) f(wt)] . The objective of the learner is to minimize E[R(T)]. Note that our mean-covariance function Eq. (1) extends the popular mean-variance measure [26, 30, 34] to the continuous decision space. In the following, we consider three feedback settings motivated by reward observation scenarios in practice, including (i) full-information (CMCB-FI), observing random rewards of all base arms after a pull, (ii) semi-bandit (CMCB-SB), only observing random rewards of the selected base arms, and (iii) full-bandit (CMCB-FB), only seeing a weighted sum of the random rewards from base arms. We will present the formal definitions of these three feedback settings in the following sections. Notations. For action w D, let Iw be a diagonal matrix such that Iw,ii = I{wi > 0}. For a matrix A, let Aw = Iw AIw and ΛA be a diagonal matrix with the same diagonal as A. 4 CMCB with Full-Information Feedback (CMCB-FI) We start with CMCB with full-information feedback (CMCB-FI). In this setting, at each timestep t, the learner selects wt d and observes the random reward θt,i for all i [d]. CMCB-FI provides an online learning model for the celebrated Markowitz [23, 14] problem in finance, where investors select portfolios and can observe the prices of all stocks at the end of the trading days. Below, we propose an optimal Mean-Covariance Empirical algorithm (MC-Empirical) for CMCB-FI, and provide a novel regret analysis that fully characterizes how an arbitrary covariance structure affects the regret performance. We also present a matching lower bound for CMCB-FI to demonstrate the optimality of MC-Empirical. 4.1 Algorithm for CMCB-FI Algorithm 1 shows the detailed steps of MC-Empirical. Specifically, at each timestep t, we use the empirical mean ˆθt and covariance ˆΣt to estimate θ and Σ , respectively. Then, we form ˆft(w) = w ˆθt ρw ˆΣtw, an empirical mean-covariance function of w d, and always choose the action with the maximum empirical objective value. Although MC-Empirical appears to be intuitive, its analysis is highly non-trivial due to covariancebased risk in the objective. In this case, a naive universal bound cannot characterize the impact of covariance, and prior gap-dependent analysis (e.g., [11, 25]) cannot be applied to solve our continuous space analysis with gap approximating to zero. Instead, we develop two novel techniques to handle the covariance, including using the actual covariance to analyze the confidence region of the expected rewards, and exploiting the empirical information of the sampling strategy to bound the risk gap between selected actions and the optimal one. Different from prior works [11, 25], which assume a prior knowledge on covariance or only focus on the independent and positively-related cases, our analysis does not require extra knowledge of covariance and explicitly quantifies the effects of arbitrary covariance structures. The regret performance of MC-Empirical is summarized in Theorem 1. Theorem 1 (Upper Bound for CMCB-FI). Consider the continuous mean-covariance bandits with full-information feedback (CMCB-FI). For any T > 0, algorithm MC-Empirical (Algorithm 1) achieves an expected cumulative regret bounded by w Σ w + ρ 1 θ max θ min , p Σ max o + ρ ln T where θ max = maxi [d] θ i , θ min = mini [d] θ i and Σ max = maxi [d] Σ ii. Proof sketch. Let Dt be the diagonal matrix which takes value t at each diagonal entry. We first build confidence intervals for the expected rewards of actions and the covariance as |w θ w ˆθt 1| w D 1 t 1(λΛΣ Dt 1 + Pt 1 s=1 Σ )D 1 t 1w and |Σ ij ˆΣij,t 1| qt c2 ln t t 1. Here λ = w Σ w Σ max and c1, c2 are positive constants. Then, we obtain the confidence interval of f(w) as | ˆft 1(w) f(w)| rt(w) pt(w) + ρw Qtw, where Qt is a matrix with all entries equal to qt. Since algorithm MC-Empirical always plays the empirical best action, we have f(w ) f(wt) ˆft 1(w )+rt(w ) f(wt) ˆft 1(wt)+rt(w ) f(wt) rt(w )+rt(wt). Plugging the definitions of f(w) and rt(w), we have θ +ρ w t Σ wt w Σ w f(w ) f(wt) (a) c3ln t where θ = θ max θ min and c3 is a positive constant. Since our goal is to bound the regret f(w ) f(wt) and in inequality (a) only the p w t Σ wt term is a variable, the challenge falls on bounding w t Σ wt. Note that the left-hand-side of Eq. (3) is linear with respect to w t Σ wt and the right-hand-side only contains p w t Σ wt. Then, using the property of sampling strategy on wt, i.e., Eq. (3), again, after some algebraic analysis, we obtain w t Σ wt c4(w Σ w + 1 w Σ w + ln t t 1 + ln t ρ2(t 1)) for some constant c4. Plugging it into inequality (a) and doing a summation over t, we obtain the theorem. Remark 1. As we will show in Section 4.2, this O( T) regret matches the lower bound up to a logarithmic factor. Moreover, Theorem 1 fully characterizes how an arbitrary covariance structure impacts the regret bound. To see this, note that in Eq. (2), under the min operation, the first w Σ w -related term dominates under reasonable ρ, and shrinks from positive to negative correlation, which implies that the more the base arms are negatively (positively) correlate, the lower (higher) regret the learner suffers. The intuition behind is that the negative (positive) correlation diversifies (intensifies) the risk of estimation error and narrows (enlarges) the confidence region for the expected reward of an action, which leads to a reduction (an increase) of regret. Also note that when ρ = 0, the CMCB-FI problem reduces to a d-armed bandit problem with full-information feedback, and Eq. (2) becomes O( p Σ max T). For this degenerated case, the optimal gap-dependent regret is O( Σ max ) for constant gap > 0. By setting = p Σ max/T at this gap-dependent result, one obtains the optimal gap-independent regret O( p Σ max T). Hence, when ρ = 0, Eq. (2) still offers a tight gap-independent regret bound. 4.2 Lower Bound for CMCB-FI Now we provide a regret lower bound for CMCB-FI, which demonstrates that the O( T) regret of MC-Empirical is in fact optimal (up to a logarithmic factor). Since CMCB-FI considers full-information feedback and continuous decision space where the reward gap (between the optimal action and the nearest optimal action) approximates to zero, existing lower bound analysis for linear [8, 9] or discrete [19, 11, 25] bandit problems cannot be applied to this problem. Algorithm 2 MC-UCB 1: Input: ρ > 0, c (0, 1 2] and regularization parameterλ>0. 2: Initialize: i [d], pull ei that has 1 at the i-th entry and 0 elsewhere. i, j [d], i = j, pull eij that has 1 2 at the i-th and the j-th entries, and 0 elsewhere. Update Nij(d2), i, j [d], ˆθd2 and ˆΣd2. 3: for t = d2 + 1, . . . do 4: Σt,ij ˆΣt 1,ij gij(t) 5: Σt,ij ˆΣt 1,ij + gij(t) 6: wt argmax w c d (w ˆθt 1+Et(w) ρw Σtw) 7: Pull wt and observe all θt,i s.t. wt,i > 0 8: Jt,ij I{wt,i, wt,j > 0}, i,j [d] 9: Nij(t) Nij(t 1) + Jt,ij, i,j [d] Pt s=1 Jt,iiθs,i Nii(t) , i [d] s=1 Jt,ij(θs,i ˆθ t,i)(θs,j ˆθ t,j) Nij(t) , i,j [d] 12: end for To tackle this challenge, we contribute a new analytical procedure to establish the lower bound for continuous and full-information bandit problems from the Bayesian perspective. The main idea is to construct an instance distribution, where θ is drawn from a well-chosen prior Gaussian distribution. After t pulls the posterior of θ is still Gaussian with a mean vector ut related to sample outcomes. Since the hindsight strategy simply selects the action which maximizes the mean-covariance function with respect to θ while a feasible strategy can only utilize the sample information (ut), we show that any algorithm must suffer Ω( T) regret due to the gap between random θ and its mean ut. Theorem 2 below formally states this lower bound. Theorem 2 (Lower Bound for CMCB-FI). There exists an instance distribution of the continuous mean-covariance bandits with full-information feedback problem (CMCB-FI), for which any algorithm has an expected cumulative regret bounded by Ω( Remark 2. This parameter-free lower bound demonstrates that the regret upper bound (Theorem 1) of MC-Empirical is optimal (within a logarithmic factor), since under the constructed instance distribution, Theorem 1 also implies a matching O( T) parameter-free result, i.e., when ρ = 1/ T, Eq. (2) becomes O(( p T). Unlike discrete bandit problems [19, 11, 25] where the optimal regret is usually log T for constant gap > 0, CMCB-FI has continuous decision space with gap 0 and a polylogarithmic regret is not achievable in general. In such continuous bandit literature [18, 8, 9], the parameter (θ , Σ and ρ) dependent lower bound is an open problem. 5 CMCB with Semi-Bandit Feedback (CMCB-SB) In many practical tasks, the learner may not be able to simultaneously select (place positive weights on) all options and observe full information. Instead, the weight of each option is usually lower bounded and cannot be arbitrarily small. As a result, the learner only selects a subset of options and obtains their feedback, e.g., company investments [12] on multiple business. Motivated by such tasks, in this section we consider the CMCB problem with semi-bandit feedback (CMCB-SB), where the decision space is a restricted probability simplex c d = {w Rd : wi = 0 or c wi 1, i [d] and P i wi = 1} for some constant 0 < c 1 2.2 In this scenario, at timestep t, the learner selects wt c d and only observes the rewards {θt,i : wi c} from the base arms that are placed positive weights on. Below, we propose the Mean-Covariance Upper Confidence Bound algorithm (MC-UCB) for CMCB-SB, and provide a regret lower bound, which shows that MC-UCB achieves the optimal performance with respect to T. 5.1 Algorithm for CMCB-SB Algorithm MC-UCB for CMCB-SB is described in Algorithm 2. The main idea is to use the optimistic covariance to construct a confidence region for the expected reward of an action and calculate an upper confidence bound of the mean-covariance function, and then select the action with the maximum optimistic mean-covariance value. 2When c > 1 2, the learner can only place all weight on one option, and the problem trivially reduces to the mean-variance bandit setting [26, 34]. In this case, our Theorem 3 still provides a tight gap-independent bound. In Algorithm 2, Nij(t) denotes the number of times ws,i, ws,j > 0 occurs among timestep s [t]. Jt,ij is an indicator variable that takes value 1 if wt,i, wt,j > 0 and 0 otherwise. Dt is a diagonal matrix such that Dt,ii = Nii(t). In Line 2, we update the number of observations by Nii(d2) 2d 1 for all i [d] and Nij(d2) 2 for all i, j [d], i = j (due to the initialized d2 pulls), and calculate the empirical mean ˆθ d2 and empirical covariance ˆΣ d2 using the equations in Lines 10,11. For any t > 1 and i, j [d], we define the confidence radius of covariance Σ ij as gij(t) 16 3 ln t Nij(t 1) q 3 ln t Nij(t 1) + q 48 ln2 t Nij(t 1)Nii(t 1) + q 36 ln2 t Nij(t 1)Nj(t 1), and the confidence region for the expected reward w θ of action w as v u u t2β(δt) w D 1 t 1 λΛ Σt Dt 1 + where λ > 0 is the regularization parameter, β(δt) = ln( 1 δt )+d ln ln t+ d λ) is the confidence term and δt = 1 t ln2 t is the confidence parameter. At each timestep t, algorithm MC-UCB calculates the upper confidence bound of f(w) using gij(t) and Et(w), and selects the action wt that maximizes this upper confidence bound. Then, the learner observes rewards θt,i with wt,i > 0 and update the statistical information according to the feedback. In regret analysis, unlike [11] which uses a universal upper bound to analyze confidence intervals, we incorporate the estimated covariance into the confidence region for the expected reward of an action, which enables us to derive tighter regret bound and explictly quantify the impact of the covariance structure on algorithm performance. We also contribute a new technique for handling the challenge raised by having different numbers of observations among base arms, in order to obtain an optimal O( T) regret (here prior gap-dependent analysis [11, 25] still cannot be applied to solve this continuous problem). Theorem 3 gives the regret upper bound of algorithm MC-UCB. Theorem 3 (Upper Bound for CMCB-SB). Consider the continuous mean-covariance bandits with semi-bandit feedback problem (CMCB-SB). Then, for any T > 0, algorithm MC-UCB (Algorithm 2) with regularization parameter λ > 0 has an expected cumulative regret bounded by L(λ)( Σ + + d2)d ln2 T T + ρd ln T where L(λ) = (λ + 1)(ln(1 + λ 1) + 1) and Σ + = P i,j [d] Σ ij 0 for any i, j [d]. Remark 3. Theorem 3 captures the effects of covariance structures in CMCB-SB, i.e., positive correlation renders a larger Σ + factor than the negative correlation or independent case, since the covariance influences the rate of estimate concentration for the expected rewards of actions. The regret bound for CMCB-SB has a heavier dependence on d than that for CMCB-FI. This matches the fact that semi-bandit feedback only reveals rewards of the queried dimensions, and provides less information than full-information feedback in terms of observable dimensions. 5.2 Lower Bound for CMCB-SB In this subsection, we establish a lower bound for CMCB-SB, and show that algorithm MC-UCB achieves the optimal regret with respect to T up to logarithmic factors. The insight of the lower bound analysis is to construct two instances with a gap in the expected reward vector θ , where the optimal actions under these two instances place positive weights on different base arms. Then, when the gap is set to p ln T/T, any algorithm must suffer Ω regret for differentiating these two instances. Theorem 4 summarizes the lower bound for CMCB-SB. Theorem 4 (Lower Bound for CMCB-SB). There exists an instance distribution of the continuous mean-covariance bandits with semi-bandit feedback (CMCB-SB) problem, for which any algorithm has an expected cumulative regret bounded by Ω Remark 4. Theorem 4 demonstrates that the regret upper bound (Theorem 3) of MC-UCB is optimal with respect to T (up to logarithmic factors). Similar to CMCB-FI, CMCB-SB considers continuous Algorithm 3 MC-ETE 1: Input: ρ > 0, d = d(d+1) 2 and design action set π = {v1, . . . , v d}. 2: Initialize: Nπ(0) 0. t 1. 3: Repeat lines 4-22: 4: if Nπ(t 1) > t 2 3 /d then 5: wt = argmax w d (w ˆθ t 1 ρw ˆΣt 1w) 6: t t + 1 7: else 8: Nπ(t) Nπ(t 1) + 1 9: for k = 1, . . . , d do 10: Pull vk and observe y Nπ(t),k 11: if k = d then 12: y Nπ(t) (y Nπ(t),1, . . . , y Nπ(t), d) PNπ(t) s=1 ys Nπ(t) 14: ˆzt,k = PNπ(t) s=1 (ys,k ˆyt,k)2 Nπ(t) , k [ d] 15: ˆzt (ˆzt,1, . . . , ˆzt, d) 16: ˆθt B+ π ˆyt 17: ˆσt C+ π ˆzt 18: Reshape ˆσt to d d matrix ˆΣt 19: end if 20: t t + 1 21: end for 22: end if decision space with 0, and thus the lower bound differs from those gap-dependent results log T in discrete bandit problems [19, 11, 25]. Our lower bound shows that for CMCB-SB, no improvement upon O( T) regret is possible in general. 6 CMCB with Full-Bandit Feedback (CMCB-FB) In this section, we further study the CMCB problem with full-bandit feedback (CMCB-FB), where at timestep t, the learner selects wt d and only observes the weighted sum of random rewards, i.e., yt = w t θt. This setting models many real-world decision making tasks, where the learner can only attain an aggregate feedback from the chosen options, such as clinical trials [31]. 6.1 Algorithm for CMCB-FB We propose the Mean-Covariance Exploration-Then-Exploitation algorithm (MC-ETE) for CMCBFB in Algorithm 3. Specifically, we first choose a design action set π = {v1, . . . , v d} which contains d = d(d + 1)/2 actions and satisfies that Bπ = (v 1 ; . . . ; v d ) and Cπ = (v2 1,1, . . . , v2 1,d, 2v1,1v1,2, . . . , 2v1,d 1v1,d; . . . ; v2 d,1, . . . , v2 d,d, 2v d,1v d,2, . . . , 2v d,d 1v d,d) are of full column rank. We also denote their Moore-Penrose inverses by B+ π and C+ π , and it holds that B+ π Bπ = Id d and C+ π Cπ = I d d. There exist more than one feasible π, and for simplicity and good performance we choose v1, . . . , vd as standard basis vectors in Rd and {vd+1, . . . , v d} as the set of all d 2 vectors where each vector has two entries equal to 1 2 and others equal to 0. In an exploration round (Lines 8-21), we pull the designed actions in π and maintain their empirical rewards and variances. Through linear transformation by B+ π and C+ π , we obtain the estimators of the expected rewards and covariance of base arms (Lines 16-17). When the estimation confidence is high enough, we exploit the attained information to select the empirical best action (Lines 5). Theorem 5 presents the regret guarantee of MC-ETE. Theorem 5 (Upper Bound for CMCB-FB). Consider the continuous mean-covariance bandits with full-bandit feedback problem (CMCB-FB). Then, for any T > 0, algorithm MC-ETE (Algorithm 3) achieves an expected cumulative regret bounded by O Z(ρ, π) p d(ln T + d2) T 2 3 + d max T 2 3 , where Z(ρ, π) = maxw d( q w B+ π Σ π(B+ π ) w+ρ C+ π ), Σ π = diag(v 1 Σ v1, . . . , v d Σ v d), C+ π = maxi [ d] P j [ d] |C+ π,ij| and max = f(w ) minw d f(w). Remark 5. The choice of π will affect the regret factor Σ π contained in Z(ρ, π). Under our construction, Σ π can be regarded as a uniform representation of covariance Σ , and thus our regret bound demonstrates how the learning performance is influenced by the covariance structure, i.e., negative (positive) correlation shrinks (enlarges) the factor and leads to a lower (higher) regret. (a) FI, synthetic, d = 5, ρ = 0.1 (b) SB, synthetic, d = 5, ρ = 0.1 (c) FB, synthetic, d = 5, ρ = 10 (d) FI, real-world, d = 5, ρ = 0.1 (e) SB, real-world, d = 5, ρ = 0.1 (f) FB, real-world, d = 5, ρ = 10 Figure 1: Experiments for CMCB-FI, CMCB-SB and CMCB-FB on the synthetic and real-world datasets. Discussion on the ETE strategy. In contrast to common ETE-type algorithms, MC-ETE requires novel analytical techniques in handling the transformed estimate concentration while preserving the covariance information in regret bounds. In analysis, we build a novel concentration using key matrices B+ π and C+ π to adapt to the actual covariance structure, and construct a super-martingale which takes the aggregate noise in an exploration round as analytical basis to prove the concentration. These techniques allow us to capture the correlations in the results, and are new compared to both the former FI/SB settings and covariance-related bandit literature [33, 11, 25]. In fact, under the full-bandit feedback, it is highly challenging to estimate the covariance without using a fixed exploration (i.e., ETE) strategy. Note that even for its simplified offline version, where one uses given (non-fixed) full-bandit data to estimate the covariance, there is no available solution in the statistics literature to our best knowledge. Hence, for such online tasks with severely limited feedback, ETE is the most viable strategy currently available, as used in many partial observation works [21, 6, 7]. We remark that our contribution in this setting focuses on designing a practical solution and deriving regret guarantees which explicitly characterize the correlation impacts. The lower bound for CMCB-FB remains open, which we leave for future work. 7 Experiments In this section, we present experimental results for our algorithms on both synthetic and real-world [20] datasets. For the synthetic dataset, we set θ = [0.2, 0.3, 0.2, 0.2, 0.2] , and Σ has all diagonal entries equal to 1 and all off-diagonal entries equal to 0.05. For the real-world dataset, we use an open dataset US Funds from Yahoo Finance on Kaggle [20], which provides financial data of 1680 ETF funds in 2010-2017. We select five funds and generate a stochastic distribution (θ and Σ ) from the data of returns (since we study a stochastic bandit problem). For both datasets, we set d = 5 and ρ {0.1, 10}. The random reward θt is drawn i.i.d. from Gaussian distribution N(θ , Σ ). We perform 50 independent runs for each algorithm and show the average regret and 95% confidence interval across runs,3 with logarithmic y-axis for clarity of magnitude comparison. 3In some cases, since algorithms are doing similar procedures (e.g., in Figures 1(c),1(f), the algorithms are exploring the designed actions) and have low performance variance, the confidence intervals are narrow and indistinguishable. (CMCB-FI) We compare our algorithm MC-Empirical with two algorithms OGD [15] and Linear FI. OGD (Online Gradient Descent) [15] is designed for general online convex optimization with also a O( T) regret guarantee, but its result cannot capture the covariance impacts as ours. Linear FI is a linear adaption of MC-Empirical that only aims to maximize the expected rewards. Figures 1(a),1(d) show that our MC-Empirical enjoys multiple orders of magnitude reduction in regret compared to the benchmarks, since it efficiently exploits the empirical observations to select actions and well handles the covariance-based risk. In particular, the performance superiority of MC-Empirical over OGD demonstrates that our sample strategy sufficiently utilize the observed information than conventional gradient descent based policy. (CMCB-SB) For CMCB-SB, we compare MC-UCB with two adaptions of OLS-UCB [11] (state-of-theart for combinatorial bandits with covariance), named MC-UCB-Γ and OLS-UCB-C. MC-UCB-Γ uses the confidence region with a universal covariance upper bound Γ, instead of the adapting one used in our MC-UCB. OLS-UCB-C directly adapts OLS-UCB [11] to the continuous decision space and only considers maximizing the expected rewards in its objective. As shown in Figures 1(b),1(e), MC-UCB achieves the lowest regret since it utilizes the covariance information to accelerate the estimate concentration. Due to lack of a covariance-adapting confidence interval, MC-UCB-Γ shows an inferior regret performance than MC-UCB, and OLS-UCB-C suffers the highest regret due to its ignorance of risk. (CMCB-FB) We compare MC-ETE with two baselines, OGD-ETE, which adopts OGD [15] during the exploitation phase, and Linear FB, which only investigates the expected reward maximization. From Figures 1(c),1(f), one can see that, MC-ETE achieves the best regret performance due to its effective estimation of the covariance-based risk and efficiency in exploitation. Due to the inefficiency of gradient descent based policy in utilizing information, OGD-ETE has a higher regret than MC-ETE, whereas Linear FB shows the worst performance owing to the unawareness of the risk. 8 Conclusion and Future Work In this paper, we propose a novel continuous mean-covariance bandit (CMCB) model, which investigates the reward-risk trade-off measured by option correlation. Under this model, we consider three feedback settings, i.e., full-information, semi-bandit and full-bandit feedback, to formulate different real-world reward observation scenarios. We propose novel algorithms for CMCB to achieve the optimal regrets (within logarithmic factors), and provide lower bounds for the problems to demonstrate our optimality. We also present empirical evaluations to show the superior performance of our algorithms. To our best knowledge, this is the first work to fully characterize the impacts of arbitrary covariance structures on learning performance for risk-aware bandits. There are several interesting directions for future work. For example, how to design an adaptive algorithm for CMCB-FB is a challenging open problem, and the lower bound for CMCB-FB is also worth further investigation. Acknowledgments and Disclosure of Funding The work of Yihan Du and Longbo Huang is supported in part by the Technology and Innovation Major Project of the Ministry of Science and Technology of China under Grant 2020AAA0108400 and 2020AAA0108403. [1] Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. Advances in Neural Information Processing Systems, 24:2312 2320, 2011. [2] Shipra Agrawal and Navin Goyal. Analysis of thompson sampling for the multi-armed bandit problem. In Conference on Learning Theory, pages 39 1, 2012. [3] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2-3):235 256, 2002. [4] Kirt C Butler and Dale L Domian. Risk, diversification, and the investment horizon. Journal of Portfolio Management, 17(3):41, 1991. [5] Asaf Cassel, Shie Mannor, and Assaf Zeevi. A general approach to multi-armed bandits under risk criteria. In Conference on Learning Theory, pages 1295 1306, 2018. [6] Sougata Chaudhuri and Ambuj Tewari. Phased exploration with greedy exploitation in stochastic combinatorial partial monitoring games. In Advances in Neural Information Processing Systems, pages 2433 2441. 2016. [7] Lixing Chen, Jie Xu, and Zhuo Lu. Contextual combinatorial multi-armed bandits with volatile arms and submodular reward. In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. [8] Varsha Dani, Thomas P Hayes, and Sham M Kakade. Stochastic linear optimization under bandit feedback. In Conference on Learning Theory, 2008. [9] Varsha Dani, Sham M Kakade, and Thomas Hayes. The price of bandit information for online optimization. In Advances in Neural Information Processing Systems, volume 20. Curran Associates, Inc., 2008. [10] Yahel David and Nahum Shimkin. Pure exploration for max-quantile bandits. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 556 571. Springer, 2016. [11] Rémy Degenne and Vianney Perchet. Combinatorial semi-bandit with known covariance. In Advances in Neural Information Processing Systems, pages 2972 2980, 2016. [12] A Diadin. Analysis of a trade company investment project. Business and Economics, page 104, 2019. [13] Nicolas Galichet, Michele Sebag, and Olivier Teytaud. Exploration vs exploitation vs safety: Risk-aware multi-armed bandits. In Asian Conference on Machine Learning, pages 245 260, 2013. [14] Stephan M Gasser, Margarethe Rammerstorfer, and Karl Weinmayer. Markowitz revisited: Social portfolio engineering. European Journal of Operational Research, 258(3):1181 1190, 2017. [15] Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2-3):169 192, 2007. [16] Anmol Kagrecha, Jayakrishnan Nair, and Krishna Jagannathan. Distribution oblivious, riskaware algorithms for multi-armed bandits with unbounded rewards. Advances in Neural Information Processing Systems, 32:11272 11281, 2019. [17] Abbas Kazerouni, Mohammad Ghavamzadeh, Yasin Abbasi-Yadkori, and Benjamin Van Roy. Conservative contextual linear bandits. In Advances in Neural Information Processing Systems, pages 3913 3922, 2017. [18] Robert Kleinberg. Nearly tight bounds for the continuum-armed bandit problem. In Advances in Neural Information Processing Systems, volume 17. MIT Press, 2005. [19] Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1):4 22, 1985. [20] Stefano Leone. Dataset: US funds dataset from yahoo finance. Kaggle, 2020. https: //www.kaggle.com/stefanoleone992/mutual-funds-and-etfs?select=ETFs.csv. [21] Tian Lin, Bruno Abrahao, Robert Kleinberg, John Lui, and Wei Chen. Combinatorial partial monitoring game with linear feedback and its applications. In International Conference on Machine Learning, pages 901 909, 2014. [22] Andrea Locatelli, Maurilio Gutzeit, and Alexandra Carpentier. An optimal algorithm for the thresholding bandit problem. In International Conference on Machine Learning, pages 1690 1698. PMLR, 2016. [23] Harry M Markowitz et al. Portfolio selection. Journal of Finance, 7(1):77 91, 1952. [24] Joanne M Mc Innerney and Tim S Roberts. Online learning: Social interaction and the creation of a sense of community. Journal of Educational Technology & Society, 7(3):73 81, 2004. [25] Pierre Perrault, Michal Valko, and Vianney Perchet. Covariance-adapting algorithm for semibandits with application to sparse outcomes. In Conference on Learning Theory, pages 3152 3184, 2020. [26] Amir Sani, Alessandro Lazaric, and Rémi Munos. Risk-aversion in multi-armed bandits. In Advances in Neural Information Processing Systems, pages 3275 3283, 2012. [27] Eric M Schwartz, Eric T Bradlow, and Peter S Fader. Customer acquisition via display advertising using multi-armed bandit experiments. Marketing Science, 36(4):500 522, 2017. [28] William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285 294, 1933. [29] Sattar Vakili and Qing Zhao. Mean-variance and value at risk in multi-armed bandit problems. In The 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 1330 1335. IEEE, 2015. [30] Sattar Vakili and Qing Zhao. Risk-averse multi-armed bandit problems under mean-variance measure. IEEE Journal of Selected Topics in Signal Processing, 10(6):1093 1111, 2016. [31] Sofía S Villar, Jack Bowden, and James Wason. Multi-armed bandit models for the optimal design of clinical trials: benefits and challenges. Statistical Science: A Review Journal of the Institute of Mathematical Statistics, 30(2):199, 2015. [32] Manfred K Warmuth and Dima Kuzmin. Online variance minimization. In International Conference on Computational Learning Theory, pages 514 528. Springer, 2006. [33] Manfred K Warmuth and Dima Kuzmin. Online variance minimization. Machine Learning, 87(1):1 32, 2012. [34] Qiuyu Zhu and Vincent YF Tan. Thompson sampling algorithms for mean-variance bandits. International Conference on Machine Learning, 2020.