# nearoptimal_mnl_bandits_under_risk_criteria__217cc6f6.pdf

Near-Optimal MNL Bandits Under Risk Criteria

Guangyu Xi, 1 Chao Tao, 2 Yuan Zhou 3

1 University of Maryland, College Park 2 Indiana University Bloomington 3 University of Illinois at Urbana-Champaign gxi@umd.edu, taochao@iu.edu, yuanz@illinois.edu

We study MNL bandits, which is a variant of the traditional multi-armed bandit problem, under risk criteria. Unlike the ordinary expected revenue, risk criteria are more general goals widely used in industries and business. We design algorithms for a broad class of risk criteria, including but not limited to the well-known conditional value-at-risk, Sharpe ratio, and entropy risk, and prove that they suffer a nearoptimal regret. As a complement, we also conduct experiments with both synthetic and real data to show the empirical performance of our proposed algorithms.

Introduction Dynamic assortment optimization is one of the fundamental problems in online learning. It has wide applications in industries, for example retailing and advertisement. To motivate the study of the problem, let us consider e-commerce companies like Amazon and Wish who want to sell products to online users when they visit the websites and search for some type of products, for example, headphones. Such companies usually have a variety of products with that type in a warehouse to sell. Due to the space constraints of a website, it is not possible to exhibit all of the available products. Hence, each time when an online user visits the website, only a limited number of products can be displayed. When an online user buys a product, the company will get some proﬁt. So one natural goal for the company is to display on the website an assortment consisting of several products such that the expected revenue is maximized. However, in practice, a company may have more complex strategies other than simply maximizing its revenue, and general risk criteria may be better choices to serve such goals. For example, in risk management, a very common risk criterion called expected shortfall or conditional value at risk (CVa R) is deﬁned as the expected revenue under a certain percentile. If we only consider the expected revenue, we may lead to focus on recommending some products producing high revenue but purchased only by a small portion of users. If the company wishes to maintain a higher level of active and diversiﬁed users, then CVa R is more appropriate. Whether it is still possible for the sales manager of the company to design

Copyright c 2021, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

a near-optimal sales strategy when the goal is changed, for example to a kind of risk criteria, is a very practical problem and to the best of our knowledge, has not been studied before. Suppose a company has N products of a certain category to sell during a sales season, which can be represented by [N], where [N] def = {1, 2, . . . , N} and each product corresponds to an element in [N]. Let T be the total number of times such products are searched during a sales season and St be the assortment displayed by the website at tth time of the request. The aforementioned sales activity can be modeled by the following game which runs in T time steps: at time step 1 t T, when an assortment St [N] is displayed by the website, the online user will make a choice, i.e., whether to buy a product in St or purchase nothing. Following the previous motivation example, we add a cardinality constraint, which means the number of products in St can not exceed a predeﬁned number K N. Let ct denote the choice of the online user at time t. When ct = i, it means that the online user buys product i. For convenience, we use ct = 0 to represent the situation when the online user does not purchase anything. In general, ct can be viewed as a random variable and there is no doubt that the multinomial logit model (MNL) (Agrawal et al. 2019) has become the most popular one to model the behavior of the online user, i.e., ct, when St is provided. Dynamic assortment optimization with the MNL choice model is also called MNL bandits. In this model, each product i is assumed to be related to an unknown preference parameter vi and the probability that a visiting online user chooses product i given assortment St is deﬁned by

P(ct = i) def = vi 1 + P

j St vj , (1)

where we set the preference parameter of no-purchase v0 = 1. Note that this assumption does not harm the model too much since one can easily scale vi s to satisfy this condition. Following the literature, we also assume no-purchase is the most frequent choice i.e., 0 vi 1, which is often a reasonable assumption in sales activities. During the last decade, MNL bandits has attracted much attention (Rusmevichientong, Shen, and Shmoys 2010; Saur e and Zeevi 2013; Agrawal et al. 2017, 2019; Dong et al. 2020). However, all of the previous works consider maxi-

The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21)

mizing the expected revenue, which is not always appropriate for practical applications. In this paper, we are interested in designing algorithms for a general class of risk criteria.

Problem Formulation Suppose for each product i [N], selling it successfully can make the company a proﬁt of ri, which is known beforehand. Without loss of generality, we assume ri (0, 1]. This can always be achieved by proper scaling. Moreover, the proﬁt for no-purchase is r0 = 0. Then at time step t 1, when assortment St [N] and preference parameter vector v = (v1, . . . , v N) are provided, the proﬁt can be represented by a random variable X(St, v) deﬁned by

P(X(St, v) = ri) = P(ct = i) = vi 1 + P j St vj (2)

for i = 0 and all i St. In addition, we denote by F(St, v) the cumulative distribution function to X(St, v). Given time horizon T, one natural goal, as explained in the introduction, is to ﬁnd a policy equipped by the decision maker such that the expected revenue, i.e.,

t=1 R(St, v) =

t=1 E [X(St, v)] (3)

is maximized, where R(St, v) represents the expected proﬁt when St is served. This has been investigated previously in (Agrawal et al. 2017, 2019). In this paper, instead of expectation, we consider a general class of risk criteria. Some examples of such risk criteria can be found in (Cassel, Mannor, and Zeevi 2018). Suppose D is the convex set of cumulative distribution functions. In general, we consider the risk criterion U which is a function from D to R. In the case of expectation, U(F) = R xd F(x). In particular, since we assumed that ri (0, 1], we will only need F D[0, 1], where we denote by D[0, 1] the subspace of D consisting of F that is the cumulative distribution function of random variable X taking values on [0, 1]. The goal of this paper is to ﬁnd a policy such that the following quantity

t=1 U(F(St, v))

is maximized. Let S be the smallest assortment such that U(F(S , v)) = max S [N],|S| K U(F(S, v)).

The regret of the game after T time steps, which is a quantity measuring the difference between the optimal policy and policy π used by the decision maker, is deﬁned as

Rπ T ([N], v, r)

def = TU(F(S , v)) E

t=1 U(F(St, v))

where r = (r1, , r N) and v = (v1, , v N). When it is clear from the context, we usually omit the policy π and parameters ([N], v, r). Without much effort, we can see that maximizing (4) is equivalent to minimizing the regret (5).

Related Work and Our Contribution To the best of our knowledge, we are the ﬁrst to study MNL bandits under general risk criteria. In the past decade, there have been many works on the MNL bandit problem considering maximizing the expected revenue (3). In (Rusmevichientong, Shen, and Shmoys 2010; Saur e and Zeevi 2013), the authors assumed the gap between the best and the second-to-the-best assortments is known and proposed Explore-then-Commit algorithms. Later in (Agrawal et al. 2019), the authors proposed the state-of-the-art UCB-type algorithm with a regret upper bound O(

NT ln T). Authors in (Agrawal et al. 2017) utilized the Bayesian method i.e., Thompson Sampling to design an algorithm which performs well in practice. For the expected revenue, it is showed in (Chen and Wang 2018) that the lower bound for the regret is Ω(

NT). There are a lot of previous works studying different risk criteria in multi-armed bandits (Sani, Lazaric, and Munos 2012; Maillard 2013; Galichet, Sebag, and Teytaud 2014; Zimin, Ibsen-Jensen, and Chatterjee 2014; Vakili and Zhao 2016). In (Cassel, Mannor, and Zeevi 2018), the authors established a thorough theory to deal with general risk criteria.

Our Contribution Note that directly applying the algorithm proposed in (Cassel, Mannor, and Zeevi 2018) will

lead to a regret of Ω q N K T since each assortment cor-

responds to an arm, which is far from being optimal. We can not simply take each product as an arm since the optimal assortment may consist of multiple products due to that there is an involved relationship between the risk criterion of the assortment and the underlying preference parameters, which is characterized by the following two complex structures: a general (and non-speciﬁc) risk function and the multinomial logit choice model. In this paper, we are able to gain a clearer understanding of the challenge raised by these two complex structures. To be more speciﬁc, we recognize three mild conditions that are easy to verify and show that the class identiﬁed by the aforementioned conditions encompasses most of the risk criteria that are of interest in literature (see Table 1). We also design and analyze the algorithmic framework, proving that our algorithm achieves e O(

NT) regret for any general risk criterion that belongs to the class.

Assumptions In this section, we ﬁrst present the aforementioned three assumptions the risk criterion U should satisfy.

Assumption 1 (Quasiconvexity). U is quasiconvex on D[0, 1], i.e., for any λ [0, 1] and F1, F2 D[0, 1], it satisﬁes

U(λF1 + (1 λ)F2) max{U(F1), U(F2)}. (6)

In addition to quasiconvexity, we also make the following two assumptions on U.

Assumption 2 (Boundedness). For any F D[0, 1] it holds that |U(F)| γ1.

Assumption 3 (One-sided Lipschitz Condition). For any v v, i.e., v i vi for all i [N], and S [N], it holds that

U(F(S, v )) U(F(S, v))

γ2 1 + P i S vi

i S (v i vi)

Note that here γ1 and γ2 are universal constants related to the risk criterion U. We note that quasiconvexity is a natural assumption as most practical risk criteria considered in literature are quasiconvex. In Table 1, we give a list of the risk criteria considered, which are all quasiconvex as shown in (Cassel, Mannor, and Zeevi 2018). To complement, we also show in Table 1 that whether a risk criterion satisﬁes Assumption 2 and Assumption 3, and give concrete values of γ1 and γ2. It turns out that all the risk criteria listed in Table 1 satisfy all three assumptions except Va R, which does not meet Assumption 3 since it is discontinuous in v,

Algorithms Due to space constraints, we only show Risk Aware UCB, which is a variant of the UCB-type algorithm proposed in (Agrawal et al. 2019), and its guarantee. In a similar way, we also propose Risk Aware TS, a variant of the Thompson Sampling algorithm proposed in (Agrawal et al. 2017). Please refer to Appendix B for its precise description and near-optimal guarantee. The high level idea of the proposed algorithm Risk Aware UCB is as follows. We divide all the time steps i.e., [T] into small episodes. During each episode ℓ, the same assortment Sℓis repeatedly provided to the online user until a no-purchase outcome is observed. Speciﬁcally, in each episode ℓ, we are providing the assortment

argmax S [N],|S| K U(F(S, evℓ)),

where evℓis an optimistic estimate of the real unknown preference parameters before the start of episode ℓ. The details of Risk Aware UCB is described in Algorithm 1 where ti,ℓis the number of times the online users buy product i in the ℓth episode and Ti(ℓ) denotes the collection of episodes for which product i is served until episode ℓ(exclusive). Then we have the following theoretical upper bound for Risk Aware UCB. Theorem 4. Suppose the risk criterion U satisﬁes Assumption 1, 2 and 3. The regret (5) incurred by the decision maker using Risk Aware UCB is upper bounded by e O(

NT) after T time steps, where e O hides poly-logarithmic factors in N and T. Before proceeding, we ﬁrst prove the following key lemma, which says that the risk gain of the optimal assortment calculated by an optimistic estimate of the preference parameters is never worse than that of S . Lemma 5 (Monotone Maximum). For any v v, it holds that max S [N],|S| K U(F(S, v )) U(F(S , v)).

Algorithm 1: Risk Aware UCB(N, K, r, U)

1 Initialize t 1, ℓ 1, evℓ i 1 for i [N] and Ti(ℓ) for i [N]

2 while t T do

3 Sℓ argmax S [N],|S| K U(F(S, evℓ))

4 Initialize ti,ℓ 0 for i [N]

6 Serve Sℓand observe customer choice ct 7 if ct = 0 then tct,ℓ tct,ℓ+ 1

9 until t > T or ct 1 = 0

10 for i [N] do

11 if i Sℓthen Ti(ℓ+ 1) Ti(ℓ) {ℓ}

12 else Ti(ℓ+ 1) Ti(ℓ)

13 ℓ ℓ+ 1, Ti(ℓ) |Ti(ℓ)| for i [N]

ℓ Ti(ℓ) ti,ℓ

Ti(ℓ) for i [N]

15 evℓ i min

vℓ i 48 ln(

Nℓ+1) Ti(ℓ) +

Nℓ+1) Ti(ℓ) , 1

Proof. Fix S, we ﬁrst prove that U(F(S, u)) is a quasiconvex function with respect to vector u. This statement can be easily veriﬁed by noticing that for any λ [0, 1] and u , we have

U(F(S, λu + (1 λ)u ))

= U λ(1 + P i S ui) λ(1 + P i S ui) + (1 λ)(1 + P i S u i)F(S, u)

+ (1 λ)(1 + P i S u i) λ(1 + P i S ui) + (1 λ)(1 + P i S u i)F(S, u )

max{U(F(S, u)), U(F(S, u ))}

where the last inequality is due to quasiconvexity of U on D[0, 1]. Next, we show the following lemma.

Lemma 6. Given a quasiconvex function V (u) deﬁned on [0, 1]n, suppose there is a point u = ( u1, , un) [0, 1]n satisfying that V ( u) V (u) for any point u = u such that ui = ui or 0 for each i from 1 to n. Then we have that V (u ) V ( u) for any u u.

Proof. For the sequence of points u(i) = (u(i) 1 , . . . , u(i) n ) with i = 1, 2, . . . , n such that

u(i) j = uj j = i 0 j = i,

we have that V (u(i)) < V ( u). For any u u and u = u, we deﬁne

λi = u i ui

Risk Criterion (Parameter) Property γ1 γ2 Va Rα Quasiconvex 1 Not Exist CVa Rα Convex 1 3/α nth-moment Linear 1 1 Entropy risk (θ) Convex 1 2eθ/θ Below target semi-variance (r) Linear r2 2r2

Negative variance Convex 1 4 6 Mean-variance (ρ) Convex 1 + ρ

4 2 + 6ρ Sharpe ratio (r, ϵ) Quasiconvex 1 ϵ 2ϵ 1/2 + 3ϵ 3/2

Sortino ratio (r, ϵ) Quasiconvex 1 ϵ 2ϵ 1/2 + ϵ 3/2

Table 1: Widely Used Risk Criteria

Here λi [0, 1] for all i = 1, 2, , n and Pn i=1 λi = 1. Then the convex combination of u(i)

is on the same line as u and u. By the quasiconvexity of V , we have that V (eu) maxn i=1 V (u(i)) < V ( u). If we deﬁne λ = 1 1+Pn i=1 λi , then

V ( u) = V (λeu + (1 λ)u ) max{V (eu), V (u )},

which means we must have V ( u) V ( u ).

Let S be the smallest assortment such that

U(F(S , v )) = max S [N],|S| K U(F(S, v )). (7)

Together with Lemma 6, we obtain that U(F(S , v )) U(F(S , v )) U(F(S , v)), which concludes the proof of this lemma.

Lemma 7. Given any ℓ> 0 and C1, C2 > 0, we deﬁne event

i [N], vi evℓ i vi + C1

Nℓ+ 1) Ti(ℓ) 1

Nℓ+ 1) Ti(ℓ) 1

There exist real numbers C1, C2 > 0 such that

ℓ for any ℓ. Lemma 7 can be easily derived from Lemma 4.1 of (Agrawal et al. 2019). So we omit its proof here.

Proof of Theorem 4. Before proceeding, we introduce several notations. Let L be the total number of episodes when Risk Aware UCB stops after T steps. Denote lℓby the length of the ℓth episode. Moreover, set ni = Ti(L), which is the total number of episodes product i is served before the Lth episode.

Using the law of total expectation, we rewrite the regret as

ℓ=1 lℓ(U(F(S , v)) U(F(Sℓ, v)))

ℓ=1 E[lℓ(U(F(S , v)) U(F(Sℓ, v))) | Hℓ]

where Hℓis the history before episode ℓ. Since Sℓis determined by Hℓ, there is

ℓ=1 E[lℓ| Hℓ](U(F(S , v)) U(F(Sℓ, v)))

Given Sℓ, we know that lℓfollows a geometric distribution with parameter 1/(1 + P i Sℓvi). Hence we have E[lℓ| Hℓ] 1 + P i Sℓvi. We put inequality here since the last episode may end due to time limit. Using aforementioned inequality, we further derive

(U(F(S , v)) U(F(Sℓ, v)))

where we have deﬁned δℓ def = (1 + P i Sℓvi) (U(F(S , v)) U(F(Sℓ, v))). We now focus on bounding Eδℓ. By a simple calculation, we get

Eδℓ= E[δℓ1Ec ℓ] + E[δℓ1Eℓ]

2γ1(N + 1) P(Ec ℓ) + E[δℓ| Eℓ] P(Eℓ)

ℓ + E[δℓ| Eℓ] P(Eℓ), (9)

where in the second last inequality, we upper bound δℓusing vi 1 and Assumption 2, and the last inequality is due to

Lemma 7. By Lemma 5 and Assumption 3, we get E[δℓ| Eℓ]

i Sℓ vi)(U(F(Sℓ, evℓ) U(F(Sℓ, v)) | Eℓ

Nℓ+ 1) Ti(ℓ) 1

Nℓ+ 1) Ti(ℓ) 1

where in the last equality we have used the deﬁnition of event Eℓ. By (9) and (10), we have

Eδℓ 2γ1(N + 1)

Nℓ+ 1) Ti(ℓ) 1

Nℓ+ 1) Ti(ℓ) 1

Putting (11) back into (8), we derive RT

2γ1(N + 1) E

r vi Ti(ℓ) 1

Note that ( ) PT ℓ=1 ℓ 1 ln T + γ, where γ is Euler s constant. Next we bound ( ). Let K be the set {(i, ℓ) : Ti(ℓ) = 0}. It is easy to see |K| N. Together with observation Pj i=1 1

i 2 j and Jensen s inequality, we derive

v u u u t NE

where ni is the total number of episodes product i is served before the Lth episode. Noting that

ℓ=1 E[lℓ|Hℓ]

we obtain ( ) N + 2

NT. Finally, we bound ( ) using

i [N] (ln ni + γ)

N(1 + γ) + N ln

N N(1 + γ) + N ln T

Putting inequalities of ( ), ( ) and ( ) back into (12), we obtain

RT 2γ1(N + 1)(ln T + γ)

NT + 1)(N ln T + N(1 + γ))

and the proof is complete.

Examples of Risk Criteria In this section, we show that conditional value-at-risk, Sharpe Ratio, and entropy risk all satisfy Assumption 2 and Assumption 3. For the proof of the other risk criteria listed in Table 1, we refer to Appendix A. For proving the one-sided Lipschitz condition, the following lemma is useful. The proof of Lemma 8 is in Appendix A.

Lemma 8. For any v v, i.e., v i vi for all i [N], and S [N], it holds that

v i 1 + P i S v i vi 1 + P i S vi

2 1 + P i S vi

i S (v i vi)

Conditional Value-at-risk Given α (0, 1], the conditional value-at-risk at α percentile for F D[0, 1] is deﬁned as

CVa Rα(F) def = 1

0 Va Rβ(F)dβ.

An equivalent deﬁnition is

CVa Rα(F) = 1

0 (F(x) α)dx .

Proposition 9. CVa Rα satisﬁes Assumption 2 and Assumption 3 with γ1 = 1 and γ2 = 3/α.

Proof. It is easy to see that |CVa Rα(F(S, v))| 1, which implies γ1 = 1. We now show the value of γ2. Without loss of generality, we can assume that the proﬁt of different products are different since for those items with the same revenue, we can

combine them into one product and the corresponding abstraction parameter is the sum of those of the arms. Given assortment S, we denote the products in S by [|S|] in the increasing order of their proﬁt. Then for any k + 1 [|S|] and x [rk, rk+1)

F(S, v ; x) F(S, v; x) = 1 + Pk i=1 v i 1 + P|S| i=1 v i 1 + Pk i=1 vi 1 + P|S| i=1 vi .

Note that for any x [rk, rk+1), by Lemma 8

|F(S, v ; x) F(S, v; x)|

1 + P|S| i=1 v i 1

1 + P|S| i=1 vi

v i 1 + P|S| i=1 v i vi 1 + P|S| i=1 vi

P|S| i=1 |v i vi| 1 + P|S| i=1 v i 1 + P|S| i=1 vi

1 + P|S| i=1 vi

i S (v i vi)

1 + P|S| i=1 vi

i S (v i vi)

Clearly the difference between F(S, v ; x) α and F(S, v; x) α satisﬁes the same bound. Hence

|CVa Rα(F(S, v )) CVa Rα(F(S, v))|

0 |FX(x) α FY (x) α|dx

1 + P|S| i=1 vi

i=1 (v i vi)

Sharpe Ratio

Given a minimum average reward r [0, 1] and the regularization factor ϵ, for F D[0, 1] we deﬁne

Shr,ϵ(F) = U 1(F) r p

ϵ + σ2(F) ,

where U 1(F) is the mean and σ2(F) is the variance.

Proposition 10. Shr,ϵ satisﬁes Assumption 2 and Assumption 3 with γ1 = 1 ϵ and γ2 = 2ϵ 1/2 + 3ϵ 3/2.

Proof. Since σ2(F) > 0, U 1(F) [0, 1] and r [0, 1], it is easy to see that γ1 = 1 ϵ. For the value of γ2, by the Lipschitz property of the mean and variance (see Appendix

A), we have

|Shr,ϵ(F(S, v )) Shr,ϵ(F(S, v))|

U 1(F(S, v )) U 1(F(S, v)) p

ϵ + σ2(F(S, v ))

+ |U 1(F(S, v)) r|

ϵ + σ2(F(S, v )) 1 p

ϵ + σ2(F(S, v))

1 ϵ|U 1(F(S, v )) U 1(F(S, v))|

+ |σ2(F(S, v )) σ2(F(S, v))| p

ϵ + σ2(F(S, v )) p

ϵ + σ2(F(S, v))

ϵ + σ2(F(S, v )) + p

ϵ + σ2(F(S, v))

1 ϵ|U 1(F(S, v )) U 1(F(S, v))|

2ϵ 3 2 |σ2(F(S, v )) σ2(F(S, v))|

2ϵ 1/2 + 3ϵ 3/2

1 + P i S vi

i S (v i vi)

Entropy Risk

Given the risk aversion parameter θ > 0, the entropy risk measure for F D[0, 1] is deﬁned as

U ent(F) = 1

0 e θxd F(x) .

Proposition 11. U ent satisﬁes Assumption 2 and Assumption 3 with γ1 = 1 and γ2 = 2eθ/θ.

Proof. By Jensen s inequality, we always have |U ent(F)| 1. Given a ﬁxed θ > 0, we know that X

i S e θri vi 1 + P i S vi [e θ, 1].

By the convexity of the log function and Lemma 8, we have that

|U ent(F(S, v )) U ent(F(S, v))|

e θriv i 1 + P i S v i

e θrivi 1 + P i S vi

e θriv i 1 + P i S v i X

e θrivi 1 + P i S vi

2eθ/θ 1 + P i S vi

i S (v i vi)

0 200 400 600 800 1000

TS UCB Risk Aware TS Risk Aware UCB

Figure 1: Synthetic Data

0.0 0.2 0.4 0.6 0.8 1.0 t 1e6

UCB TS Risk Aware TS Risk Aware UCB

Figure 2: Real Data

Experimental Evaluation We evaluate Risk Aware UCB and Risk Aware TS against UCB and TS where the last two algorithms are set to maximize the expected revenue in both synthetic and real data. Based on the Bandit Py Lib library (Holtz, Tao, and Xi 2020), all of the algorithms1 are implemented in Python3.

Synthetic Data In this experiment, we ﬁx the number of products N = 10, cardinality limit K = 4, horizon T = 106, and set the goal to be U = CVa R0.5. We generate 10 uniformly distributed random input instances where vi [0, 1] and ri [0.1, 1]. For each input instance, we run 20 repetitions and compute their average as the regret. Figure 1 shows how the worst regret among all input instances changes with square root of time.

Real Data In this experiment, we consider the UCI Car Evaluation Database dataset from the Machine Learning Repository (Dua and Graff 2017) which contains 6 categorical attributes for N = 1728 cars and consumer ratings for each car. We ﬁx cardinality limit K = 100, horizon T = 106, and set the goal to be U = CVa R0.05. By transforming each attribute to a one-hot vector, we obtain an attribute vector mi {0, 1}21 for each car. There are four different values for customer ratings i.e., acceptable , good , very good , and unacceptable . We decode unacceptable by 0 and others by 1 to represent whether the customer has the intention to buy the car. We use logistic regression to predict whether the customer is likely to buy the car and the probability that the customer buys car i is modeled by 1 1 + exp( θTmi),

where θ R21 is an unknown parameter. After the model is ﬁt with L2 regularization, we set the preference parameter vi

1Please refer to https://github.com/Alanthink/aaai2021 for the source code.

of car i to be the same as the probability predicted by logistic regression. Since there is no proﬁt data available for cars in this dataset, we generate uniformly distributed proﬁt ri from [0.1, 1] for each car. We run the experiment for 40 repetitions and compute the average CVa R0.05 for every consecutive 1000 revealed profits. To save time, when computing the assortment with the best CVa R0.05, we do a local search, i.e., try to replace a car, add a car or delete a car, and stops if we can not ﬁnd a better assortment. Figure 2 shows the results of the experiment.

Discussion From Figure 1, we can see that both Risk Aware UCB and Risk Aware TS suffer a

t-rate regret. Moreover, Risk Aware TS performs better than Risk Aware UCB, which aligns with literature that Thompson Sampling performs better in practice. From Figure 2, we can see that the obtained CVa R0.05 under UCB and TS are far from optimal. However, Risk Aware UCB and Risk Aware TS perform roughly the same. For both of these experiments, we can see the proposed algorithms Risk Aware UCB and Risk Aware TS perform much better than UCB and TS.

In this work, we have shown the near-optimal algorithms for a general class of risk criteria, which only need to satisfy three mild assumptions. Experiments with both synthetic and real data are conducted to validate our results and show that ordinary algorithms suffer a worse performance when the goal is changed.

Acknowledgments

This work was supported in part by NSF IIS-1633215, CCF1844234, CCF-2006591, and CCF-2006526.

References Agrawal, S.; Avadhanula, V.; Goyal, V.; and Zeevi, A. 2017. Thompson Sampling for the MNL-Bandit. In Conference on Learning Theory (COLT). Agrawal, S.; Avadhanula, V.; Goyal, V.; and Zeevi, A. 2019. MNL-bandit: A dynamic learning approach to assortment selection. Operations Research 67(5): 1453 1485. Cassel, A.; Mannor, S.; and Zeevi, A. 2018. A general approach to multi-armed bandits under risk criteria. In Conference on Learning Theory (COLT). Chen, X.; and Wang, Y. 2018. A note on a tight lower bound for capacitated MNL-bandit assortment selection models. Operations Research Letters 46(5): 534 537. Dong, K.; Li, Y.; Zhang, Q.; and Zhou, Y. 2020. Multinomial Logit Bandit with Low Switching Cost. In International Conference on Machine Learning (ICML). Dua, D.; and Graff, C. 2017. UCI Machine Learning Repository. http://archive.ics.uci.edu/ml (Accessed on 2020-0908). Galichet, N.; Sebag, M.; and Teytaud, O. 2014. Exploration vs Exploitation vs Safety: Risk-averse Multi-Armed Bandits. In Asian Conference on Machine Learning (ACML). Holtz, C.; Tao, C.; and Xi, G. 2020. Bandit Py Lib: a lightweight python library for bandit algorithms. Online at: https://github.com/Alanthink/banditpylib. URL https:// github.com/Alanthink/banditpylib. Documentation at https: //alanthink.github.io/banditpylib-doc. Kamath, G. 2015. Bounds on the Expectation of the Maximum of Samples from a Gaussian. http: //www.gautamkamath.com/writings/gaussian max.pdf (Accessed on 2020-09-08). Maillard, O. 2013. Robust Risk-Averse Stochastic Multiarmed Bandits. In Algorithmic Learning Theory (ALT). Rusmevichientong, P.; Shen, Z. M.; and Shmoys, D. B. 2010. Dynamic Assortment Optimization with a Multinomial Logit Choice Model and Capacity Constraint. Operations Research 58(6): 1666 1680. Sani, A.; Lazaric, A.; and Munos, R. 2012. Risk-Aversion in Multi-armed Bandits. In Advances in Neural Information Processing Systems (NIPS), 3284 3292. Saur e, D.; and Zeevi, A. 2013. Optimal Dynamic Assortment Planning with Demand Learning. Manufacturing & Service Operations Management 15(3): 387 404. Vakili, S.; and Zhao, Q. 2016. Risk-Averse Multi-Armed Bandit Problems Under Mean-Variance Measure. IEEE Journal of Selected Topics in Signal Processing 10(6): 1093 1111. Zimin, A.; Ibsen-Jensen, R.; and Chatterjee, K. 2014. Generalized Risk-Aversion in Stochastic Multi-Armed Bandits. ar Xiv preprint ar Xiv:1405.0833 .