# thompson_sampling_on_symmetric_alphastable_bandits__a0a24ffa.pdf

Thompson Sampling on Symmetric α-Stable Bandits

Abhimanyu Dubey and Alex Sandy Pentland Massachusetts Institute of Technology {dubeya, pentland}@mit.edu

Thompson Sampling provides an efﬁcient technique to introduce prior knowledge in the multiarmed bandit problem, along with providing remarkable empirical performance. In this paper, we revisit the Thompson Sampling algorithm under rewards drawn from symmetric α-stable distributions, which are a class of heavy-tailed probability distributions utilized in ﬁnance and economics, in problems such as modeling stock prices and human behavior. We present an efﬁcient framework for posterior inference, which leads to two algorithms for Thompson Sampling in this setting. We prove ﬁnite-time regret bounds for both algorithms, and demonstrate through a series of experiments the stronger performance of Thompson Sampling in this setting. With our results, we provide an exposition of symmetric α-stable distributions in sequential decision-making, and enable sequential Bayesian inference in applications from diverse ﬁelds in ﬁnance and complex systems that operate on heavy-tailed features.

1 Introduction

The multi-armed bandit (MAB) problem is a fundamental model in understanding the exploration-exploitation dilemma in sequential decision-making. The problem and several of its variants have been studied extensively over the years, and a number of algorithms have been proposed that optimally solve the bandit problem when the reward distributions are well-behaved, i.e. have a ﬁnite support, or are subexponential. The most prominently studied class of algorithms are the Upper Conﬁdence Bound (UCB) algorithms, that employ an optimism in the face of uncertainty heuristic [Auer et al.2002], which have been shown to be optimal (in terms of regret) in several cases [Capp e et al.2013, Bubeck et al.2013]. Over the past few years, however, there has been a resurgence in interest in the Thompson Sampling (TS) algorithm [Thompson1933], that approaches the problem from a Bayesian perspective.

Full version (with appendix) available at this link.

Rigorous empirical evidence in favor of TS demonstrated by [Chapelle and Li2011] sparked new interest in the theoretical analysis of the algorithm, and the seminal work of [Agrawal and Goyal2012,Agrawal and Goyal2013,Russo and Van Roy2014] demonstrated the optimality of TS when rewards are bounded in [0, 1] or are Gaussian. These results were extended in the work of [Korda et al.2013] to more general, exponential family reward distributions. The empirical studies, along with theoretical guarantees, have established TS as a powerful algorithm for the MAB problem. However, when designing decision-making algorithms for complex systems, we see that interactions in such systems often lead to heavy-tailed and power law distributions, such as modeling stock prices [Bradley and Taqqu2003], preferential attachment in social networks [Mahanti et al.2013], and online behavior on websites [Kumar and Tomkins2010]. Speciﬁcally, we consider a family of extremely heavytailed reward distributions known as α-stable distributions. This family refers to a class of distributions parameterized by the exponent α, that include the Gaussian (α = 2), L evy (α = 1/2) and Cauchy (α = 1) distributions, all of which are used extensively in economics [Frain2009], ﬁnance [Carr and Wu2003] and signal processing [Shao and Nikias1993]. The primary hurdle in creating machine learning algorithms that account for α-stable distributions, however, is their intractable probability density, which cannot be expressed analytically. This prevents even a direct evaluation of the likelihood under this distribution. Their heavy-tailed nature, additionally, often leads to standard algorithms (such as Thompson Sampling assuming Gaussian rewards), concentrating on incorrect arms. In this paper, we create two algorithms for Thompson Sampling under symmetric α-stable rewards with ﬁnite means. Our contributions can be summarized as follows:

1. Using auxiliary variables, we construct a framework for posterior inference in symmetric α-stable bandits that leads to the ﬁrst efﬁcient algorithm for Thompson Sampling in this setting, which we call α-TS.

2. To the best of our knowledge, we provide the ﬁrst ﬁnite-time polynomial bound on the Bayesian Regret of Thompson Sampling achieved by α-TS in this setting.

3. We improve on the regret by proposing a modiﬁed Thompson Sampling algorithm, called Robust α-TS,

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

that utilizes a truncated mean estimator, and obtain the ﬁrst O(N 1 1+ϵ ) Bayesian Regret in the α-stable setting. Our bound matches the optimal bound for α (1, 2) (within logarithmic factors).

4. Through a series of experiments, we demonstrate the proﬁciency of our two Thompson Sampling algorithms for α-stable rewards, which consistently outperform all existing benchmarks.

Our paper is organized as follows: we ﬁrst give a technical overview of the MAB problem, the Thompson Sampling algorithm and α-stable distributions. Next, we provide the central algorithm α-TS and its analysis, followed by the same for the Robust α-TS algorithm. We then provide experimental results on multiple simulation benchmarks and ﬁnally, discuss the related work in this area prior to closing remarks.

2 Preliminaries

2.1 Thompson Sampling

The K-Armed Bandit Problem: In any instance of the Karmed bandit problem, there exists an agent with access to a set of K actions (or arms ). The learning proceeds in rounds, indexed by t [1, T]. The total number of rounds, known as the time horizon T, is known in advance. The problem is iterative, wherein for each round t [T]:

1. Agent picks arm At [K].

2. Agent observes reward r At(t) from that arm.

For arm k [K], rewards come from a distribution Dk with mean µk = EDk[r]1. The largest expected reward is denoted by µ = maxk [K] µk, and the corresponding arm(s) is denoted as the optimal arm(s) k . In our analysis, we will focus exclusively on the i.i.d. setting, that is, for each arm, rewards are independently and identically drawn from Dk, every time arm k is pulled. To measure the performance of any (possibly randomized) algorithm we utilize a measure known as Regret R(T), which, at any round T, is the difference of the cumulative mean reward of the algorithm against the expected reward of always playing an optimal arm.

t=0 µAt (1)

Thompson Sampling (TS): Thompson Sampling [Thompson1933] proceeds by maintaining a posterior distribution over the parameters of the bandit arms. If we assume that for each arm k, the reward distribution Dk is parameterized by a (possibly vector) parameter θk that come from a set Θ with a prior probability distribution p(θk) over the parameters, the Thompson Sampling algorithm proceeds by selecting arms based on the posterior probability of the reward under the arms. For each round t [T], the agent:

1α-stable distributions with α 1 do not admit a ﬁnite ﬁrst moment. To continue with existing measures of regret, we only consider rewards with α > 1.

1. Draws parameters ˆθk(t) for each arm k [K] from the posterior distribution of parameters, given the previous rewards rk(t 1) = {r(1) k , r(2) k , ...} till round t 1 (note that the posterior distribution for each arm only depends on the rewards obtained using that arm). When t = 1, this is just the prior distribution over the parameters.

ˆθk(t) p(θk|rk(t 1)) p(rk(t 1)|θk)p(θk) (2)

2. Given the drawn parameters ˆθk(t) for each arm, chooses arm at with the largest mean reward over the posterior distribution.

At = arg max k [K] µk(ˆθk(t)) (3)

3. Obtains reward rt after taking action At and updates the posterior distribution for arm At.

In the Bayesian case, the measure for performance we will utilize in this paper is the Bayes Regret (BR) [Russo and Van Roy2014], which is the expected regret over the priors. For any policy π, this is given by:

Bayes Regret(T, π) = E[R(t)]. (4)

While the regret provides a stronger analysis, any bound on the Bayes Regret is essentially a bound on the expected regret, since if an algorithm admits a Bayes Regret of O(g(T)), then its Regret is also stochastically bounded by g( ) [Russo and Van Roy2014].

2.2 α-Stable Distributions

α-Stable distributions, introduced by L evy [L evy1925] are a class of probability distributions deﬁned over R whose members are closed under linear transformations.

Deﬁnition 1 ( [Borak et al.2005]). Let X1 and X2 be two independent instances of the random variable X. X is stable if, for a1 > 0 and a2 > 0, a1X1 + a2X2 follows the same distribution as c X + d for some c > 0 and d R.

A random variable X Sα(β, µ, σ) follows an α-stable distribution described by the parameters α (0, 2] (characteristic exponent) and β [ 1, 1] (skewness), which are responsible for the shape and concentration of the distribution, and parameters µ R (shift) and σ R+ (scale) which correspond to the location and scale respectively. While it is not possible to analytically express the density function for generic α-stable distributions, they are known to admit the characteristic function φ(x; α, β, σ, µ):

φ(x; α, β, σ, µ) = exp {ixµ |σx|α (1 iβ sign(x)Φα(x))} ,

where Φα(x) is given by

Φα(x) = tan( πα

2 ) when α = 1, 2

π log |x|, when α = 1

For ﬁxed values of α, β, σ and µ we can recover the density function from φ( ) via the inverse Fourier transform:

φ(x; α, β, σ, µ)e izxdx

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

Algorithm 1 Chambers-Mallows-Stuck Generation Input: V U( π/2, π/2), W E(1) Output: X Sα(β, σ, µ) Set Bα,β = arctan(β tan(πα/2))α 1

Set Sα,β = 1 + β2 tan2(πα/2) 1/(2α)

Set Y = Sα,β sin(α(V +Bα,β)

cos(V )1/α cos(V α(V +Bα,β)

return X = σY + µ.

Most of the attention in the analysis of α-stable distributions has been focused on the stability parameter α, which is responsible for the tail fatness . It can be shown that asymptotically, the tail behavior (x ) of X Sα(β, µ, σ) follows [Borak et al.2005]:

f(x) 1 |x|1+α

σα(1 + sgn(x)β) sin πα

where α < 2 and Γ( ) denotes the Gamma function. The power-law relationship admitted by the density is responsible for the heaviness of said tails.

Lemma 1 ( [Borak et al.2005]). X Sα(β, µ, σ), α < 2 admits a moment of order λ only if λ ( , α).

From Lemma 1 it follows that α-stable random variables only admit a ﬁnite mean for α > 1, and also admit a ﬁnite variance only when α = 2, which corresponds to the family of Gaussian distributions. To continue with existing measures of regret, we restrict our analysis hence to α-stable distributions only with α > 1. Note that for all our discussions, 1 < α < 2, hence, all distributions examined are heavytailed, with inﬁnite variance. Additionally, we restrict ourselves to only symmetric (β = 0) distributions: asymmetric distributions do not allow a scaled mixture representation (which is the basis of our framework, see Section 3.1).

Sampling from α-Stable Densities For general values of α, β, σ, µ, it is not possible to analytically express the density of α-stable distributions, and hence we resort to using auxiliary variables for sampling. The Chambers-Mallows-Stuck [Chambers et al.1976] algorithm is a straightforward method to generate samples from the density Sα(β, 1, 0) (for α = 1) via a non-linear transformation of a uniform random variable V and an exponential random variable W, which can then be re-scaled to obtain samples from Sα(β, σ, µ) (Algorithm 1).

Products of α-Stable Densities A central property of α-stable densities that will be utilized in the following section is the behavior of products of independent α-stable variables.

Lemma 2 ( [Borak et al.2005]). Let Z and Y > 0 be independent random variables such that Z Sγ(0, σ1, µ1) and Y Sδ(1, σ2, µ2). Then ZY 1/γ is stable with exponent γδ.

We now begin with the algorithm description and analysis.

3 α-Thompson Sampling We consider the setting where, for an arm k, the corresponding reward distribution is given by Dk = Sα(0, σ, µk) where α (1, 2), σ R+ are known in advance, and µk is unknown2. We set a prior distribution over µk. We now derive the form of the prior distribution, and outline an algorithm for Bayesian inference.

3.1 Scale Mixtures of Normals On setting γ = 2 (Gaussian) and β = α/2 < 1 in Lemma 2, the product distribution X = ZY 1/2 is stable with exponent α. This property is an instance of the general framework of scale mixtures of normals (SMi N) [Andrews and Mallows1974], which are described by the following:

0 N(x|0, λσ2)pΛ(λ)dλ (5)

This framework contains a large class of heavy-tailed distributions which include the exponential power law, Student st and symmetric α-stable distributions [Godsill and Kuruoglu1999]. The precise form of the variable X depends on the mixing distribution pΛ(λ). For instance, when pΛ is the inverted Gamma distribution (the conjugate prior for a unknown variance Gaussian), the resulting p X follows a Student s-t distribution. Bayesian inference directly from Sα(0, σ, µ) is difﬁcult: the non-analytic density prevents a direct evaluation of the likelihood function, and the non-Gaussianity introduces difﬁculty in its implementation. However, the SMi N representation enables us to draw samples directly from Sα(0, σ, µ) using the auxiliary variable λ:

x N(µ, λσ2), λ Sα/2(1, 1, 0) (6)

This sampling assists in inference since x is Gaussian conditioned on λ: given samples of λ, we can generate x from the induced Gaussian conditional distribution.

3.2 Posteriors for α-Stable Rewards Let us examine approximate inference for a particular arm k [K]. At any time t [T], assume this arm has been pulled nk(t) times previously, and hence we have a vector of reward samples rk(t) = {r(1) k , ..., r(nk(t)) k } observed until time t. Additionally, assume we have nk(t) samples of an auxiliary variable λk(t) = {λ(1) k , ..., λ(nk(t)) i } where λk Sα/2(1, 1, 0). Recall that rk Sα(0, σ, µk) for an unknown (but ﬁxed) µk. From the SMi N representation, we know that rk is conditionally Gaussian given the auxiliary variable λk, that is p(rk|λk, µk) N(µk, λkσ2). We can then obtain the conditional likelihood as the following:

p(rk(t)|λk(t), µk) exp 1 2σ2

Pnk(t) i=1 (r(i) k µk)2

2Typically, more general settings have been explored for TS in the Gaussian case, where the variance is also unknown. Our algorithm can also be extended to an unknown scale (σ) using an inverse Gamma prior, similar to [Godsill and Kuruoglu1999].

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

We can now assume the conjugate prior over µk, which is a normal distribution with mean µ0 k and variance σ2. We then obtain the posterior density for µk as (full derivation in the Appendix):

p(µk|rk(t), λk(t)) N ˆµk(t), ˆσ2 k(t) where,

ˆσ2 k(t) = σ2 Pnk(t) i=1 1 λ(i) k + 1 , ˆµk(t) =

Pnk(t) i=1 r(i) k λ(i) k + µ0 k Pnk(t) i=1 1 λ(i) k + 1 . (8)

We know that ˆσ2 k(t) > 0 since λ(i) k > 0 i as they are samples from a positive stable distribution (β = 1). Given rk(t) and µk, we also know that the individual elements of λk(t) are independent, which provides us with the following decomposition for the conditional density of λk(t),

p(λk(t)|rk(t), µk) =

i=1 p(λ(i) k |r(i) k , µk), where,

p(λ(i) k |r(i) k , µk) N(r(i) k |µk, λ(i) k , σ2)fα/2,1(λ(i) k ). (9)

Here, fα,β( ) is the density of a random variable following Sα(β, 1, 0). Our stepwise posterior sampling routine is hence as follows. At any time t, after arm k is pulled and we receive reward rnk(t) k , we set rk(t) = [rk(t 1), rnk(t) k ]. Then for a ﬁxed Q iterations, we repeat:

1. For i [1, nk(t)], draw λ(i) k p(λ(i) k |r(i) k , µk(t)). 2. Draw µk(t) p(µk|rk(t), λk(t)) . Sampling from the conditional posterior of µk is straightforward since it is Gaussian. To sample from the complicated posterior of λ(i) k , we utilize rejection sampling.

3.3 Rejection Sampling for λ(i) k Sampling directly from the posterior is intractable since it is not analytical. Therefore, to sample λ(i) k we follow the pipeline described in [Godsill and Kuruoglu1999]. We note that the likelihood of the mean-normalized reward v(i) k = r(i) k µk(t) forms a valid rejection function since it is bounded:

p v(i) k |0, λ(i) k σ2 1

2π exp( 1/2) (10)

Since v(i) k N(0; λ(i) k σ2). Thus, we get the procedure:

1. Draw λ(i) k Sα/2(1, 1, 0) (using Algorithm 1).

2. Draw u U 0, (v(i) k

2π) 1 exp( 1/2) .

3. If u > p(v(i) k |0, λ(i) k σ2), reject λ(i) k and go to Step 1. Combining all these steps, we can now outline our algorithm, α-Thompson Sampling (α-TS) as described in Algorithm 2. It is critical to note that in Algorithm 2, we do not draw from the full vector of λk(t) at every iteration, but only from the last obtained reward. This is done to accelerate the inference process, and while it leads to a slower convergence of

Algorithm 2 α-Thompson Sampling

1: Input: Arms k [K], priors N(µ0 k, σ2) for each arm. 2: Set Dk = 1, Nk = 0 for each arm k. 3: for For each iteration t [1, T] do

4: Draw µk(t) N µ0 k+Nk

for each arm k.

5: Choose arm At = arg max k [K] µk(t), and get reward rt.

6: for q [0, Q) do

7: Calculate v(t) At = rt µAt.

8: Draw λ(t) At following Section 3.3.

9: Set Dq = DAt + 1/λ(t) At, Nq = NAt + rt/λ(t) At.

10: Draw µAt N µ0 At+Nq

11: end for 12: Set DAt = DAt + 1/λ(t) At, NAt = NAt + rt/λ(t) At. 13: end for

the posterior, we observe that it performs well in practice. Alternatively, one can re-sample λk(t) over a ﬁxed window of the previous rewards, to prevent the runtime from increasing linearly while enjoying faster convergence.

3.4 Regret Analysis In this section, we derive an upper bound on the ﬁnite-time Bayesian Regret (BR) incurred by the α-TS algorithm. We continue with the notation used in previous sections, and assume a K armed bandit with T maximum trials. Each arm k follows an α-stable reward Sα(0, σ, µk), and without loss of generality, let µ = maxk [K] µi denote the arm with maximum mean reward.

Theorem 1 (Regret Bound). Let K > 1, α (1, 2), σ R+, µk:k [K] [ M, M]. For a K-armed stochastic bandit with rewards for each arm k drawn from Sα(0, σ, µk), we have, asymptotically, for ϵ chosen a priori such that ϵ (α 1) , Bayes Regret(T) = O(K 1 1+ϵ T 2 1+ϵ )

Proof-sketch. We ﬁrst utilize the characteristic function representation of the probability density to obtain the centered (1 + ϵ)th moment of the reward distribution for each arm. We use this moment to derive a concentration bound on the deviation of the empirical mean of α-stable densities. Next, we proceed with the decomposition of the Bayesian Regret in terms of upper-conﬁdence bounds, as done in [Russo and Van Roy2014]. The complete proof is included in detail in the appendix. We note the following: First, the only additional assumption we make on the reward distributions is that the true means are bounded, which is a standard assumption [Russo and Van Roy2014] and easily realizable in most application cases. Next, ϵ < α 1 must be chosen carefully to control the ﬁnite-time regret. As ϵ α 1, we see that while the growth of T 2 1+ϵ decreases, the constants in the ﬁnite-time expression grow, and are not ﬁnite at ϵ = α 1. This behavior arises

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

Algorithm 3 Robust α-Thompson Sampling

1: Input: Arms k [K], priors N(µ0 k, σ2) for each arm. 2: Set Dk = 1, Nk = 0 for each arm k. 3: for For each iteration t [1, T] do

4: Draw µk(t) N µ0 k+Nk

for each arm k.

5: Choose arm At = arg maxk µk(t), and get reward rt.

6: If |rt| > H(ϵ,α,σ) i

2 log(T ) 1 1+ϵ , set rt = 0.

7: Repeat steps 6-12 of Algorithm 2. 8: end for

from the non-compactness of the set of ﬁnite moments for α-stable distributions (see Appendix for detailed analysis). Compared to the problem-independent regret bound of O( KT log T) for Thompson Sampling on the multi-armed Gaussian bandit problem demonstrated by [Agrawal and Goyal2013], our bound differs in two aspects: ﬁrst, we admit a K 1 1+ϵ complexity on the number of arms, which contrasted with the Gaussian bandit is identical when ϵ 1. Second, we have a superlinear dependence of order T 2 1+ϵ . The term of T 1 1+ϵ approaches the Gaussian case (

T) when ϵ 1, leaving us with the additional sub-linear term (T 1 1+ϵ ) compared to the sub-logarithmic term ( log T) from the Gaussian case. In the next section, we address the issue of polynomial concentration by utilizing the more robust, truncated mean estimator instead of the empirical mean, and obtain a modiﬁed, robust version of α-TS.

3.5 Robust α-Thompson Sampling Assume that for all arms k [K], µk M. Note that this assumption is equivalent to the boundedness assumption in the analysis of α-TS, and is a reasonable assumption to make in any practical scenario with some domain knowledge of the problem. Let δ (0, 1), ϵ (0, α 1). Now, consider the estimator ˆr k(t) given by:

ˆr k(t) = 1 nk(t)

i=1 r(i) k 1

|r(i) k | H(ϵ, α, σ) i

where, H(ϵ, α, σ) =

ϵ M Γ( ϵ/α) + σαΓ(1 ϵ+1

ˆr k(t) then describes a truncated mean estimator for an arm k (pulled nk(t) times), where a reward r(i) k at any trial i of the arm is discarded if it is larger than the bound. We choose such a form of the truncation since it allows us to obtain an exponential concentration for α-stable densities, which will eventually provide us tighter regret. As can be seen, the corresponding Robust α-TS algorithm is identical to the basic α-TS algorithm, except for this step of rejecting a reward if it is greater than the bound. The algorithm is outlined in Algorithm 3. Theorem 2 (Regret Bound). Let K > 1, α (1, 2), σ R+, µk:k [K] [ M, M]. For a K-armed bandit with re-

wards for each arm k drawn from Sα(0, σ, µk), we have, for ϵ chosen a priori such that ϵ (α 1) and truncated estimator given in Equation (11),

Bayes Regret(T) = O (KT) 1 1+ϵ

Proof-sketch. The truncated mean estimator can be used to obtain a tighter concentration bound on the empirical mean. This can be understood intuitively as by carefully rejecting values that are distant from the mean, our empirical mean is robust to outliers, and would require less samples to concentrate around the true mean. Using the concentration result, we follow an analysis identical to Theorem 1. The full proof is deferred to the Appendix for brevity. We see that this bound is tight (upto logarithmic factors): it matches the problem-independent lower bound of O((KT) 1 1+ϵ ) [Bubeck et al.2013]. Algorithm 3 s improvements increase as α decreases: the likelihood of obtaining confounding outliers increases as α 1, and can perturb the posterior mean in the naive α-TS algorithm. In the next section, we discuss some experimental results that cement the value of α-TS for α-stable bandits.

4 Experiments We conduct simulations with the following benchmarks (i) an ε-greedy agent with linearly decreasing ε, (ii) Regular TS with Gaussian priors and a Gaussian assumption on the data (Gaussian-TS), (iii) Robust-UCB [Bubeck et al.2013] for heavy-tailed distributions using a truncated mean estimator, and (iv) α-TS and (v) Robust α-TS, both with Q(iterations for sampling) as 50. Setting Priors for TS: In all the Thompson Sampling benchmarks, setting the priors are crucial to the algorithm s performance. In the competitive benchmarking, we randomly set the priors for each arm from the same range we use for setting the mean rewards.

Performance against Competitive Benchmarks We run 100 MAB experiments each for all 5 benchmarks for α = 1.8 and α = 1.3, and K = 50 arms, and for each arm, the mean is drawn from [0, 2000] randomly for each experiment, and σ = 2500. Each experiment is run for T = 5000 iterations, and we report the regret averaged over time, i.e. R(t)/t at any time t. In Figures 1A and 1B, we see the results of the regret averaged over all 100 experiments. We observe that for α = 1.3, there are more substantial variations in the regret (low α implies heavy outliers), yet both algorithms comfortably outperform the other baselines. In the case of α = 1.8, the variations are not that substantial, the performance follows the same trend. It is important to note that regular Thompson Sampling (Gaussian-TS) performs competitively, although in our experiments, we observed that when K is large, the algorithm often concentrates on the incorrect arm, and subsequently earns a larger regret. Intuitively, we can see that whenever arms have mean rewards close to each other (compared to the variance), ϵgreedy will often converge to the incorrect arm. Robust-UCB,

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

Figure 1: (A) Competitive benchmarking for α = 1.3, and (B) α = 1.8; (C) Ablation studies for varying α, and (D) varying prior strength. Shaded areas denote variance scaled by 0.25 in (A) and (B), and scaled by 0.5 in (C) and (D). Figures are best viewed in color.

however, makes very weak assumptions on the data distributions, and hence has a much larger exploration phase, leading to larger regret. Compared to α-TS, Robust α-TS is less affected by outlying rewards (as is more evident in α = 1.3 vs. α = 1.8) and hence converges faster.

Ablation Studies We additionally run two types of ablation studies - ﬁrst, we compare the performance of α-TS on the identical set up as before (same K and reward distributions), but with varying values of α (1, 2). We report the expected time-normalized regret averaged over 100 trials in Figure 1C, and observe that (i) as α increases, the asymptotic regret decreases faster, and (ii) as expected, for lower values of α there is a substantial amount of variation in the regret. Secondly, we compare the effect of the sharpness of the priors. In the previous experiments, the priors are drawn randomly from the same range as the means, without any additional information. However, by introducing more information about the means through the priors, we can expect better performance. In this experiment, for each mean µk, we randomly draw the prior mean µ0 k from [µk δ, µk + δ], and observe the regret after T = 15K trials for δ from 50 to 1000. The results for this experiment are summarized in Figure 1D for K = 10 and σ = 25, and results are averaged over 25 trials each. We see that with uninformative priors, α-TS performs competitively, and gets better as the priors get sharper.

5 Related Work

A version of the UCB algorithm [Auer et al.2002] has been proposed in [Bubeck et al.2013] coupled with several robust mean estimators to obtain Robust-UCB algorithms with optimal problem-dependent (i.e. dependent on individual µks) regret when rewards are heavy-tailed. However, the optimal version of their algorithm has a few shortcomings that α-TS addresses: their algorithm requires the median-ofmeans estimator, which has an expensive space complexity of O(log T) and time complexity of O(log log T) per update. Secondly, there is no mechanism to incorporate prior information, which can be advantageous, as seen even with weak priors. Additionally, [Vakili et al.2013] introduce a deterministic exploration-exploitation algorithm, which achieves same order regret as Robust-UCB for heavy-tailed rewards.

There has been work in analysing Thompson Sampling for speciﬁc Pareto and Weibull heavy-tailed distributions in [Korda et al.2013], however, the Weibull and Pareto distributions typically have lighter tails owing to the existence of more higher order moments, and hence cannot typically be used to model very heavy tailed signals. In related problems, [Yu et al.2018] provide a purely exploratory algorithm for best-arm identiﬁcation under heavytailed rewards, using a ﬁnite (1 + ϵ)th moment assumption. Similarly, [Shao et al.2018, Medina and Yang2016] explore heavy-tailed rewards in the linear bandit setting.

6 Conclusion

In this paper, we ﬁrst designed a framework for efﬁcient posterior inference for the α-stable family, which has largely been ignored in the bandits literature owing to its intractable density function. We formulated the ﬁrst polynomial problem-independent regret bounds for Thompson Sampling with α-stable densities, and subsequently improved the regret bound to achieve the optimal regret identical to the sub Gaussian case, providing an efﬁcient framework for decisionmaking for these distributions. Additionally, our intermediary concentration results provide a starting point for other machine learning problems that may be investigated in α-stable settings. There is ample evidence to support the existence of α-stability in various modeling problems across economics [Frain2009], ﬁnance [Bradley and Taqqu2003] and behavioral studies [Mahanti et al.2013]. With tools such as ours, we hope to usher scientiﬁc conclusions in problems that cannot make sub-Gaussian assumptions, and can lead to more robust empirical ﬁndings. Future work may include viewing more involved decision-making processes, such as MDPs, in the same light, leading to more robust algorithms.

Acknowledgements

We would like to thank Peter Krafft, Yan Leng, Dhaval Adjodah, Ziv Epstein, Michiel Bakker and Ishaan Grover for their comments on this manuscript, the anonymous reviewers for their suggestions, and MIT Quest for Intelligence for their generous support with computational resources.

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

References [Agrawal and Goyal, 2012] Shipra Agrawal and Navin Goyal. Analysis of thompson sampling for the multiarmed bandit problem. In Conference on Learning Theory, pages 39 1, 2012. [Agrawal and Goyal, 2013] Shipra Agrawal and Navin Goyal. Further optimal regret bounds for thompson sampling. In Artiﬁcial Intelligence and Statistics, pages 99 107, 2013. [Andrews and Mallows, 1974] David F Andrews and Colin L Mallows. Scale mixtures of normal distributions. Journal of the Royal Statistical Society: Series B (Methodological), 36(1):99 102, 1974. [Auer et al., 2002] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235 256, 2002. [Borak et al., 2005] Szymon Borak, Wolfgang H ardle, and Rafał Weron. Stable distributions. In Statistical tools for ﬁnance and insurance, pages 21 44. Springer, 2005. [Bradley and Taqqu, 2003] Brendan O Bradley and Murad S Taqqu. Financial risk and heavy tails. In Handbook of heavy tailed distributions in ﬁnance, pages 35 103. Elsevier, 2003. [Bubeck et al., 2013] S ebastien Bubeck, Nicolo Cesa Bianchi, and G abor Lugosi. Bandits with heavy tail. IEEE Transactions on Information Theory, 59(11):7711 7717, 2013. [Capp e et al., 2013] Olivier Capp e, Aur elien Garivier, Odalric-Ambrym Maillard, R emi Munos, Gilles Stoltz, et al. Kullback leibler upper conﬁdence bounds for optimal sequential allocation. The Annals of Statistics, 41(3):1516 1541, 2013. [Carr and Wu, 2003] Peter Carr and Liuren Wu. The ﬁnite moment log stable process and option pricing. The journal of ﬁnance, 58(2):753 777, 2003. [Chambers et al., 1976] John M Chambers, Colin L Mallows, and BW Stuck. A method for simulating stable random variables. Journal of the american statistical association, 71(354):340 344, 1976. [Chapelle and Li, 2011] Olivier Chapelle and Lihong Li. An empirical evaluation of thompson sampling. In Advances in neural information processing systems, pages 2249 2257, 2011. [Frain, 2009] John C Frain. Studies on the Application of the Alpha-stable Distribution in Economics. 2009. [Godsill and Kuruoglu, 1999] Simon Godsill and Ercan E Kuruoglu. Bayesian inference for time series with heavytailed symmetric α-stable noise processes. 1999. [Korda et al., 2013] Nathaniel Korda, Emilie Kaufmann, and Remi Munos. Thompson sampling for 1-dimensional exponential family bandits. In Advances in Neural Information Processing Systems, pages 1448 1456, 2013. [Kumar and Tomkins, 2010] Ravi Kumar and Andrew Tomkins. A characterization of online browsing behavior.

In Proceedings of the 19th international conference on World wide web, pages 561 570. ACM, 2010. [L evy, 1925] P L evy. Calcul des probabilit es, vol. 9. Gauthier-Villars Paris, 1925. [Mahanti et al., 2013] Aniket Mahanti, Niklas Carlsson, Anirban Mahanti, Martin Arlitt, and Carey Williamson. A tale of the tails: Power-laws in internet measurements. IEEE Network, 27(1):59 64, 2013. [Medina and Yang, 2016] Andres Munoz Medina and Scott Yang. No-regret algorithms for heavy-tailed linear bandits. In International Conference on Machine Learning, pages 1642 1650, 2016. [Russo and Van Roy, 2014] Daniel Russo and Benjamin Van Roy. Learning to optimize via posterior sampling. Mathematics of Operations Research, 39(4):1221 1243, 2014. [Shao and Nikias, 1993] Min Shao and Chrysostomos L Nikias. Signal processing with fractional lower order moments: stable processes and their applications. Proceedings of the IEEE, 81(7):986 1010, 1993. [Shao et al., 2018] Han Shao, Xiaotian Yu, Irwin King, and Michael R Lyu. Almost optimal algorithms for linear stochastic bandits with heavy-tailed payoffs. In Advances in Neural Information Processing Systems, pages 8430 8439, 2018. [Thompson, 1933] William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285 294, 1933. [Vakili et al., 2013] Sattar Vakili, Keqin Liu, and Qing Zhao. Deterministic sequencing of exploration and exploitation for multi-armed bandit problems. IEEE Journal of Selected Topics in Signal Processing, 7(5):759 767, 2013. [Yu et al., 2018] Xiaotian Yu, Han Shao, Michael R Lyu, and Irwin King. Pure exploration of multi-armed bandits with heavy-tailed payoffs. In Conference on Uncertainty in Artiﬁcial Intelligence, pages 937 946, 2018.

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)