# stochastic_gradient_succeeds_for_bandits__354e0941.pdf

Stochastic Gradient Succeeds for Bandits

Jincheng Mei * 1 Zixin Zhong * 2 Bo Dai 1 3 Alekh Agarwal 4 Csaba Szepesvari 2 Dale Schuurmans 1 2

We show that the stochastic gradient bandit algorithm converges to a globally optimal policy at an O(1/t) rate, even with a constant step size. Remarkably, global convergence of the stochastic gradient bandit algorithm has not been previously established, even though it is an old algorithm known to be applicable to bandits. The new result is achieved by establishing two novel technical findings: first, the noise of the stochastic updates in the gradient bandit algorithm satisfies a strong growth condition property, where the variance diminishes whenever progress becomes small, implying that additional noise control via diminishing step sizes is unnecessary; second, a form of weak exploration is automatically achieved through the stochastic gradient updates, since they prevent the action probabilities from decaying faster than O(1/t), thus ensuring that every action is sampled infinitely often with probability 1. These two findings can be used to show that the stochastic gradient update is already sufficient for bandits in the sense that exploration versus exploitation is automatically balanced in a manner that ensures almost sure convergence to a global optimum. These novel theoretical findings are further verified by experimental results.

1. Introduction

Algorithms for multi-armed bandits (MABs) need to balance exploration and exploitation to achieve desirable performance properties (Lattimore & Szepesv ari, 2020). Well known bandit algorithms generally introduce auxiliary mechanisms to control the exploration-exploitation trade-off. For example, the upper confidence bound strategies (UCB, Lai et al. (1985); Auer et al. (2002a)), manage exploration

*Equal contribution 1Google Research, Brain Team 2University of Alberta 3Georgia Tech 4Google Research. Correspondence to: Jincheng Mei <jcmei@google.com>, Zixin Zhong <zzhong10@ualberta.ca>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

explicitly by designing auxiliary bonuses that induce optimism under uncertainty; Thompson sampling (Thompson, 1933; Agrawal & Goyal, 2012) maintains a posterior over rewards that has to be updated and sampled from tractably. The theoretical analysis of such algorithms typically focuses on bounding their regret, i.e., showing that the average reward obtained by the algorithm approaches that of the optimal action in hindsight, at a rate that is statistically optimal. However, managing exploration bonuses via uncertainty quantification is difficult in all but simple environments (Gawlikowski et al., 2021), while posterior updating and sampling can also be computationally difficult in practice, which make the UCB and Thompson sampling challenging to apply in real world scenarios.

Meanwhile, stochastic gradient-based techniques have witnessed widespread use across the breadth of machine learning. For bandit and reinforcement learning problems, stochastic gradient yields a particularly simple algorithm the stochastic gradient bandit algorithm (Sutton & Barto, 2018, Section 2.8) that omits any explicit control mechanism over exploration. This algorithm is naturally compatible with deep neural networks, in stark contrast to UCB and Thompson sampling, and has seen significant practical success (Schulman et al., 2015; 2017; Ouyang et al., 2022). Surprisingly, the theoretical understanding of this algorithm remains under-developed: the global convergence and regret properties of the stochastic gradient bandit algorithm is still open, which naturally raises the question:

Is the stochastic gradient bandit algorithm able to balance exploration vs. exploitation to identify an optimal action?

Such an understanding is paramount to further improving the underlying approach. In this paper, we take a step in this direction by studying the convergence properties of the canonical stochastic gradient bandit algorithm in the simplest setting of multi-armed bandits, and answer the question affirmatively: the distribution over the arms maintained by this algorithm almost surely concentrates asymptotically on a globally optimal action.

Of course, the broader literature on the use of stochastic gradient techniques in reinforcement learning has a long tradition, dating back to stochastic approximation (Robbins & Monro, 1951) with likelihood ratios ( log trick ) (Glynn, 1990) and REINFORCE policy gradient estima-

Stochastic Gradient Succeeds for Bandits

Reference Convergence rate Learning rate η Remarks

Zhang et al. (2020a) O(1/

t) Log-barrier regularization

Ding et al. (2021) O(1/

t) O(1/t) Entropy regularization Zhang et al. (2021) O(1/

t) O(1/ log t) Gradient truncation + variance reduction Yuan et al. (2022) O(1/t1/3) O(1/t) ABC assumptions Mei et al. (2022) O(C /t) O(1/t) Oracle baseline, C is initialization and problem dependent This paper O(C /t) O(1) C is initialization and problem dependent

Table 1. Global convergence results for gradient based methods. To simplify comparison, regret or sample complexity bounds have been converted to convergence rates through the usual translations. (This comparison does not capture differences in other metrics like regret.) We also compare the learning rates used in the underlying algorithm, since the ability to use a constant learning rate is a key insight of the analysis presented and is generally preferred in practice. Note that Zhang et al. (2021) claimed a constant learning rate; however, their learning rate must follow a decaying truncation threshold , which means the actual learning rate in (Zhang et al., 2021) is decaying.

tion (Williams, 1992). It is well known that REINFORCE (Williams, 1992) with on-line Monte Carlo sampling provides an unbiased gradient estimator of bounded variance, which is sufficient to guarantee convergence to a stationary point if the learning rates are decayed appropriately (Robbins & Monro, 1951; Zhang et al., 2020b). However, convergence to a stationary point is an extremely weak guarantee for bandits, since any deterministic policy, whether optimal or sub-optimal, has a zero gradient and is therefore a stationary point. Convergence to a stationary point, on its own, is insufficient to ensure convergence to a globally optimal policy or even to establish regret bounds.

Recently, it has been shown that if true gradients are used, softmax policy gradient (PG) methods converge to a globally optimal policy asymptotically (Agarwal et al., 2021), with an O(1/t) rate (Mei et al., 2020b), albeit with initialization and problem dependent constants (Mei et al., 2020a; Li et al., 2021). However, these results do not apply to the gradient bandit algorithm as it uses stochastic gradients, and a key theoretical challenge is to account for the effects of stochasticity from on-policy sampling and reward noise.

More recent results on the global convergence of PG methods with stochastic updates have been established, as summarized in Table 1. In particular, Zhang et al. (2020b) showed that REINFORCE (Williams, 1992) with O(1/

t) learning rates and log-barrier regularization has O(1/

t) average regret. Ding et al. (2021) proved that softmax PG with O(1/t) learning rates and entropy regularization has O(ϵ 2) sample complexity. Zhang et al. (2021) showed that with gradient truncation softmax PG gives O(ϵ 2) sample complexity. Under extra assumptions, Yuan et al. (2022) obtained O(ϵ 3) sample complexity with O(1/t) learning rates. Mei et al. (2022) analyzed on-policy natural PG with value baselines and O(1/t) learning rates and proved O(1/t) convergence rate. The results in these works are expressed in different metrics, such as average regret, sample complexity, or convergence rate, which can sometimes make comparisons difficult. However, for the bandit case, where

only one example is used in each iteration, these metrics become comparable, such that O(ϵ n) sample complexity is equal to O(t 1/n) convergence rate, which is stronger than O(t 1/n) average regret, but not necessarily vice versa.1

The two key shortcomings in these existing results are, first, they introduce decaying learning rates (or regularization and/or variance reduction) to explicitly control noise, and second, such auxiliary techniques generally incur additional computation and decelerate convergence to an O(1/

t) or slower rate. The only exception to the latter is Mei et al. (2022), which considers an aggressive O(1/t) learning rate decay and still establishes O(1/t) convergence, but leverages an unrealistic baseline to achieve this.

In this paper, we provide a new global convergence analysis of stochastic gradient bandit algorithms with constant learning rates by establishing novel properties and techniques. There are two main benefits to the results presented.

By considering only constant learning rates, we show that auxiliary forms of noise control such as learning rate decay, regularization and variance reduction are unnecessary to achieve global convergence, which justifies the use of far simpler algorithms in practice.

Unlike previous work, we show that a practical and general algorithm can achieve an optimal O(1/t) convergence rate and O(log T) regret asymptotically.

The remaining paper is organized as follows. Sections 2 and 3 introduce the gradient bandit algorithms and standard stochastic gradient analysis respectively. Section 4 presents our novel technical findings, characterizing the automatic noise cancellation effect and global landscape properties, which is then leveraged in Section 5 to establish novel global convergence results. Section 6 discusses the effect of using

1It is worth noting that these works also contain results for general Markov decision processes (MDPs). We express their results for bandits here by treating this case as a single state MDP.

Stochastic Gradient Succeeds for Bandits

baselines. Section 7 presents a simulation study to verify the theoretical findings, and Section 8 provides further discussions. Section 9 briefly concludes this work.

2. Gradient Bandit Algorithms

A multi-armed bandit (MAB) problem is specified by an action set [K] := {1, 2, , K} and random rewards with mean vector r RK. For each action a [K], the mean reward r(a) is the expectation of a bounded reward distribution,

r(a) = Z Rmax

Rmax x Pa(x)µ(dx), (1)

where µ is a finite measure over [ Rmax, Rmax], Pa(x) 0 is a probability density function with respect to µ, and Rmax > 0 is the reward range. Since the sampled reward is bounded, we also have r [ Rmax, Rmax]K. We make the following assumption on r. Assumption 2.1 (True mean reward has no ties). For all i, j [K], if i = j, then r(i) = r(j). Remark 2.2. Assumption 2.1 is used in the proofs for Theorem 5.1. In particular, convergence toward strict one-hot policies above Theorem 5.1 is needed as a result of Assumption 2.1. We discuss intuition later for establishing the same result without Assumption 2.1.

According to Sutton & Barto (2018, Section 2.8), the gradient bandit algorithm maintains a softmax distribution over actions Pr (at = a) = πθt(a) such that πθt = softmax(θt), where

πθt(a) = exp{θt(a)} P

a [K] exp{θt(a )}, for all a [K], (2)

and θt RK is the parameter vector to be updated.

Algorithm 1 Gradient bandit algorithm (without baselines)

Input: initial parameters θ1 RK, learning rate η > 0. Output: policies πθt = softmax(θt). while t 1 do

Sample one action at πθt( ). Observe one reward sample Rt(at) Pat. for all a [K] do

if a = at then

θt+1(a) θt(a) + η (1 πθt(a)) Rt(at). else

θt+1(a) θt(a) η πθt(a) Rt(at). end if end for end while

It is obvious that Algorithm 1 is an instance of stochastic gradient ascent with an unbiased gradient estimator (Nemirovski et al., 2009), as shown below for completeness.

Proposition 2.3. Algorithm 1 is equivalent to the following stochastic gradient ascent update on π θ r,

θt+1 θt + η dπ θtˆrt dθt (3)

= θt + η diag(πθt) πθtπ θt ˆrt, (4)

where Et h dπ θt ˆrt dθt

i = dπ θtr dθt , and dπθ

dθ = diag(πθ)

πθπ θ is the Jacobian of θ 7 πθ := softmax(θ), ˆrt(a) := I{at=a}

πθt(a) Rt(a) for all a [K] is the importance sampling (IS) estimator, and we set Rt(a) = 0 for all a = at.

Based on Proposition 2.3, Sutton & Barto (2018, Section 2.8) assert that Algorithm 1 has robust convergence properties toward stationary points without rigorous justification. However, as mentioned in Section 1, convergence to stationary points is a very weak assertion in a MAB, since this does not guarantee sub-optimal local maxima are avoided. Hence this claim does not assure global convergence or sub-linear regret for the gradient bandit algorithm.

3. Preliminary Stochastic Gradient Analysis

In this section, we start with local convergence of stochastic gradient bandit algorithm. The understanding of the behavior of the algorithm involves assessing whether optimization progress is able to overcome the effects of the sampling noise. This trade-off reveals inability of the vanilla analysis and inspires our refined analysis.

To illustrate the basic ideas, we first recall some known results about the form of π θ r and the behavior of Algorithm 1 and make a preliminary attempt to establish convergence to a stationary point. First, π θ r is a 5/2-smooth function of θ RK (Mei et al., 2020b, Lemma 2), which implies that,

π θtr π θt+1r Ddπ θtr dθt , θt+1 θt E + 5

4 θt+1 θt 2 2

= η Ddπ θtr dθt , dπ θtˆrt dθt

4 η2 dπ θtˆrt dθt

where the last equation follows from Eq. (3). Second, as is well known, the on-policy stochastic gradient is unbiased, and its variance / scale is uniformly bounded over all πθ.

Proposition 3.1 (Unbiased stochastic gradient with bounded variance / scale). Using Algorithm 1, we have, for all t 1,

dπ θtˆrt dθt

= dπ θtr dθt , and Et

" dπ θtˆrt dθt

where Et[ ] is on randomness from the on-policy sampling at πθt( ) and reward sampling Rt(at) Pat.

Stochastic Gradient Succeeds for Bandits

Therefore, according to Proposition 3.1, we have,

π θtr Et[π θt+1r] η dπ θtr dθt

2 + 5 R2 max 2 η2. (5)

Since the goal is to maximize π θ r, we want the first term on the r.h.s. of Eq. (5) ( progress ) to overcome the second term ( noise ) to ensure that Et[π θt+1r] > π θtr. Unfortunately, this is not achievable using a constant learning rate η Θ(1), since the progress contains a vanishing term

of dπ θtr dθt

2 0 as t while the noise term will remain at constant level. Therefore, based on this bound, it seems necessary to use a diminishing learning rate η 0 to control the effect of noise for local convergence. In fact, with appropriate learning rate control (Robbins & Monro, 1951; Ghadimi & Lan, 2013; Zhang et al., 2020b), it can be shown that minimum gradient norm converge to zero. From Eq. (5), by algebra and telescoping, we have,

min 1 t T E dπ θtr dθt

E[π θT +1r] E[π θ1r] PT t=1 ηt + 5 R2 max 2

PT t=1 η2 t PT t=1 ηt .

Choosing ηt Θ(1/

t), the r.h.s. of the above inequality is in O(1/

T), i.e., the minimum gradient norm approaches zero as T (Ghadimi & Lan, 2013; Zhang et al., 2020b). However, the decaying learning rate will slow down the convergence as is seen. Next, we will present our novel technical characterization of the noise in stochastic gradient bandit that can allow us to avoid this choice.

4. New Analysis: Noise Vanishes Automatically

As discussed in Section 3, noise control is at the heart of standard stochastic gradient analysis, and different techniques (entropy or log-barrier regularization, learning rate schemes, momentum) are used to explicitly combat noise in stochastic updates. Here we take a different perspective by asking whether the sampling noise will automatically diminish in a way that there is no need to explicitly control it. In particular, we investigate whether the constant order of the second term in Eq. (5) ( noise ) is accurately characterizing the sampling noise, or whether this bound can be improved.

Note that the noise constant 5 R2 max in Eq. (5) arises from two quantities: the standard smoothness constant of 5/2, and the variance upper bound of 2 R2 max in Proposition 3.1. It turns out that both of these quantities can be improved.

4.1. Non-uniform Smoothness: Landscape Properties

The first key observation is a landscape property originally derived in (Mei et al., 2021b) for true policy gradient settings, which is also applicable for stochastic gradient update. Lemma 4.1 (Non-uniform smoothness (NS), Lemma 2 in (Mei et al., 2021b)). For all θ RK, and for all r RK,

the spectral radius of the Hessian matrix d2{π θ r} dθ2 RK K

is upper bounded by 3 dπ θ r dθ 2, i.e., for all y RK,

y d2{π θ r} dθ2 y 3 dπ θ r dθ

2 y 2 2. (6)

It is useful to understand the intuition behind this lemma.

Note that when the PG norm dπ θ r dθ 2 is small, the policy

πθ is close to a one-hot policy, and the objective π θ r has a flat local landscape; ultimately implying that the the Hessian magnitude is upper bounded by the gradient.

However, directly using Lemma 4.1 remains challenging: Consider two iterates θt and θt+1. Then in a Taylor expansion the PG norm of an intermediate point θζ := θt +ζ (θt+1 θt), ζ [0, 1], will appear, which is undesirable since ζ is unknown. Therefore, we require an additional lemma to assert that for a sufficiently small learning rate, the PG norm of θζ will be controlled by that of θt.

Lemma 4.2 (NS between iterates). Using Algorithm 1 with η 0, 2 9 Rmax , we have, for all t 1,

D(θt+1, θt) β(θt)

2 θt+1 θt 2 2, (7)

where D(θt+1, θt) is Bregman divergence defined as,

D(θt+1, θt) := (πθt+1 πθt) r Ddπ θtr dθt , θt+1 θt E ,

and β(θt) = 6 2 9 Rmax η dπ θtr dθt

With the learning rate requirement, Lemma 4.2 is no longer only a landscape property, but also depends on updates. Using Lemma 4.2 rather than standard smoothness, 5 R2 max in Eq. (5) can be replaced with β(θt), which implies that the

noise is also vanishing since dπ θtr dθt

2 0 as t . However, with simple constant upper bound of the variance of noise in Proposition 3.1, the progress term in Eq. (5)

still decays faster than the noise term since dπ θtr dθt

o dπ θtr dθt

. Unfortunately, this suggests that a decaying learning rate is still necessary to control the noise. Therefore, a further refined analysis of the noise variance is required.

4.2. Growth Conditions: Softmax Jacobian Behavior

Since only Lemmas 4.1 and 4.2 are still insufficient to guarantee progress without learning rate control, we need to develop a more refined variance bound of the noise that was previously unknown for gradient bandit algorithms. We first consider an example to intuitively explain why a tighter bound on the noise variance might be possible.

Stochastic Gradient Succeeds for Bandits

Figure 1. Visualization and intuition for Lemma 4.3. (a) Stochastic gradient scale. (b) True gradient norm. (c) The ratio of true gradient norm over 1 minus largest action probability. Color bars contain minimum and maximum values of corresponding quantities.

Illustration. Consider Figure 1(a), which depicts the probability simplex containing all policies πθ over K = 3 actions. Figure 1(a) shows the scale of the stochastic gradient

Ea πθ( ) dπ θ ˆr dθ 2

2, illustrating that when πθ is close to any corner of the simplex, the stochastic gradient scale becomes close to 0. This suggests that the 2 R2 max in Proposition 3.1 is quite loose and improvable. Figure 1(b) presents a similar visualization for the true gradient norm dπ θ r dθ 2, demonstrating a similar behavior to the stochastic gradient scale.

We formalize this observation by proving that the stochastic gradient scale is controlled by the true gradient norm, significantly improving Proposition 3.1.

Lemma 4.3 (Strong growth condition; self-bounding noise property). Using Algorithm 1, we have, for all t 1,

" dπ θtˆrt dθt

8 R3 max K3/2

2 dπ θtr dθt

where := mini =j |r(i) r(j)|.

The proof sketch of Lemma 4.3 is as follows. For any t 1, let kt denote the action with the largest probability, i.e.,

kt := arg max a [K] πθt(a). (9)

Note that 1 πθt(kt) characterizes how close πθt is to any corner of the probability simplex. We first prove that,

" dπ θtˆrt dθt

4 R2 max (1 πθt(kt)) , (10)

which formalizes the observation in Figure 1(a). Addi-

tionally, Figure 1(c) shows that dπ θtr dθt

(1 πθt(kt)) is larger than about 0.2044, which suggests that the true gradient norm also characterizes a distance of πθ from any corner of probability simplex since it has a variance-like

structure (Eq. (14)), which is formalized by proving that,

1 πθt(kt) 2 Rmax K3/2

2 dπ θtr dθt

Combining Eqs. (10) and (11) allows one to establish Lemma 4.3, verifying the intuitive observation in Figure 1.

From this explanation, whenever the true PG norm dπ θ r dθ 2 is small and πθ is close to a one-hot policy, the stochastic PG norm dπ θ ˆr dθ 2 in Eq. (3) will also be small. A deeper

explanation is that the softmax Jacobian diag(πθ) πθπ θ is involved in the stochastic PG in Eq. (4), which cancels and annihilates the unbounded noise in reward estimator ˆr arising from the use of importance sampling. Remark 4.4. The strong growth condition was first proposed by Schmidt & Roux (2013) and later found to be satisfied in supervised learning with over-parameterized neural networks (NNs) (Allen-Zhu et al., 2019). There, given a dataset D := {(x1, y1), (x2, y2), (x N, y N)}, the goal is to minimize a composite loss function,

min w f(w) = min w

i [N] fi(w). (12)

Since over-parameterized NNs fit all the data points, i.e., f(w) = 0 for some w, each individual loss is also fi(w) = 0 since fi(w) 0 for all i [N] (e.g., squared loss and cross entropy). This guarantees that when the true gradient f(w) = 0, the stochastic gradient fi(w) = 0 for all i [N]. Hence the strong growth conditions are satisfied, and stochastic gradient descent (SGD) attains the same convergence speed as gradient descent (GD) with an over-parameterized NN (Allen-Zhu et al., 2019). Remark 4.5. In probability and statistics, the property of a variance bounded by the expectation is also called a selfbounding property, which can be used to get strong variance bounds (Mc Diarmid & Reed, 2006; Boucheron et al., 2009).

Lemma 4.3 proves that such strong growth conditions are also satisfied in the stochastic gradient bandit algorithm, but

Stochastic Gradient Succeeds for Bandits

for a different reason. For over-parameterized NNs, since the model fits every data point, zero gradient implies that the stochastic gradient is also zero. Here, in Lemma 4.3, the strong growth condition alternatively arises because of the landscape; that is, the presence of the softmax Jacobian in the stochastic gradient update annihilates the sampling noise and leads to the strong growth condition being satisfied.

4.3. No Learning Rate Decay

Finally, from Lemmas 4.2 and 4.3 we reach the result that expected progress can be guaranteed with a constant learning rate for the stochastic gradient bandit update.

Lemma 4.6 (Constant learning rates). Using Algorithm 1 with η = 2

40 K3/2 R3max , we have, for all t 1,

π θtr Et[π θt+1r] 2

80 K3/2 R3max dπ θtr dθt

Lemma 4.2 indicates that {π θtr}t 1 is a sub-martingale. Since the reward is bounded r [ Rmax, Rmax], by Doob s super-martingale convergence theorem we have that, almost surely, π θtr c [ Rmax, Rmax] as t .

Corollary 4.7. Using Algorithm 1, we have, the sequence {π θtr}t 1 converges w.p. 1.

Therefore, from Lemma 4.6 and Corollary 4.7 it follows

that dπ θtr dθt

2 0 as t 0 almost surely, which implies that convergence to a stationary point is achieved without a decaying learning rate (Robbins & Monro, 1951). From Lemma 4.6, using telescoping we immediately have,

" dπ θtr dθt

E[π θT +1r] E[π θ1r]

where C := 2

80 K3/2 R3max . Comparing to Section 3, Eq. (13) is a stronger result of averaged gradient norm approaches zero, in terms of faster O(1/T) rate and constant learning rate. An interesting observation is that the average gradient convergence rate is independent with the initialization, which is different with the global convergence results in later sections. The key reason behind this outcome is that Lemmas 4.1 and 4.3 establish that the noise in Eq. (5)

decays on the same order as the progress dπ θtr dθt

2, so that a constant learning rate is sufficient for the progress term to overcome the noise term (Lemma 4.6).

5. New Global Convergence Results

Given the refined stochastic analysis from Section 4 we are now ready to establish new global convergence results for the stochastic gradient bandit in Algorithm 1.

5.1. Asymptotic Global Convergence

First, note that the true gradient norm takes the following variance-like expression,

a [K] πθ(a)2 r(a) π θ r 2. (14)

According to dπ θtr dθt

2 0 as t 0, we have that πθt ap-

proaches a one-hot policy, i.e., πθt(i) 1 for some i [K] as t . Asymptotic global convergence is then proved by constructing contradictions against the assumption that the algorithm converges to a sub-optimal one-hot policy.

Theorem 5.1 (Asymptotic global convergence). Using Algorithm 1, we have, almost surely,

πθt(a ) 1, as t , (15)

which implies that inft 1 πθt(a ) > 0.

It is highly challenging to prove almost surely global convergence because (i) the iteration {θt}t 1 is a stochastic process, which it different with the true gradient settings (Agarwal et al., 2021), and (ii) the iteration θt RK is unbounded, which makes Doob s super-martingale convergence results not applicable, unlike Corollary 4.7. The strategy and insights of Theorem 5.1 are as follows. According to Proposition 3.1, we have, for all a [K],

Et[θt+1(a)] = θt(a) + η πθt(a) r(a) π θtr . (16)

Now we suppose that using Algorithm 1, there exists a sub-optimal action i [K], r(i) < r(a ), such that,

πθt(i) 1, as t , (17)

which implies that,

π θtr r(i), as t . (18)

Since r(i) < r(a ), there exists a good action set,

A+(i) := a+ [K] : r(a+) > r(i) . (19)

By Eqs. (16), (18) and (19), for all large enough t 1,

Et[θt+1(a+)] θt(a+), (20)

which means a good action s parameter {θt(a+)}t 1 is a sub-martingale. The major part of the proofs are devoted to the following key results. We have, almost surely,

inf t 1 θt(a+) c1 > , and (21)

sup t 1 θt(i) c2 < . (22)

Stochastic Gradient Succeeds for Bandits

Eq. (21) is non-trivial since an unbounded sub-martingale is not necessarily lower bounded and could have positive probability of approaching negative infinity2, while Eq. (22) is non-trivial since the behavior of θt(i) depends on different cases of how many times good actions are sampled as t . With the above results, we have,

πθt(i) < eθt(i)

a+ A+(i) eθt(a+) eθt(a ) > 0

eθt(i) + ec1 (by Eq. (21))

ec2 + ec1 (by Eq. (22))

which is a contradiction with the assumption of Eq. (17). Therefore, the asymptotically convergent one-hot policy has to satisfy r(i) = r(a ), proving Theorem 5.1. The detailed proof is provided in Appendix A.1.

Remark 5.2. As mentioned in Remark 2.2, the arguments above Theorem 5.1, i.e., πθt approaches a one-hot policy, is based on Assumption 2.1. With this result, Theorem 5.1 proves asymptotic global convergence by contradiction with the assumption of Eq. (17). In general, without Assumption 2.1, Eq. (14) approaches zero can only imply πθt approaches a generalized one-hot policy (rather than a strict one-hot policy). The definition of generalized one-hot policies can be found in Eq. (141). Remark 5.3. It is true that Eq. (14) approaches zero is not enough for showing πθt approaches a one-hot policy. However, Algorithm 1 is special that it is always making one action s probability dominate others when there are ties. Consider r RK with r(1) = r(2). If πθt(1) > πθt(2), then using the expected softmax PG update θt+1(a) θt(a) + η πθt(a) (r(a) π θtr), we have,

πθt+1(1) πθt+1(2) = πθt(1)

πθt(2) eη (πθt (1) πθt (2)) (r(1) π θt r) > πθt(1)

which means that after one update πθt+1(1) will be even larger than πθt+1(2). Therefore, we have,

θt+1(1) θt+1(2) > θt(1) θt(2) + C/t, (23)

for some C > 0, which implies that,

lim t πθt(1) πθt(2) = lim t eθt(1) θt(2) = , (24)

i.e., πθt(1) 1 as t . The above arguments illustrate the self-reinforcing nature of Algorithm 1, such that whenever two (or more) actions have the same mean reward, the update will make only one one of their probabilities

2Consider a random walk, Yt+1 = Yt + Zt, where Zt = 1 with equal chance of 1/2. We have Et[Yt+1] Yt. However, we also have lim inft Yt = , and with positive probability, inft 1 Yt is not lower bounded.

larger and larger, until one eventually dominates the others as t . Generalizing the arguments to stochastic updates will remove Assumption 2.1 in the proofs for Theorem 5.1.

5.2. Convergence Rate

Given Theorem 5.1, a convergence rate result can then be proved using the following inequality (Mei et al., 2020b).

Lemma 5.4 (Non-uniform Łojasiewicz (NŁ), Lemma 3 of Mei et al. (2020b)). Assume r has a unique maximizing action a . Let π = arg maxπ π r. Then,

2 πθ(a ) (π πθ) r . (25)

Theorem 5.5 (Convergence rate and regret). Using Algorithm 1 with η = 2

40 K3/2 R3max , we have, for all t 1,

E[(π πθt) r] C

t=1 (π πθt) r min{

2 Rmax C T, C log T + 1},

where C := 80 K3/2 R3 max 2 E[c2] , and c := inft 1 πθt(a ) > 0 is from Theorem 5.1.

In Theorem 5.5, the dependence on t is optimal (Lai et al., 1985). However, the constant can be large, especially for a bad initialization. In short, the stochastic gradient algorithm inherits the initialization sensitivity and sub-optimal plateaus from the true policy gradient algorithm with softmax parameterization (Mei et al., 2020a). The detailed proof of Theorem 5.5 is elaborated in Appendix A.2.

Theorem 5.6. There exists a problem, initialization θ1 RK, and a positive constant C > 0, such that, for all t t0 := C δ πθ1(a ), we have

E[(π πθt) r] 0.9 , (26)

where := r(a ) maxa =a r(a) is the reward gap of r.

6. The Effect of Baselines

The original gradient bandit algorithm (Sutton & Barto, 2018) uses a baseline, which is a slightly modification of Algorithm 1. The difference is that Rt(at) in Algorithm 1 is replaced with Rt(at) Bt, where Bt R is an action independent baseline, as shown in Algorithm 2 in Appendix B.

It is well known that action independent baselines do not introduce bias in the gradient estimate (Sutton & Barto, 2018). The utility of adding a baseline has typically been considered to be reducing the variance of the gradient estimates (Greensmith et al., 2004; Bhatnagar et al., 2007; Tucker

Stochastic Gradient Succeeds for Bandits

et al., 2018; Mao et al., 2018; Wu et al., 2018). Here we show that a similar effect manifests itself through improvements in the strong growth condition. Lemma 6.1 (Strong growth condition, Self-bounding noise property). Using Algorithm 2, we have, for all t 1,

" dπ θt ˆrt ˆbt

8 R2 max Rmax K3/2

2 dπ θtr dθt

where := mini =j |r(i) r(j)|, ˆbt(a) := I{at=a}

πθt(a) Bt for all a [K], and Rt(at) Bt [ Rmax, Rmax].

Note that Rmax denotes the range of Rt(at) Bt after minus a baseline from sampled reward. Comparing Lemma 6.1 to Lemma 4.3 shows that the only difference is that a constant factor of R2 max is changed to R2 max. This indicates that a deeper reason for the variance reduction effect of adding a baseline is to reduce the effective reward range. The same improved constant will carry over to all the similar results, including larger constant learning rates, larger progress, and better constants in the global convergence results.

It is worth noting that the effect of baseline differs between algorithms. Here we see that without any baseline Algorithm 1 already achieves global convergence, while adding a baseline provides constant improvements. For a different natural policy gradient (NPG) method (Kakade, 2002; Agarwal et al., 2021), it is known that without using baselines, on-policy NPG can fail by converging to a sub-optimal deterministic policy (Chung et al., 2020; Mei et al., 2021a), while adding a value baseline π θtr restores a guarantee of global convergence by reducing the update aggressiveness.

7. Simulation Results

In this section, we conduct several simulations to empirically verify the theoretical findings of asymptotic global convergence and convergence rate.

7.1. Asymptotic Global Convergence

We first design experiments to justify the asymptotic global convergence. We run Algorithm 1 on stochastic bandit problems with K = 10 actions. The mean reward r is random generated in (0, 1)K. For each sampled action at πθt( ), the observed reward is generated as Rt(at) = r(at)+ Zt, where Zt N(0, 1) is Gaussian noise. For the baseline in Algorithm 2, we use average reward as suggested in (Sutton & Barto, 2018), i.e., Bt := Pt 1 s=1 Rs(as)/(t 1) for all t > 1. The learning rate is η = 0.01. We use adversarial initialization, such that πθ1(a ) < 1/K.

As shown in Figure 2, πθt(a ) 1 eventually, even if its initial value πθ1(a ) is very small, verifying the asymptotic global convergence in Theorem 5.1. On the other hand, the long plateaus observed in Figure 2 verify Theorem 5.6.

0 0.5 1 1.5 2

Without baselines Average reward baseline

Without baselines Average reward baseline

Figure 2. Both subfigures show results for πθt(a ). The left is for πθ1(a ) = 0.03, and the right is for πθ1(a ) = 0.02.

One unexpected observation in Figure 2 is that average reward baselines have worse performances, which is different with Sutton & Barto (2018). After checking numerical values, we found that since the initialization is bad, a suboptimal action with r(i) < r(a ) will be pulled for most of the time, which results in Bt r(i). This implies that when a is pulled, θt(a ) is increased less than without baselines, since the reward gap is also relatively small. Therefore, the average reward baseline might not be a baseline that is universally beneficial, which raises the question to design adaptive baseline, which is out of the scope of this paper, and we leave as our future work.

7.2. Convergence Rate

We further check the convergence rate empirically in the same problem settings. We use uniform initialization i.e., πθ1(a) = 1/K for all a [K] and the results are shown in Figure 3. Each curve is an average from 10 independent runs, and the total iteration number is T = 2 106. As shown in Figure 3(b), the slope in log scale is close to 1, which implies that log (π πθt) r log t + C. Equivalently, we have (π πθt) r C /t, verifying the O(1/t) convergence rate in Theorem 5.5.

Without baselines Average reward baseline

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

-11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1

Without baselines Average reward baseline Slope = -1

Figure 3. Figure (a) shows the optimal action s probability and (b) shows log sub-optimal gap, which justifies our global convergence rate in Theorem 5.5.

7.3. Average Gradient Norm Convergence

In this section, we empirically verify the finite-step convergence rate in terms of average gradient norm. We follow exactly the same experimental settings in Section 7.2, but

Stochastic Gradient Succeeds for Bandits

evaluate the average gradient norm along the algorithm iterations. We illustrate the results in log-scale in Figure 4. It obviously aligned well with our convergence rate in terms of average gradient norm in Eq. (13).

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Without baselines Average reward baseline Slope = -1

Figure 4. Average squared gradient norm 1

1 s t E dπ θs r dθs

in log scale (l.h.s. of Eq. (13)).

8. Boltzmann Exploration

The softmax parameterization used in gradient bandit algorithms is also called Boltzmann distribution (Sutton & Barto, 2018), based on which a classic algorithm EXP3 uses O(1/

t) learning rate and achieves a O(1/

t) rate (Auer et al., 2002b). The Botlzmann distribution has also been used in other policy gradient based algorithms. For example, Lan et al. (2022) show that mirror decent (MD) or NPG with strongly convex regularizers, increasing batch sizes and O(1/t) learning rates achieves a O(log t/t) rate. The convergence in (Lan et al., 2022) heavily relies on batch observation for an accurate estimation of full gradient approximation, which is impossible in stochastic bandit setting, and thus not applicable.

There are also existing results revealing the weakness of Boltzmann distribution. Cesa-Bianchi et al. (2017) show that without count based bonuses, Boltzmann exploration done wrong . In particular, there exists a 2-armed stochastic bandit problem with rewards bounded in [0, 1], when using Pr (at = a) exp{ηt ˆµt,a}, where

ˆµt,a := Pt s=1 I {as = a} Rs(a) Pt s=1 I {as = a}

is the empirical mean estimator for r(a), with ηt > 2 log t for all t 1, would incur linear regret Ω(T). Instead of the aggressive update of parameters in softmax policy in (Cesa Bianchi et al., 2017), the stochastic gradient bandit can be understood as a better way for parameter updates (with weights diminishing if the action is not selected) to ensure global convergence.

Cesa-Bianchi et al. (2017) also claim that the Boltzmann

exploration is equivalent to the rule selecting

at+1 = arg max a [K] (ˆµt,a + Zt,a),

which is the widely used Gumbel-Softmax trick (Jang et al., 2016), where Zt,a is a Gumbel random variable independent for all a [K]. Cesa-Bianchi et al. (2017) show a O(log T) regret when replacing Zt,a by βt,a Zt,a, where βt,a is determined by count information Pt s=1 I{at = a}. This inspires us that we may incorporate other techniques into gradient bandit algorithm and further improve it, especially for the poor initialization and problem dependent constant in Theorem 5.1 as also observed in Figure 2.

9. Conclusions

This work provides the first global convergence result for the gradient bandit algorithms (Sutton & Barto, 2018) using constant learning rates. The main technical finding is that the noise in stochastic gradient updates automatically vanishes such that noise control is unnecessary for global convergence. This work uncover a new understanding of stochastic gradient itself manages to achieve weak exploraton in the sense that the distribution over the arms almost surely concentrates asymptotically on a globally optimal action. One important future direction is to improve stochastic gradient to achieve strong exploration with finite-time optimal rates. Another direction of interest is to generalize the ideas and techniques to reinforcement learning.

Acknowledgements

The authors would like to thank anonymous reviewers for their valuable comments. Jincheng Mei would like to thank Ramki Gummadi for providing feedback on a draft of this manuscript. Csaba Szepesv ari, Dale Schuurmans and Zixin Zhong gratefully acknowledge funding from the Canada CIFAR AI Chairs Program, Amii and NSERC.

Abbasi-Yadkori, Y., P al, D., and Szepesv ari, C. Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24, 2011.

Agarwal, A., Kakade, S. M., Lee, J. D., and Mahajan, G. On the theory of policy gradient methods: Optimality, approximation, and distribution shift. J. Mach. Learn. Res., 22(98):1 76, 2021.

Agrawal, S. and Goyal, N. Analysis of thompson sampling for the multi-armed bandit problem. In Conference on learning theory, pp. 39 1. JMLR Workshop and Conference Proceedings, 2012.

Allen-Zhu, Z., Li, Y., and Song, Z. A convergence theory for deep learning via over-parameterization. In International

Stochastic Gradient Succeeds for Bandits

Conference on Machine Learning, pp. 242 252. PMLR, 2019.

Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2):235 256, 2002a.

Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48 77, 2002b.

Bhatnagar, S., Ghavamzadeh, M., Lee, M., and Sutton, R. S. Incremental natural actor-critic algorithms. Advances in neural information processing systems, 20, 2007.

Boucheron, S., Lugosi, G., and Massart, P. On concentration of self-bounding functions. Electronic Journal of Probability, 14:1884 1899, 2009.

Breiman, L. Probability. SIAM, 1992.

Cesa-Bianchi, N., Gentile, C., Lugosi, G., and Neu, G. Boltzmann exploration done right. Advances in neural information processing systems, 30, 2017.

Chung, W., Thomas, V., Machado, M. C., and Roux, N. L. Beyond variance reduction: Understanding the true impact of baselines on policy optimization. ar Xiv preprint ar Xiv:2008.13773, 2020.

Ding, Y., Zhang, J., and Lavaei, J. Beyond exact gradients: Convergence of stochastic soft-max policy gradient methods with entropy regularization. ar Xiv preprint ar Xiv:2110.10117, 2021.

Doob, J. L. Measure theory, volume 143. Springer Science & Business Media, 2012.

Gawlikowski, J., Tassi, C. R. N., Ali, M., Lee, J., Humt, M., Feng, J., Kruspe, A., Triebel, R., Jung, P., Roscher, R., et al. A survey of uncertainty in deep neural networks. ar Xiv preprint ar Xiv:2107.03342, 2021.

Ghadimi, S. and Lan, G. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341 2368, 2013.

Glynn, P. W. Likelihood ratio gradient estimation for stochastic systems. Communications of the ACM, 33 (10):75 84, 1990.

Greensmith, E., Bartlett, P. L., and Baxter, J. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5(9), 2004.

Jang, E., Gu, S., and Poole, B. Categorical reparameterization with gumbel-softmax. ar Xiv preprint ar Xiv:1611.01144, 2016.

Kakade, S. M. A natural policy gradient. In Advances in neural information processing systems, pp. 1531 1538, 2002.

Lai, T. L., Robbins, H., et al. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6 (1):4 22, 1985.

Lan, G., Li, Y., and Zhao, T. Block policy mirror descent. ar Xiv preprint ar Xiv:2201.05756, 2022.

Lattimore, T. and Szepesv ari, C. Bandit algorithms. Cambridge University Press, 2020.

Li, G., Wei, Y., Chi, Y., Gu, Y., and Chen, Y. Softmax policy gradient methods can take exponential time to converge. In Conference on Learning Theory, pp. 3107 3110. PMLR, 2021.

Mao, H., Venkatakrishnan, S. B., Schwarzkopf, M., and Alizadeh, M. Variance reduction for reinforcement learning in input-driven environments. ar Xiv preprint ar Xiv:1807.02264, 2018.

Mc Diarmid, C. and Reed, B. Concentration for selfbounding functions and an inequality of talagrand. Random Structures & Algorithms, 29(4):549 557, 2006.

Mei, J., Xiao, C., Dai, B., Li, L., Szepesv ari, C., and Schuurmans, D. Escaping the gravitational pull of softmax. Advances in Neural Information Processing Systems, 33: 21130 21140, 2020a.

Mei, J., Xiao, C., Szepesvari, C., and Schuurmans, D. On the global convergence rates of softmax policy gradient methods. In International Conference on Machine Learning, pp. 6820 6829. PMLR, 2020b.

Mei, J., Dai, B., Xiao, C., Szepesvari, C., and Schuurmans, D. Understanding the effect of stochasticity in policy optimization. Advances in Neural Information Processing Systems, 34:19339 19351, 2021a.

Mei, J., Gao, Y., Dai, B., Szepesvari, C., and Schuurmans, D. Leveraging non-uniformity in first-order non-convex optimization. In International Conference on Machine Learning, pp. 7555 7564. PMLR, 2021b.

Mei, J., Chung, W., Thomas, V., Dai, B., Szepesvari, C., and Schuurmans, D. The role of baselines in policy gradient optimization. Advances in Neural Information Processing Systems, 2022.

Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization, 19(4):1574 1609, 2009.

Stochastic Gradient Succeeds for Bandits

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. Training language models to follow instructions with human feedback. ar Xiv preprint ar Xiv:2203.02155, 2022.

Robbins, H. and Monro, S. A stochastic approximation method. The annals of mathematical statistics, pp. 400 407, 1951.

Schmidt, M. and Roux, N. L. Fast convergence of stochastic gradient descent under a strong growth condition. ar Xiv preprint ar Xiv:1308.6370, 2013.

Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. Trust region policy optimization. In International conference on machine learning, pp. 1889 1897, 2015.

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. ar Xiv preprint ar Xiv:1707.06347, 2017.

Sutton, R. S. and Barto, A. G. Reinforcement Learning: An Introduction. MIT Press, 2018.

Thompson, W. R. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3-4):285 294, 1933.

Tucker, G., Bhupatiraju, S., Gu, S., Turner, R., Ghahramani, Z., and Levine, S. The mirage of action-dependent baselines in reinforcement learning. In International conference on machine learning, pp. 5015 5024. PMLR, 2018.

Williams, R. J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3):229 256, 1992.

Wu, C., Rajeswaran, A., Duan, Y., Kumar, V., Bayen, A. M., Kakade, S., Mordatch, I., and Abbeel, P. Variance reduction for policy gradient with action-dependent factorized baselines. ar Xiv preprint ar Xiv:1803.07246, 2018.

Yuan, R., Gower, R. M., and Lazaric, A. A general sample complexity analysis of vanilla policy gradient. In International Conference on Artificial Intelligence and Statistics, pp. 3332 3380. PMLR, 2022.

Zhang, J., Kim, J., O Donoghue, B., and Boyd, S. Sample efficient reinforcement learning with reinforce. ar Xiv preprint ar Xiv:2010.11364, 2020a.

Zhang, J., Ni, C., Szepesvari, C., Wang, M., et al. On the convergence and sample efficiency of variance-reduced policy gradient method. Advances in Neural Information Processing Systems, 34:2228 2240, 2021.

Zhang, K., Koppel, A., Zhu, H., and Basar, T. Global convergence of policy gradient methods to (almost) locally optimal policies. SIAM Journal on Control and Optimization, 58(6):3586 3612, 2020b.

Stochastic Gradient Succeeds for Bandits

A. Proofs for Algorithm 1

Proposition 2.3. Algorithm 1 is equivalent to the following stochastic gradient ascent update,

θt+1 θt + η dπ θtˆrt dθt (27)

= θt + η diag(πθt) πθtπ θt ˆrt, (28)

where Et h dπ θt ˆrt dθt

i = dπ θtr dθt , and dπθ

dθ = diag(πθ) πθπ θ is the Jacobian of θ 7 πθ := softmax(θ), and ˆrt(a) :=

πθt(a) Rt(a) for all a [K] is the importance sampling (IS) estimator, and we set Rt(a) = 0 for all a = at.

Proof. Using the definition of softmax Jacobian and ˆrt, we have, for all a [K],

θt+1(a) θt(a) + η πθt(a) ˆrt(a) π θtˆrt (29)

= θt(a) + η πθt(a) (ˆrt(a) Rt(at)) (30)

( η (1 πθt(a)) Rt(a), if at = a , η πθt(a) Rt(at), otherwise .

Proposition 3.1 (Unbiased stochastic gradient with bounded variance / scale). Using Algorithm 1, we have, for all t 1,

dπ θtˆrt dθt

= dπ θtr dθt , (31)

" dπ θtˆrt dθt

2 R2 max, (32)

where Et[ ] is on randomness from the on-policy sampling at πθt( ) and reward sampling Rt(at) Pat.

Proof. First part, Eq. (31). For all action a [K], the true softmax PG is,

dπ θtr dθt(a) = πθt(a) r(a) π θtr . (33)

For all a [K], the stochastic softmax PG is,

dπ θtˆrt dθt(a) = πθt(a) ˆrt(a) π θtˆrt (34)

= πθt(a) (ˆrt(a) Rt(at)) (35)

= (I {at = a} πθt(a)) Rt(at). (36)

For the sampled action at, we have,

E Rt(at) Pat

dπ θtˆrt dθt(at)

= E Rt(at) Pat

h (1 πθt(at)) Rt(at) i (37)

= (1 πθt(at)) E Rt(at) Pat

h Rt(at) i (38)

= (1 πθt(at)) r(at). (39)

For any other not sampled action a = at, we have,

E Rt(at) Pat

dπ θtˆrt dθt(a)

= E Rt(at) Pat

h πθt(a) Rt(at) i (40)

= πθt(a) E Rt(at) Pat

h Rt(at) i (41)

= πθt(a) r(at). (42)

Stochastic Gradient Succeeds for Bandits

Combing Eqs. (37) and (40), we have, for all a [K],

E Rt(at) Pat

dπ θtˆrt dθt(a)

= (I {at = a} πθt(a)) r(at). (43)

Taking expectation over at πθt( ), we have,

dπ θtˆrt dθt(a)

= Pr (at = a) E Rt(at) Pat

dπ θtˆrt dθt(a)

at = a + Pr (at = a) E Rt(at) Pat

dπ θtˆrt dθt(a)

at = a (44)

= πθt(a) (1 πθt(a)) r(a) + X

a =a πθt(a ) ( πθt(a)) r(a ) (45)

a =a πθt(a ) (r(a) r(a )) (46)

= πθt(a) r(a) π θtr . (47)

Combining Eqs. (33) and (44), we have, for all a [K],

dπ θtˆrt dθt(a)

= dπ θtr dθt(a), (48)

which implies Eq. (31) since a [K] is arbitrary.

Second part, Eq. (32). The squared stochastic PG norm is,

dπ θtˆrt dθt

dπ θtˆrt dθt(a)

a [K] (I {at = a} πθt(a))2 Rt(at)2 (by Eq. (34)) (50)

a [K] (I {at = a} πθt(a))2 (by Eq. (1)) (51)

= R2 max (1 πθt(at))2 + X

a =at πθt(a)2 (52)

R2 max (1 πθt(at))2 + X

a =at πθt(a) 2 ( x 2 x 1) (53)

= 2 R2 max (1 πθt(at))2 . (54)

Therefore, we have, for all a [K], conditioning on at = a, " dπ θtˆrt dθt

2 R2 max (1 πθt(a))2 . (55)

Taking expectation over at πθt( ), we have,

" dπ θtˆrt dθt

a [K] Pr (at = a)

" dπ θtˆrt dθt

a [K] πθt(a) 2 R2 max (1 πθt(a))2 (57)

a [K] πθt(a) (πθt(a) (0, 1) for all a [K]) (58)

= 2 R2 max.

Stochastic Gradient Succeeds for Bandits

Lemma 4.1 (Non-uniform smoothness (NS), Mei et al. (2021b, Lemma 2)). For all θ RK, the spectral radius of Hessian matrix d2{π θ r} dθ2 RK K is upper bounded by 3 dπ θ r dθ 2, i.e., for all y RK,

y d2{π θ r} dθ2 y 3 dπ θ r dθ

2 y 2 2. (59)

Proof. See the proof in Mei et al. (2021b, Lemma 2). We include a proof for completeness.

Let S := S(r, θ) RK K be the second derivative of the map θ 7 π θ r. Denote H(πθ) := diag(πθ) πθπ θ as the softmax Jacobian. By definition we have,

dθ {H(πθ)r} (61)

dθ (diag(πθ) πθπ θ )r . (62)

Continuing with our calculation fix i, j [K]. Then,

S(i,j) = d{πθ(i) (r(i) π θ r)} dθ(j) (63)

dθ(j) (r(i) π θ r) + πθ(i) d{r(i) π θ r} dθ(j) (64)

= (δij πθ(j) πθ(i) πθ(j)) (r(i) π θ r) πθ(i) (πθ(j) r(j) πθ(j) π θ r) (65)

= δij πθ(j) (r(i) π θ r) πθ(i) πθ(j) (r(i) π θ r) πθ(i) πθ(j) (r(j) π θ r), (66)

where δij is the Kronecker s δ-function defined as,

( 1, if i = j, 0, otherwise. (67)

To show the bound on the spectral radius of S, pick y RK. Then,

j=1 S(i,j) y(i) y(j) (68)

i πθ(i) (r(i) π θ r) y(i)2 2 X

i πθ(i) (r(i) π θ r) y(i) X

j πθ(j) y(j) (69)

= (H(πθ)r) (y y) 2 (H(πθ)r) y π θ y (70)

(H(πθ)r) (y y) + 2 (H(πθ)r) y π θ y (triangle inequality) (71)

H(πθ)r y y 1 + 2 H(πθ)r 2 y 2 πθ 1 y (H older s inequality) (72)

3 H(πθ)r 2 y 2 2 , (73)

where is Hadamard (component-wise) product, and the last inequality uses H(πθ)r H(πθ)r 2, y y 1 = y 2 2, πθ 1 = 1, and y y 2. Therefore, we have, y Sy 3 H(πθ)r 2 y 2 2 (by Eq. (68)) (74)

= 3 diag(πθ) πθπ θ r 2 y 2 2 H(πθ) := diag(πθ) πθπ θ (75)

= 3 dπ θ r dθ

2 y 2 2 . (by Eq. (3))

Stochastic Gradient Succeeds for Bandits

Lemma 4.2 (NS between iterates). Using Algorithm 1 with η 0, 2 9 Rmax , we have, for all t 1,

D(θt+1, θt) :=

(πθt+1 πθt) r Ddπ θtr dθt , θt+1 θt E β(θt)

2 θt+1 θt 2 2, (76)

β(θt) = 6 2 9 Rmax η dπ θtr dθt

Proof. Denote θζ := θt + ζ (θt+1 θt) with some ζ [0, 1]. According to Taylor s theorem, we have, (πθt+1 πθt) r Ddπ θtr dθt , θt+1 θt E = 1

(θt+1 θt) d2π θζr

dθζ 2 (θt+1 θt)

2 θt+1 θt 2 2. (by Lemma 4.1) (79)

Denote θζ1 := θt + ζ1 (θζ θt) with some ζ1 [0, 1]. We have,

dθζ dπ θtr dθt

dθ2 ζ1 , θζ θt

2 (Fundamental theorem of calculus) (80)

2 θζ θt 2 dζ1 (by Cauchy Schwarz) (81)

0 3 dπ θζ1r

2 θζ θt 2 dζ1 (by Lemma 4.1) (82)

0 3 dπ θζ1r

2 ζ θt+1 θt 2 dζ1 (θζ := θt + ζ (θt+1 θt)) (83)

0 3 dπ θζ1r

2 η dπ θtˆrt dθt

ζ [0, 1], and θt+1 = θt + η dπ θtˆrt dθt

where the second inequality is because of the Hessian is symmetric, and its operator norm is equal to its spectral radius. Therefore, we have, dπ θζr

2 dπ θtr dθt

dθζ dπ θtr dθt

2 (by triangle inequality) (85)

2 + 3 η dπ θtˆrt dθt

2 dζ1. (by Eq. (80)) (86)

Denote θζ2 := θt + ζ2 (θζ1 θt) with ζ2 [0, 1]. Using similar calculation in Eq. (80), we have,

2 dπ θtr dθt

2 + dπ θζ1r

dθζ1 dπ θtr dθt

2 + 3 η dπ θtˆrt dθt

2 dζ2. (88)

Combining Eqs. (85) and (87), we have,

1 + 3 η dπ θtˆrt dθt

3 η dπ θtˆrt dθt

2 dζ2dζ1, (89)

Stochastic Gradient Succeeds for Bandits

which, by recurring the above arguments, implies that, dπ θζr

3 η dπ θtˆrt dθt

Next, we have,

3 η dπ θtˆrt dθt

2 R2max (1 πθt(at))2 (by Eq. (49)) (91)

< 3 2 9 Rmax

πθt(at) (0, 1), and η < 2 9 Rmax

Combining Eqs. (90) and (91), we have, dπ θζr

1 3 η dπ θt ˆrt dθt

3 η dπ θtˆrt dθt

2 (0, 1) from Eq. (91)

2 Rmax dπ θtr dθt

2 (by Eq. (49)) (95)

2 Rmax η dπ θtr dθt

Combining Eqs. (78) and (94), we have, (πθt+1 πθt) r Ddπ θtr dθt , θt+1 θt E 3

2 θt+1 θt 2 2 (97)

3 2 9 Rmax η dπ θtr dθt

2 θt+1 θt 2 2.

Lemma 4.3 (Strong growth conditions / Self-bounding noise property). Using Algorithm 1, we have, for all t 1,

" dπ θtˆrt dθt

8 R3 max K3/2

2 dπ θtr dθt

where := mini =j |r(i) r(j)|.

Proof. Given t 1, denote kt as the action with largest probability, i.e., kt := arg maxa [K] πθt(a). We have,

According to Eq. (56), we have,

" dπ θtˆrt dθt

a [K] Pr (at = a)

" dπ θtˆrt dθt

a [K] πθt(a) (1 πθt(a))2 (101)

= 2 R2 max πθt(kt) (1 πθt(kt))2 + X

a =kt πθt(a) (1 πθt(a))2 (102)

2 R2 max 1 πθt(kt) + X

a =kt πθt(a) (πθt(a) (0, 1) for all a [K]) (103)

= 4 R2 max (1 πθt(kt)) . (104)

Stochastic Gradient Succeeds for Bandits

On the other hand, we have, dπ θtr dθt

a [K] πθt(a)2 (r(a) π θtr)2 (105)

a [K] (r(a ) π θtr)2 X

a [K] πθt(a)2 (r(a) π θtr)2 P

a [K] (r(a ) π θtr)2 (106)

a [K] (r(a ) π θtr)2

a [K] πθt(a) (r(a) π θtr)2 P

a [K] (r(a ) π θtr)2

(Jensen s inequality) (107)

a [K] (r(a ) π θtr)2

a [K] πθt(a) (r(a) π θtr)2 #2

1 4 K R2max

a [K] πθt(a) (r(a) π θtr)2 #2

, r [ Rmax, Rmax]K (109)

which implies that, dπ θtr dθt

a [K] πθt(a) (r(a) π θtr)2. (110)

Using similar calculations in the proofs for Mei et al. (2021a, Lemma 2), we have,

a [K] πθt(a) (r(a) π θtr)2 =

i=1 πθt(i) r(i)2 K X

i=1 πθt(i) r(i) 2 (111)

i=1 πθt(i) r(i)2

i=1 πθt(i)2 r(i)2 2

i=1 πθt(i) r(i)

j=i+1 πθt(j) r(j) (112)

i=1 πθt(i) r(i)2 (1 πθt(i)) 2

i=1 πθt(i) r(i)

j=i+1 πθt(j) r(j) (113)

i=1 πθt(i) r(i)2 X

j =i πθt(j) 2

i=1 πθt(i) r(i)

j=i+1 πθt(j) r(j) (114)

j=i+1 πθt(j) r(i)2 + r(j)2 2

i=1 πθt(i) r(i)

j=i+1 πθt(j) r(j) (115)

j=i+1 πθt(j) (r(i) r(j))2, (116)

which implies that,

a [K] πθt(a) (r(a) π θtr)2

j=i+1 πθt(j) (r(i) r(j))2 (fewer terms) (117)

i=1 πθt(i) πθt(kt) (r(i) r(kt))2 + πθt(kt)

j=kt+1 πθt(j) (r(kt) r(j))2 (fewer terms) (118)

= πθt(kt) X

a =kt πθt(a) (r(a) r(kt))2 (119)

K (1 πθt(kt)) , (by Eq. (99)) (120)

Stochastic Gradient Succeeds for Bandits

where := mini =j |r(i) r(j)|. Therefore, we have,

" dπ θtˆrt dθt

4 R2 max (1 πθt(kt)) (by Eq. (100)) (121)

4 R2 max K 2 X

a [K] πθt(a) (r(a) π θtr)2 (by Eq. (117)) (122)

4 R2 max K 2 2

K Rmax dπ θtr dθt

2 (by Eq. (110)) (123)

= 8 R3 max K3/2

2 dπ θtr dθt

Lemma 4.6 (Constant learning rates). Using Algorithm 1 with η = 2

40 K3/2 R3max , we have, for all t 1,

π θtr Et[π θt+1r] 2

80 K3/2 R3max dπ θtr dθt

Proof. Using the learning rate,

40 K3/2 R3max (125)

= 4 45 Rmax 2

R2max 1 K3/2 45

4 45 Rmax 4 1 2

40, ( 2 Rmax, and K 2) (127)

< 4 45 Rmax , (128)

we have η 0, 2 9 Rmax . According to Lemma 4.2, we have,

(πθt+1 πθt) r Ddπ θtr dθt , θt+1 θt E 3 2 9 Rmax η dπ θtr dθt

2 θt+1 θt 2 2 (129)

3 2 9 Rmax 4 45 Rmax dπ θtr dθt

2 θt+1 θt 2 2 (by Eq. (128)) (130)

2 dπ θtr dθt

2 θt+1 θt 2 2, (131)

which implies that,

π θtr π θt+1r Ddπ θtr dθt , θt+1 θt E + 5

2 dπ θtr dθt

2 θt+1 θt 2 2 (132)

= η Ddπ θtr dθt , dπ θtˆrt dθt

2 dπ θtr dθt

2 η2 dπ θtˆrt dθt

Stochastic Gradient Succeeds for Bandits

where the last equation uses Algorithm 1. Taking expectation over at πθt( ) and Rt(at) Pat, we have,

π θtr Et[π θt+1r] η Ddπ θtr dθt , Et

dπ θtˆrt dθt

2 dπ θtr dθt

" dπ θtˆrt dθt

= η dπ θtr dθt

2 dπ θtr dθt

" dπ θtˆrt dθt

(by Proposition 3.1) (135)

η dπ θtr dθt

2 dπ θtr dθt

2 η2 8 R3 max K3/2

2 dπ θtr dθt

2 (by Lemma 4.3) (136)

= η + η2 20 R3 max K3/2

80 K3/2 R3max dπ θtr dθt

2 . (by Eq. (125))

Corollary 4.7. Using Algorithm 1, we have, the sequence {π θtr}t 1 converges w. p. 1.

Proof. Setting Yt = r(a ) π θtr, we have Yt [ Rmax, Rmax] by Eq. (1). Define Ft as the σ-algebra generated by {a1, R1(a1), a2, R2(a2), . . . , at 1, Rt 1(at 1)}. Note that Yt is Ft-measurable since θt is a deterministic function of a1, R1(a1), . . . , at 1, Rt 1(at 1). According to Lemma 4.6, using Algorithm 1, we have, for all t 1, π θtr Et[π θt+1r] 0, which indicates that E[Yt+1|Ft] Yt. Hence, the conditions of Doob s super-martingale theorem (Theorem C.1) are satisfied and the result follows.

A.1. Proof of Theorem 5.1

Theorem 5.1 (Asymptotic global convergence). Using Algorithm 1, we have, almost surely,

πθt(a ) 1, as t , (138)

which implies that inft 1 πθt(a ) > 0.

Proof. According to Algorithm 1, for each a [K], the update is,

θt+1(a) = θt(a) + η πθt(a) I {at = a}

πθt(a) Rt(a) Rt(at) . (139)

Given i [K], define the following set P(i) of generalized one-hot policy ,

A(i) := {j [K] : r(j) = r(i)} , (140)

P(i) := π (K) : X

j A(i) π(j) = 1 . (141)

We make the following two claims.

Claim 1. Almost surely, πθt approaches one generalized one-hot policy , i.e., there exists (a possibly random) i [K], such that P

j A(i) πθt(j) 1 almost surely as t .

Claim 2. Almost surely, πθt cannot approach any sub-optimal generalized one-hot policies , i.e., i in the previous claim must be an optimal action.

From Claim 2, it follows that P

j A(a ) πθt(j) 1 almost surely, as t and thus the policy sequence obtained almost surely convergences to a globally optimal policy π .

Proof of Claim 1. According to Corollary 4.7, we have that for some (possibly random) c [ Rmax, Rmax], almost surely,

lim t π θtr = c . (142)

Stochastic Gradient Succeeds for Bandits

Thanks to π θtr [ Rmax, Rmax] and π θtr Et[π θt+1r] 0 by Lemma 4.6, we have that Xt = π θtr (t 1) satisfies the conditions of Corollary 3 in (Mei et al., 2022). Hence, by this result, almost surely,

lim t Et[π θt+1r] π θt+1r = 0 , (143)

which, combined with Eq. (142) also gives that limt Et[π θt+1r] = c almost surely. Hence,

lim t Et[π θt+1r] π θtr = c c = 0, a.s. (144)

According to Lemma 4.6, we have,

Et[π θt+1r] π θtr 2

80 K3/2 R3max dπ θtr dθt

80 K3/2 R3max

i=1 πθt(i)2 r(i) π θtr 2 . (by Eq. (14)) (146)

Combining Eqs. (144) and (145), we have, with probability 1,

i=1 πθt(i)2 r(i) π θtr 2 = 0, (147)

which implies that, for all i [K], almost surely,

lim t πθt(i)2 r(i) π θtr 2 = 0. (148)

We claim that c, the almost sure limit of π θtr, is such that almost surely, for some (possibly random) i [K], c = r(i) almost surely. We prove this by contradiction. Let Ei = {c = r(i)}. Hence, our goal is to show that P( i Ei) = 1. Clearly, this follows from P( i Ec i ) = 0, hence, we prove this. On Ec i , since limt π θtr = r(i), we also have

lim t r(i) π θtr 2 > 0, almost surely on Ec i . (149)

This, together with Eq. (148) gives that almost surely on Ec i ,

lim t πθt(i)2 = 0. (150)

Hence, on i Ec i , almost surely, for all i [K], limt πθt(i)2 = 0. This contradicts with that P

i πθt(i) = 1 holds for all t 1, and hence we must have that P( i Ec i ) = 0, finishing the proof that P( i Ei) = 1.

Now, let i [K] be the (possibly random) index of the action for which c = r(i) almost surely. Recall that A(i) contains all actions j with r(j) = r(i) (cf. Eq. (140)). Clearly, it holds that for all j A(i),

lim t π θtr = r(j), a.s., (151)

and we have, for all k A(i),

lim t r(k) π θtr 2 > 0, a.s., (152)

which implies that,

k A(i) πθt(k)2 = 0, a.s. (153)

Therefore, we have,

j A(i) πθt(j) = 1, a.s., (154)

Stochastic Gradient Succeeds for Bandits

which means πθt a.s. approaches the generalized one-hot policy P(i) in Eq. (141) as t , finishing the proof of the first claim.

Proof of Claim 2. Recall that this claim stated that limt P

j A(a ) πθt(j) = 1. The brief sketch of the proof is as follows: By Claim 1, there exists a (possibly random) i [K] such that P

j A(i) πθt(j) 1 almost surely, as t . If i = a almost surely, Claim 2 follows. Hence, it suffices to consider the event that {i = a } and show that this event has zero probability mass. Hence, in the rest of the proof we assume that we are on the event when i = a .

Since i = a , there exists at least one good action a+ [K] such that r(a+) > r(i). The two cases are as follows.

2a) All good actions are sampled finitely many times as t .

2b) At least one good action is sampled infinitely many times as t .

In both cases, we show that P

j A(i) exp{θt(j)} < as t (but for different reasons), which is a contradiction with the assumption of P

j A(i) πθt(j) 1 as t , given that a good action s parameter is almost surely lower bounded. Hence, i = a almost surely does not happen, which means that almost surely i = a . Let us now turn to the details of the proof. We start with some useful extra notation. For each action a [K], for t 2, we have the following decomposition,

θt(a) = θt(a) Et 1[θt(a)] | {z } Wt(a)

+ Et 1[θt(a)] θt 1(a) | {z } Pt 1(a)

+θt 1(a), (155)

while we also have,

θ1(a) = θ1(a) E[θ1(a)] | {z } W1(a)

+E[θ1(a)], (156)

where E[θ1(a)] accounts for possible randomness in initialization of θ1.

Define the following notations,

Zt(a) := W1(a) + + Wt(a), ( cumulative noise ) (157)

Wt(a) := θt(a) Et 1[θt(a)], ( noise ) (158)

Pt(a) := Et[θt+1(a)] θt(a). ( progress ) (159)

Recursing Eq. (155) gives,

θt(a) = E[θ1(a)] + Zt(a) + P1(a) + + Pt 1(a) | {z } cumulative progress

We have that Et[Wt+1(a)] = 0, for t = 0, 1, . . . . Let

( 1, if at = a ; 0, otherwise . (161)

The update rule (cf. Algorithm 1) is,

θt+1(a) = θt(a) + η πθt(a) I {at = a}

πθt(a) Rt(a) Rt(at) , (162)

where at πθt( ), and xt(a) Pa. Let Ft be the σ-algebra generated by a1, x1(a1), , at 1, xt 1(at 1):

Ft = σ({a1, R1(a1), , at 1, Rt 1(at 1)}) . (163)

Note that θt, It are Ft-measurable and ˆxt is Ft+1-measurable for all t 1. Let Et denote the conditional expectation with respect to Ft: Et[X] = E[X|Ft].

Stochastic Gradient Succeeds for Bandits

Using the above notations, we have,

Wt+1(a) = θt+1(a) Et[θt+1(a)] (164)

= θt(a) + η [It(a) πθt(a)] Rt(at) θt(a) + η πθt(a) r(a) π θtr (165)

= η [It(a) πθt(a)] Rt(at) η πθt(a) r(a) π θtr , (166)

which implies that,

Zt(a) = W1(a) + + Wt(a) (167)

s=1 η [Is(a) πθs(a)] Rs(as) η πθs(a) r(a) π θsr . (168)

We also have,

Pt(a) = Et[θt+1(a)] θt(a) = η πθt(a) r(a) π θtr . (169)

Now we apply Theorem 1 in Abbasi-Yadkori et al. (2011) to bound Zt(a). Fix any a. Let

ηt = η, Xt = πθt(a) r(a) π θtr , (170)

s=1 X2 s, St =

s=1 ηs Xs, (171)

then ηt = η and hence is η

2-Sub-Gaussian. Consequently, there exists event E1 such that P(E1) 1 δ, and when E1 holds,

St 2 V 1 t η2

2 log det( Vt)1/2

1 + S1 t (a) 2 log (1 + S1 t (a))1/2

where S1 t (a) = Pt 1 s=1 πθs(a)2 r(a) π θsr 2. Noted that |Rt(at)| Rmax for all t. Similarly, there exists event E2 such that P(E2) 1 δ, and when E2 holds,

s=1 It(a) Rs(as)

1 + S2 t (a) 2 log (1 + S2 t (a))1/2

where S2 t (a) = Pt 1 s=1 Is(a)2; there exists event E3 such that P(E3) 1 δ, and when E3 holds,

s=1 πθs(a) Rs(as)

1 + S3 t (a) 2 log (1 + S3 t (a))1/2

where S3 t (a) = Pt 1 s=1 πθs(a)2. Let

s=1 It(a), B =

s=1 πθs(a), (176)

Since 0 It(a) 1 and 0 πθt(a) 1 for all t, we have

s=1 |Is(a)| = A, S3 t (a)

s=1 |πθs(a)| = B. (177)

Stochastic Gradient Succeeds for Bandits

When E2 and E3 hold, Eqs. (174) and (175) indicate that

|A B| A + B

A log(1 + A) + p

B log(1 + B) . (178)

2δ)/2, we have

A log(1 + A) + B + C1 p

B log(1 + B) (179)

A C1 log(1 + A)

B log(1 + B) + C2 1 (log(1 + A))2

There exists C2 C1, A > 10 such that when A A , C1 p

log(1 + A) < C2 A1/4 and

B log(1 + B) + C2 1 (log(1 + A))2

4 + C1 log(1 + A)

(1 + C1) B + C2 2

A 4 + C2 A1/4

(1 + C1) B + C2 A1/4

2 + C2 A1/4

A C2 A1/4 p

(1 + C1) B (183)

(1 + C1) B + C2 2 4 + C2

(1 + C1) B + C2

A 4 (1 + C1) B + C2

4C1 B + 4 + C2. (186)

A log(1 + A ) + B + C1 p

B log(1 + B). (187)

Hence, whether A A or not, we have

A max {4 C1, 1 + C1} B + max n 4 + C2, C1 p

A log(1 + A ) o . (188)

Moreover, there exists C3 such that

s=1 (It(a) πθt(a))

v u u t1 + Pt 1 s=1 πθs Rs(as)

(1 + Pt 1 s=1 πθs Rs(as))1/2

1 + S4 t (a) 2 log (1 + S4 t (a))1/2

where S4 t (a) = Pt 1 s=1 πθs(a). Since C1 and C2 only depends on δ, C3 only depends on δ. In other words, C3 is a constant when δ is fixed. Since It(a) πθt(a) = (1 πθt(a)) (1 It(a)) and 0 1 πθt(a) 1, 0 1 It(a) 1, using the similar calculation to that of deriving Eq. (190), we can show that, there exists event E4 such that P(E4) 1 2δ, and when E4 holds,

s=1 (It(a) πθs(a))

v u u t1 + Pt 1 s=1 πθs Rs(as)

(1 + Pt 1 s=1 πθs Rs(as))1/2

1 + S5 t (a) 2 log (1 + S5 t (a))1/2

Stochastic Gradient Succeeds for Bandits

where S5 t (a) = Pt 1 s=1(1 πθs(a)).

Recall that i is the index of the (random) action I [K] with

j A(I) πθt(j) = 1, a.s. (193)

As noted earlier we consider the event {I = a }, where a is the index of an optimal action and we will show that this event has zero probability. Since {I = a } = i [K]{I = i, i = a }, it suffices to show that for any fixed i [K] index with r(i) < r(a ), {I = i, i = a } has zero probability. Hence, in what follows we fix such a suboptimal action s index i [K] and consider the event {I = i, i = a }.

Partition the action set [K] into three parts using r(i) as follows,

A(i) := {j [K] : r(j) = r(i)} , (from Eq. (140)) (194)

A+(i) := a+ [K] : r(a+) > r(i) , (195)

A (i) := a [K] : r(a ) < r(i) . (196)

Because i was the index of a sub-optimal action, we have A+(i) = . According to Eq. (193), on {I = i} {I = i, i = a }, we have π θtr r(i) as t because

r(i) π θtr = X

k A(i) πθt(k) (r(i) r(k)) (197)

k A(i) πθt(k) |r(i) r(k)| (198)

j A(i) πθt(j). r [0, 1]K (199)

Therefore, there exists τ 1 such that almost surely on {I = i, i = a } τ < while we also have

r(a+) c π θtr r(a ) + c , for all t τ, (200)

for all a+ A+(i), a A (i), where c > 0. Hence, for all t τ, a+ A+(i), a A (i), Pt(a+) > 0 > Pt(a ). For all a+ A+(i), when t > τ,

S1 t (a+) =

s=1 πθs(a+)2 (r(a+) π θsr)2 R2 max

s=1 πθs(a+)2

s=1 πθs(a+) = R2 max S4 t (a+), (201)

s=τ Ps(a+) =

s=τ η πθs(a+) (r(a+) π θsr) η c

s=τ πθs(a+). (202)

Stochastic Gradient Succeeds for Bandits

Hence, when E1, E2 and E3 hold, we have

θt(a+) = E[θ1(a+)] + Zt(a+) + P1(a+) + + Pτ 1(a+) + Pτ(a+) + + Pt 1(a+) by Eq. (160) (203)

E[θ1(a+)] η

1 + S1 t (a+) 2 log (1 + S1 t (a+))1/2

1 + S4 t (a+) 2 log (1 + S4 t (a+))1/2

+ P1(a+) + + Pτ 1(a+) + Pτ(a+) + + Pt 1(a+) (by Eqs. (173) and (190)) (206)

η R2 max (1 + C3)

v u u t1 + Pt 1 s=1 πθs(a+)

(1 + Pt 1 s=1 πθs(a+))1/2

+ P1(a+) + + Pτ 1(a+) + η c

s=τ πθs(a+)

If P s=1 πθs(a+) < , θt(a+) is always finite and inft 1 θt(a+) > ; if P s=1 πθs(a+) = , we have ( ) goes to faster than ( ), and inft 1 θt(a+) > .

Now take any ω E := {I = i, i = a }. Because P(E \ (E E1 E2 E3)) P(Ω\ (E1 E2 E3)) 3δ 0 as δ 0, we have that P-almost surely for all ω E there exists δ > 0 such that ω E E1 E2 E3 while Eq. (208) also holds for this δ. Take such a δ. By Eq. (208),

inf t 1 θt(a+)(ω) > . (209)

Hence, almost surely on E,

c1(a+) := inf t 1 θt(a+) > . (210)

Furthermore,

c1 := min a+ A+ inf t 1 θt(a+) = min a+ A+ c1(a+) > . (211)

Similarly, we can show that, almost surely on E,

c2 := max a A sup t 1 θt(a ) < . (212)

First case. 2a). Consider the event,

| {z } E0(a+)

i.e., any good action a+ A+(i) has finitely many updates as t . Pick a+ A+(i), such that P(N (a+) < ) > 0. According to the extended Borel-Cantelli lemma (Lemma C.2), we have, almost surely, n X

t 1 πθt(a+) = o = N (a+) = . (214)

Hence, taking complements, we have, n X

t 1 πθt(a+) < o = N (a+) < (215)

Stochastic Gradient Succeeds for Bandits

also holds almost surely. Note that, for all good action a+,

θt+1(a+) θt(a+) +

( η (1 πθt(a+)) Rt(a+), if at = a+ , η πθt(a+) Rt(at), otherwise . (216)

Since the first update will be conducted finitely many times, and the second update will be conducted for infinitely many times, we have

c3 := sup t 1 θt(a+) < . (217)

j A(i) πθt(j) =

a+ A+(i) eθt(a+) + P

a A (i) eθt(a ) P

a [K] eθt(a) (218)

P a+ A+(i) eθt(a+) + P a A (i) ec2 P

a [K] eθt(a) (219)

a+ A+(i) eθt(a+) + ec2 c1 P

a A (i) ec1 P

a [K] eθt(a) (220)

a+ A+(i) eθt(a+) + ec2 c1 |A (i)|

a+ A+(i) ec1 P

a [K] eθt(a) |A+(i)| 1 (221)

a+ A+(i) eθt(a+) + ec2 c1 |A (i)|

a+ A+(i) eθt(a+) P

a [K] eθt(a) (222)

= 1 + ec2 c1 |A (i)|

a+ A+(i) πθt(a+). (223)

According to Eq. (214), N (a+) < for all a+ A+(i). Since

a A (i) πθt(a ) ec2 c1 |A (i)|

a+ A+(i) πθt(a+) < , (224)

we have N (a ) < for all a A (i). Therefore. P

j A(i) N (i) = , which indicates that for all j A(i), the first update of i will be conducted for infinitely many times, and the second update will be conducted for finitely many times. According to Assumption 2.1, we have |A(i)| = 1, and for all j A(i), we have

θt(j) θ1(j) + η r(j)

s=1 (1 πθs(j)) (225)

θ1(j) + η r(j) 1 + ec2 c1 |A (i)|

a+ A+(i) πθt(a+) < . (226)

Hence, we have

c4 := lim sup t θt(j) < . (227)

Stochastic Gradient Succeeds for Bandits

Therefore, we have

j A(i) πθt(j) =

P j A(i) eθt(j) P

j A(i) eθt(j) + P

a+ A+(i) eθt(a+) + P

a A (i) eθt(a ) (228)

j A(i) eθt(j) P

j A(i) eθt(j) + P

a+ A+(i) eθt(a+) eθt(a ) > 0 (229)

j A(i) eθt(j) P j A(i) eθt(j) + ec2 |A+(i)| (by Eq. (212)) (230)

ec4 |A(i)| ec4 |A(i)| + ec2 |A+(i)| (by Eq. (227)) (231)

which is a contradiction with the assumption of Eq. (193), showing that P(E0 E ) = 0.

Second case. 2b). Consider the complement Ec 0 of E0, where E0 is by Eq. (213). Ec 0 indicates the event for at least one good action a+ A+(i) has infinitely many updates as t .

We now show that also P(E ) = 0 where E = Ec 0 {I = i, i = a } = ( a+ A(i){N (a+) = }) {I = i, i = a }.3

Let A+(i) := {a+ A+(i) : N (a+) = }, and

E (a+) := a+ A(i){N (a+) = } = a+ A+(i){N (a+) = }. (233)

Then E = E (a+) {I = i, i = a }. Since E E (a+), the statement follows if P(E (a+)) = 0. Hence, assume that P(E (a+)) > 0.

Fix δ [0, 1]. Using a similar calculation to that of Eq. (208), there exists an event Eδ such that P(Eδ) 1 2δ, and on Eδ, for all t τ, for all a+ A+(i),

θt(a+) = E[θ1(a+)] + Zt(a+) + P1(a+) + + Pτ 1(a+) by Eq. (160) (234)

+ Pτ(a+) + + Pt 1(a+) (235)

η R2 max (1 + C3)

v u u t1 + Pt 1 s=1 πθs(a+)

(1 + Pt 1 s=1 πθs(a+))1/2

+ P1(a+) + + Pτ 1(a+) + η c

s=τ πθs(a+)

(by Eqs. (173) and (190)) . (237)

On E (a+) Eδ, Nt 1(a+) as t , which with Eq. (214) indicates that P

t 1 πθt(a+) = . When t , both ( ) and ( ) go to infinity while ( ) goes to infinity faster than ( ). Hence, we have θt(a+) as t .

Since P(E (a+) \ (E (a+) Eδ)) 0 as δ 0, with an argument parallel to that used in the previous analysis (cf. the argument after Eq. (208)), we have, almost surely on E (a+),

lim t θt(a+) = , (238)

which implies that there exists τ1 1 such that on E (= E (a+) {I = i, i = a }) we have almost surely that τ1 < + while we also have that for all t τ1, for all a+ A+(i),

r(i) r(a ) exp{θt(a+) c1} < c := r(a+) r(i)

3Here, E is redefined to minimize clutter; the previous definition is not used in this part of the proof.

Stochastic Gradient Succeeds for Bandits

For all a+ A+(i)/ A+(i), since limt a+ < , we have

max a+ A+(i)/ A+(i) θt( a+) < . (240)

Recall in Eq. (212) we show c2 = maxa A supt 1 θt(a ) < . We have

max a (A+ A )/ A+(i) θt(a) < . (241)

Fix any a+ A+, since limt θt(a+) = , there exists τ2 < such that when t τ2,

a (A+ A )/ A+(i) exp{θt(a)} 0.1 exp{θt( a+)} (242)

a (A+ A )/ A+(i) πθt(a) 0.1 πθt( a+) (243)

j A(i) πt(j) = X

a (A+ A )/ A+(i) πθt(a) + X

a+ A+ πθt(a+) 0.1 πθt( a+) + X

a+ A+ πθt(a+) 1.1 X

a+ A+ πθt(a+).

Hence, on E , for t τ1, almost surely,

j A(i) πθt(j) r(i) + X

a A (i) πθt(a ) r(a ) + X

a+ A+(i) πθt(a+) r(a+) (245)

a A (i) πθt(a ) r(i) r(a ) + X

a+ A+(i) πθt(a+) r(a+) r(i) (246)

a A (i) πθt(a ) r(i) r(a ) + X

a+ A+(i) πθt( a+) r( a+) r(i) (247)

r(a+) r(i) > 0, Eq. (195) (248)

a+ A+(i) πθt( a+) r(a+) r(i) X

πθt(a ) πθt( a+) | A+(i)| r(i) r(a ) (249)

a+ A+(i) πθt( a+) r(a+) r(i) X

πθt(a ) πθt( a+) r(i) r(a ) (250)

a+ A+(i) πθt( a+) r( a+) r(i) X

r(i) r(a ) exp{θt( a+) θt(a )}

a+ A+(i) πθt(a+) r(a+) r(i) X

r(i) r(a ) exp{θt(a+) c2}

(by Eq. (212)) (252)

> r(i) + c X

a+ A+(i) πθt(a+) . (by Eq. (239)) (253)

Therefore, on E , for all t τ1, for any j A(i), almost surely, by Eq. (169), we have

Pt(j) = η πθt(j) (r(j) π θtr) < c X

a+ A+(i) πθt(i) πθt(a+) < 0. (254)

Since |A(i)| = 1, by Assumption 2.1, limt πθt(i) = 1, there exists τ2 such that when t τ3, πθt(i) > 1/2. Hence, when t max{τ1, τ3}, we have Pt(j) < c P

a+ A+(i) πθt(a+)/2. Let τ = max{τ1, τ2, τ3}. When t > τ , for

Stochastic Gradient Succeeds for Bandits

s=1 πθs(j)2 (r(j) π θsr)2 =

s=1 (1 πθs(j)) =

s=1 (1 πθs(j)) +

j A(i) πθs(j) i (256)

s=1 (1 πθs(j)) + 1.1

a+ A+ πθs(a+). (by Eq. (244)) (257)

Hence, for t τ , when E1 and E4 hold, we have

θt(j) = E[θ1(j)] + Zt(j) + P1(j) + + Pτ 1(j) + Pτ(a+) + + Pt 1(j) by Eq. (160) (258)

E[θ1(j)] + η

1 + S1 t (j) 2 log (1 + S1 t (j))1/2

+ η C3 Rmax

1 + S5 t (j) 2 log (1 + S5 t (j))1/2

+ P1(j) + + Pτ 1(j) + Pτ (j) + + Pt 1(j) (by Eqs. (173) and (192)) (260)

E[θ1(j)] + η

1 + S1 t (j) 2 log (1 + S1 t (j))1/2

+ η C3 Rmax

1 + S5 t (j) 2 log (1 + S5 t (j))1/2

| {z } ( ) (261)

+ P1(j) + + Pτ 1(j) 1

j=τ |Ps(j)|

a+ A+ πθt(a+)

Since ( ) and ( ) go to infinity when t , ( ) goes to infinity faster than ( ), ( ) goes to infinity faster than ( ), we have supt 1 θt(j) < .

Let E δ = E1 E4. Since P(E c δ ) 2δ 0 as δ 0, with an argument parallel to that used in the previous analysis (cf. the argument after Eq. (208)), we get that there exists a random constant c5(j) such that almost surely on E , c5(j) < and supt τ1 θt(j) c5(j). Define c5 := maxj A(i) c5(j). Then, almost surely on E , c5 < and

sup t τ1 max j A(i) θt(j) c5 . (263)

By Eq. (238), there exists a+ A+(i), τ 1, such that almost surely on E , τ < while we also have

inf t τ θt(a+) 0, (264)

for all t τ . Hence, on E , almost surely for all t max(τ , τ ),

j A(i) πθt(j) =

j A(i) eθt(j) P

j A(i) eθt(j) + P

a+ A+(i) eθt( a+) + P

a A (i) eθt(a ) (265)

j A(i) eθt(j) P

j A(i) eθt(j) + eθt(a+) eθt(k) > 0 for any k [K] (266)

P j A(i) eθt(j) P

j A(i) eθt(j) + 1 (by Eq. (264) ) (267)

ec5 |A(i)| ec5 |A(i)| + 1 (by Eq. (263)) (268)

Stochastic Gradient Succeeds for Bandits

Hence, P(E ) = 0, finishing the proof.

Lemma 5.4 (Non-uniform Łojasiewicz (NŁ), Mei et al. (2020b, Lemma 3)). Assume r has a unique maximizing action a . Let π = arg maxπ π r. Then, dπ θ r dθ

2 πθ(a ) (π πθ) r . (270)

Proof. Using the definition of softmax Jacobian, we have, dπ θ r dθ

a [K] πθ(a)2 r(a) π θ r 2 (271)

πθ(a )2 r(a ) π θ r 2 , (fewer terms) (272)

which implies Eq. (270).

A.2. Proof of Theorem 5.5

Theorem 5.5 (Convergence rate and regret). Using Algorithm 1 with η = 2

40 K3/2 R3max , we have, for all t 1,

E[(π πθt) r] C

t , and (273)

t=1 (π πθt) r min{ p

2 Rmax C T, C log T + 1}, (274)

where C := 80 K3/2 R3 max 2 E[c2] , and c := inft 1 πθt(a ) > 0 is from Theorem 5.1.

Proof. First part, Eq. (273). According to Lemma 4.6, we have,

Et[π θt+1r] π θtr 2

80 K3/2 R3max dπ θtr dθt

80 K3/2 R3max r(a ) π θtr 2 (by Lemma 5.4) (276)

2 inft 1 πθt(a )2

80 K3/2 R3max r(a ) π θtr 2 (277)

80 K3/2 R3max r(a ) π θtr 2 . (by Theorem 5.1) (278)

Denote δ(θt) := (π πθt) r as the sub-optimality gap. We have,

δ(θt) Et[δ(θt+1)] = (π πθt) r π Et[πθt+1] r (279)

= Et[π θt+1r] π θtr (280)

80 K3/2 R3max δ(θt)2. (281)

Taking expectation, we have,

E [δ(θt)] E [δ(θt+1)] 2 E[c2] 80 K3/2 R3max E [δ(θt)2] c := inf t 1 πθt(a ) > 0 is independent with t (282)

2 E[c2] 80 K3/2 R3max (E [δ(θt)])2 (by Jensen s inequality) (283)

C (E [δ(θt)])2 . (284)

Stochastic Gradient Succeeds for Bandits

Therefore, we have,

1 E [δ(θt)] = 1 E [δ(θ1)] +

1 E [δ(θs+1)] 1 E [δ(θs)]

= 1 E [δ(θ1)] +

1 E [δ(θs+1)] E [δ(θs)] (E [δ(θs)] E [δ(θs+1)]) (286)

1 E [δ(θ1)] +

1 E [δ(θs+1)] E [δ(θs)] 1

C (E [δ(θs)])2 (by Eq. (282)) (287)

1 E [δ(θ1)] +

1 C (E [δ(θs)] E [δ(θs+1)] > 0) (288)

= 1 E [δ(θ1)] + 1

C (t 1) (289)

C , E [δ(θ1)] 2 Rmax C = max s t 80 K3/2 R3 max 2 E[c2]

which implies Eq. (273).

Second part, Eq. (274). According to Eq. (285),we have,

t=1 δ(θt) =

t=1 E[δ(θt)]

t C log T + 1. (291)

On the other hand, we have

t=1 E[δ(θt)]

t=1 (E [δ(θt)])2 # 1

2 (by Cauchy Schwarz) (292)

t=1 C (E[δ(θt)] E[δ(θt+1)])

2 (by Eq. (282)) (293)

C T (E[δ(θ1)] E[δ(θT +1)]) (294)

C T 2 Rmax, (E[δ(θT +1)] 0. and E[δ(θ1)] 2 Rmax) (295)

Combining Eqs. (291) and (292), we have Eq. (274).

B. Proofs for Using Baselines

The following Algorithm 2 is same as the gradient bandit algorithm in Sutton & Barto (2018, Section 2.8).

Proposition B.1. Algorithm 2 is equivalent to the following stochastic gradient ascent update on π θ r.

θt+1 θt + η dπ θt ˆrt ˆbt

= θt + η diag(πθt) πθtπ θt ˆrt ˆbt , (297)

dθ = diag(πθ) πθπ θ is the Jacobian of θ 7 πθ := softmax(θ), and ˆrt(a) := I{at=a}

πθt(a) Rt(a) for all a [K] is the importance sampling (IS) estimator, and we set Rt(a) = 0 for all a = at. The baseline is defined as ˆbt(a) := I{at=a}

πθt(a) Bt for all a [K].

Stochastic Gradient Succeeds for Bandits

Algorithm 2 Gradient bandit algorithm with baselines

Input: initial parameters θ1 RK, learning rate η > 0. Output: policies πθt = softmax(θt). while t 1 do

Sample one action at πθt( ). Observe one reward sample Rt(at) Pat. Choose a baseline Bt R. for all a [K] do

if a = at then

θt+1(a) θt(a) + η (1 πθt(a)) (Rt(at) Bt). else

θt+1(a) θt(a) η πθt(a) (Rt(at) Bt). end if end for end while

Proof. Using the definition of softmax Jacobian, ˆrt and ˆbt, we have, for all a [K],

θt+1(a) θt(a) + η πθt(a) ˆrt(a) ˆbt(a) π θt ˆrt ˆbt (298)

= θt(a) + η πθt(a) ˆrt(a) ˆbt(a) (Rt(at) Bt) (299)

( η (1 πθt(a)) (Rt(at) Bt) , if at = a , η πθt(a) (Rt(at) Bt) , otherwise .

Lemma B.2 (Unbiased stochastic gradient with bounded variance / scale). Using Algorithm 2, we have, for all t 1,

dπ θt ˆrt ˆbt

= dπ θtr dθt , (300)

" dπ θt ˆrt ˆbt

2 R2 max, (301)

where Et[ ] is on randomness from the on-policy sampling at πθt( ) and reward sampling Rt(at) Pat, and Rmax is the range of reward minus baselines, i.e.,

Rt(at) Bt [ Rmax, Rmax]. (302)

Proof. First part, Eq. (300). For all action a [K], the true softmax PG is,

dπ θtr dθt(a) = πθt(a) r(a) π θtr . (303)

For all a [K], the stochastic softmax PG is,

dπ θt ˆrt ˆbt

dθt(a) = πθt(a) ˆrt(a) ˆbt(a) π θt ˆrt ˆbt (304)

= πθt(a) ˆrt(a) ˆbt(a) (Rt(at) Bt) (305)

= (I {at = a} πθt(a)) (Rt(at) Bt) . (306)

Stochastic Gradient Succeeds for Bandits

For the sampled action at, we have,

E Rt(at) Pat

dπ θt ˆrt ˆbt

= E Rt(at) Pat

h (1 πθt(at)) (Rt(at) Bt) i (307)

= (1 πθt(at)) E Rt(at) Pat

h Rt(at) Bt i (308)

= (1 πθt(at)) (r(at) Bt) . (309)

For any other not sampled action a = at, we have,

E Rt(at) Pat

dπ θt ˆrt ˆbt

= E Rt(at) Pat

h πθt(a) (Rt(at) Bt) i (310)

= πθt(a) E Rt(at) Pat

h Rt(at) Bt i (311)

= πθt(a) (r(at) Bt) . (312)

Combing Eqs. (307) and (310), we have, for all a [K],

E Rt(at) Pat

dπ θt ˆrt ˆbt

= (I {at = a} πθt(a)) (r(at) Bt) . (313)

Taking expectation over at πθt( ), we have,

dπ θt ˆrt ˆbt

= Pr (at = a) E Rt(at) Pat

dπ θt ˆrt ˆbt

at = a (314)

+ Pr (at = a) E Rt(at) Pat

dπ θt ˆrt ˆbt

at = a (315)

= πθt(a) (1 πθt(a)) (r(a) Bt) + X

a =a πθt(a ) ( πθt(a)) (r(a ) Bt) (316)

a =a πθt(a ) h (r(a) Bt) (r(a ) Bt) i (317)

= πθt(a) r(a) π θtr . (318)

Combining Eqs. (303) and (314), we have, for all a [K],

dπ θt ˆrt ˆbt

= dπ θtr dθt(a), (319)

which implies Eq. (300) since a [K] is arbitrary.

Second part, Eq. (301). The squared stochastic PG norm is, dπ θt ˆrt ˆbt

dπ θt ˆrt ˆbt

a [K] (I {at = a} πθt(a))2 (Rt(at) Bt)2 (by Eq. (304)) (321)

a [K] (I {at = a} πθt(a))2 (by Eq. (302)) (322)

= R2 max (1 πθt(at))2 + X

a =at πθt(a)2 (323)

R2 max (1 πθt(at))2 + X

a =at πθt(a) 2 ( x 2 x 1) (324)

= 2 R2 max (1 πθt(at))2 . (325)

Stochastic Gradient Succeeds for Bandits

Therefore, we have, for all a [K], conditioning on at = a, " dπ θt ˆrt ˆbt

2 R2 max (1 πθt(a))2 . (326)

Taking expectation over at πθt( ), we have,

" dπ θt ˆrt ˆbt

a [K] Pr (at = a)

" dπ θt ˆrt ˆbt

a [K] πθt(a) 2 R2 max (1 πθt(a))2 (328)

a [K] πθt(a) (πθt(a) (0, 1) for all a [K]) (329)

= 2 R2 max.

Lemma B.3 (NS between iterates). Using Algorithm 2 with η 0, 2/(9 Rmax) , we have, for all t 1,

D(θt+1, θt) :=

(πθt+1 πθt) r Ddπ θtr dθt , θt+1 θt E β(θt)

2 θt+1 θt 2 2, (330)

where Rmax is from Eq. (302), and

β(θt) = 6 2 9 Rmax η dπ θtr dθt

Proof. In the proofs for Lemma 4.2, replacing Rmax with Rmax, and replacing dπ θt ˆrt dθt with dπ θt(ˆrt ˆbt)

dθt , we have the results.

Lemma 6.1 (Strong growth conditions / Self-bounding noise property). Using Algorithm 2, we have, for all t 1,

" dπ θt ˆrt ˆbt

8 R2 max Rmax K3/2

2 dπ θtr dθt

where := mini =j |r(i) r(j)|, and Rmax is from Eq. (302).

Proof. Given t 1, denote kt as the action with largest probability, i.e., kt := arg maxa [K] πθt(a). We have,

According to Eq. (327), we have,

" dπ θt ˆrt ˆbt

a [K] Pr (at = a)

" dπ θt ˆrt ˆbt

a [K] πθt(a) 2 R2 max (1 πθt(a))2 (335)

= 2 R2 max πθt(kt) (1 πθt(kt))2 + X

a =kt πθt(a) (1 πθt(a))2 (336)

2 R2 max 1 πθt(kt) + X

a =kt πθt(a) (πθt(a) (0, 1) for all a [K]) (337)

= 4 R2 max (1 πθt(kt)) . (338)

Stochastic Gradient Succeeds for Bandits

Therefore, we have,

" dπ θt ˆrt ˆbt

4 R2 max (1 πθt(kt)) (339)

4 R2 max K 2 X

a [K] πθt(a) (r(a) π θtr)2 (by Eq. (117)) (340)

4 R2 max K 2 2

K Rmax dπ θtr dθt

2 (by Eq. (110)) (341)

= 8 R2 max Rmax K3/2

2 dπ θtr dθt

Lemma B.4 (Constant learning rate). Using Algorithm 2 with η = 2

40 K3/2 R2max Rmax , we have, for all t 1,

π θtr Et[π θt+1r] 2

80 K3/2 R2max Rmax dπ θtr dθt

where Rmax is from Eq. (302).

Proof. Using the learning rate,

40 K3/2 R2max Rmax (343)

= 4 45 Rmax 2 Rmax Rmax 1 K3/2 45

4 45 Rmax 4 1 2

40, 2 Rmax, 2 Rmax, and K 2 (345)

< 4 45 Rmax , (346)

we have η 0, 2/(9 Rmax) . According to Lemma B.3, we have,

(πθt+1 πθt) r Ddπ θtr dθt , θt+1 θt E 3 2 9 Rmax η dπ θtr dθt

2 θt+1 θt 2 2 (347)

3 2 9 Rmax 4 45 Rmax dπ θtr dθt

2 θt+1 θt 2 2 (by Eq. (346)) (348)

2 dπ θtr dθt

2 θt+1 θt 2 2, (349)

which implies that,

π θtr π θt+1r Ddπ θtr dθt , θt+1 θt E + 5

2 dπ θtr dθt

2 θt+1 θt 2 2 (350)

= η Ddπ θtr dθt , dπ θt ˆrt ˆbt

2 dπ θtr dθt

2 η2 dπ θt ˆrt ˆbt

Stochastic Gradient Succeeds for Bandits

where the last equation uses Algorithm 2. Taking expectation over at πθt( ) and Rt(at) Pat, we have,

π θtr Et[π θt+1r] η Ddπ θtr dθt , Et

dπ θtˆrt dθt

2 dπ θtr dθt

" dπ θtˆrt dθt

= η dπ θtr dθt

2 dπ θtr dθt

" dπ θt ˆrt ˆbt

(by Lemma B.2) (353)

η dπ θtr dθt

2 dπ θtr dθt

2 η2 8 R2 max Rmax K3/2

2 dπ θtr dθt

2 (by Lemma 6.1) (354)

= η + η2 20 R2 max Rmax K3/2

80 K3/2 R2max Rmax dπ θtr dθt

2 . (by Eq. (343))

Theorem B.5. Using Algorithm 2, we have, the sequence {π θtr}t 1 converges with probability one.

Proof. As in the proof for Theorem 5.1, we set

Wt+1(a) = θt+1(a) Et[θt+1(a)] (356)

= θt(a) + η [It(a) πθt(a)] (Rt(at) Bt) θt(a) + η πθt(a) r(a) π θtr (357)

= η [It(a) πθt(a)] (Rt(at) Bt) η πθt(a) r(a) π θtr , (358)

which implies that,

Zt(a) = W1(a) + + Wt(a) (359)

s=1 η [Is(a) πθs(a)] (Rs(as) Bs) η πθs(a) r(a) π θsr . (360)

We also have,

Pt(a) = Et[θt+1(a)] θt(a) = η πθt(a) r(a) π θtr . (361)

In the remaining part of the proofs for Theorem 5.1, replacing Rmax with Rmax, we have the results.

C. Miscellaneous Extra Supporting Results

Recall that (Xt, Ft)t 1 is a sub-martingale (super-martingale, martingale) if (Xt)t 1 is adapted to the filtration (Ft)t 1 and E[Xt+1|Ft] Xt (E[Xt+1|Ft] Xt, E[Xt+1|Ft] = Xt, respectively) holds almost surely for any t 1. For brevity, let Et[ ] denote E[ |Ft] where the filtration should be clear from the context and we also extend this notation to t = 0 such that E0U = E[U].

Theorem C.1 (Doob s supermartingale convergence theorem (Doob, 2012)). If (Yt)t 1 is an {Ft}t 1-adapted sequence such that E[Yt+1|Ft] Yt and supt E[|Yt|] < then {Yt}t 1 almost surely converges (a.s.) and, in particular, Yt Y a.s. as t where Y = lim supt Yt is such that E[|Y |] < .

Lemma C.2 (Extended Borel-Cantelli Lemma, Corollary 5.29 of (Breiman, 1992)). Let (Fn)n 1 be a filtration, An Fn. Then, almost surely,

{ω : ω An infinitely often } =

n=1 P(An|Fn)