# stochastic_bandits_with_context_distributions__a534763a.pdf

Stochastic Bandits with Context Distributions

Johannes Kirschner Department of Computer Science ETH Zurich jkirschner@inf.ethz.ch

Andreas Krause Department of Computer Science ETH Zurich krausea@ethz.ch

We introduce a stochastic contextual bandit model where at each time step the environment chooses a distribution over a context set and samples the context from this distribution. The learner observes only the context distribution while the exact context realization remains hidden. This allows for a broad range of applications where the context is stochastic or when the learner needs to predict the context. We leverage the UCB algorithm to this setting and show that it achieves an order-optimal high-probability bound on the cumulative regret for linear and kernelized reward functions. Our results strictly generalize previous work in the sense that both our model and the algorithm reduce to the standard setting when the environment chooses only Dirac delta distributions and therefore provides the exact context to the learner. We further analyze a variant where the learner observes the realized context after choosing the action. Finally, we demonstrate the proposed method on synthetic and real-world datasets.

1 Introduction

In the contextual bandit model a learner interacts with an environment in several rounds. At the beginning of each round, the environment provides a context, and in turn, the learner chooses an action which leads to an a priori unknown reward. The learner s goal is to choose actions that maximize the cumulative reward, and eventually compete with the best mapping from context observations to actions. This model creates a dilemma of exploration and exploitation, as the learner needs to balance exploratory actions to estimate the environment s reward function, and exploitative actions that maximize the total return. Contextual bandit algorithms have been successfully used in many applications, including online advertisement, recommender systems and experimental design.

The contextual bandit model, as usually studied in the literature, assumes that the context is observed exactly. This is not always the case in applications, for instance, when the context is itself a noisy measurement or a forecasting mechanism. An example of such a context could be a weather or stock market prediction. In other cases such as recommender systems, privacy constraints can restrict access to certain user features, but instead we might be able to infer a distribution over those. To allow for uncertainty in the context, we consider a setting where the environment provides a distribution over the context set. The exact context is assumed to be a sample from this distribution, but remains hidden from the learner. Such a model, to the best of our knowledge, has not been discussed in the literature before. Not knowing the context realization makes the learning problem more difﬁcult, because the learner needs to estimate the reward function from noisy observations and without knowing the exact context that generated the reward. Our setting recovers the classical contextual bandit setting when the context distribution is a Dirac delta distribution. We also analyze a natural variant of the problem, where the exact context is observed after the player has chosen the action. This allows for different applications, where at the time of decision the context needs to be predicted (e.g. weather conditions), but when the reward is obtained, the exact context can be measured.

33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada.

We focus on the setting where the reward function is linear in terms of action-context feature vectors. For this case, we leverage the UCB algorithm on a speciﬁcally designed bandit instance without feature uncertainty to recover an O(d

T) high-probability bound on the cumulative regret. Our analysis includes a practical variant of the algorithm that requires only sampling access to the context distributions provided by the environment. We also extend our results to the kernelized setting, where the reward function is contained in a known reproducing kernel Hilbert space (RKHS). For this case, we highlight an interesting connection to distributional risk minimization and we show that the natural estimator for the reward function is based on so-called kernel mean embeddings. We discuss related work in Section 6.

2 Stochastic Bandits with Context Distributions

We formally deﬁne the setting of stochastic bandits with context distributions as outlined in the introduction. Let X be a set of actions and C a context set. The environment is deﬁned by a ﬁxed, but unknown reward function f : X C R. At iteration t N, the environment chooses a distribution µt P(C) over the context set and samples a context realization ct µt. The learner observes only µt but not ct, and then chooses an action xt X. We allow that an adaptive adversary chooses the context distribution, that is µt may in an arbitrary way depend on previous choices of the learner up to time t. Given the learner s choice xt, the environment provides a reward yt = f(xt, ct) + ϵt, where ϵt is σ-subgaussian, additive noise. The learner s goal is to maximize the cumulative reward PT t=1 f(xt, ct), or equivalently, minimize the cumulative regret

t=1 f(x t , ct) f(xt, ct) (1)

where x t = arg maxx X Ec µt[f(x, c)] is the best action provided that we know f and µt, but not ct. Note that this way, we compete with the best possible mapping π : P(C) X from the observed context distribution to actions, that maximizes the expected reward PT t=1 Ect µt[f(π (µt), ct)|Ft 1, µt] where Ft = {(xs, µs, ys)}t s=1 is the ﬁltration that contains all information available at the end of round t. It is natural to ask if it is possible to compete with the stronger baseline that chooses actions given the context realization ct, i.e. x t = arg maxx X f(x, ct). While this can be possible in special cases, a simple example shows, that in general the learner would suffer Ω(T) regret. In particular, assume that ct Bernoulli(0.6) , and X = {0, 1}. Let f(0, c) = c and f(1, c) = 1 c. Clearly, any policy that does not know the realizations ct, must have Ω(T) regret when competing against x t .

From now on, we focus on linearly parameterized reward functions f(x, c) = φ x,cθ with given feature vectors φx,c Rd for x X and c C, and unknown parameter θ Rd. This setup is commonly referred to as the linear bandit setting. For the analysis we require standard boundedness assumptions φx,c 2 1 and θ 2 1 that we set to 1 for the sake of simplicity. In Section 4.2, we further consider a variant of the problem, where the learner observes ct after taking the decision xt. This simpliﬁes the estimation problem, because we have data {(xt, ct, yt)} with exact context ct available, just like in the standard setting. The exploration problem however remains subtle as at the time of decision the learner still knows only µt and not ct. In Section 4.3 we extend our algorithm and analysis to kernelized bandits where f H is contained in a reproducing kernel Hilbert space H.

3 Background

We brieﬂy review standard results from the linear contextual bandit literature and the upper conﬁdence bound (UCB) algorithm that we built on later (Abbasi-Yadkori et al., 2011). The linear contextual bandit setting can be deﬁned as a special case of our setup, where the choice of µt is restricted to Dirac delta distributions µt = δct, and therefore the learner knows beforehand the exact context which is used to generate the reward. In an equivalent formulation, the environment provides at time t a set of action-context feature vectors Ψt = {φx,ct : x X} Rd and the algorithm chooses an action xt with corresponding features φt := φxt,ct Ψt. We emphasize that in this formulation the context ct is extraneous to the algorithm, and everything can be deﬁned in terms of the time-varying action-feature sets Ψt. As before, the learner obtains a noisy reward observation yt = φ t θ + ϵt where ϵt is conditionally ρ-subgaussian with variance proxy ρ, i.e. λ R, E[eλϵt|Ft 1, φt] exp(λ2ρ2/2) .

Also here, the standard objective is to minimize the cumulative regret RT = PT t=1 φ t θ φ t θ where φ t = arg maxφ Ψt φ θ is the feature vector of the best action at time t.

To deﬁne the UCB algorithm, we make use of the conﬁdence sets derived by Abbasi-Yadkori et al. (2011) for online least square regression. At the end of round t, the algorithm has adaptively collected data {(φ1, y1), . . . , (φt, yt)} that we use to compute the regularized least squares estimate ˆθt = arg minθ Rd Pt s=1(φ s θ ys)2 + λ θ 2 2 with λ > 0. We denote the closed form solution by ˆθt = V 1 t Pt s=1 φsys with Vt = Vt 1 + φtφ t , V0 = λId and Id Rd d is the identity matrix.

Lemma 1 (Abbasi-Yadkori et al. (2011)). For any stochastic sequence {(φt, yt)}t and estimator ˆθt as deﬁned above for ρ-subgaussian observations, with probability at least 1 δ, at any time t N,

θ ˆθt Vt βt where βt = βt(ρ, δ) = ρ

2 log det(Vt)1/2

δ det(V0)1/2

+ λ1/2 θ 2 .

Note that the size of the conﬁdence set depends the variance proxy ρ, which will be important in the following. In each round t + 1, the UCB algorithm chooses an action φt+1, that maximizes an upper conﬁdence bound on the reward,

φt+1 := arg max φ Ψt+1 φ ˆθt + βt φ V 1 t .

The following result shows that the UCB policy achieves sublinear regret (Dani et al., 2008; Abbasi Yadkori et al., 2011).

Lemma 2. In the standard contextual bandit setting with ρ-subgaussian observation noise, the regret of the UCB policy with βt = βt(ρ, δ) is bounded with probability 1 δ by

8T log det VT

The data-dependent terms can be further upper-bounded to obtain RUCB T O(d

T) up to logarithmic factors in T (Abbasi-Yadkori et al., 2011, Theorem 3). A matching lower bound is given by Dani et al. (2008, Theorem 3).

4 UCB with Context Distributions

In our setting, where we only observe a context distribution µt (e.g. a weather prediction) instead of the context ct (e.g. realized weather conditions), also the features φx,ct (e.g. the last layer of a neural network that models the reward f(x, c) = φ x,ctθ) are uncertain. We propose an approach that transforms the problem such that we can directly use a contextual bandit algorithm as for the standard setting. Given the distribution µt, we deﬁne a new set of feature vectors Ψt = { ψx,µt : x X}, where we denote by ψx,µt = Ec µt[φx,c|Ft 1, µt] the expected feature vector of action x under µt. Each feature ψx,µt corresponds to exactly one action x X, so we can use Ψt as feature context set at time t and use the UCB algorithm to choose an action xt. The choice of the UCB algorithm here is only for the sake of the analysis, but any other algorithm that works in the linear contextual bandit setting can be used. The complete algorithm is summarized in Algorithm 1. We compute the UCB action xt with corresponding expected features ψt := ψxt,t Ψt, and the learner provides xt to the environment. We then proceed and use the reward observation yt to update the least squares estimate. That this is a sensible approach is not immediate, because yt is a noisy observation of φ xt,ctθ, whereas UCB expects the reward ψ t θ. We address this issue by constructing the feature set Ψt in such a way, that yt acts as unbiased observation also for the action choice ψt. As computing exact expectations can be difﬁcult and in applications often only sampling access of µt is possible, we also analyze a variant of Algorithm 1 where we use ﬁnite sample averages ψx,µt = 1

L PL l=1 φx, cl for L N i.i.d. samples cl µt instead of the expected features ψx,µ. The corresponding feature set is Ψt = { ψx,µt : x X}. For both variants of the algorithm we show the following regret bound.

Algorithm 1 UCB for linear stochastic bandits with context distributions

Initialize ˆθ = 0 Rd, V0 = λI Rd d For step t = 1, 2, . . . , T:

Environment chooses µt P(C) // context distribution Learner observes µt

Set Ψt = { ψx,µt : x X} with ψx,µt := Ec µt[φx,c] // variant 1, expected version Alternatively, sample ct,1, . . . , ct,L for L = t, // variant 2, sampled version Set Ψt = { ψx,µt : x X} with ψx,µt = 1

L PL l=1 φx, cl

Run UCB step with Ψt as context set // reduction Choose action xt = arg maxψx,µt Ψt ψ x,µt ˆθt 1 + βt ψx,µt V 1 t 1 // UCB action

Environment provides yt = φ xt,ctθ + ϵ where ct µt // reward observation Update Vt = Vt 1 + ψxs,µsψ xs,µs, ˆθt = V 1 t Pt s=1 ψxs,µsys // least-squares update

Theorem 1. The regret of Algorithm 1 with expected feature set Ψt and βt = βt(

4 + σ2, δ/2) is bounded at time T with probability at least 1 δ by

8T log det VT

Further, for ﬁnite action sets X, if the algorithm uses sampled feature sets Ψt with L = t and βt = βt as deﬁned in (11), Appendix A.2, then with probability at least 1 δ,

8T log det VT

2T log 2|X|πT

As before, one can further upper bound the data-dependent terms to obtain an overall regret bound of order RT O(d

T), see (Abbasi-Yadkori et al., 2011, Theorem 2). With iterative updates of the least-squares estimator, the per step computational complexity is O(Ld2|X|) if the UCB action is computed by a simple enumeration over all actions.

4.1 Regret analysis: Proof of Theorem 1

Recall that xt is the action that the UCB algorithm selects at time t, ψt the corresponding feature vector in Ψt and we deﬁne ψ t = arg maxψ Ψt ψ θ. We show that the regret RT is bounded in terms of the regret RUCB T := PT t=1 ψ t θ ψ t θ of the UCB algorithm on the contextual bandit deﬁned by the sequence of action feature sets Ψt. Lemma 3. The regret of Algorithm 1 with the expected feature set Ψt is bounded at time T with probability at least 1 δ,

RT RUCB T + 4

Further, if the algorithm uses the sample based features Ψt with L = t at iteration t, the regret is bounded at time T with probability at least 1 δ,

RT RUCB T + 4

2T log |X|πT

Proof. Consider ﬁrst the case where we use the expected features ψxt,µt. We add and subtract ( ψx t ,µt ψxt,µt) θ and use ψ x t ,µtθ ψ t θ to bound the regret by

RT RUCB T +

where we deﬁned Dt = (φx t ,ct ψx t ,µt + ψxt,µt φxt,ct) θ. It is easy to verify that Ect µt[Dt|Ft 1, µt, xt] = 0, that is Dt is a martingale difference sequence with |Dt| 4 and MT = PT t=1 Dt is a martingale. The ﬁrst part of the lemma therefore follows from Azuma Hoeffding s inequality (Lemma 4, Appendix). For the sample-based version, the reasoning is similar, but we need to ensure that the features ψx,µt = 1

L PL l=1 φx, cl are sufﬁciently concentrated around their expected counterparts ψx,µt for any x X and t N. We provide details in Appendix A.1.

Proof of Theorem 1. Clearly, the lemma gives a regret bound for Algorithm 1 if the regret term RUCB T is bounded. The main difﬁculty is that the reward observation yt = φ xt,ctθ + ϵt is generated from a different feature vector than the feature ψt Ψt that is chosen by UCB. Note that in general, it is not even true that φxt,ct Ψt. However, a closer inspection of the reward signal reveals that yt can be written as

yt = ψ t θ + ξt + ϵ with ξt := (φxt,ct ψt) θ (2)

For the variant that uses the expected features ψt = ψxt,µt, our construction already ensures that E[ξt|Ft 1, µt, xt] = E[φxt,ct ψxt,µt|Ft 1, µt, xt] θ = 0. Note that the distribution of ξt depends on xt and is therefore heteroscedastic in general. However, by boundedness of the rewards, |ξt| 2 and hence ξt is 2-subgaussian, which allows us to continue with a homoscedastic noise bound. We see that yt acts like an observation of ψ t θ perturbed by

4 + σ2-subgaussian noise (for two independent random variables X and Y that are σ1and σ2-subgaussian respectively, X + Y is p

σ2 1 + σ2 2subgaussian). Therefore, the construction of the conﬁdence bounds for the least squares estimator w.r.t. ψt remains valid at the cost of an increased variance proxy, and we are required to use βt with ρ =

4 + σ2 in the deﬁnition of the conﬁdence set. The regret bound for the UCB algorithm (Lemma 2) and an application of the union bound completes the proof for this case. When we use the sample-based features ψx,µt, the noise term ξt can be biased, because xt depends on the sampled features and E[ ψxt,µt|Ft 1, µt] = ψxt,µt. This bias carries on to the least-squares estimator, but can be controlled by a more careful analysis. See Appendix A.2 for details.

4.2 When the context realization is observed

We now turn our attention to the alternative setting, where it is possible to observe the realized context ct (e.g. actual weather measurements) after the learner has chosen xt. In Algorithm 1, so far our estimate ˆθt only uses the data {(xs, µs, ys)}t s=1, but with the context observation we have {(xs, cs, ys)}t s=1 available. It makes sense to use the additional information to improve our estimate ˆθt, and as we show below this reduces the amount the UCB algorithm explores. The pseudo code of the modiﬁed algorithm is given in Algorithm 2 (Appendix B), where the only difference is that we replaced the estimate of θ by the least squares estimate ˆθt = arg minθ Rd Pt s=1(φ xs,csθ ys)2 + λ θ 2 2. Since now the observation noise ϵt = yt φ xt,ctθ is only σsubgaussian (instead of

4 + σ2subgaussian), we can use the smaller scaling factor βt with ρ = σ to obtain a tighter upper conﬁdence bound. Theorem 2. The regret of Algorithm 2 based on the expected feature sets Ψt and βt = βt(σ, δ/3) is bounded with probability at least 1 δ by

8T log det VT

+ 4(1 + λ 1/2βT )

The importance of this result is that it justiﬁes the use of the smaller scaling βt of the conﬁdence set, which affects the action choice of the UCB algorithm. In practice, βt has a large impact on the amount of exploration, and a tighter choice can signiﬁcantly reduce the regret as we show in our experiments. We note that in this case, the reduction to the regret bound of UCB is slightly more involved than previously. As before, we use Lemma 3 to reduce a regret bound on RT to the regret RUCB T that the UCB algorithm obtains on the sequence of context-feature sets Ψt. Since now, the UCB action is based on tighter conﬁdence bounds, we expect the regret RUCB T to be smaller, too. This does not follow directly from the UCB analysis, as there the estimator is based on the features ψxt,µt instead of φxt,ct. We defer the complete proof to Appendix B.1. There we also show a similar result for the sample based feature sets Ψt analogous to Theorem 1.

4.3 Kernelized stochastic bandits with context distributions

In the kernelized setting, the reward function f : X C R is a member of a known reproducing kernel Hilbert space (RKHS) H with kernel function k : (X C)2 R. In the following, let H be the Hilbert norm and we denote by kx,c := k(x, c, , ) H the kernel features. For the analysis we further make the standard boundedness assumption f 1 and kx,c 1. We provide details on how to estimate f given data {(xs, µs, ys)}t s=1 with uncertain context. As in the linear case, the estimator ˆft can be deﬁned as an empirical risk minimizer with parameter λ > 0,

ˆft = arg min f H

Ec µt[f(xs, c)] ys 2 + λ f 2 H . (3)

In the literature this is known as distributional risk minimization (Muandet et al., 2017, Section 3.7.3). The following representer theorem shows, that the solution can be expressed as a linear combination of kernel mean embeddings kx,µ := Ec µ[kx,c] H. Theorem 3 (Muandet et al. (2012, Theorem 1)). Any f H that minimizes the regularized risk functional (3) admits a representation of the form f = Pt s=1 αs kxs,µs for some αs R.

It is easy to verify that the solution to (3) can be written as ˆft(x, c) = kt(x, c) (Kt + λI) 1yt (4)

where kt(x, c) = [ kx1,µ1(x, c), . . . , kxt,µt(x, c)] , (Kt)a,b = Ec µb[ kxa,µa(xb, c)] for 1 a, b, t is the kernel matrix and yt = [y1, . . . , yt]T denotes the vector of observations. Likewise, the estimator can be computed from sample based kernel mean embeddings k L x,µ := 1

L PL i=1 k(x, ci, , ) H for i.i.d. samples ci µ. This allows for an efﬁcient implementation also in the kernelized setting, at the usual cost of inverting the kernel matrix. With iterative updates the overall cost amount to O(LT 3). The cubic scaling in T can be avoided with ﬁnite dimensional feature approximations or inducing points methods, e.g. Rahimi and Recht (2008); Mutny and Krause (2018).

The UCB algorithm can be deﬁned using an analogous concentration result for the RKHS setting (Abbasi-Yadkori, 2012). We provide details and the complete kernelized algorithm (Algorithm 3) in Appendix C. The corresponding regret bound is summarized in the following theorem. Theorem 4. At any time T N, the regret of Algorithm 3 with exact kernel mean embeddings kx,c and βt as deﬁned in Lemma 6 in Appendix C, is bounded with probability at least 1 δ by

8T log(det(I + (λρ) 1KT )) + 4

Again, the data dependent log-determinant in the regret bound can be replaced with kernel speciﬁc bounds, referred to as maximum information gain γT (Srinivas et al., 2010).

5 Experiments

We evaluate the proposed method on a synthetic example as well as on two benchmarks that we construct from real-world data. Our focus is on understanding the effect of the sample size L used to deﬁne the context set Ψl t. We compare three different observational modes, with decreasing amount of information available to the learner. First, in the exact setting, we allow the algorithm to observe the context realization before choosing an action, akin to the usual contextual bandit setting. Note that this variant possibly obtains negative reward on the regret objective (1), because x t is computed to maximize the expected reward over the context distribution independent of ct. Second, in the observed setting, decisions are based on the context distribution, but the regression is based on the exact context realization. Last, only the context distribution is used for the hidden setting. We evaluate the effect of the sample sizes L = 10, 100 and compare to the variant that uses that exact expectation of the features. As common practice, we treat the conﬁdence parameter βT as tuning parameter that we choose to minimize the regret after T = 1000 steps. Below we provide details on the experimental setup and the evaluation is shown in Figure 1. In all experiments, the exact version signiﬁcantly outperforms the distributional variants or even achieves negative regret as anticipated. Consistent with our theory, observing the exact context after the action choice improves performance compared to the unobserved variant. The sampled-based algorithm is competitive with the expected features already for L = 100 samples.

exact obs (E) obs (L=10) obs (L=100) hidden (E) hidden (L=10) hidden (L=100)

0 250 500 750 1000 T

(a) Synthetic Example

0 250 500 750 1000 T

(b) Movielens

0 250 500 750 1000 T

(c) Wheat Yield Data

Figure 1: The plots show cumulative regret as deﬁned in (1). As expected, the variant that does not observe the context (hidden) is out-performed by the variant that uses the context realization for regression (obs)4. The sample size l used to construct the feature sets from the context distribution has a signiﬁcant effect on the regret, where with l = 100 performance is already competitive with the policy that uses the exact expectation over features (E). The exact baseline, which has access to the context realization before taking the decision, achieves negative regret on the benchmark (a) and (b), as the regret objective (1) compares to the action maximizing the expected reward. The error bars show two times standard error over 100 trials for (a) and (c), and 200 trials for (b). The variance in the movielens experiment is fairly large, likely because our linear model is miss-speciﬁed; and at the ﬁrst glance, it looks like the sample-based version outperforms the expected version in one case. From repeated trials we conﬁrmed, that this is only an effect of the randomness in the results.

Synthetic Example As a simple synthetic benchmark we set the reward function to f(x, c) = P5 i=1(xi ci)2, where both actions and context are vectors in R5. We choose this quadratic form to create a setting where the optimal action strongly depends on the context ci. As linear parametrization we choose φ(x, c) = (x2 1, , x2 5, c2 1, , c2 5, x1c1, . . . , x5c5). The action set consists of k = 100 elements that we sample at the beginning of each trial from a standard Gaussian distribution. For the context distribution, we ﬁrst sample a random element mt R5, again from a multivariate normal distribution, and then set µt = N(mt, 1). Observation noise is Gaussian with standard deviation 0.1.

Movielens Data Using matrix factorization we construct 6-dimensional features for user ratings of movies in the movielens-1m dataset (Harper and Konstan, 2016). We use the learned embedding as ground truth to generate the reward which we round to half-integers between 0 and 5 likewise the actual ratings. Therefore our model is miss-speciﬁed in this experiment. Besides the movie ratings, the data set provides basic demographic data for each user. In the interactive setting, the context realization is a randomly sampled user from the data. The context distribution is set to the empirical distribution of users in the dataset with the same demographic data. The setup is motivated by a setting where the system interacts with new users, for which we already obtained the basic demographic data, but not yet the exact user s features (that in collaborative ﬁltering are computed from the user s ratings). We provide further details in Appendix D.1.

Crop Yield Data We use a wheat yield dataset that was systematically collected by the Agroscope institute in Switzerland over 15 years on 10 different sites. For each site and year, a 16-dimensional suitability factor based on recorded weather conditions is available. The dataset contains 8849 yield measurements for 198 crops. From this we construct a data set D = {(xi, wi, yi)} where xi is the identiﬁer of the tested crop, wi R16+10 is a 16 dimensional suitability factor obtained from weather measurements augmented with a 1-hot encoding for each site, and yi is the normalized crop yield. We ﬁt a bilinear model yi w T i WVxi to get 5-dimensional features Vx for each variety x and site features w W that take the weather conditions w of the site into account. From this model, we generate the ground-truth reward. Our goal is to provide crop recommendations to maximize yield on a given site with characteristics w. Since w is based on weather measurements that are not available ahead of time, we set the context distribution such that each feature of w is perturbed by a Gaussian distribution centered around the true w. We set the variance of the perturbation to the empirical variance of the features for the current site over all 15 years. Further details are in Appendix D.2.

6 Related Work

There is a large array of work on bandit algorithms, for a survey see Bubeck and Cesa-Bianchi (2012) or the book by Lattimore and Szepesvári (2018). Of interest to us is the stochastic contextual bandit problem, where the learner chooses actions after seeing a context; and the goal is to compete with a class of policies, that map contexts to actions. This is akin to reinforcement learning (Sutton and Barto, 2018), but the contextual bandit problem is different in that the sequence of contexts is typically allowed to be arbitrary (even adversarially chosen), and does not necessarily follow a speciﬁc transition model. The contextual bandit problem in this formulation dates back to at least Abe and Long (1999) and Langford and Zhang (2007). The perhaps best understood instance of this model is the linear contextual bandit, where the reward function is a linear map of feature vectors (Auer et al., 2002). One of the most popular algorithms is the Upper Conﬁdence Bound (UCB) algorithm, ﬁrst introduced by Auer (2002) for the multi-armed bandit problem, and later extended to the linear case by Li et al. (2010). Analysis of this algorithm was improved by Dani et al. (2008), Abbasi-Yadkori et al. (2011) and Li et al. (2019), where the main technical challenge is to construct tight conﬁdence sets for an online version of the least squares estimator. Alternative exploration strategies have been considered as well, for instance Thompson sampling (Thompson, 1933), which was analyzed for the linear model by Agrawal and Goyal (2013) and Abeille and Lazaric (2017). Other notable approaches include an algorithm that uses a perturbed data history as exploration mechanism (Kveton et al., 2019), or a mostly greedy algorithm that leverages the randomness in the context to obtain sufﬁcient exploration (Bastani et al., 2017). In kernelized bandits the reward function is contained given reproducing kernel Hilbert space (RKHS). This setting is closely related to Bayesian optimization (Mockus, 1982). Again, the analysis hinges on the construction of conﬁdence sets and bounding a quantity referred to as information gain by the decay of the kernel s eigenspectrum. An analysis of the UCB algorithm for this setting was provided by Srinivas et al. (2010). It was later reﬁned by Abbasi-Yadkori (2012); Valko et al. (2013); Chowdhury and Gopalan (2017); Durand et al. (2018) and extended to the contextual setting by Krause and Ong (2011). Interestingly, in our reduction the noise distribution depends on the action, also referred to as heteroscedastic bandits. Heteroscedastic bandits where previously considered by Hsieh et al. (2019) and Kirschner and Krause (2018). Stochastic uncertainty on the action choice has been studied by Oliveira et al. (2019) in the context of Bayesian optimization. Closer related is the work by Yun et al. (2017), who introduce a linear contextual bandit model where the observed feature is perturbed by noise and the objective is to compete with the best policy that has access to the unperturbed feature vector. The main difference to our setting is that we assume that the environment provides a distribution of feature vectors (instead of a single, perturbed vector) and we compute the best action as a function of the distribution. As a consequence, we are able to obtain O(

T) regret bounds without further assumptions on the context distribution, while Yun et al. (2017) get O(T 7/8) with identical noise on each feature, and O(T 2/3) for Gaussian feature distributions. Most closely related is the work by Lamprier et al. (2018) on linear bandits with stochastic context. The main difference to our setting is that the context distribution in Lamprier et al. (2018) is ﬁxed over time, which allows to built aggregated estimates of the mean feature vector over time. Our setting is more general in that it allows an arbitrary sequence of distributions as well as correlation between the feature distributions of different actions. Moreover, in contrast to previous work, we discuss the kernelized-setting and the setting variant, where the context is observed exactly after the action choice. Finally, also adversarial contextual bandit algorithms apply in our setting, for example the EXP4 algorithm of Auer et al. (2002) or ILTCB of Agarwal et al. (2014). Here, the objective is to compete with the best policy in a given class of policies, which in our setting would require to work with a covering of the set of distributions P(C). However, these algorithms do not exploit the linear reward assumption and, therefore, are arguably less practical in our setting.

7 Conclusion

We introduced context distributions for stochastic bandits, a model that is naturally motivated in many applications and allows to capture the learner s uncertainty in the context realization. The method we propose is based on the UCB algorithm, and in fact, both our model and algorithm strictly generalize the standard setting in the sense that we recover the usual model and the UCB algorithm if the environment chooses only Dirac delta distributions. The most practical variant of the proposed algorithm requires only sample access to the context distributions and satisﬁes a high-probability regret bound that is order optimal in the feature dimension and the horizon up to logarithmic factors.

Acknowledgments

The authors thank Agroscope for providing the crop yield data set, in particular Didier Pellet, Lilia Levy and Juan Herrera, who collected the winter wheat data, and Annelie Holzkämper, who developed the environmental suitability factors model. Further, the authors acknowledge the work by Mariyana Koleva and Dejan Mirˇci c, who performed the initial data cleaning and exploration as part of their Master s theses.

This research was supported by SNSF grant 200020 159557 and has received funding from the European Research Council (ERC) under the European Union s Horizon 2020 research and innovation programme grant agreement No 815943.

Abbasi-Yadkori, Y. (2012). Online Learning for Linearly Parametrized Control Problems. Ph D thesis.

Abbasi-Yadkori, Y., Pál, D., and Szepesvári, C. (2011). Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312 2320.

Abe, N. and Long, P. M. (1999). Associative reinforcement learning using linear probabilistic concepts. In Proceedings of the Sixteenth International Conference on Machine Learning, pages 3 11. Morgan Kaufmann Publishers Inc.

Abeille, M. and Lazaric, A. (2017). Linear thompson sampling revisited. In Artiﬁcial Intelligence and Statistics, pages 176 184.

Agarwal, A., Hsu, D., Kale, S., Langford, J., Li, L., and Schapire, R. (2014). Taming the monster: A fast and simple algorithm for contextual bandits. In International Conference on Machine Learning, pages 1638 1646.

Agrawal, S. and Goyal, N. (2013). Thompson sampling for contextual bandits with linear payoffs. In International Conference on Machine Learning, pages 127 135.

Auer, P. (2002). Using conﬁdence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397 422.

Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. (2002). The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48 77.

Bastani, H., Bayati, M., and Khosravi, K. (2017). Mostly exploration-free algorithms for contextual bandits. ar Xiv preprint ar Xiv:1704.09011.

Bubeck, S. and Cesa-Bianchi, N. (2012). Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning, 5(1).

Chowdhury, S. R. and Gopalan, A. (2017). On kernelized multi-armed bandits. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 844 853. JMLR. org.

Dani, V., Hayes, T. P., and Kakade, S. M. (2008). Stochastic Linear Optimization under Bandit Feedback. In COLT, pages 355 366. Omnipress.

Durand, A., Maillard, O.-A., and Pineau, J. (2018). Streaming kernel regression with provably adaptive mean, variance, and regularization. The Journal of Machine Learning Research, 19(1):650 683.

Harper, F. M. and Konstan, J. A. (2016). The movielens datasets: History and context. Acm transactions on interactive intelligent systems (tiis), 5(4):19.

Holzkämper, A., Calanca, P., and Fuhrer, J. (2013). Identifying climatic limitations to grain maize yield potentials using a suitability evaluation approach. Agricultural and Forest Meteorology, 168:149 159.

Hsieh, P.-C., Liu, X., Bhattacharya, A., and Kumar, P. (2019). Stay with me: Lifetime maximization through heteroscedastic linear bandits with reneging. In International Conference on Machine Learning, pages 2800 2809.

Hug, N. (2017). Surprise, a Python library for recommender systems. http://surpriselib.com.

Kirschner, J. and Krause, A. (2018). Information Directed Sampling and Bandits with Heteroscedastic Noise. In Proc. International Conference on Learning Theory (COLT).

Koren, Y., Bell, R., and Volinsky, C. (2009). Matrix factorization techniques for recommender systems. Computer, (8):30 37.

Krause, A. and Ong, C. S. (2011). Contextual gaussian process bandit optimization. In Advances in Neural Information Processing Systems, pages 2447 2455.

Kveton, B., Szepesvari, C., Ghavamzadeh, M., and Boutilier, C. (2019). Perturbed-history exploration in stochastic linear bandits. ar Xiv preprint ar Xiv:1903.09132.

Lamprier, S., Gisselbrecht, T., and Gallinari, P. (2018). Proﬁle-based bandit with unknown proﬁles. The Journal of Machine Learning Research, 19(1):2060 2099.

Langford, J. and Zhang, T. (2007). The epoch-greedy algorithm for contextual multi-armed bandits. In Proceedings of the 20th International Conference on Neural Information Processing Systems, pages 817 824. Citeseer.

Lattimore, T. and Szepesvári, C. (2018). Bandit algorithms.

Li, L., Chu, W., Langford, J., and Schapire, R. E. (2010). A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661 670. ACM.

Li, Y., Wang, Y., and Zhou, Y. (2019). Tight regret bounds for inﬁnite-armed linear contextual bandits.

Mockus, J. (1982). The bayesian approach to global optimization. System Modeling and Optimization, pages 473 481.

Muandet, K., Fukumizu, K., Dinuzzo, F., and Schölkopf, B. (2012). Learning from distributions via support measure machines. In Advances in neural information processing systems, pages 10 18.

Muandet, K., Fukumizu, K., Sriperumbudur, B., Schölkopf, B., et al. (2017). Kernel mean embedding of distributions: A review and beyond. Foundations and Trends R in Machine Learning, 10(1-2):1 141.

Mutny, M. and Krause, A. (2018). Efﬁcient high dimensional bayesian optimization with additivity and quadrature fourier features. In Advances in Neural Information Processing Systems, pages 9005 9016.

Oliveira, R., Ott, L., and Ramos, F. (2019). Bayesian optimisation under uncertain inputs. In The 22nd International Conference on Artiﬁcial Intelligence and Statistics, pages 1177 1184.

Rahimi, A. and Recht, B. (2008). Random features for large-scale kernel machines. In Advances in neural information processing systems, pages 1177 1184.

Srinivas, N., Krause, A., Seeger, M., and Kakade, S. M. (2010). Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design. In Proceedings of the 27th International Conference on Machine Learning, pages 1015 1022.

Sutton, R. S. and Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press.

Thompson, W. R. (1933). On the Likelihood that One Unknown Probability Exceeds Another in View of the Evidence of Two Samples. Biometrika, 25(3/4):285 294.

Valko, M., Korda, N., Munos, R., Flaounas, I., and Cristianini, N. (2013). Finite-Time Analysis of Kernelised Contextual Bandits. ar Xiv:1309.6869 [cs, stat].

Yun, S.-Y., Nam, J. H., Mo, S., and Shin, J. (2017). Contextual multi-armed bandits under feature uncertainty. ar Xiv preprint ar Xiv:1703.01347.