# multitask_offpolicy_learning_from_bandit_feedback__588cc2ff.pdf

Multi-Task Off-Policy Learning from Bandit Feedback

Joey Hong 1 Branislav Kveton 2 Manzil Zaheer 3 Sumeet Katariya 2 Mohammad Ghavamzadeh 4

Many practical problems involve solving similar tasks. In recommender systems, the tasks can be users with similar preferences; in search engines, the tasks can be items with similar affinities. To learn statistically efficiently, the tasks can be organized in a hierarchy, where the task affinity is captured using an unknown latent parameter. We study the problem of off-policy learning for similar tasks from logged bandit feedback. To solve the problem, we propose a hierarchical off-policy optimization algorithm Hier OPO. The key idea is to estimate the task parameters using the hierarchy and then act pessimistically with respect to them. To analyze the algorithm, we develop novel Bayesian error bounds. Our bounds are the first in off-policy learning that improve with a more informative prior and capture statistical gains due to hierarchical models. Therefore, they are of a general interest. Hier OPO also performs well in practice. Our experiments demonstrate the benefits of using the hierarchy over solving each task independently.

1. Introduction

Many interactive systems, such as search and recommender systems, can be modeled as a contextual bandit (Li et al., 2010; Chu et al., 2011), where a policy observes a context, takes one of possible actions, and then receives a stochastic reward for that action. In many applications, it is prohibitively expensive to learn policies online by contextual bandit algorithms, because exploration has a major impact on user experience. However, offline data collected by a previously deployed policy are often available. Offline, or off-policy, optimization using such logged data is a practical way of learning policies without costly online interactions (Dudik et al., 2014; Swaminathan & Joachims, 2015).

1University of California, Berkeley 2Amazon 3Deep Mind 4Google Research. Correspondence to: Joey Hong <joey hong@berkeley.edu>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

Because we cannot explore beyond the logged dataset, it is important to use the logged data in the most statistically efficient way. One way of achieving this is by modeling the structure of the solved problem. As an example, in bandit algorithms, we could achieve higher statistical efficiency by using information about the reward distribution (Garivier & Cappe, 2011), a prior distribution over model parameters (Thompson, 1933; Agrawal & Goyal, 2012; Chapelle & Li, 2012; Russo et al., 2018), or features (Dani et al., 2008; Abbasi-Yadkori et al., 2011; Agrawal & Goyal, 2013). In this work, we consider a natural structure where multiple similar tasks are related through a hierarchical Bayesian model (Gelman et al., 2013; Kveton et al., 2021; Hong et al., 2022b). Each task is parameterized by a task parameter sampled i.i.d. from a distribution parameterized by a hyperparameter. The unknown hyper-parameter relates the tasks. Specifically, data from any task helps in improving its estimate, which in turn improves estimates of all other task parameters.

Although the tasks are similar, they are sufficiently different to require different polices, and we address this multi-task off-policy learning problem in this work. To solve the problem, we propose an algorithm called hierarchical off-policy optimization (Hier OPO). Since off-policy algorithms must reason about counterfactual rewards of actions that may not have been taken frequently in the logged dataset, a common approach is to learn pessimistic, or lower confidence bound (LCB), estimates of the mean rewards and act according to them (Buckman et al., 2021; Jin et al., 2021). Hier OPO is an instance of this approach where high-probability LCBs are estimated using a hierarchical model.

Our work makes four major contributions. First, we formalize the problem of multi-task off-policy optimization as a multi-task bandit in a hierarchical model. Second, we propose an efficient algorithm for solving the problem, which we call Hier OPO. The key idea in Hier OPO is to compute lower confidence bounds on the mean rewards of actions and act according to them. The LCBs can be computed in a closed form in linear Gaussian models (Lindley & Smith, 1972). Third, we analyze the quality of our policies using Bayesian error bounds. Our bounds capture the effect of a more informative prior and statistical gains due to hierarchical models. These are the first such bounds in off-policy learning and should be of a wide interest. Finally, we evalu-

Multi-Task Off-Policy Learning from Bandit Feedback

ate Hier OPO on both synthetic and real-world problems.

We start with introducing our notation. Random variables are capitalized, except for Greek letters like θ. For any positive integer n, we define [n] = {1, . . . , n}. The indicator function is 1{ }. The i-th entry of vector v is denoted by vi. If the vector is already indexed, such as vj, we write vj,i. We denote the maximum and minimum eigenvalues of matrix M Rd d by λ1(M) and λd(M), respectively.

In the classic contextual bandit (Li et al., 2010), the agent observes a context x X, where X is a context set; takes an action a A, where A is an action set; and observes a stochastic reward Y P( | x, a; θ), where P( | x, a; θ) is the reward distribution parameterized by a model parameter θ Θ. We denote the mean reward of action a in context x under parameter θ by r(x, a; θ) = EY P ( |x,a;θ) [Y ] and assume that the rewards are σ2-sub-Gaussian.

2.1. Multi-Task Bandit

In this paper, we simultaneously solve m similar contextual bandit instances, which we call tasks. Therefore, our problem is a multi-task contextual bandit (Azar et al., 2013; Deshmukh et al., 2017; Cella et al., 2020; Kveton et al., 2021; Moradipari et al., 2022). The set of all tasks is denoted by S and we index the tasks by s S. The reward distribution in task s is parameterized by a task parameter θs, Θ, which is sampled i.i.d. from a task prior distribution θs, P( | µ ). The task prior is parameterized by an unknown hyper-parameter µ , which is sampled from a hyper-prior Q. The hyper-prior represents the agent s prior knowledge about µ . In a recommender system, the task could be a user, the task parameter could be their preferences, and the hyper-parameter could be the preferences of an average user. We use this example in our experiments in Section 7. A similar setup was studied in the online setting by Hong et al. (2022b).

Unlike prior works in multi-task bandits, we focus on the offline setting where we learn task-specific policies from logged data. One distinguishing characteristic of our setting is that each task has its own parameter θs, , and thus may require a different policy. As a result, we learn a separate policy πs : X A for each task s. To simplify notation, we focus on deterministic policies. Our results can be extended to stochastic policies by accounting for an additional expectation over actions. We denote the set of stationary deterministic policies by Π = {π : X A}.

We learn the policies from a logged dataset of size n. The dataset is given by D = {(St, Xt, At, Yt)}t [n], where St is a task, Xt Px is a context, At π0(Xt) is an action, and Yt P( | Xt, At; θSt, ) is a reward in interaction t.

Figure 1. A graphical model of a multi-task contextual bandit.

Here Px is a context distribution and π0 Π is a logging policy. To simplify notation, we assume that π0 is the same for all tasks. It can be stochastic. A graphical model of our setting, which shows dependencies among all variables in our problem, is in Figure 1. We do not assume that π0 is known, although this assumption is common (Dudik et al., 2014; Swaminathan & Joachims, 2015).

2.2. Objective

The value of policy πs Π in task s S with parameter θs, is defined as

V (πs; θs, ) = E [r(X, πs(X); θs, )] ,

where the randomness is only over context X Px. The optimal policy πs, is defined as

πs, = arg max π Π V (π; θs, ) .

Let ˆπs Π be some estimated optimal policy from logged dataset D. A standard approach in off-policy optimization is to derive an (ε, δ) bound

V (ˆπs; θs, ) V (πs, ; θs, ) ε , (1)

which holds with probability at least 1 δ for a specified maximum error ε (Strehl et al., 2010; Li et al., 2018). The bound says that the policy ˆπs is at most ε worse than the optimum πs, with a high probability. The error ε depends on δ, D, ˆπs, and problem hardness. Such bounds can be derived using concentration inequalities for sub-Gaussian rewards (Boucheron et al., 2013), under the assumptions that the parameter θs, is fixed and bounded. We call this setting frequentist.

In our work, we study a Bayesian setting, where the prior distribution of θs, and dataset D allow the agent to derive the posterior distribution of θs, , ˆPs(θ) = P (θs, = θ | D). To model that θs, is random, and that the prior and dataset D provide additional information about θs, , it is natural to guarantee (1) in expectation over the posterior of θs, . We formalize this as an (ε, δ) bound

P (V (ˆπs; θs, ) V (πs, ; θs, ) ε | D) 1 δ , (2)

Multi-Task Off-Policy Learning from Bandit Feedback

where ε is a specified maximum error that depends on δ, D, ˆπs, and problem hardness. The main difference from (1) is that θs, , and thus also πs, , are random.

The Bayesian view allows us to derive (ε, δ) error bounds with two new properties. First, the error ε decreases with a more informative prior on θs, (Section 4). In frequentist bounds, the prior plays the role of a regularizer, unrelated to the estimated model parameter, and thus cannot capture this effect. Second, we show that the hierarchical model in Figure 1 can improve statistical efficiency in multi-task bandits (Section 5). Our bounds and analyses are motivated by Bayes regret bounds in bandits (Russo & Van Roy, 2014; Lu & Van Roy, 2019; Kveton et al., 2021; Atsidakou et al., 2022; Hong et al., 2022b;a), which have similar properties and improve upon their frequentist counterparts similarly (Abbasi-Yadkori et al., 2011; Agrawal & Goyal, 2013).

3. Algorithm

Prior works in off-policy bandit and reinforcement learning often design pessimistic lower confidence bounds and then act according to them (Jin et al., 2021). We adopt the same design principle. Our LCBs satisfy Ls(x, a) r(x, a; θs, ) with a high probability for task parameter θs, | D, jointly over all tasks s, contexts x, and actions a. Specifically, we define them as Ls(x, a) = ˆrs(x, a) cs(x, a), where

ˆrs(x, a) = E [r(x, a; θs, ) | D] ,

cs(x, a) = α q

var [r(x, a; θs, ) | D] , (3)

are the estimated mean reward and its confidence interval width, respectively; and α > 0 is a tunable parameter.

Linear models are an important class of contextual bandit models (Dani et al., 2008; Abbasi-Yadkori et al., 2011; Li et al., 2010) and we also consider them here. Specifically, we assume that r(x, a; θs, ) = ϕ(x, a) θs, for each task s, where θs, is the task parameter and ϕ : X A Rd

is some feature extractor. Under this assumption, we can write (3) using the posterior mean and covariance of θs, as

ˆrs(x, a) = ϕ(x, a) E [θs, | D] ,

cs(x, a) = α q

ϕ(x, a) cov [θs, | D] ϕ(x, a) . (4)

The above decomposition is desirable because it separates the posterior of the task parameter from context.

The rest of this section is organized as follows. We derive the mean reward estimate and its confidence interval width for a general model in Section 3.1. We instantiate these in a linear Gaussian model in Section 3.2 and then discuss the resulting algorithm in Section 3.3. Alternative algorithm designs are discussed in Section 3.4.

3.1. Hierarchical Pessimism

In any task s, the mean E [θs, | D] in (4) can be estimated hierarchically as follows. Let Ds be the subset of dataset D corresponding to task s, where St = s. Recall that µ is the hyper-parameter in Figure 1. Then, by the law of total expectation,

E [θs, | D] = E [E [θs, | µ , D] | D] (5)

= E [E [θs, | µ , Ds] | D] .

The second equality holds since conditioning on µ makes θs, independent of D\Ds. Our decomposition is motivated by the observation that estimating each E [θs, | µ , Ds] is an easier problem than estimating E [θs, | D], since all observations in Ds are from a single task s. The information sharing between the tasks is still captured by µ , which is learned from the entire logged dataset D.

Similarly, the covariance cov [θs, | D] in (4) can be decomposed using the law of total covariance,

cov [θs, | D] (6)

= E [cov [θs, | µ , D] | D] + cov [E [θs, | µ , D] | D]

= E [cov [θs, | µ , Ds] | D] + cov [E [θs, | µ , Ds] | D] .

Again, the second equality holds since conditioning on µ makes θs, independent of D \ Ds. Note that (6) comprises two interpretable terms. The first captures the uncertainty of θs, conditioned on µ , whereas the second captures the uncertainty in µ . This decomposes two sources of uncertainty in our problem, and is a powerful tool for structured uncertainty estimation (Hong et al., 2022a).

Now we plug (5) and (6) into (4), and get

ˆrs(x, a) = ϕ(x, a) E [E [θs, | µ , Ds] | D] ,

cs(x, a) = α q

ϕ(x, a) ˆΣsϕ(x, a) ,

where ˆΣs =

E [cov [θs, | µ , Ds] | D] + cov [E [θs, | µ , Ds] | D] .

With this in mind, we propose a general algorithm for hierarchical off-policy optimization, which we call Hier OPO. Its pseudo-code is showed in Algorithm 1.

3.2. Hierarchical Gaussian Pessimism

The computation of (5) and (6) requires integrating out the hyper-parameter µ and task parameter θs, . This is generally impossible in a closed form, although many powerful approximations exist (Doucet et al., 2001). In this section, we leverage the conjugacy of a Gaussian hyper-prior, task priors, and reward distributions to obtain closed-form estimates of all model parameters. In this case, Hier OPO can

Multi-Task Off-Policy Learning from Bandit Feedback

Algorithm 1 Hier OPO: Hierarchical off-policy optimization.

1: Input: Dataset D 2: for s S, x X do 3: for a A do 4: Compute ˆrs(x, a) and cs(x, a) (Section 3.1) 5: Ls(x, a) ˆrs(x, a) cs(x, a)

6: ˆπs(x) arg max a A Ls(x, a)

7: Output: ˆπ (ˆπs)s S

be implemented exactly and efficiently. We also analyze it under these assumptions (Section 5).

In particular, we consider a linear Gaussian model with the hyper-prior Q = N(µq, Σq) and the task prior P( | µ ) = N( ; µ , Σ0). The mean vector µq Rd, as well as both the covariance matrices Σq, Σ0 Rd d, are assumed to be known. The reward distribution of action a in context x is N(ϕ(x, a) θs, , σ2), where ϕ is a feature extractor and σ > 0 is a known reward noise. It follows that the mean reward is linear in features.

To derive (5) and (6), we start with understanding posterior distributions of θs, and µ . Specifically, conditioning in Gaussian graphical models preserves Gaussianity, and thus θs, | µ , Ds N( µs, Σs) for some µs and Σs. Using the structure of our model (Figure 1), we further note that this is a standard linear model posterior with a Gaussian prior N(µ , Σ0), where

µs = E [θs, | µ , Ds] = Σs(Σ 1 0 µ + Bs) , Σs = cov [θs, | µ , Ds] = (Σ 1 0 + Gs) 1 , (7)

and the statistics

Bs = σ 2 n X

t=1 1{St = s} ϕ(Xt, At)Yt ,

Gs = σ 2 n X

t=1 1{St = s} ϕ(Xt, At)ϕ(Xt, At) ,

are computed using the subset Ds of the logged dataset D.

The posterior of the hyper-parameter µ | D, known as the hyper-posterior, also has a closed-form N( µ, Σ) (Section 4.2 of Hong et al. 2022b), where

µ = E [µ | D]

= Σ Σ 1 q µq + X

s S (Σ0 + G 1 s ) 1G 1 s Bs ,

Σ = cov [µ | D] = Σ 1 q + X

s S (Σ0 + G 1 s ) 1 1 .

One way of interpreting (9) is as a multivariate Gaussian posterior where each task is a single observation. The observation of task s is the least-squares estimate of θs, from

task s, given by G 1 s Bs, with covariance Σ0 + G 1 s . The tasks with more observations affect the estimate µ more, since their G 1 s approaches a zero matrix, and as a result Σ0 + G 1 s Σ0. This uncertainty is intrinsic, since even θs, is a noisy observation of µ .

To complete our derivations, we only need to substitute (7) and (9) into (5) and (6). The posterior mean of θs, is

E [E [θs, | µ , Ds] | D] = E h Σs(Σ 1 0 µ + Bs) D i

= Σs(Σ 1 0 E [µ | D] + Bs)

= Σs(Σ 1 0 µ + Bs) ,

where we simply combine (7) and (9). Similarly, the posterior covariance of θs, requires computing

E [cov [θs, | µ , Ds] | D] = E h Σs D i = Σs ,

cov [E [θs, | µ , Ds] | D] = cov h Σs(Σ 1 0 µ + Bs) D i

= cov h ΣsΣ 1 0 µ D i

= ΣsΣ 1 0 ΣΣ 1 0 Σs .

Finally, the estimated mean reward and its confidence interval width are given by

ˆrs(x, a) = ϕ(x, a) Σs(Σ 1 0 µ + Bs) ,

cs(x, a) = α q

ϕ(x, a) ˆΣsϕ(x, a) , (10)

where ˆΣs = Σs + ΣsΣ 1 0 ΣΣ 1 0 Σs. The posterior covariance ˆΣs is tractable and has the following desirable properties. First, the hyper-parameter uncertainty only affects the second term through Σ. Moreover, since Σs appears in both terms, ˆΣs decreases with more observations from task s.

3.3. Gaussian Hier OPO

Now we plug the estimated mean reward and its confidence interval width in (10) into Hier OPO. The resulting method is a form of hierarchical regression of the hyper-parameter and task parameters. The task parameter θs, is estimated using Σs(Σ 1 0 µ + Bs) in (10) and the hyper-parameter µ is estimated using µ in (9). Since Hier OPO is a variant of hierarchical regression, it is fairly general and we expect it to perform well beyond our assumptions in the algorithm design, such as when the reward noise is sub-Gaussian.

To simplify the presentation of Hier OPO, we assume that X and A are finite. However, since Hier OPO is based on a hierarchical regression of the task parameter in (10) and the hyper-parameter in (9), this assumption is not necessary. In fact, the mean reward estimate and its confidence interval at any feature vector ϕ(x, a) can be computed as described in (10). Our error bounds in Section 5 are also independent of

Multi-Task Off-Policy Learning from Bandit Feedback

the number of contexts or actions. Finally, when the action space cannot be enumerated, it may be hard to compute the most pessimistic action ˆπs(x) = arg maxa A Ls(x, a) in context x. In this case, our algorithm and analysis can be extended to any maximization oracle with a fixed approximation ratio.

The computations in (10) rely on matrix inversions, whose computational cost is O(d3) for d features. Note that this is only needed for the exact implementation. Approximate inference, which trades off the computational cost for accuracy (Doucet et al., 2001), is possible. Any approach for Gaussian graphical models would apply in our setting.

3.4. Alternative Designs

A natural question to ask is how the hierarchy helps with improving pessimistic reward estimates. To answer it, we compare Hier OPO to two baselines, based on pessimistic least-squares estimators (Li et al., 2022) that do not model our structure. The first one is unrealistic because it assumes that µ is known. We call it Oracle OPO. Here the posterior mean reward and its confidence interval width are

ˆrs(x, a) = ϕ(x, a) Σs(Σ 1 0 µ + Bs) ,

cs(x, a) = α q

ϕ(x, a) Σsϕ(x, a) . (11)

This is an improvement of (10) in two aspects. First, the estimate µ of µ is replaced with the actual µ . Second, the confidence interval width is provably narrower because

Σs Σs + ΣsΣ 1 0 ΣΣ 1 0 Σs .

The second method does not model the hyper-parameter µ . Instead, its uncertainty is incorporated into that of modeled θs, . This is achieved by replacing Σ0 in (11) with Σq +Σ0, and µ with µq. We call the method Flat OPO. Its posterior mean reward and confidence interval width are

ˆrs(x, a) = ϕ(x, a) Σs((Σq + Σ0) 1µq + Bs) ,

cs(x, a) = α q

ϕ(x, a) Σsϕ(x, a) ,

Σs = ((Σq + Σ0) 1 + Gs) 1 .

In comparison to (10), this method is worse in two aspects. First, the prior mean µq of µ is used instead of its estimate µ. Second, as the number of tasks m increases, we expect λ1( Σ) 0 and then

Σs ΣsΣ 1 0 ΣΣ 1 0 Σs + Σs .

Therefore, our approach should be more statistically efficient, which we prove formally in Section 5.

Finally, note that optimistic methods, such as posterior sampling (Thompson, 1933; Russo et al., 2018) and Bayes UCB (Kaufmann et al., 2012), cannot be used in our setting. In fact, optimism is harmful because it leads to taking highly uncertain actions whose uncertainty is not reduced, since there are no additional observations from the environment. In this case, pessimism and robustness are desired, and these are our main design principles.

4. Single-Task Analysis

To illustrate Bayesian error bounds, we start with a classic contextual bandit parameterized by θ Rd. The mean reward of action a A in context x X under parameter θ Rd is r(x, a; θ) = ϕ(x, a) θ. We assume that θ N(θ0, Σ0) and the reward noise is N(0, σ2). This model is identical to a single task in Section 3.2.

The logged dataset is D = {(Xt, At, Yt)}n t=1, the LCB is L(x, a) = ˆr(x, a) c(x, a), and the pessimistic policy is ˆπ(x) = arg max a A L(x, a). Following the same reasoning as in the derivation of (7), the estimated mean reward and its confidence interval width are

ˆr(x, a) = ϕ(x, a) ˆΣ(Σ 1 0 θ0 + B) ,

c(x, a) = α q

ϕ(x, a) ˆΣϕ(x, a) ,

B = σ 2 n X

t=1 ϕ(Xt, At)Yt ,

ˆΣ = (Σ 1 0 + G) 1 ,

G = σ 2 n X

t=1 ϕ(Xt, At)ϕ(Xt, At) .

As in Section 2, the value of policy π Π under model parameter θ is V (π; θ ) = E [r(X, π(X); θ )]. The optimal policy is π = arg max π Π V (π; θ ). The quality of policy ˆπ is measured by an (ε, δ) bound

P (V (ˆπ; θ ) V (π ; θ ) ε | D) 1 δ . (13)

A better policy has a lower ε > 0 at a fixed δ > 0. Now we are ready to proceed with the analysis.

4.1. Bayesian Error Bound

We start with assumptions. First, we assume that the length of feature vectors is bounded.

Assumption 1. For any context x X and action a A, the feature vector satisfies ϕ(x, a) 2 1.

This assumption is standard and without loss of generality. Second, we assume that the dataset D is well-explored (Swaminathan et al., 2017; Jin et al., 2021).

Multi-Task Off-Policy Learning from Bandit Feedback

Assumption 2. Take G in (12) and let

G = E ϕ(X, π (X))ϕ(X, π (X)) θ .

Then there exists γ > 0 such that G γσ 2n G holds for any θ .

Assumption 2 relates the logging policy π0, which induces G, to the optimal policy π , which induces σ 2n G . It can be loosely interpreted as follows. As n increases,

G σ 2n E ϕ(X, π0(X))ϕ(X, π0(X)) θ ,

and γ becomes the maximum ratio between probabilities of taking actions by π and π0 in any direction. Therefore, the assumption measures the degree of overlap between π and π0. It also relates a finite-sample G to the expected G . Therefore, we do not need to reason about a finite-sample behavior of G in our analysis.

Note that Assumption 2 holds for γ = 0. Unfortunately, this setting would negate the desired scaling with sample size n in our error bounds and is impractical. The higher the value of γ, the more π and π0 are similar. When π0 is uniform, we obtain γ = Ω(1/d) for large n. Now we state our main claim for the single-task setting. Theorem 1. Fix dataset D and choose any γ > 0 such that Assumption 2 holds. Let ˆπ(x) = arg max a A L(x, a). Then for any δ (0, e 1], the error in (13) is

5d log(1/δ) | {z } α

4d λd(Σ 1 0 ) + γσ 2n .

Proof. The claim is proved in Appendix A in three steps. First, we establish that c(x, a) is a high-probability confidence interval width for α = p

5d log(1/δ). Second, we show that the suboptimality of policy ˆπ can be bounded by E [c(X, π (X))] for any parameter θ . Third, we combine c(x, a) with Assumption 2, and relate the logging policy π0 that induces c(x, a) with the optimal policy π .

4.2. Frequentist Error Bound

To understand the benefit of a Bayesian analysis, we compare Theorem 1 to a frequentist bound. The main difference in the frequentist bound is that θ is fixed. Thus a natural counterpart of (13) is

V (ˆπ; θ ) V (π ; θ ) ε ,

which holds with probability at least 1 δ for an unknown but fixed θ . Under the assumptions that θ 2 κ, and that (Yt)n t=1 are independent σ2-sub-Gaussian rewards, we get a similar bound to Theorem 1, which is stated in Theorem 5 (Appendix B). The main difference is that

2d(log(1/δ) + b) + κλ 1

2 d (Σ0) , (14)

where Σ0 should be viewed as a regularization parameter instead of the prior covariance. Before we discuss α, we discuss two key differences in how the bounds are stated. First, the frequentist bound holds for any model parameter θ such that θ 2 κ. Therefore, it is arguably stronger than the Bayesian bound in Theorem 5. This is analogous to differences in Bayesian and frequentist cumulative regret bounds (Russo & Van Roy, 2014). Second, because both bounds depend on γ in Assumption 2, which depends on θ , we make an assumption that π0 is uniform. In this case, γ = Ω(1/d) for any θ . Therefore, γ has no impact on the next discussion and we may treat it as a constant.

Under the above assumptions, the only major difference in the bounds is the term κλ 1

2 d (Σ0) in (14). This term can have a major effect. For instance, suppose that Σ0 = Id/n in (12). From a Bayesian viewpoint, this corresponds to a very informative prior with width p

1/n, and the Bayesian bound in Theorem 1 is O(dn 1

2 ). From a frequentist point of view, this amounts to O(n) regularization, and the frequentist bound becomes O(dn 1

2 + d 1 2 ). As n , we get that the Bayesian bound can be arbitrarily better.

5. Multi-Task Analysis

Now we study our multi-task setting, where the estimated mean reward and its confidence interval width are defined in (10). Similarly to Section 4, our analysis is Bayesian. We derive an error bound for a single task and discuss how to extend it to other bounds, such as for all tasks, later.

5.1. Bayesian Error Bound

To derive the bound in (2), we make similar assumptions to Section 4. First, we assume that the length of feature vectors is bounded (Assumption 1). Second, we assume that the dataset D is well-explored for all tasks.

Assumption 3. Take Gs in (8) and let ns be the number of interactions with task s. Let

Gs, = E ϕ(X, πs, (X))ϕ(X, πs, (X)) θs, .

Then there exists γ > 0 such that Gs γσ 2ns Gs, holds for any θs, in any task s S.

This is essentially Assumption 2 applied to all tasks. For a uniform logging policy, γ = Ω(1/d) when ns is large for all tasks s S. So the assumption is not very strong. Our main technical result is presented below.

Theorem 2. Fix dataset D and choose any γ > 0 such that Assumption 3 holds. Take policy ˆπ computed by Hier OPO and let α = p

5d log(1/δ). Then for any δ (0, e 1], the

Multi-Task Off-Policy Learning from Bandit Feedback

error in (2) is

4d λd(Σ 1 0 ) + γσ 2ns | {z } Task term

v u u t 4d λd(Σ 1 q ) + P

z S 1 λ1(Σ0)+γ 1σ2λ1(G 1 z, )n 1 z | {z } Hyper-parameter term

Moreover, suppose that ϕ(x, a) has at most one non-zero entry for any x X and a A, and that both Σq and Σ0 are diagonal. Then λ1(G 1 z, ) 1.

Proof. The claim is proved in Appendix C, following the same three steps as in the proof of Theorem 1. The main difference is in the definitions of c(x, a) and policies, and that Assumption 3 is used instead of Assumption 2. This shows the generality of our proof technique and indicates that it may apply to other graphical models.

5.2. Discussion

Our error bound is Bayesian and proved for a distribution of the task parameter θs, | D. The bound has two terms. The former captures the error in estimating the task parameter θs, if the hyper-parameter µ was known. It is similar to Theorem 1 and hence we call it the task term. The latter term captures the error in estimating µ and hence we call it the hyper-parameter term. We discuss each term next.

The task term depends on all quantities of interest in an expected manner. It is O(d p

log(1/δ)/ns), where d is the number of features, ns is the number of observations, and δ is the probability that the bound fails. This dependence is standard in confidence intervals for linear models with an infinite number of contexts (Abbasi-Yadkori et al., 2011; Agrawal & Goyal, 2013; Abeille & Lazaric, 2017). Since λd(Σ 1 0 ) can be viewed as the minimum number of prior pseudo-observations in any direction in Rd, the task term decreases with a more informative prior. Finally, the task term decreases when the observation noise σ decreases, and the similarity of the logging and optimal policies γ increases (Assumption 3).

The hyper-parameter term mimics scaling of the task term at the hyper-parameter level. In particular, the minimum number of prior pseudo-observations in any direction in Rd

becomes λd(Σ 1 q ) and each task becomes an observation, which is reflected by the sum over all tasks z. The hyperparameter term decreases as the number of observations nz in any task z increases, the maximum width of the task prior p

λ1(Σ0) decreases, reward noise σ decreases, and the similarity of logging and optimal policies γ increases.

To show that our bound captures the problem structure, we compare it to two baselines in Section 3.4: Oracle OPO and Flat OPO. Oracle OPO knows µ and has more information than Hier OPO. Its error can be bounded by Theorem 1 and is always lower than that of Hier OPO, because the bound corresponds to the first term in Theorem 2. On the other hand, Flat OPO solves each task independently. Its error can be bounded using Theorem 1 where the task covariance Σ0 is replaced by Σq + Σ0, to account for the additional uncertainty due to unknown µ . The resulting error bound becomes

4d λd((Σq + Σ0) 1) + γσ 2ns

and is always higher than the task term in Theorem 2. As the number of tasks increases, the hyper-parameter term in Theorem 2 goes to zero, and the error bound of Hier OPO would be provably lower.

Finally, we would like to comment on the second claim in Theorem 2, which results in a tighter bound. This claim is proved under additional assumptions that are satisfied by a multi-armed bandit, for instance.

5.3. Extensions

The error bound in Theorem 2 is derived for a fixed task s S. This decision was taken deliberately because other bounds can be easily derived from it. For instance, to get a bound for all tasks, we only need a union bound over all θs, | D. As a result, Theorem 2 holds for all s S with probability at least 1 mδ.

The bound in Theorem 2 also holds for a new task sampled from the task prior. This is because the hyper-parameter estimation in (9), which affects the hyper-parameter term in Theorem 2, separates all tasks from the evaluated one.

6. Related Work

Off-policy optimization. In off-policy optimization, data collected by a deployed policy are used to learn improved policies offline (Li et al., 2010), without a direct interaction with the environment. Off-policy estimation and optimization can be model-free or model-based. Many modelfree methods are based on inverse propensity scores (IPS) (Horvitz & Thompson, 1952; Ionides, 2008; Strehl et al., 2010; Swaminathan & Joachims, 2015). These methods have a low bias and high variance, unless corrected. Modelbased methods estimate a reward model for context-action pairs, which is then used to find an optimal policy (Bottou et al., 2013; Dudik et al., 2014). These approaches tend to have a high bias and low variance. Doubly-robust estimation (Robins et al., 1994; Dudik et al., 2014) is used frequently to combine model-based and model-free methods. We take

Multi-Task Off-Policy Learning from Bandit Feedback

100 200 300 400 500 Dataset Size n

Suboptimality

Linear Bandit (¾q = 0.500, m = 10, d = 5, K = 10)

Hier OPO Flat OPO Oracle OPO

100 200 300 400 500 Dataset Size n

0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18

Suboptimality

Linear Bandit (¾q = 1.000, m = 10, d = 5, K = 10)

Hier OPO Flat OPO Oracle OPO

10 20 30 40 50 60 Number of Tasks m

0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18

Suboptimality

Linear Bandit (¾q = 1.000, n = 500, d = 5, K = 10)

Hier OPO Flat OPO Oracle OPO

Figure 2. Evaluation of off-policy algorithms on the synthetic multi-task bandit problem. In the left and middle plots, we vary the dataset size n for small σq = 0.5 and large σq = 1.0, respectively. In the right plot, we vary the number of tasks m.

a model-based approach in this work.

Offline reinforcement learning. Pessimism has been studied extensively in offline reinforcement learning (Buckman et al., 2021; Jin et al., 2021). Specifically, Jin et al. (2021) showed that pessimistic value iteration is minimax optimal in linear Markov decision processes (MDPs). Multi-task offline reinforcement learning was also studied by Lazaric & Ghavamzadeh (2010). This paper applied expectationmaximization to solve the problem but did not prove any error bounds. In comparison, we consider a simpler setting of contextual bandits and prove error bounds that show improvements due to using the multi-task structure. These are the first error bounds of its kind.

Online learning. Off-policy methods learn from data collected by another policy. In contrast, online methods learn from data that they collect, and need to balance exploration and exploitation to attain low regret in the long term. Two popular exploration techniques are upper confidence bounds (UCBs) (Auer et al., 2002) and posterior sampling (Thompson, 1933), and both have been studied extensively in linear models (Dani et al., 2008; Abbasi-Yadkori et al., 2011; Chu et al., 2011; Agrawal & Goyal, 2013). Bandit algorithms for hierarchical models have also been studied extensively (Bastani et al., 2019; Kveton et al., 2021; Basu et al., 2021; Simchowitz et al., 2021; Wan et al., 2021; Hong et al., 2022b; Peleg et al., 2022; Wan et al., 2022). Perhaps surprisingly, all of these are based on posterior sampling. Our marginal posterior derivations in (10) can used to derive UCBs for this setting. Specifically, Us(x, a) = ˆrs(x, a) + cs(x, a) is an upper confidence bound on the mean reward of action a in context x and task s.

7. Experiments

In this section, we empirically compare Hier OPO to baselines Oracle OPO and Flat OPO (Section 3.4). All methods are implemented as described in Section 3. We set α = 0.1, which led to good performance in our initial experiments. The goal of our experiments is to show that hierarchy can greatly improve the statistical efficiency of off-policy algorithms. We include an additional experiment in Appendix D,

where Hier OPO is applied to a challenging image classification task with deep neural networks.

7.1. Synthetic Problem

Our first experiment is with a synthetic multi-task bandit, with d = 5 features and K = 10 actions. For each action a A and interaction t [n], we sample a feature vector uniformly at random from [ 0.5, 0.5]d. The hierarchy is defined as follows. The hyper-prior is N(0d, Σq), where Σq = σ2 q Id is its covariance. The task covariance is Σ0 = σ2 0Id. We experiment with σq {0.5, 1} and σ0 = 0.5. We expect the hierarchy to be more beneficial when σq > σ0, since the uncertainty of the hyper-parameter is higher and it is more valuable to learn it. The reward distribution of task s is N(ϕ(x, a) θs, , σ2) with noise σ = 0.5.

Our results are averaged over multiple runs. At the beginning of each run, the hyper-parameter µ is sampled from the hyper-prior N(0d, Σq). After that, each task parameter is sampled i.i.d. as θs, N(µ , Σ0). The logged dataset D is generated as follows. In each interaction t [n], we randomly select a task, take a random action, and record its stochastic reward. The learned policies ˆπs are evaluated on the same problem that generated D. The evaluation criterion is the suboptimality of learned policies averaged over the tasks, which we define as 1 m

s S V (πs, ; θs, ) V (ˆπs; θs, ) .

The policy values and the optimal policy are estimated on another logged dataset of size 10 000. In our experiments, we vary either the logged dataset size n or the number of tasks m, while keeping the other fixed. The default settings are m = 10 tasks and logged dataset size n = 500.

In Figure 2, we show the mean and standard error of the suboptimality of each algorithm averaged over 30 random runs. As expected, Hier OPO outperforms Flat OPO and is close to Oracle OPO. The improvement is higher when the hyper-parameter uncertainty σq is higher. The difference between Hier OPO and Flat OPO is the most noticeable in the limited data regime, where n is small or m is large. In both cases, the number of observations per task is small.

Multi-Task Off-Policy Learning from Bandit Feedback

1000 2000 3000 4000 5000 Dataset Size n

Suboptimality

Movie Lens Bandit (m = 100, d = 10, K = 30)

Hier OPO Flat OPO Oracle OPO

Figure 3. Evaluation of off-policy algorithms on a multi-user recommendation problem in Section 7.2.

7.2. Multi-User Recommendation

Now we consider a multi-user recommendation problem. The problem is simulated using the Movie Lens 1M dataset (Lam & Herlocker, 2016), with 1 million ratings for 3 883 movies from 6 040 users. As a first step, we complete the sparse rating matrix M using alternating least squares (Davenport & Romberg, 2016) with rank d = 10. This rank is sufficiently high to have a low prediction error, but also low enough to prevent overfitting. The learned factorization is M = UV . The i-th row of U, denoted by Ui, represents user i. The j-th row of V , denoted by Vj, represents movie j. The reward for recommending movie j to user i is sampled from N(V j Ui, σ2). The reward noise σ = 0.759 is estimated from data. The feature vectors in each interaction are latent factors Vj of 30 randomly chosen movies.

To simulate similar users, we cluster user latent factors. In particular, we apply a Gaussian mixture model (GMM) with k = 7 clusters to rows of U. We choose the smallest value of k that yields well-separated clusters (Bishop, 2006). The hyper-prior parameters µq and Σq are set to the mean and covariance of all cluster centers, respectively. The cluster with most users represents tasks. We set µ and Σ0 to its center and covariance estimated by the GMM, respectively. Thus all users in the cluster are related through µ and Σ0. We want to stress that the GMM is only used to estimate the hyper-parameters of the hierarchical model. The task parameters are user latent factors Ui. This is to ensure that our setup is as realistic as possible.

The number of tasks is m = 100 and they are chosen randomly in each run. The dataset D is logged as follows. In each interaction t [n], we randomly select a task, take a random action, and record its stochastic reward. The evaluation criteria are the same as in Section 7.1.

In Figure 3, we report the mean and standard error of the suboptimality of all algorithms averaged over 10 random runs. For all dataset sizes n, Hier OPO performs very well: its suboptimality is close to that of Oracle OPO and significantly lower than that of Flat OPO. This clearly shows the benefit of hierarchies for efficient off-policy learning. Also

note that the hierarchy in this experiment is estimated from data. Therefore, it is misspecified; yet hugely beneficial.

8. Conclusions

In this work, we propose hierarchical off-policy optimization (Hier OPO), a general off-policy algorithm for solving similar contextual bandit tasks related through a hierarchy. Our algorithm leverages the hierarchical structure to learn tighter, and hence more sample efficient, lower confidence bounds on the mean rewards of actions and acts according to them. We derive Bayesian error bounds for our policies, which become tighter with a more informative prior and demonstrate the benefit of hierarchies. Finally, we empirically validate the effectiveness of hierarchies on synthetic and real-world problems.

Our work is the first to propose a practical and analyzable algorithm for off-policy learning with hierarchical Bayesian models. As a result, there are many potential directions for future work. First, some applications require more complex graphical models than a two-level hierarchy with a single hyper-parameter (Hong et al., 2022a; Aouali et al., 2023). We believe that our methodology directly extends to these. Second, we believe that Hier OPO and its analysis can be extended to reinforcement learning (Lazaric & Ghavamzadeh, 2010). Third, Hier OPO is a model-based approach to offpolicy learning (Section 6). Since model-based approaches tend to be biased, due to using a potentially misspecified model, it is important to develop model-free methods for multi-task off-policy learning.

Finally, our theory shows benefits of a Bayesian analysis over a frequentist one (Section 4), and also benefits of hierarchies (Section 5). We are not aware of matching lower bounds for our setting and these are not common in offline learning, unlike in bandits (Lattimore & Szepesvari, 2019). This leaves open the possibility that our bounds are loose. We believe that this is highly unlikely, since the bounds are derived using exact Gaussian posteriors.

Abbasi-Yadkori, Y., Pal, D., and Szepesvari, C. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems 24, pp. 2312 2320, 2011.

Abeille, M. and Lazaric, A. Linear Thompson sampling revisited. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, 2017.

Agrawal, S. and Goyal, N. Analysis of Thompson sampling for the multi-armed bandit problem. In Proceeding of the 25th Annual Conference on Learning Theory, pp. 39.1 39.26, 2012.

Multi-Task Off-Policy Learning from Bandit Feedback

Agrawal, S. and Goyal, N. Thompson sampling for contextual bandits with linear payoffs. In Proceedings of the 30th International Conference on Machine Learning, pp. 127 135, 2013.

Aouali, I., Kveton, B., and Katariya, S. Mixed-effect Thompson sampling. In Proceedings of the 26th International Conference on Artificial Intelligence and Statistics, 2023.

Atsidakou, A., Katariya, S., Sanghavi, S., and Kveton, B. Bayesian fixed-budget best-arm identification. Co RR, abs/2211.08572, 2022. URL https://arxiv.org/ abs/2211.08572.

Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47:235 256, 2002.

Azar, M. G., Lazaric, A., and Brunskill, E. Sequential transfer in multi-armed bandit with finite set of models. In Advances in Neural Information Processing Systems 26, pp. 2220 2228, 2013.

Bastani, H., Simchi-Levi, D., and Zhu, R. Meta dynamic pricing: Transfer learning across experiments. Co RR, abs/1902.10918, 2019. URL https://arxiv.org/ abs/1902.10918.

Basu, S., Kveton, B., Zaheer, M., and Szepesvari, C. No regrets for learning the prior in bandits. In Advances in Neural Information Processing Systems 34, 2021.

Bishop, C. Pattern Recognition and Machine Learning. Springer, New York, NY, 2006.

Bottou, L., Peters, J., Quinonero-Candela, J., Charles, D., Chickering, M., Portugaly, E., Ray, D., Simard, P., and Snelson, E. Counterfactual reasoning and learning systems: The example of computational advertising. Journal of Machine Learning Research, 14(101):3207 3260, 2013.

Boucheron, S., Lugosi, G., and Massart, P. Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, 2013.

Buckman, J., Gelada, C., and Bellemare, M. The importance of pessimism in fixed-dataset policy optimization. In Proceedings of the 9th International Conference on Learning Representations, 2021.

Cella, L., Lazaric, A., and Pontil, M. Meta-learning with stochastic linear bandits. In Proceedings of the 37th International Conference on Machine Learning, 2020.

Chapelle, O. and Li, L. An empirical evaluation of Thompson sampling. In Advances in Neural Information Processing Systems 24, pp. 2249 2257, 2012.

Chu, W., Li, L., Reyzin, L., and Schapire, R. Contextual bandits with linear payoff functions. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, pp. 208 214, 2011.

Dani, V., Hayes, T., and Kakade, S. Stochastic linear optimization under bandit feedback. In Proceedings of the 21st Annual Conference on Learning Theory, pp. 355 366, 2008.

Davenport, M. and Romberg, J. An overview of low-rank matrix recovery from incomplete observations. IEEE Journal of Selected Topics in Signal Processing, 10(4): 608 622, 2016.

Deshmukh, A. A., Dogan, U., and Scott, C. Multi-task learning for contextual bandits. In Advances in Neural Information Processing Systems 30, pp. 4848 4856, 2017.

Doucet, A., de Freitas, N., and Gordon, N. Sequential Monte Carlo Methods in Practice. Springer, New York, NY, 2001.

Dudik, M., Erhan, D., Langford, J., and Li, L. Doubly robust policy evaluation and optimization. Statistical Science, 29(4):485 511, 2014.

Garivier, A. and Cappe, O. The KL-UCB algorithm for bounded stochastic bandits and beyond. In Proceeding of the 24th Annual Conference on Learning Theory, pp. 359 376, 2011.

Gelman, A., Carlin, J., Stern, H., Dunson, D., Vehtari, A., and Rubin, D. Bayesian Data Analysis. Chapman & Hall, 2013.

Hong, J., Kveton, B., Katariya, S., Zaheer, M., and Ghavamzadeh, M. Deep hierarchy in bandits. In Proceedings of the 39th International Conference on Machine Learning, 2022a.

Hong, J., Kveton, B., Zaheer, M., and Ghavamzadeh, M. Hierarchical Bayesian bandits. In Proceedings of the 25th International Conference on Artificial Intelligence and Statistics, 2022b.

Horvitz, D. G. and Thompson, D. J. A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association, 47(260):663 685, 1952.

Ionides, E. Truncated importance sampling. Journal of Computational and Graphical Statistics, 17(2):295 311, 2008.

Jin, Y., Yang, Z., and Wang, Z. Is pessimism provably efficient for offline RL? In Proceedings of the 38th International Conference on Machine Learning, 2021.

Multi-Task Off-Policy Learning from Bandit Feedback

Kaufmann, E., Cappe, O., and Garivier, A. On Bayesian upper confidence bounds for bandit problems. In Proceedings of the 15th International Conference on Artificial Intelligence and Statistics, pp. 592 600, 2012.

Kveton, B., Konobeev, M., Zaheer, M., Hsu, C.-W., Mladenov, M., Boutilier, C., and Szepesvari, C. Meta Thompson sampling. In Proceedings of the 38th International Conference on Machine Learning, 2021.

Lake, B., Salakhutdinov, R., and Tenenbaum, J. Humanlevel concept learning through probabilistic program induction. Science, 350(6266):1332 1338, 2015.

Lam, S. and Herlocker, J. Movie Lens Dataset. http://grouplens.org/datasets/movielens/, 2016.

Lattimore, T. and Szepesvari, C. Bandit Algorithms. Cambridge University Press, 2019.

Laurent, B. and Massart, P. Adaptive estimation of a quadratic functional by model selection. The Annals of Statistics, 28(5):1302 1338, 2000.

Lazaric, A. and Ghavamzadeh, M. Bayesian multi-task reinforcement learning. In Proceedings of the 27th International Conference on Machine Learning, 2010.

Li, G., Ma, C., and Srebro, N. Pessimism for offline linear contextual bandits using ℓp confidence sets. In Advances in Neural Information Processing Systems 35, 2022.

Li, L., Chu, W., Langford, J., and Schapire, R. A contextualbandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on World Wide Web, 2010.

Li, S., Abbasi-Yadkori, Y., Kveton, B., Muthukrishnan, S., Vinay, V., and Wen, Z. Offline evaluation of ranking policies with click models. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1685 1694, 2018.

Lindley, D. and Smith, A. Bayes estimates for the linear model. Journal of the Royal Statistical Society: Series B (Methodological), 34(1):1 18, 1972.

Lu, X. and Van Roy, B. Information-theoretic confidence bounds for reinforcement learning. In Advances in Neural Information Processing Systems 32, 2019.

Moradipari, A., Turan, B., Abbasi-Yadkori, Y., Alizadeh, M., and Ghavamzadeh, M. Parameter and feature selection in stochastic linear bandits. In Proceedings of the 39th International Conference on Machine Learning, 2022.

Peleg, A., Pearl, N., and Meirr, R. Metalearning linear bandits by prior update. In Proceedings of the 25th International Conference on Artificial Intelligence and Statistics, 2022.

Robins, J., Rotnitzky, A., and Zhao, L. P. Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association, 89(427):846 866, 1994.

Russo, D. and Van Roy, B. Learning to optimize via posterior sampling. Mathematics of Operations Research, 39 (4):1221 1243, 2014.

Russo, D., Van Roy, B., Kazerouni, A., Osband, I., and Wen, Z. A tutorial on Thompson sampling. Foundations and Trends in Machine Learning, 11(1):1 96, 2018.

Simchowitz, M., Tosh, C., Krishnamurthy, A., Hsu, D., Lykouris, T., Dudik, M., and Schapire, R. Bayesian decisionmaking under misspecified priors with applications to meta-learning. In Advances in Neural Information Processing Systems 34, 2021.

Strehl, A., Langford, J., Li, L., and Kakade, S. Learning from logged implicit exploration data. In Advances in Neural Information Processing Systems 23, 2010.

Swaminathan, A. and Joachims, T. Counterfactual risk minimization: Learning from logged bandit feedback. In Proceedings of the 32nd International Conference on Machine Learning, pp. 814 823, 2015.

Swaminathan, A., Krishnamurthy, A., Agarwal, A., Dudik, M., Langford, J., Jose, D., and Zitouni, I. Off-policy evaluation for slate recommendation. In Advances in Neural Information Processing Systems 30, 2017.

Thompson, W. R. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3-4):285 294, 1933.

Wan, R., Ge, L., and Song, R. Metadata-based multi-task bandits with Bayesian hierarchical models. In Advances in Neural Information Processing Systems 34, 2021.

Wan, R., Ge, L., and Song, R. Towards scalable and robust structured bandits: A meta-learning framework. Co RR, abs/2202.13227, 2022. URL https://arxiv.org/ abs/2202.13227.

Multi-Task Off-Policy Learning from Bandit Feedback

A. Proof of Theorem 1

The proof is under the assumption that the logged dataset D is fixed and the model parameter θ is random. Specifically, since we conduct a Bayesian analysis, we condition on all available observations and have θ | D N(ˆθ, ˆΣ). We start with the concentration of the model parameter. To simplify notation, we define r(x, a) = r(x, a; θ ).

Lemma 3. Let

E = { x X, a A : |r(x, a) ˆr(x, a)| c(x, a)}

be the event that all high-probability confidence intervals hold. Then P (E | D) 1 δ.

Proof. We start with the Cauchy Schwarz inequality,

r(x, a) ˆr(x, a) = ϕ(x, a)(θ ˆθ) = ϕ(x, a)ˆΣ 1 2 ˆΣ 1

2 (θ ˆθ) ϕ(x, a) ˆΣ θ ˆθ ˆΣ 1 .

To prove our claim, we show that θ ˆθ ˆΣ 1 p

5d log(1/δ) holds conditioned on D with probability at least 1 δ. The proof uses that θ ˆθ | D N(0d, ˆΣ). Because of that, ˆΣ 1

2 (θ ˆθ) | D is a d-dimensional vector of independent standard normal variables. Thus (θ ˆθ) ˆΣ 1(θ ˆθ) | D is a chi-squared random variable with d degrees of freedom. Then, by Lemma 1 of Laurent & Massart (2000),

P θ ˆθ ˆΣ 1 α D = P (θ ˆθ) ˆΣ 1(θ ˆθ) 5d log(1/δ) D

P (θ ˆθ) ˆΣ 1(θ ˆθ) 2 p

d log(1/δ) + 2 log(1/δ) + d D

= P (θ ˆθ) ˆΣ 1(θ ˆθ) d 2 p

d log(1/δ) + 2 log(1/δ) D δ .

The first inequality holds under the assumption that δ < (0, e 1], which implies log(1/δ) 1 and log(1/δ) p

log(1/δ). This completes our proof.

We use Lemma 3 to bound the suboptimality of ˆπ in any context by the confidence interval width induced by π .

Lemma 4. Let ˆπ(x) = arg max a A L(x, a). Then on event E (Lemma 3),

r(x, π (x)) r(x, ˆπ(x)) 2c(x, π (x))

holds jointly for all contexts x X.

Proof. For any context x X, we can decompose r(x, π (x)) r(x, ˆπ(x)) as

r(x, π (x)) r(x, ˆπ(x)) = r(x, π (x)) L(x, ˆπ(x)) + L(x, ˆπ(x)) r(x, ˆπ(x))

[r(x, π (x)) L(x, π (x))] + [L(x, ˆπ(x)) r(x, ˆπ(x))] .

Now we bound each term separately. On event E, we have r(x, π (x)) ˆr(x, π (x)) c(x, π (x)) and thus

r(x, π (x)) L(x, π (x)) = r(x, π (x)) ˆr(x, π (x)) + c(x, π (x)) 2c(x, π (x)) .

Again, on event E, we have ˆr(x, ˆπ(x)) r(x, ˆπ(x)) c(x, ˆπ(x)) and thus

L(x, ˆπ(x)) r(x, ˆπ(x)) = ˆr(x, ˆπ(x)) r(x, ˆπ(x)) c(x, ˆπ(x)) 0 .

Finally, we combine the above two inequalities and get

r(x, π (x)) r(x, ˆπ(x)) 2c(x, π (x)) .

This completes the proof.

Multi-Task Off-Policy Learning from Bandit Feedback

In the rest of the analysis, we fix θ and the only randomness is due to X Px. On event E in Lemma 3, we have

V (π ; θ ) V (ˆπ; θ ) = E [r(X, π (X)) r(X, ˆπ(X))] 2E [c(X, π (X))] (15)

5d log(1/δ) E q

ϕ(X, π (X)) ˆΣϕ(X, π (X))

5d log(1/δ)

E h ϕ(X, π (X)) ˆΣϕ(X, π (X)) i .

The first inequality is by Lemma 4. The second inequality follows from the concavity of the square root.

The last step is an upper bound on the expected confidence interval width. Let Γ = Σ 1 0 + γσ 2n G . By Assumption 2, ˆΣ 1 Γ and thus ˆΣ Γ 1. So, for any policy π , we have

E h ϕ(X, π (X)) ˆΣϕ(X, π (X)) i E ϕ(X, π (X)) Γ 1ϕ(X, π (X))

= E h tr(Γ 1

2 ϕ(X, π (X))ϕ(X, π (X)) Γ 1

= tr(G Γ 1) = tr((Σ 1 0 G 1 + γσ 2n Id) 1)

d λd(Σ 1 0 G 1 + γσ 2n Id) .

The first inequality follows from Assumption 2. The first equality holds because v v = tr(vv ) for any v Rd. The next three equalities use that the expectation of the trace is the trace of the expectation, the cyclic property of the trace, and the definition of matrix inverse. The last inequality follows from tr(A 1) dλ1(A 1) = dλ 1 d (A), which holds for any PSD matrix A Rd d.

Now we apply basic eigenvalue identities and inequalities, and get

λd(Σ 1 0 G 1 + γσ 2n Id) = λd(Σ 1 0 G 1 ) + γσ 2n = λd((G Σ0) 1) + γσ 2n = 1 λ1(G Σ0) + γσ 2n

1 λ1(G )λ1(Σ0) + γσ 2n 1 λ1(Σ0) + γσ 2n = λd(Σ 1 0 ) + γσ 2n .

The last inequality uses λ1(G ) 1, which follows from Assumption 1.

To finalize the proof, we chain all inequalities starting from (15) and get that

V (π ; θ ) V (ˆπ; θ ) p

5d log(1/δ)

4d λd(Σ 1 0 ) + γσ 2n

holds on event E, which occurs with probability at least 1 δ for θ | D N(ˆθ, ˆΣ). This completes the proof.

B. Frequentist Single-Task Analysis

In this section, we derive a frequentist counterpart to the bound in Theorem 1.

Theorem 5. Fix dataset D and let that the rewards be drawn independently as Yt ϕ(Xt, At) θ Sub G(σ2) for some σ > 0. Let ˆπ(x) = arg max a A L(x, a). Then for any θ Θ such that θ 2 κ holds, any γ such that Assumption 2 holds, and any δ (0, 1),

V (π ; θ ) V (ˆπ; θ ) p

2 log(2 |Φ| /δ) + κλ 1

4d λd(Σ 1 0 ) + γσ 2n

holds with probability at least 1 δ, where Φ Rd is the set of all feature vectors.

Multi-Task Off-Policy Learning from Bandit Feedback

The above result compares to the Bayesian error bound in Theorem 1 as follows. Under the assumption that Φ is an ε-grid over [0, 1]d, we get log(2 |Φ| /δ) = O(d log(1/εδ)) and the main difference in the bounds is κλ 1

2 d (Σ0) in Theorem 5.

To prove Theorem 1, we start with the concentration of the model parameter and define r(x, a) = r(x, a; θ ), similarly to Appendix A. We also use ϕt = ϕ(Xt, At).

Lemma 6. Let

E = { x X, a A : |r(x, a) ˆr(x, a)| c(x, a)}

be the event that all high-probability confidence intervals hold, where

c(x, a) = q

2 log(2 |Φ| /δ) + κλ 1

2 d (Σ0) ϕ(x, a) ˆΣ .

Then P (E) 1 δ.

Proof. We start with the observation that the regularized least-squares estimate of θ can be expressed as

ˆθ = σ 2(Σ 1 0 + G) 1 n X

= σ 2(Σ 1 0 + G) 1 n X

t=1 ϕt(Yt ϕ t θ ) + (Σ 1 0 + G) 1(Σ 1 0 + G)θ (Σ 1 0 + G) 1Σ 1 0 θ

= σ 2(Σ 1 0 + G) 1 n X

t=1 ϕt(Yt ϕ t θ ) + θ (Σ 1 0 + G) 1Σ 1 0 θ .

Therefore, for any ϕ Φ,

ϕ (ˆθ θ ) = σ 2 n X

t=1 ϕ (Σ 1 0 + G) 1ϕt(Yt ϕ t θ ) ϕ (Σ 1 0 + G) 1Σ 1 0 θ . (16)

Now note that σ 2 Pn t=1 ϕ (Σ 1 0 + G) 1ϕt(Yt ϕ t θ ) is a weighted sum of independent sub-Gaussian random variables Yt ϕ t θ Sub G(σ2). By definition, its variance proxy is bounded from above as

t=1 ϕ (Σ 1 0 + G) 1ϕtϕ t (Σ 1 0 + G) 1ϕ = ϕ (Σ 1 0 + G) 1G(Σ 1 0 + G) 1ϕ

ϕ (Σ 1 0 + G) 1ϕ = ϕ 2 ˆΣ ,

where the last inequality follows from G Σ 1 0 + G. By the concentration of sub-Gaussian random variables, we have

t=1 ϕ (Σ 1 0 + G) 1ϕt(Yt ϕ t θ ) p

2 log(2/δ) ϕ ˆΣ

To bound the second term in (16), we apply the Cauchy Schwarz inequality and get

ϕ ˆΣΣ 1 0 θ Σ 1 0 θ ˆΣ ϕ ˆΣ = q

θ Σ 1 0 ˆΣΣ 1 0 θ ϕ ˆΣ q

θ Σ 1 0 θ ϕ ˆΣ κλ 1

2 d (Σ0) ϕ ˆΣ .

The second and third inequalities follow from ˆΣ Σ0 and θ Σ 1 0 θ κ2λ 1 d (Σ0), respectively. In the next step, we chain all inequalities starting from (16) and get that

P ϕ (ˆθ θ ) p

2 log(2/δ) + κλ 1

2 d (Σ0) ϕ ˆΣ δ

holds for any ϕ Φ with probability at least 1 δ. To finalize the proof, we apply a union bound over all ϕ.

The rest of the proof proceeds exactly as in Appendix A, since that proof is for any model parameter θ on event E. This completes the proof of Theorem 5.

Multi-Task Off-Policy Learning from Bandit Feedback

C. Proof of Theorem 2

The proof is under the assumption that the logged dataset D is fixed and the task parameter θs, is random. In particular, since we conduct a Bayesian analysis, we condition on all available observations and have θs, | D N(ˆθs, ˆΣs), where ˆθs = Σs(Σ 1 0 µ + Bs) and ˆΣs are derived in Section 3.2. We start with the concentration of the task parameter. To simplify notation, let rs(x, a) = r(x, a; θs, ). Lemma 7. Let

E = { x X, a A : |rs(x, a) ˆrs(x, a)| cs(x, a)}

be the event that all high-probability confidence intervals in task s S hold. Then P (E | D) 1 δ.

Proof. The proof is analogous to Lemma 3, since only the mean and covariance of θs, | D changed, and this is reflected in the definitions of ˆrs(x, a) and cs(x, a).

On event E in Lemma 7, similarly to Lemma 4, we have that

rs(x, πs, (x)) rs(x, ˆπs(x)) 2cs(x, πs, (x))

holds for all contexts x X with probability at least 1 δ. Since the above bound holds for any context, we can use use it to bound the suboptimality of ˆπs by the expected confidence interval width induced by πs, .

In the rest of the analysis, we fix θs, and the only randomness is due to X Px. On event E in Lemma 7, we have

V (πs, ; θs, ) V (ˆπs; θs, ) = E [rs(X, πs, (X)) rs(X, ˆπs(X))] 2E [cs(X, πs, (X))] (17)

5d log(1/δ) E q

ϕ(X, πs, (X)) ˆΣsϕ(X, πs, (X))

5d log(1/δ)

E h ϕ(X, πs, (X)) ( ΣsΣ 1 0 ΣΣ 1 0 Σs + Σs)ϕ(X, πs, (X)) i .

These steps are the same as in (15). The latter term, which represents the conditional task uncertainty, is bounded exactly as in Theorem 1,

E h ϕ(X, πs, (X)) Σsϕ(X, πs, (X)) i d λd(Σ 1 0 ) + γσ 2ns .

For the former term, which represents the hyper-parameter uncertainty, we have

E h ϕ(X, πs, (X)) ΣsΣ 1 0 ΣΣ 1 0 Σsϕ(X, πs, (X)) i = tr(Gs, ΣsΣ 1 0 ΣΣ 1 0 Σs)

dλ1(Gs, ΣsΣ 1 0 ΣΣ 1 0 Σs) .

To bound the maximum eigenvalue, we further proceed as

λ1(Gs, ΣsΣ 1 0 ΣΣ 1 0 Σs) λ1(Gs, )λ1( ΣsΣ 1 0 )λ1( Σ)λ1(Σ 1 0 Σs)

λ1( Σ) = 1 λd(Σ 1 q + P

z S(Σ0 + G 1 z ) 1) .

The second inequality uses λ1(Gs, ) 1, which follows from Assumption 1, and λ1( ΣsΣ 1 0 ) 1. Finally, we apply basic eigenvalue identities and inequalities, and get

z S (Σ0 + G 1 z ) 1 !

λd(Σ 1 q ) + X

z S λd((Σ0 + G 1 z ) 1)

= λd(Σ 1 q ) + X

z S λ 1 1 (Σ0 + G 1 z )

λd(Σ 1 q ) + X

1 λ1(Σ0) + λ1(G 1 z )

λd(Σ 1 q ) + X

1 λ1(Σ0) + γ 1σ2λ1(G 1 z, )n 1 z ,

Multi-Task Off-Policy Learning from Bandit Feedback

where we use Assumption 3 in the last inequality.

To finalize the proof, we chain all inequalities starting from (17) and get that

V (πs, ; θs, ) V (ˆπs; θs, ) p

5d log(1/δ)

4d λd(Σ 1 q ) + P

z S(λ1(Σ0) + γ 1σ2λ1(G 1 z, )n 1 z )

holds on event E, which occurs with probability at least 1 δ for θs, | D. This completes the proof of the first claim in Theorem 2.

Note that the bound depends on λ1(G 1 z, ), which can be large when λd(Gz, ) is small. We can eliminate this dependence under the additional assumption in Theorem 2. Under that assumption, all matrices are diagonal and thus commute, and

λ1(Gs, ΣsΣ 1 0 ΣΣ 1 0 Σs) = λ1(Gs, Σ ΣsΣ 1 0 Σ 1 0 Σs) λ1(Gs, Σ) = λ 1 d ( Σ 1G 1 s, )

= 1 λd(Σ 1 q G 1 s, + P z S(Gs, Σ0 + Gs, G 1 z ) 1) .

Finally, we bound the minimum eigenvalue from below using basic eigenvalue identities and inequalities,

Σ 1 q G 1 s, + X

z S (Gs, Σ0 + Gs, G 1 z ) 1 !

λd(Σ 1 q )λ 1 1 (Gs, ) + X

z S λ 1 1 (Gs, Σ0 + Gs, G 1 z )

λd(Σ 1 q ) + X

1 λ1(Gs, )λ1(Σ0) + λ1(Gs, G 1 z )

λd(Σ 1 q ) + X

1 λ1(Σ0) + γ 1σ2n 1 z .

In the last two inequalities, we use that λ1(Gs, ) 1. In the last inequality, we also use that Assumption 3 holds for any task parameter, including θz, = θs, . Thus Gz γσ 2nz Gs, and G 1 z γ 1σ2n 1 z G 1 s, . This completes the proof of the second claim in Theorem 2.

Multi-Task Off-Policy Learning from Bandit Feedback

1000 2000 3000 4000 5000 Dataset Size n

0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65

Suboptimality

Omniglot Bandit (m = 100, d = 64, K = 10)

Hier OPO Flat OPO Oracle OPO

1000 2000 3000 4000 5000 Dataset Size n

0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65

Suboptimality

Omniglot Bandit (m = 100, d = 64, K = 10)

Hier OPO Flat OPO Oracle OPO

1000 2000 3000 4000 5000 Dataset Size n

0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65

Suboptimality

Omniglot Bandit (m = 100, d = 64, K = 10)

Hier OPO Flat OPO Oracle OPO

Figure 4. Evaluation of off-policy algorithms on the image classification using Omniglot using three randomly selected alphabets.

Figure 5. Visualization of the three alphabets used for evaluation. The first four images are randomly selected characters from the alphabet. The fifth is a visualization of the estimation of hyper-parameter µ using Hier OPO, by interpolating the estimated hyper-parameter among characters in the alphabet. Note that the hyper-parameter captures common structures among different characters in the alphabet.

D. Additional Experiment on Image Classification

In this section, we consider an additional experiment based on online image classification using a real-world dataset commonly used in meta-learning. We consider using Omniglot (Lake et al., 2015), which is a dataset of 1623 handwritten characters from 50 different alphabets and contains 20 examples per character. We train a 4-layer CNN to classify the characters from 30 of the alphabets, and use to extract d = 64 features using characters from 30 alphabets.

The remaining 20 alphabets are used to evaluate the algorithms as in Section 7. We create a multi-task contextual bandit using the test dataset as follows. First, an alphabet is sampled uniformly at random from a subset of alphabets, which we reserve for evaluation. Then for each task, a character is uniformly chosen from the alphabet as the positive class. By leveraging that all tasks correspond to characters from the same alphabet, we ensure that tasks have some hierarchical relationship. In each round of a particular task, K = 10 images from the dataset are chosen, one of which is guaranteed to be of the chosen character. The context is a concatenation of the extracted d-dimensional feature vectors from the CNN trained on the training alphabets for the corresponding chosen images. The reward for selecting an image of the positive class is sampled from Ber(0.9); otherwise, the reward is sampled from Ber(0.1).

We estimate the hierarchical model in Section 3.2 using the CNN trained on the training set. We estimate the hyper-prior parameters µq and Σq using the mean and covariance, respectively, of the features of all the images in the test set. Then for each alphabet, we set µ and Σ0 to be the mean and covariance of the features of images in the alphabet. Recall that µ , Σ0 are unknown to all baselines except Oracle OPO.

In Figure 4, we report results for three randomly selected alphabets from the test set over 10 random runs, where each run consists of choosing m = 100 characters, generating a dataset of size 4n, and running each algorithm on the dataset. Note that here, Hier OPO and Oracle OPO both outperform Flat OPO initially, but ultimately begin to converge in performance when the dataset size n is large (particularly in the third alphabet). This is because Hier OPO, Oracle OPO assume a Gaussian hierarchical structure over characters in an alphabet, which is likely violated. However, as shown in Figure 5, our algorithm Hier OPO is still able to learn commonalities between characters in an alphabet, such as shape and curvature, which leads to improved performance when n is small.