# active_offline_policy_selection__379bbfd7.pdf

Active Ofﬂine Policy Selection

Ksenia Konyushkova Deep Mind kksenia@deepmind.com

Yutian Chen Deep Mind yutianc@deepmind.com

Tom Le Paine Deep Mind tpaine@deepmind.com

Caglar Gulcehre Deep Mind caglarg@deepmind.com

Cosmin Paduraru Deep Mind paduraru@deepmind.com

Daniel J Mankowitz Deep Mind dmankowitz@deepmind.com

Misha Denil Deep Mind mdenil@deepmind.com

Nando de Freitas Deep Mind nandodefreitas@deepmind.com

This paper addresses the problem of policy selection in domains with abundant logged data, but with a restricted interaction budget. Solving this problem would enable safe evaluation and deployment of ofﬂine reinforcement learning policies in industry, robotics, and recommendation domains among others. Several offpolicy evaluation (OPE) techniques have been proposed to assess the value of policies using only logged data. However, there is still a big gap between the evaluation by OPE and the full online evaluation in the real environment. Yet, large amounts of online interactions are often not possible in practice. To overcome this problem, we introduce active ofﬂine policy selection a novel sequential decision approach that combines logged data with online interaction to identify the best policy. This approach uses OPE estimates to warm start the online evaluation. Then, in order to utilize the limited environment interactions wisely we decide which policy to evaluate next based on a Bayesian optimization method with a kernel function that represents policy similarity. We use multiple benchmarks with a large number of candidate policies to show that the proposed approach improves upon state-of-the-art OPE estimates and pure online policy evaluation 2.

1 Introduction

Reinforcement learning (RL) has recently proven to be successful in a range of applications from computer and board games [46, 55, 58] to robotics and chip design [37, 10, 45]. However, many challenges of real-world systems still prevent RL from being applied at scale in practice [16]. One of the limiting factors in real applications is that environment interactions are often expensive. Thus, training and evaluating agents becomes prohibitively slow and costly. Ofﬂine RL attempts to address this problem by training agents on a dataset without environment interactions [2, 23]. This enables researchers to train policies with multiple algorithms and different hyperparameter settings [50] on a single dataset. The best policy can then be chosen with off-policy policy evaluation (OPE) approaches [53, 40] (Fig. 1, left). These techniques attempt to estimate the expected agent performance by relying on the same pre-recorded dataset [28, 41, 36, 72].

equal contribution 2The paper website is at https://sites.google.com/corp/view/active-ops and the code is at https://github.com/deepmind/active_ops.

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

Figure 1: Left: Ofﬂine policy selection attempts to choose the best policy from a set of policies, given only a dataset of logged experiences. Right: Active ofﬂine policy selection additionally assumes a restricted environment interaction budget which is assigned intelligently to evaluate the most promising policies.

Unfortunately, OPE estimates are not yet precise enough to be relied upon for choosing which agent to deploy. Moreover, OPE is intrinsically limited by the pre-recorded dataset: When there is a distribution shift between a trained policy and a behavioural policy, i.e., when the trained policy behaves very differently from the data collection policy, OPE is imprecise [38]. In many applications, however, a subset of policies can be evaluated online prior to ﬁnal deployment [23, 19], even though this process can be hard and laborious. For example, in robotics a small subset of ofﬂine RL policies can be tested on the real hardware. Another example is recommender systems, which allow for restricted policy evaluation via A/B tests on a small fraction of user trafﬁc.

In this paper, we introduce the active ofﬂine policy selection problem. The goal is to identify the best policy where we beneﬁt from both ofﬂine evaluation with logged data and online evaluation with limited interaction (Fig. 1, right). In our approach we advance a solution based on Bayesian optimization (BO). It entails learning a Gaussian process (GP) surrogate function that maps policies to their expected returns. Then, we build on the GP statistics to construct an acquisition function to decide which policy to test next. To make BO successful in this problem setting, our approach has two key features. First, we incorporate existing OPE estimates as additional noisy observations (Fig. 2). This allows us to warm start online policy evaluation and to overcome the difﬁculties of GP hyper-parameter optimisation at the start. Second, we model correlation between policies through a kernel function based on actions that the policies take in the same states of the environment (Fig. 3). This makes our method data efﬁcient [26] as the information about the performance of one policy informs us about the performance of similar behaving policies without costly execution in the environment. It is particularly valuable when the number of candidate policies is large (or even larger than the interaction budget).

The contributions of this paper are as follows.

1. We introduce the active ofﬂine policy selection problem. This problem is important in practice, but so far it has not been studied in the literature to the best of our knowledge. 2. We build a BO solution with an extended observation model to integrate both OPE estimates and interaction data. The use of OPE estimates enables us to warm-start learning a GP. 3. We propose a novel GP kernel to capture the dependency between the policies through the actions that they take. As a result, our method infers the value of one policy from those of similar policies.

The rest of this paper is organized as follows. Section 1.1 gives background on ofﬂine policy evaluation and selection. In section 2, we propose a GP model for policy value estimation, explain how we approach the sequential decision making with BO and introduce a novel kernel based on the similarity of policy outputs. Section 3 discusses the related work. Section 4 shows that active policy evaluation can improve upon the OPE after merely a few interactions, thanks to the kernel that ensures data-efﬁciency. Additionally, our method works reliably with OPEs of varying quality and it scales well with the growing number of candidate policies. The paper is concluded with a discussion in section 5.

1.1 Off-policy policy evaluation and selection

We consider a Markov decision process (MDP) deﬁned by a tuple (S, A, T, R, d0, γ), with state space S, action space A, transition distribution T(s |s, a), reward function R(s, a), initial state distribution

Reward obs. r OPE estimate Posterior of 흁

Episodic return

OPE estimate

Posterior of

Figure 2: Gaussian process of the policy value function. We observe OPE estimates for all policies and noisy episodic returns for some of them. Neighboring policies (measured by kernel K) have similar values, and the posterior variance is lower where there are more observations. Policies are aligned in 1D for visualization purposes, but in practice they reside in a high dimensional space.

d0(s), and discount factor γ [0, 1]. A policy π(a|s) maps from S to A. The value of a policy π is measured by the expected sum of discounted rewards:

t=0 γt R(st, at)

, with s0 d0( ), at π( |st), st+1 T( |st, at) . (1)

Typically, estimating the value of a given policy requires executing the policy in the environment many times. However, practical constraints make it difﬁcult to run many policies of unknown quality in the environment. To address this, OPE estimates a value ˆµπ using a dataset of trajectories D collected by a separate behavior policy πβ. In ofﬂine policy selection (OPS) the task is to select the policy with the highest value from a set of policies π1, π2, . . . , πK given access only to a dataset D (Fig. 1, left). A straightforward approach is to select the policy with the highest OPE estimate ˆµπ1, . . . , ˆµπK [20], and alternative approaches can be used depending on the quality metric [71].

2 Active ofﬂine policy selection

2.1 Problem deﬁnition

We now formally deﬁne the active ofﬂine policy selection problem. Suppose we are given a set of K candidate policies πk, 1 k K. Denote by µk the unknown policy value and by ρk an OPE estimate computed from an ofﬂine dataset. At every step i, we can choose a policy ki to execute in the environment once and observe the trajectory with (discounted) episodic return ri, which is a noisy sample of the true policy value: E[ri] = µki. We would like to ﬁnd the policy with the highest value, k = arg maxk µk, with as few trajectories (budget) as possible. Policy selection algorithm estimates the mean as mk and recommends a policy that maximizes it: ˆk = arg maxk mk. Simple regret of the recommended policy measures how close we are to the best policy:

regret = µk µˆk . (2)

To solve this sequential decision problem, we introduce a GP over the policy returns and OPE estimates (subsection 2.2), and design the GP kernel over policies to model their correlated returns (subsection 2.3). Then we use these as part of a BO strategy (subsection 2.4).

2.2 Gaussian process over policy values

Given a limited interaction budget and a large number of policies, it is impossible to obtain accurate estimates of their values by executing them many times. Fortunately, similar policies tend to have similar values. Thus, propagating the information between them helps to improve the value prediction. Intuitively, the number of policies that we need to execute to ﬁnd a good policy should depend on their diversity instead of the total number of candidate policies (see experiments supporting this in Fig. 7). Formally, we use GP to predict a joint distribution over the policy values.

exit>, ) d(

exit>, ) d( 1, 2) = + (

Zc/g D5/MHd K2Mx Q=</latexit>(

04AGa4AMDs/w Cm+Od F6cd+dj2brh FDNn8Af O5w9+W46K</latexit> 1

D4PG ABM CRR

increasing # training steps

0 1 2 3 4 5 6

Distance to behaviour policy

Policy return

CRR RABM D4PG

Figure 3: Left: The distance between two policies π1 and π2 is computed as an average distance across pairs of action vectors that are selected by each policy on a set of states. Middle: Pairwise distance matrix between policies on humanoid-run task. High distance is shown in yellow and low distance in dark blue. The 3 3 block structure of the matrix reﬂects the 3 types of training algorithm. Smaller blocks correspond to the increasing number of training steps with the same hyperparameters. Right: Policy value (µπ) versus action distance to the behavior policy d(π, πβ). Policies with the same training algorithm are in the same color.

A GP is a stochastic process, that is an indexed collection of random variables, such that every ﬁnite collection of them has a multivariate normal distribution [54]. In our case, we assume that the policy value µ as a function of policy π follows a GP, with a kernel function K(πi, πj) over policies and mean m. Speciﬁcally, we consider the following generative process for the latent policy value µ(π), OPE estimate ρ, and episodic return r:

µ(π) GP(m( ), K( , )), ρ N(µ(π), σ2 ρ), r N(µ(π), σ2 r). (3)

We use a constant mean m without loss of generality. We assume a ﬂat prior for the hyper-parameter m, and weakly informative inverse Gamma prior for the variance σ2 ρ and σ2 r. Our probabilistic model is illustrated in Fig. 2.

As we have a ﬁnite number of policies, the input space is a ﬁnite set X = {π1, . . . , πK}, where each index corresponds to a candidate policy. The joint distribution of the K policy values reduces to a multivariate normal distribution,

µ1, . . . , µK N(m1K, K), with Kk,k = K(πk, πk ) , (4)

where 1K is a K-length vector of ones. For each policy πk, we have one noisy observation from an OPE method ρk, and zero or multiple noisy episodic return observations ri k. As the policies are related through the covariance matrix K, observing a return sample from one policy updates the posterior distribution of all of them. Given the OPE estimates ρ = [ρ1, . . . , ρK] and Nk return observations rk = [r1 k, . . . , r Nk k ] for each policy πk, the posterior distribution of the mean return is also a Gaussian,

µ1, . . . , µK|ρ, {rk}k N(m, Σ), with

m = K(K + Λ) 1y, Σ = K K(K + Λ) 1K.

1 ρk σ2ρ + 1

where Λ is a diagonal matrix with diagonal entries deﬁned above.

A key component of the GP model is the kernel that measures our belief about the policy correlation. To obtain a kernel we make a key assumption: Policies that are similar in the actions that they take, yield similar returns. Then, our insight is to measure the distance between the policies through the actions that each of them takes on a ﬁxed set of states from the ofﬂine dataset (Fig. 3, left).

Formally, to measure the similarity between π1 and π2, we obtain the policy actions at a given state s and measure the distance between these two action vectors d(π1(s), π2(s)) 3. We then deﬁne the distance between the two policies as the average distance over the states of the ofﬂine dataset D, d(π1, π2) = Es D d(π1(s), π2(s)) . (6)

3In this work, we use Euclidean distance for deterministic policies with continuous actions and Hamming distance for discrete actions. An extension to stochastic policies is straightforward.

In practice we approximate d using a randomly sampled subset of states D D.

An example of a pairwise distance matrix for a set of policies is shown in Fig. 3, middle. Rows and columns are the policies which are arranged ﬁrst by the type of the training algorithm, then by hyperparameters and ﬁnally by the training steps (none of which is observed by our method). Note the block structure, which reﬂects the different nature of training algorithms. Smaller blocks of increasing policy distances correspond to policies with more training steps and the same hyperparameters.

Additionally, Fig. 3, right shows an example of policy distances to the behaviour policy. The values of policies produced by the same algorithm change gradually as a function of this distance. It provides indication that our kernel construction is informative for predicting the value of a policy.

Finally, we compute Matérn 1/2 kernel as: K(π1, π2) = σ2 k exp( d(π1, π2)/l), where σ and l are the trainable variance and length-scale hyperparameters. This is a popular kernel choice [56]. As the distance metric is the average distance over states, using it in this kernel is equivalent to multiplication of kernels on these states and it is a valid positive semi-deﬁnite matrix.

2.4 Active ofﬂine policy selection with Bayesian optimization

Under a GP formulation, we can employ BO [56] to search for the best policy efﬁciently. BO optimizes an unknown function µ(π) with π X using a limited number of function queries that render noisy observations. BO has been used successfully in many applications including experimental design [5], robotics [42, 43], hyper-parameter optimization [59, 27, 12], preference learning and interactive machine learning [9, 8]. The key component of a BO algorithm is an acquisition function ui(π) that balances exploration and exploitation when selecting a query. Widely used acquisition functions include upper conﬁdence bounds (UCB), expected improvement (EI) and epsilon-greedy. Here, we simply use the UCB acquisition function [60, 30].4 Speciﬁcally, at every step i, we compute the score for a policy πk using

ui(k) = mk + p

where βi is a constant depending on i. We then choose the policy with the highest score, ki = arg maxk ui(k), to execute next. After observing a new return rki, we update m and Σ in Eq. 5. We also update the hyper-parameters (m, σ2 ρ, σ2 r, σ2 k and l) with the maximum a posteriori estimate (see details in the appendix).

Estimating the hyper-parameters with limited data is challenging in practice. Fortunately, in the active ofﬂine policy selection formulation, we can take advantage of the OPE estimates as prior observations to ﬁt the hyper-parameters before launching the interactive optimization process. In some ablations where OPE is not available we instead execute 5 randomly sampled policies before ﬁtting the hyperparameters.

3 Related work

Ofﬂine RL becomes increasingly popular due to its ability to leverage ofﬂine data to learn policies for RL in the real world. As a results, several benchmark datasets were recently announced [23, 19]. Ofﬂine RL methods include policy-constraint approaches that regularize the learned policy to stay close to the behavior policy [70, 21], value-based approaches that encourage more conservative estimates, either through regularization or uncertainty [35, 22], model-based approaches [73, 31], and standard RL methods used in off-policy RL [7].

OPE is studied extensively across many domains [39, 66, 29, 49]. It includes methods that use importance sampling [52] or stationary state distribution [41, 48, 67], value function [63, 44, 64], or learned transition models [74], as well as methods that combine these approaches [47, 15, 28, 18]. OPS is a related practical problem that receives increasing interest [50, 71, 20]. Our work has two notable distinctions in contrast to these prior works: 1) we consider similarities between policies when making value predictions, and 2) we focus on active setting, where a small number of evaluations in the environment can help to quickly improve the prediction of the best policy.

Similar to the way the kernel is computed based on policy s actions, Policy Evaluation Networks [24] work considers a network ﬁngerprinting that learns a policy embedding through the outputs of the

4Experiments with other acquisition functions can be found in the supplementary materials.

'Ö yµº įÈÖµ

;yµ ÅÖ yÒºÈ µÌ ÈÒįÅ

Figure 4: Environment used in the experiments. 9 environments from the dm-control domain (3 examples on the left), 4 from the MPG domain (2 examples in the middle), and 3 from Atari domain (1 example on the right).

policy on probing states. The principal difference to our approach is that we only consider a very limited online policy evaluation budget (in the order of 100) while network ﬁngerprinting needs many policy evaluations to train the policy embeddings (in the order of 100 000).

Applying BO in a ﬁnite set of indices reduces the problem to a Bayesian multiarmed bandit problem. Exploring the shared structure among arms using Bayesian models has been studied in the literature [60, 33, 26]. Our work builds upon the Bayesian bandit model to address the problem of policy selection additionally exploiting the ofﬂine dataset.

Active learning is used in RL for better exploration, learning reward or world models [17, 3, 14, 34, 6]. It has not been studied for the OPE or OPS problems to the best of our knowledge. Related to the policy representation through actions in the kernel, classiﬁers in active learning can be also characterised by their predictions on a subset of datapoints [32]. Furthermore, diversity in policies was measured through the action distribution for sampled states in population-based setting [51].

4 Experiments

4.1 Datasets and environments

In our experiments we consider three domains (Fig. 4) with different properties. Two domains are standard benchmarks in RL for continuous and discrete control and another one is motivated by the challenges of evaluating control policies in robotics. For each domain we describe the environments, datasets for ofﬂine RL and OPE, and sets of policies below. More details are provided in Appendix B.

Deep Mind Control Suite (dm-control) This is a standard set of continuous control environments [65]. It contains 9 tasks of locomotion and manipulation, some of which are shown in Fig. 4, left. The observations are the state of the MPD, including joint angles and velocities. The dimensionality and semantics of the continuous actions vary across tasks. We use the ofﬂine RL policies from DOPE [20] (a benchmark for OPE and OPS) that are trained using the dataset from the RL Unplugged benchmark [23]. For each task, there are up to 96 policies with various algorithms (D4PG [7], ABM [57], and CRR [70]), hyperparameters, and number of training steps.

Manipulation Playground (MPG) This is a simulated robotics environment. The task is to control a Kinova Jaco arm with 9 degrees of freedom. Joint velocity control of 6 arm and 3 hand joints is used. The policies are learnt from proprioception and image input from frontal and in-hand cameras of 64 64 dimensions. We perform 4 manipulation tasks within a 20 20 cm basket (Fig. 4, middle). The dataset is generated by running an online RL algorithm as in [70] and the policies are generated as in [50]. For each task, there are up to 256 policies trained with various algorithms (including BC, D4PG [7], and CRR [70]), and hyperparameters.

Atari This is a popular benchmark with discrete actions in online and ofﬂine RL [23]. For each of 3 games we trained 210 policies on an ofﬂine dataset [23] from high dimensional pixel observations with six ofﬂine RL methods, including Double DQN [68], CQL [35], BCQ [21], REM [2] and BVE [22]. As the range of returns is quite different from the other two domains, we scale the discounted returns to have approximately the same mean and variance based on the ofﬂine RL dataset.

We obtain OPE estimates for each policy by running Fitted Q-Evaluation (FQE) [44, 36] on the ofﬂine RL data. FQE is the default OPE method used in our experiments unless stated otherwise.

# trajectories

simple regret

# trajectories

OPE Ind+Uniform A-ops

# trajectories

Figure 5: Comparison of A-OPS approach, ofﬂine policy selection with OPE, and online policy selection with IND+UNIFORM on dm-control, MPG and Atari environments. The results are averaged across 9, 4 and 3 tasks and 100 experiments in each of them. Shaded areas show standard deviation of the mean. Our method A-OPS quickly surpasses the baselines and achieves the lowest regrets.

4.2 Experimental setting and implementation

To evaluate the policy selection procedure in each experiment we randomly select a subset of K policies out of all trained policies. We set K = 50 for dm-control and K = 200 for MPG and Atari. Then, we repeat each experiment 100 times, and report the average results and the standard deviation of the mean estimate. As a performance metric we monitor the simple regret (Eq. 2) as a function of the number of executed episodes. The algorithms are progressively evaluated when each trajectory is added. To make the results comparable across domains, we ﬁrst compute the gap between the best and the worst performing policy in each environment across all the policies. Then, we normalise the scores in each experiment by dividing them by this gap. The results by environment are discussed in the appendix.

Independent policy model To study the contribution of modeling the policy correlation with GP, we consider a baseline model with independently distributed value µk across policies. The data generating process for the k-th policy follows a hierarchical Gaussian distribution:

ri k iid N(µk, σ2 k,r), i, with µk N(ρk, σ2 k,ρ) , (8)

where we assume a weakly informative inverse Gamma prior for variance σ2 k,r and σ2 k,ρ as in GP. We refer to this model as independent policy model (IND). Such model can be used in combination with any policy sampling strategy.

All online and ofﬂine RL algorithms, as well as ofﬂine policy evaluation algorithms in this work are implemented with Acme [25] and Reverb [11], and run on GPUs on an internal cluster. We implement GP and IND together with all BO algorithms using the same Tensor Flow [1] codebase, and run all the policy selection experiments using CPUs. Details about the tasks and policies, hyperparameter values, and additional BO algorithms and ablations are presented in the appendix.

4.3 Comparison to the existing methods

In our main experiments we would like to verify that the proposed method performs better than two strategies commonly employed in practice (see section 1). To this end, we compare:

1. A-OPS: Active-Ofﬂine Policy Selection, our proposed method that uses OPE estimates to bootstrap GP learning and performs UCB policy sampling as discussed in section 2. 2. OPE: standard ofﬂine policy selection technique that selects the policy with the highest OPE estimate [50]. 3. IND+UNIFORM: standard online selection where we execute policies uniformly and select the one with the highest return belief (estimated with independent policy model IND).

Fig. 5 presents the simple regret as a function of the number of executed trajectories. The results are averaged across 9 tasks from dm-control, 4 tasks from MPG and 3 games from Atari. Our proposed method A-OPS combines ofﬂine and online policy evaluations. It quickly outperforms OPE and improves with more trajectories. A-OPS estimate of the best policy may be erroneous initially due to noisy reward samples (e.g., in MPG), but it quickly updates, thus allowing for signiﬁcant improvements over OPE only after a few actively selected trajectories.

Table 1: Design of the ablation experiments across three axes: the use of OPE, GP or independent policy model, active policy selection or uniform sampling. We refer to our method that combines all the components as A-OPS.

No OPE With OPE

Independent GP Independent GP

Uniform IND+UNIFORM GP+UNIFORM IND+UNIFORM+OPE GP+UNIFORM+OPE UCB IND+UCB GP+UCB IND+UCB+OPE A-OPS

simple regret

OPE Ind+Uniform+OPE Ind+UCB+OPE GP+Uniform+OPE A-ops

# trajectories

simple regret

# trajectories

Ind+Uniform Ind+UCB GP+Uniform GP+UCB

# trajectories

Figure 6: Ablations. In three domains, we ablate each of three components of our approach. 1) The use of OPE (with OPE in the ﬁrst row and without in the second). 2) The choice of policy model: GP (purple and red) and independent (green and blue). 3) The policy selection strategy: active (dark, solid line) and uniform sampling (lighter, dashed line). In general, active is better than uniform, methods that exploit policy correlation are better than methods that do not, and methods that use an OPE prior are better than those that do not.

When comparing ofﬂine and online policy selection we notice that for the range of considered budgets, online policy selection IND+UNIFORM performs worse than OPE. It suggests that single trajectory returns are noisier than OPE estimates in these environments (more discussion on this is in the appendix). Naturally, we expect that given a larger budget the online baseline will eventually surpass its ofﬂine counterpart in all domains.

4.4 Component ablations

We are not aware of other works on active ofﬂine policy selection. Thus, in this section we investigate alternative solutions that help us to understand the importance of the proposed components of A-OPS. We design our ablation studies to test the contribution of the three components of A-OPS:

1) OPE: initialisation of algorithms with or without OPE estimates. 2) active learning: active policy selection (UCB) or uniform sampling (UNIFORM). 3) policy kernel: GP model or independent model baseline (IND).

Tab. 1 summarises the 8 possible combinations of the components. We refer to our full method (GP+UCB+OPE) simply as A-OPS. We also include the pure ofﬂine OPE estimates as before.

The results with all the baselines in two domains are presented in Fig. 6. First of all we observe that A-OPS outperforms all other methods across a wide range of budgets, thus, all the components are crucial for the performance of our method and excluding any of them leads to signiﬁcantly worse results. Next, we answer three questions about each component individually.

1. How important is it to incorporate OPE estimates? Across all the experiments, incorporating OPE estimates (top row of Fig. 6) always considerably improves the results compared to starting policy selection from scratch (bottom row). For example, in MPG when A-OPS starts with OPE, it is able to cut the OPE regret by more than half before executing 100 trajectories. For this budget, GP+UCB, which starts from scratch, does not reach the same regret as OPE.

simple regret

GP+UCB OPE A-ops

0 100 200 300 400

# trajectories

simple regret

Ind+Uniform-25 Ind+Uniform-50 Ind+Uniform-100 Ind+Uniform-200 A-ops-25 A-ops-50 A-ops-100 A-ops-200

Figure 7: Left: The performance of A-OPS on dm-control with varying type of OPE estimate with a budget of 50 trajectories. The various OPE methods are Model-Based OPE with an auto-regressive transition model (MB-AR), Fitted Q Evaluation with distributional Q function and distributional loss (FQE-D), Model-Based OPE with a feedforward transition model (MB-FF), Fitted Q Evaluation with standard Q function and L2 loss (FQE-L2), Double Robust (DR), Variational Power Method (VPM), Importance Sampling (IS), Distribution Correction Estimation (DICE). A-OPS can improve upon the initial OPE estimates in all cases. Right: Performance of A-OPS and IND+UNIFORM with varying number of policies available for execution. IND+UNIFORM is affected a lot by the increase in the number of policies. Our approach A-OPS performs similarly when the number of policies is increase by a factor of almost 10.

2. How informative is our kernel? To understand the value of modelling the correlation between policies through the kernel, we compare the GP method with the independent policy model (IND). In Fig. 6 purple and red lines use GP as the policy model and green and blue use the IND model. In the vast majority of settings (11 out of 12) the use of a kernel signiﬁcantly improves the results and in one setting (IND+UNIFORM+OPE vs GP+UNIFORM+OPE in MPG domain) they perform on par. We conclude that the kernel is a key ingredient for improving data efﬁciency.

3. How important is the selection strategy? The active selection of policies is generally beneﬁcial for identifying a good policy. For this we refer to the results of dashed lines that correspond to UNIFORM policy selection and solid lines that correspond to the use of UCB. Using IND+UCB yields a high regret in the initial exploration stage, but it improves substantially over IND+UNIFORM later. Moreover, incorporating OPE estimates and a kernel (resulting in A-OPS) signiﬁcantly shortens the exploration stage.

4.5 More ablation studies

Incorporating the OPE estimates is clearly beneﬁcial for policy selection. Here, we further study: How does the performance depend on the quality of OPE estimates? Fig. 7, left compares A-OPS using estimates from different OPE algorithms in DOPE [20] on dm-control with a small budget of 50 trajectories. A-OPS always improves upon the corresponding OPE estimate, and the performance clearly depends on the quality of the associated OPE estimate. The performance of GP+UCB with 50 trajectories that does not rely on any OPE metric is shown as a horizontal line. Some OPE estimates (e.g., IS, VPM and DICE) have large estimation errors and even a negative correlation with the ranks of policies [20]. As expected, A-OPS is negatively impacted by poor OPE estimates and ignoring them could yield better results. It should be possible to correct for this to a greater extent with more online observations. For now we follow the recommendation to use the FQE estimates.

In subsection 2.2 we hypothesised that A-OPS method should be data efﬁcient as it would scale with the diversity of policies and not their number. We next try to verify it and answer the questions How does the policy selection algorithm scale with the growing number of candidate policies? In Fig. 7, right we show the simple regret for A-OPS and IND+UNIFORM in the MPG domain when we sample 25, 50, 100 and 200 policies. We see that when the number of policies is not very high (25), the algorithm that treats policies independently and samples them at uniform (IND+UNIFORM-25) can closely approach the performance of our method A-OPS-25 after 100 trajectories. However, when the number of policies grows, this small interaction budget is not enough for the naive algorithm: IND+UNIFORM-200 signiﬁcantly degrades its performance. At the same time, A-OPS-200 is minimally affected by the increase in the number of policies as the diversity of the policies stays approximately the same.

OPE selection ground truth current observations prediction

Figure 8: Visualisation of the qualitative performance of A-OPS algorithm on Robotank environment with 20 randomly selected policies (ordered by ground truth only for the illustration purpose). From left to right we show the state at 1, 10, 50, 100 algorithm iterations. The initial OPE scores are shown in light blue, the policy selected by A-OPS is highlighted with magenta star, the observation of the currently executed policy is shown in orange circle, past observations are in pink and the prediction with its standard deviation is in purple.

4.6 Qualitative results

Finally, to understand the behaviour of A-OPS method we show its progress for one of the experiments with Robotank game in Fig. 8. For visualisation purposes, we use only 20 policies which are ordered by their ground truth returns (not observed by the method). Perfect value predictions would be aligned along the diagonal line. We show the initial OPE values, current selection of the best policy, current observation of the executed policy, past observations and the algorithm prediction with its standard deviation for steps 1, 10, 50, and 100. The A-OPS prediction is determined by the OPE observations and past observations of the policy returns. The kernel reinforces smoothness of the function. We can see that the initial selection is dominated by OPE prediction and the selected policy is quite far from the best policy. With more online observations, the predictions improve. Our algorithm is data-efﬁcient because 1) it samples more promising policies more frequently (more observations with higher scores), 2) the knowledge from one policy is propagated to the policies related by the kernel (predictions for some policies are updated without any new observations of their values).

5 Conclusions

In this paper, we introduced the active ofﬂine policy selection problem. We proposed a BO solution that integrates OPE estimates with evaluations obtained by interacting with the environment. We also designed a novel kernel that models correlation over policies. Experiments on three control domains proved that our solution is very effective. Furthermore, our experiments ablated the different components of the solution, highlighting the value of modelling correlation via the kernel, active selection, and incorporation of OPE estimates in the observation model.

Limitations and future work Having tested the method and demonstrated its efﬁcacy in simulation, our future plan is to conduct experiments in real domains.

The A-OPS approach could be of use in many safety critical applications. This work does not deal with this question explicitly as we assume safety constraints to be implemented directly on the hardware. It means that if an unsafe policy is attempted, the safety controller would terminate the policy before reaching the end of an episode, resulting in low reward and thus discouraging A-OPS from trying this policy again. Alternatively, we could alleviate this limitation by applying constrained BO techniques [62] to search for the best policy without violating safety constraints.

In this work, we explore the policy structure based on the ofﬂine dataset, and use every new trajectory only to obtain a noisy observation of policy values. This allows a clean Bayesian optimization formulation. Going forward, more information could be extracted from the new trajectories, for example, to further improve OPE estimates, GP kernel, or policies themselves. However, using this information would make the solution much more complex and we may need to take the additional computational cost into account. Finally, to construct a better policy kernel, we may investigate how to measure the policy similarity on the most informative subset of states in the ofﬂine dataset.

Acknowledgements

We would like to thank Bobak Shahriari for the useful discussions and feedback on the paper.

[1] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensor Flow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorﬂow.org.

[2] Rishabh Agarwal, Dale Schuurmans, and Mohammad Norouzi. An optimistic perspective on ofﬂine reinforcement learning. In International Conference on Machine Learning, 2020.

[3] Takayuki Akiyama, Hirotaka Hachiya, and Masashi Sugiyama. Efﬁcient exploration through active learning for value function approximation in reinforcement learning. Neural Networks, 2010.

[4] Jean-Yves Audibert, Sébastien Bubeck, and Rémi Munos. Best arm identiﬁcation in multi-armed bandits. In Conference on Computational Learning Theory, 2010.

[5] Javad Azimi, Ali Jalali, and Xiaoli Zhang Fern. Hybrid batch bayesian optimization. In International Conference on Machine Learning, 2012.

[6] Philip Ball, Jack Parker-Holder, Aldo Pacchiano, Krzysztof Choromanski, and Stephen Roberts. Ready policy one: World building through active learning. In International Conference on Machine Learning, 2020.

[7] Gabriel Barth-Maron, Matthew W Hoffman, David Budden, Will Dabney, Dan Horgan, TB Dhruva, Alistair Muldal, Nicolas Heess, and Timothy Lillicrap. Distributed distributional deterministic policy gradients. In International Conference for Learning Representations, 2018.

[8] E. Brochu, T. Brochu, and N. de Freitas. A Bayesian interactive optimization approach to procedural animation design. In ACM SIGGRAPH/Eurographics Symposium on Computer Animation, 2010.

[9] E. Brochu, N. de Freitas, and A. Ghosh. Active preference learning with discrete choice data. In Advances in Neural Information Processing Systems, 2007.

[10] Serkan Cabi, Sergio Gómez Colmenarejo, Alexander Novikov, Ksenia Konyushkova, Scott Reed, Rae Jeong, Konrad Zolna, Yusuf Aytar, David Budden, Mel Vecerik, et al. Scaling data-driven robotics with reward sketching and batch reinforcement learning. In Robotics: Science and Systems Conference, 2020.

[11] Albin Cassirer, Gabriel Barth-Maron, Eugene Brevdo, Sabela Ramos, Toby Boyd, Thibault Sottiaux, and Manuel Kroiss. Reverb: A framework for experience replay. ar Xiv preprint ar Xiv:2102.04736, 2021.

[12] Yutian Chen, Aja Huang, Ziyu Wang, Ioannis Antonoglou, Julian Schrittwieser, David Silver, and Nando de Freitas. Bayesian optimization in Alpha Go. ar Xiv preprint ar Xiv:1812.06855, 2018.

[13] Will Dabney, Georg Ostrovski, David Silver, and Rémi Munos. Implicit quantile networks for distributional reinforcement learning. In International Conference on Machine Learning, 2018.

[14] Christian Daniel, Malte Viering, Jan Metz, Oliver Kroemer, and Jan Peters. Active reward learning. In Robotics: Science and Systems Conference, 2014.

[15] Miroslav Dudík, John Langford, and Lihong Li. Doubly robust policy evaluation and learning. ar Xiv preprint ar Xiv:1103.4601, 2011.

[16] Gabriel Dulac-Arnold, Daniel J. Mankowitz, and Todd Hester. Challenges of real-world reinforcement learning. ar Xiv preprint ar Xiv:1904.12901, 2019.

[17] Arkady Epshteyn, Adam Vogel, and Gerald De Jong. Active reinforcement learning. In International Conference on Machine Learning, 2008.

[18] Mehrdad Farajtabar, Yinlam Chow, and Mohammad Ghavamzadeh. More robust doubly robust off-policy evaluation. In International Conference on Machine Learning, 2018.

[19] Justin Fu, Aviral Kumar, Oﬁr Nachum, George Tucker, and Sergey Levine. D4RL: Datasets for deep data-driven reinforcement learning. ar Xiv preprint ar Xiv:2004.07219, 2020.

[20] Justin Fu, Mohammad Norouzi, Oﬁr Nachum, George Tucker, ziyu wang, Alexander Novikov, Mengjiao Yang, Michael R Zhang, Yutian Chen, Aviral Kumar, Cosmin Paduraru, Sergey Levine, and Thomas Paine. Benchmarks for deep off-policy evaluation. In International Conference for Learning Representations, 2021.

[21] Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. In International Conference on Machine Learning, 2019.

[22] Caglar Gulcehre, Sergio Gómez Colmenarejo, Ziyu Wang, Jakub Sygnowski, Thomas Paine, Konrad Zolna, Yutian Chen, Matthew Hoffman, Razvan Pascanu, and Nando de Freitas. Regularized behavior value estimation. ar Xiv preprint ar Xiv:2103.09575, 2021.

[23] Caglar Gulcehre, Ziyu Wang, Alexander Novikov, Tom Le Paine, Sergio Gomez Colmenarejo, Konrad Zolna, Rishabh Agarwal, Josh Merel, Daniel Mankowitz, Cosmin Paduraru, et al. RL Unplugged: A suite of benchmarks for ofﬂine reinforcement learning. ar Xiv preprint ar Xiv:2006.13888, 2020.

[24] Jean Harb, Tom Schaul, Doina Precup, and Pierre-Luc Bacon. Policy evaluation networks. ar Xiv preprint ar Xiv:2002.11833, 2020.

[25] Matt Hoffman, Bobak Shahriari, John Aslanides, Gabriel Barth-Maron, Feryal Behbahani, Tamara Norman, Abbas Abdolmaleki, Albin Cassirer, Fan Yang, Kate Baumli, et al. Acme: A research framework for distributed reinforcement learning. ar Xiv preprint ar Xiv:2006.00979, 2020.

[26] Matthew Hoffman, Bobak Shahriari, and Nando de Freitas. On correlation and budget constraints in model-based bandit optimization with application to automatic machine learning. In Artiﬁcial Intelligence and Statistics, 2014.

[27] F. Hutter, H. H. Hoos, and K. Leyton-Brown. Sequential model-based optimization for general algorithm conﬁguration. In Learning and Intelligent Optimization, 2011.

[28] Nan Jiang and Lihong Li. Doubly robust off-policy value evaluation for reinforcement learning. In International Conference on Machine Learning, 2016.

[29] Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang, Deirdre Quillen, Ethan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke, et al. QT-Opt: Scalable deep reinforcement learning for vision-based robotic manipulation. In Conference on Robot Learning, 2018.

[30] Emilie Kaufmann, Olivier Cappé, and Aurélien Garivier. On Bayesian upper conﬁdence bounds for bandit problems. In Artiﬁcial intelligence and statistics, 2012.

[31] Rahul Kidambi, Aravind Rajeswaran, Praneeth Netrapalli, and Thorsten Joachims. Morel: Model-based ofﬂine reinforcement learning. ar Xiv preprint ar Xiv:2005.05951, 2020.

[32] Ksenia Konyushkova, Raphael Sznitman, and Pascal Fua. Discovering general-purpose active learning strategies. ar Xiv preprint ar Xiv:1810.04114, 2019.

[33] Andreas Krause and Cheng S Ong. Contextual gaussian process bandit optimization. Advances in Neural Information Processing Systems, 2012.

[34] David Krueger, Jan Leike, Owain Evans, and John Salvatier. Active reinforcement learning: Observing rewards at a cost. ar Xiv preprint ar Xiv:2011.06709, 2020.

[35] Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative Q-learning for ofﬂine reinforcement learning. ar Xiv preprint ar Xiv:2006.04779, 2020.

[36] Hoang Le, Cameron Voloshin, and Yisong Yue. Batch policy learning under constraints. In International Conference on Machine Learning, 2019.

[37] Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies. Journal of Machine Learning Research, 2016.

[38] Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Ofﬂine reinforcement learning: Tutorial, review, and perspectives on open problems. ar Xiv preprint ar Xiv:2005.01643, 2020.

[39] Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to personalized news article recommendation. In World Wide Web, 2010.

[40] Lihong Li, Wei Chu, John Langford, and Xuanhui Wang. Unbiased ofﬂine evaluation of contextual-banditbased news article recommendation algorithms. In ACM international conference on Web search and data mining, 2011.

[41] Qiang Liu, Lihong Li, Ziyang Tang, and Dengyong Zhou. Breaking the curse of horizon: Inﬁnite-horizon off-policy estimation. In Advances in Neural Information Processing Systems, 2018.

[42] Daniel J Lizotte, Tao Wang, Michael H Bowling, and Dale Schuurmans. Automatic gait optimization with gaussian process regression. In International Joint Conference on Artiﬁcial Intelligence, 2007.

[43] R. Martinez Cantin, N. de Freitas, E. Brochu, J. Castellanos, and A. Doucet. A Bayesian explorationexploitation approach for optimal online sensing and planning with a visually guided mobile robot. Autonomous Robots, 2009.

[44] Martino Migliavacca, Alessio Pecorino, Matteo Pirotta, Marcello Restelli, and Andrea Bonarini. Fitted policy search: Direct policy search using a batch reinforcement learning approach. In Workshop on Evolutionary and Reinforcement Learning for Autonomous Robot Systems, 2010.

[45] Azalia Mirhoseini, Anna Goldie, Mustafa Yazgan, Joe Jiang, Ebrahim Songhori, Shen Wang, Young-Joon Lee, Eric Johnson, Omkar Pathak, Sungmin Bae, et al. Chip placement with deep reinforcement learning. ar Xiv preprint ar Xiv:2004.10746, 2020.

[46] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing Atari with deep reinforcement learning. ar Xiv preprint ar Xiv:1312.5602, 2013.

[47] Remi Munos, Tom Stepleton, Anna Harutyunyan, and Marc Bellemare. Safe and efﬁcient off-policy reinforcement learning. In Advances in Neural Information Processing Systems, 2016.

[48] Oﬁr Nachum, Yinlam Chow, Bo Dai, and Lihong Li. Dualdice: Behavior-agnostic estimation of discounted stationary distribution corrections. In Advances in Neural Information Processing Systems, 2019.

[49] Xinkun Nie, Emma Brunskill, and Stefan Wager. Learning when-to-treat policies. ar Xiv preprint ar Xiv:1905.09751, 2019.

[50] Tom Le Paine, Cosmin Paduraru, Andrea Michi, Caglar Gulcehre, Konrad Zolna, Alexander Novikov, Ziyu Wang, and Nando de Freitas. Hyperparameter selection for ofﬂine reinforcement learning. ar Xiv preprint ar Xiv:2007.09055, 2020.

[51] Jack Parker-Holder, Aldo Pacchiano, Krzysztof M Choromanski, and Stephen J Roberts. Effective diversity in population based reinforcement learning. In Advances in Neural Information Processing Systems, 2020.

[52] Doina Precup. Eligibility traces for off-policy policy evaluation. Computer Science Department Faculty Publication Series, 2000.

[53] Doina Precup, Richard S. Sutton, and Satinder P. Singh. Eligibility traces for off-policy policy evaluation. In International Conference on Machine Learning, 2000.

[54] Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning). The MIT Press, 2005.

[55] Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, et al. Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model. Nature, 2020.

[56] Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P Adams, and Nando De Freitas. Taking the human out of the loop: A review of Bayesian optimization. Proceedings of the IEEE, 2015.

[57] Noah Siegel, Jost Tobias Springenberg, Felix Berkenkamp, Abbas Abdolmaleki, Michael Neunert, Thomas Lampe, Roland Hafner, Nicolas Heess, and Martin Riedmiller. Keep doing what worked: Behavior modelling priors for ofﬂine reinforcement learning. In International Conference for Learning Representations, 2020.

[58] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of Go with deep neural networks and tree search. Nature, 2016.

[59] J. Snoek, H. Larochelle, and R. P. Adams. Practical Bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems, 2012.

[60] Niranjan Srinivas, Andreas Krause, Sham Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. In International Conference on Machine Learning, 2010.

[61] Niranjan Srinivas, Andreas Krause, Sham M Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. ar Xiv preprint ar Xiv:0912.3995, 2009.

[62] Yanan Sui, Alkis Gotovos, Joel Burdick, and Andreas Krause. Safe exploration for optimization with gaussian processes. In International Conference on Machine Learning, 2015.

[63] Richard S Sutton, Hamid Reza Maei, Doina Precup, Shalabh Bhatnagar, David Silver, Csaba Szepesvári, and Eric Wiewiora. Fast gradient-descent methods for temporal-difference learning with linear function approximation. In International Conference on Machine Learning, 2009.

[64] Richard S Sutton, A Rupam Mahmood, and Martha White. An emphatic approach to the problem of off-policy temporal-difference learning. Journal of Machine Learning Research, 2016.

[65] Yuval Tassa, Saran Tunyasuvunakool, Alistair Muldal, Yotam Doron, Siqi Liu, Steven Bohez, Josh Merel, Tom Erez, Timothy Lillicrap, and Nicolas Heess. dm_control: Software and tasks for continuous control. ar Xiv preprint ar Xiv:2006.12983, 2020.

[66] Georgios Theocharous, Philip S Thomas, and Mohammad Ghavamzadeh. Personalized ad recommendation systems for life-time value optimization with guarantees. In International Joint Conference on Artiﬁcial Intelligence, 2015.

[67] Masatoshi Uehara, Jiawei Huang, and Nan Jiang. Minimax weight and Q-function learning for off-policy evaluation. In International Conference on Machine Learning, 2020.

[68] Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double Q-learning. In National Conference on Artiﬁcial Intelligence (AAAI), 2016.

[69] Cameron Voloshin, Hoang M Le, Nan Jiang, and Yisong Yue. Empirical study of off-policy policy evaluation for reinforcement learning. ar Xiv preprint ar Xiv:1911.06854, 2019.

[70] Ziyu Wang, Alexander Novikov, Konrad Zołna, Jost Tobias Springenberg, Scott Reed, Bobak Shahriari, Noah Siegel, Josh Merel, Caglar Gulcehre, Nicolas Heess, and Nando de Freitas. Critic regularized regression. In Advances in Neural Information Processing Systems, 2020.

[71] Mengjiao Yang, Bo Dai, Oﬁr Nachum, George Tucker, and Dale Schuurmans. Ofﬂine policy selection under uncertainty. ar Xiv preprint ar Xiv:2012.06919, 2020.

[72] Mengjiao Yang, Oﬁr Nachum, Bo Dai, Lihong Li, and Dale Schuurmans. Off-policy evaluation via the regularized lagrangian. ar Xiv preprint ar Xiv:2007.03438, 2020.

[73] Tianhe Yu, Garrett Thomas, Lantao Yu, Stefano Ermon, James Zou, Sergey Levine, Chelsea Finn, and Tengyu Ma. MOPO: Model-based ofﬂine policy optimization. ar Xiv preprint ar Xiv:2005.13239, 2020.

[74] Michael R Zhang, Tom Le Paine, Oﬁr Nachum, Cosmin Paduraru, George Tucker, Ziyu Wang, and Mohammad Norouzi. Autoregressive dynamics models for ofﬂine policy evaluation and optimization. ar Xiv preprint ar Xiv:2104.13877, 2021.