# rewardfree_kernelbased_reinforcement_learning__7b99ccea.pdf

Reward-Free Kernel-Based Reinforcement Learning

Sattar Vakili 1 Farhang Nabiei 1 Da-shan Shiu 1 Alberto Bernacchia 1

Achieving sample efficiency in Reinforcement Learning (RL) is primarily hinged on the efficient exploration of the underlying environment, but it is still unknown what are the best exploration strategies in different settings. We consider the reward-free RL problem, which operates in two phases: an exploration phase, where the agent gathers exploration trajectories over episodes irrespective of any predetermined reward function, and a subsequent planning phase, where a reward function is introduced. The agent then utilizes the episodes from the exploration phase to calculate a near-optimal policy. Existing algorithms and sample complexities for reward-free RL are limited to tabular, linear or very smooth function approximations, leaving the problem largely open for more general cases. We consider a broad range of kernel-based function approximations, including non-smooth kernels, and propose an algorithm based on adaptive domain partitioning. We show that our algorithm achieves order-optimal sample complexity for a large class of common kernels, which includes Mat ern and Neural Tangent kernels.

1. Introduction

Reinforcement Learning (RL) policies using complex function approximations have been empirically effective in various fields including gaming (Silver et al., 2016; Lee et al., 2018; Vinyals et al., 2019), autonomous driving (Kahn et al., 2017), microchip design (Mirhoseini et al., 2021), robot control (Kalashnikov et al., 2018), and algorithm search (Fawzi et al., 2022). To achieve sample efficiency, these RL policies must learn the transition model, either directly or indirectly, necessitating efficient exploration. In the context of offline RL, the agent learns the optimal policy solely from a

1Media Tek Research. Correspondence to: Sattar Vakili <sattar.vakili@mtkresearch.com>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

pre-collected offline dataset, without any further interaction with the environment. Therefore, the offline dataset should adequately represent the trajectory produced by the optimal policy. In real-world RL applications, the reward function is often crafted by the learner based on domain knowledge. The learner may have multiple reward functions to select from or may employ an adaptive algorithm for reward design. In such situations, it is preferable to have an offline dataset that encapsulates all potential optimal trajectories associated with a variety of reward functions. With such a comprehensive offline dataset, the RL agent can estimate the corresponding policy for any arbitrary reward function.

To methodically study this problem, we concentrate on the reward-free RL framework, which includes an exploration phase and a planning phase (Figure 1). In the exploration phase, the agent interacts with the environment without any pre-determined rewards and gathers empirical trajectories over episodes for the subsequent planning phase. During the planning phase, the agent uses the offline data accumulated in the exploration phase to compute the optimal policy for a given extrinsic reward function r, without further interactions with the environment.

Collect N trajectories without knowing the reward.

Exploration

Reward r is revealed; Design a policy π for reward r.

Figure 1. Reward-Free RL framework.

The reward-free RL framework has been progressively examined under increasingly complex models tabular linear kernel-based deep learning based in several works including (Jin et al., 2020a; Wang et al., 2020; Qiu et al., 2021). The existing literature adequately addresses the tabular and linear settings. It however tends to falter, providing only partial and incomplete results when dealing with the more intricate kernel-based and deep learning based settings. The contribution of this paper is to further the literature by providing order optimal results in the kernel-based setting, applicable to a broad class of common kernels.

Our main objectives are: (i) Designing algorithms for both

Reward-Free Kernel-Based RL

exploration and planning phases in the reward-free RL framework with kernel-based modeling. (ii) Improving the existing results by proving order optimal sample complexities for a broad class of common kernels.

To touch on the technical importance of our results, we show in Table 1 the sample complexity in various settings. The sample complexity refers to the number of exploration episodes collected in the exploration phase to obtain ϵoptimal policies in the planning phase, where a policy is referred to as ϵ-optimal if its value function is at most ϵ away from the optimal value function (See Definition 2.1). Here, S, A and H N represent the state and action spaces and the episode length, respectively. In the kernel-based setting, previous work (Qiu et al., 2021) provided algorithms whose sample complexities are, up to logarithmic factors, order optimal with respect to ϵ. However, these results are primarily applicable to very smooth kernels, specifically those characterized by exponentially decaying eigenvalues. This limitation effectively excludes significant kernel families, such as Mat ern and Neural Tangent kernels.

For a more nuanced understanding of the existing results, let {λm > 0} m=1 represent the Mercer eigenvalues of kernel k, sequenced in diminishing order, and let {ϕm} m=1 denote the corresponding eigenfeatures. For details, refer to Section 2.3. The kernel k is characterized as having an exponential eigendecay when its eigenvalues λm diminish exponentially with respect to m, specifically λm = O(ιm) for some 0 < ι < 1; an example being the Squared Exponential kernel. In contrast, the kernel k is described as having a polynomial eigendecay when its eigenvalues λm decline at a rate no slower than m p for some p > 1. This decay profile is characteristic of numerous kernels, both of practical importance and theoretical interest, such as the Mat ern family of kernels (Borovitskiy et al., 2020) and the Neural Tangent (NT) kernel (Arora et al., 2019). Specifically, for a Mat ern kernel with smoothness parameter ν in a d-dimensional space, p = 2ν+d

d (e.g., see, Yang et al., 2020a). Similarly, for an NT kernel with s 1 times differentiable activations, p = 2s 1+d

d (Vakili et al., 2021b).

Leveraging the scaling of the kernel spectrum with the size of the domain can improve the sample complexity. We focus on kernels with polynomial eigendecay within a hypercubical domain of side length ρ, where eigenvalues scale as m pρα for some α > 0, as detailed in Definition 4.1. This approach covers a broad spectrum of prevalent kernels, such as the Mat ern family, where α = 2ν. While we employ a hypercube domain for technical consistency, this assumption is flexible and can extend to other regular, compact subsets of Rd.

Our contribution lies in devising algorithms for both the exploration and planning phases of the reward-free RL framework, establishing sample complexities for kernels with

Table 1. Existing results on the sample complexity of reward-free RL. The sample complexity refers to the number of exploration episodes collected in the exploration phase to obtain ϵ-optimal policies in the planning phase. The notation S, A and H denote the state and action spaces and the episode length respectively. The parameter d denotes the state action space dimension and α represents smoothness properties of the kernel. While the existing results fail to obtain even finite sample complexities in the general kernel-based setting, we report order optimal sample complexity, given in the last row of this table.

SETTING SAMPLE COMPLEXITY

TABULAR (M ENARD ET AL., 2021) O |S|2|A|H3

LINEAR O d2H5

(WAGENMAKER ET AL., 2022)

KERNEL-BASED WITH

EXPONENTIAL EIGENDECAY O H6polylog( 1

(QIU ET AL., 2021)

KERNEL-BASED WITH POLYNOMIAL EIGENDECAY O ( H3

(THIS WORK)

polynomially decaying eigenvalues. We demonstrate that an O ( H3

α exploration episodes are sufficient to guarantee ϵ-optimal policies during planning. When applied to the Mat ern kernel, our sample complexity becomes O ( H3

ν . This is a significant improvement, contrasted with existing work (Qiu et al., 2021), where the sample complexity becomes unbounded for many parameter values of the Mat ern kernel, such as when ν < d(d+1)

2 . In addi-

tion, our sample complexities match the Ω ( 1

ν lower bound for the kernel-based bandit problem with Mat ern kernel (see, Scarlett et al., 2017, Table I) that can be considered as a degenerate special case with H = 1, |S| = 1, indicating that the performance in ϵ cannot be further improved.

We obtain these samples complexities by designing algorithms tailored for the polynomial class of kernels. The main design ideas include leveraging a hypothetical reward for the exploration phase proportional to the uncertainty in kernel-based regression, and an adaptive domain partitioning technique inspired by the recent work of Vakili & Olkhovskaya (2023). In this method, the algorithm creates a partitioning of the state-action domain and builds value function estimates only based on the observations within the same partition element. See details in Section 3.

In Section 2, we present the episodic Markov Decision

Reward-Free Kernel-Based RL

Process (MDP) setting, formalize reward-free RL framework and our assumptions, and overview the kernel ridge regression. In Section 4, we present the sample complexity analysis.

1.1. Literature Review

The reward-free RL framework under the episodic setting has been studied with tabular model in Jin et al. (2020a); Zhang et al. (2020); M enard et al. (2021); Kaufmann et al. (2021), and with linear model in Wang et al. (2020); Zanette et al. (2020c); Wagenmaker et al. (2022), with the best sample complexities reported in Table 1. The problem has also been studied under the linear mixture model in Zhang et al. (2021); Chen et al. (2021); Zhang et al. (2023).

The sample complexity of the RL problem on a discounted MDP setting with an infinite horizon has been considered under various tabular, linear and kernel-based settings in (Kearns & Singh, 1998; Azar et al., 2013; Sidford et al., 2018; Agarwal et al., 2020; Yang & Wang, 2019; Yeh et al., 2023). These works however assume the existence of a generative oracle (Kakade, 2003), which provides sample transitions from any state-action pair of algorithm s choice. This assumption simplifies the problem compared to the rewardfree RL framework considered in this work, where the agent must follow the MDP trajectory within each episode and cannot arbitrarily inquire transitions from state-action pairs. Specifically, we design an exploration algorithm based on uncertainty estimates obtained from the kernel-based model that adds significant challenges to the analysis and is not required in the oracle setting.

Our algorithm design is inspired by the domain partitioning technique used in Vakili & Olkhovskaya (2023), as well as in Janz et al. (2020) for kernel-based bandits. In comparison, Vakili & Olkhovskaya (2023) considered the standard episodic RL setting, where the reward function is known to the policy a priori. That is different from the reward-free RL framework considered in this work and their results do not apply here.

There is an extensive literature on the analysis of RL policies which do not rely on a generative model or an exploratory behavioral policy. The literature has primarily focused on the tabular setting (Jin et al., 2018; Auer et al., 2008; Bartlett & Tewari, 2012). Recent literature has placed a notable emphasis on employing function approximation in RL, particularly within the context of generalized linear settings. This approach involves representing the value function or transition model through a linear transformation to a well-defined feature mapping. Important contributions include the work of Jin et al. (2020b); Yao et al. (2014), as well as subsequent studies by Russo (2019); Zanette et al. (2020a;b); Neu & Pike-Burke (2020); Yang & Wang (2020). Furthermore, there have been several efforts to extend these techniques to

a kernelized setting, as explored in Yang et al. (2020a); Yang & Wang (2020); Chowdhury & Gopalan (2019); Yang et al. (2020b); Domingues et al. (2021). These works are also inspired by methods originally designed for linear bandits (Abbasi-Yadkori et al., 2011; Agrawal & Goyal, 2013), as well as kernelized bandits (Srinivas et al., 2010; Valko et al., 2013; Chowdhury & Gopalan, 2017).

2. Problem Formulation

In this section, we present the episodic MDP setting, the reward-free RL framework, background on kernel methods and our technical assumptions.

2.1. Episodic MDP

An episodic MDP can be described by the tuple M = (S, A, H, P, r), where S is the state space, A is the action space, the integer H is the length of each episode, r = {rh}H h=1 are the reward functions and P = {Ph}H h=1 are the transition probability distributions.1 We use the notation Z = S A to denote the state-action space. For each h [H], the reward rh : Z [0, 1] is the reward function at step h, which is supposed to be deterministic for simplicity, and Ph( |s, a) is the unknown transition probability distribution on S for the next state from state-action pair (s, a).

Figure 2. Illustration of an Episodic MDP with episode of length H.

A policy π = {πh}H h=1, at each step h, determines the (possibly random) action πh : S A taken by the agent at state s. At the beginning of each episode, the environment picks an arbitrary state s1. The agent determines a policy π = {πh}H h=1. Then, at each step h [H], the agent observes the state sh S, and picks an action ah = πh(sh). The new state sh+1 then is drawn from the transition distribution Ph( |sh, ah). The episode ends when the agent receives the final reward r H(s H, a H).

We are interested in maximizing the expected total reward in the episode, starting at step h, i.e., the value function, defined as

V π h (s) = E

h =h rh (sh , ah ) sh = s

, s S, h [H],

1We intentionally do not use the standard term transition kernel for Ph, to avoid confusion with the term kernel in kernel-based learning.

Reward-Free Kernel-Based RL

where the expectation is taken with respect to the randomness in the trajectory {(sh, ah)}H h=1 obtained by the policy π. It can be shown that under mild assumptions (e.g., continuity of Ph, compactness of Z, and boundedness of r) there exists an optimal policy π which attains the maximum possible value of V π h (s) at every step and at every state (e.g., see, Puterman, 2014). We use the notation V h (s) = maxπ V π h (s), s S, h [H]. By definition V π h = V h . An ϵ-optimal policy is defined as follows.

Definition 2.1. (ϵ-optimal policy) For ϵ > 0, A policy π is called ϵ-optimal if it achieves near-optimal values from any initial state as follows:

V π 1 (s) V 1 (s) ϵ, s S.

For a value function V , we define the following notation

[Ph V ](s, a) := Es Ph( |s,a)[V (s )]. (2)

We also define the state-action value function Qπ h : Z [0, H] as follows.

Qπ h(s, a) = Eπ

h =h rh (sh , ah ) sh = s, ah = a

where the expectation is taken with respect to the randomness in the trajectory {(sh, ah)}H h=1 obtained by the policy π. The Bellman equation associated with a policy π then is represented as

Qπ h(s, a) = rh(s, a) + [Ph V π h+1](s, a),

V π h (s) = max a Qπ h(s, a), V π H+1 = 0.

where the expectation is taken with respect to the randomness in the policy π. We may specify the reward function in V π, Qπ, V , Q notations for clarity, for example, V π(s; r) and Q (z; r).

2.2. Reward-Free RL Framework

We aim to learn ϵ-optimal policies using as small as possible number of collected exploration episodes. In particular, we consider the reward-free RL framework that consists of two phases: exploration and planning. In the exploration phase, we collect N exploration episodes {(sn 1, an 1, sn 2, an 2, , sn H)}N n=1 without any revealed reward function. Then, in the planning phase, reward r is revealed, and we design a policy for reward r using the trajectories collected in the exploration phase. We refer to N as the sample complexity of designing ϵ-optimal policy. Under certain assumptions, the question is: How many exploration episodes are required to obtain ϵ-optimal policies? We provide an answer in Theorem 4.5.

2.3. Kernel Ridge Regression

We assume that the unknown transition probability distribution can be represented using a reproducing kernel Hilbert space (RKHS). See Assumption 2.2. This is a very general assumption, considering that the RKHS of common kernels can approximate almost all continuous functions on the compact subsets of Rd (Srinivas et al., 2010). Consider a positive definite kernel k : Z Z R. Let Hk be the RKHS induced by k, where Hk contains a family of functions defined on Z. Let , Hk : Hk Hk R and Hk : Hk R denote the inner product and the norm of Hk, respectively. The reproducing property implies that for all f Hk, and z Z, f, k( , z) Hk = f(z). Without loss of generality, we assume k(z, z) 1 for all z. Mercer theorem implies, under certain mild conditions, k can be represented using an infinite dimensional feature map:

m=1 λmϕm(z)ϕm(z ), (4)

where λm > 0, and λmϕm Hk form an orthonormal basis of Hk. In particular, any f Hk can be represented using this basis and wights wm R as

where f 2 Hk = P m=1 w2 m. A formal statement and the details are provided in Appendix A. We refer to λm and ϕm as (Mercer) eigenvalues and eigenfeatures of k, respectively.

Kernel-based models provide powerful predictors and uncertainty estimators, which can be leveraged to guide the RL algorithm. In particular, consider a fixed unknown function f Hk. Consider a set Zn = {zi}n i=1 Z of n inputs. Assume n noisy observations {Y (zi) = f(zi) + εi}n i=1 are provided, where εi are independent zero mean noise terms. Kernel ridge regression provides the following predictor and uncertainty estimate, respectively (see, e.g., Sch olkopf et al., 2002),

µn,f(z) = k Zn(z)(KZn + τ 2In) 1YZn,

(σn(z))2 = k(z, z) k Zn(z)(KZn + τ 2In) 1k Zn(z), (6)

where k Zn(z) = [k(z, z1), . . . , k(z, zn)] is a n 1 vector of the kernel values between z and observations, KZn = [k(zi, zj)]n i,j=1 is the n n kernel matrix, YZn = [Y (z1), . . . , Y (Zn)] is the n 1 observation vector, In

is the identity matrix of dimensions n, and τ > 0 is a free regularization parameter. The predictor and uncertainty estimate could be interpreted as posterior mean and variance of a surrogate centered Gaussian process (GP) model with covariance k, and zero mean Gaussian noise with variance τ 2 (e.g., see, Williams & Rasmussen, 2006).

Reward-Free Kernel-Based RL

2.4. Technical Assumption

We assume that the unknwon transition probability distributions Ph(s | , ) of the MDP belong to the 1-ball of the RKHS. We use the notation Bk,R = {f : f Hk R} to denote the R-ball of the RKHS. Assumption 2.2. We assume

Ph(s | , ) Bk,1, h [H], s S. (7)

This is a mild assumption considering the generality of RKHSs, that is also supposed to hold in (Qiu et al., 2021; Yang et al., 2020a). Similar assumptions are made in linear MDPs which are significantly more restrictive (e.g., see, Wang et al., 2020; Jin et al., 2020b).

A consequence of Assumption 2.2 is that for any integrable V : [0, 1]d [0, H], [Ph Vh+1] Bk,H. This is formalized in the following lemma (See, Yeh et al., 2023, Lemma 3). Lemma 2.3. Consider any integrable V : [0, 1]d [0, H]. Under Assumption 2.2, we have

[Ph V ] Bk,H. (8)

3. Algorithm

In this section, we design novel algorithms for both exploration and planning phases in the kernel-based reward-free RL framework described in Section 2.

The two main ideas in our design are (i) the use of a hypothetical reward in the exploration phase and (ii) domain partitioning in application of kernel-based confidence intervals.

Hypothetical reward. In the exploration phase, we will craft a carefully chosen hypothetical reward function that incentivizes efficient exploration. We choose the term hypothetical reward since it is different from the actual rewards revealed to the agent later in the planning phase. In other words, in the exploration phase when the reward is yet not revealed to the agent, we design a notion of reward for the agent that encourages an efficient exploration. We use the uncertainty estimates provided by kernel regression to define the hypothetical reward and motivate exploration of uncertain regions in the state-action space. In particular, for episode n in the exploration phase, we choose hypothetical reward rn = β(δ)σn/H, where σn h is the posterior standard deviation of the kernel-based model conditioned on (possibly some of) past n 1 previous episodes, and β(δ) is a 1 δ confidence interval multiplier that is specified in Theorem 4.5. Using this hypothetical reward incentivizes the agent to move on the Markovian trajectory towards stateactions with higher uncertainty.

This is different from a pure and uniform exploration. This is also different from (Yeh et al., 2023), where the existence of

Figure 3. An example of adaptive domain partitioning on Z = [0, 1]2. The dots represent a sequence of points. Partitions are created by dividing every square Z that satisfies the condition ρ α Z < N(Z ) + 1 into four equal smaller squares. Here, ρZ and N(Z ) denote the side length of a square Z and the number of dots within Z , respectively. In this example, α is set to 3.

a generative oracle was assumed that can provide transition samples for the state-action pairs of the agent s choice. The reward-free RL framework considered in this work is more sophisticated in the sense that the agent must stay on the Markovian trajectory and cannot observe arbitrary stateactions.

Domain partitioning. In both the exploration and planning phases, we will use domain partitioning to improve the precision of prediction and analytical guarantees on the approximations. In particular, we partition the state-action space into subdomains and only use the observations within the same subdomain for kernel-based prediction (disregarding the rest of the observations). As shown in Vakili & Olkhovskaya (2023), this allows a tradeoff between the standard deviation of the kernel-based model and the confidence interval width coefficient. An optimal procedure for domain partitioning leads to an improved performance. An example of partitioning on a 2-dimensional state-action domain is shown in Figure 3.

3.1. Exploration Phase

The exploration algorithm simply employs an optimistic least-squares value iteration (LSVI) with the hypothetical reward. Optimistic LSVI is a standard policy in episodic MDPs, which, inspired by the principle of optimism in the face of uncertainty, computes an upper confidence bound on state-action value function. For this purpose, kernel ridge regression is used to form prediction f n h and uncertainty estimate σn h for the [Ph Vh+1] term in the state-action value function. Then the upper confidence bound is defined as

Qn h( , ) = Π[0,H] [ rn h( , ) + f n h ( , ) + β(δ)σn h( , )] . (9)

Reward-Free Kernel-Based RL

Algorithm 1 Exploration Phase

Input: τ, β(δ), k, S, A, H, P. For all h [H], let S1 h = {[0, 1]d}. for Episode n = 1, 2, . . . , N, do

Receive the initial state sn 1. Set V n H+1(s) = 0, for all s. for step h = H, . . . , 1 do

Obtain Qn h as in (9) based on (11) and (12). V n h ( ) = maxa A Qh( , a). end for for step h = 1, 2, . . . , H do

Take action an h argmaxa AQn h(sn h, a). Receive the next state sn h+1. Split any element Z Sn 1 h , for which ρ α Z < |N n h (Z )| + 1 along the middle of each side, and obtain Sn h . end for end for

The notation Π[a,b][ ] denotes projection onto interval [a, b]. Since the rewards are assumed to be at most 1, the stateaction value function at step h is also bounded by H, hence projection to [0, H] interval. In episode n, then πn is the greedy policy with respect to Qn = {Qn h}H h=1. Under Assumption 2.2, the estimate f n h , the parameter β(δ) and the uncertainty estimate σn h can all be designed using kernel ridge regression.

To overcome the suboptimal performance guarantees rooted in the online confidence intervals in kernel ridge regression, we use a carefully designed domain partitioning. The proposed algorithm partitions the state-action space Z into subdomains and builds kernel ridge regression only based on the observations within each subdomain. By doing so, we obtain tighter confidence intervals, ultimately resulting in tighter performance guarantees.

To formalize this procedure, we consider the state-action space Z [0, 1]d. Let Sn h, h [H], n [N] be sets of hypercubes overlapping only at edges, covering the entire [0, 1]d. For any hypercube Z Sn h, we use ρZ to denote the length of any of its sides, and N n h (Z ) to denote the number of observations at step h in Z up to episode n:

N n h (Z ) =

i=1 1{(si h, ai h) Z }. (10)

For all h [H], we initialize S1 h = {[0, 1]d}. At exploration episode n, for each step h, after observing a sample from [Ph V n h+1] at (sn h, an h), we construct a new cover Sn h as follows. We divide every element Z Sn 1 h that satisfies ρ α Z < N n h (Z ) + 1, into two equal halves along each side, generating 2d hypercubes. The parameter α > 0 in the splitting rule is a constant specified in Definition 4.1.

Subsequently, we define Sn h as the set of newly created hypercubes and the elements of Sn 1 h that were not split.

The construction of the cover sets described above ensures the number N n h (Z ) of observations within each cover element Z remains relatively small taking into account the size of Z , while also controlling the total number |Sn h | of cover elements. The key parameter managing this tradeoff is α, which is carefully chosen to achieve an appropriate width for the confidence interval, as shown in Section 4.

Our exploration algorithm is derived by adopting the structure of the optimistic LSVI, as described above, where the predictor and the uncertainty estimates are designed based on kernel ridge regression only on cover elements. In particular, for z Z, let Zn h (z) Sn h be the cover element at step h of episode n containing z. Define Zn h(z) = {(si h, ai h) Zn h (z), i < n} to be the set of past observations belonging to the same cover element as z. We then set

f n h (z) = k Zn h (z)(z)(KZn h (z) + τ 2I) 1YZn h (z), (11)

where k Zn h (z) = [k(z, z )] z Zn h (z) is the kernel values between z and all observations z in Zn h(z), KZn h (z) = [k(z , z )]z ,z Zn h (z) is the kernel matrix for observations in Zn h(z). Starting from VH+1 = 0, for h = H, , 1, we obtain the observation value as follows: YZn h (z) = [V n h+1(s h+1)] z Zn h (z), where s h+1 is drawn from the transition distribution Ph( |z ), denotes the observation values for the observation points z Zn h(z). The vectors k Zn h (z) and YZn h (z) are N n 1 h (Zn h (z)) dimensional column vectors, and KZn h (z) and I are N n 1 h (Zn h (z)) N n 1 h (Zn h (z)) dimensional matrices.

Note that, having the Bellman equation in mind, f n h is the (kernel ridge) predictor for [Ph V n h+1] using some of the past n 1 observations {V n h+1(si h+1)}n 1 i=1 at points {zi h}n 1 i=1 . Recall that E V n h+1(si h+1) = [Ph V n h+1](zi h), where the expectation is taken with respect to Ph( |zi h). The observation noise V n h+1(si h+1) [Ph V n h+1](zi h) is due to random transitions and is bounded by H h H.

The exploration bonus is determined based on the uncertainty estimate of the kernel ridge regression model on cover elements defined as

σn h(z) = q

k(z, z) k Zn h (z)(z)(KZn h (z) + τ 2I) 1k Zn h (z)(z).

The policy then is the greedy policy with respect to Qn h given in (9). Specifically, at step h of exploration episode n, the following action is chosen, after observing sn h,

an h = argmaxa AQn h(sn h, a). (13)

Reward-Free Kernel-Based RL

A pseudocode is provided in Algorithm 1.

3.2. Planning Phase

In the planning phase, when the reward function r is revealed, a near-optimal policy π is derived using the episodes of trajectories collected during the exploration phase. Similar to the exploration policy, we compute a prediction gh and a confidence interval width wh for [Ph Vh+1], and define

Qh( , ) = Π[0,H][rh( , ) + gh( , ) + wn h( , )]. (14)

The policy π then is obtained as a greedy policy with respect to Q πh( ) = argmaxa AQh( , a).

Due to domain partitioning, the confidence interval width may increase over the exploration episode n for certain points. To address this specific observation related to the domain partitioning technique, we take the following steps. First, we identify the exploration episode that has the smallest confidence interval for z Z. Specifically, let us define

wh(z) = min n N β(δ)σn h(z). (15)

Also, let nh(z) = arg minn N β(δ)σn h(z) be the exploration episode that provides the tightest confidence interval for point z. Recall that for z Z, we defined Zn h (z) Sn h to be the cover element at step h of episode n containing z, and Zn h(z) = {(si h, ai h) Zn h (z), i < n} as the set of past observations belonging to the same cover element as z.

Keeping Bellman equation in mind, and starting with VH+1 = 0, gh is the kernel ridge predictor for [Ph Vh+1] using some of the n observations {Vh+1(sn h+1)}N n=1 at points {zn}N n=1 in the exploration phase. Specifically, in computing gh(z), we only use observations that are within the same subdomain as z in the partition Snh(z) h at episode nh(z) of exploration phase

gh(z) = k Z nh(z) h (z)(z)(KZ nh(z) h (z) + τ 2I) 1 YZ nh(z) h (z),

(16) where YZ nh(z) h (z) = [Vh+1(s h+1)] z Z nh(z) h (z). A pseu-

docode is given in Algorithm 2.

3.3. Computational Complexity

The runtime complexity of our algorithm in the exploration phase is O(HN 4 + H|A|N 3), similar to the runtime complexity in Qiu et al. (2021). At each episode n and each step h, the computation of the kernel ridge regression statistics in each hypercube incurs a cost of O(N 3 c + |Ac|N 2 c ), where |Ac| is the number of actions in the hypercube and Nc is the number of previous observations in the hypercube.

Algorithm 2 Planning Phase

Input: τ, β(δ), k, M(S, A, H, P, r), and exploration data {(sn h, an h)}(h,n) [H] [N] for h = H, H 1, , 1, do

Compute the prediction gh Let Qh( , ) = Π[0,H][gh( , ) + rh( , )] V ( ) = maxa A Q( , a). πh( ) = argmaxa AQh( , a). end for Output: {πh}h [H].

Summing up over all hypercubes, we bound the computational complexity with O(n3+|A|n2), where |A| is the total number of actions. This bound is derived using the simple arithmetic that the cube of the sum of natural numbers is larger than the sum of their cubes. Summing up over steps and episodes, we arrive at the overall runtime complexity of O(HN 4 + H|A|N 3). We expect an improved runtime for our algorithm in practice due to the inequalities used in this calculation.

The computational complexity of identifying the partition with the tightest confidence interval in the planning phase will not exceed that of performing kernel ridge regression on all partitions a computation similar to that employed in the exploration phase. Therefore, the overall computational complexity remains unaffected by this step.

Sparse approximation methods such as the Nystr om method significantly reduce the computational complexity, while maintaining the kernel-based confidence intervals and, consequently, the eventual rates (see, e.g., Vakili et al., 2022, and references therein). These results are generally applicable to kernel ridge regression and not specific to our problem.

4. Sample Complexity Analysis

In this section, we present the main result of the paper. In Theorem 4.5, we establish an O ( H3

α sample complexity for the kernel-based reward-free RL problem for a general class of kernels with polynomial eigendecay that includes Mat ern family and Neural Tangent kernels. The parameter α captures some smoothness properties of the kernel that is specified in the next defintion.

Definition 4.1 (Polynomial Eigendecay). Consider the Mercer eigenvalues {λm} m=1 of k : Z Z R, given in Equation (4), in a decreasing order, as well as the corresponding eigenfeatures {ϕm} m=1. Assume Z is a d-dimensional hypercube with side length ρZ. For some Cp, α > 0, p > 1, the kernel k is said to have a polynomial eigendecay, if for all m N, λm Cpm pρα Z. In addition, for some η 0, m pηϕm(z) is uniformly bounded over all m and z. We

Reward-Free Kernel-Based RL

use the notation p = p(1 2η).

The polynomial eigendecay profile encompasses a large class of common kernels, e.g., the Mat ern family of kernels. For a Mat ern kernel with smoothness parameter ν, p = 2ν+d

d and α = 2ν (e.g., see, Yang et al., 2020a). Another example is the NT kernel (Arora et al., 2019). It has been shown that the RKHS of the NT kernel, when the activations are s 1 times differentiable, is equivalent to the RKHS of a Mat ern kernel with smoothness ν = s 1

2 (Vakili et al., 2021b). For instance, the RKHS of an NT kernel with Re LU activations is equivalent to the RKHS of a Mat ern kernel with ν = 1

2. In this case, p = 1 + 1

d and α = 1. The hypercube domain assumption is a technical formality that can be relaxed to other regular compact subsets of Rd. The uniform boundedness of m pηϕm(z) for some η > 0, also holds for a broad class of kernels, including the Mat ern family, as discussed in (Yang et al., 2020a). Several works including (Vakili et al., 2021b; Kassraie & Krause, 2022), have employed an averaging technique over subsets of eigenfeatures, demonstrating that the effective value of η can be considered as 0 in the case of Mat ern and NT kernels.

4.1. Confidence Intervals for State-Action Value Functions

Confidence intervals are an important building block in the design and analysis of RL algorithms. For a fixed function f in the RKHS of a known kernel, 1 δ confidence intervals of the form |f(z) µn,f(z)| β(δ)σn(z) are established in several works (Srinivas et al., 2010; Chowdhury & Gopalan, 2017; Abbasi-Yadkori, 2013; Vakili et al., 2021a) under various assumptions. In the RL setting, however, these confidence intervals cannot be directly applied. This is due to the randomness of the target function itself. Specifically, in our case, the target function is [Ph V n h+1], which is not a fixed function due to the temporal dependence within an episode. An argument based on the covering number of the state-action value function class was used in Yang et al. (2020a) to establish uniform confidence intervals over all z Z and all f in a specific function class. Vakili & Olkhovskaya (2023) proved a variant that offers flexibility with respect to setting the parameters of the confidence interval. Their approach leads to a more refined confidence interval, which, with a proper choice of parameters, contributes to the improved results in the RL setting.

We first give a formal definition of the two complexity terms: maximum information gain and the covering number of the state-action value function class, which appear in our confidence intervals.

Definition 4.2 (Maximum Information Gain). In the kernel ridge regression setting described in Section 2.3, the following quantity is referred to as maximum information gain: Γk,τ(n) = max Zn Z 1

2 log det(I + 1 τ 2 KZn).

Upper bounds on maximum information gain based on the spectrum of the kernel are established in (Janz et al., 2020; Srinivas et al., 2010; Vakili et al., 2021c).

State-action value function class: Let us use Qk,h(R, B) to denote the class of state-action value functions. In particular for a set of observations Z, let σh(z) be the uncertainty estimate obtained from kernel ridge regression as given in (6). We define

Qk,h(R, B) = Q : Q(z) = Π[0,H] {Q0(z) + βσh(z)} ,

Q0 Hk R, β B, |Z| N . (17)

Definition 4.3 (Covering Set and Number). Consider a function class F. For ϵ > 0, we define the minimum ϵcovering set C(ϵ) as the smallest subset of F that covers it up to an ϵ error in l norm. That is to say, for all f F, there exists a g C(ϵ), such that f g l ϵ. We refer to the size of C(ϵ) as the ϵ-covering number.

We use the notation Nk,h(ϵ; R, B) to denote the ϵ-covering number of Qk,h(R, B), appeaing in the confidence interval.

Lemma 4.4 (Confidence Interval; Theorem 1 of (Vakili & Olkhovskaya, 2023)). Let f n h and σn h denote the kernel ridge predictor and uncertainty estimate of [Ph V n h+1], using n observations {V n h+1(si h+1)}n i=1 at Zn h = {zi h}n i=1 Z, where si h+1 is the next state drawn from Ph( |zi h). Let

2(Γk,τ(N) + 1 + log( 2

δ )). For ϵ, δ (0, 1), with probability, at least 1 δ, we have, (z, h) Z [H] and n [N], [Ph V n h+1](z) f n h (z) βn h(δ, ϵ)σn h(z) + ϵ,

where βn h(δ, ϵ) is set to any value satisfying

βn h(δ, ϵ) H + 3 nϵ

Γk,τ(n) + log Nk,h(ϵ; RN, βn h(δ, ϵ)) + 1 + log(2NH

4.2. Sample Complexity

With the auxiliary results laid out in the previous section, we now give a formal presentation of sample complexity.

Theorem 4.5. Consider the reward-free kernel-based RL problem presented in Section 2. Under Assumption 2.2; run Algorithm 1 with N exploration episodes. Let π be the policy obtained in the planning phase according to Algorithm 2. There exist N = O ( H3

β(δ) = O(H q

δ )), which guarantee, with probability at least 1 δ, we have, V (s) V π(s) ϵ, for all s S.

Reward-Free Kernel-Based RL

Proof. Here, we present a summary of the steps in the proof. Details are provided in Appendix B. In the proof, we show that V 1 (s; r) V π 1 (s; r) H

N PN n=1 V 1 (s, rn), that creates a connection between suboptimality in the planning (the left hand side) and total value in the exploration with hypothetical rewards (the right hand side). We then use confidence intervals for kernel-based regression over partitions to bound the right hand side, and eventually obtain

V 1 (s; r) V π 1 (s1; r) = O

α 2(α+d) log(N)

From here we can see that with a choice of N = O ( H3

α it can be guaranteed that V 1 (s; r)

V π 1 (s1; r) ϵ.

Evaluation of the result. When we instantiate the sample complexity for the Mat ern kernel, we obtain a sample complexity of O ( H3

ν . This is order optimal in ϵ

given the Ω ( 1

ν sample complexity lower bound for kernel-based bandits with Mat ern kernels (see, Scarlett et al., 2017, Table I), up to logarithmic factors in ϵ. We note that kernel-based bandit corresponds to a degenerate case with H = 1, |S| = 1. Thus, this cannot be further improved. In addition, our sample complexity in its dependency on ϵ matches that of the simpler discounted kernel-based RL problem assuming the existence of a generative oracle that can provide arbitrary transition samples (see, Yeh et al., 2023, Table 1).

Regarding the dependency of sample complexity on episode length H, however, it is still unresolved whether improvements are possible. For comparison, in the oracle setting (Yeh et al., 2023), the sample complexity is proportional to ( 1 1 γ )1+3(2+ d

ν ), where γ (0, 1) is the discount factor in the discounted MDP setting. Informally interpreting 1 1 γ as the effective episode length, our results show a similar dependency on H. Notably, our sample complexity reflects and improvement by a factor of H.

5. Conclusion

We considered the reward-free RL framework with kernelbased modeling. We developed algorithms for both exploration and planning phases. Our results shows an order optimal sample complexity for a general class of common kernels with polynomially decaying eigenvalues, that includes Mat ern and Neural Tangent kernels. We significantly improve the state of the art as the existing work does not apply to this class of kernels (with an unbounded sample complexity). In addition, the scaling of the sample complexity in ϵ matches that of the lower bound in kernel-based bandits with Mat ern kernel, showing its optimality.

Impact Statement

This paper presents research aimed at advancing the field of Machine Learning. While there are numerous potential societal implications associated with RL, we do not believe any require specific emphasis given the theoretical nature of the work.

Abbasi-Yadkori, Y. Online learning for linearly parametrized control problems. Ph D Thesis, University of Alberta, 2013.

Abbasi-Yadkori, Y., P al, D., and Szepesv ari, C. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, volume 24, 2011.

Agarwal, A., Kakade, S., and Yang, L. F. Model-based reinforcement learning with a generative model is minimax optimal. In Conference on Learning Theory, pp. 67 83. PMLR, 2020.

Agrawal, S. and Goyal, N. Thompson sampling for contextual bandits with linear payoffs. In International conference on machine learning, pp. 127 135. PMLR, 2013.

Arora, S., Du, S. S., Hu, W., Li, Z., Salakhutdinov, R. R., and Wang, R. On exact computation with an infinitely wide neural net. Advances in neural information processing systems, 32, 2019.

Auer, P., Jaksch, T., and Ortner, R. Near-optimal regret bounds for reinforcement learning. In Koller, D., Schuurmans, D., Bengio, Y., and Bottou, L. (eds.), Advances in Neural Information Processing Systems, volume 21. Curran Associates, Inc., 2008.

Azar, M. G., Munos, R., and Kappen, H. J. Minimax pac bounds on the sample complexity of reinforcement learning with a generative model. Machine learning, 91(3): 325 349, 2013.

Bartlett, P. L. and Tewari, A. REGAL: A regularization based algorithm for reinforcement learning in weakly communicating mdps. Co RR, abs/1205.2661, 2012. URL http://arxiv.org/abs/1205.2661.

Borovitskiy, V., Terenin, A., Mostowsky, P., et al. Mat ern Gaussian processes on Riemannian manifolds. In Advances in Neural Information Processing Systems, volume 33, pp. 12426 12437, 2020.

Chen, X., Hu, J., Yang, L. F., and Wang, L. Near-optimal reward-free exploration for linear mixture mdps with plug-in solver. ar Xiv preprint ar Xiv:2110.03244, 2021.

Reward-Free Kernel-Based RL

Chowdhury, S. R. and Gopalan, A. On kernelized multiarmed bandits. In International Conference on Machine Learning, pp. 844 853. PMLR, 2017.

Chowdhury, S. R. and Gopalan, A. Online learning in kernelized Markov decision processes. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 3197 3205. PMLR, 2019.

Christmann, A. and Steinwart, I. Support Vector Machines. Springer New York, NY, 2008.

Domingues, O. D., M enard, P., Pirotta, M., Kaufmann, E., and Valko, M. Kernel-based reinforcement learning: A finite-time analysis. In International Conference on Machine Learning, pp. 2783 2792. PMLR, 2021.

Fawzi, A., Balog, M., Huang, A., Hubert, T., Romera Paredes, B., Barekatain, M., Novikov, A., R Ruiz, F. J., Schrittwieser, J., Swirszcz, G., et al. Discovering faster matrix multiplication algorithms with reinforcement learning. Nature, 610(7930):47 53, 2022.

Janz, D., Burt, D., and Gonz alez, J. Bandit optimisation of functions in the Mat ern kernel RKHS. In International Conference on Artificial Intelligence and Statistics, pp. 2486 2495. PMLR, 2020.

Jin, C., Allen-Zhu, Z., Bubeck, S., and Jordan, M. I. Is Q-learning provably efficient? Advances in Neural Information Processing Systems, 31, 2018.

Jin, C., Krishnamurthy, A., Simchowitz, M., and Yu, T. Reward-free exploration for reinforcement learning. In International Conference on Machine Learning, pp. 4870 4879. PMLR, 2020a.

Jin, C., Yang, Z., Wang, Z., and Jordan, M. I. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pp. 2137 2143. PMLR, 2020b.

Kahn, G., Villaflor, A., Pong, V., Abbeel, P., and Levine, S. Uncertainty-aware reinforcement learning for collision avoidance. ar Xiv preprint ar Xiv:1702.01182, 2017.

Kakade, S. M. On the sample complexity of reinforcement learning. University of London, University College London (United Kingdom), 2003.

Kalashnikov, D., Irpan, A., Pastor, P., Ibarz, J., Herzog, A., Jang, E., Quillen, D., Holly, E., Kalakrishnan, M., Vanhoucke, V., et al. Scalable deep reinforcement learning for vision-based robotic manipulation. In Conference on Robot Learning, pp. 651 673. PMLR, 2018.

Kassraie, P. and Krause, A. Neural contextual bandits without regret. In International Conference on Artificial Intelligence and Statistics, pp. 240 278. PMLR, 2022.

Kaufmann, E., M enard, P., Domingues, O. D., Jonsson, A., Leurent, E., and Valko, M. Adaptive reward-free exploration. In Algorithmic Learning Theory, pp. 865 891. PMLR, 2021.

Kearns, M. and Singh, S. Finite-sample convergence rates for q-learning and indirect algorithms. In Advances in Neural Information Processing Systems, volume 11. MIT Press, 1998.

Lee, K., Kim, S.-A., Choi, J., and Lee, S.-W. Deep reinforcement learning in continuous action spaces: a case study in the game of simulated curling. In International Conference on Machine Learning,, pp. 2937 2946. PMLR, 2018.

M enard, P., Domingues, O. D., Jonsson, A., Kaufmann, E., Leurent, E., and Valko, M. Fast active learning for pure exploration in reinforcement learning. In International Conference on Machine Learning, pp. 7599 7608. PMLR, 2021.

Mercer, J. Functions of positive and negative type, and their connection with the theory of integral equations. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character, 209:415 446, 1909. ISSN 02643952. URL http://www.jstor.org/stable/91043.

Mirhoseini, A., Goldie, A., Yazgan, M., Jiang, J. W., Songhori, E., Wang, S., Lee, Y.-J., Johnson, E., Pathak, O., Nazi, A., et al. A graph placement methodology for fast chip design. Nature, 594(7862):207 212, 2021.

Neu, G. and Pike-Burke, C. A unifying view of optimism in episodic reinforcement learning. In Advances in Neural Information Processing Systems, volume 33, pp. 1392 1403, 2020.

Puterman, M. L. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.

Qiu, S., Ye, J., Wang, Z., and Yang, Z. On reward-free rl with kernel and neural function approximations: Singleagent mdp and markov game. In International Conference on Machine Learning, pp. 8737 8747. PMLR, 2021.

Russo, D. Worst-case regret bounds for exploration via randomized value functions. Advances in Neural Information Processing Systems, 32, 2019.

Scarlett, J., Bogunovic, I., and Cevher, V. Lower bounds on regret for noisy Gaussian process bandit optimization. In Conference on Learning Theory, pp. 1723 1742. PMLR, 2017.

Reward-Free Kernel-Based RL

Sch olkopf, B., Smola, A. J., Bach, F., et al. Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press, 2002.

Sidford, A., Wang, M., Wu, X., Yang, L., and Ye, Y. Nearoptimal time and sample complexities for solving markov decision processes with a generative model. Advances in Neural Information Processing Systems, 31, 2018.

Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484 489, 2016.

Srinivas, N., Krause, A., Kakade, S., and Seeger, M. Gaussian process optimization in the bandit setting: no regret and experimental design. In Proceedings of the 27th International Conference on International Conference on Machine Learning, pp. 1015 1022, 2010.

Vakili, S. and Olkhovskaya, J. Kernelized reinforcement learning with order optimal regret bounds. ar Xiv preprint ar Xiv:2306.07745, 2023.

Vakili, S., Bouziani, N., Jalali, S., Bernacchia, A., and Shiu, D.-s. Optimal order simple regret for Gaussian process bandits. Advances in Neural Information Processing Systems, 34:21202 21215, 2021a.

Vakili, S., Bromberg, M., Garcia, J., Shiu, D.-s., and Bernacchia, A. Uniform generalization bounds for overparameterized neural networks. ar Xiv preprint ar Xiv:2109.06099, 2021b.

Vakili, S., Khezeli, K., and Picheny, V. On information gain and regret bounds in Gaussian process bandits. In International Conference on Artificial Intelligence and Statistics, pp. 82 90. PMLR, 2021c.

Vakili, S., Scarlett, J., Shiu, D.-s., and Bernacchia, A. Improved convergence rates for sparse approximation methods in kernel-based learning. In International Conference on Machine Learning, pp. 21960 21983. PMLR, 2022.

Valko, M., Korda, N., Munos, R., Flaounas, I., and Cristianini, N. Finite-time analysis of kernelised contextual bandits. In Uncertainty in Artificial Intelligence, 2013.

Vinyals, O., Babuschkin, I., Czarnecki, W. M., Mathieu, M., Dudzik, A., Chung, J., Choi, D. H., Powell, R., Ewalds, T., Georgiev, P., et al. Grandmaster level in starcraft II using multi-agent reinforcement learning. Nature, 575 (7782):350 354, 2019.

Wagenmaker, A. J., Chen, Y., Simchowitz, M., Du, S., and Jamieson, K. Reward-free RL is no harder than rewardaware RL in linear markov decision processes. In International Conference on Machine Learning, pp. 22430 22456. PMLR, 2022.

Wang, R., Du, S. S., Yang, L., and Salakhutdinov, R. R. On reward-free reinforcement learning with linear function approximation. Advances in neural information processing systems, 33:17816 17826, 2020.

Williams, C. K. and Rasmussen, C. E. Gaussian processes for machine learning. MIT press Cambridge, MA, 2006.

Yang, L. and Wang, M. Sample-optimal parametric qlearning using linearly additive features. In International Conference on Machine Learning, pp. 6995 7004. PMLR, 2019.

Yang, L. and Wang, M. Reinforcement learning in feature space: Matrix bandit, kernels, and regret bound. In International Conference on Machine Learning, pp. 10746 10756. PMLR, 2020.

Yang, Z., Jin, C., Wang, Z., Wang, M., and Jordan, M. Provably efficient reinforcement learning with kernel and neural function approximations. Advances in Neural Information Processing Systems (Neur IPS), 2020a.

Yang, Z., Jin, C., Wang, Z., Wang, M., and Jordan, M. I. On function approximation in reinforcement learning: Optimism in the face of large state spaces. ar Xiv preprint ar Xiv:2011.04622, 2020b.

Yao, H., Szepesv ari, C., Pires, B. A., and Zhang, X. Pseudo MDPs and factored linear action models. In 2014 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), pp. 1 9. IEEE, 2014.

Yeh, S.-Y., Chang, F.-C., Yueh, C.-W., Wu, P.-Y., Bernacchia, A., and Vakili, S. Sample complexity of kernelbased q-learning. In International Conference on Artificial Intelligence and Statistics, pp. 453 469. PMLR, 2023.

Zanette, A., Brandfonbrener, D., Brunskill, E., Pirotta, M., and Lazaric, A. Frequentist regret bounds for randomized least-squares value iteration. In International Conference on Artificial Intelligence and Statistics, pp. 1954 1964. PMLR, 2020a.

Zanette, A., Lazaric, A., Kochenderfer, M., and Brunskill, E. Learning near optimal policies with low inherent Bellman error. In III, H. D. and Singh, A. (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 10978 10989. PMLR, 13 18 Jul 2020b.

Reward-Free Kernel-Based RL

Zanette, A., Lazaric, A., Kochenderfer, M. J., and Brunskill, E. Provably efficient reward-agnostic navigation with linear value iteration. Advances in Neural Information Processing Systems, 33:11756 11766, 2020c.

Zhang, J., Zhang, W., and Gu, Q. Optimal horizon-free reward-free exploration for linear mixture mdps. In International Conference on Machine Learning, pp. 41902 41930. PMLR, 2023.

Zhang, W., Zhou, D., and Gu, Q. Reward-free model-based reinforcement learning with linear function approximation. Advances in Neural Information Processing Systems, 34:1582 1593, 2021.

Zhang, Z., Du, S. S., and Ji, X. Nearly minimax optimal reward-free reinforcement learning. ar Xiv preprint ar Xiv:2010.05901, 2020.

Reward-Free Kernel-Based RL

A. RKHS and Mercer Theorem

Mercer theorem (Mercer, 1909) provides a representation of the kernel in terms of an infinite dimensional feature map (e.g., see, Christmann & Steinwart, 2008, Theorem 4.49). Let Z be a compact metric space and µ be a finite Borel measure on Z (we consider Lebesgue measure in a Euclidean space). Let L2 µ(Z) be the set of square-integrable functions on Z with respect to µ. We further say a kernel is square-integrable if Z

Z k2(z, z ) dµ(z)dµ(z ) < .

Theorem A.1. (Mercer Theorem) Let Z be a compact metric space and µ be a finite Borel measure on Z. Let k be a continuous and square-integrable kernel, inducing an integral operator Tk : L2 µ(Z) L2 µ(Z) defined by

(Tkf) ( ) = Z

Z k( , z )f(z ) dµ(z ) ,

where f L2 µ(Z). Then, there exists a sequence of eigenvalue-eigenfeature pairs {(λm, ϕm)} m=1 such that λm > 0, and Tkϕm = λmϕm, for m 1. Moreover, the kernel function can be represented as

k (z, z ) =

m=1 λmϕm(z)ϕm (z ) ,

where the convergence of the series holds uniformly on Z Z.

According to the Mercer representation theorem (e.g., see, Christmann & Steinwart, 2008, Theorem 4.51), the RKHS induced by k can consequently be represented in terms of {(λm, ϕm)} m=1.

Theorem A.2. (Mercer Representation Theorem) Let {(λm, ϕm)} i=1 be the Mercer eigenvalue-eigenfeature pairs. Then, the RKHS of k is given by

1 2mϕm( ) : wm R, f 2 Hk :=

Mercer representation theorem indicates that the scaled eigenfeatures { λmϕm} m=1 form an orthonormal basis for Hk.

B. Detailed Analysis of Sample Complexity

Recall the reward-free kernel-based RL problem formulated in Section 2. In the exploration phase, trajectories are collected for N episodes based on Algorithm 1. In the exploitation phase, after the reward r is revealed, a policy π is obtained according to Algorithm 2. We establish an upper bound on V 1 (s; r) V π 1 (s; r), where V h (s; r) is the optimal value function with r and V π h (s; r) is the value function for the policy π. It is then straightforward to obtain the appropriate sample complexity N which guarantees V 1 (s; r) V π 1 (s; r) ϵ.

Proof Structure. We structure our proof as follows. We first recall some notations, prove an upper bound on the total number of partition components created by the algorithm in the exploration phase, and introduce two high probability events based on confidence intervals. We then present two main steps of the proof in Sections B.1 and B.2, respectively. In the first step, we bound the suboptimality of the policy in the planning phase using the total value function in the exploration phase. In the second step, we bound the total value function in the exploration phase. The proof of lemmas is given in Appendix C.

Recall the notations V π h (s; r) and Qπ h(s; r) for the value function and state-action value function of policy π with reward r. In contrast, V n h and Qn h represent the proxies for value function and state-action value function used in the episode n of exploration phase (Algorithm 1), and Vh and Qh are those used in the planning phase (Algorithm 2).

Next, we bound the total number of partitions used in the exploration phase by the algorithm. This result is used in several places in the proof. For step h, let UN h = SN n=1 SN h be the union of all cover elements used by the algorithm over all exploration episodes. The size of UN h is bounded in the following lemma.

Reward-Free Kernel-Based RL

Lemma B.1 (Lemma 2 in (Janz et al., 2020)). The size of UN h satisfies

|UN h | C1N d d+α , (18)

for some constant C1 > 0.

The size of UN h depends on the dimension of the domain and the parameter α used in the splitting rule of domain partitioning.

Let us define event E as the event that all the kernel-based confidence intervals in the exploration phase hold true; i.e., z Z, (h, n) [H] [N], [Ph V n h+1](z) f n h (z) β(δ)σn h(z), (19)

where f n h and σn h are the kernel ridge predictor and uncertainty estimates of [Ph V n h+1] defined in Equations (11) and (12), respectively.

Similarly, let us define event E as the event that all the kernel-based confidence intervals in the planning phase hold true; i.e., z Z, (h, n) [H] [N],

|[Ph Vh+1](z) gh(z)| wh, (20)

where gh and wh are defined in Equations (16) and (15), respectively.

Lemma B.2. With a choice of β(δ) = O(H q

δ )) with a sufficiently large implied constant, the events E and E each

hold with probability at least 1 δ/3: Pr(E) 1 δ/3 and Pr( E) 1 δ/3.

B.1. Suboptimality of the Planning Phase

The goal of this section of the proof is to bound V 1 (s; r) V π 1 (s; r) using PN n=1 V 1 (s; rn). This creates the connection between the planning and exploration phases. To bound V 1 (s; r) V π 1 (s; r), we bound the following two terms in the following two lemmas: V 1 (s; r) V1(s) and V1(s) V π 1 (s; r).

Lemma B.3. Conditioned on E, we have V h (s1; r) Vh(s1) 0.

Lemma B.4. Conditioned on E, we have

V1(s1) V π 1 (s1; r)

h=1 wh(sh, π(sh)).

Combining Lemmas B.3 and B.4, we obtain the following

V 1 (s1; r) V π 1 (s1; r)

h=1 wh(sh, π(sh)). (21)

By definition of wh, we have wh(z) β(δ)σn h(z) for all n N. Recall definition of rn = β(δ)σn/H. We have

h=1 wh(sh, π(sh)) HV 1 (s1; rn) (22)

Summing up over n and dividing by N

h=1 wh(sh, π(sh)) H

n=1 V 1 (s; rn). (23)

Reward-Free Kernel-Based RL

From (21) and (23), we can see that

V 1 (s1; r) V π 1 (s1; r) H

n=1 V 1 (s; rn). (24)

This connect the suboptimality for the planning phase to the total value of the exploration phase.

B.2. Total Value Function in the Exploration Phase

In this section, our goal is to bound PN n=1 V 1 (s; rn). We first bound V h (s; rn) V n h (s) in Lemma B.5, and then bound the PN n=1 V n 1 (s) in Lemma B.6.

Lemma B.5. Conditioned on E, we have V h (s; rn) V n h (s) 0.

Lemma B.6. Define ζn h = [Ph V n h+1](sn h, an h) Vh+1(sn h+1). Conditioned on E, we have

h=1 ζn h + (2 + 1

h=1 β(δ)σn h(sn h, an h).

Summing both sides over n, we obtain

n=1 V n 1 (s)

h=1 ζn h + (2 + 1

h=1 β(δ)σn h(sn h, an h).

Using Lemma B.5, we get

n=1 V 1 (s; rn)

h=1 ζn h | {z } Term 1

h=1 β(δ)σn h(sn h, an h)

| {z } Term 2

We next bound the two terms on the right hand side.

Term 1. By Azuma-Hoeffding inequality with probability at least 1 δ/3,

Term 2. We bound the total uncertainty in the kernel ridge regression measured by PN n=1 (σn h(zn h))2

n=1 (σn h(zn h))2 = X

zn h Z (σn h(zn h))2

2 log(1 + 1/λ2)Γk,λ(N N h,Z )

= O |UN h | log(N)

= O N d d+α log(N) .

Reward-Free Kernel-Based RL

The first inequality is commonly used in kernelized bandits. For example see Srinivas et al. (2010), Lemma 5.4. The third and fifth lines follow from Equation (33) in Appendix C, and Lemma B.1, respectively.

Therefore, using Cauchy-Schwarz inequality,

n=1 σn h(zn h)

n=1 (σn h(zn h))2

log(N) . (27)

Replacing the value for β(δ) and summing over h, we obtain

Combining the bound on two terms, we get

n=1 V 1 (s; rn) = O

B.3. Sample Complexity

From Equations (24) and (29), proven in the previous sections, we have, with probability at least 1 δ

V 1 (s1; r) V π 1 (s1; r) = O

α 2(α+d) log(N)

Then, the choice of

α polylog( H3 q

with a sufficiently large constant, ensures that V 1 (s1; r) V π 1 (s1; r) ϵ; i.e, the policy π obtained in the planning phase is ϵ-optimal.

C. Proof of Lemmas

In this section we provide the proof of lemmas.

C.1. Proof of Lemma B.2

The lemma is a result of confidence intervals given in Lemma 4.4. We only need to prove that β(δ) given in Theorem 4.5 satisfies the condition on the confidence interval width multiplier given in Lemma 4.4.

Consider a cover element Z UN h . Using Lemma 4.4, we have, with probability at least 1 δ, for all h [H], n [N], z Z , for some ϵn h (0, 1), [Ph V n h+1](z) f n h (z) βn h(δ, ϵn h)σn h(z) + ϵn h, (32)

where βn h(δ, ϵn h) is the smallest value satisfying

βn h(δ, ϵn h) H + 1 + H

Γk,τ(N) + log Nk,h(ϵn h; RN, βn h(δ, ϵn h)) + 1 + log(NH

Reward-Free Kernel-Based RL

with N = N n h,Z and ϵn h = H

δ ) Nn h,Z .

We use the following lemma to bound the maximum information gain term. Lemma C.1 (Lemma 2 in (Vakili & Olkhovskaya, 2023)). Consider a positive definite kernel k : Z Z R, with polynomial eigendecay on a hypercube with side length ρZ. The maximum information gain given in Definition 4.2 satisfies

Γk,τ(T) = O N 1

p (log(N))1 1

Γk,τ(N n h,Z ) = O (N n h,Z ) 1

p (log(N n h,Z ))1 1

p (log(N n h,Z ))1 1

= O (log(N n h,Z ))1 1

= O (log(N)) , (33)

where the first line is based on Lemma C.1, and the second line is by the design of partitioning in the exploration algorithm. Recall that each hypercube is partitioned when ρ α Z < N n h,Z + 1, ensuring that N n h,Z remains at most ρ α Z .

We use the following lemma to bound the covering number of the space of functions. Lemma C.2 (Lemma 3 in (Vakili & Olkhovskaya, 2023)). Recall the class of state-action value functions Qk,h(R, B), where k : Z Z R satisfies the polynomial eigendecay on a hypercube with side length ρZ. We have

log Nk,h(ϵ; R, B) = O

1 p 1 1 + log R

ϵ + B2ρα Z ϵ2

2 p 1 1 + log B

For the covering number, with the choice of ϵn h = H

δ ) Nn h,Z , we have

log Nk,h(ϵn h; RN, βn h(δ, ϵn h))

R2 Nρα Z (ϵn h)2

1 p 1 (1 + log(RN

ϵn h )) + (βn h(δ, ϵn h))2ρα Z (ϵn h)2

2 p 1 (1 + log(βn h(δ, ϵn h) ϵn h ))

R2 N H2 log( HN

! 1 p 1 (1 + log(RN

(βn h(δ, ϵn h))2

! 2 p 1 (1 + log(βn h(δ, ϵn h) ϵn h ))

We thus see that the choice of βn h(δ, ϵn h) = Θ(H q

δ )) satisfies the requirement for confidence interval width on Z . We now use probability union bound over all Z Un h to obtain

log(HN|HUn h | δ )) = Θ(H

For this value of β(δ), we have with probability at least 1 δ, for all h [H], n [N], z Z, [Ph V n h+1](z) f n h (z) β(δ)σn h(z) + ϵn h, (35)

where in the above expression ϵn h is the parameter of the covering number corresponding to Z when z Z . Thus

Pr(E) 1 δ. Finally since ϵn h = H

δ ) Nn h,Z is always smaller than the first term, we have

[Ph V n h+1](z) f n h (z) β(δ)σn h(z). (36)

where β is multiplied with a factor of 2 that does not affect its expression in Equation (34).

The proof of Pr( E) 1 δ is similar.

Reward-Free Kernel-Based RL

C.2. Proof of Lemma B.3

This can be proven by induction. For h = H + 1, we have V H+1(s; r) = VH+1(s) = 0, for any s S. Also, conditioned on E, we have

Q h(z; r) Qh(z) = rh(z) + [Ph V h+1](z; r) min{rh(z) + gh(z) + wh(z), H}

= max [Ph V h+1](z; r) gh(z) wh(z), 0

= max [Ph V h+1](z; r) [Ph Vh+1](z; r) + [Ph Vh+1](z; r) gh(z) wh(z), 0

max [Ph V h+1](z; r) [Ph Vh+1](z; r), 0

The first inequality comes form event E and the second inequality comes form induction assumption. Then, we have

V h (s; r) Vh(s) = max a A Q h(z; r) max a A Qh(z)

max a A{Q h(z; r) Qh(z)}

That completes the proof.

C.3. Proof of Lemma B.4

We prove this by induction. We have VH+1(s) = V π H+1(s) = 0, for any s S. Also, conditioned on E, we have

Vh(sh) V π h (sh; r) = rh(sh, πh(sh)) + gh(sh, πh(sh)) + wh(sh, πh(sh)) Qπ h(sh, πh(sh); r)

rh(sh, πh(sh)) + [Ph Vh+1](sh, πh(sh)) + 2wh(sh, πh(sh)) rh(sh, πh(sh)) [Ph V π h+1](sh, πh(sh); r)

= [Ph Vh+1](sh, πh(sh)) [Ph V π h+1](sh, πh(sh); r) + 2wh(sh, πh(sh))

= Esh+1 Ph( |sh,πh(sh))[Vh+1(sh+1) V π h+1(sh+1)] + 2wh(sh, πh(sh))

h =h wh(sh, π(sh)).

The first inequality comes form event E and the second inequality comes form induction assumption.

C.4. Proof of Lemma B.5

This can be proven by induction. For h = H + 1, we have V H+1(s; rn) = V n H+1(s) = 0, for any s S. Also, conditioned on E, we have

Q h(z; rn) Qn h(z) = rn h(z) + [Ph V h+1](z) min{ rn h(z) + f n h (z) + β(δ)σn h(z), H}

= max [Ph V h+1](z) f n h (z) β(δ)σn h(z), 0

max [Ph V h+1](z) [Ph V n h+1](z), 0

The first inequality holds under event E, and the second inequality comes from induction assumption.

Reward-Free Kernel-Based RL

C.5. Proof of Lemma B.6

Recall ζn h = [Ph V n h+1](sn h, an h) Vh+1(sn h+1). We have, under event E

V n h (sn h) = Qn h(sn h, an h)

= f n h (sn h, an h) + rn h(sn h, an h) + β(δ)σn h(sn h, an h)

[Ph V n h+1](sn h, an h) + rn h(sn h, an h) + 2β(δ)σn h(sn h, an h)

= [Ph V n h+1](sn h, an h) + (2 + 1

H )β(δ)σn h(sn h, an h)

= ζn h + V n h+1(sn h+1) + (2 + 1

H )β(δ)σn h(sn h, an h)

The inequality holds under event E. Summing the telescoping series

V n h (sn h) V n h+1(sn h+1) ζn h + (2 + 1

H )β(δ)σn h(sn h, an h)

over h, we get

h=1 ζn h + (2 + 1

h=1 β(δ)σn h(sn h, an h).

Taking summation over n

n=1 V n 1 (s)

h=1 ζn h + (2 + 1

h=1 β(δ)σn h(sn h, an h),

that completes the proof.