# quasioptimal_reinforcement_learning_with_continuous_actions__c4d2cff4.pdf

Published as a conference paper at ICLR 2023

QUASI-OPTIMAL REINFORCEMENT LEARNING WITH CONTINUOUS ACTIONS

Yuhan Li*a Wenzhuo Zhou*b Ruoqing Zhua

a University of Illinois Urbana Champaign b University of California Irvine *Equal contribution

Many real-world applications of reinforcement learning (RL) require making decisions in continuous action environments. In particular, determining the optimal dose level plays a vital role in developing medical treatment regimes. One challenge in adapting existing RL algorithms to medical applications, however, is that the popular infinite support stochastic policies, e.g., Gaussian policy, may assign riskily high dosages and harm patients seriously. Hence, it is important to induce a policy class whose support only contains near-optimal actions, and shrink the action-searching area for effectiveness and reliability. To achieve this, we develop a novel quasi-optimal learning algorithm, which can be easily optimized in off-policy settings with guaranteed convergence under general function approximations. Theoretically, we analyze the consistency, sample complexity, adaptability, and convergence of the proposed algorithm. We evaluate our algorithm with comprehensive simulated experiments and a dose suggestion real application to Ohio Type 1 diabetes dataset.

1 INTRODUCTION

Learning good strategies in a continuous action space is important for many real-world problems (Lillicrap et al., 2015), including precision medicine, autonomous driving, etc. In particular, when developing a new dynamic regime to guide the use of medical treatments, it is often necessary to decide the optimal dose level (Murphy, 2003; Laber et al., 2014; Chen et al., 2016; Zhou et al., 2021). In infinite horizon sequential decision-making settings (Luckett et al., 2019; Shi et al., 2021), learning such a dynamic treatment regime falls into a reinforcement learning (RL) framework. Many RL algorithms (Mnih et al., 2013; Silver et al., 2017; Nachum et al., 2017; Chow et al., 2018b; Hessel et al., 2018) have achieved considerable success when the action space is finite. A straightforward approach to adapting these methods to continuous domains is to discretize the continuous action space. However, this strategy either causes a large bias in coarse discretization (Lee et al., 2018a; Cai et al., 2021a;b) or suffers from the the curse of dimensionality (Chou et al., 2017) for fine-grid.

There has been recent progress on model-free reinforcement learning in continuous action spaces without utilizing discretization. In policy-based methods (Williams, 1992; Sutton et al., 1999; Silver et al., 2014; Duan et al., 2016), a Gaussian distribution is used frequently for policy distribution representation, while its mean and variance are parameterized using function approximation and updated via policy gradient descent. In addition, many actor-critic based approaches, e.g., soft actor-critic (Haarnoja et al., 2018b), ensemble critic (Fujimoto et al., 2018) and Smoothie (Nachum et al., 2018a), have been developed to improve the performance in continuous action spaces. These works target to model a Gaussian policy for action allocations as well.

However, there are two less-investigated issues in the aforementioned RL approaches, especially for their applications in the healthcare (Fatemi et al., 2021; Yu et al., 2021). First, existing methods that use an infinite support Gaussian policy as the treatment policy may assign arbitrarily high dose levels, which may potentially harm the patient (Yanase et al., 2020). Hence, these approaches are not reliable in practice due to safety and ethical concerns. It would be more desirable to develop a policy class to identify the near-optimal (Tang et al., 2020), or at least safe, action regions, and reduce

Correspondence to: Wenzhuo Zhou <wenzhuz3@uci.edu> and Ruoqing Zhu <rqzhu@illinois.edu>

Published as a conference paper at ICLR 2023

the optimal action search area for reliability and effectiveness. Those actions out of the identified region are discriminated as non-optimal, and would be screened out with zero densities in the policy distribution. Second, for many real-world applications, the action spaces are bounded due to practical constraints. Examples include autonomous driving with a limited steering angle and dose assignment with a budget or safety constraint. In these scenarios, modeling an optimal policy by an infinite support probability distribution, e.g., Gaussian policy, would inevitably introduce a non-negligible off-support bias as shown in Figure 2. In consequence, the off-support bias damages the performance of policy learning and results in a biased decision-making procedure. Instead, constructing a policy class with finite but adjustable support might be one of the demanding solutions.

In this work, we take a substantial step towards solving the aforementioned issues by developing a novel quasi-optimal learning algorithm. Our development hinges upon a novel quasi-optimal Bellman operator and stationarity equation, which is solved via minimizing an unbiased kernel embedding loss. Quasi-optimal learning estimates an implicit stochastic policy distribution whose support region only contains near-optimal actions. In addition, our algorithm overcomes the difficulties of the nonsmoothness learning issue and the double sampling issue (Baird, 1995), and can be easily optimized using sampled transitions in off-policy scenarios without training instability and divergence. The main contribution of this paper can be summarized as follows:

We construct a novel Bellman operator and develop a reliable stochastic policy class, which is able to identify quasi-optimal action regions in scenarios with a bounded or unbounded action space. This address the shortcomings of existing approaches relying on modeling an optimal policy with infinite support distributions. We formalize an unbiased learning framework for estimating the designed quasi-optimal policy. Our framework avoids the double sampling issue and can be optimized using sampled transitions, which is beneficial in offline policy optimization tasks. We thoroughly investigate the theoretical properties of the quasi-optimal learning algorithm, including the adaptability of the quasi-optimal policy class, the loss consistency, the finitesample bound for performance error, and the convergence analysis of the algorithm. Empirical analyses are conducted with comprehensive numerical experiments and a realworld case study, to evaluate the model performance in practice. 2 PRELIMINARIES

Notations We first give an introduction to our notations. For two strictly positive sequences {Ψ(m)}m 1 and {Υ(m)}m 1, the notation {Ψ(m)}m 1 {Υ(m)}m 1 means that there exists a sufficiently small constant c 0 such that Ψ(n) cΥ(n). Lp and denote the Lp norm and supremum-norm, respectively. We define the set indicator function 1set(x) = 1 if x set or 0 otherwise. The notation Pn denotes the empirical measure i.e., Pn = 1

n Pn i=1. For two sets ℵ0 and ℵ1, the notation ℵ0 \ ℵ1 indicates that the set ℵ0 excluding the elements in the set ℵ1. We write |ℵ0| as the cardinality of the set ℵ0. For any Borel set ℵ2, we denote σ(ℵ2) as the Borel measure of ℵ2. We denote a probability simplex over a space F by (F), and in particular, convex(F) indicates the convex probability simplex over F. We denote as the floor function, and use O as the convention.

Background A Markov decision process (MDP) is defined as a tuple < S, A, P, R, γ >, where S is the state space, A is the action space, P : S A (S) is the unknown transitional kernel, R : S S A R is a bounded reward function, and γ [0, 1) is the discounted factor. In this paper, we focus on the scenario of continuous action space. We assume the offline data consists of n i.i.d. trajectories, i.e., D1:n = {S1 i , A1 i , R1 i , S2 i , . . . , STi i , AT i , RT i , ST +1 i }n i=1, where the length of trajectory T is assumed to be non-random for simplicity. A policy π is a map from the state space to the action space π : S A. The learning goal is to search an optimal policy π which maximizes the expected discounted sum of rewards. V π t (s) = Eπ P k=1 γk 1Rt+k|St = s is the value function under a policy π, where Eπ is taken by assuming that the system follows a policy π, and the Q-function is defined as Qπ t (s, a) = Eπ P k=1 γk 1Rt+k|St = s, At = a . In a time-homogenous Markov process (Puterman, 2014), V π t (s) and Qπ t (s, a) do not depend on t. The optimal value function V is the unique fixed point of the Bellman operator B, BV (s) := maxa ESt+1 P(s,a) Rt + γV (St+1)|St = s, At = a . Then BV (s) = V (s) for any s S. An optimal policy π can be obtained by taking the greedy action of Q (s, a), that is π (s) = arg maxa Q (s, a). For the rest of the paper, we use the short notation Es |s,a for the conditional expectation Es P(s,a); and ESt,At,St+1 is short for ESt υ,At πb( |St),St+1 P(St,At), where υ is a some fixed distribution and πb is some behavior policy.

Published as a conference paper at ICLR 2023

3 METHODOLOGY

To start with, we first revisit the Bellman optimality equation via a policy explicit view,

BV (s) := max π Ea π( |s), St+1|s,a R(St+1, s, a) + γV (St+1) = V (s). (1)

To obtain the optimal policy π and value function V , an optimization idea is to minimize the discrepancy between the two sides of the equation under a L2 loss. Unfortunately, there are several major challenges when it comes to optimization: (1) Non-smoothness: the Bellman operator involves a non-smoothed hard-max operator, which leads to training instability; (2) Policy class: As discussed in Section 1, it is necessary to induce an optimal policy class whose support consists of quasi-optimal sub-regions for reliability, and avoids off-support bias in Figure 2; (3) Double sampling: the unknown conditional expectation ESt+1|s,a is required to be double sampled for obtaining an unbiased sample approximation for ESt+1|s,a. However, this is usually infeasible in real-world environments; (4) Off-policy data: directly minimizing the Bellman error is not easy to incorporate off-policy data. To address these issues, we propose a quasi-optimal counterpart of the Bellman equation (1).

3.1 QUASI-OPTIMAL BELLMAN OPERATOR

In this subsection, we aim to tackle the first two challenges. We propose a quasi-optimal counterpart for the Bellman operator B that simultaneously circumvents the non-smoothness obstacles, and induce a novel policy class which can identify quasi-optimal sub-regions in continuous action spaces.

We leverage the Legendre-Fenchel transform (Hiriart-Urruty & Lemaréchal, 2012) on the Bellman operator B. For a convex probability simplex convex(A) and a strongly convex and continuous proximity function prox(π) : convex(A) R, the Fenchel transform counterpart of B is defined as

BµV µ (s) = max π convex(A)

Q µ(s, a)π(a|s) + µprox(π(a|s)) da, (2)

where Q µ(s, a) = ESt+1|s,a[R(St+1, s, a) + γV µ (St+1)] , and V µ (s) is the unique fixed point of the quasi-optimal Bellman operator Bµ. Note that, besides the smoothing purpose, we are also interested in constructing a stochastic optimal policy class that can screen out the non-optimal and sub-optimal actions. Therefore, we further define a special prox function class motivated by the rationale of q-logarithm as prox(x) = logq(x) := x(1 xq 1)

q 1 , where R

a A prox(π(a|s))da = 1 q 1(1 R

a A πq(a|s)da) essentially generalize the Shannon s entropy (Martins et al., 2020). In this paper, we focus on the setting that q = 2. Assumption 3.1. For any policy distribution π convex(A), its density is bounded above by a constant, i.e., π( |s) C for all s S.

This assumption avoids some extreme cases where a stochastic policy distribution degenerates to be deterministic. In the following, we show several nice properties of the proposed Bellman operator. Proximal Approximation The operator Bµ is a proximal approximation to B. This delivers two messages: firstly, the approximation bias is upper bounded; secondly, the operator Bµ is a smoothed substitute for B. In particular, Theorem 3.1 demonstrates that the approximation bias can vanish to zero for small enough µ. In addition, the operator Bµ has a differentiable and analytical form (3), which justifies that Bµ is a smoothed counterpart of B, see Corollary S.1 in Appendix for details. Theorem 3.1 (Proximal bias). Under Assumption 3.1, for any s S and value function V , BµV (s) BV (s) [µ(1 C), µ].

BµV µ (s) = µ 1

a Ws Q µ(s, a )da 2µ)2

a Ws Q µ 2(s, a)da

where Ws denotes the the support of π µ in (4) for a given state s.

Quasi-optimal Support Region In addition to the proximal approximation property, another unique and important property of Bµ is inducing a policy π µ whose support region contains all the actions with action-value higher than a certain threshold. The induced policy π µ is bridged from the oracle Q-function:

π µ(a|s) = Q µ(s, a)

a Ws Q µ(s, a)da

2µσ(Ws) + 1 σ(Ws)

Published as a conference paper at ICLR 2023

Figure 1: An illustrating example of the quasi-optimal sub-regions. In the left panel, the lowest admissible action-value corresponds to the horizontal red dashed line, and the integral difference is the shadowed pink area, which equals 2µ. As shown in the right panel, when µ decreases, the pink area shrinks, and the quasi-optimal sub-regions become narrower.

where the support of π µ, i.e., Ws := S

a A a1screening set(a) with

screening set :=

a Ms(a) Q µ(s, a )da σ(Ms(a))Q µ(s, a) > 2µ

a A a 1{Q µ(s,a )>Q µ(s,a)}(a ). (6)

This mechanism allows us to identify multiple sub-regions in the entire action space which only contains near-optimal actions, and weed out the sub-optimal and non-optimal support regions. Note that, the identified sub-region might not be joint in general, which is beneficial to the situation that the true Q-function has multiple modes. The screening set in (5) indicates that the threshold parameter µ not only controls the degree of smoothness, but also determines how the quasi-optimal region behaves and controls the screening intensity, as shown in Figure 1.

3.2 q-GAUSSIAN POLICY DISTRIBUTION

Figure 2: An illustrating example of bounded action space and q-Gaussian policy distribution. The Gaussian policy assigns non-zero probabilities density to all actions, even for those actions outside of the true action space support boundary. This causes the off-support bias. In contrast, the q-Gaussian policy relieves such off-support bias blessed by the boundedness of the quasi-optimal region.

In this section, we bridge the induced policy distribution π µ to an explainable q-Gaussian distribution. The q-Gaussian distribution is less favored for heavy tails, which makes it widely used in practice to model the effect of external stochasticity (d Onofrio, 2013). In continuous actions problems, e.g., medical dose suggestion, the q-Gaussian distribution is a more suitable choice than the Gaussian distribution for policy modeling, since it can filter out non-optimal and risky dose levels, i.e., too high or too low dosage.

Motivated by the fact that the induced policy π µ is feasible to identify quasi-optimal support subregions, and q-Gaussian policy distribution can realize bounded support, we conjectured that the q Gaussian policy distribution might be recovered from the induced policy π µ. Fortunately, the q-Gaussian policy distribution is indeed a special case of the induced policy if Q µ(s, a) is a concavely quadratic function with respect to the action a. We illustrate this phenomenon in Theorem 3.2. Theorem 3.2. Suppose Q µ(s, a) is a concavely quadratic function over a A, i.e., Q µ(s, a) = α1(s)a2 + α2(s)a + α3(s) := QN µ (s, a) where α1(s), α2(s), α3(s) are functions over s S and α1(s) > 0 for all s, then the induced policy distribution π µ( |s) would follow a q-Gaussian distribution with a density function

π µ(a|s) = α1(s)

3 + := πN µ (a|s), (7)

Published as a conference paper at ICLR 2023

and a closed-form quasi-optimal support region

" α2(s) (12α2 1(s)µ) 1 3 2α1(s) , α2(s) + (12α2 1(s)µ) 1 3 2α1(s)

:= WN s . (8)

The policy distribution πN µ ( |s) behaves as a affine transformation of the standard q-Gaussian distribution with mean α2(s)

2α1(s), where the maximum action-value attains, i.e., QN µ (s, α2(s)

arg maxa A QN µ (s, a). Note that the width of the quasi-optimal region is (12α2 1(s)µ) 1 3 α1(s) determined by the threshold parameter µ. The actions within the region R\WN s are discriminated as the non-optimal and would be assigned with zero probability densities. For a small µ, i.e., strong screening intensity, a narrow region would be identified as the quasi-optimal, which yields a relatively conservative action recommendation. In contrast, with a large µ, more actions are included in the support. In an extreme case, WN s degenerates to R as µ . In Theorem 4.1 of Section 4, we investigate how the intensity of µ affects the induced policy distribution formally.

So far, we have obtained the closed-form representations for the general policy π µ( |s) and q-Gaussian policy πN µ . However, how to make a policy estimation remains unknown. Indicated by the challenges in Section 3, we need to address the double sampling issue and utilize off-policy data in optimization. Both challenges cannot be easily solved by minimizing the Bellman error. Fortunately, the kernel embedding helps us to bypass the difficulties.

3.3 KERNEL EMBEDDING ON QUASI-OPTIMAL ERROR

In this subsection, we introduce the quasi-optimal learning framework for solving the induced policy π µ. First, we establish a stationary equation in Theorem 3.3. This helps to incorporate off-policy data. Then we leverage the idea of the kernel embedding (Gretton et al., 2012) to obtain an unbiased empirical loss without the double sampling issue.

Theorem 3.3 (Stationarity equation). Let V µ be a fixed point of the quasi-optimal Bellman operator Bµ, and π µ is the induced policy in (4). For any s S, a A, and µ (0, ), the pair (V µ , π µ) satisfies the following equation:

ESt+1|s,a R(St+1, s, a) + γVµ(St+1) µprox (πµ(a|s)) η(s) + ϖ(s, a) = Vµ(s). (9)

Here prox (x) = 2x 1, η(s) : S [ µC, 0] and ϖ(s, a) : S A R+ are Lagrange multipliers that ϖ(s, a) πµ(a|s) = 0. The discrepancy between the two sides of (9) is quasi-optimal error .

The equation (9) connects quasi-optimal value function V µ and policy function π µ along with any arbitrary state-action pair. This provides an easy way to incorporate off-policy data, i.e., the state-action pairs which are sampled from state-action visitation under the behavior policy, without adjusting the distribution mismatch.

Min-max Optimization One way to solve the equation (9) is minimizing the quasi-optimal error under a L2 loss function. Unfortunately, the double sampling issue would still appear if replacing the unknown ESt+1|s,a[R(St+1, s, a) + γVµ(St+1)] in the quasi-optimal error by its one-sample bootstrapping counterpart Rt + γVµ(St+1). Alternatively, inspired by the average Bellman error (Jiang et al., 2017), we propose to minimize a weighted average quasi-optimal error, and the unwanted conditional variance of the bootstrapping counterpart under L2 loss could vanish. We define the loss L(Vµ, πµ, η, ϖ, u) as

ESt,At,St+1 u St, At GVµ,πµ St, At, St+1 η(St) + ϖ(St, At) Vµ(St) ,

where GVµ,πµ(s, a, s ) := R(s , s, a) + γVµ(s ) µprox (πµ(a|s)) and u( ) : S A R is a bounded function in L2 space L2(C0) := {u L2 : u L2 C0}. Essentially, the weight function u is to fit the discrepancy of (9) and promotes the sample points with large quasi-optimal errors.

As L(V µ , π µ, η, ϖ, u) = 0 holds for any u function, this leads to a minimax optimization:

min Vµ,πµ,η,ϖ max u L2(C0) L2(Vµ, πµ, η, ϖ, u). (10)

Published as a conference paper at ICLR 2023

Algorithm 1 Quasi-optimal Learning in Continuous Action Spaces

1: Input observed transition pairs data {(St i, At i, Rt i, St+1 i ) : t = 1, ..., T}n i=1. 2: Initialize the parameters of interests (θ, ξ) = (θ0, ξ0), the mini-batch size n0, the learning rate α0, the prox parameter µ, the kernel bandwidth bw0, and the stopping criterion ε. 3: For iterations j = 1 to k 4: Randomly sample a mini-batch {(St i, At i, Rt i, St+1 i ) : t = 1, ..., T}n0 i=1. 5: Decay the learning rate αj = O(j 1/2).

6: Compute stochastic gradients with respect to θ and ξ: θ = Pn0 b θ c LU and ξ = Pn0 b ξ c LU.

7: Update the parameters of interest as θj θj 1 αj θ c LU, ξj ξj 1 αj ξ c LU. 8: Stop if (θj, ξj) (θj 1, ξj 1) ε. 9: Return bθ θj, bξ ξj.

Kernel Representation Solving the minimax optimization problem (10) is unstable, and it is also intractable due to the difficulty for the representation of u in L2 space. Fortunately, we identify continuity invariance between the reward function and the optimal weight function u ( ) (see Theorem S.2 in Appendix). The optimal u ( ) is continuous as long as the reward function is continuous, which is widely satisfied in real-world applications. As for a positive definite kernel K, a bounded reproducing kernel Hilbert space (RKHS) HRKHS(C0) := {u HRKHS : u K C0} has a diminishing approximation error to any continuous function class as C0 (Bach, 2017). This together with continuity invariance provides us a basis for representing the weight function in a bounded RKHS. This kernel representation further leads to a closed-form of the inner optimization maximizer (Gretton et al., 2012). The detailed derivation is provided in Theorem S.3 in Appendix. Upon this, the minimax optimization is reduced to only minimizing the loss

LU = ESt, St,At, At,St+1, St+1[ΛVµ,πµ(St, At, St+1)K(St, At; St, At)ΛVµ,πµ( St, At, St+1)], (11)

where ΛVµ,πµ(s, a, s ) := GVµ,πµ (s, a, s ) η(s) + ϖ(s, a) Vµ(s) and ( St, At, St+1) is an independent copy of transition pair (St, At, St+1).

It observes that the loss LU is symmetric and kernel represented. This motivates us to use an unbiased U-statistic estimator to obtain the sample loss. Given the observed data, D1:n, with n trajectories of length T, we can use a trajectory-based U-statistic estimator to capture the within-trajectory loss, thus the total loss LU can be aggregated as the empirical mean of n i.i.d. within trajectory loss:

min Vµ,πµ,η,ϖ c LU = Pn

1 j =k T [ΛVµ,πµ(Sj i , Aj i, Sj+1 i )K(Sj, Aj; Sk, Ak)ΛVµ,πµ(Sk i , Ak i , Sk+1 i )]

s.t. ϖ(a|s) 0, πµ(a|s) ϖ(a|s) = 0 and η(s) [ µC, 0] for all s S, a A. (12)

The sample loss c LU is unbiased and consistent with the population loss LU. The consistency is justified in Theorem 4.2 via examining the tail behavior of c LU. In essence, solving the equation (12) is a computationally intensive non-linear programming problem. Alternatively, we convert the constrained problem to an unconstrained problem by restricting the Lagrange multipliers. Thus, it can be solved by an unconstrained true gradient algorithm, i.e., Algorithm 1 under function approximation (Vµ, πµ, η, ϖ) = (V θ µ , πθ µ, ηξ, ϖθ); see Appendix for details.

In this section, we study the theoretical properties of the proposed method. First, we study some general properties of the proposed quasi-optimal Bellman operator, given in Proposition S.1 and S.2 of Appendix. In Theorem 4.1, we disclose the effect of the intensity of prox parameter µ on the induced optimal policy distribution. Moreover, a non-asymptotic concentration bound is established in Theorem 4.2, showing the consistency and measuring the rate of convergence of c LU to LU. Further, the overall performance error of the algorithm is given in Theorem 4.3, where the performance error is decomposed as the four sources. Finally, we show that the proposed quasi-optimal learning is a convergent algorithm. Before we present the theoretical results, we introduce some assumptions on the boundedness condition of the MDP and the sample trajectory properties, respectively.

Published as a conference paper at ICLR 2023

Assumption 4.1. The reward function R(s , s, a) is uniformly bounded, i.e, R( ) Rmax. Assumption 4.2. Suppose {St, At}t 1 is a strictly stationary and exponentially β-mixing sequence with a mixing coefficient β(m) exp( δ1m) for m 1. We further assume that the behavior policy πb, which is used to collect the offline data D1:n, satisfies that mina A,s S πb(a|s) > 0. Theorem 4.1 (Policy Adaptability). Under Assumption 4.1, for all s S, the quasi-optimal policy distribution π µ( |s) degenerates to a uniform distribution over (A) as µ , and π µ( |s) concentrates in a point mass as µ 0 and C . Theorem 4.1 formally investigates the effect of µ on π µ( |s). In an extreme case that µ 0, C , only the action maximizing Q µ(s, a) would be included in the quasi-optimal region. In the following, we establish a non-asymptotic concentration inequality for the empirical loss in the non-i.i.d. case. Theorem 4.2. For any µ (0, ) and ϵ > 0, under Assumptions 4.1-4.2, we have ϵ-divergence of | c LU LU| bounded in probability, i.e.,

P(| c LU LU| > ϵ) C1 exp ϵ2T C2ϵM 2 max

M 2max + ( ϵ

2 C2M 2 max

T ) log T log log(T)

+ C3 exp nϵ2

where C1, C2 and C3 are some constants depending on δ1 respectively, and Mmax = 4 1 γ Rmax+µC. Theorem 4.2 implies that c LU is a consistent estimator to LU, and thus avoiding the double sampling issue. Note that the concentration bound is sharper than the bound established in Chakrabortty & Kuchibhotla (2018) since we utilize a novel temporal correlatedness structure to decompose the U-statistic. We now analyze the performance error between the finite sample learner and true solution, which can be decomposed into four source errors. Theorem 4.3. Under Assumption 4.1-4.2, let V θ1,k µ be the optimizer from Algorithm 1 and V is the optimal value function and κmin be the smallest eigenvalue corresponding to an orthonormal basis of L2(S A) space. With probability 1 δ, the performance error is upper bounded by

b V θ1,k µ V 2 L2 C4 κmin(1 γ)2

C5DP-dim log 8C4

| {z } generalization error

C7 µ2(C + |1 C| 1)2

(1 γ)2 | {z } proximal bias

+ C8 b V θ1 µ b V θ1,k µ 2

L2 | {z } optimization error

+ϵapproximation error,

where = DP-dim log T/2

δ) + log+ C5C DP-dim 6 2 , DP-dim = P-dim(Θ1) + P-dim(Θ2) + P-dim(Ξ1) + P-dim(Ξ2), and C4, ..., C8 are some constants. Here P-dim( ) denotes the pseudodimension operator (Györfi, 2010), and Θ1, Θ2, Ξ1 and Ξ2 are function spaces for Vµ, πµ, ϖ and η, respectively. The ϵapproximation error is from parametrization (V θ1 µ , πθ2 µ , ϖξ1, ηξ2) on (Vµ, πµ, ϖ, η). The above sample complexity bound gives an insight into the performance error of the proposed algorithm. The generalization error εgerr = O(1/

T) if n is as the same order of T, the proximal bias εprox = O(µ2) and the optimization error εoptim = O(1/k) for k iterations. Although the prox function introduces a proximal bias in the quasi-optimal Bellman operator Bµ, it leads to a smoothed approximation for B. There exists a trade-off between the proximal bias and approximation error. As the increase of µ, it enlarges the proximal bias but decreases the approximation error since true function space becomes more smoothed and easy for function approximation. On the other hand, a small µ leads to a small proximal bias but a relatively large approximation error.

Theorem 4.4. Suppose c LU in Algorithm 1 is differentiable, but not necessarily convex, and its gradient c LU(θ, ξ) is ML-Lipschitz and Var( θ + ξ) σ2 0. And suppose that the learning rate

{αj} are set to αj = min n 2 ML , Λ σ0 j

o for some Λ 0 and ε is sufficient small. Let k = ek with

P(ek = j) = αj(2 MLαj) Pk j=1(αj(2 MLαj)) for j = 1, . . . , k . Then, if (bθ, bξ) is the optimization solution and

(θ1, ξ1) is the first step solution, we have c LU(bθ, bξ) 2 L2 2ML c LU(θ1, ξ1) min θ,ξ c LU(θ, ξ) ML

k + σ0 MLΛ k

+ Λσ0ML k ,

Published as a conference paper at ICLR 2023

Theorem 4.4 implies that the quasi-optimal learning algorithm is converges to a stationary point with a sub-linear rate O(1/ k ) even if the empirical loss is non-convex. The property serves as a basis for applying non-linear function approximation with convergent guarantees. Theorem 4.4 is adapted from Corollary 2.2 in Ghadimi & Lan (2013) under a decay learning rate and a Euclidean stopping criterion. The convergence of Algorithm 1 is blessed by our unbiased stochastic gradient estimator.

5 RELATED WORKS

In this work, we propose a provably convergent and sample efficient off-policy optimization algorithm. Our learning algorithm is trained in a fully offline fashion, without any future online interaction with the environment. This connects our work to offline RL algorithms (Lange et al., 2012; Levine et al., 2020). Due to space limitations, we defer the discussion on offline RL in Appendix.

Algorithmically, our work is related to the entropy-regularized reinforcement learning algorithms (Rawlik et al., 2012; Haarnoja et al., 2017), but these works are fundamentally different from ours. Our formulation is motivated by constructing a proximal counterpart of the Bellman operator, which serves as a basis for the latter quasi-oracle learning algorithm. Besides, the major drawback of the existing algorithms (Lee et al., 2018b; Chow et al., 2018b; Vieillard et al., 2020) is the lack of theoretical guarantees when accompanied by function approximation. It is not clear whether the algorithm is convergent, generalizable, and consistent. In contrast, our algorithm is thoroughly examined on both theoretical and empirical fronts. Nachum et al. (2017); Chow et al. (2018b) exploit an analogous stationarity condition as in Theorem 3.3 and minimize the upper bound of the error, which is biased and encounters double sampling issue. In contrast, our work leverages the kernel embedding to bypass the double sampling issue, and is provably consistent. Unlike our algorithm, the algorithms in continuous control problems, e.g., (Haarnoja et al., 2018b; Nachum et al., 2018b; Lee et al., 2019) do not check the policy optimality, but separately model a pre-specified policy class. This may introduce an additional bias if the pre-specified policy class is misspecified.

Our approach exemplifies more recent efforts that aim to learn optimal policy with continuous actions (Lillicrap et al., 2015). One of our key innovations is to develop a policy class that can identify quasi-optimal sub-regions and the induced policy has a closed-form regarding value function. This distinguishes us from the approaches, e.g., (Silver et al., 2014; Mnih et al., 2016; Kumar et al., 2019; 2020). These methods typically require prior knowledge to determine pre-specified policy class and commonly use Gaussian family distribution, but unfortunately facing the risk from off-support bias.

Our work is also relevant to safe/risk-sensitive RL. When the risk measure is defined based on the reward, e.g., the quantile of return, it draws connections to our algorithm. Given potential application scenarios, quasi-optimal learning is also related to RL in healthcare domain and the trade-offs between safety and optimality. Tang et al. (2020) constructs set-valued policies of near-optimal actions allowing the interaction between the clinician and the decision support system. However, their method is not applicable in a fully offline setting. Fatemi et al. (2021) assesses regions of risk and identifies treatments to avoid in a safety-critical environment. Nevertheless, near-optimal regret guarantee is vacuous in their framework. Due to page limits, we provide a detailed discussion on safe and healthcare RL in Appendix.

6 EXPERIMENTS

In this section, we evaluate our proposed method on synthetic and real environments. We compare our method to the state-of-the-art baselines including DDPG (Lillicrap et al., 2015), SAC (Haarnoja et al., 2018a), BEAR (Kumar et al., 2019), Greedy-GQ (Ertefaie & Strawderman, 2018), V-Learning (Luckett et al., 2019). We also compete with two safe RL algorithms CQL (Kumar et al., 2020) and IQN (Dabney et al., 2018a) for a comprehensive comparison from the safety RL point of view.

N=25, T=24 N=25, T=36 N=50, T=24 N=50, T=36 Sample Size

Discounted Return

Environment I

N=25, T=24 N=25, T=36 N=50, T=24 N=50, T=36 Sample Size

Environment II

N=25, T=24 N=25, T=36 N=50, T=24 N=50, T=36 Sample Size

Environment III

N=25, T=24 N=25, T=36 N=50, T=24 N=50, T=36 Sample Size

Environment IV

Proposed IQN CQL DDPG SAC BEAR V Learning Greedy GQ

Figure 3: The boxplot of the discounted return over 50 repeated experiments.

Published as a conference paper at ICLR 2023

Synthetic Data The four environments are simulated to mimic the real environments for continuous treatment applications. In Environment I and II, we consider a bounded action space to evaluate the potential of quasi-optimal learning for addressing off-support bias. The design of Environment III is to mimic safety-critical environment by incorporating the notion of safety into the reward function (Jia et al., 2020), i.e., the optimal dosage is unique, and a high dosage leads to excessive toxicity while a lower dosage is ineffective (Zang et al., 2014). This is helpful for examining safety performance. In Environment IV, all the methods are implemented and compared in a more complex environment. The detailed discussion on the experiment designs and settings is deferred to Section D in Appendix.

Figure 3 shows that our proposed method outperforms competing methods with a relatively small variance. This mainly benefits from identifying the quasi-optimal region, which guarantees the suggested action is near-optimal, hence improving the performance. In comparison, SAC and BEAR use a Gaussian policy and assign non-negligible positive densities to all actions, even for the nonoptimal ones, which damages the model performance. Meanwhile, even though safe RL methods (i.e., CQL and IQN) show better performance and smaller variance compared with non-safe methods, their performance is still negatively affected by assigning non-zero densities to non-optimal actions. In addition, in Environment I and II with bounded action support, the competing methods are affected by an off-support bias which lowers their discounted return. In Environment III and IV, the performance gains of the proposed method are mainly from the well-recover of the quasi-optimal regions. Also, note that our algorithm achieves stable performance in small sample size settings, which is blessed by the smoothness and optimization-friendly of our algorithm. This is promising as limited data is common in medical applications. Additional experiment results including safety criterion, i.e., distribution of the discounted sum of rewards, sensitivity analysis of µ are provided in Appendix.

Real Data: A Ohio Type 1 Diabetes Case Study Ohio type 1 diabetes (Ohio T1DM) dataset (Marling & Bunescu, 2020), which contains 2 cohorts of patients with Type-1 diabetes, each patient with 8 weeks of life-event data including health status measurements and insulin injection dosage. Clinicians are interested in adjusting insulin injection dose levels (Marling & Bunescu, 2020; Bao et al., 2011) based on patient s health status to maintain the glucose level in a certain range for safe dose suggestions. As each individual has dramatically distinctive glucose dynamics, We follow Zhu et al. (2020) to regard each patient data as an independent dataset, and the data from each day as a trajectory. The state variables are health status measurements, and the action space is a bounded insulin dose range. The glycemic index is regarded as a reward function to measure the goodness of dose suggestion. See more details of the experiment setup in Appendix. As shown in Table 1, the proposed method achieves the best performance among almost all patients. The proposed method mitigates the off-support bias in this bounded dosage space and outperforms the competing methods. This finding is consistent with the results in the synthetic data and demonstrates the potential of our method in continuous action spaces. Besides model performance, we illustrate the safety guarantee of the quasi-optimal learning with additional experiments results and analyses in Appendix.

Table 1: The discounted return for the policy improvement based on 50 repeated experiments.

Patient ID Proposed DDPG SAC BEAR Greedy-GQ VL CQL IQN 540 18.6 0.6 14.1 2.3 14.2 1.2 13.7 0.9 15.5 2.4 14.1 2.4 17.0 0.9 18.2 0.9 544 11.0 0.7 7.5 1.5 7.5 2.5 5.9 0.8 6.3 2.9 8.1 2.9 9.3 1.0 9.8 1.0 552 6.3 0.4 4.8 0.5 5.7 1.0 3.6 0.6 4.1 1.8 5.2 1.3 6.7 0.7 6.1 0.8 567 29.9 1.5 30.0 2.0 27.3 2.2 29.6 1.2 24.8 3.8 20.2 2.8 31.5 1.1 29.8 0.6 584 32.1 0.8 27.0 2.0 23.3 3.2 26.9 1.3 17.8 3.2 18.7 2.6 26.6 1.3 27.7 1.2 596 5.5 1.1 4.1 0.8 4.5 0.9 2.7 1.0 2.7 1.8 3.7 3.0 4.6 0.6 4.7 0.6 559 24.1 1.4 20.1 1.2 19.6 1.2 19.6 0.7 17.3 1.6 20.6 2.7 22.1 1.3 22.6 1.2 563 11.6 0.6 8.4 0.9 9.3 0.7 8.4 0.7 9.2 1.5 8.8 1.9 9.4 0.7 9.9 0.8 570 25.0 0.8 24.5 1.4 26.1 0.8 25.8 0.8 22.8 1.6 22.6 1.5 25.8 0.9 25.9 0.8 575 15.5 1.0 10.4 1.3 8.8 1.4 10.2 1.0 5.7 2.8 8.5 2.3 12.6 0.9 12.7 1.2 588 18.6 0.7 14.2 1.3 13.5 1.5 12.0 0.9 10.0 3.1 8.6 2.3 15.7 0.8 15.9 1.3 591 15.4 1.0 12.3 0.6 11.9 0.6 12.8 0.7 10.7 1.7 10.5 2.6 14.9 0.6 15.2 0.7

7 CONCLUSIONS

We introduce a novel quasi-oracle learning algorithm for continuous action allocations, which is particularly useful in determining the dose level when developing medical treatment regimes. The quasi-optimal learning algorithm is provably convergent in off-policy cases, and a PAC bound is provided to analyze its sample complexity. The promising results arise some interesting directions for future works, including extending the framework to online settings interacting with environments.

Published as a conference paper at ICLR 2023

8 ACKNOWLEDGMENTS

The authors are grateful to the anonymous reviewers and area chair for valuable comments and suggestions. Ruoqing Zhu s research is partially supported by a grant from the National Science Foundation DMS-2210657.

9 REPRODUCIBILITY STATEMENT

We include the reproducible code for all the experiments and the guideline for access to the Ohio Type I Diabetes dataset in Git Hub link https://github.com/liyuhan529/ Quasi-optimal-Learning. The experiment details are provided in Appendix for reproducible purpose. All proofs of main theorems and addition theorems are included in Appendix.

10 ETHICS STATEMENT

The proposed method aims at finding optimal policy in continuous action space, with a special focus on medical applications. We believe the proposed work has potential applications in sequential treatment dose suggestion, e.g., managing diabetes through insulin injection. We admit that the proposed method needs additional validation experiments in controlled settings for practical use in medical applications to avoid abundant risks. The proposed algorithm should not be used as a stand-alone tool nor as a replacement of human experts.

András Antos, Csaba Szepesvári, and Rémi Munos. Value-iteration based fitted policy iteration: learning with a single trajectory. In 2007 IEEE international symposium on approximate dynamic programming and reinforcement learning, pp. 330 337. IEEE, 2007.

András Antos, Csaba Szepesvári, and Rémi Munos. Learning near-optimal policies with bellmanresidual minimization based fitted policy iteration and a single sample path. Machine Learning, 71 (1):89 129, 2008.

Miguel Angel Arcones and Bin Yu. Central limit theorems for empirical andu-processes of stationary mixing sequences. Journal of Theoretical Probability, 7(1):47 71, 1994.

Francis Bach. Breaking the curse of dimensionality with convex neural networks. The Journal of Machine Learning Research, 18(1):629 681, 2017.

Leemon Baird. Residual algorithms: Reinforcement learning with function approximation. In Machine Learning Proceedings 1995, pp. 30 37. Elsevier, 1995.

Jiansong Bao, Heather R Gilbertson, Robyn Gray, Diane Munns, Gabrielle Howard, Peter Petocz, Stephen Colagiuri, and Jennie C Brand-Miller. Improving the estimation of mealtime insulin dose in adults with type 1 diabetes: the normal insulin demand for dose adjustment (nidda) study. Diabetes Care, 34(10):2146 2151, 2011.

Dimitri P Bertsekas. Nonlinear programming. Journal of the Operational Research Society, 48(3): 334 334, 1997.

Hengrui Cai, Chengchun Shi, Rui Song, and Wenbin Lu. Deep jump learning for off-policy evaluation in continuous treatment settings. Advances in Neural Information Processing Systems, 34:15285 15300, 2021a.

Hengrui Cai, Chengchun Shi, Rui Song, and Wenbin Lu. Jump interval-learning for individualized decision making. ar Xiv preprint ar Xiv:2111.08885, 2021b.

Abhishek Chakrabortty and Arun Kumar Kuchibhotla. Tail bounds for canonical u-statistics and u-processes with unbounded kernels. 2018.

Published as a conference paper at ICLR 2023

Guanhua Chen, Donglin Zeng, and Michael R Kosorok. Personalized dose finding using outcome weighted learning. Journal of the American Statistical Association, 111(516):1509 1521, 2016.

Jinglin Chen and Nan Jiang. Information-theoretic considerations in batch reinforcement learning. In International Conference on Machine Learning, pp. 1042 1051. PMLR, 2019.

Po-Wei Chou, Daniel Maturana, and Sebastian Scherer. Improving stochastic policy gradients in continuous control with deep reinforcement learning using the beta distribution. In International conference on machine learning, pp. 834 843. PMLR, 2017.

Yinlam Chow, Ofir Nachum, Edgar Duenez-Guzman, and Mohammad Ghavamzadeh. A lyapunovbased approach to safe reinforcement learning. Advances in neural information processing systems, 31, 2018a.

Yinlam Chow, Ofir Nachum, and Mohammad Ghavamzadeh. Path consistency learning in tsallis entropy regularized mdps. In International Conference on Machine Learning, pp. 979 988, 2018b.

Will Dabney, Georg Ostrovski, David Silver, and Rémi Munos. Implicit quantile networks for distributional reinforcement learning. In International conference on machine learning, pp. 1096 1105. PMLR, 2018a.

Will Dabney, Mark Rowland, Marc Bellemare, and Rémi Munos. Distributional reinforcement learning with quantile regression. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018b.

Bo Dai, Albert Shaw, Lihong Li, Lin Xiao, Niao He, Zhen Liu, Jianshu Chen, and Le Song. Sbeed: Convergent reinforcement learning with nonlinear function approximation. In International Conference on Machine Learning, pp. 1125 1134. PMLR, 2018.

Alberto d Onofrio. Bounded noises in physics, biology, and engineering. Springer, 2013.

Yoel Drori and Ohad Shamir. The complexity of finding stationary points with stochastic gradient descent. In International Conference on Machine Learning, pp. 2658 2667. PMLR, 2020.

Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking deep reinforcement learning for continuous control. In International conference on machine learning, pp. 1329 1338. PMLR, 2016.

Damien Ernst, Pierre Geurts, and Louis Wehenkel. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6, 2005.

Ashkan Ertefaie and Robert L Strawderman. Constructing dynamic treatment regimes over indefinite time horizons. Biometrika, 105(4):963 977, 2018.

Amir-massoud Farahmand, Mohammad Ghavamzadeh, Csaba Szepesvári, and Shie Mannor. Regularized policy iteration with nonparametric function spaces. The Journal of Machine Learning Research, 17(1):4809 4874, 2016.

Mehdi Fatemi, Taylor W Killian, Jayakumar Subramanian, and Marzyeh Ghassemi. Medical deadends and learning to identify high-risk states and treatments. Advances in Neural Information Processing Systems, 34:4856 4870, 2021.

Scott Fujimoto, Herke Hoof, and David Meger. Addressing function approximation error in actorcritic methods. In International conference on machine learning, pp. 1587 1596. PMLR, 2018.

Javier Garcıa and Fernando Fernández. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(1):1437 1480, 2015.

Saeed Ghadimi and Guanghui Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341 2368, 2013.

Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test. The Journal of Machine Learning Research, 13(1):723 773, 2012.

Published as a conference paper at ICLR 2023

Shangding Gu, Long Yang, Yali Du, Guang Chen, Florian Walter, Jun Wang, Yaodong Yang, and Alois Knoll. A review of safe reinforcement learning: Methods, theory and applications. ar Xiv preprint ar Xiv:2205.10330, 2022.

László Györfi. A distribution-free theory of nonparametric regression, 2010.

Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning with deep energy-based policies. In International Conference on Machine Learning, pp. 1352 1361. PMLR, 2017.

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, pp. 1861 1870. PMLR, 2018a.

Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, et al. Soft actor-critic algorithms and applications. ar Xiv preprint ar Xiv:1812.05905, 2018b.

Fang Han. An exponential inequality for u-statistics under mixing conditions. Journal of Theoretical Probability, 31(1):556 578, 2018.

Katharine E Henry, David N Hager, Peter J Pronovost, and Suchi Saria. A targeted real-time early warning score (trewscore) for septic shock. Science translational medicine, 7(299):299ra122 299ra122, 2015.

Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combining improvements in deep reinforcement learning. In Thirty-second AAAI conference on artificial intelligence, 2018.

Jean-Baptiste Hiriart-Urruty and Claude Lemaréchal. Fundamentals of Convex Analysis. Springer Science & Business Media, 2012.

Wassily Hoeffding. Probability inequalities for sums of bounded random variables. In The Collected Works of Wassily Hoeffding, pp. 409 426. Springer, 1994.

Matthew W Hoffman, Alessandro Lazaric, Mohammad Ghavamzadeh, and Rémi Munos. Regularized least squares temporal difference learning with nested 2 and 1 penalization. In European Workshop on Reinforcement Learning, pp. 102 114. Springer, 2011.

Yan Jia, John Burden, Tom Lawton, and Ibrahim Habli. Safe reinforcement learning for sepsis treatment. In 2020 IEEE International Conference on Healthcare Informatics (ICHI), pp. 1 7. IEEE, 2020.

Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford, and Robert E Schapire. Contextual decision processes with low bellman rank are pac-learnable. In International Conference on Machine Learning, pp. 1704 1713. PMLR, 2017.

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

Matthieu Komorowski, Leo A Celi, Omar Badawi, Anthony C Gordon, and A Aldo Faisal. The artificial intelligence clinician learns optimal treatment strategies for sepsis in intensive care. Nature medicine, 24(11):1716 1720, 2018.

Aviral Kumar, Justin Fu, Matthew Soh, George Tucker, and Sergey Levine. Stabilizing off-policy q-learning via bootstrapping error reduction. Advances in Neural Information Processing Systems, 32, 2019.

Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems, 33:1179 1191, 2020.

Eric B Laber, Daniel J Lizotte, Min Qian, William E Pelham, and Susan A Murphy. Dynamic treatment regimes: Technical challenges and applications. Electronic journal of statistics, 8(1): 1225, 2014.

Published as a conference paper at ICLR 2023

Michail G Lagoudakis and Ronald Parr. Least-squares policy iteration. The Journal of Machine Learning Research, 4:1107 1149, 2003.

Sascha Lange, Thomas Gabel, and Martin Riedmiller. Batch reinforcement learning. In Reinforcement learning, pp. 45 73. Springer, 2012.

Kyowoon Lee, Sol-A Kim, Jaesik Choi, and Seong-Whan Lee. Deep reinforcement learning in continuous action spaces: a case study in the game of simulated curling. In International conference on machine learning, pp. 2937 2946. PMLR, 2018a.

Kyungjae Lee, Sungjoon Choi, and Songhwai Oh. Sparse markov decision processes with causal sparse tsallis entropy regularization for reinforcement learning. IEEE Robotics and Automation Letters, 3(3):1466 1473, 2018b.

Kyungjae Lee, Sungyub Kim, Sungbin Lim, Sungjoon Choi, and Songhwai Oh. Tsallis reinforcement learning: A unified framework for maximum entropy reinforcement learning. ar Xiv preprint ar Xiv:1902.00137, 2019.

Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. ar Xiv preprint ar Xiv:2005.01643, 2020.

Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. ar Xiv preprint ar Xiv:1509.02971, 2015.

Daniel J Luckett, Eric B Laber, Anna R Kahkoska, David M Maahs, Elizabeth Mayer-Davis, and Michael R Kosorok. Estimating dynamic treatment regimes in mobile health using v-learning. Journal of the American Statistical Association, 2019.

Daniel J Luckett, Eric B Laber, Anna R Kahkoska, David M Maahs, Elizabeth Mayer-Davis, and Michael R Kosorok. Estimating dynamic treatment regimes in mobile health using v-learning. Journal of the American Statistical Association, 115(530):692 706, 2020.

Xiaoteng Ma, Li Xia, Zhengyuan Zhou, Jun Yang, and Qianchuan Zhao. Dsac: distributional soft actor critic for risk-sensitive reinforcement learning. ar Xiv preprint ar Xiv:2004.14547, 2020.

Hamid R Maei, Csaba Szepesvári, Shalabh Bhatnagar, and Richard S Sutton. Toward off-policy learning control with function approximation. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 719 726, 2010.

Cindy Marling and Razvan Bunescu. The ohiot1dm dataset for blood glucose level prediction: Update 2020. KHD@ IJCAI, 2020.

André Martins, António Farinhas, Marcos Treviso, Vlad Niculae, Pedro Aguiar, and Mario Figueiredo. Sparse and continuous attention mechanisms. Advances in Neural Information Processing Systems, 33:20989 21001, 2020.

Borislav Mavrin, Hengshuai Yao, Linglong Kong, Kaiwen Wu, and Yaoliang Yu. Distributional reinforcement learning for efficient exploration. In International conference on machine learning, pp. 4424 4434. PMLR, 2019.

Florence Merlevède, Magda Peligrad, Emmanuel Rio, et al. Bernstein inequality and moderate deviations under strong mixing conditions. In High dimensional probability V: the Luminy volume, pp. 273 292. Institute of Mathematical Statistics, 2009.

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. ar Xiv preprint ar Xiv:1312.5602, 2013.

Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp. 1928 1937. PMLR, 2016.

Published as a conference paper at ICLR 2023

Tetsuro Morimura, Masashi Sugiyama, Hisashi Kashima, Hirotaka Hachiya, and Toshiyuki Tanaka. Nonparametric return distribution approximation for reinforcement learning. In ICML, 2010.

Rémi Munos and Csaba Szepesvári. Finite-time bounds for fitted value iteration. Journal of Machine Learning Research, 9(5), 2008.

Susan A Murphy. Optimal dynamic treatment regimes. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 65(2):331 355, 2003.

Ofir Nachum, Mohammad Norouzi, Kelvin Xu, and Dale Schuurmans. Bridging the gap between value and policy based reinforcement learning. In Advances in Neural Information Processing Systems, pp. 2775 2785, 2017.

Ofir Nachum, Mohammad Norouzi, George Tucker, and Dale Schuurmans. Smoothed action value functions for learning gaussian policies. In International Conference on Machine Learning, pp. 3692 3700. PMLR, 2018a.

Ofir Nachum, Mohammad Norouzi, Kelvin Xu, and Dale Schuurmans. Trust-pcl: An off-policy trust region method for continuous control. In International Conference on Learning Representations, 2018b.

Tu-Hoa Pham, Giovanni De Magistris, and Ryuki Tachibana. Optlayer-practical constrained optimization for deep reinforcement learning in the real world. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 6236 6243. IEEE, 2018.

David Pollard. Convergence of stochastic processes. Springer Science & Business Media, 2012.

Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.

Aniruddh Raghu, Matthieu Komorowski, Imran Ahmed, Leo Celi, Peter Szolovits, and Marzyeh Ghassemi. Deep reinforcement learning for sepsis treatment. ar Xiv preprint ar Xiv:1711.09602, 2017.

Konrad Rawlik, Marc Toussaint, and Sethu Vijayakumar. On stochastic optimal control and reinforcement learning by approximate inference. Proceedings of Robotics: Science and Systems VIII, 2012.

Martin Riedmiller. Neural fitted q iteration first experiences with a data efficient neural reinforcement learning method. In European conference on machine learning, pp. 317 328. Springer, 2005.

Christian P Robert, George Casella, and George Casella. Monte Carlo statistical methods, volume 2. Springer, 1999.

David Rodbard. Interpretation of continuous glucose monitoring data: Glycemic variability and quality of glycemic control. Diabetes Technology & Therapeutics, 11(S1):S 55, 2009.

Bruno Scherrer, Victor Gabillon, Mohammad Ghavamzadeh, and Matthieu Geist. Approximate modified policy iteration. ar Xiv preprint ar Xiv:1205.3054, 2012.

Takuma Seno and Michita Imai. d3rlpy: An offline deep reinforcement learning library. ar Xiv preprint ar Xiv:2111.03788, 2021.

Chengchun Shi, Alin Fan, Rui Song, and Wenbin Lu. High-dimensional a-learning for optimal dynamic treatment regimes. Annals of Statistics, 46(3):925 967, 2018.

Chengchun Shi, Shengxing Zhang, Wenbin Lu, and Rui Song. Statistical inference of the value function for reinforcement learning in infinite-horizon settings. Journal of the Royal Statistical Society. Series B: Statistical Methodology, 2021.

Chengchun Shi, Shikai Luo, Yuan Le, Hongtu Zhu, and Rui Song. Statistically efficient advantage learning for offline reinforcement learning in infinite horizons. Journal of the American Statistical Association, pp. 1 14, 2022.

Published as a conference paper at ICLR 2023

David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. Deterministic policy gradient algorithms. In International conference on machine learning, pp. 387 395. PMLR, 2014.

David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. nature, 550(7676):354 359, 2017.

Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.

Richard S Sutton, David Mc Allester, Satinder Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. Advances in neural information processing systems, 12, 1999.

Csaba Szepesvári. Algorithms for reinforcement learning. Synthesis lectures on artificial intelligence and machine learning, 4(1):1 103, 2010.

Aviv Tamar, Yonatan Glassner, and Shie Mannor. Optimizing the cvar via sampling. In Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015.

Shengpu Tang, Aditya Modi, Michael Sjoding, and Jenna Wiens. Clinician-in-the-loop decision making: Reinforcement learning with near-optimal set-valued policies. In International Conference on Machine Learning, pp. 9387 9396. PMLR, 2020.

Nino Vieillard, Olivier Pietquin, and Matthieu Geist. Munchausen reinforcement learning. Advances in Neural Information Processing Systems, 33:4235 4246, 2020.

Robert Vincent. Reinforcement learning in models of adaptive medical treatment strategies. Mc Gill University (Canada), 2014.

Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3):229 256, 1992.

Tengyang Xie and Nan Jiang. Q* approximation schemes for batch reinforcement learning: A theoretical comparison. In Conference on Uncertainty in Artificial Intelligence, pp. 550 559. PMLR, 2020.

Fumitaka Yanase, Tomoko Fujii, Thummaporn Naorungroj, Alessandro Belletti, Nora Luethi, Anitra C Carr, Paul J Young, and Rinaldo Bellomo. Harm of iv high-dose vitamin c therapy in adult patients: a scoping review. Critical care medicine, 48(7):e620 e628, 2020.

Ken-ichi Yoshihara. Limiting behavior of u-statistics for stationary, absolutely regular processes. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete, 35(3):237 252, 1976.

Bin Yu. Rates of convergence for empirical processes of stationary mixing sequences. The Annals of Probability, pp. 94 116, 1994.

Chao Yu, Jiming Liu, Shamim Nemati, and Guosheng Yin. Reinforcement learning in healthcare: A survey. ACM Computing Surveys (CSUR), 55(1):1 36, 2021.

Yong Zang, J Jack Lee, and Ying Yuan. Adaptive designs for identifying optimal biological dose for molecularly targeted agents. Clinical Trials, 11(3):319 327, 2014.

Wenzhuo Zhou, Ruoqing Zhu, and Donglin Zeng. A parsimonious personalized dose-finding model via dimension reduction. Biometrika, 108(3):643 659, 2021.

Wenzhuo Zhou, Ruoqing Zhu, and Annie Qu. Estimating optimal infinite horizon dynamic treatment regimes via pt-learning. Journal of the American Statistical Association, pp. 1 14, 2022.

Liangyu Zhu, Wenbin Lu, and Rui Song. Causal effect estimation and optimal dose suggestions in mobile health. In International Conference on Machine Learning, pp. 11588 11598. PMLR, 2020.

Published as a conference paper at ICLR 2023

Supplementary Materials for Quasi-optimal Reinforcement Learning with Continuous Actions

A Additional Related Works 18

B Technical Proofs 19

B.1 Proofs on Constructing Quasi-Optimal Bellman Operator . . . . . . . . . . . . . . 19

B.1.1 Proof of Theorem S.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

B.1.2 Proof of Corollary S.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

B.1.3 Proof of Theorem 3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

B.1.4 Proof of Theorem 3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

B.2 Proofs on Quasi-Optimal Staionarity Equation . . . . . . . . . . . . . . . . . . . . . 21

B.2.1 Proof of Theorem 3.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

B.3 Proofs on Kernel Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

B.3.1 Proof of Theorem S.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

B.3.2 Proof of Theorem S.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

B.4 Proofs on Generic Properties of Quasi-optimal Bellman Operator . . . . . . . . . . 24

B.4.1 Proof of Proposition S.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

B.4.2 Proof of Proposition S.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

B.5 Proof of Theorem 4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

B.6 Proof of Lemma S.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

B.7 Proof of Theorem 4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

B.8 Proof of Theorem 4.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

B.9 Proof of Theorem 4.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

C Practical Implementation 40

D Experiment Details and Additional Results 41

D.1 Details of Simulation Settings and Real Data Analysis . . . . . . . . . . . . . . . . . 41

D.2 Additional Experiment Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

D.3 Additional Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

D.3.1 Sensitivity Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

D.3.2 Distribution Evaluation Criterion . . . . . . . . . . . . . . . . . . . . . . . 44

Published as a conference paper at ICLR 2023

D.3.3 Model Performance on Large Dataset . . . . . . . . . . . . . . . . . . . . 45

D.3.4 Safe Transitions and Learned Policy Distribution . . . . . . . . . . . . . . 45

Published as a conference paper at ICLR 2023

A ADDITIONAL RELATED WORKS

We discuss additional related works in this section.

Safe RL Safe Reinforcement Learning (safe-RL) aims at finding an optimal policy while ensuring safety (Garcıa & Fernández, 2015). In the safe-RL framework, the definition of safety and its guarantee varies based on the specific purpose of learning tasks. In our view, there are three mainstream works for safe RL.

Safe Exploration: ensuring safe action allocations in the exploration process by incorporating prior knowledge, which often exists in online RL settings (Pham et al., 2018). Safety Constraints: finding an optimal policy that satisfies external user-specified safe constraints (Chow et al., 2018a; Gu et al., 2022). Risk-sensitivity and Conservatism: finding a policy maximizing the infinite-horizon cumulative discounted reward while incorporating the notion of risk (Morimura et al., 2010; Mavrin et al., 2019), e.g., value at risk (quantile), percentile performance, chance, the variance of return.

In medical applications, specifying explicit constraints is typically hard to realize in practice (Vincent, 2014). Alternatively, the notion of safety is usually incorporated in the design of reward functions, where high-risk actions lead to significantly low reward (Raghu et al., 2017; Jia et al., 2020).

Based on these, our quasi-optimal learning is closely related to the risk-sensitive RL framework, which aims to control value at risk to ensure safety. For example, maintaining the discounted return above a certain threshold (Tamar et al., 2015), reducing the variability of performance by avoiding extremely low performance (Ma et al., 2020), or target to maximize the robust performance criterion, e.g., quantile of the discounted return (Dabney et al., 2018b). Commonly used algorithms in risk-sensitive RL include conservative Q-learning (CQL; Kumar et al. (2020)) and implicit quantile network (IQN; (Dabney et al., 2018a)). CQL learns a conservative Q-function such that the expected value of a policy under this Q-function lower-bounds its true value and thus avoids selecting high-risk actions with over-estimation action value. IQN models the full quantile function for the state-action return distribution and yields risk-sensitive policies. For a more comprehensive empirical study, we compare the proposed algorithm with the aforementioned two safe RL baselines, conduct additional numerical experiments and analyze the results from the safety point of view.

RL in healthcare Reinforcement learning has a wide variety of applications in healthcare (Yu et al., 2021). Some of the recent works aim to solve safety issues when applying RL to healthcare domains. Tang et al. (2020) considers identifying set-valued policies with near-optimal actions, which allows incorporating expert knowledge from clinicians to assist in decision making. As the same rationale in our proposed quasi-optimal region, Tang et al. (2020) also utilizes the value function to threshold a near-optimal action set. However, this method is only developed on discrete action space, and it is still not directly applicable in fully offline settings. Fatemi et al. (2021) considers identifying high-risk states in data-constrained offline settings by training two separate Q functions that model the probability of negative outcomes and positive outcomes respectively. They target to identify treatments proportional to their chance of leading to dead-ends, and attain safety by excluding these treatments from consideration. However, as they aim to identify possible dead-ends of a state space and treatments, there exists a trade-off between safety and optimality. In particular, it still has a gap for optimal treatment allocations.

Other interesting works in RL for healthcare including Henry et al. (2015); Komorowski et al. (2018) adopt RL algorithms for sepsis treatment recommendations, Jia et al. (2020) redefine the state variables and reward function to reflect practical safety concerns in sepsis treatments. We refer readers to (Yu et al., 2021) for a more comprehensive review.

Offline RL The domain approaches of offline RL include fitted Q-iteration (FQI; Ernst et al. (2005); Riedmiller (2005); Munos & Szepesvári (2008); Szepesvári (2010) ), fitted policy iteration (Antos et al., 2007; Lagoudakis & Parr, 2003; Scherrer et al., 2012), Bellman Residual Minimization (BRM; Antos et al. (2008); Hoffman et al. (2011); Farahmand et al. (2016); Dai et al. (2018); Chen & Jiang (2019); Xie & Jiang (2020), gradient Q-learning (Maei et al., 2010; Ertefaie & Strawderman, 2018), and Advantage learning (Murphy, 2003; Shi et al., 2018; 2022). We refer the reader to Levine et al. (2020) for more comprehensive discussions on the topics of the offline RL.

Published as a conference paper at ICLR 2023

In the aforementioned mainstreams of works, ours is closely related to the Bellman Residual Minimization. They learn the value function by solving a nested optimization problem, where the function space used for the inner and outer optimization must be the same. From the perspective of the couple optimization, their inner optimization plays a similar role as the inner maximization of the min-max framework. In addition to the fundamental difference in derivation, our min-max optimization can be reduced to a single minimization problem aided by the kernel representation, while they have to solve an unstable minimax optimization problem. Most importantly, our quasi-optimal learning framework provides a practical way to learn a reliable policy in continuous action space via quasi-optimal region identifications. to the best of our knowledge, no existing RL algorithms can achieve this.

B TECHNICAL PROOFS

B.1 PROOFS ON CONSTRUCTING QUASI-OPTIMAL BELLMAN OPERATOR

B.1.1 PROOF OF THEOREM S.1

Theorem S.1. Assume the induced policy has density function π µ(a|s) C for all a, s, where C is a given constant. Then the proximal Bellman operator Bµ in equation (2) has a closed form equivalent:

BµV µ (s) = µ

a Ws,1 Q µ(s, a)da

2µσ(Ws,1) 1 σ(Ws,1)

+ Cσ(Ws,1) R

a Ws,2 Q µ(s, a)da Cσ(Ws,2) R

a Ws,1 Q µ(s, a)da

2σ(Ws,1) µC2σ(Ws,2)(σ(Ws,2) + σ(Ws,1))

(13) where Ws,1 refers to the set {a A : C > π µ(a|s) > 0}, Ws,2 refers to the set {a A : π µ(a|s) = C}.

Proof: The proof is mainly to check the KKT conditions of the maximization. The Lagrangian function of the RHS of (2) can be expressed as follows:

L(π, η, ϖ1, ϖ2) = Ea π( |s) [Qµ(s, a) + µprox(π(a|s))] η(s) Z

a A π(a|s)da 1

+ ϖ1(s, a)π(a|s) ϖ2(s, a)(π(a|s) C). The following KKT conditions are necessary for the maximizer π µ in the equation:

a A π µ(a|s)da 1 = 0, π µ(a|s) 0, π µ(a|s) C.

Duality: ϖ1(s, a) 0, ϖ2(s, a) 0. Complementary slackness: ϖ1(s, a)π µ(a|s) = 0, ϖ2(s, a)(π µ(a|s) C) = 0.

Stationarity: Q µ(s, a) + µ(1 2π µ(a|s)) η(s) + ϖ1(s, a) ϖ2(s, a) = 0.

We can obtain the equation for πµ(a|s) from the stationary condition such that

π µ(a|s) = 1

2µ[ η(s) Q µ(s, a) ϖ1(s, a) + ϖ2(s, a)].

Combined with complementary slackness condition,

If π µ(a|s) = 0, then ϖ1(s, a) 0, ϖ2(s, a) = 0, thus Q µ(s, a) η(s) µ.

If C >π µ(a|s) > 0, then ϖ1(s, a) = ϖ2(s, a) = 0, thus η(s) µ + 2µC > Q µ(s, a) > η(s) µ.

If π µ(a|s) = C, then ϖ1(s, a) = 0, ϖ2(s, a) 0, thus Q µ(s, a) η(s) µ + 2µC.

Therefore, π µ(s, a) can be expressed as:

0 if Q µ(s, a) η(s) µ 1 2 1 2µ η(s) Q µ(s, a) if η(s) µ + 2µC > Q µ(s, a) > η(s) µ C if Q µ(s, a) η(s) µ + 2µC (14)

Published as a conference paper at ICLR 2023

Meanwhile, notice that R

a A π µ(s, a) = 1, we can show that η(s) has a closed form:

a Ws,1 Q µ(s, a)da 2µ + 2µCσ(Ws,2)

where Ws,1 refers to the set {a A : C > π µ(a|s) > 0}, Ws,2 refers to the set {a A : π µ(a|s) = C}, and σ(Ws,1), σ(Ws,1) refers to the interval length of the corresponding set. We take η(s) back to (14), we then have

0 if Q µ(s, a) η(s) µ

a Ws,1 Q µ(s,a)da

2µσ(Ws,1) + 1 Cσ(Ws,2)

σ(Ws,1) if η(s) µ + 2µC > Q µ(s, a) > η(s) µ C if Q µ(s, a) η(s) µ + 2µC (15)

We finally plug in the closed form of π µ(a|s) to (2), by some algebra, we have

BµV µ (s) = µ

a Ws,1 Q µ(s, a)da

2µσ(Ws,1) 1 σ(Ws,1)

+ Cσ(Ws,1) R

a Ws,2 Q µ(s, a)da Cσ(Ws,2) R

a Ws,1 Q µ(s, a)da

2σ(Ws,1) µC2σ(Ws,2)(σ(Ws,2) + σ(Ws,1))

B.1.2 PROOF OF COROLLARY S.1

Corollary S.1. When σ(Ws,2) = 0, we denote W1 as W, the closed form in (13) can be simplified as

BµV µ (s) = µ 1

a Ws Q µ(s, a )da 2µ)2

a Ws Q µ 2(s, a)da

Proof: We plug in σ(Ws,2) = 0 to (13), then could obtain the result.

B.1.3 PROOF OF THEOREM 3.1

Proof of Theorem 3.1: For any generic value function V (s) and the corresponding generic Q-function Q(s, a), we first build the lower bound:

BµV (s) = max π convex(A) Ea π( |s)[Q(s, a) + µ(1 π(a|s))]

max π convex(A) Ea π( |s)[Q(s, a) + µ µC]

= BV (s) + µ(1 C).

For the upper bound:

BµV (s) = max π convex(A)(A) Ea π( |s)[Q(s, a) + µ(1 π(a|s))]

max π convex(A)(A) Ea π( |s)[Q(s, a) + µ]

= BV (s) + µ.

Therefore, we have BµV (s) BV (s) [µ(1 C), µ].

B.1.4 PROOF OF THEOREM 3.2

Proof of Theorem 3.2: Suppose Q µ(s, a) = α1(s)a2 + α2(s)a + α3(s) with α1(s) > 0. We assume the density won t reach its boundary value C for this theorem, and we proceed by simplifying αi(s) as αi for i = 1, 2, 3. By Equation (4), we have

π µ(a|s) = Q µ(s, a)

a Ws Q µ(s, a)da

2µσ(Ws) + 1 σ(Ws)

Published as a conference paper at ICLR 2023

We first try to find the support set of π µ(a|s). Since Q µ takes the maximum value at y = α2 2α1 , by the symmetric property of quadratic function, the support set should be of the form Ws = [y l, y+l](l > 0). Additionally, the boundary point of the support set should be the solution of

a Ws Q µ(s, a)da

2µσ(Ws) + 1 σ(Ws) = 0,

with respect to a. Thus, we can find the boundary point of the support set by solving the equation with respect to l:

α1(y l)2 + α2(y l) + α3 = 1

y l ( α1a2 + α2a + α3)da µ

It turns out that l = (12α2 1µ) 1 3 2α1 . Thus, the support set has the closed-form

" α2 (12α2 1µ) 1 3 2α1 , α2 + (12α2 1µ) 1 3 2α1

Therefore σ(Ws) = (12α2 1µ) 1 3 α1 , and R

a Ws Q µ(s, a)da

2µσ(Ws) = (12α2 1µ) 2 3 3α2 2 24µα1 + α3

We plug in the result to the closed form of π µ(a|s), and obtain the probability density function

π µ(a|s) = α1

12µ) 1 3 + .

It is clear that the resulting distribution of π µ(a|s) is of the exact form of q-Gaussian distribution with q = 0, β = α1

2µ and centered at α2 2α1 .

B.2 PROOFS ON QUASI-OPTIMAL STAIONARITY EQUATION

B.2.1 PROOF OF THEOREM 3.3

Proof of Theorem 3.3: By the stationary condition from Theorem S.1 we have

Q µ(s, a) + µ(1 2π µ(a|s)) η(s) + ϖ1(s, a) ϖ2(s, a) = 0,

therefore, by the definition of Q µ(s, a), we have

ESt+1|s,a[R(St+1, s, a)]+γESt+1|s,a[V µ (St+1)]+µ(1 2π µ(a|s)) η(s)+ϖ1(s, a) ϖ2(s, a) = 0. (16)

Notice that ESt+1|s,a[R(St+1, s, a)] = r(s, a), and we take expectation with respect to a following the policy distribution π µ(a|s) from both sides of (16),

0 = Ea π µ(a|s) r(s, a) + γESt+1|s,a[V µ (St+1)] + µ(1 2π µ(a|s)) η(s) + ϖ1(s, a) ϖ2(s, a) ,

a A π µ(a|s) r(s, a) + γESt+1|s,a[V µ (St+1)] + µ(1 2π µ(a|s)) η(s) + ϖ1(s, a) ϖ2(s, a) da.

According to the proximal Bellman optimality equation BµV µ (s) = V µ (s), where V µ (s) is the fixed point of Bµ. With the explicit definition of V µ , we observe that

a A π µ(a|s) r(s, a) + γESt+1|s,a[V µ (St+1)] + µ(1 π µ(a|s)) da Z

a A µπ 2 µ (a|s)da

a A π µ(a|s) η(s)da + Z

a A π µ(a|s)ϖ1(s, a)da Z

a A π µ(a|s)ϖ2(s, a)da

a A µπ 2 µ (a|s)da Z

a A π µ(a|s) η(s)da + Z

a A π µ(a|s)ϖ1(s, a)da

a A π µ(a|s)ϖ2(s, a)da

Published as a conference paper at ICLR 2023

Meanwhile R

a A π µ(a|s) η(s)da = η(s) R

a A π µ(a|s)da = η(s) by the property of density, R

a A π µ(a|s)ϖ1(s, a)da = 0, and R

a A π µ(a|s)ϖ2(s, a)da = C R

a A ϖ2(s, a)da by complete slackness, we further have

V µ (s) µ Z

a A π 2 µ (a|s)da η(s) C Z

a A ϖ2(s, a)da = 0.

Since 0 π µ(a|s) C, thus µ R

a A π 2 µ (a|s)da = µEπ µ(a|s) [0, C]. Therefore,

η(s) := η(s) V µ (s) [ µC C Z

a A ϖ2(s, a)da, C Z

a A ϖ2(s, a)da].

The stationary condition can be reformulated as

ESt+1|s,a R(St+1, s, a)+γV µ (St+1) µprox (π µ(a|s)) η(s)+ϖ1(s, a) ϖ2(s, a) V µ (s) = 0. (17) Obviously, (π µ, V µ ) is a solution for the above equation for some η(s), ϖ1(s, a), and ϖ2(s, a), such that

ϖ1(s, a) 0,ϖ2(s, a) 0, ϖ1(s, a) πµ(a|s) = 0, ϖ2(s, a) (C πµ(a|s)) = 0

and η(s) h µC C Z

a A ϖ2(s, a)da, C Z

a A ϖ2(s, a)da i .

When σ(Ws,2) = 0, we have ϖ2(s, a) = 0. Plugging in to equation (17), and denote Ws,1 as Ws, we have the exact form of (9).

B.3 PROOFS ON KERNEL REPRESENTATION

B.3.1 PROOF OF THEOREM S.2

Theorem S.2. We define the optimal weight function as u = arg maxu L2(C0) L2(Vµ, πµ, η, ϖ, u). Let C(S A) be all continuous functions on S A. For any (s, a) S A and s S, the optimal weight function u (St, At) L2(C0) C(S A) and is unique if the reward function R(s , s, a) and the transition kernel P(s |s, a) are continuous over (s, a).

Proof: Denote eu = GVµ,πµ(St, At, St+1) η(St) + ϖ(St, At) Vµ(St). It follows from the definition of L2(Vµ, πµ, η, ϖ, u), we have that

min Vµ,πµ,η,ϖ max u L2(Vµ, πµ, η, ϖ, u)

= min Vµ,πµ,η,ϖ max u ESt,At,St+1 GVµ,πµ(St, At, St+1) η(St) + ϖ(St, At)) Vµ(St) u(St, At) 2

= min Vµ,πµ,η,ϖ max u GVµ,πµ(St, At, St+1) η(St) + ϖ(St, At)) Vµ(St) , u(St, At) 2

= min Vµ,πµ,η,ϖ GVµ,πµ(St, At, St+1) η(St) + ϖ(St, At)) Vµ(St) , C0eu eu L2 ) 2

= min Vµ,πµ,η,ϖ GVµ,πµ(St, At, St+1) η(St) + ϖ(St, At)) Vµ(St) , GVµ,πµ(St, At, St+1) η(St)+

ϖ(St, At)) Vµ(St) C0eu

eu L2 , eu eu L2

= min Vµ,πµ,η,ϖ GVµ,πµ(St, At, St+1) η(St) + ϖ(St, At)) Vµ(St) , GVµ,πµ(St, At, St+1) η(St)+

ϖ(St, At)) Vµ(St)

= min Vµ,πµ,η,ϖ ESt,At hp

C0 GVµ,πµ(St, At, St+1) η(St) + ϖ(St, At)) Vµ(St) i2 ,

where the third equality is obtained by maximization condition of the inner product between u and GVµ,πµ(St, At, St+1) η(St) + ϖ(St, At) Vµ(St) is that the two terms should have the same direction; the fourth equality is obtained by the equality condition of the Cauchy-Schwartz inequality.

Published as a conference paper at ICLR 2023

Such finding indicates that there exists a closed form solution of the the optimal weight function u , such that u (s, a) = GV µ ,π µ(s, a, s ) η(s) + ϖ(s, a) V µ (s),

which is equal to eu when (Vµ, πµ) = (V µ , π µ).

Notice that for a given µ, Ws is fully determined by Q µ(s, a), thus by Equation (3),(4), we have that π µ(a|s), V µ (s) is continuous over Q µ(s, a). Additionally, by the complete slackness and stationary condition in Theorem S.1, we have

η(s) + ϖ(s, a) = Q µ(s, a) µ + V µ (s), if ϖ(s, a) = 0;

η(s) = Q µ(s, a) µ + 2µπ µ(a|s) + V µ (s), if ϖ(s, a) = 0.

Since V µ , π µ can be represented by functions of Q µ(s, a), the Lagrange multipliers η(s) + ϖ(s, a) can also be represented by a function of Q µ(s, a), and is also continuous over Q µ(s, a).

As π µ(a|s), V µ (s), η(s) + ϖ(s, a) are all continuous over Q µ(s, a), we only need to prove that Q µ(s, a) is continuous over (s, a). By the stationarity equation in Theorem 3.3, Es |s,a[R(s , s, a)] = g(Q µ(s, a)). Since the reward function R(s , s, a) and the transition kernel P(s |s, a) are continuous over (s, a) by assumption, Q µ(s, a) is continuous for any (s, a) as Es |s,a[R(s , s, a)] is continuous for any (s, a). Therefore, the optimal weight function u (s, a) is continuous over any arbitrary state-action pair (s, a).

B.3.2 PROOF OF THEOREM S.3

Theorem S.3. Suppose u HC0 K is reproduced by a universal kernel K( , ), then the minimax optimizer (10) can be decoupled to a single-stage minimization problem as

min Vµ,πµ,η,ϖ LU = ESt,e St,At, e At,St+1, St+1 GVµ,πµ St, At, St+1 η St + ϖ At | St Vµ St

C0K St, At; e St, e At GVµ,πµ(e St, e At, e St+1) η(e St) + ϖ( e At | e St) Vµ(e St) i ,

where (e St, e At, e St+1) is an independent copy of the transition pair (St, At, St+1).

Proof: Let u = ESt,At GVµ,πmu(St, At, St+1) η(St) + ϖ(At|St) Vµ(St) K( , {St, At}) ., and define the inner product , HRKHS in HC0 K . It follows from the definition of L(Vµ, πµ, η, ϖ, u) and kernel reproducing property we have,

min Vµ,πµ,η,ϖ max u L2 (Vµ, πµ, η, ϖ, u)

= min Vµ,πµ,η,ϖ max u ESt,At GVµ,πµ St, At, St+1 η St + ϖ At | St Vµ St u St, At 2

= min Vµ,πµ,η,ϖ max u

ESt,At GVµ,πµ St, At, St+1 η St + ϖ At | St Vµ St

K ; St, At , u St, At

= min Vµ,πµ,η,ϖ max u

ESt,At GVµ,πµ St, At, St+1) η St + ϖ At | St Vµ St

K ; St, At , u St, At +2

= min Vµ,πµ,η,ϖ

ESt,At GVµ,πµ St, At, St+1 η St + ϖ At | St Vµ St

K ; St, At , C0eu eu HRKHS

Published as a conference paper at ICLR 2023

where the last equality holds because of the maximization of inner product between u and ESt,At GVµ,πµ(St, At, St+1) η(St) + ϖ(At|St) Vµ(St) K( ; St, At)] should have the same direction. Then we have,

min Vµ,πµ,η,ϖ

D ESt,At GVµ,πµ St, At, St+1) η St + ϖ At | St Vµ St K ; St, At ,

C0 u/ eu HRKHS E2

= min V , µπµ,η,ϖ

D ESt,At GVµ,πµ St, At, St+1 η St + ϖ At | St Vµ St K( ; St, At) ,

ESt,At h GVµ,πµ St, At, St+1) η St + ϖ At | St Vµ St K( ; St, At) E

u eu HRKHS , C0eu eu HRKHS

HRKHS = min Vµ,πµ,η,ϖ ESt,At GVµ,πµ St, At, St+1 η St + ϖ At | St Vµ St K ; St, At ,

C0Ee St, e At h GVµ,πµ e St, e At, e St+1 η e St + ϖ( e At | e St) Vµ(e St) K ; e St, e At i E

where the first equality is by the equality condition of Cauchy-Schwarz inequality, i.e. u/ u HRKHS is linear dependent of ESt,At GVµ,πµ(St, At, St+1) η(St) + ϖ(At|St) Vµ(St) K( ; St, At)] . Then, by the reproducing property of K(St, At; St, At), we have

min Vµ,πµ,η,ϖ max u HC0 K L2(Vµ, πµ, η, ϖ, u)

= min Vµ,πµ,η,ϖ ESt,e St,At, e At h GVµ,πµ(St, At, St+1) η(St) + ϖ(St, At)) Vµ(St)

C0 D K St, At; , K e St, e At; E

GVµ,πµ(e St, e At, St+1) η(e St) + ϖ( e At|e St) Vµ(e St) i

= min Vµ,πµ,η,ϖ ESt,e St,At, e At C0 h GVµ,πµ(St, At, St+1) η(St) + ϖ(St, At)) Vµ(St)

K St, At; e St, e At GVµ,πµ(e St, e At, e St+1) η(e St) + ϖ( e At|e St) Vµ(e St) i

= min Vµ,πµ,η,ϖ ESt,e St,At, e At,St+1,e St+1C0 h e GVµ,πµ(St, At, St+1) η(St) + ϖ(St, At)) Vµ(St)

K St, At; e St, e At e GVµ,πµ(e St, e At, e St+1) η(f St) + ϖ( e At|e St) Vµ(e St) i .

Thus, we finish the proof.

B.4 PROOFS ON GENERIC PROPERTIES OF QUASI-OPTIMAL BELLMAN OPERATOR

B.4.1 PROOF OF PROPOSITION S.1

Proposition S.1. The quasi-optimal Bellman operator Bµ is γ-contractive with respect to the supreme norm over S. That is BµV BµV γ V V , for any generic value functions {V, V : S R}.

Proposition S.1 justifies that there exists a unique fixed point of Bµ, i.e., V µ , indicating that the quasi-optimal value function V µ and the induced policy π µ are well defined and unique.

Proof: By the definition of Bµ, the explicit form corresponding to V is as follows:

BµV (s) = max π Ea π( |s) h ESt+1|s,a[R(St+1, s, a) + γV (St+1)] + µprox (π(a|s)) i .

Published as a conference paper at ICLR 2023

For any two arbitrary value functions V and V , we have

BµV (s) BµV (s)

= max π1 Ea π( |s) h ESt+1|s,a[R(St+1, s, a) + γV (St+1)] + µprox (π1(a|s)) i

max π2 Ea π( |s) h ESt+1|s,a[R(St+1, s, a) + γV (St+1)] + µprox (π2(a|s)) i

Ea π( |s) h ESt+1|s,a[R(St+1, s, a) + γV (St+1)] + µprox (π(a|s)) i

Ea π( |s) h ESt+1|s,a[R(St+1, s, a) + γV (St+1)] + µprox (π(a|s)) i)

= max π γEa π( |s),St+1|s,a h V (St+1) V (St+1) i

γ V (s) V (s) .

B.4.2 PROOF OF PROPOSITION S.2

Proposition S.2. For any s S, the performance error between V µ (s) and V (s) satisfies

V µ V µ max{|1 C|, 1}

where C is the upper bound for induced policy πµ.

Proof of Proposition S.2:

V µ V = BµV µ BV BµV µ BµV + BµV BV .

Notice that BµV µ BµV γ V µ V by Theorem S.1, and BµV BV µ max{|1 C|, 1} by Proposition 3.1. Therefore,

(1 γ) V µ V µ max{|1 C|, 1}.

We finish the proof.

B.5 PROOF OF THEOREM 4.1

Proof of Theorem 4.1: We first prove that when µ , π µ would degenerate to uniform distribution over A. By (4), we only need to prove that for arbitrary small ϵ > 0

a Ws Q µ(s, a)da

2µσ(Ws) + 1 σ(Ws) 1 σ(A)

Lower bound:

a Ws Q µ(s, a)

2µσ(Ws) + 1 σ(Ws) Q µ(s, a)

2µ σ(Ws) maxa Q µ(s, a ) 2µσ(Ws) + 1 σ(Ws) (18)

2µ maxa Q µ(s, a ) 2µ + 1 σ(A) (19)

Thus, we aim to prove that Q µ(s, a) maxa Q µ(s, a ) 2µ

Let V be the unique fixed point of (1), and Hmax = max H(π), where

H(π) = Ea π( |s)[1 π(a|s)].

Published as a conference paper at ICLR 2023

Let r(s, a) := ESt+1|s,a[R(St+1, s, a)], by the definition of Q µ, we have

2µ γESt+1|s,a V µ St+1

2µ = r(s, a)

2µ Q µ(s, a)

2µ γESt+1|s,a V µ St+1 V St+1

2µ γESt+1|s,a V St+1

2µ = r(s, a)

Therefore, Q µ(s, a)

2(1 γ) r(s, a)

2µ + γEs |s,a [V (s )]

2(1 γ) Rmax 2(1 γ)µ. (20)

Meanwhile, from another perspective, the proximal Bellman operator (2) can be treated as a new MDP with the immediate reward r(s, a) + µH(π( |s)) for given s, a. Combine with the fact that

1 γ = max π Eπ h X

t=2 γt 1(µ µπ(At|St))|S1 = s, A1 = a i .

Let πH = argmaxπH(π(a|s)), then

2(1 γ) = Q µ(s, a)

2µ max π Eπ

t=2 γt 1 µ µπ At | St | S1 = s, A1 = a

QπH µ (s, a)

t=2 γt 1 µ µπH At | St | S1 = s, A1 = a

t=1 γt 1 r (St, At)

2µ | S1 = s, A1 = a

Rmax 2(1 γ)µ.

Based on (20) and (21), we have

2µ maxa Q µ(s, a ) 2µ = Q µ(s, a)

2(1 γ) + γHmax

2(1 γ) maxa Q µ(s, a ) 2µ

Rmax (1 γ)µ. (22)

Similarly, we also have

2µ maxa Q µ(s, a ) 2µ Rmax (1 γ)µ. (23)

Therefore, we have the lower bound approaching to 1 σ(A).

For the upper bound, we have R

a A π µ(a|s)da = 1, thus

n Q µ(s, a)

a Ws Q µ(s, a )da

2µσ(Ws) + 1 σ(Ws)

nmina Q µ(s, a ) 2µ

a Ws Q µ(s, a )da

2µσ(Ws) + 1 σ(Ws)

1 σ(A) mina Q µ(s, a ) 2µ

a Ws) Q µ(s, a )da

2µ + 1 σ(Ws).

Published as a conference paper at ICLR 2023

By (23), we then have

a Ws Q µ(s, a)da

2µσ(Ws) + 1 σ(Ws) = Q µ(s, a)

2µ maxa Q µ(s, a ) 2µ (24)

+ maxa Q µ(s, a ) 2µ

a Ws Q µ(s, a )da

2µ + 1 σ(Ws)

1 σ(A) + Rmax (1 γ)µ (25)

Therefore, by the lower bound and upper bound, we conclude that πµ(a|s) will decay to the uniform distribution on A as µ .

For the case when µ 0, we prove that πµ would converge to the uniform distribution with the length of the support set equal to 1

C. Therefore, when C , it will converge to the point mass. According to (15), we only need to prove σ(Ws,1) 0 as µ 0. Meanwhile by Theorem (S.1), a Ws,1, if

σ(Ws,1)Q µ(s, a) Z

a Ws,1 Q µ(s, a )da 2µ + 2µCσ(Ws,2) (0, 2µCσ(Ws,1)).

As µ 0, (0, 2µCσ(Ws,1)) 0. Thus, by squeeze theorem, we have σ(Ws,1)Q µ(s, a) R

a Ws,1 Q µ(s, a )da 2µ + 2µCσ(Ws,2) 0 as µ 0, which is equivalent to

σ(Ws,1)Q µ(s, a) Z

a Ws,1 Q µ(s, a )da 0 for all a Ws,1.

Therefore, Ws,1 could only include a with the same value of Q µ(s, a), which should only be a series of points rather than an interval. Thus, σ(Ws,1) = 0, and π µ(a|s) would converge to uniform distribution with interval length 1

B.6 PROOF OF LEMMA S.1

Before we prove the main result, we first provide a helper lemma for studying the boundedness of the symmetric kernel in the U-statistic. Lemma S.1. Under Assumption 1, for any s S, a A and µ (0, ), we have that

sup s S,a A

GVµ,πµ(s, a, s ) η(s) + ϖ(s, a) Vµ(s) Mmax,

where Mmax = 4 1 γ Rmax + µC.

Proof of Lemma S.1: GVµ,πµ(s, a, s ) η(s) + ϖ(s, a) Vµ(s)

=R(s , s, a) + γVµ(s ) + µ 2µπµ(a|s) η(s) + ϖ(s, a) Vµ(s)

Rmax + µ + µC + γVµ(s ) Vµ(s) 2µπµ(a|s) + ϖ(s, a) | {z } (a)

By checking the KKT conditions, we can further simplify the term (a). Specifically,

1. If πµ = 0, then ϖ 0. By the stationarity equation (9), we have (a) = ϖ(s, a) = η(s) Qµ(s, a) µ + Vµ(s)

Rmax + γ Rmax µH

1 γ µ + Rmax + µH

H := Ea πµ( |s)(1 πµ(a|s))

2 1 γ Rmax µ + µH

2 1 γ Rmax.

Published as a conference paper at ICLR 2023

2. If πµ (0, C], then ϖ = 0 (a) = 2µπµ(a|s) < 0.

Therefore, Gπµ(s, a, s ) η(s) + ϖ(s, a) Vµ(s)

Rmax + µ + µC + γVµ(s ) Vµ(s) + 2 1 γ Rmax

Rmax + µ + µC + γ Rmax + µH

1 γ Rmax + µH

1 γ + 2 1 γ Rmax

4 1 γ Rmax + µC + µ µH

4 1 γ Rmax + µC.

Thus, we gain the upper bound. For the lower bound, the same technique is applied, and we can also gain that

GVµ,πµ(s, a, s ) η(s) + ϖ(s, a) Vµ(s) 4 1 γ Rmax µC.

Therefore, this completes the proof.

B.7 PROOF OF THEOREM 4.2

Proof of Theorem 4.2: We first define an operator P from GVµ,πµ(Sk, Ak, Sk+1) to GVµ,πµ(Sk, Ak, Sk+1) η(Sk) + ϖ(Sk, Ak) to simplify the expression, such that

PGVµ,πµ(Sk, Ak, Sk+1) := GVµ,πµ(Sk, Ak, Sk+1) η(Sk) + ϖ(Ak|Sk), We further define several other notations

1 j =k T K(Sj, Aj; Sk, Ak){PGVµ,πµ(Sj, Aj, Sj+1) Vµ(Sj)}

{PGVµ,πµ(Sk, Ak, Ak+1) Vµ(Sk)}

K St, At, St+1; e St, e At, e St+1

:= K St, At; e St, e At PGVµ,πµ St, At, St+1 Vµ St n PGVµ,πµ e St, e At, e St+1 Vµ e St o .

Let the expectation with respect to stationary trajectory and i.i.d training set as ET and E respectively. For any finite threshold parameter µ < and any ϵ > 0, we have

P c LU LU > ϵ = P c LU E (UT ) + E (UT ) LU > ϵ

P c LU E (UT ) > ϵ

+ P |E (UT ) LU| > ϵ

| {z } (ii)

For (i), since the Gaussian kernel satisfy that |K( ; )| 1, then by Lemma S.1, we have K s, a, s ; s, a, s M 2 max, for any s, s, a, a. By Hoeffding s inequality, we have

(i) 2 exp n nϵ2

For the term (ii), the expectation of UT as ET (UT ) can be calculated as follows:

ET (UT ) = T 2

1 j =k T ET h K(Sj, Aj; Sk, Ak){PGVµ,πµ(Sj, Aj, Sj+1) Vµ(Sj)}

{PGVµ,πµ(Sk, Ak, Sk+1) Vµ(Sj)} i .

Published as a conference paper at ICLR 2023

If with-in trajectory samples are independent, then it is obvious that

ET (UT ) = ET h K St, At, St+1; e St, e At, e St+1 i := U .

However, for weakly dependent data, dependency may introduce an additional bias term ET (UT ) U , thus we further decompose the term (ii) as

(ii) = P(|E (UT ) E [ET (UT )]| | {z } (iii)

+ | E [ET (UT )] EU ) | | {z } (iv)

For the term (iii), we follow a similar idea to use a novel decomposition of the variance term of U-statistic from Han (2018). The idea is to break down the summation of U-statistic into numerous parts, where the current time is affected by randomness, and the historical time will be canceled out after conditioning on the future.

As | K( ; )| is bounded by M 2 max, under the mixing condition of Assumption 4.2, the exponential inequality from Merlevède et al. (2009) can be applied to to bound each decomposition part.

Then we follow the Theorem 3.1 from Han (2018) that for any ϵ0,

P(|E (UT ) E [ET (UT )]| > ϵ0) 2 exp n M 4 max Tϵ2 0C 1 + M 2 max log log(4T) log T

where C 1 is some constant.

Then, we proceed to bound the term (iv). By Hoeffding decomposition of kernel function K St, At, St+1; St, At, St+1 , there exist kernel functions K1(St, At, St+1) and

K2 St, At, St+1; St, At, St+1 such that

K1 (s, a, s ) = ET K s, a, s ; e St, e At, e St+1 U ,

K2 s, a, s ; es, ea, es = K (s, a, s ; es, ea, es ) K1 (s, a, s ) K1 es, ea, es U ,

and ET K1(St, At, St+1) = 0, ET K2 St, At, St+1; St, At, St+1 = 0. Then by Hoeffding decomposition of UT , we have

t=1 K1(St, At, St+1) + U K2.

Taking the expectation from both sides:

ET [UT ] = U + 2

k=1 ET K1(St, At, St+1) + ET [U K2]

= U + ET [U K2]

Therefore, by Lyapunov inequality, we can bound the bias term

|ET [UT ] U | = ET U K2 ET U K2

ET h U 2 K2

1 h1 l1 T,1 h2 l2 T ET h K2 (Sh1, Ah1, Sh1+1; Sl1, Al1, Sl1+1)

K2 (Sh2, Ah2, Sh2+1; Sl2, Al2, Sl2+1) i 4 T 2(T 1)2 .

We proceed by the discussing the relationship between h1, h2, l1, l2.

Published as a conference paper at ICLR 2023

Case 1.1: If 1 h1 h2 l1 l2 T and l2 l1 h1 h2. Under the mixing condition assumption, and by Generalized Correlation inequality in Lemma 2 of, we have ET h K2 Sh1, Ah1, Sh1+1; Sl1, Al1, Sl1+1 K2 Sh2, Ah2, Sh2+1; Sl2, Al2, Sl2+1 i

4 M 2r max 1/r β1/s (h2 h1) , where 1/r + 1/s = 1, s > 1.

Case 1.2: If 1 h1 h2 l1 l2 T and h1 h2 l2 l1. Similar as Case 1.1, we have ET h K2 Sh1, Ah1, Sh1+1; Sl1, Al1, Sl1+1 K2 Sh2, Ah2, Sh2+1; Sl2, Al2, Sl2+1 i

4 M 2r max 1/r β1/s (l2 l1) .

Combine Case 1.1 and Case 1.2, we apply the bounded inequalities (2.17-2.21) from Yoshihara (1976), and have the following result X

1 h1 h2 l1 l2 T ET h K2 Sh1, Ah1, Sh1+1; Sl1, Al1, Sl1+1

K2 Sh2, Ah2, Sh2+1; Sl2, Al2, Sl2+1 i

l2 l1 h2 h1 1 h1 h2 l1 l2 T

ET h K2 Sh1, Ah1, Sh1+1; Sl1, Al1, Sl1+1

K2 Sh2, Ah2, Sh2+1; Sl2, Al2, Sl2+1 i + X

h2 h1 l2 l2 1 h1 h2 l1 l2 T

ET h K2 Sh1, Ah1, Sh1+1; Sl1, Al1, Sl1+1

K2 Sh2, Ah2, Sh2+1; Sl2, Al2, Sl2+1 i

M 2 max T 2 T X

j=1 (j + 1)β1/s(j) = O M 2 max T 3 τ ,

2 s+1 2 1 δ1

1 δ1 1 1 + 1 s+1 . (29)

Case 2: If 1 h1 l1 h2 l2 T. Using similar technique as Case 1.1 and 1.2, we have X

1 h1 l1 h2 l2 T ET h K2 Sh1, Ah1, Sh1+1; Sl1, Al1, Sl1+1

K2 Sh2, Ah2, Sh2+1; Sl2, Al2, Sl2+1 i

l2 h2 l1 h1 1 h1 l1 h2 l2 T

ET h K2 Sh1, Ah1, Sh1+1; Sl1, Al1, Sl1+1

K2 Sh2, Ah2, Sh2+1; Sl2, Al2, Sl2+1 i + X

l1 h1 l2 h2 1 h1 l1 h1 l2 T

ET h K2 Sh1, Ah1, Sh1+1; Sl1, Al1, Sl1+1

K2 Sh2, Ah2, Sh2+1; Sl2, Al2, Sl2+1 i

= O M 2 max T 3 τ

Published as a conference paper at ICLR 2023

Case 3: If 1 h1 l1 T and 1 h2 = l2 T. Following the same technique, we have X

1 h1 l1 T ET h K2 Sh1, Ah1, Sh1+1; Sl1, Al1, Sl1+1

K2 Sh2, Ah2, Sh2+1; Sl2, Al2, Sl2+1 i

ET h K2 Sh1, Ah1, Sh1+1; Sl1, Al1, Sl1+1

K2 Sh2, Ah2, Sh2+1; Sl2, Al2, Sl2+1 i +

ET h K2 Sh1, Ah1, Sh1+1; l1, Al1, Sl1+1

K2 Sh2, Ah2, Sh2+1; Sl2, Al2, Sl2+1 i

U 2 max T 2 + M 2 max T 2 T X

j=1 β1/s(j) = O M 2 max T 2 .

Case 4: If 1 h1 = l1 T and 1 h2 l2 T. Using the same technique, we can obtain the same rate as follows: X

1 h2 l2 T ET h K2 Sh1, Ah1, Sh1+1; Sl1, Al1, Sl1+1

K2 Sh2, Ah2, Sh2+1; Sl2, Al2, Sl2+1 i

= O M 2 max T 2 .

Combine Case 1-4 with the equation (28), we conclude that

|EUT U | C 0M 2 max T 1+τ

2 a.s. We further use the continuous mapping theorem to conclude that E[ET (UT )] EU C 0M 2 max T 1+τ

2 a.s., (30)

where τ is defined in (29) and C 0 is a constant.

As τ > 0, we have T 1+τ

2 . Combine (27) and (30), for sufficiently large T, we have

(ii) = P (|E (UT ) E [ET (UT )]| + | E [ET (UT )] EU ) |> ϵ

TC 1 ϵ/2 C 0M 2 max T (1+τ)/2 2

M 4max + M 2max ϵ/2 C 0M 2max T (1+τ)/2 log T log log 4T

TC 1ϵ2/4 Tc1ϵC 0M 2 max T (1+τ)/2 + TC 1C 0 2M 4 max T (1+τ)

M 4max + M 2max ϵ/2 C 0M 2max T (1+τ)/2 log T log log 4T

Tc1ϵ2/4 TT (1+τ)/2c1ϵC 0M 2 max + c1C 0 2M 4 max T τ

M 4max + M 2max ϵ/2 C 0M 2max T (1+τ)/2 log T log log 4T

Then by the monotonicity of exp( ),

TT (1+τ)/2C 1ϵC 0M 2 max T τC 1C 0 2M 4 max TC 1ϵ2/4 M 4max + log T log log 4TM 2maxϵ/2 T (1 + τ)/2 log T log log 4TC 0M 4max

TC 1ϵ2/4 T 1/2C 1ϵC 0M 2 max + T τC 1C 0 2M 4 max M 4max + log T log log 4TM 2maxϵ/2 T 1/2 log T log log 4TC 0M 4max

c C 1ϵ2T/4 C 0C 1ϵM 2 max

M 2max ϵ/2 C 0M 2max/

T log T log log 4T + M 4max

Published as a conference paper at ICLR 2023

where C 1 is a constant. Combine (26) and (32), we simplify the terms and then

P(| c LU LU| > ϵ) C1 exp ϵ2T C2ϵM 2 max

M 2max + ( ϵ

2 C2M 2 max

T ) log T log log(T)

+ C3 exp nϵ2

where C1, C2, C3 are some constants depending on δ1 respectively, and Mmax = 4 1 γ Rmax + µC.

B.8 PROOF OF THEOREM 4.3

Proof of Theorem 4.3. To bound the performance error, we first decompose it as

b V θ1,k µ V 2 L2 b V θ1 µ V θ1,k µ 2 L2 + b V θ1 µ V 2 L2 + ϵapproximation error where the first term is the optimization error and the last term is the approximation error. Then we proceed to bound

b V θ1 µ V 2 L2

b V θ1 µ V eπµ µ L2 | {z } 1

+ V eπµ µ V L2 | {z } 2

where V eπµ µ satisfying the stationarity equation (9) and V is the unique fixed point of B. First, we move to bound 1. Follow a similar kernel reproducing property and a eigen decomposition spirit in Bertsekas (1997); Sutton & Barto (2018); Zhou et al. (2022), we have 2 κmin

LU(b V θ1 µ , bπθ2 µ , ηξ1, ϖξ2) LU(V eπµ µ , eπµ, η, ϖ) +

2 µprox (bπθ2 µ (At|St)) µprox (eπµ(At|St)) bηξ1(St) η(St) + bϖξ2(St, At) ϖ(St, At) 2 L2

γ ESt+1|St,At[b V θ1 µ (St+1)] ESt+1|St,At[V eπµ µ (St+1)] b V θ1 µ (St) V eπµ µ (St) 2 L2. Then by µprox (bπθ2 µ (At|St)) µprox (eπµ(At|St)) 2 L2 µ2 bπθ2 µ (At|St) eπµ(At|St) 2 L2 Cµ2.

and the auxiliary functions ηξ1(s) [ Cµ, 0] for any s S, then

bηξ1(St) η(St) 2 L2 (Cµ + Cµ)2 = (Cµ)2

bηξ1 1 (St) η1(St) 2 L2 2 κmin

LU(b V θ1 µ , bπθ2 µ , bηξ1, bϖξ2) LU(V eπµ µ , eπµ, η, ϖ)

Then we conclude that

b V θ1 µ (St) V eπµ µ (St) 2 L2 C5(LU(b V θ1 µ , bπθ2 µ , bηξ1, bϖξ2) LU(V eπµ µ , eπµ, η, ϖ)) κmin(1 γ)2 + C6µ2

C5(LU(b V θ1 µ , bπθ2 µ , bηξ1, bϖξ2) L U κmin(1 γ)2 + C6µ2

where C5 and C6 are some constants, and L U := inf {Vµ,πµ,η,ϖ} LU(Vµ, πµ, η, ϖ)

Now, we have the remainder term 2 to bound. 2 V eπµ µ V µ L2 | {z } 1 2

+ V µ V L2 | {z } 2 2 We first bound 1 2. For any s S, then we have that

BµV eπµ µ (s) = max π Ea π( |s), St+1|s,a h R(St+1, s, a) + γV eπµ µ (St+1) + µprox(π(a|s)) i

=Ea eπµ( |s), St+1|s,a h R(St+1, s, a) + γV eπµ µ (St+1) + µprox(eπµ(a|s)) i

=Ea eπµ( |s), St+1|s,a h R(St+1, s, a) + γV eπµ µ (St+1) + µ(1 eπµ(a|s)) i

=Ea eπµ( |s), St+1|s,a h R(St+1, s, a) + γV eπµ µ (St+1) + µ µeπµ(a|s) i +

Ea eπµ( |s) h µeπµ(a|s) i .

Published as a conference paper at ICLR 2023

As (V eπµ µ , eπµ) is the solution of the stationarity equation,

Ea eπµ( |s), St+1|s,a h R(St+1, s, a) + γV eπµ µ (St+1) + µ µeπµ(a|s) i V eπµ µ (s)

and since Ea eπµ( |s) h µeπµ(a|s) i µ, then we have

BµV eπµ µ (s) V eπµ µ (s) + µC.

For the lower bound, as

Ea eπµ( |s), St+1|s,a h R(St+1, s, a) + γV eπµ µ (St+1) + µ µeπµ(a|s) V eπµ µ (s) | St = s i Cµ

so similarly, we conclude that

Cµ + BµV eπµ µ (s) V eπµ µ (s).

If follows the definition of the proximal Bellman operator Bµ and due to the monotonicity of the Bellman operator that BµV1(s) BµV2(s) for generic value functions V1(s) V2(s), and the BµV (s) Beπµ µ V (s) for any generic value function V , where Beπµ µ is the proximal Bellman evaluation operator, i.e.,

Beπµ µ V (s) := Ea eπµ( |s), St+1|s,a h R(St+1, s, a) + γV (St+1) + µprox eπµ(a|s) i .

Note that, V eπµ µ is unique fixed point of the Bellman operator Beπµ µ , thus limi (Beπµ µ )i V eπµ µ (s) = V eπµ µ (s), where i Z+. And for any initial value function. e.g., V eπµ µ , limi (Bµ)i V eπµ µ (s) = V µ (s) holds. Therefore the following inequality holds that

V eπµ µ (s) = lim i (Beπµ µ )i V eπµ µ (s) lim i (Beπµ µ )i V eπµ µ + Cµ (s) lim i (Bµ)i V eπµ µ + Cµ (s)

= V eπµ µ (s) lim i (Bµ)i V eπµ µ (s) +

i=1 Cµγi 1 V µ (s) + Cµ (1 γ). (33)

We repeatedly apply a similar procedure, without loss of generality. We first show one step that

Bµ(BµV eπµ µ (s)) Bµ(V eπµ µ (s) + Cµ) = Bµ(V eπµ µ (s)) + Cµγ V eπµ µ (s) + Cµ + Cµγ.

Then we apply infinite many time Bµ, then we can have that

V µ (s) = lim i (Bµ)i V eπµ µ (s) V eπµ µ (s) +

i=1 Cµγi 1 = V eπµ µ (s) + Cµ (1 γ). (34)

Combine with the inequalities (33)-(34), we immediately have that

V µ V eπµ µ L2 Cµ (1 γ)

Next, by Proposition S.2, we have

V µ V µ max{|1 C|, 1}

Now, we need to bound the excess risk. The excess risk can be decomposed into approximation error and estimation error, i.e.

LU(b V θ1 µ ,bπθ2 µ , bηξ1, bϖξ2) L U =

inf (V θ1 µ ,πθ2 µ ,ηξ1,ϖξ2) Θ1 Θ2 Ξ1 Ξ2 LU(V θ1 µ , πθ2 µ , ηξ1, ϖξ2) L U

| {z } approx

LU b V θ1 µ , bπθ2 µ , bηξ1, bϖξ2 inf (V θ1 µ ,πθ2 µ ,ηξ1,ϖξ2) Θ1 Θ2 Ξ1 Ξ2 LU(V θ1 µ , πθ2 µ , ηξ1, ϖξ2)

Published as a conference paper at ICLR 2023

where approx is the approximation error and est is the estimation error. The approximation error is assumed to be zero in our proof for simplicity. At first, we consider to bound the estimation error.

LU b V θ1 µ , bπθ2 µ , bηξ1, bϖξ2 inf (V θ1 µ ,πθ2 µ ,ηξ1,ϖξ2) Θ1 Θ2 Ξ1 Ξ2 LU(V θ1 µ , πθ2 µ , ηξ1, ϖξ2)

:= LU b V θ1 µ , bπθ2 µ , bηξ1, bϖξ2 LU(V π µ µ , π µ, ηξ1, ϖξ2)

LU b V θ1 µ , bπθ2 µ , bηξ1, bϖξ2 LU(V π µ µ , π µ, ηξ1, ϖξ2) + c LU(V π µ µ , π µ, ηξ1, ϖξ2) c LU b V θ1 µ , bπθ2 µ , bηξ1, bϖξ2

LU b V θ1 µ , bπθ2 µ , bηξ1, bϖξ2 c LU b V θ1 µ , bπθ2 µ , bηξ1, bϖξ2 LU(V π µ µ , π µ, ηξ1, ϖξ2) c LU(V π µ µ , π µ, ηξ1, ϖξ2)

(V θ1 µ ,πθ2 µ ,ηξ1,ϖξ2) Θ1 Θ2 Ξ1 Ξ2

LU(V θ1 µ , πθ2 µ , ηξ1, ϖξ2) c LU(V θ1 µ , πθ2 µ , ηξ1, ϖξ2) .

where ηξ1, ϖξ2 are Lagrange multipliers satisfying minimal Bayes risk associated with V π µ µ , π µ for the rest of this proof. Observe that the randomness of sup(V θ1 µ ,πθ2 µ ,ηξ1,ϖξ2) Θ1 Θ2 Ξ1 Ξ2|LU(V θ1 µ ,πθ2 µ ,ηξ1,ϖξ2) c LU(V θ1 µ ,πθ2 µ ,ηξ1,ϖξ2)| can be decomposed into two parts, one is from the n number of i.i.d. trajectories and another one is from the dependent transition within each trajectory. For each single trajectory, we define the quantity

U (V θ1 µ , πθ2 µ , ηξ1, ϖξ2) =ΛV θ1 µ ,πθ2 µ (St i, At i, St+1 i )K(St i, At i; e St i, e At i)ΛV θ1 µ ,πθ2 µ (e St i, e At i, V θ1 µ ),

where ET is defined as taking expectation to single stationary trajectory and E is defined as taking expectation to i.i.d. trajectory random variable D1, respectively. Without loss of generality, we assume C0 = 1. The U-statistic approximation for ET (U ) is as follows:

UT (V θ1 µ , πθ2 µ , ηξ1, ϖξ2)

:= 2 T(T 1)

h ΛV θ1 µ ,πθ2 µ (Sj i , Aj i, Sj+1 i )K Sj i , Aj i; Sk i , Ak i ΛV θ1 µ ,πθ2 µ (Sk i , Ak i , V θ1 µ ) i .

Then the uniform process is bounded by

(V θ1 µ ,πθ2 µ ,ηξ1,ϖξ2) Θ1 Θ2 Ξ1 Ξ2

LU(V θ1 µ , πθ2 µ , ηξ1, ϖξ2) c LU(V θ1 µ , πθ2 µ , ηξ1, ϖξ2)

(V θ1 µ ,πθ2 µ ,ηξ1,ϖξ2) Θ1 Θ2 Ξ1 Ξ2

LU(V θ1 µ , πθ2 µ , ηξ1, ϖξ2) P(Di:n) n ET [U (V θ1 µ , πθ2 µ , ηξ1, ϖξ2)]

(V θ1 µ ,πθ2 µ ,ηξ1,ϖξ2) Θ1 Θ2 Ξ1 Ξ2

P(Di:n) n ET [U (V θ1 µ , πθ2 µ , ηξ1, ϖξ2)] P(Di:n) n UT (V θ1 µ , πθ2 µ , ηξ1, ϖξ2)]

(V θ1 µ ,πθ2 µ ,ηξ1,ϖξ2) Θ1 Θ2 Ξ1 Ξ2

LU(V θ1 µ , πθ2 µ , ηξ1, ϖξ2) P(Di:n) n ET [U (V θ1 µ , πθ2 µ , ηξ1, ϖξ2)]

(V θ1 µ ,πθ2 µ ,ηξ1,ϖξ2) Θ1 Θ2 Ξ1 Ξ2

ET [U (V θ1 µ , πθ2 µ , ηξ1, ϖξ2)] UT (V θ1 µ , πθ2 µ , ηξ1, ϖξ2)] ,

where P(Di:n) n is the empirical measure with respect to Di:n = {Di}n i=1 and we simply denotes it as Pn in the following proof. The last term is the bound for uniform process w.r.t sum of trajectories. In this sense, it is necessary to bound 2 = sup

(V θ1 µ ,πθ2 µ ,ηξ1,ϖξ2) Θ1 Θ2 Ξ1 Ξ2

ET [U (V θ1 µ , πθ2 µ , ηξ1, ϖξ2)] UT (V θ1 µ , πθ2 µ , ηξ1, ϖξ2)] ,

since the trajectories {Di}n i=1 are i.i.d. Now, we process to bound 1. 1 can be re-expressed as the empirical process of {Di}n i=1 w.r.t. the probability space (ΩN, FN, P) equipped with empirical measure Pn such that 1 = sup

(V θ1 µ ,πθ2 µ ,ηξ1,ϖξ2) Θ1 Θ2 Ξ1 Ξ2

E(ET [U (V θ1 µ , πθ2 µ , ηξ1, ϖξ2)]) Pn ET [U (V θ1 µ , πθ2 µ , ηξ1, ϖξ2)]

(V θ1 µ ,πθ2 µ ,ηξ1,ϖξ2) Θ1 Θ2 Ξ1 Ξ2

EG(V θ1 µ , πθ2 µ , ηξ1, ϖξ2; Di) Pn G(V θ1 µ , πθ2 µ , ηξ1, ϖξ2; Di) ,

Published as a conference paper at ICLR 2023

where G(V θ1 µ , πθ2 µ , ηξ1, ϖξ2; Di) is the random function associated with random variable Di. To bound 1, it is needed to calculate the covering number N (ϵ, Fθ,ξ, {Di}n i=1) by Pollard s tail inequality (Pollard, 2012), where the function space is the composite space Fθ,ξ = G(Θ1 Θ2 Ξ1 Ξ2). Specifically, G(V θ1 µ , πθ2 µ , ηξ1, ϖξ2; Di) = ET M(V θ1 µ , πθ2 µ , ηξ1, ϖξ2; Di)K({St i, At i}, {e St i, e At i}) f M(V θ1 µ , πθ2 µ , ηξ1, ϖξ2; Di) , where M(V θ1 µ , πθ2 µ , ηξ1, ϖξ2; Di) = µV θ1 µ ,πθ2 µ (St i, At i, St+1 i )

and f M(V θ1 µ , πθ2 µ , ηξ1, ϖξ2; Di) = ΛV θ1 µ ,πθ2 µ (e St i, e At i, e St+1 i ). Next, we proceed to

bound the distance in composite space Fθ,ξ. In particular, let (V θ1 µ , πθ2 µ , ηξ1, ϖξ2), (V θ1 µ , πθ2 µ , ηξ1 , ϖξ2 ) Θ1 Θ2 Ξ1 Ξ2 are two arbitrary functions, then the empirical norm distance w.r.t. {Di}n i=1 for the two function can be upper bounded by

Pn G(V θ1 µ , πθ2 µ , ηξ1, ϖξ2; Di) G(V θ1 µ , πθ2 µ , ηξ1 , ϖξ2 ; Di)

ET M(V θ1 µ , πθ2 µ , ηξ1, ϖξ2; Di)K({St, At}, {e St, e At})f M(V θ1 µ , πθ2 µ , ηξ1, ϖξ2; Di)

ET M(V θ1 µ , πθ2 µ , ηξ1 , ϖξ2 ; Di)K(St, At; e St, e At)f M(V θ1 µ , πθ2 µ , ηξ1 , ϖξ2 ; Di)

ET h K(St, At; e St, e At) M(V θ1 µ , πθ2 µ , ηξ1, ϖξ2; Di)f M(V θ1 µ , πθ2 µ , ηξ1, ϖξ2; Di)

M(V θ1 µ , πθ2 µ , ηξ1 , ϖξ2 ; Di)f M(V θ1 µ , πθ2 µ , ηξ1 , ϖξ2 ; Di) i

=Pn n ET K(St, At; e St, e At) M(V θ1 µ , πθ2 µ , ηξ1, ϖξ2; Di)f M(V θ1 µ , πθ2 µ , ηξ1, ϖξ2; Di)

M(V θ1 µ , πθ2 µ , ηξ1 , ϖξ2 ; Di)f M(V θ1 µ , πθ2 µ , ηξ1 , ϖξ2 ; Di) o

Pn n ET K(St, At; e St, e At) ET M(V θ1 µ , πθ2 µ , ηξ1, ϖξ2; Di) + M(V θ1 µ , πθ2 µ , ηξ1 , ϖξ2 ; Di)

ET f M(V θ1 µ , πθ2 µ , ηξ1, ϖξ2; Di) f M(V θ1 µ , πθ2 µ , ηξ1 , ϖξ2 ; Di) o

Pn ET |ηξ1 ηξ1 | + µPn ET |prox (πθ2 µ ) prox (πθ2 µ )|

+ Pn ET |(γV θ1 µ V θ1 µ ) (γV θ1 µ V θ1 µ )| + Pn ET |ψθ ϖξ2 |

=Mmax,1 Pn ET |ηξ1 ηξ1 | + µPn ET |πθ2 µ πθ2 µ | + (1 + γ)Pn ET |V θ1 µ V θ1 µ | + Pn ET |ψθ ϖξ2 |

Mmax,1 Pn ηξ1 ηξ1 + µPn πθ2 µ πθ2 µ + (1 + γ)Pn V θ1 µ V θ1 µ + Pn ϖξ2 ϖξ2 , where Mmax,1 = 2Mmax. Therefore, as the proximal parameter 0 µ µmax < , for any ε > 0 the metric entropy log N ((µmax + 4)Mmax,1ε, Fθ,ξ, {Di}n i=1) can be bound with respect to separate metric entropy of (Θ1, Θ2, Ξ1, Ξ2). Denote min(2(µmax + 4)Mmax, 1) as e C, then

N (µmax + 4)Mmax,1ε, Fθ,ξ, {Di}n i=1

N e Cε, Θ1, {Di}n i=1 N e Cε, Θ2, {Di}n i=1 N e Cε, Ξ1, {Di}n i=1 N e Cϵ, Ξ2, {Di}n i=1

To bound these factors, we first introduce a idea of pseudo-dimension , that is, for any set X, any points x1:N X N, any class F of functions on X taking values in [0, C] with pseudo-dimension DF < and any ϵ > 0, we have

N ϵ, F, x1:N e (DF + 1) 2e C

Therefore, we have

N 2(µmax + 4)Mmaxϵ, Fθ,ξ, {Di}n i=1

e4 (DΘ1 + 1) (DΘ2 + 1) (DΞ1 + 1) (DΞ2 + 1) 2e Mmax

DΘ1+DΘ2+DΞ1+DΞ2

Published as a conference paper at ICLR 2023

which implies

2, Fθ,ξ, {Di}n i=1

e4 (DΘ1 + 1) (DΘ2 + 1) (DΞ1 + 1) (DΞ2 + 1) 8(µmax + 4)M 3 maxe e Cϵ

DΘ1+DΘ2+DΞ1+DΞ2

where C1 = e4 (DΘ1 + 1) (DΘ2 + 1) (DΞ1 + 1) (DΞ2 + 1) 8(µmax+4)M 3 maxe e C

DΘ1+DΘ2+DΞ1+DΞ2

and DFθ,ξ = DΘ1 + DΘ2 + DΞ1 + DΞ2 i.e., the effective psuedo dimension.

Then we apply Pollard tail inequality, for any n 32/ϵ2, we have

(V θ1 µ ,πθ2 µ ,ηξ1,ϖξ2) Θ1 Θ2 Ξ1 Ξ2

EG(V θ1 µ , πθ2 µ , ηξ1, ϖξ2; Di) Pn G(V θ1 µ , πθ2 µ , ηξ1, ϖξ2; Di) ϵ

DFθ,ξ exp nϵ2

Then we can obtain

(V θ1 µ ,πθ2 µ ,ηξ1,ϖξ2) Θ1 Θ2 Ξ1 Ξ2

EG(V θ1 µ , πθ2 µ , ηξ1, ϖξ2; Di) Pn G(V θ1 µ , πθ2 µ , ηξ1, ϖξ2; Di) 2 #

(V θ1 µ ,πθ2 µ ,ηξ1,ϖξ2) Θ1 Θ2 Ξ1 Ξ2

EG(V θ1 µ , πθ2 µ , ηξ1, ϖξ2; Di) Pn G(V θ1 µ , πθ2 µ , ηξ1, ϖξ2; Di) 2 t)dt

(V θ1 µ ,πθ2 µ ,ηξ1,ϖξ2) Θ1 Θ2 Ξ1 Ξ2

EG(V θ1 µ , πθ2 µ , ηξ1, ϖξ2; Di) Pn G(V θ1 µ , πθ2 µ , ηξ1, ϖξ2; Di) 2 t)dt

(V θ1 µ ,πθ2 µ ,ηξ1,ϖξ2) Θ1 Θ2 Ξ1 Ξ2

EG(V θ1 µ , πθ2 µ , ηξ1, ϖξ2; Di) Pn G(V θ1 µ , πθ2 µ , ηξ1, ϖξ2; Di) 2 t)dt

DFθ,ξ exp nt2

=u + 64C1 1

n exp nu 512M 2max

With probability 1 δ, minimizing the RHS with respect to u, and plug the minimizer in, we have

(V θ1 µ ,πθ2 µ ,ηξ1,ϖξ2)

EG(V θ1 µ , πθ2 µ , ηξ1, ϖξ2; Di) Pn G(V θ1 µ , πθ2 µ , ηξ1, ϖξ2; Di) 2 #

64DFθ,ξ log(8C1 1

where C2 = 8C1. Therefore, we conclude that, with probability 1 δ, we have

64DFθ,ξ log 8C1

C3DFθ,ξ log 8C1

n Next, we proceed to bound 2. To simply the notation, we denote the U-statistic kernel as K(St, At; e St, e At) := ΛV θ1 µ ,πθ2 µ (St i, At i, St+1 i )K St i, At i; e St i, e At i ΛV θ1 µ ,πθ2 µ (e St i, e At i, St+1 i ).

By Hoeffding s decomposition of kernel function K(St, At; e St, e At), there exists kernel functions K1(St, At) and K2(St, At; e St, e At) that ET K1(e St, e At) = 0 and ET K2(s, a; e St, e At) = 0. The U-statistic UT can be decomposed into

UT = ET [U ] + 2

t=1 K1 St, At + U K2 and ET [UT ] = ET [U ] + ET [U K2],

Published as a conference paper at ICLR 2023

where U K2 := U K2(V θ1 µ , πθ2 µ , ηξ1, ϖξ2) is defined similarly as in the proof of Theorem 4.2. The details of the decomposition can be seen in the proof of Theorem 4.2. The term 2 can be immediately decomposed as follows

(V θ1 µ ,πθ2 µ ,ηξ1,ϖξ2) Θ1 Θ2 Ξ1 Ξ2

UT (V θ1 µ , πθ2 µ , ηξ1, ϖξ2)] ET [UT (V θ1 µ , πθ2 µ , ηξ1, ϖξ2)] +

(V θ1 µ ,πθ2 µ ,ηξ1,ϖξ2) Θ1 Θ2 Ξ1 Ξ2

ET [U (V θ1 µ , πθ2 µ , ηξ1, ϖξ2)] ET [UT (V θ1 µ , πθ2 µ , ηξ1, ϖξ2)]

(V θ1 µ ,πθ2 µ ,ηξ1,ϖξ2) Θ1 Θ2 Ξ1 Ξ2

UT (V θ1 µ , πθ2 µ , ηξ1, ϖξ2) ET [UT (V θ1 µ , πθ2 µ , ηξ1, ϖξ2)]

(V θ1 µ ,πθ2 µ ,ηξ1,ϖξ2) Θ1 Θ2 Ξ1 Ξ2

ET [U K2(V θ1 µ , πθ2 µ , ηξ1, ϖξ2)]

Note that the second term is not exactly zero since the samples are weakly dependent. But next, we will show that 2 2 converges to zero. First, we check the conditions of Lemma 3.1 in Arcones & Yu (1994). Observe that K( , ) 1 and according to Lemma S.1, then

(V θ1 µ ,πθ2 µ ,ηξ1,ϖξ2) Θ1 Θ2 Ξ1 Ξ2

K St, At; e St, e At M 2 max K St, At; e St, e At M 2 max.

Therefore, the kernel K is a uniformly bounded function. Under Assumption 4.2 that β(m) m δ1 for δ1 > 1. Therefore, β(m)mδ1 0. By using a similar technique of calculating the metric entropy, for any ϵ > 0, we have the covering number that

N (ε, Fθ,ξ, L2) N (ε, Fθ,ξ, {Di}n i=1) <

Then the conditions of Lemma 3.1 in Arcones & Yu (1994) are satisfied, we have

(V θ1 µ ,πθ2 µ ,ηξ1,ϖξ2) Θ1 Θ2 Ξ1 Ξ2 |

TU K2(V θ1 µ , πθ2 µ , ηξ1, ϖξ2)| = op(1)

(V θ1 µ ,πθ2 µ ,ηξ1,ϖξ2) Θ1 Θ2 Ξ1 Ξ2 |U K2(V θ1 µ , πθ2 µ , ηξ1, ϖξ2)| = op(1). (35)

Since U K2(V θ1 µ , πθ2 µ , ηξ1, ϖξ2) is uniformly bounded, then

(V θ1 µ ,πθ2 µ ,ηξ1,ϖξ2) |U K2(V θ1 µ , πθ2 µ , ηξ1, ϖξ2)| <

As x, T , then

(V θ1 µ ,πθ2 µ ,ηξ1,ϖξ2) |U K2(V θ1 µ , πθ2 µ , ηξ1, ϖξ2)|I{ sup

(V θ1 µ ,πθ2 µ ,ηξ1,ϖξ2) |U K2(V θ1 µ , πθ2 µ , ηξ1, ϖξ2)| > x}

which means sup(V θ1 µ ,πθ2 µ ,ηξ1,ϖξ2) |U K2(V θ1 µ , πθ2 µ , ηξ1, ϖξ2)| is uniformly integrable. Combine with the weak convergence in (35), then as T

(V θ1 µ ,πθ2 µ ,ηξ1,ϖξ2) E|U K2(V θ1 µ , πθ2 µ , ηξ1, ϖξ2)| E sup

(V θ1 µ ,πθ2 µ ,ηξ1,ϖξ2) |U K2(V θ1 µ , πθ2 µ , ηξ1, ϖξ2)| 0.

Then we move to bound 1 2. The U-statistic UT is not degenerate, so we adopt Hoeffding s representation (Hoeffding, 1994) such that it reduces the problem to a first-order analysis. Specifically, let σ(T) is the collection of all permutations of {1, 2, ..., T}, the U-statistic UT can be re-expressed as

UT V θ1 µ , πθ2 µ , ηξ1, ϖξ2 = 1

t=1 K Xσ(t), Xσ(T0+t) ,

Published as a conference paper at ICLR 2023

where T0 = T/2 . By the trick, we have the following inequality

(V θ1 µ ,πθ2 µ ,ηξ1,ϖξ2)

1 i =j T K Xi, Xj ET [UT (V θ1 µ , πθ2 µ , ηξ1, ϖξ2)]

(V θ1 µ ,πθ2 µ ,ηξ1,ϖξ2)

t=1 K Xσ(t), Xσ(T0+t) ET [UT (V θ1 µ , πθ2 µ , ηξ1, ϖξ2)]

(V θ1 µ ,πθ2 µ ,ηξ1,ϖξ2)

i=t K Xσ(t), Xσ(T0+t) 1

σ(T ) ET [UT (V θ1 µ , πθ2 µ , ηξ1, ϖξ2)]

(V θ1 µ ,πθ2 µ ,ηξ1,ϖξ2)

t=1 K Xσ(t), Xσ(T0+t) ET [UT (V θ1 µ , πθ2 µ , ηξ1, ϖξ2)]

σ(T ) ET sup

(V θ1 µ ,πθ2 µ ,ηξ1,ϖξ2)

t=1 K Xσ(t), Xσ(T0+t) ET [UT (V θ1 µ , πθ2 µ , ηξ1, ϖξ2)]

(V θ1 µ ,πθ2 µ ,ηξ1,ϖξ2)

t=1 K Xt, X(T0+t) ET [UT (V θ1 µ , πθ2 µ , ηξ1, ϖξ2)]

(V θ1 µ ,πθ2 µ ,ηξ1,ϖξ2)

t=1 K Xt, X(T0+t) ET

t=1 K Xt, X(T0+t) #

(V θ1 µ ,πθ2 µ ,ηξ1,ϖξ2)

t=1 G e Xt ET

t=1 G e Xt # ,

where e Xt = Xt, X(T0+t) which itself is a two-dimensional stationary sequences under mixing condition. Note that the last term is the expectation of the suprema of the empirical process 1/T0 PT0 t=1 G( e Xt) ET [1/T0 PT0 t=1 G( e Xt)] on the space Gθ,ξ. The distance in Gθ,ξ can be bounded by the following,

N min{(2µmax + 4)Mmax, 1}ε, Gθ,ξ, { e Xt}T0 t=1

N e Cε, Θ1, {Di}n i=1 N e Cε, Θ2, {Di}n i=1 N e Cε, Ξ1, {Di}n i=1 N e Cϵ, Ξ2, {Di}n i=1

=e4 (DΘ1 + 1) (DΘ2 + 1) (DΞ1 + 1) (DΞ2 + 1) 2e Mmax

DΘ1+DΘ2+DΞ1+DΞ2

which implies

16, Gθ,ξ, { e Xt}T0 t=1

e4 (DΘ1 + 1) (DΘ2 + 1) (DΞ1 + 1) (DΞ2 + 1) 64(max+32)U 2 maxe e Cϵ

DΘ1+DΘ2+DΞ1+DΞ2

where D Gθ,ξ = DΘ1 + DΘ2 + DΞ1 + DΞ2. First, without loss of generality, let T0 = 2m T0k T0 for appropriate positive integers m T0k T0 as in (Yu, 1994). Follow Lemma 5 in Antos et al. (2008), we obtain that

(V θ1 µ ,πθ2 µ ,ηξ1,ϖξ2)

t=1 G e Xt ET

t=1 G e Xt # ϵ

D Gθ,ξ exp 4C4m T0ϵ2 + 2m T0β(k T0)

Published as a conference paper at ICLR 2023

where C4 = 1 2 1 8M 2 max

2 . If D G 2, and let β(m) exp ( δ1m) , T 1, m T = l C4T0ϵ2/δ1 1

2 m , m T0 = T0/ (2k T0), where D Gθ,ξ 2, C3, C4, δ1, we apply Lemma 14 in Antos et al. (2008), then

2m T0βk T0 + C1

D Gθ,ξ exp 4C2m T0ϵ2 δ

and we have, with probability 1 δ,

= (D Gθ,ξ/2) log T0 + log(e/δ) + log+ C3C D Gθ,ξ /2

= = (D Gθ,ξ/2) log(T/2) + log(e/δ) + log+ C3C D Gθ,ξ /2

Now, we conclude that

b V θ1,k µ V 2 L2 C1 κmin(1 γ)2

C3D log 8C1

C2 µ2(C + |1 C| 1)2

(1 γ)2 + C5 b V θ1 µ b V θ1,k µ 2

L2 + ϵapproximation error

where = (D Gθ,ξ/2) log( T/2 ) + log(e/δ) + log+ C3C D Gθ,ξ 4 /2 , D Gθ,ξ = P-dim(Θ1) +

P-dim(Θ2) + P-dim(Ξ1) + P-dim(Ξ2), and C1, ..., C5 are some constants. Adapt the notations for the constants number from Theorem 4.2. By some algebra, we conclude that

b V θ1,k µ V π 2 L2 C4 κmin(1 γ)2

C5DP-dim log 8C4

| {z } generalization error

C7 µ2(C + |1 C| 1)2

(1 γ)2 | {z } proximal bias

+ C8 b V θ1 µ b V θ1,k µ 2

L2 | {z } optimization error

+ϵapproximation error

where = DP-dim log( T/2 )

δ) + log+ C5C DP-dim 6 2 , DP-dim = P-dim(Θ1) + P-dim(Θ2) + P-dim(Ξ1) + P-dim(Ξ2), and C4, ..., C8 are some constants.

B.9 PROOF OF THEOREM 4.4

We note that SGD converges has a global convergence to a stationary point with a sublinear rate in the case of convexity. However, the resulting dose not typically holds for the non-convex analysis. The intuition behind the proof is that our quasi-optimal algorithm can be regarded as a special case of the randomized stochastic descent (RSD) algorithm for solving the non-convex minimization problem.

The convergence analysis of for randomized stochastic descent algorithm has been established in Corollary 2.2 of (Ghadimi & Lan, 2013). That is, RSD is provably convergent to a stationary point. Follow Theorem 3 in (Drori & Shamir, 2020), an unbiased SGD algorithm, i.e., the quasi-optimal

Published as a conference paper at ICLR 2023

algorithm with diminishing learning rate and evaluated on Euclidean distance. Therefore, it suffices to show that the gradient of the loss is unbiased.

Now we show that the gradient is unbiased, as follows

θ1LU = E h θ1(γV θ1 µ (St+1) V θ1 µ (St))K(St, At; St, At)ΛVµ,πµ( St, At, St+1)

+ ΛVµ,πµ(St, At, St+1)K(St, At; St, At) θ1(γV θ1 µ ( St+1) V θ1 µ ( St)) i

θ2LU = E h 2µ( θ2πθ2 µ (At|St)K(St, At; St, At)ΛVµ,πµ( St, At, St+1)

ΛVµ,πµ(St, At, St+1)K(St, At; St, At)(2µ( θ2πθ2 µ ( At|St)) i

ξ1LU = E h ξ1ϖ(St, At)K(St, At; St, At)ΛVµ,πµ( St, At, St+1)

+ ΛVµ,πµ(St, At, St+1)K(St, At; St, At) ξ1ϖ( St, At) i

ξ2LU = E h ξ2η(St)K(St, At; St, At)ΛVµ,πµ( St, At, St+1)

ΛVµ,πµ(St, At, St+1)K(St, At; St, At) ξ2η( St) i

We conclude that the gradient estimator is unbiased. Follow Theorem 3 in (Drori & Shamir, 2020), under the conditions stated in Theorem 4.4, we adapt Corollary 2.2 to our quasi-optimal algorithm, it completes the proof.

C PRACTICAL IMPLEMENTATION

In practice, {V µ , π µ, η, ϖ} needs to be parameterized for practical implementation. However, noticing that V µ and π µ are both associated with Q µ with closed-form expressions (3)(4). Thus, we propose to represent (V µ , π µ) by modeling Q µ. Additionally, by modeling Q µ as a quadratic function, the induced policy would follow a q-Gaussian distribution. Therefore, we model the coefficients associated with the quadratic form as a linear combination of basis function φ(s) such that Q µ(s, a; θ) = exp{θT 1 φ(s)}a2 + θT 2 φ(s)a + θT 3 φ(s), where φ(s) = [φ1(s), φ2(s), ..., φm(s)]T

is the m-dimensional basis function, and θ = [θ1, θ2, θ3]T is the 3m-dimensional parameters we need to estimate. The advantage of such parametrization lies in that the parameter space could be reduced.

To solve the constrained optimization problem, we propose a computationally efficient algorithm by transforming the original constrained optimization problem into an unconstrained minimization problem. Specifically, we impose restrictions on the representation of Lagrangian multipliers (η(s), ϖ(s, a)) so that they satisfy their constraints automatically. Although such re-parametrization may sacrifice model flexibility, it gains great computational advantage as the unconstrained optimization problem would be much simpler. To be specific, we parametrize ϖ as

ϖ(s, a; θ) = max 0, Q µ(s, a; θ)

a W1(s) Q µ(s, a; θ)da

2µσ(Ws) 1 σ(Ws)

Therefore, ϖ(St, At) 0 and π µ(At|St) ϖ(St, At) = 0 are automatically satisfied. Also, by specifying the expression of Lagrangian multipliers, ϖ(s, a) share the same set of parameters θ as (V µ , π µ). We also define

η(s; ξ) = n µC 1 + exp( k0(ξT s b0))

where b0 is the sigmoid s midpoint and k0 is the logistic growth rate. By flipping the sigmoid function to parametrize η(s; ξ), the constraint η(s) [ µC, 0] is also automatically satisfied.

Published as a conference paper at ICLR 2023

D EXPERIMENT DETAILS AND ADDITIONAL RESULTS

For the reproducing purpose, we include our code for all the experiments and the guideline for access to the Ohio Type I Diabetes dataset in an anonymous Git Hub link https://anonymous.4open. science/r/Quasi-optimal-Learning-with-Continuous-Treatments-9B88.

D.1 DETAILS OF SIMULATION SETTINGS AND REAL DATA ANALYSIS

The details of the data generative model of each environment in Section 6 are stated below:

Environment I: We consider a bounded action space where A = [0, 1], and a 2-dimensional state

space. At i iid Unif(0, 1), the state transition function is defined as St+1 i,1 = 1 exp( At i) 1+exp( At i)St i,1 + 0.25St i,1St i,2 + ϵt i,1, St+1 i,2 = 1 exp( At i) 1+exp( At i)St i,2 + 0.25St i,1St i,2 + ϵt i,2,

where ϵt i,1, ϵt i,2 iid N(0, 0.52), and the reward function is defined as

Rt i = 3 exp(St+1 i,1 St+1 i,2 )(At i)2 + (St+1 i,1 + St+1 i,2 + 0.5)At i + St+1 i,1 + St+1 i,2 .

Environment II: We consider a bounded action space where A = [0, 1], and a 2-dimensional state

space. At i iid Unif(0, 1), the state transition function is defined as St+1 i,1 = 0.75(2At i 1) St i,1 + 0.25St i,1St i,2 + ϵt i,1, St+1 i,2 = 0.75(1 2At i)St i,2 + 0.25St i,1St i,2 + ϵt i,2.

where ϵt i,1, ϵt i,2 i.i.d N(0, 0.52), and Rt i = 0.25(St+1 i,1 )3+2St+1 i,1 +0.5(St+1 i,2 )3+St+1 i,2 +0.25(2At i 1). Environment III: We consider an unbounded action space where A = ( , ), and a 8-dimensional state space. We sampled action uniformly from a bounded space, At i iid Unif( 100, 100), while it is allowed to select actions on R for the learned policy. The state transition function is defined as, St+1 i N(µt+1 i , Σ), where Σ is a pre-specified covariance matrix, and µt i = [µt i,1, ..., µt i,8],

µt+1 i,j = exp(At i/100 + µt i,j) exp( (At i/100 + µt i,j)) exp(At i/100 + µt i,j) + exp( (At i/100 + µt i,j)) for j = 1, 2, 3, 4,

µt+1 i,j = exp( At i/100 + µt i,j) exp( ( At i/100 + µt i,j)) exp( At i/100 + µt i,j) + exp( ( At i/100 + µt i,j)) for j = 5, 6, 7, 8.

Rt i = exp(St+1 i,1 /2+St+1 i,5 /2)(At i/100)2 +2(St+1 i,2 +St+1 i,3 +St+1 i,6 +St+1 i,7 +0.5)At i/100+St+1 i,4 + St+1 i,8 . Environment IV: This environment shares the same transition kernel as Environment III, the only difference is the reward function here is

Rt i = (St+1 i,1 /2)3 + (St+1 i,2 /2)3 + St+1 i,3 + St+1 i,4 + 2[(St+1 i,5 /2)3 + (St+1 i,6 /2)3] + 0.5(St+1 i,7 + St+1 i,8 ).

For all four environments, we consider different sample sizes where the number of trajectories n = {25, 50}, and the length of each trajectory T = {24, 36}. The discount factor γ is set to 0.9.

Motivation of Synthetic experiment design: We aim to test the performance of our proposed method on the settings of bounded and unbounded continuous action space with unimodal and multimodal reward functions. The motivation for testing the proposed method in bounded action space is to test if the proposed method could potentially handle the off-support bias, as illustrated in Figure 2. The reason for considering a multimodal synthetic environment is to evaluate the quasi-optimal policy class (q-Gaussian policy class) works in a relatively complex situation. Especially for the q-Gaussian policy distribution which is unimodal, it is necessary to test if the q-Gaussian policy still works and is robust to the scenario where the optimal policy might be multimodally behaving.

We make a summary of the synthetic experiments as follows:

Environment I:

Setting: Bounded action space and unimodal reward function

Published as a conference paper at ICLR 2023

Purpose: To evaluate if the quasi-optimal learning works in the scenario where it might suffer the off-support bias issue as the continuous action space is bounded.

Environment II:

Setting: Bounded action space and multimodal reward function Purpose: In addition to the purpose in Environment I, we aim to implement quasi-optimal learning in a more challenging environment. Also, this is for evaluating the robustness of the unimodal q-Gaussian policy under the scenario that the true optimal policy follows a multimodal probability distribution.

Environment III:

Setting: High-dimension state space and well-separated reward function. The design of the well-separated reward function causes the effect that the selection of non-optimal or sub-optimal actions greatly damages the rewards and increases the risk. Purpose: To evaluate the reliability/safety of quasi-optimal learning. We aim to examine if quasi-optimal learning could perform well in this scenario. As we expect quasi-optimal learning is able to identify the quasi-optimal sub-regions and avoids choosing those nonoptimal/sub-optimal actions which greatly damage the performance.

Environment IV:

Setting: High-dimension state space and complex well-separated reward function. Purpose: In addition to the purpose in Environment III, we target to evaluate the quasioptimal learning in a more complex environment, imposing great challenges on recovering the quasi-optimal regions for the proposed method. Indeed, imposing more complex structures on reward function indicates imposing difficulties on value function learning and thus imposes great challenges on identifying quasi-optimal regions.

Ohio Type 1 Diabetes Dataset: For individuals in the first cohort, we treat glucose level , carbonhydrate intake, and acceleration level as state variables, i.e., St i,1, St i,2 and St i,3 . For individuals in the second cohort, heart rate is used instead of acceleration level as St i,3. The reward function is defined as

Rt i = 1(St i,1 > 140)(St i,1 140)1.1 + 1(St i,1 < 80)(St i,1 80)2

D.2 ADDITIONAL EXPERIMENT DETAILS

In our implementation, since the objective function, ˆLU may not be convex with respect to (θ, ξ). We determine the initial point by randomly generating 200 initial values for all parameters and selecting the one with the smallest objective function value.

For the discretization-based methods, i.e., Greedy-GQ and V-learning, we discretize the original action space into 20 bins for implementation in synthetic experiments and 14 bins for real data analysis. The number of bins is chosen by analyzing the distribution of action and the scale of rewards, where too few bins could not lead to an accurate approximation of the whole dynamic, and too many bins may damage the performance of these methods. We use a radial basis to approximate value functions for these two methods based on the recommendation of the original implementation (Ertefaie & Strawderman, 2018; Luckett et al., 2019).

For the Deep RL-based continuous control methods, i.e., DDPG, SAC, BEAR, CQL and IQN, we implement them mainly based on well-known offline deep reinforcement learning library (Seno & Imai, 2021). For the general optimization and function approximation settings, we use a multi-layer perceptron (MLP) with 2 hidden layers, each with 32 nodes for function approximation. We set the batch size to be 64, and use Re LU function as the activation function. In addition to the summary provided below, the initial learning rate is chosen from the set {3 10 4, 1 10 4, 3 10 5}. We use Adam (Kingma & Ba, 2014) as the optimizer for learning the neural network parameters. We set the discounted factor to be γ = 0.9 for all experiments.

Published as a conference paper at ICLR 2023

To evaluate the policy obtained from the proposed method in synthetic experiments, we generate 100 independent trajectories, each with a length of 100 based on the learned policy. We use rejection sampling (Robert et al., 1999) to randomly sample each action by the induced density πµ(a|s) and calculate the discounted sum of reward for each trajectory. We compare the discounted return of each method. The boxplot of synthetic experiments results based on 50 runs is presented in Figure 3.

For real data analysis, since the data-generating process is unknown, we follow Luckett et al. (2020) to utilize the Monte Carlo approximation of the estimated V-function of the initial state of each trajectory to evaluate the performance of each method. To better evaluate the stability and performance of each method, we randomly select 10 or 20 trajectories from each individual based on available trajectories 50 times and apply all methods to the selected data. The baseline refers to the observed discounted return. The mean and standard deviation of the improvements on the Monto Carlo discounted returns are presented in Table 1.

We report all hyperparameters used in training and additional experiment results in this section. The value of µ is selected from the set {0.01, 0.05, 0.1, 0.2, 0.3, 0.5}. We select µ by cross-validation for each experiment, specifically we select µ with the largest fitted V-function value on the initial states of each trajectory, i.e., Pn ˆVµ(S1 i ) (1 γ) 1µ, where we mitigate the effect of the threshold parameter µ. In our implementation, we set C = 5 for all synthetic experiments and real data analysis, and check that the induced policy πµ never reaches the boundary value.

We set the learning rate αj for the jth iteration is be α0 1+d j , where α0 is the learning rate of the initial iteration, and d is the decay rate of the learning rate. When n = 25, we set the batch size to be 5, and when n = 50, we set the batch size to be 7. We use the L2 distance of iterative parameters as the stopping criterion for the SGD algorithm. The µ selected for each experiment, along with the learning rates and their descent rates, are shown in Table 2 3 and 4.

Table 2: Hyperparameters for each synthetic environment

Hyperparameters Environment I Environment II Environment III Environment IV µ 0.1 0.05 0.05 0.05 Learning Rate 0.002 0.0005 10 5 10 5

Descent Rate 10 4 10 4 10 4 10 4

Table 3: Hyperparameters for Ohio Type I Diabetes Analysis (Cohort I)

Patient ID 540 544 552 567 584 596 µ 0.1 0.1 0.1 0.05 0.05 0.1 Learning Rate 0.001 0.001 0.001 0.0005 0.001 0.02 Descent Rate 10 4 10 4 10 4 10 4 10 4 2 10 4

Table 4: Hyperparameters for Ohio Type I Diabetes Analysis (Cohort II)

Patient ID 559 563 570 575 588 591 µ 0.2 0.1 0.2 0.05 0.05 0.05 Learning Rate 0.005 0.0001 0.005 0.0001 0.0001 0.0001 Descent Rate 10 4 10 4 10 4 10 4 10 4 10 4

Table 5: The mean running time in seconds of each method over 50 experiment runs in Environment I. The synthetic experiments are conducted on a single 2.3 GHz Dual-Core Intel Core i5 CPU .

n T Proposed SAC DDPG BEAR Greedy-GQ 25 24 23.11 22.17 14.31 35.42 11.39 36 28.91 28.12 18.35 42.16 14.47 50 24 28.88 29.91 19.42 46.73 15.62 36 45.23 44.46 36.82 63.54 24.81

Published as a conference paper at ICLR 2023

Table 6: The mean running time in seconds of each method over 50 experiment runs in Environment II. The synthetic experiments are conducted on a single 2.3 GHz Dual-Core Intel Core i5 CPU .

n T Proposed SAC DDPG BEAR Greedy-GQ 25 24 30.93 27.29 20.12 44.01 14.56 36 39.34 36.91 26.43 52.86 19.43 50 24 41.12 42.25 28.42 55.17 21.56 36 60.16 56.47 45.71 72.12 32.14

D.3 ADDITIONAL EXPERIMENT RESULTS

D.3.1 SENSITIVITY ANALYSES

To validate the cross-validation procedure in practice and analyze the effect of µ on model performance, we conduct sensitivity analyses for the change of µ. Results are summerized in Figure 4. This confirms that the cross-validation procedure indeed selects a proper µ which maximizes the discounted return.

0.01 0.05 0.1 0.2 0.3 0.5 µ

Discounted Return

Environment I

0.01 0.05 0.1 0.2 0.3 0.5 µ

Environment II

0.01 0.05 0.1 0.2 0.3 0.5 µ

Environment III

0.01 0.05 0.1 0.2 0.3 0.5 µ

Environment IV

Sample Size

N=25, T=24 N=25, T=36 N=50, T=24 N=50, T=36

Figure 4: The sensitivity analyses of µ over 50 repeated experiments

D.3.2 DISTRIBUTION EVALUATION CRITERION

To measure the performance on safety, we aim to evaluate the distribution of Monte-Carlo discounted sum of rewards for each roll-out trajectory Dabney et al. (2018a), instead of its empirical mean, i.e., discounted return.

In particular, we generate 100 trajectories under the learned policy and record the discounted sum of rewards of each single trajectory. Then we draw the density plots in Figure 5 for all four environments. As shown in Figure 5, the distribution of the quasi-optimal learning shows a thinner tail on the left. This is aligned to two safe RL algorithms IQN and CQL. The phenomenon indicates that there is less chance to enter a low reward trajectory which is damaged by allocating highly-risk actions. However, the non-safe RL approach SAC is more evenly distributed on both extremes; Hence, SAC may enter a low reward trajectory with higher probability (heavier left tail) compared to the quasi-optimal learning and two safe RL baselines. This validates that quasi-optimal learning can avoid risky actions as the other two safe RL baselines.

20 10 0 10 20 Discounted Return

Environment I

10 5 0 5 10 Discounted Return

Environment II

10 0 10 Discounted Return

Environment III

10 5 0 5 10 Discounted Return

Environment IV

Proposed CQL IQN SAC

Figure 5: The distribution of Monte-Carlo discounted sum of rewards over 50 repeated experiments.

Published as a conference paper at ICLR 2023

D.3.3 MODEL PERFORMANCE ON LARGE DATASET

We evaluate the model performance in large sample size scenarios (10,000 transition pairs (n = 100, T = 100) for all four environments. The results are presented in Figure 6. Deep RL baseline methods have some improvement in the model performance and variance reduction with increased training samples. Meanwhile, the quasi-optimal learning still outperforms all competing methods as shown in Figure 6.

Environment I Environment II Environment III Environment IV Environments

Discounted Return

Figure 6: The boxplot of discounted return over 30 repeated experiments with sample size N = 100, T = 100.

D.3.4 SAFE TRANSITIONS AND LEARNED POLICY DISTRIBUTION

Safe Transition: We illustrate the safety of the proposed method via evaluating the proportion of safe transition, i.e., from a fixed current state to a safe transition state. The goal of the Ohio T1M case study is to maintain the glucose level in a safe range. The safe state in this study is defined as the state where the glucose level is within the range of 80-140 mg/d L. The reward function, i.e., the index of glycemic control (Rodbard, 2009),

Rt i = 1(St i,1 > 140)(140 St i,1)1.1 + 1(St i,1 < 80)(St i,1 80)2

where St i,1 is the glucose level of patient i at t decision stage. This reward setting tends to favor the safe range and penalize the risky scenario where the glucose level is out of the range of 80-140 mg/d L.

The details of the evaluation procedure are summarized in the following. In offline Ohio T1M dataset, we pick up the observed states which transited to risky states, i.e., the states out of the safe range of glucose level. On the picked-up states, we calculate the proportion of safe transition, in which the corresponding transition states are sampled from the transition kernel under the learned policy. The transition kernel is estimated by maximum likelihood estimation from the offline dataset.

We summarize the results of the safe proportions on 1000 transition samplings in Figure 7. As shown, the quasi-optimal learning achieves 82.2% safe proportions, which outperforms 67.3% in safe RL baseline IQN and 44.6% in non-safe RL baseline SAC. By the results, we may conclude that quasi-optimal learning enjoys a better safety guarantee when applied to the medical domain.

Published as a conference paper at ICLR 2023

Proposed IQN SAC Methods

Figure 7: Proportions of Safe Transition from each Method

The Learned Policy Distribution: In this dimension, we illustrate the validity of the quasi-optimal policy distribution on a fixed state. In Ohio T1M dataset, we select a patient state with a glucose level of 217 mg/d L, which is moderate hyperglycemia. On this state, we draw a density plot in Figure 8 for the policy distribution learned by the quasi-optimal learning, IQN, and SAC. Figure 8 shows that the quasi-optimal learning identified support regions [3.15, 6.19]. As the patient is under moderate hyperglycemia, so the moderate insulin dosage, i.e., [3.15, 6.19], works well to decrease the glucose level into a safe range. Meanwhile, it avoids overly dropping the patient s glucose level and causes hypoglycemia. In comparison, SAC is risky as it has a non-negligible probability of assigning too low and too high insulin dosage to the patient. The policy learned by the safe RL algorithm IQN tends to avoid assigning extreme dosage, but it has wider support than the one learned by quasi-optimal learning. Regarding efficiency or safety, the quasi-optimal has certain advantages compared with IQN in this case.

0.0 2.5 5.0 7.5 Dose Level (Bolus)

Figure 8: The Learned Policy Distribution of each Method for the Same Given State