# rewardweighted_regression_converges_to_a_global_optimum__e1bf60f4.pdf Reward-Weighted Regression Converges to a Global Optimum Miroslav ˇStrupl1*, Francesco Faccio1 , Dylan R. Ashley1, Rupesh Kumar Srivastava2, J urgen Schmidhuber1,2,3 1 The Swiss AI Lab IDSIA, Universit a della Svizzera italiana (USI) & SUPSI, Lugano, Switzerland 2 NNAISENSE, Lugano, Switzerland 3 King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia {struplm, francesco, dylan.ashley}@idsia.ch, rupesh@nnaisense.com, juergen@idsia.ch Reward-Weighted Regression (RWR) belongs to a family of widely known iterative Reinforcement Learning algorithms based on the Expectation-Maximization framework. In this family, learning at each iteration consists of sampling a batch of trajectories using the current policy and fitting a new policy to maximize a return-weighted log-likelihood of actions. Although RWR is known to yield monotonic improvement of the policy under certain circumstances, whether and under which conditions RWR converges to the optimal policy have remained open questions. In this paper, we provide for the first time a proof that RWR converges to a global optimum when no function approximation is used, in a general compact setting. Furthermore, for the simpler case with finite state and action spaces we prove R-linear convergence of the state-value function to the optimum. 1 Introduction Reinforcement learning (RL) is a branch of artificial intelligence that considers learning agents interacting with an environment (Sutton and Barto 2018). RL has enjoyed several notable successes in recent years. These include both successes of special prominence within the artificial intelligence community such as achieving the first superhuman performance in the ancient game of Go (Silver et al. 2016) and successes of immediate real-world value such as providing autonomous navigation of stratospheric balloons to provide internet access to remote locations (Bellemare et al. 2020). One prominent family of algorithms that tackle the RL problem is the Reward-Weighted Regression (RWR) family (Peters and Schaal 2007). The RWR family is notable in that it naturally extends to continuous state and action spaces. The lack of this functionality in many methods serves as a strong limitation. This prevents them from tackling some of the more practically relevant RL problems such as many robotics tasks (Plappert et al. 2018). Recently, RWR variants were able to learn high-dimensional continuous control tasks (Peng et al. 2019). RWR works by transforming the RL problem into a form solvable by wellstudied expectation-maximization (EM) methods (Dempster, Laird, and Rubin 1977). EM methods are, in general, *Equal contribution. Correspondence to struplm@idsia.ch Copyright 2022, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. guaranteed to converge to a point whose gradient is zero with respect to the parameters. However, these points could be both local minima or saddle points (Wu 1983). These benefits and limitations transfer to the RL setting, where it has been shown that an EM-based return maximizer is guaranteed to yield monotonic improvements in the expected return (Dayan and Hinton 1997). However, it has been challenging to assess under which conditions if any RWR is guaranteed to converge to the optimal policy. This paper presents a breakthrough in this challenge. The EM probabilistic framework requires that the reward obtained by the RL agent is strictly positive, such that it can be considered as an improper probability distribution. Several reward transformations have been proposed, e.g., Peters and Schaal (2007, 2008); Peng et al. (2019); Abdolmaleki et al. (2018b), frequently involving exponential transformations. In the past, it has been claimed that a positive, strictly increasing transformation uτ(s) with R 0 uτ(r) dr = const would not alter the optimal solution for the MDP (Peters and Schaal 2007). Unfortunately, as we demonstrate in Appendix A, this is not the case. We must, as a consequence, consider only linear transformation of the reward if we want prove convergence. A possible disadvantage of relying on linear transformations is that it is necessary to know a lower bound on the reward to construct such a transformation. In this work, we provide the first proof of RWR s global convergence in a setting without function approximation or reward transformations1. The paper is structured as follows: Section 2 introduces the MDP setting and other preliminary material; Section 3 presents a closed-form update for RWR based on the state and action-value functions and Section 4 shows that the update induces monotonic improvement related to the variance of the action-value function with respect to the action sampled by the policy; Section 5 proves global convergence of the algorithm in the general compact setting and convergence rates in the finite setting; Section 6 illustrates experimentally that for a simple MDP the presented update scheme converges to the optimal policy; Section 7 discusses related work; and Section 8 concludes. The extended version of this paper is available at https://arxiv.org/abs/2107.09088. 1Without loss of generality we do assume that a linear reward transformation is already provided, such that the reward is positive. The Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI-22) 2 Background Here we consider a Markov Decision Process (MDP) (Stratonovich 1960; Puterman 2014) M = (S, A, p T , R, γ, µ0). We assume that the state and action spaces S Rn S, A Rn A are compact sub-spaces 2 (equipped with subspace topology), with measurable structure given by measure spaces (S, B(S), µS), (A, B(A), µA) where B( ) denotes the Borel σ-algebra, and reference measures µS, µA are assumed to be finite and strictly positive on S, A respectively. The distributions of state (action) random variables (except in Section 5 where greedy policies are used) are assumed to be dominated by µS (µA), thus having a density with respect to µS (µA). Therefore, we reserve symbols ds, da in integral expression not to integration with respect to Lebesgue measure, as usual, but to integration with respect to µS and µA respectively, e.g. R S( )ds := R S( )dµS(s). Let (Ω, F, µ) be a measure space and f : Ω R+ a F measurable function (density). We denote by f µ the measure which assigns to every set B F a measure f µ(B) := R B fdµ. In the MDP framework, at each step, an agent observes a state s S, chooses an action a A, and subsequently transitions into state s with probability density p T (s |s, a) to receive a deterministic reward R(s, a). The transition probability kernel is assumed to be continuous in total variation in (s, a) S A (the product topology is assumed on S A), and thus the density p T (s |s, a) is continuous (in 1 norm). R(s, a) is assumed to be a continuous function on S A. The agent starts from an initial state (chosen under a probability density µ0(s)) and is represented by a stochastic policy π: a probability kernel which provides the conditional probability distribution of performing action a in state s.3 The policy is deterministic if, for each state s, there exists an action a such that π({a}|s) = 1. The return Rt is defined as the cumulative discounted reward from time step t: Rt = P k=0 γk R(st+k+1, at+k+1) where γ (0, 1) is a discount factor. We discuss the undiscounted case (γ = 1) in Appendix B, which covers the scenario with absorbing states. The agent s performance is measured by the cumulative discounted expected reward (i.e., the expected return), defined as J(π) = Eπ[R0]. The state-value function V π(s) = Eπ[Rt|st = s] of a policy π is defined as the expected return for being in a state s while following π. The maximization of the expected cumulative reward can be expressed in terms of the state-value function by integrating it over the state space S: J(π) = R S µ0(s)V π(s) ds. The action-value function Qπ(s, a) defined as the expected return for performing action a in state s and following a policy π is Qπ(s, a) = Eπ[Rt|st = s, at = a]. State and action value functions are related by V π(s) = R A π(a|s)Qπ(s, a) da. We define as dπ(s ) the discounted weighting of states encountered starting at s0 µ0(s) and following the policy π: dπ(s ) = R S P t=1 γt 1µ0(s)pst|s0,π(s |s) ds, where pst|s0,π(s |s) is 2This allows for state and action vectors that have discrete, continuous, or mixed components. 3In Sections 3 and 4, a policy is given through its conditional density with respect to µA. We also refer to this density as a policy. the probability density of transitioning to s after t time steps, starting from s and following policy π. We assume that the reward function R(s, a) is strictly positive4, so that state and action value functions are also bounded V π(s) 1 1 γ ||R|| = BV < + . We define the operator5 W : L (S) C(S A) as [W(V )](s, a) := R(s, a) + γ R S V (s )p T (s |s, a)ds and the Bellman s optimality operator T : L (S A) C(S A) as [T(Q)](s, a) := R(s, a) + γ R S maxa Q(s , a )p T (s |s, a)ds . An actionvalue function Qπ is optimal if it is the unique fixed point for T. If Qπ is optimal, then π is an optimal policy. 3 Reward-Weighted Regression Reward-Weighted Regression (RWR, see (Dayan and Hinton 1997),(Peters and Schaal 2007),(Peng et al. 2019)) is an iterative algorithm which consists of two main steps. First, a batch of episodes is generated using the current policy πn (all policies in this section are given as conditional densities with respect to µA). Then, a new policy is fitted to (using supervised learning under maximum likelihood criterion) a sample representation of the conditional distribution of an action given a state, weighted by the return. The RWR optimization problem is: πn+1 = arg max π Π E s dπn( ),a πn( |s) E Rt p( |st=s,at=a,πn) [Rt log π(a|s)] , (1) where Π is the set of all conditional probability densities (meant with respect to µA)6. Notice that πn+1 is defined correctly as its expression does not depend on t. This is equivalent to the following: πn+1 = arg max π Π E s dπn( ),a πn( |s) [Qπn(s, a) log π(a|s)] . (2) We start by deriving a closed form solution to the optimization problem: Theorem 3.1. Let π0 be an initial policy and let s S, a A R(s, a) > 0. At each iteration n > 0, the solution of the RWR optimization problem is: πn+1(a|s) = Qπn(s, a)πn(a|s) V πn(s) . (3) πn+1 = arg max π Π A πn(a|s)Qπn(s, a) log π(a|s) da ds. 4It is enough to assume that the reward is bounded, so it can be linearly mapped to a positive value. 5W maps to continuous functions since R(s, a) is continuous and continuity of the integral follows from continuity of p T in 1 norm and boundedness of V . 6We can restrict to talk about probability kernels dominated by µA instead of all probability kernels thanks to Lebesgue decomposition. Define ˆf(s, a) := dπn(s)πn(a|s)Qπn(s, a). ˆf(s, a) can be normalized such that it becomes a density that we fit by πn+1: f(s, a) = ˆf(s, a) R A ˆf(s, a) da ds = dπn(s)πn(a|s)Qπn(s, a) R A dπn(s)πn(a|s)Qπn(s, a) da ds. For the function to be maximized we have: Z A f(s, a) log π(a|s) da ds = A f(a|s) log π(a|s) da ds A f(a|s) log f(a|s) da ds, where the last inequality holds for any policy π, since s S we have that R A f(a|s) log π(a|s) da R A f(a|s) log f(a|s) da, as f(a|s) is the maximum likelihood fit. Note that for all states s S such that dπn(s) = 0, we have that f(s, a) = 0. Therefore, for such states, the policy will not contribute to the objective and can be defined arbitrarily. Now, assume dπn(s) > 0. The objective function achieves a maximum when the two distributions are equal: πn+1(a|s) = f(a|s) = f(s, a) f(s) = f(s, a) R A f(s, a) da = = dπn(s)πn(a|s)Qπn(s, a) R A dπn(s)πn(a|s)Qπn(s, a) da ds A dπn(s)πn(a|s)Qπn(s, a) da ds R A dπn(s)πn(a|s)Qπn(s, a) da = πn(a|s)Qπn(s, a) R A πn(a|s)Qπn(s, a) da = Qπn(s, a)πn(a|s) We can now set πn+1(a|s) = Qπn(s,a)πn(a|s) V πn(s) also for all s such that dπn(s) = 0, which completes the proof. Note that V πn(s) is positive thanks to the assumption of positive rewards. Similarly, the denominator R A ˆf(s, a) da ds = R S dπn(s)V πn(s) ds > 0 is positive. When function approximation is used for policy π, the term f(s) weighs the mismatch between π(a|s) and f(a|s). Indeed, we have f(s) dπ(s)V π(s), suggesting that the error occurring with function approximation would be weighted more for states visited often and with a bigger value. In our setting, however, the two terms are equal since no function approximation is used. Theorem 3.1 provides us with an interpretation on how the RWR update rule works: at each iteration, given a state s, the probability over an action a produced by policy πn will be weighted by the expected return obtained from state s, choosing action a and following πn. This result will be then normalized by V πn(s), providing a new policy πn+1. Alternatively, we can interpret this new policy as the fraction of return obtained by policy πn from state s, after choosing action a with probability πn( |s). Intuitively, assigning more weight to actions which lead to better return should improve the policy. We prove this in the next section. 4 Monotonic Improvement Theorem Here we prove that the update defined in Theorem 3.1 leads to monotonic improvement.7 Theorem 4.1. Fix n > 0 and let π0 Π be a policy8. Assume s S, a A, R(s, a) > 0. Define the operator B : Π Π such that B(π) := Qππ V π for π Π. Thus πn+1 = B(πn), i.e. s S, a A : πn+1(a|s) = (Bπn)(a|s) = Qπn(s,a)πn(a|s) V πn(s) . Then s S, a A we have that V πn+1(s) V πn(s) and Qπn+1(s, a) Qπn(s, a). Moreover, if for some s S holds Vara πn(a|s)[Qπn(s, a)] > 0 then the first inequality above is strict, i.e. V πn+1(s) > V πn(s). Proof. We start by defining a function V πn+1,πn(s) as the expected return for using policy πn+1 in state s and then following policy πn: V πn+1,πn(s) := R A πn+1(a|s)Qπn(s, a) da. By showing that s S, V πn+1,πn(s) V πn(s), we get that s S, V πn+1(s) V πn(s). 9 Now, let s be fixed: V πn+1,πn(s) V πn(s) A πn+1(a|s)Qπn(s, a) da Z A πn(a|s)Qπn(s, a) da πn(a|s)Qπn(s, a)2 V πn(s) da Z A πn(a|s)Qπn(s, a) da A π(a|s)Qπn(s, a)2 da Z A πn(a|s)Qπn(s, a) da 2 E a πn(a|s)[Qπn(s, a)2] E a πn(a|s)[Qπn(s, a)]2 Vara πn(a|s)[Qπn(s, a)] 0, which always holds. Finally, s S, a A: Qπn+1(s, a) = = R(s, a) + γ Z S p T (s |s, a)V πn+1(s ) ds R(s, a) + γ Z S p T (s |s, a)V πn(s ) ds = Qπn(s, a). Theorem 4.1 provides a relationship between the improvement in the state-value function and the variance of 7The case where the MDP has non-negative rewards and the undiscounted case are more complex and treated in Appendix B. 8Also in this section all policies are given as conditional densities with respect to µA. 9The argument is the same as given in (Puterman 2014), see section on Monotonic Policy Improvement. the action-value function with respect to the actions sampled. Note that if at a certain point the policy becomes deterministic or it becomes the greedy policy of its actionvalue function (i.e. the optimal policy) , then the operator B will map the policy to itself and there will be no improvement. 5 Convergence Results Weak Convergence in Topological Factor It is worth discussing what type of convergence we can achieve by iterating the B-operator πn := B(πn 1), where πn are probability densities with respect to a fixed reference measure µA. Consider first the classic continuous variable case, where µA is the Lebesgue measure and fix s S. Optimal policies are known to be greedy on the optimal action-value function Q (s, a). That is, they concentrate all mass on arg maxa Q (s, a). If arg maxa Q (s, a) consists of just a single point {a }, then the optimal policy (measure), π ( |s) for s, concentrates all its mass in {a }. This means that the optimal policy does not have a density with respect to the Lebesgue measure. Furthermore (πn( |s) µA)({a }) = R {a } πn(a|s)dµA(a) = 0, while π ({a }|s) = 1. However, we still want to show that the measures πn( |s) µA get concentrated in the neighbourhood of a and that this neighbourhood gets tinier as n increases. We will use the concept of weak convergence to prove this. Another problem arises when considering the above: since arg maxa Q (s, a) can consist of multiple points, the set of optimal policies is P(arg maxa Q (s, a)), where P(F) := {µ : µ is a probability measure on B(A), µ(F) = 1} for a F B(A). We want to prove convergence even when the sequence of policies πn oscillates near P(arg maxa Q (s, a)). A way of coping with this is to make arg maxa Q (s, a) a single point through topological factorisation, to obtain the limit by working in a quotient space. The notion of convergence we will be using is described in the following definition. Definition 1. (Weak convergence of measures in metric space relative to a compact set) Let (X, d) be a metric space, F X a compact subset, B(X) its Borel σ-algebra. Denote ( X, d) a metric space resulting as a topological quotient with respect to F and ν the quotient map ν : X X (see Lemma C.2 for details). A sequence of probability measures Pn is said to converge weakly relative to F to a measure P denoted Pn w(F ) P, if and only if the image measures of Pn under ν converge weakly to the image measure of P under ν: νPn w νP. Note that the limit is meant to be unique just in quotient space, thus if P is a weak limit (relative to F) of a sequence (Pn), then also all measures P for which νP = νP are relatively weak limits, i.e. P |B(X) F c = P|B(X) F c. Thus, they can differ on B(X) F. While the total mass assigned to F must be the same for P and P , the distribution of masses inside F may differ. Main Results Consider for all n > 0 the sequence generated by πn := B(πn 1). For convenience, for all n 0, we define Qn := Qπn, Vn := Vπn. First we note that, since the reward is bounded, the monotonic sequences of value functions converge point-wise to a limit: ( s S) : Vn(s) VL(s) BV < + ( s S, a A) : Qn(s, a) QL(s, a) BV < + , where BV = 1 1 γ ||R|| . Further n Qn is continuous since Qn = W(Vn) and W maps all bounded functions to continuous functions. The convergence proof proceeds in four steps: 1. First we show in Lemma 5.1 that QL can be expressed in terms of VL through W operator. This helps when showing that Qn converges uniformly to QL. 2. Then we demonstrate in Lemma 5.2 that s S the sequence of policy measures πn( |s) µA converges weakly relative to the set M(s) := arg maxa QL(s, a) to a measure that assigns all probability mass to greedy actions of QL( , s), i.e. πn( |s) µA w(M(s)) πL( |s) P(M(s)). However we are interested just in those πL which are kernels, i.e. πL ΠL := {π L : π L is a probability kernel from (S, B(S)) to (A, B(A)), s S, π L(.|s) P(M(s))} the set of all greedy policies on QL. 3. At this point we do not know yet if QL and VL are the value functions of πL. We prove this in Lemma 5.3 (together with previous Lemmas) by showing that they are fixed points of the Bellman operator. 4. Finally, we state the main results in Theorem 5.1. Since VL and QL are value functions for πL and πL is greedy with respect to QL, then QL is the unique fixed point of the Bellman s optimality operator: QL(s, a) = [T(Q)](s, a) = = R(s, a) + γ Z S max a Q(s , a )p T (s |s, a)ds . Therefore QL and VL are optimal value functions and πL is an optimal policy for the MDP. Lemma 5.1. The following holds: 1. QL = W(VL), 2. QL is continuous, 3. Qn converges to QL uniformly. Proof. 1. Fix (s, a) S A. We aim to show QL(s, a) [W(VL)](s, a) = 0. Since Qn = W(Vn), we can write: QL(s, a) [W(VL)](s, a) = = QL(s, a) Qn(s, a) [W(VL)](s, a) + [W(Vn)](s, a) |QL(s, a) Qn(s, a)| + |[W(VL)](s, a) [W(Vn)](s, a)|. The first part can be made arbitrarily small as Qn(s, a) QL(s, a). Consider the second part and fix ϵ > 0. Since Vn VL point-wise, from Severini-Egorov s theorem (Severini 1910) there exists Sϵ S with (p T ( |s, a) µS)(Sc ϵ) < ϵ such that Vn VL 0 on Sϵ. Thus there exists n0 such that Vn VL < ϵ for all n > n0. Now let us rewrite the second part for n > n0: |[W(VL)](s, a) [W(Vn)](s, a)| S |VL(s ) Vn(s )|p T (s |s, a)dµS(s ) Sϵ |VL(s ) Vn(s )|p T (s |s, a)dµS(s ) Sc ϵ |VL(s ) Vn(s )|p T (s |s, a)dµS(s ) Sc ϵ p T (s |s, a)dµS(s ) which can be made arbitrarily small. 2. QL is continuous because W maps all bounded measurable functions to continuous functions. 3. Since Qn and QL are continuous functions in a compact space and Qn is a monotonically increasing sequence that converges point-wise to QL, we can apply Dini s theorem (see Th. 7.13 on page 150 in (Rudin 1976)) which ensures uniform convergence of Qn to QL. Lemma 5.2. Let πn be a sequence generated by πn := B(πn 1). Let π0 be continuous in actions and s S, a A, π0(a|s) > 0. Define M(s) := arg max QL( |s). Then πL ΠL = , s S, we have πn( |s) µA w(M(s)) πL( |s)( P(M(s))). Proof. First notice that the set ΠL is nonempty10. Fix πL ΠL and s S. In order to prove that πn( |s) µA w(M(s)) πL( |s), we will use a characterization of relative weak convergence that follows from an adaptation of the Portmanteau Lemma (Billingsley 2013) (see Appendix C.3). In particular, it is enough to show that for all open sets U A such that U M(s) = or such that M(s) U, we have that lim infn(πn( |s) µA)U πL( |s)U. The case U M(s) = is trivial since πL( |s)(U) = 0. For the remaining case M(s) U it holds πL( |s)(U) = 1. Thus we have to prove lim infn(πn( |s) µA)U = 1. If we are able to construct an open set D U such that (πn( |s) µA)(D) 1 for n , then we will get that lim infn(πn( |s) µA)U 1, satisfying the condition for relative weak convergence of πn( |s) µA w(M(s)) πL( |s). The remainder of the proof will focus on constructing such a set. Fix a M(s) and 0 < ϵ < 1/3. Define a 10The argument goes as follows: H := s S{s} M(s) is a closed set, then f(s) := sup M(s) is upper semi-continuous and therefore measurable. Then graph of f is measurable so we can define a probability kernel πL(B|s) := 1B(f(s)) for all B measurable. continuous map λ : A R+ and closed sets Aϵ and Bϵ: λ(a) := QL(a) Aϵ := {a A|λ(a) 1 2ϵ}, Bϵ := {a A|λ(a) 1 ϵ}, where continuity of the map stems from QL(a ) > 0 and continuity of QL (Lemma 5.1). We will prove that the candidate set is D = Ac ϵ. In particular, we must prove that Ac ϵ U and that (πn( |s) µA)(Aϵ) 0. Using Lemma C.1 (Appendix) on function λ, we can choose ϵ > 0 such that Ac ϵ U, satisfying the first condition. We are left to prove that (πn( |s) µA)(Aϵ) 0. Assume Aϵ = (otherwise the condition is proven): for all a Aϵ and b Bϵ it holds: QL(a) QL(b) = QL(a) QL(a ) QL(b) QL(a ) QL(a) QL(a )(1 ϵ) 1 ϵ = 1 ϵ 1 ϵ =: α1 < 1. For Lemma 5.1 Qn converges uniformly to QL. Therefore we can fix n0 > 0 such that Qn QL < ϵ for all n n0, where we define ϵ := 0.1 QL(a )(1 ϵ)(1 α1). Now we can proceed by bounding Qn ratio from above. For all n n0, a Aϵ and b Bϵ: Qn(a) Qn(b) QL(a) QL(b) ϵ QL(a) QL(a )(1 ϵ) ϵ = QL(a) QL(a )(1 ϵ)(1 0.1(1 α1)) = α1 (0.9 + 0.1α1) =: α < 1. Finally, we can bound the policy ratio. For all n n0, a Aϵ, b Bϵ: πn(a|s) πn(b|s) = π0(a|s) Qi(s, a) Qi(s, b) αnc(a, b), c(a, b) := α n0 π0(a|s) Qi(s, a) Qi(s, b) . The function c : Aϵ Bϵ R+ is continuous as π0, Qi are continuous (and denominators are non-zero due to π0(b|s) > 0 and Qi(s, b) > 0). Since Aϵ Bϵ is a compact set, there exists cm such that c cm. Thus we have that for all n > n0: πn(a|s) αncmπn(b|s). Integrating with respect to a over Aϵ and then with respect to b over Bϵ (using reference measure µA in both cases) we obtain: (πn( |s) µA)(Aϵ) (µABϵ) αncm(πn( |s) µA)(Bϵ) (µAAϵ). Rearranging terms, we have: (πn( |s) µA)(Aϵ) αn cm µAAϵ µABϵ (πn( |s) µA)Bϵ since the nominator in brackets is composed by finite measures of sets, thus finite numbers, while the denominator µABϵ > 0. Indeed, define the open set C := {a A|λ(a) > 1 ϵ} Bϵ. Then µA(Bϵ) µA(C) > 0 (µA is strictly positive). To conclude, we have proven that for arbitrarily small ϵ > 0, the term (πn( |s) µA)(Aϵ) tends to 0, satisfying the condition for relative weak convergence of πn( |s) µA w(M(s)) πL( |s). Lemma 5.3. Assume that, for each s S, for each πL ΠL, we have that πn( |s) µA w(M(s)) πL( |s)( P(M(s))). Then this holds: A QL(s, a) dπL(a|s). (4) Proof. Fix s S and πL ΠL. We aim to show VL(s) R A QL(s, a) dπL(a|s) = 0. Since Vn(s) R A Qn(s, a)πn(a|s) dµA(a) = 0, we have: VL(s) Z A QL(s, a) dπL(a|s) = VL(s) Vn(s) Z A QL(s, a) dπL(a|s) A Qn(s, a)πn(a|s) dµA(a) VL(s) Vn(s) + Z A QL(s, a) dπL(a|s) A Qn(s, a)πn(a|s) dµA(a) . The first part can be made arbitrarily small due to Vn(s) VL(s). For the second part: Z A QL(s, a) dπL(a|s) Z A Qn(s, a)πn(a|s) dµA(a) A QL(s, a)dπL(a|s) Z A QL(s, a)πn(a|s)dµA(a) A QL(s, a)πn(a|s)dµA(a) A Qn(s, a)πn(a|s) dµA(a) A QL(s, a)dπL(a|s) Z A QL(s, a)πn(a|s)dµA(a) A |QL(s, a) Qn(s, a)|πn(a|s)dµA(a), where the first term tends to zero since πn( |s) µA w(M(s)) πL( |s) and QL is continuous and constant on M(s), satisfying the conditions of the adapted Portmanteau Lemma (Billingsley 2013) (see Appendix C.3). The second term can be arbitrarily small since Lemma 5.1 ensures uniform convergence of Qn to QL. Theorem 5.1. Let πn be a sequence generated by πn := B(πn 1). Let π0 be such that s S, a A π0(a|s) > 0 and continuous in actions. Then s S πn( |s) µA w(M(s)) πL( |s), where πL ΠL is an optimal policy for the MDP. Moreover, limn Vn = VL, limn Qn = QL are the optimal state and action value functions. Proof. Fix πL ΠL (we have already shown that ΠL = ). Due to Lemma 5.2, we know that for all s S, πL( |s) is the relative weak limit πn( |s) µA w(M(s)) πL( |s) and further we know that πL is greedy on QL(s, a) (from definition of ΠL). Moreover, thanks to Lemmas 5.3 and 5.1, VL(s) and QL(s, a) are the state and action value functions of πL because they are fixed points of the Bellman operator. Since πL( |s) P(arg maxa QL(s, a)), VL(s) and QL(s, a) are also the unique fixed points of Bellman s optimality operator, hence VL, QL are optimal value functions and πL is an optimal policy. This result has several implications. First, it provides a solid theoretical ground for both previous and future works that are based on RWR (Dayan and Hinton 1997; Peters and Schaal 2007; Peng et al. 2019) and lends us some additional understanding regarding the properties of similar algorithms (e.g., (Abdolmaleki et al. 2018b)). It should also be stressed that the results presented herein are for compact state and action spaces: traits of some key domains such as robotic control. In addition to the above, one should also note that the upper bound on (πn( |s) µA)(Aϵ) constructed in lemma 5.2 can be used to study convergence orders and convergence rates of RWR. The following corollary, for example, proves R-linear convergence for the special case of finite state and action spaces: Corollary 5.1. Under the assumptions of lemma 5.2, if S and A are finite, then V Vn = O(αn m), where 0 αm < 1, αm := 2λm 0.9+1.1λm , and λm := maxs S maxa A\M(s) Q (s,a) A proof of the above is included in the appendix D. We observe that in the finite case, V Vn converges to 0 R-linearly (i.e., V Vn is bounded by a Q-linearly converging sequence αn m). We provide an example of a finite MDP in lemma D.1 which exhibits linear convergence rate, showing that the upper bound from the corollary 5.1 is asymptotically tight in regards to the convergence order. Therefore it is not possible to achieve an order of convergence better than linear. Furthermore, the example in lemma D.2 shows that, in the continuous case, the convergence order could be sub-linear. Appendix E discusses the motivation of our approach. 6 Demonstration of RWR Convergence To illustrate that the update scheme of Theorem 3.1 converges to the optimal policy, we test it on a simple environment that meets the assumptions of the Theorem. In particular, we ensure that rewards are positive and that there is no function approximations for value functions and policies. In order to meet these criteria, we use the modified four-room gridworld domain (Sutton, Precup, and Singh 1999) shown Figure 1: (Top) the value of states under the optimal policy in the four-room gridworld domain. (Bottom) the rootmean-squared value error of reward-weighted regression in the four-room gridworld domain compared to the optimal policy and the return obtained by running the learned policy of reward-weighted regression. All lines are averages of 100 runs under different uniform random initial policies. Shading shows standard deviation. on the left of Figure 1. Here the agent starts in the upper left corner and must navigate to the bottom right corner (i.e., the goal state). In non-goal states actions are restricted to moving one square at each step in any of the four cardinal directions. If the agent tries to move into a square containing a wall, it will remain in place. In the goal state, all actions lead to the agent remaining in place. The agent receives a reward of 1 when transitioning from a non-goal state to the goal state and a reward of 0.001 otherwise. The discountrate is 0.9 at each step. At each iteration, we use Bellman s updates to obtain a reliable estimate of Qn and Vn, before updating πn using the operator in Theorem 3.1. The bottom left of Figure 1 shows the root-mean-squared value error (RMSVE) of the learned policy at each iteration as compared to the optimal policy, while the bottom right shows the return obtained by the learned policy at each iteration. Smooth convergence can be observed under reward-weighted regression. The source code for this experiment is available at https://github.com/dylanashley/rewardweighted-regression. 7 Related Work The principle behind expectation-maximization was first applied to artificial neural networks by Von der Malsburg (1973). The reward-weighted regression (RWR) algorithm, though, originated in the work of Peters and Schaal (2007) which sought to bring earlier work of Dayan and Hinton (1997) to the domain of operational space control and reinforcement learning. However, Peters and Schaal (2007) only considered the immediate-reward reinforcement learning (RL) setting. This was later extended to the episodic setting separately by Wierstra et al. (2008a) and then by Kober and Peters (2011). Wierstra et al. (2008a) went even further and also extended RWR to partially observable Markov decision processes, and Kober and Peters (2011) applied it to motor learning in robotics. Separately, Wierstra et al. (2008b) extended RWR to perform fitness maximization for evolutionary methods. Hachiya, Peters, and Sugiyama (2009, 2011) later found a way of reusing old samples to improve RWR s sample complexity. Much later, Peng et al. (2019) modified RWR to produce an algorithm for off-policy RL, using deep neural networks as function approximators. Other methods based on principles similar to RWR have been proposed. Neumann and Peters (2008), for example, proposed a more efficient version of the well-known fitted Q-iteration algorithm (Riedmiller 2005; Ernst, Geurts, and Wehenkel 2005; Antos, Munos, and Szepesv ari 2007) by using what they refer to as advantaged-weighted regression which itself is based on the RWR principle. Ueno et al. (2012) later proposed weighted likelihood policy search and showed that their method both has guaranteed monotonic increases in the expected reward. Osa and Sugiyama (2018) subsequently proposed a hierarchical RL method called hierarchical policy search via return-weighted density estimation and showed that it is closely related to the episodic version of RWR by (Kober and Peters 2011). Notably, all of the aforementioned works, as well as a number of other proposed similar RL methods (e.g., Peters, M ulling, and Altun (2010), Neumann (2011), Abdolmaleki et al. (2018b), Abdolmaleki et al. (2018a)), are based on the expectation-maximization framework of Dempster, Laird, and Rubin (1977) and are thus known to have monotonic improvements of the policy in the RL setting under certain conditions. However, it has remained an open question under which conditions convergence to the optimal is guaranteed. 8 Conclusion and Future Work We provided the first global convergence proof for Reward Weighted Regression (RWR) in absence of reward transformation and function approximation. The convergence achieved is linear when using finite state and action spaces and can be sub-linear in the continuous case. We also highlighted problems that may arise under nonlinear reward transformations, potentially resulting in changes to the optimal policy. In real-world problems, access to true value functions may be unrealistic. Future work will study RWR s convergence under function approximation. In such a case, the best scenario that one can expect is to achieve convergence to a local optimum. One possible approach is to follow a procedure similar to standard policy gradients (Sutton et al. 1999) and derive a class of value function approximators that is compatible with the RWR objective. It might be possible then to prove local convergence under value function approximation using stochastic approximation techniques (Borkar 2008; Sutton, Maei, and Szepesv ari 2009; Sutton et al. 2009). This would require casting the value function and policy updates in a system of equations and studying the convergence of the corresponding ODE under specific assumptions. Our RWR is on-policy, using only recent data to update the current policy. Future work will also study convergence in challenging off-policy settings (using all past data), which require corrections of the mismatch between state-distributions, typically through a mechanism like Importance Sampling. Acknowledgements We would like to thank Sjoerd van Steenkiste and Frantiˇsek ˇZ ak for their insightful comments. This work was supported by the European Research Council (ERC, Advanced Grant Number 742870), the Swiss National Supercomputing Centre (CSCS, Project s1090), and by the Swiss National Science Foundation (Grant Number 200021 192356, Project NEUSYM). We also thank both the NVIDIA Corporation for donating a DGX-1 as part of the Pioneers of AI Research Award and IBM for donating a Minsky machine. Abdolmaleki, A.; Springenberg, J. T.; Degrave, J.; Bohez, S.; Tassa, Y.; Belov, D.; Heess, N.; and Riedmiller, M. 2018a. Relative Entropy Regularized Policy Iteration. ar Xiv:1812.02256. Abdolmaleki, A.; Springenberg, J. T.; Tassa, Y.; Munos, R.; Heess, N.; and Riedmiller, M. A. 2018b. Maximum a Posteriori Policy Optimisation. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. Open Review.net. Antos, A.; Munos, R.; and Szepesv ari, C. 2007. Fitted Qiteration in continuous action-space MDPs. In Platt, J. C.; Koller, D.; Singer, Y.; and Roweis, S. T., eds., Advances in Neural Information Processing Systems 20, Proceedings of the Twenty-First Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 3-6, 2007, 9 16. Curran Associates, Inc. Bellemare, M. G.; Candido, S.; Castro, P. S.; Gong, J.; Machado, M. C.; Moitra, S.; Ponda, S. S.; and Wang, Z. 2020. Autonomous navigation of stratospheric balloons using reinforcement learning. Nature, 588(7836): 77 82. Billingsley, P. 2013. Convergence of Probability Measures. John Wiley & Sons. Borkar, V. S. 2008. Stochastic Approximation: A Dynamical Systems Viewpoint, volume 48. Hindustan Book Agency. Dayan, P.; and Hinton, G. E. 1997. Using Expectation Maximization for Reinforcement Learning. Neural Comput., 9(2): 271 278. Dempster, A. P.; Laird, N. M.; and Rubin, D. B. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1): 1 22. Ernst, D.; Geurts, P.; and Wehenkel, L. 2005. Tree-Based Batch Mode Reinforcement Learning. J. Mach. Learn. Res., 6: 503 556. Hachiya, H.; Peters, J.; and Sugiyama, M. 2009. Efficient Sample Reuse in EM-Based Policy Search. In Buntine, W. L.; Grobelnik, M.; Mladenic, D.; and Shawe Taylor, J., eds., Machine Learning and Knowledge Discovery in Databases, European Conference, ECML PKDD 2009, Bled, Slovenia, September 7-11, 2009, Proceedings, Part I, volume 5781 of Lecture Notes in Computer Science, 469 484. Springer. Hachiya, H.; Peters, J.; and Sugiyama, M. 2011. Reward Weighted Regression with Sample Reuse for Direct Policy Search in Reinforcement Learning. Neural Comput., 23(11): 2798 2832. Kober, J.; and Peters, J. 2011. Policy search for motor primitives in robotics. Mach. Learn., 84(1-2): 171 203. Munkres, J. 2000. Topology. Prentice Hall, Incorporated. Neumann, G. 2011. Variational Inference for Policy Search in changing situations. In Getoor, L.; and Scheffer, T., eds., Proceedings of the 28th International Conference on Machine Learning, ICML 2011, Bellevue, Washington, USA, June 28 - July 2, 2011, 817 824. Omnipress. Neumann, G.; and Peters, J. 2008. Fitted Q-iteration by Advantage Weighted Regression. In Koller, D.; Schuurmans, D.; Bengio, Y.; and Bottou, L., eds., Advances in Neural Information Processing Systems 21, Proceedings of the Twenty-Second Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 8-11, 2008, 1177 1184. Curran Associates, Inc. Osa, T.; and Sugiyama, M. 2018. Hierarchical Policy Search via Return-Weighted Density Estimation. In Mc Ilraith, S. A.; and Weinberger, K. Q., eds., Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, 3860 3867. AAAI Press. Peng, X. B.; Kumar, A.; Zhang, G.; and Levine, S. 2019. Advantage-Weighted Regression: Simple and Scalable Off Policy Reinforcement Learning. ar Xiv:1910.00177. Peters, J.; M ulling, K.; and Altun, Y. 2010. Relative Entropy Policy Search. In Fox, M.; and Poole, D., eds., Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2010, Atlanta, Georgia, USA, July 11-15, 2010. AAAI Press. Peters, J.; and Schaal, S. 2007. Reinforcement Learning by Reward-Weighted Regression for Operational Space Control. In Ghahramani, Z., ed., Machine Learning, Proceedings of the Twenty-Fourth International Conference (ICML 2007), Corvallis, Oregon, USA, June 20-24, 2007, volume 227 of ACM International Conference Proceeding Series, 745 750. ACM. Peters, J.; and Schaal, S. 2008. Learning to Control in Operational Space. The International Journal of Robotics Research, 27(2): 197 212. Plappert, M.; Andrychowicz, M.; Ray, A.; Mc Grew, B.; Baker, B.; Powell, G.; Schneider, J.; Tobin, J.; Chociej, M.; Welinder, P.; Kumar, V.; and Zaremba, W. 2018. Multi Goal Reinforcement Learning: Challenging Robotics Environments and Request for Research. ar Xiv:1802.09464. Pollard, D. 2001. A User s Guide to Measure Theoretic Probability. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press. Puterman, M. L. 2014. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons. Riedmiller, M. A. 2005. Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method. In Gama, J.; Camacho, R.; Brazdil, P.; Jorge, A.; and Torgo, L., eds., Machine Learning: ECML 2005, 16th European Conference on Machine Learning, Porto, Portugal, October 3-7, 2005, Proceedings, volume 3720 of Lecture Notes in Computer Science, 317 328. Springer. Rudin, W. 1976. Principles of Mathematical Analysis. Mc Graw-hill New York, 3d ed. edition. Severini, C. 1910. Sulle successioni di funzioni ortogonali [On Sequences of Orthogonal Functions]. Atti dell Accademia Gioenia, serie 5a (in Italian), 3 (5): Memoria XIII, 1-7, JFM 41.0475.04. Published by the Accademia Gioenia in Catania. Silver, D.; Huang, A.; Maddison, C. J.; Guez, A.; Sifre, L.; van den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; Dieleman, S.; Grewe, D.; Nham, J.; Kalchbrenner, N.; Sutskever, I.; Lillicrap, T. P.; Leach, M.; Kavukcuoglu, K.; Graepel, T.; and Hassabis, D. 2016. Mastering the game of Go with deep neural networks and tree search. Nat., 529(7587): 484 489. Stratonovich, R. 1960. Conditional Markov processes. Theory of Probability And Its Applications, 5(2): 156 178. Sutton, R. S.; and Barto, A. G. 2018. Reinforcement Learning: An Introduction. USA: A Bradford Book. Sutton, R. S.; Maei, H. R.; Precup, D.; Bhatnagar, S.; Silver, D.; Szepesv ari, C.; and Wiewiora, E. 2009. Fast Gradient Descent Methods for Temporal-Difference Learning with Linear Function Approximation. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML 09, 993 1000. New York, NY, USA: Association for Computing Machinery. Sutton, R. S.; Maei, H. R.; and Szepesv ari, C. 2009. A convergent o(n) temporal-difference algorithm for off-policy learning with linear function approximation. In Advances in neural information processing systems, 1609 1616. Sutton, R. S.; Mc Allester, D.; Singh, S.; and Mansour, Y. 1999. Policy Gradient Methods for Reinforcement Learning with Function Approximation. In Proceedings of the 12th International Conference on Neural Information Processing Systems, NIPS 99, 1057 1063. Cambridge, MA, USA: MIT Press. Sutton, R. S.; Precup, D.; and Singh, S. P. 1999. Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning. Artif. Intell., 112(1-2): 181 211. Ueno, T.; Hayashi, K.; Washio, T.; and Kawahara, Y. 2012. Weighted Likelihood Policy Search with Model Selection. In Bartlett, P. L.; Pereira, F. C. N.; Burges, C. J. C.; Bottou, L.; and Weinberger, K. Q., eds., Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United States, 2366 2374. Von der Malsburg, C. 1973. Self-organization of orientation sensitive cells in the striate cortex. Kybernetik, 14(2): 85 100. Wierstra, D.; Schaul, T.; Peters, J.; and Schmidhuber, J. 2008a. Episodic Reinforcement Learning by Logistic Reward-Weighted Regression. In Kurkov a, V.; Neruda, R.; and Koutn ık, J., eds., Artificial Neural Networks - ICANN 2008 , 18th International Conference, Prague, Czech Republic, September 3-6, 2008, Proceedings, Part I, volume 5163 of Lecture Notes in Computer Science, 407 416. Springer. Wierstra, D.; Schaul, T.; Peters, J.; and Schmidhuber, J. 2008b. Fitness Expectation Maximization. In Rudolph, G.; Jansen, T.; Lucas, S. M.; Poloni, C.; and Beume, N., eds., Parallel Problem Solving from Nature - PPSN X, 10th International Conference Dortmund, Germany, September 13-17, 2008, Proceedings, volume 5199 of Lecture Notes in Computer Science, 337 346. Springer. Wu, C. J. 1983. On the Convergence Properties of the EM Algorithm. The Annals of statistics, 11(1): 95 103.