# actorcritic_based_improper_reinforcement_learning__076b4906.pdf Actor-Critic based Improper Reinforcement Learning Mohammadi Zaki 1 Avinash Mohan 2 Aditya Gopalan 1 Shie Mannor 3 We consider an improper reinforcement learning setting where a learner is given M base controllers for an unknown Markov decision process, and wishes to combine them optimally to produce a potentially new controller that can outperform each of the base ones. This can be useful in tuning across controllers, learnt possibly in mismatched or simulated environments, to obtain a good controller for a given target environment with relatively few trials. Towards this, we propose two algorithms: (1) a Policy Gradient-based approach; and (2) an algorithm that can switch between a simple Actor-Critic (AC) based scheme and a Natural Actor-Critic (NAC) scheme depending on the available information. Both algorithms operate over a class of improper mixtures of the given controllers. For the first case, we derive convergence rate guarantees assuming access to a gradient oracle. For the AC-based approach we provide convergence rate guarantees to a stationary point in the basic AC case and to a global optimum in the NAC case. Numerical results on (i) the standard control theoretic benchmark of stabilizing an cartpole; and (ii) a constrained queueing task show that our improper policy optimization algorithm can stabilize the system even when the base policies at its disposal are unstable. 1. Introduction A natural approach to design effective controllers for large, complex systems is to first approximate the system using a tried-and-true Markov decision process (MDP) model, such as the Linear Quadratic Regulator (LQR) (Dean et al., 2017) or tabular MDPs (Auer et al., 2009), and then compute (near-) optimal policies for the assumed model. Though this 1Department of ECE, IISc, Bangalore, India 2Boston University, Massachusetts, USA 3Faculty of Electrical Engineering, Technion, Haifa, Israel and NVIDIA Research, Israel.. Correspondence to: Mohammadi Zaki . Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s). yields favorable results in principle, it is quite possible that errors in describing or understanding the system leading to misspecified models may lead to overfitting , resulting in subpar controllers in practice. Moreover, in many cases, the stability of the designed controller may be crucial and more desirable than optimizing a fine-grained cost function. From the controller design standpoint, it is often easier, cheaper and more interpretable to specify or hardcode control policies based on domain-specific principles, e.g., anti-lock braking system (ABS) controllers (Radac & Precup, 2018). For these reasons, we investigate in this paper a promising, general-purpose reinforcement learning (RL) approach towards designing controllers1 given predesigned ensembles of basic or atomic controllers, which (a) allows for flexibly combining the given controllers to obtain richer policies than the atomic policies, and, at the same time, (b) can preserve the basic structure of the given class of controllers and confer a high degree of interpretability on the resulting hybrid policy. Overview of the approach. We consider a situation where we are given black-box access to M controllers (maps from state to action distributions) {K1, . . . , KM} for an unknown MDP. By this we mean that we can choose to invoke any of the given controllers at any point during the operation of the system. With the understanding that the given family of controllers is reasonable, we frame the problem of learning the best combination of the controllers by trial and error. We first set up an improper policy class of all randomized mixtures of the M given controllers each such mixture is parameterized by a probability distribution over the M base controllers. Applying an improper policy in this class amounts to selecting independently at each time a base controller according to this distribution and implementing the recommended action as a function of the present state of the system. The learner s goal is to find the best performing mixture policy by iteratively testing from the pool of given controllers and observing the resulting state-action-reward trajectory. Note that the underlying parameterization in our setting is over a set of given controllers which could be potentially abstract and defined for complex MDPs with continuous 1We use the terms policy and controller interchangeably in this article. Actor-Critic based Improper Reinforcement Learning state/action spaces, instead of the (standard) policy gradient (PG) view where the parameterization directly defines the policy in terms of the state-action map. Our problem, therefore, hews more closely to a meta RL framework, in that we operate over a set of controllers that have themselves been designed using some optimization framework to which we are agnostic. This has the advantage of conferring a great deal of generality, since the class of controllers can now be chosen to promote any desirable secondary characteristic such as interpretability, ease of implementation or cost effectiveness. It is also worth noting that our approach is different from treating each of the base controllers as an expert and applying standard mixture-of-experts algorithms, e.g., Hedge or Exponentiated Gradient (Littlestone & Warmuth, 1994; Auer et al., 1995; Koc ak et al., 2014; Neu, 2015). Whereas the latter approach is tailored to converge to the best single controller (under the usual gradient approximation framework) and hence qualifies as a proper learning algorithm, the former optimization problem is in the improper class of mixture policies which not only contains each atomic controller but also allows for a true mixture (i.e., one which puts positive probability on at least two elements) of many atomic controllers to achieve optimality; we exhibit concrete examples where this is indeed possible. Our Contributions. We make the following contributions in this context: We develop a gradient-based RL algorithm to iteratively tune a softmax parameterization of an improper (mixture) policy defined over the base controllers (Algorithm 1). While this algorithm, Softmax Policy Gradient (or Softmax PG), relies on the availability of value function gradients, we later propose a modification that we call Grad Est (see Alg. 6 in appendix) to Softmax PG to rectify this. Grad Est uses a combination of rollouts and Simultaneously Perturbed Stochastic Approximation (SPSA) (Borkar, 2008) to estimate the value gradient at the current mixture distribution. We show a convergence rate of O(1/t) to the optimal value function for finite state-action MDPs. To do this, we employ a novel Non-uniform Łojasiewicz-type inequality (Łojasiewicz, 1963), that lower bounds the 2-norm of the value gradient in terms of the suboptimality of the current mixture policy s value. Essentially, this helps establish that when the gradient of the value function hits zero, the value function is itself close to the optimum. Policy-gradient methods are well-known to suffer from high variance (Peters & Schaal, 2008; Bhatnagar et al., 2009). To circumvent this issue, we develop an algorithm that can switch between a simple Actor-Critic (AC) based scheme and a Natural Actor-Critic (NAC) scheme depending on the available information. The algorithm, ACIL (Sec. 5), executes on a single sample path, without requiring any forced resets, as is common in many RL algorithms. We provide convergence rate guarantees to a stationary point in the basic AC case and to a global optimum in the NAC case, under some additional (but standard) assumptions (of uniform ergodicty). The total complexity of AC is measured to attain an (ε+ Critic error)-accurate stationary point. The total complexity of NAC is measured to attain an (ε + Critic error + Actor error)-accurate stationary point. We use linear function approximation to approximate the value function and our convergence analysis show exactly how this approximation affects the final complexity bound. We corroborate our theory using extensive simulation studies. For the PG based method we use Grad Est in two different settings (a) the well-known Cart Pole system and (b) a scheduling task in a constrained queueing system. We discuss both these settings in detail in Sec. 2, where we also demonstrate the power of our improper learning approach in finding control policies with provably good performance. In our experiments (see Sec. 6), we eschew access to exact value gradients and instead rely on a combination of roll outs and SPSA to estimate them. For the actor-critic based learner, we demonstrate simulations on various queuing theoretic simulations using the natural-actor-critic based ACIL. All the results show that our proposed algorithms quickly converge to the correct mixture of available atomic controllers. Related Work (brief). We provide a quick survey of relevant literature. A detailed survey is deferred to the appendix. Policy gradient. The basic policy gradient method has become a cornerstone of modern RL and given birth to an entire class of highly efficient policy search techniques such as CPI (Kakade & Langford, 2002), TRPO (Schulman et al., 2015), PPO (Schulman et al., 2017), and MADDPG (Lowe et al., 2020). A growing body of recent work shows promising results about convergence rates for PG algorithms over finite state-action MDPs (Agarwal et al., 2020a; Shani et al., 2020; Bhandari & Russo, 2019; Mei et al., 2020), where the parameterization is over the entire space of state -action pairs, i.e., RS A. These advances, however, are partially offset by negative results such as those in Li et al. (2021), which show that the convergence time is Ω |S|21/(1 γ) , where S is the state space of the MDP and γ the discount factor, even with exact gradient knowledge. Improper learning. The above works concern proper learning, where the policy search space is usually taken to be the set of all deterministic policies for an MDP. Improper learning, on the other hand, has been studied in statistical learning theory for the IID setting (Daniely et al., 2014; 2013). In this representation independent learning framework, the learning algorithm is not restricted to output a hypothesis from a given set of hypotheses. Boosting. Agarwal et al. (2020b) attempts to frame and solve policy optimization over an improper class by boosting Actor-Critic based Improper Reinforcement Learning a given class of controllers. This work, however, is situated in the context of non-stochastic control and assumes perfect knowledge of (i) the memory-boundedness of the MDP, and (ii) the state noise vector in every round, which amounts to essentially knowing the MDP transition dynamics. We work in the stochastic MDP setting and assume no access to the MDP s transition kernel. Further, it is assumed in (Agarwal et al., 2020b) that all the atomic controllers available are stabilizing which, when working with an unknown MDP, is a very strong assumption to make. While making no such assumptions on our atomic controller class; we show our algorithms can begin with provably unstable controllers and yet succeed in stabilizing the system (Sec. 2.2 and 6). Options framework. Our work differs from the options framework (Barreto et al., 2017; Sutton et al., 1999) for hierarchical RL in spirit, in that we allow for each controller to be applied in each round rather than waiting for a subtask to complete. The current work deals with finding an optimal mixture of basic controllers to solve a particular task. However, if we allow for a state-dependent choice of controllers, then the methods proposed can be generalized for solving hierarchical RL tasks. Ensemble policy-based RL. Our current work deals with accessing given (possibly separately trained) controllers as black-boxes and learning to combine them optimally. In contrast, in ensemble RL approaches (Maclin & Opitz, 2011; Xiliang et al., 2018; Wiering & van Hasselt, 2008) the base policies are learnt on the fly (e.g., Q-learning, SARSA) by the agent whereas the combining rule is fixed upfront (e.g., majority voting, rank voting, Boltzmann multiplication, etc.). Moreover, the base policies have access to the new system in Ensemble RL, which gives them a distinct advantage. Our method can serve as a meta-RL adaptation framework with theoretical guarantees which can use such pre-trained models to combine them optimally. To the best of our knowledge, ensemble RL works like (Xiliang et al., 2018; Wiering & van Hasselt, 2008) do not provide theoretical guarantees on the learnt combined policy. Our work on the other hand provides a firm theoretical as well as empirical basis for the methods we propose. Improper learning with given base controllers. Probably the closest resemblance with our work is that of Banijamali et al. (2019) which aims at finding the best convex combination of a given set of base controllers for a given MDP. They however frame it as a planning problem where the transition kernel P is known to the agent. Furthermore, we treat the base controllers as black-box entities, whereas they exploit their structure to compute the state-occupancy measures. Actor-critic methods. Actor-critic (AC) methods were first introduced in Konda & Tsitsiklis (2000). Natural actor-critic methods were first introduced in (Peters & Schaal, 2008; Bhatnagar et al., 2009). While many studies are available for the asymptotic convergence of AC and NAC, we use the new techniques proposed by Xu et al. (2020) and Barakat et al. (2021) for showing convergence results. 2. Motivating Examples We begin with two examples that help illustrate the need for improper learning over a given set of atomic controllers. These examples concretely demonstrate the power of this approach to find (improper) control policies that go well beyond what the atomic set can accomplish, while retaining some of their desirable properties (such as interpretability and simplicity of implementation). 2.1. Ergodic Control of the Cartpole System Consider the Cartpole system which has, over the years, become a benchmark for testing control strategies (Khalil, 2015). The system s dynamics, evolving in R4, can be approximated via a Linear Quadratic Regulator around an (unstable) equilibrium state vector that we designate the origin (x = 0). The objective now reduces to finding a (potentially randomized) control policy u {u(t), t 0} that solves infu J Eu P t=0 x (t)Qx(t) + Ru2(t) subject to x(t + 1) = Aopenx(t) + bu(t) at all times t 0. Under standard assumptions of controllability and observability, this optimization has a stationary, linear solution u (t) = K x(t) ( (Bertsekas, 2011)). Moreover, setting A := Aopen b K , it is well know that the dynamics x(t + 1) = Ax(t), t 0, are stable. The usual design strategy for a given Cartpole involves a combination of system identification, followed by linearization and computing the controller gain K. This would typically produce a controller with tolerable performance fairly quickly, but would also suffer from nonidealities of parameter estimation. To alleviate this problem, first consider a generic (ergodic) control policy that builds on this strategy by switching across a menu of controllers {K1, , KN} produced as above. That is, at any time t, this policy chooses Ki, i [N], w.p. pi, so that the control input at time t is u(t) = K i x(t) w.p. pi. Let A(i) := Aopen b K i . The resulting controlled dynamics are given by x(t + 1) = A(r(t))x(t), t 0, where r(t) = i w.p. pi, IID across t. This is an example of an ergodic parameter linear system (EPLS) (Bolzern et al., 2008), which is said to be Exponentially Almost Surely Stable (EAS) if the state norm decays at least exponentially fast with time: P lim supt 1 t log x(t) ρ = 1 for some ρ > 0. Let the random variable λ(ω) := lim supt 1 t log x(t, ω) . For our dynamics x(t + 1) = A(r(t))x(t), t 0, it is seen that the Lyapunov exponent 1 t log x(t) is at most the quantity PN i=1 pi log A(i) a.s. (see appendix for details). A good mixture controller can now be designed by choosing {p1, , p N} such that λ(ω) < ρ for some ρ > 0, Actor-Critic based Improper Reinforcement Learning ensuring exponentially almost sure stability (subject to log A(i) < 0 for some i). As we show in the sequel, our policy gradient algorithm (Soft Max PG) learns an improper mixture {p1, , p N} that (i) can stabilize the system even when a majority of the constituent atomic controllers {K1, , KN} are unstable, i.e., converges to a mixture that ensures that the average exponent λ(ω) < 0, and (ii) shows better performance than that each of the atomic controllers. 2.2. Scheduling in Constrained Queueing Networks extra capacity achieved through improper learning λ1 + λ2 = 1 Figure 1: K1 and K2 by themselves can only stabilize C1 C2 (gray rectangles). With improper learning, we enlarge the set of stabilizable arrival rates by the triangle ABC shown in purple, above. We consider a system that comprises two queues fed by independent, stochastic arrival processes Ai(t), i {1, 2}, t N. The length of queue i, measured at the beginning of time slot t, is denoted by Qi(t) Z+. A common server serves both queues and can drain at most one packet from the system in a time slot. The server, therefore, needs to decide which of the two queues it intends to serve in a given slot (we assume that once the server chooses to serve a packet, service succeeds with probability 1). The server s decision is denoted by the vector D(t) A := {[0, 0], [1, 0], [0, 1]} , where a 1 denotes service and a 0 denotes lack thereof. Let EAi(t) = λi, and note that the arrival rate λ = [λ1, λ2] is unknown to the learner. We aim to find a (potentially randomized) policy π to minimize the discounted system backlog given by Jπ(Q(0)) := Eπ Q(0) P t=0 γt (Q1(t) + Q2(t)) . Any policy with Jπ( ) < , is said to be stabilizing (or, equivalently, a stable policy). It is well known that there exist stabilizing policies iff λ1 + λ2 < 1 (Tassiulas & Ephremides, 1992). A policy πµ1,µ2 that chooses Queue i w.p. µi in every slot, can provably stabilize a system iff µi > λi, i {1, 2}. Now, assume our control set consists of two stationary policies K1, K2 with K1 πε,1 ε, K1 π1 ε,ε and sufficiently small ε > 0. That is, we have M = 2 controllers K1, K2. Clearly, neither of these can, by itself, stabilize a network with λ = [0.49, 0.49]. However, an improper mixture of the two that selects K1 and K2 each with probability 1/2 can. In fact, as Fig. 1 shows, our improper learning algorithm can stabilize all arrival rates in C1 C2 ABC, without prior knowledge of [λ1, λ2]. In other words, our algorithm enlarges the stability region by the triangle ABC, over and above C1 C2. We will return to these examples in Sec. 6, and show, using experiments, (1) how our improper learner converges to the stabilizing mixture of the available policies and (2) if the optimal policy is among the available controllers, how our algorithm can find and converge to it. 3. Problem Statement and Notation A (finite) Markov Decision Process (S, A, P, r, ρ, γ) is specified by a finite state space S, a finite action space A, a transition probability matrix P, where P ( s|s, a) is the probability of transitioning into state s upon taking action a A in state s, a single stage reward function r : S A R, a starting state distribution ρ over S and a discount factor γ (0, 1). A (stationary) policy or controller π : S P(A) specifies a decision-making strategy in which the learner chooses actions (at) adaptively based on the current state (st), i.e., at π(st). π and ρ, together with P, induce a probability measure Pπ ρ on the space of all sample paths of the underlying Markov process and we denote by Eπ ρ the associated expectation operator. The value function of policy π (also called the value of policy π), denoted by V π is the total discounted reward obtained by following π, i.e., V π(ρ) := Eπ ρ P t=0 γtr(st, at). Improper Learning. We assume that the learner is provided with a finite number of (stationary) controllers C := {K1, , KM} and, as described below, set up a parameterized improper policy class Isoft(C) that depends on C. The aim therefore, is to identify the best policy for the given MDP within this class, i.e., π = argmax π Isoft(C) V π(ρ). (1) We now describe the construction of the class Isoft(C). The Softmax Policy Class. We assign weights θm R, to each controller Km C and define θ := [θ1, , θM]. The improper class Isoft is parameterized by θ as follows. In each round, the policy πθ Isoft(C) chooses a controller drawn from softmax(θ), i.e., the probability of choosing Controller Km is given by, πθ(m) := eθm/ PM m =1 eθm . Note, therefore, that in every round, our algorithm interacts with the MDP only through the controller sampled in that round. In the rest of the paper, we will deal exclusively with a fixed and given C and the resultant Isoft. therefore, we overload the notation πθt(a|s) for any a A and s S to denote the probability with which the algorithm chooses action a in state s at time t. For ease of notation, whenever the context is clear, we will also drop the subscript θ i.e., πθt πt. Hence, we have at any time t 0 : πθt(a|s) = PM m=1 πθt(m)Km(s, a). Since we deal with gradient-based methods in the sequel, we define the value gradient of policy πθ Isoft, by θV πθ d V πθt dθt . We say that V πθ is β-smooth if θV πθ is β-Lipschitz (Agarwal et al., 2020a). Finally, let for any two integers a and b, Actor-Critic based Improper Reinforcement Learning Algorithm 1 Soft Max PG Input: learning rate η > 0, initial state distribution µ Initialize: each θ1 m = 1, for all m [M], s1 µ. for t = 1 to T do Choose controller mt πt Play action at Kmt(st, :) Observe st+1 P(.|st, at) Update: θt+1 = θt + η θt V πθt end for Iab denote the indicator that a = b. Comparison to the standard PG setting. This problem we define is different from the usual policy gradient setting where the parameterization completely defines the policy in terms of the state-action mapping. One can use the methodology followed in (Mei et al., 2020), by assigning a parameter θs,m for every s S, m [M]. With some calculation, it can be shown that this is equivalent to the tabular setting with S states and M actions, with the new reward defined by r(s, m) := P a A Km(s, a)r(s, a) where r(s, a) is the usual expected reward obtained at state s and playing action a A. By following the approach in (Mei et al., 2020) on this modified setting, it can be shown that the policy converges for each s S, πθ(m (s) s) 1, for every s S, which is the optimum policy. However, the problem that we address, is to select a single controller (from within Isoft, the convex hull of the given M controllers) , which would guarantee maximum return if one plays that single mixture for all time, from among the given set of controllers. 4. Improper Learning using Gradients In this and the following sections, we propose and analyze a policy gradient-based algorithm that provably finds the best, potentially improper, mixture of controllers for the given MDP. While we employ gradient ascent to optimize the mixture weights, the fact that this procedure works at all is far from obvious. We begin by noting that V πθ, as described in Section 3, is nonconcave in θ for both direct and softmax parameterizations, which renders analysis with standard tools of convex optimization inapplicable. Lemma 4.1. (Non-concavity of Value function) There is an MDP and a set of controllers, for which the maximization problem of the value function (i.e. (1)) is non-concave for both the Soft Max and direct parameterizations, i.e., θ 7 V πθ is non-concave. The proof follows from a counterexample whose construction we show in the appendix. Our PG algorithm, Soft Max PG, is shown in Algorithm 1. The parameters θ RM which define the policy are updated by following the gradient of the value function at the current policy parameters. Convergence Guarantees. The following result shows that with Soft Max PG, the value function converges to that of the best in-class policy at a rate O (1/t). Furthermore, the theorem shows an explicit dependence on the number of controllers M, in place of the usual |S|. Note that with perfect gradient knowledge the algorithm becomes deterministic. This is a standard assumption in the analysis of PG algorithms (Fazel et al., 2018; Agarwal et al., 2020a; Mei et al., 2020). Theorem 4.2 (Convergence of Policy Gradient). With {θt}t 1 generated as in Algorithm 1 and using a learning rate η = (1 γ)2 7γ2+4γ+5, for all t 1, V (ρ) V πθt(ρ) = c2 t (1 γ)3 , where ct := min 1 s t min m:π (m)>0 πθs(m). Remark 4.3. The quantity ct in the statement is the minimum probability that Soft Max PG puts on the controllers for which the best mixture π has positive probability mass. Empirical evidence (Sec. 6) makes us conjecture that lim t ct is positive, which shows a convergence rate of O (1/t). Remark 4.4. The proof of the above theorem uses the β smoothness property of the value function under the softmax parameterization along with a new non-uniform Łojaseiwicz-type inequality (NUŁI) for our probabilistic mixture class, which lower bounds the magnitude of the gradient of the value function, which we mention below. Lemma 4.5 (NUŁI). min m:π θm>0 πθm dπ ρ d πθ µ [V (ρ) V πθ(ρ)] . The proof of Theorem 4.2, then follows by an induction argument over t 1. Technical Challenges. We note here that while the basic recipe for the analysis of Theorem 4.2 is similar to (Mei et al., 2020), our setting does not directly inherit the intuition of standard PG (s PG) analysis. (1) With |S A| < , the s PG analysis critically depends on the fact that a deterministic optimal policy exists and shows convergence to it. In contrast, in our setting, π could be a strictly randomized mixture of the base controllers (see Sec. 2). (2) A crucial step in s PG analysis is establishing that the value function V π(s), s S increases monotonically with time such that parameter of the optimal action θs,a . In the appendix, we supply a simple counterexample showing that monotonicity of the V function is not guaranteed in our setting for every s S. (3) The value function gradient in s PG has no cross contamination from other states, in the sense that modifying the parameter at one state does not affect the values of the others. This plays a crucial part in simplifying the proof of global convergence to the optimal policy in s PG analysis. Our setting cannot leverage this property since the value function gradient at a given controller possesses contributions from all states. For the special case of S = 1, which is the Multiarmed Actor-Critic based Improper Reinforcement Learning Bandits, each controller is a probability distribution over the A arms of the bandit. We call this special case Banditover-Bandits. We obtain a convergence rate of O M 2/t to the optimum and recover M 2 log T regret bound when our softmax PG algorithm is applied to this special case. We refer to the appendix for details. Discussion on ct. Convergence in Theorem 4.2 depends inversely on c2 t. It follows that in order for Soft Max PG to converge, ct must either (a) converge to a positive constant, or (b) decay (to 0) slower than O 1/ t . The technical challenges discussed above, render proving this extremely hard analytically. Hence, while we currently do not show this theoretically, our experiments in Sec. 6 repeatedly confirm that its empirical analog, i.e., ct (defined formally in Sec. 6) approaches a positive value. Hence, we conjecture that the rate of convergence in Thm 4.2 is O(1/t). 5. Actor-Critic based Improper Learning Softmax PG follows a gradient ascent scheme to solve the optimization problem (1), but is limited by the requirement of the true gradient in every round. To address situations where this might be unavailable, we resort to a Monte-carlo sampling based procedure (see appendix: Alg 6), which may lead to high variance. In this section, we take an alternative approach and provide a new algorithm based on an actor-critic framework for solving our problem. Actor Critic methods are well-known to have low variance than their Monte-carlo counterparts (Konda & Tsitsiklis, 2000). We begin by proposing modifications to the standard Q-function and advantage function definitions. Recall that we wish to solve for the following optimization problem: maxπ Isoft Es ρ[V π(s)], where π is some distribution over the M base controllers. Let Qπ(s, m) := P a A Km(s, a)Qπ(s, a). Let Aπ(s, m) := P a A Km(s, a)Aπ(s, a) = P a A Km(s, a)Qπ(s, a) V π(s), where Qπ and Aπ are the usual action-value functions and advantage functions respectively. We also define the new reward function r(s, m) := P a A Km(s, a)r(s, a) and a new transition kernel P(s |s, m) := P a A Km(s, a)P(s |s, a). Then, following the distribution π over the controllers induces a Markov Chain on the state space S. Define νπ(s, m) as the statecontroller visitation measure induced by the policy π: νπ(s, m) := (1 γ) P t 0 γt Pπ(st = s, mt = m) = dπ ρ(s)π(m). With these definitions, we have the following variant of the policy-gradient theorem. Lemma 5.1 (Modified Policy Gradient Theorem). θV πθ(ρ) = E(s,m) νπθ [ Qπθ(s, m)ψθ(m)] = E(s,m) νπθ [ Aπθ(s, m)ψθ(m)], where ψθ(m) := θ log(πθ(m)). Note the independence of the score function ψ from the state Algorithm 2 Actor-Critic based Improper RL (ACIL) Input: φ, actor stepsize α, critic stepsize β, regularization parameter λ, AC or NAC Initialize: θ0 = (1, 1, . . . , 1)M 1, s0 ρ flag = 1{NAC} {Selects AC or NAC} for t 0 to T 1 do sinit = st 1,B (when t = 0, sinit = s0) wt, st,0 Critic TD(sinit, πθt, φ, β, Tc, H) Ft(θt) 0. for i 0 to B 1 do mt,i πθt, at,i Kmt,i(st,i, .) st,i+1 P(.|st,i, mt,i) Ewt(st,i, mt,i, st,i+1) = r(st,i, mt,i) + (γφ(st,i+1) φ(st,i)) wt Ft(θt) Ft(θt) + 1 B ψθt(mt,i)ψθt(mt,i) end for if {flag} then Gt := [Ft(θt) + λI] θt+1 = θt + i=0 Ewt(st,i, mt,i, st,i+1)ψθt(mt,i) θt+1 = θt + α i=0 Ewt(st,i, mt,i, st,i+1)ψθt(mt,i) end if πθt+1 = softmax(θt+1) end for Output: θ b T with b T chosen uniformly at random from {1, . . . , T} s. For the gradient ascent update of the parameters θ we need to estimate Aπθ(s, m) where (s, m) are drawn according to νπθ( , ). We recall how to sample from νπ. Following Konda & Tsitsiklis (2000) and the recent works like Xu et al. (2020); Barakat et al. (2021) and casting into our setting, observe that νπ is a stationary distribution of a Markov chain over the pair (s, m) with state-to-state transition kernel defined by P(s |s, m) := γ P(s |s, m) + (1 γ)ρ(s ) and m π(.). Algorithm Description. We present the algorithm in detail in Algorithm 2 along with a subroutine Alg 3 which updates the critic s parameters. ACIL is a single-trajectory based algorithm, in the sense that it does not require a forced reset along the run. We begin with the critic s updates. The critic uses linear function approximation Vw(s) := φ(s) w, and uses TD learning to update its parameters w Rd. We assume that φ( ) : S Rd is a known feature mapping. Let Φ be the corresponding |S| d matrix. We assume that the columns of Φ are linearly independent. Next, based on the critic s parameters, the actor approximates the A(s, m) function using the TD error: Ew(s, m, s ) = r(s, m) + (γφ(s ) φ(s)) w. In order to provide guarantees of the convergence rates Actor-Critic based Improper Reinforcement Learning Algorithm 3 Critic-TD Subroutine Input: sinit, π, φ, β, Tc, H Initialize: w0 for k 0 to Tc 1 do sk,0 = sk 1,H (when k = 0, sk,0 = sinit) for j 0 to H 1 do mk,j π(.), ak,j Kmk,j(sk,j, .) sk,j+1 P(.|sk,j, mk,k) Ewk(sk,j, mk,j, sk,j+1) = r(sk,j, mk,j) + (γφ(sk,j+1) φ(sk,j)) wk end for wk+1 = wk + β i=0 Ewk(sk,i, mk,i, sk,i+1)φ(sk,i) end for Output:w Tc, s Tc 1,H of Algorithm ACIL, we make the following assumptions, which are standard in RL literature (Konda & Tsitsiklis, 2000; Bhandari et al., 2018; Xu et al., 2020). Assumption 5.2 (Uniform Ergodicity). For any θ RM, consider the Markov Chain induced by the policy πθ, and following the transition kernel P(.|s, m). Let ξπθ be the stationary distribution of this Markov Chain. We assume that there exists constants κ > 0 and ξ (0, 1) such that sup s S P (st |s0 = s, πθ) ξπθ( ) T V κξt. Further, let Lπ := Eνπ[φ(s)(γφ(s ) φ(s)) ] and vπ := Eνπ[r(s, m, s )φ(s)]. The optimal solution to the critic s TD learning is now w := L 1 π vπ. Assumption 5.3. There exists a positive constant ΓL such that for all w Rd, we have w w , Lπ(w w ) ΓL w w 2 2 . Based on the above two assumptions, let LV := 2 where Cκξ = 1 + logξ 1 κ + 1 1 ξ . Theorem 5.4. Consider the Actor-Critic improper learning algorithm ACIL (Alg 2). Assume sups S φ(s) 2 1. Under Assumptions 5.2 and 5.3 with step-sizes chosen as α = 1 4LV , β = min {O (ΓL) , O (1/ΓL)}, batch- sizes H = O 1 ε , B = O (1/ε), Tc = O M ΓL log(1/ε) , M (1 γ)2ε , we have E[ θV (θ b T ) 2 2] ε + O( critic). Hence, the total sample complexity is O M(1 γ) 2ε 2 log(1/ε) . Here, critic := maxθ RM Eνπθ V πθ(s) V w πθ 2 , which equals zero, if the value function lies in the linear space spanned by the features. Next we provide the global optimality guarantee for the Natural-Actor-Critic version of ACIL. Theorem 5.5. Assume sups S φ(s) 2 1. Under Assumptions 5.2 and 5.3 with step-sizes chosen as α = λ2 , β = min {O (ΓL) , O (1/ΓL)}, batch-sizes H = O 1 ΓLε2 , B = O 1 (1 γ)2ε2 , M ΓL log(1/ε) , T = O M (1 γ)2ε and λ = O( critic) we have V (π ) 1 T t=0 E [V (πθt)] ε + actor (1 γ)3 + O( critic). Hence, the total sample com- plexity is O M (1 γ)4ε3 log 1 where actor := maxθ RM minw Rd Eνπθ [[ψ θ w Aπθ(s, m)]2] and critic is same as before. 6. Numerical Results 6.1. Simulations with Softmax PG We now discuss the results of implementing Softmax PG (Alg 1) on the cartpole system and on the constrained queueing examples described in Sec. 2. Since neither value functions nor value gradients for these problems are available in closed-form, we modify Soft Max PG (Algorithm 1) to make it generally implementable using a combination of (1) rollouts to estimate the value function of the current (improper) policy and (2) simultaneous perturbation stochastic approximation (SPSA) to estimate its value gradient. Specifically, we use the approach in (Flaxman et al., 2005), noting that for a function V : RM R, the gradient, V , V (θ) E [(V (θ + α.u) V (θ)) u] . M α , where the perturbation parameter α (0, 1) and u is sampled uniformly randomly from the unit sphere.This expression requires evaluation of the value function at the point (θ +α.u). Since the value function may not be explicitly computable, we employ rollouts, for its evaluation. The full algorithm, Grad Est, can be found in the appendix (Alg. 6). Note that all of the simulations shown have been averaged over #L = 20 trials, and the mean and standard deviations plotted. We also show empirically that ct in Theorem 4.2 is indeed strictly positive. In the sequel, for every trial l [#L], let cl t := inf 1 s t min m {m [M]:π (m )>0} πθs(m), and ct := 1 #L P#L l=1 cl t. Also let c T := min l [#L] min 1 t T cl t. That is the sequences { cl t}T,#L t=1,l=1 define the minimum probabilities that the algorithm puts, over rounds 1 : t in trial l, on controllers with π ( ) > 0. { ct}T t=1 represents its average across the different trials, and c T is the minimum such probability that the algorithm learns across all rounds 1 t T and across trials. Simulations for the Cartpole. We study two different settings for the Cartpole example. Let Kopt be the optimal controller for the given system, computed via standard procedures (details can be found in (Bertsekas, 2011)). We Actor-Critic based Improper Reinforcement Learning (a) Cartpole with {K1 = Kopt, K2 = Kopt + }. (b) Cartpole with {K1 = Kopt , K2 = Kopt + }. (c) Softmax PG applied to a Path Graph Network. (d) 2 queue system with timevarying arrival rates Figure 2: Softmax PG algorithm applied to the cartpole control and path graph scheduling tasks. Each plot shows (a) the learnt probabilities of various base controllers over time, and (b) the minimum probability ct and c T as described in text. (a) Arrival rate:(λ1, λ2) = (0.4, 0.4) (b) Arrival rate:(λ1, λ2) = (0.35, 0.35) (c) (Estimated) 2 queue system with timevarying arrival rates Figure 3: Natural-actor-critic based improper learning algorithm applied to various queuing networks show convergence to the best mixture policy. set M = 2 and consider two scenarios: (i) the two base controllers are C {Kopt, Kopt + }, where is a random matrix, each entry of which is drawn IID N(0, 0.1), (ii) C {Kopt , Kopt + }. In the first case a corner point of the simplex is optimal. In the second case a strict improper mixture of the available controllers is optimum. As we can see in Fig. 2(a) and 2(b) our policy gradient algorithm converges to the best controller/mixture in both the cases. The details of all the hyperparameters for this setting are provided in the appendix. We note here that in the second setting even though none of the controllers, applied individually, stabilizes the system, our Softmax PG algorithm finds and follows a improper mixture of the controllers which stabilizes the given Cartpole. Constrained Queueing Networks. We present simulation results for the following networks. (i) Path Graph Networks. The scheduling constraints in the first network we study dictate that Queues i and i+1 cannot be served simultaneously for i [N 1] in any round t 0. Such queueing systems are called path graph networks (Mohan et al., 2020). We work with N = 4. Therefore, sets of queues which can be served simultaneously are A = { , {1}, {2}, {3}, {4}, {1, 3}, {2, 4}, {1, 4}}. The constituents of A are called independent sets in the literature. In each round t, the scheduler selects an independent set to serve the queues therein. Let Qj(t) be the backlog of Queue j at time t. We use the follow- ing base controllers: (i) K1 : Max Weight (MW) controller (Tassiulas & Ephremides, 1992) chooses a set st := argmaxs A P j s Qj(t), i.e, the set with the largest backlog, (ii) K2 : Maximum Egress Rate (MER) controller chooses a set st := argmaxs A P j s I{Qj(t) > 0}, i.e, the set which has the maximum number of non-empty queues.We also choose K3, K4 and K5 which serve the sets {1, 3}, {2, 4}, {1, 4} respectively with probability 1. We fix the arrival rates to the queues (0.495, 0.495, 0.495, 0.495). It is well known that the MER rule is mean-delay optimal in this case (Mohan et al., 2020). In Fig. 2(c), we plot the probability of choosing Ki, i [5], learnt by our algorithm. The probability of choosing MER indeed converges to 1. (ii) Non-stationary arrival rates. Recall the example discussed in Sec. 2.2 of two queues. The scheduler there is now given two base/atomic controllers C := {K1, K2}, i.e. M = 2. Controller Ki serves Queue i with probability 1, i = 1, 2. As can be seen in Fig. 2(d), the arrival rates λ to the two queues vary over time (adversarially) during the learning. In particular, λ varies from (0.3, 0.6) (0.6, 0.3) (0.49, 0.49). Our PG algorithm successfully tracks this change and adapts to the optimal improper stationary policies in each case. In all the simulations shown above we note that the empirical trajectories of ct and c T become flat after some initial rounds and are bounded away from zero. This supports our conjecture that limt ct in Theorem 4.2 is bounded away Actor-Critic based Improper Reinforcement Learning from zero, rendering the theorem statement non-vacuous. Note that Alg. 1 performs well in challenging scenarios, even with estimates of the value function and its gradient. 6.2. Simulations with ACIL We perform some queueing theoretic simulations on the natural actor critic version of ACIL, which we will call NACIL in this section. Unlike Softmax PG, ACIL estimates gradients using temporal difference instead of SPSA. We study three different settings (1) where in the first case the optimal policy is a strict improper combination of the available controllers and (2) where it is at a corner point, i.e., one of the available controllers itself is optimal (3) arrival rates are time-varying as in the previous section. Our simulations show that in all the cases, ACIL converges to the correct controller mixture. Recall the example that we discussed in Sec. 2.2. We consider the case with Bernoulli arrivals with rates λ = [λ1, λ2] and are given two base/atomic controllers {K1, K2}, where controller Ki serves Queue i with probability 1, i = 1, 2. As can be seen in Fig. 3(a) when λ = [0.4, 0.4] (equal arrival rates), NACIL converges to an improper mixture policy that serves each queue with probability [0.5, 0.5]. Next in Fig 3(b) shows a situation where one of the base controllers, i.e., the Longest-Queue-First (LQF) is the optimal controller. NACIL converges correctly to the corner point. Lastly, Fig. 3(c) shows a setting similar to (ii) Sec. 6.1 above. Here there is a single transition of (λ1, λ2) from (0.4, 0.3) (0.3, 0.4) which occurs at t = 105/3 , which is unknown to the learner. We show the probability of choosing controller 1. NACIL tracks the changing arrival rates over time. We supply some more simulations with NACIL in the appendix due to space limitations. Acknowledgment This work was partially supported by the Israel Science Foundation under contract 2199/20. Mohammadi Zaki was supported by the Aerospace Network Research Consortium (ANRC) Grant on Airplane IOT Data Management. Abbasi-Yadkori, Y. and Szepesv ari, C. Regret bounds for the adaptive control of linear quadratic systems. In Proceedings of the 24th Annual Conference on Learning Theory, volume 19 of Proceedings of Machine Learning Research, pp. 1 26, Budapest, Hungary, 09 11 Jun 2011. JMLR Workshop and Conference Proceedings. Agarwal, A., Kakade, S. M., Lee, J. D., and Mahajan, G. Optimality and approximation with policy gradient methods in markov decision processes. In Proceedings of Thirty Third Conference on Learning Theory, pp. 64 66. PMLR, 2020a. Agarwal, N., Brukhim, N., Hazan, E., and Lu, Z. Boosting for control of dynamical systems. In III, H. D. and Singh, A. (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 96 103. PMLR, 13 18 Jul 2020b. Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. Gambling in a rigged casino: The adversarial multi-armed bandit problem. In Proceedings of IEEE 36th Annual Foundations of Computer Science, pp. 322 331, 1995. Auer, P., Jaksch, T., and Ortner, R. Near-optimal regret bounds for reinforcement learning. In Advances in Neural Information Processing Systems, volume 21, pp. 89 96. Curran Associates, Inc., 2009. Banijamali, E., Abbasi-Yadkori, Y., Ghavamzadeh, M., and Vlassis, N. Optimizing over a restricted policy class in mdps. In Chaudhuri, K. and Sugiyama, M. (eds.), Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, volume 89 of Proceedings of Machine Learning Research, pp. 3042 3050. PMLR, 16 18 Apr 2019. Barakat, A., Bianchi, P., and Lehmann, J. Analysis of a target-based actor-critic algorithm with linear function approximation. Co RR, abs/2106.07472, 2021. Barreto, A., Dabney, W., Munos, R., Hunt, J. J., Schaul, T., van Hasselt, H. P., and Silver, D. Successor features for transfer in reinforcement learning. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. Bertsekas, D. P. Dynamic programming and optimal control 3rd edition, volume ii. Belmont, MA: Athena Scientific, 2011. Bhandari, J. and Russo, D. Global optimality guarantees for policy gradient methods. Ar Xiv, abs/1906.01786, 2019. Actor-Critic based Improper Reinforcement Learning Bhandari, J., Russo, D., and Singal, R. A finite time analysis of temporal difference learning with linear function approximation. Oper. Res., 69:950 973, 2018. Bhatnagar, S., Sutton, R. S., Ghavamzadeh, M., and Lee, M. Natural actor critic algorithms. Automatica, 45(11): 2471 2482, 2009. ISSN 0005-1098. Bolzern, P., Colaneri, P., and De Nicolao, G. Almost sure stability of stochastic linear systems with ergodic parameters. European Journal of Control, 14(2):114 123, 2008. Borkar, V. S. Stochastic Approximation. Cambridge Books. Cambridge University Press, December 2008. Cassel, A., Cohen, A., and Koren, T. Logarithmic regret for learning linear quadratic regulators efficiently. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 1328 1337. PMLR, 13 18 Jul 2020. Chen, X. and Hazan, E. Black-box control for linear dynamical systems. ar Xiv preprint ar Xiv:2007.06650, 2020. Daniely, A., Linial, N., and Shalev-Shwartz, S. More data speeds up training time in learning halfspaces over sparse vectors. In Advances in Neural Information Processing Systems, volume 26, pp. 145 153. Curran Associates, Inc., 2013. Daniely, A., Linial, N., and Shalev-Shwartz, S. From average case complexity to improper learning complexity. In Proceedings of the Forty-Sixth Annual ACM Symposium on Theory of Computing, STOC 14, pp. 441 448, New York, NY, USA, 2014. Association for Computing Machinery. Dean, S., Mania, H., Matni, N., Recht, B., and Tu, S. On the Sample Complexity of the Linear Quadratic Regulator. ar Xiv e-prints, art. ar Xiv:1710.01688, October 2017. Denisov, D. and Walton, N. Regret analysis of a markov policy gradient algorithm for multi-arm bandits. Ar Xiv, abs/2007.10229, 2020. Durrett, R. Probability: Theory and examples, 2011. Fazel, M., Ge, R., Kakade, S. M., and Mesbahi, M. Global convergence of policy gradient methods for the linear quadratic regulator, 2018. Flaxman, A. D., Kalai, A. T., and Mc Mahan, H. B. Online convex optimization in the bandit setting: Gradient descent without a gradient. SODA 05, pp. 385 394, USA, 2005. Society for Industrial and Applied Mathematics. Gao, B. and Pavel, L. On the properties of the softmax function with application in game theory and reinforcement learning. Ar Xiv, abs/1704.00805, 2017. Gopalan, A. and Mannor, S. Thompson Sampling for Learning Parameterized Markov Decision Processes. In Proceedings of The 28th Conference on Learning Theory, volume 40 of Proceedings of Machine Learning Research, pp. 861 898, Paris, France, 03 06 Jul 2015. PMLR. Ibrahimi, M., Javanmard, A., and Roy, B. Efficient reinforcement learning for high dimensional linear quadratic systems. In Advances in Neural Information Processing Systems, volume 25, pp. 2636 2644. Curran Associates, Inc., 2012. Kakade, S. and Langford, J. Approximately optimal approximate reinforcement learning. In In Proc. 19th International Conference on Machine Learning, 2002. Khalil, H. K. Nonlinear Control. Pearson, 2015. Koc ak, T., Neu, G., Valko, M., and Munos, R. Efficient learning by implicit exploration in bandit problems with side observations. In Advances in Neural Information Processing Systems, volume 27, pp. 613 621. Curran Associates, Inc., 2014. Konda, V. and Tsitsiklis, J. Actor-critic algorithms. In Solla, S., Leen, T., and M uller, K. (eds.), Advances in Neural Information Processing Systems, volume 12. MIT Press, 2000. Lai, T. and Robbins, H. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1): 4 22, 1985. ISSN 0196-8858. Li, G., Wei, Y., Chi, Y., Gu, Y., and Chen, Y. Softmax policy gradient methods can take exponential time to converge. ar Xiv preprint ar Xiv:2102.11270, 2021. Littlestone, N. and Warmuth, M. K. The weighted majority algorithm. Inform. Comput., 108(2):212 261, 1994. Łojasiewicz, S. Les equations aux d eriv ees partielles (paris, 1962), 1963. Lowe, R., Wu, Y., Tamar, A., Harb, J., Abbeel, P., and Mordatch, I. Multi-agent actor-critic for mixed cooperativecompetitive environments, 2020. Maclin, R. and Opitz, D. W. Popular ensemble methods: An empirical study. Co RR, abs/1106.0257, 2011. Mania, H., Tu, S., and Recht, B. Certainty equivalence is efficient for linear quadratic control. In Advances in Neural Information Processing Systems, volume 32, pp. 10154 10164. Curran Associates, Inc., 2019. Actor-Critic based Improper Reinforcement Learning Mei, J., Xiao, C., Szepesvari, C., and Schuurmans, D. On the global convergence rates of softmax policy gradient methods. In Proceedings of the 37th International Conference on Machine Learning, pp. 6820 6829. PMLR, 2020. Mohan, A., Chattopadhyay, A., and Kumar, A. Hybrid mac protocols for low-delay scheduling. In 2016 IEEE 13th International Conference on Mobile Ad Hoc and Sensor Systems (MASS), pp. 47 55, Los Alamitos, CA, USA, oct 2016. IEEE Computer Society. Mohan, A., Gopalan, A., and Kumar, A. Throughput optimal decentralized scheduling with single-bit state feedback for a class of queueing systems. Ar Xiv, abs/2002.08141, 2020. Neu, G. Explore no more: Improved high-probability regret bounds for non-stochastic bandits. In Advances in Neural Information Processing Systems, volume 28, pp. 3168 3176. Curran Associates, Inc., 2015. Osband, I., Russo, D., and Van Roy, B. (more) efficient reinforcement learning via posterior sampling. In Advances in Neural Information Processing Systems, volume 26, pp. 3003 3011. Curran Associates, Inc., 2013. Ouyang, Y., Gagrani, M., Nayyar, A., and Jain, R. Learning unknown markov decision processes: A thompson sampling approach. In NIPS, 2017. Peters, J. and Schaal, S. Natural actor-critic. Neurocomputing, 71(7):1180 1190, 2008. ISSN 0925-2312. Progress in Modeling, Theory, and Application of Computational Intelligenc. Radac, M.-B. and Precup, R.-E. Data-driven model-free slip control of anti-lock braking systems using reinforcement q-learning. Neurocomput., 275(C):317 329, January 2018. Rummery, G. A. and Niranjan, M. On-line q-learning using connectionist systems. Technical report, 1994. Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. Trust region policy optimization. In Bach, F. and Blei, D. (eds.), Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pp. 1889 1897, Lille, France, 07 09 Jul 2015. PMLR. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms, 2017. Shani, L., Efroni, Y., and Mannor, S. Adaptive trust region policy optimization: Global convergence and faster rates for regularized mdps. Ar Xiv, abs/1909.02769, 2020. Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T., and Hassabis, D. Mastering the game of go with deep neural networks and tree search. Nature, 529:484 503, 2016. Singh, S., Okun, A., and Jackson, A. Artificial intelligence: Learning to play Go from scratch. 550(7676):336 337, October 2017. doi: 10.1038/550336a. Sutton, R. S. and Barto, A. G. Reinforcement Learning: An Introduction. The MIT Press, second edition, 2018. Sutton, R. S., Precup, D., and Singh, S. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112(1): 181 211, 1999. ISSN 0004-3702. Sutton, R. S., Mc Allester, D., Singh, S., and Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems, volume 12, pp. 1057 1063. MIT Press, 2000. Tassiulas, L. and Ephremides, A. Stability properties of constrained queueing systems and scheduling policies for maximum throughput in multihop radio networks. IEEE Transactions on Automatic Control, 37(12):1936 1948, 1992. doi: 10.1109/9.182479. Wiering, M. A. and van Hasselt, H. Ensemble algorithms in reinforcement learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 38(4):930 936, 2008. Xiliang, C., Cao, L., Li, C.-x., Xu, Z.-x., and Lai, J. Ensemble network architecture for deep reinforcement learning. Mathematical Problems in Engineering, 2018:1 6, 04 2018. Xu, T., Wang, Z., and Liang, Y. Improving sample complexity bounds for (natural) actor-critic algorithms. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M. F., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 4358 4369. Curran Associates, Inc., 2020. Actor-Critic based Improper Reinforcement Learning A. Glossary of Symbols 1. S: State space 2. A : Action space 3. S : Cardinality of S 4. A : Cardinality of A 5. M : Number of controllers 6. Ki Controller i, i = 1, , M. For finite SA space MDP, Ki is a matrix of size S A, where each row is a probability distribution over the actions. 7. C : Given collection of M controllers. 8. Isoft(C) : Improper policy class setup by the learner. 9. θ RM : Parameter assigned to the controllers to controllers, representing weights, updated each round by the learner. 10. π(.) : Probability of choosing controllers 11. π(. s) Probability of choosing action given state s. Note that in our setting, given π(.) over controllers (see previous item) and the set of controllers, π(. s) is completely defined, i.e., π(a s) = M P m=1 π(m)Km(s, a). Hence we use simply π to denote the policy followed, whenever the context is clear. 12. r(s, a) : Immediate (one-step) reward obtained if action a is played in state s. 13. P(s s, a) Probability of transitioning to state s from state s having taken action a. 14. V π(ρ) := Es0 ρ [V π(s0)] = Eπ ρ P t=0 γtr(st, at) Value function starting with initial distribution ρ over states, and following policy π. 15. Qπ(s, a) := E r(s, a) + γ P s S P(s s, a)V π(s ) . 16. Qπ(s, m) := E P a A Km(s, a)r(s, a) + γ P s S P(s s, a)V π(s ) . 17. Aπ(s, a) := Qπ(s, a) V π(s) 18. A(s, m) := Qπ(s, m) V π(s). 19. dπ ν := Es0 ν t=0 P st = s so, π, P . Denotes a distribution over the states, is called the discounted state visitation measure 20. c : inft 1 min m {m [M]:π (m )>0} πθt(m). = maxs dπ µ (s) µ(s) . µ = maxs 1 µ(s). Actor-Critic based Improper Reinforcement Learning B. Expanded Survey of Related Work In this section, we provide a detailed survey of related works. It is vital to distinguish the approach investigated in the present paper from the plethora of existing algorithms based on proper learning . Essentially, these algorithms try to find an (approximately) optimal policy for the MDP under investigation. These approaches can broadly be classified in two groups: model-based and model-free. The former is based on first learning the dynamics of the unknown MDP followed by planning for this learnt model. Algorithms in this class include Thompson Sampling-based approaches (Osband et al., 2013; Ouyang et al., 2017; Gopalan & Mannor, 2015), Optimism-based approaches such as the UCRL algorithm (Auer et al., 2009), both achieving order-wise optimal O( T) regret bound. A particular class of MDPs which has been studied extensively is the Linear Quadratic Regulator (LQR) which is a continuous state-action MDP with linear state dynamics and quadratic cost (Dean et al., 2017). Let xt Rm be the current state and let ut Rn be the action applied at time t. The infinite horizon average cost minimization problem for LQR is to find a policy to choose actions {ut}t 1 so as to minimize t=1 x T t Qxt + u T t Rut such that xt+1 = Axt + But + n(t), n(t) is iid zero-mean noise. Here the matrices A and B are unknown to the learner. Earlier works like (Abbasi-Yadkori & Szepesv ari, 2011; Ibrahimi et al., 2012) proposed algorithms based on the well-known optimism principle (with confidence ellipsoids around estimates of A and B). These show regret bounds of O( However, these approaches do not focus on the stability of the closed-loop system. (Dean et al., 2017) describes a robust controller design which seeks to minimize the worst-case performance of the system given the error in the estimation process. They show a sample complexity analysis guaranteeing convergence rate of O(1/ N) to the optimal policy for the given LQR, N being the number of rollouts. More recently, certainity equivalence (Mania et al., 2019) was shown to achieve O( T) regret for LQRs. Further, (Cassel et al., 2020) show that it is possible to achieve O(log T) regret if either one of the matrices A or B are known to the learner, and also provided a lower bound showing that Ω( T) regret is unavoidable when both are unknown. The model-free approach on the other hand, bypasses model estimation and directly learns the value function of the unknown MDP. While the most popular among these have historically been Q-learning, TD-learning (Sutton & Barto, 2018) and SARSA (Rummery & Niranjan, 1994), algorithms based on gradient-based policy optimization have been gaining considerable attention of late, following their stunning success with playing the game of Go which has long been viewed as the most challenging of classic games for artificial intelligence owing to its enormous search space and the difficulty of evaluating board positions and moves. (Silver et al., 2016) and more recently (Singh et al., 2017) use policy gradient method combined with a neural network representation to beat human experts. Indeed, the Policy Gradient method has become the cornerstone of modern RL and given birth to an entire class of highly efficient policy search algorithms such as TRPO (Schulman et al., 2015), PPO(Schulman et al., 2017), and MADDPG (Lowe et al., 2020). Despite its excellent empirical performance, not much was known about theoretical guarantees for this approach until recently. There is now a growing body of promising results showing convergence rates for PG algorithms over finite state-action MDPs (Agarwal et al., 2020a; Shani et al., 2020; Bhandari & Russo, 2019; Mei et al., 2020), where the parameterization is over the entire space of state -action pairs, i.e., RS A. In particular, (Bhandari & Russo, 2019) show that projected gradient descent does not suffer from spurious local optima on the simplex, (Agarwal et al., 2020a) show that the with softmax parameterization PG converges to the global optima asymptotically. (Shani et al., 2020) show a O(1/ t) convergence rate for mirror descent. (Mei et al., 2020) show that with softmax policy gradient convergence to the global optima occurs at a rate O(1/t) and at O(e t) with entropy regularization. We end this section noting once again that all of the above works concern proper learning. Improper learning, on the other hand, has been separately studied in statistical learning theory in the IID setting (Daniely et al., 2014; 2013). In this framework, which is also called Representation Independent learning, the learning algorithm is not restricted to output a hypothesis from a given set of hypotheses. We note that improper learning has not been studied in RL literature to the best of our knowledge. To our knowledge, (Agarwal et al., 2020b) is the only existing work that attempts to frame and solve policy optimization Actor-Critic based Improper Reinforcement Learning over an improper class via boosting a given class of controllers. However, the paper is situated in the rather different context of non-stochastic control and assumes perfect knowledge of (i) the memory-boundedness of the MDP, and (ii) the state noise vector in every round, which amounts to essentially knowing the MDP transition dynamics. We work in the stochastic MDP setting and moreover assume no access to the MDP s transition kernel. Further, (Agarwal et al., 2020b) also assumes that all the atomic controllers available to them are stabilizing which, when working with an unknown MDP, is a very strong assumption to make. We make no such assumptions on our atomic controller class and, as we show in Sec. 2 and Sec. 6, our algorithms even begin with provably unstable controllers and yet succeed in stabilizing the system. In summary, the problem that we address concerns finding the best among a given class of controllers. None of these need be optimal for the MDP at hand. Moreover, our PG algorithm could very well converge to an improper mixture of these controllers meaning that the output of our algorithms need not be any of the atomic controllers we are provided with. This setting, to the best of our knowledge has not been investigated in the RL literature hitherto. C. Details of Setup and Modelling of the Cartpole Figure 4: The Cartpole system. The mass of the pendulum is denoted by mp, that of the cart by m K, the force used to drive the cart by F, and the distance of the center of mass of the cart from its starting position by s. θ denotes the angle the pendulum makes with the normal and its length is denoted by 2l. Gravity is denoted by g. As shown in Fig. 4, it comprises a pendulum whose pivot is mounted on a cart which can be moved in the horizontal direction by applying a force. The objective is to modulate the direction and magnitude of this force F to keep the pendulum from keeling over under the influence of gravity. The state of the system at time t, is given by the 4-tuple x(t) := [s, s, θ, θ], with x( ) = 0 corresponding to the pendulum being upright and stationary. One of the strategies used to design control policies for this system is by first approximating the dynamics around x( ) = 0 with a linear, quadratic cost model and designing a linear controller for these approximate dynamics. This, after time discretization, The objective now reduces to finding a (potentially randomized) control policy u {u(t), t 0} that solves: inf u J(x(0)) = Eu t=0 x (t)Qx(t) + Ru2(t), s.t. x(t + 1) = 0 1 0 0 0 0 g l 4 3 mp mp+mk 0 0 0 1 0 0 g l 4 3 mp mp+mk | {z } Aopen 0 1 mp+mk 0 1 l 4 3 mp mp+mk Actor-Critic based Improper Reinforcement Learning Under standard assumptions of controllability and observability, this optimization has a stationary, linear solution u (t) = K x(t) (details are available in (?)Chap. 3]bertsekas11dynamic). Moreover, setting A := Aopen b K , it is well know that the dynamics x(t + 1) = Ax(t), t 0, are stable. C.1. Details of simulations settings for the cartpole system In this section we supply the adjustments we made for specifically for the cartpole experiments. We first mention that we scale down the estimated gradient of the value function returned by the Grad Est subroutine (Algorithm 6) (in the cartpole simulation only). The scaling that worked for us is 10 \ V π(µ) . Next, we provide the values of the constants that were described in Sec. C in Table 1. Parameter Value Gravity g 9.8 Mass of pole mp 0.1 Length of pole l 1 Mass of cart mk 1 Total mass mt 1.1 Table 1: Values of the hyperparameters used for the cartpole simulation D. Stability for Ergodic Parameter Linear Systems (EPLS) For simplicity and ease of understanding, we connect our current discussion to the cartpole example discussed in Sec. 2.1. Consider a generic (ergodic) control policy that switches across a menu of controllers {K1, , KN}. That is, at any time t, it chooses controller Ki, i [N], w.p. pi, so that the control input at time t is u(t) = K i x(t) w.p. pi. Let A(i) := Aopen b K i . The resulting controlled dynamics are given by x(t + 1) = A(r(t))x(t) x(0) = 0, (3) where r(t) = i w.p. pi, IID across time. In the literature, this belongs to a class of systems known as Ergodic Parameter Linear Systems (EPLS) (Bolzern et al., 2008), which are said to be Exponentially Almost Surely Stable (EAS) if there exists ρ > 0 such that for any x(0), P ω Ω lim sup t 1 t log x(t, ω) ρ = 1. (4) In other words, w.p. 1, the trajectories of the system decay to the origin exponentially fast. The random variable λ(ω) := lim supt 1 t log x(t, ω) in (4) is called the Lyapunov Exponent of the system. For our EPLS, λ(ω) = lim sup t 1 t log x(t, ω) = lim sup t 1 s=1 A(r(s, ω))x(0) lim sup t :0 1 t log x(0) + lim sup t 1 s=1 A(r(s, ω)) lim sup t 1 s=1 log A(r(s, ω)) ( ) = lim t 1 s=1 log A(r(s, ω)) ( ) = E log A(r) = i=1 pi log A(i) , (5) where the equalities ( ) and ( ) are due to the ergodic law of large numbers. The control policy can now be designed by choosing {p1, , p N} such that λ(ω) < ρ for some ρ > 0, ensuring exponentially almost sure stability. Actor-Critic based Improper Reinforcement Learning E. The Constrained Queuing Example The system, shown in Fig. 5, comprises two queues fed by independent, stochastic arrival processes Ai(t), i {1, 2}, t N. The length of Queue i, measured at the beginning of time slot t, is denoted by Qi(t) Z+. A common server serves both queues and can drain at most one packet from the system in a time slot2. The server, therefore, needs to decide which of the two queues it intends to serve in a given slot (we assume that once the server chooses to serve a packet, service succeeds with probability 1). The server s decision is denoted by the vector D(t) A := {[0, 0], [1, 0], [0, 1]} , where a 1 denotes service and a 0 denotes lack thereof. Figure 5: Qi(t) is the length of Queue i (i {1, 2}) at the beginning of time slot t, Ai(t) is its packet arrival process and D(t) {[0, 0], [1, 0], [0, 1]} . For simplicity, we assume that the processes (Ai(t)) t=0 are both IID Bernoulli, with EAi(t) = λi. Note that the arrival rate λ = [λ1, λ2] is unknown to the learner. Defining (x)+ := max{0, x}, x R, queue length evolution is given by the equations Qi(t + 1) = (Qi(t) Di(t))+ + Ai(t + 1), i {1, 2}. (6) F. Non-concavity of the Value function We show here that the value function V π(ρ) is in general non-concave, and hence standard convex optimization techniques for maximization may get stuck in local optima. We note once again that this is different from the non-concavity of V π when the parameterization is over the entire state-action space, i.e., RS A. We show here that for both Soft Max and direct parameterization, the value function is non-concave where, by direct parameterization we mean that the controllers Km are parameterized by weights θm R, where θi 0, i [M] and M P i=1 θi = 1. A similar argument holds for softmax parameterization, which we outline in Note F.2. Lemma F.1. (Non-concavity of Value function) There is an MDP and a set of controllers, for which the maximization problem of the value function (i.e. (1)) is non-concave for Soft Max parameterization, i.e., θ 7 V πθ is non-concave. 2Hence, a constrained queueing system. Figure 6: An example of an MDP with controllers as defined in (7) having a non-concave value function. The MDP has S = 5 states and A = 2 actions. States s3, s4 and s5 are terminal states. The only transition with nonzero reward is s2 s4. Actor-Critic based Improper Reinforcement Learning Proof. Consider the MDP shown in Figure 6 with 5 states, s1, . . . , s5 . States s3, s4 and s5 are terminal states. In the figure we also show the allowed transitions and the rewards obtained by those transitions. Let the action set A consists of only three actions {a1, a2, a3} {right, up, null}, where null is a dummy action included to accommodate the three terminal states. Let us consider the case when M = 2. The two controllers Ki RS A, i = 1, 2 (where each row is probability distribution over A) are shown below. 1/4 3/4 0 3/4 1/4 0 0 0 1 0 0 1 0 0 1 3/4 1/4 0 1/4 3/4 0 0 0 1 0 0 1 0 0 1 Let θ(1) = (1, 0)T and θ(2) = (0, 1)T. Let us fix the initial state to be s1. Since a nonzero reward is only earned during a s2 s4 transition, we note for any policy π : A S that V π(s1) = π(a1|s1)π(a2|s2)r. We also have, (K1 + K2)/2 = 1/2 1/2 0 1/2 1/2 0 0 0 1 0 0 1 0 0 1 We will show that 1 2V πθ(1) + 1 2V πθ(2) > V π(θ(1)+θ(2))/2. We observe the following. V πθ(1)(s1) = V K1(s1) = (1/4).(1/4).r = r/16. V πθ(2)(s1) = V K2(s1) = (3/4).(3/4).r = 9r/16. where V K(s) denotes the value obtained by starting from state s and following a controller matrix K for all time. Also, on the other hand we have, V π(θ(1)+θ(2))/2 = V (K1+K2)/2(s1) = (1/2).(1/2).r = r/4. Hence we see that, 1 2V πθ(1) + 1 2V πθ(2) = r/32 + 9r/32 = 10r/32 = 1.25r/4 > r/4 = V π(θ(1)+θ(2))/2. This shows that θ 7 V πθ is non-concave, which concludes the proof for direct parameterization. Remark F.2. For softmax parametrization, we choose the same 2 controllers K1, K2 as above. Fix some ε (0, 1) and set θ(1) = (log(1 ε), log ε)T and θ(2) = (log ε, log(1 ε))T. A similar calculation using softmax projection, and using the fact that πθ(a|s) = M P m=1 πθ(m)Km(s, a), shows that under θ(1) we follow matrix (1 ε)K1 +εK2, which yields a Value of (1/4 + ε/2)2 r. Under θ(2) we follow matrix εK1 + (1 ε)K2, which yields a Value of (3/4 ε/2)2 r. On the other hand, (θ(1) + θ(2))/2 amounts to playing the matrix (K1 + K2)/2, yielding the a value of r/4, as above. One can verify easily that (1/4 + ε/2)2 r + (3/4 ε/2)2 r > 2.r/4. This shows the non-concavity of θ 7 V πθ under softmax parameterization. G. Example showing that the value function need not be pointwise (over states) monotone over the improper class Consider the same MDP as in Sec F, however with different base controllers. Let the initial state be s1. Actor-Critic based Improper Reinforcement Learning The two base controllers Ki RS A, i = 1, 2 (where each row is probability distribution over A) are shown below. 1/4 3/4 0 1/4 3/4 0 0 0 1 0 0 1 0 0 1 3/4 1/4 0 3/4 1/4 0 0 0 1 0 0 1 0 0 1 Let θ(1) = (1, 0)T and θ(2) = (0, 1)T. Let us fix the initial state to be s1. Since a nonzero reward is only earned during a s2 s4 transition, we note for any policy π, that V π(s1) = π(a1|s1)π(a2|s2)r and V π(s2) = π(a2|s2)r. Note here that the optimal policy of this MDP is deterministic with π (a1|s1) = 1 and π (a2|s2) = 1. The transitions are all deterministic. However, notice that the optimal policy (with initial state s1) given K1 and K2 is strict mixture, because, given any θ = [θ, 1 θ], θ [0, 1], the value of the policy πθ is 4(3 2θ)(1 + 2θ)r, (9) which is maximized at θ = 1/2. This means that the optimal non deterministic policy chooses K1 and K2 with probabilites (1/2, 1/2), i.e., K = (K1 + K2)/2 = 1/2 1/2 0 1/2 1/2 0 0 0 1 0 0 1 0 0 1 We observe the following. V πθ(1)(s1) = V K1(s1) = (1/4).(3/4).r = 3r/16. V πθ(2)(s1) = V K2(s1) = (3/4).(1/4).r = 3r/16. V πθ(1)(s2) = V K1(s2) = (3/4).r = 3r/4. On the other hand we have, V π(θ(1)+θ(2))/2(s1) = V K (s1) = (1/2).(1/2).r = r/4. V π(θ(1)+θ(2))/2(s2) = V K (s2) = (1/2).r = r/2. We see that V K (s1) > max{V K1(s1), V K2(s1)}. However, V K (s2) < V K1(s2). This implies that playing according to an improved mixture policy (here the optimal given the initial state is s1) does not necessarily improve the value across all states. H. Proof details for Bandit-over-bandits In this section we consider the instructive sub-case when S = 1, which is also called the Multiarmed Bandit. We provide regret bounds for two cases (1) when the value gradient d V πθt (µ) dθt (in the gradient update) is available in each round, and (2) when it needs to be estimated. Note that each controller in this case, is a probability distribution over the A arms of the bandit. We consider the scenario where the agent at each time t 1, has to choose a probability distribution Kmt from a set of M probability distributions over actions A. She then plays an action at Kmt. This is different from the standard MABs because the learner cannot choose the actions directly, instead chooses from a given set of controllers, to play actions. Note the V function has Actor-Critic based Improper Reinforcement Learning no argument as S = 1. Let µ [0, 1]A be the mean vector of the arms A. The value function for any given mixture π P([M]), t=0 γt E rt π m=1 π(m)Km(a)µa. m=1 πmµTKm = 1 1 γ m=1 πmrµ m. (10) where the interpretation of rµ m is that it is the mean reward one obtains if the controller m is chosen at any round t. Since V π is linear in π, the maximum is attained at one of the base controllers π puts mass 1 on m where m := argmax m [M] V Km, and V Km is the value obtained using Km for all time. In the sequel, we assume i := rµ m rµ i > 0. H.1. Proofs for MABs with perfect gradient knowledge With access to the exact value gradient at each step, we have the following result, when Softmax PG (Algorithm 1) is applied for the bandits-over-bandits case. Theorem H.1. With η = 2(1 γ) 5 and with θ(1) m = 1/M for all m [M], with the availability for true gradient, we have t 1, V π V πθt 5 1 γ M 2 Also, defining regret for a time horizon of T rounds as t=1 V π V πθt, (11) we show as a corollary to Thm. H.4 that, Corollary H.2. R(T) min 5M 2 1 γ log T, r 5 1 γ M Proof. Recall from eq (10), that the value function for any given policy π P([M]), that is a distribution over the given M controllers (which are itself distributions over actions A) can be simplified as: V π = 1 1 γ m=1 πmµTKm = 1 1 γ where µ here is the (unknown) vector of mean rewards of the arms A. Here, rµ m := µTKm, i = 1, , M, represents the mean reward obtained by choosing to play controller Km, m M. For ease of notation, we will drop the superscript µ in the proofs of this section. We first show a simplification of the gradient of the value function w.r.t. the parameter θ. Fix a m [M], θm V πθ = 1 1 γ θm πθ(m)rm = 1 1 γ m=1 πθ(m ) {Imm πθ(m)} rm. (12) Next we show that V π is β smooth. A function f : RM R is β smooth, if θ , θ RM f(θ ) f(θ) d dθf(θ), θ θ β 2 θ θ 2 2 . Actor-Critic based Improper Reinforcement Learning Let S := d2 dθ2 V πθ. This is a matrix of size M M. Let 1 i, j M. = 1 1 γ d(πθ(i)(r(i) πT θr)) dθj (14) dθj (r(i) πT θr) + πθ(i)d(r(i) πT θr) dθj = 1 1 γ πθ(j)(r(i) πT θr) πθ(i)πθ(j)(r(i) πT θr) πθ(i)πθ(j)(r(j) πT θr) . (16) Next, let y RM, j=1 Sijy(i)y(j) πθ(j)(r(i) πT θr) πθ(i)πθ(j)(r(i) πT θr) πθ(i)πθ(j)(r(j) πT θr) y(i)y(j) i=1 πθ(i)(r(i) πT θr)y(i)2 2 j=1 πθ(i)πθ(j)(r(i) πT θr)y(i)y(j) i=1 πθ(i)(r(i) πT θr)y(i)2 2 i=1 πθ(i)(r(i) πT θr)y(i) j=1 πθ(j)y(j) i=1 πθ(i)(r(i) πT θr)y(i)2 + 2 1 γ i=1 πθ(i)(r(i) πT θr)y(i) j=1 πθ(j)y(j) πθ (r πT θr) y y 1 + 2 1 γ πθ (r πT θr) 1 . y . πθ 1 y . The last equality is by the assumption that reward are bounded in [0,1]. We observe that, πθ (r πT θr) 1 = πθ(i)(r(i) πT θr) m=1 πθ(i) r(i) πT θr = max i=1,...,M r(i) πT θr 1. Next, for any i [M], πθ(i)(r(i) πT θr) = πθ(i)r(i) πθ(i)2r(i) X j =i πθ(i)πθ(j)r(j) = πθ(i)(1 πθ(i)) + πθ(i)(1 πθ(i)) 2.1/4 = 1/2. Combining the above two inequalities with the fact that πθ 1 = 1 and y y 2, we get, y TSy 1 1 γ πθ (r πT θr) y y 1+ 2 1 γ πθ (r πT θr) 1 . y . πθ 1 y 1 1 γ (1/2+2) y 2 2 . Hence V πθ is β smooth with β = 5 2(1 γ). Actor-Critic based Improper Reinforcement Learning We establish a lower bound on the norm of the gradient of the value function at every step t as below (these type of inequalities are called Łojaseiwicz inequalities (Łojasiewicz, 1963)) Lemma H.3. [Lower bound on norm of gradient] V πθ 2 πθm V π V πθ . Proof of Lemma H.3. Proof. Recall from the simplification of gradient of V π, i.e., eq (12): θm V πθ = 1 1 γ m =1 πθ(m) {Imm πθ(m )} r m = 1 1 γ π(m) r(m) πTr . Taking norm both sides, θV πθ = 1 1 γ m=1 (π(m))2 (r(m) πTr)2 (π(m ))2 (r(m ) πTr)2 = 1 1 γ (π(m )) r(m ) πTr = 1 1 γ (π(m )) (π π)T r = (π(m )) h V π V πθ i . where π = em . We will now prove Theorem H.4 and corollary H.2. We restate the result here. Theorem H.4. With η = 2(1 γ) 5 and with θ(1) m = 1/M for all m [M], with the availability for true gradient, we have t 1, V π V πθt 5 1 γ M 2 Proof. First, note that since V π is smooth we have: V πθt V πθt+1 d dθt V πθt, θt+1 θt + 5 2(1 γ) θt+1 θt 2 2 = η d dθt V πθt 2 + 5 4(1 γ)η2 d dθt V πθt = d dθt V πθt d dθt V πθt (πθt(m ))2 h V π V πθ i2 Lemma H.3 ( inf 1 s t πθt(m ) | {z } =:ct )2 h V π V πθ i2 . Actor-Critic based Improper Reinforcement Learning The first equality is by smoothness, second inequality is by the update equation in algorithm 1. Next, let δt := V π V πθt. We have, δt+1 δt (1 γ) 5 c2 tδ2 t . (17) Claim: t 1, δt 5 c2 t (1 γ) 1 t . We prove the claim by using induction on t 1. Base case. Since δt 1 1 γ , the claim is true for all t 5. Induction step: Let φt := 5 c2 t (1 γ). Fix a t 2, assume δt φt Let g : R R be a function defined as g(x) = x 1 φt x2. One can verify easily that g is monotonically increasing in 0, φt 2 . Next with equation 19, we have This completes the proof of the claim. We will show that ct 1/M in the next lemma. We first complete the proof of the corollary assuming this. We fix a T 1. Observe that, δt 5 (1 γ)c2 t 1 t 5 (1 γ)c2 T 1 t . t=1 V π V πθt = 1 1 γ t=1 (π πθt)Tr 5 log T (1 γ)c2 T + 1. Also we have that, t=1 V π V πθt = 5 (1 γ)c2 T (δt δt+1) 1 We next show that with θ(1) m = 1/M, m, i.e., uniform initialization, inft 1 ct = 1/M, which will then complete the proof of Theorem H.4 and of corollary H.2. Lemma H.5. We have inft 1 πθt(m ) > 0. Furthermore, with uniform initialization of the parameters θ(1) m , i.e., 1/M, m [M], we have inft 1 πθt(m ) = 1 M . Proof. We will show that there exists t0 such that inft 1 πθt(m ) = min 1 t t0 πθt(m ), where t0 = min {t : πθt(m ) C}. We define the following sets. S1 = θ : d V πθ dθm , m = m S2 = {θ : πθ(m ) πθ(m), m = m } S3 = {θ : πθ(m ) C} Note that S3 depends on the choice of C. Let C := M M+ . We claim the following: Claim 2. (i)θt S1 = θt+1 S1 and (ii)θt S1 = πθt+1(m ) πθt(m ). Actor-Critic based Improper Reinforcement Learning Proof of Claim 2. (i) Fix a m = m . We will show that if d V πθ dθt(m ) d V πθ dθt(m), then d V πθ dθt+1(m ) d V πθ dθt+1(m). This will prove the first part. Case (a): πθt(m ) πθt(m). This implies, by the softmax property, that θt(m ) θt(m). After gradient ascent update step we have: θt+1(m ) = θt(m ) + η d V πθt θt(m) + η d V πθt dθt(m) = θt+1(m). This again implies that θt+1(m ) θt+1(m). By the definition of derivative of V πθ w.r.t θt (see eq (12)), dθt+1(m ) = 1 1 γ πθt+1(m )(r(m ) πT θt+1r) = 1 1 γ πθt+1(m)(r(m) πT θt+1r) This implies θt+1 S1. Case (b): πθt(m ) < πθt(m). We first note the following equivalence: dθ(m ) d V πθ dθ(m) (r(m ) r(m)) 1 πθ(m ) (r(m ) πT θr). which can be simplified as: (r(m ) r(m)) 1 πθ(m ) (r(m ) πT θr) = (r(m ) r(m)) (1 exp (θt(m ) θt(m))) (r(m ) πT θr). The above condition can be rearranged as: r(m ) r(m) (1 exp (θt(m ) θt(m))) r(m ) πT θtr . By lemma I.10, we have that V πθt+1 V πθt = πT θt+1r πT θtr. Hence, 0 < r(m ) πT θt+1r πT θtr. Also, we note: θt+1(m ) θt+1(m) = θt(m ) + η d V πt dθt(m ) θt+1(m) η d V πt dθt(m) θt(m ) θt(m). This implies, 1 exp (θt+1(m ) θt+1(m)) 1 exp (θt(m ) θt(m)). Next, we observe that by the assumption πt(m ) < πt(m), we have 1 exp (θt(m ) θt(m)) = 1 πt(m ) Hence we have, (1 exp (θt+1(m ) θt+1(m))) r(m ) πT θt+1r (1 exp (θt(m ) θt(m))) r(m ) πT θtr r(m ) r(m). Actor-Critic based Improper Reinforcement Learning Equivalently, 1 πt+1(m ) (r(m ) πT t+1r) r(m ) r(m). Finishing the proof of the claim 2(i). (ii) Let θt S1. We observe that: πt+1(m ) = exp(θt+1(m )) m=1 exp(θt+1(m)) = exp(θt(m ) + η d V πt dθt(m )) m=1 exp(θt(m) + η d V πt exp(θt(m ) + η d V πt dθt(m )) m=1 exp(θt(m) + η d V πt dθt(m )) = exp(θt(m )) m=1 exp(θt(m)) = πt(m ) This completes the proof of Claim 2(ii). Claim 3. S2 S1 and S3 S1. Proof. To show that S2 S1, let θ c S2. We have πθ(m ) πθ(m), m = m . dθ(m ) = 1 1 γ πθ(m )(r(m ) πT θr) > 1 1 γ πθ(m)(r(m) πT θr) This shows that θ S1. For showing the second part of the claim, we assume θ S3 Sc 2, because if θ S2, we are done. Let m = m . We have, dθ(m ) d V πθ dθ(m) = 1 1 γ πθ(m )(r(m ) πT θr) πθ(m)(r(m) πT θr) 2πθ(m )(r(m ) πT θr) + i =m ,m πθ(i)(r(i) πT θr) i =m ,m πθ(i) (r(m ) πT θr) i =m ,m πθ(i)(r(m ) r(i)) i =m ,m πθ(i) (r(m ) πT θr) i =m ,m πθ(i) i =m ,m πθ(i) i =m ,m πθ(i) Actor-Critic based Improper Reinforcement Learning Observe that, M P i =m ,m πθ(i) = 1 π(m ) π(m). Using this and rearranging we get, dθ(m ) d V πθ dθ(m) 1 1 γ 1 1 γ π(m) 1 The last inequality follows because θ S3 and the choice of C. This completes the proof of Claim 3. Claim 4. There exists a finite t0, such that θt0 S3. Proof. The proof of this claim relies on the asymptotic convergence result of (Agarwal et al., 2020a). We note that their convergence result hold for our choice of η = 2(1 γ) 5 . As noted in (Mei et al., 2020), the choice of η is used to justify the gradient ascent lemma I.10. Hence we have πθt 1 as t . Therefore, there exists a finite t0 such that πθt0(m ) C and hence θt0 S3. This completes the proof that there exists a t0 such that inf t 1 πθt(m ) = inf 1 t t0 πθt(m ), since once the θt S3, by Claim 3, θt S1. Further, by Claim 2, t t0, θt S1 and πθt(m ) is non-decreasing after t0. With uniform initialization θ1(m ) = 1 M θ1(m), for all m = m . Hence, πθ1(m ) πθ1(m) for all m = m . This implies θ1 S2, which implies θ1 S1. As established in Claim 2, S1 remains invariant under gradient ascent updates, implying t0 = 1. Hence we have that inf t 1 πθt(m ) = πθ1(m ) = 1/M, completing the proof of Theorem H.4 and corollary H.2. Proofs for MABs with noisy gradients When value gradients are unavailable, we follow a direct policy gradient algorithm instead of softmax projection. The full pseudo-code is provided here in Algorithm 4. At each round t 1, the learning rate for η is chosen asynchronously for each controller m, to be απt(m)2, to ensure that we remain inside the simplex, for some α (0, 1). To justify its name as a policy gradient algorithm, observe that in order to minimize regret, we need to solve the following optimization problem: min π P([M]) m=1 π(m)(rµ(m ) rµ(m)). A direct gradient with respect to the parameters π(m) gives us a rule for the policy gradient algorithm. The other changes in the update step (eq 18), stem from the fact that true means of the arms are unavailable and importance sampling. We have the following result. Theorem H.6. With value of α chosen to be less than min rµ m min , (πt) is a Markov process, with πt(m ) 1 as t , a.s. Further the regret till any time T is bounded as m α 2 min log T + C, where C := 1 1 γ P t 1 P πt(m (t)) 1 We make couple of remarks before providing the full proof of Theorem H.6. Remark H.7. The cost of not knowing the true gradient seems to cause the dependence on min in the regret, as is not the case when true gradient is available (see Theorem H.4 and Corollary H.2). The dependence on min as is well known from the work of (Lai & Robbins, 1985), is unavoidable. Remark H.8. The dependence of α on min can be removed by a more sophisticated choice of learning rate, at the cost of an extra log T dependence on regret (Denisov & Walton, 2020). Actor-Critic based Improper Reinforcement Learning Algorithm 4 Projection-free Policy Gradient (for MABs) Input: learning rate η (0, 1) Initialize each π1(m) = 1 M , for all m [M]. for t = 1 to T do m (t) argmax m [M] πt(m) Choose controller mt πt. Play action at Kmt. Receive reward Rmt by pulling arm at. Update m [M], m = m (t) : πt+1(m) = πt(m) + η Rm Im πt(m) Rm (t)Im (t) Set πt+1(m (t)) = 1 P m =m (t) πt+1(m). Proof. The proof is an extension of that of Theorem 1 of (Denisov & Walton, 2020) for the setting that we have. The proof is divided into three main parts. In the first part we show that the recurrence time of the process {πt(m )}t 1 is almost surely finite. Next we bound the expected value of the time taken by the process πt(m ) to reach 1. Finally we show that almost surely, lim t πt(m ) 1, in other words the process {πt(m )}t 1 is transient. We use all these facts to show a regret bound. Recall m (t) := argmax m [M] πt(m). We start by defining the following quantity which will be useful for the analysis of algorithm 4. Let τ := min t 1 : πt(m ) > 1 Next, let S := π P([M]) : 1 α 2 π(m ) < 1 In addition, we define for any a R, Sa := π P([M]) : 1 α a π(m ) < 1 x . Observe that if π1(m ) 1/a and π2(m ) < 1/a then π1 Sa. This fact follows just by the update step of the algorithm 4, and choosing η = απt(m) for every m = m . Lemma H.9. For α > 0 such that α < min r(m ) min , we have that sup π S E τ π1 = π < . Proof. The proof here is for completeness. We first make note of the following useful result: For a sequence of positive real numbers {an}n 1 such that the following condition is met: a(n + 1) a(n) b.a(n)2, for some b > 0, the following is always true: an a1 1 + bt. This inequality follows by rearranging and observing the an is a non-increasing sequence. A complete proof can be found in eg. ((Denisov & Walton, 2020), Appendix A.1). Returning to the proof of lemma, we proceed by showing that the sequence 1/πt(m ) ct is a supermartingale for some c > 0. Let min := for ease of notation. Note that if the condition on α holds then there exists an ε > 0, such that (1 + ε)(1 + α) < r /(r ), where r := r(m ). We choose c to be 1 + α α(r )(1 + ε) > 0. Next, let x to be greater than M and satisfying: x x αM 1 + ε. Actor-Critic based Improper Reinforcement Learning Let ξx := min{t 1 : πt(m ) > 1/x}. Since for t = 1, . . . , ξx 1, m (t) = m , we have πt+1(m ) = (1 + α)πt(m ) w.p. πt(m )r and πt+1(m ) = πt(m ) + απt(m )2/πt(m )2 w.p. πt(m )r (t), where r (t) := r(m (t)). Let y(t) := 1/πt(m ), then we observe by a short calculation that, y(t) α 1+αy(t), w.p. r y(t) y(t) + α y(t) πt(m (t))y(t) α. w.p.πt(m )r (t) y(t) otherwise. We see that, E y(t + 1) H(t) y(t) y(t).(y(t) α 1 + αy(t)) + πt(m )r (t).(y(t) + α y(t) πt(m (t))y(t) α) y(t)( r y(t) + πt(m )r (t)) α(r )(1 + ε) αr The inequality holds because r (t) r and that πt(m ) > 1/M. By the Optional Stopping Theorem (Durrett, 2011), c E [ξx t] E [y(ξx t) E [y(1)]] x 1 α. The final inequality holds because π1(m ) 1 α Next, applying the monotone convergence theorem gives theta E [ξx] x c(1 α). Finally to show the result of lemma H.9, we refer the reader to (Appendix A.2, (Denisov & Walton, 2020)), which follow from standard Markov chain arguments. Next we define an embedded Markov Chain {p(s), s Z+} as follows. First let σ(k) := min t τ(k) : πt(m ) < 1 2 and τ(k) := min t σ(k 1) : πt(m ) 1 2 . Note that within the region [τ(k), σ(k)), πt(m ) 1/2 and in [σ(k), τ(k + 1)), πt(m ) < 1/2. We next analyze the rate at which πt(m ) approaches 1. Define p(s) := πts(m ) where ts = s + i=0 (τ(i + 1) σ(i)) i=0 (σ(i) τ(i)), i=0 (σ(i) τ(i)) Also let, σs := min {t > 0 : πt+ts(m ) > 1/2} and, τs := min {t > σs : πt+ts(m ) 1/2} Lemma H.10. The process {p(s)}s 1, is a submartingale. Further, p(s) 1, as s . Finally, E [p(s)] 1 1 1 + α 2 P Proof. We first observe that, ( πts+1(m ) if πts+1(m ) 1/2 πts+τ+s(m ) if πts+1(m ) < 1/2 Since πts+τs(m ) 1/2, we have that, p(s + 1) πts+1(m ) and p(s) = πts(m ). Actor-Critic based Improper Reinforcement Learning Since at times ts, πts(m ) > 1/2, we know that m is the leading arm. Thus by the update step, for all m = m , πts+1(m) = πts(m) + απts(m)2 Im Rm(ts) πts(m) Im Rm (ts) Taking expectations both sides, E πts+1(m) H(ts) πts(m) = απts(m)2(rm rm ) = α mπts(m)2. Summing over all m = m : E πts+1(m ) H(ts) + πts(m ) = α X m =m mπts(m)2. By Jensen s inequality, m =m mπts(m)2 = m =m πts(m) = 2 (1 πts(m ))2 Hence we get, p(s) E p(s + 1) H(ts) α 2 (1 p(s))2 ! = E p(s + 1) H(ts) p(s) + α 2 (1 p(s))2 This implies immediately that {p(s)}s 1 is a submartingale. Since, {p(s)} is non-negative and bounded by 1, by Martingale Convergence Theorem, lims p(s) exists. We will now show that the limit is 1. Clearly, it is sufficient to show that lim sup s p(s) = 1. For a > 2, let φa := min s 1 : p(s) a 1 As is shown in (Denisov & Walton, 2020), it is sufficient to show φa < , with probability 1, because then one can define a sequence of stopping times for increasing a, each finite w.p. 1. which implies that p(s) 1. By the previous display, we have E p(s + 1) H(ts) p(s) α 2 P Actor-Critic based Improper Reinforcement Learning as long as p(s) a 1 a . Hence by applying Optional Stopping Theorem and rearranging we get, E [φa] lim s E [φa s] α (1 E [p(1)]) < . Since φa is a non-negative random variable with finite expectation, φa < a.s. Let q(s) = 1 p(s). We have : E [q(s + 1)] E [q(s)] α 2 (q(s))2 P By the useful result H.2, we get, E [q(s)] E [q(1)] 1 + α 2E[q(1)] P !s 1 1 + α 2 P This completes the proof of the lemma. Finally we provide a lemma to tie the results above. We refer (Appendix A.5 (Denisov & Walton, 2020)) for the proof of this lemma. Lemma H.11. X t 1 P [πt(m ) < 1/2] < . Also, with probability 1, πt(m ) 1, as t . Proof of regret bound: Since r r(m) 1, we have by the definition of regret (see eq 11) m=1 π (m)rm πt(m)rm Here we recall that π = em , we have: R(T) = 1 1 γ E m=1 (π (m)rm πt(m)rm) t=1 (π (m)rm πt(m)rm) m=1 πt(m)rm m=1 πt(m)rm t=1 r (1 πt(m )) m =m πt(m)rm m =m r πt(m) m =m πt(m)rm m =m (r rm)E Actor-Critic based Improper Reinforcement Learning Hence we have, R(T) = 1 1 γ m =m (r rm)E t=1 (1 πt(m )) We analyze the following term: t=1 (1 πt(m )) t=1 (1 πt(m ))I{πt(m ) 1/2} t=1 (1 πt(m ))I{πt(m ) < 1/2} t=1 (1 πt(m ))I{πt(m ) 1/2} where, C1 := P t=1 P [πt(m ) < 1/2] < by Lemma H.11. Next we observe that, t=1 (1 πt(m ))I{πt(m ) 1/2} s=1 q(s)I{πt(m ) 1/2} 1 1 + α 2 P Putting things together, we get, α 2 log T + C1 This completes the proof of Theorem H.6. Actor-Critic based Improper Reinforcement Learning I. Proofs for MDPs First we recall the policy gradient theorem. Theorem I.1 (Policy Gradient Theorem (Sutton et al., 2000)). θV πθ(µ) = 1 1 γ s S dπθ µ (s) X θ Qπθ(s, a). Let s S and m [m]. Let Qπθ(s, m) := P a A Km(s, a)Qπθ(s, a). Also let A(s, m) := Q(s, m) V (s). Lemma I.2 (Gradient Simplification). The softmax policy gradient with respect to the parameter θ RM is θm V πθ(µ) = s S dπθ µ (s)πθ(m) A(s, m), where A(s, m) := Q(s, m) V (s) and Q(s, m) := P a A Km(s, a)Qπθ(s, a), and dπθ µ (.) is the discounted state visitation measure starting with an initial distribution µ and following policy πθ. The interpretation of A(s, m) is the advantage of following controller m at state s and then following the policy πθ for all time versus following πθ always. As mentioned in section 4, we proceed by proving smoothness of the V π function over the space RM. Proof. From the policy gradient theorem I.1, we have: θm V πθ(µ) = 1 1 γ s S dπθ µ (s) X θ Qπθ(s, a) s S dπθ µ (s) X m=1 πθ(m)Km(s, a) s S dπθ µ (s) θm πθ(m) Km(s, a)Q(s, a) s S dπθ µ (s) X m=1 πm Km(s, a) s S dπθ µ (s)πm X m=1 πm Km(s, a) s S dπθ µ (s)πm a A Km (s, a)Q(s, a) X m=1 πm Km(s, a)Q(s, a) s S dπθ µ (s)πm h Q(s, m ) V (s) i s S dπθ µ (s)πm Aπθ(s, m ). Lemma I.3. V πθ (µ) is 7γ2+4γ+5 2(1 γ)2 -smooth. Proof. The proof uses ideas from (Agarwal et al., 2020a) and (Mei et al., 2020). Let θα = θ + αu, where u RM, α R. For any s S, Actor-Critic based Improper Reinforcement Learning m=1 πθm (Imm πθm) Km(s, a)u(m ) Km (s, a)u(m ) m=1 Km(s, a)u(m ) m =1 πθm Km (s, a) |u(m )| + X m=1 πθm πθm Km(s, a) |u(m )| m =1 πθm |u(m )| X a Km (s, a) m=1 πθm πθm |u(m )| X m =1 πθm |u(m )| + m=1 πθm πθm |u(m )| m =1 πθm |u(m )| 2 u 2 . Next we bound the second derivative. 2πθα(a s) α2 α=0 * 2πθα(a s) α2 α=0u, u Let Ha,θ := 2πθα(a s) θ2 RM M. We have, Ha,θ i,j = θj m=1 πθi (Imi πθm) Km(s, a) πθi Ki(s, a) m=1 πθiπθm Km(s, a) = πθj(Iij πθi)Ki(s, a) m=1 Km(s, a) πθiπθm = πj(Iij πi)Ki(s, a) m=1 Km(s, a) (πj(Iij πi)πm + πiπj(Imj πm)) (Iij πi)Ki(s, a) m=1 πm(Iij πi)Km(s, a) m=1 πi(Imj πm)Km(s, a) Actor-Critic based Improper Reinforcement Learning Plugging this into the second derivative, we get, θ2 πθ(a|s)u, u i=1 Ha,θ i,j uiuj (Iij πi)Ki(s, a) m=1 πm(Iij πi)Km(s, a) m=1 πi(Imj πm)Km(s, a) i=1 πi Ki(s, a)u2 i j=1 πiπj Ki(s, a)uiuj m=1 πiπm Km(s, a)u2 i m=1 πiπjπm Km(s, a)uiuj j=1 πiπj Kj(s, a)uiuj m=1 πiπjπm Km(s, a)uiuj i=1 πi Ki(s, a)u2 i 2 j=1 πiπj Ki(s, a)uiuj m=1 πiπm Km(s, a)u2 i + 2 m=1 πiπjπm Km(s, a)uiuj m=1 πm Km(s, a) m=1 πm Km(s, a) m=1 πm Km(s, a) i=1 πi |ui| j=1 πj |uj| m=1 πm Km(s, a) i=1 πi |ui| j=1 πj |uj| 3 u 2 2 . The rest of the proof is similar to (Mei et al., 2020) and we include this for completeness. Define P(α) RS S, where (s, s ), [P(α)](s,s ) = X a A πθα(a s).P(s |s, a). The derivative w.r.t. α is, απθα(a s) α=0 .P(s |s, a). For any vector x RS, απθα(a s) α=0 .P(s |s, a).x(s ). The l norm can be upper-bounded as, αP(α) α=0x = max s S απθα(a s) α=0 .P(s |s, a).x(s ) Actor-Critic based Improper Reinforcement Learning απθα(a s) α=0 .P(s |s, a). x Now we find the second derivative, 2P(α) taking the l norm, P(s |s, a)x(s ) P(s |s, a) x 3 u 2 x . Next we observe that the value function of πθα : V πθα (s) = X a A πθα(a|s)r(s, a) a A πθα(a|s) X s S P(s |s, a)V πθα (s ). In matrix form, V πθα = rθα + γP(α)V πθα = (Id γP(α)) V πθα = rθα V πθα = (Id γP(α)) 1 rθα. Let M(α) := (Id γP(α)) 1 = P t=0 γt[P(α)]t. Also, observe that 1 = 1 1 γ (Id γP(α)) 1 = M(α)1 = 1 1 γ 1. = i [M(α)]i,: 1 = 1 1 γ where [M(α)]i,: is the ith row of M(α). Hence for any vector x RS, M(α)x 1 1 γ x . By assumption I.6, we have rθα = maxs |rθα(s)| 1. Next we find the derivative of rθα w.r.t α. a A πθα(m )(Imm πθα(m))Km(s, a)r(s, a)u(m ) a A πθα(m )Km (s, a)r(s, a)u(m ) a A πθα(m )πθα(m)Km(s, a)r(s, a)u(m ) a A πθα(m )Km (s, a)r(s, a) a A πθα(m )πθα(m)Km(s, a)r(s, a) Actor-Critic based Improper Reinforcement Learning Similarly, we can calculate the upper-bound on second derivative, rθα 5/2 u 2 2 . Next, the derivative of the value function w.r.t α is given by, α = γe T s M(α) P(α) α M(α)rθα + e T s M(α) rθα And the second derivative, α2 = 2γ2e T s M(α) P(α) α M(α) P(α) α M(α)rθα | {z } T 1 + γe T s M(α) 2P(α) α2 M(α)rθα | {z } T 2 + 2γe T s M(α) P(α) α | {z } T 3 + e T s M(α) 2rθα α2 | {z } T 4 We use the above derived bounds to bound each of the term in the above display. The calculations here are same as shown for Lemma 7 in (Mei et al., 2020), except for the particular values of the bounds. Hence we directly, mention the final bounds that we obtain and refer to (Mei et al., 2020) for the detailed but elementary calculations. |T1| 4 (1 γ)3 u 2 2 |T2| 3 (1 γ)2 u 2 2 |T3| 2 (1 γ)2 u 2 2 |T4| 5/2 (1 γ) u 2 2 . Combining the above bounds we get, 2V πθα (s) (1 γ)3 + 3γ (1 γ)2 + 4γ (1 γ)2 + 5/2 (1 γ) = 7γ2 + 4γ + 5 2(1 γ)3 u 2 . Finally, let y RM and fix a θ RM: y T 2V πθ(s) θ2 y = y y 2 θ2 u, u . y 2 2 = max u 2=1 = max u 2=1 Actor-Critic based Improper Reinforcement Learning 7γ2 + 4γ + 5 2(1 γ)3 y 2 2 . Let θξ := θ + ξ(θ θ) where ξ [0, 1]. By Taylor s theorem s, θ, θ , V πθ (s) V πθ(s) V πθ(s) (θ θ)T 2V πθξ (s) 7γ2 + 4γ + 5 4(1 γ)3 θ θ 2 2 . Since V πθ(s) is 7γ2+4γ+5 2(1 γ)3 smooth for every s, V πθ(µ) is also 7γ2+4γ+5 2(1 γ)3 smooth. Lemma I.4 (Value Difference Lemma-1). For any two policies π and π , and for any state s S, the following is true. V π (s) V π(s) = 1 1 γ s S dπ s (s ) m=1 π m A(s , m). V π (s) V π(s) = m=1 π m Q (s, m) m=1 πm Q(s, m) m=1 π m Q (s, m) Q(s, m) + m=1 (π m πm) Q(s, m) m=1 (π m πm) Q(s, m) + a A Km(s, a) a A πθ(a|s) s S P(s |s, a) h V π (s ) V π(s ) i s S dπ s (s ) m =1 (π m πm ) Q(s , m ) s S dπ s (s ) m =1 π m ( Qs , m V (s )) s S dπ s (s ) m =1 π m A(s , m ). Lemma I.5. (Value Difference Lemma-2) For any two policies π and π and state s S, the following is true. V π (s) V π(s) = 1 1 γ s S dπ s (s ) m=1 (π m πm) Qπ (s , m). Actor-Critic based Improper Reinforcement Learning Proof. We will use Q for Qπ and Q for Qπ as a shorthand. V π (s) V π(s) = m=1 π m Q (s, m) m=1 πm Q(s, m) m=1 (π m πm) Q (s, m) + m=1 πm( Q (s, m) Q(s, m)) m=1 (π m πm) Q (s, m)+ a A Km(s, a) X s S P(s |s, a)V (s ) X a A Km(s, a) X s S P(s |s, a)V (s ) m=1 (π m πm) Q (s, m) + γ X a A πθ(a|s) X s S P(s |s, a) [V (s) V (s )] s S dπ s (s ) m=1 (π m πm) Q (s , m). Assumption I.6. The reward r(s, a) [0, 1], for all pairs (s, a) S A. Assumption I.7. Let π := argmax π PM V π(s0). We make the following assumption. Em π [Qπθ(s, m)] V πθ(s) 0, s S, πθ Π. Let the best controller be a point in the M simplex, i.e., K := M P m=1 π m Km. Lemma I.8 (NUŁI). θV πθ(µ) 2 1 min m:π θm>0 πθm dπ ρ d πθ µ [V (ρ) V πθ(ρ)] . θV πθ(µ) 2 = (Cauchy-Schwarz) s S dπθ µ (s)πm A(s, m) s S dπθ µ (s) A(s, m) min m:π θm>0 πθm s S dπθ µ (s) A(s, m) min m:π θm>0 πθm s S dπθ µ (s) A(s, m) Actor-Critic based Improper Reinforcement Learning min m:π θm>0 πθm s S dπθ µ (s) π m 1 γ A(s, m) min m:π θm>0 πθm s S dπθ µ (s) π m 1 γ A(s, m) Assumption I.7 min m:π θm>0 πθm ! dπ ρ dπθ µ m=1 π m A(s, m) min m:π θm>0 πθm ! dπ ρ dπθ µ [V (ρ) V πθ(ρ)] Lemma I.4. I.1. Proof of the Theorem 4.2 Lemma I.9 (Modified Policy Gradient Theorem). θV πθ(ρ) = E(s,m) νπθ [ Qπθ(s, m)ψθ(m)] = E(s,m) νπθ [ Aπθ(s, m)ψθ(m)], where ψθ(m) := θ log(πθ(m)). Let β := 7γ2+4γ+5 (1 γ)2 . We have that, V (ρ) V πθ(ρ) = 1 1 γ s S dπθ ρ (s) m=1 (π m πm) Qπ (s, m) (Lemma I.5) dπθ ρ (s) dπθ µ (s)dπθ µ (s) m=1 (π m πm) Qπ (s, m) m=1 (π m πm) Qπ (s, m) m=1 (π m πm) Qπ (s, m) [V (µ) V πθ(µ)] (Lemma I.5). Let δt := V (µ) V πθt(µ). δt+1 δt = V πθt(µ) V πθt+1(µ) (Lemma I.3) 2 (Lemma I.10 ) min m:π θm>0 πθm !2 dπ ρ dπθ µ δ2 t (Lemma 4.5) 2β (1 γ)2 1 min m:π θm>0 πθm !2 dπ ρ dπθ µ 2β (1 γ)2 1 min 1 s t min m:π θm>0 πθm !2 dπ ρ dπθ µ 2β 1 M (1 γ)2 dπ µ µ Actor-Critic based Improper Reinforcement Learning where ct := min 1 s t min m:π m>0 πθs(m). Hence we have that, c2 tδ2 t . (19) The rest of the proof follows from a induction argument over t 1. Base case: Since δt 1 1 γ , and ct (0, 1), the result holds for all t 2βM (1 γ) For ease of notation, let φt := 2βM c2 t (1 γ)2 . We need to show that δt φt t , for all t 1. Induction step: Fix a t 2, assume δt φt Let g : R R be a function defined as g(x) = x 1 φt x2. One can verify easily that g is monotonically increasing in 0, φt 2 . Next with equation 19, we have where the last step follows from the fact that ct+1 ct (infimum over a larger set does not increase the value). This completes the proof. Lemma I.10. Let f : RM R be β smooth. Then gradient ascent with learning rate 1 β guarantees, for all x, x RM: f(x) f(x ) 1 f(x) f(x ) f(x) 2 . x x 2 2 Actor-Critic based Improper Reinforcement Learning J. Proofs for (Natural) Actor-critic based improper learning We will begin with some useful lemmas. Lemma J.1. For any θ, θ RM, we have ψθ(m) ψθ (m) 2 θ θ 2. Proof. Recall, ψθ(m) := θ log πθ(m). Fix m [M], = 1{m = m} eθm = 1{m = m} πθ(m ). ψθ(m) ψθ (m) 2 θ θ 2 = θ log πθ(m) θ log πθ (m) 2 = πθ(.) πθ (.) 2 ( ) θ θ 2 . Here (*) follows from the fact that the softmax function is 1-Lipschitz (Gao & Pavel, 2017). Lemma J.2. For all m [M] and θ RM, ψθ(m) 2 Proof. Proof follows by noticing that ψθ(m) 2 = θ log πθ(m) 2 2, where the last inequality follows because the 2-norm of a probability vector is bounded by 1. Lemma J.3. For all θ, θ RM, πθ(.) πθ (.) T V πθ(.) πθ (.) T V = 1 2 πθ(.) πθ (.) 1 M 2 πθ(.) πθ (.) 2 . The inequality follows from relation between 1-norm and 2-norm. Proposition J.4. For any θ, θ RM, V (θ) V (θ ) where LV = 2 1 γ , and Cκξ = 1 + logξ 1 κ + 1 1 ξ . Proof. We follow the same steps as in Proposition 1 in (Xu et al., 2020) along with Lemmas J.3,J.2,J.1 and that the maximum reward is bounded by 1. We will now restate a useful result from (Xu et al., 2020), about the convergence of the critic parameter wt to the equilibrium point w of the underlying ODE, applied to our setting. Actor-Critic based Improper Reinforcement Learning Proposition J.5. Suppose assumptions 5.3 and 5.2 hold. Then is β min n ΓL o and H 4 ΓL + 2α h 1536[1+(κ 1)ξ] i . We have E h w Tc w 2 2 i 1 ΓL 16 α Tc w0 w 2 2 + 4 ΓL + 2α 1536(1 + R2 w)[1 + (κ 1)ξ] (1 ξ)H . If we further let Tc 16 ΓLα log 2 w0 w 2 2 ε and H 4 ΓL + 2α 3072(R2 w+1)[1+(κ 1)ξ] (1 ξ)ΓLε , then we have E h w Tc w 2 2 i ε with total sample complexity given by Tc H = O 1 Proof. Proof follows along the similar lines as in Thm. 4 in Xu et al. (2020) and by using φ(s)(γφ(s ) φ(s)) F (1 + γ) 2 and assuming φ(s) 2 1 for all s, s S. J.1. Actor-critic based improper learning Proof of Theorem 5.4. Let vt(w) := 1 B i=0 E(st,i, mt,i, st,i+1)ψθt(mt,i) and Aw(s, m) := E P [E(s, m, s )|(s, m)] and g(w, θ) := Eνθ[Aw(s, m)ψθ(m)] for all θ RM, w Rd, s S, m [M]. Using Prop J.4 we get, V (θt+1) V (θt) + θV (θt), θt+1 θt 2 θt+1 θt 2 2 = V (θt) + α θV (θt), vt(wt) θV (θt) + θV (θt) 2 vt(wt) 2 2 = V (θt) + α θV (θt) 2 2 + α θV (θt), vt(wt) θV (θt) 2 vt(wt) 2 2 MLV α2 θV (θt) 2 2 1 MLV α2 vt(wt) θV (θt) 2 2 Taking expectations and rearranging, we have 1 MLV α2 E h θV (θt) 2 2 |Ft i E [V (θt+1)|Ft] V (θt) + 1 MLV α2 E h vt(wt) θV (θt) 2 2 |Ft i . Next we will upperbound E h vt(wt) θV (θt) 2 2 |Ft i . vt(wt) θV (θt) 2 2 3 vt(wt) vt(w θt) 2 2 + 3 vt(w θt) g(w θt) 2 2 + 3 g(w θt) θV (θt) 2 2 . vt(wt) vt(w θt) 2 2 = i=0 [Ewt(st,i, mt,i, st,i+1) Ew θt(st,i, mt,i, st,i+1)]ψ(mt,i) [Ewt(st,i, mt,i, st,i+1) Ew θt(st,i, mt,i, st,i+1)]ψ(mt,i) 2 [Ewt(st,i, mt,i, st,i+1) Ew θt(st,i, mt,i, st,i+1)] 2 (γφ(st,i+1) φ(st,i)) (wt w θt) 2 Actor-Critic based Improper Reinforcement Learning (wt w θt) 2 2 = 8 (wt w θt) 2 2 . Next we have, g(w θt) θV (θt) 2 2 = Eνθt[Aw θt(s, m)ψθt(m)] Eνθt[Aπθt(s, m)ψθt(m)] 2 Aw θt(s, m) Aπθt(s, m) 2 h |γE h Vw θt(s ) Vπθt(s )|s, m i + +Vπθt(s) Vw θt(s)|2i Finally we bound the last term vt(w θt) g(w θt) 2 2 by using Assumption 5.2 we have, vt(w θt) g(w θt) 2 2 E i=0 Ew θt(st,i, mt,i, st,i+1)ψθt(mt,i) Eνθt[Aw θt(s, m)ψθt(m)] We will now proceed in the similar manner as in (Xu et al., 2020) (eq 24 to eq 26), and using Lemma J.2, we have E h vt(w θt) g(w θt) 2 2 |Ft i 32(1 + Rw)2[1 + (κ 1)ξ] Putting things back we have, E h vt(wt) θV (θt) 2 2 Ft i 96(1 + Rw)2[1 + (κ 1)ξ] B(1 ξ) + 24E h (wt w θt) 2 2 i + 24 critic. Hence we get, MLV α2 E h θV (θt) 2 2 i E [V (θt+1)] E [V (θt)] MLV α2 96(1 + Rw)2[1 + (κ 1)ξ] B(1 ξ) + 24E h (wt w θt) 2 2 i + 24 critic We put α = 1 4LV M above to get, E h θV (θt) 2 2 i E [V (θt+1)] E [V (θt)] 96(1 + Rw)2[1 + (κ 1)ξ] B(1 ξ) + 24E h (wt w θt) 2 2 i + 24 critic which simplifies as E h θV (θt) 2 2 i M (E [V (θt+1)] E [V (θt)]) + 384(1 + Rw)2[1 + (κ 1)ξ] B(1 ξ) + 96E h (wt w θt) 2 2 i + 96 critic. Actor-Critic based Improper Reinforcement Learning Taking summation over t = 0, 1, 2, . . . , T 1 and dividing by T, E h θV (θ b T ) 2 2 t=0 E h θV (θt) 2 2 i M (E [V (θT )] E [V (θ0)]) T + 384(1 + Rw)2[1 + (κ 1)ξ] B(1 ξ) + 96 1 t=0 E h (wt w θt) 2 2 i + 96 critic M (1 γ)T + 384(1 + Rw)2[1 + (κ 1)ξ] B(1 ξ) + 96 1 t=0 E h (wt w θt) 2 2 i + 96 critic We now let B 1152 (1+Rw)2[1+(κ 1)ξ](1 ξ)ε, E h (wt w θt) 2 2 i ε 288 and T 48LV M (1 γ)ε , then we have E h θV (θ b T ) 2 2 i ε + O( critic). This leads to the final sample complexity of (B + HTc)T = 1 ε + M (1 γ)2ε = O M (1 γ)2ε2 log 1 J.2. Natural-actor-critic based improper learning J.2.1. PROOF OF THEOREM 5.5 Proof. We first show that the natural actor-critic improper learner converges to a stationary point. We will then show convergence to the global optima which is what is different from that of (Xu et al., 2020). Let vt(w) := 1 B PB 1 i=0 Ew(st,i, mt,i, st,i+1)ψθt(mt,i), Aw(s, m) := E P [E(s, m, s )|s, m] and g(w, θ) := Eνθ[Aw(s, m)ψθ(m)] for w Rd and θ RM. Also let ut(w) := [Ft(θt) + λI] 1 h 1 B PB 1 i=0 Ew(st,i, mt,i, st,i+1)ψθt(mt,i) i = [Ft(θt) + λI] 1vt(w). Recall Prop J.4. We have Lemma J.6. Assume sups S φ(s) 2 1. Under Assumptions 5.2 and 5.3 with step-sizes chosen as α = λ2 E[ θV (θ b T ) 2 2] = 1 t=0 E[ θV (θt) 2 2] MLV (1 + λ)2 λ2 E [V (θT )] V (θ0) λ2 [2(1 + λ)2 + λ2] PT 1 t=0 E h wt w θt 2 2 + [2(1 + λ)2 + λ2] 32 λ4(1 γ)2 + 432(1 + 2Rw)2 (1 ξ)B + 216 λ2 [2(1 + λ)2 + λ2] critic. Proof. Proof is similar to first part of proof of Thm 6 in (Xu et al., 2020) and similar to Thm 5.4, along with using Prop J.4 and Lemmas J.3, J.2 and J.1. We now move to proving the global optimality of natural actor critic based improper learner. Let KL( , ) be the KLdivergence between two distributions. We denote D(θ) := KL(π , πθ), uλ θt := (F(θt) + λI) 1 θV (θt) and u θt := Actor-Critic based Improper Reinforcement Learning F(θt) θV (θt). We see that D(θt) D(θt+1) m=1 π (m) log πθt+1(m) log πθt(m) s S dπ ρ (s) m=1 π (m) log πθt+1(m) log πθt(m) = Eνπ [log πθt+1(m) log πθt(m)] (ii) Eνπ [ θ log(πθt(m))] (θt+1 θt) θt+1 θt 2 2 2 = Eνπ [ψθt(m)] (θt+1 θt) θt+1 θt 2 2 2 = αEνπ [ψθt(m)] ut(wt) α2 ut(wt) 2 2 2 = αEνπ [ψθt(m)] uλ θt + αEνπ [ψθt(m)] (ut(wt) uλ θt) α2 ut(wt) 2 2 2 = αEνπ [ψθt(m)] u θt + αEνπ [ψθt(m)] (uλ θt u θt) + αEνπ [ψθt(m)] (ut(wt) uλ θt) α2 ut(wt) 2 2 2 = αEνπ [Aπθt(s, m)] + αEνπ [ψθt(m) u θt Aπθt(s, m)] + αEνπ [ψθt(m)] (uλ θt u θt) + αEνπ [ψθt(m)] (ut(wt) uλ θt) α2 ut(wt) 2 2 2 (iii) = (1 γ)(V (π ) V (πθt)) + αEνπ [ψθt(m) u θt Aπθt(s, m)] + αEνπ [ψθt(m)] (uλ θt u θt) + αEνπ [ψθt(m)] (ut(wt) uλ θt) α2 ut(wt) 2 2 2 (1 γ)(V (π ) V (πθt)) α q Eνπ [[ψθt(m) u θt Aπθt(s, m)]2] + αEνπ [ψθt(m)] (uλ θt u θt) + αEνπ [ψθt(m)] (ut(wt) uλ θt) α2 ut(wt) 2 2 2 (iv) (1 γ)(V (π ) V (πθt)) Eνπ [[ψθt(m) u θt Aπθt(s, m)]2] + αEνπ [ψθt(m)] (uλ θt u θt) + αEνπ [ψθt(m)] (ut(wt) uλ θt) α2 ut(wt) 2 2 2 (v) (1 γ)(V (π ) V (πθt)) v u u t 1 1 γ Eνπ [[ψθt(m) u θt Aπθt(s, m)]2] + αEνπ [ψθt(m)] (uλ θt u θt) + αEνπ [ψθt(m)] (ut(wt) uλ θt) α2 ut(wt) 2 2 2 (vi) (1 γ)(V (π ) V (πθt)) v u u t 1 1 γ Eνπ [[ψθt(m) u θt Aπθt(s, m)]2] αCsoftλ 2α ut(wt) uλ θt 2 α2 ut(wt) 2 2 2 . where (i) is by taking an extra expectation without changing the inner summand, (ii) follows by Lemma J.1 and Lemma 5 in (Xu et al., 2020), (iii) follows by the value difference lemma (Lemma I.4), (iv) follows by defining νπ νπθt maxs,m νπ (s,m) νπθt (s,m), (v) follows because νπθt(s, m) (1 γ)νπθ0(s, m), (vi) follows by Lemma6 in (Xu et al., 2020)and Actor-Critic based Improper Reinforcement Learning Next, we denote actor := maxθ RM minw Rd Eνπθ [[ψ θ w Aπθ(s, m)]2] as the actor error. D(θt) D(θt+1) (1 γ)(V (π ) V (πθt)) v u u t 1 1 γ actor αCsoftλ 2α ut(wt) uλ θt 2 α2 ut(wt) 2 2 2 (1 γ)(V (π ) V (πθt)) v u u t 1 1 γ actor αCsoftλ 2α ut(wt) uλ θt 2 α2 ut(wt) uλ(θt) 2 2 2 α2 λ2 θV (θt) 2 2 . Rearranging and dividing by (1 γ)α, and taking expectation both sides we get V (π ) E [V (πθt)] E [D(θt)] E [D(θt+1)] (1 γ)α + 2 q E[ ut(wt) uλ θt 2 1 γ + αE h ut(wt) uλ(θt) 2 2 + α λ2(1 γ)E h θV (θt) 2 2 i + v u u t 1 (1 γ)3 actor + Csoftλ Next we use the same argument as in eq (33) and Lemma 2 in Xu et al. (2020) to bound the second term. E h ut(wt) uλ θt 2 B + 108E h wt w θt 2 2 λ2 + 216 critic where C := 18 λ2 24(1+2Rw)2[1+(κ 1)ξ] B(1 ξ) + 4 λ4(1 γ)2 . 8[1+(κ 1)ξ] (1 ξ)B . Using this in the bound and using b for positive a, b above, we have, V (π ) E [V (πθt)] E [D(θt)] E [D(θt+1)] (1 γ)α + 2 1 γ v u u t E h wt w θt 2 B + 108 E h wt w θt 2 2 λ2 + 216 critic + α λ2(1 γ)E h θV (θt) 2 2 i + v u u t 1 (1 γ)3 actor + Csoftλ Actor-Critic based Improper Reinforcement Learning Summing over all t = 0, 1, . . . , T 1 and then dividing by T we get, t=0 E [V (πθt)] D(θ0) E [D(θT)] (1 γ)αT + 2 (1 γ) + 22 (1 γ)T v u u t E h wt w θt 2 B + 216 critic + 54α (1 γ)T E h wt w θt 2 2 + α λ2(1 γ)T t=0 E h θV (θt) 2 2 i + v u u t 1 (1 γ)3 actor + Csoftλ We now put the value of α λ2 MLV (1+λ), we get, t=0 E [V (πθt)] critic + C4 E h wt w θt 2 critic + C7 t=0 E h wt w θt 2 2 t=0 E h θV (θt) 2 2 i v u u t 1 (1 γ)3 actor + C9λ. Letting T = O M (1 γ)2ε , B = O 1 (1 γ)2ε2 then E h θV (θt) 2 2 i ε2 and t=0 E [V (πθt)] ε + O actor (1 γ)3 + O( critic) + O(λ). This leads to the total sample complexity as (B + HTc)T = O 1 (1 γ)2ε2 + = O M (1 γ)4ε3 log 1 K. Simulation Details In this section we describe the details of the Sec. 6. Recall that since neither value functions nor value gradients are available in closed-form, we modify Soft Max PG (Algorithm 1) to make it generally implementable using a combination of (1) rollouts to estimate the value function of the current (improper) policy and (2) a stochastic approximation-based approach to estimate its value gradient. The Softmax PG with Gradient Estimation or SPGE (Algorithm 5), and the gradient estimation algorithm 6, Grad Est, are shown below. Actor-Critic based Improper Reinforcement Learning Figure 7: A chain MDP with 10 states. Algorithm 5 Softmax PG with Gradient Estimation (SPGE) 1: Input: learning rate η > 0, perturbation parameter α > 0, Initial state distribution µ 2: Initialize each θ1 m = 1, for all m [M], s1 µ 3: for t = 1 to T do 4: Choose controller mt πt. 5: Play action at Kmt(st, :). 6: Observe st+1 P(.|st, at). 7: \ θt V πθt(µ) = Grad Est(θt, α, µ) 8: Update: θt+1 = θt + η. \ θt V πθt(µ). 9: end for Algorithm 6 Grad Est (subroutine for SPGE) 1: Input: Policy parameters θ, parameter α > 0, Initial state distribution µ. 2: for i = 1 to #runs do 3: ui Unif(SM 1). 4: θα = θ + α.ui 5: πα = softmax(θα) 6: for l = 1 to #rollouts do 7: Generate trajectory (s0, a0, r0, s1, a1, r1, . . . , slt, alt, rlt) using the policy πα : and s0 µ. 8: rewardl = lt P 9: end for 10: mr(i) = mean(reward) 11: end for 12: Grad Value = 1 #runs i=1 mr(i).ui. M 13: Return: Grad Value. Next we report some extra simulations we performed under different environments. K.1. State Dependent controllers Chain MDP We consider a linear chain MDP as shown in Figure 7. As evident from the figure, |S| = 10 and the learner has only two actions available, which are A = {left, right}. Hence the name chain . The numbers on the arrows represent the reward obtained with the transition. The initial state is s1. We let s100 as the terminal state. Let us define 2 base controllers, K1 and K2, as follows. K1(left sj) = 1, j [9]\{5} 0.1, j = 5 0, j = 10. K2(left sj) = 1, j [9]\{6} 0.1, j = 6 0, j = 10. and obviously Ki(right|sj) = 1 Ki(left|sj) for i = 1, 2. An improper mixture of the two controllers, i.e., (K1+K2)/2 is the optimal in this case. We show that our policy gradient indeed converges to the correct combination, see Figure 8. We here provide an elementary calculation of our claim that the mixture Kmix := (K1 + K2)/2 is indeed better than applying K1 or K2 for all time. We first analyze the value function due to Ki, i = 1, 2 (which are the same due to symmetry of the problem and the probability values described). V Ki(s1) = E t 0 γtrt(at, st) Actor-Critic based Improper Reinforcement Learning 0 1 2 3 4 5 Number of rounds 10 4 Probability of choosing K 1 Probability of choosing K 1 Figure 8: Softmax PG alg applied to the linear Chain MDP with various randomly chosen initial distribution. Plot shows probability of choosing controller K1 averaged over #trials = 0.1 γ9 + 0.1 0.9 0.1 γ11 + 0.1 0.9 0.1 0.9 0.1 γ13 . . . = 0.1 γ9 1 + 0.1 0.9γ2 + 0.1 0.9γ2 2 + . . . = 0.1 γ9 1 0.1 0.9 γ2 . We will next analyze the value if a true mixture controller i.e., Kmix is applied to the above MDP. The analysis is a little more intricate than the above. We make use of the following key observations, which are elementary but crucial. 1. Let Paths be the set of all sequence of states starting from s1, which terminate at s10 which can be generated under the policy Kmix. Observe that V Kmix(s1) = X p Paths γlength(p)P p Recall that reward obtained from the transition s9 s10 is 1. 2. Number of distinct paths with exactly n loops: 2n. 3. Probability of each such distinct path with n cycles: = (0.55 0.45) (0.55 0.45) . . . (0.55 0.45) | {z } n times 0.55 0.55 γ9+2n = (0.55)2 γ9 0.55 0.45 γ2 n . 4. Finally, we put everything together to get: V Kmix(s1) = n=0 2n (0.55)2 γ9 0.55 0.45 γ2 n = (0.55)2 γ9 1 2 0.55 0.45 γ2 > V Ki(s1). This shows that a mixture performs better than the constituent controllers. The plot shown in Fig. 8 shows the Softmax PG algorithm (even with estimated gradients and value functions) converges to a (0.5,0.5) mixture correctly. Actor-Critic based Improper Reinforcement Learning (a) Arrival rate:(λ1, λ2) = (0.49, 0.49) (b) Arrival rate:(λ1, λ2) = (0.49, 0.49) (c) Arrival rate:(λ1, λ2) = (0.3, 0.4) (d) (Estimated) Value functions for case with the two base policies and Longest Queue First ( LQF ) (e) Case with 3 experts: Always Queue 1, Always Queue 2 and LQF. Figure 9: Softmax policy gradient algorithm applies show convergence to the best mixture policy. K.2. Stationary Bernoulli Queues We study two different settings (1) where in the first case the optimal policy is a strict improper combination of the available controllers and (2) where it is at a corner point, i.e., one of the available controllers itself is optimal. Our simulations show that in both the cases, PG converges to the correct controller distribution. Recall the example that we discussed in Sec. 2.2. We consider the case with Bernoulli arrivals with rates λ = [λ1, λ2] and are given two base/atomic controllers {K1, K2}, where controller Ki serves Queue i with probability 1, i = 1, 2. As can be seen in Fig. 9(b) when λ = [0.49, 0.49] (equal arrival rates), Grad Est converges to an improper mixture policy that serves each queue with probability [0.5, 0.5]. Note that this strategy will also stabilize the system whereas both the base controllers lead to instability (the queue length of the unserved queue would obviously increase without bound). Figure 9(c), shows that with unequal arrival rates too, Grad Est quickly converges to the best policy. Fig. 9(d) shows the evolution of the value function of Grad Est (in blue) compared with those of the base controllers (red) and the Longest Queue First policy (LQF) which, as the name suggests, always serves the longest queue in the system (black). LQF, like any policy that always serves a nonempty queue in the system whenever there is one3, is known to be optimal in the sense of delay minimization for this system (Mohan et al., 2016). See Sec. K in the Appendix for more details about this experiment. Finally, Fig. 9(e) shows the result of the second experimental setting with three base controllers, one of which is delay optimal. The first two are K1, K2 as before and the third controller, K3, is LQF. Notice that K1, K2 are both queue length-agnostic, meaning they could attempt to serve empty queues as well. LQF, on the other hand, always and only serves nonempty queues. Hence, in this case the optimal policy is attained at one of the corner points, i.e., [0, 0, 1]. The plot shows the PG algorithm converging to the correct point on the simplex. Here, we justify the value of the two policies which always follow one fixed queue, that is plotted as straight line in Figure 9(d). Let us find the value of the policy which always serves queue 1. The calculation for the other expert (serving queue 2 only) is similar. Let qi(t) denote the length of queue i at time t. We note that since the expert (policy) always recommends 3Tie-breaking rule is irrelevant. Actor-Critic based Improper Reinforcement Learning (a) A basic path-graph interference system with N = 4 communication links. (b) The associated conflict (interference) graph is a pathgraph. Figure 10: An example of a path graph network. The interference constraints are such that physically adjacent queues cannot be served simultaneously. to serve one of the queue, the expected cost suffered in any round t is ct = q1(t) + q2(t) = 0 + t.λ2. Let us start with empty queues at t = 0. V Expert1(0) = E t=0 γtct Expert1 t=0 γt.t.λ2 λ2. γ (1 γ)2 . With the values, γ = 0.9 and λ2 = 0.49, we get V Expert1(0) 44, which is in good agreement with the bound shown in the figure. K.3. Details of Path (Interference) Graph Networks Consider a system of parallel transmitter-receiver pairs as shown in Figure 10(a). Due to the physical arrangement of the Tx Rx pairs, no two adjacent systems can be served simultaneously because of interference. This type of communication system is commonly referred to as a path graph network (Mohan et al., 2020). Figure 10(b) shows the corresponding conflict graph. Each Tx-Rx pair can be thought of as a queue, and the edges between them represent that the two connecting queues, cannot be served simultaneously. On the other hand, the sets of queues which can be served simultaneously are called independent sets in the queuing theory literature. In the figure above, the independent sets are { , {1}, {2}, {3}, {4}, {1, 3}, {2, 4}, {1, 4}}. Finally, in Table 2, we report the mean delay values of the 5 base controllers we used in our simulation Fig. 2(c), Sec.6. We see the controller K2 which was chosen to be MER, indeed has the lowest cost associated, and as shown in Fig. 2(c), our Softmax PG algorithm (with estimated value functions and gradients) converges to it. Table 2: Mean Packet Delay Values of Path Graph Network Simulation. Controller Mean delay (# time slots) over 200 trials Standard deviation K1(MW) 22.11 0.63 K2(MER) 20.96 0.65 K3({1, 3}) 80.10 0.92 K4({2, 4}) 80.22 0.90 K5({1, 4}) 80.13 0.91 Actor-Critic based Improper Reinforcement Learning K.4. Cartpole Experiments We investigate further the example in our simulation in which the two constituent controllers are Kopt + and Kopt . We use Open AI gym to simulate this situation. In the Figure 2(b), it was shown our Softmax PG algorithm (with estimated values and gradients) converged to a improper mixture of the two controllers, i.e., (0.53, 0.47). Let Kconv be defined as the (randomized) controller which chooses K1 with probability 0.53, and K2 with probability 0.47. Recall from Sec. 2.1 that this control law converts the linearized cartpole into an Ergodic Parameter Linear System (EPLS). In Table 3 we report the average number of rounds the pendulum stays upright when different controllers are applied for all time, over trajectories of length 500 rounds. The third column displays an interesting feature of our algorithm. Over 100 trials, the base controllers do not stabilize the pendulum for a relatively large number of trials, however, Kconv successfully does so most of the times. Table 3: A table showing the number of rounds the constituent controllers manage to keep the cartpole upright. Controller Mean number of rounds before the pendulum falls 500 # Trials out of 100 in which the pendulum falls before 500 rounds K1(Kopt + ) 403 38 K2(Kopt ) 355 46 Kconv 465 8 We mention here that if one follows K , which is the optimum controller matrix one obtains by solving the standard Discrete-time Algebraic Riccatti Equation (DARE) (Bertsekas, 2011), the pole does not fall over 100 trials. However, as indicated in Sec.1, constructing the optimum controller for this system from scratch requires exponential, in the number of state dimension, sample complexity (Chen & Hazan, 2020). On the other hand Kconv performs very close to the optimum, while being sample efficient. Choice of hyperparameters. In the simulations, we set learning rate to be 10 4, #runs = 10, #rollouts = 10, lt = 30, discount factor γ = 0.9 and α = 1/ #runs. All the simulations have been run for 20 trials and the results shown are averaged over them. We capped the queue sizes at 1000. K.5. Some extra simulations for natural-actor-critic based improper learner NACIL First we show a queuing theory where we have 2 queues to be served and we have two base controllers similar to as we discussed in the Sec 2. However, here we have two different arrival rates for the two queues (λ1, λ2) (0.4, 0.3), i.e., the arrival rates are unequal. We plot in Fig. 11 the probability of choosing the two different controllers. We see that ACIL converges to the correct mixture of the base controllers. Next, we show a simulation on the setting in Sec. K.1, which we called a Chain MDP. We recall that this setting consists of two base controllers K1 and K2, however a (1/2, 1/2) mixture of the two controllers was shown (analytically) to perform better than each individual ones. As the plot in Fig. 12 shows NACIL identifies the correct combination and follows it. Choice of hyperparameters. For the queuing theoretic simulations of Algorithm 2 ACIL, we choose α = 10 4, β = 10 3. We choose the identity mapping φ(s) s, where s is the current state of the system which is a N length vector, which consists of the ith queue length at the ith position. λ was chosen to be 0.1. The other parameters are chosen as B = 50, H = 30 and Tc = 20. We choose a buffer of size 1000 to keep the states bounded, i.e., if a queue exceeds a size 1000, those arrivals are ignored and queue length is not increased. This is used to normalize the φ(s) 2 across time. L. Additional Comments Comment on the simple experimental settings. The motivating examples may seem simple and trainable from scratch with respect to progress in the field of RL. However, our main point is that there are situations where, for example, one may have trained controllers for a range of environments in simulation. However, the real life environment may differ from the simulated ones. We demonstrate that exploiting such basic pre-learnt controllers via our approach can help in generating a better (meta) controller for a new, unseen environment, instead of learning a new controller for the new environment from scratch. Actor-Critic based Improper Reinforcement Learning Figure 11: NACIL alg applied to the a queuing system with two queues, having arrival rates (λ1, λ2) (0.4, 0.3). Plot shows probability of choosing controllers K1 and K2 averaged over 20 trials Figure 12: NACIL alg applied to the linear Chain MDP with various randomly chosen initial distribution. Plot shows probability of choosing controller K1 averaged over 20 trials Actor-Critic based Improper Reinforcement Learning On characterizing the performance of the optimal mixture policy. As correctly noticed by the reviewer, the inverted pendulum experiment showed that the optimal mixture policy can vastly outperform the component controllers. Currently, however, we do not provide any theoretical guarantees regarding this, since this depends on the structure of the policy space and the underlying MDP, which is very challenging. We hope to explore this task in our future work. M. Discussion We have considered the problem of using a menu of baseline controllers and combining them using improper probabilistic mixtures to form a superior controller. In many relevant MDP learning settings, we saw that this is indeed possible, and the policy gradient and actor-critic based analyses indicate that this approach may be widely applicable. This work opens up a plethora of avenues. One can consider a richer class of mixtures that can look at the current state and mix accordingly. For example, an attention model can be used to choose which controller to use, or other state-dependent models can be relevant. Another example is to artificially force switching across controllers to occur less frequently than in every round. The can help create momentum and allow the controlled process to mix better, when using complex controllers. A few caveats are in order regarding the potential societal impact and consequences of this work. As such, this paper offers a way of combining or blending a given class of decision-making entities in the hope of producing a better one. In this process, the definitions of what constitutes optimal or expected behavior from a policy are likely to be subjective, and may encode biases and attitudes of the system designer(s). More importantly, it is possible that the base policy class (or some elements of it) have undesirable properties to begin with (e.g., bias or insensitivity), which could get amplified in the improper learning process as an unintended outcome. We sound ample caution to practitioners who contemplate adopting this method. Finally, in the present setting, the base controllers are fixed. It would be interesting to consider adding adaptive, or learning controllers as well as the fixed ones. Including the base controllers can provide baseline performance below which the performance of the learning controllers would not drop.