# asynchronous_coagent_networks__40edb4cd.pdf Asynchronous Coagent Networks James E. Kostas * 1 Chris Nota * 1 Philip S. Thomas 1 Coagent policy gradient algorithms (CPGAs) are reinforcement learning algorithms for training a class of stochastic neural networks called coagent networks. In this work, we prove that CPGAs converge to locally optimal policies. Additionally, we extend prior theory to encompass asynchronous and recurrent coagent networks. These extensions facilitate the straightforward design and analysis of hierarchical reinforcement learning algorithms like the option-critic, and eliminate the need for complex derivations of customized learning rules for these algorithms. 1. Introduction Reinforcement learning (RL) policies are often represented by stochastic neural networks (SNNs). SNNs are networks where the outputs of some units are not deterministic functions of the units inputs. Several classes of algorithms from various branches of RL research, such as those using options (Sutton et al., 1999) or hierarchical architectures (Bacon et al., 2017), can be formulated as using SNN policies (see Section 2 for more examples). In this paper we study the problem of deriving learning rules for RL agents with SNN policies. Coagent networks are one formulation of SNN policies for RL agents (Thomas & Barto, 2011). Coagent networks are comprised of conjugate agents, or coagents; each coagent is an RL algorithm learning and acting cooperatively with the other coagents in its network. In this paper, we focus specifically on the case where each coagent is a policy gradient RL algorithm, and call the resulting algorithms coagent policy gradient algorithms (CPGAs). Intuitively, CPGAs cause each individual coagent to view the other coagents as part of the environment. That is, individual coagents learn and interact with the combination of both the environment *Equal contribution 1College of Information and Computer Sciences, University of Massachusetts, Amherst, MA, USA. Correspondence to: James E. Kostas . Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s). and the other coagents as if this combination was a single environment. Typically, algorithm designers using SNNs must create specialized training algorithms for their architectures and prove the correctness of these algorithms. Coagent policy gradient algorithms (CPGAs) provide an alternative: they allow algorithm designers to easily derive stochastic gradient update rules for the wide variety of policy architectures that can be represented as SNNs. To analyze a given policy architecture (SNN), one must simply identify the inputs and outputs of each coagent in the network. The theory in this paper then immediately provides a policy gradient (stochastic gradient ascent) learning rule for each coagent, providing a simple mechanism for obtaining convergent update rules for complex policy architectures. This process is applicable to SNNs across several branches of RL research. This paper also extends that theory and the theory in prior work to encompass recurrent and asynchronous coagent networks. A network is recurrent if it contains cycles. A network is asynchronous if units in the neural network do not execute simultaneously or at the same rate. The latter property allows distributed implementations of large neural networks to operate asynchronously. Additionally, a coagent network s capacity for temporal abstraction (learning, reasoning, and acting across different scales of time and task) may be enhanced, not just through the network topology, but by designing networks where different coagents execute at different rates. Furthermore, these extensions facilitate the straightforward design and analysis of hierarchical RL algorithms like the option-critic. The contributions of this paper are: 1) a complete and formal proof of a key CPGA result on which this paper relies (prior work provides an informal and incomplete proof), 2) a generalization of the CPGA framework to handle asynchronous recurrent networks, 3) empirical support of our theoretical claims regarding the gradients of asynchronous CPGAs, and 4) a proof that asynchronous CPGAs generalize the option-critic framework, which serves as a demonstration of how CPGAs eliminate the need for the derivation of custom learning rules for architectures like the option-critic. Our simple mechanistic approach to gradient derivation for the option-critic is a clear example of the usefulness of the coagent framework to any researcher or algorithm designer Asynchronous Coagent Networks creating or analyzing stochastic networks for RL. 2. Related Work Klopf (1982) theorized that traditional models of classical and operant conditioning could be explained by modeling biological neurons as hedonistic, that is, seeking excitation and avoiding inhibition. The ideas motivating coagent networks bear a deep resemblance to Klopf s proposal. Stochastic neural networks have applications dating back at least to Marvin Minsky s stochastic neural analog reinforcement calculator, built in 1951 (Russell & Norvig, 2016). Research of stochastic learning automata continued this work (Narendra & Thathachar, 1989); one notable example is the adaptive reward-penalty learning rule for training stochastic networks (Barto, 1985). Similarly, Williams (1992) proposed the well-known REINFORCE algorithm with the intent of training stochastic networks. Since then, REINFORCE has primarily been applied to deterministic networks. However, Thomas (2011) proposed CPGAs for RL, building on the original intent of Williams (1992). Since their introduction, CPGAs have proven to be a practical tool for defining and improving RL agents. CPGAs have been used to discover motor primitives in simulated robotic control tasks and to solve RL problems with highdimensional action spaces (Thomas & Barto, 2012). They are the RL precursor to the more general stochastic computation graphs. The formalism of stochastic computation graphs was proposed to describe networks with a mixture of stochastic and deterministic nodes, with applications to supervised learning, unsupervised learning, and RL (Schulman et al., 2015). Several recently proposed approaches fit into the formalism of stochastic networks, but the relationship has frequently gone unnoticed. One notable example is the option-critic architecture (Bacon et al., 2017). The option-critic provides a framework for learning options (Sutton et al., 1999), a type of high-level and temporally extended action, and how to choose between options. Below, we show that this framework is a special case of coagent networks. Subsequent work, such as the double actor-critic architecture for learning options (Zhang & Whiteson, 2019) and the hierarchical option-critic (Riemer et al., 2018) also fit within and can be informed by the coagent framework. The Horde architecture (Sutton et al., 2011) is similar to the coagent framework in that it consists of independent RL agents working cooperatively. However, the Horde architecture does not cause the collection of all agents to follow the gradient of the expected return with respect to each agent s parameters the property that allows us to show the convergence of CPGAs. CPGAs can be viewed as a special case of multi-agent reinforcement learning (MARL), the application of RL in settings where multiple agents exist and interact. However, MARL differs from CPGAs in that MARL agents typically have separate manifestations within the environment; additionally, the objectives of the MARL agents may or may not be aligned. Working within the MARL framework, researchers have proposed a variety of principled algorithms for cooperative multi-agent learning (Guestrin et al., 2002; Zhang et al., 2007; Liu et al., 2014). An overview of MARL is given by Bu soniu et al. (2010). 3. Background We consider an MDP, M = (S, A, R, P, R, d0, γ), where S is the finite set of possible states, A is the finite set of possible actions, and R is the finite set of possible rewards (although this work extends to MDPs where these sets are infinite and uncountable, the assumption that they are finite simplifies notation). Let t {0, 1, 2, . . . } denote the time step. St, At, and Rt are the state, action, and reward at time t, and are random variables that take values in S, A, and R, respectively. P:S A S [0, 1] is the transition function, given by P(s, a, s ) := Pr(St+1=s |St=s, At=a). R:S A S R [0, 1] is the reward distribution, given by R(s, a, s , r) := Pr(Rt=r|St=s, At=a, St+1=s ). The initial state distribution, d0:S [0, 1], is given by d0(s) := Pr(S0=s). The reward discount parameter is γ [0, 1]. An episode is a sequence of states, actions, and rewards, starting from t=0 and continuing indefinitely. We assume that the discounted sum of rewards over an episode is finite. A policy, π : S A [0, 1], is a stochastic method of selecting actions, such that π(s, a) := Pr(At=a|St=s). A parameterized policy is a policy that takes a parameter vector θ Rn. Different parameter vectors result in different policies. More formally, we redefine the symbol π to denote a parameterized policy, π : S A Rn [0, 1], such that for all θ Rn, π( , , θ) is a policy. We assume that π(s, a, θ)/ θ exists for all s S, a A, and θ Rn. An agent s goal is typically to find a policy that maximizes the objective function J(π) := E [P t=0 γt Rt|π] , where conditioning on π denotes that, for all t, At π(St, ). The state-value function, vπ : S R, is defined as vπ(s) := E P k=0 γk Rt+k St=s, π . The discounted return, Gt, is defined as Gt := P k=0 γk Rt+k. We denote the objective function for a policy that has parameters θ as J(θ), and condition probabilities on θ to denote that the parameterized policy uses parameter vector θ. A coagent network is a parameterized policy that consists of an acyclic network of nodes (coagents), which do not share parameters. Each coagent can have several inputs that may include the state at time t, a noisy and incomplete observation of the state at time t, and/or the outputs of Asynchronous Coagent Networks Figure 1. Diagram of the three-step process for action generation for a fully connected feedforward network (we do not require the network to have this structure). The circle in the middle denotes the ith coagent. First, preceding nodes are executed to compute the inputs to this coagent. Second, the coagent uses these inputs to produce its output, Ut. Third, the remainder of the network is executed to produce an action. For each coagent, all inputs are optional. That is, our approach encompasses architectures where St and/or components of U pre t are not provided to some coagents. other coagents. When considering the ith coagent, θ can be partitioned into two vectors, θi Rni (the parameters used by the ith coagent) and θi Rn ni (the parameters used by all other coagents). From the point of view of the ith coagent, At is produced from St in three stages: execution of the nodes prior to the ith coagent (nodes whose outputs are required to compute the input to the ith coagent), execution of the ith coagent, and execution of the remaining nodes in the network to produce the final action. This process is depicted graphically in Figure 1 and described in detail below. First, we define a parameterized distribution πpre i (St, , θi) to capture how the previous coagents in the network produce their outputs given the current state. The output of the previous coagents is a random variable, which we denote by U pre t , and which takes continuous and/or discrete values in some set Upre. U pre t is sampled from the distribution πpre i (St, , θi). Next, the ith coagent takes St and U pre t as input. We denote this input, (St, U pre t ), as Xt (or Xi t if it is not unambiguously referring to the ith coagent), and refer to Xt as the local state. Given this input, the coagent produces the output U i t (below, when it is unambiguously referring to the output of the ith coagent, we make the i implicit and write Ut). The conditional distribution of U i t is given by the ith coagent s policy, πi(Xt, , θi). Although we allow the ith coagent s output to depend directly on St, it may be parameterized to only depend on U pre t . Finally, At is sampled according to a distribution πpost i (Xt, U i t, , θi), which captures how the subsequent coagents in the network produce At. Below, we sometimes make θi and θi implicit and write the three policy functions as πpre i (St, ), πi(Xt, ), and πpost i (Xt, U i t, ). Figure 2 depicts the setup that we have described and makes relevant independence properties clear. 4. The Coagent Policy Gradient Theorem Consider what would happen if the ith coagent ignored all of the complexity in this problem setup and simply learned and interacted with the combination of both the environment and the other coagents as if this combination was a single environment. From the ith coagent s point of view, the actions would be Ut, the rewards would remain Rt, and the state would be Xt (that is, St and U pre t ). Note that the coagent may ignore components of this local state, such as the S component. Each coagent could naively implement an unbiased policy gradient algorithm, like REINFORCE (Williams, 1992), based only on these inputs and outputs. We refer to the expected update in this setting as the local policy gradient, i, for the ith coagent. Formally, the local policy gradient of the ith coagent is: t=0 γt Gt ln (πi (Xt, Ut, θi)) The local policy gradient should not be confused with J(θ), which we call the global policy gradient, or with J(θ) s ith component, [ J(θ)/ θi] (notice that J(θ) = J(θ) , . . . , J(θ) , where m is the number of coagents). Unlike a network following J(θ) or a coagent following its corresponding component of J(θ), a coagent following a local policy gradient faces a nonstationary sequence of partially-observable MDPs as the other coagents (part of its environment) update and learn. One could naively design an algorithm that uses this local policy gradient and simply hope for good results, but without theoretical analysis, this hope would not be justified. The coagent policy gradient theorem (CPGT) justifies this approach: If θ is fixed and all coagents update their parameters following unbiased estimates, b i(θi), of their local policy gradients, then the entire network will follow an unbiased estimator of J(θ). For example, if every coagent performs the following update simultaneously at the end of each episode, then the entire network will be performing stochastic gradient ascent on J (without using backpropagation): ln (πi (Xt, Ut, θi)) In practice, one would use a more sophisticated policy gradient algorithm than this simple variant of REINFORCE. Although Thomas & Barto (2011) present the CPGT in their Theorem 3, the provided proof is lacking in two ways. First, it is not general enough for our purposes because it only considers networks with two coagents. Second, it is missing a crucial step. They define a new MDP, the Co MDP, which models the environment faced by a coagent, but they do not show that this definition accurately describes Asynchronous Coagent Networks Figure 2. Bayesian network depicting the relationships of relevant random variables. the environment that the coagent faces. Without this step, Thomas & Barto (2011) have shown that there is a new MDP for which the policy gradient is a component of J(θ), but not that this MDP has any relation to the coagent network. 4.1. The Conjugate Markov Decision Process (Co MDP) In order to reason about the local policy gradient, we begin by modeling the ith coagent s environment as an MDP, called the Co MDP. Given M, i, πpre i , πpost i , and θi, we define a corresponding Co MDP, M i, as M i := (X i, Ui, Ri, P i, Ri, di 0, γi). Although our proof of the CPGT relies heavily on these definitions, due to space limitations the formal definitions for each of these terms are provided in Section A of the supplementary material. A brief summary of definitions follows. We write Xi t, U i t, and Ri t to denote the state, action, and reward of M i at time t. These random variables in the Co MDP are written with tildes to provide a visual distinction between terms from the Co MDP and original MDP. Additionally, when it is clear that we are referring to the ith Co MDP, we often make i implicit (for example, we might write Xi t as Xt). The input (analogous to the state) to the ith coagent is in the set X i := S Upre i . For x X, we denote the S component as x.s and the Upre component as x.upre. We also sometimes denote an x X i as (x.s, x.upre). Ui is an arbitrary set that denotes the output of the ith coagent. Ri := R and γi := γ represent the reward set and the discount factor, respectively. We denote the transition function as P i(x, u, x ), the reward function as Ri(x, u, x , r), and initial state distribution as di 0(x). We write Ji(θi) to denote the objective function of M i. 4.2. The Co MDP Models the Coagent s Environment Here we show that our definition of the Co MDP correctly models the coagent s environment. We do so by presenting a series of properties and lemmas that each establish different components of the relationship between the Co MDP and the environment faced by a coagent. The proofs of these properties and theorems are provided in Section B of the supplementary material. In Properties 1 and 2, by manipulating the definitions of di 0 and πpre i , we show that di 0 and the distribution of X0.s capture the distribution of the inputs to the ith coagent. Property 1. x X, di 0(x) = Pr(S0=x.s, U pre 0 =x.upre). Property 2. For all s S, Pr( X0.s=s) = d0(s). In Property 3, we show that P i captures the distributions of the inputs that the ith coagent will see given the input at the previous step and the output that it selected. Property 3. For all x X, x X, and u U, P i(x, u, x ) = Pr(St+1=x .s, U pre t+1=x .upre |St=x.s, U pre t =x.upre, Ut=u). In Property 4, we show that Ri captures the distribution of the rewards that the ith coagent receives given the output that it selected and the inputs at the current and next steps. Property 4. For all x X, x X, u U, and r R, Ri(x, u, x , r) = Pr(Rt=r|St=x.s, U pre t =x.upre, Ut=u, St+1=x .s, U pre t+1=x .upre). In Properties 5 and 6, we show that the distributions of X and Xt.s capture the distribution of inputs to the ith coagent. Property 5. For all s S and upre Upre i , Pr( Xt=(s, upre)) = Pr(St=s, U pre t =upre). Property 6. For all s S, Pr( Xt.s=s) = Pr(St=s). In Property 7, we show that the distribution of Xt.upre given Xt.s captures the distribution πpre i . Property 7. For all s S and upre Upre i , Pr( Xt.upre=upre| Xt.s=s) = πpre i (s, upre). In Property 8, we show that the distribution of Xt+1.s given Xt.s, Xt.upre, and Ut captures the distribution of the S component of the input that the ith coagent will see given the input at the previous step and the output that it selected. Property 8. For all s S, s S, upre Upre i , and u U, Pr( Xt+1.s=s | Xt.s=s, Xt.upre=upre, Ut=u) = Pr(St+1=s |St=s, U pre t =upre, Ut=u). In Property 9, we show that: Given the S component of the input, the Upre i component of the input that the ith coagent will see is independent of the previous input and output. Property 9. For all s S, s S, upre Upre i , u pre Upre i , and u U, Pr( Xt+1.upre=u pre| Xt+1.s=s ) = Pr( Xt+1.upre=u pre| Xt+1.s=s , Xt=(s, upre), Ut=u). In Property 10, we use Properties 6, 7, 8, 9, and 10 to show that the distribution of Ri t captures the distribution of the rewards that the ith coagent receives. Property 10. For all r R, Pr(Rt=r) = Pr( Ri t=r). We then use Properties 3 and 4 and the definition of M i to show that: Asynchronous Coagent Networks Lemma 1. M i is a Markov decision process. Finally, in Lemma 2, we use the properties above to show that the Co MDP M i (built from M, i, πpre i , πpost i , and θi) correctly models the local environment of the ith coagent. Lemma 2. For all M, i, πpre i , πpost i , and θi, and given a policy parameterized by θi, the corresponding Co MDP M i satisfies Properties 1-6 and Property 10. Lemma 2 is stated more specifically and formally in the supplementary material. 4.3. The Coagent Policy Gradient Theorem Again, all proofs of the properties and theorems below are provided in Section B of the supplementary material. We use Property 10 to show that M s objective, J(θ), is equivalent to the objective Ji(θi) of the ith Co MDP. Property 11. For all coagents i, for all θi, given the same θ = (θi, θi), J(θ) = Ji(θi). Next, using Lemmas 1 and 2, we show that the local policy gradient, i (the expected value of the naive REINFORCE update), is equivalent to the gradient Ji θi of the ith Co MDP. Lemma 3. For all coagents i, for all θi, Ji(θi) θi = i(θi). The CPGT states that the local policy gradients are the components of the global policy gradient. Notice that this intuitively follows by transitivity from Property 11 and Lemma 3. Theorem 1 (Coagent Policy Gradient Theorem). J(θ) = [ 1(θ1) , 2(θ2) , . . . , m(θm) ] , where m is the number of coagents and i is the local policy gradient of the ith coagent. Corollary 1. If αt is a deterministic positive stepsize, P t=0 αt = , P t=0 α2 t < , additional technical assumptions are met (Bertsekas & Tsitsiklis, 2000, Proposition 3), and each coagent updates its parameters, θi, with an unbiased local policy gradient update θi θi + αt b i(θi), then J(θ) converges to a finite value and limt J(θ)=0. 5. Asynchronous Recurrent Networks Having formally established the CPGT, we now turn to extending the CPGA framework to asynchronous and cyclic networks networks where the coagents execute, that is, look at their local state and choose actions, asynchronously and without any necessary order. This extension allows for distributed implementations, where nodes may not execute synchronously. This also facilitates temporal abstraction, since, by varying coagent execution rates, one can design algorithms that learn and act across different levels of abstraction. We first consider how we may modify an MDP to allow coagents to execute at arbitrary points in time, including at points in between our usual time steps. Our motivation is to consider continuous time. As a theoretical construct, we can approximate continuous time with arbitrarily high precision by breaking a time step of the MDP into an arbitrarily large number of shorter steps, which we call atomic time steps. We assume that the environment performs its usual update regularly every n Z+ atomic time steps, and that each coagent executes (chooses an output in its respective Ui) at each atomic time step with some probability, given by an arbitrary but fixed distribution that may be conditioned on the local state. On atomic time steps where the ith coagent does not execute, it continues to output the last chosen action in Ui until its next execution. The duration of atomic time steps can be arbitrarily small to allow for arbitrarily close approximations to continuous time or to model, for example, a CPU cluster that performs billions of updates per second. The objective is still the expected value of G0, the discounted sum of rewards from all atomic time steps: J(θ) = E[G0|θ] = E[P t=0 γt Rt|θ]. (Note that γ in this equation is the nth root of the original γ, where n is the number of of atomic time steps per environment update.) Next, we extend the coagent framework to allow cyclic connections. Previously, we considered a coagent s local state to be captured by Xi t = (St, U pre t ), where U pre t is some combination of outputs from coagents that come before the ith coagent topologically. We now allow coagents to also consider the output of all m coagents on the previous time step, U all t 1 = (U 1 t 1, U 2 t 1, . . . U m t 1). In the new setting, the local state at time t is therefore given by Xi t = (St, U pre t , U all t 1). The corresponding local state set is given by X i = S Upre U1 Um. In this construction, when t = 0, we must consider some initial output of each coagent, U all 1. For the ith coagent, we define U i 1 to be drawn from some initial distribution, hi 0, such that for all u Ui, hi 0(u) = Pr(U i 1 = u). We redefine how each coagent selects actions in the asynchronous setting. First, we define a random variable, Ei t, the value of which is 1 if the ith coagent executes on atomic time step t, and 0 otherwise. Each coagent has a fixed execution function, βi : X i N [0, 1], which defines the probability of the ith coagent executing on time step t, given the coagent s local state. That is, for all x X i, βi(x) := Pr(Ei t = 1|Xi t = x). Finally, the action that the ith coagent selects at time t, U i t, is sampled from πi(Xi t, , θi) if Ei t = 1, and is U i t 1 otherwise. That is, if the agent does not execute on atomic time step t, then it should repeat its action from time t 1. We cannot directly apply the CPGT to this setting: the policy and environment are non-Markovian. That is, we cannot determine the distribution over the output of the network given Asynchronous Coagent Networks only the current state, St, since the output may also depend on U all t 1. However, we show that the asynchronous setting can be reduced to the acyclic, synchronous setting using formulaic changes to the state set, transition function, and network structure. This allows us to derive an expression for the gradient with respect to the parameters of the original, asynchronous network, and thus to train such a network. We prove a result similar to the CPGT that allows us to update the parameters of each coagent using only states and actions from atomic time steps when the coagent executes. 5.1. The CPGT for Asynchronous Networks We first extend the definition of the local policy gradient, i, to the asynchronous setting. In the synchronous setting, the local policy gradient captures the update that a coagent would perform if it was following an unbiased policy gradient algorithm using its local inputs and outputs. In the asynchronous setting, we capture the update that an agent would perform if it were to consider only the local inputs and outputs it sees when it executes. Formally, we define the asynchronous local policy gradient: i(θi) := E h P t=0 Ei tγt Gt ln πi Xt,Ut,θi The only change from the synchronous version is the introduction of Ei t. Note that when the coagent does not execute (Ei t = 0), the entire inner expression is 0. In other words, these states and actions can be ignored. An algorithm estimating Gt would still need to consider the rewards from every atomic time step, including time steps where the coagent does not execute. However, the algorithm may still be designed such that the coagents only perform a computation when executing. For example, during execution, coagents may be given the discounted sum of rewards since their last execution to serve as a summary of all rewards since that execution. The important question is then: does something like the CPGT hold for the asynchronous local policy gradient? If each coagent executes a policy gradient algorithm using unbiased estimates of i, does the network still perform gradient descent on the asynchronous setting objective, J? The answer turns out to be yes. Theorem 2 (Asynchronous Coagent Policy Gradient Theorem). J(θ) = [ 1(θ1) , 2(θ2) , . . . , m(θm) ] , where m is the number of coagents and i is the asynchronous local policy gradient of the ith coagent. Proof. The general approach is to show that for any MDP M, with an asynchronous network represented by π with parameters θ, there is an augmented MDP, M, with objective J and an acylic, synchronous network, π, with the same parameters θ, such that J(θ) = J(θ). Thus, we reduce the asynchronous problem to an equivalent synchronous problem. Applying the CPGT to this reduced setting allows us to derive Theorem 2. The original MDP, M, is given by the tuple (S, A, P, R, d0, γ). We define the augmented MDP, M, as the tuple, ( S, A, P, R, d0, γ). We would like M to hold all of the information necessary for each coagent to compute its next output, including the previous outputs of all coagents. This will allow us to construct an acyclic version of the network to which we may apply the CPGT. We define Uall = U1 U2 Um to be the combined output set of all m coagents in π and E = {0, 1}m to be the set of possible combinations of coagent executions. We define the state set to be S = S Uall and the action set to be A = A Uall E. We write the random variables representing the state, action, and reward at time t as St, At, and Rt respectively. Additionally, we refer to the components of values s S and a A and the components of the random variables St and At using the same notation as for the components of Xt above (for example, s.s is the S component of s, At.uall is the Uall component of At, etc.). For vector components, we write the ith component of the vector using a subscript i (for example, s.uall i is the ith component of s.uall). The transition function, P, captures the original transition function and the fact that St+1.uall = At.uall. For all s, s S and a A, P( s, a, s ) is given by P( s.s, a.a, s .s) if s .uall= a.uall, and 0 otherwise. For all s, s S, a A, and r R, the reward distribution is simply given by R( s, a, s , r)=R( s.s, a.a, s .s, r). The initial state distribution, d0, captures the original state distribution and the initialization of each coagent. For all s S, it is given by d0( s)=d0( s.s) Qm i=1 h0( s.ui). The discount parameter is γ=γ. The objective is the usual: J(θ)=E[ G0|θ], where G0=E[P t=0 γt Rt|θ]. Next we define the synchronous network, π, in terms of components of the original asynchronous network, π specifically, each πi, βi, and θi. We must modify the original network to accept inputs in S and produce outputs in A. Recall that in the asynchronous network, the local state at time t of the ith coagent is given by Xi t = (St, U pre t , U all t 1). In the augmented MDP, the information in U all t 1 is contained in St, so the local state of the ith coagent in the synchronous network is Xi t = ( St, U pre t ), with accompanying state set X i = S Upre. To produce the Uall component of the action, At.uall, we append the output of each coagent to the action. In doing so, we have removed the need for cyclic connections, but still must deal with the asynchronous execution. The critical step is as follows: We represent each coagent in the asynchronous network by two coagents in the synchronous network, the first of which represents the execution function, βi, and the second of which represents the original policy, πi. At time step t, the Asynchronous Coagent Networks first coagent accepts Xi t and outputs 1 with probability βi(( St.s, U pre t , St.u)), and 0 otherwise. We append the output of every such coagent to the action in order to produce the E component of the action, At.e. Because the coagent representing βi executes before the coagent representing πi, from the latter s perspective, the output of the former is present in U pre t , that is, U pre t .ei = At.ei. If U pre t .ei = 1, the coagent samples a new action from πi. Otherwise, it repeats its previous action, which can be read from its local state (that is, Xi t.uall i = U i t 1). Formally, for all ( s, upre) X i and θi, the probability of the latter coagent producing action u Ui is given by: πi(( s, upre), u, θi):=πi(( s.s, upre, s.uall), u, θi) if upre.ei=1, 1 if upre.ei=0 and s.uall i = u, and 0 otherwise. This completes the description of π. In the supplementary material, we prove that this network exactly captures the behavior of the asynchronous network that is, π((s, u), (a, u , e), θ) = Pr(At = a, U all t = u , Et = e|St = s, U all t 1 = u, θ) for all possible values of a, u, a, u , e, and θ in their appropriate sets. The proof that J(θ) = J(θ) is given in Section C of the supplementary material, but it follows intuitively from the fact that 1) the hidden state of the network is now captured by the state set, 2) π accurately captures the dynamics of the hidden state, and 3) this hidden state does not materially affect the transition function or the reward distribution with respect to the original states and actions. Having shown that the expected return in the asynchronous setting is equal to the expected return in the synchronous setting, we turn to deriving the asynchronous local policy gradient, i. It follows from J(θ) = J(θ) that J(θ) = J(θ). Since π is a synchronous, acylic network, and M is an MDP, we can apply the CPGT to find an expression for J(θ). For the ith coagent in the synchronous network, this gives us: J(θ) θi = E h P t=0 γt Gt ln( πi(( St, U pre t ), Ut,θi)) θi Consider ln πi(( St, U pre t ), Ut, θi) / θi, which we abbreviate as πi/ θi. When U pre t .ei = 0, we know that the action is U i t = St.ui = U i t 1 regardless of θ. Therefore, in these local states, πi/ θi is zero. When U pre t .ei = 1, we see from the definition of π that πi/ θi= πi/ θi. Therefore, we see that in all cases, πi/ θi=( U pre t .ei) π/ θi. Substituting this into the above expression yields: E h P t=0( U pre t .ei) γt Gt ln πi ( St.s, U pre t , St.uall), Ut,θi In the proof that J(θ) = J(θ) given in Section C of the supplementary material, we show that the distribution over all analogous random variables is equivalent in both settings (for example, for all s S, Pr(St = s) = Pr( St.s = s)). Substituting each of the random variables of M into the above expression yields precisely the asynchronous local policy gradient, i. To empirically test the Asynchronous Coagent Policy Gradient Theorem (ACPGT), we compare the gradient ( J) estimates of the ACPGT with a finite difference method. The results are presented in Figure 5 (Section D) of the supplementary material; this data provides empirical support for the ACPGT. 6. Applications to Past and Future Work Consider past RL approaches that use stochastic networks, such as stochastic computation graphs, hierarchical networks like the option-critic, or any other form of SNN; one could use the ACPGT to immediately obtain a learning rule for any of these varieties of SNN. In other words, the theory in the sections above facilitates the derivation of gradient and learning rules for arbitrary stochastic architectures. The ACPGT applies when there are recurrent connections. It applies in the asynchronous case, even when the units (coagents) have different rates or execution probabilities, and/or have execution functions that depend on the output of other units or the state. It also applies to architectures that are designed for temporal abstraction, like the option-critic depicted in Figure 3. Architectures adding additional levels of abstraction, such as the hierarchical option-critic (Riemer et al., 2018), can be analyzed with similar ease. Architectures designed for other purposes, such as partitioning the state space between different parts of the network (Sutton et al., 2011), or simplifying a high-dimensional action space (Thomas & Barto, 2012), can also be analyzed with the coagent framework. The diversity of applications is not limited to network topology variations. For example, one could design an asynchronous deep neural network where each unit is a separate coagent. Alternatively, one could design an asynchronous deep neural network where each coagent is itself a deep neural network. In both cases, the coagents could run at different rates to facilitate temporal abstraction. In the latter case, the gradient generated by an algorithm following the ACPGT could be used to train each coagent internally with backpropagation. Notice that the former architecture is trained without backpropagation while the latter architecture combines an ACPGT algorithm with backpropagation: The ACPGT facilitates easy design and analysis for both types of architectures and algorithms. Deriving the policy gradient for a particular coagent simply requires identifying the inputs and outputs and plugging them into the ACPGT formula. In this way, an algorithm designer can rapidly design a principled policy gradient algorithm for any stochastic network, even in the asynchronous and recurrent setting. In the next section, we give an example of this process. Asynchronous Coagent Networks Figure 3. The option-critic framework, described by Bacon et al. (2017), depicted as a three-coagent network consisting of 1) β, which contains the termination functions for each option, 2) πΩ, the policy over options, and 3) πω, which contains the option policies and selects the action at. Each option has a corresponding label. Ωis the set of the available options labels. The label corresponding to the option selected at time t is denoted as ωt. β s local state is st and the previous option label, ωt 1 (the latter via a recurrent connection from πΩ). β sends an action, et {0, 1}, to πΩ. πΩ s execution function outputs 1 (a 100% chance of execution) if et = 1, and outputs 0 (no chance of execution) if et = 0. That is, πΩexecutes if and only if et = 1. πΩ s local state is st and et. If it executes, πΩoutputs a new option label ωt Ω. πω chooses an action, at A, based on its local state, st and ωt. Internally, πω looks up the option policy corresponding to the option label ωt and selects an action based on that option. 7. Case Study: Option-Critic The coagent framework allows one to design an arbitrary hierarchical architecture, and then immediately produce a learning rule for that architecture. In this section, we study the well-known option-critic framework (Bacon et al., 2017) as an example, demonstrating how the CPGT drastically simplifies the gradient derivation process. The option-critic framework aspires to many of the same goals as coagent networks: namely, hierarchical learning and temporal abstraction. We show that the architecture is equivalently described in terms of a simple, three-node coagent network, depicted and described in Figure 3. More formal and complete definitions are provided in Section E.1 of the supplemental material, but Figure 3 is sufficient for the understanding of this section. Bacon et al. (2017) give gradients for two of the three coagents. πω, the policy that selects the action, and β, the termination functions, are parameterized by weights θ and ϑ, respectively. Bacon et al. (2017) gave the corresponding policy gradients, which we rewrite as: x (S Ω) dπ Ω(x) X θ QU(x, a), (1) x (S Ω) dπ Ω(x) β(x, 0) ϑ AΩ(x.s, x.ω), (2) where dπ Ω(x) is a discounted weighting of state-option pairs, given by dπ Ω(x) := P t=0 γt Pr(st=x.s, ωt=x.ω), QU(x, a) is the expected return from choosing option x.ω and action a at state s under the current policy, and AΩ(s, ω) is the advantage of choosing option ω, given by AΩ(s, ω) = QΩ(s, ω) VΩ(s), where QΩ(s, ω) is the expected return from choosing option ω in state s, and VΩ(s) is the expected return from beginning in state s with no option selected. Previously, the CPGT was written in terms of expected values. An equivalent expression of the local gradient for policy πi is the sum over the local state set, Xi, and the local action set, Ui: J(θ)/ θi = P x Xi dπ i (x) P u Ui πi(x,u) θi Qi(x, u), where Qi(x, u) = E[Gt|Xi t=x, U i t=u]. Deriving the policy gradient for a particular coagent simply requires identifying the inputs and outputs and plugging them into this formula. First consider the policy gradient for πω, that is, J/ θ: the input set is Xω = S Ω, and the action set is A. The local initial state distribution (the dπ i term) is given exactly by dπ Ω, and the local state-action value function (the Qi term) is given exactly by QU. Directly substituting these terms into the CPGT immediately yields (1). Note that this derivation is completely trivial using the CPGT: only direct substitution is required. In contrast, the original derivation from Bacon et al. (2017) required a degree of complexity and several steps. Next consider J/ ϑ. The input set is again Xβ = S Ω, but the action set is {0, 1}. The local state distribution is again the distribution over state-option pairs, given by dπ Ω. The PCGN expression gives us x (S Ω) dπ Ω(x) X ϑ Qβ(x, u). In Section E.2 of the supplementary material we show that this is equivalent to (2). Notice that, unlike πΩ, both of these coagents execute every atomic time step. The execution function, therefore, always produces 1 and therefore can be ignored for the purposes of the gradient calculation. Finally, consider the gradient of J with respect to the parameters of πΩ. Bacon et al. (2017) do not provide a policy gradient for πΩ, but suggest policy gradient methods at the SMDP level, planning, or intra-option Q-learning (Sutton et al., 1999). One option is to use the ACPGT approach. The execution function is simply the output of the termination coagent at time t, et. We use the expected value form for simplicity, substituting terms in as above: J/ µ = E h X t=0 etγt Gt ln πΩ Xt, ωt, µ {θ, ϑ, µ} i , where µ represents the parameters of πΩand Xt (S {0, 1}) is the local state of the πΩcoagent at time t. While Bacon et al. (2017) introduce an asynchronous framework, Asynchronous Coagent Networks the theoretical tools they use for the synchronous components do not provide a gradient for the asynchronous component. Instead, their suggestions rely on a piecemeal approach. While this approach is reasonable, it invokes ideas beyond the scope of their work for training the asynchronous component. Our approach has the benefit of providing a unified approach to training all components of the network. The ACPGT provided a simplified and unified approach to these gradient derivations. Using the option-critic framework as an example, we have shown that the ACPGT is a useful tool for analyzing arbitrary stochastic networks. Notice that a more complex architecture containing many levels of hierarchy could be analyzed with similar ease. 8. Conclusion We provide a formal and general proof of the coagent policy gradient theorem (CPGT) for stochastic policy networks, and extend it to the asynchronous and recurrent setting. This result demonstrates that, if coagents apply standard policy gradient algorithms from the perspective of their inputs and outputs, then the entire network will follow the policy gradient, even in asynchronous or recurrent settings. We empirically support the CPGT, and use the option-critic framework as an example to show how our approach facilitates and simplifies gradient derivation for arbitrary stochastic networks. Future work will focus on the potential for massive parallelization of asynchronous coagent networks, and on the potential for many levels of implicit temporal abstraction through varying coagent execution rates. Acknowledgements Research reported in this paper was sponsored in part by the CCDC Army Research Laboratory under Cooperative Agreement W911NF-17-2-0196 (ARL Io BT CRA). The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Laboratory or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein. We would like to thank Scott Jordan for help and feedback with the editing process. We would also like to thank the reviewers and meta-reviewers for their feedback, which helped us improve this work. Bacon, P.-L., Harb, J., and Precup, D. The option-critic architecture. In Thirty-First AAAI Conference on Artificial Intelligence, 2017. Barto, A. G. Learning by statistical cooperation of selfinterested neuron-like computing elements. Human Neurobiology, 4(4):229 256, 1985. Bertsekas, D. P. and Tsitsiklis, J. N. Gradient convergence in gradient methods with errors. SIAM Journal on Optimization, 10(3):627 642, 2000. Bu soniu, L., Babuška, R., and De Schutter, B. Multi-agent reinforcement learning: An overview. In Innovations in multi-agent systems and applications-1, pp. 183 221. Springer, 2010. Guestrin, C., Lagoudakis, M., and Parr, R. Coordinated reinforcement learning. In ICML, volume 2, pp. 227 234, 2002. Klopf, A. H. The hedonistic neuron: A theory of memory, learning, and intelligence. Toxicology-Sci, 1982. Liu, B., Singh, S., Lewis, R. L., and Qin, S. Optimal rewards for cooperative agents. IEEE Transactions on Autonomous Mental Development, 6(4):286 297, 2014. Narendra, K. S. and Thathachar, M. A. Learning Automata: an Introduction. Prentice-Hall, Inc., 1989. Riemer, M., Liu, M., and Tesauro, G. Learning abstract options. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Neur IPS 31, pp. 10424 10434. Curran Associates, Inc., 2018. URL http://papers.nips.cc/paper/ 8243-learning-abstract-options.pdf. Russell, S. J. and Norvig, P. Artificial Intelligence: A Modern Approach. Malaysia; Pearson Education Limited 2016. Schulman, J., Heess, N., Weber, T., and Abbeel, P. Gradient estimation using stochastic computation graphs. In Neur IPS, pp. 3528 3536, 2015. Sutton, R. S., Precup, D., and Singh, S. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112(1 2): 181 211, 1999. Sutton, R. S., Modayil, J., Delp, M., Degris, T., Pilarski, P. M., and White, A. Horde: A scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction. In Proc. of 10th Int. Conf. on Autonomous Agents and Multiagent Systems (AAMAS 2011), pp. 761 768, 2011. Thomas, P. S. Policy gradient coagent networks. In Neur IPS, pp. 1944 1952, 2011. Thomas, P. S. and Barto, A. G. Conjugate markov decision processes. In ICML, pp. 137 144, 2011. Asynchronous Coagent Networks Thomas, P. S. and Barto, A. G. Motor primitive discovery. In Procedings of the IEEE Conference on Development and Learning and Epigenetic Robotics, pp. 1 8, 2012. Williams, R. J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3 4):229 256, 1992. Zhang, S. and Whiteson, S. Dac: The double actorcritic architecture for learning options. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., and Garnett, R. (eds.), Neur IPS 32, pp. 2010 2020. Curran Associates, Inc., 2019. Zhang, X., Aberdeen, D., and Vishwanathan, S. Conditional random fields for multi-agent reinforcement learning. In ICML, pp. 1143 1150. ACM, 2007.