# svqn_sequential_variational_soft_qlearning_networks__759b99d7.pdf Published as a conference paper at ICLR 2020 SVQN: SEQUENTIAL VARIATIONAL SOFT QLEARNING NETWORKS Shiyu Huang, Hang Su, Jun Zhu , Ting Chen Dept. of Comp. Sci. & Tech., BNRist Center, Institute for AI, THBI Lab, Tsinghua University hsy17@mails.tsinghua.edu.cn; {suhangss, dcszj, tingchen}@tsinghua.edu.cn Partially Observable Markov Decision Processes (POMDPs) are popular and flexible models for real-world decision-making applications that demand the information from past observations to make optimal decisions. Standard reinforcement learning algorithms for solving Markov Decision Processes (MDP) tasks are not applicable, as they cannot infer the unobserved states. In this paper, we propose a novel algorithm for POMDPs, named sequential variational soft Q-learning networks (SVQNs), which formalizes the inference of hidden states and maximum entropy reinforcement learning (MERL) under a unified graphical model and optimizes the two modules jointly. We further design a deep recurrent neural network to reduce the computational complexity of the algorithm. Experimental results show that SVQNs can utilize past information to help decision making for efficient inference, and outperforms other baselines on several challenging tasks. Our ablation study shows that SVQNs have the generalization ability over time and are robust to the disturbance of the observation. 1 INTRODUCTION In recent years, substantial progress has been made in deep reinforcement learning for solving various challenging tasks, including the computer Go game (Silver et al., 2016), Atari games (Mnih et al., 2015), Star Craft (Zambaldi et al., 2018; Pang et al., 2018) and the first-person shooting (FPS) games (Lample & Chaplot, 2017; Wu & Tian, 2016; Huang et al., 2019). However, in many real-world applications, decision-making problems are partially observable (Astr om, 1965), preventing such problems from being solved by standard reinforcement learning algorithms. Formally, these kinds of problems are often defined as Partially Observable Markov Decision Processes (POMDPs) (Kaelbling et al., 1998), which demand information from past observations to help in the decision-making process (Mc Callum, 1993). Although numerous efforts (Hausknecht & Stone, 2015; Foerster et al., 2016; Igl et al., 2018; Zhu et al., 2018) have been paid to tackle this problem, there still exist various challenges. For example, Egorov (2015) tries to solve POMDPs by using the belief of the agent as the input of DQN (Mnih et al., 2015), but this algorithm needs access to the environment model. However, in many reinforcement learning tasks, it is not possible for the agent to acquire the underlying transition function, making such algorithms inapplicable. Some recent work (Karkus et al., 2017; Mc Allester & Singh, 2013; Babayan et al., 2018) tries to solve POMDPs under the model-free setting, i.e., the agent does not need to know and learn the transition function of the environment. For instance, Karkus et al. (2017) trained an agent to navigate in a partially observable grid world under the model-free setting, i.e., the agent can only observe a part of the grid world and does not learn the transition function. The agent uses its local observations to update its beliefs (Mc Allester & Singh, 2013; Babayan et al., 2018). In their experiments, the ground truth of the state is the full map plus the location of the agent, which means the representation of the state is explicit. However, in some complex tasks, it is impossible to acquire or design the state or beliefs. To solve the unknown representation problem of the state, Hausknecht & Stone (2015) and Zhu et al. (2018) try to represent the state as latent variables of neural networks. However, they only use J. Zhu and T. Chen are corresponding authors. J. Zhu is also with Real AI. Published as a conference paper at ICLR 2020 a deep recurrent neural network to capture the historical information and fail to utilize the Markov property of the state in POMDPs. Igl et al. (2018) apply sequential Monte Carlo (SMC) (Le et al., 2017) to introduce inductive bias to the neural network, which can embody the Markov properties of the state. They can infer the state from the past observations online. However, they separate the planning algorithm from the inference of the state. To infer the hidden states and optimize the planning module jointly, we represent POMDPs as a unified probabilistic graphical model (PGM) and derive a single evidence lower bound (ELBO). We apply structured variational inference to optimize the ELBO. In our implementation, we design generative models to infer the hidden variables, however, the distribution of the latent variables is conditioned on previous hidden states. This is different from standard VAEs (Kingma & Welling, 2013), whose prior of the latent variables can be a standard Gaussian distribution. So, we apply an additional approximate function to tackle the conditional prior problem. The planning can also be solved under the PGM framework. Fortunately, maximum entropy reinforcement learning (MERL) (Levine, 2018) provide a tool to formalize the planning as a probabilistic inference task. In this paper, we propose a novel end-to-end neural network called the sequential variational soft Q-learning network (SVQN), which integrates the learning of hidden states and the optimization of the planning within the same framework. A deep recurrent neural network (RNN) (Cho et al., 2014; Hochreiter & Schmidhuber, 1997) in SVQNs is used to reduce the computational complexity, because the feature extraction can share the same weights in the RNN. Experimental results show that the SVQN can utilize past information to help in decision making for efficient inference, and outperforms other baselines on several challenging tasks. Our ablation study shows that SVQNs have the generalization ability over time and are robust to the disturbance of the observation. Contributions: (1) We derive the variational lower bound for POMDPs, which allows us to integrate the optimization of the control problem and learning of the hidden state under a unified graphical model. (2) We propose to tackle the difficulty of the inference of the hidden state and solve the problem of a conditional prior using the generative models. (3) We design an end-to-end deep recurrent neural network, which can reduce the computational complexity and be trained efficiently. 2 RELATED WORK We summarize some related work in POMDPs and inference methods for sequential data. Model-Based and Model-Free methods for POMDPs: When the environment model is accessible, POMDPs can be solved by model-based method. Egorov (2015) used model-based methods to solve POMDPs, but their agents need to know the belief-update function and the transition function. When the environment model is unknown, model-free methods should be applied. Recently, some researchers (Hausknecht & Stone, 2015; Zhu et al., 2018) used recurrent neural networks to capture the historical information, but they failed to utilize the Markov property of the state in POMDPs. Our work proposes generative models for the algorithm learning, which tackles the difficulty of the inference of hidden states and introduces inductive bias to the network structure. Igl et al. (2018) applied sequential Monte Carlo (SMC) to the POMDPs. They can infer the hidden state from the past observations online. However, they separate the planning algorithm and the inference of the hidden state. Our algorithm is derived from a unified graphical model, which can train the inference model and the planning algorithm jointly. Explorations in POMDPs: In contrast to MDP methods, POMDP methods should do exploration to gather information with no immediate reward which can then be used in the future to gain higher rewards. Pathak et al. (2017) and Choi et al. (2018) designed deep reinforcement learning algorithms to help the exploration in their tasks. Actually, their tasks are POMDP tasks. However, they just treat their tasks as MDP tasks and intuitively add exploration tricks to standard reinforcement learning algorithms. Their exploration tricks include the reconstruction of observations and actions, which comes from intuitive concepts such as curiosity and attention. The generative models in our method also need to reconstruct the observations and actions, but our algorithm is derived from solid theoretical foundations. Inference for Sequential Data: The observation in POMDP tasks is sequential data, and we need to do inference for the hidden state from observations. Coquelin et al. (2009) used a particle filter to estimate the belief state given past observations. However, their method needs to access to the Published as a conference paper at ICLR 2020 (b) Figure 1: The graphical models for Markov decision processes (MDPs) (a) and partially observable Markov processes (POMDPs) (b). Grey nodes are observed, white nodes are hidden. In POMDPs, the state s is not observable and must be inferred from past observations. The Ot is a binary random variable, where Ot = 1 denotes that the action is optimal at time t, and Ot = 0 denotes that the action is not optimal. environment model. Chung et al. (2015) derived the evidence lower bound of latent variables for sequential data and designed a novel deep neural network to infer recurrent latent variables. Our inference method has some superficial resemblances to their method, but we derived the evidence lower bound from a different graphical model and tackle the difficulty of how to integrate the RL planning algorithm instead of just inferring the hidden variables. 3 PRELIMINARIES We start by briefly reviewing the backgrounds and notations related to our method, including partially observable Markov decision processes and maximum entropy reinforcement learning which solves optimal control problems using a probabilistic framework. 3.1 PARTIALLY OBSERVABLE MARKOV DECISION PROCESSES In many control tasks, the complete state of the environment is unknown to the agent. Instead, the agent can only receive local observations, which are typically conditioned on the current state of the system. The agent needs to make decisions based on the historical information. This kind of problems can be formalized as partially observable Markov decision processes (POMDPs) (Smith & Simmons, 2004). Formally, a POMDP is represented as a tuple (S, A, O, T, Z, r), where S, A and O are state space, action space and observation space, respectively. The reward function r(s, a) is the received reward when taking action a in state s. T(s, a, s ) is the state-transition function, which defines the probability of the succeeding state s after taking action a in state s. Z(s, a, o) is the observation function, which defines the probability of emitted observation o after taking action a in state s. POMDPs use the belief bt(s) to maintain the distribution of the unknown state at time t, and the agent updates its belief with a Bayesian filter when receiving a new observation as bt(s ) = ηZ(s , at, ot) X s S T(s, at, s )bt 1(s), (1) where η is a normalizing constant and at is the action at time t. POMDP planning needs to find a policy π to maximize the accumulative reward as Vπ(b0) = E( t=0 γtr(st, at)|b0, π), (2) where st is the state at time t, and γ (0, 1) is a discount factor. 3.2 MAXIMUM ENTROPY REINFORCEMENT LEARNING POMDPs require solving two problems: the inference of state and optimal control. To integrate both of them under a unified framework, we represent POMDPs as a probabilistic graphical model. So the optimal control problem needs to be solved as a probabilistic inference task. Fortunately, the maximum entropy reinforcement learning (MERL) algorithm provides an effective tool to solve optimal control problem under PGM framework. The graphical model of Markov decision processes Published as a conference paper at ICLR 2020 is shown in Fig. 1(a). We borrow the notation from Levine (2018) to illustrate the algorithm. Levine (2018) introduces a binary random variable Ot to the graphical model, where Ot = 1 denotes that the action is optimal at time t, and Ot = 0 denotes that the action is not optimal. The probability distribution of O is p(Ot = 1|st, at) = exp(r(st, at)), and the variational lower bound is given by: logp(O1:T ) E(s1:T ,a1:T ) π(s1:T ,a1:T ) t=1 r(st, at) logπ(at|st) where π(a|s) is the policy function. Standard reinforcement learning only needs to maximize the cumulative reward. However, MERL will maximize an extra term, which is the policy entropy at each visited state. To get the optimal solution for MERL, two messages are introduced, i.e., βt(st, at) = p(Ot:T |st, at) and βt(st) = p(Ot:T |st), and their relations are given by: A βt(st, at)p(at|st)dat, βt(st, at) = Z S βt+1(st+1)p(st+1|st, at)p(Ot|st, at)dst+1, and the optimal policy is given by: π(at|st, Ot:T ) = βt(st, at) βt(st) . (5) We can define the Q-value function and V-value function as below: Q(st, at) = logβt(st, at), (6) V (st) = logβt(st), (7) and the update functions are: V (st) = log Z A p(at|st)exp(Q(st, at))dat, Q(st, at) = r(st, at) + log Est+1 p(st+1|st,at)[exp(V (st+1))]. To maximize the variational lower bound in Eq. (3), we can use a parameterized Q-function Qθ(st, at), where θ is the function parameter, and it can be learned via gradient descent: θ = θ αE[Qθ(st, at) dθ (r(st, at) + log Z A p(at+1|st+1)exp(Qθ(st+1, at+1))dat+1)], (9) where α is the learning rate. This algorithm is called soft Q-learning, because the update function for Q-value can be considered a soft-update version of the standard Bellman backup. For a better understanding of MERL, we recommend readers reference this tutorial (Levine, 2018). 4 SEQUENTIAL VARIATIONAL SOFT Q-LEARNING NETWORKS We now present our algorithm in detail. We first derive the variational lower bound for POMDPs, and then illustrate how to deal with the conditional prior. 4.1 VARIATIONAL LOWER BOUND FOR POMDPS Different from MDPs, the state s of POMDPs in the PGM (shown in Fig. 1(b)) is unobservable, which need to be inferred from the action a and the observation o. We need to derive different variational lower bound for POMDPs, which can be used to infer the hidden state and do planning jointly. Unlike the observation function Z(s, a, o) in Section 2.1, we assume that the observation is emitted from the hidden state s, which means the probability distribution of observations are only conditioned on states, i.e., ot p(ot|st). This assumption can hold in many tasks. For example, the observation in a partially observed maze is only determined by the full map and the agent s location. Published as a conference paper at ICLR 2020 Figure 2: The structure of sequential variational soft Q-learning networks (SVQNs). Black solid lines represent forward paths of the neural network, gray dashed lines represent reconstruction paths and blue arrows stand for sampling latent variables using the re-parameterization trick. Doublearrows indicate that the algorithm needs to minimize the KL-divergence between two probability distributions. The model takes the observation ot, previous action at 1 and reward rt 1 as inputs, and it uses a neural network fθ(ot, at 1, rt 1) to extract the low-dim hidden feature wt. The recurrent unit rnnθ(wt, ht 1) is used to capture the historical information ht. qθ(st|ht) and pθ(st|st 1, at 1) are proposed distributions of hidden states. The hidden state st and inner hidden state ˆst are sampled from qθ(st|ht) and pprior θ (st|st 1, at 1) respectively. The Q-function Qθ(st, at) is learned via the temporal difference (TD) algorithm with soft update. We apply the structured variational inference to optimize the evidence lower bound of POMDPs. In structured variational inference, different parts of the proposal distributions can be optimized separately, which means we can fix some approximate functions and optimize other approximate functions. In POMDPs, we will use two approximate functions qπ(at|st) and qθ(st|st 1, at 1, ot). qπ( ) approximates the optimal policy and qθ( ) approximates the function of inferring hidden states, where θ is the parameter of the approximate function. When qθ( ) is fixed, the learning procedure is same as MERL, so that qπ( ) can be learned via the soft Q-learning algorithm. Conversely, when qπ( ) is fixed as the optimal policy, we can learn the inference function qθ( ) for hidden states. We can derive the evidence lower bound(ELBO) as bellow: logp(O0:T , a0:T , o1:T ) = log Eqθ(s1:T |O1:T ,a0:T ,o1:T ) p(s1:T , O0:T , a0:T , o1:T ) qθ(s1:T |O0:T , a0:T , o1:T ) Eqθ(s1:T |O1:T ,a0:T ,o1:T )log p(s1:T , O0:T , a0:T , o1:T ) qθ(s1:T |O0:T , a0:T , o1:T ) and the ELBO can be written as: L(O0:T , a0:T , o1:T ) = Eqθ(s1:T |O1:T ,a0:T ,o1:T ) n r(st, at) + log [p(at)p(ot|st)] DKL [qθ(st|st 1, at 1, ot)||p(st|st 1, at 1)] o . To get the optimal action in POMDPs, we need to maximize the ELBO above. The first term in the ELBO is the cumulative reward PT t=1 r(st, at), which can be optimized via maximum entropy reinforcement learning algorithm. We apply generative models to optimize the rest terms in the ELBO. The term p(ot|st) means that the hidden state st needs to have the ability to generate current observation ot. The Kullback-Leibler divergence term DKL [qθ(st|st 1, at 1, ot)||p(st|st 1, at 1)] indicates that we should minimize the gap between approximate function qθ(st|st 1, at 1, ot) and the prior p(st|st 1, at 1). In standard VAEs, the prior p( ) is generally fixed and can be set as a standard Gaussian. However, in this generative model, the prior of the hidden state st is conditioned on previous state st 1 and action at 1. We will show how we deal with the conditional prior in next sub-section. More details about the derivation process of the ELBO can be found in Appendix A. Published as a conference paper at ICLR 2020 4.2 VARIATIONAL AUTOENCODERS FOR THE CONDITIONAL PRIOR The KL-divergence term in Eq. (11) introduces a conditional prior for the hidden state st, but the true conditional distribution p(st|st 1, at 1) is unknown, making it intractable to do inference. Hence we propose a new parameterized function pprior θ (st|st 1, at 1) to approximate the true distribution p(st|st 1, at 1). We use a standard VAE to learn the pprior θ ( ). Hence (st 1, at 1) can be treated as the inner observed data in this VAE model and the inner hidden state ˆst is inferred from (st 1, at 1). We can write down the ELBO for the conditional prior directly by using the ELBO of standard VAEs (see in Appendix B): L(st 1, at 1) = DKL(pprior θ (st|st 1, at 1)||p(st)) + Epprior θ (st|st 1,at 1) [logp(st 1, at 1|st)] , (12) where p(st) can be set as a standard Gaussian. The pprior θ ( ) can be learned via a standard VAE paradigm. The outputs of pprior θ ( ) are µprior and σprior, so that the KL-divergence term in Eq. (11) can be written as: DKL [qθ(st|st 1, at 1, ot)||p(st|st 1, at 1)] DKL h qθ(st|st 1, at 1, ot)||pprior θ (st|st 1, at 1) i = DKL h qθ(st|st 1, at 1, ot)||N(µprior, diag(σprior2)) i . Unlike standard VAEs, the second distribution of the KL-divergence function in Eq. (13) is not a standard normal distribution. But we can still get the final loss function if we expand the KLdivergence function. We present the final formula of the loss function for Eq. (13) in Appendix C. 4.3 SEQUENTIAL VARIATIONAL SOFT Q-LEARNING NETWORKS We design a deep recurrent neural network to improve the ELBO derived in Section 4.1 and Section 4.2. Fig. 2 shows the overall structure of our algorithm. In our implementation, there are two generative models, i.e., one of the generate model learns the conditional prior qprior θ ( ) and reconstructs the inner observed data (st 1, at 1), and another generate model learns the qθ( ) in Eq. (13) and reconstructs the current observation ot. For the first generative model, there are two losses, i.e., Linner KL and Linner MSE. Linner KL is the negated KL-divergence between pprior θ (st|st 1, at 1) and a standard Gaussian: Linner KL = DKL[pprior θ (st|st 1, at 1)||N(0, I)]. (14) And Linner MSE is the reconstruction loss of (st 1, at 1): Linner MSE = MSE (st 1, at 1), ϕinner θ (ˆst) , (15) where MSE( ) is the mean-square error function, the inner hidden state ˆst is sampled from pprior θ (st|st 1, at 1), and ϕinner θ ( ) is a reconstruction function. For the second generative model, there are also two losses, i.e., Lelbo KL and Lelbo MSE. Lelbo KL is defined as: Lelbo KL = DKL h qθ(st|st 1, at 1, ot)||pprior θ (st|st 1, at 1) i . (16) And Lelbo MSE is the reconstruction loss of ot: Lelbo MSE = MSE ot, ϕelbo θ (st) , (17) where MSE( ) is the mean-square error function, the hidden state st is sampled from qθ(st|st 1, at 1, ot), and ϕelbo θ ( ) is a reconstruction function. Published as a conference paper at ICLR 2020 Figure 3: Screen shots of Atari games and Vi ZDoom tasks. From left to right, they are Pong, Chopper Command, Double Dunk, Asteroids, Health Gather,Health Gather v2 and Defend Center. When only given one frame as the input, these Atari games are POMDPs because we can not obtain the velocity of the moving object from a single observation. Because the agent in Vi ZDoom environment can only see in just one direction, it naturally introduces partial observability to these tasks. Finally, we get two kinds of loss functions for the two generative models, i.e., the reconstruction loss LMES = Linner MSE + Lelbo MSE and the KL-divergence loss LKL = Linner KL + Lelbo KL . And for the planning algorithm, we use the soft Q-learning algorithm (Levine, 2018). Its loss function is the temporal difference error LT D, which is updated via Eq. (9). All these losses can jointly be optimized via stochastic gradient descent algorithms. To reduce the computation complexity, we use a deep recurrent neural network to capture the historical information. We first use the fθ(ot, at 1, rt 1) to extract a low-dim hidden feature wt from current input, and then feed this features to a recurrent unit rnnθ( ) to get the recurrent output ht. Because ht contains the information of past observations, the loss function in Eq. (16) can be rewritten as: Lelbo KL = DKL h qθ(st|ht)||pprior θ (st|st 1, at 1) i . (18) In the training, we tried both LSTM cell (Hochreiter & Schmidhuber, 1997) and GRU cell (Cho et al., 2014) as basic recurrent units. Because of the generalization ability of the recurrent neural network (Lample & Chaplot, 2017), we can just sample a fixed length H of sequential data for training instead of using the full-length data. In the experiment, we also studied how the training length H influences the final performance. We use a parallel training strategy, i.e., the program hosts multiple games in parallel and they all send batched data to a central data memory. This is quite similar to the data collection method in ELF (Tian et al., 2017). More details about the training strategy can be found in Appendix E. 5 EXPERIMENTS We evaluate our algorithm on flickering Atari (Hausknecht & Stone, 2015) and Vi ZDoom platform (Kempka et al., 2016). Flickering Atari was previously used as the test environment in DRQN (Hausknecht & Stone, 2015), ADRQN (Zhu et al., 2018) and DVRL (Igl et al., 2018). The Vi ZDoom platform is a 3D FPS game for AI research. The agent in Vi ZDoom needs to navigate in the 3D environment to accomplish various tasks, such as gathering resources, shooting enemies and looking for the exit. Because the agent can only see in just one direction each time step, this naturally introduces partial observability to the task. 5.1 EXPERIMENT SETUP We used some recent algorithms as baselines. Because the true state and transition function are unknown, only the methods which can be trained under model-free setting will be used for comparison. Baselines are listed as below: Deep Q-Networks (DQN) (Mnih et al., 2015): A method which uses standard Bellman backup and uses the temporal difference error as the objective function. Deep Soft Q-Networks (DSQN) (Levine, 2018): A method which uses soft Bellman backup and uses the temporal difference error as the objective function. Deep Recurrent Q-Networks (DRQN) (Hausknecht & Stone, 2015): A method which applies LSTM on the top of the DQN. Action-Specific Deep Recurrent Q-Networks (ADRQN) (Zhu et al., 2018): A method which extends the DRQN, i.e., it uses both the observation and action as inputs. Published as a conference paper at ICLR 2020 Deep Variational Reinforcement Learning (DVRL) (Igl et al., 2018) A method which combines sequential Monte Carlo and A2C (Dhariwal et al., 2017) to solve POMDPs. 5.2 EVALUATION ON FLICKERING ATARI Atari environments (Bellemare et al., 2013) are widely used as the benchmark for deep reinforcement learning algorithms due to its high dimensional observation spaces and numerous challenging tasks. Flickering Atari was introduced by (Hausknecht & Stone, 2015). In each running step, the observation may be obscured with a certain probability, i.e., the raw screen will be either fully observable or fully obscured with black pixels. The settings of experiments keep in line with DVRL (Igl et al., 2018), i.e., only one frame is used, the frameskip is set to four and each frame is obscured with a probability of 0.5. We choose four Atrai games (i.e., Pong, Chopper Command, Double Dunk and Asteroids) for evaluation. Fig. 3 shows the screen shots of these games. When only given one frame as the input, these tasks are POMDPs because we cannot obtain the velocity of the moving object from a single observation. These tasks have high dimensional and continuous observation spaces, while discrete action spaces. The basic network architecture and hyper-parameters are similar to DQN (Mnih et al., 2015). For the recurrent neural networks, we use a sequence length of 5 for training. All the algorithms train for 10000,000 steps and run for 100 episodes during evaluation. The training details can be found in Appendix E. Table 1 shows the performance results of different algorithms on these Atari games. We can see that our algorithms significantly outperform other baselines on three of the games and get close score to DVRL on Chopper Command. Compared with DRQN and ADRQN, our method introduces inductive bias to the network, which helps state estimate and RL planning for POMDPs. Compared with DVRL, our method can achieve competitive performance with lower sampling complexity. DQN DSQN DRQN ADRQN DVRL SVQN(GRU) SVQN(LSTM) Pong -4.9( 2.6) -2.1( 2.3) 1.6( 7.8) 7( 4.6) 18.2( 2.7) 19.2( 1.3) 18.6( 1.9) Chopper 1350( 731) 1250( 522) 1090( 409) 1608( 707) 6602( 449) 6005( 258) 5805( 312) DDunk -16.2( 3.1) -16.9( 2.8) -14.4( 3.2) -12.9( 3.6) -6.0( 1.3) -5.8 ( 1.4) -5.5( 1.2) Asteroids 935( 410) 940( 320) 871( 340) 1040( 432) 1539( 73) 1645( 102) 1585( 86) Table 1: Evaluation results of different models on flickering Atari. The values are the final evaluation scores after training for different algorithms. Values in parentheses indicate the standard deviation. Evaluations use Mann-Whitney rank test and bold numbers indicate statistical significance at the 5% level. Our algorithms outperform other baselines on three of the games and get close score to DVRL on Chopper Command. 5.3 EVALUATION ON VIZDOOM TASKS We designed three tasks in Vi ZDoom as the evaluation tasks for our algorithm. Health Gather: The agent needs to gather health supplies in a flat map. The agent will lose health every time step. Accordingly, if it can t gather enough health supplies, it will die and the game is over. At each time step, the agent will receive a reward of 0.001, which encourages the agent to live a longer life. When it collects one health supply, it will receive a reward of 1. The observation for the agent is the grayscale image of its vision and its health value. The agent has three actions to choose, i.e., TURN LEFT, TURN RIGHT and MOVE FORWARD. Health Gather v2: This is a more difficult task than previous task with a more complex map. There is some poison in the map. When the agent encounters poison, it will lose health. We use the same reward scenario, observations and actions in Health Gather. Defend Center: In this task, the agent stands in the center of a flat map. Monsters will walk toward the agent from the edge of the map. When the monsters touch the agent, the agent will lose health. The agent holds a gun with limited ammo, and it can kill the monsters by pressing down the shooting button. The agent gets reward of 4 by killing monsters, reward of -0.2 by using ammo and reward of Published as a conference paper at ICLR 2020 Figure 4: (a) Training curves on Health Gather. Both SVQN models outperform other baselines. (b) Performances of the models with different training sequence lengths on Health Gather. When the training sequence length is too long, it is hard for the algorithm to gather useful information through a long gradient flow. (c) Evaluation results of different models under different observation probabilities of the Health Gather, and all models are trained with full observation. The results show that SVQN models are more robust to the disturbance of the observation than other algorithms. DQN DSQN DRQN ADRQN SVQN(GRU) SVQN(LSTM) Health Gather 27.8( 1.4) 23.3( 1.7) 32.4( 2.0) 29.9( 1.6) 37.7( 0.9) 42.3( 1.7) Health Gather v2 9.8( 1.5) 6.7( 2.1) 13.2( 2.5) 14.3( 1.2) 15.8( 1.1) 18.2( 0.9) Defend Center 35.4( 1.3) 38.5( 1.5) 46.0( 0.5) 45.8( 2.2) 50.3( 2.1) 48( 1.4) Table 2: Evaluation results of different models on Vi ZDoom. The values are the final evaluation scores after training for different algorithms. Values in parentheses indicate the standard deviation. Evaluations use Mann-Whitney rank test and bold numbers indicate statistical significance at the 5% level. The SVQN models achieve the best performance on these three tasks. -1 when losing health. The observation for the agent is the grayscale image of its vision, its health value and the ammo remained. The agent has three actions to choose, i.e., TURN LEFT, TURN RIGHT and ATTACK. Fig. 3 shows the screen shots of these three tasks. All the algorithms are trained with the same basic network architecture and use the same hyperparameters. All the models take only one observation as input at each time step and the vision inputs are resized to the resolution of 84 84. For the recurrent neural networks, we use a sequence length of 5 for training. All the algorithms train for 300,000 steps and run for 20 episodes during evaluation. The discount factor γ is set to 0.95, learning rate is 0.0001 and the Adam optimizer (Kingma & Ba, 2014) is used for training. The network architecture for these tasks and training details are shown in Appendix D. Fig. 4(a) shows the training curves of different methods on the Health Gather. Both SVQN models outperform other baselines. Table 2 reports the final scores of different models on these three tasks. The SVQN models achieve the best performance on these three tasks. Experiments on Vi ZDoom indicate that generative models of SVQNs can improve agent s exploration ability in complicated unknown environment 5.4 ABLATION STUDY Training sequence length: We study how the training sequence length H impacts the algorithm s performance. We use the SVQN(LSTM) model and the environment of Health Gather for this study. Fig. 4(b) shows training processes of models with different training sequence lengths(H = 2, 5, 10, 15). We can see that, when the sequence length is too short(H = 2), the model can t gather enough historical information, which causes performance degradation. When the sequence length is too long(H = 15), it is arduous for the network to do optimization, which will also Published as a conference paper at ICLR 2020 lead to performance degradation. This experiment also shows that with a limited training length, the algorithm can generalize to the time of arbitrary lengths during testing. Thanks to this generalization ability, we can train the agent with a fixed length of data instead of the full-length data, which can reduce the computation complexity. Observation Probability: We also study whether different models are robust to the noise of the observation. We add a modification to the game Health Gather, i.e., at each time step, the observation of the screen is either fully revealed or fully obscured with a fixed observation probability p. Fig. 4(c) show the evaluation results of different models under different observation probabilities(p = 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9). All the models are trained under the standard environment with full observations. The results show that SVQN models are relatively robust to the disturbance of the observation compared to other algorithms. 6 CONCLUSIONS We propose a novel algorithm named Sequential Variational Soft Q-Learning Networks (SVQN) to solve POMDPs with the discrete action space. SVQN is model-free and does not need to know the true state s representation. We apply generative models to deal with the conditional prior of hidden states and use a recurrent neural network to reduce the computational complexity, i.e., with a small length of training data, it can generalize to the test data with an arbitrary length. Our designed deep neural network can be trained end-to-end, which optimizes the planning and inference of hidden states jointly. Experimental results show that SVQN outperforms previous methods on challenging tasks and has the robustness to the disturbance of the observation. SVQN is also flexible and can be integrated with other maximum entropy reinforcement learning algorithms, such as soft actorcritic (Haarnoja et al., 2018). In the future, we will try to develop algorithms for POMDPs problems with the continuous action space. ACKNOWLEDGMENTS This work was supported by the National Key Research and Development Program of China (No. 2017YFA0700904), NSFC Projects (Nos. 61620106010, U19B2034, U1811461), Beijing NSF Project (No. L172037), Beijing Academy of Artificial Intelligence (BAAI), Tsinghua-Huawei Joint Research Program, a grant from Tsinghua Institute for Guo Qiang, Tiangong Institute for Intelligent Computing, the JP Morgan Faculty Research Program and the NVIDIA NVAIL Program with GPU/DGX Acceleration. K. J. Astr om. Optimal control of Markov decision processes with incomplete state estimation. J. Math. Anal. Appl., 10:174 205, 1965. Benedicte M Babayan, Naoshige Uchida, and Samuel J Gershman. Belief state representation in the dopamine system. Nature communications, 9(1):1891, 2018. Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47: 253 279, 2013. Kyunghyun Cho, Bart Van Merri enboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. ar Xiv preprint ar Xiv:1406.1078, 2014. Jongwook Choi, Yijie Guo, Marcin Moczulski, Junhyuk Oh, Neal Wu, Mohammad Norouzi, and Honglak Lee. Contingency-aware exploration in reinforcement learning. ar Xiv preprint ar Xiv:1811.01483, 2018. Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron C Courville, and Yoshua Bengio. A recurrent latent variable model for sequential data. In Advances in neural information processing systems, pp. 2980 2988, 2015. Published as a conference paper at ICLR 2020 Pierre-Arnaud Coquelin, Romain Deguest, and R emi Munos. Particle filter-based policy gradient in pomdps. In Advances in Neural Information Processing Systems, pp. 337 344, 2009. Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford, John Schulman, Szymon Sidor, Yuhuai Wu, and Peter Zhokhov. Openai baselines. https: //github.com/openai/baselines, 2017. Maxim Egorov. Deep reinforcement learning with pomdps. 2015. Jakob N Foerster, Yannis M Assael, Nando de Freitas, and Shimon Whiteson. Learning to communicate to solve riddles with deep distributed recurrent q-networks. ar Xiv preprint ar Xiv:1602.02672, 2016. Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Offpolicy maximum entropy deep reinforcement learning with a stochastic actor. ar Xiv preprint ar Xiv:1801.01290, 2018. Matthew Hausknecht and Peter Stone. Deep recurrent q-learning for partially observable mdps. In 2015 AAAI Fall Symposium Series, 2015. Sepp Hochreiter and J urgen Schmidhuber. Long short-term memory. Neural computation, 9(8): 1735 1780, 1997. Shiyu Huang, Hang Su, Jun Zhu, and Ting Chen. Combo-action: Training agent for fps game with auxiliary tasks. AAAI, 2019. Maximilian Igl, Luisa Zintgraf, Tuan Anh Le, Frank Wood, and Shimon Whiteson. Deep variational reinforcement learning for pomdps. ar Xiv preprint ar Xiv:1806.02426, 2018. Leslie Pack Kaelbling, Michael L Littman, and Anthony R Cassandra. Planning and acting in partially observable stochastic domains. Artificial intelligence, 101(1-2):99 134, 1998. Peter Karkus, David Hsu, and Wee Sun Lee. Qmdp-net: Deep learning for planning under partial observability. In Advances in Neural Information Processing Systems, pp. 4694 4704, 2017. Michał Kempka, Marek Wydmuch, Grzegorz Runc, Jakub Toczek, and Wojciech Ja skowski. Vizdoom: A doom-based ai research platform for visual reinforcement learning. In 2016 IEEE Conference on Computational Intelligence and Games (CIG), pp. 1 8. IEEE, 2016. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. Diederik P Kingma and Max Welling. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114, 2013. Guillaume Lample and Devendra Singh Chaplot. Playing fps games with deep reinforcement learning. In Thirty-First AAAI Conference on Artificial Intelligence, 2017. Tuan Anh Le, Maximilian Igl, Tom Rainforth, Tom Jin, and Frank Wood. Auto-encoding sequential monte carlo. ar Xiv preprint ar Xiv:1705.10306, 2017. Sergey Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review. ar Xiv preprint ar Xiv:1805.00909, 2018. David A Mc Allester and Satinder Singh. Approximate planning for factored pomdps using belief state simplification. ar Xiv preprint ar Xiv:1301.6719, 2013. R Andrew Mc Callum. Overcoming incomplete perception with utile distinction memory. In Proceedings of the Tenth International Conference on Machine Learning, pp. 190 196, 1993. Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015. Published as a conference paper at ICLR 2020 Zhen-Jia Pang, Ruo-Ze Liu, Zhou-Yu Meng, Yi Zhang, Yang Yu, and Tong Lu. On reinforcement learning for full-length game of starcraft. ar Xiv preprint ar Xiv:1809.09095, 2018. Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. Curiosity-driven exploration by self-supervised prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 16 17, 2017. David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484 489, 2016. Trey Smith and Reid Simmons. Heuristic search value iteration for pomdps. In Proceedings of the 20th conference on Uncertainty in artificial intelligence, pp. 520 527. AUAI Press, 2004. Yuandong Tian, Qucheng Gong, Wenling Shang, Yuxin Wu, and C Lawrence Zitnick. Elf: An extensive, lightweight and flexible research platform for real-time strategy games. In Advances in Neural Information Processing Systems, pp. 2659 2669, 2017. Yuxin Wu and Yuandong Tian. Training agent for first-person shooter game with actor-critic curriculum learning. 2016. Vinicius Zambaldi, David Raposo, Adam Santoro, Victor Bapst, Yujia Li, Igor Babuschkin, Karl Tuyls, David Reichert, Timothy Lillicrap, Edward Lockhart, et al. Relational deep reinforcement learning. ar Xiv preprint ar Xiv:1806.01830, 2018. Pengfei Zhu, Xin Li, Pascal Poupart, and Guanghui Miao. On improving deep reinforcement learning for pomdps. ar Xiv preprint ar Xiv:1804.06309, 2018. Published as a conference paper at ICLR 2020 A DERIVATION OF THE VARIATIONAL LOWER BOUND logp(O0:T , a0:T , o1:T ) = log Eq(s1:T |O1:T ,a0:T ,o1:T ) p(s1:T , O0:T , a0:T , o1:T ) q(s1:T |O0:T , a0:T , o1:T ) Eq(s1:T |O1:T ,a0:T ,o1:T )log p(s1:T , O0:T , a0:T , o1:T ) q(s1:T |O0:T , a0:T , o1:T ) = Z q(s1:T |O1:T , a0:T , o1:T )log p(s1:T , O0:T , a0:T , o1:T ) q(s1:T |O0:T , a0:T , o1:T ) t=1 q(s1:T |O1:T , a0:T , o1:T )log p(at)p(Ot|st, at)p(st|st 1, at 1)p(ot|st) q(st|st 1, at 1, ot) Z q(s1:t|O1:t, a0:t, o1:t)log p(at)p(Ot|st, at)p(st|st 1, at 1)p(ot|st) q(st|st 1, at 1, ot) n Z q(s1:t|O1:t, a0:t, o1:t)log [p(at)p(Ot|st, at)p(ot|st)] ds1:t + Z q(s1:t|O1:t, a0:t, o1:t)log p(st|st 1, at 1) q(st|st 1, at 1, ot) n Z q(s1:t|O1:t, a0:t, o1:t)log [p(at)p(Ot|st, at)p(ot|st)] ds1:t Z q(s1:t 1|O1:t 1, a0:t 1, o1:t 1)DKL [q(st|st 1, at 1, ot)||p(st|st 1, at 1)] ds1:t o = Eq(s1:T |O1:T ,a0:T ,o1:T ) n log [p(at)p(Ot|st, at)p(ot|st)] DKL [q(st|st 1, at 1, ot)||p(st|st 1, at 1)] o n r(st, at) + log [p(at)p(ot|st)] DKL [q(st|st 1, at 1, ot)||p(st|st 1, at 1)] o , where s1:T q(s1:T |O1:T , a0:T , o1:T ) Published as a conference paper at ICLR 2020 B VARIATIONAL AUTOENCODERS Variational auto-encoders (VAEs) (Kingma & Welling, 2013) are effective generative models that can recover complex distributions over the data space. In the VAE, there are observed data x and the underlying causal factor z. Usually, it s intractable to do inference for the posterior p(z|x). The VAE uses a variational function q(z|x) to approximate the true posterior. The lower bound of the marginal distribution of the observed data is given by: logp(x) = log Eq(z|x) and the lower bound can be equivalently written as : L(x) = DKL(q(z|x)||p(z)) + Eq(z|x) [logp(x|z)] , (21) where DKL( || ) is Kullback-Leibler divergence between two distributions. The approximate posterior q(z|x) is often set as Gaussian N(µ, diag(σ2)), where µ and σ are the outputs of a non-linear function of x. The generative model p(x|z) and inference model q(z|x) can be trained jointly via standard backpropagation techniques. C OPTIMIZE THE KLS IN EQ. (15) We give out the final form of the loss function as below: DKL(p||q) = Z p(x)p(x) = Z p(x) log p(x)dx Z p(x)q(x)dx σp + σ2 p + (µp µq)2 σp + σ2 p + (µp µq)2 σq + σ2 p + (µp µq)2 2(log(σp) log(σq)) + 1 σ2 p + (µp µq)2 , where p and q are two Gaussian distributions. σp, σq, µq and µp are their standard deviations and means respectively. Published as a conference paper at ICLR 2020 D NETWORK ARCHITECTURE The network architecture for Atari is shown as below: Published as a conference paper at ICLR 2020 D.2 VIZDOOM The network architecture for Vi ZDoom is shown as below: Published as a conference paper at ICLR 2020 E TRAINING DETAILS Training architechture: The data produce module, training module and evaluation module are run in parallel. The framework is implemented by Python and Tensorflow. All the models take only one observation as input at each timestep and the vision inputs are resized to the resolution of 84 84. For the recurrent neural networks, we use sequence length of 5 for training. All the algorithms train for 1,0000,000 steps and run for 100 episodes during evaluation. E.2 VIZDOOM All the algorithms are trained with the same basic network architectures and use the same hyperparameters. All the models take only one observation as input at each timestep and the vision inputs are resized to the resolution of 84 84. For the recurrent neural networks, we use sequence length of 5 for training. All the algorithms train for 300,000 steps and run for 20 episodes during evaluation. The discount factor γ is set to 0.95, learning rate is 0.0001 and the Adam optimizer is used for training.