# multiview_reinforcement_learning__b4e97b24.pdf Multi-View Reinforcement Learning Minne Li University College London London, United Kingdom minne.li@cs.ucl.ac.uk Lisheng Wu University College London London, United Kingdom lisheng.wu.17@ucl.ac.uk Haitham Bou Ammar University College London London, United Kingdom haitham.bouammar71@googlemail.com Jun Wang University College London London, United Kingdom junwang@cs.ucl.ac.uk This paper is concerned with multi-view reinforcement learning (MVRL), which allows for decision making when agents share common dynamics but adhere to different observation models. We define the MVRL framework by extending partially observable Markov decision processes (POMDPs) to support more than one observation model and propose two solution methods through observation augmentation and cross-view policy transfer. We empirically evaluate our method and demonstrate its effectiveness in a variety of environments. Specifically, we show reductions in sample complexities and computational time for acquiring policies that handle multi-view environments. 1 Introduction In reinforcement learning (RL), tasks are defined as Markov decision processes (MDPs) with state and action spaces, transition models, and reward functions. The dynamics of an RL agent commence by executing an action in a state of the environment according to some policy. Based on the action choice, the environment responds by transitioning the agent to a new state and providing an instantaneous reward quantifying the quality of the executed action. This process repeats until a terminal condition is met. The goal of the agent is to learn an optimal action-selection rule that maximizes total-expected returns from any initial state. Though minimally supervised, this framework has become a profound tool for decision making under uncertainty, with applications ranging from computer games [24, 30] to neural architecture search [42], robotics [3, 25, 27], and multi-agent systems [22, 34, 38, 40]. Common RL algorithms, however, only consider observations from one view of the state space [17]. Such an assumption can become too restrictive in real-life scenarios. To illustrate, imagine designing an autonomous vehicle that is equipped with multiple sensors. For such an agent to execute safe actions, data-fusion is necessary so as to account for all available information about the world. Consequently, agent policies have now to be conditioned on varying state descriptions, which in turn, lead to challenging representation and learning questions. In fact, acquiring good-enough policies in the multi-view setting is more complex when compared to standard RL due to the increase in sample complexities needed to reason about varying views. If solved, however, multi-view RL will allow for data-fusion, fault-tolerance to sensor deterioration, and policy generalization across domains. Numerous algorithms for multi-view learning in supervised tasks have been proposed. Interested readers are referred to the survey in [23, 39, 41], and references therein for a detailed exposition. Equal contributions. Honorary Lecturer at University College London 33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada. Though abundant in supervised learning tasks, multi-view data fusion for decision making has gained less attention. In fact, our search revealed only a few papers attempting to target this exact problem. A notable algorithm is the work in [6] that proposed a double task deep Q-network for multi-view reinforcement learning. We believe the attempt made by the authors handle the multi-view decision problem indirectly by carrying innovations from computer-vision, where they augment different angle cameras in one state and feed to a standard deep Q-network. Our attempt, on the other hand, aims to resolve multi-view decision making directly by learning joint models for autonomous planning. As a by-product of our method, we arrive at a learning pipeline that allows for improvements in both policy learning and feature representations. Closest to our work are algorithms from the domain of partially observable Markov decision processes (POMDPs) [17]. There, the environment s state is also hidden, and the agent is equipped with a sensor (i.e., an observation function) for it to build a belief of the latent state in order to execute and learn its policy. Although most algorithms consider one observation function (i.e., one view), one can generalize the definition of POMDPs, to multiple types of observations by allowing for a joint observation space, and consequently a joint observation function, across varying views. Though possible in principle, we are not aware of any algorithm from the POMDP literature targeting this scenario. Our problem, in fact, can become substantially harder from both the representation and learning perspectives. To illustrate, consider a POMDP with only two views, the first being images, while the second a low-dimensional time series corresponding to, say joint angles and angular velocities. Following the idea of constructing a joint observation space, one would be looking for a map from history of observations (and potentially actions) to a new observation at the consequent time steps. Such an observation can be an image, a time series, or both. In this setting, constructing a joint observation mapping is difficult due to the varying nature of the outputs and their occurrences in history. Due to the large sample and computational complexities involved in designing larger deep learners with varying output units, and switching mechanisms to differentiate between views, we rather advocate for a more grounded framework by drawing inspiration from the well-established multi-view supervised learning literature. Thus, leading us to multi-view reinforcement learning3. Our framework for multi-view reinforcement learning also shares similarities with multi-task reinforcement learning, a framework that has gained considerable attention [3, 10, 11, 12, 26, 32, 36]. Particularly, one can imagine multi-view RL as a case of multi-task RL where tasks share action spaces, transition models, and reward functions, but differ in their state-representations. Though a bridge between multi-view and multi-task can be constructed, it is worth mentioning that most works on multi-task RL consider same observation and action spaces but varying dynamics and/or reward functions [33]. As such, these methods fail to handle fusion and generalization across feature representations that vary between domains. A notable exception is the work in [2], which transfers knowledge between task groups, with varying state and/or action spaces. Though successful, this method assumes model-free with linear policy settings. As such, it fails to efficiently handle high-dimensional environments, which require deep network policies. In this paper, we contribute by introducing a framework for multi-view reinforcement learning that generalizes partially observable Markov decision processes (POMDPs) to ones that exhibit multiple observation models. We first derive a straight-forward solution based on state augmentation that demonstrates superior performance on various benchmarks when compared to state-of-the-art proximal policy optimization (PPO) in multi-view scenarios. We then provide an algorithm for multi-view model learning, and propose a solution capable of transferring policies learned from one view to another. This, in turn, greatly reduces the amount of training samples needed by around two order of magnitudes in most tasks when compared to PPO. Finally, in another set of experiments, we show that our algorithm outperforms PILCO [9] in terms of sample complexities, especially on high-dimensional and non-smooth systems, e.g., Hopper4. Our contributions can be summarized as: formalizing multi-view reinforcement learning as a generalization of POMDPs; proposing two solutions based on state augmentation and policy transfer to multi-view RL; demonstrating improvement in policy against state-of-the-art methods on a variety of control benchmarks. 3We note that other titles for this work are also possible, e.g., Multi-View POMDPs, or Multi-Observation POMDPs. The emphasis is on the fact that little literature has considered RL with varying sensory observations. 4It is worth noting that our experiments reveal that PILCO, for instance, faces challenges when dealing with high-dimensional systems, e.g., Hopper. In future, we plan to investigate latent space Gaussian process models with the aim of carrying sample efficient algorithms to high-dimensional systems. 2 Multi-View Reinforcement Learning This section introduces multi-view reinforcement learning by extending MDPs to allow for multiple state representations and observation densities. We show that POMDPs can be seen as a special case of our framework, where inference about latent state only uses one view of the state-space. 2.1 Multi-View Markov Decision Processes To allow agents to reason about varying state representations, we generalize the notion of an MDP to a multi-view MDP, which is defined by the tuple Mmulti-view = S, A, P, O1, P1 obs, . . . , ON, PN . Here, S Rd and A Rm, and P : S A S ! [0, 1] represent the standard MDP s state and action spaces, as well as the transition model, respectively. Contrary to MDPs, multi-view MDPs incorporate additional components responsible for formally representing multiple views belonging to different observation spaces. We use Oj and Pj obs : S Oj ! [0, 1] to denote the observation space and observation-model of each sensor j 2 {1, . . . , N}. At some time step t, the agent executes an action at 2 A according to its policy that conditions on a history of heterogeneous observations (and potentially actions) Ht = {oi1 1 , . . . , oit t }, where oik obs (.|sk) for k 2 {1, . . . , T} and5 ik 2 {1, . . . , N}. We define Ht to represent the history of observations, and introduce a superscript it to denote the type of view at the tth time instance. Each it is dependent on the observation function (availability of specific views) and conditioned on the environment s state. As per our definition, we allow for N different types of observations, therefore, it is allowed to vary from one to N. Depending on the selected action, the environment then transitions to a successor state st+1 P(.|st, at), which is not directly observable by the agent. On the contrary, the agent only receives a successor view oit+1 obs (.|st+1) with it+1 2 {1, . . . , N}. 2.2 Multi-View Reinforcement Learning Objective As in standard literature, the goal is to learn a policy that maximizes total expected return. To formalize such a goal, we introduce a multi-view trajectory M, which augments standard POMDP trajectories with multi-view observations, i.e., M = [s1, oi1 1 , a1, . . . , s T , oi T T ], and consider finite horizon cumulative rewards computed on the real state (hidden to the agent) and the agent s actions. With this, the goal of the agent to determine a policy to maximize the following optimization objective: , where M p M t γt R(st, at), (1) where γ is the discount factor. The last component needed for us to finalize our problem definition is to understand how to factor multi-view trajectory densities. Knowing that the trajectory density is that defined over joint observation, states, and actions, we write: P (s T |s T 1, a T 1) M (a T 1|HT 1) . . . Pi1 P(st|st 1, at 1) M (at 1|Ht 1) , with P0(s1) being the initial state distribution. The generalization above arrives with additional sample and computational burdens, rendering current solutions to POMDPs impertinent. Among various challenges, multi-view policy representations capable of handling varying sensory signals can become expensive to both learn and represent. That being said, we can still reason about such structures and follow a policy-gradient technique to learn the parameters of the network. However, a multi-view policy network needs to adhere to a crucial property, which can be regarded as a special case of representation fusion networks from multi-view representation learning [23, 39, 41]. We give our derivation of general gradient update laws following model-free policy gradients in Sect. 3.1. Contrary to standard POMDPs, our trajectory density is generalized to support multiple state views by requiring different observation models through time. Such a generalization allows us to advocate a more efficient model-based solver that enables cross-view policy transfer (i.e., conditioning one view policies on another) and few-shot RL, as we shall introduce in Sect. 3.2. 5Please note it is possible to incorporate the actions in Ht by simply introducing additional action variables. 3 Solution Methods This section presents our model-free and model-based approaches to solving multi-view reinforcement learning. Given a policy network, our model-free solution derives a policy gradient theorem, which can be approximated using Monte Carlo and history-dependent baselines when updating model parameters. We then propose a model-based alternative that learns joint dynamical (and emission) models to allow for control in the latent space and cross-view policy transfer. 3.1 Model-Free Multi-View Reinforcement Learning through Observation Augmentation The type of algorithm we employ for model free multi-view reinforcement learning falls in the class of policy gradient algorithms. Given existing advancements on Monte Carlo estimates, the variance reduction methods (e.g., observation-based baselines Bφ(Ht)), and the problem definition in Eq. (1), we can proceed by giving the rule of updating the policy parameters ! as: !k+1 = !k + k 1 t=1 r! log M Please refer to the appendix for detailed derivation. While a policy gradient algorithm can be implemented, the above update rule is oblivious to the availability of multiple views in the state space closely resembling standard (one-view) POMDP scenarios. This only increases the number of samples required for training, as well as the variance in the gradient estimates. We thus introduce a straight-forward model-free MVRL algorithm by leveraging fusion networks from multi-view representation learning. Specifically, we assume that corresponding observations from all views, i.e., observations that sharing the same latent state, are accessible during training. Although the parameter update rule is exactly the same as defined in Eq. (2), this method manages to utilize the knowledge of shared dynamics across different views, thus being optimal than independent model-free learners, i.e., regarding each view as a single environment and learn the policy. 3.2 Model-Based Multi-View Reinforcement Learning Figure 1: The graphical model of multi-view learning. We now propose a model-based approach that learns approximate transition models for multi-view RL allowing for policies that are simpler to learn and represent. Our learnt model can also be used for crossview policy transfer (i.e., conditioning one view policies on another), few-shot RL, and typical model-based RL enabling policy improvements through back-propagation in learnt joint models. 3.2.1 Multi-View Model Learning The purpose of model-learning is to abstract a hidden joint model shared across varying views of the state space. For accurate predictions, we envision the generative model in Fig. (1). Here, observed random variables are denoted by oit t for a time-step t and it 2 {1, . . . , N}, while st 2 Rd represents latent variables that temporally evolve conditioned on applied actions. As multiple observations are allowed, our model generalizes to multi-view by supporting varying emission models depending on the nature of observations. Crucially, we do not assume Markov transitions in the latent space as we believe that reasoning about multiple views requires more than one-step history information. To define our optimization objective, we follow standard variational inference. Before deriving the variational lower bound, however, we first introduce additional notation to ease exposure. Recall that oit t is the observation vector at time-step t for view it6. To account for action conditioning, we further augment oit t with executed controls, leading to oit C t , at]T for t = {1, . . . , T 1}, and oit C T for the last time-step T. Analogous to standard latent-variable models, our goal is to learn latent transitions that maximize the marginal likelihood of observations as max log p(oit C 1 , . . . , oit C T ) = max log p(oit 1 , a1, . . . , oit T 1, a T 1, oit T ). According to the graphical model in Fig. (1), observations 6Different from the model-free solver introduced in Sect. 3.1, we don t assume the accessibility to all views. are generated from latent variables which are temporally evolving. Hence, we regard observations as a resultant of a process in which latent states have been marginalized-out: 1 , a1, . . . , oit T 1, a T 1, oit T , s1, . . . , s T | {z } latent variables ) ds1 . . . ds T | {z } marginalization 1:T collects all multi-view observations and actions across time-steps. To devise an algorithm that reasons about latent dynamics, two components need to be better understood. The first relates to factoring our joint density, while the second to approximating multi-dimensional integrals. To factorize our joint density, we follow the modeling assumptions in Fig. (1) and write: 1:T , s1, . . . , s T 3 (st| ˆ Ht), where in the last step we introduced it = { it 3 } to emphasize modeling parameters that need to be learnt, and ˆHt to concatenate state and action histories back to s1. Having dealt with density factorization, another problem to circumvent is that of computing intractable multi-dimensional integrals. This can be achieved by introducing a variational distribution over latent variables, qφ(s1, . . . , s T |o MC 1:T ), which transform integration into an optimization problem as qφ(s1, . . . , s T |o MC 1:T ) qφ(s1, . . . , s T |o MC 1:T )p(o MC 1:T , s1, . . . , s T )ds1:T qφ(s1, . . . , s T |o MC 1:T , s1, . . . , s T ) qφ(s1, . . . , s T |o MC where we used the concavity of the logarithm and Jensen s inequality in the second step of the derivation. We assume a mean-field decomposition for the variational distribution, qφ(s1, . . . , s T |o MC t (st|Ht), with Ht being the observation and action history. This leads to: t (st|Ht)||p it 3 (st| ˆ Ht) 1 (s1)||p it where KL(p||q) denotes the Kullback Leibler divergence between two distribution. Assuming shared variational parameters (e.g., one variational network), model learning can be formulated as: Eqφit (st|Ht) log p it (oit qφit (st|Ht)||p it (st| ˆ Ht) Intuitively, Eq. (3) fits the model by maximizing multi-view observation likelihood, while being regularized through the KL-term. Clearly, this is similar to the standard evidence lower-bound with additional components related to handling multi-view types of observations. 3.2.2 Distribution Parameterization and Implementation Details To finalize our problem definition, choices for the modeling and variational distributions can ultimately be problem-dependent. To encode transitions beyond Markov assumptions, we use a memory-based model g (e.g., a recurrent neural network) to serve as the history encoder and future predictor, i.e., ht = g (st 1, ht 1, at 1). Introducing memory splits the model into stochastic and deterministic parts, where the deterministic part is the memory model g , while the stochastic part is the conditional prior distribution on latent states st, i.e., p it (st|ht). We assume that this distribution is Gaussian with its mean and variance parameterized by a feed-forward neural network taking ht as inputs. As for the observation model, the exact form is domain-specific depending on available observation types. In the case when our observation is a low-dimensional vector, we chose a Gaussian parameterization with mean and variance output by a feed-forward network as above. When dealing with images, we parameterized the mean by a deconvolutional neural network [13] and kept an identity covariance. The variational distribution qφ can thus, respectively, be parameterized by a feed-forward neural network and a convolutional neural network [18] for these two types of views. With the above assumptions, we can now derive the training loss used in our experiments. First, we rewrite Eq. (3) as log p it (oit + Eqit [log p it (st|ht)] + H where H denotes the entropy, and qit represents qφit (st|ht, ot). From the first two terms in Eq. (4), we realize that the optimization problem for multi-view model learning consists of two parts: 1) observation reconstruction, 2) transition prediction. Observation reconstruction operates by: 1) inferring the latent state st from the observation ot using the variational model, and 2) decoding st to ot (an approximation of ot). Transition predictions, on the other hand, operate by feeding the previous latent state st 1 and the previous action at 1 to predict the next latent state ˆst via the memory model. Both parts are optimized by maximizing the log-likelihood under a Gaussian distribution with unit variance. This equates to minimizing the mean squared error between model outputs and actual variable value: Lr( it, φit) = log p it (oit Lp( , it, φit) = t=1 Eqit [log p it (st|ht)] = where k k2 is the Euclidean norm. Optimizing Eq. (4) also requires maximizing the entropy of the variational model H[qφit (st|ht, ot)]. Intuitively, the variational model aims to increase the element-wise similarity of the latent state s among corresponding observations [14]. Thus, we represent the entropy term as: LH( it, φit) = qφit (st|ht, ot) where µit 2 RK is the average value of the mean of the diagonal Gaussian representing qφit (st|ht, ot) for each training batch. 3.2.3 Policy Transfer and Few-Shot Reinforcement Learning As introduced in Section 2.2, trajectory densities in MVRL generalize to support multiple state views by requiring different observation models through time. Such a generalization enables us to achieve cross-view policy transfer and few-shot RL, where we only require very few data from a specific view to train the multi-view model. This can then be used for action selection by: 1) inferring the corresponding latent state, and 2) feeding the latent state into the policy learned from another view with greater accessibility. Details can be found in Appendix A. Concretely, our learned models it should be able to reconstruct corresponding observations for views with shared underlying dynamics (latent state s). During model learning, we thus validate the variational and observation model by: 1) inferring the latent state s from the first view s observation o1, and 2) comparing the reconstructed corresponding observation from other views oit with the actual observation oit through calculating the transformation loss: Lt = PN it=2 k oit oitk2. Similarly, the memory model can be validated by: 1) reconstructing the predicted latent state ˆs1 of the first view using the observation model of other views to get ˆoit, and 2) comparing ˆoit with the actual observation oit, through calculating prediction transformation losses: Lpt = PN it=2 kˆoit oitk2. 4 Experiments We evaluate our method on a variety of dynamical systems varying in dimensions of their state representation. We consider both high and low dimensional problems to demonstrate the effectiveness of our model-free and model-based solutions. On the model-free side, we demonstrate performance against state-of-the-art methods, such as Proximal Policy Optimization (PPO) [29]. On the modelbased side, we are interested in knowing whether our model successfully learns shared dynamics across varying views, and if these can then be utilized to enable efficient control. We consider dynamical systems from the Atari suite, Roboschool [29], Py Bullet [8], and the Highway environments [20]. We generate varying views either by transforming state representations, introducing noise, or by augmenting state variables with additional dummy components. When considering the game of Pong, we allow for varying observations by introducing various transformations to the image representing the state, e.g., rotation, flipping, and horizontal swapping. We then compare our multi-view model with a state-of-the-art modeling approach titled World Models [16]; see Section 4.1. Given successful modeling results, we commence to demonstrate control in both mode-free and model-based scenarios in Section 4.2. Our results demonstrate that although multi-view model-free algorithms can present advantages when compared to standard RL, multi-view model-based techniques are highly more efficient in terms of sample complexities. (a) Γo and Γt (b) Γo and Γh (c) Γo and Γc (d) Γo and Γm Figure 2: Training multi-view models on Atari Pong. Legend: p: prediction loss Lp; r: reconstruction loss Lr; t: transformation loss Lt; pt: predicted transformation loss Lpt. These results demonstrate that our method correctly converges in terms of loss values. (a) log(σ2), multi-view (b) Distance of µo, µt (c) log(σ2), WM Figure 3: Difference between inferred latent states from Γo and Γt. Results demonstrating that our method is capable of learning key elements a property essential for multi-view dynamics learning. These results also demonstrate that extracting such key-elements is challenging for world-models. 4.1 Modeling Results To evaluate multi-view model learning we generated five views by varying state representations (i.e., images) in the Atari Pong environment. We kept dynamics unchanged and considered four sensory transformations of the observation frame Γo. Namely, we considered varying views as: 1) transposed images Γt, 2) horizontally-swapped images Γh, 3) inverse images Γc, and 4) mirror-symmetric images Γm. Exact details on how these views have been generated can be found in Appendix C.1.1. For simplicity, we built pair-wise multi-view models between Γo and one of the above five variants. Fig. (2) illustrates convergence results of multi-view prediction and transformation for different views of Atari Pong. Fig (3) further investigates the learnt shared dynamics among different views (Γo and Γt). Fig. (3a) illustrates the converged log(σ2), and the log standard deviation of the latent state variable, in Γo and Γt. Observe that a small group of elements (indexed as 14, 18, 21, 24 and 29) have relatively low variance in both views, thus keeping stable values across different observations. We consider these elements as the critical part in representing the shared dynamics and define them as key elements. Clearly, learning a shared group of key elements across different views is the target in the multi-view model. Results in Fig. (3b), illustrate the distance between µo and µt for the multi-view model demonstrating convergence. As the same group of elements are, in fact, close to key elements learnt by the multi-view model, we conclude that we can capture shared dynamics across different views. Further analysis of these key elements is also presented in the appendix. In Fig. (3c), we also report the converged value of log(σ2) of World Models (WM) [16] under the multi-view setting. Although the world model can still infer the latent dynamics of both environments, the large difference between learnt dynamics demonstrates that varying views resemble a challenge to world models our algorithm, however, is capable of capturing such hidden shared dynamics. 4.2 Policy Learning Results Given successful modeling results, in this section we present results on controlling systems across multiple views within a RL setting. Namely, we evaluate our multi-view RL approach on several high and low dimensional tasks. Our systems consisted of: 1) Cartpole (O R4, a 2 {0, 1}), where the goal is to balance a pole by applying left or right forces to a pivot, 2) hopper (O R15, A R3), where the focus is on locomotion such that the dynamical system hops forward as fast as possible, 3) RACECAR (O R2, A R2), where the observation is the position (x,y) of a randomly placed ball in the camera frame and the reward is based on the distance to the ball, (a) Cartpole Policy Learning (b) Hopper Policy Learning (c) RACECAR Policy Learning (d) Parking Policy Learning Figure 4: Policy learning results demonstrating that our method outperforms others in terms of sample complexities. and 4) parking (O R6, A R2), where an ego-vehicle must park in a given space with an appropriate heading (a goal-conditioned continuous control task). The evaluation metric we used was defined as the average testing return) across all views with respect to the amount of training samples (number of interactions). We use the same setting in all experiments to generate multiple views. Namely, the original environment observation is used as the first view, and adding dummy dimensions (two dims) and large-scale noises (0.1 after observation normalization) to the original observation generates the second view. Such a setting would allow us to understand if our model can learn shared dynamics with mis-specified state representations, and if such a model can then be used for control. For all experiments, we trained the multi-view model with a few samples gathered from all views and used the resultant for policy transfer (MV-PT) between views during the test period. We chose state-of-the-art PPO [29] an algorithm based on the work in [27], as the baseline by training separate models on different views and aggregated results together. The multi-view model-free (MV-MF) method is trained by augmenting PPO with concatenated observations. Relevant parameter values and implementation details are listed in the Appendix C.2. Fig. (4) shows the result of average testing return (the average testing successful rate for the parking task) from all views. On the Cartpole and parking tasks, our MV-MF algorithms can present improvements when compared to strong model-free baselines such as PPO, showing the advantage of leveraging information from multiple views than training independently within each view. On the other hand, multi-view model-based techniques give the best performance on all tasks and reduce number of samples needed by around two orders of magnitudes in most tasks. This proves that MV-PT greatly reduces the required amount of training samples to reach good performance. Table 1: Model-based RL result on the Cartpole. PILCO PILCO-aug MLP MV-MB 240 320 2334 358 381 28 We also conducted pure model-based RL experiment and compared our multi-view dynamic model against 1) PILCO [9], which can be regarded as state-of-the-art modelbased solution using a Gaussian Process dynamic model, 2) PILCO with augmented states, i.e., a single PILCO model to approximate the data distribution from all views, and 3) a multilayer perceptron (MLP). We use the same planning algorithm for all model-based methods, e.g., hill climbing or Model Predictive Control (MPC) [25], depending on the task at hand. Table 1 shows the result on the Cartpole environment, where we evaluate all methods by the amount of interactions till success. Each training rollout has at most 40 steps of interactions and we define the success as reaching an average testing return of 195.0. Although multi-view model performs slightly worse than PILCO in the Cartpole task, we found out that model-based alone cannot perform well on tasks without suitable environment rewards. For example, the reward function in hopper primarily encourages higher speed and lower energy consumption. Such high-level reward functions make it hard for model-based methods to succeed; therefore, the results of model-based algorithms on all tasks are lower than using other specifically designed reward functions. Tailoring reward functions, though interesting, is out of the scope of this paper. MV-PT, on the other hand, outperforms others significantly, see Fig. (4). 5 Related Work Our work has extended model-based RL to the multi-view scenario. Model-based RL for POMDPs have been shown to be more effective than model-free alternatives in certain tasks [37, 35, 21, 19]. One of the classical combination of model-based and model-free algorithms is Dyna-Q [31], which learns the policy from both the model and the environment by supplementing real world on-policy experiences with simulated trajectories. However, using trajectories from a non-optimal or biased model can lead to learning a poor policy [15]. To model the world environment of Atari Games, autoencoders have been used to predict the next observation and environment rewards [19]. Some previous works [28, 16, 19] maintain a recurrent architecture to model the world using unsupervised learning and proved its efficiency in helping RL agents in complex environments. Mb-Mf [25] is a framework bridging the gap between model-free and model-based methods by employing MPC to pre-train a policy within the learned model before training it with standard model-free method. However, these models can only be applied to a single environment and need to be built from scratch for new environments. Although using a similar recurrent architecture, our work differs from above works by learning the shared dynamics over multiple views. Also, many of the above advancements are orthogonal to our proposed approach, which can definitely benefit from model ensemble, e.g., pre-train the model-free policy within the multi-view model when reward models are accessible. Another related research area is multi-task learning (or meta-learning). To achieve multi-task learning, recurrent architectures [10, 36] have also been used to learn to reinforcement learn by adapting to different MDPs automatically. These have been shown to be comparable to the UCB1 algorithm [4] on bandit problems. Meta-learning shared hierarchies (MLSH) [12] share sub-policies among different tasks to achieve the goal in the training process, where high hierarchy actions are obtained and reused in other tasks. Model-agnostic meta-learning algorithm (MAML) [11] minimizes the total error across multiple tasks by locally conducting few-shot learning to find the optimal parameters for both supervised learning and RL. Actor-mimic [26] distills multiple pre-trained DQNs on different tasks into one single network to accelerate the learning process by initializing the learning model with learned parameters of the distilled network. To achieve promising results, these pre-trained DQNs have to be expert policies. Distral [32] learns multiple tasks jointly and trains a shared policy as the "centroid" by distillation. Concurrently with our work, ADRL [5] has extended model-free RL to multi-view environments and proposed an attention-based policy aggregation method based on the Q-value of the actor-critic worker for each view. Most of above approaches consider the problems within the model-free RL paradigm and focus on finding the common structure in the policy space. However, model-free approaches require large amounts of data to explore in high-dimensional environments. In contrast, we explicitly maintain a multi-view dynamic model to capture the latent structures and dynamics of the environment, thus having more stable correlation signals. Some algorithms from meta-learning have been adapted to the model-based setting [1, 7]. These focused on model adaptation when the model is incomplete, or the underlying MDPs are evolving. By taking the unlearnt model as a new task and continuously learning new structures, the agent can keep its model up to date. Different from these approaches, we focus on how to establish the common dynamics over compact representations of observations generated from different emission models. 6 Conclusions In this paper, we proposed multi-view reinforcement learning as a generalization of partially observable Markov decision processes that exhibit multiple observation densities. We derive model-free and model-based solutions to multi-view reinforcement learning, and demonstrate the effectiveness of our method on a variety of control benchmarks. Notably, we show that model-free multi-view methods through observation augmentation significantly reduce number of training samples when compared to state-of-the-art reinforcement learning techniques, e.g., PPO, and demonstrate that model-based approaches through cross-view policy transfer allow for extremely efficient learners needing significantly fewer number of training samples. There are multiple interesting avenues for future work. First, we would like to apply our technique to real-world robotic systems such as self-driving cars, and second, use our method for transferring between varying views across domains. [1] M. Al-Shedivat, T. Bansal, Y. Burda, I. Sutskever, I. Mordatch, and P. Abbeel. Continuous adaptation via meta-learning in nonstationary and competitive environments. ar Xiv preprint ar Xiv:1710.03641, 2017. [2] H. B. Ammar, E. Eaton, J. M. Luna, and P. Ruvolo. Autonomous cross-domain knowledge trans- fer in lifelong policy gradient reinforcement learning. In Proceedings of the 24th International Conference on Artificial Intelligence, IJCAI 15, pages 3345 3351. AAAI Press, 2015. [3] H. B. Ammar, E. Eaton, P. Ruvolo, and M. E. Taylor. Online multi-task learning for policy gradient methods. In Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32, ICML 14, pages II 1206 II 1214. JMLR.org, 2014. [4] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2):235 256, May 2002. [5] E. Barati and X. Chen. An actor-critic-attention mechanism for deep reinforcement learning in multi-view environments. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, pages 2002 2008. International Joint Conferences on Artificial Intelligence Organization, 7 2019. [6] J. Chen, T. Bai, X. Huang, X. Guo, J. Yang, and Y. Yao. Double-task deep q-learning with multiple views. In Proceedings of the IEEE International Conference on Computer Vision, pages 1050 1058, 2017. [7] I. Clavera, A. Nagabandi, R. S. Fearing, P. Abbeel, S. Levine, and C. Finn. Learning to adapt: Meta-learning for model-based control. ar Xiv preprint ar Xiv:1803.11347, 2018. [8] E. Coumans and Y. Bai. Pybullet, a python module for physics simulation for games, robotics and machine learning. http://pybullet.org, 2016 2019. [9] M. Deisenroth and C. E. Rasmussen. Pilco: A model-based and data-efficient approach to policy search. In Proceedings of the 28th International Conference on machine learning (ICML-11), pages 465 472, 2011. [10] Y. Duan, J. Schulman, X. Chen, P. L. Bartlett, I. Sutskever, and P. Abbeel. Rl 2: Fast reinforce- ment learning via slow reinforcement learning. ar Xiv preprint ar Xiv:1611.02779, 2016. [11] C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. ar Xiv preprint ar Xiv:1703.03400, 2017. [12] K. Frans, J. Ho, X. Chen, P. Abbeel, and J. Schulman. Meta learning shared hierarchies. ar Xiv preprint ar Xiv:1710.09767, 2017. [13] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672 2680, 2014. [14] A. Gretton, K. M. Borgwardt, M. Rasch, B. Schölkopf, and A. J. Smola. A kernel method for the two-sample-problem. In Advances in NIPS, pages 513 520, 2007. [15] S. Gu, T. Lillicrap, I. Sutskever, and S. Levine. Continuous deep q-learning with model-based acceleration. In ICML, pages 2829 2838, 2016. [16] D. Ha and J. Schmidhuber. Recurrent world models facilitate policy evolution. In Advances in Neural Information Processing Systems, pages 2455 2467, 2018. [17] L. P. Kaelbling, M. L. Littman, and A. R. Cassandra. Planning and acting in partially observable stochastic domains. Artif. Intell., 101(1-2):99 134, May 1998. [18] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097 1105, 2012. [19] F. Leibfried, N. Kushman, and K. Hofmann. A deep learning approach for joint video frame and reward prediction in atari games. ar Xiv preprint ar Xiv:1611.07078, 2016. [20] E. Leurent. An environment for autonomous driving decision-making. https://github.com/ eleurent/highway-env, 2018. [21] S. Levine and P. Abbeel. Learning neural network policies with guided policy search under unknown dynamics. In Advances in Neural Information Processing Systems, pages 1071 1079, 2014. [22] M. Li, Z. Qin, Y. Jiao, Y. Yang, J. Wang, C. Wang, G. Wu, and J. Ye. Efficient ridesharing order dispatching with mean field multi-agent reinforcement learning. In The World Wide Web Conference, pages 983 994. ACM, 2019. [23] Y. Li, M. Yang, and Z. M. Zhang. A survey of multi-view representation learning. IEEE Transactions on Knowledge and Data Engineering, 2018. [24] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015. [25] A. Nagabandi, G. Kahn, R. S. Fearing, and S. Levine. Neural network dynamics for model- based deep reinforcement learning with model-free fine-tuning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 7559 7566. IEEE, 2018. [26] E. Parisotto, J. L. Ba, and R. Salakhutdinov. Actor-mimic: Deep multitask and transfer reinforcement learning. ar Xiv preprint ar Xiv:1511.06342, 2015. [27] J. Peters and S. Schaal. Natural actor-critic. Neurocomput., 71(7-9):1180 1190, Mar. 2008. [28] J. Schmidhuber. On learning to think: Algorithmic information theory for novel combina- tions of reinforcement learning controllers and recurrent neural world models. ar Xiv preprint ar Xiv:1511.09249, 2015. [29] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. ar Xiv preprint ar Xiv:1707.06347, 2017. [30] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017. [31] R. S. Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Machine Learning Proceedings 1990, pages 216 224. Elsevier, 1990. [32] Y. Teh, V. Bapst, W. M. Czarnecki, J. Quan, J. Kirkpatrick, R. Hadsell, N. Heess, and R. Pascanu. Distral: Robust multitask reinforcement learning. In Advances in NIPS, pages 4496 4506, 2017. [33] Y. Teh, V. Bapst, W. M. Czarnecki, J. Quan, J. Kirkpatrick, R. Hadsell, N. Heess, and R. Pascanu. Distral: Robust multitask reinforcement learning. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 4496 4506. Curran Associates, Inc., 2017. [34] Z. Tian, Y. Wen, Z. Gong, F. Punakkath, S. Zou, and J. Wang. A regularized opponent model with maximum entropy objective. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, pages 602 608. International Joint Conferences on Artificial Intelligence Organization, 7 2019. [35] N. Wahlström, T. B. Schön, and M. P. Deisenroth. From pixels to torques: Policy learning with deep dynamical models. ar Xiv preprint ar Xiv:1502.02251, 2015. [36] J. X. Wang, Z. Kurth-Nelson, D. Tirumala, H. Soyer, J. Z. Leibo, R. Munos, C. Blundell, D. Ku- maran, and M. Botvinick. Learning to reinforcement learn. ar Xiv preprint ar Xiv:1611.05763, 2016. [37] M. Watter, J. Springenberg, J. Boedecker, and M. Riedmiller. Embed to control: A locally linear latent dynamics model for control from raw images. In Advances in NIPS, pages 2746 2754, 2015. [38] Y. Wen, Y. Yang, R. Luo, J. Wang, and W. Pan. Probabilistic recursive reasoning for multi-agent reinforcement learning. In International Conference on Learning Representations, 2019. [39] C. Xu, D. Tao, and C. Xu. A survey on multi-view learning. ar Xiv preprint ar Xiv:1304.5634, [40] Y. Yang, R. Luo, M. Li, M. Zhou, W. Zhang, and J. Wang. Mean field multi-agent reinforcement learning. In J. Dy and A. Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 5571 5580, Stockholmsmässan, Stockholm Sweden, 10 15 Jul 2018. PMLR. [41] J. Zhao, X. Xie, X. Xu, and S. Sun. Multi-view learning overview: Recent progress and new challenges. Information Fusion, 38:43 54, 2017. [42] B. Zoph and Q. V. Le. Neural architecture search with reinforcement learning. ar Xiv preprint ar Xiv:1611.01578, 2016.