# domqnet_grounded_rl_on_structured_language__bd635780.pdf Published as a conference paper at ICLR 2019 DOM-Q-NET: GROUNDED RL ON STRUCTURED LANGUAGE Sheng Jia University of Toronto Vector Institute sheng.jia@utoronto.ca Jamie Kiros Google Brain kiros@google.com Jimmy Ba University of Toronto Vector Institute jba@cs.toronto.ca Building agents to interact with the web would allow for significant improvements in knowledge understanding and representation learning. However, web navigation tasks are difficult for current deep reinforcement learning (RL) models due to the large discrete action space and the varying number of actions between the states. In this work, we introduce DOM-Q-NET, a novel architecture for RL-based web navigation to address both of these problems. It parametrizes Q functions with separate networks for different action categories: clicking a DOM element and typing a string input. Our model utilizes a graph neural network to represent the tree-structured HTML of a standard web page. We demonstrate the capabilities of our model on the Mini Wo B environment where we can match or outperform existing work without the use of expert demonstrations. Furthermore, we show 2x improvements in sample efficiency when training in the multi-task setting, allowing our model to transfer learned behaviours across tasks. 1 INTRODUCTION Over the past years, deep reinforcement learning (RL) has shown a huge success in solving tasks such as playing arcade games (Mnih et al., 2015) and manipulating robotic arms (Levine et al., 2016). Recent advances in neural networks allow RL agents to learn control policies from raw pixels without feature engineering by human experts. However, most of the deep RL methods focus on solving problems in either simulated physics environments where the inputs to the agents are joint angles and velocities, or simulated video games where the inputs are rendered graphics. Agents trained in such simulated environments have little knowledge about the rich semantics of the world. The World Wide Web (WWW) is a rich repository of knowledge about the real world. To navigate in this complex web environment, an agent needs to learn about the semantic meaning of texts, images and the relationships between them. Each action corresponds to interacting with the Document Object Model (DOM) from tree-structured HTML. Tasks like finding a friend on a social network, clicking an interesting link, and rating a place on Google Maps can be framed as accessing a particular DOM element and modifying its value with the user input. In contrast to Atari games, the difficulty of web tasks comes from their diversity, large action space, and sparse reward signals. A common solution for the agent is to mimic the expert demonstration by imitation learning in the previous works (Shi et al., 2017; Liu et al., 2018). Liu et al. (2018) achieved state-of-the-art performance with very few expert demonstrations in the Mini Wo B (Shi et al., 2017) benchmark tasks, but their exploration policy requires constrained action sets, hand-crafted with expert knowledge in HTML. In this work, our contribution is to propose a novel architecture, DOM-Q-NET, that parametrizes factorized Q functions for web navigation, which can be trained to match or outperform existing work on Mini Wo B without using any expert demonstration. Graph Neural Network (Scarselli et al., 2009; Li et al., 2016; Kipf & Welling, 2016) is used as the main backbone to provide three levels of state and action representations. In particular, our model uses the neural message passing and the readout (Gilmer et al., 2017) of the local DOM representations to produce neighbor and global representations for the web page. Published as a conference paper at ICLR 2019 We also propose to use three separate multilayer perceptrons (MLP) (Rumelhart et al., 1985) to parametrize a factorized Q function for different action categories: click , type and mode . The entire architecture is fully differentiable, and all of its components are jointly trained. Moreover, we evaluate our model on multitask learning of web navigation tasks, and demonstrate the transferability of learned behaviors on the web interface. To our knowledge, this is the first instance that an RL agent solves multiple tasks in the Mini Wo B at once. We show that the multi-task agent achieves an average of 2x sample efficiency comparing to the single-task agent. 2 BACKGROUND 2.1 REPRESENTING WEB PAGES USING DOMS The Document Object Model (DOM) is a programming interface for HTML documents and it defines the logical structure of such documents. DOMs are connected in a tree structure, and we frame web navigation as accessing a DOM and optionally modifying it by the user input. As an elementary object, each DOM has a tag and other attributes such as class , is focused , similar to the object in Object Oriented Programming. Browsers use those attributes to render web pages for users. 2.2 REINFORCEMENT LEARNING In the traditional reinforcement learning setting, an agent interacts with an infinite-horizon, discounted Markov Decision Process (MDP) to maximize its total discounted future rewards. An MDP is defined as a tuple (S, A, T, R, γ) where S and A are the state space and the action space respectively, T(s |s, a) is the transition probability of reaching state s S by taking action a A from s S, R is the immediate reward by the transition, and γ is a discount factor. The Q-value function for a tuple of actions is defined to be Qπ(s, a) = E[PT t=0 γtrt|s0 = s, a0 = a], where T is the number of timesteps till termination. The formula represents the expected future discounted reward starting from state s, performing action a and following the policy until termination. The optimal Q-value function Q (s, a) = maxπQπ(s, a), s S, a A (Sutton & Barto, 1998) satisfies the Bellman optimality equation Q (s, a) = Es [r + γ maxa A Q (s , a )]. 2.3 GRAPH NEURAL NETWORKS For an undirected graph G = (V, E), the Message Passing Neural Network (MPNN) framework (Gilmer et al., 2017) formulates two phases of the forward pass to update the node-level feature representations hv, where v V , and graph-level feature vector ˆy. The message passing phase updates hidden states of each node by applying a vertex update function Ut over the current hidden state and the message, ht+1 v = Ut(ht v, mt+1 v ), where the passed message mt+1 v is computed as mt+1 v = P ω N(v) Mt(ht v, ht w, evw). N(v) denotes the neighbors of v in G, and evw is an edge feature. This process runs for T timesteps. The readout phase uses the readout function R, and computes the graph-level feature vector ˆy = R(h T v |v G). 2.4 REINFORCEMENT LEARNING WITH GRAPH NEURAL NETWORKS There has been work in robot locomotion that uses graph neural networks (GNNs) to model the physical body (Wang et al., 2018; Hamrick et al., 2018). Nerve Net demonstrates that policies learned with GNN transfers better to other learning tasks than policies learned with MLP (Wang et al., 2018). It uses GNNs to parametrize the entire policy whereas DOM-Q-NET uses GNNs to provide representational modules for factorized Q functions. Note that the graph structure of a robot is static whereas the graph structure of a web page can change at each time step. Locomotion-based control tasks provide dense rewards whereas web navigation tasks are sparse reward problems with only 0/1 reward at the end of the episode. For our tasks, the model also needs to account for the dependency of actions on goal instructions. Published as a conference paper at ICLR 2019 2.5 PREVIOUS WORK ON RL ON WEB INTERFACES Shi et al. (2017) constructed benchmark tasks, Mini World of Bits (Mini Wo B), that consist of many toy tasks of web navigation. This environment provides both the image and HTML of a web page. Their work showed that the agent using the visual input cannot solve most of the tasks, even given the demonstrations. Then Liu et al. (2018) proposed DOM-NET architecture that uses a series of attention between DOM elements and the goal. With their workflow guided-exploration that uses the formal language to constrain the action space of an agent, they achieved state-of-the-art performance and sample efficiency in using demonstrations. Unlike these previous approaches, we aim to tackle web navigation without any expert demonstration or prior knowledge. 3 NEURAL DOM Q NETWORK tag: "span" text: "w ZQMj24" tag: "span" text: "Gv Cqw" tag: "span" tag:"Input_checkbox" tag:"Input_checkbox" tag:"Input_checkbox" DOM i tag:"Button" text: "Submit" wdom wtoken wmode token Qmode neighbor eglobal Select Token j sr and click Submit [1,0,0,0,0] [0,1,0,0,0] [0,0,1,0,0] [0,0,0,1,0] [0,0,0,0,1] & Action Modules HTML DOM Tree Figure 1: Given the web page on the right, its DOM tree representation is shown as a graph where each DOM represents a node from V . Different colors indicate different tag attributes of DOMs. DOMs are embedded as a local module, elocal, and propagated by a GNN to produce a neighbor module, eneighbor. The global module, eglobal, is aggregated from the neighbor module. The Qdom stream uses all three modules whereas Qtoken and Qmode streams only use the global module. Here, Q values of the submit and sr token are computed by Qdom and Qtoken respectively. Consider the problem of navigating through multiple web pages or menus to locate a piece of information. Let V be the set of DOMs in the current web page. There are often multiple goals that can be achieved in the same web environment. We consider goals that are presented to the agent in the form of a natural language sentence, e.g. Select sr and click Submit in Figure 1 and Use the textbox to enter Kanesha and press Search, then find and click the 9th search result in Figure 2. Let G represent the set of word tokens in the given goal sentence. The RL agent will only receive a reward if it successfully accomplishes the goal, so it is a sparse reward problem. The primary means of navigation are through interaction with the buttons and the text fields on the web pages. There are two major challenges in representing the state-action value function for web navigation: the action space is enormous, and the number of actions can vary drastically between the states. We propose DOM-Q-NET to address both of the problems in the following. 3.1 ACTION SPACE FOR WEB NAVIGATION In contrast to typical RL tasks that require choosing only one action a from an action space, A, such as choosing one from all combinations of controller s joint movements for Atari (Mnih et al., 2015), we frame acting on the web with three distinct categories of actions: DOM selection adom chooses a single DOM in the current web page, adom V . The DOM selection covers the typical interactive actions such as clicking buttons or checkboxes as well as choosing which text box to fill in the string input. Published as a conference paper at ICLR 2019 Word token selection atoken G picks a work token from the given goal sentence to fill in the selected text box. The assumption that typed string comes from the goal instruction aligns with the previous work (Liu et al., 2018). Mode amode {click, type} tells the environment whether the agent s intention is to click or type when acting in the web page. amode is represented as a binary action. At each time step, the environment receives a tuple of actions, namely a = (adom, atoken, amode), though it does not process atoken unless amode = type. : (DOM(search-box), Kanesha, TYPE) : (DOM(search), , CLICK) A2 : (DOM("3"), , CLICK) A3 : (DOM("Kanesha"), , CLICK) S2 S1 S3 S4 Figure 2: A successful trajectory executed by our model for search-engine. Si is the state, and Ai = (adom, atoken, amode) is a tuple of actions for the three distinct categories of actions at timestep i. DOM(x) represents the index of the corresponding element x in the web page. 3.2 FACTORIZED Q FUNCTION One way to represent the state-action value function is to consider all the permutations of adom and atoken. For example, Mnih et al. (2015) considers the permutations of joystick direction and button clicking for Atari. For Mini Wo B, this introduces an enormous action space with size |V | |G|. The number of DOMs and goal tokens, |V | and |G|, can reach up to 60 and 18, and the total number of actions become over 1, 000 for some hard tasks. To reduce the action space, we consider a factorized state-action value function where the action values of adom and atoken are independent to each other. Formally, we define the optimal Q-value function as the sum of the individual value functions of the three action categories: Q (s, a) = Q (s, adom, atoken, amode) = Q (s, adom) + Q (s, atoken) + Q (s, amode). (1) Under the independence assumption, we can find the optimal policy by selecting the greedy actions w.r.t. each Q-value function individually. Therefore, the computation cost for the optimal action of the factorized Q function is linear in the number of DOM elements and the number of word tokens rather than quadratic. a = arg max adom Q (s, adom), arg max atoken Q (s, atoken), arg max amode Q (s, amode) (2) 3.3 LEARNING STATE-ACTION EMBEDDINGS OF WEB PAGES Many actions on the web, clicking different checkboxes and filling unseen type of forms, share similar tag or class attributes. Our goal is to design a neural network architecture that effectively captures such invariance for web pages, and yet is flexible to deal with the varying number of DOM elements and goal tokens at different time steps. Furthermore, when locating a piece of information on the web, an agent needs to be aware of both the local information, e.g. the name of button and its surrounding texts, and the global information, e.g. the general theme, of the web page. The cue for clicking a particular button from the menu is likely scattered. Published as a conference paper at ICLR 2019 To address the above problem, we propose a GNN-based RL agent that computes the factorized Q-value for each DOM in the current web page, called DOM-Q-NET as shown in Figure 1. It uses additional information of tree-structured HTML to guide the learning of state-action representations, embeddings e, which is shared among factorized Q networks. Explicitly modeling the HTML tree structure provides the relational information among the DOM elements to the agent. Given a web page, our model learns a concatenated embedding vector ei = h ei local, ei neighbor, eglobal i using the low-level and high-level modules that correspond to node-level and graph-level outputs of the GNN. Local Module ei local is the concatenation of each embedded attribute e Attr of the DOM vi, which includes the tag, class, focus, tampered, and text information of the DOM element. In particular, we use the maximum of cosine distance between the text and each goal token to measure the soft alignment of the DOM vi with the jth word embedding, ej goal, in the goal sentence. Liu et al. (2018) uses the exact alignment to obtain tokens that appear in the goal sentence, but our method can detect synonyms that are not exactly matched. ei local = ei Attr, max j cos(ei Attr, ej goal) (3) This provides the unpropagated action representation of clicking each DOM, and is the skip connection of GNNs. Neighbor Module ei neighbor is the node representation that incorporates the neighbor context of the DOM vi using a graph neural network. The model performs the message passing between the nodes of the tree with the weights w GNN . The local module is used to initialize this process. mt is an intermediate state for each step of the message passing, and we adopt Gated Recurrent Units (Cho et al., 2014) for the nonlinear vertex update (Li et al., 2016). This process is performed for T number of steps to obtain the final neighbor embeddings. mi,t+1 neighbor = X k N(i) w GNNek,t neighbor, ei,0 neighbor = ei local, (4) ei,t+1 neighbor = GRU(ei,t neighbor, mi,t+1 neighbor), ei neighbor = ei,T neighbor (5) By incorporating the context information, this module contains the state representation of the current page, and the propagated action representation of clicking the particular DOM, so the Q-value function can be approximated using only this module. Global Module eglobal is the high-level feature representation of the entire web page after the readout phase. It is used by all three factorized Q networks. We investigate two readout functions to obtain such global embedding with and without explicitly incorporating the goal information. 1) We use max-pooling to aggregate all of the DOM embeddings of the web page. eglobal = maxpool ei local, ei neighbor |vi V (6) 2) We use goal-attention with the goal vector as an attention query. This is in contrast to Velickovic et al. (2018) where the attention is used in the message passing phase, and the query is not a task dependent representation. To have the goal vector hgoal, each goal token etoken is concatenated with the one-hot positional encoding vector epos, as shown in Figure 1. Next, the position-wise feedforward network with Re LU activation is applied to each concatenated vector before max-pooling the goal representation. Motivated by Vaswani et al. (2017), we use scaled dot product attention with local embeddings as keys, and neighbor embeddings as values. Note that Elocal and Eneighbor are packed representations of (e1 local, ..., e V local) and (e1 neighbor, ..., e V neighbor) respectively, where Elocal R(V,dk), Eneighbor R(V,dk), and dk is the dimension of text token embeddings. eattn = softmax(hgoal ET local dk )Eneighbor, eglobal attn = [eglobal, eattn] (7) The illustrative diagram is shown in Appendix 6.2, and a simpler method of concatenating the nodelevel feature with the goal vector is shown in Appendix 6.3. This method is also found to be effective in incorporating the goal information, but the size of the model increases. Published as a conference paper at ICLR 2019 Learning The Q-value function of choosing the DOM is parametrized by a two-layer MLP, Qi dom = MLP(ei; wdom), where it takes the concatenation of DOM embeddings ei = h ei local, ei neighbor, eglobal i as the input. Similarly, the Q-value functions for choosing the word token and the mode are computed using MLP(etoken, eglobal; wtoken) and MLP(eglobal; wmode) respectively. See Figure 1. All the model parameters including the embedding matrices are learned from scratch. Let θ = (E, w GNN, wdom, wtoken, wmode) be the model parameters including the embedding matrices, the weights of a graph neural network, and weights of the factorized Q-value function. The model parameters are updated by minimizing the squared TD error (Sutton, 1988): min θ E(s,a,r,s ) replay[ y DQN Q(s, adom; θ) Q(s, atoken; θ) Q(s, amode; θ) 2], (8) where the transition pairs (s, a, r, s ) are sampled from the replay buffer and y DQN is the factorized target Q-value with the target network parameters θ as in the standard DQN algorithm. y DQN = r + γ max a dom Q(s , a dom; θ ) + max a token Q(s , a token; θ ) + max a mode Q(s , a mode; θ ) (9) 3.4 MULTITASK LEARNING FOR TRANSFERRING LEARNED BEHAVIOURS To assess the effectiveness of transferring learned behaviours and solving multiple tasks by our model, we train a single agent acting in multiple environments. Transitions from different tasks are collected in a shared replay buffer, and the network is updated after performing an action in each environment. See Alg.1 for details. 4 EXPERIMENTS We first evaluate the generalization capability of the proposed model for large action space by comparing it against previous works. Tasks with various difficulties, as defined in Appendix 6.4, are chosen from Mini Wo B. Next, we investigate the gain in sample efficiency with our model from multitask learning. We perform an ablation study to justify the effectiveness of each representational module, followed by the comparisons of gains in sample efficiency from goal-attention in multitask and single task settings. Hyperparameters are explained in Appendix 6.1. 4.1 DOM-Q-NET BENCHMARK MINIWOB We use the Q-learning algorithm, with four components of Rainbow (Hessel et al., 2018), to train our agent because web navigation tasks are sparse reward problems, and an off-policy learning with a replay buffer is more sample-efficient. The four components are DDQN (Van Hasselt et al., 2016), Prioritized replay (Schaul et al., 2016), Multi-step learning (Sutton, 1988) , and Noisy Net (Fortunato et al., 2018). To align with the settings used by Liu et al. (2018), we consider the tasks that only require clicking DOM elements and typing strings. The agent receives +1 reward if the task is completed correctly, and 0 reward otherwise. We perform T = 3 steps of neural message passing for all the tasks except social-media, for which we use T = 7 steps to address the large DOM space. Evaluation metric: We plot the moving average of rewards for the last 100 episodes during training. We follow previous works (Shi et al., 2017; Liu et al., 2018), and report the success rate, which is the percentage of test episodes ending up with the reward +1. Each reported success rate is based on the average of 4 different runs, and Appendix 6.6 explains our experiment protocol. Results: Figure 3 shows that DOM-Q-NET reaches 100% success rate for most of the tasks selected by Liu et al. (2018), except for click-widget, social-media, and email-inbox. Our model still reaches 86% success rate for social-media, and the use of goal-attention enables the model to solve clickwidget and social-media with 100% success rate. We did not use any prior knowledge such as providing constraints on the action set during exploration, using pre-defined fields of the goal and showing expert demonstrations. Specifically, our model solves a long-horizon task, choose-date, that previous works with demonstrations were unable to solve. This task expects many similar actions, but has a large action space. Even using imitation learning or guided exploration, the neural network needs to learn a representation that generalizes for unseen diverse DOM states and actions, which our model proves to do. Published as a conference paper at ICLR 2019 Figure 3: Performance comparisons of DOM-Q-NET with Shi et al. (2017); Liu et al. (2018) 0 2000 4000 6000 8000 10000 number of timesteps Avg reward of last 100 episodes 0 1000 2000 3000 4000 5000 number of timesteps Avg reward of last 100 episodes Navigate-Tree 0 10000 20000 30000 40000 50000 number of timesteps Avg reward of last 100 episodes Enter-Password multitask_dom_q_net+g_a dom_q_net+g_a 0 10000 20000 30000 40000 50000 number of timesteps Avg reward of last 100 episodes Click-option 0 10000 20000 30000 40000 50000 number of timesteps Avg reward of last 100 episodes Click Checkboxes 0 10000 20000 30000 40000 50000 number of timesteps Avg reward of last 100 episodes multitask_dom_q_net+g_a dom_q_net+g_a Figure 4: Multitask Comparisons: 9-multitask DOM-Q-NET with goal-attention consistently has better sample efficiency. Results for other tasks are shown in Appendix 6.7.1. g a=goal-attention. 4.2 MULTITASK Two metrics are used for comparing the sample efficiency of multitask and single-task agents. Mtotal multitask agent: total number of frames observed upon solving all the tasks. Mtotal single-task agents: sum of the number of frames observed for solving each task. Mtask: number of frames observed for solving a specific task. We trained a multitask agent solving 9 tasks with 2x sample efficiency, using about Mtotal = 63000 frames, whereas the single-task agents use Mtotal = 127000 frames combined. Figure 4 shows the plots for 6 out of the 9 tasks. In particular, login-user and click-checkboxes are solved with 40000 fewer frames using multitask learning, but such gains are not as obvious when the task is simple, as in the case of navigate-tree. Next we included two hard tasks shown in Figure 5. Compared to the sample efficiency of observing Mtotal = 477000 frames for solving 11 tasks by single-task agents, multitask agent has only observed Mtotal = 29000 11 = 319000 frames when the last socialmedia task is solved as shown in Figure 5. Additionally, the plots indicate that multitask learning with simpler tasks is more efficient in using observed frames for hard tasks, achieving better Mtask than multitask learning with only those two tasks. These results indicate that our model enables positive transfers of learned behaviours between distinct tasks. Published as a conference paper at ICLR 2019 0 50000 100000 150000 200000 250000 number of timesteps Avg reward of last 100 episodes Social-Media 2Hard+9_multitask_dom_q_net+g_a 2Hard_multitask_dom_q_net+g_a dom_q_net+g_a 0 50000 100000 150000 200000 250000 number of timesteps Avg reward of last 100 episodes Search-Engine 2Hard+9_multitask_dom_q_net+g_a 2Hard_multitask_dom_q_net+g_a dom_q_net+g_a Figure 5: Comparisons in sample efficiency for 2 hard tasks, social-media (left) and search-engine (right), by multitask learning. 9 multitask refers to the tasks discussed in Figure 4 0 10000 20000 30000 40000 50000 number of timesteps Avg reward of last 100 episodes Click-Checkboxes dom_q_net dom_q_net-g dom_q_net-local-g dom_q_net-n-g 0 10000 20000 30000 40000 50000 number of timesteps Avg reward of last 100 episodes dom_q_net-l-g dom_q_net dom_q_net-g dom_q_net-n-g Figure 6: Ablation experiments for l=Local, n=Neighbor, g=Global modules. dom q net - g is the DOM-Q-NET without the global module. dom q net -l - g is the DOM-Q-NET with only neighbor module. dom q net-n-g is the DOM-Q-NET with only local module. 4.3 ABLATION STUDY ON THE DOM REPRESENTATION MODULES We perform ablation experiments to justify the effectiveness of using each module for the Qdom stream. We compare the proposed model against three discounted versions that omit some modules for computing Qdom: (a) edom = elocal, (b) edom = eneighbor, (c) edom = [e T local, e T neighbor]T . Figure 6 shows the two tasks chosen, and the failure case for click-checkboxes shows that DOM selection without the neighbor module will simply not work because many DOMs have the same attributes, and thus have exactly the same representations despite the difference in the context. Liu et al. (2018) addressed this issue by hand-crafting the message passing. The faster convergence of DOM-Q-NET to the optimal behaviour indicates the limitation of neighbor module and how global and local module provide shortcuts to the high-level and low-level representations of the web page. 4.4 EFFECTIVENESS OF GOAL-ATTENTION Most of the Mini Wo B tasks have only one desired control policy such as put a query word in the search, and find the matched link where the word token for the query and the link have alignments with the DOMs. Hence, our model solves most of the tasks without feeding the goal representation to the network, with exceptions like click-widget. Appendix 6.7 shows comparisons of the model with different goal encoding methods including goal-attention. The effect of goal-attention is not obvious, as seen in some tasks. However, Figure 7 shows that the gain in sample efficiency from Published as a conference paper at ICLR 2019 using goal-attention is considerable in multitask learning settings, and this gain is much bigger than the gain in the single-task setting. This indicates that the agent successfully learns to pay attention to different parts of the DOM tree given different goal instructions when solving multiple tasks. 0 10000 20000 30000 40000 50000 number of timesteps Avg reward of last 100 episodes Enter-Password dom_q_net multitask_dom_q_net+g_a dom_q_net+g_a multitask_dom_q_net 0 10000 20000 30000 40000 50000 number of timesteps Avg reward of last 100 episodes dom_q_net multitask_dom_q_net+g_a dom_q_net+g_a multitask_dom_q_net Figure 7: Effects of goal-attention for single and multi-task learning (g a=goal attention) 5 DISCUSSION We propose a new architecture for parameterizing factorized Q functions using goal-attention, local word embeddings, and a graph neural network. We contribute to the formulation of web navigation with this model. Without any demonstration, it solves relatively hard tasks with large action space, and transfers learned behaviours from multitask learning, which are two important factors for web navigation. For future work, we investigate exploration strategies for tasks like email-inbox where the environment does not have a simple instance of the task that the agent can use to generalize learned behaviours. Liu et al. (2018) demonstrated an interesting way to guide the exploration. Another work is to reduce the computational cost of evaluating the Q value for each DOM element. Finally, we intend on applying our methods to using search engines. Tasks like question answering could benefit from the ability of an agent to query search, navigate the results page and obtain relevant information for solving the desired goal. The ability to query and navigate search could also be used to bootstrap agents in realistic environments to obtain task-oriented knowledge and improve sample efficiency. Acknowledgement: We acknowledge using the implementation of segment tree by Dopamine (Castro et al., 2018) for this project. Reproducibility: Our code and demo are available at https://github.com/Sheng-J/DOM-Q-NET and https://www.youtube.com/channel/UCr Gs Yub9l KCYO8dl REC3dn Q respectively. Pablo Samuel Castro, Subhodeep Moitra, Carles Gelada, Saurabh Kumar, and Marc G. Bellemare. Dopamine: A research framework for deep reinforcement learning. Co RR, abs/1812.06110, 2018. URL http://arxiv.org/abs/1812.06110. Kyunghyun Cho, Bart van Merrienboer, C aglar G ulc ehre, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation. Co RR, abs/1406.1078, 2014. URL http://arxiv.org/abs/1406. 1078. Meire Fortunato, Mohammad Gheshlaghi Azar, Bilal Piot, Jacob Menick, Ian Osband, Alex Graves, Vlad Mnih, Remi Munos, Demis Hassabis, Olivier Pietquin, et al. Noisy networks for exploration. In ICLR, 2018. Published as a conference paper at ICLR 2019 Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural message passing for quantum chemistry. In ICML, 2017. Jessica B Hamrick, Kelsey R Allen, Victor Bapst, Tina Zhu, Kevin R Mc Kee, Joshua B Tenenbaum, and Peter W Battaglia. Relational inductive bias for physical construction in humans and machines. ar Xiv preprint ar Xiv:1806.01203, 2018. Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combining improvements in deep reinforcement learning. In AAAI, 2018. Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. Co RR, abs/1412.6980, 2014. URL http://arxiv.org/abs/1412.6980. Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. Co RR, abs/1609.02907, 2016. URL http://arxiv.org/abs/1609.02907. Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies. In JMLR, 2016. Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated graph sequence neural networks. In ICLR, 2016. Evan Zheran Liu, Kelvin Guu, Panupong Pasupat, Tianlin Shi, and Percy Liang. Reinforcement learning on web interfaces using workflow-guided exploration. In ICLR, 2018. Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. In Nature, 2015. David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning internal representations by error propagation. Technical report, California Univ San Diego La Jolla Inst for Cognitive Science, 1985. Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. The graph neural network model. In IEEE Transactions on Neural Networks, 2009. Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. In ICLR, 2016. Tianlin Shi, Andrej Karpathy, Linxi Fan, Jonathan Hernandez, and Percy Liang. World of bits: An open-domain platform for web-based agents. In ICML, 2017. Richard S Sutton. Learning to predict by the methods of temporal differences. In Machine learning, 1988. Richard S Sutton and Andrew G Barto. Introduction to reinforcement learning, volume 135. MIT press Cambridge, 1998. Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double qlearning. In AAAI, 2016. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, 2017. Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. In ICLR, 2018. Tingwu Wang, Renjie Liao, Jimmy Ba, and Sanja Fidler. Nervenet: Learning structured policy with graph neural networks. In ICLR, 2018. Published as a conference paper at ICLR 2019 6.1 HYPERPARAMETERS Table 1: Hyperparameters for training with Rainbow DQN (4 components) Hyperparameter Value Optimization algorithm Adam (Kingma & Ba, 2014) Learning rate 0.00015 Batch Size 128 Discounted factor 0.99 DQN Target network update period 200 online network updates Number of update per frame 1 Number of exploration steps 50 N steps (multi-step) bootstrap 8 Noisy Nets σ0 0.5 Use DDQN True Easy Tasks: Number of steps for training 5000 Medium Tasks: Number of steps for training 50000 Hard Tasks: Number of steps for training 2000000 Table 2: Hyperparameters for DOM-Q-NET Hyperparameter Value Vocabulary size: tag 80 Vocabulary size: text 400 Vocabulary size: class 80 Embedding dimension: tag 16 Embedding dimension: text 32 Embedding dimension: class 16 Dimension of Fully Connected(FC) layers 128 Number of FC layers for 3 factorized Q networks 2 each Hidden Layer Activation Re LU Number of steps for neural message passing 3 (7 for social media task) Max number of DOMS 160 Max number goal tokens 18 Out of Vocabulary Random vector generation Choose-option, Click-Checkboxes Table 3: Hyperparameters for Replay Buffer Hyperparameter Value α: prioritization exponent 0.5 β for computing importance sampling weights 0 Single Task Buffer Size 15000 Multi Task Buffer Size 100000 6.2 GOAL-ATTENTION OUTPUT MODEL Figure 8 shows the readout phase of the graph neural network using goal-attention. The graph-level feature vector hglobal is computed by the weighted average of node-level representations processed with T steps of message passing, {h1.....h V }. The weights, {α1.....αV }, are computed with the goal vector as the query and node-level features as keys. For our model, we use a scaled dot product attention (Vaswani et al., 2017) with local embeddings as keys and neighbor embeddings as values, as illustrated in 3.3. Published as a conference paper at ICLR 2019 = f( ,..., ) hgoal e1 e N ht N e N e1 Goal Attention Readout Phase j hgoal t = 1....T Figure 8: goal-attention 6.3 GOAL ENCODER Three types of goal encoding module for global module are investigated. 1. Concatenating each node-level feature with the goal vector. 2. Goal-attention, as illustrated in 6.2 3. Using both concatenation and attention, as shown in Figure9 eglobal attn = [ , ] eglobal eattn = f( ,... ) hgoal e1 e N ( .. .. ) e1 neighbor ei neighbor e V ( .. .. ) e1 Goal Concatenation Goal Attention Figure 9: goal-encoder Benchmark results for multitask and 23 tasks in Appendix 6.7 also compare the performances of using different goal encoding modules. 6.4 MINIWOB TASKS DIFFICULTIES DEFINITION Easy Task: Any task solvable under 5000 timesteps by single-task DOM-Q-NET {click-dialog, click-test, focus-text, focus-text-2, click-test-2, click-button, click-link, click-button-sequence, click-tab, click-tab-2, Navigate-tree} Medium Task: Any task solvable under 50000 timesteps by single-task DOM-Q-NET {enter-text, click-widget, click-option, click-checkboxes, enter-text-dynamic, enterpassword, login-user, email-inbox-delete} Published as a conference paper at ICLR 2019 Hard Task: Any task solvable under 200000 timesteps by single-task DOM-Q-NET, or any unsolvable tasks. {choose-date, search-engine, social-media, email-inbox} 6.5 MULTITASK LEARNING Algorithm 1 Multitask Learning with Shared Replay Buffer an off-policy RL algorithm A, e.g. DQN, DDPG, NAF, SDQN a set of environments for multiple tasks K, e.g. click-checkboxes, social-media 2: Initialize the shared replay buffer R 3: Initialize A by initializing the shared network parameters θ 4: Initialize each environment and sample s(k) 0 5: for i = 1 to M do 6: for each k K do 7: Sample an action a(k) t using behavioral policy from A: a(k) t π(s(k) t ) 8: Execute the action a(k) t k and observe a reward r(k) t a new state s(k) t+1 9: Store the transition (s(k) t , a(k) t , r(k) t , s(k) t+1) in R 10: Sample a minibatch B from R, and perform one step optimization w.r.t θ 11: if episode for k terminated then 12: reset k and sample s(k) 0 6.6 EXPERIMENT PROTOCOL We report the success rate of the 100 test episodes at the end of the training once the agent converges to its highest performance. The final success rate reported in Figure 3 is based on the average of success rate from 4 different random seeds/runs. In particular, we evaluate the RL agent after training it for a fixed number of frames, depending on the difficulty of the task, as illustrated in Appendix 6.4. As shown in table 4, the results presented in this paper is based on a total of 536 experiments using the set of hyperparameters in table 1, 2, 3. Table 4: Experiment statistics Number of tasks 23 Number of tasks concurrently running for multitask 9 Number of goal encoding modules compared 4 N1 = (23 + 9) 4 = 128 Number of tasks for ablation study 2 Number of discounted models compared for ablation study 3 N2 = 2 3 = 6 Number of experiments for computing the average of a result 4 Number of experiments for 11 multitask learning 11 Ntotal = (128 + 6 + 11) 4 = 580 6.7 BENCHMARK RESULTS We present the learning curves of both single-task and multitask agents. We also provide the learning curves of the model with different goal-encoding modules 6.3. X-axis represents the number of timesteps, and Y-axis represents the moving average of last 100 rewards. For medium and hard tasks, we also show the fraction of transitions with positive/non-zero rewards in the replay buffer and the number of unique positive transitions sampled throughout the training. This is to demonstrate the sparsity of the rewards for each task, and investigate whether the failure comes from exploration. Note that we are using multi-step bootstrap (Sutton, 1988), so some transitions that do not directly lead to the rewards are still considered positive here. Published as a conference paper at ICLR 2019 6.7.1 MULTITASK (9 TASKS) RESULTS The following plots show the learning curves for the 9 tasks used in multitask learning. 6.7.2 SOME EASY AND MEDIUM TASKS We omit the plots for very simple tasks requiring less than 1000 training steps. Published as a conference paper at ICLR 2019 6.7.3 MEDIUM TASKS WITH REPLAY BUFFER INFORMATION The plots on the left show the moving average of the rewards for last 100 episodes. The plots on the center show the fraction of positive transitions in replay buffer. The plots on the right show the unique number of positive transitions for each training batch. Published as a conference paper at ICLR 2019 6.7.4 HARD TASKS The plots on the left show the moving average of the rewards for last 100 episodes. The plots on the center show the fraction of positive transitions in replay buffer. The plots on the right show the unique number of positive transitions for each training batch. 0 20000 40000 60000 80000 100000 120000 140000 t Avg reward of last 100 episodes Social-Media dom_q_net+goal_cat_attn dom_q_net dom_q_net+goal_attn dom_q_net+goal_cat 0 20000 40000 60000 80000 100000 120000 140000 t Fraction of pos transition in buffer Social-Media dom_q_net+goal_cat_attn dom_q_net dom_q_net+goal_attn dom_q_net+goal_cat 0 20000 40000 60000 80000 100000 120000 140000 t Unique number of pos transitions per batch Social-Media dom_q_net+goal_cat_attn dom_q_net dom_q_net+goal_attn dom_q_net+goal_cat Published as a conference paper at ICLR 2019 0 20000 40000 60000 80000 100000 120000 t Avg reward of last 100 episodes Email Inbox dom_q_net+goal_cat_attn dom_q_net dom_q_net+goal_attn dom_q_net+goal_cat 0 20000 40000 60000 80000 100000 120000 t Fraction of pos transition in buffer Email Inbox dom_q_net+goal_cat_attn dom_q_net dom_q_net+goal_attn dom_q_net+goal_cat 0 20000 40000 60000 80000 100000 120000 t Unique number of pos transitions per batch Email Inbox dom_q_net+goal_cat_attn dom_q_net dom_q_net+goal_attn dom_q_net+goal_cat