# endtoend_goaldriven_web_navigation__d324fbb0.pdf End-to-End Goal-Driven Web Navigation Rodrigo Nogueira Tandon School of Engineering New York University rodrigonogueira@nyu.edu Kyunghyun Cho Courant Institute of Mathematical Sciences New York University kyunghyun.cho@nyu.edu We propose a goal-driven web navigation as a benchmark task for evaluating an agent with abilities to understand natural language and plan on partially observed environments. In this challenging task, an agent navigates through a website, which is represented as a graph consisting of web pages as nodes and hyperlinks as directed edges, to find a web page in which a query appears. The agent is required to have sophisticated high-level reasoning based on natural languages and efficient sequential decision-making capability to succeed. We release a software tool, called Web Nav, that automatically transforms a website into this goal-driven web navigation task, and as an example, we make Wiki Nav, a dataset constructed from the English Wikipedia. We extensively evaluate different variants of neural net based artificial agents on Wiki Nav and observe that the proposed goal-driven web navigation well reflects the advances in models, making it a suitable benchmark for evaluating future progress. Furthermore, we extend the Wiki Nav with questionanswer pairs from Jeopardy! and test the proposed agent based on recurrent neural networks against strong inverted index based search engines. The artificial agents trained on Wiki Nav outperforms the engined based approaches, demonstrating the capability of the proposed goal-driven navigation as a good proxy for measuring the progress in real-world tasks such as focused crawling and question-answering. 1 Introduction In recent years, there have been many exciting advances in building an artificial agent, which can be trained with one learning algorithm, to solve many relatively large-scale, complicated tasks (see, e.g., [8, 10, 6].) In much of these works, target tasks were computer games such as Atari games [8] and racing car game [6]. These successes have stimulated researchers to apply a similar learning mechanism to language-based tasks, such as multi-user dungeon (MUD) games [9, 4]. Instead of visual perception, an agent perceives the state of the world by its written description. A set of actions allowed to the agent is either fixed or dependent on the current state. This type of task can efficiently evaluate the agent s ability of not only in planning but also language understanding. We, however, notice that these MUD games do not exhibit the complex nature of natural languages to the full extent. For instance, the largest game world tested by Narasimhan et al. [9] uses a vocabulary of only 1340 unique words, and the largest game tested by He et al. [4] uses only 2258 words. Furthermore, the description of a state at each time step is almost always limited to the visual description of the current scene, lacking any use of higher-level concepts present in natural languages. In this paper, we propose a goal-driven web navigation as a large-scale alternative to the text-based games for evaluating artificial agents with natural language understanding and planning capability. The proposed goal-driven web navigation consists of the whole website as a graph, in which the web pages are nodes and hyperlinks are directed edges. An agent is given a query, which consists of one 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain. or more sentences taken from a randomly selected web page in the graph, and navigates the network, starting from a predefined starting node, to find a target node in which the query appears. Unlike the text-based games, this task utilizes the existing text as it is, resulting in a large vocabulary with a truly natural language description of the state. Furthermore, the task is more challenging as the action space greatly changes with respect to the state in which the agent is. We release a software tool, called Web Nav, that converts a given website into a goal-driven web navigation task. As an example of its use, we provide Wiki Nav, which was built from English Wikipedia. We design artificial agents based on neural networks (called Neu Agents) trained with supervised learning, and report their respective performances on the benchmark task as well as the performance of human volunteers. We observe that the difficulty of a task generated by Web Nav is well controlled by two control parameters; (1) the maximum number of hops from a starting to a target node Nh and (2) the length of query Nq. Furthermore, we extend the Wiki Nav with an additional set of queries that are constructed from Jeopardy! questions, to which we refer by Wiki Nav-Jeopardy. We evaluate the proposed Neu Agents against the three search-based strategies; (1) Simple Search, (2) Apache Lucene and (3) Google Search API. The result in terms of document recall indicates that the Neu Agents outperform those search-based strategies, implying a potential for the proposed task as a good proxy for practical applications such as question-answering and focused crawling. 2 Goal-driven Web Navigation A task T of goal-driven web navigation is characterized by T = (A, s S, G, q, R, Ω). (1) The world in which an agent A navigates is represented as a graph G = (N, E). The graph consists of a set of nodes N = {si}NN i=1 and a set of directed edges E = {ei,j} connecting those nodes. Each node represents a page of the website, which, in turn, is represented by the natural language text D(si) in it. There exists an edge going from a page si to sj if and only if there is a hyperlink in D(si) that points to sj. One of the nodes is designated as a starting node s S from which any navigation begins. A target node is the one whose natural language description contains a query q, and there may be more than one target node. At each time step, the agent A reads the natural language description D(st) of the current node in which the agent has landed. At no point, the whole world, consisting of the nodes and edges, nor its structure or map (graph structure without any natural language description) is visible to the agent, thus making this task partially observed. Once the agent A reads the description D(si) of the current node si, it can take one of the actions available. A set of possible actions is defined as a union of all the outgoing edges ei, and the stop action, thus making the agent have state-dependent action space. Each edge ei,k corresponds to the agent jumping to a next node sk, while the stop action corresponds to the agent declaring that the current node si is one of the target nodes. Each edge ei,k is represented by the description of the next node D(sk). In other words, deciding which action to take is equivalent to taking a peek at each neighboring node and seeing whether that node is likely to lead ultimately to a target node. The agent A receives a reward R(si, q) when it chooses the stop action. This task uses a simple binary reward, where R(si, q) = 1, if q D(si) 0, otherwise Constraints It is clear that there exists an ultimate policy for the agent to succeed at every trial, which is to traverse the graph breadth-first until the agent finds a node in which the query appears. To avoid this kind of degenerate policies, the task includes a set of four rules/constraints Ω: 1. An agent can follow at most Nn edges at each node. 2. An agent has a finite memory of size smaller than T . Table 1: Dataset Statistics of Wiki Nav-4-*, Wiki Nav-8-*, Wiki Nav-16-* and Wiki Nav-Jeopardy. Wiki Nav-4-* Wiki Nav-8-* Wiki Nav-16-* Wiki Nav-Jeopardy Train 6.0k 1M 12M 113k Valid 1k 20k 20k 10k Test 1k 20k 20k 10k 3. An agent moves up to Nh hops away from s S. 4. A query of size Nq comes from at least two hops away from the starting node. The first constraint alone prevents degenerate policies, such as breadth-first search, forcing the agent to make good decisions as possible at each node. The second one further constraints ensure that the agent does not cheat by using earlier trials to reconstruct the whole graph structure (during test time) or to store the entire world in its memory (during training.) The third constraint, which is optional, is there for computational consideration. The fourth constraint is included because the agent is allowed to read the content of a next node. 3 Web Nav: Software As a part of this work, we build and release a software tool which turns a website into a goal-driven web navigation task.1 We call this tool Web Nav. Given a starting URL, the Web Nav reads the whole website, constructs a graph with the web pages in the website as nodes. Each node is assigned a unique identifier si. The text content of each node D(si) is a cleaned version of the actual HTML content of the corresponding web page. The Web Nav turns intra-site hyperlinks into a set of edges ei,j. In addition to transforming a website into a graph G from Eq. (1), the Web Nav automatically selects queries from the nodes texts and divides them into training, validation, and test sets. We ensure that there is no overlap among three sets by making each target node, from which a query is selected, belongs to only one of them. Each generated example is defined as a tuple X = (q, s , p ) (2) where q is a query from a web page s , which was found following a randomly selected path p = (s S, . . . , s ). In other words, the Web Nav starts from a starting page s S, random-walks the graph for a predefined number of steps (Nh/2, in our case), reaches a target node s and selects a query q from D(s ). A query consists of Nq sentences and is selected among the top-5 candidates in the target node with the highest average TF-IDF, thus discouraging the Web Nav from choosing a trivial query. For the evaluation purpose alone, it is enough to use only a query q itself as an example. However, we include both one target node (among potentially many other target nodes) and one path from the starting node to this target node (again, among many possible connecting paths) so that they can be exploited when training an agent. They are not to be used when evaluating a trained agent. 4 Wiki Nav: A Benchmark Task With the Web Nav, we built a benchmark goal-driven navigation task using Wikipedia as a target website. We used the dump file of the English Wikipedia from September 2015, which consists of more than five million web pages. We built a set of separate tasks with different levels of difficulty by varying the maximum number of allowed hops Nh {4, 8, 16} and the size of query Nq {1, 2, 4}. We refer to each task by Wiki Nav-Nh-Nq. For each task, we generate training, validation and test examples from the pages half as many hops away from a starting page as the maximum number of hops allowed.2 We use Category:Main topic classifications as a starting node s S. 1 The source code and datasets are publicly available at github.com/nyu-dl/Web Nav. 2 This limit is an artificial limit we chose for computational reasons. Table 3: Sample query-answer pairs from Wiki Nav-Jeopardy. Query Answer For the last 8 years of his life, Galileo was under Copernicus house arrest for espousing this man s theory. In the winter of 1971-72, a record 1,122 inches of snow fell Washington at Rainier Paradise Ranger Station in this state. This company s Accutron watch, introduced in 1960, Bulova had a guarantee of accuracy to within one minute a month. As a minimal cleanup procedure, we excluded meta articles whose titles start with Wikipedia . Any hyperlink that leads to a web page outside Wikipedia is removed in advance together with the following sections: References , External Links , Bibliography and Partial Bibliography . Hyperlinks Words Avg. 4.29 462.5 Var 13.85 990.2 Max 300 132881 Min 0 1 Table 2: Per-page statistics of English Wikipedia. In Table 2, we present basic per-article statistics of the English Wikipedia. It is evident from these statistics that the world of Wiki Nav-Nh-Nq is large and complicated, even after the cleanup procedure. We ended up with a fairly small dataset for Wiki Nav-4-*, but large for Wiki Nav-8-* and Wiki Nav-16-*. See Table 1 for details. 4.1 Related Work: Wikispeedia This work is indeed not the first to notice the possibility of a website, or possibly the whole web, as a world in which intelligent agents explore to achieve a certain goal. One most relevant recent work to ours is perhaps Wikispeedia from [14, 12, 13]. West et al. [14, 12, 13] proposed the following game, called Wikispeedia. The game s world is nearly identical to the goal-driven navigation task proposed in this work. More specifically, they converted Wikipedia for Schools , which contains approximately 4,000 articles as of 2008, into a graph whose nodes are articles and directed edges are hyperlinks. From this graph, a pair of nodes is randomly selected and provided to an agent. The agent s goal is to start from the first node, navigate the graph and reach the second node. Similarly to the Wiki Nav, the agent has access to the text content of the current nodes and all the immediate neighboring nodes. One major difference is that the target is given as a whole article, meaning that there is a single target node in the Wikispeedia while there may be multiple target nodes in the proposed Wiki Nav. From this description, we see that the goal-driven web navigation is a generalization and re-framing of the Wikispeedia. First, we constrain a query to contain less information, making it much more difficult for an agent to navigate to a target node. Furthermore, a major research question by West and Leskovec [13] was to understand how humans navigate and find the information they are looking for , whereas in this work we are fully focused on proposing an automatic tool to build a challenging goal-driven tasks for designing and evaluating artificial intelligent agents. 5 Wiki Nav-Jeopardy: Jeopardy! on Wiki Nav One of the potential practical applications utilizing the goal-drive navigation is question-answering based on world knowledge. In this Q&A task, a query is a question, and an agent navigates a given information network, e.g., website, to retrieve an answer. In this section, we propose and describe an extension of the Wiki Nav, in which query-target pairs are constructed from actual Jeopardy! question-answer pairs. We refer to this extension of Wiki Nav by Wiki Nav-Jeopardy. We first extract all the question-answer pairs from J! Archive3, which has more than 300k such pairs. We keep only those pairs whose answers are titles of Wikipedia articles, leaving us with 133k pairs. We divide those pairs into 113k training, 10k validation, and 10k test examples while carefully 3 www.j-archive.com ensuring that no article appears in more than one partition. Additionally, we do not shuffle the original pairs to ensure that the train and test examples are from different episodes. For each training pair, we find one path from the starting node Main Topic Classification to the target node and include it for supervised learning. For reference, the average number of hops to the target node is 5.8, the standard deviation is 1.2, and the maximum and minimum are 2 and 10, respectively. See Table 3 for sample query-answer pairs. 6 Neu Agent: Neural Network based Agent 6.1 Model Description Core Function The core of the Neu Agent is a parametric function fcore that takes as input the content of the current node φc(si) and a query φq(q), and that returns the hidden state of the agent. This parametric function fcore can be implemented either as a feedforward neural network fff: ht = fff(φc(si), φq(q)) which does not take into account the previous hidden state of the agent or as a recurrent neural network frec: ht = frec(ht 1, φc(si), φq(q)). We refer to these two types of agents by Neu Agent-FF and Neu Agent-Rec, respectively. For the Neu Agent-FF, we use a single tanh layer, while we use long short-term memory (LSTM) units [5], which have recently become de facto standard, for the Neu Agent-Rec. Figure 1: Graphical illustration of a single step performed by the baseline model, Neu Agent. Based on the new hidden state ht, the Neu Agent computes the probability distribution over all the outgoing edges ei. The probability of each outgoing edge is proportional to the similarity between the hidden state ht such that p(ei,j| p) exp φc(sj) ht . (3) Note that the Neu Agent peeks at the content of the next node sj by considering its vector representation φc(sj). In addition to all the outgoing edges, we also allow the agent to stop with the probability p( | p) exp v ht , (4) where the stop action vector v is a trainable parameter. In the case of Neu Agent-Rec, all these (unnormalized) probabilities are conditioned on the history p which is a sequence of actions (nodes) selected by the agent so far. We apply a softmax normalization on the unnormalized probabilities to obtain the probability distribution over all the possible actions at the current node si. The Neu Agent then selects its next action based on this action probability distribution (Eqs. (3) and (4)). If the stop action is chosen, the Neu Agent returns the current node as an answer and receives a reward R(si, q), which is one if correct and zero otherwise. If the agent selects one of the outgoing edges, it moves to the selected node and repeats this process of reading and acting. See Fig. 1 for a single step of the described Neu Agent. Content Representation The Neu Agent represents the content of a node si as a vector φc(si) Rd. In this work, we use a continuous bag-of-words vector for each document: φc(si) = 1 |D(si)| Each word vector ek is from a pretrained continuous bag-of-words model [7]. These word vectors are fixed throughout training. Query Representation In the case of a query, we consider two types of representation. The first one is a continuous bag-of-words (Bo W) vector, just as used for representing the content of a node. The other one is a dynamic representation based on the attention mechanism [2]. In the attention-based query representation, the query is first projected into a set of context vectors. The context vector of the k-th query word is k =k u/2 Wk ek , where Wk Rd d and ek are respectively a trainable weight matrix and a pretrained word vector. u is the window size. Each context vector is scored at each time step t by βt k = fatt(ht 1, ck) w.r.t. the previous hidden state of the Neu Agent, and all the scores are normalized to be positive and sum to one, i.e., αt k = exp(βt k) P|q| l=1 exp(βt l ). These normalized scores are used as the coefficients in computing the weighted-sum of query words to result in a query representation at time t: k=1 αt kck. Later, we empirically compare these two query representations. 6.2 Inference: Beam Search Once the Neu Agent is trained, there are a number of approaches to using it for solving the proposed task. The most naive approach is simply to let the agent make a greedy decision at each time step, i.e., following the outgoing edge with the highest probability arg maxk log p(ei,k| . . .). A better approach is to exploit the fact that the agent is allowed to explore up to Nn outgoing edges per node. We use a simple, forward-only beam search with the beam width capped at Nn. The beam search simply keeps the Nn most likely traces, in terms of log p(ei,k| . . .), at each time step. 6.3 Training: Supervised Learning In this paper, we investigate supervised learning, where we train the agent to follow an example trace p = (s S, . . . , s ) included in the training set at each step (see Eq. (2)). In this case, the cost per training example is Csup = log p( |p , q) k=1 log p(p k|p