# endtoend_goaldriven_web_navigation__d324fbb0.pdf

End-to-End Goal-Driven Web Navigation

Rodrigo Nogueira Tandon School of Engineering New York University rodrigonogueira@nyu.edu

Kyunghyun Cho Courant Institute of Mathematical Sciences New York University kyunghyun.cho@nyu.edu

We propose a goal-driven web navigation as a benchmark task for evaluating an agent with abilities to understand natural language and plan on partially observed environments. In this challenging task, an agent navigates through a website, which is represented as a graph consisting of web pages as nodes and hyperlinks as directed edges, to ﬁnd a web page in which a query appears. The agent is required to have sophisticated high-level reasoning based on natural languages and efﬁcient sequential decision-making capability to succeed. We release a software tool, called Web Nav, that automatically transforms a website into this goal-driven web navigation task, and as an example, we make Wiki Nav, a dataset constructed from the English Wikipedia. We extensively evaluate different variants of neural net based artiﬁcial agents on Wiki Nav and observe that the proposed goal-driven web navigation well reﬂects the advances in models, making it a suitable benchmark for evaluating future progress. Furthermore, we extend the Wiki Nav with questionanswer pairs from Jeopardy! and test the proposed agent based on recurrent neural networks against strong inverted index based search engines. The artiﬁcial agents trained on Wiki Nav outperforms the engined based approaches, demonstrating the capability of the proposed goal-driven navigation as a good proxy for measuring the progress in real-world tasks such as focused crawling and question-answering.

1 Introduction

In recent years, there have been many exciting advances in building an artiﬁcial agent, which can be trained with one learning algorithm, to solve many relatively large-scale, complicated tasks (see, e.g., [8, 10, 6].) In much of these works, target tasks were computer games such as Atari games [8] and racing car game [6].

These successes have stimulated researchers to apply a similar learning mechanism to language-based tasks, such as multi-user dungeon (MUD) games [9, 4]. Instead of visual perception, an agent perceives the state of the world by its written description. A set of actions allowed to the agent is either ﬁxed or dependent on the current state. This type of task can efﬁciently evaluate the agent s ability of not only in planning but also language understanding.

We, however, notice that these MUD games do not exhibit the complex nature of natural languages to the full extent. For instance, the largest game world tested by Narasimhan et al. [9] uses a vocabulary of only 1340 unique words, and the largest game tested by He et al. [4] uses only 2258 words. Furthermore, the description of a state at each time step is almost always limited to the visual description of the current scene, lacking any use of higher-level concepts present in natural languages.

In this paper, we propose a goal-driven web navigation as a large-scale alternative to the text-based games for evaluating artiﬁcial agents with natural language understanding and planning capability. The proposed goal-driven web navigation consists of the whole website as a graph, in which the web pages are nodes and hyperlinks are directed edges. An agent is given a query, which consists of one

30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.

or more sentences taken from a randomly selected web page in the graph, and navigates the network, starting from a predeﬁned starting node, to ﬁnd a target node in which the query appears. Unlike the text-based games, this task utilizes the existing text as it is, resulting in a large vocabulary with a truly natural language description of the state. Furthermore, the task is more challenging as the action space greatly changes with respect to the state in which the agent is.

We release a software tool, called Web Nav, that converts a given website into a goal-driven web navigation task. As an example of its use, we provide Wiki Nav, which was built from English Wikipedia. We design artiﬁcial agents based on neural networks (called Neu Agents) trained with supervised learning, and report their respective performances on the benchmark task as well as the performance of human volunteers. We observe that the difﬁculty of a task generated by Web Nav is well controlled by two control parameters; (1) the maximum number of hops from a starting to a target node Nh and (2) the length of query Nq.

Furthermore, we extend the Wiki Nav with an additional set of queries that are constructed from Jeopardy! questions, to which we refer by Wiki Nav-Jeopardy. We evaluate the proposed Neu Agents against the three search-based strategies; (1) Simple Search, (2) Apache Lucene and (3) Google Search API. The result in terms of document recall indicates that the Neu Agents outperform those search-based strategies, implying a potential for the proposed task as a good proxy for practical applications such as question-answering and focused crawling.

2 Goal-driven Web Navigation

A task T of goal-driven web navigation is characterized by

T = (A, s S, G, q, R, Ω). (1)

The world in which an agent A navigates is represented as a graph G = (N, E). The graph consists of a set of nodes N = {si}NN i=1 and a set of directed edges E = {ei,j} connecting those nodes. Each node represents a page of the website, which, in turn, is represented by the natural language text D(si) in it. There exists an edge going from a page si to sj if and only if there is a hyperlink in D(si) that points to sj. One of the nodes is designated as a starting node s S from which any navigation begins. A target node is the one whose natural language description contains a query q, and there may be more than one target node.

At each time step, the agent A reads the natural language description D(st) of the current node in which the agent has landed. At no point, the whole world, consisting of the nodes and edges, nor its structure or map (graph structure without any natural language description) is visible to the agent, thus making this task partially observed.

Once the agent A reads the description D(si) of the current node si, it can take one of the actions available. A set of possible actions is deﬁned as a union of all the outgoing edges ei, and the stop action, thus making the agent have state-dependent action space.

Each edge ei,k corresponds to the agent jumping to a next node sk, while the stop action corresponds to the agent declaring that the current node si is one of the target nodes. Each edge ei,k is represented by the description of the next node D(sk). In other words, deciding which action to take is equivalent to taking a peek at each neighboring node and seeing whether that node is likely to lead ultimately to a target node.

The agent A receives a reward R(si, q) when it chooses the stop action. This task uses a simple binary reward, where

R(si, q) = 1, if q D(si) 0, otherwise

Constraints It is clear that there exists an ultimate policy for the agent to succeed at every trial, which is to traverse the graph breadth-ﬁrst until the agent ﬁnds a node in which the query appears. To avoid this kind of degenerate policies, the task includes a set of four rules/constraints Ω:

1. An agent can follow at most Nn edges at each node. 2. An agent has a ﬁnite memory of size smaller than T .

Table 1: Dataset Statistics of Wiki Nav-4-*, Wiki Nav-8-*, Wiki Nav-16-* and Wiki Nav-Jeopardy.

Wiki Nav-4-* Wiki Nav-8-* Wiki Nav-16-* Wiki Nav-Jeopardy Train 6.0k 1M 12M 113k

Valid 1k 20k 20k 10k

Test 1k 20k 20k 10k

3. An agent moves up to Nh hops away from s S. 4. A query of size Nq comes from at least two hops away from the starting node.

The ﬁrst constraint alone prevents degenerate policies, such as breadth-ﬁrst search, forcing the agent to make good decisions as possible at each node. The second one further constraints ensure that the agent does not cheat by using earlier trials to reconstruct the whole graph structure (during test time) or to store the entire world in its memory (during training.) The third constraint, which is optional, is there for computational consideration. The fourth constraint is included because the agent is allowed to read the content of a next node.

3 Web Nav: Software

As a part of this work, we build and release a software tool which turns a website into a goal-driven web navigation task.1 We call this tool Web Nav. Given a starting URL, the Web Nav reads the whole website, constructs a graph with the web pages in the website as nodes. Each node is assigned a unique identiﬁer si. The text content of each node D(si) is a cleaned version of the actual HTML content of the corresponding web page. The Web Nav turns intra-site hyperlinks into a set of edges ei,j.

In addition to transforming a website into a graph G from Eq. (1), the Web Nav automatically selects queries from the nodes texts and divides them into training, validation, and test sets. We ensure that there is no overlap among three sets by making each target node, from which a query is selected, belongs to only one of them.

Each generated example is deﬁned as a tuple

X = (q, s , p ) (2)

where q is a query from a web page s , which was found following a randomly selected path p = (s S, . . . , s ). In other words, the Web Nav starts from a starting page s S, random-walks the graph for a predeﬁned number of steps (Nh/2, in our case), reaches a target node s and selects a query q from D(s ). A query consists of Nq sentences and is selected among the top-5 candidates in the target node with the highest average TF-IDF, thus discouraging the Web Nav from choosing a trivial query.

For the evaluation purpose alone, it is enough to use only a query q itself as an example. However, we include both one target node (among potentially many other target nodes) and one path from the starting node to this target node (again, among many possible connecting paths) so that they can be exploited when training an agent. They are not to be used when evaluating a trained agent.

4 Wiki Nav: A Benchmark Task

With the Web Nav, we built a benchmark goal-driven navigation task using Wikipedia as a target website. We used the dump ﬁle of the English Wikipedia from September 2015, which consists of more than ﬁve million web pages. We built a set of separate tasks with different levels of difﬁculty by varying the maximum number of allowed hops Nh {4, 8, 16} and the size of query Nq {1, 2, 4}. We refer to each task by Wiki Nav-Nh-Nq.

For each task, we generate training, validation and test examples from the pages half as many hops away from a starting page as the maximum number of hops allowed.2 We use Category:Main topic classiﬁcations as a starting node s S.

1 The source code and datasets are publicly available at github.com/nyu-dl/Web Nav. 2 This limit is an artiﬁcial limit we chose for computational reasons.

Table 3: Sample query-answer pairs from Wiki Nav-Jeopardy.

Query Answer

For the last 8 years of his life, Galileo was under Copernicus house arrest for espousing this man s theory.

In the winter of 1971-72, a record 1,122 inches of snow fell Washington at Rainier Paradise Ranger Station in this state.

This company s Accutron watch, introduced in 1960, Bulova had a guarantee of accuracy to within one minute a month.

As a minimal cleanup procedure, we excluded meta articles whose titles start with Wikipedia . Any hyperlink that leads to a web page outside Wikipedia is removed in advance together with the following sections: References , External Links , Bibliography and Partial Bibliography .

Hyperlinks Words Avg. 4.29 462.5

Var 13.85 990.2 Max 300 132881 Min 0 1

Table 2: Per-page statistics of English Wikipedia.

In Table 2, we present basic per-article statistics of the English Wikipedia. It is evident from these statistics that the world of Wiki Nav-Nh-Nq is large and complicated, even after the cleanup procedure.

We ended up with a fairly small dataset for Wiki Nav-4-*, but large for Wiki Nav-8-* and Wiki Nav-16-*. See Table 1 for details.

4.1 Related Work: Wikispeedia

This work is indeed not the ﬁrst to notice the possibility of a website, or possibly the whole web, as a world in which intelligent agents explore to achieve a certain goal. One most relevant recent work to ours is perhaps Wikispeedia from [14, 12, 13].

West et al. [14, 12, 13] proposed the following game, called Wikispeedia. The game s world is nearly identical to the goal-driven navigation task proposed in this work. More speciﬁcally, they converted Wikipedia for Schools , which contains approximately 4,000 articles as of 2008, into a graph whose nodes are articles and directed edges are hyperlinks. From this graph, a pair of nodes is randomly selected and provided to an agent.

The agent s goal is to start from the ﬁrst node, navigate the graph and reach the second node. Similarly to the Wiki Nav, the agent has access to the text content of the current nodes and all the immediate neighboring nodes. One major difference is that the target is given as a whole article, meaning that there is a single target node in the Wikispeedia while there may be multiple target nodes in the proposed Wiki Nav.

From this description, we see that the goal-driven web navigation is a generalization and re-framing of the Wikispeedia. First, we constrain a query to contain less information, making it much more difﬁcult for an agent to navigate to a target node. Furthermore, a major research question by West and Leskovec [13] was to understand how humans navigate and ﬁnd the information they are looking for , whereas in this work we are fully focused on proposing an automatic tool to build a challenging goal-driven tasks for designing and evaluating artiﬁcial intelligent agents.

5 Wiki Nav-Jeopardy: Jeopardy! on Wiki Nav

One of the potential practical applications utilizing the goal-drive navigation is question-answering based on world knowledge. In this Q&A task, a query is a question, and an agent navigates a given information network, e.g., website, to retrieve an answer. In this section, we propose and describe an extension of the Wiki Nav, in which query-target pairs are constructed from actual Jeopardy! question-answer pairs. We refer to this extension of Wiki Nav by Wiki Nav-Jeopardy.

We ﬁrst extract all the question-answer pairs from J! Archive3, which has more than 300k such pairs. We keep only those pairs whose answers are titles of Wikipedia articles, leaving us with 133k pairs. We divide those pairs into 113k training, 10k validation, and 10k test examples while carefully

3 www.j-archive.com

ensuring that no article appears in more than one partition. Additionally, we do not shufﬂe the original pairs to ensure that the train and test examples are from different episodes.

For each training pair, we ﬁnd one path from the starting node Main Topic Classiﬁcation to the target node and include it for supervised learning. For reference, the average number of hops to the target node is 5.8, the standard deviation is 1.2, and the maximum and minimum are 2 and 10, respectively. See Table 3 for sample query-answer pairs.

6 Neu Agent: Neural Network based Agent

6.1 Model Description

Core Function The core of the Neu Agent is a parametric function fcore that takes as input the content of the current node φc(si) and a query φq(q), and that returns the hidden state of the agent. This parametric function fcore can be implemented either as a feedforward neural network fff:

ht = fff(φc(si), φq(q))

which does not take into account the previous hidden state of the agent or as a recurrent neural network frec:

ht = frec(ht 1, φc(si), φq(q)).

We refer to these two types of agents by Neu Agent-FF and Neu Agent-Rec, respectively. For the Neu Agent-FF, we use a single tanh layer, while we use long short-term memory (LSTM) units [5], which have recently become de facto standard, for the Neu Agent-Rec.

Figure 1: Graphical illustration of a single step performed by the baseline model, Neu Agent.

Based on the new hidden state ht, the Neu Agent computes the probability distribution over all the outgoing edges ei. The probability of each outgoing edge is proportional to the similarity between the hidden state ht such that

p(ei,j| p) exp φc(sj) ht . (3)

Note that the Neu Agent peeks at the content of the next node sj by considering its vector representation φc(sj). In addition to all the outgoing edges, we also allow the agent to stop with the probability

p( | p) exp v ht , (4)

where the stop action vector v is a trainable parameter. In the case of Neu Agent-Rec, all these (unnormalized) probabilities are conditioned on the history p which is a sequence of actions (nodes) selected by the agent so far. We apply a softmax normalization on the unnormalized probabilities to obtain the probability distribution over all the possible actions at the current node si.

The Neu Agent then selects its next action based on this action probability distribution (Eqs. (3) and (4)). If the stop action is chosen, the Neu Agent returns the current node as an answer and receives a reward R(si, q), which is one if correct and zero otherwise. If the agent selects one of the outgoing edges, it moves to the selected node and repeats this process of reading and acting.

See Fig. 1 for a single step of the described Neu Agent.

Content Representation The Neu Agent represents the content of a node si as a vector φc(si) Rd. In this work, we use a continuous bag-of-words vector for each document:

φc(si) = 1 |D(si)|

Each word vector ek is from a pretrained continuous bag-of-words model [7]. These word vectors are ﬁxed throughout training.

Query Representation In the case of a query, we consider two types of representation. The ﬁrst one is a continuous bag-of-words (Bo W) vector, just as used for representing the content of a node. The other one is a dynamic representation based on the attention mechanism [2].

In the attention-based query representation, the query is ﬁrst projected into a set of context vectors. The context vector of the k-th query word is

k =k u/2 Wk ek ,

where Wk Rd d and ek are respectively a trainable weight matrix and a pretrained word vector. u is the window size. Each context vector is scored at each time step t by βt k = fatt(ht 1, ck) w.r.t. the previous hidden state of the Neu Agent, and all the scores are normalized to be positive and sum to one, i.e., αt k = exp(βt k) P|q| l=1 exp(βt l ). These normalized scores are used as the coefﬁcients in computing

the weighted-sum of query words to result in a query representation at time t:

k=1 αt kck.

Later, we empirically compare these two query representations.

6.2 Inference: Beam Search

Once the Neu Agent is trained, there are a number of approaches to using it for solving the proposed task. The most naive approach is simply to let the agent make a greedy decision at each time step, i.e., following the outgoing edge with the highest probability arg maxk log p(ei,k| . . .). A better approach is to exploit the fact that the agent is allowed to explore up to Nn outgoing edges per node. We use a simple, forward-only beam search with the beam width capped at Nn. The beam search simply keeps the Nn most likely traces, in terms of log p(ei,k| . . .), at each time step.

6.3 Training: Supervised Learning

In this paper, we investigate supervised learning, where we train the agent to follow an example trace p = (s S, . . . , s ) included in the training set at each step (see Eq. (2)). In this case, the cost per training example is

Csup = log p( |p , q)

k=1 log p(p k|p <k, q). (5)

This per-example training cost is fully differentiable with respect to all the parameters of the neural network, and we use stochastic gradient descent (SGD) algorithm to minimize this cost over the whole training set, where the gradients can be computed by backpropagation [11]. This allows the entire model to be trained in an end-to-end fashion, in which the query-to-target performance is optimized directly.

7 Human Evaluation

One unique aspect of the proposed task is that it is very difﬁcult for an average person who was not trained speciﬁcally for ﬁnding information by navigating through an information network. There are a number of reasons behind this difﬁculty. First, the person must be familiar with, via training, the graph structure of the network, and this often requires many months, if not years, of training. Second, the person must have in-depth knowledge of a broad range of topics in order to make a connection via different concepts between the themes and topics of a query to a target node. Third, each trial requires the person carefully to read the whole content of the nodes as she navigates, which is a time-consuming and exhausting job.

We asked ﬁve volunteers to try up to 20 four-sentence-long queries4 randomly selected from the test sets of Wiki Nav-{4, 8, 16}-4 datasets. They were given up to two hours, and they were allowed to

4 In a preliminary study with other volunteers, we found that, when the queries were shorter than 4, they were not able to solve enough trials for us to have meaningful statistics.

Table 4: The average reward by the Neu Agents and humans on the test sets of Wiki Nav-Nh-Nq.

Nq = 1 2 4 fcore Layers Units φq Nh = 4 8 16 4 8 16 4 8 16

(a) fff 1 512 Bo W 21.5 4.7 1.2 40.0 9.2 1.9 45.1 12.9 2.9 (b) frec 1 512 Bo W 22.0 5.1 1.7 41.1 9.2 2.1 44.8 13.3 3.6 (c) frec 8 2048 Bo W 17.7 10.9 8.0 35.8 19.9 13.9 39.5 28.1 21.9 (d) frec 8 2048 Att 22.9 15.8 12.5 41.7 24.5 17.8 46.8 34.2 28.2 (e) Humans - - - - - - 14.5 8.8 5.0

choose up to the same maximum number of explored edges per node Nn as the Neu Agents (that is, Nn = 4), and also were given the option to give up. The average reward was computed as the fraction of correct trials over all the queries presented.

8 Results and Analysis

8.1 Wiki Nav

We report in Table 4 the performance of the Neu Agent-FF and Neu Agent-Rec models on the test set of all nine Wiki Nav-{4, 8, 16}-{1, 2, 4} datasets. In addition to the proposed Neu Agents, we also report the results of the human evaluation.

We clearly observe that the level of difﬁculty is indeed negatively correlated with the query length Nq but is positively correlated with the maximum number of allowed hops Nh. The latter may be considered trivial, as the size of the search space grows exponentially with respect to Nh, but the former is not. The former negative correlation conﬁrms that it is indeed easier to solve the task with more information in a query. We conjecture that the agent requires more in-depth understanding of natural languages and planning to overcome the lack of information in the query to ﬁnd a path toward a target node.

The Neu Agent-FF and Neu Agent-Rec shares similar performance when the maximum number of allowed hops is small (Nh = 4), but Neu Agent-Rec ((a) vs. (b)) performs consistently better for higher Nh, which indicates that having access to history helps in long-term planning tasks. We also observe that the larger and deeper Neu Agent-Rec ((b) vs (c)) signiﬁcantly outperforms the smaller one, when a target node is further away from the starting node s S.

The best performing model in (d) used the attention-based query representation, especially as the difﬁculty of the task increased (Nq and Nh ), which supports our claim that the proposed task of goal-driven web navigation is a challenging benchmark for evaluating future progress. In Fig. 2, we present an example of how the attention weights over the query words dynamically evolve as the model navigates toward a target node.

The human participants generally performed worse than the Neu Agents. We attribute this to a number of reasons. First, the Neu Agents are trained speciﬁcally on the target domain (Wikipedia), while the human participants have not been. Second, we observed that the volunteers were rapidly exhausted from reading multiple articles in sequence. In other words, we ﬁnd the proposed benchmark, Web Nav, as a good benchmark for machine intelligence but not for comparing it against human intelligence.

1918 Kentuchy Derby Category: Kentuchy Derby Races

Category: Kentuchy Derby Category: Sports events in Louisville, Kentuchy

Category: Sports Events by City

Category: Sports Events

Category: Sports Category: Main Topic Classifications

Figure 2: Visualization of the attention weights over a test query. The horizontal axis corresponds to the query words, and the vertical axis to the articles titles visited.

8.2 Wiki Nav-Jeopardy

Settings We test the best model from the previous experiment (Neu Agent-Rec with 8 layers of 2048 LSTM units and the attention-based query representation) on the Wiki Nav-Jeopardy. We evaluate two training strategies. The ﬁrst strategy is straightforward supervise learning, in which we train a Neu Agent-Rec on Wiki Nav-Jeopardy from scratch. In the other strategy, we pretrain a Neu Agent-Rec ﬁrst on the Wiki Nav-16-4 and ﬁnetune it on Wiki Nav-Jeopardy.

We compare the proposed Neu Agent against three search strategies. The ﬁrst one, Simple Search, is a simple inverted index based strategy. Simple Search scores each Wikipedia article by the TF-IDF weighted sum of words that co-occur in the articles and a query and returns top-K articles. Second, we use Lucene, a popular open source information retrieval library, in its default conﬁguration on the whole Wikipedia dump. Lastly, we use Google Search API5, while restricting the domain to wikipedia.org.

Each system is evaluated by document recall at K (Recall@K). We vary K to be 1, 4 or 40. In the case of the Neu Agent, we run beam search with width set to K and returns all the K ﬁnal nodes to compute the document recall. Since there is only one correct document/answer per query, Precision@K = Recall@K / K and therefore we do not show this measure in the results.

Table 5: Recall on Wiki Nav-Jeopardy. ( ) Pretrained on Wiki Nav-16-4.

Model Pre Recall@1 Recall@4 Recall@40

Neu Agent 13.9 20.2 33.2 Neu Agent 18.9 23.6 38.3

Simple Search 5.4 12.6 28.4 Lucene 6.3 14.7 36.3 Google 14.0 22.1 25.9

Result and Analysis In Table 5, we report the results on Wiki Nav-Jeopardy. The proposed Neu Agent clearly outperforms all the three search-based strategies, when it was pretrained on the Wiki Nav-16-4. The superiority of the pretrained Neu Agent is more apparent when the number of candidate documents is constrained to be small, implying that the Neu Agent is able to accurately rank a correct target article. Although the Neu Agent performs comparably to the other search-based strategy even without pretraining, the beneﬁt of pretraining on the much larger Wiki Nav is clear.

We emphasize that these search-based strategies have access to all the nodes for each input query. The Neu Agent, on the other hand, only observes the nodes as it visits during navigation. This success clearly demonstrates a potential in using the proposed Neu Agent pretrained with a dataset compiled by the proposed Web Nav for the task of focused crawling [3, 1], which is an interesting problem on its own, as much of the content available on the Internet is either hidden or dynamically generated [1].

9 Conclusion

In this work, we describe a large-scale goal-driven web navigation task and argue that it serves as a useful test bed for evaluating the capabilities of artiﬁcial agents on natural language understanding and planning. We release a software tool, called Web Nav, that compiles a given website into a goal-driven web navigation task. As an example, we construct Wiki Nav from Wikipedia using Web Nav. We extend Wiki Nav with Jeopardy! questions, thus creating Wiki Nav-Jeopardy. We evaluate various neural net based agents on Wiki Nav and Wiki Nav-Jeopardy. Our results show that more sophisticated agents have better performance, thus supporting our claim that this task is well suited to evaluate future progress in natural language understanding and planning. Furthermore, we show that our agent pretrained on Wiki Nav outperforms two strong inverted-index based search engines on the Wiki Nav-Jeopardy. These empirical results support our claim on the usefulness of the proposed task and agents in challenging applications such as focused crawling and question-answering.

5 https://cse.google.com/cse

[1] Manuel Álvarez, Juan Raposo, Alberto Pan, Fidel Cacheda, Fernando Bellas, and Víctor Carneiro. Deepbot: a focused crawler for accessing hidden web content. In Proceedings of the 3rd international workshop on Data enginering issues in E-commerce and services: In conjunction with ACM Conference on Electronic Commerce (EC 07), pages 18 25. ACM, 2007.

[2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In ICLR 2015, 2014.

[3] Soumen Chakrabarti, Martin Van den Berg, and Byron Dom. Focused crawling: a new approach to topic-speciﬁc web resource discovery. Computer Networks, 31(11):1623 1640, 1999.

[4] Ji He, Jianshu Chen, Xiaodong He, Jianfeng Gao, Lihong Li, Li Deng, and Mari Ostendorf. Deep reinforcement learning with an unbounded action space. ar Xiv preprint ar Xiv:1511.04636, 2015.

[5] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8): 1735 1780, 1997.

[6] Jan Koutník, Jürgen Schmidhuber, and Faustino Gomez. Evolving deep unsupervised convolutional networks for vision-based reinforcement learning. In Proceedings of the 2014 conference on Genetic and evolutionary computation, pages 541 548. ACM, 2014.

[7] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efﬁcient estimation of word representations in vector space. ar Xiv preprint ar Xiv:1301.3781, 2013.

[8] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529 533, 2015.

[9] Karthik Narasimhan, Tejas Kulkarni, and Regina Barzilay. Language understanding for textbased games using deep reinforcement learning. ar Xiv preprint ar Xiv:1506.08941, 2015.

[10] Sebastian Risi and Julian Togelius. Neuroevolution in games: State of the art and open challenges. ar Xiv preprint ar Xiv:1410.7326, 2014.

[11] David Rumelhart, Geoffrey Hinton, and Ronald Williams. Learning representations by backpropagating errors. Nature, pages 323 533, 1986.

[12] Robert West and Jure Leskovec. Automatic versus human navigation in information networks. In ICWSM, 2012.

[13] Robert West and Jure Leskovec. Human wayﬁnding in information networks. In 21st International World Wide Web Conference, pages 619 628. ACM, 2012.

[14] Robert West, Joelle Pineau, and Doina Precup. Wikispeedia: An online game for inferring semantic distances between concepts. In IJCAI, pages 1598 1603, 2009.