# webvln_visionandlanguage_navigation_on_websites__d1f58d05.pdf Web VLN: Vision-and-Language Navigation on Websites Qi Chen*, Dileepa Pitawela*, Chongyang Zhao*, Gengze Zhou, Hsiang-Ting Chen, Qi Wu Australian Institute for Machine Learning, The University of Adelaide {qi.chen04, dileepa.pitawela, chongyang.zhao, gengze.zhou, tim.chen, qi.wu01}@adelaide.edu.au Vision-and-Language Navigation (VLN) task aims to enable AI agents to accurately understand and follow natural language instructions to navigate through real-world environments, ultimately reaching specific target locations. We recognise a promising opportunity to extend VLN to a comparable navigation task that holds substantial significance in our daily lives, albeit within the virtual realm: navigating websites on the Internet. This paper proposes a new task named Vision-and-Language Navigation on Websites (Web VLN), where we use question-based instructions to train an agent, emulating how users naturally browse websites. Unlike the existing VLN task that only pays attention to vision and instruction (language), the Web VLN agent further considers underlying web-specific content like HTML, which could not be seen on the rendered web pages yet contains rich visual and textual information. Toward this goal, we contribute a dataset, Web VLN-v1, and introduce a novel approach called Website-aware VLN Network (Web VLNNet), which is built upon the foundation of state-of-the-art VLN techniques. Experimental results show that Web VLNNet outperforms current VLN and web-related navigation methods. We believe that the introduction of the new Web VLN task and its dataset will establish a new dimension within the VLN domain and contribute to the broader vision-and-language research community. Code is available at: https://github.com/Web VLN/Web VLN. Introduction Vision-and-Language Navigation (VLN) (Anderson et al. 2018) aims to seamlessly integrate visual perception and action with language understanding, to enable AI agents to navigate and interact effectively within real-world environments. Interestingly, the resemblance can be found in the virtual online environment, where users might rely on AI agents to assist them in gathering information about certain products even when they can only offer broad and vague instructions such as how much does a pair of grey and orange striped socks cost . This extends beyond the boundaries of traditional VLN tasks, including not only vision and instruction (language) but also incorporating the abundant informa- *These authors contributed equally. Corresponding author. Copyright 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Figure 1: An example of the Web VLN task. An agent is initiated on the homepage of a website and asked a question Q with an auxiliary description D. To respond, the agent is required to intelligently navigate and explore the website, gather information through observation, and finally provide an accurate response/answer R in a free-form sentence. tion embedded within webpage like HTML. With this consideration, we introduce an extended VLN task, denoted as Vision-and-Language Navigation on Websites (Web VLN). Figure 1 shows an example of the Web VLN task. In this scenario, an agent starts its journey from a website s front page, presented with a question Q accompanied by an auxiliary description D. The agent emulates genuine user behaviour and navigates through the website. It processes the current view of the webpage and engages in common web browsing activities such as reading the images and text, and clicking on the links to navigate to the next pages. The agent s objective is to efficiently traverse the website and reach a target webpage, which contains the necessary information to answer the question Q and produce an accurate response R to the question. The new task poses several new challenges. First, the choices available to an AI agent navigating a website are substantially more than those in traditional discrete VLN scenarios, which are confined to adjacent navigable viewpoints in physical environments. While in Web VLN, the The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) range of choices for each observed webpage is significantly broader as it contains a vast array of content, features, links, and interactive elements. Each webpage offers multiple avenues for navigation, such as clicking on various links, buttons, and dropdown menus. Second, due to the intrinsic diversity of available choices on each webpage, Web VLN constructs a more intricate and complex navigation graph than traditional VLN, making it nearly impossible to explore all the content on websites by a naive heuristic trial-and-error scheme. Thus, in the Web VLN task, an ideal method should seek to maximise accurate choices while minimising the need for exploration by leveraging varied information available within the webpage. Due to the lack of an off-the-shelf dataset for Web VLN, we have collected a new Web VLN-v1 dataset to facilitate research in this field. It comprises 8, 990 records/paths with 14, 825 QA pairs derived from three different shopping websites (aliased as SA, HB and ES). Differing from other VLN datasets (Anderson et al. 2018; Qi et al. 2020b) that only consider the visual inputs of the environment, our Web VLNv1 incorporates both visual and textual contents extracted from the websites. Furthermore, in comparison to other webrelated datasets, such as web navigation (Liu et al. 2018; Xu et al. 2021; Mazumder and Riva 2020; Yao et al. 2022) and web QA (Chang et al. 2022; Hsiao et al. 2022), our Web VLN-v1 seamlessly integrates both navigation and QA environments with question-based instructions, resulting in a unified benchmark. To tackle the challenging Web VLN task, we propose a new method called Website-aware Vision-and-Language Navigation Network (Web VLN-Net) based on the widely used VLN framework VLN BERT (Hong et al. 2021). Besides the visual input (screenshot) and instruction (question & description), Web VLN-Net considers the underlying HTML of each webpage and extracts elements such as clickable buttons. Upon reaching a stop token, our model initiates answering the question using information from both click history and the current stop webpage. The evaluation of the model performance is based on the metrics from both VLN and VQA domains. Specifically, for VLN, we consider success rate (SR), oracle success rate (OSR), success rate weighted by path length (SPL), and Trajectory Length (TL), while adopting Wu-Palmer Similarity (WUPS) (Wu and Palmer 1994) for VQA evaluation due to the open-end setting, i.e., generating a free-form sentence as an answer. In summary, our contributions include: A new task - Vision-and-Language Navigation on Websites (Web VLN), where the agent emulates human web browsing behaviours and navigates to a specified target webpage based on the input question and its auxiliary description, subsequently answering the question using information extracted from the target webpage. A new Web VLN-v1 dataset, consisting of 8, 990 records/paths, and 14, 825 question-answer (QA) pairs derived from three different websites, covering both navigation and QA on the web environments. A new method, named Website-aware VLN Network (Web VLN-Net), which not only considers the visual in- put (screenshot) and linguistic instruction but also uses web-specific content (i.e., HTML of the webpage) to enhance decision-making precision. Related Work As the Web VLN is a new task, we briefly overview several closely relevant works w.r.t. Vision-and-Language Navigation (VLN) and other web-related navigation and QA tasks. Vision-and-Language Navigation (VLN) The VLN task (Anderson et al. 2018) extends the vision and language research with sequential action prediction and is one of the most influential tasks in Embodied AI. The research on VLN is dedicated to addressing the alignment of linguistic instructions with visual cues and actions, some work fine-graining the navigation instructions to achieve sub-goal planning (Hong et al. 2020; He et al. 2021a; Zhu et al. 2020), and some concentrate on utilizing object information to identify landmarks from observations (Gao et al. 2021; Qi et al. 2020a, 2021). Temporal information is specifically designed in (Hao et al. 2020; Hong et al. 2021; Chen et al. 2021, 2022; Qiao et al. 2022, 2023; Zhao, Qi, and Wu 2023) to capture long-range dependencies across past observations and actions, which are crucial during navigation. some methods incorporate external knowledge during navigation (Li et al. 2022; Gao et al. 2021). Recently, several methods leverage commonsense knowledge from LLMs and build an LLMs-as-agent pipeline to perform zero-shot VLN (Zhou, Hong, and Wu 2023). However, the VLN tasks require spatial awareness of agents and mainly focus on the photo-realistic environment, which would not consider the web-specific information (e.g., descriptions denoted by alt in HTML) when directly applied to website navigation. Web Navigation and Question-Answering Web navigation task (Toyama et al. 2021; Yao et al. 2022; Burns et al. 2022) involves developing algorithms or models that enable automated agents to navigate and interact with websites on the Internet. There are some related datasets (Liu et al. 2018; Xu et al. 2021; Mazumder and Riva 2020; Yao et al. 2022; Deng et al. 2023; Zhou et al. 2023). For example, Mini Wo B++ (Liu et al. 2018), RUSS (Xu et al. 2021) and FLIN (Mazumder and Riva 2020) cover sites with diverse user instructions from simple tasks to complex ones like booking flights. Many previous works use various methods on these datasets, which, however, depend on Document Object Model (DOM) structure (Jia, Kiros, and Ba 2019; He et al. 2021b) and hence hamper their flexibility. As for web QA, it mimics the human behaviour of posing a question, aggregating information on the webpage, and generating a response. Several benchmarks, such as Web QA (Chang et al. 2022) and Screen QA (Hsiao et al. 2022), have been proposed. However, they only offer a single webpage for each question. In contrast, we break the boundary between web navigation and web QA by merging them into a unified task called Web VLN, aligning more closely with human behaviour. Moreover, we design a framework (i.e., Web VLNNet) that can be easily adapted for different websites. The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) Web VLN Task and Simulator Problem Definition As in Figure 1, the Web VLN task requires an agent to follow natural language instructions (i.e., question & description) to navigate from a homepage to a target webpage and answer the question based on the information from both trajectories and the target webpage. Formally, at the beginning of each episode, the agent is given an input question Q and an auxiliary description D (e.g., details of the target item) as the questions are often brief and may lack enough information for locating a unique item 1. Then, the agent observes the current webpage W (i) = I(i), B(i) , where I(i) and B(i) are the screenshot and the set of clickable buttons in the (ith) current page, respectively. Here, each clickable button b B(i) is represented by its description d (i.e., alt in HTML) and image e, namely b = d, e . Note that if only having either description or image, use for the other one. In this setting, the agent must execute a sequence of actions A, where each action at A leads to a new state st S. Each state st contains the information derived from both the state of current page W (i) that the agent locates on and the state of current action at that the agent performs. Besides, we define a special End Of Action (i.e., [EOA]) as the stop token, which would be predicted if the current state refers to the target webpage. Ideally, given question Q and auxiliary description D, the agent should predict a response/answer R based on the contents in the target state s[EOA]. Web VLN Simulator In this part, we establish a Web VLN simulator on three different shopping websites (aliased as SA, HB, and ES), covering 1, 485 products such as socks, fossils and blankets, and mirror the actions of humans. The details are as follows. Observations To build the simulator, we enable an agent to navigate within a website by interacting with various buttons. During the i-th webpage W (i), the simulator generates an RGB image observation I(i) (i.e., screenshot) that corresponds to the agent s viewpoint. It is worth noting that, to prevent any instances of mismatch, we provide the entire screenshot of the current webpage, rather than just a partial view within the display window. Action Space The primary difficulty in simulator implementation lies in defining the action space that depends on the current state. Naturally, we hope that the agent will not jump to the currently unreachable page indiscriminately. Hence, during each step t, the simulator additionally generates a subset of next-step reachable clickable buttons Bt+1 B, where B contains all clickable buttons present on the website. The agent interacts with the simulator by selecting a new button bt+1 Bt+1. To establish Bt+1, the simulator constructs a directed graph for each website, denoted as G = B, E , where the existence of an edge indicates a navigable transition between two webpages, which is 1Note that if the question has sufficient information to locate the target webpage, the auxiliary description will remain empty. accessible by the agent. In each step, we also incorporate a specific button b[EOA] for the stop action. Web VLN-v1 Dataset Construction Automatic Path Generation Following the way for generating path in R2R (Anderson et al. 2018), we sample a target webpage and subsequently construct the shortest path from the homepage to the selected target according to the aforementioned graph G. For each path, we manually check it to determine its reasonability from a human perspective and then discard the unreasonable ones (e.g., clicking the advertisement icon multiple times in succession). Furthermore, we remove paths with fewer than 2 webpage transitions to maintain dataset quality and diversity, resulting in a total sample of 8,990 paths. LLM-aided Question-Answer Generation To alleviate the workload on humans, we seek to generate QA pairs with the assistance of Large Language Models (LLMs)2 based on the multimodal content from the webpage. We depict the details in the following. Image Data Handling & HTML Cleaning Firstly, we employ BLIP-2 (Li et al. 2023), a large and powerful model for image captioning, to transform the website images into captions, ensuring that the LLMs are able to capture the visual information. After that, another critical step is processing the word list, since the original texts directly from webpages are often disorganised and challenging to work with, which also include irrelevant information, such as messy code. Hence, we systematically adopt a rule-based filtering approach to manually eliminate such information, which enhances the logic and readability of these texts. Designing Rules for Generating QA Pairs We start to design a series of rules for the LLM to guide its behaviour when generating required QA pairs. The initial rules include: Provide 3 questions and their answers that can be directly found from the information provided in the text. Ask the first question about price and second about available sizes and the third about material. If precise answers cannot be found for those questions, then ask the questions on colours and availability in stock. Phrase your questions in a clear and concise manner to ensure they can be accurately answered by the given content. Answer should be to the point without additional information. To mitigate the impact of negative information, we introduce two additional rules as follows. The provided text is all from an online shopping website, there is some disturbing information which is irrelevant to the products, such as sign in . Make sure your questions and answers will focus on the products themselves. 2Here, we use Chat GPT with the gpt-3.5-turbo model. The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) Env. Type Dataset Environment (Env.) Instruction (Ins.) Task Number Temp. Image Text HTML Que. Des. Ins. Level Embodied R2R Low Navigation 21, 567 EQA High Navigation + QA 5, 281 REVERIE High Localise Remote Object 21, 702 Mobile App Pixel Help Low Navigation 187 Mo TIF High Navigation 1, 125 META-GUI High Dialogue 4, 707 Mini Wo B++ Low Navigation - RUSS Low Navigation 741 FLIN High Navigation 53, 520 Web Shop High Navigation 12, 087 MIND2WEB High Navigation 2, 350 Web QA High Question-Answer (QA) 46, 500 Screen QA High Question-Answer (QA) - Web VLN-v1 High Navigation + QA 14, 825 Table 1: Comparison between our Web VLN-v1 dataset and the related datasets, which cover three different types of environment, i.e., embodied scene, Mobile App, and website. To ensure a more comprehensive comparison, we evaluate the datasets across three dimensions: the environment (Env.), the instruction (Ins.) and the target task. Here, Temp. refers to whether the environment would be temporally changed. Image , Text and HTML are the components that environment covers. Que. and Des. are the abbreviations of question and description, respectively, referring to the type of instruction that the dataset contains. Ins. Level means whether the instruction is a high-level statement or a low-level step-by-step command. The provided texts may contain punctuation and symbols, which are irrelevant to the products, you should be able to distinguish them and make sure they won t appear in the generated questions and answers. Prompt for Generating Final QA Pairs We obtain the final prompt P by directly concatenating the three aforementioned terms, i.e., P = { There is a picture of the product with the caption of + caption + After that, here are all the words that appear on the website: + word list + Lastly, I will give the following instructions, and you will be strictly following the instructions: + rules}. Moreover, due to the varying lengths of these three terms with the word list being the longest and the caption the shortest it becomes necessary to appropriately adjust the weighting of each component. It is particularly crucial given that the caption, despite its brevity, holds rich information. Quality Checking Due to the inherent uncertainty introduced by LLMs, it is vital to perform quality checks on every generated question-answer (QA) pair. Concretely, we randomly select 100 QA samples for each website to assess their quality. Our stringent quality checks adhere to the following criteria w.r.t. question and answer, respectively: The generated questions by LLM should relate to the actual products visible on the website. The generated answers should be correct, brief, and correspond to the questions. Any answer surpassing the scope of the question will be considered invalid. We have undertaken multiple iterations of the constructed prompt until all the samples are reasonable and correct. In each iteration, a subset of samples is generated for manual quality assessment. Each sample has undergone evaluation by at least two assessors for a reliable evaluation. Web VLN-v1 Dataset Analysis Web VLN-v1 Dataset vs. Related Datasets We compare our Web VLN-v1 dataset with the most relevant datasets w.r.t. Embodied AI, Mobile App, and Website. Specifically, for Embodied AI datasets, we consider R2R (Anderson et al. 2018), REVERIE (Qi et al. 2020b) and EQA (Das et al. 2018), where the first two are the widely used vision-and-language navigation (VLN) datasets while the last one is a famous embodied question answering dataset. Regarding App-based datasets, we compare Pixel Help (Li et al. 2020), Mo TIF (Burns et al. 2022) and METAGUI (Sun et al. 2022). As for the website, we consider seven datasets for a comprehensive comparison, including Mini Wo B++ (Liu et al. 2018), RUSS (Xu et al. 2021), FLIN (Mazumder and Riva 2020), Web Shop (Yao et al. 2022), MIND2WEB (Deng et al. 2023), Web QA (Chang et al. 2022), and Screen QA (Hsiao et al. 2022). Table 1 demonstrates that our Web VLN-v1 dataset is unique, covering rich information (i.e., temporal sequence, image, text and HTML) from the environment and two different types of instructions (i.e., questions and descriptions/statements), while others only focus on them partially. Moreover, our Web VLN-v1 is able to support both navigation and question-answering tasks, rather than other websiterelated datasets that can only support one of them. Web VLN-v1 Statistics Word Cloud In Figure 2, we visualise the questions, auxiliary descriptions, and answers of our Web VLN-v1 dataset as Venn-style word cloud (Coppersmith and Kelly 2014). The size of each word is the harmonic mean of its count. Lengths of Question, Description, Answer and Path Figure 3 exhibits the distributions of question length, de- The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) (c) Answers (a) Questions (b) Descriptions Figure 2: Word cloud of (a) questions, (b) descriptions, and (c) answers on the proposed Web VLN-v1 dataset. (c) Distribution of Answer Length Normalized Frequency (a) Distribution of Question Length (b) Distribution of Description Length Normalized Frequency Figure 3: Distributions of (a) question length, (b) description length, and (c) answer length on the Web VLN-v1 dataset. scription length and answer length on the Web VLN-v1 dataset. Specifically, the majority of questions consist of 8 12 words (Figure 3(a)) while most descriptions have 7 14 words (Figure 3(b)). The answers, in Figure 3(c), exhibit a broader range, with some extending to 30 40 words, as they derive content from the question, description and webpage. The average length of the paths is 3.32. Data Splits We split 60% samples as training data, 10% samples as validation data and 30% samples as testing data (i.e., 8, 960/1, 262/4, 603). Notably, to prevent any information leakage, we carefully check that the training, validation, and testing sets cover all three websites, but their records/- paths remain distinct and non-overlapping. Web VLN-Net For the Web VLN task, we propose a new model, called Website-aware Vision-and-Language Navigation Network (Web VLN-Net), which is based on a widely used visionand-language navigation framework VLN BERT (Hong et al. 2021). As shown in Figure 4, our model contains three main components: initialisation, navigation and answering. Specifically, we first initialise the state and context tokens by a pre-trained BERT model. Subsequently, we input these initialised language tokens, along with the screenshot and button tokens extracted from the current webpage, into the navigation component. This process iterates until the target webpage is reached. Last, the answering head in the answering component generates the final answer. State and Context Initialisation In initialisation (t = 0), our model receives a word sequence comprising the classification token [CLS], the separation token [SEP], and the language tokens V extracted from both the question Q and the auxiliary description D. Here, [CLS] and [SEP] are predefined tokens in BERT. Similar to VLN BERT (Hong et al. 2021), the [CLS] token is used to aggregate relevant vision-language cues from the input sequence. In this context, we define the embedded [CLS] token as the initial state representation s0. We update it during the whole training phase, ensuring it could be aware of the entire navigation and question-answering tasks. The process can be defined as s0, V = Init([CLS], Q, [SEP], D), (1) where the Init( ) indicates the initialisation process. Web Navigation We adapt the model proposed in (Hong et al. 2021) to incorporate the learning of navigation and the concurrent selection of clickable buttons. As shown in Figure 4, at each time step, the network takes four different token sets as input: the preceding state token st 1, the language tokens V, the screenshot tokens It, and the button tokens Bt. Specifically, we patchify the screenshot as a set of image patches and convert them to the tokens It by using a Transformer-based image encoder. Likewise, as each button consists of an image and a description (i.e., alt in HTML), we introduce a button encoder that contains an image encoder and a text encoder to derive their corresponding tokens. After that, we concatenate tokens associated with the same button, followed by projecting the concatenated token back to its original dimension using a linear projection layer. Subsequently, we put all the tokens into a multi-layer Transformer to obtain an action probability pt: st, pt = Nav(st 1, V, It, Bt). (2) Here, the Nav( ) refers to the navigation process in each step. Notably, the set of button tokens Bt involves an End Of Action ([EOA]) token, selected when the agent reaches the target webpage. The state would be updated to s[EOA]. In navigation steps (t > 0), the state token st, screenshot tokens It, and button tokens Bt employ self-attention across the entire input sequence, while the language tokens V only serve as the keys and values in Transformer. We regard the language tokens derived from the model s initialisation step as a good representation of both the question and auxiliary description (Q&D), obviating the necessity for subsequent encoding in later stages and saving computational resources. Question Answering To generate the final answer, we introduce an Answering Head, which is a M-layer Transformer decoder. Considering an open-ended questionanswering setting, our objective is to generate the answer as a free-form sentence autoregressively. Mathematically, R = Ans(s[EOA]), (3) where Ans( ) refers to the answering process. Here, R is the predicted answer consisting of L words (i.e., R = {wl}L l=1) and s[EOA] denotes the last state aforementioned above. The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) Image Encoder Button Encoder Tokens [CLS] what is [SEP] a pair Question Description State & Contexts Multi-Layer Transformer New State & Contexts action probs Initialisation Answering Head The price of Courtside Short Crew is $30. Answer Prediction Visual Token Token Q&D [CLS] [EOA] Token Observation Screenshot Buttons Figure 4: Overall architecture of Web VLN-Net. We take an example that the question Q is What is the price of the Courtside Short Crew Socks? with an auxiliary description D as a pair of grey and orange striped socks . The observation at each step contains a screenshot of the current webpage and all the clickable buttons derived from its HTML. Training For navigation, we train our network using an imitation learning (IL) objective. Concretely, our agent navigates on the ground-truth trajectory by following the teacher actions and computes a cross-entropy loss for each decision made. Formally, we minimise the navigation loss function, which can be formulated for each specific sample by t at log(pt) η X t a t log(pt), (4) where at is the sampled action and a t is the teacher action. Here, η represents a coefficient used to weigh the IL loss. The training of the Answering Head employs the autoregressive Teacher Forcing (Williams and Zipser 1989) approach, wherein the prediction at each step should maximise the likelihood of the subsequent token: l=1 log p(wl|w