# knowit_vqa_answering_knowledgebased_questions_about_videos__fcbeeef5.pdf The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20) Know IT VQA: Answering Knowledge-Based Questions about Videos Noa Garcia,1 Mayu Otani,2 Chenhui Chu,1 Yuta Nakashima1 1Osaka University, Japan, 2Cyber Agent, Inc., Japan {noagarcia, chu, n-yuta}@ids.osaka-u.ac.jp otani mayu@cyberagent.co.jp We propose a novel video understanding task by fusing knowledge-based and video question answering. First, we introduce Know IT VQA, a video dataset with 24,282 humangenerated question-answer pairs about a popular sitcom. The dataset combines visual, textual and temporal coherence reasoning together with knowledge-based questions, which need of the experience obtained from the viewing of the series to be answered. Second, we propose a video understanding model by combining the visual and textual video content with specific knowledge about the show. Our main findings are: (i) the incorporation of knowledge produces outstanding improvements for VQA in video, and (ii) the performance on Know IT VQA still lags well behind human accuracy, indicating its usefulness for studying current video modelling limitations. Introduction Visual question answering (VQA) was firstly introduced in (Malinowski and Fritz 2014) as a task for bringing together advancements in natural language processing and image understanding. Since then, VQA has experienced a huge development, in part due to the release of a large number of datasets, such as (Malinowski and Fritz 2014; Antol et al. 2015; Krishna et al. 2017; Johnson et al. 2017; Goyal et al. 2017). The current trend for addressing VQA (Anderson et al. 2018; Kim et al. 2017a; Ben-Younes et al. 2017; Bai et al. 2018) is based on predicting the correct answer from a multi-modal representation, obtained from encoding images with a pre-trained convolutional neural network (CNN) and attention mechanisms (Xu et al. 2015), and encoding questions with a recurrent neural network (RNN). These kinds of models infer answers by focusing on the content of the images (e.g. How many people are there wearing glasses? in Fig. 1). Considering that the space spanned by the training question-image pairs is finite, the use of image content as the only source of information to predict answers presents two important limitations. On one hand, image features only capture the static information of the picture, leaving temporal coherence in video unattended (e.g. How do they finish Copyright c 2020, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. +RZ PDQ\ SHRSOH DUH WKHUH ZHDULQJ JODVVHV" 2QH :KR KDV EHHQ WR WKH VSDFH" +RZDUG :KR KDV EHHQ WR WKH VSDFH" :KR RZQV WKH SODFH ZKHUH WKH\ DUH VWDQGLQJ" 6WXDUW +RZ GR WKH\ ILQLVK WKH FRQYHUVDWLRQ" 6KDNLQJ KDQGV +RZ GR WKH\ ILQLVK WKH FRQYHUVDWLRQ" 6KDNLQJ KDQGV :KR RZQV WKH SODFH ZKHUH WKH\ DUH VWDQGLQJ" /HRQDUG +DYH \RX QRWLFHG WKDW +RZDUG FDQ WDNH DQ\ WRSLF DQG XVH LW WR UHPLQG \RX WKDW KH ZHQW WR VSDFH" 6KHOGRQ ,QWHUHVWLQJ K\SRWKHVLV /HW V DSSO\ WKH VFLHQWLILF PHWKRG /HRQDUG 2ND\ +H\ +RZDUG DQ\ WKRXJKWV RQ ZKHUH ZH VKRXOG JHW GLQQHU" +RZDUG $Q\ZKHUH EXW WKH 6SDFH 6WDWLRQ 2Q D JRRG GD\ GLQQHU ZDV D EDJ IXOO RI PHDW ORDI %XW KH\ \RX GRQ W JR WKHUH IRU WKH IRRG \RX JR WKHUH IRU WKH YLHZ Figure 1: Types of questions addressed in Know It VQA. the conversation? in Fig. 1), which is a strong constraint in real-world applications. On the other hand, visual content by itself does not provide enough insights for answering questions that require knowledge (e.g. Who owns the place were they are standing? in Fig. 1). To address these limitations, video question answering (Video QA) (Tapaswi et al. 2016; Kim et al. 2017b; Lei et al. 2018) and knowledge-based visual question answering (KBVQA) (Wu et al. 2016; Wang et al. 2018) have emerged independently by proposing specific datasets and models. However, a common framework for addressing multi-question types in VQA is still missing. The contribution of this work lies in this line, by introducing a general framework in which both video understanding and knowledge-based reasoning are required to answer questions. We first argue that a popular sitcom, such as The Big Bang Theory,1 is an ideal testbed for modelling knowledge-based questions about the world. With this idea, we created Know IT VQA,2 a dataset for KBVQA in videos 1https://www.cbs.com/shows/big\ bang\ theory/ 2Available at https://knowit-vqa.github.io/ Table 1: Comparison of Video QA and KBVQA datasets. Answers are either multiple-choice (MCN with N being the number of choices) or single word. Last four columns refer to the type of questions available in each dataset. Dataset VQA-Type Domain # Imgs # QAs Answers Vis. Text. Temp. Know. Movie QA (Tapaswi et al. 2016) Video Movie 6,771 14,944 MC5 - KB-VQA (Wang et al. 2017) KB COCO 700 2,402 Word - - Pororo QA (Kim et al. 2017b) Video Cartoon 16,066 8,913 MC5 - TVQA (Lei et al. 2018) Video TV show 21,793 152,545 MC5 - R-VQA (Lu et al. 2018) KB Visual Genome 60,473 198,889 Word - - - FVQA (Wang et al. 2018) KB COCO, Img Net 2,190 5,826 Word - - KVQA (Shah et al. 2019) KB Wikipedia 24,602 183,007 Word - - OK-VQA (Marino et al. 2019) KB COCO 14,031 14,055 Word - - Know IT VQA (Ours) Video KB TV show 12,087 24,282 MC4 in which real-world natural language questions are designed to be answerable only by people who is familiar with the show. We then cast the problem as a multi-choice challenge, and introduce a two-piece model that (i) acquires, processes, and maps specific knowledge into a continuous representation inferring the motivation behind each question, and (ii) fuses video and language content together with the acquired knowledge in a multi-modal fashion to predict the answer. Related Work Video Question Answering Video QA addresses specific challenges with respect to the interpretation of temporal information in videos, including action recognition (Maharaj et al. 2017; Jang et al. 2017; Zellers et al. 2019; Mun et al. 2017), story understanding (Tapaswi et al. 2016; Kim et al. 2017b), or temporal coherence (Zhu et al. 2017). Depending on the video source, the visual content of videos may also be associated with textual data, such as subtitles or scripts, which provide an extra level of context for its interpretation. Most of the proposed datasets so far are mainly focused on either the textual or the visual aspect of the video, without exploiting the combination of both modalities. In Movie QA (Tapaswi et al. 2016), for example, questions are mainly plot-focused, whereas in other collections, questions are purely about the visual content, such as action recognition in Movie FIB (Maharaj et al. 2017), TGIF-QA (Jang et al. 2017), and Mario VQA (Mun et al. 2017), or temporal coherence in Video Context QA (Zhu et al. 2017)). Only few datasets, such as Pororo QA (Kim et al. 2017b) or TVQA (Lei et al. 2018), present benchmarks for exploiting multiple sources of information, requiring models to jointly interpret multi-modal video representations. Even so, reasoning beyond the video content in these kinds of approaches is complicated, as only the knowledge acquired in the training samples is used to generate the answer. Knowledge-Based Visual Question Answering Answering questions about a visual query by only using its content constrains the output to be inferred within the space of knowledge contained in the training set. Considering that the amount of training data in any dataset is finite, the knowledge used to predict answers in standard visual question answering is rather limited. In order to answer questions beyond the image content, KBVQA proposes to in- form VQA models with external knowledge. The way of acquiring and incorporating this knowledge, however, is still in early stages. For example, (Zhu et al. 2015) creates a specific knowledge base with image-focused data for answering questions under a certain template, whereas more generic approaches (Wu et al. 2016) extract information from external knowledge bases, such as DBpedia (Auer et al. 2007), for improving VQA accuracy. As VQA datasets do not envisage questions with general information about the world, specific KBVQA datasets have been recently introduced, including KB-VQA (Wang et al. 2017) with question-images pairs generated from templates, R-VQA (Lu et al. 2018) with relational facts supporting each question, FVQA (Wang et al. 2018) with supporting facts extracted from generic knowledge bases, KVQA (Shah et al. 2019) for entity identification, or OK-VQA (Marino et al. 2019) with free-form questions without knowledge annotations. Most of these datasets impose hard constraints on their questions, such as being generated by templates (KB-VQA) or directly obtained from existing knowledge bases (FVQA), being OK-VQA the only one that requires handling unstructured knowledge to answer natural questions about images. Following this direction, we present a framework for answering general questions that may or may not be associated with a knowledge base by introducing a new Video QA dataset, in which questions are freely proposed by qualified workers to study knowledge and temporal coherence together. To the best of our knowledge, this is the first work that explores external knowledge questions in a collection of videos. Know IT VQA Dataset Due to the natural structure of TV shows, in which characters, scenes, and general development of the story can be known in advance, TV data has been exploited for modelling real-world scenarios in video understanding tasks (Nagrani and Zisserman 2017; Frermann, Cohen, and Lapata 2018). We also rely on this idea and argue that popular sitcoms provide an ideal testbed to encourage progress in knowledgebased visual question answering, due to their additional facilities to model knowledge and temporal coherence over time. In particular, we introduce the Know IT VQA dataset, (standing for knowledge informed temporal VQA), a collection of videos from The Big Bang Theory annotated with knowledge-based questions and answers about the show. Figure 2: Number of questions by KNOWLEDGE TYPE. Video Collection Our dataset contains both visual and textual video data. Videos are collected from the first nine seasons of The Big Bang Theory TV show, with 207 episodes of about 20 minutes long each. For the textual data, we obtained the subtitles directly from the DVDs. Additionally, we downloaded episode transcripts from a specialised website.3 Whereas subtitles are annotated with temporal information, transcripts associate dialog with characters. We align subtitles and transcripts with dynamic programming so that each subtitle is annotated to both its speaker and its timestamp. Transcripts also contain scene information, which is used to segment each episode into video scenes. Scenes are split uniformly into 20 seconds clips, obtaining 12,264 clips in total. QA Generation To generate real-world natural language questions and answers, we used Amazon Mechanical Turk (AMT)4. We required workers to have a high knowledge about The Big Bang Theory and instructed them to write knowledge-based questions about the show. Our aim was to generate questions answerable only by people familiar with the show, whereas difficult for new spectators. For each clip, we showed workers the video and subtitles, along with a link to the episode transcript and summaries of all the episodes for extra context. Workers were asked to annotate each clip with a question, its correct answer, and three wrong but relevant answers. The QA generation process was done in batches of one season at a time in two different rounds. During the second round, we showed the already collected data for each clip in order to 1) get feedback on the quality of the collected data and 2) obtain a diverse set of questions. The QA collection process took about 3 months. Knowledge Annotations We define knowledge as the information that is not contained in a given video clip. To approximate the knowledge the viewers acquire by watching the series, we annotated each QA pair with expert information: KNOWLEDGE: the information that is required to answer the question represented by a short sentence. For example, 3https://bigbangtrans.wordpress.com/ 4https://www.mturk.com :KDW JLUOIULHQG LV 6KHOGRQ WDONLQJ DERXW" 3UL\D $P\ %HUQDGHWWH 3HQQ\ :KDW LV PLVVLQJ IURP 6KHOGRQ V VSRW" $ IRRWVWRRO $ FXVKLRQ $Q DUPUHVW FRYHU +RZDUG MXPSLQJ RII JDPH PDW *UDE D QDSNLQ KRPH\ \RX MXVW JRW VHUYHG /HRQDUG ,W V ILQH 0.998.9 To find near duplicate instances, we create clusters of nodes, Cl with l = 1, , L, by finding all the connected components in G, i.e. Cl corresponds to the l-th subgraph in G, for which all nodes are connected to each other by a path of edges. We randomly choose one node in each cluster and remove the others. 9We experimentally found 0.998 to be a good tradeoff between near duplicates and semantically similar instances Knowledge Retrieval Module Inspired by the ranking system in (Nogueira and Cho 2019), the knowledge retrieval module uses a question qi and its candidate answers ac i with c {0, 1, 2, 3} to query the knowledge base K and rank knowledge instances wj K according to a relevance score sij. We first obtain a sequence input representation xij as a concatenation of strings: xij = [CLS] + qi + aα0 i + aα1 i + aα2 i + aα3 i + [SEP] + wj + [SEP], where [SEP] separates the input text used for querying and the knowledge to be queried. Although preliminary experiments showed that the order of the answers ac i does not have a high impact on the results, for an invariant model we automatically sort the answers according to a prior relevance score. αc is then the original position of the answer with cth highest score. Details are provided below. We tokenise xij into a sequence of n words xij10 and input it into a BERT network, namely BERT-scoring denoted by BERTS(xij), whose output is the vector corresponding to the [CLS] token. To compute sij, we use a fully connected layer together with a sigmoid activation as: sij = sigmoid(w S BERTS(xij) + b S), (3) where w S and b S are the weight vector and the bias scalar of the fully connected layer, respectively. BERTS θ, w S, and b S are fine-tuned using matching (i.e. i = j) and non-matching (i.e. i = j) QA-knowledge pairs with the following loss: i=j log(sij) i =j log(1 sij) (4) For each qi, all wj s in K are ranked according to sij. The top k ranked instances, i.e. the most relevant samples for the query question, are retrieved. Prior Score Computation To prevent the model producing different outputs for different candidate answer order, we create an answer order-invariant model by sorting answers, ac with c = {0, 1, 2, 3}, according to a prior score, ξc. For a given question q, ξc is obtained from predicting the score of ac being the correct answer. We first build an input sentence ec as the concatenation of the strings: ec = [CLS] + q + [SEP] + ac + [SEP], and we tokenise ec into a sequence of 120 tokens, ec. If BERTE( ) represents a BERT network whose output is the vector corresponding to the [CLS] token, ξc is obtained as: ξc = w E BERTE(ec) + b E, (5) Finally, all ξc with c = {0, 1, 2, 3} are sorted in descending order into ξ and answers are ordered according to αc = δ, where δ is the position of the δ-th highest score in ξ. 10Sequences longer than n are truncated, and sequences shorter than n are zero-padded. Video Reasoning Module In this module, the retrieved knowledge instances are jointly processed with the multi-modal representations from the video content to predict the correct answer. This process contains three components: visual representation, language representation, and answer prediction. Visual Representation We sample nf frames from each video clip and apply four different techniques to describe their visual content: Image features: Each frame is fed into Resnet50 (He et al. 2016) without the last fully-connected layer and is represented by a 2,048-dimensional vector. We concatenate all vectors from the nf frames and condense it into a 512dimensional vector using a fully-connected layer. Concepts features: For a given frame, we use the bottomup object detector (Anderson et al. 2018) to obtain a list of objects and attributes. We encode all the objects and attributes in the nf frames into a C-dimensional bagof-concept representation, which is projected into a 512dimensional space with a fully-connected layer. C is the total number of available objects and attributes. Facial features: We use between 3 to 18 photos of the main cast of the show to train the state-of-the-art face recognition network in (Parkhi et al. 2015).11 For each clip, we encode the detected faces as a F-dimensional bag-of-faces representation, which is projected into a 512dimensional space with a fully-connected layer. F is the total number of people trained in the network. Caption features: For each frame, we generate a caption to describe its visual content using (Xu et al. 2015). The nf captions extracted from each clip are passed to the language representation model. Language Representation Text data is processed using a fine-tuned BERT model, namely BERT-reasoning. We compute the language input, yc, as a concatenation of strings: yc = [CLS] + caps + subs + q + [SEP] + ac + w + [SEP], where caps is the concatenated nf captions (ordered by timestamp), subs the subtitles, and w the concatenated k retrieved knowledge instances. For each question q, four different yc are generated, one for each of the candidate answers ac with c = {0, 1, 2, 3}. We tokenise yc into a sequence of m words, yc, as in BERT-scoring. Let BERTR denote BERT-reasoning, whose output is the vector corresponding to the [CLS] token. For ac, the language representation uc is obtained as uc = BERTR(yc). 11Characters trained in the face recognition network are: Amy, Barry, Bernadette, Dr. Beverly Hofstadter, Dr. VM Koothrappali, Emily, Howard, Leonard, Leslie, Lucy, Mary Cooper, Penny, Priya, Raj, Sheldon, Stuart, and Wil Wheaton. Answer Prediction To predict the correct answer, we concatenate the visual representation v (i.e. image, concepts, or facial features) with one of the language representations uc: zc = [v, uc], (6) zc is projected into a single score, oc, with a fully-connected layer: oc = w R zc + b R, (7) The predicted answer ˆa is obtained with the index of the maximum value in o = (o0, o1, o2, o3) , i.e., ˆa = aarg maxc o. Being c the correct class, BERTR, w R, and b R are fine-tuned with the multi-class cross-entropy loss as: L(o, c ) = log exp(oc ) c exp(oc) (8) Experimental Results We evaluated and compared ROCK against several baselines on the Know IT VQA dataset. Results per question type and overall accuracy are reported in Table 4. Models were trained with stochastic gradient descent with momentum of 0.9 and learning rate of 0.001. In BERT implementations, we used the uncased base model with pre-trained initialisation. Answers To detect potential biases in the dataset, we evaluated the accuracy of predicting the correct answer by only considering the candidate answers: Longest/Shortest: The predicted answer is the one with the largest/smallest number of words. word2vec/BERT sim: For word2vec, we use 300dimensional pre-trained word2vec vectors (Mikolov et al. 2013). For BERT, we encode words with the output of the third-to-last layer of pre-trained BERT. Answers are encoded as the mean of their word representations. The prediction is the answer with the highest cosine similarity to the other candidates in average. In general, these baselines performed very poorly, with only Longest being better than random. Other than the tendency of correct answers to be longer, results do not show any strong biases in terms of answer similarities. QA We also evaluated several baselines in which only questions and candidate answers are considered. word2vec/BERT sim: Questions and answers are represented by the mean word2vec or pre-trained BERT word representation. The predicted answer is the one with highest cosine similarity to the question. TFIDF: Questions and answers are represented as a weighted frequency word vector (tf-idf) and projected into a 512-dimensional space. The question and the four answer candidates are then concatenated and input into a four-class classifier to predict the correct answer. LSTM Emb./BERT: Each word in a question or in a candidate answer is encoded through an embedding layer or a pre-trained BERT network and input into an LSTM Table 4: Accuracy for different methods on Know It VQA dataset. for parts of our model, for our full model. Model Vis. Text. Temp. Know. All Random 0.250 0.250 0.250 0.250 0.250 Longest 0.324 0.308 0.395 0.342 0.336 Shortest 0.241 0.236 0.233 0.297 0.275 word2vec sim 0.166 0.196 0.233 0.189 0.186 BERT sim 0.199 0.239 0.198 0.226 0.220 word2vec sim 0.108 0.163 0.151 0.180 0.161 BERT sim 0.174 0.264 0.209 0.190 0.196 TFIDF 0.434 0.377 0.488 0.485 0.461 LSTM Emb. 0.444 0.428 0.512 0.515 0.489 LSTM BERT 0.446 0.464 0.500 0.532 0.504 ROCKQA 0.542 0.475 0.547 0.535 0.530 Humans (Rookies, Blind) 0.406 0.407 0.418 0.461 0.440 LSTM Emb. 0.432 0.362 0.512 0.496 0.467 LSTM BERT 0.452 0.446 0.547 0.530 0.504 TVQASQA 0.602 0.551 0.512 0.468 0.509 ROCKSQA 0.651 0.754 0.593 0.534 0.587 Humans (Rookies, Subs) 0.618 0.837 0.453 0.498 0.562 Vis, Subs, QA TVQA 0.612 0.645 0.547 0.466 0.522 ROCKVSQA Image 0.643 0.739 0.581 0.539 0.587 ROCKVSQA Concepts 0.647 0.743 0.581 0.538 0.587 ROCKVSQA Facial 0.649 0.743 0.581 0.537 0.587 ROCKVSQA Caption 0.666 0.772 0.581 0.514 0.580 Humans (Rookies, Video) 0.936 0.932 0.624 0.655 0.748 ROCK Image 0.654 0.681 0.628 0.647 0.652 ROCK Concepts 0.654 0.685 0.628 0.646 0.652 ROCK Facial 0.654 0.688 0.628 0.646 0.652 ROCK Caption 0.647 0.678 0.593 0.643 0.646 ROCKGT 0.747 0.819 0.756 0.708 0.731 Humans (Masters, Video) 0.961 0.936 0.857 0.867 0.896 (Hochreiter and Schmidhuber 1997). The last hidden state of the LSTM is used as a 512-dimensional sentence representation. Question and answers are concatenated and input into a four-class classifier for prediction. ROCKQA: ROCK model with m = 120 tokens, trained and evaluated only with questions and answers as input. Whereas methods based on sentence similarity performed worse than random, methods with classification layers trained for answer prediction (i.e. TFIDF, LSTM Emb./BERT, and ROCKQA) obtained considerably better accuracy, even outperforming human workers. Subs, QA Models that use subtitles, questions, and answers as input. LSTM Emb./BERT: Subtitles are encoded with another LSTM and concatenated to the question and answer candidates before being fed into the four-class classifier. TVQASQA (Lei et al. 2018): Language is encoded with a LSTM layer and no visual information is used. ROCKSQA: With m = 120 tokens, the input sequence only includes subtitles, questions, and candidate answers. LSTM BERT and ROCKSQA improved accuracy by a 5.7% with respect to only questions and answers. On the other hand, LSTM Emb. did not improve compared to the models using only QA, which may imply a limitation in the word embeddings to encode long sequences in subtitles. Vis, Sub, QA Video QA models based on both language and visual representations. 5 .QRZOHGJH 6DWXUGD\ LV 6KHOGRQ V ODXQGU\ QLJKW $ ERZOLQJ DOOH\ $ ORFDO 'DQFH &HQWHU 6KH LQWURGXFHG /HRQDUG WR KHU IDWKHU OLNH WKH\ ZHUH VWLOO GDWLQJ EHFDXVH VKH GLGQ W ZDQW KLP WR EH GLVDSSRLQWHG LQ KHU 5 .QRZOHGJH $ EOXHV GDQFH FOXE 3HQQ\ :KDW DUH \RX GRLQJ DW ZRUN WKHVH GD\V" 6KHOGRQ 2K , P ZRUNLQJ RQ WLPH GHSHQGHQW EDFNJURXQGV LQ VWULQJ WKHRU\ 6SHFLILFDOO\ TXDQWXP ILHOG WKHRU\ LQ ' GLPHQVLRQDO GH 6LWWHU VSDFH LQ ' GLPHQVLRQDO GH 6LWWHU VSDFH :KDW QLJKW LV LW" )ULGD\ 0RQGD\ /HRQDUG