# automatic_generation_of_grounded_visual_questions__8c3fac0c.pdf Automatic Generation of Grounded Visual Questions Shijie Zhang #1, Lizhen Qu 2, Shaodi You 3, Zhenglu Yang 4, Jiawan Zhang 5 #School of Computer Science and Technology, Tianjin University, Tianjin, China Data61-CSIRO, Canberra, Australia Australian National University, Canberra, Australia College of Computer and Control Engineering, Nankai University, Tianjin, China The School of Computer Software, Tianjin University, Tianjin, China {shijiezhang,jwzhang}@tju.edu.cn, {lizhen.qu,shaodi.you}@data61.csiro.au, yangzl@nankai.edu.cn In this paper, we propose the first model to be able to generate visually grounded questions with diverse types for a single image. Visual question generation is an emerging topic which aims to ask questions in natural language based on visual input. To the best of our knowledge, it lacks automatic methods to generate meaningful questions with various types for the same visual input. To circumvent the problem, we propose a model that automatically generates visually grounded questions with varying types. Our model takes as input both images and the captions generated by a dense caption model, samples the most probable question types, and generates the questions in sequel. The experimental results on two real world datasets show that our model outperforms the strongest baseline in terms of both correctness and diversity with a wide margin. 1 Introduction Multi-modal learning of vision and language is an important task in artificial intelligence because it is the basis of many applications such as education, user query prediction, interactive navigation, and so forth. Apart from describing visual scenes by using declarative sentences [Chen and Zitnick, 2014; Gupta and Mannem, 2012; Karpathy and Fei-Fei, 2015; Hodosh et al., 2013; Kulkarni et al., 2011; Kuznetsova et al., 2012; Li et al., 2009; Vinyals et al., 2015; Xu et al., 2015], recently, automatic answering of visually related questions (VQA) has also attracted a lot of attention in computer vision communities [Antol et al., 2015; Malinowski and Fritz, 2014; Gao et al., 2015; Ren et al., 2015; Yu et al., 2015; Zhu et al., 2015]. However, there is little work on automatic generation of questions for images. The art of proposing a question must be held higher value than solving it. -Georg Canton . An intelligent system should be able to ask meaningful questions given the environment. Beyond demonstrating a high-level of AI, in practice, multi-modal question-asking modules find their use in a wide Corresponding author. What is the food in the picture? What color is the tablecloth? What number on the cake? Where is the coffee? Figure 1: Automatically generated grounded visual questions. range of AI systems such as child education and dialogue systems. To the best of our knowledge, almost all existing VQA systems rely on manually constructed questions [Antol et al., 2015; Malinowski and Fritz, 2014; Gao et al., 2015; Ren et al., 2015; Yu et al., 2015; Zhu et al., 2015]. An common assumption of the existing VQA systems is that answers are visually grounded thus all relevant information can be found in the visual input. However, the construction of such data sets are labor-intensive and time consuming, thus limits the diversity and coverage of questions being asked. As a consequence, the data incompleteness imposes a special challenge for supervised-learning based VQA systems. In light of the above analysis, we focus on automatic generation of visually grounded questions, coined VQG. The generated questions should be grammatically well-formed, reasonable for given images, and as diverse as possible. However, the existing systems are either rule-based such that they generate questions with few limited textual patterns [Ren et al., 2015; Zhu et al., 2015], or they are able to ask only one question per image and the generated questions are frequently not visually grounded [Simoncelli and Olshausen, 2001]. To tackle this task, we propose the first model capable of asking questions of various types for the same image. As illustrated in Fig. 2, we first apply Dense Cap [Johnson et al., 2015] to construct dense captions that provides a almost complete coverage of information for questions. Then we feed these captions into the question type selector to sample the most probable question types. Taking as input the questions types, the dense captions, as well as visual features generated by VGG-16 [Simonyan and Zisserman, 2014], the question Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17) generator decodes all these information into questions. We conduct extensive experiments to evaluate our model as well as the most competitive baseline with three kinds of measures adapted from the ones commonly used in the tasks of image caption generation and machine translation. The contributions of our paper are three-fold: We propose the first model capable of asking visually grounded questions with diverse types for a single image. Our model outperforms the strongest baseline up to 216% in terms of the coverage of asked questions. The grammaticality of the questions generated by our model as well as their relatedness to visual input also outperform the strongest baseline with a wide margin. The rest of the paper is organized as follows: we cover the related work in Section 2, followed by presenting our model in Section 3. After introducing the experimental setup in Section 4, we discuss the results in Section 5, and draw the conclusion in Section 6. 2 Related Work The generation of textual description for visual information has gained popularity in recent years. The key challenge is to learn the alignment between text and visual information [Barnard et al., 2003; Kong et al., 2014; Zitnick et al., 2013]. Herein, a popular task is to describe images with a few declarative sentences, which are often referred to as image captions [Barnard et al., 2003; Chen and Zitnick, 2014; Gupta and Mannem, 2012; Karpathy and Fei-Fei, 2015; Hodosh et al., 2013; Kulkarni et al., 2011; Kuznetsova et al., 2012; Li et al., 2009; Vinyals et al., 2015; Xu et al., 2015]. Visual Question and Answering Automatic answering of questions based on visual input is one of the most popular tasks in computer vision [Geman et al., 2015; Malinowski and Fritz, 2014; Malinowski et al., 2015; Pirsiavash et al., 2014; Ren et al., 2015; Weston et al., 2015; Yu et al., 2015]. Most VQA models are evaluated on a few benchmark datasets [Antol et al., 2015; Malinowski and Fritz, 2014; Gao et al., 2015; Ren et al., 2015; Yu et al., 2015; Zhu et al., 2015]. The images in those datasets are sampled from the MS-COCO dataset [Lin et al., 2014], the questionsanswer pairs are manually constructed [Antol et al., 2015; Gao et al., 2015; Yu et al., 2015; Zhu et al., 2015]. Visual Question Generation Automatic question generation from text is explored in-depth in NLP, however, it is rarely studied for visual questions, despite of the fact that such questions are highly desired for many applications. In order to generate multiple questions per image, the most common approach is to ask human to manually build the question-answer pairs, which is labor-intensive and timeconsuming [Antol et al., 2015; Gao et al., 2015; Malinowski and Fritz, 2014]. As one of the most recent examples, Zhu et al.[Zhu et al., 2015] manually create questions of seven wh-question types such as what, where, when and etc. People also explored automatic generation of visual questions by using rules. Yu et al.[Yu et al., 2015] consider question generation as a task of selectively removing content words that serve as answers from a caption and reformulate the resulted sentences as questions. In a similar manner, Ren et al.[Ren et al., 2015] carefully designed a handful of rules to transform image captions into questions with limited types. However, those rule-based methods are limited by the types of questions they can generate. Apart from that, model-based methods are also studied to overcome the diversity issue, the most closed work is [Simoncelli and Olshausen, 2001], which trains an image caption model on a dataset of visual questions. However, their model cannot generate more than one question per image. Knowledge Base (KB) based Question Answering (KBQA) KB-QA has attracted considerable attention due to the ubiquity of the World Wide Web and the rapid development of the artificial intelligence (AI) technology. Large-scale structured KBs, such as DBpedia [Auer et al., 2007], Freebase [Bollacker et al., 2008], and YAGO [Suchanek et al., 2007], provide abundant resources and rich general human knowledge, which can be used to respond to users queries in opendomain question answering (QA). However, how to bridge the gap between visual questions and structured data in KBs remains a huge challenge. The existing KB-QA methods can be broadly classified into two main categories, namely, semantic parsing based methods[Kwiatkowski et al., 2013; Reddy et al., 2016] and information retrieval based methods [Yao and Durme, 2014; Bordes et al., 2014] methods. Most semantic parsing based methods transform a question into its meaning representation (i.e., logical form), which will be then translated to a KB query to retrieve the correct answer(s). Information retrieval based methods initially roughly retrieve a set of candidate answers, and subsequently perform an in-depth analysis to re-rank the candidate answers and select the correct ones. These methods focus on modeling the correlation of questionanswer pairs from the perspective of question topic, relation mapping, answer type, and so forth. 3 Question Generation Our goal is to generate visually grounded questions directly from images with diverse question types. We start with randomly picking a caption from a set of automatically generated captions, which describes a certain region of image with natural language. Then we sample a reasonable question type and varying the caption. In the last step, our question generator learns the correlation between the caption and the image, generates a question of the chosen type. Formally, for each raw image x, our model generates a set of captions {c1, c2, ..., c M}, samples a set of question types {t1, t2, ..., t ˆ M}, followed by yielding a set of grounded questions {q1, q2, ..., q ˆ M}. Herein, a caption or a question is a sequence of words. w = {w1, ..., w L} (1) Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17) Question Type Selector Question Generator Image and Caption Correlation Layer Decoder LSTM Classifier Question Type 1*1 Rich visual feature 1*300 Rich textual feature Grounded Questions What is the food in the picture? What color is the table cloth? What number on the cake? Where is the coffee? Figure 2: The proposed framework. Where L is the length of the word sequence. Each word wi employs 1-of-K encoding, where K is the size of the vocabulary. A question type is represented by the first word of a question, adopting 1-of-T encoding where T is the number of question types. The same as [Zhu et al., 2015], we consider six question types in our experiments: what, when, where, who, why and how. For each image xi, we apply a dense caption model (Dense Cap) [Johnson et al., 2015] trained on the Visual Genome dataset [Krishna et al., 2016] to produce a set of captions Ci. Then the generative process is described as follows: 1. Choose a caption cn from Ci. 2. Choose a question type tn given cn. 3. Generate a question qn conditioned on cn and tn. Denoted by θ all model parameters, for each image xi, the joint distribution of cn, tn and qn is factorized as follows: P(qn, tn, cn|xi, Ci; θ) =P(qn|cn, xi, tn; θq) P(tn|cn; θt)P(cn|Ci) (2) where θ = θq θt, P(qn|cn, xi, tn; θq) is the distribution of generating question, P(tn|cn; θt) and P(cn|Ci) are the distributions for sampling question type and caption respectively. More details are given in the following sections. Since we do not observe the alignment between captions and questions, cn is latent. Sum over c, we obtain: P(qn, tn|xi, Ci; θ) = X cn Ci P(qn, tn, cn|xi, Ci; θ) Let Qi denote the question set of the image xi, the probability of the training dataset D is given by taking the product of the above probabilities over all images and their questions. n Qi P(qn, tn|xi, Ci; θ) (3) For word representations, we initialize a word embedding matrix E R300 K by using Glove [Pennington et al., 2014], which are trained on 840 billions of words. For the image representations, we apply a VGG-16 model [Szegedy et al., 2015] trained on Image Net [Deng et al., 2009] without fine-tuning to produce 300-dimensional feature vectors. The dimension is chosen to match the size of the pre-trained word embeddings. Compared to the question generation model [Simoncelli and Olshausen, 2001], which generates only one question per image, the probabilistic nature of this model allows generating questions of multiple types which refer to different regions of interests, because each caption predicted by Dense Cap is associated with a different region. 3.1 Sample Captions and Question Types The caption model Dense Cap generates a set of captions for a given image. Each caption c is associated with a region and a confidence oc of the proposed region. Intuitively, we should give a higher probability to the caption with higher confidence than the lower one. Thus, given a caption set Ci of an image xi, we define the prior distribution as: P(ck|Ci) = exp(ok) P j Ci exp(oj) A caption is either a declarative sentence, a word, or a phrase. We are able to ask many different types of questions but not all of them for a chosen caption. For example, for a caption floor is brown we can ask what color is the floor but it would be awkward to ask a who question. Thus, our model draws a question type given a caption with the probability P(tn|cn) by assuming it suffices to infer question types given a caption. Our key idea is to learn the association between question types and key words/phrases in captions. The model P(tn|cn) consists of two components. The first one is a Long Short Term Memory (LSTM) [Hochreiter and Schmidhuber, 1997] that maps a caption into a hidden representation. LSTM is a recurrent neural network taking the following form: ht, mt = LSTM(xt, ht 1, mt 1) (4) where xt is the input and the hidden state of LSTM at time step t, and ht and mt are the hidden states and memory states of LSTM at time step t, respectively. As the representation of the whole sequence, we take the last state h L generated at the end of the sequence. This representation is further fed into a softmax layer to compute a probability vector pt for all question types. The probability vector characterizes a multinomial distribution of all question types. 3.2 Generate Questions At the core of our model is the question generation module, which models P(qn|cn, xi, tn; θq), given a chosen caption cn and a question type tn. It is composed of three modules: i) an LSTM encoder to generate caption embeddings; ii) a Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17) correlation module to learn the association between images and captions; iii) a decoder consisting of an LSTM decoder and an ngram language model. A grounded question is deeply anchored in both the sampled caption and the associated image. In our preliminary experiments, we found it useful to let the LSTM encoder LSTM(xt, ht 1, mt 1) to read the image features prior to reading captions. In particular, at time step t = 0, we initialize the state vector m0 to zero and feed the image features as x0. At the 1st time step, the encoder reads in a special token S0 indicating the start of a sentence, which is a good practice adopted by many caption generation models [Vinyals et al., 2015]. After reading the whole caption of length L, the encoder yields the last state vector m L as the embedding of caption. The correlation module takes as input the caption embeddings from the encoder and the image features from VGG-16, produces a 300-dimensional joint feature map. We apply a linear layer of size 300 600 and a PRe LU [He et al., 2015] layer in sequel to learn the associations between captions and images. Since an image gives an overall context and the chosen caption provides the focus in the image, the joint representation provides sufficient context to generate grounded questions. Although the LSTM encoder incorporates image features before reading captions, this correlation module enhances the correlation between images and text by building more abstract representations. Our decoder extends the LSTM decoder of [Vinyals et al., 2015] with a ngram language model. The LSTM decoder consists of an LSTM layer and a softmax layer. The LSTM layer starts with reading the joint feature map and the start token S0 in the same fashion as the caption encoder. From time step t = 0, the softmax layer predicts the most likely word given the state vector at time t yielded by the LSTM layer. A word sequence ends when the end of sequence token is produced. Joint decoding Although the LSTM decoder alone can generate questions, we found that it would frequently produce repeated words and phrases such as the the . The problem didn t disappear even the beam search [Koehn et al., 2003] was applied. It is due to the fact that the state vectors produced at adjunct time steps tend to be similar. Since repeated words and phrases are rarely observed in text corpora, we discount such occurrence by joint decoding with a ngram language model. Given a word sequence w = {w1, ..., w N}, a bigram language model is defined as: i=2 P(wi|wi 1)P(w0) Instead of using neural models, we adopt the word count based estimation of model parameters. In particular, we apply the Kneser Ney smoothing [Kneser and Ney, 1995] to estimate P(wi|wi 1), which is given by: max(count(wi 1, wi) d, 0) count(wi) + λ(wt 1)PKN(wi) where count(x) denotes the corpus frequency of term x, PKN(wi) is a back-off statistic of unigram wi in case the bigram (wi, wi 1) does not appear in the training corpus. The parameter d is usually fixed to 0.75 to avoid overfitting for low frequency bigrams. And λ(wt 1) is a normalizing constant conditioned on wi. We incorporate bigram statistics with the LSTM decoder from the time step t = 1 because the LSTM decoder can well predict the first words of questions. The LSTM decoder essentially captures the conditional probability Pl(qt|q