# neurosymbolic_visual_reasoning_disentangling_visual_from_reasoning__cce2910d.pdf Neuro-Symbolic Visual Reasoning: Disentangling Visual from Reasoning Saeed Amizadeh 1 Hamid Palangi * 2 Oleksandr Polozov * 2 Yichen Huang 2 Kazuhito Koishida 1 Visual reasoning tasks such as visual question answering (VQA) require an interplay of visual perception with reasoning about the question semantics grounded in perception. However, recent advances in this area are still primarily driven by perception improvements (e.g. scene graph generation) rather than reasoning. Neuro-symbolic models such as Neural Module Networks bring the benefits of compositional reasoning to VQA, but they are still entangled with visual representation learning, and thus neural reasoning is hard to improve and assess on its own. To address this, we propose (1) a framework to isolate and evaluate the reasoning aspect of VQA separately from its perception, and (2) a novel top-down calibration technique that allows the model to answer reasoning questions even with imperfect perception. To this end, we introduce a differentiable first-order logic formalism for VQA that explicitly decouples question answering from visual perception. On the challenging GQA dataset, this framework is used to perform in-depth, disentangled comparisons between well-known VQA models leading to informative insights regarding the participating models as well as the task. 1. Introduction Visual reasoning (VR) is the ability of an autonomous system to construct a rich representation of a visual scene and perform multi-step inference over the scene s constituents and their relationships. It stands among the key outstanding challenges in computer vision. Common tangible instantiations of VR include language-driven tasks such as Visual Question Answering (VQA) (Antol et al., 2015) and Visual Commonsense Reasoning (VCR) (Zellers et al., 2019). *Equal contribution 1Microsoft Applied Sciences Group (ASG), Redmond WA, USA 2Microsoft Research AI, Redmond WA, USA. Correspondence to: Saeed Amizadeh . Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s). Recent advances in computer vision, representation learning, and natural language processing have enabled continued progress on VQA with a wide variety of modeling approaches (Anderson et al., 2018; Andreas et al., 2016; Hudson & Manning, 2019a; 2018; Tan & Bansal, 2019). A defining characteristic of VR is the interaction between a perception system (i.e. object detection and scene representation learning) and a reasoning system (i.e. question interpretation and inference grounded in the scene). However, this interaction is difficult to capture and assess accurately. For example, the definition of VQA has evolved over time to eliminate language biases that impeded its robustness as a VR metric. The early VQA datasets were biased to real-world language priors to the extent that many questions were answerable without looking at the image (Agrawal et al., 2018). Subsequent versions improved the balance but still mostly involved simple inference questions with little requirement for multi-step reasoning. To facilitate progress in VR, Hudson & Manning (2019b) proposed GQA, a procedurally generated VQA dataset of multi-step inference questions. Although GQA targets compositional multi-step reasoning, the current GQA Challenge primarily evaluates visual perception rather than reasoning of a VQA model. As we show in Section 4, a neurosymbolic VQA model that has access to ground-truth scene graphs achieves 96% accuracy on GQA. Moreover, language interpretation (i.e. semantic parsing) alone does not capture the complexity of VR due to the language in questions being procedurally generated. As a result, while GQA is well suited as an evaluation environment for VR (e.g. for multi-modal pretraining tasks (Tan & Bansal, 2019; Zhou et al., 2020)), a higher GQA accuracy does not necessarily imply a higher reasoning capability. In this work, we supplement GQA with a differentiable first-order logic framework -FOL that allows us to isolate and assess the reasoning capability of a VQA model separately from its perception. The -FOL Framework1: -FOL is a neuro-symbolic VR model. Neuro-symbolic models such as MAC (Hudson & Manning, 2018), Neural Module Networks (Andreas et al., 2016), and Neural State Machines (Hudson & Manning, 2019a) implement compositional multi-step inference 1The Py Torch code for the -FOL framework is publicly available at https://github.com/microsoft/DFOL-VQA. Neuro-Symbolic Visual Reasoning: Disentangling Visual from Reasoning 𝑭𝑿, 𝒀 = 𝑿: 𝑴𝒂𝒏𝑿 𝒀: 𝑳𝒆𝒇𝒕(𝑿, 𝒀) 𝜶(𝒎𝒂𝒏|𝒙𝟐) 𝜶(𝒎𝒂𝒏|𝒙𝟑) 𝜶(𝑭|𝒙𝟏) 𝜶(𝑭|𝒙𝟐) 𝜶(𝑭|𝒙𝟑) 𝜶𝑭|𝒚𝟏 𝜶𝑭|𝒚𝟒 Aggregation Loss Function A: No Q: Is there a man on the left of all objects in the scene? Semantic Parser Differentiable FOL Reasoning Visual Oracle 𝜶(𝒎𝒂𝒏|𝒙𝒊) 𝜶(𝑳𝒆𝒇𝒕|𝒙𝒊, 𝒚𝒋) Visual Featurization Figure 1. The multi-step question answering process in the -FOL framework, based on differentiable first-order logic. by modeling each step as a differentiable operator from a functional specification of the question (i.e. a program) or its approximation. This facilitates systematicity, compositionality, and out-of-distribution generalization in VQA because accurate inference of a given question commonly requires accurate inference over its constituents and entailed questions (Vedantam et al., 2019). They, however, commonly operate over the latent feature representations of objects and their relations, produced by the underlying perception module. This entanglement not only limits interpretability of the learned neuro-symbolic inference blocks, but also limits the reasoning techniques applicable for VQA improvement. In contrast to SOTA neuro-symbolic approaches, -FOL fully disentangles visual representation learning of a VQA model from its inference mechanism, while still being endto-end trainable with backpropagation (see Figure 1). This enables identifying GQA questions solvable via perception vs. reasoning and evaluating their respective contributions. VQA Reasoning Evaluation Score: To assess the reasoning capability of a VQA model, we define the VQA reasoning evaluation score as the extent to which the model can answer a question despite imperfect visual perception. If the input image is noisy or the perception system is imperfect, the learned object representations do not contain enough information to determine certain attributes of the objects. This potentially impedes question answering and may require non-trivial reasoning. For example, an object detection module that misclassifies wolves as huskies may impede answering the question Is there a husky in the living room? Similarly, the question What is behind the broken wooden chair? relies on the information capturing broken , wooden , and chair attributes in the representation of the corresponding object. Many VQA models answer such questions nonetheless (e.g. by disregarding weak attribute signals when a strong chair signal is present in a single object in the scene), which exemplifies the kind of visual reasoning we aim to assess in VQA. In contrast, the questions that can be answered using a pre-trained perception system and parameter-less logical inference do not require reasoning per se as their visual representations contain all the information necessary to answer the question. Contributions: This work makes three contributions: We introduce differentiable first-order logic as a common formalism for compositional visual reasoning and use it as a foundation for the inference in -FOL. We use -FOL to define a disentangled evaluation methodology for VQA systems to assess the informativeness of perception as well as the power of reasoning separately. To this end, we introduce a VQA reasoning evaluation score, an augmentation of GQA evaluation that eliminates questions primarily resolved by perception. With it, we evaluate two representatives from two families of VQA models: MAC (Hudson & Manning, 2018) and LXMERT (Tan & Bansal, 2019). As a simple way of going beyond logical reasoning, we propose top-down calibration via the question context on the top of FOL reasoning and show that it improves the accuracy of -FOL on the visually hard questions. 2. Related Work and Background Visual Question Answering: VQA has been used as a front-line task to research and advance VR capabilities. The first release of the VQA dataset (Antol et al., 2015) initiated annual competitions and a wide range of modeling techniques aimed at addressing visual perception, language understanding, reasoning, and their combination (Anderson et al., 2018; Hudson & Manning, 2019a; 2018; Li et al., 2019; Lu et al., 2019; Tan & Bansal, 2019; Zhou et al., 2020). Neuro-Symbolic Visual Reasoning: Disentangling Visual from Reasoning To reduce the annotation effort and control the problem complexity, CLEVR (Johnson et al., 2017) and GQA (Hudson & Manning, 2019b) tasks propose synthetic construction of resp. scenes and questions. Capturing and measuring the extent of human ability in VR accurately is a significant challenge in task design as well as modeling. Datasets have to account for language and real-world biases, such as non-visual and false-premise questions (Ray et al., 2016). VQA models, when uncontrolled, are known to solve the task by e.g. exploiting language priors (Agrawal et al., 2016; Zhou et al., 2015). Different techniques have been proposed to control this phenomenon. Agrawal et al. (2018) adversarially separate the distributions of training and validation sets. Goyal et al. (2017) balance the VQA dataset by asking human subjects to identify distractors visually similar images that yield different answers for the same questions. Recently, Selvaraju et al. (2020) augment the VQA dataset with human-annotated subquestions that measure a model s reasoning consistency in answering complex questions. In this work, we propose another step to improve the accuracy of VQA reasoning assessment by capturing a hard subset of GQA questions where perception produces imperfect object representations. Neuro-Symbolic Reasoning: -FOL is a neuro-symbolic reasoning model (Garcez et al., 2019). In neuro-symbolic reasoning, answer inference is defined as a chain of differentiable modules wherein each module implements an operator from a latent functional program representation of the question. The approach is applicable to a wide range tasks, including visual QA (Andreas et al., 2016; Hudson & Manning, 2018; Vedantam et al., 2019), reading comprehension of natural language (Chen et al., 2020), and querying knowledge bases, databases, or other structured sources of information (Liang et al., 2016; Neelakantan et al., 2015; 2016; Saha et al., 2019). The operators can be learned, like in MAC (Hudson & Manning, 2018) or pre-defined, like in NMN (Andreas et al., 2016). In contrast to semantic parsing (Liang, 2016) or program synthesis (Gulwani et al., 2017; Parisotto et al., 2016), the model does not necessarily emit a symbolic program, although it can involve them as an intermediate step to construct the differentiable pipeline (like in -FOL). Neuro-symbolic reasoning is also similar to neural program induction (NPI) (Cai et al., 2017; Pierrot et al., 2019; Reed & De Freitas, 2015) but the latter requires strong supervision in the form of traces, and the learned operators are not always composable or interpretable. The main benefit of neuro-symbolic models is their compositionality. Because the learnable parameters of individual operators are shared for all questions and subsegments of the differentiable pipeline correspond to constituents of each question instance, the intermediate representations produced by each module are likely composable with each other. This, in turn, facilitates interpretability, systematicity, and out-of-distribution generalization commonly challenging desiderata of reasoning systems (Vedantam et al., 2019). In Section 6, we demonstrate them in -FOL over VQA. Neuro-symbolic models can be partially or fully disentangled from the representation learning of their underlying ground-world modality (e.g. vision in the case of VQA). Partial entanglement is the most common, wherein the differentiable reasoning operates on featurizations of the scene objects rather than raw pixels but the featurizations are in the uninterpretable latent space. Neural State Machine (NSM) (Hudson & Manning, 2019a) and the e Xplainable and e Xplicit Neural Modules (XNM) (Shi et al., 2019) are prominent examples of such frameworks. As for full disentanglement, there are Neural-Symbolic Concept Learner (NS-CL) (Mao et al., 2019) and Neural-Symbolic VQA (NS-VQA) (Yi et al., 2018) which separate scene understanding, semantic parsing, and program execution with symbolic representations in between similar to -FOL. However, both NS-CL and NS-VQA as well as XNM are based on operators that are heuristic realization of the task-dependent domain specific language (DSL) of their target datasets. In contrast, we propose a task-independent, mathematical formalism that is probabilistically derived from the first-order logic independent of any specific DSL. This highlights two important differences between -FOL and NS-CL, NS-VQA, or XNM. First, compared to these frameworks, -FOL is more general-purpose: it can implement any DSL that is representable by FOL. Second, our proposed disentangled evaluation methodology in Section 4 requires the reasoning framework to be mathematically sound so that we can reliably draw conclusions based off it; this is the case for our FOL inference formalism. Furthermore, while NS-CL and NS-VQA have only been evaluated on CLEVR (with synthetic scenes and a limited vocabulary), -FOL is applied to real-life scenes in GQA. Finally, we note that outside of VR, logic-based, differentiable neuro-symbolic formalisms have been widely used to represent knowledge in neural networks (Serafini & Garcez, 2016; Socher et al., 2013; Xu et al., 2018). A unifying framework for many of such formalisms is Differentiable Fuzzy Logics (DFL) (van Krieken et al., 2020) which models quantified FOL within the neural framework. Despite the similarity in formulation, the inference in DFL is generally of exponential complexity, whereas -FOL proposes a dynamic programming strategy to perform inference in polynomial time, effectively turning it into program-based reasoning of recent VQA frameworks. Furthermore, while these frameworks have been used to encode symbolic knowledge into the loss function, -FOL is used to specify a unique feed-forward architecture for each individual instance in the dataset; in that sense, -FOL is similar to recent neuro- Neuro-Symbolic Visual Reasoning: Disentangling Visual from Reasoning symbolic frameworks proposed to tackle the SAT problem (Amizadeh et al., 2018; 2019; Selsam et al., 2018). 3. Differentiable First-Order Logic for VR We begin with the formalism of differentiable first-order logic (DFOL) for VR systems, which forms the foundation for the -FOL framework. DFOL is a formalism for inference over statements about an image and its constituents. It has two important properties: (a) it disentangles inference from perception, so that e.g. the operation "filter all the red objects in the scene" can be split into determining the redness of every object and attending to the ones deemed sufficiently red, and (b) it is end-to-end differentiable, which allows training the perception system from inference results. In Section 4, we show how DFOL enables us to measure reasoning capabilities of VQA models. 3.1. Visual Perception Given an image I, let V = {v1, v2, ..., v N} be a set of feature vectors vi Rd representing a set of N objects detected in I. This detection can be done via different pretrained models such as Faster-RCNN (Ren et al., 2015) for object detection or Neural Motifs (Zellers et al., 2018) or Graph-RCNN (Yang et al., 2018) for scene graph generation.2 As is common in VQA, we assume V as given, and refer to it as the scene visual featurization. Furthermore, we introduce the notion of neural visual oracle O = {Mf, Mr} where Mf and Mr are neural models parametrized by vectors θf and θr, respectively. Conceptually, Mf(vi, π | V) computes the likelihood of the natural language predicate π holding for object vi (e.g. Mf(vi, red | V)). Similarly, Mr(vi, vj, π | V) calculates the likelihood of π holding for a pair of objects vi and vj (e.g. Mr(vi, vj, holding | V)). O combined with the visual featurization forms the perception system of -FOL. 3.2. First-Order Logic over Scenes Given N objects in the scene, we denote by the upper-case letters X, Y, Z, ... categorical variables over the objects index set I = {1, ..., N}. The values are denoted by subscripted lower-case letters e.g. X = xi states that X is set to refer to the i-th object in the scene. The k-arity predicate π : Ik 7 { , } defines a Boolean function on k variables X, Y, Z, ... defined over I. In the context of visual scenes, we use unary predicates π( ) to describe object names and attributes (e.g. Chair(xi) and Red(yj)), and binary predicates π( , ) to describe relations between pairs of objects 2V can also include features of relations between the objects. Relation features have been shown to be helpful in tasks such as image captioning and information retrieval (Lee et al., 2019) (e.g. On(yj, xi)). Given the definitions above, we naturally define quantified first-order logical (FOL) formulae F, e.g. F(X, Y ) = X, Y : Chair(X) Left(X, Y ) (1) states that "There is a chair in the scene that is to the left of all other objects." FOL is a more compact way to describe the visual scene than the popular scene graph (Yang et al., 2018) notation, which can be seen as a Propositional Logic description of the scene, also known as grounding the formula. More importantly, while scene graph is only used to describe the scene, FOL allows us to perform inference over it. For instance, the formula in Eq. (1) also encodes the binary question "Is there a chair in the scene to the left of all other objects?" In other words, a FOL formula encodes both a descriptive statement and a hypothetical question about the scene. This is the key intuition behind -FOL and the common formalism behind its methodology. 3.3. Inference Given a NL (binary) question Q and a corresponding FOL formula FQ, the answer a Q is the result of evaluating FQ. We reformulate this probabilistically as Pr(a Q = yes | V) = Pr(FQ | V) α(FQ). (2) The naïve approach to calculate the probability in Eq. (2) requires evaluating every instantiation of FQ, which are of exponential number. Instead, we propose a dynamic programming strategy based on the intermediate notion of attention which casts inference as a multi-hop execution of a functional program in polynomial time. Assume FQ is minimal and contains only the operators and (which are functionally complete). We begin by defining the concept of attention which in -FOL naturally arises by instantiating a variable in the formula to an object: Definition 3.1. Given a FOL formula F over the variables X, Y, Z, ..., the attention on the object xi w.r.t. F is: α(F | xi) Pr(FX=xi | V) (3) where FX=xi F(xi, Y, Z, ...), i [1..N] (4) Similarly, one can compute the joint attention α(F | xi, yj, ...) by fixing more than one variable to certain objects. For example, given the formula in Eq. (1), α(F | xi) represents the probability that "The i-th object in the scene is a chair that is to the left of all other objects." and α(F | yj) represents the probability that "The j-th object in the scene is to the right of a chair.". Next, we define the attention vector on variable X w.r.t. formula F as α(F | X) = [α(F | xi)]N i=1. In similar way, Neuro-Symbolic Visual Reasoning: Disentangling Visual from Reasoning we define the attention matrix on two variables X and Y w.r.t. formula F as α(F | X, Y ) = [α(F | xi, yj)]N i,j=1. Given these definitions, the following lemma gives us the first step toward calculating the likelihood in Eq. (2) from attention values in polynomial time: Lemma 3.1. Let F be a FOL formula with left most variable X = LMV (F) that appears with logical quantifier q { , , }. Then we have: α(F) = Pr(F | V) = Aq α(F | X) (5) where α(F | X) is the attention vector on X and Aq( ) is the quantifier-specific aggregation function defined as: A (a1, ..., a N) = YN A (a1, ..., a N) = 1 YN i=1(1 ai) (7) A (a1, ..., a N) = YN i=1(1 ai) (8) Furthermore, given two matrix A and B, we define the matrix Q-product C = A q B w.r.t. the quantifier q as: Ci,j = [A q B]i,j Aq Ai B j (9) where Ai and B j are respectively the i-th row of A and the j-th column of B, and denotes the Hadamard product. In general, the Q-product can be used to aggregate attention tensors (multi-variate logical formulas) along a certain axis (a specific variable) according to the variable s quantifier. Lemma 3.1 reduces the computation of the answer likelihood to computing the attention vector of the left most variable w.r.t. F. The latter can be further calculated recursively in polynomial time as described below. Lemma 3.2 (Base Case). If F only constitutes the literal , the attention vector α(F | X) is the 1 vector. Lemma 3.3 (Recursion Case). We have three cases: (A) Negation Operator: if F(X, Y, Z, ...) = G(X, Y, Z, ...), then we have: α(F | X) = 1 α(G | X) Neg α(G | X) (10) (B) Filter Operator: if F(X, Y, Z, ...) = π(X) G(X, Y, Z, ...) where π( ) is a unary predicate, then: α(F | X) = α(π | X) α(G | X) Filterπ α(G | X) (C) Relate Operator: if F(X, Y, Z, ...) = V π ΠXY π(X, Y ) G(Y, Z, ...) where ΠXY is the set of all binary predicates defined on variables X and Y in F, then we have: α(F | X) = K π ΠXY α(π | X, Y ) q α(G | Y ) RelateΠXY ,q α(G | Y ) (12) where q is the quantifier of variable Y in F. Algorithm 1 Question answering in DFOL. Input: Question FQ (binary or open), threshold θ if FQ is a binary question then return α(FQ) > θ else Let {a1, . . . , ak} be the plausible answers for FQ return argmax1 i k α(FQ,ai) The attention vector α(π | X) and the attention matrix α(π | X, Y ) in Eqs. (11) and (12), respectively, form the leaves of the recursion tree and contain the probabilities of the atomic predicate π holding for specific object instantiations. These probabilities are directly calculated by the visual oracle O. In particular, we propose: α(π | xi) = Mf(vi, π | V), π Πu (13) α(π | xi, yj) = Mr(vi, vj, π | V), π Πb (14) where Πu and Πb denote the sets of all unary and binary predicates in the model s concept dictionary. The recursion steps in Lemma 3.3 can be seen as operators that given an input attention vector produce an output attention vector. In fact, Eq. (11) and Eq. (12) are respectively the DFOL embodiments of the abstract Filter and Relate operations widely used in multi-hop VQA models. In other words, by abstracting the recursion steps in Lemma 3.3 into operators, we turn a descriptive FOL formula into an executable program which can be evaluated to produce the probability distribution of the answer. For example, by applying the steps in Lemmas 3.1-3.3 to Eq. (1), we get the following program to calculate its likelihood: α(F) = A Filter Chair Relate{Left}, [1] (15) Algorithm 1 presents the final operationalization of question answering as inference over formulae in DFOL. For open questions such as What is the color of the chair to the left of all objects? , it translates them into a set of binary questions over the plausible set of answer options (e.g. all color names), which can be predefined or learned. 4. VQA Reasoning Evaluation Score In this section, we describe our methodology of VQA reasoning evaluation. Given a VQA model M over the visual featurization V, our goal is to study and measure: (Q1) how informative a visual featurization V is on its own to accomplish a certain visual reasoning task, and (Q2) how much the reasoning capabilities of a model M can compensate for the imperfections in perception to accomplish a reasoning task. Neuro-Symbolic Visual Reasoning: Disentangling Visual from Reasoning To this end, we use the GQA dataset (Hudson & Manning, 2019b) of multi-step functional visual questions. The GQA dataset consists of 22M questions defined over 130K reallife images. Each image in the Train/Validation splits is accompanied by the scene graph annotation, and each question in the Train/Validation/Test-Dev splits comes with its equivalent program. We translate the programs in GQA into a domain-specific language (DSL) built on top of the four basic operators Filter, Relate, Neg and Aq introduced in the previous section. The DSL covers 98% of the questions in GQA. See Supplementary Material for its definition. The DFOL formalism allows us to establish an upper bound on reasoning the accuracy of a neuro-symbolic VQA model when the information in its visual featurization is perfect. To measure it, let O be a golden visual oracle based on the information in the ground-truth GQA scene graphs. The parameter-less -FOL inference from Section 3 achieves 96% accuracy on the GQA validation split using the golden oracle O and the golden programs. We manually inspected the remaining 4% and found that almost all involved errors in the scene graph or the golden program. This result not only verifies the soundness of -FOL as a probabilistic relaxation of the GQA DSL, but also establishes that question understanding alone does not constitute the source of complexity in the compositional question answering on GQA. In other words, the main contributing factor to the performance of GQA models is the representation learning in their underlying perception systems. However, even with imperfect perception, many models successfully recover the right answer using language priors, real-world biases, and other non-trivial learned visual reasoning. Using -FOL, we present a metric to quantify this phenomenon. Reasoning with Imperfect Perception: Let V be a fixed scene featurization, often produced by e.g. a pre-trained Faster-RCNN model. Let Q be a GQA question and FQ be its corresponding DFOL formula. The VQA Reasoning Evaluation is based on two key observations: 1. If the probabilistic inference over FQ produces a wrong answer, the featurization V does not contain enough information to correctly classify all attributes, classes, and relations involved in the evaluation of FQ. 2. If V is informative enough to enable correct probabilistic inference over FQ, then Q is an easy question the right answer is accredited to perception alone. Let a base model M be an evaluation of Algorithm 1 given some visual oracle O trained and run over the features V. Note that the inference process of M described in Section 3 involves no trainable parameters. Thus, its accuracy stems entirely from the accuracy of O on the at- tributes/relations involved in any given question.3 Assuming a commonly reasonable architecture for the oracle O (e.g. a deep feed-forward network over V followed by sigmoid activation) trained end-to-end with backpropagation from the final answer through M , the accuracy of M thus indirectly captures the amount of information in V directly involved in the inference of a given question i.e. Q1 above. With this in mind, we arrive at the following procedure for quantifying the extent of reasoning of a VQA model M: 1. Fix an architecture for O as described above. We propose a standard in our experiments in Section 6. 2. Train the oracle O on the Train split of GQA using backpropagation through M from the final answer. 3. Let T be a test set for GQA. Evaluate M on T using the trained oracle O and ground-truth GQA programs. 4. Let Te and Th be respectively the set of successful and failed questions by M (i.e. Te Th = T). 5. Measure the accuracy of M on Th. 6. Measure the error of M of Te. The easy set Te and hard set Th define, respectively, GQA instances where visual featurization alone is sufficient or insufficient to arrive at the answer. By measuring a model s accuracy on the hard set (or error on the easy set), we determine the extent to which it uses the information in the featurization V to answer a hard question (or, resp., fails to do so on an easily solvable question) i.e. Q2 above. Importantly, M need not be a DFOL-based model, or even a neuro-symbolic model, or even based on any notion of a visual oracle we only require it to take as input the same visual features V. Thus, its accuracy on Th or error on Te is entirely attributable to its internal interaction between vision and language modalities. Furthermore, we can meaningfully compare M s reasoning score to that of any VQA model M that is based on the same featurization. (Although the comparison is not always fair as the models may differ in e.g. their pre-training data, it is still meaningful.) 5. Top-Down Contextual Calibration We now present top-down contextual calibration as one way of augmenting logical reasoning to compensate for imperfect perception. Note that the FOL reasoning is a bottom-up process in the sense that every time the oracle is queried, it does not take into consideration the broad context of the question. Nevertheless, considering any additional information such as the context of question can be useful especially when the visual perception is imperfect. 3This is not the same as classification accuracy of O in general because only a small fraction of objects and attributes in the scene are typically involved in any given question. Neuro-Symbolic Visual Reasoning: Disentangling Visual from Reasoning 0 0.2 0.4 0.6 0.8 1 Attention Sharpening (a=b=3, c=0.5, d=1) 0 0.2 0.4 0.6 0.8 1 Attention Blurring (a=b=0.3, c=0.5, d=1) 0 0.2 0.4 0.6 0.8 1 Attention Suppression (a=15, b=0.3, c=0.5, d=1) 0 0.2 0.4 0.6 0.8 1 Attention Excitation (a=0.3, b=15, c=0.5, d=1) 𝑿 𝜶(𝒎𝒂𝒏|𝒙𝟏) 𝜶(𝒎𝒂𝒏|𝒙𝟐) 𝜶(𝒎𝒂𝒏|𝒙𝟑) 𝜶(𝒎𝒂𝒏|𝒙𝟒) 𝜶(𝑭|𝒙𝟏) 𝜶(𝑭|𝒙𝟐) 𝜶(𝑭|𝒙𝟑) 𝜶(𝑭|𝒙𝟒) 𝜶𝑭|𝒚𝟏 𝜶𝑭|𝒚𝟒 Visual Oracle 𝜶(𝒎𝒂𝒏|𝒙𝒊) 𝜶(𝑳𝒆𝒇𝒕|𝒙𝒊, 𝒚𝒋) Calibration Calibration 𝒂𝟐, 𝒃𝟐 𝒄𝟐, 𝒅𝟐 Calibration 𝒂𝟑, 𝒃𝟑 𝒄𝟑, 𝒅𝟑 Top-Down Neural Attention Calibrator Figure 2. (Left) The architecture of the top-down neural attention calibrator. (Right) Four examples of the calibration function (Eq. (16)) shape determining to whether sharpen, blur, suppress or excite the attention values depending on the parameter values a, b, c and d. Every formula F defines two conditional likelihoods on the attention values α(F | x) over the population of all images in the dataset: P+ F (α) Pr(α(F | x) | F ) and P F (α) Pr(α(F | x) | F ). In general, the bottom-up process assumes these two distributions are well separated on the extremes for every F. However, due to the imperfection of O, that is not the case in practice. The Bayesian way to address this issue is to estimate these likelihoods and use the posterior α (F | x) Pr(F | α(F | x)) instead of α(F | x). This is the classical notion of calibration in binary classification (Platt, 2000). In our framework, we have developed the neural version of the Beta Calibration (Kull et al., 2017) to calculate the above posterior. In particular, we assume the likelihoods P+ F (α) and P F (α) can be modeled as two Beta distributions with parameters [a+, b+] and [a , b ], respectively. Then, the posterior becomes α (F | x) = C α(F | x) where: cαa + d(1 c)(1 α)b (16) is called the calibration function. Here a = a+ a , b = b b+ and c = Pr(F ) is the prior. Furthermore, d = B(a+, b+)/B(a , b ) where B( , ) is the Beta function. By a(i) F , b(i) F , c(i) F , d(i) F , we denote the parameters of the calibration function that are applied after the i-th operator of F during the attention calculation. Instead of estimating these parameters for each possible F and i, we amortize the computation by modeling them as a function of question context using a Bi-LSTM (Huang et al., 2015): a(i) F , b(i) F , c(i) F , d(i) F = Mc M(i) lstm(SF; θlstm); θc (17) where Mc is a MLP with parameters θc and M(i) lstm denotes the i-th state of a Bi-LSTM parametrized by θlstm. Here SF denotes the context of the formula F, which is defined as the sequence of the predicates present in the program. For example, for the formula in Eq. (1), we have SF = [Chair, Left]. The word embedding of this context is then fed to the bi-LSTM network as its input. Figure 2 (Left) shows our proposed top-down calibration mechanism and how it affects the DFOL reasoning process. To train this calibrator, we first train the Base model without the calibrator as before. We then freeze the weights of the visual oracle O in the Base model, add the calibrator on the top and run the backprop again through the resulted architecture on the training data to tune the weights of the calibrator. Note that for parameter values a = b = d = 1 and c = 0.5, the calibration function in Eq. (16) is just the Identity function; that is, the calibration function does nothing and the reasoning stays purely logical. However, as the parameters deviate from these values, so does the behavior of reasoning from the logical reasoning. Interestingly, depending on the values of its parameters, the behavior of the calibration function is quite often interpretable. In Figure 2 (Right), we have shown how the calibrator, for example, can sharpen, blur, suppress or excite visual attention values via the parameters of the calibration function. This behavior is indeed context-dependent and learned by the calibrator from data. For example, if the model sees the "broken wooden chair" phrase enough times but the visual featurization is not informative enough to always detect "broken" in the image, the calibrator may decide to excite visual attention values upon seeing that phrase so it can make up for the imperfection of the visual system and still answer the question correctly. It is important to note that even though the calibrator tries to pick up informative signals from the language priors, it does not simply replace the visual attention values by them. Instead, it modulates the visual attention via the language priors. So for example, if the attention values upon seeing "broken wooden chair" is close to zero for an image (indicating that the phrase cannot be really grounded in that image), then the calibration function will not raise the Neuro-Symbolic Visual Reasoning: Disentangling Visual from Reasoning attention values significantly as shown in Figure 2 (Right), even though the calibrator has learned to "excite" visual attentions for that phrase. This soft thresholding behavior of C( ) is entirely learned from data. Finally, we note that modulating the visual attentions via the question context is only one way of filling in the holes of perceptions. Other informative signals such as the visual context and the featurelevel, cross-modal interaction of language and vision can be exploited to improve the accuracy of -FOL even further. 6. Experiments In this section, we experimentally demonstrate how we can incorporate our framework for evaluating the visual and the reasoning aspects of the VQA in a decoupled manner. To this end, we have performed experiments using our framework and candidate VQA models on the GQA dataset. The visual oracle: For the experiments in this section, we have chosen a feed-forward architecture with 3 hidden layers and an output embedding layer that covers all the concepts in the GQA vocabulary. The weights of the embedding layer are initialized using Glo Ve (Pennington et al., 2014). The visual featurization: We use the standard Faster RCNN object featurization that is released with the GQA dataset. The features vectors are further augmented by the bounding box positional features for each detected object. For binary relations, we simply concatenate the feature vectors of the two objects involved after a linear projection. For the sake of meaningful comparison in this section, we have made sure all the participating models use the same Faster-RCNN object featurization. Training setup: For training all of -FOL models, we have used Adam optimizer with learning rate 10 4 and weight decay 10 10. The dropout ratio is set to 0.1. We have also applied gradient clipping with norm 0.65. For better convergence, we have implemented a curriculum training scheme where we start the training with short programs and over time we add longer programs to the training data. Evaluation metrics: In addition to accuracy, we have also computed the consistency metric as defined by the GQA Challenge (Hudson & Manning, 2019b). 6.1. How Informative is the GQA Visual Featurization? Using the settings above, we have trained the Base model M . Table 1 shows the accuracy and the consistency of the this model evaluated on the (balanced) Test-Dev split. Since we wish to use the Base model to isolate only the visual informativeness of the data, we have used the golden programs (provided in GQA) for calculating the metrics for this experiment. Based on these results, the Faster-RCNN featurization is informative enough on its own to produce Split Accuracy Consistency Open 42.73 % 88.74 % Binary 65.08 % 86.65 % All 51.86 % 88.35 % Table 1. The Test-Dev metrics for the Base model. 51.86% of questions are answerable via pure FOL over Faster-RCNN features. correct answers for 51.86% of the instances in the set without requiring any extra reasoning capabilities beyond FOL. Whereas, for 48.14% of the questions, the visual signal in the featurization is not informative enough to accomplish the GQA task. Another interesting data point here is for about 2/3 of the binary questions, the visual features are informative enough for question answering purposes without needing any fancy reasoning model in place, which in turn can explain why many early classifier-based models for VQA work reasonably well on binary questions. 6.2. Evaluating the Reasoning Capabilities of Models The Base model M , from the previous section, can be further used to divide the test data into the hard and easy sets as illustrated in Section 4 (i.e. Th and Te). In this section, we use these datasets to measure the reasoning power of candidate VQA models by calculating the metrics Acch and Erre as well as the consistency for each model. See Supp. Material for examples of challenging instances from Th and deceptively simple instances from Te. For the comparison, we have picked two well-known representatives in the literature for which the code and checkpoints were open-sourced. The first is the MAC network (Hudson & Manning, 2018) which belongs to the family of multi-hop, compositional neuro-symbolic models (Andreas et al., 2016; Hudson & Manning, 2019a; Vedantam et al., 2019). The second model is the LXMERT (Tan & Bansal, 2019) network which belongs to the family of Transformerbased, vision-language models (Li et al., 2019; Lu et al., 2019). Both models consume Faster-RCNN object featurization as their visual inputs and have been trained on GQA. Table 2 demonstrates the various statistics obtained by evaluating the two candidate models on balanced Test-Dev and its hard and easy subsets according to the Base model. From these results, it is clear that LXMERT is significantly superior to MAC on the original balanced Test-Dev set. Moreover, comparing the Acch values for two models shows that the reasoning capability of LXMERT is significantly more effective compared to that of MAC when it comes to visually vague examples. This can be attributed to the fact that LXMERT like many other models of its family is massively pre-trained on large volumes of vision-language bi-modal data before it is fine-tuned for the GQA task. This pre-trained knowledge comes to the aide of the reasoning process when there are holes in the visual perception. Neuro-Symbolic Visual Reasoning: Disentangling Visual from Reasoning Test-Dev Hard Test-Dev Easy Test-Dev Split Accuracy Consistency Acch Consistency Erre Consistency MAC Open 41.66 % 82.28 % 18.12 % 74.87 % 26.70 % 84.54 % Binary 71.70 % 70.69 % 58.77 % 66.51 % 21.36 % 75.37 % All 55.37 % 79.13 % 30.54 % 71.04 % 23.70 % 82.83 % LXMERT Open 47.02 % 86.93 % 25.27 % 85.21 % 22.92 % 87.75 % Binary 77.63 % 77.48 % 63.02 % 73.58 % 13.93 % 81.63 % All 61.07 % 84.48 % 38.43 % 81.05 % 17.87 % 86.52 % Table 2. The test metrics for MAC and LXMERT over balanced Test-Dev and its hard and easy subsets according to the Base model. Test-Dev Hard Test-Dev Easy Test-Dev Split Accuracy Consistency Acch Consistency Erre Consistency -FOL Open 41.22 % 87.63 % 0.53 % 11.46 % 2.53 % 90.70 % Binary 64.65 % 85.54 % 4.42 % 61.11 % 2.21 % 86.33 % All 51.45 % 87.22 % 1.81 % 19.44 % 2.39 % 89.90 % Calibrated -FOL Open 41.22 % 86.37 % 0.53 % 11.46 % 2.53 % 89.45 % Binary 71.99 % 79.28 % 37.82 % 70.90 % 9.20 % 84.45 % All 54.76 % 84.48 % 12.91 % 57.72 % 6.32 % 88.51 % Table 3. The test metrics for -FOL and Calibrated -FOL over balanced Test-Dev and its hard and easy subsets. Another interesting observation is the comparison between the accuracy gap (i.e. 1 Erre Acch) and the consistency gap between the hard and easy subsets for each model/split row in the table. While the accuracy gap is quite large between the two subsets (as expected), the consistency gap is much smaller (yet significant) in comparison. This shows that the notion of visual hardness (or easiness) captured by the Base model partitioning is in fact consistent; in other words, even when VQA models struggle in the face of visually-hard examples in the hard set, their struggle is consistent across all logically-related questions (i.e. high hard consistency value in the table), which indicates that the captured notion of visual hardness is indeed meaningful. Furthermore, one may notice the smaller consistency gap of LXMERT compared to that of the MAC network, suggesting the consistent behavior of MAC is more sensitive to the hardness level of perception compared to that of LXMERT. 6.3. The Effect of Top-Down Contextual Calibration Table 3 shows the result of applying the calibration technique from Section 5. Since we are using -FOL as an actual VQA model in this experiment, we have trained a simple sequence-to-sequence semantic parser to convert the natural language questions in the test set to programs. As shown in Table 3, the top-down calibration significantly improves the accuracy over the -FOL. This improvement is even more significant when we look at the results on the hard set, confirming the fact that exploiting even the simplest form of bi-modal interaction (in this case, the program context interacting with the visual attentions) can significantly improve the performance of reasoning in the face imperfect perception. Nevertheless, this gain comes at a cost. Firstly, the consistency of the model over the entire set degrades. This is, however, to be expected; after all, we are moving from pure logical reasoning to something that is not always logical . Secondly, by looking at the Erre values, we observe that the calibrated model starts making significant mistakes on cases that are actually visually informative. This reveals one of the important dangers the VQA models might fall for once they start deviating from objective logical reasoning to attain better accuracy overall. 7. Conclusion The neuro-symbolic -FOL framework, based on the differentiable first-order logic defined over the VQA task, allows us to isolate and assess reasoning capabilities of VQA models. Specifically, it identifies questions from the GQA dataset where the contemporary Faster-RCNN perception pipeline by itself produces imperfect representations that do not contain enough information to answer the question via straightforward sequential processing. Studying these questions on the one hand motivates endeavors for improvement on the visual perception front and on the other hand provides insights into the reasoning capabilities of state-of-the-art VQA models in the face of imperfect perception as well as the sensitivity of their consistent behavior to it. Furthermore, the accuracy and consistency on visually imperfect instances is a more accurate assessment of a model s VR ability than dataset performance alone. In conclusion, we believe that the methodology of vision-reasoning disentanglement, realized in -FOL, provides an excellent tool to measure progress toward VR and some form of it should be ideally adopted by VR leaderboards. Neuro-Symbolic Visual Reasoning: Disentangling Visual from Reasoning Acknowledgement We would like to thank Pengchuan Zhang for insightful discussions and Drew Hudson for helpful input during her visit at Microsoft Research. We also thank anonymous reviewers for their invaluable feedback. Agrawal, A., Batra, D., and Parikh, D. Analyzing the behavior of visual question answering models. ar Xiv preprint ar Xiv:1606.07356, 2016. Agrawal, A., Batra, D., Parikh, D., and Kembhavi, A. Don t just assume; look and answer: Overcoming priors for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4971 4980, 2018. Amizadeh, S., Matusevych, S., and Weimer, M. Learning to solve circuit-sat: An unsupervised differentiable approach. In International Conference on Learning Representations, 2018. Amizadeh, S., Matusevych, S., and Weimer, M. Pdp: A general neural framework for learning constraint satisfaction solvers. ar Xiv preprint ar Xiv:1903.01969, 2019. Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., and Zhang, L. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6077 6086, 2018. Andreas, J., Rohrbach, M., Darrell, T., and Klein, D. Neural module networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 39 48, 2016. Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Lawrence Zitnick, C., and Parikh, D. VQA: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pp. 2425 2433, 2015. Cai, J., Shin, R., and Song, D. Making neural programming architectures generalize via recursion. ar Xiv preprint ar Xiv:1704.06611, 2017. Chen, X., Liang, C., Yu, A. W., Zhou, D., Song, D., and Le, Q. Neural symbolic reader: Scalable integration of distributed and symbolic representations for reading comprehension. In International Conference on Learning Representations, 2020. URL https://openreview. net/forum?id=ryxjn REFw H. Garcez, A. d., Gori, M., Lamb, L. C., Serafini, L., Spranger, M., and Tran, S. N. Neural-symbolic computing: An effective methodology for principled integration of machine learning and reasoning. ar Xiv preprint ar Xiv:1905.06088, 2019. Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., and Parikh, D. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6904 6913, 2017. Gulwani, S., Polozov, O., Singh, R., et al. Program synthesis. Foundations and Trends in Programming Languages, 4 (1-2):1 119, 2017. Huang, Z., Xu, W., and Yu, K. Bidirectional LSTMCRF models for sequence tagging. ar Xiv preprint ar Xiv:1508.01991, 2015. Hudson, D. and Manning, C. D. Learning by abstraction: The neural state machine. In Advances in Neural Information Processing Systems, pp. 5901 5914, 2019a. Hudson, D. A. and Manning, C. D. Compositional attention networks for machine reasoning. ar Xiv preprint ar Xiv:1803.03067, 2018. Hudson, D. A. and Manning, C. D. GQA: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6700 6709, 2019b. Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Lawrence Zitnick, C., and Girshick, R. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2901 2910, 2017. Kull, M., Silva Filho, T., and Flach, P. Beta calibration: a well-founded and easily implemented improvement on logistic calibration for binary classifiers. In Artificial Intelligence and Statistics, pp. 623 631, 2017. Lee, K., Palangi, H., Chen, X., Hu, H., and Gao, J. Learning visual relation priors for image-text matching and image captioning with neural scene graph generators. abs/1909.09953, 2019. URL http://arxiv.org/ abs/1909.09953. Li, G., Duan, N., Fang, Y., Gong, M., Jiang, D., and Zhou, M. Unicoder-VL: A universal encoder for vision and language by cross-modal pre-training. ar Xiv preprint ar Xiv:1908.06066, 2019. Liang, C., Berant, J., Le, Q., Forbus, K. D., and Lao, N. Neural symbolic machines: Learning semantic parsers on freebase with weak supervision. ar Xiv preprint ar Xiv:1611.00020, 2016. Neuro-Symbolic Visual Reasoning: Disentangling Visual from Reasoning Liang, P. Learning executable semantic parsers for natural language understanding. Communications of the ACM, 59(9):68 76, 2016. Lu, J., Batra, D., Parikh, D., and Lee, S. Vilbert: Pretraining task-agnostic visiolinguistic representations for visionand-language tasks. In Advances in Neural Information Processing Systems, pp. 13 23, 2019. Mao, J., Gan, C., Kohli, P., Tenenbaum, J. B., and Wu, J. The neuro-symbolic concept learner: Interpreting scenes, words, and sentences from natural supervision. ar Xiv preprint ar Xiv:1904.12584, 2019. Neelakantan, A., Le, Q. V., and Sutskever, I. Neural programmer: Inducing latent programs with gradient descent. ar Xiv preprint ar Xiv:1511.04834, 2015. Neelakantan, A., Le, Q. V., Abadi, M., Mc Callum, A., and Amodei, D. Learning a natural language interface with neural programmer. ar Xiv preprint ar Xiv:1611.08945, 2016. Parisotto, E., Mohamed, A.-r., Singh, R., Li, L., Zhou, D., and Kohli, P. Neuro-symbolic program synthesis. ar Xiv preprint ar Xiv:1611.01855, 2016. Pennington, J., Socher, R., and Manning, C. D. Glo Ve: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pp. 1532 1543, 2014. URL http://www.aclweb.org/ anthology/D14-1162. Pierrot, T., Ligner, G., Reed, S. E., Sigaud, O., Perrin, N., Laterre, A., Kas, D., Beguir, K., and de Freitas, N. Learning compositional neural programs with recursive tree search and planning. In Advances in Neural Information Processing Systems, pp. 14646 14656, 2019. Platt, J. Probabilities for SV machines. advances in large margin classifiers (pp. 61 74), 2000. Ray, A., Christie, G., Bansal, M., Batra, D., and Parikh, D. Question relevance in VQA: identifying nonvisual and false-premise questions. ar Xiv preprint ar Xiv:1606.06622, 2016. Reed, S. and De Freitas, N. Neural programmer-interpreters. ar Xiv preprint ar Xiv:1511.06279, 2015. Ren, S., He, K., Girshick, R., and Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91 99, 2015. Saha, A., Ansari, G. A., Laddha, A., Sankaranarayanan, K., and Chakrabarti, S. Complex program induction for querying knowledge bases in the absence of gold programs. Transactions of the Association for Computational Linguistics, 7:185 200, March 2019. doi: 10.1162/tacl_a_00262. URL https://www.aclweb. org/anthology/Q19-1012. Selsam, D., Lamm, M., Bünz, B., Liang, P., de Moura, L., and Dill, D. L. Learning a sat solver from single-bit supervision. ar Xiv preprint ar Xiv:1802.03685, 2018. Selvaraju, R. R., Tendulkar, P., Parikh, D., Horvitz, E., Ribeiro, M., Nushi, B., and Kamar, E. SQu INTing at VQA Models: Interrogating VQA Models with Sub Questions. ar Xiv preprint ar Xiv:2001.06927, 2020. Serafini, L. and Garcez, A. d. Logic tensor networks: Deep learning and logical reasoning from data and knowledge. ar Xiv preprint ar Xiv:1606.04422, 2016. Shi, J., Zhang, H., and Li, J. Explainable and explicit visual reasoning over scene graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8376 8384, 2019. Socher, R., Chen, D., Manning, C. D., and Ng, A. Reasoning with neural tensor networks for knowledge base completion. In Advances in neural information processing systems, pp. 926 934, 2013. Tan, H. and Bansal, M. LXMERT: Learning cross-modality encoder representations from transformers. ar Xiv preprint ar Xiv:1908.07490, 2019. van Krieken, E., Acar, E., and van Harmelen, F. Analyzing differentiable fuzzy logic operators. ar Xiv preprint ar Xiv:2002.06100, 2020. Vedantam, R., Desai, K., Lee, S., Rohrbach, M., Batra, D., and Parikh, D. Probabilistic neural-symbolic models for interpretable visual question answering. ar Xiv preprint ar Xiv:1902.07864, 2019. Xu, J., Zhang, Z., Friedman, T., Liang, Y., and Broeck, G. A semantic loss function for deep learning with symbolic knowledge. In International Conference on Machine Learning, pp. 5502 5511, 2018. Yang, J., Lu, J., Lee, S., Batra, D., and Parikh, D. Graph R-CNN for scene graph generation. In Proceedings of the European conference on computer vision (ECCV), pp. 670 685, 2018. Yi, K., Wu, J., Gan, C., Torralba, A., Kohli, P., and Tenenbaum, J. Neural-symbolic VQA: Disentangling reasoning from vision and language understanding. In Advances in Neural Information Processing Systems, pp. 1031 1042, 2018. Neuro-Symbolic Visual Reasoning: Disentangling Visual from Reasoning Zellers, R., Yatskar, M., Thomson, S., and Choi, Y. Neural motifs: Scene graph parsing with global context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5831 5840, 2018. Zellers, R., Bisk, Y., Farhadi, A., and Choi, Y. From recognition to cognition: Visual commonsense reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6720 6731, 2019. Zhou, B., Tian, Y., Sukhbaatar, S., Szlam, A., and Fergus, R. Simple baseline for visual question answering. ar Xiv preprint ar Xiv:1512.02167, 2015. Zhou, L., Palangi, H., Zhang, L., Hu, H., Corso, J. J., and Gao, J. Unified vision-language pre-training for image captioning and VQA. AAAI, 2020.