# visual_reasoning_by_progressive_module_networks__f888f51c.pdf

Published as a conference paper at ICLR 2019

VISUAL REASONING BY PROGRESSIVE MODULE NETWORKS

Seung Wook Kim1,2 , Makarand Tapaswi1,2 , Sanja Fidler1,2,3

1Department of Computer Science, University of Toronto 2Vector Institute, Canada 3NVIDIA {seung,makarand,fidler}@cs.toronto.edu

Humans learn to solve tasks of increasing complexity by building on top of previously acquired knowledge. Typically, there exists a natural progression in the tasks that we learn most do not require completely independent solutions, but can be broken down into simpler subtasks. We propose to represent a solver for each task as a neural module that calls existing modules (solvers for simpler tasks) in a functional program-like manner. Lower modules are a black box to the calling module, and communicate only via a query and an output. Thus, a module for a new task learns to query existing modules and composes their outputs in order to produce its own output. Our model effectively combines previous skill-sets, does not suffer from forgetting, and is fully differentiable. We test our model in learning a set of visual reasoning tasks, and demonstrate improved performances in all tasks by learning progressively. By evaluating the reasoning process using human judges, we show that our model is more interpretable than an attention-based baseline.

1 INTRODUCTION

Humans acquire skills and knowledge in a curriculum by building on top of previously acquired knowledge. For example, in school we ﬁrst learn simple mathematical operations such as addition and multiplication before moving on to solving equations. Similarly, the ability to answer complex visual questions often requires the skills to understand attributes such as color, recognize a variety of objects, and be able to spatially relate them. Just like humans, machines may also beneﬁt by sequentially learning tasks in progressive complexity and composing knowledge along the way.

The process of training a machine learning model to be able to solve multiple tasks, or multi-task learning (MTL), has been widely studied (Long et al., 2017; Ruder, 2017; Ruder et al., 2017; Rusu et al., 2016). The dominant approach is to have a model that shares parameters (e.g., bottom layers of a CNN) with individualized prediction heads (Caruana, 1993; Long et al., 2017). By sharing parameters, models are able to learn better task-agnostic data representations. However, the tasks are disconnected as their outputs are not combined to solve tasks of increasing complexity. It is desirable if one task can learn to process the predictions from other tasks thereby reaping the beneﬁts of MTL.

In this paper, we address the problem of MTL where tasks exhibit a natural progression in complexity. We propose Progressive Module Networks (PMN), a framework for multi-task learning by progressively designing modules on top of existing modules. Each module is a neural network that can query modules for lower-level tasks, which in turn may query modules for even simpler tasks. The modules communicate by learning to query other modules and process their outputs, while the internal module processes are a blackbox. This is similar to a computer program that uses available libraries without having to know their internal operations. Parent modules can choose which lower-level modules they want to query via a soft gating mechanism. Examining the queries, replies, and choices a parent module makes, we can understand the reasoning behind the module s output.

PMN is related but different from Andreas et al. (2016) and Hu et al. (2017). PMN s modules are task-level modules, and they are compositional, i.e. modules build on modules which build on modules. It allows efﬁcient use of data by not needing to re-learn previously acquired knowledge. By learning selective information ﬂow between modules, interpretability arises naturally.

Published as a conference paper at ICLR 2019

input output

= [ , , ] L3 M0 M1 M2 = [ , ] L2 M0 M1 = [ ] L1 M0

= Ln ordered list of modules calls Mn Figure 1: An example computation graph for PMN with four tasks. Green rectangles denote terminal modules, and yellow rectangles denote compositional modules. Blue arrows and red arrows represent calling and receiving outputs from submodules, respectively. White numbered circles denote computation order. For convenience, assume task levels correspond to the subscripts. Calling M3 invokes a chain of calls (blue arrows) to lower level modules which stop at the terminal modules. We demonstrate PMN in learning a set of visual reasoning tasks such as counting, captioning, and Visual Question Answering (VQA). PMN outperform baselines without module composition on all tasks. We further analyze the interpretability of PMN s reasoning process with human judges.

2 RELATED WORK

Multi-task learning. The dominant approach to multi-task learning is to have a model that shares parameters in a soft (Duong et al., 2015; Yang & Hospedales, 2017) or hard way (Caruana, 1993). Soft sharing refers to each task having independent weights that are constrained to be similar (e.g. via ℓ2 regularization (Duong et al., 2015), trace norm (Yang & Hospedales, 2017)) while hard sharing typically means that all tasks share the base network but have independent layers on the top (Kokkinos, 2017; Misra et al., 2016). While sharing parameters helps to compute a task-agnostic representation that is not overﬁt to a speciﬁc task, tasks do not directly share information or help each other.

Bilen & Vedaldi (2016) propose the Multinet architecture where tasks can interact with each other in addition to shared image features. Multinet solves one task at each time step and appends the encoded output of each task to existing data representation. A similar idea, Progressive Neural Networks (PNNs) (Rusu et al., 2016) use a new neural network for each task, but are designed to prevent catastrophic forgetting as they transfer knowledge from previous tasks by making lateral connections to representations of previously learned tasks. Recently, Wang et al. (2017) propose the VQA-Machine which exploits a set of existing algorithms to solve the VQA problem. Zamir et al. (2018) learn a computational taxonomy for task transfer learning on several vision problems. However, the major differences to this work are PMN s compositional modular structure, ability to directly query other modules, and the overall process of learning increasingly complex tasks. Module networks. Pioneering work in modular structure, NMN (Andreas et al., 2016; Hu et al., 2017) addresses VQA where questions have a compositional structure. Given an inventory of small networks, or modules, NMN produces a layout for assembling the modules for any question. PMN is different from NMN as it can be easily extended to new tasks using its compositional structure. It also treats each task as modules which opens up promising ways to train models that can compartmentalize and perform multiple tasks as they progressively improve their abilities. Visual question answering. VQA has seen great progress in recent years: improved multimodal pooling functions (Fukui et al., 2016; Kim et al., 2018), multi-hop attention (Yang et al., 2016), driving attention through both bottom-up and top-down schemes (Anderson et al., 2018), and modeling attention between words and image regions recurrently (Hudson & Manning, 2018) are some of the important advances. There are also attempts to generate programs or sequence of modules automatically that yield a list of interpretable steps (Hu et al., 2017; Johnson et al., 2017b) using policy gradient optimization. Our approach treats visual reasoning as a compositional multi-task problem, and shows that using sub-tasks compositionally can help improve performance and interpretability.

3 PROGRESSIVE MODULE NETWORKS

Most complex reasoning tasks can be broken down into a series of sequential reasoning steps. We hypothesize that there exists a hierarchy with regards to complexity and order of execution: high level tasks (e.g. counting) are more complex and beneﬁt from leveraging outputs from lower level tasks (e.g. classiﬁcation). For any task, Progressive Module Networks (PMN) learn a module that requests and uses outputs from lower modules to aid in solving the given task. This process is compositional, i.e., lower-level modules may call modules at an even lower level. Solving a task corresponds to executing a directed acyclic computation graph where each node represents a module (see Fig. 1).

PMN has a plug-and-play architecture where modules can be replaced by their improved versions. This opens up promising ways to train intelligent robots that can compartmentalize and perform multiple tasks while progressively learning from experience and improving abilities. PMN also

Published as a conference paper at ICLR 2019

chooses which lower level modules to use through a soft-gating mechanism. A natural consequence of PMN s modularity and gating mechanism is interpretability. While we do not need to know the internal workings of modules, we can examine the queries and replies along with the information about which modules were used to reason about why the parent module produced a certain output.

Formally, given a task n at level i, the task module Mn can query other modules Mk at level j such that j < i. Each module is designed to solve a particular task (output its best prediction) given an input and environment E. Note that E is accessible to every module and represents a broader set of sensory information available to the model. For example, E may contain visual information such as an image, and text in the form of words (i.e., question). PMN has two types of modules: (i) terminal modules execute the simplest tasks that do not require information from other modules (Sec. 3.1); and (ii) compositional modules that learn to efﬁciently communicate and exploit lower-level modules to solve a task (Sec. 3.2). We describe the tasks studied in this paper in Sec. 3.3 and provide a detailed example of how PMN is implemented and executed for VQA (Sec. 3.4).

3.1 TERMINAL MODULES

Terminal modules are by deﬁnition at the lowest level 0. They are analogous to base cases in a recursive function. Given an input query q, a terminal module Mℓgenerates an output o = Mℓ(q), where Mℓis implemented with a neural network. A typical example of a terminal module is an object classiﬁer that takes as input a visual descriptor q, and predicts the object label o.

3.2 COMPOSITIONAL MODULES

A compositional module Mn makes a sequence of calls to lower level modules which in turn make calls to their children, in a manner similar to depth-ﬁrst search (see Fig. 1). We denote the list of modules that Mn is allowed to call by Ln = [Mm, . . . , Ml]. Every module in Ln has level lower than Mn. Since lower modules need not be sufﬁcient in fully solving the new task, we optionally include a terminal module n that performs residual reasoning. Also, many tasks require an attention mechanism to focus on certain parts of data. We denote Ωn as a terminal module that performs such soft-attention. n and Ωn are optionally inserted to the list Ln and treated as any other module.

The compositional aspect of PMN means that modules in Ln can have their own hierarchy of calls. We make Ln an ordered list, where calls are being made in a sequential order, starting with the ﬁrst in the list. This way, information produced by earlier modules can be used when generating the query for the next. For example, if one module is performing object detection, we may want to use its output (bounding box proposals), for querying other modules such as an attribute classiﬁer.

For this work, the list Ln, and thus the levels of tasks, are determined by hand. Relaxing this and letting the model learn the task hierarchy itself is a challenging direction that we leave for future work. Also, notice that the number of back-and-forth communications increases exponentially if each module makes use of every lower-level module. Thus, in practice we restrict the list Ln to those lower-level modules that may intuitively be needed by the task. We emphasize that Mn still (softly) chooses between them, and thus the expert intervention only removes the lower-level modules that are uninformative to the task.

Our compositional module Mn runs (pre-determined) Tn passes over the list Ln. It keeps track of a state variable st at time step t Tn. This contains useful information obtained by querying other modules. For example, st can be the hidden state of a Recurrent Neural Network. Each time step corresponds to executing every module in Ln and updating the state variable. We describe the module components below, and Algorithm 1 shows how the computation is performed. An example implementation of the components and demonstration of how they are used is detailed in Sec. 3.4. State initializer. Given a query (input) qn, the initial state s1 is produced using a state initializer In. Importance function. For each module Mk (and n, Ωn) in Ln, we compute an importance score gk n with Gn(st). The purpose of gk n is to enable Mn to (softly) choose which modules to use. This also enables training all module components with backpropagation. Notice that gk n is input dependent, and thus the module Mn can effectively control which lower-level module outputs to use in state st. Here, Gn can be implemented as an MLP followed by either a softmax over submodules, or a sigmoid that outputs a score for each submodule. However, note that the proposed setup can be modiﬁed to adopt hard-gating mechanism using a threshold or sampling with reinforcement learning.

Published as a conference paper at ICLR 2019

Algorithm 1 Computation performed by our Progressive Module Network, for one module Mn

1: function Mn(qn) The environment E and module list Ln are global variables 2: s1 = In(qn) initialize the state variable 3: for t 1 to Tn do Tn is the maximum time step 4: V = [] wipe out scratch pad V 5: g1 n, . . . , g|Ln| n = Gn(st) compute importance scores 6: for k 1 to |Ln| do Ln is the sequence of lower modules [Mm, ..., Ml] 7: qk = Qn k(st, V, Gn(st)) produce query for Mk 8: ok = Ln[k](qk) call kth module Mk = Ln[k], generate output 9: vk = Rk n(st, ok) receive and project output 10: V.append(vk) write vk to pad V 11: st+1 = Un(st, V, E, Gn(st)) update module state 12: on = Ψn(s1, . . . , s Tn, qn, E) produce the output 13: return on

Query transmitter and receiver. A query for module Mk in Ln is produced using a query transmitter, as qk = Qn k(st, V, Gn(st)). The output ok = Mk(qk) received from Mk is modiﬁed using a receiver function, as vk = Rk n(st, ok). One can think of these functions as translators of the inputs and outputs into the module s own language". Note that each module has a scratch pad V to store outputs it receives from a list of lower modules Ln, i.e., vk is stored to V . Qn k and Rk n stand for the query transmitter from task n to task k and receiver from task k to task n, respectively.

State update function. After every module in Ln is executed, module Mn updates its internal state using a state update function Un as st+1 = Un(st, V, E, Gn(st)). This completes one time step of the module s computation. Once the state is updated, the scratch pad V is wiped clean and is ready for new outputs. An example can be a simple gated sum of all outputs, i.e., st+1 = P

Prediction function. After Tn steps, the ﬁnal module output is produced using a prediction function Ψn as on = Ψn(s1, . . . , s Tn, qn, E). Recall that E is the environment.

All module functions: state initializer I, importance function G, query transmitter Q, receiver R, state update function U, residual module , attention module Ω, and prediction function Ψ are implemented as neural networks or simple assignment functions (e.g. set qk = vl). Note that all variables (e.g. ok, qk, vk, st) are continuous vectors to allow learning with standard backpropagation. For example, the output of the relationship detection module that predicts an object bounding box is a N dimensional soft-maxed vector (assuming there are total of N boxes or image regions in E).

Training. We train our modules sequentially, from low level to high level tasks, one at a time. The internal weights of the lower level modules are not updated, thus preserving their performance on the original task. The new module only learns to communicate with them via the query transmitter Q and receiver R. We do train the weights of and Ω. We train I, G, Q, R, U, and Ψ, by allowing gradients to pass through the lower level modules. The loss function depends on the task n.

3.3 PROGRESSIVE MODULE NETWORKS FOR VISUAL REASONING

We present an example of how PMN can be adopted for several tasks related to visual reasoning. In particular, we consider six tasks: object classiﬁcation, attribute classiﬁcation, relationship detection, object counting, image captioning, and visual question answering. Our environment E consists of: (i) image regions: N image features X = [X1, . . . , XN], each Xi Rd with corresponding bounding box coordinates b = [b1, . . . , b N] extracted from Faster R-CNN (Ren et al., 2015); and (ii) language: vector representation of a sentence S (in our example, a question). S is computed through a Gated Recurrent Unit (Cho et al., 2014) by feeding in word embeddings [w1, . . . , w T ] at each time step.

Below, we discuss each task and a module designed to solve it. We provide detailed implementation and execution process of the VQA module in Sec. 3.4. For other modules, we present a brief overview of what each module does in this section. Further implementation details of all module architectures are in Appendix A.

Object and Attribute Classiﬁcation (level 0). Object classiﬁcation is concerned with naming the object that appears in the image region, while attribute classiﬁcation predicts the object s attributes (e.g. color). As these two tasks are fairly simple (not necessarily easy), we place Mobj and Matt as

Published as a conference paper at ICLR 2019

Compute importance scores.

input query and output

internal querying and receiving omitted

querying a submodule

receiving output from submodule

Execution Process of

- For - The initial state is initialized using . of

- Using and history of states, infer the ﬁnal answer.

Q: What is the bird

standing on?

Environment

a bird sitting on top

of a wooden bench bench 1

0.23 0.10 0.22

scratch pad

Visualizing 3

Query transmitter produces input to

'on top of'

Call the attention module. Store the result in V. Call the relationship module. Store the result in V.

4 Compute the attention map as a summation of

weighted by softmax of their importance scores. Pass the map to object, attribute and residual module. Store the results in V. Call the counting module. Store the count vector in V.

Call the captioning module. Store the caption vector in V.

7 Update the module state by using outputs of

weighted by softmax of their importance scores.

softmaxed softmaxed

Figure 2: Example of PMN s module execution trace on the VQA task. Numbers in circles indicate the order of execution. Intensity of gray blocks represents depth of module calls. All variables including queries and outputs stored in V are continuous vectors to allow learning with standard backpropagation (e.g., caption is composed of a sequence of softmaxed W dimensional vectors for vocabulary size W). For Mcap, words with higher intensity in red are deemed more relevant by Rcap vqa. Top: high level view of module execution process. Bottom right: computed importance scores and populated scratch pad. Note that we perform the ﬁrst softmax operation on (Ωvqa, Mrel) to obtain an attention map and the second on (Mobj, Matt, vqa, Mcnt, Mcap) to obtain the answer. Bottom left: visualizing the query Mvqa sends to Mrel, and the received output.

terminal modules at level 0. Mobj consists of an MLP that takes as input a visual descriptor for a bounding box bi, i.e., qobj = Xi, and produces oobj = Mobj(qobj), the penultimate vector prior to classiﬁcation. Attribute module Matt has a similar structure. These are the only modules for which we do not use actual output labels, as we obtained better results for higher level tasks empirically.

Image Captioning (level 1). In image captioning, one needs to produce a natural language description of the image. We design our module Mcap as a compositional module that uses information from Lcap = [Ωcap, Mobj, Matt, cap]. We implement the state update function as a two-layer GRU network with st corresponding to the hidden states. Similar to Anderson et al. (2018), at each time step, the attention module Ωcap attends over image regions X using the hidden state of the ﬁrst layer. The attention map m is added to the scratch pad V . The query transmitters produce a query (image vector at the attended location) using m to obtain nouns Mobj and adjectives Matt. The residual module cap processes other image-related semantic information. The outputs from modules in Lcap are projected to a common vector space (same dimensions) by the receivers and stored in the scratch pad. Based on their importance score, the gated sum of the outputs is used to update the state. The natural language sentence ocap is obtained by predicting a word at each time step using a fully connected layer on the hidden state of the second GRU layer.

Relationship Detection (level 1). In this task the model is expected to produce triplets in the form of subject - relationship - object (Lu et al., 2016). We re-purpose this task as one that involves ﬁnding the relevant item (region) in an image that is related to a given input through a given relationship. The input to the module is qrel = [bi, r] where bi is a one-hot encoding of the input box and r is a one-hot encoding of the relationship category (e.g. above, behind). The module produces orel = bout corresponding to the box for the subject/object related to the input bi through r. We place Mrel on the ﬁrst level as it may use object and attribute information that can be useful to infer relationships, i.e., Lrel = [Mobj, Matt, rel]. We train the module using the cross-entropy loss.

Object Counting (level 2). Our next task is counting the number of objects in the image. Given a vector representation of a natural language question (e.g. how many cats are in this image?), the goal of this module is to produce a numerical count. The counting task is at a higher-level since it may also require us to understand relationships between objects. For example, how many cats are on the blue chair? , requires counting cats on top of the blue chair. We thus place Mcnt on the second level and provide it access to Lcnt = [Ωcnt, Mrel]. The attention module Ωcnt ﬁnds relevant objects by using the input question vector. Mcnt may also query Mrel if the question requires relational reasoning. To answer how many cats are on the blue chair , we can expect the query transmitter

Published as a conference paper at ICLR 2019

Qcnt rel to produce a query qrel = [bi, r] for the relationship module Mrel that includes the chair bounding box bi and relationship on top of r so that Mrel outputs boxes that contain cats on the chair. Note that both Ωcnt and Mrel produce attention maps on the boxes. The state update function softly chooses a useful attention map by calculating softmax on the importance scores of Ωcnt and Mrel. For prediction function Ψcnt, we adopt the counting algorithm by Zhang et al. (2018), which builds a graph representation from attention maps to count objects. Mcnt returns ocnt which is the count vector corresponding to softmaxed one-hot encoding of the count (with maximum count Z).

Visual Question Answering (level 3). VQA is our ﬁnal and most complex task. Given a vector representation of a natural language question, qvqa, the VQA module Mvqa uses Lvqa = [Ωvqa, Mrel, Mobj, Matt, vqa, Mcnt, Mcap]. Similar to Mcnt, Mvqa makes use of Ωvqa and Mrel to get an attention map. The produced attention map is fed to the downstream modules [Mobj, Matt, vqa] using the query transmitters. Mvqa also queries Mcnt which produces a count vector. For the last entry Mcap in Lvqa, the receiver attends over the words of the entire caption produced by Mcap to ﬁnd relevant answers. The received outputs are used depending on the importance scores. Finally, Ψvqa produces an output vector based on qvqa and all states st.

3.4 EXAMPLE: Mvqa FOR VISUAL QUESTION ANSWERING

We give a detailed example of how PMN is implemented for the VQA task. The entire execution process is depicted in Fig. 2, and the general algorithm is tabulated in Alg. 1.

The input qvqa is a vector representing a natural language question (i.e. the sentence vector S E). The state variable st is represented by a tuple (qt vqa, kt 1) where qt vqa represents query to ask at time t and kt 1 represents knowledge gathered at time t 1. The state initializer Ivqa is composed of a GRU with hidden state dimension 512. The ﬁrst input to GRU is qvqa, and Ivqa sets s1 = (q1 vqa, 0) where q1 vqa is the ﬁrst hidden state of the GRU and 0 is a zero vector (no knowledge at ﬁrst).

For t in Tvqa = 2, Mvqa does the following seven operations:

(1) The importance function Gvqa is executed. It is implemented as a linear layer R512 R7 (for the seven modules in Lvqa) that takes st, speciﬁcally qt vqa st as input.

(2) Qvqa Ωpasses qt vqa to the attention module Ωvqa which attends over the image regions X with qt vqa as the key vector. Ωvqa is implemented as an MLP that computes a dot-product soft-attention similar to Yang et al. (2016). The returned attention map vΩis added to the scratch pad V .

(3) Qvqa rel produces an input tuple [b, r] for Mrel. The input object box b is produced by a MLP that does soft attention on image boxes, and the relationship category r is produced through a MLP with qt vqa as input. Mrel is called with [b, r] and the returned map vrel is added to V .

(4) Qvqa obj, Qvqa att, and Qvqa ﬁrst compute a joint attention map m as summation of (vΩ, vrel) weighted by the softmaxed importance scores of (Ωvqa, Mrel), and they pass the sum of visual features X weighted by m to the corresponding modules. vqa is implemented as an MLP. The receivers project the outputs into 512 dimensional vectors vobj, vatt, and v through a sequence of linear layers, batch norm, and tanh() nonlinearities. They are added to V .

(5) Qvqa cnt passes qt vqa to Mcnt which returns ocnt. Rcnt vqa projects the count vector ocnt into a 512 dimensional vector vcnt through the same sequence of layers as above. vcnt is added to V .

(6) Mvqa calls Mcap and Rcap vqa receives natural language caption of the image. It converts words in the caption into vectors [w1, . . . , w T ] through an embedding layer. The embedding layer is initialized with 300 dimensional Glo Ve vectors (Pennington et al., 2014) and ﬁne-tuned. It does softmax attention operation over [w1, . . . , w T ] through a MLP with qt vqa st as the key vector, resulting in word probabilities p1, . . . , p T . The sentence representation PT i pi wi is projected into a 512 dimensional vector vcap using the same sequence as vcnt. vcap is added to V .

(7) The state update function Uvqa ﬁrst does softmax operation over the importance scores of (Mobj, Matt, vqa, Mcnt, Mcap). We deﬁne an intermediate knowledge vector kt as the summation of (vobj, vatt, δvqa, vcnt, vcap) weighted by the softmaxed importance scores. Uvqa passes kt as input to the GRU initialized by Ivqa, and we get qt+1 vqa the new hidden state of the GRU. The new state st+1 is set to (qt+1 vqa , kt). This process allows the GRU to compute new question and state vectors based on what has been asked and seen.

Published as a conference paper at ICLR 2019

After Tvqa steps, the prediction function Ψvqa computes the ﬁnal output based on the initial question vector qvqa and all knowledge vectors kt st. Here, qvqa and kt are fused with gated-tanh layers and fed through a ﬁnal classiﬁcation layer similar to Anderson et al. (2018), and the logits for all time steps are added. The resulting logit is the ﬁnal output ovqa that corresponds to an answer in the vocabulary of the VQA task. Note that the exact form of each module can be different. While we leave a more general architecture across tasks as future work, we stress that one of PMN s strengths is that once a module is trained, it can be used as a blackbox by the higher-level modules. Details of other modules architectures are provided in Appendix A.

4 EXPERIMENTS

We present experiments demonstrating the impact of progressive learning of modules. We also analyze and evaluate the reasoning process of PMN as it is naturally interpretable. We conduct experiments on three datasets (see Appendix B.1 for details): Visual Genome (VG) (Krishna et al., 2016), VQA 2.0 (Goyal et al., 2017), MS-COCO (Lin et al., 2014). These datasets contain natural images and are thus more complex in visual appearance and language diversity than CLEVR (Johnson et al., 2017a) that contains synthetic scenes. Neural module networks (Andreas et al., 2016; Hu et al., 2017) show excellent performance on CLEVR but their performance on natural images is quite below the state-of-the-art. For all datasets, we extract bounding boxes and their feature representations using a pretrained model from Anderson et al. (2018).

4.1 PROGRESSIVE LEARNING OF TASKS AND MODULES

Object and Attribute Classiﬁcation. We train these modules with annotated bounding boxes from the VG dataset. We follow Anderson et al. (2018) and use 1,600 and 400 most commonly occurring object and attribute classes, respectively. Each extracted box is associated with the ground truth label of the object with greatest overlap. It is ignored if there are no ground truth boxes with Io U > 0.5. This way, each box is annotated with one object label and zero or more attribute labels. Mobj achieves 54.9% top-1 accuracy and 86.1% top-5 accuracy. We report mean average precision (m AP) for attribute classiﬁcation which is a multi-label classiﬁcation problem. Matt achieves 0.14 m AP and 0.49 weighted m AP. m AP is deﬁned as the mean over all classes, and weighted m AP is weighted by the number of instances for each class. As there are a lot of redundant classes (e.g. car, cars, vehicle) and boxes have sparse attribute annotations, the accuracy may seem artiﬁcially low.

Image Captioning. We report results on MS-COCO for image captioning. We use the standard split from the 2014 captioning challenge to avoid data contamination with VQA 2.0 or VG. This split contains 30% less data compared to that proposed in Karpathy & Fei-Fei (2015) that most current works adopt. We report performance using the CIDEr (Vedantam et al., 2015) metric. A baseline (non-compositional module) achieves a strong CIDEr score of 108. Using the object and attribute modules we are able to obtain 109 CIDEr. While this is not a large improvement, we suspect a reason for this is the limited vocabulary. The MS-COCO dataset has a ﬁxed set of 80 object categories and does not beneﬁt by using knowledge from modules that are trained on more diverse data. We believe the beneﬁts of PMN would be clearer on a diverse captioning dataset with many more object classes. Also, including high-level modules such as Mvqa would be an interesting direction for future work.

Table 1: Performance of Mrel

Model Composition Acc. (%) BASE OBJ ATT Object Subject

Mrel0 - - 51.0 55.9 Mrel1 Mobj Matt 53.4 57.8

Table 2: Accuracy for Mcnt

Model Composition Acc. (%) BASE OBJ ATT REL

Mcnt0 - - - 45.4 Mcnt1 Mobj Matt - 47.4 Mcnt2 Mobj Matt Mrel1 50.0

Relationship Detection. We use top 20 commonly occurring relationship categories, which are deﬁned by a set of words with similar meaning (e.g. in, inside, standing in). Relationship tuples in the form of subject - relationship - object are extracted from Visual Genome (Krishna et al., 2016; Lu et al., 2016). We train and validate the relationship detection module using 200K/38K train/val tuples that have both subject and object boxes overlapping with the ground truth boxes (Io U > 0.7). Table 1 shows improvement in performance when using modules. Even though accuracy is relatively low, model errors are reasonable, qualitatively. This is partially attributed to multiple correct answers although there is only one ground truth answer.

Object Counting. We extract questions starting with how many from VQA 2.0 which results in a training set of 50K questions. We additionally create 89K synthetic questions based on the VG dataset by counting the object boxes and forming how many

Published as a conference paper at ICLR 2019

Table 3: Model ablation for VQA. We report mean std computed over three runs. Steady increase indicates that information from modules helps, and that PMN makes use of lower modules effectively. The base model Mvqa0 does not use any lower level modules other than the residual and attention modules.

Model Composition Accuracy (%) BASE OBJ ATT REL CNT CAP

Mvqa0 - - - - - 62.05 0.11 Mvqa1 Mobj Matt - - - 63.38 0.05 Mvqa2 Mobj Matt Mrel1 - - 63.64 0.07 Mvqa3 Mobj Matt - Mcnt1 - 64.06 0.05 Mvqa4 Mobj Matt Mrel1 Mcnt2 - 64.36 0.06 Mvqa5 Mobj Matt Mrel1 Mcnt2 Mcap1 64.68 0.04

Table 4: Comparing VQA accuracy of PMN with state-of-the-art models. Rows with Ens denote ensemble models. test-dev is development test set and test-std is standard test set for VQA 2.0.

Model Ens VQA 2.0 val VQA 2.0 test-dev VQA 2.0 test-std Yes/No Number Other All Yes/No Number Other All Yes/No Number Other All

Andreas et al. (2016) - 73.38 33.23 39.93 51.62 - - - - - - - - Yang et al. (2016) - 68.89 34.55 43.80 52.20 - - - - - - - - Teney et al. (2018) - 80.07 42.87 55.81 63.15 81.82 44.21 56.05 65.32 82.20 43.90 56.26 65.67 Teney et al. (2018) - - - - 86.08 48.99 60.80 69.87 86.60 48.64 61.15 70.34 Yu et al. (2018) - - - - - 84.27 49.56 59.89 68.76 - - - - Yu et al. (2018) - - - - - - - - 86.65 51.13 61.75 70.92 Zhang et al. (2018) - - 49.36 - 65.42 83.14 51.62 58.97 68.09 83.56 51.39 59.11 68.41 Kim et al. (2018)* - - - - 66.04 85.43 54.04 60.52 70.04 85.82 53.71 60.69 70.35 Kim et al. (2018)* - - - - 86.68 54.94 62.08 71.40 87.22 54.37 62.45 71.84 Jiang et al. (2018)* - - - - 87.82 51.54 63.41 72.12 87.82 51.59 63.43 72.25

baseline Mvqa0 - 80.28 43.06 53.21 62.05 - - - - - - - - PMN Mvqa5 - 82.48 48.15 55.53 64.68 84.07 52.12 57.99 68.07 - - - - PMN Mvqa5 - - - - 85.74 54.39 60.60 70.25 86.34 54.26 60.80 70.68

questions. This synthetic data helps to increase the accuracy by 1% for all module variants. Since the number of questions that have relational reasoning and counting (e.g. how many people are sitting on the sofa? how many plates on table?) is limited, we also sample relational synthetic questions from VG. These questions are used only to improve the parameters of query transmitter Qcnt rel for the relationship module. Table 2 shows a large improvement (4.6%) of the compositional module over the non-modular baseline. When training for the next task (VQA), unlike other modules whose parameters are ﬁxed, we ﬁne-tune the counting module because counting module expects the same form of input - embedding of natural language question. The performance of counting module depends crucially on the quality of attention map over bounding boxes. By employing more questions from the whole VQA dataset, we obtain a better attention map, and the performance of counting module increases from 50.0% (c.f. Table 2) to 55.8% with ﬁnetuning (see Appx B.2 for more details).

Visual Question Answering. We present ablation studies on the val set of VQA 2.0 in Table 3. As seen, PMN strongly beneﬁts from utilizing different modules achieving a performance improvement of 2.6% over the baseline. Note that all results here are without additional questions from the VG data. We also compare performance of PMN for the VQA task with state-of-the-art models in Table 4. Models are trained on the train split for results on VQA val, while for test-dev and test-std, models are trained on both the train and val splits. Although we start with a much lower baseline performance of 62.05% on the val set (vs. 65.42% (Zhang et al., 2018), 63.15% (Teney et al., 2018), 66.04% (Kim et al., 2018)), PMN s performance is on par with these models. Note that entries with * are parallel works to ours. Also, as Jiang et al. (2018) showed, the performance depends strongly on engineering choices such as learning rate scheduling and ensembling models with different architectures.

Plug-and-play architecture. The query-answer communication within PMN results in a plugand-play architecture where modules can be replaced by their improved versions. Instead of using generated captions from Mcap, when we feed in the ground-truth captions to trained Mvqa5 in Table 3 (with 64.68% acc.), it achieves 65.43%. We also tried training and validating Mvqa5 with ground-truth captions, and this achieved 67.84%. These results shows how PMN can be continually improved.

Three additional experiments on VQA. (1) To verify that the gain is not from the increased model capacity, we trained a baseline with the number of parameters approximately matching that of the full PMN model. This baseline with more capacity also achieves 62.0%, thus conﬁrming our claim. (2) We evaluated the impact of the additional data available. We convert the subj-obj-rel triplets used for the relationship detection task to additional QAs (e.g. Q: what is on top of the desk?, A: laptop) and train the Mvqa1 model (Table 3). This results in an accuracy of 63.05%, not only lower than Mvqa2 (63.64%) that uses the relationship module via PMN, but also lower than Mvqa1 at 63.38%. This suggests that while additional data may change the question distribution and reduce performance, PMN is robust and beneﬁts from a separate relationship module. (3) Lastly, we conducted another

Published as a conference paper at ICLR 2019

experiment to show that PMN does make efﬁcient use of the lower level modules. We give equal importance scores to all modules in Mvqa5 model (Table 3) (thus, ﬁxed computation path), achieving 63.65% accuracy. While this is higher than the baseline at 62.05%, it is lower than Mvqa5 at 64.68% which softly chooses which sub-modules to use.

4.2 INTERPRETABILITY ANALYSIS

Visualizing the model s reasoning process. We present a qualitative analysis of the answering process. In Fig. 2, Mvqa makes query qrel = [bi, r] to Mrel where bi corresponds to the blue box bird and r corresponds to on top of relationship. Mvqa correctly chooses (i.e. higher importance score) to use Mrel rather than its own output produced by Ωvqa since the question requires relational reasoning. With the attended green box obtained from Mrel, Mvqa mostly uses the object and captioning modules to produce the ﬁnal answer. More examples are presented in Appendix C.

Table 5: Average human judgments from 0 to 4. indicates that model got ﬁnal answer right, and for wrong.

Correct? # Q Human Rating PMN Baseline PMN Baseline

715 3.13 2.86 584 2.78 1.40 162 1.73 2.47 139 1.95 1.66

All images 1600 2.54 2.24

Judging Answering Quality. The modular structure and gating mechanism of PMN makes it easy to interpret the reasoning behind the outputs. We conduct a human evaluation with Amazon Mechanical Turk on 1,600 randomly chosen questions. Each worker is asked to rate the explanation generated by the baseline model and the PMN like a teacher grades student exams in the scale of 0 (very bad), 1 (bad), 2 (satisfactory), 3 (good), 4 (very good). The baseline explanation is composed of the bounding box it attends to and the ﬁnal answer. For PMN, we form a rule-based natural language explanation based on the prominent modules used. An example is shown in Fig. 3. Each question is assessed by three human workers. Incorrect reasoning steps are penalized, so if PMN produces wrong reasoning steps, it could get a low score. On the other hand, the baseline model often scores well on simple questions that do not need complex reasoning (e.g. what color is the cat?).

Q: what is behind the men?

I first find the BLUE box, and then from

that, I look at the GREEN box. The object 'tree' would be useful in

answering the question. In conclusion, I think the answer is trees.

Q: what color is the curvy wire?

I look at the RED box. The object properties white long electrical

would be useful in answering the question. In conclusion, I think the answer is white.

Figure 3: Example of PMN s reasoning processes. Top: it correctly ﬁrst ﬁnd a person and then uses relationship module to ﬁnd the tree behind him. Bottom: it ﬁnds the wire and then use attribute module to correctly infer its attributes - white, long, electrical - and then outputs the correct answer.

We report results in Table 5, and show more examples in Appendix D. Human evaluators tend to give low scores to wrong answers and high scores to correct answers regardless of explanations, but PMN always scores higher if both PMN and baseline gets a question correct or wrong. Interestingly, a correct answer from PMN gets 1.38 points higher than wrong baseline, but a correct baseline scores only 0.74 higher than a wrong PMN answer. This shows that PMN gets partial marks even when it gets an answer wrong since the reasoning steps are partially correct.

Low Data Regime. PMN beneﬁts from re-using modules and only needs to learn the communication between them. This allows us to achieve good performance even when using a fraction of the training data. Table 6 presents the absolute gain in accuracy PMN achieves. For this experiment, we use Lvqa = [Ωvqa, Mrel, Mobj, Matt, vqa, Mcap] (because of overlapping questions from Mcnt). When the amount of data is really small (1%), PMN does not help because there is not enough data to learn to communicate with lower modules. The maximum gain is obtained when using 10% of data. It shows that PMN can help in situations where there is not a huge amount of training data since it can exploit previously learned knowledge. The gain remains constant at about 2% from then on.

Table 6: Absolute gain in accuracy when using a fraction of the training data.

Fraction of VQA training data (in %) 1 5 10 25 50 100

Absolute accuracy gain (in %) -0.49 2.21 4.01 2.66 1.79 2.04 5 CONCLUSION AND DISCUSSION

In this work, we proposed Progressive Module Networks (PMN) that train task modules in a compositional manner, by exploiting previously learned lower-level task modules. PMN can produce queries to call other modules and make use of the returned information to solve the current task. Given experts in speciﬁc tasks, the parent module only needs to learn how to effectively communicate with them. It can also choose which lower level modules to use. Thus, PMN is data efﬁcient and

Published as a conference paper at ICLR 2019

provides a more interpretable reasoning processes. Also, since there is no need to know about the inner workings of the children modules, it opens up promising ways to train intelligent robots that can compartmentalize and perform multiple tasks as they progressively improve their abilities. Moreover, one task can beneﬁt from unrelated tasks unlike conventional multi-task learning algorithms.

PMN as it stands has few limitations with respect to hand-designed structures and the need for additional supervision. Nevertheless, PMN is an important step towards more interpretable, compositional multi-task models. Some of the questions to be solved in the future include: 1) learning module lists automatically; 2) choosing few modules (hard attention) to reduce overhead; 3) more generic structure of module components across tasks; and 4) joint training of all modules.

Acknowledgments. Partially supported by the DARPA Explainable AI (XAI) program, Samsung and NSERC. We also thank NVIDIA for their donation of GPUs.

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and Top-down Attention for Image Captioning and VQA. In CVPR, 2018.

Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. Neural Module Networks. In CVPR, 2016.

Hakan Bilen and Andrea Vedaldi. Integrated Perception with Recurrent Multi-task Neural Networks. In NIPS, 2016.

Richard Caruana. Multitask Learning: A Knowledge-Based Source of Inductive Bias. In ICML, 1993.

Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. ar Xiv preprint ar Xiv:1406.1078, 2014.

Long Duong, Trevor Cohn, Steven Bird, and Paul Cook. Low Resource Dependency Parsing: Cross-lingual Parameter Sharing in a Neural Network Parser. In Association for Computational Linguistics (ACL), 2015.

Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding. In Empirical Methods in Natural Language Processing (EMNLP), 2016.

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. In CVPR, 2017.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.

Ronghang Hu, Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Kate Saenko. Learning to Reason: End-to-End Module Networks for Visual Question Answering. In ICCV, 2017.

Drew Arad Hudson and Christopher D. Manning. Compositional attention networks for machine reasoning. In ICLR, 2018. URL https://openreview.net/forum?id=S1Euwz-Rb.

Yu Jiang, Vivek Natarajan, Xinlei Chen, Marcus Rohrbach, Dhruv Batra, and Devi Parikh. Pythia v0. 1: the winning entry to the vqa challenge 2018. ar Xiv preprint ar Xiv:1807.09956, 2018.

Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. In CVPR, 2017a.

Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Judy Hoffman, Li Fei-Fei, C. Lawrence Zitnick, and Ross Girshick. Inferring and Executing Programs for Visual Reasoning. In ICCV, 2017b.

Published as a conference paper at ICLR 2019

Andrej Karpathy and Li Fei-Fei. Deep Visual-Semantic Alignments for Generating Image Descriptions. In CVPR, 2015.

Jin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang. Bilinear attention networks. ar Xiv preprint ar Xiv:1805.07932, 2018.

Iasonas Kokkinos. Uber Net: Training a Universal Convolutional Neural Network for Low-, Mid-, and High-Level Vision using Diverse Datasets and Limited Memory. In CVPR, 2017.

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, Michael Bernstein, and Li Fei-Fei. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. ar Xiv preprint ar Xiv:1602.07332, 2016.

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft COCO: Common Objects in Context. In ECCV, 2014.

Mingsheng Long, ZHANGJIE CAO, Jianmin Wang, and Philip S Yu. Learning Multiple Tasks with Multilinear Relationship Networks. In NIPS, 2017.

Cewu Lu, Ranjay Krishna, Michael Bernstein, and Li Fei-Fei. Visual Relationship Detection with Language Priors. In ECCV, 2016.

Ishan Misra, Abhinav Shrivastava, Abhinav Gupta, and Martial Hebert. Cross-Stitch Networks for Multi-task Learning. In CVPR, 2016.

Jeffrey Pennington, Richard Socher, and Christopher Manning. Glo Ve: Global Vectors for Word Representation. In Empirical Methods in Natural Language Processing (EMNLP), 2014.

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards Real-time Object Detection with Region Proposal Networks. In NIPS, 2015.

Sebastian Ruder. An Overview of Multi-Task Learning in Deep Neural Networks. ar Xiv preprint, ar Xiv:1706.05098, 2017.

Sebastian Ruder, Joachim Bingel, Isabelle Augenstein, and Anders Sogaard. Learning what to share between loosely related tasks. ar Xiv preprint, ar Xiv:1705.08142, 2017.

Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive Neural Networks. ar Xiv preprint, ar Xiv:1606.04671, 2016.

Damien Teney, Peter Anderson, Xiaodong He, and Anton van den Hengel. Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge. In CVPR, 2018.

Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. CIDEr: Consensus-based Image Description Evaluation. In CVPR, 2015.

Peng Wang, Qi Wu, Chunhua Shen, and Anton van den Hengel. The vqa-machine: Learning how to use existing vision algorithms to answer new questions. In CVPR, 2017.

Yongxin Yang and Timothy M. Hospedales. Trace Norm Regularised Deep Multi-Task Learning. In ICLR Workshop Track, 2017.

Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. Stacked Attention Networks for Image Question Answering. In CVPR, 2016.

Zhou Yu, Jun Yu, Chenchao Xiang, Jianping Fan, and Dacheng Tao. Beyond Bilinear: Generalized Multimodal Factorized High-Order Pooling for Visual Question Answering. IEEE Transactions on Neural Networks and Learning Systems, 2018. doi: 10.1109/TNNLS.2018.2817340.

Amir R. Zamir, Alexander Sax, William B. Shen, Leonidas J. Guibas, Jitendra Malik, and Silvio Savarese. Taskonomy: Disentangling Task Transfer Learning. In CVPR, 2018.

Yan Zhang, Jonathon Hare, and Adam Prügel-Bennett. Learning to Count Objects in Natural Images for Visual Question Answering. In ICLR, 2018. URL https://openreview.net/forum? id=B12Js_y Rb.

Published as a conference paper at ICLR 2019

A MODULE ARCHITECTURES

We discuss the detailed architecure of each module. We ﬁrst describe the shared environment and soft attention mechanism architecture.

Environment. The sensory input that form our environment E consists of: (i) image regions: N image regions X = [X1, . . . , XN], each Xi Rd with corresponding bounding box coordinates b = [b1, . . . , b N] extracted from Faster R-CNN (Ren et al., 2015); and (ii) language: vector representation of a sentence S (in our example, a question). S is computed through a one layer GRU by feeding in the embedding of each word [w1, . . . , w T ] at each time step. For (i), we use a pretrained model from Anderson et al. (2018) to extract features and bounding boxes.

Soft attention. For all parts that use soft-attention mechanism, an MLP is emloyed. Given some key vector k and a set of data to be attended {d1, . . . , d N}, we compute

attention_map = (z(f(k) g(d1)), . . . , z(f(k) g(d N))) (1)

where f and g are a sequence of linear layer followed by Re LU activation function that project k and di into the same dimension, and z is a linear layer that projects the joint representation into a single number. Note that we do not specify softmax function here because sigmoid is used for some cases.

A.1 OBJECT AND ATTRIBUTE CLASSIFICATION (LEVEL 0)

The input to both modules Mobj, Matt is a visual descriptor for a bounding box bi in the image, i.e., qobj = Xi. Mobj and Matt projects the visual feature Xi to a 300-dimensional vector through a single layer neural network followed by tanh() non-linearity. We expect this vector to represent the name and attributes of the box bi.

A.2 IMAGE CAPTIONING (LEVEL 1)

Mcap takes zero vector as the model input and produces natural language sentence as the output based on the environment E (detected image regions in an image). It has Lcap = [Ωcap, Mobj, Matt, cap] and goes through maximum of Tcap = 16 time steps or until it reaches the end of sentence token. Mcap is implemented similarly to the captioning model in Anderson et al. (2018). We employ two layered GRU (Cho et al., 2014) as the recurrent state update function Ucap where st = (ht 1, ht 2) with hidden states of the ﬁrst and second layers of Ucap. Each layer has 1000-d hidden states.

The state initializer Icap sets the initial hidden state of Ucap, or the model state st, as a zero vector. For t in Tcap = 16, Mcap does the following four operations:

(1) The importance function Gcap is executed. It is implemented as a linear layer R1000 R4 (for the four modules in Lcap) that takes st, speciﬁcally ht 1 st as input. (2) Qcap Ωpasses ht 1 to the attention module Ωcap which attends over the image regions X with ht 1 as the key vector. Ωcap is implemented as a soft-attention mechanism so that it produces attention probabilities pi (via softmax) for each image feature Xi E. The returned attention map vΩis added to the scratch pad V . (3) Qcap obj and Qcap att pass the sum of visual features X weighted by vΩ V to the corresponding modules. cap is implemented as an MLP. The receivers project the outputs into 1000 dimensional vectors vobj, vatt, and v through a sequence of linear layers, batch norm, and tanh() nonlinearities. They are added to V . (4) As stated above, Ucap is a two-layered GRU. At time t, the ﬁrst layer takes input the average visual features from the environment E, 1 N P

i Xi, embedding vector of previous word wt 1, and ht 2. For time t = 1, beginning-of-sentence embedding and zero vector are inputs for w1 and h1 1, respectively. The second layer is fed ht 1 as well as the information from other modules,

ρ = X (softmax(gobj, gatt, g ) (vobj, vatt, v )) (2)

which is a gated summation of outputs in V with softmaxed importance scores. We now have a new state st+1 = (ht+1 1 , ht+1 2 ).

Published as a conference paper at ICLR 2019

The output of Mcap, ocap, is a sequence of words produced through Ψcap which is a linear layer projecting each ht 2 in st to the output word vocabulary.

A.3 RELATIONSHIP DETECTION (LEVEL 1)

Relationship detection task requires one to produce triplets in the form of subject - relationship - object (Lu et al., 2016). We re-purpose this task as one that involves ﬁnding the relevant item (region) in an image that is related to a given input through a given relationship. The input to the module is qrel = [bi, r] where bi is a one-hot encoded input bounding box (whose i-th entry is 1 and others 0) and r is a one-hot encoded relationship category (e.g. above, behind). Mrel has Lrel = [Mobj, Matt, rel] and goes through Trel = N steps where N is the number of bounding boxes (image regions in the environment). So at time step t, the module looks at the t-th box. Mrel uses Mobj and Matt just as feature extractors for each bounding box. Therefore, it does not have a complex structure.

The state initializer Irel projects r to a 512 dimensional vector with an embedding layer, and the resulting vector is set as the ﬁrst state s1.

For t in Trel = N, Mrel does the following three operations:

(1) Qrel obj and Qrel att pass the image vector corresponding to the bounding box bt to Mobj and Matt. Robj rel and Ratt rel are identity functions, i.e., we do not modify the object and attribute vectors. The outputs vobj and vatt are added to V .

(2) rel projects the coordinate of the current box bt to a 512 dimensional vector. This resulting v is added to V .

(3) Urel concatenates the visual feature Xt with vobj, vatt, v from V . The concatenated vector is fed through a MLP resulting in 512 dimensional vector. This corresponds to the new state st+1.

After N steps, the prediction function Ψrel does the following operations: The ﬁrst state s1 which contains relationship information is multiplied element-wise with si+1 (Note: si+1 corresponds to the input box bi). Let such a vector be l. It produces an attention map bout over all bounding boxes in b. The inputs to the attention function are s2, . . . , s Trel (i.e. all image regions) and the key vector l. orel = bout is the output of Mrel which represents an attention map indicating the bounding box that contains the related object.

A.4 COUNTING (LEVEL 2)

Given a vector representation of a natural language question (e.g. how many cats are in this image?), the goal of this module is to produce a count. The input qcnt = S E is a vector representing a natural language question. When training Mcnt, qcnt is computed through a one layer GRU with hidden size of 512 dimensions. The input to the GRU at each time step is the embedding of each word from the question. Word embeddings are initialized with 300 dimensional Glo Ve word vectors (Pennington et al., 2014) and ﬁne-tuned thereafter. Similar to visual features obtained through CNN, the question vector is treated as an environment variable. Mcnt has Lcnt = [Ωcnt, Mrel] and goes through only one time step.

The state initializer Icnt is a simple function that just sets s1 = qcnt.

For t in Tcnt = 1, Mcnt does the following four operations:

(1) The importance function Gcnt is executed. It is implemented as a linear layer R512 R2 (for the two modules in Lcnt) that takes st as input.

(2) Qcnt Ωpasses st to the attention module Ωcnt which attends over the image regions X with st as the key vector. Ωcnt is implemented as an MLP that computes a dot-product soft-attention similar to Yang et al. (2016). The returned attention map vΩis added to the scratch pad V .

(3) Qcnt rel produces an input tuple [b, r] for Mrel. The input object box b is produced by a MLP that does soft attention on image boxes, and the relationship category r is produced through a MLP with st as input. Mrel is called with [b, r] and the returned map vrel is added to V .

Published as a conference paper at ICLR 2019

(4) Ucnt ﬁrst computes probabilities of using vΩor vrel by doing a softmax over the importance scores. vΩand vrel are weighted and summed with the softmax probabilities resulting in the new state s2 containing the attention map. Thus, the state update function chooses the map from Mrel if the given question involves in relational reasoning.

The prediction function Ψcnt returns a count vector. The count vector is computed through the counting algorithm by Zhang et al. (2018), which builds a graph representation from attention maps to count objects. The method uses s2 through a sigmoid and bounding box coordinates b as inputs. Zhang et al. (2018) is a fully differentiable algorithm and the resulting count vector corresponds to one-hot encoding of a number. We let the range of count be 0 to 12 Z. Please refer to Zhang et al. (2018) for details of the counting algorithm.

A.5 VISUAL QUESTION ANSWERING (LEVEL 3)

The description for the VQA task (Sec. 3.4) is included here again for completeness. The input qvqa is a vector representing a natural language question (i.e. the sentence vector S E). The state variable st is represented by a tuple (qt vqa, kt 1) where qt vqa represents query to ask at time t and kt 1 represents knowledge gathered at time t 1. The state initializer Ivqa is composed of a GRU with hidden state dimension 512. The ﬁrst input to GRU is qvqa, and Ivqa sets s1 = (q1 vqa, 0) where q1 vqa is the ﬁrst hidden state of the GRU and 0 is a zero vector (no knowledge at ﬁrst).

For t in Tvqa = 2, Mvqa does the following seven operations:

(1) The importance function Gvqa is executed. It is implemented as a linear layer R512 R7 (for the seven modules in Lvqa) that takes st, speciﬁcally qt vqa st as input.

(2) Qvqa Ωpasses qt vqa to the attention module Ωvqa which attends over the image regions X with qt vqa as the key vector. Ωvqa is implemented as an MLP that computes a dot-product soft-attention similar to Yang et al. (2016). The returned attention map vΩis added to the scratch pad V .

(3) Qvqa rel produces an input tuple [b, r] for Mrel. The input object box b is produced by a MLP that does soft attention on image boxes, and the relationship category r is produced through a MLP with qt vqa as input. Mrel is called with [b, r] and the returned map vrel is added to V .

(4) Qvqa obj, Qvqa att, and Qvqa ﬁrst compute a joint attention map m as summation of (vΩ, vrel) weighted by the softmaxed importance scores of (Ωvqa, Mrel), and they pass the sum of visual features X weighted by m to the corresponding modules. vqa is implemented as an MLP. The receivers project the outputs into 512 dimensional vectors vobj, vatt, and v through a sequence of linear layers, batch norm, and tanh() nonlinearities. They are added to V .

(5) Qvqa cnt passes qt vqa to Mcnt which returns ocnt. Rcnt vqa projects the count vector ocnt into a 512 dimensional vector vcnt through the same sequence of layers as above. vcnt is added to V .

(6) Mvqa calls Mcap and Rcap vqa receives natural language caption of the image. It converts words in the caption into vectors [w1, . . . , w T ] through an embedding layer. The embedding layer is initialized with 300 dimensional Glo Ve vectors (Pennington et al., 2014) and ﬁne-tuned. It does softmax attention operation over [w1, . . . , w T ] through a MLP with qt vqa st as the key vector, resulting in word probabilities p1, . . . , p T . The sentence representation PT i pi wi is projected into a 512 dimensional vector vcap using the same sequence as vcnt. vcap is added to V .

(7) The state update function Uvqa ﬁrst does softmax operation over the importance scores of (Mobj, Matt, vqa, Mcnt, Mcap). We deﬁne an intermediate knowledge vector kt as the summation of (vobj, vatt, δvqa, vcnt, vcap) weighted by the softmaxed importance scores. Uvqa passes kt as input to the GRU initialized by Ivqa, and we get qt+1 vqa the new hidden state of the GRU. The new state st+1 is set to (qt+1 vqa , kt). This process allows the GRU to compute new question and state vectors based on what has been asked and seen.

After Tvqa steps, the prediction function Ψvqa computes the ﬁnal output based on the initial question vector qvqa and all knowledge vectors kt st. Here, qvqa and kt are fused with gated-tanh layers and fed through a ﬁnal classiﬁcation layer similar to Anderson et al. (2018), and the logits for all time steps are added. The resulting logit is the ﬁnal output ovqa that corresponds to an answer in the vocabulary of the VQA task.

Published as a conference paper at ICLR 2019

B ADDITIONAL EXPERIMENTAL DETAILS

In this section, we provide more details about datasets and module training.

B.1 DATASETS

We extract bounding boxes and their visual representations using a pretrained model from Anderson et al. (2018)which is a Faster-RCNN (Ren et al., 2015) based on Res Net-101 (He et al., 2016). It produces 10 to 100 boxes with 2048-d feature vectors for each region. To accelerate training, we remove overlapping bounding boxes that are most likely duplicates (area overlap Io U > 0.7) and keep only the 36 most conﬁdent boxes (when available).

MS-COCO contains 100K images with annotated bounding boxes and captions. It is a widely used dataset used for benchmarking several vision tasks such as captioning and object detection.

Visual Genome is collected to relate image concepts to image regions. It has over 108K images with annotated bounding boxes containing 1.7M visual question answering pairs, 3.8M object instances, 2.8M attributes and 1.5M relationships between two boxes. Since the dataset contains MS-COCO images, we ensure that we do not train on any MS-COCO validation or test images.

VQA 2.0 is the most popular visual question-answering dataset, with 1M questions on 200K natural images. Questions in this dataset require reasoning about objects, actions, attributes, spatial relations, counting, and other inferred properties; making it an ideal dataset for our visual-reasoning PMN.

B.2 TRAINING

Here, we give training details of each module. We train our modules sequentially, from low level to high level tasks, one at a time. When training a higher level module, internal weights of the lower level modules are not updated, thus preserving their performance on the original task. We do train the weights of the residual module and the attention module Ω. We train I, G, Q, R, U, and Ψ, by allowing gradients to pass through the lower level modules. Thus, while the existing lower modules are held ﬁxed, the new module learns to communicate with them via the query transmitter Q and receiver R.

Object and attribute classiﬁcation. Mobj is trained to minimize the cross-entropy loss for predicting object class by including an additional linear layer on top of the module output. Matt also include an additional linear layer on top of the module output, and is trained to minimize the binary cross-entropy loss for predicting attribute classes since one detected image region can contain zero or more attribute classes. We make use of 780K/195K train/val object instances paired with attributes from the Visual Genome dataset. They are trained with the Adam optimizer at learning rate of 0.0005 with batch size 32 for 20 epochs.

Image captioning. Mcap is trained using cross-entropy loss at each time step (maximum likelihood). Parameters are updated using the Adam optimizer at learning rate of 0.0005 with batch size 64 for 20 epochs. We use the standard split of MS-COCO captioning dataset.

Relationship detection. Mrel is trained using cross-entropy loss on subject - relationship - object pairs with Adam optimizer with learning rate of 0.0005 with batch size 128 for 20 epochs. The pairs are extracted from the Visual Genome dataset that have both subject and object boxes overlapping with the ground truth boxes (Io U > 0.7), resulting in 200K/38K train/val tuples.

Counting. Mcnt is trained using cross-entropy loss on questions starting with how many from the VQA 2.0 dataset. We use Adam optimizer with learning rate of 0.0001 with batch size 128 for 20 epochs. As stated in the experiments section, we additionally create 89K synthetic questions to increase our training set by counting the object boxes and forming how many questions from the VG dataset (e.g. (Q: how many dogs are in this picture?, A:3) from an image containing three bounding boxes of dog). We also sample relational synthetic questions from each training image from VG that are used to train only the module communication parameters when the relationship module is included. We use the same 200K/38K split from the relationship detection task by concatenating how many +subject+relationship or how many +relationship+object (e.g. how many plates on table?, how many behind door?). The module communication parameters for Mrel in this case are Qcnt rel

Published as a conference paper at ICLR 2019

which compute a relationship category and the input image region to be passed to Mrel. To be clear, we supervise qrel = [bi, r] to be sent to Mrel by reducing cross entropy loss on bi and r.

Visual Question Answering. Mvqa is trained using binary cross-entropy loss on ovqa with Adam optimizer with learning rate of 0.0005 with batch size 128 for 7 epochs. We empirically found binary cross-entropy loss to work better than cross-entropy which was also reported by Anderson et al. (2018). Unlike other modules whose parameters are ﬁxed, we ﬁne-tune only the counting module because counting module expects the same form of input - embedding of natural language question. The performance of counting module depends crucially on the quality of attention map over bounding boxes. By employing more questions from the whole VQA dataset, we obtain a better attention map, and the performance of counting module increases from 50.0% to 55.8% with ﬁnetuning. Since Mvqa and Mcnt expect the same form of input, the weights of attention modules Ω{vqa,cnt} and query transmitters for the relationship module Q{vqa,cnt} rel are shared.

Published as a conference paper at ICLR 2019

C PMN EXECUTION ILLUSTRATED

We provide more examples of the execution traces of PMN on the visual question answering task in Figure 4. Each row in the ﬁgure corresponds to different examples. For each row in the ﬁgure, the left column shows the environment E, the middle column shows the ﬁnal answer & visualizes step 3 in the execution process, and the right column shows computed importance scores along with populated scratch pad.

Compute importance scores.

input query and output

internal querying and receiving omitted

querying a submodule

receiving output from submodule

Execution Process of

- For - The initial state is initialized using . of

- Using and history of states, infer the ﬁnal answer.

Q: What is the bird

standing on?

Environment

a bird sitting on top

of a wooden bench bench 1

0.23 0.10 0.22 1

Visualizing 3

Query transmitter produces input to

'on top of'

Call the attention module. Store the result in V.

Call the relationship module. Store the result in V.

4 Compute the attention map as a summation of

weighted by softmax of their importance scores. Pass the map to object, attribute and residual module. Store the results in V. Call the counting module. Store the count vector in V.

Call the captioning module. Store the caption vector in V.

7 Update the module state by using outputs of

weighted by softmax of their importance scores.

0.55 0.14 0.22 1

2 3 4 5 Q: Are they watching tv?

Environment

Visualizing 3

Query transmitter produces input to

'looking at'

Q: What color is

Environment

Visualizing 3

Query transmitter produces input to

Q: What is on top

of his head?

Environment

Visualizing 3

Query transmitter produces input to

blue, white court 1

0.06 0.14 0.05 1

black, brown,

sun glasses 1

0.20 0.11 0.21 1

a woman sitting on

a couch watching

Final answer: bench

Final answer: yes

Final answer: blue

Final answer: sunglasses

'no relation'

a man holding a tennis racquet on a

tennis court

a man with sunglasses talking

on a cell phone

6 'on top of'

Figure 4: Example of PMN s module execution trace on the VQA task. Numbers in circles indicate the order of execution. Intensity of gray blocks represents depth of module calls. All variables including queries and outputs stored in V are vectorized to allow gradients to ﬂow (e.g., caption is composed of a sequence of softmaxed W dimensional vectors for vocabulary size W). For Mcap, words with higher intensity in red are deemed more relevant by Rcap vqa.

Published as a conference paper at ICLR 2019

D EXAMPLES OF PMN S REASONING

We provide more examples of the human evaluation experiment on interpretability of PMN compared with the baseline model in Figure 5.

PMN (ours) Baseline Question

I look at the RED box. The object 'bird' would be useful in

answering the question. The object properties small black

gray would be useful in answering the question. In conclusion, I think the answer is

I look at the RED box. In conclusion, I think the answer is:

What type of bird is this?

I look at the RED box. The object 'keyboard' would be

useful in answering the question. In conclusion, I think the answer is

I look at the RED box. In conclusion, I think the answer is:

What is in the center of the screen?

I first find the BLUE box, and then

from that, I look at the GREEN box. The object 'table' would be useful in

answering the question. In conclusion, I think the answer is

I look at the RED box. In conclusion, I think the answer is:

What is the television standing on?

I look at the RED box. The object properties black white

gray would be useful in answering the question. In conclusion, I think the answer is

I look at the RED box. In conclusion, I think the answer is:

What color is the tile?

I look at the PURPLE boxes. I will try to count them: 2. In conclusion, I think the answer is 2.

I look at the RED box. In conclusion, I think the answer is: 1

How many screens are here?

I look at the RED box. The object properties brown white

gray would be useful in answering the question. In conclusion, I think the answer is

I look at the RED box. In conclusion, I think the answer is:

brown What color is the cat?

I first find the BLUE box, and then

from that, I look at the GREEN box. The object 'trees' would be useful in

answering the question. In conclusion, I think the answer is

I look at the RED box. In conclusion, I think the answer is:

What is behind the trees?

I look at the RED box. The object properties black large

white would be useful in answering the question. In conclusion, I think the answer is

I look at the RED box. In conclusion, I think the answer is:

1:30 What is the clock saying?

Figure 5: Example of PMN s reasoning processes compared with the baseline given the question on the left. and denote correct and wrong answers, respectively.