# saliencybased_sequential_image_attention_with_multiset_prediction__1daf5b20.pdf

Saliency-based Sequential Image Attention with Multiset Prediction

Sean Welleck New York University wellecks@nyu.edu

Jialin Mao New York University jialin.mao@nyu.edu

Kyunghyun Cho New York University kyunghyun.cho@nyu.edu

Zheng Zhang New York University zz@nyu.edu

Humans process visual scenes selectively and sequentially using attention. Central to models of human visual attention is the saliency map. We propose a hierarchical visual architecture that operates on a saliency map and uses a novel attention mechanism to sequentially focus on salient regions and take additional glimpses within those regions. The architecture is motivated by human visual attention, and is used for multi-label image classiﬁcation on a novel multiset task, demonstrating that it achieves high precision and recall while localizing objects with its attention. Unlike conventional multi-label image classiﬁcation models, the model supports multiset prediction due to a reinforcement-learning based training process that allows for arbitrary label permutation and multiple instances per label.

1 Introduction

Humans can rapidly process complex scenes containing multiple objects despite having limited computational resources. The visual system uses various forms of attention to prioritize and selectively process subsets of the vast amount of visual input [6]. Computational models and various forms of psychophysical and neuro-biological evidence suggest that this process may be implemented using various "maps" that topographically encode the relevance of locations in the visual ﬁeld [17, 39, 13].

Under these models, visual input is compiled into a saliency-map that encodes the conspicuity of locations based on bottom-up features, computed in a parallel, feed-forward process [20, 17]. Top-down, goal-speciﬁc relevance of locations is then incorporated to form a priority map, which is then used to select the next target of attention [39]. Thus processing a scene with multiple attentional shifts may be interpreted as a feed-forward process followed by sequential, recurrent stages [23]. Furthermore, the allocation of attention can be separated into covert attention, which is deployed to regions without eye movement and precedes eye movements, and overt attention associated with an eye movement [6]. Despite their evident importance to human visual attention, the notions of incorporating saliency to decide attentional targets, integrating covert and overt attention mechanisms, and using multiple, sequential shifts while processing a scene have not been fully addressed by modern deep learning architectures.

Motivated by the model of Itti et al. [17], we propose a hierarchical visual architecture that operates on a saliency map computed by a feed-forward process, followed by a recurrent process that uses a combination of covert and overt attention mechanisms to sequentially focus on relevant regions and take additional glimpses within those regions. We propose a novel attention mechanism for implementing the covert attention. Here, the architecture is used for multi-label image classiﬁca-

31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.

tion. Unlike conventional multi-label image classiﬁcation models, this model can perform multiset classiﬁcation due to the proposed reinforcement-learning based training.

2 Related Work

We ﬁrst introduce relevant concepts from biological visual attention, then contextualize work in deep learning related to visual attention, saliency, and hierarchical reinforcement learning (RL). We observe that current deep learning models either exclusively focus on bottom-up, feed-forward attention or overt sequential attention, and that saliency has traditionally been studied separately from object recognition.

2.1 Biological Visual Attention

Visual attention can be classiﬁed into covert and overt components. Covert attention precedes eye movements, and is intuitively used to monitor the environment and guide eye movements to salient regions [6, 21]. Two particular functions of covert attention motivate the Gaussian attention mechanism proposed below: noise exclusion, which modiﬁes perceptual ﬁlters to enhance the signal portion of the stimulus and mitigate the noise; and distractor suppression, which refers to suppressing the representation strength outside an attention area [6]. Further inspiring the proposed attention mechanism is evidence from cueing [1], multiple object tracking [8], and f MRI [30] studies, which indicate that covert attention can be deployed to multiple, disjoint regions that vary in size and can be conceptually viewed as multiple "spotlights".

Overt attention is associated with an eye movement, so that the attentional focus coincides with the fovea s line of sight. The planning of eye movements is thought to be inﬂuenced by bottom-up (scene dependent) saliency as well as top-down (goal relevant) factors [21]. In particular, one major view is that two types of maps, the saliency map and the priority map, encode measures used to determine the target of attention [39]. Under this view, visual input is processed into a feature-agnostic saliency map that quantiﬁes distinctiveness of a location relative to other locations in the scene based on bottom-up properties. The saliency map is then integrated to include top-down information, resulting in a priority map.

The saliency map was initially proposed by Koch & Ullman [20], then implemented in a computational model by Itti [17]. In their model, saliency is determined by relative feature differences and compiled into a "master saliency map". Attentional selection then consists of directing a ﬁxed-sized attentional region to the area of highest saliency, i.e. in a "winner-take-all" process. The attended location s saliency is then suppressed, and the process repeats, so that multiple attentional shifts can occur following a single feed-forward computation.

Subsequent research effort has been directed at ﬁnding neural correlates of the saliency map and priority map. Some proposed areas for salience computation include the superﬁcial layers of the superior colliculus (s SC) and inferior sections of the pulvinar (PI), and for priority map computation include the frontal eye ﬁeld (FEF) and deeper layers of the superior colliculus (d SC)[39]. Here, we need to only assume existence of the maps as conceptual mechanisms involved in inﬂuencing visual attention and refer the reader to [39] for a recent review.

We explore two aspects of Itti s model within the context of modern deep learning-based vision: the use of a bottom-up, featureless saliency map to guide attention, and the sequential shifting of attention to multiple regions. Furthermore, our model incorporates top-down signals with the bottom-up saliency map to create a priority map, and includes covert and overt attention mechanisms.

2.2 Visual Attention, Saliency, and Hierarchical RL in Deep Learning

Visual attention is a major area of interest in deep learning; existing work can be separated into sequential attention and bottom-up feed-forward attention. Sequential attention models choose a series of attention regions. Larochelle & Hinton [24] used a RBM to classify images with a sequence of fovea-like glimpses, while the Recurrent Attention Model (RAM) of Mnih et al. [31] posed single-object image classiﬁcation as a reinforcement learning problem, where a policy chooses the sequence of glimpses that maximizes classiﬁcation accuracy. This "hard attention" mechanism developed in [31] has since been widely used [27, 44, 35, 2]. Notably, an extension to multiple

objects was made in the DRAM model [3], but DRAM is limited to datasets with a natural label ordering, such as SVHN [32]. Recently, Cheung et al. [9] developed a variable-sized glimpse inspired by biological vision, incorporating it into a simple RNN for single object recognition. Due to the fovea-like attention which shifts based on task-speciﬁc objectives, the above models can be seen as having overt, top-down attention mechanisms.

An alternative approach is to alter the structure of a feed-forward network so that the convolutional activations are modiﬁed as the image moves through the network, i.e. in a bottom-up fashion. Spatial transformer networks [18] learn parameters of a transformation that can have the effect of stretching, rotating, and cropping activations between layers. Progressive Attention Networks [36] learn attention ﬁlters placed at each layer of a CNN to progressively focus on an arbitrary subset of the input, while Residual Attention Networks [41] learn feature-speciﬁc ﬁlters. Here, we consider an attentional stage that follows a feed-forward stage, i.e. a saliency map and image representation are produced in a feed-forward stage, then an attention mechanism determines which parts of the image representation are relevant using the saliency map.

Saliency is typically studied in the context of saliency modeling, in which a model outputs a saliency map for an image that matches human ﬁxation data, or salient object segmentation [25]. Separately, several works have considered extracting a saliency map for understanding classiﬁcation network decisions [37, 47]. Zagoruyko et al. [46] formulate a loss function that causes a student network to have similar "saliency" to a teacher network. They model saliency as a reduction operation F : RC H W RH W applied to a volume of convolutional activations, which we adopt due to its simplicity. Here, we investigate using a saliency map for a downstream task. Recent work has begun to explore saliency maps as inputs for prominent object detection [38] and image captioning [11], pointing to further uses of saliency-based vision models.

While we focus on using reinforcement learning for multiset classiﬁcation with only class labels as annotation, RL has been applied to other computer vision tasks, including modeling eye movements based on annotated human scan paths [29], optimizing prediction performance subject to a computational budget [19], describing classiﬁcation decisions with natural language [16], and object detection [28, 5, 4].

Finally, our architecture is inspired by works in hierarchical reinforcement learning. The model distinguishes between the upper level task of choosing an image region to focus on and the lower level task of classifying the object related to that region. The tasks are handled by separate networks that operate at different time-scales, with the upper level network specifying the task of the lower level network. This hierarchical modularity relates to the meta-controller / controller architecture of Kulkarni et al. [22] and feudal reinforcement learning [12, 40]. Here, we apply a hierarchical architecture to multi-label image classiﬁcation, with the two levels linked by a differentiable operation.

Figure 1: A high-level view of the model components. See Supplementary Materials section 3 for detailed views.

3 Architecture

The architecture is a hierarchical recurrent neural network consisting of two main components: the meta-controller and controller. These components assume access to a saliency model, which produces a saliency map from an image, and an activation model, which produces an activation volume from an image. Figure 1 shows the high level components, and Supplementary Materials section 3 shows detailed views of the overall architecture and individual components.

In short, given a saliency map the meta-controller places an attention mask on an object, then the controller takes subsequent glimpses and classiﬁes that object. The saliency map is updated to account for the processed locations, and the process repeats. The meta-controller and controller operate at different time-scales; for each step of the meta-controller, the controller takes k + 1 steps.

Notation Let I denote the space of images, I Rh I w I and Y = 1, ..., nc denote the set of labels. Let S denote the space of saliency maps, S Rh S w S, let V denote the space of activation volumes, V RC h V w V , let M denote the space of covert attention masks, M Rh M w M , let P denote the space of priority maps, P Rh M w M , and let A denote an action space. The activation model is a function f A : I V mapping an input image to an activation volume. An example volume is the 512 h V w V activation tensor from the ﬁnal conv layer of a Res Net.

Meta-Controller The meta-controller is a function f MC : S M mapping a saliency map to a covert attention mask. Here, f MC is a recurrent neural network deﬁned as follows:

xt = [St, ˆyt 1], et = Wencodext, ht = GRU(et, ht 1), Mt = attn(ht).

xt is a concatenation of the ﬂattened saliency map and one-hot encoding of the previous step s class label prediction, and attn( ) is the novel spatial attention mechanism deﬁned below. The mask is then transformed by the interface layer into a priority map that directs the controller s glimpses towards a salient region, and used to produce an initial glimpse vector for the controller.

Gaussian Attention Mechanism The spatial attention mechanism, inspired by covert visual attention, is a 2D discrete convolution of a mixture of Gaussians ﬁlter. Speciﬁcally, the attention mask M is a m n matrix with Mij = φ(i, j), where

k=1 α(k) exp β(k) κ(k) 1 i 2 + κ(k) 2 j 2 .

K denotes the number of Gaussian components and α(k), β(k), κ(k) 1 , κ(k) 2 respectively denote the importance, width, and x, y center of component k.

To implement the mechanism, the parameters (α, β, κ1, κ2) are output by a network layer as a 4K-dimensional vector (α, β, κ1, κ2), and the elements are transformed to their proper ranges: κ1 = σ(κ1)m, κ2 = σ(κ2)n, α = softmax(α), β = exp(β). Then M is formed by applying φ to the coordinates {(i, j) | 1 i m, 1 j n}. Note that these operations are differentiable, allowing the attention mechanism to be used as a module in a network trained with back-propagation. Graves [15] proposed a 1D version; here we use a 2D version for spatial attention.

Interface The interface layer transforms the meta-controller s output into a priority map and glimpse vector that are used as input to the controller (diagram in Supp. Materials 3.4). The priority map combines the top-down covert attention mask with the bottom-up saliency map: P = M S. Since P inﬂuences the region that is processed next, this can also be seen as a generalization of the "winner-take-all" step in the Itti model; here a learned function chooses a region of high saliency rather than greedily choosing the maximum location.

To provide an initial glimpse vector g0 RC for the controller, the mask is used to spatially weight the activation volume: g0 = Ph V i=1 Pw V j=1 Mi,j V ,i,j This is interpreted as the meta-controller taking an initial, possibly broad and variable-sized glimpse using covert attention. The weighting produced by the attention map retains the activations around the centers of attention, while down-weighting outlying areas, effectively suppressing activations from noise outside of the attentional area. Since

the activations are averaged into a single vector, there is a trade-off between attentional area and information retention.

Controller The controller is a recurrent neural network f C : (P, g0) A that runs for k + 1 steps and maps a priority map and initial glimpse vector from the interface layer to parameters of a distribution, and an action is sampled. The ﬁrst k actions select spatial indices of the activation volume, and the ﬁnal action chooses a class label, i.e. A1,...,k {1, 2, ..., h V w V } and Ak+1 Y. Speciﬁcally:

xi = [Pt, ˆyt 1, ai 1, gi 1], ei = Wencodexi, hi = GRU(ei, hi 1),

si = Wlocationhi 1 i k Wclasshi i = k + 1 ,

pi = softmax(si), ai pi,

where t indexes the meta-controller time-step and i indexes the controller time-step, and ai A is an action sampled from the categorical distribution with parameter vector pi. The glimpse vectors gi, i 1 k are formed by extracting the column from the activation volume V at location ai = (x, y)i.

Intuitively, the controller uses overt attention to choose glimpse locations using the information conveyed in the priority map and initial glimpse, compiling the information in its hidden state to make a classiﬁcation decision. Recall that both covert attention and priority maps are known to inﬂuence eye saccades [21]. See Supplementary Materials 3.5 for a diagram.

Update Mechanism During a step t, the meta-controller takes saliency map St as input, focuses on a region of St using an attention mask Mt, then the controller takes glimpses at locations (x, y)1, (x, y)2, ..., (x, y)k. At step t + 1, the saliency map should reﬂect the fact that some regions have already been attended to in order to encourage attending to novel areas. While the metacontroller s hidden state can in principle prevent it from repeatedly focusing on the same regions, we explicitly update the saliency map with a function update : S S that suppresses the saliency of glimpsed locations and locations with nonzero attention mask values, thereby increasing the relative saliency of the remaining unattended regions:

[St+1]ij = 0 if (i, j) {(x, y)1, (x, y)2, ..., (x, y)k} max([St]ij [Mt]ij, 0) otherwise

This mechanism is motivated by the inhibition of return effect in the human visual system; after attention has been removed from a region, there is an increased response time to stimuli in the region, which may inﬂuence visual search and encourage attending to novel areas [13, 33].

Saliency Model The saliency model is a function f S : I S mapping an input image to a saliency map. Here, we use a saliency model that computes a map by compressing an activation volume using a reduction operation F : RC HV WV RHV WV as in [46]. We choose F(V ) = PC c=1 |Vi|2, and use the output of the activation model as V . Furthermore, the activation model is ﬁne-tuned on a single-object dataset containing classes found in the multi-object dataset, so that the saliency model has high activations around classes of interest.

4.1 Sequential Multiset Classiﬁcation

Multi-label classiﬁcation tasks can be categorized based on whether the labels are lists, sets, or multisets. We claim that multiset classiﬁcation most closely resembles a human s free viewing of a scene; the exact labeling order of objects may vary by individual, and multiple instances of the same object may appear in a scene and receive individual labels. Speciﬁcally, let D = {(Xi, Yi)}n i=1 be a dataset of images Xi with labels Yi Y and consider the structure of Yi.

In list-based classiﬁcation, the labels Yi = [y1, ..., y|Yi|] have a consistent order, e.g. left to right. As a sequential prediction problem, there is exactly one true label for each prediction step, so a standard

cross-entropy loss can be used at each prediction step, as in [3]. When the labels Yi = {y1, ..., y|Yi|} are a set, one approach for sequential prediction is to impose an ordering O(Yi) [yo1, ..., yo|Yi|] as a preprocessing step, transforming the set-based problem to a list-based problem. For instance, O( ) may order the labels based on prevalence in the training data as in [42]. Finally, multiset classiﬁcation generalizes set-based classiﬁcation to allow duplicate labels within an example, i.e. Yi = {ym1 1 , ...y m|Yi| |Yi| }, where mj denotes the multiplicity of label yj.

Here, we propose a training process that allows duplicate labels and is permutation-invariant with respect to the labels, removing the need for a hand-engineered ordering and supporting all three types of classiﬁcation. With a saliency-based model, permutation invariance for labels is especially crucial, since the most salient (and hence ﬁrst classiﬁed) object may not correspond to the ﬁrst label.

4.2 Training

Our solution is to frame the problem in terms of maximizing a non-smooth reward function that encourages the desired classiﬁcation and attention behavior, and use reinforcement learning to maximize the expected reward. Assuming access to a trained saliency model and activation model, the meta-controller and controller can be jointly trained end-to-end.

Reward To support multiset classiﬁcation, we propose a multiset-based reward for the controller s classiﬁcation action. Speciﬁcally, consider an image X with m labels Y = {y1, ..., ym}. At metacontroller step t, 1 t m, let Ai be a multiset of available labels, let fi(X) be the corresponding class scores output by the controller. Then deﬁne:

Riclf = +1 if ˆyi Ai 1 otherwise Ai+1 = Ai \ ˆyi if ˆyi Ai Ai otherwise

where ˆyi softmax(fi(X)) and A1 Y. In short, a class label is sampled from the controller, and the controller receives a positive reward if and only if that label is in the multiset of available labels. If so, the label is removed from the available labels. Clearly, the reward for sampled labels ˆy1, ˆy2, .., ˆym equals the reward for σ(ˆy1), σ(ˆy2), .., σ(ˆym) for any permutation σ of the m elements. Note that list-based tasks can be supported by setting Ai yi.

The controller s location-choice actions simply receive a reward equal to the priority map value at the glimpse location, which encourages the controller to choose locations according to the priority map. That is, for locations (x, y)1, ..., (x, y)k sampled from the controller, deﬁne Riloc = P(x,y)i.

Objective Let n = 1...N index the example, t = 1...M index the meta-controller step, and i = 0...K index the controller step. The goal is choosing θ to maximize the total expected reward: J(θ) = Ep(τ|fθ) h P

n,t,i Rn,t,i i where the rewards Rn,t,i are deﬁned as above, and the expectation is over the distribution of trajectories produced using a model f parameterized by θ. An unbiased gradient estimator for θ can be obtained using the REINFORCE [43] estimator within the stochastic computation graph framework of Schulman et al. [34] as follows.

Viewed as a stochastic computation graph, an input saliency map Sn,t passes through a path of deterministic nodes, reaching the controller. Each of the controller s k+1 steps produces a categorical parameter vector pn,t,i and a stochastic node is introduced by each sampling operation at,i pn,t,i. Then form a surrogate loss function L(θ) = P

t,i log pt,i Rt,i with the stochastic computation graph. By Corollary 1 of [34], the gradient of L(θ) gives an unbiased gradient estimator of the objective, which can be approximated using Monte-Carlo sampling: θJ(θ) = E

θL(θ) 1 B PB b=1 θL(θ). As is standard in reinforcement learning, a state-value function V (st,i) is used as a baseline to reduce the variance of the REINFORCE estimator, thus L(θ) = P

t,i log pt,i(V (st,i) Rt,i). In our implementation, the controller outputs the state-value estimate, so that st,i is the controller s hidden state.

5 Experiments

We validate the classiﬁcation performance, training process, and hierarchical attention with set-based and multiset-based classiﬁcation experiments. To test the effectiveness of the permutation-invariant

Table 1: Metrics on the test set for MNIST Set and Multiset tasks, and SVHN Multiset.

MNIST Set MNIST Multiset SVHN Multiset

F1 0-1 F1 0-1 F1 0-1

HSAL-RL 0.990 0.960 0.978 0.935 0.973 0.947 Cross-Entropy 0.735 0.478 0.726 0.477 0.589 0.307

RL training, we compare against a baseline model that uses a cross-entropy loss on the probabilities pt,i and (randomly ordered) labels yi instead of the RL training, similar to training proposed in [42].

Datasets Two synthetic datasets, MNIST Set and MNIST Multiset, as well as the real-world SVHN dataset, are used. For MNIST Set and Multiset, each 100x100 image in the dataset has a variable number (1-4) of digits, of varying sizes (20-50px) and positions, along with cluttering objects that introduce noise. Each label in an image from MNIST Set is unique, while MNIST Multiset images may contain duplicate labels. Each dataset is split into 60,000 training examples and 10,000 testing examples, and metrics are reported for the testing set. SVHN Multiset consists of SVHN examples with label order randomized when a batch is sampled. This removes the natural left-to-right order of the SVHN labels, thus turning the classiﬁcation into a multiset task.

Evaluation Metrics To evaluate classiﬁcation performance, macro-F1 and exact match (0-1) as deﬁned in [26] are used. For evaluating the hierarchical attention mechanism we use visualization as well as a saliency metric for the controller s glimpses, deﬁned as attnsaliency = 1

k Pk i=1 Sti for a controller trajectory (x, y)1, ..., (x, y)k, ˆyt at meta-controller time step t, then averaged over all time steps and examples. A high score means that the controller tends to pick salient points as glimpse locations.

Implementation Details The activation and saliency model is a Res Net-34 network pre-trained on Image Net. For MNIST experiments, the Res Net is ﬁne-tuned on a single object MNIST Set dataset, and for SVHN is ﬁne-tuned by randomly selecting one of an image s labels each time a batch is sampled. Images are resized to 224x224, and the ﬁnal (4th) convolutional layer is used (V R512 7 7). Since the label sets vary in size, the model is trained with an extra "stop" class, and during inference greedy argmax sampling is used until the "stop" class is predicted. See Supplementary Materials section 1 for further details.

5.1 Experimental Evaluation

In this section we analyze the model s classiﬁcation performance, the contribution of the proposed RL training, and the behavior of the hierarchical attention mechanism.

Classiﬁcation Performance Table 1 shows the evaluation metrics on the set-based and multiset-based classiﬁcation tasks for the proposed hierarchical saliency-based model with RL training ("HSAL-RL") and the cross-entropy baseline ("Cross-Entropy") introduced above. HSAL-RL performs well across all metrics; on both the set and multiset tasks the model achieves very high precision, recall, and macro-F1 scores, but as expected, the multiset task is more difﬁcult. We conclude that the proposed model and training process is effective for these set and multiset image classiﬁcation tasks.

Contribution of RL training As seen in Table 1, performance is greatly reduced when the standard cross-entropy training is used, which is not invariant to the label ordering. This shows the importance of the RL training, which only assumes that predictions are some permutation of the labels.

Controller Attention Based on attnsaliency, the controller learns to glimpse in salient regions more often as training progresses, starting at 58.7 and ending at 126.5 (see graph in Supplementary Materials Section 2). The baseline, which does not have the reward signal for its glimpses, fails to improve over training (remaining near 25), demonstrating the importance and effectiveness of the controller s glimpse rewards.

Hierarchical Attention Visualization Figure 2 visualizes the hierarchical attention mechanism on three example inference processes. See Supplementary Materials Section 4 for more examples, which we discuss here. In general, the upper level attention highlights a region encompassing a digit, and

Figure 2: The inference process showing the hierarchical attention on three different examples. Each column represents a single meta-controller step, two controller glimpses, and classiﬁcation.

the lower level glimpses near the digit before classifying. Notice the saliency map update over time, the priority map s structure due to the Gaussian attention mechanism, and the variable-sized focus of the priority map followed by ﬁner-grained glimpses. Note that the predicted labels need not be in the same order as the ground truth labels (e.g. "689"), and that the model can predict multiple instances of a label (e.g. "33", "449"), illustrating multiset prediction. In some cases, the upper level attention is sufﬁcient to classify the object without the controller taking related glimpses, as in "373", where the glimpses are in a blank region for the 7. In "722", the covert attention is initially placed on both the 7 and the 2, then the controller focuses only on the 7; this can be interpreted as using the multiple spotlight capability of covert attention, then directing overt attention to a single target.

5.2 Limitations

Saliency Map Input Since the saliency map is the only top-level input, the quality of the saliency model is a potential performance bottleneck. As Figure 4 shows, in general there is no guarantee that all objects of interest will have high saliency relative to the locations around them. However, the modular architecture allows for plugging in alternative, rigorously evaluated saliency models such as a state-of-the-art saliency model trained with human ﬁxation data [10].

Activation Resolution Currently, the activation model returns the highest-level convolutional activations, which have a 7x7 spatial dimension for a 224x224 image. Consider the case shown in Figure 3. Even if the controller acted optimally, activations for multiple digits would be included in its glimpse vector due to the low resolution. This suggests activations with higher spatial resolution are needed, perhaps by incorporating dilated convolutions [45] or using lower-level activations at attended areas, motivated by covert attention s known enhancement of spatial resolution [6, 7, 14].

6 Conclusion

We proposed a novel architecture, attention mechanism, and RL-based training process for sequential image attention, supporting multiset classiﬁcation. The proposal is a ﬁrst step towards incorporating notions of saliency, covert and overt attention, and sequential processing motivated by the biological visual attention literature into deep learning architectures for downstream vision tasks.

Figure 3: The location of highest saliency from a 7x7 saliency map (right) is projected onto the 224x224 image (left).

Figure 4: The cat is a label in the ground truth set but does not have high salience relative to its surroundings.

Acknowledgments

This work was partly supported by the NYU Global Seed Funding <Model-Free Object Tracking with Recurrent Neural Networks>, STCSM 17JC1404100/1, and Huawei HIPP Open 2017.

[1] Edward Awh and Harold Pashler. Evidence for split attentional foci. 26:834 46, 05 2000.

[2] Jimmy Ba, Roger Grosse, Ruslan Salakhutdinov, and Brendan Frey. Learning wake-sleep recurrent attention models. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2, NIPS 15, pages 2593 2601, Cambridge, MA, USA, 2015. MIT Press.

[3] Jimmy Ba, Volodymyr Mnih, and Koray Kavukcuoglu. Multiple object recognition with visual attention. ar Xiv preprint ar Xiv:1412.7755, 2014.

[4] Miriam Bellver, Xavier Giro i Nieto, Ferran Marques, and Jordi Torres. Hierarchical object detection with deep reinforcement learning. ar Xiv preprint ar Xiv:1611.03718, 2016.

[5] Juan C. Caicedo and Svetlana Lazebnik. Active object localization with deep reinforcement learning. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), ICCV 15, pages 2488 2496, Washington, DC, USA, 2015. IEEE Computer Society.

[6] Marisa Carrasco. Visual attention: The past 25 years. Vision Research, 51(13):1484 1525, 2011. Vision Research 50th Anniversary Issue: Part 2.

[7] Marisa Carrasco, Patrick E Williams, and Yaffa Yeshurun. Covert attention increases spatial resolution with or without masks: Support for signal enhancement. Journal of Vision, 2:467 479, 2002.

[8] Patrick Cavanagh and George A. Alvarez. Tracking multiple targets with multifocal attention. Trends in Cognitive Sciences, 9(7):349 354, 2005.

[9] Brian Cheung, Eric Weiss, and Bruno Olshausen. Emergence of foveal image sampling from learning to attend in visual scenes. ar Xiv preprint ar Xiv:1611.09430, 2016.

[10] Marcella Cornia, Lorenzo Baraldi, Giuseppe Serra, and Rita Cucchiara. Predicting human eye ﬁxations via an lstm-based saliency attentive model. ar Xiv preprint ar Xiv:1611.09571, 2016.

[11] Marcella Cornia, Lorenzo Baraldi, Giuseppe Serra, and Rita Cucchiara. Paying more attention to saliency: Image captioning with saliency and context attention. ar Xiv preprint ar Xiv:1706.08474, 2017.

[12] Peter Dayan and Geoffrey E. Hinton. Feudal reinforcement learning. In Advances in Neural Information Processing Systems 5, [NIPS Conference], pages 271 278, San Francisco, CA, USA, 1993. Morgan Kaufmann Publishers Inc.

[13] Jillian H. Fecteau and Douglas P. Munoz. Salience, relevance, and ﬁring: a priority map for target selection. Trends in Cognitive Sciences, 10(8):382 390, 2006.

[14] Jason Fischer and David Whitney. Attention narrows position tuning of population responses in v1. Current Biology, 19(16):1356 1361, 2009.

[15] Alex Graves. Generating sequences with recurrent neural networks. ar Xiv preprint ar Xiv:1308.0850, 2013.

[16] Lisa Anne Hendricks, Zeynep Akata, Marcus Rohrbach, Jeff Donahue, Bernt Schiele, and Trevor Darrell. Generating visual explanations. In European Conference on (ECCV), 2016.

[17] L. Itti, C. Koch, and E. Niebur. A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(11):1254 1259, 1998.

[18] Max Jaderberg, Karen Simonyan, Andrew Zisserman, and Koray Kavukcuoglu. Spatial transformer networks. ar Xiv preprint ar Xiv:1506.02025, 2015.

[19] Sergey Karayev, Tobias Baumgartner, Mario Fritz, and Trevor Darrell. Timely object recognition. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 890 898. Curran Associates, Inc., 2012.

[20] C Koch and S Ullman. Shifts in selective visual attention: towards the underlying neural circuitry. Human neurobiology, 4(4):219 27, 1985.

[21] Eileen Kowler. Eye movements: The Past 25 Years. Vision Research, 51(13):1457 1483, 7 2011.

[22] Tejas D Kulkarni, Karthik Narasimhan, Ardavan Saeedi, and Josh Tenenbaum. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 3675 3683. Curran Associates, Inc., 2016.

[23] Victor A.F. Lamme and Pieter R. Roelfsema. The distinct modes of vision offered by feedforward and recurrent processing. Trends in Neurosciences, 23(11):571 579, 2000.

[24] Hugo Larochelle and Geoffrey E Hinton. Learning to combine foveal glimpses with a third-order boltzmann machine. In J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems 23, pages 1243 1251. Curran Associates, Inc., 2010.

[25] Y. Li, X. Hou, C. Koch, J. M. Rehg, and A. L. Yuille. The secrets of salient object segmentation. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 280 287, June 2014.

[26] Yuncheng Li, Yale Song, and Jiebo Luo. Improving pairwise ranking for multi-label image classiﬁcation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.

[27] Xiao Liu, Jiang Wang, Shilei Wen, Errui Ding, and Yuanqing Lin. Localizing by describing: Attribute-guided attention localization for ﬁne-grained recognition. ar Xiv preprint ar Xiv:1605.06217, 2016.

[28] S. Mathe, A. Pirinen, and C. Sminchisescu. Reinforcement learning for visual object detection. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2894 2902, June 2016.

[29] Stefan Mathe and Cristian Sminchisescu. Action from still image dataset and inverse optimal control to learn task speciﬁc visual scanpaths. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 1923 1931. Curran Associates, Inc., 2013.

[30] Stephanie A Mc Mains and David C Somers. Multiple Spotlights of Attentional Selection in Human Visual Cortex. Neuron, 42(4):677 686, 2004.

[31] Volodymyr Mnih, Nicolas Heess, Alex Graves, and koray kavukcuoglu. Recurrent models of visual attention. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 2204 2212. Curran Associates, Inc., 2014.

[32] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading Digits in Natural Images with Unsupervised Feature Learning.

[33] Raymond M. Klein. Inhibition of return. Trends in cognitive sciences, 4(4):138 147, 4 2000.

[34] John Schulman, Nicolas Heess, Theophane Weber, and Pieter Abbeel. Gradient estimation using stochastic computation graphs. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2, NIPS 15, pages 3528 3536, Cambridge, MA, USA, 2015. MIT Press.

[35] Stanislau Semeniuta and Erhardt Barth. Image classiﬁcation with recurrent attention models. In 2016 IEEE Symposium Series on Computational Intelligence (SSCI), pages 1 7. IEEE, 12 2016.

[36] Paul Hongsuck Seo, Zhe Lin, Scott Cohen, Xiaohui Shen, and Bohyung Han. Progressive attention networks for visual attribute prediction. ar Xiv preprint ar Xiv:1606.02393, 2016.

[37] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classiﬁcation models and saliency maps. ar Xiv preprint ar Xiv:1312.6034, 2013.

[38] Hamed R. Tavakoli and Jorma Laaksonen. Towards instance segmentation with object priority: Prominent object detection and recognition. ar Xiv preprint ar Xiv:1704.07402, 2017.

[39] Richard Veale, Ziad M. Hafed, and Masatoshi Yoshida. How is visual salience computed in the brain? Insights from behaviour, neurobiology and modelling. Philosophical Transactions of the Royal Society of London B: Biological Sciences, 372(1714), 2017.

[40] Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver, and Koray Kavukcuoglu. Feudal networks for hierarchical reinforcement learning. ar Xiv preprint ar Xiv:1703.01161, 2017.

[41] Fei Wang, Mengqing Jiang, Chen Qian, Shuo Yang, Cheng Li, Honggang Zhang, Xiaogang Wang, and Xiaoou Tang. Residual attention network for image classiﬁcation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.

[42] Jiang Wang, Yi Yang, Junhua Mao, Zhiheng Huang, Chang Huang, and Wei Xu. Cnn-rnn: A uniﬁed framework for multi-label image classiﬁcation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.

[43] Ronald J Williams. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. Machine Learning, 8:229 256, 1992.

[44] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. ar Xiv preprint ar Xiv:1502.03044, 2015.

[45] Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions. ar Xiv preprint ar Xiv:1511.07122, 2015.

[46] Sergey Zagoruyko and Nikos Komodakis. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. ar Xiv preprint ar Xiv:1612.03928, 2016.

[47] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Learning deep features for discriminative localization. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.