# selfadaptable_templates_for_feature_coding__c998a2e5.pdf Self-Adaptable Templates for Feature Coding Xavier Boix1,2 Gemma Roig1,2 Salomon Diether1 Luc Van Gool1 1Computer Vision Laboratory, ETH Zurich, Switzerland 2LCSL, Massachusetts Institute of Technology & Istituto Italiano di Tecnologia, Cambridge, MA {xboix,gemmar}@mit.edu {boxavier,gemmar,sdiether,vangool}@vision.ee.ethz.ch Hierarchical feed-forward networks have been successfully applied in object recognition. At each level of the hierarchy, features are extracted and encoded, followed by a pooling step. Within this processing pipeline, the common trend is to learn the feature coding templates, often referred as codebook entries, filters, or over-complete basis. Recently, an approach that apparently does not use templates has been shown to obtain very promising results. This is the second-order pooling (O2P) [1, 2, 3, 4, 5]. In this paper, we analyze O2P as a coding-pooling scheme. We find that at testing phase, O2P automatically adapts the feature coding templates to the input features, rather than using templates learned during the training phase. From this finding, we are able to bring common concepts of coding-pooling schemes to O2P, such as feature quantization. This allows for significant accuracy improvements of O2P in standard benchmarks of image classification, namely Caltech101 and VOC07. 1 Introduction Many object recognition schemes, inspired from biological vision, are based on feed-forward hierarchical architectures, e.g. [6, 7, 8]. In each level in the hierarchy, the algorithms can be usually divided into the steps of feature coding and spatial pooling. The feature coding extracts similarities between the set of input features and a set of templates (the so called filters, over-complete basis or codebook), and then, the similarity responses are transformed using some non-linearities. Finally, the spatial pooling extracts one single vector from the set of transformed responses. The specific architecture of the network (e.g. how many layers), and the specific algorithms for the coding-pooling at each layer are usually set for a recognition task and dataset, cf. [9]. Second-order Pooling (O2P) is an alternative algorithm to the aforementioned coding-pooling scheme. O2P has been introduced in medical imaging to analyze magnetic resonance images [1, 2], and lately, O2P achieved state-of-the-art in some of the traditional computer vision tasks [3, 4, 5, 10]. A surprising fact of O2P is that it is formulated without feature coding templates [5]. This is in contrast to the common coding-pooling schemes, in which the templates are learned during a training phase, and at testing phase, the templates remain fixed to the learned values. Motivated by the intriguing properties of O2P, in this paper we try to re-formulate O2P as a codingpooling scheme. In doing so, we find that O2P actually computes similarities to feature coding templates as the rest of the coding-pooling schemes. Yet, what remains uncommon of O2P, is that the templates are recomputed for each specific input, rather than being fixed to learned values. In O2P, the templates are self-adapted to the input, and hence, they do not require learning. From our formulation, we are able to bring common concepts of coding-pooling schemes to O2P, such as feature quantization. This allows us to achieve significant improvements of the accuracy Both first authors contributed equally. of O2P for image classification. We report experiments on two challenging benchmarks for image classification, namely Caltech101 [11], and VOC07 [12]. 2 Preliminaries In this Section, we introduce O2P as well as several coding-pooling schemes, and identify some common terminology in the literature. This will serve as a basis for the new formulation of O2P, that we introduce in the following section. The algorithms that we analyze in this section are usually part of a layer of a hierarchical network for object recognition. The input to these algorithms is a set of feature vectors that come from the output of the previous layer, or from the raw image. Let {xi}N be the set of input feature vectors to the algorithm, which is the set of N feature vectors, xi RM, indexed by i {1, . . . , N}. The output of the algorithm is a single vector, which we denote as y, and it may have a different dimensionality than the input vectors. In the following subsections, we present the algorithms and terminology of template-based methods, and then, we introduce the formulation of O2P that appears in the literature that apparently does not use templates. 2.1 Coding-Pooling based on Evaluating Similarities to Templates Template-based methods are build upon similarities between the input vectors and a set of templates. Depending on the terminology of each algorithm, the templates may be denoted as filters, codebook, or over-complete basis. From now on, we will refer to all of them as templates. We denote the set of templates as {bk RM}P . In this paper, bk and the input feature vectors xi have the same dimensionality, M. The set of templates is fixed to learned values during the training phase. There are many possible learning algorithms, but analyzing them is not necessary here. The algorithms that are interesting for our purposes, start by computing a similarity measure between the input feature vectors {xi}N and the templates {bk}P . Let Γ(xi, bk) be the similarity function, which depends on each algorithm. We define γi as the vector that contains the similarities of xi to the set of templates {bk}, and γ RM P the matrix whose columns are the vectors γi, i.e. γki = Γ(xi, bk). (1) Once γ is computed, the algorithms that we analyze apply some non-linear transformation to γ, and then, the resulting responses are merged together, with the so called pooling operation. The pooling consists on generating one single response value for each template. We denote as gk(γ) the function that includes both the non-linear transformation and the pooling operation, where gk : RM P R. We include both operations in the same function, but in the literature it is usually presented as two separate steps. Finally, the output vector y is built using {gk(γ)}P , {bk}P and {xi}N, depending on the algorithm. It is also quite common to concatenate the outputs of neighboring regions to generate the final output of the layer. We now show how the presented terminology is applied to some methods based on evaluating similarities to templates, namely assignment-based methods and Fisher Vector. In the sequel, these algorithms will be a basis to reformulate O2P. Assignment-based Methods The popular Bag-of-Words and some of its variants fall into this category, e.g. [13, 14, 15]. These methods consist on assigning each input vector xi to a set of templates (the so called vector quantization), and then, building a histogram of the assignments, which corresponds to the average pooling operation. We now present them using our terminology. After computing the similarities to the templates, γ (usually based on ℓ2 distance), gk(γ) computes both the vector quantization and the pooling. Let s be the number of templates to which each input vector is assigned, and let γ i be the resulting assignment vector of xi (i.e. γ i is the result of applying vector quantisation on xi). γ i has s entries set to 1 and the rest to 0, that indicate the assignment. Finally, gk(γ) also computes the pooling for the assignments corresponding to the template k, i.e. gk(γ) = 1 i