# selfadaptable_templates_for_feature_coding__c998a2e5.pdf

Self-Adaptable Templates for Feature Coding

Xavier Boix1,2 Gemma Roig1,2 Salomon Diether1 Luc Van Gool1

1Computer Vision Laboratory, ETH Zurich, Switzerland 2LCSL, Massachusetts Institute of Technology & Istituto Italiano di Tecnologia, Cambridge, MA {xboix,gemmar}@mit.edu {boxavier,gemmar,sdiether,vangool}@vision.ee.ethz.ch

Hierarchical feed-forward networks have been successfully applied in object recognition. At each level of the hierarchy, features are extracted and encoded, followed by a pooling step. Within this processing pipeline, the common trend is to learn the feature coding templates, often referred as codebook entries, ﬁlters, or over-complete basis. Recently, an approach that apparently does not use templates has been shown to obtain very promising results. This is the second-order pooling (O2P) [1, 2, 3, 4, 5]. In this paper, we analyze O2P as a coding-pooling scheme. We ﬁnd that at testing phase, O2P automatically adapts the feature coding templates to the input features, rather than using templates learned during the training phase. From this ﬁnding, we are able to bring common concepts of coding-pooling schemes to O2P, such as feature quantization. This allows for signiﬁcant accuracy improvements of O2P in standard benchmarks of image classiﬁcation, namely Caltech101 and VOC07.

1 Introduction

Many object recognition schemes, inspired from biological vision, are based on feed-forward hierarchical architectures, e.g. [6, 7, 8]. In each level in the hierarchy, the algorithms can be usually divided into the steps of feature coding and spatial pooling. The feature coding extracts similarities between the set of input features and a set of templates (the so called ﬁlters, over-complete basis or codebook), and then, the similarity responses are transformed using some non-linearities. Finally, the spatial pooling extracts one single vector from the set of transformed responses. The speciﬁc architecture of the network (e.g. how many layers), and the speciﬁc algorithms for the coding-pooling at each layer are usually set for a recognition task and dataset, cf. [9].

Second-order Pooling (O2P) is an alternative algorithm to the aforementioned coding-pooling scheme. O2P has been introduced in medical imaging to analyze magnetic resonance images [1, 2], and lately, O2P achieved state-of-the-art in some of the traditional computer vision tasks [3, 4, 5, 10]. A surprising fact of O2P is that it is formulated without feature coding templates [5]. This is in contrast to the common coding-pooling schemes, in which the templates are learned during a training phase, and at testing phase, the templates remain ﬁxed to the learned values.

Motivated by the intriguing properties of O2P, in this paper we try to re-formulate O2P as a codingpooling scheme. In doing so, we ﬁnd that O2P actually computes similarities to feature coding templates as the rest of the coding-pooling schemes. Yet, what remains uncommon of O2P, is that the templates are recomputed for each speciﬁc input, rather than being ﬁxed to learned values. In O2P, the templates are self-adapted to the input, and hence, they do not require learning.

From our formulation, we are able to bring common concepts of coding-pooling schemes to O2P, such as feature quantization. This allows us to achieve signiﬁcant improvements of the accuracy

Both ﬁrst authors contributed equally.

of O2P for image classiﬁcation. We report experiments on two challenging benchmarks for image classiﬁcation, namely Caltech101 [11], and VOC07 [12].

2 Preliminaries

In this Section, we introduce O2P as well as several coding-pooling schemes, and identify some common terminology in the literature. This will serve as a basis for the new formulation of O2P, that we introduce in the following section.

The algorithms that we analyze in this section are usually part of a layer of a hierarchical network for object recognition. The input to these algorithms is a set of feature vectors that come from the output of the previous layer, or from the raw image. Let {xi}N be the set of input feature vectors to the algorithm, which is the set of N feature vectors, xi RM, indexed by i {1, . . . , N}. The output of the algorithm is a single vector, which we denote as y, and it may have a different dimensionality than the input vectors.

In the following subsections, we present the algorithms and terminology of template-based methods, and then, we introduce the formulation of O2P that appears in the literature that apparently does not use templates.

2.1 Coding-Pooling based on Evaluating Similarities to Templates

Template-based methods are build upon similarities between the input vectors and a set of templates. Depending on the terminology of each algorithm, the templates may be denoted as ﬁlters, codebook, or over-complete basis. From now on, we will refer to all of them as templates. We denote the set of templates as {bk RM}P . In this paper, bk and the input feature vectors xi have the same dimensionality, M. The set of templates is ﬁxed to learned values during the training phase. There are many possible learning algorithms, but analyzing them is not necessary here.

The algorithms that are interesting for our purposes, start by computing a similarity measure between the input feature vectors {xi}N and the templates {bk}P . Let Γ(xi, bk) be the similarity function, which depends on each algorithm. We deﬁne γi as the vector that contains the similarities of xi to the set of templates {bk}, and γ RM P the matrix whose columns are the vectors γi, i.e.

γki = Γ(xi, bk). (1)

Once γ is computed, the algorithms that we analyze apply some non-linear transformation to γ, and then, the resulting responses are merged together, with the so called pooling operation. The pooling consists on generating one single response value for each template. We denote as gk(γ) the function that includes both the non-linear transformation and the pooling operation, where gk : RM P R. We include both operations in the same function, but in the literature it is usually presented as two separate steps. Finally, the output vector y is built using {gk(γ)}P , {bk}P and {xi}N, depending on the algorithm. It is also quite common to concatenate the outputs of neighboring regions to generate the ﬁnal output of the layer.

We now show how the presented terminology is applied to some methods based on evaluating similarities to templates, namely assignment-based methods and Fisher Vector. In the sequel, these algorithms will be a basis to reformulate O2P.

Assignment-based Methods The popular Bag-of-Words and some of its variants fall into this category, e.g. [13, 14, 15]. These methods consist on assigning each input vector xi to a set of templates (the so called vector quantization), and then, building a histogram of the assignments, which corresponds to the average pooling operation.

We now present them using our terminology. After computing the similarities to the templates, γ (usually based on ℓ2 distance), gk(γ) computes both the vector quantization and the pooling. Let s be the number of templates to which each input vector is assigned, and let γ i be the resulting assignment vector of xi (i.e. γ i is the result of applying vector quantisation on xi). γ i has s entries set to 1 and the rest to 0, that indicate the assignment. Finally, gk(γ) also computes the pooling for the assignments corresponding to the template k, i.e. gk(γ) = 1

i<N γ ki. The ﬁnal output vector is the concatenation of the resulting pooling of the different templates, y = (g1(γ), . . . , g P (γ)).

Fisher Vectors It uses the ﬁrst and second order statistics of the similarities between the features and the templates [16]. Fisher Vector builds two vectors for each template bk, which are

i<N γki (bk xi) Φ(2) k = 1

i<N γki (bk xi)2 Ck , (2)

where γki = 1

2(xi bk)t Dk(xi bk) . (3)

Ak, Bk, Ck are learned constants, Zk a normalization factor and Dk is a learned constant matrix of the model. Note that in Eq. (3), γki is a similarity between the feature vector xi and the template bk. The ﬁnal output vector is y = (Φ(1) 1 , Φ(2) 1 . . . , Φ(1) P , Φ(2) P ). For further details we refer the reader to [16].

We use our terminology to do a very simple re-write of the terms. We deﬁne gk(γ) and b F k (we use the super-index F to indicate that are from Fisher vectors, and different from bk) as

gk(γ) = (Φ(1) k , Φ(2) k ) 2, b F k = 1 gk(γ)(Φ(1) k , Φ(2) k ). (4)

We can see the templates of Fisher vectors, b F k , are obtained from computing some transformations to the original learned template bk, which involve the input set of features {xi}. gk(γ) is the norm of (Φ(1) k , Φ(2) k ), which gives an idea of the importance of each template in {xi}, similarly to gk(γ) in assignment-based methods. Note that b F k and gk(γ) are related to only one ﬁxed template, bk. The ﬁnal output vector becomes y = (g1(γ)b F 1 , . . . , g P (γ)b F P ).

2.2 Second-Order Pooling

Second-order Pooling (O2P) was introduced in medical imaging to describe the voxels produced in diffusion tensor imaging [1], and to process tensor ﬁelds [2, 17]. O2P starts by building a correlation matrix from the set of feature (column) vectors {xi RM}N, i.e.

i<N xixt i, (5)

where xt i is the transpose vector of xi, and K RM M is a square matrix. K is a symmetric positive deﬁnite (SPD) matrix, and contains second-order statistics of {xi}. The set of SPD matrices form a Riemannian manifold, and hence, the conventional operations in the Euclidean space can not be used. Several metrics have been proposed for SPD matrices, and the most celebrated is the Log Euclidean metric [17]. Such metric consists of mapping the SPD matrices to the tangent space by using the logarithm of the matrix, log(K). In the tangent space, the standard Euclidean metrics can be used.

The logarithm of an SPD matrix can be computed in practice by applying the logarithm individually to each of the eigenvalues of K [18]. Thus, the ﬁnal output vector for O2P can be written as

y = vec (log(K)) = vec

k<M log(λk)eket k

where ek are the eigenvectors of K, and λk the corresponding eigenvalues. The vec( ) operator vectorizes log(K).

In Eq. (6), apparently, there are no similarities to a set of templates. The absence of templates makes O2P look quite different from template-based methods. Recently, O2P achieved state-of-the-art results in some computer vision tasks, e.g. in object detection [3], semantic segmentation [5, 10], and for patch description [4]. Both reasons, motivates us to further analyze O2P in relation to template-based methods.

3 Self-Adaptability of the Templates

In this section, we introduce a formulation that relates O2P and template-based methods. The new formulation is based on comparing two ﬁnal representation vectors, rather than deﬁning how the

ﬁnal vector y is built. We denote yr, ys as the inner product between yr and ys, which are the ﬁnal representation vectors from two sets of input feature vectors, {xr i }N and {xs i}N, respectively, where we use the superscripts r and s to indicate the respective representation for each set. It will become clear during this section why we analyze yr, ys instead of y.

We divide the analysis in three subsections. In subsection 3.1, we re-write the formulation of the template-based methods of Section 2 with the inner product yr, ys . In subsection 3.2, we do the same for O2P, and this unveils that O2P is also based on evaluating similarities to templates. In subsection 3.3, we analyze the characteristics of the templates in O2P, which have the particularity that are self-adapted to the input.

3.1 Re-Formulation of Template-Based Methods

We re-write a generic formulation for the template-based methods described in Section 2 with the inner product between two ﬁnal output vectors. The algorithms of Section 2 can be expressed as

q<P gk(γr)gq(γs)S(br k, bs q), (7)

where γki = Γ(xi, bk),

and S(u, v) is a similarity function between the templates that depends on each algorithm. Recall that gk(γ) is a function that includes the non-linearities and the pooling of the similarities between the input feature vectors and the the templates. To see how Eq. (7) arises naturally from the algorithms of Section 2, we now analyze them in terms of this formulation.

Assignment-Based Methods The inner product between two ﬁnal output vectors can be written as

yr, ys =(g1(γr), . . . , g P (γr))t(gs 1(γs), . . . , gs P (γs)) =

k<P gk(γr)gk(γs) = X

q<P gk(γr)gq(γs)I(br k = bs q), (8)

where the last step introduces an outer summation, and the indicator function I( ) eliminates the unnecessary cross terms. Comparing this last equation to Eq. (7), we can identify that S(br k, bs q) is the indicator function (returns 1 when br k = bs q, and 0 otherwise).

Fisher Vectors The inner product between two ﬁnal Fisher Vectors is

yr, ys =(g1(γr)br F 1 , . . . , g P (γr)br F P )t(g1(γs)bs F 1 , . . . , g P (γs)bs F P )

q<P gk(γr)gq(γs)I(br k = bs q) br F k , bs F q . (9)

The indicator function appears for the same reason as in Assignment-Based Methods. The ﬁnal templates for each set of input vectors, br F k , bs F k , respectively, are compared with each other with the similarity (br F k )tbs F q . Thus, S(br F k , bs F q ) in Eq. (7) is equal to I(br k = bs q)(br F k )tbs F q .

3.2 O2P as Coding-Pooling based on Template Similarities

We now re-formulate O2P, in the same way as we did for template-based methods in the previous subsection. This will allow relating O2P to template-based methods, and show that O2P also uses similarities to templates.

We re-write the deﬁnition of O2P in Eq. (6) with yr, ys . Using the property vec(A)tvec(B) = tr(At B), where tr( ) is the trace function of a matrix, yr, ys becomes (in the supplementary material we do the full derivation)

yr, ys = vec (log(Kr)) , vec (log(Ks)) =

q<M log(λr k) log(λs q) er k, es q 2, (10)

where eket k is a square matrix, and the eigenvectors, {er k}M and {es k}M, are compared all against each other with er k, es q 2. Going back to the generic formulation of template-based methods in

Method S(br k, bs q) γki = Γ(xi, bk) templates gk(γ) Assignment-based I(br k = bs q) xi, bk ﬁxed 1 N P

i γ ki Fisher Vectors I(br k = bs q) bs F k , bs F P Eq. (3) ﬁxed/adapted (Φ(1) k , Φ(2) k ) 2

O2P br k, bs q 2 xi, bk 2 self-adapted log 1

Table 1: Summary Table of the elements of our formulation for Assignment-based methods, Fisher Vectors and O2P.

Eq. (7), we can see that the similarity function between the templates, S(er k, es q), can be identiﬁed in O2P as er k, es q 2. Also, note that in O2P the sums go over M, which is the number of eigenvectors, and in Eq. (7), go over P, which is the number of templates. Finally, gk(γ) in Eq. (7) corresponds to log(λk) in O2P.

At this point, we have expressed O2P in a similar way as template-based methods. Yet, we still have to ﬁnd the similarity between the input feature vectors and the templates. For that purpose, we use the deﬁnition of eigenvalues and eigenvectors, i.e. λkek = Kek, and also that tr(eket k) = 1 (the eigenvectors are orthonormal). Then, we can derive the following equivalence: λk = λktr(eket k) = tr(Keket k). Replacing K by 1

i xixt i, we ﬁnd that the eigenvalues, λk, can be written using the similarity between the input vectors, xi, and the eigenvectors, ek:

i tr((xixt i)(eket k)) = 1

i xi, ek 2. (11)

Finally, we can integrate all the above derivations in Eq. (10), and we obtain that

q<M gk(γr)gq(γs) er k, es q 2, (12)

where gk(γ) = log(λk) = log

and γki = Γ(xi, ek) = xi, ek 2. (14)

We can see by analyzing Eq. (12) that this equation takes the same form as the general equation of template-based methods in Eq. (7). Note that the eigenvectors take the same role as the set of templates, i.e. bk = ek and P = M. Also, observe that S(br k, bs q) is the square of the inner product between eigenvectors, Γ(xi, bk) is the square of the inner product between the input vectors and the eigenvectors, and the pooling operation is the logarithm of the average of the similarities. In Table 1 we summarize the corresponding elements of all the described methods.

3.3 Self-Adaptative Templates

We deﬁne self-adaptative templates as templates that only depend on the input set of feature vectors, and are not ﬁxed to predeﬁned values. This is the case in O2P, because the templates in O2P correspond to the eigenvectors computed from the set of input feature vectors. The templates in O2P are not ﬁxed to values learned during the training phase. Interestingly, the ﬁnal templates in Fisher Vectors, b F k , are also partially self-adapted to the input vectors. Note that b F k are obtained by modifying the ﬁxed learned templates, bk, with the input feature vectors.

Finally, note that in O2P the number of templates is equal to the dimensionality of the input feature vectors. Thus, in O2P the number of templates can not be increased without changing the input vectors length, M. This begs the following question: do M templates allow for sufﬁcient generalization for object recognition for any set of input vectors? We analyze this question in the next section.

4 Application: Quantization for O2P

We observe in the experiments section that the performance of O2P degrades when the number of vectors in the set of input features increases. It is reasonable that M templates are not sufﬁcient when the number of different vectors in {xi}N increases, specially when they are very different

Algorithm 1: Sparse Quantization in O2P Input: {xi}N, k Output: y foreach i = {1, . . . , N} do

ˆxi Set k highest values of xi to its vector entry: xi, and the rest to 0 end K = 1

i ˆxiˆxt i y = vec(log(K))

from each other. We now introduce an algorithm to increase the robustness of O2P to the variability of the input vectors.

We quantize the input feature vectors, {xi}, before computing O2P. Quantization may discard details, and hence, reduce the variability among vectors. In the experiments section it is reported that this allows preventing the degradation of performance in object recognition, when the number of input feature vectors increases. The quantization algorithm that we use is sparse quantization (SQ) [15, 19], because SQ does not change the dimensionality of the feature vector. Also, SQ is fast to compute, and does not increase the computational cost of O2P.

Sparse Quantization for O2P For the quantization of {xi} we use SQ, which is a quantization to the set of k-sparse vectors. Let Rq k be the set of k-sparse vectors, i.e. {s Rq : s 0 k}. Also, we deﬁne Bq k = {0, 1}q k = {s {0, 1}q : s 0 = k}, which is the set of binary vectors with k elements set to one and (q k) set to zero. The cardinality of |Bq k| is equal to q k . The quantization of a vector v Rq into a codebook {ci} is a mapping of v to the closest element in {ci}, i.e. ˆv = arg minˆv {ci} ˆv v 2, where ˆv is the quantized vector v. In the case of SQ, the codebook {ci} contains the set of k-sparse vectors. These may be any of the previously introduced types: Rq k, Bq k. An important advantage of SQ over a general quantization is that it can be computed much more efﬁciently. The naive way to compute a general quantization is to evaluate the nearest neighbor of v in {ci}, which may be costly to compute for large codebooks and high-dimensional v. In contrast, SQ can be computed by selecting the k higher values of the set {vi}, i.e. for SQ into Rq k, ˆvi = vi if i is one of the k-highest entries of vector v, and 0 otherwise. For SQ into Bq k, the dimension indexed by the k-highest are set to 1 instead of vi, and 0 otherwise. (We refer the reader to [15, 19] for a more detailed explanation on SQ).

In Algorithm 1 we depict the implementation of SQ in O2P, which highlights its simplicity. The computational cost of SQ is negligible compared to the cost of computing O2P. We use the set of k-sparse vectors in RM k for SQ, which worked best in practice, as shown in the following.

5 Experiments

In this section, we analyze O2P in image classiﬁcation from dense sampled SIFT descriptors. This setup is common in image classiﬁcation, and it allows direct comparison to previous works on O2P. We report results on the Caltech101 [11] and VOC07 [12] datasets, using the standard evaluation benchmarks, which are the mean average precision accuracy across all classes.

5.1 Implementation Details

We use the standard pipeline for image classiﬁcation. We never use ﬂipped or blurred images to extend the training set.

Pipeline. For Caltech101, the image is re-sized to take a maximum height and width of 300 pixels, which is the standard resizing protocol for this dataset. For VOC07 the size of the images remains the same as the original. We extract SIFT [8] from patches on a regular grid, at different scales. In Caltech 101, we extract them at every 8 pixels and at the scales of 16, 32 and 48 pixels diameter. In VOC07, SIFT is sampled at each 4 pixels and at the scales of 12, 24 and 36 pixels diameter. O2P is computed using the SIFT descriptors as input, and using spatial pyramids. In

Caltech101, we generate the pooling regions dividing the image in 4 4, 2 2 and 1 1 regions, and in VOC07 in 3 1, 2 2 and 1 1 regions. To generate the ﬁnal descriptor for the whole image, we concatenate the descriptors for each pooled region. We apply the power normalization to the ﬁnal feature dimensions, sign(x)|x|3/4, that was shown to work well in practice [5]. Finally, we use a linear one-versus-rest SVM classiﬁer for each class with the parameter C of the SVM set to 1000. We use the LIBLINEAR library for the SVM[20].

Other Feature Codings. As a sanity check of our results, we replace O2P with the Bag-of Words [13] baseline, without changing any of the parameters. In Caltech101, we replace the average pooling of Bag-of-Words by max-pooling (without normalization) as it performs better. The codebook is learned by randomly picking a set of patches as codebook entries, which was shown to work well for the encodings we are evaluating [14]. We use a codebook of 8192 entries, since with more entries the performance does not increase signiﬁcantly, but the computational cost does.

5.2 Results on Caltech101

We use 3 random splits of 30 images per class for training and the rest for testing. In Fig. 1a, results are shown for different spatial pyramid conﬁgurations, as well as different levels of quantization. Note that SQ with k = 128 is not introducing any quantization, as SIFT features are 128 dimensional vectors. Note that using SQ increases the performance more than 5% compared to when not using SQ (k = 128), when using only the ﬁrst level of the pyramid. For the other levels of the pyramid, there is less improvement with SQ. This is in accordance with the observation that in smaller regions there are less SIFT vectors, the variability is smaller, and the limited amount of templates is able to better capture the meaningful information than in bigger regions. We can also see that for small k of SQ, the performance degrades due to the introduction of too much quantization.

We also run experiments with Bag-of-Words with max-pooling (74.8%), and O2P without SQ (76.52%), and both of them are surpassed by O2P with SQ (78.63%). In [5], O2P accuracy is reported to be 79.2% with SIFT descriptor (we do not compare to their version of enriched SIFT, since all our experiments are with normal SIFT). We inspected the code of [5], and we found that the difference of accuracy mainly comes from using a more drastic resizing of the image, that takes a maximum of 100 pixels of width and height (usually in the literature it is 300 pixels). Note that resizing is another way of discarding information, and hence, O2P may beneﬁt from that. We conﬁrm this by resizing the image back to 300 pixels in [5] s code, and the accuracy is 77.1%, similar to the one that we report without SQ in our code. The accuracy is not exactly the same due to differences in the SIFT parameters in [5]. Also, we tested SQ in [5] s code with the resizing to a maximum of 100 pixels, and the accuracy increased to 79.45%, which is higher than reported in [5], and close to state-of-the-art results using SIFT descriptors (80.3%) [21].

5.3 Results on VOC07

In Fig. 1b, we run the same experiment as in Caltech101. Note that the impact of SQ is even more evident than in Caltech101. In Table 2 we report the per-class accuracy, in addition to the mean average precision reported in Fig. 1b. We follow the evaluation procedure as described in [12].

With the full pyramid, when we use SQ the accuracy increases from 18.81% to 50.97%. In contrast to Caltech101, O2P with SQ performance is similar to our implementation of Bag-of-Words (51.14%). Thus, under adverse conditions for O2P, i.e. images with high variability such as in VOC07 and with a high number of input vectors, we can use SQ and obtain huge improvements of the O2P s accuracy. The best reported results [22] in VOC07 are around 10% better than O2P with SQ, yet we obtain more than 30% improvement from the baseline.

6 Conclusions

We found that O2P can be posed as a coding-pooling scheme based on evaluating similarities to templates. The templates of O2P self-adapt to the input, while the rest of the analyzed methods do not. In practice, our formulation was used to improve the performance of O2P in image classiﬁcation. We are currently analyzing self-adaptative templates in deep hierarchical networks.

1 pyr. 1+2 pyr. 1+2+3 pyr. 1+2+3 pyr. w/o SQ SQ selected in val. set

5 20 40 60 80 100 128 0.55

Sparse Quantization

Mean accuracy

Caltech 101

5 20 40 60 80 100 128 0.1

Sparse Quantization

Mean average precision

PASCAL VOC 2007

(a) (b) Figure 1: Results for different numbers of non-zero entries of SQ. Note that SQ at k = 128 is not introducing any quantization, since SIFT features are 128 dimensional vectors. (a) Caltech 101 (using 30 images per class for training), (b) VOC07.

Dinning Table

Potted Plant

3 Pyr. O2P + SQ 72 53 45 63 23 51 69 52 50 35 44 41 74 56 78 19 35 50 67 45 50.97 3 Pyr. O2P w/o SQ 34 9 12 18 6 19 40 14 26 14 9 21 28 17 55 7 7 10 16 12 18.81 2 Pyr. O2P + SQ 71 50 41 62 20 50 68 47 47 33 41 37 69 56 74 18 36 51 66 44 49.09 1 Pyr. O2P + SQ 66 41 32 58 15 37 58 38 40 27 28 30 61 43 66 20 33 37 56 36 41.20 1 Pyr. O2P w/o SQ 21 7 11 9 6 8 29 10 22 4 7 12 12 8 49 6 5 7 9 9 12.53

Table 2: PASCAL VOC 2007 classiﬁcation results. The average score provides the per-class average. We report results for O2P, with and without SQ, with the ﬁrst plus second plus third levels of pyramids (3 Pyr.), O2P with SQ with the ﬁrst plus second levels of pyramids (2 Pyr.), and O2P with and without SQ only with the ﬁrst level of pyramids (1 Pyr.).

Acknowledgments: We thank the ERC for support from Ad G Var City.

[1] D. Le Bihan, J.-F. Mangin, C. Poupon, C. A. Clark, S. Pappata, N. Molko, and H. Chabriat, Diffusion tensor imaging: concepts and applications, Journal of magnetic resonance imaging, 2001. [2] J. Weickert and H. Hagen, Visualization and Processing of Tensor Fields. Springer, 2006. [3] O. Tuzel, F. Porikli, and P. Meer, Region covariance: A fast descriptor for detection and classiﬁcation, in ECCV, 2006. [4] P. Li and Q. Wang, Local log-euclidean covariance matrix (L2ECM) for image representation and its applications, in ECCV, 2012. [5] J. Carreira, R. Caseiro, J. Batista, and C. Sminchisescu, Semantic segmentation with secondorder pooling, in ECCV, 2012. [6] K. Fukushima, Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position, Biological cybernetics, 1980. [7] M. Riesenhuber and T. Poggio, Hierarchical models of object recognition in cortex, Nature neuroscience, 1999. [8] D. G. Lowe, Distinctive image features from scale-invariant keypoints, IJCV, 2004. [9] J. Bergstra, D. Yamins, and D. Cox, Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures, in ICML, 2013.

[10] R. Girshick, J. Donahue, T. Darrell, and J. Malik, Rich feature hierarchies for accurate object detection and semantic segmentation, in CVPR, 2014.

[11] L. Fei-Fei, R. Fergus, and P. Perona, One-shot learning of object categories, TPAMI, 2006.

[12] M. Everingham, L. Van Gool, C. Williams, J. Winn, and A. Zisserman, The PASCAL visual object classes (VOC) challenge, IJCV, 2010.

[13] G. Csurka, C. R. Dance, L. Fan, J. Willamowski, and C. Bray, Visual categorization with bags of keypoints, in Workshop on Statistical Learning in Computer Vision, ECCV, 2004.

[14] A. Coates and A. Ng, The importance of encoding versus training with sparse coding and vector quantization, in ICML, 2011.

[15] X. Boix, G. Roig, and L. Van Gool, Nested sparse quantization for efﬁcient feature coding, in ECCV, 2012.

[16] J. Sanchez, F. Perronnin, T. Mensink, and J. Verbeek, Image classiﬁcation with the ﬁsher vector: Theory and practice, IJCV, 2013.

[17] V. Arsigny, P. Fillard, X. Pennec, and N. Ayache, Geometric means in a novel vector space structure on symmetric positive-deﬁnite matrices, Journal on matrix analysis and applications, 2007.

[18] R. Bhatia, Positive deﬁnite matrices. Princeton University Press, 2009.

[19] X. Boix, M. Gygli, G. Roig, and L. Van Gool, Sparse quantization for patch description, in CVPR, 2013.

[20] R. E. Fan, K. W. Chang, C. J. Hsieh, X. R. Wang, and C. J. Lin, LIBLINEAR: A library for large linear classiﬁcation, JMLR, 2008.

[21] O. Duchenne, A. Joulin, and J. Ponce, A graph-matching kernel for object categorization, in ICCV, 2011.

[22] X. Zhou, K. Yu, T. Zhang, and T. S. Huang, Image classiﬁcation using super-vector coding of local image descriptors, in ECCV, 2010.