# object_recognition_with_hidden_attributes__e635c137.pdf

Object Recognition with Hidden Attributes

Xiaoyang Wang and Qiang Ji Rensselaer Polytechnic Institute, Troy, NY, USA

xiaoyang.wangs@gmail.com, jiq@rpi.edu

Attribute based object recognition performs object recognition using the semantic properties of the object. Unlike the existing approaches that treat attributes as a middle level representation and require to estimate the attributes during testing, we propose to incorporate the hidden attributes, which are the attributes used only during training to improve model learning and are not needed during testing. To achieve this goal, we develop two different approaches to incorporate hidden attributes. The ﬁrst approach utilizes hidden attributes as additional information to improve the object classiﬁcation model. The second approach further exploits the semantic relationships between the objects and the hidden attributes. Experiments on benchmark data sets demonstrate that both approaches can effectively improve the learning of the object classiﬁers over the baseline models that do not use attributes, and their combination reaches the best performance. Experiments also show that the proposed approaches outperform both state of the art methods that use attributes as middle level representation and the approaches that learn the classiﬁers with hidden information.

1 Introduction Object recognition in computer vision generally refers to the recognition of object images into different categories such as bird , aeroplane , bicycle , etc. In recent years, computer vision researchers explore to assign a list of attributes [Farhadi et al., 2009] to the object images. These attributes [Ferrari and Zisserman, 2007] are manually speciﬁed and semantically meaningful descriptions about the object shape (e.g. is cylindrical ), parts (e.g. has head , has leg ), materials (e.g. made of wood ), color (e.g. is red ), etc. The approaches [Farhadi et al., 2009; Lampert et al., 2009; Wang and Mori, 2010; Parikh and Grauman, 2011a; Kovashka et al., 2011]) that utilize the assigned attributes to beneﬁt object recognition can be called attribute-based object recognition.

Most existing attribute-based object recognition approaches (e.g. [Farhadi et al., 2009; Lampert et al., 2009;

Bird has beak has wing feather has head has torso

Cow has ear has snout furry has head has torso

Figure 1: An example of attributes for two animals, where the bird has beak , wing and is covered with feather , and the cow has ear , snout and is furry .

Wang and Mori, 2010; Parikh and Grauman, 2011a; 2011b; Kovashka et al., 2011]) utilize attributes as an intermediate layer in the classiﬁers cascade. In the testing phase of these approaches, attributes are ﬁrst predicted by the pre-trained attribute classiﬁers. And, these predicted attributes are further utilized by the attribute based object classiﬁers as midlevel input for object recognition. Typical applications of these approaches include zero-shot transfer learning [Lampert et al., 2009], description of unfamiliar objects [Farhadi et al., 2009], and improving the object classiﬁcation [Wang and Mori, 2010], event recognition [Wang and Ji, 2012], and phone recognition [Zhao et al., 2015].

However, due to tremendous variations in vision applications, attribute recognition itself is challenging. Moreover, poor quality attribute measurements in the middle level would adversely affect the subsequent object classiﬁcation. This dilemma motivates us to avoid utilizing attributes as middle level representation, and to explore incorporating attributes in a different way, where we have access to ground truth attributes during training, but do not utilize the predicted attributes explicitly or implicitly for ﬁnal stage recognition during testing. In this paper, we call these attributes that are available only during training as hidden attributes. We hope the hidden attributes utilized in our approach can still improve the object recognition.

This hidden attribute setting lies in the learning with hidden information (LHI) paradigm [Vapnik and Vashist, 2009]. In this paradigm, the hidden information a are utilized only during training. They help learn a better classiﬁer (e.g. linear classiﬁer y = sign(w>x)) from feature x to label y that can outperform the traditional classiﬁer (e.g. y = sign(w>

0 x)) learned without hidden information. Hence, in this paradigm, the hidden information serves only for the purpose of obtaining a better parameter vector w that is in the

Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-16)

same dimension as the original parameter vector w0. Compared to the traditional mid-level based attribute approaches, this paradigm can avoid the propagation of erroneous attribute predictions to the subsequent object classiﬁcation.

In this paper, we propose two novel formulations to incorporate hidden attributes during model learning. Our ﬁrst approach (x LR+) utilizes hidden attributes as additional information to improve the target model that predicts object label y from image feature x. In the second approach (LRRel+), we further incorporate the semantic relationships between the objects and the hidden attributes. Finally,we further combine both formulations as regularization terms into one uniﬁed learning objective (x LR-Rel+) that receives the best performance.

In summary, the major contributions of this work include: 1) we propose to incorporate hidden attributes for object classiﬁcation; 2) we propose the formulations including x LR+, LR-Rel+ and x LR-Rel+ that use hidden attributes as extra information and exploit their relationships with objects.

2 Related Work

Utilizing attributes to enhance the object recognition performance has drawn great attention in recent years. Work in this area can be divided into sequential attribute and object recognition, and joint attribute and object recognition. The sequential approaches [Farhadi et al., 2009; Lampert et al., 2009; Parikh and Grauman, 2011a; 2011b; Akata et al., 2013; Wang and Ji, 2014] utilize attributes as an intermediate representation between low-level image features and high-level categories. [Farhadi et al., 2009] use linear classiﬁers like SVM to predict attributes from shared image features, and then use the predicted attributes for object categorization. However, the sequential approaches still require to train attribute classiﬁers with training data, and then predict the attribute labels or infer attribute scores during testing. The performance of these methods is therefore subject to the performance of the attribute classiﬁers. To address this problem, the joint approach performs attribute and object recognition simultaneously in order to exploit their interdependencies. Wang and Ji [Wang and Ji, 2013] utilize a Bayesian network (BN) with learned structure to improve both attribute prediction and object recognition with captured attribute relationships. Also, the multi-task learning approach in [Hwang et al., 2011] simultaneously learns multiple classiﬁers for object recognition and attribute prediction tasks based on the shared feature assumption. The joint approaches need learn attribute and object classiﬁers simultaneously and they are therefore computationally complex. In contrast, our approach focuses only on the object recognition task.

Recently, [Wang et al., 2014] propose to incorporate hidden information for learning logistic regression classiﬁer LR+. While its formulation looks similar to the ﬁrst formulation in this paper, our work signiﬁcantly differs from the work in [Wang et al., 2014]. The work in [Wang et al., 2014] studies learning LR+ using hidden information as extra information. Comparatively, our work focuses on object recognition with hidden attributes. We incorporate hidden attributes as extra information in x LR+, exploit the object-attribute re-

lationships in LR-Rel+, and further propose a combined formulation x LR-Rel+. Both LR-Rel+ and x LR-Rel+ are completely different from the LR+ in [Wang et al., 2014]. Our x LR+, LR-Rel+, and x LR-Rel+ approaches all outperform the LR+ approach by [Wang et al., 2014] in the experiments.

3 Object Recognition With Hidden Attributes

We ﬁrst deﬁne both the traditional supervised object recognition and the the proposed object recognition with hidden attributes.

Traditional supervised object recognition can generally be formulated as: given a set of N labeled training samples represented by image feature vector set X = {x1, , x N} X Rd, and the object label set Y = {y1, , y N} 2 Y with Y = { 1, 1} for binary cases, learn a mapping function f : X 7 ! R with parameter w from the function space F of all possible functions (e.g. all linear functions f(x) = w>x) to predict the object label y from the input image feature x as accurate as possible. Generally, the object classiﬁer parameter w can be learned by minimizing the objective function shown in Equation 1, where l(yi, xi; w) is the loss function and ||w||2

2 is a regularization term to avoid overﬁtting.

l(yi, xi; w) + γ

Object recognition with hidden attributes differs from the traditional supervised object recognition problem in that additional hidden information vectors (i.e., the ground truth attributes in this paper) A = {a1, , a N} A are also provided for each training sample, where each M dimensional vector ai corresponds to the training sample pair (xi, yi). Object recognition with hidden attributes can hence be stated as: given N labeled training triplets {(xi, ai, yi)N

i=1}, learn a mapping function f 0 : X 7 ! R, with parameters w0 from the same function space F of all possible functions (e.g. all linear functions f(x) = w>x) to predict object label y from input image feature x as accurate as possible.

Following this deﬁnition, the new mapping function f 0 : X 7 ! R does not depend on hidden attribute space A, but hidden attributes will inﬂuence parameter w0 in training. We expect w0 to be better than w in predicting y from x.

3.1 Hidden Attributes as Extra Information In the object recognition with hidden attribute setting, hidden attributes can be utilized as additional information to learn a mapping function g : A 7 ! R with parameter w for the prediction of class label y. In this paper, we call the g mapping function with parameter w as the hypothetic model. And also, the f function f : X 7 ! R could be called as the target model.

As shown in Figure 2, during training, the hypothetic model and target model share the same object class labels {yi}i=1,...,N, but the information input for the hypothetic model is {ai}i=1,...,N which is the ground truth attributes.

Since both the image feature x and the hidden attributes a describe the object for each training sample, our basic idea in this formulation is to link the hypothetical model and the target model by regularizing the prediction score f(x; w) of

* w Hypothetical Model

Target Model

1,..., i i N a

1,..., i i N x

1,..., i i N y

Figure 2: The hypothetical model and the target model for object recognition with hidden attributes. The hypothetical model is learned with attributes, and it helps improve the learning of the target model.

the target model to be close to the prediction score g(a; w ) of the hypothetical model. We denote this regularization as the dissimilarity regularization.

Suppose the loss functions of the target model and hypothetic model for the ith training sample to be represented by l(yi, xi; w) and l(yi, ai; w ) respectively. Under the dissimilarity regularization, we would learn the parameters w and w of two models simultaneously with the dissimilarity regularization term, and the two parameter regularization terms kwk2

2 and kw k2

2 incorporated as shown in Equation 2:

l(yi, xi; w) +

l(yi, ai; w )

{f(xi; w) g(ai; w )}2 + γ1

where λ is a positive coefﬁcient for the dissimilarity regularization term, γ1 and γ2 are the positive coefﬁcients for the corresponding parameter regularization terms, and is the positive weight on the loss function of the hypothetic model. With this objective function, our general approach would not only minimize the loss functions for supervised learning of both models, but also minimize the dissimilarity between the predictions of two models on the original feature space and the hidden information space respectively. The regularization term effectively ties the learning of the target classiﬁer to that of the hypothetical model such that the hidden attributes a can inﬂuence the parameters of the target model.

The general approach described in Equation 2 incorporates a squared difference term which is similar to the squared difference terms in co-regularization based multi-view semisupervised learning (SSL) approaches [Krishnapuram et al., 2005; Sindhwani and Rosenberg, 2008; Sindhwani et al., 2005; Farquhar et al., 2005; Belkin et al., 2006], which are also regarded as the co-training approaches [Blum and Mitchell, 1998; Zhou and Li, 2010]. However, different from the SSL setting which focuses on better utilizing the additional unlabeled training data, the scenario of classiﬁcation with hidden information assumes that hidden information is only available during training but not available during testing. We want to use such information to improve the target object classiﬁer built on primary features. Moreover, the squared difference term in co-regularization SSL approaches are imposed on the unlabeled samples while it is imposed on labeled samples in our proposed approach.

The proposed method can be applied to different types of linear classiﬁers such as logistic regression (LR) and support vector machine (SVM) by selecting different loss functions. For instance, if we use hinge loss as l(yi, xi; w) =

max(0, 1 yiw>xi), the objective function in Equation 2 would apply our additional information modeling to the SVM learning. Since the hinge loss is still convex, subgradient based optimization can be used to solve the objective. In this paper, we apply the it to the LR model. The loss function of LR model is:

l(yi, xi; w) , ln p(yi|xi; w) = ln

1 + exp( yiw>xi)

where yi 2 { 1, 1}. Since the LR loss function term is convex and differentiable, gradient based methods can be applied to solve the objective function.

For linear models, the dissimilarity regularization term in Equation 2 can be written as:

{f(xi; w) g(ai; w )}2

2 (Xw Aw )>(Xw Aw ) , λ

where X = [x1, x2, . . . , x N]> denotes the training data matrix, A = [a1, a2, . . . , a N]> denotes the hidden information matrix, C = [X, A]>[X, A], and w = [w>, w >]>.

Since w>Cw = {[X, A]w}T {[X, A]w} 0, the matrix C is positive semi-deﬁnite for any vector w. Thus, the score similarity term is also a convex quadratic term. Its gradient with respect to vector w is:

{f(xi; w) g(ai; w )}2 = λCw (4)

This gradient can be directly combined with the gradients of the remaining terms in Equation 2 to optimize the objective for the learning of parameters w and w .

3.2 Object-Attribute Relationships as Hidden Information To further improve the performance of object recognition with hidden attributes, we propose to exploit the additional information in the attributes, i.e. the relationships between objects and attributes. As a set of semantic descriptions about the objects, attributes hold strong relationships with categories of objects that are determined by the intrinsic properties of different categories of objects. For instance, the object bird holds co-occurrence relationship with attribute has wing , and holds mutually exclusive relationship with attribute has horn . We believe such relationships, if captured as additional hidden information, would enforce the object classiﬁcation to ﬁt not only with the object labels, but also with the intrinsic properties of objects. In this way, the classiﬁers learned with attributes as hidden information can generalize better in the testing data.

To simplify the analysis, we consider the relationship between object label y and each of the M types of attributes, i.e. am with m 2 [1, M], in a pairwise manner. Suppose the relationship between object label y and attribute am can be evaluated by a d dimensional real valued vector tym. Also, the relationship between predicted object ˆy and attribute am can be evaluated by another d dimensional real valued vector ˆtym. Since the predicted object ˆy is given by the mapping

function f : X 7 ! R with parameter w, ˆtym should also be a function of w as ˆtym(w).

The essence of our method here is enforcing the relationship between the predicted object ˆy and each of the attribute label to be close to the relationship between the ground truth object label and the corresponding attribute label. In this way, our classiﬁer learning can be connected with the hidden information. Such a regularization is natural, since a perfect object classiﬁcation should also preserve the relationships between objects and attributes perfectly. Suppose the above two relationships can be evaluated by vectors ˆtym(w) and tym respectively, we can hence enforce the 2 norm of the vector difference ˆtym(w) tym to be small. Combining such a relationship regularization with the terms for standard object classiﬁer learning as in Equation 1, our general formulation for exploiting object-attribute relationships can then be written as in Equation 5.

l(yi, xi; w) + γ

kˆtym(w) tymk2

Compared to the formulation in Equation 2, this formulation does not require to learn a hypothetic classiﬁer and is therefore more computationally efﬁcient.

To fulﬁll the general formulation in Equation 5, we further introduce the detailed deﬁnition of ˆtym(w) and tym. We utilize the linear regression coefﬁcients to evaluate the relationships between object y and each of the attribute am. Here, the regression coefﬁcients rym and sym reconstruct the object label y from the attribute am as y = rym + symam. Coefﬁcients rym and sym can then be obtained by minimizing the mean square error as:

min rym,sym

(yi rym symam

i is the value of attribute am for sample i. Both the coefﬁcients rym and sym have their speciﬁc meanings for representing the relationship. When y and am are binary values with 1 standing for positive label and -1 standing for negative label, sym will reﬂect the cooccurrence (sym > 0) and mutually exclusive (sym < 0) relationships with its amplitude indicating the extent of the relationship. When sym 0, the two variables tend to be unrelated . Also, rym represents the bias between y and am values. It gives the prior information on whether y would be more frequent to present than am or not.

Here, we deﬁne the matrix φm, the object label vector y, and the relationship evaluation vector tym as:

Vector tym can then have a closed form solution as tym = (φ>

my, where φ+

m is the Moore-Penrose pseudo-inverse [Penrose, 1955] of matrix φm. Given the object and attribute labels in the training data, vector tym should be a constant unrelated to classiﬁer parameter w.

The predicted object ˆy can be further represented by the object classiﬁer response w>x. Hence, the regression coefﬁcients ˆrym and ˆsym should reconstruct w>x from attribute

am as w>x = ˆrym + ˆsymam. These two coefﬁcients are obtained by minimizing the following mean square error:

min ˆrym,ˆsym

(w>xi ˆrym ˆsymam

Deﬁne the training sample matrix X, and the relationship evaluation vector ˆtym(w) as:

75 ˆtym(w) =

then the closed form solution for relationship evaluation vector is ˆtym(w) = φ+

m Xw. Now, we replace the ˆtym(w) term in Equation 5 as φ+

m Xw, and keep the pre-calculated constant term tym. The complete objective function formulation can then be given in Equation 6.

l(yi, xi; w) + γ

From Equation 6, we can see our formulation capturing object-attribute relationship would bring in an additional quadratic term w>X>(φ+

m Xw. Reshaping this term, we ﬁnd it equals to (φ+

m Xw) 0 for any w. Hence, the matrix X>(φ+

m X is positive semi-deﬁnite, and the whole quadratic term is then convex. Such a quadratic convex term is easy to optimize.

Similar to our previous formulation discussed in Section 3.1, the formulation in Equation 6 can also be applied to different types of linear classiﬁers such as logistic regression (LR) and support vector machine (SVM) by selecting different loss functions. As discussed in Section 3.1, if we use hinge loss as l(yi, xi; w) = max(0, 1 yiw>xi), the objective function in Equation 6 would apply our relationship modeling to the SVM learning. Since the hinge loss is still convex, subgradient based optimization can be used to solve the objective. In this paper, we apply the relationship modeling to the LR model. The optimization of the LR loss functions have been discussed in Section 3.1.

For our formulation in Equation 6, the gradient of the relationship term can be represented as:

Since the relationship term in Equation 6 is still convex, we can directly combine the gradient in Equation 7 with the gradients of LR loss functions and the 2 norm parameter regularization term for object classiﬁer learning.

3.3 Combined Formulation The formulation discussed in Section 3.1 utilizes hidden attributes as additional information, and enforce the score dissimilarity between the hypothetic model on the hidden attributes and the target model on the image feature to be small. On the other hand, the formulation in Section 3.2 models the relationships between hidden attributes and the object. It enforces preservation of the relationships between attributes and the object category. These two formulations incorporate different properties of hidden attributes, and hence can be further combined into one objective function. The combined objective function can be written as:

l(yi, xi; w) +

l(yi, ai; w )

{f(xi; w) g(ai; w )}2 + γ1

This objective function hence minimizes the score dissimilarity term and the relationship regularization term simultaneously during the learning of parameters w and w .

The gradient of the objective function in Equation 8 with respect to w and w is the combination of the gradients for each term in this equation. The gradients of our proposed score dissimilarity term and the relationship regularization term are given in Equation 4 and Equation 7 respectively. Since each term in Equation 8 is convex, the optimization can be solved by gradient descent based methods.

4 Experiments

We perform experiments on natural scene object classiﬁcation on two benchmark datasets: a Pascal dataset [Farhadi et al., 2009] and Animals with Attributes (AWA) dataset [Lampert et al., 2009]. The goal is to compare the performance of our proposed approaches incorporating hidden attributes with the basic approaches without using attributes, the traditional attribute approaches using attributes as the middle level representation, and the existing approaches for learning with hidden information.

Models. The models evaluated in our experiments include: the standard logistic regression model (LR) and support vector machine (SVM) learned with only the training data, the proposed formulation in Equation 2 using attributes as extra information (x LR+), the proposed formulation in Equation 6 with incorporating attribute relationships (LR-Rel+), the formulation in Equation 8 further combining formulations in Equation 2 and Equation 6 (x LR-Rel+).

The a Pascal dataset contains 6340 training images and 6355 testing images collected from Pascal VOC 2008 challenge. Each sample belongs to one of the twenty object categories: people, bird, cat, cow, dog, horse, sheep, aeroplane, bicycle, boat, bus, car, motorbike, train, bottle, chair, dining table, potted plant, sofa, and tv/monitor. The dataset also provides 9751 dimensional base feature for each of the training and testing sample. Various types of color, textural,

HOG, shape and edge descriptors are combined with a Bagof-Words approach to formalize the 9751 dimensional base feature vector. The base feature is used in all of the following experiments.

A list of 64 attributes are annotated for each sample in the dataset with examples shown in Figure 1. Each attribute is quantized into -1 or 1 binary values to represent the absence or presence of the attribute. These hidden attributes are used for the proposed algorithms x LR+, LR-Rel+ and x LRRel+. We use a one-versus-all strategy to perform multi-class object classiﬁcation in this dataset. A total of 20 classiﬁers are trained to predict each category against the remaining categories. The ﬁnal decision is made by comparing the scores for each object type.

During classiﬁer learning, the coefﬁcients are tuned through a two fold cross validation procedure within the training set. The results are shown in Table 1, where both the overall accuracy rate and mean per-class recognition accuracy rate are given. We also include the results from state of art middle level representation work [Farhadi et al., 2009; Wang and Mori, 2010; Wang and Ji, 2013], and the results from the state of art learning with hidden information approaches include the SVM+ approach [Vapnik and Vashist, 2009] and the LR+ approach [Wang et al., 2014] in Table 1.

Table 1: Object recognition results on a Pascal dataset compared to state of the art middle level representation based attribute methods.

Methods (%) Model Mean Overall LR 43.29 59.76 This work x LR+ 43.87 60.23 This work LR-Rel+ 47.52 62.03 This work x LR-Rel+ 47.82 63.10 [Farhadi et al., 2009] SVM 37.70 59.40 [Wang and Mori, 2010] LSVM 50.84 59.15 [Wang and Ji, 2013] BN 44.82 63.02 [Vapnik and Vashist, 2009] SVM+ 42.08 60.02 [Wang et al., 2014] LR+ 42.21 60.17

Firstly, we compare the baseline LR approach with our proposed models x LR+, LR-Rel+, and x LR-Rel+. From the results, we can see that by incorporating hidden attributes, the proposed models x LR+, LR-Rel+, and x LR-Rel+ all outperform LR in terms of both the overall and mean per-class accuracies, which shows the effectiveness of the proposed algorithms. In addition, x LR-Rel+ outperforms both x LR+ and LR-Rel+ in both accuracy evaluations. This comparison shows that the combination further improves the performance.

Secondly, we compare with state of art middle level representation approaches by [Farhadi et al., 2009], [Wang and Mori, 2010] and [Wang and Ji, 2013]. We can see that although predicted attributes are not utilized during testing, all three of our models (i.e. x LR+, LR-Rel+ and x LR-Rel+) outperform the approach proposed in [Farhadi et al., 2009]. Compared to the results by [Wang and Mori, 2010], our x LRRel+ approach performs better in overall recognition rate by

around 4%, and performs lower in mean per-class recognition rate by about 3%. This is expected. As argued in [Wang and Ji, 2013], [Wang and Mori, 2010] use the loss function speciﬁcally designed for skewed data, and the a Pascal data is skewed by having 2571 of 6355 testing samples to be in person category. [Wang and Mori, 2010] also report their performances with the standard 0/1 loss function. Results are 46.25% for mean accuracy, and 62.16% for overall accuracy, which are both not as good as our performances. The approach in [Wang and Ji, 2013] also combines the attribute relationship in the model, and its object recognition performance in a Pascal is not as good as our x LR-Rel+ model for both overall and mean per-class evaluations. These results show that our approaches are quite effective for improving object classiﬁer learning compared to traditional middle level representation methods.

Thirdly, we compare our methods with learning with hidden information approaches including the SVM+ approach [Vapnik and Vashist, 2009] and the LR+ approach [Wang et al., 2014]. From Table 1, we can see all our three models (i.e. x LR+, LR-Rel+ and x LR-Rel+) outperforms both SVM+ and LR+ approaches in both the mean and overall recognition accuracy. We perform the Wilcoxon rank sum test to evaluate the performance improvement of the proposed x LR-Rel+ model over both LR and SVM+ approaches. Both tests show performance improvements are statistically signiﬁcant with a p-value less than 5%.

To further compare the performances of our proposed model with the learning with hidden information state of the arts including SVM+ and Rank Transfer, we test the proposed algorithm for object classiﬁcation on the Animals with Attributes (AWA) dataset [Lampert et al., 2009]. This dataset includes 6180 images that belong to 10 testing classes. These 10 testing classes are different wild animals including chimpanzee (CP), giant panda (GP), leopard (LP), persian cat (PC), pig (PG), hippopotamus (HP), humpback whale (HW), raccoon (RC), rat (RT), and seal (SL).

To compare with the results in [Sharmanska et al., 2013], we follow the same experimental setting as in [Sharmanska et al., 2013]. In such setting, the models are tested on recognizing each possible pair of the 10 animal classes. This would give us 45 animal pairs. Also, the provided SURF descriptor in 2000 dimensions are used as features, and the predicted attributes in the format of probability estimates provided by [Lampert et al., 2009] are used as hidden information during training. With the provided feature and hidden information, 45 binary object classiﬁers are trained for each animal pair. We use 100 samples per object class for training, and 200 samples per object class for testing. As in [Sharmanska et al., 2013], we repeat such training/testing split procedure for 20 times. The results with comparisons to SVM+ and Rank Transfer methods are presented in Table 2

From the results in Table 2, our proposed x LR-Rel+ is very effective for incorporating hidden attributes. Among the 45 possible cases, the SVM performs the best only in 1 case, the SVM+ performs the best in 7 cases, the Rank Transfer method performs the best in 11 cases, and our proposed x LRRel+ model performs the best in 26 out of the 45 cases. The average values of accuracies over the total 45 pairs also show

Table 2: Object Recognition with Hidden Attributes on AWA Dataset

Rank SVM SVM+ Transfer x LR-Rel+ 1 CP vs. GP 91.53 92.12 91.83 83.85 1.45 2 CP vs. LP 94.16 94.23 94.80 98.03 1.03 3 CP vs. PC 91.09 91.73 91.86 95.17 1.24 4 CP vs. PG 87.45 88.06 88.59 86.57 1.58 5 CP vs. HP 87.58 87.53 87.57 87.08 1.65 6 CP vs. HW 98.12 98.57 98.52 99.60 0.88 7 CP vs. RC 89.00 89.67 89.54 87.98 1.43 8 CP vs. RT 86.84 87.96 88.47 92.95 2.15 9 CP vs. SL 92.53 92.59 92.58 90.54 2.19 10 GP vs. LP 95.13 94.95 95.11 97.74 0.92 11 GP vs. PC 94.66 94.68 94.38 93.27 1.86 12 GP vs. PG 88.67 88.95 88.69 81.87 1.70 13 GP vs. HP 92.35 92.85 92.78 88.93 1.68 14 GP vs. HW 98.77 98.76 98.88 98.97 0.73 15 GP vs. RC 91.76 91.90 91.33 86.99 1.91 16 GP vs. RT 90.50 90.61 90.33 90.69 1.15 17 GP vs. SL 93.33 93.40 93.58 89.85 1.05 18 LP vs. PC 95.50 95.65 95.92 97.65 1.11 19 LP vs. PG 90.40 90.40 90.88 96.95 0.87 20 LP vs. HP 93.60 93.83 93.81 96.12 1.30 21 LP vs. HW 99.06 99.20 99.17 99.43 1.41 22 LP vs. RC 83.23 83.18 83.15 90.66 2.84 23 LP vs. RT 90.28 90.65 90.98 96.50 1.44 24 LP vs. SL 94.98 95.14 95.49 97.09 1.59 25 PC vs. PG 83.23 83.38 83.39 78.31 1.99 26 PC vs. HP 92.66 93.14 93.41 94.14 0.93 27 PC vs. HW 96.19 96.69 97.26 99.64 1.42 28 PC vs. RC 90.46 90.94 91.20 88.40 1.67 29 PC vs. RT 69.38 69.43 70.40 68.41 1.89 30 PC vs. SL 86.06 86.97 86.91 90.43 1.78 31 PG vs. HP 76.45 77.42 79.02 82.01 2.72 32 PG vs. HW 96.78 97.04 97.32 98.66 1.49 33 PG vs. RC 80.08 81.50 81.79 78.13 2.10 34 PG vs. RT 72.25 72.63 73.68 73.70 2.90 35 PG vs. SL 79.76 80.33 81.76 78.32 2.35 36 HP vs. HW 93.83 93.63 93.75 98.17 1.42 37 HP vs. RC 86.49 86.83 87.37 84.21 1.80 38 HP vs. RT 85.12 85.99 87.37 90.55 1.89 39 HP vs. SL 72.82 73.41 75.85 70.98 3.16 40 HW vs. RC 96.92 97.11 97.15 99.32 1.12 41 HW vs. RT 95.21 95.45 95.53 99.39 0.92 42 HW vs. SL 86.44 86.89 86.93 96.80 2.86 43 RC vs. RT 79.59 79.67 80.31 79.49 2.71 44 RC vs. SL 92.22 92.55 92.80 83.09 1.37 45 RT vs. SL 80.44 80.68 82.34 88.02 2.55 * Average 88.95 89.30 89.64 89.88

that our proposed model performs better than SVM, SVM+, and Rank Transfer methods.

5 Conclusion

In this work, we propose to incorporate hidden attributes for object classiﬁcation. Instead of predicting these attributes explicitly or implicitly during testing, we utilize the attributes only during training to improve the learning of the object classiﬁer on the primary features. We develop two different approaches to incorporate the hidden attributes, with one approach utilizing attributes as additional information, and the other incorporating the relationship between attributes and objects. Finally, these two different approaches are combined into one learning objective. We evaluate our approach on the natural scene object classiﬁcation. Experiments demonstrate the effectiveness of our approaches for classiﬁcation over state of the art methods on benchmark datasets.

Acknowledgments

This work is funded in part by US Defense Advanced Research Projects Agency under grants HR0011-08-C-0135-S8 and HR0011-10-C-0112, and by the Army Research Ofﬁce under grant W911NF-13-1-0395.

References [Akata et al., 2013] Zeynep Akata, Florent Perronnin, Zaid

Harchaoui, Cordelia Schmid, et al. Label-embedding for attribute-based classiﬁcation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 819 826, 2013. [Belkin et al., 2006] Mikhail Belkin, Partha Niyogi, and

Vikas Sindhwani. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. Journal of Machine Learning Research, 7:2399 2434, 2006. [Blum and Mitchell, 1998] Avrim Blum and Tom Mitchell.

Combining labeled and unlabeled data with co-training. In Annual Conference on Computational Learning Theory (COLT), pages 92 100, 1998. [Farhadi et al., 2009] A. Farhadi, I. Endres, D. Hoiem, and

D. Forsyth. Describing objects by their attributes. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1778 1785, 2009. [Farquhar et al., 2005] Jason Farquhar, David Hardoon, Hongying Meng, John Shawe-Taylor, and Sandor Szedmak. Two view learning: Svm-2k, theory and practice. In Advances in Neural Information Processing Systems (NIPS), pages 355 362, 2005. [Ferrari and Zisserman, 2007] Vittorio Ferrari and Andrew

Zisserman. Learning visual attributes. In Advances in Neural Information Processing Systems (NIPS), pages 433 440, 2007. [Hwang et al., 2011] Sung Ju Hwang, Fei Sha, and K. Grau-

man. Sharing features between objects and their attributes. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1761 1768, 2011. [Kovashka et al., 2011] A. Kovashka, S. Vijayanarasimhan,

and K. Grauman. Actively selecting annotations among objects and attributes. In IEEE International Conference on Computer Vision (ICCV), pages 1403 1410, 2011. [Krishnapuram et al., 2005] Balaji Krishnapuram, David Williams, Ya Xue, Alexander Hartemink, Lawrence Carin, and Mario Figueiredo. On semi-supervised classiﬁcation. Advances in Neural Information Processing Systems (NIPS), 17:721 728, 2005. [Lampert et al., 2009] C.H. Lampert, H. Nickisch, and S. Harmeling. Learning to detect unseen object classes by between-class attribute transfer. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 951 958, 2009. [Parikh and Grauman, 2011a] D. Parikh and K. Grauman.

Interactively building a discriminative vocabulary of

nameable attributes. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1681 1688, 2011. [Parikh and Grauman, 2011b] D. Parikh and K. Grauman.

Relative attributes. In IEEE International Conference on Computer Vision (ICCV), pages 503 510, 2011. [Penrose, 1955] Roger Penrose. A generalized inverse for matrices. Mathematical Proceedings of the Cambridge Philosophical Society, 51(03):406 413, 1955. [Sharmanska et al., 2013] V. Sharmanska, N. Quadrianto,

and C.H. Lampert. Learning to rank using privileged information. In IEEE International Conference on Computer Vision (ICCV), pages 825 832, Dec 2013. [Sindhwani and Rosenberg, 2008] Vikas Sindhwani and David S Rosenberg. An rkhs for multi-view learning and manifold co-regularization. In International Conference on Machine learning (ICML), pages 976 983, 2008. [Sindhwani et al., 2005] Vikas Sindhwani, Partha Niyogi,

and Mikhail Belkin. A co-regularization approach to semisupervised learning with multiple views. In Proceedings of ICML Workshop on Learning with Multiple Views, pages 74 79, 2005. [Vapnik and Vashist, 2009] V. Vapnik and A. Vashist. A new

learning paradigm: Learning using privileged information. Neural Networks, 22(5-6):544 557, 2009. [Wang and Ji, 2012] Xiaoyang Wang and Qiang Ji. A novel

probabilistic approach utilizing clip attributes as hidden knowledge for event recognition. In International Conference on Pattern Recognition (ICPR), 2012. [Wang and Ji, 2013] Xiaoyang Wang and Qiang Ji. A uniﬁed

probabilistic approach modeling relationships between attributes and objects. In IEEE International Conference on Computer Vision (ICCV), pages 2120 2127, 2013. [Wang and Ji, 2014] Xiaoyang Wang and Qiang Ji. Attribute

augmentation with sparse coding. In International Conference on Pattern Recognition (ICPR), pages 4352 4357, 2014. [Wang and Mori, 2010] Yang Wang and Greg Mori. A dis-

criminative latent model of object classes and attributes. In European Conference on Computer Vision (ECCV), volume 6315, pages 155 168, 2010. [Wang et al., 2014] Ziheng Wang, Xiaoyang Wang, and

Qiang Ji. Learning with hidden information. In International Conference on Pattern Recognition (ICPR), pages 238 243, 2014. [Zhao et al., 2015] Yue Zhao, Nan Zhou, Libing Zhang,

Licheng Wu, Rui Zheng, Xiaoyang Wang, and Qiang Ji. Shared speech attribute augmentation for english-tibetan cross-language phone recognition. In IEEE International Symposium on Signal Processing and Information Technology (ISSPIT), pages 539 543, 2015. [Zhou and Li, 2010] Zhi-Hua Zhou and Ming Li. Semisupervised learning by disagreement. Knowledge and Information Systems, 24(3):415 439, 2010.