# joint_dictionaries_for_zeroshot_learning__59f41d2e.pdf

Joint Dictionaries for Zero-Shot Learning

Soheil Kolouri

HRL Laboratories, LLC skolouri@hrl.com

Mohammad Rostami

University of Pennsylvania mrostami@seas.upenn.edu

Yuri Owechko HRL Laboratories, LLC yowechko@hrl.com

Kyungnam Kim HRL Laboratories, LLC kkim@hrl.com

A classic approach toward zero-shot learning (ZSL) is to map the input domain to a set of semantically meaningful attributes that could be used later on to classify unseen classes of data (e.g. visual data). In this paper, we propose to learn a visual feature dictionary that has semantically meaningful atoms. Such a dictionary is learned via joint dictionary learning for the visual domain and the attribute domain, while enforcing the same sparse coding for both dictionaries. Our novel attribute aware formulation provides an algorithmic solution to the domain shift/hubness problem in ZSL. Upon learning the joint dictionaries, images from unseen classes can be mapped into the attribute space by ﬁnding the attribute aware joint sparse representation using solely the visual data. We demonstrate that our approach provides superior or comparable performance to that of the state of the art on benchmark datasets.

Introduction

Most classiﬁcation algorithms require a large pool of manually labeled data to learn the optimal parameters of a classiﬁer. The recent exponential growth of visual data, growing need for ﬁne-grained multi-label annotations, and consistent emergence of new classes (e.g. new products), however, has rendered manual labeling of data practically infeasible. Transfer learning has been proposed as a remedy to deal with this issue (Lampert, Nickisch, and Harmeling 2014). The idea is to learn on a limited number of classes and then through knowledge transfer, learn how to classify images from the new classes either using only few labeled data points, i.e. fewor one-shot learning (Fei-Fei, Fergus, and Perona 2006), or in the extreme case without any labeled data, i.e. zero-shot learning (ZSL) (Lampert, Nickisch, and Harmeling 2014). These transfer learning approaches address the challenge of annotated data unavailability and open the door towards lifelong learning machines. To learn target classes with no labeled data, one needs to be able to generalize the relationship between the source data and its labels to the target classes. To address this challenge in ZSL, an intermediate shared space (i.e. the space

Equal contribution Copyright c 2018, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

of semantic attributes) is exploited, which allows for knowledge transfer from labeled classes to the unlabeled classes. The overarching idea in ZSL is that the source and the target classes share common attributes. The semantic attributes (e.g., can ﬂy, is green) are often provided as accessible side information (e.g. verbal description of a class), which uniquely describe classes of data. To achieve ZSL the relationship between seen data and its corresponding attributes are ﬁrst learned in the training phase. In testing stage, this allows for parsing a target image from an unseen class into its semantic attributes to predict corresponding label. To clarify the ZSL core idea and the required steps to perform ZSL, consider the following sentence: Tardigrades (also known as water bears or moss piglets) are waterdwelling, eight-legged, segmented micro animals 1. Given this textual description, one can easily identify the creature in Figure 1, Left as a Tardigrade even though she may have never seen one before. Performing this task requires three capabilities: 1) parsing the textual information into semantic features, so we can describe the class Tardigrade as bear-like , piglet-like , water-dwelling , eight-legged , segmented , and microscopic animal , 2) parsing the image into its visual attributes (See Figure 1), and 3) matching the parsed visual features to the parsed textual information which often requires extensive prior knowledge. Recent textual features extracted from large unlabeled text corpora; including word2vec (Mikolov et al. 2013) and glove (Pennington, Socher, and Manning 2014) enable a learner to efﬁciently parse textual information. Deep convolutional neural networks (CNNs) (Krizhevsky, Sutskever, and Hinton 2012; Simonyan and Zisserman 2014; He et al. 2016; Huang et al. 2017) have revolutionized the ﬁeld of computer vision and they enable a learner to extract rich visual features from images. An extensive body of work in the ﬁeld of ZSL is concentrated on modeling the relationship between visual features and semantic attributes (Palatucci et al. 2009; Akata et al. 2013; Socher et al. 2013; Norouzi et al. 2014; Lampert, Nickisch, and Harmeling 2009; Zhang and Saligrama 2015; Ding, Shao, and Fu 2017). In this paper, we provide a novel approach to model the relationship between the visual features and the textual information. Our speciﬁc contributions are:

1Source: Wikipedia

The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18)

Figure 1: High-level overview of our approach. Left & right: visual and attribute feature extraction and representation using union of subspaces. Middle: constraining the dictionary atoms to be coupled.

1. New formulation of ZSL via joint dictionary learning

2. Extending the classic joint dictionary learning formulation to an attribute aware formulation that addresses the domain shift/adaptation problem (Kodirov et al. 2015)

3. Demonstrating the beneﬁt of a transductive learning scheme to reduce the hubness phenomenon (Dinu, Lazaridou, and Baroni 2014; Shigeto et al. 2015)

Related Work ZSL methods often focus on learning the relationship between the visual space and the semantic attribute space. Palatucci et al. (Palatucci et al. 2009) proposed to learn a linear compatibility between the visual space and the semantic attribute space. Lampert et al. (Lampert, Nickisch, and Harmeling 2014) posed the problem as an attribute classiﬁcation problem and learned individual binary attribute classiﬁers in the training stage and used the ensemble of classiﬁers to map visual features to their semantic attributes. Yu and Aloimonos (Yu and Aloimonos 2010) approached the problem from a probabilistic point of view and proposed to use generative models to learn prior distributions for image features with respect to each attribute. Li et al. (Li, Guo, and Schuurmans 2015) considered multi-class classiﬁcation over all classes (observed and unseen) and tackled the ZSL problem directly, and without introducing intermediate prediction steps. Deutsch et al. (Deutsch et al. 2017) used a multi-scale manifold regularization scheme in a transductive setting to address the ZSL problem. Recently, various authors have proposed to embed image features and semantic attributes in a shared metric space (i.e. a latent embedding) (Akata et al. 2013; Romera-Paredes and Torr 2015; Zhang and Saligrama 2015) while forcing the embedded representations for image features and their corresponding semantic attributes to be similar. Akata et al. (Akata et al. 2013), for instance, proposed a model that embeds the image features and the semantic attributes in a common space (i.e. a latent embedding) where the compatibility between them is measured via a bilinear function. Similarly,

Romera-Paredes and Torr (Romera-Paredes and Torr 2015) utilized a principled choice of regularizers that enable the authors to derive a simple closed form solution to learn a linear mapping that embeds the image features and the semantic attributes in a low dimensional shared linear subspace. Others have identiﬁed the major problems and challenges in ZSL to be the domain shift problem (Kodirov et al. 2015) and the hubness phenomena (Dinu, Lazaridou, and Baroni 2014; Shigeto et al. 2015). In short, the domain shift problem raises from the fact that the distribution of features corresponding to the same attribute for seen and unseen images could be very different (e.g. stripes of tigers versus zebras). The hubness problem, on the other hand, states that there will often be attributes that are similar (have small distance) to vastly different visual features in the embedding space. Various transductive approaches are presented to overcome the hubness problem (Fu et al. 2015). The use of sparse dictionaries to model the space of visual features and semantic attributes as union of linear subspaces has been shown to be an effective modeling scheme in recent ZSL papers (Isele, Rostami, and Eaton 2016; Kodirov et al. 2015; Zhang and Saligrama 2015). Zhang et al. (Zhang and Saligrama 2015) showed that modeling the test image features as sparse linear combination of train image features is beneﬁcial and formulated a ZSL method based on this principal. Using similar ideas, Isele et. al. (Isele, Rostami, and Eaton 2016) used joint dictionary learning to learn a dynamical control system using high level task descriptors in an online lifelong zero-shot reinforcement learning setting. Our JD-ZSL build on similar ideas as in (Isele, Rostami, and Eaton 2016; Kodirov et al. 2015) and introduce a novel ZSL method based on learning joint sparse dictionaries for the image features and the semantic attributes. At its core, JD-ZSL is equipped with a novel entropy minimization regularizer, similar to te one proposed by (Grandvalet and Bengio 2004), which facilitates the solution to the ZSL problem by reducing the domain shift effect. We further show that a transductive approach applied to our attribute aware JD-ZSL formulation provide state-of-the-art or close to state-of-the-art performance on various benchmark datasets. Finally it should be noted that the idea of using joint dictionaries to map data from a given metric space to a second related space was pioneered by Yang et al. (Yang et al. 2010) in super-resolution applications. Figure 1 captures the gist of our idea. Visual features are extracted via CNNs, left sub-ﬁgure, and the semantic attributes are provided via textual feature extractors like word2vec or via human annotations, right sub-ﬁgure. Both the visual features and the semantic attributes are assumed to be representable sparsely in a shared union of linear subspaces, left and right sub-ﬁgures. The idea here is then to enforce the sparse representation vectors for both domains be equal and thus effectively couple the learned dictionaries for the the visual and the attribute spaces. The intuition from a co-view perspective (Yu et al. 2014) is that both the visual and the attribute features provide information about the same class, and therefore each can augment the learning of the other. Each underlying class is common to both views, and one can ﬁnd task embeddings that are consis-

tent for both the visual features and their corresponding attributes. Having learned the coupled dictionaries, zero-shot classiﬁcation can be performed by mapping images of unseen classes into the attribute space, where classiﬁcation can be simply done via nearest neighbor or via a more elaborate scheme like label propagation. Given the coupled nature of the learned dictionaries, an image could be mapped to its semantic attributes by ﬁrst ﬁnding the sparse representation with respect to the visual dictionary, and next the attribute dictionary can be used to recover the attribute vector from the joint sparse representation which could then be used for classiﬁcation.

Problem Statement and Technical Rational

Consider a visual feature metric space X of dimension p, an attribute metric space Z with dimension q, and a class label set Y with dimension K which ranges over a ﬁnite alphabet of size K (images can potentially have multiple memberships to the classes). As an example X = Rp for the visual features extracted from a deep CNN and Z = {0, 1}q when a binary code of length q is used to identify the presence/absence of various characteristics in an object (Lampert, Nickisch, and Harmeling 2014). We are given a labeled dataset D = {((xi; zi), yi)}N i=1 of features of seen images and their corresponding semantic attributes, where i : xi X, zi Z, and yi Y. We are also given the unlabeled attributes of unseen classes D = {z j}M j=1 (i.e. we have access to textual information for a wide variety of objects but do not have access to the corresponding visual information). In ZSL the set of seen and unseen classes are disjoint and it is assumed that the semantic attributes are class speciﬁc. The goal is then to use D and D to learn the relationship between X and Z so when an unseen image (image from an unseen class) is fed to the system, its corresponding attributes and consequently its label could be predicted. Finally, we assume that ψ : Z Y is the mapping between the attribute space and the label space. We show that ψ can be learned to be a nearest neighbor classiﬁer, or a more elaborate labeling scheme. To further clarify the problem, consider an instance of ZSL in which features extracted from images of horses and tigers are included in seen visual features X = [x1, ..., x N], where xi X, but X does not contain features from images containing zebras. On the other hand, the semantic attributes contain information of all seen Z = [z1, ..., z N] for zi Z and unseen Z = [z 1, ..., z M] for z j Z classes including the zebras. Intuitively, by learning the relationship between the image features and the attributes has hooves , has mane , and has stripes from the seen images, we must be able to assign an image of a zebra to its corresponding attribute, while we have never seen a zebra before. More formally, we want to learn the mapping φ : X Z which relates the visual space and the attribute space. Having learned this mapping, for an unseen image one can recover the corresponding attribute vector using the image features and then classify the image using the mapping y = (ψ φ)(x), where represents function composition.

Technical Rational

For the rest of our discussion we assume that X = Rp, Z = Rq, and Y = RK. The simplest ZSL approach is to assume that the mapping φ : Rp Rq is linear, φ(x) = W T x where W Rp q, and then minimize the regression error 1 N i W T xi zi 2 2 to learn W. Despite existence of a closed form solution for W, the solution contains the inverse of the covariance matrix of X, ( 1

i(xix T i )) 1, which requires a large number of data points for accurate estimation. To overcome this problem, various regularizations are considered for W. Decomposition of W as W = PΛQ, where P Rp l, Λ Rl l, Q Rl q, and l < min(p, q) can also be helpful. Intuitively, P is a right linear operator that projects x s into a shared low dimensional subspace, Q is a left linear operator that projects z into the same shared subspace, and Λ provides a bi-linear similarity measure in the shared subspace. The regression problem then can be transformed into maximizing 1 N i x T i PΛQzi, which is a weighted correlation between the embedded x s and z s. This is the essence of many ZSL techniques including Akata et al. (Akata et al. 2013) and Romera-Paredes et al.(Romera Paredes and Torr 2015). This technique can be extended to nonlinear mappings using kernel methods. However, the choice of kernels remains a challenge. On the other side of the spectrum, the mapping φ : Rp Rq can be chosen to be highly nonlinear, as in deep neural networks (DNN). Let a DNN be denoted as φ(.; θ), where θ represents the parameters of the network (i.e. synaptic weights and biases). ZSL can then be addressed by minimizing 1 N i φ(xi; θ) zi 2 2 with respect to θ. Alternatively, one can nonlinearly embed x s and z s in a shared metric space via deep nets, f(x; θx) : Rp Rl and g(z; θz) : Rq Rl, and maximize their similarity measure in the embedded space, 1 N

i f(xi; θx)T g(zi; θz), as in (Lei Ba et al. 2015). Nonlinear methods are computationally expensive, require a large training dataset, and can easily overﬁt to the training data. On the other hand, linear ZSL algorithms are efﬁcient, easy to train, and generalizable but they are often outperformed by nonlinear methods. As a compromise, we model nonlinearities in data distributions as union of linear subspaces with coupled dictionaries. By jointly learning the visual and attribute dictionaries, we effectively model the relationship between the metric spaces. This allows a nonlinear scheme with a computational complexity comparable to linear techniques. Finally, The core assumption in our work is that: there exists a joint sparse space that encodes both the visual features and the semantic attributes. This assumption holds when the visual features and the semantic attributes could be modeled as union of low-dimensional subspaces and more importantly when there exist correspondences between these subspaces.

Zero Shot Learning using Joint Dictionaries

Joint dictionary learning has been proposed to couple related features from two metric spaces (Yang et al. 2010; Shekhar et al. 2014). Yang et al. (Yang et al. 2010) proposed the approach to tackle the problem of image superresolution, while Shekhar et al. (Shekhar et al. 2014) used

joint dictionary learning for multimodal biometrics recognition. Following a similar framework, the gist of our approach is to learn the mapping φ : Rp Rq through two dictionaries, Dx Rp r and Dz Rq r that model X and [Z, Z ], respectively, where r > max(p, q). The goal is to ﬁnd a shared sparse representation (i.e. sparse code) ai for xi and zi, such that xi = Dxai and zi = Dzai. Below we describe the training and testing phases of our method. The standard dictionary learning is based on minimizing the empirical average estimation error 1

N X Dx A 2 F on a given training set X, where ℓ1 regularization on A enforces sparsity:

D x, A = argmin Dx,A

1 N X Dx A 2 F + λ A 1

s.t. D[i] x 2 2 1. (1)

Here λ is the regularization parameter, which controls the sparsity of A, and D[i] x is the i th column of Dx. Alternatively, following the Lagrange multiplier technique, the Frobenius norm of Dx could be used as a regularizer in place of the costraint. In our joint dictionary learning framework, we aim to learn Dx and Dz such that they share the sparse coefﬁcients A to represent the seen visual features X and their corresponding attributes Z, respectively. An important twist here is that the attribute dictionary, Dz, is also required to sparsify the semantic attributes of other (unseen) classes, Z . To obtain such coupled dictionaries we propose the following optimization,

argmin Dx,A,Dz,B { 1

Np( X Dx A 2 F + pλ

1 Nq Z Dz A 2 F + 1 Mq ( Z Dz B 2 F +

r B 1)} s.t.: D[i] x 2 2 1, D[i] z 2 2 1

The above formulation combines the dictionary learning problem for X and Z by coupling them via A, and also enforces Dz to be a sparsifying dictionary (i.e. a good model) for Z . The optimization in Eq (2), while convex in each individual term, is highly nonconvex in all variables. Following the approach proposed in (Yang et al. 2012) we use an Expectation Maximization (EM) like alternation to update dictionaries Dx and Dz. To do so, we rewrite the optimization problem into the following two steps: 1. For a ﬁxed Dx update Dz via the following optimization:

min Dz,B 1 Mq ( Z Dz B 2 F + qλ

1 Nq Z Dz A 2 F

s.t. A = argmin A 1 p X Dx A 2 F + λ

D[i] z 2 2 1

A is found using a Lasso optimization problem, and FISTA (Beck and Teboulle 2009) is used to update Dz and B.

2. For a ﬁxed Dz update Dx via:

min Dx X Dx A 2 F

s.t. A = argmin A 1 q Z Dz A 2 F + λ

D[i] x 2 2 1,

which involves a Lasso optimization together with a simple regression with a close form solution.

Zero-Shot Prediction of Unseen Attributes In the testing phase we are only given the extracted features from unseen images, X = [x 1, ..., x l] Rp l and the goal is to predict their corresponding semantic attributes. Here we introduce a progression of methods, which clariﬁes the logic behind our method, and enables us to efﬁciently predict the semantic attributes of the unseen images based on the learned dictionaries in the training phase.

Attribute Agnostic Prediction The attribute agnostic (AAg) formulation, is the naive way of predicting semantic attributes from an unseen image x i. In the AAg formulation, we ﬁrst ﬁnd the sparse representation αi of the unseen image x i with respect to the learned dictionary Dx by solving the following Lasso problem,

αi = argmina 1 p x i Dxa 2 2 + λ

Here, one can simply use αi and compare it to the sparse codes of the unseen attributes, bj. In our experiments, however, we found that this approach is not suitable in our JDZSL setting as the dictionaries could have redundant atoms that cause two similar image features or attributes to have different sparse codes. Instead, we do the comparison in the attribute space and predict the corresponding attribute via ˆzi = Dzαi. In the attribute-agnostic formulation, the sparse coefﬁcients are calculated without any information from the attribute space. Not using the information from the attribute space would lead to the domain shift problem, in the sense that there is no guarantee that αi would reconstruct a meaningful attribute in Z. In other words, ˆzi = Dzαi could be far from the unseen attributes, z m, and therefore could not be assigned to any known attribute with high conﬁdence. To alleviate this problem we progress to an extended solution, which we denote as the Attribute Aware (AAw) prediction.

Attribute Aware Prediction In the attribute-aware (AAw) formulation we would like to ﬁnd the sparse representation αi to not only approximate the input visual feature, x i Dxαi, but also provide an attribute prediction, ˆzi = Dzαi, that is well resolved in the attribute space and does not suffer from the domain shift problem. Note that, ideally ˆzi = z m for some m {1, ..., M}. To achieve this we deﬁne the soft assignment of ˆzi to z m, denoted by pm, using the Student s t-distribution as a kernel to measure similarity between ˆzi = Dzαi and z m,

pm(αi) = (1 + Dzαi z m 2 2 ρ ) ρ+1

2 k(1 + Dzαi z k 2 2 ρ ) ρ+1

where ρ is the kernel parameter. The choice of t-distribution is due to its long tail and low sensitivity to the choice of kernel parameter, ρ. Ideally, pm(αi) = 1 for some m {1, ..., M} and pj(αi) = 0 for j = m. The ideal softassignment p = [p1, p2, ..., p M] then would be one-sparse and therefore would have minimum entropy. This motivates our attribute-aware formulation, which regularizes the AAg formulation in Equation 5 with the entropy of p.

αi = argmina 1 p x i Dxa 2 2 + γh(p(a)) g(a)

where h(p(a)) = m pm(a)log(pm(a) is the entropy term, and γ is the regularization parameter. Such entropy minimization scheme has been successfully used in several work (Grandvalet and Bengio 2004; Huang, Tran, and Tran 2016) whether as a sparsifying regularization or to boost the conﬁdence of classiﬁers. The entropy regularization enforces the prediction to be close to one of the unseen attributes, but it can potentially backﬁre in that a low-entropy solution (aligned to a prototype) doesn t necessarily have to be the correct solution. In our experiments, we consistently observed higher performance for the AAw formulation. The entropy regularization turns the optimization in Eq. (7) into a nonconvex problem. In (Huang, Tran, and Tran 2016), the authors use a generalized gradient descent approach similar to FISTA to optimize this non-convex problem. We use a similar scheme to optimize the objective function in Eq. (7). In short, we relax g(a) using its quadratic approximation around the previous estimation of a, ak 1, and update a as the solution of the following problem

ak = argmina 1 2t a (ak 1 t g(ak 1)) 2 2 + λ

r a 1 (8) Eq. (8) is a LASSO problem, but the solution is readily available via soft-thresholding (ak 1 t g(ak 1)). It only remains to compute g:

p DT x (Dxa x )

m {(1 + log(pm(a)))

k lk(a) lm(a)

lm(a) = (1 + Dza z m 2 2 ρ ) ρ+1

ρ (DT z (Dza z))(1 + Dza z m 2 2 ρ ) ρ+3

Due to the non-convex nature of the objective function, a good initialization is needed to achieve a sensible solution. Therefore we initialize α from the solution of the AAg formulation. Finally the corresponding attributes are estimated by ˆzi = Dzαi, for i = 1, ..., l.

From Predicted Attributes to Labels In order to predict the image labels, one needs to assign the predicted attributes, ˆZ = [ˆz1, ..., ˆzl], to the M attributes of the unseen classes Z (i.e. prototypes). In other words, we still need to learn the mapping ψ : Z Y. Here we consider learning ψ in two ways, namely the inductive approach and the transductive approach. In the inductive scheme the inference could be performed using a nearest neighbor (NN) approach in which label of each individual ˆzi is assigned to be the label of its nearest neighbor z m. In such approach the structure of ˆzi s is not taken into account and the hubness problem could easily degrade the performance of the ZSL algorithm. Looking at the t-SNE embedding visualization (Maaten and Hinton 2008) of ˆzi s and z m s in Figure 2, details are explained later, it can be seen that NN does not provide an optimal label assignment. In the transductive setting, on the other hand, the attributes for all test images (i.e. unseen) are ﬁrst predicted to form ˆZ = [ˆz1, ..., ˆzl]. Next, a graph is formed on [Z , ˆZ] where the labels for Z are known and the task is to infer the labels of ˆZ. This problem can be formulated as a graph-based semi-supervised label propagation (Zhou et al. 2003). We follow the work of Zhou et al. (Zhou et al. 2003) and spread the labels of Z to ˆZ. More precisely, we ﬁrst reduce the dimension of [Z , ˆZ] via t-SNE and then form a graph in the lower dimension and perform label propagation on this graph. Figure 2 reconﬁrms that label propagation in a transductive setting could signiﬁcantly improve the performance of ZSL and resolve the hubness and domain shift issues as also demonstrated in (Fu et al. 2015).

Theoretical Discussion The core step for ZSL in our scheme is to compute the joint sparse representation for an unseen image. Note that in the testing phase, the sparse representation a is estimated using (5), while the dictionaries are learned for optimal sparse representations as in (2). More speciﬁcally, we need to demonstrate that the following two problems lead to close approximations:

α = argmina x Dxa 2 2 + z Dza 2 2 + λ a 1

= argmina x z

a 2 2 + λ a 1

α+ = argmina x Dxa 2 2 + λ a 1

in order to conclude that we can solve for α+ in ZSL regime (i.e. prediction attributes for unseen images) to estimate α with good accuracy. Note that the major challenge in the testing phase is that we are using the dictionary Dx Rp r to ﬁnd the shared sparse parameters, α, instead of D = [Dx, Dz]T R(p+q) r. To study the effect of this change, we ﬁrst point out that Eq. (1) can be interpreted as result of a maximum a posteriori (MAP) inference from a Bayesian perspective. This means that from a probabilistic perspective, α s are drawn from a Laplace distribution and the dictionary D is a Gaussian matrix with elements drawn i.i.d: dij N(0, σ2). This means that given a drawn dataset, we learn MAP estimate of a Gaussian matrix. To

Figure 2: Attributes predicted from the input visual features for the unseen classes of images for AWA dataset using our attribute agnostic and attribute aware formulations respectively in top and bottom rows. The nearest neighbor and label propagation assignment of the labels together with the ground truth labels are visualized. It can be seen that the attribute aware formulation together with the label propagation scheme overcomes the hubness and domain shift problems. Best seen in color.

analyze the effect, we rely on the following theorem about LASSO with Gausian matrices (Negahban et al. 2009): Theorem 1 (Negahban et al. 2009): Let αs be the unique sparse solution of the linear system x = Da with a 0 = k and D Rp n. If α is the LASSO solution for the system from noisy observations, then with high probability:

p , where c R+ is a constant which depends on the loss function which measures the data ﬁdelity, here the Euclidean distance. Lemma 1: Attribute prediction error in ZSL setting is upper-bounded proportional to ( 1 p + 1 q+p).

Proof: note that if α is a solution of [x T, z T]T = Da, trivially it is also a solution for x = Dxa as well. Now using Theorem 1: z z+ Dx(α α+)

Dx(α α+) c Dz 2

k log r( 1 p + 1 q + p)

Note we have used the triangular inequality ﬁrst and then the theorem in the above deduction and 2 denotes spectral norm for a matrix. This result accords with intuition. First, it advises sparseness of z, i.e. smaller k, decreases the error which means that a good sparsifying dictionary would lead to less ZSL error. Second, the error is proportional to inverse of both p and p + q, meaning that rich visual and attribute descriptions lead to minimal ZSL error. This suggests that for our approach to work, existence of a good spar-

sifying dictionary as well as rich visual and attribute data is essential. Finally, although increasing the number of dictionary columns r intuitively can improve sparsity, i.e. decrease k, this result shows that it can potentially increase the ZSL error, and should be tuned for an optimal performance.

Experiments We carried out experiments on three benchmark ZSL datasets and empirically evaluated the resulting performance against nascent ZSL algorithms. Datasets: We conducted our experiments on three benchmark datasets namely: the Animals with Attributes (Aw A1) (Lampert, Nickisch, and Harmeling 2014), the SUN attribute (Patterson and Hays 2012), and the Caltech-UCSDBirds 200-2011 (CUB) bird (Wah et al. 2011) datasets. The Aw A1 dataset is a coarse-grained dataset containing 30475 images of 50 types of animals with 85 corresponding attributes for these classes. Semantic attributes for this dataset are obtained via human annotations. The images for the AWA1 dataset are not publicly available; therefore we use the publicly available features of dimension 4096 extracted from a VGG19 convolutional neural network, which was pretrained on the Image Net dataset. Following the conventional usage of this dataset, 40 classes are used as source classes to learn the model and the remaining 10 classes are used as target (unseen) classes to test the performance of zero-shot classiﬁcation. The SUN dataset is a ﬁne-grained dataset and contains 717 classes of different scene categories

Method SUN CUB Aw A (Romera-Paredes and Torr 2015) 82.10 - 75.32 (Zhang and Saligrama 2015) 82.5 30.41 76.33 (Zhang and Saligrama 2016) 82.83 42.11 80.46 (Bucher, Herbin, and Jurie 2016) 84.41 43.29 77.32 (Xu et al. 2017) 83.5 53.6 84.5 (Li et al. 2017) - 61.79 87.22 (Ye and Guo 2017) 85.40 57.14 85.66 (Ding, Shao, and Fu 2017) 86.0 45.2 82.8 (Wang and Chen 2017) - 42.7 79.8 (Kodirov, Xiang, and Gong 2017) 91.0 61.4 84.7 Ours AAg (5) 82.05 35.81 77.73 Ours AAw (6) 83.22 38.36 83.33 Ours Transductive AAw (TAAw) 85.90 47.12 88.23 Ours TAAw hit@3 94.52 58.19 91.73 Ours TAAw hit@5 98.15 69.67 97.13

Table 1: Zero-shot classiﬁcation results for three benchmark datasets. All methods use VGG19 features trained on the Image Net dataset and the original continuous (or binned) attributes provided by the datasets. Here, indicates that the results are extracted directly from the corresponding paper, indicates that the results are reimplemented with VGG19 features, and indicates that the results are not reported.

with 20 images per category (14340 images total). Each image is annotated with 102 attributes that describe the corresponding scene. Following (Lampert, Nickisch, and Harmeling 2014), 707 classes are used to learn the dictionaries and the remaining 10 classes are used for testing. The CUB200 dataset is a ﬁne-grained dataset containing 200 classes of different types of birds with 11788 images with 312 attributes and boundary segmentation for each image. The attributes are obtained via human annotation. The dataset is divided into four almost equal folds, where three folds are used to learn the model and the fourth fold is used for testing. For both SUN and CUB200-2011 datasets we used features from VGG19 trained on the Image Net dataset, which have 4096 dimensions. Tuning parameters: The optimization regularization parameters λ, ρ, γ as well as the number of dictionary atoms r need to be tuned for maximal performance. We used standard k-fold cross validation to search for the optimal parameters for each dataset. After splitting the datasets accordingly into training, validation, and testing sets, we used performance on the validation set for tuning the parameters in a brute-force search. we used the common evaluation metrics in ZSL, ﬂat hit@K classiﬁcation accuracy, to measure the performance. This means that a test image is said to be classiﬁed correctly if it is classiﬁed among the top K predicted labels. We report hit@1 rate to measure ZSL image classiﬁcation performance and hit@3 and hit@5 for image retrieval performance. Each experiment is performed ten times and the mean is reported in Tabel 1. Results: Figure 2 demonstrates the 2D t-SNE embedding for predicted attributes and actual class attributes of the AWA dataset. The actual attributes are depicted by colored circles with black edges. The ﬁrst column of Figure 2 demonstrates the attribute prediction for AAg and AAw

formulations. It can be clearly seen that the entropy regularization in AAw formulation improves the clustering quality, decreases data overlap, and reduces the domain shift problem. The nearest neighbor label assignment is shown in the second column, which demonstrates the domain shift and hubness problems with NN label assignment in the attribute space. The third column of Figure 2 shows the transductive approach in which a label propagation is performed on the graph of the predicted attributes. Note that the label propagation addresses the domain shift and hubness problem and when used with the AAw formulation provides signiﬁcantly better zero-shot classiﬁcation accuracy. In this ﬁgure each colored cluster corresponds to predicted labels for images from one unseen class and the black o close to that cloud denotes the attribute description embedding for that class. This ﬁgure visualizes very helpful information. First, it can be seen that our algorithm can cluster the dataset in the attribute space. This explains why ZSL can be performed. Second, it is clear that entropy regularization improves the clustering quality and decreases data overlap which stems from enforcing the predictions to be clustered. Finally, it demonstrates why nearest neighbor search is a naive approach for ﬁnal label assignment. We conclude label propagation techniques are more suitable. Performance comparison results are summarized in Table 1. As pointed out by Xian et al. (Xian et al. 2017) the variety of used image features (e.g. various DNNs and various combinations of these features) as well as the variation of used attributes (e.g. word2vec, human annotation), and different data splits make direct comparison with the ZSL methods in the literature very challenging. In Table 1 we provide a fair comparison of our JDZSL performance to the recent methods in the literature. All compared methods use the same visual features (i.e. VGG19) and the same attributes (i.e. the continuous or binned) provided in the dataset. Table 1 provides a comprehensive explanation of the shown results. Note that our method achieves state-of-the-art or close to state-of-the-art performance. We report the hit@1 accuracy on unseen classes in the ﬁrst nine rows of the table to measure image classiﬁcation performance. For the sake of transparency and to provide the complete picture to the reader, we included results for the AAg formulation using nearest neighbor, the AAw using nearest neighbor, and AAw using the transductive approach, denoted as transductive attribute aware (TAA) formulation. As it can be seen, while the AAw formulation signiﬁcantly improves the AAg formulation and adding the transductive approach (i.e. label propagation on predicted attributes) to the AAw formulation further boosts the classiﬁcation accuracy, as also shown in Figure 2. In addition, our approach leads to better and comparable performance in all three datasets which include zero-shot scene and object recgonition tasks. More importantly, while the other methods can perform well on a speciﬁc dataset, our algorithm leads to competitive performance on all the three datasets.

Conclusions We proposed a novel ZSL formulation, which models the relationship between visual features and semantic attributes

via joint sparse dictionaries. The proposed method effectively projects the data into a shared subspace of sparse coefﬁcients. We demonstrated that while a classic joint dictionary learning approach could still suffer from the domain shift problem, an entropy regularization scheme can help with this phenomenon and provide superior zero-shot performance. In addition, we demonstrated that a transductive approach towards assigning labels to the predicted attributes can boost the performance considerably and lead to state-of-the-art zero-shot classiﬁcation. Finally, we compared our method to the nascent approaches in the literature and demonstrated its competitiveness on benchmark datasets.

References Akata, Z.; Perronnin, F.; Harchaoui, Z.; and Schmid, C. 2013. Label-embedding for attribute-based classiﬁcation. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, 819 826. Beck, A., and Teboulle, M. 2009. A fast iterative shrinkagethresholding algorithm for linear inverse problems. SIAM journal on imaging sciences 2(1):183 202. Bucher, M.; Herbin, S.; and Jurie, F. 2016. Improving semantic embedding consistency by metric learning for zeroshot classifﬁcation. In European Conference on Computer Vision, 730 746. Springer. Deutsch, S.; Kolouri, S.; Kim, K.; Owechko, Y.; and Soatto, S. 2017. Zero shot learning via multi-scale manifold regularization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 7112 7119. Ding, Z.; Shao, M.; and Fu, Y. 2017. Low-rank embedded ensemble semantic dictionary for zero-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2050 2058. Dinu, G.; Lazaridou, A.; and Baroni, M. 2014. Improving zero-shot learning by mitigating the hubness problem. ar Xiv preprint ar Xiv:1412.6568. Fei-Fei, L.; Fergus, R.; and Perona, P. 2006. One-shot learning of object categories. IEEE transactions on pattern analysis and machine intelligence 28(4):594 611. Fu, Y.; Hospedales, T. M.; Xiang, T.; and Gong, S. 2015. Transductive multi-view zero-shot learning. IEEE transactions on pattern analysis and machine intelligence 37(11):2332 2345. Grandvalet, Y., and Bengio, Y. 2004. Semi-supervised learning by entropy minimization. In Advances in neural information processing systems, volume 17, 529 536. He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In International Conference on Computer Vision, 770 778. Huang, G.; Liu, Z.; Weinberger, K. Q.; and van der Maaten, L. 2017. Densely connected convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, 4700 4708. Huang, S.; Tran, D. N.; and Tran, T. D. 2016. Sparse signal recovery based on nonconvex entropy minimization. In

IEEE International Conference on Image Processing, 3867 3871. IEEE. Isele, D.; Rostami, M.; and Eaton, E. 2016. Using task features for zero-shot knowledge transfer in lifelong learning. In Proc. of International Joint Conference on Artiﬁcial Intelligence, 1620 1626. Kodirov, E.; Xiang, T.; Fu, Z.; and Gong, S. 2015. Unsupervised domain adaptation for zero-shot learning. In Proceedings of the IEEE International Conference on Computer Vision, 2452 2460. Kodirov, E.; Xiang, T.; and Gong, S. 2017. Semantic autoencoder for zero-shot learning. 3174 3183. Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Imagenet classiﬁcation with deep convolutional neural networks. In Advances in neural information processing systems, 1097 1105. Lampert, C. H.; Nickisch, H.; and Harmeling, S. 2009. Learning to detect unseen object classes by between-class attribute transfer. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, 951 958. IEEE. Lampert, C. H.; Nickisch, H.; and Harmeling, S. 2014. Attribute-based classiﬁcation for zero-shot visual object categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence 36(3):453 465. Lei Ba, J.; Swersky, K.; Fidler, S.; et al. 2015. Predicting deep zero-shot convolutional neural networks using textual descriptions. In International Conference on Computer Vision, 4247 4255. Li, Y.; Wang, D.; Hu, H.; Lin, Y.; and Zhuang, Y. 2017. Zero-shot recognition using dual visual-semantic mapping paths. 3279 3287. Li, X.; Guo, Y.; and Schuurmans, D. 2015. Semi-supervised zero-shot classiﬁcation with label representation learning. In Proceedings of the IEEE International Conference on Computer Vision, 4211 4219. Maaten, L., and Hinton, G. 2008. Visualizing data using t-sne. Journal of Machine Learning Research 9(Nov):2579 2605. Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G. S.; and Dean, J. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, 3111 3119. Negahban, S.; Yu, B.; Wainwright, M.; and Ravikumar, P. 2009. A uniﬁed framework for high-dimensional analysis of m-estimators with decomposable regularizers. In Advances in neural information processing systems, 1348 1356. Norouzi, M.; Mikolov, T.; Bengio, S.; Singer, Y.; Shlens, J.; Frome, A.; Corrado, G. S.; and Dean, J. 2014. Zero-shot learning by convex combination of semantic embeddings. International Conference on Learning Representations. Palatucci, M.; Pomerleau, D.; Hinton, G. E.; and Mitchell, T. M. 2009. Zero-shot learning with semantic output codes. In Advances in neural information processing systems, 1410 1418.

Patterson, G., and Hays, J. 2012. Sun attribute database: Discovering, annotating, and recognizing scene attributes. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2751 2758. IEEE. Pennington, J.; Socher, R.; and Manning, C. D. 2014. Glove: Global vectors for word representation. In Empirical Methods on Natural Language Processing, volume 14, 1532 43. Romera-Paredes, B., and Torr, P. 2015. An embarrassingly simple approach to zero-shot learning. In International Conference on Machine Learning, 2152 2161. Shekhar, S.; Patel, V. M.; Nasrabadi, N. M.; and Chellappa, R. 2014. Joint sparse representation for robust multimodal biometrics recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 36(1):113 126. Shigeto, Y.; Suzuki, I.; Hara, K.; Shimbo, M.; and Matsumoto, Y. 2015. Ridge regression, hubness, and zeroshot learning. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 135 151. Springer. Simonyan, K., and Zisserman, A. 2014. Very deep convolutional networks for large-scale image recognition. ar Xiv preprint ar Xiv:1409.1556. Socher, R.; Ganjoo, M.; Manning, C. D.; and Ng, A. 2013. Zero-shot learning through cross-modal transfer. In Advances in neural information processing systems, 935 943. Wah, C.; Branson, S.; Welinder, P.; Perona, P.; and Belongie, S. 2011. The caltech-ucsd birds-200-2011 dataset. Wang, Q., and Chen, K. 2017. Zero-shot visual recognition via bidirectional latent embedding. International Journal of Computer Vision 124(3):356 383. Xian, Y.; Lampert, C. H.; Schiele, B.; and Akata, Z. 2017. Zero-shot learning-a comprehensive evaluation of the good, the bad and the ugly. ar Xiv preprint ar Xiv:1707.00600. Xu, X.; Shen, F.; Yang, Y.; Zhang, D.; Shen, H. T.; and Song, J. 2017. Matrix tri-factorization with manifold regularizations for zero-shot learning. In Proceedings of the IEEE International Conference on Computer Vision, 3798 3807. Yang, J.; Wright, J.; Huang, T. S.; and Ma, Y. 2010. Image super-resolution via sparse representation. IEEE transactions on image processing 19(11):2861 2873. Yang, J.; Wang, Z.; Lin, Z.; Cohen, S.; and Huang, T. 2012. Coupled dictionary training for image super-resolution. IEEE Transactions on Image Processing 21(8):3467 3478. Ye, M., and Guo, Y. 2017. Zero-shot classiﬁcation with discriminative semantic representation learning. 17140 17148. Yu, X., and Aloimonos, Y. 2010. Attribute-based transfer learning for object categorization with zero/one training example. European Conference on Computer Vision 127 140. Yu, Z.; Wu, F.; Yang, Y.; Tian, Q.; Luo, J.; and Zhuang, Y. 2014. Discriminative coupled dictionary hashing for fast cross-media retrieval. In Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval, 395 404. ACM. Zhang, Z., and Saligrama, V. 2015. Zero-shot learning via

semantic similarity embedding. In International Conference on Computer Vision, 4166 4174. Zhang, Z., and Saligrama, V. 2016. Zero-shot learning via joint latent similarity embedding. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, 6034 6042. Zhou, D.; Bousquet, O.; Lal, T. N.; Weston, J.; and Sch olkopf, B. 2003. Learning with local and global consistency. In Advances in neural information processing systems, volume 16, 321 328.