# supervised_settoset_hashing_in_visual_recognition__14863aa1.pdf

Supervised Set-to-Set Hashing in Visual Recognition

I-Hong Jhuo CODAIT, IBM ihjhuo@gmail.com

Visual data, such as an image or a sequence of video frames, is often naturally represented as a point set. In this paper, we consider the fundamental problem of ﬁnding a nearest set from a collection of sets, to a query set. This problem has obvious applications in large-scale visual retrieval and recognition, and also in applied ﬁelds beyond computer vision. One challenge stands out in solving the problem set representation and measure of similarity. Particularly, the query set and the sets in dataset collection can have varying cardinalities. The training collection is large enough such that linear scan is impractical. We propose a simple representation scheme that encodes both statistical and structural information of the sets. The derived representations are integrated in a kernel framework for ﬂexible similarity measurement. For the query set process, we adopt a learning-to-hash pipeline that turns the kernel representations into hash bits based on simple learners, using multiple kernel learning. Experiments on two visual retrieval datasets show unambiguously that our setto-set hashing framework outperforms prior methods that do not take the set-to-set search setting.

1 Introduction Searching for similar data samples is a fundamental step in many large-scale applications. As the data size explodes, hashing techniques have emerged as a unique option for approximate nearest neighbor (ANN) search, as it can dramatically reduce both the computational time and the storage space. Successes are seen in areas including computer vision and information retrieval [Kulis et al., 2009; Sun et al., 2017; Wang et al., 2012a; Wang et al., 2016]. Hashing methods perform space partitioning to encode the original highdimensional data points into binary codes. With the resulting binary hash codes, one can perform extremely rapid ANN search that entails only sublinear search complexity. Conventional hashing schemes concern point-to-point (P2P) search setting. They either depend on randomization and are data oblivious (represented by the classic Locality Sensitive Hashing LSH), or are based on advanced ma-

chine learning techniques to learn hashing functions that are better tailored to the speciﬁc data and/or label distribution. The latter includes unsupervised [Gong and Lazebnik, 2011; Weiss et al., 2008], semi-supervised [Wang et al., 2012a; Wang et al., 2010], and supervised hashing [Kulis and Darrell, 2009; Salakhutdinov and Hinton, 2009; Mu et al., 2010; Liu et al., 2012a; Zhang and Li, 2014; Shen et al., 2015; Li et al., 2016; Weng et al., 2019]. A natural generalization of the point-to-point search is setto-set (S2S) search. For example, one can pose the facial image recognition problem as one that queries for a nearest subspaces to a given point [Wang et al., 2013]. Indeed, there are several recent attempts, studying point-tohyperplane search that is useful for active learning [Liu et al., 2012b], or subspace-to-subspace search [Basri et al., 2011] that models S2S search assuming linear structures in the sets. In this paper, we consider the set-to-set search problem in its full generality. This general setting ﬁnds applications ranging from video-based surveillance to 3D face retrieval from collections of 2D images [Berretti et al., 2010; Tuzel et al., 2007; Sivic et al., 2005]. Compared to specialized settings discussed above that come with natural notion of distance, a central challenge here is how to measure the distance/similarity between sets. We propose a similarity measure that captures both the statistical and structural aspects of the sets (Section 3). To learn the hash bits, we adopt dyadic hypercut as a weak learner [Moghaddam and Shakhnarovich, 2002] to derive a boosted algorithm that integrates both the structural and statistical similarities. The whole framework is illustrated in Fig. 1. In this paper, we focus on image applications, and hence coin the name Image Set Hashing (ISH). However, the core components of the proposed framework can be extended to generic scenarios.

2 Related Work In this section, we discuss representative works in randomization-based hashing and learning-based hashing for P2P setting, and recent work on certain restricted S2S setting. Review of recent development of hashing techniques can be found in [Wang et al., 2016; Wang et al., 20114]. LSH hashing and variants are iconic randomization-based hashing schemes. They are simple in theory and efﬁcient in practice, and ﬂexible enough to handle various distance measures [Charikar, 2002; Datar et al., 2004; Kulis and Darrell,

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

Statistic Structure

:positive point

:negative point

Hamming Space

Figure 1: Illustration of the proposed ISH framework. First, statistical and structural information of the image sets are encoded. Second, appropriate kernel mappings are chosen to measure the similarities between image sets. A boosted algorithm based on the two kernels is used to construct the hash function.

2009]. However, the data oblivious design of the LSH family methods often causes suboptimal recall-precision tradeoff curve. Hence, learning-based hashing schemes have been under active development recently. Generally, both the data and label information is fed into carefully designed learning pipeline to produce more adaptive and efﬁcient hash codes. Depending on whether the label information is in use, these schemes are either unsupervised, e.g., spectral hashing [Weiss et al., 2008], graph hashing, and ITQ (iterative quantization) [Gong and Lazebnik, 2011], or (semi-)supervised hashing, including [Lin et al., 2013; Liu et al., 2012a; Mu et al., 2010; Norouzi and Fleet, 2011; Sablayrolles et al., 2017; Wang et al., 2010]. Particularly, recent efforts have built more powerful hashing schemes on top of deep learning [Salakhutdinov and Hinton, 2009; Salakhutdinov and Hinton, 2007; Liong et al., 2015; Masci et al., 2013]. All these methods only deal with the P2P setting.

Study of the S2S setting started only very recently, and is mostly about image applications. Statistical or geometric assumptions are often made on the sets to facilitate representation. For example, statistical distribution of data points in each set can be assumed, and KL-divergence can be used to measure set similarity [Arandjelovic et al., 2005]. By comparison, point sets can also lie on linear subspaces or more general geometric objects [Cevikalp and Triggs, 2010; Hu et al., 2011; Kim et al., 2007; Liu et al., 2014; Sun et al., 2014; Sun et al., 2015; Wang et al., 2012b]. Among the representation schemes, representation based on covariance matrices has led to superior performances on image sets (video frames) [Wang et al., 2012b; Lu et al., 2013; Tuzel et al., 2007]. For instance, in [Li et al., 2015], covariance matrix is used in such way, and similarity is then measured via kernel mapping and learning. Promising result has been reported, but the framework is restricted to cases when the query is a single point. For image applications speciﬁcally, the Set Compression Tree [Relja and Zisserman, 2014] compresses a set of image descriptors jointly (rather than individual descriptors) and achieve a very small memory foot-

concept2 concept n2

Index of image

Index of image

Index of image

Index of image Index of image

Index of image

Image Set 1 Image Set 2 Image Set 3

Figure 2: Extracting the structural information and measure the structural similarity. In the top row, graphs are constructed to represented individual sets: nodes are data points and graph weights indicate the similarities. Dense cliques (red circles) are then extracted to reveal the holistic structure in a set. Point sets with similar clique structures (sets 2 and 1 as shown) are assumed to have higher similarity. The bottom row shows the similarity matrices, in which yellow blocks indicate the high similarities among data points in each clique. Best viewed in color.

print (as low as 5 bits). However, all existing representation methods do not account the holistic structural information; this may lead to incorrect similarity measurements on highly nonlinear data distributions. Our representation and hashing scheme is a ﬁrst attempt to directly address the above problems in the general S2S setting.

3 Structural and Statistical Modeling In this section, we detail how the structural and statistical information is extracted from the point sets, and how similarity between sets is measured.

3.1 Structure via Graph Modeling The idea here is to use graph for discovering structures within data, and then measure the similarity via appropriate kernel mapping on graphs [Tenenbaum et al., 2000; Gartner, 2003; Zhou et al., 2009], Fig. 2 gives an example.

To model data points within a set, we derive an afﬁnity matrix A based on quantized pairwise distances [Zhou et al., 2009]: if the distance is larger than a predeﬁned threshold µ, the corresponding afﬁnity value in A is set to 0, and 1 otherwise. We use xi s to denote the point sets, and Ai s to denote the corresponding afﬁnity matrices thus constructed. With all the Ai s at hand, the point set similarity is deﬁned as:

Kg(xi, xj) =

q=1 Aip Ajqg(xip,xjq) ni

q=1 Ajq , (1)

where, Aip = 1/ ni

pu, Ajq = 1/ nj

qv and g(xip, xjq) = exp( γg xip xjq 2). The parameter γg is a constant and ni and nj are the number of data points in xi and xj, respectively.

To understand the captured structural information, each clique (formed by several 1 elements) in Ai can be regarded as one concept. If Ai is an all-one matrix, all data points in

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

one set belong to one concept and each point set is considered as one data point. When Ai is an identical matrix, each data point is independent, and no relation can be discovered. When Ai is a clique-based matrix, data can be clustered into cliques and Kg is a clique-based graph kernel. In this way, we leverage the structural information for our set hashing.

3.2 Statistical Information

Covariance matrices have provided effective local region representation for visual recognition and human identiﬁcation [Tuzel et al., 2007; Liu et al., 2014]. Intuitively, they describe the local image statistics. In this work, we use covariance matrices to depict the statistical variance of images within each set. Given N image sets, X = {(x1, l1), , (xi, li), , (x N, l N)}. xi = {xi1, , xi,ni} is an image set, where xi,j Rd represents the jth d-dimensional feature vector in xi, and the set consists of ni images. li s are the labels of each image set. Each image set is represented with a d d covariance matrix: Ci = 1 ni ni j=1(xij xi)(xij xi) , (2)

where xi is the mean feature vector within the set. The diagonal elements of Ci represent the variance of each individual image feature, and the off-diagonal elements are their respective covariance. In this way, the covariance modeling can provide a desirable statistic for semantic variance among all images for an image set. Moreover, we use Gaussian-logrithm kernel [Jayasumana et al., 2013] to map each covariance matrix of an image set into high dimension space, as follows: Ks(xi, xj) = φ(Ci) φ(Cj)

= exp( log(Ci) log(Cj) 2 F /2γ2 s), (3)

where γs is a positive constant, which be set to the mean distances of training points. Kernel matrix Ks, is for the image sets in Riemannian space and F denotes the matrix Frobenius norm. In the end, each pair of image sets xi and xj are mapped by the Kg and Ks kernel functions into a high dimensional space.

4 Image Set Hashing

4.1 Learning Framework

Suppose we have an image set dataset X = {xi, li}. The goal of hashing is to generate an array of appropriate hash functions h : Rd {0, 1} by a designed function Ψ and each bit is constructed by h(x) = sign(Ψ(x)). However, there is no straightforward way to generate hash codes for each image set. Inspired by [Vemulapalli et al., 2013; Li et al., 2015; Wang et al., 2012b] , we develop a mapping framework to construct hash codes for image sets in a common Hamming space based on multiple kernels. Kernel methods [Hamm and Lee, 2009; Jayasumana et al., 2013; Vemulapalli et al., 2013; Wang et al., 2012b] are known to capture and unfold rich information in data distribution. After the mapping process, we can generate hash codes in a Hamming space by simultaneously considering the structural and statistical information and iteratively maximize the discriminant margins based on multiple kernels learning.

4.2 Weak Learners with Boosting Algorithm for Hash Functions

Since a multi-class classiﬁcation problem can always be treated as an array of two-class problems by adopting oneagainst-one or one-against-all strategies, we design a boosting algorithm to learn binary splits for constructing hash functions. Speciﬁcally, we consider dyadic hypercut [Moghaddam and Shakhnarovich, 2002] with multiple kernel functions. A dyadic hypercut f is generated by a kernel with a pair of different labels in training samples, where f is parameterized by positive sample xa , negative sample xb , and kernel functions {Km},i.e., m indicates statistical kernel (Ks) or structural kernel (Kg), and can be represented as follows: f(x) = sign(Km(xa, x) Km(xb, x) + ε), (4)

where ε R is a threshold. The size of the totally generated weak learner pool is |f| = M na nb, where M, na and nb are the numbers of kernels, positive training samples and negative training samples, respectively. With an efﬁcient boosting process, we iteratively select a subset of weak learners by considering the learning loss. Note that the learning process may be susceptible to overﬁtting when |f| is large. To alleviate this issue, we adopt a boosting algorithm [Freund and Schapire, 1995; Moghaddam and Shakhnarovich, 2002] to combine a number of weak splits (weak learners) into a strong one. Speciﬁcally, we iteratively select the discriminant weak learners generated from multiple kernels via maintaining a weighted distribution wt over data. Each iteration t produces a weak hypothesis f(x) : x {+1, 1} and a weighted error δt. The learning algorithm is aimed at selecting weak learner f t for minimizing δt followed by updating next distribution wt+1. We adopt exponential loss [Freund and Schapire, 1995] and minimize the loss function to select the best weak learner f t at iteration t and the best weak learner is computed as:

f t = minf N i=1 wt iexp( lif(xi)), (5)

where wt i indicates the weight of xi at iteration t. Once obtaining the best weak learner, we update the data distribution based on weighted errors. The linear combination of weak learners, i.e., a strong split, is computed as follows:

F(x) = sign( T t=1 λtf t(x)), (6)

where λt = 1

δt . Each iteration t, F = t 1 τ=1 λτf τ is a linear combination of the (t 1) weak learners.

4.3 Objective Function With the designed hash functions, the following are desired properties of the hash codes: (1) Each hash value is independent of the binary representation for each sample. (2) When samples are close to each other in feature space (e.g. with similar distributions), the hash codes should induce similar hash values with a small Hamming distance. (3) In the resulting Hamming space, different contents of samples should have different hash codes, which push different samples of categories as far as possible, meanwhile gather the samples

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

of the same category close to each other. Based on the criteria, we derive our multiple kernel hashing as follows:

min Hq,Hr αDs + βDc + ν1

r {1:R} i {1:N}

s.t. Hr i q = sign(

t=1 λtf tr q (xi)), i {1 : N}, r {1 : R}

Hr i r = sign(

t=1 λtf tr r (xi)), i {1 : N}, r {1 : R},

(7) where Hr i is the hash value of the ith image set using the r th strong split, i.e., hash function, and r and q represent the retrieval and query sets in the training process. R is the number of the splits and Ω= T t=1(λt qf tr q (xi) + ν2λt rf tr r (xi)). f r is the r th strong split generated from the number of T weighted weak learners (in eq. (6)), λt is the coefﬁcient of weak learners trained via a boosting algorithm, ν1 and ν2 are constant parameters. The minimization of the ﬁrst two terms, i.e., Ds and Dc, tend to ﬁnd an optimal difference of the distance between intraand intercategories, which capture the discriminative property among all the samples and are deﬁned in eq. (8) and eq. (9), and d( , ) indicates distance measure in Hamming space. In addition, Dc can reﬁne the hash codes generated from the training image sets (q and r parts) by maximizing the separability of category algorithm in [Rastegari et al., 2012]. Simultaneous consideration of the two distance functions helps to minimize the within-category distances and meanwhile maximize the between-category dis. By formulating the structural and statistical information with multiple kernels for our objective function, the image set hashing becomes more robust and discriminative. Ds = (m,n) M d(Hm , Hn ) ν3

(m,n) C d(Hm , Hn ) (8)

Dc = (m,n) M d(Hm q , Hn r ) ν4 (m,n) C d(Hm q , Hn r ), (9) where M and C are represented as intraand inter-category, H can be Hq or Hr, and ν3 and ν4 are the pre-computable constant parameters to balance the intra-catetory and intercategory scales. 4.4 Optimization The objective function optimization problem 7 is a typical nonsmooth, nonconvex multiple variable minimization problem. We derive an iterative block coordinate descent algorithm [Tseng, 2001] for the optimization. Algorithm 1 gives the entire algorithm. Here, we highlight several critical steps in our algorithm. We ﬁrst compute the kernel matrices Kg and Ks with eq. (1) and eq. (3) for q training and r training image sets. After the kernel computation, we adopt kernel PCA [Scholkopf et al., 1997] for the q and r two parts to obtain the initial hash codes, i.e., Hq and Hr, based on their statistical kernels in Step 1 to Step 3. Next, we update the Hq and Hr codes by optimizing eq. (8) for seeking the

Algorithm 1 Image Set Hashing

Input: a set of training image sets, X = {xi, li} is divided into q and r two training image set parts, where xi = {xij} ni j=1 Rd, i {1, 2, , N}, li = {1, , L}. Initialize: Compute kernel matrices for q training image sets, i.e., (Km)q, by using the kernel functions (Kg and Ks) according to eq. (1) and eq. (3); Similar to r training image sets, (Km)r 1: Vq RN R, Vr RN R kernel PCA with statistical kernels for Kq and Kr, respectively

2: Hq sign(V q (Ks)q)

3: Hr sign(V r (Ks)r) Optimization: 4: while not converged do 5: Optimize Hq, Hr with eq.(8) 6: Train R splits by weak learner selection in e.q.(6) on q kernels (Km)q by using Hr as training labels, and inversely train another R splits on r kernels (Km)r by using Hq as training labels

7: Hq sign( R r =1 F r q (x)) (Km)q

8: Hr sign( R r =1 F r r (x)) (Km)r

9: Optimize H = [Hq, Hr] {0, 1}R 2N with eq.(9) 10: Train R splits by weak learner selection in e.q.(6) on q kernels (Km)q by using Hr as training labels, and inversely train another R splits on r kernels (Km)r by using Hq as training labels 11: Check the convergence condition 12: end while 13: Output: R r =1 F r q (x) and R r =1 F r r (x) for query and database encoding in the testing process, respectively.

discriminability and utilize an efﬁcient subgradient descent algorithm [Rastegari et al., 2012] for the binary optimization (the optimization algorithm gives the generated code with two properties: sample-wise balance and bit-wise balance.). In Step 6, we use the updated hash codes, Hq and Hr, to train R two-class strong splits based on multiple kernels. More speciﬁcally, we adopt cross-training strategy [M. Rastegari J. Choi and Davis, 2013] by using the hash codes, Hr, as training labels to train the strong splits with q training image sets and similar process to Hq hash code with r training image sets. After that, we update the current hash codes, Hq and Hr by using the learned strong splits based on multiple kernel learning. In order to improve the discriminability, we combine Hq and Hr together to reﬁne the learned hash codes with eq. (9). The process is then repeated. Convergence typically occurs within few outer iterations. Once we obtained the two strong split models ( R r =1 F r q (x) and R r =1 F r r (x) in Algorithm 1), we can adopt them to generate query and database hash codes in the testing process, respectively.

5 Experiments

We evaluate the effectiveness of the proposed Image Set Hashing (ISH) method on two well-known benchmarks, CIFAR-10 and TV-series, i.e., Big Bang Theory. We also conduct extensive comparison studies with state-of-the-art methods, including Locality Sensitive Hashing (LSH) [Indyk and Motwani, 1998], Spectral Hashing (SH) [Weiss et al., 2008], Kernelized LSH (KLSH) [Kulis and Darrell, 2009]; Semi-Supervised Hashing (SSH) [Wang et al., 2010] and supervised methods, Kernel-Based Supervised Hashing (KSH) [Liu et al., 2012a] and Hashing across Euclidean space and Riemannian manifold (HER) [Li et al., 2015]. For the competing techniques, we adopted the publicly released codes of SH, KLSH, KSH and HER in our experiments.

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

The number of samples 0 200 400 600 800 1000 1200 1400 1600

Number of Samples

Figure 3: The evaluation results by mean recall curves for Hamming ranking using 24 bits on the CIFAR-10 dataset. The number of retrieved samples is up to 1600.

The number of samples 0 200 400 600 800 1000 1200 1400 1600

LSH SH SSH KLSH KSH HER ISH

Number of Samples

Figure 4: The evaluation results by mean precision curves for Hamming ranking using 24 bits on the CIFAR-10 dataset. The number of retrieved samples is up to 1600.

5.1 Experiment on CIFAR-10

We compare the performance for different hashing techniques on the CIFAR-10 dataset. As a labeled subset of the 80M tiny images, the CIFAR-10 dataset consists of a total of 60K color images, each of which has the size of 32 32 resolution. The dataset contains 6K image samples with ten object categories. To evaluate the performance, we uniformly and randomly sample images from each category to form a total of 195 image sets, each of which contains about 25 50 images for the training process (q and r two parts), 100 image sets as query in testing and 1577 image sets for the testing database in KLSH, KSH, HER, ISH methods. To test the LSH, SH and SSH methods, we randomly select 1K images as queries and the remaining as database samples. For feature representation, each image is represented as a 512-dimensional GIST feature vector [Oliva and Torralba, 2001]. We evaluate two test scenarios: Hamming ranking and hash look-up. In Fig. 3 and Fig. 4 show the mean recall and precision curves from different number of returned search samples when using 24-bit hash codes. As we can see, the proposed ISH method produces higher quality of Hamming embedding since it signiﬁcantly outperforms the competing methods in terms of precisions, recalls, and MAPs. In general, the methods using set information often provide better performance than those based on P2P settings. For instance, the HER method generates the second best MAPs for most of the test cases. The relative performance gains in MAP ranges from 6% to 23.6% compared to the HER method. Such performance gains conﬁrm the value of exploring statistical and clique-based structural information for hash function design. In Fig. 5, we also evaluate the results using the hash look-up table strategy by showing the precision curve within Hamming radius 2 for hash codes from 8-bit to 48-bit. Again the proposed ISH method achieved the best precisions across all the cases.

5.2 Experiment on TV-series The Big Bang Theory (BBT) video (image set) benchmark 1 was collected by [Bauml et al., 2013] and contains in 3341 face videos from 1 6 episodes of season one. The dataset includes around 5 8 main cast characters and has multiple characters at the same full-view scene shot. Even though most of the scenes are taken in indoors, it is still extremely challenging since the resolution of faces regions are quite small with an average size of 75 pixels. In the experiments, we use the provided face features, which are extracted from face videos by block Discrete Cosine Transformation (DCT) feature. In this way, each face is represented by a 240dimensional feature vector. In the experiments, we have two different settings. For the ﬁrst setting, we follow the setting used in [Li et al., 2015] and apply still images for the q part in training and query in testing process and denote the setting as ISH0. The second setting indicated as ISH, we have 150 image sets for q and r two parts in the training process, respectively. For query in testing, we use 100 image sets and the remaining image sets for database. The setting can completely utilize statistical and structural information. For the comparison, the ﬁrst group of compared methods consists of seven pointto-point (P2P) hash methods, i.e, LSH, ITQ, SH, Discriminative Binary Codes (DBC) [Rastegari et al., 2012] , SSH, MM-NN [Masci et al., 2013], and KSH (point). In addition, we generate kernels for image sets (represented by covariance matrices) and employ them as input for the KLSH, KSH (set), and HER methods. Thus, we have the second group of methods that uses kernels for image sets. Such a modiﬁcation can be used to further justify the advantage of explore the structural and statistical information of image sets. The performance evaluated by MAPs for the compared methods is shown in Table 1. We vary the number of hash bits from 8

1https://cvhci.anthropomatik.kit.edu/ baeuml/datasets.html

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

TV drama: the Big Bang Theory

Method 8 bits 16 bits 32 bits 64 bits 128 bits

LSH [Indyk and Motwani, 1998] 0.2109 0.2086 0.2092 0.1963 0.1994 ITQ [Gong and Lazebnik, 2011] 0.2935 0.3025 0.2989 0.3029 0.3060 SH [Weiss et al., 2008] 0.2377 0.2652 0.2665 0.2623 0.2673 DBC [Rastegari et al., 2012] 0.4489 0.4495 0.4235 0.4005 0.3867 SSH [Wang et al., 2010] 0.2716 0.2855 0.2662 0.2584 0.3003 MM-NN [Masci et al., 2013] 0.3752 0.3955 0.4664 0.5124 0.4922

KLSH [Kulis and Darrell, 2009] 0.2450 0.2498 0.2381 0.2256 0.2325 KSH (point) 0.4090 0.4366 0.4454 0.4567 0.4604 KSH [Liu et al., 2012a] (set) 0.4590 0.4619 0.4534 0.4685 0.4631 HER [Li et al., 2015] 0.4606 0.5049 0.5227 0.5490 0.5539 ISH0 0.4833 0.5279 0.5359 0.5501 0.5712 ISH 0.5018 0.5592 0.5864 0.6007 0.6280

Table 1: The evaluation results measured by Mean Average Precision on the on video (image set) benchmark TV-series (BBT). The length of hash codes ranges from 8-bit to 128-bit. Besides the proposed ISH, the ﬁrst group of compared methods consists of seven P2P hashing methods, i.e., LSH, ITQ, SH, DBC, SSH, MM-NN, and KSH. The second group of compared methods include three modiﬁed techniques, i.e., KLSH, KSH, and HER, that use image set information as input.

Number of bits 10 15 20 25 30 35 40 45 50

Precision whithin Hamming radius 2

Number of bits

Precision within Hamming radius 2

Figure 5: Mean precision curves of Hamming radius 2 in 24 bits on CIFAR-10 dataset. The curves of LSH, SH, SSH are randomly sampled.

The number of samples 0 500 1000 1500 2000 2500 3000

Number of Samples

Figure 6: The evaluation results by mean precision curves for Hamming ranking using 128 bits on the BBT video dataset.

to 128 bits. As we can see, the method (ISH0) generates the second best MAPs when structural information is ignored in the q training and the query testing process. Moreover, ISH method achieved the best performance for all the tested cases when both of statistical and structural information are considered. This evidence shows that the structural information can help to generate more robust and discriminative hash codes. In addition, we use 128-bit hash codes and show the performance curves for the seven compared methods. Fig. 6, Fig. 7 and Fig. 8 show the precision, recall, and precisionrecall curves, respectively. From these results, the proposed ISH achieves the best performance compared to all the baseline techniques, including both P2P methods and modiﬁed S2S methods. The underlying reason lies in that the ISH method can simultaneously capture the statistical (covariance matrix) and structural (graph kernel) information, i.e., combination of weak learners, to generate each hash code. Hence, it captures the most intrinsic characteristics within complicated variations of face images for the same subject. Moreover, if we compare the general performance between the P2P and S2S settings for the KSH method, we can observe that the S2S

hashing continuously attains higher search accuracy. It further conﬁrms the advantage of using set information. However, simply employing the covariance matrix as inputs limits the performance improvements. In summary, with the S2S setting and fully explore statistical and structural information, the proposed ISH yields signiﬁcant performance gain across all the experiments.

6 Conclusion

We have presented a set-to-set (S2S) ANN search problem and proposed to learn the optimal hash codes for image sets by simultaneously exploiting the statistical and structural information. The key idea is to transform the image sets into a high dimension space where each of image set can be characterized by a graph kernel and statistical measurement. As a result, the proposed S2S hashing achieves a robust and discriminative representation for searching datapoint sets. The experimental results have demonstrated the effectiveness of the proposed ISH method by showing superior performance over several representative competing approaches.

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

The number of samples 0 500 1000 1500 2000 2500 3000

Number of Samples

Figure 7: The evaluation results by mean recall curves for Hamming ranking using 128 bits on the BBT dataset.

Recall 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Figure 8: The evaluation results by mean precision-recall curves for Hamming ranking using 128 bits on the BBT dataset.

[Arandjelovic et al., 2005] O. Arandjelovic, G. Shakhnarovich, J. Fisher, R. Cipolla, and T. Darrell. Face recognition with image sets using manifold density divergence. In CVPR, 2005.

[Basri et al., 2011] R. Basri, T. Hassner, and L. Zelnik Manor. Approximate nearest subspace search. In IEEE Trans PAMI, 2011.

[Bauml et al., 2013] M. Bauml, M. Tapaswi, and R. Stiefelhagen. Semi-supervised learning with constraints for person identiﬁcation in multimedia data. In CVPR, 2013.

[Berretti et al., 2010] S. Berretti, A. D. Bimbo, and P. Pala. 3d face recognition using isogeodesic stripes. In IEEE Trans. PAMI, 2010.

[Cevikalp and Triggs, 2010] H. Cevikalp and B. Triggs. Face recognition based on image sets. In CVPR, 2010.

[Charikar, 2002] M. S. Charikar. Similarity estimation techniques from rounding algorithms. In STOC, 2002.

[Datar et al., 2004] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Locality sensitive hashing scheme based on p-stable distributions. In SCG, 2004.

[Freund and Schapire, 1995] Y. Freund and R. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. In Computational Learning Theory, 1995.

[Gartner, 2003] T. Gartner. A survey of kernels for structured data. In SIGKDD, 2003.

[Gong and Lazebnik, 2011] Y. Gong and S. Lazebnik. Iterative quantization: a procrustean approach to learning binary codes. In CVPR, 2011.

[Hamm and Lee, 2009] J. Hamm and D. Lee. Grassmann discriminant analysis: a unifying view on subspace-based learning. In ICML, 2009.

[Hu et al., 2011] Y. Hu, A. S. Mian, and R. Owens. Sparse approximated nearest points for image set classiﬁcation. In CVPR, 2011. [Indyk and Motwani, 1998] P. Indyk and R. Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality. In STOC, 1998. [Jayasumana et al., 2013] S. Jayasumana, R. Hartley, M. Salzmann, H. Li, and M. Harandi. Kernel methods on the riemannian manifold of symmetric positive deﬁnite matrices. In CVPR, 2013. [Kim et al., 2007] T.-K. Kim, J. Kittler, and R. Cipolla. Discriminative learning and recognition of image set classes using canonical correlations. In IEEE Trans. PAMI, 2007. [Kulis and Darrell, 2009] B. Kulis and T. Darrell. Learning to hash with binary reconstructive embeddings. In NIPS, 2009. [Kulis et al., 2009] B. Kulis, P. Jain, and K. Grauman. Fast similarity search for learned metrics. In IEEE Trans. PAMI, 2009. [Li et al., 2015] Y. Li, R. Wang, Z. Huang, S. Shan, and X. Chen. Face video retrieval with image query via hashing across euclidean space and riemannian manifold. In CVPR, 2015. [Li et al., 2016] W.-J. Li, S. Wang, and W.-C. Kang. Feature learning based deep supervised hashing with pairwise labels. In IJCAI, 2016. [Lin et al., 2013] G. Lin, C. Shen, D. Suter, and A. V. D. Hengel. A general two-step approach to learning-based hashing. In ICCV, 2013. [Liong et al., 2015] V. E. Liong, J. Lu, G. Wang, P. Moulin, and J. Zhou. Deep hashing for compact binary codes learning. In CVPR, 2015. [Liu et al., 2012a] W. Liu, J. Wang, R. Ji, Y.-G. Jiang, and S.-F. Chang. Supervised hashing with kernels. In CVPR, 2012.

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

[Liu et al., 2012b] W. Liu, J. Wang, Y. Mu, S. Kumar, and S.-F. Chang. Compact hyperplane hashing with bilinear functions. In ICML, 2012. [Liu et al., 2014] M. Liu, S. Shan, R. Wang, and X. Chen. Learning expressionlets on spatio-temporal manifold for dynamic facial expression recognition. In CVPR, 2014. [Lu et al., 2013] J. Lu, G. Wang, and P. Moulin. Image set classiﬁcation using holistic multiple order statistics features and localized multi-kernel metric learning. In ICCV, 2013. [M. Rastegari J. Choi and Davis, 2013] D. Hal M. Rastegari J. Choi, S. Fakhraei and L. Davis. Predictable dual-view hashing. In ICML, 2013. [Masci et al., 2013] J. Masci, M. Bronstein, A. Bronstein, and J. Schmidhuber. Multimodal similarity-preserving hashing. In ICLR, 2013. [Moghaddam and Shakhnarovich, 2002] B. Moghaddam and G. Shakhnarovich. Boosted dyadic kernel discriminants. In NIPS, 2002. [Mu et al., 2010] Y. Mu, J. Shen, and S. Yan. Weaklysupervised hashing in kernel space. In CVPR, 2010. [Norouzi and Fleet, 2011] M. Norouzi and D. J. Fleet. Minimal loss hashing for compact binary codes. In ICML, 2011. [Oliva and Torralba, 2001] A. Oliva and A. Torralba. Modeling the shape of the scene: a holistic representation of the spatial envelope. In IJCV, 2001. [Rastegari et al., 2012] M. Rastegari, A. Farhadi, and D. Forsyth. Attribute discovery via predictable discriminative binary codes. In ECCV, 2012. [Relja and Zisserman, 2014] A. Relja and A. Zisserman. Extremely low bit-rate nearest neighbor search using a set compression tree. In IEEE Trans. PAMI, 2014. [Sablayrolles et al., 2017] A. Sablayrolles, M. Douze, H. Ju, and N. Usunier. How should we evaluate supervised hashing? In ICASSP, 2017. [Salakhutdinov and Hinton, 2007] R. Salakhutdinov and G. E. Hinton. Learning a nonlinear embedding by preserving class neighborhood structure. In ICAIS, 2007. [Salakhutdinov and Hinton, 2009] R. Salakhutdinov and G. Hinton. Semantic hashing. In Inter l. J. Approximate Reasoning, 2009. [Scholkopf et al., 1997] B. Scholkopf, A. Smola, and K.-R. Muller. Kernel principle component analysis. In ICANN, 1997. [Shen et al., 2015] F. Shen, C. Shen, W. Liu, and H.T. Shen. Supervised discrete hashing. In CVPR, 2015. [Sivic et al., 2005] J. Sivic, M. Everingham, and A. Zisserman. Person spotting: video shot retrieval for face sets. In Image and Video Retrieval, 2005. [Sun et al., 2014] X. Sun, Q. Qu, N. M. Nasrabadi, and T. D. Tran. Structured priors for sparse-representation-based hyperspectral image classiﬁcation. In IEEE GRSL, 2014.

[Sun et al., 2015] X. Sun, N. M. Nasrabadi, and T. D. Tran. Task-driven dictionary learning for hyperspectral image classiﬁcation with structured sparsity constraints. In IEEE Trans. GRSL, 2015. [Sun et al., 2017] X. Sun, N. M. Nasrabadi, and T. D. Tran. Supervised multilayer sparse coding networks for image classiﬁcation. In Co RR, 2017. [Tenenbaum et al., 2000] J. B. Tenenbaum, V. de Silva, and J. C. Langford. A global geometric framework for nonlinear dimensionality reduction. In Science, 2000. [Tseng, 2001] P. Tseng. Convergence of a block coordinate descent method for nondifferentiable minimization. In J. of optimization theory and applications, 2001. [Tuzel et al., 2007] O. Tuzel, F. Porikli, and P. Meer. Human detection via classiﬁcation on riemannian manifolds. In CVPR, 2007. [Vemulapalli et al., 2013] R. Vemulapalli, J. K. Pillai, and R. Chellappa. Kernel learning for extrinsic classiﬁcation of manifold features. In CVPR, 2013. [Wang et al., 2010] J. Wang, S. Kumar, and S.-F. Chang. Sequential projection learning for hashing with compact codes. In ICML, 2010. [Wang et al., 20114] J. Wang, H. T. Shen, J. Song, and J. Ji. Hashing for similarity search: a survey. In ar Xiv:1408.2927, 20114. [Wang et al., 2012a] J. Wang, S. Kumar, and S.-F. Chang. Semi-supervised hashing for large scale search. In IEEE Trans. PAMI, 2012. [Wang et al., 2012b] R. Wang, H. Guo, L. S. Davis, and Q. Dai. Covariance discriminative learning: a natural and efﬁcient approach to image set classiﬁcation. In CVPR, 2012. [Wang et al., 2013] X. Wang, S. Atev, J. Wright, and G. Lerman. Fast subspace search via grassmannian based hashing. In ICCV, 2013. [Wang et al., 2016] J. Wang, W. Liu, S. Kumar, and S.-F. Chang. Learning to hash for indexing big data - a survey. In Proceedings of the IEEE. 104(1):34-57, 2016. [Weiss et al., 2008] Y. Weiss, A. Torralba, and R. Fergus. Spectral hashing. In NIPS, 2008. [Weng et al., 2019] L. Weng, I.-H. Jhuo, and W.-H. Cheng. Perceptual hashing for large-scale multimedia search. In John Wiley Sons, 2019. [Zhang and Li, 2014] D. Zhang and W.-J. Li. Large-scale supervised multimodal hashing with semantic correlation maximization. In AAAI, 2014. [Zhou et al., 2009] Z.H. Zhou, Y.-Y. Sun, and Y.-F. Li. Multi-instance learning by treating instances as non-i.i.d. samples. In ICML, 2009.

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)