# proximity_preserving_binary_code_using_signed_graphcut__3c153e96.pdf

The Thirty-Fourth AAAI Conference on Artiﬁcial Intelligence (AAAI-20)

Proximity Preserving Binary Code Using Signed Graph-Cut

Inbal Lavi,1 Shai Avidan,1 Yoram Singer,2 Yacov Hel-Or3

1Department of Electrical Engineering, Tel-Aviv University, Israel 2Department of Computer Science, Princeton University, NJ, USA 3School of Computer Science, The Interdisciplinary Center, Israel inballavi@mail.tau.ac.il, toky@idc.ac.il

We introduce a binary embedding framework, called Proximity Preserving Code (PPC), which learns similarity and dissimilarity between data points to create a compact and afﬁnity-preserving binary code. This code can be used to apply fast and memory-efﬁcient approximation to nearest-neighbor searches. Our framework is ﬂexible, enabling different proximity deﬁnitions between data points. In contrast to previous methods that extract binary codes based on unsigned graph partitioning, our system models the attractive and repulsive forces in the data by incorporating positive and negative graph weights. The proposed framework is shown to boil down to ﬁnding the minimal cut of a signed graph, a problem known to be NP-hard. We offer an efﬁcient approximation and achieve superior results by constructing the code bit after bit. We show that the proposed approximation is superior to the commonly used spectral methods with respect to both accuracy and complexity. Thus, it is useful for many other problems that can be translated into signed graph cut.

Introduction Content-based image retrieval is a fundamental problem in computer vision, media indexing, and data analysis. A common solution to the problem consists of assigning each image an indicative feature vector and retrieving similar images by deﬁning a distance metric in the feature vector space. One of the successful uses of deep learning is data embedding (Chopra, Hadsell, and Le Cun 2005; Koch, Zemel, and Salakhutdinov 2015), where a network is used to map input data into a feature vector space, satisfying some desired distance properties. This technique has many applications, such as word embedding for machine translation (Mikolov et al. 2013), face embedding for identity recognition (Taigman et al. 2014; Schroff, Kalenichenko, and Philbin 2015; Wen et al. 2016; Liu et al. 2017), and many more. The main idea behind data embedding is to ﬁnd a mapping from input space into a vector space where the distances in the embedding space conform with the desired task. In a typical scenario, the embedding space is several hundreds of bytes long (e.g., 512 bytes in Face Net (Schroff,

Copyright c 2020, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

Kalenichenko, and Philbin 2015) embedding), and a new query may be compared to the existing images by nearestneighbor (NN) search. As the number of images scales up, the memory required to store all the examples becomes too large, and the time complexity to apply NN search becomes a critical bottleneck. Many solutions have been proposed to mitigate this issue, including dimensionality reduction (Liu and Schisterman 2004) and approximate NN search (Muja and Lowe 2014). In recent years, a family of algorithms called Binary Hashing or Hamming Embedding has gained popularity. These algorithms ﬁnd a mapping from a feature space into a Hamming space using a variety of methods. The main advantages of a binary representation are the signiﬁcant reduction in storage and in the time required to apply vector comparisons: vectors are compared not in high-dimensional Euclidean space, but rather in the Hamming space, utilizing the extremely fast XOR operation. This representation is highly valuable in mobile systems, as on-device training is limited due to computational shortage. This requires ad-hoc hashing methods that can be computed on simple hardware, and that can be generalized well to novel data points. Many modern Hamming embedding techniques are datadependent. Data-dependent methods work by learning an afﬁnity matrix between data points while attempting to preserve their afﬁnities in Hamming space. In-sampled techniques aim at generating a set of binary codes, a single code for each data point, whose Hamming distances conform with the afﬁnity matrix. Out-of-sample techniques deal with novel samples that are not known in advance. These techniques learn a general functional mapping that maps query points from feature space into Hamming space. Afﬁnity between data pairs can be, for example, related to the metric distances between their associated features, or semantic relations indicating data points belonging to the same semantic class. The afﬁnity matrix is usually relaxed to positive values, where small values indicate weak proximity (far pairs), and large values indicate strong proximity (near pairs). This encourages near pairs to be located close by in the Hamming space but does not constrain the far pairs. We propose a binary hashing framework called Proximity Preserving Code (PPC). The main contribution of our method

is that the binary code is constructed based on positive and negative proximity values, representing attractive and repulsive forces. These forces properly arrange the points in the Hamming space while respecting the pairwise afﬁnities. Our solution models this proximity as a signed graph, and the code is computed by ﬁnding the min-cut of the graph. This problem can be formulated as the max-cut problem (due to the negative values) and is known to be NP-hard (Alon and Naor 2004). We demonstrate that our approach is more accurate and memory-efﬁcient as compared to state of the art graph-based embeddings.

Previous Works

Previous works in Hamming embedding can be classiﬁed into two distinct categories: data-independent and data-dependent. Data-independent methods are composed of various techniques for dimensionality reduction or techniques for dividing the N-dimensional space into buckets with equal distributions. One of the most popular data-independent hashing methods is Locality Sensitive Hashing (LSH) (Datar et al. 2004). LSH is a family of hash functions that map similar items to the same hash bucket with higher probability than dissimilar items. Data-dependent methods learn the distribution of the data in order to create accurate and efﬁcient mapping functions. These functions are usually comprised of three elements: the hash function, the similarity measure, and an optimization criterion. Hash functions vary and include linear functions (Norouzi and Blei 2011), nearest vector assignment (He, Wen, and Sun 2013), kernel functions (Kulis and Darrell 2009), neural networks (Lin et al. 2015), and more. Similarity measures include Hamming distance and different variants of Euclidean or other compute-intensive distances that are precomputed for vector assignment (Jegou, Douze, and Schmid 2011). Optimization criteria mainly use variants of similarity preservation and code balancing. We will focus on binary hashing methods. An inﬂuential work in binary hashing methods is Spectral Hashing (Weiss, Torralba, and Fergus 2009). This method creates code words {ci} that preserve the data similarity. By deﬁning an afﬁnity matrix Wij = exp( xi xj 2

ϵ2 ), the authors turn the hashing problems into a minimization of

ij Wij ci cj 2, subject to: ci { 1, 1}k,

i ci = 0 (code balancing), and 1

i cic T i = I (independence). This minimization problem for a single bit can be cast as a graph partitioning problem which is known to be NP-hard. A good approximation for this problem is achieved by using spectral methods. The code is obtained by computing the k eigenvectors corresponding to the smallest eigenvalues of the graph Laplacian of W and thresholding them at zero. Liu et al. (2011) proposed the Anchor Graph Hashing, a hashing method utilizing the idea of a low-rank matrix that approximates the afﬁnity matrix, to allow a graph Laplacian solution that is scalable both for training and out-of-sample computation. Shen et al. (2013) present Inductive Manifold Hashing, a method that learns a manifold from the data and utilizes it to ﬁnd a Hamming embedding. They demonstrate their results with several approaches, including Laplacian

eigen-maps and t-SNE. Shen et al. (2015) and Liu et al. (2014) directly optimize the discrete problem, and employ discrete coordinate descent to achieve better precision on the graph problem. Scalable Graph Hashing with Feature Transformation (Jiang and Li 2015) uses a feature transformation method to approximate the afﬁnity graph, allowing faster computation on large scale datasets. They also proposed a sequential approach to learn the code bit-by-bit, allowing for error-correcting of the previously computed bits. Li, Hu, and Nie (2017) revisit the spectral solution to the Laplacian graph and propose a spectral rotation that improves the accuracy of the solutions. All of the above approaches formulate the graph Laplacian by deﬁning an afﬁnity matrix that takes into account the similarities between points in the training set. However, they do not address the dis-similarity, or push-pull forces in the data set. In this paper, we propose a binary embedding method that employs an afﬁnity matrix of both positive and negative values. We argue that this type of afﬁnity better represents the relationships between data points, allowing a more accurate code generation. The characteristics and the advantages of this work are as follows:

Our code is constructed by solving a signed graph-cut problem, to which we propose a highly accurate solution. We demonstrate that the signed graph provides a better encoding for the forces existing in the coding optimization. We show that the commonly used spectral solution, which works well in the unsigned graph-cut problems, is unnecessary, costly, and inferior in this scenario. The code is computed one bit at a time, allowing for error correction during the construction of the hashing functions. We split the optimization into two steps. We ﬁrst optimize for a binary vector representing the in-sample data, and then we ﬁt the hashing functions to obtain accurate code for out-of-sample points. Our framework is ﬂexible, allowing various proximity definitions, including semantic proximity. This can be useful for many applications, especially in low computation environments.

Problem Formulation We are given a set of n data points X = {x1, x2, , xn}, xi Rd, in some vector space, and a proximity relation between pairs of points (i, j) S, where S = {1..n} {1..n}. We assign each pair of points in S to be in the Near or Far group, according to some proximity measure. This proximity measure can have a semantic meaning, geometric meaning, or any other adjacency relation. Formally, we deﬁne

N = {(i, j) | xi and xj are in the same class}

F = {(i, j) | xi and xj are in different classes}

Note that N and F induce a partition of S into two disjoint sets: N F = S where N F = . In a classiﬁcation scenario, for example, two points belonging to the same class will be deﬁned as Near; otherwise,

(a) PPC Algorithm

(b) Unsigned vs. signed graph

Figure 1: (a) PPC algorithm overview. To compute the tth bit, we ﬁrst ﬁnd the optimal α for the existing code (t 1 bits), then compute a new bit b by minimizing the loss E. For this bit, we compute a binary classiﬁer ht(x) that is used as the tth hashing function. (b) Illustration of an unsigned graph (left) vs. a signed graph (right) describing the same relations between nodes. The graph edges in green (solid line) are edges with positive weights, and the red (dashed lines) are edges with negative weights. The line thickness indicates the weight magnitude.

they will be deﬁned as Far. Another example of an adjacency matrix is a neighborhood of a certain radius. For a distance metric dij = d(xi, xj) in Rd and a given radius r we deﬁne:

N = {(i, j) | dij r} and F = {(i, j) | dij > r} (1)

Denote the p-length binary code of point xi by ci { 1}p. Our goal is to ﬁnd n binary codes {ci}n i=1 that satisfy the following two requirements: Compactness: The length of the code should be short, i.e., p should be as small as possible. Proximity Preserving: The binary code should preserve the proximity of X. That is, there exists a constant α s.t. d H(ci, cj) α for each pair (i, j) N, and d H(ci, cj) > α for each (i, j) F, where d H( , ) stands for the Hamming distance between two binary codes1:

d H(ci, cj) =

k=1 (1 ci[k]cj[k]) = (p c T i cj).

It can be shown that if proximity relationships are determined according to ℓ1 or ℓ2 distance between points in Rd, the Proximity Preserving requirement can be fully satisﬁed using large enough codes (i.e., p is large). However, due to the compactness requirement, we want to relax the proximity preserving requirement and try to ﬁnd an optimal code for a given code length. Denote a proximity label, yij, associated with each pair of points (i, j) S:

yij = +1 if (i, j) N 1 if (i, j) F

For a given value α > 0 we deﬁne:

zij = yij(α d H(ci, cj)) . (2)

We would like that for each pair (i, j), zij 0, and accordingly we deﬁne a loss function:

l(zij) = 1 if zij < 0 0 otherwise (3)

1In fact, this deﬁnition is twice the Hamming distance, but we stick with it for sake of clarity.

The empirical loss for the entire set reads:

E({yij}, {ci}) = min α

(i,j) S l(zij) (4)

This loss penalizes pairs of points that are mislabeled, that is, pairs of points in F whose Hamming distance is smaller than α, or pairs of points in N whose Hamming distance is larger than α. Deﬁnition 1 (Proximity Preserving Code). Given a set of data points X along with their proximity labels, {yij}, a Proximity Preserving Code (PPC) of length p is a binary code, {ci}n i=1, ci { 1}p, that minimizes E({yij}, {ci}).

In the following we describe the procedure to generate the PPC. In particular, we show that ﬁnding PPC for a given set of points boils down to applying an integer low-rank matrix decomposition. We provide two possible approximated solutions and show their connection to the minimum signed graph-cut problem. Finally, we provide a solution for extracting hashing functions for out-of-sample data points.

Proximity Preserving Code Recall the deﬁnition of zij (Equation 2): zij = yij(α d H(ci, cj)). Substituting the Hamming distance into this expression we get:

zij = yij α p c T i cj = yij c T i cj β (5)

where we deﬁne β = p α. To simplify notations we deﬁne a code matrix C { 1}p n by stacking the code words along its columns:

| | | c1 c2 cn | | |

Similarly we deﬁne

B = CT C where Bij = c T i cj

Equation 5 can now be deﬁned over the entries of matrix B:

zij = yij(Bij β)

and the total loss (Equation 4) is:

E({yij}, B) = min β

ij l(zij) (6)

Denote by bk the rows of C (similarly, the columns of CT ) such that

| | | b1 b2 bp | | |

Each bk { 1}n is a vector representing the kth bit of all the code words (n words). The matrix B = CT C can now be represented as a linear sum of n n matrices:

k=1 bkbk T =

where Bk = bkbk T is a rank-1 matrix extracted from the kth bit of the code words. Thus, each additional bit can either increase the rank of matrix B or leave it the same. Our goal then is to ﬁnd a low rank matrix B = CT C, minimizing the loss deﬁned in Equation 6. The minimization function deﬁned in Equation 6 introduces a combinatorial problem which is NP-hard. Therefore we relax the binary loss function and re-deﬁne it using a logistic loss function:

l(zij) = ℓ(yij(Bij β))

where ℓ(z) = ln(1 + e z). The relaxed total loss is therefore

E({yij}, B) = min β

ij ℓ(yij(Bij β)) (8)

Bit Optimization In the proposed process we generate the codes for n data points in a sequential manner, bit after bit. In the following we detail the minimization process for bit k. This is also illustrated in Figure 1a. At this step we assume that k 1 bits of PPC code have already been generated. Denote

ℓ=1 Bℓwhere Bℓ= bℓbℓT .

For the kth bit, we minimize Equation 8 with respect to Bk and β as follows:

ij ℓ(yij(B1:k 1 ij + Bk ij β)) (9)

Note that B1:k 1 ij is already known at step k. As mentioned above, Ek is minimized using alternate minimization, described below.

Step I - optimizing β: Ek is convex with respect to β, so any scalar search is applicable here. Since the loss ℓ(z) is nearly linear for z 0, a fast yet sufﬁciently accurate approximation for β is to choose

the value that equates the number of misclassiﬁed pairs in the N and F sets. For the current code {ci}n i=1, ci { 1}k 1, and a constant value α, deﬁne the misclassiﬁed sets: EN(α) = {(i, j) | (i, j) N and d H(ci, cj) > α} and similarly EF (α) = {(i, j) | (i, j) F and d H(ci, cj) α} The value of α is set such that the cardinality of the two sets is equal, i.e., the ˆα that satisﬁes: |EN(ˆα)| = |EF (ˆα)| (10)

and accordingly ˆβ = (t 1) ˆα. This is visualized in Step I of Figure 1a. We show a histogram of the near pairs of samples in blue and the far pairs in red, and α is the vertical black line thresholding the Hamming distance.

Step II - optimizing bk: For the evaluated ˆβ, Equation 9 becomes:

ij ℓ(yij(B1:k 1 ij + Bk ij ˆβ))

ij ℓ(yij(Bk ij + γk 1 ij ))

where we deﬁne γk 1 ij = B1:k 1 ij ˆβ. In a forward greedy selection process, we approximate the potential decrease in the loss using the gradient. Our goal is to ﬁnd Bk that minimizes Ek Ek 1 + ΔEk or alternatively maximizes ΔEk where:

Bk ij Bk ij =

ij ℓ (yijγk 1 ij )yij Bk ij

where ℓ stands for the derivative of the logistic loss function ℓ (z) = 1/(1 + ez). Deﬁning wij = yij ℓ (yijγk 1 ij ), we arrive at the following maximization problem:

ij wij Bk ij = max bk

ij wijbk[i]bk[j]

where the maximization is taken over all entries of bk { 1}n. For the sake of simplicity we omit the superscript k and denote bk by b. Collecting {wij} into matrix W, s.t. W(i, j) = wij, the above maximization can be simply expressed in a matrix form: ˆb = argmax b b T Wb s.t. b { 1}n (11)

If the weight matrix W was all positive (all entries are positive values), this problem can be interpreted as a graph min-cut problem. In our problem, however, the matrix W is comprised of both positive and negative values, indicating pairs (i, j) that are properly and improperly assigned as near or far according to the code computed. This is termed in the literature a signed min-cut problem which is equivalent to the max-cut problem whose solution is NP-hard. In the proposed solution we start with an initial guess for the bit vector b and improve it by using a forward greedy selection scheme. We present two iterative approaches for the selection scheme: vector update and bit update.

Vector Update Given an initial guess for b, the vector update method updates the entire vector at once. At each iteration the vector is improved by applying:

b = sign(Wb)

where b is the updated vector that satisﬁes: b T Wb b T Wb. The following four theorems prove that b is a better vector than b:

Theorem 1. For any n n matrix W and b, b { 1}n, if b = sign(Wb), then b T Wb b T Wb

Theorem 2. Assuming W is Positive Semideﬁnite, if b T Wb > b T Wb, then b T Wb > b T Wb.

Theorem 3. Any symmetric matrix W can become Positive Semideﬁnite by applying W W + |λ|I where λ is the smallest eigenvalue of W.

Theorem 4. Adding a constant value to the diagonal of the weight matrix W will not affect the output code computed. Proofs are given in the Appendix. Using the above theorems, Algorithm 1 summarizes the vector update iterations.

Algorithm 1 Vector Update (b, W)

ˆb b W W + |λ|I where λ is the smallest e.v. of W repeat b ˆb ˆb sign(Wb) until ˆb = b return b

Bit Update Unlike the vector update, the bit update method changes one bit at a time: For each bit in vector b, we ﬂip the bit and determine whether the new value improved the objective b T Wb. This is repeated for each bit sequentially, and over the entire vector, until convergence. This procedure can be applied very efﬁciently using the following scheme: Deﬁne b = b(i) + b( i), where b(i) = (0...bi...0) is a one-hot vector with the ith entry of b at the ith coordinate. Accordingly, b( i) = b b(i) is the vector b with 0 at the ith coordinate. When optimizing the ith bit:

b T Wb = (b(i) + b( i))T W(b(i) + b( i)) =

= b2 (i)W(i, i) + 2b T ( i)Wb(i) + const

It can be veriﬁed that the only term affecting the optimization is b T ( i)Wb(i). Therefore we can optimize each bit in b by looking at the value of the ith element of b T ( i)W. Subsequently, the only elements affecting this value in the matrix W are in the ith column of W. Thus,

b[i] = sign b T ( i)W[:, i] (12)

where we denote by W[:, i] the ith column of W. We apply this optimization scheme for each bit sequentially, and repeatedly over the entire vector b, until convergence. Each bit

update is inserted immediately into b so that the optimization for the i + 1 bit will account for the preceding bits that have been calculated. This update is computationally inexpensive, requiring O(n) operations for each bit update, and O(n2) operations for one round over the entire b. This method is summarized in Algorithm 2.

Algorithm 2 Bit Update (b, W)

repeat ˆb b for i 1, N do b i b, b i[i] 0 b[i] sign(b T i W[:, i]) until ˆb = b return b

The two algorithms presented for the iterative bit optimization scheme provide a solution to the max-cut problem where both positive and negative weights appear on the graph edges. The iterations require several light computations and stop when a local maximum is reached and the iteration scheme can no longer improve upon the current bit vector. We show in the Experiments Section that the bit update scheme achieves better codes than the vector update and is therefore preferable.

Initial Guess

Our method is based on an iterative scheme. Therefore, we start the optimization with an initial guess and improve upon it. A common solution is to relax the constraints b[i] 1 and allow real-valued solutions. This enables the maximization problem to be cast as an eigenvalue problem. The ﬁnal solution is then obtained by thresholding the results, in a similar manner to Weiss, Torralba, and Fergus (2009). Interestingly, we have found that starting from a random guess and applying the suggested iterations produces more accurate solutions and with much faster compute time than the traditional eigenvalue solutions. In conclusion, the algorithm provided above can be used to solve the signed graph min-cut problem where W consists of both positive and negative weights. We show empirically that our solution is equivalent to or outperforms other methods, by starting from a random guess solution and applying an update scheme until convergence. Note, that the proposed update schemes can improve upon any approximated solution suggested in the literature, as the suggested iterations do not deteriorate and can only improve the objective function. Our evaluations and experimental results are provided in the Experiments Section.

Signed Graph Min-Cut Problem

Equation 11 suggests that our problem can be cast as a signed graph min-cut problem. A weighted graph is represented by a vertex set V = {1, , n} and weights Wij = W[i, j] = W[j, i] for each pair of its vertices (i, j) V V . The weight of the minimum cut w(G, G) is given by the following

Minimize 1 2

i,j Wij(1 bibj) s.t. bi { 1}. (13)

where b = [b1, , bn] is an indicator vector s.t. bi = 1 if i G and bi = 1 if i G. The above minimization is an integer quadratic program whose solution is known to be NPhard (Alon and Naor 2004). Note that the above formulation can be expressed similarly by maximize bt Wb s.t. b { 1}n, which is similar to the expression given in Equation 11. The weights collected in Equation 13 refer only to pairs (i, j) s.t. bi = bj. Thus, the minimal cut aims at including as many negative weights as possible while excluding positive weights. Since we are dealing with signed graphs, balancing the cut is not critical as it is in unsigned graphs since cutting a small component with few edges does not necessarily provide the smallest cut. Alon and Naor (2004) deﬁne the above problem as a W 1 norm and provide a semideﬁnite relaxation: maximize

ij Wijui vj s.t. ui = vj = 1. The semideﬁnite program can be solved within an additive error of ϵ in polynomial time. Alon and Naor suggested three techniques to round the semideﬁnite solution into a binary solution (bi { 1}), which provides an approximation to the original solution up to a constant factor (KG, called Grothendieck s constant, 1.570 KG 1.782). In the Experiments Section we show that our iterative update approach can improve over Alon and Naor s solution when taking their solution as an initial guess. Moreover, taking a random guess as an initial solution provides a ﬁnal solution that is comparable or better, so the beneﬁt of using a costly approximated solution as initial guess is questionable. The minimization in 13 can be equivalently rewritten in a quadratic form:

Minimize 1 2

i,j Wij(bi bj)2 s.t. bi { 1} (14)

The matrix form of the above minimization reads: mimimize bt Lb s.t. b { 1}n, where L = D W is the Laplacian of W and D is a diagonal matrix Dii =

j Wij. The Laplacian of a graph is frequently used for graph clustering or graph-cut using spectral methods. It was shown that taking the second-smallest eigenvector (the Fiedler vector) and thresholding it at zero provides a relaxed approximation for the minimization in Eq. 14 (see Von Luxburg for more details). However, spectral methods commonly deal with positive weights where the matrix W is guaranteed to be positive semideﬁnite. This is not the situation in our case where eigenvalues might be negative as well. Kunegis, Lommatzsch, and Bauckhage (2009) suggested an alternative for graph Laplacian for signed-graphs: L = D W, where Dii =

j | Wij | and proved that L is positive semideﬁnite. However, Knyazev (2017) argue that the signed Laplacian does not give better clustering results than the original deﬁnition of Laplacian, even if the graph is signed. We show in our experiments that neither solution works as well as the greedy update scheme suggested in this paper.

Optimizing the Hashing Functions Finally, we arrive at the out-of-sample extension and explain how to learn hashing functions to encode out-of-sample data points. We found that it is preferable to ﬁrst optimize for the binary vector bk, then learn a hashing function hk(x) requiring hk(xi) = bk[i], for i = 1..n. Optimizing directly for the hashing function yields a non-linear optimization that often provides inaccurate results. Splitting the optimization into two steps allows each step to be exploited in the best manner. We assume that novel data points will be drawn from the same distribution of the given data X. Therefore, the hashing functions can be optimized using the empirical loss over X. We denote by bk the optimal bk resulting from the ﬁrst step. This vector encodes the optimal binary values for the kth bit (over all data points). We then train a binary classiﬁer hk(x; Θ) over the input pairs {(xi, bk[i])}, by minimizing a loss function:

i=1 L(hk(xi, Θ), bk[i])

where Θ denotes the classiﬁer s parameters. We use kernel SVM (Scholkopf and Smola 2001) with Gaussian kernels to classify the points {xi} into 1, but any standard classiﬁer can be applied similarly. At step k, we train the hash functions hk and construct the kth bit for the binary codes: bk = [hk(x1), hk(x2), , hk(xn)]T . This bit is updated immediately in the codes {ci}n i=1, allowing the k + 1 bit to account for the errors in bk. This error correcting scheme is another beneﬁt of the two-step solution. As we proceed, the algorithm adds more bits to the PPC code. Each additional bit is aimed at decreasing the total loss. The process terminates when the total loss is below a given threshold, or when the number of bits exceeds p.

Experiments and Results

-0.5 0 0.5 -0.5

-0.5 0 0.5 -0.5

-0.5 0 0.5 -0.5

-0.5 0 0.5 -0.5

Figure 2: Visualizations of the bits assigned to each data point in a 2-dimensional space according to PPC. In these images, blue points have been assigned 1 while red points have been assigned -1.

To illustrate the optimization process, a synthetic example is shown in Figure 2. In this ﬁgure 300 points are drawn in 2D in a range of [ 0.5..0.5] [ 0.5..0.5]. The ﬁgure shows the ﬁrst 4 ﬁrst bits (red/blue indicate 1) where the proximity matrix was generated with a r-neighborhood proximity measure. This demonstrates how the PPC algorithm tries to separate the vector space into two labels by balancing between areas with high neighbor density and correcting for the errors of previous bits.

AUC CIFAR-10 Code Length 12 16 24 32 48 64 96 128 SH 0.121379 0.121498 0.119628 0.121108 0.126965 0.12771 0.130133 0.129777 IMH-t SNE 0.154169 0.165783 0.168781 0.150094 0.165669 0.16049 0.163685 0.174634 IMH-LE 0.170701 0.164687 0.144931 0.154949 0.165396 0.152761 0.162143 0.152876 SGH 0.129056 0.12776 0.131998 0.138006 0.136264 0.144698 0.153421 0.157661 LGHSR 0.136739 0.144933 0.149855 0.148962 0.144178 0.144102 0.150296 0.147022 SDH 0.249316 0.191575 0.229962 0.250167 0.227537 0.257303 0.288238 0.326233 PPC 0.28365 0.312302 0.308905 0.329332 0.343186 0.352296 0.354291 0.355555

Table 1: Area under the curve of precision-recall for varying code lengths on the CIFAR-10 dataset. The methods compared: Spectral Hashing (SH) (Weiss, Torralba, and Fergus 2009), Inductive Manifold Hashing (IMH) (Shen et al. 2013), Scalable Graph Hashing (SGH) (Jiang and Li 2015), Large Graph Hashing with Spectral Rotation (LGHSR) (Li, Hu, and Nie 2017), Supervised Discrete Hashing (SDH) (Shen et al. 2015), and our method (PPC).

5 10 15 20 25 30

5 10 15 20 25 30

5 10 15 20 25 30

Figure 3: Joint histogram of Hamming distances (x-axis) with respect to the Euclidean distances (y-axis) for PPC code computed on data with r-neighborhood afﬁnity of (left-toright) r = 10, r = 20 and r = 30.

The mutual relationships between the actual distances and the Hamming distances are illustrated in Figure 3, which shows the joint histogram of d H(ci, cj) vs. the Euclidean distances, dij for three cases. This is another synthetic 2D example with varying r-neighborhood proximity measures. The x-axis indicates the Hamming distances of the generated code and the y-axis indicates the actual Euclidean distances. The histograms are plotted as gray-scale images where the grayvalue in each entry indicates the number of pairs with the associated distances. The brighter the gray-value, the greater the number of pairs (we display the log of the actual values for a better visualization). For each case we see that most of the pairs with Euclidean distance below r = 10, 20, 30 (the pairs labeled as Near in this example) are concentrated to the left of the respective Hamming distances, d H = 11, 12, 13. These were also the ﬁnal α values at the last step of each case. It is interesting to note that the conditional distributions of d H at the two sides of the alpha values are wide, while the order of the Euclidean distances is not necessarily preserved in the respective Hamming distances. This indicates that the bits are allocated solely to optimize the neighborhood constraints, and not to meet any other requirements such as preserving the ordinal distances. This allows for optimal allocation of the bit resources.

We evaluate Proximity Preserving Code on several public datasets: MNIST (Deng 2012), CIFAR-10 (Krizhevsky, Nair, and Hinton 2014), and Label Me (Russell et al. 2008). CIFAR10 (Krizhevsky, Nair, and Hinton 2014) is a labeled subset of the 80 million tiny images dataset, consisting of 60,000 32x32 color images represented by 512-dimensional GIST feature vectors (Oliva and Torralba 2001). It is split into 59,000 images in the training set and 1000 in the test set. MNIST (Deng 2012) is the well-known database of handwritten digits in grayscale images of size 28x28. The dataset is split into a training set of 69,000 samples and a test set of 1,000 samples. Label Me (Russell et al. 2008) has 20,019 training images and 2000 test images, each with a 512D GIST descriptor. The descriptors were dimensionality reduced to 40D using PCA. We use this dataset as unsupervised, and the afﬁnity is deﬁned by thresholding in the Euclidean GIST space such that each training point has an average of 100 neighbors. We evaluate the results by computing a precision-recall graph of varying Hamming thresholds (denoted by α in Equation 2). We compare our methods to the following stateof-the-art spectral hashing methods: Spectral Hashing (SH) (Weiss, Torralba, and Fergus 2009), Anchor Graph Hashing (AGH) (Liu et al. 2011), Inductive Manifold Hashing (IMH) (Shen et al. 2013), Scalable Graph Hashing (SGH) (Jiang and Li 2015), Supervised Discrete Hashing (SDH) (Shen et al. 2015), and Large Graph Hashing with Spectral Rotation (LGHSR) (Li, Hu, and Nie 2017). We use the default settings that the authors provided, and as in Shen et al. (2013) we use settings of anchor number m = 300 and neighborhood number s = 3. For IMH, we show results using both Laplacian eigenmaps (LE) and t-SNE. We ﬁrst compare our method in the unsupervised (or selfsupervised) setting to the unsupervised methods listed above. Results are shown in Figure 4a. We show the precision-recall graph of the Label Me dataset with the self-supervised afﬁnity labels. The results are for the 50 bit code computed for the train set vs. the test set. The results clearly show that our code is more accurate than the other methods over all Hamming thresholds. Next, we compare our method in the supervised scenario. Figure 4b shows the precision-recall of 50 bit codes for the MNIST dataset. The results are computed for the test set only, showing that our method outperforms the other methods

0 0.2 0.4 0.6 0.8 1 Recall

PPC LGHSR SGH IMH-LE SH AGH

(a) Label Me (self-supervised)

0 0.2 0.4 0.6 0.8 1 Recall

SDH PPC SGH LGHSR IMH-LE IMH-t SNE SH

0 0.2 0.4 0.6 0.8 1 Recall

SDH PPC SGH LGHSR IMH - LE IMH - TSNE SH

(c) CIFAR-10

Figure 4: Precision-recall of 50 bit codes with varying Hamming threshold α for different datasets. The methods compared: Spectral Hashing (SH) (Weiss, Torralba, and Fergus 2009), Anchor Graph Hashing (AGH) (Liu et al. 2011), Inductive Manifold Hashing (IMH) (Shen et al. 2013), Scalable Graph Hashing (SGH) (Jiang and Li 2015), Supervised Discrete Hashing (SDH) (Shen et al. 2015), and Large Graph Hashing with Spectral Rotation (LGHSR) (Li, Hu, and Nie 2017).

0 5 10 15 Code Length

0 5 10 15 Code Length

0 5 10 15 Code Length

0 5 10 15 Code Length

L L+BU SL SL+BU AN AN+BU R R+BU

Figure 5: Loss of binary code as a function of code length for PPC with different initial guesses and iterative improvements. This is shown for 4 small datasets. The initial guesses are: signed eigenvector corresponding to the smallest non-trivial eigenvalue of the Laplacian (L) as described in Von Luxburg, signed Laplacian (SL) as suggested by Kunegis, Lommatzsch, and Bauckhage, the sign of the random projection of the 3 smallest non-trivial eigenvalues of the original deﬁnition of Laplacian as suggested by Alon and Naor (AN), and random guess (R). We show here results for PPC using only the initial guess, and bit update (BU).

in the more challenging out-of-sample scenario. Similarly, Figure 4c shows the comparison for the CIFAR-10 dataset. To compare performance at different code lengths, we calculate the area under curve (AUC) for the precision-recall graph in the out-of-sample scenario. Table 1 shows our results on the CIFAR-10 dataset, compared to the results of the spectral methods mentioned above. Our method consistently outperforms the other methods in both short and long codes. Our solution for the signed min-cut problem includes an iterative scheme that continuously improves the initial guess. As mentioned before, we argue that the initial guess does not play a signiﬁcant role in the ﬁnal solution. In fact, at the end of the iterative process, an initial guess based on spectral methods provides similar results to a random initial guess. In the following experiment, we compute the codes only for the in-sample points (using a ﬁxed random seed) and plot the loss as shown in Equation 4 at each code length. We show our results on the benchmark presented in Norouzi and Blei (2011) for six small datasets, consisting of 1000 training points. Since we use the full versions of the MNIST

and Label Me datasets in the previous sections, we show here the four remaining datasets. We present the results generated from the following initial guesses: signed eigenvector corresponding to the smallest non-trivial eigenvalue of the Laplacian (L) (Von Luxburg 2007), signed Laplacian (SL) (Kunegis, Lommatzsch, and Bauckhage 2009), the sign of the random projection of the 3 smallest non-trivial eigenvalues of the Laplacian (Alon and Naor 2004) (AN), and random guess (R). We show the effect of improving upon the initial guesses using bit update (BU) as presented in Algorithm 2. Results are shown in Figure 5. The results clearly show that the random guess with bit update performs as well as or surpasses the costly spectral computations, while the bit update improves upon all of the initial guesses.

A comparison between vector update (Algorithm 1) and bit update (Algorithm 2) for different initial guesses is shown in Figure 6. It is clear that the bit update method outperforms the vector update. This is reasonable as the bit update is an optimization with smaller steps, allowing for a broader search for the optimum, whereas the vector update takes large steps

0 5 10 15 Code Length

0 5 10 15 Code Length

0 5 10 15 Code Length

0 5 10 15 Code Length

L+VU L+BU SL+VU SL+BU AN+VU AN+BU R+VU R+BU

Figure 6: Loss of binary code as a function of code length for PPC with different initial guesses and iterative improvements. This is shown for 4 small datasets. Here we compare the effects of bit update (BU), and vector update (VU). The initial guesses are: signed eigenvector corresponding to the smallest non-trivial eigenvalue of the Laplacian (L) as described in Von Luxburg, signed Laplacian (SL) as suggested by Kunegis, Lommatzsch, and Bauckhage, the sign of the random projection of the 3 smallest non-trivial eigenvalues of the original deﬁnition of Laplacian as suggested by Alon and Naor (AN), and random guess (R).

and converges quickly into a local optimum.

Conclusions We have shown a binary hashing method called Proximity Preserving Code (PPC) based on the signed graph-cut problem. We propose an approximation to this problem and show its advantages over other methods suggested in the literature. We also introduce a hashing framework that can work for both supervised and unsupervised datasets. The framework computes binary code that is more accurate than state-of-theart graph hashing algorithms, especially in the challenging out-of-sample scenario. We believe the use of the signed graph problem instead of relaxation to the standard graph problem can prove beneﬁcial in other algorithms as well.

Appendix The following four theorems are used in the Vector Update method in Algorithm 1. The proofs are provided below. Theorem 1. for any n n matrix W and b, b { 1}n, if b = sign(Wb), then b T Wb b T Wb

Proof. Denote u = Wb. Thus b T Wb = b T u =

i b [i]u[i]. Since b [i] 1, the maximal value is obtained when b [i] = sign(u[i]). We arrive at b T Wb b T Wb.

Theorem 2. Assuming W is positive semideﬁnite (PSD), if b T Wb > b T Wb, then b T Wb > b T Wb.

Proof. We express vectors b, b using the eigenvectors of W, Wui = λiui, b = αiui and b = i α iui. Given

i α iu T i )(

i αiλiui) = (15)

i α2 i λi = b T Wb (16)

we would like to show that b T Wb =

i α2 i λi = b T Wb

Using the PSD property of W (λi 0), we deﬁne

α1 λ1 ... αN λN

α 1 λ1 ... α N λN

Starting from our original assumption we have: v T v v T v. Using the triangular inequality we have:

v v v T v v 2

which follows that v v and accordingly

v T v = b T Wb b T Wb = v T v

Theorem 3. A symmetric matrix W can become positive semideﬁnite by applying W W + |λ|I where λ is the smallest eigenvalue of W.

Proof. According to the Gershgorin Circle Theorem (Gershgorin 1931), for an n n matrix W, deﬁne Ri = n j=1,j =i |wij|. All eigenvalues of W are in at least one of the disks {v : |v wii| Ri}. Therefore, by adding the smallest eigenvalue to wii, the disks will only contain values greater than or equal to zero.

Theorem 4. Adding a constant value to the diagonal of the weight matrix W will not affect the output code computed.

Proof. In PPC, we optimize the vector b according to Equation 11. Therefore:

argmaxbb T (W + |λ|I)b

= argmaxbb T Wb + |λ|b T b

= argmaxbb T Wb + |λ|n

= argmaxbb T Wb

References Alon, N., and Naor, A. 2004. Approximating the Cut-Norm via Grothendieck s Inequality. In Proceedings of the Thirtysixth Annual ACM Symposium on Theory of Computing, 72 80. ACM. Chopra, S.; Hadsell, R.; and Le Cun, Y. 2005. Learning a Similarity Metric Discriminatively, with Application to Face Veriﬁcation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, volume 1, 539 546. Datar, M.; Immorlica, N.; Indyk, P.; and Mirrokni, V. S. 2004. Locality-sensitive Hashing Scheme Based on P-stable Distributions. In Proceedings of the Twentieth Annual Symposium on Computational Geometry, 253 262. ACM. Deng, L. 2012. The MNIST Database of Handwritten Digit Images for Machine Learning Research. IEEE Signal Processing Magazine 29(6):141 142. Gershgorin, S. 1931. Uber die Abgrenzung der Eigenwerte einer Matrix. Izvestija Akademii Nauk SSSR, Serija Matematika 7(3):749 754. He, K.; Wen, F.; and Sun, J. 2013. K-means Hashing: An Afﬁnity-Preserving Quantization Method for Learning Binary Compact Codes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2938 2945. Jegou, H.; Douze, M.; and Schmid, C. 2011. Product Quantization for Nearest Neighbor Search. IEEE Transactions on Pattern Analysis and Machine Intelligence 33(1):117 128. Jiang, Q.-Y., and Li, W.-J. 2015. Scalable Graph Hashing with Feature Transformation. In Twenty-Fourth International Joint Conference on Artiﬁcial Intelligence. Knyazev, A. V. 2017. Signed Laplacian for Spectral Clustering Revisited. ar Xiv preprint. Koch, G.; Zemel, R.; and Salakhutdinov, R. 2015. Siamese Neural Networks for One-shot Image Recognition. In ICML Deep Learning Workshop, volume 2. Krizhevsky, A.; Nair, V.; and Hinton, G. 2014. The CIFAR10 Dataset. online: http://www. cs. toronto. edu/kriz/cifar. html. Kulis, B., and Darrell, T. 2009. Learning to Hash with Binary Reconstructive Embeddings. In Advances in Neural Information Processing Systems, 1042 1050. Kunegis, J.; Lommatzsch, A.; and Bauckhage, C. 2009. The Slashdot Zoo: Mining a Social Network with Negative Edges. In Proceedings of the 18th International Conference on World Wide Web, 741 750. ACM. Li, X.; Hu, D.; and Nie, F. 2017. Large Graph Hashing with Spectral Rotation. In Thirty-First AAAI Conference on Artiﬁcial Intelligence. Lin, K.; Yang, H.-F.; Hsiao, J.-H.; and Chen, C.-S. 2015. Deep Learning of Binary Hash Codes for Fast Image Retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 27 35. Liu, A., and Schisterman, E. F. 2004. Principal Component Analysis. Encyclopedia of Biopharmaceutical Statistics. New York: Marcel Dekker.

Liu, W.; Wang, J.; Kumar, S.; and Chang, S.-F. 2011. Hashing with Graphs. In Proceedings of the 28th International Conference on Machine Learning, 1 8. Liu, W.; Mu, C.; Kumar, S.; and Chang, S.-F. 2014. Discrete Graph Hashing. In Advances in Neural Information Processing Systems, 3419 3427. Liu, W.; Wen, Y.; Yu, Z.; Li, M.; Raj, B.; and Song, L. 2017. Sphereface: Deep Hypersphere Embedding for Face Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Mikolov, T.; Chen, K.; Corrado, G.; and Dean, J. 2013. Efﬁcient Estimation of Word Representations in Vector Space. 1st International Conference on Learning Representations. Muja, M., and Lowe, D. G. 2014. Scalable Nearest Neighbor Algorithms for High Dimensional Data. IEEE Transactions on Pattern Analysis and Machine Intelligence 36(11):2227 2240. Norouzi, M., and Blei, D. M. 2011. Minimal Loss Hashing for Compact Binary Codes. In Proceedings of the 28th International Conference on Machine Learning, 353 360. Oliva, A., and Torralba, A. 2001. Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope. International Journal of Computer Vision 42(3):145 175. Russell, B. C.; Torralba, A.; Murphy, K. P.; and Freeman, W. T. 2008. Label Me: a Database and Web-Based Tool for Image Annotation. International Journal of Computer Vision 77(1-3):157 173. Scholkopf, B., and Smola, A. J. 2001. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press. Schroff, F.; Kalenichenko, D.; and Philbin, J. 2015. Facenet: A Uniﬁed Embedding for Face Recognition and Clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 815 823. Shen, F.; Shen, C.; Shi, Q.; Van Den Hengel, A.; and Tang, Z. 2013. Inductive Hashing on Manifolds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1562 1569. Shen, F.; Shen, C.; Liu, W.; and Tao Shen, H. 2015. Supervised Discrete Hashing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 37 45. Taigman, Y.; Yang, M.; Ranzato, M.; and Wolf, L. 2014. Deepface: Closing the Gap to Human-level Performance in Face Veriﬁcation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1701 1708. Von Luxburg, U. 2007. A Tutorial on Spectral Clustering. Statistics and Computing 17(4):395 416. Weiss, Y.; Torralba, A.; and Fergus, R. 2009. Spectral Hashing. In Advances in Neural Information Processing Systems, 1753 1760. Wen, Y.; Zhang, K.; Li, Z.; and Qiao, Y. 2016. A Discriminative Feature Learning Approach for Deep Face Recognition. In European Conference on Computer Vision, 499 515. Springer.