# redundancyresistant_generative_hashing_for_image_retrieval__14b013fe.pdf Redundancy-resistant Generative Hashing for Image Retrieval Changying Du1, Xingyu Xie2, Changde Du3, Hao Wang1 1 360 Search Lab, Beijing 100015, China 2 College of Automation, Nanjing University of Aeronautics and Astronautics, Nanjing, China 3 Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China ducyatict@gmail.com, nuaaxing@gmail.com, duchangde2016@ia.ac.cn, cashenry@126.com By optimizing probability distributions over discrete latent codes, Stochastic Generative Hashing (SGH) bypasses the critical and intractable binary constraints on hash codes. While encouraging results were reported, SGH still suffers from the deficient usage of latent codes, i.e., there often exist many uninformative latent dimensions in the code space, a disadvantage inherited from its autoencoding variational framework. Motivated by the fact that code redundancy usually is severer when more complex decoder network is used, in this paper, we propose a constrained deep generative architecture to simplify the decoder for data reconstruction. Specifically, our new framework forces the latent hashing codes to not only reconstruct data through the generative network but also retain minimal squared L2 difference to the last realvalued network hidden layer. Furthermore, during posterior inference, we propose to regularize the standard auto-encoding objective with an additional term that explicitly accounts for the negative redundancy degree of latent code dimensions. We interpret such modifications as Bayesian posterior regularization and design an adversarial strategy to optimize the generative, the variational, and the redundancy-resistanting parameters. Empirical results show that our new method can significantly boost the quality of learned codes and achieve state-of-the-art performance for image retrieval. 1 Introduction Image retrieval aims to retrieve relevant images to a given query image from a very large image corpus. Due to the huge amount of computation and storage cost required for similarity evaluation in the original feature space, the core problem in this area has been how to re-represent each image using sufficiently compact codes while preserving the original similarity structure as more as possible. Learning to hash (LTH) [Wang et al., 2017] is a popular kind of techniques developed for the aforementioned approx- Corresponding author imate nearest neighbor search problem. By using binary hash codes to represent the original data, the storage cost can be dramatically reduced. Furthermore, search with such binary representations can be efficiently conducted because (1) we can perform Hamming distance computation that is supported via POPCNT on modern CPU/GPU; and (2) we can achieve a constant or sub-linear time complexity for search by using hash codes to construct an index [Kong and Li, 2012]. Since the expressiveness of the learned hash codes can directly affect the final retrieval performance, improving code quality by capturing preferred properties of the hash function is a long-standing topic in LTH studies. Supervised hashing [Xia et al., 2014], semi-supervised hashing [Wang et al., 2012; Zhu et al., 2016], and cross-modal hashing [Jiang and Li, 2016] achieve this goal by assuming the images are labeled or tagged, which requires many human efforts or even be unrealistic for modern applications. In contrast, unsupervised hashing methods do not utilize any semantic relations explicitly, and thus are more challenging to be effective in practice. Representative work along this line include spectral hashing [Weiss et al., 2008], isotropic hashing [Kong and Li, 2012], iterative quantization [Gong et al., 2013], Kmeans hashing [He et al., 2013], discrete graph hashing [Liu et al., 2014], spherical hashing [Heo et al., 2015], graph hashing [Jiang and Li, 2015], binary autoencoder (BAE) hashing [Carreira-Perpin an and Raziperchikolaei, 2015], stochastic generative hashing (SGH) [Dai et al., 2017], etc. Among them, SGH and BAE enjoy the theoretical advantage of minimum description length (MDL) owing to their auto-encoding architecture. This property is critical to achieve not only fast but also accurate retrieval. Besides, due to its probabilistic nature, SGH avoids commonly used relaxation of the binary constraints (which often leads to inferior results) by optimizing probability distributions over discrete hash codes. The apparent benefits claimed by SGH are inherited from the auto-encoding variational inference framework [Kingma and Welling, 2014; Rezende et al., 2014]. We note, however, that the MDL principle may be unsatisfied due to the prior constraints in this framework. In fact, recent studies show that such a variational autoencoder (VAE) framework suffers from the deficient usage of latent codes, i.e. there often exist many uninformative latent dimensions in the code space [Sønderby et al., 2016], especially when the decoder network is complex. This phenomenon is caused by the variational Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) pruning [Hoffman, 2017; Yeung et al., 2017], a problem occurs when many latent units collapse early in training before they learned a useful representation. In this paper, we design a new deep generative architecture which forces the latent hashing codes to not only reconstruct data through the generative (decoder) network but also retain minimal squared L2 difference to the output of the last realvalued decoder hidden layer. We show that such a constrained deep structure has the effect of simplifying the decoder while providing sufficiently good generation/reconstruction quality. A simplified decoder network together with a deep inference (encoder) network are expected to alleviate variational pruning and hence code redundancy. Furthermore, during posterior inference, we propose to regularize the standard autoencoding objective with an additional term that explicitly accounts for the negative redundancy degree of latent code dimensions. We interpret such modifications as Bayesian posterior regularization and design an adversarial strategy to optimize the generative, the variational, and the redundancyresistanting parameters. Experimental results show that our redundancy-resistant generative hashing framework can significantly boost the quality of learned codes and achieve stateof-the-art retrieval performance on image data sets. 2 Model As a powerful unsupervised representation learning framework, VAE [Kingma and Welling, 2014; Rezende et al., 2014] defines a latent variable generative model as follows i=1 p(zi), pθ(X|Z) = i=1 pθ(xi|zi) where θ denotes the likelihood is parameterized by a decoder Deep Neural Network (DNN), and Z RK N and X RD N denote the latents and the observed data, respectively. Specifying qφ(Z|X) = QN i=1 qφ(zi|xi) as the DNN-parameterized inference model (encoder), VAE seeks to maximize the evidence lower bound (ELBO) w.r.t. φ and θ i=1 log pθ(xi) i=1 Eqφ(zi|xi)[log pθ(xi|zi)] i=1 DKL(qφ(zi|xi)||p(zi)) ELBO. 2.1 Stochastic Generative Hashing As an instance of VAE, SGH [Dai et al., 2017] uses a linear Gaussian likelihood and imposes the Bernoulli prior on Z: j=1 ρzij j (1 ρj)1 zij, (1) where ρ = [ρj]K j=1 [0, 1]K. Correspondingly, the inference model is parameterized as j=1 (φj(xi))zij(1 φj(xi))1 zij, (2) where φj(xi) = qφ(zij = 1|xi), and φj represents a scalarvalued linear or deep nonlinear transformation. 2.2 Redundancy-resistant Generative Hashing To ensure efficient similarity search, we assume the latent hash code z of each data to be binary, thus our Redundancyresistant SGH (R-SGH) model shares the same Bernoulli prior p(Z) as in (1). Compared to SGH, one distinct feature of R-SGH lies in the new deep generative architecture for likelihood computation. Suppose the decoder DNN (generative network) has M +1 layers, where 1) the first layer is the hash code layer, which has prior p(Z) and is the latent variable of our Bayesian model; 2) the last layer is the data reconstruction layer, whose output is used to form the mean and covariance parameters of our Gaussian likelihood on data; and 3) the M-th layer is the last real-valued decoder hidden layer, whose output is denoted by H(M) = [h(M) i ]N i=1). Our idea is to force Z to not only reconstruct data through the decoder network but also retain minimal squared L2 difference to the output of the M-th decoding layer. Thus we propose to add the regularization term Z H(M) 2 F into the VAE objective, where F denotes the Frobenius norm of a matrix, and we assume z and h(M) have the same dimension. Consider the extreme case that the above regularization term is perfectly optimized (Z exactly equals to H(M)), i.e., the hash code layer is identical to the last decoder hidden layer, then our decoder model is equivalent to a single layer network. Thus, such a constrained deep structure has the effect of simplifying the decoder while providing sufficiently good generation/reconstruction quality. As studied in RNNbased variational language models [Yang et al., 2017], where the overly flexible LSTM decoder in VAE does not make effective use of the latent representation, a simplified decoder network together with a deep unconstrained inference (encoder) network are expected to alleviate variational pruning and hence code redundancy. For our R-SGH, we assume that the inference model has the same form as in (2), and φ is implemented by an encoder DNN (which transforms the input data into Bernoulli parameter for hash code through M hidden layers) with mirrored layer structure as the decoder. Furthermore, during posterior inference, we propose to regularize the standard auto-encoding variational objective with an additional term that explicitly accounts for the negative redundancy degree of latent code dimensions. The key insight is that, if a latent dimension can be approximated by the linear combination of the other latent dimensions, then this dimension is useless for data generation. To eliminate such redundancy and encourage all latent dimensions to explain the data, we formulate our overall model as follows: max θ,φ min A Eqφ(Z|X) δ Z Z A 2 F η Z H(M) 2 F + ELBO, s.t. akk = 0, k = 1, ..., K, (3) where A RK K denotes a combination coefficient matrix. Intuitively, we always find the best possible combination coefficients to account for the redundancy degree of latent dimensions, and δ > 0 and η > 0 are regularization parameters that balance ELBO and redundancy-related terms. Before we design specific adversarial algorithm to optimize θ, φ and A alternatingly, we first interpret the formulation in (3) as Bayesian posterior regularization [Zhu et al., Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) 2014]. We first define an auxiliary function l(Z|η, δ, A) = exp η Z H(M) 2 F δ Z Z A 2 F , then given A the formulation in (3) can be rewritten as max θ,φ ELBO Eqφ(Z|X) log l(Z|η, δ, A) . (4) Solving problem (4), we can get the posterior distribution qφ(Z|X) = pθ(X|Z)p(Z)l(Z|η, δ, A) As a direct way to impose constraints or incorporate knowledge in Bayesian models, Bayesian posterior regularization is more natural and general than specially designed priors. 2.3 Adversarial Training The VAE parameters θ, φ aim to maximize (3), while the redundancy-resistanting parameter A plays an adversarial role to minimize it. To optimize them in turn, we have to formulate each sub-problem separately. Optimizing θ Given φ and A, problem (3) simplifies to i=1 Eqφ(zi|xi) log pθ(xi|zi), (5) which is straightforward to solve utilizing the reparameterization trick introduced in [Dai et al., 2017]. Specifically, to deal with the discrete variational distribution, we use the Doubly Stochastic Neuron (DSN) that achieves the hash code variable by zij = f(φj(xi), ξj, εj), where ξj, εj U(0, 1) and f(e, ξj, εj) = 1, if e > ξj, 1e>εj, if e = ξj, 0, if e < ξj. Hence, the objective in (5) can be rewritten as i=1 Eξ,ε log pθ(xi|f(φ(xi), ξ, ε)), (6) where zi = f(φ(xi), ξ, ε) = [f(φj(xi), ξj, εj)]K j=1. Then it is similar as in the vanilla VAE to derive a Monte Carlo estimator for the gradient of the objective in (6) w.r.t. θ. Optimizing φ Given θ and A, problem (3) simplifies to max φ Eqφ(Z|X) δ Z Z A 2 F η Z HM+1 2 F + ELBO, (7) which can also be solved using the reparameterization trick. However, we empirically note that such direct treatment often leads to poor performance since a optimum found this way is easily dominated by the first term. The underlying reason is that the ELBO is upper-bounded while the first term can be arbitrarily large. To this end, we introduce the ϵ-insensitive idea, and reformulate the problem as max φ ELBO + δ Eqφ(Z|X) k=1 min rk 2 η Eqφ(Z|X) Z HM+1 2 F , (8) where rk is the k-th column of R = Z Z A. Following [Dai et al., 2017], we employ the distributional derivative of DSN to derive the gradient of the objective in (8) w.r.t. φ. Optimizing A Given φ and θ, problem (3) simplifies to min A Eqφ(Z|X) Z Z A 2 F , s.t. akk = 0, k = 1, ..., K. (9) Applying the reparameterization trick again, we can solve this problem via gradient decent. Alternatively, here we show we can get an analytical solution by adding an extra term λ A 2 F to the objective, then (9) is converted into min A Eqφ(Z|X) Z Z A 2 F + λ A 2 F , s.t. akk = 0, k = 1, ..., K. (10) where λ > 0. It can be shown that (10) can be solved in close-form [Lu et al., 2012] with A = D(diag(D)) 1, diag(A ) = 0, (11) where D = Eqφ(Z|X) (ZZ + λI) 1 and diag(A) denotes a vector with its i-th element being the i-th diagonal element of A. By setting λ > 0, ZZ +λI is ensured to be invertible. When λ 0, (10) degenerates to (9). Remarks To reasonably optimize (7) and (9), we alter the terms related to the redundancy-resistanting parameter A and optimize (8) and (10) instead. This leads to different overall objectives for maximization and the adversarial minimization. Inconsistent objectives during block coordinate descent/ascent may lead to instability and misconvergence of optimization. However, in our adversarial framework, we empirically didn t find this is a problem. As will be seen in Figure 1, our algorithm converges steadily. We conjecture that it is due to the equilibrium condition difference between adversarial training and maximum likelihood training. 2.4 Complexity Analysis Using the gradient estimators of {θ, φ} and the analytical solution of A, the computational complexity of a single joint update of the parameters {θ, φ, A} for the proposed model is O( NCnn + K3), where N is the batch size during our mini-batch training, and Cnn is the cost of evaluating the encoder/decoder networks. After training, all parameters are frozen, and we are interested in encoding a query image to its hash code, a procedure of the same complexity O(Cnn) as other deep hashing methods. 3 Experiments We evaluate the proposed method on two computer vision tasks: 1) Image generation/reconstruction on MNIST [Oliva and Torralba, 2001]; 2) Image retrieval on CIFAR10 [Krizhevsky, 2009] and Caltech-256 [Griffin et al., 2007]. 3.1 Experimental Setup Compared Methods (1) Spectral Hashing (SH) [Weiss et al., 2008]; (2) Iterative Quantization (ITQ) [Gong et al., 2013]; (3) PCA Hashing (PCAH), which projects the input via PCA and performs binarization according to the sign of each dimension; (4) Stochastic Generative Hashing (SGH) [Dai et al., 2017]; and (5) Deep-SGH, a variant of SGH that adopts DNN as the encoder and decoder. Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) Parameter Settings For the compared methods, we use the implementations provided by their authors (Deep-SGH is implemented directly based on SGH) and set the parameters according to their original papers. Without explicit statement, 1) for our R-SGH, the prior parameter ρj is set to 0.5 for any j {1, .., K}, the threshold parameter ϵ is set to 0.05, and both δ and η are set to 0.01; and 2) for R-SGH and Deep SGH, the encoder and decoder network structures are set as [D-K-K-K] and [K-K-K-D] respectively, where D and K are the dimensions of input data and hash code respectively. Note, the 4-layer encoder and decoder share the same hash code layer, and all hidden layers have the same dimension. Evaluation Metrics (1) L2 reconstruction error; (2) precision at S: the percentage of true relevant instances among top S retrieved ones; and (3) recall at S: the percentage of true relevant instances that are retrieved as the top S similar ones; (4) mean average precision (m AP), which computes the area under the entire precision-recall curve and evaluates the overall retrieval performance of different hashing algorithms. Note that, in evaluating these metrics for a given query image, we use a fixed small set of Euclidean nearest neighbors to the query as its ground-truth relevant images. 3.2 Performance Evaluation Generation on MNIST To study the discriminative power of learned representations, we re-generate the images with 64 bit hash code on MNIST dataset. Specifically, we first randomly sample an x from the dataset and encode it into binary representation via the learned qφ(z|x), and then, we re-generate an artificial sample from pθ(x|z) by the mean. We compare the proposed method with SGH. For fairness, we only consider the redundancy-resistant constraint on shallow network, where both the encoder and decoder are single layer neural network. The quantitative comparison is shown in Figure 1 where R-SGH achieves lower reconstruction error than that of SGH eventually, indicating that the hash codes of R-SGH contain richer information for reconstruction owing to their low redundancy merit. The visualization results are shown in Figure 2, which verify the advantage of our model again. Image Retrieval on CIFAR-10 The CIFAR-10 dataset contains 60,000 color images from 10 object classes. Each original image is of size 32 32 and is represented as a 512D GIST feature vector [Oliva and Torralba, 2001]. Following the same setting as in [Erin Liong et al., 2015], we randomly sample 1000 samples, 100 per class, as the query data, and use the remaining 590,00 images as the gallery set. The parameter δ in R-SGH is set to 0.1 on this dataset. For all considered methods, we repeat the experiments 20 trials and take the averages as the final results. Table 1 lists the m AP results of different hashing methods. Figure 3 shows the recall and precision curves as functions of the retrieved number of instances under different number of hash bits. First, the clear advantages of our R-SGH over the other compared unsupervised hashing methods can be easily observed in terms of m AP performance, which demonstrate that our redundancy-resistant generative hashing framework can L2 Reconstruction Error 1 40 80 120 160 5 200 Figure 1: Reconstruction error vs. the number of iterating epochs. Method 32 bits 64 bits 128 bits ITQ 22.34 27.45 29.28 SH 6.97 12.23 19.39 PCAH 7.79 11.33 15.72 SGH 23.86 30.56 35.61 Deep-SGH 21.96 29.98 34.77 R-SGH 24.66 33.62 44.12 Table 1: m AP (%) results on the CIFAR-10 dataset. consistently improve the overall retrieval performance. Second, R-SGH keeps a greater growth of m AP than that of SGH when increasing number of hash bits are used, this is because the redundancy-resistant terms in R-SGH can effectively activate the uninformative bit in the generative model. Third, R-SGH outperforms the other compared methods by a large margin when 128 bits of hash code are used. We attribute this to the fact that higher dimensional latent representations come with higher information redundancy. Finally, we note that SGH and ITQ achieve better recall and precision performances when 32 bits hash code are used. This is due to very little information redundancy contained in the hash code when the number of hash bits is small. In this case, the redundancy-resistant regularization influences the reconstruction procedure and may degrade the performance. Fortunately, if low redundancy is expected in advance, we can prevent the negative impact by setting a small threshold parameter ϵ. Image Retrieval on Caltech-256 The Caltech-256 dataset consists of 29,780 images associated with 256 object categories. We randomly choose 1000 images as the query set and the rest of the dataset is regarded as the training set. We use the VGG network (VGG-fc7 [Szegedy et al., 2015]) pre-trained on ILSVRC [Russakovsky et al., 2015] as the feature extractor, which transforms each image into a 1024dimensional vector. We followed the same settings as in the CIFAR-10 experiment and used the same DNN structures. Table 2 displays the m AP performance of different hashing methods on Caltech256 dataset. Figure 4 shows the precision-recall curves for different hashing methods under 32, 64 and 128 bits, respectively. We can see that R-SGH significantly outperforms the other compared methods on this dataset, even under 32 bits hash code. It is probably due to the well known fact that the CNN features their self have some information redundan- Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) Figure 2: Illustration of MNIST images re-generated by R-SGH and SGH with 64 bit hash code. The hash codes of R-SGH contain richer information for reconstruction owing to their low redundancy merit. - number of retrieved instances S 200 400 600 800 1000 (a) 32 bit (c) 128 bit - number of retrieved instances S 200 400 600 800 1000 - number of retrieved instances S 200 400 600 800 1000 - number of retrieved instances S 200 400 600 800 1000 0 - number of retrieved instances S 200 400 600 800 1000 0.05 - number of retrieved instances S 200 400 600 800 1000 0.05 Figure 3: Recall and precision curves on CIFAR-10 dataset. The advantage of R-SGH is more evident when more hash bits are used, which verifies our conjecture that higher dimensional latent representations come with higher information redundancy. cy. The improvements of R-SGH also show that the negative influence of information redundancy is even greater than that caused by dimensionality reduction for image retrieval, hence R-SGH is more suitable for the CNN features or other information redundant features than the compared methods. Method 32 bits 64 bits 128 bits ITQ 50.12 68.56 76.88 SH 41.64 52.32 62.93 PCAH 43.62 55.65 66.02 SGH 47.12 71.09 78.61 Deep-SGH 51.39 70.34 79.39 R-SGH 59.02 74.18 84.96 Table 2: m AP (%) results on the Caltech-256 dataset. 3.3 Sensitivity Analysis For sensitivity analysis of the proposed model, we use the Caltech-256 data set to evaluate the influences of regularization parameters δ and η. Here, the number of hash bits for each image is fixed to 128. When η is fixed to 0.01, Figure 5 (a) shows the m AP performance of R-SGH with varying δ. When δ is fixed to 0.01, Figure 5 (b) shows the m AP performance of R-SGH with varying η. It is worth to mention that R-SGH seems to be more sensitive to δ. Unsuitable δ can degrade the m AP performance severely, while η has little impact on the m AP in a relatively large range, i.e., [0.002, 0.07], as shown in Figure 5 (b). Both terms we proposed improves the discriminative power of the learned hash codes. Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) (c) 128 bit (a) 32 bit (b) 64 bit Recall 0 0.4 0.6 0.2 0.8 1.0 0 Recall 0 0.4 0.6 0.2 0.8 1.0 0 Recall 0 0.4 0.6 0.2 0.8 1.0 0 Figure 4: Precision-recall curves on Caltech-256 dataset under 32, 64 and 128 bits hash code, respectively. The proposed R-SGH significantly outperforms the other compared methods even under 32 bits hash code, which may be due to the well known fact that CNN features their self have information redundancy. 0 0.001 0.01 0.1 1.0 mean average precision 0 0.001 0.01 0.1 1.0 mean average precision Figure 5: (a) m AP vs. δ curve with fixed η = 0.01. (b) m AP vs. η curve with fixed δ = 0.01. 4 Related Work Learning-based hashing is an effective solution to accelerate similarity search by designing compact binary codes in a low-dimensional space. To enhance the expressiveness of fixed-length hash codes, there already exist some hash learning methods that pursue the independence (or low redundancy) between the hash bits. For example, the iterative quantization method [Gong et al., 2013] maximizes the variance of each binary bit to keep the non-relevance and minimizes the binarization loss to obtain a high performance for image retrieval. Later, Liong et.al [Erin Liong et al., 2015] proposed to use a DNN to learn hash codes with three objectives: (1) the loss between the real-valued feature descriptor and the learned binary codes is minimized; (2) binary codes distribute evenly on each bit; and (3) different bits are as independent as possible. They pursue the independence between hash bits by constraining the weight matrix in the encoders to be high rank. A more recent work named Deep Bit [Lin et al., 2016] learns compact binary code based on similar rules, i.e., minimal loss quantization, uncorrelated and evenly distributed bits. Bypassing quantization, our redundancy-resistant generative hashing method clearly differs from these methods. Recent studies of VAEs for representation learning have revealed the decoupling phenomenon of latent code dimensionality from expressive power. Caused by the variational prun- ing [Hoffman, 2017; Yeung et al., 2017], this phenomenon can be seen as a virtue of automatic relevance determination if efficiency is not a problem (but it is not the case for hashingbased similarity search), but also as a problem when many units collapse early in training before they learned a useful representation. [Sønderby et al., 2016] observed that such units remain uninformative for the rest of the training, presumably trapped in a local minima or saddle point, with the optimization algorithm unable to re-activate them. In [Bowman et al., 2015; Sønderby et al., 2016], the authors used warm-up strategy to alleviate it by initializing training using the reconstruction error only. As more principled approaches, [Hoffman, 2017] proposed to learn a shared shearing matrix to rotate the variational distribution, and [Tomczak and Welling, 2017] proposed a Variational Mixture of Posteriors prior for the latent codes. The most similar strategy to ours is to restrict the power of the decoder by word dropout [Bowman et al., 2015], lossy coding [Chen et al., 2016] and dilated CNN [Yang et al., 2017]. Nevertheless, our adversarial learning framework is more general than these works. 5 Conclusion We designed a new deep generative architecture which forces the latent hashing codes to not only reconstruct data through the generative network but also retain minimal difference to the last real-valued hidden layer of the network. Furthermore, during posterior inference, we proposed to regularize the standard auto-encoding variational objective with an additional term that explicitly accounts for the negative redundancy degree of latent code dimensions. We designed an adversarial strategy to optimize the generative, the variational, and the redundancy-resistanting parameters. Experimental results show that our redundancy-resistant generative hashing framework can significantly boost the quality of learned codes and achieve state-of-the-art retrieval performance on image data. Acknowledgements This work was supported by the National Natural Science Foundation of China (No. 61602449, 61672501, 61603372). Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) References [Bowman et al., 2015] Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz, and Samy Bengio. Generating sentences from a continuous space. ar Xiv preprint ar Xiv:1511.06349, 2015. [Carreira-Perpin an and Raziperchikolaei, 2015] Miguel A Carreira-Perpin an and Ramin Raziperchikolaei. Hashing with binary autoencoders. In CVPR, 2015. [Chen et al., 2016] Xi Chen, Diederik P Kingma, Tim Salimans, Yan Duan, Prafulla Dhariwal, John Schulman, Ilya Sutskever, and Pieter Abbeel. Variational lossy autoencoder. ar Xiv preprint ar Xiv:1611.02731, 2016. [Dai et al., 2017] Bo Dai, Ruiqi Guo, Sanjiv Kumar, Niao He, and Le Song. Stochastic generative hashing. In ICML, 2017. [Erin Liong et al., 2015] Venice Erin Liong, Jiwen Lu, Gang Wang, Pierre Moulin, and Jie Zhou. Deep hashing for compact binary codes learning. In CVPR, 2015. [Gong et al., 2013] Yunchao Gong, Svetlana Lazebnik, Albert Gordo, and Florent Perronnin. Iterative quantization: A procrustean approach to learning binary codes for large-scale image retrieval. IEEE Transactions on PAMI, 35(12):2916 2929, 2013. [Griffin et al., 2007] G. Griffin, A. Holub, and P. Perona. Caltech-256 object category dataset. Technical Report 7694, California Institute of Technology, 2007. [He et al., 2013] Kaiming He, Fang Wen, and Jian Sun. K-means hashing: An affinity-preserving quantization method for learning binary compact codes. In CVPR, 2013. [Heo et al., 2015] Jae-Pil Heo, Youngwoon Lee, Junfeng He, Shih-Fu Chang, and Sung-Eui Yoon. Spherical hashing: Binary code embedding with hyperspheres. IEEE Transactions on PAMI, 37(11):2304 2316, 2015. [Hoffman, 2017] Matthew D Hoffman. Learning deep latent gaussian models with markov chain monte carlo. In ICML, 2017. [Jiang and Li, 2015] Qing-Yuan Jiang and Wu-Jun Li. Scalable graph hashing with feature transformation. In IJCAI, 2015. [Jiang and Li, 2016] Qing-Yuan Jiang and Wu-Jun Li. Deep cross-modal hashing. ar Xiv preprint ar Xiv:1602.02255, 2016. [Kingma and Welling, 2014] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. In ICLR, 2014. [Kong and Li, 2012] Weihao Kong and Wu-Jun Li. Isotropic hashing. In NIPS, 2012. [Krizhevsky, 2009] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, Univ. of Toronto, 2009. [Lin et al., 2016] Kevin Lin, Jiwen Lu, Chu-Song Chen, and Jie Zhou. Learning compact binary descriptors with unsupervised deep neural networks. In CVPR, 2016. [Liu et al., 2014] Wei Liu, Cun Mu, Sanjiv Kumar, and Shih Fu Chang. Discrete graph hashing. In NIPS, 2014. [Lu et al., 2012] Can-Yi Lu, Hai Min, Zhong-Qiu Zhao, Lin Zhu, De-Shuang Huang, and Shuicheng Yan. Robust and efficient subspace segmentation via least squares regression. ECCV, pages 347 360, 2012. [Oliva and Torralba, 2001] Aude Oliva and Antonio Torralba. Modeling the shape of the scene: A holistic representation of the spatial envelope. IJCV, 42(3):145 175, 2001. [Rezende et al., 2014] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In ICML, 2014. [Russakovsky et al., 2015] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. IJCV, 115(3):211 252, 2015. [Sønderby et al., 2016] Casper Kaae Sønderby, Tapani Raiko, Lars Maaløe, Søren Kaae Sønderby, and Ole Winther. Ladder variational autoencoders. In NIPS, 2016. [Szegedy et al., 2015] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In CVPR, 2015. [Tomczak and Welling, 2017] Jakub M Tomczak and Max Welling. Vae with a vampprior. ar Xiv preprint ar Xiv:1705.07120, 2017. [Wang et al., 2012] Jun Wang, Sanjiv Kumar, and Shih Fu Chang. Semi-supervised hashing for large-scale search. IEEE Transactions on PAMI, 34(12):2393 2406, 2012. [Wang et al., 2017] Jingdong Wang, Ting Zhang, Nicu Sebe, Heng Tao Shen, et al. A survey on learning to hash. IEEE Transactions on PAMI, 2017. [Weiss et al., 2008] Yair Weiss, Antonio Torralba, and Rob Fergus. Spectral hashing. In NIPS, 2008. [Xia et al., 2014] Rongkai Xia, Yan Pan, Hanjiang Lai, Cong Liu, and Shuicheng Yan. Supervised hashing for image retrieval via image representation learning. In AAAI, 2014. [Yang et al., 2017] Zichao Yang, Zhiting Hu, Ruslan Salakhutdinov, and Taylor Berg-Kirkpatrick. Improved variational autoencoders for text modeling using dilated convolutions. In ICML, 2017. [Yeung et al., 2017] Serena Yeung, Anitha Kannan, Yann Dauphin, and Li Fei-Fei. Tackling over-pruning in variational autoencoders. ar Xiv preprint ar Xiv:1706.03643, 2017. [Zhu et al., 2014] Jun Zhu, Ning Chen, and Eric P Xing. Bayesian inference with posterior regularization and applications to infinite latent SVMs. JMLR, 15(1):1799 1847, 2014. [Zhu et al., 2016] Han Zhu, Mingsheng Long, Jianmin Wang, and Yue Cao. Deep hashing network for efficient similarity retrieval. In AAAI, 2016. Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)