# deep_divergence_learning__88b78452.pdf Deep Divergence Learning Kubra Cilingir 1 Rachel Manzelli 1 Brian Kulis 1 Abstract classification (Davis et al., 2007; Weinberger & Saul, 2009; Classical linear metric learning methods have re cently been extended along two distinct lines: deep metric learning methods for learning em beddings of the data using neural networks, and Bregman divergence learning approaches for ex tending learning Euclidean distances to more gen eral divergence measures such as divergences over distributions. In this paper, we introduce deep Bregman divergences, which are based on learn ing and parameterizing functional Bregman diver gences using neural networks, and which unify and extend these existing lines of work. We show in particular how deep metric learning formula tions, kernel metric learning, Mahalanobis met ric learning, and moment-matching functions for comparing distributions arise as special cases of these divergences in the symmetric setting. We then describe a deep learning framework for learn ing general functional Bregman divergences, and show in experiments that this method yields supe rior performance on benchmark datasets as com pared to existing deep metric learning approaches. We also discuss novel applications, including a semi-supervised distributional clustering problem, and a new loss function for unsupervised data generation. 1. Introduction The goal of metric learning is to use supervised data in or der to learn a distance function (or more general divergence measure) that is tuned to the data and task at hand. Classical approaches to metric learning are generally focused on the linear regime, where one learns a linear mapping of the data and then applies the Euclidean distance in the mapped space for downstream tasks such as clustering, ranking, and 1Department of Electrical and Computer Engineering, Boston University, Boston, Massachusetts, USA. Correspon dence to: Kubra Cilingir , Rachel Manzelli , Brian Kulis . Proceedings of the 37 th International Conference on Machine Learning, Online, Austria, PMLR 119, 2020. Copyright 2020 by the author(s). Goldberger et al., 2004). These methods, known as Ma halanobis metric learning approaches, have been analyzed theoretically, are scalable, and usually involve convex opti mization problems that can be solved globally (Kulis, 2013; Bellet et al., 2015). Classical metric learning methods have been extended along various axes; two important directions are deep metric learn ing and Bregman divergence learning. Deep metric learning approaches replace the linear mapping learned in Maha lanobis metric learning methods with more general map pings that are learned via neural networks (Hoffer & Ailon, 2015; Chopra et al., 2005). On the other hand, Bregman divergence methods replace the squared Euclidean distance with arbitrary Bregman divergences (Bregman, 1967), and learn the underlying generating function of the Bregman divergence via piecewise linear approximators (Siahkamari et al., 2019) or convex combinations of existing basis func tions (Wu et al., 2009). These two extensions of classical metric learning are complementary and disjoint. For in stance, Bregman divergence approaches can be utilized in scenarios where one needs to compare distributions (the well-known KL-divergence arises as a special case), but the learning problems are not directly applicable to the deep learning setting. Similarly, deep metric learning methods still employ Euclidean distances, and are thus not directly amenable to problems where one needs to compare distribu tions. In this paper, we introduce a framework for studying Breg man divergences that can naturally be learned in the deep set ting. Figure 1 gives a high-level overview of our approach, which we term as deep Bregman divergences, in comparison to existing metric learning approaches. These divergences are based on functional Bregman divergences (Frigyik et al., 2008), which were introduced as an extension of classical Bregman divergences but with functional inputs instead of vector inputs. In this functional setting, the underlying Breg man divergence is parameterized by a convex functional whose input itself is a function. We first perform an analysis for the symmetric divergence case. In this setting, we prove a result about the form for any functional Bregman divergence and observe that many exist ing metric learning models can be seen to arise from special Deep Divergence Learning cases of this form. These include deep learning methods, classical linear metric learning methods, and kernel metric learning. There are also special cases that include momentmatching functions, which yields connections to the Wasser stein distance (Arjovsky et al., 2017), maximum mean dis crepancy (MMD), and kernel MMD (Gretton, 2012). We then turn our attention to the strictly more general case, where the divergences need not be symmetric; the KLdivergence is a classical example of such an asymmetric Bregman divergence. In this setting, we describe a frame work for learning an arbitrary deep Bregman divergence. Our approach is based on appropriately parameterizing the convex functional governing the underlying Bregman di vergence with a neural network, and learning the resulting parameters of that network. We describe several applications of our proposed deep Breg man divergence framework. First, we can extend existing deep metric learning formulations to learn more general deep Bregman divergences. Second, since our divergences can naturally be applied to compare distributions on data, another application is in unsupervised generative learning, where the goal is to minimize a learned distributional diver gence between real and generated data. In particular, we dis cuss connections to GAN models and describe some novel algorithms for unsupervised data generation. Third, we de scribe a semi-supervised distributional clustering problem. Here, the problem is to cluster data where each data point is represented as a distribution for example, a movie s rating may be represented as a distribution over user scores using training data where we know whether pairs of distributions should be clustered together or not. In all three of the above settings, we show empirical results that highlight the benefits of our framework. In particular, we show that learning asymmetric divergences offers perfor mance gains over existing symmetric models on benchmark data, and achieve state-of-the-art classification performance in some settings. We also show that our clustering algorithm outperforms existing baselines on a simple proof-of-concept dataset as well as several human activity sensor data sets, and that our data generation results suggest that there may be value in further developing and studying new learned distributional divergence measures. Our code is available at https://github.com/kubrac/Deep_Bregman. 2. Related Work Much of the early work on metric learning focused on the linear setting, often referred to as Mahalanobis metric learn ing. In this setting, the goal is to learn a global linear trans formation of the data and apply standard distances such as the Euclidean distance on top of the learned transformation. This is often expressed as learning a distance function of the Linear(Mahalanobis) Metric Learning Beyond Euclidean Bregman Divergence Learning Deep Metric Learning Deep Divergence Learning Weights to Functions Figure 1. Overview of our framework in comparison to existing metric learning approaches. Deep Bregman divergences feature both the ability to learn divergences beyond Euclidean (such as divergences over distributions) while encompassing parameteriza tions that are amenable to deep learning architectures. form d A(x, y) = (x y)T A(x y), where A is a positive semi-definite matrix. This is equivalent to learning a lin ear transformation G, where A = GT G, since d A(x, y) = (x y)T GT G(x y) = k Gx Gyk2 2. Examples of this ap proach to metric learning include MMC (Xing et al., 2003), MCML (Globerson & Roweis, 2005), LMNN (Weinberger & Saul, 2009), ITML (Davis et al., 2007), POLA (Shalev Shwartz et al., 2004), LEGO (Jain et al., 2008), and others. See the surveys by Kulis (2013) and Bellet et al. (2015) for further references and details on some of these approaches. Note that one of the advantages of the linear approach is that one can often provide performance guarantees for in stance, a significant amount of work has gone into proving regret bounds in the online setting (Shalev-Shwartz et al., 2004), as well as generalization bounds (Bellet & Habrard, 2015; Cao et al., 2016) for some Mahalanobis metric learn ing models. While linear methods are simpler and can often be analyzed theoretically, in practice it is often useful to learn other, non linear, approaches to metric learning. For instance, one can show that many linear models can be appropriately adapted to run in kernel space (Jain et al., 2012; Chatpatanasiri et al., 2010). Another more recent approach to moving beyond linear metric learning is the Bregman divergence learning framework discussed in the introduction (Siahkamari et al., 2019; Wu et al., 2009). Here we move beyond learning Mahalanobis metrics, but instead focus on a strictly larger class of divergences that includes asymmetric divergences such as the KL-divergence, Itakura-Saito divergence, and others. These may be considered as non-linear approaches (since the resulting divergence does not involve linear trans formations in general). The Bregman learning framework is thus more powerful than linear approaches but also remains well-principled: one can prove generalization bounds in this framework. Deep Divergence Learning The third, and by far the most well-studied, approach to non-linear metric learning is known as deep metric learning, and involves learning a neural network to embed data into some new space, where standard distances such as Euclidean distance are used. If f is a function that maps an input x to an embedding f(x), then the resulting learned metric is typically kf(x) f(y)k2 2. Several popular loss functions have been proposed to learn such a metric the two main ones are the contrastive loss (Chopra et al., 2005) and the triplet loss (Hoffer & Ailon, 2015). Both utilize supervision (pairwise for the contrastive loss and relative constraints for the triplet loss) and use the learned distance kf(x) f(y)k2 2. Moreover, there has been considerable follow-up work that explores how best to choose pairs or triples of points from a training set to achieve the best results (Hermans et al., 2017). There has also been work on deep metric learning using other losses, such as the angular loss (Wang et al., 2017) or the average precision for deep metric learning to rank (Cakir et al., 2019). Our work also has ties to methods involving comparing distributions. Examples of such measures that are relevant to our work include the maximum mean discrepancy met ric (also known as the integral probability metric) (Gretton, 2012), the kernel MMD, and the Wasserstein distance (Ar jovsky et al., 2017). Several notions of divergences over distributions have been used for unsupervised data gener ation in GAN-type models, including the Jensen-Shannon divergence (Goodfellow et al., 2014), the Wasserstein dis tance (Arjovsky et al., 2017), and the MMD (Li et al., 2015; 2017). 3. Deep Bregman Divergences We now turn our attention to functional Bregman diver gences, the main tool for our learning problems. Our goal is two-fold: we first prove a result that characterizes the form for a symmetric functional Bregman divergence and show connections between this form and existing metric learn ing models. Second, we consider a parameterization for arbitrary functional Bregman divergences that will permit learning via neural networks. 3.1. Bregman Divergences and Functional Bregman Divergences A Bregman divergence is a generalized measure of distance between objects, parameterized by a strictly convex function φ (Bregman, 1967). Let φ : Ω R, where Ω is a closed, convex set. The Bregman divergence with respect to φ (for vector inputs) is defined as Dφ(x, y) = φ(x) φ(y) (x y)T rφ(y). Note that the last term represents the derivative of φ in the direction of x y. Examples of Bregman divergences in clude the squared Euclidean distance, parameterized by 1 φ(x) = kxk2 2; the KL-divergence, parameterized by P2 φ(x) = i xi log xi; and the Itakura-Saito distance, pa P rameterized by φ(x) = log xi. i Bregman divergences arise in many settings in machine learning and related areas. In the study of exponential fam ily distributions, there is a bijection between the class of regular Bregman divergences and regular exponential fami lies (see Banerjee et al. (2005)). In optimization, Bregman divergences arise frequently; for instance, mirror descent utilizes Bregman divergences, and Bregman divergences were originally proposed as part of constrained optimiza tion (Bregman, 1967). In the study of clustering, Bregman divergences offer a straightforward way to extend the kmeans algorithm beyond the use of the squared Euclidean distance (Banerjee et al., 2005). A consequence of this is a way to cluster multivariate Gaussians in a k-means frame work (Davis & Dhillon, 2006); we will use this algorithm as a baseline later in the paper. More recently, Frigyik et al. (2008) proposed and studied an extension to standard Bregman divergences called func tional Bregman divergences, where instead of vector inputs, we compute a divergence between pairs of functions (or distributions). In this case, given two functions p and q, and a strictly convex functional φ whose input space is a convex set of functions and whose output is in R, the corresponding Bregman divergence is Z Dφ(p, q) = φ(p) φ(q) [p(x) q(x)]δφ(q)(x)dx. Here δφ(q) is the functional derivative of φ at q and the integral term calculates this derivative in the direction of p q.1 An example of a functional Bregman divergence R arises when we choose φ(p) = p(x)2dx; in this case, one can work out that the functional derivative of φ at p is 2p R and that the resulting functional divergence is [p(x) q(x)]2dx. 3.2. The Symmetric Setting Our first goal is to relate functional Bregman divergences back to concepts in metric learning and other related learn ing models. To do this, let us define a symmetric functional Bregman divergence as a functional Bregman divergence such that Dφ(p, q) = Dφ(q, p) for all p and q. Our first result characterizes the form of an arbitrary sym metric functional Bregman divergence. This result can be stated as follows: Theorem 3.1. A functional Bregman divergence Dφ(p, q) is a symmetric functional Bregman divergence if and only if 1Note that Frigyik et al. (2008) utilize the more general Fr echet derivative. Also, for simplicity, we limit ourselves to Riemann integrals unless otherwise noted. See Appendix for more details. Deep Divergence Learning it has the following form: ZZ Dφ(p, q) = (p(x) q(x))(p(y) q(y))ψ(x, y)dxdy, where ψ(x, y) is some symmetric positive semi-definite func tion. For instance, the example from above, where φ(p) = R p(x)2dx, can be seen as a special case where ψ(x, y) = 1 if x = y and 0 otherwise. The proof of the theorem appears in the appendix. In essence, this result extends an anal ogous result known from the vector setting, which states that any symmetric Bregman divergence must be a Maha lanobis distance, namely Dφ(x, y) = (x y)T A(x y) for some positive semi-definite matrix A (Bauschke & Borwein, 2001). Next we must show that, for particular choices of the symmetric positive semi-definite function ψ, as well as restrictions on p and q, the resulting divergence yields familiar forms. Deep Metric Learning and Moment-Matching. Let us consider ψ(x, y) = f W (x)T f W (y), where f W (x) is an embedding given by a neural network parameterized by weights W . This is clearly a positive semi-definite function, as it is an inner product between embedded data points. Further, assume p and q are distributions. First, from Fubini s theorem, observe that we can re-write the functional Bregman divergence in this special case as Dφ(p, q) = k Ep[f W ] Eq[f W ]k2 . This is a moment-matching type of metric. Note the similar ity to the Wasserstein distance (Arjovsky et al., 2017) and the maximum mean discrepancy (Gretton, 2012). (In those cases, one further takes a supremum over the function f W , which we would also do when performing optimization to learn f W .) This general form of the divergence is typically difficult to compute. We can consider the case when we have finite samples or, equivalently, we let p and q be given by empiri cal distributions over sets of points P and Q, respectively. In this case, the resulting divergence simplifies to 2 X X 1 1 Dφ(p, q) = f W (x) f W (y) . |P | |Q| x P y Q This yields a divergence measure between distributions p and q that matches the first moment, similar to how MMD operates. To make connections to deep metric learning, consider the case where P and Q are of size one, namely Dirac delta functions at points x and y, respectively. Then the diver gence is simply Dφ(p, q) = kf W (x) f W (y)k2 , or just the squared Euclidean distance after embedding the data via a neural network. This form is precisely what nearly all deep metric learning methods employ: they learn a neural network to embed data, apply the (squared) Euclidean distance in the mapped space, and then apply a loss function such as a contrastive or triplet loss on top of this mapped distance (Chopra et al., 2005; Hoffer & Ailon, 2015). Linear Metric Learning. If we replace the integral in the functional Bregman divergence with a Lebesgue integral (as it was defined in the original functional Bregman divergence paper), then use the counting measure for integration, the integral in the functional Bregman divergence simply be comes a sum over the elements in the measure space. In this case, ψ(x, y) is then replaced by a positive semi-definite matrix A, and function inputs to the divergence are replaced by vectors x and y. Then the resulting divergence is the usual Mahalanobis distance Dφ(x, y) = (x y)T A(x y). Thus, we can recover the usual Mahalanobis metric used in linear metric learning under our framework. Kernel Metric Learning. We can also recover familiar kernel forms of the preceding functions. In the case of a kernel function ψ(x, y) = κ(x, y), the divergence recovers the moment-matching objective but with the norm induced by the kernel s reproducing kernel Hilbert space, similar to kernel MMD (Gretton, 2012). Further, in the case of a kernel function κ(x, y) = g(x)T Ag(x), where g(x) is an embedding to a reproducing kernel Hilbert space, and A is a positive-definite operator, the resulting divergence in the single-sample case yields the divergence studied for Mahalanobis metric learning in kernel space (Kulis et al., 2009). A summary of some of the special cases described in this section appear in Table 3.2. 3.3. The General Setting Next we consider the more general setting, i.e., when the functional divergence may not be symmetric. Here our goal is to introduce a parameterization of the functional diver gences that are amenable to learning via neural networks. We term the resulting divergences as deep Bregman diver gences. A key insight of Siahkamari et al. (2019) was that one can approximate a strictly convex function arbitrarily well with a piecewise linear function. In particular, they chose to Deep Divergence Learning Case Integral Setting ψ(x, y) Inputs to Dφ Dφ Mahalanobis Distance Lebesgue + Count. Meas. A 0 Vectors x, y (x y)T A(x y) Deep Metric Learning Riemann f W (x)T f W (y) Dirac Deltas at x, y kf W (x) f W (y)k2 Moment Matching Riemann f W (x)T f W (y) Distributions p, q k Ep[f W ] Eq [f W ]k2 Table 1. Some of the special cases of Dφ(p, q) for the symmetric divergence setting. parameterize the generating function φ of a vector Bregman divergence by the following max-affine function: T φ(x) = max(x wc + bc). Here c ranges from 1 to K, where K is the number of hyper planes used to approximate the underlying strictly convex function. Such functions can be used to approximate any vector Bregman divergence arbitrarily well. Thus, learning a Bregman divergence amounts to learning the weights wi and biases bi given appropriate supervision. We can perform an analogous parameterization in the func tional divergence setting. By generalizing the piecewise linear functions of Siahkamari et al, we can define a convex generating functional. The following theorem demonstrates that every convex generating functional can be expressed in terms of linear functionals, thus justifying our choice of parameterization: Theorem 3.2. Let φ(p) be a convex generating functional corresponding to a functional Bregman divergence Dφ. Then φ can be formulated as Z φ(p) = sup p(x)w(x)dx + bw, (w,bw ) A where A is a set of affine functionals in which each member is characterized by w and bw. For our parameterization, we replace supremum with maxi mum, and denote each function pair as (wc, bc) in a count able set of functionals A. See Appendix A.2 and B for the proof and details. In the case where p and q are dis tributions, we may write this more succinctly as φ(p) = max Ep[wc] + bc , where the expectation is taken with respect to the subscript distribution p. Note that straight forward application of the calculus of variations reveals that the functional derivative of φ(q) is simply wq , where R q = argmax [ q(x)wc(x)dx + bc]. Consequently, the c functional Bregman divergence between p and q can be expressed as Dφ(p, q) = Z Z p(x)wp (x)dx+bp p(x)wq (x)dx+bq . (1) For distributions, this is more succinctly Dφ(p, q) = (Ep[wp ] + bp ) (Ep[wq ] + bq ). This parameterization of the functional φ is now amenable to learning a functional divergence given data. In particular, we now parameterize a divergence by the corresponding weight functions w1, ..., w K and biases b1, ..., b K . If we assume that each of these weight functions are given by deep neural networks, then it becomes natural to set up learning problems where we aim to learn the underlying divergence given data. The resulting deep Bregman divergences will be shown to yield novel learning problems and strong empirical performance on benchmark metric learning tasks. In the next section we will detail our approach to extend deep metric learning to this setting. 4. Learning Problems and Applications In the previous section, we saw in the symmetric setting how different choices of the functions related to a func tional Bregman divergence yield existing forms, as well as how one may parameterize a general asymmetric functional Bregman divergence using deep neural networks. Now we connect the divergences discussed in the previous section to particular learning problems. In particular, we describe several novel applications and learning problems that arise from learning deep Bregman divergences. 4.1. From Deep Metric Learning to Deep Divergence Learning Consider a learning problem where we aim to learn a deep divergence given supervised data. As with deep metric learn ing, we will consider the case when p and q are empirical distributions over single points x and y, respectively. We saw in the previous section that we will parameterize our deep divergence by weight functions w1, ..., w K and biases b1, ..., b K . To make things simpler, let us encompass all of these weight functions into a single large neural network with weights W . The network will have K different outputs, one for each weight function. Many possible architectures are possible to capture this type of network; we consider an architecture where several layers are shared in the network, and then the network branches into K subnetworks, each with its own independent set of weights. See Figure 2 for the network that we employ in our benchmark experiments. Now, suppose we pass x through the network. Each subnet work c produces a single output wc(x) + bc, and there are K total outputs, one per subnetwork. The index of the max imum output is p . Similarly, pass y through the network; Deep Divergence Learning k input (x) conv layers dense layers split into k subnetworks Figure 2. The general architecture we employ for deep Bregman divergences on image data. For K functionals, we produce K separate outputs, which are then used to compute the divergence over pairs of inputs. the index of the maximum output across the K subnetworks is q . Then, by (1), the divergence is the difference be tween the output of x at p and the output of x at q . For instance, suppose that each of the K outputs corresponds to a different class. Then the divergence will be zero if both points achieve a maximum value for the same class (i.e., they are both classified into the same class). The divergence is non-zero if the points are assigned to different classes, and the divergence grows as the two outputs become more disparate. One can now set up a divergence learning problem over pairs or triples of points under this framework. Suppose we are given triples of points (x, y, z), where x should have a smaller divergence to y than to z. One can easily apply exist ing loss deep metric learning loss functions with the learned divergence in place of the usual squared Euclidean distance. The contrastive loss and triplet loss are the two most com mon ones with their definitions given below respectively: 1 Lcont(x, y, z) = d(x, y) + max{m d(x, z), 0}2 , 2 Ltriplet(x, y, z) = max{d(x, y) d(z, x) + m, 0}, where m is a margin value to impose a lower bound on the distance between the dissimilar pairs for the contrastive case and the difference between the relative distances for the triplet case. Typically, the distance measured in both loss functions, d, is the Euclidean distance; however, for our loss functions, we replace the distance measure by our learned deep Bregman divergence. In experiments, we will compare existing deep metric learn ing approaches to the more general deep divergence learning problem considered here, and we will see that we obtain gains over the existing models on standard benchmarks. 4.2. Learning over Distributions A key advantage to our framework is that we need not re strict ourselves only to divergences between single points. As we saw earlier, we can also capture divergences between distributions of points that are similar to what is used for the MMD and the Wasserstein distance. Here we will discuss applications involving learning divergences over distribu tions. Data Generation. Consider the problem encountered in many GAN applications: we aim to learn a generator for data such that we minimize some distributional diver gence between the real and generated data distributions. In existing GAN literature, divergences considered include the Jensen-Shannon divergence (Goodfellow et al., 2014), MMD distance (Li et al., 2015; 2017), and the Wasserstein distance (Arjovsky et al., 2017). Under the deep divergence framework, rather than employ ing a fixed divergence, we can learn one from data. In this setting, we consider two distributions psynth and preal, corresponding to distributions of generated and real data, respectively. Assume that psynth is generated by passing randomly-generated input data through a generator g, as is standard with GAN models. As with GAN training, learn ing proceeds in an adversarial manner. We aim to learn a generator to minimize Dφ(psynth, preal), while simultane ously we aim to learn weights of the underlying network parameterizing Dφ to maximize Dφ(psynth, preal). As with GANs, we alternate between gradient updates for these two objectives. We note that, in practice, it is useful to restrict our attention to the case when K = 2, as it yields a particularly inter pretable model. In this case, we can think of one of the two subnetworks as outputting a larger value on real data, while the other subnetwork as outputting a larger value on synthetic data. Thus, the network that parameterizes the di vergence is analogous to the discriminator in a GAN model. When training the underlying weights of this network W , we can take pairs or triples of real and synthetic data and utilize a triplet or contrastive loss to encourage the output on the real data to be larger for one subnetwork and the output on the synthetic data to be larger for the other subnetwork. Similarly, when training the generator g, we use a loss that encourages real and synthetic data to both have the same maximal output. Semi-Supervised Distributional Clustering. As another application of learning divergences over distri butions, consider a scenario where instead of clustering a set of data points, we aim to cluster a set of distributions. In this setup, each distribution may correspond to an empirical distribution over a set of points for instance, we may have a distribution of ratings for each item in an online store. The goal is: given a set of such distributions, to cluster the distributions together into a set of clusters. Davis & Dhillon (2006) considered a version of this prob Deep Divergence Learning Figure 3. (Left) Plot of the means of the n = 500 Gaussian distributions, color-coded by cluster identity. (Middle left) Plot of data after generating 50 points from each Gaussian. (Middle right) Embedding learned by our method using contrastive loss with a moment-matching function. (Right) Embedding learned by the baseline deep learning approach. lem where each distribution was given by a multivariate Gaussian. Since the KL-divergence between multivariate Gaussians is itself a Bregman divergence, one can use prop erties of Bregman divergences to generalize the k-means algorithm to this setting. Here, we will consider a version of this problem that is both semi-supervised (so pairs of distributions that should or should not be clustered together are provided over a training set), and does not assume that each distribution is a multivariate Gaussian. Our approach also removes the implicit assumption that the means of the distributions are linearly separable for each cluster. Analogous to Davis and Dhillon, given a functional Breg man divergence defined over distributions, one can apply a generalization of k-means to cluster the distributions. As shown by Frigyik et al. (2008), the mean minimizes the expected functional Bregman divergence over a set of dis tributions, analogous to the finite-dimensional case. Thus, k-means can be generalized to a setting where the squared Euclidean distance between vectors is replaced by the corre sponding functional Bregman divergence over distributions. If we represent each distribution by an empirical distribution over its underlying points, we can easily compute a param eterized functional Bregman divergence between pairs of distributions. In our experiments, we will consider in par ticular learning a symmetric divergence on supervised data using the moment-matching distance with a contrastive or triplet loss. Then, once we have learned the divergence from data, we replace the squared Euclidean distance in the k-means algorithm with the learned divergence to directly cluster data in the test set. 5. Experimental Results We now empirically compare our proposed deep divergence framework to existing models. Due to space considerations, some further details and results are available in the supple mentary material. 5.1. Clustering Clustering Multivariate Gaussian Distributions. To begin, we consider a simple demonstration of the advantages of our approach on synthetic data for the semi-supervised distributional clustering problem. We generated n = 500 training points, each assigned to one of three clusters. Each data point is represented by a multivariate Gaussian; the means of these Gaussians were uniformly sampled over rings of radius .2, .6, and 1 plus Gaussian noise, depending on the cluster identity, and the covariance of each Gaussian was .1 times the identity. See Figure 3 for a plot of sampled means, along with data after generating from these Gaus sians. We also generated n = 200 test points in the same manner. We compare three approaches to cluster the data. Our first baseline is the method of Davis & Dhillon (2006), which is an unsupervised clustering algorithm designed specifically to cluster multivariate Gaussian distributions. The second baseline applies deep metric learning on all generated points from all the Gaussians; we apply contrastive and triplet losses separately and learn a 3-layer multilayer perceptron (MLP) over the data in each case. The number of units in each layer were set to 1000, 500, and 2, and standard Re LU activation was used. The third approach is our method; we apply the (empirical) moment-matching function from the symmetric setting, treating each distribution as its own data point, in conjunction with a contrastive and triplet losses to Baseline Method Our Method Davis & Metrics Triplet Contrastive Triplet Contrastive Dhillon Mean 0.638 0.639 0.997 0.999 0.550 RI Std 0.005 0.005 0.003 0.003 0.009 Mean 0.197 0.198 0.993 0.997 0.005 ARI Std 0.012 0.013 0.007 0.006 0.012 Table 2. Rand index and adjusted rand index scores for different clustering experiments, where the baseline method treats each training point independently. Deep Divergence Learning Datasets Baseline Method Our Method Davis & Dhillon Mean Std Mean Std 0.832 0.002 0.098 0.006 0.887 0.004 0.364 0.022 0.876 0.007 0.327 0.026 Mean Std Mean Std 0.849 0.005 0.106 0.008 0.860 0.007 0.149 0.018 0.664 0.006 0.023 0.001 Mean Std Mean Std 0.894 0.004 0.086 0.005 0.907 0.003 0.127 0.014 0.900 0.003 0.089 0.009 Table 3. Rand index and adjusted rand index scores for different clustering experiments performed on real data, where the baseline method treats each training point independently. learn a 3-layer MLP with the same settings as the baseline MLP. On the test set, we use the learned divergence in place of the squared Euclidean distance in a k-means algorithm for both the second and third method. We compute the rand index and adjusted rand index scores on the test set in each case, averaged over 10 runs for each of the three methods. The results are given in Table 2. The Davis & Dhillon method cannot cluster the multivariate Gaussians, as their method is restricted to linear separability of the means. The baseline deep metric learning method fails due to the overlap of the generated data across clusters, whereas the distributional divergence approach is able to perfectly cluster the test data in most runs. We can also visualize the embeddings learned by the second and third method, where we see that our learned embeddings capture the correct cluster structure, as pictured in Figure 3. Human Activity Recognition Task. We next perform further experiments on human activity recognition using time-varying sensor data, in order to demonstrate the capa bilities of our distributional clustering method on real data. We use WHARF (Bruno et al., 2014), MHEALTH (Banos et al., 2014; 2015), and WISDM (Weiss et al., 2019) datasets in our initial experiments. These datasets are collections of multimodal body sensor recordings as test subjects perform different activities of daily living (ADL), including but not limited to sitting, standing, eating, walking, and jogging. The experimental setup is the same as the previous task, where we compute the rand index and adjusted rand index scores on the test set in each experiment, averaged over 10 runs for each of the three methods. Results are given in Table 3. We note that only experiments using contrastive loss are reported here; though our distributional loss formula has the potential to be applied directly here, we leave this as future work. We provide a visualization of the embeddings learned by Euclidean Deep Bregman Datasets Triplet Contrastive Triplet Contrastive MNIST 99.50 99.63 99.61 99.56 Fashion MNIST 93.24 93.57 94.90 94.00 SVHN 92.58 94.88 94.03 94.12 Cifar10 77.00 79.40 81.40 80.80 STL10 59.97 63.10 62.64 60.91 Table 4. K-nn classification accuracy results on the given datasets (without data augmentation or using learned features). The bold values indicate the best triplet loss (Bregman versus Euclidean) and contrastive loss (Bregman versus Euclidean) results. our method and the baseline method, where we see that our learned embeddings capture the correct cluster structure, in the appendix. 5.2. Deep Metric Learning Comparisons Next we consider comparisons between our general deep divergence learning framework and existing deep metric learning models on standard benchmarks, to demonstrate that our approach s flexibility yields improved performance on several datasets and tasks. We compare standard deep metric learning approaches to our proposed approach on the four benchmark datasets used in the original triplet loss paper (Hoffer & Ailon, 2015) MNIST, Cifar10, SVHN, and STL10 as well as Fashion MNIST. We use the same basic architecture for the deep Bregman divergence network as shown in Figure 2; for the Euclidean case we do not employ separate subnetworks in the dense layers. We treat several architecture choices as hyperparameters and validate over these hyperparame ters using Bayesian optimization (tuned separately for each dataset); Table 5 lists the hyperparameters that we search over, along with the ranges of values considered. We consider separately both triplet loss and contrastive loss, and report in bold the best values for each loss. For the triplet loss, we consider all triplets in a batch when com puting the loss. We perform no data augmentation. Results Model hyperparams Training hyperparams layers 2 - 5 margin 0.1 - 2.0 conv filters 16 - 128 epochs 10 - 40 conv kernels 3 - 9 learning rate 10 5 - 10 1 conv biases T / F batch size 32-128 poolings T / F optimizer adam / sgd / rms batchnorms T / F K in k-nn 5 - 10 dense units 50 - 300 normalization T / F Table 5. Hyperparameter intervals used for tuning. First 100 itera tions are used to narrow down the space, then 200 more iterations are run for each benchmark. T: True, F: False. Deep Divergence Learning Figure 4. Real and generated sample batches from Celeb A (Top row) and MNIST (Bottom row) datasets. are shown in Table 4, where we see small but significant gains in classification accuracy for the Bregman method as compared to the standard deep metric learning approach, particularly in the triplet loss case. On Fashion MNIST, we outperform the current state-of-the-art for no data augmen tation (94.23% from Assunc ao et al. (2018)), even though we are not directly training a classifier. We also note that we would expect further gains in performance with more sophis ticated architectures (e.g., Res Nets and other more recent architectures), perhaps yielding near state-of-the-art perfor mance on more datasets; however, the main goal of this comparison is not to achieve state-of-the-art performance but rather to present a fair comparison between the Bregman and Euclidean approaches on standard benchmarks. 5.3. Unsupervised Data Generation Finally, we consider some qualitative results where we show that our approach can be used for generating data with sim ilar performance to GANs. We consider the problem dis cussed earlier, namely where we train a deep divergence model to minimize a learned divergence between real and synthetic data. We apply our approach on 28x28 MNIST and CELEBA datasets, as is standard for GAN applications. We adjust the strides to adapt the network for different input sizes. We keep model structures close to standard in order to show the effectiveness of the divergence formula we intro duced. We use a generator consisting of 4 deconvolutional layers and a discriminator (i.e., the network parameterizing the deep Bregman divergence) with 4 convolutional layers, with a dropout rate of 0.5 in between the layers as well as lrelu and tanh activations. In the discriminator network, the convolutional layers are followed by two 2-layer subnet works (again similar to Figure 2, where K = 2 in this case). For the discriminator, we use the contrastive loss with a margin of 0.4, whereas the generator directly attempts to minimize deep Bregman divergence between the real and generated images. More hyperparameter details are given in the appendix. Some randomly chosen results are presented in Figure 4, where we see that the distribution divergence learned by our method is able to generate realistic-looking images with no labeled supervision. We note that further theoretical analysis and experimentation of these methods is required to determine situations where our loss functions may be more desirable than existing GAN approaches. 6. Conclusions In this paper, we examined a novel generalization of both Bregman divergence learning and deep metric learning, which we call deep divergence learning. This framework offers several appealing advantages: it unifies a number of existing ideas in metric learning under a single framework, it suggests a way to extend deep metric learning beyond the Euclidean setting, and it naturally yields learning problems involving divergences over distributions. Empirically we have seen advantages of our approach compared to existing deep metric learning methods. 7. Acknowledgements This research was supported by NSF CAREER Award 1559558. Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein GAN, 2017. ar Xiv:1701.07875. Assunc ao, F., Lourenc o, N., Machado, P., and Ribeiro, B. Denser: Deep evolutionary network structured represen tation. ar Xiv preprint ar Xiv:1801.01563, 2018. Banerjee, A., Merugu, S., Dhillon, I. S., and Ghosh, J. Clus tering with Bregman divergences. Journal of Machine Learning Research, 6:1705 1749, 2005. Banos, O., Garcia, R., Holgado-Terriza, J. A., Damas, M., Pomares, H., Rojas, I., Saez, A., and Villalonga, C. mhealthdroid: a novel framework for agile development of mobile health applications. In International workshop on ambient assisted living, pp. 91 98. Springer, 2014. Banos, O., Villalonga, C., Garcia, R., Saez, A., Damas, M., Holgado-Terriza, J. A., Lee, S., Pomares, H., and Rojas, I. Design, implementation and validation of a novel open framework for agile development of mobile health applications. Biomedical engineering online, 14 (2):S6, 2015. Bauschke, H. H. and Borwein, J. M. Joint and separate con vexity of the Bregman distance. Studies in Computational Mathematics, 8:23 36, 2001. Bellet, A. and Habrard, A. Robustness and generalization for metric learning. Neurocomputing, 151:259 267, 2015. Deep Divergence Learning Bellet, A., Habrard, A., and Sebban, M. Metric learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 9(1):1 151, 2015. Bregman, L. M. The relxation method of finding the com mon points of convex sets and its application to the so lution of problems in convex programming. USSR Com putational Mathematics and Mathematical Physics, 7(3): 200 217, 1967. Bruno, B., Mastrogiovanni, F., and Sgorbissa, A. A pub lic domain dataset for adl recognition using wrist-placed accelerometers. In the 23rd IEEE International Sympo sium on Robot and Human Interactive Communication, pp. 738 743. IEEE, 2014. Cakir, F., He, K., Xide, X., Kulis, B., and Sclaroff, S. Deep metric learning to rank. In Computer Visiona and Pattern Recognition, 2019. Cao, Q., Guo, Z.-C., and Ying, Y. Generalization bounds for metric and similarity learning. Machine Learning, 102 (1):115 132, 2016. Chatpatanasiri, R., Korsrilabutr, T., Tangchanachaianan, P., and Kijsirikul, B. A new kernelization framework for Ma halanobis distance learning algorithms. Neurocomputing, 73(10 12):1570 1579, 2010. Chopra, S., Hadsell, R., and Le Cun, Y. Learning a similarity metric discriminatively, with application to face verifica tion. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2005. Davis, J. and Dhillon, I. S. Differential entropic cluster ing of multivariate Gaussians. In Advances in Neural Information Processing Systems (NIPS), 2006. Davis, J., Kulis, B., Jain, P., Sra, S., and Dhillon, I. Information-theoretic metric learning. In Proc. 24th In ternational Conference on Machine Learning (ICML), 2007. Frigyik, B. A., Srivastava, S., and Gupta, M. R. Functional Bregman divergences and Bayesian estimation of distri butions. IEEE Transactions on Information Theory, 54 (11):5130 5139, 2008. Globerson, A. and Roweis, S. Metric learning by collapsing classes. In Advances in Neural Information Processing Systems (NIPS), 2005. Goldberger, J., Roweis, S., Hinton, G., and Salakhutdinov, R. Neighbourhood components analysis. In Advances in Neural Information Processing Systems (NIPS), 2004. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial networks. In Advances in Neural Information Processing Systems (NIPS), 2014. Gretton, A. A Kernel Two-Sample Test. Journal of Machine Learning Research, 13:723 773, 2012. Hermans, A., Beyer, L., and Leibe, B. In Defense of the Triplet Loss for Person Re-Identification. ar Xiv preprint ar Xiv:1703.07737, 2017. Hoffer, E. and Ailon, N. Deep metric learning using triplet network. In International Workshop on Similarity-Based Pattern Recognition, pp. 84 92. Springer, 2015. Jain, P., Kulis, B., Dhillon, I., and Grauman, K. Online metric learning and fast similarity search. In Advances in Neural Information Processing Systems (NIPS), 2008. Jain, P., Kulis, B., Davis, J., and Dhillon, I. Metric and kernel learning using a linear transformation. Journal of Machine Learning Research, 13:519 547, 2012. Kulis, B. Metric learning: A survey. Foundations and Trends R in Machine Learning, 5(4):287 364, 2013. Kulis, B., Sustik, M., and Dhillon, I. Low-rank kernel learning with Bregman matrix divergences. Journal of Machine Learning Research, 10:341 376, 2009. Li, C., Chang, W., Cheng, Y., Yang, Y., and Poczos, B. MMD-GAN: Towards deeper understanding of moment matching network. In Neural Information Processing Systems, 2017. Li, Y., Swersky, K., and Zemel, R. Generative moment matching networks. In International Conference on Ma chine Learning, 2015. Shalev-Shwartz, S., Singer, Y., and Ng, A. Y. Online and batch learning of pseudo-metrics. In Proceedings of the twenty-first international conference on Machine learn ing, pp. 94. ACM, 2004. Siahkamari, A., Saligrama, V., Castanon, D., and Kulis, B. Learning Bregman divergences, 2019. ar Xiv:1905.11545. Wang, J., Zhou, F., Wen, S., Liu, X., and Lin, Y. Deep metric learning with angular loss. In International Conference on Computer Vision, 2017. Weinberger, K. Q. and Saul, L. K. Distance metric learning for large margin nearest neighbor classification. Journal of Machine Learning Research, 10:207 244, 2009. Weiss, G. M., Yoneda, K., and Hayajneh, T. Smartphone and smartwatch-based biometrics using activities of daily living. IEEE Access, 7:133190 133202, 2019. Deep Divergence Learning Wu, L., Jin, R., Hoi, S. C., Zhu, J., and Yu, N. Learning Bregman distance functions and its application for semisupervised clustering. In Advances in neural information processing systems, pp. 2089 2097, 2009. Xing, E. P., Jordan, M. I., Russell, S. J., and Ng, A. Y. Distance metric learning with application to clustering with side-information. In Advances in neural information processing systems, pp. 521 528, 2003.