# bayesian_distance_clustering__f4bdf772.pdf

Journal of Machine Learning Research 22 (2021) 1-27 Submitted 7/20; Revised 8/21; Published 9/21

Bayesian Distance Clustering

Leo L Duan LI.DUAN@UFL.EDU Department of Statistics University of Florida Gainesville, FL 32611, USA

David B Dunson DUNSON@DUKE.EDU Department of Statistical Science Duke University Durham, NC 27708, USA

Editor: Ruslan Salakhutdinov

Model-based clustering is widely used in a variety of application areas. However, fundamental concerns remain about robustness. In particular, results can be sensitive to the choice of kernel representing the within-cluster data density. Leveraging on properties of pairwise differences between data points, we propose a class of Bayesian distance clustering methods, which rely on modeling the likelihood of the pairwise distances in place of the original data. Although some information in the data is discarded, we gain substantial robustness to modeling assumptions. The proposed approach represents an appealing middle ground between distanceand model-based clustering, drawing advantages from each of these canonical approaches. We illustrate dramatic gains in the ability to infer clusters that are not well represented by the usual choices of kernel. A simulation study is included to assess performance relative to competitors, and we apply the approach to clustering of brain genome expression data. Keywords: Distance-based clustering, Mixture model, Model-based clustering, Model misspeciﬁcation, Pairwise distance matrix, Partial likelihood

1. Introduction

Clustering is a primary focus of many statistical analyses, providing a valuable tool for exploratory data analysis and simpliﬁcation of complex data. In the literature, there are two primary approaches distanceand model-based clustering. Let yi Y, for i = 1, . . . , n, denote the data and let d(y, y ) denote a distance between data points y and y . Then, distance-based clustering algorithms are typically applied to the n n matrix of pairwise distances D(n) (n) = {di,j}, with di,j = d(yi, yj) for all i, j pairs and (n) = {1, . . . , n}. For reviews, see Jain (2010); Xu and Tian (2015). In contrast, model-based clustering takes a likelihood-based approach in building a model for the original data y(n) that has the form:

yi iid f, f(y) =

h=1 πh K(y; θh), (1)

where π = (π1, . . . , πk) is a vector of probability weights in a ﬁnite mixture model, h is a cluster index, and K(y; θh) is the density of the data within cluster h. Typically, K(y; θ) is a density in a

c 2021 Leo Duan and David Dunson.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v22/20-688.html.

DUAN AND DUNSON

parametric family, such as the Gaussian, with θ denoting the parameters. The ﬁnite mixture model (1) can be obtained by marginalizing out the cluster index ci {1, . . . , k} in the following model:

yi K(θci), pr(ci = h) = πh. (2)

Using this data-augmented form, one can obtain maximum likelihood estimates of the model parameters π and θ = {θh} via an expectation-maximization algorithm (Fraley and Raftery, 2002). Alternatively, Bayesian methods are widely used to include prior information and characterize uncertainty in the parameters. For reviews, see Bouveyron and Brunet-Saumard (2014) and Mc Nicholas (2016). Distance-based algorithms tend to have the advantage of being relatively simple conceptually and computationally, while a key concern is the lack of characterization of uncertainty in clustering estimates and associated inferences. While model-based methods can address these concerns by exploiting a likelihood-based framework, a key disadvantage is a large sensitivity to the choice of kernel K( ; θ). Often, kernels are chosen for simplicity and computational convenience, and they place rigid assumptions on the shape of the clusters, which are not justiﬁed by the applied setting being considered. We are not the ﬁrst to recognize this problem, and there is literature attempting to address issues with kernel robustness in model-based clustering. One direction is to choose a ﬂexible class of kernels, which can characterize a wide variety of densities. For example, one can replace the Gaussian kernel with one that accommodates asymmetry, skewness and/or heavier tails (Karlis and Santourian (2009); Ju arez and Steel (2010); O Hagan et al. (2016); Gallaugher and Mc Nicholas (2018); among others). A related direction is to nonparametrically estimate the kernels speciﬁc to each cluster, while placing minimal constraints for identiﬁability, such as unimodality and sufﬁciently light tails. This direction is related to the mode-based clustering algorithms of Li et al. (2007); see also Rodr ıguez and Walker (2014) for a Bayesian approach using unimodal kernels. Unfortunately, as discussed by Hennig et al. (2015), a kernel that is too ﬂexible leads to ambiguity in deﬁning a cluster and identiﬁability issues: for example, one cluster can be the union of several clusters that are close. Practically, such ﬂexible kernels demand a large number of parameters, leading to daunting computation cost. A promising new strategy is to replace the likelihood with a robust alternative. Coretto and Hennig (2016) propose a pseudo-likelihood based approach for robust multivariate clustering, which captures outliers with an extra improper uniform component. Miller and Dunson (2018) propose a coarsened Bayes approach for robustifying Bayesian inference and apply it to clustering problems. Instead of assuming that the observed data are exactly generated from (1) in deﬁning a Bayesian approach, they condition on the event that the empirical probability mass function of the observed data is within some small neighborhood of that for the assumed model. Both of these methods aim to allow small deviations from a simple kernel. It is difﬁcult to extend these approaches to data with high complexity, such as clustering multiple time series, images, etc. We propose a new approach based on a Bayesian model for the pairwise distances, avoiding a complete speciﬁcation of the likelihood function for the data y(n). There is a rich literature proposing Bayesian approaches that replace an exact likelihood function with some alternative. Chernozhukov and Hong (2003) consider a broad class of such quasi-posterior distributions. Jeffreys (1961) proposed a substitution likelihood for quantiles for use in Bayesian inference; also refer to Dunson and Taylor (2005). Hoff (2007) proposed a Bayesian approach to inference in copula models, which avoids specifying models for the marginal distributions via an extended rank likelihood.

BAYESIAN DISTANCE CLUSTERING

Johnson (2005) proposed Bayesian tests based on modeling frequentist test statistics instead of the data directly. These are just some of many examples. Our proposed Bayesian distance clustering approach gains some of the advantages of modelbased clustering, such as uncertainty quantiﬁcation and ﬂexibility, while signiﬁcantly simplifying the model speciﬁcation task. There is a connection between our approach and nonnegative matrix factorization (NMF) methods (Kuang et al., 2012; Zhao et al., 2015; Kuang et al., 2015). Certain NMF algorithms can be viewed as fast approximations to our likelihood-based approach. Our major contributions are: (i) establishing a novel link between modeland distance-based frameworks, (ii) introducing a principled choice for assigning kernels for distances (equivalent to the afﬁnity/similarity score in NMFs), and (iii) providing a way to calibrate the parameters within the proposed probabilistic framework.

2. Partial likelihood for distances

2.1 Motivation for partial likelihood

Suppose that data y(n) are generated from model (1) or equivalently (2). We focus on the case in which yi = (yi,1, . . . , yi,p) Y Rp. The conditional likelihood of the data y(n) given clustering indices c(n) can be expressed as

L(y(n); c(n)) =

i:ci=h Kh(yi) =

h=1 Lh(y[h]), (3)

where we let Kh(y) denote the density of data within cluster h, and y[h] = {yi : ci = h} = {y[h] i , i = 1, . . . , nh} is the data in cluster h. Since the information of c(n) is stored by the index

with [h], we will omit c(n) in the notation when [h] appears. Referring to y[h] 1 as the seed for clus-

ter h, we can express the likelihood Lh(y[h]) using the change-of-variables (y[h] 1 , y[h] 2 . . . , y[h] nh) (y[h] 1 , d[h] 2,1, . . . , d[h] nh,1):

Kh(y[h] 1 )

i=2 Gh( d[h] i,1 | y[h] 1 )

= Kh y[h] 1 | d[h] 2,1, . . . , d[h] nh,1 Gh d[h] 2,1, . . . , d[h] nh,1 ,

where d[h] i,1 = y[h] i y[h] 1 denotes the difference vector between y[h] i and y[h] 1 (with G the transformed

kernel), and the second line is an equivalent factorization of the joint distribution, with Kh(y[h] 1 | .) the conditional density of y[h] 1 given the differences. Expression (4) is a product of the densities of the seed and (nh 1) differences. As the cluster size nh increases, the relative contribution of the seed density Kh(y[h] 1 | .) will decrease and the likelihood becomes dominated by Gh. With this heuristic justiﬁcation, we discard the Kh(y[h] 1 | .) term by treating the value of y[h] 1 as random and integrating out Kh(y[h] 1 | .). We now use a toy example to illustrate how to derive the function Gh from a known modelbased likelihood (later we will show how to specify Gh directly, when the model-based likelihood

DUAN AND DUNSON

is not known). Consider the case of yi R from a Gaussian mixture, starting from the likelihood of those yi s associated with ci = h:

Lh(y[h]) = (2πσ2 h) nh/2 exp

Pnh i=1(y[h] i µh)2

= (2πσ2 h) nh/2 exp

Pnh i=1 h (y[h] i y[h] 1 )2 + (y[h] 1 µh)2 + 2(y[h] i y[h] 1 )(y[h] 1 µh) i

To obtain Gh, based on the formula f(d, y) = f(y | d) R f(d, y)dy, we use the change-ofvariable d[h] i,1 = y[h] i y[h] 1 , and integrate out y[h] 1 [as the information of y[h] 1 is now in Kh y[h] 1 | d[h] 2,1, . . . , d[h] nh,1 ]:

Gh d[h] 2,1, . . . , d[h] nh,1 = Z (2πσ2 h) nh/2 exp

Pnh i=1 d[h]2 i,1 + nh(y[h] 1 µh)2 + 2(y[h] 1 µh) Pnh i=1 d[h] i,1 2σ2 h

= (2πσ2 h) (nh 1)/2 1 nh exp

Pnh i=1 d[h]2 i,1 (P d[h] i,1)2/nh 2σ2 h

(a) = (2πσ2 h) (nh 1)/2 1 nh

i,j exp d[h]2 i,j 4nhσ2 h

where (a) is due to 2 P i<j d[h]2 ij = P i P j( d[h] i,1 d[h] j,1)2 = nh P i d[h]2 i,1 +nh P j d[h]2 j,1 P j =i 2 d[h] i,1 d[h] j,1 P i=j 2 d[h] i,1 d[h] j,1 = 2nh[P d[h]2 i,1 (P d[h] i,1)2/nh]. In the above example, note that the right hand side appears to be a product density of all the pairwise differences D[h] = { d[h] i,j}(i,j), besides those formed with the seed. This is due to the linear

equality that d[h] i,j = d[h] i,1 d[h] j,1 therefore, there are effectively only (nh 1) free random variables; once they are given, the rest are completely determined. Generalizing from this form, we now specify G as

Gh d[h] 2,1, . . . , d[h] nh,1 =

j>i g1/nh h ( d[h] i,j), (5)

where gh : Rp R+ and each d[h] i,j is assigned a marginal density. The power 1/nh is a calibration parameter that adjusts the order discrepancy between the numbers of (nh 1)nh/2 marginal densities and (nh 1) effective random variables. We will formally justify this calibration in the theory section.

Remark 1 To clarify, despite the simple product form, (5) should not be treated as the independent densities of nh(nh 1)/2 differences. This is because these differences contain effectively (nh 1) random variables d[h] 2,1, . . . , d[h] nh,1, and (nh 1)(nh 2)/2 interaction terms d[h] i,j = d[h] i,1 d[h] j,1; these

BAYESIAN DISTANCE CLUSTERING

interaction terms induce dependence between d[h] i,1 and d[h] j,1:

Gh d[h] 2,1, . . . , d[h] nh,1 = nh Y

i=2 g1/nh h ( d[h] i,1) nh 1 Y

i<j g1/nh h ( d[h] i,1 d[h] j,1) .

For example, for d[h] 2,1 and d[h] 3,1, the related terms in (5) are:

g1/nh( d[h] 2,1)g1/nh( d[h] 3,1)g1/nh( d[h] 2,1 d[h] 3,1),

which is a non-separable function of d[h] 2,1 and d[h] 3,1, and hence not independent.

We now state the assumptions that we use for clustering.

Assumption 1 For those data within a cluster, y[h] i and y[h] j are independent and identically distributed.

Focusing on marginally specifying each gh, we can immediately obtain two key properties of d[h] i,j = y[h] i y[h] j : (1) Expectation zero, and (2) Marginal symmetry with skewness zero. Hence, the distribution of the differences is substantially simpler than the original data distribution Kh. This suggests using Gh for clustering will substantially reduce the model complexity and improve robustness. We connect the density of the differences to a likelihood of distances here used as a loose notion including metrics, semi-metrics and divergences. Consider di,j [0, ) as a transform of di,j, such as some norm di,j = di,j (e.g. Euclidean or 1-norm); hence, a likelihood for di,j is implicitly associated to a pushforward measure from the one on di,j (assuming a measurable transform). For example, an exponential density on di,j = di,j 1 can be taken as the result of assigning a multivariate Laplace on di,j. We can further generalize the notion of difference from subtraction to other types, such as ratio, cross-entropy, or an application-driven speciﬁcation (Izakian et al., 2015). To summarize, this motivates the practice of ﬁrst calculating a matrix of pairwise distances, and then assigning a partial likelihood for clustering. For generality, we slightly abuse notation and replace the difference array D with the distance matrix D in (5), and denote the density by Gh(D[h]). We will refer to (5) as the distance likelihood from now on. Conditional on the clustering labels,

L[D; c(n)] =

h=1 Gh(D[h]), (6)

with ci Pk h=1 πhδh independently, as is (2).

DUAN AND DUNSON

Figure 1: Illustration of the clustering uncertainty and its estimation using the distance-based clustering method. Left panel: two overlapping clusters (red and grey) generated from two skew Gaussian distributions with n = 400. Center panel: the oracle uncertainty pr(ci = ˆci | yi) calculated based on the generative distribution. Right panel: the matrix of the co-assignment probabilities pr(ci = cj | D) estimated using the distance likelihood of D.

To provide some intuition about the clustering uncertainty, we simulate two clusters using the bivariate skewed Gaussian distribution: in each dimension, the ﬁrst cluster has a skewness of 4 (red in Figure 1, left panel), location of 0 and scale of 1, and the second has a skewness of 2, location of 0.5 and scale of 1 (grey in Figure 1, left panel); the two sub-coordinates are generated independently. Via the skew Gaussian density Kh(yi) used to generate the data, we can compute the oracle assignment probability pr(ci = h | yi) for h = 1 and 2, and the most likely cluster assignment ˆci for each data point. Clearly, due to the overlapping of the two clusters, there is a signiﬁcant amount of uncertainty for those near the location (0, 0), as the pr(ci = ˆci) remains away from 0 (Figure 1, center panel) importantly, such uncertainty does not vanish even as n , as these points will remain nearly equidistant to the two cluster centers. Using the distance likelihood on D, we can obtain an easy quantiﬁcation of the uncertainty, by sampling ci from the posterior distribution and calculating the co-assignment probabilities pr(ci = cj | D); as shown in the right panel, the off-diagonal block shows that a signiﬁcant portion of data that can be co-assigned to either the ﬁrst or the second cluster with a non-trivial probability.

2.2 Choosing a distance density for clustering

To implement our Bayesian distance clustering approach, we need a deﬁnition of clusters, guiding us to choose a parametric form for gh(.) in (5). For conciseness, we will focus on the norm-based distances from now on. A popular intuition for a cluster is a group of data points, such that most of the distances among them are relatively small. That is, the probability of ﬁnding large distances within a cluster should be low. We now state the assumption.

Assumption 2 With σh > 0, a scale parameter and ϵh a function that rapidly declines towards 0 as t increases. pr(d[h] i,j tσh) ϵh(t) for sufﬁciently large t > 0. (7)

BAYESIAN DISTANCE CLUSTERING

For such a decline, it is common to consider the sub-exponential rate (Wainwright, 2019), ϵh(t) = O{exp( t/b)} with some constant b > 0. Ideally, we want σh to be small, so that most of the distances within a cluster are small. It is tempting to consider using a simple exponential density Exp(σh) for modeling d[h] i,j, however, we make an important observation here: the exponential distribution has a deterministic relationship between the mean σh and the variance σ2 h this means any slightly large Ed[h] i,j (such as

when the distribution of d[h] i,j does not follow a exponential decay near zero) will inﬂate the estimate of σh, making it difﬁcult to use small distances for clustering.

Figure 2: Histograms of Euclidean distances scaled by 1/σh (with σh p). Left is the ﬁrst two dimensions, and the right show that the distances formed within a cluster (cyan) tend to be much smaller than the ones across clusters (red). Each cluster s data are generated from a multivariate Laplace distribution yi Lap(µh, Σh) with h = 1, 2.

This motivates us to use a two-parameter distribution instead in this article, we use Gamma (αh, σh) for gh in (5), whose variance αhσ2 h is no longer completely determined by the mean αhσh.

gh(d[h] i,j) = 1 Γ(αh)σαh h xαh 1 exp d[h] i,j/σh . (8)

We defer the prior choice for αh and σh to a later section. For now, we verify that the Gamma distribution does have a sub-exponential tail that satisﬁes Assumption 1.

Lemma 2 (Bound on the right tail) If d has the density (8), for any αh 1 and t > 0,

pr(d tσh) Mtαh exp ( t),

where M = (αh) αh exp(αh).

Remark 3 The polynomial term tαh allows deviation from the exponential distribution at small t; its effect vanishes as t increases.

The assumption on the distances are connected to some implicit assumptions on the data distribution K(yi). As such a link varies with the speciﬁc form of the distance, we again focus on the vector norm of subtraction d[h] i,j = y[h] i y[h] j q, with x q = (Pp j=1 xq j)1/q and q 1. We now show that a sub-exponential tail for the vector norm distance is a necessary result of assuming sub-exponential tails in K(yi).

DUAN AND DUNSON

Lemma 4 (Tail of vector norm distance) If there exist bound constants m[h] 1 , m[h] 2 > 0, such that for all j = 1, . . . , p pr(|y[h] i,j Ey[h] i,j | t) m[h] 1 exp( m[h] 2 t), (9)

then, there exist another two constants νh, bh > 0, such that for any q 1

pr(d[h] ij > tbhpη) 2p exp{ tp(η 1/q)/2} for t > 2p1/q ην2 h. (10)

Remark 5 The concentration property (9) is less restrictive than common assumptions on the kernel in a mixture model, such as Gaussianity, log-concavity or unimodality.

3. Hyper-prior speciﬁcation for αh and σh

In Bayesian clustering, it is useful to choose the prior parameters in a reasonable range (Malsiner Walli et al., 2017). Recall in our gamma density, αh determines the mean for d[h] i,j at αhσh. To favor small values for the mode while accommodating a moderate degree of uncertainty, we use a Gamma prior αh Gamma(1.5, 1.0). To select a prior for σh, we associate it with a pre-speciﬁed maximum cluster number k. We can view k as a packing number that is, how many balls (clusters) we can ﬁt in a container of the data. To formalize, imagine a p-dimensional ellipsoid in Rp enclosing all the observed data. The smallest volume of such an ellipsoid is

vol(Data) = M min µ Rp,Q 0(det Q) 1/2, s.t. (yi µ)TQ(yi µ) 1 for i = 1, . . . , n,

which can be obtained via a fast convex optimization algorithm (Sun and Freund, 2004), with M = πp/2/Γ(p/2 + 1) and π 3.14. If we view each cluster as a high-probability ball of points originating from a common distribution, then the diameter the distance between the two points that are farthest apart is 4σh. This is calculated based on pr(d 4σh) 0.95 using the gamma density with shape 1.5 (the prior mean of αh). We denote the ball by B2σh, with vol(B2σh) = M(2σh)p. Setting k to the packing number

k vol(Data)

yields a sensible prior mean for σh. For conjugacy, we choose an inverse-gamma prior for σh with E(σh) = βh,

σh Inverse-Gamma(2, βσ), βσ = 1

The above prior can be used as a default in broad applications, and does not require tuning to each new application.

We describe several interesting properties for the distance likelihood.

BAYESIAN DISTANCE CLUSTERING

4.1 Calibration

Lemma 6 (Exchangeability) When the product density (5) is used for all Gh(D[h]), h = 1, . . . , k, the distance likelihood (6) is invariant to permutations of the indices i:

L{y(n); c(n)} = L{y(n ); c(n )},

with (n ) = {1 , . . . , n } denoting a set of permuted indices.

We ﬁll a missing gap between the model-based and distance likelihoods by considering an information-theoretic analysis of the two clustering approaches. This also leads to a principled choice of the power 1/nh in (5). To quantify the information in clustering, we ﬁrst brieﬂy review the concept of Bregman divergence (Bregman, 1967). Letting φ : S R be a strictly convex and differentiable function, with S the domain of φ, the Bregman divergence is deﬁned as

Bφ(x, y) = φ(x) φ(y) (x y)T φ(y),

where φ(y) denotes the gradient of φ at y. A large family of loss functions, such as squared norm and Kullback-Leibler divergence, are special cases of the Bregman divergence with suitable φ. For model-based clustering, when the regular exponential family ( regular as the parameter space is a non-empty open set) is used for the component kernel Kh, Banerjee et al. (2005) show that there always exists a re-parameterization of the kernel using Bregman divergence. Using our notation,

Kh(yi; θh) = exp T(yi) θh ψ(θh) κ(yi) exp [ Bφ {T(yi), µh}] bφ{T(yi)},

where T(yi) is a transformation of yi, in the same form as the minimum sufﬁcient statistic for θh (except this statistic is based on only one data point yi); µh is the expectation of T(yi) taken with respect to Kh(y; θh); ψ, κ and bφ are functions mapping to (0, ). With this re-parameterization, maximizing the model-based likelihood over c(n) becomes equivalent to minimizing the within-cluster Bregman divergence

h=1 H[h] y , H[h] y =

i=1 Bφ n T(y[h] i ), µh o .

We will refer to Hy as the model-based divergence. For the distance likelihood, considering those distances that can be viewed or re-parameterized as a pairwise Bregman divergence, we assume each g(d[h] i,j) in the distance likelihood (5) can be re-written with a calibrating power βh > 0 as

gβh(d[h] i,j) = zβh exp h βh Bφ n T(y[h] i ), T(y[h] j ) oi ,

with z > 0 the normalizing constant. A distance-based divergence Hd can be computed as

h=1 H[h] d , H[h] d = βh

1 2Bφ n T(y[h] i ), T(y[h] j ) o . (11)

We now compare these two divergences Hy and Hd at their expectations.

DUAN AND DUNSON

Lemma 7 (Expected Bregman Divergence) The distance-based Bregman divergence (11) in cluster h has

Ey[h]H[h] d = βh Ey[h] i Ey[h] j

1 2Bφ{T(y[h] i ), T(y[h] j )}

= (nhβh) Ey[h]

Bφ{T(y[h] i ), µh} + Bφ{µh, T(y[h] i )} 2

where the expectation over y[h] is taken with respect to Kh.

Remark 8 The term inside the expectation on the right hand side is the symmetrized Bregman divergence between T(y[h] i ) and µh (Banerjee et al., 2005). Therefore, Ey[h]H[h] d = (nhβh)Ey[h]H[h] y when Bφ( , ) is symmetric.

There is an order difference of O(nh) between distance-based and model-based divergences. Therefore, a sensible choice is simply setting βh = 1/nh. This power is related to the weights used in composite pairwise likelihood (Lindsay, 1988; Cox and Reid, 2004).

4.2 Relationship to Graph Cut

It is also interesting to consider the matrix form of the distance likelihood. We use C as an n k binary matrix encoding the cluster assignment, with Ci,h = 1 if ci = h, and all other Ci,h = 0. Then it can be veriﬁed that CTC = diag(n1, . . . , nk). Hence the distance likelihood, with the Gamma density, is

G(D; C) exp tr CT(log D)CΛ(CTC) 1 exp tr CTDC ΣCTC 1 , (12)

where D is the n n distance matrix, log is applied element-wise, Σ = diag(σ1, . . . , σk), and Λ = diag(α1 1, . . . , αh 1). If C contains zero columns, the inverse is replaced by a generalized inverse. One may notice some resemblance of (12) to the loss function in graph partitioning algorithms. Indeed, if we simplify the parameters to α1 = = αk = α0 and σ1 = = σk = σ0, then

G(D; C) exp tr CTAC CTC 1 , (13)

where A = κ1n,n D/σ0 + (α0 1) log D can be considered as an adjacency matrix of a graph formed by a log-Gamma distance kernel, with 1n,n as an n n matrix with all elements equal to 1; κ a constant so that each Ai,j > 0 (since κ enters the likelihood as a constant tr{CTκ1n,n C(CTC) 1} = nκ, it does not impact the likelihood of C). To compare, the popular normalized graph-cut loss (Bandeira et al., 2013) is

NCut-Loss =

Ai,j 2nh , (14)

which is the total edges deleted because of partitioning (weighted by n 1 h to prevent trivial cuts). There is an interesting link between (13) and (14).

BAYESIAN DISTANCE CLUSTERING

Lemma 9 Considering a graph with weighted adjacency matrix A, the normalized graph-cut loss is related to the negative log-likelihood (omitting constant) (13) via

2NCut-Loss = tr CTAC CTC 1 +

Pn j=1 Ai,j

Remark 10 The difference on the right is often known as the degree-based regularization (with Pn j=1 Ai,j the degree, nci the size of the cluster that data i is assigned to). When the cluster sizes are relatively balanced, we can ignore its effect.

Such a near-equivalence suggests that we can exploit popular graph clustering algorithms, such as spectral clustering, for good initiation of C as a warm-start of the Markov chain Monte Carlo algorithm.

5. Posterior computation

For the posterior computation, Gibbs sampler is easy to derive as it involves updating one element of the parameter at a time, from the full conditional distribution. However, this could lead to a heavy computational cost for our model. To understand this, consider the update step for each ci, which involves a draw from the categorical distribution:

pr(ci = h | .) = πh G(D; Ci,h) Pk h =1 πh G(D; Ci,h ) ,

where Ch denotes a matrix equal to the current value of C, except replacing the ith row with Ci,h = 1 and Ci,j = 0 for other j = h. Since G(D; C) involves a matrix inverse term (CTC) 1, the above ratio cannot be simpliﬁed to reduce the computational burden. The evaluation cost for each G(D; C) is dominated by the matrix multiplication steps within, hence having an overall cost of O(n2k). Therefore, iterating over h = 1, . . . , k and i = 1, . . . , n will lead to a high cost in one sweep of update. To solve this problem, we instead develop a more efﬁcient algorithm based on the approximate Hamiltonian Monte Carlo (HMC) algorithm. We use a continuous relaxation of each row Ci (on a simplex vertex) into the interior of the simplex, and denote the relaxation by Wi (k 1) \ . This is achieved via a tempered softmax re-parameterization (Maddison et al., 2017)

wi,h = exp(vi,h/t) Pk h =1 exp(vi,h /t) , h = 1, . . . , k.

At small t > 0 and close to 0, if one vi,h is slightly larger than the rest in {vi,1, . . . , vi,k}, then wi,h will be close to 1, and all the other wi,h s close to 0. In this article, we use t = 0.1 as a balance between the approximation accuracy and the numeric stability of the algorithm. In addition, we re-parameterize the other parameters using the softplus function σh = log[exp( σh) + 1] , αh = log[exp( αh) + 1] for h = 1, . . . , k, and and the softmax function (π1, . . . , πk) = softmax( π1, . . . , πk) (as deﬁned above except with t = 1), where σh, αh and πh are all unconstrained parameters in R amenable to the off-the-shelf continuous HMC algorithm. We denote the vectorized parameters by β = (v1,1, . . . , vn,k, σ1, . . . , σk, α1, . . . , αk, π1, . . . , πk). To sample from posterior distribution β Πβ|D( ), the HMC uses an auxiliary momentum variable

DUAN AND DUNSON

v and samples from a joint distribution Π(β, v) = Π(β | D)Π(v), where a common choice of Π(v) is the density of N(0, M). Denote U(β) = log Π(β | D) and K(v) = log π(v) = v T M 1v/2, which are commonly referred to as the potential energy and kinetic energy respectively. The total Hamiltonian energy function is H(β, v) = U(β) + K(v). At each state (β, v), a new proposal (β , v ) is generated by simulating Hamiltonian dynamics satisfying the Hamilton s equations:

t = H(β, v)

v = M 1v; v

t = H(β, v)

β = log Π(β | D)

Since the exact solution to the above is intractable, we can numerically approximate the temporal evolution using the leapfrog scheme, as described in the following pseudocode.

for Iteration =1, 2, . . . do

Sample v N(0, M), set β β and v v; for l = 1, . . . , L do

Update v v + ϵ

2 log Π(β |D)

β ; Update β β + ϵM 1v ;

Update v v + ϵ

2 log Π(β |D)

β ; if (β β)Tv < 0 then

Break; Sample u Uniform(0, 1); if u < min{1, exp[ H(β , v ) + H(β, v)]}. then

Set β β ; Algorithm 1: The pseudocode of the No-U-Turn Hamiltonian Monte Carlo sampler for the Bayesian distance clustering.

To accelerate the convergence of the Markov chain to stationarity, we ﬁrst use the BFGS optimization algorithm (implemented in the Py Torch package) to ﬁrst minimize U(β) and obtain the posterior mode ˆβ. We then initialize the Markov chain at β = ˆβ. A typical choice for the working parameter M 1 is to let it roughly scale with the covariance matrix of the posterior distribution (Neal, 2011). Using ˆβ, we calculate the observed Fisher information at ˆβ [the Hessian matrix of U(β) evaluated at ˆβ, denoted by Hess U(ˆβ)], which gives an approximation to the inverse covariance of β. Although it is tempting to set M 1 = [Hess U(ˆβ)] 1, the matrix inversion of the latter is often costly and ill-conditioned. To avoid this problem, we use a simpler and diagonal parameterization M 1 = diag(1/Hess U(ˆβ)i,i), which shows good empirical performances in all the examples within this article. To run the HMC sampler, we use the No-U-Turn Sampler (NUTS-HMC) algorithm (Hoffman and Gelman, 2014) implemented in the hamiltorch package (Cobb and Jalaian, 2020), which also automatically tunes the other two working parameters ϵ and L. After the automatic tuning, the algorithm reaches an acceptance rate close to 70% as commonly desired for good mixing of the Markov chains. To provide some running time, using a quad-core i7 CPU, at n = 1000, the HMC algorithm takes about 20 minutes for running 10, 000 iterations.

Remark 11 On the computational cost, the most expensive step in the HMC algorithm is the calculation of the derivative of log G(D; W) with respect to the matrix W, which involves the following

BAYESIAN DISTANCE CLUSTERING

tr[(XTAX)(XTBX) 1]

X = 2AX(XTBX) 1 2BX(XTBX) 1(XTAX)(XTBX) 1

where X Rn k, symmetric B Rn n and symmetric A Rn n. Since k is relatively small, the matrix inversion of the k k matrix is not costly [O(k3)] and dominated by the matrix multiplication O(n2k). Therefore, running over L leapfrog steps, the computational cost per iteration of HMC is O(Ln2k). Potentially, one could instead consider a Gibbs sampling algorithm, using a block-wise update of CT(log D)C and CTDC (instead of a full evaluation of the matrix product) when sampling each row of C. Despite having a similar computing complexity, a strength of HMC is that we can take advantage of the highly parallelized matrix operation on C, which is often faster than the sequential looping over each row of C. In comparison, the parametric/model-based clustering algorithm has a lower cost of O(n), although this often comes with a risk of model misspeciﬁcation for modern data. Therefore, choosing which class of methods involves a trade-off between computational speed versus model robustness.

The posterior samples of CCT give an estimate of the pairwise co-assignment probabilities pr(ci = cj) = Pk h=1 pr(ci = cj = h). To obtain estimates for pr(ci = h), we use symmetric simplex matrix factorization (Duan, 2020) on {pr(ci = cj)}i,j to obtain an n k matrix corresponding to {pr(ci = h)}i,h. For the diagnostics on the convergence, we calculate the autocorrelation (ACF) and the effective sample size (ESS) for each parameter, and we provide some diagnostic plots in the appendix. In this article, for the ease of visualization and interpretation, we use pr(ci = ˆci | D) as a measure of the uncertainty on the point estimate ˆci = maxh=1,...,k pr(ci = h | D). An alternative is to use the variation of information (Wade and Ghahramani, 2018) as a metric between the clusterings, leading to the discrete extension of the posterior mean and credible intervals. The readers can ﬁnd the method and toolbox within the reference. In addition, in the appendix, we show that the non-negative matrix factorization (NMF) algorithm, if using a calibrated similarity matrix as its input (such as using our distance likelihood), produces an almost indistinguishable result from the Bayesian distance clustering method. On the other hand, if the similarity is set less carefully (such as using the default choice in popular machine learning packages), we found a severe sensitivity that leads to over-/under-estimation of the uncertainty (as shown in Panel 4 of Figure 8). Therefore, for the sake of both conciseness and fairness, we choose to not present the NMF results without calibration in the numerical experiments.

DUAN AND DUNSON

6. Numerical experiments

6.1 Clustering with skewness-robust distance

2 0 2 4 6 Value

(a) Histogram and the true density (red line) of a mixture of two symmetric Gaussians.

0 1 2 3 4 5 Value

(b) Histogram and the true density (red line) of a mixture of two right skewed Gaussians.

0.00 0.25 0.50 0.75 1.00

0.00 0.25 0.50 0.75 1.00

(c) Assignment probability pr(ci = 1), under Bayesian distance clustering and the mixture of Gaussians. Dashed line is the oracle probability based on symmetric Gaussians.

0 1 2 3 4 5

0.00 0.25 0.50 0.75 1.00

0.00 0.25 0.50 0.75 1.00

(d) Assignment probability pr(ci = 1), under Bayesian distance clustering and the mixture of Gaussians. Dashed line is the oracle probability based on skewed Gaussians.

(e) The point estimates ˆci (represented by two colors) using Bayesian distance clustering and the mixture of Gaussians.

Figure 3: Clustering data from a two-component mixture of skewed Gaussians in R. Bayesian Distance clustering (BDC) gives posterior clustering probabilities close to the oracle probabilities regardless of whether the distribution is skewed or not (upper plots in panel c and d), while the mixture of Gaussians fails when the skewness is present (lower plot in panel d).

BAYESIAN DISTANCE CLUSTERING

p Bayes Dist. Clustering Mix. of Gaussians Mix. of Skewed Gaussians 1 0.80 (0.75, 0.84) 0.65 (0.55, 0.71) 0.81 (0.75, 0.85) 5 0.76 (0.71, 0.81) 0.55 (0.40, 0.61) 0.76 (0.72, 0.80) 10 0.72 (0.68, 0.76) 0.33(0.25, 0.46) 0.62 (0.53, 0.71) 30 0.71 (0.67, 0.76) 0.25 (0.20, 0.30) 0.43 (0.37, 0.50)

Table 1: Accuracy of clustering skewed Gaussians under different dimensions p. Adjusted Rand index (ARI) is computed for the point estimates using variation of information. The average and 95% conﬁdence interval are shown.

As described in Section 2.1, the vector norm-based distance is automatically robust to skewness. To illustrate, we generate n = 200 data from a two-component mixture of skewed Gaussians:

pr(ci = 1) = pr(ci = 2) = 0.5,

yi,j | ci = h SN(µh, 1, αh) for j = 1 . . . p,

where SN(µ, σ, α) has density π(y | µ, σ, α) = 2f{(y µ)/σ}F{α(y µ)/σ} with f and F the density and cumulative distribution functions for the standard Gaussian distribution. We start with p = 1 and assess the performance of the Bayesian distance clustering model under both non-skewed (α1 = α2 = 0, µ1 = 0, µ2 = 3) and skewed distributions (α1 = 8, α2 = 10, µ1 = 0, µ2 = 2). The results are compared against the mixture of Gaussians as implemented in the Mclust package. Figure 3(a,c) show that for non-skewed Gaussians, the proposed approach produces clustering probabilities close to their oracle probabilities, obtained using knowledge of the true kernels that generated the data. When the true kernels are skewed Gaussians, Figure 3(b,d) shows that the mixture of Gaussians gives inaccurate estimates of the clustering probability, whereas Bayesian distance clustering remains similar to the oracle. To evaluate the accuracy of the point estimate ˆci, we compute the adjusted Rand index (Rand, 1971) with respect to the true labels. We test under different p {1, 5, 10, 30}, and repeat each experiment 30 times. The results are compared against model-based clustering using symmetric and skewed Gaussians kernels, using independent variance structure. As shown in Table 1, the misspeciﬁed symmetric model deteriorates quickly as p increases. In contrast, Bayesian distance clustering maintains high clustering accuracy.

6.2 Clustering high dimensional data with subspace distance

For high-dimensional clustering, it is often useful to impose the additional assumption that each cluster lives near a different low-dimensional manifold. Clustering data based on these manifolds is known as subspace clustering. We exploit the sparse subspace embedding algorithm proposed by Vidal (2011) to learn pairwise subspace distances. Brieﬂy speaking, since the data in the same cluster are alike, each data point can be approximated as a linear combination of several other data points in the same subspace; hence a sparse locally linear embedding can be used to estimate an n n coefﬁcient matrix ˆW through

ˆW = arg min W:wi,i=0,P j wi,j=1

i=1 yi Wyi 2 2 + W 1,

DUAN AND DUNSON

Bayes Dist. Clustering Spectral Clustering HDClassif 0.57 (0.54, 0.60) 0.50 (0.48, 0.52) 0.35 (0.31, 0.43)

Table 2: Accuracy of clustering MNIST hand-written digit data. Adjusted Rand index (ARI) is computed for the point estimates using variation of information. The average ARI and 95% conﬁdence intervals are shown.

where the sparsity of ˆW ensures only the data in the same linear subspace can have non-zero embedding coefﬁcients. Afterward, we can deﬁne a subspace distance matrix as

di,j = 2 | ˆwi,j| maxj | ˆwi,j | + | ˆwj,i| maxj | ˆwj,j|

where we follow Vidal (2011) to normalize each row by its absolute maximum. We then use this distance matrix in our Bayesian distance clustering method. To assess the performance, we use the MNIST data of hand-written digits of 0 9, with each image having p = 28 28 pixels. In each experiment, we take n = 10, 000 random samples to ﬁt the clustering models, among which each digit has approximately 1000 samples, and we repeat experiments 10 times. For comparison, we also run the near low-rank mixture model in HDclassif package (Berg e et al., 2012) and spectral clustering based on the p-dimensional vector norm. Our method using subspace distances shows clearly higher accuracy, as shown in Table 2.

7. Clustering brain regions

We carry out a data application to segment the mouse brain according to the gene expression obtained from the Allen Mouse Brain Atlas dataset (Lein et al., 2007). Speciﬁcally, the data are in situ hybridization gene expression, represented by expression volume over spatial voxels. Each voxel is a (200µm)3 cube. We take the mid-coronal section of 41 58 voxels. Excluding the empty ones outside the brain, they have a sample size of n = 1781. For each voxel, there are records of expression volume over 3241 different genes. To avoid the curse of dimensionality for distances, we extract the ﬁrst p = 30 principal components and use them as the source data. Since gene expression is closely related to the functionality of the brain, we will use the clusters to represent the functional partitioning, and compare them in an unsupervised manner with known anatomical regions. The voxels belong to 12 macroscopic anatomical regions (Table 4).

BAYESIAN DISTANCE CLUSTERING

D D D D D DD D D

D D D D DDD D D D D D

D D D DDD D D D D D DD D D D D

D D D DDD D D D D D DD D

DD D D D DDDDDDDD

D DDD D E E EE DD D D

F CC C C E E E E E

CCC C C C C C E E

FC C CC C C C C C C E

F CC C C C C D D D C C C C

C C C CF F D D D D F C C C C EEE

C FF F F F F F D D C C C C C E E

F F F B B F

F F D C C C C C E

D D D F C C C

B B B B B B B

B BB B B B B B

BB B B B BB

B B A A A A A

BBB B B BB B B B A A A A A

BB BB B B B B B B A AA A AA

BBBBB BB BB B A A A A A A A

BB BBBBBB B B A A AAA A A

B BB BBBB B B B A AA A A A A

B B B BB B B B B B B

B AA A A A A

B B B B B B B B B B B

F AA A A AA

B BB BBBBB B B BAAA A A A A

B BBBBBB B B A A AAA A A

BBBB B BB BB B A A A A

BBBB B B B B B B A AA A

BB B B BB B B B A A A A

BB B B B B B

B B A A A A A

B BB B B B B

B B B B B B B

F DF F F C C

F F F D C C C C

F F F D C C C C E E

C C F F F FF FF D D C C C C C E E

C C C F D D D D D C C C C CEE E

C C C C D D D C C C

F CC CC C C C C C C E

F CC C C C C C E D E E E

C CC C C F E E E E

DD D D D D D

D DDD E E EE DD D

DD D D D D DDDDD D

D D D D D D D D D D D D D D DD D

D D D DDDD D D D D DD D

D D D D D D D DDD D D D D D D D DD D D

6 5 4 3 2 1 y1

(a) Anatomical structure labels.

F F F F F FF F F

F F F F FFF F F F F F

F F F FFF F F F F FFF F F F F

F F F FFF F F F F F FF F

EF F F F FFFFF FFF

F F FF F F F F F F F F F

F FF F F F F F F F

FFF F F F F F F F

FF F FF F F F F F F F

D DD D D C C C C C C C C C

D D D DC C C C C C C C C C C CCC

D DC B B B B C C C C C C C C C C

D C B B B B

C C C C C C C C C

B C C C C C C

B B B B B B B

B BB B B B B B

BB B B B BB

B B B B A A A

BBB B B BB B B B B A A A A

BB BB B B B B B B B AA A AA

BBBBB BB BB B B A A A A A A

FB BBBBBB B B B B AAA A A

F BB BBBB B B B B AA A A A A

F B B BB B B B B B B

B BA A A A A

F F B B B B B B B B B

B AA A A AA

F BB BBBBB B B BBAA A A A A

B BBBBBB B B B A AAA A A

BBBB B BB BB B B A A A

BBBB B B B B B B B AA A

BB B B BB B B B B A A A

BB B B B B B

B B B B A A A

B BB B B B B

B B B B B B B

B BC C C C C

B C C C C C C C

B C C C C C C C C C

D D D C C BB BC C C C C C C C C C

D D D C C C C C C C C C C CCC C

F F F F C C C C C C

F F F FF F F F F F F F

F FFF F F F F F F F F F

F FF F F F F F F F

EE E E F F F

F FFF F F F F FF F

EE E E F F FFFFF F

F F E E E E EE F F F F F F F F F

E E E EFFF F F F F F F F

E E F F FF F FFF F F F F F F F FF F F

6 5 4 3 2 1 y1

(b) Point estimate from Gaussian mixture model.

E E E E E EE E E

E E E E EEE E E E E E

E E E EEE E E E E EEE E E E E

E E E EEE E E E E E EE E

EE E E E EEEEDDDD

D D FF D D D DD DD D D

F FF F F F D D DD

FFF F F F F F F F

FF F FG G G G G F F F

G GG G G A A A A A G G F F

G G G AA A A A A A G G G F F FFF

G AA A A A A A A A G G G F F F F

A A A A A A

A A A A G G F F F

A A A A G G F

B B B B B B B

B BB B B B B B

BB B B B BB

B B B A A A A

BBB B B BB B B B A A A A A

BB BB B B B B B B A AA A AA

BBBBB BB BB B B A A A A A A

BB BBBBBB B B B A AAA A A

B BB BBBB B B B A AA A A A A

B B B BB B B B B B B

A AA A A A A

C B B B B B B B B B B

A AA A A AA

B BB BBBBB B B BAAA A A A A

B BBBBBB B B B A AAA A A

BBBB B BB BB B B A A A

BBBB B B B B B B A AA A

BB B B BB B B B A A A A

BB B B B B B

B B B A A A A

B BB B B B B

B B B B B B B

A AA A A G G

A A A A A G G F

A A A A G G G F F F

G G A A A AA AA A A G G G F F F F

G G A A A A A A A G G F F FFF F

G G G G A A G G G F

F F F FF F F F F F F F

F FFF F F F F F F D D D

F FF F F F D D D D

EE E E E E E

D DDD D D DD DD D

EE E E E E EEEEE E

E E E E E E EE E E E E E E EE E

E E E EEEE E E E E EE E

E E E E EE E EEE E E E E E E E EE E E

6 5 4 3 2 1 y1

(c) Point estimate from Bayesian Distance Clustering.

Figure 4: Clustering mouse brain using gene expression: visualizing the clustering result on the ﬁrst two principal components.

For clustering, we use an over-ﬁtted mixture with k = 20 and small Dirichlet concentration parameter α = 1/20. As shown by Rousseau and Mengersen (2011), asymptotically, small α < 1 leads to the automatic emptying of small clusters; we observe such behavior here in this large sample. In the Markov chain, most iterations have 7 major clusters. Table 5 lists the voxel counts at ˆc(n). Comparing the two tables, although we do not expect a perfect match between the structural and functional partitions, we do see a correlation in group sizes based on the top few groups. Indeed, visualized on the spatial grid (Figure 5), the point estimates from Bayesian distance clustering have very high resemblance to the anatomical structure. Comparatively, the clustering result from the Gaussian mixture model is completely different. To benchmark against other distance clustering approaches, we compute various similarity scores and list the results in Table 3. Competing methods include spectral clustering (Ng et al., 2002), DBSCAN (Ester et al., 1996) and HDClassif (Berg e et al., 2012); the ﬁrst two are applied on the same dimension-reduced data as used by Bayesian distance clustering, while the last one

DUAN AND DUNSON

BDC GMM Spectral Clustering DBSCAN HDClassif Adjusted Rand Index 0.49 0.31 0.45 0.43 0.43 Normalized Mutual Information 0.51 0.42 0.46 0.44 0.47 Adjusted Mutual Information 0.51 0.42 0.47 0.45 0.47

Table 3: Comparison of label point estimates using Bayesian distance clustering (BDC), Gaussian mixture model (GMM), spectral clustering, DBSCAN and HDClassif. The similarity measure is computed with respect to the anatomical structure labels.

0 20 40 60 X1

(a) Anatomical structure labels.

0 20 40 60 X1

(b) Point estimate from Gaussian mixture model.

0 20 40 60 X1

(c) Point estimate from Bayesian Distance Clustering.

0 20 40 60 X1

0.4 Uncertainty

(d) Uncertainty based on Bayesian Distance Clustering: pr(ci = ˆci)

Figure 5: Clustering mouse brain using gene expression: visualizing the clustering result on the spatial grid of brain voxels. Comparing with the anatomical structure (panel a), Bayesian Distance Clustering (panel c) has a higher similarity than the Gaussian mixture model (panel b). Most of the uncertainty (panel d) resides in the inner layers of the cortical plate (upper parts of the brain).

is applied directly on the high dimensional data. Among all the methods, the point estimates of Bayesian Distance Clustering have the highest similarity to the anatomical structure. Figure 5(d) shows the uncertainty about the point clustering estimates, in terms of the probability pr(ci = ˆci). Besides the area connecting neighboring regions, most of the uncertainty resides in the inner layers of the cortical plate (upper parts of the brain). As a result, the inner cortical plate can be either clustered with the outer layer or with the inner striatum region. Therefore, from a practical perspective, when segmenting the brain according to the functionality, it is more appropriate to treat the inner layers as a separate region.

BAYESIAN DISTANCE CLUSTERING

8. Discussion

The use of a distance likelihood reduces the sensitivity to the choice of a mixture kernel, giving the ability to exploit distances for characterizing complex and structured data. While we avoid specifying the kernel, one potential weakness is that there can be sensitivity to the choice of the distance metrics. For example, the Euclidean distance tends to produce a more spherical cluster, compared to the weighted Euclidean distance (see appendix). However, our results suggest that this sensitivity is often less than that of the assumed kernel. In many settings, there is a rich literature considering how to carefully choose the distance metric to reﬂect structure in the data (Pandit and Gupta, 2011). In such cases, the sensitivity of clustering results to the distance can be viewed as a positive. Clustering method necessarily relies on some notion of distances between data points. Another issue is that we give up the ability to characterize the distribution of the original data. An interesting solution is to consider a modular modeling strategy that connects the distance clustering to a post-clustering inference model while restricting the propagation of cluster information in one direction only. Related modular approaches have been shown to be much more robust than a single overarching full model (Jacob et al., 2017). Our concentration characterization of the within-cluster distance based on the vector norm holds for any arbitrary p. On the other hand, high-dimensional clustering is a subtle topic with challenging issues: (i) not all the coordinates in Rp contain discriminative information that is favorable for one particular partition; hence some alternative distances (Vidal, 2011), feature selection (Witten and Tibshirani, 2010), or multi-view clustering (Duan, 2020) may be necessary; (ii) the selection of number of clusters k becomes difﬁcult, and it was recently discovered (Chandra et al., 2020) that the model-based framework could lead to nonsensical results of converging to either one cluster or n clusters even under a correctly speciﬁed model, as p . One interesting extension of this work is to combine with the nearest neighbor algorithm we can choose to ignore the large distances and instead focus on modeling the K smallest ones for each data point this could also signiﬁcantly reduce the O(n2) computing and storage cost, via the sparse matrix computation. One possible model is to replace our Gamma distance density with a two-component mixture: one Gamma component concentrating near zero for modeling small distances, and one Uniform(0, maxi,j di,j) for handling large distances. Since the uniform density is a constant that does not depend on the speciﬁc value of the distance (as long it is in the support of the uniform), it effectively eliminates the inﬂuence of large distances in clustering.

Acknowledgement

This work was partially supported by grants R01-ES027498 and R01-MH118927 of the United States National Institutes of Health. The authors thank Amy Herring for helpful comments.

Afonso S Bandeira, Amit Singer, and Daniel A Spielman. A Cheeger Inequality for the Graph Connection Laplacian. SIAM Journal on Matrix Analysis and Applications, 34(4):1611 1630, 2013.

Arindam Banerjee, Srujana Merugu, Inderjit S Dhillon, and Joydeep Ghosh. Clustering with Bregman Divergences. Journal of Machine Learning Research, 6:1705 1749, 2005.

DUAN AND DUNSON

Laurent Berg e, Charles Bouveyron, and St ephane Girard. HDclassif: An R package for Modelbased Clustering and Discriminant Analysis of High-dimensional Data. Journal of Statistical Software, 46(6):1 29, 2012.

Charles Bouveyron and Camille Brunet-Saumard. Model-based Clustering of High-dimensional Data: A Review. Computational Statistics & Data Analysis, 71:52 78, 2014.

Lev M Bregman. The Relaxation Method of Finding the Common Point of Convex Sets and Its Application To the Solution of Problems in Convex Programming. USSR Computational Mathematics and Mathematical Physics, 7(3):200 217, 1967.

Noirrit Kiran Chandra, Antonio Canale, and David B Dunson. Bayesian Clustering of High Dimensional Data. ar Xiv preprint ar Xiv:2006.02700, 2020.

Victor Chernozhukov and Han Hong. An MCMC Approach to Classical Estimation. Journal of Econometrics, 115(2):293 346, 2003.

Adam D Cobb and Brian Jalaian. Scaling Hamiltonian Monte Carlo Inference for Bayesian Neural Networks with Symmetric Splitting. ar Xiv preprint ar Xiv:2010.06772, 2020.

Pietro Coretto and Christian Hennig. Robust Improper Maximum Likelihood: Tuning, Computation, and a Comparison with Other Methods for Robust Gaussian Clustering. Journal of the American Statistical Association, 111(516):1648 1659, 2016.

David R Cox and Nancy Reid. A Note on Pseudolikelihood Constructed from Marginal Densities. Biometrika, 91(3):729 737, 2004.

Leo L Duan. Latent Simplex Position Model. Journal of Machine Learning Research, 21, 2020.

David B Dunson and Jack A Taylor. Approximate Bayesian Inference for Quantiles. Journal of Nonparametric Statistics, 17(3):385 400, 2005.

Martin Ester, Hans-Peter Kriegel, J org Sander, and Xiaowei Xu. A Density-based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, volume 96, pages 226 231, 1996.

Chris Fraley and Adrian E Raftery. Model-based Clustering, Discriminant Analysis, and Density Estimation. Journal of the American Statistical Association, 97(458):611 631, 2002.

Michael PB Gallaugher and Paul D Mc Nicholas. Finite Mixtures of Skewed Matrix Variate Distributions. Pattern Recognition, 80:83 93, 2018.

Christian Hennig, Marina Meila, Fionn Murtagh, and Roberto Rocci. Handbook of Cluster Analysis. CRC Press, 2015.

Peter D Hoff. Extending the Rank Likelihood for Semiparametric Copula Estimation. The Annals of Applied Statistics, 1(1):265 283, 2007.

BAYESIAN DISTANCE CLUSTERING

Matthew D Hoffman and Andrew Gelman. The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15(1):1593 1623, 2014.

Hesam Izakian, Witold Pedrycz, and Iqbal Jamal. Fuzzy Clustering of Time Series Data Using Dynamic Time Warping Distance. Engineering Applications of Artiﬁcial Intelligence, 39:235 244, 2015.

Pierre E Jacob, Lawrence M Murray, Chris C Holmes, and Christian P Robert. Better Together? Statistical Learning in Models Made of Modules. ar Xiv preprint ar Xiv:1708.08719, 2017.

Anil K Jain. Data Clustering: 50 Years Beyond K-means. Pattern Recognition Letters, 31(8): 651 666, 2010.

Harold Jeffreys. The Theory of Probability. OUP Oxford, 1961.

Valen E Johnson. Bayes Factors Based on Test Statistics. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(5):689 701, 2005.

Miguel A Ju arez and Mark FJ Steel. Model-based Clustering of Non-gaussian Panel Data Based on Skew-t Distributions. Journal of Business and Economic Statistics, 28(1):52 66, 2010.

Dimitris Karlis and Anais Santourian. Model-based Clustering with Non-elliptically Contoured Distributions. Statistics and Computing, 19(1):73 83, 2009.

Da Kuang, Chris Ding, and Haesun Park. Symmetric Nonnegative Matrix Factorization for Graph Clustering. In Proceedings of the 2012 SIAM international conference on data mining, pages 106 117. SIAM, 2012.

Da Kuang, Sangwoon Yun, and Haesun Park. Sym NMF: Nonnegative Low-Rank Approximation of a Similarity Matrix for Graph Clustering. Journal of Global Optimization, 62(3):545 574, 2015.

Ed S Lein, Michael J Hawrylycz, Nancy Ao, Mikael Ayres, Amy Bensinger, Amy Bernard, Andrew F Boe, Mark S Boguski, Kevin S Brockway, and Emi J Byrnes. Genome-wide Atlas of Gene Expression in the Adult Mouse Brain. Nature, 445(7124):168, 2007.

Jia Li, Surajit Ray, and Bruce G Lindsay. A Nonparametric Statistical Approach to Clustering via Mode Identiﬁcation. Journal of Machine Learning Research, 8:1687 1723, 2007.

Bruce G Lindsay. Composite Likelihood Methods. Contemporary Mathematics, 80(1):221 239, 1988.

Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables. International Conference on Learning Representations, 2017.

Gertraud Malsiner-Walli, Sylvia Fr uhwirth-Schnatter, and Bettina Gr un. Identifying Mixtures of Mixtures Using Bayesian Estimation. Journal of Computational and Graphical Statistics, 26(2): 285 295, 2017.

Paul D Mc Nicholas. Model-based Clustering. Journal of Classiﬁcation, 33(3):331 373, 2016.

DUAN AND DUNSON

Jeffrey W Miller and David B Dunson. Robust Bayesian Inference via Coarsening. Journal of the American Statistical Association, pages 1 13, 2018.

Radford M Neal. MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, 2:113 162, 2011.

Andrew Y Ng, Michael I Jordan, and Yair Weiss. On Spectral Clustering: Analysis and an Algorithm. In Advances in Neural Information Processing Systems, pages 849 856, 2002.

Adrian O Hagan, Thomas Brendan Murphy, Isobel Claire Gormley, Paul D Mc Nicholas, and Dimitris Karlis. Clustering With the Multivariate Normal Inverse Gaussian Distribution. Computational Statistics & Data Analysis, 93:18 30, 2016.

Shraddha Pandit and Suchita Gupta. A Comparative Study on Distance Measuring Approaches for Clustering. International Journal of Research in Computer Science, 2(1):29 31, 2011.

William M Rand. Objective Criteria for the Evaluation of Clustering Methods. Journal of the American Statistical Association, 66(336):846 850, 1971.

Carlos E Rodr ıguez and Stephen G Walker. Univariate Bayesian Nonparametric Mixture Modeling with Unimodal Kernels. Statistics and Computing, 24(1):35 49, 2014.

Judith Rousseau and Kerrie Mengersen. Asymptotic Behaviour of the Posterior Distribution in Overﬁtted Mixture Models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(5):689 710, 2011.

Peng Sun and Robert M Freund. Computation of Minimum-volume Covering Ellipsoids. Operations Research, 52(5):690 706, 2004.

Ren e Vidal. Subspace Clustering. IEEE Signal Processing Magazine, 28(2):52 68, 2011.

Sara Wade and Zoubin Ghahramani. Bayesian Cluster Analysis: Point Estimation and Credible Balls. Bayesian Analysis, 13(2):559 626, 2018.

Martin J Wainwright. High-dimensional Statistics: A Non-asymptotic Viewpoint, volume 48. Cambridge University Press, 2019.

Daniela M Witten and Robert Tibshirani. A Framework for Feature Selection in Clustering. Journal of the American Statistical Association, 105(490):713 726, 2010.

Dongkuan Xu and Yingjie Tian. A Comprehensive Survey of Clustering Algorithms. Annals of Data Science, 2(2):165 193, 2015.

Han Zhao, Pascal Poupart, Yongfeng Zhang, and Martin Lysy. Sof: Soft-Cluster Matrix Factorization for Probabilistic Clustering. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 29, 2015.

BAYESIAN DISTANCE CLUSTERING

Proof of Lemma 2

Proof We ﬁrst focus on x Gamma(α, 1), by the Markov s inequality

pr(x t) E exp(s X)

exp(st) = (1 s) αe ts,

where s < 1. Minimizing the right hand side over s yields s = 1 α/t, and

pr(x t) ( t

α)αe t+α = α αeαtαe t.

Scaling x by σh and adjusting the constant yield the results.

Proof of Lemma 4

Proof Equivalently, the sub-exponential tail can be characterized by the bound on its moment generating function

E exp{t(y[h] i,k µ[h] k )} exp(ν2 ht2/2) |t| 1/bh,

for k = 1, . . . , p. It immediately follows that the pairwise difference d[h] i,j = y[h] i y[h] j between two iid random variables must be sub-exponential as well, with

E exp(t d[h] i,j,k) exp(ν2 ht2) |t| 1/bh.

That is, sub-exponential (

2νh, bh). Then the vector norm

pr(d[h] i,j > pηt) = pr(

k=1 | d[h] i,j,k|q > pηqtq)

p pr(| d[h] i,j,k|q > pηq 1tq)

= p pr(| d[h] i,j,k| > pη 1/qt)

2p exp{ tp(η 1/q)/(2bh)} for t > p1/q η2ν2 h/bh.

where the ﬁrst inequality is due to pr(Pp i=1 ai > b) pr(There is at least one i: ai > b/p) Pp i=1 pr(ai > b/p) and second inequality uses Ed[h] i,j,k = 0 and the property of sub-exponential tail (Wainwright, 2019).

DUAN AND DUNSON

Proof of Lemma 7

Proof For a clear exposition, we omit the sub/super-script h for now and use xi = T(yi)

j=1 Bφ(xi, xj) =Eyi Eyj

j=1 {φ(xi) φ(xj) xi xj, φ(xj) }

i=1 {Eyiφ(xi) φ(µ) Eyixi µ, φ(µ)

+ φ(µ) φ(xj) Eyixi xj, φ(xj)

i=1 Eyi{φ(xi) φ(µ) xi µ, φ(µ) }

j=1 Eyj{φ(µ) φ(xj) µ xj, φ(xj) }

i=1 Ey{Bφ(xi, µ) + Bφ(µ, xi)},

where ., . denotes dot product, the second equality is due to Fubini theorem and Eyixi µ = 0.

Proof of Lemma 9

Proof Using 1n,m n m matrix with all elements equal 1. Since CTC = diag(n1, . . . , nk), the 2 times of normalized graph cut loss can be written as

tr A(1n,k C) CTC 1CT

= tr AC CTC 1CT + tr A1n,k CTC 1CT .

For the second term tr A1n,k CTC 1CT

=tr CTA1n,k CTC 1

Pn j=1 Ai,j

BAYESIAN DISTANCE CLUSTERING

Anatomical Structure Name Voxel Count Cortical plate 718 Striatum 332 Thalamus 295 Midbrain 229 Basic cell groups and regions 96 Pons 56 Vermal regions 22 Pallidum 14 Cortical subplate 6 Hemispheric regions 6 Cerebellum 5 Cerebral cortex 2

Table 4: Names and voxel counts in 12 macroscopic anatomical structures in the coronal section of the mouse brain. They represent the structural partitioning of the brain.

Index Voxel Count 1 626 2 373 3 176 4 113 5 79 6 39 7 12

Table 5: Group indices and voxel counts in 7 clusters found by Bayesian Distance Clustering, using the gene expression volume over the coronal section of the mouse brain. They represent the functional partitioning of the brain.

Numerical experiments showing the effect of discarding the seed

We show numerically that as nh increases, the seed conditional density Kh y[h] 1 | d[h] 2,1, . . . , d[h] nh,1

becomes very small in magnitude compared to the distance likelihood Gh d[h] 2,1, . . . , d[h] nh,1 . To compute those two terms, we use the Gaussian example presented in Section 2.1, simulated with σ2 h = 1, µh = 0. Since the values of those densities are very close to zero, we use the log-scale and compute the ratio, log Kh y[h] 1 | d[h] 2,1, . . . , d[h] nh,1

log Kh y[h] 1 | d[h] 2,1, . . . , d[h] nh,1 Gh d[h] 2,1, . . . , d[h] nh,1

as a measure of how small Kh(y[h] 1 | .) compared to Gh. We repeat each experiment for 30 times and create the box plots. As can be seen from Figure 6 (left panel), the ratio quickly drops to near zero after nh 10. In addition, we repeat the same experiment except using a multivariate Gaussian N(0, Ip); and the ﬁndings are very similar (right panel).

DUAN AND DUNSON

(b) p = 30.

Figure 6: Numerical experiments show that the seed conditional density Kh y[h] 1 | d[h] 2,1, . . . , d[h] nh,1

becomes much smaller than the distance likelihood as nh increases.

The effect of distances on the shapes of clusters

(a) Clustering using the Euclidean distance di,j = yi yj 2, each cluster has a spherical shape.

(b) Clustering using the weighted Euclidean distance di,j = p

(yi yj)TS(yi yj), with S = diag(s1, s2), each induced cluster has an elliptical shape.

Figure 7: Experiments show that the choice of distances may impact the shapes of the clusters and the uncertainty. Although, if one wants to separate clusters based on different shapes , a model on the cluster-speciﬁc covariances could be more useful than one on the pairwise distances.

BAYESIAN DISTANCE CLUSTERING

Connection with the nonnegative matrix factorization methods

Figure 8: Comparing the performance of the Bayesian distance clustering (BDC) and the nonnegative matrix factorization (NMF) using the symmetric simplex matrix factorization (Zhao et al., 2015; Duan, 2020). We apply the algorithms on the two clusters of skew Gaussian data as generated in Section 2.1, and plot the estimated co-assignment probability matrix for each method. As can be seen from Panels 2 and 3, when NMF is well calibrated in its similarity function (as we deﬁned in (13), with parameters set to the posterior mean estimate from BDC), it produces a result almost indistinguishable from BDC both are very close to the oracle co-assignment probability (Panel 1). On the other hand, NMF using an uncalibrated similarity exp( d2 i,j) produces a result (Panel 4) very different from the oracle.