# compact_representation_of_uncertainty_in_clustering__0f991118.pdf

Compact Representation of Uncertainty in Clustering

Craig S. Greenberg 1,2 Nicholas Monath1 Ari Kobren1 Patrick Flaherty3

Andrew Mc Gregor1 Andrew Mc Callum1

1College of Information and Computer Sciences, University of Massachusetts Amherst

2National Institute of Standards and Technology 3Department of Mathematics and Statistics, University of Massachusetts Amherst

{csgreenberg,nmonath,akobren,mcgregor,mccallum}@cs.umass.edu

flaherty@math.umass.edu

For many classic structured prediction problems, probability distributions over the dependent variables can be efﬁciently computed using widely-known algorithms and data structures (such as forward-backward, and its corresponding trellis for exact probability distributions in Markov models). However, we know of no previous work studying efﬁcient representations of exact distributions over clusterings. This paper presents deﬁnitions and proofs for a dynamic-programming inference procedure that computes the partition function, the marginal probability of a cluster, and the MAP clustering all exactly. Rather than the N th Bell number, these exact solutions take time and space proportional to the substantially smaller powerset of N. Indeed, we improve upon the time complexity of the algorithm introduced by Kohonen and Corander [11] for this problem by a factor of N. While still large, this previously unknown result is intellectually interesting in its own right, makes feasible exact inference for important real-world small data applications (such as medicine), and provides a natural stepping stone towards sparse-trellis approximations that enable further scalability (which we also explore). In experiments, we demonstrate the superiority of our approach over approximate methods in analyzing real-world gene expression data used in cancer treatment.

1 Introduction

Probabilistic models provide a rich framework for expressing and analyzing uncertain data because they provide a full joint probability distribution rather than an uncalibrated score or point estimate. There are many well-established, simple probabilistic models, for example Hidden Markov Models (HMMs) for modeling sequences. Inference in HMMs is performed using the forward-backward algorithm, which relies on an auxiliary data structure called a trellis (a graph-based dynamic programming table). This trellis structure serves as a compact representation of the distribution over state sequences. Many model structures compactly represent distributions and allow for efﬁcient exact or approximate inference of joint and marginal distributions.

Clustering is a classic unsupervised learning task. Classic clustering algorithms and even modern ones, however, only provide a point estimate of the best partitioning by some metric. In many applications, there are other partitions of the data that are nearly as good as the best one. Therefore representing uncertainty in clustering can allow one to chose the most interpretable clustering from among a nearly equivalent set of options. We explore the beneﬁts of representing uncertainty in clustering in a real-world gene expression analysis application in the experiments section.

32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada.

Representing discrete distributions can be rather challenging, since the size of the support of the distribution can grow extremely rapidly. In the case of HMMs, the number of sequences that need to be represented is exponential in the sequence length. Despite this, the forward-backward algorithm (i.e., belief-propagation in a non-loopy graph) performs exact inference in time linear in the size of the sequence multiplied by the square of the size of the state space. In the case of clustering, the problem is far more difﬁcult. The number of clusterings of N elements, known as the N th Bell number [2], grows super exponentially in the number of elements to be clustered. For example, there are more than a billion ways to cluster 15 elements. An exhaustive approach would require enumerating and scoring each clustering. We seek a more compact representation of distributions over clusterings.

In this paper, we present a dynamic programming inference procedure that exactly computes the partition function, the marginal probability of a cluster, and the MAP clustering. Crucially, our approach computes exact solutions in time and space proportional to the size of the powerset of N, which is substantially less than the N th Bell number complexity of the exhaustive approach. While the size of the powerset of N is still large, this is a previously unknown result that on its own bears intellectual interest. It further acts as a stepping stone towards approximations enabling larger scalability and provides insight on small data sets as shown in the experiments section.

The approach works by creating a directed acyclic graph (DAG), where each vertex represents an element of the powerset and there are edges between pairs of vertices that represent maximal subsets/minimal supersets of one another. We refer to this DAG as a cluster trellis. The dynamic programs can operate in either a top-down or bottom up fashion on the cluster trellis, labeling vertices with local partition functions and maximum values. It is also possible to read likely splits and joins of clusters (see Appendix M), as well as marginals from this structure. These programs work in any circumstance where the energy of a cluster can be computed. We prove that our programs return exact values and provide an analysis of their time and space complexity.

This paper also describes an approach to approximating the partition function, marginal probabilities, and MAP inference for clustering in reduced time and space. It works by performing exact computations on a sparsiﬁed version of the cluster trellis, where only a subset of the possible vertices are represented. This is analogous to using beam search [17] in HMMs. We prove that our programs return exact values for DAG-consistent partitions and that the time and space complexity are now measured in the size of the sparse cluster trellis. When not in the text, proofs of all facts and theorems can be found in the Appendix.

We develop our method in further detail in the context of correlation clustering [1]. In correlation clustering, the goal is to construct a clustering that maximizes the sum of cluster energies (minus the sum of the across cluster energies), where a cluster energy can be computed from pairwise afﬁnities among data points. We give a dynamic program that computes the energies of all possible clusters. Our approach proceeds in a bottom up fashion with respect to the cluster trellis, annotating cluster energies at each step. This all can be found in the Appendix.

Previous work has examined the related problem of computing MAP k-clusterings exactly, including dynamic programming approaches [8, 9, 22], as well as using fast convolutions [11]. Our method has a smaller runtime complexity than using these approaches for computing the MAP clustering and partition function for all possible partitions (irrespective of k). Further, none of this related work discusses how to reduce complexity using approximation (as we do in Section 4), and it is unclear how their work might be extended for approximation. The most closely related work [10] models distributions over clusterings using Perturb and MAP [16]. Unlike the Perturb and MAP approach, our work focuses on exact inference in closed form.

Being able to compactly represent probability distributions over clusterings is a fundamental problem in managing uncertainty. This paper presents a dynamic programming approach to exact inference in clustering, reducing the time complexity of the problem from super exponential to sub-quadratic in the size of the cluster trellis.

2 Uncertainty in Clustering

Clustering is the task of dividing a dataset into disjoint sets of elements. Formally, Deﬁnition 1. (Clustering) Given a dataset of elements, D = {xi}N

i=1, a clustering is a set of subsets, C = {C1, C2, . . . , CK} such that Ci D, SK

i=1 Ci = D, and Ci \Cj = ; for all Ci, Cj 2 C, i 6= j. Each element of C is known as a cluster.

ab ac ad bc bd cd

abc abd acd bcd

Figure 1: A cluster trellis, T , over a dataset D = {a, b, c, d}. Each node in the trellis represents a speciﬁc cluster, i.e., subset, of D corresponding to its label. Solid lines indicate parent-child relationships. Note that a parent may have multiple children and a child may have multiple parents.

Our goal is to design data structures and algorithms for efﬁciently computing the probability distribution over all clusterings of D. We adopt an energy-based probability model for clustering, where the probability of a clustering is proportional to the product of the energies of the individual clusters making up the clustering. The primary assumption in energy based clustering is that clustering energy is decomposable as the product of cluster energies. While it is intuitive that the probability of elements being clustered together would be independent of the clustering of elements disjoint from the cluster, one could conceive of distributions that violate that assumption. An additional assumption is that exponentiating pairwise scores preserves item similarity. This is the Gibbs distribution, which has been found useful in practice [6].

Deﬁnition 2. (Energy Based Clustering) Let D be a dataset, C be a clustering of D and ED(C) be the energy of C. Then, the probability of C with respect to D, PD(C), is equal to the energy of C normalized by the partition function, Z(D). This gives us PD(C) = ED(C)

ZD and Z(D) = P

C2CD ED(C). The ED(C) energy of C is deﬁned as the product of the energies of its clusters: ED(C) = Q

We use CD to refer to all clusterings of D. In general, we assume that D is ﬁxed and so we omit subscripts to simplify notation. Departing from convention [12], clusterings with higher energy are preferred to those with lower energy. Note that computing the membership probability of any element xi in any cluster Cj, as is done in mixture models, is ill-suited for our goal. In particular, this computation assumes a ﬁxed clustering whereas our work focuses on computations performed with respect to the distribution over all possible clusterings.

3 The Cluster Trellis

Recall that our goal is compute a distribution over the valid clusterings of an instance of energy based clustering as efﬁciently as possible. Given a dataset D, a naïve ﬁrst step in computing such a distribution is to iterate through its unique clusters and, for each, compute its energy and add it to a running sum. If the number of elements is |D| = N, the number of unique clusters is the N th Bell Number, which is super-exponential in N [14].

Note that a cluster C may appear in many clusterings of D. For example, consider the dataset D0 = {a, b, c, d}. The cluster {a, b} appears in 2 of the clusterings of D0. More precisely, in a dataset comprised of N elements, a cluster of M elements appears in the (N M)th Bell Number of its clusterings. This allows us to make use of memoization to compute the distribution over clusterings more efﬁciently, in a procedure akin to variable elimination in graphical models [4, 25]. Unlike variable elimination, our procedure is agnostic to the ordering of the elimination.

To support the exploitation of this memoization approach, we introduce an auxiliary data structure we call a cluster trellis.

Deﬁnition 3. (Cluster Trellis) A cluster trellis, T , over a dataset D is a graph, (V (T ), E(T )), whose vertices represent all valid clusters of elements of D. The edges of the graph connect a pair vertices if one (the child node) is a maximal strict subset of the other (the parent node).

In this paper, we refer to a cluster trellis simply as a trellis. In more detail, each trellis vertex, v 2 V (T ), represents a unique cluster of elements; the vertices in T map one-to-one with the nonempty members of the powerset of the elements of D. We deﬁne D(v) to be the elements in the cluster represented by v. There exists an edge from v0 to v, if D(v) D(v0) and D(v0) = D(v) [ {xi} for some element xi 2 D (or vice versa). See Figure 1 for a visual representation of a trellis over 4 elements. Each vertex stores the energy of its associated cluster, E(D(v)), and can be queried in constant time. We borrow terminology from trees and say vertex v0 is a parent of vertex v, if there is an edge from v0 to v, and that vertex v00 is an ancestor of v if there is a directed path from v00 to v.

3.1 Computing the Partition Function

Computing a distribution over an event space requires computing a partition function, or normalizing constant. We present an algorithm for computing the partition function, Z(D), with respect to all possible clusterings of the elements of D. Our algorithm uses the trellis and a particular memoization scheme to signiﬁcantly reduce the computation required: from super-exponential to exponential.

The full partition function, Z(D), can be expressed in terms of cluster energies and the partition functions of a speciﬁc set of subtrellises. A subtrellis rooted at v, denoted T [v] contains all nodes in T that are descendants of v.

Formally, a subtrellis T [v] = (V (T [v]), E(T [v])) has vertices and edges satisfying the following properties: (1) V (T [v]) = {u|u 2 V (T ) D(u) D(v)}, and (2) E(T [v]) = {(u, u0)|(u, u0) 2 E(T ) u, u0 2 V (T [v])}. Note that T [v] is always a valid trellis.

The following procedure not only computes Z(D), but also generalizes in a way that the partition function with respect to clusterings for any subset D(v) D can also be computed. We refer to the partition function for a dataset D(v) memoized at the trellis/subtrellis T [D(v)] as the partition function for the trellis/subtrellis, Z(T [D(v)]).

Algorithm 1 Partition Function(T , D)

Pick xi 2 D Z(D) = 0 for v in V (T )(i) do

Let v0 be such that D(v0) = D \ D(v) if Z(D(v0)) has not been assigned then

Z(D(v0)) = Partition Function(T [v0], D(v0)) Z(D) Z(D) + E(D(v)) Z(D(v0)) return Z(D)

Deﬁne V (T )(i) = {v|v 2 V (T ) xi 2 D(v)} and V (T )(i) = V (T )\V (T )(i). In other words, V (T )(i) is the set of all vertices in the trellis containing the element xi and V (T )(i) is the set of all vertices that do not contain xi.

Fact 1. Let v 2 V (T ) and xi 2 D(v). The partition function with respect to D(v) can be written recursively, with Z(D(v)) = P

vi2V (T [v])(i) E(vi) Z(D(v)\D(vi)) and Z(;) = 1.

Proof. The partition function Z(D(v)) is deﬁned as:

For a given element xi in D(v), the set of all clusterings of D(v) can be re-written to factor out the cluster containing xi in each clustering:

CD(v) = {{vi} [ C|vi 2 V (i), C 2 CD(v)\D(vi)}

Note that CD(v)\D(vi) refers to all clusterings on the elements D(v)\D(vi). Using this expansion and since E({vi} [ Ci) = E({vi})E(Ci), we can rewrite the partition function as below. By performing

algebraic re-arrangements and applying our deﬁnitions:

C2CD(v)\D(vi)

C2CD(v)\D(vi)

C2CD(v)\D(vi)

E(vi)Z(D(v) \ D(vi))

As a result of Fact 1, we are able to construct a dynamic program for computing the partition function of a trellis as follows: (1) select an arbitrary element xi from the dataset; (2) construct V (T )(i) as deﬁned above; (3) for each vertex vi 2 V (T )(i), compute and memoize the partition function of D(v) \ D(vi) if it is not already cached; (4) sum the partition function values obtained in step (3). The pseudocode for this dynamic program appears in Algorithm 1.

We use Algorithm 1 and Fact 1 to analyze the time and space complexity of computing the partition function. Consider a trellis T over a dataset D = {xi}N

i=1. Our goal is to compute the partition function, Z(T ). When the partition function of all subtrellises of T have already been computed, Algorithm 1 is able to run without recursion. Fact 2. Let T be a trellis such that the partition function corresponding to each of its subtrellises T is memoized and accessible in constant time. Then, Z(T ) can be computed by summing exactly 2N 1 terms. Given that the partition function of every strict sub-trellis of T (i.e., any sub-trellis of T that is not equivalent to T ) has been memoized and is accessible in constant time, then Z(T ) is computed by taking the sum of exactly 2N 1 terms.

We now consider the more general case, where the partition function of all subtrellises of T have not yet been computed: Theorem 1. Let T be a trellis over D = {xi}N

i=1. Then, Z(T ) can be computed in O(3N 1) = O(|V (T )|log(3)) time.

A proof of Theorem 1 can be found in the Appendix in Section E.

3.2 Finding the Maximal Energy Clustering

By making a minor alteration to Algorithm 1, we are also able to compute the value of and ﬁnd the clustering with maximal energy. Speciﬁcally, at each vertex in the trellis, v, store the clustering of D(v) with maximal energy (and its associated energy). We begin by showing that there exists a recursive form of the max-partition calculation analogous to the computation of the partition function in Fact 1. Deﬁnition 4. (Maximal Clustering) Let v 2 V (T ) and xi 2 D(v). The maximal clustering over the elements of D(v), C?(D(v)), is deﬁned as: C?(D(v)) = argmax C2CD(v) E(C).

Fact 3. C?(D(v)) can be written recursively as C?(D(v)) = argmaxv02V (T [v])(i) E(v0) E(C?(D(v)\ D(v0))). In other words, the clustering with maximal energy over the set of elements, D(v) can be written as the energy of any cluster, C, in that clustering multiplied by a clustering with maximal energy over the elements D(v)\C.

Using this recursive deﬁnition, we modify Algorithm 1 to compute the maximum clustering instead of the partition function, resulting in Algorithm 2 (in Appendix). The correctness of this algorithm is demonstrated by Fact 3. We can now analyze the time complexity of the algorithm. We use similar memoized notation for the algorithm where C?(T [D(v)]) is the memoized value for C?(D(v)) stored at v. Fact 4. Let TD be a trellis over D = {xi}N

i=1. Then, C?(TD) can be computed in O(3N 1) time.

3.3 Computing Marginals

The trellis facilitates the computation of two types of cluster marginals. First, the trellis can be used to compute the probability of a speciﬁc cluster, D(v), with respect to the distribution over all possible clusterings; and second, it can be used to compute the probability that any group of elements, X, are clustered together. We begin by analyzing the ﬁrst style of marginal computation as it is used in computing the second.

Let C(v) 2 C be the set of clusterings that contain the cluster D(v). Then the marginal probability

of D(v) is given by P(D(v)) =

C2C(v) E(C)

Z(D) , where Z(D) is the partition function with respect to the full trellis described in section 2. This probability can be re-written in terms of the complement

of D(v), i.e., P(D(v)) =

C2C(v) E(C)

C2C(v) E(D(v))E(C\D(v))

C02CD\D(v) E(C0)

E(D(v))Z(D\D(v))

Z(D) . Note that if Z(D \ D(v)) were memoized during Algorithm 1, then computing the marginal probability requires constant time and space equal to the size of the trellis. This is only true for clusters whose complements do not contain element xi (selected to compute Z(D) in Algorithm 1), which is true for |V (T )|/(2|V (T )| 1) of the vertices in the trellis. Otherwise, we may need to

repeat the calculation from Algorithm 1 to compute Z(D \ D(v)). We note that due to memoization, the complexity of computing the partition function of the remaining verticies is no greater than the complexity of Algorithm 1.

This machinery makes it possible to compute the second style of marginal. Given a set of elements, X, the marginal probability of the elements of X being clustered together is: P(X) = P

D(v)2T :X D(v) P(D(v)). The probability that the elements of X is distinct from the marginal probability of a cluster in that P(X) sums the marginal probabilities of all clusters that include all elements of X. Once the marginal probability of each cluster is computed, the marginal probability of any sets of elements being clustered together can be computed in time and space linear in the size of the trellis.

4 The Sparse Trellis

The time to compute the partition function scales sub-quadratically with the size of the trellis (Theorem 1). Unfortunately, the size of the trellis scales exponentially with the size of the dataset, which limits the use of the trellis in practice. In this section, we show how to approximate the partition function and maximal partition of a sparse trellis, which is a trellis with some nodes omitted. Increasing the sparsity of a trellis enables the computation of approximate clustering distributions for larger datasets.

Deﬁnition 5. (Sparse Trellis) Given a trellis T = (V (T ), E(T )), deﬁne a sparse trellis with respect to T to be any b T = (b V , ˆE) satisfying the following properties: b V 6= ;, b V V (T ), and

ˆE = {(v, v0)| D(v0) D(v) @u : D(v0) D(u) D(v)}.

Note that there exist a number of sparse trellises that contain no valid clusterings. As an example, consider b T = (b V = {v1, v2, v3}, b E = ;) where D(v1) = {a, b}, D(v2) = {b, c}, and D(v3) = {a, c}.

For ease of analysis, we focus on a speciﬁc family of sparse trellises which are closed under recursive complement 1. This property ensures that the trellises contain only valid partitions. For trellises in this family we show that the partition function and the clustering with maximal energy can be computed using algorithms similar to those described in Section 3. Since these algorithms have complexity measured in the number of nodes in the trellis, their efﬁciency improves with trellis-sparsity. We also present the family of tree structured sparse trellises with tree speciﬁc partition function and max partition algorithms. The more general family of all sparse trellises is also discussed brieﬂy.

The key challenge of analyzing a sparse trellis, b T , is how to treat any cluster C that is not represented by a vertex v 2 b T , i.e., C = D(v) v 62 b T . Although there are several feasible approaches to reasoning about such clusters, in this paper we assume that any cluster that is not represented by a

1A set of sets, S, is closed under recursive complement iff 8x, y 2 S, x y =) 9z 2 S : x S z = y x \ z = ;.

vertex in b T has zero energy. Since the energy of a clustering, C, is the product of its clusters energies (Deﬁnition 2), E(C) = 0 if it contains one or more clusters with zero energy.

4.1 Approximating The Partition Function and Max Partition

Given a sparse trellis, ˆT , we are able to compute the partition function by using Algorithm 1.

Fact 5. Let b T = (b V , b E) be a sparse trellis whose vertices are closed under recursive complement. Then Algorithm 1 computes Z( b T ) in O(| b T |log(3)).

If b T is not closed under recursive complement, we cannot simply run Algorithm 1 because not all vertices for which the algorithm must compute energy (or the partition function) are guaranteed to exist. How to compute the partition function using such a trellis is an area of future study.

Given a sparse trellis, ˆT , closed under recursive complement, we are able to compute the max partition by using Algorithm 2. Doing so takes O(| ˆT |log(3)) time and O(| ˆT |) space. The correctness and complexity analysis is the same as in Section 4.1.

The often-used hierarchical (tree structured) clustering encompasses one family of sparse trellises. Algorithms for tree structured trellises can be found in the Appendix in Section J.

5 Experiments

In this section, we demonstrate the utility of the cluster trellis via experiments on real-world gene expression data. To begin, we provide a high-level background on cancer subtypes to motivate the use of our method in the experiment in Section 5.3.

5.1 Background

For an oncologist, determining a prognosis and constructing a treatment plan for a patient is dependent on the subtype of that patient s cancer [13]. This is because different subtypes react well to some treatments, for example, to radiation and not chemotherapy, and for other subtypes the reverse is true [20]. For example, basal and erb B2+ subtypes of breast cancer are more sensitive to paclitaxeland doxorubicin-containing preoperative chemotherapy (approx. 45% pathologic complete response) than the luminal and normal-like cancers (approx. 6% pathologic complete response)[18]. Unfortunately, identifying cancer subtypes is often non-trivial. One common method of learning about a patient s cancer subtype is to cluster their gene expression data along with other available expression data for which previous treatments and treatment outcomes are known [21].

5.2 Data & Methods

We use breast cancer transcriptome proﬁling (FPKM-UQ) data from The Cancer Genome Atlas (TCGA) because much is known about the gene expression patterns of this cancer type, yet there is heterogeneity in the clinical response of patients who are classiﬁed into the same subtype by standard approaches [23].

The data are subselected for African American women with Stage I breast cancer. We select African American women because there is a higher prevalence of the basal-like subtype among premenopausal African American women [15] and there is some evidence that there is heterogeneity (multiple clusters) even within this subtype [23]. Stage I breast cancer patients were selected because of the prognostic value in distinguishing aggressive subtypes from non-aggressive subtypes at an early stage.

Despite the considerable size of TCGA, there are only 11 samples meeting this basic, necessary inclusion/exclusion criteria. Each of the 11 samples is a 20,000 dimensional feature vector, where each dimension is a measure of how strongly a given gene is expressed. We begin by sub-selecting the 3000 features with greatest variance across the samples. We then add an inﬁnitesimal value prior to taking the log of the remaining features, since genome expression data is believed to be normally distributed in log-space [19]. A similar data processing was shown to be effective in prior work [19].

We use correlation clustering as the energy model. Pairwise similarities are exponentiated negative euclidean distances. We subtract from each the mean pairwise similarity so that similarities are both positive and negative. We then compute the marginal probabilities for each pair (i.e., the probability

100 50 0 50 100 Pairwise Log Energies

Pairwise Log Marginals

Pairwise Energies and Marginals

74ca & 62da 74ca & d6fa 62da & d6fa

Figure 2: For each pair of patients with Stage I cancer, we plot the energy and marginal probability of the pair being in the same cluster as described in Section 5.3.

4 3 2 1 0 Exact Pairwise Log Marginals

Approximate Pairwise Log Marginals

Approximate vs. Exact Marginal Log Probability of Pairs Of Points

Figure 3: The approximate vs. exact pairwise marginals for each pair of gene expressions. Approximate marginals are computed using a Perturb-and-MAP based method [10].

Figure 4: Heatmap of the pairwise energies between the patients. The pair 74ca and d6fa has an energy of -4.7, 74ca and 62da have 91.09, and d6fa and 62da have 44.5.

Figure 5: Heatmap of the marginal probability that a pair will be clustered together. Patients 74ca and d6fa have a pairwise marginal that is nearly one, despite having a low pairwise energy.

that the two samples appear in the same cluster). See Section 3.3 for how to compute these values using the trellis.

5.3 Model Evaluation using Marginals

One method for evaluating a set of cancer subtype clustering models is to identify pairs of samples that the evaluator believes should be clustered together and inspect their pairwise energies. However, high pairwise energies do not necessarily mean the points will be clustered together by the model (which considers how the pairs cluster assignment impacts the rest of the data). Similarly, a low pairwise energy does not necessarily mean the two samples will not be clustered together. The pairwise marginal on the other hand exactly captures the probability that the model will place the two samples in the same cluster. We test if the corresponding unnormalized pairwise energies or a simple approximation of the marginals could reasonably be used as a proxy for exact pairwise marginals.

5.3.1 Pairwise Energies vs. Marginals & Exact vs. Approximate Marginals

Figure 2 plots the pairwise log energies vs. pairwise log marginals of the sub-sampled TCGA data2. The pairwise scores and marginals are not strongly correlated, which suggests that unnormalized pairwise energies cannot reasonably be used as a proxy for pairwise marginals. For example, the sample pair of patients (partial id numbers given) 74ca and d6fa have an energy of -4.7 (low), but a pairwise marginal that is nearly one. This is because both 74ca and d6fa have high energy with sample 62da, with pairwise energies 91.09 (the fourth largest) and 44.5, respectively. Figures 4 and 5 that visualize the pairwise energies and pairwise marginals, respectively.

2The MAP clustering of the dataset is in the Appendix in Section P.

We also explore the extent to which an approximate method can accurately capture pairwise marginals. We use an approach similar to Perturb-and-MAP [10]. We sample clusterings by adding Gumbel distributed noise to the pairwise energies and using Algorithm 2 to ﬁnd the maximal clustering with the modiﬁed energies. We approximate the marginal probability of a given pair being clustered together by measuring how many of these sampled clusters contain the pair in the same cluster. Figure 3 plots the approximate vs. exact pairwise marginal for each pair of points in the dataset. The ﬁgure shows that the approximate method overestimates many of the pairwise marginals. Like the pairwise scores (rather than exact marginals), using the approximate marginals in practice may lead to errors in data analysis.

6 Related Work

While there is, to the best of our knowledge, no prior work on compact representations for exactly computing distributions over clusterings, there is a small amount related work on computing the MAP k-clustering exactly, as well as a wide array of related work in approximate methods, graphical models, probabilistic models for clustering, and clustering methods.

The ﬁrst dynamic programming approach to computing the MAP k-clustering was given in [9], which focuses on minimizing the sum of square distances within clusters. It works by considering distributional form of the clusterings, i.e., all possible sizes of the clusters that comprise the clustering, and deﬁnes arcs between them. However, no full speciﬁcation of the dynamic program is given and, as the author notes, many redundant computations are required, since there are many clusterings that share the same distributional form. In [8], the ﬁrst implementation is given, with some of the redundancies removed, and the implementation and amount of redundancy is further improved upon in [22]. In each of these cases, the focus is on ﬁnding the best k-clustering, which can be done using these methods in O(3n) time. These methods can also be used to ﬁnd the MAP clustering for all K, however doing so would result in an O(n 3n) time, which is worse than our O(3n) result.

In [11], the authors use fast convolutions to compute the MAP k-clustering and k-partition function. Fast convolutions use a Mobius transform and Mobius inversion on the subset lattice to compute the convolution in e O(n22n) time. It would seem promising to use this directly in our work, however, our algorithm divides the subset lattice in half,which prevents us from applying the fast transform directly. The authors note that, similar to the above dynamic programming approaches, their method can be used to compute the clustering partition function and MAP in O(n 3n), which is larger than our result of O(3n). Their use of convolutions to compute posteriors of k-clusterings also implies the existence of an e O(n32n) algorithm to compute the pair-wise posterior matrix, i.e., the probability that items i and j are clustered together, though the authors mention that, due to numerical instability issues, using fast convolutions to computing the pair-wise posterior matrix is only faster in theory.

Recently proposed perturbation based methods [10] approximate distributions over clusterings as well as marginal distributions over clusters. They use the Perturb and MAP approach [16], originally proposed by Papandreou, which is based on adding Gumbel distributed noise to the clustering energy function. Unfortunately, for Perturb and MAP to approach the exact distribution, independent samples from the Gumbel distribution must be added to each clustering energy, which would require a super-exponential number of draws. To overcome this, Kappes et al. [10] propose adding Gumbel noise to the pairwise real-valued afﬁnity scores, thus requiring fewer draws, but introducing some dependence among samples. They must also perform an outer relaxation in order obtain a computable bound for the log partition function. As a result, the method approaches a distribution with unknown approximation bounds.

7 Conclusion

In this paper, we present a data structure and dynamic-programming algorithm to compactly represent and compute probability distributions over clusterings. We believe this to be the ﬁrst work on efﬁcient representations of exact distributions over clusterings. We reduce the computation cost of the naïve exhaustive method from the N th Bell number to sub-quadratic in the substantially smaller powerset of N. We demonstrate how this result is a ﬁrst step towards practical approximations enabling larger scalability and show a case study of the method applied to correlation clustering.

Acknowledgments

We thank the anonymous reviewers for their constructive feedback.

This work was supported in part by the Center for Intelligent Information Retrieval, in part by DARPA under agreement number FA8750-13-2-0020, in part by the National Science Foundation Graduate Research Fellowship under Grant No. NSF-1451512 and in part by the National Science Foundation Grant 1637536. Any opinions, ﬁndings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reﬂect those of the sponsor.

[1] Nikhil Bansal, Avrim Blum, and Shuchi Chawla. Correlation clustering. Machine Learning,

[2] E. T. Bell. Exponential polynomials. Annals of Mathematics, 1934.

[3] Charles Blundell, Yee Whye Teh, and Katherine A Heller. Bayesian rose trees. Conference on

Uncertainty in Artiﬁcial Intelligence, 2010.

[4] Rina Dechter. Bucket elimination: A unifying framework for probabilistic inference. 1999.

[5] Dua Dheeru and EﬁKarra Taniskidou. UCI machine learning repository, 2017.

[6] Stuart Geman and Donald Geman. Stochastic relaxation, gibbs distributions, and the bayesian

restoration of images. IEEE Transactions on pattern analysis and machine intelligence, 1984.

[7] Katherine A Heller and Zoubin Ghahramani. Bayesian hierarchical clustering. International

Conference on Machine Learning, 2005.

[8] Lawrence Hubert, Phipps Arabie, and Jacqueline Meulman. Combinatorial data analysis:

Optimization by dynamic programming. Society for Industrial and Applied Mathematics, 2001.

[9] Robert E. Jensen. A dynamic programming algorithm for cluster analysis. Operations Research,

[10] Jörg Hendrik Kappes, Paul Swoboda, et al. Probabilistic correlation clustering and image

partitioning using perturbed multicuts. International Conference on Scale Space and Variational Methods in Computer Vision, 2015.

[11] Jukka Kohonen and Jukka Corander. Computing exact clustering posteriors with subset convo-

lution. Communications in Statistics-Theory and Methods, 2016.

[12] Yann Le Cun, Sumit Chopra, and Raia Hadsell. A tutorial on energy-based learning. 2006.

[13] Brian D Lehmann and Jennifer A Pietenpol. Identiﬁcation and use of biomarkers in treatment

strategies for triple-negative breast cancer subtypes. The Journal of pathology, 2014.

[14] László Lovász. Combinatorial problems and exercises. 1993.

[15] Cancer Genome Atlas Network et al. Comprehensive molecular portraits of human breast

tumours. Nature, 2012.

[16] George Papandreou and Alan L Yuille. Perturb-and-map random ﬁelds: Using discrete op-

timization to learn and sample from energy models. International Conference on Computer Vision, 2011.

[17] D Raj Reddy et al. Speech understanding systems: A summary of results of the ﬁve-year

research effort. department of computer science, 1977.

[18] Roman Rouzier, Charles M Perou, W Fraser Symmans, et al. Breast cancer molecular subtypes

respond differently to preoperative chemotherapy. Clinical cancer research, 2005.

[19] Hachem Saddiki, Jon Mc Auliffe, and Patrick Flaherty. Glad: a mixed-membership model for

heterogeneous tumor subtype classiﬁcation. Bioinformatics, 2014.

[20] Therese Sørlie, Charles M Perou, Robert Tibshirani, et al. Gene expression patterns of breast

carcinomas distinguish tumor subclasses with clinical implications. Proceedings of the National Academy of Sciences, 2001.

[21] Therese Sørlie, Robert Tibshirani, Joel Parker, et al. Repeated observation of breast tumor

subtypes in independent gene expression data sets. Proceedings of the National Academy of Sciences, 2003.

[22] BJ Van Os and Jacqueline J Meulman. Improving dynamic programming strategies for parti-

tioning. Journal of Classiﬁcation, 2004.

[23] Ozlem Yersal and Sabri Barutca. Biological subtypes of breast cancer: Prognostic and therapeu-

tic implications. World journal of clinical oncology, 2014.

[24] Giacomo Zanella, Brenda Betancourt, Hanna Wallach, Jeffrey Miller, Abbas Zaidi, and Re-

becca C Steorts. Flexible models for microclustering with application to entity resolution. Advances in Neural Information Processing Systems, 2016.

[25] Nevin Lianwen Zhang and David Poole. Exploiting causal independence in bayesian network

inference. Journal of Artiﬁcial Intelligence Research, 1996.