# learning_to_link__22e62de8.pdf LEARNING TO LINK Maria-Florina Balcan Carnegie Mellon University ninamf@cs.cmu.edu Travis Dick University of Pennsylvania tbd@seas.upenn.edu Manuel Lang Karlsruhe Institute of Technology manuel.lang@student.kit.edu Clustering is an important part of many modern data analysis pipelines, including network analysis and data retrieval. There are many different clustering algorithms developed by various communities, and it is often not clear which algorithm will give the best performance on a specific clustering task. Similarly, we often have multiple ways to measure distances between data points, and the best clustering performance might require a non-trivial combination of those metrics. In this work, we study data-driven algorithm selection and metric learning for clustering problems, where the goal is to simultaneously learn the best algorithm and metric for a specific application. The family of clustering algorithms we consider is parameterized linkage based procedures that includes single and complete linkage. The family of distance functions we learn over are convex combinations of base distance functions. We design efficient learning algorithms which receive samples from an application-specific distribution over clustering instances and learn a near-optimal distance and clustering algorithm from these classes. We also carry out a comprehensive empirical evaluation of our techniques showing that they can lead to significantly improved clustering performance on real-world datasets. 1 INTRODUCTION Overview. Clustering is an important component of modern data analysis. For example, we might cluster emails as a pre-processing step for spam detection, or we might cluster individuals in a social network in order to suggest new connections. There are a myriad of different clustering algorithms, and it is not always clear what algorithm will give the best performance on a specific clustering task. Similarly, we often have multiple different ways to measure distances between data points, and it is not obvious which distance metric will lead to the best performance. In this work, we study data-driven algorithm selection and metric learning for clustering problems, where the goal is to use data to simultaneously learn the best algorithm and metric for a specific application such as clustering emails or users of a social network. An application is modeled as a distribution over clustering tasks, we observe an i.i.d. sample of clustering instances drawn from that distribution, and our goal is to choose an approximately optimal algorithm from a parameterized family of algorithms (according to some well-defined loss function). This corresponds to settings where we repeatedly solve clustering instances (e.g., clustering the emails that arrive each day) and we want to use historic instances to learn the best clustering algorithm and metric. The family of clustering algorithms we learn over consists of parameterized linkage based procedures and includes single and complete linkage, which are widely used in practice and optimal in many cases (Awasthi et al., 2014; Saeed et al., 2003; White et al., 2010; Awasthi et al., 2012; Balcan and Liang, 2016; Grosswendt and Roeglin, 2015). The family of distance metrics we learn over consists of convex combinations of base distance functions. We design efficient learning algorithms that receive samples from an application-specific distribution over clustering instances and simultaneously learn both a near-optimal distance metric and clustering algorithm from these classes. We contribute to a recent line of work that provides learning-theoretical guarantees for data-driven algorithm configuration (Gupta and Roughgarden, 2017; Balcan et al., 2017; 2018a;b; 2019). These papers analyze the intrinsic complexity of parameterized algorithm families in order to provide sample complexity guarantees, that is, bounds on the number of sample instances needed in order to find an approximately optimal algorithm for a given application domain. Our results build on the work of Balcan et al. (2017), who studied the problem of learning the best clustering algorithm from a class of linkage based procedures, but did not study learning the best metric. In addition to our sample complexity guarantees, we develop a number of algorithmic tools that enable learning application specific clustering algorithms and metrics for realistically large clustering instances. We use our efficient implementations to conduct comprehensive experiments on clustering domains derived from both real-world and synthetic datasets. These experiments demonstrate that learning application-specific algorithms and metrics can lead to significant performance improvements over standard algorithms and metrics. Our Results. We study linkage-based clustering algorithms that take as input a clustering instance S and output a hierarchical clustering of S represented as a binary cluster tree. Each node in the tree represents a cluster in the data at one level of granularity, with the leaves corresponding to individual data points and the root node corresponding to the entire dataset. Each internal node represents a cluster obtained by merging its two children. Linkage-based clustering algorithms build a cluster tree from the leaves up, starting with each point belonging to its own cluster and repeatedly merging the closest pair of clusters until only one remains. The parameters of our algorithm family control both the metric used to measure pointwise distances, as well as how the linkage algorithm measures distances between clusters (in terms of the distances between their points). This work has two major contributions. Our first contribution is to provide sample complexity guarantees for learning effective application-specific distance metrics for use with linkage-based clustering algorithms. The key challenge is that, if we fix a clustering algorithm from the family we study and a single clustering instance S, the algorithm output is a piecewise constant function of our metric family s parameters. This implies that, unlike many standard learning problems, the loss we want to minimize is very sensitive to the metric parameters and small perturbations to the optimal parameters can lead to high loss. Our main technical insight is that for any clustering instance S, we can partition the parameter space of our metric family into a relatively small number of regions such that the ordering over pairs of points in S given by the metric is constant on each region. The clustering output by all algorithms in the family we study only depends on the ordering over pairs of points induced by the metric, and therefore their output is also a piecewise constant function of the metric parameters with not too many pieces. We leverage this structure to bound the intrinsic complexity of the learning problem, leading to uniform convergence guarantees. By combining our results with those of Balcan et al. (2017), we show how to simultaneously learn both an application-specific metric and linkage algorithm. Our second main contribution is a comprehensive empirical evaluation of our proposed methods, enabled by new algorithmic insights for efficiently learning application-specific algorithms and metrics from sample clustering instances. For any fixed clustering instance, we show that we can use an execution tree data structure to efficiently construct a coarse partition of the joint parameter space so that on each region the output clustering is constant. Roughly speaking, the execution tree compactly describes all possible sequences of merges the linkage algorithm might make together with the parameter settings for the algorithm and metric that lead to that merge sequence. The learning procedure proposed by Balcan et al. (2017) takes a more combinatorial approach, resulting in partitions of the parameter space that have many unnecessary regions and increased overall running time. Balcan et al. (2018a) and Balcan et al. (2018b) also use an execution tree approach for different algorithm families, however their specific approaches to enumerating the tree are not efficient enough to be used in our setting. We show that using a depth-first traversal of the execution tree leads to significantly reduced memory requirements, since in our setting the execution tree is shallow but very wide. Using our efficient implementations, we evaluate our learning algorithms on several real world and synthetic clustering applications. We learn the best algorithm and metric for clustering applications derived from MNIST, CIFAR-10, Omniglot, Places2, and a synthetic rings and disks distribution. Across these different tasks the optimal clustering algorithm and metric vary greatly. Moreover, in most cases we achieve significant improvements in clustering quality over standard clustering algorithms and metrics. Related work. Gupta and Roughgarden (2017) introduced the theoretical framework for analyzing algorithm configuration problems that we study in this work. They provide sample complexity guarantees for greedy algorithms for canonical subset selection problems including the knapsack problem, maximum weight independent set, and machine scheduling. Some recent works provide sample complexity guarantees for learning application-specific clustering algorithms. Balcan et al. (2017) consider several parameterized families of linkage based clustering algorithms, one of which is a special case of the family studied in this paper. Their sample complexity results are also based on showing that for a single clustering instance, we can find a partitioning of the algorithm parameter space into regions where the output clustering is constant. The families of linkage procedures they study have a single parameter, while our linkage algorithm and metric families have multiple. Moreover, they suppose we are given a fixed metric for each clustering instance and do not study the problem of learning an application-specific metric. Balcan et al. (2018b) study the related problem of learning the best initialization procedure and local search method to use in a clustering algorithm inspired by Lloyd s method for k-means clustering. Their sample complexity results are again based on demonstrating that for any clustering instance, there exists a partitioning of the parameter space on which the algorithm s output is constant. The parameter space partitions in both of these related works are defined by linear separators. Due to the interactions between the distance metric and the linkage algorithm, our partitions are defined by quadratic functions. The procedures proposed by prior work for finding an empirically optimal algorithm for a collection of problem instances roughly fall into two categories: combinatorial approaches and approaches based on an execution-tree data structure. Gupta and Roughgarden (2017) and Balcan et al. (2017) are two examples of the combinatorial approach. They show that the boundaries in the constant-output partition of the algorithm parameter space always occur at the solutions to finitely many equations that depend on the problem instance. To find an empirically optimal algorithm, they find all solutions to these problem-dependent equations to explicitly construct a partition of the parameter space. Unfortunately, only a small subset of the solutions are actual boundaries in the partition. Consequently, their partitions contain many extra regions and suffer from long running times. The execution-tree based approaches find the coarsest possible partitioning of the parameter space such that the algorithm output is constant. Balcan et al. (2018b) and Balcan et al. (2018a) both use execution trees to find empirically optimal algorithm parameters for different algorithm families. However, the specific algorithms used to construct and enumerate the execution tree are different from those explored in this paper and are not suitable in our setting. 2 LEARNING CLUSTERING ALGORITHMS The problem we study is as follows. Let X be a data domain. A clustering instance consists of a point set S = {x1, . . . , xn} X and an (unknown) target clustering Y = (C1, . . . , Ck), where the sets C1, . . . , Ck partition S into k clusters. Linkage-based clustering algorithms output a hierarchical clustering of the input data, represented by a cluster tree. We measure the agreement of a cluster tree T with the target clustering Y = (C1, . . . , Ck) in terms of the Hamming distance between Y and the closest pruning of T into k clusters (i.e., k disjoint subtrees that contain all the leaves of T). More formally, we define the loss ℓ(T, Y) = min P1,...,Pk minσ Sn 1 |S| Pk i=1 |Ci \ Pσi|, where A \ B denotes set difference, the first minimum is over all prunings P1, . . . , Pk of the cluster tree T, and the second minimum is over all permutations of the k cluster indices. This formulation allows us to handle the case where each clustering task has a different number of clusters, and where the desired number might not be known in advance. Our analysis applies to any loss function ℓmeasuring the quality of the output cluster tree T, but we focus on the Hamming distance for simplicity. Given a distribution D over clustering instances (i.e., point sets together with target clusterings), our goal is to find the algorithm A from a family A with the lowest expected loss for an instance sampled from D. As training data, we assume that we are given an i.i.d. sample of clustering instances annotated with their target clusterings drawn from the application distribution D. We study linkage-based clustering algorithms. These algorithms construct a hierarchical clustering of a point set by starting with each point belonging to a cluster of its own and then they repeatedly merge the closest pair of clusters until only one remains. There are two distinct notions of distance at play in linkage-based algorithms: first, the notion of distance between pairs of points (e.g., Euclidean distance between feature vectors, edit distance between strings, or the Jaccard distance between sets). Second, these algorithms must define a distance function between clusters, which we refer to as a merge function to avoid confusion. A merge function D defines the distance between a pair of clusters A, B X in terms of the pairwise distances given by a metric d between their points. For example, single linkage uses the merge function Dmin(A, B; d) = mina A,b B d(a, b) and complete linkage uses the merge function Dmax(A, B; d) = maxa A,b B d(a, b). Our parameterized family of linkage-based clustering algorithms allows us to vary both the metric used to measure distances between points, as well as the merge function used to measure distances between clusters. To vary the metric, we suppose we have access to L metrics d1, . . . , d L defined on our data universe X, and our goal is to find the best convex combination of those metrics. That is, for any parameter vector β L = {β [0, 1]L | P i βi = 1}, we define a metric dβ(x, x ) = P i βi di(x, x ). This definition is suitable across a wide range of applications, since it allows us to learn the best combination of a given set of metrics for the application at hand. Similarly, for varying the merge function, we suppose we have L merge functions D1, . . . , DL . For any parameter α L , define the merge function Dα(A, B; d) = P i αi Di(A, B; d). For each pair of parameters β L and α L , we obtain a different clustering algorithm (i.e., one that repeatedly merges the pair of clusters minimizing Dα( , ; dβ)). Pseudocode for this method is given in Algorithm 1. In the pseudocode, clusters are represented by binary trees with leaves corresponding to the points belonging to that cluster. For any clustering instance S X, we let Aα,β(S) denote the cluster tree output by Algorithm 1 when run with parameter vectors α and β. Algorithm 1 Linkage Clustering Input: Metrics d1, . . . , d L, merge functions D1, . . . , DL , points x1, . . . , xn X, parameters α and β. 1. Let N = {Leaf(x1), . . . , Leaf(xn)} be the initial set of nodes (one leaf per point). 2. While |N| > 1 (a) Let A, B N be the clusters in N minimizing Dα(A, B; dβ). (b) Remove clusters A and B from N and add Node(A, B) to N. 3. Return the cluster tree (the only element of N). First, we provide sample complexity results that hold for any collection of metrics d1, . . . , d L and any collection of merge functions D1, . . . , DL that belong to the following family: Definition 1. A merge function D is 2-point-based if for any pair of clusters A, B X and any metric d, there exists a pair of points (a, b) A B such that D(A, B; d) = d(a, b). Moreover, the pair of points defining the merge distance must depend only on the ordering of pairwise distances. More formally, if d and d are two metrics s.t. for all a, a A and b, b B, we have d(a, b) d(a , b ) if and only if d (a, b) d (a , b ), then D(A, B; d) = d(a, b) implies that D(A, B; d ) = d (a, b). For example, both single and complete linkage are 2-point-based merge functions, since they output the distance between the closest or farthest pair of points, respectively. Theorem 1. Fix any metrics d1, . . . , d L, 2-point-based merge functions D1, . . . , DL , and distribution D over clustering instances with at most n points. For any parameters ϵ > 0 and δ > 0, let (S1, Y1), . . . , (SN, YN) be an i.i.d. sample of N = O 1 ϵ2 (L + L)2L log (L +L)2L n δ = O (L +L)2L ϵ2 clustering instances with target clusterings drawn from D. Then with probability at least 1 δ over the draw of the sample, we have sup (α,β) L L i=1 ℓ(Aα,β(Si), Yi) E (S,Y) D ℓ(Aα,β(S), Y) ϵ. The key step in the proof of Theorem 1 is to show that for any clustering instance S with target clustering Y, the function (α, β) 7 ℓ(Aα,β(S), Y) is piecewise constant with not too many pieces and where each piece is simple. Intuitively, this guarantees that for any collection of clustering instances, we cannot see too much variation in the loss of the algorithm on those instances as we vary over the parameter space. In Appendix A we use this fact to bound the empirical Rademacher complexity of the learning problem and obtain the uniform convergence guarantee in Theorem 1. In the remainder of this section, we prove the key structural property. We let ζ = (α, β) L L denote a pair of parameter vectors for Algorithm 1, viewed as a vector in RL +L. Our parameter space partition will be induced by the sign-pattern of M quadratic functions. Definition 2 (Sign-pattern Partition). The sign-pattern partition of Rp induced M functions f1, . . . , f M : Rp R is defined as follows: Let F : Rp { 1}M be the function F(ζ) = sign(f1(ζ)), . . . , sign(f M(ζ)) . Two points ζ, ζ Rp belong to the same region in the partition iff F(ζ) = F(ζ ). Each region is of the form Z = {ζ Rp|F(ζ) = b}, for some sign-pattern vector b { 1}M. We show that for any fixed metrics d1, . . . , d L and clustering instance S = {x1, . . . , xn} X, we can find a signpattern partitioning of L induced by linear functions such that, on each region, the ordering over pairs of points in S induced by the metric dβ is constant. An important consequence of this result is that for each region Z in this partitioning of L, the following holds: For any 2-point-based merge function D and any pair of clusters A, B S, there exists a pair of points (a, b) A B such that D(A, B; dβ) = dβ(a, b) for all β Z. In other words, restricted to β parameters belonging to Z, the same pair of points (a, b) defines the D-merge distance for the clusters A and B. Lemma 1. Fix any metrics d1, . . . , d L and a clustering instance S X. There exists a set H of O(|S|4) linear functions mapping RL to R with the following property: if two metric parameters β, β L belong to the same region in the sign-pattern partition induced by H, then the ordering over pairs of points in S given by dβ and dβ are the same. That is, for all points a, b, a , b S we have dβ(a, b) dβ(a , b ) iff dβ (a, b) dβ (a , b ). Proof sketch. For any pair of points a, b S, the distance dβ(a, b) is a linear function of the parameter β. Therefore, for any four points a, b, a , b S, we have that dβ(a, b) dβ(a , b ) iff ha,b,a ,b (β) 0, where ha,b,a ,b is the linear function given by ha,b,a ,b (β) = dβ(a, b) dβ(a , b ). Let H = {ha,b,a ,b | a, b, a , b S} be the collection of all such linear functions arising from any subset of 4 points. On each region of the sign-pattern partition induced by H, all comparisons of pairwise distances in S are fixed, implying that the ordering over pairs of points in S is fixed. Building on Lemma 1, we now prove the main structural property of Algorithm 1. We argue that for any clustering instance S X, there is a partition induced by quadratic functions of L L RL +L over α and β into regions such that on each region, the ordering over all pairs of clusters according to the merge distance Dα( , ; dβ) is fixed. This implies that for all (α, β) in one region of the partition, the output of Algorithm 1 when run on S is constant, since the algorithm output only depends on the ordering over pairs of clusters in S given by Dα( , ; dβ). Lemma 2. Fix any metrics d1, . . . , d L, any 2-point-based merge functions D1, . . . , DL , and clustering instance S X. There exists a set Q of O(|S|4L ) quadratic functions defined on RL +L so that if parameters (α, β) and (α , β ) belong to the same region of the sign-pattern partition induced by Q, then the ordering over pairs of clusters in S given by Dα( , ; dβ) and Dα ( , ; dβ ) is the same. That is, for all clusters A, B, A , B S, we have that Dα(A, B; dβ) Dα(A , B ; dβ) iff Dα (A, B; dβ ) Dα (A , B ; dβ ). Proof sketch. Let H be the linear functions constructed in Lemma 1 and fix any region Z in the sign-pattern partition induced by H. For any i [L ], since the merge function Di is 2-point-based and the ordering over pairs of points according to dβ( , ) in the region Z is fixed, for any clusters A, B, A , B S we can find points ai, bi, a i, b i such that Di(A, B; dβ) = dβ(ai, bi) and Di(A , B ; dβ) = dβ(a i, b i) for all β R. Therefore, expanding the definition of Dα, we have that Dα(A, B; dβ) Dα(A , B ; dβ) iff q A,B,A ,B (α, β) 0, where q A,B,A ,B (α, β) = P j αiβj(dj(ai, bi) dj(a i, b i)). Observe that the coefficients of each quadratic function depend only on 4L points in S, so there are only O(|S|4L ) possible quadratics of this from collected across all regions Z in the sign-pattern partition induced by H and all subsets of 4 clusters. Together with H, this set of quadratic functions partitions the joint parameter space into regions where the ordering over all pairs of clusters is fixed. A consequence of Lemma 2 is that for any clustering instance S with target clustering Y, the function (α, β) 7 ℓ(Aα,β(S), Y) is piecewise constant, where the constant partitioning is the sign-pattern partition induced by O(|S|4L ) quadratic functions. The proof of Theorem 1 now follows from using this fact to bound the empirical Rademacher complexity of the learning problem. Extensions. The above analysis can be extended to handle several more general settings. First, we can accommodate many specific merge functions that are not included in the 2-point-based family, at the cost of increasing the number of quadratic functions |Q| needed in Lemma 2. For example, if one of the merge functions is average linkage, Davg(A, B; d) = 1 |A| |B| P a A,b B d(a, b), then |Q| will be exponential in the dataset size n. Fortunately, our sample complexity analysis depends only on log(|Q|), so this still leads to non-trivial sample complexity guarantees (though the computational algorithm selection problem becomes harder). We can also extend the analysis to more intricate methods for combining the metrics and merge functions. For example, our analysis applies to polynomial combinations of metrics and merges at the cost of increasing the complexity of the functions defining the piecewise constant partition. 3 EFFICIENT ALGORITHM SELECTION In this section we provide efficient algorithms for learning low-loss clustering algorithms and metrics for applicationspecific distributions D defined over clustering instances. We begin by focusing on the special case where we have a single metric and our goal is to learn the best combination of two merge functions (i.e., L = 1 and L = 2). This special case is already algorithmically interesting. Next, we show how to apply similar techniques to the case of learning the best combination of two metrics when using the complete linkage merge function (i.e., L = 2 and L = 1). Finally, we discuss how to generalize our techniques to other cases. Learning the Merge Function. We will use the following simplified notation for mixing two base merge functions D0(A, B; d) and D1(A, B; d): for each parameter α [0, 1], let Dα(A, B; d) = (1 α)D0(A, B; d) + αD1(A, B; d) denote the convex combination with weight (1 α) on D0 and weight α on D1. We let Amerge α (S; D0, D1) denote the cluster tree produced by the algorithm with parameter α, and Amerge(D0, D1) = {Amerge α ( ; D0, D1) | α [0, 1]} denote the parameterized algorithm family. 𝛼 (½,1] 𝛼 [0, ½] 𝛼 [0, ½] 𝛼 (½,1] Figure 1: An example of the execution tree of Amerge(Dmin, Dmax) for a clustering instance with 4 points. The nested rectangles show the clustering at each node. Our goal is to design efficient procedures for finding the algorithm from Amerge(D0, D1) (and more general families) that has the lowest average loss on a sample of labeled clustering instances (S1, Y1), . . . , (SN, YN) where Yi = (C(i) 1 , . . . , C(i) ki ) is the target clustering for instance Si. Recall that the loss function ℓ(T, Y) computes the Hamming distance between the target clustering Y and the closest pruning of the cluster tree T. Formally, our goal is to solve the following optimization problem: argminα [0,1] 1 N PN i=1 ℓ(Amerge α (Si; D0, D1), Yi). The key challenge is that, for a fixed clustering instance S we can partition the parameter space [0, 1] into finitely many intervals such that for each interval I, the cluster tree output by the algorithm Amerge α (S; D0, D1) is the same for every parameter in α I. It follows that the loss function is a piecewise constant function of the algorithm parameter. Therefore, the optimization problem is non-convex and the loss derivative is zero wherever it is defined, rendering gradient descent and similar algorithms ineffective. We solve the optimization problem by explicitly computing the piecewise constant loss function for each instance Si. That is, for instance i we find a collection of discontinuity locations 0 = c(i) 0 < . . . < c(i) Mi = 1 and values v(i) 1 , . . . , v(i) Mi R so that for each j [Mi], running the algorithm on instance Si with a parameter in [c(i) j 1, c(i) j ) has loss equal to v(i) j . Given this representation of the loss function for each of the N instances, finding the parameter with minimal average loss can be done in O(M log(M)) time, where M = P i Mi is the total number of discontinuities from all N loss functions. The bulk of the computational cost is incurred by computing the piecewise constant loss functions, which we focus on for the rest of the section. We exploit a more powerful structural property of the algorithm family to compute the piecewise constant losses: for a clustering instance S and any length t, the sequence of first t merges performed by the algorithm is a piecewise constant function of the parameter (our sample complexity results only used that the final tree is piecewise constant). For length t = 0, the partition is a single region containing all parameters in [0, 1], since every algorithm trivially starts with the empty sequence of merges. For each length t > 0, the piecewise constant partition for the first t merges is a refinement of the partition for t 1 merges. We can represent this sequence of partitions using a partition tree, where each node in the tree is labeled by an interval, the nodes at depth t describe the partition of [0, 1] after t merges, and edges represent subset relationships. This tree represents all possible execution paths for the algorithm family when run on the instance S as we vary the algorithm parameter. In particular, each path from the root node to a leaf corresponds to one possible sequence of merges. We therefore call this tree the execution tree of the algorithm family when run on S. Figure 1 shows an example execution tree for the family Amerge(Dmin, Dmax). To find the piecewise constant loss function for a clustering instance S, it is sufficient to enumerate the leaves of the execution tree and compute the corresponding losses. The following result, proved in Appendix B, shows that the execution tree for Amerge(D0, D1) is well defined. Lemma 3. For any merge functions D0 and D1 and any clustering instance S, the execution tree for Amerge(D0, D1) when run on S is well defined. That is, there exists a partition tree s.t. for any node v at depth t, the same sequence of first t merges is performed by Amerge α for all α in node v s interval. The fundamental operation required to perform a depth-first traversal of the execution tree is finding a node s children. That is, given a node, its parameter interval [αlo, αhi), and the set of clusters at that node C1, . . . , Cm, find all possible merges that will be chosen by the algorithm for α [αlo, αhi). We know that for each pair of merges (Ci, Cj) and (C i, C j), there is a single critical parameter value where the algorithm switches from preferring to merge (Ci, Cj) to (C i, C j). A direct algorithm that runs in O(m4) time for finding the children of a node in the execution tree is to compute all O(m4) critical parameter values and test which pair of clusters will be merged on each interval between 𝐷" distance (a) First iteration 𝐷" distance 𝑐) 𝑐+ 𝐶+,𝐶, wins 𝐶),𝐶, wins 𝐶),𝐶+ wins (b) Final output Figure 2: Depiction of Algorithm 2 when given three clusters, C1, C2, and C3. Each line shows the Dα-distance between one pair of clusters as a function of the parameter α. On the first iteration, Algorithm 2 determines that clusters C2 and C3 are the closest for the parameter α = αlo, calculates the critical parameter values c12 and c13, and advances α to c13. Repeating the process partitions of [αlo, αhi) into merge-constant regions. consecutive critical parameters. We provide a more efficient algorithm that runs in time O(m2M), where M m2 is the number of children of the node. Fix any node in the execution tree. Given the node s parameter interval I = [αlo, αhi) and the set of clusters C1, . . . , Cm resulting from that node s merge sequence, we use a sweep-line algorithm to determine all possible next merges and the corresponding parameter intervals. First, we calculate the merge for α = αlo by enumeration in O(m2) time. Suppose clusters Ci and Cj are the optimal merge for α. We then determine the largest value α for which Ci and Cj are still merged by solving the linear equation Dα(Ci, Cj) = Dα(Ck, Cl) for all other pairs of clusters Ck and Cl, keeping track of the minimal solution larger than α. Since there are only O(m2) alternative pairs of clusters, this takes O(m2) time. Denote the minimal solution larger than α by c I. We are guaranteed that Amerge α will merge clusters Ci and Cj for all α [α, c). We repeat this procedure starting from α = c to determine the next merge and corresponding interval, and so on, sweeping through the α parameter space until α αhi. Algorithm 2 in Appendix B provides pseudocode for this approach and Figure 2 shows an example. Our next result bounds the running time of this procedure. Lemma 4. Let C1, . . . , Cm be a collection of clusters, D0 and D1 be any pair of merge functions, and [αlo, αhi) be a subset of the parameter space. If there are M distinct cluster pairs Ci, Cj that minimize Dα(Ci, Cj) for values of α [αlo, αhi), then the running time of Algorithm 2 is O(Mm2K), where K is the cost of evaluating the merge functions D0 and D1. With this, our algorithm for computing the piecewise constant loss function for an instance S performs a depth-first traversal of the leaves of the execution tree for Amerge(D0, D1), using Algorithm 2 to determine the children of each node. When we reach a leaf in the depth-first traversal, we have both the corresponding parameter interval I [0, 1], as well as the cluster tree T such that Amerge α (S) = T for all α I. We then evaluate the loss ℓ(T, Y) to get one piece of the piecewise constant loss function. Detailed pseudocode for this approach is given in Algorithm 3 in Appendix B. Theorem 2. Let S = {x1, . . . , xn} be a clustering instance and D0 and D1 be any two merge functions. Suppose that the execution tree of Amerge(D0, D1) on S has E edges. Then the total running time of Algorithm 3 is O(En2K), where K is the cost of evaluating D0 and D1 once. We can express the running time of Algorithm 3 in terms of the number of discontinuities of the function α 7 Amerge α (S). There is one leaf of the execution tree for each constant interval of this function, and the path from the root of the execution tree to that leaf is of length n 1. Therefore, the cost associated with that path is at most O(Kn3) and enumerating the execution tree to obtain the piecewise constant loss function for a given instance S spends O(Kn3) time for each constant interval of α 7 Amerge α (S). In contrast, the combinatorial approach of Balcan et al. (2017) requires that we run α-linkage once for every interval in their partition of [0, 1], which always contains O(n8) intervals (i.e., it is a refinement of the piecewise constant partition). Since each run of α-Linkage costs O(Kn2 log n) time, this leads to a running time of O(Kn10 log n). The key advantage of our approach stems from the fact that the number of discontinuities of the function α 7 Amerge α (S) is often several orders of magnitude smaller than O(n8). Learning the Metric. Next we present efficient algorithms for computing the piecewise constant loss function for a single clustering instance when interpolating between two base metrics and using complete linkage. For a pair of fixed base metrics d0 and d1 and any parameter value β [0, 1], define dβ(a, b) = (1 β)d0(a, b) + βd1(a, b). Let Ametric β (S; d0, d1) denote the output of running complete linkage with the metric dβ, and Ametric(d0, d1) denote the family of all such algorithms. We prove that for this algorithm family, the execution tree is well defined and provide an efficient algorithm for finding the children of each node in the execution tree, allowing us to use a depth-first traversal to find the piecewise constant loss function for any clustering instance S. Lemma 5. For any metrics d0 and d1 and any clustering instance S, the execution tree for the family Ametric(d0, d1) when run on S is well defined. That is, there exists a partition tree s.t. for any node v at depth t, the same sequence of first t merges is performed by Ametric β for all β in node v s interval. Next, we provide an efficient procedure for determining the children of a node v in the execution tree of Ametric(d0, d1). Given the node s parameter interval I = [βlo, βhi) and the set of clusters C1, . . . , Cm resulting from that node s sequence of merges, we again use a sweep-line procedure to find the possible next merges and the corresponding parameter intervals. First, we determine the pair of clusters that will be merged by Ametric β for β = βlo by enumerating all pairs of clusters. Suppose the winning pair is Ci and Cj and let x Ci and x Cj be the farthest pair of points between the two clusters. Next, we find the largest value of β for which we will still merge the clusters Ci and Cj. To do this, we enumerate all other pairs of clusters Ck and Cl and all pairs of points y Ck and y Cl, and solve the linear equation dβ (x, x ) = dβ(y, y ), keeping track of the minimal solution larger than β. Denote the minimal solution larger than β by c. We are guaranteed that for all β [β, c), the pair of clusters merged will be Ci and Cj. Then we repeat the process with β = c to find the next merge and corresponding interval, and so on, until β βhi. Pseudocode for this procedure is given in Algorithm 4 in Appendix B. The following Lemma bounds the running time: Lemma 6. Let C1, . . . , Cm be a collection of clusters, d0 and d1 be any pair of metrics, and [βlo, βhi) be a subset of the parameter space. If there are M distinct cluster pairs Ci, Cj that complete linkage would merge when using the metric dβ for β [βlo, βhi), the running time of Algorithm 4 is O(Mn2). Our algorithm for computing the piecewise constant loss function for an instance S is almost identical for the case of the merge function: it performs a depth-first traversal of the leaves of the execution tree for Ametric(d0, d1), using Algorithm 4 to determine the children of each node. Detailed pseudocode for this approach is given in Algorithm 5 in Appendix B. The following Theorem characterizes the overall running time of the algorithm. Theorem 3. Let S = {x1, . . . , xn} be a clustering instance and d0 and d1 be any two merge functions. Suppose that the execution tree of Ametric(d0, d1) on S has E edges. Then the total running time of Algorithm 5 is O(En2). Extension to general algorithm families. In this section we introduced efficient algorithm selection procedures for two special cases where the algorithm family has a single parameter. Working with a one-dimensional parameter space allowed us to compute partitions of the parameter space using efficient sweep-line procedures. The key additional challenge when working with higher-dimensional parameter spaces is that the partition of the parameter space given by each level of the execution tree becomes more complicated, defined by the sign-pattern partition induced by a collection of quadratic functions. Efficient execution-tree based algorithm selection procedures for our more general algorithm family will require an analog of the sweep-line algorithms described above that efficiently compute the children of a node in the execution tree, together with a minimal number of quadratic functions describing the partition. It may also be promising to explore combinatorial-style algorithm selection procedures. As a consequence of Lemma 2, for any collection of N clustering instances with at most n points, we can find O(Nn4L ) quadratic functions whose sign-pattern partition divides the parameter space into polynomially many regions where the algorithm output is constant on every instance. It follows that we can find an empirically optimal parameter in polynomial time if we are able to identify a parameter from each region of the sign-pattern partition in polynomial time. For example, when the parameter space is R2, for each vertex v of each region in the sign-pattern partition, there is a pair of quadratic functions such that qi(v) = qj(v) = 0. By solving this equation for all pairs of quadratic functions, we are able to find a parameter from each region. An interesting future direction is to develop efficient analogs for higher-dimensional parameter spaces. 4 EXPERIMENTS In this section we evaluate the performance of our learning procedures when finding algorithms for application-specific clustering distributions. Our experiments demonstrate that the best algorithm for different applications varies greatly, and that in many cases we can have large gains in cluster quality using a mixture of base merge functions or metrics. Experimental setup. In each experiment we define a distribution D over clustering tasks. For each clustering instance, the loss of the cluster tree output by a clustering algorithm is measured in terms of the loss ℓ(S, Y), which computes the Hamming distance between the target clustering and the closest pruning of the cluster tree. We draw N sample clustering tasks from the given distribution and use the algorithms developed in Section 3 to exactly compute the average empirical loss for every algorithm in one algorithm family. The theoretical results from Section 2 ensure that these plots generalize to new samples from the same distribution, so our focus is on demonstrating empirical improvements in clustering loss obtained by learning the merge function or metric. Clustering distributions. Most of our clustering distributions are generated from classification datasets by sampling a subset of the dataset and using the class labels as the target clustering. We briefly describe our instance distributions together with the metrics used for each below. Complete details for the distributions can be found in Appendix C. MNIST Subsets. The MNIST dataset (Le Cun et al., 1998) contains images of hand-written digits from 0 to 9. We generate a random clustering instance from this data by choosing k = 5 random digits and sampling 200 images from each digit, giving a total of n = 1000 images. We measure distance between any pairs of images using the Euclidean distance between their pixel intensities. CIFAR-10 Subsets. We similarly generate clustering instances from the CIFAR-10 dataset (Krizhevsky, 2009). To generate an instance, we select k = 5 classes at random and then sample 50 images from each class, leading to a total of n = 250 images. We measure distances between examples using the cosine distance between feature embedding extracted from a pre-trained Google inception network (Szegedy et al., 2015). Omniglot Subsets. The Omniglot dataset (Lake et al., 2015) contains written characters from 50 alphabets, with a total of 1623 different characters. To generate a clustering instance from the omniglot data, we choose one of the alphabets at random, we sample k from {5, . . . , 10} uniformly at random, choose k random characters from the alphabet, and include all 20 examples of those characters in the clustering instance. We use two metrics for the omniglot data: first, cosine distances between neural network feature embeddings of the character images from a simplified version of Alex Net (Krizhevsky et al., 2012). Second, each character is also described by a stroke , which is a sequence of coordinates (xt, yt)T t=1 describing the trajectory of the pen when writing the character. We hand-design a metric based on the stroke data: the distance between a pair of characters is the average distance from a point on either stroke to its nearest neighbor on the other stroke. A formal definition is given in the appendix. Places2 Subsets. The Places2 dataset consists of images of 365 different place categories, including volcano , gift shop , and farm (Zhou et al., 2017). To generate a clustering instance from the places data, we choose k randomly from {5, . . . , 10}, choose k random place categories, and then select 20 random examples from each chosen category. We use two metrics for this data distribution. First, we use cosine distances between feature embeddings generated by a VGG16 network (Simonyan and Zisserman, 2015) pre-trained on Image Net (Deng et al., 2009). Second, we compute color histograms in HSV space for each image and use the cosine distance between the histograms. Places2 Diverse Subsets. We also construct an instance distribution from a subset of the Places2 classes which have diverse color histograms. We expect the color histogram metric to perform better on this distribution. To generate a clustering instance, we pick k = 4 classes from aquarium, discotheque, highway, iceberg, kitchen, lawn, stage-indoor, underwater ocean deep, volcano, and water tower. We include 50 randomly sampled images from each chosen class, leading to a total of n = 200 points per instance. Synthetic Rings and Disks. We consider a two dimensional synthetic distribution where each clustering instance has 4 clusters, where two are ring-shaped and two are disk-shaped. To generate each instance we sample 100 points uniformly at random from each ring or disk. The two rings have radiuses 0.4 and 0.8, respectively, and are both centered at the origin. The two disks have radius 0.4 and are centered at (1.5, 0.4) and (1.5, 0.4), respectively. For this data, we measure distances between points in terms of the Euclidean distance between them. Results. Learning the Merge Function. Figure 3 shows the average loss when interpolating between single and complete linkage as well as between average and complete linkage for each of the clustering instance distributions described above. For each value of the parameter α [0, 1], we report the average loss over N = 1000 i.i.d. instances drawn from the corresponding distribution. We see that the optimal parameters vary across different clustering instances. For example, when interpolating between single and complete linkage, the optimal parameters are α = 0.874 for MNIST, α = 0.98 for CIFAR-10, α = 0.179 for Rings and Disks, and α = 0.931 for Omniglot. Moreover, using the parameter that is optimal for one distribution on another would lead to significantly worse clustering performance. Next, we also see that for different distributions, it is possible to achieve non-trivial improvements over single, complete, and average linkage by interpolating between them. For example, on the Rings and Disks distribution we see an improvement of almost 0.2 error, meaning that an additional 20% of the data is correctly clustered. 0.00 0.25 0.50 0.75 1.00 Hamming Cost 0.00 0.25 0.50 0.75 1.00 0.5 Hamming Cost (b) CIFAR-10 0.00 0.25 0.50 0.75 1.00 0.00 Hamming Cost (c) Rings and Disks 0.00 0.25 0.50 0.75 1.00 Hamming Cost (d) Omniglot Figure 3: Empirical loss for interpolating between single and complete linkage ( SC in the legend) as well as between average and complete linkage ( AC in the legend) over 1000 sampled clustering instances. Learning the Metric. Next we consider learning the best metric for the Omniglot, Places2, and Places2 Diverse instance distributions. Each of these datasets is equipped with one hand-designed metric and one metric based on neural-network embeddings. The parameter β = 0 corresponds to the hand-designed metric, while β = 1 corresponds to the embedding. Figure 4 shows the empirical loss for each parameter β averaged over N = 4000 samples for each distribution. On all three distributions the neural network embedding performs better than the hand-designed metric, but we can achieve non-trivial performance improvements by mixing the two metrics. On Omniglot, the optimal parameter is at β = 0.514 which improves the Hamming error by 0.091, meaning that we correctly cluster nearly 10% more of the data. For the Places2 distribution we see an improvement of approximately 1% with the optimal parameter being β = 0.88, while for the Places2 Diverse distribution the improvement is approximately 3.4% with the optimal β being 0.87. 0.00 0.25 0.50 0.75 1.00 0.30 Hamming Cost (a) Omniglot 0.00 0.25 0.50 0.75 1.00 Hamming Cost (b) Places2 0.00 0.25 0.50 0.75 1.00 Hamming Cost (c) Places2 Diverse Figure 4: Empirical loss interpolating between two distance metrics on Omniglot, Places2, and Places2 distributions. In each plot, β = 0 corresponds to the hand-crafted metric and β = 1 corresponds to the neural network embedding. Number of Discontinuities. The efficiency of our algorithm selection procedures stems from the fact that their running time scales with the true number of discontinuities in each loss function, rather than a worst-case upper bound. Of all the experiments we ran, interpolating between single interpolating between single and complete linkage for MNIST had the most discontinuities per loss function with an average of 362.6 discontinuities per function. Given that these instances have n = 1000 points, this leads to a speedup of roughly n8/362.6 2.76 1021 over the combinatorial algorithm that solves for all O(n8) critical points and runs the clustering algorithm once for each. Table 1 in Appendix C shows the average number of discontinuities per loss function for all of the above experiments. 5 CONCLUSION In this work we study both the sample and algorithmic complexity of learning linkage-based clustering algorithms with low loss for specific application domains. We give strong bounds on the number of sample instances required from an application domain in order to find an approximately optimal algorithm from a rich family of algorithms that allows us to vary both the metric and merge function used by the algorithm. We complement our sample complexity results with efficient algorithms for finding empirically optimal algorithms for a sample of instances. Finally, we carry out experiments on both real-world and synthetic clustering domains demonstrating that our procedures can often find algorithms that significantly outperform standard linkage-based clustering algorithms. An important future direction for learning linkage based clustering algorithms is the design of efficient algorithm selection procedures for higher dimensional parameter spaces. ACKNOWLEDGEMENTS This work was supported in part by NSF grants CCF-1535967, IIS-1618714, IIS-1901403, CCF-1910321, SES-1919453, an Amazon Research Award, a Bloomberg Research Grant, a Microsoft Research Faculty Fellowship, and by the generosity of Eric and Wendy Schmidt by recommendation of the Schmidt Futures program. Pranjal Awasthi, Avrim Blum, and Or Sheffet. Center-based clustering under perturbation stability. In Information Processing Letters, 2012. Pranjal Awasthi, Maria-Florina Balcan, and Konstantin Voevodski. Local algorithms for interactive clustering. In ICML, 2014. Maria-Florina Balcan and Yingyu Liang. Clustering under perturbation resilience. In SIAM Journal on Computing, 2016. Maria-Florina Balcan, Vaishnavh Nagarajan, Ellen Vitercik, and Colin White. Learning-theoretic foundations of algorithm configuration for combinatorial partitioning problems. Proceedings of the Conference on Learning Theory (COLT), 2017. Maria-Florina Balcan, Travis Dick, Tuomas Sandholm, and Ellen Vitercik. Learning to branch. In ICML, 2018a. Maria-Florina Balcan, Travis Dick, and Colin White. Data-driven clustering via parameterized lloyd s families. In Neur IPS, 2018b. Maria-Florina Balcan, Dan De Blasio, Travis Dick, Carl Kingsford, Tuomas Sandholm, and Ellen Vitercik. How much data is sufficient to learn high-performing algorithms? ar Xiv preprint ar Xiv:1908.02894, 2019. Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463 482, 2002. R. C. Buck. Partition of space. Amer. Math. Monthly, 50:541 544, 1943. ISSN 0002-9890. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Image Net: A Large-Scale Hierarchical Image Database. In CVPR, 2009. Anna Grosswendt and Heiko Roeglin. Improved analysis of complete linkage clustering. In European Symposium of Algorithms, 2015. Rishi Gupta and Tim Roughgarden. A PAC approach to application-specific algorithm selection. SIAM Journal on Computing, 46(3):992 1017, 2017. Alex Krizhevsky. Learning multiple layers of features from tiny images. In Technical Report, 2009. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In Neur IPS, 2012. Brenden M. Lake, Ruslan Salakhutdinov, and Joshua B. Tenenbaum. Human-level concept learning through probabilistic program induction. Science, 350(6266):1332 1338, 2015. doi: 10.1126/science.aab3050. Y. Le Cun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, 1998. Pascal Massart. Some applications of concentration inequalities to statistics. 9(2):245 303, 2000. Mehreen Saeed, Onaiza Maqbool, Haroon Atique Babri, Syed Zahoor Hassan, and S. Mansoor Sarwar. Software clustering techniques and the use of combined algorithm. In European Conference on Software Maintenance and Reengineering, 2003. S. Shalev-Shwartz and S. Ben-David. Understanding machine learning: From theory to algorithms. Cambridge University Press, 2014. Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. ICLR, 2015. Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In CVPR, 2015. James R. White, Saket Navlakha, Niranjan Nagarajan, Mohammad-Reza Ghodsi, Carl Kingsford, and Mihai Pop. Alignment and clustering of phylogenetic markers implications for microbial diversity studies. In BCM Bioinformatics, 2010. Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. Places: A 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017. A APPENDIX FOR LEARNING CLUSTERING ALGORITHMS We begin by providing complete proofs for the piecewise structural Lemmas from the main body. Lemma 1. Fix any metrics d1, . . . , d L and a clustering instance S X. There exists a set H of O(|S|4) linear functions mapping RL to R with the following property: if two metric parameters β, β L belong to the same region in the sign-pattern partition induced by H, then the ordering over pairs of points in S given by dβ and dβ are the same. That is, for all points a, b, a , b S we have dβ(a, b) dβ(a , b ) iff dβ (a, b) dβ (a , b ). Proof. Let S be any clustering instance and fix points a, b, a , b S. For any parameter β L, by definition of dβ, we have that dβ(a, b) dβ(a , b ) i=1 βidi(a, b) i=1 βidi(a , b ) i=1 βi(di(a, b) di(a , b )) 0. Define the linear function ha,b,a ,b (β) = PL i=1 βi(di(a, b) di(a , b )). Then we have that dβ(a, b) dβ(a , b ) if ha,b,a ,b (β) 0 and dβ(a, b) > dβ(a , b ) if ha,b,a ,b (β) > 0. Let H = {ha,b,a ,b | a, b, a , b S} be the collection of all such linear functions collected over all possible subsets of 4 points in S. Now suppose that β and β belong to the same region in the sign-pattern partition induced by H. For any points a, b, a , b S, we are guaranteed that sign(ha,b,a ,b (β)) = sign(ha,b,a ,b (β )), which by the above arguments imply that dβ(a, b) dβ(a , b ) iff dβ (a, b) dβ (a , b ), as required. Lemma 2. Fix any metrics d1, . . . , d L, any 2-point-based merge functions D1, . . . , DL , and clustering instance S X. There exists a set Q of O(|S|4L ) quadratic functions defined on RL +L so that if parameters (α, β) and (α , β ) belong to the same region of the sign-pattern partition induced by Q, then the ordering over pairs of clusters in S given by Dα( , ; dβ) and Dα ( , ; dβ ) is the same. That is, for all clusters A, B, A , B S, we have that Dα(A, B; dβ) Dα(A , B ; dβ) iff Dα (A, B; dβ ) Dα (A , B ; dβ ). Proof. From Lemma 1, we know we can find a set H of O(|S|4) linear functions defined on RL that induce a signpattern partition of the β parameter space L RL into regions where the ordering over pairs of points according to the dβ distance is constant. Now let Z L be any region of the sign-pattern partition of L induced by H. From Lemma 1, we know that for all parameters β Z, the ordering over pairs of points in S according to dβ is fixed. For any 2-point-based merge function, the pair of points used to measure the distance between a pair of clusters depends only on the ordering of pairs of points according to distance. Therefore, since D1, . . . , DL are all 2-point-based, we know that for any pair of clusters (A, B) and each merge function index i [L ], there exists a pair of points (ai, bi) A B such that Di(A, B; dβ) = dβ(ai, bi) for all β Z. In other words, all of the merge functions measure distances between A and B using a fixed pair of points for all values of the metric parameter β in the region Z. Similarly, let A , B S be any other pair of clusters and (a i, b i) A B be the pairs of points defining Di(A , B ; dβ) for each i [L ]. Then for all β Z, we have that Dα(A, B; dβ) Dα(A , B ; dβ) i=1 αi Di(A, B; dβ) i=1 αi Di(A, B; dβ) j=1 βjdj(ai, bi) j=1 βjdj(a i, b i) j=1 αiβj dj(ai, bi) dj(a i, b i) 0. Now define the quadratic function q A,B,A ,B (α, β) = j=1 αiβj dj(ai, bi) dj(a i, b i) . (1) For all β Z, we are guaranteed that Dα(A, B; dβ) Dα(A , B ; dβ) if and only if q A,B,A ,B (α, β) 0. Notice that the coefficients of q A,B,A ,B only depend on 4L points in S, which implies that if we collect these quadratic functions over all quadruples of clusters A, B, A , B S, we will only obtain O(|S|4L ) different quadratic functions. These O(|S|4L ) functions induce a sign-pattern partition of L Z for which the desired conclusion holds. Next, observe that the coefficients in the quadratic functions defined above do not depend on the region Z we started with. It follows that the same set of O(|S|4L ) quadratic functions partition any other region Z in the sign-pattern partition induced by H so that the claim holds on L Z . Now let Q contain the linear functions in H (viewed as quadratic functions over RL +L by placing a zero coefficient on all quadratic terms and terms depending on α), together with the O(|S|4L ) quadratic functions defined above. Then we have that |Q| = O(|S|4 + |S|4L ) = O(|S|4L ). Now suppose that (α, β) and (α , β ) belong to the same region of the sign-pattern partition of L L RL +L induced by the quadratic functions Q. Since Q contains H, this implies that β and β belong to the same region Z in the sign-pattern partition induced by H. Moreover, since Q contains all the quadratic functions defined in (1), it follows that Dα(A, B; dβ) Dα(A , B ; dβ) if and only if Dα (A, B; dβ ) Dα (A , B ; dβ ), as required. Our proof of Theorem 1 depends crucially on the number of regions in the sign-pattern partition induced by a collection of functions. Based on the work of Buck (1943), we know that when the functions f1, . . . , f M are linear, then the number of regions is at most (3M)p a substantial improvement over the naive bound of 2M. We can also leverage this result to bound the number of regions by (3M)p2+p when the functions are quadratic by viewing each quadratic function as a linear function on a (p2 + p)-dimensional space. The following result summarizes these facts. Lemma 7. Let T be the number of regions in the sign-pattern partition of Rp induced by f1, . . . , f M : Rp R. The following statements hold. 1. If the functions are linear, i.e., fi(ζ) = w i ζ + ri for wi Rp and ri R, then T (3M)p. 2. If the functions are quadratic, i.e., fi(ζ) = ζ Qiζ + w i ζ + ri for Qi Rp p, wi Rp and ri R, then T (3M)p2+p. Proof. The proof of the first statement follows from the work of Buck (1943) which shows that if H is a collection of M hyperplanes in Rp, then the number of connected components of Rp \ H is at most (3M)p. To connect this to the sign-pattern partitioning induced by the collection of linear functions f1, . . . , f M, let H be the set of M hyperplanes defined by {ζ Rp | fi(ζ) = 0} for i [M]. The connected components of Rp \ H correspond exactly to the sign-pattern partition of the functions f1, . . . , f M, and by the result of Buck (1943), it follows that the number of regions in the partition is at most (3M)p. Equivalently, we know that |{F(ζ) | ζ Rp}| (3M)p, where F(ζ) = sign(f1(ζ)), . . . , sign(f M(ζ)) is the sign-pattern function. That is, F takes at most (3M)p distinct values. Next we use the first statement to prove the second statement. Let ϕ : Rp Rp2+p be the function that maps a vector ζ to the vector containing all products of pairs of elements from ζ concatenated to the beginning of ζ: ϕ(ζ) = ζ1ζ1, . . . , ζ1ζd, . . . , ζdζ1, . . . , ζdζd, ζ1, . . . , ζd . Let q(ζ) = ζ Qζ + w ζ + r be a quadratic function where Q Rp p, w Rp, and r R. Now let v = q11, . . . , q1d, . . . , qd1, . . . , qdd, w1, . . . , wd . Then we have that q(ζ) = v ϕ(ζ). In other words, q is a linear function of ϕ(ζ). This guarantees that we can find M linear functions h1, . . . , h M : Rp2+p R such that fi(ζ) = hi(ϕ(ζ)) for all i [M]. Let H : Rp2+p { 1}M be the sign-pattern function for h1, . . . , h M. Then we have {F(ζ) | ζ Rp} = {H(ϕ(ζ)) | ζ Rp} {H(φ) | φ Rp2+p} (3M)p2+p, where the last inequality follows from the first statement for linear functions. It follows that the number of regions in the sign-pattern partition induced by M quadratic functions is at most (3M)p2+p. With this, we are ready to prove Theorem 1. Theorem 1. Fix any metrics d1, . . . , d L, 2-point-based merge functions D1, . . . , DL , and distribution D over clustering instances with at most n points. For any parameters ϵ > 0 and δ > 0, let (S1, Y1), . . . , (SN, YN) be an i.i.d. sample of N = O 1 ϵ2 (L + L)2L log (L +L)2L n δ = O (L +L)2L ϵ2 clustering instances with target clusterings drawn from D. Then with probability at least 1 δ over the draw of the sample, we have sup (α,β) L L i=1 ℓ(Aα,β(Si), Yi) E (S,Y) D ℓ(Aα,β(S), Y) ϵ. Proof of Theorem 1. Our goal is to provide a uniform convergence guarantee that ensures the sample loss of any pair of parameters (α, β) is close to its expected cost on the underlying distribution. Towards that end, define the family of loss functions F = {fα,β : (S, Y) 7 ℓ(Aα,β(S), Y) | (α, β L L}, where the function fα,β fixes algorithm parameters and maps each clustering instance S with target clustering Y to that algorithm s loss when run on (S, Y). We will bound the empirical Rademacher complexity of F to prove uniform convergence guarantees. Let S = {(S1, Y1), . . . , (SN, YN)} be a sample of clustering instances with target clusterings drawn from the distribution D. The empirical Rademacher complexity of the class of functions F on the sample S is given by ˆR(F, S) = 1 i=1 σifα,β(Si, Yi) where σ is a vector of N independent Rademacher random variables. From Lemma 2, we know that we can find a collection of O(n4L ) quadratic functions for each clustering instance Si so that the output of Algorithm 1 when run on Si is constant on each region of the induced sign-pattern partition of the parameter space. Let Q be the collection of O(Nn4L ) quadratic functions collected across all N clustering instances S1, . . . , SN. Whenever two parameter settings (α, β) and (α , β ) belong to the same region in the sign-pattern partition induced by Q, we know that for all instance indices i, we have ℓα,β(Si, Yi) = ℓα ,β (Si, Yi). Applying the second statement of Lemma 7, we know that this partition has at most T = (3|Q|)(L +L)2+(L +L) regions. This implies that we can find a collection of T parameter vectors (α1, β1), . . . , (αT , βT ) (by taking one from each region in the partition) so that the following holds: for every (α, β) L L, there exists a j [T] such that for all clustering instances (Si, Yi) in the sample S we havefα,β(Si, Yi) = fαj,βj(Si, Yi). With this, we can bound the empirical Rademacher complexity as follows: ˆR(F, S) = 1 i=1 σifα,β(Si, Yi) i=1 σifαj,βj(Si, Yi) Now, using the fact that the losses take values in [0, 1] and the supremum is over only T distinct elements, we can apply Massart s Lemma (Massart, 2000) to obtain ˆR(F, S) = O( p log(T)/N). Taking logs and simplifying our bound on T, we have that log(T) = O (L + L)2L log(Nn) , giving the following bound on the empirical Rademacher complexity of F: ˆR(F, S) = O (L + L)2L log(Nn) Next, from standard Rademacher complexity bounds (Bartlett and Mendelson, 2002), we are guaranteed that with probability at least 1 δ we have that sup (α,β) L L i=1 fα,β(Si, Yi) E (S,Y) D fα,β(S, Y) = O (L + L)2L log(Nn) (L + L)2L log(Nn) + log(1/δ) where the last equality follows from the fact that for any x, y 0 we have x + y p 2(x + y). Since the right hand side goes to zero as N grows, it remains to choose N sufficiently large that the right hand side is bounded by ϵ. Let A = (L + L)2L and B = (L + L)2L log(n) + log(1/δ) so that the error bound is given by O q We are guaranteed that q N ϵ whenever N 1 ϵ2 (A log(N) + B). Using the fact that for a 1 and b 0 we have that x 4a log(2a) + 2b implies that x a log(x) + b (e.g., see Lemma A.2 in (Shalev-Shwartz and Ben- David, 2014)), it follows that when N 4A ϵ2 (A log A ϵ2 + B) , we also have q N ϵ. Substituting A and B and simplifying the expression completes the proof. B APPENDIX FOR EFFICIENT ALGORITHM SELECTION B.1 LEARNING THE MERGE FUNCTION In this section we provide details for learning the best combination of two merge functions. We also give detailed pseudocode for our sweepline algorithm for finding the children of a node in the execution tree (see Algorithm 2) and for the complete algorithm (see Algorithm 3). Algorithm 2 Find all merges for Amerge(D0, D1) Input: Set of clusters C1, . . . , Cm, merge functions D0, D1, parameter interval [αlo, αhi). 1. Let M = be the initially empty set of possible merges. 2. Let I = be the initially empty set of parameter intervals. 3. Let α = αlo. 4. While α < αhi: (a) Let Ci, Cj be the pair of clusters minimizing (1 α) D0(Ci, Cj) + α D1(Ci, Cj). (b) For each k, l [m], let ckl = 0/( 0 1), where p = Dp(Ci, Cj) Dp(Ck, Cl) for p {0, 1}. (c) Let c = min {ckl | ckl > α} {αhi} . (d) Add merge (Ci, Cj) to M and [α, c) to I. (e) Set α = c. 5. Return M and I. Lemma 3. For any merge functions D0 and D1 and any clustering instance S, the execution tree for Amerge(D0, D1) when run on S is well defined. That is, there exists a partition tree s.t. for any node v at depth t, the same sequence of first t merges is performed by Amerge α for all α in node v s interval. Algorithm 3 Depth-first Enumeration of α-linkage Execution Tree Input: Point set x1, . . . , xn, cluster distance functions d1 and d2. 1. Let r be the root node of the execution tree with r.N = {(x1), . . . , (xn)} and r.I = [0, 1]. 2. Let s be a stack of execution tree nodes, initially containing the root r. 3. Let T = be the initially empty set of possible cluster trees. 4. Let I = be the initially empty set of intervals. 5. While the stack s is not empty: (a) Pop execution tree node e off stack s. (b) If e.N has a single cluster, add e.N to T and e.I to I. (c) Otherwise, for each merge (Ci, Cj) and interval Ic returned by Algorithm 2 run on e.N and e.I: i. Let c be a new node with state given by e.N after merging Ci and Cj and c.I = Ic. ii. Push c onto the stack s. 6. Return T and I. Proof. The proof is by induction on the depth t. The base case is for depth t = 0, in which case we can use a single node whose interval is [0, 1]. Since all algorithms in the family start with an empty-sequence of merges, this satisfies the execution tree property. Now suppose that there is a tree of depth t with the execution tree property. If t = |S| 1 then we are finished, since the algorithms in Amerge(D0, D1) make exactly |S| 1 merges. Otherwise, consider any leaf node v of the depth t tree with parameter interval Iv. It is sufficient to show that we can partition Iv into subintervals such that for α in each subinterval the next merge performed is constant. By the inductive hypothesis, we know that the first t merges made by Amerge α are the same for all α Iv. After performing these merges, the algorithm will have arrived at some set of clusters C1, . . . , Cm with m = |S| t. For each pair of clusters Ci and Cj, the distance Dα(Ci, Cj) = (1 α) D0(Ci, Cj) + α D1(Ci, Cj) is a linear function of the parameter α. Therefore, for any clusters Ci, Cj, Ck, and Cl, the algorithm will prefer to merge Ci and Cj over Cj and Ck for a (possibly empty) sub-interval of Iv, corresponding to the values of α Iv where Dα(Ci, Cj) < Dα(Ck, Cl). For any fixed pair of clusters Ci and Cj, taking the intersection of these intervals over all other pairs Cj and Ck guarantees that clusters Ci and Cj will be merged exactly for parameter values in some subinterval of Iv. For each merge with a non-empty parameter interval, we can introduce a child node of v labeled by that parameter interval. These children partition Iv into intervals where the next merge is constant, as required. Lemma 4. Let C1, . . . , Cm be a collection of clusters, D0 and D1 be any pair of merge functions, and [αlo, αhi) be a subset of the parameter space. If there are M distinct cluster pairs Ci, Cj that minimize Dα(Ci, Cj) for values of α [αlo, αhi), then the running time of Algorithm 2 is O(Mm2K), where K is the cost of evaluating the merge functions D0 and D1. Proof. The loop in step 4 of Algorithm 2 runs once for each possible merge, giving a total of M iterations. Each iteration finds the closest pair of clusters according to Dα using O(m2) evaluations of the merge functions D0 and D1. Calculating the critical parameter value c involves solving O(m2) linear equations whose coefficients are determined by four evaluations of D0 and D1. It follows that the cost of each iteration is O(m2K), where K is the cost of evaluating D0 and D1, and the overall running time is O(Mm2K). Theorem 2. Let S = {x1, . . . , xn} be a clustering instance and D0 and D1 be any two merge functions. Suppose that the execution tree of Amerge(D0, D1) on S has E edges. Then the total running time of Algorithm 3 is O(En2K), where K is the cost of evaluating D0 and D1 once. Proof. Fix any node v in the execution tree with m clusters C1, . . . , Cm and M outgoing edges (i.e., M possible merges from the state represented by v). We run Algorithm 2 to determine the children of v, which by Lemma 4 costs O(Mn2K), since m n. Summing over all non-leaves of the execution tree, the total cost is O(En2K). In addition to computing the children of a given node, we need to construct the children nodes, but this takes constant time per child. B.2 LEARNING THE METRIC In this section we provide details for learning the best combination of two metrics. We also give detailed pseudocode for our sweepline algorithm for finding the children of a node in the execution tree (see Algorithm 4) and for the complete algorithm (see Algorithm 5). Lemma 5. For any metrics d0 and d1 and any clustering instance S, the execution tree for the family Ametric(d0, d1) when run on S is well defined. That is, there exists a partition tree s.t. for any node v at depth t, the same sequence of first t merges is performed by Ametric β for all β in node v s interval. Proof. The proof is by induction on the depth t. The base case is for depth t = 0, in which case we can use a single node whose interval is [0, 1]. Since all algorithms in the family start with an empty-sequence of merges, this satisfies the execution tree property. Now suppose that there is a tree of depth t with the execution tree property. If t = |S| 1 then we are finished, since the algorithms in Ametric(d0, d1) make exactly |S| 1 merges. Otherwise, consider any leaf node v of the depth t tree with parameter interval Iv. It is sufficient to show that we can partition Iv into subintervals such that for β in each subinterval the next merge performed is constant. By the inductive hypothesis, we know that the first t merges made by Ametric β are the same for all β Iv. After performing these merges, the algorithm will have arrived at some set of clusters C1, . . . , Cm with m = |S| t. Recall that algorithms in the family Ametric(d0, d1) run complete linkage using the metric dβ. Complete linkage can be implemented in such a way that it only makes comparisons between pairwise point distances (i.e., is dβ(x, x ) larger or smaller than dβ(y, y )?). To see this, for any pair of clusters, we can find the farthest pair of points between them using only distance comparisons. And, once we have the farthest pair of points between all pairs of clusters, we can find the pair of clusters to merge by again making only pairwise comparisons. It follows that if two parameters β and β have the same outcome for all pairwise distance comparisons, then the next merge to be performed must be the same. We use this observation to partition the interval Iv into subintervals where the next merge is constant. For any pair of points x, x S, the distance dβ(x, x ) = (1 β) d0(x, x ) + β d1(x, x ) is a linear function of the parameter β. Therefore, for any points x, x , y, y S, there is at most one critical parameter value where the relative order of dβ(x, x ) and dβ(y, y ) changes. Between these O(|S|4) critical parameter values, the ordering on all pairwise merges is constant, and the next merge performed by the algorithm will also be constant. Therefore, there must exist a partitioning of Iv into at most O(|S|4) sub-intervals such that the next merge is constant on each interval. We let the children of v correspond to the coarsest such partition. Lemma 6. Let C1, . . . , Cm be a collection of clusters, d0 and d1 be any pair of metrics, and [βlo, βhi) be a subset of the parameter space. If there are M distinct cluster pairs Ci, Cj that complete linkage would merge when using the metric dβ for β [βlo, βhi), the running time of Algorithm 4 is O(Mn2). Proof. The loop in step 4 of Algorithm 4 runs once for each possible merge, giving a total of M iterations. Each iteration finds the merge performed by complete linkage using the dβ metric, which takes O(n2) time, and then solves O(n2) linear equations to determine the largest value of β such that the same merge is performed. It follows that the cost of each iteration is O(n2), leading to an overall running time of O(Mn2). Note, we assume that the pairwise distances dβ(x, x ) can be evaluated in constant time. This can always be achieved by precomputing two n n distance matrices for the base metrics d0 and d1, respectively. Theorem 3. Let S = {x1, . . . , xn} be a clustering instance and d0 and d1 be any two merge functions. Suppose that the execution tree of Ametric(d0, d1) on S has E edges. Then the total running time of Algorithm 5 is O(En2). Proof. Fix any node v in the execution tree with m clusters C1, . . . , Cm and M outgoing edges (i.e., M possible merges from the state represented by v). We run Algorithm 4 to determine the children of v, which by Lemma 6 costs O(Mn2). Summing over all non-leaves of the execution tree, the total cost is O(En2). C APPENDIX FOR EXPERIMENTS Clustering distributions. Algorithm 4 Find all merges for Ametric(d0, d1) Input: Set of clusters C1, . . . , Cm, metrics d0, d1, parameter interval [βlo, βhi). 1. Let M = be the initially empty set of possible merges. 2. Let I = be the initially empty set of parameter intervals. 3. Let β = βlo. 4. While β < βhi: (a) Let Ci, Cj be the pair of clusters minimizing maxa A,b B dβ(a, b). (b) Let x Ci and x Cj be the farthest points between Ci and Cj. (c) For all pairs of points y and y belonging to different clusters, let cyy = 0/( 0 1) where p = dp(y, y ) dp(x, x ) for p {0, 1}. (d) Let c = min {cyy | cyy > β} {βhi} . (e) Add merge (Ci, Cj) to M and [β, c) to I. (f) Set β = c. 5. Return M and I. Algorithm 5 Depth-first Enumeration of β-linkage Execution Tree Input: Point set x1, . . . , xn, cluster distance functions d1 and d2. 1. Let r be the root node of the execution tree with r.N = {(x1), . . . , (xn)} and r.I = [0, 1]. 2. Let s be a stack of execution tree nodes, initially containing the root r. 3. Let T = be the initially empty set of possible cluster trees. 4. Let I = be the initially empty set of intervals. 5. While the stack s is not empty: (a) Pop execution tree node e off stack s. (b) If e.N has a single cluster, add e.N to T and e.I to I. (c) Otherwise, for each merge (Ci, Cj) and interval Ic returned by Algorithm 4 run on e.N and e.I: i. Let c be a new node with state given by e.N after merging Ci and Cj and c.I = Ic. ii. Push c onto the stack s. 6. Return T and I. MNIST Subsets. Our first distribution over clustering tasks corresponds to clustering subsets of the MNIST dataset (Le Cun et al., 1998), which contains 80,000 hand-written examples of the digits 0 through 9. We generate a random clustering instance from the MNIST data as follows: first, we select k = 5 digits from {0, . . . , 9} at random, then we randomly select 200 examples belonging to each of the selected digits, giving a total of n = 1000 images. The target clustering for this instance is given by the ground-truth digit labels. We measure distances between any pair of digits in terms of the the Euclidean distance between their images represented as vectors of pixel intensities. CIFAR-10 Subsets. We also consider a distribution over clustering tasks that corresponds to clustering subsets of the CIFAR-10 dataset (Krizhevsky, 2009). This dataset contains 6000 images of each of the following classes: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck. Each example is a 32 32 color image with 3 color channels. We pre-process the data to obtain neural-network feature representations for each example. We include 50 randomly rotated and cropped versions of each example and obtain feature representations from layer in4d of a pre-trained Google inception network. This gives a 144-dimensional feature representation for each of the 3000000 examples (50 randomly rotated copies of the 6000 examples for each of the 10 classes). We generate clustering tasks from CIFAR-10 as follows: first, select k = 5 classes at random, then choose 50 examples belonging to each of the selected classes, giving a total of n = 250 images. The target clustering for this instance is given by the ground-truth class labels. We measure distance between any pair of images as the distance between their feature embeddings. Omniglot Subsets. Next, we consider a distribution over clustering tasks corresponding to clustering subsets of the Omniglot dataset (Lake et al., 2015). The Omniglot dataset consists of written characters from 50 different alphabets with a total of 1623 different characters. The dataset includes 20 examples of each character, leading to a total of 32,460 examples. We generate a random clustering instance from the Omniglot data as follows: first, we choose one of the alphabets at random. Next, we choose k uniformly in {5, . . . , 10} and choose k random characters from that alphabet. The clustering instance includes 20k examples and the target clustering is given by the ground-truth character labels. We use two different distance metrics on the Omniglot dataset. First, we use the cosine distance between neural network feature embeddings. The neural network was trained to perform digit classification on MNIST. Second, each example has both an image of the written character, as well as the stroke trajectory (i.e., a time series of (x, y) coordinates of the tip of the pen when the character was written). We also use the following distance defined in terms of the strokes: Given two trajectories s = (xt, yt)T t=1 and s = (x t, y t)T t=1, we define the distance between them by d(s, s ) = 1 T +T PT t=1 d (xt, yt), s + PT t=1 d (x t, y t), s , where d (xt, yt), s denotes the Euclidean distance from the point (xt, yt) to the closest point in s . This is the average distance from any point from either trajectory to the nearest point on the other trajectory. This hand-designed metric provides a complementary notion of distance to the neural network feature embeddings. Places2 Subsets. The Places2 dataset consists of images of 365 different place categories, including volcano , gift shop , and farm (Zhou et al., 2017). To generate a clustering instance from the places data, we choose k randomly from {5, . . . , 10}, choose k random place categories, and then select 20 random examples from each chosen category. We restrict ourselves to the first 1000 images from each class. We use two metrics for this data distribution. First, we use cosine distances between feature embeddings generated by a VGG16 network pre-trained on imagenet. In particular, we use the activations just before the fully connected layers, but after the max-pooling is performed, so that we have 512-dimensional feature vectors. Second, we compute color histograms in HSV space for each image and use the cosine distance between the histograms. In more detail, we partition the hue space into 8 bins, the saturation space into 2 bins, and the value space into 4 bins, resulting in a 64-dimensional histogram counting how frequently each quantized color appears in the image. Two images are close under this metric if they contain similar colors. Places2 Diverse Subsets. We also construct an instance distribution from a subset of the Places2 classes which have diverse color histograms. We expect the color histogram metric to perform better on this distribution. To generate a clustering instance, we pick k = 4 classes from aquarium, discotheque, highway, iceberg, kitchen, lawn, stage-indoor, underwater ocean deep, volcano, and water tower. We include 50 randomly sampled images from each chosen class, leading to a total of n = 200 points per instance. Synthetic Rings and Disks. We consider a two dimensional synthetic distribution where each clustering instance has 4 clusters, where two are ring-shaped and two are disk-shaped. To generate each instance we sample 100 points uniformly at random from each ring or disk. The two rings have radiuses 0.4 and 0.8, respectively, and are both centered at the origin. The two disks have radius 0.4 and are centered at (1.5, 0.4) and (1.5, 0.4), respectively. For this data, we measure distances between points in terms of the Euclidean distance between them. Average Number of Discontinuities. Next we report the average number of discontinuities in the loss function for a clustering instance sampled from each of the distributions described above for each of the learning tasks we consider. In all cases, the average number of discontinuities is many orders of magnitude smaller than the upper bounds. The metric learning problems tend to have more discontinuities than learning the best merge function. Surprisingly, even though our only worst-case bound on the number of discontinuities when interpolating between average and complete linkage is exponential in n, the empirical number of discontinuities is always smaller than for interpolating between single and complete linkage. The results are shown in Table 1. Distribution Task max n Average # Discontinuities Omniglot SC 200 59.4 Omniglot AC 200 33.9 Omniglot metric 200 201.1 MNIST SC 1000 362.6 MNIST AC 1000 282.0 Rings and Disks SC 400 29.0 Rings and Disks AC 400 18.3 CIFAR-10 SC 250 103.2 CIFAR-10 AC 250 66.2 Places2 metric 200 241.0 Places2 Diverse metric 200 269.6 Table 1: Table of average number of discontinuities for a piecewise constant loss function sampled from each distribution and learning task. Task SC corresponds to interpolating between single and complete linkage, AC is interpolating between average and complete linkage, and metric is interpolating between two base metrics. The column labeled max n is an upper bound on the size of each clustering instance.