# faster_cover_trees__404c3836.pdf

Faster Cover Trees

Mike Izbicki MIZBI001@UCR.EDU Christian R. Shelton CSHELTON@CS.UCR.EDU University of California Riverside, 900 University Ave, Riverside, CA 92521

The cover tree data structure speeds up exact nearest neighbor queries over arbitrary metric spaces (Beygelzimer et al., 2006). This paper makes cover trees even faster. In particular, we provide

1. A simpler deﬁnition of the cover tree that reduces the number of nodes from O(n) to exactly n, 2. An additional invariant that makes queries faster in practice, 3. Algorithms for constructing and querying the tree in parallel on multiprocessor systems, and 4. A more cache efﬁcient memory layout.

On standard benchmark datasets, we reduce the number of distance computations by 10 50%. On a large-scale bioinformatics dataset, we reduce the number of distance computations by 71%. On a large-scale image dataset, our parallel algorithm with 16 cores reduces tree construction time from 3.5 hours to 12 minutes.

1. Introduction

Data structures for fast nearest neighbor queries are most often used to speed up the k-nearest neighbor classiﬁcation algorithm. But many other learning tasks also require neighbor searches, and cover trees can speed up these tasks as well: localized support vector machines (Segata & Blanzieri, 2010), dimensionality reduction (Lisitsyn et al., 2013), and reinforcement learning (Tziortziotis et al., 2014). Making cover trees faster the main contribution of this paper also makes these other tasks faster.

Given a space of points X , a dataset X X , a data point p X , and a distance function d : X X R, the near-

Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copyright 2015 by the author(s).

est neighbor of p in X is deﬁned as

pnn = argmin q X {p} d(p,q)

The naive method for computing pnn involves a linear scan of all the data points and takes time θ(n), but many data structures have been created to speed up this process. The kd-tree (Friedman et al., 1977) is probably the most famous. It is simple and effective in practice, but it can only be used on Euclidean spaces. We must turn to other data structures when given an arbitrary metric space. The simplest and oldest of these structures is the ball tree (Omohundro, 1989). Although attractive for its simplicity, it provides only the trivial runtime guarantee that queries will take time O(n). Subsequent research focused on providing stronger guarantees, producing more complicated data structures like the metric skip list (Karger & Ruhl, 2002) and the navigating net (Krauthgamer & Lee, 2004). Because these data structures were complex and had large constant factors, they were mostly of theoretical interest. The cover tree (Beygelzimer et al., 2006) simpliﬁed navigating nets while maintaining good run time guarantees. Further research has strengthened the theoretical runtime bounds provided by the cover tree (Ram et al., 2010). Our contributions make the cover tree faster in practice.

The cover tree as originally introduced is simpler than related data structures, but it is not simple. The original explanation required an implicit tree with an inﬁnite number of nodes; but a smaller, explicit tree with O(n) nodes actually gets implemented. We refer to this presentation as the original cover tree. In the remainder of this paper we introduce the simpliﬁed cover tree and the nearest ancestor cover tree; parallel construction and querying algorithms; and a cache-efﬁcient layout suitable for all three cover trees. We conclude with experiments showing: on standard benchmark datasets we outperform both the original implementation (Beygelzimer et al., 2006) and MLPack s implementation (Curtin et al., 2013a); on a large bioinformatics dataset, our nearest ancestor tree uses 71% fewer distance comparisons than the original cover tree; and on a large image data set, our parallelization algorithm reduces tree construction time from 3.5 hours to 12 minutes.

Faster Cover Trees

2. Simpliﬁed cover trees

Deﬁnition 1. Our simpliﬁed cover tree is any tree where: (a) each node p in the tree contains a single data point (also denoted by p); and (b) the following three invariants are maintained.

1. The leveling invariant. Every node p has an associated integer level(p). For each child q of p

level(q) = level(p) 1 .

2. The covering invariant. For every node p, deﬁne the function covdist(p) = 2level(p). For each child q of p d(p,q) covdist(p) .

3. The separating invariant. For every node p, deﬁne the function sepdist(p) = 2level(p) 1. For all distinct children q1 and q2 of p

d(q1,q2) > sepdist(p)

Throughout this paper, we will use the functions children(p) and descendants(p) to refer to the set of nodes that are children or descendants of p respectively. We also deﬁne the function maxdist as

maxdist(p) = argmax q descendants(p) d(p,q)

In words, this is the greatest distance from p to any of its descendants. This value is upper bounded by 2level(p)+1, and its exact value can be cached within the data structure.1

In order to query a cover tree (any variant), we use the generic space tree framework developed by Curtin et al. (2013). This framework provides fast algorithms for ﬁnding the k-nearest neighbors, points within a speciﬁed distance, kernel density estimates, and minimum spanning trees. Each of these algorithms has two variants: a single tree algorithm for querying one point at a time, and a dual tree algorithm for querying many points at once. Our faster cover tree algorithms speed up all of these queries; but for simplicity, in this paper we focus only on the single tree nearest neighbor query. The pseudocode is shown in Algorithm 1.

Analysis of the cover tree s runtime properties is done using the data-dependent doubling constant c: the minimum value c such that every ball in the dataset can be covered by c balls of half the radius. We state without proof two facts:

1As in the original cover tree, practical performance is improved on most datasets by redeﬁning covdist(p) = 1.3level(p) and sepdist(p) = 1.3level(p) 1. All of our experiments use this modiﬁed deﬁnition.

Algorithm 1 Find nearest neighbor function find Nearest Neighbor(cover tree p, query point x, nearest neighbor so far y)

1: if d(p,x) < d(y,x) then 2: y p

3: for each child q of p sorted by distance to x do 4: if d(y,x) > d(y,q) maxdist(q) then 5: y find Nearest Neighbor(q,x,y)

6: return y

Algorithm 2 Simpliﬁed cover tree insertion function insert(cover tree p, data point x)

1: if d(p,x) > covdist(p) then 2: while d(p,x) > 2covdist(p) do 3: Remove any leaf q from p 4: p tree with root q and p as only child 5: p p

6: return tree with x as root and p as only child

7: return insert (p,x) function insert (cover tree p, data point x) prerequisites: d(p,x) covdist(p)

1: for q children(p) do 2: if d(q,x) covdist(q) then 3: q insert (q,x) 4: p p with child q replaced with q

5: return p

6: return p with x added as a child

(a) any node in the cover tree can have at most O(c4) children; (b) the depth of any node in the cover tree is at most O(c2 logn). These were proven for the original cover tree (Beygelzimer et al., 2006) and the proofs for our simpliﬁed cover tree are essentially the same. We can use these two facts to show that the runtime of Algorithm 1 is O(c6 logn) for both the original and simpliﬁed cover tree.

Algorithm 2 shows how to insert into the simpliﬁed cover tree. It is divided into two cases. In the ﬁrst case, we cannot insert our data point x into the tree without violating the covering invariant. So we raise the level of the tree p by taking any leaf node and using that as the new root. Because maxdist(p) 2covdist(p), we are guaranteed that d(p,x) covdist(x), and so we do not violate the covering constraint. In the second case, the insert function recursively descends the tree. On each function call, we search through children(p) to ﬁnd a node we can insert into without violating the covering invariant. If we ﬁnd such a node, we recurse; otherwise, we know we can add x to children(p) without violating the separating invariant. In all cases, exactly one node is added per data point, so the resulting tree will have exactly n nodes. Since every

Faster Cover Trees

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

yearpredict

tiny Images

artiﬁcial40

fraction of nodes in the original cover tree

required for the simpliﬁed cover tree

Figure 1. Fraction of nodes required for the simpliﬁed cover tree. Fewer nodes means less overhead from traversing the tree and fewer distance comparisons. See Table 1 for information on the datasets.

node can have at most O(c4) children and the depth of the tree is bounded by O(c2 logn), the runtime is O(c6 logn).

The simpliﬁed cover tree has the same runtime bounds as the original cover tree, but it has an improved constant factor, as it needs only n nodes. Fewer nodes reduces both the overhead from traversing the data structure and the number of required distance comparisons. The original cover tree s nesting invariant dictated these extra nodes. In our deﬁnitions, the nesting invariant states that for every node p, if p has any children, then p also has itself as a child (this child need not satisfy the leveling invariant). The nesting invariant comes from the presentation of the original cover tree as an inﬁnite data structure, but it does not play a key role in the cover tree s runtime analysis. Therefore, we can discard it and maintain the runtime guarantees.

Figure 1 shows the reduction in nodes by using the simpliﬁed cover tree on benchmark datasets taken from the MLPack test suite (Curtin et al., 2013a). Section 6 contains more details on these datasets, and Figure 3 in the same section shows how this reduced node count translates into improved query performance.

3. Nearest ancestor cover trees

In this section, we exploit a similarity between simpliﬁed cover trees and binary search trees (BSTs). Insertion into both trees follows the same procedure: Perform a depth ﬁrst search to ﬁnd the right location to insert the point. In particular, there is no rebalancing after the insertion. Many alternatives to plain BSTs produce better query times by introducing new invariants. These invariants force the insertion algorithm to rebalance the tree during the insertion step. Our deﬁnition of the simpliﬁed cover tree makes adding similar invariants easy. We now introduce one possible invariant.

Figure 2. Using the metric d(a,b) = |a b|, both trees are valid simpliﬁed cover trees; but only the right tree is a valid nearest ancestor cover tree. Moving the 9 and 11 nodes reduces the value of maxdist for their ancestor nodes. This causes pruning to happen more often during nearest neighbor queries.

Deﬁnition 2. A nearest ancestor cover tree is a simpliﬁed cover tree where every point p has the nearest ancestor invariant: If q1 is an ancestor of p and q2 is a sibling of q1, then d(p,q1) d(p,q2)

In other words, the nearest ancestor cover tree ensures that for every data point, each of its ancestors is the best possible ancestor for it at that level. Figure 2 shows a motivating one dimensional example.

Algorithm 3 shows how to insert a point into a nearest ancestor cover tree. It uses the same insert function as 2, but the helper function insert is slightly modiﬁed in two ways (shown with an underline). First, we sort children(p) according to their distance from the data point x. This sorting ensures that our newly inserted point will satisfy the nearest ancestor invariant. But this new point x may cause other points to violate the nearest ancestor invariant. In particular, if x has a sibling q; q has a descendent r; and d(r,x) < d(r,q); then r now violates the nearest ancestor invariant. Our second step is to call rebalance, which ﬁnds all these violating data points and moves them underneath x.

Most of the work of the rebalance function happens in the helper rebalance . rebalance returns a valid nearest ancestor cover tree and a set of points that still need to be inserted; rebalance just inserts those extra points. rebalance takes two nearest ancestor cover trees p and q (where p is an ancestor of q) and a point x. Its goal is to extract all points from q that would violate the nearest ancestor invariant if x became a sibling of p. It returns three values: a modiﬁed version of q, a set of points that cannot remain in any point along the path from p to q called the moveset, and a set of points that need to be reinserted somewhere along the path from p to q called the stayset. There are two cases. In the ﬁrst case, the data point at node q must move. We then ﬁlter the descendants of q into the moveset or stayset as appropriate and return null for our

Faster Cover Trees

modiﬁed q. In the second case, the data point at node q must stay. We recursively apply rebalance to each of q s children; we use the results to update the corresponding child, the moveset and the stayset variables. Finally, we try to reinsert any nodes in stayset. If rebalance was called with q a child of p, then the return value of stayset will be empty; any children that could not be directly reinserted must be in the moveset.

The rebalance function loops over all O(c4) children of a node, and the maximum depth of the recursion is O(c2 logn). Therefore the overall runtime is O(c6 logn). It may be called up to O(c4) times within rebalance, so the body of the for loop on line 12 executes O(c10 logn) times. Unfortunately, we have no bound on the size of moveset, except to note that it is usually small in practice. On the datasets in our experiments (see Table 1), the value is usually zero or at worst in the single digits. Figure 3(a) shows that nearest ancestor cover tree construction is not that much slower in practice, and Figure 3(b) shows that this slowdown is overshadowed by the resulting speedup in nearest neighbor queries.

4. Parallel cover tree

In this section we discuss parallelism on shared-memory, multiprocessor machines. Querying in parallel is easy. Since the results of neighbor queries for a data point do not depend on other data points, we can: divide the points among the processors; then each processor traverses the tree independently. More difﬁcult is constructing the tree in parallel. Our strategy is to split the data, create one cover tree on each processor, then merge these trees together. Previous work on parallelizing cover trees applied only to the GPU (Kumar & Ramareddy, 2010). Our approach is suitable for any shared-memory multiprocessor machine. We give a detailed description for merging simpliﬁed cover trees and discuss at a high level how to extend this procedure to nearest ancestor cover trees.

Algorithm 4 shows the merging procedure. The merge function s main purpose is to satisfy the prerequisites for merge Helper, which has two phases. First, we ﬁnd all the subtrees of q that can be inserted directly into p without violating any invariants, and we insert them. Second, we insert the remaining nodes from q into p directly via the insert function.

The merge Helper function returns a partially merged tree and a set of nodes called the leftovers that still need to be inserted into the tree. The ﬁrst phase uses the for loop starting on line 3 to categorize the children of q into three disjoint sets. The uncovered set contains all of q s children that would violate the covering invariant if inserted into p. The sepcov set contains all of q s children that would not

Algorithm 3 Nearest ancestor cover tree insertion function insert (cover tree p, data point x)

1: for q children(p) sorted by distance to x do 2: if d(q,x) covdist(q) then 3: q insert (q,x) 4: p p with child q replaced with q

5: return p

6: return rebalance(p, x)

function rebalance(cover trees p, data point x) prerequisites: x can be added as a child of p without violating the covering or separating invariants

1: create tree x with root node x at level level(p) 1 x

contains no other points 2: p p 3: for q children(p) do 4: (q ,moveset,stayset) rebalance (p,q,x) 5: p p with child q replaced with q

6: for r moveset do 7: x insert(x ,r)

8: return p with x added as a child function rebalance (cover trees p and q, point x) prerequisites: p is an ancestor of q

1: if d(p,q) > d(q,x) then 2: moveset,stayset /0 3: for r descendants(q) do 4: if d(r, p) > d(r,x) then 5: moveset moveset {r} 6: else 7: stayset stayset {r}

8: return (null,moveset,stayset) 9: else 10: moveset ,stayset /0 11: q q 12: for r children(q) do 13: (r ,moveset,stayset) rebalance (p,r,x) 14: moveset moveset moveset

15: stayset stayset stayset

16: if r = null then 17: q q with the subtree r removed 18: else 19: q q with the subtree r replaced by r

20: for r stayset do 21: if d(r,q) covdist(q) then 22: q insert(q ,r) 23: stayset stayset {r}

24: return (q ,moveset ,stayset )

violate the separating or covering invariants when inserted into p. Both of these sets are unused in the second phase of merge Helper. Every child of q that is not inserted into

Faster Cover Trees

the uncovered or sepcov sets gets merged with a suitable node in children(p). This is done by recursively calling the merge Helper function. Any points that could not be inserted into the results of merge Helper get added to the leftovers set.

In the second phase of merge Helper, we insert as many nodes as possible into our merged tree p . First update the children with the subtrees in sepcov. Then insert the root of q. We know that d(p,q) covdist(p), so this insertion is guaranteed not to change the level of p . Finally, we loop over the elements in leftovers and insert them into p

only if it would not change the level of p . Any elements of leftovers that cannot be inserted into p get inserted into leftovers and returned. It is important to do this insertion of leftovers at the lowest level possible (rather than wait until the recursion ends and have the insertion performed in merge) to avoid unnecessary distance computations.

The merge function does not maintain the nearest ancestor invariant. A modiﬁed version of merge that calls the rebalance function appropriately could. But for space reasons, we do not provide this modiﬁed algorithm. In our experiments below, we use the provided merge function in Algorithm 4 to parallelize both simpliﬁed and nearest ancestor tree construction. In practice, this retains the beneﬁts of the nearest ancestor cover tree because the nearest ancestor invariant is violated in only a few places.

Providing explicit bounds on the runtime of merge is difﬁcult. But in practice it is fast. When parallelizing on two processors, approximately 1% of the distance calculations occur within merge. So this is not our bottleneck. Instead, the main bottleneck of parallelism is cache performance. On modern desktop computers, last level cache is shared between all cores on a CPU. Cover tree construction results in many cache misses, and this effect is exaggerated when the tree is constructed in parallel.

5. Cache efﬁciency

One of the biggest sources of overhead in the cover tree is cache misses. Our last improvement is to make cover trees more cache efﬁcient. A simple way to reduce cache misses for tree data structures is to use the van Emde Boas tree layout (Frigo et al., 1999). This layout arranges nodes in memory according to a depth ﬁrst traversal of the tree. This arrangement creates a cache oblivious data structure. That is, the programmer does not need any special knowledge about the cache sizes to obtain optimal speedup the van Embde Boas tree layout works efﬁciently on any cache architecture. This layout has been known for a long time in the data structures community, but it seems unused in machine learning libraries. Frigo (1999) provides a detailed tutorial.

Algorithm 4 Merging cover trees function merge(cover tree p, cover tree q)

1: if level(q) > level(p) then 2: swap p and q

3: while level(q) < level(p) do 4: move a node from the leaf of q to the root; 5: this raises the level of q by 1

6: (p,leftovers) merge Helper(p,q) 7: for r leftovers do 8: p insert(p,r)

9: return p function merge Helper(cover tree p, cover tree q) prereqs: level(p) = level(q), d(p,q) covdist(p)

1: children children(p) Phase 1 2: uncovered,sepcov,leftovers /0 3: for r children(q) do 4: if d(p,r) < covdist(p) then 5: foundmatch false 6: for s children do 7: if d(s,r) sepdist(p) then 8: (s ,leftoverss) merge Helper(s,r) 9: children children {s } {s} 10: leftovers leftovers leftoverss 11: foundmatch true 12: break from inner loop

13: if not foundmatch then 14: sepcov sepcov {r}

15: else 16: uncovered uncovered {r}

17: children children sepcov Phase 2 18: p tree rooted at p with children(p )=children

19: p insert(p ,q) 20: leftovers /0 21: for r leftovers do 22: if d(r, p) covdist(p) then 23: p insert(p ,r) 24: else 25: leftovers leftovers {r}

26: return (p ,leftovers uncovered)

Our implementation of the cache oblivious cover tree is static. That is, we ﬁrst construct the cover tree, then we call a function pack that rearranges the tree in memory. This means we do not get the reduced cache misses while constructing the tree, but only while querying the tree. The pack function is essentially free to run because it requires only a single traversal through the dataset. Figure 4 shows that the van Emde Boas tree layout reduces cache misses by 5 to 20 percent. This results in a reduction of stalled CPU cycles by 2 to 15 percent.

Faster Cover Trees

yearpredict

tiny Images

artiﬁcial40

number of distance comparisons

in tree construction only

(normalized by the original cover tree)

yearpredict

tiny Images

artiﬁcial40

number of distance comparisons

in tree construction and query

(normalized by n2)

Original c.t.

Simpliﬁed c.t.

Nearest ancestor c.t.

Figure 3. (a) Constructing a nearest ancestor query tree usually takes longer than the original cover tree and the simpliﬁed cover tree. (b) Construction plus querying is faster in the nearest ancestor cover tree. On most datasets, this faster query time more than offsets the increased construction cost, giving an overall speedup.

yearpredict

tiny Images

artiﬁcial40

cache miss rate

(cache misses / cache accesses)

yearpredict

tiny Images

artiﬁcial40

stalled CPU cycle rate

(stalled cycles / total cycles)

Ours (unpacked)

Ours (packed)

Figure 4. (a) Comparison of our packed nearest ancestor cover tree to our unpacked tree and other implementations, demonstrating better cache performance. (b) A stalled CPU cycle is when the CPU does no work because it must wait for a memory access. Reducing the number of cache misses results in fewer stalled cycles, and so faster run times. We used the Linux perf stat utility to measure the cache-references, cache-misses, cycles, and stalled-cycles-frontend hardware counters. perf stat uses a sampling strategy with negligible affect on program performance.

yearpredict (77sec)

twitter (107sec)

tiny Images (65sec)

mnist (12sec)

normalized tree construction time

number of processors

yearpredict (277min)

twitter (51min)

tiny Images (34min)

mnist (30min)

normalized total runtime

(both construction and query)

Figure 5. Run times on the all nearest neighbor procedure for only those datasets that take more than 5 minutes. (a) Tree construction. A single cover tree merge takes about 1% of the computation time; the main reason for the lack of perfect parallel speedup is the increased number of cache misses caused by inserting into multiple trees simultaneously. (b) Comparison on total performance to reference and MLPack implementations. Runtimes in both ﬁgures are divided by that of our single processor implementation (shown in parenthesis).

Faster Cover Trees

dataset num data points num dimensions yearpredict 515345 90 twitter 583250 78 tiny Images 100000 384 mnist 70000 784 corel 68040 32 covtype 581012 55 artiﬁcial40 10000 40 faces 10304 20

Table 1. All MLPack benchmark datasets with at least 20 dimensions and 10000 points, arranged in descending order by runtime of all nearest neighbor search.

6. Experiments

We now validate our improvements empirically. Our ﬁrst experiments use the Euclidean distance on a standard benchmark suite (described in Table 1). Our last experiments use non-Euclidean metrics on data from bioinformatics and computer vision. In each experiment, we use the all nearest neighbors experimental setup. That is, we ﬁrst construct the cover trees on the dataset. Then, for each point in the dataset, we ﬁnd its nearest neighbor. This is a standard technique for measuring the efﬁciency of nearest neighbor algorithms.

6.1. Tree-type comparison

Our ﬁrst experiment compares the performance of the three types of cover trees: original, simpliﬁed, and nearest ancestor. We measure the number of distance comparisons required to build the tree on a dataset in Figure 3(a) and the number of distance comparisons required to ﬁnd each data point s nearest neighbor in Figure 3(b) using Algorithm 1. Distance comparisons are a good proxy measure of runtime performance because the majority of the algorithm s runtime is spent computing distances, and it ignores the possible unwanted confounding variable of varying optimization efforts. As expected, the simpliﬁed tree typically outperforms the original tree, and the nearest ancestor tree typically outperforms the simpliﬁed tree. We reiterate that this reduced need for distance comparisons translates over to all other queries provided by the space tree framework (Curtin et al., 2013b).

6.2. Implementation comparison

We next compare our implementation against two good cover tree implementations currently in widespread use: the reference implementation used in the original paper (Beygelzimer et al., 2006) and MLPack s implementation (Curtin et al., 2013a). Both of these programs were written in C++ and compiled using g++ 4.4.7 with full optimizations. Our implementation was written in Haskell

and compiled with ghc 7.8.4 also with full optimizations.2 All tests were run on an Amazon Web Services c3.8x-large instance with 60 GB of RAM and 32 Intel Xeon E5-2680 CPU cores clocked at 2.80GHz. Half of those cores are hyperthreads, so for simplicity we only parallelize out to 16 cores.

Since the reference implementation and MLPack only come with the Euclidean distance built-in, we only use that metric when comparing the three implementations. Figure 4 shows the cache performance of all three libraries. Figure 5 shows the runtime of all three libraries. Our implementation s cache performance and parallelization speedup is shown on the nearest ancestor cover tree. Neither the original implementation nor MLPack support parallelization.

6.3. Graph kernels and protein function

An important problem in bioinformatics is to predict the function of a protein based on its 3d structure. State of the art solutions model the protein s 3d structure as a graph and use support vector machines (with a graph kernel) for prediction. Computing graph kernels is relatively expensive, however, so research has focused on making the graph kernel computation faster (Vishwanathan et al., 2010; Shervashidze et al., 2011). Such research makes graph kernels scale to larger graphs, but does not help in the case where there are more graphs. Our contribution is to use cover trees to reduce the number of required kernel computations, letting us scale to more graphs. The largest dataset in previous research contained about 1200 proteins. With our cover tree, we perform nearest neighbor queries on all one hundred thousand proteins currently registered in the Protein Data Bank (Berman et al., 2000).

We use the random walk graph kernel in our experiment. It performs well on protein classiﬁcation and is conceptually simple. See Vishwanathan et al. (2010) for more details. A naive computation of this kernel takes time O(v6), where v is the number of vertices in the graph. Vishwanathan et al. present faster methods that take time only O(v3). While considerably faster, it is still a relatively expensive distance computation.

The Protein Data Bank (Berman et al., 2000) contains information on the 3d primary structure of approximately one hundred thousand proteins. To perform our experiment, we follow a procedure similar to that used by the PROTEIN dataset used in the experiments in Viswanathan et al.. This procedure constructs secondary structure graphs from the primary structures in the Protein Data Bank using the tool VLPG (Sch afer et al., 2012). The Protein Data Bank stores the 3d structure of the atoms in the protein in a PDB ﬁle.

2Our code can be downloaded at http://github.com/ mikeizbicki/hlearn#covertree.

Faster Cover Trees

0 50 100 150 200 250

total distance comparisons (millions)

number of data points (thousands)

original c.t. simpliﬁed c.t. nearest ancestor c.t.

construction only

construction and query

Figure 6. The effect on runtime as we increase the number of data points on the bionformatics data. The relationship is roughly linear, indicating protein graphs have a relatively low intrinsic dimensionality. As expected, the nearest ancestor cover tree performs the best.

From this PDB ﬁle, we calculate the protein s secondary structure using the DSSP tool (Joosten et al., 2011). Then, the tool VLPG (Sch afer et al., 2012) generates graphs from the resulting secondary structure. Some PDB ﬁles contain information for multiple graphs, and some do not contain enough information to construct a graph. In total, our dataset consists of 250,000 graphs, and a typical graph has between 5-120 nodes and 0.1-3 edges per node. Figure 6 shows the scaling behavior of all three cover trees on this dataset. On all of the data, the total construction and query cost are 29% that of the original cover tree.

6.4. Earth mover s distance

The Earth Mover s Distance (EMD) is a distance metric between histograms designed for image classiﬁcation (Rubner et al., 1998). In our tests, we convert images into three dimensional histograms of the pixel values in Lab CIE color space. Lab CIE is a color space represents colors in three dimensions. It is similar to the more familiar RGB and CMYK color spaces, but the distances between colors more accurately match what humans perceive color distances to be. We construct the histogram such that each dimension has 8 equally spaced intervals, for a total of 512 bins. We then create a signature of the histogram by recording only the 20 largest of the 512 bins.

Previous research on speeding up EMD focused on computing EMD distances faster. The EMD takes a base distance as a parameter. For an arbitrary base distance, EMD requires O(b3 logb) time where b is the size of the his-

number of cores

simpliﬁed tree nearest ancestor tree construction construction time speedup time speedup 1 70.7 min 1.0 210.9 min 1.0 2 36.6 min 1.9 94.2 min 2.2 4 18.5 min 3.8 48.5 min 4.3 8 10.2 min 6.9 25.3 min 8.3 16 6.7 min 10.5 12.0 min 17.6

Table 2. Parallel cover tree construction using the earth movers distance. On this large dataset with an expensive metric, we see better parallel speedup than on the datasets with the cheaper L2 metric. The nearest ancestor cover tree gets super-linear parallel speedup because we are merging with Algorithm 4, which does not attempt to rebalance.

togram signature. Faster algorithms exist for speciﬁc base metrics. For example, with an L1 base metric the EMD can be computed in time O(b2) (Ling & Okada, 2007); and if the base metric is a so-called thresholded metric, we can get an order of magnitude constant factor speed up (Pele & Werman, 2009). We speciﬁcally chose the Lab CIE color space because there is no known faster EMD algorithm. It will stress-test our cover tree implementation.

In this experiment, we use the Yahoo! Flickr Creative Commons dataset. The dataset contains 1.5 million images in its training set, and we construct simpliﬁed and nearest ancestor cover trees in parallel on this data. Construction times are shown in Table 2. Using the cheap L2 distance with smaller datasets, tree construction happens quickly and so parallelization is less important. But with an expensive distance on this larger dataset, parallel construction makes a big difference.

7. Conclusion

We ve simpliﬁed the deﬁnition of the cover tree, and introduced the new nearest ancestor invariant that speeds up the cover tree in practice. It is possible that other invariants exist that will balance the tree better, providing even more speed improvements.

Acknowledgments

Shelton was supported by DARPA award FA8750-12-20010.

Berman, Helen M., Westbrook, John, Feng, Zukang, Gilliland, Gary, Bhat, T. N., Weissig, Helge, Shindyalov, Ilya N., and Bourne, Philip E. The Protein Data Bank. Nucleic Acids Research, 28(1):235 242, 2000.

Faster Cover Trees

Beygelzimer, Alina, Kakade, Sham, and Langford, John. Cover trees for nearest neighbor. In Proceedings of the 23rd International Conference on Machine Learning, pp. 97 104, New York, NY, USA, 2006.

Curtin, Ryan R., Cline, James R., Slagle, Neil P., March, William B., Ram, P., Mehta, Nishant A., and Gray, Alexander G. MLPACK: A scalable C++ machine learning library. Journal of Machine Learning Research, 14: 801 805, 2013a.

Curtin, Ryan R, March, William B, Ram, Parikshit, Anderson, David V, Gray, Alexander G, and Isbell Jr, Charles L. Tree-independent dual-tree algorithms. In Proceedings of The 30th International Conference on Machine Learning, pp. 1435 1443, 2013b.

Friedman, Jerome H, Bentley, Jon Louis, and Finkel, Raphael Ari. An algorithm for ﬁnding best matches in logarithmic expected time. Transactions on Mathematical Software, 3(3):209 226, 1977.

Frigo, Matteo, Leiserson, Charles E, Prokop, Harald, and Ramachandran, Sridhar. Cache-oblivious algorithms. In Foundations of Computer Science, 1999. 40th Annual Symposium on, pp. 285 297, 1999.

Joosten, Robbie P., te Beek, Tim A. H., Krieger, Elmar, Hekkelman, Maarten L., Hooft, Rob W. W., Schneider, Reinhard, Sander, Chris, and Vriend, Gert. A series of PDB related databases for everyday needs., January 2011.

Karger, David R. and Ruhl, Matthias. Finding nearest neighbors in growth-restricted metrics. In In 34th Annual ACM Symposium on the Theory of Computing, pp. 741 750, 2002.

Krauthgamer, Robert and Lee, James R. Navigating nets: Simple algorithms for proximity search. In Proceedings of the Fifteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 798 807, Philadelphia, PA, USA, 2004.

Kumar, R. Deepak and Ramareddy, K. Design and implementation of cover tree algorithm on cuda-compatible gpu. International Journal of Computer Applications, 3 (7):24 27, June 2010. Published By Foundation of Computer Science.

Ling, Haibin and Okada, Kazunori. An efﬁcient earth mover s distance algorithm for robust histogram comparison. PAMI, 29:853, 2007.

Lisitsyn, Sergey, Widmer, Christian, and Garcia, Fernando J. Iglesias. Tapkee: An efﬁcient dimension reduction library. Journal of Machine Learning Research, 14:2355 2359, 2013.

Omohundro, Stephen Malvern. Five balltree construction algorithms. Technical Report 89-063, International Computer Science Institute, December 1989.

Pele, Oﬁr and Werman, Michael. Fast and robust earth mover s distances. In ICCV, 2009.

Ram, Parikshit, Lee, Dongryeol, March, William, and Gray, Alexander G. Linear-time Algorithms for Pairwise Statistical Problems. In Advances in Neural Information Processing Systems, 2010.

Rubner, Yossi, Tomasi, Carlo, and Guibas, Leonidas J. A metric for distributions with applications to image databases. In Sixth International Conference on Computer Vision, pp. 59 66, 1998.

Sch afer, Tim, May, Patrick, and Koch, Ina. Computation and Visualization of Protein Topology Graphs Including Ligand Information. In German Conference on Bioinformatics 2012, volume 26, pp. 108 118, Dagstuhl, Germany, 2012.

Segata, Nicola and Blanzieri, Enrico. Fast and scalable local kernel machines. Journal of Machine Learning Research, 11:1883 1926, August 2010.

Shervashidze, Nino, Schweitzer, Pascal, van Leeuwen, Erik Jan, Mehlhorn, Kurt, and Borgwardt, Karsten M. Weisfeiler-lehman graph kernels. Journal of Machine Learning Research, 12:2539 2561, November 2011.

Tziortziotis, Nikolaos, Dimitrakakis, Christos, and Blekas, Konstantinos. Cover Tree Bayesian Reinforcement Learning. Journal of Machine Learning Research, 15, 2014.

Vishwanathan, S. V. N., Schraudolph, Nicol N., Kondor, Risi, and Borgwardt, Karsten M. Graph kernels. Journal of Machine Learning Research, 11:1201 1242, August 2010.