# efficient_projection_onto_the_perfect_phylogeny_model__87031444.pdf

Efﬁcient Projection onto the Perfect Phylogeny Model

Bei Jia jiabe@bc.edu Surjyendu Ray raysc@bc.edu Boston College

Sam Safavi safavisa@bc.edu José Bento jose.bento@bc.edu

Several algorithms build on the perfect phylogeny model to infer evolutionary trees. This problem is particularly hard when evolutionary trees are inferred from the fraction of genomes that have mutations in different positions, across different samples. Existing algorithms might do extensive searches over the space of possible trees. At the center of these algorithms is a projection problem that assigns a ﬁtness cost to phylogenetic trees. In order to perform a wide search over the space of the trees, it is critical to solve this projection problem fast. In this paper, we use Moreau s decomposition for proximal operators, and a tree reduction scheme, to develop a new algorithm to compute this projection. Our algorithm terminates with an exact solution in a ﬁnite number of steps, and is extremely fast. In particular, it can search over all evolutionary trees with fewer than 11 nodes, a size relevant for several biological problems (more than 2 billion trees) in about 2 hours.

1 Introduction

The perfect phylogeny model (PPM) [1, 2] is used in biology to study evolving populations. It assumes that the same position in the genome never mutates twice, hence mutations only accumulate.

Consider a population of organisms evolving under the PPM. The evolution process can be described by a labeled rooted tree, T = (r, V, E), where r is the root, i.e., the common oldest ancestor, the nodes V are the mutants, and the edges E are mutations acquired between older and younger mutants. Since each position in the genome only mutates once, we can associate with each node v = r, a unique mutated position, the mutation associated to the ancestral edge of v. By convention, let us associate with the root r, a null mutation that is shared by all mutants in T. This allows us to refer to each node v V as both a mutation in a position in the genome (the mutation associated to the ancestral edge of v), and a mutant (the mutant with the fewest mutations that has a mutation v). Hence, without loss of generality, V = {1, . . . , q}, E = {2, . . . , q}, where q is the length of the genome, and r = 1 refers to both the oldest common ancestor and the null mutation shared by all.

One very important use of the PPM is to infer how mutants of a common ancestor evolve [3 8]. A common type of data used for this purpose is the frequency, with which different positions in the genome mutate across multiple samples, obtained, e.g., from whole-genome or targeted deep sequencing [9]. Consider a sample s, one of p samples, obtained at a given stage of the evolution process. This sample has many mutants, some with the same genome, some with different genomes. Let F Rq p be such that Fv,s is the fraction of genomes in s with a mutation in position v in the genome. Let M Rq p be such that Mv,s is the fraction of mutant v in s. By deﬁnition, the columns of M must sum to 1. Let U {0, 1}q q be such that Uv,v = 1, if and only if mutant v is an ancestor of mutant v , or if v = v . We denote the set of all possible U matrices, M matrices and labeled rooted trees T, by U, M and T , respectively. See Figure 1 for an illustration. The PPM implies F = UM. (1)

Our work contributes to the problem of inferring clonal evolution from mutation-frequencies: How do we infer M and U from F? Note that ﬁnding U is the same as ﬁnding T (see Lemma B.2).

Bei Jia is currently with Element AI.

32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada.

s = 1 s = 2

g1 g3 g1 g7 g6 g3 g1 g7 g6 g3 g5 g2 g4 Evolutionary

tree, T g2 g1 g1 2

F1,3 F2,3 F3,3 F4,3 F5,3 F6,3 F7,3

2/10 5/10 1/10 1/10 2/10 3/10

s = 3 (M1,3, , M7,3) = (2, 2, 2, 1, 1, 1, 1) g5

1 1 1 1 1 1 1 0 1 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 1

Figure 1: Black lines are genomes. Red circles indicate mutations. gi is the mutant with fewest mutations with position i mutated. Mutation 1, the mutation in the null position, i = 1, is shared by all mutants. g1 is the organism before mutant evolution starts. In sample s = 3, 2/10 of the mutants are type g2, hence M2,3 = 2/10, and 3/10 of the mutations occur in position 7, hence F7,3 = 3/10. The tree shows the mutants evolution. Although model (1) is simple, simultaneously inferring M and U from F can be hard [3]. One popular inference approach is the following optimization problem over U, M and F, min U U C(U), (2)

C(U) = min M,F Rq p ˆF F subject to F = UM, M 0, M 1 = 1, (3)

where is the Frobenius norm, and ˆF Rq p contains the measured fractions of mutations per position in each sample, which are known and ﬁxed. In a nutshell, we want to project our measurement ˆF onto the space of valid PPM models.

Problem (2) is a hard mixed integer-continuous optimization problem. To approximately solve it, we might ﬁnd a ﬁnite subset {Ui} U, that corresponds to a heuristically good subset of trees, {Ti} T , and, for each ﬁxed matrix Ui, solve (3), which is a convex optimization problem. We can then return Tx, where x arg mini C(Ui). Fortunately, in many biological applications, e.g., [3 8], the reconstructed evolutionary tree involves a very small number of mutated positions, e.g., q 11. In practice, a position v might be an effective position that is a cluster of multiple real positions in the genome. For a small q, we can compute C(U) for many trees, and hence approximate M, U, and get uncertainty measures for these estimates. This is important, since data is generally scarce and noisy.

Contributions: (i) we propose a new algorithm to compute C(U) exactly in O(q2p) steps, the ﬁrst non-iterative algorithm to compute C(U); (ii) we compare its performance against state-of-the-art iterative algorithms, and observe a much faster convergence. In particular, our algorithm scales much faster than O(q2p) in practice; (iii) we implement our algorithm on a GPU, and show that it computes the cost of all (more than 2 billion) trees with 11 nodes, in 2.5 hours.

2 Related work A problem related to ours, but somewhat different, is that of inferring a phylogenetic tree from single-cell whole-genome sequencing data. Given all the mutations in a set of mutants, the problem is to arrange the mutants in a phylogenetic tree, [10, 11]. Mathematically, this corresponds to inferring T from partial or corrupted observation of U. If the PPM is assumed, and all the mutations of all the mutants are correctly observed, this problem can be solved in linear time, e.g., [12]. In general, this problem is equivalent to ﬁnding a minimum cost Steiner tree on a hypercube, whose nodes and edges represent mutants and mutations respectively, a problem known to be hard [13].

We mention a few works on clonality inference, based on the PPM, that try to infer both U and M from ˆF. No previous work solves problem (2) exactly in general, even for trees of size q 11. Using our fast projection algorithm, we can solve (2) exactly by searching over all trees, if q 11. Ref. [3] (Ances Tree) reduces the space of possible trees T to subtrees of a heuristically constructed DAG. The authors use the element-wise 1-norm in (3) and, after introducing more variables to linearize the product UM, reduce this search to solving a MILP, which they try to solve via branch and bound. Ref. [6] (CITUP) searches the space of all unlabeled trees, and, for each unlabeled tree, tries to solve an MIQP, again using branch and bound techniques, which ﬁnds a labeling for the unlabeled tree, and simultaneously minimizes the distance ˆF F . Refs. [5] and [14] (Phylo Sub/Phylo WGS), use a stochastic model to sample trees that are likely to explain the data. Their model is based on [15], which generates hierarchical clusterings of objects, and from which lineage trees can be formed. A score is then computed for these trees, and the highest scoring trees are returned.

Procedure (2) can be justiﬁed as MLE if we assume the stochastic model ˆF = F + N(0, Iσ2), where F, U and M satisfy the PPM model, and N(0, Iσ2) represents additive, component-wise,

Gaussian measurement noise, with zero mean and covariance Iσ2. Alternative stochastic models can be assumed, e.g., as M U 1 ˆF = N(0, Iσ2), where M is non-negative and its columns must sum to one, and N(0, Iσ2) is as described before. For this model, and for each matrix U, the cost C(U) is a projection of U 1 ˆF onto the probability simplex M 0, M 1 = 1. Several fast algorithms are known for this problem, e.g., [16 20] and references therein. In a pq-dimensional space, the exact projection onto the simplex can be done in O(qp) steps.

Our algorithm is the ﬁrst to solve (3) exactly in a ﬁnite number of steps. We can also use iterative methods to solve (3). One advantage of our algorithm is that it has no tuning parameters, and requires no effort to check for convergence for a given accuracy. Since iterative algorithms can converge very fast, we numerically compare the speed of our algorithm with different implementations of the Alternating Direction Method of Multipliers (ADMM) [21], which, if properly tuned, has a convergence rate that equals the fastest convergence rate among all ﬁrst order methods [22] under some convexity assumptions, and is known to produce good solutions for several other kinds of problems, even for non-convex ones [23 29].

3 Main results

We now state our main results, and explain the ideas behind their proofs. Detailed proofs can be found in the Appendix.

Our algorithm computes C(U) and minimizers of (3), resp. M and F , by solving an equivalent problem. Without loss of generality, we assume that p = 1, since, by squaring the objective in (3), it decomposes into p independent problems. Sometimes we denote C(U) by C(T), since given U, we can specify T, and vice-versa. Let i be the closest ancestor of i in T = (r, V, E). Let i be the set of all the ancestors of i in T, plus i. Let i be the set of children of i in T. Theorem 3.1 (Equivalent formulation). Problem (3) can be solved by solving min t R t + L(t), (4) L(t) = min Z Rq 1 2

i V (Zi Z i)2 subject to Zi t Ni , i V, (5)

where Ni = P

j i ˆFj, and, by convention, Z i = 0 for i = r. In particular, if t minimizes (4), Z

minimizes (5) for t = t , and M , F minimize (3), then

M i = Z i + Z i + X

r i (Z r Z r ) and F i = Z i + Z i , i V. (6)

Furthermore, t , M , F and Z are unique. Theorem 3.1 comes from a dual form of (3), which we build using Moreau s decomposition [30].

3.1 Useful observations

Let Z (t) be the unique minimizer of (5) for some t. The main ideas behind our algorithm depend on a few simple properties of the paths {Z (t)} and {L (t)}, the derivative of L(t) with respect to t. Note that L is also a function of N, as deﬁned in Theorem 3.1, which depends on the input data ˆF. Lemma 3.2. L(t) is a convex function of t and N. Furthermore, L(t) is continuous in t and N, and L (t) is non-decreasing with t. Lemma 3.3. Z (t) is continuous as a function of t and N. Z (t ) is continuous as a function of N.

Let B(t) = {i : Z (t)i = t Ni}, i.e., the set of components of the solution at the boundary of (5). Variables in B are called ﬁxed, and we call other variables free. Free (resp. ﬁxed) nodes are nodes corresponding to free (resp. ﬁxed) variables. Lemma 3.4. B(t) is piecewise constant in t.

Consider dividing the tree T = (r, V, E) into subtrees, each with at least one free node, using B(t) as separation points. See Figure 4 in Appendix A for an illustration. Each i B(t) belongs to at most degree(i) different subtrees, where degree(i) is the degree of node i, and each i V\B(t) belongs exactly to one subtree. Let T1, . . . , Tk be the set of resulting (rooted, labeled) trees. Let Tw = (rw, Vw, Ew), where the root rw is the closest node in Tw to r. We call {Tw} the subtrees induced by B(t). We deﬁne Bw(t) = B(t) Vw, and, when it does not create ambiguity, we drop the index t in Bw(t). Note that different Bw(t) s might have elements in common. Also note that, by construction, if i Bw, then i must be a leaf of Tw, or the root of Tw.

Deﬁnition 3.5. The (Tw, Bw)-problem is the optimization problem over |Vw\B(t)| variables

min {Zj:j Vw\B(t)} (1/2) X

j Vw (Zj Z j)2, (7)

where j is the parent of j in Tw, Z j = 0 if j = rw, and Zj = Z (t)j = t Nj if j Bw(t). Lemma 3.6. Problem (5) decomposes into k independent problems. In particular, the minimizers {Z (t)j : j Vw\B(t)} are determined as the solution of the (Tw, Bw)-problem. If j Vw, then Z (t)j = c1t + c2 , where c1 and c2 depend on j but not on t, and 0 c1 1. Lemma 3.7. Z (t) and L (t) are piecewise linear and continuous in t. Furthermore, Z (t) and L (t) change linear segments if and only if B(t) changes. Lemma 3.8. If t t , then B(t ) B(t). In particular, B(t) changes at most q times with t. Lemma 3.9. Z (t) and L (t) have less than q + 1 different linear segments.

3.2 The Algorithm

In a nutshell, our algorithm computes the solution path {Z (t)}t R and the derivative {L (t)}t R. From these paths, it ﬁnds the unique t , at which d(t + L(t))/dt = 0|t=t L (t ) = 1. (8) It then evaluates the path Z (t) at t = t , and uses this value, along with (6), to ﬁnd M and F , the unique minimizers of (3). Finally, we compute C(T) = ˆF F .

We know that {Z (t)} and {L (t)} are continuous piecewise linear, with a ﬁnite number of different linear segments (Lemmas 3.7, 3.8 and 3.9). Hence, to describe {Z (t)} and {L (t)}, we only need to evaluate them at the critical values, t1 > t2 > > tk, at which Z (t) and L (t) change linear segments. We will later use Lemma 3.7 as a criteria to ﬁnd the critical values. Namely, {ti} are the values of t at which, as t decreases, new variables become ﬁxed, and B(t) changes. Note that variables never become free once ﬁxed, by Lemma 3.8, which also implies that k q.

The values {Z (ti)} and {L (ti)} are computed sequentially as follows. If t is very large, the constraint in (5) is not active, and Z (t) = L(t) = L (t) = 0. Lemma 3.7 tells us that, as we decrease t, the ﬁrst critical value is the largest t for which this constraint becomes active, and at which B(t) changes for the ﬁrst time. Hence, if i = 1, we have ti = maxs{Ns}, Z (ti) = L (ti) = 0, and B(ti) = arg maxs{Ns}. Once we have ti, we compute the rates Z (ti) and L (ti) from B(ti) and T, as explained in Section 3.3. Since the paths are piecewise linear, derivatives are not deﬁned at critical points. Hence, here, and throughout this section, these derivatives are taken from the left, i.e., Z (ti) = limt ti(Z (ti) Z (t))/(ti t) and L (ti) = limt ti(L (ti) L (t))/(ti t).

Since Z (t) and L (t) are constant for t (ti+1, ti], for t (ti+1, ti] we have Z (t) = Z (ti) + (t ti)Z (ti), L (t) = L (ti) + (t ti)L (ti), (9) and the next critical value, ti+1, is the largest t < ti, for which new variables become ﬁxed, and B(t) changes. The value ti+1 is found by solving for t < ti in Z (t)r = Z (ti)r + (t ti)Z (ti)r = t Nr, (10) and keeping the largest solution among all r / B. Once ti+1 is computed, we update B with the new variables that became ﬁxed, and we obtain Z (ti+1) and L (ti+1) from (9). The process then repeats.

By Lemma 3.2, L never increases. Hence, we stop this process (a) as soon as L (ti) < 1, or (b) when all the variables are in B, and thus there are no more critical values to compute. If (a), let tk be the last critical value with L (tk) > 1, and if (b), let tk be the last computed critical value. We use tk and (9) to compute t , at which L (t ) = 1 and also Z (t ). From Z (t ) we then compute M and F and C(U) = ˆF F .

The algorithm is shown compactly in Alg. 1. Its inputs are ˆF and T, represented, e.g., using a linkednodes data structure. Its outputs are minimizers to (3). It makes use of a procedure Compute Rates, which we will explain later. This procedure terminates in O(q) steps and uses O(q) memory. Line 5 comes from solving (10) for t. In line 14, the symbols M (Z , T) and F (Z , T) remind us that M and F are computed from Z and T using (6). The correctness of Alg. 1 follows from the Lemmas in Section 3.1, and the explanation above. In particular, since there are at most q + 1 different linear regimes, the bound q in the for-loop does not prevent us from ﬁnding any critical value. Its time complexity is O(q2), since each line completes in O(q) steps, and is executed at most q times.

Theorem 3.10 (Complexity). Algorithm 1 ﬁnishes in O(q2) steps, and requires O(q) memory. Theorem 3.11 (Correctness). Algorithm 1 outputs the solution to (3).

Algorithm 1 Projection onto the PPM (input: T and ˆF; output: M and F )

j i ˆFj, for all i V This takes O(q) steps using a DFS, see proof of Theorem 3.10 2: i = 1, ti = maxr{Nr}, B(ti) = arg maxr{Nr}, Z (ti) = 0, L (ti) = 0. Initialize 3: for i = 1 to q do 4: (Z (ti), L (ti)) = Compute Rates(B(ti), T) Update rates of change

5: P = {Pr : Pr = Nr+Z (ti)r ti Z (ti)r

1 Z (ti)r if r / B(ti), tr < ti, and Pr = otherwise} 6: ti+1 = maxr Pr Update next critical value from (9) 7: B(ti+1) = B(ti) arg maxr Ps Update list of ﬁxed variables 8: Z (ti+1) = Z (ti) + (ti+1 ti)Z (ti) Update solution path 9: L (ti+1) = L (ti) + (ti+1 ti)L (ti) Update objective s derivative 10: if L (ti+1) < 1 then break If already passed by t , then exit the for-loop 11: end for 12: t = ti 1+L (ti)

L (ti) Find solution to (8)

13: Z = Z (ti) + (t ti)Z (ti) Find minimizers of (5) for t = t

14: return M (Z , T), F (Z , T) Return solution to (3) using (6), which takes O(q) steps

3.3 Computing the rates We now explain how the procedure Compute Rates works. Recall that it takes as input the tree T and the set B(ti), and it outputs the derivatives Z (ti) and L (ti).

A simple calculation shows that if we compute Z (ti), then computing L (ti) is easy. Lemma 3.12. L (ti) can be computed from Z (ti) in O(q) steps and with O(1) memory as

j V (Z (ti)j Z (ti) j)2, (11)

where j is the closest ancestor to j in T. We note that if j B(ti), then, by deﬁnition, Z (ti)j = 1. Assume now that j V\B(ti). Lemma 3.6 implies we can ﬁnd Z (ti)j by solving the (Tw = (rw, Vw, Ew), Bw)-problem as a function of t, where w is such that j Vw. In a nutshell, Compute Rates is a recursive procedure to solve all the (Tw, Bw)-problems as an explicit function of t.

It sufﬁces to explain how Compute Rates solves one particular (Tw, Bw)-problem explicitly. To simplify notation, in the rest of this section, we refer to Tw and Bw as T and B. Recall that, by the deﬁnition of T = Tw and B = Bw, if i B, then i must be a leaf of T, or the root of T. Deﬁnition 3.13. Consider a rooted tree T = (r, V, E), a set B V, and variables {Zj : j V} such that, if j B, then Zj = αjt + βj for some α and β. We deﬁne the (T, B, α, β, γ)-problem as

min {Zj:j V\B} 1 2

j V γj(Zj Z j)2, (12)

where γ > 0, j is the closest ancestor to j in T, and Z j = 0 if j = r.

We refer to the solution of the (T, B, α, β, γ)-problem as {Z j : j V\B}, which uniquely minimizes (12). Note that (12) is unconstrained and its solution, Z , is a linear function of t. Furthermore, the (Tw, Bw)-problem is the same as the (Tw, Bw, 1, N, 1)-problem, which is what we actually solve.

We now state three useful lemmas that help us solve any (T, B, α, β, γ)-problem efﬁciently. Lemma 3.14 (Pruning). Consider the solution Z of the (T, B, α, β, γ)-problem. Let j V\B be a leaf. Then Z j = Z j . Furthermore, consider the ( T, B, α, β, γ)-problem, where T = ( r, V, E) is

equal to T with node j pruned, and let its solution be Z . We have that Z i = Z i , for all i V. Lemma 3.15 (Star problem). Let T be a star such that node 1 is the center node, node 2 is the root, and nodes 3, . . . , r are leaves. Let B = {2, . . . , r}. Let Z 1 R be the solution of the (T, B, α, β, γ)-problem. Then,

Z 1 = γ1α2 + Pr i=3 γrαr γ1 + Pr i=3 γr

t + γ1β2 + Pr i=3 γrβr γ1 + Pr i=3 γr

In particular, to ﬁnd the rate at which Z 1 changes with t, we only need to know α and γ, not β.

Lemma 3.16 (Reduction). Consider the (T, B, α, β, γ)-problem such that j, j V\B, and such that j has all its children 1, . . . , r B. Let Z be its solution. Consider the ( T, B, α, β, γ) problem, where T = ( r, V, E) is equal to T with nodes 1, . . . , r removed, and B = (B\{1, . . . , r}) {j}. Let Z be its solution. If ( αi, βi, γi) = (αi, βi, γi) for all i B\{1, . . . , r}, and αj, βj and γj satisfy

αj = Pr i=1 γiαi Pr i=1 γi , βj = Pr i=1 γiβi Pr i=1 γi , γj =

then Z i = Z i for all i V\{j}.

Lemma 3.15 and Lemma 3.16 allow us to recursively solve any (T, B, α, β, γ)-problem, and obtain for it an explicit solution of the form Z (t) = c1t + c2, where c1 and c2 do not depend on t.

Assume that we have already repeatedly pruned T, by repeatedly invoking Lemma 3.14, such that, if i is a leaf, then i B. See Figure 2-(left). First, we ﬁnd some node j V\B such that all of its children are in B. If j B, then j must be the root, and the (T, B, α, β, γ)-problem must be a star problem as in Lemma 3.15. We can use Lemma 3.15 to solve it explicitly. Alternatively, if j / V\B, then we invoke Lemma 3.16, and reduce the (T, B, α, β, γ)-problem to a strictly smaller ( T, B, α, β, γ)-problem, which we solve recursively. Once the ( T, B, α, β, γ)-problem is solved, we have an explicit expression Z i (t) = c1it + c2i for all i V\{j}, and, in particular, we have an explicit expression Z j (t) = c1 jt + c2 j. The only free variable of the (T, B, α, β, γ)-problem to be

determined is Z j (t). To compute Z j (t), we apply Lemma 3.15 to the (

β, γ)-problem, where

T is a star around j, γ are the components of γ corresponding to nodes that are neighbors of j, α and

β are such that Z i (t) = αit +

βi for all i that are neighbors of j, and for which Z i (t) is already known, and

B are all the neighbors of j. See Figure 2-(right).

The algorithm is compactly described in Alg. 2. It is slightly different from the description above for computational efﬁciency. Instead of computing Z (t) = c1t + c2, we keep track only of c1, the rates, and we do so only for the variables in V\B. The algorithm assumes that the input T has been pruned. The inputs T, B, α, β and γ are passed by reference. They are modiﬁed inside the algorithm but, once Compute Rates Rec ﬁnishes, they keep their initial values. Throughout the execution of the algorithm, T = (r, V, E) encodes (1) a doubly-linked list where each node points to its children and its parent, which we call T.a, and (b) a a doubly-linked list of all the nodes in V\B for which all the children are in B, which we call T.b. In the proof of Theorem 3.17, we prove how this representation of T can be kept updated with little computational effort. The input Y , also passed by reference, starts as an uninitialized array of size q, where we will store the rates {Z i }. At the end, we read Z from Y .

Algorithm 2 Compute Rates Rec (input: T = (r, V, E), B, α, β, γ, Y )

1: Let j be some node in V\B whose children are in B We read j from T.b in O(1) steps 2: if j B then 3: Set Yj using (13) in Lemma 3.15 If j B, then the (T, B, α, β, γ)-problem is star-shaped 4: else 5: Modify (T, B, α, β, γ) to match ( T, B, α, β, γ) deﬁned by Lemma 3.16 for j in line 1 6: Compute Rates Rec(T, B, α, β, γ, Y ) Sets Yi = Z i for all i V\B; Yj is not yet deﬁned 7: Restore (T, B, α, β, γ) to its original value before line 5 was executed

8: Compute Yj from (13), using for α, β, γ in (13) the values α,

β, γ, where γ are the components of γ corresponding to nodes that are neighbors of j in T, and α and

β are such that Z i = αit +

βi for all i that are neighbors of j in T, and for which Z i is already known 9: end if

Let q be the number of nodes of the tree T that is the input at the zeroth level of the recursion. Theorem 3.17. Algorithm 2 correctly computes Z for the (T, B, α, β, γ)-problem, and it can be implemented to ﬁnish in O(q) steps, and to use O(q) memory.

The correctness of Algorithm 2 follows from Lemmas 3.14-3.16, and the explanation above. Its complexity is bounded by the total time spent on the two lines that actually compute rates during

Reduce Problem

Use to define the star problem

j Solve the star

j Solve big problem

Solve smaller problem

(T , B, , β, γ) ( T , B, , β, γ)

( T , B, , β, γ)

Figure 2: Red squares represent ﬁxed nodes, and black circles free nodes. (Left) By repeatedly invoking Lemma 3.14, we can remove nodes 2, 3, and 4 from the original problem, since their associated optimal values are equal to the optimal value for node 1. (Right) We can compute the rates for all the free nodes of a subtree recursively by applying Lemma 3.16 and Lemma 3.15. We know the linear behavior of variables associated to red squares.

the whole recursion, lines 3 and 8. All the other lines only transform the input problem into a more computable form. Lines 3 and 8 solve a star-shaped problem with at most degree(j) variables, which, by inspecting (13), we know can be done in O(degree(j)) steps. Since, j never takes the same value twice, the overall complexity is bounded by O(P

j V degree(j)) = O(|E|) = O(q). The O(q) bound on memory is possible because all the variables that occupy signiﬁcant memory are being passed by reference, and are modiﬁed in place during the whole recursive procedure.

The following lemma shows how the recursive procedure to solve a (T, B, α, β, γ)-problem can be used to compute the rates of change of Z (t) of a (T, B)-problem. Its proof follows from the observation that the rate of change of the solution with t in (13) in Lemma 3.15 only depends on α and β, and that the reduction equations (14) in Lemma 3.16 never make α or γ depend on β.

Lemma 3.18 (Rates only). Let Z (t) be the solution of the (T, B)-problem, and let Z (t) be the solution of the (T, B, 1, 0, 1)-problem. Then, Z (t) = c1t + c2, and Z (t) = c1t for some c1 and c2.

We ﬁnally present the full algorithm to compute Z (ti) and L (ti) from T and B(ti).

Algorithm 3 Compute Rates (input: T and B(ti) output: Z (ti) and L (ti))

1: Z (ti)j = 1 for all j B(ti) 2: for each (Tw, Bw)-problem induced by B(ti) do 3: Set Tw to be Tw pruned of all leaf nodes in Bw, by repeatedly evoking Lemma 3.14 4: Compute Rates Rec( Tw, j, Bw, 1, 0, 1, Z ) 5: Z (ti)j = Z j for all j Vw\B 6: end for 7: Compute L (ti) from Z (ti) using Lemma 3.12 8: return Z (ti) and L (ti)

The following theorem follows almost directly from Theorem 3.17. Theorem 3.19. Alg. 3 correctly computes Z (ti) and L (ti) in O(q) steps, and uses O(q) memory.

4 Reducing computation time in practice

Our numerical results are obtained for an improved version of Algorithm 1. We now explain the main idea behind this algorithm.

The bulk of the complexity of Alg. 1 comes from line 4, i.e., computing the rates {Z (ti)j}j V\B(ti) from B(ti) and T. For a ﬁxed j V\B(ti), and by Lemma 3.6, the rate Z (ti)j, depends only on one particular (Tw = (rw, Vw, Ew), Bw)-problem induced by B(ti). If exactly this same problem is induced by both B(ti) and B(ti+1), which happens if the new nodes that become ﬁxed in line 7 of round i of Algorithm 1 are not in Vw\Bw, then we can save computation time in round i + 1, by not recomputing any rates for j Vw\Bw, and using for Z (ti+1)j the value Z (ti)j.

Furthermore, if only a few {Z j } change from round i to round i + 1, then we can also save computation time in computing L from Z by subtracting from the sum in the right hand side of equation (11) the terms that depend on the previous, now changed, rates, and adding new terms that depend on the new rates.

Finally, if the rate Z j does not change, then the value of t < ti at which Z j (t) might intersect t Nj, and become ﬁxed, given by Pj in line 5, also does not change. (Note that this is not obvious from the formula for Pr in line 5). If not all {Pr} change from round i to round i + 1, we can also save computation time in computing the maximum, and maximizers, in line 7 by storing P in a maximum binary heap, and executing lines 5 and 7 by extracting all the maximal values from the top of the heap. Each time any Pr changes, the heap needs to be updated. 5 Numerical results Our algorithm to solve (3) exactly in a ﬁnite number of steps is of interest in itself. Still, it is interesting to compare it with other algorithms. In particular, we compare the convergence rate of our algorithm with two popular methods that solve (3) iteratively: the Alternating Direction Method of Multipliers (ADMM), and the Projected Gradient Descent (PGD) method. We apply the ADMM, and the PGD, to both the primal formulation (3), and the dual formulation (4). We implemented all the algorithms in C, and derived closed-form updates for ADMM and PG, see Appendix F. We ran all algorithms on a single core of an Intel Core i5 2.5GHz processor.

Figure 5-(left) compares different algorithms for a random Galton Watson input tree truncated to have q = 1000 nodes, with the number of children of each node chosen uniformly within a ﬁxed range, and for a random input ˆF Rq, with entries chosen i.i.d. from a normal distribution. We observe the same behavior for all random instances that was tested. We gave ADMM and PGD an advantage by optimally tuning them for each individual problem-instance tested. In contrast, our algorithm requires no tuning, which is a clear advantage. At each iteration, the error is measured as maxj{|Mj M j |}. Our algorithm is about 74 faster than its closest competitor (PGD-primal) for 10 3 accuracy. In Figure 5-(right), we show the average run time of our algorithm versus the problem size, for random inputs of the same form. The scaling of our algorithm is (almost) linear, and much faster than our O(q2p), p = 1, theoretical bound.

0 0.1 0.2 0.3 0.4 0.5 Time in seconds

ADMM Primal ADMM Dual Projected Gradient Descent Primal Projected Gradient Descent Dual Our Algorithm = 0.0027 seconds

0 2000 4000 6000 8000 10000 Problem size

Average run time

Figure 3: (Left) Time that the different algorithms take to solve our problem for trees of with 1000 nodes. (Right) Average run time of our algorithm for problems of different sizes. For each size, each point is averaged over 500 random problem instances. Finally, we use our algorithm to exactly solve (2) by computing C(U) for all trees and a given input ˆF. Exactly solving (2) is very important for biology, since several relevant phylogenetic tree inference problems deal with trees of small sizes. We use an NVIDIA QUAD P5000 GPU to compute the cost of all possible trees with q nodes in parallel, and return the tree with the smallest cost. Basically, we assign to each GPU virtual thread a unique tree, using Prufer sequences [31], and then have each thread compute the cost for its tree. For q = 10, we compute the cost of all 100 million trees in about 8 minutes, and for q = 11, we compute the cost of all 2.5 billion trees in slightly less than 2.5 hours.

Code to solve (3) using Alg. 1, with the improvements of Section 4, can be found in [32]. More results using our algorithm can be found in Appendix G. 6 Conclusions and future work

We propose a new direct algorithm that, for a given tree, computes how close the matrix of frequency of mutations per position is to satisfying the perfect phylogeny model. Our algorithm is faster than the state-of-the-art iterative methods for the same problem, even if we optimally tune them. We use the proposed algorithm to build a GPU-based phylogenetic tree inference engine for the trees of relevant biological sizes. Unlike existing algorithms, which only heuristically search a small part of the space of possible trees, our algorithm performs a complete search over all trees relatively fast. It is an open problem to ﬁnd direct algorithms that can provably solve our problem in linear time on average, or even for a worst-case input.

Acknowledgement: This work was partially funded by NIH/1U01AI124302, NSF/IIS-1741129, and a NVIDIA hardware grant.

[1] Richard R Hudson. Properties of a neutral allele model with intragenic recombination. Theoretical population biology, 23(2):183 201, 1983.

[2] Motoo Kimura. The number of heterozygous nucleotide sites maintained in a ﬁnite population due to steady ﬂux of mutations. Genetics, 61(4):893 903, 1969.

[3] Mohammed El-Kebir, Layla Oesper, Hannah Acheson-Field, and Benjamin J Raphael. Reconstruction of clonal trees and tumor composition from multi-sample sequencing data. Bioinformatics, 31(12):i62 i70, 2015.

[4] Mohammed El-Kebir, Gryte Satas, Layla Oesper, and Benjamin J Raphael. Multi-state perfect phylogeny mixture deconvolution and applications to cancer sequencing. ar Xiv preprint ar Xiv:1604.02605, 2016.

[5] Wei Jiao, Shankar Vembu, Amit G Deshwar, Lincoln Stein, and Quaid Morris. Inferring clonal evolution of tumors from single nucleotide somatic mutations. BMC bioinformatics, 15(1):35, 2014.

[6] Salem Malikic, Andrew W Mc Pherson, Nilgun Donmez, and Cenk S Sahinalp. Clonality inference in multiple tumor samples using phylogeny. Bioinformatics, 31(9):1349 1356, 2015.

[7] Victoria Popic, Raheleh Salari, Iman Hajirasouliha, Dorna Kashef-Haghighi, Robert B West, and Seraﬁm Batzoglou. Fast and scalable inference of multi-sample cancer lineages. Genome biology, 16(1):91, 2015.

[8] Gryte Satas and Benjamin J Raphael. Tumor phylogeny inference using tree-constrained importance sampling. Bioinformatics, 33(14):i152 i160, 2017.

[9] Anna Schuh, Jennifer Becq, Sean Humphray, Adrian Alexa, Adam Burns, Ruth Clifford, Stephan M Feller, Russell Grocock, Shirley Henderson, Irina Khrebtukova, et al. Monitoring chronic lymphocytic leukemia progression by whole genome sequencing reveals heterogeneous clonal evolution patterns. Blood, 120(20):4191 4196, 2012.

[10] David Fernández-Baca. The perfect phylogeny problem. In Steiner Trees in Industry, pages 203 234. Springer, 2001.

[11] Dan Gusﬁeld. Efﬁcient algorithms for inferring evolutionary trees. Networks, 21(1):19 28, 1991.

[12] Zhihong Ding, Vladimir Filkov, and Dan Gusﬁeld. A linear-time algorithm for the perfect phylogeny haplotyping (pph) problem. Journal of Computational Biology, 13(2):522 553, 2006.

[13] Michael R Garey and David S Johnson. Computers and intractability, volume 29. wh freeman New York, 2002.

[14] Amit G Deshwar, Shankar Vembu, Christina K Yung, Gun Ho Jang, Lincoln Stein, and Quaid Morris. Phylowgs: reconstructing subclonal composition and evolution from whole-genome sequencing of tumors. Genome biology, 16(1):35, 2015.

[15] Zoubin Ghahramani, Michael I Jordan, and Ryan P Adams. Tree-structured stick breaking for hierarchical data. In Advances in neural information processing systems, pages 19 27, 2010.

[16] Laurent Condat. Fast projection onto the simplex and the \pmb {l} _\mathbf {1} ball. Mathematical Programming, 158(1-2):575 585, 2016.

[17] John Duchi, Shai Shalev-Shwartz, Yoram Singer, and Tushar Chandra. Efﬁcient projections onto the l1-ball for learning in high dimensions. In Proceedings of the 25th international conference on Machine learning, pages 272 279. ACM, 2008.

[18] Pinghua Gong, Kun Gai, and Changshui Zhang. Efﬁcient euclidean projections via piecewise root ﬁnding and its application in gradient projection. Neurocomputing, 74(17):2754 2766, 2011.

[19] Jun Liu and Jieping Ye. Efﬁcient euclidean projections in linear time. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 657 664. ACM, 2009.

[20] Christian Michelot. A ﬁnite algorithm for ﬁnding the projection of a point onto the canonical simplex of Rn. Journal of Optimization Theory and Applications, 50(1):195 200, 1986.

[21] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, Jonathan Eckstein, et al. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends R in Machine Learning, 3(1):1 122, 2011.

[22] Guilherme França and José Bento. An explicit rate bound for over-relaxed admm. In Information Theory (ISIT), 2016 IEEE International Symposium on, pages 2104 2108. IEEE, 2016.

[23] Ning Hao, Amir Reza Oghbaee, Mohammad Rostami, Nate Derbinsky, and José Bento. Testing ﬁne-grained parallelism for the admm on a factor-graph. In Parallel and Distributed Processing Symposium Workshops, 2016 IEEE International, pages 835 844. IEEE, 2016.

[24] Guilherme França and José Bento. How is distributed admm affected by network topology? ar Xiv preprint ar Xiv:1710.00889, 2017.

[25] Laurence Yang, José Bento, Jean-Christophe Lachance, and Bernhard Palsson. Genome-scale estimation of cellular objectives. ar Xiv preprint ar Xiv:1807.04245, 2018.

[26] Charles JM Mathy, Felix Gonda, Dan Schmidt, Nate Derbinsky, Alexander A Alemi, José Bento, Francesco M Delle Fave, and Jonathan S Yedidia. Sparta: Fast global planning of collision-avoiding robot trajectories.

[27] Daniel Zoran, Dilip Krishnan, Jose Bento, and Bill Freeman. Shape and illumination from shading using the generic viewpoint assumption. In Advances in Neural Information Processing Systems, pages 226 234, 2014.

[28] José Bento, Nate Derbinsky, Charles Mathy, and Jonathan S Yedidia. Proximal operators for multi-agent path planning. In AAAI, pages 3657 3663, 2015.

[29] José Bento, Nate Derbinsky, Javier Alonso-Mora, and Jonathan S Yedidia. A message-passing algorithm for multi-agent trajectory planning. In Advances in neural information processing systems, pages 521 529, 2013.

[30] Jean-Jacques Moreau. Décomposition orthogonale d un espace hilbertien selon deux cônes mutuellement polaires. CR Acad. Sci. Paris, 225:238 240, 1962.

[31] H Prufer. Neuer beweis eines satzes uber per mutationen. Archiv der Mathematik und Physik, 27:742 744, 1918.

[32] Github repository for the PPM projection algorithm, https://github.com/bentoayr/ efficient-projection-onto-the-perfect-phylogeny-model, Accessed: 2018-1026.

[33] Neal Parikh, Stephen Boyd, et al. Proximal algorithms. Foundations and Trends R in Optimization, 1(3):127 239, 2014.

[34] Yuchao Jiang, Yu Qiu, Andy J Minn, and Nancy R Zhang. Assessing intratumor heterogeneity and tracking longitudinal and spatial clonal evolutionary history by next-generation sequencing. Proceedings of the National Academy of Sciences, 113(37):E5528 E5537, 2016.

[35] Mohammed El-Kebir, Gryte Satas, Layla Oesper, and Benjamin J Raphael. Inferring the mutational history of a tumor using multi-state perfect phylogeny mixtures. Cell systems, 3(1):43 53, 2016.

[36] Iman Hajirasouliha, Ahmad Mahmoody, and Benjamin J Raphael. A combinatorial approach for analyzing intra-tumor heterogeneity from high-throughput sequencing data. Bioinformatics, 30(12):i78 i86, 2014.

[37] Paola Bonizzoni, Anna Paola Carrieri, Gianluca Della Vedova, Riccardo Dondi, and Teresa M Przytycka. When and how the perfect phylogeny model explains evolution. In Discrete and Topological Models in Molecular Biology, pages 67 83. Springer, 2014.

[38] Ancestree data used, https://github.com/raphael-group/ancestree/tree/master/ data/simulated/cov_1000_samples_4_mut_100_clone_10_pcr_removed, Accessed: 2018-10-26.