# fast_algorithms_for_segmented_regression__32ac5605.pdf Fast Algorithms for Segmented Regression Jayadev Acharya JAYADEV@MIT.EDU Massachusetts Institute of Technology, Cambridge, MA 02139, USA Ilias Diakonikolas DIAKONIK@USC.EDU University of Southern California, Los Angeles, MA 90089, USA Jerry Li JERRYZLI@MIT.EDU Ludwig Schmidt LUDWIGS@MIT.EDU Massachusetts Institute of Technology, Cambridge, MA 02139, USA We study the fixed design segmented regression problem: Given noisy samples from a piecewise linear function f, we want to recover f up to a desired accuracy in mean-squared error. Previous rigorous approaches for this problem rely on dynamic programming (DP) and, while sample efficient, have running time quadratic in the sample size. As our main contribution, we provide new sample near-linear time algorithms for the problem that while not being minimax optimal achieve a significantly better sample-time tradeoff on large datasets compared to the DP approach. Our experimental evaluation shows that, compared with the DP approach, our algorithms provide a convergence rate that is only off by a factor of 2 to 4, while achieving speedups of three orders of magnitude. 1. Introduction We study the regression problem a fundamental inference task with numerous applications that has received tremendous attention in machine learning and statistics during the past fifty years (see, e.g., (Mosteller & Tukey, 1977) for a classical textbook). Roughly speaking, in a (fixed design) regression problem, we are given a set of n observations (xi, yi), where the yi s are the dependent variables and the xi s are the independent variables, and our goal is to model the relationship between them. The typical assumptions are that (i) there exists a simple function f that (approximately) models the underlying relation, and (ii) the depen- Proceedings of the 33 rd International Conference on Machine Learning, New York, NY, USA, 2016. JMLR: W&CP volume 48. Copyright 2016 by the author(s). dent observations are corrupted by random noise. More specifically, we assume that there exists a family of functions F such that for some f F the following holds: yi = f(xi) + ϵi, where the ϵi s are i.i.d. random variables drawn from a tame distribution such as a Gaussian (later, we also consider model misspecification). Throughout this paper, we consider the classical notion of Mean Squared Error (MSE) to measure the performance (risk) of an estimator. As expected, the minimax risk depends on the family F that f comes from. The natural case that f is linear is fully understood: It is well-known that the least-squares estimator is statistically efficient and runs in sample-linear time. The more general case that f is non-linear, but satisfies some well-defined structural constraint has been extensively studied in a variety of contexts (see, e.g., (Gallant & A., 1973; Feder, 1975; Friedman, 1991; Bai & Perron, 1998; Yamamoto & Perron, 2013; Kyng et al., 2015; Avron et al., 2013; Meyer, 2008; Chatterjee et al., 2015)). In contrast to the linear case, this more general setting is not well-understood from an informationtheoretic and/or computational aspect. In this paper, we focus on the case that the function f is promised to be piecewise linear with a given number k of unknown pieces (segments). This is known as fixed design segmented regression, and has received considerable attention in the statistics community (Gallant & A., 1973; Feder, 1975; Bai & Perron, 1998; Yamamoto & Perron, 2013). The special case of piecewise polynomial functions (splines) has been extensively used in the context of inference, including density estimation and regression, see, e.g., (Wegman & Wright, 1983; Friedman, 1991; Stone, 1994; Stone et al., 1997; Meyer, 2008). Information-theoretic aspects of the segmented regression problem are well-understood: Roughly speaking, the min- Authors ordered alphabetically. Fast Algorithms for Segmented Regression imax risk is inversely proportional to the number of samples. In contrast, the computational complexity of the problem is poorly understood: Prior to our work, known algorithms for this problem with provable guarantees were quite slow. Our main contribution is a set of new provably fast algorithms that outperform previous approaches both in theory and in practice. Our algorithms run in time that is nearly-linear in the number of data points n and the number of intervals k. Their computational efficiency is established both theoretically and experimentally. We also emphasize that our algorithms are robust to model misspecification, i.e., they perform well even if the function f is only approximately piecewise linear. Note that if the segments of f were known a priori, the segmented regression problem could be immediately reduced to k independent linear regression problems. Roughly speaking, in the general case (where the location of the segment boundaries is unknown) one needs to discover the right segments using information provided by the samples. To address this algorithmic problem, previous works (Bai & Perron, 1998; Yamamoto & Perron, 2013) relied on dynamic programming that, while being statistically efficient, is computationally quite slow: its running time scales at least quadratically with the size n of the data, hence it is rather impractical for large datasets. Our main motivation comes from the availability of large datasets that has made computational efficiency the main bottleneck in many cases. In the words of (Jordan, 2013): As data grows, it may be beneficial to consider faster inferential algorithms, because the increasing statistical strength of the data can compensate for the poor algorithmic quality. Hence, it is sometimes advantageous to sacrifice statistical efficiency in order to achieve faster running times because we can then achieve the desired error guarantee faster (provided more samples). In our context, instead of using a slow dynamic program, we employ a subtle iterative greedy approach that runs in sample-linear time. Our iterative greedy approach builds on the work of (Acharya et al., 2015a;b), but the details of our algorithms here and their analysis are substantially different. In particular, as we explain in the body of the paper, the natural adaptation of their analysis to our setting fails to provide any meaningful statistical guarantees. To obtain our results, we introduce novel algorithmic ideas and carefully combine them with additional probabilistic arguments. 2. Preliminaries In this paper, we study the problem of fixed design segmented regression. We are given samples xi Rd for i [n] ( = {1, . . . , n}), and we consider the following classical regression model: yi = f(xi) + ϵi . (1) Here, the ϵi are i.i.d. sub-Gaussian noise variables with variance proxy σ2, mean E[ϵi] = 0, and variance s2 = E[ϵ2 i ] for all i. We will let ϵ = (ϵ1, . . . , ϵn) denote the vector of noise variables. We also assume that f : Rd R is a k-piecewise linear function. Formally, this means: Definition 1. The function f : Rd R is a k-piecewise linear function if there exists a partition of the real line into k disjoint intervals I1, . . . , Ik, k corresponding parameters θ1, . . . , θk Rd, and a fixed, known j such that for all x = (x1, . . . , xd) Rd we have that f(x) = θi, x if xj Ii. Let Lk,j denote the space of k-piecewise linear functions with partition defined on coordinate j. Moreover, we say f is flat on an interval I R if I Ii for some i = 1, . . . , k, otherwise, we say that f has a jump on the interval I. In the full paper, we also discuss the agnostic setting where the ground truth f is not piecewise linear itself but only well-approximated by a k-piecewise linear function. For simplicity of exposition, we assume that the partition coordinate j is 1 in the rest of the paper. We remark that this model also contains the problem of (fixed design) piecewise polynomial regression as an important subcase (see the full paper for details). Following this generative model, a regression algorithm receives the n pairs (xi, yi) as input. The goal of the algorithm is then to produce an estimate bf that is close to the true, unknown f with high probability over the noise terms ϵi and any randomness in the algorithm. We measure the distance between our estimate bf and the unknown function f with the classical mean-squared error: MSE( bf) = 1 i=1 (f(xi) bf(xi))2 . Throughout this paper, we let X Rn d be the data matrix, i.e., the matrix whose j-th row is x T j for every j, and we let r denote the rank of X. The following notation will also be useful. For any function f : Rd R, we let f Rn denote the vector with components fi = f(xi) for i [n]. For any interval I, we let XI denote the data matrix consisting of all data points xi for i I, and for any vector v Rn, we let v I R|I| be the vector of vi for i I. 2.1. Our Contributions Our main contributions are new, fast algorithms for the aforementioned segmented regression problem. We now informally state our main results and refer to later sections Fast Algorithms for Segmented Regression for more precise theorems. Theorem 2 (informal statement of Theorems 13 and 14). There is an algorithm GREEDYMERGE, which, given X (of rank r), y, a target number of pieces k, and the variance of the noise s2, runs in time O(nd2 log n) and outputs an O(k)-piecewise linear function bf so that with probability at least 0.99, we have That is, our algorithm runs in time which is nearly linear in n and still achieves a reasonable rate of convergence. While this rate is asymptotically slower than the rate of the dynamic programming estimator, our algorithm is significantly faster than the DP so that in order to achieve a comparable MSE, our algorithm takes less total time given access to a sufficient number of samples. For more details on this comparision and an experimental evaluation, see Sections 2.2 and 4. At a high level, our algorithm first forms a fine partition of [n] and then iteratively merges pieces of this partitions until only O(k) pieces are left. In each iteration, the algorithm reduces the number of pieces in the following manner: we group consecutive intervals into pairs which we call candidates . For each candidate interval, the algorithm computes an error term that is the error of a least squares fit combined with a regularizer depending on the variance of the noise s2. The algorithm then finds the O(k) candidates with the largest errors. We do not merge these candidates, but do merge all other candidate intervals. We repeat this process until only O(k) pieces are left. A drawback of this algorithm is that we need to know the variance of the noise s2, or at least have a good estimate. In practice, we might be able to obtain such an estimate, but ideally our algorithm would work without knowing s2. By extending our algorithm, we obtain the following result: Theorem 3 (informal). There is an algorithm BUCKETGREEDYMERGE, which, given X (of rank r), y, and a target number of pieces k, runs in time O(nd2 log n) and outputs an O(k log n)-piecewise linear function bf so that with probability at least 0.99, we have σ2 kr log n At a high level, there are two fundamental changes to the algorithm: first, instead of merging with respect to the sum squared error of the least squares fit regularized by a term depending on s2, we instead merge with respect to the average error the least squares fit incurs within the current interval. The second change is more substantial: instead of finding the O(k) candidates with largest error and merg- ing the rest, we now split the candidates into log n buckets based on the lengths of the candidate intervals. In this scheme, bucket α contains all candidates with length between 2α and 2α+1, for α = 0, . . . , log n 1. Then we find the k + 1 candidates with largest error within each bucket and merge the remaining candidate intervals. We continue this process until we are left with O(k log n) buckets. A potential disadvantage of our algorithms above is that they produce O(k) or O(k log n) pieces, respectively. In order to address this issue, we also provide a postprocessing algorithm that converts the output of any of our algorithms and decreases the number of pieces down to 2k + 1 while preserving the statistical guarantees above. The guarantee of this postprocessing algorithm is as follows. Theorem 4 (informal). There is an algorithm POSTPROCESSING that takes as input the output of either GREEDYMERGE or BUCKETGREEDYMERGE together with a target number of pieces k, runs in time O k3d2 log n , and outputs a (2k + 1)-piecewise linear function bf p so that with probability at least 0.99, we have MSE( bf p) O Qualitatively, an important aspect this algorithm is that its running time depends only logarithmically on n. In practice, we expect k to be much smaller than n, and hence the running time of this postprocessing step will usually be dominated by the running time of the piecewise linear fitting algorithm run before it. In this document, we only give the formal pseudo code and proofs for GREEDYMERGE. We refer the reader to the full version for more details. 2.2. Comparison to prior work Dynamic programming. Previous work on segmented regression with statistical guarantees (Bai & Perron, 1998; Yamamoto & Perron, 2013) relies heavily on dynamic programming-style algorithms to find the k-piecewise linear least-squares estimator. Somewhat surprisingly, we are not aware of any work which explicitly gives the best possible running time and statistical guarantee for this algorithm. For completeness, we prove the following result (Theorem 5), which we believe to be folklore. The techniques used in its proof will also be useful for us later. Theorem 5 (informal). The exact dynamic program runs in time O(n2(d2 + k)) and outputs an k-piecewise linear estimator bf so that with probability at least 0.99 we have MSE( bf) O σ2 kr We now compare our guarantees with those of the DP. The main advantage of our approaches is computational Fast Algorithms for Segmented Regression efficiency our algorithm runs in linear time, while the running time of the DP has a quadratic dependence on n. While our statistical rate of convergence is slower, we are able to achieve the same MSE as the DP in asymptotically less time if enough samples are available. For instance, suppose we wish to obtain a MSE of η with a k-piecewise linear function, and suppose for simplicity that d = O(1) so that r = O(1) as well. Then Theorem 5 tells us that the DP requires n = k/η samples and runs in time O(k3/η2). On the other hand, Theorem 2 tells us that GREEDYMERGING requires n = e O(k/η2) samples and thus runs in time e O(k/η2). For non-trivial values of k, this is a significant improvement in time complexity. This gap also manifests itself strongly in our experiments (see Section 4). When given the same number of samples, our MSE is a factor of 2-4 times worse than that achieved by the DP, but our algorithm is about 1, 000 times faster already for 104 data points. When more samples are available for our algorithm, it achieves the same MSE as the DP about 100 times faster. Histogram Approximation Our results build upon the techniques of (Acharya et al., 2015a), who consider the problem of histogram approximation. Indeed, without too much work it is possible to convert their algorithm to our regression setting, but the resulting guarantee is not statistically meaningful (see the full version for a more detailed discussion). To overcome this difficulty, we propose a merging algorithm based on a new merging error measure. Utilizing more sophisticated proof techniques than in (Acharya et al., 2015a), we show that our new algorithm has good statistical guarantees in the regression setting. 2.3. Mathematical Preliminaries Tail bounds We will require the following tail bound on deviations of squares of sub-Gaussian random variables. We defer the proof to the supplementary material. Corollary 6. Fix δ > 0 and let ϵ1, . . . , ϵn be as in (1). Recall that s = E[ϵ2 i ]. Then, with probability 1 δ, we have that simultaneously for all intervals I [n] the following inequality holds: i I ϵ2 i s2|I| O(σ2 log(n/δ)) p Linear regression. Our analysis builds on the classical results for fixed design linear regression. In linear regression, the generative model is exactly of the form described in (1), except that f is restricted to be a 1-piecewise linear function (as opposed to a k-piecewise linear function), i.e., f(x) = θ , x for some unknown θ . The problem of linear regression is very well understood, and the asymptotically best estimator is known to be the least-squares estimator. Definition 7. Given x1, . . . , xn and y1, . . . , yn, the least squares estimator f LS is defined to be the linear function which minimizes Pn i=1(yi f(xi))2 over all linear functions f. For any interval I, we let f LS I denote the least squares estimator for the data points restricted to I, i.e. for the data pairs {(xi, yi)}i I. We also let LEASTSQUARES (X, y, I) denote an algorithm which solves linear least squares for these data points, i.e., which outputs the coefficients of the linear least squares fit for these points. Following our previous definitions, we let f LS Rn denote the vector whose i-th coordinate is f LS(xi), and similarly for any I [n] we let f LS I R|I| denote the vector whose i-th coordinate for i I is f LS I (xi). The following error rate is known for the least-squares estimator: Fact 8. Let x1, . . . , xn and y1, . . . , yn be generated as in (1), where f is a linear function. Let f LS(x) be the least squares estimator. Then, E MSE(f LS) = O σ2 r Moreover, with probability 1 δ, we have MSE(f LS) = O σ2 r + log 1/δ Fact 8 can be proved with the following lemma, which we also use in our analysis. The lemma bounds the correlation of a random vector with any fixed r-dimensional subspace: Lemma 9 (c.f. proof of Theorem 2.2 in (Rigollet, 2015)). Fix δ > 0. Let ϵ1, . . . , ϵn be i.i.d. sub-Gaussian random variables with variance proxy σ2. Let ϵ = (ϵ1, . . . , ϵn), and let S be a fixed r-dimensional affine subspace of Rn. Then, with probability 1 δ, we have sup v S\{0} r + log(1/δ) . This lemma also yields the two following consequences. The first corollary bounds the correlation between sub Gaussian random noise and any linear function on any interval (see the full paper for proofs): Corollary 10. Fix δ > 0 and v Rn. Let ϵ = (ϵ1, . . . , ϵn) be as in Lemma 9. Then with probability at least 1 δ, we have that for all intervals I, and for all non-zero linear functions f : Rd R, | ϵI, f I + v I | f I + v I O(σ p r + log(n/δ)) . Corollary 11. Fix δ > 0 and v Rn. Let ϵ be as in Fast Algorithms for Segmented Regression Lemma 9. Then with probability at least 1 δ, we have | ϵ, f I + v I | f I + v I O σ r kr + k log n These bounds also imply the following statement, which we will need later. We defer the proof to the supplementary material. Lemma 12. Fix δ > 0. Then with probability 1 δ we have the following: for all disjoint sets of k intervals I1, . . . , Ik of [n] so that f is flat on each Iℓ, the following inequality holds: ℓ=1 f LS Iℓ f Iℓ 2 O(σ2 k(r + log(n/δ))) . 3. A greedy merging algorithm In this section, we formally state our new algorithm for segmented regression and analyze its statistical and computational efficiency. See Algorithm 1 for the corresponding pseudocode. The algorithm expects two tuning parameters τ, γ because the algorithm does not return a k-piecewise linear function but instead a O(k)-piecewise linear function The tuning parameters allow us to trade off running time for fewer pieces. In typical use cases we have τ, γ = Θ(1), in which case our algorithm produces an O(k)-piecewise linear function in time O(nd2 log n) time. We now state the running time of GREEDYMERGING. The analysis is similar to the analysis presented in (Acharya et al., 2015a) and for space considerations is left to the supplementary material. Theorem 13. Let X and y be as above. Then GREEDYMERGING(τ, γ, s, X, y) outputs a (2 + 2 τ )k + γ -piecewise linear function and runs in time O(nd2 log(n/γ)). 3.1. Analysis of GREEDYMERGING Theorem 14. Let δ > 0, and let bf be the estimator returned by GREEDYMERGING. Let m = (2 + 2 τ )k + γ be the number of pieces in bf. Then, with probability 1 δ, we have that σ2 m(r + log(n/δ)) Proof. We first condition on the event that Corollaries 6, 10 and 11, and Lemma 12 all hold with error parameter O(δ), so that together they all hold with probability at least 1 δ. Let I = {I1, . . . , Im} be the final partition of [n] that our algorithm produces. Recall f is the ground truth k-piecewise linear function. We partition the intervals in I into two sets: F = {I I : f is flat on I} , J = {I I : f has a jump on I} . We first bound the error over the intervals in F. By Lemma 12, we have X I F f I b f I 2 O(σ2|F|(r + log(n/δ))) , (3) with probability at least 1 O(δ). We now turn our attention to the intervals in J and distinguish two further cases. We let J1 be the set of intervals in J which were never merged, and we let J2 be the remaining intervals. If the interval I J1 was never merged, the interval contains one point, call it i. Because we may assume that xi = 0, we know that for this one point, our estimator satisfies bf(xi) = yi, since this is clearly the least squares fit for a linear estimator on one nonzero point. Hence Corollary 6 implies that the following inequality holds with probability at least 1 O(δ): X I J1 f I b f I 2 = X I J1 |I| + O log n σ2 m + O log n We now finally turn our attention to the intervals in J2. Fix an interval I J2. By definition, the interval I was merged in some iteration of the algorithm. This implies that in that iteration, there were (1 + 1/τ)k intervals M1, . . . , M(1+1/τ)k so that for each interval Mℓ, we have y I b f I 2 s2|I| y Mℓ f LS Mℓ 2 s2|Mℓ| . (5) Since the true, underlying k-piecewise linear function f has at most k jumps, this means that there are at least k/τ intervals of the Mℓon which f is flat. WLOG assume that these intervals are M1, . . . , Mk/τ. We have, by definition, y Mℓ f LS Mℓ 2 s2|Mℓ| = f Mℓ f LS Mℓ 2 ℓ=1 ϵMℓ, f Mℓ f LS Mℓ + i Mℓ (ϵ2 i s2) . (6) We will upper bound each term on the RHS in turn. First, since the function f is flat on each Mℓfor ℓ= 1, . . . , k/τ, Lemma 12 implies f Mℓ f LS Mℓ 2 O σ2 k Fast Algorithms for Segmented Regression Algorithm 1 Piecewise linear regression by greedy merging. 1: GREEDYMERGING(τ, γ, s, X, y) {Initial partition of [n] into intervals of length 1.} 2: I0 {{1}, {2}, . . . , {n}} {Iterative greedy merging (we start with j 0).} 3: while |Ij| > (2 + 2 τ )k + γ do 4: Let sj be the current number of intervals. {Compute the least squares fit and its error for merging neighboring pairs of intervals.} 5: for u {1, 2, . . . , sj 2 } do 6: θu LEASTSQUARES(X, y, I2u 1 I2u) 7: eu = y I XIθu 2 2 s2|I2u 1 I2u| 8: end for 9: Let L be the set of indices u with the (1 + 1 τ )k largest errors eu, breaking ties arbitrarily. 10: Let M be the set of the remaining indices. {Keep the intervals with large merging errors.} 11: Ij+1 S u L {I2u 1, I2u} {Merge the remaining intervals.} 12: Ij+1 Ij+1 {I2u 1 I2u | u M} 13: j j + 1 14: end while 15: return the least squares fit to the data on every interval in Ij with probability at least 1 O(δ). Moreover, note that the function f LS Mℓis a linear function on Mℓof the form f LS Mℓ(x) = x T ˆβ, where ˆβ Rd is the least-squares fit on Mℓ. Because f is just a fixed vector, the vector f Mℓ f LS Mℓlives in the affine subspace of vectors of the form f Mℓ+ (XMℓ)η where η Rd is arbitrary. So Corollary 11 and (7) imply that ℓ=1 ϵMℓ, f Mℓ f LS Mℓ ℓ=1 ϵMℓ, f Mℓ f LS Mℓ sup η | ϵMℓ, Xη | with probability 1 O(δ). By Corollary 6, we get that with probability 1 O(δ), i Mℓ ϵ2 i s2|Mℓ| Putting it all together, we get that y Mℓ f LS Mℓ 2 s2|Mℓ| τ σ2 r + log n + O σ log n with probability 1 O(δ). Since the LHS of (5) is bounded by each individual summand above, this implies that the LHS is also bounded by their average, which implies that y I b f I 2 s2|I| y Mℓ f LS Mℓ 2 s2|Mℓ| O σ2 r + log n + O σ log n We now similarly expand out the LHS of (5): y I b f I 2 s2|I| = f I + ϵI b f I 2 s2|I| = f I b f I 2 + 2 ϵI, f I b f I + ϵI 2 s2|I| . (11) Next, we aim to upper bound P i I(f(xi) bf(xi))2. Hence, we lower bound the second and third terms of (11). The calculations here closely mirror those above. By Corollary 10, we have that 2 ϵI, f I b f I O σ r f I b f I , and by Corollary 6 we have ϵI 2 s2|I| O σ log n y I b f I 2 s2|I| f I b f I 2 O σ r Fast Algorithms for Segmented Regression Putting (10) and (12) together yields that with probability 1 O(δ), f I b f I 2 O σ2 r + log n + O σ log n Letting z2 = b f I f I 2, then this inequality is of the form z2 bz + c where b, c 0. In this case, we have that c = O σ2 r + log n + O σ log n We now prove the following lemma about the behavior of such quadratic inequalities: Lemma 15. Suppose z2 bz + c where b, c 0. Then z2 O(b2 + c). Proof. From the quadratic formula, the inequality implies that z b+ b2+4c 2 . From this, it is straightforward to demonstrate the desired claim. Thus, from the lemma, we have f I b f I 2 O σ2 r + log n + O σ log n Hence the total error over all intervals in J2 can be bounded by: X I J2 f I b f I 2 O kσ2 r + log n + O σ log n O kσ2 r + log n + O σ log n In the last line we use that the intervals I J2 are disjoint (and hence their cardinalities sum up to at most n), and that there are at most k intervals in J2 because the function f is k-piecewise linear. Finally, applying a union bound and summing (3), (4), and (13) yields the desired conclusion. 4. Experiments In addition to our theoretical analysis above, we also study the empirical performance of our new estimator for segmented regression on both real and synthetic data. As baseline, we compare our estimator (GREEDYMERGING) to the dynamic programming approach. Since our algorithm combines both combinatorial and linear-algebraic operations, we use the Julia programming language1 (version 0.4.2) for our experiments because Julia achieves performance close to C on both types of operations. All experiments were conducted on a laptop computer with a 2.8 GHz Intel Core i7 CPU and 16 GB of RAM. Synthetic data. Experiments with synthetic data allow us to study the statistical and computational performance of our estimator as a function of the problem size n. Our theoretical bounds indicate that the worst-case performance of the merging algorithm scales as O( kd k/n log n) for constant error variance. Compared to the O( kd n ) rate of the dynamic program, this indicates that the relative performance of our algorithm can depend on the number of features d. Hence we use two types of synthetic data: a piecewise-constant function f (effectively d = 1) and a piecewise linear function f with d = 10 features. We generate the piecewise constant function f by randomly choosing k = 10 integers from the set {1, . . . , 10} as function value in each of the k segments.2 Then we draw n/k samples from each segment by adding an i.i.d. Gaussian noise term with variance 1 to each sample. For the piecewise linear case, we generate a n d data matrix X with i.i.d. Gaussian entries (d = 10). In each segment I, we choose the parameter values βI independently and uniformly at random from the interval [ 1, 1]. So the true function values in this segment are given by f I = XIβI. As before, we then add an i.i.d. Gaussian noise term with variance 1 to each function value. Figure 1 shows the results of the merging algorithm and the exact dynamic program for sample size n ranging from 102 to 104. Since the merging algorithm can produce a variable number of output segments, we run the merging algorithm with three different parameter settings corresponding to k, 2k, and 4k output segments, respectively. As predicted by our theory, the plots show that the exact dynamic program has a better statistical performance. However, the MSE of the merging algorithm with 2k pieces is only worse by a factor of 2 to 4, and this ratio empirically increases only slowly with n (if at all). The experiments also show that 1http://julialang.org/ 2We also repeated the experiment for other values of k. Since the results are not qualitatively different, we only report the k = 10 case here. Fast Algorithms for Segmented Regression 102 103 104 102 103 104 Relative MSE ratio 102 103 104 Running time (s) 102 103 104 102 103 104 102 103 104 Relative MSE ratio 102 103 104 10 4 Running time (s) Merging k Merging 2k Merging 4k Exact DP 102 103 104 Piecewise constant Piecewise linear Figure 1. Experiments with synthetic data: results for piecewise constant models with k = 10 segments (top row) and piecewise linear models with k = 5 segments (bottom row, dimension d = 10). Compared to the exact dynamic program, the MSE achieved by the merging algorithm is worse but stays within a factor of 2 to 4 for a sufficient number of output segments. The merging algorithm is significantly faster and achieves a speed-up of about 103 compared to the dynamic program for n = 104. This leads to a significantly better trade-off between statistical and computational performance (see also Figure 2). 10 4 10 2 100 10 4 Piecewise constant 10 4 10 1 102 10 4 Piecewise linear Figure 2. Computational vs. statistical efficiency in the synthetic data experiments. The solid lines correspond to the data in Figure 1, the dashed lines show the results from additional runs of the merging algorithms for larger values of n. The merging algorithm achieves the same MSE as the dynamic program about 100 faster if a sufficient number of samples is available. Dow Jones data Piecewise linear approximation Figure 3. Results of fitting a 5-piecewise linear function (d = 2) to a Dow Jones time series. The merging algorithm produces a fit that is comparable to the dynamic program and is about 200 faster (0.013 vs. 3.2 seconds). forcing the merging algorithm to return at most k pieces can lead to a significantly worse MSE. In terms of computational performance, the merging algorithm has a significantly faster running time, with speedups of more than 1, 000 for n = 104 samples. As can be seen in Figure 2, this combination of statistical and computational performance leads to a significantly improved trade-off between the two quantities. When we have a sufficient number of samples, the merging algorithm achieves a given MSE roughly 100 faster than the dynamic program. Real data. We also investigate whether the merging algorithm can empirically be used to find linear trends in a real dataset. We use a time series of the Dow Jones index as input, and fit a piecewise linear function (d = 2) with 5 segments using both the dynamic program and our merging algorithm with k = 5 output pieces. As can be seen from Figure 3, the dynamic program produces a slightly better fit for the rightmost part of the curve, but the merging algorithm identifies roughly the same five main segments. As before, the merging algorithm is significantly faster and achieves a 200 speed-up compared to the dynamic program (0.013 vs 3.2 seconds). Fast Algorithms for Segmented Regression Acknowledgements Part of this research was conducted while Ilias Diakonikolas was at the University of Edinburgh, Jerry Li was at Microsoft Research Cambridge (UK), and Ludwig Schmidt was at UC Berkeley. Jayadev Acharya was supported by a grant from the MITShell Energy Initiative. Ilias Diakonikolas was supported in part by EPSRC grant EP/L021749/1, a Marie Curie Career Integration Grant, and a SICSA grant. Jerry Li was supported by NSF grant CCF-1217921, DOE grant DESC0008923, NSF CAREER Award CCF-145326, and a NSF Graduate Research Fellowship. Ludwig Schmidt was supported by grants from the MIT-Shell Energy Initiative, MADALGO, and the Simons Foundation. Acharya, J., Diakonikolas, I., Hegde, C., Li, J. Z., and Schmidt, L. Fast and near-optimal algorithms for approximating distributions by histograms. In PODS, pp. 249 263, 2015a. Acharya, J., Diakonikolas, I., Li, J. Zheng, and Schmidt, L. Sample-optimal density estimation in nearly-linear time. Co RR, abs/1506.00671, 2015b. URL http:// arxiv.org/abs/1506.00671. Avron, H., Sindhwani, V., and Woodruff, D. Sketching structured matrices for faster nonlinear regression. In NIPS, pp. 2994 3002. 2013. Bai, J. and Perron, P. Estimating and testing linear models with multiple structural changes. Econometrica, 66(1): 47 78, 1998. Chatterjee, S., Guntuboyina, A., and Sen, B. On risk bounds in isotonic and other shape restricted regression problems. Annals of Statistics, 43(4):1774 1800, 08 2015. Feder, P. I. On asymptotic distribution theory in segmented regression problems identified case. Annals of Statistics, 3(1):49 83, 01 1975. Friedman, J. H. Multivariate adaptive regression splines. Annals of Statistics, 19(1):1 67, 03 1991. Gallant, A. R. and A., Fuller W. Fitting segmented polynomial regression models whose join points have to be estimated. Journal of the American Statistical Association, 68(341):144 147, 1973. Jordan, M. I. On statistics, computation and scalability. Bernoulli, 19(4):1378 1390, 09 2013. Kyng, R., Rao, A., and Sachdeva, S. Fast, provable algorithms for isotonic regression in all lp-norms. In NIPS, pp. 2701 2709, 2015. Meyer, M. C. Inference using shape-restricted regression splines. Annals of Applied Statistics, 2(3):1013 1033, 09 2008. Mosteller, F. and Tukey, J. W. Data analysis and regression: a second course in statistics. Addison-Wesley, Reading (Mass.), Menlo Park (Calif.), London, 1977. Rigollet, P. High dimensional statistics. 2015. URL http://www-math.mit.edu/~rigollet/ PDFs/Rig Notes15.pdf. Stone, C. J. The use of polynomial splines and their tensor products in multivariate function estimation. Annals of Statistics, 22(1):pp. 118 171, 1994. Stone, C. J., Hansen, M. H., Kooperberg, C., and Truong, Y. K. Polynomial splines and their tensor products in extended linear modeling: 1994 wald memorial lecture. Annals of Statistics, 25(4):1371 1470, 1997. Wegman, E. J. and Wright, I. W. Splines in statistics. Journal of the American Statistical Association, 78(382):pp. 351 365, 1983. Yamamoto, Y. and Perron, P. Estimating and testing multiple structural changes in linear models using band spectral regressions. Econometrics Journal, 16(3):400 429, 2013.