# layered_sampling_for_robust_optimization_problems__8c27ab26.pdf Layered Sampling for Robust Optimization Problems Hu Ding 1 Zixiu Wang 1 In real world, our datasets often contain outliers. Most existing algorithms for handling outliers take high time complexities (e.g. quadratic or cubic complexity). Coreset is a popular approach for compressing data so as to speed up the optimization algorithms. However, the current coreset methods cannot be easily extended to handle the case with outliers. In this paper, we propose a new variant of coreset technique, layered sampling, to deal with two fundamental robust optimization problems: k-median/means clustering with outliers and linear regression with outliers. This new coreset method is in particular suitable to speed up the iterative algorithms (which often improve the solution within a local range) for those robust optimization problems. 1. Introduction Coreset is a widely studied technique for solving many optimization problems (Phillips, 2016; Bachem et al., 2017; Munteanu et al., 2018; Feldman, 2020). The (informal) definition is as follows. Given an optimization problem with the objective function , denote by (P, C) the objective value determined by a dataset P and a solution C; a small set S is called a coreset if (P, C) (S, C) (1) for any feasible solution C. Roughly speaking, the coreset is a small set of data approximately representing a much larger dataset, and therefore existing algorithm can run on the coreset (instead of the original dataset) so as to reduce the complexity measures like running time, space, and communication. In the past years, the coreset techniques have been successfully applied to solve many optimization problems, such as clustering (Chen, 2009; Feldman & Langberg, 1School of Computer Science and Technology, University of Science and Technology of China. Correspondence to: Hu Ding . Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s). 2011; Huang et al., 2018), logistic regression (Huggins et al., 2016; Munteanu et al., 2018), linear regression (Dasgupta et al., 2009; Drineas et al., 2006), and Gaussian mixture model (Lucic et al., 2017; Karnin & Liberty, 2019). A large part of existing coreset construction methods are based on the theory of sensitivity which was proposed by (Langberg & Schulman, 2010). Informally, each data point p P has the sensitivity φ(p) (in fact, we just need to compute an appropriate upper bound of φ(p)) to measure its importance to the whole instance P over all possible solutions, and Φ(P) = P p P φ(p) is called the total sensitivity. The coreset construction is a simple sampling procedure where each point p is drawn i.i.d. from P proportional to φ(p) Φ(P ); each sampled point p is assigned a weight w(p) = Φ(P ) mφ(p) where m is the sample size depending on the pseudo-dimension of the objective function ((Feldman & Langberg, 2011; Li et al., 2001)); eventually, the set of weighted sampled points form the desired coreset S. In real world, datasets are noisy and contain outliers. Moreover, outliers could seriously affect the final results in data analysis (Chandola et al., 2009; Goodfellow et al., 2018). However, the sensitivity based coreset approach is not appropriate to handle robust optimization problems involving outliers (e.g., k-means clustering with outliers). For example, it is not easy to compute the sensitivity φ(p) because the point p could be inlier or outlier for different solutions; moreover, it is challenging to build the relation, such as (1), between the original instance P and the coreset S (e.g., how to determine the number of outliers for the instance S?). 1.1. Our Contributions In this paper, we consider two important robust optimization problems: k-median/means clustering with outliers and linear regression with outliers. Their quality guaranteed algorithms exist but often have high complexities that seriously limit their applications in real scenarios (see Section 1.2 for more details). We observe that these problems can be often efficiently solved by some heuristic algorithms in practice, though they only guarantee local optimums in theory. For example, (Chawla & Gionis, 2013) proposed the algorithm k-means- - to solve the problem of k-means clustering with outliers, where the main idea is an alternating minimization strategy. The algorithm is an iterative procedure, where it Layered Sampling for Robust Optimization Problems Local range L Solution space Figure 1. The red point represents the initial solution C, and our goal is to guarantee (2) for a local range around C. alternatively updates the outliers and the k cluster centers in each iteration; eventually the solution converges to a local optimum. The alternating minimization strategy is also widely used for solving the problem of linear regression with outliers, e.g., (Shen & Sanghavi, 2019). A common feature of these methods is that they usually start from an initial solution and then locally improve the solution round by round. Therefore, a natural question is can we construct a coreset only for a local range in the solution space? Using such a coreset, we can substantially speed up those iterative algorithms. Motivated by this question, we introduce a new variant of coreset method called layered sampling. Given an initial solution C, we partition the given data set P into a consecutive sequence of layers surrounding C and conduct the random sampling in each layer; the union of the samples, together with the points located in the outermost layer, form the coreset S. Actually, our method is partly inspired by the coreset construction method of k-median/means clustering (without outliers) proposed by (Chen, 2009). However, we need to develop significantly new idea in theory to prove its correctness for the case with outliers. The purpose of layered sampling is not to guarantee the approximation quality (as (1)) for any solution C, instead, it only guarantees the quality for the solutions in a local range L in the solution space (the formal definition is given in Section 1.3). Informally, we need to prove the following result to replace (1): C L, (P, C) (S, C) (2) See Figure 1 for an illustration. In other words, the new method can help us to find a local optimum faster. Our main results are shown in Theorem 1 and 2. The construction algorithms are easy to implement. 1.2. Related Works k-median/means clustering (with outliers). kmedian/means clustering are two popular center-based clustering problems (Awasthi & Balcan, 2014). It has been extensively studied for using coreset techniques to reduce the complexities of k-median/means clustering algorithms (Chen, 2009; Har-Peled & Kushal, 2007; Fichtenberger et al., 2013; Feldman et al., 2013); in particular, (Feldman & Langberg, 2011) proposed a unified coreset framework for a set of clustering problems. However, the research on using coreset to handle outliers is still quite limited. Recently, (Huang et al., 2018) showed that a uniform independent sample can serve as a coreset for clustering with outliers in Euclidean space; however, such uniform sampling based method often misses some important points and therefore introduces an unavoidable error on the number of outliers. (Gupta, 2018) also studied the uniform random sampling idea but under the assumption that each optimal cluster should be large enough. Partly inspired by the method of (Mettu & Plaxton, 2004), (Chen et al., 2018) proposed a novel summary construction algorithm to reduce input data size which guarantees an O(1) factor of distortion on the clustering cost. In theory, the algorithms with provable guarantees for kmedian/means clustering with outliers (Chen, 2008; Krishnaswamy et al., 2018; Friggstad et al., 2018) have high complexities and are difficult to be implemented in practice. The heuristic but practical algorithms have also been studied before (Chawla & Gionis, 2013; Ott et al., 2014). By using the local search method, (Gupta et al., 2017) provided a 274-approximation algorithm of k-means clustering with outliers but needing to discard more than the desired number of outliers; to improve the running time, they also used k-means++ (Arthur & Vassilvitskii, 2007) to seed the coreset that yields an O(1) factor approximation. Based on the idea of k-means++, (Bhaskara et al., 2019) proposed an O(log k)-approximation algorithm. Linear regression (with outliers). Several coreset methods for ordinary linear regression (without outliers) have been proposed (Drineas et al., 2006; Dasgupta et al., 2009; Boutsidis et al., 2013). For the case with outliers, which is also called Least Trimmed Squares linear estimator (LTS) , a uniform sampling approach was studied by (Mount et al., 2014; Ding & Xu, 2014). But similar to the scenario of clustering with outliers, such uniform sampling approach introduces an unavoidable error on the number of outliers. (Mount et al., 2014) also proved that it is impossible to achieve even an approximate solution for LTS within polynomial time under the conjecture of the hardness of affine degeneracy (Erickson & Seidel, 1995), if the dimensionality d is not fixed. Despite of its high complexity, several practical algorithms were proposed before and most of them are based on the idea of alternating minimization that improves the solution within a local range, such as (Rousseeuw, 1984; Rousseeuw & van Driessen, 2006; Hawkins, 1994; Mount et al., 2016; Bhatia et al., 2015; Shen & Sanghavi, 2019). (Klivans et al., 2018) provided another approach based on the sum-of-squares method. Layered Sampling for Robust Optimization Problems 1.3. Preliminaries Below, we introduce several important definitions. i. k-Median/Means Clustering with Outliers. Suppose P is a set of n points in Rd. Given two integers 1 z, k < n, the problem of k-median clustering with z outliers is to find a set of k points C = {c1, , ck} Rd and a subset P P with |P | = n z, such that the following objective function K z 1 (P, C) = 1 n z p P min 1 j k ||p cj|| (3) is minimized. Similarly, we have the objective function K z 2 (P, C) = 1 n z p P min 1 j k ||p cj||2, (4) for k-means clustering with outliers. The set C is also called a solution of the instance P. Roughly speaking, given a solution C, the farthest z points to C are discarded, and the remaining subset P is partitioned into k clusters where each point is assigned to its nearest neighbor of C. ii. Linear Regression with Outliers. Given a vector h = (h1, h2, , hd) Rd, the linear function defined by h is y = Pd 1 j=1 hjxj + hd for d 1 real variables x1, x2, , xd 1. Thus the linear function can be represented by the vector h. From geometric perspective, the linear function can be viewed as a (d 1)-dimensional hyperplane in the space. Let z be an integer between 1 and n, and P = {p1, p2, , pn} be a set of n points in Rd, where each pi = (xi,1, xi,2, , xi,d 1, yi) for 1 i n; the objective is to find a subset P P with |P | = n z and a (d 1)-dimensional hyperplane, represented as a coefficient vector h = (h1, h2, , hd) Rd, such that LR z 1 (P , h) = 1 n z P pi P Res(pi, h) (5) or LR z 2 (P , h) = 1 n z P pi P Res(pi, h) 2 (6) is minimized. Res(pi, h) = yi Pd 1 j=1 hjxi,j hd is the residual of pi to h. The objective functions (5) and (6) are called the least absolute error and least squared error , respectively. Remark 1. All the above problems can be extended to weighted case. Suppose each point p has a non-negative weight w(p), then the (squared) distance ||p cj|| (||p cj||2) is replaced by w(p) ||p cj|| (w(p) ||p cj||2); we can perform the similar modification on Res(pi, h) and Res(pi, h) 2 for the problem of linear regression with outliers. Moreover, the total weights of the outliers should be equal to z. Namely, we can view each point p as w(p) unit-weight overlapping points. Solution range. To analyze the performance of our layered sampling method, we also need to define the solution range for the clustering and linear regression problems. Consider the clustering problems first. Given a clustering solution C = { c1, , ck} Rd and L > 0, we use C L to denote the range of solutions L = n C = {c1, , ck} | || cj cj|| L, 1 j k o . (7) Next, we define the solution range for linear regression with outliers. Given an instance P, we often normalize the values in each of the first d 1 dimensions as the preprocessing step; without loss of generality, we can assume that xi,j [0, D] with some D > 0 for any 1 i n and 1 j d 1. For convenience, denote by RD the region {(s1, s2, , sd) | 0 sj D, 1 j d 1} and thus P RD after the normalization. It is easy to see that the region RD actually is a vertical square cylinder in the space. Given a coefficient vector (hyperplane) h = ( h1, h2, , hd) Rd and L > 0, we use h L to denote the range of hyperplanes L = n h = (h1, h2, , hd) | |Res(p, h) Res(p, h)| L, p RD o . (8) To understand the range defined in (8), we can imagine two linear functions h+ = ( h1, h2, , hd + L) and h = ( h1, h2, , hd L); if we only consider the region RD, the range (8) contains all the linear functions sandwiched by h+ and h . For both the clustering and regression problems, we also say that the size of the solution range L is |L| = L. 2. The Layered Sampling Framework We present the overview of our layered sampling framework. For the sake of completeness, we first introduce the coreset construction method for the ordinary k-median/means clustering proposed by (Chen, 2009). Suppose α and β 1. A bi-criteria (α, β)-approximation means that it contains αk cluster centers, and the induced clustering cost is at most β times the optimum. Usually, finding a bi-criteria approximation is much easier than achieving a single-criterion approximation. For example, one can obtain a bi-criteria approximation for k-median/means clustering in linear time with α = O(1) and β = O(1) (Chen, 2009). Let T = {t1, t2, , tαk} Rd be the obtained (α, β)-approximate solution of the input instance P. For convenience, we use B(c, r) to denote the ball centered at a point c with radius r > 0. At the beginning of Chen s coreset construction algorithm, it takes two carefully designed values r > 0 and N = O(log n), and partitions the space into N + 1 layers H0, H1, , HN, where H0 = αk j=1B(tj, r) and Hi = αk j=1 B(tj, 2ir) \ αk j=1 B(tj, 2i 1r) for 1 i N. It can be proved that P is Layered Sampling for Robust Optimization Problems covered by N i=0Hi; then the algorithm takes a random sample Si from each layer P Hi, and the union N i=0Si forms the desired coreset S satisfying the condition (1). However, this approach cannot directly solve the case with outliers. First, it is not easy to obtain a bi-criteria approximation for the problem of k-median/means clustering with outliers (e.g., in linear time). Moreover, it is challenging to guarantee the condition (1) for any feasible solution C, because the set of outliers could change when C changes (this is also the major challenge for proving the correctness of our method later on). We propose a modified version of Chen s coreset construction method and aim to guarantee (2) for a local range of solutions. We take the kmedian clustering with outliers problem as an example. Let C = { c1, , ck} Rd be a given solution. Assume ϵ > 0 and N Z+ are two pre-specified parameters. With a slight abuse of notations, we still use H0, H1, , HN to denote the layers surrounding C, i.e., H0 = k j=1B( cj, r); (9) Hi = k j=1 B( cj, 2ir) \ k j=1 B( cj, 2i 1r) for 1 i N. (10) In addition, let Hout = Rd \ k j=1 B( cj, 2Nr) . (11) Here, we set the value r to satisfy the following condition: P Hout = (1 + 1 That is, the union of the layers N i=0Hi covers n (1 + 1 ϵ )z points of P and excludes the farthest (1 + 1 ϵ )z. Obviously, such a value r always exists. Suppose P is the set of n z inliers induced by C, and then we have p P min 1 j k ||p cj|| = ϵ z (n z)K z 1 (P, C) (13) via the Markov s inequality. Our new coreset contains the following N + 2 parts: S = S0 S1 SN Sout, (14) where Si is still a random sample from P Hi for 0 i N, and Sout contains all the (1 + 1 ϵ )z points in Hout. In Section 3, we will show that the coreset S of (14) satisfies (2) for the k-median clustering with outliers problem (and similarly for the k-means clustering with outliers problem). For the linear regression with outliers problem, we apply the similar layered sampling framework. Define S(h, r) to be the slab centered at a (d 1)-dimensional hyperplane h with Algorithm 1 LAYERED SAMPLING FOR k-MED-OUTLIER Input: An instance P Rd of k-median clustering with z outliers, a solution C = { c1, , ck}, and two parameters ϵ, η (0, 1). 1. Let γ = z/(n z) and N = log 1 γ . Compute the value r satisfying (12). 2. As described in (9), (10), and (11), the space is partitioned into N + 2 layers H0, H1, , HN and Hout. 3. Randomly sample min n O( 1 ϵ2 kd log d Hi| o points, denoted by Si, from P Hi for 0 i N. 4. For each point p Si, set its weight to be |P Hi|/|Si|; let SH = N i=0Si. Output S = SH (P Hout). r > 0, i.e., S(h, r) = {p Rd | r Res(p, h) r}. Let P be an instance, and h = ( h1, , hd) Rd be a given hyperplane. We divide the space into N + 2 layers H0, H1, , HN, Hout, where H0 = S( h, r); (15) Hi = S( h, 2ir) \ S( h, 2i 1r) for 1 i N;(16) Hout = Rd \ S( h, 2Nr). (17) Similar to (12), we also require the value r to satisfy the following condition: P Hout = P \ S( h N, 2Nr) = (1 + 1 And consequently, we have z (n z)LR z 1 (P, h). (19) Then, we construct the coreset for linear regression with outliers by the same manner of (14). 3. k-Median/Means Clustering with Outliers In this section, we provide the details on applying our layered sampling framework to the problem of k-median clustering with outliers. See Algorithm 1. The algorithm and analysis can be easily modified to handle k-means clustering with outliers, where the only difference is that we need to replace (13) by 2Nr q ϵ z(n z)K z 2 (P, C) . Theorem 1. Algorithm 1 returns a point set S having the size |S| = O1( 1 ϵ2 kd)+(1+ 1 ϵ )z. Moreover, with probability 1The asymptotic notation O(f) = O f polylog( d γϵη ) . Layered Sampling for Robust Optimization Problems at least 1 η, for any L > 0 and any solution C C L, we have K z 1 (S, C) K z 1 (P, C) ϵ K z 1 (P, C) + L . (20) Here, S is a weighted instance of k-median clustering with outliers, and the total weight of outliers is z (see Remark 1). Remark 2. (1) The running time of Algorithm 1 is O(knd). For each point p P, we compute its shortest distance to C, min1 j k ||p cj||; then select the farthest (1 + 1/ϵ)z points and compute the value r by running the linear time selection algorithm (Blum et al., 1973); finally, we obtain the N + 1 layers Hi with 0 i N and take the samples S0, S1, , SN from them. (2) Comparing with the standard coreset (1), our result contains an additive error ϵ K z 1 (P, C) + L in (20) that depends on the initial objective value K z 1 (P, C) and the size L of the solution range. In particular, the smaller the range size L, the lower the error of our coreset. (3) The algorithm of (Chen et al., 2018) also returns a summary for compressing the input data. But there are two major differences comparing with our result. First, their summary guarantees a constant factor of distortion on the clustering cost, while our error approaches 0 if ϵ is small enough. Second, their construction algorithm (called successive sampling from (Mettu & Plaxton, 2004)) needs to scan the data multiple passes, while our Algorithm 1 is much simpler and only needs to read the data in one-pass. We also compare these two methods in our experiments. To prove Theorem 1, we first show that SH is a good approximation of P \ Hout. Fixing a solution C C L, we view the distance from each point p P to C, i.e., min1 j k ||p cj||, as a random variable xp. For any point p P Hi with 0 i N, we have the following bounds for xp. Suppose p is covered by B( cj1, 2ir). Let the nearest neighbor of p in C be cj2. Then, we have the upper bound xp = ||p cj2|| ||p cj1|| ||p cj1|| + || cj1 cj1|| 2ir + L. (21) Similarly, we have the lower bound xp max{2i 1r L, 0} if i 1; xp 0 if i = 0. Therefore, we can take a sufficiently large random sample ˆSi from P Hi, such that 1 | ˆSi| P p P Hi xp with certain probability. Specifically, combining (21) and (22), we have the following lemma through the Hoeffding s inequality. Lemma 1. Let η (0, 1). If we randomly sample O( 1 η) points, denote by ˆSi, from P Hi, then with probability 1 η, p ˆSi xp 1 |P Hi| p P Hi xp ϵ(2ir + 2L). Lemma 1 is only for a fixed solution C. To guarantee the result for any C C L, we discretize the range C L. Imagine that we build a grid inside each B( cj, L) for 1 j k, where the grid side length is ϵ d L. Denote by Gj the set of grid points inside each B( cj, L), and then G = G1 G2 Gk contains O 2 d ϵ kd k-tuple points of C L in total. We increase the sample size in Lemma 1 via replacing η by η N |G| in the sample size O( 1 η) . As a consequence, through taking the union bound for the success probability, we have the following result. Lemma 2. Si is the sample obtained from P Hi in Step 3 of Algorithm 1 for 0 i N. Then with probability 1 η, p Si xp 1 |P Hi| p P Hi xp ϵ(2ir + 2L). for each i = {0, 1, , N} and any C G. Next, we show that for any C C L (in particular the solutions in C L \G), Lemma 2 is true. For any solution C = {c1, , ck} C L, let C = {c 1, , c k} be its nearest neighbor in G, i.e., c j is the grid point of the cell containing cj in Gj, for 1 j k. Also, denote by x p the distance min1 j k ||p c j||. Then we consider to bound the error 1 |Si| P p Si xp 1 |P Hi| P p P Hi xp through C . By using the triangle inequality, we have p Si xp 1 |P Hi| p P Hi xp (23) p Si xp 1 |Si| p Si x p 1 |P Hi| p P Hi x p 1 |P Hi| In (23), the term (b) is bounded by Lemma 2 since C G. To bound the terms (a) and (c), we study the difference |xp x p| for each point p. Suppose the nearest neighbor of Layered Sampling for Robust Optimization Problems p in C (resp., C ) is cj1 (resp., c j2). Then, xp = ||p cj1|| ||p cj2|| ||p c j2|| + ||c j2 cj2|| ||p c j2|| + ϵL = x p + ϵL, (24) where the last inequality comes from the fact that c j2 and cj2 are in the same grid cell with side length ϵ d L. Similarly, we have x p xp + ϵL. Overall, |xp x p| ϵL. As a consequence, the terms (a) and (c) in (23) are both bounded by ϵL. Overall, (23) becomes p Si xp 1 |P Hi| O(ϵ)(2ir + L). (25) For convenience, we use PH to denote the set N i=0(P Hi). Lemma 3. Let S0, S1, , SN be the samples obtained in Algorithm 1. Then, with probability 1 η, O(ϵ) K z 1 (P, C) + L for any C C L. Proof. For convenience, let Erri = 1 |Si| P p P Hi xp for 0 i N. We directly have Erri O(ϵ)(2ir + L) from (25). Moreover, the left handside of (26) = 1 n z PN i=0 |P Hi| Erri i=0 |P Hi| (2ir + L) n z 2ir + O(ϵ)L. (27) It is easy to know that the term PN i=0 |P Hi| n z 2ir of (27) is at most 1 n z(|P H0|r + 2 P p PH\H0 xp) r + 2K z 1 (P, C). Note we set N = log 1 γ in Algorithm 1. Together with (13), we know r ϵK z 1 (P, C) and thus PN i=0 |P Hi| n z 2ir O(1)K z 1 (P, C). So (26) is true. Below, we always assume that (26) is true and consider to prove (20) of Theorem 1. The set P is partitioned into two parts: P C in and P C out by C, where P C out is the z farthest points to C (i.e., the outliers) and P C in = P \ P C out. Similarly, the coreset S is also partitioned into two parts SC in and SC out by C, where SC out is the set of outliers with total weights z. In other words, we need to prove X Consider two cases: (i) PH \P C in = and (ii) PH \P C in = . Intuitively, the case (i) indicates that the set P C in occupies the whole region N i=0Hi; the case (ii) indicates that the region N i=0Hi contains some outliers from P C out. In the following subsections, we prove that (20) holds for both cases. For ease of presentation, we use w(U) to denote the total weight of a weighted point set U (please be not confused with |U|, which is the number of points in U). 3.1. Case (i): PH \ P C in = We prove the following key lemma first. Lemma 4. If PH \ P C in = , SC in = SH (P C in \ PH) and SC out = P C out (recall SH = N i=0Si from Algorithm 1). Proof. First, the assumption PH \ P C in = implies PH P C in; (29) |P C in \ PH| = |P C in| |PH|. (30) In addition, since SH PH, we have SH P C in from (29). Consequently, the set SH (P C in \ PH) P C in. Therefore, for any p SH (P C in \ PH) and any q P C out, xp xq. Moreover, the set S \ SH (P C in \ PH) = SH (P \ PH) \ SH (P C in \ PH) = (P \ PH) \ (P C in \ PH) = |{z} by (29) P \ P C in = P C out. (31) Note |P C out| = z. As a consequence, SC in should be exactly the set SH (P C in \ PH), and SC out = P C out. Lemma 5. If PH \ P C in = , (20) is true. Proof. Because the set SC in is equal to SH (P C in \ PH) from Lemma 4, the objective value K z 1 (S, C) = p SH w(p)xp + P p P C in\PH xp = p Si xp + X p P C in\PH xp . (32) From Lemma 3, the value of (32) is no larger than p PH xp + O(ϵ)(n z)(K z 1 (P, C) + L) p P C in\PH xp . (33) Layered Sampling for Robust Optimization Problems Note that PH \ P C in = , and thus the sum of the two terms P p PH xp and P p P C in\PH xp in (33) is P p P C in xp. Therefore, K z 1 (S, C) xp + O(ϵ)(K z 1 (P, C) + L) = K z 1 (P, C) + O(ϵ)(K z 1 (P, C) + L). (34) Similarly, we have K z 1 (S, C) K z 1 (P, C) O(ϵ)(K z 1 (P, C) + L). Thus, (20) is true. 3.2. Case (ii): PH \ P C in = Since S \PH = P \PH are the outermost (1+1/ϵ)z points to C, we have the following claim first. Claim 1. Either SC in \ PH P C in \ PH or P C in \ PH SC in \ PH is true. Lemma 6. If PH \ P C in = , we have xp 2Nr + L for any p SC in P C in PH. Proof. We consider the points in the three parts PH, P C in, and SC in separately. (1) Due to (21), we have xp 2Nr + L for any p PH. (2) Arbitrarily select one point p0 from PH \ P C in. By (21) again, we have xp0 2Nr + L. Also, because PH \ P C in P C out, we directly have xp xp0 for any p P C in. Namely, xp 2Nr + L for any p P C in. (3) Below, we consider the points in SC in. If w(SC in PH) > P C in PH , i.e., PH contains more inliers of S than that of P, then the outer region Hout should contain less inliers of S than that of P. Thus, from Claim 1, we have SC in \ PH P C in \ PH. Hence, SC in = (SC in \ PH) (SC in PH) (P C in \ PH) PH = P C in PH. From (1) and (2), we know xp 2Nr + L for any p SC in. Else, w(SC in PH) P C in PH . Then w(SC in SH) P C in PH since SC in PH = SC in SH. Because w(SH) = |PH|, we have w(SH \ SC in) |PH \ P C in|. (35) Also, the assumption PH \ P C in = implies w(SH \ SC in) |PH \ P C in| > 0, i.e., SH \ SC in = . (36) Arbitrarily select one point p0 from SH \ SC in. We know xp0 2Nr + L since p0 SH \ SC in PH. Also, for any point p SC in, we have xp xp0 because p0 SH \SC in SC out. Therefore xp 2Nr + L. Lemma 7. If PH \ P C in = , (20) is true. Proof. We prove the upper bound of K z 1 (S, C) first. We analyze the clustering costs of the two parts SC in SH and SC in \ SH separately. K z 1 (S, C) = 1 n z p SC in SH w(p)xp p SC in\SH xp Note the points of SC in \ SH have unit-weight (since SC in \ SH P \ PH are the points from the outermost (1 + 1 ϵ )z points of P). Obviously, the part (a) is no larger than p SH w(p)xp = p PH xp + O(ϵ)(n z)(K z 1 (P, C) + L) (37) from Lemma 3. The set PH consists of two parts PH P C in and PH \ P C in. From Lemma 6 and the fact |PH \ P C in| |P C out| = z, we know P p PH\P C in xp z(2Nr + L). Thus, the upper bound of the part (a) becomes P p PH P C in xp + z(2Nr + L) + O(ϵ)(n z)(K z 1 (P, C) + L). (38) To bound the part (b), we consider the size |SC in\SH|. Since the total weight of outliers is z, w SH SC in = w(SH) w SH SC out PH P C in z. (39) Together with the fact w SH SC in + SC in \ SH = PH P C in + P C in \ PH = n z, we have SC in \ SH P C in \ PH + z. (40) Therefore SC in \ SH \ P C in \ PH z from Claim 1. Through Lemma 6 again, we know that the part (b) is no larger than P p P C in\PH xp + SC in \ SH \ P C in \ PH p P C in\PH xp + z(2Nr + L). (41) Putting (38) and (41) together, we have K z 1 (S, C) K z 1 (P, C) + O(ϵ)(K z 1 (P, C) + L) n z (2Nr + L). (42) Layered Sampling for Robust Optimization Problems Recall γ = z/(n z) in Algorithm 1. If γ ϵ, the size of our coreset S is at least (1 + 1/ϵ)z n; that is, S contains all the points of P. For the other case ϵ > γ, together with (13), the term 2z n z(2Nr + L) in (42) is at most O(ϵ)(K z 1 (P, C) + L). (43) Overall, K z 1 (S, C) K z 1 (P, C)+O(ϵ)(K z 1 (P, C)+L) via (42). So we complete the proof for the upper bound. Now we consider the lower bound of K z 1 (S, C). Denote by X = SH (P C in \ PH) and Y = X \ SC in. Obviously, K z 1 (S, C) 1 n z From Lemma 3, the part (c) is at least X p PH xp O(ϵ)(K z 1 (P, C) + L) + X p P C in\PH xp xp O(ϵ)(K z 1 (P, C) + L). (44) Further, since w(Y ) z and Y X P C in PH, the part (d) is no larger than z(2Nr + L) from Lemma 6. Using the similar manner for proving the upper bound, we know that K z 1 (S, C) K z 1 (P, C) O(ϵ)(K z 1 (P, C) + L). 4. Linear Regression with Outliers In this section, we consider the problem of linear regression with outliers. Our algorithm and analysis are for the objective function LR z 1 , and the ideas can be extended to handle the objective function LR z 2 . Theorem 2. Algorithm 2 returns a point set S having the size |S| = O( 1 ϵ2 d) + (1 + 1 ϵ )z. Moreover, with probability at least 1 η, for any L > 0 and any solution h h L, we have LR z 1 (S, h) LR z 1 (P, h) ϵ LR z 1 (P, h) + L . (45) Here, S is a weighted instance of linear regression with outliers, and the total weight of outliers is z (see Remark 1). We still use PH to denote the set N i=0(P Hi). First, we need to prove that SH is a good approximation of PH. Given a hyperplane h, we define a random variable xp = |Res(p, h)| for each p P. If p Hi for 0 i N, similar to (21) and (22), we have the following bounds for xp: xp 2ir + L; xp max{2i 1r L, 0} if i 1 and xp 0 if i = 0. Then, we can apply the similar idea of Lemma 3 to obtain the following lemma, where the only difference is about Algorithm 2 LAYERED SAMPLING FOR LIN1-OUTLIER Input: An instance P Rd of linear regression with z outliers, a solution h = ( h1, , hk), and two parameters ϵ, η (0, 1). 1. Let γ = z/(n z) and N = log 1 γ . Compute the value r satisfying (18). 2. As described in (15), (16), and (17), the space is partitioned into N + 2 layers H0, H1, , HN and Hout. 3. Randomly sample min n O( 1 Hi| o points, denoted by Si, from P Hi for 0 i N. 4. For each point p Si, set its weight to be |P Hi|/|Si|; let SH = N i=0Si. Output S = SH (P Hout). the discretization on h L. Recall that h is defined by the coefficients h1, , hd and the input set P is normalized within the region RD. We build a grid inside each vertical segment ljuj for 0 j d 1, where l0 = (0, , 0, hd L), u0 = (0, , 0, hd + L), and lj = (0, , 0, D |{z} j th , 0, , 0, hj D + hd L), (46) uj = (0, , 0, D |{z} j th , 0, , 0, hj D + hd + L) (47) for j = 0; the grid length is ϵ 2d L. Denote by Gj the set of grid points inside the segment ljuj. Obviously, G = G0 G1 Gd 1 contains ( 4d ϵ )d d-tuple points, and each tuple determines a (d 1)-dimensional hyperplane in h L; moreover, we have the following claim. Claim 2. For each h h L, there exist a hyperplane h determined by a d-tuple points from G, such that |Res(p, h) Res(p, h )| ϵL for any p P. Lemma 8. Let S0, S1, , SN be the samples obtained in Algorithm 2. Then, with probability 1 η, O(ϵ) LR z 1 (P, h) + L for any h h L. We fix a solution h h L. Similar to the proof of Theorem 1 in Section 3, we also consider the two parts P h in and P h out of P partitioned by h, where P h out is the z farthest points to the hyperplane h (i.e., the outliers) and Layered Sampling for Robust Optimization Problems P h in = P \ P h out. Similarly, S is also partitioned into two parts Sh in and Sh out by h, where Sh out is the set of outliers with total weights z. For case (i) PH \ P h in = and case (ii) PH \ P h in = , we can apply almost the identical ideas in Section 3.1 and 3.2 respectively to prove (45). 5. Conclusion To reduce the time complexities of existing algorithms for clustering and linear regression with outliers, we propose a new variant of coreset method which can guarantee the quality for any solution in a local range surrounding the given initial solution. Due to the space limit, we leave the complete experimental results to our supplement. In future, it is worth considering to apply our framework to a broader range of robust optimization problems, such as logistic regression with outliers and Gaussian mixture model with outliers. A. Proof of Claim 1 Since SC in is the set of inliers to C, there must exist some value r S > 0 such that SC in = {p | p S, min 1 j k ||p cj|| r S}. (49) And therefore SC in \ PH = {p | p S \ PH, min 1 j k ||p cj|| r S}. (50) Similarly, there exists some value r P > 0 such that P C in \ PH = {p | p P \ PH, min 1 j k ||p cj|| r P }. (51) Note S\PH = P \PH. So, if r S r P , we have SC in\PH P C in \ PH. Otherwise, P C in \ PH SC in \ PH. B. Proof of Claim 2 Let h = (h1, , hd), and suppose h = (h 1, , h d) is h s nearest neighbor in G, i.e., |h d hd| ϵ 2d L and |Dh j + h d Dhj hd| ϵ 2d L for 1 j d 1. Then, |h j hj| 1 D( ϵ 2d L + |h d hd|) ϵ Dd L (52) for 1 j d 1. For any p = (x1, , xd) RD, |Res(p, h) Res(p, h )| j=1 |h j hj| |xj| + |h d hd| j=1 |h j hj| D + |h d hd| Arthur, D. and Vassilvitskii, S. k-means++: The advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pp. 1027 1035. Society for Industrial and Applied Mathematics, 2007. Awasthi, P. and Balcan, M.-F. Center based clustering: A foundational perspective. 2014. Bachem, O., Lucic, M., and Krause, A. Practical coreset constructions for machine learning. ar Xiv preprint ar Xiv:1703.06476, 2017. Bhaskara, A., Vadgama, S., and Xu, H. Greedy sampling for approximate clustering in the presence of outliers. In Advances in Neural Information Processing Systems, pp. 11146 11155, 2019. Bhatia, K., Jain, P., and Kar, P. Robust regression via hard thresholding. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pp. 721 729, 2015. Blum, M., Floyd, R. W., Pratt, V., Rivest, R. L., and Tarjan, R. E. Time bounds for selection. Journal of Computer and System Sciences, 7(4):448 461, 1973. Boutsidis, C., Drineas, P., and Magdon-Ismail, M. Nearoptimal coresets for least-squares regression. IEEE Trans. Information Theory, 59(10):6880 6892, 2013. doi: 10. 1109/TIT.2013.2272457. Chandola, V., Banerjee, A., and Kumar, V. Anomaly detection: A survey. ACM Computing Surveys (CSUR), 41(3): 15, 2009. Chawla, S. and Gionis, A. k-means : A unified approach to clustering and outlier detection. In Proceedings of the 2013 SIAM International Conference on Data Mining, pp. 189 197. SIAM, 2013. Chen, J., Azer, E. S., and Zhang, Q. A practical algorithm for distributed clustering and outlier detection. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, Neur IPS 2018, 3-8 December 2018, Montr eal, Canada, pp. 2253 2262, 2018. Chen, K. A constant factor approximation algorithm for k-median clustering with outliers. In Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete algorithms, pp. 826 835. Society for Industrial and Applied Mathematics, 2008. Layered Sampling for Robust Optimization Problems Chen, K. On coresets for k-median and k-means clustering in metric and euclidean spaces and their applications. SIAM Journal on Computing, 39(3):923 947, 2009. Dasgupta, A., Drineas, P., Harb, B., Kumar, R., and Mahoney, M. W. Sampling algorithms and coresets for \ell p regression. SIAM Journal on Computing, 38(5):2060 2078, 2009. Ding, H. and Xu, J. Sub-linear time hybrid approximations for least trimmed squares estimator and related problems. In 30th Annual Symposium on Computational Geometry, SOCG 14, Kyoto, Japan, June 08 - 11, 2014, pp. 110, 2014. doi: 10.1145/2582112.2582131. Drineas, P., Mahoney, M. W., and Muthukrishnan, S. Sampling algorithms for l2 regression and applications. In Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, pp. 1127 1136. Society for Industrial and Applied Mathematics, 2006. Erickson, J. and Seidel, R. Better lower bounds on detecting affine and spherical degeneracies. Discrete & Computational Geometry, 13:41 57, 1995. doi: 10.1007/ BF02574027. Feldman, D. Core-sets: An updated survey. Wiley Interdiscip. Rev. Data Min. Knowl. Discov., 10(1), 2020. Feldman, D. and Langberg, M. A unified framework for approximating and clustering data. In Proceedings of the 43rd ACM Symposium on Theory of Computing, STOC 2011, San Jose, CA, USA, 6-8 June 2011, pp. 569 578, 2011. Feldman, D., Schmidt, M., and Sohler, C. Turning big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering. In Proceedings of the Twenty-Fourth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2013, New Orleans, Louisiana, USA, January 6-8, 2013, pp. 1434 1453, 2013. Fichtenberger, H., Gill e, M., Schmidt, M., Schwiegelshohn, C., and Sohler, C. Bico: Birch meets coresets for k-means clustering. In European Symposium on Algorithms, pp. 481 492. Springer, 2013. Friggstad, Z., Khodamoradi, K., Rezapour, M., and Salavatipour, M. R. Approximation schemes for clustering with outliers. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 398 414. SIAM, 2018. Goodfellow, I. J., Mc Daniel, P. D., and Papernot, N. Making machine learning robust against adversarial inputs. Commun. ACM, 61(7):56 66, 2018. Gupta, S. Approximation algorithms for clustering and facility location problems. Ph D thesis, University of Illinois at Urbana-Champaign, 2018. Gupta, S., Kumar, R., Lu, K., Moseley, B., and Vassilvitskii, S. Local search methods for k-means with outliers. Proceedings of the VLDB Endowment, 10(7):757 768, 2017. Har-Peled, S. and Kushal, A. Smaller coresets for k-median and k-means clustering. Discrete & Computational Geometry, 37(1):3 19, 2007. Hawkins, D. M. The feasible solution algorithm for least trimmed squares regression. Computational Statistics and Data Analysis, 17, 1994. doi: 10.1016/0167-9473(92) 00070-8. Huang, L., Jiang, S., Li, J., and Wu, X. Epsilon-coresets for clustering (with outliers) in doubling metrics. In 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS), pp. 814 825. IEEE, 2018. Huggins, J., Campbell, T., and Broderick, T. Coresets for scalable bayesian logistic regression. In Advances in Neural Information Processing Systems, pp. 4080 4088, 2016. Karnin, Z. S. and Liberty, E. Discrepancy, coresets, and sketches in machine learning. Co RR, abs/1906.04845, 2019. Klivans, A. R., Kothari, P. K., and Meka, R. Efficient algorithms for outlier-robust regression. In Conference On Learning Theory, COLT 2018, Stockholm, Sweden, 6-9 July 2018, pp. 1420 1430, 2018. Krishnaswamy, R., Li, S., and Sandeep, S. Constant approximation for k-median and k-means with outliers via iterative rounding. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pp. 646 659. ACM, 2018. Langberg, M. and Schulman, L. J. Universal εapproximators for integrals. In Proceedings of the twentyfirst annual ACM-SIAM symposium on Discrete Algorithms, pp. 598 607. SIAM, 2010. Li, Y., Long, P. M., and Srinivasan, A. Improved bounds on the sample complexity of learning. Journal of Computer and System Sciences, 62(3):516 527, 2001. Lucic, M., Faulkner, M., Krause, A., and Feldman, D. Training gaussian mixture models at scale via coresets. The Journal of Machine Learning Research, 18(1):5885 5909, 2017. Layered Sampling for Robust Optimization Problems Mettu, R. R. and Plaxton, C. G. Optimal time bounds for approximate clustering. Machine Learning, 56(1-3):35 60, 2004. Mount, D. M., Netanyahu, N. S., Piatko, C. D., Silverman, R., and Wu, A. Y. On the least trimmed squares estimator. Algorithmica, 69(1):148 183, 2014. doi: 10.1007/s00453-012-9721-8. Mount, D. M., Netanyahu, N. S., Piatko, C. D., Wu, A. Y., and Silverman, R. A practical approximation algorithm for the LTS estimator. Computational Statistics and Data Analysis, 99:148 170, 2016. doi: 10.1016/j.csda.2016.01. 016. Munteanu, A., Schwiegelshohn, C., Sohler, C., and Woodruff, D. On coresets for logistic regression. In Advances in Neural Information Processing Systems, pp. 6561 6570, 2018. Ott, L., Pang, L., Ramos, F. T., and Chawla, S. On integrated clustering and outlier detection. In Advances in neural information processing systems, pp. 1359 1367, 2014. Phillips, J. M. Coresets and sketches. Computing Research Repository, 2016. Rousseeuw, P. and van Driessen, K. Computing LTS regression for large data sets. Data Min. Knowl. Discov., 12(1): 29 45, 2006. doi: 10.1007/s10618-005-0024-4. Rousseeuw, P. J. Least median of squares regression. Journal of the American Statistical Association, 79, 12 1984. doi: 10.1080/01621459.1984.10477105. Shen, Y. and Sanghavi, S. Iterative least trimmed squares for mixed linear regression. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, 8-14 December 2019, Vancouver, BC, Canada, pp. 6076 6086, 2019.