# robust_gradientbased_markov_subsampling__c94eaeac.pdf The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20) Robust Gradient-Based Markov Subsampling Tieliang Gong, Quanhan Xi, Chen Xu Deparment of Mathematics and Statistics, University of Ottawa, Ottawa, ON, K1N6N5, Canada Subsampling is a widely used and effective method to deal with the challenges brought by big data. Most subsampling procedures are designed based on the importance sampling framework, where samples with high importance measures are given corresponding sampling probabilities. However, in the highly noisy case, these samples may cause an unstable estimator which could lead to a misleading result. To tackle this issue, we propose a gradient-based Markov subsampling (GMS) algorithm to achieve robust estimation. The core idea is to construct a subset which allows us to conservatively correct a crude initial estimate towards the true signal. Specifically, GMS selects samples with small gradients via a probabilistic procedure, constructing a subset that is likely to exclude noisy samples and provide a safe improvement over the initial estimate. We show that the GMS estimator is statistically consistent at a rate which matches the optimal in the minimax sense. The promising performance of GMS is supported by simulation studies and real data examples. Introduction The rapid development of science and technology in the past decade have introduced data of extraordinary size and complexity, which brings great challenges to conventional machine learning and statistical methods. A popular way to deal with big data is the divide and conquer strategy, which involves partitioning the data, running a particular learning algorithm on data segments in parallel, and then aggregating a global output by combining these individual parallel outputs. To this end, distributed computing platforms like Spark (Zaharia et al. 2010) and Ray (Moritz and et al 2018) have been developed. As a computationally cheaper alternative, subsampling techniques have also drawn a great deal of attention for processing big data (Fithian and Hastie 2014; Halko, Martinsson, and Tropp 2011; Derezi nski, Warmuth, and Hsu 2018a; Derezinski, Warmuth, and Hsu 2018b). These methods aim to select a representative subset from the full data for downstreaming learning tasks. The computational burden is then greatly alleviated as the selected subset is quite small. Copyright c 2020, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Intuitively, uniform sampling is perhaps the most straightforward way to conduct subsampling. However, uniform sampling can be inefficient and unstable for data with high noise level. Therefore, we usually resort to informative subsampling, where important observations are given a higher chance to be selected. One representative informative subsampling procedure is leverage score sampling (Drineas et al. 2011; 2012; Drineas, Mahoney, and Muthukrishnan 2008; Rudi et al. 2018), which assigns sampling probability proportional to a distance measure within the covariates. Another informative subsampling procedure is given in (Zhu 2016), where the sampling probabilities are computed proportional to the quadratic loss gradient using a pilot estimator. Gradient-based subsampling (GS) notably uses information derived from the input data as well as the response. Following in this spirit, the work in (Ting and Brochu 2018) proposes an influence function as a information measure to calculate sampling probabilities, which is shown to lead to an asymptotically optimal sampling distribution in the sense of minimum variance for the resulting linear estimator. These informative subsampling methods usually boil down to solving a weighted least squares problem which is sensitive to unbalanced sampling probabilities. Consequently, the resulting estimators can be less precise in applications (Ma, Mahoney, and Yu 2015). Moreover, these methods assign higher probabilities to samples that are highly influential to the estimator. However, many of these samples can be noisy or outliers when the noise level is high, and these estimators would result in a misleading conclusion. In addition, the accuracy of GS and influence sampling depends on a reliable initial estimator which would be difficult to obtain in the noisy setting. In this paper, we develop a robust gradient-based Markov subsampling (GMS) method for linear regression. The procedure is as follows: We first obtain a crude estimator β0 based on a simple pilot selection. We later show that by selecting samples with small loss gradients, it is possible to construct a subset DS on which the empirical loss at β0 serves as a good approximation of the empirical loss at the oracle β . Since β0 is a rough estimator to β and quadratic loss is convex, the least square estimator ˆβ on DS is a safe improvement from β0 towards β (See Fig. 1 for an illustration). The pro- Figure 1: A random sample n = 200 is generated from yi = xiβ i + εi where xi N(0, 4) , εi U( 2, 2). The red line denotes the quadratic loss on full data, and the blue line denotes the loss function on a subset DS generated by the GMS procedure. posed GMS constructs such a DS by selecting samples of small gradient value through a Metropolis-Hastings (MH) type procedure. By doing so, samples with large gradient value are unlikely to be selected and outliers are avoided with high probability. We show that under mild conditions, the GMS estimator is statistically consistent with a rate at order O( d log(d)/nsub), where nsub denotes the subsample size and d is the number of regression covariates. The superior performance of GMS is also supported by simulation studies and real-world examples. The rest of this paper is organized as follows: Section 2 sets the notations and problem statement. Section 3 introduces the proposed GMS algorithm. Section 4 establishes the corresponding error bound. Section 5 presents experimental results on both simulations and real-world dataset. Section 6 concludes our work. Notations and Preliminaries To make our arguments in the following sections precise, some concepts and notations are introduced. Definition 1 (Vershynin 2018) A random variable X R is said to be sub-Gaussian with variance proxy σ2 if E[X] = 0 and its moment generating function satisfies E[exp(s X)] exp s2σ2 We denote a sub-Gaussian random variable as X sub G(σ2). More generally, a random vector X Rd is said to be sub-Gaussian with variance proxy σ2 if E[X] = 0 and u X is sub-Gaussian with variance proxy σ2 for any unit vector u Sd 1. In this case we write X sub Gd(σ2). In this paper, we consider a dataset D = (X, y) generated according to the linear model y = Xβ + ε, (1) where X = [x1, x2, , xn] Rn d is a design matrix, y = (y1, y2, , yn) Rn is a response vector , ε = (ε1, ε2, εn) and εi sub G(σ2) for i = 1, 2, , n. Denote the quadratic loss on D by L(β) = n i=1 ℓi(β), where ℓi(β) = (yi xiβ)2. We focus on the setting n d, where the least squares solution to model (1) is given by nΣ 1 n X y, (2) where Σn = X X/n denotes the empirical covariance matrix. Classic algorithms, including Cholesky decomposition, QR decomposition and Singular Value Decomposition compute βn in O(nd2) time. For a matrix M Rd d, we denote its maximal and minimal eigenvalues by λmax(M) and λmin(M). For a vector u Rd, we denote its ℓ2 norm by u . The following concepts play an important role in our theoretical analysis. Let {Xi}i 1 be a Markov chain on a general space X with invariant probability distribution π. Let P(x, dy) be a Markov transition kernel on a general space (X, B(X)) and P be its adjoint. Denote L2(π) by the Hilbert space consisting of square integrable functions with respect to π. For any function h : X R, we write π(h) := h(x)π(dx). Denote the additive reversiblization by R = (P + P )/2. Let P t(x, dy), (t N) be the t-step Markov transition kernel corresponding to P, then for i N, x X and a measurable set S, P t(x, S) = Pr(Xt+i S|Xi = x). Following the above notations, we introduce the definitions of ergodicity and spectral gap for a Markov chain. Definition 2 Let M(x) be a non-negative function. For an initial probability measure ρ( ) on B(X), a Markov chain is uniformly ergodic if P t(ρ, ) π( ) T V M(x)tn (3) for some M(x) < and t < 1, where T V denotes total variation norm. A Markov chain is geometrically ergodic if (3) holds for some t < 1, which eliminates the bounded assumption on M(x). Definition 3 (Absolute spectral gap) A Markov operator P admits an absolute spectral gap 1 λ if λ := sup { Ph π : h π = 1, π(h) = 0} < 1. Let α(λ) := 1+λ 1 λ. Obviously, α(λ) is strictly increasing with λ and α(λ) = 1 as λ = 0. Definition 4 (Right spectral gap) A Markov operator P admits an right spectral gap 1 λr if R has λr < 1, where λr := sup { Rh, h : h π = 1, π(h) = 0} . The spectral gap measures the convergence rate of a Markov chain towards its invariant state π (Rudolf 2011). The bigger the spectral gap, the faster the convergence to the stationary distribution. Since R is self-adjoint, it is known that λr λ. Robust Gradient-based Markov Subsampling The idea of gradient-based learning has been studied in the literature (Zhu 2016; Burke, Lewis, and Overton 2005; Burke et al. 2018). Recall that GS selects samples with probability directly proportional to their gradient values, providing a natural way to apply this idea to subsampling. Although GS performs well empirically, the resulting estimator can be sensitive to highly noisy data, particularly when the sampling ratio is small. In contrast, we explore the potential of selecting samples with small gradients to achieve robust estimation. To this end, we develop a gradient-based Markov subsampling (GMS) algorithm. Concretely, GMS consists of three steps: 1) pilot estimation; 2) gradient calculation; 3) Markov subsampling. Pilot estimation. The pilot estimation β0 can be calculated by (2) based on a small random subset with size n0 n. We empirically show that the GMS does not heavily rely on the quality of β0. As a result, n0 can be chosen to the user s preference in practice. Calculate gradient. Given the pilot β0, we calculate the gradient for the i-th sample by gi(β0) = ℓi(β) β |β=β0 = x i (yi xiβ0). (4) As discussed previously, we are attempting to find a subset DS on which L(β ) L(β0). Consider the first order Taylor expansion of L(β ) centered at the pilot β0: L(β ) = L(β0) + i=1 gi(β0), β0 β + Rn(β0), where Rn is the remainder. Since β0 is considered to be a crude estimate for β , β0 β is non-negligible. Thus, a natural way to satisfies L(β0) L(β ) is to select points with small gi(β0), i DS. Markov subsampling. It has been empirically shown that Markov chain samples may lead to more stable estimation compared to their i.i.d. counterparts (Gong, Zou, and Xu 2015; Sun, Sun, and Yin 2018). Taking this in mind, we implement probabilistic sampling through a Metropolis Hastings (MH) type procedure. Since we prefer samples with small gradient values, we use these values to design the probability in our acceptance step. This procedure can be summarized as follows. At some current sample zt, we accept a randomly selected candidate sample z with probability defined in (5). If accepted, the iteration is completed and we set z = zt+1. Otherwise, a new candidate is randomly selected and the process repeats. Finally, we accept the last n0 elements generated by this procedure after a user-specified burn-in period t0. The detailed procedure is summarized in Algorithm 1. Remark 1 It is known that the the subsamples generated by Algorithm 1 constitute an irreducible Markov Chain (since the transition probabilities are always positive), and therefore are uniformly ergodic (Down, Meyn, and Tweedie 1995). Oracle Pilot Estimate (a) Full data GMS Estimate GS Estimate Oracle GS Selected GMS Selected (b) Subsampled data Figure 2: (a) Oracle and pilot estimation; (b) Comparison on 10 data points selected by GS and GMS. The size of the subsampled data points are scaled by their corresponding gradient value. Remark 2 The overall computational complexity of GMS is O(max{nsubd2, d3}). Thus, the computational burden can be greatly reduced when nsub n. We give a toy example to show the potential benefits of GMS. We generate a dataset of size 50 by yi = 2xi + εi, where xi N(0, 3), εi U( 10, 10). Fig. 2 (a) plots the oracle (red solid line) and pilot estimation (purple dashed line) which is fitted by 10 randomly selected data points. Fig. 2(b) demonstrates the comparison of 10 samples selected by GS (green circles), GMS (blue circles) respectively. The purple area denotes the range where the gradient value is less than 15. It can be observed from Fig. 2 that samples in purple area help to correct the pilot line towards the oracle. In other words, by selecting the data points with small gradient values, GMS can safely improve the pilot estimation while also avoiding distractions caused by noisy samples. Algorithm 1 Robust Gradient-based Markov Subsampling Input: Dataset D = (xi, yi)n i=1, DS = , burn-in period: t0, subsample size nsub n. 1: Train a pilot estimator β0 based on a subsample with size n0 = nsub and calculate gi, i = 1, 2, , n by (4). 2: Randomly select a sample z1 from D, and set Ds = z1. 3: for 2 t (nsub + t0) do 4: while |DS| < t do 5: Randomly draw a candidate z =(x , y ) 6: Calculate the acceptance probability by p = min 1, gt 7: Set DS = DS z with probability p 8: If z is accepted, set zt+1 = z 9: end while 10: end for 11: Denote the last nsub samples of DS as (XS, y S). 12: Solve ˆβ = arg minβ y S XSβ 2. Output: ˆβ. Theoretical Assessment In this section, we provide theoretical support for the proposed GMS. In particular, our main interest is to bound the difference between the GMS estimator ˆβ and the oracle β . The main result is given in Theorem 1, which shows that the gradient based Markov subsampling algorithm is statistically consistent. We first present several lemmas as follows. Lemma 1 (Vershynin 2018) If X sub G(σ2), then for any t > 0, it holds P(|X| t) 2 exp t2 2σ2 . Lemma 2 (Sun et al. 2017) Let A, B Rd d be invertible, then for any matrix norm , if A 1 A B < 1, we have A 1 B 1 A 1 2 A B 1 A 1 A B . Lemma 3 (Fan, Jiang, and Sun 2018) Let {xi}i 1 be a Markov chain with invariant measure π and right spectral gap 1 λr > 0. Then for any bounded function f : X [a, b] and any t R, i=1 f(xi) nπ(f) 2 exp α(λr 0) 1ϵ2 Theorem 1 Suppose that the Markov chain samples generated by GMS are with invariant distribution π and right spectral gap 1 λr. Let ˆΣS = 1 nsub X S XS, Σ = cov(X) if nsub 8α(λr 0)d2 Σ 1 2 log(d2/δ), then with confidence at least 1 2δ, 8d(log(d) + log(1/δ)) where C = σ Σ 1 2. PROOF. Note that the Markov chain samples {xi}nsub i=1 are uniformly ergodic. Suppose that it has a stationary invariant distribution π with right spectral gap 1 λr. Without loss of generality, we assume that supxj X xj 1 for j = 1, 2, , nsub. Observe that = (n 1 sub X S XS) 1(n 1 sub X S y S) β = (X S XS) 1X S XSβ β + (n 1 sub X S XS) 1 X S ε nsub ˆΣ 1 S Σ 1 X S ε/nsub + Σ 1 X S ε/nsub Σ 1 2 ˆΣS Σ 1 ˆΣ 1 S ˆΣS Σ + Σ 1 1 ˆΣ 1 S ˆΣS Σ X S ε/nsub ν 1 Σ 1 X S ε/nsub . The second inequality comes from Lemma 2. Let us first consider X S ε/nsub. Recall that ε sub Gd(σ2), supxj X xj 1, then the random variable x j ε sub Gd(σ2). Denote by ϵ = maxj,k |ˆΣj,k Σj,k|, we have P( X S ε/nsub ϵ) i=1 P(|x i ε| nsubϵ/ 2d exp nsubϵ2 According to the fact that ˆΣS Σ dϵ. It follows from Lemma 3 that P( ˆΣS Σ ϵ) 2d2 exp nsubϵ2 Given that Σ 1 ˆΣS Σ 1 ν (0, 1), by taking ν = 1/2, we know P( ˆβ β ϵ) P Σ 1 ˆΣS Σ 1 + P Σ 1 X S ε/nsub ε 2d2e nsubα(λr 0) 1 8d2 Σ 1 2 +2de nsubϵ2 8dσ2 Σ 1 2 . ˆβ β σ Σ 1 2 8d log(d/δ) holds with confidence at least 1 δ if nsub 8α(λr 0)d2 Σ 1 2 log(d2/δ). This completes the proof. Remark 3 Theorem 1 indicates that the GMS estimator ˆβ is consistent, i.e. ˆβ β 0 as nsub . The founding condition requires that the right spectral gap of the invariant distribution π satisfies 1 λr > 0. GMS almost trivially satisfies this condition. To see this, notice that the Markov chain generated by GMS is uniformly ergodic, and hence geometrically ergodic. Following this, it is known that λ(P) < 1 if and only if the Markov chain is geometrically ergodic (Roberts and Rosenthal 1997), so the condition is satisfied. The convergence rate is with order of O( nsub ), which matches the optimal error bound for the i.i.d. samples in a minimax sense. Denote the mean squared error of the prediction X ˆβ by MSE(X ˆβ) = 1 n X( ˆβ β ) 2. According to the inequality MSE(X ˆβ) λmax(Σn) ˆβ β 2, we can obtain the following corollary bounding the prediction error immediately from Theorem 1. Corollary 1 Under the same conditions in Theorem 1, with confidence at least 1 δ, it holds MSE(X ˆβ) C2λmax(Σn) 8d(log(d) + log(1/δ)) GS GMS Influence LEV LEVUNW UNIF Subsampling Method GS GMS Influence LEV LEVUNW UNIF Subsampling Method GS GMS Influence LEV LEVUNW UNIF Subsampling Method GS GMS Influence LEV LEVUNW UNIF Subsampling Method GS GMS Influence LEV LEVUNW UNIF Subsampling Method GS GMS Influence LEV LEVUNW UNIF Subsampling Method GS GMS Influence LEV LEVUNW UNIF Subsampling Method GS GMS Influence LEV LEVUNW UNIF Subsampling Method GS GMS Influence LEV LEVUNW UNIF Subsampling Method Figure 3: Boxplots of 50 times experiments for different subsampling methods with n = 1M, d = 500. From top to bottom: M1(N), M1(U), M1(t). From left to right: sr = 0.001, 0.005, 0.01. Experiments To assess the performance of GMS, we conduct experiments on both simulation studies and real data examples. All numerical studies are conducted in software R on Compute Canada clusters with 2.1 GHz CPUs and 128 GB memory. Simulation Studies In simulation studies, we generate the data by y = Xβ +ε, where the n d design matrix X is generated by a mixture of Gaussian distributions 1 2N(μ1, σ2 1) + 1 2N(μ2, σ2 2) in two different ways: (M1) μ1 = 2, σ1 = 3, μ2 = 2, σ2 = 10; (M2) μ1 = 0, σ1 = 3, μ2 = 0, σ2 = 10. The oracle β is generated uniformly from { 3, 2, 1, 0}. We generate three different types of i.i.d. noise, including uniform distribution with εi U( 5, 5), normal distribution with εi N(0, 25) and Student-t distribution with εi t(2). We denote the models combining these design matrices and noise distributions as follows: M1(U), M1(N), M1(t), M2(U), M2(N), M2(t). Also, we set n = 100K, 500K, 1M and with corresponding d = 50, 250, 500. For each dataset, we compare the proposed GMS with five representative sampling methods where each method is applied K = 50 times repeatedly. The quality of the fit is measured by the estimation error (EE): k=1 ˆβk β . In all experiments, the subsample size is set by nsub = sr n, where sr represents the sampling ratio. We set sr = 0.001, 0.005, 0.01 for each model. If required, a pilot estimator is calculated by uniform subsampling of size n0 = nsub. The sampling methods considered for comparison are: GS, leverage subsampling (LEV), unweighted leverage subsampling (LEVUNW) (Ma, Mahoney, and Yu 2015), uniform sampling and influence-based sampling (Ting and Brochu 2018). Note that LEVUNW conducts sampling identically to LEV, but solves the unweighted least squares problem instead. For influence-based sampling, the sampling weight for (xi, yi) is proportional to ψβ(xi, yi) , where ψβ(xi, yi) = (yi xiβ)Σ 1 n xi is the influence function. Table 1: EE comparison (mean standard deviation) 10 2 for different sampling methods for n = 1M, d = 500 Methods M1(N) M1(U) M1(t) sr = 0.1% sr = 0.5% sr = 1% sr = 0.1% sr = 0.5% sr = 1% sr = 0.1% sr = 0.5% sr = 1% GS 74.9(3.89) 21.7(0.68) 13.7(0.45) 42.9(1.36) 12.8(0.40) 8.33(0.26) 38.2(2.99) 8.54(0.31) 5.48(0.24) Influence 74.5(2.97) 21.5(0.78) 13.8(0.50) 42.8(1.54) 12.8(0.35) 8.30(0.21) 38.4(3.13) 8.61(0.36) 5.47(0.19) LEV 65.4(3.41) 21.7(0.52) 14.7(0.50) 37.6(1.64) 12.3(0.42) 8.57(0.22) 35.4(5.71) 14.1(2.51) 9.26(0.98) LEVUWN 65.0(2.81) 21.7(0.73) 14.7(0.40) 37.4(1.47) 12.5(0.46) 8.59(0.20) 37.8(7.52) 13.2(2.25) 9.72(1.25) UNIF 65.5(2.72) 21.6(0.66) 14.9(0.37) 37.8(1.52) 12.5(0.45) 8.64(0.25) 37.8(8.23) 13.6(2.15) 9.36(1.26) GMS 58.6(2.36) 18.9(0.61) 13.1(0.36) 35.5(1.35) 12.0(0.31) 8.26(0.23) 22.5(3.65) 7.22(0.50) 5.30(0.22) Table 2: EE comparison (mean standard deviation) 10 2 for different sampling methods for n = 1M, d = 500 Methods M2(N) M2(U) M2(t) sr = 0.1% sr = 0.5% sr = 1% sr = 0.1% sr = 0.5% sr = 1% sr = 0.1% sr = 0.5% sr = 1% GS 77.8(2.55) 22.2(0.67) 14.4(0.43) 43.3(1.65) 13.4(0.41) 8.62(0.24) 39.3(3.71) 8.95(0.48) 5.66(0.34) Influence 77.4(2.53) 22.3(0.49) 14.4(0.39) 44.0(1.39) 13.4(0.36) 8.73(0.33) 39.5(3.38) 9.00(0.51) 5.68(0.36) LEV 67.3(3.11) 22.4(0.61) 15.5(0.42) 39.3(1.26) 13.0(0.45) 8.89(0.21) 36.1(5.19) 14.1(2.22) 9.52(1.05) LEVUWN 67.6(2.85) 22.3(0.67) 15.4(0.37) 38.7(1.52) 12.9(0.43) 8.93(0.22) 37.0(6.46) 13.2(1.39) 13.0(2.73) UNIF 68.5(3.36) 22.5(0.64) 15.4(0.47) 38.8(1.52) 13.0(0.42) 8.81(0.31) 35.8(4.48) 13.5(2.03) 9.60(1.42) GMS 60.5(2.68) 19.4(0.65) 13.4(0.46) 37.0(1.53) 12.6(0.43) 8.61(0.26) 22.7(1.78) 7.52(0.66) 5.65(0.88) GS GMS Influence LEV LEVUNW UNIF Subsampling Method GS GMS Influence LEV LEVUNW UNIF Subsampling Method GS GMS Influence LEV LEVUNW UNIF Subsampling Method GS GMS Influence LEV LEVUNW UNIF Subsampling Method GS GMS Influence LEV LEVUNW UNIF Subsampling Method GS GMS Influence LEV LEVUNW UNIF Subsampling Method GS GMS Influence LEV LEVUNW UNIF Subsampling Method GS GMS Influence LEV LEVUNW UNIF Subsampling Method GS GMS Influence LEV LEVUNW UNIF Subsampling Method Figure 4: Boxplots of 50 times experiments for different subsampling methods with n = 1M, d = 500. From top to bottom: M2(N), M2(U), M2(t). From left to right: sr = 0.001, 0.005, 0.01. Due to space limitation, we only show the results for the setting n = 1M, d = 500. Other results are given in the Table 3: EE comparison (mean standard deviation) for different sampling methods for real datasets Methods Online News Popularity Poker Hands Wave Energy Converters sr = 0.1% sr = 0.5% sr = 1% sr = 0.1% sr = 0.5% sr = 1% sr = 0.1% sr = 0.5% sr = 1% GS 36.4(26.7) 8.67(5.27) 7.06(4.08) 0.44(0.18) 0.118(0.039) 0.078(0.030) 1979(258.6) 767(105.1) 498(74.8) Influence 29.8(31.9) 8.02(5.15) 5.22(2.06) 0.41(0.17) 0.125(0.036) 0.079(0.023) 1990(339.1) 696(96.1) 503(51.9) LEV 18.6(19.1) 11.6(7.01) 8.55(4.55) 0.35(0.13) 0.12(0.04) 0.092(0.024) 1940(275.5) 801(99.1) 579(66.2) LEVUWN 21.7(13.1) 10.3(8.38) 8.85(4.92) 0.37(0.13) 0.123(0.03) 0.078(0.023) 1868(239.8) 808(86.8) 583(66.1) UNIF 18.1(14.2) 12.4(7.75) 7.23(3.01) 0.35(0.12) 0.135(0.035) 0.089(0.022) 1956(278.1) 815(90.5) 563(76.8) GMS 11.4(5.24) 7.48(4.81) 5.37(2.61) 0.26(0.09) 0.119(0.034) 0.079(0.024) 1747(269.1) 762(103.9) 517(67.1) supplementary material. Figs. 3 and 4 record the boxplots based on 50 times empirical estimation error. The mean and standard deviation of EE are reported in Tables 1 and 2. Several observations are worth making about the presented results. To begin, as the subsample size increases, the mean error and standard deviation for all methods tend to decrease monotonically. Note that GMS achieves the lowest error for all 6 models under all sampling ratios. In particular, GMS outperforms other competitors significantly when the sampling ratio is very small. This observation supports that the samples generated by GMS are more informative and lead to robust estimation. Moreover, LEV and LEVUWN perform quite similarly to uniform sampling. This is because the leverage score only depends the input information. Since both M1 and M2 are generated by mixtures of Gaussian and hence have nearly uniform leverage scores, LEV and LEVUWN do not show significant differences with uniform sampling. We also observe that influence-based subsampling performs almost identically to GS. Since the design matrix X is generated by i.i.d. mixtures of Gaussian, Σn approximates a diagonal matrix. Therefore, the influence function assigns similar sampling probability as gradient in the sampling process. In addition, in heavy-tailed noise cases, i.e. M1(t), M2(t), GMS still achieves the lowest error, which supports that GMS is more robust to highly noisy data. Real Data Examples We further evaluate the performance of GMS on 3 realworld datasets: Online News Popularity (n = 39797, d = 61), Wave Energy Converters (n = 288000, d = 32) and Poker Hands (n = 25010, d = 11) 1. For the WEC dataset, we remove 16 columns due to collinearity. Since the oracle β is unknown for real datasets, we utilize βn as a proxy to β in our performance metric. The comparisons of empirical estimation error with different sampling ratio are reported in Table 3. It can be observed from Table 3 that GMS still achieves the lowest error when the sampling ratio is very low. As subsample size increases, GMS is still able to achieve competitive performance among the 6 sampling algorithms. Conclusion In this paper, we propose a gradient-based Markov subsampling (GMS) algorithm for the linear regression problem. We analyze the performance of GMS in terms of error bounds. The theoretical results show that the GMS estimator 1https://archive.ics.uci.edu/ml/datasets.php is statistically consistent and the corresponding error bound matches the optimal rate in the minimax sense. Experiments on simulation studies and real data examples demonstrate the effectiveness of GMS. Future work includes extending GMS to general models (e.g. Ridge, Lasso), and accelerating the burn-in process of GMS. All these problems are under current research. Acknowledgments This work was supported in part by NSERC grant RGPIN2016-05024 and in part by NSFC grants 11690014 and 11971173. The content is solely the responsibility of the authors and does not necessarily represent the official views of the aforementioned funding agencies. Burke, J. V.; Curtis, F. E.; Lewis, A. S.; Overton, M. L.; and Sim oes, L. E. 2018. Gradient sampling methods for nonsmooth optimization. ar Xiv preprint ar Xiv:1804.11003. Burke, J. V.; Lewis, A. S.; and Overton, M. L. 2005. A robust gradient sampling algorithm for nonsmooth, nonconvex optimization. SIAM Journal on Optimization 15(3):751 779. Derezi nski, M.; Warmuth, M. K.; and Hsu, D. 2018a. Correcting the bias in least squares regression with volumerescaled sampling. ar Xiv preprint ar Xiv:1810.02453. Derezinski, M.; Warmuth, M. K.; and Hsu, D. J. 2018b. Leveraged volume sampling for linear regression. In Advances in Neural Information Processing Systems, 2505 2514. Down, D.; Meyn, S. P.; and Tweedie, R. L. 1995. Exponential and uniform ergodicity of markov processes. The Annals of Probability 23(4):1671 1691. Drineas, P.; Mahoney, M. W.; Muthukrishnan, S.; and Sarl os, T. 2011. Faster least squares approximation. Numerische mathematik 117(2):219 249. Drineas, P.; Magdon-Ismail, M.; Mahoney, M. W.; and Woodruff, D. P. 2012. Fast approximation of matrix coherence and statistical leverage. Journal of Machine Learning Research 13(Dec):3475 3506. Drineas, P.; Mahoney, M. W.; and Muthukrishnan, S. 2008. Relative-error cur matrix decompositions. SIAM Journal on Matrix Analysis and Applications 30(2):844 881. Fan, J.; Jiang, B.; and Sun, Q. 2018. Hoeffding s lemma for markov chains and its applications to statistical learning. ar Xiv preprint ar Xiv:1802.00211. Fithian, W., and Hastie, T. 2014. Local case-control sampling: Efficient subsampling in imbalanced data sets. Annals of statistics 42(5):1693 1724. Gong, T.; Zou, B.; and Xu, Z. 2015. Learning with ℓ1regularizer based on markov resampling. IEEE Transactions on Cybernetics 46(5):1189 1201. Halko, N.; Martinsson, P.-G.; and Tropp, J. A. 2011. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review 53(2):217 288. Ma, P.; Mahoney, M. W.; and Yu, B. 2015. A statistical perspective on algorithmic leveraging. The Journal of Machine Learning Research 16(1):861 911. Moritz, P., and et al. 2018. Ray: A distributed framework for emerging ai applications. In 13th {USENIX} Symposium on Operating Systems Design and Implementation (OSDI, 18), 561 577. Roberts, G., and Rosenthal, J. 1997. Geometric ergodicity and hybrid markov chains. Electronic Communications in Probability 2:13 25. Rudi, A.; Calandriello, D.; Carratino, L.; and Rosasco, L. 2018. On fast leverage score sampling and optimal learning. In Advances in Neural Information Processing Systems, 5672 5682. Rudolf, D. 2011. Explicit error bounds for markov chain monte carlo. ar Xiv preprint ar Xiv:1108.3201. Sun, Q.; Tan, K. M.; Liu, H.; and Zhang, T. 2017. Graphical nonconvex optimization for optimal estimation in gaussian graphical models. ar Xiv preprint ar Xiv:1706.01158. Sun, T.; Sun, Y.; and Yin, W. 2018. Explicit error bounds for markov chain monte carlo. ar Xiv preprint ar Xiv:1809.04216v1. Ting, D., and Brochu, E. 2018. Optimal subsampling with influence functions. In Advances in neural information processing systems, 3650 3659. Vershynin, R. 2018. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge University Press. Zaharia, M.; Chowdhury, M.; Franklin, M. J.; Shenker, S.; and Stoica, I. 2010. Spark: Cluster computing with working sets. Hot Cloud 10(10-10):95. Zhu, R. 2016. Gradient-based sampling: an adaptive importance sampling for least-squares. In Advances in neural information processing systems, 406 414.