# optimal_subsampling_with_influence_functions__5d8fe87e.pdf

Optimal Subsampling with Inﬂuence Functions

Daniel Ting Tableau Software Seattle, WA, USA dting@tableau.com

Eric Brochu Tableau Software Vancouver, BC, Canada ebrochu@tableau.com

Subsampling is a common and often effective method to deal with the computational challenges of large datasets. However, for most statistical models, there is no well-motivated approach for drawing a non-uniform subsample. We show that the concept of an asymptotically linear estimator and the associated inﬂuence function leads to asymptotically optimal sampling probabilities for a wide class of popular models. This is the only tight optimality result for subsampling we are aware of as other methods only provide probabilistic error bounds or optimal rates. We also show that these optimal weights can differ depending on whether the task is parameter estimation or prediction. Furthermore, for linear regression models, which have well-studied procedures for non-uniform subsampling, we empirically show our optimal inﬂuence function based method outperforms previous approaches even when using approximations to the optimal probabilities.

1 Introduction

As the amount of data increases, the question arises as to how best to deal with the large datasets. While computational platforms such as Spark [28] and Ray [23] help process large datasets once a desired model is chosen, simply using smaller data can be a faster solution for exploratory data modeling, rapid prototyping, or other tasks where the accuracy obtainable from the full dataset is not needed. Sampling provides a ﬂexible summarization of the data that can be applied to almost all tasks in a simple, straightforward manner.

Ideally, data is sampled efﬁciently, preferentially sampling the data that will accurately approximate the estimates from the full data set. However, we show existing preferential sampling techniques are inefﬁcient and can demonstrate pathological behavior. Furthermore, most methods are inﬂexible as they are derived only for speciﬁc linear or logistic regression models.

We propose using the inﬂuence function as a measure of sampling importance. The inﬂuence function measures the change in an objective or values of interest due to a single point. It is a particularly general approach as many model and estimators, such as maximum likelihood and M-estimators, can be cast in the framework, and it can work with non-differentiable objectives. We prove that the regularized version of our sampling design is asymptotically optimal among all regularized designs of the same expected size. While other sampling methods provide probabilistic guarantees on the resulting samples, we do not know of other results that can claim optimality.

Beyond the improved performance of our method, the inﬂuence-based approach allows one to understand the problem of optimal subsampling. The inﬂuence function and the notion of asymptotically linear estimators reduces the problem of accurately approximating an estimate on the full data set to a problem of calculating the mean over inﬂuence functions. Thus, the problem of ﬁnding an optimal sampling probabilities, or sampling design, for estimating a model is converted to the more straightforward problem of generating an optimal sampling design for a mean. This design can depend on the speciﬁc task on hand. In particular, efﬁcient sampling designs to estimate the parameters of a model

32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada.

Algorithm 1 Basic Inﬂuence Based Sampling(X, Y )

Given a model and objective, derive the inﬂuence function ψθ or an approximation of it Compute a pilot estimate of the parameters ˆθ. Draw an importance sample taking (Xi, Yi) with probability πi ψˆθ(Xi, Yi) under the constraint πi α Fit a model with the sample using importance weights 1/πi

can substantially differ from efﬁcient sampling designs for predictions even when both use the exact same model. This fact is also borne out in our experimental results. We explicitly derive sampling probabilities for linear regression, quantile regression, and generalized linear models (GLMs). In these examples, the inﬂuence of the residuals yi ˆyi may be separated from the inﬂuence of the regression design or predictors X. As a result, existing approaches are shown to only exploit one of the two, whereas our method appropriately incorporates both.

1.1 Related work

The dominant subsampling approach in the literature for least squares regression is based on statistical leverage scores [10, 19]. A number of papers [4, 5, 8, 21] address more general Lp linear regression problems and derive a corresponding leverage for Lp regression. These methods focus on generating sampling designs from the matrix of predictors X, and make no or limited use of the responses Y . The resulting sampling designs are obtained via a relatively expensive to compute random projection or low distortion embedding. The sampling weights for leverage based sampling are deﬁned by

w(lev) i = Hii, where ˆY = HY = XX Y (1)

with indicating the Moore-Penrose pseudo-inverse, so that H is the hat matrix.

The gradient-based approach of [29] for linear models and local case-control sampling [13] for logistic models generate sampling probabilities based on the residuals given a pilot estimate of the coefﬁcients. Very recent work covers optimal subsampling for logistic regression [27]. The gradient based approach is similar to ours in that it provides a general method that can be applied across a wide range of models. Given a differentiable M-estimator objective J(θ) = P

i ℓ(θ; xi), gradient-based sampling deﬁnes importance sampling weights

w(grad) i = ℓ(θ; xi) . (2)

Other techniques such as SGD [3], the Uluru algorithm [9] and random projection techniques [22] are relevant for fast model ﬁtting on large datasets. In these cases, the data size is not reduced, and the result is an estimate for the speciﬁc model being ﬁt. In contrast, we focus on sampling as a way to reduce computational complexity for one or more models while being able to use the sample for additional purposes, for example, coarse parameter tuning or rapid prototyping on a single machine.

Coreset generation [1, 15] is another technique for generating reduced size datasets that can well approximate results generated from ﬁtting a model on a full dataset. These techniques provide a probabilistic guarantee on the quality of the approximation according to some pre-speciﬁed sensitivity and cost metrics and some strong assumptions on the solution. In earlier literature, similar techniques without speciﬁc guarantees were also known as data squashing [11, 20]. We consider our work to be complementary to the coreset literature. Our work shows that the inﬂuence function can be used to improve the sensitivity measure used in many coreset designs. For example, [15] deﬁnes a sensitivity metric on the log-likelihood for logistic regression while our work shows that choice is inefﬁcient, but that appropriately scaled derivatives of the log-likelihood deﬁne an asymptotically optimal metric.

We also note the connection or lack thereof between sampling for statistical purposes and subsampling for computational purposes. While the former is a fundamental topic in statistics, the latter discards data, which is typically contrary a statistician s goals. In sampling, the population data is unknown, and cost arises from a data collection process. In contrast, the data is known in subsampling and can be exploited to draw better samples. Size is the main cost in this case. Though there is limited work on subsampling procedures in the statistics community, several papers have analyzed statistical properties of these procedures [19, 24].

2 Our method

Our key idea is that many parameter estimators can be asymptotically expressed as a mean of inﬂuence vectors. This allows the problem of optimal subsampling for a statistical model to be recast in terms of the well-studied problem of optimal subsampling for a mean. We give a brief overview of inﬂuence functions and asymptotically linear estimators. The interested reader may refer to [26] and [14] for more information. We then show how this technique can generate sampling weights. As the exact inﬂuence function is often expensive to compute, we address practical considerations in implementing our method. Finally, we analyze the asymptotic error of our method and show that out of all importance sampling distributions satisfying a form of regularization, ours has the lowest asymptotic variance. We summarize our method in Algorithm 1.

2.1 Inﬂuence and asymptotically linear estimators

We consider the class of plug-in estimators ˆθ(P) taking a distribution to a real-valued vector of parameter estimates. This is a highly ﬂexible class of estimators. For example, any M-estimator ˆθ(P) = argminθ EP ℓ(θ, X) is of this form for loss ℓ. The argument for the M-estimator is the empirical distribution Pn. When it exists, the inﬂuence function for this estimator is deﬁned by its functional Gâteaux derivative

ψθ(x) = lim ϵ 0+ 1 ϵ

ˆθ ((1 ϵ)Pθ + ϵδx) ˆθ(Pθ) (3)

where δx is the Dirac delta measure at x. It represents the inﬁnitesimal change in the estimate by adding the point x to the sample. The estimator ˆθ is an asymptotically linear estimator with inﬂuence function ψθ if it satisﬁes

n ˆθ θ = 1 n

i=1 ψθ(Xi) + op(1), (4)

with Eψθ = 0, EψT θ ψθ < . Here, op(1) denotes convergence in probability in some normed space.

Asymptotically linear estimators are pervasive in statistical modeling. Under sufﬁcient regularity conditions, M-estimators, maximum likelihood estimators, Z-estimators, non-degenerate U-statistics, and Generalized Method of Moments estimators are all examples of asymptotically linear estimators. For maximum likelihood estimation with correctly speciﬁed and sufﬁciently regular models, the inﬂuence function is easily derived. It is the gradient of the log-likelihood, also called the score sθ(x), scaled by the inverse Fisher information Iθ = Esθ(X)sθ(X)T .

sθ(x) = ℓ(θ; x)/ θ, ψθ = I 1 θ sθ. (5)

2.2 Optimal sampling design and estimation

It is not obvious how one should choose good sampling probabilities. For example, [11, 15] pick sampling weights to best approximate an objective function. For least squares regression, this is equivalent to sampling with probability proportional to the squared residuals. However, Fig. 1 shows how this can actually perform worse than uniform sampling.

The reason for this poor performance is that accurately estimating the loss itself is different from accurately estimating the actual quantity of interest, the minimizer of the loss. The corresponding optimal sampling probabilities for each task are also different. The theory of asymptotically linear estimators addresses the problem of mapping the quantity of interest to optimal sampling probabilities in two ways. Inﬂuence functions measure how much a data point changes the actual quantity of interest, in other words, its importance for the task at hand. An asymptotic linear expansion into independent, zero mean inﬂuence vectors then reduces the problem of drawing an optimal sample for an estimator to the problem of drawing an optimal sample for a multivariate mean. We further reduce this to a well-studied problem of designing optimal sampling probabilities for a univariate mean.

First, we describe optimal independent (Poisson) sampling for a univariate mean. Let independent Zi Bernoulli(πi) indicate if xi R is included in the sample. An optimal design with expected

2000 4000 6000 expected Size

G exact leverage exact root leverage

Excess MSE of predictions

2000 4000 6000 expected Size

G residual squared residual unif

Excess MSE of predictions

2000 4000 6000 expected Size

||β β||2 ||β||2

MSE of coefficients

2000 4000 6000 expected Size

Excess MSE of predictions

prediction influence

root leverage

Figure 1: Left: Naive sampling weights versus inﬂuence function derived weights on Gaussian data where 5% of points are modiﬁed to have high leverage. Root leverage outperforms leverage as predicted by theory. Squared residuals as weights optimally approximate the squared loss but perform worse than uniform weights when approximating the true predictions. Right: The inﬂuence based weights are separated in the contributions of the regressors and the residual for the CASP data set. For predictions the contribution of the regressors is the root leverage score. In this case, the regressors are more important than the residual.

size nsub for minimizing the variance is given by

π(opt) = argmin P πj=nsub

2 πi(1 πi), π(opt) i = min{λ|xi|, 1} (6)

for λ such that P

i π(opt) i = nsub. Such a scheme is a probability proportional to size (PPS) design with respect to the size measure xi.

To reduce the problem of multiple parameter estimation for an asymptotically unbiased and asymptotically linear estimator, consider the trace of the variance

Zi πi ψθ(xi)

2 πi(1 πi). (7)

A PPS design with size equal to the norm of the inﬂuence function is optimal for minimizing the trace of the asymptotic variance. This optimality statement is made more precise in section 5.

We examine the inﬂuence functions for linear least squares regression, logistic regression and GLMs, and quantile regression. While asymptotically linear estimators encompass a wider class of estimators, we focus on this particular subset as we can demonstrate they can be computed efﬁciently and they share some structure that allows us to understand the problem by decomposing the inﬂuence into informative residual and design components.

We also examine the difference between targeting accurate estimation of parameters versus predictions. The form of the weights provides insight on sampling methods by allowing a decomposition into regression design and residual effects.

3.1 Linear Least Squares

For linear regression models, the inﬂuence function for the coefﬁcients ˆθ is given by

ψθ(xi, yi) = yi x T i θ Σ 1xi (8)

where Σ = 1

n XT X is the empirical second moment matrix and θ are the true coefﬁcients. Taking the norm of the inﬂuence yields the sampling weight.

This weight differs from gradient-based sampling weights, (yi ˆyi)xi , given in (2) only by the scaling Σ 1. The inﬂuence-based weights appropriate change with scalings of the data while the gradient-based approach displays pathological behavior. A parameter θj is made more important by making it larger, or equivalently, by scaling the jth coordinate of the data to be smaller. Under gradient-based sampling, the opposite occurs; a coordinate is less important as it is scaled to be smaller. As shown in Fig. 2, inﬂuence-based sampling correctly increases the importance.

1.0 0.5 0.0 0.5 1.0

1.0 0.5 0.0 0.5 1.0

1.0 0.5 0.0 0.5 1.0

Figure 2: Sampling patterns for different importance measures for the model Y = 5X + 1 + ϵ. Red and light blue respectively indicate high and low inclusion probability. Leverage focuses on the extremes of the regression design. It and the gradient ignore large residuals near the center. The inﬂuence picks an appropriate balance of residual and regression design effects.

3.2 Inﬂuence on predictions

The notion of inﬂuence can be extended beyond estimators for parameters. For example, points can be sampled according to their inﬂuence on predictions rather than on the coefﬁcients, which is useful when exact coefﬁcient values are unimportant as long as predictive performance is good. The inﬂuence on predictions has also been used in other contexts such as in the interpretability of machine learning models [17].

The inﬂuence on the prediction is easily derived from the inﬂuence on the coefﬁcients. When each prediction ˆyi(θ) is a twice differentiable function of the k-dimensional parameter θ, then, with a slight abuse of notation, the inﬂuence on the vector of predictions can be easily computed by the chain rule

ψ(pred) θ (xi, yi) = ˆy (θ) dθ dδ(xi,yi) = ˆy (θ)ψθ(xi, yi). (9)

Here ˆy (θ) is the n k matrix of gradients ˆyi(θ). In the context of linear least squares regression, the inﬂuence is ψ(pred) θ (xi, yi) = X(XT X) 1xi(yi ˆyi) = ri Hi where ri is the residual of the ith prediction and Hi is the ith row of the hat matrix. Since the hat matrix H is idempotent and symmetric, the squared norm Hi 2 = e T i HT Hei = e T i Hei = Hii. The prediction inﬂuence weight can thus be more succinctly written as w(pred) i = |ri| Hii.

The prediction on the inﬂuence provides a strong connection to leverage-based sampling. In the classical statistical setting where the experimenter does not have knowledge of the response Yi while setting the regression design X, a sensible measure of inﬂuence takes the expectation over the unknown Yi. In this case, one obtains a sampling weight proportional to the root leverage Hii. If only the inﬂuence on the single prediction ˆyi is considered, then leverage score based sampling is recovered. The inﬂuence is ψ(pred,i) Pθ (xi, yi) = xi(XT X) 1xi(yi ˆyi) = ri Hii. Taking the norm and the expectation over an unknown Yi exactly recovers leverage score sampling. Thus, leverage-based sampling throws away two pieces of available information, the effect of the regression design on points other than the ith point and more importantly, the information about the response Yi.

The surprising result that root leverage can yield a better sampling weight than leverage is borne out in simulations. Fig. 1 shows that root leverage weights yield lower predictive loss than leverage based weights when the model is correctly speciﬁed. However, we ﬁnd the empirical results less clear in the case of real data with misspeciﬁed models. In this case, leverage scores can be correlated with the size of the residuals, and leverage sometimes performs better than root leverage.

3.3 Generalized Linear Model

GLMs can be solved via iteratively reweighted least squares. The solution is given by ˆθ = (XT WX) 1XT WY where W is the diagonal matrix with Wii = 1/Varˆθ(Yi|x T i ˆθ), the inverse predicted variance at xi if ˆθ is equal to the true parameters. It is easy to verify that the inﬂuence functions for ˆθ and ˆy under correct model speciﬁcation are given by

ψθ(xi, yi) = (yi ˆyi) XT WX 1 xi, ψ(pred) θ (xi, yi) = ri HT i . (10)

where ri = yi ˆyi is the residual and H = X(XT WX) 1XT W so that ˆθ = HY . Unlike linear least squares regression, the matrix H is non-symmetric so the norm of the inﬂuence function cannot be expressed exactly in terms of the diagonal of the hat matrix.

For the special case of logistic regression, local case-control sampling provides a sampling method that has both good empirical and theoretical properties. It chooses sampling probabilities proportional to the surprise yi(1 ˆpi) + (1 yi)ˆpi so that a point is likely to be sampled only if it did not match the prediction. The surprise can also be expressed as the absolute value of the residual |yi ˆpi|. Thus, local case-control sampling is equivalent to inﬂuence-based sampling under the approximation that there is no effect due to the regression design.

3.4 Quantile regression

While linear least squares regression focuses on estimating the conditional mean, in some cases the quantity of interest is not the mean but the median or other quantiles of a distribution. Quantile regression [16] provides another useful generalization of linear models. It is of particular interest here since the objective is non-differentiable but still has an inﬂuence function.

For quantile regression, the loss function is the non-differentiable check function

ℓτ(x) = (1 τ)x1(x < 0) + τx1(x 0)

rather than the squared residual. When the desired quantile is τ and the true conditional quantile is linear, the inﬂuence function is given by

ψθ(xi, yi) = [τ(1 τ)] 1V 1x ρ(yi x T i θ)

where ρ is a subgradient of the loss, ρ(z) = 1 τ if z < 0 and τ if z > 0, and

V = Z xx T f(0|x)d G(x)

when the Xi are randomly drawn from a distribution Xi G and the error Yi x T i θ has density f( |xi). In the commonly considered special case where the error distribution is independent of the predictors X, Σ = 1

n XT X is a consistent estimator of V . This gives the following estimated inﬂuence functions on the coefﬁcients and predictions

ˆψθ(xi, yi) = [τ(1 τ)] 1 ρ(ri)Σ 1xi, ˆψ(pred) θ (xi, yi) = [τ(1 τ)] 1 ρ(ri)H i. (11)

We note that this inﬂuence function has the same form as the inﬂuence function for linear regression in (8). The residual ri in the inﬂuence function for linear regression is simply replaced by ρ(ri) in the quantile inﬂuence function. We will refer to ρ(ri) as the "residual" for quantile regression.

4 Implementation

To implement our method practically, we must address three main issues: the fact that the inﬂuence function depends on the unknown true parameter θ; the problem that very small sampling probabilities can lead to high variance; and the computational expense of exactly computing the inﬂuence function.

Inﬂuence function estimation: In all the examples, the exact inﬂuence function depends on the true parameter θ. A simple solution is to substitute a pilot estimate θ0 for the true parameter. In some cases, the pilot estimate may be readily available or easily obtained, such as applications in which estimated parameters from one day may be used as a warm-start for training on the next day. If there is no pre-existing pilot estimate, then one can ﬁrst draw a uniform sample, or even a reasonable convenience sample, from the data to form a pilot that can be used on the remaining data. Theorem 2 shows that simple consistency of the pilot is sufﬁcient to ensure asymptotically optimal estimation of regularized sampling probabilities. It does not need to improve at a rate dependent on the number of points n.

Regularization of probabilities: It is typical for some estimated inﬂuences to be very close to 0. For example, any point that lies on a pilot estimate s regression line has zero inﬂuence. Since each sampled point is weighted by its inverse sampling probability, these points have very large weight when selected.

Our solution is to regularize by setting a minimum sampling probability α. This corresponds to adding the convex constraint that α πi 1 to the optimization in section 2.2. The resulting sampling probabilities are πi = max{α, min{1, λ ψθ0(xi) }} for λ such that P

i πi = nsub. In other words, the probabilities not equal to α or 1 are drawn with probability proportional to ψθ0(xi) .

Inﬂuence function computation: For many estimators, computing the inﬂuence function requires a matrix pseudo-inverse for some matrix Σ or computing a leverage score. These may be costly to compute. Approximate leverage scores may be computed [10] using fast Johnson-Lindestrauss transforms. For matrix inverse approximations, a standard strategy [3, 2] is to replace Σ with its diagonal, which amounts to using the Jacobi preconditioner in place of Σ and ignoring covariance.

We apply a slightly more complex approximation and estimate Σ = XT WX with a diagonal and a low rank component so that Σ = S1/2(I V DV T )S1/2 where V has low rank r and S is a diagonal scaling matrix. This can be approximately computed using a truncated SVD of QXW 1/2 where Q is an nsub d sampling matrix or a fast Johnson-Lindestrauss transform. Since the inverse is of interest, the truncated SVD should compute the smallest rather than largest singular values of XW 1/2. This can be done in O(nsubdk) time where k is the dimension of the data. Computing Σ 1XT takes O(ndk) time.

5 Error analysis

Our main theoretical result is the optimality of inﬂuence based sampling weights and derivation of its asymptotic variance. To do this, we ﬁrst deﬁne the class of limit importance sampling distributions in terms of Radon-Nikodym derivatives and establish the conditions under which the asymptotic variance for an importance sampling distribution can be derived.

Theorem 1 maps each limit importance sampling distribution to an asymptotic variance. It further shows there is a unique importance sampling distribution that minimizes this asymptotic variance. Our main result, Theorem 2, states that a consistent estimate of the inﬂuence function is sufﬁcient to attain this minimum asymptotic variance. The proofs are provided in the supplementary material.

Let φ( ) be some real-valued function on distributions in some P-Donsker class F. Suppose it is Hadamard differentiable at Pθ under the uniform norm with inﬂuence function ψθ ℓ (F). Assume values Xi are drawn i.i.d. from Pθ. Consider the set of measures Q that are mutually absolutely continuous with respect to Pθ with α d Q/d P 1 almost everywhere and the total measure of Q is some constant c. Let ˆPQ n = n 1 Pn i=1 Zi d P

d Q(Xi)δXi be the resulting estimated empirical measure where Zi indicates Xi is in the subsample and equals 1 with probability d Q

d P . These statements are easy to understand when P, Q have densities p, q respectively. In this case, the Radon-Nikodym derivative d Q/d P = q/p exists almost everywhere. The subsampling procedure changes the measure from the natural sampling distribution P to the subsample s distribution which is proportional to Q. The weights p

q are the familiar importance sampling weights to evaluate an integral over P given a sample from Q.

Theorem 1. For any Q that satisﬁes the assumptions, p n

c (φ(ˆPQ n ) φ(P)) N(0, V Q) where

V Q = R ψ(x)ψ(x)T d P d Q 2 d Q(x). Furthermore, there is a unique measure Qopt which minimizes

Theorem 2. Suppose the pilot estimate of the inﬂuence function is consistent so that ψθ = ψθ0+op(1) under the uniform norm. Let ˆπn be estimated regularized inclusion probabilities for a sample of expected size nsub based on PPS sampling with size measure equal to the norm of the estimated inﬂuence function. As nsub, n with nsub/n c > 0, the plug-in estimator φ(ˆPˆπn n ) N(0, V Qopt).

We note this tight optimality result is the only one we are aware of as other methods only provide probabilistic bounds or rate guarantees save for the recent work in [27] speciﬁcally for logistic regression. Furthermore, probabilistic guarantees are immediately obtained as conﬁdence intervals from our limit result, and these guarantees are tight.

Our optimality result is obtained assuming correct speciﬁcation of the inﬂuence. Other methods may make fewer assumptions but at the cost of an optimality guarantee. These assumptions are often

2000 4000 6000 expected Size

||β β||2 ||β||2

MSE of coefficients (CASP)

2000 4000 6000 expected Size

Excess MSE of predictions

prediction influence exact

exact influence

prediction influence

2000 4000 6000 expected Size

||β β||2 ||β||2

MSE of coefficients (NEWS)

2000 4000 6000 expected Size

Excess MSE of predictions

prediction influence exact

exact influence

prediction influence

Figure 3: Results for linear regression on the CASP (left) and NEWS (right) data. Inﬂuence-based methods (colored) are top performers on CASP, requiring 1

2 the size of uniform sampling for the same error. As expected, the parameter inﬂuence sampling outperformed prediction inﬂuence sampling for parameter estimation and vice versa for prediction.

already made by the downstream model and, in some cases, are weak. For example, the squared error loss induces the same inﬂuence regardless of the underlying distribution. However, for quantile regression, the inﬂuence may depend on assumptions about how the error distribution changes with respect to the covariates X. This misspeciﬁcation only affects the optimality and not correctness of the procedure. Even with a misspeciﬁed inﬂuence, an importance weighted subsample will still yield an unbiased estimate of the objective.

6 Experiments

We compared our sampling procedures with existing methods on three real datasets for least squares and quantile regression models. We did not include median L1 regression because it yields sampling with probability proportional to leverage which we did not consider to be an interesting comparison. Quantile regression, on the other hand, has a number of unique and applications beyond what can be found with a robust version of least squares: for instance, quality of service guarantees, understanding high-value donors for a nonproﬁt, or other cases where large values are of most interest.

The datasets we used are the CASP [18] (n = 45730, d = 9) and Online News Popularity (n = 38644, d = 59) datasets from the UCI repository, which are the same as those used by [29], and the EPA Risk-screening Environmental Indicators dataset [12] (n 9.6M, d = 10), a much larger regression dataset. For the Online News Popularity dataset, we removed 4 columns due to collinearity. For the EPA dataset, we used log(1 + Score) as the response. We expanded the 2 geographic coordinates into 10 10 cubic tensor b-splines and added log transformed predictors while removing the other Score variables. This yielded a total of 111 regressors. In each case, we use a uniform random sample with size equal to the size of the smallest subsample considered to derive a pilot estimate and drew a weighted subsample from the remainder.

The quality of the ﬁt on a subsample is measured either by either the relative squared error in the coefﬁcients ˆθ θopt 2/ θopt 2 or by the mean excess loss E ℓ(yi, ˆyi) E ℓ(yi, ˆy(opt) i ).

We performed experiments on both approximate inﬂuence and leverage scores as well as their exact counterparts. For the approximate methods, we use a Fast Walsh-Hadamard Transform to project n rows to 104 and compute the appropriate SVD or truncated SVD. We also consider two approximations to the inﬂuence: the diagonal approximation for the second moment matrix and the equal leverage approximation or, equivalently, the residual weighted approximation.

As shown in Figs. 3 and 4, our experiments demonstrate that inﬂuence based sampling outperforms all the other procedures. The improved performance depends both on the the design Xsub and residuals (Fig. 1). When accounting for correlation in the design is costly simple approximations for the inﬂuence still yield signiﬁcant improvements over uniform sampling (see supplementary materials).

Note that while we did not compare run times for these experiments, our technique should take the same time as leverage-based sampling, excluding time used to generate a pilot, as the only additional cost is computing residuals using the ﬁxed pilot.

2000 4000 6000 expected Size

||β β||2 ||β||2

prediction influence

MSE of coefficients (0.9 quantile)

2000 4000 6000 expected Size

prediction influence

Excess MSE of predictions

2500 5000 7500 10000 12500 expected Size

||β β||2 ||β||2

MSE of coefficients (RSEI)

G G G G 1e+00

2500 5000 7500 10000 12500 expected Size

Excess MSE of predictions

prediction influence

influence diag approx

Figure 4: Results for a 90%-quantile regression model on the CASP data set (left) and linear regression on the large RSEI data set (right) also show approximate inﬂuence based sampling dominating other methods. Approximate leverage score sampling was slightly worse than uniform sampling on CASP and much worse on RSEI.

7 Discussion

One strength of sampling is the ability to reuse samples in a variety of problems. Since the samples are independent samples with known inclusion probabilities, they can be used in any procedure that takes importance weights. Multi-objective sampling [6] provides ways to sample for multiple objectives and generate a sample that is certiﬁably good across multiple tasks.

One particularly useful scenario for sampling is the case when a single machine can store and process the subsample. However, generating the sample required all the importance weights to ﬁrst be computed and then iteratively normalized and regularized. To generate a sample on the ﬂy, the adaptive threshold sampling technique in [25] can be used to both generate a ﬁxed size, without replacement sample while ensuring a minimum sampling probability is satisﬁed.

Another consequence of work is that a reasonably good sample can be drawn as a side-effect of a stochastic gradient descent (SGD) procedure. A ﬁrst-order SGD with no rescaling of the updates would generate gradient-based sampling weights, obviating the need for a special pilot with almost no additional cost. Computing just the residual as part of the SGD procedure or applying a diagonal approximation to the Hessian [3] would yield other natural approximations. In the case of a distributed SGD, the individual samples can be combined by a further round of subsampling to satisfy a storage constraint. If points are duplicated across machines, coordinated sampling [7] can be used to combine the samples.

We note that the regularization of probabilities is necessary to address inaccuracies from noisy pilot estimates, but the exact relationship is unknown. As such, we think theoretical analysis of the pilot and regularization would be interesting follow-on work.

8 Conclusion

We demonstrate both theoretically and empirically that inﬂuence functions yield good and principled subsampling procedures, achieving the best possible asymptotic variance over all independent, without replacement sampling designs with the same size and regularization. Our approach is also highly general as it applies to a wide range of statistical and machine learning models. Furthermore, it provides insight into the problem of optimal subsampling, allowing one to recast it as optimal sampling for a mean and decomposing the contributions of the regression design and response. This allows other existing work, such as on multi-objective sampling, to generate variations of this basic method that can address a wider range of scenarios. It further allows us to show several methods, including leverage score sampling, can be seen as approximations to our method. We also address computational difﬁculties with inﬂuence based methods and show fast approximations perform well.

[1] O. Bachem, M. Lucic, and A. Krause. Practical coreset constructions for machine learning. ar Xiv preprint ar Xiv:1703.06476, 2017.

[2] A. Bordes, L. Bottou, and P. Gallinari. SGD-QN: Careful quasi-Newton stochastic gradient descent. Journal of Machine Learning Research, 10:1737 1754, 2009.

[3] L. Bottou. Large-scale machine learning with stochastic gradient descent. In COMPSTAT. 2010.

[4] K. Clarkson, P. Drineas, M. Magdon-Ismail, M. Mahoney, X. Meng, and D. Woodruff. The fast Cauchy transform and faster robust linear regression. SIAM Journal on Computing, 45(3):763 810, 2016.

[5] K. L. Clarkson. Subgradient and sampling algorithms for ℓ1 regression. In SODA, 2005.

[6] E. Cohen. Multi-objective weighted sampling. In Hot Web, 2015.

[7] E. Cohen and H. Kaplan. What you can do with coordinated samples. In RANDOM, 2013.

[8] A. Dasgupta, P. Drineas, B. Harb, R. Kumar, and M. Mahoney. Sampling algorithms and coresets for ℓp regression. SIAM Journal on Computing, 38(5):2060 2078, 2009.

[9] P. Dhillon, Y. Lu, D. Foster, and L. Ungar. New subsampling algorithms for fast least squares regression. In NIPS, 2013.

[10] P. Drineas, M. Magdon-Ismail, M. Mahoney, and D. Woodruff. Fast approximation of matrix coherence and statistical leverage. Journal of Machine Learning Research, 13:3475 3506, 2012.

[11] W. Du Mouchel, C. Volinsky, T. Johnson, C. Cortes, and D. Pregibon. Squashing ﬂat ﬁles ﬂatter. In KDD, 1999.

[12] U.S. EPA. U.S. EPA risk-screening environmental indicators (RSEI) chemical weights, 2015.

[13] W. Fithian and T. Hastie. Local case-control sampling: Efﬁcient subsampling in imbalanced data sets. Annals of Statistics, 42(5):1693, 2014.

[14] Frank R Hampel. The inﬂuence curve and its role in robust estimation. Journal of the american statistical association, 69(346):383 393, 1974.

[15] J. Huggins, T. Campbell, and T. Broderick. Coresets for scalable Bayesian logistic regression. In NIPS, 2016.

[16] R. Koenker. Quantile Regression. Cambridge university press, 2005.

[17] P. W. Koh and P. Liang. Understanding black-box predictions via inﬂuence functions. In ICML, 2017.

[18] M. Lichman. UCI machine learning repository, 2013. URL http://archive.ics.uci.edu/ml.

[19] P. Ma, M. Mahoney, and B. Yu. A statistical perspective on algorithmic leveraging. Journal of Machine Learning Research, 16:861 911, 2015.

[20] D. Madigan, N. Raghavan, W. Dumouchel, M. Nason, C. Posse, and G. Ridgeway. Likelihood-based data squashing: A modeling approach to instance construction. Data Mining and Knowledge Discovery, 6(2): 173 190, 2002.

[21] X. Meng and M. Mahoney. Low-distortion subspace embeddings in input-sparsity time and applications to robust linear regression. In STOC, 2013.

[22] X. Meng, M. Saunders, and M. Mahoney. LSRN: A parallel iterative solver for strongly over-or underdetermined systems. SIAM Journal on Scientiﬁc Computing, 36(2):C95 C118, 2014.

[23] P. Moritz, R. Nishihara, S. Wang, A. Tumanov, R. Liaw, E. Liang, W. Paul, M. I. Jordan, and I. Stoica. Ray: A distributed framework for emerging ai applications. ar Xiv preprint ar Xiv:1712.05889, 2017.

[24] Garvesh Raskutti and Michael W Mahoney. A statistical perspective on randomized sketching for ordinary least-squares. The Journal of Machine Learning Research, 17(1):7508 7538, 2016.

[25] D. Ting. Adaptive threshold sampling and estimation. ar Xiv preprint ar Xiv:1708.04970, 2017.

[26] A.W. van der Vaart. Asymptotic Statistics. Cambridge University Press, 2000.

[27] Hai Ying Wang, Rong Zhu, and Ping Ma. Optimal subsampling for large sample logistic regression. Journal of the American Statistical Association, 113(522):829 844, 2018.

[28] M. Zaharia, M. Chowdhury, M. Franklin, S. Shenker, and I. Stoica. Spark: Cluster computing with working sets. In Hot Cloud, 2010.

[29] R. Zhu. Gradient-based sampling: An adaptive importance sampling for least-squares. In NIPS, 2016.