# importance_sampling_for_nonlinear_models__58a9c0d2.pdf

Importance Sampling for Nonlinear Models

Prakash P. Rajmohan 1 Fred Roosta 2 3

While norm-based and leverage-score-based methods have been extensively studied for identifying important data points in linear models, analogous tools for nonlinear models remain significantly underdeveloped. By introducing the concept of the adjoint operator of a nonlinear map, we address this gap and generalize norm-based and leverage-score-based importance sampling to nonlinear settings. We demonstrate that sampling based on these generalized notions of norm and leverage scores provides approximation guarantees for the underlying nonlinear mapping, similar to linear subspace embeddings. As direct applications, these nonlinear scores not only reduce the computational complexity of training nonlinear models by enabling efficient sampling over large datasets but also offer a novel mechanism for model explainability and outlier detection. Our contributions are supported by both theoretical analyses and experimental results across a variety of supervised learning scenarios.

1. Introduction

The process of training in machine learning (ML) typically boils down to solving an optimization problem of the form

i=1 ℓ(fi(θ))

where ℓis a loss function, and fi : Rp R is a potentially nonlinear mapping, parametrized by θ, that represents the ML model evaluated at the ith training data point. Through-

1School of Electrical Engineering and Computer Science, University of Queensland, Brisbane, Australia. 2School of Mathematics and Physics, University of Queensland, Brisbane, Australia. 3ARC Training Centre for Information Resilience (CIRES), Brisbane, Australia. Correspondence to: Prakash P. Rajmohan <p.palanivelurajmohan@uq.net.au>, Fred Roosta <fred.roosta@uq.edu.au>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

out this paper, we make the umbrella assumption that n p, i.e., we operate in the underparameterized setting.

For example, in the simple linear regression, fi(θ) = θ, xi yi, where (xi, yi) is the ith input-output pair. Using the squared loss ℓ(t) = t2 gives rise to the familiar linear least-squares problem L(θ) = Xθ y 2, where X Rn d is the input data matrix whose ith row is xi, and y Rn is the output vector whose ith component is yi.

The growth of ML over the past decade has been largely driven by the explosion in the availability of data. However, this exponential increase in data volume presents significant computational challenges, particularly in solving (1) and developing diagnostic tools for post-training analysis. These challenges have been extensively studied and addressed in simpler linear settings. In this context, randomized numerical linear algebra (Rand NLA) has emerged as a powerful paradigm for approximating underlying matrices and accelerating computations. Rand NLA has led to significant advancements in algorithms for fundamental matrix problems, including matrix multiplication, least-squares problems, least-absolute deviation, and low-rank matrix approximation, among others. For further reading, see the lecture notes and surveys on this topic, such as Mahoney et al. (2011); Woodruff et al. (2014); Drineas & Mahoney (2018); Martinsson & Tropp (2020); Murray et al. (2023); Derezi nski & Mahoney (2024).

Arguably, linear least-squares problems have been among the most well-studied problems in Rand NLA (Sarlos, 2006; Nelson & Nguyˆen, 2013; Meng & Mahoney, 2013; Clarkson & Woodruff, 2017). Various randomized approximation techniques, ranging from data-independent oblivious methods (e.g., sketching and projections) to data-dependent nonoblivious sampling methods (e.g., non-uniform leverage score or row-norm sampling), have emerged as powerful tools to speed up computations directly (Woodruff et al., 2014) or to construct preconditioners for downstream linear algebra subroutines (Avron et al., 2010). While these tools have proven remarkably successful in accelerating computations for linear least-squares problems, extending them to more general problems of the form (1) remains challenging. Efforts to push these boundaries into the nonlinear realm often remain ad hoc and limited in scope (Gajjar & Musco, 2021; Erd elyi et al., 2020; Avron et al., 2019; 2017). A key

Importance Sampling for Nonlinear Models

gap lies in the lack of a systematic framework to capture and embed the nonlinear component of the objective function while preserving the critical approximation properties well-established in linear embeddings.

In this paper, we aim to bridge this gap to some extent. Specifically, we focus on non-uniform sampling in (1). Letting θ and θ S denote the optimal parameters obtained from training the model over the full dataset and a, potentially non-uniformly, sampled subset of data, respectively, our goal is to ensure that, for any small ε, the samples are selected such that

L(θ S) L(θ ) + O (ε) . (2)

Recent work by (Gajjar et al., 2023; 2024) has made progress toward this goal by proposing a one-shot active learning strategy using leverage scores of the data for certain classes of single-neuron predictors. Here, we take a step further by introducing the concept of the adjoint operator of a nonlinear map. This allows us to generalize norm-based and leverage-score-based importance sampling to nonlinear settings, thereby obtaining approximation guarantees of the form (2) in many settings.

Contributions. Our contributions are as follows:

1. By introducing the nonlinear adjoint operator, we provide a systematic framework for constructing importance sampling scores for (1). While numerical integration can generally approximate these scores, we show that for specific nonlinear models, they can be derived directly.

2. We further show that, under certain assumptions, importance sampling based on these scores achieves (2). To our knowledge, this is the first work to extend such guarantees to NNs.

3. To validate the theoretical results, we present experiments for several supervised learning tasks. We show that, beyond reducing computational costs, our framework can be used post hoc for diagnostics, such as identifying important samples and detecting anomalies.

Notation. Vectors and matrices are denoted by bold lowercase and bold uppercase letters, respectively. The ith row of a data matrix X Rn d is denoted by xi Rd. The pseudoinverse of X is denoted by X , and the ith standard basis vector in Rn is denoted by ei. The spectral norm and Frobenius norm of a matrix are denoted by 2 and F, respectively. The inner product between two vectors is denoted by , . We use O( ) for Big-O complexity and e O( ) to omit logarithmic factors. The sub-sampled data matrix is represented by XS Rs d, where s is the number of sub-sampled points.

2. Background and Related Work

Importance Sampling in Linear Models. In linear settings, importance sampling has been widely used for various linear algebra tasks such as matrix multiplication (Drineas et al., 2006a), least-squares regression (Drineas et al., 2006b), and low-rank approximation (Cohen et al., 2017). It approximates large data matrices by prioritizing data points with high information , reducing computational (Clarkson & Woodruff, 2017; Nelson & Nguyˆen, 2013) and storage costs (Iwen et al., 2021; Meng & Mahoney, 2013) while ensuring strong guarantees for downstream tasks.

Formally, let X Rn d, n d, be a data matrix with row vectors xi Rd. Each row is assigned a nonnegative weight qi 0 through an importance sampling scheme, with sampling probability τi = qi/Pn j=1 qj. Consider selecting s n rows of X independently at random, with replacement, according to {τi}. Let XS Rs d be the sub-sampled matrix, where the ith row is xi/ s τi. If {τi} is constructed appropriately, XS preserves key spectral and geometric properties of X with high probability, acting as a lower-dimensional approximation. This is captured by a linear subspace embedding guarantee: for all v Rd,

(1 ε) Xv 2 2 XSv 2 2 (1 + ε) Xv 2 2. (3)

Two common approaches for constructing {τi} are based on the norms and leverage scores of the rows of X. In rownorm sampling, we set qi = xi 2 2, meaning rows that contribute more to the overall ℓ2-energy, X 2 F = Pn i=1 xi 2 2, are sampled more frequently. Leverage scores quantify the influence of each row xi on the row space of X and are defined as

qi = ei, XX ei = min α Rn α 2 2 subject to X α = xi.

Thus, the ith leverage score captures the importance of xi in spanning the data subspace.

These importance sampling schemes are particularly effective when the data exhibits high coherence (Paschou et al., 2007; Mahoney & Drineas, 2009; Gittens & Mahoney, 2013; Fan et al., 2011; Eshragh et al., 2022). It has been well established that, for any ε (0, 1), as long as the sample size s is sufficiently large specifically, s O d ln d/ε2 the approximation in (3) holds with high probability (Martinsson & Tropp, 2020). These bounds are tight (up to logarithmic factors) (Woodruff et al., 2014; Chen & Price, 2019).

The linear subspace embedding guarantee in (3) forms the foundation for approximating linear least-squares problems (Woodruff et al., 2014). However, for loss functions beyond least-squares, simple linear subspace embedding may not provide the desired guarantees. In such cases, alternative tools like importance sampling scores (e.g., Lewis weights (Apers et al., 2024; Johnson & Schechtman, 2001; Bourgain

Importance Sampling for Nonlinear Models

et al., 1989; Cohen & Peng, 2015)) or coresets (Feldman, 2020; Mirzasoleiman et al., 2020; Lucic et al., 2018; Har Peled & Mazumdar, 2004) are commonly used.

In particular for coresets, Langberg & Schulman (2010) introduced a sensitivity sampling framework that provides foundational coreset guarantees for broad classes of objectives. This approach (akin to importance sampling) assigns each data point a sampling probability proportional to its worst-case influence (sensitivity) on the objective function, yielding a coreset that achieves a (1 ε) approximation for all queries. Subsequent work has refined this idea for specific loss functions; for instance, Tremblay et al. (2019) combine sensitivity scores with determinantal point processes (DPP) to reduce redundant draws and thus promote diversity among the selected points. For logistic regression, Munteanu et al. (2018) prove that although no sublinear-size coreset exists in the worst case, a sensitivity-based scheme can yield the first provably sublinear approximation coreset when the dataset s complexity measure µ(X) is bounded. These results, position importance-sampling coresets as the natural extension of linear embeddings when moving beyond least-squares.

For more general loss functions ℓ, some works have extended approximation guarantees similar to (2) to specific problems, such as logistic regression (Samadian et al., 2020; Huggins et al., 2016; Tolochinksy et al., 2022), linear predictor models with hinge-like loss (Mai et al., 2021), ℓp regression (Chen & Derezinski, 2021; Musco et al., 2022), and kernel regression (Erd elyi et al., 2020).

Importance Sampling in Nonlinear Models. Importance sampling techniques have long played a crucial role in modern large-scale machine learning tasks, serving as tools to identify the most important samples and accelerating optimization procedures (Katharopoulos & Fleuret, 2018; Stich et al., 2017; Nabian et al., 2021; Can evet et al., 2016; Liu & Lee, 2017; Meng et al., 2022; Xu et al., 2016; Liu et al., 2024). However, their use in obtaining approximation guarantees of the form (2) has been very limited.

To our knowledge, Gajjar et al. (2023; 2024) are among the first to consider non-linear function families and analyze a leverage-score-based sampling scheme (also referred to as a one-shot active learning sampling scheme) in the context of single-index (or single neuron ) models, where ℓis the squared loss and fi(θ) = ϕ θ, xi for some scalar non-linearity ϕ: R R. For their derivations, the authors employ intricate tools from high-dimensional probability, as the mapping x 7 ϕ( w, x ) introduces non-linearity that invalidates the straightforward matrix Chernoff arguments typically used in linear settings.

Gajjar et al. (2023) showed that if ϕ is L-Lipschitz, one can collect e O(d2/ε4) labels, sampled with probabilities

proportional to the linear leverage scores of X, to guarantee a solution θ S satisfying, with high probability, the bound

L(θ S) C L(θ ) + O (ε) , (4)

for some sufficiently large constant C > 0. They also argued that such additive error bounds are necessary, as achieving a pure multiplicative error bound would require exponentially many samples in d.

By using intricate tools from high-dimensional probability, subsequent work by Gajjar et al. (2024) improved the sample complexity to e O(d/ε2), matching the linear benchmark up to polylogarithmic terms. Furthermore, they extended the analysis to the setting where ϕ is Lipschitz but unspecified by employing a meticulous sampling-aware discretization of the function class.

While the results in Gajjar et al. (2023; 2024) represent significant contributions toward achieving (2) for nonlinear models, they have a notable drawback: the guarantees in Gajjar et al. (2023; 2024) take the form (4), which involves an undesirable constant C 1 (exceeding 1,000 in Gajjar et al. (2024)). Such large constants undermine the practical utility of the approximation guarantee in (4). We show that, under certain assumptions, it is possible to obtain the more desirable guarantee (2) (where C = 1).

3. Nonlinear Approximation: Our Approach

The core structure enabling the use of the subspace embedding property (3) in approximating linear least-squares problems is the linear inner product. Specifically, for f(θ) = θ, x y, the least-squares model can be expressed as

i=1 f 2 i (θ) =

i=1 ( θ, xi yi)2 = b Xbθ, b Xbθ ,

where b X = [X | y] Rn (d+1) and bθ = [θ; 1] Rd+1. This structure allows the subspace embedding property (3) to be applied, enabling the approximation of the original least-squares objective using a carefully chosen subsample of the data.

Using the notion of the nonlinear adjoint operator, we now derive an analogous inner-product-like representation for nonlinear mappings. This representation captures the nonlinearity in a novel way, enabling the derivation of nonlinear embeddings similar to (3).

Importance Sampling for Nonlinear Models

3.1. Nonlinear Adjoint Operator

Consider a mapping f(.) : Rp R that is continuously differentiable. By the integral form of the mean value theorem,

f(θ) = f(0) + Z 1

θ f(tθ), θ dt

= f(0) + θ, Z 1

θ f(tθ)dt .

The above relation is an equation involving functions of θ. The second term includes an inner product between the parameter θ and a vector-valued map, reminiscent of the role of dual operators in Banach spaces. This observation motivates the following definition, which extends beyond continuously differentiable functions.

Definition 3.1 (Adjoint Operator). The adjoint operator of f : Rp R is the vector valued map f : Rp Rp,

θ f(tθ)dt, (5)

for all θ Rp such that the function ϕ : [0, 1] R defined as ϕ(t) f(tθ) is absolutely continuous on [0, 1].

Note that the absolute continuity condition relaxes the requirement for differentiability to that of being differentiable almost everywhere.

With the above definition, we can more compactly write

f(θ) = D bθ,bf (θ) E , (6a)

bf (θ) f (θ) f(0)

, and bθ θ 1

The term nonlinear adjoint operator is motivated by cases where f(0) = 0. In such cases, the above expression simplifies to f(θ) = θ, f (θ) , which can be viewed as a nonlinear analogue of the Riesz representation theorem in linear settings; see, for example, Bur yˇskov a (1998); Scherpen & Gray (2002).

In the simplest case of linear regression, i.e., f(θ) = θ, x y, we have f (θ) = x, that is, the input data constitutes the space of adjoint operators. Beyond simple linear settings, the explicit calculation of (5) involves evaluating an integral. Naturally, this integral can be approximated using numerical methods, such as quadrature schemes. Fortunately, for many machine learning models, the following property enables the direct computation of (5).

Proposition 3.1. Let f = g h = g(h) where g : R R and h : Rp R. Also, assume that h is positively

homogeneous of degree α R, i.e., h(tθ) = tαh(θ) for any t > 0. Then

g(h(θ)) g(0)

θ h(θ), if h(θ) = 0,

θ h(θ), if h(θ) = 0.

Proof. The proof can be found in Appendix A.1.

Example 3.1 (Generalized Linear Predictors). Consider generalized linear predictor models, often called single index models, where f(θ) = ϕ( θ, x ) for some function ϕ : R R and data point x Rd. These models are foundational in studying complex nonlinear predictors in high dimensions (Hristache et al., 2001; H ardle et al., 2004; Kakade et al., 2011; Bietti et al., 2022). By limiting networks to a single hidden unit, they provide a controllable instance of neural network architecture while retaining simplicity (Goel et al., 2017). In scientific machine learning, such models underpin efficient surrogate methods for parametric partial differential equations, where the cost of labeled queries can be prohibitive (Cohen & De Vore, 2015; O Leary-Roseberry et al., 2022).

Note that here f(θ) = g(h(θ)) where g(t) = ϕ(t) and h(θ) = θ, x . Since the positive homogeneity degree of h is α = 1, we have

f (θ) = ϕ( θ, x ) ϕ(0)

We also get bf (θ) = f (θ) ϕ(0)

. A notable example is the

logistic function where ϕ(t) = 1/(1 + exp( t)), where

f (θ) = tanh( θ, x /2)

Example 3.2 (Re LU Neural Networks). Suppose r(θ) = ϕ(ψ(θ)) where ϕ : R R, and ψ(θ) a max{ b, x , 0}) for x Rd and θ = [a, b] Rd+1. Here, the positive homogeneity degree of ψ is α = 2. Hence,

r (θ) = Z 1

θ ϕ(ta max { tb, x , 0}) dt (8)

= ϕ(a max { b, x , 0}) ϕ(0)

2a max { b, x , 0}

max { b, x , 0}

a x 1{ b,x >0}

In the typical case where ϕ(z) = z, this simplifies to

max { b, x , 0} a x1{ b,x >0}

Importance Sampling for Nonlinear Models

Now, consider a two-layer neural network with m hidden neurons, Re LU activation, and a single output:

j=1 r(θj) =

j=1 ϕ(aj max { bj, x , 0}),

where θj = [aj, bj] and θ = [θ1, . . . , θm]. We have

θ ϕ(taj max { tbj, x , 0})dt

j=1 ej r (θj)

= [r (θ1)] [r (θ2)] . . . [r (θm)] ,

where denotes the Kronecker product and r (θj) is given

in (8). Finally, we get bf (θ) = f (θ) m ϕ(0)

3.2. Parameter-dependent Approximations

The inner-product-like representation of a nonlinear function in (6) enables the development of sampling strategies for approximating general loss functions beyond linear leastsquares and extends the concept of linear subspace embeddings to nonlinear settings. We focus on nonlinear leastsquares problems, where ℓ(t) = t2, and (1) is expressed as L(θ) = Pn i=1 (fi(θ))2, with extensions to more general nonlinear losses discussed in Appendix A.3. Definition 3.2 (Nonlinear Dual Matrix). Given the nonlinear maps fi : Rp R, i = 1, . . . , n, the nonlinear dual matrix operator, F : Rp Rn p, is defined as

F (θ) f 1 (θ) f 2 (θ) . . . f n(θ) .

Using (6b), we also define

b F (θ) h bf 1 (θ) bf 2 (θ) . . . bf n(θ) i .

With (6) and Definition 3.2, we can now write

i=1 (fi(θ))2 =

i=1 bθ,bf i (θ) 2 = b F (θ)bθ 2 . (9)

This formulation resembles the ordinary least-squares setup, with the nonlinear dual matrix b F (θ) replacing the original input data matrix. This motivates the following notions of importance sampling scores, which generalize standard leverage scores or row-norms to those based on the rows of the dual matrix b F (θ). Definition 3.3 (Nonlinear Leverage Scores). Nonlinear leverage score of fi is defined as

ei, b F (θ) h b F (θ) i ei

Rank b F (θ) .

Definition 3.4 (Nonlinear Norm Scores). Nonlinear norm score of fi is defined as

2 b F (θ) 2

It is easy to verify that, for both Definitions 3.3 and 3.4, the values {τi(θ)}n i=1 form a probability distribution, i.e., τi(θ) 0 and P

i τi(θ) = 1.

Remark 3.1. Definitions 3.3 and 3.4 generalize their linear counterparts. Specifically, for linear models where f(θ) = θ, x , these definitions reduce to the standard linear leverage and row norm scores. To the best of our knowledge, this represents the first systematic extension of these concepts from linear to nonlinear settings.

Remark 3.2. While Example 3.1 and Example 3.2 illustrate explicit computations of the adjoint operator via Proposition 3.1, our approach conceptually extends to broader classes of nonlinear functions through its general Definition 3.1. When an explicit adjoint form is available, importance scores (row-norm or leverage) are computed directly from the nonlinear dual matrix via standard linear algebra routines (e.g., QR or SVD factorizations). In scenarios where the adjoint lacks a closed-form expression, we can approximate it numerically using Definition 3.1; the resulting dual matrix is then factorized similarly, with QR or SVD methods.

Let S be the index set of s samples, obtained by randomly sampling s n data points with replacement, using probabilities defined by one of the sampling distributions in Definitions 3.3 and 3.4. Note that S depends on the choice of θ (this is addressed later in Section 3.3.1). The loss on the sampled data is defined as

which is an unbiased estimator of the true loss. Similar to the derivation of (9), we have

LS(θ) = b F S(θ)bθ 2,

where b F S(θ) Rs p is a subset of b F (θ), consisting of rows indexed by S, with the ith row scaled by 1/ p

The representation in (9), which related the output of a nonlinear function to the norm of a matrix-vector product, enables the use of existing results on approximate matrix multiplication and linear subspace embedding to derive an approximation bound on LS(θ); see, for example, Woodruff et al. (2014); Drineas & Mahoney (2018); Martinsson & Tropp (2020). Specifically, for a fixed θ and

Importance Sampling for Nonlinear Models

any of the sampling distributions mentioned, if the sample size is O p log(p/δ)/ε2 , then with probability at least 1 δ, for all v Rp+1, (1 ε) b F (θ)v 2 b F S(θ)v 2 (1 + ε) b F (θ)v 2. which in turn implies, with probability at least 1 δ,

(1 ε)L(θ) LS(θ) (1 + ε)L(θ). (10)

3.3. Parameter-independent Approximation

The relation (10) and the optimality of θ S with respect to LS(.) allow us to take initial steps towards (2) as

LS(θ S) LS(θ ) (1 + ε)L(θ ). (11)

However, the key missing component is that the left-hand side of (2) involves the loss evaluated over the full dataset, whereas the left-hand side of (11) is restricted to the loss over the sampled subset. The chain of inequalities to obtain (2) would be complete if we could find a sensible lower bound for LS(θ S) in terms of L(θ S). Additionally, (11) requires evaluating the importance sampling scores defined in Definitions 3.3 and 3.4 at θ , which is inherently unknown. Therefore, to achieve (2), we must address two critical challenges:

1. Approximating the nonlinear scores defined in Definitions 3.3 and 3.4 with quantities that are independent of the parameter θ.

2. Establishing a meaningful lower bound for LS(θ S) in terms of L(θ S).

Our next tasks involve addressing these challenges.

3.3.1. ESTIMATING SAMPLING SCORES

The relation in (11) requires calculating the scores from Definitions 3.3 and 3.4 for θ , which is infeasible since θ

is unknown. A solution is to approximate these nonlinear scores with ones independent of θ.

Sampling with near-optimal probabilities has been studied in Rand NLA (Drineas & Mahoney, 2018; Woodruff et al., 2014), where it is known that if leverage or row-norm scores, {τi}, are approximated by {ˆτi} such that βτi ˆτi, the subspace embedding property (3) holds with sample size O(p log(p/δ)/(βε2)). If nonlinear scores can be similarly approximated in a manner independent of θ, we can sample according to the approximate scores and still obtain a guarantee of the form (10) for any θ .

Within the context of the specific models considered in Examples 3.1 and 3.2, we demonstrate that leveraging the structure of the nonlinear adjoint operators enables the estimation of the nonlinear scores. Example 3.3 (Generalized Linear Predictors). Consider the class of generalized linear predictor models from Example 3.1 with any activation function ϕ such that there exist

constants 0 < l < u < for which l ϕ2(t)/t2 u for all t T , where T is some set of interest. An example is the Swish-type activation function given by

ϕ(t) = t c1 + ( c2 c1) 1 1 + e ζt

for some 0 c1 < c2 and ζ R. If c1 > 0, we can take l = c1 and u = c2. For c1 = 0, i.e., the typical Swish function, we can still define 0 < l min t T ϕ2(t)/t2 provided

that T is a bounded set.

Consider leverage score sampling according to Definition 3.3. Suppose X Rn d and F (θ) Rn d are both full column rank. Since

[F (θ)] F (θ) = X

2 xix T i ,

it follows that, as long as c1 > 0 or θ C for some bounded set C, we have 0 l X X [F (θ)] F (θ) u X X. Hence,

d τi(θ) = f (θ), [F (θ)] F (θ) 1 f (θ)

D f (θ), X X 1 f (θ) E

D x, X X 1 x E = d u

where τi is the linear leverage score. This implies β = l/u.

Example 3.4 (Re LU Neural Networks). Revising the NN model in Example 3.2, we consider row norm sampling according to Definition 3.4. Suppose ϕ is such that such that c1 (ϕ(t) ϕ(0))2/t2 c2 for some 0 < c1 c2 < and for all t T for some set of interest T , e.g., Swishtype or linear output layer. Recall that θ = [θ1, . . . , θm] where θj = [aj, bj]. Also denote θ = [θ 1, . . . , θ m] where θ j = [a j, b j]. Suppose a j = 0 for all j 1, 2, . . . , m. Let l > 0 be such that minj(a j)2 l and 0 < u < be such

that Pm j=1 b j 2 + (a j)2 u. Define the set

C [a1, b1, . . . , am, bm] | min j=1,...,m (aj)2 l,

bj 2 + (aj)2 u

Clearly, by construction, θ C. After some algebraic manipulations (see Appendix A.2), for any xi and any θ C, we have

c1l xi 2 f i (θ) 2 c2u xi 2 ,

Importance Sampling for Nonlinear Models

where f i (θ) is the adjoint operator of

j=1 ϕi(aj max { bj, xi , 0}).

This implies that

2 b F (θ) 2

max{c2u, 1}

min{c1l, 1}

where τi is the norm score for the ith row of

b X = x1 x2 . . . xn mϕ1(0) mϕ2(0) . . . mϕn(0)

This implies β = min{c1l, 1}/max{c2u, 1}.

3.3.2. LOWER BOUNDING LS(θ S)

Let θ denote a solution to (1), and consider a ball with radius R, chosen large enough to contain θ , denoted by B R. For any ε (0, 1), we pick a discrete subset Nε B R such that, for every θ B R, there exists at least one θ Nε satisfying θ θ 2 εR. The size of this set can be shown to be |Nε| Ω(1/εp). This construction is well-known and is commonly referred to as an ε-net; see Appendix A.4.

This ε-net construction enables a union bound argument to control the probability of (10) for all θ Nε. Suppose S can be sampled according to a uniform estimate of the nonlinear scores, as described in Section 3.3.1, so that S no longer depends on the choice of θ. If, for a given θ, (10) fails to hold with a probability of at most δ , then a union bound ensures that the overall failure probability of (10) for all θ Nε cannot exceed |Nε|δ . By choosing δ = δ/|Nε| δεp, we ensure (10) holds with a success probability of at least 1 δ for all θ Nε.

Suppose LS(.) is continuous on B R, which is almost always guaranteed for many loss functions in ML. The compactness of the ball implies that LS(.) is also Lipchitz continuous with respect to θ on B R, i.e., for any set of samples S, there exists a constant 0 L(f, X, S, R) < such that for any θ, θ B R, |LS(θ) LS(θ )| L(f, X, S, R) θ θ 2. Note that the Lipschitz continuity constant may depend on X, S, f, and the radius of the ball B R. Define L(f, X, R) max S L(f, X, S, R). Note that this is a maximization over a finite collection of numbers1 and hence L(f, X, R) < .

Now, suppose C B R with θ C, and let

θ S arg min θ C LS(θ). (12)

1The total number of possible unordered samples of any size from n object with replacement is given by Pn s=1 n+s 1 s .

Also, let θ0 Nε be such that θ S θ0 2 εR. Without loss of generality, we can assume θ0 Nε C as otherwise we can simply increase the size of the net inside C by O (|Nε|) . From Lipschitz continuity and (10), we get

L(θ S) L(θ0) + εR L(f, X, R)

1 1 εLS(θ0) + εR L(f, X, R)

LS(θ S)+εR L(f, X, R) + εR L(f, X, R),

which gives the lower bound

LS(θ S) (1 ε)L(θ S) ε(2 ε)R L(f, X, R).

Combining this with (11), we get

L(θ S) L(θ )+ ε 1 ε

L(θ )+(2 ε)R L(f, X, R) .

The culmination of the above discussions and derivations leads to the main result of this paper.

Theorem 3.1 (Main Result). Suppose θ C is a solution to (1) with squared loss ℓ(t) = t2 and for some 0 < β 1, we have βτi(θ) τi for i = 1, 2, . . . , n and for all θ C. Consider random sampling according to τi with a sample S of size at least O p log(p/δ) + p2 log(p/ε) /(βε2 ) for some ε (0, 1) and δ (0, 1). Let θ S be defined as in (12). Then (2) holds with probability at least 1 δ.

Remark 3.3. Theorem 3.1 applies to more generally than single index models. However, when adapted to Example 3.3, it provides a result similar to Gajjar et al. (2024). While both works have the same dependence on ε, our result has quadratic dependence on the dimension, while theirs is linear. Additionally, our approach involves a constrained optimization subproblem (12), whereas the subproblems of Gajjar et al. (2024) are unconstrained. In contrast, Gajjar et al. (2024) provided a guarantee of the form (4) with a potentially large constant C 1, while Theorem 3.1 provides a more desirable guarantee of the form (2).

4. Experiments

We conduct a series of experiments to demonstrate the effectiveness and versatility of the proposed nonlinear importance scores. First, we evaluate our approach on several benchmarking regression datasets, including California Housing Prices (Pace & Barry, 1997), Medical Insurance dataset (Lantz, 2019), and Diamonds dataset (Wickham & Sievert, 2009). As a proof of concept, we demonstrate that our importance sampling method outperforms traditional linear and uniform sampling strategies by achieving lower training error using fewer samples. Second, we consider image classification using four standard datasets: SVHN (Street View House Numbers) (Netzer et al., 2011), FER-2013

Importance Sampling for Nonlinear Models

(Facial Expression Recognition) (Goodfellow et al., 2013), NOTMNIST (Bulatov, 2011), and QD (Quick, Draw) (Ha & Eck, 2018). Here, we demonstrate that our proposed nonlinear leverage scores in Definition 3.3 serve as a powerful diagnostic tool, identifying the most important points in training complex models and detecting anomalies by isolating outliers. We employ both single-index models and neural networks to validate our findings across varying model complexities and architectures. Additional experimental details are provided in Appendix A.5. The code is available here.

Regression Tasks. We compare the efficacy of three sampling methods: uniform sampling, linear importance sampling (based on leverage scores and row-norms of the original dataset), and nonlinear importance sampling as defined in Definitions 3.3 and 3.4. While computing nonlinear importance scores directly is impractical in real-world scenarios, these experiments serve to validate that such scores effectively identify critical samples. As shown in Figure 1, nonlinear importance sampling more effectively identifies impactful samples, achieving lower MSE with fewer training points compared to uniform and linear sampling. The Y-axis for regression tasks reports relative error on a logarithmic scale, so even small vertical shifts represent substantial absolute improvements.

Classification Tasks. We convert the datasets into binary classification tasks, comparing visually similar ( like ) and dissimilar ( unlike ) classes, in SVHN and Not MNIST. like classes are those that are harder to distinguish without prior knowledge, while unlike classes are more easily separable. In SVHN, we examine two scenarios: distinguishing the clearly distinct digits 1 vs. 0 and the more visually similar 1 vs. 7. Similarly, in Not MNIST, we consider letter pairs (A vs. B) and (B vs. D). We also analyze more complex, noisy datasets like QD, where we distinguish between Fish and Car. For FER, which groups expressions by valence (positive vs. negative) based on (Mollahosseini et al., 2017), we track the evolution of important samples as the model approaches optimal parameters ( 55% at initialization, 65%, and 75% accuracy at termination).

Figure 2 presents images with the highest (most unique) and lowest (easiest to classify) nonlinear leverage scores for each binary task. The results clearly show that samples with higher nonlinear leverage scores contain distinct patterns and are harder to classify, while those with lower scores are straightforward. In contrast, standard linear leverage scores select more random and less meaningful/insightful samples. Notably, our method identifies mislabeled and noisy samples with high scores, which represent the outliers. In FER, under-trained models highlight blank or extreme valence, whereas trained models detect subtler emotions and facial characteristics such as accessories, tears, and aging. To demonstrate that nonlinear scores also yield robust

(a) California Housing Dataset

(b) Medical Insurance Dataset

(c) Diamonds dataset

Figure 1. Comparison of sampling strategies. The Y-axis shows log [(L(θ S) L(θ ))/L(θ )] against sample size (as a percentage of total data), where θ S is the optimal parameter from training on sampled data. RN , LS , and UN denote Row Norm, Leverage Scores, and Uniform Sampling, respectively, with L and A indicating linear and adjoint-based nonlinear variants. Nonlinear importance scores consistently outperform all other alternatives.

quantitative evidence for our classification experiments, we provide additional numerical results in Appendix A.6. Additional images supporting Figure 2 for all datasets are given in Appendix A.7.

Importance Sampling for Nonlinear Models

(a) SVHN High (1 vs 0)

(b) SVHN Low (1 vs 0)

(c) SVHN High (1 vs 7)

(d) SVHN Low (1 vs 7)

(e) NOTMNIST High (A vs B)

(f) NOTMNIST Low (A vs B)

(g) NOTMNIST High (B vs D)

(h) NOTMNIST Low (B vs D)

(i) QD High

(k) FER High

(l) FER Low

Figure 2. Comparisons of high and low linear/nonlinear leverage scores across multiple datasets. High and Low refer to images with the highest and lowest scores, respectively. In subfigures (a)-(j), the top row shows images selected using nonlinear leverage scores (Definition 3.3), while the bottom row uses linear leverage scores. Samples with higher nonlinear scores contain distinct patterns and are harder to classify, while lower scores correspond to straightforward samples. In contrast, linear scores select less insightful samples. Notably, our method identifies mislabeled and noisy samples (outliers) with high scores. In FER, the top three rows represent the scores calculated at 75% (final), 65%, and 55% (initial) training accuracy, while the bottom row uses linear scores. Under-trained models highlight extreme expressions, while fully trained models detect subtler emotions and facial features like accessories, tears, and aging.

5. Conclusions and Further Thoughts

We introduced a unifying framework that extends importance sampling from linear to more general nonlinear models, through the notion of the adjoint operator of a nonlinear map. This perspective yields sampling schemes with approximation guarantees analogous to linear subspace embeddings, yet it is applicable to a wide range of models. Our theoretical analysis shows that these generalized scores offer strong performance bounds and our experiments demonstrate concrete benefits in reduced training costs, improved diagnostics, and outlier detection.

While we did not explicitly address active learning or transfer learning scenarios in this paper, our nonlinear importance sampling approach can naturally be utilized in such contexts. In particular, the method of subsampling based on nonlinear scores can be viewed as a one-shot active learning strategy, where selecting the most informative samples significantly reduces labeling costs. Similarly, these nonlinear scores could also be beneficial in analyzing transfer learning, by effectively identifying samples from a source domain that best represent the characteristics of a target domain.

Although Appendix A.3 outlines preliminary steps toward extending the theoretical guarantees beyond squared loss objectives, fully generalizing them remains an avenue for future work. By estimating nonlinear importance scores without the model parameters, we substantially reduce the classic chicken-and-egg coupling between sampling and estimation. The subsequent constrained optimization still introduces a mild dependence on the unknown solution, leaving open challenges for fully parameter-agnostic importance sampling in nonlinear settings.

As shown, in many cases the computational cost of approximating our nonlinear importance scores is comparable to that of existing sampling techniques in the linear regime, as both rely on similar fundamental operations. In such cases, the only additional overhead typically stems from computing the nonlinear dual matrix. Thus, our framework in many cases does not introduce any fundamentally new computational bottleneck beyond those already present in linear importance sampling methods. As a result, the practical runtime overhead is largely governed by implementation-level factors, such as the choice of optimization algorithm and hyperparameter settings.

Importance Sampling for Nonlinear Models

Acknowledgments

We sincerely thank Cameron Musco for their valuable feedback and discussions. This research was partially supported by the Australian Research Council through an Industrial Transformation Training Centre for Information Resilience (IC200100022).

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Apers, S., Gribling, S., and Sidford, A. On computing approximate lewis weights. ar Xiv preprint ar Xiv:2404.02881, 2024.

Avron, H., Maymounkov, P., and Toledo, S. Blendenpik: Supercharging lapack s least-squares solver. SIAM Journal on Scientific Computing, 32(3):1217 1236, 2010.

Avron, H., Kapralov, M., Musco, C., Musco, C., Velingker, A., and Zandieh, A. Random fourier features for kernel ridge regression: Approximation bounds and statistical guarantees. In International conference on machine learning, pp. 253 262. PMLR, 2017.

Avron, H., Kapralov, M., Musco, C., Musco, C., Velingker, A., and Zandieh, A. A universal sampling method for reconstructing signals with simple fourier transforms. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, pp. 1051 1063, 2019.

Bietti, A., Bruna, J., Sanford, C., and Song, M. J. Learning single-index models with shallow neural networks. Advances in Neural Information Processing Systems, 35: 9768 9783, 2022.

Bourgain, J., Lindenstrauss, J., and Milman, V. Approximation of zonoids by zonotopes. Acta Mathematica, 162:73 141, 1989.

Bulatov, Y. Notmnist dataset. Google (Books/OCR), Tech. Rep.[Online]. Available: http://yaroslavvb. blogspot. it/2011/09/notmnist-dataset. html, 2:4, 2011.

Bur yˇskov a, V. Some properties of nonlinear adjoint operators. The Rocky Mountain journal of mathematics, 28(1): 41 59, 1998.

Can evet, O., Jose, C., and Fleuret, F. Importance sampling tree for large-scale empirical expectation. In International Conference on Machine Learning, pp. 1454 1462. PMLR, 2016.

Chen, X. and Derezinski, M. Query complexity of least absolute deviation regression via robust uniform convergence. In Conference on Learning Theory, pp. 1144 1179. PMLR, 2021.

Chen, X. and Price, E. Active regression via linear-sample sparsification. In Conference on Learning Theory, pp. 663 695. PMLR, 2019.

Clarkson, K. L. and Woodruff, D. P. Low-rank approximation and regression in input sparsity time. Journal of the ACM (JACM), 63(6):1 45, 2017.

Cohen, A. and De Vore, R. Approximation of highdimensional parametric pdes. Acta Numerica, 24:1 159, 2015.

Cohen, M. B. and Peng, R. Lp row sampling by lewis weights. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pp. 183 192, 2015.

Cohen, M. B., Musco, C., and Musco, C. Input sparsity time low-rank approximation via ridge leverage score sampling. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1758 1777. SIAM, 2017.

Derezi nski, M. and Mahoney, M. W. Recent and upcoming developments in randomized numerical linear algebra for machine learning. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 6470 6479, 2024.

Drineas, P. and Mahoney, M. W. Lectures on randomized numerical linear algebra. The Mathematics of Data, 25 (1), 2018.

Drineas, P., Kannan, R., and Mahoney, M. W. Fast monte carlo algorithms for matrices i: Approximating matrix multiplication. SIAM Journal on Computing, 36(1):132 157, 2006a.

Drineas, P., Mahoney, M. W., and Muthukrishnan, S. Sampling algorithms for ℓ2 regression and applications. In Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, pp. 1127 1136, 2006b.

Erd elyi, T., Musco, C., and Musco, C. Fourier sparse leverage scores and approximate kernel learning. Advances in Neural Information Processing Systems, 33:109 122, 2020.

Eshragh, A., Roosta, F., Nazari, A., and Mahoney, M. W. Lsar: efficient leverage score sampling algorithm for the analysis of big time series data. Journal of Machine Learning Research, 23(22):1 36, 2022.

Importance Sampling for Nonlinear Models

Fan, J., Liao, Y., and Mincheva, M. High-dimensional covariance matrix estimation in approximate factor models1. Annals of statistics, 2011.

Feldman, D. Core-sets: Updated survey. Sampling techniques for supervised or unsupervised tasks, pp. 23 44, 2020.

Gajjar, A. and Musco, C. Subspace embeddings under nonlinear transformations. In Algorithmic Learning Theory, pp. 656 672. PMLR, 2021.

Gajjar, A., Musco, C., and Hegde, C. Active learning for single neuron models with lipschitz non-linearities. In International Conference on Artificial Intelligence and Statistics, pp. 4101 4113. PMLR, 2023.

Gajjar, A., Tai, W. M., Xingyu, X., Hegde, C., Musco, C., and Li, Y. Agnostic active learning of single index models with linear sample complexity. In The Thirty Seventh Annual Conference on Learning Theory, pp. 1715 1754. PMLR, 2024.

Gittens, A. and Mahoney, M. Revisiting the nystrom method for improved large-scale machine learning. In International Conference on Machine Learning, pp. 567 575. PMLR, 2013.

Goel, S., Kanade, V., Klivans, A., and Thaler, J. Reliably learning the relu in polynomial time. In Kale, S. and Shamir, O. (eds.), Proceedings of the 2017 Conference on Learning Theory, volume 65 of Proceedings of Machine Learning Research, pp. 1004 1042. PMLR, 07 10 Jul 2017. URL https://proceedings.mlr.press/ v65/goel17a.html.

Goodfellow, I. J., Erhan, D., Carrier, P. L., Courville, A., Mirza, M., Hamner, B., Cukierski, W., Tang, Y., Thaler, D., Lee, D.-H., et al. Challenges in representation learning: A report on three machine learning contests. In Neural information processing: 20th international conference, ICONIP 2013, daegu, korea, november 3-7, 2013. Proceedings, Part III 20, pp. 117 124. Springer, 2013.

Ha, D. and Eck, D. A neural representation of sketch drawings. In International Conference on Learning Representations, 2018.

Har-Peled, S. and Mazumdar, S. On coresets for k-means and k-median clustering. In Proceedings of the thirtysixth annual ACM symposium on Theory of computing, pp. 291 300, 2004.

H ardle, W., M uller, M., Sperlich, S., Werwatz, A., et al. Nonparametric and semiparametric models, volume 1. Springer, 2004.

Hristache, M., Juditsky, A., and Spokoiny, V. Direct estimation of the index coefficient in a single-index model. Annals of Statistics, pp. 595 623, 2001.

Huggins, J., Campbell, T., and Broderick, T. Coresets for scalable bayesian logistic regression. Advances in neural information processing systems, 29, 2016.

Iwen, M. A., Needell, D., Rebrova, E., and Zare, A. Lower memory oblivious (tensor) subspace embeddings with fewer random bits: modewise methods for least squares. SIAM Journal on Matrix Analysis and Applications, 42 (1):376 416, 2021.

Johnson, W. B. and Schechtman, G. Finite dimensional subspaces of lp. Handbook of the geometry of Banach spaces, 1:837 870, 2001.

Kakade, S. M., Kanade, V., Shamir, O., and Kalai, A. Efficient learning of generalized linear and single index models with isotonic regression. Advances in Neural Information Processing Systems, 24, 2011.

Katharopoulos, A. and Fleuret, F. Not all samples are created equal: Deep learning with importance sampling. In International conference on machine learning, pp. 2525 2534. PMLR, 2018.

Langberg, M. and Schulman, L. J. Universal εapproximators for integrals. In Proceedings of the twentyfirst annual ACM-SIAM symposium on Discrete Algorithms, pp. 598 607. SIAM, 2010.

Langley, P. Crafting papers on machine learning. In Langley, P. (ed.), Proceedings of the 17th International Conference on Machine Learning (ICML 2000), pp. 1207 1216, Stanford, CA, 2000. Morgan Kaufmann.

Lantz, B. Machine learning with R: expert techniques for predictive modeling. Packt publishing ltd, 2019.

Liu, Q. and Lee, J. Black-box importance sampling. In Artificial Intelligence and Statistics, pp. 952 961. PMLR, 2017.

Liu, Z., Wang, G., Zhong, S. H., Xu, Z., Zha, D., Tang, R. R., Jiang, Z. S., Zhou, K., Chaudhary, V., Xu, S., et al. Winner-take-all column row sampling for memory efficient adaptation of language model. Advances in Neural Information Processing Systems, 36, 2024.

Lucic, M., Faulkner, M., Krause, A., and Feldman, D. Training gaussian mixture models at scale via coresets. Journal of Machine Learning Research, 18(160):1 25, 2018.

Mahoney, M. W. and Drineas, P. Cur matrix decompositions for improved data analysis. Proceedings of the National Academy of Sciences, 106(3):697 702, 2009.

Importance Sampling for Nonlinear Models

Mahoney, M. W. et al. Randomized algorithms for matrices and data. Foundations and Trends in Machine Learning, 3(2):123 224, 2011.

Mai, T., Musco, C., and Rao, A. Coresets for classification simplified and strengthened. Advances in Neural Information Processing Systems, 34:11643 11654, 2021.

Martinsson, P.-G. and Tropp, J. A. Randomized numerical linear algebra: Foundations and algorithms. Acta Numerica, 29:403 572, 2020.

Meng, C., Liu, E., Neiswanger, W., Song, J., Burke, M., Lobell, D., and Ermon, S. Is-count: Large-scale object counting from satellite images with covariate-based importance sampling. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp. 12034 12042, 2022.

Meng, X. and Mahoney, M. W. Low-distortion subspace embeddings in input-sparsity time and applications to robust linear regression. In Proceedings of the forty-fifth annual ACM symposium on Theory of computing, pp. 91 100, 2013.

Mirzasoleiman, B., Bilmes, J., and Leskovec, J. Coresets for data-efficient training of machine learning models. In International Conference on Machine Learning, pp. 6950 6960. PMLR, 2020.

Mollahosseini, A., Hasani, B., and Mahoor, M. H. Affectnet: A database for facial expression, valence, and arousal computing in the wild. IEEE Transactions on Affective Computing, 10(1):18 31, 2017.

Munteanu, A., Schwiegelshohn, C., Sohler, C., and Woodruff, D. On coresets for logistic regression. Advances in Neural Information Processing Systems, 31, 2018.

Murray, R., Demmel, J., Mahoney, M. W., Erichson, N. B., Melnichenko, M., Malik, O. A., Grigori, L., Luszczek, P., Derezi nski, M., Lopes, M. E., et al. Randomized numerical linear algebra: A perspective on the field with an eye to software. ar Xiv preprint ar Xiv:2302.11474, 2023.

Musco, C., Musco, C., Woodruff, D. P., and Yasuda, T. Active linear regression for p norms and beyond. In 2022 IEEE 63rd Annual Symposium on Foundations of Computer Science (FOCS), pp. 744 753. IEEE, 2022.

Nabian, M. A., Gladstone, R. J., and Meidani, H. Efficient training of physics-informed neural networks via importance sampling. Computer-Aided Civil and Infrastructure Engineering, 36(8):962 977, 2021.

Nelson, J. and Nguyˆen, H. L. Osnap: Faster numerical linear algebra algorithms via sparser subspace embeddings. In 2013 ieee 54th annual symposium on foundations of computer science, pp. 117 126. IEEE, 2013.

Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., Ng, A. Y., et al. Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, volume 2011, pp. 4. Granada, 2011. URL http://ufldl. stanford.edu/housenumbers.

O Leary-Roseberry, T., Villa, U., Chen, P., and Ghattas, O. Derivative-informed projected neural networks for highdimensional parametric maps governed by pdes. Computer Methods in Applied Mechanics and Engineering, 388:114199, 2022.

Pace, R. K. and Barry, R. Sparse spatial autoregressions. Statistics & Probability Letters, 33(3):291 297, 1997.

Paschou, P., Ziv, E., Burchard, E. G., Choudhry, S., Rodriguez-Cintron, W., Mahoney, M. W., and Drineas, P. Pca-correlated snps for structure identification in worldwide human populations. PLo S genetics, 3(9):e160, 2007.

Samadian, A., Pruhs, K., Moseley, B., Im, S., and Curtin, R. Unconditional coresets for regularized loss minimization. In International Conference on Artificial Intelligence and Statistics, pp. 482 492. PMLR, 2020.

Sarlos, T. Improved approximation algorithms for large matrices via random projections. In 2006 47th annual IEEE symposium on foundations of computer science (FOCS 06), pp. 143 152. IEEE, 2006.

Schechter, E. Handbook of Analysis and its Foundations. Academic Press, 1996.

Scherpen, J. M. and Gray, W. S. Nonlinear hilbert adjoints: Properties and applications to hankel singular value analysis. Nonlinear Analysis: Theory, Methods & Applications, 51(5):883 901, 2002.

Stich, S. U., Raj, A., and Jaggi, M. Safe adaptive importance sampling. Advances in Neural Information Processing Systems, 30, 2017.

Tolochinksy, E., Jubran, I., and Feldman, D. Generic coreset for scalable learning of monotonic kernels: Logistic regression, sigmoid and more. In International Conference on Machine Learning, pp. 21520 21547. PMLR, 2022.

Tremblay, N., Barthelm e, S., and Amblard, P.-O. Determinantal point processes for coresets. Journal of Machine Learning Research, 20(168):1 70, 2019.

Importance Sampling for Nonlinear Models

Vershynin, R. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.

Wickham, H. and Sievert, C. ggplot2: elegant graphics for data analysis, volume 10. springer New York, 2009.

Woodruff, D. P. et al. Sketching as a tool for numerical linear algebra. Foundations and Trends in Theoretical Computer Science, 10(1 2):1 157, 2014.

Xu, P., Yang, J., Roosta, F., R e, C., and Mahoney, M. W. Sub-sampled newton methods with non-uniform sampling. Advances in Neural Information Processing Systems, 29, 2016.

Importance Sampling for Nonlinear Models

A. Appendix

A.1. Proof of Proposition 3.1

Proof. We first note that by Euler s homogeneous function theorem, the gradient of h is positively homogeneous of degree α 1, i.e.,

θ h(tθ) = tα 1

θ h(θ), t > 0.

f (θ) = Z 1

θ f(tθ)dt = Z 1

θ h(tθ)g (h(tθ))dt

θ h(θ)g (tαh(θ))dt = Z 1

0 tα 1g (tαh(θ))dt

θ h(θ). (13)

Now letting

dt = 1 αh(θ)

(1 α)/α ds.

It follows that

f (θ) = Z h(θ)

(α 1)/α 1 αh(θ)

(1 α)/α g (s)ds θ h(θ)

f (θ) = g(h(θ)) g(0)

If θ is such that h(θ) = 0, then from (13), we get

f (θ) = g (0)

A.2. Norm Score Approximation for Example 3.4

Consider Example 3.2 with ϕ such that such that c1 (ϕ(t) ϕ(0))2/t2 c2 for some 0 < c1 c2 < and for all t T for some set of interest T , e.g., Swish-type or linear output layer. Recall that θ = [θ1, . . . , θm] where θj = [aj, bj]. Also denote θ = [θ 1, . . . , θ m] where θ j = [a j, b j]. We get

j=1 r (θj) 2 =

j=1 γ2 j [max { bj, x , 0}]2 + a2 j x 2 1{ bj,x >0} ,

γj ϕ(aj max { bj, x , 0}) ϕ(0)

2aj max { bj, x , 0} .

Importance Sampling for Nonlinear Models

For any xi, denote

j=1 ri(θj) =

j=1 ϕi(aj max { bj, xi , 0}).

Suppose for some j {1, 2, . . . , m}, bj, xi > 0, as otherwise f i (θ) = 0. We have,

γ2 j a2 j xi 2

j=1 γ2 j [max { bj, xi , 0}]2 + a2 j xi 2 1{ bj,xi >0}

j=1 γ2 j bj 2 + a2 j

By assumption on ϕ, it follows that

c1a2 j xi 2

j=1 γ2 j [max { bj, xi , 0}]2 + a2 j xi 2 1{ bj,xi >0} c2

bj 2 + a2 j

Assume a j = 0 for all j 1, 2, . . . , m. Let l > 0 be such that minj(a j)2 l and 0 < u < be such that Pm j=1 b j 2 + (a j)2 u. Define the set

[a1, b1, . . . , am, bm] | min j=1,...,m (a j)2 l,

b j 2 + (a j)2 u

Clearly, by construction, θ C. It follows that that for any θ C, we have

c1l xi 2 f i (θ) 2 c2u xi 2 ,

which in turn gives

min{c1l, 1} xi 2 + m2ϕi(0)2 bf i (θ) 2 max{c2u, 1} xi 2 + m2ϕi(0)2 .

This implies that

2 b F (θ) 2

max{c2u, 1}

min{c1l, 1}

xi 2 + m2ϕ2 i (0)

X 2 F + m2 Pn j=1 ϕ2 j(0) = max{c2u, 1}

min{c1l, 1}

where τi is the norm score for the ith row of

x 1 mϕ1(0) x 2 mϕ2(0) ... ... x n mϕn(0)

This allows us to pick β = min{c1l, 1}/max{c2u, 1} in Theorem 3.1.

A.3. Beyond Nonlinear Least-squares.

Going beyond nonlinear least-squares settings, it turns out that an extension of our adjoint based approach still applies as long as the map f has suitable homogeneity properties. More precisely, consider (1) and suppose ℓis a nonnegative loss and f is positively homogeneous2 of degree α, e.g., Re LU and varaints such as Leaky Re LU. Since ℓis a nonnegative loss function, we define

2Recall that a function ψ is positively homogeneous of degree α if ψ(tθ) = tαψ(θ) for any t > 0; see (Schechter, 1996) for more details on positive homogeneity.

Importance Sampling for Nonlinear Models

Analogously to Definition 3.1, for any given x and θ, we can define

ℓ (f(tθ)) p

! θ f(tθ)dt

0 tα 1 ℓ (tαf(θ)) p

! θ f(θ, x),

where the last equality follows from Euler s homogeneous function theorem. Now, letting s = tαf(θ), we have ds = αtα 1f(θ)dt, which gives

h (θ) = 1 2 αf(θ)

Since h(θ) = h(0) + h i (θ), θ , we can define

bh (θ) = h (θ) h(0)

i=1 ℓ(fi(θ)) =

i=1 (hi(θ))2

D bh i (θ), bθ E 2 = b H (θ)bθ 2 ,

where b H (θ) is defined analogously to Definition 3.2. Now, similar to the case of nonlinear least-squares, importance sampling according to leverage scores or norms of {bh i (θ)} give sampling approximations of the form (10).

A.4. Construction of the ε-Net in Section 3.3.2

Let θ denote a solution to (1), and consider a compact ball with radius R, chosen large enough to contain θ . Let B R denote this ball. For any ε (0, 1), we pick a discrete subset Nε B R such that, for every θ B R, there exists at least one θ Nε satisfying

The construction of an ε-net reduces an uncountable continuous set to a finite covering set; see Figure 3. For completeness, we provide some details here. The reader is encouraged to consult references such as Woodruff et al. (2014); Vershynin (2018) for further discussions and details.

To find an upper bound on the cardinality of the set |Nε|, one can use standard volume and covering number arguments. Recall that the volume of B R is

Vol(B R) = πp/2Rp

where Γ is Euler s gamma function. A bound on |Nε| can be obtained as the number of small balls with radius εR/2 that cover the larger ball of radius (1 + ε

2)R, which is given by the ratio of their respective volumes:

2 p = 1 + 2

Importance Sampling for Nonlinear Models

Figure 3. Illustration of an ε-net covering B R. The larger circle has radius (1 + ε/2)R, while B R has radius R. Blue dots denote the ε-net, and dashed circles of radius εR/2 cover all points.

A.5. Further Details for Section 4

Classification Experiments. To carry out the experiment in an under-parameterized setting, the dataset was balanced, and the images were resized to 10 10 dimensions with a grayscale background. A fully connected MLP was trained with a linear 100-input layer connected to a hidden layer with 10 neurons and a Re LU activation unit, followed by a sigmoid output transformation function. The optimal weights were computed using Py Torch, with the Adam optimizer for 1000-5000 epochs (depending on the dataset) and BCEWith Logits Loss(). These optimal weights were then used to calculate the nonlinear leverage scores in Definition 3.3 for each data point.

Regression Experiments. Here, we train a single-index model using a bounded output transformation function described in Example 3.3, with c1 = 1 and c2 = 2. After training for 30,000 epochs, we obtain the optimal parameters and compute the nonlinear importance scores for each data point, as well as the classical linear leverage/row-norm scores using the original data matrix. We then sample training instances using a stratified strategy, proportional to these scores, and evaluate how well each sampling strategy preserves the training performance. Specifically, we solve the subproblem outlined in (12). The experiment is repeated multiple times, and we measure the median log relative error between the MSE of the parameter θ S on the full dataset and the optimal MSE obtained by training on the full dataset, i.e., log(L(θ S) L(θ ))/L(θ ) as a function of the number of samples selected (the number of samples were chosen depending on the total size of the dataset).

A.6. Quantitative performance for Qualitative datasets

In this section, we show that the datasets presented in Figure 2 serve not only as qualitative illustrations but also as quantitative evidence that sampling based on nonlinear scores helps reduce the training mean squared error (MSE) loss more effectively than alternatives. This reduction is more clearly observed when the loss is plotted on a logarithmic scale. We support this result with experiments on two datasets, SVHN (digits 1 vs. 0) and QD, comparing nonlinear leverage score sampling to linear leverage score sampling and uniform sampling.

Importance Sampling for Nonlinear Models

(a) SVHN (1 vs 0) Dataset

(b) Quick Draw dataset

Figure 4. Illustration of quantitative results on datasets used in Figure 2, SVHN & QD. The Y-axis shows Log(MSE) on training data against sample size (in percentage of total data). LS and UN denote Leverage Scores and Uniform Sampling schemas, respectively, with L and A indicating linear and adjoint-based nonlinear variants.

A.7. Additional images from Section 4

To facilitate further comparisons, we provide an additional 50 images with the highest and lowest leverage scores for each dataset from the classification experiments in Section 4.

(a) SVHN High (1 vs 0) Nonlinear Leverage Scores

(b) SVHN Low (1 vs 0) Nonlinear Leverage Scores

(c) SVHN High (1 vs 0) Linear Leverage Scores

(d) SVHN Low (1 vs 0) Linear Leverage Scores

Importance Sampling for Nonlinear Models

(e) SVHN High (1 vs 7) Nonlinear Leverage Scores

(f) SVHN Low (1 vs 7) Nonlinear Leverage Scores

(g) SVHN High (1 vs 7) Linear Leverage Scores

(h) SVHN Low (1 vs 7) Linear Leverage Scores

Figure 5. Top 50 images with the highest and lowest nonlinear leverage scores in each grouping for the SVHN dataset.

(a) NOTMNIST High (A vs B) Nonlinear Leverage Scores

(b) NOTMNIST Low (A vs B) Nonlinear Leverage Scores

(c) NOTMNIST High (A vs B) Linear Leverage Scores

(d) NOTMNIST Low (A vs B) Linear Leverage Scores

Importance Sampling for Nonlinear Models

(e) NOTMNIST High (B vs D) Nonlinear Leverage Scores

(f) NOTMNIST Low (B vs D) Nonlinear Leverage Scores

(g) NOTMNIST High (B vs D) Linear Leverage Scores

(h) NOTMNIST Low (B vs D) Linear Leverage Scores

Figure 6. Top 50 images with the highest and lowest nonlinear leverage scores in each grouping for the NOTMNIST dataset.

(a) QD High Nonlinear Leverage Scores

(b) QD Low Nonlinear Leverage Scores

(c) QD High Linear Leverage Scores

(d) QD Low Linear Leverage Scores

Figure 7. Top 50 images with the highest and lowest nonlinear leverage scores for the QD dataset.

Importance Sampling for Nonlinear Models

(a) FER - 75% High Nonlinear Leverage Scores

(b) FER - 75% Low Nonlinear Leverage Scores

(c) FER - 65% High Nonlinear Leverage Scores

(d) FER - 65% Low Nonlinear Leverage Scores

(e) FER - 55% Nonlinear Leverage Scores.

(f) FER High Linear Leverage Scores

(g) FER Low Linear Leverage Scores

Figure 8. Top 50 (trained and linear settings) and 16 (under-trained setting) images with the highest and lowest nonlinear leverage scores for valence expression on the FER dataset. In (e) only 16 images had non-zero scores at initialization.