# nearoptimal_active_regression_of_singleindex_models__f3314807.pdf

Published as a conference paper at ICLR 2025

NEAR-OPTIMAL ACTIVE REGRESSION OF SINGLEINDEX MODELS

Yi Li School of Physical and Mathematical Sciences and College of Computing and Data Science Nanyang Technological University yili@ntu.edu.sg

Wai Ming Tai Independent Researcher taiwaiming2003@gmail.com

The active regression problem of the single-index model is to solve minx f(Ax) b p, where A is fully accessible and b can only be accessed via entry queries, with the goal of minimizing the number of queries to the entries of b. When f is Lipschitz, previous results only obtain constant-factor approximations. This work presents the first algorithm that provides a (1+ε)-approximation solution by querying O(d p 2 1/εp 2) entries of b. This query complexity is also shown to be optimal up to logarithmic factors for p [1, 2] and the ε-dependence of 1/εp is shown to be optimal for p > 2.

1 INTRODUCTION

Active regression, an extension of the classical regression model, has gained increasing attention in recent years. In its most basic form, active regression aims to solve minx Rd Ax b p (p 1), where the matrix A Rn d represents n data points in Rd and the vector b Rn represents the corresponding labels. However, since label access can be expensive, the challenge is to minimize the number of entries of b that are accessed while still solving the regression problem approximately. A typical guarantee of the approximate solution is to find ˆx Rd such that

Aˆx b p (1 + ε) Ax b p, (1)

where x is the optimal solution to the regression problem, i.e., x = arg minx Rd Ax b p.

Research on this problem often focuses on randomized algorithms with constant failure probability, i.e., the entries of b are sampled randomly (but typically not uniformly) and the output ˆx satisfies the error guarantee above with probability at least a large constant. When p = 2, Chen & Price (2019) showed sampling O(d/ε) entries of b suffices, and when p = 1, Parulekar et al. (2021) showed an optimal query complexity of Θ(d/ε2). The case of general p was later settled by Musco et al. (2022), who showed a query complexity of O(d/ε) for p (1, 2] and O(dp/2/εp 1) for p > 2. Their proof was later refined by Yasuda (2024).

A more general form of the regression problem is the single-index model, which, in our context, asks to solve min x Rd f(Ax) b p,

where f : R R is a nonlinear function and, for u Rn, we abuse the notation slightly to denote f(u) = (f(u1), . . . , f(un)), i.e., applying f entrywise. This formulation arises naturally in neural networks and has received significant recent attention (see, e.g., (Diakonikolas et al., 2020; Gajjar et al., 2023a; Huang et al., 2024) and the references therein). A neural network can be viewed as the composition of a network backbone and a linear prediction layer x Rd with an activation function f, where a typical choice of f is the Re LU function. The network s prediction is given by f(Ax),

Published as a conference paper at ICLR 2025

where the matrix A is the feature matrix generated by the network backbone from the dataset. The goal is to learn the linear prediction layer x, which corresponds to solving the regression problem.

In this paper, we consider f to be a general L-Lipschitz function. It is tempting to expect an analogous guarantee as (1) for the canonical active regression problem, i.e. to find ˆx Rd such that

f(Aˆx) b p (1 + ε) f(Ax ) b p

where x = arg minx Rd f(Ax) b p. However, Gajjar et al. (2023a) showed that achieving this guarantee with a poly(d) query complexity is impossible even when f is a Re LU function and ε is a constant. Hence, we can only expect a weaker guarantee. The single-index regression problem was studied for p = 2 in the same paper (Gajjar et al., 2023a), which showed that sampling O(d2/ε4) entries suffices to find an ˆx Rd such that

f(Aˆx) b 2 2 C( f(Ax ) b 2 2 + εL2 Ax 2 2), (2)

where x = arg minx Rd f(Ax) b 2 is the minimizer and C is an absolute constant. For general p, Huang et al. (2024) obtained

f(Aˆx) b p p C(p)( f(Ax ) b p p + εLp Ax p p) (3)

for some constant C(p) depending only on p, using O(d/ε4) samples when 1 p 2 and O(dp/2/ε4) samples when p > 2. For p = 2, Gajjar et al. (2023b) also independently obtained O(d/ε4) samples and, with additional co-authors, further improved the query complexity to O(d/ε2) in (Gajjar et al., 2024).

The main drawback of the existing results for the single-index model, compared to the basic form of active regression, is that (2) and (3) only achieve constant-factor approximations, whereas (1) achieves a guarantee of (1 + ε)-approximation. The goal of this paper is to obtain a (1 + ε)- approximation for the single-index model.

Note that all existing results assume access to an oracle solver for the regression problem of the form arg minx Rd f(A x) b p or its regularized variant, which may not be a convex programme and where the objective function may be non-differentiable due to f. In this work, we retain the assumption of having such an oracle solver.

1.1 PROBLEM DEFINITION

Now we define our problem formally. For L 0, let Lip L denote the set of L-Lipschitz functions f such that f(0) = 0, i.e.

Lip L := f : R R f(0) = 0 and |f(x) f(y)| L |x y| for all x, y R .

Suppose we are given a function f Lip L, a matrix A Rn d and a query access to the entries of an unknown n-dimensional vector b Rn. Define

OPT := min x Rd f(Ax) b p p and x := arg min x Rd: f(Ax) b p p=OPT Ax p p.

That is, if there are multiple minimizers x , we choose an arbitrary one that minimizes Ax p. As noted in (Gajjar et al., 2024), there is no loss of generality in assuming f(0) = 0 because one can shift both f(x) and b by f(0). For an error parameter ε > 0, our goal is to find an ˆx Rd such that

f(Aˆx) b p p (1 + ε)OPT + Cε Ax p p,

where C is some constant that depends only on L and p while the number of queries to the entries of b is minimized. Therefore, we would like to ask:

What is the minimum number (in terms of d and ε) of queries to the entries of b needed in order to achieve this goal?

We will solve this problem in this paper.

Published as a conference paper at ICLR 2025

1.2 OUR RESULTS

We first present the main result of this paper that querying O(d/ε2 poly log n) entries of b is sufficient for a (1 + ε)-approximation when 1 p 2 and O(dp/2/εp poly log n) entries when p > 2. Our result achieves the same query complexity (up to logarithmic factors) for the constant-factor approximation algorithm in (Gajjar et al., 2024) when p = 2.

Theorem 1. There is a randomized algorithm, when given A Rn d, b Rn, f Lip L and an arbitrary sufficient small ε > 0, with probability at least 0.9, makes O d1 p

2 /ε2 p poly log(d/ε)

queries to the entries of b and returns an ˆx Rd such that

f(Aˆx) b p p OPT + ε(OPT + Lp Ax p p). (4)

The hidden constant in the bound on number of queries depends on p only.

Recall that in the canonical active regression problem, i.e. f(x) = x, the query complexity is Θ(d/ε) when 1 < p 2 and Θ(dp/2/εp 1) when p > 2 (Musco et al., 2022). We can show that accommodating a general f pushes up the ε-dependence to 1/ε2 and 1/εp, respectively. In particular, for 1 p 2, we show a lower bound of Ω(d/ε2) queries, which suggests that our upper bound is tight up to logarithmic factors. The following are formal statements on our lower bound.

Theorem 2. Suppose that p 1, ε > 0 is sufficiently small and n (d log d)/ε2. Any randomized algorithm that, given A Rn d, a query access to the entries of an unknown b Rn and f Lip1, outputs a d-dimensional vector ˆx such that (4) holds with probability at least 4/5 must make Ω(d/ε2) queries to the entries of b.

Theorem 3. Suppose that p > 2, ε > 0 is sufficiently small and n d/εp. Any randomized algorithm that, given A Rn d, a query access to the entries of an unknown b Rn and f Lip1, outputs a d-dimensional vector ˆx such that (4) holds with probability at least 4/5 must make Ω(d/εp) queries to the entries of b.

2 PROOF OVERVIEW

In this section, we will provide a proof overview of our theorems. The full proof for the upper bound can be found in Appendix B and the full proofs for the lower bounds in Appendix C.

2.1 UPPER BOUND

General Observations We follow the usual sample-and-solve paradigm as in many previous active regression algorithms (Chen & Price, 2019; Musco et al., 2022; Gajjar et al., 2023a; Chen et al., 2022; Huang et al., 2024). Querying entries of b can be viewed as multiplying b with a sampling matrix S, which is a diagonal matrix with sparse diagonal entries; the nonzero entries in Sb correspond to the queried entries of b. Hence, we would like to minimize the number of nonzero diagonal entries in S. The same sampling matrix S is also applied to f(Ax), leading to a natural attempt at solving the optimization problem minx Rd S(f(Ax) b) p p. However, we preview that this optimization problem is not the one we seek and we will provide more explanation on how to modify it.

The natural question is how to design the sampling matrix S. In all previous work, S is a rowsampling matrix with respect to Lewis weights; see the statement of Lemma 6. In this paper, we adopt the same approach. This means that (i) (unbiased estimator) E Sv p p = v p p for each v Rn

and (ii) (subspace embedding) when S has sufficiently many nonzero rows, SAx p Ax p for all x Rd. Note that the query complexity for active regression is usually higher than necessary for subspace embedding alone.

We will present the construction of the sampling matrix and the full algorithm in Section 3.

Formulating the Concentration Bounds Let ˆx = arg minx Rd S(f(Ax) b) p p. Although E S(f(Ax) b) p p = f(Ax) b p p for each x Rd, it is unclear what E S(f(Aˆx) b) p p is since ˆx depends on S. To address this, we instead argue that the sampling error

Published as a conference paper at ICLR 2025

S(f(Ax) b) p p f(Ax) b p p is small for all x T, where T is a small bounded domain containing ˆx. Here, the small size of T is critical for a small error bound since it controls the number of points in T at which the error needs to be small when applying Dudley s integral, a classical extension of the net argument. To further reduce the variance, we choose S(f(Ax ) b) p p f(Ax ) b p p as a reference point and argue that the difference

Err(x) = |( S(f(Ax) b) p p f(Ax) b p p) ( S(f(Ax ) b) p p f(Ax ) b p p)|. (5)

is uniformly small over T. Note that the reference point can be S(f(A x) b) p p f(A x) b p p for any x Rd. This approach has been employed by Yasuda (2024) for the canonical active regression (i.e. f(x) = x). However, for general Lipschitz functions f, a key challenge is to identify an appropriate domain T, which we shall discuss below.

Regularized Regression As previously noted, the optimization problem minx Rd S(f(Ax) b) p p is not the one we are seeking. It turns out that there is a fundamental challenge to argue that ˆx = arg minx Rd S(f(Ax) b) p p satisfies the desired guarantee (4). In the canonical active regression, one can show that

A(ˆx x ) p Ax b p. (6)

This suggests that ˆx T for T = { x Rd | A(x x ) p Ax b p }, which is a small bounded region. Unfortunately, for a Lipschitz function f, it is not clear how to obtain an analogy to (6) and thus a bounded region T containing ˆx. Hence, we still seek a bound on the norm Aˆx p p to keep T bounded and ideally small. For constant-factor approximations, previous work (Gajjar et al., 2023a; Huang et al., 2024) restrict the domain in the regression problem and solve minx T S(f(Ax) b) p p for some small T, but this leads to a poor ε-dependence of 1/ε4 in query complexity.

To improve the ε-dependence, one can consider a regularized regression problem

min x Rd S(f(Ax) b) p p + τ Ax p p,

where τ > 0 is a regularization parameter. This approach, as demonstrated by Gajjar et al. (2024), improves the ε-dependence for constant-factor approximations when p = 2, and will therefore be adopted in this paper. To ease the notation, assume that the Lipschitz constant L = 1 from now on.

An important question is how to choose the regularization parameter τ. Recall that we want the sampling error Err(x) (defined in (5)) to be small, ideally close to 0, when x = ˆx. Then

f(Aˆx) b p p OPT = f(Aˆx) b p p f(Ax ) b p p S(f(Aˆx) b) p p S(f(Ax ) b) p p + error terms

τ Ax p p + error terms,

where the last inequality follows from the optimality of ˆx. The desired guarantee (4) has an additive error ε Ax p p, indicating that τ should be set smaller than ε. On the other hand, the main purpose of regularization is to bound Aˆx p p. By the optimality of ˆx and rearranging the terms, we have

τ S(f(Ax ) b) p p S(f(Aˆx) b) p p

This suggests that τ should be set as large as possible for a better bound on Aˆx p p. Therefore, we set τ = ε and the optimization problem becomes

min x Rd S(f(Ax) b) p p + ε Ax p p. (7)

Bounding the Error To avoid overloading our notation, we focus on 1 p 2. A similar argument works for p 2. Recall that our intention is to bound

sup x T Err(x),

Published as a conference paper at ICLR 2025

where Err(x) is the sampling error defined in (5). The first issue we need to resolve is defining the domain T. By the optimality of ˆx in (7) and rearranging the terms, we have

S(f(Ax ) b) p p S(f(Aˆx) b) p p

+ Ax p p (8)

ε S(f(Ax ) b)) p p + Ax p p

εOPT + Ax p p by Markov inequality. (9)

Hence, we can set

εOPT + Ax p p and T = { x Rd | Ax p p R }.

Then ˆx T. Now, while we omit the details, we obtain the following concentration bound in (10) by following the standard technique of upper bounding the supremum of a stochastic process using Dudley s integral, which has been the central tool in the work on subspace embeddings (Bourgain et al., 1989; Ledoux & Talagrand, 1991; Cohen & Peng, 2015) and previous work on active regression (Musco et al., 2022; Chen et al., 2022; Huang et al., 2024). In short, when the number of queries is m, we can obtain that with probability at least 1 δ,

sup x T Err(x)

d poly(log n, log δ 1)

We preview here that this concentration bound will yield a weaker result, but it serves to illustrate the key idea and will guide us in refining the analysis later.

Suppose that m d ε4 poly log n. By plugging m and R into (10), we have with constant probability,

sup x T Err(x) ε(OPT + ε Ax p p).

Since ˆx T, we have

f(Aˆx) b p p OPT S(f(Aˆx) b) p p S(f(Ax ) b) p p + ε(OPT + ε Ax p p)

ε Ax p p + ε(OPT + ε Ax p p) by the optimality of ˆx in (7)

ε(OPT + Ax p p), (11)

which achieves the desired guarantee (4) by a rescaling of ε.

Attempt to Improve The reason we previously set m d ε4 poly log n is because R = 1

εOPT + Ax p p and we need the square root term in (10) to be ε2 so that the overall error is at most ε(OPT+ Ax p p). Indeed, when we compare to the canonical case of f(x) = x and bound the radius of the domain in (6), this large R is the main reason why extra factors of 1

ε are needed.

Notice that the term S(f(Ax) b) p p S(f(Ax ) b) p p in Err(x) also appears in (8). This means that we can plug the bound on Err(x) into (8) and improve the radius R.

For example, let m d ε3 poly log n. Here, the exponent 3 can be replaced by any value strictly larger than 2 and we simply choose this number for demonstration purposes. By plugging m and R into (10), we have with constant probability

sup x T Err(x) = sup x T |( S(f(Ax) b) p p S(f(Ax ) b) p p) ( f(Ax) b p p f(Ax ) b p p)|

ε 1 2 OPT + ε 3 2 Ax p p. (12)

Since ˆx T, by plugging (12) into (8) and a similar calculation to that in (9), we have

ε 1 2 OPT + Ax p p which is smaller than R.

ε 1 2 OPT + Ax p p and T = { x Rd | Ax p p R },

Published as a conference paper at ICLR 2025

then ˆx T . By plugging m, R and T into (10), we have with constant probability,

sup x T Err(x) εOPT + ε 3 2 Ax p p.

This is an improved error bound compared to (12). It follows from ˆx T and a similar calculation to that in (11) that

f(Aˆx) b p p OPT ε(OPT + Ax p p),

which achieves the guarantee (4). Therefore, we have successfully improved the query complexity from d

ε4 poly log n to d

ε3 poly log n. Although this bootstrapping idea of reusing the error guarantee to shrink T has appeared in previous work (Musco et al., 2022; Yasuda, 2024), we emphasize that there is a fundamental difference in the detailed analysis for general Lipschitz functions f, due to the lack of convexity of f(A( )) b p p.

Further Improvement Recall that we are targeting a query complexity of d

ε2 poly log n. One may immediately check that setting m d ε2 poly log n in the above argument is not helpful. To address this issue, we refine the analysis of (10) and improve the bound as follows. Recall that we set R = 1

εOPT + Ax p p such that Aˆx p p R. If we further restrict the domain T and set it to be

T = x Rd Ax p p R and f(Ax) f(Ax ) p p F for some F OPT

then (10) can be improved to

sup x T Err(x)

d poly(log n, log δ 1)

Note that, by the Lipschitz condition and the fact that Ax p p R, we have

f(Ax) f(Ax ) p p Ax Ax p p R

and hence one can set F = R. That means (13) is always no worse than (10). To apply (13), we need to show that ˆx T, i.e. find a suitable F such that f(Aˆx) f(Ax ) p p F.

In the proof of a constant-factor approximation by Gajjar et al. (2024), a key step is

f(Aˆx) f(Ax ) p p OPT + ε Ax p p

when p = 2. A straightforward modification extends it to general p, which means that we can set F = OPT + ε Ax p p.

Now, we pick m d ε2 poly log n. By plugging m, R and F into (13), we have

sup x T Err(x) ε

(OPT + ε Ax p p) (1

εOPT + Ax p p) ε 1 2 OPT + ε 3 2 Ax p p. (14)

Following the same argument before and plugging it into (8), we have

ε 1 2 OPT + Ax p p

which allows us to refine further the radius R and thus the domain T, leading to a better bound on Aˆx p p. Iterate this process and apply (13) log log 1

ε times, we shall arrive at the bound

Err(ˆx) = |( S(f(Aˆx) b) p p S(f(Ax ) b) p p) ( f(Aˆx) b p p f(Ax ) b p p)|

ε (OPT + Ax p p).

Finally, we follow a calculation similar to that in (11) to achieve the desired guarantee (4). The caveat here is that repeatedly applying the concentration bound (13) in the iterative process causes the failure probability to accumulate. We resolve this by setting δ = 1/ log log(1/ε) in (13), keeping log(1/δ) at most log n. Hence, the query complexity remains (d/ε2) poly log n.

Published as a conference paper at ICLR 2025

Dependence on n Although we have achieved the query complexity of d ε2 poly log n, it may not be desirable when n is large and we seek to further remove the dependence on n. The poly log n factor arises from estimating a covering number when bounding Dudley s integral. Indeed, by using a simple net argument with a sampling matrix of poly(d/ε) non-zero rows, the solution guarantee can still be achieved. While the dependence on d and ε are both worse, the query complexity is independent of n. To take the advantage of this trade-off, a standard approach involves using two query matrices S and S, where S has the suboptimal number of nonzero rows, and then solving the following new regularized problem

ˆx = arg min x Rd SS (f(Ax) b)) p p + ε S Ax p p.

We need to pay close attention to the fact that we are not simply using S A as the input matrix A in the original statement because of the function f.

To bound the error, a natural attempt is to use the concentration bounds and show that

f(Aˆx) b p p f(Ax ) b p p S (f(Aˆx) b) p p S (f(Ax ) b) p p + error terms (15)

SS (f(Aˆx) b) p p SS (f(Ax ) b) p p + error terms (16)

and then we use the optimality of ˆx to complete the proof. However, we remind the readers that, to apply the concentration bounds, it is important to check that the relevant points x , ˆx are in the domain of interest for the corresponding bounds. It turns out that we can obtain (15) but arguing (16) is the main obstacle because of that. While we will omit the detail, we note that we need a proxy point x to circumvent this obstacle when using the concentration bound for S. The proxy point x is defined as

x = arg min x Rd S (f(Ax) b) p p + ε2 Ax p p

and this proxy point allows us to show that ˆx lies within the domain of interest. We can then modify the above argument by continuing from (15)

S (f(Aˆx) b) p p S (f(Ax ) b) p p S (f(Aˆx) b) p p S (f(Ax ) b) p p + ε2 Ax p p SS (f(Aˆx) b) p p SS (f(Ax ) b) p p + error terms

and use the optimality of ˆx to finish the proof.

2.2 LOWER BOUND

General Observations As in the previous results (Musco et al., 2022; Yasuda, 2024), via Yao s minimax theorem, the proof of the lower bounds is reduced to distinguishing between two distributions (which are called hard instance). Specifically, we construct two hard-to-distinguish distributions on the vector b, and it requires a certain number of queries to the entries of b to distinguish between these distributions with constant probability. The reduction is using an approximation solution ˆx to determine from which distribution b is drawn. We construct our hard instances for 1 p 2 and p 2 separately. These instances are inspired by those in (Musco et al., 2022; Yasuda, 2024) but are more complex, as we are showing a higher lower bound. For the purpose of exposition, we assume d = 1, in which case the matrix A degenerates to a vector a Rn. We shall then extend the result to the general d.

Hard Instance for 1 p 2 We pair up the entries (say 2i 1 and 2i). Let

and v = 2 3

Let D0 (resp. D1) be the distribution on b R2n that, for all i = 1, . . . , n, each pair b2i 1 b2i

= u with probability 1

2 + ε (resp. 1

2 ε) v with probability 1

2 ε (resp. 1

Published as a conference paper at ICLR 2025

Figure 1: (left) Plot of the locus f([ x x ]), where the red (resp. blue) part corresponds to x 0 (resp. x 0); (middle) f( 6 6 ) is the point on the red part that is closest to u and v in the ℓp-distance; (right) f( 6 6 ) is the point on the blue part that is closest to u and v in the ℓp-distance

By the standard information-theoretic lower bounds, one needs to query Ω( 1

ε2 ) entries of b to distinguish D0 and D1.

To reduce this problem to our problem, we set

2 if x 6 x 4 if 6 x 4 0 if 4 x 0 1 2x if 0 x

and a2i 1 a2i

for i = 1, . . . , n.

Let k be the number of u s in b. The objective function becomes

f(a x) b p p = k f([ x x ]) u p p + (n k) f([ x x ]) v p p.

The takeaway of this construction is one can view f([ x x ]) as a locus in R2 as x varies, illustrated in Figure 1. Suppose b is drawn from D0. It implies that k n

2 + εn and hence n k < k. One can view each component as the ℓp distance between the locus and u or v. As seen in Figure 1, the locus passes through u and v. When x = 6, we have f([ x x ]) = u and so

OPT k 0 + (n k) u v p p = 2p(n k) 2p 1n(1 2ε).

On the other hand, Figure 1 also suggests that, when x > 0, we have

f(a x) b p p k u v p p + (n k) 0 2pk 2p 1n(1 + 2ε).

Suppose we have a solution ˆx such that

f(a ˆx) b p p (1 + c ε)OPT + c ε a x p p (1 + c ε)2p(n k) + c ε a x p p (1 + c ε) 2p 1n(1 2ε) + c ε 6p 2n

= 2p 1n(1 Ω(ε)) for a sufficiently small c > 0

< 2p 1n(1 + 2ε)

which implies that ˆx < 0. Similarly, suppose b is drawn from D1, one can show the symmetric result. We can declare b is drawn from D0 if ˆx < 0 and D1 otherwise. This concludes our reduction.

Hard Instance for p 2 We start with the all-one vector b R2n. Then, we pick a random index i from {1, . . . , 2n} uniformly and update bi bi + 1/ε. Our question is to determine whether i n or i > n, and it follows from a straightforward probability calculation that Ω(n) queries to the entries of b are required. Recall that we are targeting a query complexity of Ω(1/εp) and hence we set n = 1/εp.

To reduce this problem to our problem, we set

0 if x 0 x if 0 x 1 1 if 1 x. and ai = 1 if i = 1, . . . , n 1 if i = n + 1, . . . , 2n.

Published as a conference paper at ICLR 2025

Algorithm 1 Generating a Sampling Matrix GSM(k1, . . . , kn, α) Input: n integers k1, . . . , kn 0; a sampling rate α < 1

1: S an n n diagonal matrix, initialized to a zero matrix 2: for i = 1, . . . , n do 3: if ki > 0 then 4: Generate a binomial random variable Ni Bin(ki, α)

5: Sii ( Ni

6: Return S

Suppose that i n. If x = 1, we have the following. For i = 1, . . . , n and i = i , we have f(a x)i = f(1) = 1 = bi. Recall that bi = 1 + 1/ε. Hence, we have Pn i=1|f(a x)i bi|p = 1/εp = n. For i = n + 1, . . . , 2n, we have f(a x)i = f( 1) = 0 and therefore we have P2n i=n+1|f(a x)i bi|p = P2n i=n+1 1p = n. Namely, we have

OPT f(a x) b p p = 2n.

On the other hand, it is easy to check that, when x < 0, we have

f(a x) b p p

i=1 |f(a x)i bi|p n 1 + (1

ε + 1)p 2n(1 + ε)

Suppose we have a solution ˆx such that

f(a ˆx) b p p (1 + c ε)OPT + c ε a x p p (1 + c ε) 2n + c ε 1p n < 2n(1 + ε) for a sufficiently small c > 0

which implies ˆx > 0. Similarly, suppose that i n + 1, one can show the symmetric result. We can declare i n if ˆx > 0 and i > n otherwise. This concludes our reduction.

Extension to d > 1 We consider the problem of solving multiple independent copies of hard instances of d = 1 and reduce this new problem to the regression. The formal construction is as follows. Let m = Θ(1/εp 2). We have a dm-dimensional vector b, which can be partitioned into d blocks of m-dimensional vectors, with each block drawn from either D0 or D1 (the hard instances introduced earlier depending on p). By a straightforward probability calculation, it can be shown that Ω(dm) queries to the entries of b are needed to correctly answer, with constant probability, which distribution each block of m-dimensional vector is drawn from, for at least 2d/3 blocks.

To reduce it to our problem, let A be a dm-by-d block-diagonal matrix, partitioned into d2 blocks of m-dimensional vectors. Each diagonal block is the vector a which we constructed earlier. The function f remains the same as before. Suppose we have a solution ˆx satisfying (4). By the independence between blocks in b and the block-diagonal structure of A, we can argue that (4) can be decomposed into the sum of the objective functions for each independent block and declare that each block is drawn from D0 or D1 based on the same criteria as in the case of d = 1. By the standard counting techniques, at least 2d/3 of the d answers are correct and this completes the reduction. Hence, we achieve the query complexity of d/εp 2.

We point out that for the canonical case of f(x) = x and p 2, the previous result of Yasuda (2024) gives a stronger lower bound, in terms of d, of Ω(dp/2/εp 1). Unfortunately, it is still not clear how to apply the techniques in our setting.

3 ALGORITHM

To complement the proof overview in Section 2.1, we present our full algorithm in Algorithm 2 and explain the explicit implementation.

It first constructs a sampling matrix S (line 1 to line 4 of Algorithm 2) and applies it to A, f(A( )) and b. This sampling matrix S is generated using Algorithm 1. When applying S to A, in step 5

Published as a conference paper at ICLR 2025

Algorithm 2 Algorithm for Active Learning without Dependence on n

Input: a matrix A Rn d a query access to the entries of the vector b Rn a function f Lip L an error parameter ε two sampling rates α < α < 1 1: Compute the Lewis weights of A, denoted by w1(A), . . . , wn(A) 2: for i = 1, . . . , n do 3: k i n wi(A)

d 4: S GSM(k 1, . . . , k n, α ) from Algorithm 1 5: m number of nonzero rows in S

6: Compute the Lewis weights of S A, denoted by w1(S A), . . . , wn(S A) 7: for i = 1, . . . , n do 8: ki m wi(S A)

d 9: S GSM(k1, . . . , kn, α) from Algorithm 1 10: Solve the minimization problem ˆx := arg minx Rd SS f(Ax) SS b p p + ε S Ax p p 11: Return the vector ˆx Rd

of Algorithm 1, it is equivalent to splitting the rows of A such that all rows have uniformly bounded Lewis weights of O(d/n). To achieve this, it needs the Lewis weights of A and they can be computed as in (Cohen & Peng, 2015) for p < 4 and as in (Fazel et al., 2022) for p 4. Afterwards, we sample each row with the same probability α . This row-splitting approach has been used in the proofs of Cohen & Peng (2015); Yasuda (2024) and in the algorithms in (Gajjar et al., 2023b; 2024). Details of this row-splitting technique can be found in Appendix B.1.

We set the sampling rate α = poly(d/ε)/n. This effectively reduces the dimension from n, the number of rows of A, to m α n = poly(d/ε), the number of non-zero rows of S A. Therefore, it removes the dependence on n in our bound.

It then constructs the main sampling matrix S (line 6 to line 9 of Algorithm 2) with the sampling rate α = d p 2 1/(εp 2m) poly log(m), whereby avoiding dependence on n as previously discussed, and applies to S A, S f(A( )) and S b. That means that the number of non-zero entries of SS b is, with high probability, at most 2αm d p 2 1/(εp 2) poly log(d/ε), which is the query complexity we are looking for. Note that S is also generated using Algorithm 1 and hence satisfies the property of uniformly bounded Lewis weights through the previously mentioned row-splitting techniques. Finally, the algorithm outputs the optimal solution ˆx of the regularized problem

min x Rd SS f(Ax) SS b p p + ε S Ax p p

and that completes our full algorithm.

4 CONCLUSION

In this paper, we consider the active regression problem of the single-index model, which asks to solve minx f(Ax) b p, with f being a Lipschitz function, A fully accessible and b only accessible via entry queries. The goal is to minimize the number of queries to the entries of b while achieving an accurate solution to the regression problem. Previous work on single-index model has only achieved constant-factor approximations (Gajjar et al., 2023a;b; Huang et al., 2024; Gajjar et al., 2024). In this paper, we achieve a (1 + ε)-approximation with O(d p 2 1/εp 2) queries and we show that this query complexity is tight for 1 p 2 up to logarithmic factors. Furthermore, we prove that the 1/εp dependence is tight for p > 2 and we leave the full tightness of dp/2/εp as an open problem for future work.

ACKNOWLEDGEMENTS

Y. Li was supported in part by Singapore Ministry of Education Ac RF Tier 2 grant MOET2EP20122-0001 and Tier 1 grant RG75/21. W. M. Tai was supported by Singapore Ministry of

Published as a conference paper at ICLR 2025

Education Ac RF Tier 2 grant MOE-T2EP20122-0001 when he was affiliated with Nanyang Technological University, where most part of this work was done.

Shiri Artstein, Vitali Milman, and Stanisław J. Szarek. Duality of metric entropy. Annals of Mathematics, 159(3):1313 1328, 2004.

Jean Bourgain, Joram Lindenstrauss, and Vitali Milman. Approximation of zonoids by zonotopes. Acta Mathematica, 162(1):73 141, 1989.

Cheng Chen, Yi Li, and Yiming Sun. Online active regression. In Proceedings of the 39th International Conference on Machine Learning, pp. 3320 3335. PMLR, 2022.

Xue Chen and Eric Price. Active regression via linear-sample sparsification. In Alina Beygelzimer and Daniel Hsu (eds.), Proceedings of the Thirty-Second Conference on Learning Theory, volume 99 of Proceedings of Machine Learning Research, pp. 663 695. PMLR, 25 28 Jun 2019. URL https://proceedings.mlr.press/v99/chen19a.html.

Michael B Cohen and Richard Peng. Lp row sampling by Lewis weights. In Proceedings of the 47th annual ACM symposium on Theory of computing, pp. 183 192, 2015.

Ilias Diakonikolas, Surbhi Goel, Sushrut Karmalkar, Adam R. Klivans, and Mahdi Soltanolkotabi. Approximation schemes for relu regression. In Jacob Abernethy and Shivani Agarwal (eds.), Proceedings of Thirty Third Conference on Learning Theory, volume 125 of Proceedings of Machine Learning Research, pp. 1452 1485. PMLR, 09 12 Jul 2020. URL https://proceedings. mlr.press/v125/diakonikolas20b.html.

Maryam Fazel, Yin Tat Lee, Swati Padmanabhan, and Aaron Sidford. Computing lewis weights to high precision. In Proceedings of the 33rd Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 2022), pp. 2723 2742, 2022. doi: 10.1137/1.9781611977073.107.

Aarshvi Gajjar, Christopher Musco, and Chinmay Hegde. Active learning for single neuron models with Lipschitz non-linearities. In International Conference on Artificial Intelligence and Statistics, pp. 4101 4113. PMLR, 2023a.

Aarshvi Gajjar, Xingyu Xu, Christopher Musco, and Chinmay Hegde. Improved bounds for agnostic active learning of single index models. In Neur IPS 2023 Workshop on Adaptive Experimental Design and Active Learning in the Real World, 2023b.

Aarshvi Gajjar, Wai Ming Tai, Xu Xingyu, Chinmay Hegde, Christopher Musco, and Yi Li. Agnostic active learning of single index models with linear sample complexity. In Shipra Agrawal and Aaron Roth (eds.), Proceedings of Thirty Seventh Conference on Learning Theory, volume 247 of Proceedings of Machine Learning Research, pp. 1715 1754. PMLR, 30 Jun 03 Jul 2024.

Sheng-Jun Huang, Yi Li, Yiming Sun, and Ying-Peng Tang. One-shot active learning based on lewis weight sampling for multiple deep models. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. Open Review.net, 2024. URL https://openreview.net/forum?id=EDXkk UAIFW.

E. Kiltz and V. Vaikuntanathan. Theory of Cryptography: 20th International Conference, TCC 2022, Chicago, IL, USA, November 7 10, 2022, Proceedings, Part II. Lecture Notes in Computer Science. Springer Nature Switzerland, 2022. ISBN 9783031223655.

Michel Ledoux and Michel Talagrand. Probability in Banach Spaces: isoperimetry and processes, volume 23. Springer Science & Business Media, 1991.

Cameron Musco, Christopher Musco, David P Woodruff, and Taisuke Yasuda. Active linear regression for ℓp norms and beyond. In 2022 IEEE 63rd Annual Symposium on Foundations of Computer Science (FOCS), pp. 744 753. IEEE, 2022.

Published as a conference paper at ICLR 2025

Aditya Parulekar, Advait Parulekar, and Eric Price. L1 regression with lewis weights subsampling. In Mary Wootters and Laura Sanit a (eds.), Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2021), volume 207 of Leibniz International Proceedings in Informatics (LIPIcs), pp. 49:1 49:21, Dagstuhl, Germany, 2021. Schloss Dagstuhl Leibniz-Zentrum f ur Informatik. ISBN 978-3-95977-207-5. doi: 10.4230/ LIPIcs.APPROX/RANDOM.2021.49. URL https://drops.dagstuhl.de/entities/ document/10.4230/LIPIcs.APPROX/RANDOM.2021.49.

Michel Talagrand. Embedding subspaces of L1 into ℓN 1 . Proceedings of the American Mathematical Society, 108(2):363 369, 1990.

Michel Talagrand. Embedding subspaces of Lp in ℓN p . In J. Lindenstrauss and V. Milman (eds.), Geometric Aspects of Functional Analysis, pp. 311 326, Basel, 1995. Birkh auser Basel.

Roman Vershynin. High-Dimensional Probability: An Introduction with Applications in Data Science, volume 47. Cambridge University Press, 2018.

Przemysław Wojtaszczyk. Banach Spaces for Analysts. Cambridge Studies in Advanced Mathematics. Cambridge University Press, 1991.

Taisuke Yasuda. Algorithms for Matrix Approximation: Sketching, Sampling, and Sparse Optimization. Ph D thesis, Carnegie Mellon University, 2024.

A PRELIMINARIES

Notation For a distribution D, we write X D to denote a random variable X drawn from D and β D to denote the distribution of the scaled random variable βX, where X D. For any 0 p 1 and positive integer n, we use Ber(p) to denote the Bernoulli distribution with expected value p and Bin(n, p) to denote the Binomial distribution with n trials and success probability p for each trial. That is, if X Ber(p) then

X = 1 with probability p 0 with probability 1 p

and if X Bin(n, p) then X can be expressed as Pn i=1 Xi where X1, . . . , Xn are i.i.d. Ber(p) variables.

For a matrix A, we use Ai, to denote its i-th row and A ,i its i-th column. For λ1, . . . , λn R, we use diag{λ1, . . . , λn} to denote a diagonal matrix whose diagonal entries are λ1, . . . , λn.

In a normed space (X, ), the unit ball B(X) is defined as B(X) = { x X | x 1 }. When X is clear from the context, we may omit the space and write only B for the unit ball. When X is the column space of a matrix A, we also write the unit ball as B(A). If the norm has a subscript , we shall include the subscript of the norm and denote the associated unit ball by B (or B (A) if X is the column space of A). In Rn, the standard ℓp-norm and the weighted ℓp-norm, denoted by p and w,p, are defined as x p = (Pn i=1 |xi|p)1/p and x w,p = (Pn i=1 wi|xi|p)1/p, respectively, where w Rn and wi > 0 for i [n].

We shall use C, C1, C2, ..., c, c1, c2, ... to denote absolute constants. We also write max{a, b} and min{a, b} as a b and a b, respectively. We use O, Ω, Θ and , , interchangeably.

Lewis Weights We now define an important concept regarding matrices, which have played critical roles in the construction of space-efficient subspace embeddings.

Definition 4 (ℓp-Lewis weights). Let A Rn d and p 1. For each i [n], the ℓp-Lewis weight of A for the i-th row is defined to be wi that satisfies

wi(A) = (a i (A W 1 2

p A) ai) p 2

where ai is the i-th row of A (as a column vector), W = diag{w1, . . . , wn} and denotes the pseudoinverse.

Published as a conference paper at ICLR 2025

When the matrix A is clear in context, we will simply write wi(A) as wi. Adopting that 0 = 0, we have wi(A) = 0 if ai = 0. The following are a few important properties of Lewis weights; see, e.g., (Wojtaszczyk, 1991) for a proof. Lemma 5 (Properties of Lewis weights). Suppose that A Rn d has full column rank and Lewis weights w1, . . . , wn. Let W = diag{w1, . . . , wn}. The following properties hold.

i wi = d; (b) There exists a matrix U Rn d such that

(i) the column space of U is the same as that of A; (ii) wi = Ui, p 2; (iii) W 1 2 1

p U has orthonormal columns; (c) It holds for all vectors u in the column space of A that W 1 2 1

p u 2 d 1 2 1 2 p u p.

(d) It holds for all vectors u in the column space of A that |ui| d 1 2 1 2 p w

Subspace Embeddings Suppose that A Rn d and ε (0, 1). We say a matrix S Rm n is an ℓp-subspace embedding matrix for A with distortion 1 + ε if (1 + ε) 1 Ax p SAx p (1+ε) Ax p. The prevailing method to construct ℓp-subspace embedding matrices is to sample the rows of A according to its Lewis weights. Lemma 6. Suppose that A Rn d has Lewis weights w1, . . . , wn. Let pi [0, 1] satisfy that pi (βwi) 1 and S Rn n be a diagonal matrix with independent diagonal entries Sii p 1/p i Ber(pi). Then with probability at least 0.99, S is an ℓp-subspace embedding matrix for A with distortion 1 + ε if

ε(log log d

ε)2, 1 < p < 2 1 ε2 log d

ε, p = 1, 2

ε2 (log d)2 log d

The results for p [1, 2] are due to Cohen & Peng (2015), based on earlier work of Talagrand (1990; 1995). The result for p > 2 can be found in (Yasuda, 2024; Huang et al., 2024), which improves upon the previous bound β (dp/2 1/ε5)(log d) log(1/ε) in (Bourgain et al., 1989; Cohen & Peng, 2015).

Covering Numbers and Dudley s Integral Suppose that T is a pseudometric space endowed with a pseudometric d. The diameter of T, denoted by Diam(T, d), is defined as Diam(T, d) := supt,s T ρ(t, s).

Given an r > 0, an r-covering of (T, d) is a subset X T such that for every t T, there exists x X such that d(t, x) r. The covering number N(T, d, r) is the minimum number K such that there exists an r-covering of cardinality K.

The covering numbers are intrinsically related to a subgaussian process on the space T that conforms to the pseudometric d. This relationship is captured by the well-known Dudley s integral. Lemma 7 (Dudley s integral Vershynin (2018)). Let Xt be a zero-mean stochastic process that is subgaussian w.r.t. a pseudo-metric d on the indexing set T. Then it holds that

Pr sup t,s T |Xt Xs| > C Z

log N(T, d, ε)dε + u Diam(T) 2 exp( u2),

where C is an absolute constant. As a consequence,

E sup t,s T |Xt Xs| ℓ C Cℓ " Z

log N(T, d, ε)dε ℓ + (

ℓDiam(T))ℓ #

where C and C are absolute constants.

Note that when r > Diam(T, d), the covering number N(T, d, r) = 1 and thus the integrand becomes 0. Hence, Dudley s integral is in fact taken over the finite interval [0, Diam(T, d)].

The following covering numbers related to w,p will be useful our analysis. These are not novel results though we include a proof in Appendix E for completeness.

Published as a conference paper at ICLR 2025

Lemma 8. Suppose that A Rn d has full column rank and W is a diagonal matrix whose diagonal entries are the Lewis weights of A. It holds that

log N(Bw,p(W 1/p A), w,q, t)

t q = p 1 t pq log d 1 p 2 and q > 2 t 2qd1 2

Lower Bound The following two lemmata, Lemma 9 and Lemma 10, are needed in the proof of our lower bounds for p 2 and p 2, respectively. Lemma 9 is a classical result, whose proof can be found, for example, in (Kiltz & Vaikuntanathan, 2022, p711).

Lemma 9. Let m be a positive integer. Suppose we have an m-dimensional vector whose entries are i.i.d. samples all drawn from either Ber( 1

2 + 1 m) or Ber( 1

2 1 m). It requires Ω(m) queries to distinguish Ber( 1

2 + 1 m) and Ber( 1

2 1 m) with probability at least 3/5.

Lemma 10. Let m be a positive integer. Suppose that x R2m is a random vector in which all but one of the entries are the same and the distinct entry xi is located at a uniformly random position i [2m]. Any deterministic algorithm that determines with probability at least 3/5 whether i lies within {1, . . . , m} or {m + 1, . . . , 2m} must read Ω(m) entries of x.

Proof. Let Q be the set of indices the algorithm reads and A(Q, i ) be the output of the algorithm. Note that, if i / Q, then A(Q, i ) does not depend on i and we write it A(Q).

Now, let E be the event of A(Q, i ) being the correct set and I be the event of i being chosen among these q queries. Then, we have

Pr(E) = Pr(E | I) Pr(I) + Pr(E | I) Pr(I).

Pr(E | I) 1, Pr(I) = q 2m and Pr(I) = 1 q

where q = |Q|.

Now, we evaluate Pr(E | I). Let q1 (resp. q2) be the size of the set Q {1, . . . , m} (resp. Q {m + 1, . . . , 2m}), so q1 + q2 = q. Given the event I, recall that we have A(Q, i ) = A(Q). If A(Q) = {1, . . . , m}, the event of E | I is equivalent to the event that i belongs to {1, . . . , m} but is not queried, thus we have

Pr(E | I) = 1 m q1

2m q = m q2

Similarly, if A(Q) = {m + 1, . . . , 2m}, we have

Pr(E | I) = 1 m q2

2m q = m q1

Hence, we have

Pr(E | I) max{ m q2

2m q , m q1

Namely, we have

Pr(E) q 2m + max{ m q2

2m q , m q1

= q 2m + max{1

2 + max{ q1

5 , it implies max{q1, q2} m

5 and hence Pr(E) 3

Published as a conference paper at ICLR 2025

Algorithm 3 Algorithm for Active Learning

Input: a matrix A Rn d a query access to the entries of the vector b Rn a function f Lip L an error parameter ε a sampling rate α < 1 1: Compute the Lewis weights w1, . . . , wn of A 2: for i = 1, . . . , n do 3: ki n wi

d 4: S GSM(k1, . . . , kn, α) from Algorithm 1 5: Solve the minimization problem ˆx := arg minx Rd Sf(Ax) Sb p p + ε Ax p p 6: Return the vector ˆx Rd

The next lemma extends the previous two lemmata to multiple instances of the problem considered therein.

Lemma 11. Let d and m be positive integers. Suppose that D0 and D1 are two distributions in Rm and distinguishing whether a vector is drawn from D0 or D1 with probability at least 3/5 requires querying βm entries of the vector for some constant β > 0. Consider a dm-dimensional random vector consisting of d blocks, each of which is an m-dimensional vector drawn from either D0 or D1. Every deterministic algorithm that correctly distinguishes, with probability at least 2/3, the distributions in 2d/3 instances requires Ω(dm) entry queries to this dm-dimensional random vector.

Proof. Suppose that an algorithm makes fewer than β 10dm queries in total. Then there exist 9 10d blocks, each of which makes fewer than βm queries. Therefore, each of these blocks will make an error in distinguishing the distributions with probability at least 2/5. By a Chernoff bound, with probability at least 1/3, at least d = 2

5 9 10 d Θ(

d) instances make an error. It means that d > d/3 and we arrive at a contradiction against the assumption on the correctness of the algorithm.

B UPPER BOUND

In this section, we first obtain a query complexity with poly log n factors with the algorithm presented in Algorithm 3 (see Theorem 12) and then remove the dependence on n in Appendix B.4.

B.1 EQUIVALENT STATEMENT

We shall first reduce the problem to the case where A has uniformly bounded Lewis weights, before proving in the next section that the output of Algorithm 3 with a suitable α satisfies (20) with probability 0.99.

We start with the following observation. Let ki be n wi

d for i = 1, . . . , n which is the same term also defined in Algorithm 3. Hence, we rewrite

f(Ax) b p p =

i=1 |(f(Ax) b)i|p =

ki |(f(Ax) b)i|p =

1 ki |(f(Ax) b)i|p.

Now, suppose that we duplicate the i-th term, |(f(Ax) b)i|p, ki times and assign a weight of 1/k

1 p i to each duplicate term. Formally, let

n = Pn i=1 ki, A be an n -by-d matrix in which A j, = Ai, if Pi 1 a=1 ka < j Pi a=1 ka, b be an n -dimensional vector in which b j = bi if Pi 1 a=1 ka < j Pi a=1 ka,

Λ be an n -by-n diagonal matrix in which Λjj = ki 1

p if Pi 1 a=1 ka < j Pi a=1 ka,

Published as a conference paper at ICLR 2025

In other words, we have

f(Ax) b p p =

j=1 Λp jj|(f(A x) b )j|p = Λf(A x) Λb p p

Note that we still have

OPT = min x Rd Λf(A x) Λb p p and x = arg min x Rd Λf(A x) Λb p p=OPT

ΛA x p p. (17)

On the other hand, in Algorithm 3, recall that Ni Bin(ki, α) for i = 1, . . . , n, it can be rewritten as

where Ni,1, . . . , Ni,ki are i.i.d. Ber(α) variables. In other words, we have

Sf(Ax) Sb p p =

i=1 Sp ii|(f(Ax) b)i|p =

α 1 ki |(f(Ax) b)i|p.

Let S be an n -by-n diagonal matrix in which S jj = Ni,j

p if j = Pi 1 a=1 ka + j for j = 1, . . . , ki. Then

Sf(Ax) Sb p p =

j=1 Sjj pΛp jj|(f(A x) b )j|p = S Λf(A x) S Λb p p.

Also, it is easy to check that

Ax p p = ΛA x p p.

We still have

ˆx = arg min x Rd S Λf(A x) S Λb p p + ε ΛA x p p. (18)

The advantage of introducing the diagonal matrix Λ is to bound the Lewis weights. Formally, we have the following observation. By the definition of Lewis weights, the j-th Lewis weight of ΛA is wi

ki if j = Pi 1 a=1 ka + 1, . . . , Pi a=1 ka for j = 1, . . . , n . Recall that ki = n wi

d and we have

ki = wi n wi

d + 1) = 2n.

Therefore, we generalize our statement to be the following. Let A be an n -by-d matrix, f Lip L, b be an n -dimensional vector, Λ be an arbitrary n -by-n positive diagonal matrix such that the Lewis weights of ΛA is at most 2d

n . Define OPT and x as in (17). Furthermore, let S be an n -by-n diagonal random matrix in which the diagonal entries are i.i.d. α 1

p Ber(α) variables, i.e.

p with probability α 0 with probability 1 α

and define ˆx as in (18). Our goal is to show, for a suitable α, we have in correspondence to (4)

Λf(A ˆx) Λb p p (1 + ε)OPT + Lpε ΛA x p p.

Published as a conference paper at ICLR 2025

B.2 CORRECTNESS

We would like to prove that the output of Algorithm 3 satisfies (4) with probability 0.99. In view of Appendix B.1, we can replace A with ΛA, where the Lewis weights of ΛA are uniformly bounded by 2d/n. The desired error guarantee is therefore

Λf(Aˆx) Λb p p (1 + ε)OPT + Lpε ΛAx p p. (19)

By replacing f(x) with f(x)/L and b with b/L, we can henceforce assume that L = 1 and the error guarantee (4) becomes

Λf(Aˆx) Λb p p (1 + ε)OPT + ε ΛAx p p. (20)

We shall first prove a weaker version of Theorem 1 with query complexity containing log n factors and then show how to remove the log n factors in Appendix B.4. The weaker version of Theorem 1 is stated formally below. Theorem 12. Let A Rn d, x Rd, b Rn, f Lip1, ε (0, 1) be sufficiently small and Λ be an n n diagonal matrix satisfying that Λii > 0 and wi(ΛA) d/n for all i. There is a randomized algorithm which, with probability at least 0.9, makes O d1 p

2 /ε2 p poly log n queries to the entries of b and returns an ˆx Rd satisfying

Λ(f(Aˆx) b) p p (1 + ε) Λ(f(A x b) p p + ε ΛA x p p.

The hidden constant in the bound on number of queries depends on p only.

Note that we introduce a vector x Rd. If we take x = x , Theorem 12 becomes Theorem 1 except that the query complexity contains log n factors. The reason we introduce x is because when we remove the log n factors in Appendix B.4 we can reuse the theorem using a different x. Now, to prove Theorem 12, we first provide a concentration bound in Lemma 13. Lemma 13. Let A Rn d, f Lip1, ε (0, 1) be sufficiently small and Λ be an n n diagonal matrix satisfying that Λii > 0 and wi(ΛA) d/n for all i [n]. Also, let S be an n-by-n random diagonal matrix in which the diagonal entries are i.i.d. α 1

p Ber(α) variables where

nεp 2 poly log n. If ˆx, x Rd satisfy

ˆx = arg min x Rd SΛ(f(Ax) b) p p + ε ΛAx p p

Λ(f(A x) b) p p Λ(f(Aˆx) b) p p ε( Λ(f(A x) b) p p + ε ΛA x p)

then, with probability at least 0.9,

|( SΛ(f(Aˆx) b) p p SΛ(f(A x) b) p p) ( Λ(f(Aˆx) b) p p Λ(f(A x) b) p p)|

ε ( Λ(f(A x) b) p p + ΛA x p p).

We now show how Lemma 13 can be used to prove Theorem 12. The proof of Lemma 13 will be presented in Appendix B.3.

Proof of Theorem 12. We shall apply Lemma 13 with x = x to prove Theorem 12. First, we verify the conditions in Lemma 13. Clearly, the output ˆx of Algorithm 3 satisfies

ˆx = arg min x Rd SΛ(f(Ax) b) p p + ε ΛAx p p

and, by the optimality of x , we also have

Λ(f(Ax ) b) p p Λ(f(Aˆx) b) p p 0.

Recall that Λ(f(Ax ) b) p p = OPT. By Lemma 13, with probability at least 0.9, we have

|( SΛ(f(Aˆx) b) p p SΛ(f(Ax ) b) p p) ( Λ(f(Aˆx) b) p p OPT)|

ε (OPT + ΛAx p p),

Published as a conference paper at ICLR 2025

which implies that

Λ(f(Aˆx) b) p p OPT

SΛ(f(Aˆx) b) p p SΛ(f(Ax ) b) p p + ε (OPT + ΛAx p p)

ε ΛAx p p + ε (OPT + ΛAx p p) by the optimality of ˆx

ε (OPT + ΛAx p p).

This completes the proof of Theorem 12.

B.3 CONCENTRATION BOUNDS

B.3.1 PROOF OF LEMMA 13

To prove Lemma 13, we rely on the following concentration bound provided in Lemma 14 and provide the proof in Appendix B.3.2. Lemma 14. Let A Rn d, f Lip1, ε (0, 1) be sufficiently small and Λ be an n n diagonal matrix satisfying that Λii > 0 and wi(ΛA) d/n for all i [n]. Additionally, suppose that x Rd and v Rn are fixed vectors, 0 α 1, R is any value that R ΛA x p p, F is any value that F V := Λ(f(A x) v) p p and T is any subset of Rd that { x} T x Rd ΛAx p p R .

Let S be an n-by-n random diagonal matrix in which the diagonal entries are i.i.d. α 1

p Ber(α) variables. When conditioned on the event that

SΛ(f(A x) v) p p V and sup x T SΛ(f(Ax) f(A x)) p p F,

it holds with probability at least 1 δ that

( SΛ(f(Ax) v) p p SΛ(f(A x) v) p p) ( Λ(f(Ax) v) p p Λ(f(A x) v) p p)

C εV + d1 p

log 5 4 d r

where C is an absolute constant and

( (d/(αn)) 1 2 F 1 2 R 1 2 when 1 p 2 (d p 2 /(αn)) 1 p F 1 1

p R 1 p when p > 2. (21)

With Lemma 14, we immediately have the following two corollaries. Corollary 15. Let A Rn d, f Lip1, Λ Rn n, x Rd, α (0, 1), R Rd, T Rd and S Rn n be as defined in Lemma 14 and satisfy the same constraints. Additionally, suppose that α d p 2 1/n. When conditioned on the event that SΛAx p p ΛAx p p for all x Rd, it holds with probability at least 1 δ that

SΛ(f(Ax) f(A x)) p p Λ(f(Ax) f(A x)) p p

(αn) 1 2 p R

log 5 4 d r

where C is an absolute constant.

Proof. In Lemma 14, we take v = f(A x) and ε to be a constant. Note that V = SΛ(f(A x) v) p p = 0. For any x T, if we take F = 2p R then

SΛ(f(Ax) f(A x))) p p SΛ(Ax A x)) p p by the Lipschitz condition

Λ(Ax A x) p p by the assumption of SΛAx p p ΛAx p p 2p R by x, x T

and hence the result follows by Lemma 14.

Published as a conference paper at ICLR 2025

Corollary 16. Let A Rn d, f Lip1, ε (0, 1), Λ Rn n, x Rd, α (0, 1), R R, F R, T Rd and S Rn n be as defined in Lemma 14 and satisfy the same constraints. Additionally,

suppose that α d p 2 1

nε and F εR. When conditioned on the event that

SΛ(f(A x) b) p p Λ(f(A x) b) p p and sup x T SΛ(f(Ax) f(A x)) p p F,

it holds with probability at least 1 δ that

( SΛ(f(Ax) b) p p SΛ(f(A x) b) p p) ( Λ(f(Ax) b) p p Λ(f(A x) b) p p)

log 5 4 d r

where C is an absolute constant and Γ is as defined in (21).

Proof. In Lemma 14, we take v = b. Note that we have V = Λ(f(A x) b) p p and hence the result follows, noticing that the last term in the error bound of Lemma 14 is the dominating term.

Now, we are ready to complete the proof of Lemma 13.

Proof of Lemma 13. Without loss of generality, we can assume that n d p 2 1/εp 2. We rely on Corollary 16 in this proof. To apply the corollary, we need to pick a suitable subset T so that the output ˆx T. The set T will be defined through suitable bounds for R and F and the main part of the proof will focus on obtaining these bounds.

Before doing so, we present some useful inequalities. First, by Markov inequality, with probability at least 0.99, we have

SΛ(f(A x) b) p p 100 Λ(f(A x) b) p p. (22)

We condition on this event in the remainder of the proof. By the optimality of ˆx, we have

SΛ(f(Aˆx) b) p p + ε ΛAˆx p p SΛ(f(A x) b) p p + ε ΛA x p p. (23)

It implies that, by (22) and (23),

ε SΛ(f(A x) b) p p + ΛA x p p 100

ε Λ(f(A x) b) p p + ΛA x p p | {z } :=R0

Throughout the remainder of the proof, we assume that

2 nεp 2 poly log n and δ 1 log log(1/ε)

so that the error term in Corollary 16 can be upper bounded as

Γ (poly log n + p

log(1/δ)) εF θRβ,

where β = 1

p and θ = (1 1

2. Note that β + θ = 1.

Bounding F in Corollary 16 We would like to first use Corollary 15 and let

T 1 = { x Rd | ΛAx p p R0 }.

Now, we check the conditions. Our choice of α satisfies that α d1 p

2 n poly log n, thus, by Lemma 6, S is a constant-distortion subspace embedding for ΛA with probability at least 0.99, i.e. SΛAx p 2 ΛAx p for all x Rd. Recall that

ε Λ(f(A x) b) p p + ΛA x p p.

Published as a conference paper at ICLR 2025

Hence, by Corollary 15 with our choice of α and R = R0, it holds with probability 0.99 that

SΛ(f(Ax) f(A x)) p p Λ(f(Ax) f(A x)) p p C1εR0, (25)

where C1 is a constant that depends only on p. Below we shall use C2, C3, . . . to denote constants that depend only on p. Conditioning on the event in (25), it follows that

Λ(f(Aˆx) f(A x)) p p SΛ(f(Aˆx) f(A x)) p p + C1εR0 2p SΛ(f(Aˆx) b) p p + SΛ(f(A x) b) p p + C1εR0

(A) 2p SΛ(f(A x) b) p p + ε ΛA x p p + SΛ(f(A x) b) p p + C1εR0 = 2p 2 SΛ(f(A x) b) p p + ε ΛA x p p + C1εR0

(B) 2p 2 100 Λ(f(A x) b) p p + ε ΛA x p p + C1ε(100

ε Λ(f(A x) b) p p + ΛA x p p)

C2( Λ(f(A x) b) p p + ε ΛA x p p) for some large constant C2, (26)

where (A) is due to (23), the optimality of ˆx, and (B) to (22), the Markov inequality for Λ(f(A x) b) p p, and the definition of R0.

Define F0 to be the RHS of (26), i.e.

F0 := C2( Λ(f(A x) b) p p + ε ΛA x p p).

We preview that the set T we use in Corollary 16 contains the element x satisfying the inequality

Λ(f(Ax) f(A x)) p p F0

Hence, (26) suggests that ˆx is in the domain of interest and hence F0 is the bound we will use in Corollary 16.

Bounding R in Corollary 16 Now, we would like to use Corollary 16. Recall that

ε Λ(f(A x) b) p p + ΛA x p p.

We can apply Corollary 16 with R0 but it will give a weaker result. However, we shall still use this weaker result and improve the bounds iteratively. Specifically, we shall define Ri based on Ri 1, ensuring that Ri R0 and that each Ri has the form of Xi Λ(f(A x) b) p p +Yi ΛA x p p for some Xi, Yi 1 (for example, X0 = 100

ε and Y0 = 1). Furthermore, let

Ti = { x Rd | ΛAx p p Ri and Λ(f(Ax) f(A x)) p p F0 },

so that Ti T0. More specifically, we shall use Ti to estimate an upper bound of ΛAˆx p p and define Ri+1 based on the upper bound, ensuring that x Ti. This guarantees that Ti satisfies the subset condition in Corollary 16. We shall also verify other conditions of Corollary 16.

It is clear that Ri R0 1

εF0. By (22), we have SΛ(f(A x) b) p p Λ(f(A x) b) p p and, by (25) and the fact that Ti T 1, we have

sup x Ti SΛ(f(Ax) f(A x)) p p sup x Ti Λ(f(Ax) f(A x)) p p + C1εR0 F0.

We invoke Corollary 16 with our choice of α, R = Ri and F = F0. Hence, with probability 1 δ,

( SΛ(f(Ax) b) p p SΛ(f(A x) b) p p) ( Λ(f(Ax) b) p p Λ(f(A x) b) p p)

C3 εRβ i F θ 0 for some constant C3. (27)

To use (27), we would like to argue that the solution ˆx Ti. For T0, we have

ΛAˆx p p R0 by (24) and Λ(f(Aˆx) f(A x)) p p by (26)

and hence ˆx T0. From now on, suppose that ˆx Ti and we will argue that ˆx Ti+1.

Published as a conference paper at ICLR 2025

We continue to bound (27). Assume that KYi/Xi ε for some K 1, then we can upper bound Rβ i F θ 0 as follows.

Rβ i F θ 0 = Xi Λ(f(A x) b) p p + Yi ΛA x p p β Cθ 2 Λ(f(A x) b) p p + ε ΛA x p p θ

Cθ 2 Xi Λ(f(A x) b) p p + Yi ΛA x p p β Λ(f(A x) b) p p + KYi

Xi ΛA x p p

θ (Xi Λ(f(A x) b) p p + KYi ΛA x p p) note that β + θ = 1. (28)

ε SΛ(f(A x) b) p p SΛ(f(Aˆx) b) p p + ΛA x p p

ε ( Λ(f(A x) b) p p Λ(f(Aˆx) b) p p + C3 εRβ i F θ 0 ) + ΛA x p p by (27)

From our assumption, we have Λ(f(A x) b) p p Λ(f(Aˆx) b) p p ε( Λ(f(A x) b) p p + ε ΛA x p)

εRβ i F θ 0 by the definition of F0 and F0 Ri. In other words, we have

ΛAˆx p p C4 (Rβ i F θ 0 ) + ΛA x p p for some constant C4

θ (Xi Λ(f(A x) b) p p + KYi ΛA x p p) + ΛA x p p

Xi+1 Λ(f(A x) b) p p + Yi+1 ΛA x p p, (29) where

Xi+1 = C4Cθ 2X1 θ i and Yi+1 = 1 + C4Cθ 2KYi Xθ i .

Define Ri+1 to be the minimum of R0 and the expression in (29), i.e. Ri+1 := R0 (Xi+1 Λ(f(A x) b) p p + Yi+1 ΛA x p p). We immediately have ˆx Ti+1, which is needed to iterate the argument.

Let X0 = 100/ε and Y0 = 1. By induction, one can show that

for C5 = C4Cθ 2. Then C6 Xi 100C1/θ 5 /ε for some constant C6 for all i r, thus Yi+1 1 + C7Yi C8Yi for some constants C7 and C8. When r p log log(1/ε), we have

Xr C9 and Yr C4(C8)r 1 = poly log 1

We shall also verify that KYi/Xi ε for some ε. Indeed, Yi/Xi 1/(100C1/θ 5 /ε) ε/K for K = 100C1/θ 5 .

Iterating the argument above r times. The total failure probability is at most δr + 0.03 = 0.1 since δ 1/r. It then follows from (27) with i = r 1 that Λ(f(Aˆx) b) p p Λ(f(A x) b) p p

SΛ(f(Aˆx) b) p p SΛ(f(A x) b) p p + C3 εRβ r 1F γ 0 ε ΛA x p p + C3 εRβ r 1F γ 0 by the optimality of ˆx

ε ΛA x p p + ε(Xr Λ(f(A x) b) p p + Yr ΛA x p p) by (28)

ε Λ(f(A x) b) p p + ε poly log 1

Rescaling ε poly log 1

ε to ε proves the claimed result of the theorem.

Published as a conference paper at ICLR 2025

B.3.2 PROOF OF LEMMA 14

In this section, we will prove Lemma 14. Recall that x is an arbitrary fixed vector in Rd, v is an arbitrary fixed vector in Rn, V = Λ(f(A x) v) p p, R ΛA x p p, F V and { x} T { x Rd | ΛAx p p R }. Also, we would like to bound the following expression

( SΛ(f(Ax) v) p p SΛ(f(A x) v) p p) ( Λ(f(Ax) v) p p Λ(f(A x) v) p p) ,

which can be written as

i=1 (Sp ii 1)Λp ii(|(f(Ax) v)i|p |(f(A x) v)i|p)

We shall bound (30) from above by, up to a constant factor,

poly log n +

with probability 1 δ. Recall that, as defined in (21),

( (d/(αn)) 1 2 F 1 2 R 1 2 when 1 p 2 (d p 2 /(αn)) 1 p F 1 1

p R 1 p when p > 2.

We preview that the first term εV comes from Lemma 18, the second term d1 p

2 εp αn R from Lemma 19

and the third term Γ (poly log n + q

δ ) from Dudley s integral (Lemma 7). We first present a useful lemma. Lemma 17. For any R ΛA x p p, let T be a set that { x} T { x Rd | ΛAx p p R }. Also, let w1, . . . , wn be the Lewis weights of ΛA. It holds for all x T and i [n] that

|Λii(f(Ax) f(A x))i| 2d 1 2 1 2 p w

1 p i R 1 p .

Proof. Note that

|Λii(f(Ax) f(A x))i| |Λii(Ax A x)i| by the Lipschitz condition

d 1 2 1 2 p w

1 p i ΛAx ΛA x p by Lemma 5(d)

Since x, x are both in T, we have

ΛAx ΛA x p ΛAx p + ΛA x p 2R 1 p .

The desired result follows.

Define the set G of good indices to be

|Λii(f(A x) v)i| d 1 2 1 2 p w

1 p i R 1 p

We shall first take care of the terms with bad indices in (30), i.e. the indices not in G, and hence obtain the first term εV in (31). We highlight that only the property of the nonnegativity of the diagonal entries of S is used in Lemma 18. Lemma 18. For any R ΛA x p p and ε > 0, let T be a set that { x} T { x Rd | ΛAx p p R } and G be the set defined in (32). Suppose that S is an n-by-n diagonal matrix with nonnegative diagonal entries and SΛ(f(A x) v) p p V,

where V = Λ(f(A x) v) p p. Then, we have

i/ G (Sp ii 1)Λp ii |(f(Ax) v)i|p |(f(A x) v)i|p εV.

Published as a conference paper at ICLR 2025

Proof. To ease the notations, let

ux := f(Ax) for all x Rd and λi := Λii for i [n].

Note that by the triangle inequality,

i/ G (Sp ii 1)λp i |(ux v)i|p |(u x v)i|p X

i/ G (Sp ii + 1)λp i |(ux v)i|p |(u x v)i|p

Furthermore, by the inequality |a|p |b|p p|a b| |a|p 1 + |b|p 1 , we have

λp i |(ux v)i|p |(u x v)i|p p|λi(ux u x)i| |λi(ux v)i|p 1 + |λi(u x v)i|p 1 .

For any i / G and x T, by Lemma 17 and the definition of G, we have

|λi(ux u x)i| 2ε|λi(u x v)i|, |λi(ux v)i| |λi(ux u x)i| + |λi(u x v)i| (1 + 2ε)|λi(u x v)i|.

It follows that

λp i |(ux v)i|p |(u x v)i|p ε|λi(u x v)i|p.

which implies that

i/ G (Sp ii 1)λp i |(ux v)i|p |(u x v)i|p

i/ G (Sp ii + 1)|λi(u x v)i|p ε

i=1 (Sp ii + 1)|λi(u x v)i|p ε V,

where we used the assumption of the lemma in the last step.

Now, we also define the set of indices whose term has a high Lewis weight within G. Let

J := i G wi > εpd

We now take care of the terms with low Lewis weights in (30), i.e. the indices i J, and hence

obtain the second term d1 p

2 εp αn R in (31). We highlight that only the property of the diagonal entries of S being in [0, 1

α] is used in Lemma 19.

Lemma 19. For any R ΛA x p p and ε > 0, let T be a set that { x} T { x Rd | ΛAx p p R } and J be the set defined in (33). Suppose S is an n-by-n diagonal matrix whose entries satisfy 0 Sp ii 1

α for any α > 0. Then, we have

i G\J (Sp ii 1)Λp ii |(f(Ax) v)i|p |(f(A x) v)i|p d1 p

Proof. To ease the notations, let

ux := f(Ax) for all x Rd and λi := Λii for i [n].

Note that by the triangle inequality,

i G\J (Sp ii 1)λp i |(ux v)i|p |(u x v)i|p X

i G\J (Sp ii +1)λp i |(ux v)i|p |(u x v)i|p .

Published as a conference paper at ICLR 2025

Since i G, by Lemma 17, we have

|λi(ux v)i| |λi(ux u x)i| + |λi(u x v)i| (2 + 1

ε)d 1 2 1 2 p w

1 p i R 1 p ,

which, together with i J, implies that

|λi(ux v)i|p d( p

εp wi R d1 p

Recall that we assume Sp ii 1

α. Therefore,

i G\J (Sp ii 1)λp i |(ux v)i|p |(u x v)i|p X

α + 1) d1 p

2 n2 R d1 p

With Lemma 18 and Lemma 19, we only need to take care of the indices in J. Namely the set of indices i such that

|(f(A x) v)i| d 1 2 1 2 p w

1 p i R 1 p

ε and wi > εpd

Now, we would like to bound the following expression

i J (Sp ii 1)Λp ii |(f(Ax) v)i|p |(f(A x) v)i|p .

We consider bounding its ℓ-th moment and then apply Markov s inequality for some ℓto be determined later. To that end, consider

ΘS := sup x T

i J (Sp ii 1)Λp ii |(f(Ax) v)i|p |(f(A x) v)i|p

By the standard symmerization trick, we have

ESΘℓ S 2ℓEξ,S

i J ξi Sp iiΛp ii |(f(Ax) v)i|p |(f(A x) v)i|p

where ξ is a |J|-dimensional vector whose entries are independent Rademacher random variable, i.e. each ξi is uniform on { 1, 1}.

Next, we condition on S. Recall that Sp ii is either 1

α or 0, let I J be the set of indices i such that Sp ii = 1

α. For any x Rd, we define zx to be the n-dimensional vector whose i-th entry is

(zx)i := |Λii(f(Ax) v)i|p |Λii(f(A x) v)i|p (35)

Also, we define a pseudometric ρ to be

ρ(x, x ) := (zx)I (zx )I 2 for any x, x Rd

|Λii(f(Ax) v)i|p |Λii(f(Ax ) v)i|p 2 1/2

Recall that ( )I means we shrink the vector by only retaining the entries whose index is in I. Now, in order to upper bound the right-hand side of (34), we seek to upper bound

sup x T | ξI, (zx)I | ℓ .

Since x T, the ℓ-moment of the supremum can be upper bounded using Dudley s integral (Lemma 7) as

sup x T | ξI, (zx)I | ℓ Cℓ

Z Diam(T,ρ)

log N(T, ρ, r)dr

ℓDiam(T, ρ))ℓ

Published as a conference paper at ICLR 2025

Recall that N(T, ρ, r) is the covering number of T w.r.t. ρ and r. We will prove in Appendix B.3.4 that Z Diam(T,ρ)

log N(T, ρ, r)dr α Γ log 5 4 d r

εd and Diam(T, ρ) α Γ (37)

Γ = d p 2 1RF (p 1) 1

Taking expectation over S, it follows from (34), (36) and (37) that

ESΘℓ S (C Γ)ℓ log 5 4 d r

Take ℓ= log(1/δ). By Markov inequality, it holds with probability 1 δ that

ΘS = sup x T

i J ξi Sp iiΛp ii |(ux v)i|p |(u x v)i|p Γ

log 5 4 d r

Note that it is the third term in (31). Combining Lemmas 18 and 19 proves Lemma 14.

B.3.3 DIAMETER ESTIMATES

In order to bound Dudley s integral in (37), we need to bound the covering number N(T, ρ, r). To this end, we shall bound the metric, ρ, and the diameter Diam(T, ρ). The proof imitates the proofs in earlier works, e.g., Ledoux & Talagrand (1991); Huang et al. (2024); Gajjar et al. (2024), on subspace embeddings and active regression problems. Lemma 20. Let A Rn d, x Rd, v Rn, f Lip1, Λ Rn n, α [0, 1] and S Rn n be as defined in Lemma 14 and satisfy the same constraints. Suppose that 0 α 1, R ΛA x p p and F Λ(f(A x) v) p p. Let T be a set that { x} T { x Rd | ΛAx p p R }. If I is a subset of J such that

(Λ(f(A x) v))I p p α Λ(f(A x) v) p p and sup x T (Λ(f(Ax) f(A x)))I p p α F

then, for any x, x T and q = log( n

εd), we have

ρ(x, x ) K W 1

w,q d 1 2 1 2 p W 1

Diam(T, ρ) d 1 2 1 2 p KR 1 2 1

where W = diag{w1, . . . , wn} and

αd F/n for 1 p 2; (αp 1d F p 1/n)1/p for p > 2.

Proof. As in the proof of Lemma 18, we let ux = f(Ax) and λi = Λii to simplify the notation. We further define semi-norms u I, := maxi I |ui| and u I,p = (P

i I |ui|p)1/p.

For i I and x, x T, we have

|(zx)i (zx )i| ||λi(ux v)i|p |λi(ux v)i|p|

p|λi(ux ux )i| |λi(ux v)i|p 1 + |λi(ux v)i|p 1

where the first inequality is due to the definition in (35) and the second to the fact that ||a|p |b|p| p|a b| ||a|p 1 + |b|p 1|. It follows that

ρ(x, x )2 X

i I p2|λi(ux ux )i|2 |λi(ux v)i|p 1 + |λi(ux v)i|p 1 2

i I |λi(ux ux )i|2 |λi(ux v)i|2p 2 + |λi(ux v)i|2p 2 . (38)

Published as a conference paper at ICLR 2025

When 1 p 2, we further bound (38) by X

i I |λi(ux ux )i|2 |λi(ux v)i|2p 2 + |λi(ux v)i|2p 2

max i I {|λi(ux ux )i|p} X

i I |λi(ux ux )i|2 p |λi(ux v)i|2p 2 + |λi(ux v)i|2p 2 .

We can then proceed as X

i I |λi(ux ux )i|2 p |λi(ux v)i|2p 2 + |λi(ux v)i|2p 2

i I λp i |(ux ux )i|2 p max{|(ux v)i|2p 2, |(ux v)i|2p 2}

i I λp i |(ux ux )i|p ! 2 p

i I λp i max{|(ux v)i|p, |(ux v)i|p}

Λ(ux ux ) 2 p I,p Λ(ux v) p I,p + Λ(ux v) p I,p 2p 2

For the ℓp-norms in the preceding line, we remind the readers that they have been restricted to the indices in I and do not refer to the ℓp-norm of the entire vector.

Since ux v = (u x v) + (ux u x), by our assumptions,

Λ(ux v) p I,p 2p 1( Λ(u x v) p I,p + Λ(ux u x) p I,p)

α Λ(u x v) p p + α F

Similarly, Λ(ux ux ) I,p Λ(ux v) I,p + Λ(ux v) I,p (αF) 1 p . It follows that

ρ(x, x )2 Λ(ux ux ) p I, (αF)

p αF Λ(ux ux ) p I, . (40)

When p > 2, we use the fact that |zi|2p 2 |zi|p z p 2 for a vector z and so we can proceed from (38) as X

i I |λi(ux ux )i|2 |λi(ux v)i|2p 2 + |λi(ux v)i|2p 2

i I Λ(ux ux ) 2 I, (|λi(ux v)i|p Λ(ux v) p 2 I, + |λi(ux v)i|p Λ(ux v) p 2 I, )

= Λ(ux ux ) 2 I, Λ(ux v) p I,p Λ(ux v) p 2 I, + Λ(ux v) p I,p Λ(ux v) p 2 I,

αF Λ(ux ux ) 2 I, Λ(ux v) p 2 I, + Λ(ux v) p 2 I, by (39)

αF Λ(ux ux ) 2 I, Λ(ux v) p 2 I,p + Λ(ux v) p 2 I,p .

Λ(ux v) I,p Λ(ux u x) I,p + Λ(u x v) I,p (αF) 1 p

we have ρ(x, x )2 (αF)2 2

p Λ(ux ux ) 2 I, . (41)

Combining (40) and (41) yields

ρ(x, x ) (αF)(1 1

2 Λ(ux ux )

p 2 1 I, (42)

and our next task to upper bound Λ(ux ux ) I, .

Published as a conference paper at ICLR 2025

Recall that wi 2d/n, we have by the Lipschitz condition and Lemma 5(d),

|λi(ux ux )i| |(Λ(Ax Ax ))i| w

1 p i d 1 2 1 2 p ΛA(x x ) p

p d 1 2 1 2 p ΛA(x x ) p. (43)

Plugging this result into (42) immediately leads to

ρ(x, x ) K d 1 2 1 2 p ΛA(x x )

p 2 1 p . (44)

Alternatively,

|λi(ux ux )i| d

p |Λii(Ax Ax )i|

wi |Λii(Ax Ax )i|

since wi > εpd

p ΛA(x x ) w,q recall that x w,p = (

i=1 wi|xi|q)1/q

ρ(x, x ) K W 1

as claimed.

Finally, we bound Diam(T, ρ). By the definition of T and (44), we have that ΛA(x x ) p p R. The claimed upper bound on Diam(T, ρ) follows immediately.

B.3.4 BOUNDING DUDLEY S INTEGRAL

In this section, we will prove (37), i.e.

Z Diam(T,ρ)

log N(T, ρ, r)dr α Γ poly log n,

where Γ is as defined in (21). To further simplify the notation, let

2 1 p 2, θ = (1 1

log N(T, ρ, r) log N(R 1 p Bp(ΛA), Kdγ W 1

p ΛA( ) ϕ w,p, r) by Lemma 20

= log N(Bp(ΛA), W 1

p ΛA( ) w,p, 1

R 1 p ( r Kdγ ) 1 ϕ )

= log N(Bw,p(E), w,p, 1

R 1 p dγ ( r

K ) 1 ϕ ) since γ/ϕ = γ

where E = colspace(W 1

p ΛA) and is endowed with norms w,p for p 1.

Now, we shall split the integral domain into two parts [0, λ] and [λ, Diam(T, ρ)] for some λ 1 2KRβdγ to be determined later (note that ϕ = pβ). Note that when r λ, we have (r/K)1/ϕ/(R 1 p dγ) ( 1

Published as a conference paper at ICLR 2025

By Lemma 8 Case 1, we have (letting λ = λ/(KRβdγ))

log N(T, ρ, r)dr Z λ

log N(Bw,p(E), w,p, 1

R 1 p dγ ( r

K ) 1 ϕ )dr

log N(Bw,p(E), w,p, s 1 ϕ )KdγRβds

d KdγRβ Z λ

d KdγRβ λ log 1

d log dγRβK

To handle the integral over [λ, Diam(T, ρ)], we bound

log N(T, ρ, r) log N(R 1 p Bp(ΛA), K W 1

p ΛA( ) ϕ w,q, r) by Lemma 20

= log N(Bp(ΛA), W 1

p ΛA( ) w,q, 1

= log N(Bw,p(E), w,q, 1

where E = colspace(W 1

p ΛA) and is endowed with norms w,q for q 1. We further divide the estimates into two cases. For p [1, 2], we invoke Lemma 8 Case 2 and obtain that

Z Diam(T,ρ)

log N(T, ρ, r)dr Z Diam(T,ρ)

log N(Bw,p(E), w,q, 1

K ) 2 p )dr

log d Z Diam(T,ρ)

log d log Diam(T, ρ)

For p > 2, we invoke Lemma 8 Case 3 and obtain that

Z Diam(T,ρ)

log N(T, ρ, r)dr Z Diam(T,ρ)

log N(Bw,p(E), w,q, r

R 1 p K )dr

εd R 1 p K Z Diam(T,ρ)

εd R 1 p K log Diam(T, ρ)

Combining (47) and (48) yields

Z Diam(T,ρ)

log N(T, ρ, r)dr dγRβK r

log d log Diam(T, ρ)

Recall that by Lemma 20,

Diam(T, ρ) dγKRβ, where K = (αF)θ d

Published as a conference paper at ICLR 2025

Combining (46) and (49) and taking λ = (dγ 1

2dγRβK), we have Z Diam(T,ρ)

log N(T, ρ, r)dr λ

d log dγRβK

λ + dγRβK r

log d log dγKRβ

dγRβK log 5 4 d r

= α Γ log 5 4 d r

B.4 REMOVING THE DEPENDENCE ON n

In this section, we reduce the log n factors in Theorem 12 to log(d/ε) factors, thereby proving Theorem 1. This is achieved by first applying a sampling matrix S to reduce the dimension of the regression problem from n to poly(d/ε) before invoking Algorithm 3; see Algorithm 2 for the full algorithm. The sampling matrix S uses a larger sampling rate α , which allows for controlling the error in Lemma 14 via Bernstein s inequality with a simple net argument instead of the chaining argument or Dudley s integral and thus avoiding the log n factor from entropy estimates.

Recall that, in Appendix B.1, we introduce the matrix Λ to ensure that the Lewis weights are bounded uniformly. We will include the matrix Λ in our proof and abuse the notations by dropping the prime mark as indicated in Appendix B.1.

The following is a weaker version of Lemma 14 for reducing n to poly(d/ε).

Lemma 21. Let A Rn d, v Rn, f Lip1, ε > 0, Λ Rn n, α [0, 1], S Rn n and R R be as defined in Lemma 14 and satisfy the same constraints. Suppose that 0 α 1 such that αn (d( p

2 1)+1/εp+2) log(1/ε) and R ΛAx p p Λ(f(Ax ) v) p p. Let T = x Rd ΛAx p p R . When conditioned on

SΛ(f(Ax ) v) p p R,

it holds with probability at least 0.99 that

( SΛ(f(Ax) v) p p SΛ(f(Ax ) v) p p) ( Λ(f(Ax) v) p p Λ(f(Ax ) v) p p)

where C is an absolute constant.

Proof. Recall that the error bound in Lemma 14 consists of three terms. By the same proofs as in Lemmas 18 and 19, the first two terms remain the same, both of which are now bounded by CεRp under our assumptions. The rest of the proof is devoted to deriving a similar bound for the third term. Recall that we need to upper bound

i J Sp iiξi(zx)i

where zx is as defined in (35) and {ξi} are independent Rademacher variables. We shall use a net argument here.

Fix an x T. Let Wi = Sp iiξi(zx)i, then EWi = 0 and

α|(Λ(f(Ax) v)p i (Λ(f(Ax ) v)p i |

α 2p 1(|Λii(f(Ax) f(Ax ))i|p + |Λii(f(Ax ) v)i|p) + |Λii(f(Ax ) v)i|p

α (|Λii(Ax Ax )i|p + |Λii(f(Ax ) v)i|p) .

Published as a conference paper at ICLR 2025

Now we use (43) for the first term in the brackets and the definition of G in (32) for the second term. We proceed as

2 1) 0 ΛA(x x ) p p + 1

n R 1 αnεp d p 2 1R | {z } :=

Next we bound E(P

i J W 2 i ).

i J W 2 i = X

i J ES2p ii (zx)2 i X

1 α max i |(zx)i| |(zx)i| X

i J |(zx)i|.

Note that X

i J |(zx)i| Λ(f(Ax) v) p p + Λ(f(Ax ) v) p p

2p 1( Λ(f(Ax) f(Ax )) p p + Λ(f(Ax ) v) p p) + Λ(f(Ax ) v) p p 2p 1 ΛA(x x ) p p + (2p 1 + 1) Λ(f(Ax ) v) p p ΛAx p p + ΛAx p p + Λ(f(Ax ) v) p p R.

i J W 2 i R.

It follows from Bernstein s inequality that

2 exp c η2R2

2 exp c η2R

exp Cd log 1

provided that

( d p 2 +1 log 1

η R/(αnεp) p > 2 d2R log 1

η/(αnεp) 1 p < 2

To summarize, we have shown that when (51), for each fixed x T that with probability at least 1 δ (where δ = exp( Cd log(1/η))), it holds

i J Sp iiξi(zx)i

To obtain an upper bound for the supremum over x T, we employ a standard net argument. Let N T be an (ηR 1 p )-net of T such that |N| (3/η)d. By a union bound, we have with probability at least 1 |N|δ 0.99 that

i J Sp iiξi(zx)i

For an ΛAx T, there exists ΛAy N such that ΛA(x y) p ηR 1 p . Thus

i J Sp iiξi(zx)i

i J Sp iiξi(zy)i

i J Sp iiξi(zx zy)i

ηR + sup x T

i J Sp iiξi(zx zy)i

Published as a conference paper at ICLR 2025

We bound the error term by H older s inequality as

i J Sp iiξi(zx zy)i

i J Sp ii|(zx zy)i|

i J Sp ii||Λii(ux v)i|p |Λii(uy v)i|p|

i J Sp ii|Λii(ux uy)i| |Λii(ux v)i|p 1 + |Λii(uy v)i|p 1

i J Sp ii|Λii(Ax Ay)i|p ! 1

i J Sp ii|Λii(ux v)i|p !1 1

i J Sp ii|Λii(uy v)i|p !1 1

SΛA(x y) p( SΛ(f(Ax) v) p 1 p + SΛ(f(Ay) v) p 1 p )

Using the the subspace embedding property of S,

SΛA(x y) p 2 ΛA(x y) p 2ηR 1 p

SΛ(f(Ax) v) p 1 p SΛ(f(Ax) f(Ax )) p 1 p + SΛ(f(Ax ) v) p 1 p SΛA(x x ) p 1 p + SΛ(f(Ax ) v) p 1 p ΛA(x x ) p 1 p + SΛ(f(Ax ) v) p 1 p

i J Sp iiξi(zx zy)i

i J Sp iiξi(zx)i

and the claimed result follows from setting η ε.

To prove that the output ˆx of Algorithm 2 satisfies (20), let x Rd be

x := arg min x Rd S Λ(f(Ax) b) p p + ε2 ΛAx p p

where Λ is the matrix ensuring the Lewis weights of ΛA are uniformly bounded. Note that we set the regularized parameter to be ε2 instead of ε. We highlight that this x is for the purpose of analysis and we do not actually compute it in the algorithm. From now on, we set

mεp 2 poly log m and α d( p

n(ε4)p+2 log(1/ε)

where m is the number of nonzero rows in S A which is the same as the m defined in Algorithm 2.

Note that our choice of α implies that m d( p

ε4p+8 log(1/ε) = poly(d, 1

ε). We preview that when we use Lemma 21 we aim for the error of Cε2R in (50). Combining with the regularized parameter ε2, we set ε4 in α . Recall that our goal is to simply reduce n to poly( d

ε) and thus these choices of the exponents of ε may not be optimized. Nonetheless, they are sufficient to achieve our objective.

We begin with using Lemma 13 with x = x and S Λ as the matrix ensuring the Lewis weights of S ΛA are uniformly bounded. We now verify the conditions in Lemma 13. By Appendix D, the

Published as a conference paper at ICLR 2025

Lewis weights of S ΛA are uniformly bounded by O(d/m) with probability at least 0.99. Clearly, the output of Algorithm 2 satisfies

ˆx = arg min x Rd SS Λ(f(Ax) b) p p + ε S ΛAx p p

and, by the optimality of x , we have

S Λ(f(Ax ) b) p p S Λ(f(Aˆx) b) p p ε2 ΛAˆx p p (52)

We need to upper bound ΛAˆx p p. By the optimality of ˆx, we have

SS Λ(f(Aˆx) b) p p + ε S ΛAˆx p p SS Λ(f(Ax ) b) p p + ε S ΛAx p p which implies

S ΛAˆx p p 1

ε SS Λ(f(Ax ) b) p p + S ΛAx p p. (53)

Since S is an ℓp-subspace embedding matrix for ΛA with constant distortion with probability 0.99 because of our choice of α and we condition on it from now on, we have 1 2 ΛAˆx p p S ΛAˆx p p. (54)

Also, by Markov inequality, with probability at least 0.99, we have

SS Λ(f(Ax ) b) p p 100 S Λ(f(Ax ) b) p p. (55)

Plugging (54) and (55) into (53), we have

ε S Λ(f(Ax ) b) p p + S ΛAx p p.

and when we further plug this into (52) we have

S Λ(f(Ax ) b) p p S Λ(f(Aˆx) b) p p

ε S Λ(f(Ax ) b) p p + S ΛAx p p)

= ε ( S Λ(f(Ax ) b) p p + ε S ΛAx p p)

which completes the condition verification for Lemma 13.

By Lemma 13, with probability 0.99, we have

|( SS Λ(f(Aˆx) b) p p SS Λ(f(Ax ) b) p p) ( S Λ(f(Aˆx) b) p p S Λ(f(Ax ) b) p p)|

ε ( S Λ(f(Ax ) b) p p + S ΛAx p).

By rearranging the terms, we have

S Λ(f(Aˆx) b) p p S Λ(f(Ax ) b) p p SS Λ(f(Aˆx) b) p p SS Λ(f(Ax ) b) p p + ε ( S Λ(f(Ax ) b) p p + S ΛAx p)

ε S ΛAx p p + ε ( S Λ(f(Ax ) b) p p + S ΛAx p) by the optimality of ˆx

ε ( S Λ(f(Ax ) b) p p + S ΛAx p) (56)

Now, we would like to use Lemma 21 with R 1

εOPT + ΛAx p p for x = ˆx and hence we need to verify ˆx T. By the optimality of ˆx, we have

SS Λ(f(Aˆx) b) p p + ε S ΛAˆx p p SS Λ(f(Ax ) b) p p + ε S ΛAx p p which implies

S ΛAˆx p p 1

ε SS Λ(f(Ax ) b) p p + S ΛAx p p. (57)

Recall that we condition that S is an ℓp-subspace embedding matrix for ΛA with constant distortion. We have 1 2 ΛAˆx p p S ΛAˆx p p and S ΛAx p p 2 ΛAx p p. (58)

Published as a conference paper at ICLR 2025

Also, by Markov inequality, with probability 0.99, we have SS Λ(f(Ax ) b) p p 100 Λ(f(Ax ) b) p p = 100OPT. (59) Plugging (58) and (59) into (58), we have

εOPT + ΛAx p p which implies ˆx T.

By Lemma 21 with our choice of α , with probability 0.99, we have ( S Λ(f(Aˆx) b) p p S Λ(f(Ax ) b) p p) ( Λ(f(Aˆx) b) p p Λ(f(Ax ) b) p p)

εOPT + ΛAx p p)

ε (OPT + ΛAx p p). (60) By the optimality of x , we have S Λ(f(Ax ) b) p p S Λ(f(Ax ) b) p p + ε2 ΛAx p p which implies S Λ(f(Aˆx) b) p p S Λ(f(Ax ) b) p p S Λ(f(Aˆx) b) p p S Λ(f(Ax ) b) p p + ε2 ΛAx p p ε ( S Λ(f(Ax ) b) p p + S ΛAx p) + ε2 ΛAx p p by (56).

By rearranging the terms in (60), we have Λ(f(Aˆx) b) p p Λ(f(Ax ) b) p p ε ( S Λ(f(Ax ) b) p p + S ΛAx p + OPT + ΛAx p) (61)

It means that we need to upper bound the terms S Λ(f(Ax ) b) p p and S ΛAx p. For S Λ(f(Ax ) b) p p, we have

S Λ(f(Ax ) b) p p S Λ(f(Ax ) b) p p + ε2 ΛAx p p by the optimality of x

100OPT + ε2 ΛAx p p, (62)

where the last inequality holds with probability 0.99 by Markov inequality. For S ΛAx p, we have S ΛAx p ΛAx p recall that S is an ℓp subspace embedding. To further bound the term ΛAx p, we have

ε2 S Λ(f(Ax ) b) p p + ΛAx p p by the optimality of x

ε2 OPT + ΛAx p p by Markov inequality with probability 0.99.

Note that this bound is not enough to finish our proof. However, it implies that x T where T is the set defined in Lemma 21 with R = 100

ε2 OPT + ΛAx p p. By Lemma 21 with our choice of α , with probability 0.99, we have ( S Λ(f(Ax ) v) p p S Λ(f(Ax ) v) p p) ( Λ(f(Ax ) v) p p Λ(f(Ax ) v) p p)

ε2 OPT + ΛAx p p)

ε2 (OPT + ΛAx p p). Then, we have

S Λ(f(Ax ) b) p p S Λ(f(Ax ) b) p p

+ ΛAx p p by the optimality of x

Λ(f(Ax ) v) p p Λ(f(Ax ) v) p p + ε2 (OPT + ΛAx p p) + ΛAx p p

OPT + ΛAx p p by the optimality of x . (63) Plugging (62) and (63) into (61), we have Λ(f(Aˆx) b) p p OPT ε (OPT + ΛAx p). This completes the proof for the query complexity without dependence on n. The overall failure probability is at most 0.1 in the above argument.

Published as a conference paper at ICLR 2025

C LOWER BOUND

C.1 CASE OF p [1, 2]

By Yao s minimax theorem, it suffices to show the following theorem. Theorem 22. Suppose that p 1, ε > 0 is sufficiently small and n p (d log d)/ε2. There exist a deterministic function f Lip1, a deterministic matrix A Rn d and a distribution over b Rn such that the following holds: every deterministic algorithm that outputs ˆx Rd which with probability at least 4/5 over the randomness of b satisfies (20) must make Ω d/ε2 queries to the entries of b.

We remark that the lower bound holds for all p 1, and is tight up to logarithmic factors for p [1, 2]. To prove Theorem 22, we reduce Problem 23 below to our problem. Problem 23. Suppose that 0 < ε < 1, d is a positive integer, m (log d)/ε2 and n = 2md. Let

and v = 2 3

Let D0 be the distribution on the (2m)-dimensional vector b such that, for i = 1, . . . , m, b 2i 1 b 2i

= u with probability 1

2 + ε v with probability 1

and D1 be the distribution on the (2m)-dimensional vector b such that, for i = 1, . . . , m, b 2i 1 b 2i

= u with probability 1

2 ε v with probability 1

Let b be the n-dimensional random vector formed by concatenating d i.i.d. random vectors b(1), . . . , b(d), where each b(i) is drawn from D0 with probability 1/2 and from D1 with probability 1/2.

Given a query access to the entries of b, we would like to, with probability at least 2/3, correctly identify whether b(i) is drawn from D0 or D1 for at least 2d/3 indices i.

By Lemma 9 and Lemma 11, any deterministic algorithm that solves this problem requires Ω d/ε2

queries to the entries of b.

Now, we construct the reduction. Let f be the function

2 if x 6 x 4 if 6 x 4 0 if 4 x 0 1 2x if 0 x

Let a be the (2m)-dimensional vector such that a2i 1 a2i

for i = 1, . . . , m

and A be the n d block-diagonal matrix whose diagonal blocks are the same a R2m 1.

Given a deterministic algorithm A that takes f, A, ε and a query access to the entries of b as inputs and returns ˆx Rd satisfying (20), we claim that ˆx can be used to solve Problem 23. This claim is proved in Lemma 24. Lemma 24. Let f, A, b be as specified above. There exists K = K(p), a constant only depending on p such that given an ˆx Rd satisfying

f(Aˆx) b p p (1 + ε

we can, with probability at least 99/100 (over the randomness of b), identify whether b(i) is drawn from D0 or D1 for at least 2d/3 indices i.

Published as a conference paper at ICLR 2025

We need the next lemma, whose proof is postponed to Appendix C.3, to prove Lemma 24. Lemma 25. Let b be a (2m)-dimensional vector in which b 2i 1 b 2i

= u or v, i = 1, . . . , m.

(a) it holds for x 0 that f(ax) b p p f( 6a) b p p, and

(b) it holds for x 0 that f(ax) b p p f(6a) b p p.

Now we are ready to prove Lemma 24.

Proof of Lemma 24. To prove the statement, we first give a bound for OPT. We have

OPT = min x Rd f(Ax) b p p =

i=1 min xi R f(axi) b(i) p p by the structure of A (64)

and hence we can look at each term minxi R f(axi) b p p individually. By Lemma 25, we have

min xi R f(axi) b(i) p p = min{ f( 6a) b(i) p p, f(6a) b(i) p p}. (65)

For i = 1, . . . , d, let ki be the number of occurrences of u = [ 3 2 ] in b(i). Recall that m (log d)/ε2. By choosing the hidden constant to be large enough, we have by a Chernoff bound that for every i = 1, . . . , d, with probability at least 1 1 100d,

2 + (1 β)εm, m

2 + (1 + β)εm] if b(i) is drawn from D0 [ m

2 (1 + β)εm, m

2 (1 β)εm] if b(i) is drawn from D1 , (66)

where β > 0 is a constant to be determined. Taking a union bound, with probability at least 99/100, every ki satisfies this condition. We condition on this event below.

f(6a) b(i) p p = 2(m ki) and f( 6a) b(i) p p = 2ki

By (66), if b(i) is drawn from D0, we have

m 2(1 + β)εm f(6a) b(i) p p m 2(1 β)εm

f( 6a) b(i) p p m + 2(1 β)εm.

Similarly, if b(i) is drawn from D1, we have

f(6a) b(i) p p m + 2(1 β)εm

m 2(1 + β)εm f( 6a) b(i) p p m 2(1 β)εm.

By plugging them into (65) and (64), we have

OPT d (m 2(1 β)εm) .

Now, suppose that a solution ˆx satisfies

f(Aˆx) b p p (1 + ε

d (m 2(1 β)εm) + ε

md 2(1 β) Cp

md (2 3β)εmd,

Published as a conference paper at ICLR 2025

provided that K Cp/β. Here Cp is a constant that depends only on p.

We declare b(i) is drawn from D0 if ˆxi > 0 and from D1 otherwise. Suppose that our declaration is wrong on ℓindices, then by Lemma 25, f(Aˆx) b p p ℓ (m + 2(1 β)εm) + (d ℓ) (m 2(1 + β)εm)

= md 2(1 + β)εmd + 4εℓm. Therefore, md 2(1 + β)εmd + 4εℓm md (2 3β)εmd which implies that

We can conclude that we have used an approximate solution ˆx to deduce the distribution of b(i) for at least (1 5β/4)d indices of i = 1, . . . , d. Choosing β = 4/15 and K = 4Cp completes the proof of Lemma 24.

To finish the proof of Theorem 22, by Lemma 24, with probability 1 1/100 1/5 > 2/3, we can correctly identify whether b(i) is drawn from D0 or D1 for at least 2d/3 indices i, i.e. we solve Problem 23. Hence, we conclude that A must make Ω d/ε2 queries to the entries of b.

C.2 CASE OF p 2

By Yao s minimax theorem, it suffices to show the following theorem. Theorem 26. Suppose that p 1, ε > 0 is sufficiently small and n p d/εp. There exist a deterministic function f Lip1, a deterministic matrix A Rn d and a distribution over b Rn such that the following holds: every deterministic algorithm that outputs ˆx Rd which with probability at least 2/3 over the randomness of b satisfies (20) must make Ω d/εp queries to the entries of b.

We remark that the lower bound holds for all p 1. To prove Theorem 26, we reduce Problem 27 below to our problem. Problem 27. Suppose that 0 < ε < 1, d is a positive integer, m = 1/εp and n = 2md. Let v be the 2m-dimensional vector whose entries are all 1, i.e. v = [1, . . . , 1] , D0 be the uniform distribution on {v + (1/ε) e1, . . . , v + (1/ε) em} and D1 be the uniform distribution on {v + (1/ε) em+1, . . . , v + (1/ε) e2m} where e1, . . . , e2m are the canonical basis vectors in R2m. Let b be the n-dimensional random vector formed by concatenating d i.i.d. random vectors b(1), . . . , b(d), where each b(i) is drawn from D0 with probability 1/2 and from D1 with probability 1/2, i.e. each b(i) is an all one vector with a value 1/ε planted at a uniformly random entry.

Given a query access to the entries of b, we would like to, with probability at least 2/3, correctly identify whether b(i) is drawn from D0 or D1 for at least 2d/3 indices i.

By Lemma 10 and Lemma 11, any deterministic algorithm that solves this problem requires Ω(d/εp) queries to the entries of b.

Now, we construct the reduction. Let f be the function

0 if x 0 x if 0 x 1 1 if 1 x.

Let a be the 2m-dimensional vector such that

ai = 1 for i = 1, . . . , m 1 for i = m + 1, . . . , 2m

and A be the n d block-diagonal matrix whose diagonal blocks are the same a R2m 1.

Given a deterministic algorithm A that takes f, A, ε and a query access to the entries of b as inputs and returns ˆx Rd satisfying (20), we claim that ˆx can be used to solve Problem 27. This claim is proved in Lemma 28.

Published as a conference paper at ICLR 2025

Lemma 28. Let f, A, b be as specified above. There exists K, a constant, such that given an ˆx Rd satisfying f(Aˆx) b p p (1 + ε

we can identify whether b(i) is drawn from D0 or D1 for at least 2d/3 indices i.

We need the next lemma, whose proof is postponed to Appendix C.3. Lemma 29. Let b be a 2m-dimensional vector in which all entries are 1 except one of them is 1+ 1

(a) it holds for x 0 that f(ax) b p p f( a) b p p, and

(b) it holds for x 0 that f(ax) b p p f(a) b p p.

Now we are ready to prove Lemma 28.

Proof of Lemma 28. To prove the statement, we first give a bound for OPT. We have

OPT = min x Rd f(Ax) b p p =

i=1 min xi R f(axi) b(i) p p (67)

and hence we can look at each term minxi R f(axi) b p p individually. By Lemma 29, we have

min xi R f(axi) b(i) p p = min{ f( a) b(i) p p, f(a) b(i) p p}. (68)

Recall that m = 1 εp . For i = 1, . . . , d, if b(i) is drawn from D0, we have

f(a) b(i) p p = m + 1

εp = 2m and f( a) b(i) p p = (1 + 1

ε)p + m 1 (2 + ε)m

and, if b(i) is drawn from D1, we have

f( a) b(i) p p = m + 1

εp = 2m and f(a) b(i) p p = (1 + 1

ε)p + m 1 (2 + ε)m.

By plugging them into (68) and (67), we have

Now, suppose that a solution ˆx satisfies

f(Aˆx) b p p (1 + ε

K ) (2dm) + ε

We declare that b(i) is drawn from D0 if ˆxi > 0 and from D1 otherwise. Suppose that our declaration is wrong on ℓindices, then by Lemma 28,

f(Aˆx) b p p ℓ(2 + ε)m + (d ℓ)(2m) = 2dm + εℓm.

2dm + εℓm 2dm + 4εdm

which implies, when K = 12, that

completing the proof of Lemma 28.

To finish the proof of Theorem 26, by Lemma 28, with probability 2/3, we can correctly identify whether b(i) is drawn from D0 or D1 for at least 2d/3 indices i, i.e. we solve Problem 27. Hence, we conclude that A must make Ω d/εp queries to the entries of b.

Published as a conference paper at ICLR 2025

C.3 OMITTED PROOFS

Proof of Lemma 25. Suppose that b contains k occurrences of u = [ 3 2 ] and θ = k/m. Then

1 m f(ax) b p p = θ(g(x))p + (1 θ)(h(x))p,

p and h(x) =

It suffices to show that both h(x) and g(x) attain a local minimum at x = 6 when x 0 and at x = 6 when x 0.

Now, we view the 2-dimensional vectors as points in R2. For any x R, let ζ(x) be the point (f(x), f( x)). Also, let γ R2 be the locus of ζ(x) = (f(x), f( x)), i.e. γ := { ζ(x) | x R }. It has a positive branch γ+ := { ζ(x) | x 0 } and a negative branch γ := { ζ(x) | x 0 }.

We first consider x 0. Note that ζ( 6) = (2, 3) is on γ and x = 6 is the only value such that ζ(x) = (2, 3). Hence, we immediately have that h(x) = 0 attains a local minimum at x = 6. For g(x), consider the smallest ℓp-ball centered at (3, 2) that touches (2, 3) on its boundary. It is not difficult to verify that this ℓp-ball does not intersect γ at any other point. (Figure 2 provides a geometric intuition.)

Since γ+ and γ are symmetric about y = x, we can show the symmetric result for x 0.

Figure 2: Illustration of the locus γ (left), the minimizers when x 0 (middle) and the minimizer when x 0 (right)

Proof of Lemma 29. We would like to show that the function f(ax) b p p attains a local minimum at x = 1 when x 0 and at x = 1 when x 0. Note that, by the construction of f, f(x) = f(1) = 1 for all x 1 and f(x) = f( 1) = 0 for all x 1. Therefore, we now restrict our domain to be x [ 1, 1].

Suppose that the index of the entry whose value is 1 + 1

ε is in {1, . . . , m}. Then

f(ax) b p p = (1 + 1

ε f(x))p + (m 1)(1 f(x))p + m(1 f( x))p.

Note that we drop the absolute value sign because f(x) 1 on R. When 1 x 0, we have

f(ax) b p p = (1 + 1

ε)p + (m 1) + m(1 + x)p

ε)p + (m 1).

where the equality holds if x = 1. When 0 x 1, we have

f(ax) b p p = (1 + 1

ε x)p + (m 1)(1 x)p + m

Published as a conference paper at ICLR 2025

where the equality holds if x = 1.

Similarly, we can prove the same result when the index of the entry whose value is 1 + 1

ε is in {m + 1, . . . , 2m}.

D LEWIS WEIGHTS OF ROW-SAMPLED MATRIX

In this section, we shall show the following theorem.

Theorem 30. Suppose that A Rn d has uniformly bounded Lewis weights wi(A) d/n. Let S be an n n diagonal matrix in which the diagonal elements are i.i.d. α 1/p Ber(α) for some sampling rate α (0, 1). When αn p d log d when p < 4 or αn p d2 log d when p 4, it holds with probability at least 0.99 that wi(SA) d/m, where m is the number of nonzero rows of S.

This theorem is similar to (Chen et al., 2022, Lemma A.3), where the sampling rates are proportional to Lewis weights and no assumptions on the bounds of wi(A) were made. Our proof is also similar.

Proof. Let w1, . . . , wn denote the Lewis weights of A and suppose that S = diag{σ1, . . . , σn}. We first show that with probability at least 0.995,

p i aia i X

p (σiai)(σiai) 3

p i aia i .

Here, the sign denotes semi-positive definiteness. We prove this claim by the matrix Bernstein inequality. Notice that X

p (σiai)(σiai) = X

p i aia i ,

where ξi are i.i.d. Ber(α) variables. Let M = P

p i aia i , a i = M 1

2 ai, and Xi =

p i a i(a i) w 1 2

p i a i(a i) , then EXi = 0 and, by the definition of Lewis weights, a i 2 = w2/p i . We bound

p i a i 2 2 = 1

p ) i ai 2 2a i(a i) 2 = 1

p i a i(a i) 2

p i a i(a i) 2

It follows from matrix Bernstein inequality that

2d exp cαnη2

provided that αn η 2d log d. This shows that

p i a i(a i) (1 + η)I,

which is equivalent to our claim.

Published as a conference paper at ICLR 2025

When σi > 0,

p (σjaj)(σjaj)

1 1 η σ2 i a i

= 1 1 η σ2 i w

and, similarly,

p (σjaj)(σjaj)

(σiai) 1 1 + η

We take η to be a constant depending on p for p < 4 and η = 1/(Cp

d) for p 4. It then follows from (Cohen & Peng, 2015, Lemmas 5.3 and 5.4) that wi(SA) wi (A)/α d/(αn), where i is the index of the corresponding row in A.

By a Chernoff bound, with probability at least 0.995, m αn. The result then follows.

E ENTROPY ESTIMATES

In this section we provide a proof of Lemma 8 for completeness. The proof is decomposed into the following three lemmata, Lemma 31, Lemma 32 and Lemma 33.

Lemma 31. Suppose that A Rn d has full column rank and W is a diagonal matrix whose diagonal entries are the Lewis weights of A. It holds for p 1 that

log N(Bw,p(W 1/p A), w,p, t) d log 1

Proof. This is a standard result following from a standard volume argument, which we reproduce below for completeness. Suppose that E is the column space of W 1/p A and is endowed with norms w,p. Using the notation simplification defined in Appendix A, we denote by Bw,p the unit ball of E w.r.t. w,q, i.e. Bw,p = Bw,p(W 1/p A). Consider a maximal t-separation set N Bw,p, then the balls x + t

2Bw,p (x N) are contained in (1 + t

2)Bw,p and are nearly disjoint (intersection has zero volume). Hence P

x N vol(x + t

2Bw,p) vol((1 + t

2)Bw,p), that is, |N| vol( t

2Bw,p) vol((1 + t

2)Bw,p), leading to |N| (1 + 2/t)d. It is easy to check that N is a t-covering of (Bw,p, w,p) and it implies N(Bw,p(W 1/p A), w,p, t) |N| (1 + 2/t)d.

Lemma 32. Suppose that A Rn d has full column rank and W is a diagonal matrix whose diagonal entries are the Lewis weights of A. When 1 p 2 and q > 2, it holds that

log N(Bw,p(W 1/p A), w,q, t) q log d

Proof. Suppose that E is the column space of W 1/p A and is endowed with norms w,p and an inner product , w. We first have

log N(Bw,p, w,q, t) log N(Bw,p, w,2, λ) + log N Bw,2, w,q, t

Published as a conference paper at ICLR 2025

Recall that Bw,p = Bw,p(W 1/p A) and Bw,2 = Bw,2(W 1/p A). For the second term, we can apply Lemma 33 directly and obtain that

log N Bw,2, w,q, t

Next we deal with the first term.

We first consider the case 1 < p 2. Let p be the conjugate index of p and r p to be determined. Define θ [0, 1] by 1 p = 1 θ

For x, y Bw,2, by H older s inequality,

x y w,p x y 1 θ w,2 x y θ w,r 21 θ x y θ w,r.

This implies that

log N Bw,2, w,p , λ log N

Bw,2, w,r, 2 λ

where the last inequality follows from Lemma 33. Since

it follows that

log N Bw,2, w,p , λ 2

Choose r = p (log d),

log N Bw,2, w,p , λ 2

By duality (Artstein et al., 2004),

log N(Bw,p, w,2, λ) 2

log N(Bw,p, w,q, t) 2

Optimizing λ yields that

log N(Bw,p, w,q, t) qp/2r1 p/2

tp 1 p 1 + p

This completes the proof for 1 < p 2.

When p = 1, Maurey s empirical method gives that (using the fact that, by Lemma 5(c), w,2 w,1 in E)

log N(Bw,1, w,2, λ) log d

λ2 and thus

log N(Bw,1, w,q, t) log d

Optimizing λ yields that

log N(Bw,1, w,q, t) q log d d1/q

Published as a conference paper at ICLR 2025

Lemma 33. Suppose that A Rn d has full column rank and W is a diagonal matrix whose diagonal entries are the Lewis weights of A. When p, q 2, it holds that

log N(Bw,p(W 1/p A), w,q, t) d1 2

Proof. Suppose that E is the column space of W 1/p A and is endowed with norms w,p and an inner product , w. By Lemma 5(b), there exist u1, . . . , ud Rn such that

ui, ui w = 0, ui w,p = 1 and

j=1 u2 ij = 1 for any i, i = 1, . . . , n.

Recall that Bw,p = Bw,p(W 1/p A) and Bw,2 = Bw,2(W 1/p A). First, observe, by Lemma 5(c), that Bw,p d1/2 1/p Bw,2, thus

log N(Bw,p, w,q, t) log N Bw,2, w,q, t d1/2 1/p

and it suffices to show that

log N Bw,2, w,q, t qd2/q

Let q be the conjugate index of q, i.e. 1

q = 1. By dual Sudakov minorization,

log N Bw,2, w,q, t 1

Eg N(0,Id) sup x Bw,q

i=1 giui, x

By duality again,

Eg N(0,Id) sup x Bw,q

i=1 giui, x

= Eg N(0,Id)

w,q = Eg N(0,Id)

q Eg N(0,1)|g|q

Eg N(0,1)|g|q 1/q

i=1 u2 ij = 1