# online_active_regression__452a3a29.pdf Online Active Regression Cheng Chen * 1 Yi Li * 1 Yiming Sun * 1 Abstract Active regression considers a linear regression problem where the learner receives a large number of data points but can only observe a small number of labels. Since online algorithms can deal with incremental training data and take advantage of low computational cost, we consider an online extension of the active regression problem: the learner receives data points one by one and immediately decides whether it should collect the corresponding labels. The goal is to efficiently maintain the regression of received data points with a small budget of label queries. We propose novel algorithms for this problem under ℓp loss where p [1, 2]. To achieve a (1 + ϵ)- approximate solution, our proposed algorithms only require O(d/ poly(ϵ) log(nκ)) queries of labels, where n is the number of data points and κ is a quantity, called the condition number, of the data points. The numerical results verify our theoretical results and show that our methods have comparable performance with offline active regression algorithms. 1. Introduction Linear regression is a simple method to model the relationship between the data points in an Euclidean space and their scalar labels. A typical formulation is to solve the minimization problem minx Ax b p for A Rn d and b Rn, where each row Ai is a data point in Rd and bi is its corresponding scalar label. When p = 2, the linear regression is precisely the least-squares regression, which admits a closed-form solution and is thus a classical choice due to its computational simplicity. When p [1, 2), it is more robust than least-squares as the solution is less sensitive to outliers. A popular choice is p = 1 because the regression can be cast as a linear programme though other values of p *Equal contribution 1School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore. Correspondence to: Yi Li . Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s). are recommended depending on the distribution of the noise in the labels. Interested readers may refer to Section 1.3 of (Gonin & Money, 1989) for some discussion. One harder variant of linear regression is active regression (Sabato & Munos, 2014), in which the data points are easy to obtain but the labels are costly. Here one can query the label of any chosen data point and the task is to minimize the number of queries while still being able to solve the linear regression problem approximately. Specifically, one constructs an index set S [n] as small as possible, queries b S (the restriction of b on S) and computes a solution x based on A, S and b S such that A x b p p (1 + ϵ) min x Ax b p p. (1) For p = 2, the classical approach is to sample the rows of A according to the leverage scores. This can achieve (1) with large constant probability using |S| = O(d log d+d/ϵ) queries. Chen & Price (2019) reduced the query complexity to the optimal O(d/ϵ), based on graph sparsifiers. When p = 1, Chen & Derezinski (2021) and Parulekar et al. (2021) showed that O((d log d)/ϵ2) queries suffices with large constant probability, based on sampling according to Lewis weights. More recently, Musco et al. (2021b) solved the problem for all values of p with query complexity O(d/ϵ) for 1 p < 2 and O(dp/2/ϵp) for p > 2, where the dependence on d is optimal up to logarithmic factors. Another common setting of linear regression is the online setting, which considers memory restrictions that prohibit storing the inputs A and b in their entirety. In such a case, each pair of data points and their labels (i.e. each row of [A b]) arrives one by one, and the goal is to use as little space as possible to solve the linear regression problem. Again, the case of p = 2 has the richest research history, with the stateof-the-art results due to Cohen et al. (2020) and Jiang et al. (2022), which retain only O(ϵ 1d log d log(ϵ A 2 2)) rows of A (where A 2 denotes the operator norm of A). The idea of the algorithms is to sample according to the online leverage scores, which was first employed in (Kapralov et al., 2017). The online leverage score of a row is simply the leverage score of the row in the submatrix of A consisting of all the revealed rows so far. The algorithm of Jiang et al. (2022) is based on that of Cohen et al. (2020) with further optimized runtime. The case of p = 1 was solved by (Braverman et al., 2020), who generalized the notion of online leverage score to online Lewis weights and sampled the rows of A according to the online Lewis weights. In this paper, we consider the problem of online active regression, a combination of the two variants above. In a similar vein to (Cohen et al., 2020) and (Jiang et al., 2022), the rows of A arrive one by one, and upon receiving a row, one must decide whether it should be kept or discarded and whether to query the corresponding label, without ever retracting these decisions. The problem was considered by Riquelme et al. (2017), who assumed an underlying distribution of the data points together with a noise model of the labels and only considered ℓ2-regression. Here we do not make such assumptions and need to handle arbitrary input data. To the best of our knowledge, our work is the first to consider the online active regression in the general ℓp-norm. Our approach is largely based upon the existing techniques for online regression and active regression. A technical contribution in our work is to show that one can compress a fraction of rows in a matrix by sampling these rows according to their Lewis weights while preserving the Lewis weights of the uncompressed rows (see Lemma 4.5 for the precise statement), which may be of independent interest. Our Contributions. We show that the online active regression problem can be solved, attaining the error guarantee (1) with constant probability, using m = O(ϵ (2p+5)d log(nκOL(A))) queries for p [1, 2) (where κOL(A) is the online condition number of A, see Definition 2) and m = O(ϵ 9d log(n A 2/σ)) queries for p = 2 (where A 2 is the operator norm of A and σ the smallest singular value of the first d rows of A). Our algorithms are sublinear in space complexity, using m+O(ϵ 2d poly(log n)) words. The query complexity for p [1, 2) depends on log n and log κOL(A)), which are not present in the offline counterpart (Musco et al., 2021b). But this is not unexpected, given that the log n log κOL(A) factor appears in the sketch size for the ℓ1-subspace embedding under the sliding window model (Braverman et al., 2020). We also demonstrate empirically the superior accuracy of our algorithm to online uniform sampling on both synthetic and real-world data. We vary the allotted number of queries and compare the relative error in the objective function of the regression (with respect to the minimum error, namely minx Ax b p). For active ℓ1-regression, our algorithm achieves almost the same relative error as the offline active regression algorithm on both the synthetic and real-world data. For active ℓ2-regression, our algorithm significantly outperforms the online uniform sampling algorithm on both synthetic and real-world data and is comparable with the offline active regression algorithm on the synthetic data. 2. Preliminaries Notation. We use [n] to denote the integer set {1, . . . , n}. For a matrix A, we denote by A its Moore Penrose inverse. For two matrices A and B of the same number of columns, we denote by A B the vertical concatenation of A and B. A matrix S is a called a sampling matrix if each row and each column has at most one nonzero entry. Associated with S are indicator variables {(1S)i}i=1,...,n (where n is the number of columns of S) defined as follows. For each i, we define (1S)i = 1 if the i-th column of S is nonzero, and (1S)i = 0 otherwise. Suppose that A Rn d. We define the operator norm of A, denoted by A 2, to be max x 2=1 Ax 2. We also define an online condition number κOL(A) = A 2 maxi (A(i)) 2, where A(i) is the submatrix consisting of the first i rows of A. Suppose that A Rn d, b Rd and p 1. We define REG(A, b, p) to be an x Rd that minimizes Ax b p. We remark that when p > 1, the minimizer is unique. Lewis weights. A central technique to solve minx Ax b p is to solve a compressed version minx SAx Sb p, where S is a sampling matrix. This sampling is based on Lewis weights (Cohen & Peng, 2015), which are defined below. Definition 2.1 (Lewis weights). Suppose that A Rn d and p 1. The ℓp Lewis weights of A, denoted by w1(A), . . . , wn(A), are the unique real numbers such that wi(A) = (a i (A W 1 2/p A) 1ai)p/2, where W is the diagonal matrix with diagonal elements w1(A), . . . , wn(A) and ai is the i-th row of A. For notational convenience, when A has n rows, we also write wn(A) as wlast(A). The ℓ2 Lewis weight is also called the leverage score. Definition 2.2. Given p1, . . . , pn [0, 1], the rescaled sampling matrix S with respect to p1, . . . , pn is a random n n diagonal matrix in which Si,i = p 1/p i with probability pi and Si,i = 0 with probability 1 pi. Lemma 2.3 (Lewis weights sampling (Cohen & Peng, 2015)). Let A Rn d. Choose β = Θ(log(d/δ)/ϵ2) and p1, . . . , pn such that min{βwi(A), 1} pi 1. Let S be the rescaled sampling matrix with respect to p1, . . . , pn. Then it holds with probability at least 1 δ that (1 ϵ) Ax p SAx p (1 + ϵ) Ax p (i.e., S is an ϵ-subspace embedding for A in the ℓp-norm) and that the number of nonzero rows in S is O(β P i wi(A)) = O(βd). In the light of the preceding lemma, one can choose an ϵ-subspace embedding matrix S for [A b] and retain only the nonzero rows of S so that S has only O(d/ϵ2) rows and minx SAx Sb p = (1 ϵ) min Ax b p. The remaining question is how to compute the Lewis weights of a given matrix. Cohen & Peng (2015) showed that, for a given matrix A Rn d, the following iterations W (j) i,i a i A (W (j 1))1 2/p A 1 ai with the initial point W (0) = In, will converge to some diagonal matrix W, whose diagonal elements are exactly w1(A), . . . , wn(A). Definition 2.4 (Online Lewis Weights). Let p [1, 2) and A Rn d. The online ℓp Lewis weights, denoted by w OL 1 (A), . . . , w OL n (A), are defined to be w OL i (A) = wi(A(i)), where A(i) is the submatrix consisting of the first i rows of A. We shall need the Johnson-Lindenstrauss matrix and an assumption on the input matrix A for the online active ℓ2regression. Definition 2.5 (Johnson-Lindenstrauss Matrix). Let X Rd be a point set. A matrix J is said to be a Johnson Lindenstrauss matrix for X of distortion parameter ϵ (or, an ϵ-JL matrix for X) if (1 ϵ) x 2 2 Jx 2 2 (1 + ϵ) x 2 2 for all x X. It is a classical result (Kane & Nelson, 2014) that when |X| = T, there exists a random matrix J Rm d with m = O(ϵ 2 log(T/δ)) such that (i) J is an ϵ-JL matrix for T with probability at least 1 δ, (ii) each column of J contains O(ϵ 1 log(T/δ)) nonzero entries and (iii) J can be generated using O(log2(|T|/δ) log d) bits. 3. Algorithms and Main Results The high-level approach follows (Musco et al., 2021a) and we give a brief review below. We sample A twice but with different sampling parameters β, getting A of O(d log d) rows and A1 of O(d2 poly(ϵ 1 log d)) rows, respectively. We use A to solve minx Rd Ax b p, obtaining a constantfactor approximation solution xc. The problem is then reduced to solving minx Rd Ax z p with z = b Axc, for which we shall solve minx Rd A1 z1 p instead. Since A1 has Ω(d2) rows, we repeat the idea above and further subsample A1 twice with different sampling parameters, getting A2 of O(d log d) rows and A3 of O(d poly(ϵ 1 log d)) rows. The sampled matrix A2 is used to obtain a constantfactor approximation solution ˆxc to minx Rd A1x z1 p and A3 is used to solve minx Rd A1x ( z1 A1x c) p with a near-optimal solution x . The near-optimal solution to minx Rd A1x z1 p is then x = ˆxc + x . Finally, the solution to the original problem is x = xc + x. Note that in the algorithms, we use A to denote the nonzero rows of SA where S is the rescaled sampling matrix. Hence, the Algorithm 1 Online Active Regression for p (1, 2) Initialize: Let A(d), A(d) 1 , A(d) 2 , A(d) 3 be the first d rows of A and b(d) be the first d rows of b. 1: β Θ(log d) 2: β1 Θ(d/ϵ2+p) 3: β2 Θ(log d) 4: β3 Θ(log2 d log(d/ϵ)/ϵ2p+5) 5: Retain the first d rows of A 6: while there is an additional row at do 7: wt wt(A(t)) 8: pt min{β w(t), 1} 9: ( A(t), b(t)) SAMPLE(at, pt, A(t 1), b(t 1), p) 10: w1,t wt(A(t)) 11: p1,t min{β1 w1,t, 1} 12: Sample at with probability p1,t 13: if at is sampled then 14: A(t) 1 A(t 1) 1 a t p 1 p 1,t 15: w2,t wlast( A(t) 1 ) 16: p2,t min{β2 w2,t, 1} 17: ( A(t) 2 , b(t) 2 ) SAMPLE(atp 1 p 1,t , p2,t, A(t 1) 2 , b(t 1) 2 , p) 18: w3,t wlast( A(t) 1 ) 19: p3,t min{β3 w3,t, 1} 20: ( A(t) 3 , b(t) 3 ) SAMPLE(atp 1 p 1,t , p3,t, A(t 1) 3 , b(t 1) 3 , p) 21: end if 22: end while 23: xc REG( A, b, p) 24: z2 b2 A2xc 25: ˆxc REG( A2, z2, p) 26: z3 b3 A3xc 27: x REG( A3, z3 A3ˆxc, p) 28: x ˆxc + x 29: x xc + x 30: return x Algorithm 2 SAMPLE(at, pt, A(t 1), b(t 1), p) 1: Sample at with probability pt 2: if at is sampled then 3: Query bt 4: ( A(t), b(t)) ( A(t 1) a t p 1 p t , b(t 1) btp 1 p t ) 5: else 6: ( A(t), b(t)) ( A(t 1), b(t 1)) 7: end if sampled matrices A, A1, A2 and A3 are SA, S1A, S2S1A and S3S1A respectively. 3.1. The case p [1, 2) We present our main algorithm for p (1, 2] in Algorithm 1. The following is the guarantee of the algorithm. Theorem 3.1. Let A Rn d and b Rn. Algorithm 1 outputs a solution x which satisfies that A x b p min x Rd Ax b p (3) with probability at least 0.94 and makes O d log2 d ϵ2p+5 log d ϵ log d log n log κOL(A) ϵ log(nκOL(A)) queries overall in total. A major drawback of Algorithm 1 is the cost of calculating the online Lewis weights. Recall that the online Lewis weight of at is defined with respect to the first t rows of A. A na ıve implementation would require storing the entire matrix A, partly defying the purpose of an online algorithm. Furthermore, the iterative procedure described after Definition 2.1 takes O(log t) iterations to reach a constant-factor approximation to the Lewis weights (Cohen & Peng, 2015), where each iteration takes O(td2 + d3) time, which would become intolerable as t becomes large. To address this issue, we adopt the compression idea in (Braverman et al., 2020), which maintains O(log n) rescaled row-sampled submatrices of A, each having a small number of rows. The compression algorithm is presented in Algorithm 3. Algorithm 3 Compression algorithm for calculation of online Lewis weights Initialize: B0 contains the first d rows of A; B1, . . . , Blog n are empty matrices; Q = Θ(ϵ 2d log3 n). 1: β Θ(ϵ 2d log3 n) 2: while there is an additional row at do 3: B0 B0 at 4: if the size of B0 exceeds Q then 5: j the smallest index i such that Bi is empty 6: M Bi 1 Bi 2 B0 7: pi min{βwi(M), 1} for all i 8: S rescaled sampling matrix with respect to probabilities {pi}i 9: Bi SM 10: B0, B1, . . . , Bi 1 empty matrix 11: end if 12: end while With the compression algorithm for A which maintains B0, . . . , Blog n, we can replace Line 7 of Algorithm 1 with wt wlast(Blog n Blog n 1 B0). (4) Similarly, we run an additional compression algorithms for each of A1 and replace Lines 10, 15 and 18 with updates analogous to (4). In addition, we change the value of β and β = Θ(ϵ 2 log d log2 n) and β1 = Θ(ϵ 4 log d log4 n), respectively. By the construction of the blocks Bi, each Bi contains at most R = O(ϵ 2d log3 n) rows with probability at least 1 1/ poly(n), sufficient for taking a union bound over all the blocks throughout the process of reading all n rows of A. Hence we may assume that each block Bi always contains at most R rows. Now, wt is calculated to be the Lewis weight of a matrix of R = O(Q + R log n) = O(R log n) rows, which can be done in O((R d2 + d3) log R ) = O(ϵ 2d3 poly(log(n/ϵ))) time for a constant-factor approximation, where the dependence on n is only polylogarithmic. The remaining question is correctness and the following theorem is the key to proving the correctness. Theorem 3.2. Let A Rn d. With Algorithm 3 maintaining B0, . . . , Blog n, let wt be as in (4) for each t n. Then it holds with probability at least 1 1/ poly(n) that (1 ϵ)wt(A(t)) wt (1 + ϵ)wt(A(t)), t n, where A(t) be the submatrix consisting of the first t rows of A. The weights wt can be calculated in O(ϵ 2 d3 poly(log(n/ϵ))) time and Algorithm 3 needs O(ϵ 2d poly(log n)) words of space overall in total. The proof of Theorem Theorem 3.2 is deferred to Section 4.2. Now we can strengthen Theorem 3.1 as follows. Theorem 3.3. Let A Rn d and b Rn. When implemented using the compression technique as explained above, Algorithm 1 outputs a solution x which satisfies (3) with probability at least 0.94 o(1), making m = ϵ2p+5 log d ϵ log d log n log κOL(A) ϵ log(nκOL(A)) queries. Furthermore, it uses m + O( d ϵ2 poly(log n)) words of space overall in total. 3.2. The case p = 2 In this subsection, we assume that the first d rows of the input matrix A is not singular. Assumption 3.4. The minimal singular value of the first d rows of A is σ > 0. As mentioned in the preceding subsection, it is computationally expensive to compute Lewis weights in general. A special case is p = 2, where the Lewis weights are leverage scores and thus much easier to compute. In this case, wi(A) = a i (A A) 1ai, and correspondingly, the online Lewis weights become online leverage scores, which are w OL i (A) = a i ((A(i)) A(i)) 1ai. It is much easier to compute w OL i (A) in the online setting because one can simply maintain (A(i)) A(i) by adding aia i when reading a new Algorithm 4 Online Active Regression for p = 2 Initialize: Let A(d), A(d) 1 , A(d) 2 , A(d) 3 be the first d rows of A and b(d), b(d) 2 , b(d) 3 be the first d rows of b. Let x(d) c = REG( A(d), b(d), 2), z(d) 2 = z(d) 3 = b(d) A(d)x(d) c , ˆx(d) c = REG( A(d) 2 , z(d) 2 , 2) and x d = REG( A(d) 3 , z(d) 3 A(d) 3 ˆx(d) c , 2). Let G(d) = (( A(d)) A(d)) 1 and H(d) = A(d) G(d). Also let G(d) i = (( A(d) i ) A(d) i ) 1 and H(d) i = A(d) i G(d) i for i = 1, 2, 3. 1: β Θ(log d) 2: β1 Θ(d/ϵ4) 3: β2 Θ(log d) 4: β3 Θ((log2 d) log(d/ϵ)/ϵ9) 5: retain the first d rows of A 6: while there is an additional row at do 7: wt H(t 1)at 2 2 8: (x(t) c , A(t), b(t), G(t), H(t)) SAMPLEQUERY(at, b(t 1), , , A(t 1), β, wt, G(t 1),1) 9: w1,t H(t) 1 at 2 2 10: Sample at with pr. p1,t = min{β1 w1,t, 1} 11: if at is sampled then 12: A(t) 1 A(t 1) 1 a t p1,t 13: ( G(t) 1 , H(t) 1 ) UPDATE( at p1,t , , , , A(t 1) 1 , G(t 1) 1 ) 14: w2,t H(t) 2 at p1,t 2 2 15: (ˆx(t) c , A(t) 2 , b(t) 2 , G(t) 2 , H(t) 2 ) SAMPLEQUERY( at p1,t , b(t 1) 2 , x(t) c , , A(t 1) 2 , β2, w2,t, G(t 1) 2 ,2) 16: w3,t = H(t) 3 at p1,t 2 2 17: ( x (t), A(t) 3 , b(t) 3 , G(t) 3 , H(t) 3 ) SAMPLEQUERY( at p1,t , b(t 1) 3 , x(t) c , ˆx(t) c , A(t 1) 3 , β3, w3,t, G(t 1) 3 ,3) 18: end if 19: x(t) ˆx(t) c + x (t) 20: x(t) x(t) + x(t) c 21: end while 22: return x(t) row ai (viewed as a column vector). A na ıve implementation of this algorithm would require inverting a d d matrix at each step and we can further optimize the running time by noticing that ((A(i)) A(i)) 1 receives a rank-one update at each step. This is the approach taken by (Cohen et al., 2020) and (Jiang et al., 2022) for computing the online leverage scores in the online setting. Adopting this approach, we present our fast algorithm for p = 2 in Algorithm 4 and its guarantee below. Algorithm 5 SAMPLEQUERY(at, b(t 1), x(t) c , ˆx(t) c , A(t 1), β, wt, G(t 1), χ) in Algorithm 4 1: pt min{β(1 + ϵ)2 wt, 1} 2: Sample at with probability pt 3: if at is sampled then 4: A(t) A(t 1) a t pt 5: Query bt 6: if χ = 1 then 7: b(t) b(t 1) bt pt 8: else 9: b(t) b(t 1) bt p1,tpt 10: z(t) b(t) A(t)x(t) c 11: end if 12: (x(t), G(t), H(t)) UPDATE(at, b(t), ˆx(t) c , A(t), G(t 1)) 13: else 14: ( A(t), b(t)) ( A(t 1), b(t 1)) 15: (x(t), G(t), H(t)) (x(t 1), G(t 1), H(t 1)) 16: end if 17: return (x(t), A(t), b(t), G(t), H(t)) Algorithm 6 UPDATE(at, b(t), ˆx(t) c , A(t), G(t 1)) 1: g a t G(t 1)at/pt 2: G(t) G(t 1) 1 1+g G(t 1) ata t pt G(t 1) 3: st the number of rows in A(t) 4: Update the ϵ-JL matrix J(t+1) of size log n ϵ2 st 5: F (t) J(t+1) A(t) 6: H(t) F (t) G(t) 7: if b(t) = then 8: return ( G(t), H(t)) 9: else if ˆx(t) c = then 10: x(t) G(t) A(t) b(t) 11: else 12: x(t) G(t) A(t) ( b(t) A(t)ˆx(t) c ) 13: end if 14: return (x(t), G(t), H(t)) Theorem 3.5. Let A Rn d and b Rn. Suppose that A satisfy Assumption 3.4. With probability at least 0.94, Algorithm 4 makes ϵ9 log2 d log d queries in total and maintains for each T = d + 1, . . . , n a solution x(T ) which satisfies that A(T ) x(T ) b(T ) 2 (1 + ϵ) min x Rd A(T )x b(T ) 2 . With probability at least 0.97, Algorithm 4 runs in a total of ϵ2 nnz(A) log n time for processing the entire matrix A. Remark 3.6. The theoretical guarantees of Theorem 3.3 and Theorem 3.5 can be extended to δ failure probability with an additional log(1/δ) factor in the query complexities, using the same boosting procedure in (Musco et al., 2021a). Remark 3.7. We only consider the case p [1, 2] because adding an extra row to a matrix never increases the Lewis weights of existing rows of that matrix. This property is used in upper bounding the sum of online Lewis weights (see Lemmas 4.1 and 4.2 below). However, this property does not necessarily hold when p > 2 and we leave open the problem of upper bounding the sum of online Lewis weights in this case. 4. Proofs of the Main Results The framework of our Algorithm 4 and Algorithm 1 follows from the algorithm in (Musco et al., 2021a). Hence, in order to prove Theorem 3.5 and 3.3, it suffices to verify all the conditions and lemmata needed by the proof in (Musco et al., 2021a). (Small modifications are needed for p = 2 because the sampling matrices do not have independent rows and the details are postponed in Appendix B.) In particular, it suffices to show that (i) the online ℓp Lewis weights calculated in Algorithms 4 and 1 are within an absolute constant factor of the corresponding true ℓp online Lewis weights, and (ii) the sum of approximate ℓp online Lewis weights are bounded. 4.1. Sum of Online Lewis Weights Suppose that (i) holds, (ii) would follow from that the sum of true ℓp online Lewis weights are bounded, which are exactly the following two lemmas, for p [1, 2) and p = 2, respectively. Lemma 4.1. Let p [1, 2). It holds that Pn i=1 w OL i (A) = O(d log n log κOL(A)). Lemma 4.2 (Lemma 2.2 of (Cohen et al., 2020)). Let p = 2. Suppose that A satisfy Assumption 3.4. It holds that Pn i=1 w OL i (A) = O(d log( A 2/σ)). The case p = 1 of Lemma 4.1 appeared in (Braverman et al., 2020). We generalize the result to p (1, 2), following their approach. The proof can be found in Appendix A.1, where we also note an omission in the proof of Braverman et al. (2020). In the analysis of Algorithm 1, we shall apply Lemma 4.1 to A1 = S1A, where S1 is a sampling matrix w.r.t. the online Lewis weights of A. To upper bound κOL(S1A), we shall need the following auxiliary lemma, whose proof is postponed to Appendix A.2. Lemma 4.3. Let p [1, 2) and S is a rescaled sampling matrix w.r.t. the online Lewis weights of A and the oversampling parameter β. With probability at least 0.99, it holds that log κOL(SA) = O(log(nκOL(A)/β)). 4.2. Approximating Online Lewis Weights Now, it remains to prove (i) in order to prove the guarantee of x in Theorems 3.3 and 3.5. First, the guarantee of approximate ℓ2 online Lewis weights follows from the works of Cohen et al. (2020) or Jiang et al. (2022), which we cite below. Lemma 4.4 (Theorem 2.3 in (Cohen et al., 2020), Lemma 3.4 in (Jiang et al., 2022)). Let { wi}i be the approximate Lewis weights in Algorithm 4 and β = Θ(log n/ϵ2). Let S be the rescaled sampling matrix with respect to { wi}i. It holds with probability at least 0.99 that (1 ϵ)(A(t)) A(t) (SA(t)) (SA(t)) (1+ϵ)(A(t)) A(t) for all t {d + 1, . . . , n} and the number of non-zero rows of S is O(β(Pn i=1 wi)). As a consequence, wt 1 ϵ 1+ϵ a i ((A(t)) A(t)) 1ai (1 2ϵ)w OL t (A) for all t {d + 1, . . . , n}. This establishes (i) when p = 2. The case of general p follows from Theorem 3.2. The following lemma is the key to the proof. Lemma 4.5. Let Ai Rni d (i = 1, . . . , r), B Rk d and M = A1 A2 Ar B. For each i [r], let Si Rmi ni be the rescaled sampling matrix with respect to pi,1, . . . , pi,ni with min{βwj(Ai), 1} pi,j 1 for each j [ni], where β = O(ϵ 2 log(d/δ)). Let M = S1A1 Sr Ar B. Then, with probability at least 1 δ, it holds (1 ϵ)wn1+ +nr+j(M) wm1+ +mr+j(M ) (1 + ϵ)wn1+ +nr+j(M). for all j = 1, . . . , k. A full version of the preceding lemma and its proof are postponed to Lemma A.3. Now we turn to prove Theorem 3.2. Proof of Theorem 3.2. Observe that each block Bi is the compressed version of 2i smaller matrices, say, A1, . . . , A2i, and each smaller matrix is compressed at most i times. The compression scheme inside Bi can be represented by a tree Ti, which satisfies that the root of Ti has i children Ti 1, Ti 2, . . . , T0. Every internal node of the tree represents a compression operation, which subsamples (with rescaling) the vertical concatenation of its children. The following diagram shows an illustration of Ti. Consider a decompression process which begins at the root and goes down the tree level by level. When going down a level, we decompress each internal node on that level into the vertical concatenation of its children. When the decompression process is completed, we will have a vertical concatenation of the leaves, namely, A1 A2 A2i, which is a submatrix of A(t). Let i be the largest i such that Bi is nonempty. Consider the decompression process of all blocks Blog n B0. This process will terminate in i steps, A(t,i ) A(t,i 1) A(t,0), where A(t,i ) = Blog n B0 and A(t,0) = A(t). Let wt,j = wlast((At,j)). Note that wt,0 = wt(A(t)). By Theorem 3.2 and our choices of parameters, it holds that 1 ϵ 2 log n wt,j wt,j+1 1 + ϵ 2 log n with probability at least 1 1/ poly(n). Iterating yields that 1 ϵ 2 log n i wt(A(t)) wt,i 1+ ϵ 2 log n Note that wt,i = wt per (4). Since i log n, we have (1 ϵ)wt(A(t)) wt,i (1 + ϵ)wt(A(t)). Taking a union bound over all t gives the claimed result. 4.3. Time Complexity for p = 2 Lemma 4.6. With probability at least 0.98, the running time of Algorithm 4 over n iterations is O(ϵ 2 log n nnz(A) + ϵ 4d3(ϵ 2 log n + d) log d Proof. We analyze the time complexity following Lemma 3.8 in (Jiang et al., 2022). Note that total running time is dominated by calculating Lewis weights and calls to UPDATE. The approximate Lewis weights are calculated by wt = H(t 1)at 2 2, which takes O(ϵ 2 nnz(A) log n) time over n iterations. Observe that the runtime of each call to UPDATE is dominated by the time calculating F (t) and H(t), which takes O(ϵ 2d log n + d2) time. Calls to UPDATE only happen when there is a new row at is sampled and the number of samples is dominated by the maximum of the number of rows of S and that of S1, which with probability at least 0.98 are O(d log d) and O(ϵ 4d2 log d σ ), respectively. Hence, the total running time is O(ϵ 2 nnz(A) log n + ϵ 4d4 log d σ + ϵ 6d3 log n log d 5. Experiments In this section, we provide empirical results on online active ℓp regression with p = 1, p = 1.5 and p = 2. We compare our methods with online uniform sampling, the offline active regression algorithms (Musco et al., 2021a; Chen & Derezinski, 2021; Parulekar et al., 2021) for all values of p and, additionally, the thresholding algorithm in (Riquelme et al., 2017) for p = 2. The quantity we compare is the relative error, which is defined as (err erropt)/erropt, where err = A x b p is the error of the algorithm s output x and erropt = minx Ax b p is the minimum error of the ℓp regression. Below we explain the online uniform sampling algorithm, the thresholding algorithm and the adaptation of online and offline active regression algorithms to the budgetconstrained setting. All algorithms are prescribed with a budget for querying the labels. Online Uniform Sampling: In the t-th round, we sample the new data point [at, bt] with probability Bt/(n t), where Bt is the remaining budget. Regression via Thresholding (for p = 2 only): We use the Algorithm 1.b in (Riquelme et al., 2017) and assign the weights ξi = 1 for all i [n]. Online Active Regression: We sample each data point with probability proportional to wt, where wt is the approximate online Lewis weight calculated with the compression technique for p = 1 and p = 1.5. Offline Active Regression: For p = 1, the algorithms in (Chen & Derezinski, 2021; Parulekar et al., 2021) are under the budget setting and no modification is needed. For p = 2, the offline algorithm (Musco et al., 2021a) involves parallel sampling. Since it expects to sample O(d) data points for a constant-factor approximation, we allocate a budget of size d to the part of the constant- factor approximation and allocate the remaining budget to the regression on residuals. We perform experiments on both synthetic and real-world data sets to demonstrate the efficacy of our approaches. Synthetic Data: We generate the synthetic data as follows. Each row of A Rn d is a random Gaussian vector, i.e., ai N(0, Id). The label is generated as b = Ax + ξ where x is the ground truth vector and ξ is the Gaussian noise vector, i.e., ξ N(0, 1). To make the rows of A have nonuniform Lewis weights, we enlarge d data points by a factor of n 1 p . We choose n = 10000 and d = 100. Real-world Data: We evaluate our algorithm on a realworld dataset, the gas sensor data (Vergara et al., 2012; Rodriguez-Lujan et al., 2014) from the UCI Machine Learning Repository1. The dataset contains 13910 measurements of chemical gases characterized by 128 features and their concentration levels. We vary the budget sizes for the synthetic data between 800 and 1400 (8% 14% of the data size) and for the realworld data between 1600 and 2500 (12% 18% of the data size). For each budget size, we run 20 independent trials and calculate the mean relative error and standard deviation. All our experiments are conducted in MATLAB on a Macbook Pro with an i5 2.9GHz CPU and 8GB of memory. Below we discuss the experiments results for the online active ℓp regression, p = 1, 1.5, 2. The budget-versus-error plots are shown in Figure 1. p = 1: For the synthetic data, we see that the online regression algorithm achieves a relative error comparable to that of the offline regression algorithm when the budget is at least 1000 and always significantly outperforms the online uniform sampling algorithm. For the real-world data, the online regression algorithm s performance is again significantly better than the online uniform sampling algorithm and comparable to that of the offline active regression algorithm. p = 1.5: The online ℓ1.5 regression algorithm significantly outperforms the online uniform sampling on both data sets. It achieves a relative error comparable to that of the offline active regression algorithm on the real-world data and is only slightly worse than the offline algorithm when the budget size is at least 2300 (14.3% of the data size). 1https://archive.ics.uci.edu/ml/datasets/ Gas+Sensor+Array+Drift+Dataset+at+Differen t+Concentrations 800 900 1000 1100 1200 1300 1400 0 2 4 6 8 Online ℓ1 Offline ℓ1 Online Uni. Relative error (a) Synthetic data 1600 1700 1800 1900 2000 2100 2200 2300 2400 2500 0 Online ℓ1 Offline ℓ1 Online Uni. Relative error (b) Gas sensor data 800 900 1000 1100 1200 1300 1400 Online ℓ1.5 Offline ℓ1.5 Online Uni. Relative error (c) Synthetic data 1600 1700 1800 1900 2000 2100 2200 2300 2400 2500 0 Online ℓ1.5 Offline ℓ1.5 Online Uni. Relative error (d) Gas Sensor data 800 900 1000 1100 1200 1300 1400 0 0.05 Online ℓ2 Offline ℓ2 Threshold Online Uni. Relative error (e) Synthetic data 1600 1700 1800 1900 2000 2100 2200 2300 2400 2500 0 Online ℓ2 Offline ℓ2 Threshold Online Uni. Relative error (f) Gas Sensor data Figure 1. Performance of algorithms for online ℓp active regression on both synthetic data and Gas Sensor data for p = 1, 1.5, 2. p = 2: The online ℓ2 regression algorithm significantly outperforms the online uniform sampling algorithm on both datasets and performs much better than the thresholding algorithm on real-world data. It achieves a relative error comparable to that of the offline active regression algorithm on the synthetic data and is only slightly worse than the offline algorithm on real-world data. 6. Conclusion We provably show an online active regression algorithm which uses sublinear space for the ℓp-norm, p [1, 2]. Our experiments demonstrate the superiority of the algorithm over online uniform sampling on both synthetic and realworld data and a comparable performance with the offline active regression algorithm. Acknowledgements C. Chen was supported by and Y. Li was partially supported by Singapore Ministry of Education (Ac RF) Tier 2 grant MOE2018-T2-1-013 and Singapore Ministry of Education (Ac RF) Tier 1 grant RG75/21. Y. Sun was supported by Singapore Ministry of Education (Ac RF) Tier 2 grant MOE2018-T2-1-013. Bourgain, J., Lindenstrauss, J., and Milman, V. Approximation of zonoids by zonotopes. Acta mathematica, 162(1): 73 141, 1989. Braverman, V., Drineas, P., Musco, C., Musco, C., Upadhyay, J., Woodruff, D. P., and Zhou, S. Near optimal linear algebra in the online and sliding window models. In Irani, S. (ed.), 61st IEEE Annual Symposium on Foundations of Computer Science, FOCS 2020, Durham, NC, USA, November 16-19, 2020, pp. 517 528. IEEE, 2020. doi: 10.1109/FOCS46700.2020.00055. Chen, X. and Derezinski, M. Query complexity of least absolute deviation regression via robust uniform convergence. In Belkin, M. and Kpotufe, S. (eds.), Conference on Learning Theory, COLT 2021, 15-19 August 2021, Boulder, Colorado, USA, volume 134 of Proceedings of Machine Learning Research, pp. 1144 1179. PMLR, 2021. URL http://proceedings.mlr.press/ v134/chen21d.html. Chen, X. and Price, E. Active regression via linear-sample sparsification. In Beygelzimer, A. and Hsu, D. (eds.), Conference on Learning Theory, COLT 2019, 25-28 June 2019, Phoenix, AZ, USA, volume 99 of Proceedings of Machine Learning Research, pp. 663 695. PMLR, 2019. URL http://proceedings.mlr.press/v99/ chen19a.html. Cohen, M. B. and Peng, R. lp row sampling by Lewis weights. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pp. 183 192, 2015. Cohen, M. B., Musco, C., and Pachocki, J. Online row sampling. Theory of Computing, 16(15):1 25, 2020. APPROX-RANDOM 2016 Special Issue. Gonin, R. and Money, A. H. Nonlinear Lp-Norm Estimation. CRC Press, 1989. doi: 10.1201/9780203745526. Jiang, S., Peng, B., and Weinstein, O. Dynamic least-squares regression. ar Xiv:2201.00228 [cs.DS], 2022. Kane, D. M. and Nelson, J. Sparser Johnson-Lindenstrauss transforms. J. ACM, 61(1), jan 2014. ISSN 0004-5411. doi: 10.1145/2559902. URL https://doi.org/10 .1145/2559902. Kapralov, M., Lee, Y. T., Musco, C., Musco, C., and Sidford, A. Single pass spectral sparsification in dynamic streams. SIAM J. Comput., 46(1):456 477, 2017. doi: 10.1137/14 1002281. Musco, C., Musco, C., Woodruff, D. P., and Yasuda, T. Active sampling for linear regression beyond the l2 norm. ar Xiv:2111.04888v1 [cs.LG], 2021a. Musco, C., Musco, C., Woodruff, D. P., and Yasuda, T. Active sampling for linear regression beyond the l2 norm. ar Xiv:2111.04888v3 [cs.LG], 2021b. Parulekar, A., Parulekar, A., and Price, E. L1 regression with lewis weights subsampling. In Wootters, M. and Sanit a, L. (eds.), Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, APPROX/RANDOM 2021, August 16-18, 2021, University of Washington, Seattle, Washington, USA (Virtual Conference), volume 207 of LIPIcs, pp. 49:1 49:21. Schloss Dagstuhl - Leibniz-Zentrum f ur Informatik, 2021. doi: 10.4230/LIPIcs.APPROX/RANDOM.2021.49. Riquelme, C., Johari, R., and Zhang, B. Online active linear regression via thresholding. In Singh, S. P. and Markovitch, S. (eds.), Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA, pp. 2506 2512. AAAI Press, 2017. URL http://aaai.org/ocs/i ndex.php/AAAI/AAAI17/paper/view/14599. Rodriguez-Lujan, I., Fonollosa, J., Vergara, A., Homer, M., and Huerta, R. On the calibration of sensor arrays for pattern recognition using the minimal number of experiments. Chemometrics and Intelligent Laboratory Systems, 130:123 134, 2014. Sabato, S. and Munos, R. Active regression by stratification. In Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N. D., and Weinberger, K. Q. (eds.), Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pp. 469 477, 2014. Tropp, J. Freedman s inequality for matrix martingales. Electronic Communications in Probability, 16:262 270, 2011. Vergara, A., Vembu, S., Ayhan, T., Ryan, M. A., Homer, M. L., and Huerta, R. Chemical gas sensor drift compensation using classifier ensembles. Sensors and Actuators B: Chemical, 166:320 329, 2012. A. Some Facts of Lewis Weights Lemma A.1. Given A Rn d with ℓp Lewis weights wi, i [n], let S be the rescaled sampling matrix with respect to p1, . . . , pn satisfying that min{βwi, 1} pi 1, where β = O(ϵ 2 log(d/δ)). With probability at least 1 δ, it holds that p i aia i (1 + ϵ) p i aia i . Proof. We prove the lemma by the matrix Chernoff bound. Without loss of generality, we assume that pi 1/β for all i, otherwise we can restrict the sum to the i s such that pi 1/β. We further assume that A W 1 2 p A = Id, where W = diag{w1, . . . , wn}. Let Xi = (1S)i p i aia i w 1 2 p i aia i , then EXi = 0. By the definition of Lewis weights, we have 2 p i = a i (A W 1 2 p A) 1ai. Hence, we have ai 2 2 = w 2 p i . Next, it holds that Xi 2 w 1 2 pi ai 2 2 = 1 p i ai 2 2 = 1 i=1 E Xi X i 2 1 pi w 2(1 2 p ) i ai 2 2 aia i p i aia i β Applying the matrix Chernoff inequality, we have = 2d exp Ω(βϵ2) δ. Lemma A.2. Suppose that A Rn d and w1, . . . , wn are the Lewis weights of A. Let w1, . . . , wn be weights such that αw2/p i a i i w1 2/p i aia i ai βw2/p i , i = 1, . . . , n, then αwi wi βwi for all i. Proof. Let γ = sup{c > 0 : wi cwi for all i}. It then holds for all i that i w1 2/p i aia i i w1 2/p i aia i i (γwi)1 2/paia i wi 1 2/paia i This implies that γ2/p γ2/p 1 and thus γ 1 that is, wi wi/β for all i. Similarly one can show that wi wi/α. Combining Lemma A.1 and Lemma A.2 we have the following lemma. We assume that we retain only nonzero rows of any sampling matrix S in the following lemma. Lemma A.3. Let Ai Rni d (i = 1, . . . , r), B Rk d and M = A1 A2 Ar B. For each i [r], let Si Rmi ni be the rescaled sampling matrix with respect to pi,1, . . . , pi,ni with min{βwj(Ai), 1} pi,j 1 for each j [ni], where β = O(ϵ 2 log(d/δ)). Let M = S1A1 Sr Ar B. The following statements hold with probability at least 1 δ. 1. For each i [r] and each j [mi], it holds that (1 ϵ)wn1+ +ni 1+si(j)(M) pi,si(j) wm1+ +mi 1+j(M ) (1 + ϵ)wn1+ +ni 1+si(j)(M) where si(j) [ni] is the row index such that (Si)j,si(j) = 0. 2. For each j = 1, . . . , k, it holds that (1 ϵ)wn1+ +nr+j(M) wm1+ +mr+j(M ) (1 + ϵ)wn1+ +nr+j(M). Proof. Define partial sums µi = m1 + + mi with µ0 = 0 and νi = n1 + + ni with ν0 = 0. For each j [µr + k], ( wνi 1+si(j)(M)/pi,si(j), µi 1 < j µi; wj µr+νr(M), j µr. (Si Ai)j(Si Ai) j (w µi 1+j)p/2 1 + bib i (w µr+j)p/2 1 . Then we have (Ai)si(j)(Ai) si(j) pi,si(j)(wsi(j)(Ai))p/2 1 + bib i (wνr+j(M))p/2 1 . Let WM = diag{w1(M), . . . , wνr+k(M)}. Let pi = 1 for i = νr + 1, . . . , νr + k. Also note that pi,j min{βwνi 1+j(M), 1} since wj(Ai) wνi 1+j(M). It follows from Lemma A.1 that (1 ϵ)(M W 1 2/p M M) L (1 + ϵ)(M W 1 2/p M M), with probability at least 1 δ. Next we verify that {w j}j are good weights for M . When µi 1 < j µi, (w j)2/p = (wνi 1+si(j)(M))2/p p2/p i,si(j) = (Ai)si(j)(M W 1 2/p M M) 1(Ai) si(j) p2/p i,s(j) = 1 1 ϵ (Ai)si(j)L 1(Ai) si(j) p2/p i,si(j) = 1 1 ϵ(Si Ai)j L 1(Si Ai) j , where (Si Ai)j denotes the j-th row of Si Ai. Similarly, one can show that for j > µr, (w µr+j)2/p = 1 1 ϵbj µr L 1b j µr. The result follows from Lemma A.2. A.1. Online ℓp Lewis Weights The goal of this section is to show Lemma 4.1, which states that the sum of the online ℓp Lewis weights of a matrix A Rn d is upper bounded by O(d log n log(κOL(A))) for p [1, 2). This is a generalization of Lemma 5.15 of (Braverman et al., 2020) from p = 1 and we follow the same approach in (Braverman et al., 2020). Lemma A.4 (Monotonicity, Lemma 5.5 in (Cohen & Peng, 2015)). For any matrix A Rn d and vector x Rd, for every i [n] we have wi(A) wi(B) where B = [A , x ] . Lemma A.5. If the leverage scores of A are at most C > 0, then the ℓp Lewis weights of A are at most C for p [1, 2]. Proof. This is the generalization of Lemma 5.12 in (Braverman et al., 2020) and we follow the same proof approach. By the assumption, we have a i (A A) 1ai C for i [n]. We prove by induction that for iteration j in the Lewis weight iteration, we have W (j) C1 (1 p/2)j In. For the base case j = 1, we have W (j) i,i = (a i (A A) 1ai)p/2 Cp/2. Thus W (1) Cp/2In as desired. For iteration j, by the induction hypothesis, we have W (j 1) C1 (1 p/2)j 1In, which implies that (W (j 1))1 2/p C(1 (1 p/2)j 1)(1 2/p)In since 1 2/p 0. Thus, A (W (j 1))1 2/p A C(1 (1 p/2)j 1)(1 2/p)A A, and (A (W (j 1))1 2/p A) 1 C(1 (1 p/2)j 1)(2/p 1)(A A) 1. It then follows from (2) that (W (j) i,i )2/p = a i (A (W (j 1))1 2/p A) 1ai C(1 (1 p/2)j 1)(2/p 1)a i (A A) 1ai C(1 (1 p/2)j 1)(2/p 1)+1. Notice that ((1 (1 p/2)j 1)(2/p 1) + 1)p/2 = 1 (1 p/2)j, we have obtained that W (j) i,i C1 (1 p/2)j for all i, i.e., W (j) C1 (1 p/2)j In. The induction step is established. The claim follows the convergence of Lewis weight iteration (Cohen & Peng, 2015). Lemma A.6. Given A = [a1, . . . , an] Rn d, let B R(n+1) d = [a1, . . . , aj 1, bj, aj+1, . . . , an, bn+1] where bj = (1 γ)1/paj and bn+1 = γ1/paj for some γ [0, 1] and j [n]. Then we have wi(A) = wi(B) for i = j, n + 1, wj(B) = (1 γ)wj(A) and wn+1(B) = γwj(A). Proof. Without loss of generality, we suppose j = n. Let W Rn n be the diagonal Lewis weight matrix of A, i.e., Wi,i = wi(A). Let W (n+1) (n+1) be a diagonal matrix where W i,i = wi(A) for i = 1, . . . , n 1, W n,n = (1 γ)wn(A) and W n+1,n+1 = γwn(A). According to the uniqueness of Lewis weights, it suffices to show that τi(W 1/2 1/p B) = W i,i for i [n + 1]. Notice that the first n 1 rows of W 1/2 1/p B are the same as those of W 1/2 1/p A. The last two rows of W 1/2 1/p B are wn(A)1/2 1/p(1 γ)1/2 1/p(1 γ)1/pan = wn(A)1/2 1/p(1 γ)1/2an and wn(A)1/2 1/pγ1/2an, respectively. Thus we have W 1/2 1/p Ay 2 2 = W 1/2 1/p By 2 2 for any vector y, which indicates that the leverage scores of the first n 1 rows of W 1/2 1/p A are the same as those of W 1/2 1/p B, i.e., τi(W 1/2 1/p B) = Wi,i = W i,i for 1 i n 1. For the last two rows, we have τn(W 1/2 1/p B) = (1 γ)τn(W 1/2 1/p A) = W n,n and τn+1(W 1/2 1/p B) = γ τn(W 1/2 1/p A) = W n+1,n+1. Thus we have τi(W 1/2 1/p B) = W i,i for all i [n + 1]. Corollary A.7. For any matrix A Rn d. Let B Rn d have the same rows but with the j-th row reweighted by a factor α [0, 1]. Then for all i = j, wi(B) wi(A). Proof. Let γ = 1 αp and B R(n+1) d = [a1, . . . , aj 1, (1 γ)1/paj, aj+1, . . . , an, γ1/paj] . By Lemma A.6, we have wi( B) = wi(A) for i = j. Then by Lemma A.4 we have wi(B) wi( B) = wi(A). We are now ready to prove Lemma 4.1, which we restate below as Lemma A.8. Lemma A.8. For A Rn d and each i [n], we denote w OL i (A) be the online Lewis weight of ai with respect to A. Then Pn i=1 w OL i (A) = O(d log n log κOL(A)). Proof. The first part of our proof is similar to the proof of Lemma 5.15 in (Braverman et al., 2020). Suppose that λ > 0. Let B0 = λId, B = B0 B0 | {z } n times and X B A. Following the proof of Lemma 5.15 of (Braverman et al., 2020), we have Pn i=1 w OL i (X) = O(d log n log κOL(A)). Now, let WA be the Lewis weight matrix of A and L = A i W 1 2/p A A. Let σ = λmin(L), the smallest eigenvalue of L, and ρ = mini(L 1)ii, the smallest diagonal element of L 1 i . Choose λ σ n 1/p ρ(2 p)/(2p), µ = ( nλ2 σ )p/(2 p), UX = µInd and WX = UX WA . We claim that 1 2µ2/p B j A W 1 2/p A A + B U 1 2/p X B 1 Bj, (5) 1 2(wi(A))2/p a i A W 1 2/p A A + B U 1 2/p X B 1 ai (6) for all j [nd] and all i [n]. Observe that B U 1 2/p X B = nλ2µ1 2/p Id σId L. Thus, a i (L + nλ2µ1 2/p Id) 1ai 1 2a i L 1ai = 1 2(wi(A))2/p, establishing (6). Similarly, since Bj = λei for some i, B j (L + nλ2µ1 2/p Id) 1Bj 1 2λ2(L 1)i,i 1 establishing (5). It then follows from Lemma A.2 that wi(A) 2wnd+i(X). Applying the argument above to the n submatrices which consist of the first i rows of A for each i = 1, . . . , n, we see that we can choose λ to be sufficiently small such that w OL i (A) 2w OL nd+i(X) for all i. Therefore, P i w OL i (A) = O(d log n log κOL(A)). Finally, we note an omission in the proof of Lemma 5.15 of (Braverman et al., 2020). In the ar Xiv version of (Braverman et al., 2020)2, on page 42 it states that U (j 1) X Id and so B (U (j 1) X ) 1B nλ2Id. The first inequality does not seem to imply the second one, as the latter requires a lower bound on U (j 1) X . We have used a different argument in our proof above. A.2. Proof of Lemma 4.3 First, we note the following facts. For any two matrices A and B, AB 2 A 2 B 2, and when A has full row rank and B = 0, σmin(AB) σmin(A)σmin(B), where σmin( ) denotes the smallest nonzero singular value of a matrix. It is clear that S, which is a rescaled sampling matrix, has full row rank. By the definition of the online condition number, κOL(SA) = SA 2 max i 1 σmin(SA(i)) S 2 A 2 max i 1 σmin(SA(i))i S 2 A 2 max i 1 σmin(S)σmin(A)i σmin(S) κOL(A). Now, observe that σmax(S) = maxi p 1/p i = (mini pi) 1/p and σmin(S) = mini p 1/p i = (maxi pi) 1/p, where min{βw OL i (A), 1} pi 1. It is clear that σmin(S) 1. For the upper bound of σmax(S), note that a row i with 2ar Xiv:1805.03765v4 [cs.DS], 19 Apr 2020. w OL i (A) 1/(100n) will be sampled with probability Hence, with probability at least 0.99, none of the rows i with w OL i (A) 1/(100n) is sampled and so mini pi β/(100n) and σmax(S) (100n/β)1/p. Therefore, we conclude that with probability at least 0.99, κOL(SA) 100n 1/p κOL(A). B. Omitted proofs of Theorem 3.5 and 3.1 In this section we highlight the modifications needed to prove Theorem 3.5 and Theorem 3.1, based on (Musco et al., 2021a). When p = 2, our sampling matrices does not have independent rows, since the online leverage scores are calculated with respect to sampled rows instead of all the rows that have been revealed. Hence, we cannot use a Bernstein bound, which is exactly where we need modify in the proof of Theorem 4.1 in (Musco et al., 2021a). This problem does not exist for p [1, 2) and the original proof in (Musco et al., 2021a) applies. Below we shall reprove a key technical lemma in (Musco et al., 2021a) for p = 2 but state the auxiliary lemmas with a general p whenever possible. It was originally proved in the offline setting and we shall need to make small modifications to its proof so that it can be applied to the online setting. Lemma B.1 (Lemma 3.7 in (Musco et al., 2021a)). Consider the same setting in Lemma B.2. With probability at least 0.99, for all x Rd with Ax p = O(OPT), SAz S z p p Ax z p p = O(ϵ) OPTp . Its proof depends on a series of lemmas, namely Lemmas B.2 to B.8. Lemmas B.2 to B.4, B.6 and B.7 are identical to those in (Musco et al., 2021a) and so we only cite the statements. The modification occurs in the proof of Lemma B.8 as well as in the proof of Lemma B.1 when given Lemma B.8. Lemma B.2 (Constant factor approximation, Theorem 3.2 in (Musco et al., 2021a)). Let A Rn d, b Rn, p [1, 2] and OPT = minx Rd Ax b p. If we sample A and obtain xc by Algorithm 4 or Algorithm 1 with β = O(log(d/δ)) then with probability at least 1 δ, Axc b p 21+ 1 p 3/δ 1 p OPT . When δ is constant then Axc p C OPT for constant C. Lemma B.3 (Lemma 3.5 in (Musco et al., 2021a)). Considering the same setting in Lemma B.2, let z = b Axc. Let B be an index set such that B = {i [n] : |zi|p OPTp d 2 p 1wi ϵp }. Let z be equal to z but with all entries in B set to 0. Then for all x Rd with Ax p = O(OPT), Ax z p p Ax z p p z z p p = O(ϵ) OPTp . Lemma B.4 (Lemma 3.6 in (Musco et al., 2021a)). Consider the same setting in Lemma B.2. With probability at least 0.99, Sz p = O(OPT) and for any x Rd with Ax p = O(OPT), SAx S z p p Ax z p p Sz S z p p = O(ϵ) OPTp . Lemma B.5 (Bound over Net). Let Nϵ be an ϵ-net of the lp unit ball {Ax | Ax p 1}. If | SAx S z p Ax z p| = O(ϵ) holds for all x Nϵ. Then for any x Rd with Ax p 1, we have | SAx S z p Ax z p| = O(ϵ). Proof. To simplify the writing, without loss of generality, we assume OPT = 1. For any x in the unit ball, by the definition of Nϵ, there exists a vector y Nϵ such that Ax Ay p ϵ. We have proved that S is a subspace embedding matrix for A, so SAx SAy 2 O(ϵ). Hence, we have | SAx Sz 2 Ax z 2| | SAy Sz 2 Ay z 2| + SAx SAy 2 + Ax Ay 2 O(ϵ) + O(ϵ) + ϵ. Lemma B.6 (Compact rounding, Lemma 3.10 in (Cohen et al., 2020) and Theorem 7.3 in (Bourgain et al., 1989)). Consider A Rn d, v Rn with |v(i)|p d p 2 1wi ϵp . Let l be (1 + ϵ)l = d 1 p and Nϵ be an ϵ-net in Lemma B.5. For any y Nϵ, let r = y v. Then we have r = e + Pk=l k=0 dk such that 1. |r (i) r(i)| ϵ|v(i)|, for any i [n], 2. |dk(i)| 2(1+ϵ)kw 1 p i ϵd , for any i [n] and k {0} [l], 3. d0, . . . , dl, e have mutually disjoint supports, 4. e is a single fixed vector with |e(i)| w1/p i ϵ , for any i [n], 5. Each dk is drawn from a set of vectors Dk with log |Dk| c(p) d log(n) ϵ1+p(1+ϵ)pk , for any k [l]. Lemma B.7. For any y = Ax with Ax p 1, let r = y z and r be the rounding of r shown to exist in in Lemma B.6, with ϵ set to ϵ 1 p . If Sr p p = (1 ϵ) r p p, then Sr p p = (1 ϵ) r p p. The next lemma is where we need to modify for the online setting, for which we shall use Freedman s inequality instead of Bernstein s inequality. Lemma B.8. For all roundings r produced by Lemma B.7, with probability at least 0.99, we have Sr p p = (1 ϵ) r p p Proof. we analyze the sampling process via a martingale. To make modifications more clear, we use the notation in Claim 3.14 of (Musco et al., 2021a). We write r = e+Pl k=0 dk. We have e p = O(1) and dk = O(1) for k {0} [l]. Let S be the rescaled sampling matrix with respect to pi. Let Sdk,(i) and dk,(i) be the first i coordinates of Sdk and dk respectively. Let Yi = |Sdk,(i+1)|p |dk,(i+1)|p, Y0 = 0 and Xi = Yi Yi 1. By the second condition of Lemma B.6, we have |dk(i)| ϵd 1 p . Since S rescales dk by the sampling probability, we have |Sdk(i)|p 1 β wi 2p(1+ϵ)kpwi ϵpd = O (1+ϵ)kp Hence, |Xi| = |Sdk(i + 1)|p |dk(i + 1)|p O( (1+ϵ)kp βϵpd ) and Ei 1X2 i 1 β wi |dk(i + 1)|2p O( (1+ϵ)kp βϵpd )|dk(i + 1)|p. Thus, we have Pn i=1 Ei 1X2 i O( (1+ϵ)kp βϵpd ). By Freedman inequality, Pr( Sdk p p dk p p ϵ/(l + 2)) exp ϵ2/(2(l + 2)2) Therefore, if β = O(log |Dk| l2ϵp+2 d(1+ϵ)kp ) = O( log2 d log n ϵ2p+5 ), we can take a union bound over all k and e. This completes the proof. Now we are ready to prove Lemma B.1. Proof of Lemma B.1. Let p = 2. We prove the lemma by the matrix Freedman inequality. Let Yi = (SA)ix S zi 2 2 Aix zi 2 2, Y0 = 0 and Xi = Yi Yi 1. Then, |Xi| is uniformly bounded. 1i pi (aix zi) 2 aix zi 2 2 pi aix zi 2 2. If i B, zi = 0, then, by Cauchy-Schwarz inequality, we have aix 2 2 w2 i Ax 2 2 = wi O(OPT2). Otherwise, aix zi 2 2 ( 1 ϵ + 1)2w2 i O(OPT)2. Hence, since pi = min(βwi, 1), we have Yi Yi 1 1 βϵ2 O(OPT2). E(X2 i |Yi, . . . , Y1) is denoted by Ei 1X2 i , so we have Ei 1X2 i = E 1 pi (aix zi) 2 2 aix zi 2 2 pi 1)2 aix zi 4 2 pi 1) aix zi 4 2 ϵ + 1)2O(OPT2) aix zi 2 2 1 βϵ2 O(OPT2) aix zi 2 2. Therefore, Pn i=1 Ei 1X2 i 1 βϵ2 O(OPT2) Pn i=1 aix zi 2 2. Since Ax 2 2 = O(OPT2) and z 2 2 = O(OPT2), we can get Pn i=1 Ei 1X2 i 1 βϵ2 O(OPT4). Then, by the matrix Freedman inequality (Tropp, 2011) and β = d log( 1 δ ) ϵ4 , it follows that Pr( Yn Cϵ OPT2) exp 1 βϵ4 O(OPT4) + O(OPT4) for C large enough. This implies that with probability at least 1 δ 2d , S(Ax z) 2 2 Ax z 2 2 O(ϵ) OPT2, for a fixed x Rd. To simplify the writing, without loss of generality, now we assume OPT = 1. We apply a union bound over an ϵ-net N of the ball B = {x Rd| Ax z 2 2 = 1)}. Note that there are at most ( 3 ϵ )d points in the ϵ-net. After applying a union bound over the net, according to Lemma B.5 S(Ax z) 2 2 Ax z 2 2 O(ϵ) holds for each x Rd with Ax = 1 with probability at least 1 δ. For any x Rd with Ax 2 2 = 1, by the definition of ϵ-net, there exists a vector y N such that Ax Ay 2 ϵ. We have proved that S is a subspace embedding for A, so S(Ax y) 2 O(ϵ). Hence, we have | S(Ax z) 2 Ax z 2| | S(Ay z) 2 Ay z 2| + S(Ax Ay) 2 + Ax Ay 2 O(ϵ) + O(ϵ) + ϵ, which completes our proof.