# john_ellipsoids_via_lazy_updates__a0280c26.pdf

John Ellipsoids via Lazy Updates

David P. Woodruff Carnegie Mellon University dwoodruf@cs.cmu.edu

Taisuke Yasuda Voleon Group yasuda.taisuke1@gmail.com

We give a faster algorithm for computing an approximate John ellipsoid around 𝑛 points in 𝑑dimensions. The best known prior algorithms are based on repeatedly computing the leverage scores of the points and reweighting them by these scores [CCLY19]. We show that this algorithm can be substantially sped up by delaying the computation of high accuracy leverage scores by using sampling, and then later computing multiple batches of high accuracy leverage scores via fast rectangular matrix multiplication. We also give low-space streaming algorithms for John ellipsoids using similar ideas.

1 Introduction

The John ellipsoid problem is a classic algorithmic problem, which takes as input a set of 𝑛points {a1, a2, . . . , a𝑛} in 𝑑dimensions, and asks for the minimum volume enclosing ellipsoid (MVEE) of these 𝑛points. If 𝑃is the convex hull of these 𝑛points, then a famous result of Fritz John [Joh48] states that such an ellipsoid satisfies 1

𝑑𝑄 𝑃 𝑄, and if 𝑃is furthermore symmetric, then 1

𝑑𝑄 𝑃 𝑄. Equivalently, we may consider the 𝑛input points to be constraints of a polytope {x : a𝑖, x 1, 𝑖 [𝑛]}, in which case the problem is to compute a maximal volume inscribed ellipsoid (MVIE). These two problems are related by taking polars, which corresponds to inverting the quadratic form defining the ellipsoid. In this work, we focus on the symmetric case, so that the polytope 𝑃may be written as 𝑃= {x : Ax 1}, where A denotes the 𝑛 𝑑matrix with the 𝑛input points a𝑖in the rows, and our goal is to output an ellipsoid 𝑄which approximately satisfies 𝑄 𝑃

The John ellipsoid problem has far-reaching applications in numerous fields of computer science. In statistics, the John ellipsoid problem is equivalent to the dual of the D-optimal experiment design problem [Atw69, Sil13], in which one seeks weights for selecting a subset of a dataset to observe in an experiment. Other statistical applications of John ellipsoids include outlier detection [ST80] and pattern recognition [Ros65, Gli98]. In the optimization literature, computing John ellipsoids is a fundamental ingredient for the ellipsoid method for linear programming [Kha79], cutting plane methods [TKE88], and convex programming [Sho77, Vai96]. Other applications in theoretical computer science include sampling [Vem05, CDWY18, GN23], bandits [BCK12, HK16], differential privacy [NTZ13], coresets [TMF20, TWZ+22], and randomized numerical linear algebra [Cla05, DDH+09, WY23, BMV23].

We refer to a survey of Todd [Tod16] for an account on algorithms and applications of John ellipsoids.

1.1 John ellipsoids via iterated leverage scores

Following a long line of work on algorithms for computing John ellipsoids [Wol70, Atw73, KT93, NN94, Kha96, STT78, KY05, SF04, TY07, AST08, Yu11, HFR20] based on convex optimization techniques, the work of [CCLY19] developed a new approach towards computing approximate John

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

ellipsoids via a simple fixed point iteration approach. The approach of [CCLY19] starts with an observation on the optimality condition for the dual problem, given by

𝑖=1 w𝑖 log det(A WA) 𝑑

subject to w𝑖 0, 𝑖 [𝑛]

where W = diag(w).1 The optimality condition requires that the dual weights w satisfy

for each 𝑖 [𝑛], where 𝜏𝑖(A) for a matrix A denotes the 𝑖-th leverage score of A.

Definition 1.1 (Leverage score). Let A R𝑛 𝑑. Then, for each 𝑖 [𝑛], we define the 𝑖-th leverage score as

𝜏𝑖(A) := a 𝑖(A A) a𝑖= sup Ax =0

The optimality condition of (1) can be viewed as a fixed point condition, which suggests the following iterative algorithm for computing approximate John ellipsoids

W(𝑡 1)A) = w(𝑡 1) 𝑖 a 𝑖(A W(𝑡 1)A) a𝑖 (2)

where W(𝑡) := diag(w(𝑡)). After repeating this update for 𝑇= 𝑂(𝜀 1 log(𝑛/𝑑)) iterations starting with w(0) = 𝑑/𝑛 1𝑛, it can be shown that the ellipsoid 𝑄defined by the quadratic form Q = A W(𝑇)A, i.e. 𝑄= {x R𝑑: x Qx 1}, satisfies

Note that the computation of leverage scores can be done in 𝑂(𝑛𝑑𝜔 1) time, where 𝜔 2.371552 is the current exponent of fast matrix multiplication [DWZ23, WXXZ24]. Thus, this gives an algorithm running in time 𝑂(𝜀 1𝑛𝑑𝜔 1 log(𝑛/𝑑)) for outputting an ellipsoid with the guarantee of (3).

It is known that input matrices with additional structure admit even faster implementation of this algorithm. For instance, [CCLY19, SYYZ22] give faster algorithms for sparse matrices A for which the number of nonzero entries nnz(A) is much less than 𝑛𝑑. The work of [SYYZ22] shows that this approach can also be sped up for matrices A with small treewidth.

1.2 Our results

Our first main result is a substantially faster algorithm for computing John ellipsoids. In the typical regime where 𝑛 𝑑, our algorithm runs in 𝑂(𝜀 1𝑛𝑑) log(𝑛/𝑑) time to output an ellipsoid 𝑄 satisfying (3), and 𝑂(𝜀 1𝑛𝑑2) log(𝑛/𝑑) time to output an ellipsoid 𝑄which approximates the maximal volume up to a (1 + 𝜀) factor. We will discuss our techniques for this result in Sections 1.2.1 and 1.2.2.

Table 1: Running time of John ellipsoid approximation for dense 𝑛 𝑑matrices, for 𝑛 𝑑 poly(𝜀 1 log 𝑛). There is other prior work on sparse matrices and matrices with low treewidth [CCLY19, SYYZ22].

Running time Guarantee

[KY05, TY07] 𝑂(𝜀 1𝑛𝑑𝜔) volume approximation [CCLY19] 𝑂(𝜀 1𝑛𝑑𝜔 1) log(𝑛/𝑑) (3) [CCLY19, SYYZ22] 𝑂(𝜀 2𝑛𝑑)(log 𝑛) log(𝑛/𝑑) (3) Theorem 1.6 𝑂(𝜀 1𝑛𝑑) log(𝑛/𝑑) (3)

1 For a weight vector w R𝑛, we will often write the corresponding 𝑛 𝑛diagonal matrix diag(w) as the capitalized version W.

1.2.1 Linear time leverage scores via fast matrix multiplication

We start by showing how to approximate leverage scores up to (1 + 𝜀) factors in 𝑂(𝑛𝑑) time, which had not been observed before to the best of our knowledge. Note that if we compute exact leverage scores using fast matrix multiplication, then this takes time 𝑂(𝑛𝑑𝜔 1). Alternatively, sketching-based algorithms for approximate leverage scores are known, which gives the following running time for sparse matrices A with nnz(A) nonzero entries. Theorem 1.2 ([SS11, DMMW12, CW13]). There is an algorithm which, with probability at least 1 𝛿, outputs 𝜏 𝑖for 𝑖 [𝑛] such that 𝜏 𝑖= (1 𝜀)𝜏𝑖(A) and runs in time 𝑂(𝜀 2 nnz(A) log(𝑛/𝛿)) + poly(𝑑𝜀 1 log(𝑛/𝛿)).

If the goal is to compute (1 + 𝜀)-approximate leverage scores for a dense 𝑛 𝑑matrix, then we are not aware of a previous result which does this in a nearly linear 𝑂(𝑛𝑑) time, which we now show: Theorem 1.3. There is an algorithm which, with probability at least 1 𝛿, outputs 𝜏 𝑖for 𝑖 [𝑛] such that 𝜏 𝑖= (1 𝜀)𝜏𝑖(A) in time 𝑂(𝑛𝑑) poly log(𝜀 1 log(𝑛/𝛿)) + 𝑂(𝑛) poly(𝜀 1 log(𝑛/𝛿)).

Our improvement comes from improving the running time analysis of a sketching-based algorithm of [DMMW12] by using fast rectangular matrix multiplication. We will need the following result on fast matrix multiplication: Theorem 1.4 ([Cop82, Wil11, Wil24]). There is a constant 𝛼 0.1 and an algorithm for multiplying a 𝑚 𝑚and a 𝑚 𝑚𝛼matrix in 𝑂(𝑚2 poly log 𝑚) time, under the assumption that field operations can be carried out in 𝑂(1) time.

By applying the above result in blocks, we get the following version of this result for rectangular matrix multiplication. Corollary 1.5. There is a constant 𝛼 0.1 and an algorithm for multiplying a 𝑛 𝑑for 𝑛 𝑑and a 𝑑 𝑡matrix in 𝑂(𝑛𝑑+ 𝑛𝑡1/𝛼+1) poly log 𝑡time, under the assumption that field operations can be carried out in 𝑂(1) time.

Proof. Let 𝑚= 𝑡1/𝛼. If 𝑑 𝑚, then matrix multiplication takes only 𝑂(𝑛𝑑𝑡) = 𝑂(𝑛𝑡1/𝛼+1) time, so assume that 𝑑 𝑚. We partition the first matrix into an 𝑂(𝑛/𝑚) 𝑂(𝑑/𝑚) block matrix with blocks of size 𝑚 𝑚and the second into an 𝑂(𝑑/𝑚) 1 block matrix with blocks of size 𝑚 𝑚𝛼. By Theorem 1.4, each block matrix multiplication requires 𝑂(𝑚2 poly log 𝑚) time, and we have 𝑂(𝑛𝑑/𝑚2) of these to do, which gives the desired running time.

That is, the above result shows that when multiplying an 𝑛 𝑑matrix A with a 𝑑 𝑡matrix for a much smaller 𝑡, then this multiplication can be done in roughly 𝑂(𝑛𝑑) poly log 𝑡time. The work of [DMMW12] shows that the leverage scores of A can be written as the row norms of AR for a 𝑑 𝑡 matrix with 𝑡= 𝑂(𝜀 2 log(𝑛/𝛿)), and thus this gives us the result of Theorem 1.3.

1.2.2 John ellipsoids via lazy updates

By using Theorem 1.3, we already obtain a John ellipsoid algorithm which runs in time 𝑂(𝜀 1𝑛𝑑) log(𝑛/𝑑) poly log(𝜀 1 log 𝑛) time, which substantially improves upon prior algorithms for dense input matrices A. We now show how to obtain further improvements by using the idea of lazy updates. At the heart of our idea is to only compute the quadratic forms for the John ellipsoids for most iterations, and defer the computation of the weights until we have computed roughly 𝑂(log 𝑛) iterations. At the end of this group of iterations, we can then compute the John ellipsoid weights via fast matrix multiplication as used in Theorem 1.3, which allows us to remove the suboptimal poly log log 𝑛terms in the dominating term of the running time. Theorem 1.6. Given A R𝑛 𝑑, let 𝑃be the polytope defined by 𝑃= {x R𝑑: Ax 1}. For 𝜀 (0, 1), there is an algorithm, Algorithm 3, that runs in time

𝑂(𝜀 1𝑛𝑑)(log(𝑛/𝑑) + poly log(𝜀 1 log 𝑛)) + 𝑂(𝑛) poly(𝜀 1 log 𝑛) + 𝑂(𝑛0.1)𝑑𝜔+1𝜀 3(log 𝑛)2

and returns an ellipsoid 𝑄such that 1 1+𝜀 𝑄 𝑃

𝑑 𝑄with probability at least 1 1/ poly(𝑛).

The full proof of this result is given in Section 2.

Let Q(𝑡) = A W(𝑡)A, where the weights w(𝑡) are defined as (2). Note that with this notation, the update rule for the iterative algorithm of [CCLY19] can be written as

w(𝑡) 𝑖 = w(𝑡 1) 𝑖 a 𝑖(Q(𝑡 1)) a𝑖. (4)

Thus, given high-accuracy spectral estimates to the quadratics Q(𝑡), we can recover the weights w(𝑡) 𝑖 to high accuracy in 𝑂(𝑑𝜔) time per iteration by evaluating the quadratic forms a 𝑖(Q(𝑡)) a𝑖and then multiplying them together. This approach is useful for fast algorithms if we only need to to do this for a small number of indices 𝑖 [𝑛]. This is indeed the case if we only need these weights for a row sample of

W(𝑡)A, which is sufficient for computing a spectral approximation to the next quadratic form Q(𝑡). Furthermore, we only need low-accuracy leverage scores (up to a factor of, say, 𝑛0.1) to obtain a good row sample, which can be done quickly for all 𝑛rows [LMP13, CLM+15]. Thus, by repeatedly sampling rows of

W(𝑡)A, computing high-accuracy weights on the sampled rows, and then building an approximate quadratic, we can iteratively compute high-accuracy approximations Q(𝑡) to the quadratics Q(𝑡). More formally, our algorithm takes the following steps:

We first compute low-accuracy leverage scores of

W(𝑡 1)A, which can be done in 𝑂(𝑛𝑑) time. This gives us the weights w(𝑡) to low accuracy, say u(𝑡), for all 𝑛rows.

We use the low-accuracy weights u(𝑡) to obtain a weighted subset of rows of

U(𝑡)A which spectrally approximates Q(𝑡). Note, however, that we do not yet have the sampled rows of

W(𝑡)A to high accuracy, since we do not know the weights w(𝑡) to high accuracy.

If we only need the weights w(𝑡) for a small number of sampled rows 𝑆 [𝑛], then we can explicitly compute these using (4), since we inductively have access to high-accuracy quadratics Q(𝑡 ) for 𝑡 = 0, 1, 2, . . . , 𝑡 1. These can then be used to build Q(𝑡).

While this algorithm allows us to quickly compute high-accuracy approximate quadratics Q(𝑡), this algorithm cannot be run for too many iterations, as the error in the low-accuracy leverage scores u(𝑡) grows to poly(𝑛) factors in 𝑂(log 𝑛) rounds. This is a problem, as this error factor directly influences the number of leverage score samples needed to approximate Q(𝑡). We will now use the fast matrix multiplication trick from the previous Section 1.2.1 to fix this problem. Indeed, after 𝑂(log 𝑛) iterations, we will now have approximate quadratics Q(1), Q(2), . . . , Q(𝑡) for 𝑡= 𝑂(log 𝑛). Now we just need to compute the 𝑛John ellipsoid weights which are given by

𝑡 =1 e 𝑖A( Q(𝑡 1)) 1/2 2 2.

To approximate this quickly, we can approximate each term e 𝑖A( Q(𝑡 1)) 1/2 2 2 by e 𝑖A( Q(𝑡 1)) 1/2G(𝑡 ) 2 2 for a random Gaussian matrix G(𝑡 ), by the Johnson Lindenstrauss lemma [JL84]. Here, the number of columns of the Gaussian matrix can be taken to be poly(𝜀 1 log 𝑛), so now all we need to compute is the matrix product A [( Q(0)) 1/2G(0), ( Q(1)) 1/2G(1), . . . , ( Q(𝑡)) 1/2G(𝑡)] which is the product of a 𝑛 𝑑matrix and a 𝑑 𝑚matrix for 𝑚= poly(𝜀 1 log 𝑛). By Theorem 1.4, this can be computed in 𝑂(𝑛𝑑poly log 𝑚) time. However, this resetting procedure is only run 𝑂(𝜀 1) times across the 𝑇= 𝑂(𝜀 1 log 𝑛) iterations, so the running time contribution from the resetting is just 𝑂(𝜀 1𝑛𝑑) poly log(𝜀 1 log 𝑛) + 𝑂(𝑛) poly(𝜀 1 log 𝑛). Overall, the total running time of our algorithm is 𝑂(𝜀 1𝑛𝑑)(log(𝑛/𝑑) + poly log(𝜀 1 log 𝑛)) + 𝑂(𝑛) poly(𝜀 1 log 𝑛). Remark 1.7. In general, our techniques can be be viewed as a way of exploiting the increased efficiency of matrix multiplication when performed on a larger instance by delaying large matrix multiplication operations, so that the running time is 𝑂(𝜀 1𝑛𝑑log(𝑛/𝑑)) + 𝑂(𝜀 1)𝑇𝑟where 𝑇𝑟 is the time that it takes to multiply A by a 𝑑 𝑟matrix for 𝑟= poly(𝜀 1 log 𝑛). While we have instantiated this general theme by obtaining faster running times via fast matrix multiplication, one can expect similar improvements by other ways of exploiting the economies of scale of matrix multiplication, such as parallelization. For instance, we recover the same running time if we can multiply 𝑟vectors in parallel so that 𝑇𝑟= 𝑂(𝑛𝑑).

1.2.3 Streaming algorithms

The problem of computing John ellipsoids is also well-studied in the streaming model, where the input points a𝑖arrive one at a time in a stream [MSS10, AS15, WY22, MMO22, MMO23]. The streaming model is often considered when the number of points 𝑛is so large that we cannot fit all of the points in memory at once, and the focus is primarily on designing algorithms with low space complexity. Our second result of this work is that approaches similar to the one we take to prove Theorem 1.6 in fact also give a low-space implementation of the iterative John ellipsoid algorithm of [CCLY19].

Theorem 1.8 (Streaming algorithms). Given A R𝑛 𝑑, let 𝑃be the polytope defined by 𝑃= {x R𝑑: Ax 1}. Furthermore, suppose that A is presented in a stream where the rows a𝑖 R𝑑arrive one by one in a stream. For 𝜀 (0, 1), there is an algorithm, Algorithm 1, that makes 𝑇= 𝑂(𝜀 1 log(𝑛/𝑑)) passes over the stream, takes 𝑂(𝑑2𝑇) time to update after each new row, and returns an ellipsoid 𝑄such that 1 1+𝜀 𝑄 𝑃

𝑑 𝑄. Furthermore, the algorithm uses at most 𝑂(𝑑2𝑇) words of space.

In Section 1.2.2, we showed that by storing only the quadratics Q(𝑡) and only computing the weights 𝑡 𝑡 =1 a𝑖( Q(𝑡 1)) a𝑖as needed, we could design fast algorithms for John ellipsoids. In fact, this idea is also useful in the streaming setting, since storing all of the weights w(𝑡) 𝑖 requires 𝑂(𝑛) space per iteration, whereas storing the quadratics Q(𝑡) requires only 𝑂(𝑑2) space per iteration. Furthermore, in the streaming setting, we may optimize the update running time by instead storing the pseudo-inverse of the quadratics ( Q(𝑡)) and then updating them by using the Sherman-Morrison low rank update formula.2 This gives the result of Theorem 1.8.

Algorithm 1 Streaming John ellipsoids via lazy updates

1: function STREAMINGJOHNELLIPSOID(input matrix A) 2: for 𝑡= 0 to 𝑇do 3: Let Q(𝑡) = 0 4: for 𝑖= 1 to 𝑛do 5: if 𝑡= 0 then 6: Let Q(𝑡) Q(𝑡) + a𝑖a 𝑖 7: else 8: Let w(𝑡) 𝑖 = 𝑡 𝑡 =1 a 𝑖(Q(𝑡 1)) a𝑖 9: Let Q(𝑡) Q(𝑡) + w(𝑡) 𝑖a𝑖a 𝑖 10: return 1 𝑇+1 𝑇 𝑡=0 Q(𝑡)

1.3 Notation

Throughout this paper, A will denote an 𝑛 𝑑matrix whose 𝑛rows are given by vectors a𝑖 R𝑑. For positive numbers 𝑎, 𝑏> 0, we write 𝑎= (1 𝜀)𝑏to mean that (1 𝜀)𝑏 𝑎 (1+𝜀)𝑏. For symmetric positive semidefinite matrices Q, R, we write Q = (1 𝜀)R to mean that (1 𝜀)R Q (1+𝜀)R, where denotes the Löwner order on PSD matrices.

2 Fast algorithms

2.1 Approximating the quadratics

We will analyze the following algorithm, Algorithm 2, for approximating the quadratics Q(𝑡) of the iterative John ellipsoid algorithm.

Fix a row 𝑖. Note then that for each iteration 𝑡, e 𝑖A( Q(𝑡 1)) 1/2G 2 2 is distributed as an independent 𝜒2 variable with 𝑘degrees of freedom, scaled by e 𝑖A( Q(𝑡 1)) 1/2 2 2, say 𝑋𝑡. Note then that

2 We note that storing the pseudo-inverse may increase the space complexity by polynomial factors in the bit complexity model.

Algorithm 2 Approximating the quadratics

1: function APPROXQUADRATIC(input matrix A, initial weights w(0), number of rounds 𝑇) 2: Let Q(0) = A W(0)A for W(0) = diag( w(0)). 3: for 𝑡= 1 to 𝑇do 4: Compute A( Q(𝑡 1)) 1/2G for a 𝑑 𝑘Gaussian matrix G.

5: Let u(𝑡) 𝑖 = u(𝑡 1) 𝑖 e 𝑖A( Q(𝑡 1)) 1/2G 2 2 for each 𝑖 [𝑛].

6: Let S(𝑡) be a (1 + 𝜀)-approximate leverage score sample of

U(𝑡)A (Thm. 2.3).

7: For rows 𝑖sampled by S(𝑡), set v(𝑡) 𝑖 = 𝑡 𝑡 =1 a 𝑖( Q(𝑡 1)) a𝑖.

8: Let Q(𝑡) = (S(𝑡)

V(𝑡)A) S(𝑡)

V(𝑡)A. 9: return { Q(𝑡)}𝑇 𝑡=0

after 𝑇iterations,

𝑡=1 e 𝑖A( Q(𝑡 1)) 1/2G 2 2,

which is distributed as

𝑡=1 e 𝑖A( Q(𝑡 1)) 1/2 2 2 𝑋𝑡=

𝑡=1 e 𝑖A( Q(𝑡 1)) 1/2 2 2

for i.i.d. 𝜒2 variables 𝑋𝑡with 𝑘degrees of freedom. We will now bound each of the terms in this product.

2.1.1 Bounds on products of 𝜒2 variables

We need the following bound on products of 𝜒2 variables, which generalizes [SSK17, Proposition 1].

Lemma 2.1. Let 𝑋1, 𝑋2, . . . , 𝑋𝑡be 𝑡i.i.d. 𝜒2 variables with 𝑘degrees of freedom. Then,

inf 𝑠 (0,𝑘/2) 𝐶𝑡 𝑠,𝑘𝑅 𝑠

inf 𝑠> 𝑘/2 𝐶𝑡 𝑠,𝑘𝑅 𝑠

𝐶𝑠,𝑘= 2𝑠Γ(𝑠+ 𝑘/2)

Proof. For 𝑠> 𝑘/2, the moment generating function of log 𝑋𝑖is given by

E 𝑒𝑠log 𝑋𝑖= E 𝑋𝑠 𝑖= 1 2𝑘/2Γ(𝑘/2)

0 𝑥𝑠𝑥𝑘/2 1𝑒 𝑥/2 𝑑𝑥= 2𝑠Γ(𝑠+ 𝑘/2)

Γ(𝑘/2) = 𝐶𝑠,𝑘.

Then by Chernoff bounds,

𝑖=1 𝑠log 𝑋𝑖

E[𝑒 𝑠log 𝑋𝑖]𝑡𝑅 𝑠= 𝐶𝑡 𝑠,𝑘𝑅 𝑠

𝑖=1 𝑠log 𝑋𝑖

E[𝑒𝑠log 𝑋𝑖]𝑡𝑅 𝑠= 𝐶𝑡 𝑠,𝑘𝑅 𝑠

Using the above result, we can show that if the 𝜒2 variables have 𝑘= 𝑂(1/𝜃) degrees of freedom and the number of rounds 𝑇is 𝑐log 𝑛for a small enough constant 𝑐, then the product of the 𝜒2 variables 𝑇 𝑡=1 𝑋𝑡will be within a 𝑛𝜃factor for some small constant 𝜃> 0.

Lemma 2.2. Fix a small constant 𝜃> 0 and let 𝑘= 𝑂(1/𝜃). Let 𝑇= 𝑐log 𝑛for a sufficiently small constant 𝑐> 0. Let 𝑋1, 𝑋2, . . . , 𝑋𝑇be 𝑡i.i.d. 𝜒2 variables with 𝑘degrees of freedom. Then,

𝑖=1 𝑋𝑖 𝑛𝜃 }

1 1 poly(𝑛).

Proof. Set 𝑠= 𝑘/2. Then, 𝐶 𝑠,𝑘and 𝐶𝑠,𝑘are absolute constants. Now let 𝑐to be small enough (depending on 𝑠and 𝑘) such that for 𝑇= 𝑐log 𝑛, 𝐶𝑇 𝑠,𝑘, 𝐶𝑇 𝑠,𝑘 𝑛. Furthermore, set 𝑅= 𝑛𝜃. Then, by Lemma 2.1, we have that both Pr{ 𝑡 𝑖=1 𝑋𝑖 1 𝑅} and Pr{ 𝑡 𝑖=1 𝑋𝑖 𝑅} are bounded by 𝑛 𝑅 𝑠= 𝑛 𝑛𝜃 𝑠= 1/ poly(𝑛), as desired.

It follows that with probability at least 1 1/ poly(𝑛), the products of 𝜒2 variables appearing in (5) are bounded by 𝑛𝜃for every row 𝑖 [𝑛] and for every iteration 𝑡 [𝑇]. We will condition on this event in the following discussion.

2.1.2 Bounds on the quadratic Q(𝑡)

In the previous section, we have established that the products of 𝜒2 variables in (5) are bounded by 𝑛𝜃. We will now use this fact to show that the quadratics Q(𝑡 1) in Algorithm 2 are good approximate John ellipsoid quadratics. We will use the following leverage score sampling theorem for this.

Theorem 2.3 ([DMM06, SS11]). Let A R𝑛 𝑑. Let 𝜏 𝑖 𝜏𝑖(A) and let 𝑝𝑖= min{1, 𝜏 𝑖/𝛼} for 𝛼= Θ(𝜀2)/ log(𝑑/𝛿). If S is a diagonal matrix with S𝑖,𝑖= 1/ 𝑝𝑖with probability 𝑝𝑖and 0 otherwise for 𝑖 [𝑛], then with probability at least 1 𝛿,

for all x R𝑑, SAx 2 2 = (1 𝜀) Ax 2 2.

This theorem, combined with the bounds on 𝜒2 products, gives the following guarantee for the approximate quadratics Q(𝑡).

Lemma 2.4. Fix a small constant 𝜃> 0 and let 𝑘= 𝑂(1/𝜃). Suppose that the leverage score sample in Line 6 oversamples by a factor of 𝑂(𝑛2𝜃), that is, uses leverage score estimates 𝜏 𝑖such that 𝜏 𝑖 𝑂(𝑛2𝜃)𝜏𝑖(

U(𝑡)A). Then, with probability at least 1 1/ poly(𝑛), we have for every 𝑡 [𝑇] that Q(𝑡) = (1 𝜀)A V(𝑡)A (6)

where v(𝑡) 𝑖 = 𝑡 1 𝑡 =1 a 𝑖( Q(𝑡 )) a𝑖for 𝑖 [𝑛]. Furthermore, Q(𝑡) can be computed in 𝑂(𝑛2𝜃)𝑇𝜀 2𝑑𝜔+1 log 𝑛time in each iteration.

Proof. We will first condition on the success of the event of Lemma 2.2, so that the 𝜒2 products in (5) are bounded by 𝑛𝜃factors for every row 𝑖 [𝑛] and every iteration 𝑡 [𝑡]. We will also condition on the success of the leverage score sampling for all 𝑇iterations.

Note that by (5) and the bound the 𝜒2 products, u(𝑡) 𝑖 is within a 𝑂(𝑛2𝜃) factor of v(𝑡) 𝑖, and thus S(𝑡)

is a correct leverage score sample for

V(𝑡)A. We thus have that

Q(𝑡) = (S(𝑡)

V(𝑡)A) S(𝑡)

V(𝑡)A = (1 𝜀)A V(𝑡)A.

For the running time, note that S(𝑡) samples at most 𝑂(𝜀 2𝑛2𝜃𝑑log 𝑛) rows in each iteration, and each sampled row 𝑖requires 𝑂(𝑑𝜔𝑇) to compute v(𝑡) 𝑖. This gives the running time claim.

Algorithm 3 John ellipsoids via lazy updates

1: function JOHNELLIPSOID(input matrix A) 2: Let 𝐵= 𝑂(𝑐 1𝜀 1) and 𝑇= 𝑂(𝑐log(𝑛/𝑑)) for a sufficiently small constant 𝑐.

3: Let w(0) 𝑖 = 𝑑/𝑛for 𝑖 [𝑛]. 4: for 𝑏= 1 to 𝐵do 5: Let { Q(𝑡)}𝑇 𝑡=0 be given by APPROXQUADRATIC(A, w(0)) (Algorithm 2). 6: Let G(𝑡 1) for 𝑡 [𝑇] be a random 𝑑 𝑚Gaussian for 𝑚= 𝑂(𝜀 2(𝐵𝑇)2 log 𝑛). 7: Compute A [( Q(0)) 1/2G(0), ( Q(1)) 1/2G(1), . . . , ( Q(𝑇)) 1/2G(𝑇)].

8: Let w(𝑏,𝑡) 𝑖 = 𝑡 𝑡 =1 1 𝑚 e 𝑖A( Q(𝑡 1)) 1/2G(𝑡 1) 2 2 for each 𝑖 [𝑛]. 9: Let w(0) = w(𝑏,𝑇). 10: Let w = 1 𝐵𝑇

𝑏,𝑡 w(𝑏,𝑡).

11: return A WA

2.2 Full algorithm

We now combine the subroutine for approximating quadratics from Section 2.1 with a resetting procedure using fast matrix multiplication to obtain our full algorithm for quickly computing John ellipsoids. The full algorithm is presented in Algorithm 3.

We will need the following theorem, which summarizes results of [CCLY19] on guarantees of the fixed point iteration algorithm (2) under approximate leverage score computations. Theorem 2.5 (Lemma 2.3 and Lemma C.4 of [CCLY19]). Let A R𝑛 𝑑and let 𝑃= {x : Ax 1}. Let 𝑇= 𝑂(𝜀 1 log(𝑛/𝑑)). Suppose that

w(𝑡) 𝑖 = (1 𝜀)𝜏𝑖(

for all 𝑡 [𝑇] and 𝑖 [𝑛]. Then, if 𝑄is the ellipsoid given by the quadratic form A W(𝑇)A, then 1 1+𝜀 𝑄 𝑃

We also use the Johnson Lindenstrauss lemma, which is a standard tool for randomized numerical linear algebra, especially in the context of approximating leverage scores [SS11, DMMW12, LMP13, CLM+15]. Theorem 2.6 ([JL84, DG03]). Let 𝑚= 𝑂(𝜀 2 log(𝑛/𝛿)) and let G be a random 𝑚 𝑑Gaussian matrix. Then, for any 𝑛points a1, a2, . . . , a𝑛 R𝑑in 𝑑dimensions, we have with probability at least 1 𝛿that 1 𝑚 Ga𝑖 2 2 = (1 𝜀) a𝑖 2 2

simultaneously for every 𝑖 [𝑛].

We will now give a proof of Theorem 1.6.

Proof of Theorem 1.6. We first argue the correctness of this algorithm. By Theorem 2.5, it suffices to verify that w(𝑏,𝑡) 𝑖 computes (1 + 𝜀)-approximate leverage scores in order to prove the claimed guarantees of Theorem 1.6. For the updates within APPROXQUADRATIC (Algorithm 2), we have by Lemma 2.4 that Q(𝑡) = (1 𝜀)A V(𝑡)A. Thus,

v(𝑡 1) 𝑖 e 𝑖A( Q(𝑡 1)) 1/2 2 2 = (1 𝜀)v(𝑡 1) 𝑖 e 𝑖A(A V(𝑡 1)A) 1/2 2 2

and thus the weights v(𝑡) 𝑖 output by APPROXQUADRATIC (Algorithm 2) indeed satisfy the requirements of Theorem 2.5.

For the weights computed in Line 8 of Algorithm 3, note that we also compute these weights, but this time with approximation error from the application of the Gaussian matrix 1

𝑚G(𝑡) to speed up the

computation. Applying Theorem 2.6 with 𝜀set to 𝜀/𝐵𝑇and failure probability 𝛿= 1/ poly(𝑛), we have that 1 𝑚 e 𝑖A( Q(𝑡 1)) 1/2G(𝑡 1) 2 2 = (1 𝜀/𝐵𝑇) e 𝑖A( Q(𝑡 1)) 1/2 2 2.

Note then that Line 8 takes a product of at most 𝐵𝑇of these approximations, so the total error in the approximation is at most (1 𝜀/𝐵𝑇)𝐵𝑇= (1 𝑂(𝜀)).

The running time is given by 𝐵𝑇iterations of the inner loop of APPROXQUADRATIC and 𝐵 iterations of the fast matrix multiplication procedure in Line 7 of Algorithm 3. The inner loop of APPROXQUADRATIC requires 𝑂(𝑛𝑑) time to compute the product with the 𝑑 𝑘Gaussian as well as the time to compute the approximate quadratic, which is bounded in Lemma 2.4. Altogether, this gives the claimed running time bound.

3 Future directions

In this work, we developed fast algorithms and low-space streaming algorithms for the problem of computing John ellipsoids. Our fast algorithms use a combination of using lazy updates together with fast matrix multiplication to substantially improve the running time of John ellipsoids, and we apply similar ideas to obtain a low-space streaming implementation of the John ellipsoid algorithm.

Our results have several limitations that we discuss here, which we leave for future work to resolve. First, our algorithm makes crucial use of fast matrix multiplication in order to get running time improvements. However, this makes it a less attractive option for practical implementations, and also makes the polynomial dependence on 𝜀in the 𝑂(𝑛) poly(𝜀 1) term rather large. Thus, it is an interesting question whether the running time that we obtain in Theorem 1.6 is possible without fast matrix multiplication. Question 3.1. Is there an algorithm with the guarantee of Theorem 1.6 that avoids fast matrix multiplication?

More generally, it is an interesting question to design algorithms for approximating John ellipsoids with optimal running time. This is an old question which has been studied in a long line of work [Wol70, Atw73, KT93, NN94, Kha96, STT78, KY05, SF04, TY07, AST08, Yu11, HFR20], and we believe that the investigation of this question will lead to further interesting developments in algorithms research. Question 3.2. What is the optimal running time of approximating John ellipsoids?

For instance, one interesting question is whether it is possible to obtain nearly linear time algorithms that run in time 𝑂(𝑛𝑑) + 𝑂(𝑛) poly(𝜀 1), or even input sparsity time algorithms that run in time 𝑂(nnz(A))+ 𝑂(𝑛) poly(𝜀 1). The resolution of such questions for the least squares linear regression problem has led to great progress in algorithms, and studying these questions for the John ellipsoid problem may have interesting consequences as well.

John ellipsoids are closely related to ℓ𝑝Lewis weights, which give a natural ℓ𝑝generalization of John ellipsoids and leverage scores and have been a valuable tool in randomized numerical linear algebra. There has been a recent focus on fast algorithms for computing (1 + 𝜀)-approximate ℓ𝑝 Lewis weights [FLPS22, AGS24], and thus it is natural to ask whether developments in algorithms for John ellipsoids would carry over to algorithms for ℓ𝑝Lewis weights as well. Question 3.3. What is the optimal running time of approximating ℓ𝑝Lewis weights?

Finally, we raise questions concerning streaming algorithms for approximating John ellipsoids. In Theorem 1.8, we gave a multi-pass streaming algorithm which obtains a (1+𝜀)-optimal John ellipsoid. A natural question is whether a similar algorithm can be achieved in fewer passes, or if there are pass lower bounds for computing (1 + 𝜀)-optimal John ellipsoids. Question 3.4. Are there small space streaming algorithms for approximating John ellipsoids up to a (1 + 𝜀) factor which make fewer than 𝑂(𝜀 1 log(𝑛/𝑑)) passes, or do small space streaming algorithms necessarily require many passes?

We note that there are one-pass small space streaming algorithms if we allow for approximation factors that scale as 𝑂( log 𝑛) rather than (1 + 𝜀) [WY22, MMO22, MMO23].

Acknowledgments and Disclosure of Funding

The authors were supported in part by a Simons Investigator Award and NSF CCF-2335412.

Bibliography

[AGS24] Simon Apers, Sander Gribling, and Aaron Sidford. On computing approximate Lewis weights. Co RR, abs/2404.02881, 2024. 3

[AS15] Pankaj K. Agarwal and R. Sharathkumar. Streaming algorithms for extent problems in high dimensions. Algorithmica, 72(1):83 98, 2015. 1.2.3

[AST08] Selin Damla Ahipasaoglu, Peng Sun, and Michael J. Todd. Linear convergence of a modified Frank-Wolfe algorithm for computing minimum-volume enclosing ellipsoids. Optim. Methods Softw., 23(1):5 19, 2008. 1.1, 3

[Atw69] Corwin L. Atwood. Optimal and efficient designs of experiments. Ann. Math. Statist., 40:1570 1602, 1969. 1

[Atw73] Corwin L. Atwood. Sequences converging to 𝐷-optimal designs of experiments. Ann. Statist., 1:342 352, 1973. 1.1, 3

[BCK12] Sébastien Bubeck, Nicolò Cesa-Bianchi, and Sham M. Kakade. Towards minimax policies for online linear optimization with bandit feedback. In Shie Mannor, Nathan Srebro, and Robert C. Williamson, editors, COLT 2012 - The 25th Annual Conference on Learning Theory, June 25-27, 2012, Edinburgh, Scotland, volume 23 of JMLR Proceedings, pages 41.1 41.14. JMLR.org, 2012. 1

[BMV23] Aditya Bhaskara, Sepideh Mahabadi, and Ali Vakilian. Tight bounds for volumetric spanners and applications. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, Neur IPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023. 1

[CCLY19] Michael B. Cohen, Ben Cousins, Yin Tat Lee, and Xin Yang. A near-optimal algorithm for approximating the John ellipsoid. In Alina Beygelzimer and Daniel Hsu, editors, Conference on Learning Theory, COLT 2019, 25-28 June 2019, Phoenix, AZ, USA, volume 99 of Proceedings of Machine Learning Research, pages 849 873. PMLR, 2019. (document), 1.1, 1, 1, 1.2.2, 1.2.3, 2.2, 2.5

[CDWY18] Yuansi Chen, Raaz Dwivedi, Martin J. Wainwright, and Bin Yu. Fast MCMC sampling algorithms on polytopes. J. Mach. Learn. Res., 19:55:1 55:86, 2018. 1

[Cla05] Kenneth L. Clarkson. Subgradient and sampling algorithms for ℓ1 regression. In Proceedings of the Sixteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 05, pages 257 266, USA, 2005. Society for Industrial and Applied Mathematics. 1

[CLM+15] Michael B. Cohen, Yin Tat Lee, Cameron Musco, Christopher Musco, Richard Peng, and Aaron Sidford. Uniform sampling for matrix approximation. In Tim Roughgarden, editor, Proceedings of the 2015 Conference on Innovations in Theoretical Computer Science, ITCS 2015, Rehovot, Israel, January 11-13, 2015, pages 181 190. ACM, 2015. 1.2.2, 2.2

[Cop82] Don Coppersmith. Rapid multiplication of rectangular matrices. SIAM J. Comput., 11(3):467 471, 1982. 1.4

[CW13] Kenneth L. Clarkson and David P. Woodruff. Low rank approximation and regression in input sparsity time. In Dan Boneh, Tim Roughgarden, and Joan Feigenbaum, editors, Symposium on Theory of Computing Conference, STOC 13, Palo Alto, CA, USA, June 1-4, 2013, pages 81 90. ACM, 2013. 1.2

[DDH+09] Anirban Dasgupta, Petros Drineas, Boulos Harb, Ravi Kumar, and Michael W. Mahoney. Sampling algorithms and coresets for ℓ𝑝regression. SIAM J. Comput., 38(5):2060 2078, 2009. 1

[DG03] Sanjoy Dasgupta and Anupam Gupta. An elementary proof of a theorem of Johnson and Lindenstrauss. Random Struct. Algorithms, 22(1):60 65, 2003. 2.6

[DMM06] Petros Drineas, Michael W. Mahoney, and S. Muthukrishnan. Sampling algorithms for ℓ2 regression and applications. In Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2006, Miami, Florida, USA, January 22-26, 2006, pages 1127 1136. ACM Press, 2006. 2.3

[DMMW12] Petros Drineas, Malik Magdon-Ismail, Michael W. Mahoney, and David P. Woodruff. Fast approximation of matrix coherence and statistical leverage. J. Mach. Learn. Res., 13:3475 3506, 2012. 1.2, 1.2.1, 1.2.1, 2.2

[DWZ23] Ran Duan, Hongxun Wu, and Renfei Zhou. Faster matrix multiplication via asymmetric hashing. In 64th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2023, Santa Cruz, CA, USA, November 6-9, 2023, pages 2129 2138. IEEE, 2023. 1

[FLPS22] Maryam Fazel, Yin Tat Lee, Swati Padmanabhan, and Aaron Sidford. Computing Lewis weights to high precision. In Joseph (Seffi) Naor and Niv Buchbinder, editors, Proceedings of the 2022 ACM-SIAM Symposium on Discrete Algorithms, SODA 2022, Virtual Conference / Alexandria, VA, USA, January 9 - 12, 2022, pages 2723 2742. SIAM, 2022. 3

[Gli98] François Glineur. Pattern separation via ellipsoids and conic programming. Mémoire de DEA, Faculté Polytechnique de Mons, Mons, Belgium, 1998. 1

[GN23] Adam Gustafson and Hariharan Narayanan. John s walk. Adv. in Appl. Probab., 55(2):473 491, 2023. 1

[HFR20] Radoslav Harman, Lenka Filová, and Peter Richtárik. A randomized exchange algorithm for computing optimal approximate designs of experiments. J. Amer. Statist. Assoc., 115(529):348 361, 2020. 1.1, 3

[HK16] Elad Hazan and Zohar S. Karnin. Volumetric spanners: An efficient exploration basis for learning. J. Mach. Learn. Res., 17:119:1 119:34, 2016. 1

[JL84] William B. Johnson and Joram Lindenstrauss. Extensions of Lipschitz mappings into a Hilbert space. In Conference in modern analysis and probability (New Haven, Conn., 1982), volume 26 of Contemp. Math., pages 189 206. Amer. Math. Soc., Providence, RI, 1984. 1.2.2, 2.6

[Joh48] Fritz John. Extremum problems with inequalities as subsidiary conditions. In Studies and Essays Presented to R. Courant on his 60th Birthday, January 8, 1948, pages 187 204. Interscience Publishers, Inc., New York, N. Y., 1948. 1

[Kha79] Leonid G. Khachiyan. A polynomial algorithm in linear programming. Dokl. Akad. Nauk SSSR, 244(5):1093 1096, 1979. 1

[Kha96] Leonid G. Khachiyan. Rounding of polytopes in the real number model of computation. Math. Oper. Res., 21(2):307 320, 1996. 1.1, 3

[KT93] Leonid G. Khachiyan and Michael J. Todd. On the complexity of approximating the maximal inscribed ellipsoid for a polytope. Math. Programming, 61(2, Ser. A):137 159, 1993. 1.1, 3

[KY05] Piyush Kumar and E. Alper Yildirim. Minimum-volume enclosing ellipsoids and core sets. J. Optim. Theory Appl., 126(1):1 21, 2005. 1.1, 1, 3

[LMP13] Mu Li, Gary L. Miller, and Richard Peng. Iterative row sampling. In 54th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2013, 26-29 October, 2013, Berkeley, CA, USA, pages 127 136. IEEE Computer Society, 2013. 1.2.2, 2.2

[MMO22] Yury Makarychev, Naren Sarayu Manoj, and Max Ovsiankin. Streaming algorithms for ellipsoidal approximation of convex polytopes. In Po-Ling Loh and Maxim Raginsky, editors, Conference on Learning Theory, 2-5 July 2022, London, UK, volume 178 of Proceedings of Machine Learning Research, pages 3070 3093. PMLR, 2022. 1.2.3, 3

[MMO23] Yury Makarychev, Naren Sarayu Manoj, and Max Ovsiankin. Near-optimal streaming ellipsoidal rounding for general convex polytopes. Co RR, abs/2311.09460, 2023. 1.2.3, 3

[MSS10] Asish Mukhopadhyay, Animesh Sarker, and Tom Switzer. Approximate ellipsoid in the streaming model. In Weili Wu and Ovidiu Daescu, editors, Combinatorial Optimization and Applications - 4th International Conference, COCOA 2010, Kailua Kona, HI, USA, December 18-20, 2010, Proceedings, Part II, volume 6509 of Lecture Notes in Computer Science, pages 401 413. Springer, 2010. 1.2.3

[NN94] Yurii Nesterov and Arkadii Nemirovskii. Interior-point polynomial algorithms in convex programming, volume 13 of SIAM Studies in Applied Mathematics. Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA, 1994. 1.1, 3

[NTZ13] Aleksandar Nikolov, Kunal Talwar, and Li Zhang. The geometry of differential privacy: the sparse and approximate cases. In Dan Boneh, Tim Roughgarden, and Joan Feigenbaum, editors, Symposium on Theory of Computing Conference, STOC 13, Palo Alto, CA, USA, June 1-4, 2013, pages 351 360. ACM, 2013. 1

[Ros65] Judah Ben Rosen. Pattern separation by convex programming. Journal of Mathematical Analysis and Applications, 10(1):123 134, 1965. 1

[SF04] Peng Sun and Robert M. Freund. Computation of minimum-volume covering ellipsoids. Oper. Res., 52(5):690 706, 2004. 1.1, 3

[Sho77] Naum Z Shor. Cut-off method with space extension in convex programming problems. Cybernetics, 13(1):94 96, 1977. 1

[Sil13] Samuel Silvey. Optimal design: an introduction to the theory for parameter estimation, volume 1. Springer Science & Business Media, 2013. 1

[SS11] Daniel A. Spielman and Nikhil Srivastava. Graph sparsification by effective resistances. SIAM J. Comput., 40(6):1913 1926, 2011. 1.2, 2.3, 2.2

[SSK17] Željka Stojanac, Daniel Suess, and Martin Kliesch. On products of gaussian random variables. ar Xiv preprint ar Xiv:1711.10516, 2017. 2.1.1

[ST80] BW Silverman and DM Titterington. Minimum covering ellipses. SIAM Journal on Scientific and Statistical Computing, 1(4):401 409, 1980. 1

[STT78] Samuel D Silvey, DH Titterington, and Ben Torsney. An algorithm for optimal designs on a design space. Communications in Statistics-Theory and Methods, 7(14):1379 1389, 1978. 1.1, 3

[SYYZ22] Zhao Song, Xin Yang, Yuanyuan Yang, and Tianyi Zhou. Faster algorithm for structured John ellipsoid computation. Co RR, abs/2211.14407, 2022. 1, 1

[TKE88] S. P. Tarasov, L. G. Khachiyan, and I. I. Èrlikh. The method of inscribed ellipsoids. Dokl. Akad. Nauk SSSR, 298(5):1081 1085, 1988. 1

[TMF20] Murad Tukan, Alaa Maalouf, and Dan Feldman. Coresets for near-convex functions. In Hugo Larochelle, Marc Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, 2020. 1

[Tod16] Michael J. Todd. Minimum volume ellipsoids - theory and algorithms, volume 23 of MOS-SIAM Series on Optimization. SIAM, 2016. 1

[TWZ+22] Murad Tukan, Xuan Wu, Samson Zhou, Vladimir Braverman, and Dan Feldman. New coresets for projective clustering and applications. In Gustau Camps-Valls, Francisco J. R. Ruiz, and Isabel Valera, editors, International Conference on Artificial Intelligence and Statistics, AISTATS 2022, 28-30 March 2022, Virtual Event, volume 151 of Proceedings of Machine Learning Research, pages 5391 5415. PMLR, 2022. 1

[TY07] Michael J. Todd and E. Alper Yildirim. On Khachiyan s algorithm for the computation of minimum-volume enclosing ellipsoids. Discrete Appl. Math., 155(13):1731 1744, 2007. 1.1, 1, 3

[Vai96] Pravin M. Vaidya. A new algorithm for minimizing convex functions over convex sets. Math. Programming, 73(3, Ser. A):291 341, 1996. 1

[Vem05] Santosh Vempala. Geometric random walks: a survey. Combinatorial and computational geometry, 52(573-612):2, 2005. 1

[Wil11] Ryan Williams. Non-uniform ACC circuit lower bounds. In Proceedings of the 26th Annual IEEE Conference on Computational Complexity, CCC 2011, San Jose, California, USA, June 8-10, 2011, pages 115 125. IEEE Computer Society, 2011. 1.4

[Wil24] Ryan Williams. Personal communication, 2024. 1.4

[Wol70] P. Wolfe. Convergence theory in nonlinear programming. In Integer and nonlinear programming, pages 1 36. North-Holland, Amsterdam-London, 1970. 1.1, 3

[WXXZ24] Virginia Vassilevska Williams, Yinzhan Xu, Zixuan Xu, and Renfei Zhou. New bounds for matrix multiplication: from alpha to omega. In David P. Woodruff, editor, Proceedings of the 2024 ACM-SIAM Symposium on Discrete Algorithms, SODA 2024, Alexandria, VA, USA, January 7-10, 2024, pages 3792 3835. SIAM, 2024. 1

[WY22] David P. Woodruff and Taisuke Yasuda. High-dimensional geometric streaming in polynomial space. In 63rd IEEE Annual Symposium on Foundations of Computer Science, FOCS 2022, Denver, CO, USA, October 31 - November 3, 2022, pages 732 743. IEEE, 2022. 1.2.3, 3

[WY23] David P. Woodruff and Taisuke Yasuda. New subset selection algorithms for low rank approximation: Offline and online. In Barna Saha and Rocco A. Servedio, editors, Proceedings of the 55th Annual ACM Symposium on Theory of Computing, STOC 2023, Orlando, FL, USA, June 20-23, 2023, pages 1802 1813. ACM, 2023. 1

[Yu11] Yaming Yu. D-optimal designs via a cocktail algorithm. Stat. Comput., 21(4):475 481, 2011. 1.1, 3

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: [TODO] Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: We discuss limitations of our work in Section 3, and pose open questions towards closing these gaps. Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes]

Justification: We provide complete proofs of all of our results, and references to prior work whenever we need prior results. Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [NA] Justification: [TODO] Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?

Answer: [NA] Justification: [TODO] Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [NA] Justification: [TODO] Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [NA] Justification: [TODO] Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean.

It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [NA] Justification: [TODO] Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: [TODO] Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [NA] Justification: [TODO] Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to

generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML). 11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: [TODO] Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort. 12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [NA] Justification: [TODO] Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?

Answer: [NA] Justification: [TODO] Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: [TODO] Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: [TODO] Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.