# discrepancy_minimization_in_inputsparsity_time__1de188a5.pdf Discrepancy Minimization in Input-Sparsity Time Yichuan Deng 1 Xiaoyu Li 2 Zhao Song 3 Omri Weinstein 4 A recent work by [Larsen, SODA 2023] introduced a faster combinatorial alternative to Bansal s SDP algorithm for finding a coloring x { 1, 1}n that approximately minimizes the discrepancy disc(A, x) := Ax of a realvalued m n matrix A. Larsen s algorithm runs in e O(mn2) time compared to Bansal s e O(mn4.5)- time algorithm, with a slightly weaker logarithmic approximation ratio in terms of the hereditary discrepancy of A [Bansal, FOCS 2010]. We present a combinatorial e O(nnz(A) + n3)-time algorithm with the same approximation guarantee as Larsen s, optimal for tall matrices where m = poly(n). Using a more intricate analysis and fast matrix multiplication, we further achieve a runtime of e O(nnz(A) + n2.53), breaking the cubic barrier for square matrices and surpassing the limitations of linear-programming approaches [Eldan and Singh, RS&A 2018]. Our algorithm relies on two key ideas: (i) a new sketching technique for finding a projection matrix with a short ℓ2-basis using implicit leverage-score sampling, and (ii) a data structure for efficiently implementing the iterative Edge-Walk partial-coloring algorithm [Lovett and Meka, SICOMP 2015], and using an alternative analysis to enable lazy batch updates with low-rank corrections. Our results nearly close the computational gap between real-valued and binary matrices, for which inputsparsity time coloring was recently obtained by [Jain, Sah and Sawhney, SODA 2023]. 1. Introduction Discrepancy theory is a fundamental subject in combinatorics and theoretical computer science, studying how 1The University of Washington 2University of New South Wales 3University of California, Berkeley 4Columbia University. Correspondence to: Xiaoyu Li <7.xiaoyu.li@gmail.com>, Zhao Song . Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s). to color elements of a finite set-system S1, . . . , Sm {1, . . . , n} with two colors (e.g., red and blue) to minimize the maximum imbalance in color distribution across all sets. It finds diverse applications in fields such as computational geometry (Matousek, 1999b; De Berg, 2000), probabilistic algorithms (Spencer, 1985; Chazelle, 2000), machine learning (Vapnik & Chervonenkis, 1971; Talagrand, 1995; Karnin & Liberty, 2019; Bechavod et al., 2022; Han et al., 2025), differential privacy (Muthukrishnan & Nikolov, 2012; Nikolov et al., 2013), and optimization (Beck & Fiala, 1981; Bansal, 2010; 2012; Nikolov, 2015). Definition 1.1 (Discrepancy). The discrepancy of a real matrix A Rm n with respect to a coloring vector x { 1}n is defined as disc(A, x) := Ax = max j [m] |(Ax)j|. The discrepancy of a real matrix A Rm n is defined as disc(A) := min x { 1}n disc(A, x). This is a natural generalization of the classic combinatorial notion of discrepancy of set systems, corresponding to binary matrices A {0, 1}m n where rows represent (the indicator vector of) m sets over a ground set of [n] elements, and the goal is to find a coloring x { 1}n which is as balanced as possible simultaneously on all sets. Most of the history of this problem was focused on the existential question of understanding the minimum possible discrepancy achievable for various classes of matrices. For general set systems of size n (arbitrary n n binary matrices A), the classic result of Spencer (1985) states that disc(A) 6 n, which is asymptotically better than random coloring (Θ( n log n). More recent works have focused on restricted matrix families, such as sparse matrices (Banaszczyk, 1998; Beck & Fiala, 1981), showing that it is possible to achieve the o( n) discrepancy in these restricted cases. For example, the minimum-discrepancy coloring of k-sparse binary matrices (sparse set systems) turns out to have discrepancy at most O( k log n) (Banaszczyk, 1998). All of these results, however, do not provide polynomial time algorithms since they are non-constructive they only Discrepancy Minimization in Input-Sparsity Time argue about the existence of low-discrepancy colorings, which prevents their use in algorithmic applications that rely on low-discrepancy (partial) coloring, such as binpacking problems (Rothvoß, 2013; Hoberg & Rothvoss, 2017).1 Indeed, the question of efficiently finding a lowdiscrepancy coloring, i.e., computing disc(A), was less understood until recently and is more nuanced: Charikar et al. (2011) showed that it is NP-hard to distinguish whether a matrix has disc(A) = 0 or disc(A) = Ω( n), suggesting that (ω(1)) approximation is inevitable to achieve fast runtimes. The celebrated work of Bansal (2010) gave the first polynomial-time algorithm which achieves an additive O( n)-approximation to the optimal coloring of general n n matrices, matching Spencer s non-constructive result. Bansal s algorithm has approximation guarantees in terms of the hereditary discrepancy (Lov asz et al., 1986). Definition 1.2 (Hereditary discrepancy). Given a matrix A Rm n, its hereditary discrepancy is defined as herdisc(A) := max B A disc(B), where A is the set of all matrices obtained from A by deleting some columns from A. Bansal (2010) gave an SDP-based algorithm that finds a coloring ex satisfying disc(A, ex) = O(log(mn) herdisc(A)) for any real matrix A, in time e O(mn4.5) assuming state-ofthe-art SDP solvers (Jiang et al., 2020b; Huang et al., 2022b). In other words, if all submatrices of A have low-discrepancy colorings, then it is in fact possible (yet still quite expensive) to find an almost-matching overall coloring. Building on Bansal s work, Lovett & Meka (2015) designed a simpler algorithm for set-systems (binary matrices A), running in e O((n + m)3) time. The main new idea of their algorithm, which is also central to our work, was a subroutine for repeatedly finding a partial-coloring via random-walks (a.k.a EDGE-WALK), which in every iteration rounds a constant-fraction of the coordinates of a fractional-coloring to an integral one in { 1, 1} (more on this in the next section). Followup works by Rothvoss (2017) and Eldan & Singh (2018) extended these ideas to the real-valued case, and developed faster convex optimization frameworks for obtaining low-discrepancy coloring, the latter requiring O(log n) linear programs instead of a semidefinite program (Bansal, 2010), assuming a value oracle to herdisc(A).2 In this model, Eldan & Singh (2018) 1If we don t have a polynomial constructive algorithm, then we can t approximate solution for bin packing in polynomial. This is common in integer programming where we first relax it to linear programming and then round it back to an integer solution. 2Eldan & Singh (2018) s LP requires an upper-bound estimate on herdisc(A) for each of the O(log n) sequential LPs, hence standard exponential-guessing seems too expensive: the number of possible branches is > 2O(log n). yields an O (max{mn + n3, (m + n)w}) time approximate discrepancy algorithm via state-of-art (tall) LP solvers (Brand et al., 2020). This line of work, however, has a fundamental setback for achieving input-sparsity time, which is a major open problem for (high-accuracy) LP solvers (Bubeck et al., 2018). Sparse matrices are often the realistic case for discrepancy problems, and have been widely studied in this context as discussed earlier in the introduction. Another drawback of convex-optimization based algorithms is that they are far from being practical due to the complicated nature of fast LP solvers. Interestingly, in the binary (set-system) case, these limitations have been very recently overcome in the breakthrough work of Jain et al. (2023), who gave an e O(n+nnz(A))-time coloring algorithm for binary matrices A { 1, 1}m n, with near optimal (O( p n log(m/n + 2))) discrepancy. While their approach, too, was based on convex optimization, their main observation was that an approximate LP solver, using first-order methods, in fact suffices for a logarithmic approximation. Unfortunately, the input-sparsity running time of the algorithm in Jain et al. (2023) does not extend to real-valued matrices, as their LP is based on a heavy-light decomposition of the rows of binary matrices based on their support size. More precisely, generalizing the argument of Jain et al. (2023) to matrices with entries in range, say, [ R, R], guarantees that a uniformly random vector would only have discrepancy e O(poly(R) n) on the heavy rows, and this term would also govern the approximation ratio achieved by their algorithm. By contrast, a concurrent result of Larsen (2023) gave a purely combinatorial (randomized) algorithm, which is not as fast, but handles general real matrices and makes a step toward practical coloring algorithms. Larsen s algorithm improves Bansal s SDP from O(mn4.5) to e O(mn2 + n3) time, at the price of a slightly weaker approximation guarantee: For any A Rm n, Larsen (2023) finds a coloring ex { 1, +1}n such that disc(A, ex) = O(herdisc(A) log n log1.5 m). Recent work by Jambulapati et al. (2024) introduced a new framework for efficient computing of the discrepancy, with which they successfully recovered Spencer s result in time e O(nnz(A) log5(n)) for matrix A with entries restricted within 1. The recent exciting developments naturally raise the following question: Is it possible to achieve input-sparsity time for discrepancy minimization with general (real-valued) matrices? In fact, this was one of the main open questions raised in 2018 workshop on Discrepancy Theory and Integer Programming (Dadush, 2018). Discrepancy Minimization in Input-Sparsity Time Table 1. Progress on approximate discrepancy-minimization algorithms of real-valued m n matrices (m n). For simplicity, we ignore no(1) and poly(log n) factors in the table. (Eldan & Singh, 2018)* refers to a (black-box) combination of our Theorem 1.5 with (Eldan & Singh, 2018) and using state-of-art LP solvers for square and tall matrices (Lee & Sidford, 2014; Brand et al., 2020; Jiang et al., 2021). References Methods Running Time Bansal (2010) SDP (Jiang et al., 2020c; Huang et al., 2022b) mn4.5 Eldan & Singh (2018)* (Step 1: Theorem 1.5) + (Step 2: LP (Jiang et al., 2021)) mω + m2+1/18 Eldan & Singh (2018)* (Step 1: Theorem 1.5) + (Step 2: LP (Brand et al., 2020)) mn + n3 Eldan & Singh (2018)* (Step 1: Theorem 1.5) + (Step 2: LP (Lee & Sidford, 2014)) nnz(A) n + n2.5 Larsen (2023) Combinatorial mn2 + n3 Ours (Theorem 1.3) Combinatorial nnz(A) + n3 Ours (Theorem 1.3) (Step 1: Theorem 1.5) + (Step 2: Lemma E.12) nnz(A) + n2.53 1.1. Our Results We answer this question in the affirmative. To this end, we develop an algorithm that achieves a near-optimal runtime for tall matrices with m = poly(n) and a subcubic runtime for square matrices, while maintaining the same approximation guarantees as Larsen s algorithm, up to constant factors. We state our main result in the following theorem. Theorem 1.3 (Main result, informal version of Theorem D.1 and Theorem D.2). For any parameter a [0, 1], there is a randomized algorithm that, given a real matrix A Rm n, finds a coloring x { 1, +1}n such that disc(A, x) = O(herdisc(A) log n log1.5 m). Moreover, it runs in time e O(nnz(A) + nω + n2+a + n1+ω(1,1,a) a). Here, ω(a, b, c) denotes the time for multiplying an na nb matrix with an nb nc matrix, and ω := ω(1, 1, 1) denotes the exponent of fast matrix multiplication (FMM). Remark 1.4. With the current value of ω 2.371, according to Table 1 in Alman et al. (2025), we choose the parameter a 0.53 to balance the terms n2+a and n1+ω(1,1,a) a. Consequently, the running time in Theorem 1.3 simplifies to e O(nnz(A) + n2.53). Without FMM, our algorithm runs in e O(nnz(A) + n3) time and is purely combinatorial. Let α denote the dual exponent of matrix multiplication, i.e., 2 = ω(1, 1, α), and let a [0, 1] be the tunable parameter of Theorem 1.3. Currently, α 0.32 (Williams et al., 2024). Note that the running time of our algorithm (when using FMM) has a tradeoff between the additive terms n2+a and n1+ω(1,1,a) a, so it is never better than n2.5, as it is never beneficial to set a < 1/2. This tradeoff also means that our runtime is barely sensitive to future improvements in the value of the dual exponent (even α 1 would improve the exponent of the additive term by merely 0.03). Curiously, a similar phenomenon occurs in recent FMM-based LP solvers (Cohen et al., 2019; Jiang et al., 2021), dynamic attention (Brand et al., 2024) and weight pruning (Li et al., 2024) in large language models. A central technical component of Theorem 1.3 is the following theorem, which allows us to quickly find a hereditary projection matrix (i.e., a subspace such that the projection of a constant fraction of the rows of A to its orthogonal complement has small ℓ2-norm), and is of independent interest in randomized linear algebra. We state it as follows. Theorem 1.5 (Fast hereditary projection, informal version of Theorem C.2 and Theorem C.3). Let A Rm n with m n, and let d = n/4. There is a randomized algorithm that outputs a d n matrix V such that with probability 1 δ, the followings hold simultaneously: For all l [d], we have Vl, 2 = 1, It holds that max j [m] (A(I V V ))j, 2 = O(herdisc(A) log(m/n)), where Vl, denotes the l-th row of V for any l [d]. Moreover, it runs in time e O(nnz(A) + nω), where the e Onotation hides a log(m/δ) factor. Theorem 1.5 directly improves Theorem 3 in Larsen (2023) which runs in e O(mn2) time without using FMM, and in O(mnω 1) time using FMM. In either case, Larsen s algorithm pays at least mnω 1 time, even when A is sparse (see more discussion in Section 2.1). Theorem 1.5 is in some sense best possible: Reading the input matrix A requires O(nnz(A)) time, and explicitly computing the projection matrix P = V V in Theorem 1.5 requires nω time. We note that, in the case of binary matrices, the projection-free algorithm of Jain et al. (2023) avoids this bottleneck (and hence the nω term), but for real-valued matrices, all known discrepancy algorithms involve projections (Bansal, 2010; Lovett & Meka, 2015; Larsen, 2023; Rothvoss, 2017). Discrepancy Minimization in Input-Sparsity Time 1.2. Related Work Algorithmic discrepancy theory Constructive discrepancy theory has become a pivotal area of research, focusing on efficiently finding low-discrepancy solutions. Bansal s seminal work introduced a semidefinite programming (SDP)-based algorithm that achieves additive O( n) discrepancy (Bansal, 2010), matching the nonconstructive bounds but incurring a significant computational cost of O(mn4.5). Subsequent contributions, such as Lovett & Meka (2015) and Eldan & Singh (2018), developed faster algorithms leveraging random walks and convex optimization techniques, respectively. Recent advances include Larsen s combinatorial approach (Larsen, 2023), which attains a runtime of O(mn2+n3), and the near-input-sparsity algorithms for binary matrices proposed by Jain et al. (2023). Furthermore,Jambulapati et al. (2024) introduced an efficient framework for computing discrepancy, recovering Spencer s result in e O(nnz(A) log5(n)) time for matrices A with entries bounded by 1. These innovations have substantially narrowed the computational gap between theoretical results and practical applications, advancing the field s capabilities within polynomial time. For further details, we refer readers to related works (Cohen, 2016b; Bansal et al., 2018; 2019; Dadush et al., 2018; Alweiss et al., 2021; Bansal et al., 2020; 2022; Pesenti & Vladu, 2023; Jambulapati et al., 2024). Most recently, Han et al. (2025) uses the discrepancy theory to approximate the attention computation in the streaming model. Sketching and leverage-score sampling Sketching is a versatile technique employed across numerous fundamental problems, including linear programming (Jiang et al., 2021; Brand et al., 2020; Song & Yu, 2021; Liu et al., 2023), empirical risk minimization (Lee et al., 2019; Qin et al., 2023; Gu et al., 2025), and semidefinite programming (Jiang et al., 2020a; Huang et al., 2022a; Song et al., 2023b). It is particularly prominent in randomized linear algebra, where it has been applied to a wide range of tasks (Clarkson & Woodruff, 2013; Nelson & Nguyˆen, 2013; Razenshteyn et al., 2016; Boutsidis et al., 2016; Song et al., 2017; Xiao et al., 2018; Song et al., 2019b; Lee et al., 2019; Jiang et al., 2021; Song & Yu, 2021; Brand et al., 2021; Song et al., 2022a; Hu et al., 2022; Gu & Song, 2022). Sketching frequently serves as an effective tool for oblivious dimension reduction (Clarkson & Woodruff, 2013; Nelson & Nguyˆen, 2013). The use of sampling matrices to enhance computational efficiency is a well-established approach in numerical linear algebra (see Clarkson & Woodruff (2013); Razenshteyn et al. (2016); Boutsidis et al. (2016); Song et al. (2017); Cohen et al. (2019); Song et al. (2019b); Brand et al. (2020); Li et al. (2023); Gao et al. (2023); Deng et al. (2023)). In this paper, we employ leverage score sampling as a non-oblivious dimension reduction method, in line with previous works such as Spielman & Srivastava (2011); Batson et al. (2012); Zhang (2022); Song et al. (2022b). 2. Technical Overview In this section, we give the overview of the techniques used to prove the main results, and the formal proofs are in Appendix. In Section 2.1, we first give an overview and discuss the barriers of Larsen s algorithm. In Section 2.2, we introduce our techniques used to improve the Larsen s algorithm and overcome the barriers. 2.1. Overview and Barriers of Larsen s Algorithm Larsen s algorithm (Larsen, 2023) is a clever reimplementation of the iterated partial-coloring subroutine of Lovett & Meka (2015): In each iteration, with constant probability, this subroutine rounds at least half of the coordinates of a fractional coloring x Rn to { 1, 1}. As in Lovett & Meka (2015), this is done by performing a random-walk in the orthogonal complement subspace V spanned by a set of rows from A (a.k.a Edge-Walk (Lovett & Meka, 2015)). The key idea in Larsen (2023) lies in a clever choice of V : Using a connection between the eigenvalues of A A and herdisc(A) (Larsen, 2017), Larsen shows that there is a subspace V spanned by n/4 rows of A, so that the projection of A onto the complement V has small ℓ2 norm, i.e., every row of A(I V V ) has norm less than O(herdisc(A) log(m/n)). Larsen shows that computing this projection operator (henceforth B B) can be done in O(mn2) time combinatorially, or in Tmat(m, n, n) = e O(mnω 1) time3 using FMM. The second part (PARTIALCOLORING) of Larsen s algorithm is to repeatedly apply the above subroutine (PROJECTTOSMALLROWS) to implement the Edge-Walk of (Lovett & Meka, 2015): Starting from a partial coloring x Rn, the algorithm first generates a subspace V1 by calling the afformentioned PROJECTTOSMALLROWS subroutine. Then it samples a fresh random Gaussian vector gt in each iteration t [N] = O(n), and projects it to obtain gt = (I V t Vt)gt. The algorithm then decides whether or not to update V ; More precisely, it maintains a vector ut, and gradually updates it to account for large entries of x + ut+1 that have reached 1 in absolute value (as in Lovett & Meka (2015)). When a coordinate i [n] reaches this threshold at some iteration, the corresponding unit vector ei is added to Vt and Vt is updated to Vt+1. This update is necessary to ensure that in future iterations, no amount will be added to the i-th entry of ut. At the end of the loop, the algorithm outputs a vector xnew = x + u N+1 with property that for each row ai of A, the difference between ai, x and 3Tmat(m, n, k) represents the time required to multiply an m n matrix by an n k matrix. Discrepancy Minimization in Input-Sparsity Time ai, xnew is less than O(herdisc(A) log1.5(m)). Since Vt is constantly changing thoughout iterations, each iteration requires an online Matrix-Vector multiplication (I V t Vt)gt. Hence, since there are N = O(n) iterations, O(mn2 + n3) time is required to implement this part, even if fast-matrix multiplication is allowed Indeed, the Online Matrix-Vector conjecture (Henzinger et al., 2015) postulates that mn2 is essentially best possible for such online problem. Beating this barrier for the PARTIALCOLORING subroutine therefore requires to somehow avoid this online problem, as we discuss in the next subsection. Below we summarize the computational bottlenecks in Larsen s algorithm, and then explain the new ideas required to overcome them and achieve the claimed overall runtime of Theorem 1.3. Implementing the first part of Larsen s algorithm (PROJECTTOSMALLROWS) incurs the following computational bottlenecks, which we overcome in Theorem 1.5: Barrier 1. Computing the projection matrix Bt = A(I V t Vt) Rm n explicitly (exactly) already takes Tmat(m, n, n) time. 4 Barrier 2. Computing the jth-row s norm e j Bt 2 requires Tmat(m, n, n) time. Barrier 3. Computing B t Bt Rn n requires multiplying an n m with a m n matrix, which also takes Tmat(n, m, n) time. Implementing the second part of Larsen s algorithm (PARTIALCOLORING) incurs the following two (main) computational bottlenecks: Barrier 4. The coloring algorithm requires computing η = maxj [m] e j A(I V V ) 2, which takes Tmat(m, n, n) time. Barrier 5. In each of the N = O(n) iterations, the algorithm first chooses a Gaussian vector g N(0, Id) and projects it to the orthogonal span of Vt, i.e., g Vt = (I V t Vt)g. Next, it finds a rescaling factor g by solving a single variable maximization problem.5 Finally, the algorithm checks whether | aj, v+g | τ, | aj, v | < τ for each j [m]. The overall runtime is therefore mn2. As mentioned above, the last step is conceptually challenging, as it must be done adaptively (Vt is being updated 4We remark that, Tmat(m, n, n) = O(mn2) without using FMM, and Tmat(m, n, n) = O(mnω 1) with using FMM 5The naive computation here would take O(n2) time. Later we show how to do it in O(n) time. throughout iterations), and cannot be batched via FMM (assuming the OMv Conjecture (Henzinger et al., 2015))6. We circumvent this step by slightly modifying the algorithm and analysis of Larsen, and using the fact that, while the subspace Vt is changing throughout iterations, the projected random Gaussian vectors gt themselves are independent We show this enables to batch the projections and then perform low-rank corrections as needed in the last step. We now turn to explain our technical approach and the main ideas for overcoming these computational bottlenecks. 2.2. Our Techniques A natural approach to accelerate the first part of the above algorithm is to use linear sketching techniques (Clarkson & Woodruff, 2013) as they enable working with much smaller matrices in the aforementioned steps. However, sketching techniques naturally introduce (spectral) error to the algorithm, which is exacerbated in iterative algorithms. Indeed, a nontrivial challenge is showing that Larsen s algorithm can be made robust to noise, which requires to modify his analysis in several parts of the algorithm. The main technical obstacle (not present in vanilla sketch-and-solve problems such as linear-regression, low-rank, tensor, inverse problems (Clarkson & Woodruff, 2013; Nelson & Nguyˆen, 2013; Razenshteyn et al., 2016; Song et al., 2017; 2019a; Lee et al., 2019; Jiang et al., 2021)) is that we can never afford to explicitly store the projection matrix Bt, as even writing it would already require O(mn) time. This constraint makes it more challenging to apply non-oblivious sketching tools, in particular approximate leverage-score sampling (LSS, (Spielman & Srivastava, 2011)), which are key to our algorithm. Breaking the n3 runtime of Larsen s partial-coloring algorithm (which is important for near-square matrices m n) requires a different idea, as this bottleneck stems from the Online Matrix-Vector Conjecture (Henzinger et al., 2015). To circumvent this bottleneck, we modify the analysis and implementation of the Edge-Walk subroutine (dropping certain verification steps via concentration arguments), and then design a guess-and-correct data structure (inspired by lookahead algorithms (Brand et al., 2019)) that batches the prescribed gaussian projections, with low-rank amortizes corrections. We now turn to formalize these three main ideas. 2.2.1. ROBUST ANALYSIS OF LARSEN S ALGORITHM Approximate norm-estimation suffices. The original algorithm (Larsen, 2023) explicitly calculates the exact norm of each row of Bt = A(I V t Vt). Since the JL lemma 6The study of OMv originates from (Henzinger et al., 2015), the (Larsen & Williams, 2017; Chakraborty et al., 2018) provide surprising upper bound for the problem. Discrepancy Minimization in Input-Sparsity Time guarantees e j b Bt 2 (1 ϵ0) e j Bt 2, j [m] except with polynomially-small probability δ0 (as the sketch dimension is logarithmic in 1/δ0), it is not hard to show that, even though our approximation could potentially miss some heavy rows, this has a minor effect on the correctness of the algorithm: the rows our algorithm selects will be larger than (1 ϵ0) times a pre-specified threshold, and the ones that are not chosen will be smaller than (1 + ϵ0) times the threshold. This allows to set ϵ0 = Ω(1). Approximate SVD suffices. Recall that in the PROJECTTOSMALLROWS algorithm, expanding the subspace V iteratively, requires to compute SVD(B t Bt). Since exact SVD is too costly for us, we wish to maintain an approximate SVD instead. Even though this may result in substantially different eigen-spectrum, we observe that a spectralapproximation of SVD(B t Bt) suffices for this subroutine, since we only need eigenvectors to be: (i) orthogonal to the row space of A(I V t V t); (ii) orthonormal to each other. Our correctness analysis shows that an ϵB = Θ(1) spectral approximation will preserve (up to constant factor) the row-norm guarantee of PROJECTTOSMALLROWS. 2.2.2. OVERCOMING THE BARRIERS Speeding-up the hereditary-projection step In the PROJECTTOSMALLROWS subroutine in Larsen (2023), the matrix Bt is used for (i) detecting rows with largest norms; (ii) extracting the largest rows to generate a matrix B; and (iii) computing the eigenvectors of B B. To optimize this process and reduce computational overhead, we avoid the explicit representation of matrix Bt by substituting it with a product of appropriately-chosen sketching matrices. Given the aforementioned robustness-guarantees, Barriers in the first step can be bypassed straight-forwardly by using a JL sketch. Specifically, we utilize a random matrix R Rn ϵ 2 0 log(1/δ0) to obtain the compressed matrix b Bt := A(I V t Vt)R. Similarly, detecting rows with large ℓ2 norm can be done in the sketched subspace since we can preserve norms up to constant by choosing ϵ0 = Θ(1) and δ0 = δ/ poly(m, n), which reduces the time for querying row-norms from O(mn2) to e O(nnz(A) + nω). Implicit leverage-score sampling To address the hardness of computation of B t Bt (overcoming Barrier 3), we use robust analysis to ensure that approximating the Top-k SVD of B t Bt (where k n/ log(m/n)) using leveragescore sampling (Drineas et al., 2012; Clarkson & Woodruff, 2013; Nelson & Nguyˆen, 2013) preserves algorithm correctness. However, we lack explicit access to the input matrix Bt, so we must perform implicit leverage-score sampling w.r.t Bt in nnz(A) time. We propose IMPLICITLEVERAGESCORE (Algorithm 9), which takes A and an orthonormal basis V and generates a sparse embedding matrix S1 e O(ϵ 2 0 ) n n n e O(ϵ 2 0 ) The original matrix B by Larsen Figure 1. We use the above sketch technique to reduce the time cost when computing the row norms. b B Rm e O(ϵ 2 0 ) is our sketched matrix. A Rm n is the original data matrix. I Rn n is an n n unit matrix. V V Rn n is the projection matrix onto the row span of V . And R Rn e O(ϵ 2 0 ) is our JL sketching matrix. After sketched, we are able to fast query row norms of b B, which are close to row norm of B with an accuracy ϵ0. We select the rows from B using this approximated norms. For the details of the selection operation, see Figure 2 and Figure 3. to produce a compressed matrix M, whose QR factorization gives R. Using another sparse embedding matrix S2, we calculate the compressed matrix N, which is used to compute the approximate leverage scores. This allows us to carry out the LSS lemma without computing Bt. With IMPLICITLEVERAGESCORE, we can generate a diagonal sampling matrix e D in e O(nnz(A) + nω) time. Using this subroutine to calculate e Bt = e Dt Bt approximates the SVD of B t Bt. Finally, we prove that adding the eigenvectors of e B t e Bt instead of B t Bt to V will still satisfy the required oversampling prerequisite for each row in B. 2.2.3. BEATING THE CUBIC BARRIER Where does the n3 barrier arise from? Recall that, in order to simulate the Edge-Walk process, the PARTIALCOLORING algorithm generates, in each iteration, a Gaussian vector g and projects it to Span(V ). This requires a generic Mat-Vec product (I V t Vt)g, which takes n2 time. Since Vt is dynamically changing throughout iterates (depending on whether |xi + vi + g V,i| = 1 and |xi + vi| < 1 is satisfied or not for each i), and there are N = O(n) iterations, the OMv Conjecture (Henzinger et al., 2015) generally implies an Ω(n3) runtime for implementing this iterative loop (Note that each ei can be added to V at most once, O(n) iterations indeed suffice). Nonetheless, we show how to re-implement PARTIALCOLORING using a lookahead (guess-and-correct) data structure, which combines FMM with low-rank corrections. We now describe the main ingredients of this data structure. Discrepancy Minimization in Input-Sparsity Time (a) Larsen s selection matrix e Bt Dt e Dt A Dt 0 = mt e Dt 0 = e O(ϵ 2 B n) (b) Our subsample matrix Figure 2. This figure shows the difference of the row selection. Figure (a): Larsen s algorithm explicitly selected the mt largest rows from Bt = A(I V t Vt), and compute the eigenvectors of B t Bt, whose time cost is expensive. Figure (b): In our design, we use a subsample matrix e Dt to generate the matrix e Bt, who has the eigenvalues close to Bt. We use the above subsamping technique to reduce the time cost when computing the SVD decomposition. By using this, we can fast compute the eigenvectors. Precomputing Gaussian projections. To overcome the n2 running time for computing the Gaussian projections for g, we add a preprocessing phase (INIT subroutine) to the PARTIALCOLORING iterative algorithm, which generates in advance a Gaussian matrix G Rn N and stores the projection of every column of G to the row space of V , denoted {g Vt}g G. We also design a QUERY procedure, which outputs any desired output vector g = (I V t Vt)g on-demand. Since we do not know apriori if the subspace Vt will change (and which coordinates ei will be added to Vt), we use a Guess-and-Correct approach: We guess the batch Gaussian projections, and then in iteration t, we perform a low-rank corrections via our update and query procedure in data structure. This idea is elaborated in more detail below. Lazy updates for the past rank-1 sequence. Updating all the projections eg stored in the data structure would result in prohibitively expensive runtime. Instead, we use the idea of lazy updates: We divide the columns of G into different batches, each batch having the size K. We also use a counter τu to denote which batch is being used currently, and initialize two counters kq and ku to record the times QUERY and UPDATE were called, respectively. Every time we call QUERY or UPDATE, we increment the counter by 1. When either kq or ku reaches the threshold K, we RESTART the process, and accumulate the current updates as well as some clean-up operations for future iterations, adding up the vectors which are not present yet, and computes a new batch of eg s for future use). This procedure runs in time O(Tmat(n, K, n)), and contributes Tmat(n, K, n) time to the RESTART procedure. Now, Recall that at the beginning of the PARTIALCOLORING algorithm, when we initialize the data structure, we compute the first K projections eg1, . . . , egk. Then we enter the iteration. Recall in INIT, we generate a matrix P which is defined to be V V for the input matrix V . And we precompute a batch of projections onto the row space of V . ( e G = P G ,S, where S = [K] at the beginning.) Besides, if there is some rows are added to V during the running of the algorithm, we first find its factor that is vertical to the row space of V . Then we rescale this vector to unit and name it w. We maintain at most K w s. Then through the running of the algorithm, when we call the QUERY, we just simply select the corresponding row in e G, denoted as eg, and output the vector g eg Pku i=1 wiw i g. Recall that the number of g we pre-computed and the number of w we maintained are both limited to be less than K, when one of them reach the limit, we call the RESTART procedure. When this condition happens, QUERY and UPDATE will take Tmat(K, n, n). Thus by applying this lazy update idea, we finally reach the subcubic running time. The final runtime of PARTIALCOLORING using our data structure is therefore a tradeoff based on the choice of batch size K. Denoting Discrepancy Minimization in Input-Sparsity Time mt 1 largest rows mt largest rows Bt = A(I V t 1Vt 1) Bt 1 = A(I V t 2Vt 2) Bt+1 = A(I V t Vt) (a) Larsen s matrix row selection mt 1 approximated largest rows mt approximated largest rows c Bt = A(I V t 2Vt 2)R c Bt 1= A(I V t 2Vt 2)R c Bt 1= A(I V t 2Vt 2)R Bt 1 = A(I V t 2Vt 2) Bt = A(I V t 1Vt 1) (b) Our matrix row selection Figure 3. This figure shows the idea of the operation at the t-th iteration in the projection algorithm. The figure (a) shows the idea of Larsen s, and the figure (b) shows the idea of ours. Figure (a): Larsen s algorithm selects the mt largest rows from the matrix Bt = A(I V t 1Vt 1), then the rest of the rows not selected will have norm less than the threshold C0 T herdisc(A). But the time cost of row norm computation is expensive. Figure (b): we compute the sketched matrix e Bt = A(I V t 1Vt 1)R, where R is an JL matrix. Then we select the rows based on the approximated norms. Thus we significantly reduce the time cost of norm computation. We show that, under our setting, the norm of the rows not selected will have another guarantee, that is, (1 + ϵ0) C0 T herdisc(A). This constant loss will still make our algorithm correct. (Since our row norm computation is approximated, there is a constant loss effecting the selecting operation. To demonstrate this, we use the darker red to demonstrate the real largest rows without being selected.) K = na, we get a total runtime of e O(nnz(A) + nω + n2+a + n1+ω(1,1,a) a). For the current upper bound on the fast rectangular matrix multiplication function ω( , , ) (Alman et al., 2025), setting a = 0.53 yields an optimal overall runtime time of e O(nnz(A) + n2.53). We note that even an ideal value of ω would not improve this result by much, as it would merely enable setting a = 0.5 which in turn would translate into an n2.5 time algorithm, and this is the limit of our approach. Faster Iterative Coloring Recall that with the approximate small-projection matrix in hand, we still need to overcome Barriers 4 and 5 above. Computing the heaviest row η := maxj [m] a j (I V V ) 2 (Barrier 1) can again be done (approximately) via the JL-sketch (as in the first step), by computing bη := max j [m] a j (I V V )R 2 using the JL matrix R. Since R has only O(ϵ 2 1 log(1/δ1)) columns (where choosing ϵ1 = Θ(1) and δ1 = δ/ poly(mn) is sufficient ) this step reduces from Discrepancy Minimization in Input-Sparsity Time ss ss Iteration start Generating random Gaussian vector g Projecting g onto Span(V ) Condition check 1 Add ei to V Condition check 2 Add ai to V Condition check 3 Return fail (a) Larsen s Partial Coloring Algorithm Generating a set of random Gaussian vectors g and store a batch of them projected onto V Iteration Start Query a projected g directly Condition check 1 Update ei to V and update the projection automatically Condition check 3 Condition check 2 Return fail (b) Our Fast Partial Coloring Algorithm Figure 4. The flowchart shows different design of the partial coloring algorithm of Larsen (2023) and ours. In figure (a), the blue blocks are the steps causing the mainly time cost. In figure (b), the red blocks are the steps we design to overcome those time costs. The g Rn is the new-generated Gaussian vector and g is the projected vector. That is, g := (I V V )g. Tmat(m, n, n) time to e O(n2). As discussed earlier, Barrier 5 is a different ballgame and major obstruction to speeding up Larsen s algorithm, since the verification step j=1 (| aj, v + g | τ | aj, v | < τ) (1) needs to be performed adaptively in each of the n iterations, which appears impossible to perform in mn2 time given the OMv Conjecture (Henzinger et al., 2015). Fortunately, it turns out we can avoid this verification step using a slight change in the analysis of Larsen (2023): Larsen s analysis shows the event Eτ in Eq. (1) happens with constant probability. By slightly increasing the threshold to τ = τ(δ), we can ensure the event Eτ happens with probability 1 δ. Setting δ to a small enough constant so as not to affect the other parts of the algorithm, we can avoid this verification step altogether. At this point, the entire algorithm can be boosted to ensure high-probability of success. combining these ideas yields a (combinatorial) coloring algorithm that runs in e O(nnz(A) + n3). Next, we turn to explain the new idea required to overcome the cubic term n3, which is important for the near-square case (m n). 3. Conclusion In this work, we address the longstanding challenge of discrepancy minimization for real-valued matrices with a focus on achieving input-sparsity time algorithms. By introducing novel algorithmic components such as implicit leveragescore sampling and lazy update, our combinatorial algorithm achieves a runtime of e O(nnz(A) + n3) and a FMM-based variant breaks the cubic barrier to reach e O(nnz(A)+n2.53). We significantly improve the computational efficiency while matching the approximation guarantees of existing methods. Our approach not only introduces novel algorithmic components but also demonstrates the potential of combining sketching and fast matrix multiplication in advancing computational geometry and optimization. We believe these techniques are of independent interest. Future work could extend these ideas to reduce the runtime of other problems in combinatorial optimization, or adapt techniques to tackle the problems in streaming and distributed models. Discrepancy Minimization in Input-Sparsity Time Acknowledgements The author would like to thank the anonymous reviewer of ICML 2025 for their highly insightful suggestions. Impact Statement This paper presents work whose goal is to advance the field of Machine Learning and Optimization. There are many potential societal consequences of our work, but none of which we feel must be specifically highlighted here. Allen-Zhu, Z. and Li, Y. Follow the compressed leader: faster online learning of eigenvectors and faster mmwu. In Proceedings of the 34th International Conference on Machine Learning (ICML), 2017. Allen Zhu, Z., Lee, Y. T., and Orecchia, L. Using optimization to obtain a width-independent, parallel, simpler, and faster positive SDP solver. In Proceedings of the Twenty-Seventh Annual ACM-SIAM Symposium on Discrete Algorithms(SODA), 2016. Alman, J. and Williams, V. V. A refined laser method and faster matrix multiplication. In Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 522 539. SIAM, 2021. Alman, J., Duan, R., Williams, V. V., Xu, Y., Xu, Z., and Zhou, R. More asymmetry yields faster matrix multiplication. In Proceedings of the 2025 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA). SIAM, 2025. Alon, N., Matias, Y., and Szegedy, M. The space complexity of approximating the frequency moments. In Proceedings of the twenty-eighth annual ACM symposium on Theory of computing (STOC), pp. 20 29, 1996. Alweiss, R., Liu, Y. P., and Sawhney, M. Discrepancy minimization via a self-balancing walk. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pp. 14 20, 2021. Anstreicher, K. M. The volumetric barrier for semidefinite programming. Mathematics of Operations Research, 2000. Arora, S. and Kale, S. A combinatorial, primal-dual approach to semidefinite programs. In Proceedings of the 39th Annual ACM Symposium on Theory of Computing (STOC), 2007. Banaszczyk, W. Balancing vectors and gaussian measures of n-dimensional convex bodies. Random Structures & Algorithms, 12(4):351 360, 1998. Bansal, N. Constructive algorithms for discrepancy minimization. In 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, pp. 3 10. IEEE, 2010. Bansal, N. Semidefinite optimization in discrepancy theory. Mathematical programming, 134:5 22, 2012. Bansal, N., Dadush, D., Garg, S., and Lovett, S. The gramschmidt walk: a cure for the banaszczyk blues. In Proceedings of the 50th annual acm sigact symposium on theory of computing, pp. 587 597, 2018. Bansal, N., Dadush, D., and Garg, S. An algorithm for koml os conjecture matching banaszczyk s bound. SIAM Journal on Computing, 48(2):534 553, 2019. Bansal, N., Jiang, H., Singla, S., and Sinha, M. Online vector balancing and geometric discrepancy. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, pp. 1139 1152, 2020. Bansal, N., Laddha, A., and Vempala, S. S. A unified approach to discrepancy minimization. ar Xiv preprint ar Xiv:2205.01023, 2022. Batson, J., Spielman, D. A., and Srivastava, N. Twiceramanujan sparsifiers. SIAM Journal on Computing, 41 (6):1704 1721, 2012. Bechavod, Y., Podimata, C., Wu, S., and Ziani, J. Information discrepancy in strategic learning. In International Conference on Machine Learning (ICML), pp. 1691 1715. PMLR, 2022. Beck, J. and Fiala, T. integer-making theorems. Discrete Applied Mathematics, 3(1):1 8, 1981. Bl aser, M. Fast matrix multiplication. Theory of Computing, pp. 1 60, 2013. Boutsidis, C., Woodruff, D. P., and Zhong, P. Optimal principal component analysis in distributed and streaming models. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing (STOC), pp. 236 249, 2016. Brand, J. v. d., Nanongkai, D., and Saranurak, T. Dynamic matrix inverse: Improved algorithms and matching conditional lower bounds. In 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS), pp. 456 480. IEEE, 2019. Brand, J. v. d., Lee, Y. T., Sidford, A., and Song, Z. Solving tall dense linear programs in nearly linear time. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing (STOC), pp. 775 788, 2020. Discrepancy Minimization in Input-Sparsity Time Brand, J. v. d., Peng, B., Song, Z., and Weinstein, O. Training (overparametrized) neural networks in near-linear time. In ITCS, 2021. Brand, J. v. d., Song, Z., and Zhou, T. Algorithm and hardness for dynamic attention maintenance in large language models. In Forty-first International Conference on Machine Learning, 2024. Brand, J. v. D., Song, Z., and Weng, A. Subquadratic space representation of flows. In arxiv preprint, 2025. Bubeck, S., Cohen, M. B., Lee, Y. T., and Li, Y. An homotopy method for lp regression provably beyond selfconcordance and in input-sparsity time. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing (STOC), pp. 1130 1137, 2018. B urgisser, P., Clausen, M., and Shokrollahi, M. A. Algebraic complexity theory, volume 315. Springer Science & Business Media, 2013. Carmon, Y., Duchi, J. C., Sidford, A., and Tian, K. A rank-1 sketch for matrix multiplicative weights. In Conference on Learning Theory (COLT), pp. 589 623, 2019. Chakraborty, D., Kamma, L., and Larsen, K. G. Tight cell probe bounds for succinct boolean matrix-vector multiplication. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing (STOC), pp. 1297 1306, 2018. Charikar, M., Chen, K., and Farach-Colton, M. Finding frequent items in data streams. In International Colloquium on Automata, Languages, and Programming, pp. 693 703. Springer, 2002. Charikar, M., Newman, A., and Nikolov, A. Tight hardness results for minimizing discrepancy. In Proceedings of the twenty-second annual ACM-SIAM symposium on Discrete Algorithms, pp. 1607 1614. SIAM, 2011. Chazelle, B. The discrepancy method: randomness and complexity. Cambridge University Press, 2000. Chen, L., Kol, G., Paramonov, D., Saxena, R. R., Song, Z., and Yu, H. Towards multi-pass streaming lower bounds for optimal approximation of max-cut. In Proceedings of the 2023 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 878 924. SIAM, 2023. Chen, Z., Yuan, X., Lu, J., Tian, Q., and Zhou, J. Deep hashing via discrepancy minimization. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 6838 6847, 2018. Clarkson, K. L. and Woodruff, D. P. Low rank approximation and regression in input sparsity time. In Proceedings of the forty-fifth annual ACM symposium on Theory of Computing (STOC), pp. 81 90, 2013. Cohen, M. B. Nearly tight oblivious subspace embeddings by trace inequalities. In Proceedings of the twenty-seventh annual ACM-SIAM symposium on Discrete algorithms, pp. 278 287. SIAM, 2016a. Cohen, M. B. Ramanujan graphs in polynomial time. In 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), pp. 276 281. IEEE, 2016b. Cohen, M. B., Lee, Y. T., and Song, Z. Solving linear programs in the current matrix multiplication time. In STOC, 2019. Dadush, D. Discrepancy open problems. https: //homepages.cwi.nl/ dadush/workshop/ discrepancy-ip/open-problems.html, 2018. Dadush, D., Nikolov, A., Talwar, K., and Tomczak Jaegermann, N. Balancing vectors in any norm. In 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS), pp. 1 10. IEEE, 2018. Dantzig, G. B. Maximization of a linear function of variables subject to linear inequalities. Activity analysis of production and allocation, 13:339 347, 1947. De Berg, M. Computational geometry: algorithms and applications. Springer Science & Business Media, 2000. Deng, Y., Li, Z., and Song, Z. Attention scheme inspired softmax regression. ar Xiv preprint ar Xiv:2304.10411, 2023. Drineas, P., Magdon-Ismail, M., Mahoney, M. W., and Woodruff, D. P. Fast approximation of matrix coherence and statistical leverage. The Journal of Machine Learning Research, 13(1):3475 3506, 2012. Eldan, R. and Singh, M. Efficient algorithms for discrepancy minimization in convex sets. Random Structures & Algorithms, 53(2):289 307, 2018. Gall, F. L. and Urrutia, F. Improved rectangular matrix multiplication using powers of the coppersmith-winograd tensor. In Proceedings of the Twenty-Ninth Annual ACMSIAM Symposium on Discrete Algorithms (SODA), pp. 1029 1046. SIAM, 2018. Gao, Y., Song, Z., and Yin, J. An iterative algorithm for rescaled hyperbolic functions regression. ar Xiv preprint ar Xiv:2305.00660, 2023. Garber, D. and Hazan, E. Sublinear time algorithms for approximate semidefinite programming. Mathematical Programming, 158(1-2):329 361, 2016. Discrepancy Minimization in Input-Sparsity Time Grigorescu, E., Lin, Y.-S., Silwal, S., Song, M., and Zhou, S. Learning-augmented algorithms for online linear and semidefinite programming. Advances in Neural Information Processing Systems (Neur IPS), 35:38643 38654, 2022. Gu, Y. and Song, Z. A faster small treewidth sdp solver. ar Xiv preprint ar Xiv:2211.06033, 2022. Gu, Y., Kuang, N. L., Ma, Y.-A., Song, Z., and Zhang, L. Log-concave sampling over a convex body with a barrier: a robust and unified dikin walk. In Advances in Neural Information Processing Systems (Neur IPS), 2024a. Gu, Y., Song, Z., Yin, J., and Zhang, L. Low rank matrix completion via robust alternating minimization in nearly linear time. In The Twelfth International Conference on Learning Representations (ICLR), 2024b. Gu, Y., Song, Z., and Zhang, L. A nearly-linear time algorithm for structured support vector machines. In ICLR, 2025. Han, I., Kapralov, M., Kochetkova, E., Sheth, K., and Zandieh, A. Balancekv: Kv cache compression through discrepancy theory. ar Xiv preprint ar Xiv:2502.07861, 2025. Henzinger, M., Krinninger, S., Nanongkai, D., and Saranurak, T. Unifying and strengthening hardness for dynamic problems via the online matrix-vector multiplication conjecture. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing (STOC), pp. 21 30, 2015. Hoberg, R. and Rothvoss, T. A logarithmic additive integrality gap for bin packing. In Proceedings of the Twenty Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 2616 2625. SIAM, 2017. Hu, H., Song, Z., Weinstein, O., and Zhuo, D. Training overparametrized neural networks in sublinear time. In ar Xiv preprint ar Xiv: 2208.04508, 2022. Huang, B., Jiang, S., Song, Z., Tao, R., and Zhang, R. Solving sdp faster: A robust ipm framework and efficient implementation. In 2022 IEEE 63rd Annual Symposium on Foundations of Computer Science (FOCS), pp. 233 244. IEEE, 2022a. Huang, B., Jiang, S., Song, Z., Tao, R., and Zhang, R. Solving sdp faster: A robust ipm framework and efficient implementation. In FOCS, 2022b. Huang, B., Jiang, S., Song, Z., Tao, R., and Zhang, R. Solving sdp faster: A robust ipm framework and efficient implementation. In FOCS, 2022c. Jain, R. and Yao, P. A parallel approximation algorithm for positive semidefinite programming. In Proceedings of the 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science (FOCS), 2011. Jain, V., Sah, A., and Sawhney, M. Spencer s theorem in nearly input-sparsity time. In Proceedings of the 2023 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 3946 3958. SIAM, 2023. Jambulapati, A., Reis, V., and Tian, K. Linear-sized sparsifiers via near-linear time discrepancy theory. In Proceedings of the 2024 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 5169 5208. SIAM, 2024. Jiang, H., Kathuria, T., Lee, Y. T., Padmanabhan, S., and Song, Z. A faster interior point method for semidefinite programming. In 2020 IEEE 61st annual symposium on foundations of computer science (FOCS), pp. 910 918. IEEE, 2020a. Jiang, H., Kathuria, T., Lee, Y. T., Padmanabhan, S., and Song, Z. A faster interior point method for semidefinite programming. In FOCS, 2020b. Jiang, H., Lee, Y. T., Song, Z., and Wong, S. C.-w. An improved cutting plane method for convex optimization, convex-concave games, and its applications. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing (STOC), pp. 944 953, 2020c. Jiang, H., Lee, Y. T., Song, Z., and Zhang, L. Convex minimization with integer minima in o (n 4) time. In Proceedings of the 2024 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 3659 3684. SIAM, 2024. Jiang, S., Song, Z., Weinstein, O., and Zhang, H. Faster dynamic matrix inverse for faster lps. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing (STOC), 2021. Johnson, W. B. and Lindenstrauss, J. Extensions of lipschitz mappings into a hilbert space. Contemp. Math., 26:189 206, 1984. Karmarkar, N. A new polynomial-time algorithm for linear programming. In Proceedings of the sixteenth annual ACM symposium on Theory of computing (STOC), 1984. Karnin, Z. and Liberty, E. Discrepancy, coresets, and sketches in machine learning. In Conference on Learning Theory, pp. 1975 1993. PMLR, 2019. Khachiyan, L. G. Polynomial algorithms in linear programming. USSR Computational Mathematics and Mathematical Physics, 20(1):53 72, 1980. Discrepancy Minimization in Input-Sparsity Time Larsen, K. G. Constructive discrepancy minimization with hereditary l2 guarantees. ar Xiv preprint ar Xiv:1711.02860, 2017. Larsen, K. G. Fast discrepancy minimization with hereditary guarantees. In SODA. ar Xiv preprint ar Xiv:2207.03268, 2023. Larsen, K. G. and Nelson, J. Optimality of the johnsonlindenstrauss lemma. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pp. 633 638. IEEE, 2017. Larsen, K. G. and Williams, R. Faster online matrix-vector multiplication. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 2182 2189, 2017. Lee, Y. T. and Padmanabhan, S. An e O(m/ϵ3.5)-cost algorithm for semidefinite programs with diagonal constraints. In Conference on Learning Theory (COLT), Proceedings of Machine Learning Research. PMLR, 2020. Lee, Y. T. and Sidford, A. Path finding methods for linear programming: Solving linear programs in O ( rank) iterations and faster algorithms for maximum flow. In 2014 IEEE 55th Annual Symposium on Foundations of Computer Science, pp. 424 433. IEEE, 2014. Lee, Y. T., Sidford, A., and Wong, S. C. A faster cutting plane method and its implications for combinatorial and convex optimization. In 56th Annual IEEE Symposium on Foundations of Computer Science (FOCS), 2015. Lee, Y. T., Song, Z., and Zhang, Q. Solving empirical risk minimization in the current matrix multiplication time. In Conference on Learning Theory, pp. 2140 2157. PMLR, 2019. Li, X., Liang, Y., Shi, Z., and Song, Z. A tighter complexity analysis of sparsegpt. In Workshop on Machine Learning and Compression, Neur IPS 2024, 2024. Li, Z., Song, Z., and Zhou, T. Solving regularized exp, cosh and sinh regression problems. ar Xiv preprint ar Xiv:2303.15725, 2023. Liu, S. C., Song, Z., Zhang, H., Zhang, L., and Zhou, T. Space-efficient interior point method, with applications to linear programming and maximum weight bipartite matching. In ICALP, 2023. Lov asz, L., Spencer, J., and Vesztergombi, K. Discrepancy of set-systems and matrices. European Journal of Combinatorics, 7(2):151 160, 1986. Lovett, S. and Meka, R. Constructive discrepancy minimization by walking on the edges. SIAM Journal on Computing, 44(5):1573 1582, 2015. Lyu, Y. Fast rank-1 lattice targeted sampling for black-box optimization. Advances in Neural Information Processing Systems, 36:10494 10515, 2023. Lyu, Y., Yuan, Y., and Tsang, I. Subgroup-based rank-1 lattice quasi-monte carlo. Advances in Neural Information Processing Systems, 33:6269 6280, 2020. Matousek, J. Geometric discrepancy: An illustrated guide, volume 18. Springer Science & Business Media, 1999a. Matousek, J. Geometric discrepancy: An illustrated guide, volume 18. Springer Science & Business Media, 1999b. Muthukrishnan, S. and Nikolov, A. Optimal private halfspace counting via discrepancy. In Proceedings of the forty-fourth annual ACM symposium on Theory of computing, pp. 1285 1292, 2012. Nelson, J. and Nguyˆen, H. L. OSNAP: Faster numerical linear algebra algorithms via sparser subspace embeddings. In 54th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pp. 117 126. IEEE, 2013. Nesterov, Y. and Nemirovski, A. Conic formulation of a convex programming problem and duality. Optimization Methods and Software, 1(2):95 115, 1992. Nikolov, A. Randomized rounding for the largest simplex problem. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pp. 861 870, 2015. Nikolov, A., Talwar, K., and Zhang, L. The geometry of differential privacy: the sparse and approximate cases. In Proceedings of the forty-fifth annual ACM symposium on Theory of computing, pp. 351 360, 2013. Pesenti, L. and Vladu, A. Discrepancy minimization via regularization. In Proceedings of the 2023 Annual ACMSIAM Symposium on Discrete Algorithms (SODA), pp. 1734 1758. SIAM, 2023. Qin, L., Song, Z., Zhang, L., and Zhuo, D. An online and unified algorithm for projection matrix vector multiplication with application to empirical risk minimization. In International Conference on Artificial Intelligence and Statistics, pp. 101 156. PMLR, 2023. Razenshteyn, I., Song, Z., and Woodruff, D. P. Weighted low rank approximations with provable guarantees. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pp. 250 263, 2016. Renegar, J. A polynomial-time algorithm, based on newton s method, for linear programming. Mathematical Programming, 40(1-3):59 93, 1988. Discrepancy Minimization in Input-Sparsity Time Rothvoß, T. Approximating bin packing within O(log(OPT) loglog(OPT)) bins. In 2013 IEEE 54th Annual Symposium on Foundations of Computer Science, pp. 20 29. IEEE, 2013. Rothvoss, T. Constructive discrepancy minimization for convex sets. SIAM Journal on Computing, 46(1):224 234, 2017. Sarlos, T. Improved approximation algorithms for large matrices via random projections. In 2006 47th annual IEEE symposium on foundations of computer science (FOCS 06), pp. 143 152. IEEE, 2006. Shamir, O. A variant of azuma s inequality for martingales with subgaussian tails. Co RR, abs/1110.2392, 2011. URL http://arxiv.org/abs/1110.2392. Song, Z. Matrix Theory: Optimization, Concentration and Algorithms. Ph D thesis, The University of Texas at Austin, 2019. Song, Z. and Yu, Z. Oblivious sketching-based central path method for linear programming. In International Conference on Machine Learning, pp. 9835 9847. PMLR, 2021. Song, Z., Woodruff, D. P., and Zhong, P. Low rank approximation with entrywise l1-norm error. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pp. 688 701, 2017. Song, Z., Woodruff, D. P., and Zhong, P. Relative error tensor low rank approximation. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 2772 2789. SIAM, 2019a. Song, Z., Woodruff, D. P., and Zhong, P. Relative error tensor low rank approximation. In SODA. ar Xiv preprint ar Xiv:1704.08246, 2019b. Song, Z., Xu, Z., Yang, Y., and Zhang, L. Accelerating frankwolfe algorithm using low-dimensional and adaptive data structures. ar Xiv preprint ar Xiv:2207.09002, 2022a. Song, Z., Xu, Z., and Zhang, L. Speeding up sparsification using inner product search data structures. ar Xiv preprint ar Xiv:2204.03209, 2022b. Song, Z., Wang, Y., Yu, Z., and Zhang, L. Sketching for first order method: efficient algorithm for low-bandwidth channel and vulnerability. In International Conference on Machine Learning (ICML), pp. 32365 32417. PMLR, 2023a. Song, Z., Yang, X., Yang, Y., and Zhang, L. Sketching meets differential privacy: Fast algorithm for dynamic kronecker projection maintenance. In ICML, 2023b. Song, Z., Ye, M., and Zhang, L. Streaming semidefinite programs: o( p {n}) passes, small space and fast runtime. ar Xiv preprint ar Xiv:2309.05135, 2023c. Spencer, J. Six standard deviations suffice. Transactions of the American mathematical society, 289(2):679 706, 1985. Spielman, D. A. and Srivastava, N. Graph sparsification by effective resistances. SIAM Journal on Computing, 40(6): 1913 1926, 2011. Talagrand, M. Concentration of measure and isoperimetric inequalities in product spaces. Publications Math ematiques de l Institut des Hautes Etudes Scientifiques, 81:73 205, 1995. Tropp, J. A. Improved analysis of the subsampled randomized hadamard transform. Advances in Adaptive Data Analysis, 3(01n02):115 126, 2011. Tropp, J. A. et al. An introduction to matrix concentration inequalities. Foundations and Trends in Machine Learning, 8(1-2):1 230, 2015. Vaidya, P. M. An algorithm for linear programming which requires O(((m+n)n2+(m+n)1.5n)L) arithmetic operations. In 28th Annual IEEE Symposium on Foundations of Computer Science (FOCS), 1987. Vaidya, P. M. A new algorithm for minimizing convex functions over convex sets. In 30th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pp. 338 343, 1989. Vapnik, V. and Chervonenkis, A. Y. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16(2): 264, 1971. Wang, R., Yan, J., and Yang, X. Unsupervised learning of graph matching with mixture of modes via discrepancy minimization. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 45(8):10500 10518, 2023. Williams, V. V. Multiplying matrices faster than coppersmith-winograd. In Proceedings of the forty-fourth annual ACM symposium on Theory of computing (STOC), pp. 887 898. ACM, 2012. Williams, V. V., Xu, Y., Xu, Z., and Zhou, R. New bounds for matrix multiplication: from alpha to omega. In Proceedings of the 2024 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 3792 3835. SIAM, 2024. Discrepancy Minimization in Input-Sparsity Time Woodruff, D. P. Sketching as a tool for numerical linear algebra. Foundations and Trends in Theoretical Computer Science, 10(1 2):1 157, 2014. Xiao, C., Zhong, P., and Zheng, C. Bourgan: Generative networks with metric embeddings. Advances in neural information processing systems, 31, 2018. Yurtsever, A., Tropp, J. A., Fercoq, O., Udell, M., and Cevher, V. Scalable semidefinite programming. ar Xiv preprint ar Xiv: 1912.02949, 2019. Zhang, L. Speeding up optimizations via data structures: Faster search, sample and maintenance. Master s thesis, Carnegie Mellon University, 2022. Discrepancy Minimization in Input-Sparsity Time Rodamap. We organize the appendix as follows. In Section A, we give the preliminaries of the paper, including the notations we use, the definitions and some useful results from prior works. In Section B, we provide previous tools and results on sketching. In Section C, we present our first result, a fast algorithm of projecting matrix into small rows. In Section D, we present the final algorithm of hereditary minimize. In Section E, we present the second main result, the fast partial coloring algorithm. In Section F, we discuss the fast maintaining algorithm we use to boost the running time. In Section G, we provide our implicit leverage score sampling algorithm. In Section H, we give a sparsification tool for the matrix that has pattern A A. In Section I, we provide the algorithm to find proper step size in Partial Coloring algorithm. In Section J, we provide empirical validation of the efficiency of our algorithm. In Section K, we discuss more related work on linear programming, semidefinite programming, and applications of discrepancy theory in machine leraning. A. Preliminaries A.1. Notations For a vector x Rn, we use x 1 to denote its entrywise ℓ1 norm. We use x to denote its entrywise ℓ norm. For a matrix X, we use X to denote the spectral norm. We use ei to denote the vector where i-th location is 1 and everywhere else is 0. For a matrix A, we say matrix A positive semi-definite if A 0, i.e., x Ax 0 for all vectors x. We say A B, if for all x Rd, x Ax x Bx. For a positive semi-definite matrix A, let UΣU be the SVD decomposition of A. We use A1/2 to denote UΣ1/2U . We use A 1/2 to denote UΣ 1/2U where Σ 1/2 is diagonal matrix that i, i-entry is Σ 1/2 i,i if Σi,i = 0 and i, i-th entry is 0 if Σi,i = 0. For a matrix A, we use nnz(A) to denote the number of non-zeros in matrix A. A.2. Definitions Here, we define disc and herdisc. Definition A.1 (Discrepancy). For a real matrix A Rm n, and a vector x {1, 1}n. We define the discrepancy of A to be disc(A, x) := Ax = max i [m] |(Ax)i|. And we define the discrepancy of the matrix A as disc(A) := min x {1, 1}n disc(A, x). Definition A.2 (Hereditary Discrepancy). For a real matrix A Rm n. Let A be defined as the set of matrices obtained from A by deleting some columns from A. We define the hereditary discrepancy of A to be herdisc(A) := max B A disc(B), where disc is defined as Definition A.1. A.3. Basic Linear Algebra Lemma A.3. Given two psd matrices A Rn n and B Rn n. If (1 ϵ)A B (1 + ϵ)A then, we have (1 ϵ)λi(A) λi(B) (1 + ϵ)λi(A). Note that, λi(A) is the i-the largest eigenvalue of A. Discrepancy Minimization in Input-Sparsity Time A.4. Prior Results on Herdisc We state a tool from previous work. Theorem A.4 (Larsen (2017)). Given a real matrix A Rm n, let λ1 λ2 λn 0 be the eigenvalues of A A. For all positive integers k min{m, n}, we have herdisc(A) k A.5. Properties of Gaussians We state two statements about the Gaussian variables. Claim A.5. Let t be a random variable such that t N(0, 1). Then for any λ > 0, we have that Pr[|t| > λ] 2 exp( λ2/2). Claim A.6 (Claim 2 in Larsen (2023)). For an arbitrary matrix V Rl n such that l n and all rows of V form an orthogonal basis. Let g R be a vector which is sampled with n i.i.d. N(0, 1) distributed coordinates. Then for any arbitrary vector a Rn we have that a, (I V V )g N(0, a I V V 2). A.6. Azuma for Martingales with Subgaussian Tails Then for martingales with subgaussian tails, we introduce the following Azuma s inequality. We first define the following martingale difference sequence. Definition A.7 (Martingale Difference Sequence). For two sequences of random variables A = {A1, A2, . . . } and B = {B1, B2, . . . }. If for any integer t Z+, At+1 is a function of {B1, . . . , Bt} and the following holds, Pr[E[At+1 | B1, . . . , Bt] = 0] = 1, then we say that A is a martingale difference sequence with respect to B. Then we introduce the following Azuma s inequality for martingales with Subgaussian tails. Theorem A.8 (Shamir (2011)). Let δ (0, 1) be an arbitrary failure probability. Let A = {A1, . . . , AT } be a martingale difference sequence with respect to a sequence B = {B1, . . . , BT }. Let b > 1 and c > 0 be two constants. If for any integer t and all a R+, the following holds, max{Pr[At > a | B1, . . . , Bt 1], Pr[At < a | B1, . . . , Bt 1]} b exp( ca2). Then it holds that 28b lg(1/δ) with probability at least 1 δ. A.7. Concentration Inequality We state the matrix Bernstein bound as follows: Lemma A.9 (Matrix Bernstein Bound (Tropp, 2011; Tropp et al., 2015)). Let X1, . . . , Xs be independent copies of a symmetric random matrix X Rn n with E[X] = 0, X γ almost surely and E[X X] σ2. Let W = 1 s P i [s] Xi. For any ϵ (0, 1), Pr[ W ϵ] 2n exp sϵ2 Note the W is the spectral norm of matrix W. Discrepancy Minimization in Input-Sparsity Time A.8. Fast Matrix Multiplication We describe several basic backgrounds about fast matrix multiplication. Definition A.10. Given three integers n1, n2, n3, we use Tmat(n1, n2, n3) to denote time of multiplying a n1 n2 matrix with another n2 n3 matrix. It is known that the ordering of n1, n2, n3 only affects the constant factor in the time complexity of matrix multiplication. Fact A.11 (B urgisser et al. (2013); Bl aser (2013)). Tmat(n1, n2, n3) = O(Tmat(n1, n3, n2)) = O(Tmat(n2, n1, n3)). For convenience, we also define the ω( , , ) function and the exponent ω of matrix multiplication (Williams, 2012; Gall & Urrutia, 2018; Alman & Williams, 2021; Williams et al., 2024; Alman et al., 2025). Definition A.12. For a, b, c > 0, we use nω(a,b,c) to denote the time of multiplying an na nb matrix with another nb nc matrix. We denote ω := ω(1, 1, 1) as the exponent of matrix multiplication. we denote α as the dual exponent of matrix multiplication, i.e., 2 = ω(1, 1, α). By the state-of-the-art fast matrix multiplication result in Alman et al. (2025), we have Lemma A.13 (Alman et al. (2025)). We have ω = ω(1, 1, 1) = 2.371. ω(1, 1, 0.5275) = 2.055. B. Previous Tools and Results on Sketching Here in this section, we provide tools and previous results on sketching matrices. In Section B.1 we introduce some sketching matrices. In Section B.2 we provide traditional JL transform. In Section B.3 we give some previous results on subspace embedding. B.1. Sketching Matrices Definition B.1 (Random Gaussian matrix or Gaussian transform, folklore ). Let S = σ G Rs m where σ is a scalar, and each entry of G Rs m is chosen independently from the standard Gaussian distribution. For any matrix A Rm n, SA can be computed in O(s nnz(A)) time. Definition B.2 (AMS (Alon et al., 1996)). Let h1, h2, . . . , hb be b random hash functions picking from a 4-wise independent hash family H = {h : [n] { 1 b}}. Then R Rb n is a AMS sketch matrix if we set Ri,j = hi(j). Definition B.3 (Count Sketch (Charikar et al., 2002)). Let h : [n] [b] be a random 2-wise independent hash function and σ : [n] {+1, 1} be a random 4-wise independent hash function. Then R Rb n is a count-sketch matrix if we set Rh(i),i = σ(i) for all i [n] and other entries to zero. Definition B.4 (Sparse Embedding Matrix I (Nelson & Nguyˆen, 2013)). We say R Rb n is a sparse embedding matrix with parameter s if each column has exactly s non-zero elements being 1/ s uniformly at random, whose locations are picked uniformly at random without replacement (and independent across columns). Definition B.5 (Sparse Embedding Matrix II (Nelson & Nguyˆen, 2013)). Let h : [n] [s] [b/s] be a random 2-wise independent hash function and σ : [n] [s] { 1. + 1} be a 4-wise independent. Then R Rb n is a sparse embedding matrix II with parameter s if we set R(j 1)b/s+h(i,j),i = σ(i, j)/ s for all (i, j) [n] [s] and all other entries to zero. Definition B.6 (Count Sketch + Gaussian transform, Definition B.18 in Song et al. (2019a)). Let S = SΠ, where Π Rt m is the COUNTSKETCH transform (defined in Definition B.3) and S Rs t is the Gaussian transform (defined in Definition B.1). For any matrix A Rm n, S A can be computed in O(nnz(A) + ntsω 2) time, where ω is the matrix multiplication exponent. Discrepancy Minimization in Input-Sparsity Time B.2. JL Transform We define Johnson Lindenstrauss transform (Johnson & Lindenstrauss, 1984) as follows: Definition B.7 (Johnson & Lindenstrauss (1984)). Let ϵ (0, 1) be the precision parameter. Let δ (0, 1) be the failure probability. Let A Rm n be a fixed matrix. Let a i denote the i-th row of matrix A, for all i [m]. We say R is an ϵ, δ-JL transform if with probability at least 1 δ, (1 ϵ) ai 2 Rai 2 (1 + ϵ) ai 2, i [m]. It is well-known that random Gaussian matrices and AMS matrices give JL-transform property. Lemma B.8 (Johnson Lindenstrauss transform (Johnson & Lindenstrauss, 1984)). Let ϵ (0, 1) be the precision parameter. Let δ (0, 1) be the failure probability. Let A Rm n be a real matrix . Then there exists an sketching matrix R Rϵ 2 log(mn/δ) n (defined as Definition B.1 or Definition B.2), such that the following holds with probability at least 1 δ, (1 ϵ) ai 2 Rai 2 (1 + ϵ) ai 2, i [m], where for a matrix A, a i denotes the i-th row of matrix A Rm n. The JL Lemma is known to be tight due to Larsen & Nelson (2017). B.3. Subspace Embedding We first define a well-known property called subspace embedding. Definition B.9 (Subspace embedding (Sarlos, 2006)). A (1 ϵ) ℓ2-subspace embedding for the column space of an m n matrix A is a matrix S for which for all x Rn SAx 2 2 = (1 ϵ) Ax 2 2. Let U denote the orthonormal basis of A, then it is equivalent to the all the singular values of SU are within [1 ϵ, 1 + ϵ] with probability 1 δ. Nelson & Nguyˆen (2013) shows that r = O(ϵ 2n log8(n/δ)) and column sparsity O(ϵ 1 log3(n/δ)) suffices for subspace embedding (See Theorem 9 in Nelson & Nguyˆen (2013)). Later Cohen (2016a) improves the result to r = O(ϵ 2n log(n/δ)) and column sparsity is O(ϵ 1 log(n/δ)) (see Theorem 4.2 in Cohen (2016a)). Lemma B.10 (Nelson & Nguyˆen (2013); Cohen (2016a)). Let A Rm n be a matrix. Let S Rr m denote the sketching matrix (defined as Definition B.4). If r = O(ϵ 2n log(n/δ)) and the column sparsity of S is s = O(ϵ 1 log(n/δ)), then S satisfies that with probability 1 δ SAx 2 2 = (1 ϵ) Ax 2 2. Further, SA can be done in (s nnz(A)) time. C. Small Row Projection via Implicit Leverage-Score Sampling Here in this section, we build our algorithm of projecting a matrix to small rows. In Section C.1 we present the orghogonalize subroutine which is useful in later algorithms. In Section C.2 we present our projection algorithm together with its analysis. C.1. Orthogonalize Subroutine Here in this section, we present the following algorithm named ORTHOGONALIZE, which is just the Gram-Schmidt process. This subroutine is very useful in the construction of the following algorithms. The following lemma gives the running time of the above algorithm, with input vector sparsity time. Lemma C.1. The algorithm (ORTHOGONALIZE in Algorithm 1) runs O( s 0 n) time. Discrepancy Minimization in Input-Sparsity Time Algorithm 1 Algorithm 1 in Larsen (2023) 1: procedure ORTHOGONALIZE(s Rn, V Rl n) Lemma C.1 2: s (I V V )s 3: if s = 0 then 4: add s / s 2 as a row of V 5: end if 6: return V 7: end procedure Proof. The running time of ORTHOGONALIZE can be divided as follows, Line 2 takes time O( s 0 n) to compute s , where for a vector x Rd, x 0 denotes the ℓ0 norm of x. Line 4 takes time O(n) to compute s / s 2. Thus we have the total running time of O( s 0 n). C.2. Fast Projecting to Small Rows Here in this section, we present our first main result, i.e., the faster algorithm for projecting a matrix to small rows. We first show the algorithm as follows. C.2.1. RUNNING TIME The goal of this section is to prove running time of Algorithm 2. We have the following lemma. Theorem C.2 (Running time of Algorithm 2). The algorithm (FASTPROJECTTOSMALLROWS in Algorithm 2) runs e O(nnz(A) + nω) time7. The succeed probability is 1 δ. Note that e O hides log(n/δ). Proof. The running time of FASTPROJECTTOSMALLROWS can be divided as follows, Run the following lines for i [T] times, where T = O(log(m/n)): Line 14 needs to compute b Bt, it takes the following time O(nnz(A)r + nω + n2r) = e O(nnz(A) + nω) Line 18 takes time e O(m) to construct Dt by computing the norms, and takes time e O(m) to find the largest mt 1 of them. Using Lemma G.4, we can compute e Dt in e O(nnz(A) + nω) time Line 31 runs ORTHOGONIZE for O(n) times, and takes O(n) each time. Adding these together, we have the total running time of e O(nnz(A) + nω). C.2.2. CORRECTNESS Here we present the correctness theorem of the Algorithm 2, and together with its proof. Theorem C.3 (Correctness of Algorithm 2 ). For any given m n matrix A, there is an algorithm that takes A as input and output a matrix V Rn/4 n such that 7Note that ω denotes the exponent of matrix multiplication, currently ω 2.373 (Williams, 2012; Alman & Williams, 2021). Discrepancy Minimization in Input-Sparsity Time Algorithm 2 Our input sparsity projection algorithm (this improves the Algorithm 2 in (Larsen, 2023)) 1: procedure FASTPROJECTTOSMALLROWS(A Rm n, δ (0, 1)) Theorem C.3 2: V1 00 n Initialize an empty matrix 3: T log(8m/n) 4: δ0 Θ(δ/(m T)) 5: ϵ0 0.01 6: r ϵ 2 0 log(1/δ0) 7: δB Θ(δ/T) 8: ϵB 0.1 9: for t = 1 T do 10: lt t n 8T 11: mt m/2t 12: /*Ideally, we need to compute Bt A(I V t Vt) Rm n, but we have no time to do that.*/ 13: Let R Rn r denote JL sketching matrix (Either Definition B.2 or Definition B.1) 14: Compute b Bt A(I V t Vt)R b Bt,j 2 (1 ϵ0) Bt,j 2, j [m], Lemma B.8 15: /* Ideally, we need to compute Bj 2, but we had no time to do that.*/ 16: Compute b Bt,j 2 for all j [m] 17: St [m] denote the indices of b Bt such that it contains mt 1 rows of the largest norm 18: Let Dt denote a sparse diagonal matrix where (i, i)-th location is 1 if i St and 0 otherwise 19: Let Bt = Dt A(I V t Vt) Rmt 1 n be the submatrix obtained from Bt 20: /* Ideally, we need to compute B t Bt, but in algorithm we had no time to do that.*/ 21: e Dt SUBSAMPLE(Dt A, Vt, ϵB, δB) Algorithm 10 22: e Bt e Dt A(I V t Vt) 23: Note that σi = λi(B t Bt) and eσi = λi( e B t e Bt) 24: /* Ideally, we need to compute the eigenvalues σ1 σn 0 and corresponding eigenvectors u1, , un of B t Bt, but in algorithm we had no time to that*/ 25: Let eu1, eun Rn denote the eigenvectors of SVD decomposition of e B t e Bt = e U eΣe U 26: Construct matrix Vt+1 Rlt+1 n by adding rows vectors {eu1, , eun/(8T )} to the bottom Vt Rlt n t 27: end for 28: BT A (I V T +1VT +1) 29: Let r1, , rn/8 be the n/8 rows of BT with largest norm 30: for j = 1 n/8 do 31: V ORTHOGONALIZE(rj, V ) Algorithm 1 32: end for 33: return V V Rℓ n and ℓ n/4 34: end procedure having unit length orthogonal rows all rows of A(I V V ) have norm at most O(herdisc(A) log(m/n)), holds with probability at least 1 δ. Proof. We define T := log(8m/n), which is the time of the main loop of Algorithm 2 execute. Discrepancy Minimization in Input-Sparsity Time to be a constant used later. And we define to denote the number of rows in B at each iteration. We then continue the proof in the following paragraphs. The rows of V form an orthonormal basis. To start with, let us show that, the output matrix V form an orthonormal basis with its rows. Since V0 has no rows, the claim is true initially. We first fix t T, consider at t-th iteration,after constructing Vt+1 by adding row vectors {eu1, . . . , eun/(8T )} to the bottom of Vt in Line 26. Note that these are eigenvectors, so they must be orthogonal to each other and for any u of them, u 2 = 1. Moreover, we have that {eu1, . . . , eun/(8T )} span( e Bt). We note that, since we define e Bt as e Bt = e Dt A(I V t Vt), all rows of e Bt muse be orthogonal to any rows of Vt Rlt n, and it follows that adding the above set of eigenvectors {eu1, . . . , eun/(8T )} as rows of Vt Rlt n maintains the rows of Vt+1 Rlt+1 n form an orthonormal basis. After the first for-loop, we claim obviously that, the last for-loop (Line 31) preserves the property of V such that, V form an orthonormal basis with its rows. Row norm guarantee, proved with induction. We now claim that, after the t-th iteration of the first for-loop (Line 9 to Line 27), we have that |{i [m] | bi 2 (1 + ϵ0) C0 T herdisc(A)}| mt, i.e., there are at most mt rows in Bt = A(I V t Vt) having norm larger than (1 + ϵ0) C0 T herdisc(A). We first note that, for t = 0 this holds obviously. Then have the following inductive assumption: After iteration t 1, there are at most mt 1 rows have norm larger than (1 ϵ0) C0 T herdisc(A). We split the inductive step into the following two parts: Norm of rows not in Bt will not increase. We first fix t [T]. Then by the inductive assumption and our approximated norm computation, we have that for the t th iteration, all rows not in Bt have norm at most (1 + ϵ0) C0 T herdisc(A). Since we have span(AV t Vt) span(V ), then the norm of each row in matrix A(I V t Vt) will never increase after we add new rows to V . Thus we have the claim proved. Discrepancy Minimization in Input-Sparsity Time At most mt rows in Bt reach the threshold after adding rows. Now in this paragraph we prove that, after adding {eu1, . . . , eun/(8T )} the bottom of Vt (Then we name the new matrix to be vt+1), there are no less than mt rows in Bt(I V t+1Vt+1) Rmt 1 n having norm larger than (1 ϵ0) C0 T herdisc(A). Here we note that, before adding the new rows to Vt, since all rows of Bt are already orthogonal to all rows of Vt, we have the following Bt = Bt(I V t Vt)8. We now prove it by contradiction. For the sake of contradiction, we assume that, more than mt rows in Bt(I V t+1Vt+1) have norm larger than C0 T herdisc(A) after adding {eu1, . . . , eun/(8T )} to the bottom of Vt. We assume that Vt+1 has lt+1 rows. Select an arbitrary set of unit vectors {w1, . . . , wn lt+1} Rn span(w1, . . . , wn lt+1) = Rd\span(Vt+1). Using row norms, we can generate Bt = Dt A(I V t Vt). Using SUBSAMPLE (Algorithm 10), we can generate a matrix e Bt = e Dt A(I V t Vt). By Lemma H.2, we know that e Bt only has roughly e Dt 0 = O(ϵ 2 B n log(n/δB)) rows so that Pr (1 ϵB)B t Bt e B t e Bt (1 + ϵB)B t Bt 1 δB Using Lemma A.3, the above approximation implies that Pr (1 ϵB)λi(B t Bt) λi( e B t e Bt) (1 + ϵB)λi(B t Bt), i [n] 1 δB. Q := Bt(I V t+1Vt+1). And for the approximated metrics, we define e Q := e Bt(I V t+1Vt+1). We first note that, all rows of Q lie in span(w1, . . . , wn lt+1). Then by the construction of e B, we have that, all rows of e Q also lie in span(w1, . . . , wn lt+1). It follows that X j [n lt+1] qk, wj 2 = X k [mt 1] qk 2 2 > mt ((1 ϵ0) C0 T herdisc(A))2, where the second step follows from the contradiction assumption. Discrepancy Minimization in Input-Sparsity Time Taking the average over the vectors wj Rn, there has to be a j [n lt+1] such that X k [mt 1] qk, wj 2 > mt ((1 ϵ0) C0 T herdisc(A))2/n. We denote v = wj. Thus we have that, e Btv 2 (1 ϵB) Btv 2 > (1 ϵB) mt ((1 ϵ0) C0 T herdisc(A))2/n, (2) where the first step follows from Lemma H.2. and the second step follows from the contradiction assumption that there are more than ml rows in Bt(I V t+1Vt+1) with norm more than (1 ϵ0) C0 T herdisc(A) and v is a unit vector. We can upper bound e Btv 2 2, e Btv 2 2 = v e B t e Btv eσn/(8T ) (1 + ϵB) σn/(8T ), (3) where the first step follows from the definition of ℓ2 norm , the second step follows from that v is orthogonal to eu1, . . . , eun/(8T ) Rn and, the third step follows from the way we generate e Bt. Thus we have that, σn/(8T ) > 1 ϵB 1 + ϵB mt ((1 ϵ0) C0 T herdisc(A))2/n 4 mt (C0 T herdisc(A))2/n, where the first step follows from Eq. (2) and Eq. (3), the second step follows from that we set ϵB (0, 0.5) and ϵ0 = 0.01. But at the same time we have the following, herdisc(A) herdisc(Bt) 1 4 mt (C0 T herdisc(A))2/n = 2 herdisc(A), where the first step follows from that Bt is a matrix obtained from A Rm n by deleting a subset of the rows of A, the second step follows from Theorem A.4 and k = n/(8T) here, and the last step follows from simplify the above step. Thus, we get a contradiction. By the proof above, we have the claim that after the first for-loop (Line 9 to Line 27 in Algorithm 2), there are at most m T = n/8 rows in BT = A(I V t Vt) having norm larger than C0 T herdisc(A). Then we do the second for-loop (Line 31 in Algorithm 2). We notice after that, all these rows lie in Span(Vt). Thus we have the desired guarantee of the algorithm. And by union bound of the above steps, we have the result holds with probability at least 1 δ. Thus we complete the proof. D. Discrepancy-Minimization Algorithm D.1. Correctness The goal of this section is to prove Theorem D.1. Discrepancy Minimization in Input-Sparsity Time Theorem D.1. The algorithm (FASTHEREDITARYMINIMIZE in Algorithm 3) takes a matrix A Rm n as input and outpus a coloring x { 1, +1}n such that disc(A, x) = O(herdisc(A) log n log1.5 m). holds with probability 1 δfinal. Proof. We divide the proof into the following two parts. Proof of Quality of Answer. By the condition of the algorithm stops, we have that, there is no i [n] such that |xi| < 1. And by Lemma E.2, every time after we call FASTPARTIALCOLORING (Algorithm 4), herdisc(A, x) changes at most O(herdisc(A) log1.5 m). Note that, the FASTPARTIALCOLORING is called O(log n) times, thus we have the guarantee. Thus we complete the proof. Proof of Success Probability. For the FASTPARTIALCOLORING Algorithm, each time we call it, we have that with probability at most 19/20, we know that it outputs fail. with 1/20 δ3 final/n3, it outputs a good xnew. with probability at most δ3 final/n3 it outputs a xnew with no guarantee. As long as we repeat second while loop more than k = Θ(log(n/δfinal)) times, then we can promise that with probability 1 δ2 final/n2, we obtain a non-final xnew. Then by union bound we have the final succeed probability is 1 (δ2 final/n2) log n | {z } all log n iterations get non-fail (δ3 final/n3) k log n | {z } all V and η are good 1 δfinal. D.2. Running Time The goal of this section is to prove Theorem D.2. Theorem D.2. For any parameter a [0, 1], the expected running time of algorithm (FASTHEREDITARYMINIMIZE in Algorithm 3) is9 e O(nnz(A) + nω + n2+a + n1+ω(1,1,a) a). Note that ω( , , ) function is defined as Definition A.12. Proof. For the running time of FASTHEREDITARYMINIMIZE, we analyze it as follows. First we notice that, |S| halves for each iteration of the outer while loop. Thus for each iteration, we can only examine the xi s for i S, where S is from the previous iteration. Then, we can maintain S in O(n) time. By the same argument, Line 6 takes a total time of O(nnz(A)) to generate the submatrices A. Recall we have the S halved each iteration, so |S| n/2i at i th iteration. Thus we have by Lemma E.12, Line 12 takes time TFPC = e O(nnz(A) + nω + n2+a + n1+ω(1,1,a) a) to call FASTPARTIALCOLORING. 9Note that ω is the exponent of matrix multiplication, currently ω 2.373 (Williams, 2012; Alman & Williams, 2021). Discrepancy Minimization in Input-Sparsity Time Algorithm 3 1: procedure FASTHEREDITARYMINIMIZE(A Rm n, δfinal (0, 0.001)) Lemma D.1, Lemma D.2 2: x 0n 3: while x / { 1, +1}n do 4: S {i [n] | |xi| < 1} 5: Let x x S be the coordinates of x indexed by S 6: Let A A ,S Rm |S| be the columns of A indexed by S 7: finished false 8: counter 0 9: k Θ(log(n/δfinal)) 10: while finished = false and counter < k do 11: counter counter + 1 12: xnew FASTPARTIALCOLORING(A, x, δ3 final/n3) Algorithm 4 13: if xnew = fail then 14: x S xnew 15: finished true 16: end if 17: end while 18: if finished = false then 19: return fail 20: end if 21: end while 22: return x 23: end procedure Let p denote the succeed probability of FASTPARTIALCOLORING, each time, it takes TFPC time, so total Expected time = p TFPC + (1 p) p 2TFPC + (1 p)2 p 3TFPC + i=1 (1 p)i 1p i TFPC Let f(p) = P i=1(1 p)i 1p i. Then we have f(p) (1 p)f(p) = i=1 (1 p)i 1pi i=1 (1 p)ipi i=0 (1 p)ip(i + 1) i=1 (1 p)ipi i=1 (1 p)ip Thus the expected time is O(TFPC/p). Since there are at most log n iterations, so the expected time is O(p 1TFPC log n). Discrepancy Minimization in Input-Sparsity Time Algorithm 4 Input sparsity time partial coloring algorithm 1: procedure FASTPARTIALCOLORING(A Rm n, x ( 1, +1)n, a [0, 1], δfinal (0, 0.01), δ (0, 0.01)) Lemma E.2, Lemma E.12 2: u1 0n 3: δ1 δ2 final/n2 4: δ2 δ2 final/n2 5: V FASTPROJECTTOSMALLROWS(A, δ1) Algorithm 2 6: η FASTAPPROXMAXNORM(A, V, δ2) Algorithm 5 7: η O(herdisc(A) log(m/n)) 8: ϵ Θ((log(mn) + n) 1/2) 9: N Θ(ϵ 2 + n) 10: β Θ(ϵη p N log(m/δ)) β O(herdisc(A) log1.5(m)) 11: Generate a Gaussian matrix G Rn N, where each column is sampled from N(0, In) 12: MAINTAIN ds Algorithm 6 13: ds.INIT(V, G, n, na) Algorithm 6 14: for t = 1 N do 15: g Vt ds.QUERY() Algorithm 6, g Vt = (I V t Vt)G ,t 16: Let µ > 0 be maximal such that max{ x + ut + µ g Vt , x + ut µ g Vt } = 1 17: µ FINDBOUNDARY(x + ut, g Vt) Algorithm 11 18: bgt min{ϵ, µ} g Vt bgt Rn 19: for i = 1 n do 20: if |xi + ut,i + bgt,i| = 1 and |xi + ut,i| < 1 then 21: We will add ei into V in an implicit way via data structure ds The explicit way can be found here Algorithm 1 22: ds.UPDATE(ei) Algorithm 7 23: end if 24: end for 25: ut+1 ut + bgt 26: if |{i [n] : |xi + ut+1,i| = 1}| n/2 then 27: if Aut+1 > β then 28: return fail Case 2 of Lemma E.2 29: else 30: return xnew x + ut+1 Case 1 of Lemma E.2 31: end if 32: end if 33: end for 34: return fail Case 3 of Lemma E.2 35: end procedure E. Partial Coloring E.1. Fast Algorithm for Approximating Norms Algorithm 5 1: procedure FASTAPPROXNORM(A Rm n, V Rℓ n, δ2 (0, 0.01)) Lemma E.1 2: Let R Rn r denote a random JL matrix with r = Θ(ϵ 2 1 log(m/δ2)) Either Definition B.2 or Definition B.1. 3: Compute (I V V )R This takes O(n2r) time 4: η maxj [m] a j (I V V )R 2 This takes O(nnz(A)r) time 5: return η 6: end procedure Here in this section, we present an subroutine used as a tool to get fast approximation to the row norm of a matrix, based on Discrepancy Minimization in Input-Sparsity Time the JL lemma. The goal of this section is to prove Lemma E.1 Lemma E.1. Let V Rℓ n where ℓ n and A Rm n. For any accuracy parameter ϵ1 (0, 0.1), failure probability δ1 (0, 0.1), let r = O(ϵ 2 1 log(m/δ1)), there is an algorithm (procedure FASTAPPROXMAXNORM in Algorithm 5) that runs in O((nnz(A) + n2) r) time and outputs a number η R such that η (1 ϵ1) max j [m] a j (I V V ) 2 holds with probability 1 δ1. Proof. Let R Rn r denote a random JL matrix. The running time is from computing Computing V R takes n2r time. Computing V (V R) takes n2r time. Computing AR takes nnz(A)r time. Computing A (V (V R)) takes nnz(A)r time. Computing (AR AA (V (V R)))j, 2 for all j [m]. This takes O(mr) time. Taking the max over all the j [m]. This step takes O(m) time. The correctness follows from JL Lemma (Lemma B.8) directly. E.2. Correctness The goal of this section is to prove Lemma E.2, which gives the correctness guarantee of the Algorithm 4. Lemma E.2 (Output Gurantees of Algorithm 4). Let A Rm n be an input matrix and x ( 1, 1)n be an input partial coloring. Let δ (0, 0.01) denote a parameter. Conditioning on the event that V is good and η is good. We have: Case 1. The Algorithm 4 outputs a vector xnew Rn such that with probability at least 1/10 δ the following holds, xnew = 1; |{i [n] | |xnew i | = 1}| n/2; Let a j denote the j-th row of matrix A Rm n, for each j [m]. For all j [m], the following holds | aj, xnew aj, x | O(herdisc(A) log(m) log1/2(m/δ)). Case 2. With probability at most δ, the Algorithm 4 will output fail at Line 28. Case 3. With probability at most 9/10, the Algorithm 4 will output fail at Line 34. Proof. We start with choosing the parameters: N := O(n + max{log(mn), n}) = O(n + log(m)), ϵ := O(min{log 1/2(mn), n 1/2}). Discrepancy Minimization in Input-Sparsity Time Proof of Case 1. We first prove that for any iteration t [N], no entry of x + vt + bgt has absolute value larger than 1. Note that in t-th iteration, if we find i-th entry of x + vt + bgt reaches absolute value of 1, we add ei to the matrix Vt (Line 22). Then in the following iterations, after we get the query g (We omit the t here for g and V ), we have by Lemma F.4 that, g is the distance of original Gaussian vector to the row space of V , thus the i-th entry of g will be 0. Hence in the following iterations, the entry that we add to i-th entry of v will stay 0. Thus we complete the claim. By the algorithm design, the number entries reach the absolute value 1 keeps increasing, and by the condition check of Line 20, we have that |{i [n] | |xnew i | = 1}| n/2. Then by Lemma E.7 we have that j [m], | aj, g aj, x | β with probability at least 1/10 δ, where we notice that β = O(η min{log 1/2(mn), n 1/2} p n + log(m) p log(m/δ)) = O(η log1/2(m)). Thus we have the guarantee. Proof of Case 2. From Lemma E.7 we know that, there is a failure probability at most δ that, j [m], | aj, g aj, x | > β. Thus we have proved the claim. Proof of Case 3. By Lemma E.8, the Algorithm 4 returns fail with probability at most 4/5. Thus we complete the proof. Now if we consider the probability of the returning V and η are not good, we have the following corollary. Corollary E.3. Let A Rm n be an input matrix and x ( 1, 1)n be an input partial coloring. Let δfinal (0, 0.01) denote a parameter. We assume δ = 1/100 here in the algorithm. Then we have Case 1. The Algorithm 4 outputs a vector xnew Rn such that with probability at least 1/10 δ δfinal the following holds, xnew = 1; |{i [n] | |xnew i | = 1}| n/2; Let a j denote the j-th row of matrix A Rm n, for each j [m]. For all j [m], the following holds | aj, xnew aj, x | O(herdisc(A) log(m) log1/2(m)). Case 2. With probability at most δ + δfinal, the Algorithm 4 will output fail at Line 28. Case 3. With probability at most 9/10 + δfinal, the Algorithm 4 will output fail at Line 34. Case 4. With probability at most δfinal, the Algorithm 4 will output xnew with no guarantee. Proof. Recall Lemma E.2 is conditioning on V and η are good, here we mention the failure probability of these two operations. By the construction of Algorithm 4, when we call FASTPROJECTOSMALLROWS(A, δ1) Discrepancy Minimization in Input-Sparsity Time at Line 5, we set δ1 = δfinal/2. Also when we call FASTAPPROXMAXNORM(A, V, δ2), we set δ2 = δfinal/2. Thus, we know Pr[V, η good] 1 δfinal. (4) and we use V, η not good to denote the complement event of V, η good , then we have Pr[V, η not good] δfinal. (5) Proof of Case 1. By Lemma E.2, we have Pr[case 1 happens | V, η good] 1/10 δ. (6) By basic probability rule, Pr[case 1 happens] = Pr[case 1 happens | V, η good] Pr[V, η good] + Pr[case 1 happens | V, η not good] Pr[V, η not good] Pr[case 1 happens | V, η good] Pr[V, η good] (1/10 δ) Pr[V, η good] (1/10 δ) (1 δfinal) 1/10 δ δfinal. where the second step follows from Pr[] 0, the third step follows from Eq. (6), the forth step follows from Eq. (4). Proof of Case 2. Then by Lemma E.2, we have the case happens with probability at most δ when conditioning on V, η are good, i.e., Pr[case 2 happens | V, η good] δ. (7) By basic probability rule, Pr[case 2 happens] = Pr[case 2 happens | V, η good] Pr[V, η good] + Pr[case 2 happens | V, η not good] Pr[V, η not good] Pr[case 2 happens | V, η good] + Pr[V, η not good] δ + Pr[V, η not good] where the second step follows from Pr[] 1, the third step follows from Eq. (7), and the forth step follows from Eq. (5). Proof of Case 3. By Lemma E.2, we have the case happens with probability at most 9/10 conditioning on V and η is good, i.e., Pr[case 3 happens | V, η good] 9/10. (8) By basic probability rule, Pr[case 3 happens] = Pr[case 3 happens | V, η good] Pr[V, η good] + Pr[case 3 happens | V, η not good] Pr[V, η not good] Pr[case 3 happens | V, η good] + Pr[V, η not good] 9/10 + Pr[V, η not good] 9/10 + δfinal. where the second step follows from Pr[] 1, the third step follows from Eq. (8), and the last step follows from Eq. (5). Discrepancy Minimization in Input-Sparsity Time Proof of Case 4. This happens when at least one of the returning V and η is not good, and the algorithm returns xnew at Line 30. By union bound we have this happens with probability at most Pr[case 4 happens] δfinal. E.3. Iterative Notations Here in this section, we introduce some notations to be used in later proofs. We first define the matrix Vt we maintained as follows. Definition E.4 (Matrix V of each iteration). For all t [T], we define Vt to be the matrix forming an orthonormal basis which implicitly maintained in the MAINTAIN data-structure at the start of t-th iteration. For t = 1, we define V1 to be generated from FASTPROJECTTOSMALLROWS at Line 5 in Algorithm 4. We then define the Gaussian random vectors related to the above Vt. Definition E.5 (Random Gaussian vectors). For all t [T], in the t-th iteration, We define gt Rn to be a random Gaussian sampled from N(0, 1). We define vector g Vt to be generated by query the MAINTAIN data-structure at Line 15 in Algorithm 4. By Lemma F.4, we have g Vt = (I V t Vt) gt. And we define bgt to be the vector rescaled from g Vt at Line 18 in Algorithm 4, that is, bgt := min{ϵ, µ} g Vt. Through the whole Algorithm 4, we iteratively maintain a vector u Rn to accumulate the random Gaussian vectors of each iteration. We formally define it here as follows, Definition E.6 (Accumulated maintained vector). For all t [T], we define the accumulated maintained vector ut Rn to be as follows ut+1 := ut + bgt. For the case that t = 1, we define u1 = 0n. E.4. Upper Bound for the Inner Products Here in this section, the goal is to prove Lemma E.7, which gives the upper bound of the inner products of the output vector with the rows of input matrix. Lemma E.7 (A variation of Lemma 4 in (Larsen, 2023)). Let β = Θ(ϵη p N log(m/δ)). For any input matrix A Rm n and any vector x ( 1, 1)n, let a i denote the i-th row of A. Let u N+1 Rn denote the vector v Rn in the last iteration of Algorithm 4. Then we have that, Pr[ i [m], | ai, u N+1 | < β] 1 δ. Proof. To prove the lemma, we first fix i [m] here. Then for t-th iteration of the outer for-loop (Line 14 in Algorithm 4). We let g Vt be defined as Definition E.5. For simplicity, sometimes we use gt to denote g Vt if it is clear from text. For the t s that Algorithm 4 returns fail before the t-th iteration, we define gt = 0 Rn. We also define a cumulative vector ut := Pt i=1 gi. Then we have that j=1 ai, gj . Discrepancy Minimization in Input-Sparsity Time Now we condition on the previous t 1 iterations and look into the t-th iteration, we generate a vector gt Rd such that every entry is drawn i.i.d. from N(0, 1). Then we have for any ξ > 0, we have Pr[| ai, (I V t Vt)g | > ξ] Pr[| min{ϵ, µ} ai, (I V t Vt)g | > ξ] Pr[| ai, gt | > ξ], where the first step follows from min{ϵ, µ} 1, the second step follows from Lemma F.4. We also have the following ϵ ai, (I V t Vt)g N(0, ϵ2 ai(I V t Vt) 2 2) by the linearity of the distribution. We can bound the variance in the following sense, ϵ2 ai(I V t Vt) 2 2 ϵ2η2 by the definition of η. Let β > 0 denote some parameter. Thus using Claim A.5, we have the following Pr[| ai, gt | > β] Pr[|ϵ ai, (I V t Vt)g | > β] 2 exp( (ϵη) 2β2/2) Finally, since the gt is independent from ui s for i [t 1], we have that E[ ai, gt | ut 1, . . . , u1] = 0. This follows by the fact that we define µ in Line 17 to be µ := arg max µ 0 { x + ut + µgt , x + ut µgt } = 1. Thus we have that, the sequence ai, g1 , . . . , ai, gt becomes a martingale difference sequence (Definition A.7) with respect to u1, . . . , ut. Then by Theorem A.8 we have that with probability at least 1 δ, the following holds | ai, u N+1 | = | t=1 ai, gt | 112N lg(1/δ) where we set the parameters in Theorem A.8 to be b = 2 and c (ϵη) 2/2. Then we set δ = δ/m and let β = 100ϵη p N log(m/δ), we have Pr[| ai, u N+1 | β] 1 δ/m. The above implies that Pr[| ai, u N+1 | > β] δ/m Applying a union bound over all the i [m], then we complete the proof. Discrepancy Minimization in Input-Sparsity Time E.5. Upper Bound for the Failure Probability Here in this section, the goal is to prove Lemma E.8, which gives the upper bound of the probability that the algorithm returns fail. Lemma E.8 (A variation of Lemma 5 in (Larsen, 2023)). For any input matrix A Rm n and any vector x [0, 1]n, the probability that Algorithm 4 returns fail at Line 34 is at most 4/5. Proof. The basic idea of the proof is that, after many iterations (N in Algorithm 4), we have the expectation of x + u 2 2 large enough. Hence lots of entries of x + u will reach the absolute value of 1. We denote the number of ei s added to Vt in Line 22 through the running of Algorithm 4 to be R. Recall we define gt to be generated in Line 15 of Algorithm 4 (Definition E.5). We notice that, we add ϵ gt or µ gt to ut in Line 25. Here in this paragraph, we first assume that, we add µ gt, i.e, µ < ϵ. Recall we generate µ in Line 17 of Algorithm 4 to be µ := arg max µ>0 {max{ x + ut + µ gt , x + ut µ gt } = 1}. Thus we have that in Line 20 of Algorithm 4, there must be at least one entry i [n] satisfying that |xi + ut,i + gt,i| = 1 and |xi + ut,i| < 1. We define ρt to denote the probability such that the algorithm never return fail before reaching t-th iteration and µ < ϵ in this iteration. For any i [n] and t [N], we define the following event by Ei,t: the index i satisfies |xi + ut,i + µgt,i| = 1 and |xi + ut,i| < 1 in Line 20 of the t-th iteration. By the symmetry of the vector gt (Since gt is Gaussian, gt and gt are equally likely conditioned on Vt), we have that {i [n] | Ei,t holds} ρt/2. We denote the number of ei s added to Vt in Line 22 through the running of Algorithm 4 to be R. And we denote the number of rows added to Vt by rt. By the analysis above, we have that By the construction of our algorithm, no more than n ei s will be added to V through the algorithm running, thus we have t=1 ρt 2 E[R] 2n. Now for each iteration t [N], we define gt to be the vector added to ut at Line 25 of Algorithm 4. Here we note that after N iterations, And this is the final vector output by the algorithm (if it doesn t return fail). We have that: E[ x + u N+1 2 2] = E[ x + t=1 gt 2 2] Discrepancy Minimization in Input-Sparsity Time i=1 E[ gt 2 2]. (9) where the first step follows from the definition of v, and the second step follows from E[gt|g1, . . . , gt 1] = 0. Recall we define the Vt to be the matrix V at the start of t-th iteration10 (Definition E.4). We denote the original Gaussian before projection by gt (Definition E.5). By Claim E.9, we have E[ gt 2 2] (1 δ) ϵ2(n n/4 E[R]) ϵ2p By Claim E.10 we have the lower bound of the expectation of R, E[R] 0.63n. Using Claim E.11, we obtain Pr[R n/2] > 0.2. When R n/2, Algorithm 4 must terminate and return the xnew in Line 30. Thus we have that, the probability of terminating and returning fail in Line 34 is at most 4/5. Thus we complete the proof. E.6. Lower Bound for the Vector after Projection Here in this section, we prove the following claim, giving the lower bound of the projected vector gt. Claim E.9 (Lower Bound for the projection of projected vector gt). We define R to be the total number of ei s added to Vt in Line 22 through the running of Algorithm 4. We define ρt to denote the probability such that the algorithm never return fail before reaching t-th iteration and µ < ϵ in this iteration. Then we have that E[ gt 2 2] (1 δ) ϵ2(n n/4 E[R]) ϵ2p Proof. For any t [N], we define the indicators Ft to be Ft = 1{ i [m] | | ai, u N | < β}, and Yt to be Yt = 1{µ < ϵ}. Then we have that E[ gt 2 2] = E[Ft Ytµ2 (I V t Vt) gt 2 2 + Ft(1 Yt)ϵ2 (I V t Vt) gt 2 2] E[Ft(1 Yt)ϵ2 (I V t Vt) gt 2 2] ϵ2 E[Ft (I V t Vt) gt 2 2] ϵ2 E[Yt (I V t Vt) gt 2 2]. where the first step follows from the definition of Ft and Gt, the second step follows from Ft Ytµ2 (I V t Vt) gt 2 2 0, and the last step follows from splitting the terms. Using Cauchy-Schwartz inequality, we have that E[Yt (I V t Vt) gt 2 2] q E[Y 2 t ] E[ (I V t Vt) gt 4 2]. 10Note that, in the design of our algortithm, we maintain this matrix V implicitly in the MAINTAIN data structure. Discrepancy Minimization in Input-Sparsity Time Recall we define Yt to be an indicator, thus we have that E[Y 2 t ] = E[Yt]. Since (I V t Vt) is a projection matrix, we have gt 2 (I V t Vt) gt 2. (10) Thus, the following holds E[Yt (I V t Vt) gt 2 2] q E[Yt] E[ gt 4 2] ρt E[ gt 4 2]. where the first step follows from Eq. (10), the second step follows from the definition of ρt. For any i [n], we denote the i-th entry of gt by gt,i. Note that gt,i N(0, 1) for all i [n] and t [N]. And all gt,i s are i.i.d. Thus we have that E[ gt 4 2] = E[( i=1 g2 t,i)2] j=1 E[g2 t,ig2 t,j]. where the first step follows from the definition of gt, the second step follows from that gt,i s are i.i.d. For all the terms that i = j, we have that E[g2 t,ig2 t,j] = E[g2 t,i] E[g2 t,j] = 1. For the terms that i = j, by the 4-th moment of the normal distribution, we have E[g4 t,i] = 3. Thus we have E[ gt 4 2] n(n 1) + 3n 2n2. E[Yt (I V t Vt) gt 2 2] p Also we have that E[Ft (I V t Vt) gt 2 2] = Pr[Ft = 1] E[ (I V t Vt) gt 2 2] = Pr[Ft = 1] (E[n dim(Vt)]) = Pr[Ft = 1] (n E[dim(Vt)]). where the first step follows from the definition of expectation, the second step follows from the property of the vector (I V t Vt) gt (Since g is Gaussian), and the last step follows from the definition of expectation. Note that for any t [N], we have that dim(Vt) R + n/4 Thus we have E[dim(Vt)] E[R] + n/4. Hence we conclude that E[ gt 2 2] Pr[Ft = 1] ϵ2(n n/4 E[R]) ϵ2p We have by Lemma E.7 that Pr[Ft = 1] 1 δ. Hence we have E[ gt 2 2] (1 δ) ϵ2(n n/4 E[R]) ϵ2p Discrepancy Minimization in Input-Sparsity Time E.7. Lower Bound on Expectation Here we prove the following claim, which gives the lower bound of expectation of the unit vectors added to V through the algorithm running. Claim E.10. We denote the number of ei s added to Vt in Line 22 through the running of Algorithm 4 to be R. We define ρt to denote that Algorithm 4 never return fail before i-th iteration and µ < ϵ in this iteration. We assume ρt 0.005 and δ 0.01. E[ gt 2 2] (1 δ) ϵ2(n n/4 E[R]) ϵ2p Then we have that E[R] 0.63n. Proof. By Assumption this claim, we have E[ gt 2 2] (1 δ) ϵ2(n n/4 E[R]) ϵ2p Note that we assume that ρt 0.005 and δ 0.01. Then by the above step we have that 0.99 ϵ2(n n/4 E[R] 0.102n) = 0.99 ϵ2(0.698n E[R]). Note here that if ρt > 0.005, we have E[ gt 2 2] is lower bounded by 0. Recall that we have PN t=1 ρt 2n, thus the number of iterations that ρt 0.005 is at most 400n. Now considering the rest N 400n iterations, it also holds that E[ gt 2 2] 0.99 ϵ2(0.698n E[R]). (11) Hence we have E[ x + u N+1 2 2] x 2 2 + (N 400n) 0.99 ϵ2(0.698n E[R]) (N 400n) 0.99 ϵ2(0.698n E[R]) = 16 0.99 (0.698n E[R]), where the first step follows from Eq. (9) and Eq. (11), the second step follows from x 2 2 0, and the last step follows by setting N = 16ϵ 2 + 400n. Notice now that Algorithm 4 will only produce an output v such that x + u N+1 1. Thus we conclude it happens when E[ x + u N+1 2 2] n. Thus we have n 16 0.99 (0.698n E[R]). By the above inequality, it holds that E[R] (0.698 1.02(1/16))n > 0.634n. Discrepancy Minimization in Input-Sparsity Time E.8. From Expectation to Probability Claim E.11. If E[R] 0.62n then Pr[R n/2] > 0.2. Proof. Here we define Z := n R. Then we have the expectation of Z E[Z] = n E[R] 0.38n. where the first step follows from the definition of Z, the second step follows from E[R] 0.62n. Then by Markov s inequality we have that Pr[Z > a] E[Z] Thus we have Pr[Z > n/2] < 2 0.38 which implies that, Pr[Z n/2] > 0.2. Finally we have by Z := n R that Pr[R n/2] > 0.2. E.9. Running Time The goal of this section is to prove Lemma E.12. Lemma E.12. For any parameter a [0, 1], the algorithm (FASTPARTIALCOLORING in Algorithm 4) runs in e O(nnz(A) + nω + n2+a + n1+ω(1,1,a) a) time. Note that ω is the exponent of matrix multiplication. Proof. The running time of FASTPARTIALCOLORING can be divided as follows, Line 5 takes time e O(nnz(A) + nω), by Lemma C.2. Line 6 takes time O(nnz(A) + n2), by Lemma E.1. Line 13 takes time O(nω + n2) to initialize the data structure, by Lemma F.7. Run the following for N = O(ϵ 2 + n) = O(n + log m) times, Line 15 takes time O(n1+a) to query g, by Lemma F.7). Line 17 takes time O(n) to find µ due to Lemma I.1 (via Algorithm 11). Line 22 takes time O(n1+a) to update the data structure, by Lemma F.7. Over the entire, algorithm we will enter Line 28 at most once. To check the satisfied condition for that only takes nnz(A) time. Adding them together and by the tighter time for operations of FASTMAINTAIN data structure (Lemma F.10), we have the total running time of FASTPARTIALCOLORING of TFASTPARTIALCOLORING = e O(nnz(A) + nω + n2+a + n1+ω(1,1,a) a). Therefore, we complete the proof. Discrepancy Minimization in Input-Sparsity Time F. Fast Maintaining Lazy Data Structure In this section, we provide a faster maintaining data structure with the idea of lazy update for the Partial Coloring algorithm. In Section F.1 we provide main theorem of the data structure. In Section F.2, Section F.3, Section F.4, Section F.5 we give analysis for the INITIALIZATION, QUERY, UPDATE, RESTART procedures respectively. In Section F.6 we provide the running time analysis. In Section F.7 we give the correctness analysis. In Section F.8 we give a tighter running time analysis. Algorithm 6 1: data structure MAINTAIN Theorem F.1 2: members 3: P Rn n 6: ℓ 7: τq total counter for query 8: ku ku is a counter for update process 9: kq kq is a counter for query process 10: w1, , w K Rn This is a list 11: end members 12: 13: public: 14: procedure INIT(V Rℓ n, G Rn N, K) Lemma F.2 15: ku 0, kq 0 16: τq 0 17: P V V This takes Tmat(n, n, n) 18: Let S = {1, 2, , K} 19: Let G Rn N denote G ,S 20: Compute e G = P G ,S and let eg1, , eg K denote those new K vectors This takes Tmat(K, n, n) 21: for i = 1 K do 22: wi 0n 23: end for 24: end procedure 25: 26: public: 27: procedure QUERY() Lemma F.3, F.4 28: kq kq + 1 29: τq τq + 1 30: if kq K then 31: RESTART() 32: end if 33: return gkq egkq Pku i=1 wiw i gkq This takes O(n K) time 34: end procedure 35: data structure F.1. Main Data Structure The goal of this section is to prove Theorem F.1. Theorem F.1. For a input matrix V Rℓ n with ℓ n, a matrix G Rn N and a integer K > 0, N = O(n), there exists a data structure MAINTAIN(Algorithm 6, 7) uses O(n2) space and supports the following operations: INIT(V Rℓ n, G Rn N, K N+): It initializes the projection matrix P = V V Rn n. It stores the matrix G. For the first K column vectors in G, it computes their projection when apply P. It sets counter kq and ku to be zero. It Discrepancy Minimization in Input-Sparsity Time Algorithm 7 Update 1: data structure MAINTAIN Theorem F.1 2: public: 3: procedure UPDATE(u Rn) Lemma F.5 4: The input vector u is 1-sparse and the nonzero entry is 1 5: ku ku + 1 6: ew ((I P) Pku 1 i=1 wiw i )u This takes O(n K) 7: wku ew/ ew 2 8: if ku K then 9: RESTART() 10: end if 11: end procedure 12: 13: private: 14: procedure RESTART() Lemma F.6 15: P P + PK i=1 wiw i This takes Tmat(n, K, n) 16: Let S = {1 + τq, 2 + τq, , K + τq} 17: Let G Rn K denote G ,S 18: Compute e G = P G and let eg1, , eg K denote those new K vectors This takes Tmat(K, n, n) 19: ku 0, kq 0 20: for i = 1 K do 21: wi 0n Reset wi to an all zero vector 22: end for 23: end procedure 24: end data structure sets τq to be zero. This procedure runs in time Tmat(n, n, n) + Tmat(K, n, n). QUERY(): It output (I V V )g where g Rn is the next column vector in G to be used. The running time of this procedure is O(n K) time, if kq [K 1] Tmat(n, k, n) time, if ku = K UPDATE(u Rn): It takes an 1-sparse vector u Rn as input and maintains V by adding w into next row of V (according to Algorithm 1, w = ew/ ew 2 where ew = (I V V )u. It is obvious that w V ). The running time of this procedure is O(n K) time, if ku [K 1]; Tmat(n, k, n) time, if ku = K. RESTART(): It updates the projection P = V V , generates K fresh Gaussian vectors by switching the batch of Gaussian vectors we use by τq, and compute their projections when apply P. It reset the counter kq and ku to be zero. The running time of this procedure Tmat(n, K, n). Proof. It follows from combining Lemma F.2, Lemma F.3, Lemma F.4, Lemma F.5 and Lemma F.6. Discrepancy Minimization in Input-Sparsity Time F.2. Initialization The goal of this section is to prove Lemma F.2. Lemma F.2 (Running time for INIT). The procedure INIT(V Rℓ n, G Rn N, K N+) initializes the projection matrix P = V V Rn n. It stores the matrix G. For the first K column vectors in G, it computes their projection when apply P. It sets counter kq and ku to be zero. It sets τq to be zero. This procedure runs in time Tmat(n, n, n) + Tmat(K, n, n). Proof. The running time of INIT can be divided into the following lines, Line 17 takes time Tmat(n, n, n) to compute matrix P. Line 17 takes time O(n K) to generate matrix G. Line 20 takes time Tmat(K, n, n) to compute matrix e G. Taking these together, we have the total running time is Tmat(n, n, n) + Tmat(K, n, n). Thus, we complete the proof. The goal of this section is to prove Lemma F.3 and Lemma F.4. Lemma F.3 (Running time for QUERY). The running time of procedure QUERY is O(n K) time, if kq [K 1]. Tmat(n, k, n) time, if ku = K. Proof. The proof can be splitted into two cases. For the case that kq [K 1], Line 33 takes time O(nku) = O(n K) obviously. For the case that kq = K, Line 31 runs in time Tmat(n, k, n). Thus we complete the proof. Lemma F.4 (Correctness for QUERY). the procedure QUERY outputs (I V V )g where g Rn is the next vector in G to be used. g gkq egkq wi gkq = Pku i=1 Output vector Figure 5. The decomposition of the output vector by QUERY. The green composition is the pre-computed factor and the blue composition is the new added ones, these two together form the projection of g onto the row span of V . Discrepancy Minimization in Input-Sparsity Time gkq = g (I V V )g Pku i=1 wiw i gkq = V 2 V2 g Figure 6. The Visualization of the decomposition. Here we denote the matrix of rows encoded to P by V1, and the matrix of rows not encoded yet by V2. Thus we have that, P = V 1 V1. And we assume that, Span(V1) = Span(ex), Span(V2) = Span(ey). Thus we visualize the idea of the decomposition as above. We divide the projection onto Span(V ) into the projection onto Span(V1) and the projection onto Span(V2). That is, we have V V g = V 1 V1 g + V 2 V2 g = eg + Pku i=1 wiw i g. Proof. We divide the rows of V Rℓ n, into two parts: the rows that have been encoded into P, and the ones that have not. Without loss of generality, we note the first ℓ ku rows are encoded, and the last ku rows are not. Thus we have that i=1 viv i g i=1 viv i g i=1 wiw i g i=1 wiw i g, where the first step follows from split the multiplication, the second step follows from split the summation, and the last step follows from the definition of eg and the construction P. Thus we complete the proof. F.4. Update The goal of this section is to prove Lemma F.5. Lemma F.5 (Running time for UPDATE). The running time of procedure UPDATE is O(n K) time, if kq [K 1]. Tmat(n, k, n) time, if ku = K. Proof. For the UPDATE procedure, the running time can be divided into the following lines, Line 6 takes time O(n K). Line 7 takes time O(n). Discrepancy Minimization in Input-Sparsity Time If k reaches K, Line 9 takes time Tmat(n, K, n) + Tmat(K, n, n). Thus we have that, if k doesn t reach K, the running time is If k reaches K, the running time is Tmat(n, K, n) + Tmat(K, n, n). F.5. Restart The goal of this section is to prove Lemma F.6. Lemma F.6 (Running time for RESTART). The running time of procedure UPDATE is O(Tmat(K, n, n)). Proof. For the RESTART procedure, the running time can be divided as the following lines Line 15 takes time Tmat(n, K, n). Line 17 takes time O(n K). Line 18 takes time Tmat(K, n, n). Adding them together, we have the runnning time is Tmat(n, K, n) + Tmat(K, n, n) = O(Tmat(K, n, n)). F.6. Running Time of the Maintain algorithm The goal of this section is to prove Lemma F.7. Lemma F.7 (Running time of FASTMAINTAIN). Let N = O(n). For any parameter a [0, α], let K = na where α is the dual exponent of matrix multiplication. For any input G Rn N, and an input b {0, 1}n N such that only one entry of each row of b is 1, the procedure FASTMAINTAIN (Algorithm 8) runs in time O(nω + n2+a + n3 a). For the current ω 2.373 and α 0.31. Due to the above equation, if we choose the a = min{α, 1/2}, then the running time can be simplified to O(n2.5 + n3 α). Remark F.8. Instead of using α, we can use the ω( , , ) function in Table 3 in (Gall & Urrutia, 2018). In Lemma F.10, we provide a tighter analysis of Lemma F.7. Proof. The running time can be divided as follows. Line 2 takes time Tmat(n, n, n) + Tmat(K, n, n) = O(nω + n2) to initialize the data structure. Run the following lines for N = O(n) times (We ignore the condition of restart here): Line 4 takes time O(n K) = O(n1+a) to query for g. Discrepancy Minimization in Input-Sparsity Time Line 7 takes time O(n K) = O(n1+a) time update the data structure. The above runs in O(n2+a) time in total. For the restart condition, we note that, UPDATE is called for O(n) times, so we will call RESTART for O(n/K) = O(n1 a) times, each time it will take O(n2) time for restarting. The condition that QUERY calls RESTART is as the same. Thus we have the restart time bounded by O(n3 a). Combining the above together, we have the running time of FASTMAINTAIN is O(nω + n2+a + n3 a). F.7. Correctness Algorithm 8 The purpose of writing down SLOWMAINTAIN is proving the correctness of FASTMAINTAIN. See Lemma F.9. 1: procedure FASTMAINTAIN(b {0, 1}n N, G Rn N, n, N) 2: ds.INIT(V, G, K) Algorithm 6 3: for t = 1 N do 4: gt ds.QUERY() Algorithm 6 5: for i = 1 n do 6: if bi,t = 1 then 7: ds.UPDATE(ei) Algorithm 7 8: end if 9: end for 10: end for 11: return g1, , g N 12: end procedure 13: 14: procedure SLOWMAINTAIN(b {0, 1}n N, G Rn N, n, N) 15: for t = 1 N do 16: gt (I V V )G ,t G ,t is the t-th column of G 17: for i = 1 n do 18: if bi,t = 1 then 19: Add ei to V according to Algorithm 1 and update V 20: end if 21: end for 22: end for 23: return g1, , g N 24: end procedure The goal of this section is to prove Lemma F.9 Lemma F.9. For any input b {0, 1}n N and G Rn N, the N vectors outputted by procedure FASTMAINTAIN (Algorithm 8) are exactly same as the N vectors outputted by procedure SLOWMAINTAIN (Algorithm 8). Proof. For all i [N], we denote vectors output by the two algorithms by gi,F for FASTMAINTAIN and gi,S for SLOWMAINTAIN. Induction Hypothesis. For any t [N], we have that gi,F = gi,S for all i [t 1]. Proof of Base Case. For the case that i = 1, we have by Lemma F.4 that g1,F = (I V V )G ,1 = g1,S. Thus the base case holds. Discrepancy Minimization in Input-Sparsity Time Proof of Inductive case. For the inductive case, we prove it by the following steps. We first notice that, with the binary matrix b, we have added some vectors into V through the procedure when it reaches t-th iteration. We divide it into the following two circumstances: No vectors added to V after the last query of gt 1. Since no vectors are added to V after query of gt 1, the V stays the same when querying the gt. Thus by Lemma F.4, we have that gi,F = (I V V )G ,1 = gi,S. Hence we have the correctness. There are vectors added to V after the last query of gt 1. If there s vectors added to the matrix V , then the FASTMAINTAIN calls the UPDATE for some times, and added some vectors into ds. Without loss of generality, we assume the UPDATE never calls RESTART. Then we have that, new vectors are encoded as w in the ds. We divide the rows into Rows added before querying gt 1, denoted the rows as the first W0 ones. Rows added after querying gt 1 and before querying gt, denote the set as the last W1 ones. We denote the Gaussian vector as g, Thus we have that query (I V V )egt as follows, gt,S = (I V V )gt,S = gt,S X i [W0] wi,Sw i,Sgt,S i=W0+1 wi,Sw i,Sgt,S. And in the FASTMAINTAIN (Algorithm 8), when we call QUERY, we divide the rows as the ones encoded into P and the ones not, that is gt,F = gt,F egt,F X i [ku] wi,F w i,F gt,F . If we denote the rows have been encoded into P as the first ℓrows. We have that i [ℓ] wi,F w i,F gt,F . By the construction of the procedures, we have that ℓ+ ku = W0 + W1, wi,F = wi,S for each i [ℓ+ ku]. And by the using of the counter τq in the ds, we have that gt,F = gt,S. Then by Lemma F.4, we have the t-th query ds.QUERY() = (I V V )G ,t. Thus we complete the proof. F.8. Tighter Running Time Analysis The time analysis of Lemma F.7 is not tight. We can further improve it by using function ω( , , ). Discrepancy Minimization in Input-Sparsity Time Lemma F.10 (A Tighter Analysis of Running time of FASTMAINTAIN). Let N = O(n). For any parameter a [0, 1], let K = na. For any input G Rn N, and an input b {0, 1}n N such that only one entry of each row of b is 1, the procedure FASTMAINTAIN (Algorithm 8) runs in time O(nω(1,1,1) + n2+a + n1+ω(1,1,a) a). For the current ω( , , ) function (see Table 3 in (Gall & Urrutia, 2018)). We can choose a = 0.529, then the running time becomes For the ideal ω, we can choose a = 0.5, then the running time becomes n2.5. G. Implicit Leverage Score Sampling Algorithm In this section, we present two crucial subroutines. That is, leverage score approximation and a fast sampling algorithm based on it. G.1. Leverage Score Approximation Here we present the following algorithm, which providing a fast leverage score approximation. Algorithm 9 Implicit Leverage Score Computation 1: procedure IMPLICITLEVERAGESCORE(A Rm n, V Rn n, ϵσ = Θ(1), δσ) Lemma G.2 2: The goal of this procedure is to compute a constant approximation 3: s1 e O(ϵ 2 σ n) 4: Choose S1 Rs1 m to be sparse embedding matrix Definition B.4 5: Compute (S1 A) It takes e O(ϵ 1 σ nnz(A)) time. 6: Compute M (S1 A) (I V V ) It takes e O(ϵ 2 σ nω) time 7: Let R Rn n denote the R of QR factorization of M 8: s2 Θ(log(m/δσ)) 9: Choose S2 Rn s2 to be a JL matrix Either Definition B.1 or Definition B.2 10: Compute N (I V V )R 1S2 N Rn s2 11: for j = 1 m do It takes e O(nnz(A)) time 12: Compute eσj (e j A)N 2 2 13: end for 14: return eσ eσj = Θ(σj), j [m] 15: end procedure Definition G.1 (Leverage score). Let A Rm n. The leverage score of A is a vector σ(A) Rm satisfying σ(A)i = ai(A A) 1a i , where ai denote the i-th row of A. The above algorithm provides an approximation to all the leverage scores of matrix A(I V V ), see the following lemma. Lemma G.2 (Implicit leverage score). Given a matrix A Rm n and a matrix V Rn n, there is an algorithm (procedure IMPLICITLEVERAGESCORE Algorithm 9) that runs in e O(ϵ 2 σ (nnz(A) + nω)) time and output a vector eσ Rm, such that, eσ is an approximation of the leverage score of matrix A(I V V ), i.e., eσ (1 ϵσ) σ(A(I V V )), with probability at least 1 δσ. The e O hides the log(n/δσ) factors. Remark G.3. The only difference between classical statement is, in classical there is one input matrix A. In our case, the target matrix is implicitly given in a way A(I V V ), and we re not allowed to formally write down A(I V V ). The correctness proof is similar as literature (Clarkson & Woodruff, 2013; Nelson & Nguyˆen, 2013). Discrepancy Minimization in Input-Sparsity Time For our task, we only need an overestimation of the leverage score. We believe that the two-sides error might have future applications, therefore, we provide a two-sided error proof. Proof. We follow an approach of (Woodruff, 2014), but instead we use a high precision sparse embedding matrix (Nelson & Nguyˆen, 2013) and prove a two-sided bound on leverage score. Correctness. Let S1 be a sparse embedding matrix with s = O(ϵ 2 σ n poly log(n/(ϵσδσ))) rows and column sparsity O(ϵ 1 log(n/δ)). We first compute M := (S1A) (I V V ), then compute the QR decomposition M = QR. Now, let S2 Rn s2 matrix with s2 = O(ϵ 2 σ log(m/δσ)), each entry of S2 is i.i.d. N(0, 1/s2) random variables. Then we generate an sketched matrix N := (I V V )R 1S2. Set eσj = (e j A)N 2 2 for all j [m]. We argue eσj is a good approximation to σj. First, with failure probability at most δσ/m, we have that eσj (1 ϵσ) e j A(I V V )R 1 2 2 via Lemma B.8. Now, it suffices to argue that e j A(I V V )R 1 2 2 approximates e j U 2 2 well, where U Rm n is the left singular vectors of A(I V V ). To see this, first observe that for any x Rn, A(I V V )R 1x 2 2 = (1 ϵσ) S1A(I V V )R 1x 2 2 = (1 ϵσ) Qx 2 2 = (1 ϵσ) x 2 2, where the first step follows from Lemma B.10, and the last step is due to Q has orthonormal columns. This means that all singular values of A(I V V )R 1 are in the range [1 ϵσ, 1 + ϵσ]. Now, since U is an orthonormal basis for the column space of A(I V V ), A(I V V )R 1 and U has the same column space (since R is full rank). This means that there exists a change of basis matrix T Rn n with A(I V V )R 1T = U. Our goal is to provide a bound on all singular values of T. For the upper bound, we claim the largest singular value is at most 1 + 2ϵσ, to see this, suppose for the contradiction that the largest singular is larger than 1 + 2ϵσ and let v be its corresponding (unit) singular vector. Since the smallest singular value of AR 1 is at least 1 ϵσ, we have A(I V V )R 1Tv 2 2 (1 ϵσ) Tv 2 2 > (1 ϵσ)(1 + 2ϵσ) however, recall A(I V V )R 1T = U, therefore A(I V V )R 1Tv 2 2 = Uv 2 2 = v 2 2 = 1, a contradiction. One can similarly establish a lower bound of 1 2ϵσ. Hence, the singular values of T are in the range of [1 2ϵσ, 1 + 2ϵσ]. This means that e j A(I V V )R 1 2 2 = e i UT 1 2 2 = (1 2ϵσ) e j U 2 2 = (1 2ϵσ) σ(A(I V V ))j, as desired. Scaling ϵσ by ϵσ/2 yields the approximation result. Running Time. We divided the running time of the algorithm into the following lines. Line 6 takes time e O(ϵ 1 σ nnz(A)) to compute the matrix M. Line 7 takes time e O(ϵ 2 σ nω) to compute the QR decomposition of M. In Line 10, we can first multiply R 1 with S2 in time e O(ϵ 2 σ n2), this gives a matrix of size n s2. Multiplying this matrix with (I V V ) takes e O(nω) time. Note that R Rn n hence R 1 can be computed in O(nω) time. Discrepancy Minimization in Input-Sparsity Time Line 11 loop takes in total e O(nnz(A)) time to compute the output vector eσ. Hence, the overall time for computing eσ(A(I V V )) Rm is e O(ϵ 2 σ (nnz(A) + nω)). G.2. Fast Subsampling Here we present the fast subsampling algorithm based on the leverage score approximation. Algorithm 10 Fast Subsample Algorithm 1: procedure SUBSAMPLE(B Rm n, V Rn n, ϵB, δB) Lemma G.4 2: eσ IMPLICITLEVERAGESCORE(B, V, δB) Compute an O(1)-approximation to the leverage scores of B(I V V ), eσ Rm 3: Sample a subset rows of B = B(I V V ) with proper re-weighting according to approximate leverage score eσ 4: Let e D denote the diagonal sampling matrix such that Pr[B e D e DB (1 ϵB) B B] 1 δB 5: return e D e D Rm m 6: end procedure Lemma G.4. There is an algorithm (Algorithm 10) takes B Rm n and V Rn n as inputs and outputs a diagonal sampling matrix e D with e D 0 = O(ϵ 2 B n log(n/δB)) and runs in time e O(nnz(B) + nω). Here e O hides the log(n/δB) factors. Here ω is the exponent of matrix multiplication. Currently, ω 2.373. Proof. We first approximately compute the leverage score, i.e., gives an O(1)-approximation to all leverage scores via Lemma G.2. Then run we samples a number of rows according to the leverage scores, using Lemma H.2, we can show the correctness. H. Sampling a Batch of Rank-1 Terms In this section, we give a sparsification tool for the matrix that has pattern A A. We define our sampling process as follows: Definition H.1 (Sampling process). Let H = A A. Let pj β σj(A)/n, suppose we sample with replacement independently for s rows of matrix A, with probability pj of sampling row j for some β 1. Let jt denote the index of the row sampled in the t-th trial. Define the generated sampling matrix as 1 pjt ajta jt, where T denotes the number of the trials. For our sampling process defined as Definition H.1, we can have the following guarantees: Lemma H.2 (Sample using Matrix Chernoff). Let ϵ0, δ0 (0, 1) be precision and failure probability parameters, respectively. Suppose e H is generated as in Definition H.1, then with probability at least 1 δ0, we have (1 ϵ0) H e H (1 + ϵ0) H. Moreover, the number of rows sampled is T = Θ(β ϵ 2 0 n log(n/δ0)). Proof. The proof of this Lemma is follows by designing the sequence of random matrices, then applying the matrix Chernoff bound to get the desired guarantee. Discrepancy Minimization in Input-Sparsity Time We first define the following vector generated from scaling the rows of A, yj = (A A) 1/2 aj for all j [m]. And for i [m], we define the matrix Yj := 1 pj yjy j generated by yj, and we define Xj := Yj In, where In is n n identity matrix. Based on the above vector and matrix definitions, we define the following distributions: We define a distribution y for random vector: For y Rn, if y y, then for each j [m], y = yj with probability pj. We define a distribution Y such that, For Y Rn n, if Y Y, then for each j [m], Y = Yj with probability pj. We define a distribution X such that, For X Rn n, if X X, then for each j [m], X = Xj = Yj In with probability pj. Using H = A A, we write the yj as yj = H 1/2 aj. We notice that, j=1 yjy j = j=1 H 1/2aja j H 1/2 i=j aja j )H 1/2 = H 1/2(A A)H 1/2 where the first step follows from the definition of yj, the second step follows from reorganization, the third step follows from definition of aj, and the last step follows from the definition of H. Besides, we notice the the connection between the norm of yj and the leverage score of A: yj 2 2 = a j (A A) 1aj = σj(A). (13) Here we denote the index of the row that we sample during t-th trial as jt, note that jt [m], for all t [T]. Unbiased Estimator. For a matrix X X, here we show that X has the expectation of 0, we note that E X X[X] = E Y Y[Y ] In pj yjy j ) In where the first step follows from the definition of X, the second step follows from the definition of Y and the definition of expectation, and the last step follows from Eq. (12). Upper Bound on X . To give an upper bound of X , we first provide an upper bound for any Xj , then the upper bound of X follows immediately. We note that, Discrepancy Minimization in Input-Sparsity Time = 1 + yjy j pj 1 + n yj 2 2 β σj(A) where the first step follows from the definition of Xj, the second step follows from triangle inequality and the definition of In, the third step follows from the definition of Yj, the forth step follows from pj β σj(A)/nd and the definition of ℓ2 norm and the last step follows from Eq. (13). Bound on E[X X] . To upper bound the spectral norm of the matrix, we first provide the upper bound of the expectation of the matrix X X. We compute the matrix expectation: E Xjt X[X jt Xjt] = In + E yjt y[yjty jtyjty jt p2 jt ] 2 E yjt y[yjty jt pjt ] pj yjy j ) 2In n β yjy j In where the first step follows from definition of X, the second step follows from Eq. (12), Eq. (13) and the definition of expectation, the third step follows from pj β σj(A)/n, and the last step follows from Eq. (12) and factorising. Thus we upper bound the spectral norm of Eyjt y[X jt Xjt] as E yjt y[X jt Xjt] n Put things together. Here we define W := PT t=1 Xjt. We choose the parameter Then, we apply Matrix Chernoff Bound as in Lemma A.9: 2n exp Tϵ2 0 n/β 1 + (1 + n/β)ϵ0/3 = 2n exp( Tϵ2 0 Θ(β/n)) where the last step follows from that we choose T = Θ(β ϵ 2 0 n log(n/δ0)). Finally, we notice that 1 pjt yjty jt In) Discrepancy Minimization in Input-Sparsity Time 1 pjt ajta jt)H 1/2 In = H 1/2 e HH 1/2 In. where the first step follows from definition of W, the second step follows from yjt = H 1/2ajt, and the last step follows from definition of e H. H 1/2 e HH 1/2 In ϵ0 Thus, we know that (1 ϵ0)In H 1/2 e HH 1/2 (1 + ϵ0)In which implies that (1 ϵ0)H e H (1 + ϵ0)H. Thus, we complete the proof. I. Boundary-Seeking Subroutine for Partial Coloring In order to find the proper step size in the Partial Coloring algorithm, we propose the following subroutine. Algorithm 11 1: procedure FINDBOUNDARY(a Rn, b Rn) Lemma I.1 2: Compute n intervals [xi, yi] such that xi, yi are the two boundaries of 1 aiµ bi 1, e.g., xi, yi { bi 1 ai } and xi yi, for all i [n] 3: Compute n intervals [xn+i, yn+i] such that xi, yi are the two boundaries of 1 aiµ + bi 1, e.g., xi, yi { bi 1 ai } and xi yi, for all i [n] 4: Linear scan the x1, , x2n, find the index u such that u max{xi}i [2n] 5: Linear scan the x1, , y2n, find the index v such that v min{yi}i [2n] 6: if u v then 7: if |u| |v| then 8: return v 9: else 10: return u 11: end if 12: else 13: return fail 14: end if 15: end procedure Here we illustrate the idea of this algorithm with an example in the following Figure 7. Lemma I.1. There is an algorithm (Algorithm 11) that takes two vectors a, b Rn, and outputs an positive number µ R+, such that µ is the number with the largest absolute value satisfying that max{ µ a + b , µ a b } = 1. And runs in time O(n). Proof. Running Time. The running time of Algorithm 11 can be divided into the following lines, Line 2 takes time O(n) to compute the xi, yi s for i [n]. Discrepancy Minimization in Input-Sparsity Time s = µ a1 b1 s = µ a1 + b1 s = µ a2 b2 s = µ a2 + b2 (a) The lines of the constraints in our example (b) The feasible intervals Figure 7. Here we give an example of our Algorithm 11. The algorithm is looking for the intervals to make every constraint (|µ ai+bi| 1 or |µ ai bi| 1) hold, so we have 2n lines in total (n is the dimension). Here in the example, we assume n = 2, so we have 2n = 4 lines in total. Thus we have 4 intervals, and taking the intersection over them, we get the interval [u, v]. Figure (a) shows the lines of the constraints we have, the green area shows the feasible region. Figure (b) shows our idea of taking intersection over the 2n = 4 constraints. (Here we set a = (1, 15 7 ) and b = (0.7, 3 Line 3 takes time O(n) to compute the xi, yi s for i {n + 1, . . . , 2n}. Line 4 takes time O(n) to linear scan the xi s for i [2n]. Line 5 takes time O(n) to linear scan the yi s for i [2n]. The rest part of the algorithm all runs in an constant time. Thus we have the total running time is Discrepancy Minimization in Input-Sparsity Time Correctness. We first prove that, if the algorithm outputs a number µ, then for every i [n], it holds that 1 µ ai bi 1, and 1 µ ai + bi 1. Here we call the feasible region of the above question to be U. By the construction of our algorithm we have that, [xi, yi] = {µ R | 1 µ ai bi 1}, i [n] and [xi, yi] = {µ R | 1 µ ai n bi n 1}, i {n + 1, . . . , 2n} Thus we have that i [2n] [xi, yi]. We note that, if the algorithm outputs a number µ, then it must holds that Recall that we define u and v to be u := max{xi}i [2n] v := min{yi}i [2n], thus we have that [u, v] = U. Recall the output number µ equals to either u or v, it must stands that Thus the constraints hold for µ. Now we prove that max{ µ a + b , µ a b } = 1. Without loss of generality, we assume |u| > |v|, that is, µ = u. The case that |u| |v| is just the same. We first define j := {i [2n] | u = xi}. Here we first assume j [n], that is, we have that |µ aj bj| = 1. (For the case that j {n + 1, . . . , 2n}, we have |µ aj + bj| = 1, which is the same to analyse.) Note that, µ U, so we have that |µ ai bi| 1, i [n] and |µ ai + bi| 1, i [n]. Thus we have that and µ a + b 1, max{ µ a + b , µ a b } = 1. Discrepancy Minimization in Input-Sparsity Time Now for the case that j {n + 1, . . . , 2n}, by a similar analysis we have that µ a + b = 1 and µ a b 1, which also implies the result that max{ µ a + b , µ a b } = 1. By a same analysis we can have the above result if µ = v. Now we prove that, µ = u and µ = v are the only two case that max{ µ a + b , µ a b } = 1. Suppose for the contradiction that there exists an t R such that t = u and t = v, and it holds that max{ t a + b , t a b } = 1. (14) If t U, then there must exists and i [n] such that |t ai + bi| > 1 or |t ai bi| > 1, which is a violation of the hypothesis (Eq.(14)). Then for the case that t U, since t = u and t = v, it must holds that t (u, v). Note we define u := max{xi}i [2n] v := min{yi}i [2n]. It must holds that t > xi, i [2n], and t < yi, i [2n]. Then we have that |t ai + bi| < 1 or |t ai bi| < 1, which implies that max{ t a + b , t a b } < 1, which is a violation of the hypothesis (Eq.(14)). Thus we conclude there is no such t. If we define µ to be one of u or v, depending on who has larger absolute value, then µ is the number with the largest absolute value satisfying that max{ µ a + b , µ a b } = 1. Thus we complete the proof. Corollary I.2. For the case that b 1, Algorithm 11 always return a number µ. Proof. If b 1, then we have that |bi| 1, i [n]. Discrepancy Minimization in Input-Sparsity Time Without loss of generality we assume bi 0, then bi [0, 1], which implies that Thus we have that 0 [xi, yi], i [n]. By a similar analysis we have 0 [xi, yi], i {n + 1, . . . , 2n}. Thus we have i [2n] [xi, yi], which means So it must holds that Thus we complete the proof. J. Experimental Results We conduct the experiments to empirically validate the efficiency of our algorithm. We focused our experimental evaluation on demonstrating the improvements achieved through the fast hereditary projection. This is because the lazy update scheme is primarily designed to leverage fast matrix multiplication. Although fast matrix multiplication theoretically has lower asymptotic complexity, algorithms based on FMM often suffer from significantly large constant factors, adversely affecting their practical runtime performance. Additionally, for practical considerations, the parameters used in the experiments do not strictly follow the theoretical suggestions in the algorithm. Our experimental setup follows the matrix configurations used in Larsen (2023): Uniform matrices: Every entry is chosen independently and uniformly from the set { 1, +1}. 2D corner matrices: Sample two point sets P = {p1, . . . , pn} and Q = {q1, . . . , qm} independently and uniformly from the unit square [0, 1] [0, 1]. Construct a matrix with one column per pj P and one row per qi Q; set the entry (i, j) to 1 if pj is dominated by qi, i.e. qi, x > pj, x and qi, y > pj, y , and to 0 otherwise. 2D halfspace matrices: Draw P = {p1, . . . , pn} uniformly from [0, 1] [0, 1] and construct a set Q of m half-spaces as follows: pick a point a uniformly on either the left or the top boundary of the unit square, a point b uniformly on either the right or the bottom boundary, and then choose uniformly whether the half-space consists of all points above the line through a and b or all points below it. Form a matrix with one column per pj P and one row per half-space hi Q; set the entry (i, j) to 1 if pj hi and to 0 otherwise. Our algorithm achieves substantial speedups with only minor sacrifices in approximation guarantees. The speedup will be more significant once m is much bigger than n, as sketching is known to work well in the regime where m n. Our code is available at https://github.com/magiclinux/input_sparsity_discrepancy_icml_ 2025. Discrepancy Minimization in Input-Sparsity Time Table 2. Results of experiments on uniform matrices. Here Larsen s algorithm is Larsen (2023). The runtime is measured in second. Matrix Size Sparsity Larsen s Obj. Val. Our Obj. Val. Larsen s Runtime Our Runtime 400 400 1.0 54 56 2.98 2.16 400 400 0.5 38 42 2.90 1.95 400 400 0.1 14 20 2.91 1.90 2000 2000 1.0 140 148 345 164 2000 2000 0.5 96 99 334 156 2000 2000 0.1 46 47 331 152 10000 1000 1.0 132 140 378 63 10000 1000 0.5 92 97 374 62 10000 1000 0.1 42 44 375 62 Table 3. Results of experiments on 2D corner matrices. Here Larsen s algorithm is Larsen (2023). The runtime is measured in second. Matrix Size Sparsity Larsen s Obj. Val. Our Obj. Val. Larsen s Runtime Our Runtime 400 400 1.0 24 28 2.80 2.15 400 400 0.5 30 40 2.79 2.18 400 400 0.1 17 18 2.75 1.93 2000 2000 1.0 52 60 347 170 2000 2000 0.5 83 92 352 169 2000 2000 0.1 45 44 350 181 10000 1000 1.0 46 52 386 65 10000 1000 0.5 76 77 374 60 10000 1000 0.1 37 40 375 62 Table 4. Results of experiments on 2D halfspace matrices. Here Larsen s algorithm is Larsen (2023). The runtime is measured in second. Matrix Size Sparsity Larsen s Obj. Val. Our Obj. Val. Larsen s Runtime Our Runtime 400 400 1.0 38 42 2.67 2.08 400 400 0.5 30 40 2.70 2.12 400 400 0.1 17 18 2.62 1.89 2000 2000 1.0 50 56 352 174 2000 2000 0.5 75 76 345 171 2000 2000 0.1 38 40 345 169 10000 1000 1.0 62 66 390 67 10000 1000 0.5 71 75 389 68 10000 1000 0.1 33 36 382 64 Discrepancy Minimization in Input-Sparsity Time K. More Related Work Because sketching techniques are pivotal in both linear and semidefinite programming, we first survey this body of work. We then discuss how discrepancy theory has been leveraged in modern machine-learning applications. Linear programming Linear programming has a long and rich history in both mathematics and computer science. The simplex algorithm (Dantzig, 1947), while foundational, admits exponential worst-case running time. The ellipsoid method brought the first polynomial guarantee (Khachiyan, 1980), but was slower than simplex in practice. Karmarkar s interiorpoint method (Karmarkar, 1984) was a major advance, offering polynomial complexity together with strong empirical performance. When the number of constraints d satisfies d = Ω(n) (where n is the number of variables), Karmarkar s algorithm runs in O(n3.5) time, later improved to O(n3)(Vaidya, 1987; Renegar, 1988) and then to O (n2.5)(Vaidya, 1989). Cohen et al. (2019) achieved a time bound of O(nω + n2.5 α/2 + n2+1/6), where ω is the matrix-multiplication exponent and α its dual. Their techniques were subsequently generalized to empirical risk minimization(Lee et al., 2019; Song, 2019; Qin et al., 2023; Gu et al., 2025), while Song & Yu (2021) reproduced the results of Cohen et al. (2019) using oblivious sketching matrices instead of non-oblivious sampling. For tall, dense constraint matrices, Brand et al. (2020) obtained a running time of e O(nd) + poly(d). There are also works studying linear programming in the streaming setting (Chen et al., 2023; Song et al., 2023c; Brand et al., 2025). Semidefinite programming Two principal families of algorithms address semidefinite programming (SDP): second-order and first-order methods. Second-order methods enjoy logarithmic dependence on the accuracy parameter ϵ, whereas firstorder methods typically have polynomial dependence on 1/ϵ. Among second-order techniques, cutting-plane approaches such as Lee et al. (2015); Jiang et al. (2020c; 2024) solve SDPs in O(m(mn2 + m2 + nω)) time, where m is the number of constraints. Interior-point methods based on self-concordant barriers constitute another major line: log-barrier algorithms (Nesterov & Nemirovski, 1992; Jiang et al., 2020b; Huang et al., 2022c; Liu et al., 2023), hybrid-barrier variants (Anstreicher, 2000; Huang et al., 2022c; Liu et al., 2023), and those using the Lee Sidford barrier (Lee & Sidford, 2014; Liu et al., 2023; Gu et al., 2024a). Notably, Jiang et al. (2020b) obtains a running time of O( n(mn2 + nω)), and Huang et al. (2022c) shows that when m = Ω(n2), an SDP can be solved in O(mω + m2+1/4) time. First-order algorithms avoid second-order information but pay a higher cost in their dependence on 1/ϵ. Representative results include Arora & Kale (2007); Jain & Yao (2011); Allen Zhu et al. (2016); Garber & Hazan (2016); Allen-Zhu & Li (2017); Carmon et al. (2019); Lee & Padmanabhan (2020); Yurtsever et al. (2019); Grigorescu et al. (2022); Song et al. (2023a;c); Chen et al. (2023); Gu et al. (2024b). Applications of discrepancy theory in machine learning Matousek (1999a); Karnin & Liberty (2019) introduce the relationship of discrepancy and concepts in learning theory such as VC dimension and PAC learning. Chen et al. (2018) introduces a method to learn hash functions via discrepancy minimization. Learning hash functions in content-based image retrieval optimize binary codes to preserve similarity in high-dimensional feature spaces, enabling faster and more efficient image search. Wang et al. (2023) leverage discrepancy minimization for unsupervised graph matching by aligning predictions from classical solvers and neural models. Han et al. (2025) proposed an algorithm for compressing the KV cache recursively using a geometric correlated sampling process based on discrepancy theory. Nikolov et al. (2013) investigated the relationship between discrepancy minimization and differential privacy in the context of linear queries over histograms. Quasi-Monte Carlo methods (Lyu et al., 2020; Lyu, 2023) leverage concepts from discrepancy theory by employing low-discrepancy sequences to efficiently approximate high-dimensional integrals and expectations.