# approximate_vanishing_ideal_computations_at_scale__9253a0d9.pdf

Published as a conference paper at ICLR 2023

APPROXIMATE VANISHING IDEAL COMPUTATIONS AT SCALE

Elias Wirth Institute of Mathematics Berlin Institute of Technology Berlin, Germany wirth@math.tu-berlin.de

Hiroshi Kera Graduate School of Engineering Chiba University Chiba, Japan kera.hiroshi@gmail.com

Sebastian Pokutta Institute of Mathematics & AI in Society, Science, and Technology Berlin Institute of Technology & Zuse Institute Berlin Berlin, Germany pokutta@zib.de

The vanishing ideal of a set of points X = {x1, . . . , xm} Rn is the set of polynomials that evaluate to 0 over all points x X and admits an efficient representation by a finite subset of generators. In practice, to accommodate noise in the data, algorithms that construct generators of the approximate vanishing ideal are widely studied but their computational complexities remain expensive. In this paper, we scale up the oracle approximate vanishing ideal algorithm (OAVI), the only generator-constructing algorithm with known learning guarantees. We prove that the computational complexity of OAVI is not superlinear, as previously claimed, but linear in the number of samples m. In addition, we propose two modifications that accelerate OAVI s training time: Our analysis reveals that replacing the pairwise conditional gradients algorithm, one of the solvers used in OAVI, with the faster blended pairwise conditional gradients algorithm leads to an exponential speed-up in the number of features n. Finally, using a new inverse Hessian boosting approach, intermediate convex optimization problems can be solved almost instantly, improving OAVI s training time by multiple orders of magnitude in a variety of numerical experiments.

1 INTRODUCTION

High-quality features are essential for the success of machine-learning algorithms (Guyon & Elisseeff, 2003) and as a consequence, feature transformation and selection algorithms are an important area of research (Kusiak, 2001; Van Der Maaten et al., 2009; Abdi & Williams, 2010; Paul et al., 2021; Manikandan & Abirami, 2021; Carderera et al., 2021). A recently popularized technique for extracting nonlinear features from data is the concept of the vanishing ideal (Heldt et al., 2009; Livni et al., 2013), which lies at the intersection of machine learning and computer algebra. Unlike conventional machine learning, which relies on a manifold assumption, vanishing ideal computations are based on an algebraic set1 assumption, for which powerful theoretical guarantees are known (Vidal et al., 2005; Livni et al., 2013; Globerson et al., 2017). The core concept of vanishing ideal computations is that any data set X = {x1, . . . , xm} Rn can be described by its vanishing ideal, IX = {g P | g(x) = 0 for all x X}, where P is the polynomial ring over R in n variables. Despite IX containing infinitely many polynomials, there exists a finite number of generators of IX, g1, . . . , gk IX with k N, such that any polynomial h IX can be written as h = Pk i=1 gihi, where hi P for all i {1, . . . , k} (Cox et al., 2013). Thus, the generators share any sample x X as a common root, capture the nonlinear structure of the data, and succinctly represent the vanishing ideal IX. Due to noise in empirical data, we are interested in constructing generators of

1A set X Rn is algebraic if it is the set of common roots of a finite set of polynomials.

Published as a conference paper at ICLR 2023

the approximate vanishing ideal, the ideal generated by the set of polynomials that approximately evaluate to 0 for all x X and whose leading term coefficient is 1, see Definition 2.2. For classification tasks, constructed generators can, for example, be used to transform the features of the data set X Rn such that the data becomes linearly separable (Livni et al., 2013) and training a linear kernel support vector machine (SVM) (Suykens & Vandewalle, 1999) on the feature-transformed data results in excellent classification accuracy. Various algorithms for the construction of generators of the approximate vanishing ideal exist (Heldt et al., 2009; Fassino, 2010; Limbeck, 2013; Livni et al., 2013; Iraji & Chitsaz, 2017; Kera & Hasegawa, 2020; Kera, 2022), but among them, the oracle approximate vanishing ideal algorithm (OAVI) (Wirth & Pokutta, 2022) is the only one capable of constructing sparse generators and admitting learning guarantees. More specifically, CGAVI (OAVI with Frank-Wolfe algorithms (Frank & Wolfe, 1956), a.k.a. conditional gradients algorithms (CG) (Levitin & Polyak, 1966) as a solver) exploits the sparsity-inducing properties of CG to construct sparse generators and, thus, a robust and interpretable corresponding feature transformation. Furthermore, generators constructed with CGAVI vanish on out-sample data and the combined approach of transforming features with CGAVI for a subsequently applied linear kernel SVM inherits the margin bound of the SVM (Wirth & Pokutta, 2022). Despite OAVI s various appealing properties, the computational complexities of vanishing ideal algorithms for the construction of generators of the approximate vanishing ideal are superlinear in the number of samples m. With training times that increase at least cubically with m, vanishing ideal algorithms have yet to be applied to large-scale machine-learning problems.

1.1 CONTRIBUTIONS

In this paper, we improve and study the scalability of OAVI. Linear computational complexity in m. Up until now, the analysis of computational complexities of approximate vanishing ideal algorithms assumed that generators need to vanish exactly, which gave an overly pessimistic estimation of the computational cost. For OAVI, we exploit that generators only have to vanish approximately and prove that the computational complexity of OAVI is not superlinear but linear in the number of samples m and polynomial in the number of features n. Solver improvements. OAVI repeatedly calls a solver of quadratic convex optimization problems to construct generators. By replacing the pairwise conditional gradients algorithm (PCG) (Lacoste Julien & Jaggi, 2015) with the faster blended pairwise conditional gradients algorithm (BPCG) (Tsuji et al., 2022), we improve the dependence of the time complexity of OAVI on the number of features n in the data set by an exponential factor. Inverse Hessian boosting (IHB). OAVI solves a series of quadratic convex optimization problems that differ only slightly and we can efficiently maintain and update the inverse of the corresponding Hessians. Inverse Hessian boosting (IHB) then refers to the procedure of passing a starting vector, computed with inverse Hessian information, close to the optimal solution to the convex solver used in OAVI. Empirically, IHB speeds up the training time of OAVI by multiple orders of magnitude. Large-scale numerical experiments. We perform numerical experiments on data sets of up to two million samples, highlighting that OAVI is an excellent large-scale feature transformation method.

1.2 RELATED WORK

The Buchberger-M oller algorithm was the first method for constructing generators of the vanishing ideal (M oller & Buchberger, 1982). Its high susceptibility to noise was addressed by Heldt et al. (2009) with the approximate vanishing ideal algorithm (AVI), see also Fassino (2010); Limbeck (2013). The latter introduced two algorithms that construct generators term by term instead of degree-wise such as AVI, the approximate Buchberger-M oller algorithm (ABM) and the border bases approximate Buchberger-M oller algorithm. The aforementioned algorithms are monomialaware, that is, they require an explicit ordering of terms and construct generators as linear combinations of monomials. However, monomial-awareness is an unattractive property: Changing the order of the features changes the outputs of the algorithms. Monomial-agnostic approaches such as vanishing component analysis (VCA) (Livni et al., 2013) do not suffer from this shortcoming, as they construct generators as linear combinations of polynomials. VCA found success in hand posture recognition, solution selection using genetic programming, principal variety analysis for nonlinear data modeling, and independent signal estimation for blind source separation (Zhao & Song, 2014; Kera & Iba, 2016; Iraji & Chitsaz, 2017; Wang & Ohtsuki, 2018). The disadvantage of foregoing the term ordering is that VCA sometimes constructs multiple orders of magnitude more generators

Published as a conference paper at ICLR 2023

than monomial-aware algorithms (Wirth & Pokutta, 2022). Furthermore, VCA is susceptible to the spurious vanishing problem: Polynomials that do not capture the nonlinear structure of the data but whose coefficient vector entries are small become generators, and, conversely, polynomials that capture the data well but whose coefficient vector entries are large get treated as non-vanishing. The problem was partially addressed by Kera & Hasegawa (2019; 2020; 2021).

2 PRELIMINARIES

Throughout, let ℓ, k, m, n N. We denote vectors in bold and let 0 Rn denote the 0-vector. Sets of polynomials are denoted by capital calligraphic letters. We denote the set of terms (or monomials) and the polynomial ring over R in n variables by T and P, respectively. For τ 0, a polynomial g = Pk i=1 citi P with c = (c1, . . . , ck) is said to be τ-bounded in the ℓ1-norm if the ℓ1-norm of its coefficient vector is bounded by τ, that is, if g 1 := c 1 τ. Given a polynomial g P, let deg(g) denote its degree. The sets of polynomials in n variables of and up to degree d N are denoted by Pd and P d, respectively. Similarly, for a set of polynomials G P, let Gd = G Pd and G d = G P d. We often assume that X = {x1, . . . , xm} [0, 1]n, a form that can be realized, for example, via min-max feature scaling. Given a polynomial g P and a set of polynomials G = {g1, . . . , gk} P, define the evaluation vector of g and evaluation matrix of G over X as g(X) = (g(x1), . . . , g(xm)) Rm and G(X) = (g1(X), . . . , gk(X)) Rm k, respectively. Further, define the mean squared error of g over X as mse(g, X) = 1 |X| g(X) 2 2 = 1

m g(X) 2 2 .

OAVI sequentially processes terms according to a so-called term ordering, as is necessary for any monomial-aware algorithm. For ease of presentation, we restrict our analysis to the degreelexicographical ordering of terms (Deg Lex) (Cox et al., 2013), denoted by <σ. For example, given the terms t, u, v T1, Deg Lex works as follows: 1 <σ t <σ u <σ v <σ t2 <σ t u <σ t v <σ u2 <σ u v <σ v2 <σ t3 <σ . . . , where 1 denotes the constant-1 term. Given a set of terms O = {t1, . . . , tk}σ T , the subscript σ indicates that t1 <σ . . . <σ tk.

Definition 2.1 (Leading term (coefficient)). Let g = Pk i=1 citi P with ci R and ti T for all i {1, . . . , k} and let j {1, . . . , k} such that tj >σ ti for all i {1, . . . , k} \ {j}. Then, tj and cj are called leading term and leading term coefficient of g, denoted by lt(g) = tj and ltc(g) = cj, respectively.

We thus define approximately vanishing polynomials via the mean squared error as follows.

Definition 2.2 (Approximately vanishing polynomial). Let X = {x1, . . . , xm} Rn, ψ 0, and τ 2. A polynomial g P is ψ-approximately vanishing (over X) if mse(g, X) ψ. If also ltc(g) = 1 and g 1 τ, then g is called (ψ, 1, τ)-approximately vanishing (over X).

In the definition above, we fix the leading term coefficient of polynomials to address the spurious vanishing problem, and the requirement that polynomials are τ-bounded in the ℓ1-norm is necessary for the learning guarantees of OAVI to hold.

Definition 2.3 (Approximate vanishing ideal). Let X = {x1, . . . , xm} Rn, ψ 0, and τ 2. The (ψ, τ)-approximate vanishing ideal (over X), Iψ,τ X , is the ideal generated by all (ψ, 1, τ)- approximately vanishing polynomials over X.

For ψ = 0 and τ = , it holds that I0, X = IX, that is, the approximate vanishing ideal becomes the vanishing ideal. Finally, we introduce the generator-construction problem addressed by OAVI.

Problem 2.4 (Setting). Let X = {x1, . . . , xm} Rn, ψ 0, and τ 2. Construct a set of (ψ, 1, τ)-approximately vanishing generators of Iψ,τ X .

Recall that for t, u T , t divides (or is a divisor of) u, denoted by t | u, if there exists v T such that t v = u. If t does not divide u, we write t u. OAVI constructs generators of the approximate vanishing ideal of degree d N by checking whether terms of degree d are leading terms of an approximately vanishing generator. As explained in Wirth & Pokutta (2022), OAVI does not have to consider all terms of degree d but only those contained in the subset defined below.

Definition 2.5 (Border). Let O T . The (degree-d) border of O is defined as d O = {u Td : t O d 1 for all t T d 1 such that t | u}.

Published as a conference paper at ICLR 2023

Algorithm 1: Oracle approximate vanishing ideal algorithm (OAVI) Input : A data set X = {x1, . . . , xm} Rn and parameters ψ 0 and τ 2. Output: A set of polynomials G P and a set of monomials O T .

1 d 1, O = {t1}σ {1}σ, G

2 while d O = {u1, . . . , uk}σ = do

3 for i = 1, . . . , k do

4 ℓ |O| N, A O(X) Rm ℓ, b ui(X) Rm, P = {y Rℓ| y 1 τ 1}

5 c argminy P 1 m Ay + b 2 2 call to convex optimization oracle

6 g Pℓ j=1 cjtj + ui

7 if mse(g, X) ψ then check whether g vanishes

10 O = {t1, . . . , tℓ+1}σ (O {ui})σ

3 ORACLE APPROXIMATE VANISHING IDEAL ALGORITHM (OAVI)

In this section, we recall the oracle approximate vanishing ideal algorithm (OAVI) (Wirth & Pokutta, 2022) in Algorithm 1, a method for solving Problem 2.4.

3.1 ALGORITHM OVERVIEW

OAVI takes as input a data set set X = {x1, . . . , xm} Rn, a vanishing parameter ψ 0, and a tolerance τ 2 such that the constructed generators are τ-bounded in the ℓ1-norm. From a highlevel perspective, OAVI constructs a finite set of (ψ, 1, τ)-approximately vanishing generators of the (ψ, τ)-approximate vanishing ideal Iψ,τ X by solving a series of constrained convex optimization problems. OAVI tracks the set of terms O T such that there does not exist a (ψ, 1, τ)- approximately vanishing generator of Iψ,τ X with terms only in O and the set of generators G P of Iψ,τ X . For every degree d 1, OAVI computes the border d O in Line 2. Then, in Lines 3 12, for every term u d O, OAVI determines whether there exists a (ψ, 1, τ)-approximately vanishing generator g of Iψ,τ X with lt(g) = u and other terms only in O via oracle access to a solver of the constrained convex optimization problem in Line 5. If such a g exists, it gets appended to G in Line 8. Otherwise, the term u gets appended to O in Line 10. OAVI terminates when a degree d N is reached such that d O = . For ease of exposition, unless noted otherwise, we ignore that the optimization problem in Line 5 of OAVI is addressed only up to a certain accuracy ϵ > 0. Throughout, we denote the output of OAVI by G and O, that is, (G, O) = OAVI(X, ψ, τ). We replace the first letter in OAVI with the abbreviation corresponding to the solver used to solve the optimization problem in Line 5. For example, OAVI with solver CG is referred to as CGAVI.

3.2 PREPROCESSING WITH OAVI IN A CLASSIFICATION PIPELINE

OAVI can be used as a feature transformation method for a subsequently applied linear kernel support vector machine (SVM). Let X Rn and Y = {1, . . . , k} be an input and output space, respectively. We are given a training sample S = {(x1, y1), . . . , (xm, ym)} (X Y)m drawn i.i.d. from some unknown distribution D. Our task is to determine a hypothesis h: X Y with small generalization error P(x,y) D[h(x) = y]. For each class i {1, . . . , k}, let Xi = {xj X | yj = i} X = {x1, . . . , xm} denote the subset of samples corresponding to class i and construct a set of (ψ, 1, τ)-approximately vanishing generators Gi of the (ψ, τ)-approximate vanishing ideal Iψ,τ Xi via OAVI, that is, (Gi, Oi) = OAVI(Xi, ψ, τ). We combine the sets of generators corresponding to the different classes to obtain G = {g1, . . . , g|G|} = Sk i=1 Gi, which encapsulates the feature

Published as a conference paper at ICLR 2023

(d) Synthetic.

Figure 1: Training time comparisons with fixed ψ = 0.005, averaged over ten random runs with shaded standard deviations. On all data sets except skin, BPCGAVI is faster than PCGAVI.

transformation we are about to apply to data set X. As proposed in Livni et al. (2013), we transform the training sample X via the feature transformation

xj 7 xj = |g1(xj)|, . . . , |g|G|(xj)| R|G| (FT)

for j = 1, . . . , m. The motivation behind this feature transformation is that a polynomial g Gi vanishes approximately over all x Xi and (hopefully) attains values that are far from zero over points x X \ Xi. We then train a linear kernel SVM on the feature-transformed data S = {( x1, y1), . . . , ( xm, ym)} with ℓ1-regularization to promote sparsity. If ψ = 0, τ = , and the underlying classes of S belong to disjoint algebraic sets, then the different classes become linearly separable in the feature space corresponding to transformation (FT) and perfect classification accuracy is achieved with the linear kernel SVM (Livni et al., 2013).

3.3 SOLVING THE OPTIMIZATION PROBLEM IN LINE 5 OF OAVI

Setting τ = , the optimization problem in Line 5 becomes unconstrained and can, for example, be addressed with accelerated gradient descent (AGD) (Nesterov, 1983). If, however, τ < , OAVI satisfies two generalization bounds. Indeed, assuming that the data is of the form X [0, 1]n, the generators in G are guaranteed to also vanish on out-sample data. Furthermore, the combined approach of using generators obtained via OAVI to preprocess the data for a subsequently applied linear kernel SVM inherits the margin bound of the SVM (Wirth & Pokutta, 2022).

4 OAVI AT SCALE

Here, we address the main shortcoming of vanishing ideal algorithms: time, space, and evaluation complexities that depend superlinearly on m, where m is the number of samples in the data set X = {x1, . . . , xm} Rn. Recall the computational complexity of OAVI for τ = . Theorem 4.1 (Complexity (Wirth & Pokutta, 2022)). Let X = {x1, . . . , xm} Rn, ψ 0, τ = , and (G, O) = OAVI(X, ψ, τ). Let TORACLE and SORACLE be the time and space complexities required to solve the convex optimization problem in Line 5 of OAVI, respectively. In the real number model, the time and space complexities of OAVI are O((|G| + |O|)2 + (|G| + |O|)TORACLE) and O((|G| + |O|)m + SORACLE), respectively. The evaluation vectors of all polynomials in G over a set Z = {z1, . . . , zq} Rn can be computed in time O((|G| + |O|)2q).

Under mild assumptions, Wirth & Pokutta (2022) proved that O(|G| + |O|) = O(mn), implying that OAVI s computational complexity is superlinear in m. In this work, we improve OAVI s computational complexity both with a tighter theoretical analysis and algorithmic modifications.

4.1 NUMBER-OF-SAMPLES-AGNOSTIC BOUND ON |G| + |O|

For ψ > 0, OAVI, ABM, AVI, and VCA construct approximately vanishing generators of the (ψ, τ)- approximate vanishing ideal. Despite being designed for the approximate setting (ψ > 0), so far, the analysis of these algorithms was only conducted for the exact setting (ψ = 0). Below, we exploit that ψ > 0 in OAVI and obtain the first number-of-samples-agnostic bound on the size of a

Published as a conference paper at ICLR 2023

(d) Synthetic.

Figure 2: Training time comparisons with fixed ψ = 0.005, averaged over ten random runs with shaded standard deviations. CGAVI-IHB is faster than BPCGAVI-WIHB, which is faster than BPCGAVI.

generator-constructing algorithm s output. The result below also highlights the connection between the size of OAVI s output |G| + |O| and the extent of vanishing ψ: Large ψ implies small |G| + |O| and small ψ implies large |G| + |O|.

Theorem 4.2 (Bound on |G| + |O|). Let X = {x1, . . . , xm} [0, 1]n, ψ ]0, 1[, D = log(ψ)/ log(4) , τ (3/2)D, and (G, O) = OAVI(X, ψ, τ). Then, OAVI terminates after having constructed generators of degree D. Thus, |G| + |O| D+n D .

The proof of Theorem 4.2 is presented in Appendix A. When TORACLE and SORACLE are linear in m and polynomial in |G|+|O|, the time and space complexities of OAVI are linear in m and polynomial in n. Furthermore, the evaluation complexity of OAVI is independent of m and polynomial in n.

4.2 BLENDED PAIRWISE CONDITIONAL GRADIENTS ALGORITHM (BPCG)

Wirth & Pokutta (2022) solved the optimization problem in Line 5 of OAVI with the pairwise conditional gradients algorithm (PCG). In this section, we replace PCG with the blended pairwise conditional gradients algorithm (BPCG) (Tsuji et al., 2022), yielding an exponential improvement in time complexity of OAVI in the number of features n in the data set X = {x1, . . . , xm} [0, 1]n.

Let P Rℓ, ℓ N, be a polytope of diameter δ > 0 and pyramidal width ω > 0, see Appendix D for details, and let f : Rℓ R be a µ-strongly convex and L-smooth function. PCG and BPCG address constrained convex optimization problems of the form

y argminy P f(y). (4.1)

With vert(P) denoting the set of vertices of P, the convergence rate of PCG for solving (4.1) is f(y T ) f(y ) (f(y0) f(y ))e ρPCGT , where y0 P is the starting point, y is the optimal solution to (4.1), and ρPCG = µω2/(4Lδ2(3| vert(P)|! + 1)) (Lacoste-Julien & Jaggi, 2015). The dependence of PCG s time complexity on | vert(P)|! is due to swap steps, which shift weight from the away to the Frank-Wolfe vertex but do not lead to a lot of progress. Since the problem dimension is |O|, which can be of a similar order of magnitude as |G| + |O|, PCG can require a number of iterations that is exponential in |G| + |O| to reach ϵ-accuracy. BPCG does not perform swap steps and its convergence rate is f(y T ) f(y ) (f(y0) f(y ))e ρBPCGT , where ρBPCG = 1/2 min 1/2, µω2/(4Lδ2) (Tsuji et al., 2022). Thus, to reach ϵ-accuracy, BPCG requires a number of iterations that is only polynomial in |G| + |O|. For the optimization problem in Line 5 of OAVI, A = O(X) Rm ℓand µ and L are the smallest and largest eigenvalues of matrix 2 m A A Rℓ ℓ, respectively. Since the pyramidal width ω of the ℓ1-ball of radius τ 1 is at least (τ 1)/

ℓ(Lemma D.1), ignoring dependencies on µ, L, ϵ, and τ, and updating A A and A b as explained in the proof of Theorem C.1, the computational complexity of BPCGAVI (OAVI with solver BPCG) is as summarized below.

Corollary 4.3 (Complexity). Let X = {x1, . . . , xm} Rn, ψ > 0, τ 2, and (G, O) = BPCGAVI(X, ψ, τ). In the real number model, the time and space complexities of BPCGAVI are O((|G| + |O|)2m + (|G| + |O|)4) and O((|G| + |O|)m + (|G| + |O|)2), respectively.

Published as a conference paper at ICLR 2023

(d) Synthetic.

Figure 3: Training time comparisons, averaged over ten random runs with shaded standard deviations. For small data sets, ABM and VCA are faster than OAVI, but for synthetic, the training times of ABM and VCA scale worse than OAVI s. See Appendix F.2 for details.

The results in Figure 1, see Appendix F.1 for details, show that replacing PCG with BPCG in OAVI often speeds up the training time of OAVI. For the data set skin in Figure 1c, see Table 1 for an overview of the data sets, it is not fully understood why BPCGAVI is slower than PCGAVI.

4.3 INVERSE HESSIAN BOOSTING (IHB)

We introduce inverse Hessian boosting (IHB) to speed up the training time of OAVI by multiple orders of magnitudes by exploiting the structure of the optimization problems solved in OAVI.

For ease of exposition, assume for now that τ = , in which case we would use AGD to solve the problem in Line 5 of OAVI. Letting ℓ= |O|, A = O(X) Rm ℓ, b = ui(X) Rm, and f(y) = 1 m Ay + b 2 2, the optimization problem in Line 5 of OAVI takes the form c argminy Rℓf(y). Then, the gradient and Hessian of f at y Rℓare f(y) = 2 m A (Ay + b) Rℓand 2f(y) = 2 m A A Rℓ ℓ, respectively. By construction, the columns of A = O(X) are linearly independent. Hence, A A Rℓ ℓis positive definite and invertible, f is strongly convex, and the optimal solution c to the optimization problem in Line 5 of OAVI is unique. Further, for y Rℓ, we have f(y) = 0 if and only if y = c. Instead of using AGD to construct an ϵ-accurate solution to the optimization problem in Line 5 of OAVI, we could compute the optimal solution c = (A A) 1A b Rℓ. Since matrix inversions are numerically unstable, relying on them directly would make OAVI less robust, and approximately vanishing polynomials might not be correctly detected. Instead, we capitalize on the fact that the number of iterations of AGD to reach an ϵ-minimizer depends on the Euclidean distance between the starting vector and the optimal solution. IHB refers to the procedure of passing y0 = (A A) 1A b Rℓto AGD as a starting vector. Then, AGD often reaches ϵ-accuracy in one iteration. In case (A A) 1 is not computed correctly due to floating-point errors, AGD still guarantees an ϵ-accurate solution to the optimization problem. Thus, IHB can also be thought of as performing one iteration of Newton s method starting with iterate 0, see, for example, Gal antai (2000), and passing the resulting vector as a starting iterate to AGD. We stress that the dimension of A A Rℓ ℓis number-of-samples-agnostic, see Theorem 4.2.

IHB also works for τ < , in which case we use CG variants to address the optimization problem in Line 5 of OAVI that takes the form c argminy P f(y), where P = {y Rℓ| y 1 τ 1}. In the problematic case

y0 1 = (A A) 1A b 1 > τ 1, (INF)

polynomial g = Pℓ j=1 cjtj +ui constructed in Line 6 of OAVI might not vanish approximately even though there exists h P with lt(h) = ui, ltc(h) = 1, non-leading terms only in O, and h 1 > τ that vanishes exactly over X. Thus, mse(h, X) = 0 ψ < mse(g, X) and OAVI does not append g to G and instead updates O (O {ui})σ and A (A, b) Rm (ℓ+1). Since mse(h, X) = 0, we just appended a linearly dependent column to A, making A rank-deficient, (A A) 1 ill-defined, and IHB no longer applicable. To address this problem, we could select τ adaptively, but this would invalidate the learning guarantees of OAVI that rely on τ being a constant. In practice, we fix τ 2 and stop using IHB as soon as (INF) holds for the first time, preserving the generalization bounds of OAVI. Through careful updates of (A A) 1, the discussion in Appendix C.1 implies the following complexity result for CGAVI-IHB (CGAVI with IHB).

Published as a conference paper at ICLR 2023

Table 1: Overview of data sets. All data sets are binary classification data sets and are retrieved from the UCI Machine Learning Repository (Dua & Graff, 2017) and additional references are provided.

Data set Full name # of samples # of features bank banknote authentication 1,372 4 credit default of credit cards (Yeh & Lien, 2009) 30,000 22 htru HTRU2 (Lyon et al., 2016) 17,898 8 skin skin (Bhatt & Dhall, 2010) 245,057 3 spam spambase 4,601 57 synthetic synthetic, see Appendix E 2,000,000 3

Corollary 4.4 (Complexity). Let X = {x1, . . . , xm} Rn, ψ 0, τ 2 large enough such that (INF) never holds, and (G, O) = CGAVI-IHB(X, ψ, τ). Assuming CG terminates after a constant number of iterations due to IHB, the time and space complexities of CGAVI-IHB are O((|G| + |O|)2m + (|G| + |O|)3) and O((|G| + |O|)m + (|G| + |O|)2), respectively.

In Appendix C.2, we introduce weak inverse Hessian boosting (WIHB), a variant of IHB that speeds up CGAVI variants while preserving sparsity-inducing properties. In Figure 2, see Appendix F.1 for details, we observe that CGAVI-IHB is faster than BPCGAVI-WIHB, which is faster than BPCGAVI. In Figure 3, see Appendix F.2 for details, we compare the training times of OAVI-IHB to ABM and VCA. In Figure 3d, we observe that OAVI-IHB s training time scales better than ABM s and VCA s.

5 NUMERICAL EXPERIMENTS

Unless noted otherwise, the setup for the numerical experiments applies to all experiments in the paper. Experiments are implemented in PYTHON and performed on an Nvidia Ge Force RTX 3080 GPU with 10GB RAM and an Intel Core i7 11700K 8x CPU at 3.60GHz with 64 GB RAM. Our code is publicly available on Git Hub.

We implement OAVI as in Algorithm 1 with convex solvers CG, PCG, BPCG, and AGD and refer to the resulting algorithms as CGAVI, PCGAVI, BPCGAVI, and AGDAVI, respectively. Solvers are run for up to 10,000 iterations. For the CG variants, we set τ = 1, 000. The CG variants are run up to accuracy ϵ = 0.01 ψ and terminated early when less than 0.0001 ψ progress is made in the difference between function values, when the coefficient vector of a generator is constructed, or if we have a guarantee that no coefficient vector of a generator can be constructed. AGD is terminated early if less than 0.0001 ψ progress is made in the difference between function values for 20 iterations in a row or the coefficient vector of a generator is constructed. OAVI implemented with WIHB or IHB is referred to as OAVI-WIHB or OAVI-IHB, respectively. We implement ABM as in Limbeck (2013) but instead of applying the singular value decomposition (SVD) to the matrix corresponding to A = O(X) in OAVI, we apply the SVD to A A in case this leads to a faster training time and we employ the border as in Definition 2.5. We implement VCA as in Livni et al. (2013) but instead of applying the SVD to the matrix corresponding to A = O(X) in OAVI, we apply the SVD to A A in case this leads to a faster training time. We implement a polynomial kernel SVM with a oneversus-rest approach using the SCIKIT-LEARN software package (Pedregosa et al., 2011) and run the polynomial kernel SVM with ℓ2-regularization up to tolerance 10 3 or for up to 10,000 iterations.

We preprocess with OAVI, ABM, and VCA for a linear kernel SVM as in Section 3.2 and refer to the combined approaches as OAVI , ABM , and VCA , respectively. The linear kernel SVM is implemented using the SCIKIT-LEARN software package and run with ℓ1-penalized squared hinge loss up to tolerance 10 4 or for up to 10,000 iterations. For OAVI , ABM , and VCA , the hyperparameters are the vanishing tolerance ψ and the ℓ1-regularization coefficient of the linear kernel SVM. For the polynomial kernel SVM, the hyperparameters are the degree and the ℓ2-regularization coefficient. Table 1 and 3 contain overviews of the data sets and hyperparameter values, respectively.

5.1 EXPERIMENT: PERFORMANCE

We compare the performance of CGAVI-IHB , BPCGAVI-WIHB , AGDAVI-IHB , ABM , VCA , and polynomial kernel SVM on the data sets credit, htru, skin, and spam.

Published as a conference paper at ICLR 2023

Table 2: Numerical results averaged over ten random 60%/40% train/test partitions with best results in bold. For approaches other than BPCGAVI-WIHB , spar (G) < 0.01 and we omit the results.

Algorithms Data sets credit htru skin spam

CGAVI-IHB 18.08 2.09 0.23 6.69 AGDAVI-IHB 18.08 2.09 0.23 6.69 BPCGAVI-WIHB 17.98 2.05 0.22 6.71 ABM 18.36 2.09 0.43 6.67 VCA 19.85 2.11 0.24 7.15 SVM 18.34 2.08 2.25 7.13

CGAVI-IHB 1.3 102 2.3 101 1.0 102 8.3 101

AGDAVI-IHB 1.9 102 2.8 101 1.1 102 3.1 102

BPCGAVI-WIHB 3.7 103 8.0 102 5.6 102 4.2 102

ABM 1.2 102 2.4 101 6. 101 1.7 102

VCA 2.4 101 6.2 100 1.4 101 6.5 101

SVM 8.9 101 4.1 100 7.1 102 2.2 100

CGAVI-IHB 82.40 27.90 39.00 786.60 AGDAVI-IHB 82.40 27.90 39.00 786.60 BPCGAVI-WIHB 106.40 55.20 39.00 653.00 ABM 51.40 28.70 19.30 261.00 VCA 49.80 19.00 12.40 1766.40

BPCGAVI-WIHB 0.67 0.52 0.03 0.71

Setup. We tune the hyperparameters on the training data using threefold cross-validation. We retrain on the entire training data set using the best combination of hyperparameters and evaluate the classification error on the test set and the hyperparameter optimization time. For the generatorconstructing methods, we also compare |G| + |O|, where |G| = P

i |Gi|, |O| = P

i |O|i, and (Gi, Oi) is the output of applying a generator-constructing algorithm to samples belonging to class i and the sparsity of the feature transformation G = S

i Gi, which is defined as

spar(G) = ( X

g G gz)/( X

g G ge) [0, 1], (SPAR)

where for a polynomial g = Pk j=1 cjtj + t with lt(g) = t, ge = k and gz = |{cj = 0 | j {1, . . . , k}}|, that is, ge and gz are the number of non-leading and the number of zero coefficient vector entries of g, respectively. Results are averaged over ten random 60%/40% train/test partitions.

Results. The results are presented in Table 2. OAVI admits excellent test-set classification accuracy. BPCGAVI-WIHB , in particular, admits the best test-set classification error on all data sets but one. Hyperparameter tuning for OAVI is often slightly slower than for ABM and VCA . Since BPCGAVI-WIHB does not employ IHB, the approach is always slower than CGAVI-IHB and AGDAVI-IHB . For all data sets but credit and skin, the hyperparameter optimization time for the SVM is shorter than for the other approaches. On skin, since the training time of the polynomial kernel SVM is superlinear in m, the hyperparameter training for the polynomial kernel SVM is slower than for the other approaches. For data sets with few features n, the magnitude of |G| + |O| is often the smallest for VCA . However, for spam, a data set with n = 57, as already pointed out by Kera & Hasegawa (2019) as the spurious vanishing problem, we observe VCA s tendency to create unnecessary generators. Finally, only BPCGAVI-WIHB constructs sparse feature transformations, potentially explaining the excellent test-set classification accuracy of the approach.

In conclusion, the numerical results justify considering OAVI as the generator-constructing algorithm of choice, as it is the only approximate vanishing ideal algorithm with known learning guarantees and, in practice, performs better than or similar to related approaches.

Published as a conference paper at ICLR 2023

ACKNOWLEDGEMENTS

We would like to thank Gabor Braun for providing us with the main arguments for the proof of the lower bound on the pyramidal width of the ℓ1-ball. This research was partially funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany s Excellence Strategy The Berlin Mathematics Research Center MATH+ (EXC-2046/1, project ID 390685689, BMS Stipend) and JST, ACT-X Grant Number JPMJAX200F, Japan.

Herv e Abdi and Lynne J Williams. Principal component analysis. Wiley Interdisciplinary Reviews: Computational Statistics, 2(4):433 459, 2010.

Maurice S Bartlett. An inverse matrix adjustment arising in discriminant analysis. The Annals of Mathematical Statistics, 22(1):107 111, 1951.

Rajen Bhatt and Abhinav Dhall. Skin segmentation dataset. UCI Machine Learning Repository, 2010.

Alejandro Carderera, Sebastian Pokutta, Christof Sch utte, and Martin Weiser. Cindy: Conditional gradient-based identification of non-linear dynamics noise-robust recovery. ar Xiv preprint ar Xiv:2101.02630, 2021.

David Cox, John Little, and Donal O Shea. Ideals, varieties, and algorithms: an introduction to computational algebraic geometry and commutative algebra. Springer Science & Business Media, 2013.

Dheeru Dua and Casey Graff. UCI machine learning repository. 2017.

Claudia Fassino. Almost vanishing polynomials for sets of limited precision points. Journal of Symbolic Computation, 45(1):19 37, 2010.

Marguerite Frank and Philip Wolfe. An algorithm for quadratic programming. Naval Research Logistics Quarterly, 3(1-2):95 110, 1956.

Aurel Gal antai. The theory of newton s method. Journal of Computational and Applied Mathematics, 124(1-2):25 44, 2000.

Amir Globerson, Roi Livni, and Shai Shalev-Shwartz. Effective semisupervised learning on manifolds. In Proceedings of the Conference on Learning Theory, pp. 978 1003, 2017.

Isabelle Guyon and Andr e Elisseeff. An introduction to variable and feature selection. Journal of Machine Learning Research, 3(Mar):1157 1182, 2003.

Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2):169 192, 2007.

Daniel Heldt, Martin Kreuzer, Sebastian Pokutta, and Hennie Poulisse. Approximate computation of zero-dimensional polynomial ideals. Journal of Symbolic Computation, 44(11):1566 1591, 2009.

Reza Iraji and Hamidreza Chitsaz. Principal variety analysis. In Proceedings of the Conference on Robot Learning, pp. 97 108, 2017.

Hiroshi Kera. Border basis computation with gradient-weighted normalization. In Proceedings of the International Symposium on Symbolic and Algebraic Computation, ISSAC 22, pp. 225 234, 2022.

Hiroshi Kera and Yoshihiko Hasegawa. Spurious vanishing problem in approximate vanishing ideal. IEEE Access, 7:178961 178976, 2019.

Hiroshi Kera and Yoshihiko Hasegawa. Gradient boosts the approximate vanishing ideal. In Proceedings of the Conference on Artificial Intelligence, number 04, pp. 4428 4435, 2020.

Published as a conference paper at ICLR 2023

Hiroshi Kera and Yoshihiko Hasegawa. Monomial-agnostic computation of vanishing ideals. ar Xiv preprint ar Xiv:2101.00243, 2021.

Hiroshi Kera and Hitoshi Iba. Vanishing ideal genetic programming. In IEEE Congress on Evolutionary Computation, pp. 5018 5025, 2016.

Andrew Kusiak. Feature transformation methods in data mining. IEEE Transactions on Electronics Packaging Manufacturing, 24(3):214 221, 2001.

Simon Lacoste-Julien and Martin Jaggi. On the global linear convergence of Frank-Wolfe optimization variants. In Proceedings of Advances in Neural Information Processing Systems, pp. 496 504, 2015.

Evgeny S Levitin and Boris T Polyak. Constrained minimization methods. USSR Computational Mathematics and Mathematical Physics, 6(5):1 50, 1966.

Jan Limbeck. Computation of approximate border bases and applications. 2013.

Roi Livni, David Lehavi, Sagi Schein, Hila Nachliely, Shai Shalev-Shwartz, and Amir Globerson. Vanishing component analysis. In Proceedings of the International Conference on Machine Learning, pp. 597 605. PMLR, 2013.

Robert J Lyon, BW Stappers, Sally Cooper, John Martin Brooke, and Joshua D Knowles. Fifty years of pulsar candidate selection: from simple filters to a new principled real-time classification approach. Monthly Notices of the Royal Astronomical Society, 459(1):1104 1123, 2016.

G Manikandan and S Abirami. Feature selection is important: state-of-the-art methods and application domains of feature selection on high-dimensional data. In Applications in Ubiquitous Computing, pp. 177 196. Springer, 2021.

H Michael M oller and Bruno Buchberger. The construction of multivariate polynomials with preassigned zeros. In European Computer Algebra Conference, pp. 24 31. Springer, 1982.

Yurii Nesterov. A method for unconstrained convex minimization problem with the rate of convergence O(1/k2). In Doklady an USSR, volume 269, pp. 543 547, 1983.

Dipanjyoti Paul, Anushree Jain, Sriparna Saha, and Jimson Mathew. Multi-objective pso based online feature selection for multi-label classification. Knowledge-Based Systems, 222:106966, 2021.

Fabian Pedregosa, Ga el Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit-learn: Machine learning in python. Journal of Machine Learning Research, 12:2825 2830, 2011.

Javier Pena and Daniel Rodriguez. Polytope conditioning and linear convergence of the frank wolfe algorithm. Mathematics of Operations Research, 44(1):1 18, 2019.

Jack Sherman and Winifred J Morrison. Adjustment of an inverse matrix corresponding to a change in one element of a given matrix. The Annals of Mathematical Statistics, 21(1):124 127, 1950.

Johan AK Suykens and Joos Vandewalle. Least squares support vector machine classifiers. Neural Processing Letters, 9(3):293 300, 1999.

Kazuma K Tsuji, Ken ichiro Tanaka, and Sebastian Pokutta. Pairwise conditional gradients without swap steps and sparser kernel herding. In Proceedings of the International Conference on Machine Learning, pp. 21864 21883. PMLR, 2022.

Laurens Van Der Maaten, Eric Postma, Jaap Van den Herik, et al. Dimensionality reduction: a comparative. Journal of Machine Learning Research, 10(66-71):13, 2009.

Rene Vidal, Yi Ma, and Shankar Sastry. Generalized principal component analysis (gpca). IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(12):1945 1959, 2005.

Lu Wang and Tomoaki Ohtsuki. Nonlinear blind source separation unifying vanishing component analysis and temporal structure. IEEE Access, 6:42837 42850, 2018.

Published as a conference paper at ICLR 2023

Elias Wirth and Sebastian Pokutta. Conditional gradients for the approximate vanishing ideal. ar Xiv preprint ar Xiv:2202.03349, 2022.

I-Cheng Yeh and Che-Hui Lien. The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Systems with Applications, 36(2): 2473 2480, 2009.

Yan-Guo Zhao and Zhan Song. Hand posture recognition using approximate vanishing ideal generators. In Proceedings of the International Conference on Image Processing, pp. 1525 1529. IEEE, 2014.

Published as a conference paper at ICLR 2023

A MISSING PROOFS

Proof of Theorem 4.2. Let X = {x1, . . . , xm} [0, 1]n and let t1, . . . , tn T1 be the degree-1 monomials. Suppose that during OAVI s execution, for some degree d N, OAVI checks whether the term u = Qn j=1 tαj j d O, where Pn j=1 αj = d and αj N for all j {1, . . . , n}, is the leading term of a (ψ, 1, τ)-approximately vanishing generator with non-leading terms only in O. Let

21 αj , (A.1)

where 1 denotes the constant-1 monomial, that is, 1(x) = 1 for all x Rn. Note that h 1 3

2 d τ, guaranteeing that h can be constructed in Lines 5 6 of OAVI. Since u d O, for any term t T such that t | u and t = u, it holds that t O and thus, h is a polynomial with ltc(h) = 1, lt(h) = u, and other terms only in O. Under the assumption that the convex optimization oracle in OAVI is accurate, mse(h, X) mse(g, X), where g is the polynomial constructed during Lines 5 and 6 of OAVI. Hence, proving that h vanishes approximately implies that g vanishes approximately. Note that for all x X and t T1, it holds that t(x) [0, 1], and, thus, |t(x) 1

2. We obtain

mse(g, X) mse(h, X)

Since mse(g, X) 4 d ψ is satisfied for d D := l log(ψ)

log(4) m , OAVI terminates after reaching

degree D. Thus, at the end of OAVI s execution, |G| + |O| D+n D (D + n)D.

B BLENDED PAIRWISE CONDITIONAL GRADIENTS ALGORITHM (BPCG)

The blended pairwise conditional gradients algorithm (BPCG) (Tsuji et al., 2022) is presented in Algorithm 2.

C ADDITIONAL INFORMATION ON IHB

We discuss the computational complexity of IHB in Appendix C.1 and WIHB in Appendix C.2.

C.1 COMPUTATIONAL COMPLEXITY OF IHB

The main cost of IHB is the inversion of the matrix A A Rℓ ℓ, which generally requires O(ℓ3) elementary operations. Since OAVI solves a series of quadratic convex optimization problems that differ from each other only slightly, we can maintain and update (A A) 1 using O(ℓ2) instead of O(ℓ3) elementary operations.

Theorem C.1 (IHB update cost). Let A Rm ℓ, A A Rℓ ℓ, (A A) 1, and b Rℓbe given. In case b 2 > 0 and b A(A A) 1A b = b 2 2,

A = (A, b) Rm (ℓ+1), A A R(ℓ+1) (ℓ+1), ( A A) 1 R(ℓ+1) (ℓ+1) (C.1)

can be constructed in O(ℓm + ℓ2) elementary operations.

Published as a conference paper at ICLR 2023

Algorithm 2: Blended pairwise conditional gradients algorithm (BPCG) Input : A smooth and convex function f, a starting vertex y0 vert(P). Output: A point y T P.

1 S(0) {y0}

2 λ(0) y0 1

3 λ(0) y0 0 for v vert(P) \ {y0}

4 for t = 0, . . . , T 1 do

5 at argmaxv S(t) f(yt), v away vertex

6 qt argminv S(t) f(yt), v local FW vertex

7 wt argminv vert(P ) f(yt), v FW vertex

8 if f(yt), wt yt f(yt), qt at then

10 γt argminγ [0,λ(t) at ] f(yt + γdt)

11 λ(t+1) v λ(t) v for v vert(P) \ {at, qt}

12 λ(t+1) at λ(t) at γt

13 λ(t+1) qt λ(t) qt + γt

15 dt wt yt

16 γt argminγ [0,1] f(yt + γdt)

17 λ(t+1) v (1 γt)λ(t) v for v vert(P) \ {wt}

18 λ(t+1) wt (1 γt)λ(t) wt + γt

20 S(t+1) {v vert(P) | λ(t+1) v > 0}

21 yt+1 yt + γtdt

Proof. Let B = A A Rℓ ℓ, B = A A R(ℓ+1) (ℓ+1), N = B 1 Rℓ ℓ, and N = B 1 = ( A A) 1 R(ℓ+1) (ℓ+1). In O(mℓ) elementary operations, we can construct A b Rℓ, b b = b 2 2 R, and A = (A, b) Rm (ℓ+1) and in additional O(ℓ2) elementary operations, we can construct

B = A A = B A b b A b 2 2

R(ℓ+1) (ℓ+1).

We can then compute N = B 1 = ( A A) 1 R(ℓ+1) (ℓ+1) in additional O(ℓ2) elementary operations. We write

N = N1 n2 n 2 n3

R(ℓ+1) (ℓ+1),

where N1 Rℓ ℓ, n2 Rℓ, and n3 R. Then, it has to hold that

B N = B A b b A b 2 2

N1 n2 n 2 n3

= I R(ℓ+1) (ℓ+1),

where I is the identity matrix. Note that b A n2 + b 2 2 n3 = 1. Thus,

n3 = 1 b A n2

b 2 2 R, (C.2)

which is well-defined due to the assumption b 2 > 0. Since b A is already computed, once n2 is computed, the computation of n3 requires only additional O(ℓ) elementary operations. Similarly, we have that

B n2 + A b n3 = 0 Rℓ. (C.3)

Published as a conference paper at ICLR 2023

Plugging (C.2) into (C.3), we obtain (B A bb A

b 2 2 ) n2 = A b

b 2 2 . Thus,

n2 = B A bb A

The existence of (B A bb A

b 2 2 ) 1 follows from the Sherman-Morrison formula (Sherman & Mor-

rison, 1950; Bartlett, 1951) and the assumption that b A(A A) 1A b = b 2 2. Then, again using the Sherman-Morrison formula,

n2 = B 1 + B 1A bb AB 1

b 2 2 b AB 1A b

which can be computed using additional O(ℓ2) elementary operations. Finally, we construct N1, which is determined by B N1 + A b n 2 = I Rℓ ℓ. Thus, N1 = B 1 B 1A b n 2 Rℓ ℓ, which can be computed in additional O(ℓ2) elementary operations since B 1A b Rℓis already computed. In summary, we require O(ℓm + ℓ2) elementary operations. Note that even if we do not compute ( A A) 1, we still require O(ℓm + ℓ2) elementary operations.

In the remark below, we discuss the literature related to IHB.

Remark C.2 (Work related to IHB). The proof of Theorem C.1 is similar to the proof that the inverse of the Hessian can be updated efficiently in the online Newton algorithm (Hazan et al., 2007). Both proofs rely on the Sherman-Morrison formula (Sherman & Morrison, 1950; Bartlett, 1951). However, in our setting, the updates occur column-wise instead of row-wise. Updating (A A) 1 column-wise for generator-constructing algorithms was already discussed in Limbeck (2013) using QR decompositions without addressing the numerical instability of inverse matrix updates.

Next, we prove that Theorem C.1 implies the improved computational complexity of CGAVI-IHB in Corollary 4.4.

Proof of Corollary 4.4. Suppose that CGAVI-IHB is currently executing Lines 5 11 of Algorithm 1 for a particular term ui d O and recall the associated notations ℓ= |O|, A = O(X) Rm ℓ, b = ui(X), f(y) = 1 m Ay + b 2 2, and P = {y Rℓ| y 1 τ 1}. Then, the optimization problem in Line 5 of Algorithm 1 takes the form

c argminy P f(y).

We first prove that the violation of any of the two assumptions of Theorem C.1 implies the existence of a (ψ, 1, τ)-approximately vanishing generator g. We prove this claim by treating the two assumptions separately:

1. If b 2 = 0, it holds that mse(ui, X) = 0. By the assumption that τ 2 is large enough to guarantee that (INF) never holds, CGAVI-IHB constructs a (ψ, 1, τ)-approximately vanishing generator in Line 6.

2. If b A(A A) 1A b = b 2 2, it holds that f( (A A) 1A b) = 0. Thus, by the assumption that τ 2 is large enough to guarantee that (INF) never holds, CGAVI-IHB constructs a (ψ, 1, τ)-approximately vanishing generator in Line 6.

Since an update of (A A) 1 is necessary only if Line 10 is executed, that is, when there does not exist a (ψ, 1, τ)-approximately vanishing generator g, we never have to update (A A) 1 when the assumptions of Theorem C.1 are violated. Thus, by Theorem C.1, we can always update A, A A, and (A A) 1 in O(ℓm + ℓ2) O((|G| + |O|)m + (|G| + |O|)2) elementary operations using space O(ℓm + ℓ2) O((|G| + |O|)m + (|G| + |O|)2). Then, since we run CG for a constant number of iterations, the time and space complexities of Lines 5 11 are TORACLE = O((|G| + |O|)m + (|G| + |O|)2) and SORACLE = O((|G| + |O|)m + (|G| + |O|)2), respectively.

Note that the time and space complexities in Corollary 4.4 also hold for AGDAVI-IHB with τ = .

Published as a conference paper at ICLR 2023

Algorithm 3: Blended pairwise conditional gradients approximate vanishing ideal algorithm with weak inverse Hessian boosting (BPCGAVI-WIHB) Input : A data set X = {x1, . . . , xm} Rn and parameters ψ ϵ 0 and τ 2. Output: A set of polynomials G P and a set of monomials O T .

1 d 1, O = {t1}σ {1}σ, G

2 while d O = {u1, . . . , uk}σ = do

3 for i = 1, . . . , k do

4 ℓ |O| N, A O(X) Rm ℓ, b ui(X) Rm, P = {y Rℓ| y 1 τ 1}

5 y0 (A A) 1A b Rℓ

6 Solve c argminy P 1 m Ay + b 2 2 up to tolerance ϵ using CG with starting vector y0.

7 g Pℓ j=1 cjtj + ui a non-sparse polynomial

8 if mse(g, X) ψ then check whether g vanishes

9 Solve d argminy P 1 m Ay + b 2 2 up to tolerance ϵ using BPCG and a vertex of P as starting vector.

10 h Pℓ j=1 djtj + ui a sparse polynomial

11 if mse(h, X) ψ then check whether h vanishes

17 O = {t1, . . . , tℓ+1}σ (O {ui})σ

C.2 WEAK INVERSE HESSIAN BOOSTING (WIHB)

So far, we have introduced IHB to drastically speed-up training of AGDAVI and CGAVI. A drawback of using CGAVI-IHB is that the initialization of CG variants with a non-sparse initial vector such as y0 = (A A) 1A b leads to the construction of generally non-sparse generators. In this section, we explain how to combine the speed-up of IHB with the sparsity-inducing properties of CG variants, referring to the resulting technique as weak inverse Hessian boosting (WIHB). Specifically, we present WIHB with BPCGAVI (OAVI with solver BPCG), referred to as BPCGAVI-WIHB in Algorithm 3.

The high-level idea of BPCGAVI-WIHB is to use IHB and vanilla CG to quickly check whether a (ψ, 1, τ)-approximately vanishing generator exists. If it does, we then use BPCG to try and construct a (ψ, 1, τ)-approximately vanishing generator that is also sparse.

We proceed by giving a detailed overview of BPCGAVI-WIHB. In Line 5 of BPCGAVI-WIHB, we construct y0. Then, in Line 6, we solve c argminy P 1 m Ay + b 2 2 up to tolerance ϵ using CG with starting vector y0 = (A A) 1A b. Since y0 is a well-educated guess for the solution to the constrained optimization problem argminy P 1 m Ay + b 2 2 in Line 6, CG often runs for only very few iterations. The drawback of using the non-sparse y0 as a starting vector is that c constructed in Line 6 and g constructed in Line 7 are generally non-sparse. We alleviate the issue of non-sparsity of g in Lines 8 18. In Line 8, we first check whether g is a (ψ, 1, τ)-approximately vanishing generator of X. If g does not vanish approximately, we know that there does not exist an approximately vanishing generator with leading term ui and we append ui to O in Line 17. If, however, mse(g, X) ψ, we solve the constrained convex optimization problem in Line 6 again in Line 9 up to tolerance ϵ using BPCG and a vertex of the ℓ1-ball P as starting vector. This has two consequences:

1. The vector d constructed in Line 9 tends to be sparse, as corroborated by the results in Table 2.

Published as a conference paper at ICLR 2023

2. The execution of Line 9 tends to take longer than the execution of Line 6 since BPCG s starting vector is not necessarily close in Euclidean distance to the optimal solution.

Then, in Line 10, we construct the polynomial h, which tends to be sparse. If h is a (ψ, 1, τ)- approximately vanishing generator, we append the sparse h to G in Line 12. If it happens that mse(g, X) ψ < mse(h, X), we append the non-sparse g to G in Line 14. Following the discussion of Section 4.3, should (INF) ever hold, that is, y0 1 > τ 1, we can no longer proceed with BPCGAVI-WIHB as we can no longer guarantee that the inverse of A A exists. In that case, we proceed with vanilla BPCGAVI for all terms remaining in the current and upcoming borders.

The discussion above illustrates that BPCGAVI-WIHB solves argminy P 1 m Ay + b 2 2 with BPCG |G| times as opposed to BPCGAVI, which solves argminy P 1 m Ay+b 2 2 with BPCG |G|+|O| 1 times. Thus, BPCGAVI-WIHB combines the sparsity-inducing properties of BPCG with the speed-up of IHB without any of IHB s drawbacks. The speed-up of BPCGAVI-WIHB compared to BPCGAVI is evident from the numerical experiments in Figure 2 and the results of Table 2 indicate that the sparsity-inducing properties of BPCG are successfully exploited.

D PYRAMIDAL WIDTH OF THE ℓ1-BALL

Pena & Rodriguez (2019) showed that the pyramidal width ω of the ℓ1-ball of radius τ, that is, P = {x Rℓ| x 1 τ} Rℓis given by

ω = min F faces (P ), F P dist (F, conv(vert(P) \ F)),

where, for a polytope P Rℓ, faces (P) denotes the set of faces of P and vert(P) denotes the set of vertices of P, for a set F Rℓ, conv(F) is the convex hull of F, and for two disjoint sets F, G Rℓ, dist (F, G) is the Euclidean distance between F and G.

Lemma D.1 (Pyramidal width of the ℓ1-ball). The pyramidal width of the ℓ1-ball of radius τ, that is, P = {x Rℓ| x 1 τ} Rℓ, is lower bounded by ω τ ℓ 1.

Proof. Let e(i) Rℓdenote the ith unit vector. Note that a non-trivial face of P = conv{ τe(1), . . . , τe(n)} Rℓthat is not P itself cannot contain both τe(i) and τe(i) for any i {1, . . . , ℓ}. Thus, due to symmetry, we can assume that any non-trivial face of P that is not P itself is of the form F = conv({τe(1), . . . , τe(k)}) for some k {1, . . . , ℓ 1}. Then,

G := conv(vert(P) \ F) = conv({ τe(1), . . . , τe(k), τe(k+1), . . . , τe(ℓ)}).

We have that

dist (F, G) = min u F,v G u v 2

= min u F,v G

i=1 (ui vi)2 +

min u F,v G

i=1 (ui vi)2 since v2 j 0 for all j {k + 1, . . . , ℓ}

i=1 u2 i since vi 0 ui for all i {1, . . . , k}

k is minimized for k {1, . . . , ℓ 1} as large as possible, it follows that ω τ ℓ 1.

Published as a conference paper at ICLR 2023

Table 3: Hyperparameter ranges for numerical experiments.

Hyperparameters Values ψ 0.1, 0.05, 0.01, 0.005, 0.001, 0.0005, 0.0001 Regularization coefficient for SVMs 0.1, 1, 10 Degree for polynomial kernel SVM 1, 2, 3, 4

E THE SYNTHETIC DATA SET

The synthetic data set consists of three features and contains two different classes. The samples x that belong to the first class are generated such that they satisfy the equation

x2 1 + 0.01x2 + x2 3 1 = 0

and samples that belong to the second class are generated such that they satisfy the equation

x2 1 + x2 3 1.3 = 0.

The samples are perturbed with additive Gaussian noise with mean 0 and standard deviation 0.05.

F ADDITIONAL NUMERICAL EXPERIMENTS

In this section, we provide details for the numerical experiments conducted in Figures 1, 2, and 3.

F.1 EXPERIMENT: SPEEDING UP CGAVI

We compare the training times of PCGAVI, BPCGAVI, BPCGAVI-WIHB, and CGAVI-IHB on the data sets bank, htru, skin, and synthetic.

Setup. For a single run, we randomly split the data set into subsets of varying sizes. Then, for fixed ψ = 0.005, we run the generator-constructing algorithms on subsets of the full data set of varying sizes and plot the training times, which are the times required to run PCGAVI, BPCGAVI, BPCGAVI-WIHB, and CGAVI-IHB once for each class. The results are averaged over ten random runs and standard deviations are shaded in Figures 1 and 2.

Results. The results are presented in Figures 1 and 2. In Figure 1, we observe that the training times for BPCGAVI are often shorter than for PCGAVI, except for the skin data set. In Figure 2, we observe that IHB not only leads to theoretically better training times but also speeds up training in practice: CGAVI-IHB is always faster than BPCGAVI. Furthermore, WIHB is indeed the best of both worlds: BPCGAVI-WIHB is always faster than BPCGAVI but preserves the sparsity-inducing properties of BPCG. The latter can also be seen in Table 2, where BPCGAVI-WIHB is the only algorithmic approach that constructs sparse generators.

F.2 EXPERIMENT: SCALABILITY COMPARISON

We compare the training times of CGAVI-IHB, AGD-IHB, ABM, and VCA on the data sets bank, htru, skin, and synthetic.

Setup. For a single run, on at most 10,000 samples of the data set, we tune the hyperparameters of generator-constructing algorithms OAVI, ABM, and VCA and a subsequently applied linear kernel SVM using threefold cross-validation. Then, using the determined hyperparameters, we run only the generator-constructing algorithms on subsets of the full data set of varying sizes and plot the training times, which are the times required to run CGAVI-IHB, AGD-IHB, ABM, and VCA once for each class. The results are averaged over ten random runs and standard deviations are shaded in Figure 3.

Published as a conference paper at ICLR 2023

Results. The results are presented in Figure 3. For small data sets, we observe that ABM and VCA are faster than OAVI. However, when the number of samples in the data set is large, as in the synthetic data set, OAVI can be trained faster than ABM and VCA. AGDAVI-IHB is slower than CGAVIIHB because AGD cannot use the Frank-Wolfe gap as an early termination criterion to quickly decide that a solution close enough to the optimum is reached.