# learning_to_compute_gröbner_bases__21b64de7.pdf Learning to compute Gröbner bases Hiroshi Kera Chiba University, Zuse Institute Berlin kera@chiba-u.jp Yuki Ishihara Nihon University ishihara.yuki@nihon-u.ac.jp Yuta Kambe Mitsubishi Electric kambe.yuta@bx.mitsubishielectric.co.jp Tristan Vaccon Limoges University tristan.vaccon@unilim.fr Kazuhiro Yokoyama Rikkyo University kazuhiro@rikkyo.ac.jp Solving a polynomial system, or computing an associated Gröbner basis, has been a fundamental task in computational algebra. However, it is also known for its notorious doubly exponential time complexity in the number of variables in the worst case. This paper is the first to address the learning of Gröbner basis computation with Transformers. The training requires many pairs of a polynomial system and the associated Gröbner basis, raising two novel algebraic problems: random generation of Gröbner bases and transforming them into non-Gröbner ones, termed as backward Gröbner problem. We resolve these problems with 0-dimensional radical ideals, the ideals appearing in various applications. Further, we propose a hybrid input embedding to handle coefficient tokens with continuity bias and avoid the growth of the vocabulary set. The experiments show that our dataset generation method is a few orders of magnitude faster than a naive approach, overcoming a crucial challenge in learning to compute Gröbner bases, and Gröbner computation is learnable in a particular class. 1 Introduction Understanding the properties of polynomial systems and solving them have been a fundamental problem in computational algebra and algebraic geometry with vast applications in cryptography [8, 93], control theory [71], statistics [27, 41], computer vision [78], systems biology [60], and so forth. Special sets of polynomials called Gröbner bases [16] play a key role to this end. In linear algebra, the Gaussian elimination simplifies or solves a system of linear equations by transforming its coefficient matrix into the reduced row echelon form. Similarly, a Gröbner basis can be regarded as a reduced form of a given polynomial system, and its computation is a generalization of the Gaussian elimination to general polynomial systems. However, computing a Gröbner basis is known for its notoriously bad computational cost in theory and practice. It is an NP-hard problem with the doubly exponential worst-case time complexity in the number of variables [29, 67]. Nevertheless, because of its importance, various algorithms have been proposed in computational algebra to obtain Gröbner bases in better runtime. Examples include Faugère s F4/F5 algorithms [33, 34] and M4GB [66]. corresponding author 38th Conference on Neural Information Processing Systems (Neur IPS 2024). In this study, we investigate Gröbner basis computation from a learning perspective, envisioning it as a practical compromise to address large-scale polynomial system solving and understanding when mathematical algorithms are computationally intractable. The learning approach does not require explicit design of computational procedures, and we only need to train a model using a large amount of (non-Gröbner set, Gröbner basis) pairs. Further, if we restrict ourselves to a particular class of Gröbner bases (or associated ideals), the model may internally find some patterns useful for prediction. The success of learning indicates the existence of such patterns, which encourages the improvement of mathematical algorithms and heuristics. Several recent studies have already addressed mathematical tasks via learning, particularly using Transformers [14, 19, 58]. For example, [58] showed that Transformers can learn symbolic integration simply by observing many (df/dx , f) pairs in training. The training samples are generated by first randomly generating f and computing its derivative df/dx and/or by the reverse process. However, a crucial challenge in the learning of Gröbner basis computation is that it is mathematically unknown how to efficiently generate many (non-Gröbner set, Gröbner basis) pairs. We need an efficient backward approach (i.e., solution-to-problem computation) because, as discussed above, the forward approach (i.e., problem-to-solution computation) is prohibitively expensive. To this end, we frame two problems: (i) a random generation of Gröbner bases and (ii) a backward transformation from a Gröbner basis to an associated non-Gröbner set. To our knowledge, neither of them has been addressed in the study of Gröbner bases because of the lack of motivations; all the efforts have been dedicated to the forward computation from a non-Gröbner set to Gröbner basis. Another challenge in the learning approach using Transformers lies in the tokenization of polynomials on infinite fields, such as R and Q. To cover a wide range of coefficients, one has to either prepare numerous number tokens or split a number into digits. The former requires a large embedding matrix, and the latter incurs large attention matrices due to the lengthy sequences. To resolve this, we introduce a continuous embedding scheme, which embeds coefficient tokens by a small network and avoids the tradeoff between vocabulary size and sequence length. The continuity of the function realized by the network naturally implements the continuity of the numbers in the embedding. We summarize the contributions as follows. We investigate the first learning approach to the Gröbner computation using Transformers and experimentally show its learnability. Unlike most prior studies, our results indicate that training a Transformer may be a compromise to NP-hard problems to which no efficient (even approximate or probabilistic) algorithms have been designed. We uncovered two unexplored algebraic problems random generation of Gröbner bases and backward Gröbner problem and propose efficient methods to address them in the 0dimensional case. The problems are essential to the learning approach but also algebraically interesting and need interaction between computational algebra and machine learning. We propose a new input embedding to efficiently handle a large range of coefficients without the tradeoff between the size of the embedding matrix and attention maps. Our experiments show that the proposed dataset generation is highly efficient and faster than a baseline method by a few orders of magnitude. Further, we observe a learnability gap between polynomials on finite fields and infinite fields while predicting polynomial supports are more tractable. 2 Related Work Gröbner basis computation. Gröbner basis is one of the fundamental concepts in algebraic geometry and commutative ring theory [24, 39]. By its computational aspect, Gröbner basis is a very useful tool for analyzing the mathematical structures of solutions of algebraic constraints. Notably, the form of Gröbner bases is suited for finding solutions and allows parametric coefficients, and thus, it is vital to make Gröbner basis computation efficient and practical in applications. Following the definition of Gröbner bases in [16], the original algorithm to compute them can be presented as (i) create potential new leading terms by constructing S-polynomials, (ii) reduce them either to zero or to new polynomials for the Gröbner basis, and (iii) repeat until no new S-polynomials can be constructed. Plenty of work has been developed to surpass this algorithm. There are four main strategies: (a) avoiding unnecessary S-polynomials based on the F5 algorithm and the more general signature-based algorithms [9, 34]. Machine learning appeared for this task in [73]. (b) More efficient reduction using efficient linear algebraic computations using [33] and the very recent GPU-using [12]. (c) Performing modular computations, following [6, 70], to prevent coefficient growth during the computation. (d) Using the structure of the ideal, e.g., [13, 35] for change of term ordering for 0-dimensional ideals or [83] when the Hilbert function is known. In this study, we present the fifth strategy: (e) Gröbner basis computation fully via learning without specifying any mathematical procedures. Transformers for mathematics. Recent studies have revealed that Transformers can be used for mathematical reasoning and symbolic computation. The training only requires samples (i.e., problem solution pairs), and no explicit mathematical procedures need to be specified. In [58], the first study that uses Transformers for mathematical problems is presented. It showed that Transformers can learn symbolic integration and differential equation solving with training with sufficiently many and diverse samples. Since then, Transformers have been applied to checking local stability and controllability of differential equations [22], polynomial simplification [3], linear algebra [19, 20], symbolic regression [14, 26, 44, 45], Lyapunov function design [4] and attacking the LWE cryptography [61, 89]. In [75], comprehensive experiments over various mathematical tasks are provided. In contrast to the aforementioned studies, we found that Gröbner basis computation, an NP-hard problem, even has an algebraic challenge in the dataset generation. This paper introduces unexplored algebraic problems and provides an efficient algorithm for a special but important case (i.e., 0-dimensional ideals), thereby realizing an experimental validation of the learning of Gröbner basis computation. 3 Notations and Definitions We introduce the necessary notations and definitions in this Section. The reader interested in a gentle introduction to Gröbner basis theory can refer to the classical book [25]. For their comfort, we have moreover compiled most elementary additional definitions and notations in App. A. We consider a polynomial ring k[x1, . . . , xn] with a field k and variables x1, . . . , xn. For a set F k[x1, . . . , xn], the ideal generated by F is denoted by F . Once a term order on the terms of k[x1, . . . , xn] is fixed, one can define leading terms and Gröbner bases. Definition 3.1 (Leading term). Let F = {f1, . . . , fs} k[x1, . . . , xn] and let be a term order. The leading term LT(fi) of fi is the largest term in fi in ordering . The leading term set of F is LT(F) = {LT(f1), . . . , LT(fs)}. Definition 3.2 (Gröbner basis). Fix a term order . A finite subset G of an ideal I is said to be a -Gröbner basis of I if LT(G) = LT(I) . The condition LT(G) = LT(I) means that for any element h I, the leading term LT(h) is divided by the leading term LT(g) of an element g G. It gives a complete test whether a given polynomial h is in I or not by polynomial division with G, similar to Gaussian elimination by a basis of a vector space. The remainder is 0 means that h I, and otherwise means that h I. This is related to finding solutions. Roughly speaking, if h I = f1, . . . , fs , then we have a form h = Ps i=1 hifi and the system f1(x1, . . . , xn) = = fs(x1, . . . , xn) = 0 shares solutions with the equation h(x1, . . . , xn) = 0. Note that LT(G) LT(I) is trivial from G I. The nontriviality of the Gröbner basis lies in LT(G) LT(I) ; that is, a finite number of leading terms can generate the leading term of any polynomial in the infinite set I. The Hilbert basis theorem [25] guarantees that every ideal I = {0} has a Gröbner basis. Moreover, using the multivariate division algorithm, one gets that any Gröbner basis G of an ideal I generates I. We are particularly interested in the reduced Gröbner basis G of I = F , which is unique once the term order is fixed. Intuition of Gröbner bases and system solving. Let G = {g1, . . . , gt} be a Gröbner basis of an ideal F = f1, . . . , fs . The polynomial system g1(x1, . . . , xn) = = gt(x1, . . . , xn) = 0 is a simplified form of f1(x1, . . . , xn) = = fs(x1, . . . , xn) = 0 with the same solution set. With the term order lex, G has a form g1 k[xn1, . . . , xn], g2 k[xn2, . . . , xn], . . . , gt k[xnt, . . . , xn] with n1 n2 . . . nt, which may be regarded as the reduced row echelon form of a polynomial system. In our particular case (i.e., 0-dimensional ideals in shape position; cf. Sec. 4.2), we have (n1, n2, . . . , nt) = (1, 2, . . . , n). Thus, one can obtain the solutions of the polynomial system using a backward substitution, i.e., by first solving a univariate polynomial gt, next solving bivariate polynomial gt 1, which becomes univariate after substituting the solutions of gt, and so forth. Other notations. The subset k[x1, . . . , xn] d k[x1, . . . , xn] denotes the set of all polynomials of total degree at most d. For a polynomial matrix A k[x1, . . . , xn]s s, its determinant is given by det(A) k[x1, . . . , xn]. The set Fp with a prime number p denotes the finite field of order p. The set ST(n, k[x1, . . . , xn]) denotes the set of upper-triangular matrices with all-one diagonal entries (i.e., unimodular upper-triangular matrices) with entries in k[x1, . . . , xn]. The total degree of f k[x1, . . . , xn] is denoted by deg(f). 4 New Algebraic Problems Our goal is to realize Gröbner basis computation through a machine learning model. To this end, we need a large training set {(Fi, Gi)}m i=1 with finite polynomial set Fi k[x1, . . . , xn] and Gröbner basis Gi of Fi . As the computation from Fi to Gi is computationally expensive in general, we instead resort to backward generation (i.e., solution-to-problem process); that is, we generate a Gröbner basis Gi randomly and transform it to non-Gröbner set Fi. What makes the learning of Gröbner basis computation hard is that, to our knowledge, neither (i) a random generation of Gröbner basis nor (ii) the backward transform from Gröbner basis to non-Gröbner set has been considered in computational algebra. Its primary interest has been instead posed on Gröbner basis computation (i.e., forward generation), and nothing motivates the random generation of Gröbner basis nor the backward transform. Interestingly, machine learning now sheds light on them. Formally, we address the following problems for dataset generation. Problem 4.1 (Random generation of Gröbner bases). Find a collection G = {Gi}m i=1 with the reduced Gröbner basis Gi k[x1, . . . , xn] of Gi , i = 1, . . . , m. The collection should contain diverse bases, and we need an efficient algorithm for constructing them. Problem 4.2 (Backward Gröbner problem). Given a Gröbner basis G k[x1, . . . , xn], find a collection F = {Fi}µ i=1 of polynomial sets that are not Gröbner bases but Fi = G for i = 1, . . . , µ. The collection should contain diverse sets, and we need an efficient algorithm for constructing them. Problems 4.1 and 4.2 require the collections G, F to contain diverse polynomial sets. Thus, the algorithms for these problems should not be deterministic but should have some controllable randomness. Several studies reported that the distribution of samples in a training set determines the generalization ability of models trained on it [19, 58]. However, the distribution of non-Gröbner sets and Gröbner bases is an unexplored and challenging object of study. It can be another challenging topic and goes beyond the scope of the present study. 4.1 Scope of this study Non-Gröbner sets have various forms across applications. For example, in cryptography (particularly post-quantum cryptosystems), polynomials are restricted to dense degree-2 polynomials and generated by an encryption scheme [93]. On the other hand, in systems biology (particularly, reconstruction of gene regulatory networks), they are typically assumed to be sparse [59]. In statistics (particularly algebraic statistics), they are restricted to binomials, i.e., polynomials with two monomials [41, 79]. As the first study of Gröbner basis computation using Transformers, we do not focus on a particular application and instead address a generic case reflecting a motivation shared by various applications of computing Gröbner basis: solving polynomial systems or understanding ideals associated with polynomial systems having solutions. Particularly, we focus on 0-dimensional radical ideals, a special but fundamental class of ideals. Definition 4.3 (0-dimensional ideal). Let F be a set of polynomials in k[x1, . . . , xn]. An ideal F is called a 0-dimensional ideal if all but a finite number of terms belong to LT( F ). In fact, the number of terms not belong to LT( f1, . . . , fs ) is an upper bound of the number of solutions of the system f1(x1, . . . , xn) = = fs(x1, . . . , xn) = 0. In particular, the finiteness of the number of terms not belong to LT( f1, . . . , fs ) implies the finiteness of the number of solutions. This is the reason why we call such ideals 0-dimensional ideals in Def. 4.3. 0-dimensional ideals are the fundamental ideals in the study of pure algebra. This is partly because of the ease of analysis. As Def. A.5 shows, 0-dimensional ideals relate to finite-dimensional vector spaces, and thus, analysis and algorithm design can be essentially addressed by matrices and linear algebra. Also ideals in most practical scenarios are known to be 0-dimensional. For example, a multivariate public-key encrypted communication (a candidate of post-quantum cryptosystems) with a public polynomial system F over a finite field Fp will be broken if one finds any root of the system F {xp 1 x1, . . . , xp n xn}). One should note that the ideal F {xp 1 x1, . . . , xp n xn} is 0-dimensional [84, Sec. 2.2]. Generically, 0-dimensional ideals defined from polynomial systems having solutions are radical2 (i.e., non-radical ideals are in a zero-measure set in the Zariski topology). The proofs of the results in the following sections can be found in App. C. Hereinafter, the sampling of polynomials is done by a uniform sampling of coefficients from a prescribed range. It is also worth noting that Transformers cannot be an efficient tool for general Gröbner basis computation, and thus, we should focus on a particular class of ideals and pursue in-distribution accuracy. This is evident from the facts that Gröbner basis computation is NP-hard and that machine learning models perform best on in-distribution samples and do not generalize perfectly. Fortunately, unlike standard machine learning tasks (e.g., image classification task), users can frame their problems beforehand (i.e., they know what types of polynomials they want to handle), and they can collect as many training samples as they want if an efficient algorithm exists. As mentioned above, the form of non-Gröbner sets varies across applications, and thus, we focus on the generic case and leave the specialization to future work. 4.2 Random generation of Gröbner bases We address Prob. 4.1 using the fact that 0-dimensional radical ideals are generally in shape position. Definition 4.4 (Shape position). Ideal I k[x1, . . . , xn] is called in shape position if some univariate polynomials h, g1, . . . , gn 1 k[xn] form the reduced lex-Gröbner basis of I as follows. G = {h, x1 g1, . . . , xn 1 gn 1}. (4.1) As can be seen, the lex-Gröbner basis consists of a univariate polynomial in xn and the difference of univariate polynomials in xn and a leading term xi for i < n. While not all ideals are in shape position, 0-dimensional radical ideals are almost always in shape position: if an f1, . . . , fs k[x1, . . . , xn] is a 0-dimensional and radical ideal, a random coordinate change (y1, . . . , yn) = (x1, . . . , xn)R with a regular (i.e., invertible) matrix R kn yields f1, , fs k[y1, . . . , yn], and the ideal y1, . . . , yn generally has the reduced lex-Gröbner basis in the form of Eq. (4.1) (cf. Prop. A.14). With this fact, an efficient sampling of Gröbner bases of 0-dimensional radical ideals can be realized by sampling n polynomials in k[xn], i.e., h, g1, . . . , gn 1 with h = 0. We have to make sure that the degree of h is always greater than that of g1, . . . , gn 1, which is necessary and sufficient for G to be a reduced Gröbner basis. This approach involves efficiency and randomness, and thus resolving Prob. 4.1. Note that while our approach assumes term order lex, if necessary, one can use an efficient change-of-ordering algorithm, e.g., the FGLM algorithm [35]. The cost of the FGLM algorithm is O(n deg(h)3) based on the number of arithmetic operations over k. Besides the ideals in shape position, we also consider the Cauchy module in App. B, which defines another class of 0-dimensional ideals. 4.3 Backward Gröbner problem To address Prob. 4.2, we consider the following problem. Problem 4.5. Let I k[x1, . . . , xn] be a 0-dimensional ideal, and let G = (g1, . . . , gt) k[x1, . . . , xn]t be its -Gröbner basis with respect to term order .3 Find a polynomial matrix A k[x1, . . . , xn]s t giving a non-Gröbner set F = (f1, . . . , fs) = AG such that F = G . 2See App. A for the definition. 3We surcharge notations to mean that the set {g1, . . . , gt} defined by the vector G is a -Gröbner basis. Namely, we generate a set of polynomials F = (f1, . . . , fs) from G = (g1, . . . , gt) by fi = Pt j=1 aijgj for i = 1, . . . , s, where aij k[x1, . . . , xn] denotes the (i, j)-th entry of A. Note that F and G are generally not identical, and the design of A such that F = G is of our question. A similar question was studied without the Gröbner condition in [17, 18]. They provided an algebraic necessary and sufficient condition for the polynomial system of F to have a solution outside the variety defined by G. This condition is expressed explicitly by multivariate resultants. However, strong additional assumptions are required: A, F, G are homogeneous, G is a regular sequence, and in the end, F = G is only satisfied up to saturation. Thus, they are not compatible with our setting and method for Prob. 4.1. Our analysis gives the following results for the design A to achieve F = G for the 0-dimensional case (without radicality or shape position assumption). Theorem 4.6. Let G = (g1, . . . , gt) be a Gröbner basis of a 0-dimensional ideal in k[x1, . . . , xn]. Let F = (f1, . . . , fs) = AG with A k[x1, . . . , xn]s t. 1. If F = G , it implies s n. 2. If A has a left-inverse in k[x1, . . . , xn]t s, F = G holds. 3. The equality F = G holds if and only if there exists a matrix B k[x1, . . . , xn]t s such that each row of BA Et is a syzygy4 of G, where Et is the identity matrix of size t. The first statement of Thm. 4.6 argues that polynomial matrix A should have at least n rows. For an ideal in shape position, we have a lex-Gröbner basis G of size n, and thus, A is a square or tall matrix. The second statement shows a sufficient condition. The third statement provides a necessary and sufficient condition. Using the second statement, we design a simple random transform of a Gröbner basis to a non-Gröbner set without changing the ideal. We now assume = lex and 0-dimensional ideals in shape position. Then, G has exactly n generators. When s = n, we have the following. Proposition 4.7. For any A k[x1, . . . , xn]n n with det(A) k \ {0}, we have F = G . As non-zero constant scaling does not change the ideal, we focus on A with det(A) = 1 without loss of generality. Such A can be constructed using the Bruhat decomposition: A = U1PU2, (4.2) where U1, U2 ST(n, k[x1, . . . , xn]) are upper-triangular matrices with all-one diagonal entries (i.e., unimodular upper-triangular matrices) and P {0, 1}n n denotes a permutation matrix. Noting that A 1 satisfies A 1A = En, we have AG = G from Thm. 4.6. Therefore, random sampling (U1, U2, P) of unimodular upper-triangular matrices U1, U2 and a permutation matrix P resolves the backward Gröbner problem for s = n. We extend this idea to the case of s > n using a rectangular unimodular upper-triangular matrix: U2 = U 2 Os n,n k[x1, . . . , xn]s n, (4.3) where U 2 ST(n, k[x1, . . . , xn]) and Os n,n k[x1, . . . , xn](s n) n is the zero matrix. The permutation matrix is now P {0, 1}s s. Note that U2G already gives a non-Gröbner set such that U2G = G ; however, the polynomials in the last s n entries of U2G are all zero by its construction. To avoid this, the permutation matrix P shuffles the rows and also U1 to exclude the zero polynomial from the final polynomial set. To summarize, our strategy is to compute F = U1PU2G, which only requires a sampling of O(s2) polynomials in k[x1, . . . , xn], and O(n2 + s2)-times multiplications of polynomials. Note that even in the large polynomial systems in the MQ challenge, a post-quantum cryptography challenge, we have n < 100 and s < 200 [93]. 4Refer to App. A for the definition. 4.4 Dataset generation algorithm The combination of the discussion in the previous sections gives an efficient dataset generation algorithm (see Alg. 1 for a pseudocode). Theorem 4.8. Consider a polynomial ring k[x1, . . . , xn]. Given the dataset size m, maximum degrees d, d > 0, maximum size of non-Gröbner set smax n, and term order , Alg. 1 returns a collection D = {(Fi, Gi)}m i=1 with the following properties: For all i = 1, . . . , m, 1. |Gi| = n and |Fi| smax. 2. The set Gi is the reduced -Gröbner basis of Fi . The set Fi is not, unless Gi, U1, U 2, P are all sampled in a non-trivial Zariski closed subset.5 3. The ideal Fi = Gi is a 0-dimensional ideal in shape position. The time complexity is O(m(n S1,d + s2Sn,d + (n2 + s2)Mn,2d +d)) when = lex, where Sn,d denotes the complexity of sampling an n-variate polynomial with total degree at most d, and Mn,d denotes that of multiplying two n-variate polynomials with total degree at most d. If = lex, O(mnd3) is additionally needed. The proposed dataset generation method is a backward approach, which first generates solutions and then transforms them into problems. In this case, we have control over the complexity of the Gröbner bases and can add some intrinsic structure if any prior information is available. For example, a multi-variate encryption scheme encapsulates secret key information in the solution of a polynomial system with a single solution in a base field [93]. The ideal associated with such a system is 0-dimensional and in shape position; namely, its Gröbner basis has the following form: G = xn an, x1 a1, . . . , xn 1 an 1 , where a1, a2, . . . , an are constants [84]. The backward approach allows one to restrict the Gröbner bases in a dataset into such a class. This is not the case with forward approaches. While they may include prior information into non Gröbner sets, it is computationally expensive to obtain the corresponding Gröbner bases. It is also worth noting that a naive forward approach, which randomly generates non-Gröbner sets and computes their Gröbner bases, should be avoided even if it were computationally tractable because, for example, if the Gröbner basis of such F is generally {1} when |F| > n. 5 Hybrid Input Embedding Transformers are an efficient learner of sequence-to-sequence functions. They receive and generate a sequence of tokens. In our context, for example, {x2 10, y} can be tokenized as [x, , 2, +, -10, , y], a sequence of tokens. The vocabulary set V is the collection of all possible tokens. Each token s V has a predesignated token ID i(s) N, and the input embedding layer of Transformer associates them with the i(s)-th row of the embedding matrix WE R|V| D, where D is the embedding dimension. The embedding matrix is updated during training, and eventually, we have a nice vector representation for each token. However, this approach requires a large embedding matrix to handle a wide range of number tokens.6 The number tokens dominate the vocabulary set, and Transformers have to learn the relationship of the number tokens from scratch during the training. At the inference, if the given numbers are out of range, the model cannot work. In our case, this is inconvenient, particularly when the coefficient field is k = Q. For example, let F = { 5/3y3 y 1/2, 7/2xy2 5/3x 2}, which we obtained from a random sampling with some upper bounds on the degree, number of terms, and integers appearing in numerators and denominators. Even the simplicity of F, its reduced lex-Gröbner basis has a polynomial of g1 = x + 569520/427411y2 158760/427411y + 612912/427411. If we tokenize a/b as [a, /, b], we need to prepare more than a million integer tokens. Noting that g1 x + 1.33y2 0.37y + 1.43, it is more reasonable to handle coefficients as real values and also implement the inductive bias on the continuity of numbers in the embedding. 5This can happen with probability zero if k is infinite and very low probability over a large finite field. 6We may tokenize a number by the digits (e.g., 123 by [1, 2, 3]), but this makes input sequences long and affects the quadratic memory cost of attention mechanism. See [19] for the effect of number tokenization. Table 1: Runtime comparison (in seconds) of forward generation (F.) and backward generation (B.) of dataset Dn(Q) of size 1,000. The forward generation used either of the three algorithms provided in Sage Math with the lib Singular backend. We set a timeout limit to five seconds (added to the total runtime at every occurrence) for each Gröbner basis computation. The numbers with and include the timeout for more than 13% and 25% of the runs, respectively (cf. Tab. 5 for the success rate). Method n = 2 n = 3 n = 4 n = 5 F. (STD) 4.20 216.3 740.1 1411.1 F. (SLIMGB) 4.29 183.4 697.5 1322.7 F. (STDFGLM) 7.22 8.29 21.0 164.3 B. (ours) 5.23 5.46 7.05 7.91 We propose a hybrid input embedding that accepts both discrete token IDs and continuous values. Let s = [s1, . . . , s L] to be a sequence of tokens. Some of these tokens are in V and otherwise in R. For those in V, the standard input embedding based on the embedding matrix is applied. For the others, a small feed-forward network f E : R RD is applied. A Transformer with the proposed embedding should equip a regression head for these continuous tokens. This allows us to handle any number as a single token without the explosion of the vocabulary set (i.e., embedding matrix). As feed-forward networks are a continuous function, they naturally implement the continuity of numbers; two close values s1, s2 R are expected to be embedded in similar vectors. The hybrid input embedding has two advantages. First, as claimed above, we are no longer suffering from the large embedding matrix for registering many number tokens and can naturally implement the continuity bias. Second, We do not have the out-of-range issue. Further, we can scale the coefficients of given polynomials globally so that they match our training coefficient range.7 Refer to App. D for the details. 6 Experiments We now present the efficiency of the proposed dataset generation method and the learnability of Gröbner basis computation.8 All the experiments were conducted with 48-core CPUs, 768GB RAM, and NVIDIA RTX A6000ada GPUs. The training of a model takes less than a day on a single GPU. More information on the profile of generated datasets, the training setup, and additional experimental results are given in Apps. E and F. 6.1 Dataset generation First, we demonstrate the efficiency of the proposed dataset generation framework. We constructed 16 datasets Dn(k) for n {2, 3, 4, 5} and k {F7, F31, Q, R} and measured the runtime of the forward generation and our backward generation. The dataset Dn(k) consists of 1,000 pairs of non-Gröbner set and Gröbner basis in k[x1, . . . , xn] of ideals in shape position. Each sample (F, G) Dn(k) was prepared using Alg. 1 with (d, d , smax, ) = (5, 3, n + 2, lex). The number of terms of univariate polynomials and n-variate polynomials is uniformly determined from [1, 5] and [1, 2], respectively. When k = Q, the coefficient a/b are restricted to those with a, b { 5, . . . , 5} for random polynomials and a, b { 100, . . . , 100} for polynomials in F. In the forward generation, one may first generate random polynomial sets and then compute their Gröbner bases. However, this leads to a dataset with a totally different complexity from that constructed by the backward generation, leading to an unfair runtime comparison between the two generation processes. As such, the forward generation instead computes Gröbner bases of the non-Gröbner sets given by the backward generation, leading to the identical dataset. We used Sage Math [82] with the lib Singular backend. As Tab. 1 shows, our backward generation is a few orders of magnitude faster than the forward generation. A sharp runtime growth is observed in the forward generation as the number of variables increases. Note that these numbers only show the runtime on 1,000 samples, while training typically requires millions of samples. Therefore, the forward generation is almost infeasible, and the proposed method resolves a bottleneck in the learning of Gröbner basis computation. 7In the follow-up survey, we found that a very similar idea was proposed in [37] in a broader context, which shows continuous embedding of number tokens perform better than the discrete embedding in various tasks. 8The code is available at https://github.com/Hiroshi KERA/transformer-groebner. Table 2: Accuracy [%] / support accuracy [%] of Gröbner basis computation by Transformer on D n (k). In the support accuracy, two polynomials are considered identical if they consist of an identical set of terms (i.e., identical support), Transformers are trained on either discrete input embedding (disc.) and the hybrid embedding (hyb.). Note that the datasets for n = 3, 4, 5 are here constructed using U1, U 2 (cf. Alg. 1) with density σ = 0.6, 0.3, 0.2, respectively. Coeff. Shape position Cauchy module n = 2 n = 3 n = 4 n = 5 n = 2 n = 3 Q disc. 93.7 / 95.4 88.7 / 92.0 90.8 / 94.0 86.5 / 90.6 99.7 / 99.8 97.2 / 97.6 hyb. 66.8 / 87.3 69.0 / 89.8 62.7 / 86.8 0.0 / 84.9 98.3 / 99.7 80.1 / 89.2 F7 disc. 72.3 / 79.1 78.1 / 83.2 71.3 / 84.6 84.3 / 88.5 98.7 / 99.8 98.1 / 98.7 hyb. 54.1 / 78.7 55.8 / 84.3 46.1 / 81.8 54.4 / 81.5 95.8 / 99.7 80.8 / 91.2 F31 disc. 46.8 / 77.3 50.2 / 80.9 51.1 / 83.7 28.6 / 77.9 93.8 / 99.7 94.7 / 99.6 hyb. 6.1 / 75.3 5.8 / 80.4 0.1 / 73.0 0.1 / 76.9 15.1 / 99.5 10.9 / 98.4 R hyb. 57.2 / 85.0 61.0 / 88.0 61.7 / 87.5 45.6 / 82.9 28.3 / 100 4.3 / 100 6.2 Learnability of Gröbner basis computation We now demonstrate that Transformers can learn to compute Gröbner bases. To examine the general Transformer s ability, we focus on a standard architecture (e.g., 6 encoder/decoder layers and 8 attention heads) and a standard training setup (e.g., the Adam W optimizer [65] with (β1, β2) = (0.9, 0.999) and a linear decay of learning rate from 10 4). The batch size was set to 16, and models were trained for 8 epochs. We also tested the hybrid input embedding. Refer to App. E for the complete information. Each polynomial set in the datasets is converted into a sequence using the prefix representation and the separator tokens. Unlike natural language processing, our task does not allow the truncation of an input sequence because the first term of the first polynomial in F certainly relates to the last term of the last polynomial. To make the input sequence length manageable for vanilla Transformers, we used simpler datasets D n (k) using U1, U 2 in Alg. 1 of a moderate density σ (0, 1]. This makes the maximum sequence length less than 5,000. Specifically, we used σ = 1.0, 0.6, 0.3, 0.2 for n = 2, 3, 4, 5, respectively. The training set has one million samples, and the test set has one thousand samples. With hybrid input embedding, coefficients are predicted by regression, and we quantized them for Fp and otherwise regarded them correct when the mean squared error is less than 0.1. Table 2 shows that trained Transformers successfully compute Gröbner bases with moderate/high accuracy. Several intriguing observations below are obtained. See App. F for more results. Particularly, App. F.3 presents several examples found in the datasets for which Transformer successfully computed Gröbner bases significantly faster than math algorithms. Table 2 also includes the results on Cauchy module datasets on which Transformers are trained and tested. The dataset generation starts with sampling the roots in kn, and the other parts follow the generation of D n (k). The results on (Q, n = 3) with standard embedding is not shown as it requires too many number tokens. The performance gap across the rings. The accuracy shows that the learning is more successful on infinite field coefficients k {Q, R} than finite field ones k = Fp. This may be a counter-intuitive observation because there are more possible coefficients in G and F for Q than Fp. Specifically, for G, the coefficient a/b Q is restricted to those with a, b { 5, . . . , 5} (i.e., roughly 50 choices), and a, b { 100, . . . , 100} (i.e., roughly 20,000 choices) for F. In contrast, there are only p choices for Fp. The performance even degrades for the larger order p = 31. Interestingly, the support accuracy shows that the terms forming the polynomial (i.e., the support of polynomial) are correctly identified well. Thus, Transformers have difficulty determining the coefficients in finite fields. Several studies have also reported that learning to solve a problem involving modular arithmetic may encounter some difficulties [21, 38, 74], but no critical workaround is known. Incorrect yet reasonable failures. We observed that the predictions by a Transformer are mostly reasonable even when they are incorrect. For example, only several coefficients may be incorrect, and the support can be correct as suggested by the relatively high support accuracy in Tab. 2. In such a case, one can use a Gröbner basis computation algorithm that works efficiently given the leading terms of the target unknown Gröbner basis [83]. Refer to App. F.2 for extensive lists of examples. (a) Embedding of -coefficients Distance Norm Normalized dot product Slice at 0 (Unnorm.) Slice at 0 1 layer 2 layers (b) Embedding of -coefficients Distance Norm Normalized dot product Slice at 0 Slice at 0 Slice at 0 Slice at 0 (Unnorm.) (i) (ii) (iii) (iv) (v) (vi) (i) (ii) (iii) (iv) (v) (vi) Figure 1: Visual analysis of embedding vectors of numbers given by the proposed embedding. Embedding c R to f E(c) RD from cmin to cmax with B bins to obtain M RB D, the fix figures show from the left, (i) the Euclidean distance matrix of M, (ii) its slice at 0, (iii) the norm of embedding vectors, (iv) the dot product M M with M of the row-normalized M, (v) f E(0) M and (vi) f E(c0) M. (a) Trained on R[x1, x2]; (cmin, cmax) = ( 100, 100). (b) Trained on F31[x1, x2]; (cmin, cmax) = (0, 31). The embedding layer f E has one/two hidden layers (top/bottom rows). As can be seen, the relationship between embedding vectors in terms of distance and dot product is aligned well in the infinite field and not in the finite field. Hybrid embedding. Table 2 shows that determining coefficients by regression is less successful than classifications. For infinite field k, this may be because of the accumulation of coefficient errors during the auto-regressive generation. Thus, the current best practice would be to prepare many number tokens in the vocabulary set, or a sophisticated regression-by-classification approach may be helpful [76]. Note that the results for k = Fp are shown for reference as the finite field elements do not have ordering. Figure 1 shows a contrast between the embedding functions learned in infinite field and finite field. Particularly, the slice of distance matrix (ii) and that of the dot-product matrix (vi) show that these metrics align well with the difference between numbers in R. However, we cannot observe convincing patterns in the embedding in F31. For the two-layer case in Fig. 1(a), we observe sharp changes around 5 of the horizontal axis. This may be because of the gap in the coefficient range in the input and output space. The coefficients of F ranges between [ 100, 100], while that of G does between [ 5, 5]. In Tab. 3, we show that the increase of hidden layers of f E does not lead to improvement. 7 Conclusion This study proposed the first learning approach to a fundamental algebraic task, the Gröbner basis computation. While various recent studies have reported the learnability of mathematical problems by Transformers, we addressed the first problem with nontriviality in the dataset generation. Ultimately, the learning approach may be useful to address large-scale problems that cannot be approached by Gröbner basis computation algorithms because of their computational complexity. Transformers can output predictions in moderate runtime. The outputs may be incorrect, but there is a chance of obtaining a hint of a solution, as shown in our experiments. We believe that our study reveals many interesting open questions to achieve Gröbner basis computation learning. Some are algebraic problems, and others are machine learning challenges, further discussed in Sec. H. Acknowledgement. We would like to thank Masayuki Noro (Rikkyo University) for his fruitful comments on our dataset construction algorithm and Noriki Nishida (RIKEN Center for Advanced Intelligence Project) for his help in the implementation. Hiroshi Kera was supported by JST PRESTO Grant Number JPMJPR24K4, JST ACT-X Grant Number JPMJAX23C8, Mitsubishi Electric Information Technology R&D Center, and the Chiba University IAAR Research Support Program and the Program for Forming Japan s Peak Research Universities (J-PEAKS). Yuki Ishihara was supported by JSPS KAKENHI Grant Number JP22K13901 and Institute of Mathematics for Industry, Joint Usage/Research Center in Kyushu University (FY2023 Short-term Joint Research Speeding up of symbolic computation and its application to solving industrial problems (2023a006)). Yuta Kambe was supported by Mitsubishi Electric Research Associate Program. [1] J. Abbott, C. Fassino, and M.-L. Torrente. Stable border bases for ideals of points. Journal of Symbolic Computation, 43(12):883 894, 2008. [2] J. Abbott, M. Kreuzer, and L. Robbiano. Computing zero-dimensional schemes. Journal of Symbolic Computation, 39(1):31 49, 2005. [3] V. Agarwal, S. Aditya, and N. Goyal. Analyzing the nuances of transformers polynomial simplification abilities. In The First Mathematical Reasoning in General Artificial Intelligence Workshop at International Conference on Learning Representations, 2021. [4] A. Alfarano, F. Charton, and A. Hayat. Global Lyapunov functions: a long-standing open problem in mathematics, with symbolic transformers. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. [5] R. Antonova, M. Maydanskiy, D. Kragic, S. Devlin, and K. Hofmann. Analytic manifold learning: Unifying and evaluating representations for continuous control. ar Xiv, abs/2006.08718, 2020. [6] E. Arnold. Modular algorithms for computing Gröbner bases. Journal of Symbolic Computation, 35:403 419, 2003. [7] M. F. Atiyah and I. G. Mac Donald. Introduction To Commutative Algebra. Addison-Wesley series in mathematics. Avalon Publishing, 1994. [8] G. V. Bard. Algorithms for Solving Polynomial Systems. Springer US, 2009. [9] M. Bardet, J.-C. Faugère, and B. Salvy. On the complexity of the F5 Gröbner basis algorithm. Journal of Symbolic Computation, 70:49 70, September 2015. [10] T. Becker, V. Weispfenning, and H. Kredel. Gröbner Bases: A Computational Approach to Commutative Algebra. Graduate texts in mathematics. Springer-Verlag, 1993. [11] I. Beltagy, M. E. Peters, and A. Cohan. Longformer: The long-document transformer. ar Xiv, 2020. [12] J. Berthomieu, S. Graillat, D. Lesnoff, and T. Mary. Modular matrix multiplication on GPU for polynomial system solving. ACM Commun. Comput. Algebra, 57(2):35 38, August 2023. [13] J. Berthomieu, V. Neiger, and M. Safey El Din. Faster change of order algorithm for Gröbner bases under shape and stability assumptions. In Proceedings of the 2022 International Symposium on Symbolic and Algebraic Computation, pages 409 418, 2022. [14] L. Biggio, T. Bendinelli, A. Neitz, A. Lucchi, and G. Parascandolo. Neural symbolic regression that scales. In Proceedings of the 38th International Conference on Machine Learning, volume 139, pages 936 945, 18 24 Jul 2021. [15] M. Brickenstein. Slimgb: Gröbner bases with slim polynomials. Revista Matemática Complutense, 23:453 466, 2010. [16] B. Buchberger. Ein Algorithmus zum Auffinden der Basiselemente des Restklassenringes nach einem nulldimensionalen Polynomideal (An Algorithm for Finding the Basis Elements in the Residue Class Ring Modulo a Zero Dimensional Polynomial Ideal). Ph D thesis, Mathematical Institute, University of Innsbruck, Austria, 1965. English translation in J. of Symbolic Computation, Special Issue on Logic, Mathematics, and Computer Science: Interactions. Vol. 41, Number 3-4, Pages 475 511, 2006. [17] L. Busé. Étude du résultant sur une variété algébrique. Theses, Université Nice Sophia Antipolis, 2001. [18] L. Busé, M. Elkadi, and B. Mourrain. Resultant over the residual of a complete intersection. Journal of Pure and Applied Algebra, 164(1):35 57, 2001. [19] F. Charton. Linear algebra with transformers. Transactions on Machine Learning Research, 2022. [20] F. Charton. What is my math transformer doing? - three results on interpretability and generalization. ar Xiv, abs/2211.00170, 2022. [21] F. Charton. Learning the greatest common divisor: explaining transformer predictions. In The Twelfth International Conference on Learning Representations, 2024. [22] F. Charton, A. Hayat, and G. Lample. Learning advanced mathematical computations from examples. In International Conference on Learning Representations, 2021. [23] Y. Chen, Q. Tao, F. Tonin, and J. A. Suykens. Primal-Attention: Self-attention through asymmetric kernel SVD in primal representation. In Advances in Neural Information Processing Systems, 2023. [24] D. A. Cox. Solving equations via algebras. In Solving Polynomial Equations, pages 63 124. Springer-Verlag, Berlin, 2005. [25] D. A. Cox, J. Little, and D. O Shea. Ideals, Varieties, and Algorithms: An Introduction to Computational Algebraic Geometry and Commutative Algebra. Undergraduate Texts in Mathematics. Springer International Publishing, 2015. [26] S. D Ascoli, P.-A. Kamienny, G. Lample, and F. Charton. Deep symbolic regression for recurrence prediction. In Proceedings of the 39th International Conference on Machine Learning, volume 162, pages 4520 4536, 17 23 Jul 2022. [27] P. Diaconis and B. Sturmfels. Algebraic algorithms for sampling from conditional distributions. The Annals of Statistics, 26(1):363 397, 1998. [28] J. Ding, S. Ma, L. Dong, X. Zhang, S. Huang, W. Wang, N. Zheng, and F. Wei. Long Net: Scaling Transformers to 1,000,000,000 tokens. ar Xiv, 2307.02486, 2023. [29] T. W. Dubé. The structure of polynomial ideals and Gröbner bases. SIAM Journal on Computing, 19(4):750 773, 1990. [30] N. Dziri, X. Lu, M. Sclar, X. L. Li, L. Jiang, B. Y. Lin, S. Welleck, P. West, C. Bhagavatula, R. L. Bras, J. D. Hwang, S. Sanyal, X. Ren, A. Ettinger, Z. Harchaoui, and Y. Choi. Faith and fate: Limits of transformers on compositionality. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. [31] D. Eisenbud. Commutative algebra: with a view toward algebraic geometry. Springer Science & Business Media, 2013. [32] C. Fassino. Almost vanishing polynomials for sets of limited precision points. Journal of Symbolic Computation, 45(1):19 37, 2010. [33] J.-C. Faugère. A new efficient algorithm for computing Gröbner bases (F4). Journal of Pure and Applied Algebra, 139(1):61 88, 1999. [34] J.-C. Faugère. A new efficient algorithm for computing Gröbner bases without reduction to zero (F5). In Proceedings of the 2002 International Symposium on Symbolic and Algebraic Computation, ISSAC 02, page 75 83, New York, NY, USA, 2002. Association for Computing Machinery. [35] J.-C. Faugère, P. M. Gianni, D. Lazard, and T. Mora. Efficient computation of zero-dimensional Gröbner bases by change of ordering. Journal of Symbolic Computation, 16(4):329 344, 1993. [36] P. Gianni and T. Mora. Algebraic solution of systems of polynomial equations using Groebner bases. In Applied Algebra, Algebraic Algorithms and Error-Correcting Codes, pages 247 257, Berlin, Heidelberg, 1989. Springer Berlin Heidelberg. [37] S. Golkar, M. Pettee, M. Eickenberg, A. Bietti, M. Cranmer, G. Krawezik, F. Lanusse, M. Mc Cabe, R. Ohana, L. Parker, B. R.-S. Blancard, T. Tesileanu, K. Cho, and S. Ho. x Val: A continuous number encoding for large language models, 2023. [38] A. Gromov. Grokking modular arithmetic. ar Xiv, abs/2301.02679, 2023. [39] G.-M. Gruel and G. Pfister. A Singular Introduction to Commutative Algebra, 2nd Edition. Sringer Verlag, 2008. [40] D. Heldt, M. Kreuzer, S. Pokutta, and H. Poulisse. Approximate computation of zerodimensional polynomial ideals. Journal of Symbolic Computation, 44(11):1566 1591, 2009. [41] T. Hibi. Gröbner bases. Statistics and software systems. Springer Tokyo, March 2014. [42] C. Hou, F. Nie, and D. Tao. Discriminative vanishing component analysis. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, pages 1666 1672, Palo Alto, California, 2016. AAAI Press. [43] R. Iraji and H. Chitsaz. Principal variety analysis. In Proceedings of the 1st Annual Conference on Robot Learning (ACRL), pages 97 108. PMLR, 2017. [44] P.-A. Kamienny, S. d Ascoli, G. Lample, and F. Charton. End-to-end symbolic regression with transformers. In Advances in Neural Information Processing Systems, 2022. [45] P.-A. Kamienny, G. Lample, S. Lamprier, and M. Virgolin. Deep generative symbolic regression with Monte-Carlo-Tree-Search. ar Xiv, abs/2302.11223, 2023. [46] A. Karimov, E. G. Nepomuceno, A. Tutueva, and D. Butusov. Algebraic method for the reconstruction of partially observed nonlinear systems using differential and integral embedding. Mathematics, 8(2):300 321, February 2020. [47] A. Karimov, V. Rybin, E. Kopets, T. Karimov, E. Nepomuceno, and D. Butusov. Identifying empirical equations of chaotic circuit from data. Nonlinear Dynamics, 111(1):871 886, 2023. [48] A. Kehrein and M. Kreuzer. Computing border bases. Journal of Pure and Applied Algebra, 205(2):279 295, 2006. [49] H. Kera. Border basis computation with gradient-weighted normalization. In Proceedings of the 2022 International Symposium on Symbolic and Algebraic Computation, pages 225 234, New York, 2022. Association for Computing Machinery. [50] H. Kera and Y. Hasegawa. Noise-tolerant algebraic method for reconstruction of nonlinear dynamical systems. Nonlinear Dynamics, 85(1):675 692, 2016. [51] H. Kera and Y. Hasegawa. Approximate vanishing ideal via data knotting. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, pages 3399 3406, Palo Alto, California, 2018. AAAI Press. [52] H. Kera and Y. Hasegawa. Spurious vanishing problem in approximate vanishing ideal. IEEE Access, 7:178961 178976, 2019. [53] H. Kera and Y. Hasegawa. Gradient boosts the approximate vanishing ideal. In Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence, pages 4428 4425, Palo Alto, California, 2020. AAAI Press. [54] H. Kera and Y. Hasegawa. Monomial-agnostic computation of vanishing ideals. Journal of Computational Algebra, 11:100022, 2024. [55] H. Kera and H. Iba. Vanishing ideal genetic programming. In Proceedings of the 2016 IEEE Congress on Evolutionary Computation (CEC), pages 5018 5025, Piscataway, NJ, 2016. IEEE. [56] F. J. Király, M. Kreuzer, and L. Theran. Dual-to-kernel learning with ideals. ar Xiv, abs/1402.0099, 2014. [57] N. Kitaev, L. Kaiser, and A. Levskaya. Reformer: The efficient Transformer. In International Conference on Learning Representations, 2020. [58] G. Lample and F. Charton. Deep learning for symbolic mathematics. In International Conference on Learning Representations, 2020. [59] R. Laubenbacher and B. Stigler. A computational algebra approach to the reverse engineering of gene regulatory networks. Journal of Theoretical Biology, 229(4):523 537, 2004. [60] R. Laubenbacher and B. Sturmfels. Computer algebra in systems biology. American Mathematical Monthly, 116(10):882 891, 2009. [61] C. Y. Li, E. Wenger, Z. Allen-Zhu, F. Charton, and K. E. Lauter. SALSA VERDE: a machine learning attack on LWE with sparse small secrets. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. [62] J. Limbeck. Computation of approximate border bases and applications. Ph D thesis, Passau, Universität Passau, 2013. [63] R. Livni, D. Lehavi, S. Schein, H. Nachliely, S. Shalev-Shwartz, and A. Globerson. Vanishing component analysis. In Proceedings of the 30th International Conference on Machine Learning, volume 28(1) of Proceedings of Machine Learning Research, pages 597 605, Atlanta, Georgia, USA, June 2013. PMLR. [64] H. Lombardi and I. Yengui. Suslin s algorithms for reduction of unimodular rows. Journal of Symbolic Computation, 39(6):707 717, 2005. [65] I. Loshchilov and F. Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019. [66] R. H. Makarim and M. Stevens. M4GB: An efficient Gröbner-basis algorithm. In Proceedings of the 2017 ACM on International Symposium on Symbolic and Algebraic Computation, ISSAC 17, pages 293 300, New York, NY, USA, 2017. Association for Computing Machinery. [67] E. W. Mayr and A. R. Meyer. The complexity of the word problems for commutative semigroups and polynomial ideals. Advances in Mathematics, 46(3):305 329, 1982. [68] H. M. Möller and B. Buchberger. The construction of multivariate polynomials with preassigned zeros. In Computer Algebra. EUROCAM 1982. Lecture Notes in Computer Science, pages 24 31. Springer Berlin Heidelberg, 1982. [69] M. Noro and K. Yokoyama. A modular method to compute the rational univariate representation of zero-dimensional ideals. Journal of Symbolic Computation, 28(1):243 263, 1999. [70] M. Noro and K. Yokoyama. Usage of modular techniques for efficient computation of ideal operations. Math. Comput. Sci., 12(1):1 32, 2018. [71] H. Park and G. Regensburger, editors. Gröbner Bases in Control Theory and Signal Processing. De Gruyter, 2007. [72] H. Park and C. Woodburn. An algorithmic proof of Suslin s stability theorem for polynomial rings. Journal of Algebra, 178(1):277 298, 1995. [73] D. Peifer, M. Stillman, and D. Halpern-Leistner. Learning selection strategies in buchberger s algorithm. In Proceedings of the 37th International Conference on Machine Learning, ICML 20. JMLR.org, 2020. [74] A. Power, Y. Burda, H. Edwards, I. Babuschkin, and V. Misra. Grokking: Generalization beyond overfitting on small algorithmic datasets. ar Xiv, abs/2201.02177, 2022. [75] D. Saxton, E. Grefenstette, F. Hill, and P. Kohli. Analysing mathematical reasoning abilities of neural models. In International Conference on Learning Representations, 2019. [76] D. Shah and T. M. Aamodt. Learning label encodings for deep regression. In The Eleventh International Conference on Learning Representations, 2023. [77] Y. Shao, G. Gao, and C. Wang. Nonlinear discriminant analysis based on vanishing component analysis. Neurocomputing, 218:172 184, 2016. [78] H. Stewenius. Gröbner Basis Methods for Minimal Problems in Computer Vision. Ph D thesis, Mathematics (Faculty of Engineering), 2005. [79] B. Sturmfels. Solving systems of polynomial equations. Number 97. American Mathematical Soc., 2002. [80] Y. Sun, L. Dong, S. Huang, S. Ma, Y. Xia, J. Xue, J. Wang, and F. Wei. Retentive network: A successor to Transformer for large language models. ar Xiv, 2023. [81] A. Suslin. On the structure of the special linear group over polynomial rings. Mathematics of the USSR - Izvestija, 11(2):221 238, April 1977. [82] The Sage Developers. Sage Math, the Sage Mathematics Software System (Version 10.0), 2023. https://www.sagemath.org. [83] C. Traverso. Hilbert functions and the Buchberger algorithm. Journal of Symbolic Computation, 6:287 304, 1997. [84] E. Ullah. New Techniques for Polynomial System Solving. Ph D thesis, Universität Passau, 2012. [85] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. [86] L. Wang and T. Ohtsuki. Nonlinear blind source separation unifying vanishing component analysis and temporal structure. IEEE Access, 6:42837 42850, 2018. [87] M. Wang and J. Deng. Learning to prove theorems by learning to generate theorems. In Advances in Neural Information Processing Systems, volume 33, pages 18146 18157. Curran Associates, Inc., 2020. [88] Z. Wang, Q. Li, G. Li, and G. Xu. Polynomial representation for persistence diagram. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6116 6125, 2019. [89] E. Wenger, M. Chen, F. Charton, and K. E. Lauter. SALSA: Attacking lattice cryptography with Transformers. In Advances in Neural Information Processing Systems, volume 35, pages 34981 34994, 2022. [90] E. S. Wirth, H. Kera, and S. Pokutta. Approximate vanishing ideal computations at scale. In International Conference on Learning Representations, 2023. [91] E. S. Wirth and S. Pokutta. Conditional gradients for the approximately vanishing ideal. In Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, volume 151, pages 2191 2209, 28 30 Mar 2022. [92] H. Yan, Z. Yan, G. Xiao, W. Wang, and W. Zuo. Deep vanishing component analysis network for pattern classification. Neurocomputing, 316:240 250, 2018. [93] T. Yasuda, X. Dahan, Y.-J. Huang, T. Takagi, and K. Sakurai. MQ challenge: hardness evaluation of solving multivariate quadratic problems. Cryptology e Print Archive, 2015. A Basic Definitions in Algebra Definition A.1 (Ring, Field ([7], Chap. 1 1)). A set R with an additive operation + and a multiplicative operation is called a (commutative) ring if it satisfies the following conditions: 1. a + (b + c) = (a + b) + c for any a, b, c R, 2. there exists 0 R such that a + 0 = 0 + a = a for any a R, 3. for any a R, there exists a such that a + ( a) = ( a) + a = 0, 4. a + b = b + a for any a, b R, 5. a (b c) = (a b) c for any a, b, c R, 6. there exists 1 R such that a 1 = 1 a = a for any a R, 7. a (b + c) = a b + a c for any a, b, c R, 8. (a + b) c = a c + b c for any a, b, c R, 9. a b = b a for any a, b R. A commutative ring R is called a field if it satisfies the following condition 10. for any a R \ {0}, there exists a 1 such that a a 1 = a 1 a = 1. Definition A.2 (Polynomial Ring ([7], Chap. 1 1)). In Definition A.1, k[x1, . . . , xn], the set of all n-variate polynomials with coefficients in k, satisfies all conditions (1)-(9). Thus, k[x1, . . . , xn] is called a polynomial ring. Definition A.3 (Quotient Ring ([7], Chap. 1 1)). Let R be a ring and I an ideal of R. For each f R, we set [f] = {g R | f g I}. Then, the set {[f] | f R} is called the quotient ring of R modulo I and denoted by R/I. Indeed, R/I is a ring with an additive operation + and a multiplicative operation , where [f] + [g] = [f + g] and [f] [g] = [f g] for f, g R respectively. Definition A.4 (Generators). For F = {f1, . . . , fs} k[x1, . . . , xn], the following set i=1 hifi | h1, . . . , hs k[x1, . . . , xn] is an ideal and said to be generated by F, and f1, . . . , fs are called generators. Definition A.5 (0-dimensional ideal ([25], Chap. 5 3, Thm. 6)). Let F be a set of polynomials in k[x1, . . . , xn]. An ideal F is called a 0-dimensional ideal if the k-linear space k[x1, . . . , xn]/ F is finite-dimensional, where k[x1, . . . , xn]/ F is the quotient ring of k[x1, . . . , xn] modulo F . Definition A.6 (Radical ideal ([7], Chap. 1 1)). For an ideal I of k[x1, . . . , xn], the set {f k[x1, . . . , xn] | f m I for a positive integer m} is called the radical of I and denoted by I. Also, I is called a radical ideal if I = I. Definition A.7 (Syzygy ([10], Chap. 3, 3)). Let F = {f1, . . . , fs} k[x1, . . . , xn]. A syzygy of F is an s-tuple of polynomials (q1, . . . , qs) k[x1, . . . , xn]s such that q1f1 + + qsfs = 0. Definition A.8 (Term ([10], Chap. 2, 1)). For a polynomial f = P α1,...,αn cα1,...,αnxα1 1 xαn n with cα1,...,αn K and α1, . . . , αn Z 0, each xα1 1 xαn n is called a term in f. Definition A.9 (Total Degree ([25], Chap. 1 1, Def. 3)). For a term xα1 1 xαn n , its total degree is the sum of indices α1 + + αn. For a polynomial f, the total degree of f is the maximal total degree of terms in f. Definition A.10 (Term order ([10], Definition 5.3)). A term order is a relation between terms such that 1. (comparability) for different terms xα1 1 xαn n and xβ1 1 xβn n , either xα1 1 xαn n xβ1 1 xβn n or xβ1 1 xβn n xα1 1 xαn n holds, 2. (order-preserving) for terms xα1 1 xαn n , xβ1 1 xβn n and xγ1 1 xγn n = 1, if xα1 1 xαn n xβ1 1 xβn n then xα1+γ1 1 xαn+γn n xβ1++γ1 1 xβn+γn n holds, 3. (minimality of 1) the term 1 is the smallest term i.e. 1 xα1 1 xαn n for any term xα1 1 xαn n = 1. Example A.11. The lexicographic order lex prioritizes terms with larger exponents for the variables of small indices, e.g., x2 lex x2 3 and x1x2x2 3 lex x1x2 2x3. (A.2) Two terms are first compared in terms of the exponent in x1 (larger one is prioritized), and if a tie-break is needed, the next variable x2 is considered, and so forth. Example A.12. The graded lexicographic order grlex prioritizes terms with higher total degree.9 For tie-break, the lexicographic order is used, e.g., 1 grlex xn and x2 grlex x2 3 and x1x2x2 3 grlex x1x2 2x3. (A.3) Term orders prioritizing lower total degree terms as grlex are called graded term orders. Definition A.13 (Reduced Gröbner basis). A -Gröbner basis G = {g1, . . . , gt} of I is called the reduced Gröbner basis of I if 1. the leading coefficient of gi with respect to is 1 for all i = 1, . . . , t, 2. no terms of gi lies on LT(G \ {gi}) for any i = 1, . . . , t. Proposition A.14 ([36], Prop. 1.6; [69], Lem. 4.4). Let I be a 0-dimensional radical ideal. If k is of characteristic 0 or a finite field of large enough order, then a random linear coordinate change puts I in shape position. B Cauchy module We here provide the definition of the Cauchy module, which defines another class of 0-dimensional ideals. Definition B.1 (Elementary symmetric polynomials). The elementary symmetric polynomials s1, . . . , sn in n variables x1, . . . , xn are i1< 0 (Krull s principal ideal theorem [31, 10]). Since the Krull dimension of k[x1, . . . , xn]/ G is 0, we have s n. (2) From F = AG, we have F G . If A has a left-inverse B k[x1, . . . , xn]t s, we have BF = BAG = G, indicating F G . Therefore, we have F = G . (3) If the equality F = G holds, then there exists a t s matrix B k[x1, . . . , xn]t s such that G = BF. Since F is defined as F = AG, we have G = BF = BAG and G = Et G in k[x1, . . . , xn]t. Therefore we obtain (BA Et)G = 0. In particular, each row of BA Et is a syzygy of G. Conversely, if there exists a t s matrix B k[x1, . . . , xn]t s such that each row of BA Et is a syzygy of G, then we have (BA Et)G = 0 in k[x1, . . . , xn]t, therefore the equality F = G holds since we have G = Et G = BAG = BF. Proposition 4.7. For any A k[x1, . . . , xn]n n with det(A) k \ {0}, we have F = G . 10Refer to App. A for the definition. Proof. From the Cramer s rule, there exists B k[x1, . . . , xn]n n such that BA = det(A)En, where En denotes the n-by-n identity matrix. Indeed, the i-th row Bi of B satisfies for i = 1, . . . , n, Bi = 1 det(A) det A(i) 1 , . . . , det A(i) n , (C.1) where A(i) j is the matrix A with the j-th column replaced by the i-th canonical basis (0, ..., 1, ..., 0) . Since det(A) is a non-zero constant, A has the left-inverse B in k[x1, . . . , xn]n n. Thus F = G from Thm. 4.6. Theorem 4.8. Consider a polynomial ring k[x1, . . . , xn]. Given the dataset size m, maximum degrees d, d > 0, maximum size of non-Gröbner set smax n, and term order , Alg. 1 returns a collection D = {(Fi, Gi)}m i=1 with the following properties: For all i = 1, . . . , m, 1. |Gi| = n and |Fi| smax. 2. The set Gi is the reduced -Gröbner basis of Fi . The set Fi is not, unless Gi, U1, U 2, P are all sampled in a non-trivial Zariski closed subset.11 3. The ideal Fi = Gi is a 0-dimensional ideal in shape position. The time complexity is O(m(n S1,d + s2Sn,d + (n2 + s2)Mn,2d +d)) when = lex, where Sn,d denotes the complexity of sampling an n-variate polynomial with total degree at most d, and Mn,d denotes that of multiplying two n-variate polynomials with total degree at most d. If = lex, O(mnd3) is additionally needed. Proof. Outside of the Zariski subset part, statements 1 3 are trivial from Alg. 1 and the discussion in Sec.s 4.2 and 4.3. To obtain the desired Zariski subsets, we consider the vector space of polynomials of degree d + 2d or less. We remark that if Fi is a -Gröbner basis, its leading terms must belong to a finite amount of possibilities. For a polynomial to have a given term as its leading term, zero conditions on terms greater than this term are needed, defining a closed Zariski subset condition. By considering the finite union of all these conditions, we obtain the desired result. To obtain one pair (F, G), the random generation of G needs O(n S1,d), and the backward transform from G to F needs O(s2Sn,d ) to get U1, U2 and (n2 + s2)Mn,2d +d) for the multiplication F = U1PU2G. Note that the maximum total degree of polynomials in F is 2d + d. D Hybrid Input Embedding We here present the supplemental information of Transformers with hybrid input embedding. Let s = [s1, . . . , s L] to be a sequence of tokens. Some of these tokens are in V and otherwise in R. We call the former discrete tokens and the latter continuous tokens. For discrete tokens, the standard input embedding based on the embedding matrix is applied. For continuous tokens, a small feed-forward network f E : R RD is applied. Unlike discrete tokens, continuous tokens are predicted by regression. For this sake, Transformers should equip a regression head, and they solve a classification task and a regression task simultaneously. In the classification task, continuous tokens are all replaced with a single coefficient token. In other words, the classification head predicts the support of the polynomials, while the regression head predicts the coefficients to be filled in the coefficient tokens. The auto-regressive generation process is naturally induced by a standard method. In our experiments, we implemented f E by one-hidden layer Re LU Network, i.e., f E(x) = W2φ(w1x + b1) + b2, (D.1) where w1, b1 RD, W2 RD D, b2 RD and φ is the Re LU function applied entry-wise. We also tried f E with one more hidden layer. However, this only has a minor improvement on the R case; see Tab. 3. E Training Setup This section provides the supplemental information of our experiments presented in Sec. 6. 11This can happen with probability zero if k is infinite and very low probability over a large finite field. Table 3: Comparison of implementations of continuous input embedding f E on D 2 (k). Coefficient Q F7 F31 R one hidden layer 66.8 / 87.3 54.1 / 78.7 6.1 / 75.3 57.2 / 85.0 two hidden layers 67.2 / 87.7 54.3 / 78.8 5.6 / 75.7 56.2 / 83.6 E.1 Gröbner basis computation algorithms In Tab. 1, we tested three algorithms provided in Sage Math with the lib Singular backend for forward generation. STD (libsingular:std): The standard Buchberger algorithm. SLIMGB (libsingular:slimgb): A variant of the Faugère s F4 algorithm. Refer to [15]. STDFGLM (libsingular:stdfglm): Fast computation using STD with the graded reverse lexicographic order followed by the FGLM for the change of term orders. Only for 0-dimensional cases. E.2 Training setup Dataset. Both training and test samples are generated using our method. It involves sampling of random polynomials. The degree and the support size (the number of terms) of them as well as coefficients are restricted by user-defined bounds (see Sec. 6.1). Let dmax, µmax be the maximum degree and the maximum support size. A random polynomial is obtained as the sum of µ U[1, µmax] monomials uniformly and randomly sampled from k[x1, . . . , xn] dmax. When the samples are fed to a Transformer, polynomials are tokenized into an infix representation. For example, {x2 1/2y, y} Q[x, y] is tokenized to [C1, E2, E0, +, C-1, /, C2, E0, E1, , C1, E0, E1]. Training. We used a Transformer model [85] with a standard architecture: 6 encoder/decoder layers, 8 attention heads, token embedding dimension of 512 dimensions, and feed-forward networks with 2048 inner dimensions. The absolute positional embedding is learned from scratch. The dropout rate was set to 0.1. We used the Adam W optimizer [65] with (β1, β2) = (0.9, 0.999) with no weight decay. The learning rate was initially set to 10 4 and then linearly decayed over training steps. All training samples are visited in a single epoch, and the total number of epochs was set to 8. The batch size was set to 16. At the inference time, output sequences are generated using a beam search with width 1. For the hybrid input embedding, we used a Re LU network with one hidden layer (cf. Sec. D). A model with this embedding predicts coefficients as continuous values, and the mean-squared loss with weight 0.01 is additionally used for the training. Note that while the exponents are also numbers, we treat them as discrete tokens because they are always discrete and their range is moderate. F Additional Experimental Results. We provide the additional experimental results with Transformers with the standard input embedding. F.1 Dataset generation The runtime comparison for datasets with and without density control is given in Tab. 4, and the success rate (i.e., not encountering the timeout) is given in Tab. 5. The generation of density-controlled datasets D n (k) (1,000 samples) requires less runtime, and the proposed method is not always the fastest. However, it is important to remember that the forward method is still not feasible because of the difficulty in sampling overdetermined non-Gröbner sets (i.e., Fs). Generally, such F only leads to a trivial ideal with Gröbner basis {1}. The profile of datasets is given in Tab. 6. Table 4: Runtime comparison (in seconds) of forward generation (F.) and backward generation (B.) of dataset Dn(k) of size 1,000. The forward generation used either of the three algorithms provided in Sage Math with the lib Singular backend. For D n (k), the proposed method is not necessarily the fastest. Note that the runtime of the forward methods does not include the sampling of F, and F is given from the datasets constructed by the backward method. The sampling step roughly consists of 30% of the runtime in the backward method. It is also worth noting that sampling of overdetermined F generally leads to a trivial ideal with the Gröbner basis {1}. Method Dn(Q) D n (Q) n = 2 n = 3 n = 4 n = 5 n = 2 n = 3 n = 4 n = 5 σ = 1.0 σ = 0.2 F. (STD) 4.20 216.3 740.1 1411.1 4.20 104.3 101.0 117.4 F. (SLIMGB) 4.29 183.4 697.5 1322.7 4.29 77.1 98.9 134.5 F. (STDFGLM) 7.22 8.29 21.0 164.3 7.22 12.1 9.75 14.9 B. (ours) 5.23 5.46 7.05 7.91 5.23 11.2 7.85 13.7 Method Dn(F7) D n (F7) n = 2 n = 3 n = 4 n = 5 n = 2 n = 3 n = 4 n = 5 σ = 1.0 σ = 0.2 F. (STD) 4.93 4.57 818.9 2123.3 4.93 4.30 48.8 91.5 F. (SLIMGB) 4.92 5.57 561.0 1981.2 4.92 4.65 32.9 81.8 F. (STDFGLM) 8.02 6.33 9.20 62.6 8.02 7.50 7.25 7.46 B. (ours) 6.79 8.36 10.5 14.2 6.79 8.72 10.5 14.5 Method Dn(F31) D n (F31) n = 2 n = 3 n = 4 n = 5 n = 2 n = 3 n = 4 n = 5 σ = 1.0 σ = 0.2 F. (STD) 5.08 5.04 777.6 2110.4 5.08 4.39 20.6 114.0 F. (SLIMGB) 5.07 6.91 664.2 2026.0 5.07 4.98 22.2 103.2 F. (STDFGLM) 8.10 6.73 9.14 80.2 8.10 6.95 7.23 8.58 B. (ours) 7.40 8.37 10.5 14.7 7.40 18.0 9.91 15.3 F.2 Success and failure cases Tables 7 18 show examples of success cases. One can see that Transformers can solve many nontrivial instances. Tables 19 and 21 show examples of failure cases. One can see that, interestingly, the incorrect predictions appear reasonable. Examples are all taken according to their order in each dataset (i.e., no cherry-picking). F.3 Superiority of Transformer in several cases. Approaching Gröbner basis computation using a Transformer has a potential advantage in the runtime because the computational cost has less dependency on the problem difficulty than mathematical algorithms do. However, currently, mathematical algorithms run faster than Transformers because of our naive input scheme. Nevertheless, we observed several examples in our D n (k) datasets for which Transformers generate the solutions efficiently, while mathematical algorithms take significantly longer time or encounter a timeout. Particularly, we examined the examples in D n (k) where several forward methods encounter a timeout with the five-second budget, see Tab. 5. We fed these examples to Transformer and the three forward algorithms again, but now with a 100-second budget. For such examples, as shown in Tab. 22, Transformers completed the computation in less than a second, while the two forward algorithms, STD and SLIMGB, often used longer computation time or encountered a timeout. It is worth noting that STD and SLIMGB are general-purpose algorithms, while STDFGLM is specially designed for the zero-dimension ideals. To summarize, Transformers successfully computed Gröbner bases with much less runtime than general-purpose algorithms for several examples. This shows a Table 5: Success rate [%] of forward generation with the five-second timeout limit. Method Dn(Q) D n (Q) n = 2 n = 3 n = 4 n = 5 n = 2 n = 3 n = 4 n = 5 σ = 1.0 σ = 0.2 F. (STD) 100.0 96.0 85.5 72.2 100.0 98.3 98.2 97.9 F. (SLIMGB) 100.0 96.6 86.4 74.4 100.0 98.7 98.3 97.6 F. (STDFGLM) 100.0 100.0 99.9 98.4 100.0 100.0 100.0 100.0 Method Dn(F7) D n (F7) n = 2 n = 3 n = 4 n = 5 n = 2 n = 3 n = 4 n = 5 σ = 1.0 σ = 0.2 F. (STD) 100.0 100.0 84.8 58.7 100.0 100.0 99.3 98.4 F. (SLIMGB) 100.0 100.0 91.6 62.9 100.0 100.0 99.7 98.7 F. (STDFGLM) 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 Method Dn(F31) D n (F31) n = 2 n = 3 n = 4 n = 5 n = 2 n = 3 n = 4 n = 5 σ = 1.0 σ = 0.2 F. (STD) 100.0 100.0 86.0 59.2 100.0 100.0 99.7 98.0 F. (SLIMGB) 100.0 100.0 89.6 62.3 100.0 100.0 99.8 98.5 F. (STDFGLM) 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 Table 6: Dataset profiles. The standard deviation is shown in the superscript. Metric Dn(Q) D n (Q) n = 2 n = 3 n = 4 n = 5 n = 2 n = 3 n = 4 n = 5 σ = 1.0 σ = 0.6 σ = 0.3 σ = 0.2 Size of F 2.57( 0.71) 3.46( 0.66) 4.40( 0.62) 5.37( 0.60) 2.57( 0.71) 3.71( 0.77) 4.86( 0.80) 5.90( 0.81) Max degree in F 7.31( 1.91) 8.54( 1.44) 9.02( 1.26) 9.17( 1.22) 7.31( 1.91) 8.20( 1.62) 8.34( 1.53) 8.46( 1.46) Min degree in F 4.09( 1.93) 4.45( 1.92) 4.75( 1.89) 4.96( 1.85) 4.09( 1.93) 3.96( 2.00) 3.54( 2.09) 3.38( 2.08) # of terms in F 15.46( 7.67) 23.86( 7.97) 33.18( 8.25) 42.70( 8.74) 15.46( 7.67) 24.36( 9.15) 32.36( 10.39) 40.32( 11.38) Gröbner ratio 0.001( 0.026) 0( 0) 0( 0) 0( 0) 0.001( 0.026) 0( 0.012) 0( 0) 0( 0) Size of G 2.00( 0) 3.00( 0) 4.00( 0) 5.00( 0) 2.00( 0) 3.00( 0) 4.00( 0) 5.00( 0) Max degree in G 4.00( 1.32) 4.00( 1.32) 4.00( 1.32) 4.00( 1.32) 4.00( 1.32) 4.00( 1.32) 4.00( 1.32) 4.00( 1.32) Min degree in G 2.47( 1.23) 2.07( 1.14) 1.79( 1.02) 1.60( 0.90) 2.47( 1.23) 2.07( 1.14) 1.79( 1.02) 1.60( 0.90) # of terms in G 6.46( 2.33) 8.93( 3.25) 11.40( 4.13) 13.86( 4.99) 6.46( 2.33) 8.93( 3.24) 11.39( 4.13) 13.87( 4.99) Gröbner ratio 1( 0) 1( 0) 1( 0) 1( 0) 1( 0) 1( 0) 1( 0) 1( 0) Metric Dn(F7) D n (F7) n = 2 n = 3 n = 4 n = 5 n = 2 n = 3 n = 4 n = 5 σ = 1.0 σ = 0.6 σ = 0.3 σ = 0.2 Size of F 3.00( 0.82) 4.00( 0.82) 5.00( 0.82) 6.00( 0.82) 3.00( 0.82) 4.00( 0.82) 5.00( 0.82) 6.00( 0.82) Max degree in F 7.91( 2.04) 8.45( 1.67) 8.43( 1.55) 8.51( 1.47) 7.91( 2.04) 8.45( 1.67) 8.43( 1.55) 8.51( 1.47) Min degree in F 4.37( 2.06) 4.15( 2.07) 3.64( 2.13) 3.44( 2.11) 4.37( 2.06) 4.15( 2.07) 3.64( 2.13) 3.44( 2.11) # of terms in F 19.88( 9.62) 27.56( 10.42) 34.02( 11.07) 41.50( 11.90) 19.88( 9.62) 27.56( 10.42) 34.02( 11.07) 41.50( 11.90) Gröbner ratio 0.002( 0.045) 0( 0.011) 0( 0) 0( 0) 0.002( 0.045) 0( 0.011) 0( 0) 0( 0) Size of G 2.00( 0) 3.00( 0) 4.00( 0) 5.00( 0) 2.00( 0) 3.00( 0) 4.00( 0) 5.00( 0) Max degree in G 3.94( 1.34) 3.93( 1.34) 3.93( 1.34) 3.94( 1.34) 3.94( 1.34) 3.93( 1.34) 3.93( 1.34) 3.94( 1.34) Min degree in G 2.39( 1.22) 1.98( 1.11) 1.72( 0.97) 1.53( 0.84) 2.39( 1.22) 1.98( 1.11) 1.72( 0.97) 1.53( 0.84) # of terms in G 6.32( 2.33) 8.70( 3.23) 11.08( 4.10) 13.47( 4.94) 6.32( 2.33) 8.70( 3.23) 11.08( 4.10) 13.47( 4.94) Gröbner ratio 1( 0) 1( 0) 1( 0) 1( 0) 1( 0) 1( 0) 1( 0) 1( 0) Metric Dn(F31) D n (F31) n = 2 n = 3 n = 4 n = 5 n = 2 n = 3 n = 4 n = 5 σ = 1.0 σ = 0.6 σ = 0.3 σ = 0.2 Size of F 3.00( 0.82) 4.00( 0.82) 5.00( 0.82) 6.00( 0.82) 3.00( 0.82) 4.00( 0.82) 5.00( 0.82) 6.00( 0.82) Max degree in F 8.11( 2.02) 8.65( 1.65) 8.62( 1.54) 8.69( 1.46) 8.11( 2.02) 8.65( 1.65) 8.62( 1.54) 8.69( 1.46) Min degree in F 4.55( 2.06) 4.33( 2.08) 3.81( 2.16) 3.61( 2.15) 4.55( 2.06) 4.33( 2.08) 3.81( 2.16) 3.61( 2.15) # of terms in F 20.46( 9.74) 28.36( 10.52) 35.00( 11.19) 42.69( 12.00) 20.46( 9.74) 28.36( 10.52) 35.00( 11.19) 42.69( 12.00) Gröbner ratio 0( 0.017) 0( 0.009) 0( 0.001) 0( 0) 0( 0.017) 0( 0.009) 0( 0.001) 0( 0) Size of G 2.00( 0) 3.00( 0) 4.00( 0) 5.00( 0) 2.00( 0) 3.00( 0) 4.00( 0) 5.00( 0) Max degree in G 4.07( 1.31) 4.07( 1.31) 4.06( 1.31) 4.07( 1.31) 4.07( 1.31) 4.07( 1.31) 4.06( 1.31) 4.07( 1.31) Min degree in G 2.56( 1.24) 2.16( 1.18) 1.88( 1.07) 1.68( 0.95) 2.56( 1.24) 2.16( 1.18) 1.88( 1.07) 1.68( 0.95) # of terms in G 6.63( 2.33) 9.18( 3.26) 11.74( 4.15) 14.30( 5.03) 6.63( 2.33) 9.18( 3.26) 11.74( 4.15) 14.30( 5.03) Gröbner ratio 1( 0) 1( 0) 1( 0) 1( 0) 1( 0) 1( 0) 1( 0) 1( 0) potential advantage of using a Transformer in Gröbner basis computation, particularly for large-scale problems. F.4 Generalization to out-distribution samples Handling out-distribution samples is beyond the scope of the current work. Several studies of using a Transformer for math problems (e.g., integer multiplication [30] and linear algebra [19]) addressed out-distribution generalization by controlling the training sample distribution. This is because these problems are of moderate difficulty, and naive training sample generation methods exist. However, this is not the case for our task, and we thus focused on the problem of efficient generation of training samples in this paper. Nevertheless, we consider that presenting a limitation of our work by experiments is helpful for future work. Thus, we here present that the Transformers trained on our datasets certainly fail on out-distribution samples through several cases. Out-distribution samples. We generate additional datasets Du n(k) for k {Q, F7, F31}. These datasets are generated as Dn(k) with a slight difference in sampling of random polynomials. Originally, a (degree-bounded) random polynomial is obtained using monomials randomly sampled from k[x1, . . . , xn] dmax. Since there are more high-degree terms than low-degree ones, random polynomials are more likely to be high-degree. In Du n(k), we instead uniformly sample the degree-bound d from U[1, dmax], and then conduct the sampling of monomials. As Tab. 23 shows, this change in the sampling strategy causes some distribution shifts. Table 24 shows the prediction accuracy and support accuracy on the new datasets. As can be seen, the accuracy drops when the base field is a finite field Fp. When it is Q, the accuracy drop is moderate. Katsura-n. Katsura-n is a typical benchmarking example for Gröbner basis computation algorithms ( -n denotes the number of variables). Table 25 shows a list of examples for different n {2, 3, 4} and k {Q, F7, F31}. While the Gröbner bases in Katsura-n have a form of Eq. (4.1), one can readily find its qualitative difference in the non-Gröbner sets from those in our Dn(k) datasets (cf. Tables 7 18). For example, non-Gröbner sets in Katsura-n consist of low-degree sparse polynomials, whereas those in Dn(k) are not because of the generation processes (i.e., the product and sum of polynomials through the multiplication by polynomial matrices U1, U2). Therefore, Katsura-n examples are greatly out-distributed samples for our Transformers. We fed these samples to the trained Transformers, but only obtained the output sequences that cannot be transformed back to polynomials because of their invalid prefix representation. As Gröbner basis computation is an NP-hard problem, it may not be a good idea to peruse a general solver via learning. Instead, we should ultimately aim at a solver for large-scale but specialized cases. Thus, the generation of application-specific training samples and the pursuit of in-distribution accuracy will be a future direction. From this perspective, existing mathematical benchmark examples may not be practical because they are mostly artificial, empirically found difficult, and/or designed for easily generating variations in the number of variables n.12 They are useful for math algorithms, i.e., the algorithms proved to work for all the cases (i.e., 100% in/out-distribution samples), but not for our current work because they are out-distribution samples. G Buchberger Möller Algorithm for Prob. 4.1 Here, we discuss another approach for Prob. 4.1 using the Buchberger Möller (BM) algorithm [2, 68]. Although we did not adopt this approach, we include this for completeness as many variants have been recently developed and applied extensively in machine learning and other data-centric applications. Given a set of points X kn and a graded term order, the BM algorithm computes a Gröbner basis of its vanishing ideal I(X) = {g k[x1, . . . , xn] | g(p) = 0, p X}. While several variants follow in computational algebra [1, 32, 40, 48, 49, 54, 62], interestingly, it is also recently tailored for machine learning [42, 51 53, 56, 63, 90, 91]. Various applications have followed such as machine learning [77, 92], signal processing [86 88], nonlinear dynamics [46, 47, 50], and more [5, 43, 55]. Such broad applications derived from the distinguishing design of the BM algorithm: unlike most computer-algebraic algorithms, it takes a set of points (i.e., dataset) as input, not a set of polynomials. 12For example, refer to https://www-sop.inria.fr/coprin/logiciels/ALIAS/Benches/node1. html. Therefore, to address Prob. 4.1, one may consider using the BM algorithm or its variants, e.g., by running the BM algorithm m times while sampling diverse sets of points. An important caveat is that Gröbner bases that can be given by the BM algorithm may be more restrictive than those considered in the main text (i.e., the Gröbner bases of ideals in shape position). For example, the former generates the largest ideals that have given k-rational points for their roots, whereas this is not the case for the latter. Another drawback of using the BM algorithm is its large computational cost. The time complexity of the BM algorithm is O(n |X|3). Furthermore, we need O(nd) points to obtain a Gröbner basis that includes a polynomial of degree d in the average case. Therefore, the BM algorithm does not fit our settings that a large number of Gröbner bases are needed (i.e., m 106). Accelerating the BM algorithm by reusing the results of runs instead of independently running the algorithm many times can be interesting for future work. H Open Questions Random generation of Gröbner bases To our knowledge, no prior studies addressed this problem. Our study focuses on generic 0-dimensional ideals. These ideals are generally in shape position, and a simple sampling strategy is available. However, some applications may be interested in other classes of ideals (e.g., positive dimensional or binomial ones) or a particular subset of 0-dimensional ideals (e.g., those associated with a single solution in the coefficient field). The former case is an open problem. The latter case may be addressed by the Buchberger Möller algorithms [68] (cf. App. G). Backward Gröbner problem. Machine learning models perform better for in-distribution samples than out-distribution samples. Thus, it is essential to design a training sample distribution that is close to one s use case. As noted in Sec. 4.1, polynomial systems (either Gröbner basis or not) take a domain-specific form. Backward generation gives us control over Gröbner bases but not for non-Gröbner sets. Hence, we need a well-tailored backward generation method specialized to an application, as the specialized Gröbner basis computation algorithms in computational algebra. This paper addressed a generic case. Prop. 4.7 states that any matrix A SLn(k[x1, . . . , xn]) satisfies Prob. 4.5. This raises two sets of open questions: (i) are there matrices outside SLn(k[x1, . . . , xn]) satisfying Prob. 4.5? Can we sample them? and (ii) is it possible to efficiently sample matrices of SLn(k[x1, . . . , xn])? To efficiently generate our dataset, we have restricted ourselves to sampling matrices having a Bruhat decomposition (see Eq. (4.2)), which is a strict subset of SLn(k[x1, . . . , xn]). Sampling matrices in SLn(k[x1, . . . , xn]) remains an open question. Thanks to Suslin s stability theorem and its algorithmic proofs [64, 72, 81], SLn(k[x1, . . . , xn]) is generated by elementary matrices and a decomposition into a product of elementary matrices can be computed algorithmically. One may hope to use sampling of elementary matrices to sample matrices of SLn(k[x1, . . . , xn]). It is unclear whether this can be efficient as many elementary matrices are needed [64]. Distribution analysis. Several studies have reported that careful design of a training set leads to a better generalization of Transformers [19, 21]. Algebraically, analyzing the distribution of Gröbner bases and the non-Gröbner sets is challenging, particularly when some additional structures (e.g., sparsity) are injected. Thus, the first step may be to investigate the generic case (i.e., dense polynomials). In this case, Thm. 4.6(3) is helpful to design an algorithm that is certified to be able to yield all possible G and F almost uniformly. While dataset generation algorithms should run efficiently for practicality, a solid analysis may be of independent interest in computational algebra. Long mathematical expressions. As in most of the prior studies, we used a Transformer of a standard architecture to see the necessity of a specialized model. From the accuracy aspect, such a vanilla model may be sufficient for Q[x1, . . . , xn] but not for Fp[x1, . . . , xn]. From the efficiency perspective, the quadratic cost of the attention mechanism prevents us from scaling up the problem size. For large n, both non-GB sets and Gröbner bases generally consist of many dense polynomials, leading to long input and output sequences. Unlike natural language processing tasks, the input sequence cannot be split as all the symbols are related in mathematical tasks. It is worth studying several attention mechanisms that work with sub-quadratic memory cost [11, 23, 28, 57, 80] can be introduced with a small degradation of performance even for mathematical sequences, which have a different nature from natural language (e.g., a single modification of the input sequence can completely change the solution of the problem). Table 7: Success examples from the D 2 (Q) test set. 0 f1 = x0 +5/2x4 1 2x3 1 +5x2 1 2/3x1 +1/5 g1 = x0 +5/2x4 1 2x3 1 +5x2 1 2/3x1 +1/5 f2 = 3/5x2 0x2 1 + 1/5x2 0x1 + 3/2x0x6 1 7/10x0x5 1 + 13/5x0x4 1 + 3/5x0x3 1 1/75x0x2 1 + 1/25x0x1 + x5 1 5/4x3 1 + 5x1 g2 = x5 1 5/4x3 1 + 5x1 f1 = x0x3 1 + x5 1 + 3x4 1 x1 + 2/3 g1 = x0 + x1 f2 = 1/3x2 0x2 1 + 4x0x7 1 + 8x0x6 1 2x0x5 1 11/3x0x3 1 + 8/3x0x2 1 2x7 1 6x6 1 + 2x3 1 4/3x2 1 g2 = x5 1 + 2x4 1 x1 + 2/3 f3 = 4/3x4 0x2 1 + 16x3 0x7 1 + 32x3 0x6 1 44/3x3 0x3 1 + 32/3x3 0x2 1 + 3/5x3 0x1 + 13/4x2 0x5 1 + 5/2x2 0x4 1 33/20x2 0x2 1 5/4x2 0x1 + 5/6x2 0 x0x7 1 + 3/5x0x6 1 + 6/5x0x5 1 5/4x0x3 1 19/15x0x2 1+2/5x0x1 f4 = 3/25x5 0x1 1/4x4 0x5 1 1/2x4 0x4 1 + 7/25x4 0x2 1+1/4x4 0x1 1/6x4 0 3/25x3 0x6 1 6/25x3 0x5 1 + 13/20x3 0x3 1 22/25x3 0x2 1 2/25x3 0x1 12x2 0x7 1 24x2 0x6 1 +1/4x2 0x5 1 + 3/4x2 0x4 1 + 11x2 0x3 1 8x2 0x2 1 1/4x2 0x1 + 1/6x2 0 + x0 + x1 f1 = 3/4x2 0x3 1 x0x5 1 +2x0x2 1 1/5x0x1 2x3 1 + 1/5x2 1 f2 = 5/3x3 0 3/2x2 0x4 1 + 5/3x2 0x1 + 2x0x6 1 4x0x3 1 + 2/5x0x2 1 + 4x4 1 + 3/5x3 1 f3 = 5/6x5 0 + 5/6x4 0x1 5/2x3 0x3 1 + 13/4x2 0x4 1 + 1/2x2 0x3 1 x0x6 1 + 2x0x3 1 1/5x0x2 1 + x0 + 3/2x6 1 2x4 1 + 1/5x3 1 x1 f1 = 5x3 0x1 + 4x2 0x1 + x1 g1 = x0 4/5 f2 = 25/3x4 0x3 1 + 20/3x3 0x3 1 4x2 0x2 1 5/3x2 0x1 + 5/3x0x3 1 + 4/3x0x1 f3 = 5/2x4 0x3 1 + 2x3 0x3 1 2x2 0x5 1 5/6x2 0x4 1 + 2/3x0x4 1 + 1/2x0x3 1 + x0 4/5 f1 = x0 + 5/3x4 1 + x2 1 + 5/4x1 5/2 g1 = x0 + 5/3x4 1 + x2 1 + 5/4x1 5/2 f2 = x3 0 + 5/3x2 0x4 1 + x2 0x2 1 + 5/4x2 0x1 1/2x2 0 + 5x0x6 1 25x0x5 1 + 10/3x0x4 1 + 2x0x2 1 + 2x0x1 5x0 + 5/6x5 1 + 1/2x3 1 + 5/8x2 1 5/4x1 g2 = x5 1 5x4 1 1/5 f3 = x5 0 + 5/3x4 0x4 1 + x4 0x2 1 + 5/4x4 0x1 1/2x4 0 + 5x3 0x6 1 25x3 0x5 1 + 10/3x3 0x4 1 + 2x3 0x2 1+5/2x3 0x1 5x3 0+5/3x2 0x5 1+x2 0x3 1+ 5/4x2 0x2 1 5/2x2 0x1+x0x1+8/3x5 1 5x4 1+ x3 1 + 5/4x2 1 5/2x1 1/5 5 f1 = x0 + 2/5x1 g1 = x0 + 2/5x1 f2 = 1/3x2 0x2 1 3/4x2 0 2/15x0x3 1 1/20x0x1 + 11/10x2 1 4 g2 = x2 1 4 6 f1 = 3/5x2 0x2 1 x0x6 1 + 3/25x0x4 1 12/25x0x3 1 + 3/4x0 + x5 1 29/20x4 1 + 19/20x2 1 3/5x1 1/2 g1 = x0 5/3x4 1 + 1/5x2 1 4/5x1 f2 = 3/10x2 0x3 1 1/2x0x7 1 + 3/50x0x5 1 6/25x0x4 1 + 3/8x0x1 + x0 + 1/2x6 1 29/40x5 1 5/3x4 1 + 19/40x3 1 1/10x2 1 21/20x1 g2 = x5 1 1/5x4 1 + 4/5x2 1 1/2 7 f1 = x0 + 5/4x3 1 5/4 g1 = x0 + 5/4x3 1 5/4 f2 = x3 0 5/4x2 0x3 1 + 4x2 0x1 + 5/4x2 0 + 5x0x4 1 5x0x1 + x5 1 + 1/2x3 1 + 2/5x1 g2 = x5 1 + 1/2x3 1 + 2/5x1 Table 8: Success examples from the D 3 (Q) test set. f1 = 4x0x2 1x2 + 4/5x0x1x2 + 4/5x0x2 2 x2 1x2 2 4x2 1x2 + x2 2 1 g1 = x0 1/4x2 1 f2 = 20x0x3 1x2 2+4x0x2 1x2 2+4x0x1x3 2+x0 5x3 1x3 2 20x3 1x2 2+5x1x3 2 5x1x2 1/4x2 1 g2 = x1 + x2 f3 = 4/5x3 0x2 1x2 2 + 4/25x3 0x1x2 2 + 4/25x3 0x3 2 1/5x2 0x2 1x3 2 4/5x2 0x2 1x2 2 + 1/5x2 0x3 2 3/5x2 0x2 + 16/3x0x3 1x3 2 + 16/15x0x2 1x3 2+16/15x0x1x4 2+1/10x0x2 2+ 2/5x0x2 4/3x3 1x4 2 16/3x3 1x3 2 + 4/3x1x4 2 4/3x1x2 2 + x1 + x2 g3 = x2 2 1 f4 = 1/2x2 0x1x2 + 1/2x2 0x2 2 + 20x0x3 1x2 2 + 4x0x2 1x2 2 + 4x0x1x3 2 + 3/5x0x1x2 + 4/3x0x1 + 3/5x0x2 2 5x3 1x3 2 20x3 1x2 2 5/4x3 1x2 5/4x2 1x2 2 +5x1x3 2 16/3x1x2 4/3x1 f5 = 1/3x4 0x1x2 1/3x4 0x2 2 2/5x3 0x1x2 2/5x3 0x2 2 + 16x2 0x4 1x2 + 16/5x2 0x3 1x2 +16/5x2 0x2 1x2 2 4/3x2 0x1x2 4x0x4 1x2 2 16x0x4 1x2 +4x0x2 1x2 2 4x0x2 1 + 5x0x1x3 2 + 1/3x0x1x2 2 11/3x0x1x2 + 2/15x0x3 2 + 1/5x0x2 x2 1 x1x2 1/12x4 2 1/3x3 2 f1 = x1 x2 2 4x2 g1 = x0 + 1/5x2 2 + 4/3x2 + 5/3 f2 = 2/3x2 0x1x2 + 2/3x2 0x3 2 + 8/3x2 0x2 2 1/5x0x1x2 2 2/3x0x1 2/5x0x4 2 8/5x0x3 2+2/3x0x2 2+8/3x0x2 1/5x2 1x2 3/25x1x4 2 3/5x1x3 2 1/5x1x2 2 g2 = x1 x2 2 4x2 f3 = 1/2x2 0x3 1x2 + 1/2x2 0x2 1x3 2 + 2x2 0x2 1x2 2 2/3x2 0x1x2 + 2/3x2 0x3 2 + 8/3x2 0x2 2 3/20x0x3 1x2 2 3/10x0x2 1x4 2 6/5x0x2 1x3 2 1/5x0x1x2 2 2/5x0x4 2 8/5x0x3 2 + x0 9/100x3 1x4 2 3/5x3 1x3 2 3/4x3 1x2 2 3/25x1x4 2 + 1/5x1x3 2 9/5x1x2 2 x5 2 16/5x4 2 + 16/5x3 2 + 1/5x2 2 + 4/3x2 + 5/3 g3 = x3 2 + 3/5x2 2 3/2 f4 = 1/2x3 0x2 1x2 + 1/2x3 0x1x3 2 + 2x3 0x1x2 2 x3 0x1 3/20x2 0x2 1x2 2 3/10x2 0x1x4 2 6/5x2 0x1x3 2 1/5x2 0x1x2 2 4/3x2 0x1x2 5/3x2 0x1 9/100x0x2 1x4 2 3/5x0x2 1x3 2 3/4x0x2 1x2 2 1/4x0x2 1 + 1/2x0x1x2 2 + 3/2x0x1x2 1/2x0x4 2 7/2x0x3 2 6x0x2 2 1/20x2 1x2 2 1/3x2 1x2 5/12x2 1 x1x2 + 2x3 2 + 23/5x2 2 3/2 f5 = 4/9x3 0x3 1x2 + 3x3 0x3 1 4/9x3 0x2 1x3 2 16/9x3 0x2 1x2 2 + 11/15x2 0x3 1x2 2 + 4x2 0x3 1x2 + 5x2 0x3 1 + 4/15x2 0x2 1x4 2 + 16/15x2 0x2 1x3 2 1/2x2 0x2 1 + 2/25x0x3 1x4 2 + 8/15x0x3 1x3 2 5/6x0x3 1x2 2 + 3/2x0x2 1x4 2 + 6x0x2 1x3 2 1/10x0x2 1x2 2 2/3x0x2 1x2 + 2/3x0x2 1 3/2x0x1x2 2 23/3x0x1x2 + 5/3x0x3 2 + 20/3x0x2 2 x0x2 + 3x3 1x2 6x2 1x3 2 69/5x2 1x2 2 + 9/2x2 1 + 1/3x1x4 2 + 17/10x1x3 2 3/4x1x2 3/2x5 2 6x4 2 + 1/20x3 2 1/3x2 2 5/3x2 Table 9: Success examples from the D 4 (Q) test set. f1 = x0 + 3x4 3 x3 3 3/2x3 g1 = x0 + 3x4 3 x3 3 3/2x3 f2 = 3/5x2 0x2 3 9/5x0x6 3 + 3/5x0x5 3 + 9/10x0x3 3 + x1 + x3 g2 = x1 + x3 f3 = 1/2x2 0x2 1 3/2x0x2 1x4 3+1/2x0x2 1x3 3+ 3/4x0x2 1x3 x4 1 x3 1x3 4/5x1x2 2x3 4/5x2 2x2 3 + x5 3 x4 3 + 4/3x2 3 g3 = x2 1/5x4 3 + 1/4x3 + 1 f4 = x2 0x1 + x2 0x3 1/4x0x1x2 3 3/2x3 1 3/2x2 1x3 3/4x1x6 3+1/4x1x5 3+3/8x1x3 3+ 1/5x2x5 3 1/5x2x4 3 + 4/15x2x2 3 + x2 1/5x4 3 + 1/4x3 + 1 g4 = x5 3 x4 3 + 4/3x2 3 f1 = x1 2/5x3 1/2 g1 = x0 + 1/2x3 f2 = x0x2 1 2/5x0x1x3 1/2x0x1 2/3x3 1x3 1/3x2 1x2x3 + 4/15x2 1x2 3 + 1/3x2 1x3+2/15x1x2x2 3+1/6x1x2x3+x5 3+ 1/2x4 3 5/2 g2 = x1 2/5x3 1/2 f3 = 4/3x0x3 1 +8/15x0x2 1x3 +2/3x0x2 1 + x0 + 5/2x2 1x2 x1x2x3 5/4x1x2 4/3x1x5 3 2/3x1x4 3 +1/4x1x3 +10/3x1 1/10x2 3 + 3/8x3 g3 = x2 + 4/3x4 3 + x3 3 + 4/5x2 3 f4 = 4x3 0x3 + 2x2 0x2 3 + x0x2 1x2 2 2/5x0x1x2 2x3 1/2x0x1x2 2 + 1/3x3 1x2 2/15x2 1x2x3 1/6x2 1x2 1/5x2 1x2 3 + 2/25x1x3 3 +1/10x1x2 3 +x2 2x5 3 +1/2x2 2x4 3 5/2x2 2 + x2 + 4/3x4 3 + x3 3 + 4/5x2 3 g4 = x5 3 + 1/2x4 3 5/2 f1 = x2 + x3 g1 = x0 + 1/3x3 2/3 f2 = 3/2x2 0x2 3/2x2 0x3 5/4x2 0 5/3x0x1x3 + 2/3x0x3 2 + 2/3x0x2 2x3 5/12x0x3+5/6x0 5/9x1x2 3+10/9x1x3+ x1 + 3/2x3 g2 = x1 + 3/2x3 f3 = 15/8x3 0x2 3 5/2x2 0x1x3 3 5/8x2 0x3 3 + 5/4x2 0x2 3 5/6x0x1x4 3 + 5/3x0x1x3 3 + 3/2x0x1x2 3 + 9/4x0x3 3 + 1/2x2 1x2 + 1/2x2 1x3 + 1/2x2 + 1/2x3 g3 = x2 + x3 f4 = 5/8x2 0x2 2 5/6x0x1x2 2x3 5/24x0x2 2x3 31/12x0x2 2 3x0x2x3 5/18x1x2 2x2 3 + 5/9x1x2 2x3 + 1/2x1x2 2 + 3/4x2 2x3 + 1/2x2 2 + 1/2x2x3 + x2 3 f5 = 5x2 0x3 3 3/4x0x1x2 2 3/4x0x1x2x3 + 20/3x0x1x4 3 + 9/4x0x3 2x3 + 9/4x0x2 2x2 3 + 5/3x0x4 3 10/3x0x3 3 + x0 + 20/9x1x5 3 40/9x1x4 3 4x1x3 3 +1/4x1x2 3 3/4x2x3 3 + 2x2x2 3 6x4 3 + 2x3 3 + 1/3x3 2/3 f1 = x0 1/3x3 g1 = x0 1/3x3 f2 = 1/3x0x2 1 + 1/2x0x3 3 + 1/9x2 1x3 1/6x4 3 + x2 3 + 3/5x3 g2 = x1 + 1/2x3 f3 = x2 0x2x2 3 3/5x2 0x2x3 x0x2 1x2 + 5x0 + 1/3x2 1x2x3 + x1 7/6x3 g3 = x2 + x3 f4 = 1/2x0x1x2 + 4/5x0x3 2 + x0x3 3 + 3/5x0x2 3 + 4/5x1x2 2 1/6x1x2x3 4/15x3 2x3 + 2/5x2 2x3 + x2 + x3 g4 = x2 3 + 3/5x3 f5 = 5x2 0x2x3 3/10x0x2 1x2 2 2x0x1x2x3 3/2x0x2 2x3 + 5/12x0x2x2 3 3/4x0x2x3 + 1/2x0x4 3 + 3/10x0x3 3 + 1/10x2 1x2 2x3 3/5x1x2 2 + x1x2x2 3 3/5x1x2x3 + 1/2x2 2x2 3 + 1/6x2x3 3 5x5 3 3x4 3 Table 10: Success examples from the D 5 (Q) test set. f1 = x0 x4 4 x3 4 2x2 4 + 1/4 g1 = x0 x4 4 x3 4 2x2 4 + 1/4 f2 = 5/4x0x3 1 + 1/2x0x2 1 5/4x3 1x4 4 5/4x3 1x3 4 5/2x3 1x2 4 +5/16x3 1 1/2x2 1x4 4 1/2x2 1x3 4 x2 1x2 4 + 1/8x2 1 1/3x1x2x2 3 + 3x1x3x4 + 5/3x2x2 3x4 + x2 15x3x2 4 5x4 4 + x3 4 + 2/5x2 4 + 2x4 5/2 g2 = x1 5x4 f3 = x3 0x4 x2 0x5 4 x2 0x4 4 2x2 0x3 4 + 1/4x2 0x4 + x0x2 1x4 + 2x0x1x2 4 + x0x3 3 x2 1x5 4 x2 1x4 4 2x2 1x3 4 + 1/4x2 1x4 4/3x1x3 2x2 3x4 + 1/15x1x2 2x3 3 + 12x1x2 2x3x2 4 3/5x1x2x2 3x4 2x1x6 4 2x1x5 4 4x1x4 4 +1/2x1x2 4 +20/3x3 2x2 3x2 4 + 4x3 2x4 1/3x2 2x3 3x4 60x2 2x3x3 4 1/5x2 2x3 20x2 2x5 4 + 4x2 2x4 4 + 8/5x2 2x3 4 + 8x2 2x2 4 10x2 2x4 + 3x2x2 3x2 4 + x2x3x4 4 1/5x2x3x3 4 2/25x2x3x2 4 2/5x2x3x4 + 1/2x2x3 x3 3x4 4 x3 3x3 4 2x3 3x2 4 +1/4x3 3 + x5 4 + 3/2x4 4 1/2x3 4 5/3x2 4 1 g3 = x2 5x4 4 + x3 4 + 2/5x2 4 + 2x4 5/2 f4 = 1/2x2 0x1x2 + 1/2x0x2 1x2 3x4 + 4/3x0x2 1x3x2 4 5/3x0x2 1x3 5/6x0x1x2x4 3 1/2x0x1x2x4 4 1/2x0x1x2x3 4 x0x1x2x2 4 + 1/8x0x1x2 + 15/2x0x1x3 3x4 + 25/6x0x2x4 3x4 + 5/2x0x2x2 3 + 1/2x0x5 3 + 4/3x0x4 3x4 75/2x0x3 3x2 4 25/2x0x2 3x4 4 +5/2x0x2 3x3 4 + x0x2 3x2 4 + 5x0x2 3x4 25/4x0x2 3 1/2x2 1x2 3x5 4 1/2x2 1x2 3x4 4 x2 1x2 3x3 4 + 1/8x2 1x2 3x4 4/3x2 1x3x6 4 4/3x2 1x3x5 4 x2 1x3x4 4 + 5/3x2 1x3x3 4 + 11/3x2 1x3x2 4 5/12x2 1x3 1/2x5 3x4 4 1/2x5 3x3 4 x5 3x2 4 + 1/8x5 3 4/3x4 3x5 4 4/3x4 3x4 4 8/3x4 3x3 4 + 1/3x4 3x4+1/2x2 3x5 4+3/4x2 3x4 4 1/4x2 3x3 4 5/6x2 3x2 4 1/2x2 3 + 4/3x3x6 4 + 2x3x5 4 2/3x3x4 4 20/9x3x3 4 4/3x3x4 + x3 + 5x4 4 + 1/5x2 4 + 1/2x4 + 2/5 g4 = x3 + 5x4 4 + 1/5x2 4 + 1/2x4 + 2/5 f5 = x2 0x2 3 + 3/2x0x2 1x2x2 3x4 + 3/2x0x2x5 3+x0x2 3x4 4+x0x2 3x3 4+2x0x2 3x2 4 1/4x0x2 3 3/2x2 1x2x2 3x5 4 3/2x2 1x2x2 3x4 4 3x2 1x2x2 3x3 4+3/8x2 1x2x2 3x4+1/3x1x2 2x4 3 3x1x2x3 3x4 + x1 5/3x2 2x4 3x4 x2 2x2 3 3/2x2x5 3x4 4 3/2x2x5 3x3 4 3x2x5 3x2 4 + 3/8x2x5 3 + 15x2x3 3x2 4 + 3/2x2x2 3x5 4 + 29/4x2x2 3x4 4 7/4x2x2 3x3 4 29/10x2x2 3x2 4 2x2x2 3x4 + x2x2 3 3x2 3x2 4 15x3x6 4 3/5x3x4 4 3/2x3x3 4 6/5x3x2 4 5x4 g5 = x5 4 + 3/2x4 4 1/2x3 4 5/3x2 4 1 Table 11: Success examples from the D 2 (F7) test set. 0 f1 = x0 + 3x1 g1 = x0 + 3x1 f2 = x3 0 + x2 0x1 + 3x0x2 1 + x4 1 x3 1 3 g2 = x4 1 3 f1 = 3x0x3 1 + 2x3 1 g1 = x0 2 f2 = x2 0x1+2x0x4 1+x0x3 1+2x0x1+x4 1 3x3 1 + x1 f3 = x3 0x3 1 2x2 0x3 1 + x0x4 1 x0x3 1 + x0 3x4 1 2 f1 = 2x3 0 x2 0x4 1+3x2 0x2 1+x2 0 2x6 1+3x3 1+ 2x2 1 + 3x1 g1 = x0 + 3x4 1 2x2 1 3 f2 = 3x4 0x2 1 + 3x4 0x1 2x3 0x6 1 + 2x3 0x5 1 x3 0x4 1 + x3 0x3 1 + 2x3 0x2 1 2x3 0x1 + 3x0x8 1 3x0x7 1 x0x5 1 2x0x4 1 + 2x0x3 1 + x0x2 1 + x0 + 3x4 1 2x2 1 3 g2 = x5 1 + 2x2 1 x1 + 2 f3 = 2x5 0 +x4 0x4 1 3x4 0x2 1 x4 0 +2x2 0x6 1 3x2 0x3 1 2x2 0x2 1 3x2 0x1 + x2 0 + 3x0x4 1 3x0 x6 1 + x5 1 + 3x4 1 + 3x2 1 x1 + 2 f1 = 2x2 0x2 1 2x0x3 1 2x0x2 1 + 2x0x1 + x4 1 + 2x2 1 + 2x1 g1 = x0 + x1 + 1 f2 = x3 0x2 1 + x2 0x3 1 + x2 0x2 1 x2 0x1 + x2 0 + 3x0x4 1 2x0x2 1 + x0x1 + x0 + x2 1 + x1 f3 = x4 0x2 1 x3 0x3 1 + x3 0x1 3x2 0x4 1 + 2x2 0x3 1 + 2x2 0x2 1 + x2 0x1 + x0x5 1 + 2x0x4 1 + x0 + 3x6 1 x4 1 x3 1 + x1 + 1 f4 = 3x4 0x1+2x3 0x3 1+x3 0x2 1 3x3 0x1+x3 0 x2 0x4 1+3x2 0x3 1 2x2 0x2 1+x2 0x1+x2 0 3x0x5 1+ x0x3 1 +3x0x2 1 2x0 +2x3 1 +3x2 1 2x1 2 5 f1 = x2 0x1 + 3x0x1 + x3 1 g1 = x0 3 f2 = 3x2 0x4 1 3x2 0x3 1 2x0x4 1+2x0x3 1+x0 3x6 1 + 3x5 1 3 f1 = 3x2 0x1 x0x4 1 x0x2 1 x7 1 3x5 1 + 2x3 1 x1 g1 = x0 + 2x3 1 + 2x1 f2 = 2x2 0x3 1 3x0x6 1 3x0x4 1 + x0 3x9 1 2x7 1 x5 1 x3 1 + 2x1 g2 = x4 1 2 f3 = 2x3 0x2 1 + x3 0x1 + 3x2 0x5 1 + 2x2 0x4 1 + 3x2 0x3 1 + 2x2 0x2 1 + 3x0x8 1 + 2x0x6 1 + x0x4 1 2x0x2 1 x0x1 3x5 1 x4 1 3x3 1 2x2 1 2 f4 = x3 0x2 1 + 2x3 0x1 + x3 0 + 2x2 0x5 1 + x2 0x4 1 3x2 0x3 1 3x2 0x2 1 + 2x2 0 2x0x7 1 + x0x5 1 2x0x4 1 +3x0x3 1 +2x0x1 +3x0 +x5 1 2x4 1 + x3 1 + 3x2 1 + 3 f1 = x0 1 g1 = x0 1 f2 = 3x2 0x2 1 + 3x2 0x1 + 2x2 0 2x0x3 1 + 3x0x2 1 3x0x1 2x0 f3 = x4 0x3 1 3x3 0x4 1 + x3 0x3 1 + x3 0x2 1 + x3 0 + 3x2 0x3 1 x2 0x2 1 x2 0 + 3x0x2 1 3x2 1 + x1 f1 = x0 + x4 1 + 2x2 1 g1 = x0 + x4 1 + 2x2 1 f2 = 3x2 0x1+3x0x5 1 x0x3 1 2x0x2 1 2x6 1+ x5 1 + 3x4 1 x3 1 x2 1 x1 g2 = x5 1 x3 1 x2 1 x1 f3 = x3 0x2 1 + 2x2 0x6 1 3x2 0x5 1 + 2x2 0x4 1 + 3x2 0x1 2x2 0 2x0x4 1 + 3x0x2 1 2x0x1 + 3x0 + x6 1 2x5 1 + 2x4 1 + 2x3 1 2x2 1 f4 = 2x3 0x8 1 2x3 0x6 1 2x3 0x5 1 2x3 0x4 1 3x3 0x2 1 3x2 0x6 1 + x2 0x3 1 + 3x2 0x2 1 + 2x2 0 + 2x0x8 1 2x0x7 1 + x0x6 1 + 2x0x5 1 2x0x4 1 + x0x2 1 Table 12: Success examples from the D 3 (F7) test set. f1 = 2x2 0x2 1 x2 0x1x2 2 x2 0x1 2x0x2 1 + x2 1x2 2 g1 = x0 + 3x2 2 f2 = 2x3 0x1 +3x2 0x4 1 2x2 0x3 1x2 2 2x2 0x3 1 + x2 0x1x2 2+3x0x4 1 3x0x1+2x4 1x2 2 2x1x2 2+ x1 3x2 2 3 g2 = x1 3x2 2 3 f3 = 2x3 0x2 1x2 2 + x3 0x1x4 2 3x3 0x1x3 2 + x3 0x1x2 2 + 2x2 0x2 1x2 2 2x2 0x1x5 2 x0x2 1x4 2 x0x1x3 2+x0 3x1x5 2 2x1x3 2 x5 2 x3 2+3x2 2 g3 = x5 2 3x4 2 2x3 2 3x2 2 1 f4 = 3x4 0x2 1+3x4 0x1x2 2+3x4 0x1 2x3 0x2 1x2 2+ 3x3 0x2 1 +3x3 0x1 +2x2 0x5 1 +x2 0x4 1x2 2 +x2 0x4 1 + 2x2 0x2 1x2 2 x2 0x2 1 + 2x2 0x1x2 2 + 2x2 0x1 + 2x0x5 1 3x0x2 1x2 2 3x0x2 1 x5 1x2 2 2x3 1x2 x2 1x3 2 3x2 1x2 2 x2 1x2 2x2 1 x1x4 2+2x1x2 2 x1 + x5 2 3x4 2 2x3 2 3x2 2 1 f1 = x0 + 3x2 g1 = x0 + 3x2 f2 = 3x2 0x2 2 + 2x0x3 2 + x1 + x2 + 3 g2 = x1 + x2 + 3 f3 = 3x3 0x3 1 x3 0x1 2x2 0x3 1x2 + 2x2 0x2 1 + x2 0x1x2 2x2 0x1 2x2 0x2 + x2 0 x0x5 1x2 x0x2 1x2 2x0x1x2 2 + x0x1x2 + x0x2 2 + 3x0x2 3x5 1x2 2 +3x3 1x2 +3x2 1x2 2 +2x2 1x2 + x2 2 f4 = 3x4 0x3 1+3x3 0x3 1x2+2x3 0x3 1+x2 0x3 1x2 2+ 2x2 0x3 1x2+3x2 0x2 1x2 2+2x2 0x2 1x2+3x2 0x1x2 2 x2 0x1 x2 0x2 3x2 0 3x0x5 1x2 3x0x4 1x2 2 + 3x0x2 1x2 2 + 3x0x2 1x2 + 2x0x2 1 x0x1x2 2 + 2x0x1x2 x0x1 + x0x4 2 x4 1x2 2 + 2x2 1x2 2 x2 1x2 3x1x3 2 x1x2 2 3x1x2 f1 = 2x0x2+2x2 1x2 2+2x1x2 2 2x1 3x2 2+3 g1 = x0 2 f2 = 3x2 0x2 3x0x2 1x2 2 + 3x0x1x2 + 3x0x1 + x0x2 2 x0 + 3x3 1x3 2 + 3x2 1x3 2 3x2 1x2 x1x3 2 + 2x1x2 + x1 + 2 g2 = x1 + 2 f3 = x3 0x1 2x2 0x1 x0x1x3 2 + 2x0x3 2 x0x2 + 2x2 1x4 2 x2 1x2 2 + 2x1x4 2 + 2x1x3 2 x1x2 2 + x1 3x4 2 2x2 2 + x2 + 2 f4 = 2x4 0x1 + 2x3 0x3 1 3x3 0x1 + 3x2 0x3 1 + 2x0x1x2 2+2x0x4 2+2x0x3 2 2x0x2+2x2 1x5 2+ 2x2 1x2 +2x1x5 2 2x1x3 2 +3x1x2 2 +3x1x2 3x5 2 + 3x3 2 x2 f5 = 2x5 0x1 + 3x4 0x1 + x3 0x1x3 2 x2 0x2 1x2 + 2x2 0x2 x0x2 1x2 2 + 2x0x2 1x2 + 2x0x2 1 + 2x0x1x2 3x0x1 + 3x0x3 2 + x0 + 2x3 1x2 2 3x2 1x2 2 + 2x2 1x2 2x2 1 3x1x2 2 3x1x2 + 3x1 2 Table 13: Success examples from the D 4 (F7) test set. f1 = x2 3x3 3 + 2x2 3 + 1 g1 = x0 x3 f2 = x1x2 3x1x3 3 + 2x1x2 3 + 2x1 3x3 3 + x2 3 2x3 + 2 g2 = x1 3x3 3 + x2 3 2x3 + 2 f3 = 2x2 0x2 1 x2 0x1x3 3 2x2 0x1x2 3 3x2 0x1x3 + 3x2 0x1 x0x1x2 2 x0x1x2x3 + 3x0x2 2x3 3 x0x2 2x2 3 + 2x0x2 2x3 2x0x2 2 + 3x0x2x4 3 x0x2x3 3+2x0x2x2 3 2x0x2x3 x2 1x2 + 3x2 1x3 3 2x2 1x2 3 x2 1 g3 = x2 3x3 3 + 2x2 3 + 1 f4 = 2x3 0x3 1x3 + x3 0x2 1x4 3 + 2x3 0x2 1x3 3 + 3x3 0x2 1x2 3 3x3 0x2 1x3 + x2 0x2 1x2x2 3 3x2 0x2 1x3 3 3x2 0x1x2x5 3 + x2 0x1x2x4 3 2x2 0x1x2x3 3 + 2x2 0x1x2x2 3 + 3x2 0x1x2 + 2x2 0x1x6 3 3x2 0x1x5 3 x2 0x1x4 3 x2 0x1x3 3 x2 0x1x2 3 + 3x2 0x1 + 3x0x2 1x2 2x0x2 1x3 3 x0x2 1x2 3 + x0x2 1 x0x1x2 2 + 2x0x1x2x4 3 x0x1x3 3 2x0x1x2 3 3x0x1x3 + 3x0x1 + 3x0x2 2x3 3 x0x2 2x2 3 + 2x0x2 2x3 2x0x2 2 + x0x2x7 3 +2x0x2x6 3 +3x0x2x5 3 3x0x2x4 3 + 3x3 1x2 2x2 1x2x3 3 + 3x2 1x2x2 3 + x2 1x2x3 x2 1x2 + x4 3 3x3 1 g4 = x4 3 3x3 1 f5 = x3 0x3 1x3 + 3x3 0x2 1x4 3 x3 0x2 1x3 3 + 2x3 0x2 1x2 3 2x3 0x2 1x3 + 3x2 0x2 1x2x2 3 + 3x2 0x2 1x3 3x2 0x2 1 2x2 0x1x2x5 3 + 3x2 0x1x2x4 3 + x2 0x1x2x3 3 x2 0x1x2x2 3 2x2 0x1x4 3 2x2 0x1x3 3 2x2 0x1x2 3 2x2 0x1x3+ x2 0x1 + 2x0x3 1 2x0x1x2x2 3 x0x2x5 3 2x0x2x4 3 3x0x2x3 3 + 3x0x2x2 3 3x0x2 2x0x4 3+2x0x3 3+x0x2 3 x0x3 x0 2x3 1x3+ x1x2 2x3 2x3 2 3x2 2x4 3+x2 2x2 3+2x2 2x3 2x2 2 f6 = 3x2 0x3 1x2 2x2 0x3 1 + 2x2 0x2 1x2x3 3 3x2 0x2 1x2x2 3 x2 0x2 1x2x3 + x2 0x2 1x2 x2 0x2 1x3 3 2x2 0x2 1x2 3 3x2 0x2 1x3 + 2x2 0x2 1 + 3x2 0x1x3 3 x2 0x1x2 3 + 2x2 0x1x3 2x2 0x1 + 3x2 0x2 2x2 0x3 3 x2 0x2 3 + 3x2 0 + x0x5 1x3 + 2x0x2 1x2 2x3 x0x2 1x2x3 + x0x1x2 2x4 3 + 2x0x1x2 2x3 3 + 3x0x1x2 2x2 3 3x0x1x2 2x3 + 3x0x1x2x4 3 x0x1x2x3 3 + 2x0x1x2x2 3 2x0x1x2x3 3x0x4 3+2x0x3 3x0 x5 1x2 3 x2 1x2 2 + 3x1x2 2x3 3 x1x2 2x2 3 + 2x1x2 2x3 2x1x2 2 x1x3 3 +3x6 3 x5 3 +2x4 3 2x3 3 x3 f1 = 3x0x1x3 x1x2 3 + x1x3 + x2 3 g1 = x0 + 2x3 2 f2 = 2x3 0x1x3 + 3x2 0x1x2 3 3x2 0x1x3 + 2x2 0x1 3x2 0x2 3 + 3x2 0x3 3x0x2 1x2x3 + x2 1x2x2 3 x2 1x2x3 + 2x2 1x2 3 x1x2x2 3 + 3x1x3 3 + x2 2x3 3 g2 = x1 2x3 f3 = 2x2 0x4 1 3x2 0x3 1x3 + 3x2 0x1x2x3 + x0x1x2 2x3 x0x1x2x2 3 + x0x1x2x3 + x0x2x2 3 + x0 2x5 1x2 3 3x4 1x3 3 x3 1x2 + 2x3 1x3+3x3 1+2x1x2 2x2 3 2x1x2 2x3 2x2 2x2 3+ 2x3 2 g3 = x2 2x3 3 f4 = x2 0x2 1x2x3 +2x2 0x1x2x2 3 +x2 0x1x2 3x2 0x1x3 3 + x2 0x1 x2 0x4 3 + 2x0x3 1x2x3 + 2x0x1x2x3 2x0x1x2+2x0x1x3 x0x1 x3 1x2x3 3 3x3 1x2x2 3 +3x3 1x2x3 +2x2 1x2x4 3 + 3x2 1x2x2 3 3x2 1x5 3 + 3x1x2 2x3 + x1x2x2 3 2x1x2x3 x1x6 3 + 2x1x3 x1 + 2x2x3 3 + 3x4 3 + x3 3 2x3 Table 14: Success examples from the D 5 (F7) test set. f1 = x0x2 1+2x2 1x2 4+x2 1x4+3x2 1+x1 3x4 g1 = x0 2x2 4 x4 3 f2 = x0x2 1x3 3 + 2x2 1x3 3x2 4 + x2 1x3 3x4 + 3x2 1x3 3+3x1x2 2x3+x1x3 3 2x2 2x3x4 3x3 3x4 g2 = x1 3x4 f3 = 3x2 0x2 1x3 3x2 0x1x2 2x2 3 + 2x2 0x2 2x2 3x4 x0x2 1x3x2 4 + 3x0x2 1x3x4 + 2x0x2 1x3 + 3x0x1x3 2x0x3x4 + x0 + x1x3 2x3x2 4 3x3 2x3x3 4 2x2 4 x4 3 f4 = 2x2 0x1x2 2x3 x2 0x2 2x3x4 + x2 0x3 + 3x2 0x4 3x0x3 1x2x4 3x0x2 1x3 2x3 + 2x0x1x3 2x3x4 + 3x0x2 2x4 x3 1x2x3 4 + 3x3 1x2x2 4+2x3 1x2x4+3x2 1x2x4 2x1x2x2 4+ x2 2x3 4 3x2 2x2 4 2x2 2x4 + x3 4 x2 4 + 3x4 g4 = x3 + 3x4 f5 = 2x3 0x1x2x3 x3 0x1x2x4 + x2 0x3 3 + 3x2 0x2 3x4 + 3x0x2 1x3x4 + 2x0x1x2x3 + 2x0x1x2x3 4 2x0x1x2x2 4 x0x1x2x4 + x2 1x3 2x2 3 3x2 1x2 2x3x4+x2 1x3x3 4 3x2 1x3x2 4 2x2 1x3x4 3x1x3 2x2 3x4 + 2x1x2 2x3x2 4 + 3x1x2x3x2 4 2x1x2x3x4 + x1x2x3 3x1x3x4 +x2 3x3 4 x2 3x2 4 +3x2 3x4 +2x3x2 4 + x3 + 3x4 g5 = x3 4 x2 4 + 3x4 f6 = x2 0x2 1x3 2x2 0x2 1x2 4 3x2 0x2 1x4 3x0x2 1x2+3x0x2 1x2 3 3x0x2 1x4 4+2x0x2 1x3 4 x0x2 1x2 4 x0x1x2x4+2x0x1x2 4+3x0x2x2 4 x0x3 4 + x2 1x3 2x3 + x2 1x2 3x2 4 3x2 1x2 3x4 2x2 1x2 3 x2 1x3 x2 1x3 4 + x2 1x2 4 + x2 1x4 3x1x3 2x3x4 3x1x2 3 2x2x4 4 x2x3 4 3x2x2 4 +x2 +2x2 3x4 3x5 4 +2x4 4 x3 4 x4 f7 = 2x4 0x2x3 + x4 0x2x4 + 2x3 0x2 1x3 x3 0x1x3x4 3x3 0x1x2 4 + 3x2 0x2 1x3x2 4 2x2 0x2 1x3x4 + x2 0x2 1x3 2x2 0x1x3 2x2 0x2x3 4 + 2x2 0x2x2 4 + x2 0x2x4 x2 0x3x4 2x0x3 1x2x2 3 3x0x2 1x2x2 3x4 x0x1x2x2 3x2 4 + x0x1x2x3 + 3x0x1x2x4 2x0x1x2 3 x0x1x4 4 + x0x1x3 4 3x0x1x2 4 x0x1x4 2x0x3x3 4 + 2x0x3x2 4 + x0x3x4 3x3 1x2 2x3x4 + 2x2 1x2 2x3x2 4 + 3x1x2x2 3 3x1x2 3x2 4 x1x2 3x4 x1x2 3 + 2x1x3 4 + x1x2 4 + 3x1x4 f1 = x2 4 g1 = x0 2x4 f2 = x0x2x3 4 + x0 2x4 g2 = x1 + x4 3 f3 = 3x2 0x2 3+3x2 0 x0x1x2x4 x0x2 3x4+ x0x4 + 2x1x2x2 4 + x3 x5 4 x4 + 1 g3 = x2 + x4 f4 = 2x3 0x2 2x2 3 + 2x3 0x2 2 3x2 0x2 1x3 3 + 3x2 0x2 1x3 3x2 0x2 2x2 3x4 + 3x2 0x2 2x4 x0x2 1x3 3x4 + x0x2 1x3x4 + 3x0x2 2x3 3x0x2 2x4 + 3x0x2 2 + 3x0x2x2 4 + x2 1x2 3 x2 1x3x4 + x2 1x3 3x1x2 2x2 4 + x1 + x2x3 4 + x4 3 g4 = x3 x4 + 1 f5 = 2x3 0x2x2 3+2x3 0x2 x3 0 3x2 0x2x2 3x4+ 3x2 0x2x4 + 2x2 0x2 4 + 2x2 0x4 + 3x0x1x2 3 + 3x0x2x3 3x0x2x4 + 3x0x2 + 3x0x2 3x4 2x0x2 3 + 3x0x3 4 3x1x2 4 2x2x3 4 + x2 3x3 4 + 2x2 4 + x4 Table 15: Success examples from the D 2 (F31) test set. f1 = x0 + 8x1 g1 = x0 + 8x1 f2 = 8x3 0x1 2x2 0x2 1 12x0x5 1 x0x3 1 + 12x0x2 1 x4 1 x3 1 + x1 g2 = x3 1 8x1 f3 = 7x4 0x3 1 + 6x3 0x4 1 + 6x3 0x2 1 15x3 0x1 + 5x3 0 + 5x2 0x7 1 9x2 0x5 1 5x2 0x4 1 14x2 0x3 1 + 4x2 0x2 1 + 9x2 0x1 10x2 0 + 9x0x6 1 + 3x0x5 1 10x0x4 1 12x0x3 1 + 13x0x1 7x4 1 + x3 1 + 7x2 1 8x1 f1 = 6x3 0x1 + 11x2 0x4 1 + 6x2 0x2 1 2x0x3 1 11x6 1 2x4 1 g1 = x0 10x3 1 + x1 f2 = 3x3 0x4 1 10x2 0x7 1 + 3x2 0x5 1 x0x6 1 + 3x0x5 1 11x0x3 1 + 15x0x2 1 + 10x9 1 x7 1 14x6 1 + 5x5 1 11x4 1 + 15x3 1 f3 = 13x4 0x1 2x3 0x4 1 + 13x3 0x2 1 9x2 0x5 1 + 8x2 0x3 1 14x2 0x2 1 11x2 0x1 6x0x6 1 15x0x5 1 + 12x0x4 1 + 15x0x3 1 11x0x2 1 + 6x7 1 11x6 1 13x5 1 x4 1 f4 = 7x4 0x2 1 10x4 0x1+13x3 0x5 1+6x3 0x4 1 7x3 0x3 1 10x3 0x2 1 3x3 0x1 2x2 0x7 1 3x2 0x5 1 5x2 0x4 1 +14x2 0x3 1 3x2 0x2 1 x0x8 1 6x0x7 1 + 12x0x6 1+13x0x5 1+3x0x4 1+x0 10x3 1+x1 f1 = x0 15x1 g1 = x0 15x1 f2 = 4x3 0 + 2x2 0x1 10x0x2 1 5x3 1 + x2 1 g2 = x2 1 f3 = 3x3 0x3 1 14x2 0x4 1 + 9x2 0x3 1 + 3x2 0x2 1 12x2 0x1 7x0x3 1 6x0x2 1 + 4x0 12x4 1 6x3 1 + 2x1 f1 = 6x3 0 10x2 0 + 4x0x3 1 + 5x2 1 g1 = x0 + 12 f2 = x3 0x3 1 12x2 0x3 1+11x0x6 1+x0+6x5 1+ 12 f3 = 9x5 0 15x4 0 + 6x3 0x3 1 + 13x2 0x2 1 11x2 0x1 + 4x0x2 1 + 3x0x1 + 9x1 14 f1 = 3x3 0 x2 0x1 12x2 0 + x2 1 g1 = x0 + 10x1 4 f2 = 14x3 0x1 15x2 0x2 1+6x2 0x1+x0+15x3 1+ 10x1 4 f1 = 4x2 0+7x0x7 1+4x0x6 1+2x0x5 1+8x0x4 1 x0x3 1 3x0x2 1 11x0x1 + 9x0 g1 = x0 + 10 f2 = 6x4 0x1 5x3 0x8 1 + 6x3 0x7 1 + 3x3 0x6 1 + 12x3 0x5 1 + 14x3 0x4 1 + 11x3 0x3 1 x3 0x2 1 + 14x3 0x1 9x2 0x2 1 + 5x2 0x1 + 13x0x9 1 + 3x0x8 1 14x0x7 1+6x0x6 1+7x0x5 1 10x0x4 1+ 15x0x3 1+3x0x2 1+x5 1+5x4 1 9x3 1 x2 1 14x1 g2 = x5 1 + 5x4 1 9x3 1 x2 1 14x1 f3 = 15x4 0x3 1 12x3 0x4 1 +5x3 0x3 1 7x3 0x2 1 + 11x2 0x9 1 7x2 0x8 1 + 12x2 0x7 1 14x2 0x6 1 6x2 0x5 1 14x2 0x4 1 4x2 0x3 1 8x2 0x2 1 +10x2 0 + 3x0x7 1+15x0x6 1 4x0x5 1 12x0x4 1 x0x3 1+ 8x0x2 1 12x0x1 + 8x0 + 10 f1 = x0 + 14x1 + 5 g1 = x0 + 14x1 + 5 f2 = 11x3 0 8x2 0x3 1 + x2 0x1 + 13x2 0 2x0x2 1 9x0x1 x0 + 7x4 1 + 3x3 1 10x2 1 f3 = x4 0x3 1 7x4 0 8x3 0x2 1 5x3 0x1 4x3 0 3x2 0x4 1 9x2 0x2 1 + 8x2 0 13x0x3 1 3x0x2 1 + 2x0x1 + 9x0 x4 1 14x3 1 4x2 1 + 8x1 Table 16: Success examples from the D 3 (F31) test set. f1 = 11x2 0x2 + 8x0x1 x0x5 2 + 6x0x4 2 + 15x0x2 2 + 10x0x2 + 12x2 1x2 2 + 15x2 1x2 15x1x6 2 + 3x1x5 2 10x1x4 2 + 10x1x3 2 8x1x2 2 + 15x1x2 4x1 + x5 2 + 9x3 2 + 6x2 2 12x2 g1 = x0 + 14x4 2 + 9x3 2 + 7x2 + 15 f2 = 5x3 0x1x2 4x3 0x1 2x2 0x2 1+8x2 0x1x5 2 11x2 0x1x4 2 5x2 0x1x3 2 + 4x2 0x1x2 2 15x2 0x1x2 +2x2 0x1 3x0x3 1x2 2 +4x0x3 1x2 4x0x2 1x6 2 + 7x0x2 1x5 2 13x0x2 1x4 2 + 13x0x2 1x3 2 + 2x0x2 1x2 2 + 4x0x2 1x2 + x0x2 1 8x0x1x5 2 10x0x1x3 2 + 14x0x1x2 2 + 3x0x1x2 3x0x2 3x1x5 2 + 4x1x3 2 + 13x1x2 2+5x1x2 11x5 2+4x4 2+10x2 2 14x2 g2 = x1 9x4 2 4x3 2 11 f3 = 12x5 0x1 + 13x4 0x1x4 2 + 15x4 0x1x3 2 9x4 0x1x2 + 7x4 0x1 4x3 0x1x4 2 7x3 0x1x3 2 2x3 0x1x2 + 9x3 0x1 x3 0x3 2 + 9x3 0x2 + 9x2 0x1x5 2 12x2 0x1x3 2 + 11x2 0x1x2 2 15x2 0x1x2 14x2 0x7 2 9x2 0x6 2 + 2x2 0x5 2 + 12x2 0x4 2 15x2 0x3 2 + x2 0x2 2 + 13x2 0x2 + 13x0x2 1x4 2 7x0x2 1x3 2 + 7x0x1x8 2 + 11x0x1x7 2 + 15x0x1x6 2 13x0x1x5 2 + 12x0x1x4 2 + 11x0x1x3 2 13x0x1x2 2 + 7x0x1x2 + 14x0x7 2 x0x5 2 + 9x0x4 2 13x0x3 2 + 14x0x2 2 x0x2 + x1 9x4 2 4x3 2 11 g3 = x5 2 + 9x3 2 + 6x2 2 12x2 f4 = 14x3 0x3 1x2 + 10x2 0x3 1x5 2 + 2x2 0x3 1x4 2 + 5x2 0x3 1x2 2 7x2 0x3 1x2 3x2 0x2 1x2 2 5x0x3 1x2 11x0x2 1x6 2 + 4x0x2 1x5 2 + 10x0x2 1x3 2 + 12x0x2 1x2 2 + x0 + 8x4 1x3 2 + 10x4 1x2 2 10x3 1x7 2 3x3 1x6 2 + 14x3 1x5 2 + 3x3 1x4 2 + 6x3 1x3 2 + 8x3 1x2 2 + 13x3 1x2 + x3 1 + 3x2 1x6 2 14x2 1x4 2 4x2 1x3 2 + 10x2 1x2 2 7x2 1x2 11x2 1 + 14x4 2 + 9x3 2 + 7x2 + 15 f1 = 7x0x1 8x0 6x2 1x2 g1 = x0 + 10 f2 = 5x2 0x1 12x2 0 + 12x0x2 1x2 2 9x0x2 1x2 + 4x0x1x2 2 + x0 + 3x3 1x3 2 + 10 f3 = 12x0x1x2 2 x0x1x2 4x0x2 2 3x2 1x3 2 10x1x2 + x1 10 f4 = 15x2 0x1 + 10x0x2 1x2 + 12x0x2 1 + 11x0x1x2 2 9x0x1x2 + 5x0x1 13x3 1x2 2 3x2 1x2 2 4x2 1 15x1x2 2 + 11x1x2 + x2 f1 = 0 g1 = x0 + 12x2 2 + 13x2 2 f2 = 12x0x1x2 2 14x0x1x2 + 11x1x4 2 12x1x3 2 + 11x1x2 2 3x1x2 g2 = x1 + 5x2 2 + 14x2 + 9 f3 = 9x2 0x1x3 2 5x2 0x1x2 2 + 15x0x1x5 2 + 9x0x1x4 2 + 15x0x1x3 2 + 10x0x1x2 2 + x0 + 12x2 2 + 13x2 2 f4 = 12x3 0x1 2x3 0x2 14x2 0x2 1x2 2 + 6x2 0x2 1x2 11x2 0x1x2 2 + x2 0x1x2 + 7x2 0x1 + 7x2 0x3 2 + 5x2 0x2 2 + 4x2 0x2 + 13x0x2 1x4 2 + 14x0x2 1x3 2 + 13x0x2 1x2 2 12x0x2 1x2 + x3 2 f5 = 8x2 0x2 1x3 2+x2 0x2 1x2 2+7x2 0x2 1 10x2 0x3 2 3x0x2 1x5 2 8x0x2 1x4 2 3x0x2 1x3 2 11x0x2 1x2 2 2x0x2 1x2 14x0x2 1 13x0x1x3 2 + 10x0x1x2 2 + 11x0x3 2 + x1x5 2 + 13x1x4 2 + x1x3 2 + 11x1x2 2 + x1 + 8x5 2 12x4 2 + 9x3 2 + 5x2 2 + 14x2 + 9 Table 17: Success examples from the D 4 (F31) test set. f1 = 12x0x3 g1 = x0 + 8 f2 = 5x3 0x3 + 3x0x3 + 6x2 1x2 3 x1x2 3 + x2 7x3 + 7 g2 = x1 + 5 f3 = 4x2 0x2x3 5x0x1x2 2x3 +7x0x2 2x3 10x3 1x2 2x2 3 5x2 1x2 2x2 3 12x1x3 2+8x1x2 2x2 3 9x1x2 2x3+9x1x2 2+x1 8x3 2 6x2 2x3+6x2 2+ 5 g3 = x2 + 7 f4 = 15x3 0 4x2 0+5x0x2 1x2x3 14x0x1x3 10x0x2x2 3 + 10x4 1x2x2 3 12x3 1x2x2 3 + 12x2 1x2 2 + 9x2 1x2x3 9x2 1x2 2x1x2 + 12x1x3 10x2 + x3 f5 = 6x5 0x1 14x4 0x1 + 13x3 0x2 1x3 + 11x2 0x2 1x3 12x2 0x1x3 + 13x0x1x2 3 + 5x0x2 2x2 3 2x0x2 3 + x0 + 8x4 1 5x3 1x3 3 8x3 1x3 +9x3 1 +2x2 1x3 3 9x2 1x3 6x1x2x3 + 11x1x3 3 + 11x1x2 3 11x1x3 11x2x3 + 15x2 3 15x3 + 8 f1 = 4x0x1x3+11x0x4 3+3x0x3 3+10x0x2 3+ 10x0x3 + x4 3 g1 = x0 + 9x3 3 15 f2 = 4x2 0x2 1x3 11x2 0x1x4 3 3x2 0x1x3 3 10x2 0x1x2 3 10x2 0x1x3 x0x1x4 3 + x0 + 9x3 3 15 g2 = x1 5x3 3 7x2 3 13x3 13 f3 = 11x3 0x1x3 7x3 0x4 3 + 15x3 0x3 3 12x3 0x2 3 12x3 0x3 9x2 0x1x2 2x3 + 11x2 0x1x2 + 14x2 0x2 2x4 3 + x2 0x2 2x3 3 7x2 0x2 2x2 3 7x2 0x2 2x3+5x2 0x4 3+6x0x1x2x3 3 10x0x1x2 10x0x2 2x4 3 13x0x2 2x3 + x1 + 7x2 2x4 3 + 9x2 2x3 5x3 3 7x2 3 13x3 13 g3 = x2 8x3 3 + 3x3 9 f4 = 4x2 0x1x2 11x2 0x2x3 3 3x2 0x2x2 3 10x2 0x2x3 10x2 0x2 +6x0x3 1 14x0x2 1x2 3 + 8x0x1x5 3+5x0x1x4 3 4x0x1x3 3 4x0x1x2 3 8x3 1x3 3 + 3x3 1 + 12x1x5 3 f5 = 2x3 0x3 13x2 0x4 3+x2 0x3+5x0x1x2x3 3 15x0x1x4 3 + 6x0x2x6 3 4x0x2x5 3 3x0x2x4 3 3x0x2x3 3 + 13x0x7 3 + 12x0x6 3 + 9x0x5 3 + 9x0x4 3 + 8x3 1x2 9x2 1x2x3 3 + 6x2 1x2x2 3 11x2 1x2x3 11x2 1x2 + 9x2x6 3 + x2 + 4x7 3 8x3 3 + 3x3 9 f6 = x2 0x2 1x2x3+5x2 0x1x2 2x3 5x2 0x1x2x4 3 7x2 0x1x2x3 3 13x2 0x1x2x2 3 13x2 0x1x2x3 + 6x2 0x2 2x4 3 4x2 0x2 2x3 3 3x2 0x2 2x2 3 3x2 0x2 2x3 12x2 0x2 2 + 3x2 0x2x3 3 5x2 0x2x3 + 15x2 0x2 + 5x0x1x3 2x3 + 10x0x1x2 + 13x0x1x3 3 13x0x1x2 3 15x0x1x3+3x0x1+6x0x3 2x4 3 4x0x3 2x3 3 3x0x3 2x2 3 3x0x3 2x3+9x0x2 2x4 3 7x0x2 2 + 3x0x5 3 2x0x4 3 + 14x0x3 3 + 14x0x2 3 +11x3 1x2 +5x3 1x3 3 +2x3 1x3 6x3 1 15x1x2 2x3 2x1x4 3 + 15x1x3 3 6x1x2 3 7x1x3 12x3 2+13x2 2x4 3+14x2 2x3 3+9x2 2x2 3+ 4x2 2x3 4x2 2 + 6x6 3 13x5 3 15x4 3 + 7x3 3 15x2 3 Table 18: Success examples from the D 5 (F31) test set. f1 = 13x0x3 1 + 4x3 1x4 + 9x3 1 + x2 4 g1 = x0 14x4 + 15 f2 = 12x0x3 1x3 2 +5x0x3 1x2 3x4 13x3 1x3 2x4 6x3 1x3 2 8x3 1x2 3x2 4 + 13x3 1x2 3x4 11x3 2x2 4 2x2 3x3 4 + x3 + 8x4 4 g2 = x1 + 8x4 f3 = 6x0x3 1x2 11x0x2x2 3+5x0x2x3x4+ 13x0x2x3 9x3 1x2x4+3x3 1x2+11x1x2x2 3 5x1x2x3x4 13x1x2x3+x1 10x2x2 4+8x4 g3 = x2 14x4 f4 = 6x2 0x3 1x3x4 9x0x3 1x3x2 4 + 3x0x3 1x3x4 + 15x0x1x3 + 6x0x2 3 10x0x3x3 4 + 13x0x3x4 + 7x0x3 + x0 x1x2 + 5x1x2 3 + 9x1x3x4 + 11x1x3 8x2x4 14x4 + 15 g4 = x3 + 8x4 4 f5 = 10x3 0x2 7x2 0x3 1x3x4 + 15x2 0x2x4 5x2 0x2 3x0x3 1x2x4 + 5x0x3 1x3x2 4 12x0x3 1x3x4+x0x2x2 3+9x0x3x3 4+7x0x3 6x0x4 + 3x0 + 11x3 1x2x2 4 14x3 1x2x4 3x2 1 + 7x1x4 + 12x3 2x3 + 3x3 2x4 + 14x3 2 14x2x2 3x4 + 15x2x2 3 5x2x3 4 + x2 14x4 f1 = 7x2 1x3 12x2 1 12x1x2 3 6x1x3 3x3 2x4 + 8x2 2x4 f2 = 13x3 1x2x2 3 9x3 1x2x3 9x2 1x2x3 3 + 11x2 1x2x2 3 + 8x2 1x2x3x2 4 + 4x2 1x2x2 4 10x1x4 2x3x4 + 6x1x3 2x3x4 + 4x1x2x2 3x2 4 + 2x1x2x3x2 4 + x1 + x4 2x3 4 13x3 2x3 4 + 11 g2 = x1 + 11 f3 = 4x2 0x2 1 13x2 0x1 + 8x2 1x4 3 + 4x2 1x3 3 + 4x1x5 3 + 2x1x4 3 + 12x1x3x2 4 + x3 2x3 3x4 13x2 2x3 3x4 + 8x3x2 4 + x3 15 f4 = 10x0x2 1 +5x0x2 3 13x0x3 +7x2 1x3 3 + 2x2 1x2 3x2 4 12x2 1x2 3 + x2 1x3x2 4 4x2 1 12x1x4 3 + x1x3 3x2 4 6x1x3 3 15x1x2 3x2 4 + 14x1x2 3 12x1x3x4 6x1x4 3x3 2x2 3x4 + 8x3 2x3x3 4+8x2 2x2 3x4 11x2 2x3x3 4+x2 x2 3 13 f5 = 3x0x2 1x2x4 + 5x0x2 1x3 4x0x2 1x3 4 13x0x2 1+7x0x2x2 3 12x0x2x3+8x4 1x2x3+ 4x4 1x2 + 4x3 1x2x2 3 + 2x3 1x2x3 + x2 1x4 2x4 13x2 1x3 2x4 5x2 1x2x4 14x2 1x3 4 13x1x3 4 + 9x2 2x4 12x2x3 4 + 7x2x4 + 13x3 4 + x4 f6 = 11x2 0x1x3 3x2 0x3 + 8x0x5 1 + 10x0x2 1x4 10x0x1x4 3x5 1 7x3 1x2 2x3 1 + 7x2 1x3 2x3 12x2 1x3 2 12x1x3 2x2 3 6x1x3 2x3 +15x1x2x3x4 3x6 2x4 +8x5 2x4 + x2 2x3x4 15x2 2x4 f7 = 10x2 0x1x2x3 14x2 0x2x3 2x0x3 1x3x4 + 2x0x2 1x2 3 + x0x2 1x3 + 3x0x1x2x4 + x0x1x3 3 15x0x1x2 3 11x0x1x4 + 8x0x3 2x3x4 11x0x2 2x3x4 + x0 +11x4 1x2 3 10x4 1x3 10x3 1x3 3 5x3 1x2 3 7x3 1x3x4 + 13x2 1x3 2x3x4 14x2 1x2 2x3x4 + 14x2 1x3x4 + 11x1x2 2x3x4 + 2x1x2 2x3 + x1x2 2 6x1x2x3x4+13x1x2x2 4 9x1x2x4+ 15x1x3x4 12x2x2 4 12 Table 19: Failure examples from the D n (Q) test sets for n = 2, 3, 4, 5 (ground truth v.s. prediction). ID G (Ground Truth) G (Transformer) 15 g1 = x0 1/4x4 1 5/4x3 1 + 2/3x1 5/4 g 1 = x0 + 1/4x4 1 + 2/3x1 5/4 g2 = x5 1 2/3x4 1 4/3x3 1 + 2 g 2 = x5 1 2/3x4 1 x3 1 x2 1 + 2 26 g1 = x0 1/2x4 1 1/5x3 1 1/2x2 1 x1+2/3 g 1 = x0 1/2x4 1 1/5x3 1+1/2x2 1 x1+2/3 g2 = x5 1 3/5x1 4/3 g 2 = x5 1 1/2x4 1 3/5x1 4/3 36 g1 = x0 2x1 g 1 = x0 2x1 g2 = x5 1 + 5/2x4 1 1/4x3 1 + 1/2x2 1 + 1/3 g 2 = x5 1 + 5/2x4 1 5/4x3 1 + 1/2x2 1 + 1/3 g1 = x0 + 4x4 2 1/5x2 2 + 5/4x2 + 1/2 g 1 = x0 + 4x4 2 1/5x2 2 + 5/4x2 + 1/2 g2 = x1 5/2x4 2+1/5x3 2 2/3x2 2 1/3x2+ 3/2 g 2 = x1 5/2x4 2+1/5x3 2 2/3x2 2 1/3x2+ 3/2 g3 = x5 2 1/4x3 2 3/2x2 2 1/5x2 5/3 g 3 = x5 2 2/3x3 2 3/2x2 2 1/5x2 5/3 g1 = x0 + 2/3x4 2 + x3 2 + 4x2 2 5/3x2 + 4 g 1 = x0 + 2/3x4 2 + x3 2 + 4x2 2 5/3x2 + 4 g2 = x1 + 2x3 2 + 5/3x2 + 2 g 2 = x1 + 5/3x2 + 2 g3 = x5 2 + 4x4 2 x3 2 + x2 2 g 3 = x5 2 + 4x4 2 x3 2 + x2 2 g1 = x0 + 2x2 g 1 = x0 + 2x2 g2 = x1 1/5 g 2 = x1 1/5 g3 = x3 2 2x2 2 + 1/5x2 + 1 g 3 = x3 2 + 1/5x2 + 1 g1 = x0 2/3x3 3 g 1 = x0 2/3x3 3 g2 = x1 4x2 3 + 1/4x3 g 2 = x1 + 1/4x3 g3 = x2 x4 3 + x2 3 3/5 g 3 = x2 x4 3 + x2 3 3/5 g4 = x5 3 5x4 3 + 4x2 3 + 4x3 g 4 = x5 3 5x4 3 + 4x2 3 + 4x3 g1 = x0 3 g 1 = x0 3 g2 = x1 1/5x2 3 g 2 = x1 1/5x2 3 g3 = x2 + 2/5x4 3 + 5/2x2 3 2/3x3 3/5 g 3 = x2 + 2/5x4 3 + 5/2x2 3 2/3x3 3/5 g4 = x5 3 4/5x4 3 x3 1 g 4 = x5 3 4/5x4 3 3x3 3 x3 1 g1 = x0 + 1/4x2 3 2 g 1 = x0 + 1/4x2 3 2 g2 = x1 + 2x3 3 g 2 = x1 + 2x3 3 g3 = x2 2x3 3 + 3/2 g 3 = x2 2x3 3 + 3/2 g4 = x5 3 x4 3 5x3 3 1/2 g 4 = x5 3 x4 3 5x3 3 1/4x2 3 + 3/2 g1 = x0 + 5x4 4 3/5x3 4 + 4x2 4 + 2 g 1 = x0 + 5x4 4 3/5x3 4 + 4x2 4 + 2 g2 = x1 2/5x4 4+1/3x3 4+x2 4 1/2x4+3/5 g 2 = x1 2/5x4 4+1/3x3 4+x2 4 1/2x4+3/5 g3 = x2 3/5x4 g 3 = x2 2/5x4 g4 = x3 + 4/5x4 4 + x3 4 3/5x4 g 4 = x3 + 4/5x4 4 + x3 4 3/5x4 g5 = x5 4 5x4 4 + 3/2x3 4 x4 5 g 5 = x5 4 + 3/2x3 4 x4 5 g1 = x0 x2 4 + 4x4 5/3 g 1 = x0 x2 4 + 4x4 5/3 g2 = x1 x2 4 + 1 g 2 = x1 x2 4 + 1 g3 = x2 1 g 3 = x2 1 g4 = x3 + 2x2 4 + 4/3x4 g 4 = x3 + 2x2 4 + 4/3x4 g5 = x3 4 g 5 = x3 4 + 2/3x2 4 g1 = x0 + x2 4 5 g 1 = x0 + x2 4 5 g2 = x1 + 1/3x4 4 + 4/3x3 4 + 5x4 + 3/4 g 2 = x1 + 1/3x4 4 + 4/3x3 4 + 5x4 + 3/4 g3 = x2 + 1/2x2 4 x4 g 3 = x2 x4 g4 = x3 5/4x4 4 x3 4 + x2 4 g 4 = x3 5/4x4 4 x3 4 + x2 4 g5 = x5 4 4/3x3 4 + x4 g 5 = x5 4 4/3x3 4 + x4 Table 20: Failure examples from the D n (F7) test sets for n = 2, 3, 4, 5 (ground truth v.s. prediction). ID G (Ground Truth) G (Transformer) 2 g1 = x0 3x2 1 + 1 g 1 = x0 3x2 1 + 1 g2 = x3 1 2 g 2 = x3 1 2x2 1 2 6 g1 = x0 x4 1 + 3x3 1 + 3x1 g 1 = x0 x4 1 + 3x3 1 + 3x1 g2 = x5 1 2x4 1 x3 1 g 2 = x5 1 x4 1 x3 1 8 g1 = x0 + 3x4 1 + 2x3 1 + 2x1 g 1 = x0 + 3x4 1 + 2x3 1 + 2x1 g2 = x5 1 3x3 1 + 2x2 1 g 2 = x5 1 3x3 1 x2 1 g1 = x0 x2 2 g 1 = x0 x2 2 g2 = x1 1 g 2 = x1 1 g3 = x5 2 x4 2 + 2x3 2 3x2 g 3 = x5 2 x4 2 + 2x3 2 x2 2 g1 = x0 + 1 g 1 = x0 + 1 g2 = x1 3x2 2 2 g 2 = x1 3x2 2 + x2 + 3 g3 = x3 2 1 g 3 = x3 2 + 1 g1 = x0 3x4 2 2x3 2 x2 2 + 1 g 1 = x0 3x4 2 2x3 2 x2 2 + 1 g2 = x1 x4 2 + 3x2 2 g 2 = x1 x4 2 + 3x2 2 g3 = x5 2 3x4 2 + 3x3 2 + x2 2 g 3 = x5 2 3x4 2 + x2 2 g1 = x0 + 3x3 3 3x2 3 g 1 = x0 + 3x3 3 3x2 3 g2 = x1 + 2x3 3 + x2 3 3x3 g 2 = x1 + 2x3 3 + x2 3 3x3 g3 = x2 + 2x3 3 2 g 3 = x2 + 2x3 3 2 g4 = x4 3 3x3 g 4 = x4 3 x3 g1 = x0 3 g 1 = x0 3 g2 = x1 + 3x4 3 + x3 3 3x2 3 + 3x3 3 g 2 = x1 + 3x4 3 + x3 3 + 3x2 3 + 3x3 3 g3 = x2 + 3x4 3 2 g 3 = x2 + 3x4 3 2 g4 = x5 3 + 2x3 3 3x2 3 2x3 g 4 = x5 3 + 2x3 3 3x2 3 2x3 g1 = x0 + 1 g 1 = x0 1 g2 = x1 + 1 g 2 = x1 + 1 g3 = x2 3 g 3 = x2 3 g4 = x3 3 g 4 = x3 3 g1 = x0 3x4 1 g 1 = x0 3x4 1 g2 = x1 + 1 g 2 = x1 + 1 g3 = x2 3 g 3 = x2 + 2 g4 = x3 + 3 g 4 = x3 + 3 g5 = x4 4 g 5 = x4 4 g1 = x0 3x4 g 1 = x0 x4 g2 = x1 + 3x2 4 2x4 + 3 g 2 = x1 + 3x2 4 2x4 + 3 g3 = x2 3x2 4 3x4 g 3 = x2 3x2 4 3x4 g4 = x3 2 g 4 = x3 2 g5 = x3 4 3x4 g 5 = x3 4 3x4 g1 = x0 + 2 g 1 = x0 + 2 g2 = x1 + 2 g 2 = x1 + 2 g3 = x2 + x3 4 2x2 4 + 3x4 2 g 3 = x2 + x3 4 2x2 4 + 3x4 2 g4 = x3 3 g 4 = x3 3 g5 = x4 4 x4 g 5 = x4 4 + x4 Table 21: Failure examples from the Dn(F31) test sets for n = 2, 3, 4, 5 (ground truth v.s. prediction). ID G (Ground Truth) G (Transformer) 0 g1 = x0 14x4 1 + 9x3 1 + 12x2 1 + 14x1 + 15 g 1 = x0 14x4 1 + 9x3 1 + 12x2 1 + 14x1 + 15 g2 = x5 1 + 15x4 1 5x3 1 7x2 1 g 2 = x5 1 11x4 1 11x3 1 1 g1 = x0 9 g 1 = x0 9 g2 = x1 g 2 = x1 7 3 g1 = x0 8x2 1 + 8x1 g 1 = x0 8x2 1 + 8x1 g2 = x5 1 14x3 1 + 10x2 1 + 2x1 + 5 g 2 = x5 1 3x4 1 + 10x2 1 + 2x1 + 5 g1 = x0 + 5 g 1 = x0 + 5 g2 = x1 8x4 2 + 15x2 2 5x2 + 11 g 2 = x1 8x4 2 12x2 2 8x2 13 g3 = x5 2 2x2 2 + 15x2 + 4 g 3 = x5 2 2x2 2 + 15x2 + 4 g1 = x0 14x4 2 + 13x3 2 10x2 2 6x2 10 g 1 = x0 2x4 2 + x3 2 10x2 2 6x2 10 g2 = x1 14x4 2 4x3 2 5x2 2 + 8x2 + 3 g 2 = x1 14x4 2 4x3 2 5x2 2 + 8x2 + 3 g3 = x5 2 + 7x4 2 + 7x2 2 8 g 3 = x5 2 2x4 2 2x3 2 2x2 2 2 g1 = x0 + 15x4 2 + 12x3 2 13x2 2 10x2 7 g 1 = x0 + 15x4 2 + 12x3 2 13x2 2 10x2 7 g2 = x1 + 9x2 2 g 2 = x1 + 9x2 2 g3 = x5 2 15x3 2 + 5x2 2 g 3 = x5 2 + 14x4 2 + 14x3 2 + 14x2 2 + 6 g1 = x0 + 11x4 3 4x3 3 8x2 3 g 1 = x0 + 11x4 3 4x3 3 8x2 3 g2 = x1 7 g 2 = x1 7 g3 = x2 + 2 g 3 = x2 + 2 g4 = x5 3 + 5x4 3 6x2 3 g 4 = x5 3 + 5x4 3 11x2 3 g1 = x0 6x2 3 + 13x3 g 1 = x0 6x2 3 + 13x3 g2 = x1 +12x4 3 11x3 3 +10x2 3 15x3 +10 g 2 = x1 +12x4 3 11x3 3 +10x2 3 15x3 +10 g3 = x2 8x2 3 15x3 g 3 = x2 8x4 3 8x3 3 8x2 3 15x3 g4 = x5 3 2x4 3 2x2 3 5x3 g 4 = x5 3 2x4 3 2x2 3 5x3 g1 = x0 5x3 + 13 g 1 = x0 5x3 + 13 g2 = x1 + 11x3 3 13 g 2 = x1 + 11x3 3 13 g3 = x2 5x3 9 g 3 = x2 5x3 9 g4 = x5 3 2x4 3 11x3 3 6x3 4 g 4 = x5 3 x4 3 11x3 3 10x3 4 g1 = x0 10x4 4 12x2 4 11x4 13 g 1 = x0 10x4 4 12x2 4 11x4 13 g2 = x1 + 2x4 g 2 = x1 + 2x4 g3 = x2 + 7x4 4 11x3 4 12x2 4 + 7x4 13 g 3 = x2 + 7x4 4 11x3 4 12x2 4 + 7x4 13 g4 = x3 + 5x4 4 x3 4 15x2 4 + 4x4 5 g 4 = x3 + 5x4 4 x3 4 15x2 4 2x4 5 g5 = x5 4 + 13x4 4 + 8 g 5 = x5 4 + 13x4 4 + 8 g1 = x0 + 7x4 4 + 3x3 4 + 15x2 4 + 3x4 + 9 g 1 = x0 2x4 4 + 3x3 4 + 15x2 4 + 3x4 + 9 g2 = x1 + 3x4 4 12x3 4 6x2 4 x4 15 g 2 = x1 + 6x4 4 + 11x3 4 + 12x2 4 5x4 + 11 g3 = x2 + 6 g 3 = x2 + 6 g4 = x3 15x4 4 + 3x3 4 13x2 4 + 9x4 g 4 = x3 15x4 4 + 3x3 4 13x2 4 2x4 g5 = x5 4 7x3 4 5x2 4 13x4 g 5 = x5 4 7x3 4 5x2 4 13x4 g1 = x0 + 13x4 4 + 10x3 4 + 11x2 4 5x4 g 1 = x0 + 13x4 4 + 10x3 4 + 11x2 4 5x4 g2 = x1 + 10x4 4 13x3 4 + 12 g 2 = x1 + 10x4 4 13x3 4 + 12 g3 = x2 + 12x4 4 + 12x3 4 7 g 3 = x2 + 12x4 4 + 12x3 4 7 g4 = x3 + 4 g 4 = x3 15 g5 = x5 4 + 9x4 4 3x2 4 g 5 = x5 4 + 9x4 4 3x2 4 Table 22: Runtime comparison (in seconds) of forward generation (F.) and Transformer on timeout examples. Timeout limit: 100 seconds. Red cells indicate timeout. For each configuration, up to first 10 unique samples are shown. Transformer successfully predicted the correct Gröbner bases for these examples. Sample ID D n (Q) n = 4 (σ = 0.3) F. (STD) F. (SLIMGB) F. (STDFGLM) Transformer 23 0.005 100.0 0.008 0.110 49 100.0 0.018 0.008 0.253 64 0.006 100.0 0.008 0.147 171 100.0 100.0 0.015 0.238 341 100.0 0.011 0.008 0.212 384 100.0 100.0 0.009 0.142 423 0.123 100.0 0.008 0.167 530 100.0 0.005 0.010 0.197 542 0.005 100.0 0.008 0.212 552 100.0 100.0 0.008 0.185 Sample ID D n (Q) n = 5 (σ = 0.2) F. (STD) F. (SLIMGB) F. (STDFGLM) Transformer 61 0.005 100.0 0.009 0.170 76 0.004 100.0 0.008 0.292 130 100.0 0.014 0.008 0.227 153 100.0 19.384 0.009 0.220 175 100.0 34.825 0.192 0.390 208 100.0 0.013 0.008 0.223 269 100.0 0.006 0.069 0.320 295 0.005 100.0 0.008 0.282 320 0.005 100.0 0.008 0.309 333 0.135 100.0 0.008 0.296 Sample ID D n (F7) n = 4 (σ = 0.3) F. (STD) F. (SLIMGB) F. (STDFGLM) Transformer 407 100.0 4.788 0.013 0.189 717 100.0 0.004 0.007 0.135 765 100.0 0.004 0.007 0.122 915 100.0 5.832 0.009 0.136 916 100.0 0.022 0.008 0.266 Sample ID D n (F7) n = 5 (σ = 0.2) F. (STD) F. (SLIMGB) F. (STDFGLM) Transformer 74 100.0 100.0 0.045 0.314 121 100.0 100.0 0.011 0.285 256 100.0 100.0 0.008 0.144 267 100.0 0.005 0.008 0.284 274 100.0 0.006 0.007 0.209 433 100.0 38.895 0.013 0.285 438 0.432 100.0 0.011 0.270 492 100.0 0.020 0.007 0.144 568 100.0 0.008 0.007 0.223 766 100.0 100.0 0.009 0.144 Sample ID D n (F31) n = 4 (σ = 0.3) F. (STD) F. (SLIMGB) F. (STDFGLM) Transformer 954 100.0 0.157 0.008 0.123 Sample ID D n (F31) n = 5 (σ = 0.2) F. (STD) F. (SLIMGB) F. (STDFGLM) Transformer 196 100.0 0.051 0.007 0.267 715 100.0 26.161 0.008 0.160 Table 23: Dataset profile comparison between Du(k) and D n (k). The standard deviation is shown in the superscript. Metric Du(Q) D n (Q) Du(F7) D n (F7) Du(F31) D n (F31) Size of F 2.64( 0.73) 2.56( 0.71) 3.06( 0.82) 3.03( 0.81) 3.04( 0.79) 3.01( 0.81) Max degree in F 4.97( 1.65) 7.44( 1.91) 5.46( 1.77) 7.91( 2.01) 5.56( 1.77) 8.18( 1.98) Min degree in F 2.50( 1.50) 4.19( 1.91) 2.70( 1.61) 4.33( 2.04) 2.85( 1.63) 4.64( 2.06) # of terms in F 10.56( 6.02) 15.64( 7.69) 13.46( 7.14) 20.02( 9.60) 13.73( 7.10) 20.79( 9.68) Gröbner ratio 0( 0) 0( 0) 0.002( 0.045) 0.002( 0.045) 0( 0) 0( 0) Size of G 2( 0) 2( 0) 2( 0) 2( 0) 2( 0) 2( 0) Max degree in G 2.48( 1.32) 4.09( 1.31) 2.53( 1.33) 3.94( 1.30) 2.53( 1.38) 4.09( 1.30) Min degree in G 1.21( 0.54) 2.56( 1.24) 1.20( 0.53) 2.37( 1.21) 1.21( 0.57) 2.60( 1.24) # of terms in G 3.69( 1.65) 6.65( 2.32) 3.72( 1.64) 6.31( 2.27) 3.74( 1.73) 6.70( 2.32) Gröbner ratio 1( 0) 1( 0) 1( 0) 1( 0) 1( 0) 1( 0) Table 24: Accuracy and support accuracy of Transformers on out-distribution evaluation set D+ 2 (k). Metric D+ 2 (Q) D+ 2 (F7) D+ 2 (F31) accuracy 83.8 52.9 36.1 support acc. 86.7 62.0 54.4 Table 25: Katsura-n on Q[x1, . . . , xn]. Katsura-5 is not presented because of its complexity. 2 f1 = x0 + 2x1 1 g1 = x0 + 2x1 1 f2 = x2 0 x0 + 2x2 1 g2 = x2 1 1 f1 = x0 + 2x1 + 2x2 1 g1 = x0 60x3 2 + 158 7x2 1 f2 = x2 0 x0 + 2x2 1 + 2x2 2 g2 = x1 + 30x3 2 79 7x2 f3 = 2x0x1 + 2x1x2 x1 g3 = x4 2 10 f1 = x0 + 2x1 + 2x2 + 2x3 1 g1 = x0 53230079232 1971025 x7 3 + 10415423232 1971025 x6 3 + 9146536848 1971025 x5 3 2158574456 1971025 x4 3 838935856 5913075 x3 3+ 275119624 5913075 x2 3 + 4884038 5913075x3 1 f2 = x2 0 x0 + 2x2 1 + 2x2 2 + 2x2 3 g2 = x1 97197721632 1971025 x7 3 + 73975630752 1971025 x6 3 12121915032 1971025 x5 3 2760941496 1971025 x4 3 + 814792828 1971025 x3 3 1678512 1971025x2 3 9158924 1971025x3 f3 = 2x0x1 + 2x1x2 x1 + 2x2x3 g3 = x2 + 123812761248 1971025 x7 3 79183342368 1971025 x6 3 + 7548646608 1971025 x5 3 + 3840228724 1971025 x4 3 2024910556 5913075 x3 3 132524276 5913075 x2 3 + 30947828 5913075 x3 f4 = 2x0x2 + x2 1 + 2x1x3 x2 g4 = x8 3 8 33x6 3 + 131 5346x5 3 70 8019x4 3 + 1 3564x3 3 + 5 42768x2 3 1 128304x3 Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: The abstract and introduction present the scope of the study and its contributions. Guidelines: The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: We clarify the problem setting in this paper and potential future work in Section 4.1. Besides, several open questions are discussed in detail in the Appendix H. Guidelines: The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes] . Justification: All the assumptions and complete proofs are provided in Sections 4 and C. Guidelines: The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: We provided detailed description of the experiment setups. Guidelines: The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: The code to reproduce our experiments is released. Guidelines: The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: All the training and test details to follow the results are provided in Sections 6 and E. Guidelines: The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [No] Justification: Error bars were not measured due to computational costs. However, we conducted experiments on various setups and provide their details in Sections 6 and F. Guidelines: The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: The information is given at the beginning of Section 6. Guidelines: The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: We have confirmed the present study meets the Neur IPS Code of Ethics. Guidelines: The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [NA] . Justification: This paper focuses on the learnability of a mathematical task, and we cannot see any urgent social impact. Guidelines: The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML). 11. Safeguards Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] . Justification: This paper does not have an evident risk of misuse. Guidelines: The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort. 12. Licenses for existing assets Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [NA] . Justification: This paper does not use existing assets. Guidelines: The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] . Justification: The paper does not release new assets. Guidelines: The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] . Justification: This paper does not involve crowdsourcing nor research with human subjects. Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] . Justification: This paper does not involve crowdsourcing nor research with human subjects. Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.