# gloptinets_scalable_nonconvex_optimization_with_certificates__40ed4e8c.pdf Glopti Nets: Scalable Non-Convex Optimization with Certificates Gaspard Beugnot gaspard.beugnot@inria.fr Inria, École normale supérieure, CNRS, PSL Research University, 75005 Paris, France Julien Mairal julien.mairal@inria.fr Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK, 38000 Grenoble, France Alessandro Rudi alessandro.rudi@inria.fr Inria, École normale supérieure, CNRS, PSL Research University, 75005 Paris, France We present a novel approach to non-convex optimization with certificates, which handles smooth functions on the hypercube or on the torus. Unlike traditional methods that rely on algebraic properties, our algorithm exploits the regularity of the target function intrinsic in the decay of its Fourier spectrum. By defining a tractable family of models, we allow at the same time to obtain precise certificates and to leverage the advanced and powerful computational techniques developed to optimize neural networks. In this way the scalability of our approach is naturally enhanced by parallel computing with GPUs. Our approach, when applied to the case of polynomials of moderate dimensions but with thousands of coefficients, outperforms the state-of-the-art optimization methods with certificates, as the ones based on Lasserre s hierarchy, addressing problems intractable for the competitors. 1 Introduction Non-convex optimization is a difficult and crucial task. In this paper, we aim at optimizing globally a non-convex function defined on the hypercube, by providing a certificate of optimality on the resulting solution. Let h be a smooth function on [ 1, 1]d. Here we provide an algorithm that given bx, an estimate of the minimizer x of h x = arg min x [ 1,1]d h(x), produces an ϵ, that constitutes an explicit certificate for the quality of bx, of the form |h(x ) h(bx)| ϵδ, with probability 1 δ. The literature abounds of algorithms to optimize non-convex functions. Typically they are either (a) heuristics, very smart, but with no guarantees of global convergence Moscato et al. [1989], Horst and Pardalos [2013] (b) variation of algorithms used in convex optimization, which can guarantee convergence only to local minima Boyd and Vandenberghe [2004] (c) algorithms with only asymptotic guarantees of convergence to a global minimum, but no explicit certificates Van Laarhoven et al. [1987]. In general, the methods recalled above are quite fast to produce some 37th Conference on Neural Information Processing Systems (Neur IPS 2023). solution, but don t provide guarantees on its quality, with the result that the produced point can be arbitrarily far from the optimum, so they are used typically where non-reliable results can be accepted. On the contrary, there are contexts where an explicit quantification of the incurred error is crucial for the task at hand (finance, engineering, scientific validation, safety-critical scenarios Lasserre [2009]). In these cases, more expensive methods that provide certificates are used, such as polynomial sum-of-squares (poly-So S) Lasserre [2001, 2009]. These kinds of techniques are quite powerful since they provide certificates in the form above, often with machine-precision error. However, (a) they have reduced applicability since h must be a multivariate polynomial (possibly sparse, low-degree) and must be known in its analytical form (b) the resulting algorithm is a semi-definite programming optimization on matrices whose size grows very fast with the number of variables and the degree of the polynomial, becoming intractable already in moderate dimensions and degrees. Our approach builds and extends the more recent line of works on kernel sum-of-squares, and in particular the work of Woodworth et al. [2022] based on the Fourier analysis. It mitigates the limitations of poly-So S methods in both aspects: (a) we can deal with any function h (not necessarily a polynomial) for which the Fourier transform is known and (b) the resulting algorithm leverages the smoothness properties of the objective function as Woodworth et al. [2022] rather than relying on its algebraic structure leading to way more compact representations than poly-So S. Contrary to Woodworth et al. [2022], we fully leverage the power of the certificate allowing for a drastic reduction of the computational cost of the method. Indeed, we cast the minimization in terms of a way smaller problem, similar to the optimization of a small neural network that, albeit again non-convex, produces efficiently a good candidate on which we then compute the certificate. Notably, our focus lies on a posteriori guarantees: we define a family of models that allow for efficient computation of certificates. Once the model structure is established, we have ample flexibility in training the model, offering various possibilities to achieve good certificates in practical scenarios, while still using well-established and effective techniques in the field of deep neural networks (DNN) Goodfellow et al. [2016] to reduce the computational burden of the approach. Our contributions can be summarized as follows: We propose a new approach to global optimization with certificates which drastically extends the applicability domain allowed by the state of the art, since it can be applied to any function for which we can compute the Fourier transform (not just polynomials). The proposed approach is naturally tailored for GPU computations and provides a refined control of time and memory requirements of the proposed algorithm, contrary to poly-So S methods (whose complexity scales dramatically and in a rigid way with dimension and degree of the polynomial). From a technical viewpoint, we improve the results in Woodworth et al. [2022], by developing a fast stochastic approach to recover the certificate in high probability (theorem 3), and we generalize the formulation of the problem to allow the use of powerful techniques from DNN, still providing a certificate on the result (section 3, in particular alg. 1) In practical applications, we are able to provide certificates for functions in moderate dimensions, which surpasses the capabilities of current state-of-the-art techniques. Specifically, as shown in the experiments we can handle polynomials with thousands of coefficients. This achievement marks an important milestone towards utilizing these models to provide certificates for more challenging real-life problems. 1.1 Previous work Polynomial So S. In the field of certificate-based polynomial optimization, Lasserre s hierarchy plays a pivotal role Lasserre [2001, 2009]. This hierarchy employs a sequence of SDP relaxations with increasing size proportional to O(rd) (where d is the dimension of the space and r is a parameter that upper bounds the degree of the polynomial) and that ultimately converges to the optimal solution when r . While Lasserre s hierarchy is primarily associated with polynomial optimization, its applicability extends beyond this domain. It offers a specific formulation for the more general moment problem, enabling a wide range of applications; see Henrion et al. [2020] for an introduction. For polynomial optimization problems such as in eq. (1), a significant amount of research has been dedicated to leveraging problem structure to improve the scalability of the hierarchy. This research has predominantly focused on exploiting very specific sparsity patterns among the variables of the polynomial, enabling the handling in these restricted scenarios of instances ranging from a few variables to even thousands of variables Waki et al. [2006], Wang et al. [2021b,a]. There have been theoretical results regarding optimization on the hypercube Bach and Rudi [2023], Laurent and Slot [2022], but there are no algorithms handling them natively. Furthermore, alternative approaches exist that exploit different types of structure, such as the constant trace property Mai et al. [2022]. Kernel So S. Kernel Sum of Squares (K-So S) is an emerging research field that originated from the introduction of a novel parametrization for positive functions in Marteau-Ferey et al. [2020]. This approach has found application in various domains, including Optimal Control Berthier et al. [2022], Optimal Transport Muzellec et al. [2021] and modeling probability distribution Rudi and Ciliberto [2021]. In the context of function optimization, two types of theoretical results have been explored: a priori guarantees Rudi et al. [2020] and a posteriori guarantees Woodworth et al. [2022]. A priori guarantees offer insights into the convergence rate towards a global optimum of the function, giving a rate on the number of parameters and the complexity necessary to optimize a function up to a given error. For example, Rudi et al. [2020] proposes a general technique to achieve the global optimum, with error ϵ of a function that is s-times differentiable, by requiring a number of parameters essentially in the order of O(ϵ s/d), allowing to avoid the curse of dimensionality in the rate, when the function is very regular, i.e., s d, while typical black-box optimization algorithms have a complexity that scales as ϵ d. A-posteriori guarantees focus on providing a certificate for the minimum found by the algorithm. In particular, Woodworth et al. [2022], provides both a-priori guarantee and a-posteriori certificates; however, the model considered makes it computationally infeasible to provide certificates in dimension larger than 2. To conclude, approaches based on kernel-So S allow to extend the applicability of global optimization with certificates methods to a wider family of functions and on exploiting finer regularity properties beyond just the number of variables and the degrees of a polynomial. By comparison, we focus on making the optimization amenable to high-performance GPU computation while retaining an a posteriori certificate of optimality. 2 Computing certificates with extended k-So S Without loss of generality (see next remark), with the goal of simplifying the analysis and using powerful tools from harmonic analysis, we cast the problem in terms of minimization of a periodic function f over the torus, [0, 1]d (we will denote it also as Td). In particular, we are interested in minimizing periodic functions for which we know (or we can easily compute) the coefficients of its Fourier representation, i.e. f = min z Td f(z), f(z) = X ω Zd bfωe2πiω z, z Td, (1) where Z is the set of integers. This setting is already interesting on its own, as it encompasses a large class of smooth functions. It includes notably trigonometric polynomials, i.e. functions which have only a finite number of non-zero Fourier coefficients bfω. Optimization of trigonometric polynomials arises in multiple research areas, such as the optimal power flow Van Hentenryck [18] or quantum mechanics Hilling and Sudbery [2010]. Note that this problem is already NP-hard, as it encompasses for instance the Max-Cut problem Waldspurger et al. [2013]. Even so, we will consider the more general case where we can evaluate function values of f, along with its Fourier coefficient bfω, and we have access to its norm in a certain Hilbert space. This norm can be computed numerically for trigonometric polynomials, and more generally reflects the regularity (degree of differentiability) of the function, and thus the difficulty of the problem. Remark 1 (No loss of generality in working on the torus). Given a (non-periodic) function h : [ 1, 1]d R we can obtain a periodic function whose minimum is exactly h and from which we can recover x . Indeed, following the classical Chebychev construction, define cos(2πz) as the componentwise application of cos to the elements of 2πz, i.e. cos(2πz) := (cos(2πz1), . . . , cos(2πzd)) and define f as f(z) := h(cos(2πz)) for z [0, 1]d. It is immediate to see that (a) f is periodic, and, (b) since cos(2πz) is invertible on [0, 1]d and its image is exactly [ 1, 1]d, we have h = h(x ) = f(z ) where x = cos(2πz ), and z = min z Td f(z). We discuss an efficient representation of these problems in section 3.3. 2.1 Certificates for global optimization and k-So S A general recipe for obtaining a certificates was developed in Woodworth et al. [2022] where, in particular, it was derived the following bound [Woodworth et al., 2022, see Thm. 2] f sup c R, g G+ c f c g F , (2) where u F is the ℓ1 norm of the Fourier coefficients of a periodic function u, i.e. ω Zd |buω| , (3) and the sup is taken over G+ that is a class of non-negative functions. The paper Woodworth et al. [2022] then chooses G+ to be the set of positive semidefinite models, leading to a possibly expensive convex SDP problem. Our approach instead starts from the following two observations: (a) the lower bound in eq. (2) holds for any set G+ of non-negative functions, not necessarily convex, moreover (b) any candidate solution (g, c) of the supremum in eq. (2) would constitute a lower bound for f , so there is no need to solve eq. (2) exactly. This yields the following theorem Theorem 1. Given a point bx Td and a non-negative and periodic function g0 : Td R+, we have |f(bx) f(x )| f f(bx) g0 F (4) Proof. Since x is the minimizer of f, then f(x ) f(bx). Moreover, since c0 = f(bx) and g0 are feasible solutions for the r.h.s. of eq. (2), we have f(bx) f(x ) sup c R, g G+ c f c g F c0 f c0 g0 F , from which we derive that 0 f(bx) f(x ) f f(bx) g0 F . In particular, since any good candidate g0 is enough to produce a certificate, we consider the following class of non-negative functions that can be seen as a two-layer neural network. Definition 1 (extended K-So S model on the torus). Let K : Td Td R be a periodic function in the first variable and let m, r N. Given a set of anchors Z = (z1, . . . , zm) Td and a matrix R Rr m, we define the K-So S model g with x Td, g(x) = RKZ(x) 2 2 , and KZ(x) = (K(x, z1), . . . , K(x, zm)) Rm. (5) The functions represented by the model above are non-negative and periodic. The model is an extension of the k-So S model presented in Marteau-Ferey et al. [2020], where the points (z1, . . . , zm) cannot be optimized. Moreover it has the following benefits at the expense of the convexity in the parameters: 1. The extended k-So S models benefit of the good approximation properties of k-So S models described in Marteau-Ferey et al. [2020] and especially Rudi and Ciliberto [2021], since they are a super-set of the k-So S, that have optimal approximation properties for non-negative functions. 2. The extended model can have a reduced number of parameters, by choosing a matrix R with r = 1 or r m. This will drastically improve the cost of the optimization, while not impacting the approximation properties of the model, since a good approximation is still possible with already r proportional to d [Rudi et al., 2020, see Thm. 3]. 3. The extended model does not require any positive semidefinite constraint on the matrix (contrary to the base model) that is typically a well-known bottleneck to scale up the optimization in the number of parameters Marteau-Ferey et al. [2020]. In the extended model we trade the positive semidefinite constraint with non-convexity. However this allows us to use all the advanced and effective techniques we know for unconstrained (or boxconstrained) non-convex optimization for (two-layers) neural networks Goodfellow et al. [2016]. To conclude the picture on the k-So S models, a critical aspect of the model is the choice of K, since it must guarantee good approximation properties and at the same time we need to compute easily its Fourier coefficients since we need to evaluate f c g F . To this aim, a good candidate for K are the reproducing kernels defined on the torus Steinwart and Christmann [2008]. We use shift-invariant kernels, enabling a convenient analysis of the associated RKHS through their Fourier Transform. Definition 2 (Reproducing kernel on the torus). Let q be a real function on Td, with positive Fourier Transform and q(0) = 1. Let K be the kernel defined with x, y Td, K(x, y) = q(x y) = X ω Zd bqωe2πiω (x y). (6) Then, K is a r.k bounded by 1. We denote H its Reproducing kernel Hilbert Space (RKHS) and by H the associated RKHS norm ω Zd | bfω|2/bqω. Define λ(x) = q(x)2. We assume that we can compute (and sample from, see next section) bλω, i.e., the Fourier transform of λ, corresponding to (bq bq)ω, for all ω Zd. By choosing such a K, the models inherit the good approximation properties derived in Marteau Ferey et al. [2020], Rudi and Ciliberto [2021]. We conclude by recalling that shift-invariant r.k kernel have a positive Fourier transform due to Bochner s theorem Rudin [1990]. The fact that K is bounded by 1 can be seen with |K(x, x)| = |q(0)| = P ω bqω = 1. Finally, note that the Fourier coefficients of an extended k-So S model can be computed exactly, as in shown e.g. later in lemma 1. 2.2 Providing certificates with the F-norm 103 104 105 0.00 F-norm Our bound (Thm. 3) Figure 1: f f is a trigonometric polynomial approximated by an extended k-So S model g. The L norm of the difference (blue) is upper-bounded by the F-norm (red), which is itself upper bounded by the Mo M inequality in theorem 3, with probability 98%, here showed with respect to the number N of sampled frequencies. Shaded area shows min/max values across 10 estimations. As discussed in the previous section our approach for providing a certificate on f relies on first obtaining bx using a fast algorithm without guarantees and solving approximately eq. (2) to obtain the certificate (see theorem 1). With this aim, now we need an efficient way to compute the norm F . We use here a stochastic approach. Introducing a probability bλω (that later will be chosen as a rescaled version of bλω in definition 2) on Zd we rewrite the F-norm ω Zd bλω |buω| bλω = Eω bλω which yields an objective that is amenable to stochastic optimization. From there, Woodworth et al. [2022] computes a certificate by truncating the sum to a hypercube {ω; ω N} of size N d and bounding the remaining terms with a smoothness assumption on u = f c g, which enables to control the decay of buω. We want to avoid this cost exponential in the dimension so we proceed differently. Probabilistic estimates with the H norm. Given that the F-norm can be written as an expectation in eq. (7), we approximate it with an empirical mean b S given with N i.i.d samples (ωi)1 i n bλω. Now, note that the variance of b S can be upper bounded by a Hilbert norm, as λωi , so that Var b S 1 N u 2 Hλ , (8) with Hλ the RKHS from definition 2 with kernel K(x, x ) = P ω Zd bλωe2πiω (x x ). This allows to quantify the deviation of b S from E[b S] = u F , with e.g. Chebychev s inequality, as shown in next theorem. Theorem 2 (Certificate with Chebychev Inequality). Let (bλω)ω be a probability distribution on Zd, δ (0, 1) and g a positive function. Let N > 0 and b S be the empirical mean of | bfω c bgω|/bλω obtained with N samples ωi bλω. Then, a certificate with probability 1 δ is given with f c b S f c g Hλ Nδ , b S = 1 bfωi c1ωi=0 bgωi Proof. From its definition in eq. (7), we see that an unbiased estimator of the F-norm is given by b S. Then, Chebychev s inequality states that |b S u F |2 Var b S/δ with probability at least 1 δ. Using the computation of the variance in eq. (8), it follows that u F b S + f c g Hλ / Nδ with probability at least 1 δ. Plugging this expression into eq. (2), we obtain the result. Note that the norm in Hλ can be developed with (assuming for conciseness that ( c) is comprised in the 0-frequency of f) bf ω( bfω 2bgω) bλω + g 2 Hλ ( f Hλ + g Hλ)2. (10) Thus, theorem 2 provides a certificate of f as long as we can (i) evaluate the Fourier transform bgω of g and (ii) compute its Hilbert norm in some r.k Hλ induced by bλω. In next section, we detail the choice we make to achieve this efficiently, with kernels amenable to GPU computation, scaling to thousands of coefficients. Remark 2 (Using a RKHS norm instead of the F-norm). Note that since (bλω)ω sums to 1, the associated kernel is bounded by 1. Hence u L u Hλ, and the latter could be used instead of the F-norm in eq. (2). There are two reasons for taking our approach instead. Firstly, the F-norm is always tighter that a RKHS norm (see e.g. [Woodworth et al., 2022, Lem. 4]); secondly, we cannot compute u Hλ efficiently and have to rely instead on another upper bound. However, taking the number of samples N = O( u 2 Hλ) alleviates this issue. Exponential concentration bounds with Mo M. The scaling in 1/ δ in theorem 2 can be prohibitive if one requires a high probability on the result (δ 1). Hopefully, alternative estimator exist for those cases. The Median-of-Mean estimator is an example, illustrated in theorem 3. Theorem 3 (Certificate with Mo M estimator). Let (bλω)ω be a probability distribution on Zd, and δ (0, 1). Draw N > 0 frequencies ωi bλω. Define the Mo M estimator with the following: for K N s.t. δ = e K/8 and N = Kb, b 1, write B1, . . . , BK a partition of [N]; then Mo Mδ(|buωi|/λωi) = median | bfωi c1ωi=0 bgωi| A certificate on f with probability 1 δ follows, with f c Mo Mδ(|buωi|/λωi) 4 Proof. Using e.g. Theorem 4.1 from Devroye et al. [2016] we get that the deviation of the Mo M estimator from the expectation is bounded by | u F Mo Mδ(|buωi|/λωi)| 4 Var(|buω|/bλω)log(1/δ) N with proba. 1 δ. (13) Using the upper bound on the variance with the Hλ norm given in eq. (8) and plugging the resulting expression into eq. (2), we obtain the result. To conclude this section, bounding the L norm from above with the F-norm in eq. (3) enables to obtain a certificate on f, as shown in theorem 1. The F-norm requires an infinite number of computation in the general case, but can be bounded efficiently with a probabilistic estimate, given by theorem 2 or theorem 3. This is summed up in fig. 1. Note that the difference F L is a source of conservatism in the certificate which we do not quantify yet, the F-norm is optimal for a class of norms, see [Woodworth et al., 2022, Lemma 3]. 3 Algorithm and implementation 3.1 Bessel kernel We now detail the specific choice of kernel we make in order to compute the certificate of theorem 2 or theorem 3 efficiently. Our first observation is to use a kernel stable by product, so that we can easily characterize a Hilbert space the model g belongs to. This restricts the choice to the exponential family. That s why we define, for a parameter s > 0, x T, qs(x) = es(cos(2πx) 1) = X ω Z e s I|ω|(s)e2πiωx, (14) with I|ω|( ) the modified Bessel function of the first kind [Watson, 1922, p.181]. Then, define Ks(x, y) = qs(x y) as in definition 2, and take a tensor product to extend the definition of K to multiple dimension, i.e. Ks(x, y) = Qd ℓ=1 Ksℓ(xℓ, yℓ) for any x, y Td. We refer to this kernel as the Bessel kernel, and the associated RKHS as Hs. It is stable by product as Ks(x, y) = Ks/2(x, y)2. This is key to compute the Fourier transform of the model g, and in contrast to previous approaches which used the exponential kernel with bqω e s|ω| Woodworth et al. [2022], Bach and Rudi [2023]. In the following, g is a K-So S model defined as in definition 1, with the Bessel kernel of parameter s Rd + defined in eq. (14). Lemma 1 (Fourier coefficient of the Bessel kernel). For ω Zd, the Fourier coefficient of g in ω can be computed in O(drm2) time with i,j=1 R i Rj ℓ=1 e 2sℓI|ωℓ|(2s cos π(ziℓ zjℓ))e iπωℓ(ziℓ+zjℓ). (15) Proof. From its definition in eq. (5), we rewrite g as i,j=1 R i Rj ℓ=1 Ks(x, ziℓ)Ks(x, zjℓ). (16) Now, from the definition of the Bessel kernel in eq. (14), we have that for any (x, y, z) T, K(x, y)K(x, z) = e 2se2s cos(2π(y z)/2) cos 2π(x (y+z)/2). By definition of the modified Bessel function, the Fourier coefficient of this expression are given by I|ω|(2s cos(2π(y z)/2)). Using this into eq. (16), we get the result. The second necessary ingredient for using the certificate of theorem 2 is computing a RKHS norm for g. It relies on the inclusion of H2s into the bigger space of symmetric operator S(Hs). Lemma 2 (Bound on the RKHS norm of g). g belongs to H2s, and g H2s is bounded by the Hilbert-Schmidt norm of S(Hs), which can be computed in O(dm2 + mr2) time, with g 2 H2s g 2 S(Hs) = Tr (RKs,z R )2. (17) Proof. Assume that d = 1; the reasoning can be extended to multiple dimensions with the tensor product. From the computation of the Fourier coefficient in lemma 1 and the fact that I|ω|(2s cos(2π )) I|ω|(2s), we have that bgω = O(I|ω|(2s)) hence g H2s. Finally, since the kernel is stable by product, H2s = Hs Hs, so we can use e.g. [Paulsen and Raghupathi, 2016, Thm. 5.16], with H1 = H2 = Hs and S(Hs) = Hs Hs, with the operator A = (ϕ(z1), . . . , ϕ(zm))R R(ϕ(z1), . . . , ϕ(zm)) S(Hs). With lemma 2, we have that the model g belongs to H2s, so we will naturally use bλω = Qd ℓ=1 e 2sℓIω(2sℓ) in theorem 2; said differently, the space Hλ introduced in eq. (8) is simply H2s defined in eq. (14). 3.2 The algorithm: Glopti Nets We can now describe how Glopti Nets yields a certificate on f. The key observation is that no matter how is obtained our model g(R, z) from definition 1, we will always be able to compute a certificate with theorems 2 and 3. Thus, even though optimizing eq. (9) w.r.t (c, R, z) is highly non-convex, we can use any optimization routine and check empirically its efficiency by looking at the certificate. Finally, thanks to its low-rank structure it is cheaper to evaluate g than evaluating its Fourier coefficient. This is formally shown in proposition 2 in appendix A, where a block-diagonal structure for the model is also introduced. That s why we first optimize supc,g c f c g , where is a proxy for the L norm, e.g. the log-sum-exp on a random batch of N points1: f c g L max i [N] |f(xi) c g(xi)| LSE(f(xi) c g(xi))i [N]. (18) This optimization can be carried out by any deep learning libraries with automatic differentiation and any flavour of gradient ascent. Only afterwards do we compute the certificate with theorems 2 and 3. This procedure is summed up in alg. 1. Algorithm 1: Glopti Nets Data: A trigonometric polynomial f, a candidate z s.t. c = f(z), a model g, and a probability δ. Result: A certificate |f f(z)| ϵδ with proba. 1 δ. /* Optimize g with function values */ for epoch = 1:nepochs do Sample x1, . . . , x N Td ; L, L = autodiff(LSE(f(xi) c g(xi))i [N]) ; z, R optimizer( L) ; /* Compute a certificate */ bλω: probability distribution on Zd with bλω = Qd ℓ=1 e 2sℓIω(2sℓ); Sample Ω= (ω1, . . . , ωN) bλω ; Compute M = Mo Mδ(| bfωi c1ωi=0 bgωi|/λωi)i [n] and σ = g S(Hs); Returns ϵδ = c M 4 log(1/δ)/N; Remark 3 (Providing a candidate). In alg. 1, a candidate estimate c for the minimum value f(x ) is necessary. However, it is possible to overcome this requirement by incorporating c as a learnable parameter within the training loop. Moreover, x can be learned using techniques similar to those in Rudi et al. [2020]: by replacing the lower bound c with a parabola centered at z, z becomes a candidate for x with precision corresponding to the tightness of the certificate. Note however that this method introduces additional hyperparameters. 3.3 Specific implementation for the Chebychev basis As already observed in Bach and Rudi [2023], a result on trigonometric polynomial on Td directly extends to a real polynomials on [ 1, 1]d. The reason for that is that minimizing h on [ 1, 1]d amounts to minimizing the trigonometric polynomial f = h((cos 2πx1, . . . , cos 2πxd)) on Td. Note however that f is an even function in all dimension, as for any x Td, f(x) = f(x1, . . . , xi, . . . , xd). Thus, approximating f f with a K-So S model of definition 1 is suboptimal, in the sense that we could approximate f only on [0, 1/2]d, which is 2 d smaller. Put differently, the Fourier coefficient of f are real by design: it would be convenient to enforce this structure in the model g. This is achieved with proposition 1. 1Another detail of practical importance is that this loss can be efficiently backpropagated through; on the other hand, the certificate is not easily vectorized, and the Bessel function involved would require specific approximation to be efficiently backpropagated through. Proposition 1 (Kernel defined on the Chebychev basis). Let q be a real, even function on the torus, bounded by 1, as in eq. (6). Let K be the kernel defined on [ 1, 1] by (u, v) (0, 1/2), K(cos 2πu, cos 2πv) = 1 2(q(u + v) + q(u v)). (19) Then K is a symmetric, p.d., hence reproducing kernel, bounded by 1, with explicit feature map given by (x, y) [ 1, 1], K(x, y) = bq0 + X ω N 2bqωHω(x)Hω(y). (20) The proof is available in appendix B. It simply relies on expanding the definition of K in eq. (19). The resulting expression in eq. (20) exhibits only cosine terms (in the decomposition of x 7 K(cos 2πx, y)). This enables to directly extend the PSD models from definition 1 with such kernels. Finally, when used with the Bessel kernel of eq. (14), we recover an easy computation of the Chebychev coefficient, as shown in lemma 3, in O(drm2) time. This enables to approximate any function expressed on the Chebychev basis. Note that polynomials expressed in other basis can be certified too, by first operating a change of basis. 4 Experiments 101 102 103 poly, h H = 1 poly, h H = 2 kernel, h H = 1 kernel, h H = 2 Figure 2: Certificate vs. number of parameters in g, for a given function h. The higher the RKHS norm of h, the more difficult it is to approximate uniformly and the looser the certificate, independently of the function type. The more parameters in the k-So S model, the tighter the certificates obtained with theorem 3. The code to reproduce these experiments is available at github.com/gaspardbb/Glopti Nets.jl Settings. Given a function h, we compute a candidate bx with gradient descent and multiple initializations. The goal is then to certify that bx is indeed a global minimizer of h. This is a common setup in the Polynomial-So S literature Wang and Magron [2022]. To illustrate the influence of the number of parameters, the positive model g defined in definition 1 for Glopti Nets designates either a small model GN-small with 1792 parameters, or a bigger model GN-big with 22528 parameters. The latter should have higher expressivity and better interpolate positive functions, leading to tighter certificates. All results for Glopti Nets are obtained with confidence 1 δ = 1 e 4 98%. All other details regarding the experiments are reported in appendix C. Polynomials. We first consider the case where h is a random trigonometric polynomial. Note that this is a restrictive analysis, as Glopti Nets can handle any smooth functions (i.e. with infinite non-zero Fourier coefficients). Polynomials have various dimension d, degree p, number of coefficients n, but a constant RKHS norm H21d. We compare the performances of Glopti Nets to TSSOS, in its complex polynomial variant Wang and Magron [2022]. The latter is used with parameters such that it executes the fastest, but without guarantees of convergence to the global minimum f . Table 1 shows the certificates h(x ) h(bx) and the execution times (lower is better, t in seconds) for TSSOS, GN-small and GN-big. Figure 2 provides certificate on a random polynomial, function of the number of parameters in g. Kernel mixtures. While polynomials provide ground for comparison with existing work, Glopti Nets is not confined to this function class. This is evidenced by experiments on kernel mixtures, where our approach stands as the only viable alternative we are aware of. The function we certify are of the form h(x) = Pn i=1 αi K(xi, x), where K is the Bessel kernel of eq. (14). Kernel mixtures are ubiquitous in machine learning and arise e.g. when performing kernel ridge regression. Certificates obtained on mixtures are compared with those obtained on polynomials in fig. 2, function of the model size g. Table 1: Glopti Nets and TSSOS on random trigonometric polynomials. While TSSOS provides machine-precision certificates, its running time grows exponentially with the problem size, and eventually fails on problems 3 and 6. On the other hand, Glopti Nets has constant running time no matter the problem size, and its certificates can be tightened by increasing the model size. d p n TSSOS GN-small GN-big Certif. t Certif. t Certif. t 3 5 85 5.3 10 11 3 8.35 10 4 6 103 2.64 10 4 9 103 7 231 4.7 10 13 120 9.51 10 4 6 103 2.90 10 4 9 103 9 489 out of memory! - 1.18 10 3 6 103 3.34 10 4 9 103 4 3 33 3.1 10 10 0.1 2.46 10 2 1 104 3.45 10 3 2 104 5 225 4.8 10 12 53 3.71 10 2 1 104 3.59 10 3 2 104 7 833 out of memory! - 4.76 10 2 1 104 4.85 10 3 2 104 Results. There are two key hindsight about the performances of Glopti Nets. Firstly, its certificate does not depend on the structure of the function to optimize. Thus, although Glopti Nets does not match the performances of TSSOS on small polynomials, it can tackle polynomials which cannot be handled by competitors, with arbitrarily as many coefficients (n = ). For instance, TSSOS cannot handle problems with n {489, 833} in table 1. More importantly, Glopti Nets can certify a richer class of functions than polynomials, among which kernel mixtures. The performances of Glopti Nets mostly depends on the complexity of the function to certify, as measured with its RKHS norm. Secondly, note that a bigger model yields tighter certificate. This is detailed in fig. 2, where the same function f is optimized with various models. The dependency of the certificate on the norm of f is shown in fig. 3 in appendix C, along with experiments with Chebychev polynomials. 5 Limitations One limitation of Glopti Nets is the trade-off resulting from its high flexibility for obtaining a certificate as in alg. 1. While this flexibility offers numerous advantages, it also introduces the need for an extensive hyperparameter search. Although we have identified a set of hyperparameters that align with deep learning practices utilizing a Momentum optimizer with cosine decay and a large initial learning rate the optimal settings may vary depending on the specific problem at hand. In the same vein, the certificates given by Glopti Nets are of moderate accuracy. While adding more parameters into the k-So S model certainly helps (as shown in fig. 2), alternative optimization scheme to interpolate h h(bx) with g might provide easier improvement. For instance, we found that using approximate second-order scheme in alg. 1 is key to obtaining good certificates. In the specific settings of polynomial optimization, we highlight that our model is not competitive on problems which exhibits some algebraic structure, as for instance term sparsity or the constant trace property. Typically, problems with coefficients of low degrees (less or equal than 2), which encompass notably the OPF problem, are really well handled by the family of solvers TSSOS belongs to. Finally, Glopti Nets does not handle constraints yet. 6 Conclusion The Glopti Nets algorithm presented in this work lays the foundation for a new family of solvers which provide certificates to non-convex problems. While our approach does not aim to replace the well-established Lasserre s hierarchy for sparse polynomials, it offers a fresh perspective on tackling a new set of problems at scale. Through demonstrations on synthetic examples, we have showcased the potential of our approach. Further research directions include extensive parameter tuning to obtain tighter certificates, with the possibility of leveraging second-order optimization schemes, along with warm-restart schemes for application which requires solving multiple similar problems sequentially. Acknowledgments. AR acknowleges support of the French government under management of Agence Nationale de la Recherche as part of the Investissements d avenir program, reference ANR19-P3IA-0001 (PRAIRIE 3IA Institute). AR acknowledges support of the European Research Council (grant REAL 947908). JM was supported by the ERC grant number 101087696 (APHE-LAIA project) and by ANR 3IA MIAI@Grenoble Alpe (ANR-19-P3IA-0003). Francis Bach and Alessandro Rudi. Exponential convergence of sum-of-squares hierarchies for trigonometric polynomials. SIAM Journal on Optimization, 33(3):2137 2159, 2023. Eloïse Berthier, Justin Carpentier, Alessandro Rudi, and Francis Bach. Infinite-Dimensional Sumsof-Squares for Optimal Control. In 2022 IEEE 61st Conference on Decision and Control (CDC), pages 577 582, December 2022. doi: 10.1109/CDC51059.2022.9992396. Stephen P Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004. Luc Devroye, Matthieu Lerasle, Gabor Lugosi, and Roberto I. Oliveira. Sub-Gaussian Mean Estimators. The Annals of Statistics, 44(6):2695 2725, 2016. Dinh D ung, Vladimir N. Temlyakov, and Tino Ullrich. Hyperbolic Cross Approximation. ar Xiv:2211.04889, April 2017. Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016. Didier Henrion, Milan Korda, and Jean-Bernard Lasserre. The Moment-SOS Hierarchy, volume 4 of Optimization and Its Applications. World Scientific Publishing Europe Ltd., December 2020. doi: 10.1142/q0252. Joseph J. Hilling and Anthony Sudbery. The geometric measure of multipartite entanglement and the singular values of a hypermatrix. Journal of Mathematical Physics, 51(7):072102, July 2010. Reiner Horst and Panos M Pardalos. Handbook of global optimization, volume 2. Springer Science & Business Media, 2013. Cédric Josz and Daniel K. Molzahn. Lasserre Hierarchy for Large Scale Polynomial Optimization in Real and Complex Variables. SIAM Journal on Optimization, 28(2):1017 1048, January 2018. Jean B. Lasserre. Global Optimization with Polynomials and the Problem of Moments. SIAM Journal on Optimization, 11(3):796 817, January 2001. doi: 10.1137/S1052623400366802. Jean Bernard Lasserre. Moments, Positive Polynomials and Their Applications, volume 1 of Series on Optimization and Its Applications. October 2009. doi: 10.1142/p665. Monique Laurent and Lucas Slot. An effective version of schmüdgen s positivstellensatz for the hypercube. Optimization Letters, September 2022. doi: 10.1007/s11590-022-01922-5. Ngoc Hoang Anh Mai, J. B. Lasserre, Victor Magron, and Jie Wang. Exploiting Constant Trace Property in Large-scale Polynomial Optimization. ACM Transactions on Mathematical Software, 48(4):40:1 40:39, December 2022. Ulysse Marteau-Ferey, Francis Bach, and Alessandro Rudi. Non-parametric Models for Non-negative Functions. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 12816 12826. Curran Associates, Inc., 2020. Pablo Moscato et al. On evolution, search, optimization, genetic algorithms and martial arts: Towards memetic algorithms. Caltech concurrent computation program, C3P Report, 826(1989):37, 1989. Boris Muzellec, Adrien Vacher, Francis Bach, François-Xavier Vialard, and Alessandro Rudi. Nearoptimal estimation of smooth transport maps with kernel sums-of-squares. ar Xiv:2112.01907, December 2021. Vern I. Paulsen and Mrinal Raghupathi. An Introduction to the Theory of Reproducing Kernel Hilbert Spaces. Cambridge Studies in Advanced Mathematics. Cambridge University Press, Cambridge, 2016. doi: 10.1017/CBO9781316219232. Alessandro Rudi and Carlo Ciliberto. PSD Representations for Effective Probability Models. In Advances in Neural Information Processing Systems, volume 34, pages 19411 19422. Curran Associates, Inc., 2021. Alessandro Rudi, Ulysse Marteau-Ferey, and Francis Bach. Finding Global Minima via Kernel Approximations. ar Xiv:2012.11978, December 2020. Walter Rudin. The Basic Theorems of Fourier Analysis. In Fourier Analysis on Groups, chapter 1, pages 1 34. John Wiley & Sons, Ltd, 1990. doi: 10.1002/9781118165621.ch1. Ingo Steinwart and Andreas Christmann. Support vector machines. Springer Science & Business Media, 2008. Pascal Van Hentenryck. Machine Learning for Optimal Power Flows. INFORMS Tutorials in Operations Research, October 18. Peter JM Van Laarhoven, Emile HL Aarts, Peter JM van Laarhoven, and Emile HL Aarts. Simulated annealing. Springer, 1987. Hayato Waki, Sunyoung Kim, Masakazu Kojima, and Masakazu Muramatsu. Sums of Squares and Semidefinite Program Relaxations for Polynomial Optimization Problems with Structured Sparsity. SIAM Journal on Optimization, 17(1):218 242, January 2006. doi: 10.1137/050623802. Irène Waldspurger, Alexandre d Aspremont, and Stéphane Mallat. Phase Recovery, Max Cut and Complex Semidefinite Programming, July 2013. Jie Wang and Victor Magron. Exploiting Sparsity in Complex Polynomial Optimization. Journal of Optimization Theory and Applications, 192(1):335 359, January 2022. Jie Wang, Victor Magron, and Jean-Bernard Lasserre. Chordal-TSSOS: A Moment-SOS Hierarchy That Exploits Term Sparsity with Chordal Extension. SIAM Journal on Optimization, 31(1): 114 141, January 2021a. doi: 10.1137/20M1323564. Jie Wang, Victor Magron, and Jean-Bernard Lasserre. TSSOS: A Moment-SOS Hierarchy That Exploits Term Sparsity. SIAM Journal on Optimization, 31(1):30 58, January 2021b. doi: 10.1137/19M1307871. G. N. Watson. A Treatise on the Theory of Bessel Functions. Cambridge University Press, 1922. Blake Woodworth, Francis Bach, and Alessandro Rudi. Non-Convex Optimization with Certificates and Fast Rates Through Kernel Sums of Squares. In Proceedings of Thirty Fifth Conference on Learning Theory, pages 4620 4642. PMLR, June 2022. A Extensions We explore additional extensions of Glopti Nets that further enhance its appeal. We first describe a block diagonal structure for the model for faster evaluation, a theoretical splitting scheme for optimization, and finally a warm-start scheme. A.1 Block diagonal structure for efficient computation Without any further assumption, we see that a model from definition 1 can be evaluated in O(drm) time; its Fourier coefficient given by lemma 1 in O(dm2r); the bound on the RKHS norm is computed in O(dm2 + mr2) time thanks to lemma 2; all that enables to compute a certificate, as stated in theorem 2, in O(Ndm2r+mr2) time, where N is the number of frequencies sampled. If the function f to be minimized has big Hs norm, we might need a large model size m to have f f g. Hence, we introduce specific structure on G which makes it block-diagonal and better conditioned. Proposition 2 (Block-diagonal PSD model). Let g be a PSD model as in definition 1, with m = bs anchors. Split them into b groups, denoting them zij, i [b] and j [s]. Compute the Cholesky factorization of each kernel matrix T i Ti = Kzi Rs s. Then, define G as a block-diagonal matrix, with b blocks defined as Gi = Ri R i , Ri = T 1Ri, and Ri Rr s. Equivalently, R1 R 1 ... Rb R b , s.t. g(x) = R i Kzi(x) 2 , Kzi(x) = K(zij, x)1 j s. (21) Then g can be evaluated in O(rbs3d) time, bgω in O(bs2(dr + s)) time, and g 2 S(Hs) in O(b2(rs2 + r2s) + bs3) time. The model has (r + d)bs real parameters. Proof. Having G defined as such, it is psd, of rank at most rb sb = m. Written g(x) = Pb i=1 R i Kzi(x) 2, we can compute the Fourier coefficient by applying lemma 1 to each of the b component. Adding the cost of computing Gi = Ri R i results in complexity of O(bs2(dr + s)). Finally, note that g 2 S(Hs) = A 2 S(Hs) where A = ((ϕ(z1j))j [s], . . . , (ϕ(zbj))j [s])(Diag Gi)i [b]((ϕ(z1j))j [s], . . . , (ϕ(zbj))j [s]) . Then, defining Q the matrix of b b blocks of size s s s.t. for j, k [b], Qjk = K(zj, zk) Rs s, we have A 2 S(Hs) = Tr Q(Diag Gi)i [b]Q(Diag Gi)i [b] = j,k=1 Tr Gj Qjk Gk Qkj, (22) and each term in the sum can be written Tr ( R j Qjk Rk)( R k Qkj R j ) = R j Qjk Rk 2 HS, which is computed in O(rs2 + r2s) time, plus O(bs3) to compute the Cholesky factor. Denoting ϕzi = (ϕ(zij))1 j s, note that ϕzi Giϕ zi = ϕzi T 1 i Ri R i (ϕzi T 1 i ) = Ei Ri R i E i , (23) with Ei = ϕzi T 1 i an orthonormal basis of Span(ϕzij)1 j s as E i Ei = Is. Thus, each model s coefficient is defined on an orthonormal basis, which makes the optimization easier. Of course, this comes at an added s3 complexity, which could be alleviated by using e.g. an incomplete Cholesky factorization instead. Remark 4 (Relation to Term Sparsity in POP). The successful application of polynomial hierarchies to problems with thousands of variables rely on making the moment matrix M having a block structure Wang et al. [2021b,a]. If the monomial basis has size m, the constraint M 0 is replaced with M = (Diag Mi)i [b] and Mi 0. This enables to solve b SDP of size at most s instead of one of size m. Our model in proposition 2 follows a similar route for having a lower computational budget. A.2 Global optimization with splitting scheme While Glopti Nets can provide certificates for functions, it falls behind local solvers in terms of competitiveness. The challenge lies in the fact that finding a certificate is considerably more difficult than finding a local minimum, as it necessitates the uniform approximation of the entire function. However, we present a novel algorithmic framework that has the potential to enhance the competitiveness of Glopti Nets with local solvers while simultaneously delivering certificates. Our approach involves partitioning the search domain into multiple regions and computing lower bounds for each partition. By discarding portions of the domain where we can certify that the function exceeds a certain threshold, the algorithm progressively simplifies the optimization problem and removes areas from consideration. Moreover, such an approach is naturally well suited to parallel computation. The algorithm relies on a divide-and-conquer mechanism. First, we split the hypercube ( 1, 1)d in N regions, where N is the number of core available. We compute an upper bound with a local solver. For each region, we run Glopti Nets in parallel, computing a certificate at regular interval. As soon as the certificate is bigger than the upper bound, we stop the process: we know that the global minimum is not in the associated region. We can then reallocate the freed computing power by splitting the biggest current region, which yields an easier problem. We stop as soon as the region considered are small enough. This is summarized in alg. 2, where P indicates the loop run in parallel. Note that minimizing f on a hypercube of center µ and size σ amounts to minimizing x 7 f((x µ)/σ) on [ 1, 1]d, which is another Chebychev polynomial whose coefficients can be evaluated efficiently thanks to the order-2 relation every orthonormal polynomial satisfy. For Chebychev polynomials, that is Hω+1(x) = 2x Hω(x) Hω 1(x). Algorithm 2: Splitting scheme with Glopti Nets Data: A Chebychev polynomial f with a unique global optimum, a probability δ, a number of cores N and a volume ρ < 1/N. Result: A certificate on f: f Cδ(f) with proba. 1 δ . /* Initialization: upper bound and partition */ Π = partition([ 1, 1]d, N), δ = 0 ; P ub = minπ Π {localsolverx πf(x)}; /* Iterate over the partition */ P for π Π, While length(Π) > 1 do while Cδ(fπ) < ub do Continue optimization; Split biggest part: π0 = arg maxπ Π Vol(π); (π1, π2) = partition(π0, 2) ; If Vol(π1,2) < ρ: end this process ; Update upper bound: ub = min ub, localsolverx π1,2f(x) ; Update search space and δ : Π = Π \ {π, π0} {π1, π2}, δ = 1 (1 δ )(1 δ); /* A single region in Π remains */ Returns Π = {π}, Cδ(fπ), δ ; A.3 Warm restarts Our model distinguishes itself by leveraging the analytical properties of the objective function, rather than relying solely on algebraic characteristics. This approach offers a notable advantage, as closely related functions can naturally benefit from a warm restart. For example, if we already have a certificate for a function f using a PSD model g, and we seek to compute a certificate for a similar function f f, we can readily employ Glopti Nets by initializing the PSD model with g. Indeed, if f f g, we can expect f f g, so we can expect the optimization to be faster. In contrast, P-So S methods, which rely on SDP programs, cannot directly adapt to new problems without significant effort. For instance, if a new component is introduced, an entirely new SDP must be solved. Our model s ability to accommodate related yet distinct problems could prove highly valuable in domains with a frequent need to certify different but closely related problems. In the industry, the Optimal Power Flow (OPF) problem requires periodic solves every 5 minutes Van Hentenryck [18]. With Glopti Nets, once the initial challenging solve is performed, subsequent solves become easier assuming minimal changes in supply and demand conditions. A.4 Optimizing the certificate directly As explained in section 3.2 where Glopti Nets is introduced, we optimize a proxy of the L norm rather than the certificate of theorems 2 and 3. This proxy is the log-sum-exp on a random batch of N points. The reason for this is that evaluating an extended k-So S model g(x) on x Td requires O(drs) time, while evaluating bgω on ω Zd requires O(drs2) time. Yet, optimizing the certificate directly could probably help obtaining higher-precision certificate. Lemma 4 in appendix D sketches a method to reduce the computational cost of the Fourier computation from O(s2) to O(s). B Kernel defined on the Chebychev basis In this section we describe the approach we take to model functions written in the Chebychev basis. For h such a polynomial, a naive approach would simply model f = h cos(2π ) as a trigonometric polynomial. However, note that the decomposition of f only has cosine terms. Thus, approximating f f efficiently requires a PSD model which has only cosine terms in its Fourier decomposition. This is achieved by using a kernel written in the Chebychev basis, as introduce in proposition 1, for which we now provide a proof. Proof of proposition 1. Let x, y [ 1, 1] and u, v [0, 1/2] s.t. x, y = cos(2πu), cos(2πv), by bijectivity of the cosine function on [0, π]. From the definition of K in eq. (19) and the definition of q in eq. (6), we have that K(x, y) = 1 ω Z bqω e2πiω(u+v) + e2πiω(u v) ω Z bqωe2πiωu cos(2πωv) = bq0 + 2 X ω N bqω cos(2πωu) cos(2πωv) = bq0 + 2 X ω N bqωHω(u)Hω(v). Since q has positive Fourier transform, this makes the feature map of K explicit with K(x, y) = ϕ(u) ϕ(v), ϕ(u)ω = p (1 + 1ω =0)bqωHω(u), for ω N. Hence the kernel is a reproducing kernel. We now use this kernel with the Bessel function x 7 es(cos(2πx) 1), i.e. we define the kernel K on [ 1, 1] to satisfy u, v (0, 1/2), K(cos(2πu), cos(2πv)) = 1 es(cos(2π(u+v)) + es(cos(2π(u v)) . (24) As it was the case for the torus, this kernel enables an easy characterization of a RKHS in which an associated PSD model g lives. Lemma 3 (Chebychev coefficient of the Bessel kernel). Let g be a PSD model as in definition 1, with the kernel K of eq. (24). Then, the Chebychev coefficient ω Nd of g can be computed in O(rdm2) time with i,j=1 R i Rj ℓ=1 (1 + 1ω =0)e 2sℓ Iωℓ(2sℓσ ℓij)Hωℓ(σ+ℓij) +Iωℓ(2sℓσ+ℓij)Hωℓ(σ ℓij) (25) σ ℓij = cos(2πm ℓij), m ℓij = (uℓij uℓij)/2, and cos 2πuℓij = zℓij. Expanding g and definition of Chebychev coefficient. From the definition of g in eq. (5), we have i,j=1 R i Rj ℓ=1 Ksℓ(xℓ, zℓi)Ksℓ(xℓ, zℓj). (26) We consider x, y, z ( 1, 1) and s > 0. We denote u, v, w (0, 1/2) s.t. x, y, z = cos 2πu, cos 2πv, cos 2πw with the bijectivity of x 7 cos(2πx) on (0, 1/2). We now compute the Chebychev coefficient of x 7 Ks(x, y)Ks(x, z). Denoted pω, this is ω N, pω = 1 + 1ω =0 1 Ks(x, y)Ks(x, z)Tω(x) dx or equivalently ω N, pω = (1 + 1ω =0) Z 1 0 Ks(cos 2πu, cos 2πv)Ks(cos 2πu, cos 2πw) cos(2πωu)du. (27) Chebychev coefficient of kernel product. With the definition of the kernel in proposition 1, eq. (19), we have Ks(x, y)Ks(x, z) = 1 4 (p(u + v) + p(u v)) (p(u + w) + p(u w)) es cos 2π(u+v) + es cos 2π(u v) es cos 2π(u+w) + es cos 2π(u w) Now use the sum-to-product formula with the cosines to obtain Ks(x, y)Ks(x, z) = e 2s e2s cos 2π( v w 2 ) cos 2π(u+ v+w 2 ) + e2s cos 2π( v w 2 ) cos 2π(u v+w +e2s cos 2π( v+w 2 ) cos 2π(u+ v w 2 ) + e2s cos 2π( v+w 2 ) cos 2π(u v w (28) We simplify this expression by introducing 2(v w) and σ = cos 2πm . (29) Then, eq. (28) becomes Ks(x, y)Ks(x, z) = e 2s e2sσ cos 2π(u+m+) + e2sσ cos 2π(u m+) +e2sσ+ cos 2π(u+m ) + e2sσ+ cos 2π(u m ) ! We recognize the definition of the kernel (which is not a surprise as we chose the kernel to be stable by product). However, we need variables in (0, 1/2) to retrieve the proper definition of the kernel. Instead, we use lemma 5 on eq. (30) combined with eq. (27), to obtain pω = (1 + 1ω =0)e 2s cos(2πωm+)Iω(2sσ ) + cos(2πωm+)Iω(2sσ ) + cos(2πωm )Iω(2sσ+) + cos(2πωm )Iω(2sσ+) which gives pω = (1 + 1ω =0)e 2s 2 (cos(2πωm+)Iω(2sσ ) + cos(2πωm )Iω(2sσ+)). (31) Equation (31) contains the Chebychev coefficient of the product of two kernel function as defined in eq. (27). Plugging this result into the definition of g in eq. (26), and noting that cos(2πωm ) = Hω(cos 2πm ) = Hω(σ ), we obtain the result. 1 2 3 4 5 6 7 8 9 10 20 Figure 3: Certificate vs. RKHS norm of f, for a given model g with a fixed number of parameters. f has 1146 coefficients and g has 22528 parameters. Best certificate is kept among a set of optimization hyperparameters. As the norm of f decreases, fitting f f with g is easier and the certificate becomes tighter. Thanks to lemma 3, we see that a model g defined as in definition 1 with the Bessel kernel Ks of eq. (24) as its Chebychev coefficients decaying in O(Iω(2s)). Hence, it belongs to H2s, the RKHS associated to K2s. C Additional details on the experiments Tuning the hyperparameters. The time reported in section 4 does not take into account the experiments needed to find a good set of hyperparameters. The parameters tuned were the type of optimizer, the decay of learning rate, and the regularization on the Frobenius norm of G. Regularization. Regularization is performed by approximating the HS norm with a proxy which is faster to compute. We use R j Rk 2 HS instead of R j Qjk Rk 2 HS in eq. (22). Hardware. Glopti Nets was used with NVIDIA V100 GPUs for the interpolation part, and Intel Xeon CPU E5-2698 v4 @ 2.20GHz for computing the certificate. TSSOS was run on a Apple M1 chip with Mosek solver. Configuration of TSSOS. We use the lowest possible relaxation order d (i.e. deg f/2 ), along with Chordal sparsity. We use the first relaxation step of the hierarchy. In these settings, TSSOS is not guaranteed to converge to f but will executes the fastest. Certificate vs. number of parameter for a given function. In fig. 2, the target function is a random polynomial of norm 1 or 2, or a kernel mixture with 10 coefficients of norm 1 or 2. The models forming the blue line are defined as in proposition 2, with rank, block size and number of blocks equal to (1, bs, 1) respectively, with bs the block size we vary. The number of frequencies sampled to compute the certificate is 1.6 107, and accounts for the fact that the bound on the variance becomes larger than the MOM estimator for large models. Certificate vs. problem difficulty for a given model. We have 3 related parameters: the quality of the optimization (given by the certificate), the expressivity of the model (given by its number of parameters), and the difficulty of the optimization (given by the norm of the function). In fig. 3, we fix the latter and plot the relation between the first two. Here, we fix the model with parameters (8, 16, 128), and we optimize a polynomial in 3d of degree 12, with RKHS norm ranging from 1 to 20. The certificates obtained are given in fig. 3. The resulting plot exhibits a clear polynomial relation between the certificate and the norm of the function, with a slope of 0.88. This suggest that the certificate behaves as O( f 1/2 H2s). Comparison with TSSOS on the Fourier basis. In table 1, the polynomials f all have a RKHS norm of 1. The small model is defined as in proposition 2, with rank, block size and number of blocks equal to 4, 32, 8 respectively. For the big models, those values are 8, 128, 16. The certificate is the Table 2: Glopti Nets and TSSOS on random Chebychev polynomials. The same conclusion as in table 1 applies. While TSSOS is very efficient on small problems, its memory requirements grow exponentially with the problem size. Glopti Nets has less accuracy, but a computational burden which does not increase with the problem size. d p n TSSOS GN-small GN-big Certif. t Certif. t Certif. t 4 3 255 3.4 10 7 6 1.1 10 2 2 102 4.1 10 3 1 103 4 624 2.1 10 9 153 2.5 10 2 2 102 3.6 10 3 1 103 5 1295 Out of memory! - 1.8 10 2 2 102 4.2 10 3 2 103 maximum of the Chebychev bound of theorem 2 and the Mo M bound of theorem 3. The number of frequencies sampled is 3.2 107. Comparison with TSSOS on the Chebychev basis. We compare Glopti Nets with TSSOS on random Chebychev polynomials in table 2, similarly to the comparison with trigonometric polynomials in table 1. Minimizing polynomials defined on the canonical basis is easier: contrary to trigonometric polynomials, there is no need to account for the imaginary part of the variable. If d is the dimension, complex polynomials are encoded in a variable of dimension 2d in TSSOS, following the definition of Hermitian Sum-of-Squares introduced in Josz and Molzahn [2018]. Hence, the random polynomials we consider are characterized by the dimension d and their number of coefficients n; instead of bounding the degree, we use all the basis elements Hω(x) = Qd ℓ=1 Hωℓ(xℓ) for which ω p. The maximum degree is then dp. The RKHS norm of f is fixed to 1. As with the comparison on Trigonometric polynomial table 1, we see that Glopti Nets provides similar certificates no matter the number of coefficients in f. Even though it lags behind TSSOS for small polynomials, it handles large polynomials which are intractable to TSSOS. The small and big models have the same structure as for the trigonometric polynomials experiments. Sampling from the Bessel distribution. The function ω 7 e s Iω(s) decays rapidly. In fact, with s = 2, which is the value used to generate the random polynomials, it falls under machine precision as soon as ω > 14. Thus, we approximate the distribution with a discrete one with weights Iω(s) for ω s.t. the result is above the machine precision. We then extend it to multiple dimension with a tensor product. Finally, we use a hash table to store the already sampled frequency, to make the evaluation of million of frequencies much faster. For instance in dimension 5, sampling 106 frequencies from the Bessel distribution of parameter s = 2 on N5 yields only 104 unique frequencies. This allows for tighter certificates, as it makes the r.h.s of eq. (9), in 1/N, much smaller. Note that the time to generate this hash table is not reported in tables 1 and 2, and of the order of a few seconds. Optimizing a kernel mixture. As it is the case with polynomials, when optimizing a function of the form h(x) = Pn i=1 αi K(xi, x) the certificate provided by Glopti Nets only depends on the function norm h 2 H and not on e.g. the number of coefficients n. This is illustrated in fig. 4. D Fourier coefficients in linear time Lemma 4 (Fourier coefficient of the Bessel kernel in linear time). Let g be an extended k-So S model as in definition 1. Then, its Fourier transform can be evaluated in linear time in m with ℓ=1 φℓ, (ziℓ)nℓ ℓ=1 φℓ,+(ziℓ) where n Z, z T, ℓ [d], φℓ, (z)n = qℓ,neπi(n ωℓ)z and q , (s) is defined with lemma 6. Lemma 4 provides a formula for computing bgω which is linear in m, but which still requires numerical approximation to compute the sum on n Zd. For instance, restraining the sum to the hyperbolic 101 102 103 h H = 1, n = 10 h H = 1, n = 100 h H = 2, n = 10 h H = 2, n = 100 Figure 4: Certificate vs. number of parameters in g when certifying mixture of Bessel functions, characterized by their RKHS norm (1 in blue, 2 in red) and their number of coefficients (10 in circles, 100 in rectangles). As with polynomials, this shows that Glopti Nets is only sensible to the former, and not to the way the function is represented. We are not aware of other algorithms able to certify this class of functions. cross D ung et al. [2017] ℓ=1 max {1, |ωℓ|} n would result in a complexity of O(n(log d)nmr) and should produce reasonably accurate estimate of bgω for low n. Furthermore, since q is real-even w.r.t n, the inner-product in eq. (36) can be simplified by computing only half of the terms. Proof. From lemma 1, we have that i,j=1 R i Rj ℓ=1 e 2sℓI|ωℓ|(2sℓcos π(ziℓ zjℓ))e iπωℓ(ziℓ+zjℓ). (33) Introducing fℓ(x, y) = e 2sℓI|ωℓ|(2sℓcos π(x y))e iπωℓ(x+y), (34) eq. (33) simplifies to i,j=1 R i Rj ℓ=1 fℓ(ziℓ, zjℓ). (35) Using lemma 6, for any x, y T, e 2sℓI|ωℓ|(2sℓcos π(x y)) = X n Z qℓ,neπin(x y) (qℓ,n depends on ωℓ) so that, fℓdefined in eq. (34) now writes fℓ(x, y) = X n Z qℓ,neπin(x y)e πiωℓ(x+y) n Z qℓ,neπi(n ωℓ)xe πi(n+ωℓ)y = φℓ, (x) φℓ,+(y) (36) where, for any ℓ {1, . . . , d} and z T, we defined φℓ, (z) = qℓ,neπi(n ωℓ)z We then define the embedding φ : T (Zd C) be the tensor product of the φℓ, . Then, eq. (36), enables to write bgω in eq. (35) as k=1 Rki Rkjφ (zi) φ+(zj) i=1 Rkiφ (zi) i=1 Rkiφ+(zi) i=1 Rkiφ1, (zi1) φd, (zid) i=1 Rkiφ1,+(zi1) φd,+(zid) ℓ=1 φℓ, (ziℓ)nℓ ℓ=1 φℓ,+(ziℓ) which is the desired result. E Other computation Lemma 5. Let f be the function defined on ( 1, 1) with u (0, 1/2), f(cos 2πu) = es cos 2π(u v). (38) Then, its Chebychev coefficient are given with fω = (1 + 1ω =0) cos(2πωv)Iω(s). (39) Proof. The ω N . The component ω of a function f on the Chebychev basis is given with 1 f(x)Tω(x) dx which we conveniently rewrite, with the classical change of variable x = cos 2πu, I1 f(cos 2πu) cos(2πωu)du (40) which is valid for any interval I1 R of length 1. Now, for s > 0, consider the function f defined on ( 1, 1) with x 7 es cos(arccos(x) 2πv), or equivalently u (0, 1/2), f(cos 2πu) = es cos 2π(u v). (41) Putting eq. (41) into eq. (40), we obtain I1 es cos 2π(u v) cos(2πωu)du I1 es cos 2πu cos(2πω(u + v))du I1 es cos 2πu cos(2πωu) cos(2πωv)du 2 Z I1 es cos 2πu sin(2πωu) sin(2πωv)du. The last term is odd, hence integrate to 0 on an interval centered around 0. Hence, fω = 2 cos(2πωv) Z I1 es cos 2πu cos(2πωu)du. (42) We recognize the definition of the modified Bessel function of the first kind, defined in eq. (14). Plugging this into eq. (42), we obtain fω = 2 cos(2πωv)Iω(s) = 2Iω(s)Hω(cos(2πv)). (43) If ω = 0, we add a factor 1/2 into the definition in eq. (40), which yields fω = I0(s). (44) Lemma 6 (Fourier decomposition of Bessel composed with cosine). Let s > 0, ω N and z T. Then, e 2s Iω(2s cos 2πz) = X n Z qω,ne2πinz, where n 0, qω,n = 2 )+ (s/2)2p+ω p!(p+ω)! 2p+ω p n ω 0 otherwise. and qω, n = qω,n by evenness of the coefficients. Proof. From the definition of the modified Bessel function of the first kind [Watson, 1922, p.77, Eq. 2], we have p!(p + ω)! , Iω(2s cos 2πz) = X p!(p + ω)! cos(2πz)2p+ω p!(p + ω)! e2πiz + e 2πiz 2p+ω e2πi(2(p k)+ω)z. (46) Using the change of variable n = 2(p k) + ω into eq. (46), we see that n has the same parity as ω and Iω(2s cos 2πz) = X n= (2p ω) n ω 2p + ω p n ω e2πinz. (47) Equation (47) can be rewritten Iω(2s cos 2πz) = X 2p + ω p n ω 1 (2p+ω) n 2p+ω, for which eq. (45) is a concise rewriting.