# gloptinets_scalable_nonconvex_optimization_with_certificates__40ed4e8c.pdf

Glopti Nets: Scalable Non-Convex Optimization with Certiﬁcates

Gaspard Beugnot gaspard.beugnot@inria.fr Inria, École normale supérieure, CNRS, PSL Research University, 75005 Paris, France

Julien Mairal julien.mairal@inria.fr Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK, 38000 Grenoble, France

Alessandro Rudi alessandro.rudi@inria.fr Inria, École normale supérieure, CNRS, PSL Research University, 75005 Paris, France

We present a novel approach to non-convex optimization with certiﬁcates, which handles smooth functions on the hypercube or on the torus. Unlike traditional methods that rely on algebraic properties, our algorithm exploits the regularity of the target function intrinsic in the decay of its Fourier spectrum. By deﬁning a tractable family of models, we allow at the same time to obtain precise certiﬁcates and to leverage the advanced and powerful computational techniques developed to optimize neural networks. In this way the scalability of our approach is naturally enhanced by parallel computing with GPUs. Our approach, when applied to the case of polynomials of moderate dimensions but with thousands of coefﬁcients, outperforms the state-of-the-art optimization methods with certiﬁcates, as the ones based on Lasserre s hierarchy, addressing problems intractable for the competitors.

1 Introduction

Non-convex optimization is a difﬁcult and crucial task. In this paper, we aim at optimizing globally a non-convex function deﬁned on the hypercube, by providing a certiﬁcate of optimality on the resulting solution. Let h be a smooth function on [ 1, 1]d. Here we provide an algorithm that given bx, an estimate of the minimizer x of h

x = arg min x [ 1,1]d h(x),

produces an ϵ, that constitutes an explicit certiﬁcate for the quality of bx, of the form

|h(x ) h(bx)| ϵδ,

with probability 1 δ. The literature abounds of algorithms to optimize non-convex functions. Typically they are either (a) heuristics, very smart, but with no guarantees of global convergence Moscato et al. [1989], Horst and Pardalos [2013] (b) variation of algorithms used in convex optimization, which can guarantee convergence only to local minima Boyd and Vandenberghe [2004] (c) algorithms with only asymptotic guarantees of convergence to a global minimum, but no explicit certiﬁcates Van Laarhoven et al. [1987]. In general, the methods recalled above are quite fast to produce some

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

solution, but don t provide guarantees on its quality, with the result that the produced point can be arbitrarily far from the optimum, so they are used typically where non-reliable results can be accepted.

On the contrary, there are contexts where an explicit quantiﬁcation of the incurred error is crucial for the task at hand (ﬁnance, engineering, scientiﬁc validation, safety-critical scenarios Lasserre [2009]). In these cases, more expensive methods that provide certiﬁcates are used, such as polynomial sum-of-squares (poly-So S) Lasserre [2001, 2009]. These kinds of techniques are quite powerful since they provide certiﬁcates in the form above, often with machine-precision error. However, (a) they have reduced applicability since h must be a multivariate polynomial (possibly sparse, low-degree) and must be known in its analytical form (b) the resulting algorithm is a semi-deﬁnite programming optimization on matrices whose size grows very fast with the number of variables and the degree of the polynomial, becoming intractable already in moderate dimensions and degrees.

Our approach builds and extends the more recent line of works on kernel sum-of-squares, and in particular the work of Woodworth et al. [2022] based on the Fourier analysis. It mitigates the limitations of poly-So S methods in both aspects: (a) we can deal with any function h (not necessarily a polynomial) for which the Fourier transform is known and (b) the resulting algorithm leverages the smoothness properties of the objective function as Woodworth et al. [2022] rather than relying on its algebraic structure leading to way more compact representations than poly-So S. Contrary to Woodworth et al. [2022], we fully leverage the power of the certiﬁcate allowing for a drastic reduction of the computational cost of the method. Indeed, we cast the minimization in terms of a way smaller problem, similar to the optimization of a small neural network that, albeit again non-convex, produces efﬁciently a good candidate on which we then compute the certiﬁcate.

Notably, our focus lies on a posteriori guarantees: we deﬁne a family of models that allow for efﬁcient computation of certiﬁcates. Once the model structure is established, we have ample ﬂexibility in training the model, offering various possibilities to achieve good certiﬁcates in practical scenarios, while still using well-established and effective techniques in the ﬁeld of deep neural networks (DNN) Goodfellow et al. [2016] to reduce the computational burden of the approach.

Our contributions can be summarized as follows:

We propose a new approach to global optimization with certiﬁcates which drastically extends the applicability domain allowed by the state of the art, since it can be applied to any function for which we can compute the Fourier transform (not just polynomials).

The proposed approach is naturally tailored for GPU computations and provides a reﬁned control of time and memory requirements of the proposed algorithm, contrary to poly-So S methods (whose complexity scales dramatically and in a rigid way with dimension and degree of the polynomial).

From a technical viewpoint, we improve the results in Woodworth et al. [2022], by developing a fast stochastic approach to recover the certiﬁcate in high probability (theorem 3), and we generalize the formulation of the problem to allow the use of powerful techniques from DNN, still providing a certiﬁcate on the result (section 3, in particular alg. 1)

In practical applications, we are able to provide certiﬁcates for functions in moderate dimensions, which surpasses the capabilities of current state-of-the-art techniques. Speciﬁcally, as shown in the experiments we can handle polynomials with thousands of coefﬁcients. This achievement marks an important milestone towards utilizing these models to provide certiﬁcates for more challenging real-life problems.

1.1 Previous work

Polynomial So S. In the ﬁeld of certiﬁcate-based polynomial optimization, Lasserre s hierarchy plays a pivotal role Lasserre [2001, 2009]. This hierarchy employs a sequence of SDP relaxations with increasing size proportional to O(rd) (where d is the dimension of the space and r is a parameter that upper bounds the degree of the polynomial) and that ultimately converges to the optimal solution when r . While Lasserre s hierarchy is primarily associated with polynomial optimization, its applicability extends beyond this domain. It offers a speciﬁc formulation for the more general moment problem, enabling a wide range of applications; see Henrion et al. [2020] for an introduction. For polynomial optimization problems such as in eq. (1), a signiﬁcant amount of research has been dedicated to leveraging problem structure to improve the scalability of the hierarchy. This research

has predominantly focused on exploiting very speciﬁc sparsity patterns among the variables of the polynomial, enabling the handling in these restricted scenarios of instances ranging from a few variables to even thousands of variables Waki et al. [2006], Wang et al. [2021b,a]. There have been theoretical results regarding optimization on the hypercube Bach and Rudi [2023], Laurent and Slot [2022], but there are no algorithms handling them natively. Furthermore, alternative approaches exist that exploit different types of structure, such as the constant trace property Mai et al. [2022].

Kernel So S. Kernel Sum of Squares (K-So S) is an emerging research ﬁeld that originated from the introduction of a novel parametrization for positive functions in Marteau-Ferey et al. [2020]. This approach has found application in various domains, including Optimal Control Berthier et al. [2022], Optimal Transport Muzellec et al. [2021] and modeling probability distribution Rudi and Ciliberto [2021]. In the context of function optimization, two types of theoretical results have been explored: a priori guarantees Rudi et al. [2020] and a posteriori guarantees Woodworth et al. [2022]. A priori guarantees offer insights into the convergence rate towards a global optimum of the function, giving a rate on the number of parameters and the complexity necessary to optimize a function up to a given error. For example, Rudi et al. [2020] proposes a general technique to achieve the global optimum, with error ϵ of a function that is s-times differentiable, by requiring a number of parameters essentially in the order of O(ϵ s/d), allowing to avoid the curse of dimensionality in the rate, when the function is very regular, i.e., s d, while typical black-box optimization algorithms have a complexity that scales as ϵ d. A-posteriori guarantees focus on providing a certiﬁcate for the minimum found by the algorithm. In particular, Woodworth et al. [2022], provides both a-priori guarantee and a-posteriori certiﬁcates; however, the model considered makes it computationally infeasible to provide certiﬁcates in dimension larger than 2.

To conclude, approaches based on kernel-So S allow to extend the applicability of global optimization with certiﬁcates methods to a wider family of functions and on exploiting ﬁner regularity properties beyond just the number of variables and the degrees of a polynomial. By comparison, we focus on making the optimization amenable to high-performance GPU computation while retaining an a posteriori certiﬁcate of optimality.

2 Computing certiﬁcates with extended k-So S

Without loss of generality (see next remark), with the goal of simplifying the analysis and using powerful tools from harmonic analysis, we cast the problem in terms of minimization of a periodic function f over the torus, [0, 1]d (we will denote it also as Td). In particular, we are interested in minimizing periodic functions for which we know (or we can easily compute) the coefﬁcients of its Fourier representation, i.e.

f = min z Td f(z), f(z) = X

ω Zd bfωe2πiω z, z Td, (1)

where Z is the set of integers. This setting is already interesting on its own, as it encompasses a large class of smooth functions. It includes notably trigonometric polynomials, i.e. functions which have only a ﬁnite number of non-zero Fourier coefﬁcients bfω. Optimization of trigonometric polynomials arises in multiple research areas, such as the optimal power ﬂow Van Hentenryck [18] or quantum mechanics Hilling and Sudbery [2010]. Note that this problem is already NP-hard, as it encompasses for instance the Max-Cut problem Waldspurger et al. [2013]. Even so, we will consider the more general case where we can evaluate function values of f, along with its Fourier coefﬁcient bfω, and we have access to its norm in a certain Hilbert space. This norm can be computed numerically for trigonometric polynomials, and more generally reﬂects the regularity (degree of differentiability) of the function, and thus the difﬁculty of the problem. Remark 1 (No loss of generality in working on the torus). Given a (non-periodic) function h : [ 1, 1]d R we can obtain a periodic function whose minimum is exactly h and from which we can recover x . Indeed, following the classical Chebychev construction, deﬁne cos(2πz) as the componentwise application of cos to the elements of 2πz, i.e. cos(2πz) := (cos(2πz1), . . . , cos(2πzd)) and deﬁne f as f(z) := h(cos(2πz)) for z [0, 1]d. It is immediate to see that (a) f is periodic, and, (b) since cos(2πz) is invertible on [0, 1]d and its image is exactly [ 1, 1]d, we have h = h(x ) = f(z ) where x = cos(2πz ), and z = min z Td f(z).

We discuss an efﬁcient representation of these problems in section 3.3.

2.1 Certiﬁcates for global optimization and k-So S

A general recipe for obtaining a certiﬁcates was developed in Woodworth et al. [2022] where, in particular, it was derived the following bound [Woodworth et al., 2022, see Thm. 2]

f sup c R, g G+ c f c g F , (2)

where u F is the ℓ1 norm of the Fourier coefﬁcients of a periodic function u, i.e.

ω Zd |buω| , (3)

and the sup is taken over G+ that is a class of non-negative functions. The paper Woodworth et al. [2022] then chooses G+ to be the set of positive semideﬁnite models, leading to a possibly expensive convex SDP problem. Our approach instead starts from the following two observations: (a) the lower bound in eq. (2) holds for any set G+ of non-negative functions, not necessarily convex, moreover (b) any candidate solution (g, c) of the supremum in eq. (2) would constitute a lower bound for f , so there is no need to solve eq. (2) exactly. This yields the following theorem Theorem 1. Given a point bx Td and a non-negative and periodic function g0 : Td R+, we have |f(bx) f(x )| f f(bx) g0 F (4)

Proof. Since x is the minimizer of f, then f(x ) f(bx). Moreover, since c0 = f(bx) and g0 are feasible solutions for the r.h.s. of eq. (2), we have

f(bx) f(x ) sup c R, g G+ c f c g F c0 f c0 g0 F ,

from which we derive that 0 f(bx) f(x ) f f(bx) g0 F .

In particular, since any good candidate g0 is enough to produce a certiﬁcate, we consider the following class of non-negative functions that can be seen as a two-layer neural network. Deﬁnition 1 (extended K-So S model on the torus). Let K : Td Td R be a periodic function in the ﬁrst variable and let m, r N. Given a set of anchors Z = (z1, . . . , zm) Td and a matrix R Rr m, we deﬁne the K-So S model g with

x Td, g(x) = RKZ(x) 2 2 , and KZ(x) = (K(x, z1), . . . , K(x, zm)) Rm. (5)

The functions represented by the model above are non-negative and periodic. The model is an extension of the k-So S model presented in Marteau-Ferey et al. [2020], where the points (z1, . . . , zm) cannot be optimized. Moreover it has the following beneﬁts at the expense of the convexity in the parameters:

1. The extended k-So S models beneﬁt of the good approximation properties of k-So S models described in Marteau-Ferey et al. [2020] and especially Rudi and Ciliberto [2021], since they are a super-set of the k-So S, that have optimal approximation properties for non-negative functions. 2. The extended model can have a reduced number of parameters, by choosing a matrix R with r = 1 or r m. This will drastically improve the cost of the optimization, while not impacting the approximation properties of the model, since a good approximation is still possible with already r proportional to d [Rudi et al., 2020, see Thm. 3]. 3. The extended model does not require any positive semideﬁnite constraint on the matrix (contrary to the base model) that is typically a well-known bottleneck to scale up the optimization in the number of parameters Marteau-Ferey et al. [2020]. In the extended model we trade the positive semideﬁnite constraint with non-convexity. However this allows us to use all the advanced and effective techniques we know for unconstrained (or boxconstrained) non-convex optimization for (two-layers) neural networks Goodfellow et al. [2016].

To conclude the picture on the k-So S models, a critical aspect of the model is the choice of K, since it must guarantee good approximation properties and at the same time we need to compute easily its Fourier coefﬁcients since we need to evaluate f c g F . To this aim, a good candidate for K are the reproducing kernels deﬁned on the torus Steinwart and Christmann [2008]. We use shift-invariant kernels, enabling a convenient analysis of the associated RKHS through their Fourier Transform. Deﬁnition 2 (Reproducing kernel on the torus). Let q be a real function on Td, with positive Fourier Transform and q(0) = 1. Let K be the kernel deﬁned with

x, y Td, K(x, y) = q(x y) = X

ω Zd bqωe2πiω (x y). (6)

Then, K is a r.k bounded by 1. We denote H its Reproducing kernel Hilbert Space (RKHS) and by H the associated RKHS norm

ω Zd | bfω|2/bqω.

Deﬁne λ(x) = q(x)2. We assume that we can compute (and sample from, see next section) bλω, i.e., the Fourier transform of λ, corresponding to (bq bq)ω, for all ω Zd.

By choosing such a K, the models inherit the good approximation properties derived in Marteau Ferey et al. [2020], Rudi and Ciliberto [2021]. We conclude by recalling that shift-invariant r.k kernel have a positive Fourier transform due to Bochner s theorem Rudin [1990]. The fact that K is bounded by 1 can be seen with |K(x, x)| = |q(0)| = P

ω bqω = 1. Finally, note that the Fourier coefﬁcients of an extended k-So S model can be computed exactly, as in shown e.g. later in lemma 1.

2.2 Providing certiﬁcates with the F-norm

103 104 105 0.00

F-norm Our bound (Thm. 3)

Figure 1: f f is a trigonometric polynomial approximated by an extended k-So S model g. The L norm of the difference (blue) is upper-bounded by the F-norm (red), which is itself upper bounded by the Mo M inequality in theorem 3, with probability 98%, here showed with respect to the number N of sampled frequencies. Shaded area shows min/max values across 10 estimations.

As discussed in the previous section our approach for providing a certiﬁcate on f relies on ﬁrst obtaining bx using a fast algorithm without guarantees and solving approximately eq. (2) to obtain the certiﬁcate (see theorem 1). With this aim, now we need an efﬁcient way to compute the norm F . We use here a stochastic approach. Introducing a probability bλω (that later will be chosen as a rescaled version of bλω in deﬁnition 2) on Zd we rewrite the F-norm

ω Zd bλω |buω|

bλω = Eω bλω

which yields an objective that is amenable to stochastic optimization. From there, Woodworth et al. [2022] computes a certiﬁcate by truncating the sum to a hypercube {ω; ω N} of size N d and bounding the remaining terms with a smoothness assumption on u = f c g, which enables to control the decay of buω. We want to avoid this cost exponential in the dimension so we proceed differently.

Probabilistic estimates with the H norm. Given that the F-norm can be written as an expectation in eq. (7), we approximate it with an empirical mean b S given with N i.i.d samples (ωi)1 i n bλω. Now, note that the variance of b S can be upper bounded by a Hilbert norm, as

λωi , so that Var b S 1

N u 2 Hλ , (8)

with Hλ the RKHS from deﬁnition 2 with kernel K(x, x ) = P

ω Zd bλωe2πiω (x x ). This allows to quantify the deviation of b S from E[b S] = u F , with e.g. Chebychev s inequality, as shown in next theorem.

Theorem 2 (Certiﬁcate with Chebychev Inequality). Let (bλω)ω be a probability distribution on Zd, δ (0, 1) and g a positive function. Let N > 0 and b S be the empirical mean of | bfω c bgω|/bλω obtained with N samples ωi bλω. Then, a certiﬁcate with probability 1 δ is given with

f c b S f c g Hλ

Nδ , b S = 1

bfωi c1ωi=0 bgωi

Proof. From its deﬁnition in eq. (7), we see that an unbiased estimator of the F-norm is given by b S. Then, Chebychev s inequality states that |b S u F |2 Var b S/δ with probability at least 1 δ. Using the computation of the variance in eq. (8), it follows that u F b S + f c g Hλ /

Nδ with probability at least 1 δ. Plugging this expression into eq. (2), we obtain the result.

Note that the norm in Hλ can be developed with (assuming for conciseness that ( c) is comprised in the 0-frequency of f)

bf ω( bfω 2bgω)

bλω + g 2 Hλ ( f Hλ + g Hλ)2. (10)

Thus, theorem 2 provides a certiﬁcate of f as long as we can (i) evaluate the Fourier transform bgω of g and (ii) compute its Hilbert norm in some r.k Hλ induced by bλω. In next section, we detail the choice we make to achieve this efﬁciently, with kernels amenable to GPU computation, scaling to thousands of coefﬁcients.

Remark 2 (Using a RKHS norm instead of the F-norm). Note that since (bλω)ω sums to 1, the associated kernel is bounded by 1. Hence u L u Hλ, and the latter could be used instead of the F-norm in eq. (2). There are two reasons for taking our approach instead. Firstly, the F-norm is always tighter that a RKHS norm (see e.g. [Woodworth et al., 2022, Lem. 4]); secondly, we cannot compute u Hλ efﬁciently and have to rely instead on another upper bound. However, taking the number of samples N = O( u 2 Hλ) alleviates this issue.

Exponential concentration bounds with Mo M. The scaling in 1/

δ in theorem 2 can be prohibitive if one requires a high probability on the result (δ 1). Hopefully, alternative estimator exist for those cases. The Median-of-Mean estimator is an example, illustrated in theorem 3.

Theorem 3 (Certiﬁcate with Mo M estimator). Let (bλω)ω be a probability distribution on Zd, and δ (0, 1). Draw N > 0 frequencies ωi bλω. Deﬁne the Mo M estimator with the following: for K N s.t. δ = e K/8 and N = Kb, b 1, write B1, . . . , BK a partition of [N]; then

Mo Mδ(|buωi|/λωi) = median

| bfωi c1ωi=0 bgωi|

A certiﬁcate on f with probability 1 δ follows, with

f c Mo Mδ(|buωi|/λωi) 4

Proof. Using e.g. Theorem 4.1 from Devroye et al. [2016] we get that the deviation of the Mo M estimator from the expectation is bounded by

| u F Mo Mδ(|buωi|/λωi)| 4

Var(|buω|/bλω)log(1/δ)

N with proba. 1 δ. (13)

Using the upper bound on the variance with the Hλ norm given in eq. (8) and plugging the resulting expression into eq. (2), we obtain the result.

To conclude this section, bounding the L norm from above with the F-norm in eq. (3) enables to obtain a certiﬁcate on f, as shown in theorem 1. The F-norm requires an inﬁnite number of computation in the general case, but can be bounded efﬁciently with a probabilistic estimate, given by theorem 2 or theorem 3. This is summed up in ﬁg. 1. Note that the difference F L is a source of conservatism in the certiﬁcate which we do not quantify yet, the F-norm is optimal for a class of norms, see [Woodworth et al., 2022, Lemma 3].

3 Algorithm and implementation

3.1 Bessel kernel

We now detail the speciﬁc choice of kernel we make in order to compute the certiﬁcate of theorem 2 or theorem 3 efﬁciently. Our ﬁrst observation is to use a kernel stable by product, so that we can easily characterize a Hilbert space the model g belongs to. This restricts the choice to the exponential family. That s why we deﬁne, for a parameter s > 0,

x T, qs(x) = es(cos(2πx) 1) = X

ω Z e s I|ω|(s)e2πiωx, (14)

with I|ω|( ) the modiﬁed Bessel function of the ﬁrst kind [Watson, 1922, p.181]. Then, deﬁne Ks(x, y) = qs(x y) as in deﬁnition 2, and take a tensor product to extend the deﬁnition of K to multiple dimension, i.e. Ks(x, y) = Qd ℓ=1 Ksℓ(xℓ, yℓ) for any x, y Td. We refer to this kernel as the Bessel kernel, and the associated RKHS as Hs. It is stable by product as Ks(x, y) = Ks/2(x, y)2. This is key to compute the Fourier transform of the model g, and in contrast to previous approaches which used the exponential kernel with bqω e s|ω| Woodworth et al. [2022], Bach and Rudi [2023].

In the following, g is a K-So S model deﬁned as in deﬁnition 1, with the Bessel kernel of parameter s Rd + deﬁned in eq. (14).

Lemma 1 (Fourier coefﬁcient of the Bessel kernel). For ω Zd, the Fourier coefﬁcient of g in ω can be computed in O(drm2) time with

i,j=1 R i Rj

ℓ=1 e 2sℓI|ωℓ|(2s cos π(ziℓ zjℓ))e iπωℓ(ziℓ+zjℓ). (15)

Proof. From its deﬁnition in eq. (5), we rewrite g as

i,j=1 R i Rj

ℓ=1 Ks(x, ziℓ)Ks(x, zjℓ). (16)

Now, from the deﬁnition of the Bessel kernel in eq. (14), we have that for any (x, y, z) T, K(x, y)K(x, z) = e 2se2s cos(2π(y z)/2) cos 2π(x (y+z)/2). By deﬁnition of the modiﬁed Bessel function, the Fourier coefﬁcient of this expression are given by I|ω|(2s cos(2π(y z)/2)). Using this into eq. (16), we get the result.

The second necessary ingredient for using the certiﬁcate of theorem 2 is computing a RKHS norm for g. It relies on the inclusion of H2s into the bigger space of symmetric operator S(Hs). Lemma 2 (Bound on the RKHS norm of g). g belongs to H2s, and g H2s is bounded by the Hilbert-Schmidt norm of S(Hs), which can be computed in O(dm2 + mr2) time, with

g 2 H2s g 2 S(Hs) = Tr (RKs,z R )2. (17)

Proof. Assume that d = 1; the reasoning can be extended to multiple dimensions with the tensor product. From the computation of the Fourier coefﬁcient in lemma 1 and the fact that I|ω|(2s cos(2π )) I|ω|(2s), we have that bgω = O(I|ω|(2s)) hence g H2s. Finally, since the kernel is stable by product, H2s = Hs Hs, so we can use e.g. [Paulsen and Raghupathi, 2016, Thm. 5.16], with H1 = H2 = Hs and S(Hs) = Hs Hs, with the operator A = (ϕ(z1), . . . , ϕ(zm))R R(ϕ(z1), . . . , ϕ(zm)) S(Hs).

With lemma 2, we have that the model g belongs to H2s, so we will naturally use bλω = Qd ℓ=1 e 2sℓIω(2sℓ) in theorem 2; said differently, the space Hλ introduced in eq. (8) is simply H2s deﬁned in eq. (14).

3.2 The algorithm: Glopti Nets

We can now describe how Glopti Nets yields a certiﬁcate on f. The key observation is that no matter how is obtained our model g(R, z) from deﬁnition 1, we will always be able to compute a certiﬁcate with theorems 2 and 3. Thus, even though optimizing eq. (9) w.r.t (c, R, z) is highly non-convex, we can use any optimization routine and check empirically its efﬁciency by looking at the certiﬁcate. Finally, thanks to its low-rank structure it is cheaper to evaluate g than evaluating its Fourier coefﬁcient. This is formally shown in proposition 2 in appendix A, where a block-diagonal structure for the model is also introduced. That s why we ﬁrst optimize supc,g c f c g , where is a proxy for the L norm, e.g. the log-sum-exp on a random batch of N points1:

f c g L max i [N] |f(xi) c g(xi)| LSE(f(xi) c g(xi))i [N]. (18)

This optimization can be carried out by any deep learning libraries with automatic differentiation and any ﬂavour of gradient ascent. Only afterwards do we compute the certiﬁcate with theorems 2 and 3. This procedure is summed up in alg. 1.

Algorithm 1: Glopti Nets Data: A trigonometric polynomial f, a candidate z s.t. c = f(z), a model g, and a probability δ. Result: A certiﬁcate |f f(z)| ϵδ with proba. 1 δ. /* Optimize g with function values */ for epoch = 1:nepochs do

Sample x1, . . . , x N Td ; L, L = autodiﬀ(LSE(f(xi) c g(xi))i [N]) ; z, R optimizer( L) ; /* Compute a certificate */ bλω: probability distribution on Zd with bλω = Qd ℓ=1 e 2sℓIω(2sℓ); Sample Ω= (ω1, . . . , ωN) bλω ; Compute M = Mo Mδ(| bfωi c1ωi=0 bgωi|/λωi)i [n] and σ = g S(Hs);

Returns ϵδ = c M 4

log(1/δ)/N;

Remark 3 (Providing a candidate). In alg. 1, a candidate estimate c for the minimum value f(x ) is necessary. However, it is possible to overcome this requirement by incorporating c as a learnable parameter within the training loop. Moreover, x can be learned using techniques similar to those in Rudi et al. [2020]: by replacing the lower bound c with a parabola centered at z, z becomes a candidate for x with precision corresponding to the tightness of the certiﬁcate. Note however that this method introduces additional hyperparameters.

3.3 Speciﬁc implementation for the Chebychev basis

As already observed in Bach and Rudi [2023], a result on trigonometric polynomial on Td directly extends to a real polynomials on [ 1, 1]d. The reason for that is that minimizing h on [ 1, 1]d amounts to minimizing the trigonometric polynomial f = h((cos 2πx1, . . . , cos 2πxd)) on Td. Note however that f is an even function in all dimension, as for any x Td, f(x) = f(x1, . . . , xi, . . . , xd). Thus, approximating f f with a K-So S model of deﬁnition 1 is suboptimal, in the sense that we could approximate f only on [0, 1/2]d, which is 2 d smaller. Put differently, the Fourier coefﬁcient of f are real by design: it would be convenient to enforce this structure in the model g. This is achieved with proposition 1.

1Another detail of practical importance is that this loss can be efﬁciently backpropagated through; on the other hand, the certiﬁcate is not easily vectorized, and the Bessel function involved would require speciﬁc approximation to be efﬁciently backpropagated through.

Proposition 1 (Kernel deﬁned on the Chebychev basis). Let q be a real, even function on the torus, bounded by 1, as in eq. (6). Let K be the kernel deﬁned on [ 1, 1] by

(u, v) (0, 1/2), K(cos 2πu, cos 2πv) = 1

2(q(u + v) + q(u v)). (19)

Then K is a symmetric, p.d., hence reproducing kernel, bounded by 1, with explicit feature map given by (x, y) [ 1, 1], K(x, y) = bq0 + X

ω N 2bqωHω(x)Hω(y). (20)

The proof is available in appendix B. It simply relies on expanding the deﬁnition of K in eq. (19). The resulting expression in eq. (20) exhibits only cosine terms (in the decomposition of x 7 K(cos 2πx, y)). This enables to directly extend the PSD models from deﬁnition 1 with such kernels. Finally, when used with the Bessel kernel of eq. (14), we recover an easy computation of the Chebychev coefﬁcient, as shown in lemma 3, in O(drm2) time. This enables to approximate any function expressed on the Chebychev basis. Note that polynomials expressed in other basis can be certiﬁed too, by ﬁrst operating a change of basis.

4 Experiments

101 102 103

poly, h H = 1 poly, h H = 2 kernel, h H = 1 kernel, h H = 2

Figure 2: Certiﬁcate vs. number of parameters in g, for a given function h. The higher the RKHS norm of h, the more difﬁcult it is to approximate uniformly and the looser the certiﬁcate, independently of the function type. The more parameters in the k-So S model, the tighter the certiﬁcates obtained with theorem 3.

The code to reproduce these experiments is available at

github.com/gaspardbb/Glopti Nets.jl

Settings. Given a function h, we compute a candidate bx with gradient descent and multiple initializations. The goal is then to certify that bx is indeed a global minimizer of h. This is a common setup in the Polynomial-So S literature Wang and Magron [2022]. To illustrate the inﬂuence of the number of parameters, the positive model g deﬁned in deﬁnition 1 for Glopti Nets designates either a small model GN-small with 1792 parameters, or a bigger model GN-big with 22528 parameters. The latter should have higher expressivity and better interpolate positive functions, leading to tighter certiﬁcates. All results for Glopti Nets are obtained with conﬁdence 1 δ = 1 e 4 98%. All other details regarding the experiments are reported in appendix C.

Polynomials. We ﬁrst consider the case where h is a random trigonometric polynomial. Note that this is a restrictive analysis, as Glopti Nets can handle any smooth functions (i.e. with inﬁnite non-zero Fourier coefﬁcients). Polynomials have various dimension d, degree p, number of coefﬁcients n, but a constant RKHS norm H21d. We compare the performances of Glopti Nets to TSSOS, in its complex polynomial variant Wang and Magron [2022]. The latter is used with parameters such that it executes the fastest, but without guarantees of convergence to the global minimum f . Table 1 shows the certiﬁcates h(x ) h(bx) and the execution times (lower is better, t in seconds) for TSSOS, GN-small and GN-big. Figure 2 provides certiﬁcate on a random polynomial, function of the number of parameters in g.

Kernel mixtures. While polynomials provide ground for comparison with existing work, Glopti Nets is not conﬁned to this function class. This is evidenced by experiments on kernel mixtures, where our approach stands as the only viable alternative we are aware of. The function we certify are of the form h(x) = Pn i=1 αi K(xi, x), where K is the Bessel kernel of eq. (14). Kernel mixtures are ubiquitous in machine learning and arise e.g. when performing kernel ridge regression. Certiﬁcates obtained on mixtures are compared with those obtained on polynomials in ﬁg. 2, function of the model size g.

Table 1: Glopti Nets and TSSOS on random trigonometric polynomials. While TSSOS provides machine-precision certiﬁcates, its running time grows exponentially with the problem size, and eventually fails on problems 3 and 6. On the other hand, Glopti Nets has constant running time no matter the problem size, and its certiﬁcates can be tightened by increasing the model size.

d p n TSSOS GN-small GN-big

Certif. t Certif. t Certif. t

3 5 85 5.3 10 11 3 8.35 10 4 6 103 2.64 10 4 9 103

7 231 4.7 10 13 120 9.51 10 4 6 103 2.90 10 4 9 103

9 489 out of memory! - 1.18 10 3 6 103 3.34 10 4 9 103

4 3 33 3.1 10 10 0.1 2.46 10 2 1 104 3.45 10 3 2 104

5 225 4.8 10 12 53 3.71 10 2 1 104 3.59 10 3 2 104

7 833 out of memory! - 4.76 10 2 1 104 4.85 10 3 2 104

Results. There are two key hindsight about the performances of Glopti Nets. Firstly, its certiﬁcate does not depend on the structure of the function to optimize. Thus, although Glopti Nets does not match the performances of TSSOS on small polynomials, it can tackle polynomials which cannot be handled by competitors, with arbitrarily as many coefﬁcients (n = ). For instance, TSSOS cannot handle problems with n {489, 833} in table 1. More importantly, Glopti Nets can certify a richer class of functions than polynomials, among which kernel mixtures. The performances of Glopti Nets mostly depends on the complexity of the function to certify, as measured with its RKHS norm.

Secondly, note that a bigger model yields tighter certiﬁcate. This is detailed in ﬁg. 2, where the same function f is optimized with various models. The dependency of the certiﬁcate on the norm of f is shown in ﬁg. 3 in appendix C, along with experiments with Chebychev polynomials.

5 Limitations

One limitation of Glopti Nets is the trade-off resulting from its high ﬂexibility for obtaining a certiﬁcate as in alg. 1. While this ﬂexibility offers numerous advantages, it also introduces the need for an extensive hyperparameter search. Although we have identiﬁed a set of hyperparameters that align with deep learning practices utilizing a Momentum optimizer with cosine decay and a large initial learning rate the optimal settings may vary depending on the speciﬁc problem at hand.

In the same vein, the certiﬁcates given by Glopti Nets are of moderate accuracy. While adding more parameters into the k-So S model certainly helps (as shown in ﬁg. 2), alternative optimization scheme to interpolate h h(bx) with g might provide easier improvement. For instance, we found that using approximate second-order scheme in alg. 1 is key to obtaining good certiﬁcates.

In the speciﬁc settings of polynomial optimization, we highlight that our model is not competitive on problems which exhibits some algebraic structure, as for instance term sparsity or the constant trace property. Typically, problems with coefﬁcients of low degrees (less or equal than 2), which encompass notably the OPF problem, are really well handled by the family of solvers TSSOS belongs to. Finally, Glopti Nets does not handle constraints yet.

6 Conclusion

The Glopti Nets algorithm presented in this work lays the foundation for a new family of solvers which provide certiﬁcates to non-convex problems. While our approach does not aim to replace the well-established Lasserre s hierarchy for sparse polynomials, it offers a fresh perspective on tackling a new set of problems at scale. Through demonstrations on synthetic examples, we have showcased the potential of our approach. Further research directions include extensive parameter tuning to obtain tighter certiﬁcates, with the possibility of leveraging second-order optimization schemes, along with warm-restart schemes for application which requires solving multiple similar problems sequentially.

Acknowledgments. AR acknowleges support of the French government under management of Agence Nationale de la Recherche as part of the Investissements d avenir program, reference ANR19-P3IA-0001 (PRAIRIE 3IA Institute). AR acknowledges support of the European Research Council (grant REAL 947908). JM was supported by the ERC grant number 101087696 (APHE-LAIA project) and by ANR 3IA MIAI@Grenoble Alpe (ANR-19-P3IA-0003).

Francis Bach and Alessandro Rudi. Exponential convergence of sum-of-squares hierarchies for trigonometric polynomials. SIAM Journal on Optimization, 33(3):2137 2159, 2023.

Eloïse Berthier, Justin Carpentier, Alessandro Rudi, and Francis Bach. Inﬁnite-Dimensional Sumsof-Squares for Optimal Control. In 2022 IEEE 61st Conference on Decision and Control (CDC), pages 577 582, December 2022. doi: 10.1109/CDC51059.2022.9992396.

Stephen P Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.

Luc Devroye, Matthieu Lerasle, Gabor Lugosi, and Roberto I. Oliveira. Sub-Gaussian Mean Estimators. The Annals of Statistics, 44(6):2695 2725, 2016.

Dinh D ung, Vladimir N. Temlyakov, and Tino Ullrich. Hyperbolic Cross Approximation. ar Xiv:2211.04889, April 2017.

Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.

Didier Henrion, Milan Korda, and Jean-Bernard Lasserre. The Moment-SOS Hierarchy, volume 4 of Optimization and Its Applications. World Scientiﬁc Publishing Europe Ltd., December 2020. doi: 10.1142/q0252.

Joseph J. Hilling and Anthony Sudbery. The geometric measure of multipartite entanglement and the singular values of a hypermatrix. Journal of Mathematical Physics, 51(7):072102, July 2010.

Reiner Horst and Panos M Pardalos. Handbook of global optimization, volume 2. Springer Science & Business Media, 2013.

Cédric Josz and Daniel K. Molzahn. Lasserre Hierarchy for Large Scale Polynomial Optimization in Real and Complex Variables. SIAM Journal on Optimization, 28(2):1017 1048, January 2018.

Jean B. Lasserre. Global Optimization with Polynomials and the Problem of Moments. SIAM Journal on Optimization, 11(3):796 817, January 2001. doi: 10.1137/S1052623400366802.

Jean Bernard Lasserre. Moments, Positive Polynomials and Their Applications, volume 1 of Series on Optimization and Its Applications. October 2009. doi: 10.1142/p665.

Monique Laurent and Lucas Slot. An effective version of schmüdgen s positivstellensatz for the hypercube. Optimization Letters, September 2022. doi: 10.1007/s11590-022-01922-5.

Ngoc Hoang Anh Mai, J. B. Lasserre, Victor Magron, and Jie Wang. Exploiting Constant Trace Property in Large-scale Polynomial Optimization. ACM Transactions on Mathematical Software, 48(4):40:1 40:39, December 2022.

Ulysse Marteau-Ferey, Francis Bach, and Alessandro Rudi. Non-parametric Models for Non-negative Functions. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 12816 12826. Curran Associates, Inc., 2020.

Pablo Moscato et al. On evolution, search, optimization, genetic algorithms and martial arts: Towards memetic algorithms. Caltech concurrent computation program, C3P Report, 826(1989):37, 1989.

Boris Muzellec, Adrien Vacher, Francis Bach, François-Xavier Vialard, and Alessandro Rudi. Nearoptimal estimation of smooth transport maps with kernel sums-of-squares. ar Xiv:2112.01907, December 2021.

Vern I. Paulsen and Mrinal Raghupathi. An Introduction to the Theory of Reproducing Kernel Hilbert Spaces. Cambridge Studies in Advanced Mathematics. Cambridge University Press, Cambridge, 2016. doi: 10.1017/CBO9781316219232.

Alessandro Rudi and Carlo Ciliberto. PSD Representations for Effective Probability Models. In Advances in Neural Information Processing Systems, volume 34, pages 19411 19422. Curran Associates, Inc., 2021.

Alessandro Rudi, Ulysse Marteau-Ferey, and Francis Bach. Finding Global Minima via Kernel Approximations. ar Xiv:2012.11978, December 2020.

Walter Rudin. The Basic Theorems of Fourier Analysis. In Fourier Analysis on Groups, chapter 1, pages 1 34. John Wiley & Sons, Ltd, 1990. doi: 10.1002/9781118165621.ch1.

Ingo Steinwart and Andreas Christmann. Support vector machines. Springer Science & Business Media, 2008.

Pascal Van Hentenryck. Machine Learning for Optimal Power Flows. INFORMS Tutorials in Operations Research, October 18.

Peter JM Van Laarhoven, Emile HL Aarts, Peter JM van Laarhoven, and Emile HL Aarts. Simulated annealing. Springer, 1987.

Hayato Waki, Sunyoung Kim, Masakazu Kojima, and Masakazu Muramatsu. Sums of Squares and Semideﬁnite Program Relaxations for Polynomial Optimization Problems with Structured Sparsity. SIAM Journal on Optimization, 17(1):218 242, January 2006. doi: 10.1137/050623802.

Irène Waldspurger, Alexandre d Aspremont, and Stéphane Mallat. Phase Recovery, Max Cut and Complex Semideﬁnite Programming, July 2013.

Jie Wang and Victor Magron. Exploiting Sparsity in Complex Polynomial Optimization. Journal of Optimization Theory and Applications, 192(1):335 359, January 2022.

Jie Wang, Victor Magron, and Jean-Bernard Lasserre. Chordal-TSSOS: A Moment-SOS Hierarchy That Exploits Term Sparsity with Chordal Extension. SIAM Journal on Optimization, 31(1): 114 141, January 2021a. doi: 10.1137/20M1323564.

Jie Wang, Victor Magron, and Jean-Bernard Lasserre. TSSOS: A Moment-SOS Hierarchy That Exploits Term Sparsity. SIAM Journal on Optimization, 31(1):30 58, January 2021b. doi: 10.1137/19M1307871.

G. N. Watson. A Treatise on the Theory of Bessel Functions. Cambridge University Press, 1922.

Blake Woodworth, Francis Bach, and Alessandro Rudi. Non-Convex Optimization with Certiﬁcates and Fast Rates Through Kernel Sums of Squares. In Proceedings of Thirty Fifth Conference on Learning Theory, pages 4620 4642. PMLR, June 2022.

A Extensions

We explore additional extensions of Glopti Nets that further enhance its appeal. We ﬁrst describe a block diagonal structure for the model for faster evaluation, a theoretical splitting scheme for optimization, and ﬁnally a warm-start scheme.

A.1 Block diagonal structure for efﬁcient computation

Without any further assumption, we see that a model from deﬁnition 1 can be evaluated in O(drm) time; its Fourier coefﬁcient given by lemma 1 in O(dm2r); the bound on the RKHS norm is computed in O(dm2 + mr2) time thanks to lemma 2; all that enables to compute a certiﬁcate, as stated in theorem 2, in O(Ndm2r+mr2) time, where N is the number of frequencies sampled. If the function f to be minimized has big Hs norm, we might need a large model size m to have f f g. Hence, we introduce speciﬁc structure on G which makes it block-diagonal and better conditioned.

Proposition 2 (Block-diagonal PSD model). Let g be a PSD model as in deﬁnition 1, with m = bs anchors. Split them into b groups, denoting them zij, i [b] and j [s]. Compute the Cholesky factorization of each kernel matrix T i Ti = Kzi Rs s. Then, deﬁne G as a block-diagonal matrix, with b blocks deﬁned as Gi = Ri R i , Ri = T 1Ri, and Ri Rr s. Equivalently,

R1 R 1 ... Rb R b

, s.t. g(x) =

R i Kzi(x) 2 , Kzi(x) = K(zij, x)1 j s.

(21) Then g can be evaluated in O(rbs3d) time, bgω in O(bs2(dr + s)) time, and g 2 S(Hs) in O(b2(rs2 + r2s) + bs3) time. The model has (r + d)bs real parameters.

Proof. Having G deﬁned as such, it is psd, of rank at most rb sb = m. Written g(x) = Pb i=1 R i Kzi(x) 2, we can compute the Fourier coefﬁcient by applying lemma 1 to each of the b component. Adding the cost of computing Gi = Ri R i results in complexity of O(bs2(dr + s)). Finally, note that g 2 S(Hs) = A 2 S(Hs) where

A = ((ϕ(z1j))j [s], . . . , (ϕ(zbj))j [s])(Diag Gi)i [b]((ϕ(z1j))j [s], . . . , (ϕ(zbj))j [s]) .

Then, deﬁning Q the matrix of b b blocks of size s s s.t. for j, k [b], Qjk = K(zj, zk) Rs s, we have

A 2 S(Hs) = Tr Q(Diag Gi)i [b]Q(Diag Gi)i [b] =

j,k=1 Tr Gj Qjk Gk Qkj, (22)

and each term in the sum can be written Tr ( R j Qjk Rk)( R k Qkj R j ) = R j Qjk Rk 2 HS, which is computed in O(rs2 + r2s) time, plus O(bs3) to compute the Cholesky factor.

Denoting ϕzi = (ϕ(zij))1 j s, note that

ϕzi Giϕ zi = ϕzi T 1 i Ri R i (ϕzi T 1 i ) = Ei Ri R i E i , (23)

with Ei = ϕzi T 1 i an orthonormal basis of Span(ϕzij)1 j s as E i Ei = Is. Thus, each model s coefﬁcient is deﬁned on an orthonormal basis, which makes the optimization easier. Of course, this comes at an added s3 complexity, which could be alleviated by using e.g. an incomplete Cholesky factorization instead.

Remark 4 (Relation to Term Sparsity in POP). The successful application of polynomial hierarchies to problems with thousands of variables rely on making the moment matrix M having a block structure Wang et al. [2021b,a]. If the monomial basis has size m, the constraint M 0 is replaced with M = (Diag Mi)i [b] and Mi 0. This enables to solve b SDP of size at most s instead of one of size m. Our model in proposition 2 follows a similar route for having a lower computational budget.

A.2 Global optimization with splitting scheme

While Glopti Nets can provide certiﬁcates for functions, it falls behind local solvers in terms of competitiveness. The challenge lies in the fact that ﬁnding a certiﬁcate is considerably more difﬁcult than ﬁnding a local minimum, as it necessitates the uniform approximation of the entire function. However, we present a novel algorithmic framework that has the potential to enhance the competitiveness of Glopti Nets with local solvers while simultaneously delivering certiﬁcates. Our approach involves partitioning the search domain into multiple regions and computing lower bounds for each partition. By discarding portions of the domain where we can certify that the function exceeds a certain threshold, the algorithm progressively simpliﬁes the optimization problem and removes areas from consideration. Moreover, such an approach is naturally well suited to parallel computation.

The algorithm relies on a divide-and-conquer mechanism. First, we split the hypercube ( 1, 1)d in N regions, where N is the number of core available. We compute an upper bound with a local solver. For each region, we run Glopti Nets in parallel, computing a certiﬁcate at regular interval. As soon as the certiﬁcate is bigger than the upper bound, we stop the process: we know that the global minimum is not in the associated region. We can then reallocate the freed computing power by splitting the biggest current region, which yields an easier problem. We stop as soon as the region considered are small enough. This is summarized in alg. 2, where P indicates the loop run in parallel.

Note that minimizing f on a hypercube of center µ and size σ amounts to minimizing x 7 f((x µ)/σ) on [ 1, 1]d, which is another Chebychev polynomial whose coefﬁcients can be evaluated efﬁciently thanks to the order-2 relation every orthonormal polynomial satisfy. For Chebychev polynomials, that is Hω+1(x) = 2x Hω(x) Hω 1(x).

Algorithm 2: Splitting scheme with Glopti Nets Data: A Chebychev polynomial f with a unique global optimum, a probability δ, a number of cores N and a volume ρ < 1/N. Result: A certiﬁcate on f: f Cδ(f) with proba. 1 δ . /* Initialization: upper bound and partition */ Π = partition([ 1, 1]d, N), δ = 0 ;

P ub = minπ Π {localsolverx πf(x)}; /* Iterate over the partition */ P for π Π, While length(Π) > 1 do

while Cδ(fπ) < ub do

Continue optimization; Split biggest part: π0 = arg maxπ Π Vol(π); (π1, π2) = partition(π0, 2) ; If Vol(π1,2) < ρ: end this process ; Update upper bound: ub = min ub, localsolverx π1,2f(x) ; Update search space and δ : Π = Π \ {π, π0} {π1, π2}, δ = 1 (1 δ )(1 δ); /* A single region in Π remains */ Returns Π = {π}, Cδ(fπ), δ ;

A.3 Warm restarts

Our model distinguishes itself by leveraging the analytical properties of the objective function, rather than relying solely on algebraic characteristics. This approach offers a notable advantage, as closely related functions can naturally beneﬁt from a warm restart. For example, if we already have a certiﬁcate for a function f using a PSD model g, and we seek to compute a certiﬁcate for a similar function f f, we can readily employ Glopti Nets by initializing the PSD model with g. Indeed, if f f g, we can expect f f g, so we can expect the optimization to be faster.

In contrast, P-So S methods, which rely on SDP programs, cannot directly adapt to new problems without signiﬁcant effort. For instance, if a new component is introduced, an entirely new SDP must be solved. Our model s ability to accommodate related yet distinct problems could prove highly valuable in domains with a frequent need to certify different but closely related problems. In the industry, the Optimal Power Flow (OPF) problem requires periodic solves every 5 minutes

Van Hentenryck [18]. With Glopti Nets, once the initial challenging solve is performed, subsequent solves become easier assuming minimal changes in supply and demand conditions.

A.4 Optimizing the certiﬁcate directly

As explained in section 3.2 where Glopti Nets is introduced, we optimize a proxy of the L norm rather than the certiﬁcate of theorems 2 and 3. This proxy is the log-sum-exp on a random batch of N points. The reason for this is that evaluating an extended k-So S model g(x) on x Td requires O(drs) time, while evaluating bgω on ω Zd requires O(drs2) time. Yet, optimizing the certiﬁcate directly could probably help obtaining higher-precision certiﬁcate. Lemma 4 in appendix D sketches a method to reduce the computational cost of the Fourier computation from O(s2) to O(s).

B Kernel deﬁned on the Chebychev basis

In this section we describe the approach we take to model functions written in the Chebychev basis. For h such a polynomial, a naive approach would simply model f = h cos(2π ) as a trigonometric polynomial. However, note that the decomposition of f only has cosine terms. Thus, approximating f f efﬁciently requires a PSD model which has only cosine terms in its Fourier decomposition. This is achieved by using a kernel written in the Chebychev basis, as introduce in proposition 1, for which we now provide a proof.

Proof of proposition 1. Let x, y [ 1, 1] and u, v [0, 1/2] s.t. x, y = cos(2πu), cos(2πv), by bijectivity of the cosine function on [0, π]. From the deﬁnition of K in eq. (19) and the deﬁnition of q in eq. (6), we have that

K(x, y) = 1

ω Z bqω e2πiω(u+v) + e2πiω(u v)

ω Z bqωe2πiωu cos(2πωv)

= bq0 + 2 X

ω N bqω cos(2πωu) cos(2πωv)

= bq0 + 2 X

ω N bqωHω(u)Hω(v).

Since q has positive Fourier transform, this makes the feature map of K explicit with K(x, y) = ϕ(u) ϕ(v), ϕ(u)ω = p

(1 + 1ω =0)bqωHω(u), for ω N. Hence the kernel is a reproducing kernel.

We now use this kernel with the Bessel function x 7 es(cos(2πx) 1), i.e. we deﬁne the kernel K on [ 1, 1] to satisfy

u, v (0, 1/2), K(cos(2πu), cos(2πv)) = 1

es(cos(2π(u+v)) + es(cos(2π(u v)) . (24)

As it was the case for the torus, this kernel enables an easy characterization of a RKHS in which an associated PSD model g lives. Lemma 3 (Chebychev coefﬁcient of the Bessel kernel). Let g be a PSD model as in deﬁnition 1, with the kernel K of eq. (24). Then, the Chebychev coefﬁcient ω Nd of g can be computed in O(rdm2) time with

i,j=1 R i Rj

ℓ=1 (1 + 1ω =0)e 2sℓ

Iωℓ(2sℓσ ℓij)Hωℓ(σ+ℓij)

+Iωℓ(2sℓσ+ℓij)Hωℓ(σ ℓij) (25)

σ ℓij = cos(2πm ℓij), m ℓij = (uℓij uℓij)/2, and cos 2πuℓij = zℓij.

Expanding g and deﬁnition of Chebychev coefﬁcient. From the deﬁnition of g in eq. (5), we have

i,j=1 R i Rj

ℓ=1 Ksℓ(xℓ, zℓi)Ksℓ(xℓ, zℓj). (26)

We consider x, y, z ( 1, 1) and s > 0. We denote u, v, w (0, 1/2) s.t.

x, y, z = cos 2πu, cos 2πv, cos 2πw

with the bijectivity of x 7 cos(2πx) on (0, 1/2). We now compute the Chebychev coefﬁcient of x 7 Ks(x, y)Ks(x, z). Denoted pω, this is

ω N, pω = 1 + 1ω =0

1 Ks(x, y)Ks(x, z)Tω(x) dx

or equivalently

ω N, pω = (1 + 1ω =0) Z 1

0 Ks(cos 2πu, cos 2πv)Ks(cos 2πu, cos 2πw) cos(2πωu)du. (27)

Chebychev coefﬁcient of kernel product. With the deﬁnition of the kernel in proposition 1, eq. (19), we have

Ks(x, y)Ks(x, z) = 1

4 (p(u + v) + p(u v)) (p(u + w) + p(u w))

es cos 2π(u+v) + es cos 2π(u v) es cos 2π(u+w) + es cos 2π(u w)

Now use the sum-to-product formula with the cosines to obtain

Ks(x, y)Ks(x, z) = e 2s

e2s cos 2π( v w

2 ) cos 2π(u+ v+w

2 ) + e2s cos 2π( v w

2 ) cos 2π(u v+w

+e2s cos 2π( v+w

2 ) cos 2π(u+ v w

2 ) + e2s cos 2π( v+w

2 ) cos 2π(u v w

(28) We simplify this expression by introducing

2(v w) and σ = cos 2πm . (29)

Then, eq. (28) becomes

Ks(x, y)Ks(x, z) = e 2s

e2sσ cos 2π(u+m+) + e2sσ cos 2π(u m+)

+e2sσ+ cos 2π(u+m ) + e2sσ+ cos 2π(u m ) !

We recognize the deﬁnition of the kernel (which is not a surprise as we chose the kernel to be stable by product). However, we need variables in (0, 1/2) to retrieve the proper deﬁnition of the kernel. Instead, we use lemma 5 on eq. (30) combined with eq. (27), to obtain

pω = (1 + 1ω =0)e 2s

cos(2πωm+)Iω(2sσ ) + cos(2πωm+)Iω(2sσ )

+ cos(2πωm )Iω(2sσ+) + cos(2πωm )Iω(2sσ+)

which gives

pω = (1 + 1ω =0)e 2s

2 (cos(2πωm+)Iω(2sσ ) + cos(2πωm )Iω(2sσ+)). (31)

Equation (31) contains the Chebychev coefﬁcient of the product of two kernel function as deﬁned in eq. (27). Plugging this result into the deﬁnition of g in eq. (26), and noting that cos(2πωm ) = Hω(cos 2πm ) = Hω(σ ), we obtain the result.

1 2 3 4 5 6 7 8 9 10 20

Figure 3: Certiﬁcate vs. RKHS norm of f, for a given model g with a ﬁxed number of parameters. f has 1146 coefﬁcients and g has 22528 parameters. Best certiﬁcate is kept among a set of optimization hyperparameters. As the norm of f decreases, ﬁtting f f with g is easier and the certiﬁcate becomes tighter.

Thanks to lemma 3, we see that a model g deﬁned as in deﬁnition 1 with the Bessel kernel Ks of eq. (24) as its Chebychev coefﬁcients decaying in O(Iω(2s)). Hence, it belongs to H2s, the RKHS associated to K2s.

C Additional details on the experiments

Tuning the hyperparameters. The time reported in section 4 does not take into account the experiments needed to ﬁnd a good set of hyperparameters. The parameters tuned were the type of optimizer, the decay of learning rate, and the regularization on the Frobenius norm of G.

Regularization. Regularization is performed by approximating the HS norm with a proxy which is faster to compute. We use R j Rk 2 HS instead of R j Qjk Rk 2 HS in eq. (22).

Hardware. Glopti Nets was used with NVIDIA V100 GPUs for the interpolation part, and Intel Xeon CPU E5-2698 v4 @ 2.20GHz for computing the certiﬁcate. TSSOS was run on a Apple M1 chip with Mosek solver.

Conﬁguration of TSSOS. We use the lowest possible relaxation order d (i.e. deg f/2 ), along with Chordal sparsity. We use the ﬁrst relaxation step of the hierarchy. In these settings, TSSOS is not guaranteed to converge to f but will executes the fastest.

Certiﬁcate vs. number of parameter for a given function. In ﬁg. 2, the target function is a random polynomial of norm 1 or 2, or a kernel mixture with 10 coefﬁcients of norm 1 or 2. The models forming the blue line are deﬁned as in proposition 2, with rank, block size and number of blocks equal to (1, bs, 1) respectively, with bs the block size we vary. The number of frequencies sampled to compute the certiﬁcate is 1.6 107, and accounts for the fact that the bound on the variance becomes larger than the MOM estimator for large models.

Certiﬁcate vs. problem difﬁculty for a given model. We have 3 related parameters: the quality of the optimization (given by the certiﬁcate), the expressivity of the model (given by its number of parameters), and the difﬁculty of the optimization (given by the norm of the function). In ﬁg. 3, we ﬁx the latter and plot the relation between the ﬁrst two. Here, we ﬁx the model with parameters (8, 16, 128), and we optimize a polynomial in 3d of degree 12, with RKHS norm ranging from 1 to 20. The certiﬁcates obtained are given in ﬁg. 3. The resulting plot exhibits a clear polynomial relation between the certiﬁcate and the norm of the function, with a slope of 0.88. This suggest that the certiﬁcate behaves as O( f 1/2 H2s).

Comparison with TSSOS on the Fourier basis. In table 1, the polynomials f all have a RKHS norm of 1. The small model is deﬁned as in proposition 2, with rank, block size and number of blocks equal to 4, 32, 8 respectively. For the big models, those values are 8, 128, 16. The certiﬁcate is the

Table 2: Glopti Nets and TSSOS on random Chebychev polynomials. The same conclusion as in table 1 applies. While TSSOS is very efﬁcient on small problems, its memory requirements grow exponentially with the problem size. Glopti Nets has less accuracy, but a computational burden which does not increase with the problem size.

d p n TSSOS GN-small GN-big

Certif. t Certif. t Certif. t

4 3 255 3.4 10 7 6 1.1 10 2 2 102 4.1 10 3 1 103

4 624 2.1 10 9 153 2.5 10 2 2 102 3.6 10 3 1 103

5 1295 Out of memory! - 1.8 10 2 2 102 4.2 10 3 2 103

maximum of the Chebychev bound of theorem 2 and the Mo M bound of theorem 3. The number of frequencies sampled is 3.2 107.

Comparison with TSSOS on the Chebychev basis. We compare Glopti Nets with TSSOS on random Chebychev polynomials in table 2, similarly to the comparison with trigonometric polynomials in table 1. Minimizing polynomials deﬁned on the canonical basis is easier: contrary to trigonometric polynomials, there is no need to account for the imaginary part of the variable. If d is the dimension, complex polynomials are encoded in a variable of dimension 2d in TSSOS, following the deﬁnition of Hermitian Sum-of-Squares introduced in Josz and Molzahn [2018]. Hence, the random polynomials we consider are characterized by the dimension d and their number of coefﬁcients n; instead of bounding the degree, we use all the basis elements Hω(x) = Qd ℓ=1 Hωℓ(xℓ) for which ω p. The maximum degree is then dp. The RKHS norm of f is ﬁxed to 1. As with the comparison on Trigonometric polynomial table 1, we see that Glopti Nets provides similar certiﬁcates no matter the number of coefﬁcients in f. Even though it lags behind TSSOS for small polynomials, it handles large polynomials which are intractable to TSSOS. The small and big models have the same structure as for the trigonometric polynomials experiments.

Sampling from the Bessel distribution. The function ω 7 e s Iω(s) decays rapidly. In fact, with s = 2, which is the value used to generate the random polynomials, it falls under machine precision as soon as ω > 14. Thus, we approximate the distribution with a discrete one with weights Iω(s) for ω s.t. the result is above the machine precision. We then extend it to multiple dimension with a tensor product. Finally, we use a hash table to store the already sampled frequency, to make the evaluation of million of frequencies much faster. For instance in dimension 5, sampling 106 frequencies from the Bessel distribution of parameter s = 2 on N5 yields only 104 unique frequencies. This allows for tighter certiﬁcates, as it makes the r.h.s of eq. (9), in 1/N, much smaller. Note that the time to generate this hash table is not reported in tables 1 and 2, and of the order of a few seconds.

Optimizing a kernel mixture. As it is the case with polynomials, when optimizing a function of the form h(x) = Pn i=1 αi K(xi, x) the certiﬁcate provided by Glopti Nets only depends on the function norm h 2 H and not on e.g. the number of coefﬁcients n. This is illustrated in ﬁg. 4.

D Fourier coefﬁcients in linear time

Lemma 4 (Fourier coefﬁcient of the Bessel kernel in linear time). Let g be an extended k-So S model as in deﬁnition 1. Then, its Fourier transform can be evaluated in linear time in m with

ℓ=1 φℓ, (ziℓ)nℓ

ℓ=1 φℓ,+(ziℓ)

where n Z, z T, ℓ [d], φℓ, (z)n = qℓ,neπi(n ωℓ)z

and q , (s) is deﬁned with lemma 6.

Lemma 4 provides a formula for computing bgω which is linear in m, but which still requires numerical approximation to compute the sum on n Zd. For instance, restraining the sum to the hyperbolic

101 102 103

h H = 1, n = 10 h H = 1, n = 100

h H = 2, n = 10 h H = 2, n = 100

Figure 4: Certiﬁcate vs. number of parameters in g when certifying mixture of Bessel functions, characterized by their RKHS norm (1 in blue, 2 in red) and their number of coefﬁcients (10 in circles, 100 in rectangles). As with polynomials, this shows that Glopti Nets is only sensible to the former, and not to the way the function is represented. We are not aware of other algorithms able to certify this class of functions.

cross D ung et al. [2017]

ℓ=1 max {1, |ωℓ|} n

would result in a complexity of O(n(log d)nmr) and should produce reasonably accurate estimate of bgω for low n.

Furthermore, since q is real-even w.r.t n, the inner-product in eq. (36) can be simpliﬁed by computing only half of the terms.

Proof. From lemma 1, we have that

i,j=1 R i Rj

ℓ=1 e 2sℓI|ωℓ|(2sℓcos π(ziℓ zjℓ))e iπωℓ(ziℓ+zjℓ). (33)

Introducing fℓ(x, y) = e 2sℓI|ωℓ|(2sℓcos π(x y))e iπωℓ(x+y), (34)

eq. (33) simpliﬁes to

i,j=1 R i Rj

ℓ=1 fℓ(ziℓ, zjℓ). (35)

Using lemma 6, for any x, y T,

e 2sℓI|ωℓ|(2sℓcos π(x y)) = X

n Z qℓ,neπin(x y)

(qℓ,n depends on ωℓ) so that, fℓdeﬁned in eq. (34) now writes

fℓ(x, y) = X

n Z qℓ,neπin(x y)e πiωℓ(x+y)

n Z qℓ,neπi(n ωℓ)xe πi(n+ωℓ)y

= φℓ, (x) φℓ,+(y) (36)

where, for any ℓ {1, . . . , d} and z T, we deﬁned

φℓ, (z) = qℓ,neπi(n ωℓ)z

We then deﬁne the embedding φ : T (Zd C) be the tensor product of the φℓ, . Then, eq. (36), enables to write bgω in eq. (35) as

k=1 Rki Rkjφ (zi) φ+(zj)

i=1 Rkiφ (zi)

i=1 Rkiφ+(zi)

i=1 Rkiφ1, (zi1) φd, (zid)

i=1 Rkiφ1,+(zi1) φd,+(zid)

ℓ=1 φℓ, (ziℓ)nℓ

ℓ=1 φℓ,+(ziℓ)

which is the desired result.

E Other computation

Lemma 5. Let f be the function deﬁned on ( 1, 1) with

u (0, 1/2), f(cos 2πu) = es cos 2π(u v). (38) Then, its Chebychev coefﬁcient are given with fω = (1 + 1ω =0) cos(2πωv)Iω(s). (39)

Proof. The ω N . The component ω of a function f on the Chebychev basis is given with

1 f(x)Tω(x) dx

which we conveniently rewrite, with the classical change of variable x = cos 2πu,

I1 f(cos 2πu) cos(2πωu)du (40)

which is valid for any interval I1 R of length 1.

Now, for s > 0, consider the function f deﬁned on ( 1, 1) with x 7 es cos(arccos(x) 2πv), or equivalently u (0, 1/2), f(cos 2πu) = es cos 2π(u v). (41) Putting eq. (41) into eq. (40), we obtain

I1 es cos 2π(u v) cos(2πωu)du

I1 es cos 2πu cos(2πω(u + v))du

I1 es cos 2πu cos(2πωu) cos(2πωv)du 2 Z

I1 es cos 2πu sin(2πωu) sin(2πωv)du.

The last term is odd, hence integrate to 0 on an interval centered around 0. Hence,

fω = 2 cos(2πωv) Z

I1 es cos 2πu cos(2πωu)du. (42)

We recognize the deﬁnition of the modiﬁed Bessel function of the ﬁrst kind, deﬁned in eq. (14). Plugging this into eq. (42), we obtain fω = 2 cos(2πωv)Iω(s) = 2Iω(s)Hω(cos(2πv)). (43)

If ω = 0, we add a factor 1/2 into the deﬁnition in eq. (40), which yields fω = I0(s). (44)

Lemma 6 (Fourier decomposition of Bessel composed with cosine). Let s > 0, ω N and z T. Then, e 2s Iω(2s cos 2πz) = X

n Z qω,ne2πinz,

where n 0, qω,n =

2 )+ (s/2)2p+ω

p!(p+ω)! 2p+ω p n ω

0 otherwise.

and qω, n = qω,n by evenness of the coefﬁcients.

Proof. From the deﬁnition of the modiﬁed Bessel function of the ﬁrst kind [Watson, 1922, p.77, Eq. 2], we have

p!(p + ω)! ,

Iω(2s cos 2πz) = X

p!(p + ω)! cos(2πz)2p+ω

p!(p + ω)! e2πiz + e 2πiz 2p+ω

e2πi(2(p k)+ω)z. (46)

Using the change of variable n = 2(p k) + ω into eq. (46), we see that n has the same parity as ω and

Iω(2s cos 2πz) = X

n= (2p ω) n ω

2p + ω p n ω

e2πinz. (47)

Equation (47) can be rewritten

Iω(2s cos 2πz) = X

2p + ω p n ω

1 (2p+ω) n 2p+ω,

for which eq. (45) is a concise rewriting.