# continuation_path_learning_for_homotopy_optimization__ebee96ba.pdf Continuation Path Learning for Homotopy Optimization Xi Lin 1 Zhiyuan Yang 1 Xiaoyuan Zhang 1 Qingfu Zhang 1 Homotopy optimization is a traditional method to deal with a complicated optimization problem by solving a sequence of easy-to-hard surrogate subproblems. However, this method can be very sensitive to the continuation schedule design and might lead to a suboptimal solution to the original problem. In addition, the intermediate solutions, often ignored by classic homotopy optimization, could be useful for many real-world applications. In this work, we propose a novel model-based approach to learn the whole continuation path for homotopy optimization, which contains infinite intermediate solutions for any surrogate subproblems. Rather than the classic unidirectional easy-to-hard optimization, our method can simultaneously optimize the original problem and all surrogate subproblems in a collaborative manner. The proposed model also supports real-time generation of any intermediate solution, which could be desirable for many applications. Experimental studies on different problems show that our proposed method can significantly improve the performance of homotopy optimization and provide extra helpful information to support better decision-making. 1. Introduction Homotopy optimization (Blake & Zisserman, 1987; Yuille, 1989; Allgower & Georg, 1990), also called continuation optimization, is a general optimization strategy for solving complicated and highly non-convex optimization problems which can be found in many machine learning applications (Jain & Kar, 2017). This method first constructs a simple surrogate of the original optimization problem, and then gradually solves a sequence of easy-to-hard surrogates to approach the optimal solution of the original complicated 1Department of Computer Science, City University of Hong Kong. Correspondence to: Xi Lin . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). problem. The simplest surrogate subproblem could be easily solved, and its solution will serve as a good initial one for the next subproblem. In this way, we can eventually find a good initial and then a (nearly) optimal solution for the original hard-to-solve optimization problem. The idea of homotopy optimization is straightforward but it also suffers several drawbacks. First, the optimization performance could heavily depend on the continuation schedule of the surrogate subproblems, which is not easy to design for a new problem (Dunlavy & O Leary, 2005). It also needs to iteratively solve each subproblem in sequence, which could lead to undesirable long run time in practice (Iwakiri et al., 2022). In addition, the existing homotopy optimization methods only focus on finding the final solution to the original problem. However, the intermediate solutions for homotopy surrogate subproblems could be useful for many real-world applications. In this work, to tackle the drawbacks mentioned above, we propose a novel continuation path learning (CPL) method to find the whole solution path for homotopy optimization, which contains infinite solutions for all intermediate subproblems. The key idea is to build a learnable model that maps any valid continuation level to its corresponding solution, and then optimize all of them simultaneously in a collaborative manner. In this way, the whole path model can be learned in a single run without sensitive schedule design and iterative optimization. With the learned path model, we can easily generate the solution for any intermediate homotopy level, which could be useful for many applications. Our main contributions can be summarized as follows: We propose a novel model-based approach to learn the whole continuation path for homotopy optimization, which is significantly different from the existing methods that iteratively solve a sequence of finite subproblems. We develop an efficient learning method to train the path model concerning all homotopy levels simultaneously. The proposed model can generate solutions for any intermediate subproblem in real time, which is desirable for many real-world applications. We empirically demonstrate that our proposed CPL method can achieve promising performances on var- Continuation Path Learning for Homotopy Optimization 𝑡= 0 smoothed a) Iterative Continuation c) Generalization Performance b) Learning the Continuation Path Figure 1. Continuation Path Learning: a) The classical homotopy optimization method sequentially solves a set of easy-to-hard smoothed subproblems, which helps find the optimal solution for the original problem. b) We propose to simultaneously learn the whole continuation path, which contains the intermediate solutions for all homotopy subproblems. c) The solutions for homotopy subproblems could have better generalization performance for learning-based problems. ious problems, including non-convex optimization, noisy regression, and neural combinatorial optimization. 1 2. Related Work Homotopy Optimization, also called continuation or graduated optimization method (Blake & Zisserman, 1987; Yuille, 1989; Allgower & Georg, 1990), is a general optimization strategy for solving non-convex optimization problems. This method sequentially constructs and solves a set of smoothed subproblems that gradually deform from an easyto-solve problem to the original complicated problem as shown in Figure 1(a). It would help to find a better solution for the original non-convex problem (Wu, 1996; Dunlavy & O Leary, 2005). This method has also been widely used for solving nonlinear equations (Eaves, 1972; Wasserstrom, 1973; Allgower & Georg, 1990), and it is closely related to simulated annealing for optimization (Kirkpatrick et al., 1983; Van Laarhoven & Aarts, 1987; Ingber, 1993). Homotopy Optimization in Machine Learning The homotopy optimization methods have been widely used in different machine learning applications over the past three decades, such as for computer vision (Terzopoulos, 1988; Gold et al., 1994; Brox & Malik, 2010; Hruby et al., 2022), statistical learning (Chapelle et al., 2006; Kim & De la Torre, 2010), curriculum learning (Bengio, 2009; Bengio et al., 2009; Kumar et al., 2010; Graves et al., 2017), and efficient model training (Chaudhari et al., 2016; Gargiani et al., 2020; Guo et al., 2020). A few works have been proposed to study its theoretical property in different settings (Mobahi & Fisher III, 2015b;a; Hazan et al., 2016; Anandkumar et al., 2017; Iwakiri et al., 2022). Although the homotopy optimization method can usually help to find a better solution, it might suffer from a long run time due to 1The source code can be found in https://github.com/ Xi-L/CPL. the iterative optimization structure. Recently, Iwakiri et al. (2022) have proposed a novel single loop algorithm for fast Gaussian homotopy optimization. Most homotopy optimization methods only care about the final solution for the original problem, but the intermediate optimal solutions could also be useful for flexible decisionmaking. A few works have been proposed to find a single solution for a specific smooth intermediate subproblem for better generalization performance (Chaudhari et al., 2016; Gulcehre et al., 2017). In statistical learning, efficient methods have been proposed to find the whole set of finite tuning points that fully characterize the homotopy path for LASSO (Efron et al., 2004; Rosset & Zhu, 2007; Tibshirani & Taylor, 2011) and SVM (Hastie et al., 2004) by leveraging the specific piece-wise linear structure. However, these methods do not work for general homotopy optimization problems. Model-based Optimization Many model-based methods have been proposed to improve different optimization algorithms performance. Bayesian optimization builds a surrogate model to approximate the unknown black-box optimization problem and uses it to guide the optimization process (Shahriari et al., 2016; Garnett, 2022). Latent space modeling (G omez-Bombarelli et al., 2018; Tripp et al., 2020) is another powerful approach for reconstructing the original complicated optimization problem into a much easier form to solve. It is also possible to accelerate the optimization algorithm by learning the problem structure (Sener & Koltun, 2020) or dividing the search space (Wang et al., 2020; Eriksson et al., 2019). Recently, a few model-based approaches have been proposed to learn the Pareto set for multi-objective optimization problems (Yang et al., 2019; Dosovitskiy & Djolonga, 2020; Lin et al., 2020; Navon et al., 2021; Lin et al., 2022a;b). In this work, we propose a novel method to learn the whole continuation path for homotopy optimization as shown in Figure 1(b), and use it to improve the optimization performance for different applications. Continuation Path Learning for Homotopy Optimization 3. Homotopy Optimization In this section, we introduce the classical homotopy optimization method and a recently proposed single-loop Gaussian homotopy algorithm. 3.1. Classical Homotopy Optimization We are interested in the following minimization problem: min x X f(x), (1) where x X Rd is the decision variable, and f : X R is the objective function to minimize. The objective function f(x) could be highly non-convex and has a complicated optimization landscape. Therefore, it cannot be easily optimized with a simple local minimization method such as gradient descent. To tackle this problem, homotopy optimization method (Allgower & Georg, 1990; Dunlavy & O Leary, 2005) considers a family of function H : X T R parameterized by the continuation level t T = [0, 1] such that: H(x, t = 0) = g(x) H(x, t = 1) = f(x), x X, (2) where g : X R is another easy-to-optimize objective function on the same decision space X. The function H(x, t) is also called a homotopy that gradually transforms g(x) to f(x) by increasing t from 0 to 1. An illustration of the continuation function can be found in Figure 1(a). The key idea of homotopy optimization is to define a suitable continuation function H(x, t) such that the minimizer for H(x, 0) = g(x) is already known or easy to find, and the H(x, t) with t = 0 1 be a sequence of smoothed functions transforming from g(x) to the target objective function f(x). Rather than directly optimizing the complicated target function f(x), we can progressively solve a sequence of coarse-to-fine smoothed optimization subproblems from H(x, 0) to H(x, 1) with a warm start from previously obtained solution as shown in Algorithm 1. In this way, we can find a better solution for the target objective function H(x, 1) = f(x). In practice, the performance of homotopy optimization heavily depends on two crucial components: A proper design of the initial and continuation function for the given problem; A suitable schedule to progressively solve the sequence of easy-to-hard subproblems. However, there is no clear and principled guideline for the continuation construction (Mobahi & Fisher III, 2015a). The Algorithm 1 Classical Homotopy Optimization Algorithm 1: Input: continuation function H(x, t), a predefined sequence 0 = t0 < t1 < . . . t K = 1 2: x0 = a minimizer of H(x, t0) 3: for k = 1 to K do 4: xk = local minimizer of H(x, tk), initialized at xk 1 5: end for 6: Output: x K homotopy optimization process could be time-consuming since we have to run a local search algorithm to find xk for each subproblem (Iwakiri et al., 2022). The continuation and schedule design will become much more complicated if we want to find a solution for a specific homotopy level unknown in advance, such as for better generalization (Chaudhari et al., 2016; Gulcehre et al., 2017). 3.2. Single Loop Gaussian Homotopy Algorithm Algorithm 2 Single Loop Gaussian Homotopy Algorithm 1: Input: Gaussian homotopy function GH(x, t), initial solution x0, initial homotopy level t0 2: for k = 1 to K do 3: xk = xk 1 η1 x GH(xk 1, tk 1) 4: (optional) query Gt = GH(xk 1,tk 1) t 5: update tk = γtk 1 (SLGHr) max{0, min{tk 1 η2Gt, γtk 1}} (SLGHd) 6: end for 7: Output: x K To alleviate the time-consuming local optimization at each homotopy iteration, Iwakiri et al. (2022) recently proposed a novel single loop algorithm for the popular Gaussian homotopy method (Blake & Zisserman, 1987). The Gaussian homotopy function GH(x, t) with t [0, 1] for f(x) can be defined as: GH(x, t) = Eu N(0,Id)[f(x + β(1 t)u)] = Z f(x + β(1 t)y)k(y)dy, (3) where parameter β > 0 controls the max range of homotopy effect, N(0, Id) is the d-dimensional standard Gaussian distribution, and k(y) = (2π) d/2 exp( ||y||2/2) is the Gaussian kernel. Instead of iteratively optimizing GH(x, t) with a sequence of t as in previous works (Wu, 1996; Mobahi & Fisher III, 2015b;a; Hazan et al., 2016), Continuation Path Learning for Homotopy Optimization Iwakiri et al. (2022) proposed to directly optimize: min x X,t T GH(x, t) (4) with respect to both x and t at the same time. By leveraging the theoretical properties of the heat equation (Widder, 1976) and partial differential equation (Evans, 2010), they also showed that the Gaussian homotopy function GH(x, t) will be always optimized at (x , 1) where x is the optimal solution for f(x). This property validates the approach to optimize GH(x, t) with respect to t. The single loop Gaussian homotopy (SLGH) optimization algorithm is shown in Algorithm 2. At each iteration, the decision variable xk is updated by gradient descent, and the homotopy level tk is updated by either a fixed ratio decreasing rule (e.g., SLGHr with γ = 0.999) or a derivative update rule (e.g., SLGHd). This algorithm could be faster than the classical homotopy optimization algorithm with a double loop structure (Iwakiri et al., 2022). However, the performance of SLGH is still sensitive to the homotopy schedule (e.g., the setting of γ). It might not work for a general homotopy optimization problem other than Gaussian homotopy, and can not easily obtain the solutions for any intermediate subproblem or the whole homotopy path. 4. Continuation Path Learning 4.1. Why Continuation Path Learning Both classical homotopy optimization algorithms and the recently proposed SLGH method only focus on finding a single solution for the original optimization problem. In many real-world applications, an intermediate solution xt for a smooth homotopy subproblem H(xt , t ) could be desirable such for robust generalization performance (Chaudhari et al., 2016; Gulcehre et al., 2017).Finding a solution with a proper regularization v.s. performance trade-off is also an important issue in machine learning (Efron et al., 2004; Hastie et al., 2004). However, the optimal homotopy level t is usually unknown in advance. In this work, we propose a novel model-based approach to learn the whole continuation path that contains solutions for all gradually smooth homotopy subproblems. With the learned path model, decision-makers can easily select their preferred solution(s) on the solution path as shown in Figure 2. In addition, even if the goal is to find a single solution with a given homotopy level (e.g., t = 1 for the original problem), CPL can also efficiently generate a good initial solution by collaboratively learning all subproblems together. It is a strong alternative to the sequential and unidirectional information passing in classical homotopy optimization. Figure 2. Continuation Path Model takes any valid homotopy level t as input, and generate its corresponding solution xϕ(t) on the continuation path. Decision-makers can easily obtain any predictive homotopy solution by adjusting the input level t. 4.2. Continuation Path Model For the homotopy function H(x, t), the number of continuation levels t T and their corresponding subproblems could be infinite. Let x 0 be the minimizer for H(x, 0) = g(x), we can define a path x (t) continuous in t such that x (0) = x 0 and H(x (t), t) = 0 for all t T = [0, 1] (Wu, 1996). This path simply goes through a set of stationary points for H(x, t) from x 0 with gradually increasing t. The homotopy optimization algorithms trace the solutions from x (0) to x (1). In practice, the solution x (1) could be a good local minimizer for H(x, 1) = f(x) if not the global one (Mobahi & Fisher III, 2015a;b). The existence and uniqueness of this path can be guaranteed for the Gaussian homotopy function (3) under mild conditions: Theorem 4.1 (Existence of Continuation Path (Wu, 1996)). Let f be a well-behaved function and H(x, t) be its Gaussian homotopy function (3). Then for any stationary point x0 of H(x, 0), there is a continuous and differentiable curve x (t) on t T = [0, 1] such that x (0) = x0 and x (t) is a stationary solution of H(x, t), t T . To be well-behaved, a sufficient condition is that the function f is twice continuously differentiable while f and its derivatives should all be integrable for the Gaussian homotopy function (3) (Wu, 1996). We assume such continuation path x (t) always exists in this work. According to the definition, x (t) is a continuous curve that contains solutions for all (infinite) homotopy levels t T . The discrete set of stationary solutions obtained by a classical homotopy optimization {x1, x2, . . . , x K} = {x (t1), x (t2), . . . , x (t K)} is a finite subset on the solution path {x (t)|t T }. In this work, we propose to build a model xϕ(t) with learnable parameter ϕ to approximate the whole continuation path x (t). Our goal is to find the optimal ϕ such that: xϕ (t) = x (t) = arg min x H(x, t), t T . (5) As shown in Figure 2, the continuation path model maps any valid homotopy level t to its corresponding solution Continuation Path Learning for Homotopy Optimization xϕ(t). With the ideal model parameters ϕ , the output xϕ (t) should be the optimal solutions for each intermediate subproblem H(x, t), and hence xϕ (t) well approximates the continuation path x (t). Once such a model is obtained, we can easily get the corresponding solution xϕ (t ) = arg minx H(x, t ) for any specific continuation parameter t . In this work, we set xϕ(t) as a neural network model and ϕ is its model parameters. 4.3. Learning the Continuation Path Once we have the continuation path model xϕ(t), the next step is to find the optimal parameters ϕ with respect to all homotopy level t T . Since the number of homotopy levels is infinite, our goal is to optimize the following expectation: min ϕ Et H(xϕ(t), t), (6) where each term H(xϕ(t), t) is a composition of continuation path model xϕ(t) and the homotopy function H(x, t). In this way, we reformulate the classic unidirectional homotopy optimization problem (e.g., Algorithm 1) into the single loop model training problem (6) that simultaneously learn the whole continuation path. It should be noticed that our method changes the optimization variables from the original decision variable x to model parameter ϕ. It is difficult to directly optimize the problem (6) since the expectation term could be hard to compute in most cases. In this work, we propose to learn the model parameters with stochastic gradient descent as shown in Algorithm 3. At each step, we optimize the following stochastic optimization problem with Monte Carlo sampling: m=1 H(xϕ(tm), tm), {tm}M m=1 PT , (7) where {t1, . . . , t M} are M independent identically distributed (i.i.d.) samples from distribution PT . Without any prior knowledge, we can simply set PT to be a uniform distribution on T . It is also possible to use other distributions or further adaptively adjust the distribution to incorporate extra information along the optimization process. A crucial step of the proposed method is to find a valid gradient direction to update the model parameters at each iteration. We can decompose the gradient ϕH(xϕ(t), t) with the chain rule: ϕH(xϕ(t), t) = xϕ(t) ϕ x H(x = xϕ(t), t), (8) where xϕ(t) ϕ is the Jacobian matrix of the path model with output vector xϕ(t), and x H(x, t) is the gradient of the homotopy function with respect to decision variables x. In this work, since the path model is a neural network, the Algorithm 3 Gradient-based Continuation Path Learning 1: Input: continuation function H(x, t), a path model xϕ(t) with learnable parameters ϕ 2: for i = 1 to I do 3: randomly sample a set of {tm}M m=1 PT 4: ϕ ϕ η M PM m=1 ϕH(x = xϕ(tm), tm) 5: end for 6: (optional) xt = local minimizer of H(x, t ), initialized at xϕ(t ) with chosen homotopy level t 7: Output: path model xϕ(t) Jacobian matrix xϕ(t) ϕ can be easily calculated with backpropagation. If the homotopy function is also differentiable with a known gradient formulation x H(x, t), we can use standard gradient descent to optimize the model parameters. In many real-world applications, however, the gradient of the homotopy function could be unknown or hard to compute (Iwakiri et al., 2022). In these cases, we can use a zeroth-order optimization (also called derivative-free optimization) method (Duchi et al., 2015; Nesterov & Spokoiny, 2017) with approximate gradients for model training. For a general homotopy function, we can adopt a simple evolutionary strategy (ES) (Hansen & Ostermeier, 2001; Beyer & Schwefel, 2002) to approximate the gradient: x H(x, t) = 1 σK k=1 (H(x + σu(k), t) H(x, t))u(k), where {u(1), . . . , u(K)} are K i.i.d. d-dimensional Gaussian vectors sampled from N(0, Id) and σ is a fixed control parameter. This simple gradient estimation is also closely related to Gaussian smoothing (Nesterov & Spokoiny, 2017; Gao & Sener, 2022). For the Gaussian homotopy function (3), according to (Nesterov & Spokoiny, 2017), its gradient can be written as: x GH(x, t) (10) = 1 β(1 t)Eu N(0,Id)([f(x + β(1 t)u) f(x)]u). Therefore, its gradient can be approximated by: x GH(x, t) (11) = 1 β(1 t)K k=1 (f(x + β(1 t)u(k)) f(x))u(k), where {u(1), . . . , u(K)} N(0, Id) are K i.i.d. Gaussian vectors as in the ES approximate gradient (9), while we now only query the value for the original function f but not the homotopy function H(x, t). A similar zeroth-order approximation with batch size K = 1 has been used and analyzed in Iwakiri et al. (2022) for the SLGH algorithm. Continuation Path Learning for Homotopy Optimization 4.4. Optional Local Search Decision-Maker Figure 3. Optional Local Search: Decision-makers can easily pick their favorite solutions from the learned continuation path. An optional local search can help to find a better solution if needed. The previous subsections mainly focus on learning the whole continuation path. In some applications, the decisionmaker might only be interested in a single solution, such as xϕ(t = 1) for the original optimization problem. By learning the whole continuation path xϕ(t) for all homotopy levels together, CPL actually exchanges the information among different homotopy subproblems simultaneously via the path model. In this case, our learned continuation path model can act as a warm start for any homotopy subproblem, which is more flexible than the gradual unidirectional information passing in classical homotopy optimization. An optional fast local search could help to find a better solution as in Algorithm 3 and Figure 3. This step is equal to the final iteration of the classical homotopy optimization algorithm. In the next section, we empirically show that CPL can indeed find better initial solutions. 5. Bridging Homotopy Optimization and Parametric Optimization Our proposed continuation path learning approach indeed provides a novel view to bridge homotopy optimization and parametric optimization. We discuss several interesting connections in this section. 5.1. CPL as Parametric Optimization A general parametric optimization problem is defined as: min x X f(x, β), (12) where x X is the decision variable, β B is the problem parameter (also called the context), and f : X B R. Classical works on parametric optimization mainly focus on the sensitivity of objective value to the problem parameter (Bank et al., 1983; Bonnans & Shapiro, 2013; Still, 2018). A typical metric is the value function v(β) = min x f(x, β) (13) that describes the change of optimal value with respect to the problem parameter β. In our work, we propose to build a model to approximate the whole continuation path xϕ (t) = x (t) = arg minx H(x, t) with every homotopy level t. If we treat the homotopy level t as the problem parameter, we can define the value function for homotopy optimization v(t) = min x H(x, t) = H(x = xϕ (t), t) (14) for all valid t. With CPL s model-based reformulation, we can now use the well-studied parametric optimization approach as a novel view to analyze homotopy optimization methods. In addition, the classic value function approach mainly addresses the change of optimal value v(β) with respect to the parameter β. The direct differentiation of v(β) could be difficult since the optimal solution x is often unavailable (Mehmood & Ochs, 2020; 2021). By modeling x (t) = xϕ (t), the homotopy objective v(t) = H(x = xϕ (t), t) can be easily differentiated and optimized by gradient-based optimization methods. The solution model xϕ(t) (in addition to the objective value) may also provide useful information to support decision-making. It could be interesting to extend the solution model method for a general value function approach. 5.2. Connection to Amortized Optimization There is an exciting and important research direction on learning to optimize (or called Amortized Optimization) (Amos, 2022; Chen et al., 2022), which focuses on making implicit or explicit inferences from a given problem context to its solution. Some recent works directly predict the optimal solution from problem parameters with specific structures (Liu et al., 2022). Conceptually, they are similar to the value function approach but also with a solution model. A recent work (Li et al., 2023) proposes a classic iterative homotopy optimization method to accelerate the learning-tooptimize approach. Our proposed CPL method can be useful to further improve its performance. If we treat the homotopy level as an additional problem parameter, it is possible to learn the continuation path for a set of problems via a single model. The viewpoint of the value function approach may also provide useful insight for designing better methods. 5.3. Problem Reformulation and Difficulty Our proposed CPL approach reformulates the original optimization problem into continuation model training which might have more parameters to optimize, but we believe it can deal with the homotopy optimization more easily. First of all, there could be infinite homotopy levels and corre- Continuation Path Learning for Homotopy Optimization 6 4 2 0 2 4 6 6 0 2 4 6 8 10 12 6 4 2 0 2 4 6 6 (b) GH Ackley 3 2 1 0 1 2 3 2 0 2000 4000 6000 8000 10000 12000 (c) Rosenbrock 3 2 1 0 1 2 3 2 (d) GH Rosenbrock (e) Himmelblau (f) GH Himmelblau Figure 4. The original and Gaussian homotopy (GH) versions of the Ackley (a,b), Ronsenbrock (c,d), and Himmelblau (e,f) optimization problems. The original optimization problems are all non-convex, and the GH surrogate problems are much smoother and easier to solve. Table 1. Results of gradient descent algorithm, Gaussian homotopy optimization algorithms, and our proposed continuation path learning (CPL) algorithm on three widely-used optimization problems. The optimal value is 0 for all problems. CPL can obtain the best solutions for all problems with the same number of function evaluations. Algorithms Ackley Rosenbrock Himmelblau Gradient Descent GD 12.63 0.2840 1.6 10 4 Homotopy Optimization Grad Opt(γ = 0.5) 0.014 0.0336 14.14 Grad Opt(γ = 0.8) 0.081 0.0370 80.51 SLGHr(γ = 0.995) 6.650 0.0327 6.9 10 5 SLGHr(γ = 0.995) 0.017 0.0419 0.21 Continuation Path Learning CPL(95% path model training) 0.022 0.0421 1.7 10 3 CPL(path model + local search) 0.006 0.0018 2.3 10 6 sponding intermediate solutions for the original problem, which is hard to handle using classic iterative optimization approaches. Our proposed CPL method can learn the whole continuation path via a single model, which is a novel and principled way to deal with this problem. In addition, the existing homotopy optimization methods could be sensitive to the progressive schedule design, and also suffer from high computational overheads due to the iterative optimization structure (Iwakiri et al., 2022). In contrast, our model-based CPL approach can be easily trained by a gradient-based optimization algorithm and obtain promising results. The reason why a large deep neural network can be easily trained by stochastic gradient descent is still an open research question. For the problem transformation in CPL, some findings from the smooth parametrization work such as Levin et al. (2022) would be useful for further studies. 6. Experimental Studies In this section, we empirically evaluate different aspects of our proposed continuation path learning (CPL) method for solving non-convex optimization, noisy regression, and neural combinatorial optimization problem. Due to the page limit, we only report the main results in this section. Detailed experimental settings, extra experimental results, and more discussions for each problem can be found in Appendix A,B, and C respectively. 6.1. Non-convex Optimization We first test CPL s performance on three widely-used synthetic test benchmark problems, namely the Ackley function (Ackley, 1987), the Rosenbrock function (Rosenbrock, 1960), and the Himmelblau function (Himmelblau et al., 1972). The original and Gaussian homotopy versions of these functions are shown in Figure 4. The original optimization functions are non-convex and hence hard to be directly optimized by simple gradient descent algorithms. In contrast, their Gaussian homotopy versions are much smoother and easier to optimize. We mainly follow the experimental setting from Iwakiri et al. (2022), and compare CPL with a simple gradient descent algorithm (GD), a classical Gaussian optimization algorithm (Grad Opt) (Hazan et al., 2016) with two smoothing parameters (γ = 0.5 and 0.8), and the recently proposed single loop Gaussian homotopy algorithm with fixed ratio update (SLGHr) (Iwakiri et al., 2022) with γ = 0.995 and 0.999. The total numbers of function evaluations are 1, 000, 20, 000, and 2, 000 for the Ackley, Rosenbrock, and Himmelblau optimization problem respectively. For CPL, since the goal is to optimize the original optimization problem, we use 95% function evaluations for path model training and the rest 5% for gradient-based local search with initial solution xϕ(t = 1). In other words, CPL has the same number of function evaluations as other methods. Continuation Path Learning for Homotopy Optimization 0.0 0.2 0.4 0.6 0.8 1.0 t Prediction Loss 0.0 0.2 0.4 0.6 0.8 1.0 t Prediction Loss 0.0 0.2 0.4 0.6 0.8 1.0 t Prediction Loss 0.0 0.2 0.4 0.6 0.8 1.0 t Prediction Loss Figure 5. Prediction performance v.s. homotopy level t on four different noisy nonlinear regression problems. CPL can successfully learn the whole continuation path for all problems. With the learned path, the decision-maker can easily locate the optimal homotopy level t with the minimum prediction loss for each problem. According to the results reported in Table 1, the classic homotopy optimization methods are sensitive to the scheduling design and hyperparameters, which would be hard to tune for a new problem. Indeed, all the homotopy optimization methods are carefully fine-tuned as in Iwakiri et al. (2022), but no single method can consistently achieve good performance on all problems. In contrast, our proposed CPL method with only the model training step (95% evaluations) can already generate promising xϕ(t = 1) solutions that have similar performances with other homotopy optimization algorithms. With extra gradient-based local search (the rest 5% evaluations), CPL can achieve significantly better solutions for all test problems. These results confirm that learning the whole continuation path in a collaborative manner with knowledge transfer can be helpful for solving the original complicated problem. It is a strong alternative to the classical homotopy optimization methods. 6.2. Noisy Regression In this subsection, we consider the following noisy nonlinear regression problem: min α Rd t||ˆy ψ( ˆX)α||2 2 + (1 t)||α||2, (15) where ˆX Rn p is a matrix of predictors, ˆy Rn is a noisy response vector, and ψ : Rp Rd is a nonlinear mapping to the feature space. Given a noisy data set with n data points { ˆX, ˆy}, our goal is to find the optimal parameters α = arg minα ||y ψ(X)α||2 2 for the noiseless {X, y}. The noisy regression problem (15) is a proper homotopy surrogate controlled by the continuation level t. When t = 0, the problem reduces to minα Rd ||α||2 which has a trivial solution α = 0d. When t = 1, it is the standard regression problem minα Rd ||ˆy ψ( ˆX)α||2 2 without the regularization term ||α||2, which could overfit to the noisy data. To have the best prediction performance on the noiseless {X, y}, we need to find the solution for the homotopy optimization problem (15) with a proper but unknown t T = [0, 1]. Our proposed CPL method can learn the whole continuation path for this problem. We build a simple fully connected neural network α(t) = hϕ(t) as the path model which maps any valid homotopy level t to its solution α(t), and reformulate the noisy regression problem into: min ϕ t||ˆy ψ( ˆX)hϕ(t)||2 2 + (1 t)||hϕ(t)||2. (16) Then the path model can be trained by simple gradient descent to obtain the optimal model parameters ϕ. We learn the continuation path for four different noisy regression problems and report their results in Figure 5. CPL successfully learns the continuation path for all problems, which can be used to directly locate the optimal homotopy level t. The problem details can be found in Appendix B. 6.3. Neural Combinatorial Optimization The CPL method can also improve the generalization performance for a neural combinatorial optimization (NCO) solver (Vinyals et al., 2015; Kool et al., 2019), which learns to directly predict the solution for a combinatorial optimization problem. We use the popular traveling salesman problem (TSP) to motivate our approach. A Euclidean TSP instance s is a fully connected graph with n nodes (cities) where each city has its own two-dimensional location y. The traveling cost between two cities i and j can be defined as the distance cij = ||yi yj||2. The goal of TSP is to find a valid tour with the shortest cost to visit all cities exactly once and then return to the starting city. We can represent a valid tour as a permutation of all cities π = (π1, , πi, , πn), πi {1, , n}, and the objective is to find the optimal tour to minimize: l(π|s) = cπnπ1 + i=1 cπiπi+1. (17) We can construct a smoother homotopy subproblem by gradually changing the cost between city pairs (Coy et al., 2000): ˆcij(t) = ct ij, t T = [0, 1]. (18) Continuation Path Learning for Homotopy Optimization 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 (b) t = 0.25 0 2 4 6 8 10 12 14 (c) t = 0.5 0 2 4 6 8 10 12 14 (d) t = 0.75 0 2 4 6 8 10 12 14 Figure 6. Homotopy Subproblems for TSP: (a) When t = 0, we have the smoothest subproblem where all the distances are the same. In this case, all valid tours will have the same length and be equally optimal, so solving TSP could be trivial. (b)-(d) When t increases, we have more rugged subproblems that gradually transform back to the original problem. (e) When t = 1, we have the original problem. Table 2. Results of generalization performance on the realistic TSPLib instances (with 51 to 200 cities). All learning-based methods are trained on synthetic and uniformly distributed TSP instances with 100 cities. The full table can be found in Table 6. OR-Tools Wu et al. (Wu et al., 2021) DACT DACT(long) AM-S POMO CPL Optimal Gap 3.34% 4.17% 3.90% 2.07% 22.83% 2.15% 1.72% We always normalize the original cost matrix such that cij [0, 1] and hence all ˆcij(t) [0, 1], and we also normalize the smoothed cost matrix to have the same mean with the original cost matrix P ˆcij(t) = P cij = P c. The cost matrices with different values of t are shown in Figure 6. With the smoothed costs, we can define the continuation function of the tour π for problem instance s as: H(π, t|s) = ct πnπ1 + Xn 1 i=1 ct πiπi+1, t T = [0, 1]. (19) For t = 0, all valid tours have the same total cost H(π, t = 0|s) = n c, and the optimization problem becomes trivial to solve. For t = 1, we have the original objective function H(π, t = 1|s) = l(π|s). With the above homotopy idea, we can build a CPLenhanced solver for neural combinatorial optimization. As shown in Figure 7, our model can easily generate multiple solutions on the continuation path for the smoothed subproblems (e.g., tm [0, 1]) to make a multi-shot prediction that might contain solutions with better generalization performance. By leveraging this property, CPL can obtain a robust generalization performance on unseen TSPlib instances with unseen sizes and distributions as shown in Table 2. Model details and more results can be found in Appendix. C. 7. Conclusion and Limitation Conclusion We have proposed a novel continuation path learning (CPL) method to approximate the whole continuation path for homotopy optimization. The experimental results have shown that CPL can successfully learn the solution path for different applications. In addition, compared with the classical homotopy optimization method, CPL can achieve similar or even better performance for the original 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Continuation Level t Counts of Optimal Solutions Figure 7. CPL inference and counts of optimal solutions. For 10, 000 random 100-city TSP instances, the model with t = 1 can only generate the best solutions for roughly 25% instances. We leverage solutions on the continuation path (t [0, 1)) to achieve better overall performance. complicated problem. We believe CPL could be a novel and promising method for homotopy optimization. Limitation A limitation of CPL is that we need to build and train a model for learning the continuation path. The suitable model design will mainly depend on the given problem, and some domain knowledge might also be required for efficient model building. Additional theoretical analyses, such as problem transformation and the relation to the value function approach, are important future works. Acknowledgements This work was supported by the Research Grants Council of the Hong Kong Special Administrative Region, China (Project No. City U 11215622). Continuation Path Learning for Homotopy Optimization Accorsi, L., Lodi, A., and Vigo, D. Guidelines for the computational testing of machine learning approaches to vehicle routing problems. Operations Research Letters, 50(2):229 234, 2022. Ackley, D. A connectionist machine for genetic hillclimbing. Springer Science & Business Media, 1987. Allgower, E. L. and Georg, K. Numerical continuation methods: an introduction, volume 13. Springer Science & Business Media, 1990. Amos, B. Tutorial on amortized optimization for learning to optimize over continuous domains. ar Xiv preprint ar Xiv:2202.00665, 2022. Anandkumar, A., Deng, Y., Ge, R., and Mobahi, H. Homotopy analysis for tensor pca. In Conference on Learning Theory, pp. 79 104. PMLR, 2017. Applegate, D. L., Bixby, R. E., Chvatal, V., and Cook, W. J. The traveling salesman problem: A computational study, 2006. Bank, B., Guddat, J., Klatte, D., Kummer, B., and Tammer, K. Non-linear parametric optimization. Springer, 1983. Bello, I., Pham, H., Le, Q. V., Norouzi, M., and Bengio, S. Neural combinatorial optimization with reinforcement learning. In International Conference on Learning Representations (ICLR) Workshops, 2017. Bengio, Y. Learning deep architectures for AI. Now Publishers Inc, 2009. Bengio, Y., Louradour, J., Collobert, R., and Weston, J. Curriculum learning. In International Conference on Machine Learning (ICML), 2009. Beyer, H.-G. and Schwefel, H.-P. Evolution strategies a comprehensive introduction. Natural computing, 1(1): 3 52, 2002. Blake, A. and Zisserman, A. Visual reconstruction. MIT press, 1987. Bonnans, J. F. and Shapiro, A. Perturbation analysis of optimization problems. Springer Science & Business Media, 2013. Brox, T. and Malik, J. Large displacement optical flow: descriptor matching in variational motion estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(3):500 513, 2010. Chapelle, O., Chi, M., and Zien, A. A continuation method for semi-supervised svms. In International Conference on Machine Learning (ICML), 2006. Chaudhari, P., Choromanska, A., Soatto, S., Le Cun, Y., Baldassi, C., Borgs, C., Chayes, J., Sagun, L., and Zecchina, R. Entropy-sgd: Biasing gradient descent into wide valleys. In International Conference on Learning Representations (ICLR), 2016. Chen, T., Chen, X., Chen, W., Wang, Z., Heaton, H., Liu, J., and Yin, W. Learning to optimize: A primer and a benchmark. The Journal of Machine Learning Research, 23(1):8562 8620, 2022. Chen, X. and Tian, Y. Learning to perform local rewriting for combinatorial optimization. In Advances in Neural Information Processing Systems (Neur IPS), 2019. Coy, S. P., Golden, B. L., and Wasil, E. A. A computational study of smoothing heuristics for the traveling salesman problem. European Journal of Operational Research, 124(1):15 27, 2000. d O Costa, P. R., Rhuggenaath, J., Zhang, Y., and Akcay, A. Learning 2-opt heuristics for the traveling salesman problem via deep reinforcement learning. In Asian Conference on Machine Learning, pp. 465 480. PMLR, 2020. Dosovitskiy, A. and Djolonga, J. You only train once: Lossconditional training of deep networks. International Conference on Learning Representations (ICLR), 2020. Duchi, J. C., Jordan, M. I., Wainwright, M. J., and Wibisono, A. Optimal rates for zero-order convex optimization: The power of two function evaluations. IEEE Transactions on Information Theory, 61(5):2788 2806, 2015. Dunlavy, D. M. and O Leary, D. P. Homotopy optimization methods for global optimization. In Technical Report, 2005. Eaves, B. C. Homotopies for computation of fixed points. Mathematical Programming, 3(1):1 22, 1972. Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. Least angle regression. The Annals of statistics, 32(2):407 499, 2004. Eriksson, D., Pearce, M., Gardner, J., Turner, R. D., and Poloczek, M. Scalable global optimization via local bayesian optimization. In Advances in Neural Information Processing Systems (Neur IPS), 2019. Evans, L. C. Partial differential equations, volume 19. American Mathematical Society, 2010. Gao, K. and Sener, O. Generalizing gaussian smoothing for random search. In International Conference on Machine Learning (ICML), 2022. Continuation Path Learning for Homotopy Optimization Gargiani, M., Zanelli, A., Dinh, Q. T., Diehl, M., and Hutter, F. Transferring optimality across data distributions via homotopy methods. In International Conference on Learning Representations (ICLR), 2020. Garnett, R. Bayesian Optimization. Cambridge University Press, 2022. in preparation. Gold, S., Rangarajan, A., and Mjolsness, E. Learning with preknowledge: clustering with point and graph matching distance measures. In Advances in Neural Information Processing Systems (Neur IPS), 1994. G omez-Bombarelli, R., Wei, J. N., Duvenaud, D., Hern andez-Lobato, J. M., S anchez-Lengeling, B., Sheberla, D., Aguilera-Iparraguirre, J., Hirzel, T. D., Adams, R. P., and Aspuru-Guzik, A. Automatic chemical design using a data-driven continuous representation of molecules. ACS central science, 4(2):268 276, 2018. Graves, A., Bellemare, M. G., Menick, J., Munos, R., and Kavukcuoglu, K. Automated curriculum learning for neural networks. In International Conference on Machine Learning (ICML), 2017. Gulcehre, C., Moczulski, M., Visin, F., and Bengio, Y. Mollifying networks. In International Conference on Learning Representations (ICLR), 2017. Guo, Y., Feng, S., Le Roux, N., Chi, E., Lee, H., and Chen, M. Batch reinforcement learning through continuation method. In International Conference on Learning Representations (ICLR), 2020. Ha, D., Dai, A. M., and Le, Q. V. Hypernetworks. In International Conference on Learning Representations (ICLR), 2017. Hansen, N. and Ostermeier, A. Completely derandomized self-adaptation in evolution strategies. Evolutionary computation, 9(2):159 195, 2001. Hastie, T., Rosset, S., Tibshirani, R., and Zhu, J. The entire regularization path for the support vector machine. Journal of Machine Learning Research, 5(Oct):1391 1415, 2004. Hazan, E., Levy, K. Y., and Shalev-Shwartz, S. On graduated optimization for stochastic non-convex problems. In International Conference on Machine Learning (ICML), 2016. Helsgaun, K. An effective implementation of the lin kernighan traveling salesman heuristic. European Journal of Operational Research, 126(1):106 130, 2000. Helsgaun, K. An extension of the lin-kernighan-helsgaun tsp solver for constrained traveling salesman and vehicle routing problems. Roskilde: Roskilde University, pp. 24 50, 2017. Himmelblau, D. M. et al. Applied nonlinear programming. Mc Graw-Hill, 1972. Hottung, A. and Tierney, K. Neural large neighborhood search for the capacitated vehicle routing problem. In European Conference on Artificial Intelligence (ECAI), 2020. Hottung, A., Kwon, Y.-D., and Tierney, K. Efficient active search for combinatorial optimization problems. In International Conference on Learning Representations (ICLR), 2022. Hruby, P., Duff, T., Leykin, A., and Pajdla, T. Learning to solve hard minimal problems. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. Ingber, L. Simulated annealing: Practice versus theory. Mathematical and computer modelling, 18(11):29 57, 1993. Iwakiri, H., Wang, Y., Ito, S., and Takeda, A. Single loop gaussian homotopy method for non-convex optimization. In Advances in Neural Information Processing Systems (Neur IPS), 2022. Jain, P. and Kar, P. Non-convex optimization for machine learning. Foundations and Trends in Machine Learning, 10(3-4):142 363, 2017. Jayakumar, S. M., Menick, J., Czarnecki, W. M., Schwarz, J., Rae, J., Osindero, S., Teh, Y. W., Harley, T., and Pascanu, R. Multiplicative interactions and where to find them. In International Conference on Learning Representations (ICLR), 2020. Joshi, C. K., Laurent, T., and Bresson, X. An efficient graph convolutional network technique for the travelling salesman problem. ar Xiv preprint ar Xiv:1906.01227, 2019. Kim, M. and De la Torre, F. Gaussian processes multiple instance learning. In International Conference on Machine Learning (ICML), 2010. Kirkpatrick, S., Gelatt Jr, C. D., and Vecchi, M. P. Optimization by simulated annealing. science, 220(4598): 671 680, 1983. Kool, W., van Hoof, H., and Welling, M. Attention, learn to solve routing problems! In International Conference on Learning Representations (ICLR), 2019. Continuation Path Learning for Homotopy Optimization Kumar, M., Packer, B., and Koller, D. Self-paced learning for latent variable models. In Advances in Neural Information Processing Systems (Neur IPS), 2010. Kwon, Y.-D., Choo, J., Kim, B., Yoon, I., Gwon, Y., and Min, S. Pomo: Policy optimization with multiple optima for reinforcement learning. In Advances in Neural Information Processing Systems (Neur IPS), 2020. Levin, E., Kileel, J., and Boumal, N. The effect of smooth parametrizations on nonconvex optimization landscapes. ar Xiv preprint ar Xiv:2207.03512, 2022. Li, S., Drgona, J., Tuor, A. R., Pileggi, L., and Vrabie, D. L. Homotopy learning of parametric solutions to constrained optimization problems. https://openreview. net/forum?id=Gdim Rq V_S7, 2023. Lin, X., Yang, Z., Zhang, Q., and Kwong, S. Controllable pareto multi-task learning. ar Xiv preprint ar Xiv:2010.06313, 2020. Lin, X., Yang, Z., and Zhang, Q. Pareto set learning for neural multi-objective combinatorial optimization. In International Conference on Learning Representations (ICLR), 2022a. Lin, X., Yang, Z., Zhang, X., and Zhang, Q. Pareto set learning for expensive multiobjective optimization. In Advances in Neural Information Processing Systems (Neur IPS), 2022b. Liu, X., Lu, Y., Abbasi, A., Li, M., Mohammadi, J., and Kolouri, S. Teaching networks to solve optimization problems. ar Xiv preprint ar Xiv:2202.04104, 2022. Ma, Y., Li, J., Cao, Z., Song, W., Zhang, L., Chen, Z., and Tang, J. Learning to iteratively solve routing problems with dual-aspect collaborative transformer. In Advances in Neural Information Processing Systems (Neur IPS), 2021. Mehmood, S. and Ochs, P. Automatic differentiation of some first-order methods in parametric optimization. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2020. Mehmood, S. and Ochs, P. Differentiating the value function by using convex duality. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2021. Mobahi, H. and Fisher III, J. W. On the link between gaussian homotopy continuation and convex envelopes. In International Workshop on Energy Minimization Methods in Computer Vision and Pattern Recognition, pp. 43 56. Springer, 2015a. Mobahi, H. and Fisher III, J. W. A theoretical analysis of optimization by gaussian continuation. In AAAI Conference on Artificial Intelligence (AAAI), 2015b. Mobahi, H. and Ma, Y. Gaussian smoothing and asymptotic convexity. Coordinated Science Laboratory Report no. UILU-ENG-12-2201, DC-254, 2012. Navon, A., Shamsian, A., Fetaya, E., and Chechik, G. Learning the pareto front with hypernetworks. In International Conference on Learning Representations (ICLR), 2021. Nesterov, Y. and Spokoiny, V. Random gradient-free minimization of convex functions. Foundations of Computational Mathematics, 17(2):527 566, 2017. Perron, L. and Furnon, V. Or-tools, 2019. URL https: //developers.google.com/optimization/. Rosenbrock, H. An automatic method for finding the greatest or least value of a function. The computer journal, 3 (3):175 184, 1960. Rosset, S. and Zhu, J. Piecewise linear regularized solution paths. The Annals of Statistics, pp. 1012 1030, 2007. Schmidhuber, J. Learning to control fast-weight memories: an alternative to dynamic recurrent networks. Neural Computation, 4(1):131 139, 1992. Sener, O. and Koltun, V. Learning to guide random search. In International Conference on Learning Representations (ICLR), 2020. Shahriari, B., Swersky, K., Wang, Z., Adams, R., and De Freitas, N. Taking the human out of the loop: A review of bayesian optimization. Proceedings of the IEEE, 104(1):148 175, 2016. Still, G. Lectures on parametric optimization: An introduction. Optimization Online, 2018. Terzopoulos, D. The computation of visible-surface representations. IEEE Transactions on Pattern Analysis and Machine Intelligence, 10(4):417 438, 1988. Tibshirani, R. J. and Taylor, J. The solution path of the generalized lasso. The Annals of Statistics, 39(3):1335 1371, 2011. Tripp, A., Daxberger, E., and Hern andez-Lobato, J. M. Sample-efficient optimization in the latent space of deep generative models via weighted retraining. In Advances in Neural Information Processing Systems (Neur IPS), 2020. Uchoa, E., Pecin, D., Pessoa, A., Poggi, M., Vidal, T., and Subramanian, A. New benchmark instances for the capacitated vehicle routing problem. European Journal of Operational Research, 257(3):845 858, 2017. Continuation Path Learning for Homotopy Optimization Van Laarhoven, P. J. and Aarts, E. H. Simulated annealing. In Simulated annealing: Theory and applications, pp. 7 15. Springer, 1987. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems (Neur IPS), 2017. Vinyals, O., Fortunato, M., and Jaitly, N. Pointer networks. In Advances in Neural Information Processing Systems (Neur IPS), 2015. Wang, L., Fonseca, R., and Tian, Y. Learning search space partition for black-box optimization using monte carlo tree search. In Advances in Neural Information Processing Systems (Neur IPS), 2020. Wasserstrom, E. Numerical solutions by the continuation method. SIAM Review, 15(1):89 119, 1973. Widder, D. V. The heat equation, volume 67. Academic Press, 1976. Williams, R. J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229 256, 1992. Wu, Y., Song, W., Cao, Z., Zhang, J., and Lim, A. Learning improvement heuristics for solving routing problems. IEEE Transactions on Neural Networks and Learning Systems, 2021. Wu, Z. The effective energy transformation scheme as a special continuation approach to global optimization with application to molecular conformation. SIAM Journal on Optimization, 6(3):748 768, 1996. Xin, L., Song, W., Cao, Z., and Zhang, J. Multi-decoder attention model with embedding glimpse for solving vehicle routing problems. In AAAI Conference on Artificial Intelligence (AAAI), 2021. Yang, R., Sun, X., and Narasimhan, K. A generalized algorithm for multi-objective reinforcement learning and policy adaptation. In Advances in Neural Information Processing Systems (Neur IPS), 2019. Yuille, A. Energy functions for early vision and analog networks. Biological Cybernetics, 61(2):115 123, 1989. Continuation Path Learning for Homotopy Optimization We provide detailed experimental settings, extra experimental results, and more discussions in this Appendix. A. Synthetic Test Benchmarks A.1. Problem Definition In this experiment, we evaluate the proposed continuation path learning (CPL) method with other homotopy optimization algorithms on the following three widely-used non-convex optimization problems. Ackley Optimization Problem (Ackley, 1987): f(x, y) = 20e 0.2 0.5(x2+y2) e0.5(cos 2πx+cos 2πy) + e + 20. (20) Rosenbrock Optimization Problem (Rosenbrock, 1960): f(x, y) = 100(y x2)2 + (1 x)2. (21) Himmelblau Optimization Problem (Himmelblau et al., 1972): f(x, y) = (x2 + y 11)2 + (x + y2 7)2. (22) A.2. Experimental Setting We compare CPL with a simple gradient descent algorithm (GD), a classical Gaussian optimization algorithm (Grad Opt) (Hazan et al., 2016) with two smoothing parameters (γ = 0.5 or 0.8), and the recently proposed single loop Gaussian homotopy algorithm with fixed ratio update (SLGHr) (Iwakiri et al., 2022) with γ = 0.995 or 0.999. We report the results of these algorithms with fine-tuned hyperparameters from Iwakiri et al. (2022). For our proposed CPL method, we build a simple fully connected (FC) neural network as the continuation path model. It has two hidden layers each with 128 hidden nodes. Since the model s gradient can be decomposed with the simple chain rule as in Section 4.3, it can be easily optimized by the (zeroth-order) gradient descent algorithm similar to other homotopy optimization algorithms. CPL can learn to approximate the whole continuation path simultaneously, hence it does not require any predefined continuation schedule or smoothing parameter. We use Gaussian homotopy (GH) as the homotopy method and mainly follow the experimental setting from Iwakiri et al. (2022) for each optimization problem. Ackley The Ackley optimization problem does not have an analytical form for its Gaussian homotopy function, and hence we use the zeroth-order method to approximate its Gaussian homotopy gradient (11). The total number of function evaluations is 1, 000 for all algorithms. CPL uses 950 evaluations for path model training and 50 evaluations for final local search with homotopy level t = 1. Rosenbrock The Gaussian homotopy function of the Rosenbrock optimization problem has the following analytical form (Mobahi & Ma, 2012; Iwakiri et al., 2022): GH(x, y, t) =Eux,uy[f(x + β(1 t)ux, y + β(1 t)uy)] (23) =100x4 + [ 200y + 600β2(1 t)2 + 1]x2 2x + 100y2 200β2(1 t)2y + 300β4(1 t)2 + 101β2(1 t)2 + 1. Therefore, we can use the simple first-order gradient method for CPL and all the other homotopy optimization algorithms. The total number of function evaluations is 20, 000. For CPL, 19, 000 evaluations are used for path model training, and the rest 1, 000 is for the final local search with homotopy level t = 1. We set β = 1.5 according to Iwakiri et al. (2022). Continuation Path Learning for Homotopy Optimization Himmelblau Similar to the Ronsebrock problem, since the himmelblau optimization problem is polynomial, it has an analytical Gaussian homotopy function: GH(x, y, t) =Eux,uy[f(x + β(1 t)ux, y + β(1 t)uy)] (24) =x4 + (2y + 6β2(1 t)2 21)x2 + (2y2 + 2β2(1 t)2 14)x + y4 + (6β2(1 t)2 13)y2 + (2β2(1 t)2 22)y + 6β4(1 t)4 34β2(1 t)2 + 170 We also use the first-order gradient method to optimize this problem for all algorithms. The total number of function evaluations is 2, 000. CPL has 1, 900 evaluations for model training and 100 for the final local search with t = 1. The parameter β is set to 2 as in Iwakiri et al. (2022). A.3. Computational Cost Table 3. Runtime of gradient, descent, SLGH, and CPL with CPU or GPU. Problem Ackley Himmelblau Rosenbrock # Iteration 1,000 2,000 20,000 Gradient Descent 0.9s 1.6s 11.2s SLGH 0.9s 1.7s 12.6s CPL (CPU) 0.9s 1.8s 13.2s CPL (GPU) 1.6s 2.8s 15.4s The CPL model training can be easily done by highly efficient deep learning frameworks such as Py Torch. For these non-convex optimization benchmarks, depending on the number of iterations, CPL typically needs 1 to 15 seconds to train the model while the run times for other model-free methods are 1 to 13 seconds. It should be pointed out that CPL can learn the whole continuation path while the other methods are to find a single final solution. In addition, the CPL training on GPU (RTX-3080) is actually slower than its counterpart on CPU. For such a small model, the reason could be the cost of data transformation from RAM to GPU is larger than the speedup of GPU over CPU. A.4. Effect of the Model Size Table 4. CPL performance on the original problem with different model sizes. Baseline Single Hidden Layer Two Hidden Layers Three Hidden Layers Model Size SLGH 16 128 1024 16-16 128-128 (this paper) 1024-1024 16-16-16 128-128-128 1024-1024-1024 Ackley 6.650 2.605 0.478 0.261 0.089 0.006 0.005 0.024 0.007 0.006 Himmelblau 6.9e-5 84.25 11.92 3.8e-05 19.46 2.3e-6 2.8e-6 8.74 2.1e-6 2.6e-6 Rosenbrock 0.0327 0.0548 0.0409 0.0282 0.0371 0.0018 0.0023 0.0220 0.0016 0.0019 The performance of CPL depends on the neural network architectures. Indeed, different optimization problems and applications could require different CPL models. A general guideline is that the model should be large enough to learn the continuation path for the given problem. Therefore, we should care about the model size and also problem-specific structure. We report the results for different models with various sizes in Table 4. Based on the results, a very small model (such as those with single hidden layers) is not able to learn the whole continuation path, and hence has poor performance. On the other hand, once the model has sufficient capacity (such as those three-layer models with more than 128 hidden units), further increasing the model size will not lead to significantly better performance. Therefore, it is important to choose a suitable model size for a given problem. Continuation Path Learning for Homotopy Optimization B. Noisy Regression In this experiment, we test our proposed CPL method on the following four different noisy regression problems. F1 This problem has the following ground truth function relation between x and y: y = 0.5 sin(x) + 0.3 cos(2x) + 2 cos(3x), (25) where x [ 5, 5]. For the noisy regression, we report ˆy = y + 0.1ε where ε N(0, 1) as the noise response value, and set ψ(x) = [sin(x), cos(2x), cos(3x)]. Therefore, the optimal α = [0.5, 0.3, 2]. F2 This problem has the following ground truth function relation between x and y: y = cos(x) + 0.2 sin(2x) + 0.5 sin(3x), (26) where x [ 5, 5]. For the noisy regression, we report ˆy = y + 0.1ε where ε N(0, 1) as the noise response value, and set ψ(x) = [cos(x), sin(2x), sin(3x)]. Therefore, the optimal α = [1, 0.2, 0.5]. F3 This problem has the following ground truth function relation between x and y: y = e0.25x 0.2 cos(x) + 0.5 sin(4x), (27) where x [ 5, 5]. For the noisy regression, we report ˆy = y + 0.1ε where ε N(0, 1) as the noise response value, and set ψ(x) = [e0.25x, cos(x), sin(4x)]. Therefore, the optimal α = [1, 0.2, 0.5]. F4 This problem has the following ground truth function relation between x and y: y = 2 ln 0.25|x| + 3 sin(6x) + 4 cos(0.5x), (28) where x [ 5, 5]. For the noisy regression, we report ˆy = y + 0.1ε where ε N(0, 1) as the noise response value, and set ψ(x) = [ln 0.25|x|, sin(6x), cos(0.5x)]. Therefore, the optimal α = [2, 3, 4]. Continuation Path Learning for Homotopy Optimization Neural CO Model Homotopy Level a) Problem Instance b) Model c) Output Solution d) Continuation Training e) Inference Figure 8. Continuation Path Learning Model for Neural Combinatorial Optimization: Our proposed CPL model can learn to construct different solutions for the smoothed subproblems on the continuation path for a given problem instance. In inference, it leverages these solutions to make a multi-shot prediction for better performance. C. Continuation Path Learning for Neural Combinatorial Optimization C.1. Continuation-based constructive NCO model In this experiment, we focus on learning the continuation path for constructive neural combinatorial optimization (NCO) (Vinyals et al., 2015; Bello et al., 2017). This approach learns the policy model with parameter θ to construct a solution (e.g., a tour π = (π1, , πi, , πn), πi {1, , n}) for a combinatorial optimization problem instance s (e.g., TSP) in an auto-regressive manner: pθ(π|s) = Qn i=1 pθ(πi|s, π1:i 1). (29) In contrast to a policy with constant parameters θ, we propose to build a policy model with dynamic parameters θ(t) conditioned on the continuation level t: pθ(t)(π|s) = Qn i=1 pθ(t)(πi|s, π1:i 1), (30) such that it can generate different tours for different smoothed subproblems as in Figure 8. By assigning different t, we can easily obtain solutions of different smoothed subproblems on the continuation path for a given instance. The proposed continuation path learning idea and the policy model framework are general and can be used for any constructive NCO model. We use the seminal Attention Model (Kool et al., 2019) as our constructive model. The recent work shows that only changing (part of) the AM decoder parameters is sufficient to construct solutions with significantly better performance (Hottung et al., 2022) or with very different trade-offs among different objectives (Lin et al., 2022a). Therefore, we also propose to let only part of the AM decoder parameters depend on the continuation level t: [W Q(t), W Proj(t)] = MLP(t), (31) where W Q(t) and W proj(t) are the query and projection parameters for the Multi-Head Attention (MHA) (Vaswani et al., 2017) layer in the AM decoder. More advanced model structures such as multiplicative interactions (Jayakumar et al., 2020) and hypernetwork (Schmidhuber, 1992; Ha et al., 2017) can be used to learn the conditioned parameters, but we find a simple MLP model is good enough for our model. To construct a valid tour, our proposed model first tasks the problem instance s (e.g., the graph with n fully connected nodes for TSP) as input to the AM encoder and obtains the n d-dimensional node embedding [h1, , hn] for each city. For selecting the i-th city into the tour, the AM decoder combines the embedding of the first selected node hπ1 and the most current selected node hπi 1 to obtain the query embedding h Q(t) = W Q(t)[hπ1, hπi 1]. Following the setting of POMO (Kwon et al., 2020), we do not include an extra graph embedding. The query embedding will be further updated by the multi-head attention with all node embedding: ˆh Q = MHA(Q = h Q(t), K = W K[h1, , hn], V = W V [h1, , hn])W proj(t), (32) Continuation Path Learning for Homotopy Optimization where W proj(t) projects the multi-head output of MHA into an d-dimensional embedding. With the query embedding ˆh Q and the embedding hi for each city, we can calculate the logit for each city: ( C tanh( ˆh T Qhj d ) if j = πp p < i, otherwise. The logits are further clipped into [ C, C] for all non-selected nodes with C = 10 as in (Kool et al., 2019) and all already selected nodes are masked in . The policy model can autoregressively construct a valid tour by following the probability pθ(t)(πt = j|s, π1:i 1) = elogitj/P k {j =πp p