# global_optimality_in_bivariate_gradientbased_dag_learning__a7162302.pdf

Global Optimality in Bivariate Gradient-based DAG Learning

Chang Deng Kevin Bello , Pradeep Ravikumar Bryon Aragam

Booth School of Business, University of Chicago, Chicago, IL 60637 Machine Learning Department, Carnegie Mellon University, Pittsburgh, PA 15213

Recently, a new class of non-convex optimization problems motivated by the statistical problem of learning an acyclic directed graphical model from data has attracted signiﬁcant interest. While existing work uses standard ﬁrst-order optimization schemes to solve this problem, proving the global optimality of such approaches has proven elusive. The difﬁculty lies in the fact that unlike other non-convex problems in the literature, this problem is not benign , and possesses multiple spurious solutions that standard approaches can easily get trapped in. In this paper, we prove that a simple path-following optimization scheme globally converges to the global minimum of the population loss in the bivariate setting.

1 Introduction

Over the past decade, non-convex optimization has become a major topic of research within the machine learning community, in part due to the successes of training large-scale models with simple ﬁrst-order methods such as gradient descent along with their stochastic and accelerated variants in spite of the non-convexity of the loss function. A large part of this research has focused on characterizing which problems have benign loss landscapes that are amenable to the use of gradientbased methods, i.e., there are no spurious local minima, or they can be easily avoided. By now, several theoretical results have shown this property for different non-convex problems such as: learning a two hidden unit Re LU network [48], learning (deep) over-parameterized quadratic neural networks [43, 27], low-rank matrix recovery [19, 13, 3], learning a two-layer Re LU network with a single non-overlapping convolutional ﬁlter [6], semideﬁnite matrix completion [4, 20], learning neural networks for binary classiﬁcation with the addition of a single special neuron [30], and learning deep networks with independent Re LU activations [26, 11], to name a few.

Recently, a new class of non-convex optimization problems due to Zheng et al. [51] have emerged in the context of learning the underlying structure of a structural equation model (SEM) or Bayesian network. This underlying structure is typically represented by a directed acyclic graph (DAG), which makes the learning task highly complex due to its combinatorial nature. In general, learning DAGs is well-known to be NP-complete [8, 10]. The key innovation in Zheng et al. [51] was the introduction of a differentiable function h, whose level set at zero exactly characterizes DAGs. Thus, replacing the challenges of combinatorial optimization by those of non-convex optimization. Mathematically, this class of non-convex problems take the following general form:

min Θ f(Θ) subject to h(W(Θ)) = 0, (1)

where Θ Rl represents the model parameters, f : Rl R is a (possibly non-convex) smooth loss function (sometimes called a score function) that measures the ﬁtness of Θ, and h : Rd d [0, )

changdeng@chicagobooth.edu

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

is a smooth non-convex function that takes the value of zero if and only if the induced weighted adjacency matrix of d nodes, W(Θ), corresponds to a DAG.

Given the smoothness of f and h, problem (1) can be solved using off-the-shelf nonlinear solvers, which has driven a series of remarkable developments in structure learning for DAGs. Multiple empirical studies have demonstrated that global or near-global minimizers for (1) can often be found in a variety of settings, such as linear models with Gaussian and non-Gaussian noises [e.g., 51, 34, 1], and non-linear models, represented by neural networks, with additive Gaussian noises [e.g., 29, 52, 49, 1]. The empirical success for learning DAGs via (1), which started with the NOTEARS method of Zheng et al. [51], bears a resemblance to the advancements in deep learning, where breakthroughs like Alex Net signiﬁcantly boosted the ﬁeld s recognition, even though there were notable successes before it.

Importantly, the reader should note that the majority of applications in ML consist of solving a single unconstrained non-convex problem. In contrast, the class of problems (1) contains a non-convex constraint. Thus, researchers have considered some type of penalty method such as the augmented Lagrangian [51, 52], quadratic penalty [35], and a log-barrier [1]. In all cases, the penalty approach consists in solving a sequence of unconstrained non-convex problems, where the constraint is enforced progressively [see e.g. 2, for background]. In this work, we will consider the following form of penalty: min Θ gµk(Θ) := µkf(Θ) + h(W(Θ)). (2)

It was shown by Bello et al. [1] that due to the invexity property of h,2 solutions to (2) will converge to a DAG as µk 0. However, no guarantees on local/global optimality were given.

With the above considerations in hand, one is inevitably led to ask the following questions:

(i) Are the loss landscapes gµk(Θ) benign for different µk? (ii) Is there a (tractable) solution path {Θk} that converges to a global minimum of (1)?

Due to the NP-completeness of learning DAGs, one would expect the answer to (i) to be negative in its most general form. Moreover, it is known from the classical theory of constrained optimization [e.g. 2] that if we can exactly and globally optimize (1) for each µk, then the answer to (ii) is afﬁrmative. This is not a practical algorithm, however, since the problem (1) is nonconvex. Thus we seek a solution path that can be tractably computed in practice, e.g. by gradient descent.

In this work, we focus on perhaps the simplest setting where interesting phenomena take place. That is, a linear SEM with two nodes (i.e., d = 2), f is the population least squared loss (i.e., f is convex), and Θk is deﬁned via gradient ﬂow with warm starts. More speciﬁcally, we consider the case where Θk is obtained by following the gradient ﬂow of gµk with initial condition Θk 1.

Under this setting, to answer (i), it is easy to see that for a large enough µk, the convex function f dominates and we can expect a benign landscape, i.e., a (almost) convex landscape. Similarly, when µk approaches zero, the invexity of h kicks in and we could expect that all stationary points are (near) global minimizers.3 That is, at the extremes µk and µk 0, the landscapes seem well-behaved, and the reader might wonder if it follows that for any µk [0, ) the landscape is well-behaved. We answer the latter in the negative and show that there always exists a τ > 0 where the landscape of gµk is non-benign for any µk < τ, namely, there exist three stationary points: i) A saddle point, ii) A spurious local minimum, and iii) The global minimum. In addition, each of these stationary points have wide basins of attractions, thus making the initialization of the gradient ﬂow for gµk crucial. Finally, we answer (ii) in the afﬁrmative and provide an explicit scheduling for µk that guarantees the asymptotic convergence of Θk to the global minimum of (1). Moreover, we show that this scheduling cannot be arbitrary as there exists a sequence of {µk} that leads {Θk} to a spurious local minimum.

Overall, we establish the ﬁrst set of results that study the optimization landscape and global optimality for the class of problems (1). We believe that this comprehensive analysis in the bivariate case provides a valuable starting point for future research in more complex settings.

2An invex function is any function where all its stationary points are global minima. It is worth noting that the composite objective in (2) is not necessarily invex, even when f is convex. 3This transition or path, from an optimizer of a simple function to an optimizer of a function that closely resembles the original constrained formulation, is also known as a homotopy.

(a) Contour plot

0.0 0.1 0.2 0.3 0.4 0.5 0.6 x

(x * , y * )

(x* * , y* * )

(x* * * , y* * * )

(b) Stationary points

Figure 1: Visualizing the nonconvex landscape. (a) A contour plot of gµ for a = 0.5 and µ = 0.005 (see Section 2 for deﬁnitions). We only show a section of the landscape for better visualization. The solid lines represent the contours, while the dashed lines represent the vector ﬁeld gµ. (b) Stationary points of gµ, r(y; µ) = 0 and r(x; µ) = 0 (see Section 4 for deﬁnitions).

Remark 1. We emphasize that solving (1) in the bivariate case is not an inherently difﬁcult problem. Indeed, when there are only two nodes, there are only two DAGs to distinguish and one can simply ﬁt f under the only two possible DAGs, and select the model with the lowest value for f. However, evaluating f for each possible DAG structure clearly cannot scale beyond 10 or 20 nodes, and is not a standard algorithm for solving (1). Instead, here our focus is on studying how (1) is actually being solved in practice, namely, by solving unconstrained non-convex problems in the form of (2). Previous work suggests that such gradient-based approaches indeed scale well to hundreds and even thousands of nodes [e.g. 51, 34, 1].

1.1 Our Contributions

More speciﬁcally, we make the following contributions:

1. We present a homotopy-based optimization scheme (Algorithm 2) to ﬁnd global minimizers of the program (1) by iteratively decreasing the penalty coefﬁcient according to a given schedule. Gradient ﬂow is used to ﬁnd the stationary points of (2) at each step, starting from the previous solution.

2. We prove that Algorithm 2 converges globally (i.e. regardless of initialization for W) to the global minimum (Theorem 1).

3. We show that the non-convex program (1) is indeed non-benign, and naïve implementation of black-box solvers are likely to get trapped in a bad local minimum. See Figure 1 (a).

4. Experimental results verify our theory, consistently recovering the global minimum of (1), regardless of initialization or initial penalty value. We show that our algorithm converges to the global minimum while naïve approaches can get stuck.

The analysis consists of three main parts: First, we explicitly characterize the trajectory of the stationary points of (2). Second, we classify the number and type of all stationary points (Lemma 1) and use this to isolate the desired global minimum. Finally, we apply Lyapunov analysis to identify the basin of attraction for each stationary point, which suggests a schedule for the penalty coefﬁcient that ensures that the gradient ﬂow is initialized within that basin at the previous solution.

1.2 Related Work

The class of problems (1) falls under the umbrella of score-based methods, where given a score function f, the goal is to identify the DAG structure with the lowest score possible [9, 22]. We shall note that learning DAGs is a very popular structure model in a wide range of domains such as biology [40], genetics [50], and causal inference [44, 39], to name a few.

Score-based methods that consider the combinatorial constraint. Given the ample set of scorebased methods in the literature, we brieﬂy mention some classical works that attempt to optimize f by considering the combinatorial DAG constraint. In particular, we have approximate algorithms such as the greedy search method of Chickering et al. [10], order search methods [45, 41, 38], the LP-relaxation method of Jaakkola et al. [24], and the dynamic programming approach of Loh and Bühlmann [31]. There are also exact methods such as GOBNILP [12] and Bene [42], however, these algorithms only scale up to 30 nodes.

Score-based methods that consider the continuous non-convex constraint h. The following works are the closest to ours since they attempt to solve a problem in the form of (1). Most of these developments either consider optimizing different score functions f such as ordinary least squares [51, 52], the log-likelihood [29, 34], the evidence lower bound [49], a regret function [53]; or consider different differentiable characterizations of acyclicity h [49, 1]. However, none of the aforementioned works provide any type of optimality guarantee. Few studies have examined the optimization intricacies of problem (1). Wei et al. [47] investigated the optimality issues and provided local optimality guarantees under the assumption of convexity in the score f and linear models. On the other hand, Ng et al. [35] analyzed the convergence to (local) DAGs of generic methods for solving nonlinear constrained problems, such as the augmented Lagrangian and quadratic penalty methods. In contrast to both, our work is the ﬁrst to study global optimality and the loss landscapes of actual methods used in practice for solving (1).

Bivariate causal discovery. Even though in a two-node model the discrete DAG constraint does not pose a major challenge, the bivariate setting has been subject to major research in the area of causal discovery. See for instance [36, 16, 32, 25] and references therein.

Penalty and homotopy methods. There exist classical global optimality guarantees for the penalty method if f and h were convex functions, see for instance [2, 5, 37]. However, to our knowledge, there are no global optimality guarantees for general classes of non-convex constrained problems, let alone for the speciﬁc type of non-convex functions h considered in this work. On the other hand, homotopy methods (also referred to as continuation or embedding methods) are in many cases capable of ﬁnding better solutions than standard ﬁrst-order methods for non-convex problems, albeit they typically do not come with global optimality guarantees either. When homotopy methods come with global optimality guarantees, they are commonly computationally more intensive as it involves discarding solutions, thus, closely resembling simulated annealing methods, see for instance [15]. Authors in [21] characterize a family of non-convex functions where a homotopy algorithm provably converges to a global optimum. However, the conditions for such family of non-convex functions are difﬁcult to verify and are very restrictive; moreover, their homotopy algorithm involves Gaussian smoothing, making it also computationally more intensive than the procedure we study here. Other examples of homotopy methods in machine learning include [7, 18, 46, 17, 23], in all these cases, no global optimality guarantees are given.

2 Preliminaries

The objective f we consider can be easily written down as follows:

2EX X W X 2 2 , (3)

where X R2 is a random vector and W R2 2. Although not strictly necessary for the developments that follow, we begin by introducing the necessary background on linear SEM that leads to this objective and the resulting optimization problem of interest.

The bivariate model. Let X = (X1, X2) R2 denote the random variables in the model, and let N = (N1, N2) R2 denote a vector of independent errors. Then a linear SEM over X is deﬁned as X = W X + N, where W R2 2 is a weighted adjacency matrix encoding the coefﬁcients in the linear model. In order to represent a valid Bayesian network for X [see e.g. 39, 44, for details], the matrix W must be acyclic: More formally, the weighted graph induced by the adjacency matrix W must be a DAG. This (non-convex) acyclicity constraint represents the major computational hurdle that must overcome in practice (cf. Remark 1).

The goal is to recover the matrix W from the random vector X. Since W is acyclic, we can assume the diagonal of W is zero (i.e. no self-loops). Thus, under the bivariate linear model, it then sufﬁces to consider two parameters x and y that deﬁne the matrix of parameters4

W = W(x, y) = 0 x y 0

For notational simplicity, we will use f(W) and f(x, y) interchangeably, similarly for h(W) and h(x, y). Without loss of generality, we write the underlying parameter as

W = 0 a 0 0

which implies

X = W X + N = X1 = N1, X2 = a X1 + N2.

In general, we only require Ni to have ﬁnite mean and variance, hence we do not assume Gaussianity. We assume that Var[N1] = Var[N2], and for simplicity, we consider E[N] = 0 and Cov[N] = I, where I denotes the identity matrix. Finally, in the sequel we assume w.l.o.g. that a > 0.

The population least squares. In this work, we consider the population squared loss deﬁned by (3). If we equivalently write f in terms of x and y, then we have: f(W) = ((1 ay)2+y2+(a x)2+1)/2. In fact, the population loss can be substituted with empirical loss. In such a case, our algorithm can still attain the global minimum, WG, of problem (6). However, the output WG will serve as an empirical estimation of W . An in-depth discussion on this topic can be found in Appendix B

The non-convex function h. We use the continuous acyclicity characterization of Yu et al. [49], i.e., h(W) = Tr((I + 1

d W W)d) d, where denotes the Hadamard product. Then, for the bivariate case, we have h(W) = x2y2/2. We note that the analysis presented in this work is not tailored to this version of h, that is, we can use the same techniques used throughout this work for other existing formulations of h, such as the trace of the matrix exponential [51], and the log-det formulation [1]. Nonetheless, here we consider that the polynomial formulation of Yu et al. [49] is more amenable for the analysis. Remark 2. Our restriction to the bivariate case highlights the simplest setting in which this problem exhibits nontrivial behaviour. Extending our analysis to higher dimensions remains a challenging future direction, however, we emphasize that even in two-dimensions this problem is nontrivial. Our approach is similar to that taken in other parts of the literature that started with simple cases (e.g. single-neuron models in deep learning). Remark 3. It is worth noting that our choice of the population least squares is not arbitrary. Indeed, for linear models with identity error covariance, such as the model considered in this work, it is known that the global minimizer of the population squared loss is unique and corresponds to the underlying matrix W . See Theorem 7 in [31].

Gluing all the pieces together, we arrive to the following version of (1) for the bivariate case:

min x,y f(x, y) := 1

2((1 ay)2 + y2 + (a x)2 + 1) subject to h(x, y) := x2y2

Moreover, for any µ 0, we have the corresponding version of (2) expressed as:

min x,y gµ(x, y) := µf(x, y) + h(x, y) = µ

2 ((1 ay)2 + y2 + (a x)2 + 1) + x2y2

To conclude this section, we present a visualization of the landscape of gµ(x, y) in Figure 1 (a), for a = 0.5 and µ = 0.005. We can clearly observe the non-benign landscape of gµ, i.e., there exists a spurious local minimum, a saddle point, and the global minimum. In particular, we can see that the basin of attraction of the spurious local minimum is comparable to that of the global minimum, which is problematic for a local algorithm such as the gradient ﬂow (or gradient descent) as it can easily get trapped in a local minimum if initialized in the wrong basin.

4Following the notation in (1), for the bivariate model we simply have Θ (x, y) and W(Θ) 0 x y 0 . Although x is used to represent edge weights and not data as in (3), this distinction should not lead to confusion

Algorithm 1: Gradient Flow(f, z0)

1: set z(0) = z0 2: d dtz(t) = f(z(t)) 3: return limt z(t)

Algorithm 2: Homotopy algorithm for solving (1).

Input: Initial W0 = W(x0, y0), µ0 h a2 4(a2+1)3 , a2

Output: {Wµk} k=0 1 Wµ0 Gradient Flow(gµ0, W0)

2 for k = 1, 2, . . . do

3 Let µk = (2/a)2/3 µ4/3 k 1 4 Wµk Gradient Flow(gµk, Wµk 1)

3 A Homotopy-Based Approach and Its Convergence to the Global Optimum

To ﬁx notation, let us write Wk := Wµk := ( 0 xµk yµk 0 ). and let WG denote the global minimizer of (6). In this section, we present our main result, which provides conditions under which solving a series of unconstrained problems (7) with ﬁrst-order methods will converge to the global optimum WG of (6), in spite of facing non-benign landscapes. Recall that from Remark 3, we have that WG = ( 0 a 0 0 ). Since we use gradient ﬂow path to connect Wµk and Wµk+1, we specify this path in Procedure 1 for clarity. Although the theory here assumes continuous-time gradient ﬂow with t , see Section 5 for an iteration complexity analysis for (discrete-time) gradient descent, which is a straightforward consequence of the continuous-time theory.

In Algorithm 2, we provide an explicit regime of initialization for the homotopy parameter µ0 and a speciﬁc scheduling for µk such that the solution path found by Algorithm 2 will converge to the global optimum of (6). This is formally stated in Theorem 1, whose proof is given in Section 5. Theorem 1. For any initialization W0 and a R, the solution path provided in Algorithm 2 converges to the global optimum of (6), i.e., lim k Wµk = WG.

A few observations regarding Algorithm 2: Observe that when the underlying model parameter a 0, the regime of initialization for µ0 is wider; on the other hand, if a is closer to zero then the interval for µ0 is much narrower. As a concrete example, if a = 2 then it sufﬁces to have µ0 [0.008, 1); whereas if a = 0.1 then the regime is about µ0 [0.0089, 0.01). This matches the intuition that for a stronger value of a it should be easier to detect the right direction of the underlying model. Second, although in Line 3 we set µk in a speciﬁc manner, it actually sufﬁces to have µk h (µk 1

2 ) 2/3(a 1/3 q

a 2/3 (4µk 1) 1/3)2, µk 1 .

We simply chose a particular expression from this interval for clarity of presentation; see the proof in Section 5 for details.

As presented, Algorithm 2 is of theoretical nature in the sense that the initialization for µ0 and the decay rate for µk in Line 3 depend on the underlying parameter a, which in practice is unknown. In Algorithm 3, we present a modiﬁcation that is independent of a and W . By assuming instead a lower bound on a, which is a standard assumption in the literature, we can prove that Algorithm 3 also converges to the global minimum: Corollary 1. Initialize µ0 = 1 27. If a > p

5/27 then for any initialization W0, Algorithm 3 outputs the global optimal solution to (6), i.e. lim k Wµk = WG.

For more details on this modiﬁcation, see Appendix A.

Algorithm 3: Practical (i.e. independent of a and W ) homotopy algorithm for solving (1). Input: Initial W0 = W(x0, y0) Output: {Wµk} k=0 1 µ0 1/27

2 Wµ0 Gradient Flow(gµ0, W0)

3 for k = 1, 2, . . . do

4 Let µk = 2/ 5µ0 2/3 µ4/3 k 1 5 Wµk Gradient Flow(gµk, Wµk 1)

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 y

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 y

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 y

Figure 2: The behavior of r(y; µ) for different µ. Here, for µ > τ, there exists a single solution to r(y; µ) = 0, which implies there is one stationary point in Equation (7). When µ = τ, two solutions are found for r(y; µ) = 0, suggesting that there are two stationary points in Equation (7). Conversely, when µ < τ, we observe three solutions for r(y; µ) = 0, indicating that there are three stationary points in Equation (7)- a local optimum, a saddle point, and a global optimum.

4 A Detailed Analysis of the Evolution of the Stationary Points

The homotopy approach in Algorithm 2 relies heavily on how the stationary points of (7) behave with respect to µk. In this section, we dive deep into the properties of these critical points.

By analyzing the ﬁrst-order conditions for gµ, we ﬁrst narrow our attention to the region A = {0 x a, 0 y a a2+1}. By solving the resulting equations, we obtain an equation that only involves the variable y:

r(y; µ) = a

(y2 + µ)2 (a2 + 1). (8)

Likewise, we can ﬁnd an equation only involving the variable x:

t(x; µ) = a

(µ(a2 + 1) + x2)2 1. (9)

To understand the behavior of the stationary points of gµ(W), we can examine the characteristics of t(x; µ) in the range x [0, a] and the properties of r(y; µ) in the interval y [0, a a2+1].

In Figures 2 and 3, we show the behavior of r(y; µ) and t(x; µ) for a = 1. Theorems 5 and 6 in the appendix establish the existence of a τ > 0 with the following useful property: Corollary 2. There exists µ < τ such that the equation gµ(W) = 0 has three different solutions, denoted as W µ, W µ , W µ . Then,

lim µ 0 W µ = 0 a 0 0

, lim µ 0 W µ = 0 0 0 0

, lim µ 0 W µ = 0 0 a a2+1 0

Note that the interesting regime takes place when µ < τ. Then, we characterize the stationary points as either local minima or saddle points:

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 x

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 x

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 x

Figure 3: The behavior of t(x; µ) for different µ. Here, for µ > τ, there exists a single solution to t(x; µ) = 0, which implies there is one stationary point in Equation (7). When µ = τ, two solutions are found for t(x; µ) = 0, suggesting that there are two stationary points in Equation (7). Conversely, when µ < τ, we observe three solutions for t(x; µ) = 0, indicating that there are three stationary points in Equation (7)- a local optimum, a saddle point, and a global optimum.

Lemma 1. Let µ < τ, then gµ(W) has two local minima at W µ, W µ , and a saddle point at W µ .

With the above results, it has been established that W µ converges to the global minimum WG as µ 0. In the following section for the proof of Theorem 1, we perform a thorough analysis on how to track W µ and avoid the local minimum at W µ by carefully designing the scheduling for µk.

5 Convergence Analysis: From continuous to discrete

We now discuss the iteration complexity of our method when gradient descent is used in place of gradient ﬂow. We begin with some preliminaries regarding the continuous-time analysis.

5.1 Continuous case: Gradient ﬂow

The key to ensuring the convergence of gradient ﬂow to W µ is to accurately identify the basin of attraction of W µ. The following lemma provides a region that lies within such basin of attraction.

Lemma 2. Deﬁne Bµ = {(x, y) | x µ < x a, 0 y < y µ }. Run Algorithm 1 with input f = gµ(x, y), z0 = (x(0), y(0)) where (x(0), y(0)) Bµ, then t 0, we have that (x(t), y(t)) Bµ and limt (x(t), y(t)) = (x µ, y µ).

In Figure 1 (b), the lower-right rectangle corresponds to Bµ. Lemma 2 implies that the gradient ﬂow with any initialization inside Bµk+1 will converge to W µk+1 at last. Then, by utilizing the previous solution W µk as the initial point, as long as it lies within region Bµk+1, the gradient ﬂow can converge to W µk+1, thereby achieving the goal of tracking W µk+1. Following the scheduling for µk prescribed in Algorithm 2 provides a sufﬁcient condition to ensure that will happen.

The following lemma, with proof in the appendix, is used for the Proof of Theorem 1. It provides a lower bound for y µ and upper bound for y µ.

Lemma 3. If µ < τ, then y µ > µ, and (4µ)1/3

a2/3 (4µ)1/3 > y µ.

Proof of Theorem 1. Consider that we are at iteration k + 1 of Algorithm 2, then µk+1 < µk. If µk > τ and µk+1 > τ, then there is only one stationary point for gµk(x, y) and gµk+1(x, y), thus, 1 will converge to such stationary point. Hence, let us assume µk+1 τ. From Theorem 6 in the appendix, we know that x µk+1 < x µk. Then, the following relations hold:

(1) > µk+1 2 µ2 k 4a

1/3 (2) (4µk)1/3

a2/3 (4µk)1/3 (3) > y µk

Here (1) and (3) are due to Lemma 3, and (2) follows from 1 x 1 x for 0 x 1. Then it implies that (x µk, y µk) is in the region {(x, y) | x µk+1 < x a, 0 y < y µk+1}. By Lemma 2, the

1 procedure will converge to (x µk+1, y µk+1). Finally, from Theorems 5 and 6, if limk µk = 0, then limk x µk = a, limk y µk = 0, thus, converging to the global optimum, i.e., lim k Wµk = WG.

5.2 Discrete case: Gradient Descent

In Algorithms 2 and 4, gradient ﬂow is employed to locate the next stationary points, which is not practically feasible. A viable alternative is to execute Algorithm 2, replacing the gradient ﬂow with gradient descent. Now, at every iteration k, Algorithm 6 uses gradient descent to output Wµk,ϵk, a ϵk stationary point of gµk, initialized at Wµk 1,ϵk 1, and a step size of ηk = 1/(µk(a2 + 1) + 3a2). The tolerance parameter ϵk can signiﬁcantly inﬂuence the behavior of the algorithm and must be controlled for different iterations. A convergence guarantee is established via a simpliﬁed theorem presented here. A more formal version of the theorem and a comprehensive description of the algorithm (i.e., Algorithm 6) can be found in Appendix C. Theorem 2 (Informal). For any εdist > 0, set µ0 satisfy a mild condition, and use updating rule ϵk = min{βaµk, µ3/2 k }, µk+1 = (2µ2 k)2/3 (a+ϵk/µk)2/3

(a ϵk/µk)4/3 , and let K K(µ0, a, εdist) O ln µ0 aεdist

. Then, for any initialization W0, following the updated procedure above for k = 0, . . . , K, we have: Wµk,ϵk WG 2 εdist that is, Wµk,ϵk is εdist-close in Frobenius norm to global optimum WG. Moreover, the total number

of gradient descent steps is upper bounded by O µ0a2 + a2 + µ0 1 a6 + 1 εdist6 .

6 Experiments

We conducted experiments to verify that Algorithms 2 and 4 both converge to the global minimum of (7). Our purpose is to illustrate two main points: First, we compare our updating scheme as given in Line 3 of Algorithm 2 against a faster-decreasing updating scheme for µk. In Figure 4 we illustrate how a naive faster decrease of µ can lead to spurious a local minimum. Second, in Figure 5, we show that regardless of the initialization, Algorithms 2 and 4 always return the global minimum. In the supplementary material, we provide additional experiments where the gradient ﬂow is replaced with gradient descent. For more details, please refer to Appendix F.

Figure 4: Trajectory of the gradient ﬂow path for two different update rules for µk with same initialization and µ0. Here, good scheduling uses Line 3 of Algorithm 2, while bad scheduling uses a faster decreasing scheme for µk which leads the path to a spurious local minimum.

7 Acknowledgments and Disclosure of Funding

K. B. was supported by NSF under Grant # 2127309 to the Computing Research Association for the CIFellows 2021 Project. B.A. was supported by NSF IIS-1956330, NIH R01GM140467, and the Robert H. Topel Faculty Research Fund at the University of Chicago Booth School of Business. This work was done in part while B.A. was visiting the Simons Institute for the Theory of Computing. P.R. was supported by ONR via N000141812861, and NSF via IIS-1909816, IIS-1955532, IIS-2211907. We are also grateful for the support of the University of Chicago Research Computing Center for assistance with the calculations carried out in this work.

(a) Random Initialization 1

(b) Random Initialization 2

0 2 4 6 8 10 12 -2.5

(c) Random Initialization 3

Figure 5: Trajectory of the gradient ﬂow path with the different initializations. We observe that under a proper scheduling for µk, they all converge to the global minimum.

[1] Bello, K., Aragam, B. and Ravikumar, P. [2022], DAGMA: Learning dags via M-matrices and a log-determinant acyclicity characterization, in Advances in Neural Information Processing Systems .

[2] Bertsekas, D. P. [1997], Nonlinear programming , Journal of the Operational Research Society 48(3), 334 334.

[3] Bhojanapalli, S., Neyshabur, B. and Srebro, N. [2016], Global optimality of local search for low rank matrix recovery , Advances in Neural Information Processing Systems 29.

[4] Boumal, N., Voroninski, V. and Bandeira, A. [2016], The non-convex burer-monteiro approach works on smooth semideﬁnite programs , Advances in Neural Information Processing Systems 29.

[5] Boyd, S., Boyd, S. P. and Vandenberghe, L. [2004], Convex optimization, Cambridge university press.

[6] Brutzkus, A. and Globerson, A. [2017], Globally optimal gradient descent for a convnet with gaussian inputs, in International conference on machine learning , PMLR, pp. 605 614.

[7] Chen, W., Drton, M. and Wang, Y. S. [2019], On causal discovery with an equal-variance assumption , Biometrika 106(4), 973 980.

[8] Chickering, D. M. [1996], Learning Bayesian networks is NP-complete, in Learning from data , Springer, pp. 121 130.

[9] Chickering, D. M. [2003], Optimal structure identiﬁcation with greedy search , JMLR 3, 507554.

[10] Chickering, D. M., Heckerman, D. and Meek, C. [2004], Large-sample learning of Bayesian networks is NP-hard , Journal of Machine Learning Research 5, 1287 1330.

[11] Choromanska, A., Henaff, M., Mathieu, M., Arous, G. B. and Le Cun, Y. [2015], The loss surfaces of multilayer networks, in Artiﬁcial intelligence and statistics , PMLR, pp. 192 204.

[12] Cussens, J. [2012], Bayesian network learning with cutting planes , ar Xiv preprint ar Xiv:1202.3713 .

[13] De Sa, C., Re, C. and Olukotun, K. [2015], Global convergence of stochastic gradient descent for some non-convex matrix problems, in International conference on machine learning , PMLR, pp. 2332 2341.

[14] Dontchev, A. L., Rockafellar, R. T. and Rockafellar, R. T. [2009], Implicit functions and solution mappings: A view from variational analysis, Vol. 11, Springer.

[15] Dunlavy, D. M. and O Leary, D. P. [2005], Homotopy optimization methods for global optimization, Technical report, Sandia National Laboratories (SNL), Albuquerque, NM, and Livermore, CA .

[16] Duong, B. and Nguyen, T. [2022], Bivariate causal discovery via conditional divergence, in Conference on Causal Learning and Reasoning .

[17] Gargiani, M., Zanelli, A., Tran-Dinh, Q., Diehl, M. and Hutter, F. [2020], Convergence analysis of homotopy-sgd for non-convex optimization , ar Xiv preprint ar Xiv:2011.10298 .

[18] Garrigues, P. and Ghaoui, L. [2008], An homotopy algorithm for the lasso with online observations , Advances in neural information processing systems 21.

[19] Ge, R., Jin, C. and Zheng, Y. [2017], No spurious local minima in nonconvex low rank problems: A uniﬁed geometric analysis, in International Conference on Machine Learning , PMLR, pp. 1233 1242.

[20] Ge, R., Lee, J. D. and Ma, T. [2016], Matrix completion has no spurious local minimum , Advances in neural information processing systems 29.

[21] Hazan, E., Levy, K. Y. and Shalev-Shwartz, S. [2016], On graduated optimization for stochastic non-convex problems, in International conference on machine learning , PMLR, pp. 1833 1841.

[22] Heckerman, D., Geiger, D. and Chickering, D. M. [1995], Learning bayesian networks: The combination of knowledge and statistical data , Machine learning 20(3), 197 243.

[23] Iwakiri, H., Wang, Y., Ito, S. and Takeda, A. [2022], Single loop gaussian homotopy method for non-convex optimization , ar Xiv preprint ar Xiv:2203.05717 .

[24] Jaakkola, T., Sontag, D., Globerson, A. and Meila, M. [2010], Learning bayesian network structure using lp relaxations, in Proceedings of the Thirteenth International Conference on Artiﬁcial Intelligence and Statistics , Vol. 9 of Proceedings of Machine Learning Research, pp. 358 365.

[25] Jiao, R., Lin, N., Hu, Z., Bennett, D. A., Jin, L. and Xiong, M. [2018], Bivariate causal discovery and its applications to gene expression and imaging data analysis , Frontiers Genetics 9, 347.

[26] Kawaguchi, K. [2016], Deep learning without poor local minima , Advances in neural information processing systems 29.

[27] Kazemipour, A., Larsen, B. W. and Druckmann, S. [2019], Avoiding spurious local minima in deep quadratic networks , ar Xiv preprint ar Xiv:2001.00098 .

[28] Khalil, H. K. [2002], Nonlinear systems third edition , Patience Hall 115.

[29] Lachapelle, S., Brouillard, P., Deleu, T. and Lacoste-Julien, S. [2020], Gradient-based neural dag learning, in International Conference on Learning Representations .

[30] Liang, S., Sun, R., Lee, J. D. and Srikant, R. [2018], Adding one neuron can eliminate all bad local minima , Advances in Neural Information Processing Systems 31.

[31] Loh, P.-L. and Bühlmann, P. [2014], High-dimensional learning of linear causal networks via inverse covariance estimation , The Journal of Machine Learning Research 15(1), 3065 3105.

[32] Mooij, J. M., Peters, J., Janzing, D., Zscheischler, J. and Schölkopf, B. [2014], Distinguishing cause from effect using observational data: methods and benchmarks , Arxiv .

[33] Nesterov, Y. et al. [2018], Lectures on convex optimization, Vol. 137, Springer.

[34] Ng, I., Ghassami, A. and Zhang, K. [2020], On the role of sparsity and dag constraints for learning linear DAGs, in Advances in Neural Information Processing Systems .

[35] Ng, I., Lachapelle, S., Ke, N. R., Lacoste-Julien, S. and Zhang, K. [2022], On the convergence of continuous constrained optimization for structure learning, in International Conference on Artiﬁcial Intelligence and Statistics , PMLR, pp. 8176 8198.

[36] Ni, Y. [2022], Bivariate causal discovery for categorical data via classiﬁcation with optimal label permutation , Arxiv .

[37] Nocedal, J. and Wright, S. J. [1999], Numerical optimization, Springer.

[38] Park, Y. W. and Klabjan, D. [2017], Bayesian network learning via topological order , The Journal of Machine Learning Research 18(1), 3451 3482.

[39] Pearl, J. [2009], Causality: Models, Reasoning, and Inference, 2nd edn, Cambridge University Press.

[40] Sachs, K., Perez, O., Pe er, D., Lauffenburger, D. A. and Nolan, G. P. [2005], Causal proteinsignaling networks derived from multiparameter single-cell data , Science 308(5721), 523 529.

[41] Scanagatta, M., de Campos, C. P., Corani, G. and Zaffalon, M. [2015], Learning bayesian networks with thousands of variables., in NIPS , pp. 1864 1872.

[42] Silander, T. and Myllymaki, P. [2006], A simple approach for ﬁnding the globally optimal bayesian network structure, in Proceedings of the 22nd Conference on Uncertainty in Artiﬁcial Intelligence .

[43] Soltanolkotabi, M., Javanmard, A. and Lee, J. D. [2018], Theoretical insights into the optimization landscape of over-parameterized shallow neural networks , IEEE Transactions on Information Theory 65(2), 742 769.

[44] Spirtes, P., Glymour, C. N., Scheines, R. and Heckerman, D. [2000], Causation, prediction, and search, MIT press.

[45] Teyssier, M. and Koller, D. [2005], Ordering-based search: A simple and effective algorithm for learning bayesian networks, in Proceedings of the Twenty-First Conference on Uncertainty in Artiﬁcial Intelligence .

[46] Wauthier, F. and Donnelly, P. [2015], A greedy homotopy method for regression with nonconvex constraints, in Artiﬁcial Intelligence and Statistics , PMLR, pp. 1051 1060.

[47] Wei, D., Gao, T. and Yu, Y. [2020], DAGs with no fears: A closer look at continuous optimization for learning bayesian networks, in Advances in Neural Information Processing Systems .

[48] Wu, C., Luo, J. and Lee, J. D. [2018], No spurious local minima in a two hidden unit relu network, in International Conference on Learning Representations .

[49] Yu, Y., Chen, J., Gao, T. and Yu, M. [2019], Dag-gnn: Dag structure learning with graph neural networks, in International Conference on Machine Learning , PMLR, pp. 7154 7163.

[50] Zhang, B., Gaiteri, C., Bodea, L.-G., Wang, Z., Mc Elwee, J., Podtelezhnikov, A. A., Zhang, C., Xie, T., Tran, L., Dobrin, R. et al. [2013], Integrated systems approach identiﬁes genetic nodes and networks in late-onset alzheimers disease , Cell 153(3), 707 720.

[51] Zheng, X., Aragam, B., Ravikumar, P. K. and Xing, E. P. [2018], DAGs with NO TEARS: Continuous optimization for structure learning, in Advances in Neural Information Processing Systems .

[52] Zheng, X., Dan, C., Aragam, B., Ravikumar, P. and Xing, E. [2020], Learning sparse nonparametric DAGs, in International Conference on Artiﬁcial Intelligence and Statistics , PMLR, pp. 3414 3425.

[53] Zhu, S., Ng, I. and Chen, Z. [2020], Causal discovery with reinforcement learning, in International Conference on Learning Representations .

Algorithm 4: Find a path {Wµk} via a particular scheduling for µk when a is unknown.

Input: µ0 h a2 4(a2+1)3 , a2

Output: {Wµk} k=0 1 ba p

4(µ0 + ε) // ε 0 s.t. ba < a

2 Wµ0 Gradient Flow(gµ0, 0)

3 for k = 1, 2, . . . do

4 Let µk+1 h (2/ba)2/3 µ4/3 k , µk

5 Wµk+1 Gradient Flow(gµk+1, Wµk)

7 return {Wµk} k=0

A Practical Implementation of Algorithm 2

We present a practical implementation of our homotopy algorithm in Algorithm 4. The updating scheme for µk is now independent of the parameter a, but as presented, the initialization for µ0 still depends on a. This is for the following reason: It is possible to make the updating scheme independent of a without imposing any additional assumptions on a, as evidenced by Lemma 4 below. The initialization for µ0, however, is trickier, and we must consider two separate cases:

1. No assumptions on a. In this case, if a is too small, then the problem becomes harder and the initial choice of µ0 matters. 2. Lower bound on a. If we are willing to accept a lower bound on a, then there is an initialization for µ0 that does not depend on a.

In Corollary 1, we illustrate this last point with the additional condition that a > p

5/27. This essentially amounts to an assumption on the minimum signal, and is quite standard in the literature on learning SEM.

Lemma 4. Under the assumption a2 4(a2+1)3 µ0 < a2

4 , the Algorithm 4 outputs the global optimal solution to (6), i.e. lim k Wµk = WG.

It turns out that the assumption in Lemma 4 is not overly restrictive, as there exist pre-determined sequences of {µk} k=0 that can ensure the effectiveness of Algorithm 4 for any values of a greater than a certain threshold.

B From Population Loss to Empirical Loss

The transformation from population loss to empirical can be thought from two components. First, with a given empirical loss, Algorithms 2 and 3 still achieve the global minimum, WG, of problem 6, but now the output from the Algorithm is an empirical estimator ˆa, rather than ground truth a, Theorem 1 and Corollary 1 would continue to be valid. Second, the global optimum, WG, of the empirical loss possess the same DAG structure as the underlying W . The ﬁnite-sample ﬁndings in Section 5 (speciﬁcally, Lemmas 18 and 19) of Loh and Bühlmann [31], which offer sufﬁcient conditions on the sample size to ensure that the DAG structures of WG and W are identical.

C From Continuous to Discrete: Gradient Descent

Previously, gradient ﬂow was employed to address the intermediate problem (7), a method that poses implementation challenges in a computational setting. In this section, we introduce Algorithm 6 that leverages gradient descent to solve (7) in each iteration. This adjustment serves practical considerations. We start with the convergence results of Gradient Descent. Deﬁnition 1. f is L-smooth, if f is differentiable and x, y dom(f) such that f(x) f(y) 2 L x y 2.

Algorithm 5: Gradient Descent(f, η, W0, ϵ) Input: function f, step size η, initial point W0, tolerance ϵ Output: Wt 1 t 0

2 while f(Wt) 2 > ϵ do

3 Wt+1 Wt η f(Wt)

Algorithm 6: Homotopy algorithm using gradient descent for solving (1).

Input: Initial W 1 = W(x 1, y 1), µ0 h a2 4(a2+1)3 (1+β)4

(1 β)2 , a2

4 (1 δ)3(1 β)4

η0 = 1 µ0(a2+1)+3a2 , ϵ0 = min{βaµ0, µ3/2 0 } Output: {Wµk} k=0 1 Wµ0,ϵ0 Gradient Descent(gµ0, η0, W 1, ϵ0)

2 for k = 1, 2, . . . do

3 Let µk = (2µ2 k 1)2/3 (a+ϵk 1/µk 1)2/3

(a ϵk 1/µk 1)4/3

4 Let ηk = 1 µk(a2+1)+3a2

5 Let ϵk = min{βaµk, µ3/2 k }

6 Wµk,ϵk Gradient Descent(gµk, ηk, Wµk 1, ϵk)

Theorem 3 (Nesterov et al. 33). If function f is L-smooth, then Gradient Descent (Algorithm 5) with step size η = 1/L, ﬁnds an ϵ-ﬁrst-order stationary point (i.e. f(x) 2 ϵ) in 2L(f(x0) f )/ϵ2 iterations.

One of the pivotal factors inﬂuencing the convergence of gradient descent is the selection of the step size. Theorem 3 select a step size η = 1

L. Therefore, our initial step is to determine the smoothness of gµ(W) within our region of interest, A = {0 x a, 0 y a a2+1}.

Lemma 5. Consider the function gµ(W) as deﬁned in Equation 7 within the region A = {0 x a, 0 y a a2+1}. It follows that for all µ 0, the function gµ(W) is µ(a2 + 1) + 3a2-smooth.

Since gradient descent is limited to identifying the ϵ stationary point of the function. Thus, we study the gradient of gµ(W) = µf(W) + h(W), i.e. gµ(W) has the following form

gµ(W) = µ(x a) + y2x µ(a2 + 1)y aµ + yx2

As gradient descent is limited to identifying the ϵ stationary point of the function, we, therefore, focus on gµ(W) 2 ϵ. This can be expressed in the subsequent manner:

gµ(W) 2 ϵ ϵ µ(x a) + y2x < ϵ and ϵ µ(a2 + 1)y aµ + yx2 ϵ

As a result,

{(x, y) | gµ(W) 2 ϵ} {(x, y) | µa ϵ

µ + y2 x µa + ϵ

µ + y2 , µa ϵ x2 + µ(a2 + 1) y µa + ϵ x2 + µ(a2 + 1)}

Here we denote such region as Aµ,ϵ

Aµ,ϵ = {(x, y) | µa ϵ

µ + y2 x µa + ϵ

µ + y2 , µa ϵ x2 + µ(a2 + 1) y µa + ϵ x2 + µ(a2 + 1)} (10)

Figure 6 and 7 illustrate the region Aµ,ϵ.

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 x

Plot a =0.6, = 0.009, =0.00055

(x * , y * )

(x* * , y* * )

(x* * * , y* * * )

x2 + (a2 + 1)

x2 + (a2 + 1)

Figure 6: An example of Aµ,ϵ is depicted for a = 0.6, µ = 0.009, and ϵ = 0.00055. The yellow region signiﬁes ϵ stationary points, denoted as Aµ,ϵ and deﬁned by Equation (10). Aµ,ϵ is the disjoint union of A1 µ,ϵ and A2 µ,ϵ, which are deﬁned by Equations (21) and (22), respectively.

0.45 0.50 0.55 0.60 0.65 0.70 x

Plot a =0.6, = 0.009, =0.00055

(x * , y * )

x2 + (a2 + 1)

x2 + (a2 + 1)

x2 + (a2 + 1)

Figure 7: Here is a localized illustration of Aµ,ϵ that includes the point (x µ, y µ). This region, referred to as A1 µ,ϵ, is deﬁned in Equation (21).

Given that the gradient descent can only locate ϵ stationary points within the region Aµ,ϵ during each iteration, the boundary of Aµ,ϵ becomes a critical component of our analysis. To facilitate clear presentation, it is essential to establish some pertinent notations.

x = µa µ + y2 (11a)

y = µa µ(a2 + 1) + x2 (11b)

If the system of equations yields only a single solution, we denote this solution as (x µ, y µ). If it yields two solutions, these solutions are denoted as (x µ, y µ), (x µ , y µ ), with x µ < x µ. In the event that there are three distinct solutions to the system of equations, these solutions are denoted as (x µ, y µ), (x µ , y µ ), (x µ , y µ ), where x µ < x µ < x µ.

µ + y2 (12a)

y = µa + ϵ µ(a2 + 1) + x2 (12b)

If the system of equations yields only a single solution, we denote this solution as (x µ,ϵ, y µ,ϵ). If it yields two solutions, these solutions are denoted as (x µ,ϵ, y µ,ϵ), (x µ,ϵ, y µ,ϵ), with x µ,ϵ < x µ,ϵ. In the event that there are three distinct solutions to the system of equations, these solutions are denoted as (x µ,ϵ, y µ,ϵ), (x µ,ϵ, y µ,ϵ), (x µ,ϵ , y µ,ϵ ), where x µ,ϵ < x µ,ϵ < x µ,ϵ.

µ + y2 (13a)

y = µa ϵ µ(a2 + 1) + x2 (13b)

If the system of equations yields only a single solution, we denote this solution as (x µ,ϵ_, y µ,ϵ_). If it yields two solutions, these solutions are denoted as (x µ,ϵ_, y µ,ϵ_), (x µ,ϵ_, y µ,ϵ_), with x µ,ϵ_ < x µ,ϵ_. In the event that there are three distinct solutions to the system of equations, these solutions are denoted as (x µ,ϵ_, y µ,ϵ_), (x µ,ϵ_, y µ,ϵ_), (x µ,ϵ_, y µ,ϵ_), where x µ,ϵ_ < x µ,ϵ_ < x µ,ϵ_. Remark 4. There always exists at least one solution to the above system of equations. When µ is sufﬁciently small, the above system of equations always yields three solutions, as demonstrated in Theorem 5, and Theorem 9.

The parameter ϵ can substantially inﬂuence the behavior of the systems of equations (12a),(12b) and (13a),(13b). A crucial consideration is to ensure that ϵ remains adequately small. To facilitate this, we introduce a new parameter, β, whose speciﬁc value will be determined later. At this stage, we merely require that β should lie within the interval (0, 1). We further impose a constraint on ϵ to satisfy the following inequality:

Following the same procedure when we deal with ϵ = 0. Let us substitute (12a) into (12b), then we obtain an equation that only involves the variable y

rϵ(y; µ) =a + ϵ/µ

y (a2 + 1) (µa ϵ)2/µ

(y2 + µ)2 (15)

Let us substitute (12b) into (12a), then we obtain an equation that only involves the variable x

tϵ(x; µ) =a ϵ/µ

x 1 (µa + ϵ)2/µ (µ(a2 + 1) + x2)2 (16)

Proceed similarly for equations (13a) and (13b).

rϵ_(y; µ) =a ϵ/µ

y (a2 + 1) (µa + ϵ)2/µ

(y2 + µ)2 (17)

tϵ_(x; µ) =a + ϵ/µ

x 1 (µa ϵ)2/µ (µ(a2 + 1) + x2)2 (18)

Given the substantial role that the system of equations 12a and 12b play in our analysis, the existence of ϵ in these equations complicates the analysis, this can be avoided by considering the worst-case scenario, i.e., when ϵ = βaµ. With this particular choice of ϵ, we can reformulate (15) and (16) as follows, denoting them as rβ(y; ϵ) and rβ(x; ϵ) respectively.

rβ(y; µ) =a(1 + β)

y (a2 + 1) µa2(1 β)2

(y2 + µ)2 (19)

tβ(x; µ) =a(1 β)

x 1 µa2(1 + β)2

(µ(a2 + 1) + x2)2 (20)

The functions rϵ(y; µ), rϵ_(y; µ), and rβ(y; µ) possess similar properties to r(y; µ) as deﬁned in Equation (8), with more details available in Theorem 7 and 8. Additionally, the functions tϵ(x; µ), tϵ_(x; µ), and tβ(x; µ) share similar characteristics with t(x; µ) as deﬁned in Equation (9), with more details provided in Theorem 9.

As illustrated in Figure 6, the ϵ-stationary point region Aµ,ϵ can be partitioned into two distinct areas, of which only the lower-right one contains (x µ, y µ) and it is of interest to our analysis. Moreover, (x µ,ϵ, y µ,ϵ) and (x µ,ϵ, y µ,ϵ) are extremal point of two distinct regions. The upcoming corollary substantiates this intuition.

Corollary 3. If µ < τ (τ is deﬁned in Theorem 5(v)), assume ϵ satisﬁes (14), β satisﬁes 1+β 1 β 2

a2 + 1, systems of equations (12a),(12b) at least have two solutions. Moreover, Aµ,ϵ = A1 µ,ϵ A2 µ,ϵ A1 µ,ϵ = Aµ,ϵ {(x, y) | x x µ,ϵ, y y µ,ϵ} (21)

A2 µ,ϵ = Aµ,ϵ {(x, y) | x x µ,ϵ, y y µ,ϵ} (22)

Corollary 3 suggests that Aµ,ϵ can be partitioned into two distinct regions, namely A1 µ,ϵ and A2 µ,ϵ. Furthermore, for every (x, y) belonging to A1 µ,ϵ, it follows that x x µ,ϵ and y y µ,ϵ. Similarly, for every (x, y) that lies within A2 µ,ϵ, the condition x x µ,ϵ and y y µ,ϵ holds. The region A1 µ,ϵ represents the correct" region that gradient descent should identify. In this context, identifying the region equates to pinpointing the extremal points of the region. As a result, our focus should be on the extremal points of A1 µ,ϵ and A2 µ,ϵ, speciﬁcally at (x µ,ϵ, y µ,ϵ) and (x µ,ϵ, y µ,ϵ). Furthermore, the key to ensuring the convergence of the gradient descent to the A1 µ,ϵ is to accurately identify the basin of attraction of the region A1 µ,ϵ. The following lemma provides a region within which, regardless of the initialization point of the gradient descent, it converges inside A1 µ,ϵ.

Lemma 6. Assume µ < τ (τ is deﬁned in Theorem 5(v)), 1+β 1 β 2 a2 +1. Deﬁne Bµ,ϵ = {(x, y) |

x µ,ϵ < x a, 0 y < y µ,ϵ}. Run Algorithm 5 with input f = gµ(x, y), η = 1 µ(a2+1)+3a2 , W0 =

(x(0), y(0)), where (x(0), y(0)) Bµ,ϵ, then after at most 2(µ(a2+1)+3a2)(gµ(x(0),y(0)) gµ(x µ,y µ)) ϵ2 iterations, (xt, yt) A1 µ,ϵ.

Lemma 6 can be considered the gradient descent analogue of Lemma 2. It plays a pivotal role in the proof of Theorem 4. In Figure 6, the lower-right rectangle corresponds to Bµ,ϵ. Lemma 6 implies that the gradient descent with any initialization inside Bµk+1,ϵk+1 will converge to A1 µk+1,ϵk+1 at last. Then, by utilizing the previous solution Wµk,ϵk as the initial point, as long as it lies within region Bµk+1,ϵk+1, the gradient descent can converge to A1 µk+1,ϵk+1 which is ϵ stationary points region that contains W µk+1, thereby achieving the goal of tracking W µk+1. Following the scheduling for µk prescribed in Algorithm 6 provides a sufﬁcient condition to ensure that will happen.

We now proceed to present the theorem which guarantees the global convergence of Algorithm 6.

Theorem 4. If δ (0, 1), β (0, 1), 1+β 1 β 2 (1 δ)(a2 + 1), and µ0 satisﬁes

4(a2 + 1)3 a2

4(a2 + 1)3 (1 + β)4

(1 β)2 µ0 a2

4 (1 δ)3(1 β)4

(1 + β)2 a2

Set the updating rule

ϵk = min{βaµk, µ3/2 k }

µk+1 =(2µ2 k)2/3 (a + ϵk/µk)2/3

(a ϵk/µk)4/3

Then µk+1 (1 δ)µk. Moreover, for any εdist > 0, running Algorithm 6 after K(µ0, a, δ, εdist) outer iteration

Wµk,ϵk WG 2 εdist (23)

K(µ0, a, δ, εdist) 1 ln(1/(1 δ)) max ln µ0 β2a2 , ln 72µ0 a2(1 (1/2)1/4), ln(3(4 δ)µ0

εdist2 ), 1

2 ln(46656µ2 0 a2εdist2 ), 1

3 ln(46656µ3 0 a4εdist2 )

The total gradient descent steps are

K(µ0,a,δ,εdist) X

2(µk(a2 + 1) + 3a2)(gµk+1(Wµk,ϵk) gµk+1(Wµk+1,ϵk+1))

2(µ0(a2 + 1) + 3a2)

εdist2 , 216

aεdist , 216

2/3 , 1 β2a2 , 72 (1 (1/2)1/4)a2 }

gµ0(W ϵ0 µ0)

O µ0a2 + a2 + µ0 1 β6a6 + 1 εdist6 + 1 a3εdist3 + 1 a2εdist2 + 1

Proof. Upon substituting gradient ﬂow with gradient descent, it becomes possible to only identify an ϵ-stationary point for gµ(W). This modiﬁcation necessitates specifying the stepsize η for gradient descent, as well as an updating rule for µ. The adjustment procedure used can substantially inﬂuence the result of Algorithm 6. In this proof, we will impose limitations on the update scheme µk, the stepsize ηk, and the tolerance ϵk to ensure their effective operation within Algorithm 6. The approach employed for this proof closely mirrors that of the proof for Theorem 1 albeit with more careful scrutiny. In this proof, we will work out all the requirements for µ, ϵ, η. Subsequently, we will verify that our selection in Theorem 4 conforms to these requirements.

In the proof, we occasionally use µ, ϵ or µk, ϵk. When we employ µ, ϵ, it signiﬁes that the given inequality or equality holds for any µ, ϵ. Conversely, when we use µk, ϵk, it indicates we are examining how to set these parameters for distinct iterations.

Establish the Bound y µ,ϵ µ First, let us consider rϵ( µ; µ) 0, i.e.

rϵ( µ; µ) = a + ϵ/µ µ (a2 + 1) µ(a ϵ/µ)2

This is always true when µ > 4/a2, and we require

ϵ 2µ3/2 + aµ 2 p

2aµ5/2 µ3a2 when µ 4

Now we name it condition 1.

Condition 1.

ϵ 2µ3/2 + aµ 2 p

2aµ5/2 µ3a2 when µ 4

Under the assumption that Condition 1 is satisﬁed. Since rϵ(y; µ) is increasing function with interval y [ylb,ϵ, yub,ϵ], and we know ylb,ϵ µ yub,ϵ and based on Theorem 7(ii), we have ylb,ϵ y µ,ϵ yub,ϵ, rϵ( µ; µ) rϵ(y µ,ϵ; µ) = 0. Therefore, y µ,ϵ µ.

Ensuring the Correct Solution Path via Gradient Descent Following the argument when we prove Theorem 1, we strive to ensure that the gradient descent, when initiated at (xµk,ϵk, yµk,ϵk), will converge within the "correct" ϵk+1-stationary point region (namely, gµk+1(W) 2 < ϵk+1) which includes (x µk+1, y µk+1). For this to occur, we necessitate that:

(1) > yµk+1,ϵk+1 (2) > µk+1

(3) (2µ2 k)1/3 (a + ϵk/µk)1/3

(a ϵk/µk)2/3

(4) > yµk,ϵk (5) > yµk,ϵk (24)

Here (1), (5) are due to Corollary 3; (2) comes from the boundary we established earlier; (3) is based on the constraints we have placed on µk and µk+1, which we will present as Condition 2 subsequently; (4) is from the Theorem 7(ii) and relationship y µk,ϵk < ylb,µk,ϵk. Also, from the Lemma 9, maxµ τ x µ,ϵ minµ>0 x µ,ϵ. Hence, by invoking Lemma 6, we can afﬁrm that our gradient descent consistently traces the correct stationary point. Now we state condition to make it happen,

Condition 2.

(1 δ)µk µk+1 (2µ2 k)2/3 (a + ϵk/µk)2/3

(a ϵk/µk)4/3

In this context, our requirement extends beyond merely ensuring that µk decreases. We further stipulate that it should decrease by a factor of 1 δ. Next, we impose another important constraint

Condition 3.

Updating Rules Now we are ready to check our updating rules satisfy the conditions above

ϵk = min{βaµk, µ3/2 k }

µk+1 =(2µ2 k)2/3 (a + ϵk/µk)2/3

(a ϵk/µk)4/3

Check for Conditions First, we check the condition 2. condition 2 requires

(1 δ)µk (2µ2 k)2/3 (a + ϵk/µk)2/3

(a ϵk/µk)4/3 µk (a + ϵk/µk)2

(a ϵk/µk)4 (1 δ)3

4 Note that ϵk βaµk < aµk

µk (a + ϵk/µk)2

(a ϵk/µk)4 µk (1 + β)2

(1 β)4 1 a2

Therefore, once the following inequality is true, Condition 2 is satisﬁed.

µk (1 + β)2

(1 β)4 1 a2 (1 δ)3

4 (1 δ)3(1 β)4

Because µk µ0 a2

4 (1 δ)3(1 β)4

(1+β)2 from the condition we impose for µ0. Consequently, Condition 2 is satisﬁed under our choice of ϵk.

Now we focus on the Condition 1. Because ϵk aβµk, if we can ensure aβµk 2µ3/2 k + aµk

2aµ5/2 k µ3 ka2 holds, then we can show Condition 1 is always satisﬁed.

aβµk 2µ3/2 k + aµk 2 q

2aµ5/2 k µ3 ka2

2aµ5/2 k µ3 ka2 2µ3/2 k + (1 β)aµk

4(2aµ5/2 k µ3 ka2) 4µ3 k + (1 β)2a2µ2 k + 4(1 β)aµ5/2 k 0 4(a2 + 1)µ3 k + (1 β)2a2µ2 k 4(1 + β)aµ5/2 k 0 4(a2 + 1)µk 4(1 + β)aµ1/2 k + (1 β)2a2 when 0 µk 4/a2

0 µk (1 + β)a

(a2 + 1) µ1/2 k + (1 β)2a2

We also notice that

(a2 + 1)2 4(1 β)2a2

4(a2 + 1) 0 1 + β

Because 1+β 1 β 2 (1 δ)(a2 + 1), the inequality above always holds and this inequality implies that for any µk 0

0 µk (1 + β)a

(a2 + 1) µ1/2 k + (1 β)2a2

4(a2 + 1) Therefore, Condition 2 holds. Condition 3 also holds because of the choice of ϵk.

Bound the Distance Let c = 72/a2, and assume that µ satisﬁes the following

1 (1/2)1/4 , β2a2} (25)

Note that when µ satisﬁes (25), then µ3/2 βaµ, so ϵ = µ3/2.

1 (1/2)1/4 = a2

1 (1/2)1/4 a2

tϵ((a ϵ/µ)(1 cµ); µ) = 1 1 cµ 1 µ(a + ϵ/µ)2

(µ(a2 + 1) + (a ϵ/µ)2(1 cµ)2)2

= cµ 1 cµ µ(a + ϵ/µ)2

(µ(a2 + 1) + (a ϵ/µ)2(1 cµ)2)2

cµ µ (a + ϵ/µ)2

(a ϵ/µ)4(1 cµ)4

cµ µ (a + a/2)2

(a a/2)4(1 cµ)4

=µ c 36 a2(1 cµ)4

a2 36 a2(1 cµ)4

Then we know (a ϵ/µ)(1 cµ) < x µ,ϵ. Now we can bound the distance Wµk,ϵk WG , it is important to note that

Wµk,ϵk WG = q

(xµk,ϵk a)2 + (yµk,ϵk)2

(x µk,ϵk a)2 + (y µk,ϵk)2, q

(x µk,ϵk_ a)2 + (y µk,ϵk)2 o

We use the fact that x µk,ϵk < xµk,ϵk < a, xµk,ϵk < x µk,ϵk_ and yµk,ϵk < y µk,ϵk. Next, we can

separately establish bounds for these two terms. Due to (24), y µk,ϵk < (2µ2 k)1/3 (a+ϵk/µk)1/3

(a ϵk/µk)2/3 = µk+1 and (a ϵk/µk)(1 cµk) < x µk,ϵk

(x µk,ϵk a)2 + (y µk,ϵk)2 p

µk+1 + (a (a ϵk/µk)(1 cµk))2

Given that if x µk,ϵk_ a, then q

(x µk,ϵk a)2 + (y µk,ϵk)2 q

(x µk,ϵk_ a)2 + (y µk,ϵk)2. Therefore, if x µk,ϵk_ a, we can use the fact that x µk,ϵk_ a + ϵk

µk . In this case, q

(x µk,ϵk_ a)2 + (y µk,ϵk)2 p

µk+1 + (ϵk/µk)2 = p

µk+1 + µk p

As a result, we have

Wµk,ϵk WG max{ p

µk+1 + (a (a ϵk/µk)(1 cµk))2, p

µk+1 + (a (a ϵk/µk)(1 cµk))2 (1 δ)µk + (acµk + µk cµ3/2 k )2

(1 δ)µk + 3(a2c2µ2 k + µk + c2µ3 k)

=(4 δ)µk + 3a2c2µ2 k + 3c2µ3 k

Wµk,ϵk WG max{ p

µk+1 + (a (a ϵk/µk)(1 cµk))2, p

(4 δ)µk + 3a2c2µ2 k + 3c2µ3 k, p

(4 δ)µk + 3a2c2µ2 k + 3c2µ3 k

(4 δ)µk (4 δ)(1 δ)kµ0 εdist2

3 k ln(3(4 δ)µ0/εdist2)

ln(1/(1 δ)) (27)

3a2c2µ2 k 3a2c2(1 δ)2kµ2 0 εdist2

3 k ln(46656µ2 0/(a2εdist2)) 2 ln(1/(1 δ)) (28)

3c2µ3 k 3c2(1 δ)3kµ3 0 εdist2

3 k ln(46656µ3 0/(a4εdist2)) 3 ln(1/(1 δ)) (29)

We use the fact that µk (1 δ)kµ0. In order to satisfy (25).

µk µ0(1 δ)k a2

72(1 (1/2)1/4) k ln 72µ0 a2(1 (1/2)1/4)

ln 1 1 δ (30)

µk µ0(1 δ)k β2a2 k ln (µ0/(β2a2))

ln 1 1 δ (31)

Consequently, running Algorithm 6 after K(µ0, a, δ, εdist) outer iteration

Wµk,ϵk WG 2 εdist

K(µ0, a, δ, εdist) 1 ln(1/(1 δ)) max ln µ0

β2a2 , ln 72µ0 a2(1 (1/2)1/4), ln(3(4 δ)µ0

2 ln(46656µ2 0 a2ε2 ), 1

3 ln(46656µ3 0 a4ε2 )

By Lemma 6, k iteration of Algorithm 6 need the following step of gradient descent

2(µk(a2 + 1) + 3a2)(gµk+1(Wµk,ϵk) gµk+1(Wµk+1,ϵk+1))

Let b K(µ0, a, δ, εdist) satisfy µ b K(µ0,a,δ,εdist) β2a2 < µ b K(µ0,a,δ,εdist) 1. Hence, the total number of gradient steps required by Algorithm 6 can be expressed as follows:

K(µ0,a,δ,εdist) X

2(µk(a2 + 1) + 3a2)(gµk+1 (Wµk,ϵk ) gµk+1 (Wµk+1,ϵk+1 ))

2(µ0(a2 + 1) + 3a2)

c K(µ0,a,δ,εdist) 1 X

(gµk+1 (Wµk,ϵk ) gµk+1 (Wµk+1,ϵk+1 ))

K(µ0,a,δ,εdist) X

k=c K(µ0,a,δ,εdist)

(gµk+1 (Wµk,ϵk ) gµk+1 (Wµk+1,ϵk+1 ))

=2(µ0(a2 + 1) + 3a2)

c K(µ0,a,δ,εdist) 1 X

(gµk+1 (Wµk,ϵk ) gµk+1 (Wµk+1,ϵk+1 ))

K(µ0,a,δ,εdist) X

k=c K(µ0,a,δ,εdist)

(gµk+1 (Wµk,ϵk ) gµk+1 (Wµk+1,ϵk+1 ))

2(µ0(a2 + 1) + 3a2)

c K(µ0,a,δ,εdist) 1 X

(gµk+1 (Wµk,ϵk ) gµk+1 (Wµk+1,ϵk+1 ))

K(µ0,a,δ,εdist) X

k=c K(µ0,a,δ,εdist)

(gµk+1 (Wµk,ϵk ) gµk+1 (Wµk+1,ϵk+1 ))

2(µ0(a2 + 1) + 3a2)

K(µ0,a,δ,εdist) X

(gµk+1 (Wµk,ϵk ) gµk+1 (Wµk+1,ϵk+1 ))

K(µ0,a,δ,εdist) X

(gµk+1 (Wµk,ϵk ) gµk+1 (Wµk+1,ϵk+1 ))

µ3 K(µ0,a,δ,εdist)

=2(µ0(a2 + 1) + 3a2)

µ3 K(µ0,a,δ,εdist)

K(µ0,a,δ,εdist) X

(gµk+1 (Wµk,ϵk ) gµk+1 (Wµk+1,ϵk+1 ))

2(µ0(a2 + 1) + 3a2)

µ3 K(µ0,a,δ,εdist)

K(µ0,a,δ,εdist) X

(gµk (W ϵk µk ) gµk+1 (W ϵk+1 µk+1 ))

=2(µ0(a2 + 1) + 3a2)

µ3 K(µ0,a,δ,εdist)

(gµ0 (Wµ0,ϵ0 ) gµK(µ0,a,δ,εdist)+1 (W ϵK(µ0,a,δ,εdist)+1 µK(µ0,a,δ,εdist)+1 )

2(µ0(a2 + 1) + 3a2)

µ3 K(µ0,a,δ,εdist)

gµ0 (Wµ0,ϵ0 )

Note from (27) and (30), the following should holds

µK(µ0,a,δ,εdist) = min{ εdist2

3(4 δ), aεdist

216 , aεdist

2/3 , β2a2, a2

72(1 (1/2)1/4)}

K(µ0,a,δ,εdist) X

2(µk(a2 + 1) + 3a2)(gµk+1(Wµk,ϵk) gµk+1(Wµk+1,ϵk+1))

2(µ0(a2 + 1) + 3a2)

εdist2 , 216

aεdist , 216

2/3 , 1 β2a2 , 72 (1 (1/2)1/4)a2 }

gµ0(W ϵ0 µ0)

D Additional Theorems and Lemmas

Theorem 5 (Detailed Property of r(y; µ)). For r(y; µ) in (8), then

(i) For µ > 0, limy 0+ r(y; µ) = , r( a a2+1, µ) < 0

(ii) For µ > 0, r( µ, µ) < 0.

(iii) For µ > a2

For 0 < µ a2

dy > 0 ylb < y < yub (32a)

dy 0 Otherwise (32b)

ylb =(4µ)1/3

a2/3 (4µ)1/3) yub = (4µ)1/3

2 (a1/3 + q

a2/3 (4µ)1/3)

(iv) For 0 < µ < a2

4 , let p(µ) = r(yub, µ), then p (µ) < 0 and there exist a unique solution to p(µ) = 0, denoted as τ. Additionally, τ < a2

(v) There exists a τ > 0 such that, µ > τ, the equation r(y; µ) = 0 has only one solution. At µ = τ, the equation r(y; µ) = 0 has two solutions, and µ < τ, the equation r(y; µ) = 0 has three solutions. Moreover, µ < a2

(vi) µ < τ, the equation r(y; µ) = 0 has three solution, i.e. y µ < y µ < y µ .

dy µ dµ > 0 dy µ dµ > 0 dy µ dµ < 0 and lim µ 0 y µ = 0, lim µ 0 y µ = 0, lim µ 0 y µ = a a2 + 1

Moreover, y µ < ylb < µ < y µ < yub < y µ Theorem 6 (Detailed Property of t(x; µ)). For t(x; µ) in (9), then

(i) For µ > 0, limx 0+ t(x; µ) = , t(a, µ) < 0

(ii) If µ < a(

a2+1 a) 2(a2+1) 2 or µ > a(

a2+1+a) 2(a2+1) 2 , then t( p

µ(a2 + 1), µ) < 0.

(iii) For µ > a2 4(a2+1)3 dt(x; µ)

For 0 < µ a2 4(a2+1)3

dx > 0 xlb < x < xub (33a)

dx 0 Otherwise (33b)

xlb = (4µa)1/3(1 q

1 (4µ)1/3(a2+1)

2 xub = (4µa)1/3(1 + q

1 (4µ)1/3(a2+1)

2 Moreover, xlb p

µ(a2 + 1) xub

(iv) For 0 < µ < a2 4(a2+1)3 and let q(µ) = t(xlb, µ), then q (µ) > 0 and there exist a unique

solution to q(µ) = 0, denoted as τ and τ < a2 4(a2+1)3 1 27.

(v) There exists a τ > 0 such that, µ > τ, the equation t(x; µ) = 0 has only one solution. At µ = τ, the equation t(x; µ) = 0 has two solutions, and µ < τ, the equation t(x; µ) = 0 has three solutions. Moreover, τ < a2 4(a2+1)3 1 27

(vi) µ < τ, t(x; µ) = 0 has three stationary points, i.e. x µ < x µ < x µ.

dx µ dµ < 0 dx µ dµ > 0 and lim µ 0 x µ = a, lim µ 0 x µ = 0, lim µ 0 x µ = 0

max µ τ x µ a(

a2 + 1 a) 2

a2 + 1 and a(

a2 + 1 + a) 2

a2 + 1 min µ>0 x µ

It also implies that t( a(

a2+1 ; µ) 0 and maxµ µ0 x µ < minµ>0 x µ

Lemma 7. Algorithm 1 with input f = gµ(x, y), z0 = (x(0), y(0)) where (x(0), y(0)) Cµ3 in (41), then t 0, (x(t), y(t)) Cµ3. Moreover, limt (x(t), y(t)) = (x µ, y µ)

Lemma 8. For any (x, y) Cµ3 in (41), and (x, y) = (x µ, y µ)

gµ(x, y) > gµ(x µ, y µ)

Theorem 7 (Detailed Property of rϵ(y; µ)). For rϵ(y; µ) in (15), then

(i) For µ > 0, ϵ > 0, limy 0+ rϵ(y; µ) = , y( a a2+1, µ) < 0

(ii) For µ > (a ϵ/µ)4

4(a+ϵ/µ)2 , then drϵ(y;µ)

dy < 0. For 0 < µ (a ϵ/µ)4

dy > 0 ylb,µ,ϵ < y < yub,µ,ϵ (34a)

dy 0 Otherwise (34b)

ylb,µ,ϵ =(4µ)1/3

2/3 (4µ)1/3

yub,µ,ϵ =(4µ)1/3

2/3 (4µ)1/3

ylb,µ,ϵ (2µ2)1/3 (a + ϵ/µ)1/3

ylb,µ,ϵ µ yub,µ,ϵ Theorem 8 (Detailed Property of rβ(y; µ)). For rβ(y; µ) in (19), then

(i) For µ > 0, ϵ > 0, limy 0+ rβ(y; µ) =

(ii) For µ > a2(1 β)4

4(1+β)2 , then drβ(y;µ)

dy < 0. For 0 < µ a2(1 β)4

dy > 0 ylb,µ,β < y < yub,µ,β (35a)

dy 0 Otherwise (35b)

ylb,µ,β =(4µ)1/3

yub,µ,β =(4µ)1/3

ylb,µ,β (4µ)2/3

2a1/3 (1 + β)1/3

ylb,µ,β µ yub,µ,β Theorem 9 (Detailed Property of tβ(x; µ)). For tβ(x; µ) in (20), then

(i) For µ > 0, limx 0+ tβ(x; µ) = , tβ(a; µ) < 0

(ii) For µ > a2 4(a2+1)3 (β+1)4

(β 1)2 dtβ(x; µ)

For 0 < µ a2 4(a2+1)3 (β+1)4

dx > 0 xlb,µ,β < x < xub,µ,β (36a)

dx 0 Otherwise (36b)

4aµ(1 + β)2

1 (4µ)1/3(a2 + 1)

4aµ(1 + β)2

1 (4µ)1/3(a2 + 1)

(iii) If 0 < β <

(a2+1)+1, then there exists a τβ > 0 such that, µ > τβ, the equation

rβ(x; µ) = 0 has only one solution. At µ = τβ, the equation rβ(x; µ) = 0 has two solutions, and µ < τβ, the equation rβ(x; µ) = 0 has three solutions. Moreover, µ <

a2 4(a2+1)3 (β+1)4

(iv) If 0 < β <

(a2+1)+1, then µ < τβ, tβ(x; µ) = 0 has three stationary points, i.e. x µ,β < x µ,β < x µ,β. Besides,

max µ τβ x µ,β a((1 β)

(1 β)2(a2 + 1) (β + 1)2) 2

a2 + 1 a((1 β)

(1 β)2(a2 + 1) (β + 1)2) 2

a2 + 1 min µ>0 x µ,β

It implies that

max µ τβ x µ,β < min µ>0 x µ,β

Lemma 9. Under the same setting as Corollary 3,

max µ τ x µ,ϵ < min µ>0 x µ,ϵ

E Technical Proofs

E.1 Proof of Theorem 3

Proof. For the sake of completeness, we have included the proof here. Please note that this proof can also be found in [33].

Proof. We use the fact that f is L-smooth function if and only if for any W, Y dom(f)

f(W) f(Y ) + f(Y ), Y W + L

Let W = W t+1 and Y = W t, then using the updating rule W t+1 = W t 1

f(W t+1) f(W t) + f(W t), W t+1 W t + L

2 W t+1 W t 2 2

L f(W t) 2 2 + 1

2L f(W t) 2 2

2L f(W t) 2 2

min 0 t n 1 f(W t) 2 2 1

t=0 f(W t) 2 2 2L(f(W 0) f(W n))

n 2L(f(W 0) f(W ))

min 0 t n 1 f(W t) 2 2 2L(f(W 0) f(W ))

n ϵ2 n 2L(f(W 0) f(W ))

E.2 Proof of Theorem 5

Proof. (i) For any µ > 0,

lim y 0+ r(y; µ) = lim y 0+ a y a2

µ (a2 + 1) =

r( a a2 + 1) = µa2

( a a2+1)2 + µ < 0.

r( µ, µ) = a µ a2

4µ (a2 + 1)

y2 + 4a2µy (y2 + µ)3

=4a2µy3 a(y2 + µ)3

y2(y2 + µ)3

=a((4aµ)2/3y2 + (4aµ)1/3y(y2 + µ) + (y2 + µ)2)((4aµ)1/3y y2 µ)

y2(y2 + µ)3

4 , ((4aµ)1/3y y2 µ) < 0 dr(y;µ)

4 , ylb < y < yub, ((4aµ)1/3y y2 µ) > 0 dr(y;µ)

dy > 0. For µ < a2

y < ylb or yub < y, ((4aµ)1/3y y2 µ) 0 dr(y;µ)

Note that dr(y; µ)

dµ = 0 ((4aµ)1/3y y2 µ) = 0 (4aµ)1/3 = y + µ

The intersection between line (4aµ)1/3 and function y + µ

y are exactly ylb and yub, and ylb < µ < yub.

(iv) Note that for 0 < µ < a2

r µ = a2 y2 µ

(µ + y2)3 and ylb < µ < yub

µ y=yub < 0. Let p(µ) = r(yub, µ), because r

y|y=yub = 0, then

dµ =dr(yub, µ)

Also note that when µ = a2

4 , yub = µ, p(µ) = r(yub, µ) = r( µ, µ) < 0, and also if µ < a2

yub < (4µ)1/3

2 2a1/3 = (4µa)1/3

r((4µa)1/3, µ) = a (4µa)1/3 µa2

((4µa)2/3 + µ)2 (a2 + 1)

= a (4µa)1/3 a2

(µ)1/3((4a)2/3 + µ1/3)2 (a2 + 1)

µ1/3 ( a (4a)1/3 a2

(4a)4/3 ) (a2 + 1)

Because a (4a)1/3 > a2

(4a)4/3 , it is easy to see when µ 0, r((4µa)1/3, µ) . We know

r(yub, µ) > r((4µa)1/3, µ) as µ 0 because of the monotonicity of r(y; µ) in Theorem 5(iii). Combining all of these, i.e.

dµ < 0, lim µ 0+ p(µ) = , p(a2

There exists a τ < a2

4 such that p(τ) = 0

(v) From Theorem 5(iv), for µ > τ, then p(µ) = r(yub, µ) > 0, and for µ = τ, then p(µ) = r(yub, µ) = 0. For µ < τ, then p(µ) = r(yub, µ) < 0, combining Theorem 5(i),5(iii), we get the conclusions.

(vi) By Theorem 5(v), µ < τ, there exists three stationary points such that 0 < y µ < ylb < µ < y µ < yub < y µ . Because dr(y;µ)

dy y=ylb = dr(y;µ)

dy y=yub = 0, then

y=y µ = 0, dr(y; µ)

y=y µ = 0, dr(y; µ)

By implicit function theorem [14], for solution to equation r(y; µ) = 0, there exists a unique continuously differentiable function such that y = y(µ) and satisﬁes r(y(µ), µ) = 0. Therefore,

r µ = a2 y2 µ

(µ + y2)3 , r y = a

y2 + 4a2µy (y2 + µ)3 , dy(µ)

r/ y Therefore by Theorem 5(iii),

y=y µ > 0 dy dµ

y=y µ > 0 dy dµ

Because limµ 0+ ylb = limµ 0+ yub = 0, then limµ 0+ y µ = limµ 0+ y µ = 0. Let us

consider r( a a2+1(1 cµ), µ) where c = 32 (a2+1)3

a2 and µ < 1 2c

r( a a2 + 1(1 cµ), µ)

= a a a2+1(1 cµ) µa2

( a2 (a2+1)2 (1 cµ)2 + µ)2 (a2 + 1)

=(a2 + 1)( cµ 1 cµ) µa2

( a2 (a2+1)2 (1 cµ)2 + µ)2

c(a2 + 1)µ µa2

( a2 (a2+1)2 (1 cµ)2)2

=c(a2 + 1)µ 16(a2 + 1)4

=16(a2 + 1)4

By Theorem 5(iii), then a a2+1(1 cµ) < y µ , then a a2 + 1 = lim µ 0+ a a2 + 1(1 cµ), µ) lim µ 0+ y µ a a2 + 1 Consequently, lim µ 0+ y µ = a a2 + 1

E.3 Proof of Theorem 6

Proof. (i) For µ > 0,

lim x 0+ t(x; µ) = lim x 0+ a x a2

µ(a2 + 1)2 1 =

t(a, µ) = µa2

(µ(a2 + 1) + a2)2 < 0

µ(a2 + 1), µ) = a

a2 + 1 1 µ a2

4µ(a2 + 1)2 1

µ(a2 + 1), µ) = 0, then

1 µ = 2(a2 + 1)3/2

a 2(a2 + 1) µ =

a2 + 1 a) 2(a2 + 1)

so when µ < a(

a2+1 a) 2(a2+1) 2 or µ > a(

a2+1+a) 2(a2+1) 2 , then t( p

µ(a2 + 1), µ) < 0

x2 + 4µa2x (µ(a2 + 1) + x2)3

=4µa2x3 a(µ(a2 + 1) + x2)3

x2(µ(a2 + 1) + x2)3

=a((µ(a2 + 1) + x2)2 + (µ(a2 + 1) + x2)(4µa)1/3x + (4µa)2/3x2)((4µa)1/3x µ(a2 + 1) x2)

x2(µ(a2 + 1) + x2)3

For µ > a2 4(a2+1)3 , then (4µa)1/3x µ(a2 +1) x2 < 0 dt(x,µ)

dx < 0. For µ < a2 4(a2+1)3 ,

and xlb < x < xub, then (4µa)1/3x µ(a2+1) x2 > 0 dt(x,µ)

dx > 0, For µ < a2 4(a2+1)3 ,

x < xlb or x > xub, (4µa)1/3x µ(a2 + 1) x2 < 0 dt(x,µ)

We use the same argument as before to show that

µ(a2 + 1) < xub

(iv) Note that for 0 < µ < a2 4(a2+1)3

t µ = a2 x2 µ(a2 + 1)

(µ(a2 + 1) + x2)3 and xlb < p

µ(a2 + 1) < xub

then t µ x=xlb > 0. Let q(µ) = t(xlb, µ), because t x x=xlb = 0, then

dµ =dt(xlb, µ)

Note that µ = a2 4(a2+1)3 , xub = xlb = (4µa)1/3

2 , t( (4µa)1/3

2 , a2 4(a2+1)3 ) = a (4µa)1/3 1 > 0.

When µ < a(

a2+1 a) 2(a2+1) 2 , then t( p

µ(a2 + 1), µ) < 0 by Theorem 6(ii). It implies that

q(µ) < 0 when µ 0+. By Theorem 6(iii), q(µ) = t(xlb, µ) < t( p

µ(a2 + 1), µ) < 0. Combining all of the theses, i.e.

dµ > 0, lim µ 0+ q(µ) < 0, q( a2

4(a2 + 1)3 ) > 0

There exists a τ < a2 4(a2+1)3 , q(τ) = 0. Such τ is the same as in Theorem 5(iv).

(v) We follow the same proof from the proof of Theorem 5(v).

(vi) By Theorem 6(v), µ < µ0, there exists three stationary points such that 0 < x µ < xlb <

x µ < xub < x µ < a. Because dt(x;µ)

dx x=xlb = dt(x;µ)

dx x=xub = 0, then

x=x µ = 0, dt(x; µ)

x=x µ = 0, dt(x; µ)

By implicit function theorem [14], for solutions to equation t(x; µ) = 0, there exists a unique continuously differentiable function such that x = x(µ) and satisﬁes t(x(µ), µ) = 0. Therefore,

dx dµ = t/ µ

t/ x = a2 x2 µ(a2+1) (µ(a2+1)+x2)3

x2 + 4µa2x (µ(a2+1)+x2)3

Therefore, by Theorem 6(iii)

x=x µ < 0 dx dµ

Because 0 < x µ < xlb < x µ < xub and limµ 0+ xlb = limµ 0+ xub = 0.

lim µ 0 x µ = lim µ 0 x µ = 0

Let us consider t(a(1 cµ), µ) where c = 32

a2 and µ < 1 2c

t(a(1 cµ); µ)

= a a(1 cµ) µa2

(µ(a2 + 1) + a2(1 cµ)2)2 1

= cµ 1 cµ µa2

(µ(a2 + 1) + a2(1 cµ)2)2

(a2(1 cµ)2)2

By Theorem 6(iii). It implies a(1 cµ) x µ

taking µ 0+ on both side,

a = lim µ 0+ a(1 cµ) lim µ 0+ x µ a

Hence, limµ 0 x µ = a.

When µ = τ, because t(xlb; µ) = 0 and xub > p

µ(a2 + 1) > xlb, t(x; µ) is increasing function between [xlb, xub] then t( p

µ(a2 + 1); µ) > t(xlb; µ) = 0. Moreover, t( p

µ(a2 + 1), µ), xlb and x µ are continuous function w.r.t µ, δ > 0 which is really small, such that µ = τ δ and t( p

µ(a2 + 1), µ) > 0, t(xlb, µ) < 0 (by Theorem 6(iv))

and x µ > xlb, hence dx

dµ x=x µ < 0. It implies when µ decreases, then x µ increases. This

relation holds until x µ = p

t(x µ , µ) = t( p

µ(a2 + 1), µ) = 0

a2 + 1 a) 2(a2 + 1)

µ(a2 + 1) = a(

a2+1 . Note that when µ < a(

a2+1 a) 2(a2+1) 2 ,

µ(a2 + 1), µ) < 0, it implies that x µ > p

µ(a2 + 1) and dx dµ x=x µ > 0, thus de-

creasing µ leads to decreasing x µ . We can conclude

max µ τ x µ a(

a2 + 1 a) 2

Note that µ s.t. a(

a2+1 a) 2(a2+1) 2 < µ < τ, x µ < a(

a2+1 a) 2(a2+1) 2 , so

a2+1 a) 2(a2+1) 2 , µ) 0.

Note that when µ > a2 a2+1, i.e. (x µ)2 µ(a2 + 1) then

It implies that when µ decreases, x µ also decreases. It holds true until x µ = p

µ(a2 + 1). The same analysis can be applied to x µ like above, we can conclude that

min τ x µ = a(

a2 + 1 + a) 2

a2 + 1 Hence

max µ τ x µ a(

a2 + 1 a) 2

a2 + 1 < a(

a2 + 1 + a) 2

a2 + 1 min µ>0 x µ

E.4 Proof of Theorem 7,8 and 9

Proof. The proof is similar to the proof of Theorem 5 and Theorem 6.

E.5 Proof of Lemma 1

2gµ(x, y) = µ + y2 2xy 2xy µ(a2 + 1) + x2

Let λ1( 2gµ(x, y)), λ2( 2gµ(x, y)) be the eigenvalue of matrix 2gµ(x, y), then

λ1( 2gµ(x, y)) + λ2( 2gµ(x, y))

= Tr( 2gµ(x, y)) = µ + y2 + µ(a2 + 1) + x2 > 0

Now we calculate the product of eigenvalue

λ1( 2gµ(x, y)) λ2( 2gµ(W))

= det( 2gµ(W))

=(µ + y2)(µ(a2 + 1) + x2) 4x2y2

y 4x2y2 > 0

2 )2/3 > xy

2 )2/3 > aµ y2 + µy

y > (4aµ)1/3

Note that for (x µ, y µ), (x µ , y µ ), they satisfy (11a) and (11b), this fact is used in third equality and second . By (32b), we know λ1( 2gµ(x, y)) λ2( 2gµ(x, y)) > 0 for (x µ, y µ), (x µ , y µ ), and λ1( 2gµ(x, y)) λ2( 2gµ(x, y)) < 0 for (x µ , y µ ), then

λ1( 2gµ(x, y)) > 0, λ2( 2gµ(x, y)) > 0 for (x µ, y µ), (x µ , y µ )

λ1( 2gµ(x, y)) < 0 or λ2( 2gµ(x, y)) < 0 for (x µ , y µ ) and gµ(x, y) = 0 Then (x µ, y µ), (x µ , y µ ) are locally minima, (x µ , y µ ) is saddle point for gµ(W).

E.6 Proof of Lemma 2

0.0 0.1 0.2 0.3 0.4 0.5 0.6 x

(x * , y * )

(x* * , y* * )

(x* * * , y* * * )

Figure 8: Stationary points when µ < τ

Proof. Let us deﬁne the functions as below

x ) 0 < x a (37a)

yµ2(x) = µa µ(a2 + 1) + x2 0 < x a (37b)

xµ1(y) = µa y2 + µ 0 < y < a a2+1 (38a)

y (a2 + 1)) 0 < y < a a2+1 (38b)

with simple calculations,

yµ1 yµ2 t(x; µ) 0 x (0, x µ ] [x µ , x µ]

and xµ1 xµ2 r(y; µ) 0 y [y µ, y µ ] [y µ , a a2 + 1)

Here we divide Bµ into three parts, Cµ1, Cµ2, Cµ3

Cµ1 ={(x, y)|x µ < x x µ, yµ1 < y < y µ } {(x, y)|x µ < x a, yµ2 < y < y µ } (39)

Cµ2 ={(x, y)|x µ < x x µ, 0 y < yµ2} {(x, y)|x µ < x a, 0 y < yµ1} (40)

Cµ3 ={(x, y)|x µ < x x µ, yµ2 y yµ1} {(x, y)|x µ < x a, yµ1 y yµ2} (41)

Also note that

(x, y) Cµ1 gµ(x, y)

x > 0, gµ(x, y)

(x, y) Cµ2 gµ(x, y)

x < 0, gµ(x, y)

The gradient ﬂow follows x (t) y (t)

gµ(x(t),y(t))

x gµ(x(t),y(t))

= gµ(x(t), y(t))

(x, y) Cµ1 x (t) y (t)

< 0, gµ > 0 (42)

(x, y) Cµ2 x (t) y (t)

> 0, gµ > 0 (43)

Note that gµ is not diminishing and bounded away from 0. Let us consider the (x(0), y(0)) Cµ1, since gµ(x, y) = 0, gµ(x, y) < 0 in (42) and boundness of Cµ1, it implies there exists a ﬁnite t0 > 0 such that

(x(t0), y(t0)) Cµ1, (x(t), y(t)) Cµ1 for 0 t < t0

where Cµ1 is deﬁned as

Cµ1 = {(x, y)|x µ < x x µ, y = yµ1} {(x, y)|x µ < x a, y = yµ2} Cµ3

For the same reason, if (x(0), y(0)) Cµ2, there exists a ﬁnite time t1 > 0,

(x(t0), y(t0)) Cµ2, (x(t), y(t)) Cµ2 for 0 t < t1

where Cµ2 is deﬁned as

Cµ2 = {(x, y)|x µ < x x µ, y = yµ2} {(x, y)|x µ < x a, y = yµ1} Cµ3

then by lemma 7, limt (x(t), y(t)) = (x µ, y µ).

E.7 Proof of Lemma 3

Proof. This is just a result of the Theorem 5.

E.8 Proof of Lemma 5

Proof. Note that

2gµ(W) = µ + y2 2xy 2xy µ(a2 + 1) + x2

= µ 0 0 µ(a2 + 1)

+ y2 2xy 2xy x2

Let op is the spectral norm, and it satisﬁes triangle inequality

µ 0 0 µ(a2 + 1)

y2 2xy 2xy x2

=µ(a2 + 1) +

y2 2xy 2xy x2

The spectral norm of the second term in area A is bounded by

max (x,y) A (x2 + y2) + p

(x2 + y2)2 + 12x2y2

We use x2 a2, y2 a2 in the inequality. Therefore, 2gµ(W) op 3a2 + µ(a2 + 1)

Also, according to [5, 33], for any f, if 2f exists, then f is L smooth if and only if | 2f|op L. With this, we conclude the proof.

E.9 Proof of Lemma 7

Proof. First we prove t 0, (x(t), y(t)) Cµ3, because if (x(t), y(t)) / Cµ3, then there exists a ﬁnite t such that (x(t), y(t)) Cµ3 where Cµ3 is the boundary of Cµ3, deﬁned as

Cµ3 = {(x, y)|y = yµ1(x) or y = yµ2(x), x µ < x a}

W.L.O.G, let us assume (x(0), y(0)) Cµ3 and (x(0), y(0)) = (x µ, y µ). Here are four different cases,

gµ(x(t), y(t)) =

if y(0) = yµ1(x(0)), x µ < x(0) < x µ = 0 < 0

if y(0) = yµ1(x(0)), x µ < x(0) a < 0 = 0

if y(0) = yµ2(x(0)), x µ < x(0) < x µ > 0 = 0

if y(0) = yµ2(x(0)), x µ < x(0) a

This indicates that gµ(x(t), y(t)) are pointing to the interior of Cµ3, then (x(t), y(t)) can not escape Cµ3. Here we can focus our attention in Cµ3, because t 0, (x(t), y(t)) Cµ3. For Algorithm 1, df(zt)

dt = f(zt) zt = f(zt) 2 2

In our setting, (x, y) Cµ3 gµ(x, y) = 0 (x, y) = (x µ, y µ) gµ(x, y) = 0 (x, y) = (x µ, y µ)

so dgµ(x(t), y(t))

dt = gµ 2 2 < 0 (x, y) = (x µ, y µ) gµ 2 2 = 0 (x, y) = (x µ, y µ)

Plus, (x µ, y µ) is the unique stationary point of gµ(W) in Cµ3. By lemma 8

gµ(x, y) > gµ(x µ, y µ) (x, y) = (x µ, y µ)

By Lyapunov asymptotic stability theorem [28], and applying it to gradient ﬂow for gµ(x, y) in Cµ3, we can conclude limt (x(t), y(t)) = (x µ, y µ).

E.10 Proof of Lemma 8

Proof. For any (x, y) Cµ3 in 41, and (x, y) = (x µ, y µ), in Algorithm 7. W.L.O.G, we can assume x (x µ , x µ), the analysis details can also be applied to x (x µ, a). It is obvious that xj < xj+1 and yj+1 < yj. Also, limj ( xj, yj) = (x µ, y µ). Otherwise either xj = x µ or yj = y µ hold, Algorithm 7 continues until limj ( xj, yj) = limj (yµ2( yj), xµ1( xj)), i.e. ( xj, yj) converges to (x µ, y µ).

Moreover, note that for any j = 0, 1, . . .

gµ( xj 1, yj 1) > gµ( xj 1, yj) > gµ( xj, yj)

gµ( xj 1, yj 1) gµ( xj 1, yj) = gµ( xj 1, y)

y ( yj 1 yj) where y ( yj, yj 1)

Note that gµ( xj 1, y)

y > 0 gµ( xj 1, yj 1) > gµ( xj 1, yj)

By the same reason, gµ( xj 1, yj) > gµ( xj, yj) By Lemma 1, (x µ, y µ) is local minima, and there exists a rµ > 0 and any {(x, y) | (x, y) (x µ, y µ) 2 rµ}, gµ(x, y) > gµ(x µ, y µ) Since limj ( xj, yj) = (x µ, y µ), there exists a J > 0 such that j > J, ( xj, yj) (x µ, y µ) 2 rµ, combining them all

gµ(x, y) > gµ( xj, yj) > gµ(x µ, y µ)

Algorithm 7: Path goes to (x µ, y µ)

Input: (x, y) Cµ3, xµ1(y), yµ2(x) as (38a),(37b) Output: {( xj, yj)} j=0 1 ( x0, y0) (x, y)

2 for j = 1, 2, . . . do

3 yj yµ2( xj 1)

4 xj xµ1( yj 1)

E.11 Proof of Lemma 4

Proof. From the proof of Theorem 1, any any scheduling for µk satisﬁes following will do the job

(2/a)2/3µ4/3 k 1 µk < µk 1

Note that in Algorithm 4, we have ba = p

4(µ0 + ε) < a, then it is obvious

(2/a)2/3µ4/3 k 1 < (2/ba)2/3µ4/3 k 1

The same analysis for Theorem 1 can be applied here.

E.12 Proof of Lemma 6

Proof. By the Theorem 3 and Lemma 5 and the fact that A1 µ,ϵ is µ-stationary point region, we use the same argument as proof of Lemma 7 to demonstrate the gradient descent will never go to A2 µ,ϵ.

E.13 Proof of Lemma 9

Proof. By Theorem 9(iv) max µ τβ x µ,β min µ>0 x µ,β

We also know from the proof of Corollary 3, x µ,ϵ < x µ,β and x µ,β < x µ,ϵ. Consequently,

max µ τβ x µ,ϵ min µ>0 x µ,ϵ

Because τβ > τ, so max µ τ x µ,ϵ max µ τβ x µ,ϵ min µ>0 x µ,ϵ

E.14 Proof of Corollary 1

Proof. Note that a2

4(a2 + 1)3 1

5 27, then a2

4 > µ0 = 1 27 a2 4(a2+1)3 , it satisﬁes condition in Lemma 4, we obtain the same result.

E.15 Proof of Corollary 2

Proof. Use Theorem 5(vi) and Theorem 6(vi).

E.16 Proof of Corollary 3

Proof. It is easy to know that

rβ(y; µ) > rϵ(y; µ) > r(y; µ)

tβ(x; µ) < tϵ(x; µ) < t(x; µ)

and when µ < τ, there are three solutions to r(y; µ) = 0 by Theorem 5. Also, we know from Theorem 7, 8

lim y 0+ rϵ(y; µ) = lim y 0+ rβ(y; µ) =

Note that when 1+β 1 β 2 a2 + 1

rβ( µ; µ) = a(1 + β) µ (a2 + 1) a2(1 β)2

0 rβ( µ; µ) > rϵ( µ; µ) > r( µ; µ)

Also, we know that for yub deﬁned in Theorem 5(iii), we know r(yub; µ) > 0 from Theorem 5(iv). Therefore,

rβ(yub; µ) > rϵ(yub; µ) > r(yub; µ) > 0

Besides, µ < yub. By monotonicity of rβ(y; µ) and rϵ(y; µ) from the Theorem 7(ii) and Theorem 8(ii), it implies that there are at least two solutions to rβ(y; µ) and rϵ(y; µ). From the geometry of rβ(y; µ), rϵ(y; µ), r(y; µ) and tβ(x; µ), tϵ(x; µ), t(x; µ), it is trivial to know that x µ,ϵ x µ, y µ,ϵ y µ, x µ,ϵ x µ , y µ,ϵ y µ .

Finally, for every point (x, y) A1 µ,ϵ, there exists a pair ϵ1, ϵ2, each satisfying |ϵ1| ϵ and |ϵ2| ϵ, such that (x, y) is the solution to

x = µa + ϵ1

µ + y2 y = µa + ϵ2 x2 + µ(a2 + 1)

We can repeat the same analysis above to show that x µ,ϵ x, y µ,ϵ y. Applying the same logic to (x, y) A2 µ,ϵ, we ﬁnd x µ,ϵ x, y µ,ϵ y. Thus, (x µ, y µ) is the extreme point of A1 µ,ϵ and (x µ , y µ ) is the extreme point of A2 µ,ϵ, we get the results.

F Experiments Details

In this section, we present experiments to validate the global convergence of Algorithm 6. Our goal is twofold: First, we aim to demonstrate that irrespective of the starting point, Algorithm 6 using gradient descent consistently returns the global minimum. Second, we contrast our updating scheme for µk, ϵk as prescribed in Algorithm 6 with an arbitrary updating scheme for µk, ϵk. This comparison illustrates how inappropriate setting of parameters in gradient descent could lead to incorrect solutions.

F.1 Random Initialization Converges to Global Optimum

1.0 0.5 0.0 0.5 1.0 1.5 2.0 x

Algorithm 6 for a =2 = 0.01 = 0.4 0 = 0.2034

(a) Random Initialization 1

1.0 0.5 0.0 0.5 1.0 1.5 2.0 x

Algorithm 6 for a =2 = 0.01 = 0.4 0 = 0.2034

(b) Random Initialization 2

0.0 0.5 1.0 1.5 2.0 x

Algorithm 6 for a =2 = 0.01 = 0.4 0 = 0.2034

(c) Random Initialization 3

0.0 0.5 1.0 1.5 2.0 x

Algorithm 6 for a =2 = 0.01 = 0.4 0 = 0.2034

(d) Random Initialization 4

Figure 9: Trajectory of the gradient descent path with the different initializations for a = 2. We observe that regardless of the initialization, Algorithm 6 always converges to the global minimum. Initial µ0 = a2

4 (1 δ)3(1 β)4

0.2 0.0 0.2 0.4 0.6 x

Algorithm 6 for a =0.5 = 0.01 = 0.1 0 = 0.0429

(a) Random Initialization 1

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 x

Algorithm 6 for a =0.5 = 0.01 = 0.1 0 = 0.0429

(b) Random Initialization 2

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 x

Algorithm 6 for a =0.5 = 0.01 = 0.1 0 = 0.0429

(c) Random Initialization 3

0.0 0.1 0.2 0.3 0.4 0.5 0.6 x

Algorithm 6 for a =0.5 = 0.01 = 0.1 0 = 0.0429

(d) Random Initialization 4

Figure 10: Trajectory of the gradient descent path with the different initializations for a = 0.5. We observe that regardless of the initialization, Algorithm 6 always converges to the global minimum. Initial µ0 = a2

4 (1 δ)3(1 β)4

F.2 Wrong Speciﬁcation of δ Leads to Spurious Local Optimial

0.2 0.0 0.2 0.4 0.6 x

Algorithm 6 for a =0.5 = 0.01 = 0.4 0 = 0.0127

(a) δ = 0.4

0.2 0.0 0.2 0.4 0.6 x

Algorithm 6 for a =0.5 = 0.01 = 0.1 0 = 0.0429

(b) δ = 0.1

Figure 11: Trajectory of the gradient descent path for two difference δ. Left: β violates requirement 1+β 1 β 2 (1 δ)(a2 + 1) in Theorem 4, leading to spurious local minimum. Right: β follows

requirement 1+β 1 β 2 (1 δ)(a2 + 1) in Theorem 4, leading to global minimum. Initial µ0 =

4 (1 δ)3(1 β)4

F.3 Wrong Speciﬁcation of β Leads to Incorrect Solution

0.0 0.5 1.0 1.5 2.0 x

Algorithm 6 for a =2 = 0.6 = 0.4 0 = 0.0022

(a) β = 0.6

0.0 0.5 1.0 1.5 2.0 x

Algorithm 6 for a =2 = 0.01 = 0.4 0 = 0.2034

(b) δ = 0.01

Figure 12: Trajectory of the gradient descent path for two difference β. Left: β violates requirement 1+β 1 β 2 (1 δ)(a2 +1) in Theorem 4, leading to incorrect solution. Right: β follows requirement 1+β 1 β 2 (1 δ)(a2 + 1) in Theorem 4, leading to global minimum. Initial µ0 = a2

4 (1 δ)3(1 β)4

F.4 Faster decrease of µk Leads to Incorrect Solution

0.0 0.1 0.2 0.3 0.4 0.5 0.6 x

Algorithm 6 for a =0.5 = 0.01 = 0.15 0 = 0.0361

(a) Bad scheduling

0.0 0.1 0.2 0.3 0.4 0.5 0.6 x

Algorithm 6 for a =0.5 = 0.01 = 0.15 0 = 0.0361

(b) Good scheduling

Figure 13: Trajectory of the gradient descent path for two difference update rules for µk with the same initialization. Left: Bad scheduling uses a faster-decreasing scheme for µk, leading to an incorrect solution, even a non-local optimal solution. Right: Good scheduling follows updating rule for µk in Algorithm 6, leading to the global minimum. Initial µ0 = a2

4 (1 δ)3(1 β)4