# nonergodic_convergence_analysis_of_heavyball_algorithms__c2bdd677.pdf

The Thirty-Third AAAI Conference on Artiﬁcial Intelligence (AAAI-19)

Non-Ergodic Convergence Analysis of Heavy-Ball Algorithms

Tao Sun,1 Penghang Yin,2 Dongsheng Li,1 Chun Huang,1 Lei Guan,1 Hao Jiang1

1College of Computer, National University of Defense Technology, Changsha, Hunan, China. 2Department of Mathematics, University of California, Los Angeles, USA. nudtsuntao@163.com, yph@ucla.edu, {dsli,chunhuang}@nudt.edu.cn, guanleics@gmail.com, haojiang@nudt.edu.cn

In this paper, we revisit the convergence of the Heavy-ball method, and present improved convergence complexity results in the convex setting. We provide the ﬁrst non-ergodic O(1/k) rate result of the Heavy-ball algorithm with constant step size for coercive objective functions. For objective functions satisfying a relaxed strongly convex condition, the linear convergence is established under weaker assumptions on the step size and inertial parameter than made in the existing literature. We extend our results to multi-block version of the algorithm with both the cyclic and stochastic update rules. In addition, our results can also be extended to decentralized optimization, where the ergodic analysis is not applicable.

Introduction In this paper, we study the Heavy-ball algorithm ﬁrst proposed by (Polyak 1964), for solving the following unconstrained minimization problem

min x Rm f(x), (1)

where f is convex and differentiable, and f is Lipschitz continuous with constant L. The Heavy-ball method iterates

xk+1 = xk γk f(xk) + βk(xk xk 1), (2)

where γk is the step size and βk is the inertial parameter. Different from the gradient descent algorithm, the sequence generated by Heavy-ball method is not Fej er monotone due to the inertial term βk(xk xk 1). This poses a challenge in proving the convergence rate of {f(xk)}k 0 in the convex case. In the existing literature, the sublinear convergence rate of the Heavy-ball has been proved only in the sense of ergodicity. When the objective function is twice continuously differentiable and strongly convex (i.e., almost quadratic), the Heavy-ball method provably converges linearly. Under a weaker assumption that the objective function is nonconvex but Lipschitz differentiable, (Zavriev and Kostyuk 1993) proved that the sequence generated by the Heavyball method will converge to a critical point, yet without specifying the convergence rate. The smoothness of objective function is crucial for convergence of the Heavyball. Indeed, it can be divergent for a strongly convex but

Copyright c 2019, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

nonsmooth function as suggested by (Lessard, Recht, and Packard 2016). Different from the classical gradient descent methods, the Heavy-ball algorithm fails to generate a Fej er monotone sequence. In the convex and smooth case, the only result about convergence rate, to our knowledge, is the ergodic O(1/k) rate in terms of the objective value (Ghadimi, Feyzmahdavian, and Johansson 2015), i.e.,

f Pk i=1 xi

k min f O 1

k . The linear convergence of Heavy-ball algorithm was proved under the strongly convexity assumption by (Ghadimi, Feyzmahdavian, and Johansson 2015). But the authors imposed a restrictive assumption on the inertial parameter βk. Speciﬁcally, when the strongly convex constant is tiny, the convergence result holds only for a small range of βk values. By incorporating the idea of proximal mapping, the inertial proximal gradient algorithm (i Piano) was proposed in (Ochs et al. 2014), whose convergence in nonconvex case was thoroughly discussed. Locally linear convergence of i Piano and Heavy-ball method was later proved in (Ochs 2016). In the strongly convex case, the linear convergence was proved for i Piano with ﬁxed βk (Ochs, Brox, and Pock 2015). In the paper (Pock and Sabach 2016), inertial Proximal Alternating Linearized Minimization (i PALM) was introduced as a variant of i Piano for solving the two-block regularized problem. (Xu and Yin 2013) analyzed the Heavy-ball algorithm in tensor minimization problems. Stochastic versions of heavyball have also been introduced (Loizou and Richt arik 2017b; 2017a). A multi-step heavy-ball algorithm was analyzed in (Liang, Fadili, and Peyr e 2016). The inertial methods are also developed and studied in the operator research by (Combettes and Glaudin 2017). None of the aforementioned Heavy-ball based algorithms, however, provides a non-ergodic convergence rate.

Contributions

In this paper, we establish the ﬁrst non-ergodic O(1/k) convergence result in general convex case. More precisely, we prove that f(xk) min f O( 1

k) for convex and coercive f 1. Compared with existing result in (Ghadimi, Feyzmahdavian, and Johansson 2015), ours allows a larger step size γk. We also prove a linear convergence result under a restricted

1We say f is coercive, if f(x) + as x + .

strongly convex condition, weaker than the strong convexity assumption. In short, we make weaker assumptions on the step size, on the inertial parameter, as well as on the convexity of the objective function. The convergence of multi-block extensions of Heavy-ball method is studied. The sublinear and linear convergence rates are proved for the cyclic and stochastic update rules, respectively. In addition, we extend our analysis to the decentralized Heavy-ball method, where the ergodic analysis is not applicable. Our theoretical results are based on a novel Lyapunov function, which is motivated by a modiﬁed dynamical system.

A Dynamical System Interpretation It has been long known that the Heavy-ball method is equivalent to the discretization of the following second-order ODE (Alvarez 2000):

x(t) + α x(t) + f(x(t)) = 0, t 0, (3)

for some α > 0. In the case βk 0, the Heavy-ball method boils down to the standard gradient descent, which is known to be the discretization of the following ﬁrst-order ODE

α x(t) + f(x(t)) = 0, t 0. (4)

The dynamical system (3), however, misses essential information about relation between x(t) and x(t). Speciﬁcally, if we replace x(t) by xk+1 2xk+xk 1

h2 with h being the discretization step size, then it holds that xk+1 2xk + xk 1

Since both xk+1 xk

h and xk xk 1

h can be viewed as the discretization of x(t), we propose to modify (3) by adding the following constraint

x(t) θ x(t) , (5)

where θ > 0. In next section, we will devise a useful Lyapunov function by exploiting the additive constraint (5) and establish the asymptotic non-ergodic sublinear convergence rate in the continuous setting. Finally, we will translate this analysis into that in discretized setting.

Analysis of the Dynamical System We analyze the modiﬁed dynamical system (3) + (5). The existence of the solution is beyond the scope of this paper and will not be discussed here. Let us assume that f is coercive, α > θ, and f(x(0)) min f > 0. We consider the Lyapunov function

ξ(t) := f(x(t)) + 1

2 x(t) 2 min f 0, (6)

and refer the readers to the relevant equations (3)-(6). A direct calculation gives

ξ(t) = f(x(t)), x(t) + x(t), x(t)

= α x(t) 2, (7)

which means {ξ(t)}t 0 is non-increasing. As a result,

sup t {f(x(t)) min f} sup t ξ(t) ξ(0).

By the coercivity of f, {x(t)}t 0 is bounded. Then by the continuity of f, { f(x(t))}t 0 is bounded; using (3), { x(t) + α x(t)}t 0 is also bounded. By the triangle inequality, we have

x(t) + α x(t) α x(t) x(t) (α θ) x(t) . (8)

Since α > θ, we obtain the boundedness of { x(t)}t 0; by (5), { x(t)}t 0 is also bounded. Let x arg min f, we have

0 f(x(t)) f(x ) a) f(x(t)), x(t) x

b) f(x(t)) x(t) x

c)= x(t) + α x(t) x(t) x

d) ( x(t) + α x(t) ) x(t) x

e) (α + θ) x(t) x(t) x , (9)

where a) is due to the convexity of f; b) is due to the Young s inequality; c) is due to (3); d) is due to the triangle inequality; e) is because of (5). Denote

r := sup t 0

(α + θ) x(t) x + x(t)

Since {x(t)}t 0 and { x(t)}t 0 are both bounded, we have r < + . Using (9), we have

ξ(t)2 = f(x(t)) f(x ) + 1

(α + θ) x(t) x x(t) + 1

(α + θ) x(t) x + 1

2 x(t) x(t) 2

r2 x(t) 2. (10)

Combining (7) and (10), we have ξ(t)2 r2

α ξ(t), or equivalently,

Taking the integral of both sides from 0 to t and noting that ξ(0) f(x(0)) min f > 0 (we have assumed that f(x(0)) min f > 0), we get

r2 t 1 ξ(0) 1 ξ(t),

and thus ξ(t) 1 α r2 t+ 1 ξ(0) . Since f(x(t)) min f ξ(t),

we have f(x(t)) min f 1 α r2 t + 1 ξ(0) Thus we have derived the asymptotic sublinear convergence rate for f(x(t)) min f.

Convergence Analysis of Heavy-ball In this section, we prove convergence rates of Heavy-ball method. The core of the proof is to construct a proper Lyapunov function. The expression of ξ(t) in (6) suggests the Lyapunov function be of the form f(xk) + C xk+1 xk 2 for some C > 0. In fact, we have the following sufﬁcient descent lemma. All the technical proofs for the rest of the paper, will be provided in the supplementary materials.

Lemma 1 Suppose f is convex with L-Lipschitz gradient and min f > . Let (xk)k 0 be generated by the Heavyball method with non-increasing (βk)k 0 [0, 1). By choosing the step size

γk = 2(1 βk)c

L with ﬁxed 0 < c < 1, we have f(xk) + βk

2γk xk xk 1 2

f(xk+1) + βk+1

2γk+1 xk+1 xk 2

2c xk+1 xk 2. (12)

According to Lemma 1, a potentially useful Lyapunov function is h f(xk) + βk

2γk xk xk 1 2i

k 0, as it has the de-

scent property shown in (12). However, it does not fulﬁll the relation in (10) 2. Therefore, we rewrite (12), so that the new right-hand-side contains something like xk+1 xk 2+ xk xk 1 2. It turns out that a better Lyapunov function reads

ξk := f(xk) + δk xk xk 1 2 min f, (13)

We can see ξk is in line with the discretization of (6). Given the Lyapunov function in (13), we present a key technical lemma.

Lemma 2 Suppose the assumptions of Lemma 1 hold. Let xk denote the projection of xk onto arg min f, assumed to exist, and deﬁne

εk := 4cδ2 k (1 c)L + 4c (1 c)Lγ2 k . (15)

Then it holds that

(ξk)2 εk (ξk ξk+1)

(2 xk xk 2 + xk xk 1 2). (16)

We see that (16) is the discretization of (11) if supk{εk (2 xk xk 2 + xk xk 1 2)} < + .

2That is, we are not able build a useful error relation for f(xk)+ βk 2γk xk xk 1 2.

Sublinear Convergence We present the non-ergodic O( 1

k) convergence rate of the function value. This rate holds when (βk)k 0 (0, 1). We deﬁne

R := sup k 0 sup x argmin f { xk x 2}. (17)

In our following settings, we can see it actually holds that R < + .

Theorem 1 Under the assumptions of Lemma 1 and assumptions that 0 < infk βk βk β0 < 1 and f is coercive. We have

f(xk) min f 4R supk{εk}

To our best knowledge, this is the ﬁrst non-ergodic result established for Heavy-ball algorithm in convex case. The deﬁnition of εk implies supk{εk} = O(L), so it holds that

f(xk) min f = O R L

which is on the same order of complexity as that in gradient descent. The coercivity assumption on f is crucial for Theorem 1. When the function f fails to be coercive, we need to assume summable (βk)k 0 instead.

Corollary 1 Suppose the assumptions of Lemma 1 hold, and P

k βk < + .3 Let (xk)k 0 be generated by the Heavy-ball algorithm and f be coercive. Then we have

f(xk) min f 4R supk{εk}

Linear Convergence with Restricted Strong Convexity We say the function f satisﬁes a restricted strongly convex condition (Lai and Yin 2013), if

f(x) min f ν x x 2, (19)

where x is the projection of x onto the set arg min f, and ν > 0. Restricted strong convexity is weaker than the strong convexity. For example, let us consider the function 1

2 Ax b 2 with b range(A). When A fails to be full row-rank, 1 2 Ax b 2 is not strongly convex but restricted strongly convex.

Theorem 2 Suppose the assumptions of Theorem 1 hold, and f satisﬁes condition (19). Then we have

f(xk) min f ωk,

for ωk := ℓ 1+ℓ (0, 1) and ℓ:= supk n εk( 1

Our result improves the linear convergence established by (Ghadimi, Feyzmahdavian, and Johansson 2015) in two aspects: Firstly, The strongly convex assumption is weakened to (19). Secondly, The step size and inertial parameter are chosen independent of the strongly convex constants.

3A classical example is βk = 1 kθ , where θ > 1.

Cyclic Coordinate Descent Heavy-ball Algorithm In this section, we consider the multi-block version of Heavy-ball algorithm and prove its convergence rates under convexity assumption. The minimization problem reads

min x1,x2,...,xm f(x1, x2, . . . , xm). (20)

The function f is assumed to satisfy

if(x) if(y) Li x y . (21)

With (21), we can easily obtain

f(x1, x2, . . . , x1 i , . . . , xm) f(x1, x2, . . . , x2 i , . . . , xm)

+ if(x1, x2, . . . , x2 i , . . . , xm), x1 i x2 i

2 x1 i x2 i 2. (22)

The proof is similar to [Lemma 1.2.3,(Nesterov 2013)], and we shall skip it here. We denote

k i f := if(xk+1 1 , . . . , xk+1 i 1 , xk i . . . , xk m),

xk := (xk 1, xk 2, . . . , xk m), L :=

with the convention xk+1 0 = xk 1. The cyclic coordinate descent inertial algorithm iterates: for i from 1 to m,

xk+1 i = xk i γk,i k i f + βk,i(xk i xk 1 i ), (23)

where γk,i, βk,i > 0. Our analysis relies on the following assumption: A1: for any i [1, 2, . . . , m], the parameters (βk,i)k 0 [0, 1) is non-increasing.

Lemma 3 Let f be a convex function satisfying (21), and ﬁnite min f. Let (xk)k 0 be generated by scheme (23) and assumption A1 hold. Choosing the step size

γk,i = 2(1 βk,i)c

Li , i [1, 2, . . . , m]

for arbitrary ﬁxed 0 < c < 1, we have "

βk,i 2γk,i xk i xk 1 i 2 #

βk+1,i 2γk+1,i xk+1 i xk i 2 #

2c xk+1 xk 2, (24)

where L = mini [1,2,...,m]{Li}.

We consider the following similar Lyapunov function in the analysis of cyclic coordinate descent Heavy-ball algorithm

ˆξk := f(xk) +

i=1 δk,i xk i xk 1 i 2 min f, (25)

δk,i := βk,i

Then we have the following lemma.

Lemma 4 Suppose the conditions Lemma 3 hold. Let xk denote the projection of xk onto arg min f, assumed to exist, and deﬁne

4c Pm i=1 δ2 k+1,i + 1 γ2 k,i

(1 c)L , 4c m L

It holds that

(ˆξk)2 ˆεk(ˆξk ˆξk+1)

(2 xk xk 2 + xk xk 1 2). (28)

Sublinear Convergence of Cyclic Coordinate Descent Heavy-ball Algorithm We show the O(1/k) convergence rate of cyclic coordinate descent Heavy-ball algorithm for coercive f.

Theorem 3 Suppose the conditions of Lemma 3 hold, f is coercive, and

0 < inf k βk,i βk,i β0 < 1, i [1, 2, . . . , m].

Then we have

f(xk) min f = O 4R supk{ˆεk}

where R is given by (17).

We readily check that supk{ˆεk} = O(m L), where m is number of the block. Therefore, the cyclic inertial algorithm converges with the rate O m R L

k . Compared with the results in (Sun and Hong 2015), this rate is on the same order as that of cyclic block coordinate descent in general convex setting.

Linear Convergence of Cyclic Coordinate Descent Heavy-ball Algorithm Under the same assumption of restricted strong convexity, we derive the linear convergence rate for cyclic coordinate descent Heavy-ball algorithm.

Theorem 4 Suppose the conditions of Lemma 3 hold, f satisﬁes (19), and

0 < inf k βk,i βk,i β0 < 1, i [1, 2, . . . , m].

Then we have

f(xk) min f (ˆω)k (30)

for some ˆω = ˆℓ 1+ˆℓ (0, 1), and ˆℓ:= supk ˆεk + 2

1 mini{δk,i} .

This result can be extended to the essentially cyclic Heavy-ball algorithm. The essentially cyclic index selection strategy (Sun, Hannah, and Yin 2017), which generalizes the cyclic update rule, is deﬁned as follows: there is an M N, M m, such that each block i {1, 2, . . . , m} is updated at least once in a window of M.

Stochastic Coordinate Descent Heavy-ball Algorithm For the stochastic index selection strategy, in the k-th iteration, we pick ik uniformly from [1, 2, . . . , m] and iterate xk+1 ik = xk ik γk ikf(xk) + βk(xk ik xk 1 ik ), xk+1 i = xk i , if i = ik. (31)

In this section, we make the following assumption A2: the parameters (βk)k 0 [0, m) is non-increasing. Assumption A2 is quite different from previous requirement that (βk)k 0 which are constrained on [0, 1). This difference comes from the uniformly stochastic selection of the index.

Lemma 5 Let f be a convex function whose gradient is Lipschitz continuous with L, and ﬁnite min f. Let (xk)k 0 be generated by scheme (31) and assumption A2 be satisﬁed. Choose the step size

γk = 2(1 βk/ m)c

L for arbitrary ﬁxed 0 < c < 1. Then we can obtain Ef(xk) + βk 2 mγk E xk xk 1 2

Ef(xk+1) + βk+1 2 mγk+1 E xk+1 xk 2

2c E xk+1 xk 2. (32)

Similarly, we consider the following function ξk := f(xk) + δk xk xk 1 2 min f, (33)

δk := βk 2 mγk + 1

Different from the previous analyses, the Lyapunov function considered here is E ξk instead of ξk. Naturally, the sufﬁcient descent property is established in the sense of expectation.

Lemma 6 Suppose the conditions of Lemma 5 hold. Let xk denote the projection of xk onto arg min f, assumed to exist, and deﬁne

εk := 4cδ2 k (1 c)L + 8cm (1 c)Lγ2 k . (35)

Then it holds

(E ξk)2 εk (E ξk E ξk+1)

(E xk xk 2 + E xk xk 1 2). (36)

Sublinear Convergence of Stochastic Coordinate Descent Heavy-ball Algorithm Due to that the sufﬁcient descent condition involves expectations, even using the coercivity of f, we cannot obtain the boundedness of the generated points. Therefore, we ﬁrst present a result by assuming the smoothness of f only.

Theorem 5 Suppose that the assumptions of Lemma 5 hold. Then we have

min 0 i k E f(xk) = o 1

We remark that Theorem (5) also holds for nonconvex functions. To obtain the sublinear convergence rate on the function values, we need a boundedness assumption. Precisely, the assumption is A3: the sequence (xk)k 0 satisﬁes

n E xk xk 2o < + .

Under assumption A3, we are able to show the nonergodic convergence sublinear convergence rates of the expected objective values.

Theorem 6 Suppose that the assumptions of Lemma 5 and A3 hold. Then we have

Ef(xk) min f = O 4 R supk{ εk}

Linear Convergence of Stochastic Coordinate Descent Heavy-ball Algorithm The linear convergence rate of stochastic coordinate descent Heavy-ball algorithm is similar to previous ones. By assuming the restricted strongly convex condition, the linear convergence rate of the expected objective values can be proved.

Theorem 7 Suppose that the assumptions in Lemma 5 hold, and the function satisﬁes the restricted strongly convex condition (19). Let (xk)k 0 be generated by the scheme (31). Then we have

Ef(xk) min f ( w)k, (39)

where w := ℓ 1+ ℓ (0, 1), and ℓ:= supk{ εk + 1

While we only consider the uniform probability selection strategy here, the same convergence results can be easily extended to the non-uniform probability selection strategy.

Applications to Decentralized Optimization We apply the analysis to the following decentralized optimization problem

0 100 200 300 400 500 600 700 800 900 1000 -0.5

gradient descent beta=0.1 beta=0.2 beta=0.3 beta=0.4

0 100 200 300 400 500 600 700 800 900 1000 -1

gradient descent beta=0.1 beta=0.2 beta=0.3 beta=0.4

0 100 200 300 400 500 600 700 800 900 1000 1.2

gradient descent beta=0.1 beta=0.2 beta=0.3 beta=0.4

0 100 200 300 400 500 600 700 800 900 1000 1

gradient descent beta=0.1 beta=0.2 beta=0.3 beta=0.4

0 100 200 300 400 500 600 700 800 900 1000 -1

cyclic coordinate gradient descent beta=0.1 beta=0.2 beta=0.3 beta=0.4

0 100 200 300 400 500 600 700 800 900 1000 -0.2

cyclic coordinate gradient descent beta=0.1 beta=0.2 beta=0.3 beta=0.4

0 100 200 300 400 500 600 700 800 900 1000 1.2

cyclic coordinate gradient descent beta=0.1 beta=0.2 beta=0.3 beta=0.4

0 100 200 300 400 500 600 700 800 900 1000 0.6

cyclic coordinate gradient descent beta=0.1 beta=0.2 beta=0.3 beta=0.4

0 100 200 300 400 500 600 700 800 900 1000 1.52

stochastic coordinate gradient descent beta=0.1 beta=0.2 beta=0.3 beta=0.4

0 100 200 300 400 500 600 700 800 900 1000 1.61

stochastic coordinate gradient descent beta=0.1 beta=0.2 beta=0.3 beta=0.4

0 100 200 300 400 500 600 700 800 900 1000 1.82

stochastic coordinate gradient descent beta=0.1 beta=0.2 beta=0.3 beta=0.4

0 100 200 300 400 500 600 700 800 900 1000 1.82

stochastic coordinate gradient descent beta=0.1 beta=0.2 beta=0.3 beta=0.4

Figure 1: Objective values of inertial algorithms v.s. number of epochs for linear regression and logistic regression tasks. Panels (a)-(d): Heavy-ball algorithms comparisons for linear regression with Gaussian data (a); for linear regression with Bernoulli data (b); for logistic regression with Gaussian data (c); for logistic regression with Bernoulli data (d). Panels (e)-(h): Cyclic coordinate descent Heavy-ball algorithms comparisons for linear regression with Gaussian data (e); for linear regression with Bernoulli data (f); for logistic regression with Gaussian data (g); for logistic regression with Bernoulli data (h). Panels (i)- (l): Stochastic coordinate descent Heavy-ball algorithms comparisons for linear regression with Gaussian data (i); for linear regression with Bernoulli data (j); for logistic regression with Gaussian data (k); for logistic regression with Bernoulli data (l).

where fi is differentiable and fi is Li-Lipschitz. Denote by x(i) Rn the local copy of x at node i and X := (x(1), x(2), . . . , x(m)) . In the community of decentralized algorithms, rather than directly solving the problem, following penalty formulation instead has been proposed

F(X) = f(X) + X (I W)X

where W = (wi,j) Rm m is the mixing matrix, and f(X) := Pm i=1 fi(x(i)), and I is the unit matrix. It is easy to see that F is Lipschitz with the constant LF := maxi{Li} + 1 λmin(W )

α , here λmin(W) is minimum eigenvalue of W. Researchers consider the decentralized gradient descent (DGD) (Nedic and Ozdaglar 2009), which is essentially the gradient descent applied to (40) with stepsize being equal to α. This algorithm can be implemented over a connected network, in which the agents communicate with their neighbors and make full use of the computing resources of

all nodes. Alternatively, we can use the Heavy-ball method by choosing the stepsize α, that is,

Xk+1 = Xk α F(Xk) + β(Xk Xk 1)

= WXk α f(Xk) + β(Xk Xk 1). For node i, the local scheme is then

xk+1(i) = X

j N(i) wi,jxk(j) α fi(xk(i))

+ β(xk(i) xk 1(i)),

where xk(i) is the copy of the variable xk in node i in the kth iteration and N(i) denotes the neighbors of node i. In the global scheme, it is basically Heavy-ball algorithm. Thus, we can apply our theoretical ﬁndings to this algorithm. To guarantee the convergence, we just need

0 < α < 2(1 β)

LF = 2(1 β)

maxi{Li} + 1 λmin(W )

After simpliﬁcation, we then get

α max i {Li} < 1 + λmin(W) 2β.

In a word, we need the requirements

0 β < 1 + λmin(W)

2 , 0 < α < 1 2β + λmin(W)

The convergence result for decentralized Heavy-ball method directly follows from our previous theoretical ﬁndings and can be summarized as below.

Corollary 2 Assume that fi is convex and differentiable, and fi is Lipschitz with Li. Let 0 β < 1+λmin(W )

2 , and the sequence (Xk)k 0 be generated by the decentralized Heavy-ball method. For any ﬁxed stepsize 0 < α < 1 2β+λmin(W )

maxi{Li} , we have

F(Xk) min F = O 1

This justiﬁes the superiority of our non-ergodic analysis. As aforementioned in the introduction, all the existing convergence results are about the sequence n F Pk i=1 Xi

k min F o . However, for decentralized Heavy-ball algorithm, it is meaningless to discuss the ergodic rates, because the nodes only communicate with their neighbors. However, our results, in this case, still hold.

Experimental Results We report the numerical simulations of Heavy-ball method applied to the linear regression problem

i=1 (yi A i x)2 )

and the logistic regression problem

i=1 log 1 + exp( yi A i x) + λ

where (Ai, yi) Rn R, i = 1, 2, . . . , m. All experiments were performed using MATLAB on an desktop with an Intel 3.4 GHz CPU. We tested the three Heavy-ball algorithms with different inertial parameters. We ﬁxed the stepsize as γ = 1 L in all numerical tests. For the stepsize, we need 2(1 βk) > 1, i.e., 0 βk < 0.5. Therefore, inertial parameters are set to βk β = 0, 0.1, 0.2, 0.3, 0.4. For linear regression problem, L = λmax(Pm i=1 A i Ai), where λmax( ) denotes the largest eigenvalue of a matrix; whereas for logistic regression, we have L = λmax(Pm i=1 A i Ai) + λ. With schemes of the algorithms, for cyclic coordinate gradient descent, the function values are recorded after the whole epoch is updated; while for stochastic coordinate gradient descent, functions values are updated after per iteration. The special case β = 0 corresponds to the gradient descent, or cyclic coordinate gradient descent, or stochastic coordinate gradient descent. And we set n = 100 and m = 150. The

data Ai and yi were generated by the Gaussian random and Bernoulli random distributions, respectively. The maximum number of iterations was set to 1000. For logistic regression, we set λ = 10 3. We tested the three Heavy-ball algorithms for both two regression tasks with Gaussian and Bernoulli data. As illustrated by Figure 1, larger β leads to faster convergence for both Heavy-ball algorithm and cyclic coordinate descent algorithm when β [0, 0.4]. However, for the stochastic block coordinate descent scheme, the inertial method helps insigniﬁcally. This is because for the stochastic case, in the kth iteration, the inertial terms contribute only when ik = ik 1. This case, however, happens with probability N 1 N2 = 1

N and N is the number of the blocks; as N is large, ik = ik 1 happens at low probability for just one iteration, let alone the whole iterations. Therefore, the inertial method is actually inactive at most iterations for stochastic block coordinate descent scheme. To improve the practical performance of stochastic block coordinate descent Heavy-ball algorithm, another inertial scheme proposed in (Xu and Yin 2013) can be recruited, in which, a new storage yk is used. In each iteration, the algorithm employs γk(xk ik yk ik) to replace γk(xk ik xk ik) in scheme (31) and then updates yk ik = xk ik with keeping other coordinates of yk. In this scheme, the inertial term can be active for all iterations. However, the convergence of such algorithm is beyond the proof techniques proposed in this paper, and of course, deserves further study.

In this paper, we studied the non-ergodic computational complexity of the Heavy-ball methods in the convex setting. Under different assumptions, we proved the non-ergodic sublinear and linear convergence rates for the algorithm, respectively. In both cases, we made much more relaxed assumptions than appeared in the existing literatures. Our proof was motivated by the analysis on a novel dynamical system. We extended our results to the multi-block coordinate descent Heavy-ball algorithm for both cyclic and stochastic update rules. The application to decentralized optimization demonstrated the advantage of our analysis techniques.

Acknowledgments: The authors are indebted to anonymous referees for their useful suggestions. We are grateful for the support from the the Major State Research Development Program (2016YFB0201305), and National Natural Science Foundation of Hunan (2018JJ3616).

Alvarez, F. 2000. On the minimizing property of a second order dissipative system in hilbert spaces. SIAM Journal on Control and Optimization 38(4):1102 1119. Combettes, P. L., and Glaudin, L. E. 2017. Quasinonexpansive iterations on the afﬁne hull of orbits: From mann s mean value algorithm to inertial methods. Siam Journal on Optimization 27(4).

Ghadimi, E.; Feyzmahdavian, H. R.; and Johansson, M. 2015. Global convergence of the heavy-ball method for convex optimization. In Control Conference (ECC), 2015 European, 310 315. IEEE. Lai, M.-J., and Yin, W. 2013. Augmented ℓ1 and nuclearnorm models with a globally linearly convergent algorithm. SIAM Journal on Imaging Sciences 6(2):1059 1091. Lessard, L.; Recht, B.; and Packard, A. 2016. Analysis and design of optimization algorithms via integral quadratic constraints. SIAM Journal on Optimization 26(1):57 95. Liang, J.; Fadili, J.; and Peyr e, G. 2016. A multi-step inertial forward-backward splitting method for non-convex optimization. In Advances in Neural Information Processing Systems, 4035 4043. Loizou, N., and Richt arik, P. 2017a. Linearly convergent stochastic heavy ball method for minimizing generalization error. ar Xiv preprint ar Xiv:1710.10737. Loizou, N., and Richt arik, P. 2017b. Momentum and stochastic momentum for stochastic gradient, newton, proximal point and subspace descent methods. ar Xiv preprint ar Xiv:1712.09677. Nedic, A., and Ozdaglar, A. 2009. Distributed subgradient methods for multi-agent optimization. IEEE Transactions on Automatic Control 54(1):48 61. Nesterov, Y. 2013. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media. Ochs, P.; Chen, Y.; Brox, T.; and Pock, T. 2014. ipiano: Inertial proximal algorithm for nonconvex optimization. SIAM Journal on Imaging Sciences 7(2):1388 1419. Ochs, P.; Brox, T.; and Pock, T. 2015. ipiasco: Inertial proximal algorithm for strongly convex optimization. Journal of Mathematical Imaging and Vision 53(2):171 181. Ochs, P. 2018. Local convergence of the heavy-ball method and ipiano for non-convex optimization. Journal of Optimization Theory and Applications 177(1): 153 180. Pock, T., and Sabach, S. 2016. Inertial proximal alternating linearized minimization (ipalm) for nonconvex and nonsmooth problems. SIAM Journal on Imaging Sciences 9(4):1756 1787. Polyak, B. T. 1964. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics 4(5):1 17. Sun, R., and Hong, M. 2015. Improved iteration complexity bounds of cyclic block coordinate descent for convex problems. In Advances in Neural Information Processing Systems, 1306 1314. Sun, T.; Hannah, R.; and Yin, W. 2017. Asynchronous coordinate descent under more realistic assumptions. In Advances in Neural Information Processing Systems, 6182 6190. Xu, Y., and Yin, W. 2013. A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factorization and completion. SIAM Journal on imaging sciences 6(3):1758 1789.

Zavriev, S., and Kostyuk, F. 1993. Heavy-ball method in nonconvex optimization problems. Computational Mathematics and Modeling 4(4):336 341.