# continuoustime_analysis_of_anchor_acceleration__a0653274.pdf

Continuous-time Analysis of Anchor Acceleration

Jaewook J. Suh Seoul National University jacksuhkr@snu.ac.kr

Jisun Park Seoul National University colleenp0515@snu.ac.kr

Ernest K. Ryu Seoul National University ernestryu@snu.ac.kr

Recently, the anchor acceleration, an acceleration mechanism distinct from Nesterov s, has been discovered for minimax optimization and ﬁxed-point problems, but its mechanism is not understood well, much less so than Nesterov acceleration. In this work, we analyze continuous-time models of anchor acceleration. We provide tight, uniﬁed analyses for characterizing the convergence rate as a function of the anchor coefﬁcient β(t), thereby providing insight into the anchor acceleration mechanism and its accelerated O(1/k2)-convergence rate. Finally, we present an adaptive method inspired by the continuous-time analyses and establish its effectiveness through theoretical analyses and experiments.

1 Introduction

Nesterov acceleration [51] is foundational to ﬁrst-order optimization theory, but the mechanism and its convergence proof are not transparent. One approach to better understand the mechanism is the continuous-time analysis: derive an ODE model of the discrete-time algorithm and analyze the continuous-time dynamics [65, 66]. This approach provides insight into the accelerated dynamics and has led to a series of follow-up work [71, 62, 29].

Recently, a new acceleration mechanism, distinct from Nesterov s, has been discovered. This anchor acceleration for minimax optimization and ﬁxed-point problems [35, 75, 54] has been an intense subject of study, but its mechanism is understood much less than Nesterov acceleration. The various analytic techniques developed to understand Nesterov acceleration, including continuous-time analyses, have only been applied in a very limited manner [59].

Contribution. In this work, we present continuous-time analyses of anchor acceleration. The continuous-time model is the differential inclusion X 픸(X) β(t)(X X0)

with initial condition X(0) = X0 dom 픸, maximal monotone operator 픸, and scalar-valued function β(t). The case β(t) = 1

t corresponds to the prior anchor-accelerated methods APPM [35], EAG [75], and FEG [42].

We ﬁrst establish that the differential inclusion is well-posed, despite the anchor coefﬁcient β(t) blowing up at t = 0. We then provide tight, uniﬁed analyses for characterizing the convergence rate as a function of the anchor coefﬁcient β(t). This is the ﬁrst formal and rigorous treatment of this anchored dynamics, and it provides insight into the anchor acceleration mechanism and its accelerated O(1/k2)-convergence rate. Finally, we present an adaptive method inspired by the continuous-time analyses and establish its effectiveness through theoretical analyses and experiments.

1.1 Preliminaries and notation

We provide the organization for prior works in Appendix A. Here, we review standard deﬁnitions and set up the notation.

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

Monotone and set-valued operators. We follow the standard deﬁnitions of Bauschke and Combettes [15], Ryu and Yin [58]. For the underlying space, consider Rn with standard inner product , and norm . Deﬁne domain of 픸as dom 픸= {x Rn | 픸x = }. We say 픸is an operator on Rn and write 픸: Rn Rn if 픸maps a point in Rn to a subset of Rn. We say 픸: Rn Rn is monotone if

픸x 픸y, x y 0, x, y Rn,

where the notation means that u v, x y 0 for all u 픸x and v 픸y. For µ (0, ), say 픸: Rn Rn is µ-strongly monotone if

픸x 픸y, x y µ x y 2, x, y Rn.

Write Gra 픸= {(x, u) | u 픸x} for the graph of 픸. An operator 픸is maximally monotone if there is no other monotone 픹such that Gra 픸 Gra 픹properly, and is maximally µ-strongly monotone if there is no other µ-strongly monotone 픹such that Gra 픸 Gra 픹properly.

For L (0, ), single-valued operator 핋: Rn Rn is L-Lipschitz if

핋x 핋y L x y , x, y Rn.

Write 핁픸= (핀+ 픸) 1 for the resolvent of 픸, while 핀: Rn Rn is the identity operator. When 픸 is maximally monotone, it is well known that 핁픸is single-valued with dom 핁픸= Rn.

We say x Rn is a zero of 픸if 0 픸x . We say y is a ﬁxed-point of 핋if 핋y = y . Write Zer 픸for the set of all zeros of 픸and Fix 핋for the set of all ﬁxed-points of 핋.

Monotonicity with continuous curves. We say an operator is differentiable if it is single-valued, continuous, and differentiable as a function. If a differentiable operator 픸is monotone and X : [0, ) Rn is a differentiable curve, then taking limit h 0 of

1 h2 픸(X(t + h)) 픸(X(t)), X(t + h) X(t) 0

dt픸(X(t)), X(t) 0. (1)

Similarly if 픸is furthermore µ-strongly monotone, then d

dt픸(X(t)), X(t) µ X(t) 2 . (2)

2 Derivation of differential inclusion model of anchor acceleration

2.1 Anchor ODE

Suppose 픸: Rn Rn is a maximal monotone operator and β : (0, ) [0, ) is a twice differentiable function. Consider differential inclusion X(t) 픸(X(t)) β(t)(X(t) X0) (3)

with initial condition X(0) = X0 dom (픸). We refer to this as the anchor ODE. 1 We say X : [0, ) Rn is a solution, if it is absolutely continuous and satisﬁes (3) for t (0, ) almost everywhere.

Denote S as the subset of [0, ) on which X satisﬁes the differential inclusion. Deﬁne

픸(X(t)) = X(t) β(t)(X(t) X0)

for t S. Since 픸(X(t)) 픸(X(t)) for t S, we say 픸is a selection of 픸for t S. If 픸(X(t)) is bounded on all bounded subsets of S, then we can extend 픸to [0, ) while retaining certain favorable properties. We discuss the technical details of this extension in Appendix E.1. The statements of Section 3 are stated with this extension.

1Strictly speaking, this is a differential inclusion, not a differential equation, but we nevertheless refer to it as an ODE.

2.2 Derivation from discrete methods

We now show that the following instance of the anchor ODE

X(t) = 픸(X(t)) 1

t (X(t) X0), (4)

where X(0) = X0 is the initial condition and 픸: Rn Rn is a continuous operator, is a continuoustime model of APPM [35], EAG [75], and FEG [42], which are accelerated methods for monotone inclusion and minimax problems.

Consider APPM with operator h픸

xk = 핁h픸yk 1

yk = k k + 1(2xk yk 1) + 1 k + 1y0 (5)

with initial condition y0 = x0. Assume h > 0 and 픸: Rn Rn is a continuous monotone operator. Using yk 1 = xk + h픸xk obtained from the ﬁrst line, substituting yk and yk 1 in the second line we get,

xk+1 + h픸xk+1 = k k + 1 xk h픸xk + 1 k + 1x0.

Then reorganizing and dividing both sides by h, we have

h = 픸xk+1 k k + 1픸xk 1 h(k + 1)(xk x0).

Identifying x0 = X0, 2hk = t, and xk = X(t), we have k k+1 = 1 h hk+h = 1 + O (h) and so

2 X(t) + O (h) = 픸(X(t + 2h)) (1 + O (h)) 픸(X(t)) 2 t + O (h) (X(t) X0) .

Taking limit h 0+ and dividing both sides by 2, we get the anchor ODE (4). The correspondence with EAG and FEG are provided in Appendix D.4.

The following theorem establishes a rigorous correspondence between APPM and the anchor ODE for general maximal monotone operators.

Theorem 2.1. Let 픸be a (possibly set-valued) maximal monotone operator and assume Zer픸 = . Let xk be the sequence generated by APPM (5) and X be the solution of the differential inclusion (3) with β(t) = 1

t . For all ﬁxed T > 0,

lim h 0+ max 0 k T

xk X(2kh) = 0.

We provide the proof in Appendix D.2.

2.3 Existence of the solution for β(t) = γ

To get further insight into the anchor acceleration, we generalize anchor coefﬁcient to β(t) = γ

tp for p, γ > 0. We ﬁrst establish the uniqueness and existence of the solution.

Theorem 2.2. Consider (3) with β(t) = γ

X(t) 픸(X(t)) γ

tp (X(t) X0). (6)

for p, γ > 0. Then solution of (6) uniquely exists.

We provide the proof in Appendix B.

2.4 Additional properties of anchor ODE

We state a regularity lemma of the differential inclusion (3), which we believe may be of independent interest. In particular, we use this result several times throughout our various proofs.

Lemma 2.3. Let X( ) and Y ( ) are solutions of the differential inclusion (3) respectively with initial values and anchors X0 and Y0. Then for all t [0, ),

X(t) Y (t) X0 Y0 .

We provide the proof in Appendix B.1.

Boundedness of trajectories is an immediate corollary of Lemma 2.3. Speciﬁcally, suppose X( ) is the solution of differential inclusion (3) with initial value X0. Then for all X Zer픸and t [0, ), X(t) X X0 X .

This follows from setting Y0 = X in Lemma 2.3.

3 Convergence analysis

We now analyze the convergence rate of 픸(X(t)) 2 for the anchor ODE (3) with β(t) = γ

tp and γ, p > 0. The results are organized in Table 1.

Case p = 1, γ 1 p = 1, γ < 1 p < 1 p > 1

픸(X(t)) 2 O 1

Table 1: Convergence rates of Theorem 3.1.

Let β be the anchor coefﬁcient function of (3). Deﬁne C : [0, ) R as C(t) = e R t v β(s)ds for some v [0, ]. Note that C = Cβ and C is unique up to scalar multiple. We call O (β(t)) the vanishing

speed and O 1 C(t) the contracting speed, and we describe their trade-off in the following.

p = 3, γ = 1

p = 1, γ = 1

p = 1 2, γ = 1

Figure 1: Flows of the solution of (6) with 픸= 0 1 1 0 , γ = 1, X0 = (1, 0) and various p. Flow is from t = 0 to t = 100. The marker is plotted every 0.8 units of time until t = 9.6. Note the last marker of the ﬂow for p = 1

2 is farther from the optimal point than that of the ﬂow for p = 1.

Loosely speaking, the contracting speed describes how fast the anchor term alone contracts the dynamical system. Consider X(t) = β(t)(X(t) a) for a Rn, a system only with the anchor. Then, X(t) = C(0)

C(t) (X(0) a) + a is the solution, so the ﬂow contracts towards the anchor a with rate 1 C(t). Intuitively speaking, this contracting behavior leads to stability and convergence. On the other hand, the anchor must eventually vanish, since our goal is to converge to an element in Zer픸, not the anchor. Thus the vanishing speed must be fast enough to not slow down the convergence of the ﬂow to Zer픸.

This observation is captured in Figure 1. Consider a monotone linear operator 픸= 0 1 1 0 on R2 and β(t) = γ tp with γ = 1 and p > 0. Note if there is no anchor, the ODE reduces to X = 픸(X) which do not converge [31, Chapter 8.2]. Figure 1 shows that with p > 1, the anchor vanished too early before the ﬂow is contracted enough to result in converging ﬂow. With p < 1, the ﬂow does converge but the anchor vanished too late, slowing down the convergence. With p = 1, the convergence is fastest.

The following theorem formalizes this insight and produces the results of Table 1.

Theorem 3.1. Suppose 픸is a maximal monotone operator with Zer픸 = . Consider (3) with β(t) = γ

tp . Let 픸(X(t)) be the selection of 픸(X(t)) as in Section 2.1. Then, 픸(X(t)) 2 = O 1 C(t)2

+ O β(t)2 + O β(t) .

( tγ p = 1 e

γ 1 p t1 p p = 1.

We expect the convergence rate of Theorem 3.1 to be optimized when the terms are balanced. When β(t) = 1

t , 1 C(t)2 = 1

(e R t 1 1 s ds)2 = 1

t2 = β(t)2 = β(t)

and all three terms are balanced. Indeed, the choice β(t) = 1

t corresponds to the optimal discrete-time choice 1 k+2 of APPM or other accelerated methods.

3.1 Proof outline of Theorem 3.1

The proof of Theorem 3.1 follows from Lemma 3.4, which we will introduce later in this section. To derive Lemma 3.4, we introduce a conservation law. Proposition 3.2. Suppose 픸is Lipschitz continuous and monotone. For t0 > 0, deﬁne E : (0, ) R as

픸(X(t)) 2 + 2β(t) 픸(X(t)) , X(t) X0 + β(t)2 + β(t) X(t) X0 2

X(s) X0 2 ds + Z t

ds 픸(X(s)), X(s) ds.

Then E is a constant function.

The proof of Proposition 3.2 uses dilated coordinate W(t) = C(t)(X(t) X0) to derive its conservation law in the style of Suh et al. [67]. We provide the details in Appendix E.2.

Recall from (1) that D d ds 픸(X(s)), X(s) E 0, the integrand of the last term of E is nonnegative. This motivates us to deﬁne

V (t) = E Z t

ds 픸(X(s)), X(s) ds

as our Lyapunov function. Corollary 3.3. Let 픸be maximal monotone and β(t) = γ

tp with p > 0, γ > 0. Let 픸(X(t)) be the selection of 픸(X(t)) as in Section 2.1. For t0 0, deﬁne V : [0, ) R as

V (t) = C(t)2

픸(X(t)) 2 + 2β(t) 픸(X(t)), X(t) X0 + β(t)2 + β(t) X(t) X0 2

X(s) X0 2 ds.

for t > 0 and V (0) = limt 0+ V (t). Then V (t) V (0) holds for t 0.

A technical detail is that all terms involving d

ds 픸(X(s)) have been excluded in the deﬁnition of V and this is what allows 픸to not be Lipschitz continuous. We provide the details in Appendix E.3. Lemma 3.4. Consider the setup of Corollary 3.3. Assume Zer픸 = . Then for t > 0 and X Zer픸, 픸(X(t)) 2 4β(t)2 X0 X 2 + 4V (0)

C(t)2 2 β(t)2 + β(t) X(t) X0 2

C(s)2 β(s) X(s) X0 2 ds. (7)

Proof outline of Lemma 3.4. Deﬁne

Φ(t) = 픸(X(t)) 2 + 2β(t) 픸(X(t)), X(t) X0 .

Then, from monotonicity of 픸and Young s inequality,

Φ(t) 픸(X(t)) 2 + 2β(t) 픸(X(t)), X X0

픸(X(t)) 2 2 1 2 픸(X(t))

2 + β(t) (X X0) 2

픸(X(t)) 2 2β(t)2 X0 X 2 . (8)

By Corollary 3.3, 2V (0)

C(t)2 2V (t)

C(t)2 + Φ(t) Φ(t) for t > 0. Applying (8) and organizing, we can get the desired result. The details are provided in Appendix E.4.

Proof outline of Theorem 3.1. It remains to show that last integral term of Lemma 3.4 is O 1 C(t)2 +

O β(t)2 + O β(t) . The details are provided in Appendix E.5.

Before we end this section, we observe how our analysis simpliﬁes in the special case β(t) = 1

t . In this case,

픸(X(t)) 2 + t 픸(X(t)), X(t) X0 ,

and this corresponds to the Lyapunov function of [59, Section 4] for the case γ = 1. As V (0) = 0, the conclusion of Lemma 3.4 becomes 픸(X(t)) 2 4

t2 X0 X 2 = O 1

which to the best rate in Table 1.

3.2 Point convergence

APPM is an instance of the Halpern method [54, Lemma 3.1], which iterates converge to the element in Zer픸closest to X0 [34, 73]. The anchor ODE also exhibits this behavior. Theorem 3.5. Let 픸be a maximal monotone operator with Zer픸 = and X be the solution of (3). If limt 픸(X(t)) = 0 and limt 1/C(t) = 0, then, as t ,

X(t) argmin z Zer픸 z X0 .

We provide the proof in Appendix E.6.

4 Tightness of analysis

In this section, we show that the convergence rates of Table 1 are actually tight by considering the dynamics under the explicit example 픸= 0 1 1 0 . Throughout this section, we denote 픸as A when when the operator is linear.

4.1 Explicit solution for linear A

Lemma 4.1. Let A: Rn Rn be a linear operator and let β(t) = γ

t . The series

Γ(n + γ + 1)Γ(γ + 1)X0,

where Γ denotes the gamma function, is the solution for (3) with 픸= A.

Note that when γ = 0, this is the series deﬁnition of the matrix exponential and X(t) = e t A. The solution also has an integral form, which extends to general β(t). Lemma 4.2. Suppose A: Rn Rn is a monotone linear operator. Then

X(t) = e t A

0 es AC(s)β(s)ds + C(0)I X0 (9)

is the solution for (3) with 픸= A.

See Appendix F.1.1 and Appendix F.1.2 for details.

4.2 The rates in Table 1 are tight

First, we consider p > 1 for β(t) = γ

Theorem 4.3. Suppose limt 1 C(t) = 0, i.e., suppose β(t) L1[t0, ) for some t0 > 0. Then there exists an operator 픸such that

lim t 픸(X(t)) = 0,

where X is the solution of (3).

Note that γ

tp L1[t0, ) when p > 1. The proof of Theorem 4.3 considers 픸= 2πξ 0 1 1 0 for ξ R and uses the Fourier inversion formula. See Appendix F.2 for details.

Next, we consider β(t) = γ

tp for cases other than p > 1.

Theorem 4.4. Let A = 0 1 1 0 , β(t) = γ

tp , 0 < p 1, and γ > 0. Let X be the solution given by (9) and X0 = 0. Let

t2 for p = 1, γ 1 t2γ for p = 1, γ < 1 t2p for 0 < p < 1. ,

Then, lim t r(t) A(X(t)) 2 = 0.

We provide the proof in Appendix F.3.

5 Discretized algorithms

In this section, we provide discrete-time convergence results that match the continuous-time rate of Section 3. Theorem 5.1. Suppose 픸be a maximal monotone operator, p > 0, and γ > 0. Consider

xk = 핁픸yk 1

kp + γ (2xk yk 1) + γ kp + γ x0

for k = 1, 2, . . . , with initial condition y0 = x0 Rn. Let 픸xk = yk 1 xk for k = 1, 2, . . . . Then this method exhibits the rates of convergence in Table 2.

Case p = 1, γ 1 p = 1, γ < 1 p < 1 p > 1

픸(xk) 2 O 1

Table 2: Rates for the discrete-time method of Theorem 5.1.

Note that the method of Theorem 5.1 reduces to APPM when γ = 1, p = 1.

Proof outline of Theorem 5.1. The general strategy is to ﬁnd discretized counterparts of corresponding continuous-time analyses. However, directly discretizing the conservation law of Proposition 3.2 was difﬁcult due to technical reasons. Instead, we obtain differently scaled but equivalent conservation laws using dilated coordinates and then performed the discretization. The speciﬁc dilated coordinates, inspired by [67], are W1(t) = X(t) X0 for p > 1, W2(t) = tp (X(t) X0) for 0 < p < 1, W3(t) = t (X(t) X0) for p = 1, γ 1 and W4(t) = tγ (X(t) X0) for p = 1, 0 < γ < 1.

In the discrete-time analyses, the behavior of the leading-order terms is predictable as they match the continuous-time counterpart. The difﬁcult part is, however, controlling the higher-order terms that were not present in the continuous-time analyses. Through our detailed analyses, we bound such higher-order terms and show that they do not affect the convergence rate in the end. We provide the details in Appendix G.3.

6 Convergence analysis under strong monotonicity

In this section, we analyze the dynamics of the anchor ODE (3) for µ-strongly monotone 픸. When β(t) = 1

t and 픸= µ 0 0 µ , Lemma 4.1 tells us that 픸(X(t)) = 1

t I e t A X0 and therefore that 픸(X(t)) 2 = Θ 1

t2 , which is a slow rate for the strongly monotone setup. On the other hand, we will see that β(t) = 2µ e2µt 1 is a better choice leading to a faster rate in this setup.

Our analysis of this section is also based on a conservation law, but we use a slightly modiﬁed version to exploit strong monotonicity.

Proposition 6.1. Suppose 픸is monotone and Lipschitz continuous. Let X be the solution of (3) and let R : [0, ) (0, ) be a differentiable function. For t0 > 0, deﬁne E : (0, ) R as

E = C(t)2R(t)2

픸(X(t)) 2+ 2β(t) 픸(X(t)), X(t) X0 + β(t)2 + β(t) X(t) X0 2

C(s)2R(s)2 β(s)

X(t) X0 2 ds + Z t

t0 C(s)2R(s)2 d

ds 픸(X(s)), X(s) R(s) R(s)

Then E is a constant function for t [0, ).

Proposition 6.1 generalizes Proposition 3.2, since it corresponds to the special case with R(t) 1.

Recall from (2), when 픸is µ-strongly monotone we have d

ds픸(X(t)), X(t) µ X(t) 2 0.

This motivates the choice R(t) = eµt, since R(s) R(s) = µ. From calculation provided in Appendix H.2,

the choice β(t) = 2µ e2µt 1 makes d ds C(s)2R(s)2 β(s)

2 = 0. Plugging these choices into Proposition 6.1 and following arguments of Section 3, we arrive at the following theorem.

Theorem 6.2. Let 픸be a µ-strongly maximal monotone operator with µ > 0 and assume Zer픸 = . Let X be a solution of the differential inclusion (3) with β(t) = 2µ e2µt 1, i.e. for almost all t,

X 픸(X) 2µ e2µt 1(X X0). (10)

Let 픸(X(t)) be the selection of 픸(X(t)) as in Section 2.1. Deﬁne V : [0, ) R as

V (t) = (eµt e µt)2

픸(X(t)) 2 + 2µ 1 e 2µt 픸(X(t)), X(t) X0 µ X(t) X0 2 .

Then V (t) V (0) holds for t 0. Furthermore for X Zer픸,

픸(X(t)) 2 2µ eµt 1

2 X0 X 2 = O 1

In Appendix H.2.3, we show that (10) is a continuous-time model for OS-PPM of Park and Ryu [54]. In Appendix C, we show the existence and uniqueness of the solution.

Since β(t) = 2µ e2µt 1 L1[t0, ) for any t0 > 0, Theorem 4.3 implies that 픸(X(t)) 0 when 픸is merely monotone. This tells us that the optimal choice of β(t) for should depend on the properties of 픸. In the following section, we describe how β(t) can be chosen to adapt to the operator s properties.

7 Adaptive anchor acceleration and experiments

In this section, we present an adaptive method for choosing the anchor coefﬁcient β, and we theoretically and experimentally show that this choice allows the dynamics to adapt to the operator s properties. It achieves the optimal O(1/k2)-convergence rate when 픸is monotone and an exponential convergence rate when 픸is furthermore µ-strongly monotone and Lipschitz continuous. Theorem 7.1. Suppose 픸is Lipschitz continuous and monotone. Consider the anchor ODE

2 픸(X), X X0

| {z } = β(t)

(X X0) (11)

with initial condition X(0) = X0 and 픸(X0) = 0. Suppose the solution exists and X is continuous at t = 0. Moreover, suppose β : (0, ) R is well-deﬁned, i.e., no division by zero occurs in the deﬁnition of β(t). Then for t > 0 and X Zer 픸, we have β(t) > 0 and 픸(X(t)) 2 4β(t)2 X0 X 2

If 픸is furthermore µ-strongly monotone, then for t > 0,

β(t)2 µ/2 eµt/2 1

We provide the proof in Appendix I.1. Note that anchor coefﬁcient (11) is chosen so that

Φ(t) = 픸(X(t)) 2 + 2β(t) 픸(X(t)), X(t) X0 = 0.

So left-hand side of (8) is zero and a O β(t)2 convergence rate is immediate. An analogous discrete-time result is shown in the following theorem. Theorem 7.2. Let 픸be a maximal monotone operator. Let x0 = y0 Rn. Consider

xk = 핁픸yk 1

yk = (1 βk)(2xk yk 1) + βkx0

픸xk, xk x0 + 픸xk 2 if 픸xk 2 = 0

0 if 픸xk 2 = 0,

for k = 1, 2, . . . , where 픸xk = yk 1 xk.

Then βk 0 and 픸xk+1 2 β2 k x0 x 2

β2 k 1 (k + 1)2

for k = 1, 2, . . . and x Zer픸.

If 픸is furthermore µ-strongly monotone and L-Lipschitz continuous, then for k = 1, 2, . . . ,

µ/(1 + L2) 1 + µ/(1 + L2) k 1 + µ/(1 + L2)

0 2000 4000 6000 8000 10000 Iteration count k

PG-EXTRA with APPM

with Halpern, γ k p + γ

with Adaptive

Figure 2: (Left) Network graph. (Right) Squared M-norm 픸xk 2 M vs. k. Halpern corresponds to the method in Theorem 5.1, we use p = 1.5 and γ = 2.0.

For the monotone setup, the rate 픸xk+1 2 1 (k+1)2 x0 x 2 matches the exact optimal rate of APPM [54]. In the limit µ 0, the result for the µ-strongly monotone case reduces to the result for the monotone case. We provide the proof and details of Theorem 7.2 in Appendix I.3.

The method of Theorem 7.2 is a discrete-time counterpart of the ODE of (11). The extra term 픸xk 2 of the denominator in the deﬁnition of βk vanishes in the continuous-time limit. We provide further details in Appendix I.2. Analogous to the continuous-time case, a key property of the discrete-time adaptive method is that the counterpart of Φ(t) is kept nonpositive. In the proof of Lemma I.3, the fact βk < 1 plays the key role in proving this property. The extra term 픸xk 2 in the denominator and the fact that 픸xk, xk x0 < 0 when 픸xk 2 = 0 leads to βk < 1.

7.1 Experiment details

We now show an experiment with the method of Theorem 7.2 applied to a decentralized compressed sensing problem Shi et al. [63]. We assume that we have the measurement bi = A(i)x + ei, where A(i) is a measurement matrix available for each local agent i, x is an unknown shared signal we hope to recover, and ei is an error in measurement. We solve this problem in a decentralized manner in which the local agents keep their measurements private and only communicate with their neighbors.

As in Shi et al. [63], we formulate the problem into an unconstrained ℓ1-regularized least squares problem

minimize x Rd 1 n Pn i=1 1

2 A(i)x bi 2 + ρ x 1 ,

and apply PG-EXTRA. We compare vanilla PG-EXTRA with the various anchored versions of PG-EXTRA with βk as in Theorem 5.1 and Theorem 7.2. We show the results in Figure 2. Further details of the experiment are provided in Appendix J.

8 Conclusion

This work introduces a continuous-time model of anchor acceleration, the anchor ODE X 픸(X) β(t)(X X0). We characterize the convergence rate as a function of β(t) and thereby obtain insight into the anchor acceleration mechanism. Finally, inspired by the continuous-time analyses, we present an adaptive method and establish its effectiveness through theoretical analyses and experiments.

Prior work analyzing continuous-time models of Nesterov acceleration had inspired various follow-up research, such as analyses based on Lagrangian and Hamiltonian mechanics [71, 72, 26], highresolution ODE model [62], and continuized framework [29]. Carrying out similar analyses for the anchor ODE are interesting directions of future work.

Acknowledgments and Disclosure of Funding

This work was supported by the Samsung Science and Technology Foundation (Project Number SSTF-BA2101-02). We thank Jaeyeon Kim for providing valuable feedback. We thank Hangjun Cho for the helpful discussions on well-posedness of ODEs. We also thank anonymous reviewers for the constructive comments.

[1] J. K. Alcala, Y. T. Chow, and M. Sunkula. Moving anchor extragradient methods for smooth structured minimax problems. ar Xiv:2308.12359, 2023.

[2] V. Apidopoulos, J.-F. Aujol, and C. Dossal. The differential inclusion modeling FISTA algorithm and optimality of convergence rate in the case b 3. SIAM Journal on Optimization, 28(1): 551 574, 2018.

[3] H. Attouch and A. Cabot. Asymptotic stabilization of inertial gradient dynamics with timedependent viscosity. Journal of Differential Equations, 263(9):5412 5458, 2017.

[4] H. Attouch and S. C. László. Newton-like inertial dynamics and proximal algorithms governed by maximally monotone operators. SIAM Journal on Optimization, 30(4):3252 3283, 2020.

[5] H. Attouch and S. C. László. Convex optimization via inertial algorithms with vanishing Tikhonov regularization: Fast convergence to the minimum norm solution. ar Xiv:2104.11987, 2021.

[6] H. Attouch and J. Peypouquet. Convergence of inertial dynamics and proximal algorithms governed by maximally monotone operators. Mathematical Programming, 174(1):391 432, 2019.

[7] H. Attouch, Z. Chbani, J. Peypouquet, and P. Redont. Fast convergence of inertial dynamics and algorithms with asymptotic vanishing viscosity. Mathematical Programming, 168(1):123 175, 2018.

[8] H. Attouch, Z. Chbani, and H. Riahi. Combining fast inertial dynamics for convex optimization with Tikhonov regularization. Journal of Mathematical Analysis and Applications, 457(2): 1065 1094, 2018.

[9] H. Attouch, Z. Chbani, and H. Riahi. Rate of convergence of the Nesterov accelerated gradient method in the subcritical case α 3. ESAIM: Control, Optimisation and Calculus of Variations, 25:2, 2019.

[10] H. Attouch, Z. Chbani, J. Fadili, and H. Riahi. Convergence of iterates for ﬁrst-order optimization algorithms with inertia and Hessian driven damping. Optimization, pages 1 40, 2021.

[11] J.-P. Aubin and A. Cellina. Differential Inclusions. Springer, 1984.

[12] J.-F. Aujol, C. Dossal, and A. Rondepierre. Optimal convergence rates for Nesterov acceleration. SIAM Journal on Optimization, 29(4):3131 3153, 2019.

[13] J.-B. Baillon and R. E. Bruck. Optimal rates of asymptotic regularity for averaged nonexpansive mappings. Fixed Point Theory and Applications, 128:27 66, 1992.

[14] S. Banach. Sur les opérations dans les ensembles abstraits et leur application aux équations intégrales. Fundamenta Mathematicae, 3(1):133 181, 1922.

[15] H. H. Bauschke and P. L. Combettes. Convex Analysis and Monotone Operator Theory in Hilbert Spaces. Springer International Publishing, second edition, 2017.

[16] R. I. Bo t and E. R. Csetnek. A second-order dynamical system with Hessian-driven damping and penalty term associated to variational inequalities. Optimization, 68(7):1265 1277, 2019.

[17] R. I. Bo t and D. A. Hulett. Second order splitting dynamics with vanishing damping for additively structured monotone inclusions. Journal of Dynamics and Differential Equations, 2022.

[18] R. I. Bo t, E. R. Csetnek, and S. C. László. A second-order dynamical approach with variable damping to nonconvex smooth minimization. Applicable Analysis, 99(3):361 378, 2020.

[19] R. I. Bo t, E. R. Csetnek, and S. C. László. Tikhonov regularization of a second order dynamical system with Hessian driven damping. Mathematical Programming, 189(1):151 186, 2021.

[20] R. I. Bot, E. R. Csetnek, and D.-K. Nguyen. Fast OGDA in continuous and discrete time. ar Xiv:2203.10947, 2022.

[21] M. Bravo and R. Cominetti. Sharp convergence rates for averaged nonexpansive maps. Israel Journal of Mathematics, 227:163 188, 2018.

[22] A.-L. Cauchy. Méthode générale pour la résolution des systémes d équations simultanées. Comptes Rendus Hebdomadaires des Séances de l Académie des Sciences, 25:536 538, 1847.

[23] R. Cominetti, J. A. Soto, and J. Vaisman. On the rate of convergence of Krasnosel ski ı-Mann iterations and their connection with sums of Bernoullis. Israel Journal of Mathematics, 199(2): 757 772, 2014.

[24] C. Daskalakis, A. Ilyas, V. Syrgkanis, and H. Zeng. Training GANs with optimism. International Conference on Learning Representations, 2018.

[25] J. Diakonikolas. Halpern iteration for near-optimal and parameter-free monotone inclusion and strong solutions to variational inequalities. Conference on Learning Theory, 2020.

[26] J. Diakonikolas and M. I. Jordan. Generalized momentum-based methods: A Hamiltonian perspective. SIAM Journal on Optimization, 31(1):915 944, 2021.

[27] Y. Drori. The exact information-based complexity of smooth convex minimization. Journal of Complexity, 39:1 16, 2017.

[28] J. Eckstein and D. P. Bertsekas. On the Douglas Rachford splitting method and the proximal point algorithm for maximal monotone operators. Mathematical Programming, 55(1):293 318, 1992.

[29] M. Even, R. Berthier, F. Bach, N. Flammarion, H. Hendrikx, P. Gaillard, L. Massoulié, and A. Taylor. Continuized accelerations of deterministic and stochastic gradient descents, and of gossip algorithms. Neural Information Processing Systems, 2021.

[30] George J. Minty. Monotone (nonlinear) operators in Hilbert space. Duke Mathematical Journal, 29(3):341 346, 1962.

[31] I. Goodfellow. NIPS 2016 tutorial: Generative adversarial networks. ar Xiv:1701.00160, 2016.

[32] E. Gorbunov, N. Loizou, and G. Gidel. Extragradient method: O(1/K) last-iterate convergence for monotone variational inequalities and connections with cocoercivity. International Conference on Artiﬁcial Intelligence and Statistics, 2022.

[33] G. Gu and J. Yang. Tight sublinear convergence rate of the proximal point algorithm for maximal monotone inclusion problems. SIAM Journal on Optimization, 30(3):1905 1921, 2020.

[34] B. Halpern. Fixed points of nonexpanding maps. Bulletin of the American Mathematical Society, 73(6):957 961, 1967.

[35] D. Kim. Accelerated proximal point method for maximally monotone operators. Mathematical Programming, 190(1 2):57 87, 2021.

[36] D. Kim and J. A. Fessler. Optimized ﬁrst-order methods for smooth convex minimization. Mathematical Programming, 159(1-2):81 107, 2016.

[37] D. Kim and J. A. Fessler. Optimizing the efﬁciency of ﬁrst-order methods for decreasing the gradient of smooth convex functions. Journal of Optimization Theory and Applications, 188(1): 192 219, 2021.

[38] J. Kim and I. Yang. Unifying Nesterov s accelerated gradient methods for convex and strongly convex objective functions. International Conference on Machine Learning, 202, 2023.

[39] U. Kohlenbach. On quantitative versions of theorems due to F. E. Browder and R. Wittmann. Advances in Mathematics, 226(3):2764 2795, 2011.

[40] M. A. Krasnosel skii. Two remarks on the method of successive approximations. Uspekhi Matematicheskikh Nauk, 10(1):123 127, 1955.

[41] J. Lee and E. K. Ryu. Accelerating value iteration with anchoring. Neural Information Processing Systems, 2023.

[42] S. Lee and D. Kim. Fast extra gradient methods for smooth structured nonconvex-nonconcave minimax problems. Neural Information Processing Systems, 2021.

[43] L. Leustean. Rates of asymptotic regularity for Halpern iterations of nonexpansive mappings. Journal of Universal Computer Science, 13(11):1680 1691, 2007.

[44] J. Liang, J. Fadili, and G. Peyré. Convergence rates with inexact non-expansive operators. Mathematical Programming, 159(1):403 434, 2016.

[45] F. Lieder. On the convergence rate of the Halpern-iteration. Optimization Letters, 15(2): 405 418, 2021.

[46] T. Lin and M. I. Jordan. Monotone Inclusions, Acceleration, and Closed-Loop Control. Mathematics of Operations Research, 2023.

[47] W. R. Mann. Mean value methods in iteration. Proceedings of the American Mathematical Society, 4(3):506 510, 1953.

[48] B. Martinet. Régularisation d inéquations variationnelles par approximations successives. Revue Française d Informatique et de Recherche Opérationnelle, Série Rouge, 4(3):154 158, 1970.

[49] B. Martinet. Algorithmes Pour La Résolution de Problèmes d optimisation et de Minimax. Ph D thesis, Université Scientiﬁque et Médicale de Grenoble, 1972.

[50] S.-Y. Matsushita. On the convergence rate of the Krasnosel ski ı Mann iteration. Bulletin of the Australian Mathematical Society, 96(1):162 170, 2017.

[51] Y. Nesterov. A method of solving a convex programming problem with convergence rate O(1/k2). Doklady Akademii Nauk SSSR, 269(3):543 547, 1983.

[52] Y. Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Springer, 2004.

[53] C. Park, J. Park, and E. K. Ryu. Factor-

2 acceleration of accelerated gradient methods. Applied Mathematics and Optimization, 2022.

[54] J. Park and E. K. Ryu. Exact optimal accelerated complexity for ﬁxed-point iterations. International Conference on Machine Learning, 2022.

[55] J. Park and E. K. Ryu. Accelerated infeasibility detection of constrained optimization and ﬁxed-point iterations. International Conference on Machine Learning, 2023.

[56] L. D. Popov. A modiﬁcation of the Arrow Hurwicz method for search of saddle points. Mathematical Notes of the Academy of Sciences of the USSR, 28(5):845 848, 1980.

[57] A. Rakhlin and K. Sridharan. Online learning with predictable sequences. Conference on Learning Theory, 2013.

[58] E. K. Ryu and W. Yin. Large-Scale Convex Optimization via Monotone Operators. Cambridge University Press, 2022.

[59] E. K. Ryu, K. Yuan, and W. Yin. ODE analysis of stochastic gradient methods with optimism and anchoring for minimax problems and GANs. ar Xiv:1905.10899, 2019.

[60] S. Sabach and S. Shtern. A ﬁrst order method for solving convex bilevel optimization problems. SIAM Journal on Optimization, 27(2):640 660, 2017.

[61] A. Salim, L. Condat, D. Kovalev, and P. Richtárik. An optimal algorithm for strongly convex minimization under afﬁne constraints. AISTATS, 2022.

[62] B. Shi, S. S. Du, M. I. Jordan, and W. J. Su. Understanding the acceleration phenomenon via high-resolution differential equations. Mathematical Programming, 2021.

[63] W. Shi, Q. Ling, G. Wu, and W. Yin. A proximal gradient algorithm for decentralized composite optimization. IEEE Transactions on Signal Processing, 63(22):6013 6023, 2015.

[64] M. V. Solodov and B. F. Svaiter. A hybrid approximate extragradient proximal point algorithm using the enlargement of a maximal monotone operator. Set-Valued Analysis, 7(4):323 345, 1999.

[65] W. Su, S. Boyd, and E. J. Candès. A differential equation for modeling Nesterov s accelerated gradient method: Theory and insights. Neural Information Processing Systems, 2014.

[66] W. Su, S. Boyd, and E. J. Candès. A differential equation for modeling Nesterov s accelerated gradient method: Theory and insights. Journal of Machine Learning Research, 17(153):1 43, 2016.

[67] J. J. Suh, G. Roh, and E. K. Ryu. Continuous-time analysis of AGM via conservation laws in dilated coordinate systems. International Conference on Machine Learning, 2022.

[68] A. Taylor and Y. Drori. An optimal gradient method for smooth strongly convex minimization. Mathematical Programming, 2022.

[69] Q. Tran-Dinh and Y. Luo. Halpern-type accelerated and splitting algorithms for monotone inclusions. ar Xiv:2110.08150, 2021.

[70] B. Van Scoy, R. A. Freeman, and K. M. Lynch. The fastest known globally convergent ﬁrst-order method for minimizing strongly convex functions. IEEE Control Systems Letters, 2(1):49 54, 2018.

[71] A. Wibisono, A. C. Wilson, and M. I. Jordan. A variational perspective on accelerated methods in optimization. Proceedings of the National Academy of Sciences, 113(47):E7351 E7358, 2016.

[72] A. C. Wilson, B. Recht, and M. I. Jordan. A Lyapunov analysis of accelerated methods in optimization. Journal of Machine Learning Research, 22(113):1 34, 2021.

[73] R. Wittmann. Approximation of ﬁxed points of nonexpansive mappings. Archiv der Mathematik, 58(5):486 491, 1992.

[74] T. Wu, K. Yuan, Q. Ling, W. Yin, and A. H. Sayed. Decentralized consensus optimization with asynchrony and delays. IEEE Transactions on Signal and Information Processing over Networks, 4(2):293 307, 2018.

[75] T. Yoon and E. K. Ryu. Accelerated algorithms for smooth convex-concave minimax problems with O(1/k2) rate on squared gradient norm. International Conference on Machine Learning, 2021.

[76] T. Yoon and E. K. Ryu. Accelerated minimax algorithms ﬂock together. ar Xiv:2205.11093, 2022.

A Prior work

Acceleration for smooth convex function in discrete setting. There had been rich amount of research on acceleration about smooth convex functions. Nesterov [51] introduced accelerated gradient method (AGM), which has a faster O(1/k2) rate than O(1/k) rate of gradient descent [22] in reducing the function value. Optimized gradient method (OGM) [36] improved AGM s rate by a constant factor, and is proven to be optimal [27]. For smooth strongly convex setup, strongly convex AGM [52] achieves an accelerated rate, and further improvements were studied [70, 53, 68, 61]. Recently, OGM-G [37] was introduced as an accelerated method reducing squared gradient magnitude for smooth convex minimization.

Acceleration for smooth convex function in continuous setting. Continuous-time analysis of Nesterov acceleration has been thoroughly studied as well. Su et al. [65, 66] introduced an ODE model of AGM X(t) + r

t X + f(X) = 0, providing f(X(t)) f O 1/t2 rate for r 3. Attouch et al. [7] improved the constant of bound for r > 3 and proved convergence of the trajectories. Attouch et al. [9] achieved O 1/t 2r/3 rate for 0 < r < 3. Apidopoulos et al. [2] generalized their results to differential inclusion with non-differentiable convex function. Furthermore, wide range of variations of AGM ODE has been studied [3, 8, 12, 16, 18, 5, 10, 19]. Also, applications to monotone inclusion problem were studied by Attouch and Peypouquet [6], Attouch and László [4], Bo t and Hulett [17]. Motivated from above continuous-time analysis for accelerated methods, tools analyzing ODEs have further developed. Wibisono et al. [71], Wilson et al. [72] and Kim and Yang [38] adopted Lagrangian mechanics and introduced ﬁrst, second, uniﬁed Bregman Lagrangian to provide uniﬁed analysis for generalized family of ODE, where the latter provided analysis for strongly convex AGM. Systemical approach to obtain Lyapunov functions exploiting Hamiltonian mechanics [26] and dilated coordinate system [67] were proposed, and analysis of OGM-G was provided by dilated coordinate framework. Different forms of continuous-time models such as high-resolution ODE [62] and continuized framework [29] were developed.

On the other hand, another type of acceleration called anchor acceleration recently gained attention. As Yoon and Ryu [76] focused, many recently discovered accelerated methods for both minimax optimization and ﬁxed-point problems are based on anchor acceleration. More recently, practical applications of anchor acceleration to detecting infeasibility for constrained optimization problems [55], and accelerating value iteration for dynamic programming and reinforcement learning [41] were introduced as well. Fixed-point problem. The history of studies on ﬁxed-point problem dates back to the work of Banach [14], which established that the Picard iteration with contractive operator is convergent. Kransnosel skii-Mann iteration (KM) [40, 47] was introduced, which is a generalization of Picard iteration. Convergence of KM iteration with general nonexpansive operators was proven by Martinet [49]. For iteration of Halpern [34], convergence with wide choice of parameter were shown by Wittmann [73].

The squared norm yk 핋yk 2 of ﬁxed-point residual is a common error measure for ﬁxed-point problems. KM iteration was shown to exhibit O(1/k) rate [23, 44, 21] and o(1/k)-rate [13, 50]. For Halpern iteration, O(1/(log k)2)-rate was established by Leustean [43], then improve to O(1/k) rate by Kohlenbach [39]. First accelerated O(1/k2) rate was achieved by Sabach and Shtern [60] and the constant was improved by Lieder [45] by a factor of 16. It is known that there is an equivalence between solving ﬁxed-point problem and solving monotone inclusion problem [30, 28, 54]. Proximal point method (PPM) [48] achieves O (1/k)-rate in terms of 픸xk 2 [33]. Accelerated proximal point method (APPM) [35] improved the rate to accelerated O 1/k2 -rate. Park and Ryu [54] showed APPM is exactly optimal method for this problem and provided exactly optimal method for µ-strongly monotone operator named OS-PPM, which achieved O 1/e4µk rate. The optimal methods APPM and OS-PPM are based on anchor acceleration [76].

Minimax problems. Minimax optimization problem of the form minx maxy L(x, y) have recently gained attention in machine learning society. One of the commonly considered theoretical setting is smooth convex-concave setup, with squared gradient norm as error measure. In terms of L(x, y) 2, classical EG [64] and OG [56, 57, 24] was shown to achieved O (1/k)-rate [32]. SGDA [59] achieved O 1/k2 2p rate for p > 1/2 with introducing the term anchor. With introducing a parameter-free Halpern type method, Diakonikolas [25] achieved O(log k/k2). Recently, EAG [75] ﬁrst achieved accelerated rate O(1/k2) with anchor acceleration, followed by FEG [42], anchored Popov s scheme [69] and moving anchor methods [1]. Fast ODGA [20] also achieved accelerated o(1/k2) rate. For L is furthermore strongly monotone with condition number κ, SM-EAG+ [76] achieved accelerated O 1/e2kκ rate.

However, continuous-time analysis for anchor acceleration is, to the best of our knowledge, insufﬁcient. Continuous-time analyses of acceleration for monotone inclusion problem were studied by Bot et al. [20], Lin and Jordan [46], but they did not consider anchor acceleration. Ryu et al. [59] considered continuous-time analysis of anchor acceleration, but only with limited cases X(t) = 픸(X(t)) γ

t (X X0) for γ 1. In this paper, we provide a uniﬁed continuous-time analysis for anchor acceleration with generalized anchor coefﬁcient.

B Proof of Theorem 2.2

B.1 Proof of uniqueness

Proof of uniqueness is immediate from Lemma 2.3, we ﬁrst prove the lemma.

Proof of Lemma 2.3. It is trivial for t = 0, so we may assume t > 0. By monotonicity of 픸and Young s inequality, we get the following inequality. d dt X(t) Y (t) 2 = 2 D X(t) Y (t), X(t) Y (t) E

= 2 픸(X(t)) 픸(Y (t)) + β(t)(X(t) Y (t)) β(t)(X0 Y0), X(t) Y (t)

2β(t) X(t) Y (t) 2 + 2β(t) X0 Y0, X(t) Y (t)

2β(t) X(t) Y (t) 2 + β(t) X0 Y0 2 + X(t) Y (t) 2

= β(t) X(t) Y (t) 2 + β(t) X0 Y0 2 .

Now deﬁne C(t) = e R t v β(s)ds for some v > 0, then we see C(t) = C(t)β(t). Moving β(t) X(t) Y (t) 2 to the left hand side and multiplying both sides by C(t), we have d dt

C(t) X(t) Y (t) 2 = C(t) d

dt X(t) Y (t) 2 + C(t)β(t) X(t) Y (t) 2

C(t)β(t) X0 Y0 2 = d

dt C(t) X0 Y0 2

Integrating both sides from ϵ > 0 to t we have

C(t) X(t) Y (t) 2 C(ϵ) X(ϵ) Y (ϵ) 2 C(t) X0 Y0 2 C(ϵ) X0 Y0 2 . As C is nonnegative and nondecreasing, limϵ 0+ C(ϵ) exists. Taking limit ϵ 0+ both sides we have

C(t) X(t) Y (t) 2 C(t) X0 Y0 2 . Finally, dividing both sides by C(t) we conclude

X(t) Y (t) 2 X0 Y0 2 .

Proof for uniqueness can be done to generalized case (3). Theorem B.1 (Uniqueness of solutions). If the solution for (3) exists, it is unique.

Proof. Suppose X1, X2 are solutions of (3) with same initial value X0. By Lemma 2.3, we have X1(t) X2(t) X0 X0 = 0 Therefore X1(t) = X2(t) for all t [0, ), we get the desired result.

As (6) is special case of (3), uniqueness proof for Theorem 2.2 follows from Theorem B.1.

B.2 Proof of existence

The proof of existence needs tedious work due to the singularity of γ

tp at t = 0, we provide our proof through subsections. Before we start, we provide a short outline of the proof. The proof is basically based on the proof provided in [65] and [11].

We ﬁrst prove the case 픸is Lipschitz. The differential inclusion becomes ODE when 픸is Lipschitz, we can adopt similar argument done in [65]. We consider series of ODEs Lipschitz with respect to X, approximating X(t) = 픸(X(t)) γ

tp (X0 X(t)). The approximated ODEs have solutions by classical theory of ODE, we obtain the true solution by considering proper subsequence of solutions.

As the solution for Lipschitz 픸is obtained, we can adopt similar argument done in [11]. We ﬁrst consider solution Xλ with Yosida approximation 픸λ = 1

λ(핀 (핀+ λ픸) 1), which is an approximation of 픸that is Lipschitz continuous. Then we can obtain a subsequence Xλn converging to original differential inclusion.

B.2.1 Existence proof for Lipschitz 픸

Since we will approximate 픸with Lipschitz continuous functions, we ﬁrst consider the ODE with Lipschitz continuous 픸. Theorem B.2. (Existence of solution for Lipschitz 픸) Suppose A : Rn Rn is L-Lipschitz continuous function. Consider the differential equation, with initial value condition X(0) = X0 dom( A), X(t) = A( X(t)) γ

tp ( X(t) X0). (12)

where γ, p > 0. Then there exists a unique solution X C1([0, ), Rn) that satisﬁes (12) for all t (0, ). Moreover, for X(0) deﬁned by X(0) = limt 0+ X(t) X0

t , following is true

A(X0) if 0 < p < 1 1 γ+1 A(X0) if p = 1 0 if p > 1.

The proof need some preparation. We will think of approximated solutions Xδ, obtain a sequence that converges to solution X. Thus we ﬁrst deﬁne Xδ. From now, we will denote A as a L-Lipschitz monotone function. Deﬁnition B.3. Let 0 < δ < 1. Consider

Xδ(t) = A( Xδ(t)) γ

δp ( Xδ(t) X0) 0 t < δ A( Xδ(t)) γ

tp ( Xδ(t) X0) t > δ (13)

Since right hand side above is L + γ

δp -Lipschitz with respect to Xδ, the solution uniquely exists by classical ODE theory. Deﬁne the solution as Xδ. Then for positive sequence {δm}m N such that δm < 1 and limm δm = 0, consider sequence n Xδm o

Before we start, we prove a useful lemma we will widely use for the cases that operator is Lipschitz continuous.

Lemma B.4. Let A: Rn Rn a Lipschitz continuous function. Suppose β : D [0, ) be a continuous function with D [0, ). Consider differential equation X = A( X) β(t)( X X0),

with X0 Rn. Let X : D Rn be a differentiable curve that satisﬁes above equation for t D. Then for all 0 a < b such that [a, b] D, X and the composition A X : [a, b] Rn are is Lipschitz continuous.

Moreover if β(t) is well-deﬁned and bounded for almost all t [a, b], then X is Lipschitz continuous in [a, b]. Thus if β is

twice differentiable, X is Lipschitz continuous.

As Lipschitz continuous functions, X, A X and X are absolutely continuous functions.

Proof. We ﬁrst prove X is Lipschitz continuous. As A, X, β are continuous in [a, b] we see A( X(t)) + β(t)( X(t) X0) is continuous in [a, b], thus

M1 = max t [a,b]

X(t) = max t [a,b]

A( X(t)) + β(t)( X(t) X0)

exists. Since its derivative is bounded by M1 < , we have X is M1-Lipshitz continuous. As composition of two Lipschitz continuous functions, A X is Lipschitz continuous.

First observe if β is twice differentiable, β is bounded in [a, b] as it is continuous. Now if β(t) is bounded for t [a, b], i.e. β(t) M2 for some M2 > 0, for M3 = maxt [a,b] |β(t)| we have d dt

β(t)( X(t) X0) = β(t)( X(t) X0) + β(t) X(t) M2M1 |b a| + M3M1

Therefore β(t)( X(t) X0) is (M2M1 |b a| + M3M1)-Lipschitz continuous. Thus as a sum of Lipschitz continuous

functions, X is Lispchitz continuous.

We will show for every T > 0, the set of derivatives n Xδ | δ (0, 1) o is uniformly bounded on [0, T] and n Xδm o

m N converges uniformly on [0, T]. We ﬁrst prove the boundedness of derivatives. To do so, we ﬁrst prove a useful lemma.

Lemma B.5. For 0 < a < b, suppose X : [a, b] Rn satisﬁes ODE (12) for t [a, b]. Deﬁne U : [a, b] R as

U1(t) = X(t) 2 + γp

U2(t) = 1 tp 1

X(t) 2 + γp

Then U1 is nonincreasing if for 0 < p 1, and U2 is nonincreasing for p > 1.

Proof. From Lemma B.4 we can check U1 and U2 are absolutely continuous in [a, b]. Therefore it is enough to check the derivative is nonpositive for almost all t. From Lemma B.4, we can differentiate (12) both sides. Thus for almost all t we have

dt A( X(t)) + γp

tp+1 ( X(t) X0) γ

Recall from monotonicity of A and (1), we know D X(t), d

dt A( X(t)) E 0 for almost all t . Therefore for almost all t,

(i) 0 < p 1

U1(t) = 2 D X(t), X(t) E + 2γp

D X(t), X(t) X0 E γp(p + 1)

= 2 X(t), d

dt A( X(t)) 2γ

X(t) 2 + 2γp

D X(t), X(t) X0 E

D X(t), X(t) X0 E 2γp2

X(t) X0 2 γp(1 p)

= 2 X(t), d

dt A( X(t)) 2γ

t ( X X0) 2 γp(1 p)

U2(t) = p 1

X(t) 2 + 2 tp 1 2 D X(t), X(t) E + 2γp

D X(t), X X0 E 2γp

X(t) 2 2 tp 1

dt A( X(t))

X(t) 2 + 2γp

D X(t), X X0 E + 2γp

D X(t), X X0 E 2γp

X(t) 2 2 tp 1

dt A( X(t)) 2γ t2p 1

t ( X X0) 2

We now are ready to prove uniform boundedness of derivatives Xδ. Lemma B.6. Take T > 0. For all t [0, T] and δ (0, 1), below inequality holds.

Xδ(t) Mdot(T) =

1 + γp A(X0) 0 < p 1 r

T p 1 1 2γ + 2p (2L2 + 2γ + 1) A(X0) p 1.

Proof. We prove ﬁrst statement by considering two cases.

(1) t [0, δ]

From Lemma B.4, we know Xδ is absolutely continuous and so Xδ 2 is absolutely continuous as well. Differentiating

(13), for almost all t (0, δ) we have Xδ = d

dt A( Xδ(t)) γ

Now for almost all t (0, δ),

Xδ 2 = 2 D Xδ, Xδ E = 2 Xδ, d

dt A( Xδ(t)) 2γ

(i) 0 < p 1 From above we know Xδ is nonincreasing in t [0, δ], we have Xδ(t) Xδ(0) = A(X0) .

Integrating d

δp Xδ 2 we have

Xδ(t) 2 Xδ(0) 2 2γ

Xδ(t) 2 ds = 2γt

Organizing with respect to Xδ(t) we have

(2) t δ Note for t > δ (12) holds, we can apply Lemma B.5.

(i) 0 < p 1 From (i) we know Xδ(t) A(X0) for t [0, δ]. Therefore Xδ(δ) A(X0) and we see

Xδ(s) ds δ A(X0) .

From Lemma B.5 we know U1(t) = Xδ(t) 2 + γp tp+1 Xδ(t) X0 2 is nonincreasing. Therefore Xδ(t) 2 U1(δ) for t δ. Since δ < 1 we have

r Xδ(δ) 2 + γp

r A(X0) 2 + γpδ1 p A(X0) 2 p

1 + γp A(X0) .

(ii) p > 1 From (i) we know Xδ(t) q

δp 2γt A(X0) for t [0, δ]. Therefore Xδ(δ) q

2γ A(X0) and we see

Xδ(δ) X0 Z δ

Xδ(s) ds Z δ

Applying (13) and recalling the fact A is L-Lipschitz, we have

δ2p Xδ(δ) X0 2 = A(Xδ(δ)) + Xδ(δ) 2 = A(Xδ(δ)) A(X0) + A(X0) + Xδ(δ) 2

2 A(Xδ(δ)) A(X0) 2 + 2 A(X0) + Xδ(δ) 2

2L2 Xδ(δ) X0 2 + 4 A(X0) 2 + 4 Xδ(δ) 2

γ L2δp+1 + 4 + 2δp 1

From Lemma B.5 we know U2(t) = 1 tp 1 X(t) 2 + γp

t2p X(t) X0 2 is nonincreasing. Therefore

tp 1 U2(δ) for t δ. Since δ < 1 we have

2γ + p (4L2δp+1 + 4γ + 2δp 1) A(X0) 2 r 1

2γ + 2p (2L2 + 2γ + 1) A(X0) .

Therefore for t [0, T] we have

2γ + 2p (2L2 + 2γ + 1) A(X0)

From (1) and (2), we get the desired result.

We now show the sequence n Xδm o

m N converges uniformly on [0, T] for every T > 0. It is sufﬁces to prove following

proposition.

Proposition B.7. For T > 0, the sequence n Xδm o

m N is a Cauchy sequence with respect to supremum norm on [0, T].

Proof. Take ϵ > 0. We want to show, there is N N such that if n, m > N then Xδn(t) Xδm(t) ϵ for all t [0, ).

Deﬁne dδ,ν(t) = 1 2 Xδ(t) Xν(t) 2 . Without loss of generality, we may assume δ ν. With Mdot(T) deﬁned in

Lemma B.6, we will show dδ,ν(t) 2δ2 Mdot(T)2.

First consider the case 0 t δ. By Lemma B.6, we have

Xδ(t) Xν(t) Z t

Xδ(s) Xν(s) ds Z t

Xδ(s) + Xν(s) ds

0 2 Mdot(T)ds = 2t Mdot(T)

Thus for t [0, δ] we have

dδ,ν(t) 2t2 Mdot(T)2 2δ2 Mdot(T)2.

Now we consider the case t δ. By monotonicity of A, we have

Xδ(t) Xν(t) 2 = D Xδ(t) Xν(t), Xδ(t) Xν(t) E

= D A( Xδ(t)) A( Xν(t)) γ

tp (Xδ(t) X0) + γ

tp (Xν(t) X0), Xδ(t) Xν(t) E

tp (Xδ(t) X0) + γ

tp (Xν(t) X0), Xδ(t) Xν(t) E

tp Xδ(t) Xν(t) 2

Thus we have dδ,ν(t) dδ,ν(δ) for t δ.

Now combining two cases, we have

dδ,ν(t) 2δ2 Mdot(T)2 (14)

Since limk δn = 0, there is N N such that m > n > N implies dδn,δm ϵ2

2 , we re done.

We are now ready to prove Theorem B.2.

Proof. (1) Existence of solution. From Proposition B.7, we know n Xδm o

m N converging uniformly on [0, T] for every T >. Denote the limit as X,

i.e. deﬁne X : [0, ) Rn as lim m Xδm(t) = X(t).

We can check X(0) = X0 easily since Xδm(0) = X0 for all m N. It remains to show X satisﬁes (12).

Take t > 0. We wish to show

X(t + h) X(t)

h + A( X(t)) + γ

X(t) X0 = 0.

Consider h, δ, T such that 0 < |h| < t, 0 < δ < min {1, t |h|} and T > t + |h|. Then (t |h|, t + |h|) [0, T]

and Xδ(t) = A( Xδ(t)) γ

tp Xδ(t) X0 . Consider inequality

X(t + h) X(t)

h + A( X(t)) + γ

X(t + h) X(t)

h Xδ(t + h) Xδ(t)

Xδ(t + h) Xδ(t)

+ Xδ(t) + A( X(t)) + γ

We now show right hand side goes to zero as h 0. The point of the proof is, Xδ converges uniformly to X and Xδ is uniformly bounded on [t |h|, T].

From (14) we have

X(t + h) X(t)

h Xδ(t + h) Xδ(t)

X(t + h) Xδ(t + h) + X(t) Xδ(t) 4 δ

|h| Mdot(T).

Also from (14) and since A is L-Lipschitz continuous, Xδ(t) + A( X(t)) + γ

X(t) X0 = A( Xδ(t)) + γ

Xδ(t) X0 + A( X(t)) + γ

L Xδ(t) X(t) + γ

Xδ(t) X(t) 2δ L + γ

Now from Lemma B.4 we have Xδ(t) is absolutely continuous, thus

Xδ(t + h) Xδ(t)

R t+h t Xδ(s) Xδ(t) ds

t Xδ(u) du ds

Xδ(u) du ds

Observe, for almost every u [t |h| , t + |h|] [0, T] by Lemma B.6 we have Xδ(u) = d

du A( Xδ(u)) + pγ up+1

= d du A( Xδ(u)) + pγ up+1

0 Xδ(v)dv + γ

L Mdot(T) + pγ

0 Mdot(T)dv + γ (t |h|)p Mdot(T)

(t |h|)p+1 + γ (t |h|)p

! Mdot(T) =: M.

Note M is independent of h or δ. While obtaining the inequality, we used the fact that A( Xδ( )) is L Mdot(T)-Lipschitz continuous in [0, T]. Now

Xδ(u) du ds

Now consider δ = h2 with h small enough that satisﬁes |h| < 1 and |h| + h2 < t. Then the conditions |h| < t, 0 < δ < min {1, t |h|} hold, above arguments are valid. Gathering above results, we have

X(t + h) X(t)

h + A( X(t)) + γ

2|h| Mdot(T) +

2|h|2 L + γ

Mdot(T) + |h|

2 M = O (|h|) .

which implies the desired result.

(2) The value and continuity of X(t) at t = 0. Deﬁne C(t) as

( tγ p = 1 e

γ 1 p t1 p p > 0, p = 1.

Then we see for t > 0

C(t) X(t) X0 = C(t) X(t) + γ

tp ( X(t) X0) = C(t) A( X(t)).

Integrating both sides from ϵ > 0 to t we have

C(t)( X(t) X0) C(ϵ)( X(ϵ) X0) = Z t

ϵ C(s) A( X(s)) ds.

As C is a nondecreasing function and bounded below by 0, limϵ 0+ C(ϵ) exists, taking limit ϵ 0+ we have

C(t)( X(t) X0) = Z t

0 C(s) A( X(s)) ds.

Dividing both sides by t C(t), with change of variable v = s/t, we have X(t) X0

C(t) A( X(s)) ds

C(t) A( X(tv)) dv.

γ 1 p t1 p(v1 p 1) p = 1, p > 0.

Note γ 1 pt1 p v1 p 1 0 for v [0, 1], since γ 1 p and v1 p 1 has opposite sign either 0 < p < 1 or

p > 1. Therefore e

γ 1 p t1 p(v1 p 1) 1. Also, A( X(tv)) is bounded for v [0, 1] since A( X( )) is continuous by Lemma B.4. So we can apply dominated convergence theorem and take limit t 0+. Since

lim t 0+ C(tv)

1 0 < p < 1 vγ p = 1 0 p > 1

for v = 0, we have

X(0) = lim t 0+

C(t) A( X(tv)) dv =

A(X0) 0 < p < 1 1 γ+1 A(X0) p = 1 0 p > 1. (15)

We now check X(t) is continuous at t = 0.

(i) 0 < p 1 For t > 0, we know X(t) = A( X(t)) γ

tp ( X(t) X0).

lim t 0+ γ tp ( X(t) X0) = lim t 0+

γt1 p X(t) X0

( 0 0 < p < 1 γ γ+1 A(X0) p = 1. (16)

Now from ODE X(t) = A( X(t)) γ

tp ( X(t) X0), by taking limit t 0+ we have

lim t 0+ X(t) = A(X0) lim t 0+

tp ( X(t) X0) =

( A(X0) 0 < p < 1 1 γ+1 A(X0) p = 1.

Therefore X(t) is continuous at t = 0. (ii) p > 1 For p > 1, we don t know the value of limt 0 γ tp ( X(t) X0). Thus we ﬁrst ﬁnd the limit. Let s go back to

C(t)( X(t) X0) = Z t

0 C(s) A( X(s)) ds.

Recall C(t) = γ

tp C(t). By taking integration by parts for the right hand side, we have Z t

0 C(s) A( X(s)) ds = tp

γ C(t) A( X(t)) Z t

γ C(s) A( X(s)) ds Z t

ds A( X(s)) ds.

Where we know A( X(s)) is differentiable almost everywhere from Lemma B.4. Now, divide both sides by tp C(t)

γ . Then for s = tv we have

γ tp ( X(t) X0) = A( X(t)) + Z t

C(t) A( X(s)) ds + Z t

ds A( X(s)) ds

= A( X(t)) + Z 1

0 pvp 1 C(tv)

C(t) A( X(s)) dv + Z 1

ds A( X(s)) t dv.

From Proposition B.7 and since X satisﬁes (12) for t > 0, for s (0, t] we have

X(s) = A( X(s)) + γ

X(s) X0 = lim m

A( Xδm(s)) + γ

Xδm(s) X0 = lim m Xδm(s).

From Lemma B.6 we know Xδm(s) Mdot(t) for s [0, t], taking limit m we have X(s) Mdot(t) for s [0, t]. Thus X(s) becomes Mdot(t)-Lipschitz continuous in s [0, t]. And so ( A X)(s) becomes L Mdot(t)-Lipschitz continuous in s [0, t], we have d

ds A( X(s)) L Mdot(t) for almost all

s [0, t]. Moreover C(tv)/C(t) is bounded for v [0, 1] since C is a nonnegative nondecreasing function. Therefore we can again apply dominated convergence theorem. Reminding limt 0+ C(tv)

C(t) = 0 for p > 1, taking limit t 0+ we have

lim t 0+ γ tp ( X(t) X0) = A(X0) + 0 = A(X0). (17)

Finally we have

lim t 0+ X(t) = A(X0) lim t 0+

tp ( X(t) X0) = A(X0) ( A(X0)) = 0 = X(0).

Before we move on to original inclusion, we prove important corollaries that bound X(t) and A( X(t)) uniformly on

[0, T]. Note the main difference between following corollary and Lemma B.6 is that the dependency on Lipschitz constant L is dropped for the case p > 1, which will be crucial in the next section.

Corollary B.8. Denote X as the solution of (12). Then for T > 0, following inequality is true for t [0, T].

A(X0) 0 < p < 1

A(X0) p = 1 q

p γ T p 1 A(X0) p > 1.

Proof. (i) 0 < p 1 As X is the solution for (12), from Lemma B.5 we know

U1(t) = X(t) 2 + γp

is a nonincreasing function. From (15), (16) and continuity of X(t) at t = 0, we have

lim ϵ 0+ U1(ϵ) =

A(X0) 2 0 < p < 1

1 γ+1 A(X0) 2 p = 1. (18)

Therefore X(t) 2 U1(t) limϵ 0+ U1(ϵ) for t > 0, we get the desired result.

(ii) p > 1 As X is the solution for (12), from Lemma B.5 we know

U2(t) = 1 tp 1

X(t) 2 + γp

is a nonincreasing function. However, as we don t know the value of limt 0+ 1 tp 1 X(t) 2 , we ﬁrst calculate it. To do so, we consider

U1(t) = X(t) 2 + γp

X(t) X0 2 .

In the proof of Lemma B.5, we have observed its derivative becomes

U1(t) = 2 X(t), d

dt A( X(t)) 2γ

t ( X(t) X0) 2 γp(1 p)

X(t) X0 2 .

For ϵ > 0, integrating from ϵ to t we have

U1(t) U1(ϵ) + Z t

X(s) X0 2 ds.

We consider taking limit ϵ 0+. Observe from (17) and (15) we know limt 0+ X X0

γ and X(0) = 0, and therefore we have

lim ϵ 0+ U1(ϵ) = 0 + lim ϵ 0+ γp ϵ2p

X(ϵ) X0 2 ϵp 1 = p

A(X0) 2 lim ϵ 0+ ϵp 1 = 0.

Moreover as limt 0+ X(t) X0 2

t2p = A(X0) 2

γ2 , there is δ > 0 such that 0 < s < δ implies X(t) X0 2

γ2 . Recalling p > 1, for 0 < ϵ < δ we have

X(s) X0 2 ds Z ϵ

s2 p 2 A(X0) 2

= 2p(p 1) A(X0) 2

1 p 1sp 1 ϵ

0 = 2p A(X0) 2

X(s) X0 2 ds 2p A(X0) 2

γ ϵp 1 + Z t

X(s) X0 2 ds < .

Thus the integral is well-deﬁned when ϵ 0+. Taking limit ϵ 0+ we have

X(s) X0 2 ds.

Moving γp tp+1 X(t) X0 2 to the right hand side, we get a inequality for X(t) 2 ,

X(t) 2 = U1(t) γp

X(t) X0 2 Z t

X(s) X0 2 ds γp

X(t) X0 2 .

Observe, by L Hôpital s rule we have

R t 0 γp(p 1)

sp+2 X(s) X0 2 ds

tp 1 = lim t 0+

tp+2 X(t) X0 2

(p 1)tp 2 = lim t 0+ γp

Now dividing tp 1 to previous inequality and taking limit, we conclude

lim t 0+ 1 tp 1

X 2 lim t 0+ 1 tp 1

X X0 2 ds lim t 0+ γp t2p

Thus we have limt 0+ 1 tp 1 X 2 = 0. Therefore,

lim ϵ 0+ U2(ϵ) = lim ϵ 0+

X(ϵ) 2 + γp

X(ϵ) X0 2 = p

tp 1 limϵ 0+ U2(ϵ) = p

γ A(X0) 2 , we conclude the desired result.

Corollary B.9. Denote X as the solution of (12). Then for T > 0, following inequality is true for t [0, T].

(γ + 1) 1 + T 1 p

p A(X0) 0 < p < 1 A(X0) p = 1 r

γ T p 1 + 1

p A(X0) p > 1.

Proof. First observe, A( X(t)) 2 = X(t) + γ

= X(t) 2 + 2γ

* X(t), X(t) X0

= (γ + 1) X(t) 2 + γ

The inequality comes from Young s inequality. Now we consider each case.

(i) p = 1 For this case, the terms on the right hand side exactly become U1(t). Therefore from (18) A( X(t)) 2 (γ + 1) U1(t) (γ + 1) lim ϵ 0+ U1(ϵ) = A(X0) 2 .

(ii) 0 < p < 1

Recall from the proof of Corollary B.8 we know U1(t) limϵ 0+ U1(ϵ) = A(X0) 2 . Therefore we have

γp tp+1 X(t) X0 2 U1(t) A(X0) 2 , applying Corollary B.8 we get

A( X(t)) 2 (γ + 1) X(t) 2 + γp

X(t) X0 2 t1 p

(γ + 1) A(X0) 2 1 + T 1 p

(iii) p > 1

Recall from the proof of Corollary B.8 we know U2(t) limϵ 0+ U2(ϵ) = p γ A(X0) 2 . Therefore we have

γp t2p X(t) X0 2 U2(t) p

γ A(X0) 2 , applying Corollary B.8 we get

A( X(t)) 2 (γ + 1) X(t) 2 + γp

X(t) X0 2 1

B.2.2 Existence proof for general 픸

Now we move on to the original inclusion. As noticed before, we will approximate 픸with a Liptschitz function 픸λ called Yosida approximation. We deﬁne and state some facts about 픸λ as a lemma, and use it without proof. For the ones who are interested in proofs, see [11, Chpater 3.1, Theorem 2]. Lemma B.10. Deﬁne 픸λ : Rn Rn as

λ (핀 핁λ픸) = 1

핀 (핀+ λ픸) 1

This is so called Yosida approximation of 픸. Followings are true.

(i) 픸λ is 1

λ-Lipschitz continuous and maximal monotone.

(ii) limλ 0+ 픸λx = m(픸(x)). Here m(픸(x)) is deﬁned as m(픸(x)) = Π픸(x)(0), the element of 픸(x) with minimal norm.

(iii) 픸λ(x) m(픸(x)) .

(iv) x Rn, 픸λ(x) 픸(핁λ픸x).

Now we can state the proposition that proves existence of Theorem 2.2 Proposition B.11. For Yosida approximation 픸λ, consider the ODE

Xλ(t) = 픸λ(Xλ(t)) γ

tp (Xλ(t) X0) (19)

with initial value condition Xλ(0) = X0 dom(픸). The solution uniquely exists by Theorem B.2, denote the solution as Xλ. Now for T > 0 and a positive sequence {λn}n N such that limn λn = 0, deﬁne a sequence of solutions as FT = {Xλn : [0, T] Rn | m N}. Then there is a subsequence {λnk}k N such that Xλnk converges to the solution of (6) uniformly on [0, T].

From Corollary B.8, Corollary B.9 and Lemma B.10 (iii), following lemma is immediate. Lemma B.12. Let Xλ be the solution of (19). Then following is true for t [0, T] for all T > 0.

Xλ(t) Mdot(T) =

m(픸(X0)) 0 < p < 1 1 γ+1 m(픸(X0)) p = 1 q

p γ T p 1 m(픸(X0)) p > 1

픸λ(Xλ(t)) M픸(T) =

(γ + 1) 1 + T 1 p

p m(픸(X0)) 0 < p < 1

m(픸(X0)) p = 1 r

γ T p 1 + 1

p m(픸(X0)) p > 1.

Proof. Replace A with 픸λ and X with Xλ in Corollary B.8 and Corollary B.9. Applying 픸λ(X0) m(픸(X0)) from Lemma B.10 (iii), we re done.

While we re concluding the existence of converging sequence, we will exploit following lemma from [11, Chapter 0.3, Theorem 4]. For convenience, we restate the lemma here. Lemma B.13. Let us consider a sequence of absolutely continuous functions xk( ) from an interval I of R to a Banach space X satisfying

(i) t I, {xk(t)}k N is a relatively compact subset of X

(ii) there exists a positive function c( ) L1(I, [0, )) such that, for almost all t I, xk(t) c(t)

Then there exists a subsequence (again denoted by) xk( ) converging to an absolutely continuous function x( ) from I to X in the sense that

(i) xk( ) converges to x( ) over compact subsets of I

(ii) xk( ) converges weakly to x( ) in L1(I, X)

Proof. See [11, Chapter 0.3, Theorem 4].

From Lemma B.12 we can immediately check norm of all derivatives Xλm are bounded by Mdot(T). So condition (ii) holds with Mdot(T). For condition (i), we prove FT is convergent in C([0, T], Rn). Lemma B.14. FT is convergent sequence in C([0, T], Rn). In other words, ϵ > 0, there is N > 0 such that n, m > N implies supt [0,T ] Xλn(t) Xλm(t) < ϵ

Proof. We will show Xλn is Cauchy sequence in C([0, T], Rn). Let ν, λ > 0. From (19), we see for t (0, T]

d dt 1 2 Xν(t) Xλ(t) 2 = D Xν(t) Xλ(t), Xν(t) Xλ(t) E

= 픸ν(Xν(t)) 픸λ(Xλ(t)), Xν(t) Xλ(t) γ

tp Xν(t) Xλ(t) 2

픸ν(Xν(t)) 픸λ(Xλ(t)), Xν(t) Xλ(t) .

From deﬁnition of resolvent we know 핀 핁λ픸= λ픸λ. And from Lemma B.10 (iv) we know 픸λ(Xλ(t)) 픸(핁λ픸(Xλ(t))). Thus from monotone inequality we see

픸ν(Xν(t)) 픸λ(Xλ(t)), Xν(t) Xλ(t) = 픸ν(Xν(t)) 픸λ(Xλ(t)), ν픸ν(Xν(t)) λ픸λ(Xλ(t)) 픸ν(Xν(t)) 픸λ(Xλ(t)), 핁ν픸(Xν(t)) 핁λ픸(Xλ(t)) 픸ν(Xν(t)) 픸λ(Xλ(t)), ν픸νXν(t) λ픸λ(Xλ(t))

= (ν + λ) 픸ν(Xν(t)), 픸λ(Xλ(t)) ν 픸νXν(t) 2 + λ 픸λXλ(t) 2 .

By Young s inequality

(ν + λ) 픸ν(Xν(t)), 픸λ(Xλ(t)) ν 픸νXν(t) 2 + λ 픸λXλ(t) 2

ν 픸ν(Xν(t)) 2 + 1

4 픸λ(Xλ(t)) 2 + λ 픸λ(Xλ(t)) 2 + 1

4 픸ν(Xν(t)) 2

ν 픸νXν(t) 2 + λ 픸λXλ(t) 2

ν 픸λ(Xλ(t)) 2 + λ 픸ν(Xν(t)) 2 .

Now applying Lemma B.12 we have

d dt 1 2 Xν(t) Xλ(t) 2 1

4(ν + λ)M픸(T)2.

Then integrating both sides from 0 to t we have

Xν(t) Xλ(t) 2 t

2(ν + λ)M픸(T)2. (20)

Now take ϵ > 0. Then there is N > 0 such that for n > N, λn < ϵ T M픸(T )2 holds. Then for t [0, T], n, m > N, we have

Xλn(t) Xλm(t) 2 t

2(λn + λm)M픸(T)2 < ϵ.

Therefore we get the desired result.

Finally, we are ready to prove Proposition B.11, which implies the main theorem.

Proof of Proposition B.11. Take T > 0. We know Xλn uniformly converges on [0, T] by Lemma B.14. Name the limit as X, i.e. deﬁne X : [0, T] Rn as X(t) = limn Xλn(t). Then as Xλn(0) = X0 for all n N, we see X satisﬁes the initial condition. It remains to show X satisﬁes (6) almost everywhere.

Recall { Xλn} is bounded in L ([0, T], Rn) by Lemma B.12. Thus we can apply Lemma B.13, there is a subsequence { Xλnk } converges weakly to X in L1([0, T], Rn). Furtheremore we have X L ([0, T], Rn) and so { Xλnk } also converges weakly to X in L2([0, T], Rn) as well.

For λ > 0, deﬁne fλ : [0, T] Rn as

tp (Xλ(t) X0) if t > 0 0 if t = 0.

Then for f : [0, T] Rn deﬁned as f(t) = γ tp (X(t) X0) for t > 0 and f(0) = 0, we have limk fλnk (t) = f(t). As

fλ(t) = Xλ(t) + 픸λn(Xλ)(t) Mdot(T) + M픸(T) for t (0, T] by Lemma B.12, we have

fλ(t) f(t) 2 ( fλ(t) + f(t) )2 4(Mdot(T) + M픸(T))2.

Therefore by dominated convergence theorem we have

fλnk (t) f(t) 2 dt = Z T

fλnk (t) f(t) 2 dt = 0,

we conclude fλnk strongly converges to f in L2([0, T], Rn).

Now consider Fλ : [0, T] Rn deﬁned as

Fλ(t) = Xλ(t) + γ

tp (Xλ(t) X0) if t > 0 픸λ(X0) if t = 0

Note since Xλ are solution to ODE (19), we have

Fλ(t) = 픸λ(Xλ(t)).

Then for F : [0, T] Rn deﬁned as

F(t) = X(t) + γ

tp (X(t) X0) if t > 0 m(픸(X0)) if t = 0,

we see {Fλnk }k N converges weakly to F in L2([0, T], Rn).

On the other hand, by Lemma B.10 (iv) and the fact Fλnk (t) = 픸λnk (Xλnk (t)), we see

Fλnk (t) 픸(핁λnk 픸(Xλnk (t))).

Observe, from the deﬁnition of 픸λn and Lemma B.12, we see

Xλn(t) 핁λn픸(Xλn(t)) = λn 픸λn(Xλn(t)) λn M픸(T).

Since Xλn converges to X in C([0, T], Rn), by taking n above inequality we see 핁λn픸(Xλn) also converges to X in C([0, T], Rn).

Now for A : L2([0, T], Rn) L2([0, T], Rn) deﬁned as (A(x))(t) = 픸(x(t)) almost everywhere, by [11, Chapter 3.1, Proposition 4], A is maximal monotone since 픸is maximal monotone. Since Fλk weakly converges to F in L2([0, T], Rn)

and 핁λk픸(Xλk) strongly converges to X in L2([0, T], Rn), by [11, Chapter 3.1, Proposition 2] we have F A(X) in L2([0, T], Rn). Therefore for almost all t (0, T] we have

tp (X(t) X0) 픸(X(t)).

Reorganizing the result with respect to X, we have following is true for almost all t (0, T]

X(t) 픸(X(t)) γ

tp (X(t) X0).

Since T > 0 was arbitrary, we conclude X satisﬁes above inclusion for almost all t (0, ).

C Proof of existence and uniquencess of the solution of (10)

As uniqueness comes from Theorem B.1, we only need to show the existence. What we need for the existence proof are

(i) Nonincreasing function U(t) which contains X 2 as in Lemma B.5.

(ii) Uniform boundedness of Xδ(t) for t [0, T] as shown in Lemma B.6.

(iii) X(0) = limt 0+ X(t) X0

t = limt 0+ X(t) as shown in the existence proof of Lipschitz case.

(iv) Uniform boundedness of Xλ(t) and 픸λ(Xλ(t)) for t [0, T] as shown in Lemma B.12

We now show these steps can be also done to the β(t) = 2µ e2µt 1.

(i) Nonincreasing function U(t) which contains X 2

From X(t) = A( X(t)) 2µ e2µt 1( X(t) X0)

for almost all t we have

dt A( X(t)) + 2µ e2µt 1

2 e2µt( X(t) X0) 2µ e2µt 1 X(t)

U(t) = e 2µt X(t) 2 + 2µ e2µt 1

2 X(t) X0 2

Therefore for almost all t > 0,

U(t) = 2e 2µt D X(t), X(t) E 2µe 2µt X(t) 2

+ 2 2µ e2µt 1

2 D X(t), X(t) X0 E 2µ e2µt 1

3 2e2µt X(t) X0 2

= 2e 2µt X(t), d

dt A( X(t)) + 2 2µ e2µt 1

2 D X(t), X(t) X0 E 2e 2µt 2µ e2µt 1

2µe 2µt X(t) 2 + 2 2µ e2µt 1

2 D X(t), X(t) X0 E 2µ e2µt 1

3 2e2µt X(t) X0 2

= 2µe 2µt X(t) 2 2e 2µt X(t), d

dt A( X(t)) 2µe 2µt

X(t) 2µe2µt

(ii) Uniform boundedness of Xδ(t)

As (13), we deﬁne Xδ(t) as the solution of

( A( Xδ(t)) 2µ e2µδ 1( Xδ(t) X0) 0 t δ A( Xδ(t)) 2µ e2µt 1( Xδ(t) X0) t δ

Again with same arguments of Lemma B.6 we have Xδ(δ) X0 δ A(X0) and Xδ(δ) A(X0) . Now for

e2µt U(t) q

e2µt U(δ) = eµt s

e 2µδ Xδ(δ) 2 + 2µ e2µδ 1

2 Xδ(δ) X0 2

e 2µδ A(X0) 2 + 2µδ e2µδ 1

(iii) X(0) = limt 0+ X(t) X0

t = limt 0+ X(t)

Deﬁne C(t) := 1 e 2µt. Then C(t) = C(t)β(t) for β(t) = 2µ e2µt 1. And since

lim t 0+ C(tv)

C(t) = lim t 0+ 1 e 2µtv

1 e 2µt = v,

with same argument done to arrive (15), we have

X(0) = lim t 0+

C(t) A( X(tv)) dv = 1

Now from ODE X(t) = A( X(t)) 2µ e2µt 1( X(t) X0), by taking limit both sides by t 0+ we have

lim t 0+ X(t) = A( X(0)) lim t 0+ 2µt e2µt 1

Therefore, limt 0+ X(t) X0

t = limt 0+ X(t).

(iv) Uniform boundedness of Xλ(t) and 픸λ(t) for t [0, T]

Recall U(t) = e 2µt Xλ(t) 2 + 2µ e2µt 1 2 Xλ(t) X0 2 is nonincreasing. So from (iii), we have

e 2µt Xλ(t) 2 + 2µ e2µt 1

2 Xλ(t) X0 2 lim t 0+

e 2µt Xλ(t) 2 + 2µ e2µt 1

2 Xλ(t) X0 2 !

4 픸λ(X0) 2 + 1

4 픸λ(X0) 2 = 1

Therefore we have from Lemma B.10 (iii)

e 2µt Xλ(t) 2 1

2 픸λ(X0) 2 1

2 m(픸(X0)) 2 = Xλ(t) eµT

2 m(픸(X0)) , (21)

and by Young s inequality

픸λ(Xλ(t)) 2 Xλ(t) + 2µ e2µt 1(Xλ(t) X0)

Xλ(t) 2 + 2µ e2µt 1

2 Xλ(t) X0 2 !

2 픸λ(X0) 2 + 1

2 픸λ(X0) 2 = (e2µT + 1) 픸λ(X0) 2 (e2µT + 1) m(픸(X0)) 2 .

픸λ(Xλ(t)) p

e2µT + 1 m(픸(X0)) . (22)

D Omitted proofs for derivation of anchor ODE (4)

D.1 Preparation for the proof of Theorem 2.1

We ﬁrst provide the boundedness of trajectories as a lemma. As mentioned in the discussion after Lemma 2.3, boundedness of trajectories is an immediate corollary of Lemma 2.3. However, to address cases that are slightly more generalized, we present a proof using an argument similar to the one used in the proof of Lemma 2.3. Note the proof argument of following lemma is valid for the solution of differential equation (11) with satisfying the assumptions in Theorem 7.1 as well. Lemma D.1. (Boundedness of solutions) Suppose X( ) is the solution of the differential inclusion (3). Then for all X Zer픸, t [0, ) , following holds. X(t) X X0 X . And so, X(t) X0 2 X0 X .

Proof. It is trivial for t = 0, so we may assume t > 0. Take X Zer픸. By monotonicity of 픸and Young s inequality, we get the following inequality. d dt X(t) X 2 = 2 D X(t), X(t) X E = 2 픸(X(t)) + β(t)(X(t) X0), X(t) X

= 2 픸(X(t)) + β(t)(X(t) X ) β(t)(X0 X ), X(t) X

2β(t) X(t) X 2 + 2β(t) X0 X , X(t) X

2β(t) X(t) X 2 + β(t) X0 X 2 + X(t) X 2

= β(t) X(t) X 2 + β(t) X0 X 2 .

Now again deﬁne C(t) = e R t v β(s)ds for some v > 0, then we see C(t) = C(t)β(t). Moving β(t) X(t) X 2 to the left hand side and multiplying both sides by C(t), we have

C(t) X(t) X 2 = C(t) d

dt X(t) X 2 + C(t)β(t) X(t) X 2

C(t)β(t) X0 X 2 = d

dt C(t) X0 X 2 .

Integrating both sides from ϵ > 0 to t we have

C(t) X(t) X 2 C(ϵ) X(ϵ) X 2 C(t) X0 X 2 C(ϵ) X0 X 2 .

As β > 0, we have C is nonnegative and nondecreasing, limϵ 0+ C(ϵ) exists. Taking limit ϵ 0+ both sides we have and dividing both sides by C(t) we conclude X(t) X 2 X0 X 2 .

The latter statement holds directly from triangular inequality,

X(t) X0 X(t) X + X X0 2 X0 X .

Following lemma shows APPM is an instance of Halpern method. It is immediate from induction, but we state it as a lemma due as its importance. Lemma D.2. Consider a method deﬁned as

xk+1 = 핁픸yk

yk+1 = (1 βk) (2xk+1 yk) + βkx0,

for k = 0, 1, . . . , with initial condition y0 = x0. Then for reﬂected resolvent ℝ픸deﬁned as ℝ픸= 2핁픸 핀, above method is equivalent to

yk+1 = βk y0 + (1 βk) 핋 yk

when 핋= ℝ픸, y0 = y0. Here equivalence means yk = yk holds for k = 0, 1, . . . .

Proof. Proof by induction. As y0 = y0 by assumption, the statement is true for k = 0. Now suppose yk = yk holds for k N, then

yk+1 = (1 βk)ℝ픸 yk + β0 y0 = (1 βk) (2핁픸 핀) yk + β0 y0

= (1 βk) (2핁픸 핀) yk + β0y0 = (1 βk) 2xk+1 yk + β0y0 = yk+1.

Thus yk+1 = yk+1, the statement is true for k + 1. By induction, we get the desired result.

D.2 Proof of Theorem 2.1

Let 픸: Rn Rn be a maximal monotone operator, and h, λ, δ > 0. Again, denote 픸λ = 1

λ(핀 픸λ픸) = 1

λ(핀 (핀+ λ픸) 1). Since various kind of terms appear in the proof, we ﬁrst organize the terms and notations.

X : Solution of differential inclusion,

Xλ : Solution of differential equation,

Xλ = 픸λ(Xλ) 1

Xλ,δ : Solution of approximated differential equation,

Xλ,δ(t) = 픸λ(Xλ,δ)(t) 1

δ (Xλ,δ(t) X0) 0 t < δ 픸λ(Xλ,δ)(t) 1

t (Xλ,δ(t) X0) t δ. (23)

Xk λ,δ : Sequence obtained by taking Euler discretization of ODE (23),

Xk δ 2h픸λ(Xk λ,δ) + 2h

δ (Xk λ,δ X0) 0 k < δ 2h Xk δ 2h픸λ(Xk λ,δ) + 1

k(Xk λ,δ X0) k δ 2h.

xk h,λ : Sequence obtained from APPM with operator h픸λ, i.e.

xk h,λ = 핁h픸λyk 1 h,λ

yk h,λ = k k + 1(2xk h,λ yk 1 h,λ ) + 1 k + 1X0.

xk h : Sequence obtained from APPM with operator h픸, i.e.

xk h = 핁h픸yk 1 h

yk h = k k + 1(2xk h yk 1 h ) + 1 k + 1X0.

We want to show for ﬁxed T > 0,

lim h 0+ sup 0 k< T

xk h X(2hk) = 0.

Equivalently we may show for ﬁxed T > 0, for every {hn}n N such that hn > 0 and converges to 0,

lim n sup 0 k< T 2hn

xk hn X(2hnk) = 0.

We will show this by considering inequality xk h X(2hk) X(2hk) Xλ(2hk) + Xλ(2hk) Xλ,δ(2hk) (24)

+ Xλ,δ(2hk) Xk λ,δ + Xk λ,δ xk λ + xk h,λ xk h =: S (h, λ, δ, k) .

Our goal is to show, for every {hn}n N such that hn > 0 and converges to 0, there is a sequence {(δn, λn)}n N such that limn sup0 k< T 2hn S (hn, λn, δn, k) = 0, and thus

lim n sup 0 k< T 2hn

xk hn X(2hnk) lim n sup 0 k< T 2hn S (hn, λn, δn, k) = 0.

To clarify our goal, we need to ﬁnd proper {(δn, λn)}n N in terms of {hn}n N. As {hn}n N is determined, xk hn X(2hnk) is ﬁxed and doesn t change by the choice of {(δn, λn)}n N. But if we ﬁnd {(δn, λn)}n N that makes S (hn, λn, δn, k) small, since (24) holds for any choice of δ, λ, right choice of {(δn, λn)}n N can gaurantee xk hn X(2hnk) is small. Thus to ﬁnd such {(δn, λn)}n N, we will observe each terms in S to ﬁnd the required conditions. Lemma D.3. For h, λ, δ > 0 following is true.

(i) X(2hk) Xλ(2hk) = O

(ii) Xλ(2hk) Xλ,δ(2hk) = O (δ)

xk h,λ xk h = O(λ) .

For Lλ,δ = max n 1 λ,

2 λ m(픸(X0)) , 1

2 δ m(픸(X0)) o ,

Xλ,δ (2hk) Xk λ,δ = O he2Lλ,δT .

Further more if 0 < h

Xk λ,δ xk λ = O (h) + O h2Lλ,δe2Lλ,δT

λe2Lλ,δT + O(h) + O h2

λ e2Lλ,δT + O e2δLλ,δ

We prove this lemma in next subsection, here we assume the lemma is true and prove Theorem 2.1. Suppose above lemma is true. The calculations are messy but the strategy is simple; balancing the speed of the terms h, δ, λ going zero to make above terms reach to zero. Above lemma motivate to take sequences as

2 m(픸(X0)) δn

where M = max 1, 4

2 m(픸(X0)) . When limn hn 0 with hn > 0, we can easily check limn δn = 0 and limn λn = 0. So the cases (i), (ii), (iii) go to zero.

Now observe from the deﬁnition of λn we have,

2 m(픸(X0)) δn

= Lλn,δn = M

e2Lλ,δT e 2MT

δn 3 1 4 log3( 1 hn ) = 1

3 T λn 3 MT

δn 3MT log3( 1 hn ) 8MT = 1

h1/8 n e2δn Lλn,δn

Lλn,δn = e2M δn

M = O (δn) .

Therefore when limn hn 0,

hne2Lλn,δn T h3/4 n 0

h2 n Lλn,δne2Lλn,δn T = h2 n 2T (2Lλn,δn T)e2Lλn,δn T h2 n 2T e4Lλn,δn T h3/2 n 2T 0

3 T λn hn = h7/8 n 0

λn = 3 T λn T

T 3 2T λn hn

T h3/4 n T 0

λn e2Lλn,δn T h3/4 n T e2Lλn,δn T h1/2 n T 0

3 T λn h2 n λn e2Lλn,δn T h3/2 n T 0

3 T λn δn 3T log3(δn)

2T δn = δ1/2 n 0.

As limn hn = 0, without loss of generality we may assume hn < 1. Since hn > 0 we have λn is well-deﬁned and satisﬁes the condition for Lemma D.3. Thus terms for the case (iv) and (v) go to zero as well. Therefore we have limn sup0 k< T 2hn S (hn, λn, δn, k) = 0, as {hn}n N is arbitrary, we get the desired result.

D.2.1 Proof for case (i), (ii), (iii), (iv) of Lemma D.3

As case proof for (v) need lot of work, we provide it in a different subsection and here we provide the proofs for the cases from (i) to (iv).

(i) X(2hk) Xλ(2hk) = O

This is result of Lemma B.14. Considering (20) with p = 1, taking limit ν 0 we know

sup t [0,T ] X(t) Xλ(t)

2 m(픸(X0)) = O

(ii) Xλ(2hk) Xλ,δ(2hk) = O (δ) This is result of Proposition B.7. Consider (14) with γ = p = 1 for Lemma B.6 and taking limit ν 0. Then applying Lemma B.10 (iii) we have

sup t [0,T ] Xλ(t) Xλ,δ(t) 2

2δ 픸λ(X0) 2

2δ m(픸(X0)) = O (δ) .

xk h,λ xk h = O(λ) We ﬁrst show show a general fact about Yosida approximation and resolvent. From [15, Proposition 23.7 (iv)] we have

핁h픸(x) 핁h픸λ(x) = 핁h픸(x) 핀+ 1 1 + λ 핁(1+λ)h픸 핀 (x)

= 핁h픸(x) λ 1 + λx + 1 1 + λ핁(1+λ)h픸(x)

핁h픸(x) 핁(1+λ)h픸(x) + λ 1 + λ x 핁h픸(x)

From [15, Proposition 23.31 (iii)], we have 핁h픸(x) 핁(1+λ)h픸(x) λ 핁h픸(x) x .

Combining two facts we get

핁h픸(x) 핁h픸λ(x) 2λ 1 + λ x 핁h픸(x) .

From Lemma D.2, we know the iteration of yk sequence in (5) is equivalent to below sequence

yk+1 = 1 k + 1X0 + k k + 1 (2핁h픸 핀) (yk).

Using this alternating form we have yk+1 h yk+1 h,λ = k k + 1

(2핁h픸 핀)(yk h) (2핁h픸λ 핀)(yk h,λ)

k k + 1 (2핁h픸 핀)(yk h) (2핁h픸λ 핀)(yk h) + ℝh픸λ(yk h) ℝh픸λ(yk h,λ)

k k + 1 2 핁h픸(yk h) 핁h픸λ(yk h) + yk h yk h,λ

yk h 핁h픸(yk h) + yk h yk h,λ

k + yk h yk h,λ .

The ﬁrst inequality comes from triangular inequality. The second inequality is from nonexpansiveness of reﬂected resolvent ℝ픸= 2핁픸 핀, [15, Corollary 23.11]. The third inequality is from the inequality shown previously. The last inequality comes from the convergence rate of APPM[35, Theorem 4.1], yk h 핁h픸(yk h) X0 X

k . Now multiplying both sides by k + 1 and summing up from 0 to k we get

(k + 1) yk+1 h yk+1 h,λ

4λ 1 + λ X0 X = (k + 1) 4λ

1 + λ X0 X .

Finally, from the relation between xk and yk in APPM we have xk+1 h xk+1 h,λ = 핁h픸(yk+1 h ) 핁h픸λ(yk+1 h,λ )

핁h픸λ(yk+1 h ) 핁h픸λ(yk+1 h,λ ) + 핁h픸(yk+1 h ) 핁h픸λ(yk+1 h )

yk+1 h yk+1 h,λ + 2λ 1 + λ

yk+1 h 핁h픸(yk+1 h )

4λ 1 + λ X0 X + 2λ 1 + λ X0 X

= 2 + 1 k + 1

2λ 1 + λ X0 X = O(λ).

Xλ,δ (2hk) Xk λ,δ = O he2Lλ,δT

From (23), we can consider Xλ,δ as of function F : Rn [0, ) Rn deﬁned as below

F(X, t) = 픸λ(X) 1

δ (X X0) 0 t < δ 픸λ(X) 1

t (X X0) t > δ. (25)

Note F is 2 max 1

δ -Lipschitz with respect to X. For convenience name α = 2h. Deﬁne ϵk := Xλ,δ (αk) Xk λ,δ. By deﬁnition of Euler discretization and from fundamental theorem of calculus, we have the following

Xk+1 λ,δ = Xk λ,δ + αF(Xk λ,δ, αk)

Xλ,δ(α(k + 1)) = Xλ,δ(αk) + Z α(k+1)

αk Xλ,δ(t)dt

= Xλ,δ(αk) + Z α(k+1)

Xλ,δ(αk)) + Z t

αk Xλ,δ(s)ds dt

= Xλ,δ(αk) + αF(Xλ,δ(αk), αk) + Z α(k+1)

αk Xλ,δ(s) ds dt.

From Lemma B.4 we have F(Xλ,δ, t) = Xλ,δ(t) is Lipschitz continuous respect to t, so Xλ,δ is deﬁned almost everywhere and fundamental theorem of calculus is valid. Therefore we have

ϵk+1 = Xλ,δ(α(k + 1)) Xk+1 λ,δ

= Xλ,δ(αk) Xk λ,δ + α F(Xλ,δ(αk), αk) F(Xk λ,δ, αk) + Z α(k+1)

αk Xλ,δ(s) ds dt

As F is 2max 1

δ -Lipschitz with respect to ﬁrst variable, we have

ϵk+1 Xλ,δ(αk) Xk λ,δ + α F(Xλ,δ(αk), αk) F(Xk λ,δ, αk) + Z α(k+1)

Xλ,δ(s) ds dt

1 + 2α max 1

ϵk + Z α(k+1)

Xλ,δ(s) ds dt.

Now we observe Xλ,δ is bounded. By differentiating Xλ,δ, as

dt픸λ(Xλ,δ) + 1

dt픸λ(Xλ,δ) + 1

픸λ(Xλ,δ) Xλ,δ 1

for t > δ, we have for almost every t

dt픸λ(Xλ,δ) + 1

δ Xλ,δ 0 t < δ d

dt픸λ(Xλ,δ) 1

t 픸λ(Xλ,δ) 2

t Xλ,δ t δ .

Considering γ = 1, p = 1 to Lemma B.6 and from Lemma B.10 (iv) we have Xλ,δ(t)

2 m(픸(X0)) ,

픸λ(Xλ,δ(t)) = Xλ,δ(t) + 1 max {δ, t}(Xλ,δ(t) X0)

Xλ,δ(t) + 1 max {δ, t} Xλ,δ(t) X0

Xλ,δ(t) + 1 max {δ, t}

Xλ,δ(s) ds 2

2 m(픸(X0)) .

And since 픸λ is 1

λ-Lipschitz, we know 픸λ Xλ,δ is 1

2 m(픸(X0)) -Lispchitz, thus we have for almost all t, d dt픸λ(Xλ,δ(t))

Applying these facts we have Xλ,δ(t) d dt픸λ(Xλ,δ(t)) + 1

δ 픸λ(Xλ,δ(t)) + 2

2 δ m(픸(X0)) 2 max

2 δ m(픸(X0))

ϵk+1 1 + 2α max 1

2 δ m(픸(X0))

Now for Lδ,λ = max n 1 λ,

2 δ m(픸(X0)) o , we show

ϵk he2Lλ,δT .

Multiplying (1 + 2αLλ,δ) (k+1) to (26) we have

(1 + 2αLλ,δ) (k+1) ϵk+1 (1 + 2αLλ,δ) k ϵk + (1 + 2αLλ,δ) (k+1) Lλ,δα2

As ϵ0 = X0 X(0) = 0, summing up from 0 to k 1 we have

(1 + 2αLλ,δ) k ϵk

i=1 (1 + 2αLλ,δ) i Lλ,δα2

= (1 + 2αLλ,δ) 1 1 (1 + 2αLλ,δ) k

1 (1 + 2αLλ,δ) 1 Lλ,δα2

1 (1 + 2αLλ,δ) k α.

Multiplying (1 + 2αLλ,δ)k to both sides and applying α = 2h we have

(1 + 2αLλ,δ)k 1 = h (1 + 4h Lλ,δ)k 1 . (27)

(1 + 4h Lλ,δ)k (1 + 4h Lλ,δ)

1 4h Lλ,δ 4h Lλ,δk e4h Lλ,δk,

applying k T

ϵk h e4h Lλ,δk 1 he2Lλ,δT .

Therefore Xλ,δ (2hk) Xk λ,δ he2Lλ,δT = O he2Lλ,δT . (28)

D.3 Proof for case (v) of Lemma D.3

As APPM has coefﬁcient 1 k+1, we consider Xk+1 λ,δ instead of Xk λ,δ due to calculation simplicity. From triangular inequality, we have Xk λ,δ xk λ Xk λ,δ Xk+1 λ,δ + Xk+1 λ,δ xk λ .

We will show Xk λ,δ Xk+1 λ,δ = O (h) + O h2Lλ,δe2Lλ,δT

Xk+1 λ,δ xk λ = 3 T λ O h

λe2Lλ,δT + O(h) + O h2

λ e2Lλ,δT + O e2δLλ,δ

First one is simple. Since Xk+1 λ,δ = Xk λ,δ + 2h F(Xλ,δ, 2hk) and F is 2 max 1

δ -Lipschitz with respect to the ﬁrst variable, we have Xk λ,δ Xk+1 λ,δ = 2h F(Xk λ,δ, 2hk)

2h F(Xλ,δ(2hk), 2hk) + 2h F(Xλ,δ(2hk), 2hk) F(Xk λ,δ, 2hk)

2h Xλ,δ(2hk) + 4h max 1

Xλ,δ(2hk) Xk λ,δ

2h m(픸λ(X0)) + 4h2Lλ,δe2Lλ,δT

= O (h) + O h2Lλ,δe2Lλ,δT .

Second one is complicated, we present our proof with dividing steps to subsections.

D.3.1 Recursive inequality for ϵk = Xk+1 λ,δ xk h,λ

Deﬁne ϵk = Xk+1 λ,δ xk h,λ . Recall, Xλ,δ was solution of approximated ODE

Xλ,δ(t) = F(Xλ,δ, t) = 픸λ(Xλ,δ)(t) 1

δ (X(t) X0) 0 t < δ 픸λ(Xλ,δ)(t) 1

t (X(t) X0) t δ .

We now wish to write ϵk+1 in terms of ϵk. As ϵk+1 involves Xk+2 λ,δ , we ﬁrst write it explicitly.

Xk+2 λ,δ = Xk+1 λ,δ + 2h F Xk+1 λ,δ , 2h(k + 1)

Xk+1 λ,δ 2h픸λ(Xk+1 λ,δ ) + 2h

δ (Xk+1 λ,δ X0) 0 k + 1 < δ 2h Xk+1 λ,δ 2h픸λ(Xk+1 λ,δ ) + 1 k+1(Xk+1 λ,δ X0) k + 1 δ 2h.

Now we ﬁnd recursive inequality considering two cases.

(i) k + 1 δ 2h Recall APPM (5) was deﬁned as

xk h,λ = 핁h픸λyk 1 h,λ

yk h,λ = k k + 1(2xk h,λ yk 1 h,λ ) + 1 k + 1X0,

and substituting yk 1 h,λ = xk h,λ + h픸λ(xk h,λ), we get a one line expression

xk+1 h,λ = k k + 1xk h,λ h 픸λ(xk+1 h,λ ) + k k + 1픸λ(xk h,λ) + 1 k + 1X0.

Rewriting Xk+2 λ,δ to make easier to compare with above,

Xk+2 λ,δ = k k + 1Xk+1 λ,δ h 픸λ(Xk+2 λ,δ ) + k k + 1픸λ(Xk+1 λ,δ ) + 1 k + 1X0

+ h 픸λ(Xk+2 λ,δ ) 픸λ(Xk+1 λ,δ ) h k + 1픸λ(Xk+1 λ,δ ).

λ-Lipschitz

ϵk+1 k k + 1ϵk + h

ϵk+1 + k k + 1ϵk

Xk+2 λ,δ Xk+1 λ,δ + h k + 1

픸λ(Xk+1 λ,δ ) .

Organizing,

Xk+2 λ,δ Xk+1 λ,δ + 1 k + 1

픸λ(Xk+1 λ,δ )

| {z } :=ek

(ii) k + 1 < δ 2h Rewriting Xk+2 λ,δ we have

Xk+2 λ,δ = k k + 1Xk+1 λ,δ h 픸λ(Xk+2 λ,δ ) + k k + 1픸λ(Xk+1 λ,δ ) + 1 k + 1X0

+ h 픸λ(Xk+2 λ,δ ) 픸λ(Xk+1 λ,δ ) + h k + 1픸λ(Xk+1 λ,δ ) + 1 k + 1 2h

(Xk+1 λ,δ X0).

The only difference between the case (i) is the last term. Therefore with same calculation, we get below inequality.

Xk+2 λ,δ Xk+1 λ,δ + 1 k + 1

픸λ(Xk+1 λ,δ ) + 1 k + 1 2h

Xk+1 λ,δ X0 | {z } :=ek

From case (i) and (ii), by deﬁning

h 1 λ Xk+2 λ,δ Xk+1 λ,δ + 1 k+1 픸λ(Xk+1 λ,δ ) + 1 k+1 2h

δ Xk+1 λ,δ X0 0 k + 1 < δ 2h h 1 λ Xk+2 λ,δ Xk+1 λ,δ + 1 k+1 픸λ(Xk+1 λ,δ ) k + 1 δ 2h,

we can write a recursive inequality for all k as below

D.3.2 sup0 i T

2h ϵi = 3 T λ O h

λe2Lλ,δT + O(h) + O h2

λ e2Lλ,δT + O e2δLλ,δ

We will now sum up above inequality. Multiplying (k + 1) 1 h

k+1 both sides we have 1 h

(k + 1)ϵk+1

λ k (k + 1)ek.

By summing up above inequality from 0 to k 1, we have 1 h

λ i (i + 1)ei.

Note ϵ0 vanished as it is multiplied with 0. Reorganizing with respect to ϵk we have

λ i (i + 1)ei

i=0 (i + 1)ei =

i=0 (i + 1)ei

For second inequality follows from the fact 0 < h λ < 1 2, which implies 0 < (1 h

λ) i 1 and 1 < 1+ h

λ . Observe

f(x) = 1+x 1 x 1

x is nondecreasing in x (0, 1) since

x x2 1 log 1+x 1 x + 2x

x2 (x2 1) .

Therefore, from f 1

2 = 9 we have

i=0 (i + 1)ei = 3 T λ 1

i=0 (i + 1)ei.

Now we show

i=0 (i + 1)ei = O h

λe2Lλ,δT + O(h) + O h2

λ e2Lλ,δT + O (δ) + O e2δLλ,δ

for 0 < T < , 0 k < T

2h. Name Nk = min k, δ

2h . From the deﬁnition of ei we have

i=0 (i + 1)ei

i=0 h(i + 1) 1

Xi+2 λ,δ Xi+1 λ,δ + 1 i + 1

픸λ(Xi+1 λ,δ ) + 1

i=0 (i + 1) 1 i + 1 2h

Xi+1 λ,δ X0

Xi+2 λ,δ Xi+1 λ,δ + 1

픸λ(Xi+1 λ,δ ) + 1

Xi+1 λ,δ X0 ,

where inequality follows from the fact i+1

k 1 for 0 i k 1 and 1 i+1 2h

δ for 0 i Nk 1.

Now let s observe each term. From Lemma B.12 and Lemma D.1 we know Xλ,δ(t)

픸λ(X(t)) 픸λ(X0) m(픸(X0)) Xλ,δ(t) X0 2 X0 X .

Name M = max

2 m(픸(X0)) , 2 X0 X .

(i) h λ Xi+2 λ,δ Xi+1 λ,δ First observe

1 λ Xλ,δ(2h(i + 2)) Xλ,δ(2h(i + 1))

2h(i+1) Xλ,δ(t)dt

Xλ,δ(t) dt h

Xi+2 λ,δ Xi+1 λ,δ

Xi+2 λ,δ Xλ,δ(2h(i + 2)) + Xλ,δ(2h(i + 2)) Xλ,δ(2h(i + 1)) + Xλ,δ(2h(i + 1)) Xi+1 λ,δ

i=0 he2Lλ,δT !

λ TM + Te2Lλ,δT .

Xi+2 λ,δ Xi+1 λ,δ = O h

(ii) h k 픸λ(Xi+1 λ,δ )

Since 픸λ is 1

λ-Lipschitz continuous and from (28)

픸λ(Xi+1 λ,δ ) h

픸λ(Xi λ,δ) 픸λ(Xλ,δ(2hi)) + 픸λ(Xλ,δ(2hi))

λe2Lλ,δT + M = h h

λe2Lλ,δT + M .

픸λ(Xi+1 λ,δ ) = O(h) + O h2

λ e2Lλ,δT .

(iii) 1 k PNk 1 i=0 Xi+1 λ,δ X0 for Nk = min k, δ

2h . First, observe

X0 Xλ,δ(t) =

0 Xλ,δ(s)ds Z t

Xλ,δ(s) ds t M.

From (27) Xi+1 λ,δ X0 Xi+1 λ,δ Xλ,δ(2h(i + 1)) + Xλ,δ(2h(i + 1)) X0

h (1 + 4h Lλ,δ)i+1 + 2h(i + 1)M.

Now as i + 1 Nk = min k, δ

Xi+1 λ,δ X0 1

i=0 h (1 + 4h Lλ,δ)i+1 + 2(i + 1)M

k (1 + 4h Lλ,δ)Nk 1

4h Lλ,δ + 2h M

k Lλ,δ + 2h Nk M

Lλ,δ + 2δM = O e2δLλ,δ

From (i), (ii), (iii) we have

i=0 (i + 1)ei = O h

λe2Lλ,δT + O(h) + O h2

λ e2Lλ,δT + O e2δLλ,δ

ϵk = 3 T λ O h

λe2Lλ,δT + O(h) + O h2

λ e2Lλ,δT + O e2δLλ,δ

D.4 Derivation of ODE (4) from EAG and FEG

D.4.1 Derivation from EAG

For L-Lipschitz continuous monotone operator 픸and stepsize h > 0, EAG-C [75] is deﬁned as

2 = zk 1 k + 2 zk z0 h픸(zk)

zk+1 = zk 1 k + 2 zk z0 h픸(zk+ 1

Dividing the second line by h and reorganizing we have

h = 픸(zk+ 1

2 ) 1 h(k + 2) zk z0

= 픸(zk) 1 h(k + 2) zk z0 픸(zk+ 1

2 ) 픸(zk) .

Identify hk = t, zk = X(t), z0 = X0. As zk is a converging sequence [76], it is bounded. Thus we see 픸(zk+ 1

2 ) 픸(zk) L zk+ 1

= L h h(k + 2)(zk z0) + h픸(zk)

= Lh 1 t + 2h(X(t) X0) + 픸(X(t)) = O (h) .

Therefore taking limit h 0+ we have

X(t) = 픸(X(t)) 1

t (X(t) X0).

D.4.2 Derivation from FEG

For L-Lipschitz continuous monotone operator 픸and stepsize h > 0, FEG [42] is deﬁned as

2 = zk 1 k + 1 zk z0 k k + 1h픸(zk)

zk+1 = zk 1 k + 1 zk z0 h픸(zk+ 1

Dividing the second line by h and reorganizing we have

h = 픸(zk+ 1

2 ) 1 h(k + 1) zk z0

= 픸(zk) 1 h(k + 1) zk z0 픸(zk+ 1

2 ) 픸(zk) .

Identify hk = t, zk = X(t), z0 = X0. As zk is a converging sequence [76], it is bounded. Thus we see 픸(zk+ 1

2 ) 픸(zk) L zk+ 1

= L h h(k + 1)(zk z0) + h hk h(k + 1)픸(zk)

= Lh 1 t + h(X(t) X0) + t t + h픸(X(t)) = O (h) .

Therefore taking limit h 0+ we have

X(t) = 픸(X(t)) 1

t (X(t) X0).

E Proof of convergence analysis for monotone 픸

E.1 Extending 픸to [0, )

Lemma E.1. Suppose 픸is a maximal monotone operator. Let X be the solution for (3). Deﬁne S as

S = n t [0, ) | X(t) 픸(X(t)) β(t)(X(t) X0) is true o .

Then as X is the solution for (3), we know [0, )\S is of measure zero.

Deﬁne 픸(X): S Rn as

픸(X)(t) = X(t) β(t)(X(t) X0) (29)

and denote 픸(X(t)) = 픸(X)(t). Assume for every T > 0, there is M > 0 such that for all t S [0, T] 픸(X(t)) M.

Then 픸(X) can be extended to t [0, ) with satisfying following properties.

(i) 픸(X(t)) 픸(X(t)) for all t [0, ).

(ii) For t [0, )\S, there is a sequence {tk}k N such that tk S, limk tk = t and 픸(X(tk)) converges to 픸(X(t)).

Proof. Take t [0, ), and take a sequence {tn}n N such that tn S and limn tn = t. As a converging sequence tn is bounded, there is T > 0 such that tn [0, T]. For that T, we have 픸(X(tn)) M by the assumption. As n 7 픸(X(tn)) is a bounded sequence, there is a subsequence {tnk}k N such that k 7 픸(X(tnk)) converges. Name the limit as u = limk 픸(X(tnk)). On the other hand, as X is a continuous curve we have limk X(tnk) = X(t). Then since 픸 is maximally monotone, from [15, Proposition 20.38] we have (X(t), u) Gra 픸. Deﬁning 픸(X(t)) as u, we get the desired result.

Corollary E.2. Let X be the solution for (6). Then 픸(X) deﬁned as (29) has extension with properties stated in Lemma E.1.

Proof. First consider the case β(t) = γ tp , p > 0, γ > 0. Take T > 0. It is enough to show there is M > 0 such that for t S [0, T] 픸(X(t)) M.

Recall from Proposition B.11 for Yosida approximation 픸λ, we denoted Xλ as the solution of

Xλ = 픸λ(Xλ) γ

tp (Xλ X0),

and we have shown there is a sequence {λn}n N such that Xλn uniformly converges to X on [0, T].

From Lemma B.12, we see for h = 0

Xλ(t + h) Xλ(t)

R t+h t Mdot(T)ds

h = Mdot(T),

thus X(t + h) X(t)

Xλ(t + h) Xλ(t)

Therefore for t S [0, T], X(t) = limh 0 X(t+h) X(t)

h holds, we conclude X(t) = lim h 0

X(t + h) X(t)

And also from Lemma B.12, for t [0, T] we have γ

tp (Xλ(t) X0) = Xλ(t) + 픸λ(t) Xλ(t) + 픸λ(t) Mdot(T) + M픸(T),

therefore γ

tp (X(t) X0) = lim λ 0+

tp (Xλ(t) X0) Mdot(T) + M픸(T).

Gathering the result, for t S [0, T] we have 픸(X(t)) = X(t) + γ

tp (X(t) X0) X(t) + γ

tp (X(t) X0) 2Mdot(T) + M픸(T).

E.2 Proof of Proposition 3.2

We prove this theorem by deriving the energy with dilated coordinate W(t) = C(t)(X(t) X0) and conservation law from [67]. Think of dilated coordinate W(t) = C(t)(X(t) X0). As we re considering the case 픸is Lipschitz continuous, the differential inclusion (3) becomes ODE

X(t) = 픸(X(t)) β(t)(X(t) X0).

Rewriting the ODE in terms of W, we have

W(t) β(t)W(t) = 픸(X(W(t), t)) β(t)

where X(W(t), t) = X(t) = W (t)

C(t) + X0. Organizing, we have

0 = W(t) + C(t) 픸(X(W(t), t)).

From Lemma B.4 we know 픸(X(W, t)) is differentiable almost everywhere, by differentiating we obtain second order ODE which holds almost everywhere

0 = W + C(t)β(t) 픸(X(W(t), t)) + C(t) d

dt 픸(X(W(t), t))

= W β(t) W + C(t) d

dt 픸(X(W(t), t)). (30)

Now by taking inner product with W and integrating, we get equality which was refered as conservation law in [67]

t0 β(s) W(s) 2 ds + Z t

ds 픸(X(s)), W(s) ds.

Since W(t) = C(t) X(t) + β(t)(X(t) X0) , we can rewrite the integrand in the last term as Z t

ds 픸(X(s)), W(s) ds = Z t

ds 픸(X(s)), X(s) ds + Z t

ds 픸(X(s)), C(s)β(s)W(s) ds.

Note the purpose was to obtain the ﬁrst term, which is nonnegative due to monotonicity of 픸. Now taking integration by parts to the second term we have Z t

ds 픸(X(s)), C(s)β(s)W(s) ds

= 픸(X(W, s)), C(s)β(s)W(s) t

D 픸(X(W, s)), C(s)β(s)2 + C(s) β(s) W(s) + C(s)β(s) W(s) E ds

= 픸(X(W, s)), C(s)β(s)W(s) t

t0 β(s) W(s) 2 ds + Z t

β(s)2 + β(s) D W(s), W(s) E ds

= 픸(X(W, s)), C(s)β(s)W(s) t

t0 β(s) W(s) 2 ds

+ β(s)2 + β(s) 1

2β(s) β(s) + β(s) W(s) 2 ds

On the second equality, we used the fact W(t) = C(t) 픸(X(W(t), t)). Note the fundamental theorem of calculus for 픸(X(W, s)), C(s)β(s)W(s) is valid since 픸(X(W, s)) is Lipschitz continuous and C(s)β(s)W(s) is continuously differentiable in [t0, t], so their inner product is absolutely continuous in [t0, t].

Observe the integrand in the last term can be rewritten as 2β(s) β(s) + β(s) W(s) 2 = C(s)2 2β(s) β(s) + β(s) X(s) X0 2 = d

C(s)2 β(s) X(s) X0 2 .

Now gathering the results, we conclude

t0 β(s) W(s) 2 ds + Z t

ds 픸(X(s)), W(s) ds

픸(X(t)) 2 + 픸(X(W(s), s)), C(s)β(s)W(s) t

t0 + β(s)2 + β(s) 1

ds 픸(X(s)), X(s) ds 1

C(s)2 β(s) X(s) X0 2 ds

픸(X(t)) 2 + 2β(t) 픸(X(t)), X(t) X0 + β(t)2 + β(t) X(t) X0 2

ds 픸(X(s)), X(s) ds 1

C(s)2 β(s) X(s) X0 2 ds

2β(t0) 픸(X(t0)), X(t0) X0 + β(t0)2 + β(t0) X(t0) X0 2

| {z } =constant Moving the constant terms to left hand side and naming E = E1 constant, we get the desired result.

E.3 Proof of Corollary 3.3

E.3.1 V is nonincreasing when 픸is Lipshitz continuous monotone

We ﬁrst check it is true for the case 픸is Lipschitz continuous. Then from Proposition 3.2 we can write V (t) as

V (t) = E 2 Z t

ds 픸(X(s)), X(s) ds

with E in Proposition 3.2. As 픸is monotone, from (1) we know C(s)2 D d ds 픸(X(s)), X(s) E 0 holds almost everywhere. Therefore for h > 0

V (t + h) V (t) = 2 Z t+h

ds 픸(X(s)), X(s) ds 0,

we see V is a nonincreasing function. Therefore for all t > 0, V (t) limϵ 0+ V (ϵ). It remains to show the limit limϵ 0+ V (ϵ) exists.

E.3.2 Calculation of V (0) = limt 0+ V (t) for Lipshitz continuous monotone 픸

In this section, we calculate limt 0+ V (t) when 픸is Lipshitz continuous. Recall V was deﬁned as

V (t) = C(t)2

픸(X(t)) 2 + 2β(t) 픸(X(t)), X(t) X0 + β(t)2 + β(t) X(t) X0 2

X(s) X0 2 ds.

From now we denote V for the case t0 = 0 as V 0.

We ﬁrst check

( tγ p = 1 e

γ 1 p t1 p p > 0, p = 1. (31)

Observing Z t

s ds = γ log t Z t

γ sp ds = γ 1 pt1 p for 0 < p < 1 Z t

γ sp ds = γ 1 pt1 p for p > 1,

we see C(t) deﬁned above agrees with the deﬁnition of C(t) for each case. Note

lim t 0+ C(t) = 0 p 1 1 0 < p < 1.

Now we will show

lim t 0+ V 0(t) = lim t 0+ C(t)2

픸(X(t)) 2 =

( 0 if p 1 픸(X0) 2

2 if 0 < p < 1. (32)

To do so, we ﬁrst show for t > 0

X(s) X0 2 ds < .

We ﬁrst provide an elementary fact as a lemma.

Lemma E.3. Let f : (0, ) R is a continuous function. Suppose there is q < 1, 0 < l < such that

lim sup t 0+ |f(t)tq| < l.

Then for t > 0, f L1([0, t], R).

Proof. Since lim sups 0+ f(s)sq = l, there is ϵ (0, t) such that

0 < s < ϵ = |f(s)sq| < 2|l|.

Then since q < 1 Z ϵ

0 |f(s)| ds = Z ϵ

0 |f(s)sq| 1

sq ds = 2|l|

0 = 2|l| 1 q ϵ1 q < .

By the way as f(t) is continuous on [ϵ, t], M = maxs [ϵ,t] |f(s)| exists. Therefore, Z t

0 |f(s)|ds = Z ϵ

0 |f(s)|ds + Z t

ϵ |f(s)|ds 2|l| 1 q ϵ1 q + M(t ϵ) < .

Applying Lemma E.3, we will show for t > 0 s2 d

C(s)2 β(s) L1([0, t], R). (33)

C(s)2 β(s) = 2C(s)2β(s) β(s) + C(s)2 β(s) = C(s)2 2pγ2

s2p+1 + p(1 + p)γ

(i) p > 1 We ﬁrst show lims 0+ C(s)2 1

sn = 0 for all n > 0. Take n > 0. Then there is some k N such that k(p 1) > n. With change of variable u = 1

s and L Hóptial s rule we see

lim s 0+ C(s)2 1

sn = lim u un

2γ p 1 up 1

= lim u nun 1

2γ p 1 up 1 = n

2γ lim u un+1 p

2γ p 1 up 1 = = Qk 1 m=0(n m(p 1))

(2γ)k lim u un k(p 1)

2γ p 1 up 1 = 0.

! = lim s 0+

s2p 1 + p(1 + p)γ

By Lemma E.3, we conclude R t 0

ds C(s)2 β(s)

(ii) p = 1 Since C(s) = sγ, we see

s1 γ = lim s 0+ s2 γ(γ 1)s2γ 3 s1 γ = γ |γ 1| lim s 0+ sγ = 0.

Since 1 γ < 1, by Lemma E.3 we conclude R t 0

ds C(s)2 β(s)

(iii) 0 < p < 1 Since lims 0+ C(s) = lims 0+ e

γ 1 p s1 p = 1, we see

lim sup s 0+

sp = lim sup s 0+ C(s)2 2pγ2s1 p + p(1 + p)γ = p(1 + p)γ.

Since p < 1, by Lemma E.3 we conclude R t 0

ds C(s)2 β(s)

Naming the bound in Corollary B.8 as M(T), we know for 0 < s < T

R s 0 M(T)du

Therefore applying (33), for 0 < t < T we have

X(s) X0 2 ds M(T)2 Z t

Hence we know V 0(t) is well deﬁned and since limt 0+ R t 0 d ds C(s)2 β(s)

2 X(s) X0 2 ds = 0, we have

lim t 0+ V 0(t) = lim t 0+ C(t)2

픸(X(t)) 2 + 2β(t) 픸(X(t)), X(t) X0 + β(t)2 + β(t) X(t) X0 2 .

Now we are ready to show the desired result.

(i) p > 1 As we know X(t) X0 and 픸(X(t)) are bounded from Lemma D.1 and Corollary B.9. Therefore from (35) we have

0 = lim t 0+ C(t)2 = lim t 0+ C(t)2β(t) = lim t 0+ C(t)2 β(t)2 + β(t) ,

therefore limt 0+ V 0(t) = 0.

(ii) 0 < p 1 As X(t) X0

t M(T) < for 0 < t < T, we see

lim sup t 0+ C(t)2β(t) 픸(X(t)), X(t) X0 γ lim sup t 0+ C(t)2t1 p 픸(X(t)) M(T) = 0

lim sup t 0+ C(t)2 β(t)2 + β(t) X(t) X0 2 γ lim sup t 0+ C(t)2 γt2 2p t1 p M(T)2 = 0.

lim t 0+ V 0(t) = lim t 0+ C(t)2

픸(X(t)) 2 =

( 0 if p = 1 픸(X0) 2

2 if 0 < p < 1.

From (i) and (ii) we get the desired conclusion

V 0(0) = lim t 0+ V 0(t) =

( 0 if p 1 픸(X0) 2

2 if 0 < p < 1,

and therefore for general t0 0,

V (0) = lim t 0+ V (t) = lim t 0+ V 0(t) Z t0

X(s) X0 2 ds

is well-deﬁned.

E.3.3 V (t) limn Vλn(0) holds for t S and general maximal monotone 픸

Deﬁne S as deﬁned in Lemma E.1. Take t S, let T > t. Let {λn}n N be a positive sequence λn that limn λn = 0, Xλn converges to X uniformly on [0, T] and Xλn converges weakly to X in L2([0, T], Rn). Recall existence of such sequence was gauranteed by Proposition B.11.

Recall we denoted Xλ as the solution of the ODE (19). Denote Vλ as V for the case 픸= 픸λ, i.e.

Vλ(t) = C(t)2

픸λ(Xλ(t)) 2 + 2β(t) 픸λ (Xλ(t)) , Xλ(t) X0 + β(t)2 + β(t) Xλ(t) X0 2

Xλ(t) X0 2 ds.

Note equality X(t) = 픸(X(t)) β(t)(X(t) X0) holds since t S, therefore we have

V (t) = C(t)2

X(t) 2 + β(t) X(t) X0 2 Z t

X(s) X0 2 ds.

The goal of this section is to show

V (t) lim sup n Vλn(t) lim sup n Vλn(0) = lim n Vλn(0).

(1) V (t) lim supn Vλn(t) First observe, from Lemma B.12 we know

R s 0 Mdot(T)ds

s = Mdot(T)

holds for s T. Thus from (33) we have d ds

! Mdot(T)2 L1([0, T], R).

Therefore applying dominated convergence theorem, we have for t0 [0, T]

Xλn(s) X0 2 ds = Z t

X(s) X0 2 ds. (36)

From elementary analysis, we can easily check lim supn (an + bn) = lim supn an + limn bn holds when limn bn exists. Thus we have

lim sup n Vλn(t) V (t) = C(t)2

Xλn(t) 2 X(t) 2 .

Therefore it is sufﬁces to show following lemma.

Lemma E.4. Suppose for T > 0 and sequence {λn}n N, Xλn converges to X uniformly on [0, T] and Xλn converges weakly to X in L2([0, T], Rn). Let t S [0, T]. Then following inequality is true X(t) lim sup n

X(s) lim supn Xλn(s) holds for almost every s [0, T].

Let D be a measurable subset D [0, T]. Since Xλn X in L2([0, T], Rn) and χD X L2([0, T], Rn), we have Z

X(s) 2 ds = Z T

D X(s), χD(s) X(s) E ds = lim n

D Xλn(s), χD(s) X(s) E ds

= lim sup n

D Xλn(s), X(s) E ds lim sup n

Xλn(s) X(s) ds.

The inequality comes from the Cauchy Schwarz inequality. Now from Reverse Fatou s Lemma we have

Xλn(s) X(s) ds Z

D lim sup n

Xλn(s) X(s) ds.

Therefore combining two inequalities we have Z

Xλn(s) X(s) X(s) ds 0.

As D was arbitrary measurable subset of [0, T], we conclude for almost every s [0, T] lim sup n

Xλn(s) X(s) X(s) 0 = X(s) lim sup n

X(t) lim supn Xλn(t) for t S [0, T].

Let t S [0, T]. Then for h > 0 such that t + h < T, since S is measure zero, from (i) we have Z t+h

X(s) ds Z t+h

t lim sup n

Now, for some a > 0 consider

Uλn(s) = Xλn(s) 2 + γp

sp+1 Xλn(s) X0 2 + Z s

up+2 Xλn(u) X0 2 du | {z } =fn(s)

Then from the proof Lemma B.5, we know

Uλn(s) = 2 Xλn(s), d

ds픸λn(X(s)) 2γ

s(Xλn(s) X0) 2 0

holds for almost every s, thus Uλn is nonincreasing. By the way, as Xλn converges to X uniformly on [0, T], using dominated convergence theorem we have

lim n fn(s) = γp sp+1 X(s) X0 2 + Z s

up+2 X(u) X0 2 du.

Denote f(s) = limn fn(s). Using above facts, for s [t, t + h] we have

Xλn(s) = lim sup n

Uλn(s) fn(s)

Uλn(t) fn(s) = r

lim sup n Uλn(t) f(s). (37)

X(t + h) X(t) Z t+h

X(s) ds Z t+h

t lim sup n

Xλn(s) ds Z t+h

lim sup n Uλn(t) f(s)ds.

lim supn Uλn(t) f(s) is a continuous function with respect to s. Thus we have

lim h 0 1 h

lim sup n Uλn(t) f(s)ds = r

lim sup n Uλn(t) f(t)

Xλn(t) 2 + lim n fn(t) f(t) = lim sup n

Finally as t S, X(t) = limh 0 X(t+h) X(t)

h exists. Therefore X(t) = lim h 0+ X(t + h) X(t)

lim h 0+ 1 h

lim sup n Uλn(t) f(s)ds = lim sup n

we conclude the desired result.

(2) lim supn Vλn(t) lim supn Vλn(0) As 픸λn is Lipschitz continuous, from Appendix E.3.1 we know Vλn is nonincreasing, and thus

Vλn(t) lim ϵ 0+ Vλn(ϵ) = Vλn(0).

Taking limsup both sides we get the desired result. (3) lim supn Vλn(0) = limn Vλn(0) From (32) and (ii) of Lemma B.10, we know

lim n V 0 λn(0) =

( 0 if p 1 m(픸(X0)) 2

2 if 0 < p < 1.

And applying (36) we have

lim n Vλn(0) = lim n V 0 λn(0) Z t0

X(s) X0 2 ds.

As the limit limn Vλn(0) exists, the limsup concides with the limit.

E.3.4 Proof for general maximal monotone 픸

First, we show V (t) limn Vλn(0) holds for t [0, ). Next, we show limn Vλn(0) = limt 0+ V (t). Then we have

V (t) lim n Vλn(0) = lim t 0+ V (t) = V (0),

which is our desired result.

(i) V (t) lim n Vλn(0) holds for t [0, )

Take t [0, ). We know the inequality is true when t S, thus assume t / S. Then from Lemma E.1, we know there is a sequence {tk}k N such that tk S, limk tk = t and 픸(X(tk)) converges to 픸(X(t)). As tk S, V (tk) limn Vλn(0) holds. Since limk V (tk) = V (t), taking limit k to the inequality we get the desired result. (ii) limn Vλn(0) = limt 0+ V (t) Take a sequence {tk}k N such that tk > 0 and limk tk = 0. We wish to show limk V (tk) = limn Vλn(0).

Note the arguments in Appendix E.3.2 are valid when 픸(X(tk)) and X(tk) X0

tk are bounded for all k N. The boundedness of 픸(X(tk)) comes from Corollary E.2. And from Lemma B.12 we have

tk = lim λ 0+ Xλ(tk) X0

tk lim λ 0+

tk Mdot(tk) sup k N Mdot(tk).

Therefore applying the arguments in Appendix E.3.2 we have

lim k V 0(tk) = lim k C(tk)2

픸(X(tk)) 2 .

Thus it remains to show the limit on the right hand side exists and is equal to limn V 0 λn(0).

For p 1, we know limk C(tk)2 = 0, thus limk C(tk)2

2 픸(X(tk)) 2 = 0 since 픸(X(tk)) is bounded. As limn V 0 λn(0) = 0 from (32), we re done.

Now consider the case 0 < p < 1. Suppose 픸(X(tkl)) is a convergent subsequence of 픸(X(tk)). First observe from (i) we have

V 0(tkl) lim n V 0 λn(0) = m(픸(X0)) 2

From above inequality, recalling liml C(tkl)2 = 1 we have lim l 픸(X(tkl))

2 = lim l V 0(tkl) m(픸(X0)) 2

By the way as liml X(tkl) = X0 and 픸(X(tkl)) 픸(X(tkl)), by closed graph theorem we have liml 픸(X(tkl)) 픸(X0). As of m(픸(X0)) is the element in 픸(X0) with smallest norm, we have liml 픸(X(tkl)) = m(픸(X0)). As all convergent subsequence converges to the same limit m(픸(X0)), we conclude limk 픸(X(tk)) = m(픸(X0)). Therefore

lim k V 0(tk) = lim k C(tk)2

픸(X(tk)) 2 = m(픸(X0)) 2

2 = lim n V 0 λn(0).

As tk was arbitrary positive sequence converges to 0, we conclude limn V 0 λn(0) = limt 0+ V 0(t) = V 0(0). Therefore

lim n Vλn(0) = V 0(0) Z t0

X(s) X0 2 ds = lim t 0+ V (t).

E.4 Proof of Lemma 3.4

Most of the proof is done in the main text. Recall Φ(t) is deﬁned as Φ(t) = 픸(X(t)) 2 + 2β(t) 픸(X(t)), X(t) X0 , and from (8) we have

픸(X(t)) 2 2β(t)2 X0 X 2 .

On the other hand,

2V (t) C(t)2 Φ(t) = β(t)2 + β(t) X(t) X0 2 2 C(t)2

X(s) X0 2 ds.

Note C(t) = 0 if t > 0, we can divide with C(t)2. Now from Corollary 3.3 we have we have V (t) V (0) for t > 0 , and so V (t) C(t)2 V (0)

C(t)2 0. Thus for t > ϵ

C(t)2 β(t)2 + β(t) X(t) X0 2 + 2 C(t)2

X(s) X0 2 ds

C(t)2 2V (t)

C(t)2 2V (t)

+ Φ(t) Φ(t) 1

픸(X(t)) 2 2β(t)2 X0 X 2 .

Moving 2β(t)2 X0 X 2 to the left hand side and multiplying with 2 we get the desired result

픸(X(t)) 2 4β(t)2 X0 X 2 + 4V (0)

C(t)2 2 β(t)2 + β(t) X(t) X0 2

C(s)2 β(s) X(s) X0 2 ds.

E.5 Proof of Theorem 3.1

Restating Lemma 3.4, we know for t > 0,

픸(X(t)) 2 4β(t)2 X0 X 2 + 2V (0)

C(t)2 2 β(t)2 + β(t) X(t) X0 2 (7)

C(s)2 β(s) X(s) X0 2 ds.

As we know X(t) X0 2 X0 X from Lemma D.1, it is clear that

4β(t)2 X0 X 2 = O β(t)2

C(t)2 = O 1 C(t)2

2 β(t)2 + β(t) X(t) X0 2 = O β(t)2 + O β(t) .

Therefore it remains to show 2 C(t)2

C(s)2 β(s) X(s) X0 2 ds = O β(t)2 + O 1 C(t)2

We can check there is T > 0 such that for s > T the sign of d ds C(s)2 β(s) does not change, for β(t) = γ tp with γ > 0, p > 0. We ﬁrst proceed our proof with assuming this condition. The point for this condition is that following equality holds for t > T.

C(s)2 β(s) ds = Z t

C(s)2 β(s) ds.

Applying above equality and using X(t) X0 2 X0 X , we see 2 C(t)2

C(s)2 β(s) X(s) X0 2 ds

C(s)2 β(s) X(s) X0 2 ds

C(s)2 β(s) X(s) X0 2 ds

C(t)2 + 2 C(t)2

C(s)2 β(s) 4 X0 X 2 ds

C(t)2 + 8 X0 X 2

C(t)2 β(t) C(T)2 β(T)

M + 4 X0 X 2 C(T)2 β(T) + 8 X0 X 2 β(t) = O 1 C(t)2

Therefore from (7) and (38), we conclude 픸(X(t)) 2 = O β(t)2 + O 1 C(t)2

It remains to show there is T > 0 such that for s > T the sign of d

ds C(s)2 β(s) does not change. It can be shown by easy,

but a little complicated calculations. Since C(t) = C(t)β(t), we see d ds

C(s)2 β(s) = 2C(s)2β(s) β(s) + C(s)2 β(s) = C(s)2 2β(s) β(s) + β(s) .

Therefore it is enough to check the sign of 2β(s) β(s) + β(s). Observe

2β(s) β(s) + β(s) = 2pγ2

s2p+1 + p(1 + p)γ

s3 γ(1 γ) if p = 1

s2p+1 sp 1 2γ p+1 if p = 1.

Thus when p = 1, for all s > 0 it is nonpositive if γ 1 and nonnegative if 0 < γ < 1.

When p = 1, we see

lim s 0+ sp 1 = if 0 < p < 1 0 if p > 1 , lim s sp 1 = 0 if 0 < p < 1 if p > 1

Therefore by intermediate value theorem there is T > 0 such that T p 1 2γ p+1 = 0, and for that T we have sp 1 2γ p+1 is nonpositive for s > T if 0 < p < 1 and nonnegative for s > T if p > 1. This concludes the proof.

E.5.1 Proof for the convergence rate in Table 1

Recall, from (31) we have

( tγ p = 1 e

γ 1 p t1 p p > 0, p = 1.

We now observe β(t) does not effect to convergence rate. In other words, we show β(t) is not the slowest one that goes to zero compared to β(t)2 and 1 C(t)2 for every case.

(i) 0 < p 1 Comparing β(t) = pγ

tp+1 with β(t)2 = γ2

t2p , we see β(t) = O β(t)2 when 0 < p 1. (ii) p > 1 For p > 1, we see limt 1 C(t)2 = 1 = 0. As limt β(t) = limt pγ tp+1 = 0, we have β(t) = O 1 C(t)2 .

From (i) and (ii), we conclude

픸(X(t)) 2 = O β(t)2 + O 1 C(t)2

Now the results in Table 1 is straightfoward.

p = 1 When p = 1, we have C(t) = tγ. Comparing 1 C(t)2 = 1 t2γ and β(t)2 = γ2

픸(X(t)) 2 = O β(t)2 = O 1

픸(X(t)) 2 = O 1 C(t)2 = O 1

t2γ if 0 < γ < 1.

0 < p < 1 When 0 < p < 1, we have

lim t 1 β(t)2 1 C(t)2 = 1

γ2 lim t t2

2γ 1 p t1 p = 0.

Therefore, 픸(X(t)) 2 = O β(t)2 = O 1

t2p . p > 1 We observed previously that limt 1 C(t)2 = 1 = 0 when p > 1. Therefore 픸(X(t)) 2 = O 1 C(t)2 = O (1).

E.6 Proof of Theorem 3.5

Name X = ΠZer 픸(X0) = argmin z Zer픸 z X0 .

We ﬁrst show

X(t) X , X0 X 0.

Proof by contradiction. Suppose not. Then there is ϵ > 0 such that for every k N, there is nk [0, ) such that X(nk) X , X0 X > ϵ

By the way, from Lemma D.1 we know X(nk) B X0 X ( X ). Thus by Bolzano-Weierstrass theorem, there is a converging

subseqeunce {X nk(l) }l (N). Name the limit as X . Since limt 픸(X(t)) = 0, { 픸(X(nk(l))}l (N) converges to 0. Then from [15, Proposition 20.38], (X , 0) Gra 픸, i.e. X Zer픸.

On the other hand, as 픸is maximal monotone, by [15, Proposition 23.39] Zer 픸is closed and convex. Since X = ΠZer 픸(X0), by [15, Theorem 3.16] we have X X , X0 X 0. Thus

0 < ϵ lim l X nk(l) X , X0 X = X X , X0 X 0,

this is a contradiction, therefore we get the desired result.

Now we show limt X(t) X = 0. Recall, for t > 0

X 픸(X(t)) β(t)(X X0).

From C(t) = C(t)β(t), and monotonicity of 픸, we observe

C(t)2 X(t) X 2 = 2C(t)2 D X(t), X(t) X E + 2C(t)2β(t) X(t) X 2

= 2C(t)2 픸(X(t)) β(t)(X(t) X0), X(t) X + 2C(t)2β(t) X(t) X 2

2C(t)2β(t) X(t) X0, X(t) X + 2C(t)2β(t) X(t) X 2

= 2C(t)2β(t) X0 X , X(t) X .

Now take ϵ. From lim supt X(t) X , X0 X 0, there is M > 0 such that

t > M = X0 X , X(t) X < ϵ.

For that M and t > M, integrating from M to t we get h C(s)2 X(s) X 2it

M = C(t)2 X(t) X 2 C(M)2 X(M) X 2

M 2C(s)2β(s) X0 X , X(s) X ds

2C(s)2β(s) X0 X , X(s) X ds

M 2C(s)2β(s)ϵ ds = ϵ C(s)2 t M = ϵ C(t)2 C(M)2 .

By dividing C(t)2 and organizing,

2 X(M) X 2 .

By the way from assumption, we have limt 1 C(t) = 0. Therefore we conclude

lim t X(t) X 2 ϵ.

Since ϵ > 0 was arbitrary, we get the desired result.

F Proof for worst case examples

The explicit solution for linear A is crucially used in worst case examples. Therefore, we ﬁrst provide proof for it.

F.1 Proof for explicit solution for linear A

F.1.1 Proof of Lemma 4.1

Γ(n + γ + 1)Γ(γ + 1)X0 = X0 +

Γ(n + γ + 1)Γ(γ + 1)X0,

we can check X(0) = X0. Now by using the property of Gamma function Γ(x + 1) = xΓ(x), and paying attention to the lower bound of the summation index we have

( n A)( t A)n 1

Γ(n + γ + 1) Γ(γ + 1)X0 = A

( n γ + γ)( t A)n 1

Γ(n + γ + 1) Γ(γ + 1)X0

Γ(n + γ) Γ(γ + 1)X0 + γ

Γ(n + γ + 1)Γ(γ + 1)X0

Γ(n + γ + 1)Γ(γ + 1)X0 + γ ( t)

Γ(n + γ + 1)Γ(γ + 1)X0 = AX(t) γ

t (X(t) X0) .

The solution can be written in another form

X(t) = γe t At γ Z t

0 uγ 1eu A du X0. (39)

As this is a special case of (9), here we just brieﬂy check this satisﬁes the ODE and check other details in the proof of Lemma 4.2. By product rule of differentiation,

dte t A t γ Z t

0 uγ 1eu A du X0 + γe t A d

0 uγ 1eu A du X0 + γe t At γ d

0 uγ 1eu A du X0

= A(X(t)) γ

t X(t) + γe t At γ tγ 1et A X0

= A(X(t)) γ

t (X(t) X0) .

F.1.2 Proof of Lemma 4.2

X (t) = I Ae t A

0 es AC(s)ds X0.

We show X is a well-deﬁned solution for (3) with linear 픸, then show X is equal to X(t) deﬁned in (9).

We ﬁrst check well-deﬁnedness. By deﬁnition, C(t) = e

R t t0 β(s)ds is nondecreasing. Also, et A is also nondecreasing since from monotonicity of A we have for all x Rn

et Ax 2 = 2 A(et Ax), et Ax 0.

Now we see X(t) is well-deﬁned since e t A 1 C(t)

0 es AC(s)ds =

0 e(s t)A C(s)

0 (1 1)ds = t < .

Also above inequality implies second term reaches to zero as t 0+, we have limt 0+ X (t) = X0. Thus deﬁning X (0) = X0 if necessary, we see X satisﬁes the initial condition with X C([0, ), Rn).

We then check X (t) becomes the solution. This is immediate from product rule for differentiation. Since d dt 1 C(t) =

1 C(t)2 C(t)β(t) = 1 C(t)β(t), we have

dte t A 1 C(t)

0 es AC(s)ds X0

0 es AC(s)ds X0

Ae t A 1 C(t) d dt

0 es AC(s)ds X0

= A(X (t) X0) β(t)(X (t) X0) A(X0) = A(X(t)) β(t)(X (t) X0).

Finally we now show X(t) = X (t), where X(t) is deﬁned as (9). From integral by parts we have

X(t) = e t A 1 C(t)

0 es AC(s)β(s)ds + C(0)I X0

= e t A 1 C(t)

es AC(s) t 0 Z t

0 Aes AC(s)ds + C(0)I X0

= I Ae t A 1 C(t)

0 es AC(s)ds X0 = X (t).

F.2 Proof of Theorem 4.3

If β(t) 0, for A := 0 1 1 0

, we have X(t) = e t A, so

lim t A(X(t)) = 1 = 0.

So let s consider the case β is not β(t) 0 and β(t) 0. For

Aξ = 2πξ 0 1 1 0

name the solution for ODE Xξ = Aξ(Xξ) β(t)(Xξ X0) as Xξ. By Lemma 4.2 we have

Xξ(t) = e t Aξ

0 es AξC(s)β(s)ds + C(0)I X0.

We want to show, ξ R, lim t Aξ(Xξ(t)) = 0

From limt 1 C(t) = 0, we see limt e t Aξ

C(t) = 1 1 C(t) = 0. And since Aξ is invertible except the case ξ = 0, we have

lim t Aξ(Xξ(t)) = 0 lim t Xξ(t) = 0

0 es AξC(s)β(s)ds + C(0)I = 0.

Now let assume ξ R, limt Aξ(Xξ(t)) = 0 and lead to contradiction. Deﬁne f : R R as

f(s) = C(s)β(s) s > 0 0 s 0.

Then R f(s)ds = [C(s)] 0 = limt C(t) C(0) < we have f L1(R, R). Note limt C(t) exists, since C is nondecreasing and by the assumption is bounded.

Now setting X0 = (1, 0)T and from es Aξ = cos 2πξs sin 2πξs sin 2πξs cos 2πξs

0 es AξC(s)β(s)ds X0 = Z

0 cos(2πsξ)f(s)ds , Z

0 sin( 2πsξ)f(s)ds T

= Re( ˆf(ξ)) , Im( ˆf(ξ)) T

where ˆf(ξ) is Fourier transform of f.

From assumption we have ξ R, R 0 es AξC(s)β(s)ds X0 + C(0)X0 = 0, we have Re( ˆf(ξ)) , Im( ˆf(ξ)) T C(0)X0

By the way, from Fourier theory we know ˆf vanishes at inﬁnity , thus

C(0)X0 = lim ξ

Therefore we have ˆf(ξ) = 0 for all ξ R.

Now ˆf 0 is clearly L1(R, R), we can apply Fourier inversion formula and conclude f 0 almost everywhere.

However, f(s) = C(s)β(s) is not zero almost everywhere since β(s) is not constantly zero. Thus a contradiction, we get the desired result.

F.3 Proof of Theorem 4.4

F.3.1 Proof for the case p = 1, γ 1

Recall from (39), we know

X(t) = γe t At γ Z t

0 es Asγ 1ds X0.

Without loss of generality, we consider the case X0 = (1, 0)T .

(i) γ = 1 Plugging γ = 1 gives

X(t) = e t A

t A 1 es A t 0 X0 = 1

t A 1 I e t A (1, 0)T = 1

t A 1 (1 cos t, sin t)T

Therefore lim t t A(X(t)) = lim t

(1 cos t, sin t)T = 0.

(ii) γ > 1 We will show limt t A(X(t)) = γ. With change of variable s = tv and integration by parts we have 1

t A(X(t)) = e t AAt (γ 1) Z t

0 es Asγ 1ds X0

= e t At A Z 1

0 etv Avγ 1dv X0

= e t A etv Avγ 1 1 0 X0 (γ 1)e t A Z 1

0 etv Avγ 2dv X0

= X0 (γ 1)e t A Z 1

0 cos(tv)vγ 2dv , Z 1

0 sin(tv)vγ 2dv T

By the way, from γ > 1 we have vγ 2 L1[0, 1], by Riemann-Lebesgue lemma we have

0 cos(tv)vγ 2dv = lim t

0 sin(tv)vγ 2dv = 0.

Observe as e t A is a rotation, we know e t A Z 1

0 cos(tv)vγ 2dv , Z 1

0 sin(tv)vγ 2dv =

0 cos(tv)vγ 2dv , Z 1

0 sin(tv)vγ 2dv .

Taking limit t we know right hand side converges to zero, we conclude

lim t e t A Z 1

0 cos(tv)vγ 2dv , Z 1

0 sin(tv)vγ 2dv = 0.

lim t t A(X(t)) =

γX0 γ(γ 1) lim t e t A Z 1

0 cos(tv)vγ 2dv , Z 1

0 sin(tv)vγ 2dv T

F.3.2 Proof for the case p = 1, γ < 1

Since A, et A are invertible, we see

lim t t2γ A(X(t)) 2 = 0 lim t tγ A(X(t)) = 0 lim t tγ 1 γ et AX(t) = 0.

Therefore it is enough to observe 1

γ et AtγX(t). Again for X0 = (1, 0)T

1 γ et AtγX(t) = Z t

0 es Asγ 1ds X0

0 cos(s)sγ 1ds , Z t

0 sin(s)sγ 1ds T .

Since sγ 1 decrease monotonically to zero, we may apply similar argument with alternative series test.

Deﬁne an = R nπ (n 1)π sin(s)sγ 1ds and name Sn = Pn k=1 an. Then for m N, t > 2mπ we see

0 sin(s)sγ 1ds S2m 1.

By the way S = limn Sn exists by alternative series test, thus we have

0 sin(s)sγ 1ds = S

by squeeze theorem. Note S S2 > 0. With similar argument, we can conclude limt R t 0 cos(s)sγ 1ds also exists.

Therefore we conclude limt tγX(t) = 0 since

lim t tγX(t) = lim t γ

0 cos(s)sγ 1ds , Z t

0 sin(s)sγ 1ds T γS > 0.

F.3.3 Proof for the case 0 < p < 1

Recall from (9), we have

X(t) = e t A

0 es AC(s)β(s)ds + C(0)I X0.

Our goal is to show

lim t t2p A(X(t)) 2 = 0.

We ﬁrst observe, it is enough to show

lim t 1 β(t)C(t)

0 C(s)β(s) sin s ds = 0.

This follows from below facts.

(1) Since β(t) = γ

tp and A is linear and invertible,

lim t t2p A(X(t)) 2 = lim t γ2

β(t)2 A(X(t)) 2 = 0 lim t

1 β(t)X(t) = 0.

(2) Recall from (31), we have C(t) = e

γ 1 p t1 p for β(t) = γ

tp , so C(0) = limt 0+ C(t) = 1 and limt 1 β(t)C(t) = 0. Thus limt 1 β(t)C(t) C(0)I = 0, second term is ignorable. Therefore the problem reduces to

0 C(s)β(s)e s Ads X0

(3) Since et A is linear and invertible, problem reduces to

0 C(s)β(s)e s Ads X0

(4) Again without loss of generality, let X0 = (1, 0)T . Recalling e s A = cos s sin s sin s cos s

, focusing on one component

we see it is sufﬁcient to show

lim t 1 β(t)C(t)

0 C(s)β(s) sin s ds = 0.

Now we prove the statement. For convenience, name D(t) = β(t)C(t) . We want to show

D(t) sin s ds | {z } =S(t)

We ﬁrst observe

lim t sup δ [0,δ]

D(t + δ) D(t + π) 1 = 0.

To do so, we ﬁrst show limt D(t+δ) D(t+π) = 1 for δ [0, π]. Take δ [0, π]. Observe

D(t + δ) D(t + π) = β(t + δ)C(t + δ)

β(t + π)C(t + π) = t + π

γ 1 p (t+δ)1 p

γ 1 p (t+π)1 p = t + π

γ 1 p (t+δ)1 p 1 ( t+π

t+δ ) 1 p .

Considering L ôspital s rule for the exponent, we see

lim t (t + δ)1 p

(t + δ)p 1 = lim t

π(p 1)( t+π

(p 1) (t + δ)p 2 = lim t π (t + π)p = 0.

As the exponent reaches to zero, we conclude limt D(t+δ) D(t+π) = 1.

D(t) = C(t) β(t)2 + β(t) = e

γ 1 p t1 pγt 2p 1 (γt ptp) ,

we see D(t) is nondecreasing for t γ p 1 p 1 since 0 < p < 1. Therefore for t γ p 1 p 1 we have supδ [0,δ] D(t+δ)

max n D(t) D(t+π) 1 , D(t+δ)

D(t+π) 1 o , so it reaches to zero as t .

Now we prove desired statement. Proof by contradiction. Suppose limt S(t) = 0. Then for 0 < ϵ < 1

2, there is T1 > 0 such that t > T1 implies |S(t)| < ϵ. Now for t > T1, observe

2ϵ > |S(t + π) S(t)| =

D(s) D(t + π) sin s ds Z t

D(t) sin s ds

= 1 D(t + π)

0 D(s) sin s ds Z t

0 D(s) sin s ds + 1 D(t + π) 1 D(t)

0 D(s) sin s ds

D(s) D(t + π) sin s ds

D(t) D(t + π) 1 Z t

D(t) sin s ds .

On the other hand, there is T2 such that t > T2 implies supδ [0,π] D(t+δ)

D(t+π) 1 < ϵ

2. Now for t = 2nπ > max{T1, T2},

D(s) D(t + π) sin s ds

D(t) D(t + π) 1 |S(t)| > Z (2n+1)π

2nπ (1 ϵ) |sin s| ds ϵ = 2 2ϵ.

This contradicts the fact ϵ < 1

2, we prove the assumption limt R t 0 S(t) = 0 is not true. Therefore limt R t 0 S(t) = 0.

G Proof of convergence analysis for discrete counterpart

G.1 Correspondence between discrete method in Theorem 5.1 and continuous model (6)

To check the correspondence of the method with (6), we provide a informal derivation. Assume operator 픸is continuous. Then we have yk = xk+1 + 픸xk+1, by substituting yk and yk 1, the method can be equivalently written as

xk+1 + 픸xk+1 = kp

kp + γ (xk 픸xk) + γ kp + γ x0. (40)

This can be considered as a special case of below method, with h = 1.

xk+1 + h픸xk+1 = kp

kp + h1 pγ (xk h픸xk) + h1 pγ kp + h1 pγ x0.

Dividing by h both sides and reorganizing, we have

h = 픸xk+1 hpkp

hpkp + hγ 픸xk γ hpkp + hγ xk x0 .

Identifying hk = t, xk = X(t) and taking h 0, we have

X(t) = 2픸(X(t)) γ

tp (X X0) .

As monotonicity is preserved for scalar multiple, rescaling 2픸to 픸does not change the class of operators that the ODE covers. For notation simplicity, replacing 2픸to 픸we have

X(t) = 픸(X(t)) γ

tp (X X0) .

G.2 Proof of boundedness of xk

While proving Theorem 5.1, we need a upper bound for xk . Therefore we ﬁrst prove following lemma.

Lemma G.1. Let 픸be a maximal monotone operator. Consider a method

xk = 핁픸yk 1

yk = (1 βk) (2xk yk 1) + βkx0

for k = 1, 2, . . . , with sequence {βk}k N, 0 βk 1 and initial condition y0 = x0 Rn. Then following holds xk x x0 x .

for x Zer픸. And so, xk x0 2 x0 x .

Proof. Recall from Lemma D.2, we know for 핋= ℝ픸= 2핁픸 핀given method is equivalent to below method

yk+1 = βky0 + (1 βk) 핋yk.

Note that x Zer 픸 0 픸x x = 핁픸x x = 핋x x Fix 핋. We ﬁrst prove yk x y0 x by induction. The statement is trivially true when k = 0. Now suppose the statement is true for k N. Then yk+1 x = βk y0 x + (1 βk) 핋yk x

βk y0 x + (1 βk) 핋yk x

βk y0 x + (1 βk) yk x

βk y0 x + (1 βk) y0 x = y0 x .

The ﬁrst inequality is just triangular inequality, the second inequality comes from the fact x Fix 핋and 핋is nonexpansive, and the last inequality is from induction hypothesis. By induction, we have yk x y0 x for k = 0, 1, . . . .

Deﬁne 픸xk = yk 1 xk, then 픸xk 픸xk since 핁픸yk 1 = xk. Observe for k = 1, 2, . . . , from monotone inequality we have yk 1 x 2 = 픸xk + (xk x ) 2 = 픸xk 2 + 2 픸xk, xk x + xk x 2 xk x 2 .

Therefore we conclude xk x yk 1 x y0 x = x0 x .

The latter statement holds directly from triangular inequality, xk x0 xk x + x x0 2 x0 x .

G.3 Proof of Theorem 5.1

The outline of the proofs originate from continuous proofs. To simplify calculations, instead of directly deriving discrete counterpart of Proposition 3.2, we consider rescaled conservation law for each cases. By considering dilated coordinate W1 = X X0 for p > 1, W2 = tp (X X0) for 0 < p < 1, W3 = t (X X0) for p = 1, γ 1 and W4 = tγ (X X0) for p = 1, 0 < γ < 1, we obtain below conservation laws.

E1 픸(X(t)) 2 + 2γ

tp 픸(X(t)), X(t) X0 + γ2

ds 픸(X(s)), X(s) ds + Z t

s(X(s) X0) 2 ds + 1

sp+2 X(s) X0 2 ds

E2 t2p 픸(X(t)) 2 + 2γtp 픸(X(t)), X(t) X0 + γ2 γptp 1 X(t) X0 2

ds 픸(X(s)), X(s) ds + Z t

0 2sp γ psp 1 X(s) 2 ds + Z t

0 γp(p 1)sp 2 X(s) X0 2ds

E3 t2 픸(X(t)) 2 + 2γt 픸(X(t)), X(t) X0 + γ (γ 1) X(t) X0 2

ds 픸(X(s)), X(s) ds + Z t

t0 2s (γ 1) X(s) 2 ds

E4 t2γ 픸(X(t)) 2 + 2γt2γ 1 픸(X(t)), X(t) X0 + γ(γ 1)t2γ 2 X(t) X0 2

ds 픸(X(s)), X(s) ds + Z t

0 2γ(γ 1)s2γ 3 X(s) X0 2 ds.

Lyapunov style proof can be obtained by considering below functions

U1(t) = 픸(X(t)) 2 + 2γ

tp 픸(X(t)), X(t) X0 + γ2

U2(t) = t2p 픸(X(t)) 2 + 2γtp 픸(X(t)), X(t) X0 + γ2 γptp 1 X(t) X0 2

U3(t) = t2 픸(X(t)) 2 + 2γt 픸(X(t)), X(t) X0 + γ (γ 1) X(t) X0 2

U4(t) = t2γ 픸(X(t)) 2 + γt2γ 1 픸(X(t)), X(t) X0 + γ(γ 1)t2γ 2 X(t) X0 2 .

The main blocks of the calculations corresponds to continuous cases, but there are some errors in terms of 픸xk and xk x0 occur due to discretization. The proofs are done by showing these errors don t effect to the conclusions.

(0) Preparation. For all cases, we will consider functions of the form

U k = ak 픸xk 2 + bk 픸xk, xk x0 + ck xk x0 2 , (41)

and consider U k+1 U k. As similar calculations will be repeated, we ﬁrst organize the repeating calculations. That is, we prove following equality is true. U k+1 U k + λk 픸xk+1 픸xk, xk+1 xk τk 픸xk+1 픸xk, xk+1 xk (42)

= ak+1 λk kp + γ

픸xk+1 2 + λk kp

+ bk+1 λk γ kp τk ck+1 kp + γ

픸xk+1, xk+1 x0 + (τk ck+1) 픸xk+1, xk x0

+ (τk ck+1) 픸xk, xk+1 x0 + λk γ kp + γ bk τk ck+1 kp

ck+1 γ kp xk+1 x0 2 ck+1 γ kp + γ

xk x0 2 + (ck+1 ck) xk x0 2 .

Observe, this can be shown by checking below equalities are true. 픸xk+1 픸xk, xk+1 xk (43)

kp 픸xk+1 2 γ

kp 픸xk+1, xk+1 x0 + kp

kp + γ 픸xk 2 + γ kp + γ 픸xk, xk x0 픸xk+1 픸xk, xk+1 xk (44)

= 픸xk+1, xk+1 x0 + 픸xk, xk x0 픸xk+1, xk x0 + 픸xk, xk+1 x0

ck+1 xk+1 x0 2 ck xk x0 2 (45)

= ck+1 kp + γ

kp 픸xk+1, xk+1 x0 ck+1 픸xk+1, xk x0

ck+1 픸xk, xk+1 x0 ck+1 kp

kp + γ 픸xk, xk x0

ck+1 γ kp xk+1 x0 2 ck+1 γ kp + γ

xk x0 2 + (ck+1 ck) xk x0 2 .

Proof for (43) Recall, the method was deﬁned as

kp + γ (2xk yk 1) + γ kp + γ x0

xk+1 = 핁픸yk.

By considering xk+1 + 픸xk+1 = yk, substituting yk and yk 1 we can rewrite the method as

xk+1 + 픸xk+1 = 1 γ kp + γ

(xk 픸xk) + γ kp + γ x0. (46)

By multiplying kp + γ to both sides and reorganizing we have, (kp + γ) {(xk+1 x0) + 픸xk+1} = kp{(xk x0) 픸xk}. By subtracting both sides by kp(xk+1 x0) and (kp + γ)(xk x0) respectively, above equation can be rewritten as kp( 픸xk + (xk+1 xk)) = (kp + γ) 픸xk+1 + γ(xk+1 x0)

(kp + γ)( 픸xk+1 + (xk+1 xk)) = kp 픸xk + γ(xk x0). From above, we get the following (kp + γ)kp 픸xk+1 픸xk, xk+1 xk

= (kp + γ)kp 픸xk+1, xk+1 xk + (kp + γ)kp 픸xk+1, 픸xk

(kp + γ)kp 픸xk, xk+1 xk (kp + γ)kp 픸xk+1, 픸xk

= (kp + γ) 픸xk+1, kp( 픸xk + (xk+1 xk)) + kp 픸xk, (kp + γ)( 픸xk+1 + (xk+1 xk))

= (kp + γ) 픸xk+1, (kp + γ) 픸xk+1 + γ(xk+1 x0) + kp 픸xk, kp 픸xk + γ(xk x0)

= (kp + γ) (kp + γ) 픸xk+1 2 + γ 픸xk+1, xk+1 x0 + kp 픸xk 2 + γ 픸xk, xk x0 .

Now dividing both sides by kp(kp + γ), we get the desired result. Proof for (44) This can be checked by just expanding the inner product of left hand side. Proof for (45) First, observe

ck+1 xk+1 x0 2 ck xk x0 2

= ck+1 xk+1 x0 2 xk x0 2 + (ck+1 ck) xk x0 2

= ck+1 xk+1 xk, xk+1 x0 + xk x0 + (ck+1 ck) xk x0 2

= ck+1 xk+1 xk, xk+1 x0 + xk+1 xk, xk x0 + (ck+1 ck) xk x0 2 .

Reorganizing (40), we get two different expressions for xk+1 xk.

xk+1 xk = 픸xk+1 + kp

kp + γ 픸xk γ kp + γ xk x0

xk+1 xk = kp + γ

kp 픸xk+1 + 픸xk γ

kp xk+1 x0 .

Now plugging these to previous equality, be get the desired result.

ck+1 xk+1 x0 2 ck xk x0 2

kp 픸xk+1 + 픸xk + γ

kp xk+1 x0 , xk+1 x0

+ 픸xk+1 + kp

kp + γ 픸xk + γ kp + γ xk x0 , xk x0 + (ck+1 ck) xk x0 2

= ck+1 kp + γ

kp 픸xk+1, xk+1 x0 ck+1 픸xk+1, xk x0

ck+1 픸xk, xk+1 x0 ck+1 kp

kp + γ 픸xk, xk x0

ck+1 γ kp xk+1 x0 2 ck+1 γ kp + γ

xk x0 2 + (ck+1 ck) xk x0 2

픸(xk) 2 = O(1) for p > 0, γ > 0. Plugging

ak = 1 + γ 2kp , bk = γ

kp , ck+1 = γkp

kp 1 (k + 1)p

λk = 1 + γ 2kp , τk = ck+1

to (41) and (42), we obtain

U k+1 U k +

1 + γ 2kp γkp

kp 1 (k + 1)p

! 픸xk+1 픸xk, xk+1 xk

= γ (2kp + γ)

kp 1 (k + 1)p

픸xk+1 2 γ (2kp + γ)

2kp (kp + γ) 픸xk 2

kp 1 (k + 1)p

픸xk+1, xk+1 x0

kp (kp + γ) + γkp

kp 1 (k + 1)p

2 (2kp + γ)

kp 1 (k + 1)p

2 (kp + γ) (2kp + γ)

kp 1 (k + 1)p

xk x0 2 + (ck+1 ck) xk x0 2

= γ (2kp + γ)

픸xk+1 + 1 2kp + γ

kp 1 (k + 1)p

γ (2kp + γ)

2kp (kp + γ)

픸xk + 1 2kp + γ

kp 1 (k + 1)p

kp γ (k + 1)p

γ 2 (2kp + γ)

2γ 1 (k + 1)p 1

kp 1 (k + 1)p

| {z } sk,1

2 (kp + γ) (2kp + γ)

2γ 1 (k + 1)p 1

2 1 (k + 1)p

| {z } sk,0

+ (ck+1 ck) xk x0 2 .

The continuous counterpart of above equality is

dt 픸(X(t)), X(t)

픸(X(t)) + γ

(X(t) X0) 2 1

tp+2 X(t) X0 2 ,

which can be obtained by differentiating and reorganizing the conservation law for E1. Note, as k2p

2 1 kp 1 (k+1)p =

O k2p O 1 kp+1 = O 1 k1 p , the order of the term 1 2kp+γ γ + k2p

2 1 kp 1 (k+1)p corresponds to γ

t . Thus we can see the sum of ﬁrst two terms on the right hand side of the equality for discrete setting corresponds to the ﬁrst term of the right hand side of the equality for continuous setting. We can observe that sk,1, sk,0 = O 1 k2p+1 + O 1 kp+2 . And since ck = O 1

k2p + O 1 kp+1 , we have ck+1 ck =

O 1 k2p+1 + O 1 kp+2 as well. Therefore the coefﬁcients of xk+1 x0 2 and xk x0 2 are summable for p > 0.

From Lemma G.1, we know xk+1 x0 2 and xk x0 2 are bounded by 4 x0 x 2.

The term 픸xk+1 픸xk, xk+1 xk on the left hand side is nonnegative from monotonicity of 픸, and the coefﬁcient

is nonnegative as well. The coefﬁcient of 픸xk+1 2 in the right hand side, 1

2 γ kp γ (k+1)p , is nonpositive. As a result, we get below inequality.

U k+1 U k + 1

kp γ (k + 1)p

xk+1 x0 2 + (ck+1 ck) xk x0 2

U 1 + 2 x0 x 2 X

kp γ (k + 1)p

+ 2 (ck+1 ck) = M.

As done in (8), by monotonicity of 픸and Young s inequality

M U k = ak 픸xk 2 + bk 픸xk, xk x0 + ck xk x0 2

ak 픸xk 2 + bk 픸xk, x x0

픸xk 2 + b2 k x x0 2 = 1

2k2p x x0 2 .

Reorganizing, we get the desired result 픸xk 2 2 1 γ

2k2p x x0 2

= 2 1 + γ kp γ

2k2p x x0 2 = O (1) .

픸(xk) 2 = O 1

k2p for 0 < p < 1, γ > 0. Plugging

ak+1 = kp kp + γ

, bk+1 = γkp, ck+1 = kpγ2

λk = ak+1, τk = ck+1 to (41) and (42) with p = 1, we obtain

U k+1 U k +

! 픸xk+1 픸xk, xk+1 xk

픸xk+1 2 γ2 픸xk+1, xk+1 x0 γ3

2 xk+1 x0 2

kp + γ (k 1)p (k 1)p + γ

픸xk 2 + γ (k 1)pkp + k2p γ(k 1)p

kp + γ | {z } =rk

4 (kp + γ) kp + γ

2 xk x0 2 + γ3 (kp (k 1)p) 8 (k 1)p + γ

픸xk+1 + γ 2kp + γ xk+1 x0

r2 k 4qk + γ3kp

4 (kp + γ) kp + γ

2 + γ3 (kp (k 1)p) 8 (k 1)p + γ

The continuous counterpart of this equality is

U2(t) + 2t2p d

dt 픸(X(t)), X(t)

= 2tp γ ptp 1 픸(X(t)) + γ

tp (X(t) X0) 2 γp(p 1)tp 2 X(t) X0 2

which can be obtained by differentiating the conservation law for E2. The ﬁrst term on the right hand side correspond to the sum of ﬁrst two terms in the right hand side of discrete equality. Thus we may expect the order of the coefﬁcients for the terms would match, and we will check the expectation is indeed true. With some calculation, we can observe

sk = 2γ2(k 1)pk2p ((k 1)p kp)2

4 (k 1)p + γ

dk = 2kp(k 1)p (k 1)p + γ

2k3p γk2p + 2γ(k 1)p (k 1)p + γ

By considering Newton expansion and from 0 < p < 1, we see

dk = 2k3p + γk2p + O k3p 1 2k3p γk2p + 2γk2p + O (kp)

= 2γk2p + O k3p 1 + O (kp) = O k2p .

Thus we can check the leading order of numerator is p+2p+(2p 2) = 5p 2, and leading order of the denominator is p + p + 2p = 4p. As 5p 2 4p = p 2, we have sk O kp 2 , which matches with the continuous counterpart. Therefore P k=1 sk < .

On the other hand, we see

qk = k2p kp + γ

kp + γ (k 1)p (k 1)p + γ

k2p kp pkp 1 + O kp 2 kp pkp 1 + O kp 2 + γ

2 kp + 2pk2p 1 + O k2p 2 .

Since p > 2p 1, we have limk qk = . Note this matches with the continuous counterpart as well.

Therefore there is N > 0 such that for k > N, qk < 0. Now for k > N we have

Uk+1 Uk + sk xk x0 2 UN + 4

! x0 x 2 = M.

Thus for k > N, by monotonicity of 픸and Young s inequality

M U k+1 = kp kp + γ

픸xk+1 2 + γkp 픸xk+1, xk+1 x0 + kpγ2

2 xk+1 x0 2

픸xk+1 2 + γkp 픸xk+1, x x0

픸xk+1 2 + γ2

= kp kp + γ

픸xk+1 2 kpγ2

Reorganizing, we get the desired result.

픸xk+1 2 2 kp kp + γ

x0 x 2 = O 1

픸(xk) 2 = O 1

k2 for p = 1, γ 1. Plugging

ak+1 = k2, bk+1 = γ k 1

2 (γ 1) , ck+1 = 1

λk = k(k + 1), τk = ck+1

to (41) and (42), we obtain

U k+1 U k + k(k + 1) 1

4γ(γ 1) 픸xk+1 픸xk, xk+1 xk

= (γ 1) (k + 1) 픸xk+1 2 k2(γ 1)

k + γ 픸xk 2

γ(γ 1) k + γ

k 픸xk+1, xk+1 x0 γ(γ 1) k γ

k + γ 픸xk, xk x0

xk+1 x0 2 γ2(γ 1)

= (γ 1) (k + 1) 픸xk+1 + γ 2(k + 1)

xk+1 x0 2 γ3(γ 1)

The continuous counterpart of above equality is

U3(t) + 2t2 d

dt 픸(X(t)), X(t) = 2t (γ 1) 픸(X(t)) + γ

t (X(t) X0) 2

which can be obtained by differentiating the conservation law for E3. Note the terms match with same order of coefﬁcients, except two xk+1 x0 2 and xk x0 2, while these terms are summable as sk,1, sk,0 = O 1

k2 . Therefore we have

U k+1 U k sk,1 xk+1 x0 2 sk,0 xk x0 2

U 1 + 4 x0 x 2 X

m=1 (sm,1 + sm,0) = M.

Thus by monotonicity of 픸and Young s inequality

M U k = k2 픸xk+1 2 + γ k 1

2 (γ 1) 픸xk, xk x0 + 1

4γ(γ 1) xk x0 2

= k2 픸xk+1 2 + γ k 1

2 (γ 1) 픸xk, x x0

2 (γ 1) 2 픸xk 2 + γ2 x x0 2 !

Reorganizing, we get the desired result. 픸xk 2 2

x0 x 2 = O 1

픸(xk) 2 = O 1

k2γ for p = 1, 0 < γ < 1. Plugging

ak = k2γ, bk = γk k 1

2γ 2 , ck+1 = 1

4γ(γ 1)k2γ 2

λk = k2γ 1(k + γ), τk = ck+1

to (41) and (42) with p = 1, we obtain

U k+1 U k + k2γ 1(k + γ) + 1

4γ(1 γ)k2γ 2 픸xk+1 픸xk, xk+1 xk

= (k + 1)2γ k2γ 2(k + γ)2

γ(k + 1) k + 3

2γ 2 γk2γ 2(k + γ) 1 + k + γ

4γ(γ 1)k2γ 2 !

| {z } =sk,1

픸xk+1, xk+1 x0

γk2γ 1 γk k 1

2γ 2 1 + k k + γ

4γ(γ 1)k2γ 2 !

| {z } =sk,0

4γ2(γ 1)k2γ 3 xk+1 x0 2 1

4γ2(γ 1)k2γ 3 1 1 + γ

4γ(γ 1) k2γ 2 (k 1)2γ 2 xk x0 2

The continuous counterpart of this equality is

U4(t) + 2t2γ d

dt 픸(X(t)), X(t) ds = 2γ(γ 1)t2γ 3 X(t) X0 2

which can be obtained by differentiating the conservation law for E4. Thus we may expect the order of the matching terms are equal, and the terms do not occur in the continuous version do not bother our desired conclusion. We check our expectation is true.

Terms xk+1 x0 2 and xk x0 2 correspond to X X0 2. The coefﬁcients for xk+1 x0 2 and xk x0 2 are clearly O k2γ 3 , which equals the order of continuous counterpart. Since γ < 1, we know these terms are summable. Next we observe qk 0. Observe

qk 0 (k + 1)2γ

k2γ (k + γ)2

To check the last inequality is true, consider f(x) = xγ. Since this function is concave for 0 < γ < 1, we see 1 + 1

γ = f 1 + 1

k f (1) = 1 + γ

Finally we focus on sk,0 and sk,1. As cross terms don t appear in continuous version, we may expect these terms are small , or in mathematical words, summable. From Cauchy-Schwarz inequality we know 픸xk+1, xk+1 x0 픸xk+1 xk+1 x0 픸xk, xk x0 픸xk xk x0 .

Since 픸xk+1 and 픸xk are bounded from the proof for case (i), we know two innerproduct terms are bounded. Thus if we show P k=1 |sk,0| , P k=1 |sk,1| < , we can conclude U k is bounded. Considering Newton expansion, we see

γ(k + 1) k + 3

2γ 2 γk2γ 2(k + γ)

= γk2γ 1 + γ 1 + 3

2 (γ 1) k2γ 2 + O k2γ 3 γk2γ 1 γ2k2γ 2

2γ(γ 1)k2γ 2 + O k2γ 3

2γ(γ 1)k2γ 2 + O k2γ 3 2 + γ

4γ(γ 1)k2γ 2 = O k2γ 3 .

With similar argument

sk,0 = γk2γ 1 γk k 1

2γ 2 1 + k k + γ

4γ(γ 1)k2γ 2

= γk2γ 1 γk k2γ 2 (2γ 2) 1

4k2γ 3 + O k2γ 4 2 γ k + γ

4γ(γ 1)k2γ 2

2γ(γ 1)k2γ 2 + O k2γ 3 1

2γ(γ 1)k2γ 2 + O k2γ 3 = O k2γ 3

Therefore, we have

U k+1 U k + sk,1 픸xk+1 xk+1 x0 + sk,0 픸xk xk x0

xk+1 x0 2 ck+1 γ k + γ

xk x0 2 + (ck+1 ck) xk x0 2

|sm,1| 픸xk+1 xk+1 x0 + |sm,0| 픸xk+1 xk+1 x0

xk+1 x0 2 + cm+1 γ m + γ

+ |ck+1 ck| xk x0 2 = M1.

By Young s inequality

4γ(1 γ)k2γ 2 xk x0 2 k2γ 픸xk 2 + γk k 1

2γ 2 픸xk, xk x0

k2γ 픸xk 2 + γk k 1

2γ 2 픸xk, x x0

= k2γ 픸xk 2 +

kγ 픸xk, γk1 γ k 1

2γ 2 x x0 +

k2γ 픸xk 2 1

2k2γ 픸xk 2 γ2

2γ 2!2 x x0 2 .

Since k1 γ k 1

4 2γ 2 = O kγ 1 and γ < 1, this terms goes to zero as k thus there is some M2 > 0 such

2 γk1 γ k 1

4 2γ 2 2 xk x 2 M2 for all k 1. Reorganizng terms, we obtain the desired result

픸xk 2 2 k2γ

M1 + M2 + γ(1 γ)k2γ 2 x x0 2 = O 1

H Proof of convergence analysis for strongly monotone 픸

H.1 Proof of Proposition 6.1

We take dilated coordinate W(t) = C(t)(X(t) X0) as did in the proof of Proposition 3.2. Recall from (30), the second order version of the ODE was written as 0 = W β(t) W + C(t) d

dt 픸(X(t)). Now for we multiply R(t)2 in the ODE and obtain

0 = R(t)2 W β(t) W + C(t) d

Now taking inner product with W and integrating we have

t0 R(s)2 R(s)

t0 β(s)R(s)2 W(s) 2 ds

t0 C(s)R(s)2 d

ds 픸(X(s)), W(s) ds.

Note the second term which is obtained from integration by parts, would not appear if R(t) = 1 as in Proposition 3.2. This is key term to exploit the condition 픸is strongly monotone.

Now again with W(t) = C(t) X(t) + β(s)(X(t) X0) , we rewrite the last term as Z t

t0 C(s)R(s)2 d

ds 픸(X(s)), W(s) ds

t0 C(s)2R(s)2 d

ds 픸(X(s)), X(s) ds + Z t

ds 픸(X(s)), C(s)R(s)2β(s)W(s) ds.

Taking integration by parts to the second term we have Z t

ds 픸(X(s)), C(s)R(s)2β(s)W(s) ds 픸(X(W(s), s)), C(s)R(s)2β(s)W(s) t

* 픸(X(W, t)),

β(s)2 + 2 R(s) R(s)β(s) + β(s)

C(s)R(s)2W(s) + C(s)R(s)2β(s) W(s)

t0 β(s)R(s)2 W(s) 2 ds + Z t

β(s)2 + 2 R(s) R(s)β(s) | {z }

R(s)2 D W(s), W(s) E ds.

The fact C(t) 픸(X(W, t)) = W(t) is applied to the second equality. Note the fundamental theorem of calculus is valid since C(s)R(s)2β(s)W(s) is differentiable, 픸(X(W(s), t)) is Lipschitz continuous in [t0, t] by Lemma B.4, and so their inner product is absolutely continuous in [t0, t].

Now consider the second integrand except the term marked with *. From integration by parts we have Z t

β(s)2 + β(s) R(s)2 D W(s), W(s) E ds β(s)2 + β(s) R(s)2 1

2β(s) β(s) + β(s) R(s)2 +

β(s)2 | {z } + β(s)

The integrand except * marked term can be rewritten as 1 2

2β(s) β(s)R(s) + β(s)R(s) + 2 β(s) R(s) R(s) W(s) 2 ds

t0 C(s)2 2β(s) β(s)R(s) + β(s)R(s) + 2 β(s) R(s) R(s) X(s) X0 2 ds

C(s)2R(s)2 β(s) X(s) X0 2 ds.

Now collecting the terms marked with *, we have

t0 R(s)2 R(s)

t0 2 R(s) R(s)β(s)R(s)2 D W(s), W(s) E ds Z t

t0 β(s)2R(s) R(s) W(s) 2 ds

t0 R(s)2 R(s)

W(s) β(s)W(s) 2 ds = Z t

t0 R(s)2C(s)2 R(s)

Collecting all results we have

W(t) 2 + 픸(X(W, s)), C(s)R(s)2β(s)W(s) t

t0 + β(s)2 + β(s) R(s)2 1

t0 C(s)2R(s)2 d

ds 픸(X(s)), X(s) ds Z t

t0 R(s)2C(s)2 R(s)

C(s)2R(s)2 β(s) X(s) X0 2 ds

= C(t)2R(t)2

픸(X(t)) 2 + 2β(t) 픸(X(t)), X(t) X0 + β(t)2 + β(t) X(t) X0 2

t0 C(s)2R(s)2 d

ds 픸(X(s)), X(s) R(s) R(s)

C(s)2R(s)2 β(s)

X(s) X0 2 ds

C(t0)2R(t0)2

2β(t0) 픸(X(t0)), X(t0) X0 + β(t0)2 + β(t0) X(t0) X0 2

| {z } constant

Renaming E = E1 constant, we get the desired result.

H.2 Proof of Theorem 6.2

H.2.1 Proof of the inequality V (t) V (0)

The basic structure of the proof is same as Appendix E.3. We do not repeat the whole proof here, instead we check the steps done in Appendix E.3 are also valid for the setup in Theorem 6.2.

(i) V is nonincreasing for Lipschitz continuous µ-strongly monotone 픸. Recall V in Theorem 6.2 was deﬁned as

V (t) = (eµt e µt)2

픸(X(t)) 2 + 2µ(1 e 2µt) 픸(X(t)), X(t) X0 2µ2 1 e 2µt X(t) X0 2 .

We ﬁrst check following equality is true.

V (t) = C(t)2R(t)2

픸(X(t)) 2 + 2β(t) 픸(X(t)), X(t) X0 + β(t)2 + β(t) X(t) X0 2 . (48)

Recall we re considering (10), β(t) = 2µ e2µt 1. As d

dt log 1 e 2µt = 2µe 2µt

1 e 2µt = 2µ e2µt 1 = β(t), we have

R t 2µ e2µs 1 ds = elog(1 e 2µt) = 1 e 2µt.

As β(t) = 4µ2e2µt

(e2µt 1)2 and R(t) = eµt,

2 1 e 2µt 2 e2µt = (eµt e µt)2

C(t)2R(t)2β(t) = e 2µt e2µt 1 2 2µ e2µt 1 = 2µ 1 e 2µt

β(t)2 + β(t) = (eµt e µt)2

= 2µ2 e 2µt 1 .

This proves the desired equality. Now we show

V (t) = E Z t

t0 C(s)2R(s)2 d

ds 픸(X(s)), X(s) R(s) R(s)

where E is from Proposition 6.1 which was deﬁned as

E = C(t)2R(t)2

픸(X(t)) 2 + 2β(t) 픸(X(t)), X(t) X0 + β(t)2 + β(t) X(t) X0 2

t0 C(s)2R(s)2 d

ds 픸(X(s)), X(s) R(s) R(s)

C(s)2R(s)2 β(s)

X(t) X0 2 ds.

From (48) and the deﬁnition of E, it is enough to show d

ds C(s)2R(s)2 β(s)

2 = 0. Since

C2(t)R2(t) β(t)

2 = 1 e 2µt 2 e2µt

ds C(s)2R(s)2 β(s)

Now since 픸is Lipschitz continuous, we know E is constant from Proposition 6.1. Therefore from (49), for t > 0, |h| < t we have

V (t + h) V (t) = Z t+h

t C(s)2R(s)2 d

ds 픸(X(s)), X(s) R(s) R(s)

As R(s) R(s) = µ and 픸is µ-strongly monotone, from (2) we see d

ds 픸(X(s)), X(s) R(s) R(s)

Therefore V (t + h) V (t) 0 for h > 0, we get the desired result.

(ii) Calculation of V (0) for Lipshitz continuous monotone 픸 Plugging t = 0 to (47) we immediately obtain V (0) = 0. (iii) V (t) 0 holds for all t [0, ) and general maximal µ-strongly monotone 픸 We check the arguments in Appendix E.3.3 and Appendix E.3.4 are also valid here. Deﬁne S as deﬁned in Lemma E.1. Take t S, let T > t. As checked in Appendix C, the arguments used in the proof Proposition B.11 is also valid for the case β(t) = 2µ e2µt 1. This fact provides the required sequence {Xλn}n N, where Xλn converges to X uniformly on [0, T], and Xλn converges weakly to X in L2([0, T], Rn). As in Appendix E.3.3, denote Vλ as V for the solution with 픸λ. Then from (i) we know Vλn is nonincreasing for all n N, we have lim supn Vλn(t) lim supn Vλn(0) = 0. Moreover, we can check the extension of 픸deﬁned in Lemma E.1 is also valid. From (21) and (22), we know Xλ(t) eµT

2 m(픸(X0)) and 픸λ(Xλ(t))

e2µT + 1 m(픸(X0)) for all λ > 0, t [0, T].

Therefore we can prove Corollary E.2 for the case β(t) = 2µ e2µt 1 with the same proof, replacing M픸(T) by

e2µT + 1 m(픸(X0)) and Mdot(T) by eµT

2 m(픸(X0)) . Thus 픸(X(t)) is well-deﬁned for t = 0, plugging t = 0 to (47) we obtain V (0) = 0. Therefore we have lim sup n Vλn(t) lim sup n Vλn(0) = 0 = V (0),

it remains to show V (t) lim supn Vλn(t). Observe, as equality X(t) = 픸(X(t)) β(t)(X(t) X0) holds since t S, from (48) we have

V (t) = C(t)2R(t)2

X(t) 2 + β(t) X(t) X0 2 .

Therefore if is sufﬁces to check Lemma E.4 is valid here with some Uλn. For some a > 0, Deﬁne Uλn : [0, ) R as

Uλn(t) = Xλn(t) 2 β(t) Xλn(t) X0 2 + Z t

β(s) 2 β(s)2

Xλn(s) X0 2 ds

| {z } fn(t)

We proceed similar argument with Lemma B.5. Differentiating Xλn(t) = 픸λn(X(t)) β(t)(Xλn(t) X0), we have for almost all t > 0

dt픸λn(X(t)) β(t)(Xλn(t) X0) β(t) Xλn(t).

Therefore for almost all t > 0, Uλn(t)

= 2 D Xλn(t), Xλn(t) E β(t) Xλn(t) X0 2 2 β(t) D Xλn(t), Xλn(t) X0 E +

β(t) 2 β(t)2

Xλn(t) X0 2

= 2 Xλn(t), d

dt픸λn(X(t)) β(t)(Xλn(t) X0) β(t) X(t)

2 β(t) D Xλn(t), Xλn(t) X0 E 2 β(t)2

β(t) Xλn(t) X0 2

= 2 Xλn(t), d

dt픸λn(X(t)) 2β(t)

Xλn(t) + β(t) β(t) (Xλn(t) X0)

Therefore Uλn is nonincreasing, we can prove X(t) lim supn Xλn(t) 2 with same argument in Lemma E.4.

Therefore we have V (t) V (0) = 0 for t S. Extending the result to t [0, ) can be done with the same argument done in Appendix E.3.4.

H.2.2 Proof for convergence rate

V (t) = (eµt e µt)2

픸(X(t)) 2 + 2µ(1 e 2µt) 픸(X(t)), X(t) X0 2µ2 1 e 2µt X(t) X0 2 .

Observe 2V (t) 1 e 2µt = (e2µt 1) 픸(X(t)) 2 + 4µ 픸(X(t)), X(t) X0 4µ2 X(t) X0 2

= (eµt 1) 픸(X(t)) 2 4µ 픸(X(t)), X(t) X0 + 4µ2 X(t) X0 2

= 픸(X(t)) 2µ(X(t) X0) 2

+ eµt(eµt 1) 픸(X(t)) 2 + eµt 4µ 픸(X(t)), X(t) X0 4µ2 X(t) X0 2

| {z } =p(t)

From the law of cosines, we have X(t) X0 2 = X(t) X 2 2 X(t) X0, X0 X X X0 2. Applying this to p(t) we have

p(t) = 4µeµt 픸(X(t)), X(t) X 픸(X(t)), X0 X µ X(t) X0 2)

= 4µeµt 픸(X(t)) + µ(X(t) X ), X(t) X 픸(X(t)) 2µ(X(t) X0), X0 X µ X0 X 2.

Thus 2V (t) 1 e 2µt = (eµt 1) 픸(X(t)) 2µ(X(t) X0) 2 + eµt(eµt 1) 픸(X(t)) 2 + p(t)

= (eµt 1) 픸(X(t)) 2µ(X(t) X0) 2 4µeµt 픸(X(t)) 2µ(X(t) X0), X0 X

+ eµt(eµt 1) 픸(X(t)) 2 + 4µeµt 픸(X(t)) µ(X(t) X ), X(t) X + 4µ2eµt X0 X 2

= (eµt 1) 픸(X(t)) 2µ(X(t) X0) 2µeµt

eµt 1(X0 X )

eµt 1 X0 X 2

+ eµt(eµt 1) 픸(X(t)) 2 + 4µeµt 픸(X(t)), X(t) X µ X(t) X 2 + 4µ2eµt X0 X 2.

Now since 픸is µ-strongly monotone, 픸(X(t)), X(t) X µ X(t) X 2 0. From previous section we know 0 = V (0) V (t) for all t > 0. Therefore for all t > 0

0 2V (t) 1 e 2µt eµt(eµt 1) 픸(X(t)) 2 + 4µ2eµt 4µ2e2µt

= eµt(eµt 1) 픸(X(t)) 2 4µ2eµt

eµt 1 X0 X 2.

Organizing, we conclude

픸(X(t)) 2 4 µ eµt 1

H.2.3 Informal derivation of ODE from the method

Assume 픸: Rn Rn be a continuous monotone operator. In [54], the method OS-PPM is presented as

xk = 핁h픸yk 1

ν (yk 1 xk) + 1

where y0 = x0, ν = 1 + 2hµ and sk = 1 + ν2 + + ν2k = ν2k+2 1

ν2 1 . Using yk 1 = xk + h픸xk, substituting yk and yk 1

this method can be expressed in a single line,

xk+1 + h픸xk+1 = 1 1

Reorganizing and dividing both sides by h, we have

h = 픸xk+1 1

픸xk 1 hsk (xk x0).

Identifying x0 = X0, 2hk = t, xk = X(t),

X(t + 2h) X(t)

h = 픸(X(t + 2h)) 1

픸(X(t)) 1 hsk (X(t) X0).

Now observe

lim h 0+ hsk = lim h 0+ hν2k+2 1

ν2 1 = lim h 0+ h(1 + 2hµ)2k+2 1

(1 + 2hµ)2 1

= lim h 0+ (1 + 2hµ)2(1 + 2hµ)t/h 1

4µ(1 + hµ) = e2µt 1

lim h 0+ 1 ν = lim h 0+ 1 1 + 2hµ = 1

lim h 0+ 1 νsk = lim h 0+ 1 ν 1 hsk h = 1 4µ e2µt 1 0 = 0.

Taking limit h 0+ and organizing,

2 X(t) = 픸(X(t)) (1 + 0)픸(X(t)) 4µ e2µt 1(X(t) X0).

By diving both sides by 2, we get the desired anchor ODE

X(t) = 픸(X(t)) 2µ e2µt 1(X(t) X0).

I Proof omitted in Section 7

I.1 Proof for Theorem 7.1

We ﬁrst check 픸(X(t)) 2 4β(t)2 X0 X 2 = O β(t)2 ,

X(t) = 픸(X(t)) β(t)(X(t) X0)

2 픸(X(t)), X(t) X0 .

By deﬁnition of β(t), Φ(t) deﬁned in (8) becomes zero, i.e.

0 Φ(t) = 픸(X(t)) 2 + 2β(t) 픸(X(t)), X X0 .

Plugging Φ(t) = 0 to inequality (8) we have

픸(X(t)) 2 2β(t)2 X0 X 2 .

Reorganizing, we get the desired result. Other results need some works, we provide the proof with steps into subsections.

I.1.1 β(t) > 0 for t > 0

Taking inner product with 픸(X) to the ODE and applying 1

2Φ(t) = 0 we have D X(t), 픸(X(t)) E = 픸(X(t)) 2 β(t) 픸(X(t)), X(t) X0 = 1

픸(X(t)) 2 .

As 픸is assumed to be continuous, taking limit t 0+ we have

D X(t), 픸(X(t)) E = lim t 0+ X(t), 픸(X0) = 1

2 lim t 0+ 픸(X(t)) 2 = 1

On the other hand, by assumption we have limt 0+ X(t) = limt 0+ X(t) X0

t and 픸(X0) = 0, thus

t 픸(X(t)), X(t) X0 = 픸(X0), lim t 0+ X(t) = 1

픸(X0) 2 < 0. (50)

Therefore there is ϵ > 0 such that 픸(X(t)), X(t) X0 < 0 for t (0, ϵ), thus for t (0, ϵ) we have 픸(X(t)), X(t) X0 = 0 so β(t) is well-deﬁned and satisﬁes

2 픸(X(t)), X(t) X0 > 0.

Observe the denominator of β(t) is zero when 픸(X(t)) = 0. As β is assumed to be well-deﬁned, we have 픸(X(t)) = 0 for all t > 0. Since β is continuous as 픸and X are continuous, by intermediate value theorem we have 픸(X(t)) > 0 for all t > 0.

I.1.2 Proof for the main statements

We ﬁrst show following lemma. Lemma I.1. Following equality holds for almost every t > 0. β(t) + β(t)2 픸(X(t)), X(t) X0 = d

dt 픸(X(t)), X(t) . (51)

Proof. Since 픸is Lipschitz continuous by assumption, by Lemma B.4 we know 픸(X(t)) is differentiable almost everywhere. Differentiating Φ(t) we have

dt 픸(X(t)), 픸(X(t)) + 2 β(t) 픸(X(t)), X(t) X0 + 2β(t) d

dt 픸(X(t)), X(t) X0

+ 2β(t) D 픸(X(t)), X(t) E

dt 픸(X(t)), 픸(X(t)) + β(t)(X(t) X0) + 2 β(t) 픸(X(t)), X(t) X0

+ 2β(t) 픸(X(t)), 픸(X(t)) β(t)(X(t) X0)

dt 픸(X(t)), X(t) 2β(t) 픸(X(t)) 2 + 2 β(t) β(t)2 픸(X(t)), X(t) X0

dt 픸(X(t)), X(t) + 4β(t)2 픸(X(t)), X(t) X0 + 2 β(t) β(t)2 픸(X(t)), X(t) X0

dt 픸(X(t)), X(t) + 2 β(t) + β(t)2 픸(X(t)), X(t) X0 .

for t (0, ) almost everywhere. The equalities come from the ODE X(t) = 픸(X(t)) β(t)(X(t) X0) and the fact Φ(t) = 0. Reorganizing, we have the desired equation (51).

Now we show the upper bounds of β(t) for each monotone and strongly monotone case. Observe from (51) and the deﬁnition of β(t), we have for almost all t (0, )

β(t) + β(t)2

픸(X(t)) 2 = d

dt 픸(X), X . (52)

(i) When 픸is monotone. From (52) and (1) we have for almost all t (0, )

β(t) + β(t)2

픸(X(t)) 2 = d

dt 픸(X), X 0.

Since β(t) > 0 we have

β(t) + β(t)2 0.

almost everywhere. Now dividing both sides by β(t)2 we have

1 β(t) β(t)2

holds almost everywhere. Since β(t) β(t)2 = d

dt 1 β(t) , integrating above inequality both side from δ to t we have

t δ 1 β(t) 1 β(δ) = β(t) 1 t δ + 1 β(δ) .

By the way, from (50) we have

lim t 0+ tβ(t) = lim t 0+

2 D 픸(X(t)), X(t) X0

thus limδ 0+ β(δ) = , so limδ 0+ 1 β(δ) = 0. Therefore taking limit δ 0+ we have

β(t) lim δ 0+ 1 t δ + 1 β(δ) = 1

(ii) When 픸is µ-strongly monotone. From (52) and (2) for almost all t (0, ) we have

β(t) + β(t)2

픸(X(t)) 2 = d

dt 픸(X), X µ X(t) 2 . (53)

On the other hand, observe X(t) 2 = 픸(X(t)) + β(t)(X(t) X0) 2

= 픸(X(t)) 2 + 2β(t) 픸(X(t)), X(t) X0

| {z } =Φ(t)=0

+β(t)2 X(t) X0 2 = β(t)2 X(t) X0 2 .

From Cauchy-Schwarz inequality we see 픸(X(t)) 2 = 2β(t) 픸(X(t)), X(t) X0 2β(t) 픸(X(t)) X(t) X0 ,

therefore 픸(X(t)) 2β(t) X(t) X0 . Combining above observations, we have for almost all t > 0

픸(X(t)) 2 = 4β(t)2 X(t) X0 2

픸(X(t)) 2 1. (54)

From (53) and (54), we get an inequality for β(t) and β(t)

β(t) + β(t)2

2β(t) = β(t) 2β(t) + β(t)

Moving β(t)

2 to the right-hand side and reorganizing, we have for almost all t > 0

1 β(t) β(t)( µ

2 + β(t)) = β(t)

β(t) 1 µ 2 + β(t)

2 β(t) β(t) + β(t) µ 2 + β(t).

Integrating above inequality both sides from δ to t we have

2 (t δ) h log β(t) + log µ

2 + β(t) it

As observed in case (i) we know limδ 0+ β(δ) = , taking limit δ 0+ both sides we have

β(t) = log 1 + µ/2

= eµt/2 1 + µ/2

Organizing, we get the desired result.

β(t) µ/2 eµt/2 1.

I.2 Correspondence between the ODE (11) and the discrete method in Theorem 7.2

Observe, the case h = 1 for below method corresponds to the method provided in Theorem 7.2.

xk = 핁h픸yk 1

h 픸xk, xk x0 + h 픸xk 2 if 픸xk 2 = 0

0 if 픸xk 2 = 0

yk = (1 βk)(2xk yk 1) + βkx0.

We now show when 픸is continuous and 픸xk 2 = 0 for all k 0, we obtain the ODE (11) when we take limit h 0+. Note when 픸is continuous 픸equals to 픸.

From h 픸xk+1 + xk+1 = yk, substituting yk and yk 1 we get a single line expression

h 픸xk+1 + xk+1 = (1 βk)(xk h 픸xk) + βkx0.

Reorganizing and dividing both sides by h, we have

h = 픸xk+1 (1 βk) 픸xk βk

Identify x0 = X0, 2hk = t, and xk = X(t). Then we see

h = h2 픸xk 2

h2 픸xk, xk x0 + h3 픸xk 2 = 픸(X(t)) 2

픸(X(t)), X(t) X0 + h 픸(X(t)) 2 .

Thus βk = O (h), we have limh 0+ βk = 0. Now taking limit h 0+ we have

2 X(t) = 2 픸(X(t))

픸(X(t)) 2 픸(X(t)), X(t) X0 (X(t) X0).

Dividing both sides by 2, we obtain (11).

I.2.1 Correspondence between convergence rates in Theorem 7.1 and Theorem 7.2

From identiﬁcation above, we see β(t) corresponds to βk

2h. Therefore we see with identiﬁcation x0 = X0, x = X , 2hk = t, and xk = X(t) we have

h 픸(xk+1) 2 β2 k x0 x 2 = 4h2 β2 k (2h)2 x0 x 2 divide by h2, h 0+ 픸(X(t)) 2 4β(t)2 X0 X 2 .

And we can also check that the bound βk 1 k+1 for the monotone case, is equivalent to βk

2h 1 2hk+O(h) and corresponds to β(t) 1

Now suppose 픸is µ-strongly monotone and L-Lipschitz continuous. Then h픸is hµ-strongly monotone and h L-Lipschitz continuous, we have the following inequality from Theorem 7.2

µ 2(1+(h L)2) 1 + hµ 1+(h L)2 k 1 + hµ 1+(h L)2 . (55)

Recalling the identiﬁcation 2hk = t, we see

µt 2(1+(h L)2) h 0+ eµt/2.

Therefore identifying β(t) = βk

2h and taking limit h 0+ to the inequality (55), we have

β(t) µ/2 eµt/2 1.

I.3 Proof of Theorem 7.2

Recall, the method was deﬁned as

xk = 핁픸yk 1

픸xk, xk x0 + 픸xk 2 if 픸xk 2 = 0

0 if 픸xk 2 = 0

yk = (1 βk)(2xk yk 1) + βkx0.

First, we assume 픸xk = 0 for all k 0. Deﬁne

Φk = (1 βk) 픸xk 2 + βk 픸xk, xk x0

for k = 1, 2, . . . , and

Φk = 픸xk 2 + βk 1 픸xk, xk x0

for k = 2, 3, . . . . Note by deﬁnition of βk, we have Φk = 0. Our goal is to prove

Then with the same argument with (8) we can conclude 픸xk+1 2 β2 k x0 x 2. To do so, we ﬁrst show following lemma. Lemma I.2. For k 1, following is true.

(1 βk) Φk Φk+1 = (1 βk) 픸xk+1 픸xk, xk+1 xk . (56)

Proof. Recall yk = xk+1 + 픸xk+1. Substituting yk and yk 1, the method is equivalent to

xk+1 + 픸xk+1 = (1 βk)(xk 픸xk) + βkx0.

From above we can get two different expression of (1 βk)(xk+1 xk).

(1 βk)(xk+1 xk) = (1 βk)( 픸xk+1 + 픸xk) βk 픸xk+1 + (xk+1 x0)

= (1 βk) ( 픸xk+1 + 픸xk) βk 픸xk (xk x0) .

With reorganizing, ﬁrst equality can be obtained by subtracting both sides by βk(xk+1 x0) and the second equality can be obtained by multiplying both sides by (1 βk).

From this we have

(1 βk) 픸xk+1 픸xk, xk+1 xk

= 픸xk+1, (1 βk)(xk+1 xk) 픸xk, (1 βk)(xk+1 xk)

= 픸xk+1, (1 βk)( 픸xk+1 + 픸xk) βk{ 픸xk+1 + (xk+1 x0)}

(1 βk) 픸xk, ( 픸xk+1 + 픸xk) + βk{ 픸xk (xk x0)}

= (1 βk) 픸xk+1 2 βk 픸xk+1, 픸xk+1 + (xk+1 x0)

+ (1 βk) 픸xk 2 βk 픸xk, 픸xk (xk x0)

= (1 βk) (1 βk) 픸xk 2 + βk 픸xk, xk x0 픸xk+1 2 βk 픸xk+1, xk+1 x0

= (1 βk) Φk Φk+1.

Since 픸is monotone we have 픸xk+1 픸xk, xk+1 xk 0, we see that the right-hand side of (56) is greater or equal to 0 if 1 βk 0. Thus it remains to show 1 βk 0. As the index is quite confusing, we provide it as a lemma to avoid confusion while proceeding the proof.

Lemma I.3. If 픸xk = 0 for all k 1, then 픸xk, xk x0 < 0, βk (0, 1), Φk+1 0

for all k 1.

Proof. Proof by induction. From x1 = 핁픸y0 = 핁픸x0, we have x1 + 픸x1 = x0 and so x1 x0 = 픸x1. Applying these facts we have

픸x1, x1 x0 = 픸x1 2 < 0,

픸x1, x1 x0 + 픸x1 2 = 1

As 1 β1 > 0, from (56) we have

(1 β1) Φ1 Φ2 = (1 β1) 픸x2 픸x1, x2 x1 0.

As Φ1 = 0 by deﬁnition, we have Φ2 0. Therefore the statement is true for k = 1.

Now suppose the statements are true for k. By induction hypothesis, we know

0 Φk+1 = 픸xk+1 2 + βk 픸xk+1, xk+1 x0 .

As βk > 0 from induction hypothesis and 픸xk+1 = 0 by assumption, reorganizing Φk 0 we have

픸xk+1, xk+1 x0 픸xk+1 2

And therefore

βk+1 = 픸xk+1 2

픸xk+1, xk+1 x0 | {z } >0

+ 픸xk+1 2 (0, 1).

Since 1 βk+1 > 0, by (56) we have

(1 βk+1) Φk+1 Φk+2 = (1 βk+1) 픸xk+2 픸xk+1, xk+2 xk+1 0.

Since Φk+1 = 0 by deﬁnition, we have Φk+2 0. Therefore the statements are true for k + 1, by induction, we get the desired result.

Suppose 픸xk 2 = 0 for all k 1. From the lemma we know Φk+1 0 for k 1, so with the same argument of (8) we have for x Zer픸

0 Φk+1 = 픸xk+1 2 + βk 픸xk+1, xk+1 x βk 픸xk+1, x0 x

픸xk+1 2 βk 픸xk+1, x0 x

| 픸xk+1 2 1

2 픸xk+1 2 + β2 k 2 x0 x 2 = 1

2 픸xk+1 2 β2 k 2 x0 x 2.

Organizing, we get

픸xk+1 2 β2 k x0 x 2.

Now we show the upper bound of βk. Observe from (56) and the fact Φ = 0, we have

(1 βk) 픸xk+1 픸xk, xk+1 xk = (1 βk) Φk Φk+1

= βk βk+1 Φk+1 Φk+1 = βk

βk+1 βk 1 픸xk+1 2 .

As 1 βk = 0 for k 1 by Lemma I.3, dividing both sides by 1 βk, we get the discrete counterpart of (52),

픸xk+1 픸xk, xk+1 xk =

βk βk+1 βk 1

픸xk+1 2 . (57)

(i) When 픸is monotone. From (57) and monotonicity of 픸we have

0 픸xk+1 픸xk, xk+1 xk =

βk βk+1 βk 1

From Lemma I.3 we have 1 βk > 0, therefore βk βk+1 βk 1 0 = 1 βk + 1 1 βk+1 .

Summing up, as β1 = 1

2 and 1 βk > 0 we have

1 β1 + (k 1) = k + 1 1

βk = βk 1 k + 1.

(ii) When 픸is µ-strongly monotone. Since 픸is µ-strongly monotone, from (57) we have

µ xk+1 xk 2 픸xk+1 픸xk, xk+1 xk =

βk βk+1 βk 1

Deﬁne rk = xk+1 xk 2

픸xk+1 2 for k = 0, 1, . . . . Dividing both sides by 픸xk+1 2 and organizing, we have

βk βk+1 βk 1

1 βk = rkµ(1 βk) βk βk+1 βk 1 = βk

1 βk+1 1 βk (1 βk).

Dividing both sides by βk and reorganizing, we get a recursive inequality for 1 βk 1

(1 + rkµ) 1

βk 1 + 1 1 βk+1 1. (58)

We now prove an upper bound of βk from above inequality.

Lemma I.4. Suppose 픸be a µ-strongly monotone operator. Let βk be a sequence deﬁned as Theorem 7.2 and let

rk = xk+1 xk 2

픸xk+1 2 for k = 0, 1, . . . . Then following holds for k = 1, 2, . . . .

i=j (1 + riµ) + 2

Proof. First observe, the statement is equivalent to

i=j (1 + riµ) + 1.

The proof can be done by induction with (58). When k = 1, recalling β1 = 1

2 from the proof of Lemma I.3, we can check the inequality is true. Now suppose the inequality is true for k = m. Then from (58) we have

1 βm+1 1 (1 + rmµ) 1

i=j (1 + riµ) + 1

i=j (1 + riµ) + (1 + rmµ)

i=j (1 + riµ) + 1.

Therefore, we get the desired result.

Under the identiﬁcation considered in Appendix I.2, the continuous counterpart of rk is

h 픸xk+1 2 = 4

(2h)2 1 픸xk+1 2

h 0+ 4 X(t) 2

픸(X(t)) 2 ,

and is greater or equal to 1 by (54). We obtained an exponential convergence rate in continuous setup from this fact. In the same spirit, we can get an exponential convergence rate for discrete setup if there is a positive lower bound for rk, we provide it as a corollary of Lemma I.4.

Corollary I.5. Consider the setup of Lemma I.4. Suppose there is r 0 such that rk r for k = 0, 1, . . . . Then following is true for k = 1, 2, . . . .

(1 + rµ)k 1 + rµ .

Proof. From rk r we have

i=j (1 + riµ) + 2

i=j (1 + rµ) + 2

j=1 (1 + rµ)k j + 2 =

l=0 (1 + rµ)l + 1 = (1 + rµ)k 1 + rµ

Applying Lemma I.4, we get the desired result.

We now show rk 1 1+L2 holds when 픸is furthermore L-Lipschitz continuous.

(iii) When 픸is µ-strongly monotone and L-Lipschitz continuous. Recall from the proof of Lemma I.2, we know

xk+1 xk = 픸xk+1 (1 βk) 픸xk βk(xk x0).

Taking inner product with 픸xk both sides we have 픸xk, xk+1 xk = 픸xk+1, 픸xk Φk = 픸xk+1, 픸xk .

From above equality we can check xk+1 xk 2 + 픸xk+1 픸xk 2 = 픸xk+1 2 + xk+1 xk + 픸xk 2 픸xk+1 2 .

As 픸is L-Lipschitz continuous, we have 1 + L2 xk+1 xk 2 xk+1 xk 2 + 픸xk+1 픸xk 2 픸xk+1 2 .

Dividing both sides by 1 + L2 픸xk+1 2 we get a lowerbound for rk

픸xk+1 2 1 1 + L2 .

Applying Corollary I.5 we get the desired result

βk µ/(1 + L2) 1 + µ/(1 + L2) k 1 + µ/(1 + L2) .

Note if we take limit µ 0+, with substitution α = µ 1+L2 we have

lim µ 0+ µ/(1 + L2) 1 + µ/(1 + L2) k 1 + µ/(1 + L2) = 1

α (1 + α)k 1 + 1 = 1 k + 1,

which is the bound for the monotone case.

I.3.1 If there is k 0 such that 픸xk = 0

Suppose 픸xk = 0 for some k. Let x N be the very ﬁrst iterate such that 픸x N = 0. Then from previous argument we know Theorem 7.2 is true for k < N. Thus it remains to show the statements are true for k N.

From 픸x N = 0 we know y N 1 = x N + 픸x N = x N. And since 픸x N = 0 implies βN = 0, from the deﬁnition of the method we have y N = 2x N y N 1 = x N. Therefore

x N+1 = 핁픸y N = 핁픸x N = x N,

we conclude xk = x N Zer픸for all k N. Thus βk = 0, 픸xk = 0 for all k N, the Theorem 7.2 is trivially true for k N.

J Details of experiment in Section 7.1

We solve a compressed sensing problem of Shi et al. [63] which is formulated as an ℓ1-regularized least-squared problem minimize x Rd 1 n Pn i=1 1

2 A(i)x bi 2 + ρ x 1 .

We solve this problem in decentralized manner due to the problem setup where the network of local agents are as Figure 2 and each agents communicate only with their neighbors, the nodes connected to each agents by edge. We use Metropolis-Hastings matrix as our mixing matrix W Rn n and apply PG-EXTRA. Let Wi,j denote (i, j)-th entry of W and Ni {1, 2, . . . , n} denote the index of the agents in the neighborhood of agent i. Consider

xk+1 i = Proxαρ 1

j Ni Wi,jxk j αA (i) A(i)xk i b(i) wk i

wk+1 i = wk i + 1

j Ni Wi,jxk j

, k = 0, 1, . . . (PG-EXTRA)

to the problem above. Under suitable choice of parameters, PG-EXTRA can be seen as a ﬁxed-point iteration of an averaged operator with respect to M [74, Theorem 2], where the metric matrix M is deﬁned as

M = (1/α)I U U αI

and U is a symmetric deﬁnite matrix with U 2 = 1

2(I W). That is, denoting xk, wk Rd n as the vertical stack of xk i s and wk i s respectively [58, Chapter 11.3], PG-EXTRA can be rewritten as (xk+1, wk+1) = 핋(xk, wk). Using this 핋, we proceed the experiment with the Halpern method (xk+1, wk+1) = βk(x0, w0) + (1 βk) 핋(xk, wk). When 핋is an averaged operator, it is nonexpansive, we know 핋= ℝ픸= 2핁픸 핀for some maximal monotone operator 픸 [54, Lemma 2.1]. Considering the equivalence discussed in Lemma D.2, we see above Halpern method is equivalent to our presented algorithms of the form xk+1 = 핁픸yk

yk+1 = (1 βk) (2xk+1 yk) + βkx0,

by corresponding yk = (xk+1, wk+1). Note the operator norm 픸xk 2

M can be calculated by considering below equation 1 2 핋yk 1 yk 1 = 핁픸yk 1 yk 1 = xk 픸xk xk = 픸xk.

We use the anchor coefﬁcients βk = 1 k+1 , βk = γ kp+γ with p = 1.5, γ = 2.0 and the adaptive choice of βk in Theorem 7.2 with M-norm. Note, in the experiment the adaptive coefﬁcient is calculated by considering below equation

핋yk 1 yk 1 2 M 핋yk 1 yk 1 2 M + 핋yk 1 yk 1, yk 1 x0 M = 1

M + 2 픸xk, 픸xk + xk x0

= 픸xk 2 M 픸xk, xk x0

M + 픸xk 2 M .

We choose the dimension of signal d = 100, the number of agents n = 20, the number of measurement for each agent mi = 4, ℓ1-regularization parameter ρ = 0.01, and algorithm parameter α = 0.01.

K Broader Impacts

Our work focuses on the theoretical aspects of convex optimization algorithms. There are no negative social impacts that we anticipate from our theoretical results.

L Limitations

Our analysis concerns convex optimization. Although this assumption is standard in optimization theory, many functions that arise in machine learning practice are not convex.