# convergence_analysis_of_fractional_gradient_descent__dbc9a721.pdf Published in Transactions on Machine Learning Research (03/2024) Convergence Analysis of Fractional Gradient Descent Ashwani Aggarwal Department of Computer Science University of California, Los Angeles ashraggarwal@ucla.edu Reviewed on Open Review: https://openreview.net/forum?id=Oycf V3Mhfq Fractional derivatives are a well-studied generalization of integer order derivatives. Naturally, for optimization, it is of interest to understand the convergence properties of gradient descent using fractional derivatives. Convergence analysis of fractional gradient descent is currently limited both in the methods analyzed and the settings analyzed. This paper aims to fill in these gaps by analyzing variations of fractional gradient descent in smooth and convex, smooth and strongly convex, and smooth and non-convex settings. First, novel bounds will be established bridging fractional and integer derivatives. Then, these bounds will be applied to the aforementioned settings to prove linear convergence for smooth and strongly convex functions and O(1/T) convergence for smooth and convex functions. Additionally, we prove O(1/T) convergence for smooth and non-convex functions using an extended notion of smoothness - Hölder smoothness - that is more natural for fractional derivatives. Finally, empirical results will be presented on the potential speed up of fractional gradient descent over standard gradient descent as well as some preliminary theoretical results explaining this speed up. 1 Introduction Fractional derivatives (David et al., 2011), Oldham & Spanier (1974), (Luchko, 2023) as a generalization of integer order derivatives are a much studied classical field with many different variations. One natural question to ask is if they can be utilized in optimization similar to gradient descent which utilizes integer order derivatives. To motivate the usefulness of fractional gradient descent, we can observe from experiments in Shin et al. (2021) that their Adaptive Terminal Caputo Fractional Gradient Descent (AT-CFGD) method is capable of empirically outperforming standard gradient descent in convergence rate. In addition, they showed that training neural networks based on their AT-CFGD method can give faster convergence of training loss and lower testing error. Figure 1 depicts convergence on a quadratic function for standard gradient descent as well as AT-CFGD and the method in Corollary 15 labeled Fractional Descent guided by Gradient. For specifically picked hyperparameters, both of these fractional methods can significantly outperform standard gradient descent. This suggests that study on the application of fractional derivatives to optimization has a lot of potential. The basic concept of a fractional derivative is a combination of integer-order derivatives and fractional integrals (since there is an easy generalization for integrals through Cauchy repeated integral formula). The fractional derivative that will be studied here is the Caputo Derivative since it has nice analytic properties. The definition from Shin et al. (2021) is as follows where Γ is the gamma function generalizing the factorial (many texts give the definition only for x > c, but this will be extended later on). Published in Transactions on Machine Learning Research (03/2024) 0 20 40 60 80 100 Number of iterations (t) AT-CFGD Fractional Descent guided by Gradient Standard Gradient Descent Figure 1: Convergence of descent methods on function f(x, y) = 10x2 + y2 beginning at x = 1, y = 10. In all cases, the optimal (not theoretical) step size is used. AT-CFGD is as described in Shin et al. (2021) with x( 1) = 1.5, y( 1) = 10.5, α = 1/2, β = 4/10. Fractional Descent guided by Gradient is the method discussed in Corollary 15 with α = 1/2, β = 4/10, λt = 0.0675 (t+1)0.2 in xt ct = λt f(xt). Definition 1 (Left Caputo Derivative). For x > c, the (left) Caputo Derivative of f : R R of order α is (n = α ): CDα c +f(x) = 1 Γ(n a) f n(t) (x t)α n+1 dt. The main contributions of this paper are highlighted as follows: 1. First, we establish novel inequalities that connect fractional derivatives to integer derivatives. This is important since properties like smoothness, convexity, and strong convexity are expressed in terms of gradients (first derivatives). Without these inequalities, assuming these properties would be meaningless from the perspective of a fractional derivative. 2. Next, we apply these inequalities to derive convergence results for smooth and strongly convex functions. In particular, the fractional gradient descent method we examine is a variation of the AT-CFGD method of Shin et al. (2021). Theorem 14 gives a linear convergence rate for this method on different hyperparameter domains for single dimensional functions. Corollary 15 extends these results to higher dimensional functions that are sums of single dimensional functions. Lastly, Theorem 16 gives linear convergence results for general higher dimensional functions. 3. Continuing onwards, we examine smooth and convex functions. In particular, if hyperparameters satisfy certain assumptions, Theorem 17 and Theorem 18 give a O(1/T) convergence rate for the fractional gradient method similar to standard gradient descent. 4. The last setting we examine is smooth, but non-convex functions. For this setting, we use an extension of standard smoothness - Hölder smoothness - in which standard gradient descent needs varying learning rate to converge. We establish a general O(1/T) convergence result to a stationary point in Theorem 20 and show that fractional gradient descent with well chosen hyperparameters is more natural for optimizing Hölder smooth functions. 5. Finally, we present results which show potential for fractional gradient descent speeding up convergence compared to standard gradient descent. One main point of inquiry is how fractional gradient descent, Published in Transactions on Machine Learning Research (03/2024) in some cases, is able to significantly beat the theoretical worse case rates derived (which are at best as good as gradient descent). Empirically, it seems that this can be explained by the optimal learning rate far exceeding the theoretical learning rate. Another interesting example that is explored is how functions with the same amount of smoothness and strong convexity might have different preferences between fractional and standard gradient descent. Lastly, some basic theoretical results on this speed up are given for quadratic functions. 2 Related Work Fractional gradient descent is an extension of gradient descent so its natural context is in the literature surrounding gradient descent. Gradient descent as an idea is classical, however, there are a number of variations some of which are very recent (Ruder, 2016). One variation is acceleration algorithms including momentum which incorporates past history into the update rule and Nesterov s accelerated gradient which improves on this by computing the gradient with look-ahead based on history. Another line of variation building on this is adaptive learning rates with algorithms including Adagrad, Adadelta, and the widely popular Adam (Kingma & Ba, 2017). There is also a descent method combining the ideas of Nesterov s accelerated gradient and Adam called Nadam. Moving to fractional gradient descent, it is not possible to simply replace the derivative in gradient descent with a fractional derivative and expect convergence to the optimum. This is because, as discussed in Wei et al. (2020) and Wei et al. (2017), the point at which fractional gradient descent converges is highly dependent on the choice of terminal, c, and may not have zero gradient if c is fixed. This leads to a variety of methods discussed in Wei et al. (2020) and Shin et al. (2021) to vary the terminal or order of the derivative to achieve convergence to the optimum point. Later on, the former will be done in order to guarantee convergence. Other papers take a completely different approach like Hai & Rosenfeld (2021) which opts to generalize gradient flow by changing the time derivative to a fractional derivative thus bypassing these problems. One reason why there are so many different approaches across the literature is that fractional derivatives can be defined in many different ways (David et al., 2011) (the most commonly talked about include the Caputo derivative used in this paper as well as the Riemann-Liouville derivative). Some papers like Sheng et al. (2020) also choose simply to take a 1st degree approximation of the fractional derivative which can be expressed directly in terms of the 1st derivative. There are also further variations such as taking convex combinations of fractional and integer derivatives for the descent method like in Khan et al. (2018). Finally, there are extensions combining fractional gradient descent with one of the aforementioned variations on gradient descent. One example of this is in Shin et al. (2023) which extends the results of Shin et al. (2021) to propose a fractional Adam optimizer. To provide further motivation for the usefulness of this field, there are many papers studying the application of fractional gradient descent methods on neural networks and other machine learning problems. For example, Han & Dong (2023) and Wang et al. (2017) have shown improved performance when training back propagation neural networks with fractional gradient descent. In addition, other papers like Wang et al. (2022) and Sheng et al. (2020) have trained convolutional neural networks and shown promising performance on the MINST dataset. Applications to further models have also been studied in works like Khan et al. (2018) which studied RBF neural networks and Tang (2023) which looked at optimizing FIR models with missing data. In general, when reading through the literature, many fractional derivative methods have only been studied theoretically for a specific class of functions like quadratic functions in Shin et al. (2021) or lack strong convergence guarantees like in Wei et al. (2020). Detailed theoretical results like those of Hai & Rosenfeld (2021) and Wang et al. (2017) are fairly rare or limited. Thus, one main goal of this paper is to develop methodology for proving theoretical convergence results in more general smooth, convex, or strongly convex settings. As an interesting aside, fractional derivatives are generally defined by integration which means they fall under the field of optimization called nonlocal calculus which has been studied in general by Nagaraj (2021). Published in Transactions on Machine Learning Research (03/2024) 3 Relating Fractional Derivative and Integer Derivative Before beginning the theoretical discussion, it is important to note that for the most part, all of the setup will be done in terms of single-variable functions. Although this might appear odd, due to how the fractional gradient in higher dimensions is defined, when generalizing to higher dimensional functions much of the math ends up being coordinate-wise. In fact, all of the later results will generalize to higher dimensions following very similar logic as the single dimension case. Before presenting bounds relating fractional and integer derivatives, we need to extend the definition of the fractional derivative to x < c. For the extension, the definition of the right Caputo Derivative from Shin et al. (2021) is used: Definition 2 (Right Caputo Derivative). For x < c, the right Caputo Derivative of f : R R of order α is (n = α ): CDα c f(x) = ( 1)n f n(t) (t x)α n+1 dt. From this, we can unify both the left and right Caputo Derivatives into one definition. Definition 3 (Caputo Derivative). The Caputo Derivative of f : R R of order α is (n = α ): CDα c f(x) = (sgn(x c))n 1 f n(t) |x t|α n+1 dt where sgn is the sign function. In order to motivate calling this a fractional derivative, we can compute limits as the order of the derivative tends to an integer following the logic of 5.3.1 in Atangana (2018). Theorem 4. Choose some α R and let n = α . Suppose f : R R is n times differentiable and f n(t) is absolutely continuous throughout the interval [min(x, c), max(x, c)]. Then, limα n CDα c f(x) = sgn(x c)nf n(x), limα n 1 CDα c f(x) = sgn(x c)n 1(f n 1(x) f n 1(c)). Proof. Proof is via integration by parts, see Appendix A.1 for details. One interesting point of this theorem is that for odd n, a extra sgn(x c) term appears in the upper limit in addition to the ordinary nth derivative. This will end up motivating coefficients that are proportional to sgn(x c) to cancel this term out. Next, we present a key theorem relating the first derivative with the fractional derivative. To do so, we modify Proposition 3.1 of Hai & Rosenfeld (2021) with the extended domain of x < c. Theorem 5 (Relation between First Derivative and Fractional Derivative). Suppose f : R R is continuously differentiable. Let α (0, 1]. Define ζx(t) as: ζx(t) = f(t) f(x) f (x)(t x). Then, we have: CDα c f(x) f (x)(x c) Γ(2 α)|x c|α = ζx(c) Γ(1 α)|x c|α α sgn(x c) ζx(t) |x t|α+1 dt. Proof. Proof is via integration by parts following the logic of Proposition 3.1 of Hai & Rosenfeld (2021), see Appendix A.2 for details. Published in Transactions on Machine Learning Research (03/2024) Corollary 6. Suppose f : R R is continuously differentiable. Let α (0, 1]. If f is convex, CDα c f(x) f (x)(x c) Γ(2 α)|x c|α . Proof. Start with Theorem 5 s conclusion. Convexity implies ζx(t) 0. Thus, the first term on the RHS is immediately 0. In the second term on the RHS, sgn(x c) fixes the integral to be in the positive direction. Therefore, the second term is also 0 since the interior of the integral is positive. In the interest of getting the most general results possible, we need to extend the notion of L-smooth and µ-strongly convex as will be defined here following Nesterov (2015). As will be shown in later results, this extended notion could be more natural for fractional gradient descent. Definition 7. f : Rk R is (L, p)-Hölder smooth for p > 0 if: |f(y) f(x) f(x), y x | L 1 + p y x 1+p 1+p. If p = 1, f is L-smooth. Note that when convexity is assumed, the absolute value signs on the LHS do not matter since convexity means the LHS is non-negative. Definition 8. f : Rk R is (µ, p)-uniformly convex for p > 0 if: f(y) f(x) f(x), y x µ 1 + p y x 1+p 1+p. If p = 1, f is µ-strongly convex. Note that here the norm changes with p as well. Although this may be somewhat non-standard, the linearity over dimension is useful in proving later results. In later sections, both p = 1 and p = 1 cases will be studied so for this section we derive bounds in the most general p = 1 case. For p = 1, Zhou (2018) provides many useful properties that will be leveraged in proving convergence rates later on. These smoothness and convexity definitions are now combined with Theorem 5 to get two more useful inequalities. Corollary 9. Suppose f : R R is continuously differentiable. Let α (0, 1]. If f is (L, p)-Hölder smooth, f (x)(x c) Γ(2 α)|x c|α CDα c f(x) L Γ(1 α)(1 + p α)|x c|1+p α. Proof. Proof is a straightforward application of Theorem 5 and the definition of (L, p)-Hölder smooth, see Appendix A.3 for details. Corollary 10. Suppose f : R R is continuously differentiable. Let α (0, 1]. If f is (µ, p)-uniformly convex, f (x)(x c) Γ(2 α)|x c|α CDα c f(x) µ Γ(1 α)(1 + p α)|x c|1+p α. Proof. The proof is identical to that of Corollary 9 simply replacing L with µ and using instead of . We now will make use of these bounds to derive convergence rates for several different settings. Published in Transactions on Machine Learning Research (03/2024) 4 Smooth and Strongly Convex Optimization 4.1 Fractional Gradient Descent Method This section will focus on a modification of the AT-CFGD method from Shin et al. (2021) from the perspective of smooth and strongly convex twice differentiable functions. This study is a natural extension of prior work since they focused primarily on quadratic functions. For this section, we also assume that p = 1 since p = 1 introduces many complications stemming from the non-existence of inner products corresponding to L1+p norms if p = 1. The fractional gradient descent method is defined for f : R R with α (0, 1), β R as: xt+1 = xt ηt Cδα,β ct f(xt) Cδα,β c f(x) = 1 CDαc x(CDα c f(x) + β|x c|CD1+α c f(x)) = CDα c f(x)Γ(2 α) x c |x c|α + β|x c| CD1+α c f(x)Γ(2 α) x c |x c|α. The intuition for this method is given by Theorem 2.3 in Shin et al. (2021). It states that given a Taylor expansion of f around c, Cδα,β c f(x) is the derivative of a smoothed function where the kth term for k 2 is scaled by Ck,α,β = Γ(2 α)Γ(k) Γ(k+1 α) + β Γ(2 α)Γ(k) Γ(k α) . The sign of β therefore determines the asymptotic behavior of these coefficients with respect to k since the first term goes to 0 as k and the second has asymptotic rate β(k α)α due to Wendel s double inequality (Qi & Luo, 2013). For this method to be complete, we need to define how to choose ct. We will see that a convenient choice is xt ct = λt f(xt) for well chosen λt. In later experiments, this method will thus be labelled as Fractional Descent guided by Gradient. This choice of ct differs from Shin et al. (2021) whose AT-CFGD method used ct = xt m for some positive integer m. In practice this fractional gradient descent method can be computed with Gauss-Jacobi quadrature as described in detail in Shin et al. (2021). In order to better understand this fractional gradient descent method, we can take the limit as α 1. Calculating, we arrive at Cδα,β ct f(xt) = f (xt) λtβf (xt)f (xt) = (1 λtβf (xt))f (xt). Intuitively, if λtβ is chosen properly, this approximates f (xt) 1f (xt) up to proportionality which corresponds to a 2nd order gradient method update. We will see in Section 7 that we can actually accomplish similar behavior on quadratic functions even with β = 0 as long as α (0, 1). 4.2 Single Dimensional Results Before we can prove convergence results, we need one more inequality for bounding the 1 + α derivative. Lemma 11. Suppose f : R R is twice differentiable. If f is L-smooth and α (1, 2]. Then, CDα c f(x) L Γ(3 α)|x c|2 α. If f is µ-strongly convex and α (1, 2]. Then, CDα c f(x) µ Γ(3 α)|x c|2 α. Proof. This is a direct result of the fact that L-smooth implies that f (x) L and µ-strongly convex implies that f (x) µ. See Appendix B.1 for details. The next theorems are the primary tool that will be used for convergence results of this method. Published in Transactions on Machine Learning Research (03/2024) Theorem 12. Suppose f : R R is twice differentiable. If f is L-smooth and µ-strongly convex, α (0, 1], β 0. Then, |Cδα,β c f(x) f (x) K1(x c)| K2|x c|. where K1 = ( L+µ 2 )(β γ) and K2 = ( L µ 2 )(β + γ) with γ = 1 α 2 α. Note that the above also holds if f is L-smooth and convex if µ is set to 0. Proof. Holds by applying Corollary 9, Corollary 10, and Lemma 11. See Appendix B.2 for details. Theorem 13. Suppose f : R R is twice differentiable. If f is L-smooth and µ-strongly convex, α (0, 1], β 0. Then, |Cδα,β c f(x) f (x) K1(x c)| K2|x c|. where K1 = ( L+µ 2 )(γα,β) and K2 = ( µ L 2 )(γα,β) with γα,β = β 1 α 2 α. Note that the above also holds if f is L-smooth and convex if µ is set to 0. Proof. Holds by applying Corollary 9, Corollary 10, and Lemma 11. See Appendix B.3 for details. Everything is now ready for beginning discussion of convergence results. We have two cases: β 0 and β 0. We can treat these cases simultaneously by defining K1, K2 as in Theorem 12 for the former and Theorem 13 for the latter. In both these cases, K2 0, however, K1 can be positive or negative. For single dimensional f, we get the following convergence analysis theorem. Theorem 14. Suppose f : R R is twice differentiable, L-smooth, and µ-strongly convex. Set 0 < α < 1 and β R. If β 0, define K1, K2 as in Theorem 12; if β 0, define K1, K2 as in Theorem 13. Let the fractional gradient descent method be defined as follows. xt+1 = xt ηt Cδα,β ct f(xt) ηt = (1 K1λt K2|λt|)ϕ (1 K1λt+K2|λt|)2L for 0 < ϕ < 2 xt ct = λtf (xt) with 1 K1λt K2|λt| > ϵ > 0 Then, this method achieves the following linear convergence rate: 1 K1λt K2|λt| 1 K1λt + K2|λt| In particular, however, this rate is at best the same as gradient descent. Proof. Follows by applying Theorem 12 and Theorem 13 to bound the fractional gradient descent operator with an approximation in terms of the first derivative. Then, the proof of standard gradient descent rate for smooth and strongly convex functions can be followed with additional error terms from the approximation depending on xt ct. This rate is at best same as gradient descent where we would expect to see a coefficient of 1 µ L since K2|λt| 0. See Appendix B.4 for details. As a remark on the condition 1 K1λt K2|λt| > 0, consider the special case λt 0, K1 0 then this condition reduces to λt < 1 K1+K2 . Similarly, for λt 0, K1 0, this condition reduces to λt > 1 K2 K1 . Ultimately, this proof does not give a better rate than gradient descent. In some sense, this is a limitation of the assumptions in that everything is expressed in terms of integer derivatives making it necessary to connect them with fractional derivatives. This in turn makes the bound weaker due to additional error terms scaling on xt ct. However, this result is still useful for providing linear convergence results on a wider class of functions since Shin et al. (2021) only studied this method applied to quadratic functions. Published in Transactions on Machine Learning Research (03/2024) 4.3 Higher Dimensional Results We can also consider f : Rk R by doing all operations coordinate-wise. Following Shin et al. (2021), the natural extension for the fractional gradient descent operator for f is: Cδα,β c f(x) = h Cδα,β c(1)f1,x(x(1)), ..., Cδα,β c(k)fk,x(x(k)) i . Here fj,x(y) = f(x + (y x(j))e(j)) with e(j) the unit vector in the jth coordinate. Generalizations of Theorem 14 can now be proven. We first assume that f has a special form, namely that it is a sum of single dimensional functions. Corollary 15. Suppose f : Rk R is twice differentiable, L-smooth, and µ-strongly convex. Assume f is of form f(x) = Pk i=1 fi(x(i)) where fi : R R. Set 0 < α < 1 and β R. If β 0, define K1, K2 as in Theorem 12; if β 0, define K1, K2 as in Theorem 13. Let the fractional gradient descent method be defined as follows. xt+1 = xt ηt Cδα,β ct f(xt) ηt = (1 K1λt K2|λt|)ϕ (1 K1λt+K2|λt|)2L for 0 < ϕ < 2 xt ct = λt f(xt) with 1 K1λt K2|λt| > ϵ > 0 Then, this method achieves the following linear convergence rate: 1 K1λt K2|λt| 1 K1λt + K2|λt| In particular, however, this rate is at best the same as gradient descent. Proof. If f is L-smooth, each fj,x(y) is L-smooth according to the single dimensional definition and all relevant results hold for it. The same goes for µ-strong convexity. Note that taking the derivative of each fj,x(y) is the same as taking the jth partial derivative of f at x with the jth coordinate replaced by y. Thus, Cδα,β c f(x) = [Cδα,β c(1)f1(x(1)), ..., Cδα,β c(k)fk(x(k))]. Additionally, f(x) = [f 1(x(1)), ..., f k(x(k))]. As such, the optimal point of f in the ith coordinate is the optimal point of fi. Therefore, Theorem 14 holds coordinate wise. This immediately gives the result. In the more general case, there is a single term that does not immediately generalize to higher dimensions proportional to | f(xt)|, |xt x | if the absolute value is taken element wise. For the single dimension case, convexity allowed us to bypass this issue, but for higher dimensions, we have to use Cauchy Schwarz. Theorem 16. Suppose f : Rk R is twice differentiable, L-smooth, and µ-strongly convex. Set 0 < α < 1 and β R. If β 0, define K1, K2 as in Theorem 12; if β 0, define K1, K2 as in Theorem 13. Let the fractional gradient descent method be defined as follows. xt+1 = xt ηt Cδα,β ct f(xt) ϕ L (1 K1λt) 2K2|λt| µ (1 K1λt+K2|λt|)2 for 0 < ϕ < 2 xt ct = λt f(xt) with ϕ(1 K1λt) Then, this method achieves the following linear convergence rate: 1 (2 ϕ)µ(1 K1λt) h ϕ L(1 K1λt) 2K2|λt| (1 K1λt + K2|λt|)2 In particular, however, this rate is at best the same as gradient descent. Published in Transactions on Machine Learning Research (03/2024) Proof. Follows by very similar logic to Theorem 14 except generalized to higher dimensions by taking operations coordinate-wise. The key difference as mentioned above is a single term whose bound requires more care. Similarly, the rate is at best the same as gradient descent where we expect a 1 µ L coefficient since K2|λt| 0 See Appendix B.5 for details. As a remark on the condition on λt, in the special case of λt 0, K1 0, we have: = λt < 1 K1 + 2k K2L For implementing the learning rate for the method in this section, we note that as L , ηt 0. Since L can always be increased while still satisfying smoothness, in principle the learning rate just needs to be chosen small enough same as in standard gradient descent. Similarly, for sufficiently small λt, the condition will always be satisfied. Of course, finding optimal hyper-parameters is important and techniques like line search can be employed to improve convergence rate. 5 Smooth and Convex Optimization This section considers optimizing a L-smooth and convex function, f : R R. For similar reasons as the previous section, the case p = 1 is deferred for future work. The fractional gradient descent method for this section will be identical to the previous section. Similarly to the prior section, we split into two cases: 1) that f is a sum of single dimensional functions and 2) that f is a general higher dimensional function. For the first case, we have the following. Theorem 17. Suppose f : Rk R is twice differentiable, L-smooth, and convex. Assume f is of form f(x) = Pk i=1 fi(x(i)) where fi : R R. Set 0 < α < 1 and β R. If β 0, define K1, K2 as in Theorem 12; if β 0, define K1, K2 as in Theorem 13. Let the fractional gradient descent method be defined as follows. xt+1 = xt ηCδα,β ct f(xt) L h 2(1 λK1 |λ|K2) 1 λK1+|λ|K2)2 1 1 λK1 |λ|K2 xt ct = λ f(xt) with 1 λK1 > Then, this method achieves the following O(1/T) convergence rate with x T as the average of all xt for 1 t T: f( x T ) f(x ) L x0 x 2 2 4 1 λK1 |λ|K2 1 λK1+|λ|K2 In particular, however, this rate is at best the same as gradient descent. Proof. By similar reasoning as Corollary 15, we can reduce to the single dimensional case. This case follows by applying Theorem 12 and Theorem 13 to bound the fractional gradient descent operator with an approximation in terms of the first derivative. Then, the proof of standard gradient descent rate for smooth and convex functions can be followed with additional error terms from the approximation depending on xt ct. This rate is at best the same as gradient descent (which has coefficient L 2T ) since |λ|K2 0. See Appendix C.1 for details. For the general case, just as in the previous section, there is a single term proportional to | f(xt)|, |xt x | if the absolute value is taken element wise that must be dealt with more carefully. This term ends up making the method somewhat more complicated. Published in Transactions on Machine Learning Research (03/2024) Theorem 18. Suppose f : Rk R is twice differentiable, L-smooth, and convex. Set 0 < α < 1 and β R. If β 0, define K1, K2 as in Theorem 12; if β 0, define K1, K2 as in Theorem 13. Let the fractional gradient descent method be defined as follows. xt+1 = xt ηt Cδα,β ct f(xt) L h 2(1 λt K1 |λt|K2) (1 λt K1+|λt|K2)2 1 1 λt K1 i = 1 L(1 λt K1) h s2 t 4st 1 (st+1)2 i xt ct = λt f(xt) with 1 λt K1 = st|λt|K2 s2 t+1 4st+1 1 + 2 st+1 (st+1)2 s2 t 4st 1 with s0 > Then, this method achieves the following O(1/T) convergence rate with x T as the average of all xt for 1 t T: f( x T ) f(x ) L s2 0 4s0 1 + 2 x0 x 2 2 T . In particular, however, this rate is at best the same as gradient descent. Proof. Follows by very similar logic to Theorem 17. The key difference as mentioned above is a single term whose bound requires more care. This term ends up causing difficulties when passing to telescope sum. This motivates step-dependent ηt and λt which are determined by an underlying increasing sequence st. This rate is at best the same as standard gradient descent since h (s0+1)2 s2 0 4s0 1 + 2 i 1 and gradient descent s rate has coefficient L 2T . See Appendix C.2 for details. For implementation, this theorem requires first choosing the s sequence satisfying the algebraic condition. Then, with an approximation of L (which gradient descent also requires), λt and ηt can be computed following the formulas. In general, the condition on λt has two solutions: one positive and one negative. 6 Smooth and Non-Convex Optimization 6.1 Fractional Gradient Descent Method This section will focus on fractional gradient descent in a smooth and non-convex setting. This setting turns out to be the most straightforward to generalize to p = 1 and demonstrates potential for fractional gradient descent to be more natural. Berger et al. (2020) approaches this setting by varying the learning rate in gradient descent. For this section, we will adapt their proof in 3.1 to fractional gradient descent. We use a similar fractional gradient descent method as the previous sections defined as: xt+1 = xt ηt C p δα ctf(xt) where (for f : R R): C p δα c f(x) = CDα c f(x)Γ(2 α) x c |x c|α p+1. The difference from previous sections is that the second term is dropped since Lemma 11 no longer holds and the exponent of |x c| now depends on p. For higher dimensional f, we use the same extension as in the previous sections where each component follows this definition. Namely, for f : Rk R, the definition is C p δα c f(x) = h C p δα c(1)f1,x(x(1)), ..., C p δα c(k)fk,x(x(k)) i . Published in Transactions on Machine Learning Research (03/2024) 6.2 Convergence Results For easing convergence analysis computation, we leverage the following Lemma. Lemma 19. Suppose f : R R is continuously differentiable. If f is (L, p)-Hölder smooth, α (0, 1], then f (x)|x c|1 p C p δα c f(x) K|x c| where K = L(1 α) (1+p α). Proof. Using Corollary 9, rearranging terms gives this bound directly. Now, we present a O(1/T) convergence result to a stationary point as follows. Theorem 20. Suppose f : Rk R is continuously differentiable, (L, p)-Hölder smooth, and x, f(x) f . Set 0 < α < 1. Define K as in Lemma 19. Let the fractional gradient descent method be defined as follows. xt+1 = xt ηC p δα ctf(xt) x(i) t c(i) t = λ p r f x(i) (xt) with 0 < λ < pq (1+p)(λ1 p Kλ) L(λ1 p+Kλ)1+p Then, this method achieves the following convergence rate: min 0 t T f(xt) 1+1/p 1+1/p f(x0) f ψ = η λ1 p Kλ L 1 + pηp(λ1 p + Kλ)1+p . Proof. Follows by applying Corollary 9 to bound the fractional derivative with an approximation in terms of the first derivative. Then, the proof as aforementioned from 3.1 of Berger et al. (2020) can be followed with additional error terms from the approximation. Careful choice of ct is required in order for the degrees of various terms to allow simplification. See Appendix D.1 for details. The key difference in the fractional gradient descent operator from previous sections is the exponent is now α p + 1 instead of α. If we choose α = p (assuming 0 < p < 1), the total order of |x c| terms becomes 0 with a remaining sgn(x c). In this case, the fractional gradient descent operator is (up to proportionality and sign correction) just a fractional derivative. Theorem 20 tells us that convergence can be achieved in this case with constant learning rate with proper choice of ct which means that our fractional gradient descent step at any t is directly proportional to the fractional derivative s value. In a sense, this means that the fractional derivative is natural for optimizing f when setting α = p. 7 Finding the Advantage of Fractional Gradient Descent 7.1 Experiments We can see that there is an obvious gap between the motivation of doing better than standard gradient descent and the theoretical results. While the theoretical results are crucial guarantees on worst case rates, they currently cannot explain how fractional gradient descent can do better. Thus, this section is dedicated to experiments trying to explain this gap. Published in Transactions on Machine Learning Research (03/2024) 0 20 40 60 80 100 Number of iterations Learning Rate AT-CFGD Optimal Fractional Descent guided by Gradient Optimal Standard Gradient Descent Optimal Fractional Descent guided by Gradient Theoretical Figure 2: Learning rates used by different methods in Figure 1 with the theoretical learning rate given by Corollary 15 added. The first thing to note is that in the experiments recorded in Figure 1, the learning rate used is not that of Corollary 15, rather it is the optimal learning rate for quadratic functions from 3.3 in Shin et al. (2021) given by: η t = Axt + b, dt for the function 1 2x T Ax + b T x + c with descent rule xt+1 = xt ηtdt. We plot in Figure 2 exactly what the optimal learning rates used in Figure 1 are and how they can compare to the theoretical learning rate given by Corollary 15. The optimal learning rate for gradient descent and the theoretical learning rate from Corollary 15 tend to be significantly smaller than the optimal learning rate for fractional methods. It should be noted that the actual gradient norms may differ so the fairest comparison is between the optimal and theoretical learning rate of our method (Fractional Descent guided by Gradient). From the equation in the discussion deriving Theorem 14: (xt+1 x )2 [1 (2 ϕ)ηtµ(1 K1λt K2|λt|)](xt x )2, we see that a larger learning rate directly improves the rate of convergence (assuming the larger learning rate is still valid with respect to prior assumptions). Thus, it becomes apparent that in some cases, the theoretical learning rate being much lower than necessary explains why the theoretical convergence rate is no better than that of gradient descent. One question that could then be raised is if the current data of the assumptions is enough to be able to prove a better bound that perhaps involves a speed-up over standard gradient descent. The data with respect to a smooth and strongly convex function is two numbers - L, µ. Other than this, the function is a black box and we would expect any bound based on these assumptions to be equivalent for any functions with the same L, µ assuming same hyper-parameter choices. Looking at Figure 3 and Figure 4, we observe that despite the L, µ data being identical, the fractional gradient descent convergence rates are vastly different and in particular, there is no agreement over beating standard gradient descent. This suggests that to prove whether this fractional gradient descent method is better/worse than gradient descent requires more data than just µ-strong convexity and L-smoothness about the function. Note that in order to construct this example, λt had to be chosen very close to the constraint which meant that the theoretical learning rate and convergence rate are both extremely slow. This means that it may be possible to prove advantage of fractional gradient descent with tighter constraints on λt. Published in Transactions on Machine Learning Research (03/2024) 0 20 40 60 80 100 Number of iterations (t) Fractional Descent guided by Gradient Standard Gradient Descent 0 20 40 60 80 100 Number of iterations Learning Rate Fractional Descent guided by Gradient Optimal Standard Gradient Descent Optimal Fractional Descent guided by Gradient Theoretical Figure 3: Comparison of fractional and standard gradient descent methods for f(x) = x T diag([10, 1, 1, 1, 1])x with x0 = (1, 10, 5, 8, 6). Hyper-parameters as in Corollary 15 are α = 1/2, β = 4/10, λt = 0.0675 0 20 40 60 80 100 Number of iterations (t) Fractional Descent guided by Gradient Standard Gradient Descent 0 20 40 60 80 100 Number of iterations Learning Rate Fractional Descent guided by Gradient Optimal Standard Gradient Descent Optimal Fractional Descent guided by Gradient Theoretical Figure 4: Comparison of fractional and standard gradient descent methods for f(x) = x T diag([10, 1, 7, 9, 4])x with x0 = (1, 10, 5, 8, 6). Hyper-parameters as in Corollary 15 are α = 1/2, β = 4/10, λt = 0.0675 7.2 Quadratic Function Analysis Motivated by these empirical findings, we present one final result applying the fractional gradient descent method from Section 4.1 to quadratic functions. This result is important in that it shows that with more information than just L, µ, we are able to understand when the fractional gradient descent method will outperform gradient descent (at least in this limited case). Theorem 21. Suppose f(x) = 1 2x T Ax + b T x + y0 for some positive definite k k A with elements aij. Let µ be its smallest eigenvalue and L be its largest eigenvalue. Choose β 0, α (0, 1), and R. Choose c from x so that x c = λ f(x) with λ = (β γ)L where γ = 1 α 2 α. Then, we have the following: 1. Let D = I L diag(A), A = DA, and b = Db. Then, Cδα,β c f(x) = A x b . In particular, if A is non-symmetric, this fractional gradient operator is not given by the gradient of any function. 2. Suppose that y Rk, we have that µ y 2 2 y T A y L y 2 2 and suppose that is chosen so that all elements of D are positive. Then, we have the following linear convergence rate for fractional gradient descent: xt x 2 2 1 µ t x0 x 2 2. Published in Transactions on Machine Learning Research (03/2024) In particular, if the condition number of A , κ(A ) = L µ , is smaller than κ(A) = L µ , then fractional gradient descent achieves a faster convergence rate (in number of iterations). Proof. The first point follows from a straightforward calculation of the fractional gradient descent operator on quadratic functions. The second point follows from a proof almost identical to that of gradient descent on quadratic functions. The linear rate of convergence for gradient descent is 1 µ L so if κ(A ) < κ(A), the convergence will be faster. See Appendix E.1 for more details. For one simple application of this theorem, consider the case where A is diagonal. Then A is also diagonal with elements of the form di = aii(1 aii L ) on the diagonal. This case corresponds to all of the above experiments. We note that µ and L in this case are simply the smallest and largest eigenvalues of A since A is positive definite. Some µ = aii will be the smallest eigenvalue of A and some L = ajj will be the largest eigenvalue. In A , these elements become µ(1 µ L ), L(1 ) with 0 < < 1. We then have the following: L(1 ) µ(1 µ µ L L L µ L This appears to contradict Figure 4 since this seems to imply that fractional gradient descent should always outperform gradient descent on these simple quadratic functions. However, the key is that the LHS is not necessarily equal to κ(A ). In particular, di can also amplify µ < aii < L such that these components become the maximum eigenvalues of A and can shrink components to become minimum eigenvalues. To avoid this, observe that if 1 2, the maximum of the di will always be where aii = L and the minimum will always be where aii = µ. Thus, if 1 2, for these simple functions, fractional gradient descent always does at least as good as gradient descent. Figure 4 corresponds to the case of = 0.99 which does not satisfy this condition. One more point to note is that in the limit as α 1, the theorem still holds as long as β = 0. In particular, as long as β γ = 0 the theorem s convergence rate has no dependence on β and α since λ inversely scaling on β γ is effectively cancelling it out. This means that in some sense if we set β = 0, we are getting the benefits of a second order method where α = 1, β = 0 (see Section 4.1 for details on this limit) without actually using more than the first derivative. For more complicated A, applying this theorem requires comparing the condition number of A against that of A with no easy relationship like in the previous simple case. While this theorem is limited to quadratic functions, it is promising in providing a theoretical explanation of how fractional gradient descent can outperform gradient descent - a question that prior work left open. 8 Future Directions Going off of the preceding discussion, one important future direction is to search for additional assumptions to classify when this fractional gradient descent method will outperform gradient descent on general classes of functions. It is also important to develop lower bounds to prove when general fractional gradient descent methods are capable of achieving superior performance to gradient descent and to verify whether the convergence rates developed in the paper are tight. With respect to the tightness of the bounds, one potential area for further improvement is to better handle the second term of Cδα,β c f(x) involving a fractional derivative of order between 1 and 2 with coefficient scaling with β. Currently Lemma 11 handles bounding this term, however, it is rather weak in that it loses any information that would make this term useful for convergence. This is reflected in how sending β 0 tends to improve the theoretical rate. Even in Theorem 21, β could be set to 0 without any loss in the results. If we consider Theorem 4, we have for 1 < α < 2 that limα 1 CDα c f(x) = sgn(x c)(f (x) f (c)). If c is chosen properly, this could potentially work similarly to acceleration algorithms in gradient descent. More work is needed to understand this term outside the limit case since attempting to use integration by parts to express it in terms of the gradient fails. It is important in future work to understand if this term can be useful for proving better rates. Published in Transactions on Machine Learning Research (03/2024) Another interesting direction would be to investigate the effects of changing ct. Both Shin et al. (2021) and this paper use different methods of choosing ct and it is not clear which is better since it is difficult to directly compare without more theoretical results. In addition, future work could look at applying similar strategies of relating fractional and integer derivatives to different underlying fractional derivatives such as the Reimann-Liouville derivative. One important future direction is to show convergence results for p = 1 for more settings. In this paper, we only discuss p = 1 for smooth and non-convex functions while leaving the other two settings for future work. As aforementioned the difficulties of this setting stem from the norm not being induced by an inner product on L1+p spaces making it difficult to expand terms of the form x y 1+p 1+p or even just |x y|1+p for some arbitrary x, y. Another line of thought is to bypass the need for inequalities relating fractional and integer derivatives by using convexity and smoothness definitions that only involve fractional derivatives. This would be ideal since the current proof methodology loses a lot of information about the fractional derivative in connecting it to the gradient causing the convergence rates to be bounded by those of gradient descent. One direction that may be promising is using fractional Taylor series like in Usero (2008) to construct these definitions. However, these series for Caputo Derivatives are somewhat limited in how they need to be centered at the terminal point of the derivative. If we ignore the terminal point and try to use simple assumptions like Lipschitz continuity of the fractional derivative, bad behavior around the terminal point (where the fractional derivative rapidly tends to 0) prevents it from being applicable even to simple test functions like |x|1+α. One potential solution would be to set the terminal point depending on x only and not the iteration number and consider Lipschitz continuity properties after integrating this dependency. Following this line of thought, we could take the conditions from point 2 of Theorem 21 and replace A with the total derivative of the fractional gradient descent operator. This could serve as a natural generalization of smoothness and strong convexity. In conclusion, this paper proves convergence results for fractional gradient descent in smooth and strongly convex, smooth and convex, smooth and non-convex, and quadratic settings. Future work is needed in extending these results to other classes of functions and other methods to show a general guaranteed benefit over gradient descent. Acknowledgments I thank Prof. Quanquan Gu for his constructive comments on this manuscript. Abdon Atangana. Chapter 5 - fractional operators and their applications. In Abdon Atangana (ed.), Fractional Operators with Constant and Variable Order with Application to Geo-Hydrology, pp. 79 112. Academic Press, 2018. ISBN 978-0-12-809670-3. doi: https://doi.org/10.1016/B978-0-12-809670-3.00005-9. URL https://www.sciencedirect.com/science/article/pii/B9780128096703000059. Guillaume O. Berger, P.-A. Absil, Raphaël M. Jungers, and Yurii Nesterov. On the quality of firstorder approximation of functions with hölder continuous gradient. Journal of Optimization Theory and Applications, 185(1):17 33, Apr 2020. ISSN 1573-2878. doi: 10.1007/s10957-020-01632-x. URL https://doi.org/10.1007/s10957-020-01632-x. S. David, Juan López Linares, and Eliria Pallone. Fractional order calculus: Historical apologia, basic concepts and some applications. Revista Brasileira de Ensino de Física, 33:4302 4302, 12 2011. doi: 10.1590/S1806-11172011000400002. Pham Viet Hai and Joel A. Rosenfeld. The gradient descent method from the perspective of fractional calculus. Mathematical Methods in the Applied Sciences, 44(7):5520 5547, 2021. doi: https://doi.org/10. 1002/mma.7127. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/mma.7127. Xiaohui Han and Jianping Dong. Applications of fractional gradient descent method with adaptive momentum in bp neural networks. Applied Mathematics and Computation, 448:127944, 2023. ISSN 0096-3003. doi: Published in Transactions on Machine Learning Research (03/2024) https://doi.org/10.1016/j.amc.2023.127944. URL https://www.sciencedirect.com/science/article/ pii/S0096300323001133. Shujaat Khan, Imran Naseem, Muhammad Ammar Malik, Roberto Togneri, and Mohammed Bennamoun. A fractional gradient descent-based rbf neural network. Circuits, Systems, and Signal Processing, 37(12): 5311 5332, Dec 2018. ISSN 1531-5878. doi: 10.1007/s00034-018-0835-3. URL https://doi.org/10.1007/ s00034-018-0835-3. Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017. Yuri Luchko. General fractional integrals and derivatives and their applications. Physica D: Nonlinear Phenomena, 455:133906, 2023. ISSN 0167-2789. doi: https://doi.org/10.1016/j.physd.2023.133906. URL https://www.sciencedirect.com/science/article/pii/S0167278923002609. Sriram Nagaraj. Optimization and learning with nonlocal calculus, 2021. Yu Nesterov. Universal gradient methods for convex optimization problems. Mathematical Programming, 152(1):381 404, Aug 2015. ISSN 1436-4646. doi: 10.1007/s10107-014-0790-0. URL https://doi.org/10. 1007/s10107-014-0790-0. Keith Oldham and Jerome Spanier. The fractional calculus theory and applications of differentiation and integration to arbitrary order. Elsevier, 1974. Feng Qi and Qiu-Ming Luo. Bounds for the ratio of two gamma functions: from wendel s asymptotic relation to elezović-giordano-pečarić s theorem. Journal of Inequalities and Applications, 2013(1), November 2013. ISSN 1029-242X. doi: 10.1186/1029-242x-2013-542. URL http://dx.doi.org/10.1186/1029-242X-2013-542. Sebastian Ruder. An overview of gradient descent optimization algorithms. Co RR, abs/1609.04747, 2016. URL http://arxiv.org/abs/1609.04747. Dian Sheng, Yiheng Wei, Yuquan Chen, and Yong Wang. Convolutional neural networks with fractional order gradient method. Neurocomputing, 408:42 50, 2020. ISSN 0925-2312. doi: https://doi.org/10.1016/j.neucom. 2019.10.017. URL https://www.sciencedirect.com/science/article/pii/S0925231219313918. Yeonjong Shin, Jérôme Darbon, and George Em Karniadakis. A caputo fractional derivative-based algorithm for optimization, 2021. Yeonjong Shin, Jerome Darbon, and George Karniadakis. Accelerating gradient descent and adam via fractional gradients. Neural Networks, 161, 04 2023. doi: 10.1016/j.neunet.2023.01.002. Jia Tang. Fractional gradient descent-based auxiliary model algorithm for fir models with missing data. Complexity, 2023:7527478, Feb 2023. ISSN 1076-2787. doi: 10.1155/2023/7527478. URL https://doi. org/10.1155/2023/7527478. D. Domínguez Usero. Fractional taylor series for caputo fractional derivatives. construction of numerical schemes. 2008. URL https://api.semanticscholar.org/Corpus ID:54594739. Jian Wang, Yanqing Wen, Yida Gou, Zhenyun Ye, and Hua Chen. Fractional-order gradient descent learning of bp neural networks with caputo derivative. Neural Netw., 89(C):19 30, may 2017. ISSN 0893-6080. doi: 10.1016/j.neunet.2017.02.007. URL https://doi.org/10.1016/j.neunet.2017.02.007. Yong Wang, Yuli He, and Zhiguang Zhu. Study on fast speed fractional order gradient descent method and its application in neural networks. Neurocomputing, 489:366 376, 2022. ISSN 0925-2312. doi: https:// doi.org/10.1016/j.neucom.2022.02.034. URL https://www.sciencedirect.com/science/article/pii/ S0925231222001904. Yiheng Wei, Yuquan Chen, Songsong Cheng, and Yong Wang. A note on short memory principle of fractional calculus. Fractional Calculus and Applied Analysis, 20, 12 2017. doi: 10.1515/fca-2017-0073. Published in Transactions on Machine Learning Research (03/2024) Yiheng Wei, Yu Kang, Weidi Yin, and Yong Wang. Generalization of the gradient method with fractional order gradient direction. Journal of the Franklin Institute, 357(4):2514 2532, 2020. ISSN 0016-0032. doi: https://doi.org/10.1016/j.jfranklin.2020.01.008. URL https://www.sciencedirect.com/science/ article/pii/S0016003220300235. Xingyu Zhou. On the fenchel duality between strong convexity and lipschitz continuous gradient, 2018. A Missing Proofs in Section 3 Relating Fractional Derivative and Integer Derivative A.1 Proof of Theorem 4 CDα c f(x) = sgn(x c)n 1 f n(t) |x t|α n+1 dt = sgn(x c)n 1 n α|x t|n α sgn(x c) f n+1(t)|x t|n α n α sgn(x c)dt = sgn(x c)n f n(c)|x c|n α + Z x c f n+1(t)|x t|n αdt . As α n, this simplifies to: CDα c f(x) = sgn(x c)n(f n(c) + f n(x) f n(c) = sgn(x c)nf n(x). As α n 1, directly from the definition, CDα c f(x) = sgn(x c)n 1 Z x c f n(t)dt = sgn(x c)n 1(f n 1(x) f n 1(c)). A.2 Proof of Theorem 5 Proof. First, note that for α (0, 1], CDα c x = x c Γ(2 α)|x c|α . One interesting thing here is that this fractional derivative can be both positive and negative unlike the first derivative of a line. Also note that dζx(t) = (f (t) f (x))dt. Therefore, we begin with the following expression: CDα c f(x) f (x)CDα c (x) = 1 Γ(1 α) c |x t| α(f (t) f (x))dt c |x t| αdζx(t) = |x t| αζx(t) t=c α Γ(1 α) c |x t| α 1 sgn(x t)ζx(t)dt = ζx(t) Γ(1 α)|x t|α t=c α sgn(x c) ζx(t) |x t|α+1 dt. It remains to show that in the first term vanishes as t x which is done using L Hopital s Rule: lim t x ζx(t) Γ(1 α)|x t|α = lim t x f (t) f (x) αΓ(1 α)|x t|α 1 sgn(x t) = 0. The last equality is since α (0, 1] so α 1 0. Published in Transactions on Machine Learning Research (03/2024) A.3 Proof of Corollary 9 Proof. Note that (L, p)-Hölder smooth implies that ζx(t) L 1+p|x t|1+p. Also 1 + p α > 0 since p > 0, α (0, 1]. Thus, f (x)(x c) Γ(2 α)|x c|α CDα c f(x) = ζx(c) Γ(1 α)|x c|α + α sgn(x c) ζx(t) |x t|α+1 dt L|x c|1+p α (1 + p)Γ(1 α) + αL sgn(x c) (1 + p)Γ(1 α) c |x t|p αdt L|x c|1+p α (1 + p)Γ(1 α) + αL sgn(x c) (1 + p)Γ(1 α) sgn(x c)|x c|1+p α = L (1 + p)Γ(1 α)|x c|1+p α(1 + α 1 + p α) = L Γ(1 α)(1 + p α)|x c|1+p α. The other direction of the inequality follows by the same logic using instead ζx(t) L 1+p|x t|1+p and using instead of . B Missing Proofs in Section 4 Smooth and Strongly Convex Optimization B.1 Proof of Lemma 11 Proof. L-smooth implies that f (x) L. Since α (1, 2], CDα c f(x) = sgn(x c) f (t) |x t|α 1 dt L |x t|α 1 dt Γ(2 α) |x c|2 α 2 α sgn(x c)L = L Γ(3 α)|x c|2 α. The bound holds since the integral is in the positive direction due to sgn(x c). The proof for µ-strongly convex is identical except using f (x) µ. B.2 Proof of Theorem 12 Proof. We begin by upper bounding Cδα,β c f(x). Note that since both terms in it have an (x c) in the denominator, sgn(x c) determines which inequality must be used. Let R denote Re LU. Then, Cδα,β c f(x) f (x) µ1 α 2 αR(x c) + L1 α 2 αR(c x) + LβR(x c) µβR(c x) = f (x) µγR(x c) + LγR(c x) + LβR(x c) µβR(c x) = f (x) + (Lβ µγ)R(x c) + (Lγ µβ)R(c x). The first 3 terms on the RHS come from bounding the first term of Cδα,β c f(x) and the latter two terms come from bounding the secon term of Cδα,β c f(x). Similarly, we find that Cδα,β c f(x) f (x) + (µβ Lγ)R(x c) + (µγ Lβ)R(c x). Published in Transactions on Machine Learning Research (03/2024) Observe that (Lβ µγ) (µβ Lγ) 2 = (L µ)β + (L µ)γ (Lγ µβ) (µγ Lβ) (Lβ µγ) + (µβ Lγ) 2 = (L + µ)β (L + µ)γ 2 = (L + µ) (Lγ µβ) + (µγ Lβ) 2 = (L + µ) Using these equations gives Cδα,β c f(x) f (x) + (L µ) 2 (β + γ)R(x c) + (L + µ) 2 (β γ)R(x c) 2 (β + γ)R(c x) (L + µ) 2 (β γ)R(c x) = f (x) + (L + µ) 2 (β γ)(x c) + (L µ) 2 (β + γ)|x c| = f (x) + K1(x c) + K2|x c|. Similarly, the lower bound is Cδα,β c f(x) f (x) + K1(x c) K2|x c|. Putting both of these bounds together gives the desired result: K2|x c| Cδα,β c f(x) f (x) K1(x c) K2|x c|. B.3 Proof of Theorem 13 Proof. The proof begins similarly as in Theorem 12, except β determines the sign as well. Cδα,β c f(x) f (x) µ1 α 2 αR(x c) + L1 α 2 αR(c x) + LR(β(x c)) µR(β(c x)) = f (x) µ1 α 2 αR(x c) + L1 α 2 αR(c x) LβR(c x) + µβR(x c) = f (x) + (µγα,β)R(x c) (Lγα,β)R(c x) = f (x) + K1(x c) + K2|x c|. Similarly, we find that Cδα,β c f(x) f (x) + (Lγα,β)R(x c) (µγα,β)R(c x) = f (x) + K1(x c) K2|x c|. B.4 Proof of Theorem 14 Proof. We start with 3 point identity. Note this is where the case p = 1 breaks down since the L1+p norm is not induced by an inner product if p = 1. (xt+1 x )2 = (xt+1 xt)2 + 2(xt+1 xt)(xt x ) + (xt x )2 = (ηt Cδα,β ct f(xt))2 2ηt Cδα,β ct f(xt)(xt x ) + (xt x )2. Published in Transactions on Machine Learning Research (03/2024) We begin by bounding the first term: (Cδα,β ct f(xt))2 = ((Cδα,β ct f(xt) f (xt) K1(xt ct)) + (f (xt) + K1(xt ct)))2 K2 2(xt ct)2 + 2K2|xt ct||f (xt) + K1(xt ct)| + (f (xt) + K1(xt ct))2. One observation here is that we would like everything to be in terms of f (xt)2 to make canceling more convenient later. For this purpose, choose xt ct = λtf (xt). Thus, we get (Cδα,β ct f(xt))2 K2 2(λt)2(f (xt))2 + 2K2|λt||1 K1λt|(f (xt))2 + (1 K1λt)2(f (xt))2 = (K2|λt| + |1 λt K1|)2(f (xt))2. Now, choose some ϕ (0, 2). We now bound the second term as follows (note we assume here ηt 0 since unlike the past section this makes sense): 2ηt Cδα,β ct f(xt)(xt x ) 2ηt K2|xt ct||xt x | 2ηtf (xt)(xt x ) 2ηt K1(xt ct)(xt x ) = 2ηt K2|λt||f (xt)||xt x | 2ηtf (xt)(1 λt K1)(xt x ) = 2ηt K2|λt|(f (xt))(xt x ) 2ηtf (xt)(1 λt K1)(xt x ) = 2ηtf (xt)(xt x )(1 K1λt K2|λt|) L (1 K1λt K2|λt|)f (xt)2 (2 ϕ)ηtµ(1 K1λt K2|λt|)(xt x )2. We can drop the absolute value signs due to the convexity assumption (note this works specifically for single dimension f). We need both terms of the RHS to be negative for proving convergence which puts a condition on λt: 1 K1λt K2|λt| > 0. This condition gives that 1 K1λt > 0. Putting everything together gives: (xt+1 x )2 η2 t ((K2|λt| + 1 λt K1)2(f (xt))2 ϕηt L (1 K1λt K2|λt|)f (xt)2 (2 ϕ)ηtµ(1 K1λt K2|λt|)(xt x )2 + (xt x )2. Now, we can figure out the learning rate since we want the first term on the RHS to be dominated by the second term. η2 t (K2|λt| + 1 λt K1)2 ϕηt L (1 K1λt K2|λt|) = ηt = (1 K1λt K2|λt|)ϕ (1 K1λt + K2|λt|)2L. Finally, this leads to a convergence rate as follows: (xt+1 x )2 [1 (2 ϕ)ηtµ(1 K1λt K2|λt|)] (x0 x )2 1 K1λt K2|λt| 1 K1λt + K2|λt| This is a linear rate of convergence since ϵ guarantees that this equation is a contraction as t . The rate is at best as good as gradient descent since K2|λt| 0. Published in Transactions on Machine Learning Research (03/2024) B.5 Proof of Theorem 16 Proof. We follow the discussion deriving Theorem 14. We start with 3 point identity: xt+1 x 2 2 = xt+1 xt 2 2 + 2 xt+1 xt, xt x + xt x 2 2 = η2 t Cδα,β ct f(xt) 2 2 2ηt Cδα,β ct f(xt), xt x + xt x 2 2. For bounding the first term, this can be done coordinate wise. Cδα,β ct f(xt) 2 2 = i=1 (Cδα,β c(i) t fi,xt(x(i) t ))2 (K2|λt| + |1 λt K1|)2 f(xt) 2 2. For bounding the second term (note | | is taken element-wise, ηt 0 is assumed): 2ηt Cδα,β ct f(xt), (xt x ) 2ηt K2|λt| | f(xt)|, |xt x | 2ηt(1 λt K1) f(xt), xt x 2ηt K2|λt|[ f(xt) 2 xt x 2] 2ηt(1 λt K1) f(xt), xt x µ |λt| f(xt) 2 2 2ηt(1 λt K1) f(xt), xt x . We see that 1 λt K1 > 0 is necessary for proving convergence since we need the latter term to be negative to prove convergence. We bound the second term like in the single dimensional case as: 2ηt(1 λt K1) f(xt), xt x ϕηt L (1 λt K1) f(xt) 2 2 (2 ϕ)ηtµ(1 λt K1) xt x 2 2. Gathering all terms yields: xt+1 x 2 2 η2 t (K2|λt| + 1 λt K1)2 f(xt) 2 2 + 2ηt K2 µ |λt| f(xt) 2 2 L (1 λt K1) f(xt) 2 2 (2 ϕ)ηtµ(1 λt K1) xt x 2 2 + xt x 2 2. For the 3rd term on the RHS to dominate the first two terms, the learning rate is: η2 t (K2|λt| + 1 λt K1)2 + 2ηt K2 L (1 λt K1) = ηt(K2|λt| + 1 λt K1)2 ϕ L(1 λt K1) 2K2|λt| ϕ L(1 K1λt) 2K2|λt| µ (1 K1λt + K2|λt|)2 . This in turn gives a condition on λt since the numerator needs to be strictly greater than 0 for this to make sense: ϕ(1 K1λt) We see that this new condition is consistent with the past condition that (1 λt K1) > 0. Finally, we can write down the rate of linear convergence: xt+1 x 2 2 (1 (2 ϕ)µ(1 K1λt)ηt) xt x 2 2 1 (2 ϕ)µ(1 K1λt) h ϕ L(1 K1λt) 2K2|λt| (1 K1λt + K2|λt|)2 Published in Transactions on Machine Learning Research (03/2024) Similar to the proof of Theorem 14 in Appendix B.4, this is a linear rate of convergence due to ϵ and this rate is at best as good as gradient descent since K2|λt| 0. C Missing Proofs in Section 5 Smooth and Convex Optimization C.1 Proof of Theorem 17 Proof. By similar reasoning as Corollary 15, we can reduce to the single dimensional case. We start by applying L-smoothness. f(xt+1) f(xt) f (xt)(xt+1 xt) + L 2 (xt+1 xt)2 = ηf (xt)Cδα,β ct f(xt) + Lη2 2 (Cδα,β ct f(xt))2. We bound the first term as: ηf (xt)Cδα,β ct f(xt) ηf (xt)2 + ηλK1f (xt)2 + η|λ|K2f (xt)2 = η(1 λK1 |λ|K2)f (xt)2. We bound the second term as: (Cδα,β ct f(xt))2 (|1 λK1| + |λ|K2)2f (xt)2. Putting everything together yields: f(xt+1) f(xt) ( η(1 λK1 |λ|K2) + Lη2 2 (|1 λK1| + |λ|K2)2)f (xt)2. For this to converge, we need the RHS to be negative which means that 1 λK1 |λ|K2 > 0. Therefore, 1 λK1 > |λ|K2 > 0. Now, we use 3 point identity to proceed. Note this is where the case p = 1 breaks down since the L1+p norm is not induced by an inner product if p = 1. (xt+1 x )2 = (xt+1 xt)2 + 2(xt+1 xt)(xt x ) + (xt x )2 = η2(Cδα,β ct f(xt))2 2ηCδα,β ct f(xt)(xt x ) + (xt x )2. We now bound the middle term on the RHS. Note that the following only works due to f being convex and single dimensional. 2ηCδα,β ct f(xt)(xt x ) 2η|λ|K2|f (xt)||xt x | 2ηf (xt)(1 λK1)(xt x ) = 2η|λ|K2(f (xt))(xt x ) 2ηf (xt)(1 λK1)(xt x ) = 2η(1 λK1 |λ|K2)f (xt)(xt x ) 2η(1 λK1 |λ|K2)(f(xt) f(x )). Putting everything together gives a bound on f(xt) f(x ). f(xt) f(x ) 1 2η(1 λK1 |λ|K2)((xt x )2 (xt+1 x )2) + η(1 λK1 + |λ|K2)2 2(1 λK1 |λ|K2) (f (xt))2. Combining this with the bound on f(xt+1) f(xt) yields: f(xt+1) f(x ) 1 2η(1 λK1 |λ|K2)((xt x )2 (xt+1 x )2) + η (1 λK1 + |λ|K2)2 2(1 λK1 |λ|K2) (1 λK1 |λ|K2) + Lη 2 (1 λK1 + |λ|K2)2 (f (xt))2. Published in Transactions on Machine Learning Research (03/2024) Now, we derive the learning rate η as follows so that the f (xt)2 terms vanish. 2 (1 λK1 + |λ|K2)2 = (1 λK1 |λ|K2) (1 λK1 + |λ|K2)2 2(1 λK1 |λ|K2) 2(1 λK1 |λ|K2) (1 λK1 + |λ|K2)2 1 1 λK1 |λ|K2 We require this learning rate to be positive (since this was assumed throughout the proof) so we get a condition on λ. 2(1 λK1 |λ|K2) 1 λK1 + |λ|K2)2 1 1 λK1 |λ|K2 = 1 λK1 |λ|K2 1 λK1 + |λ|K2 = (1 λK1)(1 1 2) |λ|K2(1 + 1 = 1 λK1 > |λ|K2 We can now write the earlier bound as: f(xt+1) f(x ) L 4 1 λK1 |λ|K2 1 λK1+|λ|K2 2 2 ((xt x )2 (xt+1 x )2). Applying telescope sum and bounding the LHS using convexity gives the desired result: f( x T ) f(x ) L(x0 x )2 4 1 λK1 |λ|K2 1 λK1+|λ|K2 This rate is at best the same as standard gradient descent since |λ|K2 0. C.2 Proof of Theorem 18 Proof. We begin with L-smoothness. f(xt+1) f(xt) f(xt), xt+1 xt + L 2 xt+1 xt 2 2 = ηt f(xt), Cδα,β ct f(xt) + Lη2 t 2 Cδα,β ct f(xt) 2 2. We bound the first term as: ηt f(xt), Cδα,β ct f(xt) ηt(1 λt K1 |λt|K2) f(xt) 2 2. We bound the second term as: Cδα,β ct f(xt) 2 2 (|1 λt K1| + |λt|K2)2 f(xt) 2 2. Putting everything together yields: f(xt+1) f(xt) ( ηt(1 λt K1 |λt|K2) + Lη2 t 2 (|1 λt K1| + |λt|K2)2) f(xt) 2 2. For this to converge, we need the RHS to be negative which means that 1 λt K1 > |λt|K2 > 0. Now, we use 3 point identity to proceed. xt+1 x 2 2 = xt+1 xt 2 2 + 2 xt+1 xt, xt x + xt x 2 2 = η2 t Cδα,β ct f(xt) 2 2 2ηt Cδα,β ct f(xt), xt x + xt x 2 2. Published in Transactions on Machine Learning Research (03/2024) For bounding the second term (note | | is taken element-wise, ηt 0 is assumed): 2ηt Cδα,β ct f(xt), (xt x ) 2ηt K2|λt| | f(xt)|, |xt x | 2ηt(1 λt K1) f(xt), xt x 2ηt K2|λt| f(xt) 2 xt x 2 2ηt(1 λt K1) f(xt), xt x 2ηt K2|λt|L xt x 2 2 2ηt(1 λt K1)(f(xt) f(x )). Putting everything together gives a bound on f(xt) f(x ). f(xt) f(x ) 1 2ηt(1 λt K1)( xt x 2 2 xt+1 x 2 2) + |λt|K2L 1 λt K1 xt x 2 2 + ηt(1 λt K1 + |λt|K2)2 2(1 λt K1) f(xt) 2 2. Combining this with the bound on f(xt+1) f(xt) yields: f(xt+1) f(x ) 1 2ηt(1 λt K1)( xt x 2 2 xt+1 x 2 2) + |λt|K2L 1 λt K1 xt x 2 2 " (1 λt K1 + |λt|K2)2 2(1 λt K1) (1 λt K1 |λt|K2) 2 (1 λt K1 + |λt|K2)2 # Solving for ηt such that the f(xt) 2 2 terms vanish yields: 2(1 λt K1 |λt|K2) (1 λt K1 + |λt|K2)2 1 1 λt K1 The proof thus far assumes that ηt > 0 which creates a constraint on λt. 2(1 λt K1 |λt|K2) (1 λt K1 + |λt|K2)2 1 1 λt K1 = (1 λt K1 |λt|K2)(1 λt K1) (1 λt K1 + |λt|K2)2 > 1 = (1 λt K1)2 4|λt|K2(1 λt K1) |λt|2K2 2 > 0 = (1 λt K1 2|λt|K2)2 > 5|λt|2K2 2 = 1 λt K1 > ( 5 + 2)|λt|K2. Now, suppose 1 λt K1 = st|λt|K2 (st > 5 + 2). Then the following useful equation holds: 2(st 1) (st + 1)2|λt|K2 1 st|λt|K2 = ηt = 1 Lst|λt|K2 = stηt|λt|K2 = 1 Returning to the bound on f(xt+1) f(x ), with the chosen ηt, we have: f(xt+1) f(x ) 1 2ηt(1 λt K1) + |λt|K2L xt x 2 2 1 2ηt(1 λt K1) xt+1 x 2 2. Published in Transactions on Machine Learning Research (03/2024) For this sum to telescope, we need the following condition to hold: 1 2ηt+1(1 λt+1K1) + |λt+1|K2L 1 λt+1K1 1 2ηt(1 λt K1) = 1 2ηt+1st+1|λt+1|K2 + |λt+1|K2L st+1|λt+1|K2 1 2ηt(st|λt|K2) = (st+1 + 1)2 s2 t+1 4st+1 1 + 2 st+1 (st + 1)2 s2 t 4st 1. 5 + 2, both the LHS and RHS are decreasing functions that going from 1 as st . Thus if st+1 is chosen large enough, it will satisfy the telescoping condition. To solve for an exact cutoff would require solving a cubic equation so for practical use it is better to use a numerical solver. Now, assuming this condition holds, applying telescope sum to the f(xt+1) f(x ) inequality yields the desired result: f( x T ) f(x ) 1 2η0(1 λ0K1) + |λ0|K2L s2 0 4s0 1 + 2 x0 x 2 2 T . D Missing Proofs in Section 6 Smooth and Non-Convex Optimization D.1 Proof of Theorem 20 Proof. Throughout the proof we will use the notation [ai] to denote a vector of k elements with ith element ai. We note that all the results for single variable f hold since f satisfies the single variable (L, p)-Hölder smooth definition in each component. We start with the (L, p)-Hölder smooth property: f(xt+1) f(xt) f(xt), xt+1 xt + L 1 + p xt+1 xt p+1 p+1 = η f(xt), C p δα ctf(xt) + Lηp+1 1 + p C p δα ctf(xt) p+1 p+1 x(i) (xt) , f x(i) (xt)|x(i) t c(i) t |1 p + η f x(i) (xt) , K|x(i) t c(i) t | + Lηp+1 1 + p C p δα ctf(xt) p+1 p+1 = η(λ1 p Kλ) f x(i) (xt) 1+1/p + Lηp+1 1 + p C p δα ctf(xt) p+1 p+1 = η(λ1 p Kλ) f(xt) 1+1/p 1+1/p + Lηp+1 1 + p C p δα ctf(xt) p+1 p+1 η(λ1 p Kλ) f(xt) 1+1/p 1+1/p f x(i) (xt)|x(i) t c(i) t |1 p + K|x(i) t c(i) t | η(λ1 p Kλ) f(xt) 1+1/p 1+1/p 1 + p (λ1 p + Kλ)p+1 " f x(i) (xt) = η(λ1 p Kλ) f(xt) 1+1/p 1+1/p + Lηp+1 1 + p (λ1 p + Kλ)p+1 f(xt) 1+1/p 1+1/p Published in Transactions on Machine Learning Research (03/2024) = η(λ1 p Kλ) f(xt) 1+1/p 1+1/p + Lηp+1 1 + p (λ1 p + Kλ)p+1 f(xt) 1+1/p 1+1/p = ψ f(xt) 1+1/p 1+1/p. For this to converge, we need ψ > 0 which gives a condition on η as follows: λ1 p Kλ L 1 + pηp(λ1 p + Kλ)1+p > 0 = ηp < (1 + p)(λ1 p Kλ) L(λ1 p + Kλ)1+p (1 + p)(λ1 p Kλ) L(λ1 p + Kλ)1+p . This in turn gives a condition on λ in order to make the interior of the root positive: (λ1 p Kλ) > 0 We derive the convergence bound as follows: min 0 t T f(xt) 1+1/p 1+1/p 1 T + 1 t=0 f(xt) 1+1/p 1+1/p f(xt) f(xt+1) = f(x0) f(x T +1) E Missing Proofs in Section 7 Finding the Advantage of Fractional Gradient Descent E.1 Proof of Theorem 21 Proof. We first note some useful equations: 1. f(x) = Ax + b 2. Cδα,β c x = 1 (for 1 dimensional x) 3. Cδα,β c b T x = b 4. Cδα,β c a 2x2 = a(β γ)(x c) + ax (for 1 dimensional x) 5. Cδα,β c 1 = 0 Equation 1 is simple differentiation. Equation 2 follows since the 2nd derivative of x is 0. Equation 3 follows by linearity of the fractional gradient descent operator and equation 2. Equation 4 follows directly from Lemma A.1 of Shin et al. (2021). Equation 5 follows since derivatives of a constant function are 0. Published in Transactions on Machine Learning Research (03/2024) We now leverage these equations to apply the fractional gradient descent operator to f. These equations tell us that the only terms where the fractional gradient descent operator does not act like a normal gradient are the squares corresponding to diagonal elements of A, aii. These elements look precisely like equation 4 and can be handled as follows: Cδα,β c(i) aii x(i) 2 = aii(β γ)(x(i) c(i)) + aiix(i) = aii(β γ) (β γ)L( f(x))(i) + aiix(i) L (Ax + b)(i) + d dx(i) Putting everything together with linearity we calculate the full fractional gradient descent operator as follows: Cδα,β c f(x) = diag([ a11 L , ..., akk L ])(Ax + b) + f(x) = diag([ a11 L , ..., akk L ])(Ax + b) + (Ax + b) = (DA)x + Db = A x + b . Thus we have the desired result for point 1 of the theorem. For point 2, we start by noting that Cδα,β c f(x ) = 0 when DAx = Db. Since A and D are both invertible, we see that x = A 1b. At this point, we also have f(x ) = 0. This tells us that if fractional gradient descent converges, it converges to the right point. The update rule for fractional gradient descent can be written as xt+1 = xt η(A xt + b ). For proving convergence, we begin with the 3 point equality: xt+1 x 2 2 = xt+1 xt 2 2 + 2 xt+1 xt, xt x + xt x 2 2 = η2 A xt + b 2 2 2η A xt + b , xt x + xt x 2 2 = η2 A xt + b 2 2 2η A xt + b A x b , xt x + xt x 2 2 = η2 A xt + b 2 2 2η(xt x )T A (xt x ) + xt x 2 2 η2 A xt + b 2 2 ηµ xt x 2 2 η(xt x )T (A xt + b A x b ) + xt x 2 2 = η2 A xt + b 2 2 ηµ xt x 2 2 η(Axt + b Ax b )T A 1T(A xt + b ) + xt x 2 2 = η2 A xt + b 2 2 ηµ xt x 2 2 η(Axt + b )T A 1T(A xt + b ) + xt x 2 2 η2 A xt + b 2 2 ηµ xt x 2 2 η L Axt + b 2 2 + xt x 2 2 A xt + b 2 2 + (1 ηµ ) xt x 2 2. Setting η = 1 L similarly to in standard gradient descent, we arrive at the desired convergence rate: xt+1 x 2 2 1 µ