# admm_for_structured_fractional_minimization__fac672e3.pdf

Published as a conference paper at ICLR 2025

ADMM FOR STRUCTURED FRACTIONAL MINIMIZATION

Ganzhao Yuan Peng Cheng Laboratory, China yuangzh@pcl.ac.cn

This paper considers a class of structured fractional minimization problems. The numerator consists of a differentiable function, a simple nonconvex nonsmooth function, a concave nonsmooth function, and a convex nonsmooth function composed with a linear operator. The denominator is a continuous function that is either weakly convex or has a weakly convex square root. These problems are prevalent in various important applications in machine learning and data science. Existing methods, primarily based on subgradient methods and smoothing proximal gradient methods, often suffer from slow convergence and numerical stability issues. In this paper, we introduce FADMM, the first Alternating Direction Method of Multipliers tailored for this class of problems. FADMM decouples the original problem into linearized proximal subproblems, featuring two variants: one using Dinkelbach s parametric method (FADMM-D) and the other using the quadratic transform method (FADMM-Q). By introducing a novel Lyapunov function, we establish that FADMM converges to ϵ-approximate critical points of the problem within an oracle complexity of O(1/ϵ3). Extensive experiments on synthetic and real-world datasets, including sparse Fisher discriminant analysis, robust Sharpe ratio minimization, and robust sparse recovery, demonstrate the effectiveness of our approach. 1

1 INTRODUCTION

This paper focuses on the following class of nonconvex and nonsmooth fractional minimization problem (where denotes definition):

min x F(x) u(x)

d(x), where u(x) f(x) + δ(x) g(x) + h(Ax). (1)

Here, x Rn and A Rm n. We impose the following assumptions on Problem (1). (i) The function f(x) is differentiable and possibly nonconvex. (ii) The function δ(x) is possibly nonconvex, nonsmooth, and non-Lipschitz. (iii) Both functions g(x) and h(x) are convex and possibly nonsmooth. (iv) Both functions δ(x) and h(y) are simple, such that their proximal operators can be computed efficiently and exactly. (v) The function d(x) is Lipschitz continuous, and either d(x) itself or its square root, d(x)1/2, is weakly convex. (vi) To ensure Problem (1) is well-defined, we assume that all functions g(x), δ(x), h(y), and d(x) are proper and lower semicontinuous, u(x) 0, and d(x) 0 for all x and y.

Problem (1) serves as a fundamental optimization framework in various machine learning and data science models, such as sparse Fisher discriminant analysis (Bishop & Nasrabadi, 2006), (robust) Sharpe ratio maximization (Chen et al., 2011), robust sparse recovery (Yuan, 2023; Yang & Zhang, 2011), limited-angle CT reconstruction (Wang et al., 2021), AUC maximization (Wang et al., 2022), and signal-to-noise ratio maximization (Shen & Yu, 2018a;b). An alternative formulation of Problem (1) can be obtained by replacing the maximization with the minimization, as explored in the fractional optimization literature (Stancu-Minasian, 2012; Schaible, 1995). Although these two formulations are generally distinct, the corresponding algorithmic developments can readily adapt to either formulation.

1Future versions of this paper can be found at https://arxiv.org/abs/2411.07496.

Published as a conference paper at ICLR 2025

1.1 MOTIVATING APPLICATIONS

Many models in machine learning and data science can be formulated as Problem (1). We present the sparse Fisher discriminant analysis application below, with additional applications, including robust Sharpe ratio maximization and robust sparse recovery, detailed in Appendix B.

Sparse Fisher Discriminant Analysis (Sparse FDA). Given observations from two distinct classes, let µ(i) Rn and Σ(i) Rn n represent the mean vector and covariance matrix of class i (i = 1 or 2), respectively. Classical FDA (Bishop & Nasrabadi, 2006; Xu & Li, 2020) aims to find an orthogonal subspace X Ωwith Ω {X | XTX = Ir}, that maximizes the between-class variance relative to the within-class variance. This leads to the following optimization problem: min X Ωtr(XT(Σ(1) + Σ(2))X)/tr(XT((µ(1) µ(2))(µ(1) µ(2))T)X). Inducing sparsity in the solution helps mitigate overfitting and enhances the interpretability of the model in high-dimensional data analysis (Journ ee et al., 2010). We consider the following Differenceof-Convex (DC) model (Bi et al., 2014; Bi & Pan, 2016; Gotoh et al., 2018) for learning sparse orthogonal loadings for FDA:

min X Rn r tr(XTCX) + ρ( X 1 X [k])

tr(XTDX) , s. t. X Ω, (2)

where D (µ(1) µ(2))(µ(1) µ(2))T, C = Σ(1)+Σ(2), and X [k] is the ℓ1 norm of the k largest (in magnitude) elements of the matrix X. Problem (2) exhibits a beneficial exact penalty property (Bi et al., 2014; Bi & Pan, 2016), such that X [k] closely approximates X 1 when ρ exceeds a certain threshold, thereby leading to a solution X with k-sparsity. We define ιΩ(X) as the indicator function of the set Ω. Problem (2) coincides with Problem (1) with x = vec(X), f(x) = tr(XTCX), δ(x) = ιΩ(X), g(x) = ρ X [k], A = I, h(Ax) = ρ X 1, and d(x) = tr(XTDX). Importantly, both d(x) and d(x)1/2 are Wd-weakly convex with Wd = 0.

1.2 CONTRIBUTIONS AND ORGANIZATIONS

The contributions of this paper are threefold. (i) We propose FADMM, a new ADMM tailored for nonsmooth composite fractional minimization problems. This method includes two specialized variants: FADMM based on Dinkelbach s parametric method (FADMM-D) and FADMM based on the quadratic transform method (FADMM-Q). (ii) We establish that both FADMM-D and FADMMQ algorithms converge to an ϵ-critical point with a computational complexity of O(1/ϵ3). This is the first report of iteration complexity results for estimating approximate stationary points for this class of fractional programs. (iii) We conducted experiments on sparse FDA, robust Sharpe ratio maximization, and robust sparse recovery to demonstrate the effectiveness of our approach.

The rest of the paper is organized as follows: Section 2 reviews related work. Section 3 presents technical preliminaries. Section 4 details the proposed algorithm. Section 5 discusses global convergence. Section 6 addresses iteration complexity. Section 7 provides some experiment results, and Section 8 concludes the paper.

2 RELATED WORK

We review some nonconvex optimization algorithms that are related to solve the fractional program in Problem (1).

Algorithms in Limited Scenarios. Existing fractional minimization algorithms primarily address a special instance of Problem (1) that minx F(x) u(x)/d(x), where u(x) f(x) + δ(x). (i) Dinkelbach s Parametric Algorithm (DPA) (Dinkelbach, 1967) is a classical approach. The fractional program has an optimal solution x if and only if x is an optimal solution to the problem minx u(x) λd(x), where λ = F( x). Since the optimal value λ is generally unknown, iterative methods are used. DPA generates a sequence {xt} as: xt+1 = arg minx u(x) λtd(x), with λt updated as λt = F(xt). (ii) The Quadratic Transform Algorithm (QTA) (Zhou & Yang, 2014; Shen & Yu, 2018a;b) reformulates the problem as: minx d(x)/u(x) minx,α α2u(x) 2αd(x)1/2. QTA generates a sequence {xt} as: xt+1 = arg minx(αt)2u(x) 2αtd(x)1/2, with αt updated as αt = d(xt)1/2 u(xt) 1. This method is particularly suited for solving multiple-ratio fractional

Published as a conference paper at ICLR 2025

programs. (iii) Linearized variants of DPA and QTA (Li & Zhang, 2022; Bot et al., 2023a) address the high computational cost of solving nonconvex subproblems in DPA and QTA. They employ full splitting paradigms and achieve fast convergence by performing a single iteration of splitting algorithms at each step, efficiently avoiding inner loops to solve complex nonconvex problems. The proposed ADMM algorithm is built on linearized DPA and QTA to solve the subproblems.

General Algorithms for Solving Problem (1). (i) Subgradient Projection Methods (SPM) (Li et al., 2021) provide a simple and intuitive approach to solving Problem (1). SPM iteratively updates the solution by moving along the negative subgradient direction and projecting onto the feasible set: xt+1 = PΩ(xt ηtgt), where gt F(xt), Ωis the constraint set, and ηt is a diminishing step size. However, due to the non-uniqueness of gt, these methods often exhibit slower convergence and numerical instability. (ii) Smoothing Proximal Gradient Methods (SPGM) (Beck & Rosset, 2023; Yuan, 2024a; Bian & Chen, 2020; B ohm & Wright, 2021) combine gradient-based optimization with smoothing techniques to handle nonsmooth terms in the objective function. By approximating nonsmooth components with smooth surrogates, SPGM enables more efficient updates via the proximal gradient method, achieving convergence for complex nonsmooth problems. Notably, the proposed FADMM algorithm reduces to SPGM when the multiplier is set to zero in all iterations. (iii) Full Splitting Algorithm (FSA) (Bot et al., 2023b) applies a smoothing technique by introducing a strongly concave term into the dual maximization problem, effectively framing the method as a primal-dual approach. However, the convergence analysis of FSA relies on the Kurdyka-Lojasiewicz (KL) inequality of the problem, and no iteration complexity results are provided. Overall, our proposed FADMM algorithm demonstrates faster convergence and superior numerical stability compared to SPM, SPGM, and FSA.

Other Fractional Minimization Algorithms. (i) Charnes-Cooper transform algorithm converts an original linear-fractional programming problem to a standard linear programming problem (Charnes & Cooper, 1962). Using the transformation y = x d(x), t = 1 d(x), Problem (1) can be reformulated as: mint,y tu(y/t), s.t. td(y/t) = 1. (ii) Coordinate descent algorithms (Yuan, 2023) iteratively solve one-dimensional subproblems globally and are guaranteed to converge to stronger coordinate-wise stationary points for a specific class of fractional programs. (iii) Inertial proximal block coordinate methods (Bot et al., 2023a), based on the quadratic transform, have been proposed to address a class of nonsmooth sum-of-ratios minimization problems.

ADMM for Nonconvex Optimization. The Alternating Direction Method of Multipliers (ADMM) is a powerful optimization technique that addresses complex problems by breaking them down into simpler, more manageable subproblems, which are then solved iteratively to achieve convergence. The standard ADMM was first introduced in (Gabay & Mercier, 1976), with complexity analysis for convex settings conducted in (He & Yuan, 2012; Monteiro & Svaiter, 2013). Motivated by research on the convergence analysis of nonconvex ADMM (Li & Pong, 2015; Hong et al., 2016; Bot et al., 2019; Bot & Nguyen, 2020; Yuan, 2024b), we propose applying ADMM to solve structured fractional minimization problems. To the best of our knowledge, this is the first instance of ADMM being applied to fractional programs. Our goal is to investigate both the theoretical iteration complexity required to reach an approximate stationary point and the empirical performance of the proposed method.

3 TECHNICAL PRELIMINARIES

This section presents technical preliminaries on basic assumptions, stationary points, and Nesterov s smoothing techniques. Notations, additional technical preliminaries, and relevant lemmas are provided in Appendix Section A.

3.1 BASIC ASSUMPTIONS AND STATIONARY POINTS

We impose the following assumptions on Problem (1) throughout this paper. Assumption 3.1. There exists a universal positive constant x such that x x for all x dom(F). Assumption 3.2. The function f( ) is Lf-smooth such that f(x) f(x ) Lf x x holds for all x, x Rn. This implies that: |f(x) f(x ) f(x ), x x | Lf

2 x x 2 2 (cf. Lemma 1.2.3 in (Nesterov, 2003)).

Published as a conference paper at ICLR 2025

Assumption 3.3. Let µ > 0, x Rn, and y Rm. Both proximal operators, Prox(y ; h, µ) arg miny 1 2µ y y 2 2+h(y) and Prox(x ; δ, µ) arg minx 1 2µ x x 2 2+δ(x), can be computed efficiently and exactly. Assumption 3.4. The function d(x) is Cd-Lipschitz continuous with Cd 0, and meets one of the following conditions for some Wd 0: (a) d(x) is Wd-weakly convex. (b) p

d(x) is Wd-weakly convex. Remark 3.5. (i) Assumption 3.1 holds by setting δ(x) = ιΩ(x), where Ωis a compact set. This follows from the fact that if x dom(F) {x : F(x) < + }, then x is feasible, ensuring that xt x for some x > 0. (ii) Assumption 3.2 is commonly used in the convergence analysis of nonconvex algorithms. (iii) Assumption 3.3 is mild and is satisfied by our applications. Appendix G details the computation of proximal operators.

We now introduce the definition of stationary points for Problem (1). A straightforward option is the Fr echet stationary point (Rockafellar & Wets., 2009; Mordukhovich, 2006). Recall that a solution x is a Fr echet stationary point of Problem (1) if: 0 b F( x) = b ((f + δ g + h A)/d)( x). However, computing a Fr echet stationary point is challenging for general nonconvex nonsmooth programs. Following the work of (Li et al., 2022b; Li & Zhang, 2022; Bot et al., 2023a;b; Yuan, 2023), we adopt a weaker notion of optimality, namely critical points (or limiting lifted stationary points), defined as follows: Definition 3.6. (Critical Point) A solution x dom(F) is a critical point of Problem (1) if: 0 δ( x) + f( x) g( x) + AT h(A x) F( x) d( x).

Remark 3.7. Using Lemma A.1 in Appendix A.2, we obtain that b F(x) g(x) 2{ δ(x)+ f(x) g(x)+AT h(Ax) F(x) d(x)} for any x. According to Definition 3.6, 0 b F( x) implies that x is a critical point of Problem 1, while the converse is generally not true. However, under certain mild conditions discussed in (Bot et al., 2023b;a), Definition 3.6 aligns with the standard Fr echet stationary point that 0 b F( x).

3.2 NESTEROV S SMOOTHING TECHNIQUE

The nonsmooth nature of the function h(y) presents challenges for the algorithm design and theoretical analysis. To address this, we approximate h(y) with a smooth function hµ(y) using Nesterov s smoothing technique (Nesterov, 2003; 2013; Devolder et al., 2012), which relies on the conjugate function of h(y). We introduce the following useful definition in this context. Definition 3.8. For a proper, convex, and lower semicontinuous function h(y) : Rm 7 R, the Nesterov s smoothing function for h(y) with a parameter µ (0, ) is defined as: hµ(y) = maxv y, v h (v) µ

We outline some key properties of Nesterov s smoothing function. Lemma 3.9. (Proof in Appendix C.1) Assume that h(y) is Ch-Lipschitz continuous. We let µ > 0, and 0 < µ2 µ1. We have the following results:

(a) The function hµ(y) is (1/µ)-smooth and convex, with its gradient given by hµ(y) = arg maxv { y, v h (v) µ

2 v 2 2}, it holds that hµ(y) = 1

µ(y Prox(y; h, µ)).

(b) 0 < h(y) hµ(y) 1

2C2 hµ. (c) The function hµ(y) is Ch-Lipschitz continuous. (d) hµ(y) = 1 2µ y y 2 2 + h( y), where y = Prox(y; h, µ).

(e) 0 hµ2(y) hµ1(y) 1

2C2 h(µ1 µ2). (f) hµ2(y) hµ1(y) ( µ1

Lemma 3.10. (Proof in Section C.2) Assume that h(y) is Ch-Lipschitz continuous. Consider y = arg miny hµ(y) + 1

2β y b 2 2 Prox(b; hµ, 1/β), where b Rm, and β, µ > 0. We have:

(a) y = ˇy+βµb

1+βµ , where ˇy Prox(b; h, µ + 1/β). (b) β(b y) h(ˇy).

(c) y ˇy µCh.

Published as a conference paper at ICLR 2025

Remark 3.11. (i) Lemmas 3.9 and 3.10 can be derived using standard convex analysis and play an essential role in the analysis of the proposed FADMM algorithm. (ii) Nesterov s smoothing function (Nesterov, 2003) is essentially equivalent to the Moreau envelope smoothing function (B ohm & Wright, 2021; Beck, 2017), as demonstrated in Lemma 3.9(d). (iii) Lemma 3.10 demonstrates how to compute the proximal operator Prox(b; hµ, 1/β) using Prox(b; h, µ + 1/β), where β > 0 and b Rm are given parameters.

4 THE PROPOSED FADMM ALGORITHM

This section presents the proposed FADMM Algorithm for solving Problem (1), featuring two variants: one based on Dinkelbach s parametric method (FADMM-D) (Dinkelbach, 1967) and the other on the quadratic transform method (FADMM-Q) (Zhou & Yang, 2014; Shen & Yu, 2018a). Notably, FADMM-D and FADMM-Q target different problem structures: FADMM-D is designed for Assumption 3.4(a), while FADMM-Q is suited for Assumption 3.4(b), with potential extensions to multi-ratio fractional programs (Bot et al., 2023a).

We first introduce a new variable y Rm and reformulate Problem (1) as: minx,y{f(x) + δ(x) g(x) + hµ(y)}/d(x), Ax = y, where hµ(y) is the Nesterov s smoothing function of h(y), with µ 0 as the smoothing parameter. Notably, similar smoothing techniques have been used in the design of augmented Lagrangian methods (Zeng et al., 2022), ADMM (Li et al., 2022a; Yuan, 2025; 2024b), and minimax optimization (Zhang et al., 2020). From this problem, we define two functions, referred to as modified augmented Lagrangian functions, as follows:

L(x, y; z; β, µ) U(x,y;z;β,µ)

K(α, x, y; z; β, µ) 2α p

d(x) + α2U(x, y; z; β, µ), (4)

U(x, y; z; β, µ) f(x) + Ax y, z + β

2 Ax y 2 2 | {z }

+δ(x) g(x) + hµ(y). (5)

Here, β is the penalty parameter and z is the dual variable for the linear constraint. In brief, FADMM updates the primal variables sequentially, keeping the others fixed, and updates the dual variables via gradient ascent on the dual problem. It iteratively generates a sequence {xt, yt, zt, λt, βt, µt} t=0 or {αt, xt, yt, zt, βt, µt} t=0, where βt = β0(1 + ξtp), µt = χ

βt , and {β0, ξ, p, χ} are fixed constants.

Majorization Minimization (MM). MM is an effective optimization strategy to minimize complex functions and is widely used to develop practical optimization algorithms (Mairal, 2013; Razaviyayn et al., 2013). This technique iteratively constructs a majorization function that upper-bounds the objective, enabling efficient optimization and gradual reduction of the objective function. We define s(x) S(x, yt, zt; βt), where t is known from context. Given that s(x) is (Lf +βt A 2 2)-smooth, g(x) is convex, and both d(x) or p

d(x) are Wd-weakly convex, we construct the majorization functions for the four functions as follows.

(a) s(x) Ut(x; xt) s(xt) + x xt, s(xt) + 1

2(Lf + βt A 2 2) x xt 2 2. (b) g(x) R(x; xt) g(xt) x xt, ξ , ξ g(xt).

(c) d(x) V(x; xt) d(xt) x xt, ξ + Wd

2 x xt 2 2, ξ d(xt). (d) p

d(x) V(x; xt) p

d(xt) x xt, ξ + Wd

2 x xt 2 2, ξ p

FADMM-D Algorithm. Based on Equation (3), FADMM-D alternately updates the primal variables {x, y} and the dual variable z. (i) To update the variable x, we approximately solve Dinkelbach s parametric subproblem as follows: xt+1 arg minx Wt(x) s(x)+δ(x) g(x) λtd(x), where λt = L(xt, yt; zt; βt, µt). However, this problem is challenging in general. We apply MM methods and consider the following problem:

minx Mt(x; xt, λt) δ(x) + Ut(x; xt) + R(x; xt) + λt V(x; xt) + θ 1

2 ℓ(βt) x xt 2 2, (6)

where ℓ(βt) Lf + βt A 2 2 + λt Wd, and θ > 1. One can verify that Mt(x; xt, λt) is a majorization function, satisfying Wt(x) Mt(x; xt, λt) and Wt(xt) = Mt(xt; xt, λt) for all x and

Published as a conference paper at ICLR 2025

Algorithm 1: FADMM: The Proposed ADMM using Dinkelbach s Parametric Method or the Quadratic Transform Method for Solving Problem (1). (S0) Initialize {x0, y0, z0}. (S1) Choose ξ (0, ), θ (1, ), p (0, 1), and χ (2 1 + ξ, ). (S2) Choose β0 large enough such that β0 > v/(Fd), satisfying Assumption 5.7. for t from 0 to T do

(S3) βt = β0(1 + ξtp), µt = χ/βt. (S4) Solve the x-subproblem using FADMM-D or FADMM-Q: if FADMM-D then

Set λt = U(xt, yt; zt; βt, µt)/d(xt), and xt+1 arg minx Mt(x; xt, λt). end if FADMM-Q then

Set αt+1 = p

d(xt)/U(xt, yt; zt; βt, µt), and xt+1 arg minx Mt(x; xt, αt+1). end (S5) yt+1 = arg miny hµt(y) + βt

2 y bt 2 2, where bt yt y S(xt+1, yt; zt; βt)/βt.

It can be solved as yt+1 = ˇyt+1+βtµtbt

1+βtµt , where ˇyt+1 Prox(bt; h, µt + 1/βt). (S6) zt+1 = zt + βt(Axt+1 yt+1). end

xt. Problem (6) reduces to the computation of a proximal operator for the function δ(x), yielding xt+1 arg minx Mt(x; xt, λt) = Prox(x ; δ, θℓ(βt)), where x = xt g/(θℓ(βt)), and g s(xt) g(xt) λt d(xt). (ii) When minimizing the modified augmented Lagrangian function in Equation (3) over y, the problem reduces to solving: yt+1 arg miny hµt(y) + 1

2βt y bt 2 2, where bt yt y S(xt+1, yt; zt; βt)/βt. (iii) We adjust the dual variable z using the standard gradient ascent update rule in ADMM.

FADMM-Q Algorithm. Based on Equation (4), FADMM-Q alternates between updating the primal variables {α, x, y} and the dual variable z. (i) To update the variable α, we set the gradient of K(α, x, y; z; β) w.r.t. α to zero, resulting in the update rule: αt+1 = p

d(xt)/U(xt, yt; zt; βt, µt). (ii) To update the variable x, we approximately solve the following problem: xt+1 arg minx Wt(x) s(x)+δ(x) g(x) 2 αt+1 p

d(x). To tackle this challenging problem, we employ MM methods and formulate the following problem:

min x Mt(x; xt, αt+1) δ(x) + Ut(x; xt) + R(x; xt) + 2 αt+1 V(x; xt) + θ 1

2 ℓ(βt) x xt 2 2, (7)

where ℓ(βt) Lf + βt A 2 2 + 2 αt+1 Wd, and θ > 1. One can show that Wt(x) Mt(x; xt, αt+1) and Wt(xt) Mt(xt; xt, αt+1) for all x and xt. Problem (7) can be efficiently and effectively solved, as it reduces to the computation of a proximal operator for the function δ(x), yielding xt+1 arg minx Mt(x; xt, αt+1) = Prox(x ; δ, θℓ(βt)), where x = xt g/(θℓ(βt)) and g s(xt) g(xt) 2 αt+1 p

d(xt). (iii) We use the same strategy as in FADMM-D to update the primal variable y and the dual variable z.

We summarize FADMM-D and FADMM-Q in Algorithm 1, and provide the following remarks.

Remark 4.1. (i) The y-subproblem in Step (S5) of Algorithm 1 can be solved by invoking Lemma 3.10. (ii) The introduction of the strongly convex term θ 1

2 ℓ(βt) x xt 2 2 with θ > 1 as in Problems (6) and (7) is crucial to our analysis. (iii) By minimizing K(α, x, y; z; β, µ) and setting its gradient w.r.t. α to zero, we obtain α = U(x, y; z; β, µ)/d(x), which leads to L(x, y; z; β, µ) = 1/K(α , x, y; z; β, µ). Thus, Formulations (3) and (4) are equivalent in a certain sense.

5 GLOBAL CONVERGENCE

This section establishes the global convergence of both FADMM-D and FADMM-Q. We begin with an initial theoretical analysis applicable to both algorithms, followed by a detailed, separate analysis for each.

Published as a conference paper at ICLR 2025

5.1 INITIAL THEORETICAL ANALYSIS

First, we impose the following condition on Algorithm 1.

Assumption 5.1. Let {xt} t=0 be generated by Algorithm 1. For all t, there exist constants {d, d} such that 0 < d d(xt) d, and constants {F, F} such that 0 < F F(xt) F.

Remark 5.2. (i) The existence of the upper bounds d and F is guaranteed by the boundedness of x and the continuity of the functions d(x) and F(x) within their respective effective domains. (ii) The lower bound condition d > 0 is mild and widely utilized in the literature (Li & Zhang, 2022; Yuan, 2023; Bot et al., 2023b). (iii) The lower bound condition F > 0 is reasonable; otherwise, it suffices to solve the non-fractional problem: minx u(x).

Second, we provide first-order optimality conditions for the solution yt+1. Lemma 5.3. (Proof in Appendix D.1, First-Order Optimality Conditions) For all t 0, we have: zt+1 = hµt(yt+1) h(ˇyt+1).

Third, using the subsequent lemma, we establish an upper bound for the term zt+1 zt 2 2. Lemma 5.4. (Proof in Section D.2, Controlling Dual using Primal) For all t 1, we have: zt+1 zt 2 2 2 (βt)2

χ2 yt+1 yt 2 2 + 2C2 h( 6

Fourth, we show that the solution {xt, yt, zt} is always bounded for all t 0. Lemma 5.5. (Proof in Appendix D.3) Let t 0. There exists universal constants {x, y, z} such that xt x, zt z, and yt y.

Fifth, the subsequent lemma establishes bounds for the term U(xt, yt, zt, βt). Lemma 5.6. (Proof in Appendix D.4) For all t 1, we have: F d v/βt U(xt, yt; zt; βt, µt) F d + v/βt, where v 8z2 + 1

2χz2 and v 24z2.

Given Lemma 5.6, we make the following additional assumption.

Assumption 5.7. Assume β0 v/(F d) > 0. Remark 5.8. (i) By Assumption 5.7, we have βt β0 > v/(F d), ensuring U(xt, yt; zt; βt, µt) > 0 and λt > 0 for both FADMM-D and FADMM-Q for all t 1. These inequalities are crucial to our analysis (see Inequalities (31), (38)). (ii) Assumption 5.7 is automatically satisfied when t is sufficiently large due to increasing penalty update rules. In practice, β0 = 1 can be used.

Finally, we demonstrate some critical properties for the parameters {βt, λt, αt, ℓ(βt)}. Lemma 5.9. (Proof in Appendix D.5) Let t 1. For both FADMM-D and FADMM-Q, we have:

(a) βt βt+1 (1 + ξ)βt. (b) There exist positive constants {λ, λ} such that λ λt λ.

(c) There exist positive constants {α, α} such that α αt α. (d) There exist positive constants {ℓ, ℓ} such that βtℓ ℓ(βt) βtℓ.

5.2 ANALYSIS FOR FADMM-D

This subsection provides the convergence analysis of FADMM-D.

We define εx ℓ(θ 1)/(2d) > 0, εy {1 4(1 + ξ)/χ2}/(2d) > 0, and εz ξ/(2d) > 0. We define the following sequence associated with a specific potential function:

Pt L(xt, yt; zt; βt, µt) | {z }

+ 12(1 + ξ)C2 h/(β0dt) | {z }

+ C2 hµt/(2d) | {z }

We first present two useful lemmas regarding the decrease of the variables {x} and {y, z, β, µ}. Lemma 5.10. (Proof in Section E.1, Decrease on the Function L(x, y; z; β, µ) w.r.t. x) For all t 1, we have: εxβt xt+1 xt 2 2 + L(xt+1, yt; zt; βt, µt) L(xt, yt; zt; βt, µt).

Published as a conference paper at ICLR 2025

Lemma 5.11. (Proof in Section E.2, Decrease on the Function L(x, y; z; β, µ) w.r.t. {y, z, β, µ}) For all t 1, we have: εyβt yt+1 yt 2 2 + εzβt Axt+1 yt+1 2 2 + L(xt+1, yt+1; zt+1; βt+1, µt+1) L(xt+1, yt; zt; βt, µt) Ut + Tt Ut+1 Tt+1.

The following lemma demonstrates a decrease property on a potential function. Lemma 5.12. (Proof in Section E.3, Decrease on a Potential Function) We let t 1. We define Et βt{ xt+1 xt 2 2 + yt+1 yt 2 2 + Axt+1 yt+1 2 2}. We have:

(a) There exists a univeral positive constant P such that Pt P. (b) It holds that min(εx, εy, εz)Et Pt Pt+1.

The following theorem establishes the global convergence of FADMM-D.

Theorem 5.13. (Proof in Section E.4, Global Convergence) We let t 1. We define Et + βt{ xt+1 xt + yt+1 yt + Axt+1 yt+1 }. We have: 1 T PT t=1 Et + O(T (p 1)/2). In other words, there exists an index t with 1 t T such that E t + O(T (p 1)/2).

Remark 5.14. (i) With the choice p (0, 1), Theorem (5.13) implies that Et + converges to 0 in the ergodic sense. (ii) The convergence Et + 0 is significantly stronger than the convergence xt+1 xt + yt+1 yt + Axt+1 yt+1 0 as {βt} t=1 is increasing.

5.3 ANALYSIS FOR FADMM-Q

This subsection presents the convergence analysis of FADMM-Q.

We define εx 1

2α2ℓ(θ 1) > 0, εy 1

2α2{1 4(1 + ξ)/(χ2)}, and εz 1

2ξα2 > 0. We define the following sequence associated with a specific potential function:

Pt K(αt, xt, yt; zt; βt, µt) | {z }

+ 12α2(1 + ξ)C2 h/(β0t) | {z }

2α2C2 hµt | {z }

The following two lemmas establish the decrease of the variables {λ, x} and {y, z, β, µ}. Lemma 5.15. (Proof in Section E.5, Decrease on the Function K(λ, x, y; z; β, µ) w.r.t. λ and x) For all t 1, we have: K(λt+1, xt+1, yt; zt; βt, µt) + εxβt xt+1 xt 2 2 K(λt, xt, yt; zt; βt, µt). Lemma 5.16. (Proof in Section E.6, Decrease on the Function K(λ, x, y; z; β, µ) w.r.t. {y, z, β, µ}) For all t 1, we have: εyβt yt+1 yt 2 2 + εzβt Axt+1 yt+1 2 2 + K(λt+1, xt+1, yt+1; zt+1; βt+1, µt+1) K(λt+1, xt+1, yt; zt; βt, µt) Ut + Tt Ut+1 Tt+1.

The following lemma shows a decrease property on a potential function. Lemma 5.17. (Proof in Section E.7, Decrease on a Potential Function) We let t 1. We define Et βt{ xt+1 xt 2 2 + yt+1 yt 2 2 + Axt+1 yt+1 2 2}. We have:

(a) There exists a univeral positive constant P such that Pt P. (b) It holds that min(εx, εy, εz)Et Pt Pt+1.

The following theorem establishes the global convergence of FADMM-Q.

Theorem 5.18. (Proof in Section E.8, Global Convergence) We let t 1. We define Et + βt{ xt+1 xt + yt+1 yt + Axt+1 yt+1 }. We have: 1 T PT t=1 Et + O(T (p 1)/2). In other words, there exists an index t with 1 t T such that E t + O(T (p 1)/2).

Remark 5.19. Theorem 5.18 is analogous to Theorem 5.13, with E t + converging to 0 in the ergodic sense.

6 ITERATION COMPLEXITY

This section examines the iteration complexity of FADMM for converging to critical points.

First, we introduce the notion of approximate critical points for the problem (1), which will play an important role in our analysis.

Published as a conference paper at ICLR 2025

Definition 6.1. (ϵ-Critical Point) We define Crit(x+, x, y+, y, z+, z) x+ x + y+ y + z+ z + Ax+ y+ + h(y+) z+ + δ(x+)+ f(x+) g(x)+ATz+ φ(x, y) d(x) , and φ(x, y) = {f(x) + δ(x) g(x) + h(y)}/d(x). A solution ( x+, x, y+, y, z+, z) is a critical point of Problem (1) if: Crit( x+, x, y+, y, z+, z) ϵ. Remark 6.2. (i) If ϵ = 0, Definition 6.1 simplifies to the (exact) critical point as described in Definition 3.6. (ii) The study in (Bot et al., 2023b) introduces a notation of approximate limiting subdifferential to define the ϵ-critical point, whereas we simply employ consecutive iterations for its definition.

Finally, we establish the iteration complexity of FADMM as follows. Theorem 6.3. (Proof in Section F.1, Iteration Complexity for Both FADMM-D and FADMM-Q) We define Wt {xt+1, xt, yt+1, yt, zt+1, zt}. Let the sequence {Wt}T t=0 be generated by FADMM-D or FADMM-Q. If p (0, 1), we have:

Crit(Wt) O(T p) + O(T (p 1)/2).

In particular, with the choice p = 1/3, we have Crit(Wt) O(T 1/3). In other words, there exists 1 t T such that: Crit(W t) ϵ, provided that T O( 1

ϵ3 ). Remark 6.4. (i) To our knowledge, Theorem 6.3 is the first complexity result for ADMM applied to this class of fractional programs, and it matches the iteration bound of smoothing proximal gradient methods (Beck & Rosset, 2023; B ohm & Wright, 2021). (ii) The point {xt+1, xt, yt+1, yt, zt+1, zt} rather than the point {xt+1, xt, yt+1, yt, zt+1, zt} serves as an approximate critical point of Problem (1) in Theorem 6.3.

7 EXPERIMENTS

This section evaluates the effectiveness of FADMM-D and FADMM-Q on sparse FDA. Additional experiments on robust Sharpe ratio minimization, and robust sparse recovery, please refer to Appendix Section I.

Compared Methods. We compare FADMM-D and FADMM-Q with three state-of-the-art general-purpose algorithms that solve Problem (1): (i) the Subgradient Projection Method (SPM) (Li et al., 2021), (ii) the Smoothing Proximal Gradient Method (SPGM) (Beck & Rosset, 2023; Yuan, 2024a; Bian & Chen, 2020; B ohm & Wright, 2021), and (iii) the Full Splitting Algorithm (FSA) (Bot et al., 2023b). For FADMM-D and FADMM-Q, if we fix zt = 0 and µt = 0 for all t in Algorithm 1, they respectively reduce to two SPGM variants: SPGM-D and SPGM-Q. For FSA, we adapt the algorithm from (Bot et al., 2023b) to our notation to address Problem (1). Implementation details of FSA are provided in Appendix Section H. We examine two fixed small step sizes, γ (10 3, 10 4), leading to two variants: FSA-I and FSA-II.

Experimental Settings. For all SPGM and FADMM, we consider the default parameter settings (ξ, θ, p, χ) = (1/2, 1.01, 1/3, 2 1 + ξ+10 14). For SPM, we use the default diminishing step size ηt = 1/βt, where βt is the same penalty parameter as in SPGM and FADMM. For all algorithms, we initialize their solutions drawn from a standard Gaussian distribution. All methods are implemented in MATLAB on an Intel 2.6 GHz CPU with 64 GB RAM. We incorporate a set of 8 datasets into our experiments, comprising both randomly generated and publicly available real-world data. Appendix Section I describes how to generate the data used in the experiments. We compare the objective values for all methods after running t seconds with t = 20. The corresponding MATLAB code is available on the author s research webpage.

Experimental Results on Sparse FDA. We consider solving Problem (2) using the following parameters r = 20, k = 0.1 n r, and ρ {10, 100, 1000, 10000}. According to the exact penalty theory (Bi et al., 2014; Bi & Pan, 2016), a reasonable value for βt is expected to be at least larger than ρ. We set β0 = 100ρ, which appears to work well. The experimental results for ρ {10, 1000} are presented in Figures 1 and 2, while the results for ρ {100, 10000} are provided in Appendix Section I. Based on these results, we draw the following conclusions. (i) SPM tends to be less efficient in comparison to other methods. This is primarily because, in the case of a sparse solution, the subdifferential set of the objective function is large and provides a poor approximation of the

Published as a conference paper at ICLR 2025

0 5 10 15 20

Time (seconds)

SPM SPGM D SPGM Q FSA I FSA II FADMM D FADMM Q

(a) madelon-2000-500

0 5 10 15 20 10 0

Time (seconds)

SPM SPGM D SPGM Q FSA I FSA II FADMM D FADMM Q

(b) TDT2-1-2-1000-1000

0 5 10 15 20

Time (seconds)

SPM SPGM D SPGM Q FSA I FSA II FADMM D FADMM Q

(c) TDT2-3-4-1000-1200

0 5 10 15 20 10 0

Time (seconds)

SPM SPGM D SPGM Q FSA I FSA II FADMM D FADMM Q

(d) mnist-2000-768

0 5 10 15 20

Time (seconds)

SPM SPGM D SPGM Q FSA I FSA II FADMM D FADMM Q

(e) mushroom-2000-112

0 5 10 15 20

Time (seconds)

SPM SPGM D SPGM Q FSA I FSA II FADMM D FADMM Q

(f) gisette-800-1000

0 5 10 15 20

Time (seconds)

SPM SPGM D SPGM Q FSA I FSA II FADMM D FADMM Q

(g) randn-300-1000

0 5 10 15 20

Time (seconds)

SPM SPGM D SPGM Q FSA I FSA II FADMM D FADMM Q

(h) randn-300-1500

Figure 1: Results on sparse FDA on different datasets with ρ = 10.

0 5 10 15 20 10 0

Time (seconds)

SPM SPGM D SPGM Q FSA I FSA II FADMM D FADMM Q

(a) madelon-2000-500

0 5 10 15 20

Time (seconds)

SPM SPGM D SPGM Q FSA I FSA II FADMM D FADMM Q

(b) TDT2-1-2-1000-1000

0 5 10 15 20

Time (seconds)

SPM SPGM D SPGM Q FSA I FSA II FADMM D FADMM Q

(c) TDT2-3-4-1000-1200

0 5 10 15 20

Time (seconds)

SPM SPGM D SPGM Q FSA I FSA II FADMM D FADMM Q

(d) mnist-2000-768

0 5 10 15 20

Time (seconds)

SPM SPGM D SPGM Q FSA I FSA II FADMM D FADMM Q

(e) mushroom-2000-112

0 5 10 15 20

Time (seconds)

SPM SPGM D SPGM Q FSA I FSA II FADMM D FADMM Q

(f) gisette-800-1000

0 5 10 15 20

Time (seconds)

SPM SPGM D SPGM Q FSA I FSA II FADMM D FADMM Q

(g) randn-300-1000

0 5 10 15 20

Time (seconds)

SPM SPGM D SPGM Q FSA I FSA II FADMM D FADMM Q

(h) randn-300-1500

Figure 2: Results on sparse FDA on different datasets with ρ = 1000.

(negative) descent direction. (ii) SPGM-D and SPGM-Q, utilizing a variable smoothing strategy, generally demonstrates better performance than SPM. (iii) The proposed FADMM-D and FADMMQ generally exhibit similar performance, both achieving the lowest objective function values among all the methods examined. This supports the widely accepted view that primal-dual methods are generally more robust and faster than primal-only methods. (iv) The proposed FADMM-D and FADMM-Q still outperform FSA, which uses a sufficiently small step size to ensure convergence.

8 CONCLUSIONS

In this paper, we introduce FADMM, the first ADMM algorithm designed to solve general structured fractional minimization problems. Our approach integrates Nesterov s smoothing technique (equivalent to the Moreau envelope smoothing technique) into the algorithm s updates to guarantee convergence. We present two specific variants of FADMM: one using Dinkelbach s parametric method (FADMM-D) and the other using the quadratic transform method (FADMM-Q). Additionally, we establish the iteration complexity of FADMM for convergence to approximate critical points. Finally, we validate the effectiveness of our methods through experimental results.

Published as a conference paper at ICLR 2025

ACKNOWLEDGMENTS

This work was supported by NSFC (12271278, 61772570), and Guangdong Natural Science Funds for Distinguished Young Scholar (2018B030306025).

Amir Beck. First-order methods in optimization. SIAM, 2017.

Amir Beck and Israel Rosset. A dynamic smoothing technique for a class of nonsmooth optimization problems on manifolds. SIAM Journal on Optimization, 33(3):1473 1493, 2023.

Dimitri Bertsekas. Convex optimization algorithms. Athena Scientific, 2015.

Shujun Bi and Shaohua Pan. Error bounds for rank constrained optimization problems and applications. Operations Research Letters, 44(3):336 341, 2016.

Shujun Bi, Xiaolan Liu, and Shaohua Pan. Exact penalty decomposition method for zero-norm minimization based on mpec formulation. SIAM Journal on Scientific Computing, 36(4):A1451 A1477, 2014.

Wei Bian and Xiaojun Chen. A smoothing proximal gradient algorithm for nonsmooth convex regression with cardinality penalty. SIAM Journal on Numerical Analysis, 58(1):858 883, 2020.

Christopher M Bishop and Nasser M Nasrabadi. Pattern Recognition and Machine Learning, volume 4. Springer, 2006.

Radu Ioan Bot and Dang-Khoa Nguyen. The proximal alternating direction method of multipliers in the nonconvex setting: convergence analysis and rates. Mathematics of Operations Research, 45(2):682 712, 2020.

Radu Ioan Bot , Erno Robert Csetnek, and Dang-Khoa Nguyen. A proximal minimization algorithm for structured nonconvex and nonsmooth problems. SIAM Journal on Optimization, 29(2):1300 1328, 2019.

Axel B ohm and Stephen J. Wright. Variable smoothing for weakly convex composite functions. Journal of Optimization Theory and Applications, 188(3):628 649, 2021.

Radu Ioan Bot , Minh N Dao, and Guoyin Li. Inertial proximal block coordinate method for a class of nonsmooth sum-of-ratios optimization problems. SIAM Journal on Optimization, 33(2):361 393, 2023a.

Radu Ioan Bot , Guoyin Li, and Min Tao. A full splitting algorithm for fractional programs with structured numerators and denominators. ar Xiv:2312.14341, 2023b.

Abraham Charnes and William W Cooper. Programming with linear fractional functionals. Naval Research logistics quarterly, 9(3-4):181 186, 1962.

Li Chen, Simai He, and Shuzhong Zhang. When all risk-adjusted performance measures are the same: In praise of the sharpe ratio. Quantitative Finance, 11(10):1439 1447, 2011.

Olivier Devolder, Franc ois Glineur, and Yurii Nesterov. Double smoothing technique for large-scale linearly constrained convex optimization. SIAM Journal on Optimization, 22(2):702 727, 2012.

Werner Dinkelbach. On nonlinear fractional programming. Management science, 13(7):492 498, 1967.

Dmitriy Drusvyatskiy and Courtney Paquette. Efficiency of minimizing compositions of convex functions and smooth maps. Mathematical Programming, 178:503 558, 2019.

John Duchi, Shai Shalev-Shwartz, Yoram Singer, and Tushar Chandra. Efficient projections onto the ℓ1-ball for learning in high dimensions. In International Conference on Machine Learning (ICML), pp. 272 279, 2008.

Published as a conference paper at ICLR 2025

Daniel Gabay and Bertrand Mercier. A dual algorithm for the solution of nonlinear variational problems via finite element approximation. Computers & Mathematics with Applications, 2(1): 17 40, 1976.

Jun-ya Gotoh, Akiko Takeda, and Katsuya Tono. Dc formulations and algorithms for sparse optimization problems. Mathematical Programming, 169(1):141 176, 2018.

Bingsheng He and Xiaoming Yuan. On the O(1/n) convergence rate of the douglas-rachford alternating direction method. SIAM Journal on Numerical Analysis, 50(2):700 709, 2012.

Mingyi Hong, Zhi-Quan Luo, and Meisam Razaviyayn. Convergence analysis of alternating direction method of multipliers for a family of nonconvex problems. SIAM Journal on Optimization, 26(1):337 364, 2016.

Michel Journ ee, Yurii Nesterov, Peter Richt arik, and Rodolphe Sepulchre. Generalized power method for sparse principal component analysis. Journal of Machine Learning Research, 11 (Feb):517 553, 2010.

Rongjie Lai and Stanley Osher. A splitting method for orthogonality constrained problems. Journal of Scientific Computing, 58(2):431 449, 2014.

Guoyin Li and Ting Kei Pong. Global convergence of splitting methods for nonconvex composite optimization. SIAM Journal on Optimization, 25(4):2434 2460, 2015.

Jiaxiang Li, Shiqian Ma, and Tejes Srivastava. A riemannian admm. ar Xiv preprint ar Xiv:2211.02163, 2022a.

Qia Li and Na Zhang. First-order algorithms for a class of fractional optimization problems. SIAM Journal on Optimization, 32(1):100 129, 2022.

Qia Li, Lixin Shen, Na Zhang, and Junpeng Zhou. A proximal algorithm with backtracked extrapolation for a class of structured fractional programming. Applied and Computational Harmonic Analysis, 56:98 122, 2022b. ISSN 1063-5203.

Xiao Li, Shixiang Chen, Zengde Deng, Qing Qu, Zhihui Zhu, and Anthony Man-Cho So. Weakly convex optimization over stiefel manifold using riemannian subgradient-type methods. SIAM Journal on Optimization, 31(3):1605 1634, 2021.

Julien Mairal. Optimization with first-order surrogate functions. In International Conference on Machine Learning (ICML), volume 28, pp. 783 791, 2013.

Renato DC Monteiro and Benar F Svaiter. Iteration-complexity of block-decomposition algorithms and the alternating direction method of multipliers. SIAM Journal on Optimization, 23(1):475 507, 2013.

Boris S. Mordukhovich. Variational analysis and generalized differentiation i: Basic theory. Berlin Springer, 330, 2006.

Boris S Mordukhovich, Nguyen Mau Nam, and ND Yen. Fr echet subdifferential calculus and optimality conditions in nondifferentiable programming. Optimization, 55(5-6):685 708, 2006.

Yu Nesterov. Gradient methods for minimizing composite functions. Mathematical programming, 140(1):125 161, 2013.

Yurii Nesterov. Introductory Lectures on Convex Optimization: A Basic Course, volume 87. Springer Science & Business Media, 2003.

Neal Parikh, Stephen Boyd, et al. Proximal algorithms. Foundations and trends in Optimization, 1(3):127 239, 2014.

Meisam Razaviyayn, Mingyi Hong, and Zhi-Quan Luo. A unified convergence analysis of block successive minimization methods for nonsmooth optimization. SIAM Journal on Optimization, 23(2):1126 1153, 2013.

Published as a conference paper at ICLR 2025

R. Tyrrell Rockafellar and Roger J-B. Wets. Variational analysis. Springer Science & Business Media, 317, 2009.

Siegfried Schaible. Fractional programming. pp. 495 608, 1995.

Kaiming Shen and Wei Yu. Fractional programming for communication systems - part i: Power control and beamforming. IEEE Transactions on Signal Processing, 66(10):2616 2630, 2018a.

Kaiming Shen and Wei Yu. Fractional programming for communication systems - part II: uplink scheduling via matching. IEEE Transactions on Signal Processing, 66(10):2631 2644, 2018b.

Antonio Silveti-Falls, Cesare Molinari, and Jalal Fadili. Generalized conditional gradient with augmented lagrangian for composite minimization. SIAM Journal on Optimization, 30(4):2687 2725, 2020.

Ioan M Stancu-Minasian. Fractional programming: theory, methods and applications, volume 409. Springer Science & Business Media, 2012.

Chao Wang, Min Tao, James G Nagy, and Yifei Lou. Limited-angle ct reconstruction via the l 1/l 2 minimization. SIAM Journal on Imaging Sciences, 14(2):749 777, 2021.

Guanghui Wang, Ming Yang, Lijun Zhang, and Tianbao Yang. Momentum accelerates the convergence of stochastic auprc maximization. In International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 3753 3771, 2022.

Zhiqiang Xu and Ping Li. A practical riemannian algorithm for computing dominant generalized eigenspace. In Conference on Uncertainty in Artificial Intelligence (UAI), pp. 819 828, 2020.

Junfeng Yang and Yin Zhang. Alternating direction algorithms for l1-problems in compressive sensing. SIAM Journal on Scientific Computing, 33(1):250 278, 2011.

Ganzhao Yuan. Coordinate descent methods for fractional minimization. In International Conference on Machine Learning (ICML), pp. 40488 40518, 2023.

Ganzhao Yuan. Smoothing proximal gradient methods for nonsmooth sparsity constrained optimization: Optimality conditions and global convergence. In International Conference on Machine Learning (ICML), 2024a.

Ganzhao Yuan. Admm for nonsmooth composite optimization under orthogonality constraints. ar Xiv:2405.15129, 2024b.

Ganzhao Yuan. Admm for nonconvex optimization under minimal continuity assumption. In International Conference on Learning Representations (ICLR), 2025.

Alp Yurtsever, Olivier Fercoq, Francesco Locatello, and Volkan Cevher. A conditional gradient framework for composite convex minimization with applications to semidefinite programming. In International Conference on Machine Learning (ICML), pp. 5727 5736, 2018.

Jinshan Zeng, Wotao Yin, and Ding-Xuan Zhou. Moreau envelope augmented lagrangian method for nonconvex optimization with linear constraints. Journal of Scientific Computing, 91(2):61, 2022.

Jiawei Zhang, Peijun Xiao, Ruoyu Sun, and Zhiquan Luo. A single-loop smoothed gradient descentascent algorithm for nonconvex-concave min-max problems. Advances in Neural Information Processing Systems (Neur IPS), 33:7377 7389, 2020.

Xue Gang Zhou and Ji Hui Yang. Global optimization for the sum of concave-convex ratios problem. Journal of Applied Mathematics, 2014(1):879739, 2014.

Published as a conference paper at ICLR 2025

The appendix is organized as follows.

Appendix A presents notation, technical preliminaries, and relevant lemmas.

Appendix B provides additional motivating applications.

Appendix C contains proofs for Section 3.

Appendix D contains proofs for Section 4.

Appendix E contains proofs for Section 5.

Appendix F contains proofs for Section 6.

Appendix G explains the computation of the proximal operator.

Appendix I demonstrates additional experimental details and results.

A NOTATIONS, TECHNICAL PRELIMINARIES, AND RELEVANT LEMMAS

A.1 NOTATIONS

In this paper, lowercase boldface letters signify vectors, while uppercase letters denote real-valued matrices. The following notations are utilized throughout this paper.

[n]: {1, 2, ..., n}

x : Euclidean norm: x = x 2 = p

XT : the transpose of the matrix X vec(X) : the vector formed by stacking the column vectors of X with vec(X) Rnr 1

mat(x) : convert x Rnr 1 into a matrix with mat(vec(X)) = X with mat(x) Rn r

0n,r : a zero matrix of size n r; the subscript is omitted sometimes Ir: identity matrix with Ir Rr r

X 0 : matrix X is symmetric positive semidefinite

X F : Frobenius norm: (P

ij X2 ij)1/2

f(x) : limiting subdifferential of f(x) at x ιΩ(x) : the indicator function of a set Ωwith ιΩ(x) = 0 if x Ωand otherwise + tr(A) : Sum of the elements on the main diagonal A with tr(A) = P i Ai,i X, Y : Euclidean inner product, i.e., X, Y = P

ij Xij Yij X : Operator/Spectral norm: the largest singular value of X

dist(Ξ, Ξ ) : the distance between two sets with dist(Ξ, Ξ ) infx Ξ,x Ξ x x F(x) : F(x) = infz F (x) z F = dist(0, F(x))

X [k]: ℓ1 norm of the k largest (in magnitude) elements of the matrix X xi: the i-th element of vector x Xi,j or Xij : the (ith, jth) element of matrix X

PΩ(X ) : Orthogonal projection of X with PΩ(X ) = arg min X Ω X X 2 F φ(x) φ(x): standard Minkowski addition (or subtraction) between sets φ(x) and φ(x)

A.2 TECHNICAL PRELIMINARIES ON NONSMOOTH NONCONVEX OPTIMIZATION

We present various techniques in convex analysis, nonsmooth analysis, and nonconvex analysis (Mordukhovich et al., 2006; Mordukhovich, 2006; Rockafellar & Wets., 2009; Bertsekas, 2015), encompassing conjugate functions, weakly convex functions, the Fr echet subdifferential, limiting subdifferential, and rules for sum and quotient in the Fr echet subdifferential context.

Published as a conference paper at ICLR 2025

For any extended real-valued (not necessarily convex) function f : Rn ( , + ], we denote by dom(f) := {x Rn : f(x) < + } its effective domain. The function f(x) is proper if dom(f) = . The function f(x) is lower-semicontinuous at some point x Rn if lim infx x f(x) f( x).

Conjugate Functions. For a proper, convex, lower semicontinuous function h(y) : Rm 7 R, we denote the (Fenchel) conjugate function of h(y) as h (y) supv dom(h) y Tv h(v) , and it follows that h (y) = h(y) = supv dom(h ) y Tv h (v) , where h (y) is called the biconjugate function. For any µ > 0, we have that (µh) (y) = supx dom(h) { y, x µh(x)} = µh ( y

µ). For any x, y Rn, the following statements are equivalent (see (Rockafellar & Wets., 2009), Proposition 11.3): x, y = h(x) + h (y) y h(x) x h (y). The conjugate of the support function of a closed convex set Ωis its indicator function, i.e., h(y) = supv Ω v, y , and h (v) = ιΩ(v). Typical nonsmooth functions for h(y) include { y 1, max(0, y) 1, y }, with their respective conjugate functions h (y) being {ι[ 1,1]m(y), ι[0,1]m(y), ι y 1 1(y)}.

Weakly Convex Functions. The function d(x) is weakly convex if a constant Wd 0 exists, making d(x)+ Wd

2 x 2 2 convex, with the smallest such Wd known as the modulus of weak convexity. Weakly convex functions constitute a diverse class of functions which covers convex functions, differentiable functions whose gradient is Lipschitz continuous, as well as compositions of convex, Lipschitz-continuous functions with C1-smooth mappings that have Lipschitz continuous Jacobians (Drusvyatskiy & Paquette, 2019).

Fr echet Subdifferential and Limiting (Fr echet) Subdifferential. The Fr echet subdifferential of F at x dom(F), denoted as b F( x), is defined as

b F( x) v Rn : lim x x inf x = x F(x) F( x) v, x x

The limiting subdifferential of F(x) at x dom(F), denoted as F( x), is defined as

F( x) n v Rn : xk x, F(xk) F( x), vk b F(xk) v, k o .

It is straightforward to verify that b F(x) F(x), b (αF)(x) = αb F(x) and (αF)(x) = α F(x) hold for any x dom(F) and α > 0. Additionally, if F( ) is differentiable at x, then b F(x) = F(x) = { F(x)} with F(x) being the gradient of F( ) at x; when F( ) is convex, b F(x) and F(x) reduce to the classical subdifferential for convex functions, i.e., b F(x) = F(x) = {v Rn : F(z) F(x) v, z x 0, z Rn}.

Sum and Quotient Rules for the Fr echet Subdifferential. First, we examine sum rules for the Fr echet subdifferential. Let φ1, φ2 : Rn ( , + ] be proper, closed functions, and let x dom(φ1) dom(φ2). Then, b φ1(x) + b φ2(x) b (φ1 + φ2)(x), where equality holds if φ1 or φ2 is differentiable at x (See Corollary 10.9 in (Rockafellar & Wets., 2009)). Moreover, (φ1 + φ2)(x) φ1(x) + φ2(x) holds when φ1 or φ2 is locally Lipschitz continuous at x, and it holds with equality when φ1 or φ2 is continuously differentiable at x (See Exercise 10.10 in (Rockafellar & Wets., 2009)).

We review the quotient rules for the Fr echet subdifferential. The concept of calmness (Rockafellar & Wets., 2009) plays an important role in our analysis. A proper function g : Rn ( , + ] is said to be calm at x dom(g) if there exist ε > 0 and κ > 0 such that |g(x) g(x )| κ x x for all x B(x, ε) {z Rn : z x < ε}. Many convex functions, including Lipschitz continuous functions, satisfy the calmness condition.

We define φ : Rn 7 ( , + ] at x Rn as:

( φ1(x) φ2(x), if x dom(φ1) and φ2(x) = 0, + , else.

The following lemma concerns the quotient rules for the Fr echet subdifferential.

Published as a conference paper at ICLR 2025

Lemma A.1. Let φ1 : Rn ( , + ] and φ2 : Rn R be two functions which are finite at x with φ2(x) > 0. We denote α1 := φ1(x) and α2 := φ2(x). Suppose that φ1 is closed and continuous at x relative to dom(φ1), that φ2 is calm at x. We have the following results:

(a) It holds that: b φ(x) = 1 α2 2 {b (α2φ1 α1φ2)(x)}.

(b) If φ2(x) is differentiable at x, then b φ(x) = 1 α2 2 {α2b φ1(x) α1 φ2(x)}.

(c) If α1 0 and φ2(x) is convex, then b φ(x) 1 α2 2 {b (α2φ1)(x) α1b φ2(x)}

1 α2 { (φ1)(x) α1

Proof. Refer to Proposition 2.2 in (Li & Zhang, 2022) and Lemma 2.1 in (Bot et al., 2023a). We omit the proofs for conciseness.

A.3 RELEVANT LEMMAS

We present some useful lemmas that will be used subsequently.

Lemma A.2. (Extended Moreau Decomposition) Assume h(y) is convex. For any µ > 0 and b Rm, we define Prox(b; h, µ) arg miny h(y) + 1 2µ b y 2 2. It holds that: b = Prox(b; h, µ) + µ Prox( b

Proof. The conclusion of this lemma can be found in (Nesterov, 2013; Parikh et al., 2014). For completeness, we present the proof.

We define y Prox(b; h, µ). We derive:

b y (µh)( y) ① y ((µh) )(b y) b (b y) ((µh) )(b y) b y = Prox(b; (µh) , 1)

② b = Prox(b; h, µ) + µ Prox( b

where step ①uses the definition of the conjugate function, and the property of the subdifferential that v h(y) y h (v); step ②uses the following derivations that: arg miny(µh) (y) + 1 2 y b 2 2 = arg miny µh (y/µ) + 1

2 y b 2 2 = µ arg miny h (y ) + µ

2 y b/µ 2 2 = µ Prox( b

Lemma A.3. Let βt β0(1 + ξtp) and µt 1 βt , where β0 0 and ξ (0, 1). For any integer

t 1, we have: ( µt 1

Proof. We define h(t) tp, where p (0, 1).

Given the concavity of h(t), it follows that: h(y) h(x) y x, h(x) . Letting x = t 1 and y = t, for all t > 1, we have:

tp (t 1)p p(t 1)p 1. (8)

Part (a). When t = 1, we have: ( µt 1

t 6 t+1) = ( β1

β0 1)2 3 = ξ2 3 2 < 0.

Part (b). When t 2, we derive:

µt 1)2 ①= ( βt

βt 1 1)2 ②= ( 1+ξtp

1+ξ(t 1)p 1)2 ③ ( tp (t 1)p 1)2

= ( tp (t 1)p

(t 1)p )2 ④ ((t 1)p 1 p)2 = 1 (t 1)2 ⑤ 6 t 6 t+1,

Published as a conference paper at ICLR 2025

where step ①uses µt 1 βt ; step ②uses βt = β0(1 + ξtp); step ③uses 1 < 1+ξtp

1+ξ(t 1)p ξtp

ξ(t 1)p ; step ④uses Inequality (8) and p 1; step ⑤uses 1 (t 1)2 6

t 6 t+1 for any integer t 2.

Lemma A.4. For all t 1, p (0, 1). It holds that: 1 (t + 1)p tp 2p(t + 1)p 1 0.

Proof. We assume t 1 and p (0, 1).

First, consider the function h(t) = tp. We have h(t) = ptp 1 and 2h(t) = p(p 1)tp 2 < 0. Therefore, h(t) is concave. For all x > 0 and y > 0, we have: h(x) h(y) x y, h(x) . Letting x = t and y = t + 1, we have:

tp (t + 1)p ptp 1 (9)

Letting x = t + 1 and y = t, we have:

(t + 1)p tp p(t + 1)p 1 (10)

Second, we shown that tp 1 2(t+1)p 1. Given t 1, we have: t+1

t 2, leading to ( t+1

t )p 1 2p 1 1

2. We obtain:

tp 1 2(t + 1)p 1. (11)

Part (a). We now prove the upper bound. We derive: (t + 1)p tp 2p(t + 1)(p 1) ① ptp 1

2p(t + 1)(p 1) ② 0, where step ①uses Inequality (9); step ②uses Inequality (11).

Part (b). We now focus on the lower bound. We obtain: (t + 1)p tp 2p(t + 1)p 1 ① p(t +

1)p 1 2p(t + 1)p 1 = p(t + 1)p 1 ② p ③ 1, where step ①uses Inequality (10); step ②uses (t + 1)p 1 1; step ③uses p 1.

Lemma A.5. For all t 1, p (0, 1). It holds that: (1 p)T 1 p PT t=1 1 tp T 1 p

Proof. We let t 1, p (0, 1), and q (0, 1). We define h(t) 1

q(t + 1)q 1

First, we prove that h(1) = 1

q q 0. We let f(q) = 2q 1 q2. We have f(q) = ln(2)2q 2q and 2f(q) = ln(2)22q 2 ln(2)22 2 < 0. Therefore, f(q) is concave. The global minimum lies in the boundary point. We have h(1) 1

q min(f(1), f(0)) = 1

q min(20 1 02, 21 1 12) = 0

q = 0. Therefore, we have:

q q 0. (12)

Second, we prove that h(t) 0. We have: h(t) = (t + 1)q 1 qtq 1 tq 1 qtq 1 = (1 q)tq 1 0. Therefore, the function h(t) is increasing. We further obtain:

q(t + 1)q 1

q qtq h(1) ① 0, (13)

where step ①uses Inequality (12).

Third, we define h(t) = t p. Using the integral test for convergence 2, we have: R T +1 1 h(x)dx PT t=1 h(t) h(1) + R T 1 h(x)dx.

2https://en.wikipedia.org/wiki/Integral_test_for_convergence

Published as a conference paper at ICLR 2025

Part (a). We define g(x) = 1 1 px1 p. We derive: PT t=1 t p g(1) + R T 1 x pdx ①= 1 + g(T)

g(1) = 1 + 1 1 p(T)1 p 1 1 p ②= T (1 p) p

1 p < T (1 p)

1 p , where step ①uses g(x) = x p; step ② uses Inequality (13) with q = 1 p and t = T.

Part (b). We derive: PT t=1 t p PT t=1 R t+1 t x pdx = R T +1 1 x pdx ① g(T + 1) g(1) =

1 1 p(T + 1)1 p 1 1 p

② (1 p)T 1 p, where step ①uses g(x) = x p; step ②uses Inequality (13) with q = 1 p and t = T.

B ADDITIONAL MOTIVATING APPLICATIONS

Robust Sharpe Ratio Maximization (Robust SRM). Recall that the standard SRM, which operates without data uncertainty and is commom in finance, can be formulated as: maxx Ωd x r

x Cx , where C 0 is the risk covariance matrix, d Rn are the expected returns, r R is the risk-free rate, and Ω {x | x 0, x 1 = 1} ensures valid portfolio weights (see (Chen et al., 2011; Bot et al., 2023b)). In contrast, the robust SRM, designed to handle scenario data uncertainty, is defined by: minx Ω maxm i=1{bi (Dx)i} maxp i=1 x TC(i)x , where each C(i) Rn n 0, b Rm, and D Rm n. The corresponding equivalent optimization problem is formulated as:

min x Rn max(0, max(b Dx))

maxp i=1 x TC(i)x , s. t. x Ω. (14)

We use max(0, max(b Dx)) instead of simply max(b Dx) to explicitly enforce nonnegativity in the numerator for all x Ω. Problem (14) corresponds to Problem (1) with f(x) = g(x) = 0, δ(x) = ιΩ(x), A = D, h(y) = max(0, max(y + b)), and d(x) = maxp i=1 x TC(i)x. Notably, both d(x) and d(x)1/2 are Wd-weakly convex with Wd = 0.

Robust Sparse Recovery. It is a signal processing technique, which can effectively acquire and reconstruct the signal by finding the solution of the underdetermined linear system. Given a design matrix A Rm n and an observation vector b Rm, robust sparse recovery can be formulated as the following fractional optimization problem (Li & Zhang, 2022; Yang & Zhang, 2011; Yuan, 2023):

min x Rn ρ1 Ax b 1 + ρ2 x 1

x [k] , s. t. x Ω, (15)

where Ω {x | x ρ0}, and ρ0, ρ1, ρ2 are positive constants provided by the users. Problem (15) coincides with Problem (1) with f(x) = g(x) = 0, δ(x) = ιΩ(x)+ρ2 x 1, h(y) = y b 1, and d(x) = x [k]. Importantly, d(x) is Wd-weakly convex with Wd = 0.

C PROOFS FOR SECTION 3

C.1 PROOF OF LEMMA 3.9

Proof. The results of this lemma can be derived using standard convex analysis. Some of these results are well-established and scattered throughout the literature (Nesterov, 2003; Yurtsever et al., 2018; Silveti-Falls et al., 2020; Nesterov, 2013). For completeness, we provide the full proofs of the lemma.

For any y R,, we define hµ(y) maxv y, v h (v) µ

2 v 2 2. Since µ > 0 and µ

2 v 2 2 is µ-strongly convex, the maximization problem has a unique solution and thus the subgradient set is a single set (Nesterov, 2003), i.e., hµ(y) = hµ(y) = arg maxv { y, v h (v) µ

Published as a conference paper at ICLR 2025

Part (a). We now prove that hµ(y) is (1/µ)-smooth. For any y1, y2 Rm, we define v1 = arg maxv{ y1, v h (v) µ

2 v 2 2}, and v2 = arg maxv{ y2, v h (v) µ

2 v 2 2}. We have:

y1 y2 v1 v2 ① y1 y2, v1 v2

②= µ v1 v2, v1 v2 + h (v1) h (v2), v1 v2

③ µ v2 v1 2 2 + 0,

where step ①uses the Cauchy-Schwarz Inequality; step ②uses the optimality of v1 that y1 h (v1) µv1 = 0 and the optimality of v2 that y2 h (v2) µv2 = 0; step ③uses the monotonicity of subdifferentials for the convex function h (v). Dividing both sides by v1 v2 , we have: v2 v1

µ, which implies that the function hµ(y) is (1/µ)-smooth.

We now prove that the function hµ(y) is convex. For any y1, y2 Rm and µ > 0, we define u1 = arg maxu{ y1, u h (u) µ

2 u 2 2}, and u2 = arg maxu{ y2, u h (u) µ

2 u 2 2}. We have:

hµ(y1) hµ(y2) ①= hµ(y1) { y2, u2 h (u2) µ

② hµ(y1) { y2, u1 h (u1) µ

③= { y1, u1 h (u1) µ

2 u1 2 2} { y2, u1 h (u1) µ

= u1, y1 y2 ,

where step ①uses the definition of hµ(y2) and u2; step ②uses the optimality of u2; step ③uses the definition of hµ(y1).

Part (b). For any y Rm and µ > 0, we define hµ(y) maxv y, v h (v) µ

2 v 2 2, u1 arg maxu{ y, u h (u)}, and u2 arg maxu{ y, u h (u) µ

b-i). We now prove that 0 < h(y) hµ(y). We have:

hµ(y) = max v {v Ty h (v) µ

① max v {v Ty h (v)} + max v { µ

where step ①uses a general property of the maximum function when applied to the sum of two functions; step ②uses the definition of h(y) = maxv{v Ty h (v)} and the fact that maxv{ µ

2 v 2 2} = 0.

b-ii). We now prove that h(y) hµ(y) µ

2 C2 h. We have:

h(y) hµ(y) ①= { y, u1 h (u1)} hµ(y)

② { y, u1 h (u1)} { y, u1 h (u1) µ

2 u2 2 2 = µ

2 hµ(y) 2 2

where step ①uses the definition of h(y): h(y) = { y, u1 h (u1)}; step ②uses the definition of hµ(y) and the optimality of u2: hµ(y) = { y, u2 h (u2) µ

2 u2 2 2} { y, u1 h (u1) µ 2 u1 2 2}; step ③uses Claim (b) of this lemma.

Published as a conference paper at ICLR 2025

Part (c). We now prove that the function hµ(y) is Ch-Lipschitz continuous. For any y Rm and µ > 0, we define hµ(y) maxv y, v h (v) µ

2 v 2 2. We have:

hµ(y) ①= arg maxv{ y, v h (v) µ

= arg minv{h (v) + µ

2 v y/µ 2 2}

µ; h , 1/µ)

µ(y Prox(y; h, µ)) (16)

④ h(Prox(y; h, µ)), (17)

where step ①uses the fact that the function hµ(y) is smooth and its gradient can be computed as: hµ(y) = arg maxv{ y, v h (v) µ

2 v 2 2}; step ②uses the definition of Prox(b; h, µ) arg miny h(y) + 1 2µ b y 2 2; step ③uses the extended Moreau decomposition property as shown in Lemma A.2; step ④uses the optimality of Prox(y; h, µ) that 0 h(Prox(y; h, µ))+ 1

µ(Prox(y; h, µ) y). Using Equation (17), we directly conclude that hµ(y) is Ch-Lipschitz continuous with hµ(y) Ch.

Part (d). We show how to compute hµ(y). For any y Rm and µ > 0, we define hµ(y) maxv y, v h (v) µ

2 v 2 2, and v = arg maxv { y, v h (v) µ

2 v 2 2}. We have:

y µ v h ( v) ① y µ v, v = h ( v) + h(y µ v). (18)

where step ①uses the equivalence relation: x, y = h( x) + h ( y) y h( x) x h ( y) for all x and y, as stated in Proposition 11.3 of (Rockafellar & Wets., 2009). Therefore, we have:

hµ(y) ①= y, v h ( v) µ

②= y, v y µ v, v h(y µ v) µ

③= y, v y µ v, v + h(Prox(y; h, µ)) µ

2 v 2 2 = h(Prox(y; h, µ)) + µ

µ(y Prox(y; h, µ)) 2 2,

where step ①uses the definition of hµ(y); step ②uses Equation (18) that h ( v) = y µ v, v h(y µ v); step ③uses v = 1

µ(y Prox(y; h, µ)), as shown in Equation (16).

Part (e). For any y Rm and µ1, µ2 > 0 with µ2 µ1, we define u1 = arg maxu{ y, u h (u) µ1

2 u 2 2}, and u2 = arg maxu{ y, u h (u) µ2

e-i). We now prove that 0 hµ2(y) hµ1(y) for all 0 < µ2 µ1. We have:

hµ1(y) hµ2(y)

①= { y, u1 h (u1) µ1

2 u1 2 2} hµ2(y)

② { y, u1 h (u1) µ1

2 u1 2 2} { y, u1 h (u1) µ2

where step ①uses the definition of hµ1(y): hµ1(y) = { y, u1 h (u1) µ1

2 u1 2 2}; step ②uses the definition of hµ2(y) and the optimality of u2: hµ2(y) = { y, u2 h (u2) µ2

2 u2 2 2} { y, u1 h (u1) µ2

2 u1 2 2}; step ③uses µ2 µ1.

Published as a conference paper at ICLR 2025

e-ii). We now prove that hµ2(y) hµ1(y) µ1 µ2

2 C2 h for all 0 < µ2 µ1. We have:

hµ2(y) hµ1(y)

①= { y, u2 h (u2) µ2

2 u2 2 2} hµ1(y)

② { y, u2 h (u2) µ2

2 u2 2 2} { y, u2 h (u2) µ1

2 u2 2 2 = µ1 µ2

2 hµ2(y) 2 2

where step ①uses the definition of hµ2(y): hµ2(y) = { y, u2 h (u2) µ2

2 u2 2 2}; step ②uses the definition of hµ1(y) and the optimality of u1: hµ1(y) = { y, u1 h (u1) µ1

2 u1 2 2} { y, u2 h (u2) µ1

2 u2 2 2}; step ③uses Claim (b) of this lemma.

Part (f). We now prove that hµ2(y) hµ1(y) ( µ1

µ2 1)Ch for all 0 < µ2 µ1. Using Equality (16), we have:

µ(y Prox(y; h, µ)).

We now examine the following mapping H(υ) υ(y Prox(y; h, 1/υ)). We derive:

H(υ+δ) H(υ)

δ = lim δ 0

(υ+δ)(y Prox(y;h,1/(υ+δ))) υ(y Prox(y;h,1/υ))

δy (υ+δ) Prox(y;h,1/υ)+υ Prox(y;h,1/υ)

= y Prox(y; h, 1/υ).

Therefore, the first-order derivative of the mapping H(υ) w.r.t. υ always exists and can be computed as: υH(υ) = y Prox(y; h, 1/υ), resulting in:

υ, υ > 0, H(υ) H(υ )

|υ υ | y Prox(y; h, 1/υ) .

Letting υ = 1/µ1 and υ = 1/µ2, we derive:

hµ1(y) hµ2(y)

|1/µ1 1/µ2| y Prox(y; h, µ1)

① µ1 h(Prox(y; h, µ1))

where step ①uses the optimality of Prox(y; h, µ1) that 0 h(Prox(y; h, µ1)) + 1 µ1 (Prox(y; h, µ1) y) for all µ1; step ②uses the Lipschitz continuity of h( ). We further obtain:

hµ1(x) hµ2(x) | 1

µ1 1 µ2 | µ1Ch = ( µ1

C.2 PROOF OF LEMMA 3.10

Proof. Consider the strongly convex minimization problem: y = arg miny hµ(y) + 1

2β y b 2 2, which can be equivalently repressed as:

( y, v) = arg min y max v {y Tv h (v) µ

2 v 2 2 + β

2 y b 2 2}.

Using the optimality of the variables { y, v}, we have:

0 = h ( v) µ v + y. (20)

Published as a conference paper at ICLR 2025

Plugging Equation (19) into Equation (20) to eliminate y yields:

0 = h ( v) µ v + b 1

Second, we derive the following equalities:

v ①= arg max v h (v) + v, b µ

2 v 2 2 1 2β v 2 2 (22)

= arg min v h (v) v, b + 1

= arg min v h (v) + 1

β ) v b/(µ + 1

②= Prox( b µ+1/β ); h , 1 µ+1/β )

③= 1 µ+1/β (b Prox(b; h, µ + 1

④ h(Prox(b; h, µ + 1

where step ①uses the fact that Equation (21) is the necessary and sufficient first-order optimality condition for Problem (22); step ②uses the definition of Prox( ; , ); step ③uses the extended Moreau decomposition that a = Prox(a; h, µ) + µ Prox( a

µ; h , 1/µ) for all µ > 0 and a, as shown in Lemma A.2; step ④uses the following necessary and sufficient first-order optimality condition for Prox(b; h, µ + 1

1 µ+1/β {b Prox(b; h, µ + 1

β )} h(Prox(b; h, µ + 1

Part (a). Combining Equation (23) with Equation (19) to eliminate v, we have:

β 1 µ+1/β {b Prox(b; h, µ + 1/β)}

= b 1 µβ+1 {b Prox(b; h, µ + 1/β)}.

Part (b). We define ˇy Prox(b; h, µ + 1/β). We have:

β(b y) ①= β µβ+1 {b Prox(b; h, µ + 1/β)}

②= 1 µ+1/β {b ˇy} ③= v ④ h(ˇy), (25)

where step ①uses Claim (a) of this lemma; step ②uses the definition of ˇy; step ③uses Equality (23); step ④uses Equality (24) that v h(ˇy).

Part (c). We now prove that y ˇy µCh. We derive:

y ˇy ①= b 1 βµ+1 (b ˇy) ˇy

= βµ 1+βµ ˇy b

②= βµ 1+βµ(µ + 1/β) h(ˇy)

③ βµ 1+βµ(µ + 1/β)Ch = µCh,

where step ①uses Claim (a) of this lemma that y = b 1 βµ+1 {b ˇy}; step ②uses Equality (25) that b ˇy (µ + 1/β) h(ˇy); step ③uses the fact that h(y) is Ch-Lipschitz continuous.

Published as a conference paper at ICLR 2025

D PROOFS FOR SECTION 4

D.1 PROOF OF LEMMA 5.3

Proof. Part (a). We now show that zt+1 = hµt(yt+1). For any t 0, we have:

0 ①= hµt(yt+1) + βt(yt+1 yt) + y S(xt+1, yt; zt; βt)

②= hµt(yt+1) + βt(yt+1 yt) + βt(yt Axt+1) zt

③= hµt(yt+1) zt+1,

where step ①uses the optimality condition for yt+1; step ②uses y S(xt+1, yt; zt; βt) = βt(y Axt+1) zt; step ③uses zt+1 zt = βt(Axt+1 yt+1).

Part (b). We now show that hµt(yt+1) h(ˇyt+1). For any t 0, we have:

h( yt+1) ① βt(bt yt+1)

②= βt(Axt+1 + zt/βt yt+1)

where step ①uses Lemma 3.10(b); step ②uses bt = yt y S(xt+1, yt; zt; βt)/βt = yt [βt(yt Axt+1) zt]/βt = Axt+1 + zt/βt; step ③uses zt+1 zt = βt(Axt+1 yt+1).

D.2 PROOF OF LEMMA 5.4

Proof. Since t can be arbitrary, for any t 1, we have from Lemma 5.3:

0 = hµt(yt+1) zt+1,

0 = hµt 1(yt) zt.

Combining the two equalities above, we have, for any t 1:

zt+1 zt = hµt(yt+1) hµt 1(yt).

This further leads to the following inequalities:

zt+1 zt 2 2 = hµt(yt+1) hµt 1(yt) 2 2 = hµt(yt+1) hµt(yt) + hµt(yt) hµt 1(yt) 2 2

① 2 hµt(yt+1) hµt(yt) 2 2 + 2 hµt(yt) hµt 1(yt) 2 2

µt (yt+1 yt) 2 2 + 2( µt 1

χ2 yt+1 yt 2 2 + 2C2 h( 6

where step ①uses a + b 2 2 2 a 2 2 + 2 a 2 2; step ②uses 1 µt -smoothness of hµt(y) for all y, and Lemma 3.9(f) that hµ1(y) hµ2(y) ( µ1

µ2 1)Ch for all 0 < µ2 < µ1; step ③uses µt = χ

and Lemma A.3 that ( µt 1

t 6 t+1 for any integer t 1.

D.3 PROOF OF LEMMA 5.5

Proof. We define z max( z0 , Ch), and y max( y0 , 2

β0 z + A x).

Part (a). The conclusion xt x directly follows by assumption.

Published as a conference paper at ICLR 2025

Part (b). We now show that zt z. Using Lemma 5.3(a), we have t 0, zt+1 = hµt(yt+1). This leads to t 1, zt hµt 1(yt) Ch. Therefore, it holds that t 0, zt max( z0 , Ch) z.

Part (c). We now show that yt y. For all t 0, we have:

βt (zt+1 zt) Axt+1

② 1 β0 ( zt+1 + zt ) + A xt+1

③ 2 β0 z + A x,

where step ①uses zt+1 = zt + βt(Axt+1 yt+1); step ②uses the triangle inequality, the norm inequality, and β0 βt; step ③uses the boundedness of zt and xt. Therefore, it holds that t 0, yt max( y0 , 2

β0 z + A x) y.

D.4 PROOF OF LEMMA 5.6

Proof. We define U(x, y; z; β, µ) δ(x) + f(x) g(x) + hµ(y) + Ax y, z + β

2 Ax y 2 2.

We define v 8z2 + 1

2χz2, and v 16z2.

First, given h(y) is convex, for all y, y Rm, we have:

h(y ) h(y) h(y ), y y . (26)

Second, for any y Rm and t 1, we have:

Axt yt, zt h(y) ①= 1 βt 1 zt zt 1, zt h(y)

② 2 βt zt zt 1 zt h(y)

③ 2 βt 2z (z + h(y) )

④ 2 βt 2z 2z, (27)

where step ①uses the zt+1 zt = βt(Axt+1 yt+1); step ②uses the norm inequality and βt 2βt 1; step ③uses zt z and hµt(y) Ch z, as shown in Lemma D.3.

Third, for any t 1, we have:

2 Axt yt 2 2 ①= βt 1 βt 1 (zt zt 1) 2 2

② βt 2 (βt)2 zt zt 1 2 2

③ 2 βt zt zt 1 2 2

④ 2 βt (2 zt 2 2 + 2 zt 1 2 2)

⑤ 8 βt z2, (28)

where step ①uses the zt+1 zt = βt(Axt+1 yt+1); step ②uses βt 2βt 1; step ③uses βt β0 for all t 0; step ④uses a + b 2 2 2 a 2 2 + 2 b 2 2; step ⑤uses zt z.

Published as a conference paper at ICLR 2025

Part (a). We now derive the lower bound for any t 1, as follows:

U(xt, yt; zt; βt, µt) f(xt) + δ(xt) g(xt) + Axt yt, zt + βt

2 Axt yt 2 2 + hµt(yt)

① F d(xt) h(Axt) + Axt yt, zt + βt

2 Axt yt 2 2 + hµt(yt)

② F d(xt) + h(yt) h(Axt) + Axt yt, zt + {hµt(yt) h(yt)}

③ F d(xt) + Axt yt, zt h(Axt) µt

④ F d(xt) 8z2

⑤= F d(xt) v

where step ①uses F(xt) f(xt)+δ(xt) g(xt)+h(Axt)

d(xt) F; step ②uses βt

2 Axt yt 2 2 0; step ③uses Inequality (26) with y = yt and y = Axt, and the fact that h(y) hµ(y) + µ

2 C2 h (Lemma 3.9(b)); step ④uses Inequality (27), µt = χ

βt , and Ch z; step ⑤uses the definition of v.

Part (b). We now derive the upper bound for any t 1, as follows:

U(xt, yt; zt; βt, µt) f(xt) + δ(xt) g(xt) + Axt yt, zt + βt

2 Axt yt 2 2 + hµt(yt)

① Fd(xt) h(Axt) + Axt yt, zt + βt

2 Axt yt 2 2 + hµt(yt)

② F d h(Axt) + Axt yt, zt + βt

2 Axt yt 2 2 + h(yt)

③ F d + Axt yt, zt h(yt) + βt

2 Axt yt 2 2

④ F d + (8+8)z2

where step ①uses F(xt) f(xt)+δ(xt) g(xt)+h(Axt)

d(xt) F; step ②uses d(xt) d, and hµ(y) h(y) for all y and µ; step ③uses Inequality (26) with y = Axt and y = yt; step ④uses Inequalities (27) and (28).

D.5 PROOF OF LEMMA 5.9

Proof. Part (a). For any t 0, we have:

βt = 1+ξ(t+1)p

1+ξtp ① 1+ξ(tp+1)

1+ξtp ② 1 + ξ, (29)

where step ①uses the fact that (t + 1)p tp + 1p for all p (0, 1) and t 0; step ②uses the fact that a+ξ

a 1 + ξ for all a 1 and ξ 0.

Part (b). For all t 1, we derive the upper bound and lower bound for λt:

λt U(xt,yt;zt;βt,µt)

d(xt) F d+v/βt

λt U(xt,yt;zt;βt,µt)

d(xt) F d v/βt

Part (c). For all t 0, we derive the upper bound and lower bound for αt+1:

d(xt) U(xt,yt;zt;βt,µt)

d F d v/β0 α.

d(xt) U(xt,yt;zt;βt,µt)

F d+v/β0 α.

Published as a conference paper at ICLR 2025

Part (d). We first focus on FADMM-D with ℓ(βt) Lf +βt A 2 2 +λt Wd. For all t 1, we have:

ℓ(βt) βt A 2 2.

ℓ(βt) βt Lf

β0 + βt A 2 2 + βt

We now focus on FADMM-Q with ℓ(βt) Lf + βt A 2 2 + (2/αt)Wd. For all t 1, we have:

ℓ(βt) βt A 2 2.

ℓ(βt) βt Lf

β0 + βt A 2 2 + βt

E PROOFS FOR SECTION 5

E.1 PROOF OF LEMMA 5.10

Proof. We define S(x, y; z; β) f(x) + Ax y, z + β

2 Ax y 2 2.

We define s(x) S(x, yt, zt; βt), where t is known from context.

We define L(x, y; z; β, µ) 1 d(x) U(x, y; z; β, µ).

We define U(x, y; z; β, µ) S(x, y; z; β) + δ(x) g(x) + hµ(y).

We define ℓ(βt) Lf + βt A 2 2 + λt Wd, where λt = U(xt,yt;zt;βt,µt)

We define εx (θ 1)ℓ/(2d) > 0.

Initially, using the optimality condition of xt+1 arg minx Mt(x; xt, λt), we have: Mt(xt+1; xt, λt) Mt(xt; xt, λt). This leads to xt+1 xt, s(xt) g(xt) λt d(xt) + θℓ(βt)

2 xt+1 xt 2 2 + δ(xt+1) 0 + 0 + δ(xt). Rearranging terms yields:

δ(xt+1) δ(xt) + θℓ(βt)

2 xt+1 xt 2 2 xt xt+1, s(xt) g(xt) λt d(xt)

① s(xt) s(xt+1) + Lf +βt A 2 2 2 xt+1 xt 2 2 + g(xt+1) g(xt)

+ λt(d(xt+1) d(xt)) + λt Wd

2 xt+1 xt 2 2

②= s(xt) s(xt+1) + ℓ(βt)

2 xt+1 xt 2 2 + g(xt+1) g(xt) + λt(d(xt+1) d(xt)), (30)

where step ①uses the facts that the function s(x) is (Lf + βt A 2 2)-smooth w.r.t. x, λt > 0, g(x) is convex, and d(x) is Wd-weakly convex, yielding the following inequalities:

s(xt+1) s(xt) xt+1 xt, s(xt) + Lf +βt A 2 2 2 xt+1 xt 2 2,

g(xt) g(xt+1) g(xt), xt xt+1 ,

λtd(xt) λtd(xt+1) λt d(xt), xt xt+1 + λt Wd

2 xt+1 xt 2 2; (31)

step ②uses the definition of ℓ(βt) = Lf + βt A 2 2 + λt Wd.

Published as a conference paper at ICLR 2025

We further derive:

L(xt+1, yt; zt; βt, µt) L(xt, yt; zt; βt, µt)

①= 1 d(xt+1){hµt(yt) + δ(xt+1) + s(xt+1) g(xt+1)} λt

② 1 d(xt+1){hµt(yt) + δ(xt) g(xt) + ℓ(βt)(1 θ)

2 xt+1 xt 2 2 + s(xt) λtd(xt)} + λt λt

③= 1 d(xt+1){hµt(yt) + δ(xt) g(xt) + ℓ(βt)(1 θ)

2 xt+1 xt 2 2 {δ(xt) g(xt) + hµt(yt)}}

= 1 d(xt+1) ℓ(βt)(1 θ)

2 xt+1 xt 2 2

④ ℓ(βt) xt+1 xt 2 2 2d {1 θ}

⑤ εxβt xt+1 xt 2 2,

where step ①uses the definition of L(xt, yt; zt; βt, µt) = λt; step ②uses Inequality (30); step ③ uses λtd(xt) s(xt) = δ(xt) g(xt) + hµt(yt); step ④uses d(x) d and θ > 1; step ⑤uses the definition of εx (θ 1)ℓ/(2d) > 0.

E.2 PROOF OF LEMMA 5.11

Proof. We define L(x, y; z; β, µ) 1 d(x) {f(x) g(x) + hµ(y) + Ax y, z + β

2 Ax y 2 2}.

We define Tt 12(1 + ξ)C2 h/(β0dt), and Tt C2 hµt/(2d).

We define εz ξ/(2d), and εy {1 4(1 + ξ)/χ2}/(2d).

First, we focus on a decrease for the function L(x, y; z; β, µ) w.r.t. y. We have:

L(xt+1, yt+1; zt; βt, µt) L(xt+1, yt; zt; βt, µt)

①= 1 d(xt+1){ yt yt+1, zt + βt

2 Axt+1 yt+1 2 2 βt

2 Axt+1 yt 2 2 + hµt(yt+1) hµt(yt)}

②= 1 d(xt+1){ yt yt+1, zt + βt(Axt+1 yt+1) + hµt(yt+1) hµt(yt) βt

2 yt+1 yt 2 2}

③= 1 d(xt+1){ yt yt+1, zt+1 + hµt(yt+1) hµt(yt) βt

2 yt+1 yt 2 2}

④ 1 d(xt+1){ yt yt+1, zt+1 hµt(yt+1) βt

2 yt+1 yt 2 2}

⑤= βt yt+1 yt 2 2 2d(xt+1) , (32)

where step ①uses the definition of L(x, y; z; β, µ); step ②uses the Pythagoras relation that 1

2 a c 2 2 1

2 b c 2 2 = 1

2 a b 2 2 + a c, a b ; step ③uses zt+1 zt = βt(Axt+1 yt+1); step ④uses the convexity of hµt( ); step ⑤uses the optimality for yt+1 as in Lemma 5.3(a) that: hµt(yt+1) = zt+1.

Published as a conference paper at ICLR 2025

Second, we focus on a decrease for the function L(x, y; z; β, µ) w.r.t. {z, β}. We have:

ξ 2d 1 β zt+1 zt 2 2 + L(xt+1, yt+1; zt+1; βt+1, µt) L(xt+1, yt+1; zt; βt, µt)

① ξ 2d(xt+1) 1 βt zt+1 zt 2 2 + {L(xt+1, yt+1; zt+1; βt, µt) L(xt+1, yt+1; zt; βt, µt)}

+ {L(xt+1, yt+1; zt+1; βt+1, µt) L(xt+1, yt+1; zt+1; βt, µt)}

②= 1 d(xt+1) { ξ

2βt zt+1 zt 2 2 + Axt+1 yt+1, zt+1 zt + (βt+1 βt) Axt+1 yt+1 2 2 2 }

③ 1 d(xt+1) { ξ

2βt zt+1 zt 2 2 + zt+1 zt 2 2 βt + (βt(1+ξ) βt) zt+1 zt 2 2 2(βt)2 }

= 1 d(xt+1)( ξ

βt zt+1 zt 2 2

④ 1 d(xt+1) 1+ξ

βt { 2(βt)2

χ2 yt+1 yt 2 2 + 12C2 h( 1

⑥ { 1 d(xt+1) 2(1+ξ)

χ2 βt yt+1 yt 2 2} + 12(1+ξ)C2 h β0d ( 1

t 1 t+1), (33)

where step ①uses d(xt+1) d; step ②uses the definition of L(x, y; z; β, µ); step ③uses βt(Axt+1 yt+1) = zt+1 zt and βt+1 βt(1 + ξ); step ④uses Lemma 5.4(b); step ⑤ uses Lemma 5.4; step ⑥uses d(xt+1) d, and βt β0.

Third, we focus on a decrease for the function L(x, y; z; β, µ) w.r.t. µ. We have:

L(xt+1, yt+1; zt+1; βt+1, µt+1) L(xt+1, yt+1; zt+1; βt+1, µt)

= 1 d(xt+1){hµt+1(yt+1) hµt(yt+1)}

① 1 2d(xt+1)(µt µt+1) C2 h

② 1 2d(µt µt+1) C2 h, (34)

where step ①uses Lemma 3.9(e); step ②uses d(xt) d.

Adding Inequalities (32), (33), and (34), we have:

L(xt+1, yt+1; zt+1; βt+1, µt+1) L(xt+1, yt; zt; βt, µt) + Tt+1 + Ut+1 Tt Ut

εzβt Axt+1 yt+1 2 2 1 2d(xt+1) βt yt+1 yt 2 2 {1 4(1+ξ)

① εzβt Axt+1 yt+1 2 2 βt yt+1 yt 2 2 εy,

where step ①uses the definition of εy {1 4(1 + ξ)/χ2}/(2d).

E.3 PROOF OF LEMMA 5.12

Proof. We define Lt L(xt, yt; zt; βt, µt), and Tt = 12(1 + ξ)C2 h/(β0dt), and Ut = C2 hµt/(2d).

We define Pt Lt + Tt + Ut.

Part (a). We derive the following inequalities:

Pt Lt + Tt + Ut

① L(xt, yt; zt; βt, µt) + 0

② U(xt,yt;zt;βt,µt)

d ④ F d v/β0

Published as a conference paper at ICLR 2025

where step ①uses Tt 0 and Ut 0; step ②uses the definition of L(xt, yt; zt; βt, µt); step ③ uses d(xt) d, and the lower bound of U(xt, yt; zt; βt, µt) in Lemma 5.6; step ④uses βt β0.

Part (b). Combing Lemmas (5.10) and (5.11) together, we have:

εyβt yt+1 yt 2 2 + εxβt xt+1 xt 2 2 + εzβt Axt+1 yt+1 2 2 Lt Lt + Tt Tt+1 + Ut Ut+1

E.4 PROOF OF THEOREM 5.13

Proof. We define c0 1 min(εx,εy,εz) {P1 P}.

We define Et βt{ xt+1 xt 2 2 + yt+1 yt 2 2 + Axt+1 yt+1 2 2}.

We define Et + βt{ xt+1 xt + yt+1 yt + Axt+1 yt+1 }.

Using Lemma 5.12, we have:

0 min(εx, εy, εz)Et + Pt Pt+1. (35)

Telescoping Inequality (35) over t from 1 to T, we have:

0 1 min(εx,εy,εz) PT t=1{Pt Pt+1} PT t=1 Et

= 1 min(εx,εy,εz) {P1 PT +1} PT t=1 Et

① 1 min(εx,εy,εz) {P1 P} PT t=1 Et

② c0 1 βT PT t=1 βt(xt+1 xt) 2 2 + βt(yt+1 yt) 2 2 + βt(Axt+1 yt+1) 2 2

③ c0 1 βT 1 3T {PT t=1 βt(xt+1 xt) + βt(yt+1 yt) + βt(Axt+1 yt+1) }2

④ c0 1 βT 3T {PT t=1 Et +}2 (36)

where step ①uses Pt P for all t; step ②uses the definition of c0, the definition of Et, and the H older s inequality that a, b a 1 b ; step ③uses the fact that a 2 2 1

n a 2 1; step ④uses the definition of Et +. We further obtain from Inequality (43) that PT t=1 Et + (3W)1/2(βT T)1/2 ① 1 T PT t=1 Et + O(T (p 1)/2),

where step ①βt = β0(1 + ξtp) = O(tp).

E.5 PROOF OF LEMMA 5.15

Proof. We define S(x, y; z; β) f(x) + Ax y, z + β

2 Ax y 2 2.

We define s(x) S(x, yt, zt; βt), where t is known from context.

We define K(α, x, y, z; β, µ) = 2α p

d(x) + α2U(x, y; z; β, µ).

We define U(x, y; z; β, µ) S(x, y; z; β) + δ(x) g(x) + hµ(y).

We define ℓ(βt) Lf + βt A 2 2 + 2 αt+1 Wd, where αt+1 = p

d(xt)/U(xt, yt; zt; βt, µt).

We define εx 1

Initially, using the optimality condition of xt+1 arg minx Mt(x; xt, αt+1), we have Mt(xt+1; xt, αt+1) Mt(xt; xt, αt+1). This results in xt+1 xt, s(xt) g(xt)

Published as a conference paper at ICLR 2025

d(xt) + θℓ(βt)

2 xt+1 xt 2 2 + δ(xt+1) 0 + 0 + δ(xt). Rearranging terms yields:

δ(xt+1) δ(xt) + θℓ(βt)

2 xt+1 xt 2 2 xt xt+1, s(xt) g(xt) 2 αt+1 p

① s(xt) s(xt+1) + Lf +βt A 2 2 2 xt+1 xt 2 2 + g(xt+1) g(xt)

+ 2 αt+1 { p

d(xt+1) + Wd

2 xt+1 xt 2 2}

② s(xt) s(xt+1) + ℓ(βt)

2 xt+1 xt 2 2 + g(xt+1) g(xt) 2 αt+1 [ p

d(xt+1)], (37)

where step ①uses the facts that the function s(x) is (Lf + βt A 2 2)-smooth w.r.t. x, αt+1 > 0, g(x) is convex, and p

d(x) is Wd-weakly convex, yielding the following inequalities:

s(xt+1) s(xt) + xt+1 xt, xs(xt) + Lf +βt A 2 2 2 xt+1 xt 2 2 g(xt) g(xt+1) + xt xt+1, g(xt)

d(xt+1)} 2 αt+1 { p

d(xt), xt xt+1 + Wd

2 xt+1 xt 2 2}; (38)

step ②uses the definition of ℓ(βt) = Lf + βt A 2 2 + 2 λt+1 Wd.

We further derive:

K(αt+1, xt+1, yt; zt; βt, µt) K(αt, xt, yt; zt; βt, µt)

① K(αt+1, xt+1, yt; zt; βt, µt) K(αt+1, xt, yt; zt; βt, µt)

②= (αt+1)2{s(xt+1) s(xt) + δ(xt+1) δ(xt) g(xt+1) + g(xt) 2 αt+1 [ p

③ (αt+1)2 (1 θ)

2 ℓ(βt) xt+1 xt 2 2

2α2ℓ(1 θ) | {z }

βt xt+1 xt 2 2,

where step ①uses the fact that αt+1 = arg minα K(α, xt, yt; zt; βt, µt), which implies the inequality K(αt+1, xt, yt; zt; βt, µt) K(αt, xt, yt; zt; βt, µt); step ②uses the definition of the function K(α, x, y; z; β, µ); step ③uses Inequality (37); step ④uses 1 θ < 0, αt α, and ℓ(βt) βtℓ.

E.6 PROOF OF LEMMA 5.16

Proof. We define K(α, x, y; z; β, µ) = 2α p

d(x) + α2{f(x) + δ(x) g(x) + hµ(y) + Ax y, z + β

2 Ax y 2 2)}.

We define Tt 12α2(1 + ξ)C2 h/(β0t), and Ut 1

We define εz 1

2α2ξ, and εy 1

2α2{1 4(1 + ξ)/(χ2)}.

First, we focus on the sufficient decrease for variables yt+1, we have:

K(αt+1, xt+1, yt+1; zt; βt, µt) K(αt+1, xt+1, yt; zt; βt, µt)

①= (αt+1)2{ βt

2 Axt+1 yt+1 2 2 βt

2 Axt+1 yt 2 2 + hµt(yt+1) hµt(yt) + zt, yt yt+1 }

②= (αt+1)2{ βt

2 Axt+1 yt+1 2 2 βt

2 Axt+1 yt 2 2 + yt+1 yt, hµt(yt+1) zt }

③= (αt+1)2{ βt

2 yt+1 yt 2 2 + yt+1 yt, hµt(yt+1) zt βt(Axt+1 yt+1) }

④= (αt+1)2{ βt

2 yt+1 yt 2 2 + yt+1 yt, hµt(yt+1) zt (zt+1 zt) }

⑤= βt(αt+1)2 1

2 yt+1 yt 2 2, (39)

Published as a conference paper at ICLR 2025

where step ①uses the definition of K(α, x, y; z; β, µ); step ②uses the convexity of hµt( ) that hµ(y ) hµ(y) y y, hµ(y ) for all y, y Rm, and µ > 0; step ③uses the Pythagoras relation that 1

2 a c 2 2 1

2 b c 2 2 = 1

2 a b 2 2 + a c, a b ; step ④uses the optimality for yt+1 that: hµt(yt+1) = zt + βt(Axt+1 yt+1).

Second, we focus on the sufficient decrease for variables {z, β}. We have:

1 2ξα2βt Axt+1 yt+1 2 2 + K(αt+1, xt+1, yt+1; zt+1; βt+1, µt) K(αt+1, xt+1, yt+1; zt; βt, µt)

① (αt+1)2 ξ zt+1 zt 2 2 2βt + {K(αt+1, xt+1, yt+1; zt+1; βt, µt) K(αt+1, xt+1, yt+1; zt; βt, µt)}

+ {K(αt+1, xt+1; yt+1, zt+1; βt+1, µt) K(αt+1, xt+1, yt+1, zt+1; βt, µt)}

②= (αt+1)2 { ξ zt+1 zt 2 2 2βt + Axt+1 yt+1, zt+1 zt + βt+1 βt

2 Axt+1 yt+1 2 2}

③ (αt+1)2 { ξ

βt + βt(1+ξ) βt

2(βt)2 } zt+1 zt 2 2

= (αt+1)2 {(1 + ξ) 1

βt } zt+1 zt 2 2

④ (αt+1)2(1 + ξ) 1

βt { hµt(yt+1) hµt 1(yt) 2 2}

⑤ (αt+1)2(1 + ξ) 1

βt { 2(βt)2

χ2 yt+1 yt 2 2 + 12C2 h( 1

⑥ {(αt+1)2(1 + ξ) 2

χ2 βt yt+1 yt 2 2} + 12(1+ξ)C2 hα2

t 1 t+1), (40)

where step ①uses α αt; step ②uses the definition of K(α, x, y; z; β, µ); step ③uses βt(Axt+1 yt+1) = zt+1 zt and βt+1 βt(1 + ξ); step ④uses Lemma 5.4(b); step ⑤uses Lemma 5.4; step ⑥uses αt α for all t 1.

Third, we focus on the sufficient decrease for variable µ. We have:

K(αt+1, xt+1, yt+1; zt+1; βt+1, µt+1) K(αt+1, xt+1, yt+1; zt+1; βt+1, µt)

= (αt+1)2 (hµt+1(yt+1) hµt(yt+1))

2C2 h(αt+1)2 (µt µt+1)

2C2 hα2(µt µt+1) = Ut Ut+1 (41)

where step ①uses Lemma 3.9(e); step ②uses αt α for all t 1.

Adding Inequalities (39), (40), and (41), we have:

K(λt+1, xt+1, yt+1; zt+1; βt+1, µt+1) K(λt+1, xt+1, yt; zt; βt, µt)

Tt + Ut Tt+1 Ut+1 (αt+1)2βt yt+1 yt 2 2 1

2 {1 4(1 + ξ)/(χ2)}

① Tt + Ut Tt+1 Ut+1 βt yt+1 yt 2 2 εy,

where step ①uses αt+1 α, and the definition of εy 1

2α2{1 4(1 + ξ)/(χ2)}.

E.7 PROOF OF LEMMA 5.17

Proof. We define Pt Kt + Tt + Ut.

We define Kt K(αt, xt, yt; zt; βt, µt), and Tt = 12α2(1 + ξ)C2 h/(β0dt), and Ut = 1

Published as a conference paper at ICLR 2025

Part (a). We derive the following inequalities:

Pt Kt + Tt + Ut

① K(αt, xt, yt; zt; βt, µt) + 0

d(xt) + (αt)2U(xt, yt; zt; βt, µt)

d + α2 {Fd v/βt}

d + α2 {Fd v/β0} P,

where step ①uses Tt 0 and Ut 0; step ②uses the definition of K(αt, xt, yt; zt; βt, µt); step ③uses d(xt) d, αt α, and the lower bound of U(xt, yt; zt; βt, µt) in Lemma 5.6; step ④uses βt β0.

Part (b). Combing Lemmas (5.15) and (5.16) together, we have:

εyβt yt+1 yt 2 2 + εxβt xt+1 xt 2 2 + εzβt Axt+1 yt+1 2 2 Kt Kt + Tt Tt+1 + Ut Ut+1

E.8 PROOF OF THEOREM 5.18

Proof. We define c0 1 min(εx,εy,εz) {P1 P}.

We define Et βt{ xt+1 xt 2 2 + yt+1 yt 2 2 + Axt+1 yt+1 2 2}.

We define Et + βt{ xt+1 xt + yt+1 yt + Axt+1 yt+1 }.

Using Lemma 5.17, we have:

0 min(εx, εy, εz)Et + Pt Pt+1. (42)

Telescoping Inequality (42) over t from 1 to T, we have:

0 1 min(εx,εy,εz) PT t=1{Pt Pt+1} PT t=1 Et

= 1 min(εx,εy,εz) {P1 PT +1} PT t=1 Et

① 1 min(εx,εy,εz) {P1 P} PT t=1 Et

② c0 1 βT PT t=1 βt(xt+1 xt) 2 2 + βt(yt+1 yt) 2 2 + βt(Axt+1 yt+1) 2 2

③ c0 1 βT 1 3T {PT t=1 βt(xt+1 xt) + βt(yt+1 yt) + βt(Axt+1 yt+1) }2

④ c0 1 βT 3T {PT t=1 Et +}2 (43)

where step ①uses Pt P for all t; step ②uses the definition of c0, the definition of Et, and the H older s inequality that a, b a 1 b ; step ③uses the fact that a 2 2 1

n a 2 1; step ④uses the definition of Et +. We further obtain from Inequality (43) that

PT t=1 Et + (3W)1/2(βT T)1/2 ① 1 T PT t=1 Et + O(T (p 1)/2),

where step ①βt = β0(1 + ξtp) = O(tp).

Published as a conference paper at ICLR 2025

F PROOFS FOR SECTION 6

F.1 PROOF OF THEOREM 6.3

Proof. We let t 1.

We define Crit(x+, x, y+, y, z+, z) x+ x + y+ y + z+ z + Ax+ y+ + h(y+) z+ 2 2 + δ(x+) + f(x+) g(x) + ATz+ φ(x, y) d(x) , where φ(x, y) = {f(x) + δ(x) g(x) + h(y)}/d(x).

We define Γt 1 δ(xt+1) + f(xt+1) + ATzt+1 g(xt) φ(xt, yt) d(xt) .

We define Γt 2 zt+1 zt + h( yt+1) zt+1 + Axt+1 yt+1 .

We define Γt 3 yt+1 yt+1 + xt+1 xt + yt+1 yt .

Part (a). We now focus on FADMM-D.

We define λt = {f(xt) + δ(xt) g(xt) + Axt yt, zt + 1

2βt Axt yt 2 2 + hµt(yt)}/d(xt).

First, we focus on the optimality condition of the x-subproblem. We have:

δ(xt+1) + g(xt) + λt d(xt)

θℓ(βt)(xt+1 xt) + x St(xt, yt; zt; βt)

= θℓ(βt)(xt+1 xt) + f(xt) + ATzt + βt AT(Axt yt). (44)

Second, we derive the following inequalities:

d(xt) {λt φ(xt, yt)}

① Cd | f(xt) g(xt)+ Axt yt,zt + 1

2 βt Axt yt 2 2+hµt(yt) d(xt) f(xt) g(xt)+h(yt)

d | Axt yt, zt + 1

2βt Axt yt 2 2 + hµt(yt) h(yt)|

d { z βt 1 zt zt 1 + βt

2(βt 1)2 zt zt 1 2 2 + 1

2(βt 1)2 (2z)2 + 1 2βt C2 hχ}

where step ①uses the fact that d(x) is Cd-Lipschitz continuous, and the definitions of λt and φ(xt, yt); step ②uses d(xt) d; step ③uses 0 < h(y) hµ(y) µ 2 C2 h (refer to Lemma 3.9(b)), the Cauchy-Schwarz Inequality, and the fact that Axt yt = 1 βt 1 (zt zt 1); step ④uses zt zt 1 zt + zt 1 2z; step ⑤uses βt 1 βt (1 + ξ)βt 1.

Third, we derive the following results:

Γt 1 δ(xt+1) + f(xt+1) + ATzt+1 g(xt) φ(xt, yt) d(xt)

① λt d(xt) φ(xt, yt) d(xt) + f(xt+1) f(xt)

+ AT(zt+1 zt) {θℓ(βt)(xt+1 xt) + βt AT(Axt yt)}

② λt d(xt) φ(xt, yt) d(xt) + AT(zt+1 zt) + f(xt+1) f(xt)

+ θℓ(βt) xt+1 xt + βt AT(Axt yt)

βt ) + O(βt Axt+1 yt+1 ) + O(βt xt+1 xt ) + O(βt 1 Axt yt ), (46)

where step ①uses Equalities (44); step ②uses the triangle inequality; step ③uses Inequality (45), βt 1 βt (1 + ξ)βt 1, and zt+1 zt = βt(Axt+1 yt+1).

Published as a conference paper at ICLR 2025

Fourth, we have the following inequalities:

Γt 2 zt+1 zt + h( yt+1) zt+1 + Axt+1 yt+1

① O(βt Axt+1 yt+1 ), (47)

where step ①uses zt+1 h( yt+1) (refer to Lemma 5.3), and zt+1 zt = βt(Axt+1 yt+1).

Fifth, we have the following results:

Γt 3 yt+1 yt+1 + xt+1 xt + yt+1 yt

① yt+1 yt+1 + xt+1 xt + yt+1 yt+1 + yt+1 yt

② µt Ch + xt+1 xt + µt Ch + yt+1 yt

βt ) + O(βt xt+1 xt ) + O(βt yt+1 yt ), (48)

where step ①uses the triangle inequality; step ②uses Lemma (3.10)(c); step ③uses µt = χ βt =

βt ), and 1 βt

β0 = O(βt).

Part (b). We now focus on FADMM-Q.

We define U(αt, xt, yt; zt; βt, µt) f(xt)+δ(xt) g(xt)+hµt(yt)+ Axt yt, zt + βt

2 Axt yt 2 2, and αt+1 p

d(xt)/U(xt, yt; zt; βt, µt).

First, we focus on the optimality condition of the x-subproblem. We have:

δ(xt+1) g(xt) 2 αt+1 p

θℓ(βt)(xt+1 xt) x S(xt, yt; zt; βt)

= θℓ(βt)(xt+1 xt) f(xt) ATzt βt AT(Axt yt). (49)

Second, we derive the following inequalities:

d(xt) φ(xt, yt) d(xt)

①= 2 αt+1 1

2d(xt) 1/2 d(xt) φ(xt, yt) d(xt)

②= U(xt,yt;zt;βt,µt)

d(xt) d(xt) φ(xt, yt) d(xt)

③ Cd| U(xt,yt;zt;βt,µt)

d(xt) φ(xt, yt)|

d |U(xt, yt; zt; βt, µt) {f(xt) + δ(xt) g(xt) + h(yt)}|

d {| Axt yt, zt | + βt

2 Axt yt 2 2 + |hµt(yt) h(yt)|}

d { z βt 1 zt zt 1 + βt

2(βt 1)2 zt zt 1 2 2 + µt

2(βt 1)2 (2z)2 + χ 2βt C2 h}

where step ①uses p

d(xt) = 1 2d(xt) 1/2 d(xt); step ②uses the fact that αt+1 = p

d(xt)/U(xt, yt; zt; βt); step ③uses the fact that d(x) is Cd-Lipschitz continuous; step ④uses d(xt) d; step ⑤uses the definitions of U(xt, yt; zt; βt, µt); step ⑥uses 0 < h(y) hµ(y) µ 2 C2 h (refer to Lemma 3.9(b)), the Cauchy-Schwarz Inequality, and the fact that Axt yt = 1 βt 1 (zt zt 1); step ⑦uses zt zt 1 zt + zt 1 2z and µt = χ/βt; step ⑧uses βt 1 βt (1 + ξ)βt 1.

Published as a conference paper at ICLR 2025

Third, we derive the following results:

Γt 1 δ(xt+1) + f(xt+1) + ATzt+1 g(xt) φ(xt, yt) d(xt)

d(xt) φ(xt, yt) d(xt) + f(xt+1) f(xt)

+ AT(zt+1 zt) {θℓ(βt)(xt+1 xt) + βt AT(Axt yt)}

d(xt) φ(xt, yt) d(xt) + AT(zt+1 zt)

+ f(xt+1) f(xt) + θℓ(βt) xt+1 xt + βt AT(Axt yt)

βt ) + O(βt Axt+1 yt+1 ) + O(βt xt+1 xt ) + O(βt 1 Axt yt ), (51)

where step ①uses Equalities (49); step ②uses the triangle inequality; step ③uses Inequality (50), βt 1 βt (1 + ξ)βt 1, and zt+1 zt = βt(Axt+1 yt+1).

Fourth, we have the following inequalities:

Γt 2 zt+1 zt + h( yt+1) zt+1 + Axt+1 yt+1

① O(βt Axt+1 yt+1 ), (52)

where step ①uses zt+1 h( yt+1) (refer to Lemma 5.3), and zt+1 zt = βt(Axt+1 yt+1).

Fifth, we have the following results:

Γt 3 yt+1 yt+1 + xt+1 xt + yt+1 yt

① yt+1 yt+1 + xt+1 xt + yt+1 yt+1 + yt+1 yt

② µt Ch + xt+1 xt + µt Ch + yt+1 yt

βt ) + O(βt xt+1 xt ) + O(βt yt+1 yt ), (53)

where step ①uses the triangle inequality; step ②uses Lemma (3.10)(c); step ③uses µt = χ βt =

βt ), and 1 βt

β0 = O(βt).

Part (c). Finally, we continue our analysis for both FADMM-D and FADMM-Q, deriving the following inequalities:

1 T PT t=1 Crit(xt+1, xt, yt+1, yt, zt+1, zt)

T PT t=1{Γt 1 + Γt 2 + Γt 3}

T PT t=1{O(βt Axt+1 yt+1 ) + O(βt xt+1 xt ) + O(βt 1 Axt yt )

+ O(βt xt+1 xt ) + O(βt yt+1 yt ) + O( 1

③= O(T (p 1)/2) + 1

T PT t=1 O( 1

④= O(T (p 1)/2) + O( 1

T T 1 p), (54)

where step ①uses the definition of Crit(x+, x, y+, y, z+, z), and the triangle inequality that Axt+1 yt+1 Axt+1 yt+1 + yt+1 yt+1 ; step ②uses Inequalities (46), (47), and (48) for FADMM-D and Inequalities (51), (52), and (53) for FADMM-Q; step ③uses Theorem 5.13 for FADMM-D and Theorem 5.18 for FADMM-Q that 1

T PT t=1{βt xt+1 xt + βt yt+1 yt + βt Axt+1 yt+1 } O(T (p 1)/2); step ④uses PT t=1 1 βt = O(PT t=1 1 tp ) = O(T 1 p), as presented in Lemma A.5.

We define Wt {xt+1, xt, yt+1, yt, zt+1, zt}. With the choice p = 1/3, we have from Inequality (54) that 1 T PT t=1 Crit(Wt) O(T 1/3). In other words, there exists 1 t T such that: Crit(W t) ϵ, provided that T O( 1

Published as a conference paper at ICLR 2025

G COMPUTING PROXIMAL OPERATORS

In this section, we demonstrate how to compute the proximal operator for various functions involved in this paper. The proximal operator is defined as follows:

min x Rr p(x) + 1 2µ x x 2 2. (55)

Here, x Rr and µ > 0 are given.

G.1 ORTHOGONALITY CONSTRAINT

When p(x) = ιΩ(mat(x)) with Ωbeing the set of orthogonality constraints, Problem (55) simplifies to the following nonconvex optimization problem:

x arg minx 1 2µ x x 2 2, s.t. mat(x) Ω {V | VTV = I}.

This is the nearest orthogonality matrix problem, where the optimal solution is given by x = vec( ˆU ˆVT) with mat(x ) = ˆUDiag(s) ˆUT being the singular value decomposition of the matrix mat(x ). See (Lai & Osher, 2014) for reference.

G.2 GENERALIZED ℓ1 NORM

When p(x) = ρ2 x 1+ιΩ(x) with Ω {x | x ρ0}, Problem (55) simplifies to the following strongly convex problem:

x arg minx Rr ρ2 x 1 + 1 2µ x x 2 2, s. t. x ρ0.

This problem can be decomposed into r dependent sub-problems

xi = arg minx qi(x) 1 2µ(x x i)2 + ρ2|x|, s. t. ρ0 x ρ0. (56)

We define P[l,u](x) max(l, min(u, x)). We consider five cases for x. (i) x1 = 0. (ii) x2 = ρ0. (iii) x3 = ρ0. (iv) x4 > 0 and x4 < ρ0. By omitting the bound constraints, the first-order optimality condition gives 1

µ(x4 x i) + ρ2 = 0, leading to x4 = x i µρ2. When the bound constraints are included, we have x4 = P[0,ρ0](x i µρ2). (v) x < 0 and x > ρ0. By dropping the bound constraints, the first-order optimality condition yields 1

µ(x5 x i) ρ2 = 0, leading to x5 = x i+µρ2. When the bound constraints are considered, we have x5 = P[ ρ0,0](x i + µρ2). Therefore, the onedimensional sub-problem in Problem (56) contains five critical points, and the optimal solution can be computed as:

xi = arg min x qi(x), s. t. x {x1, x2, x3, x4, x5}.

G.3 SIMPLEX CONSTRAINT

When p(x) = ιΩ(x) with Ω {x | x 0, x T1 = 1}, Problem (55) simplifies to the following strongly convex problem:

x arg minx 1 2µ x x 2 2, s. t. x 0, x T1 = 1.

This problem is referred to as the Euclidean projection onto the probability simplex. It can be solved exactly in O(n log(n)) time (Duchi et al., 2008).

G.4 GENERALIZED MAX FUNCTION

When p(x) = max(0, max(x + b)) with b Rr, Problem (55) simplifies to the following strongly convex problem: x arg minx 1 2µ x x 2 2+max(0, max(x+b)). Using the variable substitution that x + b = v, we have the following equivalent problem:

v arg minv q(v) 1 2µ v v 2 2 + max(0, max(v)), (57)

where v x + b.

Published as a conference paper at ICLR 2025

In what follows, we address Problem (57) by considering two cases for v . (i) max(v ) 0. The optimal solution can be computed as v = v , and it holds that q( v) = 0. (ii) max(v ) > 0. In this case, there exists an index i [r] such that v > 0. It is not difficult to verify that the optimal solution v satisfies max( v) 0. Problem (57) reduces to:

v = arg minv 1 2µ v v 2 2 + max(v). (58)

This problem can be equivalently reformulated as: minv,τ 1 2µ v v 2 2 + τ, s. t. v τ1, whose dual problem is given by:

z = arg maxz µ

2 z 2 2 + z, v , s. t. z 0, z 1 = 1. (59)

The unique optimal solution z for the dual problem in Problem (59) can be computed in O(n log(n)) time (Duchi et al., 2008). Finally, the optimal solution v for Problem (58) can then be recovered as v = v µ z.

H IMPLEMENTATION OF THE FULL SPLITTING ALGORITHM (FSA)

This section details the implementation of the Full Splitting Algorithm (FSA) (Bot et al., 2023b) for solving Problem (1), as summarized in Algorithm 2. For simplicity, we set β = 1 and use a constant step size γt = γ for all t, where γ {10 3, 10 4}.

Algorithm 2: FSA: Bot et al. s Full Splitting Algorithm for Solving Problem (1).

(S0) Initialize {x0, z0, u0}. (S1) Choose suitable β (0, 2), {γt}T t=0. (S2) Set {αt} = {1/γt} for all t. for t from 0 to T do

(S3) Let gt f(xt) + ATzt θt d(xt). (S4) xt+1 arg minx δ(x) + αt

2 x (ut gt/αt) 2 2. (S5) ut+1 = (1 β)ut + βxt+1. (S6) zt+1 = Prox( Axt+1

(S7) θt+1 = L(xt,zt,ut,αt,γt)

d(xt) , where

L(x, z, u, α, γ) f(x) + z, Ax h (z) + δ(x) + α

2 x u 2 2 γ

2 z 2 2. end

I ADDITIONAL EXPERIMENTAL DETAILS AND RESULTS

In this section, we offer further experimental details on the datasets used in the experiments, and include additional results.

Datasets. (i) For sparse FDA, robust SRM, and robust sparse recovery problems, we incorporate several datasets in our experiments, including randomly generated data and publicly available realworld data. These datasets serve as our data matrices Q R m d and the label vectors p R m. The dataset names are as follows: madelonm- d , TDT2-1-2m- d , TDT2-3-4m- d , mnistm- d , mushroomm- d , gisettem- d , and randnm- d . Here, randn( m, d) represents a function that generates a standard Gaussian random matrix with dimensions m d, and TDT2-i-j refers to the subset of the original dataset TDT2 consisting of data points with labels i and j. The matrix Q R m d is constructed by randomly selecting m examples and d dimensions from the original real-world dataset (https://www.csie.ntu.edu.tw/ cjlin/libsvm/). We normalize each column of D to have a unit norm. As randnm- d does not have labels, we randomly and uniformly assign to binary labels with p { 1, +1} m. (ii) For sparse FDA as in Problem (2), we let D (µ(1) µ(2))(µ(1) µ(2))T, C = Σ(1) + Σ(2), where µ(i) Rn and Σ(i) Rn n represent the mean vector and covariance matrix of class i (i = 1 or 2), respectively, generated by {Q, p}. We normalize the matrices C and D as C = C/ C F, and D = D/ D F. (iii) For

Published as a conference paper at ICLR 2025

0 5 10 15 20 10 0

Time (seconds)

SPM SPGM D SPGM Q FSA I FSA II FADMM D FADMM Q

(a) madelon-2000-500

0 5 10 15 20

Time (seconds)

SPM SPGM D SPGM Q FSA I FSA II FADMM D FADMM Q

(b) TDT2-1-2-1000-1000

0 5 10 15 20

Time (seconds)

SPM SPGM D SPGM Q FSA I FSA II FADMM D FADMM Q

(c) TDT2-3-4-1000-1200

0 5 10 15 20

Time (seconds)

SPM SPGM D SPGM Q FSA I FSA II FADMM D FADMM Q

(d) mnist-2000-768

0 5 10 15 20

Time (seconds)

SPM SPGM D SPGM Q FSA I FSA II FADMM D FADMM Q

(e) mushroom-2000-112

0 5 10 15 20

Time (seconds)

SPM SPGM D SPGM Q FSA I FSA II FADMM D FADMM Q

(f) gisette-800-1000

0 5 10 15 20

Time (seconds)

SPM SPGM D SPGM Q FSA I FSA II FADMM D FADMM Q

(g) randn-300-1000

0 5 10 15 20

Time (seconds)

SPM SPGM D SPGM Q FSA I FSA II FADMM D FADMM Q

(h) randn-300-1500

Figure 3: Experimental results on sparse FDA on different datasets with ρ = 100.

0 5 10 15 20 10 0

Time (seconds)

SPM SPGM D SPGM Q FSA I FSA II FADMM D FADMM Q

(a) madelon-2000-500

0 5 10 15 20 10 0

Time (seconds)

SPM SPGM D SPGM Q FSA I FSA II FADMM D FADMM Q

(b) TDT2-1-2-1000-1000

0 5 10 15 20

Time (seconds)

SPM SPGM D SPGM Q FSA I FSA II FADMM D FADMM Q

(c) TDT2-3-4-1000-1200

0 5 10 15 20

Time (seconds)

SPM SPGM D SPGM Q FSA I FSA II FADMM D FADMM Q

(d) mnist-2000-768

0 5 10 15 20

Time (seconds)

SPM SPGM D SPGM Q FSA I FSA II FADMM D FADMM Q

(e) mushroom-2000-112

0 5 10 15 20

Time (seconds)

SPM SPGM D SPGM Q FSA I FSA II FADMM D FADMM Q

(f) gisette-800-1000

0 5 10 15 20

Time (seconds)

SPM SPGM D SPGM Q FSA I FSA II FADMM D FADMM Q

(g) randn-300-1000

0 5 10 15 20

Time (seconds)

SPM SPGM D SPGM Q FSA I FSA II FADMM D FADMM Q

(h) randn-300-1500

Figure 4: Experimental results on sparse FDA on different datasets with ρ = 10000.

robust SRM as in Problem (14), we let D = Q and b = p. Following (Bot et al., 2023b), we generate p matrices {C(i)}p i=1, where each C(i) Rn n is constructed as C(i) = 1 n YYT, with Y = randn(n, n) 10. We let p = 100. (iv) For robust sparse recovery as in Problem (15), we simply let A = Q and b = p, where p { 1, +1} m represents the data labels.

Additional Experimental Results on Sparse FDA. We present additional experimental results on sparse FDA for ρ {100, 10000} in Figures 3 and 4. These results reinforce our conclusions presented in the main paper.

Experimental Results on Robust SRM. We consider solving Problem (14) using the proposed methods. For all methods, we set β0 = 0.001. The results of the algorithms are shown in Figure 5. We draw the following conclusions. (i) SPM appears to outperform both SPGM-D and SPGM-Q. We attribute this to the fact that the subgradient provides a good approximation of the negative descent direction. (ii) Both variants, FADMM-D and FADMM-Q, generally demonstrate better performance than the other methods, achieving lower objective function values.

Experimental Results on Robust Sparse Recovery. We consider solving Problem (15) using the following parameters (ρ1, ρ2, ρ3) {(10, 1, ), (10, 10, ), (10, 100, ), (100, 1, ),

Published as a conference paper at ICLR 2025

(100, 100, )}. For all methods, we initialize with β0 = 0.001. Since p

x [k] is not necessarily weakly convex, FADMM-Q is not applicable; thus, we only implement FADMM-D. The results of the compared methods are presented in Figures 6, 7, 8, 9, 10, from which we draw the following conclusions: Although SPM, SPGM-D, and FSA yield comparable or better results than FADMMD in certain cases, the proposed FADMM-D generally achieves the best performance among the compared methods in terms of speed. These results further corroborate our earlier findings that the proposed method is faster and more numerically robust.

0 0.5 1 1.5 2 2.5 3

Time (seconds)

SPM SPGM D SPGM Q FSA I FSA II FADMM D FADMM Q

(a) madelon-2000-500

0 0.5 1 1.5 2 2.5 3 10 2

Time (seconds)

SPM SPGM D SPGM Q FSA I FSA II FADMM D FADMM Q

(b) TDT2-1-2-1000-1000

0 0.5 1 1.5 2 2.5 3 10 2

Time (seconds)

SPM SPGM D SPGM Q FSA I FSA II FADMM D FADMM Q

(c) TDT2-3-4-1000-1200

0 0.5 1 1.5 2 2.5 3

Time (seconds)

SPM SPGM D SPGM Q FSA I FSA II FADMM D FADMM Q

(d) mnist-2000-768

0 0.5 1 1.5 2 2.5 3

Time (seconds)

SPM SPGM D SPGM Q FSA I FSA II FADMM D FADMM Q

(e) mushroom-2000-112

0 0.5 1 1.5 2 2.5 3

Time (seconds)

SPM SPGM D SPGM Q FSA I FSA II FADMM D FADMM Q

(f) gisette-800-1000

0 0.5 1 1.5 2 2.5 3

Time (seconds)

SPM SPGM D SPGM Q FSA I FSA II FADMM D FADMM Q

(g) randn-300-1000

0 0.5 1 1.5 2 2.5 3

Time (seconds)

SPM SPGM D SPGM Q FSA I FSA II FADMM D FADMM Q

(h) randn-300-1500

Figure 5: Results on Sharpe ratio maximization on different datasets.

0 5 10 15 20

Time (seconds)

SPM FSA I FSA II SPGM D FADMM D

(a) madelon-2000-500

0 5 10 15 20

Time (seconds)

SPM FSA I FSA II SPGM D FADMM D

(b) TDT2-1-2-1000-1000

0 5 10 15 20

Time (seconds)

SPM FSA I FSA II SPGM D FADMM D

(c) TDT2-3-4-1000-1200

0 5 10 15 20

Time (seconds)

SPM FSA I FSA II SPGM D FADMM D

(d) mnist-2000-768

0 5 10 15 20

Time (seconds)

SPM FSA I FSA II SPGM D FADMM D

(e) mushroom-2000-112

0 5 10 15 20

Time (seconds)

SPM FSA I FSA II SPGM D FADMM D

(f) gisette-800-1000

0 5 10 15 20

Time (seconds)

SPM FSA I FSA II SPGM D FADMM D

(g) randn-300-1000

0 5 10 15 20

Time (seconds)

SPM FSA I FSA II SPGM D FADMM D

(h) randn-300-1500

Figure 6: Results on robust sparse recovery on different datasets with (ρ1, ρ2, ρ0) = (10, 1, ).

Published as a conference paper at ICLR 2025

0 5 10 15 20

Time (seconds)

SPM FSA I FSA II SPGM D FADMM D

(a) madelon-2000-500

0 5 10 15 20

Time (seconds)

SPM FSA I FSA II SPGM D FADMM D

(b) TDT2-1-2-1000-1000

0 5 10 15 20

Time (seconds)

SPM FSA I FSA II SPGM D FADMM D

(c) TDT2-3-4-1000-1200

0 5 10 15 20

Time (seconds)

SPM FSA I FSA II SPGM D FADMM D

(d) mnist-2000-768

0 5 10 15 20

Time (seconds)

SPM FSA I FSA II SPGM D FADMM D

(e) mushroom-2000-112

0 5 10 15 20

Time (seconds)

SPM FSA I FSA II SPGM D FADMM D

(f) gisette-800-1000

0 5 10 15 20

Time (seconds)

SPM FSA I FSA II SPGM D FADMM D

(g) randn-300-1000

0 5 10 15 20

Time (seconds)

SPM FSA I FSA II SPGM D FADMM D

(h) randn-300-1500

Figure 7: Results on robust sparse recovery on different datasets with (ρ1, ρ2, ρ0) = (10, 10, ).

0 5 10 15 20

Time (seconds)

SPM FSA I FSA II SPGM D FADMM D

(a) madelon-2000-500

0 5 10 15 20

Time (seconds)

SPM FSA I FSA II SPGM D FADMM D

(b) TDT2-1-2-1000-1000

0 5 10 15 20

Time (seconds)

SPM FSA I FSA II SPGM D FADMM D

(c) TDT2-3-4-1000-1200

0 5 10 15 20

Time (seconds)

SPM FSA I FSA II SPGM D FADMM D

(d) mnist-2000-768

0 5 10 15 20

Time (seconds)

SPM FSA I FSA II SPGM D FADMM D

(e) mushroom-2000-112

0 5 10 15 20

Time (seconds)

SPM FSA I FSA II SPGM D FADMM D

(f) gisette-800-1000

0 5 10 15 20

Time (seconds)

SPM FSA I FSA II SPGM D FADMM D

(g) randn-300-1000

0 5 10 15 20

Time (seconds)

SPM FSA I FSA II SPGM D FADMM D

(h) randn-300-1500

Figure 8: Results on robust sparse recovery on different datasets with (ρ1, ρ2, ρ0) = (10, 100, ).

Published as a conference paper at ICLR 2025

0 5 10 15 20

Time (seconds)

SPM FSA I FSA II SPGM D FADMM D

(a) madelon-2000-500

0 5 10 15 20

Time (seconds)

SPM FSA I FSA II SPGM D FADMM D

(b) TDT2-1-2-1000-1000

0 5 10 15 20

Time (seconds)

SPM FSA I FSA II SPGM D FADMM D

(c) TDT2-3-4-1000-1200

0 5 10 15 20

Time (seconds)

SPM FSA I FSA II SPGM D FADMM D

(d) mnist-2000-768

0 5 10 15 20

Time (seconds)

SPM FSA I FSA II SPGM D FADMM D

(e) mushroom-2000-112

0 5 10 15 20

Time (seconds)

SPM FSA I FSA II SPGM D FADMM D

(f) gisette-800-1000

0 5 10 15 20

Time (seconds)

SPM FSA I FSA II SPGM D FADMM D

(g) randn-300-1000

0 5 10 15 20

Time (seconds)

SPM FSA I FSA II SPGM D FADMM D

(h) randn-300-1500

Figure 9: Results on robust sparse recovery on different datasets with (ρ1, ρ2, ρ0) = (100, 1, ).

0 5 10 15 20

Time (seconds)

SPM FSA I FSA II SPGM D FADMM D

(a) madelon-2000-500

0 5 10 15 20

Time (seconds)

SPM FSA I FSA II SPGM D FADMM D

(b) TDT2-1-2-1000-1000

0 5 10 15 20

Time (seconds)

SPM FSA I FSA II SPGM D FADMM D

(c) TDT2-3-4-1000-1200

0 5 10 15 20

Time (seconds)

SPM FSA I FSA II SPGM D FADMM D

(d) mnist-2000-768

0 5 10 15 20

Time (seconds)

SPM FSA I FSA II SPGM D FADMM D

(e) mushroom-2000-112

0 5 10 15 20 10 2.23

Time (seconds)

SPM FSA I FSA II SPGM D FADMM D

(f) gisette-800-1000

0 5 10 15 20

Time (seconds)

SPM FSA I FSA II SPGM D FADMM D

(g) randn-300-1000

0 5 10 15 20

Time (seconds)

SPM FSA I FSA II SPGM D FADMM D

(h) randn-300-1500

Figure 10: Results on robust sparse recovery on different datasets with (ρ1, ρ2, ρ0) = (100, 100, ).