# cccp_is_frankwolfe_in_disguise__31539499.pdf

CCCP is Frank-Wolfe in disguise

Alp Yurtsever Umeå University alp.yurtsever@umu.se

Suvrit Sra Massachusetts Institute of Technology suvrit@mit.edu

This paper uncovers a simple but rather surprising connection: it shows that the wellknown convex-concave procedure (CCCP) and its generalization to constrained problems are both special cases of the Frank-Wolfe (FW) method. This connection not only provides insight of deep (in our opinion) pedagogical value, but also transfers the recently discovered convergence theory of nonconvex Frank-Wolfe methods immediately to CCCP, closing a long-standing gap in its non-asymptotic convergence theory. We hope the viewpoint uncovered by this paper spurs the transfer of other advances made for FW to both CCCP and its generalizations.

1 Introduction

We study non-convex difference of convex (DC) optimization problems of the form: min x f(x) g(x), x D, (1.1)

where both f and g are convex functions and D is a constraint set that might itself be non-convex (we will specify its structure later). Throughout, we assume that the optimal value f g is ﬁnite.

Formulation (1.1) is a DC program. When f and g are smooth, a powerful local method is Convex Concave Procedure (CCCP) [Yuille and Rangarajan, 2003] that reduces (1.1) to a sequence of convex problems. CCCP is not only algorithmically appealing, it is also widely applicable and numerous algorithms can be viewed as special cases of CCCP, most notably the famous EM algorithm see [ 4 Yuille and Rangarajan, 2003].

But despite its wide applicability (see 2.2) and elegance, convergence theory for CCCP is quite limited. Its asymptotic convergence can be obtained by viewing it as a specialization of the DC algorithm (DCA) [Le Thi and Pham Dinh, 2018], or via a more reﬁned analysis designed for CCCP in [Lanckriet and Sriperumbudur, 2009]. A long-standing gap in the literature on CCCP has been to obtain non-asymptotic convergence theory that establishes convergence to an ε-stationary point in O(poly(1/ε)) or fewer iterations.1

The starting point of this paper is a discovery that is elementary yet surprising: CCCP is a special case of the Frank-Wolfe (FW) method! This discovery not only provides insight of deep (in our opinion) pedagogical value, but it also transfers the convergence theory of nonconvex FW to CCCP, which closes the abovementioned gap in CCCP s non-asymptotic convergence theory.

1.1 Main contributions

In light of this short background, we summarize now the key contributions of this paper.

We recognize the basic (unconstrained) CCCP method to be a special case of Frank-Wolfe. By the same argument, even its generalization to convex constrained instances of (1.1) is a special case of FW. This realization allows us to immediately transfer non-asymptotic convergence theory of FW to CCCP and convex constrained CCCP.

1We became aware of some recent discoveries on the non-asymptotic convergence of basic CCCP in [Abbaszadehpeivasti et al., 2021, Lê-Huu and Alahari, 2021] after completing this work see Remark 3.4.

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

Building on the above connection, we subsequently propose a new variant of FW (called FW+) that applies to the most general form of CCCP (called CCCP+), where the constraint set D in (1.1) is speciﬁed via several DC constraints. The variant FW+ not only allows us to syntactically view CCCP+ as a special case, but more importantly allows us to transfer non-asymptotic convergence guarantees that we prove for FW+ directly to CCCP+.

While the FW+ variant and its analysis are slightly more subtle than basic FW, the connection between CCCP and FW uncovered by our work is remarkably simple. The introduction of FW+ itself is an immediate consequence of the connection uncovered. It is also quite plausible that the connections described in this paper will have various further consequences, including development of new variants of CCCP and other related methods. For a more concrete discussion of potential implications, see the discussion in 5.

Finally, we note a few additional connections of the view uncovered in this paper in the appendix. In particular, we show that various other methods such as the proximal point method, mirror descent and mirror prox can also be seen as special instances of the FW method. A deeper exploration of these connections is left as future work.

1.2 Related work

FW. Frank and Wolfe [1956] proposed the FW algorithm originally for minimizing a convex quadratic function over a polyhedron and proved O(k 1) convergence rate in this setting. Levitin and Polyak [1966] extended the analysis and proved the same rate for minimization of an arbitrary smooth convex function over a generic convex compact set, and Canon and Cullum [1968] showed that this rate is optimal. The analysis is extended for a broader Hölder-smooth function class by Nesterov [2018]. Jaggi [2013] presented an afﬁne invariant analysis for the FW algorithm by replacing smoothness with the bounded curvature assumption, see (2.3). The analysis of FW for non-convex functions appeared in [Lacoste-Julien, 2016] the method ﬁnds a stationary point as the gap function converges to zero with O(k 1/2) rate, see (2.4) for the gap function. It is easy to show that the rate improves to O(k 1) for minimization of a concave function, see Lemma 2.1. An application of the FW algorithm to minimization of a concave function appears in [Luss and Teboulle, 2013].

Design variants of FW with away, pairwise, or in-face steps and line-search or momentum strategies; and extensions of FW for stochastic gradients or block-coordinate updates are extensively studied in the literature. We refer to [Kerdreux, 2020] for an excellent overview on these recent developments. To our knowledge, the variant FW+ introduced in this paper is new.

CCCP. The CCCP method of Yuille and Rangarajan [2003] is a special case of the more general DC algorithm (DCA) [Tao, 1997, Le Thi and Pham Dinh, 2018]. CCCP often provides a strong baseline method for tackling non-convex problems where a DC structure is already known or relatively easy to obtain see [Lipp and Boyd, 2016] for some recent variations on the CCCP method. As already noted, asymptotic convergence of CCCP follows from the more general convergence theory of DCA [Tao, 1997], while a simpliﬁed, more direct convergence analysis was obtained in [Lanckriet and Sriperumbudur, 2009], who analyzed convergence of both function values and iterates; moreover, these convergence results apply to the most general CCCP+ formulation (DC objective with DC constraints). A more recent work [Khamaru and Wainwright, 2018] considers gradient and subgradient based alternatives to DCA and CCCP, establishing their corresponding non-asymptotic convergence; however, its focus is on alternatives and its guarantees do not transfer to CCCP, and thus a fortiori not to CCCP+. Both CCCP and DCA have received a large number of applications, in a variety of areas including statistics, machine learning, signal processing, operations research, and many more we refer the reader to the extensive list documented in the survey [Le Thi and Pham Dinh, 2018] and to examples in [Yuille and Rangarajan, 2003].

We became aware of some important items of related work in the recent literature after completing this work. [Abbaszadehpeivasti et al., 2021] proposes non-asymptotic convergence guarantees for unconstrained DC optimization. Their analysis is based on the performance estimation technique [Drori and Teboulle, 2014], which is a computer-assisted methodology for estimating the worst-case complexity of an algorithm by solving semideﬁnite programs. They conjecture the convergence guarantees through systematic numerical experimentation and then analytically verify them. Their proof provides limited insight and does not extend to the more general constrained settings.

The closest in relevance to our work is [Lê-Huu and Alahari, 2021]. The authors therein recognize the connection between CCCP and FW from a conceptually different viewpoint and show that the unconstrained CCCP is a special instance of the generalized FW studied in [Bredies et al., 2009]. Generalized FW is an extension of FW for solving composite optimization problems through partial linearization leading to a modiﬁed FW subproblem. Although not identiﬁed explicitly before, generalized FW is algebraically equivalent to the original FW applied to an epigraph reformulation in the context of CCCP. We independently discovered the connection between CCCP and FW through the lens of this epigraph reformulation.

We emphasize the value of our conceptual perspective: the epigraph reformulation allows us to transfer existing guarantees of FW to CCCP without having to perform new convergence analysis for the unconstrained and convex constrained CCCP case. We further expound this connection and develop it to the DC constrained setting for CCCP+, leading to the discovery of FW+ as a natural extension of the FW method for functional constraints. Our FW+ analysis gives the ﬁrst non-asymptotic convergence guarantees for the CCCP+. We further observe that the proximal point method (PPM), mirror descent, and mirror prox are also special cases of FW applied to a suitable epigraph reformulation, see Appendices A and B. Since we reduce PPM to FW, and many other optimization methods reduce to PPM, all these methods are inherently special instances of FW.

2 Background on Frank-Wolfe and CCCP

2.1 Frank-Wolfe Algorithm

We begin by recalling the basic Frank-Wolfe method (FW) and some of its key properties. We focus in particular on the nonconvex case, as it is key to the rest of our theory. Consider therefore the prototypical constrained optimization template:

min ω φ(ω) s. t. ω D, (2.1)

where D is a closed and convex set in Rn and φ is a continuously differentiable (potentially nonconvex) function over the domain D. FW applied to (2.1) performs the following steps for k = 1, . . .

ω k argmin ω D φ(ωk) + φ(ωk), ω ωk ,

ωk+1 = (1 ηk)ωk + ηkω k, (FW)

where ηk [0, 1] is a step-size. There are multiple design choices for the step-size. A common choice is the greedy step-size

ηk = argmin η [0,1] φ (1 η)ωk + ηω k . (2.2)

The standard convergence theory of FW crucially depends on the boundedness of curvature constant of φ over D, deﬁned as follows:

Cφ := sup ω,ˆω D; η [0,1] ω=(1 η)ω+ηˆω

φ( ω) φ(ω) φ(ω), ω ω . (2.3)

The bounded curvature assumption is closely related to the Lipschitz smoothness assumption. If φ is L-Lipschitz continuous and the domain D has a bounded diameter D, then Cφ LD2. However, the former is more general. For instance, it is easy to see that Cφ = 0 if φ is concave. This becomes useful in our analysis later in this paper, and allows us to derive guarantees without any Lipschitz smoothness assumption on the functions involved and without an explicit assumption on the boundedness of the domain.

If φ is convex and Cφ is bounded, then the sequence {ωk}k 0 generated by (FW) converges to a solution in objective value (see e.g., [Jaggi, 2013, Theorem 1]); where the convergence rate is

φ(ωk) φ 2Cφ

k + 1, where φ := min ω D φ(ω).

When φ is non-convex, (FW) can be shown to ﬁnd a stationary point [Lacoste-Julien, 2016, Theorem 1], with the following guarantee: there exists an index τ [k], such that

φ(ωτ), ωτ ω max{Cφ, 2(φ(ω1) φ )}

This bound is often presented in terms of the so-called FW-gap:

gap(ωτ) := max ω D φ(ωτ), ωτ ω = φ(ωτ), ωτ ω τ . (2.4)

This quantity naturally arises in the FW algorithm and its analysis, and it provides a reliable stopping criterion.

In our analysis, we particularly need to apply FW to concave objectives. In this case, (FW) exhibits better convergence rates than general non-convex costs. More precisely, we note the following result. Lemma 2.1. Suppose that the sequence {ωk} is generated by the FW method with greedy step-size (2.2) applied to Problem (2.1) with a concave objective. Then, there exists τ [k], such that

φ(ωτ), ωτ ω φ(ω1) φ

Proof. First, observe that ηk = 1 in (2.2) when φ is concave. Then, (FW) becomes

ωk+1 argmin ω φ(ωk) + φ(ωk), ω ωk s. t. ω D.

By concavity of φ, we have

max ω D φ(ωk), ωk ω = φ(ωk), ωk ωk+1 φ(ωk) φ(ωk+1). (2.5)

Then, averaging (2.5) over k gives

i=1 max ω D φ(ωi), ωi ω 1

k (φ(ω1) φ ). (2.6)

We conclude the proof by noting that the minimum over k is less than or equal to the average.

2.2 The Convex-Concave Procedure (CCCP)

The starting point for the Convex-Concave Procedure (CCCP) of Yuille and Rangarajan [2003] is the unconstrained DC program min x f(x) g(x), (2.7)

where both f and g are C1 (once continuously differentiable) convex functions; CCCP can also be viewed as a special case of the more general DCA algorithm [Tao, 1997] that does not require differentiability. The key idea behind the success of CCCP is to use the convexity of g(x) to linearize it and obtain the following global upper bound on f(x) g(x):

Q(x; y) := f(x) g(y) g(y), x y . (2.8)

At each iteration, CCCP then updates its guess by solving the convex problem

xk+1 argmin x Q(x; xk). (2.9)

A solution to (2.9) always exists because f g is ﬁnite by assumption and

min x Q(x; xk) min x f(x) g(x) = f g . (2.10)

CCCP is thus a speciﬁc Majorize-Minorize (MM) algorithm [Hunter and Lange, 2004], and owing to the update (2.9), it generates a monotonically decreasing sequence {f(xk) g(xk)}k 0 of objective values. When f is differentiable, update (2.9) is tantamount to the implicit update

f(xk+1) = g(xk). (CCCP)

Note that the subproblem (2.9) is a convex optimization problem that does not approximate the function f(x), and thus likely offers a tighter approximate model to the DC cost (2.7), an aspect that might help explain its strong empirical performance [Lipp and Boyd, 2016].

The original CCCP paper [Yuille and Rangarajan, 2003] presented various important applications that could be viewed through the CCCP lens, of which perhaps the EM algorithm [Dempster et al., 1977] and Sinkhorn s method for matrix scaling [Sinkhorn, 1967] two very familiar special cases. A formal study of CCCP s asymptotic convergence properties was undertaken in [Lanckriet and Sriperumbudur, 2009], who studied all three variants: unconstrained CCCP, convex constrained, as well DC constrained CCCP.

3 Unconstrained CCCP and convex constrained CCCP via FW

Our ﬁrst result shows that CCCP can be viewed as a special case of the FW method.

Proposition 3.1. CCCP is equivalent to the FW method applied to the following epigraph reformulation of Problem (2.7): min x,t t g(x) s. t. f(x) t. (3.1)

Proof. If we apply FW to (3.1), the linear minimization step is

(x k, t k) arg min (x,t) t g(xk), x s. t. f(x) t. (3.2)

KKT conditions are necessary and sufﬁcient for optimality in Problem (3.2) since it is a convex optimization problem and the Slater s condition holds. The Lagrangian of this problem is

L(x, t, λ) = t g(xk), x + λ(f(x) t).

Lagrangian stationarity condition implies

x L(x, t, λ ) x=x k, t=t k = g(xk) + λ f(x k) = 0

t L(x, t, λ ) x=x k, t=t k = 1 λ = 0

f(x k) = g(xk). (3.3)

We also get t k = f(x k) by the complementary slackness condition.

Finally, FW updates its estimate via

xk+1 = (1 ηk)xk + ηkx k, tk+1 = (1 ηk)tk + ηkt k,

where ηk [0, 1] is chosen to minimize the objective function in (3.1). Since this function is concave, ηk = 1, and we conclude that the linear system in (3.3) is exactly the CCCP subproblem.

Proposition 3.1 recognizes a simple derivation of the CCCP update reformulate the unconstrained DC program as a concave minimization over a convex set and apply FW. This recognition transfers the convergence theory of FW to CCCP as we present in the next corollary.

Corollary 3.2. Suppose that the sequence {xk} is generated by (CCCP) applied to Problem (2.7). Then, there exists τ [k], such that

f(xτ) f(x) g(xτ), xτ x 1

k f(x1) g(x1) (f g ) for all x.

Proof. By Proposition 3.1, CCCP is equivalent to FW applied to (3.1). Observe that (3.1) is an instance of (2.1) with ω = (x, t), φ(ω) = t g(x), and D = {(x, t) : f(x) t}. Since φ is concave, we can use Lemma 2.1. Hence, there exists τ [k], such that

tτ t g(xτ), xτ x (t1 g(x1)) (t g )

k (x, t) : f(x) t.

t = f by deﬁnition, t1 = f(x1) by initialization, and tτ = f(xτ) by complementary slackness, see the proof of Proposition 3.1.

Corollary 3.2 establishes non-asymptotic convergence guarantees of CCCP to a stationary point of the unconstrained DC program (2.7). Remarkably, this result does not require any assumptions on the Lipschitz-smoothness of the functions or an explicit assumption on the boundedness of the domain.

The main result in Corollary 3.2 is based on the variational inequality characterization of ﬁrst-order stationarity. Similar measures are used in the literature for various problems [Yurtsever et al., 2021]. The next lemma clariﬁes the value of this stationarity condition.

Lemma 3.3. A point x is a ﬁrst-order stationary point of Problem (2.7) if

f(x ) f(x) g(x ), x x 0 for all x. (3.4)

Proof. By deﬁnition, the point x is ﬁrst-order stationary if

f(x ) g(x ), x x 0 for all x. (3.5)

Since f is convex, f(x ) f(x) f(x ), x x , and (3.5) immediately implies (3.4).

Next, we show (3.5) implies (3.4) as well. Suppose (3.4) holds and consider x = x + αs for an arbitrary direction s and step-length α > 0:

f(x + αs) f(x ) g(x ), αs 0.

We divide both sides by α and take the limit as α 0+:

lim α 0+ f(x + αs) f(x )

α g(x ), s = f(x ), s g(x ), s 0.

Finally, we arrive at (3.5) by substituting s = α 1(x x ).

In conclusion, (3.4) and (3.5) are equivalent for Problem (2.7).

Remark 3.4 (Comparison with the existing results). There are two other recent works that provides non-asymptotic convergence guarantees for CCCP in the unconstrained setting.

1. Abbaszadehpeivasti et al. [2021] present two main results: (i) Under the assumption that at least one of the terms, f or g is smooth, they show

min τ [k] vτ uτ O(1/

k) where vτ f(xτ) and uτ g(xτ).

(ii) For the non-smooth setting, they show

min τ [k] max x f(xτ) f(x) uτ, xτ x O(1/k) where uτ g(xτ).

This second result matches our guarantees in Corollary 3.2. As we also note above, Corollary 3.2 does not require any assumption on the smoothness, and it holds with subgradients as well.

2. Lê-Huu and Alahari [2021] provide convergence guarantees for the generalized FW algorithm [Bredies et al., 2009] in the non-convex settings. Moreover, they identify CCCP as a special instance of generalized FW. As a result, the guarantees that they present for the concave f setting (which corresponds to the DC template) in Theorem 1 in their paper is equivalent to our Corollary 3.2.

Intriguingly, in this setting, the generalized FW algorithm itself is algebraically equivalent to the standard FW applied to the epigraph reformulation (3.1). This recognition, although it seems simple, is interesting on its own right because the recent literature on FW habitually associates the algorithm with optimization on convex, closed and bounded domains. Nevertheless, the bounded domain assumption can be relaxed without losing the guarantees, as

ˆω D, min ω D φ(ˆω), ω is bounded below. (3.6)

In other words, the linear minimization subproblem is well-deﬁned everywhere in the domain. As an interesting fact, Frank and Wolfe [1956] use assumption (3.6) in their original work and not restrict their algorithm for bounded domains, see Hypothesis A in their paper. Assumption (3.6) trivially holds if D is bounded but not vice versa. It is easy to show that assumption (3.6) is satisﬁed in Proposition 3.1 since

min f(x) t t g(ˆx), x min x f(x) g(ˆx), x

min x f(x) g(x) + g(ˆx) + g(ˆx), ˆx

f g + g(ˆx) + g(ˆx), ˆx

and f g is ﬁnite by the blanket assumption.

3.1 Convex constrained CCCP via FW

Here the problem of interest is:

min x f(x) g(x), s. t. x D, (3.7)

for a closed convex set D. We can equivalently write this problem as

min x f(x) + χD(x) g(x), (3.8)

where χD( ) denotes the indicator function of D. Since h := f + χD is convex, we can apply the same idea as for the unconstrained case by looking at the equivalent problem

min x,t t g(x), s. t. h(x) t. (3.9)

If we apply FW to (3.9), the linear minimization step involves solving

min x,t t g(xk), x , s. t. f(x) + χD(x) t, (3.10)

which is clearly equivalent to solving

min x f(x) g(xk), x , s. t. x D. (3.11)

Subproblem (3.11) is precisely the update of convex constrained CCCP.

4 CCCP with difference of convex constraints via FW

At this point, it is natural to wonder whether the most general version of CCCP also admits a natural FW interpretation? Consider therefore the DC programming problem with DC constraints:

min x f0(x) g0(x) s. t. fi(x) gi(x) 0, i = 1, . . . , m. (4.1)

Motivated by the applications in Gaussian processes and support vector machines with missing variables, the advanced CCCP for Problem (4.1) is introduced in [Smola et al., 2005, Theorem 1]. This method updates its estimation to a solution of the following subproblem:

min x f0(x) g0(xk) g0(xk), x xk

s. t. fi(x) gi(xk) gi(xk), x xk 0, i = 1, . . . , m. (CCCP+)

A large body of DC programming literature focuses on template (4.1), or even more general versions of it. To name a few examples, [Fardad and Jovanovi c, 2014] studies this template for ﬁnding optimal feedback gains in the presence of structural constraints, and [Tao et al., 2016] uses it for solving content-centric sparse multicast beamforming. [Lipp and Boyd, 2016] focuses on the same template and provides a discussion on the applications, also introducing several extensions and variations of CCCP+ (e.g., with line-search). [Shen et al., 2016] proposes a structured way to formulate and handle problems as DC programs with DC constraints, called disciplined convex-concave programming (DCCP). DCCP is a generic framework which covers the boolean linear program, boolean least squares, and quadratic assignment problem as special instances. DC programming has a long history, beyond the references we cite above, we refer to [Le Thi and Pham Dinh, 2018] for an overview.

There is a major challenge to make a natural connection between CCCP+ and FW: The feasible set of (4.1) is non-convex, and the feasible sets for (4.1) and (CCCP+) are not the same. As a result, it seems impossible to write CCCP+ as a special instance of the standard FW method. However, searching for a connection between CCCP+ and FW leads us to the discovery of a more general form of the FW method itself that we call FW+. This method is obtained by linearizing constraints along with the objective function, and it recovers CCCP+ as a special case while remaining amenable to a transparent FW-style convergence analysis.

4.1 Extended FW algorithm

Here is the problem template:

min ω D φ(ω) s. t. ψi(ω) 0, i = 1, . . . , m, (4.2)

where D is a closed and convex set in Rn and φ and ψi are continuously differentiable functions with bounded curvature.

The FW+ steps for Problem (4.2) are as follows:

ω k argmin ω D φ(ωk) + φ(ωk), ω ωk ,

s. t. ψi(ωk) + ψi(ωk), ω ωk 0, i = 1, . . . , m

ωk+1 = (1 ηk)ωk + ηkω k.

Here, we focus on a setting where both φ and ψ are concave. In this setting, we can choose ηk = 1. We defer a more general discussion on FW+ for other settings to an extension of this paper.

Theorem 4.1. Suppose that the sequence {ωk} is generated by the FW+ algorithm for Problem (4.2) with concave functions φ and ψ. Then, there exists an index τ [k], at which ψi(ωτ) 0 for i = 1 . . . , m, and

φ(ωτ), ωτ ω φ(ω1) φ

i=1 {ω D : ψ i(ω, ωτ) 0}, (4.3)

where ψ i(ω, ωτ) := ψi(ωτ) + ψi(ωτ), ω ωτ .

Proof. First, we focus on the constraint function ψ. By concavity, for all i,

ψi(ωk+1) ψi(ωk) + ψi(ωk), ωk+1 ωk 0,

where the last inequality follows from the constraint in the linear minimization step of FW+.

Similarly, by concavity of φ, we have

φ(ωk), ωk ωk+1 φ(ωk) φ(ωk+1).

Averaging this inequality over k gives

i=1 φ(ωi), ωi ωi+1 φ(ω1) φ(ωk+1)

where the second inequality holds since ωk+1 is in a restriction of the feasible set of (4.2). Then, there exists an index τ [k], at which

φ(ωτ), ωτ ωτ+1 φ(ω1) φ

This inequality leads to (4.3) by deﬁnition of ωτ+1 = ω τ in (FW+).

Before moving on, we characterize stationary points of Problem (4.2) in the next lemma. This gives a clear interpretation of the bounds in (4.1) as a perturbation of the stationarity condition.

Lemma 4.2. Consider Problem (4.2) and assume φ and ψ are concave. Then, ω Rn is a feasible stationary point of Problem (4.2) if

ω D and ψi(ω ) 0, i = 1, . . . , m, (feasibility)

φ(ω ), ω ω 0, ω

i=1 {ω D : ψ i(ω, ω ) 0}, (stationarity)

where ψ i(ω, ω ) := ψi(ω ) + ψi(ω ), ω ω .

Proof. Let s note that ψi(ω) ψ i(ω, ω ) due to concavity, therefore the set considered for the stationarity condition in Lemma 4.2 is a restriction of the feasible set. The lemma implies that this restriction contains all feasible directions at ω . Here is a simple proof.

First, suppose that that the conditions in Lemma 4.2 are satisﬁed at ω and the inequality constraint is active, i.e., ψi(ω ) = 0. Then, ψ i is also active since ψ i(ω , ω ) = ψi(ω ) = 0. Assume that there exists a direction d Rn such that ω := ω + αd is feasible for small enough α > 0, and that d is a descent direction, φ(ω ), d < 0. By the stationarity condition in Lemma 4.2, this implies either ω / D or ψ i( ω, ω ) > 0. The former clearly contradicts our feasibility assumption. Suppose the latter holds, then we get ψi(ω ), d > 0, meaning that d is an ascent direction for ψi at ω . This also contradicts our feasibility assumption constraint is active and d is an ascent direction, hence d is directed outwards from the feasible region. By contradiction, we conclude that there is no feasible descent direction at ω .

Next, suppose the constraint is inactive. This means ψ is also inactive as ψ i(ω , ω ) = ψi(ω ) < 0. Then, we can place an open ball B(ω ) of inﬁnitesimal radius centered at ω such that ψ i(ω, ω ) 0 for all points in B(ω ). This implies φ(ω ), ω ω 0 for all ω D B(ω ) by the stationarity condition in Lemma 4.2.

FW+ is a natural extension of FW with extensively broader applications. Theorem 4.1 and Lemma 4.2 together imply that FW+ provably ﬁnds a stationary point of Problem (4.2) under the concavity assumptions with non-asymptotic convergence guarantees. We complement this result in Appendix E by deriving convergence guarantees of FW+ when φ and ψ are convex. Importantly, FW+ can be used for ﬁnding a stationary point of a QCQP (quadratically constrained quadratic program) by solving a sequence of linear programs.

4.2 CCCP+ via FW+

We are ready to establish the connection between CCCP+ and FW+ by following similar ideas as in Section 3.

Proposition 4.3. CCCP+ is equivalent to the FW+ method applied to the following epigraph reformulation of Problem (4.1):

min x,t0,...,tm t0 g0(x) s. t. fi(x) ti, i = 0, 1, . . . , m

gi(x) ti, i = 1, . . . , m. (4.4)

Proof. Problem (4.4) is a special case of (4.2) with

ω = (x, t0, . . . , tm), φ(ω) = t0 g0(x), ψi = ti gi(x), and D = {ω : fi(x) ti}.

We consider FW+ for (4.4). Since φ and ψi are concave, we can use ηk = 1 and the (FW+) procedure accounts to solving

min x,t0,...,tm t0 g0(xk) g(xk), x xk

s. t. fi(x) ti, i = 0, 1, . . . , m

ti gi(xk) gi(xk), x xk 0, i = 1, 2, . . . , m,

which is clearly equivalent to the (CCCP+) subproblem.

This connection also transfers the convergence theory of FW+ to CCCP+.

Corollary 4.4. Suppose that the sequence {xk} is generated by (CCCP+) applied to Problem (4.1). Then, there exists an index τ [k], such that

f0(xτ) f0(x) g0(xτ), xτ x 1

(f0(x1) g0(x1)) (f 0 g 0) ,

for all x that satisﬁes fi(x) gi(xτ) gi(xτ), x xτ 0, i = 1, 2, . . . , m.

Proof. By Proposition 4.3 and Theorem 4.1, we get

t0,τ t0 g0(xτ), xτ x 1

(t0,1 g0(x1)) (t 0 g 0)

for all (x, t0, . . . , tm) that satisﬁes fi(x) ti, i = 0, 1, . . . , m ti gi(xτ) gi(xτ), x xτ 0, i = 1, 2, . . . , m. This is equivalent to the desired bounds by eliminating t0, t1, . . . , tm in these formula.

5 Implications and Discussion

Discussion. We reiterate an important aspect of the non-asymptotic convergence guarantees proved for CCCP and CCCP+ in this paper. Owing to the special structure involving minimization of concave functions, the convergence rates are independent of the curvature constant of FW (see (2.3)). This property is valuable since it delivers a convergence analysis of CCCP without imposing usual Lsmoothness assumptions on either f, or g or their difference f g, a limitation that typical ﬁrst-order methods impose. While surprising at ﬁrst sight, we can reconcile with this intuitively because CCCP requires solving an optimization subproblem (2.9), which is a stronger oracle than merely a gradient oracle for f g. The ensuing faster convergence (i.e., tighter bound on the rate) may offer one explanation to the intuitive view on why CCCP can often be quite competitive [Yuille and Rangarajan, 2003, Lipp and Boyd, 2016].

Implications. Beyond generic convergence guarantees and the transfer of other progress on FW to CCCP, we note some speciﬁc implications worth further study here. The EM algorithm is a special case of CCCP [Yuille and Rangarajan, 2003, 4], and therefore a special case of FW. Thus, one should be able to develop a more reﬁned understanding of global iteration complexity of EM, as well as sharper local convergence properties. Given the large number of settings in machine learning and statistics where the EM algorithm, or more generally, variational methods [Blei et al., 2017] are used, a further deepening of their relation to FW-methods should prove fruitful. Similar implications about algorithms, convergence, and complexity apply to other instances of CCCP that have been studied, notably for problems such as matrix scaling [Sinkhorn, 1967].

Extensions. The connection established between FW and CCCP (FW+ and CCCP+) paves the path for numerous extensions that are worthy pursuing. We hope our work motivates others to pursue some of these or other extensions that these connections can inspire.

The ﬁrst important extension worthy of study is to obtain stochastic variants of CCCP by building on the recent progress on non-convex stochastic FW [Reddi et al., 2016b] as well as non-convex variance reduced versions in the case of ﬁnite-sum problems [Reddi et al., 2016a, Allen-Zhu and Hazan, 2016, Yurtsever et al., 2019].

Another line of research worthy of closer attention is to study the case where either g is nondifferentiable, or even both f and g in (1.1) are non-differentiable. In that case, developing an extension of the relation between FW and CCCP to a relation between nonsmooth FW [Thekumparampil et al., 2020] (although, as of now this nonsmooth FW method is limited to convex problems) and DCA should also prove to be fruitful, given the extensive progress DCA has witnessed over the decades [Le Thi and Pham Dinh, 2018]. The case where f(x) is an indicator function of a convex set already ﬁts trivially, as already discussed above. One might also speculate that progress on DCA could be transferred back to discover and develop nonsmooth FW methods.

Finally, given that both DCA and CCCP are special instances of MM methods [Hunter and Lange, 2004], it is possible that other MM methods might be proﬁtably viewed as instances of an optimization procedure (not necessarily FW) whose geometric and convergence properties are better understood. We hope our work inspires the discovery of additional such connections.

Acknowledgments

Alp Yurtsever received support from the Wallenberg AI, Autonomous Systems and Software Program (WASP) funded by the Knut and Alice Wallenberg Foundation. Suvrit Sra acknowledges support from the NSF CAREER grant (1846088) and the NSF CCF-2112665 (TILOS AI Research Institute).

Hadi Abbaszadehpeivasti, Etienne de Klerk, and Moslem Zamani. On the rate of convergence of the difference-of-convex algorithm (dca). ar Xiv preprint ar Xiv:2109.13566, 2021. Zeyuan Allen-Zhu and Elad Hazan. Variance reduction for faster non-convex optimization. In International conference on machine learning, pages 699 707. PMLR, 2016. David M Blei, Alp Kucukelbir, and Jon D Mc Auliffe. Variational inference: A review for statisticians. Journal of the American statistical Association, 112(518):859 877, 2017. Kristian Bredies, Dirk A Lorenz, and Peter Maass. A generalized conditional gradient method and its connection to an iterative shrinkage method. Computational Optimization and applications, 42(2): 173 193, 2009. Michael D Canon and Clifton D Cullum. A tight upper bound on the rate of convergence of frank-wolfe algorithm. SIAM Journal on Control, 6(4):509 516, 1968. Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1): 1 22, 1977. Yoel Drori and Marc Teboulle. Performance of ﬁrst-order methods for smooth convex minimization: a novel approach. Mathematical Programming, 145(1):451 482, 2014. Makan Fardad and Mihailo R Jovanovi c. On the design of optimal structured and sparse feedback gains via sequential convex programming. In 2014 American Control Conference, pages 2426 2431. IEEE, 2014. M. Frank and P. Wolfe. An algorithm for quadratic programming. Naval Research Logistics Quarterly, 3(95, 1956. David R Hunter and Kenneth Lange. A tutorial on MM algorithms. The American Statistician, 58(1): 30 37, 2004. Martin Jaggi. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In International Conference on Machine Learning (ICML), pages 427 435, 2013. Thomas Kerdreux. Accelerating conditional gradient methods. Ph D thesis, Université Paris sciences et lettres, 2020. Koulik Khamaru and Martin Wainwright. Convergence guarantees for a class of non-convex and non-smooth optimization problems. In International Conference on Machine Learning, pages 2601 2610. PMLR, 2018. Simon Lacoste-Julien. Convergence Rate of Frank-Wolfe for Non-Convex Objectives. ar Xiv:1607.00345, 2016. Gert Lanckriet and Bharath K Sriperumbudur. On the convergence of the concave-convex procedure. Advances in Neural Information Processing Systems, 22, 2009. Ð Khuê Lê-Huu and Karteek Alahari. Regularized frank-wolfe for dense crfs: Generalizing mean ﬁeld and beyond. Advances in Neural Information Processing Systems, 34:1453 1467, 2021. Hoai An Le Thi and Tao Pham Dinh. Dc programming and DCA: thirty years of developments. Mathematical Programming, 169(1):5 68, 2018. Evgeny S Levitin and Boris T Polyak. Constrained minimization methods. USSR Computational mathematics and mathematical physics, 6(5):1 50, 1966. Thomas Lipp and Stephen Boyd. Variations and extension of the convex-concave procedure. Optimization and Engineering, 17(2):263 287, 2016. Ronny Luss and Marc Teboulle. Conditional gradient algorithmsfor rank-one matrix approximations with a sparsity constraint. siam REVIEW, 55(1):65 98, 2013. B Martinet. Regularisation, d inequations variationelles par approximations succesives. Revue Francaise d Informatique et de Recherche Operationelle, 1970. Yu Nesterov. Complexity bounds for primal-dual methods minimizing the model of objective function. Mathematical Programming, 171(1):311 330, 2018. Sashank J Reddi, Ahmed Hefny, Suvrit Sra, Barnabas Poczos, and Alex Smola. Stochastic variance reduction for nonconvex optimization. In International conference on machine learning, pages 314 323. PMLR, 2016a.

Sashank J Reddi, Suvrit Sra, Barnabás Póczos, and Alex Smola. Stochastic Frank-Wolfe methods for nonconvex optimization. In Communication, Control, and Computing (Allerton), 2016 54th Annual Allerton Conference on, pages 1244 1251. IEEE, 2016b. R Tyrrell Rockafellar. Monotone operators and the proximal point algorithm. SIAM journal on control and optimization, 14(5):877 898, 1976. Xinyue Shen, Steven Diamond, Yuantao Gu, and Stephen Boyd. Disciplined convex-concave programming. In 2016 IEEE 55th Conference on Decision and Control (CDC), pages 1009 1014. IEEE, 2016. Richard Sinkhorn. Diagonal equivalence to matrices with prescribed row and column sums. The American Mathematical Monthly, 74(4):402 405, 1967. Alex J Smola, SVN Vishwanathan, and Thomas Hofmann. Kernel methods for missing variables. In International Workshop on Artiﬁcial Intelligence and Statistics, pages 325 332. PMLR, 2005. Meixia Tao, Erkai Chen, Hao Zhou, and Wei Yu. Content-centric sparse multicast beamforming for cache-enabled cloud ran. IEEE Transactions on Wireless Communications, 15(9):6118 6131, 2016. Pham Dinh Tao. Convex analysis approach to DC programming: theory, algorithms and applications. Acta mathematica vietnamica, 22(1):289 355, 1997. Kiran K Thekumparampil, Prateek Jain, Praneeth Netrapalli, and Sewoong Oh. Projection efﬁcient subgradient method and optimal nonsmooth Frank-Wolfe method. Advances in Neural Information Processing Systems, 33:12211 12224, 2020. Alan L Yuille and Anand Rangarajan. The concave-convex procedure. Neural computation, 15(4): 915 936, 2003. Alp Yurtsever, Suvrit Sra, and Volkan Cevher. Conditional gradient methods via stochastic pathintegrated differential estimator. In International Conference on Machine Learning, pages 7282 7291. PMLR, 2019. Alp Yurtsever, Varun Mangalick, and Suvrit Sra. Three operator splitting with a nonconvex loss function. In International Conference on Machine Learning, pages 12267 12277. PMLR, 2021.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reﬂect the paper s contributions and scope? [Yes] The main contribution is the realization of a simple connection between CCCP and FW, clearly described in the title and abstract. (b) Did you describe the limitations of your work? [Yes] The limitations and some future directions are discussed in detail in Section 5 (c) Did you discuss any potential negative societal impacts of your work? [No] The paper is of theoretical nature and it does not have an immediate societal impact. (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] We explicitly state the full set of assumptions for all theoretical results. (b) Did you include complete proofs of all theoretical results? [Yes] Complete proofs are present in the paper, with some extensions and extra details in the appendix. 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [N/A] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [N/A] (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [N/A] (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [N/A] 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [N/A] (b) Did you mention the license of the assets? [N/A]

(c) Did you include any new assets either in the supplemental material or as a URL? [N/A]

(d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identiﬁable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]