# activeset_methods_for_submodular_minimization_problems__0bab556a.pdf

Journal of Machine Learning Research 18 (2017) 1-31 Submitted 2/17; Revised 7/17; Published 12/17

Active-set Methods for Submodular Minimization Problems

K. S. Sesh Kumar S.KARRI@IMPERIAL.AC.UK Imperial College London, Department of Computing 180 Queen s Gate, London SW7 2AZ

Francis Bach FRANCIS.BACH@ENS.FR INRIA - Sierra Project-team D epartement d Informatique de l Ecole Normale Sup erieure (UMR CNRS/ENS/INRIA) 2, rue Simone Iff 75012 Paris, France

Editor: Sebastian Nowozin

We consider the submodular function minimization (SFM) and the quadratic minimization problems regularized by the Lov asz extension of the submodular function. These optimization problems are intimately related; for example, min-cut problems and total variation denoising problems, where the cut function is submodular and its Lov asz extension is given by the associated total variation. When a quadratic loss is regularized by the total variation of a cut function, it thus becomes a total variation denoising problem and we use the same terminology in this paper for general submodular functions. We propose a new active-set algorithm for total variation denoising with the assumption of an oracle that solves the corresponding SFM problem. This can be seen as local descent algorithm over ordered partitions with explicit convergence guarantees. It is more ﬂexible than the existing algorithms with the ability for warm-restarts using the solution of a closely related problem. Further, we also consider the case when a submodular function can be decomposed into the sum of two submodular functions F1 and F2 and assume SFM oracles for these two functions. We propose a new active-set algorithm for total variation denoising (and hence SFM by thresholding the solution at zero). This algorithm also performs local descent over ordered partitions and its ability to warm start considerably improves the performance of the algorithm. In the experiments, we compare the performance of the proposed algorithms with state-of-the-art algorithms, showing that it reduces the calls to SFM oracles.

Keywords: discrete optimization, submodular function minimization, convex optimization, cut functions, total variation denoising.

1. Introduction

Submodular optimization problems such as total variation denoising and submodular function minimization are convex optimization problems which are common in computer vision, signal processing and machine learning (Bach, 2013), with notable applications to graph cut-based image segmentation (Boykov et al., 2001), sensor placement (Krause and Guestrin, 2011), or document summarization (Lin and Bilmes, 2011).

Let F be a normalized submodular function deﬁned on V = {1, . . . , n}, i.e., F : 2V R such that F( ) = 0 and an n-dimensional real vector u, i.e., u Rn. In this paper, we consider the

c 2017 K. S. Sesh Kumar and Francis Bach.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v18/17-101.html.

KUMAR AND BACH

submodular function minimization (SFM) problem,

min A V F(A) u(A), (1)

where we use the convention u(A) = u 1A and 1A {0, 1}n is the indicator vector of the set A. Note that general submodular functions can always be decomposed into a normalized submodular function, F, i.e., F( ) = 0 and a modular function u (see Bach, 2013). Let f be the Lov asz extension of the submodular function F. Let us consider the following continuous optimization problem

min w [0,1]n f(w) u w. (2)

As a consequence of submodularity, the discrete and continuous optimization problems in Eq. 1 and Eq. 2 respectively have the same optimal solutions (Lov asz, 1982). Let us consider another related continuous optimization problem

min w Rn f(w) u w + 1

2 w 2 2. (3)

If F is a cut function in a weighted undirected graph, then f is its associated total variation, hence the denomination of total variation denoising (TV) problem for Eq. 3, which we use in this paper since it is equivalent to minimizing 1

2 u w 2 2 + f(w). The unique solution of the total variation denoising in Eq. 3 can be used to obtain the solution of the SFM problem in Eq. 1 by thresholding at 0. Conversely, we may obtain the optimal solution of the total variation denoising in Eq. 3 by solving a series of SFM problems using divide-and-conquer strategy. Relationship with existing work. Generic algorithms to optimize SFM in Eq. 1 or TV in Eq. 3 problems which only access F through function values, e.g., subgradient descent or min-normpoint algorithm (Fujishige, 1984), are too slow without any assumptions (Bach, 2013), as for signal processing applications, high precision is typically required (and often the exact solution). For decomposable problems, i.e., when F = F1 + +Fr, where each Fj is simple , some algorithms use more powerful oracles than function evaluations, improving the running times. These powerful oracles include SFM oracles that can solve the SFM problem of simple submodular function, Fj given by min A V Fj(A) uj(A), (4)

where uj Rn. The other set of powerful oracles are total variation or TV oracles, that solve TV problems of the form min w Rn fj(w) u j w + 1

2 w 2 2, (5)

where uj Rn. Note that, in general, the exact total variation oracles are at most O(n) times more expensive than their respective SFM oracle as they solve all SFM problems

min A V Fj(A) uj(A) + λ|A|, (6)

for all λ R, which have at most O(n) unique solutions. For more details refer to Fujishige (1980) and Bach (2013). Here, |A| denotes the cardinality of the set A. There does exist a subclass of submodular functions (cut functions and other submodular functions that can be written in form of

ACTIVE-SET METHODS FOR SUBMODULAR MINIMIZATION PROBLEMS

cuts) whose total variation oracles are only O(1) times more expensive than the corresponding SFM oracles but are still too expensive in practice. Stobbe and Krause (2010) used SFM oracles instead of function value oracles but their algorithm remains slow in practice. However, when total variation oracles for each Fj are used, they become competitive (Komodakis et al., 2011; Kumar et al., 2015; Jegelka et al., 2013). Therefore, our goal is to design fast optimization strategies using only efﬁcient SFM oracles for each function Fj rather than their expensive TV oracles (Kumar et al., 2015; Jegelka et al., 2013) to solve the SFM and TV denoising problems of F given by Eq. 1 and Eq. 3 respectively. An algorithm was proposed by Landrieu and Obozinski (2016) to search over partition space for solving Eq. 3 with the unary terms ( u w) replaced by a convex differentiable function but it applies only to functions F, which are cut functions. In this paper, we exploit the polytope structure of these non-smooth optimization problems with exponentially many constraints, i.e., 2n, where each face of the constraint set is indexed by an ordered partition of the underlying ground set V = {1, . . . , n}. The main insight of this paper is that given the main polytope associated with a submodular function (namely the base polytope described in Section 2) and an ordered partition, we may uniquely deﬁne a tangent cone of the polytope. Further, orthogonal projections onto the tangent cone may be done efﬁciently by isotonic regressions (Best and Chakravarti, 1990). The time needed is linear in the number of elements of the ordered partition used to deﬁne the tangent cone. We need SFM oracles only to check the optimality of the ordered partition. Given the orthogonal projection s onto the tangent cone, if the minimum of F(A) s(A) with respect to A V is positive then it is optimal. If it is not optimal, it gives us the violating constraints in the form of active-sets that enable us to generate a new ordered partition among the exponentially many ordered partitions. Contributions. We make two main contributions:

Given a submodular function F with an SFM oracle, we propose a new active-set algorithm for total variation denoising in Section 3, which is more efﬁcient and ﬂexible than existing ones. This algorithm may be seen as a local descent algorithm over ordered partitions. It has the additional advantage of allowing warm restarts, which will be beneﬁcial when we have to solve a large number of total variation denoising problems as shown in Section 5.

Given a decomposition of F = F1 + F2, with available SFM oracles for each Fj, we propose an active-set algorithm for total variation denoising in Section 4 (and hence for SFM by thresholding the solution at zero). These algorithms optimize over ordered partitions (one per function Fj). Following Jegelka et al. (2013) and Kumar et al. (2015), they are also naturally parallelizable. Given that only SFM oracles are needed, it is much more ﬂexible than the algorithms requiring a TV oracle, and allow more applications as shown in Section 5.

2. Review of Submodular Analysis

A set-function F : 2V R is submodular if F(A) + F(B) F(A B) + F(A B) for any subsets A, B of V . Our main motivating examples in this paper are cuts in a weighted undirected graph with weight function a : V V R+, which can be deﬁned as

i<j a(i, j)|1i A 1j / A|, (7)

KUMAR AND BACH

s1 + s2 = F({1, 2})

s1 + s3 = F({1, 3}) s2 + s3 = F({2, 3})

s2 = F({2}) s1 = F({1})

s3 = F({3})

({3}, {1, 2})

({2}, {1, 3}) ({1}, {2, 3})

({1, 3}, {2})

({1, 2}, {3})

({2, 3}, {1})

({3}, {2}, {1}) ({3}, {1}, {2})

({2}, {3}, {1})

({2}, {1}, {3})

({1}, {3}, {2})

({1}, {2}, {3})

Figure 1: Base polytope for n=3. (a) deﬁnition from its supporting hyperplanes {s(A) = F(A)}. (b) each face (point or segment) of B(F) is associated with an ordered partition.

where i, j V . Note that there are other submodular functions on which our algorithm works, e.g., concave function on the cardinality set, which cannot be represented in the form of Eq. 7 (Ladicky et al., 2010; Kolmogorov, 2012). However, we use cut functions as a running example to better explain our algorithm as they are most widely studied and understood among submodular functions. We now review the relevant concepts from submodular analysis (for more details, see Bach, 2013; Fujishige, 2005). Lov asz extension and convexity. The power set 2V is naturally identiﬁed with the vertices {0, 1}n of the hypercube in n dimensions (going from A V to 1A {0, 1}n). Thus, any set-function may be seen as a function f on {0, 1}n. It turns out that f may be extended to the full hypercube [0, 1]n by piecewise-linear interpolation, and then to the whole vector space Rn. Given a vector w Rn, and given its unique level-set representation as w = Pm i=1 vi1Ai, with (A1, . . . , Am) a partition of V and v1 > > vm, f(w) is equal to f(w) = Pm i=1 vi F(Bi) F(Bi 1) , where Bi = (A1 Ai). For cut functions, the Lov asz extension happens to be equal to the total variation, f(w) = P i<j a(i, j)|wi wj|, hence our denomination total variation denoising for the problem in Eq. 3. This extension is piecewise linear for any set-function F. It turns out that it is convex if and only if F is submodular (Lov asz, 1982). Any piecewise linear convex function may be represented as the support function of a certain polytope K, i.e., as f(w) = maxs K w s (Rockafellar, 1997). For the Lov asz extension of a submodular function, there is an explicit description of K, which we now review. Base polytope. We deﬁne the base polytope as

B(F) = s Rn, s(V ) = F(V ), A V, s(A) F(A) .

Given that it is included in the afﬁne hyperplane {s(V ) = F(V )}, it is traditionally represented by the projection on that hyperplane (see Figure 1 (a)). A key result in submodular analysis is that the Lov asz extension is the support function of B(F), that is, for any w Rn,

f(w) = sup s B(F) w s. (8)

The maximizers above may be computed in closed form from an ordered level-set representation of w using a greedy algorithm, which (a) ﬁrst sorts the elements of w in decreasing order such

ACTIVE-SET METHODS FOR SUBMODULAR MINIMIZATION PROBLEMS

that wσ(1) . . . wσ(n) where σ represents the order of the elements in V ; and (b) computes sσ(k) = F({σ(1), . . . , σ(k)}) F({σ(1), . . . , σ(k 1)}). SFM as a convex optimization problem. Another key result of submodular analysis is that minimizing a submodular function F (i.e., minimizing the Lov asz extension f on {0, 1}n), is equivalent to minimizing the Lov asz extension f on the full hypercube [0, 1]n (a convex optimization problem). Moreover, with convex duality we have

min A V F(A) u(A) = min w {0,1}n f(w) u w = min w [0,1]n f(w) u w

= min w [0,1]n max s B(F) s w u w

= max s B(F) min w [0,1]n s w u w = max s B(F)

i=1 min{si ui, 0}.

This dual problem allows to obtain certiﬁcates of optimality for the primal-dual pairs w [0, 1]n

and s B(F) using the quantity

gap(w, s) := f(w) u w

i=1 min{si ui, 0},

which is always non-negative. It is equal to zero only at optimality and the corresponding (w, s) form an optimal primal-dual pair. Total variation denoising as projection onto the base polytope. A consequence of the representation of f as a support function leads to the following primal/dual pair (Bach, 2013, Sec. 8):

min w Rn f(w) u w + 1

2 w 2 2 = min w Rn max s B(F) s w u w + 1

2 w 2 2 using Eq. 8,

= max s B(F) min w Rn s w u w + 1

= max s B(F) 1

2 s u 2 2, (9)

with w = u s at optimality. Thus the TV problem is equivalent to the orthogonal projection of u onto B(F). From TV denoising to SFM. The SFM problem in Eq. 1 and the TV problem in Eq. 3 are tightly connected. Indeed, given the unique solution w of the TV problem, we obtain a solution of min A V F(A) u(A) by thresholding w at 0, i.e., by taking A = {i V, wi 0} (Fujishige, 1980). Conversely, one may solve the TV problem by an appropriate sequence of SFM problems. The original divide-and-conquer algorithm may involve O(n) SFM problems (Groenevelt, 1991). The extended algorithm of Jegelka et al. (2013) can reach a precision ε in O(log 1

ε) iterations but can only get the exact solution in O(n) oracles. Fast efﬁcient algorithms are proposed to solve TV problems with O(1) oracles (Chambolle and Darbon, 2009; Goldfarb and Yin, 2009) but are speciﬁc to cut functions on simple graphs (chains and trees) as they exploit the weight representation given by Eq. 7. Our algorithm in Section 3 is a generalization of the divide-and-conquer strategy for solving the TV problem with general submodular functions.

KUMAR AND BACH

d B({2,3},{1})(F)

d B({2},{3},{1})(F)

Figure 2: Projection algorithm for a single polytope: ﬁrst projecting on the outer approximation b B({2,3},{1})(F), with a projected element which is not in B(F) (blue), then on b B({2},{3},{1})(F), with a projected element being the projection of s onto B(F) (red).

3. Ordered Partitions and Isotonic Regression

The main insight of this paper is (a) to consider the detailed face structure of the base polytope B(F) and (b) to notice that for the outer approximation of B(F) based on the tangent cone to a certain face, the orthogonal projection problem (which is equivalent to constrained TV denoising) may be solved efﬁciently using a simple algorithm, originally proposed to solve isotonic regression in linear time. This allows an explicit efﬁcient local search over ordered partitions.

3.1 Outer Approximations of B(F)

Supporting hyperplanes. The base polytope is deﬁned as the intersection of half-spaces {s(A) F(A)}, for A V . Therefore, faces of B(F) are indexed by subsets of the power set. As a consequence of submodularity (Bach, 2013; Fujishige, 2005), the faces of the base polytope B(F) are characterized by ordered partitions A = (A1, . . . , Am) with V = A1 Am. Then, a face of B(F) is such that s(Bi) = F(Bi) for all Bi = A1 Ai, i = 1, . . . , m. See Figure 1 (b) for the enumeration of faces for n = 3 based on an enumeration of all ordered partitions. Such ordered partitions are associated to vectors w = Pm i=1 vi1Ai with v1 > > vm with all solutions of maxs B(F) w s being on the corresponding face.

From a face of B(F) deﬁned by the ordered partition A, we may deﬁne its tangent cone b BA(F) at this face as the set

b BA(F) = n s Rn, s(V ) = F(V ), i {1, . . . , m 1}, s(Bi) F(Bi) o . (10)

Since we have relaxed all the constraints unrelated to A, these are outer approximations of B(F) as illustrated in Figure 2 for two ordered partitions. Support function. We may compute the support function of b BA(F), which is an upper bound on f(w) since this set is an outer approximation of B(F) as follows:

sup s b BA(F) w s = sup s Rn inf λ Rm 1 + R w s

i=1 λi(s(Bi) F(Bi)) , using Lagrangian duality,

ACTIVE-SET METHODS FOR SUBMODULAR MINIMIZATION PROBLEMS

= inf λ Rm 1 + R sup s Rn s w

i=1 (λi + + λm)1Ai

i=1 (λi + + λm) F(Bi) F(Bi 1) ,

= inf λ Rm 1 + R

i=1 (λi + + λm) F(Bi) F(Bi 1)

such that w =

i=1 (λi + + λm)1Ai.

Thus, by deﬁning vi = λi + + λm, which are decreasing, the support function is ﬁnite for w having ordered level sets corresponding to the ordered partition A (we then say that w is compatible with A). In other words, if w = Pm i=1 vi1Ai, the support functions is equal to the Lov asz extension f(w). Otherwise, when w is not compatible with A, the support function is inﬁnite. Let us now denote WA as a set of all weight vectors w that are compatible with the ordered partition A. This can be deﬁned as

WA = w Rn | v Rm, w =

i=1 vi1Ai, v1 . . . vm

sup s b BA(F) w s =

( f(w) if w WA, otherwise. (11)

3.2 Isotonic Regression for Restricted Problems

Given an ordered partition A = (A1, . . . , Am) of V , we consider the original TV problem restricted to w in WA. Since on this constraint set f(w) = Pm i=1 vi F(Bi) F(Bi 1) is a linear function, this is equivalent to

i=1 vi F(Bi) F(Bi 1) u(Ai) + 1

2 Pm i=1 |Ai|v2 i such that v1 vm. (12)

This may be done by isotonic regression in complexity O(m) by the weighted pool-adjacentviolator algorithm (Best and Chakravarti, 1990). Typically the solution v will have some values that are equal to each other, which corresponds to merging some sets Ai. If these merges are made, we now obtain a basic ordered partition1 such that our optimal w has strictly decreasing values. Because none of the constraints are tight, primal stationarity leads to explicit values of v given by vi = u(Ai)/|Ai| (F(Bi) F(Bi 1))/|Ai|, i.e., given A, the exact solution of the TV problem may be obtained in closed form. Dual interpretation. Eq. 12 is a constrained TV denoising problem that minimizes the cost function in Eq. 3 but with the constraint that weights are compatible with the ordered partition A,

1. Given a submodular function F and an ordered partition A, when the unique solution problem in Eq. 12 is such that v1 > > vm, we say that we A is a basic ordered partition for F u. Given any ordered partition, isotonic regression allows to compute a coarser partition (obtained by partially merging some sets) which is basic.

KUMAR AND BACH

i.e., minw WA f(w) u w + 1

2 w 2 2. The dual of the problem can be derived in exactly the same way as shown in Eq. 9 in the previous section, using the deﬁnition of the support function deﬁned by Eq. 11. The corresponding dual is given by maxs b BA(F) 1

2 s u 2 2, with the relationship w = u s at optimality. Thus, this corresponds to projecting u onto the outer approximation of the base polytope, b BA(F), which only has m constraints instead of the 2n 1 constraints deﬁning B(F). See an illustration in Figure 2.

3.3 Checking Optimality of a Basic Ordered Partition

Given a basic ordered partition A, the associated w Rn obtained from Eq. 12 is optimal for the TV problem in Eq. 3 if and only if s = u w B(F) due to optimality conditions in Eq. 9, which can be checked by minimizing the submodular function F s. For a basic partition, a more efﬁcient algorithm is available. By repeated application of submodularity, we have for all sets C V , if Ci = C Ai:

F(C) s(C) = F(V C)

i=1 s(Ci) (as s is a modular function),

i=1 s(Ci) +

i=1 F(Bi C) F(Bi C) (as Bm = V ),

i=1 F(Bi C) F(Bi 1 C) s(Ci) (let B0 = and as F( ) = 0),

i=1 F((Bi 1 Ai) C) F(Bi 1 C) s(Ci) (since Bi = Bi 1 Ai),

i=1 F((Bi 1 C) (Ai C)) F(Bi 1 C) s(Ci),

i=1 F((Bi 1 C) Ci) F(Bi 1 C) s(Ci),

F(Bi 1 Ci) F(Bi 1) s(Ci)

(as (Bi 1 C) Bi 1 and due to submodularity of F).

Moreover, we have s(Ai) = F(Bi) F(Bi 1), which implies s(Bi) = F(Bi) for all i {1, . . . , m}, and thus all subproblems min Ci Ai F(Bi 1 Ci) F(Bi 1) s(Ci) have non-positive values. This implies that we may check optimality by solving these m subproblems: s is optimal if and only if all of them have zero values. This leads to smaller subproblems whose overall complexity is less than a single SFM oracle call. Moreover, for cut functions, it may be solved by a single oracle call on a graph where some edges have been removed (Tarjan et al., 2006). Given all sets Ci, we may then deﬁne a new ordered partition by splitting all Ai for which F(Bi 1 Ci) F(Bi 1) s(Ci) < 0. If no split is possible, the pair (w, s) is optimal for Eq. 3. Otherwise, this new strictly ﬁner partition may not be basic, the value of the optimization problem

ACTIVE-SET METHODS FOR SUBMODULAR MINIMIZATION PROBLEMS

in Eq. 12 is strictly lower as shown in Section 3.5 (and leads to another basic ordered partition), which ensures the ﬁnite convergence of the algorithm.

3.4 Active-set Algorithm

This leads to the novel active-set algorithm below.

Data: Submodular function F with SFM oracle, u Rn, ordered partition A Result: primal optimal: w Rn and dual optimal: s B(F)

1 while True do

2 Solve Eq. 12 using isotonic regression and update A with the basic ordered partition ;

3 Check optimality by solving min Ci Ai F(Bi 1 Ci) F(Bi 1) s(Ci) for i {1, . . . , m} ;

4 if s is optimal then

7 for i {1, . . . , m}, split the set Ai into Ci and Ai \ Ci in that order to get an updated ordered partition A ;

Relationship with divide-and-conquer algorithm. When starting from the trivial ordered partition A = (V ), then we exactly obtain a parallel version of the divide-and-conquer algorithm (Groenevelt, 1991), that is, the isotonic regression problem is always solved without using the constraints of monotonicity, i.e., there are no merges, it is not necessary to re-solve the problems where nothing has changed. This shows that the number of iterations is then less than n. The key added beneﬁts in our formulation is the possibility of warm-starting, which can be very useful for building paths of solutions with different weights on the total variation. This is also useful for decomposable functions where many TV oracles are needed with close-by inputs. See experiments in Section 5.

3.5 Proof of Convergence

In order to prove the convergence of the algorithm in Section 3.4, we only need to show that if the optimality check fails in step (4), then step (7) introduces splits in the partition, which ensures that the isotonic regression in step (2) of the next iteration has a strictly lower value. Let us recall the isotonic regression problem solved in step (2):

vi F(Bi) F(Bi 1) u(Ai) + 1

such that v1 vm. (14)

Steps (2) ensures that the ordered partition A is a basic ordered partition warranting that the inequality constraints are strict, i.e., no two partitions have the same value vi and the values vi for each element of the partition i = {1, . . . , m} is given through

vi|Ai| = u(Ai) (F(Bi) F(Bi 1)), (15)

which can be calculated in closed form.

KUMAR AND BACH

The optimality check in step (4) decouples into checking the optimality in each subproblem as shown in Section 3.3. If the optimality test fails, then there is a subset of Ci of Ai for some of elements of the partition A such that

F(Bi 1 Ci) F(Bi 1) s(Ci) < 0. (16)

We will show that the splits introduced by step (7) strictly reduces the function value of isotonic regression in Eq. 13, while maintaining the feasibility of the problem. The splits modify the cost function of the isotonic regression as follows, as the objective function in Eq. 13 is equal to

vi F(Bi 1 Ci) F(Bi 1) u(Ci) + vi F(Bi) F(Bi 1 Ci) u(Ai \ Ci)

2v2 i |Ci| + 1

2v2 i |Ai \ Ci| . (17)

Let us assume a positive t R, which is small enough. The direction that the isotonic regression moves is vi + t for the partition corresponding to Ci and vi t for the partition corresponding to Ai \ Ci maintaining the feasibility of the isotonic regression problem, i.e., v1 vi + t > vi t vm . The function value is given by

(vi + t) F(Bi 1 Ci) F(Bi 1) u(Ci)

+(vi t) F(Bi) F(Bi 1 Ci) u(Ai \ Ci) + 1

2(vi + t)2|Ci| + 1

2(vi t)2|Ai \ Ci|

vi F(Bi 1 Ci) F(Bi 1) u(Ci) + vi F(Bi) F(Bi 1 Ci) u(Ai \ Ci)

2v2 i |Ci| + 1

2v2 i |Ai \ Ci| + 1

+t 2F(Bi 1 Ci) F(Bi 1) F(Bi) u(Ci) + u(Ai \ Ci) + vi|Ci| vi|Ai \ Ci| .

From this we can compute the directional derivative of the function at t = 0, which is given by

2F(Bi 1 Ci) F(Bi 1) F(Bi) u(Ci) + u(Ai \ Ci) + |Ci|vi |Ai \ Ci|vi = 2F(Bi 1 Ci) F(Bi 1) F(Bi) 2u(Ci) + u(Ai) + 2|Ci|vi |Ai|vi = 2 F(Bi 1 Ci) F(Bi 1) u(Ci) + vi|Ci| (substituting Eq. 15)

= 2 F(Bi 1 Ci) F(Bi 1) s(Ci) < 0 (as s = u w and Eq. 16).

This shows that the function strictly decreases with the splits introduced in step (7).

3.6 Discussion

Certiﬁcates of optimality. The algorithm has dual-infeasible iterates s (they only belong to B(F) at convergence). Suppose that after step (3) we have F(C) s(C) ε for all C V , where ε

ACTIVE-SET METHODS FOR SUBMODULAR MINIMIZATION PROBLEMS

shrinks as we run more iterations of the outer loop. This implies that s B(F + ε1Card (1,n)), i.e., s B(Fε) with Fε = F + ε1Card (1,n). Since by construction w = u s, we have:

fε(w) u w + 1

2 w 2 2 + 1

2 s u 2 2 = ε max j V wj min j V wj + f(w) u w + w 2

= ε max j V wj min j V wj

i=1 vi F(Bi) F(Bi 1) u(Ai) +

i=1 |Ai|v2 i

= ε max j V wj min j V wj ( using Eq. 15)

= ε range(w),

where range(w) = maxk V wk mink V wk. This means that w is approximately optimal for f(w) u w + 1

2 w 2 2 with certiﬁed gap less than ε range(w) + ε range(w ). Maximal range of an active-set solution. For any ordered partition A, and the optimal value of w (which we know in closed form), we have range(w) range(u)+maxi V F({i})+F(V \{i}) F(V ) . Indeed, for the u part of the expression, this is because values of w are averages of values of u; for the F part of the expression, we always have by submodularity:

F(Bi) F(Bi 1) X

k Ai F({k}) and

F(Bi) F(Bi 1) X

k Ai F(V ) F(V \{k}).

This means that the certiﬁcate can be used in practice by replacing range(w ) by its upperbound. See experimental evaluation for a 2D total variation denoising in Appendix A. Exact solution. If the submodular function only takes integer values and we have an approximate solution of the TV problem with gap ε 1 4n, then we have the optimal solution (Chakrabarty et al., 2014). Relationship with traditional active-set algorithm. Given an ordered partition A, an active-set method solves the unconstrained optimization problem in Eq. 12 to obtain a value of v using the primary stationary conditions. The corresponding primal value w = Pm i=1 vi1Ai and dual value s = u w are optimal, if and only if,

Primal feasibility : w WA, (18)

Dual feasibility : s B(F). (19)

If Eq. 18 is not satisﬁed, a move towards the optimal w is performed to ensure primal feasibility by performing line search, i.e., two consecutive sets Ai and Ai+1 with increasing values vi < vi+1 are merged and a potential w is computed until primal feasibility is met. Then dual feasibility is checked and potential splits are proposed. In our approach, we consider a different strategy which is more direct and does many merges simultaneously by using isotonic regression. Our method explicitly moves from ordered partitions to ordered partitions and computes an optimal vector w, which is always feasible.

KUMAR AND BACH

4. Decomposable Problems

Many interesting problems in signal processing and computer vision naturally involve submodular functions F that decompose into F = F1 + + Fr, with r simple submodular functions (Bach, 2013). For example, a cut function in a 2D grid decomposes into a function F1 composed of cuts along vertical lines and a function F2 composed of cuts along horizontal lines. For both of these functions, SFM oracles may be solved in O(n) by message passing. For simplicity, in this paper, we consider the case r = 2 functions, but following Komodakis et al. (2011) and Jegelka et al. (2013), our framework easily extends to r > 2.

4.1 Reformulation as the Distance Between Two Polytopes

Following Jegelka et al. (2013), we have the primal/dual problems :

min w Rn f1(w) + f2(w) u w + 1

2 w 2 2 = min w Rn max s1 B(F1), s2 B(F2) w (s1 + s2) u w + 1

= max s1 B(F1), s2 B(F2) min w Rn(s1 + s2 u) w + 1

= max s1 B(F1), s2 B(F2) 1

2 s1 + s2 u 2 2, (20)

with w = u s1 s2 at optimality. This is the projection of u on the sum of the base polytopes B(F1) + B(F2) = B(F). Further, this may be interpreted as ﬁnding the distance between two polytopes B(F1) u/2 and u/2 B(F2). Note that these two polytopes typically do not intersect (they will if and only if w = 0 is the optimal solution of the TV problem, which is an uninteresting situation). We now review Alternating projections (AP) (Jegelka et al., 2013), Averaged alternating reﬂections (AAR) (Jegelka et al., 2013) and Dykstra s alternating projections (DAP) (Chambolle and Pock, 2015) to show that a large number of total variation denoising problems need to be solved to obtain an optimal solution of Eq. 20. The ability to warm start and solve these total variation denoising using our algorithm in Section 3.4 can greatly improve the performance of each of these algorithms. Alternating projections (AP). The alternating projection algorithm (Bauschke et al., 1997) was proposed to solve the convex feasibility problem, i.e., to obtain a feasible point in the intersection of two polytopes. It is equivalent to performing block coordinate descent on the dual derived in Eq. 20. Let us denote the projection onto a polytope K as ΠK, i.e., ΠK(y) = argmaxx K x y 2 2. Therefore, alternating projections lead to the following updates for our problem.

zt = Πu/2 B(F2)ΠB(F1) u/2(zt 1),

where z0 is an arbitrary starting point. Thus each of these steps require TV oracles for F1 and F2 since projection onto the base polytope is equivalent to performing TV denoising as shown in Eq. 9. Averaged alternating reﬂections (AAR). The averaged alternating reﬂection algorithm (Bauschke et al., 2004), which is also known as Douglas-Rachford splitting can be used to solve convex feasibility problems. It is observed to converge quicker than alternating projection (Jegelka et al., 2013; Kumar et al., 2015) in practice. We now introduce a reﬂection operator for the polytope K as RK, i.e., RK = 2ΠK I, where I is the identity operator. Therefore, reﬂection of t on a polytope K is

ACTIVE-SET METHODS FOR SUBMODULAR MINIMIZATION PROBLEMS

given by RK(t) = 2ΠK(t) t. The updates of each iteration of the averaged alternating reﬂections, which starts with an auxiliary sequence z0 initialized to 0 vector, are given by

2(I + Ru/2 B(F2)RB(F1) u/2)(zt 1).

In the feasible case, i.e., intersecting polytopes, the sequence zt weakly converges to a point in the intersection of the polytopes. However, in our case, we have non intersecting polytopes which leads to a converging sequence of zt with AP but a diverging sequence of zt with AAR. However, when we project zt by using the projection operation, s1,t = ΠB(F1) u/2(zt); s2,t = Πu/2 B(F2)(s1,t) the sequences s1,t and s2,t converge to the nearest points on the polytopes, B(F1) u/2 and u/2 B(F2) (Bauschke et al., 2004) respectively. Dykstra s alternating projections (DAP). Dykstra s alternating projection algorithm (Bauschke and Borwein, 1994) retrieves a convex feasible point closest to an arbitrary point, which we assume to be 0. It can also be used and has a form of primal descent interpretation, i.e., as coordinate descent for a well-formulated primal problem (Gaffke and Mathar, 1989). Let us denote ιK as the indicator function of a convex set K. In our case we consider ﬁnding the nearest points on the polytopes B(F1) u/2 and u/2 B(F2) closest to 0, which can be formally written as:

min s B(F1) u

= min s Rn 1 2 s 2 2 + ιB(F1) u

2 (s) + ι u

= min s Rn 1 2 s 2 2 + ιB(F1) u

2 (s) + ιB(F2) u

2 s 2 2 + max w1 Rn w 1 s f1(w1) + w 1 u 2 + max w2 Rn w 2 s f2(w2) + w 2 u 2

= max w1 Rn w2 Rn

f1(w1) f2(w2) + (w1 + w2) u

2 + min s Rn

2 s 2 2 + (w1 w2) s

= max w1 Rn w2 Rn f1(w1) f2(w2) + (w1 + w2) u

2 w1 w2 2 2

= min w1 Rn w2 Rn f1(w1) + f2(w2) (w1 + w2) u

2 w1 w2 2 2,

where s = w2 w1 at optimality. The block coordinate descent algorithm then gives

w1,t = proxf1 u/2(w2,t 1) = (I ΠB(F1) u/2)(w2,t 1),

s1,t = w2,t 1 w1,t,

w2,t = proxf2 u/2(w1,t) = (I ΠB(F2) u/2)(w1,t),

s2,t = w1,t w2,t,

where I is the identity matrix. This is exactly the same as Dykstra s alternating projection steps.

KUMAR AND BACH

2 u 2 B(F2)

d BA1(F1) u

u 2 d BA2(F2)

Figure 3: Closest point between two polytopes. (a) Output of Dykstra s alternating projection algorithm for the TV problem, the pair (s1, s2) may not be unique while w = s1 + s2 u is. (b) Dykstra s alternating projection output for outer approximations.

We have implemented it, and it behaves similar to alternating projections, but it still requires TV oracles for projection (see experiments in Section 5). There is however a key difference: while alternating projections and alternating reﬂections always converge to a pair of closest points, Dykstra s alternating projection algorithm converges to a speciﬁc pair of points, namely the pair closest to the initialization of the algorithm (Bauschke and Borwein, 1994); see an illustration in Figure 3-(a). This insight will be key in our algorithm to avoid cycling. Assuming TV oracles are available for F1 and F2, Jegelka et al. (2013) and Kumar et al. (2015) use alternating projection (Bauschke et al., 1997) and alternating reﬂection (Bauschke et al., 2004) algorithms to solve dual optimization problem in Eq. 20 . Nishihara et al. (2014) gave a extensive theoretical analysis of alternating projection and showed that it converges linearly. However, these algorithms are equivalent to block dual coordinate descent and cannot be cast explicitly as descent algorithms for the primal TV problem. On the other hand, Dykstra s alternating projection is a descent algorithm on the primal, which enables local search over partitions. Complex TV oracles are often implemented by using SFM oracles recursively with the divide-and-conquer strategy on the individual functions. Using our algorithm in Section 3.4, they can be made more efﬁcient using warm-starts (see experiments in Section 5).

4.2 Local Search over Partitions using Active-set Method

Given our algorithm for a single function, it is natural to perform a local search over two partitions A1 and A2, one for each function F1 and F2, and consider in the primal formulation a weight vector w compatible with both A1 and A2; or, equivalently, in the dual formulation, two outer approximations b BA1(F1) and b BA2(F2). That is, given the ordered partitions A1 and A2, using a similar derivation as in Eq. 20, we obtain the primal/dual pairs of optimization problems

max s1 b BA1(F1) s2 b BA2(F2)

2 u s1 s2 2 2 = minw WA1 w WA2 f1(w) + f2(w) u w + 1

with w = u s1 s2 at optimality. Primal solution by isotonic regression. The primal solution w is unique by strong convexity. Moreover, it has to be compatible with both A1 and A2, which is equivalent to being compatible with the coalesced ordered partition A = coalesce(A1, A2) deﬁned as the coarsest ordered partition compatible by both. As shown in Appendix B, A may be found in time O(min(m1, m2)n).

ACTIVE-SET METHODS FOR SUBMODULAR MINIMIZATION PROBLEMS

Given A, the primal solution w of the subproblem may be found by isotonic regression like in Section 3.2 in time O(m) where m is the number of sets in A. However, ﬁnding the optimal dual variables s1 and s2 turns out to be more problematic. We know that s1 + s2 = u w and that s1 + s2 b BA(F), but the split of s1 + s2 into (s1, s2) is unknown. Obtaining dual solutions. Given ordered partitions A1 and A2, a unique well-deﬁned pair (s1, s2) could be obtained by using convex feasibility algorithms such as alternating projections (Bauschke et al., 1997) or alternating reﬂections (Bauschke et al., 2004). However, the result would depend in non understood ways on the initialization, and we have observed cycling of the activeset algorithm. Using Dykstra s alternating projection algorithm allows us to converge to a unique well-deﬁned pair (s1, s2) that will lead to a provably non-cycling algorithm. When running the Dykstra s alternating projection algorithm starting from 0 on the polytopes b BA1(F1) u/2 and u/2 b BA2(F2), if w is the unique distance vector between the two polytopes, then the iterates converge to the projection of 0 onto the convex sets of elements in the two polytopes that achieve the minimum distance (Bauschke and Borwein, 1994). See Figure 3-(b) for an illustration. This algorithm is however slow to converge when the polytopes do not intersect. Note that w = 0 in most of our situations and convergence is hard to monitor because primal iterates of the Dykstra s alternating projection diverge (Bauschke and Borwein, 1994). Translated intersecting polytopes. In our situation, we have more to work with than just the ordered partitions: we also know the vector w (as mentioned earlier, it is obtained cheaply from isotonic regression). Indeed, from Lemma 2.2 and Theorem 3.8 from Bauschke and Borwein (1994), given this vector w, we may translate the two polytopes and now obtain a formulation where the two polytopes do intersect; that is we aim at projecting 0 on the (non-empty) intersection of b BA1(F1) u/2+w/2 and u/2 w/2 b BA2(F2). See Figure 4. We also refer to this as the translated Dykstra problem2 in the rest of the paper. This is equivalent to solving the following optimization problem

min s b BA1(F1) u w

2 b BA2(F2)

1 2 s 2 2 (21)

= min s Rn 1 2 s 2 2 + ι b BA1(F1) u w

2 (s) + ι u w

2 b BA2(F2)(s),

= min s Rn 1 2 s 2 2 + ι b BA1(F1) u w

2 (s) + ι b BA2(F2) u w

1 2 s 2 2 + maxw1 WA1 w 1 s f1(w1) + w 1 (u w)

+ max w2 WA2 w 2 s f2(w2) + w 2 (u w)

= max w1 WA1 w2 WA2

f1(w1) f2(w2) + (w1 + w2) (u w)

2 + min s Rn 1 2 s 2 2 + (w1 w2) s ,

= max w1 WA1 w2 WA2

f1(w1) f2(w2) + (w1 + w2) (u w)

2 w1 w2 2 2

2. We refer to ﬁnding a Dykstra solution for translated intersecting polytopes as translated Dykstra problem.

KUMAR AND BACH

d BA1(F1) u

u 2 d BA2(F2) B(F1) u w

d BA1(F1) u w

2 d BA2(F2)

Figure 4: Translated intersecting polytopes. (a) output of our algorithm before translation. (b) Translated formulation.

= min w1 WA1 w2 WA2

f1(w1) + f2(w2) (w1 + w2) (u w)

2 w1 w2 2 2

with s = w2 w1 at optimality. In Section 4.4 we propose algorithms to solve the above optimization problems. Assuming that we are able to solve this step efﬁciently, we now present our active-set algorithm for decomposable functions below.

4.3 Active-set Algorithm for Decomposable Functions

Data: Submodular functions F1 and F2 with SFM oracles, u Rn, ordered partitions A1, A2 Result: primal optimal: w Rn and dual optimal: s1 B(F1), s2 B(F2)

1 while True do

2 A = coalesce(A1, A2) ;

3 Estimate w by solving minw WA f(w) u w + 1

2 w 2 2 using isotonic regression ;

4 Projection step: Estimate s1 b BA1(F1) and s2 b BA2(F2) by projecting 0 onto the intersection of b BA1(F1) u/2 + w/2 and u/2 w/2 b BA2(F2) using any of the algorithms described in Section 4.4 ;

5 Merge the sets in Aj which are tight for sj, j {1, 2};

6 Check optimality of s1 and s2 as described in Section 3.3;

7 if s1 and s2 are optimal then

10 for j {1, 2} and ij {1, . . . , mj}, split the set Aj,ij into Cj,ij and Aj,ij \ Cj,ij in that order to get an updated ordered partition Aj ;

Given two ordered partitions A1 and A2, we obtain s1 ˆBA1(F1) and s2 ˆBA2(F2) as described in the following section. The solution w = u s1 s2 is optimal if and only if both

ACTIVE-SET METHODS FOR SUBMODULAR MINIMIZATION PROBLEMS

s1 B(F1) and s2 B(F2). When checking the optimality described in Section 3.3, we split the partition. As shown in Appendix C, either (a) w 2 2 strictly increases at each iteration, or (b) w 2 2 remains constant but s1 s2 2 2 strictly increases. This implies that the algorithm is ﬁnitely convergent.

4.4 Optimizing the Translated Dykstra Problem

In this section, we describe algorithms for the projection step of the active-set algorithm proposed in Section 4.3 that optimizes the translated Dykstra problem in Eq. 22, i.e.,

min w1 WA1 w2 WA2

f1(w1) + f2(w2) (w1+w2) (u w)

2 w1 w2 2. (23)

The corresponding dual optimization problem is given by

min s b BA1(F1) u w

2 b BA2(F2)

1 2 s 2 2, (24)

with the optimality condition s = w2 w1. Note that the only link to submodularity is that f1 and f2 are linear functions on WA1 and WA2, respectively. The rest of this section primarily deals with optimizing a quadratic program and we present two algorithms in Section 4.4.1 and Section 4.4.2.

4.4.1 ACCELERATED DYKSTRA S ALGORITHM

In this section, we ﬁnd the projection of the origin onto the intersection of the translated base polytopes obtained by solving the optimization problem in Eq. 22 given by

min w1 WA1 w2 WA2

f1(w1) + f2(w2) (w1+w2) (u w)

using Dykstra s alternating projection. It can be solved using the following Dykstra s iterations:

s1,t = Π b BA1(F1)(u/2 w/2 + w2,t 1),

w1,t = u/2 w/2 + w2,t 1 s1,t,

s2,t = Π b BA2(F2)(u/2 w/2 + w1,t),

w2,t = u/2 w/2 + w1,t s2,t,

with ΠC denoting the orthogonal projection onto the sets C, solved here by isotonic regression. Note that the value of the auxiliary variable w2 can be warm-started. The algorithm converges linearly for polyhedral sets (Shusheng, 2000). In our simulations, we have used the recent accelerated version of Chambolle and Pock (2015), which led to faster convergence. In order to monitor convergence of the algorithm, we compute the value of u w s1,t s2,t 1, which is equal to zero at convergence. Note that the algorithm is not ﬁnitely convergent and gives only approximate solutions. Therefore, we introduce the approximation parameters ε1 and ε2 such that s1 lies in the ϵ1-neighborhood of the base polytope of F1, i.e., min A V F1(A) s1(A) ε1 and s2 lies in the ε2-neighborhood of the base polytope of F2,

KUMAR AND BACH

i.e., min A V F2(A) s2(A) ε2, respectively. See Appendix E for more details on the approximation. The optimization problem can also be decoupled into smaller optimization problems by using the knowledge of the face of the base polytopes on which s1 and s2 lie. This is still slow to converge in practice and therefore we present an active-set method in the next section.

4.4.2 PRIMAL ACTIVE-SET METHOD

In this section, we ﬁnd the projection of the origin onto the intersection of the translated base polytopes given by Eq. 22 using the standard active-set method (Nocedal and Wright, 2006) by solving a set of linear equations. For this purpose, we derive the equivalent optimization problems using equality constraints. The ordered partition, Aj is given by (Aj,1, . . . , Aj,mj), where mj is the number of elements in the ordered partitions. Let Bj,ij be deﬁned as (Aj,1 Aj,ij). Therefore,

Fj(Bj,ij) Fj(Bj,ij 1)

ij=1 vj,ij1Aj,ij (26)

with the constraints, vj,1 vj,mj. (27)

On substituting Eq. 25, Eq. 26 and Eq. 27 in Eq. 22, we have an equivalent optimization problem:

min v1,1 ... v1,m1 v2,1 ... v2,m2

F1(B1,i1) F1(B1,i1 1) u(A1,i1) w(A1,i1)

F2(B2,i2) F2(B2,i2 1) u(A2,i2) w(A2,i2)

1 2|A1,i1|v2 1,i1 +

1 2|A2,i2|v2 2,i2

i2=1 v1,i1v2,i21 A1,i11A2,i2.

This can be written as a quadratic program in x = v1 v2

with inequality constraints in the

following form

min x Rm1+m2 D(A1,A2)x 0

1 2x Q(A1, A2)x + c(A1, A2) x. (28)

Here, D(A1, A2) is a sparse matrix of size (m1 + m2 2) (m1 + m2), which is a block diagonal matrix containing the difference or ﬁrst order derivative matrices of sizes m1 1 m1 and m2 1 m2 as the blocks and c(A1, A2) is a linear vector that can be computed using the function evaluations of F1 and F2. Note that these evaluations need to be done only once.

ACTIVE-SET METHODS FOR SUBMODULAR MINIMIZATION PROBLEMS

Computing Q(A1, A2). Let us consider a bipartite graph, G = (A1, A2, E), with m1 + m2 nodes representing the ordered partitions of A1 and A2 respectively. The weight of the edge between each element of ordered partitions of A1, represented by A1,i1 and each element of ordered partitions of A2, represented by A2,i2 is the number of elements of the ground set V that lie in both these partitions and can be written as e(A1,i1, A2,i2) = 1 A1,i11A2,i2 for all e E. The matrix Q(A1, A2) represents the Laplacian matrix of the graph G. Figure 5 shows a sample bipartite graph with m1 = 6 and m2 = 5.

Initializing with primal feasible point. Primal active-set methods start with a primal feasible point and continue to maintain primal feasible iterates. In our case, the starting point may be obtained using the weight vector w that is estimated using isotonic regression. The vector w is compatible both with A1 and A2, i.e., w WA1 and w WA2. Therefore, we may obtain the vectors v1 and v2 from w and initialize the primal feasible starting point using v1 and v2.

Optimizing the quadratic program in Eq. 28 by using active-set methods is equivalent to ﬁnding the face of the constraint set on which the optimal solution lies. For this purpose, we need to be able to solve the quadratic program in Eq. 28 with equality constraints.

Equality constrained QP. Let us now consider the following quadratic program with equality constraints

min p Rm1+m2 D p=0

1 2p Q(A1, A2)p + Q(A1, A2)xk + c(A1, A2) p, (29)

where D is the subset of the constraints in D(A1, A2), i.e., indices of constraints that are tight and xk is a primal-feasible point. We refer the indices of the tight constraints as the working set and represent them by the set W in the algorithm. Therefore, the set of constraints in D is the restriction of the constraint set D(A1, A2) to the working set constraints denoted by W. The vector p gives the direction of strict descent of the cost function in Eq. 28 from feasible point xk (Nocedal and Wright, 2006).

Without loss of generality, let us assume that the equality constraints are vj,kj = vj,kj+1 for any kj in [0, mj). Let A j be the new ordered partition formed by merging Aj,kj and Aj,kj+1 as vj,kj = vj,kj+1. Similarly, x t can be computed from xt by merging the weights vj,kj and vj,kj+1 into a single weight for the merged element of the ordered partition. Finding the optimal vector p using the quadratic program in Eq. 29 with respect to the ordered partition A j is equivalent to solving the following unconstrained quadratic problem,

Q(A 1, A 2, x t) = min p Rm 1+m 2

2p Q(A 1, A 2)p + Q(A 1, A 2)x t + c(A 1, A 2) p , (30)

where m j is the number of elements of the ordered partition A j. This can be estimated by solving a linear system using conjugate gradient descent. The complexity of each iteration of the conjugate gradient is given by O((m 1 + m 2)k) where k is the number of non-zero elements in the sparse matrix, Q(A 1, A 2) (Vishnoi, 2013). We can build p from p by repeating the values for the elements of the partition that were merged.

KUMAR AND BACH

A1;1 A1;2 A1;3 A1;4 A1;5 A1;6

A2;1 A2;2 A2;3 A2;4 A2;5

Figure 5: Bipartite graph to compute Q(A1, A2) with A1 having m1 = 6 components and A2 having m2 = 5.

Primal active-set algorithm. We now can describe the standard primal active-set method.

Data: Laplacian matrix, Q(A1, A2) and vector, c(A1, A2), ordered partitions A1, A2 Result: x Rm1+m2

1 Initialize: t = 0;

2 Primal feasible x0 using the solution of isotonic regression w;

3 W0 with indices of rows of D(A1, A2) that are equal to 0;

4 while True do

5 Estimate primal: Solve equality constrained QP in Eq. 29 with equality constraints indexed by working set Wt to ﬁnd optimal p ;

6 if p == 0 then

7 Estimate dual: λ = D(A1, A2) Q(A1, A2)xt + c(A1, A2) ;

8 if λi 0 for all i Wt then

11 update j = argminj Wt λj ;

12 update Wt+1 = Wt \ {j};

13 update xt+1 = xt ;

16 Line search: Find least β that retains feasibility of xt+1 = xt + βp and ﬁnd the blocking constraints Bt ;

17 Wt+1 = Wt Bt ;

19 t = t + 1

21 return x = xt

We can estimate w1 and w2 from x , which will enable us to estimate s feasible in Eq. 24. Therefore we can estimate the dual variable s1 BA1(F1) and s2 BA2(F2) using s.

4.5 Decoupled Problem

In our context, the quadratic program in Eq. 28 can be decoupled into smaller optimization problems. Let us consider the bipartite graph G = (A1, A2, E) of which Q is the Laplacian matrix. The number of connected components of the graph, G, is equal to the number of level-sets of w.

ACTIVE-SET METHODS FOR SUBMODULAR MINIMIZATION PROBLEMS

Let m be the total number of connected components in G. These connected components deﬁne a partition on the ground set V and a total order on elements of the partition can be obtained using the levels sets of w. Let k denote the index of each bipartite subgraph of G represented by Gk = (A1,k, A2,k, Ek), where k = 1, 2, . . . , m. Let Jk denote the indices of the nodes of Gk in G.

x Jk = argmin x Rm1,k+m2,k D(A1,A2)kx 0

1 2x Q(A1, A2)Jk Jkx + c(A1, A2) Jkx, (31)

where mj,k is size of Aj,k. Therefore, m1,k +m2,k is the total number of nodes in the subgraph Gk. Note that this is exactly equivalent to decomposition of the base polytope of Fj into base polytopes of submodular functions formed by contracting Fj on each individual component representing the connected component k. See Appendix D for more details.

5. Experiments

In this section, we show the results of the algorithms proposed on various problems. We ﬁrst consider the problem of solving total variation denoising for a non decomposable function using active-set methods in Section 5.1, speciﬁcally cut functions. Here, our experiments mainly focus on the time comparisons with state-of-art methods and also show an important setting where we show the gain due to the ability to warm-start our algorithm. In Section 5.2, we consider cut functions on a 3D grid decomposed into a function of the 2D grid and a function of chains. We then consider a 2D grid and a concave function on cardinality, which is not a cut function. Our algorithm leads to marginal gains for the usual non decomposable functions. However, in the non decomposable case there are many total variation problems to be solved. The ability to warm-start lends to huge improvements when compared to the usage of standard total variation oracles in this setting.

5.1 Non-decomposable Total Variation Denoising

Our experiments consider images, which are 2-dimensional grids with vertex neighborhood of size 4. The data set comprises of 6 different images of varying sizes. We consider a large image of size 5616 3744 and recursively scale into a smaller image of half the width and half the height maintaining the aspect ratio. Therefore, the size of each image is four times smaller than the size of the previous image. We restrict to anisotropic uniform-weighted total variation to compare with Chambolle and Darbon (2009) but our algorithms work as well with weighted total variation, which is standard in computer vision, and on any graph with SFM oracles. Therefore, the unweighted total variation is f(w) = λ X

i j |wi wj|,

where λ is a regularizing constant for solving the total variation problem in Eq. 3. Maxﬂow (Boykov and Kolmogorov, 2004) is used as the SFM oracle for checking the optimality of the ordered partitions. Figure 6(a) shows the number of SFM oracle calls required to solve the TV problem for images of various sizes. Note that for the algorithm of Chambolle and Darbon (2009) each SFM oracle call optimizes smaller problems sequentially, while each SFM oracle call in our method optimizes several independent smaller problems in parallel. Therefore, our method has lesser number of oracle calls than Chambolle and Darbon (2009). However, oracle complexity of

KUMAR AND BACH

Number of Pixels

2e+04 8e+04 3e+05 1e+06 5e+06 2e+07

SFM Oracle Calls

Chambolle et al. Our

Number of pixels

2e+04 8e+04 3e+05 1e+06 5e+06 2e+07

log(Time in seconds)

Chambolle et al. Our

Decreasing lambda

# Oracle Calls

14 Normal Warm

Decreasing lambda

Avg Oracle Complexity

Normal Warm

Figure 6: (a) Number of SFM oracle calls for images of various sizes, (b) Time taken for images of various sizes, (c) Number of iterations with and without warm start, (d) Average complexity of the oracle with and without warm start.

ACTIVE-SET METHODS FOR SUBMODULAR MINIMIZATION PROBLEMS

each call is higher for the our method when compared to Chambolle and Darbon (2009). Figure 6(b) shows the time required for each of the methods to solve the TV problem to convergence. We have an optimized code and only use the oracle as plugin which takes about 80-85 percent of the running time. This is primarily the reason our approach takes more time than (Chambolle and Darbon, 2009) despite having fewer oracle calls for small images.

Figure 6(c) also shows the ability to warm start by using the output of a related problem, i.e., when computing the solution for several values of λ (which is typical in practice). In this case, we use optimal ordered partitions of the problem with larger λ to warm start the problem with smaller λ. It can be observed that warm start of the algorithm requires lesser number of oracle calls to converge than using the initialization with trivial ordered partition. Warm start also largely helps in reducing the burden on the SFM oracle. With warm starts the number of ordered partitions does not change much over iterations. Hence, it sufﬁces to query only ordered partitions that have changed. To analyze this we deﬁne oracle complexity as the fraction of pixels in the elements of the partitions that need to be queried. Oracle complexity is averaged over iterations to understand the average burden on the oracle per iteration. With warm starts this reduces drastically, which can be observed in Figure 6(d).

5.2 Decomposable Total Variation Denoising and SFM

Cut functions. In the decomposable case, we consider the SFM and TV problems on a cut function deﬁned on a 3D-grid. The 3D-grid consists of lines parallel lines in each dimension as shown in Figure 7. It can be decomposed into two functions F1 and F2, where F1 is composed on parallel 2D-grids and F2 is composed of parallel chains. From Figure 7, the function F1 represents all the solid edges(red and blue) whereas the function F2 represents the dashed edges(magenta). For brevity, we refer to each 2D-grid of the function F1 as a frame. The SFM oracle for the function F1 is the maxﬂow-mincut (Boykov and Kolmogorov, 2004) algorithm, which may run in parallel for all frames. Similarly the SFM oracle for the function F2 is the message passing algorithm, which may run in parallel for all chains. The corresponding TV oracles, i.e., projection algorithm for F1 and F2 may be solved using the algorithm described in Section 3.4 due to availability of the respective SFM oracles. We consider averaged alternating reﬂection (AAR) (Bauschke et al., 2004) by solving each projection without warm-start and counting the total number of SFM oracle calls of F1 to solve the SFM and TV on the 3D-grid as our baseline. (SGD-P) denotes the dual subgradient based method (Komodakis et al., 2011) modiﬁed with Polyak s rule (Poljak, 1987) to solve SFM on the 3D-grid. We show the performance of alternating projection (AP-WS), averaged alternating reﬂection (AAR-WS) (Bauschke et al., 2004) and Dykstra s alternating projection (DAPWS) (Bauschke and Borwein, 1994) using warm start of each projection with the ordered partitions. WS denotes warm start variant of each of the algorithm. The performance of the active-set algorithm proposed in Section 4.3 with inner loop solved using the primal active-set method proposed in Section 4.4.2 is represented by (ACTIVE).

In our experiments, we consider the 3D volumetric data set of the Stanford bunny (max) of size 102 100 79. The function F1 represents 102 frames while F2 represents the 7900 chains. The dimension of each frame in F1 is 100 79, while the length of each chain in F2 is 102. Figure 8 (a) and (b) show that (AP-WS), (AAR-WS), (DAP-WS) and (ACTIVE) require relatively less number of oracle calls when compared to when compared to AAR or SGD-P. Note that we count 2D SFM oracle calls as they are more expensive than the SFM oracles on chains.

KUMAR AND BACH

(a) (b) (c)

Figure 7: (a) Cut function F deﬁned on a 3D-grid may be decomposed into: (b) F1 represented by solid edges (red and blue) and (c) F2 represented by dashed lines (magenta)

# of Oracle calls 10 0 10 1 10 2 10 3 10 4

SFM - SFM opt

AAR-WS AAR AP-WS DAP-WS SGD-P ACTIVE

# of Oracle calls 10 0 10 2 10 4 10 6

TV - TV opt

AAR-WS AAR AP-WS DAP-WS ACTIVE

# of Oracle calls 10 0 10 1 10 2 10 3 10 4

SFM - SFM opt

AAR-WS AAR AP-WS DAP-WS SGD-P ACTIVE

# of Oracle calls 10 0 10 1 10 2 10 3 10 4 10 5

TV - TV opt

AAR-WS AAR AP-WS DAP-WS ACTIVE

Figure 8: (a) Number of 2D SFM calls to obtain 3D SFM, (b) Number of 2D SFM calls to obtain 3D TV, (c) Number of 2D SFM calls to obtain SFM of 2D + concave function, (d) Number of 2D SFM calls to obtain TV of 2D + concave function.

ACTIVE-SET METHODS FOR SUBMODULAR MINIMIZATION PROBLEMS

Time comparisons. We also performed time comparisons between the iterative methods and the combinatorial methods on standard data sets. The standard mincut-maxﬂow (Boykov and Kolmogorov, 2004) on the 3D volumetric data set of the Standard bunny (max) of size 102 100 79 takes 0.11 seconds while averaged alternating reﬂections (AAR) without warm start takes 0.38 seconds. The averaged alternating reﬂections with warm start (AAR-WS) takes 0.21 seconds and the active-set method (ACTIVE) takes 0.38 seconds. The main bottleneck in the active-set method is the inversion of the Laplacian matrix and it could considerably improve by using methods suggested by Vishnoi (2013). Note that the projection on the base polytopes of F1 and F2 can be parallelized by projecting onto each of the 2D frame of F1 and each line of F2 respectively (Kumar et al., 2015). The times for (AAR), (AAR-WS) and (ACTIVE) use parallel multi-core architectures3 while the combinatorial algorithm only uses a single core. Note that cut functions on grid structures are only a small subclass of submodular functions with such efﬁcient combinatorial algorithms. In contrast, our algorithm works on more general class of sum of submodular functions than just with cut functions.

Concave functions on cardinality. In this experiment we consider our SFM problem of sum of a 2D cut on a graph of size 5616 3744 and a super pixel based concave function on cardinality (Stobbe and Krause, 2010; Jegelka et al., 2013). The unary potentials of each pixel is calculated using the Gaussian mixture model of the color features. The edge weight a(i, j) = exp( yi yj 2), where yi denotes the RGB values of the pixel i. In order to evaluate the concave function, regions Rj are extracted via superpixels and, for each Rj, deﬁning the function F2(S) = |S||Rj \ S|. We use 200 and 500 regions. Figure 8 (c) and (d) shows that (AP-WS), (AAR-WS), (DAP-WS) and (ACTIVE) algorithms converge for solving TV quickly by using only SFM oracles and relatively less number of oracle calls. Note that we count 2D SFM oracle calls.

6. Conclusion

In this paper, we present an efﬁcient active-set algorithm for optimizing quadratic losses regularized by Lov asz extension of a submodular function using the SFM oracle of the function. We also present an active-set algorithms to minimize sum of simple submodular functions using SFM oracles of the individual simple functions. We also show that these algorithms are competitive to the existing state-of-art algorithms to minimize submodular functions.

Acknowledgments

We acknowledge support from the European Research Council grant SIERRA (project 239993). K. S. Sesh Kumar also acknowledges the support from the European Research Council grant of Prof. Vladimir Kolmogorov DOi CV(project 616160) at IST Austria. The comments of the reviewers have helped us improve the presentation signiﬁcantly.

3. 20 core, Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz with 100Gigabytes of memory. We only use up to 16 cores of the machine to ensure accurate timings.

KUMAR AND BACH

# of Iterations

0 2 4 6 8 10 12 14 16

Optimality gap Bound

Figure 9: Certiﬁed gap vs. bound for TV denoising on a 2D image using the algorithm in Section 3.4.

Appendix A. Certiﬁcates of Optimality

We consider a total variation denoising problem on an 2D image of dimensions 384 288 using the algorithm proposed in Section 3.4 for non-decomposable functions, where we assume a 2D SFM oracle. Here, we plot the optimality gap given by (f(w) u w + 1

2 w 2 2) (f(w ) u w + 1 2 w 2 2) and the bound, ε range(w)+ε range(w ), proposed in Section 3.6. Here, w is the solution at the end of each iteration of the algorithm and w is the optimal solution.

Appendix B. Algorithms for Coalescing Partitions

The basic interpretation in coalescing two ordered partitions is as follows. Given an ordered partition A1 and A2 with m1 and m2 elements in the partitions respectively, we deﬁne for each j = 1, 2, ij = (1, . . . , mj), Bj,ij = (Aj,1 . . . Aj,ij). The inequalities deﬁning the outer approximation of the base polytopes are given by hyperplanes deﬁned by

ij = (1, . . . , mj), sj(Bj,ij) Fj(Bj,ij).

The hyperplanes deﬁned by common sets of both these partitions, deﬁnes the coalesced ordered partitions. The following algorithm performs coalescing between these partitions.

Input: Ordered partitions A1 and A2.

Initialize: x = 1, y = 1, z = 1 and C = .

Algorithm: Iterate until x = m1 and y = m2

1. If |B1,x| > |B2,y| then y := y + 1.

2. If |B1,x| < |B2,y| then x := x + 1.

ACTIVE-SET METHODS FOR SUBMODULAR MINIMIZATION PROBLEMS

3. If B1,x == B2,y then

Az = (B1,x \ C), C = B1,x, and z := z + 1.

Output: m = z, ordered partitions A = (A1, . . . , Am).

Running time. The algorithm terminates in min(m1, m2) iterations and the checking condition for step (3) takes n iterations. Therefore, the algorithm overall takes a time of O(min(m1, m2)n).

Appendix C. Optimality of Algorithm for Decomposable Problems

In step (10) of the algorithms, when we split partitions, the value of the primal/dual pair of optimization algorithms

max s1 b BA1(F1) s2 b BA2(F2)

2 u s1 s2 2 2 = minw WA1 w WA2 f1(w) + f2(w) u w + 1

cannot increase. This is because, when splitting, the constraint set for the minimization problem only gets bigger. Since at optimality, we have w = u s1 s2, w 2 cannot decrease, which shows the ﬁrst statement. Now, if w 2 remains constant after an iteration, then it has to be the same (and not only have the same norm), because the optimal s1 and s2 can only move in the direction orthogonal to w. In step (4) of the algorithm, we project 0 on the (non-empty) intersection of b BA1(F1) u/2 + w/2 and u/2 w/2 b BA2(F2). This corresponds to minimizing 1

2 s1 u/2 + w/2 2 such that s1 b BA1(F1) and s2 = u w s1 b BA2(F2). This is equivalent to minimizing 1

8 s1 s2 2. We have:

max s1 b BA1(F1) s2 b BA2(F2) s1+s2=u w

8 s1 s2 2 2 = min w1 WA1 w2 WA2

max s1 Rn s2 Rn s1+s2=u w

8 s1 s2 2 2 + f1(w1) + f2(w2)

w 1 s1 w 2 s2

= min w1 WA1 w2 WA2

8 u w 2s2 2 2 + f1(w1) + f2(w2)

w 1 (u w s2) w 2 s2

= min w1 WA1 w2 WA2

8 u w 2 2 1

2 s2 2 2 + 1

+f1(w1) + f2(w2) w 1 (u w s2) w 2 s2

= min w1 WA1 w2 WA2

w 1 (u w) + f1(w1) + f2(w2) 1

+ max s2 Rn 1

2 s2 2 2 + s 2 u w

= min w1 WA1 w2 WA2

w 1 (u w) + f1(w1) + f2(w2) 1

KUMAR AND BACH

5 4 3 2 1 0 1 2 3 4 5 0

Total inner iterations

5 4 3 2 1 0 1 2 3 4 5 120

# Oracle Calls

50 100 150 0

Outer Iteration

# Inner Iterations

(a) (b) (c)

Figure 10: (a) Total number of inner iterations for varying α. (b) Total number of outer iterations for varying α. and (c) Number of inner iterations per each outer iteration for the α = 101.

2 + w1 w2 2 2

= min w1 WA1 w2 WA2

w 1 (u w) + f1(w1) + f2(w2) + 1

2 w1 w2 2 2

2(u w) (w1 w2)

= min w1 WA1 w2 WA2

f1(w1) + f2(w2) 1

2(u w) (w1 + w2)

2 w1 w2 2 2

Thus s1 and s2 are dual to certain vectors w1 and w2, which minimize a decoupled formulation in f1 and f2. To check optimality, like in the single function case, it decouples over the constant sets of w1 and w2, which is exactly what step (5) is performing. If the check is satisﬁed, it means that w1 and w2 are in fact optimal for the problem above without the restriction in compatibilities, which implies that they are the Dykstra solutions for the TV problem. If the check is not satisﬁed, then the same reasoning as for the one function case, leads directions of descent for the new primal problem above. Hence it decreases; since its value is equal to 1

8 s1 s2 2 2, the value of s1 s2 2 2 must increase, hence the second statement.

Appendix D. Decoupled Problems.

Given that we deal with polytopes, knowing w implies that we know the faces on which we have to looked for. It turns outs that for base polytopes, these faces are products of base polytopes for modiﬁed functions (a similar fact holds for their outer approximations). Given the ordered partition A deﬁned by the level sets of w (which have to be ﬁner than A1 and A2), we know that we may restrict b BAj(Fj) to elements s such that s(B) = F(B) for all sup-level sets B of w (which have to be unions of contiguous elements of Aj); see an illustration below.

ACTIVE-SET METHODS FOR SUBMODULAR MINIMIZATION PROBLEMS

Aj;1 Aj;3 Aj;2 Aj;4 Aj;5

More precisely, if C1, . . . , Cm are constant sets of w ordered with decreasing values. Then, we may search for sj independently for each subvector (sj)Ck RCk, k {1, . . . , m } and with the constraint that (sj)Ck ˆBAj Ck (Fj)Ck|C1 Ck 1 ,

where Aj Ck is the ordered partition obtained from Aj once restricted onto Ck and the submodular function is the so-called contraction of F on Ck given C1 Ck 1, deﬁned as S 7 Fj(S C1 Ck 1) F(C1 Ck 1). Thus this corresponds to solving m different smaller subproblems.

Appendix E. Approximate Dykstra Steps

The Dykstra step, i.e., step (4) of the algorithm proposed in Section 4.4.1 is not ﬁnitely convergent. Therefore, it needs to be solved approximately. For this purpose, we introduce a parameter α to approximately solve the Dykstra step such that s1 + s2 u + w 1 α(ϵ1 + ϵ2). Let ϵ be deﬁned as α(ϵ1 + ϵ2). This shows that the s1 and s2 are ϵ-accurate. Therefore, α must be chosen in such a way that we avoid cycling in our algorithm. However, another alternative is to warm start the Dykstra step with w1 and w2 of the previous iteration. This ensures we don t go back to the same w1 and w2, which we have already encountered and avoid cycling. Figure 10 shows the performance of our algorithm for a simple problem of 100 100 2D-grid with 4-neighborhood and uniform weights on the edges with varying α. Figure 10-(a) shows the total number of inner iterations required to solve the TV problem. Figure 10-(b) gives the total number of SFM oracle calls required to solve the TV problem. In Figure 10-(c), we show the number of inner iterations in every outer iteration for the best α we have encountered.

Maxﬂow dataset online. http://vision.csd.uwo.ca/maxflow-data.

F. Bach. Learning with Submodular Functions: A Convex Optimization Perspective, volume 6 of Foundations and Trends in Machine Learning. NOW, 2013.

H. H. Bauschke and J. M. Borwein. Dykstra s alternating projection algorithm for two sets. Journal of Approximation Theory, 79(3):418 443, 1994.

H. H. Bauschke, J. M. Borwein, and A. S. Lewis. The method of cyclic projections for closed convex sets in Hilbert space. Contemporary Mathematics, 204:1 38, 1997.

H. H. Bauschke, P. L. Combettes, and D. Luke. Finding best approximation pairs relative to two closed convex sets in Hilbert spaces. Journal of Approximation Theory, 127(2):178 192, 2004.

M. J. Best and N. Chakravarti. Active set algorithms for isotonic regression: a unifying framework. Mathematical Programming, 47(1):425 439, 1990.

KUMAR AND BACH

Y. Boykov and V. Kolmogorov. An experimental comparison of min-cut/max-ﬂow algorithms for energy minimization in vision. IEEE Transactions in Pattern Analysis and Machine Intelligence, 26(9):1124 1137, 2004.

Y. Boykov, O. Veksler, and R. Zabih. Fast approximate energy minimization via graph cuts. IEEE Transactions in Pattern Analysis and Machine Intelligence, 23(11):1222 1239, 2001.

D. Chakrabarty, P. Jain, and P. Kothari. Provable submodular minimization using Wolfe s algorithm. In Advances in Neural Information Processing Systems. 2014.

A. Chambolle and J. Darbon. On total variation minimization and surface evolution using parametric maximum ﬂows. International Journal of Computer Vision, 84(3):288 307, 2009.

A. Chambolle and T. Pock. A remark on accelerated block coordinate descent for computing the proximity operators of a sum of convex functions. Technical Report 01099182, HAL, 2015.

S. Fujishige. Lexicographically optimal base of a polymatroid with respect to a weight vector. Mathematics of Operations Research, 5(2):186 196, 1980.

S. Fujishige. Submodular systems and related topics. In Mathematical Programming at Oberwolfach II. 1984.

S. Fujishige. Submodular Functions and Optimization. Elsevier, 2005.

N. Gaffke and R. Mathar. A cyclic projection algorithm via duality. Metrika, 36(1):29 54, 1989.

D. Goldfarb and W. Yin. Parametric maximum ﬂow algorithms for fast total variation minimization. SIAM Journal on Scientiﬁc Computing, 31(5):3712 3743, 2009.

H. Groenevelt. Two algorithms for maximizing a separable concave function over a polymatroid feasible region. European Journal of Operational Research, 54(2):227 236, 1991.

S. Jegelka, F. Bach, and S. Sra. Reﬂection methods for user-friendly submodular optimization. In Advances in Neural Information Processing Systems, 2013.

V. Kolmogorov. Minimizing a sum of submodular functions. Discrete Applied Mathematics, 160 (15), 2012.

N. Komodakis, N. Paragios, and G. Tziritas. MRF energy minimization and beyond via dual decomposition. IEEE Transactions in Pattern Analysis and Machine Intelligence, 33(3):531 552, 2011.

A. Krause and C. Guestrin. Submodularity and its applications in optimized information gathering. ACM Transactions on Intelligent Systems and Technology, 2(4), 2011.

K. S. Sesh Kumar, A. Barbero, S. Jegelka, S. Sra, and F. Bach. Convex optimization for parallel energy minimization. Technical Report 01123492, HAL, 2015.

L. Ladicky, C. Russell, P.t Kohli, and P. H. S. Torr. Graph cut based inference with co-occurrence statistics. In Proceedings of the 11th European Conference on Computer Vision, 2010.

ACTIVE-SET METHODS FOR SUBMODULAR MINIMIZATION PROBLEMS

L. Landrieu and G. Obozinski. Cut Pursuit: fast algorithms to learn piecewise constant functions. In Proceedings of Artiﬁcial Intelligence and Statistics, 2016.

H. Lin and J. Bilmes. A class of submodular functions for document summarization. In Proceedings of NAACL/HLT, 2011.

L. Lov asz. Submodular functions and convexity. Mathematical programming: the state of the art, Bonn, pages 235 257, 1982.

R. Nishihara, S. Jegelka, and M. I. Jordan. On the convergence rate of decomposable submodular function minimization. In Advances in Neural Information Processing Systems 27, pages 640 648, 2014.

J. Nocedal and S. J. Wright. Numerical optimization. Springer Series in Operations Research and Financial Engineering. Springer, Berlin, 2006.

B. T. Poljak. Introduction to optimization. Optimization Software, 1987.

R. T. Rockafellar. Convex Analysis. Princeton U. P., 1997.

X. Shusheng. Estimation of the convergence rate of Dykstras cyclic projections algorithm in polyhedral case. Acta Mathematicae Applicatae Sinica (English Series), 16(2):217 220, 2000.

P. Stobbe and A. Krause. Efﬁcient minimization of decomposable submodular functions. In Advances in Neural Information Processing Systems, 2010.

R. Tarjan, J. Ward, B. Zhang, Y. Zhou, and J. Mao. Balancing applied to maximum network ﬂow problems. In European Symp. on Algorithms (ESA), pages 612 623, 2006.

N.K. Vishnoi. Lx = B - Laplacian Solvers and Their Algorithmic Applications. Now Publishers, 2013.