# provable_submodular_minimization_using_wolfes_algorithm__47787298.pdf

Provable Submodular Minimization using Wolfe s Algorithm

Deeparnab Chakrabarty Prateek Jain Pravesh Kothari

Owing to several applications in large scale learning and vision problems, fast submodular function minimization (SFM) has become a critical problem. Theoretically, unconstrained SFM can be performed in polynomial time [10, 11]. However, these algorithms are typically not practical. In 1976, Wolfe [21] proposed an algorithm to ﬁnd the minimum Euclidean norm point in a polytope, and in 1980, Fujishige [3] showed how Wolfe s algorithm can be used for SFM. For general submodular functions, this Fujishige-Wolfe minimum norm algorithm seems to have the best empirical performance.

Despite its good practical performance, very little is known about Wolfe s minimum norm algorithm theoretically. To our knowledge, the only result is an exponential time analysis due to Wolfe [21] himself. In this paper we give a maiden convergence analysis of Wolfe s algorithm. We prove that in t iterations, Wolfe s algorithm returns an O(1/t)-approximate solution to the min-norm point on any polytope. We also prove a robust version of Fujishige s theorem which shows that an O(1/n2)- approximate solution to the min-norm point on the base polytope implies exact submodular minimization. As a corollary, we get the ﬁrst pseudo-polynomial time guarantee for the Fujishige-Wolfe minimum norm algorithm for unconstrained submodular function minimization.

1 Introduction

An integer-valued1 function f : 2X Z deﬁned over subsets of some ﬁnite ground set X of n elements is submodular if it satisﬁes the following diminishing marginal returns property: for every S T X and i X \ T, f(S {i}) f(S) f(T {i}) f(T). Submodularity arises naturally in several applications such as image segmentation [17], sensor placement [18], etc. where minimizing an arbitrary submodular function is an important primitive.

In submodular function minimization (SFM), we assume access to an evaluation oracle for f which for any subset S X returns the value f(S). We denote the time taken by the oracle to answer a single query as EO. The objective is to ﬁnd a set T X satisfying f(T) f(S) for every S X. In 1981, Grotschel, Lovasz and Schrijver [8] demonstrated the ﬁrst polynomial time algorithm for SFM using the ellipsoid algorithm. This algorithm, however, is practically infeasible due to the running time and the numerical issues in implementing the ellipsoid algorithm. In 2001, Schrijver [19] and Iwata et al. [9] independently designed combinatorial polynomial time algorithms for SFM. Currently, the best algorithm is by Iwata and Orlin [11] with a running time of O(n5EO+n6).

However, from a practical stand point, none of the provably polynomial time algorithms exhibit good performance on instances of SFM encountered in practice (see 4). This, along with the widespread applicability of SFM in machine learning, has inspired a large body of work on practically fast procedures (see [1] for a survey). But most of these procedures focus either on special submodular

Microsoft Research, 9 Lavelle Road, Bangalore 560001. University of Texas at Austin (Part of the work done while interning at Microsoft Research) 1One can assume any function is integer valued after suitable scaling.

functions such as decomposable functions [16, 20] or on constrained SFM problems [13, 12, 15, 14].

Fujishige-Wolfe s Algorithm for SFM: For any submodular function f, the base polytope Bf of f is deﬁned as follows:

Bf = {x Rn : x(A) f(A), A X, and x(X) = f(X)}, (1)

where x(A) := P

i A xi and xi is the i-th coordinate of x Rn. Fujishige [3] showed that if one can obtain the minimum norm point on the base polytope, then one can solve SFM. Finding the minimum norm point, however, is a non-trivial problem; at present, to our knowledge, the only polynomial time algorithm known is via the ellipsoid method. Wolfe [21] described an iterative procedure to ﬁnd minimum norm points in polytopes as long as linear functions could be (efﬁciently) minimized over them. Although the base polytope has exponentially many constraints, a simple greedy algorithm can minimize any linear function over it. Therefore using Wolfe s procedure on the base polytope coupled with Fujishige s theorem becomes a natural approach to SFM. This was suggested as early as 1984 in Fujishige [4] and is now called the Fujishige-Wolfe algorithm for SFM.

This approach towards SFM was revitalized in 2006 when Fujishige and Isotani [6, 7] announced encouraging computational results regarding the minimum norm point algorithm. In particular, this algorithm signiﬁcantly out-performed all known provably polynomial time algorithms. Theoretically, however, little is known regarding the convergence of Wolfe s procedure except for the ﬁnite, but exponential, running time Wolfe himself proved. Nor is the situation any better for its application on the base polytope. Given the practical success, we believe this is an important, and intriguing, theoretical challenge.

In this work, we make some progress towards analyzing the Fujishige-Wolfe method for SFM and, in fact, Wolfe s algorithm in general. In particular, we prove the following two results:

We prove (in Theorem 4) that for any polytope B, Wolfe s algorithm converges to an εapproximate solution, in O(1/ε) steps. More precisely, in O(n Q2/ε) iterations, Wolfe s algorithm returns a point x 2 2 x 2 2 + ε, where Q = maxp B p 2.

We prove (in Theorem 5) a robust version of a theorem by Fujishige [3] relating min-norm points on the base polytope to SFM. In particular, we prove that an approximate min-norm point solution provides an approximate solution to SFM as well. More precisely, if x satisﬁes x 2 2 z T x + ε2 for all z Bf, then, f(Sx) min S f(S) + 2nε, where Sx can be constructed efﬁciently using x.

Together, these two results gives us our main result which is a pseudopolynomial bound on the running time of the Fujishige-Wolfe algorithm for submodular function minimization. Theorem 1. (Main Result.) Fix a submodular function f : 2X Z. The Fujishige Wolfe algorithm returns the minimizer of f in O((n5EO + n7)F 2) time where F := maxn i=1 (|f({i})|, |f([n]) f([n] \ i)|).

Our analysis suggests that the Fujishige-Wolfe s algorithm is dependent on F and has worse dependence on n than the Iwata-Orlin [11] algorithm. To verify this, we conducted empirical study on several standard SFM problems. However, for the considered benchmark functions, running time of Fujishige-Wolfe s algorithm seemed to be independent of F and exhibited better dependence on n than the Iwata-Orlin algorithm. This is described in 4.

2 Preliminaries: Submodular Functions and Wolfe s Algorithm

2.1 Submodular Functions and SFM Given a ground set X on n elements, without loss of generality we think of it as the ﬁrst n integers [n] := {1, 2, . . . , n}. f be a submodular function. Since submodularity is translation invariant, we assume f( ) = 0. For a submodular function f, we write Bf Rn for the associated base polyhedron of f deﬁned in (1). Given x Rn, one can ﬁnd the minimum value of q x over q Bf in O(n log n + n EO) time using the following greedy algorithm: Renumber indices such that x1 xn. Set q i = f([i]) f([i 1]). Then, it can be proved that q Bf and is the minimizer of the x q for q Bf.

The connection between the SFM problem and the base polytope was ﬁrst established in the following minimax theorem of Edmonds [2].

Theorem 2 (Edmonds [2]). Given any submodular function f with f( ) = 0, we have

min S [n] f(S) = max x Bf

The following theorem of Fujishige [3] shows the connection between ﬁnding the minimum norm point in the base polytope Bf of a submodular function f and the problem of SFM on input f. This forms the basis of Wolfe s algorithm. In 3.2, we prove a robust version of this theorem. Theorem 3 (Fujishige s Theorem [3]). Let f : 2[n] Z be a submodular function and let Bf be the associated base polyhedron. Let x be the optimal solution to minx Bf ||x||. Deﬁne S = {i | x i < 0}. Then, f(S) f(T) for every T [n].

2.2 Wolfe s Algorithm for Minimum Norm Point of a polytope.

We now present Wolfe s algorithm for computing the minimum-norm point in an arbitrary polytope B Rn. We assume a linear optimization oracle (LO) which takes input a vector x Rn and outputs a vector q arg minp B x p.

We start by recalling some deﬁnitions. The afﬁne hull of a ﬁnite set S Rn is aff(S) = {y | y = P

z S αz z, P

z S αz = 1}. The afﬁne minimizer of S is deﬁned as y = arg minz aff(S) ||z||2, and y satisﬁes the following afﬁne minimizer property: for any v aff(S), v y = ||y||2. The procedure Affine Minimizer(S) returns (y, α) where y is the afﬁne minimizer and α = (αs)s S is the set of coefﬁcients expressing y as an afﬁne combination of points in S. This procedure can be naively implemented in O(|S|3 + n|S|2) as follows. Let B be the n |S| matrix where each column in a point in S. Then α = (B B) 11/1 (B B) 11 and y = Bα.

Algorithm 1 Wolfe s Algorithm

1. Let q be an arbitrary vertex of B. Initialize x q. We always maintain x = P

i S λiqi as a convex combination of a subset S of vertices of B. Initialize S = {q} and λ1 = 1. 2. WHILE(true): (MAJOR CYCLE)

(a) q := LO(x). // Linear Optimization: q arg minp B x p. (b) IF ||x||2 x q + ε2 THEN break. // Termination Condition. Output x. (c) S := S {q}. (d) WHILE(true): (MINOR CYCLE)

i. (y, α) = Affine Minimizer(S). //y = arg minz aff(S) ||z||. ii. IF αi 0 for all i THEN break. //If y conv(S), then end minor loop. iii. ELSE

// If y / conv(S), then update x to the intersection of the boundary of conv(S) and the segment joining y and

previous x. Delete points from S which are not required to describe the new x as a convex combination. θ := mini:αi<0 λi/(λi αi) // Recall, x = P

i λiqi. Update x θy + (1 θ)x. // By deﬁnition of θ, the new x lies in conv(S). Update λi θαi + (1 θ)λi. //This sets the coefﬁcients of the new x S = {i : λi > 0}. // Delete points which have λi = 0. This deletes at least one point. (e) Update x y. // After the minor loop terminates, x is updated to be the afﬁne minimizer of the current set S. 3. RETURN x.

When ε = 0, the algorithm on termination (if it terminates) returns the minimum norm point in B since ||x||2 x x ||x|| ||x ||. For completeness, we sketch Wolfe s argument in [21] of ﬁnite termination. Note that |S| n always; otherwise the afﬁne minimizer is 0 which either terminates the program or starts a minor cycle which decrements |S|. Thus, the number of minor cycles in a major cycle n, and it sufﬁces to bound the number of major cycles. Each major cycle is associated with a set S whose afﬁne minimizer, which is the current x, lies in the convex hull of S. Wolfe calls such sets corrals. Next, we show that ||x|| strictly decreases across iterations (major or minor cycle) of the algorithm, which proves that no corral repeats, thus bounding the number of major cycles by the number of corrals. The latter is at most N n , where N is the number of vertices of B.

Consider iteration j which starts with xj and ends with xj+1. Let Sj be the set S at the beginning of iteration j. If the iteration is a major cycle, then xj+1 is the afﬁne minimizer of Sj {qj}

where qj = LO(xj). Since x j qj < ||xj||2 (the algorithm doesn t terminate in iteration j) and x j+1qj = ||xj+1||2 (afﬁne minimizer property), we get xj = xj+1, and so ||xj+1|| < ||xj|| (since the afﬁne minimizer is unique). If the iteration is a minor cycle, then xj+1 = θxj + (1 θ)yj, where yj is the afﬁne minimizer of Sj and θ < 1. Since ||yj|| < ||xj|| (yj = xj since yj / conv(Sj)), we get ||xj+1|| < ||xj||.

Our reﬁned analysis of Wolfe s algorithm is encapsulated in the following theorem. Theorem 4. Let B be an arbitrary polytope such that the maximum Euclidean norm of any vertex of B is at most Q. After O(n Q2/ε2) iterations, Wolfe s algorithm returns a point x B which satisﬁes ||x||2 x q + ε2, for all points q B. In particular, this implies ||x||2 ||x ||2 + 2ε2.

The above theorem shows that Wolfe s algorithm converges to the minimum norm point at an 1/t-rate. We stress that the above is for any polytope. To apply this to SFM, we prove the following robust version of Fujishige s theorem connecting the minimum norm point in the base polytope and the set minimizing the submodular function value. Theorem 5. Fix a submodular function f with base polytope Bf. Let x Bf be such that ||x||2 x q + ε2 for all q Bf. Renumber indices such that x1 xn. Let S = {1, 2, . . . , k},where k is smallest index satisfying (C1) xk+1 0 and (C2) xk+1 xk ε/n. Then, f(S) f(T) + 2nε for any subset T S. In particular, if ε = 1 4n and f is integer-valued, then S is a minimizer.

Theorem 4 and Theorem 5 implies our main theorem. Theorem 1. (Main Result.) Fix a submodular function f : 2X Z. The Fujishige Wolfe algorithm returns the minimizer of f in O((n5EO + n7)F 2) time where F := maxn i=1 (|f({i})|, |f([n]) f([n] \ i)|).

Proof. The vertices of Bf are well understood: for every permutation σ of [n], we have a vertex with xσ(i) = f({σ(1), . . . , σ(i)}) f({σ(1), . . . , σ(i 1)}). By submodularity of f, we get for all i, |xi| F. Therefore, for any point x Bf, ||x||2 n F 2. Choose ε = 1/4n. From Theorem 4 we know that if we run O(n4F 2) iterations of Wolfe, we will get a point x Bf such that ||x||2 x q + ε2 for all q Bf. Theorem 5 implies this solves the SFM problem. The running time for each iteration is dominated by the time for the subroutine to compute the afﬁne minimizer of S which is at most O(n3), and the linear optimization oracle. For Bf, LO(x) can be implemented in O(n log n + n EO) time. This proves the theorem.

We prove Theorem 4 and Theorem 5 in 3.1 and 3.2, respectively.

3.1 Analysis of Wolfe s Min-norm Point Algorithm

The stumbling block in the analysis of Wolfe s algorithm is the interspersing of major and minor cycles which oscillates the size of S preventing it from being a good measure of progress. Instead, in our analysis, we use the norm of x as the measure of progress. Already we have seen that ||x|| strictly decreases. It would be nice to quantify how much the decrease is, say, across one major cycle. This, at present, is out of our reach even for major cycles which contain two or more minor cycles in them. However, we can prove signiﬁcant drop in norm in major cycles which have at most one minor cycle in them. We call such major cycles good. The next easy, but very useful, observation is the following: one cannot have too many bad major cycles without having too many good major cycles. Lemma 1. In any consecutive 3n + 1 iterations, there exists at least one good major cycle.

Proof. Consider a run of r iterations where all major cycles are bad, and therefore contain 2 minor cycles. Say there are k major cycles and r k minor cycles, and so r k 2k implying r 3k. Let SI be the set S at the start of these iterations and SF be the set at the end. We have |SF | |SI| + k (r k) |SI| + 2k r n r

3. Therefore, r 3n, since |SF | 0.

Before proceeding, we introduce some notation. Deﬁnition 1. Given a point x B, let us denote err(x) := ||x||2 ||x ||2. Given a point x and q, let (x, q) := ||x||2 x q and let (x) := maxq B (x, q) = ||x||2 minq B x q. Observe that (x) err(x)/2 since (x) ||x||2 x x (||x||2 ||x ||2)/2.

We now use t to index all good major cycles. Let xt be the point x at the beginning of the t-th good major cycle. The next theorem shows that the norm signiﬁcantly drops across good major cycles. Theorem 6. For t iterating over good major cycles, err(xt) err(xt+1) 2(xt)/8Q2.

We now complete the proof of Theorem 4 using Theorem 6.

Proof of Theorem 4. Using Theorem 6, we get that err(xt) err(xt+1) err(xt)2/32Q2 since (x) err(x)/2 for all x. We claim that in t 64Q2/ε2 good major cycles, we reach xt with err(xt ) ε2. To see this rewrite as follows:

err(xt+1) err(xt) 1 err(xt)

, for all t.

Now let e0 := err(x0). Deﬁne t0, t1, . . . such that for all k 1 we have err(xt) > e0/2k for t [tk 1, tk). That is, tk is the ﬁrst time t at which err(xt) e0/2k. Note that for t [tk 1, tk),

we have err(xt+1) err(xt) 1 e0 32Q22k . This implies in 32Q22k/e0 time units after tk 1, we

will have err(xt) err(xtk 1)/2; we have used the fact that (1 δ)1/δ < 1/2 when δ < 1/32. That is, tk tk 1 + 32Q22k/e0. We are interested in t = t K where 2K = e0/ε2. We get t 32Q2

e0 1 + 2 + + 2K 64Q22K/e0 = 64Q2/ε2.

Next, we claim that in t < t + t good major cycles, where t = 8Q2/ε2, we obtain an xt with (xt ) ε2. This is because, if not, then, using Theorem 6, in each of the good major cycles t + 1, t + 2, . . . t + t , err(x) falls additively by > ε4/8Q2 and thus err(xt +t ) < err(xt ) ε2 0, which is a contradiction. Therefore, in O(Q2/ε2) good major cycles, the algorithm obtains an x = xt with (x) ε2, proving Theorem 4.

The rest of this subsection is dedicated to proving Theorem 6.

Proof of Theorem 6: We start off with a simple geometric lemma. Lemma 2. Let S be a subset of Rn and suppose y is the minimum norm point of aff(S). Let x and q be arbitrary points in aff(S). Then,

||x||2 ||y||2 (x, q)2

where Q is an upper bound on ||x||, ||q||.

Proof. Since y is the minimum norm point in aff(S), we have x y = q y = ||y||2. In particular, ||x y||2 = ||x||2 ||y||2. Therefore,

(x, q) = x 2 x T q = x 2 x y + y q x T q = (y x)T (q x) y x q x y x ( x + q ) 2Q y x ,

where the ﬁrst inequality is Cauchy-Schwartz and the second is triangle inequality. Lemma now follows by taking square of the above expression and by observing that y x 2 = x 2 y 2.

The above lemma takes case of major cycles with no minor cycles in them. Lemma 3 (Progress in Major Cycle with no Minor Cycles). Let t be the index of a good major cycle with no minor cycles. Then err(xt) err(xt+1) 2(xt)/4Q2.

Proof. Let St be the set S at start of the tth good major cycle, and let qt be the point minimizing x t q. Let S = St qt and let y be the minimum norm point in aff(S). Since there are no minor cycles, y conv(S). Abuse notation and let xt+1 = y be the iterate at the call of the next major cycle (and not the next good major cycle). Since the norm monotonically decreases, it sufﬁces to prove the lemma statement for this xt+1. Now apply Lemma 2 with x = xt and q = qt and S = St qt. We have that err(xt) err(xt+1) = ||xt||2 ||y||2 (xt, qt)2/4Q2 = (xt)2/4Q2.

Now we have to argue about major cycles with exactly one minor cycle. The next observation is a useful structural result.

Lemma 4 (New Vertex Survives a Minor Cycle.). Consider any (not necessarily good) major cycle. Let xt, St, qt be the parameters at the beginning of this cycle, and let xt+1, St+1, qt+1 be the parameters at the beginning of the next major cycle. Then, qt St+1.

Proof. Clearly St+1 St qt since qt is added and then maybe minor cycles remove some points from S. Suppose qt / St+1. Well, then St+1 St. But xt+1 is the afﬁne minimizer of St+1 and xt is the afﬁne minimizer of St. Since St is the larger set, we get ||xt|| ||xt+1||. This contradicts the strict decrease in the norm.

Lemma 5 (Progress in an iteration with exactly one minor cyvle). Suppose the tth good major cycle has exactly one minor cycle. Then, err(xt) err(xt+1) (xt)2/8Q2.

Proof. Let xt, St, qt be the parameters at the beginning of the tth good major cycle. Let y be the afﬁne minimizer of St qt. Since there is one minor cycle, y / conv(St qt). Let z = θxt+(1 θ)y be the intermediate x, that is, point in the line segment [xt, y] which lies in conv(St qt). Let S be the set after the single minor cycle is run. Since there is just one minor cycle, we get xt+1 (abusing notation once again since the next major cycle maynot be good) is the afﬁne minimizer of S .

Let A ||xt||2 ||y||2. From Lemma 2, and using qt is the minimizer of x t q over all q, we have:

A = ||xt||2 ||y||2 2(xt)/4Q2 (3) Recall, z = θxt + (1 θ)y for some θ [0, 1]. Since y is the min-norm point of aff(St qt), and xt St, we get ||z||2 = θ2||xt||2 + (1 θ2)||y||2. this yields:

||xt||2 ||z||2 = (1 θ2) ||xt||2 ||y||2 = (1 θ2)A (4)

Further, recall that S is the set after the only minor cycle in the tth iteration is run and thus, from Lemma 4, qt S . z conv(S ) by deﬁnition. And since there is only one minor cycle, xt+1 is the afﬁne minimizer of S . We can apply Lemma 2 with z, qt and xt+1, to get

||z||2 ||xt+1||2 2(z, qt)

Now we lower bound 2(z, qt). By deﬁnition of z, we have:

z qt = θx t qt + (1 θ)y qt = θx t qt + (1 θ)||y||2

where the last equality follows since y qt = ||y||2 (since qt St qt and y is afﬁne minimizer of St qt). This gives

(z, qt) = ||z||2 z qt = θ2||xt||2 + (1 θ2)||y||2 θx t qt + (1 θ)||y||2

= θ(||xt||2 x t qt) θ(1 θ) ||xt||2 ||y||2

= θ ( (xt) (1 θ)A) (6)

From (4),(5), and (6), we get

errt errt+1 (1 θ2)A + θ2 ( (xt) (1 θ)A)2

We need to show that the RHS is at least (xt)2/8Q2. Intuitively, if θ is small (close to 0), the ﬁrst term implies this using (3), and if θ is large (close to 1), then the second term implies this. The following paragraph formalizes this intuition for any θ.

Now, if (1 θ2)A > (xt)2/8Q2, we are done. Therefore, we assume (1 θ2)A (xt)2/8Q2. In this case, using the fact that (xt) ||xt||2 + ||xt||||qt|| 2Q2, we get that

(1 θ)A (1 θ2)A (xt) (xt)

Substituting in (7), and using (3), we get

errt errt+1 (1 θ2) (xt)2

4Q2 + 9θ2 (xt)2

This completes the proof of the lemma.

Lemma 3 and Lemma 5 complete the proof of Theorem 6.

3.2 A Robust version of Fujishige s Theorem In this section we prove Theorem 5 which we restate below. Theorem 5. Fix a submodular function f with base polytope Bf. Let x Bf be such that ||x||2 x q + ε2 for all q Bf. Renumber indices such that x1 xn. Let S = {1, 2, . . . , k},where k is smallest index satisfying (C1) xk+1 0 and (C2) xk+1 xk ε/n. Then, f(S) f(T) + 2nε for any subset T S. In particular, if ε = 1 4n and f is integer-valued, then S is a minimizer.

Before proving the theorem, note that setting ε = 0 gives Fujishige s theorem Theorem 3.

Proof. We claim that the following inequality holds. Below, [i] := {1, . . . , i}.

i=1 (xi+1 xi) (f([i]) x([i])) ε2 (9)

We prove this shortly. Let S and k be as deﬁned in the theorem statement. Note that P

i S:xi 0 xi nε, since (C2) doesn t hold for any index i < k with xi 0. Furthermore, since xk+1 xk ε/n, we get using (9), f(S) x(S) nε. Therefore, f(S) P

i S:xi<0 xi + 2nε which implies the theorem due to Theorem 2.

Now we prove (9). Let z Bf be the point which minimizes z x. By the Greedy algorithm described in Section 2.1, we know that zi = f([i]) f([i 1]). Next, we write x in a different basis as follows: x = Pn 1 i=1 (xi xi+1)1[i] + xn1[n]. Here 1[i] is used as the shorthand for the vector which has 1 s in the ﬁrst i coordinates and 0s everywhere else. Taking dot product with (x z), we get

||x||2 x z = (x z) x =

i=1 (xi xi+1) x 1[i] z 1[i] + xn x 1[n] z 1[n] (10)

Since zi = f([i]) f([i 1]), we get x 1[i] z 1[i] is x([i]) f([i]). Therefore the RHS of (10) is the LHS of (9). The LHS of (10), by the assumption of the theorem, is at most ε2 implying (9).

4 Discussion and Conclusions

(a) (b) (c)

Figure 1: Running time comparision of Iwata-Orlin s (IO) method [11] vs Wolfe s method. (a): s-t mincut function, (b) Iwata s 3 groups function [16]. (c): Total number of iterations required by Wolfe s method for solving s-t mincut with increasing F

We have shown that the Fujishige-Wolfe algorithm solves SFM in O((n5EO + n7)F 2) time, where F is the maximum change in the value of the function on addition or deletion of an element. Although this is the ﬁrst pseudopolynomial time analysis of the algorithm, we believe there is room for improvement and hope our work triggers more interest.

Note that our anlaysis of the Fujishige-Wolfe algorithm is weaker than the best known method in terms of time complexity (IO method by [11]) on two counts: a) dependence on n, b) dependence on F. In contrast, we found this algorithm signiﬁcantly outperforming the IO algorithm empirically we show two plots here. In Figure 1 (a), we run both on Erdos-Renyi graphs with p = 0.8 and randomly chosen s, t nodes. In Figure 1 (b), we run both on the Iwata group functions [16] with 3 groups. Perhaps more interestingly, in Figure 1 (c), we ran the Fujishige-Wolfe algorithm on the simple path graph where s, t were the end points, and changed the capacities on the edges of the graph which changed the parameter F. As can be seen, the number of iterations of the algorithm remains constant even for exponentially increasing F.

[1] Francis Bach. Convex analysis and optimization with submodular functions: a tutorial. Co RR, abs/1010.4207, 2010. 1

[2] Jack Edmonds. Matroids, submodular functions and certain polyhedra. Combinatorial Structures and Their Applications, pages 69 87, 1970. 2, 3

[3] Satoru Fujishige. Lexicographieally optimal base of a polymatroid with respect to a weight vector. Math. Oper. Res., 5:186 196, 1980. 1, 2, 3

[4] Satoru Fujishige. Submodular systems and related topics. Math. Programming Study, 1984. 2

[5] Satoru Fujishige. Submodular functions and optimization. Elsevier, 2005.

[6] Satoru Fujishige, Takumi Hayashi, and Shigueo Isotani. The minimum-norm-point algorithm applied to submodular function minimization and linear programming. 2006. 2

[7] Satoru Fujishige and Shigueo Isotani. A submodular function minimization algorithm based on the minimum-norm base. Paciﬁc Journal of Optimization, 7:3, 2011. 2

[8] Martin Gr otschel, L aszl o Lov asz, and Alexander Schrijver. The ellipsoid method and its consequences in combinatorial optimization. Combinatorica, 1(2):169 197, 1981. 1

[9] Satoru Iwata, Lisa Fleischer, and Satoru Fujishige. A combinatorial, strongly polynomial-time algorithm for minimizing submodular functions. In STOC, pages 97 106, 2000. 1

[10] Satoru Iwata, Lisa Fleischer, and Satoru Fujishige. A combinatorial strongly polynomial algorithm for minimizing submodular functions. J. ACM, 48(4):761 777, 2001. 1

[11] Satoru Iwata and James B. Orlin. A simple combinatorial algorithm for submodular function minimization. In SODA, pages 1230 1237, 2009. 1, 2, 7

[12] Rishabh Iyer, Stefanie Jegelka, and Jeff Bilmes. Curvature and optimal algorithms for learning and minimizing submodular functions. Co RR, abs/1311.2110, 2013. 1

[13] Rishabh Iyer, Stefanie Jegelka, and Jeff Bilmes. Fast semidifferential-based submodular function optimization. In ICML (3), pages 855 863, 2013. 1

[14] Rishabh K. Iyer and Jeff A. Bilmes. Submodular optimization with submodular cover and submodular knapsack constraints. In NIPS, pages 2436 2444, 2013. 1

[15] Stefanie Jegelka, Francis Bach, and Suvrit Sra. Reﬂection methods for user-friendly submodular optimization. In NIPS, pages 1313 1321, 2013. 1

[16] Stefanie Jegelka, Hui Lin, and Jeff A. Bilmes. On fast approximate submodular minimization. In NIPS, pages 460 468, 2011. 1, 7

[17] Pushmeet Kohli and Philip H. S. Torr. Dynamic graph cuts and their applications in computer vision. In Computer Vision: Detection, Recognition and Reconstruction, pages 51 108. 2010. 1

[18] Andreas Krause, Ajit Paul Singh, and Carlos Guestrin. Near-optimal sensor placements in gaussian processes: Theory, efﬁcient algorithms and empirical studies. Journal of Machine Learning Research, 9:235 284, 2008. 1

[19] Alexander Schrijver. A combinatorial algorithm minimizing submodular functions in strongly polynomial time. J. Comb. Theory, Ser. B, 80(2):346 355, 2000. 1

[20] Peter Stobbe and Andreas Krause. Efﬁcient minimization of decomposable submodular functions. In NIPS, pages 2208 2216, 2010. 1

[21] Phillip Wolfe. Finding the nearest point in a polytope. Math. Programming, 11:128 149, 1976.