# quadratic_decomposable_submodular_function_minimization__97d6be96.pdf

Quadratic Decomposable Submodular Function Minimization

Pan Li UIUC panli2@illinois.edu

Niao He UIUC niaohe@illinois.edu

Olgica Milenkovic UIUC milenkov@illinois.edu

We introduce a new convex optimization problem, termed quadratic decomposable submodular function minimization. The problem is closely related to decomposable submodular function minimization and arises in many learning on graphs and hypergraphs settings, such as graph-based semi-supervised learning and Page Rank. We approach the problem via a new dual strategy and describe an objective that may be optimized via random coordinate descent (RCD) methods and projections onto cones. We also establish the linear convergence rate of the RCD algorithm and develop efﬁcient projection algorithms with provable performance guarantees. Numerical experiments in semi-supervised learning on hypergraphs conﬁrm the efﬁciency of the proposed algorithm and demonstrate the signiﬁcant improvements in prediction accuracy with respect to state-of-the-art methods.1

1 Introduction

Given [N] = {1, 2, ..., N}, a submodular function F : 2[N] R is a set function that for any S1, S2 [N] satisﬁes F(S1) + F(S2) F(S1 S2) + F(S1 S2). Submodular functions are ubiquitous in machine learning as they capture rich combinatorial properties of set functions and provide useful regularization functions for supervised and unsupervised learning [1]. Submodular functions also have continuous Lovász extensions [2], which establish solid connections between combinatorial and continuous optimization problems.

Due to their versatility, submodular functions and their Lovász extensions are frequently used in applications such as learning on directed/undirected graphs and hypergraphs [3, 4], image denoising via total variation regularization [5, 6] and MAP inference in high-order Markov random ﬁelds [7]. In many optimization settings involving submodular functions, one encounters the convex program

i [N] (xi ai)2 + X

r [R] [fr(x)]p ,

where a RN, p {1, 2}, and where for all r in some index set [R], fr stands for the Lovász extension of a submodular function Fr that describes a combinatorial structure over the set [N]. For example, in image denoising, each parameter ai may correspond to the observed value of a pixel i, while the functions [fr(x)]p may be used to impose smoothness constraints on pixel neighborhoods. One of the main difﬁculties in solving this optimization problem comes from the nondifferentiability of the second term: a direct application of subgradient methods leads to convergence rates as slow as 1/

k, where k denotes the number of iterations [8].

In recent years, the above described optimization problem with p = 1 has received signiﬁcant interest in the context of decomposable submodular function minimization (DSFM) [9]. The motivation for

1The code for QDSFM is available at https://github.com/lipan00123/QDSDM.

32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada.

studying this particular setup is two-fold: ﬁrst, solving the convex optimization problem directly recovers the combinatorial solution to the submodular min-cut problem min S [N] F(S), where F(S) = P

r [R] Fr(S) 2 P

i S ai [10]; second, minimizing a submodular function decomposed into a sum of simpler components Fr, r [R], is much easier than minimizing an unrestricted submodular function F over a large set [N]. There are several milestone results for the DSFM problem: Jegelka et al. [11] ﬁrst tackled the problem by considering its dual and proposed a solver based on Douglas-Rachford splitting. Nishihara et al. [12] established the linear convergence rate of alternating projection methods for solving the dual problem. Ene et al. [13, 14] presented linear convergence rates of coordinate descent methods and subsequently tightened the results via submodular ﬂows. Pan et al. [15] improved those methods by leveraging incidence relations of the arguments of submodular function components.

Here, we focus on the other important case when p = 2; we refer to the underlying optimization problem as quadratic DSFM (QDSFM). QDSFM appears naturally in a wide spectrum of applications, including learning on graphs and hypergraphs, and in particular, semi-supervised learning and Page Rank. It has also been demonstrated both theoretically [16] and empirically [4, 17] that employing regularization with quadratic terms offers signiﬁcantly improved predictive performance when compared to the case when p = 1. Despite the importance of the QDSFM problem, its theoretical and algorithmic developments have not reached the same level of maturity as those for the DSFM problem. To the best of our knowledge, only a few reported works [17, 18] have provided solutions for speciﬁc instances of QDSFMs with sublinear convergence guarantees.

This work takes a substantial step towards solving the QDSFM problem in its most general form by developing a family of algorithms with linear convergence rate and small iteration cost, including the randomized coordinate descent (RCD) and alternative projection (AP) algorithms. Our contributions are as follows. First, we derive a new dual formulation for the QDSFM problem since an analogue of the dual transformation for the DSFM problem is not applicable. Interestingly, the dual QDSFM problem requires one to ﬁnd the best approximation of a hyperplane via a product cone as opposed to a product polytope, encountered in the dual DSFM problem. Second, we develop a linearly convergent RCD (and AP) algorithm for solving the dual QDSFM. Because of the special underlying conic structure, new analytic approaches are needed to prove the weak strong convexity of the dual QDSFM, which essentially guarantees linear convergence. Third, we develop generalized Frank-Wolfe and min-norm-point methods for efﬁciently computing the conic projection required in each step of RCD (and AP) and provide a 1/k-rate convergence analysis. Finally, we evaluate our methods on semi-supervised learning over hypergraphs using synthetic and real datasets, and demonstrate superior performance both in convergence rate and prediction accuracy compared to existing methods. We postpone all the detailed proofs and supplementary discussion to the full version of this paper.

2 Notation and Problem Formulation

For a submodular function F deﬁned over the ground set [N], the Lovász extension is a convex function f : RN R, deﬁned for all x RN according to

k=1 F({i1, ..., ik})(xik xik+1) + F([N])xi N , (1)

where xi1 xi2 xi N . For a vector x RN and a set S [N], let x(S) = P

i [S] xi where xi is the component of x in ith dimention. Then, the base polytope of F, denoted by B, is deﬁned as B = {y RN|y(S) F(S), S [N], y([N]) = F([N])}. (2) Using the base polytope, the Lovász extension can also be written as f(x) = maxy B y, x .

We say that an element i [N] is incident to F if there exists a S [N] such that F(S) = F(S {i}). Furthermore, we use (x)+ to denote the function max{x, 0}. Given a positive diagonal matrix

W RN N and a vector x RN, we deﬁne the W-norm according to x W = q PN i=1 Wiix2 i , and simply use when W = I, the identity matrix. For an index set [R], we denote the Rproduct of N-dimensional Euclidean spaces by r [R]RN. A vector y r [R]RN is written as (y1, y2, ..., y R), where yr RN for all r [R]. The W-norm induced on r [R]RN equals

y I(W ) = q PR r=1 yr 2 W . We reserve the symbol ρ for maxyr Br, r q P

r [R] yr 2 1.

Next, we formally state the QDSFM problem. Consider a collection of submodular functions {Fr}r [R] deﬁned over the ground set [N], and denote their Lovász extensions and base polytopes by {fr}r [R] and {Br}r [R], respectively. We use Sr [N] to denote the set of variables incident to Fr and make the further assumption that the functions Fr are normalized and nonnegative, i.e., that Fr( ) = 0 and Fr 0. These two mild constraints are satisﬁed by almost all submodular functions that arise in practical applications. We consider the following minimization problem:

QDSFM: min x RN x a 2 W + X

r [R] [fr(x)]2 , (3)

where a RN is a given vector and W RN N is a positive diagonal matrix. As an immediate observation, the problem has a unique solution, denoted by x , due to the strong convexity of (3).

3 Applications

We start by reviewing some important machine learning problems that give rise to QDSFM.

Semi-supervised Learning (SSL) is a learning paradigm that allows one to utilize the underlying structure or distribution of unlabeled samples, whenever the information provided by labeled samples does not sufﬁce for learning an inductive predictor [19, 20]. A standard setting for a K-class transductive learner is as follows: given N data points {zi}i [N], and labels for the ﬁrst l ( N) samples {yi|yi [K] }i [l], the learner is asked to infer the labels for all the remaining data points i [N]/[l]. The widely-used SSL problem with least squared loss requires one to solve K regularization problems: for each class k [K], set the scores of data points within the class to

ˆx(k) = arg min x(k) β x(k) a(k) 2 + Ω(x(k)),

where a(k) represents the information provided by the known labels, i.e., a(k) i = 1 if yi = k, and 0 otherwise, β denotes a hyperparameter and Ωstands for a smoothness regularizer. The labels of the data points are inferred according to ˆyi = arg maxk{ˆx(k) i }. For typical graph and hypergraph learning problems, Ωis often chosen to be a Laplacian regularizer constructed using {zi}i [N] (see Table 1). In Laplacian regularization, each edge/hyperedge corresponds to one functional component in the QDSFM problem. Note that the variables may also be normalized with respect to their degrees, in which case the normalized Laplacian is used instead. For example, in graph learning, one of the terms in Ωassumes the form wij(xi/ di xj/ p

dj)2, where di and dj correspond to the degrees of the vertex variables i and j, respectively. It can be shown using some simple algebra that the normalization term reduces to the matrix W used in the deﬁnition of the QDSFM problem (3).

One component in Ω(x) Description of the combinatorial structure The submodular function wr(xi xj)2, Sr = {i, j} Graphs: Nearest neighbors [4]/Gaussian similarity [21] Fr(S) = wij if |S {i, j}| = 1 wr maxi,j Sr (xi xj)2 Hypergraphs: Categorical features [17] Fr(S) = wr if |S Sr| [1, |Sr| 1] wr max (i,j) Hr Tr (xi xj)2 + Directed hypergraphs: citation networks [18] Fr(S) = wr if |S Hr| 1, |([N]/S) Tr| 1 General [fr(x)]2 Submodular Hypergraphs: Mutual Information [22, 23] A symmetric submodular function Table 1: Laplacian regularization in semi-supervised learning. In the third column, whenever the stated conditions are not satisﬁed, it is assumed that Fr = 0. For directed hypergraphs, Hr and Tr are subsets of Sr termed the head and the tail set. When Hr = Tr = Sr, one recovers the setting for undirected hypergraphs.

Page Rank (PR) is a well-known method used for ranking Web pages [24]. Web pages are linked and they naturally give rise to a graph G = (V, E), where, without loss of generality, one may assume that V = [N]. Let A and D be the adjacency matrix and diagonal degree matrix of G, respectively. PR essentially ﬁnds a ﬁxed point p RN via the iterative procedure p(t+1) = (1 α)s + αAD 1p(t), where s RN is a ﬁxed vector and α (0, 1]. It is easy to verify that p is a solution of the problem

min p (1 α)

α p s 2 D 1 + (D 1p)T (D A)(D 1p) = x a 2 W + X

ij E (xi xj)2, (4)

where x = D 1p, a = D 1s and W = (1 α)

α D. Obviously, (4) may be viewed as a special instance of the QDSFM problem. Note that the PR iterations on graphs take the form D 1

2 (p(t+1) p(t)) =

2 (s p(t)) αL(D 1

2 p(t)), where L = I D 1

2 is the normalized Laplacian of the graph. The PR problem for hypergraphs is signiﬁcantly more involved, and may be formulated using diffusion processes (DP) based on a normalized hypergraph Laplacian operator L [25]. The underlying PR procedure reads as dx

dt = (1 α)(a x) αL(x), where x(t) RN is the potential vector at time t. Tracking this DP precisely for every time point t is a difﬁcult task which requires solving a densest subset problem [25]. However, the stationary point of this problem, i.e., a point x that satisﬁes (1 α)(a x) αL(x) = 0 may be easily found by solving the optimization problem

min x (1 α) x a 2 + α x, L(x) .

The term x, L(x) matches the normalized regularization term for hypergraphs listed in Table 1, i.e., P

r maxi,j Sr(xi/ di xj/ p

dj)2. Clearly, once again this leads to the QDSFM problem. The PR equation for directed or submodular hypergraphs can be stated similarly using the Laplacian operators described in [26, 23, 27]. The PR algorithm deﬁned in this manner has many advantages over the multilinear PR method based on higher-order Markov chains [28], since it allows for arbitrarily large orders and is guaranteed to converge for any α (0, 1]. In the full version of this paper, we will provide a more detailed analysis of the above described PR method.

4 Algorithms for Solving the QDSFM Problem

We describe next the ﬁrst known linearly convergent algorithms for solving the QDSFM problem. To start with, observe that the QDSFM problem is convex since the Lovász extensions fr are convex and nonnegative. But the objective is in general nondifferentiable. To address this issue, we consider the dual of the QDSFM problem. A natural idea is to try to mimic the approach used for DSFM by invoking the characterization of the Lovász extension, fr(x) = maxyr Br yr, x , r. However, this leads to a semideﬁnite programing problem for the dual variables {yr}r [R], which is complex to solve for large problems. Instead, we establish a new dual formulation that overcomes this obstacle. The dual formulation hinges upon on the following key observation:

[fr(x)]2 = max φr 0 φrfr(x) φ2 r 4 = max φr 0 max yr φr Br yr, x φ2 r 4 . (5)

Let y = (y1, y2, ..., y R) and φ = (φ1, φ2, ..., φR). Using equation (5), we arrive at Lemma 4.1. The following optimization problem is dual to (3):

min y,φ g(y, φ) := X

r [R] yr 2Wa 2 W 1 + X

r [R] φ2 r, s.t. y r [R]φr Br, φ r [R]R 0. (6)

By introducing Λ = (λr) r [R]RN, the previous optimization problem can be rewritten as

R 2 W 1 + φ2 r

, s.t. y r [R]φr Br, φ r [R]R 0, X

r [R] λr = 2Wa. (7)

The primal variables in both cases are recovered via x = a 1

2W 1 P r [R] yr.

Counterparts of the above results for the DSFM problem were discussed in Lemma 2 of [11]. However, there is a signiﬁcant difference between [11] and the QDSFM problem, since in the latter setting we use a conic set constructed from base polytopes of submodular functions. More precisely, for each r, we deﬁne a convex cone Cr = {(yr, φr)|φr 0, yr φr Br} which gives the feasible set of the dual variables (yr, φr). The optimization problem (7) essentially asks one to ﬁnd the best approximation of an afﬁne space in terms of a product cone r [R]Cr, as opposed to a product polytope encountered in DSFM. Several algorithms have been developed for solving the DSFM problem, including the Douglas-Rachford splitting method (DR) [11], the alternative projection method (AP) [12] and the random coordinate descent method (RCD) [13]. Similarly, for QDSFM, we propose to solve the dual problem (6) using the RCD method exploiting the separable structure of the feasible set, and to solve (7) using the AP method. Although these similar methods for DSFM may be used for QDSFM, a novel scheme of analysis handling the conic structure is required, which takes all the effort in the rest of this section and the next section. Due to the page limitation, the analysis of the AP method is deferred to the full version of this paper. Also, it is worth mentioning that results of this work

can be easily extended for the DR method, as well as accelerated and parallel variants of the RCD method [13, 15].

RCD Algorithm. Deﬁne the projection Π onto a convex cone Cr as follows: for a given point b in RN, let ΠCr(b) = arg min(yr,φr) Cr yr b 2 W 1 + φ2 r. For each coordinate r, optimizing over the dual variables (yr, φr) is equivalent to computing a projection onto the cone Cr. This gives rise to the RCD method summarized in Algorithm 1.

Algorithm 1: RCD Solver for (6)

0: For all r, initialize y(0) r 0, φ(0) r and k 0 1: In iteration k: 2: Uniformly at random pick an r [R]. 3: (y(k+1) r , φ(k+1) r ) ΠCr(2Wa P

r =r yr ) 4: Set y(k+1) r y(k) r for r = r

In Section 5, we describe efﬁcient methods to compute the projections. But throughout the remainder of this section, we treat the projections as provided by an oracle. Note that each iteration of the RCD method only requires the computation of one projection onto a single cone. In contrast, methods such as DR, AP and the primal-dual hybrid gradient descent (PDHG) proposed in [29] used for SSL on hypergraphs [17], require performing a complete gradient descent and computing a total of R projections at each iteration. Thus, from the perspective of iteration cost, RCD is signiﬁcantly more efﬁcient, especially when R is large and computing Π( ) is costly.

The objective g(y, φ) described in (6) is not strongly convex in general. Inspired by the work for DSFM [13], in what follows, we show that this objective indeed satisﬁes a weak strong convexity condition, which guarantees linear convergence of the RCD algorithm. Note that due to the additional term φ that characterizes the conic structures, extra analytic effort is required than that for the DSFM case. We start by providing a general result that characterizes relevant geometric properties of the cone r [R]Cr.

Lemma 4.2. Consider a feasible solution (y, φ) r [R]Cr and a nonnegative vector φ = (φ r) r [R]R 0. Let s be an arbitrary point in the base polytope of P

r [R] φ r Fr, and let W (1), W (2) be two positive diagonal matrices. Then, there exists a y r [R]φ r Br such that P

r [R] y r = s and

y y 2 I(W (1)) + φ φ 2 µ(W (1), W (2))

r [R] yr s 2 W (2) + φ φ 2

µ(W (1), W (2)) = max

i [N] W (1) ii X

j [N] 1/W (2) jj , 9

i [N] W (1) ii + 1

As a corollary of Lemma 4.2, the next result establishes the weak strong convexity of g(y, φ). To proceed, we introduce some additional notation. Denote the set of solutions of problem (6) by

Ξ = {(y, φ)| X

r [R] yr = 2W(a x ), φr = inf yr θBr θ, r}.

Note that this representation arises from the relationship between the optimal primal and dual solution as stated in Lemma 4.1. We denote the optimal value of the objective over (y, φ) Ξ by g = g(y, φ),

and deﬁne a distance function d((y, φ), Ξ) = r

min (y ,φ ) Ξ y y 2 I(W 1) + φ φ 2.

Lemma 4.3. Suppose that (y, φ) r [R]Cr and that (y , φ ) Ξ minimizes y y 2 I(W 1) + φ φ 2. Then

r [R] (yr y r) 2 W 1 + φ φ 2 d2((y, φ), Ξ) µ(W 1, W 1).

Based on Lemma 4.3, we can establish the linear convergence rate of the RCD algorithm.

Theorem 4.4. After k iterations of Algorithm 1, we obtain a pair (y(k), φ(k)) that satisﬁes

E h g(y(k), φ(k)) g + d2((y(k), φ(k)), Ξ) i

1 2 R[1 + µ(W 1, W 1)]

k h g(y(0), φ(0)) g + d2((y(0), r(0)), Ξ) i .

Theorem 4.4 implies that O(Rµ(W 1, W 1) log 1

ϵ ) iterations are required to obtain an ϵ-optimal solution. Below we give the explicit characterization of the complexity for the SSL and PR problems with normalized Laplacian regularization as discussed in Section 3. Corollary 4.5. Suppose that W = βD, where β is a hyper-parameter, and D is a diagonal degree matrix such that Dii = P

r: [R],i Sr max S V [Fr(S)]2. Algorithm 1 requires an expected number of O(N 2R max{1, 9β 1} maxi,j [N] Dii Djj log 1

ϵ ) iterations to return an ϵ-optimal solution.

The term N 2R also appears in the expression for the complexity of the RCD method for solving the DSFM problem [14]. The term max{1, 9β 1} implies that whenever β is small, the convergence rate is slow. This makes sense: for example, in the PR problem (4), a small β corresponds to a large α, which typically implies longer mixing times of the underlying Markov process. The term maxi,j [N] Dii Djj arises due to the degree-based normalization.

5 Computing the Projections ΠCr( )

In this section, we provide efﬁcient routines for computing the projection onto the conic set ΠCr( ). As the procedure works for all values of r [R], we drop the subscript r for simplicity of notation. First, recall that

ΠC(a) = arg min (y,φ) h(y, φ) y a 2 W + φ2 s.t. y φB, φ 0, (9)

where W = W 1, and where B denotes the base polytope of the submodular function F. Let h and (y , φ ) be the optimal value of the objective function and the argument that optimizes it, respectively. When performing projections, one only needs to consider the variables incident to F, and set all other variables to zero. For ease of exposition, we assume that all variables in [N] are incident to F.

Unlike QDSFM, the DSFM involves the computation of projections onto the base polytopes of submodular functions. Two algorithms, the Frank-Wolfe (FW) method [30] and the Fujishige-Wolfe minimum norm algorithm (MNP) [31], are used for this purpose. Both methods assume cheap linear minimization oracles on polytopes and attain a 1/k-convergence rate. The MNP algorithm is more sophisticated and empirically more efﬁcient. Nonetheless, neither of these methods can be applied directly to cones. To this end, we modify these two methods by adjusting them to the conic structure in (9) and show that a 1/k convergence rate still holds. We refer to the procedures as the conic MNP method and the conic FW method, respectively. Here we focus mainly on the conic MNP method described in Algorithm 2, as it is more sophisticated. A detailed discussion of the conic FW method and its convergence guarantees can be found in the full version of this work.

The conic MNP algorithm keeps track of an active set S = {q1, q2, ...} and searches for the best solution in its conic hull. Let us denote the cone of an active set S as cone(S) = {P

qi S αiqi|αi 0} and its linear set as lin(S) = {P qi S αiqi|αi R}. Similar to the original MNP algorithm, Algorithm 2 also contains two level-loops: MAJOR and MINOR. In the MAJOR loop, we greedily add a new active point q(k) to the set S obtained from the linear minimization oracle w.r.t. the base polytope (Step 2), and by the end of the MAJOR loop, we obtain a y(k+1) that minimizes h(y, φ) over cone(S) (Step 3-8). The MINOR loop is activated when lin(S) contains some point z that guarantees a smaller value of the objective function than that of the optimal point in cone(S), provided that some active points from S may be removed. Compared to the original MNP method, Steps 2 and 5 as well as the termination Step 3 are specialized for the conic structure.

The following convergence result implies that the conic MNP algorithm also has a convergence rate of order 1/k; the proof is essentially independent on the submodularity assumption and represents a careful modiﬁcation of the arguments in [32] for conic structures.

Algorithm 2: The Conic MNP Method for Solving (9) Input: W, a, B and a small positive constant δ. Maintain φ(k) = P

qi S(k) λ(k) i Choose an arbitrary q1 B. Set S(0) {q1}, λ(0) 1 a,q1 W 1+ q1 2 W , y(0) λ1q1, k 0

1. Iteratively execute (MAJOR LOOP): 2. q(k) arg minq B yh(y(k), φ(k)), q W 3. If y(k) a, q(k) W + φ(k) δ, then break; Else S(k) S(k) {q(k)}. 4. Iteratively execute (MINOR LOOP): 5. α arg minα P

q(k) i S(k) αiq(k) i a 2 W + (P

q(k) i S αi)2, z(k) P

q(k) i S αiq(k) i 6. If αi 0 for all i then break 7. Else θ = mini:αi<0 λi/(λi αi), λ(k+1) i θαi + (1 θ)λ(k) i , 8. y(k+1) θz(k) + (1 θ)y(k), S(k+1) {i : λ(k+1) > 0}, k k + 1 9. y(k+1) z(k), λ(k+1) α, S(k+1) {i : λ(k+1) > 0}, k k + 1

Theorem 5.1. Let B be an arbitrary polytope in RN and let C = {(y, φ)|y φB, φ 0} be the cone induced by the polytope. For some positive diagonal matrix W, deﬁne Q = maxq B q W . Algorithm 2 yields a sequence of (y(k), φ(k))k=1,2,... such that h(y(k), φ(k)) decreases monotonically. Algorithm 2 terminates when k = O(N a W max{Q2, 1}/δ), with h(y(k), φ(k)) h + δ a W .

Both the (conic) FW and MNP are approximate algorithms for computing the projections for generic polytopes B and their induced cones. We also devised an algorithm of complexity O(N log N) that exactly computes the projection for polytopes B arising in learning on (un)directed hypergraphs. A detailed description of the algorithm for exact projections is described in the full version of this paper.

6 Extension to mix-DSFM

With the tools to solve both QDSFM and DSFM problems, it is simple to derive an efﬁcient solver for the following mix-DSFM problem: Suppose {Fr}r [R1+R2] are a collection of submodular functions where Fr 0 for r [R1]. Let fr be the corresponding Lovász extension of Fr, r [R1 + R2]. We are to solve

mix-DSFM: min x RN x a 2 W +

r=1 [fr(x)]2 +

r=R1+1 fr(x) (10)

By using the same trick in (5) for the quadratic term, one may show the dual problem of mix-DSFM is essentially to ﬁnd the best approximation of an afﬁne space in terms of a mixture product of cones and base polytopes. Furthermore, all other related results, including the weak-strong duality of the dual, the linear convergence of RCD/AP and the 1/k-rate convergence of the MNP/FW methods can be generalized to the mix-DSFM case via the same technique developed in this work.

7 Experiments

Our dataset experiments focus on SSL learning for hypergraphs on both real and synthetic datasets. For the particular problem at hand, the QDSFM problem can be formulated as follows

min x RN β x a 2 + X

r [R] max i,j Sr( xi Wii xj p

Wjj )2, (11)

where ai { 1, 0, 1} indicates if the corresponding variable i has a negative, missing, or positive label, respectively, β is a parameter that balances out the inﬂuence of observations and the regularization term, {Wii}i [N] deﬁnes a positive measure over the variables and may be chosen to be the degree matrix D with Dii = |{r [R] : i Sr}|. Each part in the decomposition corresponds to one hyperedge. We compare eight different solvers falling into three categories: (a) our proposed general QDSFM solvers, QRCD-SPE, QRCD-MNP, QRCD-FW and QAP-SPE; (b) alternative solvers for the speciﬁc problem (11), including PDHG [17] and SGD [18]; (c) SSL solvers that do not use QDSFM as the objective, including DRCD [13] and Inv Lap [33]. The ﬁrst three methods all

have outer-loops that execute RCD, but with different inner-loop projections computed via the exact projection algorithm for undirected hyperedges, or the generic MNP and FW. The QAP-SPE method uses AP in the outer-loop and exact inner-loop projections. PDHG and SGD are the only known solvers for the speciﬁc objective (11). DRCD is a state-of-the-art solver for DSFM and also uses a combination of outer-loop RCD and inner-loop projections. Inv Lap ﬁrst transforms hyperedges into cliques and then solves a Laplacian-based linear problem. All the aforementioned methods, except Inv Lap, are implemented via C++ in a nonparallel fashion. Inv Lap is executed via matrix inversion operations in Matlab which may be parallelized. The CPU times of all methods are recorded on a 3.2GHz Intel Core i5. The results are summarized for 100 independent tests. When reporting the results, we use the primal gap ( gap ) to characterize the convergence properties of different solvers. Additional descriptions of the settings and experimental results for the QRCD-MNP and QRCD-FW methods for general submodular functions can be found in the full version of this paper. Synthetic data. We generated a hypergraph with N = 1000 vertices that belong to two equal-sized clusters. We uniformly at random generated 500 hyperedges within each cluster and 1000 hyperedges across the two clusters. Note that in higher-order clustering, we do not need to have many hyperedges within each cluster to obtain good clustering results. Each hyperedge includes 20 vertices. We also uniformly at random picked l = 1, 2, 3, 4 vertices from each cluster to represent labeled datapoints. With the vector x obtained by solving (11), we classiﬁed the variables based on the Cheeger cut rule [17]: suppose that xi1

Wi N i N , and deﬁne Sj = {i1, i2, ..., ij}. We

partition [N] into two sets (Sj , Sj ), where

j = arg min j [N] c(Sj) |Sr Sj = , Sr Sj = }| max{P

r [R] |Sr Sj|, P

r [R] |Sr Sj|}.

The classiﬁcation error is deﬁned as (# of incorrectly classiﬁed vertices)/N. In the experiment, we used Wii = Dii, i, and tuned β to be nearly optimal for different objectives with respect to the classiﬁcation error rates.

The top-left ﬁgure in Figure 1 shows that QRCD-SPE converges much faster than all other methods when solving the problem (11) according to the gap metric (we only plotted the curve for l = 3 as all other values of l produce similar patterns). To avoid clutter, we postpone the results for QRCD-MNP and QRCD-FW to the full version of this paper, as these methods are typically 100 to 1000 times slower than QRCD-SPE. In the table that follows, we describe the performance of different methods with similar CPU-times. Note that when QRCD-SPE converges (with primal gap 10 9 achieved after 0.83s), the obtained x leads to a much smaller classiﬁcation error than other methods. QAP-SPE, PDHG and SGD all have large classiﬁcation errors as they do not converge within short CPU time-frames. QAP-SPE and PDHG perform only a small number of iterations, but each iteration computes the projections for all the hyperedges, which is more time-consuming. The poor performance of DRCD implies that the DFSM is not a good objective for SSL. Inv Lap achieves moderate classiﬁcation errors, but still does not match the performance of QRCD-SPE. Note that Inv Lap uses Matlab, which is optimized for matrix operations, and is hence fairly efﬁcient. However, for experiments on real datasets, where one encounters fewer but signiﬁcantly larger hyperedges, Inv Lap does not offer as good a performance as the one for synthetic data. The column Average 100c(Sj ) also illustrates that the QDSFM objective is a good choice for ﬁnding approximate balanced cuts of hypergraphs.

We also evaluated the inﬂuence of parameter choices on the convergence of QRCD methods. According to Theorem 4.4, the required number of RCD iterations for achieving an ϵ-optimal solution for (11) is roughly O(N 2R max(1, 9/(2β)) maxi,j [N] Wii/Wjj log 1/ϵ) (see the full version of this paper). We mainly focus on testing the dependence on the parameters β and maxi,j [N] Wii/Wjj, as the term N 2R is also included in the iteration complexity of DSFM and was shown to be necessary given certain submodular structures [15]. To test the effect of β, we ﬁx Wii = 1 for all i, and vary β [10 3, 103]. To test the effect of W, we ﬁx β = 1 and randomly choose half of the vertices and set their Wii values to lie in {1, 0.1, 0.01, 0.001}, and set the remaining ones to 1. The two top-right plots of Figure. 1 show our results. The logarithm of gap ratios is proportional to log β 1 for small β, and log maxi,j [N] Wii/Wjj, which is not as sensitive as predicted by Theorem 4.4. Moreover, when β is relatively large (> 1), the dependence on β levels out.

Real data. We also evaluated the proposed algorithms on three UCI datasets: Mushroom, Covertype45, Covertype67, used as standard datasets for SSL on hypergraphs [33, 17, 18]. Each dataset

0 0.2 0.4 0.6 0.8 cputime(s)

QRCDM-SPE QAP-SPE PDHG SGD

-4 -2 0 2 4 log10(β-1)

log10(gap(2e5)/gap(0))

-1 0 1 2 3 4 log10(Wmax/Wmin)

log10(gap(2e5)/gap(0))

Obj. Solvers

Classiﬁcation error rate (%) Average 100c(Sj )

#iterations cputime(s) l=1 l=2 l=3 l=4 l=1 l=2 l=3 l=4 MN MD MN MD MN MD MN MD

QRCD-SPE 2.93 2.55 2.23 0.00 1.47 0.00 0.78 0.00 6.81 6.04 5.71 5.41 4.8 105 0.83 QAP-SPE 14.9 15.0 12.6 13.2 7.33 8.10 4.07 3.80 9.51 9.21 8.14 7.09 2.7 102 0.85 PDHG 9.05 9.65 4.56 4.05 3.02 2.55 1.74 0.95 8.64 7.32 6.81 6.11 3.0 102 0.83 SGD 5.79 4.15 4.30 3.30 3.94 2.90 3.41 2.10 8.22 7.11 7.01 6.53 1.5 104 0.86

DRCD 44.7 44.2 46.1 45.3 43.4 44.2 45.3 44.6 9.97 9.97 9.96 9.97 3.8 106 0.85 Inv Lap 8.17 7.30 3.27 3.00 1.91 1.60 0.89 0.70 8.89 7.11 6.18 5.60 0.07

Figure 1: Experimental results on synthetic datasets. Top-left: gap vs CPU-time of different QDSFM solvers (with an average standard deviation). Bottom: classiﬁcation error rates & Average 100 c(Sj ) for different solvers (MN: mean, MD: median). Top-right: the rate of a primal gap of QRCD after 2 105 iterations with respect to different choices of the parameters β & maxi,j [N] Wii/Wjj.

0 0.5 1 1.5 2 cputime(s)

QRCDM-SPE QAP-SPE PDHG SGD

0 0.2 0.4 0.6 0.8 cputime(s)

covertype45

QRCDM-SPE QAP-SPE PDHG SGD

0 0.5 1 1.5 2 2.5 3 cputime(s)

covertype67

QRCDM-SPE QAP-SPE PDHG SGD

Figure 2: Convergence of different solvers for QDFSM over three different real datasets.

corresponds to a hypergraph model as described in [17]: entries correspond to vertices while each categorical feature is modeled as one hyperedge; numerical features are ﬁrst quantized into 10 bins of equal size, and then mapped to hyperedges. Compared to synthetic data, in this datasets, the size of most hyperedges is much larger ( 1000) while the number of hyperedges is small ( 100). Previous works have been shown that fewer classiﬁcation errors can be achieved by using QDSFM as an objective instead of DSFM or Inv Lap [17]. In our experiment, we focused on comparing the convergence of different solvers for QDSFM. We set β = 100 and Wii = 1, for all i, and set the number of observed labels to 100, which is a proper setting as described in [17]. Figure. 2 shows the results. Again, the proposed QRCD-SPE and QAP-SPE methods both converge faster than PDHG and SGD, while QRCD-SPE performs the best. Note that we did not plot the results for QRCD-MNP and QRCD-FW as the methods converge extremely slowly due to the large sizes of the hyperedges. Inv Lap requires 22, 114 and 1802 seconds to run on the Mushroom, Covertype45 and Covertype67 datasets, respectively. Hence, the methods do not scale well.

8 Acknowledgement

The authors gratefully acknowledge many useful suggestions by the reviewers. This work was supported in part by the NIH grant 1u01 CA198943A and the NSF grant CCF 15-27636.

[1] F. Bach, Learning with submodular functions: A convex optimization perspective, Foundations and Trends R in Machine Learning, vol. 6, no. 2-3, pp. 145 373, 2013.

[2] L. Lovász, Submodular functions and convexity, in Mathematical Programming The State of the Art. Springer, 1983, pp. 235 257.

[3] X. Zhu, Z. Ghahramani, and J. D. Lafferty, Semi-supervised learning using gaussian ﬁelds and harmonic functions, in Proceedings of the 20th International Conference on Machine learning, 2003, pp. 912 919. [4] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Schölkopf, Learning with local and global consistency, in Advances in Neural Information Processing Systems, 2004, pp. 321 328. [5] S. Osher, M. Burger, D. Goldfarb, J. Xu, and W. Yin, An iterative regularization method for total variation-based image restoration, Multiscale Modeling & Simulation, vol. 4, no. 2, pp. 460 489, 2005. [6] A. Chambolle and J. Darbon, On total variation minimization and surface evolution using parametric maximum ﬂows, International Journal of Computer Vision, vol. 84, no. 3, p. 288, 2009. [7] S. Kumar and M. Hebert, Discriminative random ﬁelds: A discriminative framework for contextual interaction in classiﬁcation, in Proceedings of IEEE International Conference on Computer Vision. IEEE, 2003, pp. 1150 1157. [8] N. Z. Shor, Minimization Methods for Non-differentiable Functions. Springer Science & Business Media, 2012, vol. 3. [9] P. Stobbe and A. Krause, Efﬁcient minimization of decomposable submodular functions, in Advances in Neural Information Processing Systems, 2010, pp. 2208 2216. [10] S. Fujishige, Submodular functions and optimization. Elsevier, 2005, vol. 58. [11] S. Jegelka, F. Bach, and S. Sra, Reﬂection methods for user-friendly submodular optimization, in Advances in Neural Information Processing Systems, 2013, pp. 1313 1321. [12] R. Nishihara, S. Jegelka, and M. I. Jordan, On the convergence rate of decomposable submodular function minimization, in Advances in Neural Information Processing Systems, 2014, pp. 640 648. [13] A. Ene and H. Nguyen, Random coordinate descent methods for minimizing decomposable submodular functions, in Proceedings of the International Conference on Machine Learning, 2015, pp. 787 795. [14] A. Ene, H. Nguyen, and L. A. Végh, Decomposable submodular function minimization: discrete and continuous, in Advances in Neural Information Processing Systems, 2017, pp. 2874 2884. [15] P. Li and O. Milenkovic, Revisiting decomposable submodular function minimization with incidence relations, in Advances in Neural Information Processing Systems, 2018. [16] R. Johnson and T. Zhang, On the effectiveness of laplacian normalization for graph semisupervised learning, Journal of Machine Learning Research, vol. 8, no. Jul, pp. 1489 1517, 2007. [17] M. Hein, S. Setzer, L. Jost, and S. S. Rangapuram, The total variation on hypergraphs-learning on hypergraphs revisited, in Advances in Neural Information Processing Systems, 2013, pp. 2427 2435. [18] C. Zhang, S. Hu, Z. G. Tang, and T. H. Chan, Re-revisiting learning on hypergraphs: conﬁdence interval and subgradient method, in Proceedings of the International Conference on Machine Learning, 2017, pp. 4026 4034. [19] A. Gammerman, V. Vovk, and V. Vapnik, Learning by transduction, in Proceedings of the Fourteenth conference on Uncertainty in artiﬁcial intelligence. Morgan Kaufmann Publishers Inc., 1998, pp. 148 155. [20] T. Joachims, Transductive learning via spectral graph partitioning, in Proceedings of the 20th International Conference on Machine Learning, 2003, pp. 290 297. [21] X. Zhu, J. Lafferty, and Z. Ghahramani, Combining active learning and semi-supervised learning using gaussian ﬁelds and harmonic functions, in ICML 2003 workshop on the continuum from labeled to unlabeled data in machine learning and data mining, vol. 3, 2003. [22] P. Li and O. Milenkovic, Inhomogeneous hypergraph clustering with applications, in Advances in Neural Information Processing Systems, 2017, pp. 2305 2315. [23] , Submodular hypergraphs: p-laplacians, cheeger inequalities and spectral clustering, in Proceedings of the International Conference on Machine learning, 2018.

[24] L. Page, S. Brin, R. Motwani, and T. Winograd, The pagerank citation ranking: Bringing order to the web. Stanford Info Lab, Tech. Rep., 1999. [25] T.-H. H. Chan, A. Louis, Z. G. Tang, and C. Zhang, Spectral properties of hypergraph laplacian and approximation algorithms, Journal of the ACM (JACM), vol. 65, no. 3, p. 15, 2018. [26] T. Chan, Z. G. Tang, X. Wu, and C. Zhang, Diffusion operator and spectral analysis for directed hypergraph laplacian, ar Xiv preprint ar Xiv:1711.01560, 2017. [27] Y. Yoshida, Cheeger inequalities for submodular transformations, ar Xiv preprint ar Xiv:1708.08781, 2017. [28] D. F. Gleich, L.-H. Lim, and Y. Yu, Multilinear pagerank, SIAM Journal on Matrix Analysis and Applications, vol. 36, no. 4, pp. 1507 1541, 2015. [29] A. Chambolle and T. Pock, A ﬁrst-order primal-dual algorithm for convex problems with applications to imaging, Journal of Mathematical Imaging and Vision, vol. 40, no. 1, pp. 120 145, 2011. [30] M. Frank and P. Wolfe, An algorithm for quadratic programming, Naval Research Logistics, vol. 3, no. 1-2, pp. 95 110, 1956. [31] S. Fujishige and S. Isotani, A submodular function minimization algorithm based on the minimum-norm base, Paciﬁc Journal of Optimization, vol. 7, no. 1, pp. 3 17, 2011. [32] D. Chakrabarty, P. Jain, and P. Kothari, Provable submodular minimization using Wolfe s algorithm, in Advances in Neural Information Processing Systems, 2014, pp. 802 809. [33] D. Zhou, J. Huang, and B. Schölkopf, Learning with hypergraphs: Clustering, classiﬁcation, and embedding, in Advances in Neural Information Processing Systems, 2007, pp. 1601 1608.