# learning_to_optimize_combinatorial_functions__05c57261.pdf Learning to Optimize Combinatorial Functions Nir Rosenfeld 1 Eric Balkanski 1 Amir Globerson 2 Yaron Singer 1 Submodular functions have become a ubiquitous tool in machine learning. They are learnable from data, and can be optimized efficiently and with guarantees. Nonetheless, recent negative results show that optimizing learned surrogates of submodular functions can result in arbitrarily bad approximations of the true optimum. Our goal in this paper is to highlight the source of this hardness, and propose an alternative criterion for optimizing general combinatorial functions from sampled data. We prove a tight equivalence showing that a class of functions is optimizable if and only if it can be learned. We provide efficient and scalable optimization algorithms for several function classes of interest, and demonstrate their utility on the task of optimally choosing trending social media items. 1. Introduction Submodular optimization is fast becoming a primary tool in machine learning. The power of submodularity as a model has been demonstrated in numerous applications, including document summarization (Lin & Bilmes, 2011), clustering (Gomes & Krause, 2010), active learning (Golovin & Krause, 2011; Guillory & Bilmes, 2011; Hoi et al., 2006), graph and network inference (Gomez Rodriguez et al., 2010; Rodriguez & Sch olkopf, 2012; Defazio & Caetano, 2012), and information diffusion in networks (Kempe et al., 2003). Crucial to the success of these methods is the fact that optimizing submodular functions can be done efficiently and with provable guarantees (Krause & Golovin, 2014). In many cases, however, the true function cannot be accessed, and instead a surrogate function is learned from data (Balkanski et al., 2017). To this end, PMAC learning (Balcan & Harvey, 2011) offers a framework for analyzing the learnability of submodular functions, as well as algo- 1Harvard University 2Tel Aviv University. Correspondence to: Nir Rosenfeld . Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s). rithms for learning in practice. Encouraging results show that in many cases submodular functions can be efficiently learned from data (Balcan & Harvey, 2011; Iyer et al., 2013; Feldman & Kothari, 2014; Feldman & Vondrak, 2016). A natural approach in this setting is to first learn a surrogate function from samples, and then optimize it, hoping that the estimated optimum will be close to the true one. A recent line of work has been devoted to this setting of optimization from samples (OPS) (Balkanski et al., 2016; 2017). The main result of OPS is unfortunately discouraging: for maximizing a submodular function under a cardinality constraint, no algorithm can obtain a constant factor approximation guarantee given polynomially-many samples from any distribution (Balkanski et al., 2017). Thus, optimizing over learned surrogates does not provide any meaningful guarantees with respect to the true function. The hardness of OPS is, however, a worst-case result. The hardness stems from the discrepancy between how the algorithm gains access to information (via samples) and how it is evaluated (globally). In contrast, machine learning objectives are typically concerned with expected outcomes, and are evaluated over the same distribution from which data is acquired (Valiant, 1984). In this paper, we build on this motivation and propose an alternative framework for optimizing from samples. The objective we propose, called distributional optimization from samples (DOPS), circumvents the above difficulties by considering a distribution-dependent objective. In general, a function class F is in α-DOPS if an α-approximation of the empirical argmax can be found with arbitrarily high probability using polynomially many samples, for any distribution D and for any f F. Formally: Definition 1 (α-DOPS). Let F = {f : 2[n] R+} be a class of set functions over n elements. We say that F is α-distributionally optimizable from samples if there is an algorithm A that, for every distribution D over 2[n], every f F, and every ϵ, δ [0, 1], when A is given as input a sample set S = {(Si, f(Si))}M i=1 where Si iid D, with probability of at least 1 δ over S it holds that: PT Dm f A(T ) 1 α max S T f(S) 1 ϵ (1) where T = {(Sj)}m j=1, A(T ) T is the output of the algorithm, and S is of size M poly(n, m, 1/ϵ, 1/δ, α). Learning to Optimize Combinatorial Functions The criterion in Eq. (1) relaxes the OPS objective to hold in expectation over D. This is achieved by replacing the entire combinatorial domain with a sampled subset T of size m, allowing for a distribution-agnostic notion of approximation. As m increases, satisfying Eq. (1) is expected to be harder. When m , DOPS recovers OPS. Our first goal in this paper is to establish the hardness of DOPS. In general, classic approximation results do not necessarily transfer to statistical settings (Balkanski et al., 2017). Nonetheless, our main theoretical result establishes a tight equivalence between DOPS and PMAC learning (Balcan & Harvey, 2011), meaning that any F that is learnable is also optimizable, and vice versa. This demonstrates an intriguing link between learning and optimizing submodular functions, which are known to be PMAC-learnable (Balcan & Harvey, 2011). The equivalence result is constructive, and gives a general optimization algorithm which can utilize any PMAC learner as a black box for DOPS, and vice versa. While our main focus in this paper is on submodular functions, these results hold for any family of combinatorial functions. In practice, however, optimizing via PMAC algorithms has several drawbacks (Balcan & Harvey, 2011; Feldman & Kothari, 2014; Feldman & Vondrak, 2016). Our second goal in this paper is hence to design an efficient and scalable DOPS algorithm for several classes of interest. Our algorithm optimizes a loss function whose minimization provides a sufficient condition for DOPS. We prove that the minimizer of the empirical loss can be used for recovering an approximate argmax. In this sense, the framework we propose is one in which the algorithm learns to optimize . We show how the loss can be minimized efficiently and with guarantees for several submodular function classes, including coverage functions, cut functions, and unit demand. An additional benefit of our approach is that it provides guarantees even when the output of the algorithm is restricted to a set of sampled alternatives. This setting is especially prevalent in cases where both sets and their values are generated by human users. For example, in the problem of influence maximization (Kempe et al., 2003), the goal is to choose a seed set of users such that, when exposed to certain content, will maximize its expected propagation. However, targeting arbitrary subsets of users is in most cases impossible, and the algorithm must choose between the sets of users sharing currently trending items. In the last part of the paper we demonstrate the empirical utility of our approach on this task using real data from Twitter. 2. Distributional optimization and learning In this section we give a tight characterization of function classes in DOPS by showing that a class F is in DOPS if and only if it is PMAC-learnable. This involves two steps. In the first, we show that if F is α-PMAC learnable with sample complexity MPMAC, then it is α-DOPS. We augment this result with tight sample complexity bounds for α-DOPS. In the second part, we show that PMAC learnability is not only sufficient but also necessary for distributional optimization from samples. We show that if F is not α-PMAC learnable, then it is not (α ϵ)-DOPS for any constant ϵ > 0, which is tight. This result is obtained by constructing a novel PMAC algorithm based on a DOPS black-box, and may thus be of separate interest in PMAC analysis. Overall, our results determine the hardness of DOPS by establishing a connection between the approximability and learnability of function classes. We begin by reviewing the notion of PMAC learnability: Definition 2 (PMAC, Balcan & Harvey (2011)). A class F is α-PMAC-learnable if there is an algorithm such that for every distribution D, every f F, and every ϵ, δ [0, 1], PS D h f(S) f(S) α f(S) i 1 ϵ (2) where the input of the algorithm is a set S of size M poly(n, 1/ϵ, 1/δ, α), the output is a mapping f : 2[n] R+, and Eq. (2) holds w.p. at least 1 δ over S. Intuitively, PMAC generalizes the standard notion of PAC learning by considering a loss which penalizes predictions that are not within a factor of α of their true value. We are now ready to prove our main theoretical results. 2.1. If F is PMAC-learnable then F is in DOPS We show that if F is α-PMAC learnable with sample complexity MPMAC(n, δ, ϵ, α), then it is α-DOPS with sample complexity MPMAC(n, δ, 1 (1 ϵ)1/m, α), and this sample complexity is tight. A PMAC algorithm learns a surrogate function f. In our reduction, the corresponding DOPS algorithm simply outputs argmax S T f(S). The technical part of this result is in showing the sample complexity tightness. Intuitively, the sample complexity is exactly the number of samples that are needed so that, with high probability, f obtains a good approximation on all S T . We begin by showing that MPMAC(n, δ, 1 (1 ϵ)1/m, α) is sufficient, which follows from the definition of PMAC. Theorem 1. Assume F is α-PMAC-learnable with sample complexity MPMAC(n, δ, ϵ, α), then F is α-DOPS with sample complexity at most MPMAC(n, δ, 1 (1 ϵ)1/m, α), i.e., MDOPS(n, m, δ, ϵ, α) MPMAC(n, δ, 1 (1 ϵ)1/m, α). Proof. Let f F, D be some distribution, S = {(Si, f(Si))}M i=1 and T = {Si}m i=1 be the train and test sets, and A be an algorithm that constructs f which α-PMAC learns f with sample complexity MPMAC(n, δ, ϵ, α). Learning to Optimize Combinatorial Functions The DOPS algorithm that we analyze constructs f with algorithm A using S and returns S = argmax S T f(S). Fix ϵ, δ > 0 and α > 1 and consider M = MPMAC(n, δ, 1 (1 ϵ)1/m, α). By the definition of α-PMAC, we get that with probability 1 δ over S, h f(S) f(S) α f(S) i (1 ϵ)1/m. Next, we obtain h f(S) f(S) α f(S) : S T i h f(S) f(S) α f(S) i m 1 ϵ. where the equality is due to the sets S T being drawn i.i.d. from D, and the inequality holds with probability 1 δ over S. We define S = argmax S T f(S) and obtain that with probability 1 ϵ over T and 1 δ over S, f( S ) f( S ) f(S ) α 1f(S ). We conclude that with M = MPMAC(n, δ, 1 (1 ϵ)1/m, α), α max S T f(S) with probability 1 ϵ over T and 1 δ over S. For tightness, we give an information-theoretic lower bound by constructing a difficult class F that cannot be in α-DOPS with less than MPMAC(n, δ, 1 (1 ϵ)1/m, α) samples. Theorem 2. For all α > 1 and ϵ, δ > 0, for m sufficiently large, there exists a family of functions F and a function MPMAC( ) such that for all ϵ , δ > 0: F is α-PMAC-learnable with sample complexity MPMAC(n, δ , ϵ , α), and given strictly less than MPMAC(n, δ, 1 (1 ϵ)1/m, α) samples, F is not α-DOPS, i.e., MDOPS(n, m, δ, ϵ, α) MPMAC(n, δ, 1 (1 ϵ)1/m, α). Proof Sketch (see supp. material for full proof). For each f in the difficult F, only a single set S has a high value, while all others have low values. We consider a uniformly random function f F and the corresponding randomized subclass F F which consists of all functions f such that S is in the test set but not in the train set. Informally, an algorithm which aims to optimize f F cannot use the train set to learn which S T is S . More precisely, if f F , the decisions of the algorithm are independent of the randomization of f, conditioned on f F . Thus, if f F , the algorithm does not obtain an αapproximation because of the gap between the value of S and the other sets. We construct F and D such that S is in the test set w.p. greater than 1 ϵ. This implies that to satisfy DOPS, the algorithm must observe enough samples so that S is in the train set w.p. at least 1 δ. We then argue that this number of samples is at least MPMAC(n, δ, 1 (1 ϵ)1/m, α). 2.2. If F is not PMAC-learnable then F is not in DOPS A simple intuition for Theorem 1 is that if one can accurately predict the values of all S T , then it is possible to find the empirical argmax. The main result in this section, which is perhaps less intuitive, shows that the reverse implication also holds. Namely, if one can find the the empirical argmax, then it is possible to infer the values of all sets in T . The contrapositive of this result is that if F is not PMAC-learnable, then F is not in DOPS. Combining both results provides a full characterization of distributional optimization in terms of learnability. To construct a PMAC learner from a DOPS algorithm, we first randomly partition S into train and test sets. We then train the DOPS algorithm on the train set, and use it to generate pairwise comparisons with test elements. The learned value for S is given by the maximum value of a test sample that S beats (via the inferred comparisons). At a high level, the analysis uses the DOPS guarantees and a bucketing argument to satisfy the PMAC requirements. Theorem 3. Let µ = max S f(S)/ min S:f(S)>0 f(S), c be any constant such that 1 c α, and Mµ = 8 log µ ϵ + 2 log 1 δ . If a class F is in α/c DOPS with sample complexity MDOPS(n, m, ϵ, δ, α/c), then it is α-PMAC-learnable with sample complexity Mµ + MDOPS(n, 2, ϵ/Mµ, δ/Mµ, α/c), i.e., MPMAC(n, ϵ, δ, α) Mµ+MDOPS(n, 2, ϵ/Mµ, δ/Mµ, α/c). Proof. Fix ϵ, δ > 0 and α > 1. Let S = {(Si, f(Si))}M i=1 be the samples from D that are given as input. We partition the samples in S uniformly at random into S1 and S2 of sizes M1 and M2, respectively. For some S D, the goal is to predict f(S) such that f(S) f(S) α f(S). For each Si S2, define S2,i := {Si, S}. Since F is in DOPS, with M1 = MDOPS(n, 2, ϵ/M2, δ/M2, α/c) samples, the algorithm outputs S i S2,i such that with probabilities 1 δ/M2 over S1 and 1 ϵ/M2 over S2,i, α max(f(S), f(Si)). By a union bound, this holds for all i M2 with probability 1 δ over S1 and probability 1 ϵ over S and S2. Learning to Optimize Combinatorial Functions We say that S beats Si if the α-DOPS algorithm outputs S when given S2,i. Let S 2 be the collection of sets Si in S2 such that S beats Si. The learning algorithm is α max Si S 2 f(Si). Let fmin = min S f(S) and fmax = max S f(S). We partition the sets into buckets defined as follows: Bi := {S : fmin ci 1 f(S) < fminci} for i 1 and B0 = {S : f(S) = 0}. With β := log µ/ log c buckets, all sets S are in a bucket since fmin f(S) fmax. We define a bucket Bi to be dense if a random set S D has non-negligible probability to be in Bi, otherwise it is sparse. More precisely, Bi is dense if Pr S D [S Bi] ϵ/2β. The set S is in a dense bucket Bi with probability at least 1 ϵ 2 since there are at most β buckets that are not dense and S is in each of them with probability at most ϵ 2β by the definition of dense bucket. With m samples, the expected number of samples in Bi is at least m ϵ 2β and by a standard concentration bound, We assume that |Bi| m 2 ϵ 2β for the remainder of the proof. There is at most one set in bucket Bi that is beaten by all the other sets. Since the set S has equal probability to be any of the sets in Bi,1 there is at least one other set S in Bi which S beats with probability 1/|Bi| 4β/mϵ. With δ e mϵ 16β (and hence m log(1/δ)16β ϵ ), with probability of at least 1 δ, the number of samples in Bi is at least m ϵ 4β . With ϵ/2 4β/mϵ (and hence m 8β/ϵ2), with probability of at least 1 ϵ over S D, S is in a dense bucket and beats at least one other S S 2 in that bucket. We get that: α max Si S 2 f(Si) c where the equality is by the definition of f(S), the first inequality is since S S 2 , and the last is since S and S are in the same bucket. We also have α max Si S 2 f(Si) = f(S) where the inequality is by the definition of S 2 and the equality by definition of f(S). Thus, f(S) f(S) α f(S) and with M2 = m 8 log µ ϵ + 2 log 1 δ = Mµ, the sample complexity is Mµ + MDOPS(n, 2, ϵ/Mµ, δ/Mµ, α/c). 1We assume that the DOPS algorithm breaks ties in a consistent manner, i.e., it cannot be adversarial and break ties depending on whether S is the set we wish to learn or if S S2. Algorithm 1 DOPS(S = {(Si, zi)}M i=1, m, α) 1: Randomly partition [M] into N = M m sets A1, . . . , AN 2: Create m-tuple sample set S = {(Si, zi)}N i=1 from S where Si = {Sj}j Ai and zi = {zj}j Ai 3: Compute α(zi) = {y [m] : ziy 1 α max zi} i [N] 4: ˆθ = argmin θ Θ i=1 max y [1{y α(zi)} + fθ(Siy) ψθ(Si, zi)]+ where ψθ(S, z) = 1 |α(z)| P y α(z) fθ(Sy) 5: Return hˆθ(T ) = argmax S T fˆθ(S) 3. Learning to Optimize at Scale In this section we give an efficient DOPS algorithm that applies to several interesting parametric submodular subclasses FΘ = {fθ : θ Θ}. Our general technique includes two steps. First, we identify a loss function whose minimization provides a sufficient condition for DOPS (Eq. (1)), but is in general hard to optimize. Then, we show that for the function classes we consider, a transformation of the inputs reveals structure which can be exploited for efficiently optimizing a convex surrogate loss. Note that in principle, due to Thm. 1, any PMAC algorithm can be used for DOPS. This, however, has several practical disadvantages, which we comment on in Sec. 3.5. We begin by illustrating our approach for coverage functions with parametric weights. We then describe our algorithm, prove its correctness, and show how it can be applied to other classes such as graph cuts, unit demand, and coverage functions with parametric cover sets. 3.1. Learning to optimize coverage functions Coverage functions are a simple but important class of submodular functions, and have been used in applications such as computational linguistics (Sipos et al., 2012), algorithmic game theory (Dughmi & Vondr ak, 2015), and influence maximization in social networks (Kempe et al., 2003). Let U be a ground set of d items, and C = {C1, . . . , Cn} a collection of subsets where Ci U. For a set of non-negative item weights θ = {θ1, . . . , θd}, a function fθ : 2[n] R is a coverage function if: u C(S) θu, C(S) = [ While apparently simple, coverage functions are quite expressive, and optimizing them from samples is known to be hard (Balkanski et al., 2017). One reason is that, as a Learning to Optimize Combinatorial Functions function of their inputs S, coverage functions can be highly non-linear. Meanwhile, as a function of their parameters, they become linear via a simple transformation of the inputs: fθ(S) = φ(S), θ , φu(S) = 1{ i S s.t. u Ci} (4) This structure allows our algorithm to efficiently find the approximate empirical argmax of any given T with high probability. The output of the algorithm is a function h H for choosing one S out of the m candidates in T , where: H = {hθ(T ) = argmax S T fθ(S) : θ Θ} (5) In this sense, our algorithm learns an empirical optimizer that is guaranteed to correctly optimize collections of size m drawn from Dm. 3.2. Algorithm Pseudocode of our algorithm is given in Algorithm 1. The following theorem establishes its correctness for any parametric class of functions F that can be made linear in their parameters using some transformation φ, namely FΘ = {fθ(S) = φ(S), θ : θ Θ}. As we show, this holds for several interesting submodular sub-classes, including the coverage functions in Sec. 3.1 as well as all other classes presented in Sec. 3.3. Theorem 4. Let m N and ϵ, δ [0, 1], and let f = fθ with θ Θ. For a given α > 0, let h be the output of Algorithm 1 when given S = {(Si, zi)}M i=1, m, and α as input, where z = fθ(S) and S iid D. Then, with probability of at least 1 δ over S, it holds that: PT Dm f h(T ) 1 α max S T f(S) 1 ϵ (6) for M O(m(RB/ϵ)2), R = max S φ(S) , B = θ . Proof. We begin with some notation. Let S = {S1, . . . , Sm} be a set of m examples with corresponding values z = {z1, . . . , zm} where zy = f(Sy). Algorithm 1 returns a function h that chooses a set Sy S. It will be convenient to instead view h as a mapping from S to indices y [m]. Denote the set of α-approximate solutions by: α(z) = {y [m] : zy 1 α max z} (7) Our analysis makes use of the following loss function: α(z, y) = 1{y α(z)} (8) Eq. (8) is useful since L(h) := E[ α(z, h(S))] ϵ implies that h satisfies Eq. (6). We therefore focus on bounding L(h). As we do not have access to D, our algorithm chooses an h H which instead minimizes the empirical loss. Note that while α is defined over m-tuples, S contains individual sets. To ensure a consistent empirical loss, we randomly partition [M] into N = M/m distinct sets A1, . . . , AN, and define an m-tuple sample set S = {(Si, zi)}N i=1, where Si = {Sy}y Ai and zi = {zy}y Ai. The loss is now: ˆL(h; S) = 1 i=1 α(zi, ˆyi), ˆyi = h(Si) (9) Since α is not convex, the algorithm instead optimizes a surrogate convex upper bound. There are many ways to do this; here we use an average hinge surrogate: max y [m] [ α(zi, y) + fθ(Si y) ψθ(Si, zi)]+ (10) where [a]+ = max{0, a} and: ψθ(S, z) = 1 |α(z)| y α(z) fθ(Sy) (11) Eq. (10) is similar in spirit to the loss in (Lapin et al., 2015), and is tight w.r.t. Eq. (9) whenever ˆL = 0, Intuitively, minimizing Eq. (10) pushes θ towards values for which the true argmax is scored higher than all others by a margin. Note that the average in Eq. (11) can be replaced with a max to attain a tighter (though no longer convex) surrogate. Since S is labeled by some fθ FΘ, we have that L(hθ ) = 0. This means that there is some θ Θ such that with ˆL(hθ; S) = 0, and due to the tightness of Eq. (10), L(hθ; S) = 0 as well. This is sufficient for applying the following generalization bound (Collins, 2004): (RB log M)2 + log 1 Plugging in M gives L(h) ϵ, concluding the proof. Eq. (10) is convex whenever fθ is linear in θ for some representation φ. This holds for coverage functions (Eq. (4)) as well as for the other classes we consider in Sec. 3.3. Eq. (10) can then be optimized using standard convex solvers, or with highly efficient and scalable solvers such as the cutting plane method of Joachims et al. (2009). 3.3. Other submodular classes We now discuss how our method can be extended to other submodular function classes. For each class, we give a transformation φ of the inputs under which the function becomes linear in its parameters. Thm. 4 and Algorithm 1 can then be applied with the appropriate fθ(S) = φ(S), θ . Learning to Optimize Combinatorial Functions Graph k-cuts: Let G = (V, E) be an undirected graph, and let θ R|E| + be edge weights. For a partition P [k]|V | of the nodes into k groups, its value is given by: (u,v) E Pu =Pv While k-cut functions are known to be hard to optimize over P, they become linear in θ with the transformation: φuv(P) = 1{Pu = Pv} (u, v) E Unit demand: Let θ Rn + be a set of item weights. The value of a subset S [n] is given by: fθ(S) = max u S θu Although it is possible to write fθ = θ, φ(S) with φu(S) = 1{θu θv v S}, this representation requires θ, which is unknown. Nonetheless, a similar data-dependent construction can still be used to obtain some θ which minimizes the loss. To see why, let S S be the set with the highest value fθ( S) in S. For this S, there must exist some u S that is not in any other S S with fθ(S) < fθ( S). By setting φv( S) = 1{u=v} and θ u = fθ( S), we ensure that fθ( S) = θ , φ( S) . Note that this does not necessarily imply that θ u = θu. In a similar fashion, by setting: φu(Si) = 1{u Si j = i s.t. u Sj zj < zi} for every i M, we get that fθ(Si) = θ , φ(Si) for some θ , which guarantees ˆL = 0. Note that generalization here concerns φ as applied to examples in both S and T . Coverage with parametrized cover sets: Let U = [N] be a ground set of items with unit weights. The parameters are a collection item subsets {C1, . . . , Cn} with Ci U. We use ξiu = 1{u Ci} and denote the maximal overlap by d = maxu P i ξiu. For a subset S [n], its value is: While f C is not linear over C, it can be linearized over a different parameterization. For xi = 1{i S}, we have: i=1 (1 xiξiu) Since f C is a polynomial of degree at most d, the explicit size of φ (and hence of the corresponding θ) is nd. For computational efficiency, we can consider the dual form and implicitly define φ via the kernalized inner product: φ(S), φ(S ) = x S, x S + 1 d 3.4. Reducing the sample-complexity cost of m Interestingly, at the cost of a small additional additive error, the dependence of the generalization bound on m can be removed by considering an alternative loss function. Fix some q [0, 1]. Given S, define Q to be the set of examples in the top q-quantile. The idea here is to learn θ so that fθ will score top-quantile examples S Q above low-quantile examples S Q. The corresponding loss is therefore defined over example pairs: q(S, S , fθ) = ( 1{fθ(S)