# parallel_submodular_function_minimization__e4f64d3e.pdf Parallel Submodular Function Minimization Deeparnab Chakrabarty Dartmouth College Hanover, USA deeparnab@dartmouth.edu Andrei Graur Stanford University Stanford, USA agraur@stanford.edu Haotian Jiang Microsoft Research Redmond, USA jhtdavid96@gmail.com Aaron Sidford Stanford University Stanford, USA sidford@stanford.edu We consider the parallel complexity of submodular function minimization (SFM). We provide a pair of methods which obtain two new query versus depth tradeoffs a submodular function defined on subsets of n elements that has integer values between M and M. The first method has depth 2 and query complexity n O(M) and the second method has depth e O(n1/3M 2/3) and query complexity O(poly(n, M)). Despite a line of work on improved parallel lower bounds for SFM, prior to our work the only known algorithms for parallel SFM either followed from more general methods for sequential SFM or highly-parallel minimization of convex ℓ2-Lipschitz functions. Interestingly, to obtain our second result we provide the first highly-parallel algorithm for minimizing ℓ -Lipschitz function over the hypercube which obtains near-optimal depth for obtaining constant accuracy. 1 Introduction A function f : 2[n] Z is submodular if it has the diminishing marginal return property that f(S {i}) f(S) f(T {i}) f(T) for all elements i [n] and subsets S T [n] \ {i}, i.e., the function value increase of adding i to a set S is at least that of adding i to a superset T that does not contain i. Submodular functions are used to model a range of problems arising machine learning [KG05, KNTB09, KC10], operations research [FG88, QS95], and economics [Top98]. The submodular function minimization (SFM) problem consists of finding the subset S with the smallest f(S), given evaluation oracle access to the submodular function f, using as few queries as possible. SFM has numerous applications. For example, in natural language processing SFM has played a key role in speech analysis where [LB11, JB11] modeled the task of optimally selecting terse, yet diverse, training data out of a large speech dataset as a SFM problem; in computer vision, the task of energy minimization was reduced to SFM [BVZ01, KKT08, KT10, JBS13, KBJ+15]. Seminal work of [GLS81] established that SFM can be solved with a polynomial number of queries and in polynomial time. In the decades since, there has been extensive research characterizing the query and computational complexity of SFM. When f is integer valued with values between Supported in part by NSF grant #2041920. Supported in party by NSF CAREER Award CCF-1844855, NSF Grant CCF-1955039, and Stanford Management Science & Engineering Department Nakagawa Fellowship Supported in part by a Microsoft Research Faculty Fellowship, NSF CAREER Award CCF-1844855, NSF Grant CCF-1955039, a Pay Pal research award, and a Sloan Research Fellowship. 37th Conference on Neural Information Processing Systems (Neur IPS 2023). M and M, the state-of-the-art query complexities are e O(n M 2)4 due to [CLSW17] and [ALS20], e O(n2 log M) and e O(n3), both due to [LSW15] (the latter query complexity being improved by a factor of log2 n in [Jia21]), and O(n2 log n) (with an exponential runtime) due to [Jia22]. The largest query complexity lower bound is Ω(n log n) for deterministic algorithms due to [CGJS22]. The algorithms underlying the above results are highly sequential. State-of-the-art SFM algorithms use at least a linear number of rounds of queries to the evaluation oracle ( e O(n M 2) rounds in [ALS20] and O(n log n) rounds in [Jia22]). Although there is a trivial 1-round algorithm that queries all 2n input of f in parallel, all known polynomial time SFM algorithms use Ω(n) rounds. With the prevalence of parallel computation in practice and applications of SFM to problems involving massive data sets, recently there has been an increasing number of works which study the parallel complexity of SFM, i.e., the smallest number of rounds a SFM algorithm must take. Starting with [BS20], significant progress has been made in proving lower bounds on the parallel complexity of SFM. It is now known that any SFM algorithm making polynomially many queries needs eΩ(n1/3) rounds when M = Θ(n) [CCK21] and eΩ(n) rounds when M = Θ(nn) [CGJS22]. In contrast to the significant progress on lower bounds, less progress has been made on designing SFM algorithms with small parallel complexity. This is due in part to the established lower bounds; as noted above, if we desire range-independent parallel complexity bounds, i.e., independent of the range M, then it is impossible to obtain an O(n1 ε)-round SFM algorithm, for any constant ε > 0. The central question motivating this paper is what non-trivial parallel speedups for SFM are possible if we allow methods with a range-dependent parallel complexity. Is it possible to obtain query-efficient o(n)-round M-dependent algorithms for SFM? In this paper we provide positive answers to the above question. Our first result is an algorithm that runs in e O(n1/3M 2/3) rounds with query complexity e O(n2M 2). To achieve this result, we first provide a generic reduction from parallel SFM to parallel convex optimization (a well studied problem discussed below). While naively this approach yields a sublinear bound of e O(n2/3M 2/3) rounds, we show how to further improve these convex optimization methods in our setting. Our second result is a simple 2-round SFM algorithm (Algorithm 1) with query complexity n O(M). For constant M, the parallel complexity of 2 is optimal among the class of query-efficient algorithms that query the minimizer [CLSW17]. It is instructive to contrast our second result to the lower bound in [CCK21], where it is proved that when M = n, any algorithm with query complexity n M 1 δ for any constant δ > 0 must proceed in nΩ(δ) rounds. Highly Parallel Convex Optimization. Motivated by applications to distributed and large scale optimization [BPC+11, GR18], the question of highly parallel convex optimization has received significant attention in the last decade. Formally, the task is to find an (approximate) minimizer of a convex function this is Lipschitz in some norm, using as few rounds of O(poly(n)) parallel queries to a subgradient oracle as possible. Over the past few years, there has been progress on both upper [DBW12, SBB+18, BJL+19] and lower bounds [Nem94, BS18, DG19] for this problem. This line of work is particularly relevant as SFM reduces to minimizing the Lovász extension (see Fact 2.4) [Lov83] over [0, 1]n, which is convex and O(M)-Lipschitz in ℓ (see Definition 2.1). In Section 2.1 we provide a straightforward reduction from this problem to unconstrained minimization of a O(L)-Lipschitz function in ℓ where the minimizer has ℓ norm O(1). Consequently, improved parallel ℓ -Lipschitz convex optimization algorithms can imply improved parallel SFM algorithms. Many prior algorithms for highly parallel convex optimization focused on convex functions that are ℓ2Lipschitz and some were written only for unconstrained optimization problems. Naively using these algorithms for ℓ yields parallel complexities of e O(n3/4/ε) [DBW12] and e O(n2/3/ε2/3) [BJL+19, CJJ+23]5 for convex functions that are ℓ -Lipschitz. Our first SFM result follows from an improved algorithm that we develop that finds ε-approximate minimizers for convex ℓ -Lipschitz functions in e O(n1/3/ε2/3) rounds. Interestingly, for constant ε > 0, the dependence on n in this improved parallel complexity is optimal up to logarithmic 4Throughout, we use e O(f(n)) to hide poly(log f(n), log n) factors. 5[BJL+19, CJJ+23] study the ℓ2 setting. To obtain the corresponding e O(n2/3/ε2/3) result for ℓ , we combine their result with Lemma 2.3 given in this paper. factors; [Nem94] proved a lower bound of eΩ(n1/3 ln(1/ε)) on the round complexity of any algorithm obtaining an ε-approximate minimizer over [0, 1]n for ℓ -Lipschitz convex functions. For constant ε > 0, this lower bound also applies to unconstrained ℓ -optimization [DG19]. Paper Year Parallel Rounds [ALS20] 2020 e O(n M 2) [JLSW20] 2020 e O(n log M) [Jia22] 2021 e O(n) This paper 2023 e O(n1/3M 2/3) Paper Year Parallel Rounds Subgradient Descent 1960s O(n/ε2) Cutting Plane 1965 e O(n log(1/ε)) [BJL+19, CJJ+23]* 2019 e O(n2/3/ε2/3) This paper 2023 e O(n1/3/ε2/3) Table 1: State-of-the-art parallel complexity for SFM and ℓ -optimization. See Section 1.2 for references on cutting plane methods. *The e O(n2/3/ε2/3) result uses Lemma 2.3 from this paper. Notation. We let [n] := {1, . . . , n} for any n Z>0. We let In denote the identity matrix in Rn n. 1.1 Problems, Results, and Approach Here we formally define the problems we consider, present our results, and discuss some of the key insights of our approach. This section is organized as follows. We begin by presenting the parallel computation model. Then, we introduce the parallel ℓp-Lipschitz convex optimization problem, which is closely tied to parallel SFM. We present our improvement for the ℓ -Lipschitz setting and offer a brief overview of our techniques in obtaining this result. Finally, we formally introduce the SFM setup, along with the new results we obtain (Theorem 1.3 and Theorem 1.4). Parallel Complexity Model. We consider the standard black-box query model for optimization, where we are given a convex function f : D R, with domain D Rn, accessed through an oracle. Parallel algorithms for minimizing f proceed in rounds, where in each round the algorithm can submit a set of queries to the oracle in parallel. The parallel complexity of an algorithm is the total number of rounds it uses and it captures how sequential the algorithm is. Additionally, we consider the query complexity of an algorithm which is the total number of queries it makes. Parallel ℓp-Lipschitz Convex Optimization. In Section 2 we consider the problem of minimizing a convex function f : Rn R given access to a subgradient oracle g : Rn Rn, which when queried with point x Rn returns a vector g(x) that is a subgradient, denote g(x) (f(x)) where (f(x)) is the set of all subgradients of f at x, i.e., v f(x) if and only if f(y) f(x) + v (y x) for all y Rn. Furthermore, we assume that f is L-Lipschitz with respect to a norm . Definition 1.1 (L-Lipschitzness with respect to a given norm). We say that f : Rn R is LLipschitz with respect to for norm if |f(x) f(y)| L x y for all x, y Rn. If f is O(1)-Lipschitz with respect to a norm , we say that f is -Lipschitz. When the norm is p we alternatively say that f is L-Lipschitz in ℓp and that f is ℓp-Lipschitz respectively. There is a broad line of work on studying the parallel complexity of ℓp-Lipschitz convex optimization in which the goal is to efficiently compute an ε-approximate minimizer (i.e., a point y with f(y) infx f(x)) of a ℓp-Lipschitz convex function where the minimizer either has ℓp-norm O(1) or the problem is constrained to the ℓp-norm ball of of radius O(1). The case when p = 2 is perhaps the most well studied and our new result regarding ℓ -Lipschitz convex optimization builds upon a result in [CJJ+23], Theorem 1, which considers this setting. The statement of our result is below: Theorem 1.2 (Parallel Convex Optimization in ℓ ). There is an algorithm that when given a subgradient oracle for convex f : Rn R that is 1-Lipschitz in ℓ and has a minimizer x with x 1 computes an ε-approximate minimizer of f in e O(n1/3ε 2/3) rounds and e O(nε 2) queries. In fact, we obtain a more general result that solves stochastic variants of parallel ℓ -optimization (see Theorem 2.2) and Theorem 1.2 is an important corollary of this more general result. Our parallel convex optimization algorithms build on machinery developed for highly-parallel algorithms for minimizing convex functions that are ℓ2-Lipschitz. These methods consider a convolution of f with a centered Gaussian with covariance ρ2In (also referred to as Gaussian smoothing), and then apply optimization methods [CJJ+23, ACJ+21, BJL+19] to this smooth function. By leveraging the properties of this smoothing and the convergence rate of the associated optimization methods, they obtain their parallel complexities. Since functions which are ℓ -Lipschitz are also ℓ2-Lipschitz, the above algorithms also apply to ℓ -Lipschitz functions but they give us a suboptimal dependence of n2/3 on the dimension. We improve the dependence on dimension by utilizing the ℓ -Lipschitzness of our function to add more Gaussian smoothing. This allows us to obtain a n1/3 dependence on the dimension, which is optimal up to polylogarithmic factors for constant ε [Nem94, DG19]. The key observation is that convolving an ℓ -Lipschitz function with a Gaussian of covariance ρ2In changes the function value by no more than O(ρ log n) (see Lemma 2.6), whereas for ℓ2-Lipschitz functions it could change the function value by O(ρ n). Submodular Function Minimization. In SFM, we assume the submodular function f : 2[n] Z is given by an evaluation oracle, which when queried with S [n], returns the value of f(S). Throughout the paper, we assume that f(S) [ M, M] for some M Z>0, and f( ) = 0. The assumption that f( ) = 0 can be made without loss of generality by instead minimizing f(S) := f(S) f( ); this transformation can be implemented with one query and moves the range of f by at most M, turning any dependence on M in an algorithms complexity to 2M. Note that this submodular function minimization setup is different from the setup of parallel convex optimization, as f only defined on the vertices of the unit hypercube. Nonetheless, it is known that there is a convex function f Lov defined on [0, 1]n, known as the Lovasz Extension, such that optimizing f Lov suffices for optimizing f. Additionally, it is known how to compute a subgradient of f Lov at any point x in 1 round using at most n evaluation queries to f (as highlighted in Fact 2.4). Now, we are ready to present our first result on parallel SFM. Later, in Section 2.1, we provide a more general version of this theorem, Theorem 2.5, which gives improved parallel complexities for approximately minimizing bounded, real-valued submodular functions. Theorem 1.3 (Sublinear Parallel SFM). There is an algorithm that, when given an evaluation oracle for submodular f : 2[n] Z with f( ) = 0 and |f(S)| M for all S [n], finds a minimizer of f in e O(n1/3M 2/3) rounds and e O(n2M 2) queries. As discussed, Theorem 1.3, is obtained by using and enhancing tools for optimizing Lipschitz convex functions with a subgradient oracle. We in fact prove a more general result, namely that e O(n1/3/ε2/3) rounds and e O(n2/ε2) queries are sufficient to find an εM-approximate minimizer (Theorem 2.5). Since the function is integer valued, approximating the scaled Lovász extension to ε Θ(1/M) gives the exact minimizer to the submodular function. Our proof of Theorem 2.5 follows from our new result on ℓ -convex optimization (Theorem 1.2). By applying it to a scaled version of the Lovász extension; it is known that if f is a submodular function with range Z [ M, +M], then the Lovász extension scaled by O(1/M) is a convex function which is ℓ -Lipschitz. However, it is important to note that SFM is only equivalent to constrained minimization of the Lovász extension in [0, 1]n, while Theorem 1.2 below is unconstrained (e.g. applies for minimizing over Rn). To apply Theorem 1.2 in the context of SFM, we give a general reduction from constrained to unconstrained optimization by adding a regularizer that restricts the minimizer of the regularized function to the constrainted set (see Lemma 2.3). This reduction is a generic technique and might be of independent utility. Parallel SFM in Two Rounds. Our second SFM result is a simple combinatorial 2-round algorithm which is efficient for functions of constant range. Theorem 1.4 (Two-round Parallel SFM). There is an algorithm (Algorithm 1) that when given an evaluation oracle for submodular f : 2[n] Z with f( ) = 0 and |f(S)| M for all S [n] finds a minimizer of f in 2 rounds and O(n M+1) queries. The algorithm proving Theorem 1.4 relies on two key observations. First, if S is the minimizer of maximum size, for every subset T S and i [n] with f(T {i}) f(T), we have i S . In other words, every element with a non-positive marginal at a subset T S is also contained in S . This leads to the idea of augmenting a set T by the set T = T {i : f(T {i}) f(T)}. Secondly, every subgradient g f(x) has at most M entries that are strictly positive. This ensures that there exists an M-sparse subset T S with the property that f(T {i}) f(T) forall i S \T. Consequently, our algorithm proceeds by augmenting all M-sparse sets, as it is guaranteed that one of these augmented sets is the maximal minimizer (see Section 3 for more details). 1.2 Related Work Parallel Convex Optimization. As mentioned in Section 1.1, there are a number of works studying the parallel complexity of ℓp-Lipschitz convex optimization. Perhaps, the most well-studied case is that of p = 2. In this case, the classic subgradient descent algorithm achieves a parallel complexity of O(ε 2)-rounds and the standard cutting-plane methods achieve a parallel complexity of O(n log(1/ε))-rounds. [DBW12, BJL+19, CJJ+23] improved upon this rate, achieving parallel complexities of e O(n1/4ε 1) [DBW12] and e O(n1/3ϵ 2/3) [BJL+19, CJJ+23] respectively. The implications of these results for the p = case, which is the object of study in our paper, were discussed earlier and we are unaware of works on alternative upper bounds for p = . In terms of lower bounds, the p = case was studied in the prescient paper of [Nem94] which obtains a eΩ(n1/3 ln(1/ε)) lower bound for minimizing ℓ -Lipschitz functions over the ℓ -ball (see also [DG19]). When ε is a constant, our upper bound matches this lower bound, though our dependence on ε is polynomial instead of logarithmic. For the p = 2 case, [Nem94, BS18] proved a tight lower bound of Ω(1/ε2) on the parallel complexity when 1/ε2 e O(n1/3), i.e., the parallel complexity of subgradient descent is optimal up to e O(n1/3) rounds of queries. This was later improved by [BJL+19], which showed that subgradient descent is optimal up to e O(n1/2) rounds. [DG19] considered the general p case (and other non-Euclidean settings) and proved a lower bound of Ω(ε p) lower bound on the parallel complexity for 2 p < , Ω(ε 2) for 1 < p 2, and Ω(ε 2/3) for p = 1. This paper also has lower bounds for smooth convex functions. Submodular Function Minimization. We now expand on the history of SFM upper and lower bounds for parallel and sequential algorithms touched upon earlier. Since the seminal work of Edmonds in 1970 [Edm70], there has been extensive work [GLS81, GLS88, Sch00, IFF01, Iwa03, FI00, Vyg03, Orl09, IO09, LSW15, CLSW17, ALS20, DVZ21, Jia21, Jia22] on developing queryefficient algorithms for SFM. [GLS81, GLS88] gave the first polynomial time algorithms using the ellipsoid method. The state-of-the-art SFM algorithms include a e O(n2)-query exponential time algorithm due to [Jia22], e O(n3)-query polynomial time algorithms due to [Jia21, JLSZ23]; a e O(n2 log M)-query polynomial time algorithm due to [LSW15], and a e O(n M 2)-query polynomial time algorithm due to [ALS20]. Despite these algorithmic improvements, limited progress has been made on lower bounding the query complexity of SFM and the best known lower bound has been Ω(n) for decades [Har08, CLSW17, GPRW20]. Very recently, [CGJS22] proved an Ω(n log n)-lower bound for deterministic SFM algorithms. All the algorithms above are highly sequential and proceed in at least n rounds. The question of parallel complexity for SFM was first studied in [BS20] where an Ω(log n/ log log n)-lower bound on the number of rounds required by any query-efficient SFM algorithm was given. The range M in their construction is M = nΘ(n). Subsequently, [CCK21] proved a eΩ(n1/3) lower bound on the round-complexity and the range is M = n for their functions. Recently, [CGJS22] described a eΩ(n) lower bound with functions of range M = nΘ(n). Cutting plane methods. Cutting plane methods are a class of optimization methods that minimize a convex function by iteratively refining a convex set containing the minimizer. Since the center of gravity method was developed independently in [Lev65, New65], there have been many developments of faster cutting plane methods over the last six decades [Sho77, YN76, Kha80, KTE88, NN89, Vai89, BV04, LSW15], with the state-of-the-art due to [JLSW20]. 2 Minimizing ℓ -Lipschitz Functions and Submodular Functions In this section we provide a new, improved parallel algorithm for convex optimization in ℓ and show how to use these algorithms to obtain an improved parallel algorithm for SFM. In much of this section, we consider the following optimization problem which we term stochastic convex optimization in ℓ . As we discussed, this problem generalizes parallel convex optimization in ℓ . The problem is more general in terms of the norms it considers and how it allows for stochastic gradients; we consider it as it could be useful more broadly and as it perhaps more tightly captures the performance of our optimization algorithm. Definition 2.1 (Stochastic Convex Optimization in ℓ ). In the stochastic convex optimization in ℓ problem we have a (stochastic) subgradient oracle g : Rn Rn such that Eg(x) f(x) and E g(x) 2 2 σ2 for a convex function f : Rn R that is L-Lipschitz in ℓ . Given the guarantee that f has a minimizer x Rn with x 2 R our goal is to compute an ε-approximate minimizer of f, i.e., x Rn with f(x) f(x ) + ε. Note that in Definition 2.1, ℓ appears only to determine the norm in which f is Lipschitz. However, the bound on x in ℓ2 that can be easily converted to one in terms of ℓ by using that x 2 n x . Furthermore, a convex function f : Rn R is L-Lipschitz in ℓ if and only if g 1 L for all g (x) for x Rn. Since g 2 g 1 we see that this stochastic convex optimization problem subsumes the (non-stochastic) problem of computing an ε-approximate minimizer to a convex function that is L-Lipschitz in ℓ given a (deterministic) subgradient oracle. We define this more general problem as, interestingly, our algorithm tolerates this weaker stochastic oracle without any loss (as we discussed). Our main result regarding stochastic convex optimization in ℓ is given in the following theorem. Theorem 2.2 (Stochastic Convex Optimization in ℓ ). There is an algorithm that solves the stochastic convex optimization problem (Definition 2.1) in ℓ (Definition 2.1) in e O((LR/ε)2/3) rounds and e O((σR/ε)2) queries. Due to the aforementioned connection between ℓ -Lipschitz continuity and bounds on the subgradient, and the fact that x 1 implies x 2 n, Theorem 2.2 immediately yields a e O(n1/3ε 2/3)-round, e O(nε 2)-query algorithm for the problem of minimizing a convex function that is 1-Lipschitz in ℓ and minimized at a point with ℓ -norm at most 1. As discussed in the introduction, the parallel complexity of this algorithm is near-optimal for constant ε [Nem94]. Theorem 1.2 (Parallel Convex Optimization in ℓ ). There is an algorithm that when given a subgradient oracle for convex f : Rn R that is 1-Lipschitz in ℓ and has a minimizer x with x 1 computes an ε-approximate minimizer of f in e O(n1/3ε 2/3) rounds and e O(nε 2) queries. In Section 2.1, we show how to use Theorem 2.2 to obtain our results for SFM. We then present the ingredients in the proof of Theorem 2.2 (which we defer to Appendix A) in Section 2.2. As part of our reduction from SFM to Stochastic Convex Optimization in ℓ in Theorem 2.2, we provide a general tool for reducing constrained to unconstrained minimization (Lemma 2.3); we use this lemma to facilitate our results in both sections. 2.1 From Unconstrained Convex Optimization in ℓ to SFM Here we show how to use Theorem 1.2 to prove the following theorem regarding SFM. Theorem 1.3 (Sublinear Parallel SFM). There is an algorithm that, when given an evaluation oracle for submodular f : 2[n] Z with f( ) = 0 and |f(S)| M for all S [n], finds a minimizer of f in e O(n1/3M 2/3) rounds and e O(n2M 2) queries. A key ingredient of our proof is the following general, simple technical tool which allows one to reduce constrained Lipschitz optimization over a ball in any norm to unconstrained minimization with only a very mild increase in parameters. Lemma 2.3. Let f : Rn R be convex and L-Lipschitz with respect to norm : Rn R. For any c, x Rn and r R let f c,r reg (x) := f(x) + 2L max{0, x c r} . (1) Then f c,r reg (x) is convex and 3L-Lipschitz with respect to . Additionally, for any y Rn for which y c r, define yc,r := c + r y c (y c). Then, yc,r c = r and f c,r reg (yc,r) = f(yc,r) f c,r reg (y) L( y c r) . (2) Consequently, f c,r reg (x) has an unconstrained minimizer xc,r and all such minimizers satisfy xc,r c r and f c,r reg (xc,r ) = f(xc,r ) = min x Rn| x c r f(x) . (3) Lemma 2.3 implies that minimizing f subject to a distance constraint x c r reduces to unconstrained minimization of f c,r reg . More formally, to compute an ϵ-optimal solution to the constrained minimization problem, minx Rn : x c r f(x), it suffices to instead compute an ε-optimal solution, xε, to the unconstrained minimization problem minx f c,r reg (x), and then output that point xε if xε c r and output xr,c ε -otherwise. From (2), we get f(xr,c ε ) f c,r reg (xε) opt + ε, where opt := minx f c,r reg (x) = minx Rn| x c r f(x) and the last equality follows from (3). We remark that the 2L in the definition of f c,r reg can be changed to L + δ for any δ 0 with the only effect of turning the L in (2) to δ and causing (3) to only hold for some minimizer (rather than all) if δ = 0. The proof of Lemma 2.3 is deferred to Appendix A. Next, we obtain Theorem 1.3 by applying Theorem 2.2 to the Lovász extension of the submodular function f extended to an unconstrained minimization problem by Lemma 2.3. Given a submodular function f defined over subsets of an n element universe, the Lovász extension f Lov : Rn R is defined as follows: f Lov(x) := P i [n] xπx(i)(f(Sπx,i) f(Sπx,i 1)), where πx : [n] [n] is the permutation such that xπx(1) xπx(2) xπx(n) (ties broken in an arbitrary but consistent manner), and Sπx,j is the subset {πx(1), . . . , πx(j)}. Next we give standard properties of the Lovász extension and use them to prove Theorem 1.3. Fact 2.4 (e.g., [GLS88, JB11]). The following are true about the Lovász extension f Lov: 1. f Lov is convex with minx [0,1]n f Lov(x) = min S V f(S). Indeed, given any x [0, 1]n, in n queries one can find a subset S with f(S) f Lov(x). 2. Given any x Rn and corresponding permutation πx, the vector g Rn where g(x)(π(i)) := f(Sπx,i) f(Sπx,i 1) is a subgradient of f Lov at x. Furthermore, g(x) can be computed in 1 round of n queries to an evaluation oracle for f. 3. If f has range in [ M, +M], then the ℓ1-norm of the subgradient is bounded, in particular, g(x) 1 3M. Equivalently, f Lov is 3M-Lipschitz with respect to the ℓ -norm. As discussed in Section 1.1, to prove Theorem 1.3, it suffices to prove the following more general result regarding approximately minimizing a submodular function. Theorem 2.5 (ϵ-approximate minimizer for SFM). There is an algorithm that, when given an evaluation oracle for submodular f : 2[n] R with minimizer x , f( ) = 0 and |f(S)| M, S [n], finds a set S with f(S) f(x ) + ϵM in e O(n1/3/ϵ2/3) rounds and a total of e O(n2/ϵ2) queries. Proof of Theorem 2.5. By Fact 2.4, SFM reduces to minimizing f Lov over x [0, 1]n which is the same set {x Rn : x c 0.5} where c is the n-dimensional vector with all entries 0.5. By Lemma 2.3, we can do so by applying Theorem 2.2 to the regularized version f c,0.5 reg of f Lov with respect to the ℓ -norm. This regularized function is guaranteed to have a minimizer x with x 2 n which also minimizes f Lov in [0, 1]n. The subgradient of this regularized function at any point can be computed from the subgradient of f Lov at the same point which takes n evaluation oracle queries to the submodular function. Hence, we obtain an εM-approximate minimizer in e O(n1/3/ε2/3) rounds and with a total of e O(n2/ε2) queries to the evaluation oracle of f. Now, we are ready to prove Theorem 1.3. Proof of Theorem 1.3. Set ϵ = 1 2M and apply Theorem 2.5 to obtain, in e O(n1/3M 2/3) rounds and with a total of e O(n2M 2) queries to the evaluation oracle of f, a x [0, 1]n with f Lov(x) minz [0,1]n f Lov(z) + 1 2. Then by property 1 in Fact 2.4, one can get a subset A V with f(A) min S V f(S)+ 1 2. As f is assumed to be integer valued, A must be the minimizer of the submodular function. 2.2 Parallel Stochastic Convex Optimization in ℓ Here we present the key steps in proving Theorem 2.2 regarding our new parallel results for the stochastic convex optimization problem in ℓ (Definition 2.1). Throughout this subsection, in our exposition, lemma statements, and proofs we assume that we are in the setting of Definition 2.1. To prove Theorem 2.2 we apply the approach of [CJJ+23] with two modifications. First, we consider the convolution of f with a centered Gaussian density function with covariance ρ2In. However we show that in our setting, it is possible to use a larger value of ρ without perturbing the function value too much, due to the ℓ geometry. Unfortunately, the minimizer of the convolved function may move outside the box of radius R. Thus, the second modification we make is working with a regularized function, freg, which is pointwise close to f, still L-Lipschitz in the ℓ norm, and keeps the minimizer in the ball or radius R even after applying the convolution with a Gaussian. In the remainder of this subsection we first present the ingredients going into the proof of Theorem 2.2, and then give a brief explanation for how they fit into the proof of Theorem 2.2, deferring the complete proof to the appendix. We start with our bound on function perturbation after adding a Gaussian with covariance ρ2In to an ℓ Lipschitz function (Lemma 2.6). We then introduce the concept of a ball optimization oracle, along with a result on how to implement it in low depth for the special case of a function that is the result of Gaussian convolution (Proposition 2.8). Lastly, we present the result which allows us to use a ball optimization oracle black-box to obtain the desired depth (Proposition 2.9). Now, we are ready to present the lemma which allows us to obtain a better dependence of depth on the dimension n, compared to the naive n2/3 obtained by directly applying the ℓ2-Lipschitz optimization result. We start with the definition of Gaussian convolution. Gaussian Convolution. Let γρ := (2πρ) n/2 exp( 1 2ρ2 x 2 2) be the probability density function of N(0, ρ2In). Given a function f : Rn R, we define its convolution with a Gaussian of covariance ρ2In by bfρ := f γρ, i.e. bfρ(x) := Ey N(0,ρ2In)[f(x + y)] = Z Rn f(x y)γρ(y)dy. (4) Next, we present a lemma which allows us to obtain a better dependence on the dimension n in depth, compared to the naive n2/3 obtained by directly applying the ℓ2-Lipschitz optimization result. This lemma shows that we can perform more Gaussian smoothing (as compared to the ℓ2-setting) without perturbing the function too much (as mentioned in Section 1.1). Lemma 2.6 (Gaussian Convolution Distortion Bound for ℓ ). Let f : Rn R be L-Lipschitz with respect to the ℓ -norm. Then for any point x Rn, we have | bfρ(x) f(x)| Lρ 2 log n. Proof. Note that | bfρ(x) f(x)| R Rn |f(z) f(x)|γρ(x z)dz L R Rn x z γρ(x z)dz where the first inequality follows from the definition of bfρ and the second follows as f is L-Lipschitz in ℓ -norm. The RHS is simply the expected ℓ -norm of a zero-mean random Gaussian vector with covariance ρ2In, and this is Θ(ρ log n) (e.g., [Ver18]). More precisely, we get | bfρ(x) f(x)| L Ey N(0,ρ2In) y Lρ p As mentioned in Section 1.1, by contrast, convolving a function f that is L-Lipschitz in ℓ2 with a Gaussian of covariance ρ2In could change the function value by O(ρ n). Hence, the ℓ geometry allows us to add a larger amount of Gaussian smoothing without changing the function value by more than ε, which in turn allows for better rates. Ball Optimization. A subroutine that we use for minimizing bfρ is called a ball optimization oracle. As suggested by [CJJ+20], the concept of ball optimization oracle is related to the notion of trust regions, explored in several papers, such as [CGT00]. The particular ball optimization procedure we employ takes a function F : Rn R and a point x Rn, which is an approximate solution to F in a small ball of x. More formally, we work with the following definition: Definition 2.7 (Ball Optimization Oracle [CJJ+23]). Let F : Rn R be a convex function. Obo is an (ϕ, λ, r)-ball optimization oracle for F if given any x Rn, it returns an x Rn with the property F(x loc) + λ 2 x loc x 2 2 + ϕ , where x loc = arg minx B x(r)(F(x) + λ 2 x x 2 2). From [CJJ+20] it is known is that for any Lipschitz convex function f, any stochastic subgradient oracle g as above, and any ρ, if we set r = ρ, then efficient ball-optimization oracles exist. More formally, we use the following proposition from [CJJ+23] which is in turn inspired from [ACJ+21]. Proposition 2.8 (Proposition 3, [CJJ+23]). Let f : Rn R be convex and g : Rn Rn be a stochastic subgradient oracle satisfying E[g(x)] f(x) and E g(x) 2 2 σ2 for all x Rn. Let bfρ := f γρ, i.e., bfρ(x) := Ey N(0,ρ2In)[f(x y)] = Z Rn f(x y)γρ(y)dy. (5) Then there is a (ϕ, λ, ρ)-ball optimization oracle for bfρ which makes O( σ2 ϕλ) total queries to g in a constant number of rounds. Highly Parallel Optimization. As shown in [CJJ+23], the ball optimization oracle above can be used for highly-parallel optimization as follows. Proposition 2.9 (Proposition 2 in [CJJ+23]). Fix a function F : Rn R which is L-Lipschitz with respect to the ℓ2-norm and convex. Suppose R 0 is a parameter such that x arg minx F(x) satisfies x 2 R. Let r (0, R] and ϵopt (0, LR] be two parameters. Define the following quantities ϵopt , K := R 3 , and λ := ϵopt K2 R2 log2 κ. (6) Then, there exists a universal constant C > 0 and an algorithm Ball Accel which runs in CK log κ iterations and produces a point x such that EF(x) F(x ) + ϵopt. Moreover, 1. Each iteration makes at most C log2( Rκ r ) calls to ( λr2 C , λ, r)-ball optimization oracle with values of λ [ λ 2. For each j [ log2 K + C ], at most C2 2 j K log( Rκ r ) iterations query a ( λr2 C2j log 2( Rκ r ), λ, r)-ball optimization oracle for some λ [ λ While the proof of Theorem 2.2 is deferred to Appendix A, we provide some intuition for how to use the stated components to obtain the result. A natural approach would be to apply Proposition 2.9 to F := bfρ. However, to ensure that the minimizer of F has ℓ2-norm at most R, we will instead work with F := bf c,R regρ as defined in Lemma 2.3. The idea is to invoke Lemma 2.3 and apply Ball Accel on the function F := bf c,R regρ with respect to 2, c being the origin, and ρ := ϵopt L 2 log n. With this choice of ρ, by Lemma 2.6, we know |F(x) f c,R reg (x)| ϵopt everywhere. Noting that F has a minimizer x with x 2 3R, this enables us to apply Proposition 2.9 to F, obtaining the stated bounds on the number of rounds and query complexity. 3 2-Round O(n M+1)-Query Algorithm for SFM Here we present our 2-round, O(n M+1)-query algorithm for SFM. The algorithm Augmenting Sets, given in Algorithm 1, iterates over every M-sparse S [n] (i.e. |S| M) (denoted F). For every such S the algorithm then builds the augmented set A(S), consisted of the union of S and all elements i that have non-positive marginal with respect to S, i.e., f(S {i}) f(S). The algorithm then outputs the set A(S) that has the smallest value. As we show below, computing all the A(S) for M-sparse sets S can be done in 1 round and O(n M+1)- queries, and then computing an element of A(S) with the smallest value can be done in another round and O(n M+1)-queries. The correctness of the algorithm is guaranteed by the fact that A(S), for some |S| M, is the maximal minimizer of f, and therefore the algorithm outputs a set with the optimum value6. Our main result of this section is the following theorem. 6Note, however, that the algorithm doesn t necessarily output the maximal minimizer itself. Algorithm 1: Augmenting Sets Algorithm Input: Submodular function f : 2[n] Z and M Z>0 such that |f(S)| M for all S [n] and f( ) = 0 Output: Sout, a minimizer of function f 1 Function Augmenting Sets(f, M): 2 F {S [n] | |S| M} 3 for S F do 4 A(S) S {i / S | f(S {i}) f(S)} // Compute the augmentation of S 6 return Sout argmin S Ff(A(S)) Theorem 1.4 (Two-round Parallel SFM). There is an algorithm (Algorithm 1) that when given an evaluation oracle for submodular f : 2[n] Z with f( ) = 0 and |f(S)| M for all S [n] finds a minimizer of f in 2 rounds and O(n M+1) queries. Proof. First, we bound the parallel and query complexity of the algorithm. Line 2 to Line 5 can be implemented in 1 round as they simply evaluate f on all subsets of [n] of size M + 1. Line 6 can be implemented in 1 round by evaluating f(A(S)) for each S F in parallel. Consequently, the algorithm is implementable in 2 rounds. To bound the query complexity, note that |F| PM k=0 n k = O(n M) and the algorithm only makes O(n) queries for each S F (O(n) in Line 4 and 1 in Line 6). Consequently, the algorithm makes O(n M+1) total queries. It only remains to show that the algorithm outputs a minimizer of f. We prove this by showing that for some S F, its augmented set A(S) is the maximal minimizer of f, i.e., the union of all minimizers, which is a minimizer itself by submodularity. This suffices as the algorithm outputs the A(S) for S F of smallest value. Let S be the maximal minimizer of f. We build a subset T S of size |T| M, which we call anchor, as follows. Start with T = and an arbitrary ordering of elements of S . For each element i S in this order, we add it to the current T if f(T {i}) > f(T). Since f only takes integer values, this means that whenever we add an element i to T, the value of f(T) goes up by at least 1. At the end of the process we have f(T) |T|. Since f(T) M, it follows that |T| M and therefore T F. We now claim that A(T) = S . For any element i S \ T, we didn t add i to T because f(Ti {i}) f(Ti) 0, where Ti T is the value of T when element i is visited in the procedure above. By submodularity, f(T {i}) f(T) f(Ti {i}) f(Ti) 0. This implies that S A(T). Also note that for any j / S , we have f(S {j}) > f(S ) by the maximality of S . It again follows from submodularity and T S that f(T {j}) > f(T), which implies j / A(T). This proves A(T) = S and completes the proof of the theorem. 4 Conclusion In this paper we designed two new parallel algorithms for minimizing submodular functions f : 2[n] Z [ M, +M] with round complexities e O(n1/3M 2/3) and 2, and query complexities e O(n2M 2) and O(n M+1), respectively. These M-dependent sublinear dependence on n in the round complexities stand in contrast to the eΩ(n)-lower bound on the number of rounds required for SFM when M = nΘ(n). On the way to the first result, we obtain a new efficient parallel algorithm for ε-approximate minimization of ℓ -Lipschitz convex functions over [0, 1]n with round-complexity e O(n1/3ε 2/3). Given results of [Nem94, DG19] the dependence on n is optimal for constant ε. Two related open questions are whether one can obtain o(n)-round SFM algorithms with polylogarithmic dependence on M, and whether one can obtain algorithms for ε-approximate minimization of ℓ -Lipschitz convex functions over [0, 1]n in e O(n1/3poly log(1/ε))-rounds, or can one prove lower bounds ruling them out. Another related open question is whether one can perform SFM in O(poly(M))-rounds with query complexity poly(n). Acknowledgements We thank the anonymous reviewers of Neur IPS 2023 for helpful comments. Part of this work was done while Deeparnab Chakrabarty, Haotian Jiang, and Aaron Sidford were attending the Discrete Optimization trimester program at the Hausdorff Research Institute for Mathematics. [ACJ+21] Hilal Asi, Yair Carmon, Arun Jambulapati, Yujia Jin, and Aaron Sidford. Stochastic bias-reduced gradient methods. Advances in Neural Information Processing Systems, 34:10810 10822, 2021. [ALS20] Brian Axelrod, Yang P Liu, and Aaron Sidford. Near-optimal approximate discrete and continuous submodular function minimization. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 837 853. SIAM, 2020. [BJL+19] Sébastien Bubeck, Qijia Jiang, Yin Tat Lee, Yuanzhi Li, and Aaron Sidford. Complexity of highly parallel non-smooth convex optimization. Advances in Neural Information Processing Systems, 2019. [BPC+11] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, Jonathan Eckstein, et al. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine learning, 3(1):1 122, 2011. [BS18] Eric Balkanski and Yaron Singer. Parallelization does not accelerate convex optimization: Adaptivity lower bounds for non-smooth convex minimization. ar Xiv preprint ar Xiv:1808.03880, 2018. [BS20] Eric Balkanski and Yaron Singer. A lower bound for parallel submodular minimization. In Proceedings of the 52nd annual ACM SIGACT Symposium on Theory of Computing (STOC), pages 130 139, 2020. [BV04] Dimitris Bertsimas and Santosh Vempala. Solving convex programs by random walks. Journal of the ACM (JACM), 51(4):540 556, 2004. [BVZ01] Yuri Boykov, Olga Veksler, and Ramin Zabih. Fast approximate energy minimization via graph cuts. IEEE Transactions on pattern analysis and machine intelligence, 23(11):1222 1239, 2001. [CCK21] Deeparnab Chakrabarty, Yu Chen, and Sanjeev Khanna. A polynomial lower bound on the number of rounds for parallel submodular function minimization and matroid intersection. In 2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS). IEEE, 2021. [CGJS22] Deeparnab Chakrabarty, Andrei Graur, Haotian Jiang, and Aaron Sidford. Improved lower bounds for submodular function minimization. In 2022 IEEE 63rd Annual Symposium on Foundations of Computer Science (FOCS), pages 245 254. IEEE, 2022. [CGT00] AR Conn, NIM Gould, and Ph L Toint. Trust region methods, mps/siam ser. Optim., SIAM, Philadelphia, PA, 2000. [CJJ+20] Yair Carmon, Arun Jambulapati, Qijia Jiang, Yujia Jin, Yin Tat Lee, Aaron Sidford, and Kevin Tian. Acceleration with a ball optimization oracle. Advances in Neural Information Processing Systems, 33:19052 19063, 2020. [CJJ+23] Yair Carmon, Arun Jambulapati, Yujia Jin, Yin Tat Lee, Daogao Liu, Aaron Sidford, and Kevin Tian. Resqueing parallel and private stochastic convex optimization. ar Xiv preprint ar Xiv:2301.00457, 2023. [CLSW17] Deeparnab Chakrabarty, Yin Tat Lee, Aaron Sidford, and Sam Chiu-wai Wong. Subquadratic submodular function minimization. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing (STOC), pages 1220 1231, 2017. [DBW12] John C Duchi, Peter L Bartlett, and Martin J Wainwright. Randomized smoothing for stochastic optimization. SIAM Journal on Optimization, 22(2):674 701, 2012. [DG19] Jelena Diakonikolas and Cristóbal Guzmán. Lower bounds for parallel and randomized convex optimization. In Conference on Learning Theory, pages 1132 1157. PMLR, 2019. [DVZ21] Daniel Dadush, László A. Végh, and Giacomo Zambelli. Geometric rescaling algorithms for submodular function minimization. Mathematics of Operations Research, 46(3):1081 1108, 2021. [Edm70] Jack Edmonds. Submodular functions, matroids, and certain polyhedra. Edited by G. Goos, J. Hartmanis, and J. van Leeuwen, page 11, 1970. [FG88] Awi Federgruen and Henry Groenevelt. Characterization and optimization of achievable performance in general queueing systems. Operations Research, 36(5):733 741, 1988. [FI00] Lisa Fleischer and Satoru Iwata. Improved algorithms for submodular function minimization and submodular flow. In Proceedings of the thirty-second annual ACM symposium on Theory of computing, pages 107 116, 2000. [GLS81] Martin Grötschel, László Lovász, and Alexander Schrijver. The ellipsoid method and its consequences in combinatorial optimization. Combinatorica, 1(2):169 197, 1981. [GLS88] Martin Grötschel, László Lovász, and Alexander Schrijver. Geometric algorithms and combinatorial optimization. Springer, 1988. [GPRW20] Andrei Graur, Tristan Pollner, Vidhya Ramaswamy, and S. Matthew Weinberg. New query lower bounds for submodular function minimization. 11th Innovations in Theoretical Computer Science Conference, ITCS, pages 64:1 64:16, 2020. [GR18] Pontus Giselsson and Anders Rantzer. Large-scale and distributed optimization, volume 2227. Springer, 2018. [Har08] Nicholas James Alexander Harvey. Matchings, matroids and submodular functions. Ph D thesis, Massachusetts Institute of Technology, 2008. [IFF01] Satoru Iwata, Lisa Fleischer, and Satoru Fujishige. A combinatorial strongly polynomial algorithm for minimizing submodular functions. Journal of the ACM (JACM), 48(4):761 777, 2001. [IO09] Satoru Iwata and James B Orlin. A simple combinatorial algorithm for submodular function minimization. In Proceedings of the twentieth annual ACM-SIAM symposium on Discrete algorithms, pages 1230 1237. SIAM, 2009. [Iwa03] Satoru Iwata. A faster scaling algorithm for minimizing submodular functions. SIAM Journal on Computing, 32(4):833 840, 2003. [JB11] Stefanie Jegelka and Jeff Bilmes. Submodularity beyond submodular energies: coupling edges in graph cuts. In CVPR 2011, pages 1897 1904. IEEE, 2011. [JBS13] Stefanie Jegelka, Francis Bach, and Suvrit Sra. Reflection methods for user-friendly submodular optimization. Advances in Neural Information Processing Systems, 26, 2013. [Jia21] Haotian Jiang. Minimizing convex functions with integral minimizers. In Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 976 985. SIAM, 2021. [Jia22] Haotian Jiang. Minimizing convex functions with rational minimizers. Journal of the ACM, 70(1):1 27, 2022. [JLSW20] Haotian Jiang, Yin Tat Lee, Zhao Song, and Sam Chiu-wai Wong. An improved cutting plane method for convex optimization, convex-concave games, and its applications. In Proceedings of the 52nd annual ACM SIGACT Symposium on Theory of Computing (STOC), pages 944 953, 2020. [JLSZ23] Haotian Jiang, Yin Tat Lee, Zhao Song, and Lichen Zhang. Convex minimization with integer minima in e O(n4) time. ar Xiv preprint ar Xiv:2304.03426, 2023. [KBJ+15] KS Kumar, Alvaro Barbero, Stefanie Jegelka, Suvrit Sra, and Francis Bach. Convex optimization for parallel energy minimization. ar Xiv preprint ar Xiv:1503.01563, 2015. [KC10] Andreas Krause and Volkan Cevher. Submodular dictionary selection for sparse representation. In International Conference on Machine Learning (ICML), number CONF, 2010. [KG05] Andreas Krause and Carlos Guestrin. Optimal nonmyopic value of information in graphical models: efficient algorithms and theoretical limits. Carnegie Mellon University. Center for Automated Learning and Discovery, 2005. [Kha80] Leonid G Khachiyan. Polynomial algorithms in linear programming. USSR Computational Mathematics and Mathematical Physics, 20(1):53 72, 1980. [KKT08] Pushmeet Kohli, M Pawan Kumar, and Philip HS Torr. p3 & beyond: Move making algorithms for solving higher order functions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(9):1645 1656, 2008. [KNTB09] Yoshinobu Kawahara, Kiyohito Nagano, Koji Tsuda, and Jeff A Bilmes. Submodularity cuts and applications. Advances in Neural Information Processing Systems, 22, 2009. [KT10] Pushmeet Kohli and Philip HS Torr. Dynamic graph cuts and their applications in computer vision. In Computer Vision, pages 51 108. Springer, 2010. [KTE88] Leonid G Khachiyan, Sergei Pavlovich Tarasov, and I. I. Erlikh. The method of inscribed ellipsoids. In Soviet Math. Dokl, volume 37, pages 226 230, 1988. [LB11] Hui Lin and Jeff Bilmes. Optimal selection of limited vocabulary speech corpora. In Twelfth Annual Conference of the International Speech Communication Association, 2011. [Lev65] Anatoly Yur evich Levin. An algorithm for minimizing convex functions. In Doklady Akademii Nauk, volume 160, pages 1244 1247. Russian Academy of Sciences, 1965. [Lov83] László Lovász. Submodular functions and convexity. Mathematical Programming The State of the Art: Bonn 1982, pages 235 257, 1983. [LSW15] Yin Tat Lee, Aaron Sidford, and Sam Chiu-wai Wong. A faster cutting plane method and its implications for combinatorial and convex optimization. In 2015 IEEE 56th Annual Symposium on Foundations of Computer Science, pages 1049 1065. IEEE, 2015. [Nem94] Arkadi Nemirovski. On parallel complexity of nonsmooth convex optimization. Journal of Complexity, 10(4):451 463, 1994. [New65] Donald J Newman. Location of the maximum on unimodal surfaces. Journal of the ACM (JACM), 12(3):395 398, 1965. [NN89] YE Nesterov and AS Nemirovskii. Self-concordant functions and polynomial time methods in convex programming. preprint, central economic & mathematical institute, ussr acad. Sci. Moscow, USSR, 1989. [Orl09] James B Orlin. A faster strongly polynomial time algorithm for submodular function minimization. Mathematical Programming, 118(2):237 251, 2009. [QS95] Maurice Queyranne and Andreas S Schulz. Scheduling unit jobs with compatible release dates on parallel machines with nonstationary speeds. In Integer Programming and Combinatorial Optimization: 4th International IPCO Conference Copenhagen, Denmark, May 29 31, 1995 Proceedings 4, pages 307 320. Springer, 1995. [SBB+18] Kevin Scaman, Francis Bach, Sébastien Bubeck, Laurent Massoulié, and Yin Tat Lee. Optimal algorithms for non-smooth distributed optimization in networks. Advances in Neural Information Processing Systems, 31, 2018. [Sch00] Alexander Schrijver. A combinatorial algorithm minimizing submodular functions in strongly polynomial time. Journal of Combinatorial Theory, Series B, 80(2):346 355, 2000. [Sho77] Naum Z Shor. Cut-off method with space extension in convex programming problems. Cybernetics, 13(1):94 96, 1977. [Top98] Donald M Topkis. Supermodularity and complementarity. Princeton university press, 1998. [Vai89] Pravin M Vaidya. A new algorithm for minimizing convex functions over convex sets. In 30th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pages 338 343, 1989. [Ver18] Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018. [Vyg03] Jens Vygen. A note on Schrijver s submodular function minimization algorithm. Journal of Combinatorial Theory, Series B, 88(2):399 402, 2003. [YN76] David B Yudin and Arkadii S Nemirovski. Evaluation of the information complexity of mathematical programming problems. Ekonomika i Matematicheskie Metody, 12:128 142, 1976. A Supplement In this supplement, we provide the complete proofs of Lemma 2.3 and Theorem 2.2. Proof of Lemma 2.3. First we show that f c,r reg (x) is convex and 3L-Lipschitz. Let h(x) := 2L( x c r). By triangle inequality and homogenity of norms, 2L( x c r) is convex and 2L-Lipschitz (with respect to ). Further, it is straightforward to check that the maximum of two convex, L-Lipschitz functions is convex and L-Lipschitz and that the sum of a L-Lipschitz and 2L-Lipschitz function is convex and 3L-Lipschitz. Since f c,r reg (x) = f(x) + max{0, h(x)} and constant functions are convex and 0-Lipschitz the result follows. Next, consider y Rn with y c r. Since is a norm we see that yc,r c = r and therefore f c,r reg (yc,r) = f(yc,r). Furthermore, yc,r y = y c r and so, f c,r reg (y) = f(y) + 2L( y c r) f(yc,r) L yc,r y + 2L( y c r) = f(yc,r) + L( y c r) . where the inequality follows since f is L-Lipschitz with respect to . This yields the desired result as it implies that values of f c,r reg are all larger than those of points where x c r for which f c,r reg (x) = f(x). Further, since {x Rn| x c r} is closed and f is Lipschitz, ther exists a minimizer of minx Rn| x c r f(x) and the result follows. Proof of Theorem 2.2. We invoke Lemma 2.3 and apply Ball Accel on the function F := bf c,R regρ as defined in Lemma 2.3 with respect to 2, c being the origin, and ρ := ϵopt L 2 log n. With this choice of ρ, by Lemma 2.6, we know |F(x) f c,R reg (x)| ϵopt everywhere. For brevity, we remove the superscript (c, R) from freg for the remainder of the proof. We now claim that F has a minimizer x with x 2 3R. To see this, let y Rn be a point with y 2 > 3R and consider y = y y 2 R. Therefore, F(y) freg(y) ϵopt = f(y) + 2L ( y 2 R) ϵopt freg( y) + L ( y 2 R) ϵopt where the first inequality follows from pointwise approximation of F and freg, and the second follows since f is also L-Lipschitz in the 2 norm, and so freg( y) = f( y) f(y) + L ( y 2 R). Again using the pointwise approximation of F and freg we get F(y) F( y) 2ϵopt + L ( y 2 R), and if y 2 > 3R using that ϵopt (0, LR] we get a contradiction to minimality of F(y). Note that the stochastic subgradient g of freg is given by g = g + 2L v for v h(x) where h(x) = max(0, x 2 R). Note that v 2 2 1, so the stochastic gradient g satisfies E g (x) 2 2 2σ2 + 8L2. It follows that by setting r = ρ, Proposition 2.8 implies that for any ϕ, λ > 0, there exists a (ϕ, λ, ρ)-ball optimization oracle for F which makes O( σ2 ϕλ) total queries to g in O(1) parallel rounds (as σ2 L2 by definition of L-Lipschitzness). Next we apply ball acceleration in Proposition 2.9 to F. We have already argued above that the minimizer x of F satisfies x 2 3R. We set the parameter r = ρ = ϵopt L 2 log n and the R and L multiplied by factor 3. Using these, the parameters of (6) in Proposition 2.9 become ϵopt , K = LR 2 log n 2/3 , and λ = ϵopt K2 Using Proposition 2.9 and Proposition 2.8, we get that the number of rounds of queries is (CK log κ) (C log2(Rκ/ρ)) which is e O((LR/ϵopt)2/3). Next, we bound the query complexity. Adding up the queries made by the ball-optimization oracle calls made in all iterations per part 1 of Proposition 2.9 is (CK log κ) C log2(Rκ/ρ) σ2 λ2 ρ2 = e O Kσ2λ 2 ρ 2 . Adding up the queries made by the ball-optimization oracle calls made in all iterations per part 2 of Proposition 2.9 is j [ log2 K+C ] C22 j K log(Rκ/ρ) 2j Cσ2 log2(Rκ/ρ) λ2 ρ2 which is also e O Kσ2λ 2 ρ 2 . Substituting the values from above and ρ = ϵopt L 2 log n, we get that the total query complexity is e O((σR/ϵopt)2). Finally note that Proposition 2.9 applied to F computes an ϵopt-approximate minimizer which is a 2ϵopt-approximate minimizer of freg. By setting ϵopt = ε/2, Lemma 2.3 implies that we can find an ε-approximate minimizer x of f with x 2 R. This completes the proof of the theorem.