# online_riskaverse_submodular_maximization__71116a75.pdf Online Risk-Averse Submodular Maximization Tasuku Soma1,2 , Yuichi Yoshida3 1 The University of Tokyo 2 Massachusetts Institute of Technology 3 National Institute of Informatics tasuku soma@mist.i.u-tokyo.ac.jp, tasuku@mit.edu, yyoshida@nii.ac.jp We present a polynomial-time online algorithm for maximizing the conditional value at risk (CVa R) of a monotone stochastic submodular function. Given T i.i.d. samples from an underlying distribution arriving online, our algorithm produces a sequence of solutions that converges to a (1 1/e)-approximate solution with a convergence rate of O(T 1/4) for monotone continuous DR-submodular functions. Compared with previous offline algorithms, which require Ω(T) space, our online algorithm only requires O( T) space. We extend our online algorithm to portfolio optimization for monotone submodular set functions under a matroid constraint. Experiments conducted on real-world datasets demonstrate that our algorithm can rapidly achieve CVa Rs that are comparable to those obtained by existing offline algorithms. 1 Introduction Submodular function maximization is a ubiquitous problem naturally arising in broad areas such as machine learning, social network analysis, economics, combinatorial optimization, and decision-making [Krause and Golovin, 2014; Buchbinder and Feldman, 2018]. Submodular maximization in these applications is stochastic in nature, i.e., the input data could be a sequence of samples drawn from some underlying distribution, or there could be some uncertainty in the environment. In this paper, we consider nonnegative monotone continuous DR-submodular1 [Bian et al., 2017] functions F(x; z) parameterized by a random variable z drawn from some distribution D. The simplest approach for such stochastic submodular objectives is to maximize the expectation Ez D[F(x; z)], which has been extensively studied [Karimi et al., 2017; Hassani et al., 2017; Mokhtari et al., 2018; Karbasi et al., 2019]. However, in real-world decision-making tasks in finance, robotics, and medicine, we sometimes must be risk-averse: We want to minimize the risk of suffering a considerably small gain rather than simply maximizing the expected 1See Section 2 for the definition of continuous DRsubmodularity. gain [Mansini et al., 2007; Yau et al., 2011; Tamar et al., 2015]. In medical applications, for example, we must avoid catastrophic events such as patient fatalities. In finance and robotics, all progress ceases when poor decisions cause bankruptcies or irreversible damage to robots. Conditional value at risk (CVa R) is a popular objective for such risk-averse domains [Rockafellar and others, 2000; Krokhmal et al., 2002]. Formally, given a parameter α [0, 1], the CVa R of a feasible solution x Rn is defined as CVa Rα,D(x) = E z D[F(x; z) | F(x; z) Va Rα,D(x)], where Va Rα,D(X) is the α-quantile of the random variable F(x; z), i.e., Va Rα,D(X) = sup n τ R : Pr z D(F(x; z) τ) α o . Note that when α = 1, the CVa R is simply the expected value of F(x; z). Since α is typically set to be 0.10 or 0.05 in practice, we assume that α is a fixed constant throughout this paper. When α is clear from the context, we omit it from the notations. CVa R can also be characterized by a variational formula: CVa RD(x) = max τ [0,1] τ 1 α E z D [τ F(x; z)]+, where [ ]+ = max{ , 0} [Rockafellar and others, 2000]. [Maehara, 2015] initiated the study of maximizing the CVa R of stochastic submodular set functions. It was shown that the CVa R of a stochastic submodular set function is not necessarily submodular, and that it is impossible to compute a single set that attains any multiplicative approximation to the optimal CVa R. [Ohsaka and Yoshida, 2017] introduced a relaxed problem of finding a portfolio over sets rather than a single set, and devised the first CVa R maximization algorithm with an approximation guarantee for the influence maximization problem [Kempe et al., 2003], a prominent example of discrete submodular maximization. [Wilder, 2018] further considered this approach and devised an algorithm called RASCAL for maximizing the CVa R of continuous DR-submodular functions subject to a down-closed convex set. The algorithms mentioned above are offline (batch) methods, i.e., a set of samples drawn from the underlying distribution D is given as the input. However, because the number of Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21) samples needed to accurately represent D can be exceedingly large, it is often inefficient or even impossible to store all the samples in memory. Further, when a new sample is observed, these offline algorithms must be rerun from scratch. In the machine learning community, online methods, which can efficiently handle large volumes of data, have been extensively studied [Hazan, 2016]. Online methods read data in a streaming fashion, and update the current solution using a few data elements stored in memory. Further, online methods can update a solution at a low computational cost. 1.1 Our Contributions In this work, we propose online algorithms for maximizing the CVa R of stochastic submodular objectives in continuous and discrete settings. Continuous setting. Let z1, . . . , z T be i.i.d. samples drawn from D arriving sequentially and K Rn be a down-closed convex set. Our main result is a polynomial-time online algorithm, STOCHASTICRASCAL, that finds x K such that E[CVa RD(x)] 1 1 CVa RD(x ) O(T 1/4), for any x K, where the expectation is taken over z1, . . . , z T and the randomness of the algorithm. STOCHAS- TICRASCAL only stores O( T) data in memory, drawing a contrast to RASCAL, which store all T data points. Note that the approximation ratio of 1 1/e is optimal for any algorithm that performs polynomially many function value queries even if α = 1 [Vondr ak, 2013]. We also conduct several experiments on real-world datasets to show the practical efficiency of our algorithm. We demonstrate that our algorithm rapidly achieves CVa R comparable to that obtained by known offline methods. Discrete setting. As an application of the above algorithm, we devise an online algorithm to create a portfolio that maximizes the CVa R of a discrete submodular function subject to a matroid constraint. Let f(X; z) : 2V [0, 1] be a monotone stochastic submodular set function on a ground set V and I 2V be a matroid. The goal is to find a portfolio P of feasible sets in I that maximizes CVa RD(P) = max τ [0,1] τ 1 α E z D[τ E X P f(X; z)]+, given i.i.d. samples from D. We show that this problem can be reduced to online CVa R maximization of continuous DRsubmodular functions on a matroid polytope, and devise a polynomial-time online approximation algorithm for creating a portfolio P such that E[CVa RD(P)] 1 1 CVa RD(P ) O(T 1/10) for any portfolio P over feasible sets. Note that this algorithm is the first online algorithm that converges to a (1 1/e)-approximation portfolio in the discrete setting, which generalizes the known offline algorithm [Wilder, 2018] to the i.i.d. setting. 1.2 Our Techniques To analyze our online algorithms, we introduce a novel adversarial online learning problem, which we call adversarial online submodular CVa R learning. This online learning problem is described as follows. For t = 1, . . . , T, the learner chooses xt K and τt [0, 1] possibly in a randomized manner. After xt and τt are chosen, the adversary reveals a monotone continuous DR-submodular function Ft : K [0, 1] to the learner. The goal of the learner is to minimize the approximate regret regret1 1/e(T) = 1 1 t=1 Ht(x , τ ) t=1 Ht(xt, τt) for arbitrary x K and τ [0, 1], where the function Ht is given by Ht(x, τ) = τ 1 α[τ Ft(x)]+. We devise an efficient algorithm that achieves O(T 3/4) approximate regret in expectation. Further, we show that, given an online algorithm with a sublinear approximate regret, we can construct an online algorithm that achieves a (1 1/e)- approximation to CVa R maximization, whose convergence rate is E[regret1 1/e(T)]/T. Combining these results, we obtain an online (1 1/e)-approximation algorithm for CVa R maximization with a convergence rate of O(T 1/4). We remark that adversarial online submodular CVa R learning may be of interest in its own right: Although the objective function Ht is neither monotone nor continuous DRsubmodular in general, we can design an online algorithm with a sublinear (1 1/e)-regret by exploiting the underlying structure of Ht. As per our knowledge, an online algorithm for non-monotone and non-DR-submodular maximization does not exist in the literature. 1.3 Related Work Several studies focused on CVa R optimization in the adversarial online settings and i.i.d. settings. [Tamar et al., 2015] studied CVa R optimization over i.i.d. samples and analyzed stochastic gradient descent under the strong assumption that CVa R is continuously differentiable. Recently, [Cardoso and Xu, 2019] introduced the concept of the CVa R regret for convex loss functions and provided online algorithms for minimizing the CVa R regret under bandit feedback. Online and stochastic optimization of submodular maximization have been extensively studied in [Streeter and Golovin, 2008; Streeter et al., 2009; Golovin et al., 2014; Karimi et al., 2017; Hassani et al., 2017; Mokhtari et al., 2018; Chen et al., 2018; Roughgarden and Wang, 2018; Soma, 2019; Karbasi et al., 2019; Zhang et al., 2019]. These studies optimize either the approximate regret or the expectation and do not consider CVa R. Another line of related work is robust submodular maximization [Krause et al., 2008; Chen et al., 2017; Anari et al., 2019]. In robust submodular maximization, we maximize the minimum of N submodular functions, i.e., min N i=1 fi(X). Robust submodular maximization is the limit of CVa R maximization, where D is the uniform distribution over N values Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21) and α 0. Recently, [Staib et al., 2019] proposed distributionally robust submodular optimization, which maximizes min D P Ez D f(X; z) for an uncertainty set P of distributions. It is known that CVa R can be formulated in the distributionally robust framework [Shapiro et al., 2014]. However, the algorithms proposed by [Staib et al., 2019] require that P is a subset of the N-dimensional probability simplex; moreover, their time complexity depends on N. Our algorithms work even if N is infinite. 1.4 Organization of This Paper This paper is organized as follows. Section 2 introduces the background of submodular optimization. Sections 3 and 4 describe our algorithms for continuous and discrete setting, respectively. Section 5 present experimental results using real-world dataset. The omitted analysis and the details of adversarial setting can be found in the full version. 2 Preliminaries Throughout the paper, V denotes the ground set and n denotes the size of the ground set. For a set function f : 2V R, the multilinear extension F : [0, 1]V R is defined as F(x) = P S V f(S) Q i S xi Q i/ S(1 xi). For a matroid on V , the base polytope is the convex hull of bases of the matroid. It is well-known that the linear optimization on a base polytope can be solved by the greedy algorithm [Fujishige, 2005]. We denote the Euclidean norm and inner product by and , , respectively. The ℓp norm (1 p ) is denoted by p. The Euclidean projection of x onto a set K is denoted by proj K(x). A convex set K Rn 0 is said to be down-closed if y K and 0 x y imply x K. A function f : Rn R is said to be L-Lipschitz (continuous) for L > 0 if |f(x) f(y)| L x y for all x, y. We say that f is β-smooth for β > 0 if f is continuously differentiable and f(x) f(y) β x y . A smooth function F : Rn R is said to be continuous DR-submodular [Bian et al., 2017] if 2F xi xj 0 for all i, j. The multilinear extension of a submodular function is known to be DR-submodular [Calinescu et al., 2011]. The continuous DR-submodularity implies up-concavity: For a continuous DR-submodular function F, x Rn, and d 0, the univariate function t 7 F(x + td) is concave. The uniform distribution of a set K is denoted by Unif(K). The standard normal distribution is denoted by N(0, I). 3 CVa R Maximization of Continuous DR-submodular Functions We present our online algorithm for CVa R maximization via i.i.d. samples. Let Ft : Rn [0, 1] be the monotone continuous DR-submodular function corresponding to the t-th sample zt, i.e., Ft(x) = F(x; zt) for t = 1, . . . , T. Similarly, define an auxiliary function Ht with respect to zt by Ht(x, τ) = τ 1 α[τ Ft(x)]+ = τ 1 α[τ F(x; zt)]+. for t = 1, . . . , T. Let K Rn 0 be a down-closed convex set. Formally, we make the following very mild assumptions on Ft and K. Assumption 1. For all t, Ft is L-Lipschitz and β-smooth, and Ft G. The diameter of K is bounded by D. We are given a linear optimization oracle over K. When the underlying norm is the ℓp-norm with p = 2, we write Gp to emphasize it. For example, if Ft is the multilinear extension of a submodular set function ft : 2V [0, 1] and K is the base polytope of a rank-k matroid, we have L = β = O(n), G = 1, D = O( k). Our algorithm borrows some ideas from an algorithm called RASCAL [Wilder, 2018]. First, we define a smoothed auxiliary function Ht : [0, 1]V [0, 1] [0, 1] as Ht(x, τ) = 1 α[τ + ξ Ft(x)]+ where u > 0 is a smoothing parameter specified later. This smoothing guarantees that Ht is differentiable for all x and has Lipschitz continuous gradients. Lemma 2 (Lemma 6 of [Wilder, 2018]). 1. |Ht(x, τ) Ht(x, τ)| u(1+1/α) 2 for all x and τ. 2. If Ft is L-Lipschitz and β-smooth, and Ft G, then x Ht is 1 u )-Lipschitz. Lemma 3 ([Wilder, 2018]). The function maxτ Ht( , τ) is monotone and up-concave. 3.1 Stochastic RASCAL We now formally describe our algorithm, STOCHASTICRASCAL. We note that RASCAL runs the Frank-Wolfe algorithm on a function maxτ PT t=1 Ht( , τ). Owing to the up-concavity and smoothness properties of this function, one can obtain (1 1/e)-approximation. However, in our online setting, we cannot evaluate this function because Ht will be revealed online, and hence we cannot simply run RASCAL. To overcome the issue above, first we split the T samples into mini-batches of length B, which we specify later. The key idea is to use the following objective function Hb(x) = max τ 1 B t=(b 1)B+1 Ht(x, τ) for each b-th mini-batch (b = 1, . . . , T/B). We can see that Hb(x) is monotone, up-concave, and can be evaluated only using samples in the b-th mini-batch. Then, we run a perturbed version of Frank-Wolfe algorithm [Golovin et al., 2014; Bian et al., 2017] on Hb. More formally, we first initialize x0 b+1 = 0 and for each s = 0, δ, 2δ, . . . , 1 δ, we perform the update xs+δ b xs b + δds b Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21) Algorithm 1 STOCHASTICRASCAL Require: learning rates λ > 0, step size δ > 0, perturbation distribution DFPL, smoothing parameter u > 0, and mini-batch size B > 0 1: Initialize x1 K arbitrary. 2: for b = 1, . . . , T/B do 3: Observe samples z(b 1)B+1, . . . , zb B and store them in mini-batch Zb. 4: /* continuous greedy */ 5: Let x0 b 0. 6: for s = 0, δ, 2δ, . . . , 1 δ do 7: τ := SMOOTHTAU(xs b, u, Zb) 8: gs b := SMOOTHGRAD(xs b, τ, u, Zb). 9: Find a vertex ds b+1 of K that maximizes D λ Pb b =1 gs b + rs b, d E for d K, where rs b DFPL. // FPL 10: xs+δ b xs b + δds b 11: end for 12: Let xb = x1 b. 13: end for 14: return xb for b chosen from {1, 2, . . . , T/B} uniformly at random. Algorithm 2 SMOOTHGRAD(x, τ, u, Z) Require: x, τ, u, and mini-batch Z 1: Iz(τ) := max{min{ F (x;z) τ u , 1} for z Z. 2: return P z Z Iz(τ) x F(x, z) where δ is the step size, ds b is a solution to a perturbed linear optimization problem on K: ds b argmax d K b =1 Hb (xs b ) + rs b, d Here, rs b DFPL is a perturbation vector. This perturbation trick aims to stabilize the algorithm so that we can maximize the true objective maxτ PT t=1 Ht( , τ) using only mini-batch objectives. In each iteration of continuous greedy, we need the gradients Hb(xs b), which in turn requires us to compute argmaxτ 1 B Pb B t=(b 1)B+1 Ht(xs b, τ). These gradients and the optimal τ can be computed by SMOOTHGRAD and SMOOTHTAU subroutines, respectively, which were proposed in [Wilder, 2018]; See Algorithms 2 and 3. Let us write xb := x1 b for b = 1, . . . , T/B. The final output of STOCHASTICRASCAL is xb for a random index b chosen uniformly at random from {1, 2, . . . , T/B}. The pseudocode of STOCHASTICRASCAL is presented in Algorithm 1. 3.2 Convergence Rate via Regret Bounds Let us consider the convergence rate of STOCHASTICRASCAL. The main challenge of the analysis is how to set the parameters used in the algorithm, i.e., learning rates λ, step size δ, perturbation distribution DFPL, smoothing parameter Algorithm 3 SMOOTHTAU(x, u, Z) Require: x, u, and mini-batch Z 1: B := {F(x; z) | z Z} {F(x; z) + u | z Z} 2: Sort B in ascending order, obtaining B = {b1, . . . , b|B|}. 3: i := min{i : P z Z Iz(bi) < α|Z|} 4: A := {z Z : bi 1 < F(x; z) < bi } 5: C := {z Z : F(x; z) bi 1} 6: return the solution τ of the linear equation X u + |C| = α|Z|. u, and mini-batch size B, to achieve the desired O(T 1/4) convergence rate. To this end, using tools from online convex optimization, we prove an approximate regret bound for a variant of STOCHASTICRASCAL for adversarial online submodular CVa R learning (see Introduction for the definition). Theorem 4 (informal). There exists an efficient online algorithm for adversarial online CVa R learning with 1 1 t=1 Ht(x , τ ) E t=1 Ht(xt, τt) for an arbitrary x K and τ [0, 1], where the big-O notation hides a factor polynomial in α, β, D, G, and n. We then show that the above regret bound can be used to show a convergence rate of STOCHASTICRASCAL. The technical detail of the adversarial setting and the proof of the following theorem is deferred to the full version. Theorem 5. Under Assumption 1, STOCHASTICRASCAL outputs x K such that for any x K, E[CVa RD(x)] CVa RD(x ) O CαGDn1/8 where we set B = αCα T DGn1/4 , δ = α2 T +αβT 1/4), B/T G , u = T 1/4 (1+1/α), and DFPL = Unif([0, 1]n), and Cα := max{1, 1 α 1}. Further, if K is an integral polytope contained in {x [0, 1]n : P i xi = k}, then E[CVa RD(x)] 1 1 CαG k3/4 log1/4 n α T α 2G k3/2 log (n), δ = α2 T +αβT 1/4), B T k, u = T 1/4 (1+1/α), and DFPL = N(0, I). To achieve E[CVa RD(x)] 1 1 e CVa RD(x ) ε for a desired error ε > 0, STOCHASTICRASCAL requires Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21) ε4 ) samples and O( DGn1.25 ε2 ) space, whereas RASCAL [Wilder, 2018] requires O( n ε2 ) samples and O( n2 ε2 ) space. Our algorithm runs in a smaller space when the parameters are of moderate size. For example, if β = D = O(1) and L = G = o(n1/4), the space complexity of STOCHASTICRASCAL is better than that of RASCAL. 4 CVa R Maximization of Discrete Submodular Functions We now present our online algorithm for a monotone submodular set function and a matroid constraint. Let ft : 2V [0, 1] be a monotone submodular function corresponding to the t-th sample and Ft be its multilinear extension for t = 1, . . . , T. The basic idea is to run STOCHASTICRASCAL on the multilinear extensions Ft and the matroid polytope K. However, we must address several technical obstacles. First, we must compare the output portfolio with the optimal portfolio; the error bound in the previous sections compared it with the optimal solution. To this end, we make multiple copies of variables so that we can approximate an optimal portfolio by a uniform distribution over a multiset of feasible solutions. More precisely, we define a continuous DR-submodular function Ft : Kr [0, 1] by Ft(x1, . . . , xr) = 1 for some sufficiently large r. Then, we feed F1, . . . , FT to STOCHASTICRASCAL. Suppose that we obtain (x1 b, . . . , xr b) at Line 12 for each mini-batch b = 1, . . . , T/B. Abusing the notation, let us denote (x1 t, . . . , xr t) := (x1 b, . . . , xr b) when the t-th sample is in the b-th mini-batch. Next, we need to convert x1 t, . . . , xr t to feasible sets without significantly deteriorating the values of the multilinear extensions. To this end, we independently apply randomized swap rounding [Chekuri et al., 2010] q times to each xi to obtain feasible sets Xi,1 t , . . . , Xi,q t . Note that randomized swap rounding is oblivious rounding and independent from Ft. We can show that 1 q Pq j=1 ft(Xi,j t ) is close to Ft(xi t) by using a concentration inequality. Finally, after T rounds, we return the uniform portfolio over all Xi,q t . The pseudocode is given in Algorithm 4. Carefully choosing r and q, we obtain the following theorem. Algorithm 4 Online algorithm for maximizing a monotone submodular set function subject to a matroid constraint. 1: Run STOCHASTICRASCAL for F1, . . . , FT and the matroid polytope K and let (x1 t, . . . , xr t) be the temporary solution at Line 12 in STOCHASTICRASCAL for t = 1, . . . , T. 2: Xi,j t RANDOMIZEDSWAPROUNDING(xi t) for t = 1, . . . , T, i = 1, . . . , r, and j = 1, . . . , q. 3: return Uniform portfolio P over all Xi,j t . Theorem 6. Algorithm 4 achieves E[CVa R( P)] CVa RD(P ) O(k3/4 log1/4(n)T 1/10) for arbitrary portfolio P , where we set r = O(T 1/5) and q = O(T 3/4) and the expectation is taken over z1, . . . , z T and the randomness of the algorithm. 5 Experiments In this section, we show our experimental results. In all the experiments, the parameter α of CVa R was set to 0.1. The experiments were conducted on a Linux server with Intel Xeon Gold 6242 (2.8GHz) and 384GB of main memory. Problem Description. We conducted experiments on the sensor resource allocation problem, in which the goal is to rapidly detect a contagion spreading through a network using a limited budget [Bian et al., 2017; Leskovec et al., 2007; Soma and Yoshida, 2015]. Here, we follow the configuration of the experiments conducted in [Wilder, 2018]. Let G = (V, E) be a graph on n vertices. A contagion starts at a random vertex and spreads over time according to some specific stochastic process. Let zv be the time at which the contagion reaches v, and let zmax = maxv V :zv< zv. If zv = for some vertex v V , that is, the contagion does not reach v, we reassign zv = zmax, as described in [Wilder, 2018]. The decision maker has a budget B (e.g., energy) to spend on sensing resources. Let xv represent the amount of energy allocated to the sensor at a vertex v. When contagion reaches v at time zv, the sensor detects it with probability 1 (1 p)xv, where p [0, 1] is the probability that detects the contagion per unit of energy. The objective F on vectors x = (xv)v V and z = (zv)v V is the expected amount of detection time that is saved by the sensor placements: F(x; z) = zmax i=1 zvi(1 (1 p)xvi ) Y j