# risksensitive_submodular_optimization__69604663.pdf Risk-Sensitive Submodular Optimization Bryan Wilder Department of Computer Science and Center for Artificial Intelligence in Society University of Southern California bwilder@usc.edu The conditional value at risk (CVa R) is a popular risk measure which enables risk-averse decision making under uncertainty. We consider maximizing the CVa R of a continuous submodular function, an extension of submodular set functions to a continuous domain. One example application is allocating a continuous amount of energy to each sensor in a network, with the goal of detecting intrusion or contamination. Previous work allows maximization of the CVa R of a linear or concave function. Continuous submodularity represents a natural set of nonconcave functions with diminishing returns, to which existing techniques do not apply. We give a (1 1/e)-approximation algorithm for maximizing the CVa R of a monotone continuous submodular function. This also yields an algorithm for submodular set functions which produces a distribution over feasible sets with guaranteed CVa R. Experimental results in two sensor placement domains confirm that our algorithm substantially outperforms competitive baselines. Introduction Decision-making under uncertainty is an ubiquitous problem. Suppose we want to maximize a function F(x, y), where x is a vector of decision variables and y a random variable drawn from a distribution D. A natural approach is to maximize Ey [F(x, y)], i.e., to maximize the expected value of the chosen decision. However, decision makers are often risk-averse: they would rather minimize the chance of having a very low reward than focus purely on the average. This is a rational behavior when failure can have large consequences. For instance, if a corporation suffers a disastrous loss, they may simply go out of business. Or in many cases, low performance entails safety issues. For instance, if a sensor network for water contamination detects problems instantly in 80% cases, but fails entirely in 20%, the population will inevitably be exposed to an unacceptable health risk. It is much better to have a sensor network which always detects contaminants, even if it requires somewhat more time on average. Hence, it is natural to move beyond average-case analysis and optimize a risk-aware objective function. One widespread choice is the conditional value at risk (CVa R). Copyright c 2018, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. CVa R takes a tunable parameter α. Roughly, it measures the performance of a decision in the worst α fraction of scenarios. It is known that when the objective F is a concave function, then CVa R can be optimized via a concave program as well. However, many natural objective functions are not concave, and no general algorithms are known for nonconcave functions. We focus on submodular functions. Submodularity captures diminishing returns and appears in application domains ranging from viral marketing (Kempe, Kleinberg, and Tardos 2003), to machine learning (Kulesza and Taskar 2012), to auction theory (Vondr ak 2008). We analyze submodular functions in two settings: Continuous: Continuous submodularity, which has lately received increasing attention (Bach 2015; Bian et al. 2017; Staib and Jegelka 2017) generalizes the notion of a submodular set function to continuous domains. Many wellknown discrete problems (e.g., sensor placement, influence maximization, or facility location) admit natural extensions where resources are divided in a continuous manner. Continuous submodular functions have also been extensively studied in economics as a model of diminishing returns or strategic substitutes (Koc kesen, Ok, and Sethi 2000; Sampson 2016). Our main result is a (1 1 e)-approximation algorithm for maximizing the CVa R of any monotone, continuous submodular function. No algorithm was previously known for this problem. Portfolio of discrete sets: Our results for continuous submodular functions also transfer to set functions. We study a setting where the algorithm can select a distribution over feasible sets, which is of interest when the aim is to select a portfolio of sets to hedge against risk (Ohsaka and Yoshida 2017). Similar settings have also been studied in robust submodular optimization (Krause, Roper, and Golovin 2011; Chen et al. 2017; Wilder 2017). We give a black-box reduction from the discrete portfolio problem to CVa R optimization of continuous submodular functions, allowing us to apply our algorithm for the continuous problem. The state of the art for the discrete portfolio setting is an algorithm by Ohsaka and Yoshida (2017) for CVa R influence maximization. Our results are stronger in two ways: (i) they apply to any submodular function and (ii) give stronger approximation guarantee. Allowing the algorithm to select a convex combination of sets is provably necessary: Maehara (2015) proved that restricted to single sets, it is NP-hard to compute The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18) any multiplicative approximation to the CVa R of a submodular set function. We experimentally evaluate our algorithm for sensor resource allocation, focusing on two domains: detecting contagion or rumors in social networks, and detecting contamination in water networks. In both cases, our algorithm substantially outperforms baselines. Problem description In this section, we formally define continuous submodularity and the conditional value at risk. We first study the continuous setting and then extend our results to discrete portfolio optimization. Continuous submodularity: Let X = n i=1 Xi be a subset of Rn, where each Xi is a compact subset of R. A twice-differentiable function F : X R is diminishing returns submodular (DR-submodular) if for all x X and all i, j = 1...n, 2F (x) xi xj 0 (Bian et al. 2017). Intuitively, the gradient of F only shrinks as x grows, just as the marginal gains of a submodular set function only decrease as items are added. Continuous submodular functions need not be convex or concave (concavity requires that the Hessian is negative semi-definite, not that the individual entries are nonpositive). We consider monotone functions, where F(x) F(y) x y ( denotes element-wise inequality). We assume that F lies in [0, M] for some constant M. Without loss of generality, we assume F(0) = 0 (normalization). In our setting F is a function of both the decision variables x and a random parameter y. Specifically, we consider functions F(x, y) where F( , y) is continuous submodular in x for each fixed y. We allow any DR-submodular F which satisfies some standard smoothness conditions. First, we assume that F is L1-Lipschitz for some constant L1 (for concreteness, with respect to the ℓ2 norm1). Second, we assume that F is twice differentiable with L2-Lipschitz gradient. Third, we assume that F has bounded gradients, || F||2 G. Only the last condition is strictly necessary; our approach can be extended to any F with bounded gradients via known techniques (Duchi, Bartlett, and Wainwright 2012). Conditional value at risk: Intuitively, the CVa R measures performance in the α worst fraction of cases. First, we define the value at risk at level α [0, 1]: Va Rα(x) = inf{τ R : Pry [F(x, y) τ] α}. That is, Va Rα(x) is the α-quantile of the random variable F(x, y). CVa R is the expectation of F(x, y), conditioned on it falling into this set of α-worst cases: CVa Rα(x) = E y [F(x, y)|F(x, y) Va Rα(x)] . 1We use the ℓ2 norm for concreteness. However, our arguments easily generalize to any ℓp norm. CVa R is a more popular risk measure than Va R both because it counts the impact of the entire α-tail of the distribution and because it has better mathematical properties (Rockafellar and Uryasev 2000). Optimization problem: We consider the problem of maximizing CVa Rα(x) over x belonging to some feasible set P. We allow P to be any downward closed polytope. A polytope is downward closed if there is a lower bound ℓ such that x ℓ x P and for any y P, ℓ x y implies that x P. Without loss of generality, we assume that P is entirely nonnegative with ℓ= 0. Otherwise, we can define the translated set P = {x ℓ: x P} and corresponding function F (x, y) = F(x ℓ, y). Let d = maxx,y P||x y||2 be the diameter of P. We want to solve the problem maxx P CVa Rα(x). It is important to note that CVa Rα(x) need not be a smooth DRsubmodular function in x. However, we would like to leverage the nice properties of the underling F. Towards this end, we note that the above problem can be rewritten in a more useful form (Rockafellar and Uryasev 2000). Let [t]+ = max(t, 0). Maximizing CVa Rα(x) is equivalent to solving max x P,τ [0,M] τ 1 α E [τ F(x, y)]+ (1) where τ is an auxiliary parameter. For any fixed x, the optimal value of τ is Va Rα(x) (Rockafellar and Uryasev 2000). It is known that when F( , y) is concave in x, this is a concave optimization problem. However, little is known when F may be nonconcave. Related work CVa R enjoys widespread popularity as a risk measure in many domains, ranging from finance (Mansini, Ogryczak, and Speranza 2007) to electricity production (Yau et al. 2011). More broadly, there is a burgeoning interest in methods which move beyond expected performance (Ermon et al. 2011; Yin et al. 2011; Yu and Nikolova 2013; Hoy and Nikolova 2015). Oftentimes, this concern is motivated by safety-critical domains where an algorithm designer must be able to minimize the risk of disastrous events, not just guarantee good results on average. Here, we survey the closest related work, dealing with CVa R optimization. Rockafellar and Uryasev (2000) introduced CVa R and proposed a linear program for optimizing it. This linear program only applies when utility is linear in the decision variables. Iyengar and Ma (2013) and Hong and Liu (2009) present faster gradient-based algorithms for the linear case. Here, we deal with nonlinear functions. The LP approach can be extended via solving a general concave program when the utilities are concave. Our main contribution is extending the range of optimizable functions to include nonconcave continuous submodular objectives. Another body of work focuses on CVa R in reinforcement learning and MDPs (Prashanth and Ghavamzadeh 2013; Tamar et al. 2015; Chow et al. 2015). Lastly, (Ohsaka and Yoshida 2017) study CVa R for discrete influence maximization; we contrast our results with theirs when we discuss the discrete portfolio setting. Preliminaries We now review techniques for optimizing smooth continuous submodular functions. These do not directly apply to CVa R, but our solution builds on them. An important property is that continuous submodular functions are concave along nonnegative directions. Formally, Definition 1. A function F(x) is up-concave if for any ξ [0, 1] and y P, F(x + ξy) is concave in ξ. All continuous submodular functions are up-concave (Bian et al. 2017). Monotone up-concave algorithms are optimized via a modified Frank-Wolfe algorithm (Bian et al. 2017; Calinescu et al. 2011). Frank-Wolfe is a gradientbased algorithm originally introduced to maximize concave functions. Consider an objective F. Frank-Wolfe algorithms start at an initial point x0 P and then generate a series of feasible solutions x1...x K for some number of iterations K. At each step k, the algorithm calculates the gradient at the current point, F(xk 1). It then takes a step towards the point vk P which lies furthest in the direction of the gradient. That is, vk is the solution to the linear optimization problem arg maxv P v, F(xk 1) . In the standard Frank-Wolfe algorithm for concave functions, the algorithm then updates to a convex combination of xk 1 and vk by setting xk = xk 1 +γk vk xk 1 for some step size γk. Note that some entries of xk may be smaller than the corresponding entries of xk 1. This is necessary for optimality: the algorithm may need to backtrack if it has made some entry too large. This update rule does not work for up-concave functions because the objective is not concave along negative directions. Hence, the update for the modified Frank-Wolfe algorithm is xk = xk 1 + γkvk, which only increases each coordinate. Because the algorithm is unable to backtrack, it achieves a (1 1/e)-approximation instead of the global optimum which is achievable for fully concave functions. The process is analogous to the greedy algorithm for submodular set functions, which successively includes elements based on their current marginal gain. The continuous Frank-Wolfe algorithm instead successively increases entries in the solution vector based on the current gradient. Algorithmic approach We now introduce the RASCAL (Risk Averse Submodular optimization via Conditional v ALue at risk) algorithm for continuous submodular CVa R optimization. RASCAL solves Problem 1, which is a function of both the decision variables x and the auxiliary parameter τ. Roughly, τ should be understood as a threshold maintained by the algorithm for what constitutes a bad scenario: at each iteration, RASCAL tries to increase F(x, y) for those scenarios y such that F(x, y) τ. Before describing the optimization algorithm more formally, we deal with the challenge that the expectation in Problem 1 cannot generally be evaluated in closed form. We Algorithm 1 RASCAL Require: K, u, s, LO 1: Y s samples i.i.d. from D 2: x0 0, τ 0 3: for k = 1...K do 4: SMOOTHGRAD(xk 1, τ, u) 5: v LO( ) 6: xk xk 1 + 1 K v 7: τ SMOOTHTAU(xk 1, u) 8: end for 9: return x K 10: 11: function SMOOTHGRAD(x, τ, u) 12: Iy(τ) max(min( F (x,y) τ u , 1), 0) y Y 13: return y Y Iy(τ) x F(x, y) 14: end function 15: 16: function SMOOTHTAU(x, u) 17: B = {F(x, y)|y Y} {F(x, y) + u|y Y} 18: Sort B in ascending order, obtaining B = {b1...b|B|}. 19: i = min{i = 1...|B|: g(bi) > αs} 20: A {y Y : bi 1 < F(x, y) < bi } 21: C {y Y : F(x, y) bi 1} 22: Return the τ which solves the linear equation u + |C|= αs 23: end function replace the expectation with the average of a set of sampled scenarios. Suppose that we draw a set of samples y1...ys i.i.d from D. Call the set of samples Y. Then we can estimate E [τ F(x, y)]+ 1 s y Y [τ F(x, y)]+. With sufficiently many samples, this approximation will be accurate to any desired level of accuracy: Lemma 1. Take s = O n M 2 ϵ samples and let CVa Rα be the empirical CVa R on the samples. Then, |CVa Rα(x) CVa Rα(x)| ϵ 3 holds for all x P with probability at least 1 δ. The proof is in the supplement. As a minor technicality, we assume that F(x, yi) takes a distinct value for each x and yi Y so that an exact α-quantile exists. This is without loss of generality since we can always add an arbitrarily small tie breaker value ri, using F(x, yi) + ri instead. We can now formally introduce RASCAL (Algorithm 1). RASCAL maximizes the objective H(x, τ) = τ 1 αs y Y [τ F(x, y)]+. Maximizing H is equivalent to maximizing the sampled CVa R. RASCAL is a coordinate ascend style algorithm. Each iteration first makes a Frank Wolfe style update to x (lines 4-6). This step assumes access to a linear optimization oracle LO which maximizes a given linear function over P. RASCAL then sets τ to its optimal value given the current x (line 7). This approach is motivated by the unique properties of H. It can be shown that H is jointly up-concave in the variable (x, τ). However, H is not monotone in τ. Indeed, H is decreasing in τ for τ > Va Rα(x). The Frank-Wolfe algorithm relies crucially on monotonicity; nonmonotonicity is much more difficult to handle. Instead, we exploit a unique form of structure in H. Specifically, H is monotone in x, but only up-concave (not fully concave). Conversely, while H is nonmonotone in τ, we can easily solve the one-dimensional problem maxτ [0,M] H(x, τ) for any fixed x (we explain how later). Our approach makes use of both properties: the Frank-Wolfe update leverages monotone up-concavity in x, while the update to τ leverages easy solvability of the one-dimensional subproblem. In order to make this approach work, two ingredients are necessary. First, we need access to the gradient of H in order to implement the Frank-Wolfe update for x. Unfortunately, H is not even differentiable everywhere. We instead present a smoothed estimator SMOOTHGRAD which restores differentiability at the cost of introducing a controlled amount of bias. Second, we need to solve the one-dimensional problem of finding the optimal value of τ. We in fact introduce a subroutine SMOOTHTAU which solves a smoothed version of the optimal τ problem. Smoothed gradient: We now calculate the gradient of the objective with respect to x, x H(x, τ). Essentially, H counts the value of all scenarios y for which F(x, y) τ. If F(x, y) = τ y Y then x H(x, τ) = 1 y Y:F (x,y) τ x F(x, y). Unfortunately, if there is a y Y such that F(x, y) = τ, then H may not be differentiable at x. To see this, consider the directional derivatives from two different directions. From a nonpositive direction, F(x, y) is always below τ and hence will count towards the gradient. From a positive direction, F(x, y) may lie above τ in which case its contribution will be zero. Frank-Wolfe algorithms require differentiability (in fact, they require a Lipschitz gradient). This is not a minor technical point: if the gradient can radically change over small regions, then gradient-based updates may prove fruitless. Thus, RASCAL uses a smoothed gradient estimate over the region from τ to τ + u for some small u > 0: SMOOTHGRAD(x, τ) = 1 z=0 x H(x, τ + z)dz The intuition is that we average over a small window of τ values so that the contribution of a given scenario to the gradient does not suddenly drop to 0 if x increases slightly. Note that as we have sampled a finite set of s scenarios, the set of points at which H is not differentiable has measure 0. Hence, the integral exists. We now show how to exactly evaluate the integral (see Algorithm 1, lines 11-14 for pseudocode). We have z=0 x H(x, τ + z)dz y Y 1 [F(x, y) τ + z] x F(x, y)dz y Y x F(x, y) u 1 u1 [F(x, y) τ + z] dz where 1[ ] is the indicator function. Now value of the inner integral is just max(min( F (x,y) τ u , 1), 0). Call this value Iy(τ). By the above, SMOOTHGRAD(x, τ) = y Y Iy(τ) x F(x, y). This can be computed in time O(s(T1 + T2)), where T1 is the time to evaluate F and T2 is the time to differentiate it. Finding the optimal τ: The update SMOOTHTAU sets τ to its optimal value over a smoothed window of size u (in order to match SMOOTHGRAD). Specifically, we find the τ minimizing 1 u u z=0 H(x, τ)dz. Recall that for the unsmoothed H, the optimal setting for τ is Va Rα(x), i.e., the value such that F takes value at most τ in an α-fraction of scenarios. An analogous property holds for the smoothed version: Lemma 2. Define g(τ) = y Y Iy(τ). (a) τ maximizes 1 u u z=0 H(x, τ)dz if g(τ) = αs. (b) g is piecewise linear and monotone decreasing. In Lemma 2(a), the condition g(τ) = αs expresses that an α-fraction of the scenarios weighted by Iy(τ) should have F(x, y) τ. The key property for efficiently finding the τ which satisfies this condition is given in Lemma 2(b): g is piecewise linear and monotone decreasing in τ. This follows since it is the sum of functions which share these properties (the Iy). SMOOTHTAU (Algorithm 1, lines 16-23) uses these properties as follows. The breakpoints of g are F(x, y) and F(x, y) + u for each y Y (line 17). Let these breakpoints B = {b1...b2s} be sorted in ascending order. We can find the τ such that g(τ) = αs by first finding the interval such that g(bi) αs g(bi+1) (line 19). Within this interval, g is linear and hence we can solve exactly for the desired point (lines 20-22). This process takes time O(s T1). Theoretical analysis We now prove that by taking appropriate choices for the smoothing parameter u and the number of steps K, RASCAL efficiently obtains a provably approximate solution. Our main theoretical result is as follows: Theorem 1. For any ϵ > 0, by taking u = ϵ 3(1+ 1 CAL outputs a solution x P satisfying CVa Rα(x) (1 1/e)OPT ϵ with probability at least 1 δ. There are K = O L2d2 α2ϵ2 iterations, requiring O (s K) total evaluations of F, O (s K) evaluations of F, and K calls to LO. The rest of this section is devoted to proving Theorem 1. We start out by introducing a surrogate objective that we consider for the sake of analysis. Let H(x, τ) = 1 z=0 H(x, τ + z)dz. This is the smoothed version of the objective, which SMOOTHGRAD computes the gradient for. Let τ(x) = maxτ H(x, τ) be the optimal setting for τ under x. Note that this is with respect to the smoothed objective H, so τ(x) is not necessarily Va Rα(x). We first show that H and H are close: Lemma 3. | H(x, τ) H(x, τ)| u(1+ 1 α) 2 x, τ The main idea is to show that H is Lipschitz with respect to τ, so we do not change the value of the function too much by changing τ slightly. This lemma essentially bounds the bias introduced by SMOOTHGRAD. Now we turn to the main step: showing that the coordinate ascent strategy makes an appropriate amount of progress towards the optimum at each iteration. Note that at the end of each iteration k, RASCAL sets τ k τ(xk). This is because SMOOTHTAU exactly computes the optimal setting for τ with respect to the smoothed objective H. Let x = maxx P H(x, τ(x)) be the point achieving the optimal value of H. Since RASCAL always sets τ to its optimal value in SMOOTHTAU, the gap from optimality at the end of iteration k is exactly Δk := H( x , τ( x )) H(xk, τ(xk)) Our aim is to show that the gap Δk decreases by a factor of (1 γk) at each iteration (up to a small amount of additive loss). We start out by providing an upper bound on Δk in terms of the current gradient. Lemma 4. At each iteration k = 1...K, H( x , τ( x )) H(xk, τ(xk)) max v P x H(xk, τ(xk)), v . The proof uses the underlying up-concavity of F combined with the concavity-preserving properties of CVa R. The intuition is that any concave function is upper bounded by its linearization at a given point (though the bound is weaker than for concave functions (Lacoste-Julien and Jaggi 2015) because F is only up-concave). Lemma 4 gives us a benchmark to track progress: it suffices to show that the improvement in iteration k is at least γk maxv P x H(xk, τ(xk)), vk since this implies that we make up at least a γk fraction of the current gap from optimality. We now express the actual improvement that is made. At iteration k, the Frank-Wolfe update moves from xk 1 to xk 1 + γkvk. Integrating over the transition between these two points gives H(xk, τ(xk)) H(xk 1, τ(xk 1)) = (2) 1 ξ=0 x H(xk 1 + ξγkv, τ(xk 1 + ξγkv)), γkv dξ. What we would like is for the gradient to stay relatively constant as we move from xk 1 to xk 1 +γkvk. This is because we chose vk to lie in the direction of x H at the starting point xk 1. If the gradient changes very sharply along the way, then we may not not actually improve the objective value very much. There are two obstacles to showing that the gradient is smooth enough. The first is that the value of τ in Equation 2 may change with ξ. We can deal with this as follows. Note that that since vk is nonnegative, xk 1 + ξγkvk xk 1 holds for all ξ [0, 1]. It is easy to see that τ(x) is monotone increasing in x. Thus, τ(xk 1+ξγkv) τ(xk). By looking at the expression for x H(x, τ), we can see if that if we increase the value of τ, then the gradient can only increase because more scenarios can contribute. Formally, Lemma 5. If x2 x1, x H(x2, τ(x2)) x H(x2, τ(x1)). Applying Lemma 5 to Equation 2 gives H(xk, τ(xk)) H(xk 1, τ(xk 1)) ξ=0 x H(xk 1 + ξγkv, τ(xk 1)), γkv dξ The second obstacle is that x H might change sharply as we vary x from xk 1 to xk 1 + γkvk. However, this is exactly what SMOOTHGRAD is designed to avoid. Formally, the gradient of H is Lipschitz: Lemma 6. If y Y, F( , y) is L1-Lipschitz and x F( , y) is L2 Lipschitz with || x F||2 G, then x H is 1 u Lipschitz. This gives us the tools to finish the proof. Let C = 1 α L2 + L1G u be the Lipschitz constant of x H. The Cauchy-Shwartz inequality and Lemma 6 yield x H(xk 1 + ξγkv, τ(xk 1)), v x H(xk 1, τ(xk 1)), v ξγk C||v||2 2 and hence H(xk, τ(xk)) H(xk 1, τ(xk 1)) 0 x H(xk 1, τ(xk 1)), v ξγk C||v||2 2dξ = γk x H(xk 1, τ(xk 1)), v γ2 k C||v||2 2 2 γkΔk 1 γ2 k Cd2 2 and by rearranging we obtain Δk (1 γk)Δk 1 γ2 k Cd2 This is exactly what we wanted to show: the gap shrinks by a factor (1 γk) each iteration, up to a small amount of additive loss. From here, the proof proceeds by fairly standard arguments which may be found in the supplement. Discrete portfolio optimization We may also want to optimize the CVa R of a submodular set function, as opposed to the continuous functions that we have dealt with so far. We study the portfolio optimization problem (Ohsaka and Yoshida 2017) where the decision maker may select any distribution over feasible sets. Equivalently, they select a decision which is a convex combination of feasible decisions but which is not guaranteed to lie in the original feasible set itself (Chen et al. 2017). This is a natural setting for CVa R optimization because the decision maker essentially hedges their bets between multiple options. Formally, we are given a collection of submodular set functions f( , y) on a ground set X, where y is a random variable. There is a collection of feasible sets I. For instance, I could be all size-k subsets. In general, our algorithm works when I is any matroid. The algorithm selects a distribution q over the sets in I. The objective is to maximize CVa Rα S I q Sf(S, y) . We provide a black-box reduction from this problem to the continuous submodular CVa R optimization problem considered earlier. Since RASCAL solves the continuous problem, we immediately obtain efficient algorithms for a range of portfolio problems. Formally, Theorem 2. Given access to an α-approximation algorithm for the continuous CVa R problem, there is an algorithm which obtains value at least αOPT ϵ for the discrete portfolio CVa R problem. A proof is deferred to the supplement. The main idea is to translate from the discrete to continuous settings via the multilinear extension (Calinescu et al. 2011). The multilinear extension F of a submodular set function f is a continuous function defined on the hypercube [0, 1]|X| which agrees with f at the vertices. We apply the promised continuous CVa R algorithm to the multilinear extensions F( , y) and then use known rounding techniques (Chekuri, Vondrak, and Zenklusen 2010) to convert the fractional solution to a distribution over integral points which preserves the fractional solution s CVa R value. However, some additional technical steps are needed to make this strategy work (e.g., we need to maintain multiple copies of the decision variables to get the optimal approximation ratio). We note that this result strengthens that of Ohsaka and Yoshida (2017) in two respects. First, their result applies only to influence maximization, while ours applies to any submodular function. Second, they obtain the additive approximation OPT 1 e when the objective values are rescaled by n (the total number of nodes in the graph for influence maximization) to the interval [0, 1]. Hence, their bound does not apply when OPT 1 en, which is very possible since CVa R counts worst-case outcomes. We have only an arbitrarily small ϵ of additive loss, which allows for stronger guarantees when OPT is small. Experiments We show experimental results for the sensor resource allocation problem, where the goal is to use a limited sensing budget to quickly detect a contagion spreading through a network (Leskovec et al. 2007; Soma and Yoshida 2015; Bian et al. 2017). We are given a graph G = (V, E) with |V |= n. A contagion starts at a random node y and spreads over time according to a given stochastic process. Let tv be the time at which the contagion reaches each node v. tv is a random variable which depends on both the source node y and the stochastic contagion process. The vector t collects tv for all v V . We assume that tv < v V (every node is eventually reached). If this does not hold, we can cut the process off after some large time horizon. Let t be the maximum possible value of tv. The decision maker has a budget B (e.g., energy) to spend on sensing resources. xv represents the amount of energy allocated to the sensor at node v. When contagion reaches v at time tv, the sensor detects with probability 1 (1 p)xv and otherwise fails. Essentially, investing an extra unit of energy in sensor v buys an extra chance to detect the contagion with probability p. Fix a vector of times t, and order the nodes v1...vn so that tv1 tv2 ... tvn. The objective F for source y is expected amount of detection time that is saved by the sensor placements: F(x, t) = t i=1 tvi (1 (1 p)xi) j