# risksensitive_submodular_optimization__69604663.pdf

Risk-Sensitive Submodular Optimization

Bryan Wilder Department of Computer Science and Center for Artiﬁcial Intelligence in Society University of Southern California bwilder@usc.edu

The conditional value at risk (CVa R) is a popular risk measure which enables risk-averse decision making under uncertainty. We consider maximizing the CVa R of a continuous submodular function, an extension of submodular set functions to a continuous domain. One example application is allocating a continuous amount of energy to each sensor in a network, with the goal of detecting intrusion or contamination. Previous work allows maximization of the CVa R of a linear or concave function. Continuous submodularity represents a natural set of nonconcave functions with diminishing returns, to which existing techniques do not apply. We give a (1 1/e)-approximation algorithm for maximizing the CVa R of a monotone continuous submodular function. This also yields an algorithm for submodular set functions which produces a distribution over feasible sets with guaranteed CVa R. Experimental results in two sensor placement domains conﬁrm that our algorithm substantially outperforms competitive baselines.

Introduction Decision-making under uncertainty is an ubiquitous problem. Suppose we want to maximize a function F(x, y), where x is a vector of decision variables and y a random variable drawn from a distribution D. A natural approach is to maximize Ey [F(x, y)], i.e., to maximize the expected value of the chosen decision. However, decision makers are often risk-averse: they would rather minimize the chance of having a very low reward than focus purely on the average. This is a rational behavior when failure can have large consequences. For instance, if a corporation suffers a disastrous loss, they may simply go out of business. Or in many cases, low performance entails safety issues. For instance, if a sensor network for water contamination detects problems instantly in 80% cases, but fails entirely in 20%, the population will inevitably be exposed to an unacceptable health risk. It is much better to have a sensor network which always detects contaminants, even if it requires somewhat more time on average. Hence, it is natural to move beyond average-case analysis and optimize a risk-aware objective function. One widespread choice is the conditional value at risk (CVa R).

Copyright c 2018, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

CVa R takes a tunable parameter α. Roughly, it measures the performance of a decision in the worst α fraction of scenarios. It is known that when the objective F is a concave function, then CVa R can be optimized via a concave program as well. However, many natural objective functions are not concave, and no general algorithms are known for nonconcave functions. We focus on submodular functions. Submodularity captures diminishing returns and appears in application domains ranging from viral marketing (Kempe, Kleinberg, and Tardos 2003), to machine learning (Kulesza and Taskar 2012), to auction theory (Vondr ak 2008). We analyze submodular functions in two settings: Continuous: Continuous submodularity, which has lately received increasing attention (Bach 2015; Bian et al. 2017; Staib and Jegelka 2017) generalizes the notion of a submodular set function to continuous domains. Many wellknown discrete problems (e.g., sensor placement, inﬂuence maximization, or facility location) admit natural extensions where resources are divided in a continuous manner. Continuous submodular functions have also been extensively studied in economics as a model of diminishing returns or strategic substitutes (Koc kesen, Ok, and Sethi 2000; Sampson 2016). Our main result is a (1 1

e)-approximation algorithm for maximizing the CVa R of any monotone, continuous submodular function. No algorithm was previously known for this problem. Portfolio of discrete sets: Our results for continuous submodular functions also transfer to set functions. We study a setting where the algorithm can select a distribution over feasible sets, which is of interest when the aim is to select a portfolio of sets to hedge against risk (Ohsaka and Yoshida 2017). Similar settings have also been studied in robust submodular optimization (Krause, Roper, and Golovin 2011; Chen et al. 2017; Wilder 2017). We give a black-box reduction from the discrete portfolio problem to CVa R optimization of continuous submodular functions, allowing us to apply our algorithm for the continuous problem. The state of the art for the discrete portfolio setting is an algorithm by Ohsaka and Yoshida (2017) for CVa R inﬂuence maximization. Our results are stronger in two ways: (i) they apply to any submodular function and (ii) give stronger approximation guarantee. Allowing the algorithm to select a convex combination of sets is provably necessary: Maehara (2015) proved that restricted to single sets, it is NP-hard to compute

The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18)

any multiplicative approximation to the CVa R of a submodular set function. We experimentally evaluate our algorithm for sensor resource allocation, focusing on two domains: detecting contagion or rumors in social networks, and detecting contamination in water networks. In both cases, our algorithm substantially outperforms baselines.

Problem description

In this section, we formally deﬁne continuous submodularity and the conditional value at risk. We ﬁrst study the continuous setting and then extend our results to discrete portfolio optimization. Continuous submodularity: Let X = n i=1 Xi be a subset of Rn, where each Xi is a compact subset of R. A twice-differentiable function F : X R is diminishing returns submodular (DR-submodular) if for all x X and all i, j = 1...n, 2F (x)

xi xj 0 (Bian et al. 2017). Intuitively, the gradient of F only shrinks as x grows, just as the marginal gains of a submodular set function only decrease as items are added. Continuous submodular functions need not be convex or concave (concavity requires that the Hessian is negative semi-deﬁnite, not that the individual entries are nonpositive). We consider monotone functions, where F(x) F(y) x y ( denotes element-wise inequality). We assume that F lies in [0, M] for some constant M. Without loss of generality, we assume F(0) = 0 (normalization). In our setting F is a function of both the decision variables x and a random parameter y. Speciﬁcally, we consider functions F(x, y) where F( , y) is continuous submodular in x for each ﬁxed y. We allow any DR-submodular F which satisﬁes some standard smoothness conditions. First, we assume that F is L1-Lipschitz for some constant L1 (for concreteness, with respect to the ℓ2 norm1). Second, we assume that F is twice differentiable with L2-Lipschitz gradient. Third, we assume that F has bounded gradients, || F||2 G. Only the last condition is strictly necessary; our approach can be extended to any F with bounded gradients via known techniques (Duchi, Bartlett, and Wainwright 2012). Conditional value at risk: Intuitively, the CVa R measures performance in the α worst fraction of cases. First, we deﬁne the value at risk at level α [0, 1]:

Va Rα(x) = inf{τ R : Pry [F(x, y) τ] α}.

That is, Va Rα(x) is the α-quantile of the random variable F(x, y). CVa R is the expectation of F(x, y), conditioned on it falling into this set of α-worst cases:

CVa Rα(x) = E y [F(x, y)|F(x, y) Va Rα(x)] .

1We use the ℓ2 norm for concreteness. However, our arguments easily generalize to any ℓp norm.

CVa R is a more popular risk measure than Va R both because it counts the impact of the entire α-tail of the distribution and because it has better mathematical properties (Rockafellar and Uryasev 2000). Optimization problem: We consider the problem of maximizing CVa Rα(x) over x belonging to some feasible set P. We allow P to be any downward closed polytope. A polytope is downward closed if there is a lower bound ℓ such that x ℓ x P and for any y P, ℓ x y implies that x P. Without loss of generality, we assume that P is entirely nonnegative with ℓ= 0. Otherwise, we can deﬁne the translated set P = {x ℓ: x P} and corresponding function F (x, y) = F(x ℓ, y). Let d = maxx,y P||x y||2 be the diameter of P. We want to solve the problem maxx P CVa Rα(x). It is important to note that CVa Rα(x) need not be a smooth DRsubmodular function in x. However, we would like to leverage the nice properties of the underling F. Towards this end, we note that the above problem can be rewritten in a more useful form (Rockafellar and Uryasev 2000). Let [t]+ = max(t, 0). Maximizing CVa Rα(x) is equivalent to solving

max x P,τ [0,M] τ 1

α E [τ F(x, y)]+ (1)

where τ is an auxiliary parameter. For any ﬁxed x, the optimal value of τ is Va Rα(x) (Rockafellar and Uryasev 2000). It is known that when F( , y) is concave in x, this is a concave optimization problem. However, little is known when F may be nonconcave.

Related work

CVa R enjoys widespread popularity as a risk measure in many domains, ranging from ﬁnance (Mansini, Ogryczak, and Speranza 2007) to electricity production (Yau et al. 2011). More broadly, there is a burgeoning interest in methods which move beyond expected performance (Ermon et al. 2011; Yin et al. 2011; Yu and Nikolova 2013; Hoy and Nikolova 2015). Oftentimes, this concern is motivated by safety-critical domains where an algorithm designer must be able to minimize the risk of disastrous events, not just guarantee good results on average. Here, we survey the closest related work, dealing with CVa R optimization. Rockafellar and Uryasev (2000) introduced CVa R and proposed a linear program for optimizing it. This linear program only applies when utility is linear in the decision variables. Iyengar and Ma (2013) and Hong and Liu (2009) present faster gradient-based algorithms for the linear case. Here, we deal with nonlinear functions. The LP approach can be extended via solving a general concave program when the utilities are concave. Our main contribution is extending the range of optimizable functions to include nonconcave continuous submodular objectives. Another body of work focuses on CVa R in reinforcement learning and MDPs (Prashanth and Ghavamzadeh 2013; Tamar et al. 2015; Chow et al. 2015). Lastly, (Ohsaka and Yoshida 2017) study CVa R for discrete inﬂuence maximization; we contrast our

results with theirs when we discuss the discrete portfolio setting.

Preliminaries

We now review techniques for optimizing smooth continuous submodular functions. These do not directly apply to CVa R, but our solution builds on them. An important property is that continuous submodular functions are concave along nonnegative directions. Formally,

Deﬁnition 1. A function F(x) is up-concave if for any ξ [0, 1] and y P, F(x + ξy) is concave in ξ.

All continuous submodular functions are up-concave (Bian et al. 2017). Monotone up-concave algorithms are optimized via a modiﬁed Frank-Wolfe algorithm (Bian et al. 2017; Calinescu et al. 2011). Frank-Wolfe is a gradientbased algorithm originally introduced to maximize concave functions. Consider an objective F. Frank-Wolfe algorithms start at an initial point x0 P and then generate a series of feasible solutions x1...x K for some number of iterations K. At each step k, the algorithm calculates the gradient at the current point, F(xk 1). It then takes a step towards the point vk P which lies furthest in the direction of the gradient. That is, vk is the solution to the linear optimization problem arg maxv P v, F(xk 1) . In the standard Frank-Wolfe algorithm for concave functions, the algorithm then updates to a convex combination of xk 1 and vk by setting xk = xk 1 +γk vk xk 1 for some step size γk. Note that some entries of xk may be smaller than the corresponding entries of xk 1. This is necessary for optimality: the algorithm may need to backtrack if it has made some entry too large. This update rule does not work for up-concave functions because the objective is not concave along negative directions. Hence, the update for the modiﬁed Frank-Wolfe algorithm is xk = xk 1 + γkvk, which only increases each coordinate. Because the algorithm is unable to backtrack, it achieves a (1 1/e)-approximation instead of the global optimum which is achievable for fully concave functions. The process is analogous to the greedy algorithm for submodular set functions, which successively includes elements based on their current marginal gain. The continuous Frank-Wolfe algorithm instead successively increases entries in the solution vector based on the current gradient.

Algorithmic approach

We now introduce the RASCAL (Risk Averse Submodular optimization via Conditional v ALue at risk) algorithm for continuous submodular CVa R optimization. RASCAL solves Problem 1, which is a function of both the decision variables x and the auxiliary parameter τ. Roughly, τ should be understood as a threshold maintained by the algorithm for what constitutes a bad scenario: at each iteration, RASCAL tries to increase F(x, y) for those scenarios y such that F(x, y) τ. Before describing the optimization algorithm more formally, we deal with the challenge that the expectation in Problem 1 cannot generally be evaluated in closed form. We

Algorithm 1 RASCAL

Require: K, u, s, LO 1: Y s samples i.i.d. from D 2: x0 0, τ 0 3: for k = 1...K do 4: SMOOTHGRAD(xk 1, τ, u) 5: v LO( ) 6: xk xk 1 + 1

K v 7: τ SMOOTHTAU(xk 1, u) 8: end for 9: return x K

10: 11: function SMOOTHGRAD(x, τ, u) 12: Iy(τ) max(min( F (x,y) τ

u , 1), 0) y Y 13: return

y Y Iy(τ) x F(x, y) 14: end function 15: 16: function SMOOTHTAU(x, u) 17: B = {F(x, y)|y Y} {F(x, y) + u|y Y} 18: Sort B in ascending order, obtaining B = {b1...b|B|}. 19: i = min{i = 1...|B|: g(bi) > αs} 20: A {y Y : bi 1 < F(x, y) < bi } 21: C {y Y : F(x, y) bi 1} 22: Return the τ which solves the linear equation

u + |C|= αs

23: end function

replace the expectation with the average of a set of sampled scenarios. Suppose that we draw a set of samples y1...ys i.i.d from D. Call the set of samples Y. Then we can estimate E [τ F(x, y)]+ 1

s y Y [τ F(x, y)]+. With sufﬁciently many samples, this approximation will be accurate to any desired level of accuracy:

Lemma 1. Take s = O n M 2

ϵ samples and

let CVa Rα be the empirical CVa R on the samples. Then, |CVa Rα(x) CVa Rα(x)| ϵ

3 holds for all x P with probability at least 1 δ.

The proof is in the supplement. As a minor technicality, we assume that F(x, yi) takes a distinct value for each x and yi Y so that an exact α-quantile exists. This is without loss of generality since we can always add an arbitrarily small tie breaker value ri, using F(x, yi) + ri instead. We can now formally introduce RASCAL (Algorithm 1). RASCAL maximizes the objective H(x, τ) = τ 1 αs

y Y [τ F(x, y)]+. Maximizing H is equivalent to maximizing the sampled CVa R. RASCAL is a coordinate ascend style algorithm. Each iteration ﬁrst makes a Frank Wolfe style update to x (lines 4-6). This step assumes access to a linear optimization oracle LO which maximizes a given linear function over P. RASCAL then sets τ to its

optimal value given the current x (line 7). This approach is motivated by the unique properties of H. It can be shown that H is jointly up-concave in the variable (x, τ). However, H is not monotone in τ. Indeed, H is decreasing in τ for τ > Va Rα(x). The Frank-Wolfe algorithm relies crucially on monotonicity; nonmonotonicity is much more difﬁcult to handle. Instead, we exploit a unique form of structure in H. Speciﬁcally, H is monotone in x, but only up-concave (not fully concave). Conversely, while H is nonmonotone in τ, we can easily solve the one-dimensional problem maxτ [0,M] H(x, τ) for any ﬁxed x (we explain how later). Our approach makes use of both properties: the Frank-Wolfe update leverages monotone up-concavity in x, while the update to τ leverages easy solvability of the one-dimensional subproblem. In order to make this approach work, two ingredients are necessary. First, we need access to the gradient of H in order to implement the Frank-Wolfe update for x. Unfortunately, H is not even differentiable everywhere. We instead present a smoothed estimator SMOOTHGRAD which restores differentiability at the cost of introducing a controlled amount of bias. Second, we need to solve the one-dimensional problem of ﬁnding the optimal value of τ. We in fact introduce a subroutine SMOOTHTAU which solves a smoothed version of the optimal τ problem. Smoothed gradient: We now calculate the gradient of the objective with respect to x, x H(x, τ). Essentially, H counts the value of all scenarios y for which F(x, y) τ. If F(x, y) = τ y Y then

x H(x, τ) = 1

y Y:F (x,y) τ x F(x, y).

Unfortunately, if there is a y Y such that F(x, y) = τ, then H may not be differentiable at x. To see this, consider the directional derivatives from two different directions. From a nonpositive direction, F(x, y) is always below τ and hence will count towards the gradient. From a positive direction, F(x, y) may lie above τ in which case its contribution will be zero. Frank-Wolfe algorithms require differentiability (in fact, they require a Lipschitz gradient). This is not a minor technical point: if the gradient can radically change over small regions, then gradient-based updates may prove fruitless. Thus, RASCAL uses a smoothed gradient estimate over the region from τ to τ + u for some small u > 0:

SMOOTHGRAD(x, τ) = 1

z=0 x H(x, τ + z)dz

The intuition is that we average over a small window of τ values so that the contribution of a given scenario to the gradient does not suddenly drop to 0 if x increases slightly. Note that as we have sampled a ﬁnite set of s scenarios, the set of points at which H is not differentiable has measure 0. Hence, the integral exists. We now show how to exactly evaluate the integral (see Algorithm 1, lines 11-14 for pseudocode). We have

z=0 x H(x, τ + z)dz

y Y 1 [F(x, y) τ + z] x F(x, y)dz

y Y x F(x, y) u

1 u1 [F(x, y) τ + z] dz

where 1[ ] is the indicator function. Now value of the inner integral is just max(min( F (x,y) τ

u , 1), 0). Call this value Iy(τ). By the above, SMOOTHGRAD(x, τ) = y Y Iy(τ) x F(x, y). This can be computed in time O(s(T1 + T2)), where T1 is the time to evaluate F and T2 is the time to differentiate it. Finding the optimal τ: The update SMOOTHTAU sets τ to its optimal value over a smoothed window of size u (in order to match SMOOTHGRAD). Speciﬁcally, we ﬁnd the τ minimizing 1

u u z=0 H(x, τ)dz. Recall that for the unsmoothed H, the optimal setting for τ is Va Rα(x), i.e., the value such that F takes value at most τ in an α-fraction of scenarios. An analogous property holds for the smoothed version:

Lemma 2. Deﬁne g(τ) =

y Y Iy(τ). (a) τ maximizes 1 u u z=0 H(x, τ)dz if g(τ) = αs. (b) g is piecewise linear and monotone decreasing.

In Lemma 2(a), the condition g(τ) = αs expresses that an α-fraction of the scenarios weighted by Iy(τ) should have F(x, y) τ. The key property for efﬁciently ﬁnding the τ which satisﬁes this condition is given in Lemma 2(b): g is piecewise linear and monotone decreasing in τ. This follows since it is the sum of functions which share these properties (the Iy). SMOOTHTAU (Algorithm 1, lines 16-23) uses these properties as follows. The breakpoints of g are F(x, y) and F(x, y) + u for each y Y (line 17). Let these breakpoints B = {b1...b2s} be sorted in ascending order. We can ﬁnd the τ such that g(τ) = αs by ﬁrst ﬁnding the interval such that g(bi) αs g(bi+1) (line 19). Within this interval, g is linear and hence we can solve exactly for the desired point (lines 20-22). This process takes time O(s T1).

Theoretical analysis We now prove that by taking appropriate choices for the smoothing parameter u and the number of steps K, RASCAL efﬁciently obtains a provably approximate solution. Our main theoretical result is as follows:

Theorem 1. For any ϵ > 0, by taking u = ϵ 3(1+ 1

CAL outputs a solution x P satisfying CVa Rα(x) (1 1/e)OPT ϵ with probability at least 1 δ. There

are K = O L2d2

α2ϵ2 iterations, requiring O (s K) total evaluations of F, O (s K) evaluations of F, and K calls to LO.

The rest of this section is devoted to proving Theorem 1. We start out by introducing a surrogate objective that we consider for the sake of analysis. Let

H(x, τ) = 1

z=0 H(x, τ + z)dz.

This is the smoothed version of the objective, which SMOOTHGRAD computes the gradient for. Let τ(x) = maxτ H(x, τ) be the optimal setting for τ under x. Note that this is with respect to the smoothed objective H, so τ(x) is not necessarily Va Rα(x). We ﬁrst show that H and H are close:

Lemma 3. | H(x, τ) H(x, τ)| u(1+ 1

α) 2 x, τ The main idea is to show that H is Lipschitz with respect to τ, so we do not change the value of the function too much by changing τ slightly. This lemma essentially bounds the bias introduced by SMOOTHGRAD. Now we turn to the main step: showing that the coordinate ascent strategy makes an appropriate amount of progress towards the optimum at each iteration. Note that at the end of each iteration k, RASCAL sets τ k τ(xk). This is because SMOOTHTAU exactly computes the optimal setting for τ with respect to the smoothed objective H. Let x = maxx P H(x, τ(x)) be the point achieving the optimal value of H. Since RASCAL always sets τ to its optimal value in SMOOTHTAU, the gap from optimality at the end of iteration k is exactly

Δk := H( x , τ( x )) H(xk, τ(xk))

Our aim is to show that the gap Δk decreases by a factor of (1 γk) at each iteration (up to a small amount of additive loss). We start out by providing an upper bound on Δk in terms of the current gradient. Lemma 4. At each iteration k = 1...K, H( x , τ( x )) H(xk, τ(xk))

max v P x H(xk, τ(xk)), v .

The proof uses the underlying up-concavity of F combined with the concavity-preserving properties of CVa R. The intuition is that any concave function is upper bounded by its linearization at a given point (though the bound is weaker than for concave functions (Lacoste-Julien and Jaggi 2015) because F is only up-concave). Lemma 4 gives us a benchmark to track progress: it sufﬁces to show that the improvement in iteration k is at least γk maxv P x H(xk, τ(xk)), vk since this implies that we make up at least a γk fraction of the current gap from optimality. We now express the actual improvement that is made. At iteration k, the Frank-Wolfe update moves from xk 1 to xk 1 + γkvk. Integrating over the transition between these two points gives

H(xk, τ(xk)) H(xk 1, τ(xk 1)) = (2) 1

ξ=0 x H(xk 1 + ξγkv, τ(xk 1 + ξγkv)), γkv dξ.

What we would like is for the gradient to stay relatively constant as we move from xk 1 to xk 1 +γkvk. This is because we chose vk to lie in the direction of x H at the starting point xk 1. If the gradient changes very sharply along the way, then we may not not actually improve the objective value very much. There are two obstacles to showing that the gradient is smooth enough. The ﬁrst is that the value of τ in Equation 2 may change with ξ. We can deal with this as follows. Note that that since vk is nonnegative, xk 1 + ξγkvk xk 1 holds for all ξ [0, 1]. It is easy to see that τ(x) is monotone increasing in x. Thus, τ(xk 1+ξγkv) τ(xk). By looking at the expression for x H(x, τ), we can see if that if we increase the value of τ, then the gradient can only increase because more scenarios can contribute. Formally,

Lemma 5. If x2 x1, x H(x2, τ(x2)) x H(x2, τ(x1)). Applying Lemma 5 to Equation 2 gives

H(xk, τ(xk)) H(xk 1, τ(xk 1))

ξ=0 x H(xk 1 + ξγkv, τ(xk 1)), γkv dξ

The second obstacle is that x H might change sharply as we vary x from xk 1 to xk 1 + γkvk. However, this is exactly what SMOOTHGRAD is designed to avoid. Formally, the gradient of H is Lipschitz: Lemma 6. If y Y, F( , y) is L1-Lipschitz and x F( , y) is L2 Lipschitz with || x F||2 G, then x H is 1

u Lipschitz. This gives us the tools to ﬁnish the proof. Let C = 1 α L2 + L1G

u be the Lipschitz constant of x H. The Cauchy-Shwartz inequality and Lemma 6 yield

x H(xk 1 + ξγkv, τ(xk 1)), v

x H(xk 1, τ(xk 1)), v ξγk C||v||2 2 and hence H(xk, τ(xk)) H(xk 1, τ(xk 1))

0 x H(xk 1, τ(xk 1)), v ξγk C||v||2 2dξ

= γk x H(xk 1, τ(xk 1)), v γ2 k C||v||2 2 2

γkΔk 1 γ2 k Cd2

2 and by rearranging we obtain

Δk (1 γk)Δk 1 γ2 k Cd2

This is exactly what we wanted to show: the gap shrinks by a factor (1 γk) each iteration, up to a small amount of additive loss. From here, the proof proceeds by fairly standard arguments which may be found in the supplement.

Discrete portfolio optimization We may also want to optimize the CVa R of a submodular set function, as opposed to the continuous functions that we have dealt with so far. We study the portfolio optimization problem (Ohsaka and Yoshida 2017) where the decision maker may select any distribution over feasible sets. Equivalently, they select a decision which is a convex combination of feasible decisions but which is not guaranteed to lie in the original feasible set itself (Chen et al. 2017). This is a natural setting for CVa R optimization because the decision maker essentially hedges their bets between multiple options. Formally, we are given a collection of submodular set functions f( , y) on a ground set X, where y is a random variable. There is a collection of feasible sets I. For instance, I could be all size-k subsets. In general, our algorithm works when I is any matroid. The algorithm selects a distribution q over the sets in I. The objective is to maximize CVa Rα

S I q Sf(S, y) . We provide a black-box reduction from this problem to the continuous submodular CVa R optimization problem considered earlier. Since RASCAL solves the continuous problem, we immediately obtain efﬁcient algorithms for a range of portfolio problems. Formally, Theorem 2. Given access to an α-approximation algorithm for the continuous CVa R problem, there is an algorithm which obtains value at least αOPT ϵ for the discrete portfolio CVa R problem. A proof is deferred to the supplement. The main idea is to translate from the discrete to continuous settings via the multilinear extension (Calinescu et al. 2011). The multilinear extension F of a submodular set function f is a continuous function deﬁned on the hypercube [0, 1]|X| which agrees with f at the vertices. We apply the promised continuous CVa R algorithm to the multilinear extensions F( , y) and then use known rounding techniques (Chekuri, Vondrak, and Zenklusen 2010) to convert the fractional solution to a distribution over integral points which preserves the fractional solution s CVa R value. However, some additional technical steps are needed to make this strategy work (e.g., we need to maintain multiple copies of the decision variables to get the optimal approximation ratio). We note that this result strengthens that of Ohsaka and Yoshida (2017) in two respects. First, their result applies only to inﬂuence maximization, while ours applies to any submodular function. Second, they obtain the additive approximation OPT 1 e when the objective values are rescaled by n (the total number of nodes in the graph for inﬂuence maximization) to the interval [0, 1]. Hence, their bound does not apply when OPT 1 en, which is very possible since CVa R counts worst-case outcomes. We have only an arbitrarily small ϵ of additive loss, which allows for stronger guarantees when OPT is small.

Experiments We show experimental results for the sensor resource allocation problem, where the goal is to use a limited sensing budget to quickly detect a contagion spreading through a network (Leskovec et al. 2007; Soma and Yoshida 2015;

Bian et al. 2017). We are given a graph G = (V, E) with |V |= n. A contagion starts at a random node y and spreads over time according to a given stochastic process. Let tv be the time at which the contagion reaches each node v. tv is a random variable which depends on both the source node y and the stochastic contagion process. The vector t collects tv for all v V . We assume that tv < v V (every node is eventually reached). If this does not hold, we can cut the process off after some large time horizon. Let t be the maximum possible value of tv. The decision maker has a budget B (e.g., energy) to spend on sensing resources. xv represents the amount of energy allocated to the sensor at node v. When contagion reaches v at time tv, the sensor detects with probability 1 (1 p)xv and otherwise fails. Essentially, investing an extra unit of energy in sensor v buys an extra chance to detect the contagion with probability p. Fix a vector of times t, and order the nodes v1...vn so that tv1 tv2 ... tvn. The objective F for source y is expected amount of detection time that is saved by the sensor placements:

F(x, t) = t

i=1 tvi (1 (1 p)xi)

j<i (1 p)xj

where the summation counts the probability that sensor i succeeds but all j < i fail. Is is known that F is DR-submodular (Bian et al. 2017). Previous work maximizes Et [F(x, t)], the expected utility over the random source node and diffusion process. Here, we consider instead CVa Rα(x), where the scenarios are all possible time vectors t. Essentially, we want to perform well when the contagion starts in hard to detect portions of the network or spreads in an unlikely way. We take the CVa R with respect to t but not the success or failure of the sensors because no algorithm can successfully detect contagion when almost all sensors fail and the source and diffusion pattern are worst-case. Domains: We consider two sensing domains. In both, the source node is uniformly random. First, contagion spreading according to the continuous time independent cascade model (CTIC). This models applications like detecting news or a disease in a social network. The CTIC is variant of the independent cascade model proposed by Gomez-Rodriguez et al. (2012) which better reﬂects the temporal dynamics of real-world social processes. Each edge (u, v) has propagation time ρu,v drawn from an exponential distribution with mean λ. The contagion starts at y (ty = 0). Letting δ(v) be v s neighbors, tv = minu δ(v) tu + ρu,v. That is, tv is the ﬁrst time contagion spreads from a neighbor to v. We show experiments on several networks. First, netscience2: a collaboration network of network science researchers with 1461 nodes. Second, euroroad: a network of European cities and roads between them, with 1,174 nodes. Third, synthetic Watts-Strogatz networks (parameters k = 2, p = 0.1). These allow us to test our algorithm on a similar graphs as n grows. For all networks, λ = 5, p = 0.01, and we simulate 1000 scenarios (random source nodes and propagation times).

2http://www-personal.umich.edu/ mejn/netdata/

0 50 100 150 200 250 B

a RASCAL FW Degree

0 100 200 300 400 500 600 B

b RASCAL FW Degree

0 5 10 15 CVa R

c RASCAL FW Degree

0 5000 10000 n

d RASCAL FW Degree

Figure 1: Results for the continuous time independent cascade model. (a) netscience as B varies (b) euroroad as B varies (d) histogram of values for netscience with B = 0.1n. (d) Watts-Strogatz networks as n varies

0.0 0.2 0.4 0.6 0.8 1.0 α

a RASCAL FW Degree

5 10 20 30 B

b RASCAL FW Degree

Figure 2: Results for BWSN

Figure 3: Example allocations for BWSN

Second, we consider detecting contamination in a water network via the Battle of Water Sensor Networks (BWSN). BWSM (Ostfeld et al. 2008) simulates the spread of contamination through a 126-node water network consisting of junctions, tanks, pumps, and the links between them. The network is a real water distribution network from an anonymous location, and the t values are provided by EPANET, a highly realistic water distribution simulator designed by the U.S. Environmental Protection Agency. We use p = 0.001 and simulate 1000 random scenarios (source node and t values). Baselines: No previous work directly addresses our setting. We consider two competitive baselines. First, FW, which uses the Frank-Wolfe algorithm of Bian et al. (2017) to maximize the expected reward. Maximizing expected value is default approach to decision making under uncertainty. Second, degree, a heuristic for producing risk-averse solutions. Speciﬁcally, degree allocates one unit of budget to each of the B nodes with highest degree. This disperses the budget throughout the network, hedging against unlikely outcomes. Results for CTIC: Figure 1 shows results under the CTIC. Figures 1(a) and 1(b) show the CVa R of each algorithm on the netscience and euroroad networks as the budget B varies on the x axis. RASCAL substantially outperforms both FW and Degree. This indicates that maximizing expected value is not a sufﬁcient proxy for risk-aversion under uncertainty. In fact, FW obtains zero value for many values of B, indicating that its sensor selection is useless in the 10% worst cases. Degree often performs better than FW, indicating some beneﬁt to heuristically hedging against possible contagion sources. However, RASCAL s principled optimization still results in much higher performance. Figure 1(c) shows a histogram of each algorithm s value across the different scenarios on netscience. RASCAL s reward distribution is tightly concentrated, which is desirable from the perspective of risk aversion. By contrast, FW and degree have more bimodal distributions, with the potential for both very low and high reward. Lastly, Figure 1(d) shows the CVa R obtained

by each algorithm for Watts-Strogatz networks as the network size n grows on the x axis. RASCAL again obtains much higher value across the board. RASCAL scales easily to 10,000 nodes, running in under 1 minute. Results for BWSN: We now examine our second domain, water network sensor management. Figure 2 shows the CVa R obtained by each algorithm. Figure 2(a) shows α on the x axis, varying the decision maker s degree of risk aversion. Throughout, B = 10. RASCAL substantially outperforms FW and degree until α = 0.6, at which point FW becomes competitive. However, for α 0.4, both FW and degree obtain zero value. This indicates that even when the decision maker is not severely risk averse (e.g., preferring to focus on the worst 50% of scenarios), they can substantially beneﬁt from using our principled approach to optimizing CVa R. It is natural to ask whether the baselines are competitive when there are more resources available, allowing them to cover a larger portion of the network. Figure 2(b) shows the results as the budget B is varied on the x axis with α = 0.1. FW and degree still obtain a CVa R of zero even when the budget is tripled to B = 30. By contrast, RASCAL s value steadily grows as it makes productive use of the additional resources. Lastly, Figure 3 shows an example of the allocation produced by each algorithm for B = 10, α = 0.1. RASCAL disperses its resources throughout the network. It places some resources on central nodes, but also spends a portion of the budget on outlying parts of the network where contagions will not be detected by centrally placed sensors. On the other hand, FW concentrates is entire budget on one central node. Degree, by design, disperses its budget more widely. However, it spends the budget largely on central nodes, instead of balancing between central and outlying nodes like RASCAL. We conclude that RASCAL successfully balances different scenarios to ﬁnd risk-averse solutions. Acknowledgments: Wilder was supported by a NSF Graduate Fellowship.

Bach, F. 2015. Submodular functions: from discrete to continous domains. ar Xiv preprint ar Xiv:1511.00394. Bian, A. A.; Mirzasoleiman, B.; Buhmann, J. M.; and Krause, A. 2017. Guaranteed non-convex optimization: Submodular maximization over continuous domains. In AISTATS. Calinescu, G.; Chekuri, C.; P al, M.; and Vondr ak, J. 2011. Maximizing a monotone submodular function subject to a matroid constraint. SIAM J. Comput. 40(6):1740 1766. Chekuri, C.; Vondrak, J.; and Zenklusen, R. 2010. Dependent randomized rounding via exchange properties of combinatorial structures. In FOCS. Chen, R.; Lucier, B.; Singer, Y.; and Syrgkanis, V. 2017. Robust optimization for non-convex objectives. In NIPS. Chow, Y.; Tamar, A.; Mannor, S.; and Pavone, M. 2015. Risk-sensitive and robust decision-making: a CVa R optimization approach. In NIPS, 1522 1530. Duchi, J. C.; Bartlett, P. L.; and Wainwright, M. J. 2012. Randomized smoothing for stochastic optimization. SIAM Journal on Optimization 22(2):674 701. Ermon, S.; Conrad, J.; Gomes, C. P.; and Selman, B. 2011. Risk-sensitive policies for sustainable renewable resource allocation. In IJCAI, 1942 1948. Gomez-Rodriguez, M.; Leskovec, J.; and Krause, A. 2012. Inferring networks of diffusion and inﬂuence. ACM Transactions on Knowledge Discovery from Data (TKDD) 5(4):21. Hong, L. J., and Liu, G. 2009. Simulating sensitivities of conditional value at risk. Manag. Sci 55(2):281 293. Hoy, D., and Nikolova, E. 2015. Approximately optimal risk-averse routing policies via adaptive discretization. In AAAI. Iyengar, G., and Ma, A. K. C. 2013. Fast gradient descent method for mean-CVa R optimization. Annals of Operations Research 205(1):203 212. Kempe, D.; Kleinberg, J.; and Tardos, E. 2003. Maximizing the spread of inﬂuence through a social network. In KDD. Koc kesen, L.; Ok, E. A.; and Sethi, R. 2000. The strategic advantage of negatively interdependent preferences. Journal of Economic Theory 92(2):274 299. Krause, A.; Roper, A.; and Golovin, D. 2011. Randomized sensing in adversarial environments. In IJCAI. Kulesza, A., and Taskar, B. 2012. Determinantal point processes for machine learning. Foundations and Trends in Machine Learning 5(2 3):123 286. Lacoste-Julien, S., and Jaggi, M. 2015. On the global linear convergence of Frank-Wolfe optimization variants. In NIPS. Leskovec, J.; Krause, A.; Guestrin, C.; Faloutsos, C.; Van Briesen, J.; and Glance, N. 2007. Cost-effective outbreak detection in networks. In KDD, 420 429. Maehara, T. 2015. Risk averse submodular utility maximization. Operations Research Letters 43(5):526 529. Mansini, R.; Ogryczak, W.; and Speranza, M. G. 2007. Conditional value at risk and related linear programming mod-

els for portfolio optimization. Annals of operations research 152(1):227 256. Ohsaka, N., and Yoshida, Y. 2017. Portfolio optimization for inﬂuence spread. In WWW, 977 985. Ostfeld, A.; Uber, J. G.; Salomons, E.; Berry, J. W.; Hart, W. E.; Phillips, C. A.; Watson, J.-P.; et al. 2008. The battle of the water sensor networks (BWSN). J. Water Resour. Plan. Manag. 134(6):556 568. Prashanth, L., and Ghavamzadeh, M. 2013. Actor-critic algorithms for risk-sensitive MDPs. In NIPS, 252 260. Rockafellar, R. T., and Uryasev, S. 2000. Optimization of conditional value-at-risk. Journal of risk 2:21 42. Sampson, T. 2016. Assignment reversals: Trade, skill allocation and wage inequality. J. Econ. Theory 163:365 409. Soma, T., and Yoshida, Y. 2015. A generalization of submodular cover via the diminishing return property on the integer lattice. In NIPS, 847 855. Staib, M., and Jegelka, S. 2017. Robust budget allocation via continuous submodular functions. In ICML. Tamar, A.; Chow, Y.; Ghavamzadeh, M.; and Mannor, S. 2015. Policy gradient for coherent risk measures. In NIPS. Vondr ak, J. 2008. Optimal approximation for the submodular welfare problem in the value oracle model. In STOC, 67 74. Wilder, B. 2017. Equilibrium computation and robust optimization in zero sum games with submodular structure. In AAAI. Yau, S.; Kwon, R. H.; Rogers, J. S.; and Wu, D. 2011. Financial and operational decisions in the electricity sector. Int J Prod Econ 134(1):67 77. Yin, Z.; Jain, M.; Tambe, M.; and Ord onez, F. 2011. Riskaverse strategies for security games with execution and observational uncertainty. In AAAI. Yu, J. Y., and Nikolova, E. 2013. Sample complexity of risk-averse bandit-arm selection. In IJCAI, 2576 2582.