# online_submodular_maximization_via_online_convex_optimization__edb7ac65.pdf Online Submodular Maximization via Online Convex Optimization Tareq Si Salem1, G ozde Ozcan1, Iasonas Nikolaou2, Evimaria Terzi2, Stratis Ioannidis1 1Northeastern University, Boston, MA, USA 2Boston University, Boston, MA, USA 1{sisalem, gozcan, ioannidis}@ece.neu.edu, 2{nikolaou, evimaria}@bu.edu We study monotone submodular maximization under general matroid constraints in the online setting. We prove that online optimization of a large class of submodular functions, namely, weighted threshold potential functions, reduces to online convex optimization (OCO). This is precisely because functions in this class admit a concave relaxation; as a result, OCO policies, coupled with an appropriate rounding scheme, can be used to achieve sublinear regret in the combinatorial setting. We show that our reduction extends to many different versions of the online learning problem, including the dynamic regret, bandit, and optimistic-learning settings. Introduction In online submodular optimization (OSM) (Streeter, Golovin, and Krause 2009), submodular reward functions chosen by an adversary are revealed over several rounds. In each round, a decision maker first commits to a set satisfying, e.g., matroid constraints. Subsequently, the reward function is revealed and evaluated over this set. The objective is to minimize α-regret, i.e., the difference of the cumulative reward accrued from the one attained by a static set, selected by an α-approximation algorithm operating in hindsight. OSM has received considerable interest recently, both in the full information (Harvey, Liaw, and Soma 2020; Matsuoka, Ito, and Ohsaka 2021; Streeter, Golovin, and Krause 2009; Niazadeh et al. 2021) and bandit setting (Niazadeh et al. 2021; Matsuoka, Ito, and Ohsaka 2021; Wan et al. 2023), where only the reward values (rather than the entire functions) are revealed. Online convex optimization (OCO) studies a similar online setting in which reward functions are concave, and decisions are selected from a compact convex set. First proposed by Zinkevich (2003), who showed that projected gradient ascent attains sublinear regret, OCO generalizes previous online problems like prediction with expert advice (Littlestone and Warmuth 1994), and has become widely influential in the learning community (Hazan 2016; Shalev-Shwartz 2012; Mc Mahan 2017). Its success is evident from the multitude of OCO variants in literature: in dynamic regret OCO (Zinkevich 2003; Besbes, Gur, and Zeevi 2015; Jadbabaie et al. 2015; Mokhtari et al. 2016), the regret is evaluated w.r.t. an Copyright 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. optimal comparator sequence instead of an optimal static decision in hindsight. Optimistic OCO (Rakhlin and Sridharan 2013; Dekel et al. 2017; Mohri and Yang 2016) takes advantage of benign sequences, in which reward functions are predictable: the decision maker attains tighter regret bounds when predictions are correct, falling back to the existing OCO regret guarantees when predictions are unavailable or inaccurate. Bandit OCO algorithms (Hazan and Levy 2014; Kleinberg 2004; Flaxman, Kalai, and Mc Mahan 2005) study the aforementioned bandit setting, where again only rewards (i.e., function evaluations) are observed. We make the following contributions: We provide a methodology for reducing OSM to OCO, when submodular functions selected by the adversary are bounded from above and below by concave relaxations and coupled with an opportune rounding. We prove that algorithms and regret guarantees in the OCO setting transfer to α-regret guarantees in the OSM setting, via a transformation that we introduce. Ratio α is determined by how well concave relaxations approximate the original submodular functions. We show that the above condition is satisfied by a wide class of submodular functions, namely, weighted threshold potential (WTP) functions. This class strictly generalizes weighted coverage functions (Karimi et al. 2017; Stobbe and Krause 2010), and includes many important applications, including influence maximization (Kempe, Kleinberg, and Tardos 2003), facility location (Krause and Golovin 2014; Frieze 1974), cache networks (Ioannidis and Yeh 2016; Li et al. 2021), similarity caching (Si Salem, Neglia, and Carra 2022), demand forecasting (Ito and Fujimaki 2016), and team formation (Li et al. 2018), to name a few. We show our reduction also extends to the dynamic regret and optimistic settings, reducing such fullinformation OSM settings to the respective OCO variants. To the best of our knowledge, our resulting algorithms are the first to come with guarantees for the dynamic regret and optimistic settings in the context of OSM with general matroid constraints. Finally, we also provide a different reduction for the bandit setting, again for all three (static, dynamic regret, and optimistic) variants; this reduction applies to general submodular functions, but is restricted to partition matroids. The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) Related Work Offline Submodular Maximization and Relaxations of Submodular Functions. Continuous relaxations of submodular functions play a prominent role in submodular maximization. The so-called continuous greedy algorithm (Calinescu et al. 2011) maximizes the multilinear relaxation of a submodular objective over the convex hull of a matroid, using a variant of the Frank-Wolfe algorithm (Frank and Wolfe 1956). The fractional solution is then rounded via pipage (Ageev and Sviridenko 2004) or swap rounding (Chekuri, Vondr ak, and Zenklusen 2010), which we also use. The multilinear relaxation is not convex but is continuous DR-submodular (Bach 2019; Bian et al. 2017), and continuous greedy comes with a 1 1 e approximation guarantee. However, the multilinear relaxation is generally not tractable and is usually estimated via sampling. Ever since the seminal paper by Ageev and Sviridenko (Ageev and Sviridenko 2004), several works have exploited the existence of concave relaxations of weighted coverage functions (e.g., (Karimi et al. 2017; Ioannidis and Yeh 2016)), a strict subset of the threshold potential functions we consider here. For coverage functions, a version of our sandwich property (Asm. 2) follows directly by the Goemans & Williamson inequality (Goemans and Williamson 1994), which we also use. Both in the standard (Ageev and Sviridenko 2004) and stochastic offline (Karimi et al. 2017; Ioannidis and Yeh 2016) submodular maximization setting, in which the objective is randomized, exploiting concave relaxations of coverage functions yields significant computational dividends, as it eschews any sampling required for estimating the multilinear relaxation. We depart from both by considering a much broader class than coverage functions and studying the online/no-regret setting. OSM via Regret Minimization. Several online algorithms have been proposed for maximizing general submodular functions (Niazadeh et al. 2021; Harvey, Liaw, and Soma 2020; Matsuoka, Ito, and Ohsaka 2021; Streeter, Golovin, and Krause 2009) under different matroid constraints. There has also been recent work (Chen, Hassani, and Karbasi 2018; Zhang et al. 2019, 2022) on the online maximization of continuous DR-submodular functions (Bach 2019). Proposed algorithms are applicable to our setting, because the multilinear relaxation is DR-submodular, and guarantees can be extended to matroid constraint sets again through rounding (Chekuri, Vondr ak, and Zenklusen 2010), akin to the approach we follow here. Also pertinent is the work by Kakade, Kalai, and Ligett (2007): their proposed online algorithm operates over reward functions that can be decomposed as the weighted sum of finitely many (non-parametric) reference functions termed Linearly Weighted Decomposable (LWD); the adversary selects only the weights. Applied to OSM, this is a more restrictive function class than the ones we study. We compare these algorithms to our work in Table 1 in (Si Salem et al. 2023). In the full information setting, the OSM algorithm by Harvey, Liaw, and Soma (2020) has a slightly tighter α-regret than us, but also a much higher computational complexity. We attain the same or better regret than DR-S (Chen, Hassani, and Karbasi 2018; Zhang et al. 2019, 2022) and remaining algorithms that either operate on restricted constraint sets (Niazadeh et al. 2021; Matsuoka, Ito, and Ohsaka 2021; Streeter, Golovin, and Krause 2009) or on the much more restrictive LWD class (Kakade, Kalai, and Ligett 2007). Most importantly, our work generalizes to the dynamic and optimistic settings. With the exception of Matsuoka, Ito, and Ohsaka (2021), who study dynamic regret restricted to uniform matroids, our work provides the first guarantees for OSM in the dynamic and optimistic settings under general matroid constraints. OSM in the Bandit Setting. Our reduction to OCO in the bandit setting extends the analysis by Wan et al. (2023), who provide a reduction to just FTRL in the static setting, under general submodular functions and partition matroid constraints. We generalize this to any OCO algorithm and to the dynamic and optimistic settings. Interestingly, Wan et al. (2023) conjecture that no sublinear regret algorithm exists for general submodular functions under general matroid constraints in the bandit setting. We compare to bounds attained by Wan et al. (2023) and other bandit algorithms for OSM (Niazadeh et al. 2021; Streeter, Golovin, and Krause 2009; Zhang et al. 2019; Kakade, Kalai, and Ligett 2007) in Table 4 in (Si Salem et al. 2023). Our main contribution is again the extension to the dynamic and optimistic settings. Technical Preliminary Submodularity and Matroids. Given a ground set V = [n] {1, 2, . . . , n}, a set function f : 2V R is submodular if f(S {i, j}) f(S {j}) f(S {i}) f(S) for all S V and i, j V \ S and monotone if f(A) f(B) for all A B 2V . A matroid is a pair M = (V, I), where I 2V , for which the following holds: (1) if B I and A B, then A I, (2) if A, B I and |A| < |B|, then there exists an x B \ A s.t. A {x} I. The rank r N of M is the cardinality of the largest set in I. With slight abuse of notation, we represent set functions f : 2V R as functions over {0, 1}n: given a set function f and an x {0, 1}n, we denote by f(x) as the value f(supp (x)), where supp (x) {i V : xi = 0} V is the support of x. Similarly, we treat matroids as subsets of {0, 1}n. Online Learning. In the general protocol of online learning (Cesa-Bianchi and Lugosi 2006), a decision-maker makes sequential decisions and incurs rewards as follows: at timeslot t [T], where T N is the time horizon, the decisionmaker first commits to a decision xt X from some set X. Then, a reward function ft : X R 0 is selected by an adversary from a set of functions F and revealed to the decision-maker, who accrues a reward ft(xt). The decision xt is determined according to a (potentially random) mapping PX,t : X t 1 Ft 1 Y, i.e., xt = PX,t (xs)t 1 s=1 , (fs)t 1 s=1 . (1) Let PX = (PX,t)t [T ] be the online policy of the decisionmaker. We define regret T (PX ), the regret at horizon T, as: sup (ft)T t=1 FT t=1 ft(x) EPX h T X t=1 ft(xt) io . (2) We seek policies that attain a sublinear (i.e., o(T)) regret; intuitively, such policies perform on average as well as the The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) static optimum in hindsight. Note that the regret is defined w.r.t. the optimal fixed decision x, i.e., the time-invariant decision x X that would be optimal in hindsight, after the sequence (ft)T t=1 is revealed. When selecting xt, the decisionmaker has no information about the upcoming reward ft. Finally, this is the full-information setting: at each timeslot, the decision maker observes the entire reward function ft( ), rather than just the reward ft(xt) R 0. Deviating from these assumptions is of both practical and theoretical interest. In the dynamic regret setting (Zinkevich 2003), the regret is measured w.r.t. a time-variant optimum, appropriately constrained so that changes from one timeslot to the next do not vary significantly. In learning with optimism (Rakhlin and Sridharan 2013), additional information is assumed to be available w.r.t. ft, in the form of so-called predictions. In the bandit setting (Auer et al. 1995), the online policy PX of the decision maker only has access to rewards ft(xt) R 0, as opposed to the entire reward function. Online Convex Optimization. The online convex optimization (OCO) framework (Zinkevich 2003; Hazan 2016) follows the above online learning protocol (1), where (a) the decision space X is a convex set in Rn, and (b) the set of reward functions F is a subset of concave functions over X. Formally, OCO operates under the following assumption: Assumption 1. Set X Rn is convex and compact. The reward functions in F are all L-Lipschitz concave functions w.r.t. a norm over X, for some common L R>0. There is a rich literature on OCO policies (Hazan 2016); examples include Online Gradient Ascent (OGA) (Hazan 2016), Online Mirror Ascent (OMA) (Bubeck 2011), and Follow-The-Regularized-Leader (FTRL) (Mc Mahan 2017). All three enjoy sublinear regret: Theorem 1. Under Asm. 1 OGA, OMA, and FTRL attain O( T) regret. Details on all three algorithms and the regret they attain are in Appendix A of Si Salem et al. (2023). Most importantly, the OCO framework generalizes to the dynamic, learningwith-optimism, as well as bandit settings (see extentions section). Weighted Threshold Potentials. A threshold potential (Stobbe and Krause 2010) Ψb,w,S : {0, 1}n R 0, also known as a budget-additive function (Andelman and Mansour 2004; Dobzinski 2016; Buchfuhrer et al. 2010), is defined as: Ψb,w,S(x) min n b, P j S xjwj o , for x {0, 1}n, (3) where b R 0 { } is a threshold, S V is a subset of V = [n], and w = (wj)j S [0, b]|S| is a weight vector bounded by b.1 The linear combination of threshold potential functions defines the rich class of weighted threshold potentials (WTP) (Stobbe and Krause 2010), defined as: f(x) P ℓ C cℓΨbℓ,wℓ,Sℓ(x), for x {0, 1}n, (4) where C is an arbitrary index set and cℓ R 0, for ℓ C. WTP functions are submodular and monotone (see Appendix B of Si Salem et al. (2023)). We define the degree of a 1Assumption wj b, j S, is w.l.o.g., as replacing wj with min {wj, b} preserves values of f over {0, 1}n. WTP function f = maxℓ C |Sℓ| as the maximum number of variables that a threshold potential Ψ in f depends on. We give several examples of WTP functions in Appendix B of Si Salem et al. (2023). In short, classic problems such as influence maximization (Kempe, Kleinberg, and Tardos 2003) and facility location (Krause and Golovin 2014; Frieze 1974), resource allocation problems like cache networks (Ioannidis and Yeh 2016; Li et al. 2021) and similarity caching (Si Salem, Neglia, and Carra 2022), as well as demand forecasting (Ito and Fujimaki 2016) and team formation (Li et al. 2018) can all be expressed using WTP functions. Overall, the WTP class is very broad: there exists a hierarchy among submodular functions, including weighted coverage functions (Karimi et al. 2017), weighted cardinality truncations (Dolhansky and Bilmes 2016), and sums of concave functions composed with non-negative modular functions (Stobbe and Krause 2010); all of them are strictly dominated by the WTP class (see Stobbe and Krause (2010), as well as Appendix B of Si Salem et al. (2023)). Problem Formulation We consider a combinatorial version of the online learning protocol defined in Eq. (1). In particular, we focus on the case where (a) the decision set is X {0, 1}n, i.e., the vectorized representation of subsets of V = [n], and (b) the set F of reward functions comprises set functions over X. Though some of our results (e.g., Thm. 2) pertain to this general combinatorial setting, we are particularly interested in the case where (a) X is a matroid, and (b) F is the WTP class, i.e., the set of functions whose form is given by Eq. (4). Both in the general combinatorial setting and for WTP functions, evaluating the best fixed decision may be computationally intractable even in hindsight, i.e., when all reward functions were revealed. As is customary (see, e.g., (Krause and Golovin 2014)), instead of the regret in Eq. (2), we consider α-regret T (PX ), the so-called α-regret, defined as: sup (ft)T t=1 FT n max x X α t=1 ft(x) EPX h T X t=1 ft(xt) io . (5) Intuitively, we compare the performance of the policy PX w.r.t. the best polytime(n) α-approximation of the static offline optimum in hindsight. For example, the approximation ratio would be α = 1 1/e in the case of submodular set functions maximized over matroids. Online Submodular Optimization via Online Convex Optimization The Case of General Set Functions First, we show how OCO can be leveraged to tackle online learning in the general combinatorial setting, i.e., when X {0, 1}n and F comprises general functions defined over X. Rounding Augmented OCO Policy. We begin by stating a sandwich property that functions in F should satisfy, so that the reduction to OCO holds. To do so, we first need to introduce the notion of randomized rounding. Let Y conv (X) be the convex hull of X. A randomized rounding is a random map Ξ : Y X, i.e., a map from a fractional y Y The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) Algorithm 1: Rounding-Augmented OCO (RAOCO) policy Require: OCO policy PY, randomized rounding Ξ : Y X 1: for t = 1, 2, . . . , T do 2: yt PY,t (ys)t 1 s=1, ( fs)t 1 s=1 3: xt Ξ(yt) 4: Receive reward ft(xt) 5: Reward function ft is revealed 6: Construct ft from ft satisfying Asm. 2 7: end for and, possibly, a source of randomness to an integral variable x X. We assume that the set F satisfies the following: Assumption 2. (Sandwich Property) There exists an α (0, 1], an L R>0, and a randomized rounding Ξ : Y X such that, for every f : X R 0 F there exists a LLipschitz concave function f : Y R s.t. f(x) f(x), for all x X, and (6) EΞ [f(Ξ(y))] α f(y), for all y Y. (7) We refer to f as the concave relaxation of f. Intuitively, Asm. 2 postulates the existence of such a concave relaxation f that is not far from f: Eqs. (6) and (7) imply that f bounds f both from above and below, up to the approximation factor α. Moreover, the upper bound (Eq. (6)) needs to only hold for integral values, while the lower bound (Eq. (7)) needs to only hold in expectation, under an appropriately-defined randomized rounding Ξ. Armed with this assumption, we can convert any OCO policy PY operating over Y = conv (X) to a Randomizedrounding Augmented OCO (RAOCO) policy, denoted by PX , operating over X. This transformation (see Alg. 1) uses both the randomized-rounding Ξ, as well as the concave relaxations ( fs)t 1 s=1 of the functions (fs)t 1 s=1 observed so far. At t [T], the RAOCO policy amounts to: yt = PY,t (ys)t 1 s=1, ( fs)t 1 s=1 , (8a) xt = Ξ(yt) X. (8b) In short, the OCO policy PY is first used to generate a new fractional state yt Y by applying PY,t to the history of concave relaxations. Then, this fractional decision yt is randomly mapped to an integral decision xt X according to the rounding scheme Ξ. Then, the reward f(xt) is received and ft is revealed, at which point a concave function ft is constructed from ft and added to the history. Our first main result is the following: Theorem 2. Under Asm. 2, given an OCO policy PY, the RAOCO policy PX described by Alg. 1 satisfies α-regret T (PX ) α regret T (PY) . The proof is in Appendix D of Si Salem et al. (2023). As a result, any regret guarantee obtained by an OCO algorithm over Y, immediately transfers to an α-regret for RAOCO, where α is determined by Asm. 2. In particular, the decision set Y is closed, bounded, and convex by construction. Combined with the fact that concave relaxations f are L-Lipschitz (by Asm. 2), Thms. 1 and 2 yield the following corollary: Corollary 1. Under Asm. 2, RAOCO policy PX in Alg. 1 equipped with OGA, OMA, or FTRL as OCO policy PY has sublinear α-regret. That is, α-regret T (PX ) = O To use this result, Asm. 2 should hold, and both the randomized rounding and the concave relaxations used in RAOCO should be poly-time: all are true for WTP functions optimized over matroid constraints, which we turn to next. The Case of Weighted Threshold Potentials We now consider the case where the decision set is a matroid, and reward functions belong to the class of WTP functions, defined by Eq. (4). We will show that, under appropriate definitions of a randomized rounding and concave relaxations, the class F satisfies Asm. 2 and, thus, online learning via RAOCO comes with the regret guarantees of Corollary 1. For Y = conv (X), consider the map f 7 f of WTP functions f : X R to concave relaxations f : Y R of the form: f(y) f(y) = P ℓ C cℓmin n bℓ, P j Sℓyjwℓ,j o , (9) for y Y. In other words, the relaxation of f is itself: it has the same functional form, allowing integral variables to become fractional.2 This is clearly concave, as the minimum of affine functions is concave, and the positively weighted sum of concave functions is concave. Finally, all such functions are Lipschitz, with a parameter that depends on cℓ, bℓ, wℓ, ℓ C. Let F be the image of F under the map (9). We make the following assumption, which is readily satisfied if, e.g., all constituent parameters (cℓ, bℓ, wℓ, ℓ C) are uniformly bounded, or the set F is finite, etc.: Assumption 3. There exists an L > 0 such that all functions in F are L-Lipschitz. Next, we turn our attention to the randomized rounding Ξ. We can in fact characterize the property that Ξ must satisfy for Asm. 2 to hold for relaxations given by Eq. (9): Definition 1. A randomized rounding Ξ : Y X is negatively correlated if, for x = Ξ(y) X (a) the coordinates of x are negatively correlated 3 and (b) EΞ [x] = y. Our next result immediately implies that any negatively correlated rounding can be used in RAOCO: Lemma 1. Let Ξ : Y X be a negatively correlated randomized rounding, and consider the concave relaxations f constructed from f F via Eq. (9). Then, if Asm. 3 holds for some L > 0, the set F satisfies Asm. 2 with α = 1 1 where = supf F f. The proof is in Appendix E of Si Salem et al. (2023). As , the approximation ratio α approaches 1 1/e from above, recovering the usual approximation guarantee. However, for finite , we in fact obtain an improved approximation ratio; an example (quadratic submodular functions) is described in Appendix B of Si Salem et al. (2023). Finally, and 2In Appendix G.IV of Si Salem et al. (2023), we provide an example where the functional form of concave relaxations differs. 3A set of random variables xi {0, 1} , i [n], are negatively correlated if E Q i S E [xi] for all S [n]. The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) most importantly, a negatively-correlated randomized rounding can always be constructed if X is a matroid. Chekuri, Vondr ak, and Zenklusen (2010) provide two polynomial-time randomized rounding algorithms that satisfy this property: Lemma 2. (Chekuri, Vondr ak, and Zenklusen (2010, Theorem 1.1.)) Given a matroid X {0, 1}n, let y conv (X) and Ξ be either swap rounding or randomized pipage rounding. Then, Ξ is negatively correlated. We review swap rounding in Appendix F of Si Salem et al. (2023). Interestingly, the existence of a negatively correlated rounding is inherently linked to matroids: a negativelycorrelated rounding exists if and only if X is a matroid (see Thm. I.1. in Chekuri, Vondr ak, and Zenklusen (2010)). Lemma 1 thus implies that the reduction of RAOCO to OCO policies is also linked to matroids. Putting everything together, Lemmas 1 2 and Corollary 1 yield the following: Theorem 3. Let X {0, 1}n be a matroid, and F be a subset of the WTP class for which Asm. 3 holds. Then, the RAOCO policy PX defined by Alg. 1 using swap rounding or randomized pipage rounding as Ξ and OGA, OMA, or FTRL as OCO policy PY, and concave relaxations in Eq. (9) has sublinear α-regret. In particular, α-regret (PX ) = O Note that, though all algorithms yield O( T) regret, the dependence of constants on problem parameters (such as n and the matroid rank r), as reported in Table 1 in C of Si Salem et al. (2023) is optimized under OMA (see Appendix C of Si Salem et al. (2023)). Computational Complexity OCO Policy. OCO policies are polytime (see, e.g., (Hazan 2016)). Taking gradient-based OCO policies (e.g., OMA in Alg. 2 in Si Salem et al. (2023)), their computational complexity is dominated by a projection operation to the convex set Y = conv (X). The exact time complexity of this projection depends on Y, however, given a membership oracle that decides y conv (X), the projection can be computed efficiently in polynomial time (Hazan 2016). Moreover, the projection problem is a convex problem that can be computed efficiently (e.g., iteratively to an arbitrary precision) and can also be computed in strongly polynomial time (Gupta, Goemans, and Jaillet 2016, Theorem 3). Concave Relaxations and Randomized Rounding. Concave relaxations are linear in the representation of the function f, but in practice come for free , once parameters in Eq. (4) are provided. Swap rounding over a general matroid is O nr2 , where r is the rank of the matroid (Chekuri, Vondr ak, and Zenklusen 2010). This dominates the remaining operations (including OCO), and thus determines the overall complexity of our algorithm. This O nr2 term assumes access to the decomposition of a fractional point y Y to bases of the matroid. Carath eodory s theorem implies the existence of such decomposition of at most n+1 points in X; moreover, there exists a decomposition algorithm (Cunningham 1984) for general matroids with running time O n6 . However, in all practical cases we consider (including uniform and partition matroids) the complexity is significantly lower. More specifically, for partition matroids, swap rounding reduces to an algorithm with linear time complexity, namely, O (mn) for m partitions (Srinivasan 2001). Extensions Dynamic Setting. In the dynamic setting, the decision maker compares its performance to the best sequence of decisions (y t )t [T ] with a path length regularity condition (Zinkevich 2003; Cesa-Bianchi et al. 2012). I.e., let ΛX (T, PT ) n (xt)T t=1 X T : PT t=1 xt+1 xt PT o X T be the set of sequences of decisions in a set X with path length less than PT R 0 over a time horizon T. We define α-regret T,PT (PX ), the dynamic α-regret, as: sup (ft)T t=1 FT max (x t )T t=1 ΛX (T,PT ) α t=1 ft(x t ) When α-regret T,PT (PX ) is sublinear in T, the policy attains average rewards that asymptotically compete with the optimal decisions of bounded path length, in hindsight. Through our reduction to OCO, we can leverage OGA (Zinkevich 2003) or meta-learning algorithms over OGA (Zhao et al. 2020) to obtain dynamic regret guarantees in OSM. As an additional technical contribution, we provide the first sufficient and necessary conditions for OMA to admit a dynamic regret guarantee (see Appendix A of Si Salem et al. (2023)). This allows to extend a specific instance of OMA operating on the simplex, the so-called fixed-share algorithm (Cesa-Bianchi et al. 2012; Herbster and Warmuth 1998), to matroid polytopes. This yields a tighter regret guarantee than OGA (Zinkevich 2003; Zhao et al. 2020) (see Theorem 7 in Appendix A of Si Salem et al. (2023)). Putting everything together, we get: Theorem 4. Under Asm. 2, RAOCO policy PX described by Alg. 1 equipped with an OMA policy in Appendix A of Si Salem et al. (2023) as OCO policy PY has sublinear dynamic α-regret, i.e., α-regret T,PT (PX ) = O PT T . This follows from Thm. 2 and the dynamic regret guarantee for OMA in Appendix A of Si Salem et al. (2023). Optimistic Setting. In the optimistic setting, the decision maker has access to additional information available in the form of predictions: a function πt+1 : [0, 1]n R 0, serving as a prediction of the reward function ft+1(x) is made available before committing to a decision xt+1 at timeslot t [T]. The prediction πt encodes prior information available to the decision maker at timeslot t.4 Let gt and gπ t be supergradients of ft and πt at point yt, respectively. We can define an optimistic OMA policy (see Alg. 3 in Appendix A of Si Salem et al. (2023)) that leverages both the prediction and the observed rewards. Applying again our reduction of OSM to this setting we get: Theorem 5. Under Asm. 2, RAOCO policy PX in Alg. 1 equipped with OOMA in Appendix A of Si Salem et al. (2023) as policy PY yields α-regret T,PT (PX ) = PT PT t=1 gt gπ t 2 4Function πt can extend over fractional values, e.g., be the multilinear relaxation of a set function. The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) RAOCO-OGA RAOCO-OMA FSF Tabular Greedy Random Datasets F t FX /F FX /F FX /F FX /F FX /F 33 0.902 0.965 0.839 0.833 0.642 66 0.924 0.967 0.896 0.894 0.624 99 0.945 0.982 0.933 0.931 0.622 33 0.994 0.997 0.985 0.953 66 0.99 0.994 0.987 0.950 99 0.993 0.997 0.995 0.953 50 0.845 0.853 0.703 0.694 0.632 100 0.865 0.906 0.776 0.768 0.615 149 0.88 0.925 0.807 0.805 0.629 50 0.826 0.861 0.720 0.620 100 0.854 0.908 0.786 0.619 149 0.88 0.927 0.818 0.625 98 0.749 0.792 0.681 0.69 0.748 196 0.786 0.781 0.713 0.676 0.7 293 0.846 0.866 0.756 0.769 0.711 98 0.889 0.948 0.908 0.829 196 0.872 0.908 0.902 0.814 293 0.926 0.948 0.964 0.874 33 0.984 0.987 0.845 0.844 0.605 66 0.994 0.994 0.868 0.886 0.603 99 0.995 0.998 0.869 0.902 0.612 33 0.98 0.983 0.834 0.611 66 0.990 0.991 0.852 0.607 99 0.993 0.994 0.857 0.601 Table 1: Average cumulative reward FX (t = T/3, 2T/3, T), normalized by fractional optimal F , of integral policies across different datasets and uniform (unif.) and partition (part.) matroid constraints. Optimal hyperparameters (η, γ, cp) and standard deviations are reported in the Appendix H of Si Salem et al. (2023) along with the value ranges explored. RAOCO combined with OGA or OMA outperforms competitors almost reaching a ratio of one, with the exception of Movie Lens, where Tabular Greedy does better. As Random also performs well on Movie Lens, this indicates that the (static) offline optimal is quite poor for this reward sequence. By Property 2, fractional solutions strictly dominate the integral optimal, which implies that in all cases RAOCO outperformed the 1 1/e approximation, attaining an almost optimal value. This theorem follows from Theorem 2 and the optimistic regret guarantee established for OMA policies in Theorem 7 in Appendix A of Si Salem et al. (2023). The optimistic regret guarantee in Theorem 5 shows that the regret of a policy can be reduced to 0 when the predictions are perfect, while providing O T guarantee in Thm. 3 when the predictions are arbitrarily bad (with bounded gradients). To the best of our knowledge, ours is the first work to provide guarantees for optimistic OSM. Bandit Setting. Recall that in the bandit setting the decision maker only has access to the reward ft(xt) after committing to a decision xt X at t [T]; i.e., the reward function is not revealed. Our reduction to OCO does not readily apply to the bandit setting; however, we show that the bandit algorithm by Wan et al. (2023) can be used to construct such a reduction. The main challenge is to estimate gradients of inputs in Y only from bandit feedback; this can be done via a perturbation method (see also Hazan and Levy (2014)). This approach, described in Appendix G of Si Salem et al. (2023), yields the following theorem: Theorem 6. Under bounded submodular monotone rewards and partition matroid constraint sets, LIRAOCO policy PX in Alg. 5 in Appendix G of Si Salem et al. (2023) equipped with an OCO policy PYδ yields α-regret T,PT ,W (PX ) W regret T/W,PT (PYδ) + T W + 2αδr2n T, where δ, W, are tuneable parameters of the algorithm and regret T/W,PT (PYδ) is the regret of an OCO policy executed for T/W timeslots. Thm. 6 applies to general submodular functions, but is restricted to partition matroids. The theorem also extends to dynamic and optimistic settings: we provide the full description in Appendix G of Si Salem et al. (2023). Our analysis generalizes Wan et al. (2023) in that (a) we show that a reduction can be performed to any OCO algorithm, rather than just FTRL, as well as (b) in extending it to the dynamic and optimistic settings (see also Table 4 in Si Salem et al. (2023)). Experiments Datasets and Problem Instances. We consider five different OSM problems: two influence maximization problems, over the ZKC (Zachary 1977) and Epinions (Richardson, Agrawal, and Domingos 2003) graphs, respectively, a facility location problem over the Movie Lens dataset (Harper and Konstan 2015), a team formation, and a weighted cov- The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) 0 10 20 30 40 50 Timeslots OGA OMA OMA Tabular Greedy Static F (a) Stationary 0 10 20 30 40 50 Timeslots OGA OMA OMA Tabular Greedy Static F (b) Non-Stationary Figure 1: Average cumulative reward FX of the different policies under Synth WC dataset under different setups: stationary in (a) and non-stationary in (b). Non-stationarity in (b) is applied by changing the objective at t = 25 (see Appendix H of Si Salem et al. (2023)). The area depicts the standard deviation over 5 runs. 0 10 20 30 40 50 Timeslots Optimistic OGA (nσ = 10) Optimistic OGA (nσ = 100) OGA Static F (a) Optimistic Learning 0 10 20 30 40 50 Timeslots Meta-Policy OGA (η) (b) Meta-Policies Figure 2: Average cumulative reward FX of the different policies under Synth WC dataset under a non-stationary setup: the objective is changed at every timeslot. The algorithms Optimistic OGA and OGA are executed with different learning rates under different prediction accuracy (noise with std. dev. nσ {10, 100}) in (a); the larger learning rate is depicted by a solid line. The meta-policy in (b) can learn the best configuration of OGA without tuning the learning rate. The area depicts the standard deviation over 5 runs. erage problem over synthetic datasets. Reward functions are generated over a finite horizon and optimized online over both uniform and partition matroid constraints. Details are provided in Appendix H of Si Salem et al. (2023). Algorithms. We implement the policy in Alg. 1 (RAOCO), coupled with OGA (RAOCO-OGA) and OMA (RAOCO-OMA) as OCO policies. As competitors, we also implement the fixed share forecaster (FSF ) policy by Matsuoka, Ito, and Ohsaka (2021), which only applies to uniform matroids, and the Tabular Greedy policy by Streeter, Golovin, and Krause (2009), as well as a Random algorithm, which selects a decision u.a.r. from the bases of X. Details and hyperparameters explored are reported in Appendix H of Si Salem et al. (2023). Our code is publicly available.5 Metrics. For each setting, we compute F = maxy Y 1 T PT t=1 f(y), the value of the optimal fractional solution in hindsight, assuming rewards are replaced 5https://github.com/neu-spiral/OSMvia OCO by their concave relaxations. Note that this involves solving a convex optimization problem, and overestimates the (intractable) offline optimal, i.e., F maxx X 1 T PT t=1 f(x). For each online policy, we compute both the integral and fractional average cumulative reward at different timeslots t, given by FX = 1 t Pt s=1 f(xs), FY = 1 t Pt s=1 f(ys), respectively. We repeat each experiment for 5 different random seeds, and use this to report standard deviations. OSM Policy Comparison. A comparison between the two versions of RAOCO (OGA and OMA) with the three competitors is shown in Table 1. We observe that both OGA and OMA significantly outperform competitors, with the exception of Movie Lens, where Tabular Greedy does better. OMA is almost always better than OGA. Most importantly, we significantly outperform both Tabular Greedy and FSF w.r.t. running time, being 2.15 250 times faster (see Appendix H of Si Salem et al. (2023)). Dynamic Regret and Optimistic Learning. Fig. 1 shows the performance of the different policies in a dynamic scenario. All the policies have similar performance in the stationary setting, however in the non-stationary setting only OGA, OMA, and FSF show robustness. We further experiment with an optimistic setting in Fig. 2 (a) under a non-stationary setting where we provide additional information about future rewards to the optimistic policies, which can leverage this information and yield better results (see Appendix H of Si Salem et al. (2023)). Meta-Policies. Fig. 2 (b) shows the performance of the metapolicy that learns over different policies configured with different learning rates. We observe that the meta-policy quickly learns the best learning rate. We provide a reduction of online maximization of a large class of submodular functions to online convex optimization and show that our reduction extends to many different versions of the online learning problem. There are many possible extensions. As our framework does not directly apply to general submodular functions, it would be interesting to derive a general method to construct the concave relaxations which are the building block of our reduction in Theorem 2. It is also meaningful to investigate the applicability of our reduction to monotone submodular functions with bounded curvature (Feldman 2021), and non-monotone submodular functions (Krause and Golovin 2014). Acknowledgements We gratefully acknowledge support from the NSF (grants 1750539, 2112471, 1813406, and 1908510). Ageev, A. A.; and Sviridenko, M. I. 2004. Pipage rounding: A new method of constructing algorithms with proven performance guarantee. Journal of Combinatorial Optimization. Andelman, N.; and Mansour, Y. 2004. Auctions with budget constraints. In SWAT. The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) Auer, P.; Cesa-Bianchi, N.; Freund, Y.; and Schapire, R. E. 1995. Gambling in a rigged casino: The adversarial multiarmed bandit problem. In FOCS. Bach, F. 2019. Submodular functions: from discrete to continuous domains. Mathematical Programming. Besbes, O.; Gur, Y.; and Zeevi, A. 2015. Non-stationary stochastic optimization. Operations research, 63(5): 1227 1244. Bian, A. A.; Mirzasoleiman, B.; Buhmann, J.; and Krause, A. 2017. Guaranteed non-convex optimization: Submodular maximization over continuous domains. In AISTATS. Bubeck, S. 2011. Introduction to online optimization. Lecture notes. Buchfuhrer, D.; Dughmi, S.; Fu, H.; Kleinberg, R.; Mossel, E.; Papadimitriou, C.; Schapira, M.; Singer, Y.; and Umans, C. 2010. Inapproximability for vcg-based combinatorial auctions. In SODA. Calinescu, G.; Chekuri, C.; Pal, M.; and Vondr ak, J. 2011. Maximizing a monotone submodular function subject to a matroid constraint. SICOMP. Cesa-Bianchi, N.; Gaillard, P.; Lugosi, G.; and Stoltz, G. 2012. Mirror descent meets fixed share (and feels no regret). Neur IPS. Cesa-Bianchi, N.; and Lugosi, G. 2006. Prediction, Learning, and Games. Cambridge University Press. ISBN 0521841089. Chekuri, C.; Vondr ak, J.; and Zenklusen, R. 2010. Dependent randomized rounding via exchange properties of combinatorial structures. In FOCS. Chen, L.; Hassani, H.; and Karbasi, A. 2018. Online Continuous Submodular Maximization. In AISTATS. Cunningham, W. H. 1984. Testing membership in matroid polyhedra. Journal of Combinatorial Theory, Series B. Dekel, O.; Flajolet, A.; Haghtalab, N.; and Jaillet, P. 2017. Online learning with a hint. Neur IPS. Dobzinski, S. 2016. Breaking the logarithmic barrier for truthful combinatorial auctions with submodular bidders. In STOC. Dolhansky, B. W.; and Bilmes, J. A. 2016. Deep submodular functions: Definitions and learning. Neur IPS. Feldman, M. 2021. Guess free maximization of submodular and linear sums. Algorithmica. Flaxman, A. D.; Kalai, A. T.; and Mc Mahan, H. B. 2005. Online Convex Optimization in the Bandit Setting: Gradient Descent without a Gradient. In SODA. Frank, M.; and Wolfe, P. 1956. An algorithm for quadratic programming. Naval research logistics quarterly. Frieze, A. M. 1974. A cost function property for plant location problems. Mathematical Programming. Goemans, M. X.; and Williamson, D. P. 1994. New 3 4approximation algorithms for the maximum satisfiability problem. SIDMA. Gupta, S.; Goemans, M.; and Jaillet, P. 2016. Solving combinatorial games using products, projections and lexicographically optimal bases. ar Xiv preprint ar Xiv:1603.00522. Harper, F. M.; and Konstan, J. A. 2015. The Movie Lens Datasets: History and Context. Tii S. Harvey, N.; Liaw, C.; and Soma, T. 2020. Improved algorithms for online submodular maximization via first-order regret bounds. Neur IPS. Hazan, E. 2016. Introduction to Online Convex Optimization. Foundations and Trends in Optimization. Hazan, E.; and Levy, K. 2014. Bandit convex optimization: Towards tight bounds. Neur IPS. Herbster, M.; and Warmuth, M. K. 1998. Tracking the best expert. Machine learning. Ioannidis, S.; and Yeh, E. 2016. Adaptive caching networks with optimality guarantees. PER. Ito, S.; and Fujimaki, R. 2016. Large-scale price optimization via network flow. Neur IPS. Jadbabaie, A.; Rakhlin, A.; Shahrampour, S.; and Sridharan, K. 2015. Online optimization: Competing with dynamic comparators. In AISTATS. Kakade, S. M.; Kalai, A. T.; and Ligett, K. 2007. Playing games with approximation algorithms. In STOC. Karimi, M.; Lucic, M.; Hassani, H.; and Krause, A. 2017. Stochastic Submodular Maximization: The Case of Coverage Functions. Neur IPS. Kempe, D.; Kleinberg, J.; and Tardos, E. 2003. Maximizing the spread of influence through a social network. In KDD. Kleinberg, R. 2004. Nearly tight bounds for the continuumarmed bandit problem. Neur IPS. Krause, A.; and Golovin, D. 2014. Submodular Function Maximization. In Tractability: Practical Approaches to Hard Problems. Cambridge University Press. Li, Y.; Cheng, M.; Fujii, K.; Hsieh, F.; and Hsieh, C.-J. 2018. Learning from group comparisons: exploiting higher order interactions. Neur IPS. Li, Y.; Si Salem, T.; Neglia, G.; and Ioannidis, S. 2021. Online caching networks with adversarial guarantees. POMACS. Littlestone, N.; and Warmuth, M. K. 1994. The Weighted Majority Algorithm. Information and computation. Matsuoka, T.; Ito, S.; and Ohsaka, N. 2021. Tracking regret bounds for online submodular optimization. In AISTATS. Mc Mahan, H. B. 2017. A survey of algorithms and analysis for adaptive online learning. JMLR. Mohri, M.; and Yang, S. 2016. Accelerating online convex optimization via adaptive prediction. In AISTATS. Mokhtari, A.; Shahrampour, S.; Jadbabaie, A.; and Ribeiro, A. 2016. Online optimization in dynamic environments: Improved regret rates for strongly convex problems. In CDC. Niazadeh, R.; Golrezaei, N.; Wang, J. R.; Susan, F.; and Badanidiyuru, A. 2021. Online learning via offline greedy algorithms: Applications in market design and optimization. In EC. Rakhlin, A.; and Sridharan, K. 2013. Online learning with predictable sequences. In COLT. Richardson, M.; Agrawal, R.; and Domingos, P. 2003. Trust Management for the Semantic Web. In ISWC. The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) Shalev-Shwartz, S. 2012. Online Learning and Online Convex Optimization. Foundations and Trends in Machine Learning. Si Salem, T.; Neglia, G.; and Carra, D. 2022. Ascent Similarity Caching With Approximate Indexes. To N. Si Salem, T.; Ozcan, G.; Nikolaou, I.; Terzi, E.; and Ioannidis, S. 2023. Online Submodular Maximization via Online Convex Optimization. ar Xiv preprint ar Xiv:2309.04339. Srinivasan, A. 2001. Distributions on level-sets with applications to approximation algorithms. In FOCS. Stobbe, P.; and Krause, A. 2010. Efficient Minimization of Decomposable Submodular Functions. Neur IPS. Streeter, M.; Golovin, D.; and Krause, A. 2009. Online Learning of Assignments. In Neur IPS. Wan, Z.; Zhang, J.; Chen, W.; Sun, X.; and Zhang, Z. 2023. Bandit Multi-linear DR-Submodular Maximization and Its Applications on Adversarial Submodular Bandits. In Proceedings of the 40th International Conference on Machine Learning (ICML). Zachary, W. W. 1977. An information flow model for conflict and fission in small groups. Journal of anthropological research. Zhang, M.; Chen, L.; Hassani, H.; and Karbasi, A. 2019. Online continuous submodular maximization: From fullinformation to bandit feedback. Neur IPS. Zhang, Q.; Deng, Z.; Chen, Z.; Hu, H.; and Yang, Y. 2022. Stochastic continuous submodular maximization: Boosting via non-oblivious function. In ICML. Zhao, P.; Zhang, Y.-J.; Zhang, L.; and Zhou, Z.-H. 2020. Dynamic regret of convex and smooth functions. Neur IPS. Zinkevich, M. 2003. Online Convex Programming and Generalized Infinitesimal Gradient Ascent. In ICML. ISBN 1577351894. The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)