# submodular_optimization_over_streams_with_inhomogeneous_decays__3cf7971e.pdf The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19) Submodular Optimization over Streams with Inhomogeneous Decays Junzhou Zhao,1 Shuo Shang,2 Pinghui Wang,3 John C.S. Lui,4 Xiangliang Zhang1 1King Abdullah University of Science and Technology, KSA 2Inception Institute of Artificial Intelligence, UAE 3Xi an Jiaotong University, China 4The Chinese University of Hong Kong, Hong Kong {junzhou.zhao, xiangliang.zhang}@kaust.edu.sa, jedi.shang@gmail.com, phwang@mail.xjtu.edu.cn, cslui@cse.cuhk.edu.hk Cardinality constrained submodular function maximization, which aims to select a subset of size at most k to maximize a monotone submodular utility function, is the key in many data mining and machine learning applications such as data summarization and maximum coverage problems. When data is given as a stream, streaming submodular optimization (SSO) techniques are desired. Existing SSO techniques can only apply to insertion-only streams where each element has an infinite lifespan, and sliding-window streams where each element has a same lifespan (i.e., window size). However, elements in some data streams may have arbitrary different lifespans, and this requires addressing SSO over streams with inhomogeneous-decays (SSO-ID). This work formulates the SSO-ID problem and presents three algorithms: BASICSTREAMING is a basic streaming algorithm that achieves an (1/2 ϵ) approximation factor; HISTAPPROX improves the efficiency significantly and achieves an (1/3 ϵ) approximation factor; HISTSTREAMING is a streaming version of HISTAPPROX and uses heuristics to further improve the efficiency. Experiments conducted on real data demonstrate that HISTSTREAMING can find high quality solutions and is up to two orders of magnitude faster than the naive GREEDY algorithm. Introduction Selecting a subset of data to maximize some utility function under a cardinality constraint is a fundamental problem facing many data mining and machine learning applications. In myriad scenarios, ranging from data summarization (Mitrovic et al. 2018), to search results diversification (Agrawal et al. 2009), to feature selection (Brown et al. 2012), to coverage maximization (Cormode, Karloff, and Wirth 2010), utility functions commonly satisfy submodularity (Nemhauser, Wolsey, and Fisher 1978), which captures the diminishing returns property. It is therefore not surprising that submodular optimization has attracted a lot of interests in recent years (Krause and Golovin 2014). If data is given in advance, the GREEDY algorithm can be applied to solve submodular optimization in a batch mode. However, today s data could be generated continuously with Shuo Shang and Xiangliang Zhang are the corresponding authors. Copyright c 2019, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. insertion-only remaining lifespan 0 0 0 0 1 2 3 4 0 0 0 0 0 1 2 3 4 sliding-window (homogeneous decay) remaining lifespan 0 0 1 3 2 4 1 5 t 0 0 0 0 2 1 3 4 2 t + 1 0 0 0 0 0 1 2 3 1 3 t + 2 inhomogeneous decay remaining lifespan Figure 1: Insertion-only stream: each element has an infinite lifespan. Sliding-window stream: each element has a same initial lifespan. Our model: each element can have an arbitrary lifespan. no ending, and in some cases, data is produced so rapidly that it cannot even be stored in computer main memory, e.g., Twitter generates more than 8, 000 tweets every second (Internet Live Stats 2018). Thus, it is crucial to design streaming algorithms where at any point of time the algorithm has access only to a small fraction of data. To this end, streaming submodular optimization (SSO) techniques have been developed for insertion-only streams where a subset is selected from all historical data (Badanidiyuru et al. 2014), and sliding-window streams where a subset is selected from the most recent data only (Epasto et al. 2017). We notice that these two existing streaming settings, i.e., insertion-only stream and sliding-window stream, actually represent two extremes. In insertion-only streams, a subset is selected from all historical data elements which are treated as of equal importance, regardless of how outdated they are. This is often undesirable because the stale historical data is usually less important than fresh and recent data. While in sliding-window streams, a subset is selected from the most recent data only and historical data outside of the window is completely discarded. This is also sometimes undesirable because one may not wish to completely lose the entire history of past data and some historical data may be still important. As a result, SSO over insertion-only streams may find solutions that are not fresh; while SSO over sliding-window streams may find solutions that exclude historical important data or include many recent but valueless data. Can we design SSO techniques with a better streaming setting? We observe that both insertion-only stream and slidingwindow stream actually can be unified by introducing the concept of data lifespan, which is the amount of time an element participating in subset selection. As time advances, an element s remaining lifespan decreases. When an element s lifespan becomes zero, it is discarded and no longer participates in subset selection. Specifically, in insertion-only streams, each element has an infinite lifespan and will always participate in subset selection after arrival. While in sliding-window streams, each element has a same initial lifespan (i.e., the window size), and hence participates in subset selection for a same amount of time (see Fig. 1). We observe that in some real-world scenarios, it may be inappropriate to assume that each element in a data stream has a same lifespan. Let us consider the following scenario. Motivating Example. Consider a news aggregation website such as Hacker News (Y Combinator 2018) where news submitted by users form a news stream. Interesting news may attract users to keep clicking and commenting and thus survive for a long time; while boring news may only survive for one or two days (Leskovec, Backstrom, and Kleinberg 2009). In news recommendation tasks, we should select a subset of news from current alive news rather than the most recent news. Therefore, besides timestamp of each data element, lifespan of each data element should also be considered in subset selection. Other similar scenarios include hot video selection from You Tube (where each video may have its own lifespan), and trending hashtag selection from Twitter (where each hashtag may have a different lifespan). Overview of Our Approach. We propose to extend the two extreme streaming settings to a more general streaming setting where each element is allowed to have an arbitrary initial lifespan and thus each element can participate in subset selection for an arbitrary amount of time (see Fig. 1). We refer to this more general decaying mechanism as inhomogeneous decay, in contrast to the homogeneous decay adopted in sliding-window streams. This work presents three algorithms to address SSO over streams with inhomogeneous decays (SSO-ID). We first present a simple streaming algorithm, i.e., BASICSTREAMING. Then, we present HISTAPPROX to improve the efficiency significantly. Finally, we design a streaming version of HISTAPPROX, i.e., HISTSTREAMING. We theoretically show that our algorithms have constant approximation factors. Our main contributions include: We propose a general inhomogeneous-decaying streaming model that allows each element to participate in subset selection for an arbitrary amount of time. We design three algorithms to address the SSO-ID problem with constant approximation factors. We conduct experiments on real data, and the results demonstrate that our method finds high quality solutions and is up to two orders of magnitude faster than GREEDY. Problem Statement Data Stream. A data stream comprises an unbounded sequence of elements arriving in chronological order, denoted by {v1, v2, . . .}. Each element is from set V , called the ground set, and each element v has a discrete timestamp tv N. It is possible that multiple data elements arriving at the same time. In addition, there may be other attributes associated with each element. Inhomogeneous Decay. We propose an inhomogeneousdecaying data stream (IDS) model to enable inhomogeneous decays. For an element v arrived at time tv, it is assigned an initial lifespan l(v, tv) N representing the maximum time span that the element will remain active. As time advances to t tv, the element s remaining lifespan decreases to l(v, t) l(v, tv) (t tv). If l(v, t ) = 0 at some time t , v is discarded. We will assume l(v, tv) is given as an input to our algorithm. At any time t, active elements in the stream form a set, denoted by St {v: v V tv t l(v, t) > 0}. IDS model is general. If l(v, tv) = , v, an IDS becomes an insertion-only stream. If l(v, tv) = W, v, an IDS becomes a sliding-window stream. If l(v, tv) follows a geometric distribution parameterized by p, i.e., P(l(v, tv) = l) = (1 p)l 1p, it is equivalent of saying that an active element is discarded with probability p at each time step. To simplify notations, if time t is clear from context, we will use lv to represent l(v, t), i.e., the remaining lifespan (or just say the lifespan ) of element v at time t. Monotone Submodular Function (Nemhauser, Wolsey, and Fisher 1978). A set function f : 2V 7 R 0 is submodular if f(S {s}) f(S) f(T {s}) f(T), for all S T V and s V \T. f is monotone (non-decreasing) if f(S) f(T) for all S T V . Without loss of generality, we assume f is normalized, i.e., f( ) = 0. Let δ(s|S) f(S {s}) f(S) denote the marginal gain of adding element s to S. Then monotonicity is equivalent of saying that the marginal gain of every element is always non-negative, and submodularity is equivalent of saying that marginal gain δ(s|S) of element s never increases as set S grows bigger, aka the diminishing returns property. Streaming Submodular Optimization with Inhomogeneous Decays (SSO-ID). Equipped with the above notations, we formulate the cardinality constrained SSO-ID problem as follows: OPTt max S f(S), s.t. S St |S| k, where k is a given budget. Remark. The SSO-ID problem is NP-hard, and active data St is continuously evolving with outdated data being discarded and new data being added in at every time t, which further complicates the algorithm design. A naive algorithm to solve the SSO-ID problem is that, when St is updated, we re-run GREEDY on St from scratch, and this approach outputs a solution that is (1 1/e)-approximate. However, it needs O(k|St|) utility function evaluations at each time step, which is unaffordable for large St. Our goal is to find faster algorithms with comparable approximation guarantees. This section presents three algorithms to address the SSO-ID problem. Due to space limitation, the proofs of all theorems are included in the extended version of this paper. Warm-up: The BASICSTREAMING Algorithm In the literature, SIEVESTREAMING (Badanidiyuru et al. 2014) is designed to address SSO over insertion-only streams. We leverage SIEVESTREAMING as a basic building block to design a BASICSTREAMING algorithm. BASICSTREAMING is simple per se and may be inefficient, but offers opportunities for further improvement. This section assumes lifespan is upper bounded by L, i.e., lv L, v. We later remove this assumption in the following sections. SIEVESTREAMING (Badanidiyuru et al. 2014) is a threshold based streaming algorithm for solving cardinality constrained SSO over insertion-only streams. The high level idea is that, for each coming element, it is selected only if its gain w.r.t. a set is no less than a threshold. In its implementation, SIEVESTREAMING lazily maintains a set of log1+ϵ 2k = O(ϵ 1 log k) thresholds and each is associated with a candidate set initialized empty. For each coming element, its marginal gain w.r.t. each candidate set is computed; if the gain is no less than the corresponding threshold and the candidate set is not full, the element is added in the candidate set. At any time, a candidate set having the maximum utility is the current solution. SIEVESTREAMING achieves an (1/2 ϵ) approximation guarantee. Algorithm Description. We show how SIEVESTREAMING can be used to design a BASICSTREAMING algorithm to solve the SSO-ID problem. Let Vt denote a set of elements arrived at time t. We partition Vt into (at most) L nonoverlapping subsets, i.e., Vt = L l=1V (t) l where V (t) l is the subset of elements with lifespan l at time t. BASICSTREAMING maintains L SIEVESTREAMING instances, denoted by {A(t) l }L l=1, and alternates a data update step and a time update step to process the arriving elements Vt. Data Update. This step processes arriving data Vt. Let instance A(t) l only process elements with lifespan no less than l. In other words, elements in i l V (t) i are fed to A(t) l . After processing Vt, A(t) 1 outputs the current solution. Time Update. This step prepares for processing the upcoming data in the next time step. We reset instance A(t) 1 , i.e., empty its threshold set and each candidate set. Then we conduct a circular shift operation: A(t+1) 1 A(t) 2 , A(t+1) 2 A(t) 3 , . . . , A(t+1) L A(t) 1 . BASICSTREAMING alternates the two steps and continuously processes data at each time step. We illustrate BASICSTREAMING in Fig. 2, with pseudo-code given in Alg. 1. Analysis. BASICSTREAMING exhibits a feature that an instance gradually expires (and is reset) as data processed in it expires. Such a feature ensures that, at any time t, A(t) 1 always processed all the data in St. Because A(t) 1 is a SIEVESTREAMING instance, we immediately have the following conclusions. A(t) 1 A(t) 2 A(t) 3 A(t) L V (t) 1 V (t) 2 V (t) 3 V (t) L reset A(t) 1 and t t + 1 Figure 2: BASICSTREAMING. Solid lines denote data update, and dashed lines denote time update. Algorithm 1: BASICSTREAMING Input: An IDS of data elements arriving over time Output: A subset St at any time t 1 Initialize L SIEVESTREAMING instances {A(1) l }L l=1; 2 for t = 1, 2, . . . do 3 for l = 1, . . . , L do Feed A(t) l with data i l V (t) i ; 4 St output of A(t) 1 ; 5 for l = 2, . . . , L do A(t+1) l 1 A(t) l ; 6 Reset A(t) 1 and A(t+1) L A(t) 1 ; Theorem 1. BASICSTREAMING achieves an (1/2 ϵ) approximation guarantee. Theorem 2. BASICSTREAMING uses O(Lϵ 1 log k) time to process each element, and O(Lkϵ 1 log k) memory to store intermediate results (i.e., candidate sets). Remark. As illustrated in Fig. 2, data with lifespan l will be fed to {A(t) i }i l. Hence, elements with large lifespans will fan out to a large fraction of SIEVESTREAMING instances, and incur high CPU and memory usage, especially when L is large. This is the main bottleneck of BASICSTREAMING. On the other hand, elements with small lifespans only need to be fed to a few instances. Therefore, if data lifespans are mainly distributed over small values, e.g., power-law distributed, then BASICSTREAMING is still efficient. HISTAPPROX: Improving Efficiency To address the bottleneck of BASICSTREAMING when processing data with a large lifespan, we design HISTAPPROX in this section. HISTAPPROX can significantly improve the efficiency of BASICSTREAMING but requires active data St to be stored in RAM1. Strictly speaking, HISTAPPROX is not a streaming algorithm. We later remove the assumption of storing St in RAM in the next section. Basic Idea. If at any time, only a few instances are maintained and running in BASICSTREAMING, then both CPU time and memory usage will decrease. Our idea is hence to selectively maintain a subset of SIEVESTREAMING instances that can approximate the rest. Roughly speaking, this 1For example, if lifespan follows a geometric distribution, i.e., P(lv = l) = (1 p)pl 1, l = 1, 2, . . ., and at most M elements arrive at a time, then |St| Pt 1 a=0 Mpa M 1 p. Hence, if RAM is larger than M 1 p, St actually can be stored in RAM even as t . idea can be thought of as using a histogram to approximate a curve. Specifically, let gt(l) denote the value of output of A(t) l at time t. For very large L, we can think of {gt(l)}l 1 as a curve (e.g., the dashed curve in Fig. 3). Our idea is to pick a few instances as active instances and construct a histogram to approximate this curve, as illustrated in Fig. 3. x1 x2 x3 x4 x5 1 Case 2 Case 3 {gt(l)}l 1 {gt(l)}l xt Figure 3: Approximate {gt(l)}l 1 by {gt(l)}l xt. The challenge is that, as new data keeps arriving, the curve is changing; hence, we need to update the histogram accordingly to make sure that the histogram always well approximates the curve. Let xt {x(t) 1 , x(t) 2 , . . .} index a set of active instances at time t, where each index x(t) i 1.2 In the follows, we describe the xt updating method, i.e., HISTAPPROX, and the method guarantees that the maintained histogram satisfies our requirement. Algorithm Description. HISTAPPROX consists of two steps: (1) updating indices; (2) removing redundant indices. Updating Indices. The algorithm starts with an empty index set, i.e., x1 = . At time t, consider a set of newly arrived elements V (t) l with lifespan l. These elements will increase the curve before l (because data V (t) l will be fed to {A(t) i }i l, see Fig. 2). There are three cases based on the position of l, as illustrated in Fig. 3. Case 1. If l xt, we simply feed V (t) l to {A(t) i }i xt i l. Case 2. If l / xt and l has no successor in xt, we create a new instance A(t) l and feed V (t) l to {A(t) i }i xt i l. Case 3. If l / xt and l has a successor l2 xt. Let A(t) l be a copy of A(t) l2 , then we feed V (t) l to {A(t) i }i xt i l. Note that A(t) l needs to process all data with lifespan l at time t. Because A(t) l2 has processed all data with lifespan l2, we still need to feed A(t) l with historical data s.t. their lifespan [l, l2). That is the reason we need St to be stored in RAM. Above scheme guarantees that each A(t) l , l xt processed all the data with lifespan l at time t. The detailed pseudo-code is given in procedure Process of Alg. 2. Removing Redundant Indices. Intuitively, if the outputs of two instances are close to each other, it is not necessary to keep both of them. We need the following definition to quantify redundancy. Definition 1 (ϵ-redundancy). At time t, consider two instances A(t) i and A(t) l with i < l. We say A(t) l is ϵ-redundant if their exists j > l such that gt(j) (1 ϵ)gt(i). The above definition simply states that, since A(t) i and A(t) j are already close with each other, then instances be- 2Superscript t will be omitted if time t is clear from context. Algorithm 2: HISTAPPROX Input: An IDS of data elements arriving over time Output: A subset St at any time t 2 for t = 1, 2, . . . do // Vt = l V (t) l 3 foreach V (t) l = do Process(V (t) l ); // data update 4 St output of A(t) x1 ; 5 if x1 =1 then Kill A(t) 1 , xt xt\{1}; // time update 6 for i = 1, . . . , |xt| do 7 A(t+1) xi 1 A(t) xi , x(t+1) i x(t) i 1; 8 Procedure Process(V (t) l ) 9 if l / xt then 10 if l has no successor in xt then // Case 2 in Fig. 3 11 A(t) l new instance; 12 else // let l2 denote the successor of l 13 A(t) l a copy of A(t) l2 ; // Case 3 in Fig. 3 14 Feed A(t) l with historical data elements s.t. their lifespans [l, l2); 15 xt xt {l}; 16 foreach i xt and i l do Feed A(t) i with V (t) l ; 17 Reduce Redundancy(); 18 Procedure Reduce Redundancy() 19 foreach i xt do 20 Find the largest j >i in xt s.t. gt(j) (1 ϵ)gt(i); 21 Delete each index l xt s.t. i t , data with lifespan l arrives and A(t) l is created. Data at l0 becomes the unprocessed historical data. We thus say that unprocessed historical data is caused by the deletion of redundant instance at previous time. Protecting non-redundant instances. To further ensure the solution quality of HISTSTREAMING, we introduce another heuristic to protect non-redundant instances. Because gt(l) is unknown, to avoid removing instances that are actually not redundant, we give each instance A(t) l an amount of value, denoted by δl, as compensation for not processing historical data, i.e., gt(l) may be as large as ˆgt(l) + δl. This allows us to represent gt(l) by an interval [gt(l), gt(l)] where gt(l) ˆgt(l) and gt(l) ˆgt(l) + δl. As gt(j) (1 ϵ)gt(i) implies gt(j) (1 ϵ)gt(i), the condition in Line 20 of HISTAPPROX is replaced by gt(j) (1 ϵ)gt(i). We want δl to be related to the amount of historical data that A(t) l does not process. Recall the example in Fig. 4. Unprocessed historical data is always fed to l s predecessor instance whenever redundant instances are removed in the interval l belonging to. Also notice that gt (l 2) (1 ϵ)gt (l 1) holds after the removal of redundant instances. Hence, the contribution of unprocessed historical data can be estimated to be at most ϵgt (l 1). In general, if some redundant indices are removed in interval (i, j) at time t, we set δl = ϵgt(i) for index l that is later created in the interval (i, j). Algorithm Description. We only need to slightly modify Process and Reduce Redundancy (see Alg. 3). Remark. HISTSTREAMING uses heuristics to further improve the efficiency of HISTAPPROX, and no longer needs to store St in memory. In experiments, we observe that HISTSTREAMING can find high quality solutions. Experiments In this section, we construct several maximum coverage problems to evaluate the performance of our methods. We use real world and public available datasets. Note that the optimization problems defined on these datasets may seem to be simplistic, as our main purpose is to validate the performance of proposed algorithms, and hence we want to keep the problem settings as simple and clear as possible. DBLP. We construct a representative author selection problem on the DBLP dataset (DBLP 2018), which records the Algorithm 3: HISTSTREAMING 1 Procedure Process(V (t) l ) 2 if l / xt then 4 // If l has a successor l2 5 A(t) l a copy of A(t) l2 ; 6 Find i, j xt s.t. l (i, j) and δij is recorded, then let δl δij; 8 Procedure Reduce Redundancy() 9 foreach i xt do 10 Find the largest j >i in xt s.t. gt(j) (1 ϵ)gt(i); 11 Delete each index l xt s.t. i