# learning_timevarying_coverage_functions__50493bdb.pdf

Learning Time-Varying Coverage Functions

Nan Du , Yingyu Liang , Maria-Florina Balcan , Le Song

College of Computing, Georgia Institute of Technology Department of Computer Science, Princeton University School of Computer Science, Carnegie Mellon University dunan@gatech.edu,yingyul@cs.princeton.edu ninamf@cs.cmu.edu,lsong@cc.gatech.edu

Coverage functions are an important class of discrete functions that capture the law of diminishing returns arising naturally from applications in social network analysis, machine learning, and algorithmic game theory. In this paper, we propose a new problem of learning time-varying coverage functions, and develop a novel parametrization of these functions using random features. Based on the connection between time-varying coverage functions and counting processes, we also propose an efﬁcient parameter learning algorithm based on likelihood maximization, and provide a sample complexity analysis. We applied our algorithm to the inﬂuence function estimation problem in information diffusion in social networks, and show that with few assumptions about the diffusion processes, our algorithm is able to estimate inﬂuence signiﬁcantly more accurately than existing approaches on both synthetic and real world data.

1 Introduction

Coverage functions are a special class of the more general submodular functions which play important role in combinatorial optimization with many interesting applications in social network analysis [1], machine learning [2], economics and algorithmic game theory [3], etc. A particularly important example of coverage functions in practice is the inﬂuence function of users in information diffusion modeling [1] news spreads across social networks by word-of-mouth and a set of inﬂuential sources can collectively trigger a large number of follow-ups. Another example of coverage functions is the valuation functions of customers in economics and game theory [3] customers are thought to have certain requirements and the items being bundled and offered fulﬁll certain subsets of these demands.

Theoretically, it is usually assumed that users inﬂuence or customers valuation are known in advance as an oracle. In practice, however, these functions must be learned. For example, given past traces of information spreading in social networks, a social platform host would like to estimate how many follow-ups a set of users can trigger. Or, given past data of customer reactions to different bundles, a retailer would like to estimate how likely customer would respond to new packages of goods. Learning such combinatorial functions has attracted many recent research efforts from both theoretical and practical sides (e.g., [4, 5, 6, 7, 8]), many of which show that coverage functions can be learned from just polynomial number of samples.

However, the prior work has widely ignored an important dynamic aspect of the coverage functions. For instance, information spreading is a dynamic process in social networks, and the number of follow-ups of a ﬁxed set of sources can increase as observation time increases. A bundle of items or features offered to customers may trigger a sequence of customer actions over time. These real world problems inspire and motivate us to consider a novel time-varying coverage function, f(S, t), which is a coverage function of the set S when we ﬁx a time t, and a continuous monotonic function of time t when we ﬁx a set S. While learning time-varying combinatorial structures has been ex-

plored in graphical model setting (e.g., [9, 10]), as far as we are aware of, learning of time-varying coverage function has not been addressed in the literature. Furthermore, we are interested in estimating the entire function of t, rather than just treating the time t as a discrete index and learning the function value at a small number of discrete points. From this perspective, our formulation is the generalization of the most recent work [8] with even less assumptions about the data used to learn the model.

Generally, we assume that the historical data are provided in pairs of a set and a collection of timestamps when caused events by the set occur. Hence, such a collection of temporal events associated with a particular set Si can be modeled principally by a counting process Ni(t), t 0 which is a stochastic process with values that are positive, integer, and increasing along time [11]. For instance, in the information diffusion setting of online social networks, given a set of earlier adopters of some new product, Ni(t) models the time sequence of all triggered events of the followers, where each jump in the process records the timing tij of an action. In the economics and game theory setting, the counting process Ni(t) records the number of actions a customer has taken over time given a particular bundled offer. This essentially raises an interesting question of how to estimate the time-varying coverage function from the angle of counting processes. We thus propose a novel formulation which builds a connection between the two by modeling the cumulative intensity function of a counting process as a time-varying coverage function. The key idea is to parametrize the intensity function as a weighted combination of random kernel functions. We then develop an efﬁcient learning algorithm TCOVERAGELEARNER to estimate the parameters of the function using maximum likelihood approach. We show that our algorithm can provably learn the time-varying coverage function using only polynomial number of samples. Finally, we validate TCOVERAGELEARNER on both inﬂuence estimation and maximization problems by using cascade data from information diffusion. We show that our method performs signiﬁcantly better than alternatives with little prior knowledge about the dynamics of the actual underlying diffusion processes.

2 Time-Varying Coverage Function

We will ﬁrst give a formal deﬁnition of the time-varying coverage function, and then explain its additional properties in details.

Deﬁnition. Let U be a (potentially uncountable) domain. We endow U with some σ-algebra A and denote a probability distribution on U by P. A coverage function is a combinatorial function over a ﬁnite set V of items, deﬁned as

f(S) := Z P [

s S Us , for all S 2V, (1)

where Us U is the subset of domain U covered by item s V, and Z is the additional normalization constant. For time-varying coverage functions, we let the size of the subset Us to grow monotonically over time, that is

Us(t) Us(τ), for all t τ and s V, (2)

which results in a combinatorial temporal function

f(S, t) = Z P [

s S Us(t) , for all S 2V. (3)

In this paper, we assume that f(S, t) is smooth and continuous, and its ﬁrst order derivative with respect to time, f (S, t), is also smooth and continuous.

Representation. We now show that a time-varying coverage function, f(S, t), can be represented as an expectation over random functions based on multidimensional step basis functions. Since Us(t) is varying over time, we can associate each u U with a |V|-dimensional vector τu of change points. In particular, the s-th coordinate of τu records the time that source node s covers u. Let τ to be a random variable obtained by sampling u according to P and setting τ = τu. Note that given all τu we can compute f(S, t); now we claim that the distribution of τ is sufﬁcient.

We ﬁrst introduce some notations. Based on τu we deﬁne a |V|-dimensional step function ru(t) : R+ 7 {0, 1}|V| , where the s-th dimension of ru(t) is 1 if u is covered by the set Us(t) at time t, and 0 otherwise. To emphasize the dependence of the function ru(t) on τu, we will also write ru(t) as ru(t|τu). We denote the indicator vector of a set S by χS {0, 1}|V| where the s-th dimension of χS is 1 if s S, and 0 otherwise. Then u U is covered by S

s S Us(t) at time t if χ S ru(t) 1.

Lemma 1. There exists a distribution Q(τ) over the vector of change points τ, such that the timevarying coverage function can be represented as

f(S, t) = Z Eτ Q(τ) φ(χ S r(t|τ)) (4)

where φ(x) := min {x, 1}, and r(t|τ) is a multidimensional step function parameterized by τ.

Proof. Let US := S

s S Us(t). By deﬁnition (3), we have the following integral representation

f(S, t) = Z Z

U I {u US} d P(u) = Z Z

U φ(χ S ru(t)) d P(u) = Z Eu P(u) φ(χ S ru(t)) .

We can deﬁne the set of u having the same τ as Uτ := {u U | τu = τ} and deﬁne a distribution over τ as d Q(τ) := R

Uτ d P(u). Then the integral representation of f(S, t) can be rewritten as

Z Eu P(u) φ(χ S ru(t)) = Z Eτ Q(τ) φ(χ S r(t|τ)) ,

which proves the lemma.

3 Model for Observations

In general, we assume that the input data are provided in the form of pairs, (Si, Ni(t)), where Si is a set, and Ni(t) is a counting process in which each jump of Ni(t) records the timing of an event. We ﬁrst give a brief overview of a counting process [11] and then motivate our model in details.

Counting Process. Formally, a counting process {N(t), t 0} is any nonnegative, integer-valued stochastic process such that N(t ) N(t) whenever t t and N(0) = 0. The most common use of a counting process is to count the number of occurrences of temporal events happening along time, so the index set is usually taken to be the nonnegative real numbers R+. A counting process is a submartingale: E[N(t) | Ht ] N(t ) for all t > t where Ht denotes the history up to time t . By Doob-Meyer theorem [11], N(t) has the unique decomposition:

N(t) = Λ(t) + M(t) (5)

where Λ(t) is a nondecreasing predictable process called the compensator (or cumulative intensity), and M(t) is a mean zero martingale. Since E[d M(t) | Ht ] = 0, where d M(t) is the increment of M(t) over a small time interval [t, t + dt), and Ht is the history until just before time t,

E[d N(t) | Ht ] = dΛ(t) := a(t) dt (6)

where a(t) is called the intensity of a counting process.

Model formulation. We assume that the cumulative intensity of the counting process is modeled by a time-varying coverage function, i.e., the observation pair (Si, Ni(t)) is generated by

Ni(t) = f(Si, t) + Mi(t) (7)

in the time window [0, T] for some T > 0, and df(S, t) = a(S, t)dt. In other words, the timevarying coverage function controls the propensity of occurring events over time. Speciﬁcally, for a ﬁxed set Si, as time t increases, the cumulative number of events observed grows accordingly for that f(Si, t) is a continuous monotonic function over time; for a given time t, as the set Si changes to another set Sj, the amount of coverage over domain U may change and hence can result in a different cumulative intensity. This abstract model can be mapped to real world applications. In the information diffusion context, for a ﬁxed set of sources Si, as time t increases, the number of inﬂuenced nodes in the social network tends to increase; for a given time t, if we change the sources to Sj, the number of inﬂuenced nodes may be different depending on how inﬂuential the sources are. In the economics and game theory context, for a ﬁxed bundle of offers Si, as time t increases, it is more likely that the merchant will observe the customers actions in response to the offers; even at the same time t, different bundles of offers, Si and Sj, may have very different ability to drive the customers actions.

Compared to a regression model yi = g(Si) + ϵi with i.i.d. input data (Si, yi), our model outputs a special random function over time, that is, a counting process Ni(t) with the noise being a zero mean martingale Mi(t). In contrast to functional regression models, our model exploits much more interesting structures of the problem. For instance, the random function representation in the last section can be used to parametrize the model. Such special structure of the counting process allows us to estimate the parameter of our model using maximum likelihood approach efﬁciently, and the martingale noise enables us to use exponential concentration inequality in analyzing our algorithm.

4 Parametrization

Based on the following two mild assumptions, we will show how to parametrize the intensity function as a weighted combination of random kernel functions, learn the parameters by maximum likelihood estimation, and eventually derive a sample complexity.

(A1) a(S, t) is smooth and bounded on [0, T]: 0 < amin a amax < , and a := d2a/dt2 is absolutely continuous with R a(t)dt < . (A2) There is a known distribution Q (τ) and a constant C with Q (τ)/C Q(τ) CQ (τ).

Kernel Smoothing To facilitate our ﬁnite dimensional parameterization, we ﬁrst convolve the intensity function with K(t) = k(t/σ)/σ where σ is the bandwidth parameter and k is a kernel function (such as the Gaussian RBF kernel k(t) = e t2/2/

0 k(t) κmax, Z k(t) dt = 1, Z t k(t) dt = 0, and σ2 k := Z t2k(t) dt < . (8)

The convolution results in a smoothed intensity a K(S, t) = K(t) (df(S, t)/dt) = d(K(t) Λ(S, t))/dt. By the property of convolution and exchanging derivative with integral, we have that

a K(S, t) = d(Z Eτ Q(τ)[K(t) φ(χ S r(t|τ)])/dt by deﬁnition of f( )

= Z Eτ Q(τ) d(K(t) φ(χ S r(t|τ))/dt exchange derivative and integral

= Z Eτ Q(τ) [K(t) δ(t t(S, r)] by property of convolution and function φ( )

= Z Eτ Q(τ) [K(t t(S, τ))] by deﬁnition of δ( )

where t(S, τ) is the time when function φ(χ S r(t|τ)) jumps from 0 to 1. If we choose small enough kernel bandwidth, a K only incurs a small bias from a. But the smoothed intensity still results in inﬁnite number of parameters, due to the unknown distribution Q(τ). To address this problem, we design the following random approximation with ﬁnite number of parameters.

Random Function Approximation The key idea is to sample a collection of W random change points τ from a known distribution Q (τ) which can be different from Q(τ). If Q (τ) is not very far way from Q(τ), the random approximation will be close to a K, and thus close to a. More speciﬁcally, we will denote the space of weighted combination of W random kernel function by

a K w(S, t) =

i=1 wi K(t t(S, τi)) : w 0, Z

, {τi} i.i.d. Q (τ). (9)

Lemma 2. If W = O(Z2/(ϵσ)2), then with probability 1 δ, there exists an ea A such that ESEt (a(S, t) ea(S, t))2 := ES P(S) R T 0 (a(S, t) ea(S, t))2 dt/T = O(ϵ2 + σ4).

The lemma then suggests to set the kernel bandwidth σ = O( ϵ) to get O(ϵ2) approximation error.

5 Learning Algorithm

We develop a learning algorithm, referred to as TCOVERAGELEARNER, to estimate the parameters of a K w(S, t) by maximizing the joint likelihood of all observed events based on convex optimization techniques as follows.

Maximum Likelihood Estimation Instead of directly estimating the time-varying coverage function, which is the cumulative intensity function of the counting process, we turn to estimate the intensity function a(S, t) = Λ(S, t)/ t. Given m i.i.d. counting processes, Dm := {(S1, N1(t)), . . . , (Sm, Nm(t))} up to observation time T, the log-likelihood of the dataset is [11]

0 {log a(Si, t)} d Ni(t) Z T

0 a(Si, t) dt

Maximizing the log-likelihood with respect to the intensity function a(S, t) then gives us the estimation ba(S, t). The W-term random kernel function approximation reduces a function optimization problem to a ﬁnite dimensional optimization problem, while incurring only small bias in the estimated function.

Algorithm 1 TCOVERAGELEARNER

INPUT : {(Si, Ni(t))} , i = 1, . . . , m; Sample W random features τ1, . . . , τW from Q (τ); Compute {t(Si, τw)} , {gi} , {k(tij)} , i {1, . . . , m} , w = 1, . . . , W, tij < T; Initialize w0 Ω= {w 0, w 1 1}; Apply projected quasi-newton algorithm [12] to solve 11; OUTPUT : a K w(S, t) = PW i=1 wi K(t t(S, τi))

Convex Optimization. By plugging the parametrization a K w(S, t) (9) into the log-likelihood (10), we formulate the optimization problem as :

tij<T log w k(tij)

subject to w 0, w 1 1, (11)

where we deﬁne

0 K (t t(Si, τk)) dt and kl(tij) = K(tij t(Si, τl)), (12)

tij when the j-th event occurs in the i-th counting process. By treating the normalization constant Z as a free variable which will be tuned by cross validation later, we simply require that w 1 1. By applying the Gaussian RBF kernel, we can derive a closed form of gik and the gradient ℓas

erfc t(Si, τk)

erfc T t(Si, τk)

k(tij) w k(tij)

A pleasing feature of this formulation is that it is convex in the argument w, allowing us to apply various convex optimization techniques to solve the problem efﬁciently. Speciﬁcally, we ﬁrst draw W random features τ1, . . . , τW from Q (τ). Then, we precompute the jumping time t(Si, τw) for every source set {Si}m i=1 on each random feature {τw}W w=1. Because in general |Si| << n, this computation costs O(m W). Based on the achieved m-by-W jumping-time matrix, we preprocess the feature vectors {gi}m i=1 and k(tij), i {1, . . . , m} , tij < T, which costs O(m W) and O(m LW) where L is the maximum number of events caused by a particular source set before time T. Finally, we apply the projected quasi-newton algorithm [12] to ﬁnd the weight w that minimizes the negative log-likelihood of observing the given event data. Because the evaluation of the objective function and the gradient, which costs O(m LW), is much more expensive than the projection onto the convex constraint set, and L << n, the worst case computation complexity is thus O(mn W). Algorithm 1 summarizes the above steps in the end.

Sample Strategy. One important constitution of our parametrization is to sample W random change points τ from a known distribution Q (τ). Because given a set Si, we can only observe the jumping time of the events in each counting process without knowing the identity of the covered items (which is a key difference from [8]), the best thing we can do is to sample from these historical data. Speciﬁcally, let the number of counting processes that a single item s V is involved to induce be Ns, and the collection of all the jumping timestamps before time T be Js. Then, for the s-th entry of τ, with probability |Js|/n Ns, we uniformly draw a sample from Js; and with probability 1 |Js|/n Ns, we assign a time much greater than T to indicate that the item will never be covered until inﬁnity. Given the very limited information, although this Q (τ) might be quite different from Q(τ), by drawing sufﬁciently large number of samples and adjusting the weights, we expect it still can lead to good results, as illustrated in our experiments later.

6 Sample Complexity

Suppose we use W random features and m training examples to compute an ϵℓ-MLE solution ba, i.e.,

ℓ(Dm|ba) max a A ℓ(Dm|a ) ϵℓ.

The goal is to analyze how well the function bf induced by ba approximates the true function f. This sections describes the intuition and the complete proof is provided in the appendix.

A natural choice for connecting the error between f and bf with the log-likelihood cost used in MLE is the Hellinger distance [22]. So it sufﬁces to prove an upper bound on the Hellinger distance h(a, ba) between ba and the true intensity a, for which we need to show a high probability bound on the (total) empirical Hellinger distance b H2(a, a ) between the two. Here, h and b H are deﬁned as

h2(a, a ) := 1

a (S, t) i2 ,

b H2(a, a ) := 1

a (Si, t) i2 dt.

The key for the analysis is to show that the empirical Hellinger distance can be bounded by a martingale plus some other additive error terms, which we then bound respectively. This martingale is deﬁned based on our hypotheses and the martingales Mi associated with the counting process Ni:

M(t|g) := Z t

0 g(t)d Mi(t)

where g G = n ga = 1

2a : a A o . More precisely, we have the following lemma.

Lemma 3. Suppose ba is an ϵℓ-MLE. Then

b H2 (ba, a) 16M(T; gba) + 4 ℓ(Dm|a) max a A ℓ(Dm|a ) + 4ϵℓ.

The right hand side has three terms: the martingale (estimation error), the likelihood gap between the truth and the best one in our hypothesis class (approximation error), and the optimization error. We then focus on bounding the martingale and the likelihood gap.

To bound the martingale, we ﬁrst introduce a notion called (d, d )-covering dimension measuring the complexity of the hypothesis class, generalizing that in [25]. Based on this notion, we prove a uniform convergence inequality, combining the ideas in classic works on MLE [25] and counting process [13]. Compared to the classic uniform inequality, our result is more general, and the complexity notion has more clear geometric interpretation and are thus easier to verify. For the likelihood gap, recall that by Lemma 2, there exists an good approximation a A. The likelihood gap is then bounded by that between a and a, which is small since a and a are close.

Combining the two leads to a bound on the Hellinger distance based on bounded dimension of the hypothesis class. We then show that the dimension of our speciﬁc hypothesis class is at most the number of random features W, and convert b H2(ba, a) to the desired ℓ2 error bound on f and bf.

Theorem 4. Suppose W = O Z2 ZT

ϵ 5/2 + ZT ϵamin

5/4 and m = O ZT

ϵ [W + ϵℓ] . Then

with probability 1 δ over the random sample of {τi}W i=1, we have that for any 0 t T,

ES h bf(S, t) f(S, t) i2 ϵ.

The theorem shows that the number of random functions needed to achieve ϵ error is roughly O(ϵ 5/2), and the sample size is O(ϵ 7/2). They also depend on amin, which means with more random functions and data, we can deal with intensities with more extreme values. Finally, they increase with the time T, i.e., it is more difﬁcult to learn the function values at later time points.

7 Experiments

We evaluate TCOVERAGELEARNER on both synthetic and real world information diffusion data. We show that our method can be more robust to model misspeciﬁcation than other state-of-the-art alternatives by learning a temporal coverage function all at once.

7.1 Competitors Because our input data only include pairs of a source set and the temporal information of its triggered events {(Si, Ni(t))}m i=1 with unknown identity, we ﬁrst choose the general kernel ridge regression model as the major baseline, which directly estimates the inﬂuence value of a source set

1 2 3 4 5 6 7 8 9 10 0

TCoverage Learner Kernel Ridge Regression CIC DIC

1 2 3 4 5 6 7 8 9 10 0

TCoverage Learner Kernel Ridge Regression CIC DIC

1 2 3 4 5 6 7 8 9 10 0

TCoverage Learner Kernel Ridge Regression CIC DIC

1 2 3 4 5 6 7 8 9 10 0

TCoverage Learner Kernel Ridge Regression CIC DIC

(a) Weibull (CIC) (b) Exponential (CIC) (c) DIC (d) LT Figure 1: MAE of the estimated inﬂuence on test data along time with the true diffusion model being continuous-time independent cascade with pairwise Weibull (a) and Exponential (b) transmission functions, (c) discrete-time independent cascade model and (d) linear-threshold cascade model.

χS by f(χS) = k(χS)(K + λI) 1y where k(χS) = K(χSi, χS), and K is the kernel matrix. We discretize the time into several steps and ﬁt a separate model to each of them. Between two consecutive time steps, the predictions are simply interpolated. In addition, to further demonstrate the robustness of TCOVERAGELEARNER, we compare it to the two-stage methods which must know the identity of the nodes involved in an information diffusion process to ﬁrst learn a speciﬁc diffusion model based on which they can then estimate the inﬂuence. We give them such an advantage and study three well-known diffusion models : (I) Continuous-time Independent Cascade model(CIC)[14, 15]; (II) Discrete-time Independent Cascade model(DIC)[1]; and (III) Linear-Threshold cascade model(LT)[1].

7.2 Inﬂuence Estimation on Synthetic Data

We generate Kronecker synthetic networks ([0.9 0.5;0.5 0.3]) which mimic real world information diffusion patterns [16]. For CIC, we use both Weibull distribution (Wbl) and Exponential distribution (Exp) for the pairwise transmission function associated with each edge, and randomly set their parameters to capture the heterogeneous temporal dynamics. Then, we use NETRATE [14] to learn the model by assuming an exponential pairwise transmission function. For DIC, we choose the pairwise infection probability uniformly from 0 to 1 and ﬁt the model by [17]. For LT, we assign the edge weight wuv between u and v as 1/dv, where dv is the degree of node v following [1]. Finally, 1,024 source sets are sampled with power-law distributed cardinality (with exponent 2.5), each of which induces eight independent cascades(or counting processes), and the test data contains another 128 independently sampled source sets with the ground truth inﬂuence estimated from 10,000 simulated cascades up to time T = 10. Figure 1 shows the MAE(Mean Absolute Error) between the estimated inﬂuence value and the true value up to the observation window T = 10. The average inﬂuence is 16.02, 36.93, 9.7 and 8.3. We use 8,192 random features and two-fold cross validation on the train data to tune the normalization Z, which has the best value 1130, 1160, 1020, and 1090, respectively. We choose the RBF kernel bandwidth h = 1/

2π so that the magnitude of the smoothed approximate function still equals to 1 (or it can be tuned by cross-validation as well), which matches the original indicator function. For the kernel ridge regression, the RBF kernel bandwidth and the regularization λ are all chosen by the same two-fold cross validation. For CIC and DIC, we learn the respective model up to time T for once.

Figure 1 veriﬁes that even though the underlying diffusion models can be dramatically different, the prediction performance of TCOVERAGELEARNER is robust to the model changes and consistently outperforms the nontrivial baseline signiﬁcantly. In addition, even if CIC and DIC are provided with extra information, in Figure 1(a), because the ground-truth is continuous-time diffusion model with Weibull functions, they do not have good performance. CIC assumes the right model but the wrong family of transmission functions. In Figure 1(b), we expect CIC should have the best performance for that it assumes the correct diffusion model and transmission functions. Yet, TCOVERAGELEARNER still has comparable performance with even less information. In Figure 1(c), although DIC has assumed the correct model, it is hard to determine the correct step size to discretize the time line, and since we only learn the model once up to time T (instead of at each time point), it is harder to ﬁt the whole process. In Figure1(d), both CIC and DIC have the wrong model, so we have similar trend as Figure synthetic(a). Moreover, for kernel ridge regression, we have to ﬁrst partition the timeline with arbitrary step size, ﬁt the model to each of time, and interpolate the value between neighboring time legs. Not only will the errors from each stage be accumulated to the error of the ﬁnal prediction, but also we cannot rely on this method to predict the inﬂuence of a source set beyond the observation window T.

1 2 3 4 5 6 7 0

Groups of Memes

Average MAE

TCoverage Learner Kernel Ridge Regression CIC DIC

128 256 512 1024 2048 4096 8192 0

# Random features

Average MAE

128 256 512 1024 2048 4096 8192 10 0

# random features

1 2 3 4 5 6 7 8 9 10 20

TCoverage Learner Kernel Ridge Regression CIC DIC

(a) Average MAE (b) Features Effect (c) Runtime (d) Inﬂuence maximization Figure 2: (a) Average MAE from time 1 to 10 on seven groups of real cascade data; (b) Improved estimation with increasing number of random features; (c) Runtime in log-log scale; (d) Maximized inﬂuence of selected sources on the held-out testing data along time.

Overall, compared to the kernel ridge regression, TCOVERAGELEARNER only needs to be trained once given all the event data up to time T in a compact and principle way, and then can be used to infer the inﬂuence of any given source set at any particular time much more efﬁciently and accurately. In contrast to the two-stage methods, TCOVERAGELEARNER is able to address the more general setting with much less assumption and information but still can produce consistently competitive performance.

7.3 Inﬂuence Estimation on Real Data

Meme Tracker is a real-world dataset [18] to study information diffusion. The temporal ﬂow of information was traced using quotes which are short textual phrases spreading through the websites. We have selected seven typical groups of cascades with the representative keywords like apple and jobs , tsunami earthquake , etc., among the top active 1,000 sites. Each set of cascades is split into 60%-train and 40%-test. Because we often can observe cascades only from single seed node, we rarely have cascades produced from multiple sources simultaneously. However, because our model can capture the correlation among multiple sources, we challenge TCOVERAGELEARNER with sets of randomly chosen multiple source nodes on the independent hold-out data. Although the generation of sets of multiple source nodes is simulated, the respective inﬂuence is calculated from the real test data as follows : Given a source set S, for each node u S, let C(u) denote the set of cascades generated from u on the testing data. We uniformly sample cascades from C(u). The average length of all sampled cascades is treated as the true inﬂuence of S. We draw 128 source sets and report the average MAE along time in Figure 2(a). Again, we can observe that TCOVERAGELEARNER has consistent and robust estimation performance across all testing groups. Figure 2(b) veriﬁes that the prediction can be improved as more random features are exploited, because the representational power of TCOVERAGELEARNER increases to better approximate the unknown true coverage function. Figure 2(c) indicates that the runtime of TCOVERAGELEARNER is able to scale linearly with large number of random features. Finally, Figure 2(d) shows the application of the learned coverage function to the inﬂuence maximization problem along time, which seeks to ﬁnd a set of source nodes that maximize the expected number of infected nodes by time T. The classic greedy algorithm[19] is applied to solve the problem, and the inﬂuence is calculated and averaged over the seven held-out test data. It shows that TCOVERAGELEARNER is very competitive to the two-stage methods with much less assumption. Because the greedy algorithm mainly depends on the relative rank of the selected sources, although the estimated inﬂuence value can be different, the selected set of sources could be similar, so the performance gap is not large.

8 Conclusions

We propose a new problem of learning temporal coverage functions with a novel parametrization connected with counting processes and develop an efﬁcient algorithm which is guaranteed to learn such a combinatorial function from only polynomial number of training samples. Empirical study also veriﬁes our method outperforms existing methods consistently and signiﬁcantly.

Acknowledgments This work was supported in part by NSF grants CCF-0953192, CCF-1451177, CCF-1101283, and CCF-1422910, ONR grant N00014-09-1-0751, AFOSR grant FA9550-09-10538, Raytheon Faculty Fellowship, NSF IIS1116886, NSF/NIH BIGDATA 1R01GM108341, NSF CAREER IIS1350983 and Facebook Graduate Fellowship 2014-2015.

[1] David Kempe, Jon Kleinberg, and Eva Tardos. Maximizing the spread of inﬂuence through a social network. In SIGKDD 2003, pages 137 146. ACM, 2003.

[2] C. Guestrin, A. Krause, and A. Singh. Near-optimal sensor placements in gaussian processes. In International Conference on Machine Learning ICML 05, 2005.

[3] Benny Lehmann, Daniel Lehmann, and Noam Nisan. Combinatorial auctions with decreasing marginal utilities. In EC 01, pages 18 28, 2001.

[4] Maria-Florina Balcan and Nicholas JA Harvey. Learning submodular functions. In Proceedings of the 43rd annual ACM symposium on Theory of computing, pages 793 802. ACM, 2011.

[5] A. Badanidiyuru, S. Dobzinski, H. Fu, R. D. Kleinberg, N. Nisan, and T. Roughgarden. Sketching valuation functions. In Annual ACM-SIAM Symposium on Discrete Algorithms, 2012.

[6] Vitaly Feldman and Pravesh Kothari. Learning coverage functions. ar Xiv preprint ar Xiv:1304.2079, 2013.

[7] Vitaly Feldman and Jan Vondrak. Optimal bounds on approximation of submodular and xos functions by juntas. In FOCS, 2013.

[8] Nan Du, Yingyu Liang, Nina Balcan, and Le Song. Inﬂuence function learning in information diffusion networks. In ICML 2014, 2014.

[9] L. Song, M. Kolar, and E. P. Xing. Time-varying dynamic bayesian networks. In Neural Information Processing Systems, pages 1732 1740, 2009.

[10] M. Kolar, L. Song, A. Ahmed, and E. P. Xing. Estimating time-varying networks. Ann. Appl. Statist., 4(1):94 123, 2010.

[11] Odd Aalen, Oernulf Borgan, and H akon K Gjessing. Survival and event history analysis: a process point of view. Springer, 2008.

[12] M. P. Friedlander K. Murphy M. Schmidt, E. van den Berg. Optimizing costly functions with simple constraints: A limited-memory projected quasi-newton algorithm. In AISTATS 2009.

[13] Sara van de Geer. Exponential inequalities for martingales, with application to maximum likelihood estimation for counting processes. The Annals of Statistics, pages 1779 1801, 1995.

[14] Manuel Gomez Rodriguez, David Balduzzi, and Bernhard Sch olkopf. Uncovering the temporal dynamics of diffusion networks. ar Xiv preprint ar Xiv:1105.0697, 2011.

[15] Nan Du, Le Song, Hongyuhan Zha, and Manuel Gomez Rodriguez. Scalable inﬂuence estimation in continuous time diffusion networks. In NIPS 2013, 2013.

[16] Jure Leskovec, Deepayan Chakrabarti, Jon Kleinberg, Christos Faloutsos, and Zoubin Ghahramani. Kronecker graphs: An approach to modeling networks. 11(Feb):985 1042, 2010.

[17] Praneeth Netrapalli and Sujay Sanghavi. Learning the graph of epidemic cascades. In SIGMETRICS/PERFORMANCE, pages 211 222. ACM, 2012.

[18] Jure Leskovec, Lars Backstrom, and Jon Kleinberg. Meme-tracking and the dynamics of the news cycle. In SIGKDD2009, pages 497 506. ACM, 2009.

[19] G. Nemhauser, L. Wolsey, and M. Fisher. An analysis of the approximations for maximizing submodular set functions. Mathematical Programming, 14:265 294, 1978.

[20] L. Wasserman. All of Nonparametric Statistics. Springer, 2006.

[21] Ali Rahimi and Benjamin Recht. Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. In Neural Information Processing Systems, 2009.

[22] Sara van de Geer. Hellinger-consistency of certain nonparametric maximum likelihood estimators. The Annals of Statistics, pages 14 44, 1993.

[23] G.R. Shorack and J.A. Wellner. Empirical Processes with Applications to Statistics. Wiley, New York, 1986.

[24] Wing Hung Wong and Xiaotong Shen. Probability inequalities for likelihood ratios and convergence rates of sieve mles. The Annals of Statistics, pages 339 362, 1995.

[25] L. Birg e and P. Massart. Minimum Contrast Estimators on Sieves: Exponential Bounds and Rates of Convergence. Bernoulli, 4(3), 1998.

[26] Kenneth S Alexander. Rates of growth and sample moduli for weighted empirical processes indexed by sets. Probability Theory and Related Fields, 75(3):379 423, 1987.