# learning_timevarying_coverage_functions__50493bdb.pdf Learning Time-Varying Coverage Functions Nan Du , Yingyu Liang , Maria-Florina Balcan , Le Song College of Computing, Georgia Institute of Technology Department of Computer Science, Princeton University School of Computer Science, Carnegie Mellon University dunan@gatech.edu,yingyul@cs.princeton.edu ninamf@cs.cmu.edu,lsong@cc.gatech.edu Coverage functions are an important class of discrete functions that capture the law of diminishing returns arising naturally from applications in social network analysis, machine learning, and algorithmic game theory. In this paper, we propose a new problem of learning time-varying coverage functions, and develop a novel parametrization of these functions using random features. Based on the connection between time-varying coverage functions and counting processes, we also propose an efficient parameter learning algorithm based on likelihood maximization, and provide a sample complexity analysis. We applied our algorithm to the influence function estimation problem in information diffusion in social networks, and show that with few assumptions about the diffusion processes, our algorithm is able to estimate influence significantly more accurately than existing approaches on both synthetic and real world data. 1 Introduction Coverage functions are a special class of the more general submodular functions which play important role in combinatorial optimization with many interesting applications in social network analysis [1], machine learning [2], economics and algorithmic game theory [3], etc. A particularly important example of coverage functions in practice is the influence function of users in information diffusion modeling [1] news spreads across social networks by word-of-mouth and a set of influential sources can collectively trigger a large number of follow-ups. Another example of coverage functions is the valuation functions of customers in economics and game theory [3] customers are thought to have certain requirements and the items being bundled and offered fulfill certain subsets of these demands. Theoretically, it is usually assumed that users influence or customers valuation are known in advance as an oracle. In practice, however, these functions must be learned. For example, given past traces of information spreading in social networks, a social platform host would like to estimate how many follow-ups a set of users can trigger. Or, given past data of customer reactions to different bundles, a retailer would like to estimate how likely customer would respond to new packages of goods. Learning such combinatorial functions has attracted many recent research efforts from both theoretical and practical sides (e.g., [4, 5, 6, 7, 8]), many of which show that coverage functions can be learned from just polynomial number of samples. However, the prior work has widely ignored an important dynamic aspect of the coverage functions. For instance, information spreading is a dynamic process in social networks, and the number of follow-ups of a fixed set of sources can increase as observation time increases. A bundle of items or features offered to customers may trigger a sequence of customer actions over time. These real world problems inspire and motivate us to consider a novel time-varying coverage function, f(S, t), which is a coverage function of the set S when we fix a time t, and a continuous monotonic function of time t when we fix a set S. While learning time-varying combinatorial structures has been ex- plored in graphical model setting (e.g., [9, 10]), as far as we are aware of, learning of time-varying coverage function has not been addressed in the literature. Furthermore, we are interested in estimating the entire function of t, rather than just treating the time t as a discrete index and learning the function value at a small number of discrete points. From this perspective, our formulation is the generalization of the most recent work [8] with even less assumptions about the data used to learn the model. Generally, we assume that the historical data are provided in pairs of a set and a collection of timestamps when caused events by the set occur. Hence, such a collection of temporal events associated with a particular set Si can be modeled principally by a counting process Ni(t), t 0 which is a stochastic process with values that are positive, integer, and increasing along time [11]. For instance, in the information diffusion setting of online social networks, given a set of earlier adopters of some new product, Ni(t) models the time sequence of all triggered events of the followers, where each jump in the process records the timing tij of an action. In the economics and game theory setting, the counting process Ni(t) records the number of actions a customer has taken over time given a particular bundled offer. This essentially raises an interesting question of how to estimate the time-varying coverage function from the angle of counting processes. We thus propose a novel formulation which builds a connection between the two by modeling the cumulative intensity function of a counting process as a time-varying coverage function. The key idea is to parametrize the intensity function as a weighted combination of random kernel functions. We then develop an efficient learning algorithm TCOVERAGELEARNER to estimate the parameters of the function using maximum likelihood approach. We show that our algorithm can provably learn the time-varying coverage function using only polynomial number of samples. Finally, we validate TCOVERAGELEARNER on both influence estimation and maximization problems by using cascade data from information diffusion. We show that our method performs significantly better than alternatives with little prior knowledge about the dynamics of the actual underlying diffusion processes. 2 Time-Varying Coverage Function We will first give a formal definition of the time-varying coverage function, and then explain its additional properties in details. Definition. Let U be a (potentially uncountable) domain. We endow U with some σ-algebra A and denote a probability distribution on U by P. A coverage function is a combinatorial function over a finite set V of items, defined as f(S) := Z P [ s S Us , for all S 2V, (1) where Us U is the subset of domain U covered by item s V, and Z is the additional normalization constant. For time-varying coverage functions, we let the size of the subset Us to grow monotonically over time, that is Us(t) Us(τ), for all t τ and s V, (2) which results in a combinatorial temporal function f(S, t) = Z P [ s S Us(t) , for all S 2V. (3) In this paper, we assume that f(S, t) is smooth and continuous, and its first order derivative with respect to time, f (S, t), is also smooth and continuous. Representation. We now show that a time-varying coverage function, f(S, t), can be represented as an expectation over random functions based on multidimensional step basis functions. Since Us(t) is varying over time, we can associate each u U with a |V|-dimensional vector τu of change points. In particular, the s-th coordinate of τu records the time that source node s covers u. Let τ to be a random variable obtained by sampling u according to P and setting τ = τu. Note that given all τu we can compute f(S, t); now we claim that the distribution of τ is sufficient. We first introduce some notations. Based on τu we define a |V|-dimensional step function ru(t) : R+ 7 {0, 1}|V| , where the s-th dimension of ru(t) is 1 if u is covered by the set Us(t) at time t, and 0 otherwise. To emphasize the dependence of the function ru(t) on τu, we will also write ru(t) as ru(t|τu). We denote the indicator vector of a set S by χS {0, 1}|V| where the s-th dimension of χS is 1 if s S, and 0 otherwise. Then u U is covered by S s S Us(t) at time t if χ S ru(t) 1. Lemma 1. There exists a distribution Q(τ) over the vector of change points τ, such that the timevarying coverage function can be represented as f(S, t) = Z Eτ Q(τ) φ(χ S r(t|τ)) (4) where φ(x) := min {x, 1}, and r(t|τ) is a multidimensional step function parameterized by τ. Proof. Let US := S s S Us(t). By definition (3), we have the following integral representation f(S, t) = Z Z U I {u US} d P(u) = Z Z U φ(χ S ru(t)) d P(u) = Z Eu P(u) φ(χ S ru(t)) . We can define the set of u having the same τ as Uτ := {u U | τu = τ} and define a distribution over τ as d Q(τ) := R Uτ d P(u). Then the integral representation of f(S, t) can be rewritten as Z Eu P(u) φ(χ S ru(t)) = Z Eτ Q(τ) φ(χ S r(t|τ)) , which proves the lemma. 3 Model for Observations In general, we assume that the input data are provided in the form of pairs, (Si, Ni(t)), where Si is a set, and Ni(t) is a counting process in which each jump of Ni(t) records the timing of an event. We first give a brief overview of a counting process [11] and then motivate our model in details. Counting Process. Formally, a counting process {N(t), t 0} is any nonnegative, integer-valued stochastic process such that N(t ) N(t) whenever t t and N(0) = 0. The most common use of a counting process is to count the number of occurrences of temporal events happening along time, so the index set is usually taken to be the nonnegative real numbers R+. A counting process is a submartingale: E[N(t) | Ht ] N(t ) for all t > t where Ht denotes the history up to time t . By Doob-Meyer theorem [11], N(t) has the unique decomposition: N(t) = Λ(t) + M(t) (5) where Λ(t) is a nondecreasing predictable process called the compensator (or cumulative intensity), and M(t) is a mean zero martingale. Since E[d M(t) | Ht ] = 0, where d M(t) is the increment of M(t) over a small time interval [t, t + dt), and Ht is the history until just before time t, E[d N(t) | Ht ] = dΛ(t) := a(t) dt (6) where a(t) is called the intensity of a counting process. Model formulation. We assume that the cumulative intensity of the counting process is modeled by a time-varying coverage function, i.e., the observation pair (Si, Ni(t)) is generated by Ni(t) = f(Si, t) + Mi(t) (7) in the time window [0, T] for some T > 0, and df(S, t) = a(S, t)dt. In other words, the timevarying coverage function controls the propensity of occurring events over time. Specifically, for a fixed set Si, as time t increases, the cumulative number of events observed grows accordingly for that f(Si, t) is a continuous monotonic function over time; for a given time t, as the set Si changes to another set Sj, the amount of coverage over domain U may change and hence can result in a different cumulative intensity. This abstract model can be mapped to real world applications. In the information diffusion context, for a fixed set of sources Si, as time t increases, the number of influenced nodes in the social network tends to increase; for a given time t, if we change the sources to Sj, the number of influenced nodes may be different depending on how influential the sources are. In the economics and game theory context, for a fixed bundle of offers Si, as time t increases, it is more likely that the merchant will observe the customers actions in response to the offers; even at the same time t, different bundles of offers, Si and Sj, may have very different ability to drive the customers actions. Compared to a regression model yi = g(Si) + ϵi with i.i.d. input data (Si, yi), our model outputs a special random function over time, that is, a counting process Ni(t) with the noise being a zero mean martingale Mi(t). In contrast to functional regression models, our model exploits much more interesting structures of the problem. For instance, the random function representation in the last section can be used to parametrize the model. Such special structure of the counting process allows us to estimate the parameter of our model using maximum likelihood approach efficiently, and the martingale noise enables us to use exponential concentration inequality in analyzing our algorithm. 4 Parametrization Based on the following two mild assumptions, we will show how to parametrize the intensity function as a weighted combination of random kernel functions, learn the parameters by maximum likelihood estimation, and eventually derive a sample complexity. (A1) a(S, t) is smooth and bounded on [0, T]: 0 < amin a amax < , and a := d2a/dt2 is absolutely continuous with R a(t)dt < . (A2) There is a known distribution Q (τ) and a constant C with Q (τ)/C Q(τ) CQ (τ). Kernel Smoothing To facilitate our finite dimensional parameterization, we first convolve the intensity function with K(t) = k(t/σ)/σ where σ is the bandwidth parameter and k is a kernel function (such as the Gaussian RBF kernel k(t) = e t2/2/ 0 k(t) κmax, Z k(t) dt = 1, Z t k(t) dt = 0, and σ2 k := Z t2k(t) dt < . (8) The convolution results in a smoothed intensity a K(S, t) = K(t) (df(S, t)/dt) = d(K(t) Λ(S, t))/dt. By the property of convolution and exchanging derivative with integral, we have that a K(S, t) = d(Z Eτ Q(τ)[K(t) φ(χ S r(t|τ)])/dt by definition of f( ) = Z Eτ Q(τ) d(K(t) φ(χ S r(t|τ))/dt exchange derivative and integral = Z Eτ Q(τ) [K(t) δ(t t(S, r)] by property of convolution and function φ( ) = Z Eτ Q(τ) [K(t t(S, τ))] by definition of δ( ) where t(S, τ) is the time when function φ(χ S r(t|τ)) jumps from 0 to 1. If we choose small enough kernel bandwidth, a K only incurs a small bias from a. But the smoothed intensity still results in infinite number of parameters, due to the unknown distribution Q(τ). To address this problem, we design the following random approximation with finite number of parameters. Random Function Approximation The key idea is to sample a collection of W random change points τ from a known distribution Q (τ) which can be different from Q(τ). If Q (τ) is not very far way from Q(τ), the random approximation will be close to a K, and thus close to a. More specifically, we will denote the space of weighted combination of W random kernel function by a K w(S, t) = i=1 wi K(t t(S, τi)) : w 0, Z , {τi} i.i.d. Q (τ). (9) Lemma 2. If W = O(Z2/(ϵσ)2), then with probability 1 δ, there exists an ea A such that ESEt (a(S, t) ea(S, t))2 := ES P(S) R T 0 (a(S, t) ea(S, t))2 dt/T = O(ϵ2 + σ4). The lemma then suggests to set the kernel bandwidth σ = O( ϵ) to get O(ϵ2) approximation error. 5 Learning Algorithm We develop a learning algorithm, referred to as TCOVERAGELEARNER, to estimate the parameters of a K w(S, t) by maximizing the joint likelihood of all observed events based on convex optimization techniques as follows. Maximum Likelihood Estimation Instead of directly estimating the time-varying coverage function, which is the cumulative intensity function of the counting process, we turn to estimate the intensity function a(S, t) = Λ(S, t)/ t. Given m i.i.d. counting processes, Dm := {(S1, N1(t)), . . . , (Sm, Nm(t))} up to observation time T, the log-likelihood of the dataset is [11] 0 {log a(Si, t)} d Ni(t) Z T 0 a(Si, t) dt Maximizing the log-likelihood with respect to the intensity function a(S, t) then gives us the estimation ba(S, t). The W-term random kernel function approximation reduces a function optimization problem to a finite dimensional optimization problem, while incurring only small bias in the estimated function. Algorithm 1 TCOVERAGELEARNER INPUT : {(Si, Ni(t))} , i = 1, . . . , m; Sample W random features τ1, . . . , τW from Q (τ); Compute {t(Si, τw)} , {gi} , {k(tij)} , i {1, . . . , m} , w = 1, . . . , W, tij < T; Initialize w0 Ω= {w 0, w 1 1}; Apply projected quasi-newton algorithm [12] to solve 11; OUTPUT : a K w(S, t) = PW i=1 wi K(t t(S, τi)) Convex Optimization. By plugging the parametrization a K w(S, t) (9) into the log-likelihood (10), we formulate the optimization problem as : tij