# structured_prediction_of_network_response__1371cbed.pdf Structured Prediction of Network Response Hongyu Su HONGYU.SU@AALTO.FI Aristides Gionis ARISTIDES.GIONIS@AALTO.FI Juho Rousu JUHO.ROUSU@AALTO.FI Helsinki Institute for Information Technology (HIIT) Department of Information and Computer Science, Aalto University, Finland We introduce the following network response problem: given a complex network and an action, predict the subnetwork that responds to action, that is, which nodes perform the action and which directed edges relay the action to the adjacent nodes. We approach the problem through max-margin structured learning, in which a compatibility score is learned between the actions and their activated subnetworks. Thus, unlike the most popular influence network approaches, our method, called SPIN, is context-sensitive, namely, the presence, the direction and the dynamics of influences depend on the properties of the actions. The inference problems of finding the highest scoring as well as the worst margin violating networks, are proven to be NP-hard. To solve the problems, we present an approximate inference method through a semi-definite programming relaxation (SDP), as well as a more scalable greedy heuristic algorithm. In our experiments, we demonstrate that taking advantage of the context given by the actions and the network structure leads SPIN to a markedly better predictive performance over competing methods. 1. Introduction With the widespread use and extensive availability of largescale networks, an increasing amount of research has been proposed to study the structure and function of networks. In particular, network analysis has been applied to study dynamic phenomena and complex interactions, such as Proceedings of the 31 st International Conference on Machine Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copyright 2014 by the author(s). information propagation, opinion formation, adoption of technological innovations, viral marketing, and disease spreading (De Choudhury et al., 2010; Kempe et al., 2003; Watts & Dodds, 2007). Influence models typically consider actions performed by the network nodes. Examples of such actions include buying a product or (re)posting a news story in one s social network. Often, network nodes perform such actions as a result of influence from neighbouring nodes, and a number of different models have been proposed to quantify influence in a network, most notably the independent-cascade and the linear-threshold models (Kempe et al., 2003). On the other hand, performing an action may also come as a result to an external (out of the network) stimulus, a situation that has also been subject to modeling and analysis (Anagnostopoulos et al., 2008). A typical assumption made by existing models is that influence among nodes depends only on the nodes that perform the action and not on the action itself. A central question in the study of network influence, is to infer the latent structure that governs the influence dynamics. This question can be formulated in different ways. In one case no underlying network is available (for example, news agencies that do not link each other) and one asks to infer the hidden network structure, e.g., to discover implicit edges between the network nodes (De Choudhury et al., 2010; Du et al., 2012; Eagle et al., 2009; Gomez Rodriguez et al., 2010; 2011). However, this problem is an unnecessarily hard one to solve in many applications. On the other hand, in many applications the network is known (e.g., follower links in twitter), and the research question is to estimate the hidden variables of the influence model (Goyal et al., 2010; Saito et al., 2008). The present paper is motivated by the following observation: the influence between two nodes in the network does not depend only on the nodes and their connections, but also depends on the action under consideration. For example, if u and v represent users in twitter, v may be influenced from u regarding topics related to science but not Structured Prediction of Network Response regarding topics related to, say, politics. Thus, in our view, the influence model needs to be context-sensitive. We thus consider the following network response problem: given an action, predict which nodes in the network will perform it and along which edges the action will spread. We approach the problem via structured output learning that models the activated response network as a directed graph. We learn a function for mappings between action descriptions and the response subnetwork. Given an action, the model is able to predict a directed subnetwork that is most favourable to performing the action. 2. Preliminaries We consider a directed network G = (V, E) where the nodes v V represent entities, and edges e = (u, v) E represent relationships among entities. As discussed in the introduction, for each edge (u, v) we assume that node v can be influenced by node u. In real applications, some networks are directed (e.g., follower networks), while other networks are undirected (e.g., friendship networks). For simplicity of exposition, and without loss of generality we formulate our problem for directed networks; indeed an undirected edge can be modeled by considering pair of directed edges. In our experiments we also consider undirected networks. In addition to other nodes, we allow the nodes to be influenced by external stimuli, modelled by a root node r, which is connected to all other nodes in the network, namely (r, v) E, for all v V \ {r}. Reversely, no node can influence r, so (v, r) E, for all v V \ {r}. The second ingredient of our model consists of the actions performed by the network nodes. We write A to denote the underlying action space, that is, the set of all possible actions, and we use a to indicate a particular action in A. We assume that actions in A are represented using a feature map φ : A FA to an associated inner product space FA. For example, FA can be a vector space of dimension k, where each action a is represented by a k-dimensional vector φ(a). In the social-network application discussed in the introduction, where actions a correspond to news articles posted by users, φ(a) can be the bag-of-words representation of the news article a. We assume that the network gets exposed to an action a A, and in response a subgraph Ga = (Va, Ea) G, called the response network gets activated. The nodes Va V are the ones that get activated and Ea E is the set of induced edges. We assume that the root r is always activated, i.e., r Va. Note that even though r is directly connected to each node v Va, in every response network Ga, some nodes in Va may exercise on v stronger influence than the influence that r exercises on v. The nodes that get directly w (w, a, t = 5) (u, a, t = 1) (v, a, t = 2) = ( 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, . . .) Ga ! (Ga) = (app, apn, ann, bpp, bpn, bnn, cpp, cpn, cnn, dpp, dpn, dnn, . . .) γ(u) = (γ(u; a), γ(u; b), γ(u; c), γ(u; d), . . .) = (1, λ, λ, λ2, . . .) Figure 1. An action a perfromed by nodes u, v, w of a directed network at times 1, 2, 5, respectively. Nodes x and y do not perform the action. The action a is represented by input feature map φ(a). The response network Ga is represented by output feature map ψ(Ga) that encodes the propagation of the action a with respect to edge e (details in the text). Finally, γ is a scaling function (see Sec. 3.4). For instance, γ(u) represents a vector of exponentially-decaying weights for node u with respect to all edges. activated by the root node r as a response to an action are called the focal points or foci of the response network. We assume a dataset (ai, Gai) m i=1 of m training examples, where each example (ai, Gai) consists of an action ai and the output Gai encoding the response network activated by ai. Our intention is to build a model that given a previously unobserved action a, predicts the response network Ga. 3. Model for network responses 3.1. Structured-prediction model Our method is based on embedding the input and output into a joint feature space and learning in that space a linear compatibility score F(a, Ga; w) = w, ϕ(a, Ga) . The score F(a, Ga; w) is given by the inner product of parameters w and the joint feature ϕ(a, Ga). As the joint feature we will use the tensor product ϕ(a, Ga) = φ(a) ψ(Ga) of the input feature map φ(a) of action a, and the output feature map ψ(Ga) that represents the response network Ga to the action a. The tensor product ϕ(a, Ga) consists of all pairs of input and output features ϕij(a, Ga) = φi(a)ψj(Ga). The output features will encode the activated subgraph in the network. We use labels {p, n} to indicate whether nodes perform an action (positive vs. negative). Similarly, we use edge labels {pp, pn, nn} to indicate the role of edges in the propagation of actions. In particular, for each edge (u, v) = e of a response network Ga and each label ℓ {pp, pn, nn} we define the feature ψe,ℓ(Ga) to Structured Prediction of Network Response be 1 if and only if e is of type ℓin Ga (and 0 otherwise). For example, ψ(u,v),pp(Ga) = 1 indicates that both nodes u and v are activated in Ga and u precedes v in the partial order of activation. An example of the model is shown in Figure 1. For the sake of brievity in the figure, we abuse notation and we use eℓ to denote ψe,ℓ(Ga). For instance, in this example we have app = (u, v)pp = 1 since both u and v are activated and u precedes v in the activation order, and thus it is possible that u has influenced v. 3.2. Maximum-margin structured learning The feature weight parameters w of the compatibility score function F are learned by solving a regularized structuredoutput learning problem min w,ξ 1 2 ||w||2 2 + C i=1 ξi, (1) s.t. F(ai, Gai; w) > argmax G ai H(G) F(ai, G ai; w) +ℓG(G ai, Gai) ξi, ξi 0, i = {1, , m}. The impact of the constraints on the above optimization problem is to push the compatibility score of input ai with output Gai above the scores of all competing outputs G ai H(G) with a margin proportional to the loss ℓG(G ai, Gai) between the correct Gai and any competing subgraph G ai. H(G) is the set of directed acyclic subgraphs of G rooted at r. The slack variable ξi is used to relax the constraints so that a feasible solution can always be found. C is a slack parameter that controls the amount of regularization in the model. The objective minimizes an L2-norm regularizer of the weight vector and the slack allocated to the training set. This is equivalent to maximizing the margin subject to allowing some data to be outliers. In practice, the optimization problem (Eq. 1) is tackled by marginal dual conditional gradient optimization (Rousu et al., 2007). 3.3. The inference problem In the structured prediction model, both in training and in prediction, we need to solve the problem of finding the highest-scoring subgraph for an action. The two problems differ only in the definition of the score: in training we need to iteratively find the subgraph that violates its margins the most, whilst in prediction we need to find the subgraph with the maximum compatibility to a given action. We explain our inference algorithms for the latter problem and note that the first problem is a straightforward variant. Given feature weights w and a network G = (V, E), the prediction for a new input action a is the maximally- scoring response graph H = (V H, EH) H (a) = argmax H H(G) F(a, H; w). Writing this problem explicitly, in terms of the parameters and the feature maps gives H (a) = argmax H H(G) w, φ(a) ψ(H) = argmax H H(G) e EH sye(e, a), (2) where we have substituted sye(e, a) = P i wi,e,yeφi(a). We will abbreviate sye(e, a) to sye(e), as the action a is fixed for an individual inference problem. The output response network H can be specified by a node label yv {p, n}, where yv = p if and only if v is activated. We write Hy to emphasize the dependence of the output subgraph H from labelling y. The node labels yv induce edge labels ye. The score function s(e) can be interpreted as a score function for the edges, given by the current input a and weight vector w. The variable ye indicates the possible labels of an edge e, and for each possible label the score function s(e) assigns a different score. Depending on the values that ye can take, the inference problem can be further diverged into two modes: Activation mode. We assume ye {pp, pn} where ye = pp implies node v is activated by u via a directed edge e = (u, v), and ye = pn means that the activation cannot pass through e. In activation mode, the inference problem is transformed as finding the maximally scoring node label yv and corresponding edge label ye, consistent with an activated subgraph Hy given a set of edge scores sye(e). Negative-feed mode. In addition to the setting in activation mode, we also explicitly model the inactive network by assume ye {pp, pn, nn}, where by ye = nn we denote our belief that both u and v should be inactive given action a. The inference problem is then to find the maximally scoring node labels and induced edge labels with regards to an activated subgraph together with the inactive counterpart given a set of edge score sye(e). It is not difficult to show that the inference problem (Eq. 2) is NP-hard. The proof of the following lemma, which provides a reduction from the MAX-CUT problem, is given in the supplementary material. Lemma 1 Finding the graph that maximizes Eq. (2) is an NP-hard problem. To solve the inference problem we propose two algorithms, described on the negative-feed mode. Similar techniques can be adapted to the activation-mode by setting edge score snn(e) = 0. The first algorithm is based on a semidefinite programming (SDP) relaxation, similar to the one Structured Prediction of Network Response used for MAX-CUT and satisfiability problems (Goemans & Williamson, 1995). The SDP algorithm offers a constantfactor approximation guarantee for the inference problem. However, it requires solving semidefinite programs. Efficient solvers do exist, but the method is not scalable to large datasets. Besides, it cannot handle the order of activations. In contrast, our second approach is a more efficient GREEDY algorithm that models activation order in a natural way, but it does not provide any quality guarantee. The SDP inference. Recall that for each edge (u, v) E we are given three scores: spp(u, v), spn(u, v), and snn(u, v). The inference problem is to assign a label p or n for each vertex u V . If a vertex u is assigned to label p we say that u is activated. If both vertices u and v of an edge (u, v) E are activated, a gain spp(u, v) incurs. Respectively, the assignments pn and nn yield gains spn(u, v) and snn(u, v). The objective is to find the assignments that maximizes the total gain. We formulate this optimization problem as a quadratic program. We introduce a variable xu { 1, +1}, for each u V . We also introduce a special variable x0 { 1, +1}, which is used to distinguish the activated vertices. In particular, if xu = x0 we consider that the vertex u is assigned to label p, and thus it is activated, while xu = x0 implies that u is assigned to n and not activated. The network-response inference problem can now be written as (QP): (u,v) E [spn(u, v)(1 + x0xu x0xv xuxv) +snn(u, v)(1 x0xu x0xv + xuxv) +spp(u, v)(1 + x0xu + x0xv + xuxv)], s.t. x0, xu, xv { 1, +1}, for all u, v V. The intuition behind the formulation of Problem (QP) is that there is gain spn(u, v) if x0 = xu = xv, a gain snn(u, v) if x0 = xu = xv, and a gain spp(u, v) if x0 = xu = xv. To solve the problem (QP), we use the similar technique introduced by Goemans & Williamson (1995), such that each variable xu is relaxed to a vector vu Rn. The relaxed quadratic program becomes (RQP): (u,v) E [spn(u, v)(1 + v0vu v0vv vuvv) +snn(u, v)(1 v0vu v0vv + vuvv) +spp(u, v)(1 + v0vu + v0vv + vuvv)], s.t. vi Rn, for all i = 0, . . . , n. Consider an (n+1) (n+1) matrix Y whose (u, v) entry is yu,v = vu vv. If V is the matrix having vu s as its columns, i.e., V = [v0 . . . vk], then Y = V T V , implying that the matrix Y is semidefinite, a fact we denote by Y 0. Problem (RQP) now becomes (SDP): u,v=1 [spn(u, v)(1 + y0,u y0,v yu,v) +snn(u, v)(1 y0,u y0,v + yu,v) +spp(u, v)(1 + y0,u + y0,v + yu,v)], Problem (SDP) asks to find a semidefinite matrix, so that a linear function on the entries of the matrix is optimized. This problem can be solved by semidefinite programming within accuracy ϵ, in time that it is polynomial on k and 1 ϵ . After solving the semidefinite program one needs to round each vector vu to the variable xu { 1, +1} in the following way: 1. Factorize Y with Cholesky decomposition to find V = [v0, v1 . . . vn]. 2. Select a random vector r. 3. For each u = 0, 1, . . . , n, if vu r 0 set xu = 1, otherwise set xu = 1. Let Z be the value of the solution obtained by the above algorithm. Let Z be the optimal value of Problem (QP) and ZR the optimal value of Problem (SDP). Since Problem (SDP) is a relaxation of Problem (QP) it is ZR Z . Furthermore, it can be shown that for the expected value of Z it holds E[Z] (α ϵ)ZR, with α > 0.796 and where expectation is taken over the choice of r. Thus the above algorithm is a 0.796 approximation algorithm for Problem (QP). The GREEDY inference. The inference (Eq. 2) is defined on all edges of the network, which can be expressed equivalently as a function of activated vertices (see details in supplementary) H (a) = argmax H H(G) where V H p is a set of activated vertices. Fm(vi) is the marginal gain on each node that is comprised partially from changing edge label from pn to pp on incoming edges {(vp, vi) | vp parents(vi)}, and partially from changing edge label from nn to pn on outgoing edges {(vi, vc) | vc parents(vi)} defined as vp parents(vi) [spp(vp, vi) spn(vp, vi)] vc children(vi) [spn(vi, vc) snn(vi, vc)]. Structured Prediction of Network Response It is difficult to maximize the sum of marginal gains as the activated subnetwork is unknown. One can instead compute for each vertex the maximized marginal gain maxvi Fm(vi) in an iterative fashion as long as Fm(vi) 0, which leads to a greedy algorithm described as follows. The algorithm starts with an activated vertex set V H p = {r}. In each iteration, it chooses a vertex vi V/V H p and adds to V H p such that vi is the current maximizer of Fm(v). The procedure terminates if the maximized gain is smaller than 0. EH can be obtained by adding edges e = (vi, vj) E, if vi, vj V H p and vi was added to V H p prior to vj. The time complexity for greedy inference algorithm is O(|E| log |V |). See supplementary material for details of the algorithm. We note that we have not been able to show an approximation guarantee for the quality of solutions produced by the GREEDY algorithm. A property that it is typically used to analyse greedy methods is submodularity. However, for this particular problem submodularity does not hold (it only holds in the special case of MAX-CUT, i.e., when spp(e) = snn(e) = 0 and spn(e) = 1). 3.4. Loss functions Instead of penalizing prediction mistakes uniformly on the network G, we wish to focus in the vicinity of the response network. To achieve this effect we scale the loss accrued on the nodes and edges by their distance to the children of the root of the response network. As the loss function in (1) we use symmetric-difference loss (or Hamming loss), applied to the nodes and the edges of the subgraphs separately, and scaled by function γG(vk) according distance to the focal point vk. ℓG(Ga, Gb) = X v V ℓ v (Ga, Gb)γG(vk; v) (v,v ) E ℓ v,v (Ga, Gb)γG(vk; v), where ℓ v (Ga, Gb) = [v Va Vb], ℓ e (Ga, Gb) = [e Ea Eb], S S denotes the symmetric difference of two sets S and S . We consider the following strategies to construct the scaling function γG(vk): Exponential scaling. Mistakes are penalized by λ and λ is weighted exponentially according to the shortest path distance to the focal point vk. Given focal point vk, edge (vi, vj), and distance matrix D between the nodes, the scaling function is defined as γG(vk; vi, vj) = 1 if i = 0 λD(k,i) if i = 0 and D(k, i) R λ(R+1) if D(k, i) > R where λ > 0 is the scaling factor and R > 1 is a radius parameter. Edges outside the radius have equal scalings. Diffusion scaling. The diffusion kernel defines a distancebased function between nodes vi and vj (Kondor & Lafferty, 2002). The kernel value K(i, j) corresponds to the probability of a random walk from node vi to node vj. Given the adjacency matrix L of the network G, the diffusion kernel is computed as s = exp(βL), where I is the identity matrix and β is the a parameter that controls how much the random walks deviate from the focal point. Given focal node vk, edge (vi, vj), and diffusion kernel K the scaling function is defined as γG(vk; vi, vj) = 1 if i = 0, K(vk, vi) otherwise. The scaling function keeps the loss value on the edges connecting the focal point, and scale other edges by the weights computed from diffusion kernel. Diffusion scaling has the effect of shrinking the distance to nodes that connects to the focal point by many paths. 4. Experimental evaluation In this section, we evaluate the performance of SPIN and compare it with the state-of-the-art methods through extensive experiments. We use two real-world datasets, DBLP and Memetracker, described below. Statistics of the datasets are given in Table 1. DBLP1 dataset is a collection of bibliographic information on major computer science journals and proceedings. We extract a subset of original data by using inproceedings articles from year 2000. First, we construct an undirected DBLP network G by connecting pairs of authors who have coauthored more that p papers (p = 5, 10, 15). After that, we generate a set of experimental networks of different size by performing snowball sampling (Goodman, 1961). For each experimental network, we extract all the documents for which at least one of their authors is a node in the network. We apply LDA algorithm (Blei et al., 2002) on the titles of extracted documents to generated topics. Topics are associated with publications, timestamped by publication dates, and described by bag-of-word features computed from LDA. In this way, a topic can be seen as an action and we will study the influence among authors. Memetracker2 dataset is a set of phrases propagated over prominent online news sites in March 2009. We construct 1http://www.informatik.uni-trier.de/ ley/ db/ 2http://Memetracker.org Structured Prediction of Network Response directed networks G for Memetracker dataset by connecting two websites via a directed edge if there are at least five phrases copying from one website to the other. A posted phrase corresponds to an action, which again is timestamped and represented with bag-of-word features. 4.1. Experimental setup and metrics SPIN can be applied to predict action-specific network response (contenxt-aware) when action representation φ(a) is given as input. It is also capable of predicting edge influence scores in context-free mode when φ(a) is treated as unknown. For comparison purposes, we evaluate SPIN against the following the state-of-the-art methods: Support Vector Machine (SVM) is used as a single target classifier used to predict the response network via decomposing it as a bag of nodes and edges, and predicting each element in the bag. Max-Margin Conditional Random Field (MMCRF) (Rousu et al., 2007; Su et al., 2010) is a multi-label classifier that utilizes the structure of output graph G. The model predicts the node labels of the network. Expectation-Maximization for the independent cascade model (ICM-EM) (Saito et al., 2008) is a contextfree model that infers the influence probability of the network given a directed network and a set of action cascades. Here we use the implementation from Mathioudakis et al. (2011) of this algorithm, which is publicly available3. Netrate (Gomez-Rodriguez et al., 2011) models the network influence as temporal processes occurs at difference rate. It infers the directed edges of the global network and estimates the transmission rate of each edge. To quantitatively evaluate the performance of the tested methods in predicting node and edge labels, we adopt two popular metrics: accuracy and F1 score, defined as where P is precision and R is recall. We also define Predicted Subgraph Coverage (PSC) as where Vi is the set of focal points given action ai, n is the number of nodes in the network, and m is the number of actions. PSC expresses the relative size of a correctly predicted subgraph Gv in terms of node predictions that cover the focal points v. 3https://dl.dropboxusercontent.com/u/ 21620176/public_html/spine/index.html Dataset Training Feature Network Example Space |V | |E| DBLP S100 440 1190 100 204 DBLP M100 478 1127 100 151 DBLP M500 2119 3619 500 699 DBLP M700 2800 4369 699 952 DBLP M1k 3720 5281 1000 1368 DBLP M2k 6030 7183 2000 2687 DBLP L100 509 1274 100 152 DBLP L500 1869 3424 499 701 DBLP L700 2620 4300 699 960 DBLP L1k 3560 5405 1000 1368 DBLP L2k 3618 5454 1023 1402 meme S 4632 181 82 325 meme M 4804 179 182 521 meme L 4809 179 333 597 Table 1. Statistics of DBLP and Memetracker datasets. Data Accuracy F1 Score Time (102s) SDP Neg Act SDP Neg Act SDP Neg Act S100 79.9 77.6 72.9 57.2 56.2 55.5 16.0 1.5 0.2 M100 75.8 73.6 68.5 51.6 53.1 54.5 15.2 1.4 0.2 L100 75.1 72.0 67.4 53.5 56.9 57.2 13.7 1.6 0.3 Geom. 76.9 74.3 69.6 52.0 55.4 55.7 15.0 1.5 0.3 Table 2. Comparison of different inference algorithms. Geom. is geometric mean of rows. Our metrics are computed both in global context where we pool all the nodes and edges from the background network, as well as in local context where we only collect the nodes and edges within certain radius R of the focal points. The experimental results are from a five-fold cross validation. 4.2. Experimental results We examine whether our context-sensitive structure predictor can boost the performance of predicting network responses. We compare SPIN with other methods in both context-sensitive and context-free problems. We show that SPIN can perform significantly better in terms of predicting action-specific network responses. Comparison of inference algorithms. Table 2 shows the geometric mean of node accuracy, F1 and running time over parameter space on three DBLP datasets, where Neg and Act represent the GREEDY inference defined on the negative-feed and the activation modes. SDP is also formulated on the negative-feed mode. In general, the inference algorithm based on negative-feed mode outperforms activation mode in terms of accuracy. The difference in F1 is smaller in comparison. SDP based inference surpasses GREEDY inference in accuracy, however, by a small margin. In addition, GREEDY inference is almost 10 times faster even on small datasets, where running time is total time used for cross validation. For the following experiments, we opted for GREEDY inference in negative-feed mode as the inference engine of SPIN. Structured Prediction of Network Response Dataset Node Accuracy Node F1 Score Edge Acc PSC Time (103s) SVM MMCRF SPIN SVM MMCRF SPIN SVM SPIN SVM MMCRF SPIN SVM MMCRF SPIN meme S 73.4 68.0 72.2 39.0 39.8 47.1 62.7 45.6 23.4 25.3 33.6 6.6 2.9 4.1 meme M 82.1 79.0 81.5 29.1 30.1 38.0 61.1 68.8 18.6 18.8 28.3 13.7 3.2 7.3 meme L 89.9 88.3 89.8 26.7 27.1 35.0 45.5 80.0 17.7 18.9 27.6 19.9 5.9 11.8 M100 71.2 73.6 76.7 49.3 50.8 54.3 33.3 61.7 33.3 35.6 34.6 0.1 0.2 0.1 M500 89.0 91.4 92.0 18.8 13.5 14.6 28.2 92.6 29.3 26.4 29.5 9.0 3.8 3.2 M700 91.9 94.1 92.1 13.8 7.3 14.2 26.3 93.0 29.4 23.9 34.4 18.5 8.3 4.4 M1k 94.1 95.8 94.2 10.9 3.5 9.3 26.6 94.7 33.7 16.6 35.2 42.2 14.7 10.4 M2k 96.8 97.6 96.7 6.2 1.4 3.4 25.3 97.6 34.6 9.6 14.7 165.0 88.4 54.1 L100 69.4 72.2 75.7 51.1 53.1 57.4 31.6 62.3 30.9 31.7 33.4 0.1 0.2 0.3 L500 85.9 89.1 86.8 21.7 15.1 24.7 27.9 87.9 14.2 11.2 19.7 6.5 3.2 2.1 L700 89.7 92.4 89.7 16.2 9.4 17.3 26.5 90.4 9.5 6.7 12.5 16.0 7.8 5.3 L1k 92.4 94.4 91.5 12.4 6.4 13.9 26.4 92.3 6.1 4.4 8.4 40.3 13.7 10.4 L2k 92.5 94.5 91.9 12.3 5.4 12.7 26.5 93.2 6.0 2.9 7.2 41.9 21.9 13.1 Geom. 85.5 86.4 86.6 19.8 12.6 20.3 32.6 79.7 18.9 14.2 21.7 9.4 4.6 4.3 Table 3. Comparison of prediction performance on global context. The best in bold-face, the second best in italic. Context-aware prediction. We apply SPIN with exponential scaling to predict context-sensitive network responses. Comparison of prediction performance against SVM and MMCRF is listed in Table 3. We show that SPIN can dramatically boost the performance of all measures except node accuracy: MMCRF wins in node accuracy, but SPIN is the second best and the difference is small. In terms of time consumption for training, SPIN is around three times faster than SVM and two times faster than MMCRF on the largest M2k dataset. Context-free network influence prediction. Here we compare SPIN to methods developed for influence network prediction, namely Netrate and ICM-EM, on Memetracker data. To make the comparison fair to the competition, we convert the network to undirected network and replace action features by a constant value. For SPIN, we further represent each undirected edge by two directed edges. The measure of success is Precision@K, where we ask for top K percent edge predictions from each model and compute the precision. Table 4 shows Precision@K as function of K, where the performance of SPIN surpasses ICM-EM and Netrate in all spectrum of K with a noticeable margin. ICM-EM has the least accurate predictions of the three, but achieves by far the the best running time. SPIN and Netrate solve more complex convex optimization problems, leading more accurate predictions at the cost of more CPU time needed for training, SPIN being the more efficient on the largest dataset, meme L. The good performance of SPIN compared to Netrate is mostly explained by the fact that Netrate solves a much harder problem in which the underlying undirected network is assumed to be unknown, while SPIN is able to leverage the known network structure. In the experiment reported, the edge predictions from Netrate are filtered against the underlying complex network, in order to excessively penalize influence predictions along non-linked nodes. Effect of loss scaling. Figure 2 depicts the effect of pa- 10 5 0 5 10 15 Subgraph Radius Node Accuracy Imporvement % 0 1 2 3 4 5 0.3 0.5 0.7 15 10 5 0 5 10 15 Subgraph Radius Node F1 Score Improvement % 0 1 2 3 4 5 10 5 0 5 10 15 Subgraph Radius PSC Improvement % 0 1 2 3 4 5 20 25 30 35 40 45 Subgraph Radius Edge Accuracy Improvement % 0 1 2 3 4 5 Figure 2. The improvement of prediction performance for different scaling factor λ with respect to SVM. rameter λ of the exponential loss scaling to prediction performance on subgraphs of different radius. SVM (dashed line) is used as the baseline. When 0 < λ < 1, the node prediction accuracy (top, left) and F1 (top, right) decrease by the increasing subgraph radius, while λ 1 leads to the opposite behavior allowing larger subgraph to be learned. Predicted subgraph coverage decreases by incresing λ. Edge prediction accuracy (bottom, right) increases monotonically in λ implying that predicting the longer influence paths is a hard problem for SVM. In Table 5 we examine the performance of diffusion scaling. The numbers reported are geometric means over the different Memetracker and DBLP datasets. We observe a decreased performance when increasing the parameter β, which corresponds to smoothing the distance matrix. This indicates Structured Prediction of Network Response Dataset Model T (103s) Precision @ K 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% SPIN 5.50 82.9 81.0 76.0 74.0 74.0 70.0 69.8 67.9 66.7 64.7 ICM-EM 0.01 60.3 63.5 65.1 62.0 62.0 61.5 62.2 60.4 60.7 61.9 NETRATE 5.83 76.2 73.8 70.4 68.7 68.7 66.8 64.9 63.4 62.9 61.9 SPIN 5.52 82.7 72.1 70.5 69.2 69.2 67.9 66.2 65.6 64.3 64.2 ICM-EM 0.02 56.3 55.3 56.8 57.4 57.4 56.3 57.5 57.8 58.3 58.5 NETRATE 13.93 61.2 64.6 62.9 62.5 62.5 62.4 61.2 60.1 58.7 58.5 SPIN 4.75 82.2 73.6 69.1 66.7 66.7 65.9 66.1 65.9 63.9 63.6 ICM-EM 0.01 52.1 55.7 54.2 56.5 56.5 56.7 57.4 58.0 57.6 57.0 NETRATE 12.63 56.5 57.8 60.0 59.3 59.3 59.4 58.9 58.4 57.5 57.0 Table 4. Model performance in context-free influence network prediction. Loss Scaling Node Acc Node F1 Edge Acc PSC Time (103s) Meme DBLP Meme DBLP Meme DBLP Meme DBLP Meme DBLP Dif β = 0.1 80.8 86.5 40.0 28.6 63.0 80.5 30.2 30.3 68.3 2.7 Dif β = 0.5 66.4 86.5 42.5 28.5 40.9 80.5 33.0 30.2 50.9 4.0 Dif β = 0.8 63.5 86.5 40.9 28.5 39.3 80.5 31.2 30.2 32.6 3.2 Exp λ = 0.5 80.9 83.9 39.7 28.7 63.1 77.7 29.7 24.3 71.0 10.8 Table 5. Comparison of diffusion scaling with exponential scaling. that emphasizing connections between long-distance nodes makes prediction more difficult, a finding consistent with the results on exponential scaling. Setting β = 0.1 leads to comparable performance over exponential scaling with λ = 0.5, with slight improvement on the DBLP datasets. 5. Discussion We have presented a novel approach, based on structured output learning, to the problem of modelling influence in networks. In contrast to previous state-of-the-art approaches, such as Netrate and ICM-EM, our proposal, named SPIN, is a context-sensitive model. SPIN does not try to force global influence parameters, but instead it incorporates the action space into the learning process and makes predictions tailored to the action under consideration. Our method can provide a useful tool in market research or other application scenarios when actions arise from a high-dimensional space, and one wants to make predictions for actions not seen before. Another benefit of our approach, compared to other state-of-the-art methods, is that our method does not make explicit assumptions regarding the underlying propagation model. Additionally, action responses are explicitly formulated as directed acyclic subgraphs, and the model is capable of predicting the complete subgraph structure. We proved that the inference problem of SPIN is NP-hard, and we provided an approximation algorithm based on semidefinite programming (SDP). In addition, we developed a greedy heuristic algorithm for the inference problem that scales linearly in the size of the network, with time consumption in the same ballpark as Netrate. With extensive experiments we show that SPIN can dramatically boost the performance of action-based network-response prediction. SPIN can also be applied in context-free prediction where it captures the edge influence weight of the network. Anagnostopoulos, Aris, Kumar, Ravi, and Mahdian, Mohammad. Influence and correlation in social networks. KDD, 2008. Blei, D., Ng, A., and Jordan, M. Latent dirichlet allocation. In Dietterich, T., Becker, S., and Ghahramani, Z. (eds.), Advances in Neural Information Processing Systems 14. MIT Press, 2002. De Choudhury, Munmun, Mason, Winter A, Hofman, Jake M, and Watts, Duncan J. Inferring relevant social networks from interpersonal communication. WWW, pp. 301 310, 2010. Du, Nan, Song, Le, Smola, Alex, and Yuan, Ming. Learning Networks of Heterogeneous Influence. NIPS, 2012. Eagle, Nathan, Pentland, Alex Sandy, and Lazer, David. Inferring friendship network structure by using mobile phone data. Proceedings of the National Academy of Sciences, 106(36):15274 15278, 2009. Goemans, Michel and Williamson, David. Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. JACM, 42 (6), 1995. Gomez-Rodriguez, Manuel, Leskovec, Jure, and Krause, Andreas. Inferring Networks of Diffusion and Influence. KDD, 2010. Structured Prediction of Network Response Gomez-Rodriguez, Manuel, Balduzzi, David, and Sch olkopf, Bernhard. Uncovering the Temporal Dynamics of Diffusion Networks. ICML, 2011. Goodman, Leo A. Snowball sampling. The annals of mathematical statistics, 32(1):148 170, 1961. Goyal, Amit, Bonchi, Francesco, and Lakshmanan, Laks VS. Learning influence probabilities in social networks. WSDM, 2010. Kempe, David, Kleinberg, Jon, and Tardos, Eva. Maximizing the spread of influence through a social network. In KDD, 2003. Kondor, I.R. and Lafferty, J. D. Diffusion kernels on graphs and other discrete structures. In Proceedings of the ICML, 2002. Mathioudakis, Michael, Bonchi, Francesco, Castillo, Carlos, Gionis, Aristides, and Ukkonen, Antti. Sparsification of influence networks. KDD, 2011. Rousu, J., Saunders, C., Szedmak, S., and Shawe-Taylor, J. Efficient algorithms for max-margin structured classification. Predicting Structured Data, pp. 105 129, 2007. Saito, Kazumi, Nakano, Ryohei, and Kimura, Masahiro. Prediction of information diffusion probabilities for independent cascade model. In Knowledge-Based Intelligent Information and Engineering Systems (KES), 2008. Su, Hongyu, Heinonen, Markus, and Rousu, Juho. Structured output prediction of anti-cancer drug activity. In Proceedings of the 5th IAPR international conference on Pattern recognition in bioinformatics, PRIB 10, 2010. Watts, Duncan J and Dodds, Peter Sheridan. Influentials, networks, and public opinion formation. Journal of consumer research, 34(4):441 458, 2007.