# metadataconscious_anonymous_messaging__bad8f1aa.pdf

Metadata-Conscious Anonymous Messaging

Giulia Fanti FANTI@ILLINOIS.EDU Peter Kairouz KAIROUZ2@ILLINOIS.EDU Sewoong Oh SWOH@ILLINOIS.EDU Kannan Ramchandran KANNANR@EECS.BERKELEY.EDU Pramod Viswanath PRAMODV@ILLINOIS.EDU

University of Illinois at Urbana-Champaign, Champaign, IL 61801

University of California, Berkeley, CA 94701

Anonymous messaging platforms like Whisper and Yik Yak allow users to spread messages over a network (e.g., a social network) without revealing message authorship to other users. The spread of messages on these platforms can be modeled by a diffusion process over a graph. Recent advances in network analysis have revealed that such diffusion processes are vulnerable to author deanonymization by adversaries with access to metadata, such as timing information. In this work, we ask the fundamental question of how to propagate anonymous messages over a graph to make it difﬁcult for adversaries to infer the source. In particular, we study the performance of a message propagation protocol called adaptive diffusion introduced in (Fanti et al., 2015). We prove that when the adversary has access to metadata at a fraction of corrupted graph nodes, adaptive diffusion achieves asymptotically optimal source-hiding and significantly outperforms standard diffusion. We further demonstrate empirically that adaptive diffusion hides the source effectively on real social networks.

1. Introduction

Popular social media platforms (Facebook, Twitter, Whatsapp, Kakao) allow users to seamlessly share potentially sensitive content with their friends. The privacy implications of such platforms are gaining attention, leading to the emergence of anonymous social networks like Whis-

Proceedings of the 33 rd International Conference on Machine Learning, New York, NY, USA, 2016. JMLR: W&CP volume 48. Copyright 2016 by the author(s).

per (whi), Yik Yak (yik), Blind (bli) and the now-defunct Secret (sec). These anonymous messaging apps are microblogging services that hide message authorship from other users. When a user posts a message, the message passes (without authorship information) to the users contacts, or friends, in an underlying social network. If a message recipient approves a message by pressing the like button, the message is further propagated to the recipient s friends, and so on. The message thus spreads anonymously through the network, in the sense that no single user can learn who authored a message.

A natural question is whether a group of colluding nodes can identify which node in the social network authored a given message. This poses an interesting estimation problem on a random process over a network. Statistical inference plays a crucial role in understanding the propagation of inﬂuence and diffusion as studied in the data mining community (Gomez Rodriguez et al., 2010; Bakshy et al., 2011). In particular, recent advances in network analysis such as (Shah & Zaman, 2011; Pinto et al., 2012; Zhu et al., 2015) suggest that a moderately-powerful adversary can infer which node started the message, using limited metadata. However, the designers of anonymous messaging platforms have the freedom to inﬂuence content propagation by adding appropriately-chosen delays on top of natural human delays. Our goal is to design messaging protocols that make it provably difﬁcult to infer the initial source node, even for an adversary with inﬁnite computational power.

Adversarial Models. We consider an adversary that has access to the underlying contact network G(V, E). The adversary lacks the resources to monitor all network trafﬁc, but it can collect partial metadata in a number of ways:

One way is to explicitly corrupt some fraction of nodes by bribery or coercion; these corrupted spy nodes continuously monitor metadata like message timestamps and relay

Metadata-Conscious Anonymous Messaging

IDs; we call this a spy-based adversary. This models an adversary using fake or corrupted social media accounts to monitor users (Goldman, 2014).

Alternatively, an adversary could use side channels to collect information on whether each graph node is infected, i.e., whether it received the message, at a ﬁxed time; we call this a snapshot adversarial model. If an adversary uses spies and a snapshot, we call it a spy+snapshot adversary. The snapshot adversary has been well-studied in the literature, for both source identiﬁcation (Shah & Zaman, 2011) and source obfuscation (Fanti et al., 2015). However, spybased adversaries have only been studied for source identiﬁcation (Pinto et al., 2012). In this paper, we focus on spy-based adversaries, and brieﬂy discuss the implications of the spy+snapshot adversary in Sec. 4.

Precisely, under the spy-based model, each node other than the source is a spy with probability p, independently. At some point in time, the source node v starts propagating its message over the graph according to a spreading protocol chosen by the platform. Each spy node si V observes: (a) the time Tsi (relative to an absolute reference) at which it receives the message; (b) the parent node psi that relayed the message; and (c) any other metadata used by the spreading mechanism (such as control signaling in the message header). At some time, spies aggregate their observations; using the collected metadata and the structure of the underlying graph, the adversary estimates the author of the message, ˆv. We are interested in both sides of the problem: the adversary and the system designer. For the adversary, we want to design the maximum likelihood estimator of the source for a given messaging protocol. For the system designer, we want to design a spreading mechanism that minimizes the probability of detection, P(ˆv = v ), for an adversary using the maximum likelihood estimator. This is the focus of this paper.

Spreading mechanisms. A common construction for modeling epidemic propagation over networks is diffusion: each node spreads the message to its neighbors according to independent, random delays. Diffusion is a commonlystudied and useful model due to its simplicity and ﬁrstorder approximation of actual propagation dynamics. Critically, it captures the symmetric spreading of most social media platforms.

Finding a computationally-efﬁcient algorithm for (near-) optimal maximum likelihood (ML) message source inference is an open problem under the spy-based adversarial model, as is the corresponding detection probability analysis. Recent work (Pinto et al., 2012; Zhu et al., 2015) has focused on identifying the message source through heuristic, low-cost algorithms. These ﬁndings suggest that a spybased adversary with metadata can locate the source with high probability under diffusion spreading. Indeed, when

the underlying graph is a d-regular tree, we empirically observe that the probability of detection under diffusion increases with time and the degree of the underlying graph (Figure 1). This has poor implications for anonymity; contact networks may have high degree nodes, and the adversary is not time-constrained.

In the diffusion model used to generate Figure 1, each node propagates the message to each of its neighbors independently with probability q = 0.7 in each time step. We used the Gaussian estimator from (Pinto et al., 2012), which is suboptimal for this spreading model; as such, the plotted curves are lower bounds on the probability of detection using an ML estimator.

0 0.2 0.4 0.6 0.8 1 0

T=1 T=2 T=3 T=4 T=5

Spy probability p

d = 3, varying T

P(V = ˆv ML)

0 0.2 0.4 0.6 0.8 1 0

d=3 d=4 d=5

Spy probability p

T = 5, varying d

P(V = ˆv ML)

Figure 1. Probability of detection (suboptimal) when a message is spread using diffusion over a d-regular tree. Detection becomes more accurate as time and underlying graph degree increase.

We therefore seek a different spreading model with strong anonymity guarantees when the underlying graph has high degree, and estimation occurs at T = . Concretely, the designer is free to use any messaging protocol, under the following scenario. We consider a discrete time system, and at any given time the protocol is allowed to infect any of the adjacent neighbors of an infected node, who is not already infected. The infection can grow at most by 1hop each time, and we may choose to add additional delays. Under this setting, the analog of standard diffusion is choosing to spread with some ﬁxed probability, independently for each neighbor. Adaptive diffusion was introduced in (Fanti et al., 2015) to achieve optimal source

Metadata-Conscious Anonymous Messaging

obfuscation under the snapshot model. In this paper, we analyze the anonymity properties of adaptive diffusion under the spy-based model. There is no reason to believe a priori that adaptive diffusion should perform well against a spy-based adversary with its access to timing information; surprisingly, it does.

Contributions. Our contributions are as follows:

(1) We identify adaptive diffusion as an algorithm that provides strong anonymity guarantees against a spy-based adversary. Since (Fanti et al., 2015) contains multiple variants of adaptive diffusion, we identify the speciﬁc parameter setting under which it is both analytically tractable and provides strong anonymity guarantees.

(2) Under the spy-based adversarial model and adaptive diffusion spreading, we identify a computationallyefﬁcient algorithm for maximum likelihood source detection when the underlying contact network is inﬁnite and tree-structured (Algorithm 2).

(3) We give a precise analysis of the anonymity properties of adaptive diffusion. Such analysis is currently open for regular diffusion; we provide exact expressions for adaptive diffusion over regular trees (Theorem 1) and a lower bound for regular diffusion (Proposition 3.2), and show that our results are numerically stable for ﬁnite, irregular, cyclic social network graphs.

(4) We show that over regular trees, adaptive diffusion has asymptotically optimal hiding guarantees (Proposition 3.1) as the degree of the underlying tree increases. This differs from regular diffusion, whose anonymity properties degrade as degree increases. Intuitively, spies near the source provide more information than distant ones; by spreading symmetrically, diffusion ensures that all nearby spies receive the message. Adaptive diffusion instead spreads asymmetrically, thereby preventing most nearby spies from seeing the message early enough to deanonymize.

Related Work. A snapshot-based adversary observes which nodes are infected at a certain time T. When the infection spreads as per standard diffusion on a d-regular tree, efﬁcient ML estimators exist for ﬁnding the source from the snapshot (Shah & Zaman, 2011). Further, the adversary can identify the source with probability converging to a constant lower-bounded by 1/3, as the time-to-attack grows. Subsequent work suggests that even under various diffusion models and estimators, source detection with a snapshot is reliable (Wang et al., 2014; Prakash et al., 2012; Fioriti & Chinnici, 2012; Luo et al., 2014; Zhu & Ying, 2014; Milling et al., 2012a; 2013; 2012b; Khim & Loh, 2015).

If we know when the adversary will attack (time T), one solution for hiding the source on a d-regular tree is the follow-

ing: for the ﬁrst half of timesteps (until T/2), the infection propagates on a line in a randomly chosen direction; for the remaining half, the infection spreads as per diffusion from the end of the line. At time T all nodes in the boundary of the snapshot are equally likely to be the source, by symmetry. However, this line-and-diffusion protocol fails to protect the source if the adversary attacks sufﬁciently before or after time T. Adaptive diffusion was proposed to provide strong protection against a snapshot-based adversary (Fanti et al., 2015). At any time T, adaptive diffusion ensures that all nodes are equally likely to have been the source. This provides perfect obfuscation; no adversary can ﬁnd the source with probability larger than 1/NT where NT is the number of infected nodes.

When the adversary collects timestamps (and other metadata) from spy nodes, standard diffusion reveals the location of the source (Pinto et al., 2012; Zhu & Ying, 2014; Farajtabar et al., 2015). However, ML estimation is known to be NP-hard (Zhu et al., 2015), and analyzing the probability of detection is also challenging. Figure 1 shows that even with sub-optimal estimators, the source can be effectively identiﬁed. Since both snapshot and spy-based adversaries are plausible, we want to go beyond diffusion and line-and-diffusion. A natural question of interest is how to spread a message in order to provide strong protection against both types of adversaries: snapshot and spy-based. Related challenges include (a) identifying the best algorithm that the adversary might use to infer the location of the source; (b) providing analytical guarantees for the proposed spreading model; and (c) identifying the fundamental limit on what any spreading model can achieve. We address all of these challenges.

2. Adaptive diffusion

Under standard diffusion, inferring the source is easy (for a snapshot adversary) because the source is typically near the center of the snapshot. In (Fanti et al., 2015), the authors present two protocols that make such inference provably difﬁcult on trees with degree d > 2: the tree protocol and a generalization called adaptive diffusion . Intuitively, these protocols randomize the location of the source within each snapshot, making the source difﬁcult to locate.

If the underlying contact network is a tree, then the tree protocol is equivalent to adaptive diffusion for a speciﬁc choice of parameters; this parameter choice implies that the source node is always a leaf of the infected subgraph. General adaptive diffusion does not make this restriction. Over regular trees and a snapshot adversarial model, adaptive diffusion achieves the minimax optimal probability of detection 1/NT , when NT nodes have received the message at the time of attack (Fanti et al., 2015). Over random irregular trees with non-trivial i.i.d. degree distributions,

Metadata-Conscious Anonymous Messaging

the tree protocol was later shown to exhibit a multiplicative gap to optimality, i.e. P(ˆv = v )/(1/NT ), that grows exponentially in T (Fanti et al., 2016).

In this work, we consider a spy-based adversarial model and focus on the tree protocol, exploiting its simplicity and asymmetric spreading. Speciﬁcally, we show that the tree protocol achieves provably (asymptotically) optimal source obfuscation, signiﬁcantly improving upon standard diffusion. Moving forward, we use the terms tree protocol and adaptive diffusion interchangeably.

Algorithm 1 Tree protocol (Fanti et al., 2015)

1: Input: network G = (V, E), source v , time T 2: Output: infected subgraph GT = (VT , ET ) 3: V0 {v } 4: mv 0 and uv 5: v selects one of its neighbors w at random 6: V1 V0 {w} 7: mw 1 and uw 8: t 2 9: for t T do 10: for all v Vt 1 with uninfected neighbors and mv > 0 do 11: if uv = then 12: v selects one of its uninfected neighbors w at random 13: Vt Vt 1 {w} 14: mw mw + 1 and uw 15: end if 16: for all uninfected neighboring nodes z of v do 17: Vt Vt 1 {z} 18: uz and mz mv 1 19: end for 20: end for 21: t t + 1 22: end for

The tree protocol from (Fanti et al., 2015) is outlined in Algorithm 1; the goal is to build an infected subtree with the true source at one of the leaves at all times. This goal is accomplished by introducing state variables. Whenever a node v passes a message to node w, it includes three pieces of metadata: (1) the parent node pw = v, (2) a binary direction indicator uw { , }, and (3) the node s level, mw N, in the infected subtree. The parent pw is the node that relayed the message to w. The direction bit uw ﬂags whether node w is a spine node, responsible for increasing the depth of the infected subtree. The level mw describes the hop distance from w to the nearest leaf node in the ﬁnal infected subtree, as t . The parent metadata did not appear in the original tree protocol (Fanti et al., 2015), and is included purely to facilitate the adversary s source estimation. Even with this extra metadata revealed, the tree

protocol achieves asymptotically optimal hiding.

Figure 2. Message spread using the tree protocol from (Fanti et al., 2015) (left), and the information observed by the spy nodes 3, 7, and 8 (right). Timestamps in this ﬁgure are absolute, but they need not be.

At time t = 0, the source chooses a neighbor uniformly at random (e.g., node 1) and passes the message and metadata (p1 = 0, u1 = , m1 = 1). Figure 2 illustrates an example spread, in which node 0 passes the message to node 1. Yellow denotes spine nodes, which receive the message with uw = , and gray denotes those that receive it with uw = . Whenever a node w receives a message , there are two cases. If uw = , node w forwards the message to another neighbor z chosen uniformly at random with up metadata: (pz = w, uz = , mz = mw + 1). All of w s remaining neighbors z receive the message with down metadata: (pz = w, uz = , mz = mw 1). In Figure 2, node 1 passes the up message to node 2 and the down message to node 3. On the other hand, if uw = and mw > 0, node w forwards the message to all its remaining neighbors with down metadata: (pz = w, uz = , mz = mw 1). If a node receives mw = 0, it does not forward the message further. This protocol ensures that the infected subgraph is a symmetric tree with the true source at one of the leaves (see Algorithm 1).

3. Main results on d-regular trees

We show that adaptive diffusion hides the source better than diffusion over d-regular trees, d > 2, and its probability of detection is asymptotically optimal in the degree of the underlying tree. We ﬁrst present a lower bound on the probability of detection for any choice of spreading protocol. Proposition 3.1. When each node in the network is independently chosen to be a spy with probability p, no spreading protocol that infects at least one node can have a probability of detection less than p, i.e.

min protocol max ˆv P(ˆv = v ) p ,

where the minimization is over all spreading protocols that infect at least one node and the maximization is over all

Metadata-Conscious Anonymous Messaging

0 0.2 0.4 0.6 0.8 1 0

d=3 d=4 d=5 d=15 d=30 d=100 p

P(V = ˆv ML)

AD (Theoretical)

0 0.2 0.4 0.6 0.8 1 0

Adaptive diffusion Diffusion, lower bound Lower bound, p

AD vs. D (d = 3)

0 0.2 0.4 0.6 0.8 1 0

Adaptive diffusion Diffusion, lower bound Lower bound, p

AD vs. D (d = 5)

0 0.2 0.4 0.6 0.8 1 10 4

d=3 d=4 d=5 d=15 d=30 d=100 2(1 p)

E[δH(ˆv ML, v )]

Spy probability p

0 0.2 0.4 0.6 0.8 1 10 5

Adaptive diffusion, lower bound Diffusion, upper bound

Spy probability p

0 0.2 0.4 0.6 0.8 1 10 4

Adaptive diffusion, lower bound Diffusion, upper bound

Spy probability p

Figure 3. Adaptive diffusion (AD) theoretical performance for varying d (left). Adaptive diffusion improves over standard diffusion (D) and the gap increases as the degree of the underlying contact network increases (center, right).

estimators that are measurable functions over the observed meta-data and the network.

The proof considers a ﬁrst-spy estimator, which returns as the estimated source the parent (the sender of the message) of the ﬁrst spy to observe the message. Regardless of spreading mechanism, this estimator returns the true source with probability at least p; with probability p, the ﬁrst node (other than the true source) to receive the message is a spy. This is illustrated in the top panels of Figure 3 as a fundamental limit. Note that this lower bound is independent of the tree degree, and we expect the bound to be tighter for larger-degree trees. The reason is that if d is larger, then it is more likely that one of the neighbors of the source is a spy.

Standard diffusion. The ML estimator under standard diffusion is computationally intractable, and characterizing the probability of detection achieved by such an estimator is also an open problem. We consider a discrete-time diffusion process, in which each infected node passes the message to each neighbor with probability q in each timestep. As q increases, the variance of the associated geometric delay decreases, revealing the true source with higher probability. To lower bound the probability of detection achieved by the best estimator, we consider two heuristic estimators in the numerical experiments: (1) the Gaussian estimator

from (Pinto et al., 2012), and (2) the ﬁrst-spy estimator, which simply returns the parent of the ﬁrst spy to observe the message. The estimator in (Pinto et al., 2012) is ML when delays are i.i.d. Gaussian, whereas our delays are geometric. We nonetheless expect it to perform well for small p; since the distance between spies will be large, the delay distribution can be approximated by a Gaussian.

Figure 3 compares the probability of detection and expected hop distance for diffusion (q = 0.7) using heuristic estimators, against adaptive diffusion using the ML estimator. The lower bound on probability of detection for standard diffusion (top) is the maximum of the simulated Pinto et al. estimator (Pinto et al., 2012) and the ﬁrst-spy estimator; the opposite holds for expected hop distance (bottom). For all p, adaptive diffusion performs better than diffusion, and the gap increases with degree. This effect is sensitive to q for small d, but we show in Section 4 that over real social graphs, the sensitivity to q becomes negligible. We make this precise in the following lower bound:

Proposition 3.2. Suppose the contact network is a regular tree with degree d. Consider a spy-based adversary and diffusion spreading that is, in each timestep, each infected node infects each uninfected neighbor independently with probability q. The optimal source estimator achieves

Metadata-Conscious Anonymous Messaging

a detection probability at least

max ˆv P(ˆv = v ) 1 (1 qp)d ,

where the maximization is over all measurable functions over the observed meta-data and the network.

This bound implies that as degree increases, the probability of detecting the true source of diffusion approaches 1. This proposition can also be proved by considering the ﬁrst-spy estimator used in Proposition 3.1. We consider all neighbors of v that (a) are spies and (b) receive the message at t = 1. If there is at least one such node, then the source is identiﬁed with probability 1. Each neighbor of v meets these criteria with probability pq.

Adaptive diffusion. Unlike standard diffusion, the ML estimator is tractable under adaptive diffusion. Further, we can characterize the probability of detection achieved by this ML estimator precisely, and prove it signiﬁcantly improves over the standard diffusion and achieves the asymptotically optimal performance.

In the spy-based adversarial model, each spy si in the network observes any received messages, the associated metadata, and a timestamp Tsi. Figure 2 (right) illustrates the information observed by each spy node, where spies are outlined in red.

ML estimator under adaptive diffusion. The precise ML estimation algorithm is detailed in Algorithm 2. Because adaptive diffusion has deterministic timing, spies only help the estimator discard candidate nodes. We assume the message spreads for an inﬁnite time. There is at least one spy on the spine; consider the ﬁrst such spy to receive the message, s0. This spine spy (along with its parent and level metadata) allows the estimator to specify a feasible subtree in which the true source must lie. In Figure 2, node 8 is on the spine with level m8 = 4, so the feasible subtree is rooted at node 5 and contains all the pictured nodes except node 8 (9 s children and grandchildren also belong, but are not pictured). Spies outside the feasible subtree do not inﬂuence the estimator, because their information is independent of the source conditioned on s0 s metadata. Only leaves of the feasible subtree could have been the source e.g., nodes 0, 3, 6, and 7, as well as 9 s grandchildren.

The estimator then uses spies within the feasible subtree to prune out candidates. The goal is to identify nodes in the feasible subtree that are on the spine and close to the source. For each spy in the feasible subtree, there exists a unique path to the spine spy s0, and at least one node on that path is on the spine; the spies metadata reveals the identity and level of the spine node on that path with the lowest level we call this node a pivot (details in Algorithm 2). For instance, in Figure 2 (right), we can use spies 7 and 8 to learn that node 2 is a pivot with level m2 = 2. Es-

timation hinges on the minimum-level pivot across all spy nodes, ℓmin. In the example, ℓmin = 1, since spies 3 and 8 identify node 1 as a pivot with level m1 = 1. The true source must lie in a subtree rooted at a neighbor of ℓmin, with no spies. In our example, this leaves only node 0, the true source.

Algorithm 2 ML Source Estimator for Algorithm 1

1: Input: contact network G = (V, E), spy nodes S = {s0, s1 . . .} and metadata si : (psi, msi, usi) 2: Output: ML source estimate ˆv ML 3: Let s0 denote the lowest-level spine spy, with metadata (ps0, ms0, us0). 4: V {v V : δH(v, s0) ms0 and ps0 P(v, s0)}

5: E {(u, v) : (u, v) E and u, v V } 6: Deﬁne the feasible subgraph as F( V , E) 7: L {Set of feasible pivots} 8: K {Set of eliminated pivot neighbors} 9: for all s S with s V do

10: Let hs,ℓs hℓs,s0

|P(s, s0)| Ts0 Ts

11: ℓs v P(s, s0) : δH(s, ℓs) = hs,ℓs 12: ks v P(s, s0) : δH(s, ks) = hs,ℓs 1 13: L L {ℓs} {Add pivot} 14: K K {ks} {Add pivot neighbor} 15: end for 16: Find the lowest-level pivot: ℓmin argminℓ Lmℓ 17: U {Candidate sources} 18: for all v V where v is a leaf in F( V , E) do 19: if P(v, ℓmin) K = then 20: U U {v} 21: end if 22: end for 23: return ˆv ML, drawn uniformly from U

Anonymity properties of adaptive diffusion. Using the described ML estimation procedure, we can exactly compute the probability of detection when adaptive diffusion is run on a d-regular tree.

Theorem 1. Suppose the contact network is a regular tree with degree d > 2. There is a source node v , and each node other than the source is chosen to be a spy i.i.d. with probability p as described in the spy model. Against colluding spies trying to detect the location of the source, adaptive diffusion achieves the following:

(a) The probability of detection is

P(ˆv ML = v ) = p + 1 d 2

qk (d 1)k ,

Metadata-Conscious Anonymous Messaging

qk (1 (1 p)((d 1)k 1)/(d 2))d 1

+(1 p)((d 1)k+1 1)/(d 2).

(b) The expected distance between the source and the estimate is bounded by

E[δH(ˆv ML, v )] 2

k=1 k rk, (1)

where |Td,k| = (d 1)k 1

(1 (1 p)|Td,k|)d 1

+ (d 1)(1 p)|Td,k|

(d 2)(1 p)|Td,k|(d 1) 1 .

The proof is included in the Supplemental Materials. Brieﬂy, it computes the probability of detection by conditioning on the lowest-level pivot node, ℓmin. Given a pivot, the probability of detection depends on the number of subtrees rooted at the neighbors of ℓmin containing no spies. Figure 3 illustrates the theoretical probability of detection and lower bound on expected distance from the true source as a function of the spy probability. We make two key observations:

Asymptotically optimal probability of detection: As tree degree d increases, the probability of detection converges to the degree-independent fundamental limit in Proposition 3.1, i.e., P(V = ˆv ML) = p. This is in contrast to diffusion, whose probability of detection tends to 1 as d . The median Facebook user has 200 friends (Smith, 2014), so these asymptotics have practical implications, as we will see in Section 4.

Expected hop distance asymptotically increasing: We observe empirically that for regular diffusion, E[δH(ˆv ML, v )] approaches 0 as d increases. On the other hand, for adaptive diffusion with a ﬁxed p > 0, as d , lim sup E[δH(ˆv ML, v )] = 2(1 p). This holds because with probability (1 p), the ﬁrst node is not a spy, but with probability approaching 1 for d large enough, the ﬁrst node on the spine will be a pivot node. Since the source is always a leaf, the distance from the estimate to the source will be at most 2 with probability approaching (1 p). Figure 3 includes the line 2(1 p) for reference, and we observe that as d , E[δH(ˆv ML, v )] appears to converge precisely to this line. However, for a ﬁxed d, Theorem 1 implies that as p 0, E[δH(ˆv ML, v )] .

4. Generalizations

Graphs. Here, we consider irregular, ﬁnite graphs that arise in real contact networks. Regardless of spreading protocol, the message always propagates over a tree superimposed on the underlying contact network. The probability of detection over irregular trees is therefore tied to performance over general graphs. ML estimation over irregular trees is more straightforward than in (Fanti et al., 2015), because the tree protocol always places the source at a leaf.

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0

Lower bound, p Adaptive diffusion Diffusion, q=0.1 Diffusion, q=0.5

Spy probability p

Figure 4. Probability of detection over the Facebook dataset (Viswanath et al., 2009), with standard error. Proposition 4.1. Suppose the underlying contact network G(V, E) is an irregular tree with the degree of each node dv > 1. A node v V starts spreading a message at time T = 0 according to Protocol 1. Each node v V , v = v is a spy with probability p. Let U denote the set of feasible candidate sources obtained by estimation Algorithm 2. Then the maximum likelihood estimate of v given U is ˆv ML = arg maxu U 1 deg(u) Q

v P(u,ℓmin)\{u,ℓmin} 1 deg(v) 1, where ℓmin is the lowest-level pivot node, P(u, ℓmin) is the unique shortest path between u and ℓmin, and deg(u) denotes the degree of node u. (Proof in Section C, Supplemental Materials.)

This ML estimator allows us to evaluate adaptive diffusion over real data (social graph connections for 10,000 Facebook users (Viswanath et al., 2009)) against a spy-based adversary. We simulate adaptive and regular diffusion for q {0.1, 0.5}. We evaluate diffusion with the ﬁrst-spy estimator, and adaptive diffusion with a modiﬁcation of the ML estimator in Proposition 4.1 that accounts for graph cycles. Figure 4 lists the probability of detection averaged over 200 trials, for p 0.15 (at its height, the Stasi employed 11 percent of the population as spies (Koehler, 1999)). Adaptive diffusion hides the source better than diffusion, and its probability of detection is close to the fun-

Metadata-Conscious Anonymous Messaging

damental lower bound of p. This is likely because the mean node degree in the dataset is 25, so high-degree asymptotics are signiﬁcant. While adaptive diffusion does not reach all nodes on a tree, cycles in the Facebook graph allow it to reach 81% of nodes within 20 timesteps.

Adversaries. The spy-based and snapshot adversarial models capture different behavior. The spy-snapshot model considers a combination of both: at a certain time T, the adversary collects both types of metadata and infers the source. Notably, this stronger model does not signiﬁcantly impact the probability of detection as time increases. The snapshot helps detection when there are few spies by revealing which nodes are true leaves. This effect is most pronounced for small T and/or small p. The exact analysis of the probability of detection at T is given in Equation (15) in the Supplementary material, and Figure 6 illustrates the tradeoff between snapshots and spy nodes.

5. Discussion

In this paper, we make the ﬁrst attempt to bridge the gap between state-of-the-art protocols in (Fanti et al., 2015) for source-hiding under a canonical, simpliﬁed, snapshotbased adversarial model, and practical adversarial scenarios. We ask how to spread a message over a graph while preventing a group of colluding nodes from inferring the message source. We show that standard diffusion, modeling existing messaging protocols, exhibits poor hiding properties; i.e. the probability of detection is at least 1 (1 0.5p)d, when each node is a spy with probability p on a d-regular tree. However, no protocol can achieve a probability of detection less than p on any connected graph. To bridge this gap, we propose adaptive diffusion a spreading protocol designed to defend against snapshot adversaries (Fanti et al., 2015) and show it exhibits asymptotically optimal source-hiding against spybased adversaries when the graph is a regular tree, and outperforms standard diffusion. We give an exact probability of detection in Theorem 1, illustrated in Figure 3.

A number of open questions remain, including an analysis of adaptive diffusion on more general graph structures. Another important question stems from the fact that in practice, users propagate (and receive) content asynchronously. Analyzing the effect of these delays on anonymity is an interesting open question. Finally, it is not clear that adaptive diffusion is the best scheme to defend against spy-based adversaries, since it was designed to defend against snapshot adversaries. It may be possible to hide the source better by taking a ﬁrst-principles approach to designing a new spreading mechanism tailored for spy-based adversaries. Beyond rumor spreading, there have been recent advances in ﬁnding the ﬁrst node to start a growing random network (Bubeck et al., 2014; Jog & Loh, 2016; 2015). An inter-

esting direction is to ask how one could grow a random network while preserving the anonymity of the ﬁrst node under certain constraints or utility.

ACKNOWLEDGMENTS

The authors thank the anonymous reviewers for their helpful comments and Paul Cuff for helpful discussions and for pointing out the Bayesian interpretation of the Polya s urn process. This work has been supported by NSF CISE awards CCF-1422278 and CCF-1553452, and Sa TC award CNS-1527754.

Team blind. http://us.teamblind.com/.

Secret. https://www.secret.ly.

Whisper. http://whisper.sh.

Yik yak. http://www.yikyakapp.com/.

Bakshy, E., Hofman, J. M., Mason, W. A., and Watts, D. J. Everyone s an inﬂuencer: quantifying inﬂuence on twitter. In WSDM, pp. 65 74. ACM, 2011.

Bubeck, S., Devroye, L., and Lugosi, G. Finding adam in random growing trees. ar Xiv preprint ar Xiv:1411.3317, 2014.

Fanti, G., Kairouz, P., Oh, S., and Viswanath, P. Spy vs. spy: Rumor source obfuscation. In ACM SIGMETRICS Perform. Eval. Rev., pp. 271 284, 2015.

Fanti, G., Kairouz, P., Oh, S., Ramchandran, K., and Viswanath, P. Rumor source obfuscation on irregular trees. In ACM SIGMETRICS Perform. Eval. Rev., 2016.

Farajtabar, M., Gomez-Rodriguez, M., Du, N., Zamani, M., Zha, H., and Song, L. Back to the past: Source identiﬁcation in diffusion networks from partially observed cascades. ar Xiv preprint ar Xiv:1501.06582, 2015.

Fioriti, V. and Chinnici, M. Predicting the sources of an outbreak with a spectral technique. ar Xiv preprint ar Xiv:1211.2333, 2012.

Goldman, David. Cops can use fake facebook accounts to lure suspects, says judge. CNN Money, 2014.

Gomez Rodriguez, M., Leskovec, J., and Krause, A. Inferring networks of diffusion and inﬂuence. In SIGKDD. ACM, 2010.

Jog, V. and Loh, P. Persistence of centrality in random growing trees. ar Xiv preprint ar Xiv:1511.01975, 2015.

Metadata-Conscious Anonymous Messaging

Jog, V. and Loh, P. Analysis of centrality in sublinear preferential attachment trees via the cmj branching process. ar Xiv preprint ar Xiv:1601.06448, 2016.

Johnson, N.L. and Kotz, S. Urn models and their application. Wiley New York, 1977.

Khim, J. and Loh, PL. Conﬁdence sets for the source of a diffusion in regular trees. ar Xiv preprint ar Xiv:1510.05461, 2015.

Koehler, J.O. STASI: The untold story of the East German secret police. Basic Books, 1999.

Luo, Wuqiong, Tay, Wee Peng, and Leng, Mei. How to identify an infection source with limited observations. JSTSP, 2014.

Milling, C., Caramanis, C., Mannor, S., and Shakkottai, S. Network forensics: Random infection vs spreading epidemic. In SIGMETRICS. ACM, 2012a.

Milling, C., Caramanis, C., Mannor, S., and Shakkottai, S. On identifying the causative network of an epidemic. In Allerton Conference, 2012b.

Milling, C., Caramanis, C., Mannor, S., and Shakkottai, S. Detecting epidemics using highly noisy data. In Mobi Hoc, 2013.

Pinto, P. C., Thiran, P., and Vetterli, M. Locating the source of diffusion in large-scale networks. Physical review letters, 109(6):068702, 2012.

Prakash, B. A., Vreeken, J., and Faloutsos, C. Spotting culprits in epidemics: How many and which ones? In ICDM, volume 12, pp. 11 20, 2012.

Shah, D. and Zaman, T. Rumors in a network: Who s the culprit? Information Theory, IEEE Transactions on, 57 (8):5163 5181, Aug 2011.

Smith, A. 6 new facts about Facebook. Pew Research, 2014.

Viswanath, B., Mislove, A., Cha, M., and Gummadi, K.P. On the evolution of user interaction in facebook. In Proc. of SIGCOMM WOSN. ACM, August 2009.

Wang, Z., Dong, W., Zhang, W., and Tan, C.W. Rumor source detection with multiple observations: Fundamental limits and algorithms. In ACM SIGMETRICS, 2014.

Zhu, Kai and Ying, Lei. A robust information source estimator with sparse observations. Computational Social Networks, 2014.

Zhu, Kai, Chen, Zhen, and Ying, Lei. Locating the contagion source in networks with partial timestamps. Data Mining and Knowledge Discovery, 2015.