# differentially_private_decentralized_learning_with_random_walks__4a100a65.pdf

Differentially Private Decentralized Learning with Random Walks

Edwige Cyffers 1 Aurélien Bellet 2 Jalaj Upadhyay 3

The popularity of federated learning comes from the possibility of better scalability and the ability for participants to keep control of their data, improving data security and sovereignty. Unfortunately, sharing model updates also creates a new privacy attack surface. In this work, we characterize the privacy guarantees of decentralized learning with random walk algorithms, where a model is updated by traveling from one node to another along the edges of a communication graph. Using a recent variant of differential privacy tailored to the study of decentralized algorithms, namely Pairwise Network Differential Privacy, we derive closed-form expressions for the privacy loss between each pair of nodes where the impact of the communication topology is captured by graph theoretic quantities. Our results further reveal that random walk algorithms yield better privacy guarantees than gossip algorithms for nodes close to each other. We supplement our theoretical results with empirical evaluation of synthetic and real-world graphs and datasets.

1. Introduction

Federated learning allows multiple data owners to collaboratively train a model without sharing their data (Kairouz et al., 2021). Some federated algorithms rely on a central server to orchestrate the process and aggregate model updates (Mc Mahan et al., 2017), with the downsides of creating a single point of failure and limiting the scalability in the number of participants (Lian et al., 2017). In this work, we focus on fully decentralized algorithms that replace the central server by peer-to-peer communications between participants, viewed as nodes in a sparse network graph (Koloskova et al., 2020; 2021; Lian et al., 2017;

1Université de Lille, Inria, CNRS, Centrale Lille, UMR 9189 - CRISt AL, F-59000 Lille, France 2Inria, Univ Montpellier, Montpellier, France 3Rutgers University. Correspondence to: Edwige Cyffers <edwige.cyffers@inria.fr>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

Le Bars et al., 2023; Nedic et al., 2018; Tang et al., 2018). In addition to their scalability, these algorithms can exploit the natural graph structure in some applications, such as social networks where users are linked to their friends, or the geographical position of devices that induces faster communications with the closest users.

Keeping data decentralized can reduce communication costs and latency, and it is also welcomed from a privacy perspective when the data contains personal information or represents a crucial asset for businesses. However, sharing model parameters can indirectly leak sensitive information and allow data reconstruction attacks (see e.g. Geiping et al., 2020; Nasr et al., 2018; Shokri et al., 2017; Zhu et al., 2019). To mitigate this problem, differential privacy (DP) (Dwork et al., 2006) has become the de facto standard in machine learning to provide robust guarantees of privacy. In a nutshell, DP compares two learning scenarios that only differ from the data of a single user and ensures that the output distribution of the algorithm remains similar. This, in particular, means that no attacker can learn too much about the private data of a user by inspecting the outputs, even if they have access to arbitrary auxiliary information.

Decentralized learning algorithms can be made differentially private by having each node add noise to their model updates before sharing them with their neighbors in the graph (Bellet et al., 2018; Cheng et al., 2019; Huang et al., 2015; Xu et al., 2021; Zhang et al., 2018). An important challenge is then to bound as tightly as possible the privacy leakage based on the level of noise and the threat model considered to achieve the best possible privacy-utility trade-off. The baseline approach relies on local DP, which assumes that all information a node sends is observed by all other nodes. This leads to overly pessimistic privacy guarantees for decentralized algorithms because nodes only observe the messages sent by their direct neighbors.

Recent work has shown that it is possible to leverage the graph topology of decentralized algorithms to develop more tailored privacy guarantees specific to the relation between the different nodes based on the notion of pairwise network differential privacy (PNDP) (Cyffers & Bellet, 2022; Cyffers et al., 2022). PNDP takes into account the fact that each node only has a local view of the communications, and allows to reason on the privacy leakage between each pair of

Differentially Private Decentralized Learning with Random Walks

nodes based on these local views. Intuitively, if two nodes are farther apart, the privacy leakage should depend on their relative position in the graph, which matches the natural setting where edges come with some trust level, such as in social networks where edges indicate relationships. Cyffers et al. (2022) showed that this intuition is correct for gossip algorithms, where all nodes update their current model and mix it with their neighbors at every step. However, a major drawback of gossip algorithms is that, even in their more asynchronous versions, they generate redundant communications and require all nodes to be largely available (since any node can be updated at any time). Redundant communication and availability has been touted as a major obstacle in distributed private learning (Smith et al., 2017).

Our Contributions. In this work, we study the privacy guarantees of random walk algorithms (Lopes & Sayed, 2007; Johansson et al., 2010; Mao et al., 2020; Cyffers & Bellet, 2022; Even, 2023), a popular alternative to gossip in the fully decentralized setting. In these algorithms, a token holding the model s current state is updated by a node at each time step and then forwarded to a random neighbor. Random walk algorithms do not require global synchronization as the token is sent as soon as the current node finishes its update. They can also easily cope with temporary unavailability, and are known to be fast in practice.

We first introduce a private version of decentralized stochastic gradient descent (SGD) based on random walks on arbitrary graphs: in a nutshell, the node holding the model at a given step updates it with a local SGD step, adds Gaussian noise and forwards it to one of its neighbor chosen with appropriate probability. Focusing on the strongly convex setting, we then establish a convergence rate for our algorithm by building upon recent results on SGD under Markovian sampling (Even, 2023), and show that the result compares favorably to its gossip SGD counterpart. Our main contribution lies in precisely characterizing the privacy loss between all pairs of nodes using a PNDP analysis. We obtain elegant closed-form expressions that hold for arbitrary graphs, capturing the effect of a particular choice of graph through graph-theoretic quantities. We also show how our general closed-form expression yields explicit and interpretable results for specific graphs. Finally, we use synthetic and real graphs and datasets to illustrate our theoretical results and show their practical relevance compared to the gossip algorithms analyzed in previous work.

In summary, our contributions are as follows:

1. We propose a private version of random walk stochastic gradient descent for arbitrary graphs (Algorithm 1);

2. We establish its convergence rate for strongly convex loss functions (Theorem 2);

3. We derive closed-form expressions for the privacy loss

between each pair of nodes that capture the effect of the topology by graph-theoretic quantities (Theorem 3);

4. We theoretically and experimentally compare our guarantees to those of gossip algorithms, highlighting the superiority of our approach in several regimes.

2. Related Work

Random walks for decentralized optimization. Optimizing the sum of local objective functions by having a token walk on the graph has a long history in the optimization community (Johansson et al., 2010; Lopes & Sayed, 2007; Mao et al., 2020). The main difficulty is to handle the bias introduced by the sampling of the nodes, as a random walk locally forces a structure that differs from the stationary distribution (Sun et al., 2018). One way to avoid this bias is to restrict the underlying graph structure to be the complete graph (Cyffers & Bellet, 2022; Cyffers et al., 2023) or to perform an update only after several steps on the walk (Hendrikx, 2022), but this comes at a high communication cost. In this work, we rely on a recent proof that casts Markov chain updates as a specific case of stochastic gradient descent with delays in order to get rid of the dependency of the Markov sampling by waiting a sufficient number of steps for the analysis (Even, 2023).

Private decentralized optimization. A classic line of work to improve privacy in a decentralized setting aims to prove that nodes cannot access enough information to reconstruct exactly the contribution of a given node (Gupta et al., 2018; Mo & Murray, 2017). However, these approaches do not provide robust guarantees against approximate reconstruction attacks or adversaries with auxiliary knowledge. Another direction is to rely on local DP (Bellet et al., 2018; Cheng et al., 2019; Huang et al., 2015; Xu et al., 2021; Zhang et al., 2018), where each node assumes that everything they share is public, but this comes at a high cost for utility (Chan et al., 2012; Wang et al., 2018; Zheng et al., 2017). While it is possible to mitigate this drawback by using other schemes such as shuffling or secure aggregation (Bonawitz et al., 2017; Cheu et al., 2019; Feldman et al., 2020; Liew et al., 2022), it requires additional computation, communication as well as an overhaul of the system architecture, making it very difficult to deploy in practice. These limitations have motivated the development of intermediary trust models specific to fully decentralized settings.

Network differential privacy. Bellet et al. (2020) were the first to suggest that decentralization can amplify privacy. This was made precise by Cyffers & Bellet (2022), who introduced Network Differential Privacy (NDP) and proved better privacy guarantees for the complete graph and the ring graph. For the complete graph, Cyffers et al. (2023) further used NDP to analyze the privacy guarantees of decentralized

Differentially Private Decentralized Learning with Random Walks

ADMM. The case of the ring graph was further studied by Yakimenka et al. (2022b), taking into account the additional complexity of dealing with straggler nodes that slow down the computation. In this work, we rely on the pairwise NDP variant introduced in Cyffers et al. (2022), where the authors study private gossip algorithms on arbitrary graphs. We provide both theoretical and empirical comparisons to these prior results, showing significant advantages in favor of our random walk algorithm.

3. Preliminaries

Definition 1 (Irreducibility) A Markov chain defined by transition matrix W Rn n is called irreducible if for any two states i, j, there exists an integer t such that

Pr[Xt = j|X0 = i] > 0.

Definition 2 (Aperiodicity) The period of a state i is defined as the greatest common divisor of the set of natural numbers, {t : Pr[Xt = i|X0 = i] > 0}. A state i is called aperiodic if its period equals 1. A Markov chain is aperiodic if all its states are aperiodic.

Irreducibility and aperiodicity ensure that, after enough steps, the probability of transit from any state to any other state is positive.

Definition 3 (Stationary distribution) For a Markov chain defined by the transition matrix W Rn n, a probability distribution π is a stationary distribution if Pπ = π.

4. Problem Setting and Background

We consider a set V = {1, . . . , n} of users (nodes), each user v holding a local dataset Dv. The nodes aim to privately optimize a separable function over the joint data D = v V Dv:

v=1 πvfv(x), (1)

where x Rd represents the parameters of the model and the local function fv depends only on the local dataset of node v, and πv 0 is the weight given to fv (in practice, the vector π will correspond to the stationary distribution of the random walk as defined below). Below, we introduce the notions and assumptions related to random walks and precisely define the privacy threat model we consider.

4.1. Random Walks

We consider that the underlying network is represented by a connected graph G = (V, E). Two nodes u and v are neighbors when there is an edge (u, v) E, which indicates

that u and v can communicate. Our random walk algorithm will involve a token following a Markov chain on this graph, where the probability of taking each edge is given by an n n transition matrix W: if the token is in u at step t, v receives the token at time t + 1 with probability Wuv.

Definition 4 (Transition matrix) A transition matrix W on graph G = (V, E) is a stochastic matrix, i.e., u V , P

v V Wuv = 1, which satisfies (u, v) / E Wuv = 0.

In particular, the probability for the token to go from node u to node v in k steps is given by the k-th power of the transition matrix W k uv. To derive convergence in optimization, we need standard assumptions on this transition matrix, which ensure that the Markov chain behaves similarly to a fixed distribution after enough iterations.

Assumption 1 The transition matrix is aperiodic and irreducible, that is, there exists a time t0 such that for all t t0 and any pair of vertices u and v, W t uv > 0, i.e., the token can go from u to v in t steps.

Under this assumption, the Markov chain has a stationary distribution π (belonging to the n-dimensional simplex) such that π = πW, and the convergence speed is governed by the mixing time (Levin & Peres, 2017).

Definition 5 (Mixing time) The mixing time τmix(ι) of a Markov chain of transition matrix W is the time needed for the walk to be close to a factor ι of its asymptotic behavior:

τmix(ι) = min t : max v (W t)v π T V ι , (2)

where P Q T V is the total variation distance between two probability measures P and Q defined over the same measurable space (Ω, F): P Q T V = sup A F |P(A) Q(A)|.

We sometimes omit ι and write τmix := τmix(1/4). From this quantity, it is easy to derive the mixing time for an arbitrary ι from the following equation: τmix(ι) log2(1/ι) τmix (Levin & Peres, 2017). The mixing time depends on the spectral gap of the graph, which is the difference between the largest eigenvalue of the transition matrix W (which is always equal to one and associated with π) and the absolute value of the second-largest eigenvalue. We denote this quantity by λW , from which we get the following bound on the mixing time: τmix λ 1 W ln(1/ minv πv).

A natural choice of the transition matrix is to give the same importance to every node and to assume symmetry in the weight of the communications.

Assumption 2 The transition matrix W is bi-stochastic and symmetric.

Differentially Private Decentralized Learning with Random Walks

Under this assumption, we have π = 1/n, where 1 is an allone vector. As W is symmetric, it can be decomposed using the spectral theorem, and all the eigenvalues are real. For any connected graph, it is possible to construct a transition matrix that satisfies this assumption, for instance by using the Hamilton weighting on the graph where transitions are chosen uniformly among the neighbors. Let du denote the degree of node u. Then

Wuv = P(Xt+1 = i|Xt = j) = 1 max{du, dv},

Remark 1 Our optimization results of Section 5 will only require Assumption 1 and thus cover (1) with non-uniform weights. Conversely, our privacy results of Section 6 will only require Assumption 2.

4.2. Privacy Threat Model

In this paper, we aim to quantify how much information each node u V leaks about its local dataset Du to any other node v during the execution of a decentralized learning algorithm. We assume nodes to be honest-but-curious (i.e., they faithfully follow the protocol) and non-colluding (see Appendix D for the possibility of collusion and how it can be seen as a modification of the graph).

We consider the graph G and the transition matrix W to be known by all nodes. This scenario is quite common; for instance, in the healthcare domain (where nodes represent hospitals and collaboration between hospitals is public knowledge) or in social networks. We will measure the privacy leakage using Differential Privacy (DP), and more precisely, a variant known as the Pairwise Network Differential Privacy (PNDP) (Cyffers & Bellet, 2022; Cyffers et al., 2022) tailored to the analysis of decentralized algorithms. In the rest of this section, we introduce the relevant definitions and tools and formally define what is observed by each node during the execution of a random walk algorithm.

Differential Privacy. DP (Dwork & Roth, 2014) quantifies the privacy loss incurred by a randomized algorithm A by comparing its output distribution on two adjacent datasets D and D . The guarantee depends thus on the granularity chosen for the adjacency relation. In this work, we adopt userlevel DP, where D = v V Dv and D = v V D v are adjacent datasets, denoted by D D , if there exists at most one user v V such that Dv = D v. We further denote D v D

if D and D differ only in the local dataset of user v.

We use Rényi Differential Privacy (RDP) to measure the privacy loss, due to its theoretical convenience and better composition properties than the classical (ϵ, δ)-DP. We recall that any (α, ε)-RDP algorithm is also (ε+ln(1/δ)/(α 1), δ)-DP for any 0 < δ < 1 (Mironov, 2017).

Definition 6 (Rényi Differential Privacy) An algorithm A satisfies (α, ε)-Rényi Differential Privacy (RDP) for α > 1 and ε > 0 if for all pairs of neighboring datasets D D :

Dα (A(D)||A(D )) ε , (3)

where Dα(X||Y ) is the Rényi divergence between the random variables X and Y :

Dα(X||Y ) = 1 α 1 ln Z µX(z)

α µY (z)dz .

with µX and µY the respective densities of X and Y .

The Gaussian mechanism ensures RDP by adding Gaussian noise to the output of a non-private function g, i.e., A(D) = g(D) + η with η N(0, σ2) satisfies (α, α 2 g/2σ2)-RDP for any α > 1, where g = sup D D g(D) g(D ) 2 is the sensitivity of g (Mironov, 2017).

This baseline privacy guarantee can be amplified when the result is not directly observed but instead used for subsequent computations. In particular, when considering the consecutive applications of non-expansive (i.e. 1-Lipschitz) operators, we can rely on the so-called privacy amplification by iteration effect that we will leverage in our analysis.

Theorem 1 (Privacy amplification by iteration, Feldman et al., 2018) Let T 1, . . . , T K, T 1, . . . , T K be nonexpansive operators, an initial random state x0 U , and (ζk)K k=1 a sequence of noise distributions. Consider the noisy iterations xk+1 = T k+1(xk) + ηk+1 and xk+1 = T k+1( xk)+ ηk+1 where ηk and ηk are drawn independently from distribution ζk+1. Let sk = supx U T k(x) T k(x) . Let (ak)K k=1 be a sequence of real numbers such that

i k ai, and X

i K ai . (4)

Dα(x K|| x K)

k=1 sup x: x ak Dα(ζk x ζk) , (5)

where is the convolution of probability distributions and x denotes the distribution of the random variable that is always equal to x.

Pairwise Network Differential Privacy. PNDP allows us to capture the limited view that nodes have in decentralized algorithms and to model privacy guarantees specific to each pair of nodes (Cyffers & Bellet, 2022; Cyffers et al., 2022). Below, the view of a user v is denoted by Ov A(D) .

Definition 7 (Pairwise Network DP) For b : V V R+, an algorithm A satisfies (α, b)-Pairwise Network DP

Differentially Private Decentralized Learning with Random Walks

(PNDP) if for all pairs of distinct users u, v V and neighboring datasets D u D :

Dα(Ov(A(D))||Ov(A(D ))) b(u, v) . (6)

We denote by εu v = b(u, v) the privacy loss from u to v and say that u is (α, εu v)-PNDP with respect to v when inequality (6) holds for b(u, v) = εu v.

For the random walk algorithms we will consider, the complete output A(D) consists of the trajectory of the token and its successive values during training. At a given step, the token of the random walk shares its current value only with its current location, but the other nodes cannot see this state. Thus, we define the view of a node v as

Ov A(D) = {(t, xt, w) : the token xt was in v

at time t and then sent to w} . (7)

In this definition, nodes know to whom they send the token, but not from whom they receive it. Ensuring the anonymity of the sender can be achieved by using mix networks (Sampigethaya & Poovendran, 2006) or anonymous routing (Dingledine et al., 2004). However, our results directly extend to the case where the sender s anonymity cannot be ensured, see Remark 3 in Section 6.

5. Private SGD with Random Walks

In this section, we introduce a decentralized stochastic gradient descent (SGD) random walk algorithm to privately approximate the minimizer of (1), and analyze its convergence in the strongly convex case. This algorithm, presented in Algorithm 1, generalizes the private random walk algorithm on the complete graph, introduced and analyzed by Cyffers & Bellet (2022), to arbitrary graphs. Differential privacy is achieved by adding Gaussian noise to the local gradient at each step. The step size is constant over time as commonly done in (centralized) differentially private stochastic gradient descent (DP-SGD) (Bassily et al., 2014).

The non-private version of this algorithm converges in various settings under Assumption 1. In this work, we adapt a recent proof for the non-private version (Even, 2023). For simplicity, we focus on strongly convex and smooth objectives with bounded gradients at the global optimum.

Assumption 3 (Bounded gradient and strong convexity) We assume that f is µ-strongly convex and L-smooth. Let x be its minimizer. We assume that, for ζ 0, v V, fv(x ) 2 ζ2 .

In the case of stochastic gradient descent, stochasticity also comes from the fact that we sample from the local dataset. To handle both cases, we define gt as an unbiased estimator of fvt(xt). We thus require to bound the variance of this estimator.

Algorithm 1: PRIVATE RANDOM WALK GRADIENT DESCENT (RW DP-SGD)

Input: transition matrix W on a graph G, number of iterations T, noise variance σ2, starting node v0, initial token value x0, step size γ, gradient sensitivity , local loss function fv

1 for t = 0 to T 1 do

2 Draw η N(0, 2σ2)

3 Compute gt s.t. E[gt] = fvt(xt)

4 xt+1 xt γ(gt + η)

5 Draw u Wvt in the set of neighbors of vt

6 Send token to u

Assumption 4 (Bounded local noise) We assume that the stochastic gradients respect the following condition: E h gt fvt (xt) 2 | xt, vt i σ2 sgd.

Theorem 2 Under Assumptions 1, 3, and 4, for step size

γ = min 1 L, 1 T µ ln T x0 x 2

the iterates verify:

E( x T x 2) 2e T µ

+ 39τmixζ2 L µ3T + (dσ2 2 + σ2 sgd)L µ2T

ln Tµ2 x0 x 2

39Lτmixζ2 .

Proof. We adapt the proof from Even (2023) that keeps track of the shift due to the Markov sampling of the random walk thanks to a comparison with delayed gradient. Following the same bounding steps and taking expectation over the noise distribution we obtain a similar inequality on the iterates than the non-private version up to an additional term for the noise. Setting the step size to balance the terms leads to the final equality. See Appendix A for a full proof.

The convergence rate in Theorem 2 has three terms: the exponential convergence towards the minimizer parameterized by the condition number L/µ of f, the impact of the stochasticity of the random walk in O(τmixζ2 L/µ3T) and the additional term due to the noise injection in O(σ2 2L/µ2T). A similar term would appear when adapting the non-private convergence proof for other settings such as under the Polyak-Lojasiewicz condition.

Comparison with private gossip (Cyffers et al., 2022). In our random walk algorithm, each step involves computing the gradient of a single node and a single message, whereas the private gossip SGD algorithm of Cyffers et al. (2022) alternates between the computation of local gradients at all nodes in parallel and a multi-step gossip communication phase until (approximate) consensus.

Differentially Private Decentralized Learning with Random Walks

Rephrasing the result of Cyffers et al. (2022) in our notation, each communication phase in Cyffers et al. (2022) requires O(τmix ln(ζ2 /σ2)) steps where all the nodes send updates synchronously. For the optimization part, the first term is the same in O(e T µ

L x0 x 2), but there is only one other term in O(σ2L/nµT). Hence, we lose a factor of n, but the reduced communication compensates for this. Our analysis is tighter in the sense that we can separate the privacy noise that is independent of the Markov chain, thereby improving the rate of the spectral gap factor compared to a naive analysis. In contrast, the private gossip analysis casts this noise as gradient heterogeneity, because it re-uses the non-private convergence analysis of Koloskova et al. (2020).

Special case of averaging. One can use the above algorithm to privately compute the average of values at each node. In this case, we assume that each node has a private value yv (a float or a vector) and define the local objective function as:

fv(x) = x yv 2 . (8)

Note that, in this case, we have L = µ = 2. A natural approach is to compute the running average of the visited values with noise injection at each step, which corresponds to Algorithm 1 with a decreasing step-size γt = 1/t. The drawback of this approach is that damping the first terms of the sum (when the Markov chain is not yet well mixed) requires a lot of steps and results in slow convergence (at least τmix steps). Adopting a constant step-size instead, as in Algorithm 1, does not modify the convergence leading terms that stay in O(1/T) for an adequate step-size.

One way to completely remove the influence of the first terms is to have a burn-in phase, when the token walks without performing any update, to come closer to the stationary distribution. Then, after τmix(ι/2) steps, a running average of 4δ2/(ι3λW ) is obtained, as proven in Theorem 12.21 of Levin & Peres (2017).

6. Privacy Analysis

In this section, we derive the privacy guarantees of Algorithm 1 for arbitrary graphs and show how this leads to improved trade-offs for specific graphs widely used in decentralized learning. Our main result is a closed-form expression for the privacy loss between each pair of nodes, which holds for arbitrary graphs.

Theorem 3 Consider a graph G with transition matrix W satisfying Assumption 2. After T iterations, for a level of noise σ2 2α(α 1), the privacy loss of Algorithm 1 from node u to v is bounded by:

εu v O αT ln(T)

σ2n ln I W + 1

We recall that the logarithm of a matrix corresponds to

the matrix whose eigenvalues are the log of the original eigenvalues and the eigenvectors remain identical. In particular, if λ1, , λn are n-eigenvalues (counting multiplicity) of a bistochastic symmetric matrix M and x1, , xn are the corresponding eigenvectors, then log(M) = Pn i=1 xix i log(λi). Note that Assumption 2 ensures this matrix is well-defined.

Sketch of proof. We give a high-level overview of the proof here and refer to Appendix B for details. We fix the two vertices u and v and see how a token visit to u will leak information to v. By the post-processing property of RDP, it is sufficient to compute the privacy loss that occurs when the token reaches for the first time v after the visit in u. For computing this loss, we use the weak convexity of the Rényi divergence (Van Erven & Harremos, 2014) to condition over the number of steps before reaching v. The length of the walk is parameterized by the power of the transition matrix. For a given length, we bound the privacy loss by using privacy amplification by iteration (Theorem 1). Refactoring the sum leads to the logarithm of the matrix. We finish by using composition over the O(T/n) times the token visits u during the walk.

Remark 2 For the complete graph, the second term is equal to zero as the transition matrix is exactly W = 1

n11 . Thus, we recover the bound of Cyffers & Bellet (2022). In other words, Theorem 3 is a generalization of this previous result to arbitrary graphs.

Remark 3 Theorem 3 holds for the definition of the view of the node given in (7), where the sender is kept anonymous. We provide in Appendix B.1 a similar theorem if the senders are known. In this case, although the formula is more complex, the asymptotic is the same as it roughly shifts the privacy guarantees from one hop.

6.1. Interpretation of the Formula via Communicability

The privacy loss of Theorem 3 has two terms that we can interpret as follows. The first term is the same as in Cyffers & Bellet (2022) for the complete graph: we have an O(1/n2) privacy amplification factor compared to local DP, matching what would be obtained in central DP with n users. Our analysis reveals that this term also appears for arbitrary graphs: we can interpret it as a baseline privacy loss that occurs from the collaboration of all agents.

However, in graphs differing from the complete graph, this baseline privacy loss is corrected by the second term which depends on the specific pair (u, v) of the nodes considered. Note that this second term can be negative for some pairs as eigenvector components can be of arbitrary signs. This quantity can be seen as a variant of known graph centrality metrics and, more precisely, communicability.

Differentially Private Decentralized Learning with Random Walks

Definition 8 (Communicability, Estrada & Hatano, 2008; Estrada & Knight, 2015) For a transition matrix W and ci a non-increasing positive series ensuring convergence, the communicability between two vertices u, v is defined by:

Guv = P i=1 ci W i uv .

For ci = 1/i!, this corresponds to the original notion of communicability as presented in (Estrada & Hatano, 2008), while for ci = αi we recover the Katz centrality (Katz, 1953). Our formula corresponds to the case ci = α σ2i as proven in Appendix B, and the convergence of the infinite sum is ensured by the fact that we remove the graph component associated with the eigenvalue 1.

Communicability is used to detect local structures in complex networks, with applications to community detection and graph clustering, for instance, in physical applications (Estrada & Knight, 2015). Good communicability is supposed to capture how well connected are the two nodes in the networks, i.e. how close they are. Hence, having the second term of the privacy loss proportional to the communicability of the nodes shows that our formula matches the intuition that nodes leak more information to closer nodes than to more distant ones.

6.2. Application to Specific Graphs

We now give closed-form expressions of the privacy loss (3) for specific graphs. To show the power of our bound, we note that the results for two network graphs studied in Cyffers & Bellet (2022), namely the complete graph and the ring graph, follow as a corollary of our general result. We illustrate below the possibility to derive closed formulas for other classes of graphs. Proofs are given in Appendix E.

Star graph. We consider a star graph with a central node in the first position linked to the n 1 other nodes. We choose the transition matrix such that the probability of self-loop κ > 0 is an arbitrarily small constant, and the distribution over the non-central nodes is uniform. Then we have the following privacy guarantees.

Theorem 4 Let u, v V be two distinct nodes of the star graph and κ > 0 be an arbitrarily small constant. For a single contribution of node u in Algorithm 1 on the star graph, the privacy loss to node v is bounded by:

σ2(n 1) ln 1 1 n 1 u = 1 and v = 1

α(1 κ) 2σ2 n 1ln n 1+1 n 1 1

u = 1 or v = 1 .

Sketch of proof. At a high level, as κ is an arbitrarily small constant, we can have an upper bound on the entries of all the powers of the adjacency matrix of the star graph.

In particular, composing over the O(T/n) contributions, we see that extremal nodes enjoy a privacy amplification factor of order O(n2) and the central node of order O(n).

Ring graph. We consider a symmetric ring where nodes are enumerated from 1 to n, which thus slightly differs from the case studied in Cyffers & Bellet (2022); Yakimenka et al. (2022a), where the ring is directed and thus deterministic (up to the possibility of skipping in Yakimenka et al., 2022a). For this graph, our results shows that the amplification is parameterized by the distance between the nodes in the ring.

Theorem 5 Let u, v V be two distinct nodes of the ring with a = (u + v 2) mod n and α = α1|u v|=1. For a single contribution of node u in Algorithm 1 on the ring graph, the privacy loss to node v, εu v, is bounded by:

α log(T) cos 2πa

k=1 cos πak

n ln 3 csc2(πk/n)

in the case of equal probability between going left, right, or self-looping. Furthermore, if the probability of self-looping is set to κ > 0, then εu v is bounded by

nσ2 + α(1 κ)

k=1 cost 1 2πk

n cos 2π(a + 1)k

Sketch of proof. At a high level, we use the fact that the adjacency matrix for a ring graph is a circulant matrix, so its eigenvectors are the Fourier modes. Therefore, its full spectral decomposition can be easily computed.

To get an intuition regarding Theorem 5, consider two nodes that are close to each other. For simplicity of calculation, consider nodes 1 and 2 and the statement of the theorem. Then a = 0 and cost 1(2πk/n) cos(2π(a + 1)k/n) = cost(2πk/n). Therefore, εu v α σ2n +

σ2n Pn k=1 cos2(2πk/n) 1 cos(2πk/n).

7. Experiments

In this section, we illustrate our results numerically on synthetic and real graphs and datasets and show that our random walk approach achieves superior privacy-utility trade-offs compared to gossip as long as the mixing time of the graph is good enough. The code is available at https://github.com/totilas/DPrandomwalk

Privacy losses and comparison with the gossip counterpart. We generate synthetic graphs with n = 2048 nodes and report the privacy loss averaged over 5 runs for every pair of nodes of the graphs as a function of the length of their shortest path in Figure 1. The transition matrix is computed using the Hamilton weighting. To compare with the private gossip algorithm (Cyffers et al., 2022), we consider

Differentially Private Decentralized Learning with Random Walks

Figure 1. Comparison of privacy loss for random walks in bold lines and gossip in dashed lines for the same synthetic graphs with n = 2048. Random walks allow privacy amplification even for very close nodes, but the decay is slower than for gossip.

0 5 10 15 20 25 Shortest Path Length

Privacy Loss

Gossip Random walk

LDP loss Exponential Erdos Renyi Random Geometric Grid

the task of averaging (thus with L = µ = 2) and consider the same precision level and the same graphs: an exponential graph, an Erdös-Rényi graph with q = c log(n)/n for c > 1, a grid and a geometric random graph. Our private random walk approach incurs a smaller privacy loss for close enough nodes than the private gossip algorithm. Remarkably, our approach improves upon the baseline local DP loss even for very close nodes. Conversely, the privacy loss of our approach is generally higher for more distant nodes. Nevertheless, random walks offer uniformly better privacy guarantees than gossip algorithms for graphs with good connectivity such as Erdös-Rényi graphs or expanders.

Logistic regression on synthetic graphs. We train a logistic regression model on a binarized version of the UCI Housing dataset.1 The objective function corresponds to fv(x) = 1 |Dv| P

(d,y) Dv ln(1 + exp( yx d)) where d Rd and y { 1, 1}. As in Cyffers & Bellet (2022); Cyffers et al. (2022) and Yakimenka et al. (2022b), we standardize the features, normalize each data point, and split the dataset uniformly at random into a training set (80%) and a test set (20%). We further split the training set across 2048 users, resulting in local datasets of 8 samples each.

In a first experiment, we compare centralized DP-SGD, local DP-SGD, and our random walk-based DP-SGD. For all algorithms, we follow common practice and clip the updates to control the sensitivity tightly. We set ε = 1 and δ = 10 6. Following Cyffers et al. (2022), we use the mean privacy loss over all pairs of nodes (computed by applying Theorem 3) to set the noise level needed for our random walk-based DP-SGD. Figure 2 reports the objective function through iterations for complete, hypercube, and random geometric graphs. Experimentally, the behavior is the same for all the graphs, meaning that the token walk is diverse enough in every case to have a behavior similar

1https://www.openml.org/d/823/

Figure 2. Private logistic regression on the Houses dataset where we compare our RW DP-SGD to with Local and Centralized DPSGD as baselines.

0 2500 5000 7500 10000 12500 15000 17500 20000 Iteration

Objective function

Centralized DP-SGD Local DP-SGD RW DP-SGD on complete graph RW DP-SGD on expander RW DP-SGD on Geometric graph

to a uniformly random choice of nodes. The improvement in the privacy-utility trade-off compared to the local DP is significant.

We then compare our random walk algorithm to its gossip counterpart (Cyffers et al., 2022) on the same logistic regression task. For both algorithms, we fix the mean privacy loss ε across all pairs of nodes to three different levels ( ε {0.5, 1, 2}) and we report the accuracy reached by each algorithm on 4 graphs: complete, exponential, geometric and grid. As shown in Table 1, our random walk algorithm outperforms gossip in all cases, which can be explained by a combination of two factors. First, as seen previously in Figure 1, our algorithm yields a lower mean privacy loss for graphs with good expansion property, especially when the degree of the nodes is high (because the privacy guarantees of gossip degrade linearly with the degree, while random walk is insensitive to it). This gain directly leads to less noise injection when fixing the mean privacy loss, and thus better utility. The second factor that explains why the gain in utility is so pronounced (even for the grid) comes from differences in the SGD version of gossip and random walk. In both methods, the privacy guarantee degrades as a function of the number of participations of nodes (linearly in Rényi DP). In gossip, allowing each node to participate 10 times means that the model will essentially be learned by applying 10 global gradient updates (i.e., aggregated over the n nodes), because all local gradients are gossiped until convergence between each gradient computation (see Algorithm 3 in Cyffers et al., 2022). In our random walk algorithm, the model is learned by applying 10 n local gradient updates. Even if this represents the same amout of information about the data, better progress is made with many noisy steps than with a small number of less noisy steps, in the same way as mini-batch SGD tends to progress faster than GD in practice.

Privacy loss on real graphs and communicability. We consider two real-world graph datasets well-suited for com-

Differentially Private Decentralized Learning with Random Walks

Figure 3. Link between graph structure and privacy loss. Left: (a) example of Facebook Ego graph communicability and privacy loss, logarithmic scale. Middle: (b) same on the Southern women graph. Right: (c) the corresponding mean privacy loss and Katz centrality.

Privacy loss

Communicability

Privacy loss

Communicability

Mean Privacy Loss Katz Centrality (c)

Table 1. Model accuracy at various mean privacy loss levels averaged over 8 runs.

Graph Gossip Random walk

Mean Privacy Loss of 0.5 Complete 0.65 0.10 0.841 0.07 Exponential 0.70 0.10 0.818 0.09 Geometric 0.60 0.07 0.795 0.06 Grid 0.60 0.07 0.803 0.10

Mean Privacy Loss of 1 Complete 0.70 0.10 0.900 0.04 Exponential 0.77 0.05 0.883 0.05 Geometric 0.66 0.10 0.873 0.05 Grid 0.73 0.10 0.848 0.07

Mean Privacy Loss of 2 Complete 0.83 0.06 0.940 0.02 Exponential 0.89 0.04 0.937 0.01 Geometric 0.67 0.04 0.933 0.02 Grid 0.72 0.10 0.919 0.02

munity detection: (i) The Facebook Ego dataset (Leskovec & Mcauley, 2012) represents subgraphs of the Facebook network, where users are nodes and edges correspond to the friendship relation, and each subgraph corresponds to the set of friends of a unique user that is removed from the subgraph; (ii) The Davis Southern women social network (Roberts, 2000) is a graph with 32 nodes which corresponds to a bipartite graph of social event attendance by women and has been used in Koloskova et al. (2019) and Pasquini et al. (2023). We report side-by-side the matrix of pairwise privacy losses and of the communicability in Figure 4(a) and Figure 4(b). This confirms the similarity between the two quantities as discussed in Section 6.1. We also report the Katz centrality compared to the mean privacy loss for each node in Figure 4(c), showing that the two quantities also have similar behavior.

We provide other numerical experiments in Appendix C: we

report the privacy loss on the other Facebook Ego graphs and study the impact of data heterogeneity.

8. Conclusion

In this work, we analyzed the convergence and privacy guarantees of private random walks on arbitrary graphs, extending the known results for the complete graph and the ring. Our results show that random walk-based decentralized algorithms provide favorable privacy guarantees compared to gossip algorithms, and establish a link between the privacy loss between two nodes and the notion of communicability in graph analysis. Remarkably, as long as the spectral gap of the communication graph is large enough, the random walk approach nearly bridges the gap between the local and the central models of differential privacy. Our results could be broadened by showing convergence under more general hypotheses. Other extensions could include skipping some nodes in the walk, considering latency, or having several tokens running in parallel.

Impact Statement

This work contributes to proving formal privacy guarantees in machine learning and thus paves the way for the larger adoption of privacy-preserving methods, with a better theoretical understanding. In particular, our work shows that decentralized algorithms may be useful in designing efficient privacy-preserving systems with good utility. In particular, the impact of the choice of a decentralized algorithm on privacy guarantees has not been studied so far. We believe our work opens interesting directions to see how one could design decentralized protocols that are private by design.

A potential risk of misuse of our work is that we rely on different privacy budgets across different pairs of users, as done previously in Cyffers et al. (2022). This could sometimes lead to a false sense of security or weaker privacy guarantees than those provided in other threat models. In

Differentially Private Decentralized Learning with Random Walks

particular, the question of comparing the privacy guarantees we provide to those of the central model could be discussed. Our experiments use the mean privacy loss of nodes to set the comparison, but it is interesting to note that for the same mean privacy loss, the distribution of the privacy loss for gossip and random walk algorithms is not the same. Indeed, as shown in Figure 1, random walks have a smoother decay but can enjoy privacy amplification even for closely related nodes, whereas gossip tends to saturate the closest nodes then, followed by exponential decay. It is an interesting question to know how to interpret these privacy loss functions to choose the algorithm whose privacy guarantees best meet the privacy expectations in a particular use case.

Acknowledgments

This work was supported by grant ANR-20-CE230015 (Project PRIDE), the ANR-20-THIA-0014 program AI_Ph D@Lille and the ANR 22-PECY-0002 IPOP (Interdisciplinary Project on Privacy) project of the Cybersecurity PEPR. JU research was supported by a Decanal Research Grant from Rutgers University.

Bassily, R., Smith, A., and Thakurta, A. Private Empirical Risk Minimization: Efficient Algorithms and Tight Error Bounds. In 2014 IEEE 55th Annual Symposium on Foundations of Computer Science, pp. 464 473, Philadelphia, PA, USA, October 2014. IEEE.

Bellet, A., Guerraoui, R., Taziki, M., and Tommasi, M. Personalized and Private Peer-to-Peer Machine Learning. In AISTATS, 2018.

Bellet, A., Guerraoui, R., and Hendrikx, H. Who started this rumor? Quantifying the natural differential privacy guarantees of gossip protocols. In DISC, 2020.

Bonawitz, K., Ivanov, V., Kreuter, B., Marcedone, A., Mc Mahan, H. B., Patel, S., Ramage, D., Segal, A., and Seth, K. Practical Secure Aggregation for Privacy Preserving Machine Learning. In CCS, 2017.

Chan, T.-H. H., Shi, E., and Song, D. Optimal Lower Bound for Differentially Private Multi-party Aggregation. In ESA, 2012.

Cheng, H.-P., Yu, P., Hu, H., Zawad, S., Yan, F., Li, S., Li, H. H., and Chen, Y. Towards Decentralized Deep Learning with Differential Privacy. In CLOUD, 2019.

Cheu, A., Smith, A., Ullman, J., Zeber, D., and Zhilyaev, M. Distributed differential privacy via shuffling. In Advances in Cryptology EUROCRYPT 2019: 38th Annual International Conference on the Theory and Applications of Cryptographic Techniques, Darmstadt, Germany,

May 19 23, 2019, Proceedings, Part I 38, pp. 375 403. Springer, 2019.

Cyffers, E. and Bellet, A. Privacy Amplification by Decentralization. In AISTATS, 2022.

Cyffers, E., Even, M., Bellet, A., and Massoulié, L. Muffliato: Peer-to-Peer Privacy Amplification for Decentralized Optimization and Averaging. In Advances in Neural Information Processing Systems 35 (Neur IPS), New Orleans, United States, 2022. URL https://cnrs.hal. science/hal-03906768.

Cyffers, E., Bellet, A., and Basu, D. From noisy fixed-point iterations to private ADMM for centralized and federated learning. In Proceedings of the 40th International Conference on Machine Learning, 2023.

Dingledine, R., Mathewson, N., and Syverson, P. Tor: the second-generation onion router. In Proceedings of the 13th Conference on USENIX Security Symposium - Volume 13, 2004.

Dwork, C. and Roth, A. The Algorithmic Foundations of Differential Privacy. Foundations and Trends in Theoretical Computer Science, 9(3-4):211 407, 2014.

Dwork, C., Mc Sherry, F., Nissim, K., and Smith, A. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography: Third Theory of Cryptography Conference, TCC 2006, New York, NY, USA, March 4-7, 2006. Proceedings 3, pp. 265 284. Springer, 2006.

Estrada, E. and Hatano, N. Communicability in complex networks. Physical Review E, 77(3), mar 2008. doi: 10. 1103/physreve.77.036111. URL https://doi.org/ 10.1103%2Fphysreve.77.036111.

Estrada, E. and Knight, P. A First Course in Network Theory. Oxford University Press, United Kingdom, March 2015. ISBN 978-0-19-872645-6.

Even, M. Stochastic gradient descent under markovian sampling schemes. In ICML, 2023.

Feldman, V., Mironov, I., Talwar, K., and Thakurta, A. Privacy amplification by iteration. 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS), Oct 2018. doi: 10.1109/focs.2018.00056. URL http: //dx.doi.org/10.1109/FOCS.2018.00056.

Feldman, V., Mc Millan, A., and Talwar, K. Hiding Among the Clones: A Simple and Nearly Optimal Analysis of Privacy Amplification by Shuffling. Technical report, ar Xiv:2012.12803, 2020.

Geiping, J., Bauermeister, H., Dröge, H., and Moeller, M. Inverting gradients - how easy is it to break privacy in federated learning? In Neur IPS, 2020.

Differentially Private Decentralized Learning with Random Walks

Godsil, C. and Royle, G. F. Algebraic graph theory, volume 207. Springer Science & Business Media, 2001.

Gupta, N., Katz, J., and Chopra, N. Informationtheoretic privacy in distributed average consensus. Ar Xiv, abs/1809.01794, 2018.

Hendrikx, H. A principled framework for the design and analysis of token algorithms, 2022. URL https:// arxiv.org/abs/2205.15015.

Horn, R. A. and Johnson, C. R. Matrix analysis. Cambridge university press, 2012.

Huang, Z., Mitra, S., and Vaidya, N. Differentially Private Distributed Optimization. In ICDCN, 2015.

Johansson, B., Rabi, M., and Johansson, M. A randomized incremental subgradient method for distributed optimization in networked systems. SIAM Journal on Optimization, 20(3):1157 1170, 2010. doi: 10.1137/08073038X. URL https://doi.org/10.1137/08073038X.

Kairouz, P., Mc Mahan, H. B., Avent, B., Bellet, A., Bennis, M., Bhagoji, A. N., Bonawitz, K., Charles, Z., Cormode, G., Cummings, R., D Oliveira, R. G. L., Eichner, H., Rouayheb, S. E., Evans, D., Gardner, J., Garrett, Z., Gascón, A., Ghazi, B., Gibbons, P. B., Gruteser, M., Harchaoui, Z., He, C., He, L., Huo, Z., Hutchinson, B., Hsu, J., Jaggi, M., Javidi, T., Joshi, G., Khodak, M., Konecný, J., Korolova, A., Koushanfar, F., Koyejo, S., Lepoint, T., Liu, Y., Mittal, P., Mohri, M., Nock, R., Özgür, A., Pagh, R., Qi, H., Ramage, D., Raskar, R., Raykova, M., Song, D., Song, W., Stich, S. U., Sun, Z., Suresh, A. T., Tramèr, F., Vepakomma, P., Wang, J., Xiong, L., Xu, Z., Yang, Q., Yu, F. X., Yu, H., and Zhao, S. Advances and open problems in federated learning. Foundations and Trends in Machine Learning, 14(1 2):1 210, 2021.

Katz, L. A new status index derived from sociometric analysis. Psychometrika, 18:39 43, 1953.

Koloskova, A., Stich, S. U., and Jaggi, M. Decentralized stochastic optimization and gossip algorithms with compressed communication. In ICML, 2019.

Koloskova, A., Loizou, N., Boreiri, S., Jaggi, M., and Stich, S. A unified theory of decentralized SGD with changing topology and local updates. In Proceedings of the 37th International Conference on Machine Learning, 2020.

Koloskova, A., Lin, T., and Stich, S. U. An improved analysis of gradient tracking for decentralized machine learning. In Advances in Neural Information Processing Systems, 2021.

Le Bars, B., Bellet, A., Tommasi, M., Lavoie, E., and Kermarrec, A.-M. Refined convergence and topology learning for decentralized sgd with heterogeneous data. In Ruiz, F., Dy, J., and van de Meent, J.-W. (eds.), Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, volume 206 of Proceedings of Machine Learning Research, pp. 1672 1702. PMLR, 25 27 Apr 2023. URL https://proceedings.mlr. press/v206/le-bars23a.html.

Leskovec, J. and Mcauley, J. Learning to discover social circles in ego networks. In Pereira, F., Burges, C., Bottou, L., and Weinberger, K. (eds.), Advances in Neural Information Processing Systems, volume 25. Curran Associates, Inc., 2012. URL https://proceedings. neurips.cc/paper/2012/file/ 7a614fd06c325499f1680b9896beedeb-Paper. pdf.

Levin, D. and Peres, Y. Markov Chains and Mixing Times. MBK. American Mathematical Society, 2017. ISBN 9781470429621. URL https://books.google. fr/books?id=f208Dw AAQBAJ.

Lian, X., Zhang, C., Zhang, H., Hsieh, C.-J., Zhang, W., and Liu, J. Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent. In NIPS, 2017.

Liew, S. P., Takahashi, T., Takagi, S., Kato, F., Cao, Y., and Yoshikawa, M. Network shuffling: Privacy amplification via random walks. Ar Xiv, abs/2204.03919, 2022.

Lopes, C. G. and Sayed, A. H. Incremental adaptive strategies over distributed networks. IEEE Transactions on Signal Processing, 55(8):4064 4077, 2007. doi: 10.1109/TSP.2007.896034.

Mao, X., Yuan, K., Hu, Y., Gu, Y., Sayed, A. H., and Yin, W. Walkman: A communication-efficient random-walk algorithm for decentralized optimization. IEEE Transactions on Signal Processing, 68:2513 2528, 2020. ISSN 1941-0476. doi: 10.1109/tsp.2020.2983167. URL http: //dx.doi.org/10.1109/TSP.2020.2983167.

Mc Mahan, B., Moore, E., Ramage, D., Hampson, S., and Arcas, B. A. y. Communication-Efficient Learning of Deep Networks from Decentralized Data. In Singh, A. and Zhu, J. (eds.), Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, pp. 1273 1282. PMLR, 20 22 Apr 2017.

Mironov, I. Renyi differential privacy. Co RR, abs/1702.07476, 2017. URL http://arxiv.org/ abs/1702.07476.

Differentially Private Decentralized Learning with Random Walks

Mo, Y. and Murray, R. M. Privacy preserving average consensus. IEEE Transactions on Automatic Control, 62 (2):753 765, 2017. doi: 10.1109/TAC.2016.2564339.

Nasr, M., Shokri, R., and Houmansadr, A. Comprehensive privacy analysis of deep learning: Passive and active white-box inference attacks against centralized and federated learning, 2018.

Nedic, A., Olshevsky, A., and Rabbat, M. Network topology and communication-computation tradeoffs in decentralized optimization. Proceedings of the IEEE, 106(5): 953 976, May 2018.

Pasquini, D., Raynal, M., and Troncoso, C. On the (in)security of peer-to-peer decentralized machine learning. In IEEE Symposium on Security and Privacy (S&P), 2023.

Roberts, J. M. Correspondence analysis of twomode network data. Social Networks, 22 (1):65 72, 2000. ISSN 0378-8733. doi: https://doi.org/10.1016/S0378-8733(00)00017-4. URL https://www.sciencedirect.com/ science/article/pii/S0378873300000174.

Sampigethaya, K. and Poovendran, R. A survey on mix networks and their secure applications. Proceedings of the IEEE, 94(12):2142 2181, 2006.

Shokri, R., Stronati, M., Song, C., and Shmatikov, V. Membership inference attacks against machine learning models. In CCS, 2017.

Smith, A., Thakurta, A., and Upadhyay, J. Is interaction necessary for distributed private learning? In 2017 IEEE Symposium on Security and Privacy (SP), pp. 58 77. IEEE, 2017.

Sun, T., Sun, Y., and Yin, W. On markov chain gradient descent. In Neural Information Processing Systems, 2018.

Tang, H., Lian, X., Yan, M., Zhang, C., and Liu, J. d2: Decentralized training over decentralized data. In Proceedings of the 35th International Conference on Machine Learning, 2018.

Van Erven, T. and Harremos, P. Rényi divergence and kullback-leibler divergence. IEEE Transactions on Information Theory, 60(7):3797 3820, 2014.

Wang, D., Gaboardi, M., and Xu, J. Empirical Risk Minimization in Non-interactive Local Differential Privacy Revisited. In Neur IPS, 2018.

Xu, J., Zhang, W., and Wang, F. A(dp)2sgd: Asynchronous decentralized parallel stochastic gradient descent with differential privacy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.

Yakimenka, Y., Weng, C.-W., Lin, H.-Y., Rosnes, E., and Kliewer, J. Straggler-resilient differentially-private decentralized learning. 2022 IEEE Information Theory Workshop (ITW), pp. 708 713, 2022a.

Yakimenka, Y., Weng, C.-W., Lin, H.-Y., Rosnes, E., and Kliewer, J. Straggler-resilient differentially-private decentralized learning. 2022 IEEE Information Theory Workshop (ITW), Nov 2022b. doi: 10.1109/itw54588.2022. 9965898. URL http://dx.doi.org/10.1109/ ITW54588.2022.9965898.

Zhang, X., Khalili, M. M., and Liu, M. Improving the Privacy and Accuracy of ADMM-Based Distributed Algorithms. In ICML, 2018.

Zheng, K., Mou, W., and Wang, L. Collect at Once, Use Effectively: Making Non-interactive Locally Private Learning Possible. In ICML, 2017.

Zhu, L., Liu, Z., and Han, S. Deep leakage from gradients. In Advances in Neural Information Processing Systems, 2019.

Differentially Private Decentralized Learning with Random Walks

A. Proof of Theorem 2

We adapt the proof of Theorem 5 of (Even, 2023). Because constants matters in differential privacy, we detail the steps needed to explicit these constants. We recall here the keys steps and definition and how we adapt the proofs to allow the addition of Gaussian noise. We keep the same definitions, except that the sequence of token iterates is now defined by

xt+1 = xt γ(gt + ηt) .

This does not impact Lemma 8, which only relies on the graph properties and the function at the optimum. Up to the transposition of notation, we have for any T 1:

t<T fvt (x )

t<T d TV P t v0,, π + 2ζ2 X

s<t<T d TV(t s), (9)

where d TV(r) = sup n d TV (P r)v,, , π , v V o for r N, so that:

t<T fvt (x )

ζ2 (4τmix (1/4) + T (1 + 8τmix (1/4))) . (10)

Next, we can transform Lemma 9 into

Eη( xt+1 yt+1 2) (1 γµ)Eη( xt yt 2) + γL yt x 2 + γ2(dσ2 + σ2 sgd) .

where the sequence of yt and yt+1 satisfies the relation.

yt+1 = yt γ fvt(x )

By applying the formula recursively, we obtain:

Eη x T y T 2 (1 γµ)T x0 y0 2 + X

t<T (1 γµ)T t γL yt x 2 + γ2(dσ2 + σ2 sgd) ,

where we use Eη to denote the expected value with respect to the privacy noise.

By instantiating the y sequence as done in the non-private version with the x, we recover nearly the same formula, with yt = x . The first term can be handled as in the non-private case, while the second term has an additional sum:

Eη x T x 2 2(1 γµ)T

E x0 x 2 + γ2E

s<T fvs (x )

t<T (1 γµ)T t

t s<T fvs (x )

+ γ2(dσ2 + σ2 sgd)

t<T (1 γµ)T t < 1 γµ, and the other terms remain identical, we have:

Eη x T y T 2 2(1 γµ)T x0 x 2 + 3γL

µ2 Cτmixζ2 + γ(dσ2 + σ2 sgd) µ .

with C = 13.

We conclude by plugging back the following γ (as in (Even, 2023)) in the previous formula:

39 µ2 Cτmixζ2

Remark 4 As long as σ2 39Lτmixζ2 dµ , the noise due to privacy is smaller than the one due to the randomness of the walk.

Differentially Private Decentralized Learning with Random Walks

B. Privacy Proofs

Let u, v be two distinct nodes. To prove Theorem 3, we see the privacy loss εu v as the composition of the privacy loss induced by each of the contributions of node u. Thus, we first bound the privacy loss for one contribution εsingle u v . Let us denote by tc the time step where this contribution is made (i.e., the token is at node u at time tc). For t tc, there is no privacy leakage. Let us denote by tl the first t tc where the token is held by v. By the post-processing property of differential privacy, the steps after tl will not yield additional leakage for the contribution of time tc. Hence, we only need to bound the privacy leakage at time tl. This leakage depends on the number of steps between tc and tl. We use the weak convexity property of the Rényi divergence to decompose our privacy loss.

Lemma 1 (Weak convexity of Rényi divergence) Let µ1, . . . , µn and ν1, . . . , νn be probability distributions over some domain Z such that for all i [n], Dα (µi νi) c/(α 1) for some c (0, 1]. Let ρ be a probability distribution over [n] and denote by µρ (respectively, νρ) the probability distribution over Z obtained by sampling i from ρ and then outputting a random sample from µi (respectively, νi). Then

Dα (µρ νρ) (1 + c) E i ρ [Dα (µi νi)] . (11)

We can thus partition according to the length of the walk and write the privacy guarantee depending on T the total number of steps and β(i) a function bounding the privacy loss occurring for seeing the token i steps after the node contribution:

εsingle u v (1 + c)

i=1 P(u v after i steps)β(i) . (12)

The probability of the path of t steps between u and v can be easily extracted from the power of the transition matrix W. We thus obtain the generic formula.

Lemma 2 Let β(i) be a bound on the privacy loss occurring for seeing the token i steps, and T the total number of steps. Then, the following holds:

εsingle u v

i=1 W i uv2β(i) . (13)

We recognize in this formula the communicability (Definition 8) with ci = 2β(i).

We can now compute the function β by resorting to privacy amplification by iteration (Theorem 1). As we apply the Gaussian mechanism at each step, we have s1 = α 2σ2 and sj = 0 for 0 < j i. We thus take all aj = α 2σ2i in Theorem 1. This gives the bound β(i) = α 2σ2i, which corresponds to setting ci = ασ2i.

We now focus on how to simplify and compute this formula under Assumption 2, i.e., the transition matrix is bistochastic and symmetric. Since the matrix is symmetric by assumption, we can apply the spectral theorem to write:

i=1 λiϕiϕ i , (14)

Furthermore, since the matrix is bistochastic, λ1 = 1 > λ2 λn > 1 and the eigenvector associated to the first eigenvalue is 1 n1. Hence, we can isolate the first term and plug into (13) to get:

εsingle u v

α σ2t 1 n +

α σ2tλt iϕi(u)ϕi(v) . (15)

In these sums, we isolate PT t=1 λt i/t. Noticing that the sum converges for all these eigenvalues, we can rewrite it as PT t=1 λt i/t = ln(1 λi) + O(λT i ). We use the integral test for convergence to replace the sum by the logarithm.

This gives us the privacy loss for a single contribution. In order to conclude, we compose the privacy loss of each of the Nu contributions of node u, leading to, for any σ2 2α(α 1) :

εu v αNu ln(T)

σ2 ln(I W + 1

n11 )uv + O(λT 2 ) . (16)

Differentially Private Decentralized Learning with Random Walks

As |λ2| < 1, its power decreases exponentially fast with the number of steps and thus is negligible with respect to the other terms.

The average number of contributions is T/n as the transition matrix is bi-stochastic. We can then upper bound the real number of contributions with high probability, for example by using Theorem 12.21 of (Levin & Peres, 2017). The small probability of exceeding the upper bound can be added to δ when converting from RDP to (ε, δ)-differential privacy.

Remark 5 The upper bound on Nu tends to be large for cryptographically small δ. An efficient way to avoid this issue in practical implementations is to force a tighter bound on the maximum number of contributions by each node. After a node reaches its maximum number of contributions, if the token passes by the node again, the node only adds noise. We use this trick in numerical experiments.

B.1. Adaptation to the case without sender anonymity

For bounding the privacy loss of a single contribution, we consider above the value of the privacy loss occurring at time tl. However, this is computed without taking into account the knowledge of where the token comes from, for example, if v can know which of its neighbors sent him the token. This computation is thus justified only in the specific case where the anononymity of the sender is ensured, e.g., by resorting to mix networks (Sampigethaya & Poovendran, 2006) or anonymous routing (Dingledine et al., 2004).

If this is not the case, then we should a priori use the conditional probability towards the position of the token at time tl 1. However, there is no close form for this in general for a graph. To fix this issue, we consider that the last step for reaching v is only a post-processing of the walk reaching one of its neighbors.

For each of the neighbors, the previous formula applies. We obtain different values for the various neighbors, so a worst-case

analysis consists in taking the max over this set. Denoting by ] εsingle uv the privacy loss when a node v knows the neighbors from which it received the token, we have ] εsingle uv max w Nv εsingle uw

This approach allows to keep a closed form for the matrix and just add a max step. However, this analysis is not tight and may lead to a significant cost in some scenarios. A simple example is the special case where the nodes u and v are neighbors. It effectively assumes that the transition between u and v is always direct, which may not always be the case.

We provide in Figure 4 the equivalent of Figure 1 by taking the max below. As expected, amplification is smaller for close nodes, but the curves have the same asymptotic.

0 5 10 15 20 25 Shortest Path Length

Privacy Loss

Gossip Random walk

LDP loss Exponential Erdos Renyi Random Geometric Grid

Figure 4. Comparison of privacy loss for random walks when nodes know who send them the token in bold lines and gossip in dashed lines for the same synthetic graphs with n = 2048. Privacy amplification is visible even for close neighbors but the decay is slower than for gossip.

C. Additional Numerical Experiments

In this section, we include other examples of graphs that illustrate how the privacy loss matches the graph structure.

A classic synthetic graph to exhibit subgroup is the stochastic block model, where each edge follows an independent Bernoulli random variable. The parameter of the law depends from a matrix encoding the relation between the clusters. As

Differentially Private Decentralized Learning with Random Walks

Privacy loss

6.5 6.0 5.5

Communicability

Figure 5. Stochastic Block Model with 200 nodes in three clusters (75, 75, 50) and probability matrix [[0.25, 0.05, 0.02], [0.05, 0.35, 0.07], [0.02, 0.07, 0.40]]. The privacy loss matrix recovers the different blocks.

for other graphs, the privacy loss is similar to communicability and shows different level of privacy within and outside a group Figure 5(b), so that a node sees the major part of its privacy loss occurring within its group (Figure 5(a).

We report other examples of privacy loss for a node chosen at random in Facebook Ego graphs. Once again, these results illustrate that our privacy loss guarantees match the existing clusters of the graph (Figure 6).

Intuitively, compare to gossip algorithms where the updates only slowly flows in the graph, the random walk should mix quite easily heterogeneous data. while we do not derive mathematical guarantees on heterogeneity, we illustrate this idea with the following numerical experiment. We generate a synthetic geometric random graph and compare two scenarii (Figure 7). On the first trial, the data is position-dependent, which generate heterogeneity as close nodes also have close datapoint. We then shuffle randomly the data to destroy the heterogeneity and compare the convergence of the two in Figure 8.

D. Collusion

The results of our work assume that nodes are separated entities that do not share information outside the protocols. One could however claim that a fraction of nodes can collude and share information between them. In this case, if we denote F V the fraction of the colluded nodes, for a given contribution done by u what matters is the first time that one of the node of F is reached afterwards. More precisely, we can derive the privacy loss as

εsingle u F

ασ2i . (17)

where the term between parenthesis corresponds to the probability to reach F in exactly i steps from u. By reorganizing these terms, we obtain the upper bound:

v F εu v (18)

This term corresponds also to the formula one would obtain from basic composition. Hence, our analysis does not allow to avoid the degradation of the privacy guarantees to collusion. Note that if all the colluded nodes are far away from u, it is still possible to derive non trivial guarantees compare to what would give the bound of local differential privacy. In comparison to gossip where the privacy loss decrease is sharper with distance, the cases where the amplification remains are scarcer. This is a fundamental limitation of amplification by decentralization, that was already pointed out in (Cyffers & Bellet, 2022; Cyffers et al., 2022).

Differentially Private Decentralized Learning with Random Walks

Figure 6. Privacy loss on the 9 other Facebook Ego graphs, following the same methodology as in Figure 4(a).

Differentially Private Decentralized Learning with Random Walks

Original Graph Shuffled graph

Figure 7. Geometric random graph with 200 nodes. On the left, the label is given by the sum of the coordinates, providing heterogeneity in the graph. On the right the same graph has its label shuffled

0 2 4 6 8 Quantile

Average Values

0 200 400 600 800 1000 1200 1400

Average Privacy

Average Values and Privacy by Quantiles

(a) With original labels

0 2 4 6 8 Quantile

Average Values

0 200 400 600 800 1000 1200 1400

Average Privacy

Average Values and Privacy by Quantiles

(b) With shuffled labels

Figure 8. We compute the quantiles of the euclidean distance between node. For each quantile, we report the mean on all the pair of nodes of the quantiles for the privacy loss and for the distance between the current estimates. We report different time step across the learning. The privacy loss is identical in both cases. At the beginning of the learning, the homogeneous case has smaller average values heterogeneity, but the difference reduces with the learning

Differentially Private Decentralized Learning with Random Walks

E. Refined Privacy Bounds for Specific Graphs

E.1. Useful Auxiliary Results

We first collect some auxiliary results that we use in our bounds.

Proposition 1 For any x (0, 1), we have X

p = ln(1 x2) .

Proof. Recall that

ln(1 + x) = x + x2

3 + , and ln(1 x) = x + x2

Now ln(1 + x) ln(1 x) gives the first bound and ln(1 x) + ln(1 + x) gives the second bound. This completes the proof.

Proposition 2 (Godsil and Royle (Godsil & Royle, 2001)) Let 1 d n 1 be an integer. Then for any d-regular graph G, the eigenvectors of the Laplacian and those of the adjacency matrix of G coincide.

E.2. Privacy Loss for Specific Graphs

Complete graph. The transition matrix is exactly 1

n11 and thus we recover exactly the same formula as in (Cyffers & Bellet, 2022). In particular, there is only one non-zero eigenvalue with magnitude 1 with an all-one vector as the corresponding eigenvector. In particular,

i=1 W i uv 1

j=1 λi jvjv j

Ring graph. To ensure aperiodic Markov chain as required by Assumption 1, the transition matrix should be in the form a I + b(J + J ), a + 2b = 1.

The adjacency matrix AR of the ring graph R is a circulant matrix. Therefore, all its eigenvectors are just the Fourier modes (Horn & Johnson, 2012):

1 ωk ω2 k... ωn 1 k

where ωn k = 1 is the n-th root of unity, i.e, e2πιk/n for 1 k n. This can be seen by noting that multiplication with a circulant matrix gives a convolution. In the Fourier space, convolutions become multiplication. Hence the product of a circulant matrix with a Fourier mode yields a multiple of that Fourier mode, which by definition is an eigenvector.

The eigenvalues can be then computed as

ωk + ω 1 k = 2 cos(2πk/n) for 0 k n 1 .

Computing ϕ(ω)ϕ(ω) is straightforward. The (u, v)-th entry would be just ω(u+v 2) mod n. Recall that

For ease of calculation, let us assume that a = b = 1

3. Then W = AR

3, where AR is a binary matrix with (i, j)-th entry 1 only when |i j| = 1. Then the eigenvalues of W are given by 2 cos(2πk/n)+1

3 for 0 k n 1. Let a = (u + v 2) mod n. Hence,

Differentially Private Decentralized Learning with Random Walks

2 cos(2πk/n) + 1

t e2πιak/n !

4 cos2(πk/n) 1

t e2πιak/n !

t cos (2πa/n) + α nσ2

4 cos2(πk/n) 1

t cos (2πak/n)

nα2 + α nσ2

4 cos2(πk/n) 1

t cos 2π(u + v 2)k

k=1 ln 1 4 cos2(πk/n) 1

k=1 cos 2πak

ln 3 csc(πk/n)

where csc is the cosecant function. The second equality follows from the fact that W i u,v is a real number, so the imaginary part is identically zero.

In the previous bound, we give self-loops the same probability as other edges. The main reason to give self-loop a non-zero weight is to ensure irreducibility and aperiodicity of the Markov chain. The same effect can be achieved by giving any non-negligible weight to the self-loops. In particular, we can consider the following adjacency matrix:

b AR = (1 κ)AR + κI

for some κ > 0. Then, the eigenvalues would be (1 κ)(ωk + ω 1) + κ. The adjacency matrix is still a circulant matrix. As a result, the eigenvectors still remain the same. Furthermore,

(ωk + ω 1 k )t cos 2πak

= 2 cost 2πk

= cost 1 2πk

cos 2π(a + 1)k

Let κ = 1 T 2 . Then b At uv (1 κ)At uv for t 2. Therefore, for a = (u + v 2) mod n:

ϵu v α(1 κ)

nσ2 Auv + α(1 κ)

α nσ2 Auv + α(1 κ)

k=1 (ωk + ω 1 k )t cos 2πak

= α nσ2 AR[u, v] + α(1 κ)

k=1 cost 1 2πk

cos 2π(a + 1)k

If |u v| = 1, then we have

ϵu v α nσ2 + α(1 κ)

k=1 cost 1 2πk

cos 2π(a + 1)k

otherwise, we have

ϵu v α(1 κ)

k=1 cost 1 2πk

cos 2π(a + 1)k

Differentially Private Decentralized Learning with Random Walks

Star graph. We set the vertex set of a simple star graph as

V = {1, 2, , n}

with the node 1 being the central node. This gives the edge set

E = {(1, i), 2 i n}.

Lemma 3 The eigenvalues of the Laplacian of the star graph are 0 1 1 n

and the eigenvectors are δi δi+1 for eigenvalues 1 and 2 i n 1. The eigenvector corresponding to eigenvalue n is computed in the proof.

Proof. Let 1 denote the all one vector. Then by the definition of Lapalcian, L1 = 0 and eigenvalue 0 corresponds to the eigenvector 1. Now, the trace of the Laplacian is just the sum of its eigenvalues. Therefore,

Tr(L) = 2n 2.

Let v be the eigenvector for eigenvalue n. Then we know that

v Span{1, e2 e3, e3 e4, , en 1 en}.

This implies that n 1 + v[1] = 0,

or that v = (n 1) 1 1 1 .

We can perform the spectral decomposition of the adjacency matrix of a star graph, but noting that the adjacency matrix of a star graph is

0 1 1 1 1 0 0 0 1 0 0 0 ... ... ... ... ... 1 0 0 0

it is easy to compute the coordinates of any higher power of A. In particular, if p is an even power, then (u, v)-th coordinate of Ap S is

(n 1)p/2 u = v = 1 0 u = 1 or v = 1 and u = v (n 1)p/2 1 otherwise

If p is an odd power, then (u, v)-th coordinate of Ap is

( (n 1)(p 1)/2 u = 1 or v = 1 0 otherwise

Since W = A (n 1) for a normalization constant (n 1) to make W doubly stochastic, we have

p=1 Ap uv α σ2p(n 1)p

Ap uv p(n 1)p +

Ap uv p(n 1)p

We now compute the privacy loss for each case. Since we only care about the privacy loss when u = v, we consider the following two cases:

Differentially Private Decentralized Learning with Random Walks

1. u = 1 or v = 1 and u = v. In this case, using Proposition 1, we have

p(n 1)p α 2σ2 n 1ln n 1 + 1 n 1 1

2. u = 1 and u = v. In this case, using Proposition 1, we have the following

= α σ2(n 1)

α σ2(n 1) ln 1 1 n 1

The above calculation is when the graph is simple. To make the Markov chain aperiodic, as before, we add self-loops with a small weight on the self-loop. For example, we consider the following adjacency matrix:

b A = (1 κ)A + κI .

We pick κ = 1 T 2 , so that b Ap uv (1 κ)Ap uv

for u = v and p T. Then if p is an even power, then (u, v)-th coordinate of b Ap is

(1 κ)(n 1)p/2 u = v = 1 0 u = 1 or v = 1 and u = v (1 κ)(n 1)p/2 1 otherwise

If p is an odd power, then (u, v)-th coordinate of b Ap is

( (1 κ)(n 1)(p 1)/2 u = 1 or v = 1 0 otherwise

Now again we have

p=1 b Ap uv α σ2p(n 1)p

b Ap uv p(n 1)p +

b Ap uv p(n 1)p

Using the same calculation as before, we have for all u = v,

α(1 κ) 2σ2 n 1ln n 1+1 n 1 1

u = 1 or v = 1 and u = v

σ2(n 1) ln 1 1 n 1 u = 1 .

In particular, this means that the privacy loss for the apex node is the most. In contrast, the nodes on the arms have approximately n more privacy than the apex node, which is what we expect: the apex node is the only one communicating with every other node in the graph.