# distributed_eventbased_learning_via_admm__c47dbe71.pdf

Distributed Event-Based Learning via ADMM

Guener Dilsad ER 1 Sebastian Trimpe 2 Michael Muehlebach 1

We consider a distributed learning problem, where agents minimize a global objective function by exchanging information over a network. Our approach has two distinct features: (i) It substantially reduces communication by triggering communication only when necessary, and (ii) it is agnostic to the data-distribution among the different agents. We therefore guarantee convergence even if the local data-distributions of the agents are arbitrarily distinct. We analyze the convergence rate of the algorithm both in convex and nonconvex settings and derive accelerated convergence rates for the convex case. We also characterize the effect of communication failures and demonstrate that our algorithm is robust to these. The article concludes by presenting numerical results from distributed learning tasks on the MNIST and CIFAR-10 datasets. The experiments underline communication savings of 35% or more due to the event-based communication strategy, show resilience towards heterogeneous data-distributions, and highlight that our approach outperforms common baselines such as Fed Avg, Fed Prox, SCAFFOLD and Fed ADMM.

1. Introduction

Distributed learning refers to the minimization of a global objective function over a network of agents, where each agent has only access to a local cost function and can communicate with some or all agents in the network. Distributed learning systems provide a solution for handling the growing amount of data being generated everywhere on earth, by utilizing the computational power of individual devices in a network rather than relying on a central entity. This takes

1Learning and Dynamical Systems Group, Max Planck Institute for Intelligent Systems, 72076 T ubingen, Germany 2Institute for Data Science in Mechanical Engineering, RWTH Aachen University, 52068 Aachen, Germany. Correspondence to: Guener Dilsad ER <gder@tue.mpg.de>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

the burden off central processors and improves data privacy by avoiding a centralized training and storage of data.

Distributed learning is particularly challenging when the data is not independent and identically distributed (noni.i.d.) across the different agents. This situation often hinders the convergence to a globally optimal model. The non-i.i.d. nature leads to disparities in local datasets, preventing the local models from generalizing across the entire dataset, leading to a fundamental dilemma between minimizing local and global objective functions (Acar et al., 2021). In addition, a second key challenge arises from the communication between agents, which is required to ensure convergence to the global solution and may lead to a substantial overhead. This communication overhead results in a waste of energy (Li et al., 2020b), and is prone to delays and communication channel failures. As a result, both, non-i.i.d. datasets and communication overhead, constitute major bottlenecks for enabling large-scale learning systems.

We provide an effective solution to both challenges. Inspired by the sent-on-delta concept (Miskowicz, 2006), we reduce the communication load by introducing an eventbased communication strategy, such that each agent (or computational node) communicates only if necessary. Our communication rule enforces local constraints that collectively guarantee bounded overall error, a paradigm related to safe zone design strategies in distributed computing (Garofalakis & Samoladas, 2017). Our approach is also rooted in event-based estimation, where communication is triggered by significant state changes. We further base our approach on the Alternating Direction Method of Multipliers (ADMM). Our method is therefore robust against ill-conditioning and agnostic towards a disparity of the local data-distributions among the agents (these can be skewed in arbitrary ways). The approach further enables an explicit trade-off between communication load on the network and solution accuracy via a small set of hyperparameters that have a clear interpretation. We explicitly quantify the influence of these hyperparameters on the solution accuracy and analyze the effect of communication failures. The article concludes by highlighting the effectiveness of our algorithm in training neural networks, and solving LASSO problems in a distributed and communication-efficient manner.

Our theoretical analysis builds on a recent trend in the opti-

Distributed Event-Based Learning via ADMM

mization literature (Wibisono et al., 2016; Su et al., 2016; Muehlebach & Jordan, 2019; Tong & Muehlebach, 2023) that views algorithms as dynamical systems and leverages ideas from differential or symplectic geometry, as well as passivity and dissipativity (Lessard et al., 2016; Muehlebach & Jordan, 2020). As we will show, this enables convergence proofs and a convergence rate analysis for our distributed algorithms, together with an analysis of robustness against communication failures. Our work provides important insights into the behavior of the event-based optimization under communication failures, an aspect, which has been overlooked in prior works, and thereby lays the groundwork for future research in this area.

Related Work: In the 1980s, Bertsekas & Tsitsiklis (1989) and others laid the foundation for the analysis of distributed algorithms. As machine learning became popular, distributed learning emerged, specifically focusing on parallelizing computation for empirical risk minimization. Shokri & Shmatikov (2015) explored collaborative deep learning with multiple agents using distributed stochastic gradient descent, which was later coined federated learning (Mc Mahan et al., 2017) and advanced by subsequent contributions (Kairouz et al., 2021; Asad et al., 2023). A unifying element in these works is the consensus problem, where agents agree on a common value or decision. This problem is both central to distributed optimization and is also a special instance of distributed optimization (Wei & Ozdaglar, 2012).

The trade-off between communication and computation is inevitable in distributed optimization (Nedic et al., 2018). Recent work by Cao et al. (2023) categorizes communicationefficient distributed learning into four main strategies: (1) minimizing the number of communications, (2) compression, (3) managing resources (e.g., bandwidth), and (4) using game theoretical approaches. We focus our review on the first category that aligns with our work, and reduces communication by transmitting information only if necessary. A first line of work (Mc Mahan et al., 2017; Wei Liu et al., 2021; Reisizadeh et al., 2020, and many more) proposes algorithms with a periodical exchange of model parameters either among all agents or randomly selected subsets for decreasing communication load. While this approach is particularly straightforward and easy to implement, the random sampling risks missing critical updates or performing redundant communications. A second line of work involves accelerated gradient methods for distributed optimization, reducing the need for many communication rounds to converge. For instance, Kovalev et al. (2020) and Nabli & Oyallon (2023) optimize the number of gradient evaluations together with communication rounds. Shamir et al. (2014) replaces gradient descent with Newton-like methods and Hendrikx et al. (2020) proposes statistical preconditioning where both methods further improve convergence rates at the cost of a higher computational load per iteration. Addi-

tionally, Liu et al. (2021) propose a lazy evaluation of dual gradients, reducing communication by skipping redundant updates, while (Chen et al., 2018) adaptively reuse lagged gradients to meet target accuracy with fewer communication rounds. There has also been a third line of work that focuses on reducing communication via event-based triggering and compression of network parameters (Liu et al., 2019; Ghadikolaei et al., 2021; Singh et al., 2023; Zhang et al., 2023). While (Zhang et al., 2024; 2023) employ an ADMM-based strategy that is similar to ours, their focus lies on investigating different compression schemes, and not on analyzing convergence rates and the effect of communication failures. Event-triggering has also been explored in contexts like dynamics model learning (Solowjow & Trimpe, 2020; Umlauft & Hirche, 2019), and Bayesian optimization (Brunzema et al., 2025). While highlighting the benefit of triggering for reducing communication, these works do not consider distributed optimization problems as we do herein.

In addition to the communication overhead, another major challenge for distributed learning arises from non-i.i.d. data distributions across agents (Zhao et al., 2018; Li et al., 2020c; Glasgow et al., 2022). SCAFFOLD (Karimireddy et al., 2020) addresses this challenge by introducing a client control variate to improve convergence at the cost of doubling communication. Similarly, (Gao et al., 2022) enhances training with auxiliary drift variables, while (Zheng et al., 2024) selects representative clients and adjusts server gradients. Recent contributions by Li et al. (2020a); Acar et al. (2021); Shi et al. (2023) add a proximal regularization term to the local objective functions of the individual agents, whereas Zhang et al. (2021) (Fed PD) and Zhou & Li (2023); Wang et al. (2022); Gong et al. (2022) (Fed ADMM) address the challenge with ADMM formulations. However, compared to our work, Fed ADMM (Zhou & Li, 2023; Wang et al., 2022; Gong et al., 2022) relies on utilizing a random selection of agents that communicate and Fed PD (Zhang et al., 2021) considers full participation, whereas we use an event-triggered mechanism. Alternatively, other splitting schemes such as Douglas-Rachford method proposed by Tran Dinh et al. (2021), similarly align local and global objectives but remain constrained by random agent participation.

ADMM remains a widely-used tool for distributed learning, with recent advancements focusing on improving convergence rates and communication efficiency. For example, Wang et al. (2025) introduce inertia and adaptive iteration strategies to accelerate convergence, while Song et al. (2025) controls inexactness and dynamically tunes penalty parameters. In addition, He et al. (2023) explore dynamic tuning of ADMM hyperparameters, and hierarchical grouping approaches (Qiu et al., 2023), inspired by Elgabli et al. (2020), aim to reduce communication overhead by restricting updates to neighboring workers. These methods share the goal

Distributed Event-Based Learning via ADMM

of improving the efficiency of ADMM in distributed settings, but they still rely on periodic or full-agent participation, whereas our approach uses event-triggered mechanisms to further reduce communication costs.

As we also highlight in numerical experiments, a random selection of agents might prevent important local changes from propagating quickly through the network, leading to a slower convergence. To the best of our knowledge, this is the first work to provide a convergence analysis of distributed learning with event-triggered communication that addresses key aspects, such as packet drop and the presence of non-i.i.d. data.

Contributions are summarized as follows: (i) We propose an event-based communication scheme for distributed optimization, where a communication event is only triggered, when the current state has deviated by a predefined threshold , indicating a significant change in the local decision variables. Therefore, our approach is effective in reducing communication overhead and can adapt to the limited communication resources in heterogeneous networks. Our method is also compatible with and complementary to gradient compression/quantization (Hegazy et al., 2024; Mao et al., 2022; Wang et al., 2018) and fair aggregation techniques (Zhu & Ling, 2021).

(ii) We characterize the effect of the communication threshold on the solution accuracy and therefore quantify the trade-off between communication and solution accuracy. Compared to other ADMM-based approaches, such as (Zhang et al., 2021; Zhou & Li, 2023), our method is versatile, both in the selection of variables that are being communicated (which is important for reducing communication in practice), as well as the different problem formulations that we can address. In particular, our approach goes beyond the scope of consensus problems, and can solve generic constrained optimization problems, sparse regression and LASSO problems, perform robust principal component analysis (Cand es et al., 2011), and solve distributed learning instances where the features but not the data points are distributed (Boyd et al., 2010).

(iii) Numerical experiments support the theoretical analysis and highlight that our approach even converges in the most extreme non-i.i.d. setting, where each agent has only access to training data from a single class (see the MNIST classifier example in Sec. 5). Comparisons to the baselines Fed ADMM (Zhou & Li, 2023), SCAFFOLD (Karimireddy et al., 2020), Fed Prox (Li et al., 2020a) and Fed Avg (Mc Mahan et al., 2017) demonstrate superiority both in terms of communication efficiency and classification accuracy.

(iv) We demonstrate an accelerated convergence rate, and derive symbolic expressions that relate the convergence rate to instance-specific quantities such as the condition number

and the topology of the communication network. The convergence analysis requires a Lyapunov-like function that is different compared to earlier work (Nishihara et al., 2015), due to the presence of the event-based communication.

(v) We study the robustness of our algorithm against communication failures, both in theory as well as in numerical experiments, which, to the best of our knowledge, is largely missing in the literature (a notable exception for the consensus problem is (Bastianello et al., 2021)). We address communication failures algorithmically by proposing a rare periodic reset strategy. We show that, without such a reset strategy, inter-agent errors accumulate rapidly in the presence of packet drops and prevent convergence.

Outline: The article is structured as follows: Sec. 2 describes the problem formulation and introduces our eventbased learning algorithm in the consensus setting. The more general formulation is discussed in Sec. 3, where we also introduce a dynamical systems model for our algorithm. Sec. 4 discusses the convergence analysis of the proposed algorithm and presents convergence rates, while empirical results that underline the theoretical findings are included in Sec. 5 and in App. G. The appendix contains additional technical details about the communication structure in App. A and the details of the convergence analysis in App. C and D.

2. Event-Based Distributed Learning

We consider a distributed learning problem of the type minx Rn PN i=1 f i(x), where the overall cost function f(x) is the sum of N individual, potentially nonsmooth functions. The different f i typically arise from different training datasets stored on different computational nodes. In the most basic instance, our algorithm arises from the consensus formulation

min x1,...,x N Rn

i=1 f i(xi) + g(z),

subject to xi = z, i = 1, . . . , N,

where we impose the constraints xi = z by corresponding dual variables ui. Thus, by guaranteeing constraint satisfaction, we can ensure consensus between the agents despite different local problems and, in particular, arbitrary non-i.i.d. data distributions among the computational nodes. In addition, we assume the function f i : Rn R to be smooth, while g : Rn R is allowed to be nonsmooth (g typically represents a regularizer) and maps to the extended real numbers.

Our event-based algorithm, stated in Alg. 1, works as follows. Each agent (or computational node) i, i = 1, . . . , N, has access to the local objective function f i, its local solution xi, its local multiplier ui, and an estimate ˆzi of the consensus variable z. We further introduce the agent N +1

Distributed Event-Based Learning via ADMM

Agent 1 Agent 2 Agent 3 Agent 4

d1 k+1 d1 [k] d2 k+1 d2 [k] d3 k+1 d3 [k] d4 k+1 d4 [k]

zk+1 z[k] zk+1 z[k] zk+1 z[k] zk+1 z[k]

Figure 1: The figure illustrates the distributed learning setup. The Agents 1 4 store xi, ui and perform updates based on the information received by Agent 5, according to Alg. 1. Agent 5, stores z and performs updates based on the information received by Agent 1 4. This architecture is common in distributed learning, where a single server aggregates updates from multiple distributed clients to collaboratively train a model.

(acting as server) that has access to g, the variable z, and maintains an estimate ˆζ of the average

αxi k+1 + ui k .

Following the communication structure in Fig. 1, the algorithm proceeds in two steps:

i) Parallel update of agents i = 1, . . . , N: Each agent i = 1, . . . , N first updates its estimate ˆzi k based on whether it receives an event-based communication from the agent N+1. The agent then updates its multiplier ui k and solves a local minimization over f i, which also includes a quadratic regularization term. The regularization term ensures that the minimization is well-conditioned (a key advantage to dual ascent, for example) and the local solution xi is biased towards ˆzi k. In practice, the minimization is replaced by a fixed number of (stochastic) gradient descent steps. If the resulting value di k+1 := αxi k+1+ui k of the agent i is significantly different from the value that it last communicated to the agent N +1, an event-based communication is triggered and the difference of di k+1 to the last communicated value is sent to the agent N +1.

ii) Update of agent N+1: The agent N+1 updates its estimate ˆζ of ζ by accumulating the di variables that it receives from all agents. It then updates the consensus variable zk+1 by solving a local minimization over g with a quadratic regularization term. Note that if the nonsmooth component g is missing, zk+1 is simply set to ˆζk (1 α)zk. Finally, the agent N +1 triggers an event-based communication if the value zk+1 is significantly different from the value that it last communicated to the agents i = 1, . . . , N.

Next, we explain the details of the event-based communication protocol on the example of the communication of di, which is related to the primal xi and dual ui variables of agent i. The other event-based communications proceed similarly. The protocol comes in two variants, vanilla event-

Algorithm 1 Event-Based Distributed Learning with Over Relaxed ADMM Require: Local objective functions f i, parameters ρ, d, z, reset period T Require: Initialize ˆxi 0 = x0, ˆz0 = ζ0 = x0, ˆui 1 = ui 0 for k = 0 to tmax do

for i = 1 to N do

ˆzi k receive zk z[k 1] {Agent i} ui k = ui k 1 + αxi k ˆzi k + (1 α)ˆzi k 1 xi k+1 = arg minxi f i(xi) + ρ

2|xi ˆzi k + ui k|2

event-based send of di k+1 di [k] {See (2)}

ˆζk receive 1

i Cd k+1(di k+1 di [k]) {Agent N+1}

zk+1 = arg minz g(z) + Nρ

2 |z ˆζk (1 α)zk|2

event-based send of zk+1 z[k] if mod(k + 1, T) = 0 then

perform reset ˆζk = ζk, ˆzk = zk end if end for

based and randomized event-based.

Vanilla event-based: This communication rule is inspired from the sent-on-delta concept (Miskowicz, 2006), which aims to reduce the number of communications by only sending updates when significant changes occur. A communication is triggered, if the value di k+1 has deviated by more than the predefined threshold d >0 compared to the value that was last communicated. We introduce the variable di [k] to denote the value di k that was last communicated and add the index i to the set Cd k+1. The set Cd k+1 denotes the set of agents that trigger a communication of di k+1 at time-step k, that is,

|di k+1 di [k]|> d i Cd k+1, (2)

and di k+1 di [k] is sent out. Similarly, agent N + 1 triggers a communication if |zk+1 z[k]| > z. We also model communication failures as drops, which we represent by the variables χdi k+1. The variable χdi k+1 takes the value χdi k+1 = (di k+1 di [k]), if di k+1 di [k] is not received by the agent N +1; otherwise χdi k+1 = 0. The agent N +1 updates its estimate of the average ζk according to the primal and dual variables that it has received at time k, that is,

ˆζk = ˆζk 1 + 1

di k+1 di [k] + χdi k+1 .

Randomized event-based: The protocol makes a case distinction. If |di k+1 di [k]| d, a communication is randomly triggered with probability ptrig. If |di k+1 di [k]| > d, a communication is triggered with certainty. Randomized communication from agent N + 1 to the other agents works in a similar way. If |zk+1 z[k]| z, then a

Distributed Event-Based Learning via ADMM

communication is randomly triggered with probability ptrig between agent N + 1 and agent i.

We observed in our numerical experiments that randomized event-based often improves vanilla event-based in terms of the achieved communication versus solution accuracy trade-off.

The error caused by the event-based communication remains bounded at all times thanks to the communication protocol and the periodic resets. This is summarized with the next proposition, whose proof is included in App. E:

Proposition 2.1. The error ˆζk ζk at iteration k is bounded by |ˆζk ζk| d+T χd, where T denotes the reset period (see Alg. 1) and χd is a bound on the disturbance χdi k .

We now state the main convergence result for Alg. 1. The result arises as a corollary from the convergence analysis in the more general distributed optimization setting (see Thm. 4.1). We also provide sublinear convergence rates in a nonconvex setting (see Thm 2.3).

Corollary 2.2. Let f = PN i=1 f i be m-strongly convex and L-smooth with κ = L/m, and g be convex. Let the step-size be ρ = (m L) 1 2 κϵ with ϵ [0, ), and α = 1. For large enough κ, we have

|zk z |2 4 1 1

where z is the optimal value for the consensus variable z, and D0 represents the initial error, D0 = |z0 z |2 + 1 N PN i=1|ui 0 ui |2, with ui denoting the optimal values of the dual variables associated with each agent. Here, = N d+ z+T(N χd+ χz) captures the error arising from the event-based communication.

The convergence result bounds the distance between the consensus variable zk and the optimal solution z = x that minimizes (1). The analysis models our event-based learning algorithm as a dynamical system, accounting for disturbances introduced by the event-based communication strategy. By design, these disturbances remain bounded under the communication protocol. The next section elaborates on the formulation of our algorithm as a dynamical system.

The strong convexity assumption enables faster convergence rates compared to more general nonconvex scenarios. Specifically, under this assumption, the rate of convergence is linear, as shown in Cor. 2.2 and accelerated. In contrast, without such assumptions, convergence rates are generally much slower, typically sublinear or achieving only asymptotic convergence.

We note that the strong convexity assumption in Cor. 2.2 only requires f := PN i=1 f i to be strongly convex, without

imposing the same condition on the individual components f i. In addition, we present a convergence result for general nonconvex cases in Thm. 2.3 leading to sublinear convergence rates. The proof is provided in App. B.

Theorem 2.3. Let each f i : Rn R be smooth (potentially nonconvex) and let g : Rn R be a proper, closed convex function. Let the relaxation parameter be α = 1, and the communication threshold k decay as k = 0/(k + 1)2. Then, the gradients and residuals converge with a rate of O(1/k), and the following bound holds:

ri k+1 2 + 1

where ri k+1 = xi k+1 zk+1 are the residuals, and the gradient terms are given by

i=1 f i(xi k+1) + g(zk+1)

3. Event-Based ADMM as a Dynamical System

We introduce a more general problem formulation that encompasses the previous section as a special case in order to broaden the scope of our analysis. This leads to the following constrained minimization problem

min x Rp, z Rqf(x) + g(z), subject to Ax + Bz = c, (3)

where x Rp and z Rq are decision variables, A Rr p, B Rr q, and c Rr are corresponding matrices, and the objective function is decomposed into a smooth part f :Rp R and a nonsmooth part g :Rq R. We will provide an analysis under the following standard assumptions in distributed optimization. Assumption 3.1. The matrix A is invertible and B is full column rank. Assumption 3.2. The function f is m-strongly convex and L-smooth. The function g is convex.

It is important to emphasize that the assumption of strong convexity for f is introduced to derive linear rates within a dynamical systems framework. This assumption does not limit the practical applicability of the algorithm and a corresponding nonconvex result is included in Thm. 2.3. We also note that Assumption 3.2 allows for nonconvex f i (see (1)) that P f i is strongly convex.

The formulation in (3) accommodates a variety of distributed optimization problems, including consensus, resource sharing, and distributed model fitting, see for example (Boyd et al., 2010). App. A further highlights how the general formulation can be tailored and simplified to accommodate specific applications, such as the sharing problem or finding a consensus on a graph.

Distributed Event-Based Learning via ADMM

Our event-based distributed learning method is summarized in Alg. 2. The algorithm is based on an over-relaxed version of ADMM, where an event-based communication structure between different agents is introduced. The over-relaxation brings the additional parameter α, which, as we will show, can be used to achieve faster convergence rates. The communication structure of the algorithm is shown in Fig. 2 and includes three agents that keep track of the individual quantities rk = Axk, sk = Bzk, and the dual multiplier uk. In the special case of the consensus problem, the updates of the primal variable xk and dual variable uk decompose further into local updates based on xi k and ui k, which results in the communication structure shown in Fig. 1.

Alg. 2 begins by initializing its variables, and over a series of iterations, agents alternate between sharing information and optimizing their local variables. Key steps include updating variables based on local objectives and residuals, and triggering communication events when individual residual changes exceed predefined thresholds. The algorithm leverages event-based communication to reduce the communication load, while still achieving convergence towards an optimal solution of (3), as we will show in the following section. The event-based communication proceeds as in Sec. 2, that is, the r-agent, for example, triggers a communication with the other agents if |rk+1 r[k]|> r, at which point it sends the difference rk+1 r[k] to the other agents. We again model communication failures by introducing the variables χru k+1 if the communication is not received by the u-agent at time k+1. The notation is analogous for the remaining agents and communication lines (see also Fig. 2).

Algorithm 2 Event-Based Distributed Optimization with Over-Relaxed ADMM Require: Functions f and g, matrices A and B, vector c, parameters ρ and α. Initial condition x0, z0 r0 = ˆrs 0 = ˆru 0 =Ax0,s0 = ˆsr 0 = ˆsu 0 =Bz0,u0 = ˆur 0 = ˆus 0 =0 for k = 0 to tmax do

ˆsr k, ˆur k receive sk+1 s[k], uk+1 u[k] xk+1 = argminx f(x) + ρ

2|Ax + ˆsr k c + ˆur k|2

event-based send rk+1 r[k] where rk+1 = Axk+1

ˆrs k+1,ˆus k receive rk+1 r[k], uk+1 u[k] zk+1=argmin z g(z)+ρ

2|αˆrs k+1 (1 α)Bzk+Bz αc+ˆus k|2

event-based send sk+1 s[k], where sk+1 = Bzk+1

ˆru k+1,ˆsu k+1 receive rk+1 r[k], sk+1 s[k] uk+1 = uk + αˆru k+1 (1 α)ˆsu k + ˆsu k+1 αc event-based send uk+1 u[k] if mod(k + 1, T) = 0 then

reset ˆru;s k+1 = rk+1, ˆsu;r k+1 = sk+1, ur;s k+1 = uk+1 end if end for

Alg. 2 has three update steps that occur sequentially,

whereby the first two involve optimization problems that can be replaced by their corresponding stationarity conditions. This yields the following implicit update equations:

0 = f(xk+1) + ρA (Axk+1 + ˆsr k c + ˆur k)

0 g(zk+1)+ρB (αˆrs k+1 (1 α)Bzk+Bzk+1 αc+ˆus k)

uk+1 = uk + αˆru k+1 (1 α)ˆsu k + ˆsu k+1 αc,

which can be expressed by the dynamical system shown in Fig. 2. We note that the variable xk+1 is uniquely determined by ˆsr k and ˆur k and does not depend on xk, which means that only ξk := (sk, uk) comprises the state of the dynamical system. We further note that the dynamical system includes a nonlinear component, which arises from the (sub)gradient evaluations f and g, and the system is subjected to external disturbances ek that arise from the event-based communication. The detailed derivation and the corresponding matrices for the dynamics in Fig. 2 are included in App. C. Our convergence analysis will build on the dynamical systems model of Alg. 2. While our analysis is inspired by earlier works, such as (Nishihara et al., 2015) and (Lessard et al., 2016), the Lyapunov function that is used to prove convergence rates are different due to the external disturbances caused by the event-based communication.

ˆA ˆE ˆB ˆC ˆEy ˆD

e rk+1 c sk+1

Delay sk uk

Figure 2: The figure visualizes the event-based communication structure of Alg. 2 at the top and a discrete-time dynamical system which represents the sequence generated by the event-based ADMM algorithm on the bottom. The function ϕ is nonlinear and represents the evaluation of (sub)gradients.

4. Convergence Analysis

This section provides convergence guarantees for the eventbased learning algorithm (Alg. 2). The detailed proof for Thm. 4.1 is provided in App. D.

Theorem 4.1. Let Assumption 3.1 and 3.2 be satisfied and let the step-size for Alg. 2 be ρ = κϵ

m L/(σ(A) σ(A)), for some ϵ 0 and α (0.675, 1 + p

1 1/ κ), κ =

Distributed Event-Based Learning via ADMM

L σ2(A)/(m σ2(A)), where σ and σ denote the minimum and maximum singular value of a matrix, respectively. Then, for large enough κ, the following bound holds:

|ξk ξ |2 κP |ξ0 ξ |2 1 α

2k + 60κ2+2ϵ

α(1 |α 1|) 2,

with ξk = (sk, uk), and where sk = Bzk, uk is the dual variable, and ξ the optimizer corresponding to (3). Furthermore, = r+ s+ u+T( χr+ χs+ χu) represents the error arising from the event-based communication and κP = (2 κ 1+ p

4κ(α 1)2+1)/(2 κ 1 p

4κ(α 1)2+1).

We conclude the section by highlighting a few important points.

(i) For ϵ=0, α=1, the bound is considerably simplified to

|ξk ξ |2 2|ξ0 ξ |2 1 1 4 κ

2k + 60κ2 2,

which shows that the convergence rate scales with 1/ κ and is therefore accelerated. This also highlights that the same convergence rate (up to constants) can be achieved with the event-based learning algorithm stated in Alg. 1 compared to a standard ADMM algorithm. As we will show in the numerical experiments, our event-based algorithm reduces communication without any significant reduction in accuracy.

(ii) The bound from Thm. 4.1 also highlights how the communication thresholds affect the solution accuracy. In the simplified scenario with ϵ = 0, α = 1 (the more general scenario follows the same rationale), the solution accuracy is bounded by |ξk ξ | 8κ , for large enough k. This means that the solution accuracy of Alg. 2 is proportional to the condition number κ and .

(iii) We can therefore easily ensure convergence, by choosing a time-varying = k such that k 0. The formal statement is included and derived in App. F. We also obtain precise nonasymptotic bounds. For example, if k = 0/(k + 1)t for any t > 0, we conclude that the error converges with O(1/kt) (see again App. F).

(iv) If f fails to be strongly convex, we can include a small regularizer, for example of the type m|x|2/2. Choosing a diminishing regularizer with m = O(1/k2) and a diminishing threshold k = O(1/k4) can be shown to result in an accelerated convergence rate of O(1/k2).

(v) The topology of the communication network, represented by the matrix A, directly influences the convergence rate, through the condition number κ = L σ2(A)/(m σ2(A)). This formulation allows us to generalize our convergence results beyond simple client-server architectures. See App. A.2 for a detailed discussion on how agent network topology is encoded in the matrix A.

5. Numerical Experiments

This section discusses the performance of Alg. 1 in numerical experiments, highlighting that Alg. 1 achieves fast convergence while reducing communication. Numerical experiments with the more general version (Alg. 2) are included in App. G, where distributed training over a network of agents is explored. We also investigate the trade-off between communication load and solution accuracy achieved by selecting different communication thresholds. The communication load is calculated by counting the number of triggered communications for Tmax number of steps and normalizing according to the full communication case of one data package per round.

0 50 100 150 10 20 30 40 50 60 70 80 90

Alg.1, Alg.1-Rand, Fed ADMM

Fed Avg, Fed Prox, SCAFFOLD

Validation Accuracy %

Alg. 1, ( = 1.75) Alg. 1-Rand ( = 3.75, ptrig = 0.7) Fed Avg - part rate = 0.7 Fed Prox - part rate = 0.7 Fed ADMM - part rate = 0.9 SCAFFOLD - part rate = 0.7

0 50 100 150 0

Communication Load %

Alg. 1 ( = 1.75) Alg. 1-Rand ( = 3.75, ptrig = 0.7) Fed Avg/Fed Prox/SCAFFOLD - part rate = 0.7 Fed ADMM - part rate = 0.9

Figure 3: Validation accuracy (top) and communication load percentage (bottom) over 150 communication rounds for training a CIFAR-10 classifier. The results indicate that Alg. 1 achieves top accuracy at a lower communication rate. The plots compare the performance of various algorithms, including Alg. 1 with different parameter settings (Vanilla and randomized), Fed Avg, Fed Prox, Fed ADMM, and SCAFFOLD. Notably, ADMM-based methods (Alg. 1, Alg. 1-Rand and Fed ADMM) demonstrate better convergence by reaching up to 78% test accuracy, compared to other algorithms Fed Avg, Fed Prox and SCAFFOLD, which reach only 70% accuracy. Among ADMM-based methods, Alg. 1 and Alg. 1-Rand achieve the same accuracy with over 20% less communication load. Communication load curves are smoothed using a window length of three for visualization purposes.

Distributed Event-Based Learning via ADMM

Algorithm MNIST Target Accuracy CIFAR-10 Target Accuracy

80% 85% 90% 70% 75% 77% 78%

Alg. 1 - Randomized 629 693 1723 12531 13422 15008 18376 Alg. 1 - Vanilla 816 1285 1710 12214 14780 14780 20690 Fed ADMM (Zhou & Li, 2023) 800 1200 >2000 12000 15000 21000 27000 Fed Avg (Mc Mahan et al., 2017) 800 2000 N/A 3000 N/A N/A N/A Fed Prox (Li et al., 2020a) 1000 2000 N/A 6000 N/A N/A N/A SCAFFOLD (Karimireddy et al., 2020) 1600 2000 3200 12000 N/A N/A N/A

Table 1: The total number of communication events required by each algorithm to achieve the target accuracies for the MNIST and CIFAR-10 classifiers within 100 and 150 rounds, respectively. N/A indicates cases where the target accuracy was not reached within the specified rounds. Parameter choices for the algorithms are detailed in Appendix G. Reported values are averages over multiple experiments with different random seeds, with standard deviations below 2%, making them negligible. The corresponding communication load and accuracy trends for the rightmost column of CIFAR-10 are shown in Fig. 3. The results emphasize the trade-off between achieving higher validation accuracy and maintaining communication efficiency across various algorithm configurations.

Due to space limitations, we present two examples in this section. Further experiments (including linear regression, LASSO, and distributed training over a network of agents) are presented in App. G. App. G also includes hyperparameters for the experiments, model details and discusses the effect of communication failures.

We start by evaluating the performance of Alg. 1 on MNIST (Deng, 2012) and CIFAR-10 (Krizhevsky, 2009). Tab. 1 reports the total number of communication events required by each algorithm to achieve the target accuracies for the MNIST and CIFAR-10 classifiers. Our event-based algorithm consistently requires fewer communication events to achieve high accuracies compared to baseline methods. This reduction is attributed to the selective triggering mechanism, which prevents unnecessary communication while ensuring convergence. For instance, on the CIFAR-10 dataset, our approach achieved 78% accuracy with a cost of 18,376 communication events, compared to 27,000 for Fed ADMM.

The comparison with other federated learning methods emphasizes the challenges associated with non-i.i.d. data distribution and communication overhead. Fed Avg, as highlighted in (Li et al., 2020c; Glasgow et al., 2022), experiences slowdowns in the presence of non-i.i.d. data, and increasing participation does not necessarily alleviate this issue. Fed Prox has the same issue and is unable to converge to a classifier that generalizes across all digits. Fed ADMM and SCAFFOLD can indeed cope with non-i.i.d. data, in general, both achieving high classification accuracies. However, Fed ADMM has disadvantages arising from the random sampling mechanism and SCAFFOLD suffers from an additional communication load to communicate two variables (client drift and local model). Notably, all baselines employ a random selection of agents, which, in non-i.i.d. scenarios, misses crucial changes and results in a waste of communication resources. Our method addresses these challenges by

adopting an event-based agent selection approach and outperforms all baselines by yielding uniformly better trade-off curves.

6. Conclusion

We introduce an event-based distributed learning approach that effectively reduces communication overhead by triggering events only when local models undergo significant changes. The method, based on over-relaxed ADMM, exhibits accelerated convergence rates in convex settings, demonstrates robustness to communication failures, and outperforms common baselines such as Fed Avg, Fed Prox, SCAFFOLD and Fed ADMM in our experiments, which include an MNIST and CIFAR-10 learning task. The experiments highlight that savings of more than 35% are possible without significantly degrading the solution accuracy (less than 1%). Our method allows for explicit trade-offs between communication load and solution accuracy, making it promising for large-scale learning systems with heterogeneous data and communication constraints.

Limitations: While our approach offers significant improvements in communication efficiency, it has not yet accounted for adversarial attacks, such as gradient poisoning, or unreliable nodes in the network. These factors could potentially degrade the robustness of the method. Nevertheless, our event-based methodology can be integrated with robust aggregation or anomaly detection methods (Pillutla et al., 2022; Yin et al., 2018) to improve security without compromising communication efficiency. Additionally, this method has not been specifically analyzed with respect to differential privacy and does not address privacy concerns in the current formulation.

Discussion: In this article, we focused on communication efficiency and convergence relationship under some con-

Distributed Event-Based Learning via ADMM

straints. However, practical implementations often face additional challenges, such as limited bandwidth, high latency, and network failures. Although we have not explicitly named these factors, they can be effectively modeled through our packet drop framework, which provides a foundation for handling such constraints.

Furthermore, while we initially framed our algorithm as a synchronous method, event-based communication can also be adapted to function in asynchronous systems. This flexibility allows the method to accommodate real-world scenarios with varying degrees of network reliability and synchronization.

Finally, our method is compatible with compression and quantization techniques, which can further reduce the size of the models exchanged during communication events and improve communication efficiency. This can also help minimize the amount of data stored in the agents, contributing to more efficient memory usage and reducing the overall communication load.

Impact Statement

This article advances machine learning by introducing a communication-efficient, event-based distributed learning method, enabling scalability for resource-constrained systems. While the work does not explicitly address adversarial robustness or differential privacy, future extensions could explore these aspects to further improve security in applications. We see no immediate negative societal impacts.

Acar, D. A. E., Zhao, Y., Navarro, R. M., Mattina, M., Whatmough, P. N., and Saligrama, V. Federated learning based on dynamic regularization. Proceedings of the International Conference on Learning Representations, pp. 1 36, 2021.

Asad, M., Shaukat, S., Hu, D., Wang, Z., Javanmardi, E., Nakazato, J., and Tsukada, M. Limitations and future aspects of communication costs in federated learning: A survey. Sensors, 23(17):7358, 2023.

Bastianello, N., Carli, R., Schenato, L., and Todescato, M. Asynchronous distributed optimization over lossy networks via relaxed ADMM: Stability and linear convergence. IEEE Transactions on Automatic Control, 66(6): 2620 2635, 2021.

Bertsekas, D. P. and Tsitsiklis, J. N. Parallel and Distributed Computation: Numerical Methods. Athena Scientific, 1989.

Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J. Distributed optimization and statistical learning via the

alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1):1 122, 2010.

Brunzema, P., von Rohr, A., Solowjow, F., and Trimpe, S. Event-triggered time-varying Bayesian optimization. Transactions on Machine Learning Research, 2025.

Cand es, E. J., Li, X., Ma, Y., and Wright, J. Robust principal component analysis. Journal of the ACM, 58(3):1 37, 2011.

Cao, X., Bas ar, T., Diggavi, S., Eldar, Y. C., Letaief, K. B., Poor, H. V., and Zhang, J. Communication-efficient distributed learning: An overview. Journal on Selected Areas in Communications, 41(4):851 873, 2023.

Chen, T., Giannakis, G. B., Sun, T., and Yin, W. LAG: Lazily aggregated gradient for communication-efficient distributed learning. Proceedings of the International Conference on Advances in Neural Information Processing Systems, pp. 5055 5065, 2018.

Deng, L. The MNIST database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29(6):141 142, 2012.

Elgabli, A., Park, J., Bedi, A. S., Bennis, M., and Aggarwal, V. GADMM: Fast and communication efficient framework for distributed machine learning. Journal of Machine Learning Research, 21:1 39, 2020.

Gao, L., Fu, H., Li, L., Chen, Y., Xu, M., and Xu, C.-Z. Fed DC: Federated learning with non-iid data via local drift decoupling and correction. Proceedings of the Conference on Computer Vision and Pattern Recognition, 2022.

Garofalakis, M. and Samoladas, V. Distributed query monitoring through convex analysis: Towards composable safe zones. Proceedings of the International Conference on Database Theory, 68(14):1 18, 2017.

Ghadikolaei, H. S., Stich, S. U., and Jaggi, M. LENA: Communication-efficient distributed learning with selftriggered gradient uploads. Proceedings of the International Conference on Artificial Intelligence and Statistics, pp. 3943 3951, 2021.

Glasgow, M., Yuan, H., and Ma, T. Sharp bounds for federated averaging (local sgd) and continuous perspective. Proceedings of the International Conference on Artificial Intelligence and Statistics, 151, 2022.

Gong, Y., Li, Y., and Freris, N. M. Fed ADMM: A robust federated deep learning framework with adaptivity to system heterogeneity. Proceedings of the IEEE International Conference on Data Engineering, pp. 2575 2587, 2022.

Distributed Event-Based Learning via ADMM

He, S., Zheng, J., Feng, M., and Chen, Y. Communicationefficient federated learning with adaptive consensus ADMM. Applied Sciences, 13(9):5270, 2023.

Hegazy, M., Leluc, R., Ting Li, C., and Dieuleveut, A. Compression with exact error distribution for federated learning. Proceedings of the International Conference on Artificial Intelligence and Statistics, 238:613 621, 2024.

Hendrikx, H., Xiao, L., Bubeck, S., Bach, F., and Massouli e, L. Statistically preconditioned accelerated gradient method for distributed optimization. Proceedings of the International Conference on Machine Learning, 119: 4203 4227, 2020.

Kairouz et al., P. Advances and open problems in federated learning. Foundations and Trends in Machine Learning, 14(1-2):1 210, 2021.

Karimireddy, S. P., Kale, S., Mohri, M., Reddi, S. J., Stich, S. U., and Suresh, A. T. SCAFFOLD: Stochastic controlled averaging for federated learning. Proceedings of International Conference on Machine Learning, 119: 5132 5143, 2020.

Kovalev, D., Salim, A., and Richt arik, P. Optimal and practical algorithms for smooth and strongly convex decentralized optimization. Proceedings of the Conference on Advances in Neural Information Processing Systems, pp. 18342 18352, 2020.

Krizhevsky, A. Learning multiple layers of features from tiny images. Technical Report, University of Toronto, 2009.

Lessard, L., Recht, B., and Packard, A. Analysis and design of optimization algorithms via integral quadratic constraints. SIAM Journal on Optimization, 26(1):57 95, 2016.

Li, M., Sanjabi, M., and Jaggi, M. Federated optimization in heterogeneous networks. Proceedings of the Conference on Machine Learning and Systems, 2:429 450, 2020a.

Li, T., Sahu, A. K., Talwalkar, A., and Smith, V. Federated learning: Challenges, methods, and future directions. IEEE Signal Processing Magazine, 37(3):50 60, 2020b.

Li, X., Yang, W., Huang, K., Wang, S., and Zhang, Z. On the convergence of Fed Avg on non-i.i.d. Proceedings of the International Conference on Learning Representations, pp. 1 26, 2020c.

Liu, Y., Xu, W., Wu, G., Tian, Z., and Ling, Q. Communication-censored ADMM for decentralized consensus optimization. IEEE Transactions on Signal Processing, 67(10):2565 2579, 2019.

Liu, Y., Sun, Y., and Yin, W. Decentralized learning with lazy and approximate dual gradients. IEEE Transactions on Signal Processing, 69:1362 1377, 2021.

Mao, Y., Zhao, Z., Yan, G., Liu, Y., Lan, T., Song, L., and Ding, W. Communication-efficient federated learning with adaptive quantization. ACM Transactions on Intelligent Systems and Technology, 13(4), 2022.

Mc Mahan, H. B., Moore, E., Ramage, D., and Hampson, S. Communication-efficient learning of deep networks from decentralized data. Proceedings of the International Conference on Artificial Intelligence and Statistics, 54: 1273 1282, 2017.

Miskowicz, M. Send-On-Delta Concept: An event-based data reporting strategy. Sensors, 6(1):49 63, 2006.

Muehlebach, M. and Jordan, M. I. A dynamical systems perspective on Nesterov acceleration. Proceedings of the International Conference on Machine Learning, 97: 4656 4662, 2019.

Muehlebach, M. and Jordan, M. I. Continuous-time lower bounds for gradient-based algorithms. Proceedings of the International Conference on Machine Learning, 119: 7088 7096, 2020.

Nabli, A. and Oyallon, E. DADAO: Decoupled accelerated decentralized asynchronous optimization. Proceedings of the International Conference on Machine Learning, 202: 25604 25626, 2023.

Nedic, A., Olshevsky, A., and Rabbat, M. G. Network topology and communication-computation tradeoffs in decentralized optimization. Proceedings of the IEEE, 106 (5):953 976, 2018.

Nishihara, R., Lessard, L., Recht, B., Packard, A., and Jordan, M. I. A general analysis of the convergence of ADMM. Proceedings of the International Conference on Machine Learning, 37:343 352, 2015.

Pillutla, K., Kakade, S. M., and Harchaoui, Z. Robust aggregation for federated learning. IEEE Transactions on Signal Processing, 70:1142 1154, 2022.

Polyak, B. T. Introduction to optimization. Optimization Software Inc., Publications Division, New York, 1987.

Qiu, Y., Lei, Y., and Wang, G. PSRA-HGADMM: A communication efficient distributed ADMM algorithm. Proceedings of the 52nd International Conference on Parallel Processing, pp. 82 91, 2023.

Reisizadeh, A., Jadbabaie, A., Mokhtari, A., Hassani, H., and Pedarsani, R. Fed PAQ: A communication-efficient federated learning method with periodic averaging and

Distributed Event-Based Learning via ADMM

quantization. Proceedings of the International Conference on Artificial Intelligence and Statistics, 108:2021 2031, 2020.

Shamir, O., Srebro, N., and Zhang, T. Communicationefficient distributed optimization using an approximate Newton-type method. Proceedings of the International Conference on Machine Learning, 32(2):1000 1008, 2014.

Shi, Y., Zhang, Y., Zhang, P., Xiao, Y., and Niu, L. Federated learning with ℓ1 regularization. Pattern Recognition Letters, 172:15 21, 2023.

Shokri, R. and Shmatikov, V. Privacy-preserving deep learning. Proceedings of the SIGSAC Conference on Computer and Communications Security, pp. 1310 1321, 2015.

Singh, N., Data, D., George, J., and Diggavi, S. SPARQSGD: Event-triggered and compressed communication in decentralized optimization. IEEE Transactions on Automatic Control, 68(2):721 736, 2023.

Solowjow, F. and Trimpe, S. Event-triggered learning. Automatica, 117:109009, 2020.

Song, Y., Wang, Z., and Zuazua, E. Fed ADMM-In Sa: An inexact and self-adaptive admm for federated learning. Neural Networks, 181:106772, 2025.

Su, W., Boyd, S., and Candes, E. J. A differential equation for modeling Nesterov s accelerated gradient method: Theory and insights. Journal of Machine Learning Research, 17(153):1 43, 2016.

Tong, G. and Muehlebach, M. A dynamical systems perspective on discrete optimization. Proceedings of Machine Learning Research, 211:1 14, 2023.

Tran Dinh, Q., Pham, N. H., Phan, D., and Nguyen, L. Fed DR randomized Douglas-Rachford splitting algorithms for nonconvex federated composite optimization. Proceedings of the International Conference on Advances in Neural Information Processing Systems, 34:30326 30338, 2021.

Umlauft, J. and Hirche, S. Feedback linearization based on gaussian processes with event-triggered online learning. IEEE Transactions on Automatic Control, 65(10):4154 4169, 2019.

Wang, G., Wang, D., Li, C., and Lei, Y. The fast inertial ADMM optimization framework for distributed machine learning. Future Generation Computer Systems, 164: 107575, 2025.

Wang, H., Sievert, S., Liu, S., Charles, Z., Papailiopoulos, D., and Wright, S. ATOMO: Communication-efficient

learning via atomic sparsification. Proceedings of the International Conference on Advances in Neural Information Processing Systems, pp. 9850 9861, 2018.

Wang, H., Marella, S., and Anderson, J. Fed ADMM: A federated primal-dual algorithm allowing partial participation. Proceedings of the IEEE Conference on Decision and Control, pp. 287 294, 2022.

Wei, E. and Ozdaglar, A. Distributed alternating direction method of multipliers. Proceedings of the IEEE Conference on Decision and Control, pp. 5445 5450, 2012.

Wei Liu, Li Chen, and Wenyi Zhang. Decentralized federated learning: Balancing communication and computing costs. IEEE Transactions on Signal and Information Processing over Networks, 8:131 143, 2021.

Wibisono, A., Wilson, A. C., and Jordan, M. I. A variational perspective on accelerated methods in optimization. Proceedings of the National Academy of Sciences, 113(47): E7351 E7358, 2016.

Yin, D., Chen, Y., Kannan, R., and Bartlett, P. Byzantinerobust distributed learning: Towards optimal statistical rates. Proceedings of the International Conference on Machine Learning, 80:5650 5659, 2018.

Yu, Z. and Freris, N. M. Communication-efficient distributed optimization with adaptability to system heterogeneity. Proceedings of the IEEE Conference on Decision and Control, pp. 3321 3326, 2023.

Zhang, X., Hong, M., Dhople, S., Yin, W., and Liu, Y. Fed PD: A federated learning framework with adaptivity to non-iid data. IEEE Transactions on Signal Processing, 69:6055 6070, 2021.

Zhang, Z., Yang, S., and Xu, W. Decentralized ADMM with compressed and event-triggered communication. Neural Networks, 165:472 482, 2023.

Zhang, Z., Yang, S., Xu, W., and Di, K. Privacy-preserving distributed ADMM with event-triggered communication. IEEE Transactions on Neural Networks and Learning Systems, 35(2):2835 2847, 2024.

Zhao, Y., Li, M., Lai, L., Suda, N., Civin, D., and Chandra, V. Federated learning with non-iid data. arxiv:1806.00582, 2018.

Zheng, S., Ye, T., Li, X., and Gao, M. Federated learning via consensus mechanism on heterogeneous data: A new perspective on convergence. IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7595 7599, 2024.

Distributed Event-Based Learning via ADMM

Zhou, S. and Li, G. Y. Federated learning via inexact ADMM. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(8):9699 9708, 2023.

Zhu, H. and Ling, Q. Broadcast: Reducing both stochastic and compression noise to robustify communicationefficient federated learning. ar Xiv:2104.06685, 2021.

Distributed Event-Based Learning via ADMM

A. Communication Structure

This section discusses the sharing problem and consensus reaching over graphs as two special cases of the more general constrained minimization problem

min x Rp, z Rqf(x) + g(z), subject to Ax + Bz = c, (4)

with variables x Rp and z Rq and constant matrices A Rr p, B Rr q, and c Rr. The objective function is decomposed into a smooth part f : Rp R and nonsmooth part g : Rq R. The communication structure of the problem formulation (4) is shown in Fig. 4, where primal, dual and auxiliary variables are treated as different communication nodes.

sk+1 s[k] rk+1 r[k]

Figure 4: The communication structure that arises from Alg. 2, where s := Bz, r := Ax, and u denotes the dual variable.

A.1. Sharing Problem

We will show that the event-based communication structure introduced in Fig. 4 simplifies considerably for the sharing problem. The sharing problem takes the following form,

min x1,...,x N Rp

i=1 f i(xi) + g

and arises as a special case from (4) when choosing f(x) = PN i=1 f i(xi), x = (x1, x2, . . . , x N) RNp, A = INp, B = (Ip, Ip, . . . , Ip), c = 0. The problem can be solved via the following updates, by agents i = 1, . . . , N:

xi k+1 = argmin xi Rp f i xi + ρ

xi xi k + ˆhk 2 , (5)

and by agent N + 1:

i=1 ˆxi k+1

zk+1 = argmin z Rp g(Nz) + Nρ

uk+1 = uk + ρ ( xk+1 zk+1)

hk+1 = xk+1 zk+1 + 1

For the sharing problem, the general communication scheme in Fig. 4 reduces to the diagram in Fig. 5, where each node communicates their local variable in an event-based manner.

Agent 1 Agent 2 Agent 3 Agent 4

x1 k+1 x1 [k] x2 k+1 x2 [k] x3 k+1 x3 [k] x4 k+1 x4 [k]

hk+1 h[k] hk+1 h[k] hk+1 h[k] hk+1 h[k]

Figure 5: The diagram visualizes the communication structure for the sharing problem for N = 4 agents.

Distributed Event-Based Learning via ADMM

Agent 1 Agent 2

Agent 3 Agent 4

x1 k+1 x1 [k] x3 k+1 x3 [k]

x3 k+1 x3 [k]

x4 k+1 x4 [k]

x4 k+1 x4 [k] x2 k+1 x2 [k]

x2 k+1 x2 [k]

x3 k+1 x3 [k]

Figure 6: The diagram visualizes the communication structure for a distributed learning problem over a graph that connects four agents with four edges.

A.2. Consensus over a Graph

As another example, we will show that (4) also generalizes to distributed learning scenarios over graphs. We consider a network topology, captured by an undirected connected graph G = (V, E), where V = {1, . . . , N} is the set of vertices and E V V is the set of edges. Each agent (vertex) has a local data distribution, and the aim is to train a model without a central server to aggregate the collected information. The problem can be formulated as follows:

min xi Rp,zij Rp

i=1 f i xi , subject to xi = zij, xj = zij, (i, j) E.

Similar to the formulation in (Yu & Freris, 2023), we define transmitter and receiver matrices ˆAt, ˆAr R|E| N for all edges, i.e.,

ei = h ˆAr i

( 1 (i, j) E 0 otherwise , e E.

By stacking xi, zij Rp into column vectors x RNp, z R|E|p, respectively, we conclude that distributed learning over graphs is indeed a special case of (4),

min x RNp, z R|E|p f(x), subject to ˆAt Ip ˆAr Ip

x = I|E|p I|E|p

where denotes the Kronecker product and Ip the identity matrix. Thus, the matrices A and B encode the topology of the communication graph, which will affect the convergence rates as highlighted with our main result Thm. 4.1 where the convergence rate is dictated by the value κ = σ(A)L/( σ(A)m).

The resulting instance of Alg. 1 takes the following form:

xi k+1 = argmin xi Rp fi (xi) + |Ni|ρ

2 xk i xi k + 1

xi k+1 = 1 |Ni|

j Ni ˆxj k+1

pi k+1 = pi k + ρ

2 xi k+1 xi k+1 ,

where Ni represents the set containing the neighbors of the agent i and |Ni| is the number of vertices. In the event based-communication setting, an agent transmits its local model (xi k+1) to the neighbors only if there has been a significant change in the local model. Fig. 6 shows an example with four agents, each communicating local variables only.

Distributed Event-Based Learning via ADMM

B. Proof of Thm. 2.3

Proof. We will establish the convergence rate by analyzing the behavior of a carefully chosen Lyapunov function. Let us begin by formulating the augmented Lagrangian for (1):

Lρ(x, z, y) =

i=1 f i(xi) + g(z) +

i=1 (yi) (xi z) + ρ

i=1 |xi z|2 2, (8)

where x = (x1, . . . , x N) and y = (y1, . . . , y N) represent the primal and dual variables respectively, and ρ > 0 is the penalty parameter.

We express the ADMM updates in Alg. 1 for α = 1 as follows by introducing the scaled dual variable ui = 1

xi k+1 = arg min xi

f i(xi) + ρ

2|xi ˆzk + ui k|2 2 , i {1, . . . , N}, (9)

zk+1 = arg min z

i=1 |ˆxi k+1 z + ˆui k|2 2

ui k+1 = ui k + xi k+1 ˆzk+1, i {1, . . . , N}. (11)

Here, we use the notation ˆzk+1 = zk+1 + εz k+1, ˆxi k+1 = xi k+1 + εx,i k+1, and ˆui k+1 = ui k+1 + εu,i k+1 to account for errors emerging from event-based communication.

From (9) and (10), we derive the following first-order optimality conditions for xi k+1 and zk+1

0 = f i(xi k+1) + ρ(xi k+1 ˆzk + ui k) (12)

0 g(zk+1) + ρ

i=1 (zk+1 ˆxi k+1 ˆui k). (13)

We then define the Lyapunov function:

Vk = |zk z |2 2 + 1

i=1 |ui k ui |2 2, (14)

where (ui , z ) denotes the optimal dual and consensus variables. Our goal is to demonstrate that this Lyapunov function is monotonically decreasing.

The optimality condition in (12) implies that xi k+1 minimizes,

f i(x) + ρ ui k+1 + zk+1 zk x + ρ(εz k+1 εz k) x.

From this minimization, we can derive the following inequality,

f i(xi k+1) f i(xi ) ρ ui k+1 + zk+1 zk (xi xi k+1) + ρ(εz k+1 εz k) (xi xi k+1). (15)

Similarly, the optimality condition (13) indicates that zk+1 minimizes,

i=1 (ui k+1 + εd,i k+1 + εz k+1) z, (16)

where εd,i k+1 = εx,i k+1 + εu,i k . This minimization leads to,

g (zk+1) g (z ) ρ

i=1 (ui k+1) (zk+1 z ) ρ

i=1 (εd,i k+1 + εz k+1) (zk+1 z ). (17)

Distributed Event-Based Learning via ADMM

By adding (17) and the sum over i of (15), and applying the conditions xi z = 0 along with the relation xi xi k+1 = ri k+1 (zk+1 z ), we obtain,

g(zk+1) g(z )+

f i(xi k+1) f i(xi ) ρ

ui k+1 ri k+1 ρ

i=1 (zk+1 zk) (ri k+1 + (zk+1 z ))

i=1 (εz k+1 εz k) (xi xi k+1)+ρ

i=1 (εd,i k+1 + εz k+1) (zk+1 z ),

where the residual is defined as ri k := xi k zk.

Since (xi , z , ui ) is a saddle point of L0 in (8), i.e., L0(x , z , u ) L0(xk+1, zk+1, u ), we have,

g(zk+1) + g(z )

f i(xi k+1) f i(xi )

i=1 ui (xi k+1 zk+1) =

i=1 ui ri k+1. (19)

By adding (18) and (19) and multiplying by 2

ρ, we arrive at

ui k+1 ui ri k+1 | {z } (I)

i=1 (zk+1 zk) (ri k+1 + (zk+1 z ))

| {z } (II)

i=1 (εz k+1 εz k) (xi xi k+1) 2

i=1 (εd,i k+1+εz k+1) (zk+1 z ).

Here ui k+1 ui ri k+1 can be written as ui k+1 ui (ui k+1 ui k + εz k+1) which splits into

ui k+1 ui ri k+1 = 1

2 ui k+1 ui k εz k+1 + 1

2|ui k+1 ui k|2 + 1

2 ui k+1 ui k (ui k+1 ui k + εz k+1)

+ ui k ui (ui k+1 ui ) + ui k ui εz k+1 |ui k ui |2.

Next, we substitute ui k+1 ui k = ri k+1 εz k+1 and use the following squared norm identity,

1 2|ui k+1 ui k|2 + ui k ui (ui k+1 ui ) = 1

2|ui k+1 ui |2 + 1

2|ui k ui |2.

Consequently, we can expand terms (I) and (II) as follows:

|ui k+1 ui |2 |ui k ui |2 + 2(ui k+1 ui ) εz k+1 + |εz k+1|2 + |ri k+1|2 2(εz k+1) ri k+1

|ri k+1 + (zk+1 zk) |2 + |zk+1 z |2 |zk z |2 |ri k+1|2 .

Substituting these expansions back into (20) and using the definition of our Lyapunov function from (14), we can express the decrease in the Lyapunov function as,

0 N(Vk+1 Vk) +

2(ui k+1 ui ) εz k+1 + |εz k+1|2 2(εz k+1) ri k+1

|ri k+1 + (zk+1 zk) |2 + 2

i=1 (εz k+1 εz k) (xi k+1 xi ) 2

i=1 (εd,i k+1 + εz k+1) (zk+1 z ).

Distributed Event-Based Learning via ADMM

Modifying the error terms using xi k+1 xi = ri k+1 + (zk+1 z ) and ri k+1 = ui k+1 ui k + εz k+1, we get,

i=1 |ri k+1 + (zk+1 zk) |2 +

2(εz k+1 εz k) (ui k+1 ui ) |εz k+1|2

+ 2(εd,i k+1 + εz k) (zk+1 z ) 2(εz k) ui k ui 2(εz k) εz k+1 .

Furthermore, we expand the squared norm term:

i=1 |ri k+1 + (zk+1 zk) |2 =

i=1 |ri k+1|2 2

i=1 (ri k+1) (zk+1 zk)

i=1 |zk+1 zk|2. (22)

Next we will establish a bound for the cross term 2 PN i=1(ri k+1) (zk+1 zk).

We recall that (13) implies (16). This minimization property leads to the following pair of inequalities:

g (zk+1) g (zk) ρ

i=1 (ui k+1 + εd,i k+1 + εz k+1) (zk+1 zk)

g (zk) g (zk+1) ρ

i=1 (ui k + εd,i k + εz k) (zk+1 zk).

By adding these inequalities and rearranging terms, we obtain,

i=1 (ui k+1 ui k + εd,i k+1 εd,i k + εz k+1 εz k) (zk+1 zk) =

i=1 (ri k+1 + εd,i k+1 εd,i k εz k) (zk+1 zk).

This yields the desired bound on the cross term 2 PN i=1(ri k+1) (zk+1 zk) which takes the form

i=1 (ri k+1) (zk+1 zk) 2

i=1 (εd,i k+1 εd,i k εz k) (zk+1 zk). (23)

As the next step, we rewrite |zk+1 zk| in terms of the gradients of f i and g. This will be important for deriving the desired convergence result. From the first-order optimality condition of the z-update (13), and rearranging for zk+1, we get,

xi k+1 + ui k + εd,i k+1 1 ρN g(zk+1). (24)

The optimality condition for xi k+1 (see (12)) results in,

zk = εz k + 1

xi k+1 ui k 1

ρ f i(xi k+1) . (25)

We combine (24) and (25) to express the update for zk+1 zk as follows,

i=1 f i(xi k+1) + g(zk+1)

i=1 εd,i k+1.

Thus, zk+1 zk depends on the averaged gradients f i and g, scaled by the penalty parameter ρ. We now take the square and apply Young s inequality on the cross term with γ = 2, which yields,

|zk+1 zk|2 1

i=1 f i(xi k+1) + νk+1

i=1 εd,i k+1

Distributed Event-Based Learning via ADMM

for any νk+1 g(zk+1).

By substituting (22), (23), and (26) in (21), and applying Young s inequality to cross terms, we derive the following inequality:

i=1 |ri k+1|2 1

i=1 f i(xi k+1) + νk+1

i=1 εd,i k+1

γ4|εz k+1 εz k|2 + 1

γ4 |ri k+1|2 + γ1|εz k+1|2 + 1

γ1 |uk u |2 |εz k+1|2 + γ2|2εd,i k+1 εd,i k |2

γ2 |zk+1 zk|2 + γ3|εd,i k+1 + εz k|2 + 1

γ3 |zk z |2 + 2(εz k+1 2εz k) εz k+1 ,

for any νk+1 g(zk+1).

We can further simplify the expression by choosing the values as γ1 = γ3 = γk and γ2 = γ4 = 3,

N(Vk+1 Vk) N

i=1 |ri k+1|2 1

i=1 f i(xi k+1) + νk+1

i=1 εd,i k+1

3|εz k+1 εz k|2 + γk|εz k+1|2 |εz k+1|2 + 3|2εd,i k+1 εd,i k |2 + γk|εd,i k+1 + εz k|2

+ 4|εz k+1 2εz k|2 + 4|εz k+1|2 ,

for any νk+1 g(zk+1).

The error values arising from event-based communication are bounded by the communication thresholds, |εd,i k | d k, |εz k| z k. This leads to the following inequality,

i=1 |ri k+1|2 1

i=1 f i(xi k+1) + νk+1

+ (3γk + 51 + 8

3N ) z k 2 + (2γk + 30 + 8

3N ) d k 2,

for any νk+1 g(zk+1), where ri k+1 = xi k+1 zk+1 represents the residuals at step k + 1. Simplifying the expression, we obtain:

i=1 |xi k+1 zk+1|2 1

i=1 f i(xi k+1) + νk+1

+ O(γk 2 k), (27)

for any νk+1 g(zk+1), where k represents the time-varying communication threshold, i.e., chosen bound for the perturbation term. This inequality suggests a relationship between Vk and Vk+1 at consecutive steps.

To analyze the convergence of the sequence Vk, we apply Polyak s Lemma (Lemma 2 in (Polyak, 1987, Chapter 2.2)), which establishes convergence under certain additional assumptions. Polyak s Lemma states that if a sequence Vk satisfies an inequality of the form,

Vk c k + c+ k ,

where c k and c+ k are sequences of non-negative terms, then the sequence Vk is bounded above provided that P k=0 1 γk < , and P k=0 c+ k < .

Distributed Event-Based Learning via ADMM

We choose γk = (k + 1)p for some p > 1 to ensure convergence. Substituting this choice into the recurrence relation, we obtain:

Vk+1 1 + 1 (k + 1)p

i=1 |xi k+1 zk+1|2 1

i=1 f i(xi k+1) + νk+1

+ O(kp 2 k),

for any νk+1 g(zk+1).

By assumption the communication threshold k decays as k O (1/kt). Under this assumption, the perturbation term O(kp 2 k) scales as O(kp 2t 2 0), which satisfies P k=0 γk 2 k < if p > 1 and t > 1+p

2 . These conditions ensure that the perturbation term decays sufficiently fast, ensuring boundedness of Vk.

Finally, by summing over k = 0 to K and dividing by K + 1, we obtain the following bound for the average of the residuals and gradient terms:

i=1 |xi k+1 zk+1|2 + 1

i=1 f i(xi k+1) + νk+1

for any νk+1 g(zk+1), where communication threshold decays k 0 (k+1)2 . This result establishes a sublinear convergence rate for both the residuals and the gradient terms. The rate of decay of the communication threshold ensures that these error terms decrease at a rate proportional to O 1

K , which yields the desired result.

Distributed Event-Based Learning via ADMM

C. Derivation of Alg. 2 as a Dynamical System

In this section, we represent Alg. 2 as a dynamical system that consists of linear dynamics with a nonlinear feedback interconnection. The communication structure is summarized with Fig. 2.

For the convenience of the reader, we start by restating Alg. 2, which is based on an over-relaxed ADMM algorithm.

Algorithm 3 Event-Based Distributed Optimization with Over-Relaxed ADMM

Require: Functions f and g, matrices A and B, vector c, parameters ρ and α Require: Initial condition x0, z0

r0 = ˆrs 0 = ˆru 0 = Ax0, s0 = ˆsr 0 = ˆsu 0 = Bz0, u0 = ˆur 0 = ˆus 0 = 0 for k = 0 to tmax do

ˆsr k, ˆur k event-based receive of sk+1 s[k], uk+1 u[k] xk+1 = arg minx f(x) + ρ

2|Ax + ˆsr k c + ˆur k|2 {r-agent} event-based send of rk+1 r[k] where rk+1 = Axk+1

ˆrs k+1, ˆus k event-based receive of rk+1 r[k], uk+1 u[k] zk+1 = arg minz g(z) + ρ

2|αˆrs k+1 (1 α)Bzk + Bz αc + ˆus k|2 {s-agent} event-based send of sk+1 s[k] where sk+1 = Bzk+1

ˆru k+1, ˆsu k+1 event-based receive of rk+1 r[k], sk+1 s[k] uk+1 = uk + αˆru k+1 (1 α)ˆsu k + ˆsu k+1 αc {u-agent} event-based send of uk+1 u[k]

if mod(k + 1, T) = 0 then

reset ˆru;s k+1 = rk+1, ˆsu;r k+1 = sk+1, ur;s k+1 = uk+1 end if end for

The following definitions will be useful for simplifying the updates of the iterates: Definition C.1. Let Assumption 3.1 and 3.2 hold. We define the function ˆf : Rn R as follows,

ˆf = (ρ 1f) A 1, (29)

where ρ is the step-size of Alg. 2. The function is ˆm := m/(ρ σ2(A))-strongly convex and ˆL := L/(ρ σ2(A))-smooth, and has therefore the condition number

κ := ˆL ˆm = L

Definition C.2. Let Assumption 3.1 and 3.2 hold. The function ˆg : Rm R is defined as

ˆg = (ρ 1g) B + ψim(B), (30)

where B is the Moore-Penrose inverse of B, ψim(B) is the indicator function of the image of B, and ρ is the step-size of Alg. 2.

We proceed by summarizing the notation that will be used subsequently. The sequences rk and sk are defined as rk := Axk and sk := Bzk. We introduced the variable ˆrs k, for example, which models agent s s estimate of the variable rk. The variables ˆru k, ˆsr k, ˆsu k, etc., are defined analogously and follow the notational convention

\ variable receiving agent k := receiving agent s estimate of variable at time k.

As a result of the event-based communication, the local estimates ˆrs k, ˆru k, ˆsr k, etc., differ from rk, sk, etc. These differences will be captured by the variable ε for which we introduce the following notational convention:

εvariable,receiving agent k : = \ variable receiving agent k variablek = receiving agent s estimation error of variable at time k.

Distributed Event-Based Learning via ADMM

We further introduce the error

ek := (εrs k+1, εru k+1, εsu k+1, εsr k , εsu k , εur k , εus k ), (31)

that collects the estimation errors of the different agents. By virtue of the event-based communication mechanism and the reset mechanism, the error ek is bounded by the communication threshold . We finally introduce the notation for the corresponding communication thresholds, rs, sr, etc. (see Fig. 2), according to the same rationale:

variable,receiving agent := threshold for triggering a communication of variable to receiving agent.

To sum up, the vector ek contains the errors on the communication lines shown on Fig. 2. For example, εrs k+1 stands for the difference between the actual value of state rk+1 and agent s s estimate ˆrs k+1, that is, εrs k+1 = ˆrs k+1 rk+1 at time step k + 1.

If the value rk+1 has deviated more than rs amount since the time-step [k], where the last value r[k] has been communicated to the agent s, a communication is triggered. This means rs Ck+1.

|rk+1 rlrs k | > rs rs Ck+1 [k + 1] = k + 1.

The set Dk+1, which is a subset of Ck+1, collects indices of failed transmission lines at time step k + 1. We further introduce that the superscript c to denote the complement of a set. If the communication does not fail, that is, rs Dc k+1, then agent s s estimate of rk+1 is updated as follows.

rs Ck+1 rs Dc k+1 ˆrs k+1 = ˆrs k + (rk+1 r[k]).

To incorporate the effect of communication drops, we introduce the variable χrs k+1, which represents the disturbance that results from dropped communications,

rs Dk+1 χrs k+1 = (rk+1 r[k]). (32)

Therefore, the dynamics of ˆrs k+1 are expressed as follows

ˆrs k+1 = r[k+1] +

l=1 χrs l . (33)

When deriving the previous equation, we have exploited the fact that

rs Ck+1 [k + 1] = k + 1

rs Cc k+1 [k + 1] = [k].

To summarize, in the case of communication drop, the agent s updates the image of r with a disturbed value.

We now express the different minimization steps in Alg. 3 by their corresponding stationarity conditions, and simplify the corresponding expressions. The minimization step for the primal variable x can be rewritten as follows

xk+1 = arg min x Rp f(x) + ρ

2 |Ax + ˆsr k c + ˆur k|2

= A 1 arg min r Rn f(A 1r) + ρ

2 |r + ˆsr k c + ˆur k|2 ,

due to the fact that A is invertible, which yields

rk+1 = arg min r Rn ˆf(r) + 1

2 |r + ˆsr k c + ˆur k|2 .

The variable rk+1 satisfies therefore the following stationarity condition

0 = ˆf(rk+1) + rk+1 + ˆsr k c + ˆur k,

Distributed Event-Based Learning via ADMM

which can be rearranged to

rk+1 c = ˆf(rk+1) sk εsr k uk εur k , (34)

where ˆsr k is replaced by sk + εsr k and ˆur k by uk + εur k .

Similarly, the update step of the auxiliary variable z can be reformulated as

zk+1 = arg min z Rn g(z) + ρ

2|αˆrs k+1 (1 α)sk + Bz αc + ˆus k|2

= B arg min s Rm g(B s) + ψim(B)(s) + ρ

2|αˆrs k+1 (1 α)sk + s αc + ˆus k|2,

since the matrix B has full column rank and therefore possesses the left inverse B . This yields

sk+1 = arg min s Rm ˆg(s) + 1

2|αˆrs k+1 (1 α)sk + s αc + ˆus k|2,

and implies the following stationarity condition for sk+1

0 ˆg(sk+1) + αˆrs k+1 (1 α)sk + sk+1 αc + ˆus k. (35)

This stationarity condition can be reformulated as

sk+1 = sk + (α 1)uk + α ˆf(rk+1) γk+1 εus k + αεsr k + αεur k αεrs k+1, (36)

for some γk+1 ˆg(sk+1), and where we have expressed ˆrs k+1 as rk+1 + εrs k+1 and ˆus k as uk + εus k . We have further replaced rk+1 c by the expression given in (34).

The update of the dual variables uk evolve according to the following dynamics:

uk+1 = uk + αˆru k+1 (1 α)ˆsu k + ˆsu k+1 αc

= uk + α(rk+1 + εru k+1) (1 α)(sk + εsu k ) + (sk+1 + εsu k+1) αc.

The dynamics can be further simplified by replacing sk+1 with the help of (35), which yields:

uk+1 = γk+1 αεrs k+1 + αεru k+1 + εsu k+1 + (α 1)εsu k εus k . (37)

As a result of these simplifications, we note that rk+1 is uniquely determined by sk, uk and the corresponding errors εsr k and εur k . We further note that according to (36) and (37) the iterates of Alg. 3 can be represented as an interconnection between a linear dynamical system, with a nonlinear feedback interconnection that models the evaluation of the gradient ˆf and ˆg. The state of the dynamical system is therefore chosen as ξk := (sk, uk), the output as yk := (rk+1 c, sk+1), and the input as vk := ( ˆf(rk+1), γk+1), where γk+1 ˆg(sk+1). We also define output variables w1 k := (rk+1 c, ˆf(rk+1)), w2 k := (sk+1, γk+1), which will be employed for the convergence analysis.

According to these definitions, we can express the iterates of Alg. 3 as trajectories of the following nonlinear dynamical system,

ξk+1 = 1 α 1 0 0

| {z } := ˆ A

ξk + α 1 0 1

| {z } := ˆ B

vk + α 0 0 α 0 α 1 α α 1 0 α 1 0 1

| {z } := ˆ E

ek, vk = ϕ(yk),

yk = 1 1 1 α 1

| {z } := ˆ C

ξk + 1 0 α 1

| {z } := ˆ D

vk + 0 0 0 1 0 1 0 α 0 0 α 0 α 1

| {z } := ˆ Ey

w1 k = 1 1 0 0

| {z } := ˆ C1

ξk + 1 0 1 0

| {z } := ˆ D1

vk + 0 0 0 1 0 1 0 0 0 0 0 0 0 0

| {z } := ˆ E1

w2 k = 1 α 1 0 0

| {z } := ˆ C2

ξk + α 1 0 1

| {z } := ˆ D2

vk + α 0 0 α 0 α 1 0 0 0 0 0 0 0

| {z } := ˆ E2

Distributed Event-Based Learning via ADMM

ˆA ˆE ˆB ˆC ˆEy ˆD

e ˆf(rk+1) γk+1

rk+1 c sk+1

Delay sk uk

Figure 7: The dynamical system following (38) is visualized.

where ϕ denotes the nonlinear feedback interconnection that captures the evaluation of the gradients ˆf and ˆg. Fig. 7 provides a graphical representation of the time-invariant dynamics determined by the matrices ˆA, ˆB, ˆC, ˆD.

We close the section with the following proposition that shows that |ek| is bounded.

Proposition C.3. The error ek at iteration k is bounded by

|ek| , := X

l {rs,ru,su,sr,su,ur,us} l + T χl,

where the variable χl is an upper bound on the communication drops.

Proof. The proof is analogous to Prop. 2.1. The error resulting from the event-based communication structure is given by

εrs k+1 = ˆrs k+1 rk+1 = r[k+1] rk+1 | {z } I

l=1 χrs l | {z } II

We further note that the first term is bounded by rs by virtue of the communication rule

|rk+1 r[k+1]| rs.

Through the assumption |χrs l | χrs, the second part is bounded by T χrs, where T is the reset period. Therefore, we conclude that |ers k+1| is bounded by rs + T χrs. Similarly, the other elements of the vector ek are bounded by ru + T χru, su + T χsu, etc. Hence, |ek| is bounded by where

l {rs,ru,su,sr,su,ur,us} l + T χl.

The analysis indicates that a periodic reset with a period T is required to achieve a bounded error. If no reset is included, Alg. 2 may not converge, which could result in a large error that accumulates over time. The dependence of on the period T highlights how the reset period T affects the error (where smaller T leads to a smaller error bound). If there are no communication failures, there is also no need for a reset ( χ = 0), and reduces to the collection of communication thresholds.

Distributed Event-Based Learning via ADMM

D. Convergence Analysis

We first start by proving the following intermediate lemmas.

Lemma D.1. Let Assumption 3.2 be satisfied. Then, the following holds,

(r1 r2) ( ˆf(r1) ˆf(r2)) 2 ˆmˆL ( ˆm + ˆL) ( ˆm + ˆL) 2

r1 r2 ˆf(r1) ˆf(r2)

for all r1, r2 Rn.

Proof. We define the auxiliary function f(r) := ˆf(r) ˆm

2 |r|2, which is ˆL ˆm-smooth and convex by the properties of ˆf. Then the following inequality holds,

( f(r1) f(r2)) (r1 r2) 1 ˆL ˆm | f(r1) f(r2)|2,

for any r1, r2 Rn. Substituting f(r) := ˆf(r) ˆm

2 |r|2 and f(r) = ˆf(r) ˆmr, we get

( ˆm + ˆL)(r1 r2) ( ˆf(r1) ˆf(r2)) ˆmˆL|r1 r2|2 + | ˆf(r1) ˆf(r2)|2,

which yields the desired result.

Lemma D.2. Let Assumption 3.2 be satisfied. Then, the following holds,

(s1 s2) (γ1 γ2) 0 1 1 0

s1 s2 γ1 γ2

where γ1 ˆg(s1) and γ2 ˆg(s2) and for any s1, s2 Rm.

Proof. The subdifferential of a convex function is a monotone operator, and therefore

(s1 s2) (γ1 γ2) 0.

Lemma D.3. Let x , z denote the minimizer of (3) and define r := Ax , s := Bz , β := ˆf(r ), and γ ˆg(s ). Then, the iterates of Alg. 3 with step-size ρ = ρ0( ˆmˆL) 1 2 satisfy

(wi k wi ) M i(wi k wi ) 0, i {1, 2}, k 0,

2ρ 2 0 ρ 1 0 κ 1

2 + κ 1 2 2

In, M 2 := 0 1 1 0

w1 k := rk+1 c βk+1

, w2 k := sk+1 γk+1

, w1 := r c β

, w2 := s γ

Proof. The proof follows directly from Lemma D.1 and Lemma D.2.

Lemma D.4. Let the sequence Vk 0 satisfy

Vk+1 Vk(1 α) + β α, (43)

for all k 0, where the parameters α, β satisfy 0 < α < 1 and 0 β. Then, the following holds for all k 0:

Vk V0(1 α)k + β. (44)

Distributed Event-Based Learning via ADMM

Proof. We prove the lemma by induction.

The claim holds for k = 0. We therefore assume that the claim holds for k and show that, as a result, the claim holds for k + 1. More precisely,

Vk+1 Vk(1 α) + β α

V0(1 α)k+1 + (1 α) β + β α

V0(1 α)k+1 + β,

which completes the induction argument.

In App. C, we expressed the iterates of Alg. 3 as the trajectories of a dynamical system. The dynamical system was given as a linear time-invariant system that was interconnected in feedback with a nonlinear function ϕ. We now arrive at the main result that will be used to show convergence of Alg. 3.

Theorem D.5. Let Assumption 3.2 be satisfied, let the step-size for Alg. 3 be ρ = ρ0( ˆmˆL) 1 2 , and let ξ = (Bz , u ), where (x , z ) is the minimizer of (3) and u the corresponding dual variable.

Suppose there exists a positive definite matrix P 0, 0 < τ < 1, and nonnegative constants λ1, λ2, γ1, γ2, γ3 and γ4 such that the following linear matrix inequality

0 (1+γ1) ˆA P ˆA τ 2P ˆA P ˆB ˆB P ˆA (1+γ2) ˆB P ˆB

+ ˆC1 ˆD1 ˆC2 ˆD2

Λ1M 1 0 0 Λ2M 2

ˆC1 ˆD1 ˆC2 ˆD2

is satisfied, where Λ1 = λ1(1 + γ3), Λ2 = λ2(1 + γ4). Then, for all k 0, we have

|ξk ξ |2 κP |ξ0 ξ |2τ 2k + σ(Q) 2

σ(P)(1 τ 2), (47)

where κP = σ(P)/ σ(P) denotes the condition number of the matrix P, is a bound on the error ek (see Prop. C.3), and

ˆE P ˆE + 1 + 1

i=1 λi ˆEi M i ˆEi. (48)

Proof. We consider the following quadratic storage function,

Vk = (ξk ξ ) P(ξk ξ ),

and claim that the following inequality holds for the iterates of Alg. 3:

Vk+1 τ 2Vk +

i=1 λi(wi wi ) M i(wi wi )

i=1 λi 1 + 1

Ei M i Ei ek.

Distributed Event-Based Learning via ADMM

Proof of the claim: We insert the system dynamics stated in App. C into the expression on the left-hand side, which yields

Vk+1 τ 2Vk +

i=1 λi(wi k wi ) M i(wi k wi )

=(ξk+1 ξ )T P(ξk+1 ξ ) τ 2(ξk ξ )T P(ξk ξ ) +

i=1 λi(wi k wi ) M i(wi k wi )

= ξ k A PA τ 2P ξk + υ k ˆB P ˆB υk + e k E PEek

+ 2 υ k ˆB P ˆA ξk + e k E P ˆA ξk + υ k ˆB PEek

i=1 λi ξ k ˆCi M i ˆCi ξk + υ k ˆDi M i ˆDi υk + e k Ei M i Eiek

i=1 λi υ k ˆDi M i ˆCi ξk + e k Ei M i ˆCi ξk + υ k ˆDi M i Eiek ,

where ξk = ξk ξ , and υk = υk υ , for simplicity. We now apply Young s inequality on the cross terms in (49), which yields

Vk+1 τ 2Vk +

i=1 λi(wi k wi ) M i(wi k wi )

ξ k A PA τ 2P ξk + υ k ˆB P ˆB υk + e k E PEek + 2 υ k ˆB P ˆA ξk

+ γ2( υ k ˆB P ˆB υk) + 1

γ2 (e k E PEek) + γ1( ξ k ˆA P ˆA ξk) + 1

γ1 (e k E PEek)

i=1 λi ξ k ˆCi M i ˆCi ξk + υ k ˆDi M i ˆDi υk + e k Ei M i Eiek + 2 υ k ˆDi M i ˆCi ξk

i=1 λi γ4 υ k ˆDi M i ˆDi υk + 1

γ4 e k Ei M i Eiek + γ3 ξ k ˆCi M i ˆCi ξk + 1

γ3 e k Ei M i Eiek

If we rearrange the right-hand side of the inequality in matrix form, we obtain,

Vk+1 τ 2Vk+

i=1 λi(wi k wi ) M i(wi k wi )

ξ k υ k (1 + γ1) ˆA P ˆA τ 2P ˆA P ˆB ˆB P ˆA (1 + γ2) ˆB P ˆB

+ ˆC1 ˆD1 ˆC2 ˆD2

λ1M 1(1 + γ3) 0 0 λ2M 2(1 + γ4)

ˆC1 ˆD1 ˆC2 ˆD2

i=1 λi 1 + 1

Ei M i Ei !

The fact that the linear matrix inequality (46) is satisfied proves the claim. Furthermore, we conclude that P2 i=1 λi(wi k wi ) M i(wi k wi ) 0 from Lemma D.1 and D.2. This simplifies the previous expression to

Vk+1 τ 2Vk + e k Qek,

where we have also inserted the definition of the matrix Q. The right-hand side can further be bounded by virtue of the reset mechanism and the event-based communication, which results in

Vk+1 τ 2Vk + σ(Q) 2. (52)

Distributed Event-Based Learning via ADMM

We are now in a position where we can apply Lemma D.4, which concludes

Vk τ 2k V0 + σ(Q) 2

By definition of the quadratic storage function we conclude

|ξk ξ |2 τ 2k σ(P)

σ(P)|ξ0 ξ |2 + σ(Q) 2

σ(P)(1 τ 2),

which implies the result of Thm. D.5.

We are now ready to prove our main result in Thm. 4.1.

Proof of Thm. 4.1. The proof is based on Thm. D.5 which shows that if the matrix inequality

0 (1 + γ1) ˆA P ˆA τ 2P ˆA P ˆB ˆB P ˆA (1 + γ2) ˆB P ˆB

+ ˆC1 ˆD1 ˆC2 ˆD2

Λ1M 1 0 0 Λ2M 2

ˆC1 ˆD1 ˆC2 ˆD2

is satisfied for a symmetric positive definite matrix P and for positive constants Λ1, Λ2, γ1, γ2, the following bound holds

|ξk ξ |2 κP |ξ0 ξ |2τ 2k + σ(Q) 2

σ(P)(1 τ 2),

where κP denotes the condition number of P and Q is defined in App. D. In fact, the following set of parameters satisfies the linear matrix inequality (53),

P = 1 α 1 α 1 1 1 κ

2 , Λ1 = α κϵ 1

2 , Λ2 = α, γ1 = α

This can be checked as follows: The matrix on the right-hand side of (53) can be expressed as 1

4κ 2L, where L is a symmetric 4 4 matrix (compared to earlier analyses (Nishihara et al., 2015), the last row and last column is not zero). We now prove that L is positive semidefinite for all sufficiently large κ by checking the leading principle minors, which can be expressed as polynomials in κ. If the leading terms of the principle minors have positive coefficients, it means that for large enough κ, the principle minor will indeed be positive.

The leading term for the first principle minor is given by 6κ 3 2 ϵ and is therefore positive. Likewise, the second principle minor is dominated by the positive term 24(2 α)κ 7 2 ϵ. For the third leading principle minor, there are two different cases. If ϵ = 0, the leading term of the third leading principle minor is 16κ5(α4 4α3 4α2 + 22α 12)/α, which is positive for α (0.675, 2). If ϵ > 0, the leading term of the third principle minor becomes 192κ5(2 α), which is positive for α (0, 2). Finally, for the fourth principle minor, there are also two different cases. If ϵ = 0, the leading term is 64κ 13

2 (α4 4α3 4α2 + 22α 12)/α2, which is positive for α (0.675, 2). If ϵ > 0, the leading term of the fourth principal minor becomes 768κ 13

2 (2 α)/α, which is positive for α (0, 2). In conclusion, for all sufficiently large κ, all four leading principle minors are positive, which implies that L is positive definite.

It remains to bound the second term σ(Q)/( σ(P)(1 τ 2)). We again investigate the symbolic expression, and conclude that the term is always bounded by 60κ2+2ϵ/(α(1 |α 1|)) for large enough κ. This concludes the proof.

Distributed Event-Based Learning via ADMM

E. Bound on Event-Based Error Variables

We restate Prop. 2.1 and present its proof.

Proposition. The error ˆζk ζk at iteration k is bounded by |ˆζk ζk| d+T χd, where T denotes the reset period (see Alg. 1) and χd is a bound on the disturbance χdi k .

Proof. We note that the error ˆζk ζk can be expressed as

i=1 (di [k+1] di k+1) | {z } I

l=T[k] χdi l+1

where T[k] denotes the last time instant where a reset has been performed. The terms I and II have each a clear interpretation: In the absence of communication failures, ˆζ is an average over the primal and dual variables, xi [k+1] and ui [k], that were last communicated, which leads to the term I. The term II captures the dropped information through failures. The communication protocol ensures that |di k+1 di [k+1]| d, for all k 0, which means that the term I is bounded by d. The bound for the term II arises from the triangle inequality, which yields, χd and concludes the proof.

The previous proposition required the variable χdi k+1 to be bounded. Prop. E.1 establishes such a bound under standard conditions on f and g.

Proposition E.1. Let f be L-smooth and convex and let {z Rn | g(z) < } be contained in a ball of radius R. Then, the disturbances χdi k and χzi k are bounded by

|χzi k | 2R, |χdi k | (α + 1)2(ρ + L)

ρ |xi | + 2R,

for all i=1, . . . , N and all k 0, where xi := arg minx Rn f i(x)+ρ|x|2/2, and where χzi k denotes the communication drops when communicating zk between agent N + 1 and agent i.

Proof. Due to the assumption that the domain of g is contained in a ball of radius R, we conclude |zk| R for all k 0. This also implies that |ˆzi k| R and concludes the bound on χz. For obtaining the remaining two bounds, we analyze the xi, ui dynamics of agent i, where we introduce the convex conjugate

f i(u, z) = sup x Rn u Tx f i(x) ρ

We note that the supremum is attained for xi k+1, if u = ρui k and z = ˆzi k in the previous equation. In addition, due to the properties of the convex conjugate, we conclude that f i( , z) is 1/ρ-smooth and 1/(ρ + L)-strongly convex. The conjugate subgradient theorem implies,

u f i( ρui k, ˆzi k) = xi k+1,

which means that the updates for ui k can now be expressed as:

ui k+1 = ui k + u f i( ρui k, ˆzi k) ˆzi k.

By applying Taylor s theorem, we obtain

ui k+1 = ui k + 2 u f i(νk, ˆzi k)( ρui k) ˆzi k + u f i(0, ˆzi k),

for some νk Rn. By leveraging the fact that, as a result of strong convexity and smoothness of f i( , z), the Hessian 2 u f i( , z) is upper and lower bounded by 1/ρ and 1/(ρ + L), respectively, we conclude

|ui k+1| 1 ρ ρ + L

|ui k| + |ˆzi k u f i(0, ˆzi k)|. (54)

Distributed Event-Based Learning via ADMM

By a similar argument based on Taylor s theorem, we can bound the last term in the previous equation by

|ˆzi k u f i(0, ˆzi k)| |xi | + L ρ + L|ˆzi k|.

Unrolling the recursion in (54) and exploiting the fact that ui 0 = 0 yields

|ui k| ρ + L

ρ |xi | + sup k 0 |ˆzi k|,

where the last term is bounded by R. Finally, due to the 1/ρ-smoothness of f i( , z), we conclude |xi k+1| |ui k|, which yields the desired result as follows,

|χdi| = |αxi k+1 + ui k| α|xi k+1| + |ui k| (α + 1)|ui k| (α + 1) ρ + L

ρ |xi | + sup k 0 |ˆzi k|

(α + 1) ρ + L

ρ |xi | + R .

Distributed Event-Based Learning via ADMM

F. Diminishing Communication Threshold

In the main text, we focused our presentation on fixed communication thresholds. However, it is important to note that our approach and our analysis can be easily extended to the case where communication thresholds are varied as a function of the number of iterations. For example, it is straightforward to show that for any vanishing sequence k, our iterates indeed converge to the minimizer of (3).

Corollary F.1. Let the assumptions of Thm. D.5 be satisfied and let k 0 be such that k 0 for k . Then limk |ξk ξ |2 0.

Proof. According to Thm. D.5, the following holds,

Vk+1 τ 2Vk + σ(Q) 2 k,

see (52). We now apply Lemma 3 in Sec. 2.2 in (Polyak, 1987), which yields the desired result.

We can also derive an explicit convergence rate. In fact, the following corollary proves that if 2 k is the form q/(k + 1)t, where q > 0 and t > 0 are constants and k is the iteration number, |ξk ξ |2 converges at a rate of O(1/kt).

Corollary F.2. Let the assumptions of Thm. D.5 be satisfied and let 2 k q (k+1)t , k 0, t > 0. Then, the following holds for all k 0:

where k0 = 1 2 1+τ2 t 1 and c0 = max n 2 σ(Q)q

1 τ 2 , σ(P)|ξ0 ξ |2o .

Proof. According to (52), the following holds,

Vk+1 τ 2Vk + σ(Q) q (k + 1)t .

We make the following claim:

We prove the claim by induction. The claim holds for k = 0 due to the fact that c0 σ(P)|ξ0 ξ |2. We therefore assume that the claim holds for k and show that this implies that the claim holds for k + 1. This yields

Vk+1 τ 2Vk + σ(Q) q (k + 1)t

t + σ(Q) q (k + 1)t

k0 k + k0 + 1

τ 2 k + k0 + 1

k0 k + k0 + 1

k0 k + k0 + 1

and completes the induction argument.

Distributed Event-Based Learning via ADMM

G. Additional Experiments and Hyperparameters

We ran various experiments in order to assess the performance of the event-based distributed learning algorithm (Alg. 1). In the comparative studies, we choose Fed Avg (Mc Mahan et al., 2017), Fed Prox (Li et al., 2020a), SCAFFOLD (Karimireddy et al., 2020) and Fed ADMM (Zhou & Li, 2023) as baselines, since these methods have been developed to address challenges such as data heterogeneity and communication efficiency. For a fair comparison in terms of computation resources in all setups, each of the agents are run for the same number of local gradient steps.

We include the first example in Sec. 5, which showcases how two image classifiers (for MNIST and CIFAR-10 datasets) can be trained in a distributed and communication efficient way. Our setup included N =10 agents for MNIST, each storing data for a single digit, resulting in the most extreme non-i.i.d. distribution of data among agents. For a CIFAR-10 classifier, the data are distributed among N =100 agents according to a Dirichlet distribution, i.e., we sample pa Dir N(β), where N is the number of agents and β =0.5. We then assign a pa,j proportion of the training data of class a to agent j.

We applied our implementation to train a fully connected neural network on the MNIST dataset and a convolutional network on the CIFAR-10 dataset. The classifier model has 4 convolutional layers, each with 3 3 kernels and 32, 64, 128, and 256 filters, respectively, followed by three fully connected layers with Re LU activation functions. After each set of convolutions, a 2 2 max pooling layer is applied, followed by a Re LU activation. We train the MNIST classifier model using Alg. 1, where we replace the full minimization step of each local objective with five steps of stochastic gradient descent with a learning rate of lr = 10 1, and the CIFAR-10 model with 3 epochs of stochastic gradient descent (batch size 20, learning rate lr = 10 3). Further hyperparameters are listed in Tabs. 3 and 4.

Tab. 1 in Sec. 5 summarizes the main result of this paper, by comparing the performance of different methods. From this table, it is clear that Alg. 1 achieves the same test accuracy with less communication cost. The communication configurations for Tab. 1 are summarized in Tab. 2.

Fig. 8 illustrates the trade-off between accuracy and communication load. The results demonstrate that our event-based approach consistently achieves higher accuracy with fewer communication events compared to baselines. Each point in Fig. 8 represents a different value of , where monotonically increases along the curve, demonstrating that with our algorithm and a well-chosen threshold, communication among agents can be reduced while still achieving a high classification accuracy. Our experimental results indicate that the approach can reduce communication costs by over 30% without significant accuracy degradation. Notably, SCAFFOLD doubles the communication cost due to its dual-package communication protocol. These findings directly translate to the results in Tab. 1 showing total communication events for target accuracies. The extensive experimentation across both small-scale (MNIST) and large-scale (CIFAR-10) scenarios demonstrates the scalability and effectiveness of our event-based communication strategy, particularly for large-scale distributed learning problems.

The next sections provide additional numerical experiments. We first show an example based on LASSO where the local objectives are strongly convex (Sec. G.1 and G.2). In this setup, our theoretical results apply. Sec. G.3 shows how our algorithm can train an MNIST classifier, when only local communications are allowed, as specified by a given communication graph. In such a setup, the baselines Fed Avg, Fed Prox, SCAFFOLD and Fed ADMM are not applicable.

Algorithm MNIST Target Accuracy CIFAR-10 Target Accuracy

80% 85% 90% 70% 75% 77% 78%

Alg. 1-randomized (ptrig, d) (0.1, 5) (0.1, 4) (0.1, 1) (0.2, 4.5) (0.1, 3.75) (0.2, 3.5) (0.7, 3.75) Alg. 1-Vanilla ( d) (3) (2) (1) (4.25) (3.25) (3.25) (1.75) Fed ADMM (part rate) 0.4 0.6 1.0 0.4 0.5 0.7 0.9 Fed Avg (part rate) 0.4 1.0 - 0.1 - - - Fed Prox (part rate) 0.5 1.0 - 0.2 - - - SCAFFOLD (part rate 2) 0.4 2 0.5 2 0.8 2 0.2 2 - - -

Table 2: Communication configurations across algorithms. Values represent the probability of communication for baseline methods. SCAFFOLD values are doubled due to double package transmission per round.

Distributed Event-Based Learning via ADMM

70 75 80 85 90 0

Classification Accuracy (%)

Total Comm. Load (%)

65 70 75 80 0

Classification Accuracy (%)

Total Comm. Load (%)

Alg. 1-Vanilla Alg. 1-Rand Fed ADMM Fed Prox Fed Avg SCAFFOLD

Figure 8: The figure compares different federated learning methods on the MNIST and CIFAR-10 datasets with respect to the resulting trade-off between total communication load and classification accuracy on the test set. In the MNIST case (left), randomization includes agent-to-server communication with 0.1 probability. For CIFAR-10 (right), randomization incorporates server-to-agent communication with 0.2 probability. Points along each curve represent different thresholds, demonstrating the relationship between communication reduction and model accuracy.

Table 3: The table summarizes the hyperparameters used for distributed training of MNIST classifier (Fig. 8, Tab. 1)

Hyperparameter Value

number of agents (N) 10 size of neural network layers [400, 200, 10] learning rate (gradient descent step-size) 0.1 number of iterations 100 d = , z = 0.1 range between [0, 10] µ (Fed Prox) 0.1 augmented lagrangian parameter (ρ) (Fed ADMM, Alg. 1) 1 ng (SCAFFOLD) 1

Table 4: The table summarizes the hyperparameters used for the distributed training of CIFAR-10 classifier (Fig. 8, Tab. 1)

Hyperparameter Value

number of agents (N) 100 augmented lagrangian parameter (ρ) (Fed ADMM, Alg. 1) 0.01 learning rate 0.01 momentum 0.9 number of iterations 150 number of local epochs 3 batch size 20 d = , z = 0.01 range between [0, 4] µ (Fed Prox) 0.1 ng (SCAFFOLD) 1

G.1. Linear Regression and LASSO with Non-i.i.d. Data

We conduct numerical experiments based on the following distributed learning problem:

min x Rn,z Rn

1 2|Aixi bi|2 + λ|z|1,

subject to xi z = 0, i = 1, . . . , N,

where Ai Rm n, bi Rm. In the data generation process, we generate samples from three different distributions: a standard normal distribution, a Student s t distribution with one degree of freedom, and a uniform distribution in the range [ 5, 5]. These samples are concatenated to form a single dataset, which is then partitioned into subsets for each agent i to

Distributed Event-Based Learning via ADMM

obtain (Ai, bi). Finally, we normalize the feature vectors and target values for each agent to prepare the data for the learning problem. In this non-i.i.d. setting, local optimal points of individual agents xi are far away from each other, and their average PN i=1 xi /N is also far away from the global optima x . The experiments were run for Tmax = 50 steps, which are required for Alg. 1 to converge to the global optimal point with high accuracy. Fig. 9 illustrates the communication load against the absolute difference between the objective function value f and the optimal value f , where the communication load is defined as the number of communications accumulated over time.

10 10 10 8 10 6 10 4 10 2 100 0

Comm. load (%)

Alg. 1, α = 1.5 Fed ADMM Fed Prox Fed Avg SCAFFOLD

10 10 10 8 10 6 10 4 10 2 100 0

Comm. load (%)

Figure 9: The figure shows the communication load versus accuracy trade-off for the different methods applied to two distinct problems derived from (55): linear regression (λ = 0, left panel), and on the right, LASSO (λ = 0.1, right panel).

In the first scenario, we set λ = 0 to obtain a linear regression problem, where the proposed algorithm with relaxation parameter α = 1.5 clearly outperforms baseline methods by a large margin. We note that the gap between the global and local optimal points prevents Fed Avg and Fed Prox from converging to the optimal point f .

For the second case, we set λ = 0.1 to solve the LASSO problem. By assumption, Fed Avg, SCAFFOLD and Fed ADMM require the local objective functions to be smooth. However, we allow handling nonsmooth local objective functions, which is relevant to the distributed learning problems with ℓ1 regularization. To avoid a noncontinuous gradient for the local minimization for SCAFFOLD, Fed ADMM, Fed Avg and Fed Prox, the local update step is carried out by the following local gradient,

xi f i(xi) = Ai (Aixi bi) + λ

( sgn(xi) |xi| > δ 1 δ xi |xi| δ , (56)

where δ can be chosen as small as 1e 16 (double precision machine epsilon). However, we found that the results are largely unaffected by the choice of δ.

Table 5: The table summarizes the hyperparameters used for the distributed linear regression and LASSO experiments (Fig. 9).

Hyperparameter Value

number of agents (N) 50 augmented lagrangian parameter (ρ) 1 gradient descent step-size 1 number of iterations 50 d = z = range between [0, 10 2]

G.2. Effect of Communication Drops

To observe the effect of communication drops, we repeated the same LASSO experiment in (55) with hyperparameters in Tab. 6, but this time, we allow the transmission of information from the agents to the server to fail with a probability of 0.3. As seen in the second panel of Fig. 10, if we have no reset, i.e., T = , the algorithm cannot converge and a significant error remains. On the left panel, the trade-off between communication load and suboptimality is presented. More frequent reset operations lead to a faster convergence and less error, in exchange for additional communication cost that comes with the reset.

Distributed Event-Based Learning via ADMM

10 5 10 3 10 1 0

0 25 50 24.6

Objective Value

T = 1 T = 5 T = 10 T =

Figure 10: The left panel presents the trajectory of communication load versus suboptimality of the objective function value. The panel in the center shows the evolution of the objective function for different values of the reset period for a drop rate of 0.3 and for = 10 3, whereas the right panel shows the cumulative communication load over time in addition to the reset communication at each T step.

Table 6: The table summarizes the hyperparameters used for the distributed LASSO experiment against communication drops (Fig. 10).

Hyperparameter Value

number of agents (N) 50 L1 regularization parameter (λ) 0.1 augmented lagrangian parameter (ρ) 1 relaxation parameter (α) 1 gradient descent step-size 1 number of iterations 50 d = z = 10 3

G.3. Distributed Training on a Graph

Our distributed learning algorithm, Alg. 2, is general enough to train a machine learning classifier over a network of agents; the network structure can be encoded by a proper selection of the linear constraint matrices A and B (see App. A for further details). Our framework therefore generalizes well beyond server-client structures, and our theoretical analysis also captures the influence of the network structure on the resulting convergence rate.

In order to highlight the versatility, we train an MNIST Classifier over a network of agents. We use a multi-layer perceptron that has the same structure as in Sec. 5 and consider a situation where each agent has only access to the training data of a single digit. Fig. 11 shows the resulting communication load and classification accuracy trade-off on the entire dataset (left), whereas the diagram on the right shows the network structure (only communication along the edges of the graph is allowed). The error bars indicate the range (minimum and maximum) of the classification accuracy among the different agents.

The results shown in Fig. 11 and highlight that a purely random selection of agents (suggested in (Yu & Freris, 2023) ) results in a worse trade-off curve, which further motivates our event-based strategy. We also apply our algorithm to a much larger distributed learning problem with 50 agents and where the corresponding accuracy versus communication trade-off is shown in Fig. 12, together with the agent network that has been used.

Table 8: The table summarizes the hyperparameters used for the distributed linear regression experiment over a graph (Fig. 12).

Hyperparameter Value

number of agents (N) 50 augmented lagrangian parameter (ρ) 10 5

number of iterations 17 103

x range between [0, 1]

Distributed Event-Based Learning via ADMM

65 70 75 80 85 0

Classification Accuracy (%)

Comm. Load (%)

Vanilla Randomized - ptrig = 0.1 Random selection

Figure 11: The figure shows a comparison of the vanilla event-based and the randomized event-based communication strategy (see Sec. 2) with a purely random selection of agents. The outcome of a purely random strategy is consistently worse with respect to the resulting trade-off between communication load and classification accuracy. The right panel visualizes the agent network with ten agents connected with 70 edges.

Table 7: The table summarizes the hyperparameters used for the distributed training of MNIST classifier over a graph (Fig. 11).

Hyperparameter Value

number of agents (N) 10 size of neural network layers [400, 200, 10] learning rate (gradient descent step-size) 5 10 3

augmented lagrangian parameter (ρ) 5 10 3

number of iterations 103

number of gradient steps per iteration 5 x range between [0.0, 0.2]

10 4 10 3 10 2 0

Comm. load (%)

Vanilla Random selection

Figure 12: The first panel shows the comparison of the communication load versus solution accuracy for different communication methods applied to the linear regression problem derived from (55) (λ = 0). The right panel visualizes the agent network with 50 agents connected with 1762 edges.