# distributed_online_convex_optimization_with_compressed_communication__32ea6946.pdf Distributed Online Convex Optimization with Compressed Communication Zhipeng Tu1,2, Xi Wang1,2, Yiguang Hong 3, Lei Wang4, Deming Yuan5, Guodong Shi2 1Academy of Mathematics and Systems Science, Chinese Academy of Sciences, China 2Australian Center for Field Robotics, School of AMME, The University of Sydney, Australia 3Department of Control Science and Engineering, Tongji University, China 4College of Control Science and Engineering, Zhejiang University, China 5School of Automation, Nanjing University of Science and Technology, China tuzhipeng@amss.ac.cn, wangxi14@mails.ucas.ac.cn, yghong@iss.ac.cn lei.wangzju@zju.edu.cn, dmyuan1012@gmail.com, guodong.shi@sydney.edu.au We consider a distributed online convex optimization problem when streaming data are distributed among computing agents over a connected communication network. Since the data are high-dimensional or the network is large-scale, communication load can be a bottleneck for the efficiency of distributed algorithms. To tackle this bottleneck, we apply the state-of-art data compression scheme to the fundamental GD-based distributed online algorithms. Three algorithms with difference-compressed communication are proposed for full information feedback (DC-DOGD), one-point bandit feedback (DC-DOBD), and two-point bandit feedback (DC-DO2BD), respectively. We obtain regret bounds explicitly in terms of time horizon, compression ratio, decision dimension, agent number, and network parameters. Our algorithms are proved to be no-regret and match the same regret bounds, w.r.t. time horizon, with their uncompressed versions for both convex and strongly convex losses. Numerical experiments are given to validate the theoretical findings and illustrate that the proposed algorithms can effectively reduce the total transmitted bits for distributed online training compared with the uncompressed baseline. 1 Introduction Online optimization has attracted considerable attention in recent decades, for its remarkable applications in machine learning tasks such as spam filtering, dictionary learning, advertising selection, and so on [1, 2, 3]. In such online tasks, data are revealed incrementally, and decisions must be made before all data are available. When the streaming data are collected at multiple agents, the distributed online optimization over a multi-agent network is considered, where data storage and processing are performed in the agents [4, 5]. It is often impractical to communicate data among different agents from multiple concerns such as privacy and bandwidth utilization. Also, there is no central agent for global coordination. In such settings, each agent relies on its own data to run an algorithm while communicating decisions with its immediate neighbors. To be specific, this paper considers the distributed online convex optimization (DOCO) problem over an N-agent network. The objective is to minimize the accumulated system-wide loss minx K PT t=1 PN i=1 f t i (x), where the local convex loss function f t i is formed by the data arriving Corresponding author 36th Conference on Neural Information Processing Systems (Neur IPS 2022). at time t in agent i, and K Rd is a convex feasible set. Note that the loss information is revealed to agent i after its decision xt i is made. Generally, there are two basic types of information feedback that agents can possess. One is the full information feedback, where agents have access to the loss functions. The other is the bandit feedback, where agents can only possess the values of the loss function at points around the decision. At each time step, agents choose the decisions based on their local feedback and neighbors information. To measure the performance of an algorithm, the (static) regret is frequently used, which compares the cumulative loss of online decisions and the loss of the best decision chosen in hindsight through all the time horizons. The regret of node j V is defined as i=1 f t i (xt j) min x K i=1 f t i (x). (1) An algorithm is called no-regret [6] if the average regret over T goes to zero, which means that the online decision updated by the streaming data is not far from the best decision chosen in hindsight. Distributed no-regret online algorithms have been widely studied in recent years [4, 7, 8, 9, 10, 11]. Although distributed algorithms are theoretically feasible, most of them are not practical as the model size gets large, since communication cost can be a bottleneck for efficiency. In distributed training tasks, agents can be powful microcomputers, while their communication network may be with low bandwidth. The information exchange over the network is pretty slow compared with the computation taking place in agents [12]. Thus, communication compression techniques are of significance for practical implementations. There have been many attempts to combine distributed optimization algorithms with compressors. A straightforward idea is the direct compression scheme, while algorithms with this simple scheme fail to converge even for the distributed average consensus problem [13, 14]. As an improvement, extrapolation compression scheme and difference compression scheme are proposed [15]. Along this line, quite a number of studies successfully extend the distributed optimization algorithms with compressors and meanwhile maintain the convergence rate [16, 17, 18, 19]. However, distributed online optimization with compression is still an area that has not been fully exploited. [20] proposed ECD-AMSGrad algorithm that extended the AMSGrad to the distributed online setting with extrapolation compression, while only empirical results were given without theoretical analysis. The key open problem in this area is whether it is possible to design provably no-regret distributed online algorithms that work with compressors. Contributions In this work, we answer the above question in the affirmative. We apply the difference compression scheme to the fundamental GD-based distributed online algorithms. Although the idea of such combination is simple, the underlying algorithm design and theoretical principle are challenging since the compression error, projection error, and consensus error will be coupled. Our contributions are summarized as follows: We propose communication-efficient distributed online algorithms, which consist of difference compression, γ-gossip consensus, gradient descent, and projection, for the cases of full information feedback (DC-DOGD), one-point bandit feedback (DC-DOBD), and two-point bandit feedback (DC-DO2BD), respectively. We make the technical advance to combine the difference compression scheme with the projection scheme. Through proper design, the errors can be estimated and controlled with the consensus stepsize γ and the gradient descent stepsizes. We analyze the regret bounds of the proposed algorithms for convex and strongly convex losses, respectively, which are established explicitly in terms of time horizon T, compression ratio ω, decision dimension d, agent number N, and parameters of the communication graph G, as simplified and summarized in Table 1. The obtained regret bounds are in accordance with those of [11] w.r.t T, N, d. Our algorithms are no-regret with theoretical guarantees. We give exhaustive experiments to illustrate the performance of the proposed algorithms. Compared with the uncompressed algorithm DAOL [8], the proposed algorithms can reduce the total transmitted bits for distributed online training. Moreover, DC-DOGD and DC-DO2BD significantly outperform the algorithm ECD-AMSGrad [20]. Table 1: Regret bounds in different settings Settings convex losses strongly convex losses Full information O ω 2N 1/2 + ω 4 N T O ω 2N 1/2 + ω 4 N ln(T) One-point bandit O ω 2N 1/2 + ω 4 1/2 Nd 1/2T 3/4 O ω 2N 1/2 + ω 4 1/3 Nd 2/3T 2/3 ln 1/3(T) Two-point bandit O ω 2N 1/2 + ω 4 Nd T O ω 2N 1/2 + ω 4 Nd2 ln(T) Related Work Distributed online convex optimization has received numerous attention in recent years. Many basic algorithms have been extended to distributed settings, referring to [7] and references therein. For instance, [8] proposed a distributed online subgradient algorithm over a static directed network and achieved the regrets O( T) and O(ln(T)) for convex and strongly convex losses, which are in line with those of centralized online algorithms [21, 22]. In the bandit feedback setting, by modifying the gradient to a randomized estimator, [9] proposed a distributed online bandit algorithm with one-point sampling over an undirected network and achieved O(T 2/3 ln 1/3(T)) regret for strongly convex losses. Better bound could be achieved by two-point sampling, as [4] obtained O( T) regret for convex quadratic losses in case of bounded decision set. Also, [10] investigated the distributed online two-point bandit algorithm in dynamic environments and achieved O( T) and O(ln(T)) regrets for convex and strongly convex losses, which match those of centralized two-bandit algorithms [23]. [11] comprehensively studied DOCO over Erd os-Rényi random networks in full gradient feedback, one-point bandit feedback, and two-point bandit feedback, and gave regret bounds. Along the line of [11], this paper aims to further introduce compressed communication strategies, while preserving the regret bounds order-wisely. Recently, combining distributed optimization algorithms with compressors has seen a dramatic rise in interest. Traditional compressors include the quantization [24, 25, 26], sparsification [27, 28, 29], and hybrid combining of them [30, 31, 32]. The way to apply compressors is called a compression scheme. The most widely used compression schemes in distributed optimization are extrapolation compression and difference compression [15]. Extrapolation compression allows agents to compress the extrapolation between the last two local states. Decentralized PSGD with extrapolation compression (ECD-PSGD) [15] was proved to converge sublinearly and match the rate of its uncompressed case (D-PSGD). Difference compression (DC), which is also called CHOCO [16] or innovation compression [19], allows agents to add replicas of neighboring states and compress the state-difference. There have been extensive successful designs combining distribute optimization algorithms with DC, to name a few, DCD-PSGD [15] (based on PSGD), CHOCO-SGD [16] (based on gossip SGD), SPARQ-SGD [17] (based on event-trigger), C-GT [18] (based on gradient tracking), and COLD[19] (based on NIDS), etc. The difference compression scheme is different from the error-feedback scheme (EF) [24, 33] in what to be compressed, which results in different application fields. DC compresses the difference between the current variable and the replica variable, which is widely used in distributed optimization where nodes tend to exchange state information whose limit is nonzero in general. EF compresses the sum of the gradient and the residual error, which is widely used in federated learning where nodes tend to exchange gradient information whose limit is expected to be zero. The insight is that successful designs have to compress something that goes to zero, otherwise the noise introduced by the compression will not vanish, which leads to the algorithm oscillation or even divergence. It is worth noting that the idea of DC can also be applied to federated learning, such as EF21 [34] and Fed PAQ [35], which is another line of research. In this paper, we focus on applying DC to distributed online optimization. The results about the distributed online optimization with compression are quite limited. [20] poposed the ECD-AMSGrad algorithm that combined AMSGrad with extrapolation compression. Actually, the AMSGrad algorithm may not be a good choice for DOCO problem since although AMSGrad itself is proved no-regret [36], a considerable performance gap still exists between AMSGrad and SGD [37]. Besides, the introduction of compression errors will further worsen the algorithm such that ECD-AMSGrad will lose the no-regret performance (see Section 5). This paper focuses on the fundamental GD-based algorithms and will give no-regret guarantees. 2 Full Information Feedback In this section, we first introduce the multi-agent network and the compressor we use, and then propose a communication-efficient distributed online algorithm for DOCO with full information feedback. Expected regret bounds will be given for both convex and strongly convex losses. Graph The multi-agent network is described by an undirected graph G(V, E), where V = {1, . . . , N} is the set of nodes, representing the set of agents, and E V V is the set of edges. Let A = [aij] RN N be the connectivity matrix of G such that aij = aji. If (vi, vj) E, then vi and vj can exchange information, and aij = 0 otherwise. It is worth noting that the communication is node-to-node in our distributed setting, and there is no central node. The graph in this paper satisfies the following assumption. Assumption 1. The communication graph G is undirected and connected. Its connectivity matrix A [0, 1]N N is a symmetric doubly stochastic matrix. Compressor A compressor Q( ) : Rd Rd is a mapping whose output can be usually encoded with fewer bits than its input. In this paper, we consider a broad class of compressors with the following general property, which has been widely considered in distributed optimization with compression [16, 17, 19]. Assumption 2. For some ω (0, 1], Q satisfies EQ Q(x) x 2 (1 ω) x 2, x Rd, (2) where EQ denotes the expectation over the internal randomness of Q. Compressors satisfy the above assumption are called ω-contracted, which include many important biased or unbiased compressors, such as sparsification (Randk and Topk) [28], random quantization (QSGDs) [26], random gossip (RGossipp) [16], etc. 2.1 Algorithm design In the full information feedback, the loss function f t i is revealed to node i at time t after the decision xt i is made. Then node i has access to the gradient value f t i (xt i) and can use gt i = f t i (xt i) to make the next decision xt+1 i . We propose the DC-DOGD algorithm as shown in Algorithm 1, which is based on DAOL [8] and memory-efficient CHOCO-SGD [16]. The DC-DOGD algorithm consists of two main parts: difference compressed communication (steps 2 and 3) and local decision update (steps 4 and 5). Algorithm 1 Distributed Online Gradient Descent with Difference Compression (DC-DOGD) Input: consensus stepsize γ, GD stepsizes {ηt}T t=1, time T Initialize: set x1 i = 0, ˆx1 i = 0, s1 i = 0, for each node i V. 1: for t = 1 to T 1 do in parallel for each node i V 2: Compress the difference vector qt i = Q(xt i ˆxt i) and update the replica ˆxt+1 i = ˆxt i + qt i. 3: Send qt i to its neighbors and receive qt j from all its neighbors j Ni. Update the estimate of the consensus decision by st+1 i = st i + P j Ni aijqt j. 4: Receive the full information feedback and calculate gt i = f t i (xt i). 5: Update its decision variable as follows xt+1 i = PK xt i + γ(st+1 i ˆxt+1 i ) ηtgt i , (3) where PK denotes the Euclidean projection, i.e., PK(x) = argminy K x y . Output: {xt i}T t=1 The insight of introducing the variable ˆxt i and the difference compression are as below. Assume that x = 0 without loss of generality. Then, if node i transmits the directly compressed information Q(xt i) to its neighbors, the compression error Q(xt i) xt i will not vanish for t . The accumulation of compression errors makes the algorithm fail to converge. Instead, we compress something that goes to zero. Let node i and all its neighbors keep an auxiliary variable ˆxt i locally, which acts as a replica of xt i. Whenever node i updates its decision variable xt i, node i calculates the difference xt i ˆxt i, compresses the difference qt i = Q(xt i ˆxt i), and sends the compressed information qt i to its neighbors. After that, node i and all its neighbors update the local replica ˆxt+1 i = ˆxt i + qt i. When all nodes are reaching a consensus optimal decision, the updates of local decisions are small, and the differences between the replica variables and the true decision variables are also small. Then the compression errors are expected to vanish. Local decision variables update through gradient descent, γ-gossip, and projection, in order to minimize the local loss function, keep consensus with neighbors, and remain in the feasible set, respectively. For each node i, s1 i is initialized to 0, and thus, st i = PN j=1 aij ˆxt j. Recall that ˆxt i tracks xt i, then st i acts as node i s estimate of the consensus decision at time t. The γ-gossip protocol is adopted to renovate the decision variable towards the consensus decision. The consensus stepsize γ (0, 1] is tunable to control the consensus speed, and will also play a crucial role in controlling the compression error. If there is no compression, i.e., using exact communication, then ˆxt+1 i turns out to be xt i, and st+1 i becomes PN j=1 aijxt j. Besides, take γ = 1, and then (3) reduces to xt+1 i = PK xt i + γ X j Ni aij(xt j xt i) ηt f t i (xt i) = PK X j Ni aijxt j ηt f t i (xt i) , which is the DAOL algorithm in [8]. Remark 1. st i is introduced for the memory-efficiency. Eq. (3) is equivalent to the update rule xt+1 i = PK xt i + γ X j Ni aij(ˆxt+1 j ˆxt+1 i ) ηtgt i . (4) If we adopt the update rule (4) together with ˆxt+1 j = ˆxt j + qt j for j Ni {i} instead of steps 3 and 5, then each node have to store deg(i) + 2 vectors, namely, xi, ˆxi and ˆxj, j Ni, which is memory-consuming. Remark 2. It is potential to extend our algorithm from a fixed graph to a time-varying graph Gt = (V, Et). We give the following two settings. Random switching undirected networks. Our algorithm can be directly applied to the classic time-varying Erd os-Rényi random networks [38], where Gt is generated over the prescribed graph G and {i, j} Et with a probability 0 < p < 1 for all {i, j} E. This setting is equivalent to performing random gossip compressor over the fixed graph G, that is, transmiting information with the probability p, which satisfies Assumption 2 with ω = p. Deterministic switching undirected networks. Our algorithm can be applied to this setting by modifying the connectivity matrix from A to A(t). With further assumptions on the switching networks such as Bounded Intercommunication Interval assumption [39], one can analyze the regret interval-wisely. This is interesting but out of this paper s focus, and we leave it for future study. 2.2 Regret bounds We consider following assumptions, which are widely used in the studies of distributed online optimization [1, 11, 40]. Assumption 3. The convex set K is bounded with diameter D, i.e., x y D, x, y K. Assumption 4. For each i V and t = 1, 2, ..., T, the loss function f t i is convex with bounded gradient over K, i.e., maxi,t,x f t i (x) G. Assumption 5. For each i V and t = 1, 2, ..., T, the loss function f t i is µ-strongly convex over K with the parameter µ > 0, i.e., f t i (x) f t i (y) x y, f t i (y) + µ 2 x y 2, x, y K. Suppose that the eigenvalues of the symmetric doubly stochastic connectivity matrix A are 1 = |λ1(A)| > |λ2(A)| |λN(A)|. Define the spectral gap δ := 1 |λ2(A)| (0, 1] and the spectral radius of the Laplacian matrix β := IN A 2 [0, 2]. Then we give the expected regret bounds of Algorithm 1 for convex and strongly convex losses, respectively. Theorem 1. Let Assumptions 1, 2, 3 and 4 hold. Consider Algorithm 1 with the consensus stepsize γ = 3δ3ω2(ω + 1) 48(δ2 + 18δβ2 + 36β2)β2(ω + 2)(1 ω) + 4δ2(β2 + β)(ω + 2)(1 ω)ω + 6δ3ω , (5) (i) (Convex case) Take the gradient descent stepsize ηt = D G t+c for a constant c 8 3γδ. Then for each j V and T 1, EQ [R(j, T)] 1 3γ 1δ 1 + 1 1 + γ 1δ 1 + ω 1 NGD (6) (ii) (Strongly convex case) With additional Assumption 5, take the gradient descent stepsize ηt = 1 µ(t+c) for a constant c 16 3γδ. Then for each j V and T 1, EQ [R(j, T)] µc D2+4 3γ 1δ 1 + 1 1 + γ 1δ 1 + ω 1 NG2µ 1 ln(T +c). (7) The proof ideas are as follows. Firstly, we estimate the general regret bounds for each node, which depend on the consensus error, the projection error, the compression error, and the gradient descent stepsize. Then comes the key point that we analyze the coupled relationship between the errors, and bound them with the consensus stepsize γ and the GD stepsize ηt. Finally, we choose proper γ and ηt to obtain Theorem 1. Complete proofs are attached to Appendix B. The consensus stepsize γ chosen in (5) depends on the compression ratio ω and the communicaiton graph paremeters δ and β. Notice that γ is an increasing function with respect to ω, and γ|ω=0 = 0, γ|ω=1 = 1. Thus, γ (0, 1] for ω (0, 1]. If there is no compression (ω = 1) , then γ = 1, and Algorithm 1 exactly reduces to DAOL [8], as mentioned in the algorithm design. Theorem 1 shows that Algorithm 1 achieves the regret bounds O((ω 2N 1/2 + ω 4)N T) and O ω 2N 1/2 + ω 4 N ln(T) for convex losses and strongly convex losses, respectively. The results suggest that Algorithm 1 is no-regret in both convex case and strongly convex case, since the time averaged regret EQ[R(j,T )]/T 0 for T . The obtained regret bounds O( T) and O(ln(T)) are in accordance with those of the centralized online algorithms in the respective cases [21, 22]. The node averaged regret EQ[R(j,T )]/N increases with N, which in line with the result in [11]. As the compression ratio ω decreases, fewer bits are needed for node-to-node communication in each iteration, while more iteration rounds are needed to reach the desired regret. ω can be used to balance the iteration rounds and the transmitted bits in each iteration from multiple concerns such as the bandwidth and agent computation capability. In practice, we can choose a proper ω to minimize the total transmitted bits or minimize the overall training time. 3 One-point Bandit Feedback In this section, we apply difference compression to DOCO with one-point bandit feedback. We propose DC-DOBD algorithm, which basically follows DC-DOGD, except for the gradient estimation. In the one-point bandit feedback, after making the decision xt i at time t, agent i can query the loss function value at one point around xt i and use the feedback to construct the gradient estimator gt i. Like the procedure in [41], let agent i choose a unit-norm vector ut i Rd uniformly at random, query the value of f t i at the point yt i = xt i + ϵut i, and calculate gt i = d ϵ f t i (yt i)ut i. Since the loss function f t i is defined in the set K, we slightly modify the projection in (3) as P(1 ζ)K to ensure the query point yt i K. Algorithm 2 actually performs the gradient descent on the function ˆf t i (x) = Eu B [f t i (x + ϵu)] restricted to the convex set (1 ζ)K. It has been shown by [41] that E [gt i] = ˆf t i (xt i). In the bandit setting, Assumptions 3 and 4 are modified as follows, which are common in online bandit optimization [11, 23, 41]. Assumption 6. The convex set K contains the ball of radius r centered at the origin, and is contained in the ball of radius R, i.e., r B K RB, B = {u Rd : u 1}. Algorithm 2 Distributed Online One-point Bandit Gradient Descent with Difference Compression (DC-DOBD) Input: consensus stepsize γ, GD stepsizes {ηt}T t=1, time T, exploration parameter ϵ, shrinkage parameter ζ Initialize: set x1 i = 0, ˆx1 i = 0, s1 i = 0, for each node i V. 1: for t = 1 to T 1 do in parallel for each node i V 2: Compress the difference vector qt i = Q(xt i ˆxt i) and update ˆxt+1 i = ˆxt i + qt i. 3: Spread qt i and receive qt j, j Ni. Update st+1 i = st i + P j Ni aijqt j. 4: Receive the one-point bandit feedback and construct gt i = d ϵ f t i (xt i + ϵut i)ut i. 5: Update the decision variable xt+1 i = P(1 ζ)K xt i + γ(st+1 i ˆxt+1 i ) ηtgt i . Output: {xt i}T t=1 Assumption 7. For each i V and t = 1, 2, ..., T, the loss function f t i is convex and l-Lipschitz continuous in K, i.e., |f t i (x) f t i (y)| l x y , x, y K. Assumptions 6 and 7 lead to an uniform upper bound on the function value, i.e., there exists a constant B > 0 such that maxx,i,t |f t i (x)| B. Then we establish the expected regret bounds of Algorithm 2 for convex and strongly convex losses, respectively. Theorem 2. Let common Assumptions 1, 2, 6 and 7 hold. Consider Algorithm 2 with the consensus stepsize γ chosen in (5). Denote 3γ 1δ 1 + 1 1 + γ 1δ 1 + ω 1 . (8) (i) (Convex case) Take the gradient descent stepsize ηt = 2Rϵ d B t+c for a constant c 8 3γδ, ϵ = (1+4H)d BR 2 (T +c) 1 4 T 1 2 and ζ = ϵ r. Then for each j V and T 1, E [R(j, T)] 2NT 1 2 (T + c) 1 4 p 2(1 + 4H) (l + B/r) d BR. (9) (ii) (Strongly convex case) With additional Assumption 5, take the gradient descent stepsize ηt = 1 µ(t+c) for a constant c 16 3γδ, ϵ = Hd2B2 ln(T +c) 3 and ζ = ϵ r. Then for each j V and T 1, E [R(j, T)] 4µc R2 + 3N Hd2B2µ 1 1 3 (l + B/r) 2 3 T 2 3 ln 1 3 (T + c). (10) Theorem 2 shows that Algorithm 2 is also no-regret, and it achieves the regret bounds O(d 1/2N 5/4T 3/4) and O(d 2/3N 7/6T 2/3 ln 1/3(T)) for convex losses and strongly convex losses, respectively, which match the bounds obtained by [11]. The regrets are scaled with ω 2N 1/2 + ω 4 1/2 and ω 2N 1/2 + ω 4 1/3 for convex and strongly convex losses, which indicates that the influence of the compression ratio ω on the regret bounds in the one-point bandit setting is less than that in the full information setting. 4 Two-point Bandit Feedback In the two-point bandit feedback, agent i can query the loss function values at two points around xt i. Like the procedure in [23], let agent i pick a unit-norm vector ut i Rd uniformly at random, query the values of f t i at yt i,1 = xt i + ϵut i and yt i,2 = xt i ϵut i, and estimate the gradient as gt i = d 2ϵ f t i (yt i,1) f t i (yt i,2) ut i. Then we obtain DC-DO2BD as a variant of DC-DOBD. In the two-point bandit setting, the regret of node j V is modified as f t i (yt j,1) + f t i (yt j,2) 2 i=1 f t i (x ). (11) Algorithm 3 Distributed Online Two-point Bandit Gradient Descent with Difference Compression (DC-DO2BD) Input: consensus stepsize γ, GD stepsizes {ηt}T t=1, time T, exploration parameter ϵ, shrinkage parameter ζ Initialize: set x1 i = 0, ˆx1 i = 0, s1 i = 0, for each node i V. 1: for t = 1 to T 1 do in parallel for each node i V 2: Compress the difference vector qt i = Q(xt i ˆxt i) and update ˆxt+1 i = ˆxt i + qt i. 3: Spread qt i and receive qt j, j Ni. Update st+1 i = st i + P j Ni aijqt j. 4: Receive the two-point feedback and construct gt i = d 2ϵ (f t i (xt i + ϵut i) f t i (xt i ϵut i)) ut i. 5: Update the decision variable xt+1 i = P(1 ζ)K xt i + γ(st+1 i ˆxt+1 i ) ηtgt i . Output: {xt i}T t=1 Theorem 3. Let common Assumptions 1, 2, 6 and 7 hold. Consider Algorithm 3 with the consensus stepsize γ chosen in (5) and H defined in (8). (i) (Convex case) Take the gradient descent stepsize ηt = 2R dl t+c for a constant c 8 3γδ, ϵ = 1 T and ζ = ϵ r. Then for each j V and T 1, E [R2(j, T)] (1 + 4H)RNdl T + c + (3 + 2R/r) Ndl (ii) (Strongly convex case) With additional Assumption 5, take the gradient descent stepsize ηt = 1 µ(t+c) for a constant c 16 3γδ, ϵ = ln(T ) T and ζ = ϵ r. Then for each j V and T 1, E [R2(j, T)] 4µc R2 + µ 1Nd2l2H ln(T + c) + (3 + 2R/r) Ndl ln(T). (13) Theorem 3 shows that Algorithm 3 achieves O((ω 2N 1/2 + ω 4)Nd T) and O((ω 2N 1/2 + ω 4)Nd2 ln(T)) regret bounds for convex and strongly convex losses, respectively, which recovers the regret bounds O( T) (convex) and O(ln(T)) (strongly convex) in the full information case, while the constants are larger than those of Theorem 1. 5 Numerical Experiments In this section, we evaluate the three proposed algorithms on a real-world online problem. A prominent example is the diabetes prediction task, which aims to diagnose diabetes through several risk factors. Consider the distributed online regularized logistic regression with the local loss function f t i (x) = XS j=1 log 1 + exp bt i,j at i,j, x + µ 2 x 2 , (14) where µ is the regularization parameter, and a batch data samples {(ai,j, bi,j)}S j=1 are revealed to agent i at time t. We adopt diabetes-binary-BRFSS2015 dataset with 70692 instances, 21 features, and 2 labels from Kaggle.2 Here, ai,j Rd with d = 21, and bi,j { 1, 1}. We standardize the data samples and distribute them evenly among N agents under the sorted setting, i.e., each agent only gets data samples from one class. The connected communication network G(N, M) with N nodes and M edges is generated randomly by tool Network X [42], and then we use the Metropolis rule [43] to construct the connectivity matrix A to satisfy Assumption 1. We repeat each experiment ten times and depict the mean curve.3 The parameters selection details are given in Appendix E. Comparison experiment We run our algorithms DC-DOGD, DC-DOBD, DC-DO2BD, and make comparisons with ECD-AMSGrad [20], for the convex case (µ = 0) and strongly convex case (µ = 1). The compressor type, the compression ratio, and the communication network are kept the same. Take the setting of compressor QSGD2 with ω = 0.3 over G(9, 18) as an example. We plot the time averaged maximum regret SR(T) := maxj R(j,T )/T versus the time horizon T and versus 2The data set is from https://www.kaggle.com/code/encode0/diabetes-prediction-and-risk-factors-evaluation. 3All experiments are performed on a 64-bit Windows platform with the Intel(R) Core(TM) i7-6850K 3.6Ghz CPU. The codes are provided in the supplementary materials. the total number of transmitted bits in Fig. 1, where the best solution in the hindsight x is obtained by Logistic Regression optimizer from scikit-learn [44]. Fig. 1 shows that the time averaged regrets of DC-DOGD, DC-DOBD, and DC-DO2BD go to zero as T goes to infinity, which is in agreement with the theoretical results that our algorithms are no-regret. Among the three proposed algorithms, the one-bandit feedback has the worst performance, while using two-bandit information can improve the performance and even reach that of the full information feedback. ECD-AMSGrad gets deteriorated in the first few steps because the inverse of the second raw moment estimation is large and this algorithm does not have a projection to restrict variables. Then, ECD-AMSGrad declines fast, while its time-average regret can not reach zero. Clearly, DC-DOGD and DC-DO2BD significantly outperform ECD-AMSGrad. 0 100 200 300 400 500 600 700 800 Time horizon DC-DOGD DC-DOBD DC-DO2BD ECD-AMSGrad 104 105 106 Transmitted bits DC-DOGD DC-DOBD DC-DO2BD ECD-AMSGrad (a) Convex case 0 100 200 300 400 500 600 700 800 Time horizon DC-DOGD DC-DOBD DC-DO2BD ECD-AMSGrad 104 105 106 Transmitted bits DC-DOGD DC-DOBD DC-DO2BD ECD-AMSGrad (b) Strongly convex case Figure 1: Comparison of algorithms DC-DOGD, DC-DOBD, DC-DO2BD, and ECD-AMSGrad with QSGD2, ω = 0.3, G(9, 18). Impacts of compression ratio and compressor type Fixing the compressor type (Topk) and the graph G(9, 18), we run DC-DOGD with different compression ratios (ω = 0.05, 0.1, 0.5) for strongly convex losses.4 We consider DAOL [8] as the baseline, which is with exact communication and is the special case of DC-DOGD with ω = 1. Fig. 2a shows that the greater the compression degree is (less ω), the more iteration rounds are needed to reach a certain average regret, while the total transmitted bits are actually the fewer. DC-DOGD with ω = 0.05 have approximately 8 reduction on transmitted bits to reach a certain average regret compared with DAOL. Then, we fix the compression ratio (ω = 0.3) and the graph G(9, 18), and run DC-DOGD with different compressors (Topk, Randk, RGossipp, GSGDs) for strongly convex losses. Fig. 2b shows that Randk and RGossipp have almost the same performance. It is expected, since Randk allows each element of the vector to has the probability ω = k/d to be chosen to be transmitted, which is equivalent to randomly transmitting the whole vector with the probability p = ω. Besides, Figs. 2b shows that Topk has better performance than Randk, which is in line with the intuition that the largest k coordinates contain more useful information than arbitrary k coordinates. In addition, quantization GSGDs performs better in reducing the total transmitted bits than sparsification under the same compression ratio. 0 100 200 300 400 500 600 700 800 Time horizon w=0.05 w=0.1 w=0.5 Exact comm. 104 105 106 107 Transmitted bits w=0.05 w=0.1 w=0.5 Exact comm. (a) Compression ratio 0 100 200 300 400 500 600 700 800 Time horizon RGossip_p Rand_k Top_k QSGD_s 104 105 106 107 Transmitted bits RGossip_p Rand_k Top_k QSGD_s (b) Compressor type Figure 2: The impacts of compression ratio and compressor type for DC-DOGD over G(9, 18) in the strongly convex case. 4Since the trajectories in the convex case and strongly convex case share similar trends, here we only present the experiment results in the strongly convex case, for space limitation. Impacts of topology and node number The communication network topology affects the parameters δ and β, which affect the choice of the stepsizes ηt and γ as well as the algorithm performance. We take DC-DOGD in the strongly convex case as an example. Fixing the compressor (Top1) and the compression ratio (ω = 0.05), we assess DC-DOGD over three basic topologies, namely ring, G(N, 2N), and full connected. Fig. 3a shows that full connected topology exhibits the best regret with respect to the time horizon, while G(N, 2N) performs slightly better in the sense of tranmitted bits. A sparser topology with fewer edges uses fewer bits to transmit information in each iteration, while more iteration rounds are needed for decision consensus. Thus, there will be a tradeoff case by case. Finally, we let the node number N vary from to 10 to 50, and run DC-DOGD, DC-DOBD, DCDO2BD with the same compressor Top2, the same compression ratio ω = 0.1, over the full connected graph, for convex losses and strongly convex losses. We plot the node averaged regret AR(T) := maxj R(j,T )/N versus the node number N in Fig. 3b. 0 100 200 300 400 500 600 700 800 Time horizon Ring G(N,2N) Full 103 104 105 106 Transmitted bits Ring G(N,2N) Full (a) Different topologies 10 15 20 25 30 35 40 45 50 Node number DC-DOGD DC-DOBD DC-DO2BD 10 15 20 25 30 35 40 45 50 Node number DC-DOGD DC-DOBD DC-DO2BD (b) Node number(left: convex, right: strongly convex) Figure 3: The impacts of topology and node number. 6 Conclusions In this paper, we considered DOCO with the full information feedback, one-point and two-points bandit feedback. We designed provably no-regret distributed online algorithms that work with ωcontracted compressors. The obtained regret bounds for both convex and strongly convex losses matched those of uncompressed algorithms in the literature. We further assessed the influence of the compressor type and the compression ratio ω on the regrets, and showed that ω can be used to balance the iteration rounds and the transmitted bits according to the bandwidth. A limitation of this work is that the obtained regret bounds show high order inverse dependence on the compression ratio, which are pretty conservative and may be further improved. Besides, the impact of topology sparsity is worth further investigation. We believe this paper is an important step in this direction. Future research includes extensions from fixed graphs to time-varying graphs and combinations with other compression schemes such as extrapolation compression. Acknowledgments and Disclosure of Funding This work is supported in part by Shanghai Municipal Science and Technology Major Project under Grant 2021SHZDZX0100 and the National Natural Science Foundation of China under Grant 62173250, and in part by Australian Research Council under Grant DP190103615 and LP210200473. [1] Elad Hazan et al. Introduction to online convex optimization. Foundations and Trends in Optimization, 2(3-4):157 325, 2016. [2] David Sculley and Gabriel M Wachman. Relaxed online svms for spam filtering. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 415 422, 2007. [3] Julien Mairal, Francis Bach, Jean Ponce, and Guillermo Sapiro. Online dictionary learning for sparse coding. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 689 696, 2009. [4] Deming Yuan, Alexandre Proutiere, and Guodong Shi. Distributed online linear regressions. IEEE Transactions on Information Theory, 67(1):616 639, 2020. [5] Wenpeng Zhang, Peilin Zhao, Wenwu Zhu, Steven CH Hoi, and Tong Zhang. Projection-free distributed online learning in networks. In International Conference on Machine Learning, pages 4054 4062. PMLR, 2017. [6] Niranjan Srinivas, Andreas Krause, Sham Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. In Proceedings of the 27th International Conference on Machine Learning, number CONF. Omnipress, 2010. [7] Xiuxian Li, Lihua Xie, and Na Li. A survey of decentralized online learning. ar Xiv preprint ar Xiv:2205.00473, 2022. [8] Feng Yan, Shreyas Sundaram, SVN Vishwanathan, and Yuan Qi. Distributed autonomous online learning: Regrets and intrinsic privacy-preserving properties. IEEE Transactions on Knowledge and Data Engineering, 25(11):2483 2493, 2012. [9] Deming Yuan, Daniel WC Ho, Yiguang Hong, and Guoping Jiang. Online bandit convex optimization over a network. In 35th Chinese Control Conference, pages 8090 8095. IEEE, 2016. [10] Jueyou Li, Chaojie Li, Wenwu Yu, Xiaomei Zhu, and Xinghuo Yu. Distributed online bandit learning in dynamic environments over unbalanced digraphs. IEEE Transactions on Network Science and Engineering, 8(4):3034 3047, 2021. [11] Jinlong Lei, Peng Yi, Yiguang Hong, Jie Chen, and Guodong Shi. Online convex optimization over Erd os-Rényi random networks. Advances in Neural Information Processing Systems, 33:15591 15601, 2020. [12] Dan Alistarh. A brief tutorial on distributed and concurrent machine learning. In Proceedings of the 2018 ACM Symposium on Principles of Distributed Computing, pages 487 488, 2018. [13] Tuncer Can Aysal, Mark J Coates, and Michael G Rabbat. Distributed average consensus with dithered quantization. IEEE Transactions on Signal Processing, 56(10):4905 4918, 2008. [14] Ruggero Carli, Fabio Fagnani, Paolo Frasca, Tom Taylor, and Sandro Zampieri. Average consensus on networks with transmission noise or quantization. In 2007 European Control Conference, pages 1852 1857. IEEE, 2007. [15] Hanlin Tang, Shaoduo Gan, Ce Zhang, Tong Zhang, and Ji Liu. Communication compression for decentralized training. Advances in Neural Information Processing Systems, 31, 2018. [16] Anastasia Koloskova, Sebastian Stich, and Martin Jaggi. Decentralized stochastic optimization and gossip algorithms with compressed communication. In International Conference on Machine Learning, pages 3478 3487. PMLR, 2019. [17] Navjot Singh, Deepesh Data, Jemin George, and Suhas Diggavi. Sparq-sgd: Event-triggered and compressed communication in decentralized optimization. IEEE Transactions on Automatic Control, 2022. [18] Yiwei Liao, Zhuorui Li, Kun Huang, and Shi Pu. Compressed gradient tracking methods for decentralized optimization with linear convergence. ar Xiv preprint ar Xiv:2103.13748, 2021. [19] Jiaqi Zhang, Keyou You, and Lihua Xie. Innovation compression for communication-efficient distributed optimization with linear convergence. ar Xiv preprint ar Xiv:2105.06697, 2021. [20] Guangxia Li, Jia Liu, Xiao Lu, Peilin Zhao, Yulong Shen, and Dusit Niyato. Decentralized online learning with compressed communication for near-sensor data analytics. IEEE Communications Letters, 25(9):2958 2962, 2021. [21] Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning, pages 928 936, 2003. [22] Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2):169 192, 2007. [23] Alekh Agarwal, Ofer Dekel, and Lin Xiao. Optimal algorithms for online convex optimization with multi-point bandit feedback. In Proceedings of the 23rd Annual Conference on Learning Theory, pages 28 40. Citeseer, 2010. [24] Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns. In Fifteenth Annual Conference of the International Speech Communication Association. Citeseer, 2014. [25] Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, and Animashree Anandkumar. signsgd: Compressed optimisation for non-convex problems. In International Conference on Machine Learning, pages 560 569. PMLR, 2018. [26] Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. Qsgd: Communicationefficient sgd via gradient quantization and encoding. Advances in Neural Information Processing Systems, 30, 2017. [27] Alham Fikri Aji and Kenneth Heafield. Sparse communication for distributed gradient descent. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 440 445, 2017. [28] Sebastian U Stich, Jean-Baptiste Cordonnier, and Martin Jaggi. Sparsified sgd with memory. Advances in Neural Information Processing Systems, 31, 2018. [29] Aritra Dutta, El Houcine Bergou, Ahmed M Abdelmoniem, Chen-Yu Ho, Atal Narayan Sahu, Marco Canini, and Panos Kalnis. On the discrepancy between the theoretical analysis and practical implementations of compressed communication for distributed deep learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 3817 3824, 2020. [30] Jiawei Jiang, Fangcheng Fu, Tong Yang, and Bin Cui. Sketchml: Accelerating distributed machine learning with data sketches. In Proceedings of the 2018 International Conference on Management of Data, pages 1269 1284, 2018. [31] Hyeontaek Lim, David G Andersen, and Michael Kaminsky. 3lc: Lightweight and effective traffic compression for distributed machine learning. Proceedings of Machine Learning and Systems, 1:53 64, 2019. [32] Debraj Basu, Deepesh Data, Can Karakus, and Suhas Diggavi. Qsparse-local-sgd: Distributed sgd with quantization, sparsification and local computations. Advances in Neural Information Processing Systems, 32, 2019. [33] Sai Praneeth Karimireddy, Quentin Rebjock, Sebastian Stich, and Martin Jaggi. Error feedback fixes signsgd and other gradient compression schemes. In International Conference on Machine Learning, pages 3252 3261. PMLR, 2019. [34] Peter Richtárik, Igor Sokolov, and Ilyas Fatkhullin. Ef21: A new, simpler, theoretically better, and practically faster error feedback. Advances in Neural Information Processing Systems, 34, 2021. [35] Amirhossein Reisizadeh, Aryan Mokhtari, Hamed Hassani, Ali Jadbabaie, and Ramtin Pedarsani. Fedpaq: A communication-efficient federated learning method with periodic averaging and quantization. In International Conference on Artificial Intelligence and Statistics, pages 2021 2031. PMLR, 2020. [36] Sashank J Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond. In International Conference on Learning Representations, 2018. [37] Nitish Shirish Keskar and Richard Socher. Improving generalization performance by switching from adam to sgd. ar Xiv preprint ar Xiv:1712.07628, 2017. [38] Paul Erd os, Alfréd Rényi, et al. On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci, 5(1):17 60, 1960. [39] Angelia Nedi c and Asuman Ozdaglar. Distributed subgradient methods for multi-agent optimization. IEEE Transactions on Automatic Control, 54(1):48 61, 2009. [40] Elad Hazan, Alexander Rakhlin, and Peter Bartlett. Adaptive online gradient descent. Advances in Neural Information Processing Systems, 20, 2007. [41] Abraham D Flaxman, Adam Tauman Kalai, and H Brendan Mc Mahan. Online convex optimization in the bandit setting: gradient descent without a gradient. In Proceedings of the 16th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 385 394, 2005. [42] Aric Hagberg, Pieter Swart, and Daniel S Chult. Exploring network structure, dynamics, and function using networkx. Technical report, Los Alamos National Lab.(LANL), Los Alamos, NM (United States), 2008. [43] Lin Xiao and Stephen Boyd. Fast linear iterations for distributed averaging. Systems & Control Letters, 53(1):65 78, 2004. [44] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit-learn: Machine learning in python. the Journal of Machine Learning Research, 12:2825 2830, 2011. 1. For all authors... (a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] See Section 6 Conclusions. (c) Did you discuss any potential negative societal impacts of your work? [No] This work is a theoretical finding to solve the communication bottleneck of the distributed algorithms. Therefore, this work does not present foreseeable societal impacts. (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results... (a) Did you state the full set of assumptions of all theoretical results? [Yes] See Assumptions 1, 2, 3, 4, 5, 6, and 7. (b) Did you include complete proofs of all theoretical results? [Yes] We present the proof ideas in the paper and put the complete proofs in supplemental materials due to space limitation. 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] Codes link is in the supplemental material. Data are from Kaggle with URL. Instructions are included in Section 5. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] See Section 5 and Appendix E (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [No] We repeat each experiment ten times and only depict the mean curves to make the figures easier to read. Actually, our experiment results with different random seeds are similar. (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] See Section 5 footnote 3 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (a) If your work uses existing assets, did you cite the creators? [Yes] We use the tools Network X and scikit-learn with citation. We use the open dateset from Kaggle with URL. (b) Did you mention the license of the assets? [Yes] See Appendix F (c) Did you include any new assets either in the supplemental material or as a URL? [No] (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [No] (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [No] 5. If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]