# compressedvfl_communicationefficient_learning_with_vertically_partitioned_data__bc54563d.pdf Compressed-VFL: Communication-Efficient Learning with Vertically Partitioned Data Timothy Castiglia 1 Anirban Das 1 Shiqiang Wang 2 Stacy Patterson 1 We propose Compressed Vertical Federated Learning (C-VFL) for communication-efficient training on vertically partitioned data. In C-VFL, a server and multiple parties collaboratively train a model on their respective features utilizing several local iterations and sharing compressed intermediate results periodically. Our work provides the first theoretical analysis of the effect message compression has on distributed training over vertically partitioned data. We prove convergence of non-convex objectives at a rate of O( 1 T ) when the compression error is bounded over the course of training. We provide specific requirements for convergence with common compression techniques, such as quantization and topk sparsification. Finally, we experimentally show compression can reduce communication by over 90% without a significant decrease in accuracy over VFL without compression. 1. Introduction Federated Learning (Mc Mahan et al., 2017) is a distributed machine learning approach that has become of much interest in both theory (Li et al., 2020; Wang et al., 2019; Liu et al., 2020) and practice (Bonawitz et al., 2019; Rieke et al., 2020; Lim et al., 2020) in recent years. Naive distributed learning algorithms may require frequent exchanges of large amounts of data, which can lead to slow training performance (Lin et al., 2020). Further, participants may be globally distributed, with high latency network connections. To mitigate these factors, Federated Learning algorithms aim to be communication-efficient by design. Methods such as local updates (Moritz et al., 2016; Liu et al., 2019), where 1Department of Computer Science, Rensselaer Polytechnic Institute, Troy, NY, USA 2IBM Thomas J. Watson Research Center, Yorktown Heights, NY, USA. Correspondence to: Timothy Castiglia . Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s). parties train local parameters for multiple iterations without communication, and message compression (Stich et al., 2018; Wen et al., 2017; Karimireddy et al., 2019) reduce message frequency and size, respectively, with little impact on training performance. Federated Learning methods often target the case where the data among parties is distributed horizontally: each party s data shares the same features but parties hold data corresponding to different sample IDs. This is known as Horizontal Federated Learning (HFL) (Yang et al., 2019). However, there are several application areas where data is partitioned in a vertical manner: the parties store data on the same sample IDs but different feature spaces. An example of a vertically partitioned setting includes a hospital, bank, and insurance company seeking to train a model to predict something of mutual interest, such as customer credit score. Each of these institutions may have data on the same individuals but store medical history, financial transactions, and vehicle accident reports, respectively. These features must remain local to the institutions due to privacy concerns, rules and regulations (e.g., GDPR, HIPAA), and/or communication network limitations. In such a scenario, Vertical Federated Learning (VFL) methods must be employed. Although VFL is less well-studied than HFL, there has been a growing interest in VFL algorithms recently (Hu et al., 2019; Gu et al., 2021; Cha et al., 2021), and VFL algorithms have important applications including risk prediction, smart manufacturing, and discovery of pharmaceuticals (Kairouz et al., 2021). Typically in VFL, each party trains a local embedding function that maps raw data features to a meaningful vector representation, or embedding, for prediction tasks. For example, a neural network can be an embedding function for mapping the text of an online article to a vector space for classification (Koehrsen, 2018). Referring to Figure 1, suppose Party 1 is a hospital with medical data features x1. The hospital computes its embedding h1(θ1; x1) for the features by feeding x1 through a neural network. The other parties (the bank and insurance company), compute embeddings for their features, then all parties share the embeddings in a private manner (e.g., homomorphic encryption, secure multiparty computation, or secure aggregation). The embeddings Communication-Efficient Learning with Vertically Partitioned Data Figure 1. Example global model with neural networks. To obtain a ˆy prediction for a data sample x, each party m feeds the local features of x, xm, into a neural network. The output of this neural network is the embedding hm(θm; xm). All embeddings are then fed into the server model neural network with parameters θ0. are then combined in a server model θ0 to determine the final loss of the global model. A server model (or fusion network) captures the complicated interactions of embeddings and is often a complex, non-linear model (Gu et al., 2019; Nie et al., 2021; Han et al., 2021b). Embeddings can be very large, in practice, sometimes requiring terabytes of communication over the course of training. Motivated by this, we propose Compressed Vertical Federated Learning (C-VFL), a general framework for communication-efficient Federated Learning over vertically partitioned data. In our algorithm, parties communicate compressed embeddings periodically, and the parties and the server each run block-coordinate descent for multiple local iterations, in parallel, using stochastic gradients to update their local parameters. C-VFL is the first theoretically verified VFL algorithm that applies embedding compression. Unlike in HFL algorithms, C-VFL compresses embeddings rather than gradients. Previous work has proven convergence for HFL algorithms with gradient compression (Stich et al., 2018; Wen et al., 2017; Karimireddy et al., 2019). However, no previous work analyzes the convergence requirements for VFL algorithms that use embedding compression. Embeddings are parameters in the partial derivatives calculated at each party. The effect of compression error on the resulting partial derivatives may be complex; therefore, the analysis in previous work on gradient compression in HFL does not apply to compression in VFL. In our work, we prove that, under a diminishing compression error, C-VFL converges at a rate of O( 1 T ), which is comparable to previous VFL algorithms that do not employ compression. We also analyze common compressors, such as quantization and sparsification, in C-VFL and provide bounds on their compression parameters to ensure convergence. Figure 2. Example local view of a global model with neural networks. When running C-VFL, Party 1 (in green) only has a compressed snapshot of the other parties embeddings and the server model. To calculate ˆy, Party 1 uses its own embedding calculated at iteration t, and the embeddings and server model calculated at time t0, the latest communication iteration, and compressed with Cm. C-VFL also generalizes previous work by supporting an arbitrary server model. Previous work in VFL has either only analyzed an arbitrary server model without local updates (Chen et al., 2020), or analyzed local updates with a linear server model (Liu et al., 2019; Zhang et al., 2020; Das & Patterson, 2021). C-VFL is designed with an arbitrary server model, allowing support for more complex prediction tasks than those supported by previous VFL algorithms. We summarize our main contributions in this work. 1. We introduce C-VFL with an arbitrary compression scheme. Our algorithm generalizes previous work in VFL by including both an arbitrary server model and multiple local iterations. 2. We prove convergence of C-VFL to a neighborhood of a fixed point on non-convex objectives at a rate of O( 1 T ) for a fixed step size when the compression error is bounded over the course of training. We also prove that the algorithm convergence error goes to zero for a diminishing step size if the compression error diminishes as well. Our work provides novel analysis for the effect of compressing embeddings on convergence in a VFL algorithm. Our analysis also applies to Split Learning when uploads to the server are compressed. 3. We provide convergence bounds on parameters in common compressors that can be used in C-VFL. In particular, we examine scalar quantization (Bennett, 1948), lattice vector quantization (Zamir & Feder, 1996), and top-k sparsification (Lin et al., 2018). 4. We evaluate our algorithm by training on MIMIC-III, CIFAR-10, and Model Net10 datasets. We empirically show Communication-Efficient Learning with Vertically Partitioned Data how C-VFL can reduce the number of bits sent by over 90% compared to VFL with no compression without a significant loss in accuracy of the final model. Related Work. (Richt arik & Tak ac, 2016; Hardy et al., 2017) were the first works to propose Federated Learning algorithms for vertically partitioned data. (Chen et al., 2020; Romanini et al., 2021) propose the inclusion of an arbitrary server model in a VFL algorithm. However, these works do not consider multiple local iterations, and thus communicate at every iteration. (Liu et al., 2019), (Feng & Yu, 2020), and (Das & Patterson, 2021) all propose different VFL algorithms with local iterations for vertically partitioned data but do not consider an arbitrary server model. Split Learning is a related concept to VFL (Gupta & Raskar, 2018). Split Learning can be thought of a special case of VFL when there is only one party. Recent works (He et al., 2020; Han et al., 2021a) have extended Split Learning to a Federated Learning setting. However, these works focus on the HFL setting and do not apply message compression. In contrast to previous works, our work addresses a vertical scenario, an arbitrary server model, local iterations, and message compression. Message compression is a common topic in HFL scenarios, where participants exchange gradients determined by their local datasets. Methods of gradient compression in HFL include scalar quantization (Bernstein et al., 2018), vector quantization (Shlezinger et al., 2021), and top-k sparsification (Shi et al., 2019). In C-VFL, compressed embeddings are shared, rather than compressed gradients. Analysis in previous work on gradient compression in HFL does not apply to compression in VFL, as the effect of embedding compression error on each party s partial derivatives may be complex. No prior work has analyzed the impact of compression on convergence in VFL. Outline. In Section 2, we provide the problem formulation and our assumptions. Section 3 presents the details of C-VFL. In Section 4, we present our main theoretical results. Our experimental results are given in Section 5. Finally, we conclude in Section 6. 2. Problem Formulation We present our problem formulation and notation to be used in the rest of the paper. We let a be the 2-norm of a vector a, and let A F be the Frobenius norm of a matrix A. We consider a set of M parties {1, . . . , M} and a server. The dataset X RN D is vertically partitioned a priori across the M parties, where N is the number of data samples and D is the number of features. The i-th row of X corresponds to a data sample xi. For each sample xi, a party m holds a disjoint subset of the features, denoted xi m, so that xi = [xi 1, . . . , xi M]. For each xi, there is a corresponding label yi. Let y RN 1 be the vector of all sample labels. We let Xm RN Dm be the local dataset of a party m, where the i-th row correspond to data features xi m. We assume that the server and all parties have a copy of the labels y. For scenarios where the labels are private and only present at a single party, the label holder can provide enough information for the parties to compute gradients for some classes of model architectures (Liu et al., 2019). Each party m holds a set of model parameters θm as well as a local embedding function hm( ). The server holds a set of parameters θ0 called the server model and a loss function l( ) that combines the embeddings hm(θm; xi m) from all parties. Our objective is to minimize the following: i=1 l(θ0, h1(θ1; xi 1), . . . , h M(θM; xi M); yi) (1) where Θ = [θT 0 , . . . , θT M]T is the global model. An example of a global model Θ is in Figure 1. For simplicity, we let m = 0 refer to the server, and define h0(θ0; xi) := θ0 for all xi, where h0( ) is equivalent to the identity function. Let hm(θm; xi m) RPm for m = 0, . . . , M, where Pm is the size of the m-th embedding. Let m F(Θ; X; y) := 1 N PN i=1 θml(θ0, h1(θ1; xi 1), . . . , h M(θM; xi M); yi) be the partial derivatives for parameters θm. Let XB and y B be the set of samples and labels corresponding to a randomly sampled mini-batch B of size B. We let the stochastic partial derivatives for parameters θm be m FB(Θ; X; y) := 1 B P xi,yi XB,y B θml(θ0, h1(θ1; xi 1), . . . , h M(θM; xi M); y). We may drop X and y from F( ) and FB( ). With a minor abuse of notation, we let hm(θm; XB m) := {hm(θm; x B1 m ), . . . , hm(θm; x BB m )} be the set of all party m embeddings associated with mini-batch B, where Bi is the i-th sample in the mini-batch B. We let m FB(Θ) and m FB(θ0, h1(θ1; XB 1), . . . , h M(θM; XB M)) be equivalent, and use them interchangeably. Assumption 1. Smoothness: There exists positive constants L < and Lm < , for m = 0, . . . , M, such that for all Θ1, Θ2, the objective function satisfies: F(Θ1) F(Θ2) L Θ1 Θ2 m FB(Θ1) m FB(Θ2) Lm Θ1 Θ2 . Assumption 2. Unbiased gradients: For m = 0, . . . , M, for every mini-batch B, the stochastic partial derivatives are unbiased, i.e., EB m FB(Θ) = m F(Θ). Assumption 3. Bounded variance: For m = 0, . . . , M, there exists constants σm < such that the variance Communication-Efficient Learning with Vertically Partitioned Data of the stochastic partial derivatives are bounded as: EB m F(Θ) m FB(Θ) 2 σ2 m B for a mini-batch B of size B. Assumption 1 bounds how fast the gradient and stochastic partial derivatives can change. Assumptions 2 and 3 require that the stochastic partial derivatives are unbiased estimators of the true partial derivatives with bounded variance. Assumptions 1 3 are common assumptions in convergence analysis of gradient-based algorithms (Tsitsiklis et al., 1986; Nguyen et al., 2018; Bottou et al., 2018). We note Assumptions 2 3 are similar to the IID assumptions in HFL convergence analysis. However, in VFL settings, all parties store identical sample IDs but different subsets of features. Hence, there is no equivalent notion of a non-IID distribution in VFL. Assumption 4. Bounded Hessian: There exists positive constants Hm for m = 0, . . . , M such that for all Θ, the second partial derivatives of FB with respect to hm(θm; XB m) satisfy: 2 hm(θm;XB m)FB(Θ) F Hm (2) for any mini-batch B. Assumption 5. Bounded Embedding Gradients: There exists positive constants Gm for m = 0, . . . , M such that for all θm, the stochastic embedding gradients are bounded: θmhm(θm; XB m) F Gm (3) for any mini-batch B. Since we are assuming a Lipschitz-continuous loss function (Assumption 1), we know the Hessian of F is bounded. Assumption 4 strengthens this assumption slightly to also bound the Hessian over any mini-batch. Assumption 5 bounds the magnitude of the partial derivatives with respect to embeddings. This embedding gradient bound is necessary to ensure convergence in the presence of embedding compression error (see Appendix A.2 for details). 3. Algorithm We now present C-VFL, a communication-efficient method for training a global model with vertically partitioned data. In each global round, a mini-batch B is chosen randomly from all samples and parties share necessary information for local training on this mini-batch. Each party m, in parallel, runs block-coordinate stochastic gradient descent on its local model parameters θm for Q local iterations. C-VFL runs for a total of R global rounds, and thus runs for T = RQ total local iterations. For party m to compute the stochastic gradient with respect to its features, it requires the embeddings computed by all other parties j = m. In C-VFL, these embeddings are Algorithm 1 Compressed Vertical Federated Learning 1: Initialize: θ0 m for all parties m and server model θ0 0 2: for t 0, . . . , T 1 do 3: if t mod Q = 0 then 4: Randomly sample Bt {X, y} 5: for m 1, . . . , M in parallel do 6: Send Cm(hm(θt m; XBt m)) to server 7: end for 8: ˆΦt0 {C0(θt 0), C1(h1(θt 1)), . . . , CM(h M(θt M))} 9: Server sends ˆΦt0 to all parties 10: end if 11: for m 0, . . . , M in parallel do 12: ˆΦt m {ˆΦt0 m; hm(θt m; XBt0 m )} 13: θt+1 m θt m ηt0 m FB(ˆΦt m; y Bt0) 14: end for 15: end for shared with the server then distributed to the parties. We reduce communication cost by only sharing embeddings every global round. Further, each party compresses their embeddings before sharing. We define a set of general compressors for compressing party embeddings and the server model: Cm( ) : RPm RPm for m = 0, . . . , M. To calculate the gradient for data sample xi, party m receives Cj(hj(θj; xi j)) from all parties j = m. With this information, a party m can compute m FB and update its parameters θm for multiple local iterations. Note that each party uses a stale view of the global model to compute its gradient during these local iterations, as it is reusing the embeddings it receives at the start of the round. In Section 4, we show that C-VFL converges even though parties use stale information. An example view a party has of the global model during training is in Figure 2. Here, t is the current iteration and t0 is the start of the most recent global round, when embeddings were shared. Algorithm 1 details the procedure of C-VFL. In each global round, when t mod Q = 0, a mini-batch B is randomly sampled from X and the parties exchange the associated embeddings, compressed using Cm( ), via the server (lines 3-9). Each party m completes Q local iterations, using the compressed embeddings it received in iteration t0 and its own m-th uncompressed embeddings hm(θt m, XBt0 m ). We denote the set of embeddings that each party receives in iteration t0 as: ˆΦt0 := {C0(θt0 0 ), C1(h1(θt0 1 )), . . . , CM(h M(θt0 M))}. (4) We let ˆΦt0 m be the set of compressed embeddings from parties j = m, and let ˆΦt m := {ˆΦt0 m; hm(θt m; XBt0 m )}. For each local iteration, each party m updates θm by computing the stochastic partial derivatives m FB(ˆΦt m; y Bt0) and applying a gradient step with step size ηt0 (lines 11-14). A key difference of C-VFL from previous VFL algorithms Communication-Efficient Learning with Vertically Partitioned Data is the support of a server model with trainable parameters, allowing for arbitrary fusion networks. To support such a model with multiple local iterations, the server model parameters are shared with the parties. Also note that the same mini-batch is used for all Q local iterations, thus communication is only required every Q iterations. Therefore, without any compression, the total communication cost is O(R M (B P m Pm + |θ0|)) for R global rounds. Our compression technique replaces Pm and |θ0| with smaller values based on the compression factor. For cases where embeddings, the batch size, and the server model are large, this reduction can greatly decrease the communication cost. Privacy. We now discuss privacy-preserving mechanisms for C-VFL. In HFL settings, model update or gradient information is shared in messages. It has been shown that gradients can leak information about the raw data (Phong et al., 2018; Geiping et al., 2020). However in C-VFL, parties only share embeddings and can only calculate the partial derivatives associated with the server model and their local models. Commonly proposed HFL gradient attacks cannot be performed on C-VFL. Embeddings may be vulnerable to model inversion attacks (Mahendran & Vedaldi, 2015), which are methods by which an attacker can recover raw input to a model using the embedding output and black-box access to the model. One can protect against such an attack using homomorphic encryption (Cheng et al., 2021; Hardy et al., 2017) or secure multi-party computation (Gu et al., 2021). Alternatively, if the input to the server model is the sum of party embeddings, then secure aggregation methods (Bonawitz et al., 2016) can be applied. Several works have proposed privacy-perserving methods for VFL settings (Cheng et al., 2021; C atak, 2015; Zheng et al., 2022) that are compatible with the C-VFL algorithm. Note that C-VFL assumes all parties have access to the labels. For low-risk scenarios, such as predicting credit score, labels may not need to be private among the parties. In cases where labels are private, one can augment C-VFL to apply the method in (Liu et al., 2019) for gradient calculation without the need for sharing labels. Our analysis in Section 4 would still hold in this case, and the additional communication is reduced by the use of message compression. 4. Analysis In this section, we discuss our analytical approach and present our theoretical results. We first define the compression error associated with Cm( ): Definition 4.1. Compression Error: Let vectors ϵxi m for m = 0, . . . , M, be the compression errors of Cm( ) on a data sample xi: ϵxi m := Cm(hm(θm; xi)) hm(θm; xi). Let ϵt0 m be the Pm B matrix with ϵxi m for all data samples xi in mini-batch Bt0 as the columns. We denote the expected squared message compression error from party m at round t0 as Et0 m := E ϵt0 m 2 F. Let ˆG t be the stacked partial derivatives at iteration t: ˆG t := [( 0FB(ˆΦt 0; y Bt0))T , . . . , ( MFB(ˆΦt M; y Bt0))T ]T . The model Θ evolves as: Θt+1 = Θt ηt0 ˆG t. (5) We note the reuse of the mini-batch of Bt0 for Q iterations in this recursion. This indicates that the stochastic gradients are not unbiased during local iterations t0 +1 t t0 +Q 1. Using conditional expectation, we can apply Assumption 2 to the gradient calculated at iteration t0 when there is no compression error. We define Φt0 to be the set of embeddings that would be received by each party at iteration t0 if no compression error were applied: Φt0 := {θt0 0 , h1(θt0 1 ), . . . , h M(θt0 M)}. (6) We let Φt0 m be the set of embeddings from parties j = m, and let Φt m := {Φt0 m; hm(θt m; XBt0 m )}. Then, if we take expectation over Bt0 conditioned on previous global models Θt up to t0: EBt0[ m FB(Φt0 m) | {Θτ}t0 τ=0] = m F(Φt0 m). (7) With the help of (7), we can prove convergence by bounding the difference between the gradient at the start of each global round and those calculated during local iterations (see the proof of Lemma 2 in Appendix A.2 for details). To account for compression error, using the chain rule and Taylor series expansion, we obtain: Lemma 1. Under Assumptions 4-5, the norm of the difference between the objective function value with compressed and uncompressed embeddings is bounded as: E m FB(ˆΦt m) m FB(Φt m) 2 H2 m G2 m j=0,j =m Et0 j . The proof of Lemma 1 is given in Appendix A.2. Using Lemma 1, we can bound the effect of compression error on convergence. We present our main theoretical results. All proofs are provided in Appendix A. Theorem 4.2. Convergence with fixed step size: Under Assumptions 1-5, if ηt0 = η for all iterations and satisfies ηt0 1 16Q max{L,maxm Lm}, then the average squared Communication-Efficient Learning with Vertically Partitioned Data Table 1. Choice of common compressor parameters to achieve a convergence rate of O(1/ T). Pm is the size of the m-th embedding. In scalar quantization, we let there be 2q quantization levels, and let hmax and hmin be respectively the maximum and minimum components in hm(θt m; xi m) for all iterations t, parties m, and xi m. We let V be the size of the lattice cell in vector quantization. We let k be the number of parameters sent in an embedding after top-k sparsification, and ( h 2)max be the maximum value of hm(θt m; xi m) 2 for all iterations t, parties m, and xi m. Scalar Quantization Vector Quantization Top-k Sparsification Parameter Choice q = Ω log2 BPm(hmax hmin)2 T V = O 1 BPm k = Ω Pm Pm B( h 2)max Compression Error Et0 m BPm (hmax hmin)2 12 2 2q Et0 m V BPm 24 Et0 m B(1 k Pm )( h 2)max gradient over R global rounds of Algorithm 1 is bounded: t0=0 E h F(Θt0) 2i 4 F(Θ0) E F(ΘT ) m=0 H2 m G2 m j=0,j =m Et0 j . (8) The first term in (8) is based on the difference between the initial model and final model of the algorithm. The second term is the error associated with the variance of the stochastic gradients and the Lipschitz constants L and Lm s. The third term relates to the average compression error over all iterations. The larger the error introduced by a compressor, the larger the convergence error is. We note that setting Et0 j = 0 for all parties and iterations provides an error bound on VFL without compression and is an improvement over the bound in (Liu et al., 2019) in terms of Q, M, and B. The second and third terms include a coefficient relating to local iterations. As the number of local iterations Q increases, the convergence error increases. However, increasing Q also has the effect of reducing the number of global rounds. Thus, it may be beneficial to have Q > 1 in practice. We explore this more in experiments in Section 5. The second and third terms scale with M, the number of parties. However, VFL scenarios typically have a small number of parties (Kairouz et al., 2021), and thus M plays a small role in convergence error. We note that when M = 1 and Q = 1, Theorem 4.2 applies to Split Learning (Gupta & Raskar, 2018) when only uploads to the server are compressed. Remark 4.3. Let E = 1 R PR 1 t0=0 PM m=0 Et0 m. If ηt0 = 1 T for all global rounds t0, for Q and B independent of T, then t0=0 E F(Θt0) 2 = O 1 This indicates that if E = O( 1 T ) then we can achieve a con- vergence rate of O( 1 T ). Informally, this means that C-VFL can afford compression error and not worsen asymptotic convergence when this condition is satisfied. We discuss how this affects commonly used compressors in practice later in the section. We consider a diminishing step size in the following: Theorem 4.4. Convergence with diminishing step size: Under Assumptions 1-5, if 0 < ηt0 < 1 satisfies ηt0 1 16Q max{L,maxm Lm}, then the minimum squared gradient over R global rounds of Algorithm 1 is bounded: min t0=0,...,R 1 E h F(Θt0) 2i = 1 PR 1 t0=0 ηt0 + PR 1 t0=0(ηt0)2 PT 1 t=0 ηt0 + PR 1 t0=0 PM m=0 ηt0Et0 m PR 1 t0=0 ηt0 If ηt0 and Et0 m satisfy P t0=0 ηt0 = , P t0=0(ηt0)2 < , and P t0=0 PM m=0 ηt0Et0 m < , then mint0=0,...,R 1 E h F(Θt0) 2i 0 as R . According to Theorem 4.4, the product of the step size and the compression error must be summable over all iterations. In the next subsection, we discuss how to choose common compressor parameters to ensure this property is satisified. We also see in Section 5 that good results can be achieved empirically without diminishing the step size or compression error. Common Compressors. In this section, we show how to choose common compressor parameters to achieve a convergence rate of O( 1 T ) in the context of Theorem 4.2, and guarantee convergence in the context of Theorem 4.4. We analyze three common compressors: a uniform scalar quantizer (Bennett, 1948), a 2-dimensional hexagonal lattice quantizer (Zamir & Feder, 1996), and top-k sparsification (Lin et al., 2018). For uniform scalar quantizer, we let there be 2q quantization levels. For the lattice vector quantizer, we let V be the volume of each lattice cell. For top-k sparsification, we let k be the number of embedding components sent in a message. In Table 1, we present the choice of compressor parameters in order to achieve a convergence Communication-Efficient Learning with Vertically Partitioned Data rate of O( 1 T ) in the context of Theorem 4.2. We show how we calculate these bounds in Appendix B and provide some implementation details for their use. We can also use Table 1 to choose compressor parameters to ensure convergence in the context of Theorem 4.4. Let ηt0 = O( 1 t0 ), where t0 is the current round. Then setting T = t0 in Table 1 provides a choice of compression parameters at each iteration to ensure the compression error diminishes at a rate of O( 1 t0 ), guaranteeing convergence. Diminishing compression error can be achieved by increasing the number of quantization levels, decreasing the volume of lattice cells, or increasing the number of components sent in a message. 5. Experiments We present experiments to examine the performance of C-VFL in practice. The goal of our experiments is to examine the effects that different compression techniques have on training, and investigate the accuracy/communication trade-off empirically. We run experiments on three datasets: the MIMIC-III dataset (Johnson et al., 2016), the Model Net10 dataset (Wu et al., 2015), and the CIFAR-10 dataset (Krizhevsky et al., 2009). We provide more details on the datasets and training procedure in Appendix C, as well as additional plots and experiments in Appendix D. MIMIC-III: MIMIC-III is an anonymized hospital patient time series dataset. In MIMIC-III, the task is binary classification to predict in-hospital mortality. We train with a set of 4 parties, each storing 19 of the 76 features. Each party trains an LSTM and the server trains two fully-connected layers. We use a fixed step size of 0.01, a batch size of 1000, and train for 1000 epochs. CIFAR-10: CIFAR-10 is an image dataset for object classification. We train with a set of 4 parties, each storing a different quadrant of every image. Each party trains Res Net18, and the server trains a fully-connected layer. We use a fixed step size of 0.0001 and a batch size of 100, and train for 200 epochs. Model Net10: Model Net10 is a set of CAD models, each with images of 12 different camera views. The task of Model Net10 is classification of images into 10 object classes. We run experiments with both a set of 4 and 12 parties, where parties receive 3 or 1 view(s) of each CAD model, respectively. Each party s network consists of two convolutional layers and a fully-connected layer, and the server model consists of a fully-connected layer. We use a fixed step size of 0.001 and a batch size of 64, and train for 100 epochs. We consider the three compressors discussed in Section 4: a uniform scalar quantizer, a 2-dimensional hexagonal lattice (vector) quantizer, and top-k sparsification. For both quantizers, the embedding values need to be bounded. In the (a) MIMIC-III by epochs (b) MIMIC-III by cost (c) CIFAR-10 by epochs (d) CIFAR-10 by cost (e) Model Net10 by epochs (f) Model Net10 by cost Figure 3. C-VFL when compressing to 2 bits per component. The solid lines are the mean of 5 runs, while the shaded region represents the standard deviation. We show test F1-Score on MIMIC-III dataset and test accuracy on CIFAR-10 and Model Net10 dataset, plotted by epochs and communication cost. case of the models used for MIMIC-III and CIFAR-10, the embedding values are already bounded, but the CNN used for Model Net10 may have unbounded embedding values. We scale embedding values for Model Net10 to the range [0, 1]. We apply subtractive dithering to both the scalar quantizer (Wannamaker, 1997) and vector quantizer (Shlezinger et al., 2021). In our experiments, each embedding component is a 32-bit float. Let b be the bits per component we compress to. For the scalar quantizer, this means there are 2b quantization levels. For the 2-D vector quantizer, this means there are 22b vectors in the codebook. The volume V of the vector quantizer is a function of the number of codebook vectors. For top-k sparsification, k = Pm b 32 as we use 32-bit components. We train using C-VFL and consider cases where b = 2, 3, and 4. We compare with a case where b = 32. This corresponds to a standard VFL algorithm without embedding compression (Liu et al., 2019), acting as a baseline for accuracy. In Figure 3, we plot the test F1-Score and test accuracy for Communication-Efficient Learning with Vertically Partitioned Data Table 2. MIMIC-III maximum F1-Score reached during training, and communication cost to reach a target test F1-Score of 0.4. Value shown is the mean of 5 runs, the standard deviation. In these experiments, Q = 10 and M = 4. Compressor Max F1-Score Cost (MB) Reached Target = 0.4 None b = 32 0.448 0.010 3830.0 558.2 Scalar b = 2 0.441 0.018 233.1 28.7 Vector b = 2 0.451 0.021 236.1 17.9 Top-k b = 2 0.431 0.016 309.8 93.6 Scalar b = 3 0.446 0.011 343.1 18.8 Vector b = 3 0.455 0.020 330.5 10.6 Top-k b = 3 0.435 0.030 470.7 116.8 Scalar b = 4 0.451 0.020 456.0 87.8 Vector b = 4 0.446 0.017 446.5 21.3 Top-k b = 4 0.453 0.014 519.1 150.4 MIMIC-III, CIFAR-10, and Model Net10 when training with b = 2. We use F1-Score for MIMIC-III as the in-hospital mortality prediction task has high class imbalance; most people in the dataset did not die in the hospital. For these experiments, we let M = 4 for Model Net10. The solid line in each plot represents the average accuracy over five runs, while the shaded regions represent one standard deviation. In Figures 3a, 3c, and 3e, we plot by the number of training epochs. We can see in all cases, although convergence can be a bit slower, training with compressed embeddings still reaches similar accuracy to no compression. In Figures 3b, 3d, and 3f, we plot by the communication cost in bytes. The cost of communication includes both the upload of (compressed) embeddings to the server and download of embeddings and server model to all parties. We can see that by compressing embeddings, we can reach higher accuracy with significantly less communication cost. In all datasets, the compressors reach similar accuracy to each other, though top-k sparsification performs slightly worse than the others on MIMIC-III, while vector quantization performs the best in both on CIFAR-10 and Model Net10. In Tables 2, 3 and 4, we show the maximum test accuracy reached during training and the communication cost to reach a target accuracy for MIMIC-III, CIFAR-10, and Model Net10. We show results for all three compressors with b = 2, 3, and 4 bits per component, as well as the baseline of b = 32. For the MIMIC-III dataset, we show the maximum test F1-Score reached and the total communication cost of reaching an F1-Score of 0.4. The maximum F1-Score for each case is within a standard deviation of each other. However, the cost to reach target score is much smaller as the value of b decreases for all compressors. We can see that when b = 2, we can achieve over 90% communication cost reduction over no compression to reach a target F1-Score. Table 3. CIFAR-10 maximum test accuracy reached during training, and communication cost to reach a target accuracy of 70%. Value shown is the mean of 5 runs, the standard deviation. A indicates that the target was not reached during training. We let Q = 10 and M = 4. Compressor Max Accuracy Cost (GB) Reached Target = 70% None b = 32 73.18% 0.44% 7.69 0.35 Scalar b = 2 65.16% 1.85% Vector b = 2 71.43% 0.47% 0.68 0.06 Top-k b = 2 66.02% 2.24% Scalar b = 3 71.49% 1.05% 1.22 0.17 Vector b = 3 72.50% 0.40% 0.81 0.05 Top-k b = 3 71.56% 0.81% 1.24 0.22 Scalar b = 4 71.80% 1.18% 1.72 0.26 Vector b = 4 73.17% 0.39% 0.98 0.08 Top-k b = 4 72.03% 1.77% 1.43 0.26 Table 4. Model Net10 maximum test accuracy reached during training, and communication cost to reach a target accuracy of 75%. Value shown is the mean of 5 runs, the standard deviation. We let Q = 10 and M = 4. Compressor Max Accuracy Cost (MB) Reached Target = 75% None b = 32 85.68% 1.57% 9604.80 2933.40 Scalar b = 2 76.94% 5.87% 1932.00 674.30 Vector b = 2 84.80% 2.58% 593.40 170.98 Top-k b = 2 79.91% 2.86% 1317.90 222.95 Scalar b = 3 81.32% 1.61% 1738.80 254.79 Vector b = 3 85.66% 1.36% 900.45 275.01 Top-k b = 3 81.63% 1.24% 1593.90 225.34 Scalar b = 4 81.19% 1.88% 2194.20 266.88 Vector b = 4 85.77% 1.69% 1200.60 366.68 Top-k b = 4 83.50% 1.21% 1821.60 241.40 For the CIFAR-10 and Model Net10 datasets, Tables 3 and 4 show the maximum test accuracy reached and the total communication cost of reaching a target accuracy. We can see that, for both datasets, vector quantization tends to outperform both scalar quantization and top-k quantization. Vector quantization benefits from considering components jointly, and thus can have better reconstruction quality than scalar quantization and top-k sparsification (Woods, 2006). In Table 5, we consider the communication/computation tradeoff of local iterations. We show how the number of local iterations affects the time to reach a target F1-Score in the MIMIC-III dataset. We train C-VFL with vector quantization b = 3 and set the local iterations Q to 1, 10, and 25. Note that the Q = 1 case corresponds to adding embedding compression to previously proposed VFL algorithms that do not have multiple local iterations (Hu et al., 2019; Romanini Communication-Efficient Learning with Vertically Partitioned Data Table 5. MIMIC-III time in seconds to reach a target F1-Score for different local iterations Q and communication latency tc with vector quantization and b = 3. Value shown is the mean of 5 runs, one standard deviation. tc Time to Reach Target F1-Score 0.45 Q = 1 Q = 10 Q = 25 1 694.53 150.75 470.86 235.35 445.21 51.44 10 1262.78 274.10 512.82 256.32 461.17 53.29 50 3788.32 822.30 699.30 349.53 532.12 61.49 200 13259.14 2878.04 1398.60 699.05 798.19 92.23 (a) Parties M = 4 (b) Parties M = 12 Figure 4. Communication cost of training on Model Net10 with vector quantization. The solid lines are the mean of 5 runs, and the shaded region represents one standard deviation. et al., 2021). We simulate a scenario where computation time for training a mini-batch of data at each party takes 10 ms, and communication of embeddings takes a total of 1, 10, 50, and 200 ms roundtrip. These different communication latencies correspond to the distance between the parties and the server: within the same cluster, on the same local network, within the same region, and across the globe. According to Theorem 4.2, increasing the number of local iterations Q increases convergence error. However, the target test accuracy is reached within less time when Q increases. The improvement over Q = 1 local iterations increases as the communication latency increases. In systems where communication latency is high, it may be beneficial to increase the number of local iterations. The choice of Q will depend on the accuracy requirements of the given prediction task and the time constraints on the prediction problem. Finally, in Figure 4, we plot the test accuracy of Model Net10 against the communication cost when using vector quantization with b = 2, 3, 4, and 32. We include plots for 4 and 12 parties. We note that changing the number of parties changes the global model structure Θ as well. We can see in both cases that smaller values of b reach higher test accuracies at lower communication cost. The total communication cost is larger with 12 parties, but the impact of increasing compression is similar for both M = 4 and M = 12. 6. Conclusion We proposed C-VFL, a distributed communication-efficient algorithm for training a model over vertically partitioned data. We proved convergence of the algorithm at a rate of O( 1 T ), and we showed experimentally that communication cost could be reduced by over 90% without a significant decrease in accuracy. For future work, we seek to relax our bounded gradient assumption and explore the effect of adaptive compressors. Acknowledgements This work was supported by the Rensselaer-IBM AI Research Collaboration (http://airc.rpi.edu), part of the IBM AI Horizons Network (http://ibm.biz/AIHorizons), and the National Science Foundation under grant CNS-1553340. Bennett, W. R. Spectra of quantized signals. Bell Syst. Tech. J., 27(3):446 472, 1948. Bernstein, J., Wang, Y., Azizzadenesheli, K., and Anandkumar, A. SIGNSGD: compressed optimisation for nonconvex problems. Proc. Int. Conf. on Machine Learn., 2018. Bonawitz, K. A., Ivanov, V., Kreuter, B., Marcedone, A., Mc Mahan, H. B., Patel, S., Ramage, D., Segal, A., and Seth, K. Practical secure aggregation for federated learning on user-held data. ar Xiv:1611.04482, 2016. Bonawitz, K. A., Eichner, H., Grieskamp, W., Huba, D., Ingerman, A., Ivanov, V., Kiddon, C., Koneˇcn y, J., Mazzocchi, S., Mc Mahan, B., Overveldt, T. V., Petrou, D., Ramage, D., and Roselander, J. Towards federated learning at scale: System design. Proc. of Machine Learn. Sys., 2019. Bottou, L., Curtis, F. E., and Nocedal, J. Optimization methods for large-scale machine learning. SIAM Review, 60(2):223 311, 2018. C atak, F. O. Secure multi-party computation based privacy preserving extreme learning machine algorithm over vertically distributed data. Proc. Adv. Neural Inf. Process. Syst., 9490:337 345, 2015. Ceballos, I., Sharma, V., Mugica, E., Singh, A., Roman, A., Vepakomma, P., and Raskar, R. Splitnn-driven vertical partitioning. ar Xiv:2008.04137, 2020. Cha, D., Sung, M., and Park, Y.-R. Implementing vertical federated learning using autoencoders: Practical application, generalizability, and utility study. JMIR Medical Informatics, 9(6):e26598, 2021. Communication-Efficient Learning with Vertically Partitioned Data Chen, T., Jin, X., Sun, Y., and Yin, W. VAFL: a method of vertical asynchronous federated learning. ar Xiv:2007.06081, 2020. Cheng, K., Fan, T., Jin, Y., Liu, Y., Chen, T., Papadopoulos, D., and Yang, Q. Secureboost: A lossless federated learning framework. IEEE Intell. Syst., 36(6):87 98, 2021. Das, A. and Patterson, S. Multi-tier federated learning for vertically partitioned data. Proc. IEEE Int. Conf. on Acoust., Speech, and Signal Process., pp. 3100 3104, 2021. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2009. Feng, S. and Yu, H. Multi-participant multi-class vertical federated learning. ar Xiv:2001.11154, 2020. Geiping, J., Bauermeister, H., Dr oge, H., and Moeller, M. Inverting gradients - how easy is it to break privacy in federated learning? Adv. Neural Inf. Process. Syst., 2020. Gu, B., Xu, A., Huo, Z., Deng, C., and Huang, H. Privacypreserving asynchronous vertical federated learning algorithms for multiparty collaborative learning. IEEE Trans. on Neural Netw. Learn. Syst., pp. 1 13, 2021. doi: 10.1109/TNNLS.2021.3072238. Gu, Y., Lyu, X., Sun, W., Li, W., Chen, S., Li, X., and Marsic, I. Mutual correlation attentive factors in dyadic fusion networks for speech emotion recognition. Proc. ACM Int. Conf. on Multimedia, 2019. Gupta, O. and Raskar, R. Distributed learning of deep neural network over multiple agents. J. Netw. Comput. Appl., 116:1 8, 2018. Han, D.-J., Bhatti, H. I., Lee, J., and Moon, J. Accelerating federated learning with split learning on locally generated losses. In ICML 2021 Workshop on Federated Learning for User Privacy and Data Confidentiality, 2021a. Han, W., Chen, H., and Poria, S. Improving multimodal fusion with hierarchical mutual information maximization for multimodal sentiment analysis. Proc. 2020 Conf. Empir. Methods in Nat. Lang. Process., pp. 9180 9192, 2021b. Hardy, S., Henecka, W., Ivey-Law, H., Nock, R., Patrini, G., Smith, G., and Thorne, B. Private federated learning on vertically partitioned data via entity resolution and additively homomorphic encryption. ar Xiv:1711.10677, 2017. He, C., Annavaram, M., and Avestimehr, S. Group knowledge transfer: Federated learning of large cnns at the edge. Proc. Adv. Neural Inf. Process. Syst., 2020. Hu, Y., Niu, D., Yang, J., and Zhou, S. FDML: A collaborative machine learning framework for distributed features. Proc. ACM Int. Conf. Knowl. Discov. Data Min., pp. 2232 2240, 2019. Johnson, A. E., Pollard, T. J., Shen, L., Lehman, L.-w. H., Feng, M., Ghassemi, M., Moody, B., Szolovits, P., Anthony Celi, L., and Mark, R. G. MIMIC-III, a freely accessible critical care database. Nature, 2016. Kairouz, P., Mc Mahan, H. B., Avent, B., Bellet, A., Bennis, M., Bhagoji, A. N., Bonawitz, K. A., Charles, Z., Cormode, G., Cummings, R., D Oliveira, R. G. L., Eichner, H., Rouayheb, S. E., Evans, D., Gardner, J., Garrett, Z., Gasc on, A., Ghazi, B., Gibbons, P. B., Gruteser, M., Harchaoui, Z., He, C., He, L., Huo, Z., Hutchinson, B., Hsu, J., Jaggi, M., Javidi, T., Joshi, G., Khodak, M., Koneˇcn y, J., Korolova, A., Koushanfar, F., Koyejo, S., Lepoint, T., Liu, Y., Mittal, P., Mohri, M., Nock, R., Ozg ur, A., Pagh, R., Qi, H., Ramage, D., Raskar, R., Raykova, M., Song, D., Song, W., Stich, S. U., Sun, Z., Suresh, A. T., Tram er, F., Vepakomma, P., Wang, J., Xiong, L., Xu, Z., Yang, Q., Yu, F. X., Yu, H., and Zhao, S. Advances and open problems in federated learning. Found. Trends Mach. Learn., 14(1-2):1 210, 2021. doi: 10.1561/2200000083. Karimireddy, S. P., Rebjock, Q., Stich, S. U., and Jaggi, M. Error feedback fixes sign SGD and other gradient compression schemes. Proc. Int. Conf. on Machine Learn., 2019. Koehrsen, W. Book recommendation system. https://github.com/Will Koehrsen/wikipedia-datascience/blob/master/notebooks/Book2018. Krizhevsky, A., Hinton, G., et al. Learning multiple layers of features from tiny images. 2009. Li, T., Sahu, A. K., Zaheer, M., Sanjabi, M., Talwalkar, A., and Smith, V. Federated optimization in heterogeneous networks. Proc. of Machine Learn. Sys., 2020. Lim, W. Y. B., Luong, N. C., Hoang, D. T., Jiao, Y., Liang, Y., Yang, Q., Niyato, D., and Miao, C. Federated learning in mobile edge networks: A comprehensive survey. IEEE Commun. Surveys Tuts., 2020. Lin, T., Stich, S. U., Patel, K. K., and Jaggi, M. Don t use large mini-batches, use local SGD. Proc. Int. Conf. on Learn. Representations, 2020. Lin, Y., Han, S., Mao, H., Wang, Y., and Dally, B. Deep gradient compression: Reducing the communication bandwidth for distributed training. Proc. Int. Conf. on Learn. Representations, 2018. Communication-Efficient Learning with Vertically Partitioned Data Liu, L., Zhang, J., Song, S., and Letaief, K. B. Client-edgecloud hierarchical federated learning. Proc. IEEE Int. Conf. on Comm., 2020. Liu, Y., Kang, Y., Zhang, X., Li, L., Cheng, Y., Chen, T., Hong, M., and Yang, Q. A communication efficient vertical federated learning framework. Adv. Neural Inf. Process. Syst., Workshop on Federated Learning for Data Privacy and Confidentiality, 2019. Mahendran, A. and Vedaldi, A. Understanding deep image representations by inverting them. Proc. IEEE Int. Conf. Comput. Vis., pp. 5188 5196, 2015. Mc Mahan, B., Moore, E., Ramage, D., Hampson, S., and y Arcas, B. A. Communication-efficient learning of deep networks from decentralized data. Proc. 20th Int. Conf. on Artif. Intell., pp. 1273 1282, 2017. Moritz, P., Nishihara, R., Stoica, I., and Jordan, M. I. Sparknet: Training deep networks in spark. Proc. Int. Conf. on Learn. Representations, 2016. Nguyen, L. M., Nguyen, P. H., van Dijk, M., Richt arik, P., Scheinberg, K., and Tak ac, M. SGD and Hogwild! convergence without the bounded gradients assumption. Proc. Int. Conf. on Machine Learn., 80:3747 3755, 2018. Nie, W., Liang, Q., Wang, Y., Wei, X., and Su, Y. MMFN: multimodal information fusion networks for 3d model classification and retrieval. ACM Trans. on Multimedia Computing, Communications, and Applications, 2021. Phong, L. T., Aono, Y., Hayashi, T., Wang, L., and Moriai, S. Privacy-preserving deep learning via additively homomorphic encryption. IEEE Trans. Inf. Forensics Security, 13(5):1333 1345, 2018. Richt arik, P. and Tak ac, M. Parallel coordinate descent methods for big data optimization. Math. Program., 156 (1-2):433 484, 2016. Rieke, N., Hancox, J., Li, W., Milletar ı, F., Roth, H. R., Albarqouni, S., Bakas, S., Galtier, M. N., Landman, B. A., Maier-Hein, K., Ourselin, S., Sheller, M., Summers, R. M., Trask, A., Xu, D., Baust, M., and Cardoso, M. J. Digital Medicine, 2020. Romanini, D., Hall, A. J., Papadopoulos, P., Titcombe, T., Ismail, A., Cebere, T., Sandmann, R., Roehm, R., and Hoeh, M. A. Py Vertical: A vertical federated learning framework for multi-headed Split NN. Int. Conf. Learn. Representations, Workshop on Distributed and Private Machine Learn., 2021. Shi, S., Zhao, K., Wang, Q., Tang, Z., and Chu, X. A convergence analysis of distributed SGD with communicationefficient gradient sparsification. Proc. Int. Joint Conf. on Artif. Intell., 2019. Shlezinger, N., Chen, M., Eldar, Y. C., Poor, H. V., and Cui, S. Uveqfed: Universal vector quantization for federated learning. IEEE Trans. Signal Process., 69:500 514, 2021. Stich, S. U., Cordonnier, J., and Jaggi, M. Sparsified SGD with memory. Adv. Neural Inf. Process. Syst., 2018. Tsitsiklis, J., Bertsekas, D., and Athans, M. Distributed asynchronous deterministic and stochastic gradient optimization algorithms. IEEE Trans. Autom. Control, 31(9): 803 812, 1986. Wang, S., Tuor, T., Salonidis, T., Leung, K. K., Makaya, C., He, T., and Chan, K. Adaptive federated learning in resource constrained edge computing systems. IEEE J. Sel. Areas Commun., 37(6):1205 1221, 2019. Wannamaker, R. A. The Theory of Dithered Quantization. Ph D thesis, 1997. Wen, W., Xu, C., Yan, F., Wu, C., Wang, Y., Chen, Y., and Li, H. Terngrad: Ternary gradients to reduce communication in distributed deep learning. Adv. Neural Inf. Process. Syst., 2017. Woods, J. W. Multidimensional signal, image, and video processing and coding. Elsevier, 2006. Wu, Z., Song, S., Khosla, A., Yu, F., Zhang, L., Tang, X., and Xiao, J. 3D shapenets: A deep representation for volumetric shapes. Proc. IEEE Int. Conf. Comput. Vis., pp. 1912 1920, 2015. Yang, Q., Liu, Y., Chen, T., and Tong, Y. Federated machine learning: Concept and applications. ACM Trans. Intell. Syst. Technol., 10(2):12:1 12:19, 2019. Zamir, R. and Feder, M. On lattice quantization noise. IEEE Trans. Inf. Theory, 42(4):1152 1159, 1996. Zhang, X., Yin, W., Hong, M., and Chen, T. Hybrid federated learning: Algorithms and implementation. ar Xiv:2012.12420, 2020. Zheng, F., Chen, C., Zheng, X., and Zhu, M. Towards secure and practical machine learning via secret sharing and random permutation. Knowl. Based Syst., 245:108609, 2022. Communication-Efficient Learning with Vertically Partitioned Data A. Proofs of Theorems 4.2 and 4.4 In this section, we provide the proofs for Theorems 4.2 and 4.4. A.1. Additional Notation Before starting the proofs, we define some additional notation to be used throughout. At each iteration t, each party m trains with the embeddings ˆΦt m. This is equivalent to the party training directly with the models θt m and θt0 j for all j = m, where t0 is the last communication iteration when party m received the embeddings. We define: ( θt j m = j θt0 j otherwise (A.1) to represent party m s view of party j s model at iteration t. We define the column vector Γt m = [(γt m,0)T ; . . . ; (γt m,M)T ]T to be party m s view of the global model at iteration t. We introduce some notation to help with bounding the error introduced by compression. We define ˆFB(Γt m) to be the stochastic loss with compression error for a randomly selected mini-batch B calculated by party m at iteration t: ˆFB(Γt m) := FB θt0 0 + ϵt0 0 , h1(θt0 1 ; XBt0 1 ) + ϵt0 1 , . . . , hm(θt m; XBt0 m ), . . . , h M(θt0 M; XBt0 M ) + ϵt0 M . (A.2) Recall the recursion over the global model Θ: Θt+1 = Θt ηt0 ˆG t. (A.3) We can equivalently define ˆG t as follows: ˆG t = h ( 0 ˆFB(Γt 0))T , . . . , ( M ˆFB(Γt M))T i T . (A.4) Note that the compression error in ˆF( ) is applied to the embeddings, and not the model parameters. Thus, F( ) and ˆF( ) are different functions. In several parts of the proof, we need to bound the compression error in m ˆFB(Γt m). For our analysis, we redefine the set of embeddings for a mini-batch B of size B from party m as a matrix: hm(θm; XB m) := h hm(θm; x B1 m ), . . . , hm(θm; x BB m ) i . (A.5) hm(θm; XB m) is a matrix with dimensions Pm B where each column is the embedding from party m for a single sample in the mini-batch. Let P = PM m=0 Pm be the sum of the sizes of all embeddings. We redefine the set of embeddings used by a party m to calculate its gradient without compression error as a matrix: ˆΦt m = h (θt0 0 )T , (h1(θt0 1 ; XBt0 1 ))T , . . . , (hm(θt m; XBt0 m ))T , . . . , (h M(θt0 M; XBt0 M ))T i T . (A.6) ˆΦt m is a matrix with dimensions P B where each column is the concatenation of embeddings for all parties for a single sample in the mini-batch. Recall the set of compression error vectors for a mini-batch B of size B from party m is the matrix: ϵt0 m := h ϵB1 m , . . . , ϵBB m i . (A.7) ϵt0 m is a matrix of dimensions Pm B where each column is the compression error from party m for a single sample in the mini-batch. Communication-Efficient Learning with Vertically Partitioned Data We define the compression error on each embedding used in party m s gradient calculation at iteration t: Et0 m = h (ϵt0 0 )T , . . . , (ϵt0 m 1)T , 0T , (ϵt0 m 1)T , . . . , (ϵt0 M)T i T . (A.8) Et0 m is a matrix with dimensions P B where each column is the concatenation of compression error on embeddings for all parties for a single sample in the mini-batch. With some abuse of notation, we define: m FB(Φt m + Et0 m) := m ˆFB(Γt m). (A.9) Note that we can apply the chain rule to m ˆFB(Γt m): m ˆFB(Γt m) = θmhm(θt m) hm(θm)FB(Φt m + Et0 m). (A.10) With this expansion, we can now apply Taylor series expansion to hm(θm)FB(Φt m + Et0 m) around the point Φt m: hm(θm)FB(Φt m + Et0 m) = hm(θm)FB(Φt m) + 2 hm(θm)FB(Φt m)t Et0 m + . . . (A.11) We let the infinite sum of all terms in this Taylor series from the second partial derivatives and up be denoted as Rm 0 : Rm 0 (Φt m + Et0 m) := 2 hm(θm)FB(Φt m)T Et0 m + . . . (A.12) Note that all compression error is in Rm 0 (Φt m + Et0 m). Presented in Section A.2, the proof of Lemma 1 shows how we can bound Rm 0 (Φt m + Et0 m), bounding the compression error in m ˆFB(Γt m). Let Et0 = EBt0[ | {Θτ}t0 τ=0]. Note that by Assumption 2, Et0 Gt0 = F(Θt0) as when there is no compression error in the gradients G, they are equal to the full-batch gradient in expectation when conditioned on the model parameters up to the iteration t0. However, this is not true for iterations t0 + 1 t t0 + Q 1, as we reuse the mini-batch Bt0 in these local iterations. We upper bound the error introduced by stochastic gradients calculated during local iterations in Lemma 2. A.2. Supporting Lemmas Next, we provide supporting lemmas and their proofs. We restate Lemma 1 here: Lemma 1. Under Assumptions 4-5, the norm of the difference between the objective function value with and without error is bounded: E m FB(ˆΦt m) m FB(Φt m) 2 H2 m G2 m j=0,j =m Et0 j . (A.13) To prove Lemma 1, we first prove the following lemma: Lemma 1 . Under Assumptions 4-5, the squared norm of the partial derivatives for party m s embedding multiplied by the Taylor series terms Rm 0 (Φt m + Et0 m) is bounded: θmhm(θt m)Rm 0 (Φt m + Et0 m) 2 H2 m G2 m Et0 m F . (A.14) θmhm(θt m)Rm 0 (Φt m + Et0 m) 2 θmhm(θt m) 2 F Rm 0 (Φt m + Et0 m) 2 F (A.15) H2 m θmhm(θt m) 2 F Et0 m 2 F (A.16) where (A.16) follows from Assumption 4 and the following property of the Taylor series approximation error: Rm 0 (Φt m + Et0 m) F Hm Et0 m F . (A.17) Communication-Efficient Learning with Vertically Partitioned Data Applying Assumption 5, we have: θmhm(θt m)Rm 0 (Φt m + Et0 m) 2 H2 m G2 m Et0 m 2 F . (A.18) We now prove Lemma 1. Proof. Recall that: m ˆFB(Γt m) = m FB(Φt m + Et0 m) (A.19) = θmhm(θt m) hm(θm)FB(Φt m + Et0 m). (A.20) Next we apply Taylor series expansion as in (A.11): m ˆFB(Γt m) = θmhm(θt m) hm(θm)FB(Φt m) + Rm 0 (Φt m + Et0 m) (A.21) = m FB(Γt m) + θmhm(θt m)Rm 0 (Φt m + Et0 m) (A.22) Rearranging and applying expectation and the squared 2-norm, we can bound further: E m ˆFB(Γt m) m FB(Γt m) 2 = E θmhm(θt m)Rm 0 (Φt m + Et0 m) 2 (A.23) H2 m G2 m E Et0 m 2 F (A.24) = H2 m G2 m X j =m E ϵt0 j 2 = H2 m G2 m X j =m Et0 j (A.26) where (A.24) follows from Lemma 1 , (A.25) follows from the definition of Et0 m, and (A.26) follows from Definition 4.1. Lemma 2. If ηt0 1 4Q maxm Lm , then under Assumptions 1-5 we can bound the conditional expected squared norm difference of gradients Gt0 and ˆG t for iterations t0 to t0 + Q 1 as follows: t=t0 Et0 ˆG t Gt0 2 16Q3(ηt0)2 M X m=0 L2 m m F(Θt0) 2 + 16Q3(ηt0)2 M X m=0 L2 m σ2 m B m=0 H2 m G2 m Et0 m 2 F . (A.27) Communication-Efficient Learning with Vertically Partitioned Data Et0 ˆG t Gt0 2 = m=0 Et0 m ˆFB(Γt m) m FB(Γt0 m) 2 (A.28) m=0 Et0 m ˆFB(Γt m) ˆFB(Γt 1 m ) + m ˆFB(Γt 1 m ) m FB(Γt0 m) 2 (A.29) m=0 Et0 m ˆFB(Γt m) m ˆFB(Γt 1 m ) 2 m=0 Et0 m ˆFB(Γt 1 m ) m FB(Γt0 m) 2 (A.30) m=0 Et0 h m FB(Γt m) m FB(Γt 1 m ) 2i + 2 (1 + n) m=0 Et0 h θmhm(θt m)Rm 0 (Φt m + Et0 m) θmhm(θt 1 m )Rm 0 (Φt 1 m + Et 1 m ) 2i m=0 Et0 m ˆFB(Γt 1 m ) m FB(Γt0 m) 2 (A.31) m=0 Et0 h m FB(Γt m) m FB(Γt 1 m ) 2i + 8 (1 + n) m=0 H2 m G2 m Et0 m 2 m=0 Et0 m ˆFB(Γt 1 m ) m FB(Γt0 m) 2 (A.32) where (A.30) follows from the fact that (X + Y )2 (1 + n)X2 + (1 + 1 n)Y 2 for some positive n and (A.32) follows from Lemma 1 . Applying Assumption 1 to the first term in (A.30) we have: Et0 ˆG t Gt0 2 2 (1 + n) m=0 L2 m Et0 h Γt m Γt 1 m 2i m=0 Et0 m ˆFB(Γt 1 m ) m FB(Γt0 m) 2 + 8 (1 + n) m=0 H2 m G2 m Et0 m 2 (A.33) = 2(ηt0)2 (1 + n) m=0 L2 m Et0 m ˆFB(Γt 1 m ) 2 m=0 Et0 m ˆFB(Γt 1 m ) m FB(Γt0 m) 2 + 8 (1 + n) m=0 H2 m G2 m Et0 m 2 (A.34) where (A.34) follows from the update rule Γt m = Γt 1 m ηt0 m ˆFB(Γt 1 m ). Communication-Efficient Learning with Vertically Partitioned Data Bounding further: Et0 ˆG t Gt0 2 2(ηt0)2 (1 + n) m=0 L2 m Et0 m ˆFB(Γt 1 m ) m FB(Γt0 m) + m FB(Γt0 m) 2 m=0 Et0 m ˆFB(Γt 1 m ) m FB(Γt0 m) 2 + 8 (1 + n) m=0 H2 m G2 m Et0 m 2 (A.35) 4(ηt0)2 (1 + n) m=0 L2 m Et0 m ˆFB(Γt 1 m ) m FB(Γt0 m) 2 + 4(ηt0)2 (1 + n) m=0 L2 m Et0 h m FB(Γt0 m) 2i m=0 Et0 m ˆFB(Γt 1 m ) m FB(Γt0 m) 2 + 8 (1 + n) m=0 H2 m G2 m Et0 m 2 (A.36) 4(ηt0)2 (1 + n) L2 m + 1 + 1 Et0 m ˆFB(Γt 1 m ) m FB(Γt0 m) 2 + 4(ηt0)2 (1 + n) m=0 L2 m Et0 h m FB(Γt0 m) 2i + 8 (1 + n) m=0 H2 m G2 m Et0 m 2 . (A.37) Let n = Q. We simplify (A.37) further: Et0 ˆG t Gt0 2 4(ηt0)2 (1 + Q) L2 m + 1 + 1 Et0 m ˆFB(Γt 1 m ) m FB(Γt0 m) 2 + 4(ηt0)2 (1 + Q) m=0 L2 m Et0 h m FB(Γt0 m) 2i + 8 (1 + Q) m=0 H2 m G2 m Et0 m 2 F . (A.38) Communication-Efficient Learning with Vertically Partitioned Data Let ηt0 1 4Q maxm Lm . We bound (A.38) as follows: Et0 ˆG t Gt0 2 (1 + Q) 4Q2 + 1 + 1 m=0 Et0 m ˆFB(Γt 1 m ) m FB(Γt0 m) 2 + 4(ηt0)2 (1 + Q) m=0 L2 m Et0 h m FB(Γt0 m) 2i m=0 H2 m G2 m Et0 m 2 F (A.39) m=0 Et0 m ˆFB(Γt 1 m ) m FB(Γt0 m) 2 + 4(ηt0)2 (1 + Q) m=0 L2 m Et0 h m FB(Γt0 m) 2i m=0 H2 m G2 m Et0 m 2 F (A.40) m=0 Et0 m ˆFB(Γt 1 m ) m FB(Γt0 m) 2 + 4(ηt0)2 (1 + Q) m=0 L2 m Et0 h m FB(Γt0 m) 2i m=0 H2 m G2 m Et0 m 2 F . (A.41) We define the following notation for simplicity: m=0 Et0 m ˆFB(Γt m) m FB(Γt0 m) 2 (A.42) B0 := 4(ηt0)2 (1 + Q) m=0 L2 m Et0 h m FB(Γt0 m) 2i (A.43) B1 := 8(1 + Q) m=0 H2 m G2 m Et0 m 2 F (A.44) Note that we have shown that At CAt 1 + B0 + B1. Therefore: At0+1 CAt0 + (B0 + B1) (A.46) At0+2 C2At0 + C(B0 + B1) + (B0 + B1) (A.47) At0+3 C3At0 + C2(B0 + B1) + C(B0 + B1) + (B0 + B1) (A.48) ... (A.49) At Ct t0 1At0 + (B0 + B1) k=0 Ck (A.50) = Ct t0 1At0 + (B0 + B1)Ct t0 1 1 C 1 . (A.51) Communication-Efficient Learning with Vertically Partitioned Data We bound the first term in (A.51) by applying Lemma 1: m=0 Et0 m ˆFB(Γt0 m) m FB(Γt0 m) 2 (A.52) m=0 H2 m G2 m Et0 m 2 F . (A.53) Summing over the set of local iterations t0, . . . , t+ 0 , where t+ 0 := t0 + Q 1: t=t0 Ct t0 1At0 = At0 t=t0 Ct t0 1 (A.54) Q Q 1 1 + 2 4QAt0 (A.58) m=0 H2 m G2 m Et0 m 2 F . (A.59) It is left to bound the second term in (A.51) over the set of local iterations t0, . . . , t0 + Q 1. t=t0 (B0 + B1)Ct t0 1 1 t=t0 (B0 + B1)Ct t0 1 1 = (B0 + B1) t=t0 Ct t0 1 Q = (B0 + B1) C 1 Q (A.62) = (B0 + B1) 1 + 2 Q Q 1 1 + 2 = Q(B0 + B1) = Q2(B0 + B1) Q2(B0 + B1) 2Q2(B0 + B1) (A.67) Communication-Efficient Learning with Vertically Partitioned Data Plugging the values for B0 and B1: t=t0 (B0 + B1)Ct t0 1 1 C 1 8Q2(ηt0)2 (1 + Q) m=0 L2 m Et0 h m FB(Γt0 m) 2i + 16Q2(1 + Q) m=0 H2 m G2 m Et0 m 2 F (A.68) Applying Assumption 3 and adding in the first term in (A.51): t=t0 At 8Q2(ηt0)2 (1 + Q) m=0 L2 m m F(Θt0) 2 + 8Q2(ηt0)2 (1 + Q) m=0 L2 m σ2 m B + 4(4Q2(1 + Q) + Q) m=0 H2 m G2 m Et0 m 2 F (A.69) 16Q3(ηt0)2 M X m=0 L2 m m F(Θt0) 2 + 16Q3(ηt0)2 M X m=0 L2 m σ2 m B m=0 H2 m G2 m Et0 m 2 F . (A.70) A.3. Proof of Theorems 4.2 and 4.4 Let t+ 0 := t0 + Q 1. By Assumption 1: F(Θt+ 0 ) F(Θt0) D F(Θt0), Θt+ 0 Θt0E + L Θt+ 0 Θt0 2 (A.71) t=t0 ηt0 ˆG t + t=t0 ηt0 ˆG t t=t0 ηt0 D F(Θt0), ˆG t E + LQ t=t0 (ηt0)2 ˆG t 2 (A.73) where (A.73) follows from fact that (PN n=1 xn)2 N PN n=1 x2 n. Communication-Efficient Learning with Vertically Partitioned Data We bound further: F(Θt+ 0 ) F(Θt0) t=t0 ηt0 D F(Θt0), ˆG t Gt0E t=t0 ηt0 F(Θt0), Gt0 t=t0 (ηt0)2 ˆG t Gt0 + Gt0 2 (A.74) t=t0 ηt0 D F(Θt0), ˆG t Gt0E t=t0 ηt0 F(Θt0), Gt0 t=t0 (ηt0)2 ˆG t Gt0 2 + LQ t=t0 (ηt0)2 Gt0 2 (A.75) t=t0 ηt0 D F(Θt0), Gt0 ˆG t E t=t0 ηt0 F(Θt0), Gt0 t=t0 (ηt0)2 ˆG t Gt0 2 + LQ t=t0 (ηt0)2 Gt0 2 . (A.76) t=t0 ηt0 F(Θt0) 2 t=t0 ηt0 ˆG t Gt0 2 t=t0 ηt0 F(Θt0), Gt0 t=t0 (ηt0)2 ˆG t Gt0 2 + LQ t=t0 (ηt0)2 Gt0 2 (A.77) where (A.77) follows from the fact that A B = 1 We apply the expectation Et0 to both sides of (A.77): Et0 h F(Θt+ 0 ) i F(Θt0) 1 t=t0 ηt0 F(Θt0) 2 + 1 t=t0 ηt0(1 + 2LQηt0)Et0 ˆG t Gt0 2 t=t0 (ηt0)2Et0 h Gt0 2i (A.78) t=t0 ηt0(1 2LQηt0) F(Θt0) 2 t=t0 ηt0(1 + 2LQηt0)Et0 ˆG t Gt0 2 + LQ t=t0 (ηt0)2 (A.79) 2 ηt0(1 2LQηt0) F(Θt0) 2 t=t0 ηt0(1 + 2LQηt0)Et0 ˆG t Gt0 2 + LQ2(ηt0)2 M X σ2 m B (A.80) where (A.78) follows from applying Assumption 2 and noting that Et0 Gt0 = F(Θt0), and (A.80) follows from Assumption 3. Communication-Efficient Learning with Vertically Partitioned Data Applying Lemma 2 to (A.80): Et0 h F(Θt+ 0 ) i F(Θt0) Q 2 ηt0(1 2LQηt0) F(Θt0) 2 + 8Q3(ηt0)3(1 + 2LQηt0) m=0 L2 m m F(Θt0 m) 2 + 8Q3(ηt0)3(1 + 2LQηt0) m=0 L2 m σ2 m B + 32Q3ηt0(1 + 2LQηt0) m=0 H2 m G2 m Et0 m 2 F + LQ2(ηt0)2 M X σ2 m B (A.81) m=0 ηt0(1 2LQηt0 16Q2L2 m(ηt0)2 16Q3L2 m L(ηt0)3)) m F(Θt0) 2 + (LQ2(ηt0)2 + 8Q3L2 m(ηt0)3 + 8Q4LL2 m(ηt0)4) + 32Q3ηt0(1 + 2LQηt0) m=0 H2 m G2 m Et0 m 2 F . (A.82) Let ηt0 1 16Q max{L,maxm Lm}. Then we bound (A.82) further: Et0 h F(Θt+ 0 ) i F(Θt0) Q m=0 ηt0 1 1 + (LQ2(ηt0)2 + 8Q3L2 m(ηt0)3 + 8Q4LL2 m(ηt0)4) + 16Q3ηt0(1 + 2LQηt0) m=0 H2 m G2 m Et0 m 2 F (A.83) 8 ηt0 F(Θt0) 2 + (LQ2(ηt0)2 + 8Q3L2 m(ηt0)3 + 8Q4LL2 m(ηt0)4) + 32Q3ηt0(1 + 2LQηt0) m=0 H2 m G2 m Et0 m 2 F (A.84) After some rearranging of terms: ηt0 F(Θt0) 2 4 h F(Θt0) Et0 h F(Θt+ 0 ) ii 3(LQ(ηt0)2 + 8Q2L2 m(ηt0)3 + 8Q3LL2 m(ηt0)4) + 86Q2ηt0(1 + 2LQηt0) m=0 H2 m G2 m Et0 m 2 F (A.85) Communication-Efficient Learning with Vertically Partitioned Data Summing over all global rounds t0 = 0, . . . , R 1 and taking total expectation: t0=0 ηt0E h F(Θt0) 2i 4 F(Θ0) E F(ΘT ) t0=0 (LQ(ηt0)2 + 8Q2L2 m(ηt0)3 + 8Q3LL2 m(ηt0)4) + 86Q2ηt0(1 + 2LQηt0) m=0 H2 m G2 m Et0 m 2 F (A.86) 4 F(Θ0) E F(ΘT ) t0=0 (QL(ηt0)2 + 8Q2L2 m(ηt0)3 + 8Q3LL2 m(ηt0)4) + 86Q2 R 1 X t0=0 ηt0(1 + 2LQηt0) m=0 H2 m G2 m E h Et0 m 2 F m=0 H2 m G2 m E h Et0 m 2 F m=0 H2 m G2 m X j =m E h ϵt0 j 2 m=0 H2 m G2 m X j =m Et0 j (A.89) where (A.89) follows from Definition 4.1. Plugging this into (A.87) t0=0 ηt0E h F(Θt0) 2i 4 F(Θ0) E F(ΘT ) t0=0 (QL(ηt0)2 + 8Q2L2 m(ηt0)3 + 8Q3LL2 m(ηt0)4) + 86Q2 R 1 X t0=0 ηt0(1 + 2LQηt0) m=0 H2 m G2 m X j =m Et0 j . (A.90) Suppose that ηt0 = η for all global rounds t0. Then, averaging over R global rounds, we have: t0=0 E h F(Θt0) 2i 4 F(Θ0) E F(ΘT ) m=0 (QLη + 8Q2L2 mη2 + 8Q3LL2 mη3)σ2 m B t0=0 (1 + 2LQη) m=0 H2 m G2 m X j =m Et0 j . (A.91) 4 F(Θ0) E F(ΘT ) m=0 QLη σ2 m B + 92Q2 m=0 H2 m G2 m X j =m Et0 j . (A.92) where (A.92) follows from our assumption that ηt0 1 16Q max{L,maxm Lm}. This completes the proof of Theorem 4.2. Communication-Efficient Learning with Vertically Partitioned Data We continue our analysis to prove Theorem 4.4. Starting from (A.90), we bound the left-hand side with the minimum over all iterations: min t0=0,...,R 1 E h F(Θt0) 2i 4 F(Θ0) Et0 F(ΘT ) Q PR 1 t0=0 ηt0 PR 1 t0=0(ηt0)2 PR 1 t0=0 ηt0 + 16Q2L2 m PR 1 t0=0(ηt0)3 PR 1 t0=0 ηt0 + 16Q3LL2 m PR 1 t0=0(ηt0)4 PR 1 t0=0 ηt0 m=0 H2 m G2 m PR 1 t0=0 ηt0 P j =m Et0 j PR 1 t0=0 ηt0 + 172LQ3 M X m=0 H2 m G2 m PR 1 t0=0(ηt0)2 P j =m Et0 j PR 1 t0=0 ηt0 (A.93) As R , if PR 1 t0=0 ηt0 = , PR 1 t0=0(ηt0)2 < , and PR 1 t0=0 ηt0 P j =m Et0 j < , then mint0=0,...,R 1 E h F(Θt0) 2i 0. This completes the proof of Theorem 4.4. B. Common Compressors In this section, we calculate the compression error and parameter bounds for uniform scalar quantization, lattice vector quantization and top-k sparsification, as well as discuss implementation details of these compressors in C-VFL. We first consider a uniform scalar quantizer (Bennett, 1948) with a set of 2q quantization levels, where q is the number of bits to represent compressed values. We define the range of values that quantize to the same quantization level as the quantization bin. In C-VFL, a scalar quantizer quantizes each individual component of embeddings. The error in each embedding of a batch B in scalar quantization is Pm 2 12 = Pm (hmax hmin)2 12 2 2q where the size of a quantization bin, Pm is the size of the m-th embedding, hmax and hmin are respectively the maximum and minimum value hm(θt m; xi m) can be for all iterations t, parties m, and xi m. We note that if hmax or hmin are unbounded, then the error is unbounded as well. By Theorem 4.2, we know that 1 R PR 1 t0=0 PM m=0 Et0 m = O( 1 T ) to obtain a convergence rate of O( 1 T ). If we use the same q for all parties and iterations, we can solve for q to find that the value q must be lower bounded by q = Ω(log2(Pm(hmax hmin)2 T)) to reach a convergence rate of O( 1 T ). For a diminishing compression error, required by Theorem 4.4, we let T = t0 in this bound, indicating that q, the number of quantization bins, must increase as training continues. A vector quantizer creates a set of d-dimensional vectors called a codebook (Zamir & Feder, 1996). A vector is quantized by dividing the components into sub-vectors of size d, then quantizing each sub-vector to the nearest codebook vector in Euclidean distance. A cell in vector quantization is defined as all points in d-space that quantizes to a single codeword. The volume of these cells are determined by how closely packed codewords are. We consider the commonly applied 2-dimensional hexagonal lattice quantizer (Shlezinger et al., 2021). In C-VFL, each embedding is divided into sub-vectors of size two, scaled to the unit square, then quantized to the nearest vector by Euclidean distance in the codebook. The error in this vector quantizer is V Pm 24 where V is the volume of a lattice cell. The more bits available for quantization, the smaller the volume of the cells, the smaller the compression error. We can calculate an upper bound on V based on Theorem 4.2: V = O( 1 Pm T ). If a diminishing compression error is required, we can set T = t0 in this bound, indicating that V must decrease at a rate of O( 1 Pm t0 ). As the number of iterations increases, the smaller V must be, and thus the more bits that must be communicated. In top-k sparsification (Lin et al., 2018), when used in distributed SGD algorithms, the k largest magnitude components of the gradient are sent while the rest are set to zero. In the case of embeddings in C-VFL, a large element may be as important as an input to the server model as a small one. We can instead select the k embedding elements to send with the largest magnitude partial derivatives in θmhm(θt m). Since a party m cannot calculate θmhm(θt m) until all parties send their embeddings, party m can use the embedding gradient calculated in the previous iteration, θmhm(θt 1 m ). This is an intuitive method, as we assume our gradients are Lipschitz continuous, and thus do not change too rapidly. The error of sparsification is (1 k Pm )( h 2)max where ( h 2)max is the maximum value of hm(θt m; xi m) 2 for all iterations t, parties m, and xi m. Note that if ( h 2)max is unbounded, then the error is unbounded. We can calculate a lower bound on k: k = Ω(Pm Pm ( h 2)max T ). Note that the larger ( h 2)max, the larger k must be. More components must be sent if embedding magnitude is large in order to achieve a convergence rate of O( 1 T ). When considering a diminishing Communication-Efficient Learning with Vertically Partitioned Data compression error, we set T = t0, showing that k must increase over the course of training. C. Experimental Details For our experiments, we used an internal cluster of 40 compute nodes running Cent OS 7 each with 2 20-core 2.5 GHz Intel Xeon Gold 6248 CPUs, 8 NVIDIA Tesla V100 GPUs with 32 GB HBM, and 768 GB of RAM. C.1. MIMIC-III The MIMIC-III dataset can be found at: mimic.physionet.org. The dataset consists of time-series data from 60,000 intensive care unit admissions. The data includes many features about each patient, such as demographic, vital signs, medications, and more. All the data is anonymized. In order to gain access to the dataset, one must take the short online course provided on their website. Our code for training with the MIMIC-III dataset can be found in in the folder titled mimic3 . This is an extension of the MIMIC-III benchmarks repo found at: github.com/Yereva NN/mimic3-benchmarks. The original code preprocesses the MIMIC-III dataset and provides starter code for training LSTMs using centralized SGD. Our code has updated their existing code to Tensor Flow 2. The new file of interest in our code base is mimic3models/in hospital mortality/quant.py which runs C-VFL. Both our code base and the original are under the MIT License. More details on installation, dependencies, and running our experiments can be found in README.md . Each experiment took approximately six hours to run on a node in our cluster. The benchmarking preprocessing code splits the data up into different prediction cases. Our experiments train models to predict for in-hospital mortality. For in-hospital mortality, there are 14,681 training samples, and 3,236 test samples. In our experiments, we use a step size of 0.01, as is standard for training an LSTM on the MIMIC-III dataset. C.2. Model Net10 and CIFAR10 Details on the Model Net10 dataset can be found at: modelnet.cs.princeton.edu/. The specific link we downloaded the dataset from is the following Google Drive link: https://drive.google.com/file/d/0B4v2j R3Wsind MUE3N2xi LVpy LW8/view. The dataset consists of 3D CAD models of different common objects in the world. For each CAD model, there are 12 views from different angles saved as PNG files. We only trained our models on the following 10 classes: bathtub, bed, chair, desk, dresser, monitor, night stand, sofa, table, toilet. We used a subset of the data with 1,008 training samples and 918 test samples. In our experiments, we use a step size of 0.001, as is standard for training a CNN on the Model Net10 dataset. Our code for learning on the Model Net10 dataset is in the folder MVCNN Pytorch and is an extension of the MVCNNPy Torch repo: github.com/RBirkeland/MVCNN-Py Torch. The file of interest in our code base is quant.py which runs C-VFL. Both our code base and the original are under the MIT License. Details on how to run our experiments can be found in the README.md . Each experiment took approximately six hours to run on a node in our cluster. In the same folder, MVCNN Pytorch , we include our code for running CIFAR-10. The file of interest is quant cifar.py which trains C-VFL with CIFAR-10. We use the version of CIFAR-10 downloaded through the torchvision library. More information on the CIFAR-10 dataset can be found at: cs.toronto.edu/ kriz/cifar.html. C.3. Image Net In Section D, we include additional experiments that use the Image Net dataset. Details on Image Net can be found at: image-net.org/. We specifically use a random 100-class subset from the 2012 ILSVRC version of the data. Our code for learning on the Image Net dataset is in the folder Image Net CVFL and is a modification on the moco align uniform repo: https://github.com/Ssn L/moco align uniform. The file of interest in our code base is main cvfl.py which runs C-VFL. Both our code base and the original are under the CC-BY-NC 4.0 license. Details on how to run our experiments can be found in the README.txt . Each experiment took approximately 24 hours to run on a node in our cluster. Communication-Efficient Learning with Vertically Partitioned Data D. Additional Plots and Experiments In this section, we include additional plots using the results from the experiments introduced in Section 5 of the main paper. We also include new experiments with the Image Net100 dataset. Finally, we include additional experiments with an alternate C-VFL for Q = 1. D.1. Additional Plots We first provide additional plots from the experiements in the main paper. The setup for the experiments is described in the main paper. These plots provide some additional insight into the effect of each compressor on convergence in all datasets. As with the plots in the main paper, the solid lines in each plot are the average of five runs and the shaded regions represent one standard deviation. (a) 2 bits per parameter (b) 3 bits per parameter (c) 4 bits per parameter Figure D.1. Test F1-Score on MIMIC-III dataset. Scalar and vector In these experiments, Q = 10 and M = 4. quantization achieve similar test F1-Score even when only using 2 bits in quantization. On the other hand, top-k sparsification performs worse than the other compressors in the MIMIC-III dataset. Figure D.1 plots the test F1-Score for training on the MIMIC-III dataset for different levels of compression. We can see that scalar and vector quantization perform similarly to no compression and improve as the number of bits available increase. We can also see that top-k sparsification has high variability on the MIMIC-III dataset and generally performs worse than the other compressors. (a) 2 bits per parameter (b) 3 bits per parameter (c) 4 bits per parameter Figure D.2. Test F1-Score on MIMIC-III dataset plotted by communication cost. In these experiments, Q = 10 and M = 4. We can see that all compressors reach higher F1-Scores with lower communication cost than no compression. We can see that the standard deviation for each compressor decreases as the number of bits available increases. Top-k sparsification generally performs worse than the other compressors on the MIMIC-III-dataset. Communication-Efficient Learning with Vertically Partitioned Data (a) Scalar quantization (b) Vector quantization (c) Top-k sparsification Figure D.3. Test F1-Score on MIMIC-III dataset plotted by communication cost. In these experiments, Q = 10 and M = 4. We can see that all compressors reach higher F1-Scores with lower communication cost than no compression. We can see that the variability for each compressor decreases as the number of bits available increases. Figures D.2 and D.3 plot the test F1-Score for training on the MIMIC-III dataset plotted against the communication cost. The plots in Figure D.2 include all compression techniques for a given level of compression, while the plots in Figure D.3 include all levels of compression for a given compression technique. We can see that all compressors reach higher F1-Scores with lower communication cost than no compression. It is interesting to note that increasing the number of bits per parameter reduces the variability in all compressors. (a) 2 bits per parameter (b) 3 bits per parameter (c) 4 bits per parameter Figure D.4. Test accuracy on Model Net10 dataset. Vector quantization In these experiments, Q = 10 and M = 4. and top-k sparsification perform similarly to no compression, even when only 2 bits are available. Scalar quantization converges to a lower test accuracy, and has high variability on the Model Net10 dataset. Figure D.4 plots the test accuracy for training on the Model Net10 dataset. Vector quantization and top-k sparsification perform similarly to no compression, even when only 2 bits are available. We can see that scalar quantization has high variability on the Model Net10 dataset. (a) 2 bits per parameter (b) 3 bits per parameter (c) 4 bits per parameter Figure D.5. Test accuracy on Model Net10 dataset plotted by communication cost. In these experiments, Q = 10 and M = 4. We can see that all compressors reach higher accuracies with lower communication cost than no compression. Scalar quantization generally performs worse than the other compressors on the Model Net10 dataset. Communication-Efficient Learning with Vertically Partitioned Data (a) Scalar quantization (b) Vector quantization (c) Top-k sparsification Figure D.6. Test accuracy on Model Net10 dataset plotted by communication cost. In these experiments, Q = 10 and M = 4. We can see that all compressors reach higher accuracies with lower communication cost than no compression. We can see that when less bits are used in each compressor, higher test accuracies are reached at lower communication costs. Scalar quantization generally performs worse than the other compressors on the Model Net10 dataset. Figures D.5 and D.6 plot the test accuracy for training on the Model Net10 dataset against the communication cost. The plots in Figure D.5 include all compression techniques for a given level of compression, while the plots in Figure D.6 include all levels of compression for a given compression technique. We can see that all compressors reach higher accuracies with lower communication cost than no compression. Scalar quantization generally performs worse than the other compressors on the Model Net10 dataset. From Figure D.6, we also see that when fewer bits are used in each compressor, higher test accuracies are reached at lower communication costs. (a) Plotted by epochs (b) Plotted by cost (c) Vector quantization Figure D.7. Test accuracy on CIFAR-10 dataset with the number of parties M = 4 and number of local iterations Q = 10. In the first two plots, the compressors have b = 2, where b is the number of bits used to represent embedding components. In the third plot, b = 32 indicates there is no compression. The results show vector quantization performs the best our of the compressors, and all compressors show improvement over no compression in terms of communication cost to reach target test accuracies. In Figure D.7, we plot the test accuracy for the CIFAR-10 dataset. The test accuracy is fairly low compared to typical baseline accuracies, which is expected, as learning object classification from only a quadrant of a 32 32 pixel image is difficult. Figure D.7a shows the test accuracy plotted by epochs. We can see that vector quantization performs almost as well as no compression in the CIFAR-10 dataset. When plotting by communication cost, seen in Figure D.7b, we can see that vector quantization performs the best, though scalar quantization and top-k sparsification show communication savings as well. In Figure D.7c, we plot the test accuracy of C-VFL using vector quantization for different values of b, the number of bits to represent compressed values. Similar to previous results, lower b tends to improve test accuracy reached with the same amount of communication cost. D.2. Additional Experiments With Image Net We also run C-VFL on Image Net100 (Deng et al., 2009). Image Net is a large image dataset for object classification. We use a random subset of 100 classes (Image Net100) from the Image Net dataset (about 126,000 images). We train a set of 4 parties, each storing a different quadrant of every image. Each party trains Res Net18, and the server trains a fully-connected layer. We use a variable step size, that starts at 0.001, and drops to 0.0001 after 50 epochs. We use a batch size of 256 and train for 100 epochs. Communication-Efficient Learning with Vertically Partitioned Data (a) Plotted by epochs (b) Plotted by cost (c) Vector quantization Figure D.8. Test accuracy on Image Net-100 dataset with the number of parties M = 4 and number of local iterations Q = 10. In the first two plots, the compressors have b = 2, where b is the number of bits used to represent embedding components. In the third plot, b = 32 indicates there is no compression. The results show vector quantization performs the best our of the compressors, and all compressors show improvement over no compression in terms of communication cost to reach target test accuracies. In Figure D.8, we plot the top-5 test accuracy for Image Net100. Figure D.8a shows the test accuracy plotted by epochs. We can see that vector quantization performs almost as well as no compression in the Image Net100 dataset. When plotting by communication cost, seen in Figure D.8b, we can see that vector quantization performs the best, though scalar quantization and top-k sparsification show communication savings as well. In Figure D.8c, we plot the test accuracy of C-VFL using vector quantization for different values of b, the number of bits to represent compressed values. Similar to previous results, lower b tends to improve test accuracy reached with the same amount of communication cost. D.3. Comparison With Alternative C-VFL Algorithm For Q = 1 In C-VFL, the server distributes party embeddings to all parties along with the server model parameters. This allows parties to calculate their partial derivatives for local model updates for multiple local iterations. However, if the number of local iterations Q = 1, then a more efficient method of communication is for the server to compute partial derivative updates for the parties (Hu et al., 2019; Ceballos et al., 2020; Romanini et al., 2021), avoiding the need for parties to receive embeddings from other parties. This approach can be applied to C-VFL as well. Algorithm 2 Compressed Vertical Federated Learning for Q = 1 1: Initialize: θ0 m for all parties m and server model θ0 0 2: for t 0, . . . , T 1 do 3: Randomly sample Bt {X, y} 4: for m 1, . . . , M in parallel do 5: Send Cm(hm(θt m; XBt m)) to server 6: end for 7: ˆΦt {C0(θ0), C1(h1(θt 1)), . . . , CM(h M(θt M))} 8: θt+1 0 = θt 0 ηt 0FB(ˆΦt; y Bt) 9: Server sends hm(θtm;XBt m)FB(ˆΦt; y Bt) to each party m 10: for m 1, . . . , M in parallel do 11: m FB(ˆΦt; y Bt) = hm(θt m; XBt m) hm(θtm;XBt m)FB(ˆΦt; y Bt) 12: θt+1 m = θt m ηt m FB(ˆΦt; y Bt) 13: end for 14: end for The pseudo-code for this method is presented in Algorithm 2. In this version of C-VFL, parties send their compressed embeddings to the server. The server calculates the loss by feeding the embeddings through the server model. The server calculates the gradient with respect to the loss. The server then sends to each party m the partial derivative with respect to its embedding: hm(θtm;XBt m)FB(ˆΦt; y Bt). Each party m calculates the following partial derivative with respect to its local Communication-Efficient Learning with Vertically Partitioned Data Table D.1. MIMIC-III communication cost to reach a target test F1-Score of 0.4. Value shown is the mean of 5 runs, the standard deviation. The first row has no embedding compression, while the second row employs vector quantization on embeddings with b = 3. For the cases where Q = 1, Algorithm 2 is used, and for cases where Q > 1, Algorithm 1 is used. In these experiments, the number of clients M = 4. Compressor Cost (MB) to Reach Target F1-Score 0.4 Q = 1 Q = 10 Q = 25 None b = 32 4517.59 465.70 3777.21 522.43 1536.69 201.99 Vector b = 3 433.28 79.83 330.47 10.63 125.09 4.56 model parameters: m FB(ˆΦt; y Bt) = hm(θt m; XBt m) hm(θtm;XBt m)FB(ˆΦt; y Bt). (D.1) Using this partial derivative, the party updates its local model: θt+1 m = θt m ηt0 m FB(ˆΦt m; y Bt0). (D.2) Note that this process is mathematically equivalent to C-VFL when Q = 1; thus the analysis in Section 4 holds. The communication cost of Algorithm 2 per communication round without compression is O(B P m Pm), a reduction in communication compared to the communication cost per round of Algorithm 1: O(M (B P m Pm + |θ0|)). Although Algorithm 2 reduces communication in a given round, it is limited to the case when Q = 1. For Q > 1, we must use Algorithm 1. We run experiments on the MIMIC-III dataset to compare C-VFL using Algorithm 2 with C-VFL using Algorithm 1 with values of Q > 1. We show the results of these experiments in Table D.1. Here, we show the communication cost to reach a target F1-Score of 0.4. The results in the column labeled Q = 1 are from running Algorithm 2, while all other results are from running Algorithm 1. We include results for the case where C-VFL is run without embedding compression, as well as results for the case where vector quantization with b = 3 is used to compress embeddings. We can see that for this dataset, values of Q > 1 reduce the cost of communication to reach the target F1-Score. In all cases, Algorithm 1 achieves lower communication cost to reach a target model accuracy than Algorithm 2, despite Algorithm 2 having a lower communication cost per communication round than Algorithm 1. The use of multiple local iterations in Algorithm 1 decreased the number of global rounds required to attain the target accuracy compared to Algorithm 2.