# communicationconstrained_bandits_under_additive_gaussian_noise__3f3aea68.pdf Communication-Constrained Bandits under Additive Gaussian Noise Prathamesh Mayekar 1 Jonathan Scarlett 1 Vincent Y. F. Tan 1 We study a distributed stochastic multi-armed bandit where a client supplies the learner with communication-constrained feedback based on the rewards for the corresponding arm pulls. In our setup, the client must encode the rewards such that the second moment of the encoded rewards is no more than P, and this encoded reward is further corrupted by additive Gaussian noise of variance σ2; the learner only has access to this corrupted reward. For this setting, we derive an information-theoretic lower bound of Ω q KT SNR 1 on the minimax regret of any scheme, where SNR := P σ2 , and K and T are the number of arms and time horizon, respectively. Furthermore, we propose a multi-phase bandit algorithm, UE-UCB++, which matches this lower bound to a minor additive factor. UE-UCB++ performs uniform exploration in its initial phases and then utilizes the upper confidence bound (UCB) bandit algorithm in its final phase. An interesting feature of UE-UCB++ is that the coarser estimates of the mean rewards formed during a uniform exploration phase help to refine the encoding protocol in the next phase, leading to more accurate mean estimates of the rewards in the subsequent phase. This positive reinforcement cycle is critical to reducing the number of uniform exploration rounds and closely matching our lower bound. 1. Introduction The stochastic multi-armed bandit (SMAB) is a classic sequential decision-making problem with decades of research (see Lattimore & Szepesvári (2020) for a recent survey). 1National University of Singapore. Correspondence to: Prathamesh Mayekar, Jonathan Scarlett, Vincent Y. F. Tan . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). The SMAB problem crucially captures the trade-off between exploring actions to find the optimal one or exploiting the action believed to be the best, but possibly sub optimal action, lying at the heart of many practical sequential decision-making problems. In its simplest form, the SMAB problem is a sequential game between the learner and the environment. The game is played over a finite number of rounds. In each round, the learner plays an arm from a given set of arms, and then the environment reveals a reward corresponding to the played arm. The reward for a particular arm is iid across rounds, with the reward distribution unknown to the learner. The learner s goal is to maximize the expected cumulative reward over the entirety of the game. In another direction, in modern distributed optimization frameworks such as Federated Learning (Koneˇcný et al., 2016), a central learner trains a global machine learning model by leveraging updates from remote clients (typically, stochastic gradients computed on local data). However, a major bottleneck in such settings is the uplink communication cost the cost of sending remote client s updates to the learner. This has spurred a recent line of work that seeks to minimize the amount of communication by a remote client while still maintaining the accuracy of the global model (see Kairouz et al. (2021) for a recent survey). An emerging abstraction to understand the trade-off between communication and accuracy in distributed optimization is to understand the one-client setting, where one remote client supplies the learner with gradient updates over a rate-limited channel (see, for instance, Acharya et al. (2021); Gandikota et al. (2019); Lin et al. (2021); Mayekar & Tyagi (2020)). A similar abstraction has been studied for distributed SMAB problems in two interesting recent works: Hanna et al. (2022); Mitra et al. (2022). Our work is inspired by these two studies. In particular, Hanna et al. (2022) studies the distributed SMAB problem where a client supplies the learner with reward feedback corresponding to its arm pull. To account for the uplink cost from the client to the learner, at the tth round, the client must quantize the reward feedback to rt-bits, and only the quantized reward feedback is available to the learner for further decision-making. Mitra et al. (2022) study a linear distributed SMAB problem where at each round, the remote client supplies the learner with a r- Communication-Constrained Bandits under Additive Gaussian Noise bit quantized version of its estimate of the d-dimensional linear model parameter. These studies investigate similar questions: Q1a Hanna et al. (2022): What is the minimum average bits 1 T PT t=1 rt needed to attain the regret of the centralized setting in the distributed bandit game? Q1b Mitra et al. (2022): What is the minimum r needed to attain the regret of the centralized setting in the distributed bandit game? Hanna et al. (2022) shows that O(1) bits of average communication is sufficient to attain the regret of centralized setting, and Mitra et al. (2022) shows that r = O(d) is sufficient1 to attain the regret of the centralized setting. In many practical distributed learning scenarios, the updates from the client to the learner are sent over a noisy wireless network. To model this, recent works on distributed optimization have considered communication over noisy channels; see, for instance, Amiri & Gündüz (2020); Jha et al. (2022) and the references therein. Unfortunately, the results for distributed optimization over noisy channels do not apply to the distributed SMAB problem over noisy channels. Moreover, the model for communication constraints in the recent works on distributed SMAB does not accurately capture communication over noisy channels. To bridge this gap, we study a distributed SMAB problem where the remote client codes the reward feedback to ensure that the encoded reward has a second moment less than P, and then the client transmits the encoded reward over an (AWGN) channel with noise variance σ2 (this is equivalent to corrupting the encoded reward with an additive Gaussian noise of mean zero and variance σ2); only the channel s output is available to the learner for pulling the arm at the next round. We define the signal-to-noise ratio in our problem as We remark that the restriction of the second moment of the encoded messages to P and transmission over an AWGN channel is a ubiquitous communication model in information theory and communication and dates back to Shannon s seminal work; see, for instance, Shannon (1948), Cover & Thomas (2006, Chapter 9), and Tse & Viswanath (2005, Chapter 8). The second moment constraint is used to model that most edge devices are restricted to transmitting up to a particular transmission power due to physical constraints. The AWGN channel is often used as a first approximation for various communication settings. Coming back to comparisons with Hanna et al. (2022) and 1The discrepancy between results is because of different setting: In Hanna et al. (2022), the scaler reward feedback needs to be compressed, whereas in Mitra et al. (2022) the d-dimensional linear model parameter needs to be compressed. Mitra et al. (2021), the counterpart of Q1 in our setting is the following question: Q2: What is the minimum SNR needed to attain the regret of the centralized setting in the distributed bandit game over an AWGN channel? In this work, we resolve the following more general question upto multiplicative log and minor additive factors: Q3: How does the minimax regret scale as a function of SNR? 1.1. Our Contributions We show that any bandit algorithm employed by the learner and any encoding strategy employed by the client must incur a worst-case expected cumulative regret of at least2 KT SNR 1 + KB where K is the number of bandit arms, T is time horizon, and B is upper bound on the L2-norm3 of rewards. A technical difficulty in deriving this lower bound is to account for the loss of information due to transmission over the AWGN channel. We overcome this difficulty by leveraging classic information-theoretic results for transmission over an AWGN channel and combining these techniques with the standard minimax lower bound proof for SMAB. An interesting consequence of our proof technique is that, up to constant factors, we can recover the lower bound in Hanna et al. (2022) with an arguably simpler proof; see Appendix for details. Unlike Hanna et al. (2022) and Mitra et al. (2022), we don t quantize our updates. In our setup, quantization and transmitting over an AWGN channel would lead to decoding errors at every round, which can pile up and result in linear regret unless SNR grows with the time horizon T; such a scheme would not allow us to resolve Q3. Instead, we use a center and scale encoding protocol CAS, where the centering and scaling factors are learned over time. We also design a new bandit algorithm UE-UCB++ which, along with the encoding protocol CAS, bounds the worst-case expected cumulative regret by a factor of4 SNR + 1 KT ln T + KB log B Thus, we resolve Q3 up to a multiplicative factor of ln T and an additive factor of KB log B UE-UCB++ performs uniform exploration for multiple initial sub-phases and then employs the classic upper con- 2a b := min{a, b}; a b := max{a, b}. 3The L2 norm of a random variable X is p E [X2]. 4log( ) refers to the logarithm to the base 2; ln refers to the logarithm to the base e. Communication-Constrained Bandits under Additive Gaussian Noise fidence bound UCB bandit algorithm. The multiple initial uniform exploration sub-phases allow us to learn CAS s optimal centering and scaling factors. An interesting feature of UE-UCB++ is that we start with coarser mean reward estimates for all arms in a given sub-phase, which helps us refine the communication scheme for the subsequent sub-phases. The refined communication scheme, in turn, leads to better mean rewards estimates in the subsequent sub-phase. Due to this positive reinforcement cycle, the variance of the mean reward estimates decays exponentially with respect to the number of uniform exploration sub-phases. As a result, we can learn an almost optimal encoding protocol after relatively few rounds of uniform exploration. We also design a private5 scheme where the learner does not send any information to the client. We show that for such a scheme, the worst-case expected cumulative regret can be bounded by KT ln T + KB This simple scheme nearly matches our lower bound when B = O(1). 2. Setup and Preliminaries We now proceed to describe our setup; see Figure 1 for a pictorial description. The Distributed SMAB Problem over an AWGN channel: 1. At round t [T],6 the learner asks a client to pull an arm At. The learner also sends some side information St based on the history of the game until time t 1 to aid the communication at the client. 2. Upon receiving this information, the client pulls arm At and gets a reward Xt,At corresponding to the arm At. The client then encodes this reward using an encoding function Ct : R S R to form Ct(Xt,At, St), where S is the domain of the information St sent by the learner. The encoding function must satisfy the power constraint E Ct(Xt,At, St)2 P (1) for some P > 0. 5As opposed to the popular privacy notion of differential privacy Dwork et al. (2014), we emphasize that here we refer to a much more rudimentary privacy notion, where the learner does not send information to the remote clients at any of the rounds. 6For an integer n, [n] := {1, . . . , n}. 3. The client sends the encoded reward over the AWGN channel. The output of the channel Yt is received by the learner, where Yt = Ct(Xt,At, St) + Zt (2) and {Zt} t=1 is sequence of i.i.d. Gaussian random variables with mean 0 and variance σ2. 4. The learner decodes Yt using a decoding function Dt : R S R to form a reward estimate Dt(Yt, St). The learner then uses a bandit algorithm π to decide on arm At+1, the side information St+1, and decoding function Dt+1 using history of the game up to round t, Ht := (A1, Y1, . . . , At, Yt). Remark 1. In most distributed learning settings, the communication from the learner to client (downlink communication) is not expensive. Therefore, we allow for schemes where the learner can, in principle, send entire history of the game up to a particular round as side information for the next round. However, our proposed scheme will send only a single real number as side information at each round, which is more desirable in practical settings. Remark 2. The client is memoryless in our setup; the client cannot use rewards from previous arm pulls to encode the current reward. This is motivated by the fact that, as observed in Hanna et al. (2022), in a typical distributed sequential decision-making problem, the client observing the reward is a low memory sensor. Additionally, in some applications, the client at any given round may differ from those at the previous rounds. Denote by ΠT be the set of all possible T-round bandit algorithms as described above and denote by C the set of all possible encoding strategies satisfying (1). We note that ΠT contains randomized algorithms, so our lower bounds hold for such algorithms, too. However, we will use a deterministic algorithm for our upper bounds. Reward Distributions: We make the following standard assumptions on the reward distributions. We assume that the reward sequence {Xt,i}t>1 is i.i.d. for all arms i in [K] and the rewards for different arms are independent of each other. Denote by µi the mean reward of arm i. We assume that X1,i µi is subgaussian with variance factor 1,7 where a subgaussian random variable with a variance factor σ is as defined below. Definition 2.1. (Boucheron et al., 2013) We say a random variable X is subgaussian with a variance factor α2, if for all λ R, we have ln[exp(λX)] λ2α2 7The variance factor 1 is without loss of generality; all our regret bounds scale linearly with this factor. Communication-Constrained Bandits under Additive Gaussian Noise Learner Client Dt(Yt) Xt,At Ct(Xt,At, St) Yt AWGN Figure 1. The Distributed Bandit Game over an AWGN channel Furthermore, for a subgaussian random variable with variance factor α2, we have E [X] = 0 and E X2 α2. Finally, for B 1,8 we assume for all i [K], that E X2 1,i B2. (3) Denote by P the set of all subgaussian distributions with variance factor 1 whose second moment satisfies (3). Then, E := PK describes the set of all possible bandit instances considered in our setup. Regret: We define the regret for a bandit instance ν E, when a bandit algorithm π Π and encoding strategies CT 1 := (Ct)t [T ] CT are employed, as RT (ν, π, CT 1 ) := Tµi E where i is the arm with the largest mean for the bandit instance ν. We are interested in characterizing the minimax regret R T (SNR) := inf π ΠT inf CT 1 CT sup ν E RT (ν, π, CT 1 ) (4) as function of SNR, K, B, and T. UCB bandit algorithm: We now describe the classical Upper Confidence Bound (UCB) bandit algorithm (Auer et al., 2002). We will build on the UCB algorithm to come up with our algorithm. Recall that at each round UCB plays the arm with the largest upper confidence bound. For subgaussian rewards with variance factor α2, the upper confidence bound for an arm k [K] after t rounds is defined as UCBk(t; η) = ( if Nk(t) = 0 Nk(t) otherwise, 8Since X1,i µi is subgaussian with variance factor 1, its second moment is bounded by 1, so we may assume that B 1. where Nk(t) is the number of times the arm k has been played up to round t, and ˆµk(t) is the empirical mean estimate of arm k. It is not difficult to prove the following regret guarantees for the UCB algorithm (see, for instance, Lattimore & Szepesvári (2020, Theorem 7.2)). Lemma 2.2. Suppose we have a classical bandit game over T rounds and K arms with the following reward feedback: (i), the reward feedback corresponding to all arms is subgaussian with a variance factor of α2; (ii), the second moment of the reward feedback is at most B. Then, the regret of the UCB algorithm with parameter η = α2 satisfies α2KT ln T + 6KB. 3. Information-Theoretic Lower Bound Our lower bound is stated as follows. Theorem 3.1. For universal constants c1 and c2, where 0 < c1 < 1 and c2 > 0, we have R T (SNR) c1 1 2 log(1 + SNR) 1 c2T for all T c2K 1 2 log(1+SNR) 1. The proof of this lower bound adopts a similar high-level approach to the lower bound proofs in the standard setting (see, for instance, Lattimore & Szepesvári (2020, Chapter 15)). However, a crucial difference from the standard proofs is that we now have to account for the loss in information due to the rewards being sent over the Gaussian channel. We defer the proof to the Appendix. Remark 3. In information theory, the term 1 2 log(1+SNR) is the capacity of the Gaussian channel. It refers to the maximum rate at which information can be transmitted with arbitrarily small probability of error; see Cover & Thomas (2006) for details. Communication-Constrained Bandits under Additive Gaussian Noise The following corollary of Theorem 3.1 will be useful to derive matching upper bounds. Corollary 3.2. For universal constants c1 and c2, where 0 < c1 < 1 and c2 > 0, we have R T (SNR) c1 for all T c2K 1 2 log(1+SNR) 1. Proof. The proof simply follows by combining Theorem 3.1 with the fact that ln(1 + x) x for x 0. We note while SNR 1 and 1 2 log(1 + SNR) 1 may appear quite different at first glance, they are, in fact, the same up to constant factors, since 1 2 log(1 + SNR) is both upper and lower bounded by a constant times SNR for SNR [0, 3], whereas for SNR > 3 both terms take the value 1. 4. Bandit Algorithms over an AWGN Channel 4.1. Our Encoding Algorithm: CAS A crucial component of all our schemes is an encoding strategy we refer to as centered amplitude scaling (CAS). CAS simply centers the input X around side information S and multiplies it by θ to get CCAS(X, S; θ) := θ(X S), which is transmitted over the AWGN channel. If the input and side information are such that E (X S)2 P 2 θ2 , then C(X, S) satisfies the power constraint (1). We now build up to our main algorithm in the following two subsections. 4.2. Warm-up: UCB0 bandit algorithm We begin by describing a simple warm-up which uses CAS with parameter θ = P B as the encoding algorithm at each round, and the UCB0 bandit algorithm described below. The UCB0 bandit algorithm is a standard bandit algorithm that sends no side information at all rounds (equivalent to sending zero as the side information at all rounds). Additionally, UCB0 simply scales channel output Yt by the inverse of the scaling parameter used by CAS, 1 Xt,A(t) = B which it uses as the reward estimate for round t. UCB0 along with its interaction with CAS is described in Algorithm 1. Parameters: η = B2 SNR + 1, θ = P B 1: for t [T] do 2: At Learner 3: At = arg max k K UCBk (t 1; η) and St = 0 4: Send At, St to Client 5: At Client 6: Play arm At and observe Xt,At 7: Send CCAS(Xt,At, 0; θ) over the AWGN channel 8: At Learner 9: Observe AWGN Output Yt = CCAS(Xt,At, 0) + Zt 10: Use Xt,At = Yt θ as the tth round reward 11: Update UCB index for all arms Figure 1. UCB0 bandit algorithm along with CAS encoding algorithm Theorem 4.1. For all bandit instances ν E, the UCB0 algorithm along with CAS encoding satisfies the power constraint (1) at all rounds and has the following regret guarantee: RT (ν, UCB0, CAS) 8 KT ln T + 6KB. Proof. Using (3), we have that E h P B X 2i P. We will show that the reward estimates Xt,At used by the UCB algorithm are unbiased and subgaussian with a variance factor of B2 SNR + 1. First, notice that Xt,At = BYt B + Zt. Combining these, we get Xt,At = Xt,At + BZt Note that Xt,At and Zt are independent and subgaussian with variance factors 1 and σ2, respectively. Therefore, the centered reward estimate, Xt,At µAt, is subgaussian with variance factor B2 Hence, this algorithm reduces our problem to playing a bandit game with subgaussian rewards that have a variance factor of α2 = B2 SNR + 1. Since the learner uses the UCB algorithm with parameter η = α2, we can bound the regret by using Lemma 2.2. Remark 4. The regret bound of UCB0 matches our lower bound when B = O(1). However, for larger values of B, the regret bound of UCB0 is far from our lower bound. While this is a significant weakness of UCB0, it is still appealing in settings where the learner does not want to send any side information to the client due to privacy concerns. 4.3. UE-UCB bandit algorithm The regret bound in Theorem 4.1 is off by a multiplicative factor of O B from our lower bound. This gap results from our encoding algorithm, which leads to subgaussian rewards with a variance factor of B2 SNR. On the other hand, Communication-Constrained Bandits under Additive Gaussian Noise to match the lower bound in Theorem 3.1, one requires a strategy such that the encoded rewards are subgaussian with a variance factor of O 1 SNR . We take the first step in arriving at such an algorithm in our second scheme, in which we combine a two-phase bandit algorithm, Uniform Exploration-Upper Confidence Bound (UE-UCB), with the CAS encoding algorithm. In the first phase, UE-UCB performs uniform exploration, and in the second phase, UE-UCB uses the UCB algorithm to select the arm. We will now describe in detail the UE-UCB algorithm and the encoding strategy at the client. Phase 1: In the first phase, the learner uniformly explores all arms, and sends no side information to the clients. The communication strategy in this phase is the same as in the previous scheme. In particular, we use CAS with parameter θ1 = P B for encoding at the client. For decoding at the learner, UE-UCB scales the channel output Yt by 1/θ1 to get Xt,A(t) = B We run the phase one of UE-UCB for Kτ rounds, where9 Phase 2: For all k [K], denote by ˆµk the mean estimates formed for all arms at the end of the first phase. Then, in the second phase, the learner runs the classic UCB algorithm treating Xt,At as the reward to decide on an arm At. It then informs the client about arm At and the mean estimate formed for this arm after phase one, ˆµAt. On the client s side, the reward corresponding to At is en- coded by using CAS with parameter θ2 = q CCAS(Xt,At, µAt; θ2) := 2 (Xt,At ˆµAt). The decoding at the learner is simply done by scaling Yt by 1 θ2 and adding the mean estimate ˆµAt to get 2 P Yt + ˆµAt. UE-UCB along with its interaction with CAS is described in Algorithm 2. Remark 5. We will soon show in the proof of Theorem 4.2 that E h Xt,At ˆµAt 2i 2. Thus our choice of θ2 ensures that the power constraint is satisfied. Moreover, 9For simplicity of notation, we assume that B2 SNR, 1 SNR, and log B are integers throughout the paper. If not, the rounding only slightly impacts the constant factors in our results. Parameters: θ1 = P B , τ = B2 SNR + 1 Phase 1 1: for k [K] do 2: for t {(k 1)τ + 1, . . . , Kτ} do 3: At Learner 4: At = k and St = 0 5: Send At, St to Client 6: At Client 7: Play arm At and observe Xt,At 8: Send CCAS(Xt,At, 0; θ1) over the AWGN channel 9: At Learner 10: Observe AWGN Output Yt = CCAS(Xt,At, 0; θ1) + Zt 11: Use Xt,At = Yt θ1 as the tth round reward 12: At Learner 13: Form a mean estimate for arm k as t=(k 1)τ+1 Xt,k Parameters: B2 2 = 2, η2 = 2 SNR + 1, θ2 = q 2 Phase 2 14: for t {Kτ + 1, . . . , T} do 15: At Learner 16: At = arg max k K UCBk (t 1; η2) and St = ˆµAt 17: Send At, St to Client 18: At Client 19: Play arm At and observe Xt,At 20: Send CCAS(Xt,At, St; θ2) over the AWGN channel 21: At Learner 22: Observe AWGN Output Yt = CCAS(Xt,At, St; θ2) + Zt 23: Use Xt,At = Yt θ2 + St as the tth round reward 24: Update UCB index for all arms Figure 2. UE-UCB Algorithm along with the CAS encoding algorithm for B 2, θ2 is greater than θ1, which essentially means that the client can scale the centered reward in the second phase by a bigger factor as compared to the scaling factor used in Algorithm 1. In both algorithms, since the learner multiplies Yt by the inverse of the scaling factor used at the client, the greater value of scaling factor results in smaller variances for the decoded rewards than in Algorithm 1. Theorem 4.2. For all bandit instances ν E, the UE-UCB algorithm along with CAS encoding as described in Algorithm (2) satisfies the power constraint (1) at all rounds and has the following regret guarantee: RT (ν, UE-UCB, CAS) SNR + 8KB + 8 Proof. In the first phase, we incur regret linear regret in the time horizon. Specifically, since the time horizon for the first phase is Kτ = K(B2/SNR + 1) and the gap between the mean of the optimal arm and the suboptimal arm can at most be 2B, we have that the regret incurred in the first phase is (KB2/SNR + K) 2B. Moreover, since the encoding of rewards on the client s side and decoding of the Communication-Constrained Bandits under Additive Gaussian Noise reward on the learner s side in the first phase is same as in Algorithm 1, by the same arguments, the power constraint is satisfied, and for all t {(k 1)τ + 1, . . . , kτ}, E h ( Xt,k µk)2i B2 SNR + 1. (5) For the second phase, we will first show that the power constraint is satisfied. Towards proving this, we will bound the variance of the mean estimates formed after the first phase. We have from the description of Algorithm 2 that for all k K, E (ˆµk µk)2 = 1 t=(k 1)τ+1 E h ( Xt,k µ)2i 1, where the inequality follows from (5) and the fact that τ = B2 SNR + 1. Now, note that for any round t in phase 2, we have E (Xt,At ˆµAt)2 = E (Xt,At µAt)2 + E (ˆµAt µAt)2 2, where the equality follows from the independence of Xt,At and ˆµAt, and the inequality follows from the fact that both (Xt,At µAt) and (ˆµAt µAt) are subgaussian with a variance factor of 1. Therefore, scaling the centered reward Xt,k ˆµk by 2 satisfies the power constraint. The reward estimate used by the learner in phase 2 is This implies that the centered reward estimate, Xt,At µAt, is subgaussian with a variance factor of 2/SNR + 1 in the second phase. The proof is complete by upper bounding the regret in the second phase by using Lemma 2.2 and summing the regret in the two phases. 4.4. UE-UCB++ bandit algorithm Algorithm 2 improves over Algorithm 1 in the sense that the T-dependent term in its regret upper bound does not have the multiplicative O(B) dependence. This term matches the lower bound in Theorem 3.1 up to logarithmic factors in T. However, this comes at the cost of an additional additive term of O( KB3 SNR ) in the regret bound of Algorithm 2. The strong dependence on B in this additive term means that for large values of B, Algorithm 2 can be far from optimal. In this section, we present our main Parameters: L = 2 log B, τ = 2 SNR, B2 1 = B2 1: for ℓ [L] do B2 ℓ+1 = B2 ℓ SNR +1 2: for k [K] do ˆµ0,k = 0 Phase 1 3: for ℓ [L] do 4: for k [K] do 5: for t {(ℓ 1)Kτ + (k 1)τ + 1, . . . , (ℓ 1)Kτ + kτ} do 6: At Learner 7: At = k and St = ˆµℓ 1,k 8: At, St to the client 9: At Client 10: Play arm At and observe Xt,At 11: Send CCAS(Xt,At, St; θℓ) over the AWGN channel 12: At Learner 13: Observe AWGN Output Yt = CCAS(Xt,At, St; θℓ) + Zt 14: Use Xt,At = Yt θt + St as the tth round reward 15: At Learner 16: Form a mean estimate for arm k (ℓ 1)Kτ+kτ X t=(ℓ 1)Kτ+(k 1)τ+1 Xt,k Parameters: θL+1 = P BL+1 , ηL+1 = B2 L+1 SNR + 1 Phase 2 17: for t {Kτ + 1, . . . , T} do 18: At Learner 19: At = arg max k K UCBk (t 1; ηL+1) and St = ˆµL,At 20: Send At, St to Client 21: At Client 22: Play arm At and observe Xt,At 23: Send CCAS(Xt,At, St; θL+1) over the AWGN channel 24: At Learner 25: Observe AWGN Output Yt = CCAS(Xt,At, St; θL+1) + Zt 26: Use Xt,At = Yt θL+1 + St as the tth round reward 27: Update UCB index for all arms Figure 3. UE-UCB++ Algorithm along with the CAS encoding algorithm scheme, which improves over this shortcoming of Algorithm 2, and is only off by a minor additive term from our lower bound. Our scheme will combine an improved version of the UE-UCB algorithm, UE-UCB++, and the CAS encoding algorithm. Like UE-UCB, UE-UCB++ is also a two-phase bandit algorithm, that performs uniform exploration in the first phase and runs the UCB algorithm in the second phase. The second phase of the algorithm proceeds in the same manner as that of the UE-UCB algorithm, and the improvement over UE-UCB is in the phase 1 of the algorithm. Phase 1: UE-UCB++, like UE-UCB, does uniform exploration in all rounds of phase 1. However, phase 1 is divided Communication-Constrained Bandits under Additive Gaussian Noise into L different sub-phases, where the encoding of rewards is different in each sub-phase. The mean estimates formed in the previous sub-phase are used as side information in the current sub-phase and sent to the client for better encoding. In more detail, we have L = 2 log B sub-phases in total and each sub-phase runs for Kτ rounds, where τ = 2 SNR 2. (6) Thus, phase 1 runs for L Kτ 4K log B SNR rounds. Remark 6. Phase 1 in UE-UCB++ runs for O K log B rounds as opposed to O KB2 SNR rounds used in the phase 1 of UE-UCB. We will show that despite performing uniform exploration for significantly fewer rounds for all arms, we will end up with mean estimates with the same variance as the means estimates formed after phase 1 of UE-UCB because of a superior encoding protocol in this phase. The first sub-phase proceeds in exactly the same manner as phase 1 of Algorithm 2. To describe the algorithm in the subsequent sub-phases, we set up some notation: For all k [K], denote by ˆµℓ,k the mean estimates formed for all arms at the end of the ℓth sub-phase. For a round t in ℓth sub-phase, the learner informs the client about the arm At it wants to play and the mean estimate formed for this arm in the (ℓ 1)th sub-phase, ˆµℓ 1,At. On the client s side, the reward corresponding to At is encoded by using CAS with parameter θℓ= CCAS(Xt,At, µAt; θℓ) := P Bℓ (Xt,At ˆµℓ 1,At). Here Bℓ, ℓ [L], will be shown to be an upper bound on the L2-norm of random variable Xt,At ˆµℓ 1,At (see the proof of Theorem 4.3 ) and is described by the recursion B2 ℓ SNR + 1 τ + 1 and B2 1 = B2. (7) The decoding at the learner is simply done by scaling Yt by 1 θℓand adding the mean estimate ˆµℓ,At to get P Yt + ˆµAt. Phase 2: Phase 2 proceeds in the same manner as the phase 2 of Algorithm 2, with θL+1 used for encoding at CAS and the parameter ηL+1 of the UCB algorithm set to B2 L+1 SNR + 1. Remark 7. Equation (7) sheds some light on the reasons behind UE-UCB++ requiring fewer uniform exploration rounds relative to UE-UCB to come up with mean estimates of similar variances. Essentially, the centered reward in the ℓth sub-phase has a much smaller L2-norm than the centered reward in the (ℓ 1)th sub-phase. This smaller second moment allows the client to scale the centered reward by a larger factor, leading to smaller variances for the decoded reward. This, in turn, leads to mean estimates formed in that sub-phase to have smaller variances or, in other words, are more accurate. Then, in the (ℓ+ 1)th sub-phase, since we center the rewards around these more accurate mean estimates, the centered rewards have even smaller L2 norms. This positive reinforcement cycle leads to the variance of the mean estimates decreasing exponentially with respect to sub-phases, which, in turn, leads to much fewer uniform exploration rounds to come up with mean estimates with the same variance as those after phase 1 of UE-UCB. UE-UCB++, along with its interaction with CAS, is described in Algorithm 3. Theorem 4.3. For all bandit instances ν E, the UE-UCB++ bandit algorithm along with CAS encoding algorithm as described in Algorithm (3) satisfies the power constraint (1) at all rounds and has the following regret guarantee: RT (ν, UE-UCB++, CAS) SNR 1 + 6KB + 8 The key to the proof of Theorem 4.3 is to show that the second moment of the centered rewards Xt,At ˆµℓ 1,At is less than Bℓ, ℓ [L], when round t belongs to the ℓth sub-phase of phase 1, and when round t belongs to phase 2, it is less than BL+1. Similar arguments as in the proof of Theorem 4.2 can show this. We defer the details to the Appendix. Remark 8. The ln T factor in the third term of Theorem 4.3 can be removed by using MOSS (Audibert & Bubeck, 2009) instead of the UCB algorithm. Our work focuses on understanding how the regret grows as a function of SNR; we are not as concerned with logarithmic factors in T. Hence, we chose the more popular UCB algorithm. 5. Concluding Remarks The near-optimal Algorithm 3 use a highly adaptive encoding protocol, where the history of the game was used to refine the encoding in a particular round. On the other hand, Acharya et al. (2021) showed for distributed convex, Lipschitz optimization under information constraints, a fixed, non-adaptive encoding protocol gives optimal performance. This suggests an interesting distinction between encoding for distributed sequential decision-making problems and distributed optimization problems. Communication-Constrained Bandits under Additive Gaussian Noise An interesting open problem is closing the additive 8KB log B SNR 1 gap between our upper and lower bound. Furthermore, the refinement to our model where the learnerto-client communication is also over an AWGN channel and the multi-client setting are interesting future research directions. Acknowledgement This work is supported by the Singapore National Research Foundation (NRF) Fellowship programme (Grant Numbers: A-0005077-01-00 and A-0008064-00-00) and an MOE Tier 1 Grant (Grant Number: A-0009042-01-00). Acharya, J., Canonne, C., Mayekar, P., and Tyagi, H. Information-constrained optimization: can adaptive processing of gradients help? Advances in Neural Information Processing Systems, 34:7126 7138, 2021. Amiri, M. M. and Gündüz, D. Machine learning at the wireless edge: Distributed stochastic gradient descent over-the-air. IEEE Transactions on Signal Processing, 68:2155 2169, 2020. Audibert, J.-Y. and Bubeck, S. Minimax policies for adversarial and stochastic bandits. In COLT, volume 7, pp. 1 122, 2009. Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2):235 256, 2002. Boucheron, S., Lugosi, G., and Massart, P. Concentration inequalities: A nonasymptotic theory of independence. Oxford university press, 2013. Cover, T. M. and Thomas, J. A. Elements of Information Theory. 2nd edition. John Wiley & Sons Inc., 2006. Csiszár, I., Shields, P. C., et al. Information theory and statistics: A tutorial. Foundations and Trends in Communications and Information Theory, 1(4):417 528, 2004. Dragomir, S. Upper and lower bounds for csiszár s fdivergence in terms of the kullback-leibler distance and applications. Inequalities for the Csiszár s, 2000. Dwork, C., Roth, A., et al. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3 4):211 407, 2014. Gandikota, V., Kane, D., Maity, R. K., and Mazumdar, A. vqsgd: Vector quantized stochastic gradient descent. ICML Workshop on Security and Privacy in Machine Learning), 2019. Hanna, O. A., Yang, L., and Fragouli, C. Solving multiarm bandit using a few bits of communication. In International Conference on Artificial Intelligence and Statistics, pp. 11215 11236. PMLR, 2022. Jha, S. K., Mayekar, P., and Tyagi, H. Fundamental limits of over-the-air optimization: Are analog schemes optimal? IEEE Journal on Selected Areas in Information Theory, 3(2):217 228, 2022. Kairouz, P., Mc Mahan, H. B., Avent, B., Bellet, A., Bennis, M., Bhagoji, A. N., Bonawitz, K., Charles, Z., Cormode, G., Cummings, R., et al. Advances and open problems in federated learning. Foundations and Trends in Machine Learning, 14(1 2):1 210, 2021. Koneˇcný, J., Mc Mahan, H. B., Yu, F. X., Richtarik, P., Suresh, A. T., and Bacon, D. Federated learning: Strategies for improving communication efficiency. NIPS Workshop on Private Multi-Party Machine Learning, 2016. Lattimore, T. and Szepesvári, C. Bandit Algorithms. Cambridge University Press, 2020. Lin, C.-Y., Kostina, V., and Hassibi, B. Differentially quantized gradient descent. In IEEE International Symposium on Information Theory (ISIT), pp. 1200 1205, 2021. Mayekar, P. and Tyagi, H. RATQ: A universal fixed-length quantizer for stochastic optimization. IEEE Transactions on Information Theory, 2020. Mitra, A., Jaafar, R., Pappas, G. J., and Hassani, H. Linear convergence in federated learning: Tackling client heterogeneity and sparse gradients. Advances in Neural Information Processing Systems, 34:14606 14619, 2021. Mitra, A., Hassani, H., and Pappas, G. J. Linear stochastic bandits over a bit-constrained channel. ar Xiv preprint ar Xiv:2203.01198, 2022. Shannon, C. E. A mathematical theory of communication. The Bell System Technical Journal, 27(4):623 656, 1948. doi: 10.1002/j.1538-7305.1948.tb00917.x. Tse, D. and Viswanath, P. Fundamentals of wireless communication. Cambridge University Press, 2005. Communication-Constrained Bandits under Additive Gaussian Noise A. Proof of Theorem 4.3 (Regret bound of UE-UCB++) We will first show that the centered rewards transmitted by the client satisfy the power constraint. Note that the centered reward coded by the client at any round in the ℓth sub-phase of phase 1 is Xt,At ˆµℓ 1,At. Furthermore, the client scales the reward by a factor of P Bℓ. Thus, we need to show that E (Xt,At ˆµℓ 1,At)2 B2 ℓ. Towards proving this, we write E (Xt,At ˆµℓ 1,At)2 = E (Xt,At µAt)2 + E (ˆµℓ 1,At µAt)2 1 + E (ˆµℓ 1,At µAt)2 . From the description of Algorithm 3, for k [K], we have ˆµℓ 1,k = 1 (ℓ 2)Kτ+kτ X t=(ℓ 2)Kτ+(k 1)τ+1 Xt,k + Bℓ 1 Therefore, we have E (ˆµℓ,At µAt)2 B2 ℓ 1 SNR + 1 Using the preceding inequalities and definition of Bℓin (7), we get E (Xt,At µℓ 1,At)2 1 + B2 ℓ 1 SNR + 1 To prove that the power constraint is satisfied in the second phase, note that the centered reward transmitted by the client in the second phase is Xt,At ˆµL,At. Furthermore, client scales the reward by a factor of P BL+1 . By precisely the same arguments as above, we can deduce that E (Xt,At ˆµL,At)2 B2 L+1. To bound the regret, first note that we incur linear regret in phase 1 of UE-UCB++. Specifically, since the time horizon for phase 1 is 4K log B SNR 1 and each round s regret is at the most 2B, we incur a total regret of 8KB log B SNR 1 in the first phase. To bound the regret in the second phase, note that the centered reward estimate Xt,At µAt in the second phase is subgaussian with a variance factor of B2 L+1 SNR + 1. Substituting the value of τ from (6) in (7), we obtain B2 ℓ SNR + 1 B2 ℓ 2 + SNR 1 Rearranging the terms, we have B2 ℓ+1 2 SNR 1 2 + 1 B2 ℓ 2 SNR 1 Recursively applying the above identity, it follows that B2 L+1 2 SNR 1 2 + 1 B2 2 SNR 1 Substituting L = 2 log B, we get B2 L+1 less than 4. The proof is then completed by using Lemma 2.2 to bound the regret in the second phase and adding the regret from the first phase. Communication-Constrained Bandits under Additive Gaussian Noise B. Proof of Theorem 3.1 (Lower Bound) We will first show that for T c2K and a universal constant c1 (0, 1), we have R T (SNR) c1KB. From the definition of R T (SNR) in (4), it is enough to prove a lower bound on the average regret when the problem instances are uniformly chosen from a subset of E. We choose that finite subset to be the instances where a single arm has deterministic reward B, and the others deterministic B. Therefore, we have K different instances with the probability of each instance occurring being 1/K. Now, observe that regardless of the algorithm, the probability of finding the B-reward arm is at the most 1 2 in the first K/2 arm pulls. Hence the regret in the first K/2 arm pulls when averaged over the K instances must be at least 1 2 K/2 2B = c1KB. In the rest of proof we will show that for T c2K 1 2 log(1+SNR) 1, we have R T (SNR) c1 Then, noting that both the bounds hold for T c2K 1 2 log(1+SNR) 1 and using the fact that a b a+b 2 completes the proof. Consider the bandit instance ν where rewards for each arm pull are either 1 or 1 and the arms have the following means: µ1 = , and µi = 0 for all i {2, . . . , K}, where is some parameter in (0, 1/4) whose precise value we will set later. Using the fact that rewards for all arms lie in [ 1, 1] and by Hoeffding s lemma (Boucheron et al., 2013, Lemma 2.2), it is easy to see that the centered rewards for all arms are subgaussian with a variance factor of 1. Therefore, it follows that ν E. Now, fix a bandit algorithm π ΠT and an encoding scheme CT CT . Denote by P T the probability distribution of the sequence (A1, Y1, A2, Y2, . . . , AT , YT ) induced by the algorithm π and encoding strategy CT on the instance ν. Denote by Ni(T) the number of times arm i has been played up to and including round T. Let j = arg min i>1 EP T [Ni(T)] . That is, j is the arm pulled least amount of times from the arms {2, . . . , K}. Trivially, EP T [Ni(T)] T K 1. (9) Now consider another bandit instance ν where again the rewards are either 1 or 1 with the following means: µ1 = , µj = 2 , µi = 0 for all i {2, . . . , K} \ {j}. Let QT be the the probability distribution of the sequence (A1, Y1, A2, Y2, . . . , AT , YT ) induced by the algorithm π and encoding strategy C on the instance ν . By a simple calculation, we have RT (ν, π, C) T 2 P T (N1(T) T/2) and RT (ν , π, C) T 2 QT (N1(T) > T/2). Combining the two inequalities, using the variational formula for total variation distance, and applying Pinsker s inequality, we get RT (ν, π, C) + RT (ν , π, C) T 2 P T (N1(T) T/2) + QT (N1(T) > T/2) 2 1 (P T (N1(T) > T/2) QT (N1(T) > T/2)) 2 1 d TV(P T , QT ) 1 2D(P T || QT ) Communication-Constrained Bandits under Additive Gaussian Noise Note that P T and QT differ only when arm j is played. Recall that the dependence of Y on an encoding strategy C and reward X is given by (2). For encoding scheme C, used in a round t, denote by C( | x) the conditional distribution of Y given that the reward is X = x. Let pj C and qj C denote the distribution of Y when arm j is played at round t. Recalling the reward distribution for arm j, it follows that, (pj C)(Y = y) = C(y | 1) 2 + C(y | 1) (qj C)(Y = y) = C(y | 1) 2 + C(y | 1) 2 + (C(y | 1) C(y | 1)). Moreover, using similar arguments as in Lattimore & Szepesvári (2020, Lemma 15.1), we can prove the following lemma. D(P T || QT ) EP T [Nj(T)] sup C C D(pj C || qj C) We defer the proof the Lemma to Appendix C and proceed with proving the lower bound. Combining Lemma B.1 with (9), we have D(P T || QT ) T K 1 sup C C D(pj C || qj C). We will now use the following fact. Fact 1. Recall that dχ2(p, q) := R (p(x) q(x))2 q(x) dx denotes the χ2 distance between probability measures p and q when they are absolutely continuous with respect to the Lebesgue measure dx. Then, it holds that D(p || q) dχ2(p, q). Fact 1 is well known (see, for instance Csiszár et al. (2004)]), but we provide a short proof in Appendix C for completeness. Using this fact, D(pj C || qj C) can be bounded as follows: D(pi C || qi C) dχ2(pi C, qi C) = Z (pi C qi C)2 = 2 Z (C(y | 1) C(y | 1))2 2 + C(y| 1) 2 + (C(y | 1) C(y | 1)) dy Now note that for 1/4, we have C(y|1) 2 + C(y| 1) 2 + (C(y | 1) C(y | 1)) C(y|1) 4 + C(y| 1) 4 . Using this fact, we get D(pi C || qi C) 2 2 Z (C(y | 1) C(y | 1))2 2 + C(y| 1) 2dχ2(C( | 1), C( | 1)/2 + C( | 1)/2) 2dχ2(C( | 1), C( | 1)/2 + C( | 1)/2) 2D(C( | 1) || C( | 1)/2 + C( | 1)/2) 2D(C( | 1) || C( | 1)/2 + C( | 1)/2) , where the equality is easily verified by substituting the definition dχ2 and the last inequality uses the following fact. To proceed further, we will use the following fact. Communication-Constrained Bandits under Additive Gaussian Noise Fact 2. If d P d Q(x) c x R, then dχ2(P, Q) c D(P || Q). Fact 2 is also known (Dragomir, 2000), but we provide a proof in Appendix C for completeness. Denote by V a hypothetical reward that takes values 1 and 1 with equal probability, and by Y the output of the AWGN channel when V is further coded to C(V ) to satisfy (1) and transmitted over the AWGN channel. Then, 1 2D(C( | 1) || C( | 1)/2 + C( | 1)/2) + 1 2D(C( | 1) || C( | 1)/2 + C( | 1)/2) is mutual information between V and Y , denoted by I(V ; Y ). To see this, note that I(W1; W2) := D(PW1,W2 || PW1 PW2) = EPW1 D(PW2|W1 || PW2) . We have therefore shown that D(P T || QT ) 4 2T K 1 sup C C I(V ; Y ). (11) Note that I(V ; Y ) H(V ) ln 2. Since V C(V ) Y forms a Markov chain in this order, by the data processing inequality (Cover & Thomas, 2006, Chapter 2), we have that I(V ; Y ) I(C(V ); Y ). Then, using the fact that I(C(V ); Y ) can be further upper bounded by the capacity of an AWGN channel (Cover & Thomas, 2006, Chapter 9), we have I(C(V ); Y ) 1 2 ln(1 + SNR) = 1 2 log(1 + SNR) ln 2. Thus, we have shown that D(P T || QT ) 4 ln 2 2T 2 log(1 + SNR) 1 . Substituting this into (10), we get RT (ν, π, C) + RT (ν , π, C) T 2 log(1 + SNR) 1 ! Setting 2 = K 1 16 ln 2 T( 1 2 log(1+SNR) 1) completes the proof. (Note that we can only set such a and ensure that 1 T K 1 ln 2( 1 2 log(1+SNR) 1), which is the reason we need this condition in the statement of our lower bound.) We will now show that for T K 1 ln 2( 1 2 log(1+SNR) 1) K and for a universal constant c, we have R T (SNR) c KB. In fact, we show the stronger bound R K(SNR) c KB. Then, using the fact that a b a+b 2 completes the proof. We now choose our instance ν where rewards for each arm pull are Gaussian with variance 1 and the following means: µ1 = 1, µi = 0 for all i {2, . . . , K}. Once again, let j be the arm pulled least amount of times. Then define the alternate instance ν where rewards for each arm pull are Gaussian with variance 1 and the following means: µ1 = B/2, µj = B 1, µi = 0 for all i {2, . . . , K}/j. All the arm rewards have Gaussian distribution with the variance 1 and means stated above. Proceeding as earlier, we have RT (ν, π, C) + RT (ν , π, C) T 2 1 d TV(P T , QT ) Communication-Constrained Bandits under Additive Gaussian Noise Using the fact that T K 1 ln 2( 1 2 log(1+SNR) 1) c(K 1) and the fact that for our chosen distributions 1 d TV(P T , QT ) is a constant completes the proof. Remark 9. (Communication-constrained setting of Hanna et al. (2022)) We now remark on possibility of recovering the lower bound in Hanna et al. (2022) with our proof technique, albeit with weaker constants. Note that (10) also holds for the setting considered in Hanna et al. (2022), where the output Y is restricted to r bits. Therefore, I(V ; Y ) H(Y ) H(V ) = (ln 2)(r 1). Proceeding, in the rest of the proof, mutatis mutandis, we get that the regret in this setting is bounded by Ω( KT 1 r 1). Such a lower bound suggests that even when using code of length 1 to encode our rewards, it might be possible to achieve the regret of the centralized setting. This is indeed shown to be possible by the scheme in Hanna et al. (2022). C. Proofs of Intermediary Results in the Proof of Theorem 3.1 Proof of Lemma B.1 We prove the inequality by similar arguments as in Lattimore & Szepesvári (2020, Lemma 15.1). Recall that Ht = {A1, Y1, . . . , At, Yt} denotes the history of the game till time t. Also recall that the hint St is a function of history of the game till time t 1. Therefore the encoding scheme at time t depends on Ht 1, so we write it as CHt 1. We additionally denote by p At and q At the distribution of reward Xt,At under instances ν and ν , respectively. Therefore, p At CHt 1 and q At CHt 1 denote distribution of Yt for an arm pull At and history Ht 1 under instances ν and ν , respectively. Using the definition of KL divergence, we have D(P T || QT ) = EP T log P T (A1, Y1, . . . , AT , YT ) QT (A1, Y1, . . . , AT , YT ) t=1 EP T log P T (At, Yt | Ht 1) QT (At, Yt | Ht 1) By the definition of the bandit algorithm π, the distribution over all actions to be played remains the same for different problem instances as long as the history is the same. Therefore, P T (At, Yt | Ht 1) QT (At, Yt | Ht 1) = P T (At | Ht 1) QT (At | Ht 1) P T (Yt | At, Ht 1) QT (Yt | At, Ht 1) = P T (Yt | At, Ht 1) QT (Yt | At, Ht 1). Substituting the above identity in (12), we have D(P T || QT ) = t=1 EP T log P T (Yt | At, Ht 1) QT (Yt | At, Ht 1) t=1 EP T EP T log P T (Yt | At, Ht 1) QT (Yt | At, Ht 1) t=1 EP T D(p At CHt 1 || q At CHt 1) , where the second last inequality uses the law of iterated expectations and the final one uses the definition of KL divergence. Since p At and q At differ only when arm At = j, we have D(P T || QT ) = t=1 EP T 1At=j D(pj CHt 1 || qj CHt 1) Communication-Constrained Bandits under Additive Gaussian Noise t=1 EP T 1At=j sup C C D(pj C || qj C) EP T [Nj(T)] sup C C D(pj C || qj C) T K 1 sup C C D(pj C || qj C). C.1. Proof of Fact 1 Using the fact that ln x x 1 and R p(x) q(x) dx = 0 , we have D(p || q) = Z p(x) ln p(x) q(x)dx Z (p(x) q(x)) dx Z p(x) p(x) q(x) 1 dx Z (p(x) q(x)) dx = Z (p(x) q(x))2 = dχ2(p, q). C.2. Proof of Fact 2 dχ2(p, q) = Z (p(x) q(x))2 = Z p(x) p(x) q(x) 1 dx Z (p(x) q(x)) dx Z p(x) p(x) q(x) ln p(x) c Z p(x) ln p(x) q(x)dx = c D(p || q), where the fist inequality follows by using the fact that x 1 x ln x x 0 and R (p(x) q(x)) dx = 0 and the second inequality follows from the fact that d P d Q(x) c x R.