# incentivized_communication_for_federated_bandits__7ecf486b.pdf Incentivized Communication for Federated Bandits Zhepei Wei1 Chuanhao Li1 Haifeng Xu2 Hongning Wang1 University of Virginia1 University of Chicago2 {tqf5qb, cl5ev, hw5x}@virginia.edu haifengxu@chicago.edu Most existing works on federated bandits take it for granted that all clients are altruistic about sharing their data with the server for the collective good whenever needed. Despite their compelling theoretical guarantee on performance and communication efficiency, this assumption is overly idealistic and oftentimes violated in practice, especially when the algorithm is operated over self-interested clients, who are reluctant to share data without explicit benefits. Negligence of such selfinterested behaviors can significantly affect the learning efficiency and even the practical operability of federated bandit learning. In light of this, we aim to spark new insights into this under-explored research area by formally introducing an incentivized communication problem for federated bandits, where the server shall motivate clients to share data by providing incentives. Without loss of generality, we instantiate this bandit problem with the contextual linear setting and propose the first incentivized communication protocol, namely, INC-FEDUCB, that achieves near-optimal regret with provable communication and incentive cost guarantees. Extensive empirical experiments on both synthetic and real-world datasets further validate the effectiveness of the proposed method across various environments. 1 Introduction Federated bandit learning has recently emerged as a promising new direction to promote the application of bandit models while preserving privacy by enabling collaboration among multiple distributed clients [10, 40, 22, 13, 23, 24, 25, 15, 37, 9]. The main focus in this line of research is on devising communication-efficient protocols to achieve near-optimal regret in various settings. Most notably, the direction on federated contextual bandits has been actively gaining momentum, since the debut of several benchmark communication protocols for contextual linear bandits in the P2P [18] and star-shaped [40] networks. Many subsequent studies have explored diverse configurations of the clients and environmental modeling factors and addressed new challenges arising in these contexts. Notable recent advancements include extensions to asynchronous linear bandits [22, 13], generalized liner bandits [23], and kernelized contextual bandits [24, 25]. Despite the extensive exploration of various settings, almost all existing federated bandit algorithms rely on the assumption that every client in the system is willing to share their local data/model with the server, regardless of the communication protocol design. For instance, synchronous protocols [40, 23] require all clients to simultaneously engage in data exchange with the server in every communication round. Similarly, asynchronous protocols [22, 25, 13] also assume clients must participate in communication as long as the individualized upload or download event is triggered, albeit allowing interruptions by external factors (e.g., network failure). In contrast, our work is motivated by the practical observation that many clients in a federated system are inherently self-interested and thus reluctant to share data without receiving explicit benefits from the server [16]. For instance, consider the following scenario: a recommendation platform Equal Contribution 37th Conference on Neural Information Processing Systems (Neur IPS 2023). (server) wants its mobile app users (clients) to opt in its new recommendation service, which switches previous on-device local bandit algorithm to a federated bandit algorithm. Although the new service is expected to improve the overall recommendation quality for all clients, particular clients may not be willing to participate in this collaborative learning, as the expected gain for them might not compensate their locally increased cost (e.g., communication bandwidth, added computation, lost control of their data, and etc). In this case, additional actions have to be taken by the server to encourage participation, as it has no power to force clients. This exemplifies the most critical concern in the real-world application of federated learning [16]. And a typical solution is known as incentive mechanism, which motivates individuals to contribute to the social welfare goal by offering incentives such as monetary compensation. While recent studies have explored incentivized data sharing in federated learning [30, 38], most of which only focused on the supervised offline learning setting [16]. To our best knowledge, ours is the first work that studies incentive design for federated bandit learning, which inherently imposes new challenges. First, there is a lack of well-defined metric to measure the utility of data sharing, which rationalizes a client s participation. Under the context of bandit learning, we measure data utility by the expected regret reduction from the exchanged data for each client. As a result, each client values data (e.g., sufficient statistics) from the server differently, depending on how such data aligns with their local data (e.g., the more similar the less valuable). Second, the server is set to minimize regret across all clients through data exchange. But as the server does not generate data, it can be easily trapped by the situation where its collected data cannot pass the critical mass to ensure every participating client s regret is close to optimal (e.g., the data under server s possession cannot motivate the clients who have more valuable data to participate). To break the deadlock, we equip the server to provide monetary incentives. Subsequently, the server needs to minimize its cumulative monetary payments, in addition to the regret and communication minimization objectives as required by federated bandit learning. We propose a provably effective incentivized communication protocol, based on a heuristic search strategy to balance these distinct learning objectives. Our solution obtains near-optimal regret O(d T log T) with provable communication and incentive cost guarantees. Extensive empirical simulations on both synthetic and real-world datasets further demonstrate the effectiveness of the proposed protocol in various federated bandit learning environments. 2 Related Work Federated Bandit Learning One important branch in this area is federated multi-armed bandits (MABs), which has been well-studied in the literature [27, 36, 19, 4, 20, 28, 32, 39, 34, 33, 43]. The other line of work focuses on the federated contextual bandit setting [18, 40], which has recently attracted increasing attention. Wang et al. [40] and Korda et al. [18] are among the first to investigate this problem, where multiple communication protocols for linear bandits [1, 26] in star-shaped and P2P networks are proposed. Many follow-up works on federated linear bandits [10, 15, 22, 13] have emerged with different client and environment settings, such as investigating fixed arm set [15], incorporating differential privacy [10], and introducing asynchronous communication [13, 22]. Li et al. [23] extend the federated linear bandits to generalized linear bandits [11]. And they further investigated federated learning for kernelized contextual bandits in both synchronous and asynchronous settings [24, 25]. In this work, we situate the incentivized federated bandit learning problem under linear bandits with time-varying arm sets, which is a popular setting in many recent works [40, 10, 22, 13]. But we do not assume the clients will always participate in data sharing: they will choose not to share their data with the server if the resulting benefit of data sharing is not deemed to outweigh the cost. Here we need to differentiate our setting from those with asynchronous communication, e.g., Asyn-Lin UCB [22]. Such algorithms still assume all clients are willing to share, though sometimes the communication can be interrupted by some external factors (e.g., network failure). We do not assume communication failures and leave it as our future work. Instead, we assume the clients need to be motivated to participate in federated learning, and our focus is to devise the minimum incentives to obtain the desired regret and communication cost for all participating clients. Incentivized Federated Learning Data sharing is essential to the success of federated learning [30], where client participation plays a crucial role. However, participation involves costs, such as the need for additional computing and communication resources, and the risk of potential privacy breaches, which can lead to opt-outs [5, 14]. In light of this, recent research has focused on investigating incentive mechanisms that motivate clients to contribute, rather than assuming their willingness to participate. Most of the existing research involves multiple decentralized clients solving the same task, typically with different copies of IID datasets, where the focus is on designing data valuation methods that ensure fairness or achieve a specific accuracy objective [35, 41, 8]. On the other hand, Donahue et al. [7] study voluntary participation in model-sharing games, where clients may opt out due to biased global models caused by the aggregated non-IID datasets. More recently, Karimireddy et al. [16] investigated incentive mechanism design for data maximization while avoiding free riders. For a detailed discussion of this topic, we refer readers to recent surveys on incentive mechanism design in federated learning [42, 38]. However, most works on incentivized federated learning only focus on better model estimation among fixed offline datasets, which does not apply to the bandit learning problem, where the exploration of growing data is also part of the objective. More importantly, in our incentivized federated bandit problem, the server is obligated to improve the overall performance of the learning system, i.e., minimizing regret among all clients, which is essentially different from previous studies where the server only selectively incentivizes clients to achieve a certain accuracy [35] or to investigate how much accuracy the system can achieve without payment [16]. 3 Preliminaries In this section, we formally introduce the incentivized communication problem for federated bandits under the contextual linear bandit setting. 3.1 Federated Bandit Learning We consider a learning system consisting of (1) N clients that directly interact with the environment by taking actions and receiving the corresponding rewards, and (2) a central server that coordinates the communication among the clients to facilitate their learning collectively. The clients can only communicate with the central server, but not with each other, resulting in a star-shaped communication network. At each time step t [T], an arbitrary client it [N] becomes active and chooses an arm xt from a candidate set At Rd, and then receives the corresponding reward feedback yt = f(xt) + ηt R. Note that At is time-varying, f denotes the unknown reward function shared by all clients, and ηt denotes zero mean sub-Gaussian noise with known variance σ2. The performance of the learning system is measured by the cumulative (pseudo) regret over all N clients in the finite time horizon T, i.e., RT = PT t=1 rt, where rt = maxx At E[y|x] E[yt|xt] is the regret incurred by client it at time step t. Moreover, under the federated learning setting, the system also needs to keep the communication cost CT low, which is measured by the total number of scalars [40] being transferred across the system up to time T. With the linear reward assumption, i.e., f(x) = x θ , where θ denotes the unknown parameter, a ridge regression estimator ˆθt = V 1 t bt can be constructed based on sufficient statistics from all N clients at each time step t, where Vt = Pt s=1 xsx s and bt = Pt s=1 xsys [21]. Using ˆθt under the Optimism in the Face of Uncertainty (OFUL) principle [1], one can obtain the optimal regret RT = O(d T). To achieve this regret bound in the federated setting, a naive method is to immediately share statistics of each newly collected data sample to all other clients in the system, which essentially recovers its centralized counterpart. However, this solution incurs a disastrous communication cost CT = O(d2NT). On the other extreme, if no communication occurs throughout the entire time horizon (i.e., CT = 0), the regret upper bound can be up to RT = O(d NT) when each client interacts with the environment at the same frequency, indicating the importance of timely data/model aggregation in reducing RT . To balance this trade-off between regret and communication cost, prior research efforts centered around designing communication-efficient protocols for federated bandits that feature the delayed update of sufficient statistics [40, 22, 13]. Specifically, each client i only has a delayed copy of Vt and bt, denoted as Vi,t = Vtlast + Vi,t, bi,t = btlast + bi,t, where Vtlast, btlast is the aggregated sufficient statistics shared by the server in the last communication, and Vi,t, bi,t is the accumulated local updates that client i obtain from its interactions with the environment since tlast. In essence, the success of these algorithms lies in the fact that Vt, bt typically changes slowly and thus has little instant impact on the regret for most time steps. Therefore, existing protocols that only require occasional communications can still achieve nearly optimal regret, despite the limitation on assuming clients willingness on participation as we discussed before. 3.2 Incentivized Federated Bandits Different from the prior works in this line of research, where all clients altruistically share their data with the server whenever a communication round is triggered, we are intrigued in a more realistic setting where clients are self-interested and thus reluctant to share data with the server if not well motivated. Formally, each client in the federated system inherently experiences a cost2 of data sharing, denoted by e Dp i R, due to their individual consumption of computing resources in local updates or concerns about potential privacy breaches caused by communication with the server. Moreover, as the client has nothing to lose when there is no local update to share in a communication round at time step t, in this case we assume the cost is 0, i.e., Dp i = e Dp i I( Vi,t = 0). As a result, the server needs to motivate clients to participate in data sharing via the incentive mechanism M : RN Rd d RN, which takes as inputs a collection of client local updates Vi,t Rd d and a vector of cost values Dp = {Dp 1, , Dp N} RN, and outputs the incentive I = {I1,t, , IN,t} RN to be distributed among the clients. Specifically, to make it possible to measure gains and losses of utility in terms of real-valued incentives (e.g., monetary payment), we adopt the standard quasi-linear utility function assumption, as is standard in economic analysis [2, 31]. At each communication round, a client decides whether to share its local update with the server based on the potential utility gained from participation, i.e., the difference between the incentive and the cost of data sharing. This requires the incentive mechanism to be individually rational: Definition 1 (Individual Rationality [29]) An incentive mechanism M : RN Rd d RN is individually rational if for any i in the participant set St at time step t, we have Ii,t Dp i (1) In other words, each participant must be guaranteed non-negative utility by participating in data sharing under M. The server coordinates with all clients and incentivizes them to participate in the communication to realize its own objective (e.g., collective regret minimization). This requires M to be sufficient: Definition 2 (Sufficiency) An incentive mechanism M : RN Rd d RN is sufficient if the resulting outcome satisfies the server s objective. Typically, under different application scenarios, the server may have different objectives, such as regret minimization or best arm identification. In this work, we set the objective of the server to minimize the regret across all clients; and ideally the server aims to attain the optimal e O(d T) regret in the centralized setting via the incentivized communication. Therefore, we consider an incentive mechanism is sufficient if it ensures that the resulting accumulated regret is bounded by e O(d 4 Methodology The communication backbone of our solution derives from Dis Lin UCB [40], which is a widely adopted paradigm for federated linear bandits. We adopt their strategy for arm selection and communication trigger, so as to focus on the incentive mechanism design. We name the resulting algorithm INC-FEDUCB, and present it in Algorithm 1. Note that the two incentive mechanisms to be presented in Section 4.2 and 4.3 are not specific to any federated bandit learning algorithms, and each of them can be easily extended to alternative workarounds as a plug-in to accommodate the incentivized federated learning setting. For clarity, a summary of technical notations can be found in Table 7. 4.1 A General Framework: INC-FEDUCB Algorithm Our framework comprises three main steps: 1) client s local update; 2) communication trigger; and 3) incentivized data exchange among the server and clients. Specifically, after initialization, an active client performs a local update in each time step and checks the communication trigger. If a communication round is triggered, the system performs incentivized data exchange between clients and the server. Otherwise, no communication is needed. 2Note that if the costs are trivially set to zero, then clients have no reason to opt-out of data sharing and our problem essentially reduces to the standard federated bandits problem [40]. Algorithm 1 INC-FEDUCB Algorithm Require: Dc 0, Dp = {Dp 1, , Dp N}, σ, λ > 0, δ (0, 1) 1: Initialize: [Server] Vg,0 = 0d d Rd d, bg,0 = 0d Rd 2: V j,0 = 0d d, b j,0 = 0d, j [N] 3: [All clients] Vi,0 = 0d d, bi,0 = 0d, Vi,0 = 0d d, bi,0 = 0d, ti,0 = 0, i [N] 4: for t = 1, 2, . . . , T do 5: [Client it] Observe arm set At 6: [Client it] Select arm xt At by Eq. (2) and observe reward yt 7: [Client it] Update: Vit,t += xtx t , bit,t += xtyt 8: Vit,t += xtx t , bit,t += xtyt, tit,t += 1 9: if tit,t log det(Vit,t+λI) det(Vit,t Vit,t+λI) > Dc then 10: [All clients Server] Upload Vi,t such that e St = { Vi,t| i [N]} 11: [Server] Select incentivized participants St = M( St) Incentive Mechanism 12: for i : Vi,t St do 13: [Participant i Server] Upload bi,t 14: [Server] Update: Vg,t += Vi,t, bg,t += bi,t 15: V j,t += Vi,t, b j,t += bi,t, j = i 16: [Participant i] Update: Vi,t = 0, bi,t = 0, ti,t = 0 17: for i [N] do 18: [Server All Clients] Download V i,t, b i,t 19: [Client i] Update: Vi,t += V i,t, bi,t += b i,t 20: [Server] Update: V i,t = 0, b i,t = 0 Formally, at each time step t = 1, . . . , T, an arbitrary client it becomes active and interacts with its environment using observed arm set At (Line 5). Specifically, it selects an arm xt At that maximizes the UCB score as follows (Line 6): xt = arg max x At x ˆθit,t 1(λ) + αit,t 1||x||V 1 it,t 1(λ) (2) where ˆθit,t 1(λ) = V 1 it,t 1(λ)bit,t 1 is the ridge regression estimator of θ with regularization parameter λ > 0, Vit,t 1(λ) = Vit,t 1 +λI, and αit,t 1 = σ q log det(Vit,t 1(λ)) det (λI) + 2 log 1/δ + λ. Vit,t(λ) denotes the covariance matrix constructed using data available to client it up to time t. After obtaining a new data point (xt, yt) from the environment, client it checks the communication event trigger tit,t log det(Vit,t(λ)) det(Vit,tlast(λ)) > Dc (Line 9), where tit,t denotes the time elapsed since the last time tlast it communicated with the server and Dc 0 denotes the specified threshold. Incentivized Data Exchange With the above event trigger, communication rounds only occur if (1) a substantial amount of new data has been accumulated locally at client it, and/or (2) significant time has elapsed since the last communication. However, in our incentivized setting, triggering a communication round does not necessarily lead to data exchange at time step t, as the participant set St may be empty (Line 11). This characterizes the fundamental difference between INC-FEDUCB and Dis Lin UCB [40]: we no longer assume all N clients will share their data with the server in an altruistic manner; instead, a rational client only shares its local update with the server if the condition in Eq. (1) is met. In light of this, to evaluate the potential benefit of data sharing, all clients must first reveal the value of their data to the server before the server determines the incentive. Hence, after a communication round is triggered, all clients upload their latest sufficient statistics update Vi,t to the server (Line 10) to facilitate data valuation and participant selection in the incentive mechanism (Line 11). Note that this disclosure does not compromise clients privacy, as the clients secret lies in bi,t that is constructed by the rewards. Only participating clients will upload their bi,t to the server (Line 13). After collecting data from all participants, the server downloads the aggregated updates V i,t and b i,t to every client i (Line 17-20). Following the convention in federated bandit learning [40], the communication cost is defined as the total number of scalars transferred during this data exchange process. Algorithm 2 Payment-free Incentive Mechanism Require: Dp = {Dp i |i [N]}, e St = { Vi,t|i [N]} 1: Initialize participant set St = e St 2: while St = do iteratively update St until it becomes stable 3: Stable Flag = True 4: for i : Vi,t St do 5: if Ii,t < Dp i then Eq. 4 6: Update participant set St = St \ { Vi,t} remove client j from St 7: Stable Flag = False 8: break 9: if Stable Flag = True then 10: break 11: return St e St 4.2 Payment-free Incentive Mechanism As mentioned in Section 1, in federated bandit learning, clients can reduce their regret by using models constructed via shared data. Denote e Vt as the covariance matrix constructed by all available data in the system at time step t. Based on Lemma 5 and 7, the instantaneous regret of client it is upper bounded by: rt 2αit,t 1 q x t e V 1 t 1xt det(e Vt 1) det(Vit,t 1) = O xt e V 1 t 1 det(e Vt 1) det(Vit,t 1) (3) where the determinant ratio reflects the additional regret due to the delayed synchronization between client it s local sufficient statistics and the global optimal oracle. Therefore, minimizing this ratio directly corresponds to reducing client it s regret. For example, full communication keeps the ratio at 1, which recovers the regret of the centralized setting discussed in Section 3.1. Therefore, given the client s desire for regret minimization, the data itself can be used as a form of incentive by the server. And the star-shaped communication network also gives the server an information advantage over any single client in the system: a client can only communicate with the server, while the server can communicate with every client. Therefore, the server should utilize this advantage to create incentives (i.e., the LHS of Eq. (1)), and a natural design to evaluate this data incentive is: Ii,t := Id i,t = det (Di,t(St) + Vi,t) det(Vi,t) 1. (4) where Di,t(St) = P j:{ Vj,t St} {j =i} Vj,t + V i,t denotes the data that the server can offer to client i during the communication at time t (i.e., current local updates from other participants that have not been shared with the server) and V i,t is the historically aggregated updates stored in the server that has not been shared with client i. Eq. (4) suggests a substantial increase in the determinant of the client s local data is desired by the client, which ultimately results in regret reduction. With the above data valuation in Eq. (4), we propose the payment-free incentive mechanism that motivates clients to share data by redistributing data collected from participating clients. We present this mechanism in Algorithm 2, and briefly sketch it below. First, we initiate the participant set St = e St, assuming all clients agree to participate. Then, we iteratively update St by checking the willingness of each client i in St according to Eq. (1). If St is empty or all clients in it are participating, then terminate; otherwise, remove client i from St and repeat the process. While this payment-free incentive mechanism is neat and intuitive, it has no guarantee on the amount of data that can be collected. To see this, we provide a theoretical negative result with rigorous regret analysis in Theorem 3 (see proof in Appendix C). Theorem 3 (Sub-optimal Regret) When there are at most c 2C N log(T/N) number of clients (for some constant C, c > 0), whose cost value Dp i min{(1+ L2 λ )T , (1+ T L2 λd )d}, there exists a linear bandit instance with σ = L = S = 1 such that for T Nd, the expected regret for INC-FEDUCB algorithm with payment-free incentive mechanism is at least Ω(d Algorithm 3 Payment-efficient Incentive Mechanism Require: e St = { Vi,t|i [N]}, data-incentivized participant set b St e St, threshold β 1: for client i : Vi,t e St \ b St do: 2: Compute client s potential contribution to the server (i.e., marginal gain in determinant): ci,t(b St) = det( Vi,t + Vg,t(b St))/ det(Vg,t(b St)), Vg,t(St) = Vg,t 1 + Σ(St) (6) 3: Rank clients {i1, . . . , im} by their potential contribution, where m = |e St \ b St| 4: Segment the list by finding α = min{j | det(Vg,t(b St)+ Vij ,t) det(Vg,t(e St)) β, j [m]} 5: k = α 1, Im last = Dp iα Id iα,t 6: return participant set St = Heuristic Search(k, Im last) Algorithm 4 Recall the discussion in Section 3.1, when there is no communication RT is upper bounded by O(d NT). Hence, in the worst-case scenario, the payment-free incentive mechanism might not motivate any client to participate. It is thus not a sufficient mechanism. 4.3 Payment-efficient Incentive Mechanism To address the insufficiency issue, we further devise a payment-efficient incentive mechanism that introduces additional monetary incentives to motivate clients participation: Ii,t := Id i,t + Im i,t (5) where Id i,t is the data incentive defined in Eq. (4), and Im i,t is the real-valued monetary incentive, i.e., the payment assigned to the client for its participation. Specifically, we are intrigued by the question: rather than trivially paying unlimited amounts to ensure everyone s participation, can we devise an incentive mechanism that guarantees a certain level of client participation such that the overall regret is still nearly optimal but under acceptable monetary incentive cost? Inspired by the determinant ratio principle discussed in Eq. (3), we propose to control the overall regret by ensuring that every client closely approximates the oracle after each communication round, which can be formalized as det(Vg,t)/ det(e Vt) β, where Vg,t = Vg,t 1 + Σ(St) is to be shared with all clients and Σ(St) = P j:{ Vj,t St} Vj,t. The parameter β [0, 1] characterizes the chosen gap between the practical and optimal regrets that the server commits to. Denote the set of clients motivated by Id i,t at time t as Sd t and those motivated by Im i,t as Sm t , and thus St = Sm t Sd t . At each communication round, the server needs to find the minimum Im i,t such that pooling local updates from St satisfies the required regret reduction for the entire system. Note that Algorithm 2 maximizes Id i,t, and thus the servers should compute Im i,t on top of optimal Id i,t and resulting Sd t , which however is still combinatorially hard. First, a brute-force search can yield a time complexity up to O(2N). Second, different from typical optimal subset selection problems [17], the dynamic interplay among clients in our specific context brings a unique challenge: once a client is incentivized to share data, the other uninvolved clients may change their willingness due to the increased data incentive, making the problem even more intricate. To solve the above problem, we propose a heuristic ranking-based method, as outlined in Algorithm 3. The heuristic is to rank clients by the marginal gain they bring to the server s determinant, as formally defined in Eq. (6), which helps minimize the number of clients requiring monetary incentives, while empowering the participation of other clients motivated by the aggregated data. This forms an iterative search process: First, we rank all m non-participating clients (Line 2-3) by their potential contribution to the server (with participant set St committed); Then, we segment the list by β, anyone whose participation satisfies the overall β gap constraint is an immediately valid choice (Line 4). The first client iα in the valid list and its payment Im last ( if not available) will be our last resort (Line 5); Lastly, we check if there exist potentially more favorable solutions from the invalid list (Line 6). Specifically, we try to elicit up to k = α 1 (k = m if iα is not available) clients from the invalid list in n k rounds, where only one client will be chosen using the same heuristic in each round. If having n clients from the invalid list also satisfies the β constraint and results in a reduced monetary incentive cost compared to Im last, then we opt for this alternative solution. Otherwise, we will adhere to the last resort. This Heuristic Search is detailed in Appendix A, and it demonstrates a time complexity of only O(N) in the worst-case scenarios, i.e., n = m = N. Theorem 4 guarantees the sufficiency of this mechanism w.r.t communication and payment bounds. Theorem 4 Under threshold β and clients committed data sharing cost Dp = {Dp 1, , Dp N}, with high probability the monetary incentive cost of INC-FEDUCB satisfies i=1 Pi det(λI) where Pi is the number of epochs client i gets paid throughout time horizon T, P is the total number of epochs, which is bounded P = O(Nd log T) by setting communication threshold Dc = T N2d log T q T 2 N2d R log T log β, where R = d log(1 + T Henceforth, the communication cost satisfies CT = O(Nd2) P = O(N 2d3 log T) Furthermore, by setting β e 1 N , the cumulative regret is The proof of theorem 4 can be found in Appendix D. 5 Experiments We simulate the incentivized federated bandit problem under various environment settings. Specifically, we create an environment of N = 50 clients with cost of data sharing Dp = {Dp 1, , Dp N}, total number of iterations T = 5, 000, feature dimension d = 25, and time-varing arm pool size K = 25. By default, we set Dp i = Dp R, i [N]. Due to the space limit, more detailed results and discussions on real-world dataset can be found in Appendix E. (b) Dp = 10 (c) Dp = 100 Figure 1: Comparison between payment-free vs. payment-efficient incentive designs. The results are averaged over 10 runs with standard deviation as the error bars. 5.1 Payment-free vs. Payment-efficient We first empirically compared the performance of the payment-free mechanism (named as INCFEDUCB-PF) and the payment-efficient mechanism INC-FEDUCB in Figure 1. It is clear that the added monetary incentives lead to lower regret and communication costs, particularly with increased Dp . Lower regret is expected as more data can be collected and shared; while the reduced communication cost is contributed by reduced communication frequency. When less clients can be motivated in one communication round, more communication rounds will be triggered as the clients tend to have outdated local statistics. (a) Accumulative Regret (b) Communication Cost (c) Payment Cost Figure 2: Ablation study on heuristic search (w.r.t Dp [1, 10, 100]). The results are averaged over 10 runs with standard deviation as the error bars. 5.2 Ablation Study on Heuristic Search To investigate the impact of different components in our heuristic search, we compare the full-fledged model INC-FEDUCB with following variants on various environments: (1) INC-FEDUCB (w/o PF): without payment-free incentive mechanism, where the server only use money to incentivize clients; (2) INC-FEDUCB (w/o IS): without iterative search, where the server only rank the clients once. (3) INC-FEDUCB (w/o PF + IS): without both above strategies. In Figure 2, we present the averaged learning trajectories of regret and communication cost, along with the final payment costs (normalized) under different Dp . The results indicate that the full-fledged INC-FEDUCB consistently outperforms all other variants in various environments. Additionally, there is a substantial gap between the variants with and without the PF strategy, emphasizing the significance of leveraging the server s information advantage to motivate participation. 5.3 Environment & Hyper-Parameter Study We further explored diverse β hyper-parameter settings for INC-FEDUCB in various environments with varying Dp , along with the comparison with Dis Lin UCB [40] (only comparable when Dp = 0). Specifically, we explored different hyper-parameter settings for INC-FEDUCB with determinant ratio threshold β [0.3, 0.7, 1], and various environmental configurations with data sharing cost Dp [1, 10, 100]. d = 25, K = 25 Dis Lin UCB INC-FEDUCB (β = 1) INC-FEDUCB (β = 0.7) INC-FEDUCB (β = 0.3) T = 5, 000, N = 50, Dp = 0 Regret (Acc.) 48.46 48.46 48.46 ( = 0%) 48.46 ( = 0%) Commu. Cost 7,605,000 7,605,000 7,605,000 ( = 0%) 7,605,000 ( = 0%) Pay. Cost \ 0 0 ( = 0%) 0 ( = 0%) T = 5, 000, N = 50, Dp = 1 Regret (Acc.) \ 48.46 47.70 ( 1.6%) 48.38 ( 0.2%) Commu. Cost \ 7,605,000 7,668,825 ( + 0.8%) 7,733,575 ( + 1.7%) Pay. Cost \ 75.12 60.94 ( 18.9%) 22.34 ( 70.3%) T = 5, 000, N = 50, Dp = 10 Regret (Acc.) \ 48.46 48.21 ( 0.5%) 47.55 ( 1.9%) Commu. Cost \ 7,605,000 7,779,425 ( + 2.3%) 8,599,950 ( + 13%) Pay. Cost \ 12,819.61 9,050.61 ( 29.4%) 4,859.17 ( 62.1%) T = 5, 000, N = 50, Dp = 100 Regret (Acc.) \ 48.46 48.22 ( 0.5%) 48.44 ( 0.1%) Commu. Cost \ 7,605,000 7,842,775 ( + 3.1%) 8,718,425 ( + 14.6%) Pay. Cost \ 190,882.45 133,426.01 ( 30.1%) 88,893.78 ( 53.4%) Table 1: Study on hyper-parameter of INC-FEDUCB and environment. As shown in Table 1, when all clients are incentivized to share data, our INC-FEDUCB essentially recover the performance of Dis Lin UCB, while overcoming its limitation in incentivized settings when clients are not willing to share by default. Moreover, by reducing the threshold β, we can substantially save payment costs while still maintaining highly competitive regret, albeit at the expense of increased communication costs. And the reason for this increased communication cost has been explained before: more communication rounds will be triggered, as clients become more outdated. 6 Conclusion In this work, we introduce a novel incentivized communication problem for federated bandits, where the server must incentivize clients for data sharing. We propose a general solution framework INC-FEDUCB, and initiate two specific implementations introducing data and monetary incentives, under the linear contextual bandit setting. We prove that INC-FEDUCB flexibly achieves customized levels of near-optimal regret with theoretical guarantees on communication and payment costs. Extensive empirical studies further confirmed our versatile designs in incentive search across diverse environments. Currently, we assume all clients truthfully reveal their costs of data sharing to the server. We are intrigued in extending our solution to settings where clients can exhibit strategic behaviors, such as misreporting their intrinsic costs of data sharing to increase their own utility. It is then necessary to study a truthful incentive mechanism design. Acknowledgement. We thank the anonymous reviewers for their insightful and constructive comments. This project is partially supported by NSF Award IIS-2213700 and IIS-2128019. Haifeng Xu is supported by an NSF Award CCF-2303372, an Army Research Office Award W911NF-23-1-0030, and an Office of Naval Research Award N00014-23-1-2802. [1] Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. In NIPS, volume 11, pages 2312 2320, 2011. [2] Maurice Allais. Le comportement de l homme rationnel devant le risque: critique des postulats et axiomes de l école américaine. Econometrica: Journal of the econometric society, pages 503 546, 1953. [3] Nicolo Cesa-Bianchi, Claudio Gentile, and Giovanni Zappella. A gang of bandits. In Advances in Neural Information Processing Systems, pages 737 745, 2013. [4] Mithun Chakraborty, Kai Yee Phoebe Chua, Sanmay Das, and Brendan Juba. Coordinated versus decentralized exploration in multi-agent multi-armed bandits. In IJCAI, pages 164 170, 2017. [5] Yae Jee Cho, Divyansh Jhunjhunwala, Tian Li, Virginia Smith, and Gauri Joshi. To federate or not to federate: Incentivizing client participation in federated learning. In Workshop on Federated Learning: Recent Advances and New Challenges (in Conjunction with Neur IPS 2022), 2022. [6] Jiu Ding and Aihui Zhou. Eigenvalues of rank-one updated matrices with some applications. Applied Mathematics Letters, 20(12):1223 1226, 2007. [7] Kate Donahue and Jon Kleinberg. Model-sharing games: Analyzing federated learning under voluntary participation. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 5303 5311, 2021. [8] Kate Donahue and Jon Kleinberg. Fairness in model-sharing games. In Proceedings of the ACM Web Conference 2023, pages 3775 3783, 2023. [9] Yihan Du, Wei Chen, Yuko Kuroki, and Longbo Huang. Collaborative pure exploration in kernel bandit. In The Eleventh International Conference on Learning Representations, 2023. [10] Abhimanyu Dubey and Alex Sandy Pentland. Differentially-private federated linear bandits. Advances in Neural Information Processing Systems, 33:6003 6014, 2020. [11] Sarah Filippi, Olivier Cappe, Aurélien Garivier, and Csaba Szepesvári. Parametric bandits: The generalized linear case. Advances in Neural Information Processing Systems, 23, 2010. [12] F Maxwell Harper and Joseph A Konstan. The movielens datasets: History and context. Acm transactions on interactive intelligent systems (tiis), 5(4):1 19, 2015. [13] Jiafan He, Tianhao Wang, Yifei Min, and Quanquan Gu. A simple and provably efficient algorithm for asynchronous federated contextual linear bandits. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 4762 4775. Curran Associates, Inc., 2022. [14] Rui Hu and Yanmin Gong. Trading data for learning: Incentive mechanism for on-device federated learning. In GLOBECOM 2020-2020 IEEE Global Communications Conference, pages 1 6. IEEE, 2020. [15] Ruiquan Huang, Weiqiang Wu, Jing Yang, and Cong Shen. Federated linear contextual bandits. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 27057 27068. Curran Associates, Inc., 2021. [16] Sai Praneeth Karimireddy, Wenshuo Guo, and Michael Jordan. Mechanisms that incentivize data sharing in federated learning. In Workshop on Federated Learning: Recent Advances and New Challenges (in Conjunction with Neur IPS 2022), 2022. [17] Ron Kohavi and George H John. Wrappers for feature subset selection. Artificial intelligence, 97(1-2):273 324, 1997. [18] Nathan Korda, Balazs Szorenyi, and Shuai Li. Distributed clustering of linear bandits in peer to peer networks. In International conference on machine learning, pages 1301 1309. PMLR, 2016. [19] Peter Landgren, Vaibhav Srivastava, and Naomi Ehrich Leonard. On distributed cooperative decision-making in multiarmed bandits. In 2016 European Control Conference (ECC), pages 243 248. IEEE, 2016. [20] Peter Landgren, Vaibhav Srivastava, and Naomi Ehrich Leonard. Social imitation in cooperative multiarmed bandits: Partition-based algorithms with strictly local information. In 2018 IEEE Conference on Decision and Control (CDC), pages 5239 5244. IEEE, 2018. [21] Tor Lattimore and Csaba Szepesvári. Bandit algorithms. Cambridge University Press, 2020. [22] Chuanhao Li and Hongning Wang. Asynchronous upper confidence bound algorithms for federated linear bandits. In International Conference on Artificial Intelligence and Statistics, pages 6529 6553. PMLR, 2022. [23] Chuanhao Li and Hongning Wang. Communication efficient federated learning for generalized linear bandits. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. [24] Chuanhao Li, Huazheng Wang, Mengdi Wang, and Hongning Wang. Communication efficient distributed learning for kernelized contextual bandits. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. [25] Chuanhao Li, Huazheng Wang, Mengdi Wang, and Hongning Wang. Learning kernelized contextual bandits in a distributed and asynchronous environment. In The Eleventh International Conference on Learning Representations, 2023. [26] Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661 670, 2010. [27] Keqin Liu and Qing Zhao. Distributed learning in multi-armed bandit with multiple players. IEEE transactions on signal processing, 58(11):5667 5681, 2010. [28] David Martínez-Rubio, Varun Kanade, and Patrick Rebeschini. Decentralized cooperative stochastic bandits. Advances in Neural Information Processing Systems, 32, 2019. [29] Roger B Myerson and Mark A Satterthwaite. Efficient mechanisms for bilateral trading. Journal of economic theory, 29(2):265 281, 1983. [30] Jian Pei. A survey on data pricing: from economics to data science. IEEE Transactions on knowledge and Data Engineering, 34(10):4586 4608, 2020. [31] Malcolm Pemberton and Nicholas Rau. Mathematics for economists: an introductory textbook. Manchester University Press, 2007. [32] Abishek Sankararaman, Ayalvadi Ganesh, and Sanjay Shakkottai. Social learning in multi agent multi armed bandits. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 3(3):1 35, 2019. [33] Chengshuai Shi, Cong Shen, and Jing Yang. Federated multi-armed bandits with personalization. In International Conference on Artificial Intelligence and Statistics, pages 2917 2925. PMLR, 2021. [34] Chengshuai Shi, Wei Xiong, Cong Shen, and Jing Yang. Decentralized multi-player multi-armed bandits with no collision information. In International Conference on Artificial Intelligence and Statistics, pages 1519 1528. PMLR, 2020. [35] Rachael Hwee Ling Sim, Yehong Zhang, Mun Choon Chan, and Bryan Kian Hsiang Low. Collaborative machine learning with incentive-aware model rewards. In International conference on machine learning, pages 8927 8936. PMLR, 2020. [36] Balazs Szorenyi, Róbert Busa-Fekete, István Hegedus, Róbert Ormándi, Márk Jelasity, and Balázs Kégl. Gossip-based distributed stochastic bandit algorithms. In International conference on machine learning, pages 19 27. PMLR, 2013. [37] Chao Tao, Qin Zhang, and Yuan Zhou. Collaborative learning with limited interaction: Tight bounds for distributed exploration in multi-armed bandits. In 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS), pages 126 146. IEEE, 2019. [38] Xuezhen Tu, Kun Zhu, Nguyen Cong Luong, Dusit Niyato, Yang Zhang, and Juan Li. Incentive mechanisms for federated learning: From economic and game theoretic perspective. IEEE Transactions on Cognitive Communications and Networking, 2022. [39] Po-An Wang, Alexandre Proutiere, Kaito Ariu, Yassir Jedra, and Alessio Russo. Optimal algorithms for multiplayer multi-armed bandits. In International Conference on Artificial Intelligence and Statistics, pages 4120 4129. PMLR, 2020. [40] Yuanhao Wang, Jiachen Hu, Xiaoyu Chen, and Liwei Wang. Distributed bandit learning: Near-optimal regret with efficient communication. In International Conference on Learning Representations, 2020. [41] Xinyi Xu, Lingjuan Lyu, Xingjun Ma, Chenglin Miao, Chuan Sheng Foo, and Bryan Kian Hsiang Low. Gradient driven rewards to guarantee fairness in collaborative machine learning. Advances in Neural Information Processing Systems, 34:16104 16117, 2021. [42] Yufeng Zhan, Jie Zhang, Zicong Hong, Leijie Wu, Peng Li, and Song Guo. A survey of incentive mechanism design for federated learning. IEEE Transactions on Emerging Topics in Computing, 10(2):1035 1044, 2021. [43] Zhaowei Zhu, Jingxuan Zhu, Ji Liu, and Yang Liu. Federated bandit: A gossiping approach. In Abstract Proceedings of the 2021 ACM SIGMETRICS/International Conference on Measurement and Modeling of Computer Systems, pages 3 4, 2021. A Heuristic Search Algorithm Algorithm 4 Heuristic Search Require: invalid client list {i1, i2, , ik}, data-incentivized participant set b St, and the last resort cost Im last 1: Initialization: St = b St 2: for n [k] do 3: Rank clients {i1, . . . , ik n+1} (in new order) by Eq (6) 4: St = St {ik n+1} add the client with the largest contribution 5: for client j {i1, . . . , ik n} do find extra data-incentivized participants 6: Compute data incentive Id j,t for client j by Eq (4) 7: if Id j,t > Dp j then 8: St = St { Vj,t} 9: Compute total payment Im n,t = P i e St\St Im i,t by Eq (5) 10: if Im n,t Im last then 11: return St = b St { Viα,t} return last resort 12: else 13: if det(Σ(St) + Vg,t 1)/ det(Σ(e St) + Vg,t 1) > β then 14: return St return search result As sketched in Section 4.3, we devised an iterative search method based on the following ranking heuristic (formally defined in Eq. (6)): the more one client assists in increasing the server s determinant, the more valuable its contribution is, and thus we should motivate the most valuable clients to participate. Denote n k (initialized as 1) as the number of clients to be selected from the invalid list {i1, . . . , ik}, and initialize the participant set St = b St. In each round n, we rank the remaining k n + 1 clients based on their potential contribution to the server by Eq. (6), and add the most valuable one to St (Line 3-4). With the latest St committed, we then proceed to determine additional data-incentivized participants by Eq. (4) (Line 5-8), and compute the total payment by Eq. (5) (Line 9). If having n clients results in the total cost Im n,t > Im last, then we terminate the search and resort to our last resort (Line 10-11). Otherwise, if the resulting St enables the server to satisfy the β gap requirement, then we successfully find a better solution than last resort and can terminate the search. However, if having n client is insufficient for the server to pass the β gap requirement, we increase n = n + 1 and repeat the search process (Line 12-14). In particular, if the above process fails to terminate (i.e., having all m clients still not suffices, we will still use the last resort. Note that, by utilizing matrix computation to calculate the contribution list in each round, this method only incurs a linear time complexity of O(N), when n = m = N. B Technical Lemmas Lemma 5 (Lemma H.3 of [40]) With probability 1 δ, single step pseudo-regret rt = θ , x xt is bounded by 2 log det(Vit,t)1/2 det(λI) 1/2 xt V 1 it,t = O xt V 1 it,t . Lemma 6 (Lemma 11 of [1]) Let {Xt} t=1 be a sequence in Rd, V is a d d positive definite matrix and define Vt = V + Pt s=1 Xs X s . Then we have that log det (Vn) t=1 Xt 2 V 1 t 1 . Further, if Xt 2 L for all t, then t=1 min n 1, Xt 2 V 1 t 1 o 2 (log det (Vn) log det V ) 2 d log trace(V ) + n L2 /d log det V . Lemma 7 (Lemma 12 of [1]) Let A, B and C be positive semi-definite matrices such that A = B + C. Then, we have that x Ax x Bx det(A) C Proof of Theorem 3 Our proof relies on the following lower bound result for federated linear bandits established in [13]. Lemma 8 (Theorem 5.3 of [13]) Let pi denote the probability that an agent i [N] will communicate with the server at least once over time horizon T. Then for any algorithm with i=1 pi c 2C N log(T/N) (7) there always exists a linear bandit instance with σ = L = S = 1, such that for T Nd, the expected regret of this algorithm is at least Ω(d In the following, we will create a situation, where Eq. (7) always holds true for the payment-free incentive mechanism. Specifically, recall that the payment-free incentive mechanism (Section 4.2) motivates clients to participate using only data, i.e., the determinant ratio defined in Eq. (4) that indicates how much client i s confidence ellipsoid can shrink using the data offered by the server. Based on matrix determinant lemma [6], we know that Ii,t (1 + L2 λ )T . Additionally, by applying the determinant-trace inequality (Lemma 10 of [1]), we have Ii,t (1 + T L2 λd )d. Therefore, as long as Dp i > min{(1 + L2 λ )T , (1 + T L2 λd )d}, where the tighter choice between the two upper bounds depends on the specific problem instance (i.e., either d or T being larger), it becomes impossible for the server to incentivize client i to participate in the communication. Now based on Lemma 8, if the number of clients that satisfy Dp i min{(1 + L2 λ )T , (1 + T L2 λd )d} is smaller than c 2C N log(T/N), a sub-optimal regret of the order Ω(d NT) is inevitable for payment-free incentive mechanism, which finishes the proof. D Proof of Theorem 4 To prove this theorem, we first need the following lemma. Lemma 9 (Communication Frequency Bound) By setting the communication threshold Dc = T N2d log T q T 2 N2d R log T log β, the total number of epochs defined by the communication rounds satisfies, P = O(Nd log T) where R = d log(1 + T λd) = O(d log T). Proof of Lemma 9. Denote P as the total number of epochs divided by communication rounds throughout the time horizon T, and Vg,tp as the aggregated covariance matrix at the p-th epoch. Specifically, Vg,t0 = λI, e VT is the covariance matrix constructed by all data points available in the system at time step T. Note that according to the incentivized communication scheme in INC-FEDUCB, not all clients will necessarily share their data in the last epoch, hence det(Vg,t P ) det(e VT ) tr(e VT ) d (λ+T/d)d. Therefore, log det(Vg,t P ) det(Vg,t P 1) + log det(Vg,t P 1) det(Vg,t P 2) + + log det(Vg,t1) det(Vg,t0) = log det(Vg,t P ) det(Vg,t0) d log(1 + T Let α R+ be an arbitrary positive value, for epochs with length greater than α, there are at most T α of them. For epochs with length less than α, say the p-th epoch triggered by client i, we have ti,tp log det(Vi,tp) det(Vi,tlast) > Dc Combining the β gap constraint defined in Section 4.3 and the fact that the server always downloads to all clients at every communication round, we have ti,tp α and hence log det(g, Vtp) det(Vg,tp 1) log β det(e Vtp) det(Vg,tp 1) log β det(Vi,tp) det(Vg,tp 1) log β det(Vi,tp) det(Vi,tlast) Dc Let R = d log(1 + T λd) = O(d log T), therefore, there are at most R Dc α +log β epochs with length less than α time steps. As a result, the total number of epochs P T α +log β . Note that α +log β 2 q T R Dc+α log β , where the equality holds when α = q T (Dc+α log β) Furthermore, let Dc = T N2d log T α log β, then α = q T 2 N2d R log T , we have TR Dc + α log β d R log T) = O(Nd log T) (8) This concludes the proof of Lemma 9. Communication Cost: The proof of communication cost upper bound directly follows Lemma 9. In each epoch, all clients first upload O(d2) scalars to the server and then download O(d2) scalars. Therefore, the total communication cost is CT = P O(Nd2) = O(N 2d3 log T) Monetary Incentive Cost: Under the clients committed data sharing cost Dp = {Dp 1, , Dp N}, during each communication round at time step tp, we only pay clients in the participant set Stp. Specifically, the payment (i.e., monetary incentive cost) Im i,tp = 0 if the data incentive is already sufficient to motivate the client to participate, i.e., when Id i,tp Dp i . Otherwise, we only need to pay the minimum amount of monetary incentive such that Eq. (1) is satisfied, i.e., Im i,tp = Dp i Id i,tp. Therefore, the accumulative monetary incentive cost is i=1 Im i,tp = i=1 max n 0, Dp i Id i,tp o I( Vi,tp Stp) i=1 max 0, max i [N]{Dp i } Id i,tp I( Vi,tp Stp) i Np (max i [N]{Dp i } Id i,tp) I( Vi,tp Stp) max i [N]{Dp i } i=1 I( Vi,tp Stp) i Np Id i,tp I( Vi,tp Stp) = max i [N]{Dp i } p Pi Id i,tp where P and N represent the number of epochs and clients, Np is the number of participants in p-th epoch, Np is the set of money-incentivized participants in the p-th epoch, Pi is the set of epochs where client i gets monetary incentive, whose size is denoted as Pi = | Pi|. Denote Dp max = maxi [N]{Dp i } to simplify our later discussion. Recall the definition of data incentive and Di,tp(Stp) = P j:{ Vj,tp Stp} {j =i} Vj,tp + V i,tp introduced in Eq. (4), we can show that Id i,tp = det Di,tp(Stp) + Vi,tp det(Vi,tp) 1 det(Vi,tp) 1 Therefore, we have det(Vi,t1) det(Vg,t2) det(Vi,t2) det(Vg,t Pi) det(Vi,t Pi) det(Vi,t1) det(Vi,t1) det(Vi,t2) det(Vi,t Pi 1) det(Vi,t Pi) det(Vg,t1) det(Vi,t Pi) (1 + Dp max) P N i=1 Pi det(λI) where the second step holds by Cauchy-Schwarz inequality and the last step follows the facts that Pi P, Np N, det(Vg,t1) det(λI), and det(Vi,t Pi) det(VT ). Specifically, by setting the communication threshold Dc = T N2d log T q T 2 N2d R log T log β, where R = d log(1 + T λd) , we have the total number of epochs P = O(Nd log T) (Lemma 9). Therefore, MT (1 + Dp max) O(N 2d log T) i=1 Pi det(λI) = O(N 2d log T) which finishes the proof. Regret: To prove the regret upper bound, we first need the following lemma. Lemma 10 (Instantaneous Regret Bound) Under threshold β, with probability 1 δ, the instantaneous pseudo-regret rt = θ , x xt in j-th epoch is bounded by xt e V 1 t 1 1 β det(Vg,tj) det(Vg,tj 1) Proof of Lemma 10. Denote e Vt as the covariance matrix constructed by all available data in the system at time step t. As introduced in Eq. (3), the instantaneous regret of client i is upper bounded by rt 2αit,t 1 q x t e V 1 t 1xt det(e Vt 1) det(Vit,t 1) = O xt e V 1 t 1 det(e Vt 1) det(Vit,t 1) Suppose the client it appears at the j-th epoch, i.e., tj 1 t tj. As the server always downloads the aggregated data to every client in each communication round, we have det(e Vt) det(Vit,t) det(e Vt) det(Vit,tj 1) det(e Vt) det(Vg,tj 1) Combining the β gap constraint defined in Section 4.3, we can show that det(e Vt) det(Vit,t) det(e Vt) det(Vg,tj 1) det(Vg,tj)/β det(Vg,tj 1) = 1 β det(Vg,tj) det(Vg,tj 1) Lastly, plugging the above inequality into Eq. (3), we have xt e V 1 t 1 1 β det(Vg,tj) det(Vg,tj 1) which finishes the proof of Lemma 10. Now, we are ready to prove the accumulative regret upper bound. Similar to Dis Lin UCB [40], we group the communication epochs into good epochs and bad epochs. Good Epochs: Note that for good epochs, we have 1 det(Vg,tj ) det(Vg,tj 1) 2. Therefore, based on Lemma 10, the instantaneous regret in good epochs is xt e V 1 t 1 r 2 Denote the accumulative regret among all good epochs as REGgood, then using the Cauchy Schwarz inequality we can see that REGgood = X v u u t T d log T t Bp xt 2 e V 1 t 1 Combining the fact x 2 log(1 + x), x [0, 1] and Lemma 6, we have v u u t T d t Bp 2 log 1 + xt 2 e V 1 t 1 v u u t T d p Pgood log det(e Vtp) det(e Vtp 1) v u u t T d p PAll log det(e Vtp) det(e Vtp 1) δ log det(e Vt P ) δ d log 1 + T Bad Epochs: Now moving on to the bad epoch. For any bad epoch starting from time step ts to time step te, the regret in this epoch is τ Ni(te)\Ni(ts) rτ where Ni(t) = {1 τ t : iτ = i} denotes the set of time steps when client i interacts with the environment up to t. Combining the fact rτ 2 and Lemma 5, we have rτ min{2, 2αiτ ,τ 1 q x τ V 1 iτ ,τ 1xτ} = O min{1, xτ V 1 iτ ,τ 1} τ Ni(te)\Ni(ts) min{1, xτ V 1 i,τ 1} τ Ni(te)\Ni(ts) min{1, xτ 2 V 1 i,τ 1} v u u t ti,te X τ Ni(te)\Ni(ts) log 1 + xτ 2 V 1 i,τ 1 v u u t ti,te X τ Ni(te)\Ni(ts) log det(Vi,τ) det(Vi,τ 1) ti,te log det(Vi,te) det(Vi,tlast) where the second step holds by the Cauchy-Schwarz inequality, the third step follows from x 2 log(1 + x), x [0, 1], the fourth step utilizes the elementary algebra, and the last two steps follow the fact that no client triggers the communication before te. Recall that, as introduced in Lemma 9, the number of bad epochs is less than R = d log(1 + T δ ) = O(d log T), therefore the regret across all bad epochs is Dc O(d log T) Combining the regret for all good and bad epochs, we have accumulative regret RT = REGgood + REGbad According to Lemma 10, the above regret bound holds with high probability 1 δ. For completeness, we also present the regret when it fails to hold, which is bounded by δ P rt 2T δ in expectation. And this can be trivially set to O(1) by selecting δ = 1/T. In this way, we can primarily focus on analyzing the following regret when the bound holds. T log T + O Nd1.5 log1.5 T p With our choice of Dc = T N2d log T q T 2 N2d R log T log β in Lemma 9, we have T log T + O Nd1.5 log1.5 T v u u t T N 2d log T N 2d R log T log β Plugging in R = d log(1 + T λd) = O(d log T), we get T log T + O Nd1.5 log1.5 T T N 2d log T + T Nd log T log 1 Furthermore, by setting β > e 1 N , we can show that T N2d log T > T Nd log T log 1 β , and therefore T log T + O d T log T = O d This concludes the proof. E Detailed Experimental Results In addition to the empirical studies on the synthetic datasets reported in Section 5, we also conduct comprehensive experiments on the real-world recommendation dataset Movie Lens [12]. Following [3, 22], we pre-processed the dataset to align it with the linear bandit problem setting, with feature dimension d = 25 and arm set size K = 25. Specifically, it contains N = 54 users and 26567 items (movies), where items receiving non-zero ratings are considered as having positive feedback, i.e., denoted by a reward of 1; otherwise, the reward is 0. In total, there are T = 214729 interactions, with each user having at least 3000 observations. By default, we set all clients costs of data sharing as Dp i = Dp R, i [N]. E.1 Payment-free vs. Payment-efficient incentive mechanism (Supplement to Section 5.1) (b) Dp = 10 (c) Dp = 100 Figure 3: Comparison between payment-free vs. payment-efficient incentive designs. Aligned with the findings presented in Section 5.1, the results on real-world dataset also confirm the advantage of the payment-efficient mechanism over the payment-free incentive mechanism in terms of both accumulative (normalized) reward and communication cost. As illustrated in Figure 3, this performance advantage is particularly notable in a more conservative environment, where clients have higher Dp . And when the cost of data sharing for clients is relatively low, the performance gap between the two mechanisms becomes less significant. We attribute this to the fact that clients with low Dp i values are more readily motivated by the data alone, thus alleviating the need for additional monetary incentive. On the other hand, higher values of Dp i indicate that clients are more reluctant to share their data. As a result, the payment-free incentive mechanism fails to motivate a sufficient number of clients to participate in data sharing, leading to a noticeable performance gap, as evidenced in Figure 3(c). Note that the communication cost exhibits a sharp increase towards the end of the interactions. This is because the presence of highly skewed user distribution in the real-world dataset. For instance, in the last 2,032 rounds, only one client remains actively engaged with the environment, rapidly accumulating sufficient amounts of local updates, thus resulting in an increase in both communication frequency and cost. E.2 Ablation Study on Heuristic Search (Supplement to Section 5.2) We further study the effectiveness of each component in the heuristic search of INC-FEDUCB on the real-world dataset and compare the performance among different variants across different data sharing costs. As presented in Figure 4(a) and 4(b), the variants without payment-free (PF) component, which only rely on monetary incentive to motivate clients to participate, generally exhibit lower rewards and higher communication costs. The reason is a bit subtle: as the payment efficient mechanism is subject to both β gap constraint and minimum payment cost requirement, it tends to satisfy the β gap constraint with minimum amount of data collected. But the payment free mechanism will always collect the maximum amount data possible. As a result, without the PF component, the server (a) Normalized Reward (b) Communication Cost (c) Payment Cost Figure 4: Ablation study on heuristic search (w.r.t Dp [1, 10, 100]). tends to collect less (but enough) data, which in turn leads to more communication rounds and worse regret. The side-effect of the increased communication frequency is the higher payment costs, with respect to the β gap requirement in each communication round. This is particularly notable in a more collaborative environment, where clients have lower data sharing costs. As exemplified in Figure 4(c), when the clients are more willing to share data (e.g., Dp = 1), the variants without PF incur significantly higher payment costs compared to the those with PF, as the server misses the opportunity to get those easy to motivate clients. Therefore, providing data incentives becomes even more crucial in such scenarios to ensure effective client participation and minimize payment costs. On the other hand, the variants without iterative search (IS) tend to maintain competitive performance compared to the fully-fledged model, despite incurring a higher payment cost, highlighting the advantage of IS in minimizing payment. E.3 Environment & Hyper-Parameter Study (Supplement to Section 5.3) d = 25, K = 25, Dc = T N 2d log T q T 2 N2d R log T log β Dis Lin UCB INC-FEDUCB (β = 1) INC-FEDUCB (β = 0.7) INC-FEDUCB (β = 0.3) Movie Lens (Dp = 0) Reward (Acc.) 38,353 38,353 37,731 ( 1.6%) 36,829 ( 2.4%) Commu. Cost 33,415,200 33,415,200 5,967,000 ( 82%) 2,457,000 ( 92.6%) Pay. Cost \ 0 0 ( = 0%) 0 ( = 0%) Movie Lens (Dp = 1) Reward (Acc.) \ 38,353 37,717 ( 1.7%) 36,833 ( 4%) Commu. Cost \ 33,415,200 13,372,250 ( 60%) 5,038,675 ( 84.9%) Pay. Cost \ 7859.67 124.41 ( 98.4%) 0 ( 100%) Movie Lens (Dp = 10) Reward (Acc.) \ 38,353 37,648 ( 1.8%) 36,675 ( 4.4%) Commu. Cost \ 33,415,200 10,041,250 ( 70%) 4,240,625 ( 87.3%) Pay. Cost \ 110,737.62 8,590.43 ( 92.2%) 2,076.98 ( 98.1%) Movie Lens (Dp = 100) Regret (Acc.) \ 38,353 37,641 ( 1.9%) 36,562 ( 4.7%) Commu. Cost \ 33,415,200 8,496,600 ( 74.6%) 5,136,700 ( 84.6%) Pay. Cost \ 1,155,616.99 105,847.84 ( 90.8%) 32,618.34 ( 97.2%) Table 2: Study on hyper-parameter of INC-FEDUCB and environment (w/ theoretical Dc). In contrast to the hyper-parameter study on synthetic dataset with fixed communication threshold reported in Section 5.3, in this section, we comprehensively investigate the impact of β and Dp on the real-world dataset by varying the communication thresholds Dc. First, we empirically validate the effectiveness of the theoretical value of Dc = T N2d log T q T 2 N 2d R log T as introduced in Theoreom 4. The results presented in Table 2 are generally consistent with the findings in Section 5.3: decreasing β can substantially lower the payment cost while still maintaining competitive rewards. We can also find that using the theoretical value of Dc can also save the communication cost. This results from the fact that setting Dc as a function of β leads to a higher communication threshold for lower β, and therefore reducing communication frequency. This observation is essentially aligned with the intuition behind lower β: when the system has a higher tolerance for outdated sufficient statistics, it should not only pay less in each communication round but also trigger communication less frequently. On the other hand, we investigate INC-FEDUCB s performance under two fixed communication thresholds Dc = T/(N 2d log T) and Dc = T/(Nd log T), which are presented in Table 3 and 4, d = 25, K = 25, Dc = T N 2d log T Dis Lin UCB INC-FEDUCB (β = 1) INC-FEDUCB (β = 0.7) INC-FEDUCB (β = 0.3) Movie Lens (Dp = 0) Reward (Acc.) 38,353 38,353 38,353 ( = 0%) 38,353 ( = 0%) Commu. Cost 33,415,200 33,415,200 33,415,200 ( = 0%) 33,415,200 ( = 0%) Pay. Cost \ 0 0 ( = 0%) 0 ( = 0%) Movie Lens (Dp = 1) Reward (Acc.) \ 38,353 38,207 ( 0.4%) 38,208 ( 0.4%) Commu. Cost \ 33,415,200 171,046,600 ( + 412%) 191,280,875 ( + 472%) Pay. Cost \ 7859.67 2095.73 ( 73.3%) 36.32 ( 99.5%) Movie Lens (Dp = 10) Reward (Acc.) \ 38,353 38,251 ( 0.3%) 37,609 ( 1.9%) Commu. Cost \ 33,415,200 135,521,025 ( + 306%) 424,465,650 ( + 1170%) Pay. Cost \ 110,737.62 33,271.39 ( 70%) 33,872.78 ( 69.4%) Movie Lens (Dp = 100) Reward (Acc.) \ 38,353 38,251 ( 0.3%) 37,970 ( 1%) Commu. Cost \ 33,415,200 135,521,025 ( + 306%) 522,196,225 ( + 1463%) Pay. Cost \ 1,155,616.99 352,231.39 ( 69.5%) 346,619.77 ( 70%) Table 3: Study on hyper-parameter of INC-FEDUCB and environment (w/ lower fixed Dc). d = 25, K = 25, Dc = T Nd log T Dis Lin UCB INC-FEDUCB (β = 1) INC-FEDUCB (β = 0.7) INC-FEDUCB (β = 0.3) Movie Lens (Dp = 0) Reward (Acc.) 37,308 37,308 37,308 ( = 0%) 37,308 ( = 0%) Commu. Cost 2,737,800 2,737,800 2,737,800 ( = 0%) 2,737,800 ( = 0%) Pay. Cost \ 0 0 ( = 0%) 0 ( = 0%) Movie Lens (Dp = 1) Reward (Acc.) \ 37,308 37,296 ( 0.1%) 37,306 ( 0.1%) Commu. Cost \ 2,737,800 4,197,525 ( + 53.3%) 5,948,950 ( + 117.3%) Pay. Cost \ 55.31 44.76 ( 19.1%) 0 ( 100%) Movie Lens (Dp = 10) Reward (Acc.) \ 37,308 37,297 ( 0.1%) 37,167 ( 0.1%) Commu. Cost \ 2,737,800 3,696,350 ( + 35%) 5,765,075 ( + 110.6%) Pay. Cost \ 4048.69 3779.77 ( 6.6%) 2242.22 ( 44.6%) Movie Lens (Dp = 100) Reward (Acc.) \ 37,308 37,273 ( 0.1%) 36,946 ( 0.1%) Commu. Cost \ 2,737,800 3,484,850 ( + 27.3%) 5,690,250 ( + 107.8%) Pay. Cost \ 77,041.04 65,286.90 ( 15.3%) 40,010.59 ( 48.1%) Table 4: Study on hyper-parameter of INC-FEDUCB and environment (w/ higher fixed Dc). respectively. These two values are created by increasing the theoretical value of Dc. Overall, the main findings align with those reported in Section 5.3, confirming our previous statements. While reducing β can achieve competitive rewards with lower payment costs, it comes at the expense of increased communication costs, suggesting the trade-off between payment costs and communication costs. Interestingly, the setting under a higher Dp and Dc can help mitigate the impact of β. Specifically, while increasing the client s cost of data sharing inherently brings additional incentive costs, raising the communication threshold results in fewer communication rounds, leading to reduced overall communication costs. This finding highlights the importance of thoughtful design in choosing Dc and β to balance the trade-off between payment costs and communication costs in real-world scenarios with diverse data sharing costs. E.4 Extreme Case Study To further investigate the utility of INC-FEDUCB in extreme cases, we conduct a set of case studies on both synthetic and real-world datasets with fixed data sharing costs. As shown in Table 5 and 6, when β is extremely small, we can achieve almost 100% savings in incentive cost compared to the case where every client has to be incentivized to participate in data sharing (i.e., β = 1). However, this extreme setting inevitably results in a considerable drop in regret/reward performance and potentially tremendous extra communication cost due to the extremely outdated local statistics in clients. Nevertheless, by strategically choosing the communication threshold, we can mitigate the additional communication costs associated with the low β values. For instance, in the synthetic dataset, the difference in performance drop between the theoretical Dc setting and heuristic Dc value d = 25, K = 25, Dp = 100 INC-FEDUCB (β = 1) INC-FEDUCB (β = 0.7) INC-FEDUCB (β = 0.3) INC-FEDUCB (β = 0.01) T = 5000, N = 50 (Dc = T/N 2d log T) Regret (Acc.) 45.37 46.33 ( + 2.1%) 48.49 ( + 6.9%) 51.22 ( + 12.9%) Commu. Cost 174,720,000 264,193,275 ( + 51.2%) 299,134,900 ( + 71.2%) 314,667,500 ( + 80.1%) Pay. Cost 479,397.18 229,999.66 ( 52%) 115,600 ( 75.9%) 42,800 ( 91.1%) T = 5000, N = 50 (Dc = T/N 2d log T p T 2/N 2d R log T log β) Regret (Acc.) 45.37 46.72 ( + 3%) 49.13 ( + 8.3%) 53.72 ( + 18.4%) Commu. Cost 174,720,000 17,808,725 ( 89.8%) 7,237,600 ( 95.9%) 2,981,175 ( 98.3%) Pay. Cost 479,397.18 178,895.78 ( 62.7%) 84,989.39 ( 82.3%) 1,200 ( 99.7%) Table 5: Case study on synthetic dataset. d = 25, K = 25, Dp = 100 INC-FEDUCB (β = 1) INC-FEDUCB (β = 0.7) INC-FEDUCB (β = 0.3) INC-FEDUCB (β = 0.01) Movie Lens (Dc = T/N 2d log T) Reward (Acc.) 38,353 38,251 ( 0.3%) 37,970 ( 1%) 37,039 ( 3.4%) Commu. Cost 33,415,200 135,521,025 ( + 306%) 522,196,225 ( + 1463%) 1,226,741,425 ( + 3571.2%) Pay. Cost 1,155,616.99 352,231.39 ( 69.5%) 346,619.77 ( 70%) 75,799.39 ( 93.4%) Movie Lens (Dc = T/N 2d log T p T 2/N 2d R log T log β) Reward (Acc.) 38,353 37,641 ( 1.9%) 36,562 ( 4.7%) 31,873 ( 16.9%) Commu. Cost 33,415,200 8,496,600 ( 74.6%) 5,136,700 ( 84.6%) 1,880,450 ( 94.4%) Pay. Cost 1,155,616.99 105,847.84 ( 90.8%) 32,618.34 ( 97.2%) 200 ( 99.9%) Table 6: Case study on real-world dataset. is relatively small ( + 18.4% vs. + 12.9%). However, these two different choices of Dc exhibit opposite effects on communication costs, with the theoretical one achieving a significant reduction ( 98.3%) while the heuristic one incurred a significant increase ( + 80.1%). On the other hand, in the real-world dataset, the heuristic choice of Dc may lead to a smaller performance drop compared to the theoretical setting of Dc (e.g., 3.4% vs. 16.9%), reflecting the specific characteristics of the environment (e.g., a high demand of up-to-date sufficient statistics). Similar to the findings in Section E.3, this case study also emphasizes the significance of properly setting the system hyper-parameter β and Dc. By doing so, we can effectively accommodate the trade-off between performance, incentive costs, and communication costs, even in extreme cases. F Notation Table Notation Meaning d context dimension N total number of clients T total number of time steps β hyperparameter that controls the regret level Dc communication threshold Dp i data sharing cost of client i St participant set at time step t Di,t(St) data offered by the sever to client i at time step t Id i,t/Im i,t data/monetary incentive for client i at time step t e Vt covariance matrix constructed by all available data in the system Vi,t, bi,t local data of client i at time step t Vg,t, bg,t global data stored at the server at time step t Vi,t, bi,t data stored at client i that has not been shared with the server V i,t, b i,t data stored at the server that has not been shared with the client i Table 7: Main technical notations used in this paper.