# federated_highdimensional_online_decision_making__a0b8fbdd.pdf

Published in Transactions on Machine Learning Research (07/2023)

Federated High-Dimensional Online Decision Making

Chi-Hua Wang chihuawang@ucla.edu Department of Statistics, University of California, Los Angles

Wenjie Li li3549@purdue.edu Department of Statistics, Purdue University

Guang Lin guanglin@purdue.edu Department of Mathematics, Statistics, Purdue University School of Mechanical Engineering, Purdue University

Reviewed on Open Review: https: // openreview. net/ forum? id= Tja MO63fc9

We resolve the main challenge of federated bandit policy design via exploration-exploitation trade-oﬀdelineation under data decentralization with a local privacy protection argument. Such a challenge is practical in domain-speciﬁc applications and admits another layer of complexity in applications of medical decision-making and web marketing, where highdimensional decision contexts are sensitive but important to inform decision-making. Existing (low dimensional) federated bandits suﬀer super-linear theoretical regret upper bound in high-dimensional scenarios and are at risk of client information leakage due to their inability to separate exploration from exploitation. This paper proposes a class of bandit policy design, termed Fedego Lasso, to complete the task of federated high-dimensional online decision-making with sub-linear theoretical regret and local client privacy argument. Fedego Lasso relies on a novel multi-client teamwork-selﬁsh bandit policy design to perform decentralized collaborative exploration and federated egocentric exploration with logarithmic communication costs. Experiments demonstrate the eﬀectiveness of the proposed algorithms on both synthetic and real-world datasets.

1 Introduction

Federated bandits (Huang et al., 2021; Dubey & Pentland, 2020) is an emerging setting of decentralized sequential decision making that emphasizes on collaborate bandit learning and data decentralized decision making. Designing eﬀective decentralized sequential decision making strategies requires resolving the fundamental exploration-exploitation trade-oﬀunder data decentralization with local privacy protection, that is, clients do not submit their raw data in any circumstance and one client should not have any chance to infer decision rule of another client based on shared information. In principle, clients in collaborate bandit learning stage should be coordinated to explore available actions and gain information to inform future decision making. In decentralized decision making stage, one client s decision should not depend on other client s information to avoid risk of indirect information leakage. Thus, coordinate exploration and privacy protected exploitation are central challenges in federated version of exploration-exploitation trade-oﬀ. However, such trade-oﬀadmits another layer of complexity in the presence of high-dimensional decision context, especially in applications of medical decision making or web marketing (Bastani & Bayati, 2020). Consequently, federated high-dimensional decision making problems lead challenges on both learning and decision making. On learning, the agent should adopt estimation method to handle high-dimensional context with designing smart sampling scheme; on decision making, the agent should only utilize local clients information to give ﬁnal decision to meet the purpose of federation. As a result, designing a bandit sampling scheme smartly is the key to minimize the regret while protect local client privacy with data decentralization.

Published in Transactions on Machine Learning Research (07/2023)

Table 1: Regret guarantee comparison for federated bandit with linear feedback functions. M : number of clients; K : number of arms; T : time horizon; d : dimension of the decision context.

Bandit algorithms Regret (High-d d = O(T) ) Communication cost

Centralized (Dimakopoulou et al. (2017)) O(T q

KM log3(KMT)) O(Md2KT) Decentralized (Amani & Thrampoulidis (2020)) O(T 3/2

M log(MT)) O(d M 2) Fed-PE (Huang et al. (2021)) O(T p

KM log(KMT)) O Md2K log T

Async-Lin UCB (Li & Wang (2021)) O(T 3/2(log T)2) O(d M log T) Byzantine-Robust (Jadbabaie et al. (2022)) O(T 7/4M) O(T 1/2) Lasso Bandit(Bastani & Bayati (2020)) O((log M + log T)2Ks2 0) No communication Teamwork Lasso (Wang & Cheng (2020)) O(M(log T)2Ks2 0) O(KMd log2(T/Kq)) Fedego Lasso(This work) O((log M) (log T)2 Ks2 0) O(KMd log2(T/Kq))

Existing works on federated bandits (Huang et al., 2021; Li & Wang, 2021; Jadbabaie et al., 2022), unfortunately, only protect the local privacy of data but not the local privacy of information . That is, one client s decision making, in current arts of federated bandits, are using other client s information. While their method protect local data privacy, they share the resulting information freely to inform collaborate decision making at the risk of indirect information leakage. More frequent the clients communicate with the central server, more risk on indirect information leakage. Such risk of information leakage is from the Upper Conﬁdence Bound (UCB) approach (Auer, 2002; Li et al., 2010) utilized in current federated bandits works. The reason is that the UCB approach cannot handle exploration and exploitation separately. In addition, Table 1 shows that existing federated (low-dimensional) bandit works suﬀer super-linear regret in the region of high-dimensional scenario, where the decision horizon T scales with the decision context dimension d, i.e. T = O(d) Bastani & Bayati (2020).

To avoid super-linear regret in the high-dimensional scenario require rethinking on exploration and exploitation. Our strategy is to design the decision making policy with (i) decentralized collaborative exploration and (ii) federated egocentric exploitation. In (i), we coordinate all clients to commit same decision no matter their current decision context to study that decision s eﬃcacy. Such coordination prevents leakage of decision information and results in decentralized collaborative exploration. In (ii), we allow clients to commit their own reward-maximizing decision to minimize the regret. Such selﬁsh decision protect client decision information and results in federated egocentric exploitation.

Major Contributions. We deliver a novel architecture for federated high-dimensional online decision making and associated algorithms. Following that, we highlight our major contributions. (i) Fedego Lasso Algorithm. Federated-egocentric Lasso algorithm (Fedego Lasso) exploits Lasso regression to learn sparse local reward models and to inform future decision without compromising local data privacy. In particular, it enables each client to compute a personalized, low-dimensional local reward model, which we refer to as the client private Lasso, that utilizes client s unique decisions, actions and observations of local data. (ii) Convergence Rate and Regret bound. We establish convergence results of three type of Lasso estimates implemented in the proposed Fedego Lasso algorithms (Lemma 3.1, 3.4 and 3.6). Such convergence guarantees is not trivial given the non-i.i.d. properties of dataset collected during online decision making process (Remark 1.2). With these convergence rate results, we establish a theoretical regret upper bound (Theorem 3.7) for the algorithm performance of Fedego Lasso. (iii) Empirical Results. Through a combination of synthetic and real datasets (Pharm GKB, Medical MNIST) we show the beneﬁts of Fedego Lasso in (a) beneﬁts of federation architecture and (b) beneﬁts of recruiting more clients on further regret reduction and faster error rate convergence in real-world tasks including personalize dosage searching and medical image labeling.

Beneﬁts of Fedego Lasso. We list beneﬁts of Fedego Lasso over standard linear contextual bandit learning (that learns a single reward model with no central server to share and no high dimensional decision context). (I) More eﬃcient, eﬀective and secure decision making procedure. By sharing the online-learned Lasso regression, each client can make more eﬃcient and eﬀective local updates at each communication round. Such update is beneﬁcial in committing its own individual decision making. Besides, our federation architecture reduce the risks of indirect data leakage or inverse decision rule recovery. This is unlike standard

Published in Transactions on Machine Learning Research (07/2023)

linear contextual bandit learning where in a heterogeneous setting requires several rounds of model updates to recover oracle policy, and thus hurts performance. (II) Provable federation architecture for online decision making with high-dimensional decision context. Existing works on linear contextual bandit-based online decision making with high-dimensional decision context focus on the eﬀect of sparsity-inspired methods on explore-exploit trade-oﬀ. Current state of related literature does not have work designing federation architecture to further facilitate explore-exploit trade-oﬀin bandit learning. To our knowledge, this is the ﬁrst provable federation architecture for online decision making with high-dimensional decision context that demonstrates the beneﬁt of cooperation.

3 Key Enhancements in Practical Applicability of Fedego Lasso over Teamwork Lasso and Lasso Bandit. Compared to Teamwork Lasso in Wang & Cheng (2020) and Lasso Bandit in Bastani & Bayati (2020), our Fedego Lasso bridges the gap between data privacy restrictions and the need for eﬀective decision making in joint data utilization scenarios. The ﬁrst is to allow data sharing in multi-institutional contexts. While Teamwork Lasso assumes "complete data sharing" scenario, which is not the case in reality, Fedego Lasso enhances data sharing by allowing clients keeping their data in house and only share the regression coeﬃcients to do joint decision making. The second is to address privacy concerns from federation. The policy in Lasso Bandit and Fedego Lasso are infeasible in healthcare ﬁeld due to the restriction from HIPAA regulations, which prevents client to share their data directly to the third party(e.g. central server). Our Fedego Lasso are HIPAA-compliant, since we only share the regression coeﬃcient but not the original data to the third party. The last enhancement is to enable privacy compliant joint decision making. Both Lasso Bandit and Fedego Lasso neglects privacy regulation on data usage, leading risk of violating HIPPA law. We design the policy of Fedego Lasso to align the modern data sharing standard, making our algorithms with broader applicability in reality.

Technical advancements on Lasso Bandit and Teamwork Lasso. Here we discuss our technical advancements on extending the theoretical analysis in Lasso Bandit and Teamwork Lasso to secure logarithm expected regret in federated learning case. The ﬁrst technical contribution is a ﬁner control on the convergence rate of the federated Lasso ˆβ ( equation 5), c.f. Lemma 3.4. This relies on redeﬁning the "good event" in the federated learning case to ensure the probability bound on the "bad event" is independent of the number of client M. Such independence builds on a right lower bound (c.f. equation 3) on the phase length q such that the aggregated estimate ˆβ to have non-asymptotic convergence rate of order O(1/M). Another key technical contribution is a ﬁner control the convergence rate of the egoistic Lasso ˆβ (equation 6) such that the contributions to regret is of order O(log T). This relies on ﬁnding a right regularization level for the Lasso procedure to make sure the non-asymptotic convergence rate of ˆβ is of O(1/M) order, c.f. Lemma 3.6. In conclusion, we reﬁne the proof on the non-asymptotic convergence rate on the Lasso procedure in Lasso Bandit and Teamwork Lasso such that the ﬁnal regret of Fedego Lasso secure logarithmatic regret.

1.1 Related works

Federated bandits. Recent works have called attention to the topic of federated bandits. Shi et al. (2021) examine eﬃcient client-server communication and coordination methods for federated MAB with personalization, using heterogeneous reward distributions at local clients. Zhu et al. (2021); Li et al. (2020); Dubey & Pentland (2020) discuss how to protect local data privacy in federated bandits via diﬀerential privacy. Unfortunately, existing methods to solve federated linear contextual bandit suﬀer super-linear regret in high-dimensional scenario (d = O(T)), since their cumulative regret scales linearly in context dimension d (Table 1). Our work advances these prior eﬀorts by relaxing low dimension reward model assumptions to the case with high-dimensional decision context.

Lasso bandits. Bastani & Bayati (2020) ﬁrst introduced linear contextual bandit with high-dimensional covariate and proposed the Lasso bandit algorithm. Follow-up works such as Wang et al. (2018) and Wang & Cheng (2020) improved the regret bounds and extended it to diﬀerent problems. In contrast, another line of literature (Kim & Paik, 2019; Hao et al., 2020; Oh et al., 2021; Li et al., 2022) proposed diﬀerent style of algorithms for solving high dimensional bandit problems. However, these eﬀorts consider a completely diﬀerent bandit problems where K contexts are observed at each round and only one underlying parameter β is present for all arms. One way to possibly transfer results in Bastani & Bayati (2020) and its follow ups into the other line of research, and compare them, is to concatenate all the parameter βk s for diﬀerent

Published in Transactions on Machine Learning Research (07/2023)

arms into one long vector eβ (see notations in Section 1.2), and assume that the K contexts generated in each round are always one-hot with the same feature for all the arms. Our works advances the setting in Bastani & Bayati (2020) by designing federation architecture to ensure local data privacy security while delineating explore-exploit tradeoﬀin online decision making with high-dimenstional decision context.

Comparison with Teamwork Lasso Bandit in Wang & Cheng (2020). Our bandit sampling policy design (Figure 1) and client-center communication scheme (Figure 2) advances the "alternating two stage" design in Wang & Cheng (2020) in the regard of client privacy protection. Previous work Wang & Cheng (2020) on batch decision making also solves the federated decision making problem in this paper, by setting their Teamwork Lasso Bandit with a batch of clients. However, such approach requires all clients share their raw data to server to train a central Lasso to inform decision, while our Fedego Lasso allows clients keep their data, only share their individual Lasso estimates, and use aggregated Lasso to inform decision. Such fundamental diﬀerence of our Fedego Lasso advances Teamwork Lasso Bandit in Wang & Cheng (2020) into a more private exploration and exploitation design. For exploration stage (Blue; Figure 1), while Wang & Cheng (2020) access all clients raw data for a teamwork Lasso estimate, our Fedego Lasso do not touch client s raw data but only their individual Lasso estimate for a federated Lasso estimate (Federation step at Figure 2). For exploitation stage (Red; Figure 1), while Wang & Cheng (2020) access all clients raw data for a selﬁsh Lasso estimate, our Fedego Lasso allows clients to do egocentric decision by keeping their local estimate private and do not share to center or other clients. While Wang & Cheng (2020) raise algorithmic privacy concerns in the scope of federated decision making, our work makes contributions to federated high-dimensional decision making problems.

1.2 Notations and basic problem formulation

Here we give problem formulation and deﬁne the bandit algorithms regret and the Lasso estimator.

Federated high-dimensional online decision making problems. We investigate a federated linear contextual bandits scenario in which M clients are pulling the same set of K arms, represented by [K] := 1, 2, . . . , K. Each client m [M] at each time t [T] pulls an arm k [K] based on historical information. The expected reward of pulling arm k at decision context x is the inner product x, βk between x and the reward parameter βk. Additionally, there is a sparsity parameter s0 [d], deﬁned as the smallest integer such that for all k [K], βk 0 s0. Regret of bandit algorithms π. A bandit algorithm π of client m pulls arm π(m) t at decision step t. The objective is to design bandit algorithms π that minimizes the expected cumulative regret among all clients, deﬁned as: Regret(T) = E h PM m=1 PT t=1 D x(m) t , βk(m), t βπ(m) t

E i , where k(m), t [K] is a context-speciﬁc

optimal arm such that : l = k(m), t , x(m) t , βk(m), t x(m) t , βl .

In this work, we approach the above federated high-dimensional online decision making problems with LASSO regressions.

Deﬁnition 1.1. (Lasso regression estimator) Given a dataset D = {(X, Y )}, where Y is a |D|-dimension response vector and X is a |D| d decision context matrix from the dataset D. The Lasso regression estimator with regularization level λ 0 is deﬁned as

ˆβ (D, λ) arg min β |Y Xβ 2 2/|D| + λ β 1 . (1)

Remark 1.2. (Non-i.i.d. properties of dataset D.) Due to the nature of online decision making, the Lasso estimate is trained on datasets D that cannot satisfy typical distributional (or i.i.d.) assumptions in the literature of federated learning. The key reason is that the decision at certain time steps are depending on previous history, leading to dependency between decision and historical data. Therefore, standard convergence theory is not applicable for analyzing Lasso trained on the dataset D. Fortunately, it is still possible to analyze and design the statistical properties of the trained Lasso estimate if we carefully design the bandit policy to coordinate the exploration across clients and exploitation within clients during whole online decision making process. See Figure 1 and Section 3.1 for bandit sampling design and Section 3.2 for formal convergence results of trained Lasso estimates. Remark 1.3. (One the usefulness of Lasso regression estimate) We use Lasso since our application is in e-commerce and medical setting. In such setting, the context vector is often high-dimensional, but only

Published in Transactions on Machine Learning Research (07/2023)

Figure 1: Centralized teamwork-selﬁsh sampling policy. In the teamwork stage (Blue), all clients pull prescribed arms, no matter the current decision context, to form collaborate exploration for a quick recovery of arm reward parameters. In the selﬁsh stage (Red), all clients pull the arm with highest estimated reward for current decision context to form egocentric exploitation for maximizing their own cumulative reward. Each stage has length Kq, where K is number of arm and q is the teamwork unit length (See Section 3.1 for details.)

few number of variable are decisive. Lasso captures such property by assuming the sparse structure of the regression coeﬃcients. With such sparsity structure of regression coeﬃcient, we leverage the technique in high-dimensional statistic in our theory (Theorem 3.7). The Lasso bandit Bastani & Bayati (2020) paper explains that in medical applications, we might want to identify a small number of biomarkers that are associated with a certain disease. In e-commerce applications, we might want to select a small number of features that are relevant for predicting user behavior. Both of these problems can be formulated as a sparse linear regression problem, where we want to minimize the number of non-zero coeﬃcients in the regression model. Lasso is particularly useful in these cases because it performs both feature selection and regularization simultaneously. By penalizing the L1 norm of the regression coeﬃcients, Lasso encourages sparsity and tends to produce models with a small number of non-zero coeﬃcients. This can lead to better generalization performance and easier interpretability of the resulting model. Therefore, the motivation for using Lasso is to take advantage of its ability to perform feature selection and regularization simultaneously, which is particularly useful for applications where sparsity is assumed.

2 Federated online decision making

This section elaborates a horizontal federated learning (as deﬁned in Yang et al. (2019)) version of bandit problems with high-dimensional decision context and linear arm rewards (as introduced by Bastani & Bayati (2020) and others). The three main challenges arise from federated contextual bandits are (A) Model local dataset heterogeneity. (B) Tighten local data and decision rule privacy security. (C) Coordinate eﬃcient central communication. Section 2.1 established our framework to resolve these challenges.

2.1 Resolution to federated high-dimensional bandit challenges A. Client local bandit models for local dataset heterogeneity. To account for the intrinsic correlation between rewards associated with diﬀerent clients pulling the same arm, we assume the local bandit model of client m that the expected reward of pulling arm k is a linear function of the decision context xt; formally,

r(xt) βk, xt + ϵ(m) t . (2)

The parameter βk Rd is a constant but unknown vector for each arm k [K]. The noise process {(ϵ(m) t , F(m) t )}t [T ] is assumed to be a σ sub Gaussian martingale diﬀerence sequence (that is, E[ϵ(m) t |

F(m) t 1 ] = 0 and E[exp(λϵ(m) t )|F(m) t 1 ] exp(σ2λ2/2) for all real λ). See Section 2.2 for essential statistical regularity assumptions for analyzing federated contextual bandit algorithms. The local dataset heterogeneity is depicted naturally by the local bandit model equation 2, since clients have their own decision context sequence {x(m) t }T t=1, resulting heterogeneous reward distributions.

B. Federation with decision rule hiding to tighten local privacy security. Our proposal to resolve local data privacy dilemma is via Teamwork-Selﬁsh sampling policy (Figure 1). By local data privacy, we

Published in Transactions on Machine Learning Research (07/2023)

meant that one client should not have any chance to infer decision rule of another client based on shared information. To protect local data privacy, we design federation architecture and hide clients decision rules from central server. Our approach, in the spirit of the doubling trick (Besson & Kaufmann, 2018), do not require the agents to share to the server their datasets, no matter the datasetes are collected from Teamwork mode or Selﬁsh mode. The only information to share is the Lasso estimates trained on datasets collected during Teamwork mode at certain pre-speciﬁed decision points. Thus, the Teamwork-Selﬁsh bandit policy resolves the local data privacy security challenge.

C. Horizontal federation to coordinate eﬃcient communication. In horizontal federated-learning system, M clients with same online decision making task collaboratively learn a model with the help of server. We resolve explore-exploit dilemma via client-center communication protocol (Figure 2), established at Section 3.3. In our scheme, each client trains two Lasso estimates, one from Teamwork dataset (for exploration) and the other from Teamwork-Selﬁsh aggregation dataset (for exploitation). Our strategy is to introduce two diﬀerent modes for agent: Teamwork mode and Selﬁsh mode. Clients only upload their Lasso estimates at the end of the Teamwork mode, thus the communication cost is log(T/Kq). Our approach, which in spirit is an applications of horizontal federated learning architecture (Yang et al., 2019), coordinates eﬃcient communication between clients and center to perform near-optimal decision making and achieve logarithmic regret performance. Consequently, the communication protocol (Figure 2) resolves the local data privacy security challenge.

2.2 Statistical Regularity Conditions

In section 2.2, we list essential statistical regularity assumptions towards convergence rate analysis of Lasso regression estimator (Deﬁnition 1.1) and regret analysis of algorithms. We note that these assumptions are standard in the literature of high-dimensional decision making (Bastani & Bayati, 2020).

Assumption 2.1. For all clients m [M], there exists C0 0 such that for arms k1 = k2 in A, the distribution of decision context X satisfy P(| X, β(m) k1 β(m) k2 | (0, κ)) C0κ for given κ > 0.

Assumption 2.1 is known as Margin Condition in the classiﬁcation literature Tsybakov et al. (2004). Such condition ensures only a minor fraction of decision context is sampled near the classiﬁcation boundary {x : x, βk1 βk2 = 0} in which eﬃcacy of both arms are indistinguishable (Rusmevichientong & Tsitsiklis, 2010; Wang & Cheng, 2020).

Assumption 2.2. For all clients m [M], there exists a optimality gap constant h and two mutually exclusive arm subsets Aopt and Asub with [K] = Aopt Asub such that (a) For each arm k in Asub, it holds for every decision context x X that βk, x < maxa A\{k} βa, x h and (b) For each arm k in Aopt, there exists a constant p > 0 of minimal sampling probability such that mink Aopt P(X Uk) Mp where Uk {x X| βk, x > maxa A\{k} βa, x h}.

Assumption 2.2 is known as the Arm Optimality Condition (Bastani & Bayati, 2020; Wang & Cheng, 2020). Such condition separates arms into an optimal subset Aopt and a suboptimal subset Asub, in the sense that (a) all sub-optimal arms are strictly sub-optimal for every decision context and (b) each optimal arm k Aopt is strictly optimal for some decision context ( Uk at Assumption 2.2).

Assumption 2.3. For a client m, there exists a constant φ0 > 0 such that for each optimal arm k Aopt, its population covariance matrix Σk E[XX |X Uk] belongs to the compatibility set with respect to the true parameter βk. That is, Σk C(supp(βk), φ0), where C(I, φ) n A Rp p 0 | v Rp v I 2 1 |I|(v Av)/φ2 if v Ic 1 3 v I 1} .

Assumption 2.3 is referred to as the Compatibility Condition in high-dimensional statistics (Bühlmann & Van De Geer, 2011). It ensures that the Lasso regression estimator trained on samples X Uk converges to the true parameter βk with high probability as the number of samples grows to inﬁnity.

3 Fedego Lasso algorithms

Algorithm 1 presents our solutions, the Fedego Lasso bandit algorithms, to the federated online decision making problems established at Section 2. There are 3 key components in our design of Fedego Lasso algorithms: the teamwork-selﬁsh bandit sampling strategy (Figure 1), the clients-central server communication

Published in Transactions on Machine Learning Research (07/2023)

protocol (Figure 2) and local data privacy preserving. Section 3.1 establishes the teamwork-selﬁsh bandit sampling strategy to implement sparsity-award collaborate exploration. Section 3.2 illustrates how Fedego Lasso protect local clients privacy. Section 3.3 establishes the communication protocol of resulting Lasso estimates between clients and central server towards optimal regret performance of online decision making task.

Algorithm 1 Fedego Lasso: client m

Require: Decision horizon T, number of arms K, optimality gap h(m), Teamwork stage T.

for t = 1 T do

Observe the decision context x(m) t if t T(Teamwork mode) then

π(x(m) t ) = Col Explore(x(m) t , t) (Alg 2) end if if t / T (Selﬁsh mode) then

ˆK = Fed Screen(x(m) t , {ˆβ k,t}K k=1, h(m)) (Alg 3; where ˆβ k,t is the central federated Lasso estimate deﬁned at equation 5)

π(x(m) t ) = Ego Commit( ˆK, {ˆβ ,(m) k,t }K k=1) (Alg 4; where {ˆβ ,(m) k,t is the private egoistic Lasso estimate deﬁned at equation 6)

end if Pull the arm π(x(m) t ), receive reward r(m) t . end for Return: Cumulative reward R(T) = PT t=1 r(m) t .

3.1 Teamwork-Selﬁsh bandit sampling

This section presents the three key elements implementing the teamwork-selﬁsh bandit sampling strategy (Figure 1): sampling mode, resulting datasets and their Lasso estimates.

Sampling modes: Teamwork and Selfish. Every local client agent has two sampling modes: Teamwork mode (Blue block in Figure 1) and Selfish mode (Red block in Figure 1). In the Teamwork mode, all clients runs collaborate exploration, aiming at a quick recover of the arm reward parameter set {βk}K k=1. In Selfish mode, clients run egocentric exploitation individually, aiming at an optimal decision for their own current decision context. The design of alternating sampling guarantees the convergence of Lasso while optimizing algorithm performance.

Datasets: teamwork set T and selﬁsh set S. Every local client agent maintains two datasets: teamwork dataset T and selﬁsh dataset S. In general, datasets T (m) and S(m) collect samples from Teamwork and Selﬁsh mode respective for the client m. In particular, notations T (m) [t],k and S(m) [t],k denote the teamwork and selﬁsh dataset respectively from pulling arm k during decision period [t] = {1, 2, , t}. Such maintenance is to separate the data source. Technically, the teamwork set s decision is public due to agreement of all clients. The selﬁsh set s decision is private and only accessible by the client itself. Theoretically, the teamwork set s samples are independently distributed since in which the decisions are independent of the previous history, while in selﬁsh set the samples are dependent since the decisions are dependent on the history.

Estimates: teamwork Lasso ˆβ and private Lasso ˆβ Every local client agent maintains two Lasso estimates: client teamwork Lasso and local private Lasso. In principle, the client teamwork Lasso ˆβ = ˆβ(T ) is from running Lasso regression (Deﬁnition 1.1) on the Teamwork dataset; in contrast, local private Lasso ˆβ = ˆβ(T S) is trained on the aggregation dataset of Teamwork and Selﬁsh set. Such maintenance is to to separate federation and decision making. In Fedego Lasso algorithm, all local clients upload their client teamwork Lasso to central server at the end of each Teamwork mode to facilitate federation. At each decision step in Selﬁsh mode, all clients commit ﬁnal decisions based on local private Lasso estimate.

Published in Transactions on Machine Learning Research (07/2023)

3.2 Decision rules in Teamwork and Selﬁsh mode This section presents the three key subroutines implemented in the Fedego Lasso bandit algorithms (Algorithm 1): collaborate exploration (Algorithm 2), federated screening (Algorithm 3) and egocentric commitment (Algorithm 4).

A. Col Explore: Collaborate exploration

Here we introduce 3 key quantities in the Collaborate exploration stage in Algorithm 1. The phase length q determines the block size of each stage in Figure 1, reﬂecting the communication frequency. The Teamwork stage T is the teamwork stage (blue block) in Figure 1, in which the decision in this stage only depends on the timestamp but not depends on previous response history. Last, the sample collected at the teamwork stage is used to learn the Lasso regression coeﬃcient for each client and to be shared to the central server.

Algorithm 2 Col Explore(x, t)

Input: decision context x, decision step t Output: π(x) (t mod K)

A.(1)Phase length q. A key technical challenge is to determine the phase length q to ensure Lasso estimator convergence is fast enough to informal optimal decision making.

q qteamwork max{ 20

Mp , 4C2(φ1)2

Mp , 512x2 max log d C1(φ)M 2p2 h2 }(4 + log M) (3)

A.(2) Planning of Teamwork stage T. Each block in Figure 1 contains Kq decision steps. The planning of teamwork stage is T = log2(T/Kq) n=1 Tn, where the nth Teamwork stage is

Tn = [(2n 1)Kq + 1 : 2n Kq].

A.(3) Client teamwork Lasso. The client teamwork Lasso for client m at time step t is deﬁned by running Lasso regression (Deﬁnition 1.1) in the teamwork dataset T (m) [t],k ; formally,

ˆβ ,(m) k,t ˆβ(T (m) [t],k , λ1). (4)

Lemma 3.1 justiﬁes the convergence of client teamwork Lasso equation 4.

Lemma 3.1. (Deviation inequality of client teamwork Lasso) For all arms k [K], the client teamwork Lasso estimate equation 4 satisﬁes

P ˆβ ,(m) k,t βk 1 > h/4xmax 5/Mt4

if λ(m) 1 = φ2 0Mp h/64s0xmax and q qteamwork.

Proof. See Section A. Lemma 3.1 generalizes the forced sampled Lasso in Bastani & Bayati (2020) into horizontal federated learning scheme. The key diﬀerence is to redeﬁne the length of exploration qteamwork to ensure faster concentration. Finding qteamwork is non-trivial due to the centralized exploration scheme. Our found solution is to inﬂate the length of exploration by a log M factor, that is, qteamwork = O(q0 log M), where q0 is the exploration length in Bastani & Bayati (2020). Remark 3.2. (On the concern of the situation that a proportion of clients explore the same arms) In current design of Collaborate exploration scheme, a subset of customers may keep exploring the same arms. Such possibility raise concerns on fairness of the decision making procedure. In particular, it is challenging to ensure fairness in the allocation of experimental resources. These customers may miss the opportunity to try their best arm, bearing the experimental costs for joint research. To address this issue, a signiﬁcant challenge we anticipate is to "deﬁne a regret formulation that suﬃciently reﬂects decision fairness." Are there minority groups among the clients? How can we ensure that the results of joint decision-making beneﬁt everyone instead of some clients bearing the costs while others enjoy the outcomes? These are non-trivial questions and challenges we expect to encounter.

Published in Transactions on Machine Learning Research (07/2023)

B. Fed Screen: Federated screening.

At each time step in Selﬁsh mode, the client screen available arms with central federated Lasso. In this stage, the goal of Algorithm 3 is to screen out sub-optimal arms for a given decision context. The screening is based on the central federated Lasso regression estimate equation 5, whose statistical accuracy is supported by Lemma 3.4. After the screening, we are sure that the remaining candidate arms are located in the Aopt in Assumption 2.2 and contribute at most h in the formal regret.

Algorithm 3 Fed Screen(x, {ˆβk}K k=1, h)

Input: decision context x, federated estimates {ˆβk}K k=1 (central federated Lasso estimate equation 5), optimality gap h Output: the candidate set ˆK(x) {k A : x, ˆβk maxl A x, ˆβl h/2}

Central federated Lasso. After receiving all clients teamwork Lasso estimates, the central server performs federation by computing the central federated Lasso. The central federated Lasso for arm k at decision step t is deﬁned as the average of all client Teamwork Lasso estimates; formally

m=1 ˆβ ,(m) k,t . (5)

Remark 3.3. Note that the central federated Lasso is the average of teamwork Lasso estimates, which are trained on i.i.d. samples, and therefore has convergence guarantees. Such analytical advantage is due to our policy design that all agents explore the eﬃcacy of pre-speciﬁed arms during Teamwork mode (Blue block in Figure 1).

Lemma 3.4. For all arms k [K], if t (Kq)2, the central federated Lasso estimate equation 5 satisﬁes

P ˆβ k,t βk 1 > h(m)/4xmax 5/t4.

Proof. See Section B. The key is due to our tighter control of client teamwork Lasso (Lemma 3.1). Such control ensure the deviation probability of equation 5 is independent of client number M. Remark 3.5. One can directly apply force estimator deviation inequality in Bastani & Bayati (2020), but the resulting deviation probability will depends on client number M due to union bound. Our technical contribution is to remove such dependency on M by inﬂating the exploration length by log M factor.

C. Ego Commit: Egocentric commitment

Given the resulted set of optimal arm candidates ˆK(x(m) t ) for decision context x(m) t from the federated screening (Algorithm 3), the client agent m commits to the arm with the highest estimated expected reward estimated by the private egocentric Lasso equation 6.

Algorithm 4 Ego Commit( ˆK(x), {ˆβk}K k=1)

Input: candidate set ˆK(x), decision context x, private estimates {ˆβk}K k=1 (private egoistic Lasso estimate equation 6) Output: π(x) arg maxk ˆ K(x) x, ˆβk

Private egoistic Lasso. The private egoistic Lasso for client m at time step t is deﬁned by running Lasso regression (Deﬁnition 1.1) in the teamwork-selﬁsh aggregate dataset T (m) [t],k S(m) [t],k; formally,

ˆβ ,(m) k,t ˆβ T (m) [t],k S[t],k, λ(m) 2,t . (6)

Lemma 3.6 justiﬁes the convergence of private egoistic Lasso equation 6.

Published in Transactions on Machine Learning Research (07/2023)

Client 1 Client m

Client Teamwork Lasso

Client Teamwork Lasso

Client 1 Client m

Central Federated

Central Federated

Federation Broadcast

β ,(m) k,t }K

Figure 2: Federated client-center communication. At the end of each Teamwork stage (Blue block in Figure 1), all clients uploads their current teamwork Lasso estimate (deﬁned at equation 4) to the central server. Then, the central server perform federation to compute the federated Lasso estimates (deﬁned at equation 5). Last, the central server broadcast the federated Lasso estimates to all clients for their upcoming federated screening procedure in the Selﬁsh stage (Red block in Figure 1) .

Lemma 3.6. For all optimal arms k Aopt, set λ(m) 2,t = φ2 0 32s0

M log d Mt C1(φ0)p t and t C5. Then, the private egocentric Lasso estimate equation 6 satisﬁes

P( ˆβ ,(m) k,t βk 1 > 16 p

(log d Mt)/(p3 C1 (φ0) Mt))

2((Mt) 1 + exp( p2 C2 2M 2t/32)).

Proof. See Section C. The key is to reﬁne the estimation error of Lasso estimator.

3.3 Clients-Central Server Communication

Figure 2 presents the three key events in the clients-central server communication protocol: upload, federation and broadcast. In the following we discuss the contribution of such horizontal federated learning architecture to bandit learning and local data privacy security.

Upload: Clients upload local teamwork Lasso to server. At the end of each round of the Teamwork mode, all clients upload their current client Teamwork Lasso estimates equation 4 to the central server to facilitate the federation. The local teamwork Lasso have diﬀerence convergence level due to local data heterogeneity. The collaborate exploration between clients in Teamwork mode ensuring decisions of a client will not compromise to the other client. Such advantage is due to design of Teamwork-Selﬁsh sampling strategy, all clients do experiment for collaborate exploration during Teamwork mode.

Federation: Server compute central federated Lasso. The central sever do averaging to mitigate statistical error of received teamwork Lasso estimates. The outcome estimate is called central federated Lasso (deﬁned at equation 5). The federation event helps security by ensuring internal potential privacy leakage from the central server since clients do not upload any dataset but only their local teamwork Lasso estimates.

Broadcast: Server broadcast federated Lasso to clients. After federation, the central server broadcasts to clients the central federated Lasso equation 5, helping client to do valid screening procedure to exclude sub-optimal arms and ﬁnd out the candidates of optimal arms. Such screening procedure is termed federated screening and elaborated at Algorithm 3. The broadcast event helps security by the fact that one client cannot infer other clients decisions based on the federated Lasso, since the ﬁnal decision is based on private egocentric Lasso (Eq. 6) as in Algorithm 4. Such a feature helps us avoid internal potential privacy leakage from one client to the other.

Published in Transactions on Machine Learning Research (07/2023)

(a) Synthetic Data

(b) Pharm GKB

(c) Medical MNIST

Figure 3: Comparison of the performances of Lasso bandit in Bastani & Bayati (2020), Teamwork Lasso bandit, and Fedego Lasso bandit algorithms. Fig (a). Average cumulative regrets per client on synthetic data. Fig (b). Error rates on the Pharm GKB dataset. Fig(c). Average cumulative regrets per client on the Medical MNIST dataset.

(a) Synthetic Data

(b) Pharm GKB

(c) Medical MNIST

Figure 4: The performances when the number of clients increase. Fig (a). Average cumulative regrets per client on synthetic data. Fig (b). Error rates on the Pharm GKB dataset. Fig(c). Average cumulative regrets per client on the Medical MNIST dataset.

3.4 Communication cost and Regret guarantee of Fedego Lasso

The following theorem bounds the regret of Fedego Lasso bandit algorithm (Algorithm 1).

Theorem 3.7. The communication cost of Fedego Lasso is O(log2(T/Kq)). The regret of Fedego Lasso bandit algorithms π over decision horizon [T] satisﬁes

Regretπ(T) = O(log M[log T + log d]2)

Proof. See Section D for details. The result supports that Fedego Lasso is the ﬁrst federated bandit algorithms (compared to others in Table 1) that secures sub-linear regret in high-dimensional scenario. Additionally, if we transfer the problem into the setting of Kim & Paik (2019); Hao et al. (2020); Li et al. (2022) as mentioned in Section 1.1, both regret bounds in their works and our work are logarithmic in the dimension d. However, their regret bounds are either of order O(

T) or O(T 2/3), while our regret bound is logarithmic in T and the dimension d. The advantage comes from not only our unique algorithm design, but also the diﬀerence in problem setting and the assumptions in Section 2.2

4 Empirical results

While the theoretical regret analysis (Theorem 3.7) provides worst-case guarantees for Fedego Lasso, we now examine their performance on a variety of tasks empirically. Examination is conducted using both synthetic and real-world data.

A. Experiment setup. We compare Fedego Lasso bandit with the vanilla Lasso bandit algorithm in (Bastani & Bayati, 2020) and the Teamwork Lasso bandit algorithm Wang & Cheng (2020) on the three

Published in Transactions on Machine Learning Research (07/2023)

datasets. We run each algorithm for 10 independent trials on each dataset and plot the average results with two standard deviations. Additional experimental details are provided in Appendix F. (a).Synthetic Data. The sparse parameters {β(m) k }K k=1 for each client is generated from randomly sampled support, and then the nonzero parameter values are generated from the uniform distribution on [0,1]. The decision context are drawn randomly from a multivariate normal distribution. (b).Real data: Pharm GKB. The ﬁrst real-world task is personalized dosage searching. The Pharmacogenomics Knowledge Base (Pharm GKB) dataset was used by Bastani & Bayati (2020) to support the superiority of Lasso bandit over the other bandit algorithms. We inherit their settings and investigate the error rates of the two algorithms when giving warfarin dosages based on patient-level decision context such as demographics, diagnosis, and medications. (c).Real data: Medical MNIST. The second real-world task is to collaborate classiﬁcation of Medical MNIST images. To extract the useful features vectors from the medical images, we ﬁrst train a fully-connected neural network on the dataset till it reaches good training and testing accuracies. At each round of the bandit problem, an image is sampled from the dataset and fed into the neural network. The covariates for the bandit algorithms are the output of an intermediate layer of the neural network. We deﬁne the instantaneous regret by whether the image is correctly classiﬁed.

B. Beneﬁts of federation architecture. Figure 3 supports the beneﬁts of federation architecture enjoyed by the proposed Fedego Lasso over the other baselines. In synthetic data scenario, Fedego Lasso enjoys substantial regret reduction in Figure 3.(a). In the task of personalized dosage searching, Fedego Lasso enjoys faster error rate convergence in Figure 3.(b). In the task of medical image labeling, Fedego Lasso again enjoys substantial regret reduction in Figure 3.(c). These empirical evidences supports the beneﬁts of the proposed federation architecture.

C. Beneﬁts of large number of clients. Figure 4 supports the beneﬁt of having large number of clients in federated bandit learning. In the synthetic data and the task of medical image labeling, Fedego Lasso enjoys further regret reduction by recruiting more clients in bandit learning, as supported by Figure 4.(a) and (c). In the task of personalized dosage searching, Fedego Lasso with more clients enjoy faster error rate convergence, as in Figure 4.(b).

5 Conclusion

We build a novel architecture on federated linear contextual bandits model with high-dimensional decision context. Such architecture delivers a uniﬁed federated bandit framework that resolves local dataset heterogeneity, local data and decision rule privacy security and explore-exploit tradeoﬀsimultaneously. The associated algorithm Fedego Lasso utilizes the sparse structure of local bandit models to recover the global parameters and inform eﬃcient federated exploration with eﬀective egocentric exploitation. Theoretical analysis supports that Fedego Lasso secures sub-linear regret in the order of O(log M(log T)2) with a communication cost in the order of O(log2(T/Kq)).

Yasin Abbasi-Yadkori, David Pal, and Csaba Szepesvari. Online-to-conﬁdence-set conversions and application to sparse stochastic bandits. In Artiﬁcial Intelligence and Statistics, pp. 1 9, 2012.

Alekh Agarwal, John Langford, and Chen-Yu Wei. Federated residual learning. ar Xiv preprint ar Xiv:2003.12880, 2020.

Shipra Agrawal and Navin Goyal. Analysis of thompson sampling for the multi-armed bandit problem. In Conference on learning theory, pp. 39 1. JMLR Workshop and Conference Proceedings, 2012.

Shipra Agrawal and Navin Goyal. Further optimal regret bounds for thompson sampling. In Artiﬁcial intelligence and statistics, pp. 99 107. PMLR, 2013a.

Shipra Agrawal and Navin Goyal. Thompson sampling for contextual bandits with linear payoﬀs. In International Conference on Machine Learning, pp. 127 135. PMLR, 2013b.

Noga Alon and Joel H Spencer. The probabilistic method. John Wiley & Sons, 2004.

Published in Transactions on Machine Learning Research (07/2023)

Sanae Amani and Christos Thrampoulidis. Decentralized multi-agent linear bandits with safety constraints. ar Xiv preprint ar Xiv:2012.00314, 2020.

Yoshinori Aono, Takuya Hayashi, Lihua Wang, Shiho Moriai, et al. Privacy-preserving deep learning via additively homomorphic encryption. IEEE Transactions on Information Forensics and Security, 13(5): 1333 1345, 2017.

Peter Auer. Using conﬁdence bounds for exploitation-exploration trade-oﬀs. Journal of Machine Learning Research, 3(Nov):397 422, 2002.

Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2):235 256, 2002.

Hamsa Bastani and Mohsen Bayati. Online decision making with high-dimensional covariates. Operations Research, 68(1):276 294, 2020.

Lilian Besson and Emilie Kaufmann. What doubling tricks can and can t do for multi-armed bandits. ar Xiv preprint ar Xiv:1803.06971, 2018.

Sébastien Bubeck and Nicolo Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Machine Learning, 5(1):1 122, 2012.

Peter Bühlmann and Sara Van De Geer. Statistics for high-dimensional data: methods, theory and applications. Springer Science & Business Media, 2011.

Liam Collins, Hamed Hassani, Aryan Mokhtari, and Sanjay Shakkottai. Exploiting shared representations for personalized federated learning. ar Xiv preprint ar Xiv:2102.07078, 2021.

Maria Dimakopoulou, Susan Athey, and Guido Imbens. Estimation considerations in contextual bandits. ar Xiv preprint ar Xiv:1711.07077, 2017.

Abhimanyu Dubey and Alex Pentland. Diﬀerentially-private federated linear bandits. ar Xiv preprint ar Xiv:2010.11425, 2020.

Alexander Goldenshluger and Assaf Zeevi. A linear response bandit problem. Stochastic Systems, 3(1): 230 261, 2013.

Botao Hao, Tor Lattimore, and Mengdi Wang. High-dimensional sparse linear bandits. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 10753 10763. Curran Associates, Inc., 2020. URL https://proceedings.neurips. cc/paper/2020/file/7a006957be65e608e863301eb98e1808-Paper.pdf.

Ruiquan Huang, Weiqiang Wu, Jing Yang, and Cong Shen. Federated linear contextual bandits. In Thirty Fifth Conference on Neural Information Processing Systems, 2021.

Ali Jadbabaie, Haochuan Li, Jian Qian, and Yi Tian. Byzantine-robust federated linear bandits. ar Xiv preprint ar Xiv:2204.01155, 2022.

Peter Kairouz, H Brendan Mc Mahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, et al. Advances and open problems in federated learning. ar Xiv preprint ar Xiv:1912.04977, 2019.

Gi-Soo Kim and Myunghee Cho Paik. Doubly-robust lasso bandit. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.

Jakub Konečn y, H Brendan Mc Mahan, Felix X Yu, Peter Richtárik, Ananda Theertha Suresh, and Dave Bacon. Federated learning: Strategies for improving communication eﬃciency. ar Xiv preprint ar Xiv:1610.05492, 2016.

Published in Transactions on Machine Learning Research (07/2023)

Tze Leung Lai and Herbert Robbins. Asymptotically eﬃcient adaptive allocation rules. Advances in applied mathematics, 6(1):4 22, 1985.

John Langford and Tong Zhang. Epoch-greedy algorithm for multi-armed bandits with side information. Advances in Neural Information Processing Systems (NIPS 2007), 20:1, 2007.

Chuanhao Li and Hongning Wang. Asynchronous upper conﬁdence bound algorithms for federated linear bandits. ar Xiv preprint ar Xiv:2110.01463, 2021.

Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pp. 661 670, 2010.

Tan Li, Linqi Song, and Christina Fragouli. Federated recommendation system via diﬀerential privacy. In 2020 IEEE International Symposium on Information Theory (ISIT), pp. 2592 2597. IEEE, 2020.

Wenjie Li, Adarsh Barik, and Jean Honorio. A simple uniﬁed framework for high dimensional bandit problems. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp. 12619 12655. PMLR, 17 23 Jul 2022.

Brendan Mc Mahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-eﬃcient learning of deep networks from decentralized data. In Artiﬁcial intelligence and statistics, pp. 1273 1282. PMLR, 2017.

Min-Hwan Oh, Garud Iyengar, and Assaf Zeevi. Sparsity-agnostic lasso bandit. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 8271 8280. PMLR, 18 24 Jul 2021.

Vianney Perchet, Philippe Rigollet, et al. The multi-armed bandit problem with covariates. The Annals of Statistics, 41(2):693 721, 2013.

Paat Rusmevichientong and John N Tsitsiklis. Linearly parameterized bandits. Mathematics of Operations Research, 35(2):395 411, 2010.

Chengshuai Shi, Cong Shen, and Jing Yang. Federated multi-armed bandits with personalization. In International Conference on Artiﬁcial Intelligence and Statistics, pp. 2917 2925. PMLR, 2021.

Alexander B Tsybakov et al. Optimal aggregation of classiﬁers in statistical learning. The Annals of Statistics, 32(1):135 166, 2004.

Chi-Hua Wang and Guang Cheng. Online batch decision-making with high-dimensional covariates. In International Conference on Artiﬁcial Intelligence and Statistics, pp. 3848 3857. PMLR, 2020.

Xue Wang, Mingcheng Wei, and Tao Yao. Minimax concave penalized multi-armed bandit model with high-dimensional covariates. In International Conference on Machine Learning, pp. 5200 5208. PMLR, 2018.

Qiang Yang, Yang Liu, Tianjian Chen, and Yongxin Tong. Federated machine learning: Concept and applications. ACM Transactions on Intelligent Systems and Technology (TIST), 10(2):1 19, 2019.

Zhaowei Zhu, Jingxuan Zhu, Ji Liu, and Yang Liu. Federated bandit: A gossiping approach. In Abstract Proceedings of the 2021 ACM SIGMETRICS/International Conference on Measurement and Modeling of Computer Systems, pp. 3 4, 2021.

Published in Transactions on Machine Learning Research (07/2023)

A Proof of Lemma 3.1

The proof strategy is to use Lemma E.1 on the dataset (T (m) [t],k ). We ﬁrst prove three lemmas for bounding probability of bad events. Then we combines these results to gain deviation inequality of Teamwork Lasso under Fed Ego bandit sampling scheme.

The following lemma gives a high-probability lower bound of the optimal allocation rate in the dataset (T (m) [t],k ).

Lemma A.1. If t (Kq)2, the following holds

P |(T (m) [t],k ) |

|(T (m) [t],k )| > Mp

1 2 Mt4 . (7)

Proof. Recall a version of the Chernoﬀinequality: given y a sum of mutually independent indicator random variables with expectation µ = E[y], one has that P[|y µ| > µ/2] < 2 exp( µ/10).

Probability upper bound on the expected optimal allocation number. Consider the indicator of optimal allocation variable I(a (T (m) [t],k ) ) for all a (T (m) [t],k ). Assumption 2.2 implies that the expected opti-

mal allocation number µ = E[P

a (T (m) [t],k)]I(a (T (m) [t],k ) ) is at least Mp |(T (m) [t],k )|; that is, µ Mp |(T (m) [t],k )|. In

this case, the Chernoﬀinequality indicate a high-probability upper bound on the optimal allocation number |(T (m) [t],k ) | as

P[|(T (m) [t],k ) | < Mp

2 |(T (m) [t],k )|] < 2 exp( Mp

10 |(T (m) [t],k )|). (8a)

Recall that |(T (m) [t],k )| (1/2)q log t. The equation equation 8a becomes

P[|(T (m) [t],k ) | < Mp

2 |(T (m) [t],k )|] < 2 exp( Mp

20 q log t). (8b)

Require q 20 Mp (4 + log M). The equation equation 8b becomes

P[|(T (m) [t],k ) | < Mp

2 |(T (m) [t],k )|] < 2 exp( (4 log t + log M)) = 2 Mt4 . (8c)

The following lemma gives the probability upper bound of the failure of compatability condition of sample covariance matrix on dataset (T (m) [t],k ) .

P ˆΣ((T (m) [t],k ) ) C supp(β), φ1

1 1 Mt4 (9)

Proof. By an application of Lemma EC.6 in Bastani & Bayati (2020), we ﬁrst have

P ˆΣ((T (m) [t],k ) ) C supp(β), φ1

1 exp( C2(φ1)2|(T (m) [t],k ) |) (10a)

Recall that |(T (m) [t],k )| (1/2)q log t. Lemma A.1 indicates that with probability at least 1 = 2/Mt4,

|(T (m) [t],k ) | Mp

2 |(T (m) [t],k )| Mp

4 q log t (10b)

Published in Transactions on Machine Learning Research (07/2023)

Require q 4C2(φ1)2

Mp (4 + log M). The equation equation 10a becomes

P ˆΣ((T (m) [t],k ) ) / C supp(β), φ1

exp( (4 log t + log M)) = 1 Mt4 . (10c)

Lemma A.3. Set λ(m) 1 φ2 0Mp h/64s0xmax. Then we have,

P max r [d]

2 ε X(r) /|(T (m) [t],k )| λ(m) 1 /2 1 2 Mt4 (11)

Proof. Recall that |(T (m) [t],k )| (1/2)q log t. Require q 512x2 max log d C1(φ)M 2p2 h2 (4 + log M). we have

exp C1(φ Mp

2 )|(T (m) [t],k )|( h 4xmax )2 + log d

exp C1(φ)M 2p2 16 1

2q log t h2

16x2max + log d

= exp C1(φ)M 2p2 h2

512x2max q log t + log d

exp (4 log t + log M) = 1 Mt4

Proposition A.4. Given

Mp , 4C2(φ1)2

Mp , 512x2 max log d C1(φ)M 2p2 h2 }(4 + log M).

P ˆβ((T (m) [t],k ), λ1) βk 1 > h 4xmax

Proof. Combine the above three lemmas.

P ˆβ((T (m) [t],k ), λ1) βk 1 > h 4xmax

P max r [d] 1

|(T (m) [t],k )| |ϵ X(r)| λ0(χ, φ p

+P ˆΣ(((T (m) [t],k )) ) / C(supp(β), φ

+P |((T (m) [t],k )) |

|(T (m) [t],k )| p

Mt + 1 Mt + 2 Mt = 5 Mt

Published in Transactions on Machine Learning Research (07/2023)

B Proof of Lemma 3.4

Recall the deﬁnition of central federated Lasso equation 5 ˆβ k,t 1 M PM m=1 ˆβ ,(m) k,t .

Step 1 Set the size of error χ > 0.

P ˆβ# k,t βk 1 > χ (15)

m=1 ˆβ T (m) [t],k , λ(m) 1 1

m=1 [ˆβ T (m) [t],k , λ(m) 1 βk]

ˆβ T (m) [t],k , λ(m) 1 βk 1 > χ

ˆβ T (m) [t],k , λ(m) 1 βk 1 > Mχ

P m [M] : ˆβ T (m) [t],k , λ(m) 1 βk 1 > χ (20)

m=1 P ˆβ T (m) [t],k , λ(m) 1 βk 1 > χ (21)

The ﬁrst inequality is due to the fact that the ℓ1-norm is convex. An application of Jensen s inequality shows that 1

M PM m=1(ˆβ T (m) [t],k , λ(m) 1 β(m) 0 ) 1 = 1 M PM m=1(ˆβ T (m) [t],k , λ(m) 1 β(m) 0 1

1 M PM m=1 ˆβ T (m) [t],k , λ(m) 1 β(m) 0 1. The second inequality is due to the fact that the ℓ1-norm is non-negative and an application of Pigeonhole principle. The third inequality is an application of Boole s inequality, also known as the union bound.

Step 2 The client m sets the size of error χ = h(m) 4xmax ,

P ˆβ# k,t βk 1 > h(m)

ˆβ# k,t βk 1 > minm [M] h(m)

ˆβ T (m) [t],k , λ(m) 1 βk 1 > minm [M] h(m)

M 5 Mt4 = 5

The ﬁrst inequality is due to the fact that tail probability is a decreasing function. The second inequality is due to Step 1 above. The third inequality is due to is by the deviation inequality of client Teamwork Lasso estimate (lemma 3.1) by setting the regularization level to be λ(m) 1 = (φ0)2p minm [M] h(m)/ (64s0xmax).

Step 3. Deﬁne a "good event"

ˆβ k,t βk 1 h(m)

The event marks that each central federated Lasso is suﬃciently accurate to exclude out sub-optimal arms.

Published in Transactions on Machine Learning Research (07/2023)

P (E(m) t )c =P k A

ˆβ# k,t βk 1 h(m)

k=1 P ˆβ# k,t βk 1 h(m)

The ﬁrst inequality is an application of Boole s inequality, also known as the union bound. The second inequality is due to Step 2 above.

C Proof of Lemma 3.6

The proof strategy is to use Lemma E.1 on the dataset S(m) [t],k.

P |(S(m) [t],k) | Mp

4 t 1 exp( M 2p2 36 t). (30)

Proof. We have E[Mk,t] P[Ant Ant+1] P(Xt Uk) |Vt|

Thus, Hoeﬀding inequality indicates that

P(Mk,t < E[Mk,t] η) exp( 2η2

|Vt|) exp( 4η2

In particular, for η = Mtp /12, we have

P(Mk,t < Mp

4 t) exp( M 2p2 36 t)

Thus, the result follows from the fact that Mk,t |(S(m) [t],k) |.

Lemma C.2. For t > C5 mint{log Mt < t M2p2 C2 2(φ) 16 }

P ˆΣ((S(m) [t],k) ) C supp(β), Mp

1 1 Mt (32)

Proof. In this case, we have sample size at most t and minimal sampling rate Mp /2.The ﬁrst is due to |S(m) [t],k| t and the second is due to |(S(m) [t],k) | Mp t/4. Thus, we have |(S(m) [t],k) |/|S(m) [t],k| Mp /4. By

deﬁnition of C5 mint{log Mt < t M 2p2 C2 2(φ) 16 }, we have the probability bound

P ˆΣ((S(m) [t],k) ) / C supp(β), Mp

exp( t M 2p2 C2 2 16 )

exp( log(Mt)) = 1 Mt

Published in Transactions on Machine Learning Research (07/2023)

Lemma C.3. Set λ(m) 2,t = φ2 0 32s0

M log d Mt C1(φ0)p t , then we have

P max r [d]

2 ε X(r) /|S(m) [t],k| λ(m) 2,t /2 1 2 Mt (34)

Proof. In this case, we have minimal sampling rate Mp /2 since |S(m) [t],k| t and |(S(m) [t],k) | Mp t/4. Let

log d Mt p3 C1(φ0)Mt.

exp C1(φ Mp

2 )|S(m) [t],k|χ2 + log d

exp C1(φ)M 2p2 16 Mp

4 t 256 log(d Mt) p3 C1 (φ0) Mt + log d

= exp 4M 2 log(d Mt) + log d

exp (log t + log M) = 1 Mt

Lemma C.4. Given t > C5. We have

P ˆβ(S(m) [t],k, λ2,t) βk 1 > 16

log t + log M + log d

p3 C1 (φ0) Mt

2 Mt + 2 exp

p2 C2 (φ0)2

Proof. Apply the above three lemmas

P ˆβ(S(m) [t],k, λ1) βk 1 > h 4xmax

P max r [d] 1

|S(m) [t],k| |ϵ X(r)| λ0(χ, φ p

+P ˆΣ((S(m) [t],k) ) / C(supp(β), φ

+P |(S(m) [t],k) |

|S(m) [t],k| p

Mt + exp M 2p2 C2(φ)2

16 t + exp M 2p2 36 t

p2 C2 (φ0)2

D Proof of Theorem 3.7

We break the proof into 3 steps.

Published in Transactions on Machine Learning Research (07/2023)

Step 1. Instantaneous regret. Given Lemma 3.6, if t > C5, we have

i ˆK I h X t+1 ˆβi X t+1 ˆβ1 Bi i

i ˆK I (Bc i )

2bxmax Pr h β1 ˆβ1 1 > δ i + Pr h ˆβi βi 1 > δ i + 2δxmax2C0δxmax

p2 C2 (φ0)2

) + 4C0x2 maxδ2

p2 C2 (φ0)2

) + 1024C0x2 max p3 C1

log t + log M + log d

8bxmax(1 32

p2 C2 (φ0)2 ) 1 M 2t + 1024C0x2 max p3 C1

log t + log M + log d

=L1 1 M 2t + L2 log t + log M + log d

The last inequality is due to exp( x) < 1/x for all x > 0. The constants are L1 = 8bxmax(1 32 p2 C2(φ0)2 )

and L2 = 1024 C0x2 max p3 C1 .

Step 2. Cumulative Regret of one client.

t=C5 rt+1 (38)

log t + log M + log d

M log T + L2

(log T)2 + log(Md) log T (40)

M log T + L2 log(Md) log T + L2(log T)2 (41)

Step 3. Cumulative regret of all client

M 2bxmax + 1

M log T + L2 log(Md) log T + L2(log T)2

=C52bxmax + L1

M + L2 log(Md) log T + L2(log T)2

=O(log M(log T)2)

Published in Transactions on Machine Learning Research (07/2023)

E Supporting Lemmas

Here we state a general version of Lemma 1 in Bastani & Bayati (2020) to facilitate our analysis.

Lemma E.1. (Lasso deviation inequality for non i.i.d. dataset) Given a dataset D with its i.i.d. subdataset D . Let ˆβ(D, λ) be the Lasso regression estimator (Deﬁnition 1.1) trained on D with regularization level λ. Suppose the population covariance matrix Σ satisfy the compatability condition C(supp(β), φ). Let χ > 0 be an error upper bound speciﬁed by user. Set λ χ, φ p/2 = χφ2p/ (16s0). Then, the l1 estimation error of ˆβ(D, λ) satisﬁes the following deviation inequality:

P ˆβ(D, λ) β 1 χ α(χ), (42)

α(χ) = P max r [d] 1 |D||ϵ X(r)| λ0(χ, φ p

2 ) + P ˆΣ(D ) / C(supp(β), φ

2) + P |D |

F Experiment Details and Additional Experiments

F.1 Synthetic Data

For the baseline case, we set d = 100, K = 5, M = 5 and s = 5, where s is the sparsity level of the parameters that should satisfy s d. The β of each arm for each client is generated by ﬁrst choosing the possibly nonzero positions for all clients. In our case we randomly choose 10 positions for each arm and the 10 positions are shared across the diﬀerent clients to represent that the bandit environment faced by diﬀerent clients are similar (but not exactly the same). We then choose the s non-zero elements randomly within the 10 positions of each arm for each client, and then generate the values from a uniform distribution on [0, 1]. Note that the nonzero values of each arm across diﬀerent clients are diﬀerent, and the positions are diﬀerent too. The covariates of each client are generated from a standard normal distributions independently.

We provide some additional experimental results for diﬀerent choices of d, K, M in Figure 5 for readers interested in the eﬀect of changing these hyper-parameters in our synthetic data.

F.2 Pharm GKB

We utilize the publicly available code for Pharm GKB at https://github.com/chuchro3/Warfarin for the processing of the dataset. Speciﬁcally, the covariates are generated from either the value in the dataset if it is numerical or from a bag-of-words of all the categories if categorical. Then the covariates are transformed into one-hot encodings and thus the vector is very high-dimensional (d = 5528). We follow Bastani & Bayati (2020) and classify the diﬀerent dosages of Warfarin into three classes. [0, 3] is the low-dosage class, [3, 7] is the moderate dosage class, and [7, ] is the high-dosage class. The reward is deﬁned to be 1 if the correct dosage class is chosen and 0 if not, and we add standard Gaussian noise to the reward. The error rate (in Figure 3) is deﬁned by the number of wrong classiﬁcations divided by the total number of classiﬁcations.

F.3 Medical MNIST

Neural Network Setting. The fully-connected neural network used to train on the Medical MNIST dataset has the architecture shown in Table 2, where input size is the size of the vectorized image, which is 10800 in our case and output size is the number of classes, which is 6. The dataset was randomly split into a training set and a testing set in the ratio of 9:1. The images are pre-processed with random rotation, random horizontal ﬂipping, resizing, center cropping and normalization. The training batch size was 32. The optimizer was SGD with learning rate 0.01 and no momentum or weight decay. The neural network was trained on the training set for 10 epochs and evaluated on the testing set. The ﬁnal testing classiﬁcation accuracy was around 99%.

Bandit Setting. The number of arms for each agent is equal to the number of classes (6) in this problem. We use the output of the last Re LU layer (i.e., before the last layer) as the feature vector for bandit problems,

Published in Transactions on Machine Learning Research (07/2023)

which means that the feature dimension is 500. At each round, The regret is deﬁned to be whether the feature vector is correctly classiﬁed, i.e., the reward is 0 if the the chosen arm is diﬀerent from the class of the image and 1 if they are the same. Standard Gaussian noise is added to the reward received by the bandit algorithms.

Table 2: Architecture of the neural network

in Layer: Linear Layer(input size, 4000)

Re LU() hidden1: Linear Layer(4000, 2000)

Re LU() hidden2: Linear Layer(2000, 1000)

Re LU() hidden3: Linear Layer(1000, 500)

Re LU() out Layer: Linear Layer(500, output size)

Published in Transactions on Machine Learning Research (07/2023)

(a) M = 5, d = 50

(b) M = 5, d = 100

(c) M = 5, d = 200

(d) M = 10, d = 50

(e) M = 10, d = 100

(f) M = 10, d = 200

(g) M = 15, d = 50

(h) M = 15, d = 100

(i) M = 15, d = 200

Figure 5: Comparison of the the average cumulative regret per client of the Fedego Fedego Lasso algorithm under diﬀerent settings of the parameters M, K, d on the synthetic dataset