# boosting_with_multiple_sources__85462bf5.pdf Boosting with Multiple Sources Corinna Cortes Google Research New York, NY 10011 corinna@google.com Mehryar Mohri Google & Courant Institute New York, NY 10012 mohri@google.com Dmitry Storcheus Courant Institute & Google New York, NY 10012 dstorcheus@google.com Ananda Theertha Suresh Google Research New York, NY 10011 theertha@google.com We study the problem of learning accurate ensemble predictors, in particular boosting, in the presence of multiple source domains. We show that the standard convex combination ensembles in general cannot succeed in this scenario and adopt instead a domain-weighted combination. We introduce and analyze a new boosting algorithm, MULTIBOOST, for this scenario and show that it benefits from favorable theoretical guarantees. We also report the results of several experiments with our algorithm demonstrating that it outperforms natural baselines on multi-source text-based, image-based and tabular data. We further present an extension of our algorithm to the federated learning scenario and report favorable experimental results for that setting as well. Additionally, we describe in detail an extension of our algorithm to the multi-class setting, MCMULTIBOOST, for which we also report experimental results. 1 Motivation Ensemble methods such as Bagging, Ada Boost, Stacking, error-correction techniques, Bayesian averaging, Ada Net or other adaptive methods for learning neural networks are general machine learning techniques used to combine several predictors to devise a more accurate one (Breiman, 1996; Freund and Schapire, 1997; Smyth and Wolpert, 1999; Mac Kay, 1991; Freund et al., 2004; Cortes et al., 2014; Kuznetsov et al., 2014; Cortes et al., 2017). These techniques are often very effective in practice and benefit from favorable margin-based learning guarantees (Schapire et al., 1997). These algorithms assume access to a training sample drawn from the target distribution. But, in many applications, the learner receives labeled data from multiple source domains that it seeks to use to find a good predictor for target domains, which are typically assumed to be mixtures of source distributions (Mansour et al., 2008, 2009a; Hoffman et al., 2018, 2021; Cortes et al., 2021; Muandet et al., 2013; Xu et al., 2014; Hoffman et al., 2012; Saito et al., 2019; Wang et al., 2019a). How can we generalize ensemble methods such as boosting to this scenario? Several related problems have been tackled in previous work. In the special case of a single target domain, boosting solutions have been derived (Dai et al., 2007; Yao and Doretto, 2010). These algorithms have been further improved by boosting base predictors jointly with source features (or feature views) that are predictive in the target (Yuan et al., 2017; Cheng et al., 2013; Zhang et al., 2014; Xu and Sun, 2012, 2011). Such methods have been widely adopted in various domain adaptation scenarios for different types of data. For example, Huang et al. (2010, 2012) showed that by selecting a base learner jointly with a feature that is predictive across multiple domains at every boosting step, one can achieve higher accuracy than standard transfer learning methods. Moreover, the margin 35th Conference on Neural Information Processing Systems (Neur IPS 2021). provided by boosting-style algorithms can aid in transfer learning where target domain is unlabelled. Habrard et al. (2013) have developed an algorithm that jointly minimizes the source domain error and margin violation proportion on the target domain. Wang et al. (2019a) have demonstrated that boosting classifiers from different domains can be carried out online. Taherkhani et al. (2020) and Becker et al. (2013) have shown that multi-source boosting can be combined with Deep Neural Networks for multi-task learning on large scale datasets. We give a more detailed discussion of this related prior work in Appendix A. However, as we demonstrate in this paper, several critical questions related to the presence of multiple domains have not been fully addressed by existing boosting methods from both the algorithmic and the theoretical point of view. First, what should be the form of the ensemble solutions? We show that, in general, the standard convex combinations used in much of previous work may not succeed in this problem (Section 2). Instead, we put forward Q-ensembles, which are convex combinations weighted by a domain classifier Q, that is, Q(k|x) is the conditional probability of domain k given input point x. This is inspired by similar Q-ensembles (Cortes, Mohri, Suresh, and Zhang, 2021) or distributionweighted combinations (Mansour, Mohri, and Rostamizadeh, 2008, 2009a; Hoffman, Mohri, and Zhang, 2018, 2021) used in the context of multiple-source adaptation and crucially differentiates our work from that of past studies of ensemble methods. Our learning scenario strictly generalizes previous algorithms. In the special case of a single domain, the form of our ensemble solutions coincides with that of the familiar ensembles used in Ada Boost. In Appendix J, we further compare our results and guarantees with the multiple-source adaptation algorithms of (Hoffman, Mohri, and Zhang, 2018, 2021; Cortes, Mohri, Suresh, and Zhang, 2021), when using Ada Boost to derive a predictor for each domain. Second, what should be the form of the objective? Unlike existing boosting methods which optimize for a specific target distribution or domain, we seek a solution that is accurate for any mixture of the source distributions, where the mixture weights may be constrained to be in a subset of the simplex. This conforms with the learning goal in this scenario where the target distribution is typically assumed to be a mixture of the source ones and is further inspired by the formulation of the agnostic loss in the related context of federated learning (Mohri, Sivek, and Suresh, 2019). Additionally, this formulation can be viewed as more risk-averse and robust, and it also ensures more favorable fairness properties, since it does not favor any distribution possibly biased towards some subset of the source domains. In Section 3, we introduce and analyze an algorithm, MULTIBOOST, that is based on such an objective. This further distinguishes our boosting algorithm from related prior work. In Appendix L, we further give a detailed description of a multi-class extension of our algorithm, MCMULTIBOOST, including its pseudocode, the discussion of some of its properties, and the results of several experiments. Third, in Section 4, we present a theoretical analysis of our MULTIBOOST algorithm, including margin-based generalization bounds that hold for any target mixture of the source distributions. In Appendix F, we derive finer and more flexible learning bounds that can provide more favorable guarantees, in particular when the family of target mixtures is more constrained. Fourth, in Section 5, we describe FEDMULTIBOOST, an extension of our MULTIBOOST algorithm adapted to the distributed learning scenario of federated learning, which is increasingly important in a wide array of applications (Kairouz et al., 2021), including healthcare (Brisimi et al., 2018). FEDMULTIBOOST allows the server to train ensemble models that benefit from clients data, while the data remains with clients. We provide a detailed description of our algorithm, as well as a brief analysis of the communication costs, which are critical in this scenario. Appendix G contains experimental results comparing FEDMULTIBOOST with several baselines. Finally, in Section 6, we report the results of extensive experiments with our MULTIBOOST algorithm. We start with a detailed description of the learning scenario we consider. 2 Learning scenario Let X denote the input space and Y = { 1, +1} the output space associated to binary classification. We consider a scenario where the learner receives labeled samples from p source domains, each defined by a distribution Dk over X Y, k [p]. We denote by Sk = ((xk,1, yk,1), . . . , (xk,mk, yk,mk)) X Y mk the labeled sample of size mk received from source k, which is drawn i.i.d. from Dk. For any function f : X R and distribution D over X Y, let L(D, f) be the expected loss of f, that is L(D, f) = E(x,y) D[ℓ(f(x), y)], where ℓis the binary loss ℓ(f(x), y) = I(yf(x) 0). For any k [p], let Hk denote a hypothesis set of functions mapping from X to [ 1, +1], |Hk| = Nk. The objective of the learner is to find a predictor f that is accurate for any target distribution Dλ that is a mixture of the source distributions, where λ may be in a subset Λ of the simplex. Thus, Dλ can written as Dλ = Pp k=1 λk Dk, with λ = (λ1, . . . , λp) 0 and Pp k=1 λk = 1. This leads to the following agnostic loss of a predictor f (Mohri et al., 2019): L(DΛ, f) = max λ Λ L(Dλ, f). (1) To come up with a predictor f, the learner seeks an ensemble of functions from the base classes Hk. What should be the form of such ensembles? A natural solution is a convex combinations of the form j=1 αl,jhl,j, (2) where hl,j Hl and αl,j 0, Pp l=1 PNl j=1 αl,j = 1, as in prominent algorithms such as BAGGING (Breiman, 1999) or ADABOOST (Freund and Schapire, 1997). It is not hard to show, however, that for some distributions, any such convex combination of base predictors would lead to a poor solution. Proposition 1. There exist distributions D1 and D2 and hypotheses h1 and h2 with L(D1, h1) = 0 and L(D2, h2) = 0 such that L 1 2(D1 + D2), αh1 + (1 α)h2 1 2 for any α [0, 1], This result is similar to (Mansour et al., 2008, Theorem 1) or similar statements in (Mansour et al., 2008, 2009a; Hoffman et al., 2018, 2021; Cortes et al., 2021), however, there are some technical differences. In particular, we are considering here a binary classification loss and not the absolute loss. The proof is given in Appendix B. Note that in the example of the proposition, the base predictors h1, h2 are both perfect for their respective domains. Nevertheless, unlike the common case of a single source domain, standard convex combinations in general may perform poorly. Thus, we will consider instead the following form for the ensembles of base predictors: x X, f(x) = j=1 αl,j Q(l|x)hl,j(x), (3) where Q(l|x) denotes the conditional probability of domain l given x. This helps us account for the fact that the learner combines base predictors from different domains. For a given point x, experts from the domains where x is more likely to appear are allocated more weight in the voted combination. The following result further substantiates the hope for this method to succeed. Proposition 2. For the same distributions D1 and D2 and hypotheses h1 and h2 as in Proposition 1, the equality L(Dλ, (αQ(1| )h1 + (1 α)Q(2| )h2)) = 0 holds for any λ and any α (0, 1). The proof is also given in Appendix B. Thus, our Q-ensembles can be substantially more effective in the multiple-source adaptation problem considered. Note that, in the special case of a single domain, the Q-combinations coincide with standard convex combinations, since Q(k|x) = 1 for all x. As in the standard boosting scenario, we will adopt a weak-learning assumption. However, our assumption here must hold for each source domain: for each domain k [p] and any distribution D over Sk, there exists a base classifier h Hk such that the weighted loss of h is γ-better than random: 1 2[1 Ei D[yk,ih(xk,i)] 1 2 γ, for some edge value γ > 0. This is equivalent to the existence of a weak-learning algorithm for each domain, which is a mild assumption. As in the standard boosting scenario, this corresponds to the existence of a a good rule of thumb for each domain. Unlike the standard scenario, however, here, we seek a Q-ensemble and further require that ensemble to be accurate for any target mixture Dλ, λ Λ. Note that the definition of the hypothesis set Hk and the weak-learning assumption for domain k are intimately related: the elements of Hk are typically simple predictors sufficient to guarantee the weak-learning assumption for that domain. In the next section, we present an algorithm, MULTIBOOST, for deriving an accurate Q-ensemble for any target mixture domain that belongs to the convex combination of the source domains. In Appendix L, we further present a multi-class extension of our algorithm, MCMULTIBOOST. 3 Algorithm Let Φ be a convex, increasing and differentiable function such that u 7 Φ( u) upper-bounds the binary loss u 7 1u 0. Φ could be the exponential function as in ADABOOST or the logistic function, as in logistic regression. Using Φ to upper-bound the agnostic loss leads to the following objective function for an ensemble f defined by (3) for any α = (αl,j)(l,j) [p] [Nl] 0: F(α) = max λ Λ j=1 αl,j Q(l|xk,i)hl,j(xk,i) . (4) Since a convex function composed with an affine function is convex and a sum of convex functions is convex, F is convex as the maximum of a set of convex functions. In this section, we will consider the case where the set Λ coincides with the full simplex, that is Λ = . F can then be expressed more simply as F = maxp k=1 Fk, with Fk(α) = 1 mk Pmk i=1 Φ yk,i Pp l=1 PNl j=1 αl,j Q(l|xk,i)hl,j(xk,i) . It is known that ADABOOST coincides with coordinate descent applied to an exponential loss objective function (Duffy and Helmbold, 1999; Schapire and Freund, 2012; Mohri et al., 2018). Our algorithm similarly applies coordinate descent to the objective just described, as with other boosting-type algorithms (Mason et al., 1999; Cortes et al., 2014; Kuznetsov et al., 2014). While convex, F is not differentiable and, in general, coordinate descent may not succeed in such cases (Tseng, 2001; Luo and Tseng, 1992). This is because the algorithm may be stuck at a point where no progress is possible along any of the axis directions, while there exists a favorable descent along some other direction. However, we will show that, under the weak-learning assumption we adopted, at any point α and for each active function Fk, that is Fk such that Fk(α) = F(α), there exists a coordinate direction along which a descent progress is possible. We will assume that these directions are also descent directions for F. More generally, it suffices in fact that one such direction admits this guarantee. Alternatively, one can replace the maximum with a soft-max, that is the (x1, . . . , xk) 7 1 µ log(Pp k=1 eµxk) for µ > 0, which leads to a continuously differentiable function with a close approximation for µ sufficiently large. Description. Let αt 1 denote the value of the parameter vector α = (αl,j) at the end of the (t 1)th iteration and ft 1 the corresponding function: ft 1 = Pp l=1 PNl j=1 αt 1,l,j Q(l| ) hl,j. Coordinate descent at iteration t consists of choosing a direction eq,r corresponding to base classifier hq,r and a step value η > 0 to minimize F(αt 1 + η eq,r). To select a direction, we consider the subdifferential of F along any eq,r. Since functions Fk are differentiable, by the subdifferential calculus for the maximum of functions, the subdifferential of F at αt 1 along the direction eq,r is given by: F(αt 1, eq,r) = conv{F k(αt 1, eq,r): k Kt}, where F k(αt 1, eq,r) is the directional derivative of Fk at αt 1 along the direction eq,r and where Kt = {k [p]: Fk(αt 1) = F(αt 1)}. We will therefore consider the direction eq,r with the largest absolute directional derivative value |F k(αt 1, eq,r)|, k Kt, but will restrict ourselves to q = k since, as we shall see, that will suffice to guarantee a non-zero directional gradient. To do so, we will express F k(αt 1, ek,r) in terms of the distribution Dt(k, ) over Sk defined by Dt(k, i) = Φ ( yk,ift 1(xk,i)) Zt,k , with Zt,k = Pmk i=1 Φ ( yk,ift 1(xk,i)), for all i [mk]: F k(αt 1, hk,r) = 1 mk i=1 yk,i Q(k|xk,i)hk,r(xk,i)Φ yk,ift 1(xk,i) = Zt,k mk [2ϵt,k,r 1], where ϵt,k,r = 1 2 1 Ei Dt(k, )[yk,i Q(k|xk,i)hk,r(xk,i)] denotes the weighted error of Q(k| )hk,r. For any s [mk], since xk,s is a sample drawn from Dk, we have Q(k|xk,s) > 0 and therefore we have: Qt,k = Pmk s=1 Dt(k, s)Q(k|xk,s) > 0. Thus, we can write Ei Dt(k, )[yk,i Q(k|xk,i)hk,r(xk,i)] as mk X Dt(k, i)Q(k|xk,i)yk,ihk,r(xk,i) Qt,k, = E i D t(k, )[yk,ihk,r(xk,i)] Qt,k, where D t(k, i) = Dt(k,i)Q(k|xk,i) Qt,k . By our weak-learning assumption, there exists r Nk such that Ei D t(k, )[yk,ihk,r(xk,i)] γ > 0. For that choice of r, we have ϵt,k,r < 1 2 γ, with γ = γQt,k > 0. In view of that, it suffices for us to search along the directions hk,r and we do not need to consider the directional derivative of Fk along directions hq,r with q = k. The direction chosen by our coordinate descent algorithm is thus defined by: argmaxk Kt,r [Nk] Zt,k mk [1 2ϵt,k,r]. Given the direction ek,r, the optimal step value η is argminη>0 F(αt 1 + ηek,r). The pseudocode of our algorithm, MULTIBOOST, is provided in Figure 1. In the most general case, η can be found via a line search or other numerical methods. Step size. In some special cases, the line search can be executed using a simpler expression by using an upper bound on F(αt 1 + ηek,r). Using the convexity of Φ, since yl,i Q(k|xl,i)hk,r(xl,i) = 1+yl,i Q(k|xl,i)hk,r(xl,i) 2 (+1) + 1 yl,i Q(k|xl,i)hk,r(xl,i) 2 ( 1), the following holds for all η R: Φ( yl,ift 1(xl,i) η yl,i Q(k|xl,i)hk,r(xl,i)) 1+yl,i Q(k|xl,i)hk,r(xl,i) 2 Φ( yl,ift 1(xl,i) η) + 1 yl,i Q(k|xl,i)hk,r(xl,i) 2 Φ( yl,ift 1(xl,i) + η). In the case of exponential and logistic functions, the following upper bounds can then be derived. Lemma 1. For any k [p], the following upper bound holds when Φ is the exponential or the logistic function: F(αt 1 + ηek,r) max l [p] Zt,l (1 ϵt,l,k,r)e η + ϵt,l,k,reη , where ϵt,l,k,r = 1 2 1 Ei Dt(l, )[yl,i Q(k|xl,i)hk,r(xl,i)] . For any k, function η 7 (1 ϵt,l,k,r)e η + ϵt,l,k,reη reaches its minimum for η = 1 2 log 1 ϵt,l,k,r ϵt,l,k,r . When the maximum is achieved for l = k, the solution coincides with the familiar expression of the step size ηt = 1 2 log 1 ϵt,k,r ϵt,k,r used in ADABOOST. MULTIBOOST(S1, . . . , Sp) 1 α0 0 2 for t 1 to T do 3 ft 1 Pp l=1 PNl j=1 αt 1,l,j Q(l| ) hl,j 4 Φk 1 mk Pmk i=1 Φ( yk,ift 1(xk,i)) 5 Kt n k: k argmaxk [p] Φk o 6 for k Kt do 7 Zt,k Pmk i=1 Φ yk,ift 1(xk,i) 8 for i 1 to mk do 9 Dt(k, i) Φ ( yk,ift 1(xk,i)) Zt,k 10 (k, r) argmax k Kt,r [Nk] mk [1 2ϵt,k,r] 11 ηt argminη>0 F(αt 1 + ηek,r) 12 αt αt 1 + ηtek,r 13 f Pp l=1 PNl j=1 αT,l,j Q(l| )hl,j 14 return f Figure 1: Pseudocode of the MULTIBOOST algorithm. ϵt,k,r = 1 2 1 Ei Dt(k, )[yk,i Q(k|xk,i)hk,r(xk,i)] denotes the weighted error of Q(k| )hk,r. Q-function. The conditional probability functions Q(k| ) are crucial to the definition of our algorithm. As pointed out earlier, Qensembles can help achieve accurate solutions in some cases that cannot be realized using convex combinations. Furthermore, for any k [p], since Dt(k, ) is a distribution over the sample Sk, it is natural to assume that for any j = k we have Es Dt(k, )[Q(k|xk,s)] Es Dt(k, )[Q(j|xk,s)]. This implies the following lower bound: Es Dt(k, )[Q(k|xk,s)] 1 p, which in turn implies γ γ p, since for any x X, Pp j=1 Q(j|x) = 1. In the special case where all domains coincide, we have Q(k|xk,s) = 1 p for all s and this lower bound is reached. At another extreme, when all domains admit distinct supports, we have Q(k|xk,s) = 1 for all s [mk] and thus γ = γ. In practice, we do not have access to the true conditional probabilities Q(k| ). Instead, we can derive accurate estimates b Q(k| ) using large unlabeled samples from the source domains, the label used for training being simply the domain index. This can be done using algorithms such as conditional maximum entropy models (Berger et al., 1996) (or multinomial logistic regression), which benefit from strong theoretical guarantees (Mohri et al., 2018, Chapter 13), or other rich models based on neural networks. Other variants of MULTIBOOST. In Appendix D, we present and discuss other variants of our algorithm, including their convergence guarantees. 4 Theoretical analysis In this section, we present a theoretical analysis of our algorithm, including margin-based learning bounds for multiple-source Q-ensembles. For any k [p], define the family Gk as follows: Gk = Q(k| ) h: h Hk . Then, the family of ensemble functions F that we consider can be defined as F = conv(Sp k=1 Gk). For any λ , let Dλ be the distribution defined by Dλ = Pp k=1 λk b Dk, where b Dk is the empirical distribution associated to an i.i.d. sample Sk drawn from Dk. Dλ is distinct from the distribution associated to Dλ. We seek to derive margin-based bounds for ensemble functions f F with respect to target mixture distributions Dλ, while the empirical data is from multiple source distributions Dk, or from Dλ. Thus, these guarantees differ from standard learning bounds where the source and target distribution coincide. For any distribution D over X Y and ρ > 0, let Lρ(D, f) denote the expected ρ-margin loss of f with respect to D: Lρ(D, f) = E(x,y) Dk[I(yf(x) ρ)]. We denote by Rm(F) the Rademacher complexity of F for samples S = (x1, . . . , xm) of size m, defined by Rm(F) = ES Dm σ [supf F Pm i=1 σif(xi)], with σis independent random variables taking values in { 1, +1}. The following gives a margin-bound for our multiple-source learning with Q-ensembles. Theorem 1. Fix ρ > 0. Then, for any δ > 0, with probability at least 1 δ over the draw of a sample S = (S1, . . . , Sp) Dm1 1 Dmp p , the following inequality holds for all ensemble functions f = PT t=1 αt Q(kt| )ht F and all λ : L(Dλ, f) Lρ(Dλ, f) + ρ max l [p] Rmk(Hl) + Note that the complexity term depends only on the Rademacher complexities of the families of based predictors Hl and does not involve the domain classification function Q. Theorem 1 improves upon previous known bounds of Mohri et al. (2019). In particular, it provides a dimension-independent margin guarantee for the zero-one loss, while the bound of Mohri et al. (2019), when applied to the zero-one loss, is dimension-dependent (Mohri et al., 2019, Lemma 3). Additionally, our bound holds for Q-ensembles. Our margin bound can be straightforwardly generalized to hold uniformly for all ρ (0, 1), at the cost of an additional mild term depending on log log2(2/ρ) (see Theorem 4 in Appendix E). For the ensemble functions returned by our algorithm, the bound applies to x 7 f(x) α 1 . These learning guarantees suggest choosing α 0 and ρ to minimize the following: αt α 1 Q(kt| )ht(x) ρ where rk = 2 maxl [p] Rmk(Hl). Choosing 1 ρ = α 1 and upper bounding the binary loss using a convex function Φ, the problem can then be equivalently cast as the following unconstrained minimization, for example in the case of the exponential function: min α 0 max λ Λ t=1 αt Q(kt| )ht(x) Thus, the learning guarantees justify a posteriori an ℓ1-regularized version of our algorithm. It is straightforward to derive that extension of our algorithm and its pseudocode. This is similar to the extension to the ℓ1-ADABOOST (Rätsch et al., 2001) of the original unregularized ADABOOST (Freund and Schapire, 1997). Let us add that finer learning guarantees such as those for DEEP BOOSTING (Cortes et al., 2014) ones can be derived similarly here, which would lead to a weighted ℓ1-regularization, where the weights are the Rademacher complexities of the families Hk. For the sake of simplicity, we do not detail that extension. Let us point out, however, that such an extension would also lead to an algorithm generalizing ADANET (Cortes et al., 2017) to multiple sources. In some applications, prior knowledge can be used to constrain Λ to a strict subset of the simplex . Finer margin-based learning guarantees can be derived to cover that scenario. Let Λ be the set of vertices of a subsimplicial cover of Λ, that is a decomposition of a cover of Λ into subsimplices and let Λϵ be the set of λs ϵ-close to Λ in ℓ1-distance. Then, the following guarantees hold. Theorem 2. Fix ρ > 0 and ϵ > 0. Then, for any δ > 0, with probability at least 1 δ over the draw of a sample S = (S1, . . . , Sp) Dm1 1 Dmp p , each of the following inequalities holds: 1. for all ensemble functions f = PT t=1 αt Q(kt| )ht F and all λ Λϵ: L(Dλ, h) Lρ(Dλ, h) + 2 ρ max r [p] Rm(Hr, λ) + 3ϵ λ2 k 2mk log |Λ| 2. for all ensemble functions f = PT t=1 αt Q(kt| )ht F and all λ = Pp k=1 µkβk conv(Λ): L(Dλ, h) Lρ(Dλ, h) + 2 l=1 µk max r [p] Rm(Hr, βl) + β2 l,k 2mk log |Λ| The proof of the theorem and further discussion are presented in Appendix F. Here, Rm(Hr, λ) is the λ-weighted Rademacher complexity of Hr (see Appendix F). For Λ a strict subset of the simplex, these learning bounds suggest an algorithm with a regularization term of the form Pp k=1 λ2 k/mk. Our analysis and learning guarantees can be similarly extended to cover the multi-class setting. 5 Federated MULTIBOOST algorithm FEDMULTIBOOST(S1, S2, . . . , Sn) 1 α0 0 2 for t 1 to T do 3 ft 1 Pp l=1 PNl j=1 αt 1,l,j Q(l| ) hl,j 4 Ct,k SELECTCLIENT(k, r) 5 for k 1 to p do 6 for c Ct,k do 7 Φk,c 1 mc Pmc i=1 Φ( yc,ift 1(xc,i)) 8 Zt,c Pmc i=1 Φ yc,ift 1(xc,i) n k [p]: k argmaxk [p] P c Ct,k Φk,c o c Ct,k Zt,c, k [p] 11 Ht,k SELECTBASECLASSIFIER(k, s) 12 for k Kt do 13 for c Ct,k do 14 for i 1 to mc do 15 Dt(c, i) Φ ( yc,ift 1(xc,i)) Zt,k 16 βt,c,r [1 2ϵt,c,r] 17 (k, r) argmax k Kt,r Ht,k c Ct,k mc P c Ct,k βt,c,r t 19 αt αt 1 + ηtek,r 20 f Pp l=1 PNl j=1 αT,l,j Q(l| )hl,j 21 return f Figure 2: Pseudocode of the FEDMULTIBOOST algorithm. ϵt,c,r denotes the weighted error of Q(k| )hk,r, ϵt,c,r = 1 2 1 Ei Dt(c, )[yc,i Q(k|xc,i)hk,r(xc,i)] , where client c belongs to domain k. cobalt steps are carried out by the server, red on the clients. In this section, we present an extension of our algorithm to the federated learning scenario, called FEDMULTIBOOST. The pseudocode is given in Figure 2. Federated learning is a distributed learning scenario where a global model is trained at the server level, based on data remaining at clients (Koneˇcn y et al., 2016b,a; Mc Mahan et al., 2017; Yang et al., 2019). Clients are typically mobile phones, network sensors, or other Io T devices. While the data remains at the clients, the trained model significantly benefits from user data, as demonstrated, for example, in next word prediction (Hard et al., 2018; Yang et al., 2018) and healthcare applications (Brisimi et al., 2018). We refer the reader to (Kairouz et al., 2021) for a more detailed survey of federated learning. A widely used algorithm in this scenario is federated averaging, where the server trains the global model via stochastic updates from the clients (Mc Mahan et al., 2017). It has been argued by Mohri et al. (2019), however, that minimizing the expected loss with respect to a particular training distribution, as done by standard federated learning algorithms, may be riskprone and benefit only a specific subset of clients. This compromises fairness, which is one of the critical questions in federated learning, where clients are often heterogeneous (Bickel et al., 1975; Hardt et al., 2016; Abay et al., 2020; Li et al., 2020). This issue can be further accentuated when the set of participating clients during inference may significantly differ from those used in training. Such cases may occur, for example, when clients loose access to network connection during training, which results in distinct training and test distributions. To overcome these problems, we adopt the agnostic federated learning approach suggested by Mohri et al. (2019) and Ro et al. (2021). Following the client-partition approach of Ro et al. (2021), let each client belong to one of p domains. The domains can be, for example, based on the type of device or their geographical location. In this setting, the global model is optimized for all convex combinations of domain distributions as in the objective of MULTIBOOST (Equation 4), by taking the maximum over the mixtures weights λ. This ensures that the solution obtained is risk-averse and reduces the worst-case mismatch between the inference and training distributions. We propose FEDMULTIBOOST, a communication-efficient modification of MULTIBOOST, that can overcome the communication bottleneck in federated learning (Koneˇcn y et al., 2016b). Before presenting our algorithm, we first describe the learning scenario and the notation we adopt. Let n be the number of clients, which in the cross-device setting can be in the order of millions. Let mc denote the number of samples in client c and (xc,i, yc,i) denote the ith sample of client c. The server has access to a set of base-predictor classes Hk for k [p] and a domain classifier Q that assigns for each sample x, the likelihood of it belonging to each of the p domains. The base classifiers can be obtained by training on public data or training on client data with federated learning. The domain classifier Q can be trained using unlabeled client data. Additional discussion of the algorithm as well as experimental results can be found in Appendix G. At each round, the algorithm randomly selects r clients and s base classifiers from each domain. Next, it chooses the best classifier out of the previously selected subset, based on the data from the subsampled clients. Since FEDMULTIBOOST operates only on a small set of base classifiers at each round, the algorithm is communication-efficient. At round t, the server needs to send Q, ft 1, and s base classifiers to the selected clients. Hence, the server-to-client communication cost at round t is O((t + s)K + K ), where K is the number of parameters in a single base classifier and K is the number of parameters required to specify Q. Furthermore, the clients communicate only the aggregate statistics Φc, Zt,c, βt,c,r to the server and hence the client-to-server communication is O(p s) real numbers, which is independent of the number of parameters of the base classifiers. We note that FEDMULTIBOOST can be extended to the multi-class setting using techniques presented in Appendix L. 6 Experiments In this section, we present experimental results for the MULTIBOOST algorithm on several multiplesource datasets. Our study is restricted to learning an ensemble of decision stumps Hstumps using the exponential surrogate loss Φ(u) = e u. To estimate the probabilities Q(k| ) for k [p], we assigned the label k to each sample from domain k and used multinomial logistic regression. This step can be facilitated with unlabeled examples, as only their domain membership information is required. We compare MULTIBOOST with a set of boosting-style algorithms that operate on the same hypotheses class Hstumps. Those include ADABOOST-k for k [p] and ADABOOST-all. The former is a standard ADABOOST algorithm trained only on a single source Sk and the latter is ADABOOST trained on the union of all sources p k=1Sk. It is natural to compare MULTIBOOST against ADABOOST-all, since both of them have access to all sources during training. The difference is that while ADABOOSTall treats p k=1Sk as a single dataset, MULTIBOOST trains base learners separately for each source and weights examples by domain probabilities Q(k| ). We used T = 100 boosting steps for all benchmarks. T up to 1,000 were explored, but did not change any results significantly. We also compare our results with the a multiple-source adaptation algorithm, DMSA, designed for a scenario where no labeled data is available, and where only an accurate predictor per domain and unlabeled data are supplied. DMSA was shown to outperform other algorithms designed for that scenario. To apply DMSA, we use the domain predictors ADABOOST-k, k [p]. We report classification errors on the test data for various mixtures of the source distributions λ1, . . . , λp, including: errors for λ on the simplex edges (errors on each domain separately); errors on the uniform mixture k : λk = 1 p; agnostic error, which is the maximum error across all sources. The errors and their standard deviations are reported based on 10-fold cross validation. Each source Sk is independently split into 10 folds S1 k, . . . , S10 k . For the i-th cross-validation step, the test set is {Si 1, . . . , Si p}, while the rest is used for training. Datasets and preprocessing steps used are described below with additional dataset details provided in Appendix H. Note, that all datasets are public and do not contain any personal identifiers or offensive information. The experiments were performed on Linux and Mac workstations with Quad-Core Intel Core i7 2.9 GHz and Intel Xeon 2.20 GHz respectively. The time taken to perform 10-fold cross-validation varies per dataset: from roughly 1 hour on CPU for all benchmarks on tabular data to 8 hours on CPU for all benchmarks on digits data. The run-times of a single MULTIBOOST iteration is almost identical to that of ADABOOST-all when a closed form solution is used for step size ηt (line 12 in Figure 1) for the exponential surrogate loss function. Alternatively, for some experiments we used line search with 100 steps. If training time is critical, one can use ηt = C/ t instead, at the cost of slightly worse convergence guarantee. Convergence plots are illustrated in Appendix I. Sentiment analysis. The Sentiment dataset (Blitzer et al., 2007) consists of text reviews and rating labels for products sold on Amazon in various categories. We selected p = 3 sources from this dataset, where k = 1 corresponds to books, k = 2 means dvd, and k = 3 is electronics. Figure 3: Mean values of Q(k| k) for the three domains of the Sentiment data projected on the first principal component of the joint data. We converted the 5-star rating to binary labels, assigning y = +1 for a positive review and y = 1 for a negative one. Neutral reviews with 3-star labels are removed from the dataset. The benchmarks are trained using bigram features generated from the raw text reviews. Since the Sentiment dataset contains only 2,000 instances for each domain, we used random train/test splits with 10 different seeds instead of the 10-fold cross-validation. To illustrate the importance of the conditional domain probabilities Q(K|x), for each domain, we illustrate its values along the projections of x onto the first principal component of the joint Sentiment dataset in Figure 3. The figure shows that our domain classifiers are able to provide coherent separation between the three domains and narrow the applicability of the base classifiers. In Appendix K, we further discuss the importance of obtaining a high-quality Q function and illustrate the performance degradation resulting from a lower-quality Q function. Digits recognition. For this problem, we aggregated 32x32 pixel handwritten digits images from p = 3 sources: MNIST (k = 1), SVHN (k = 2), MNIST-M (k = 3). We compared algorithms on two binary classification problems: digits 4 vs. 9 and digits 1 vs. 7. Object recognition. We divided images of clothes items from Fashion-MNIST (Xiao et al., 2017) dataset into two classes: tops and bottoms from p = 3 sources. k = 1 consists of t-shirts, pullovers, trousers; k = 2 consists of dresses, coats, sandals; k = 3 consists of shirts, bags, sneakers, ankle-boots. Tabular data. In addition to text and images, we tested MULTIBOOST on tabular data. We used the Adult dataset (Kohavi, 1996), which consists of numerical and categorical features describing an individual s socioeconomic characteristics, given the task to predict if an person s income exceeds USD 50, 000. We divided this dataset into p = 3 sources based on individual s educational background. Source k = 1 consisted of individuals with a university degree (BSc, MS or Ph D), source k = 1 those with only a High School diploma, and source k = 2, none of the above. Discussion As can be seen from Table 4, MULTIBOOST provides agnostic and uniform errors that are significantly better than the baselines on all datasets. While ADABOOST-k predictors on digits recognition and object recognition tasks each show the lowest error on their specific domains D1, D2, D3, they fail to generalize to other target distribution in conv{D1, D2, D3}. Moreover, as suggested by the discussion in Section 2, the standard convex combination of domain-specific predictors (ADABOOST-all) also performs poorly on several target distributions in conv{D1, D2, D3}, such as the uniform and agnostic targets. As the experimental results confirm, MULTIBOOST, by leveraging domain predictors Q(k| ) and the agnostic loss, is able to produce an ensemble of domain-specific predictors that generalizes to different targets. Appendix I provides additional illustrations of Q(k| ) by projecting each domain on the first principal component of the joint data. Moreover, for tabular data and the digits (4 vs. 9) classification problem, MULTIBOOST benefits from positive knowledge transfer and obtains an error on individual domains that is even smaller than that Table 1: Test errors for multiple benchmarks. Algorithm Error-1 Error-2 Error-3 Error-Uniform Error-Agnostic Sentiment Analysis ADABOOST-1 0.326 0.019 0.300 0.008 0.360 0.017 0.329 0.017 0.360 0.019 ADABOOST-2 0.354 0.020 0.266 0.009 0.336 0.023 0.318 0.019 0.357 0.019 ADABOOST-3 0.402 0.015 0.334 0.008 0.258 0.015 0.331 0.016 0.402 0.018 ADABOOST-all 0.354 0.020 0.325 0.011 0.313 0.022 0.324 0.021 0.354 0.016 DMSA 0.332 0.021 0.308 0.017 0.314 0.015 0.318 0.019 0.332 0.021 MULTIBOOST 0.332 0.027 0.288 0.018 0.284 0.027 0.301 0.027 0.332 0.024 Digits Recognition (4 vs. 9) ADABOOST-1 0.044 0.007 0.615 0.012 0.476 0.022 0.379 0.008 0.615 0.012 ADABOOST-2 0.455 0.014 0.299 0.011 0.504 0.015 0.420 0.011 0.504 0.015 ADABOOST-3 0.549 0.034 0.488 0.015 0.300 0.013 0.446 0.013 0.549 0.034 ADABOOST-all 0.060 0.009 0.374 0.015 0.353 0.012 0.262 0.009 0.374 0.015 DMSA 0.069 0.005 0.351 0.012 0.310 0.011 0.243 0.009 0.351 0.015 MULTIBOOST 0.096 0.008 0.283 0.028 0.246 0.014 0.209 0.013 0.284 0.027 Digits Recognition (1 vs. 7) ADABOOST-1 0.005 0.002 0.613 0.007 0.519 0.012 0.379 0.004 0.613 0.007 ADABOOST-2 0.431 0.022 0.252 0.009 0.479 0.012 0.387 0.010 0.479 0.012 ADABOOST-3 0.680 0.031 0.490 0.014 0.244 0.012 0.474 0.013 0.680 0.031 ADABOOST-all 0.014 0.003 0.286 0.010 0.306 0.012 0.202 0.005 0.306 0.011 DMSA 0.012 0.003 0.288 0.017 0.286 0.015 0.195 0.013 0.288 0.017 MULTIBOOST 0.026 0.004 0.261 0.013 0.257 0.015 0.181 0.005 0.261 0.011 Objects Recognition (Fashion-MNIST) ADABOOST-1 0.015 0.003 0.251 0.026 0.602 0.028 0.288 0.017 0.602 0.028 ADABOOST-2 0.435 0.007 0.015 0.002 0.169 0.012 0.173 0.003 0.435 0.007 ADABOOST-3 0.311 0.018 0.097 0.005 0.014 0.002 0.140 0.006 0.311 0.018 ADABOOST-all 0.036 0.004 0.020 0.002 0.025 0.003 0.027 0.002 0.036 0.004 DMSA 0.033 0.008 0.015 0.002 0.022 0.003 0.023 0.007 0.033 0.009 MULTIBOOST 0.028 0.003 0.015 0.003 0.022 0.002 0.021 0.001 0.028 0.003 Tabular Data (Adult Data) ADABOOST-1 0.201 0.012 0.249 0.021 0.215 0.014 0.222 0.012 0.249 0.021 ADABOOST-2 0.298 0.011 0.131 0.009 0.142 0.006 0.190 0.007 0.298 0.011 ADABOOST-3 0.225 0.015 0.138 0.010 0.133 0.011 0.165 0.008 0.225 0.015 ADABOOST-all 0.221 0.013 0.134 0.007 0.131 0.010 0.155 0.004 0.221 0.013 DMSA 0.195 0.014 0.137 0.005 0.131 0.008 0.154 0.005 0.195 0.014 MULTIBOOST 0.190 0.014 0.132 0.008 0.130 0.009 0.150 0.005 0.190 0.014 of the domain-specific predictors. From Figure 6 in Appendix I it can be seen that the α-weight for the MULTIBOOST classifier in the 4 vs. 9 problem is heavily centered on D2, but contributions from D1 and D3 adds to the stronger performance. The same figure also illustrates the α-mass distribution for the other classification tasks. The standard convex ensemble ADABOOST-all, however, does not seem to exhibit the positive transfer property. Similar conclusions carry over for other target distributions in conv{D1, D2, D3}, for the sake of simplicity we do not present them here. 7 Conclusion We presented a new boosting algorithm, MULTIBOOST, for a multiple-source scenario a common setting in domain adaptation problems where the target distribution can be any mixture of the source distributions. We showed that our algorithm benefits from strong theoretical guarantees and exhibits favorable empirical performance. We also highlighted the extension of our work to the federated learning scenario, which is a critical distributed learning setting in modern applications. Acknowledgments This work was partly supported by NSF CCF-1535987, NSF IIS-1618662, and a Google Research Award. A. Abay, Y. Zhou, N. Baracaldo, S. Rajamoni, E. Chuba, and H. Ludwig. Mitigating bias in federated learning, 2020. C. J. Becker, C. M. Christoudias, and P. Fua. Non-linear domain adaptation with boosting. In Proceedings of Advances in Neural Information Processing Systems 26: 27th Annual Conference on NIPS, pages 485 493, 2013. S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira. Analysis of representations for domain adaptation. In Advances in neural information processing systems, pages 137 144, 2007. S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan. A theory of learning from different domains. Machine learning, 79(1-2):151 175, 2010. A. L. Berger, S. D. Pietra, and V. J. D. Pietra. A maximum entropy approach to natural language processing. Comp. Linguistics, 22(1), 1996. P. J. Bickel, E. A. Hammel, and J. W. O Connell. Sex bias in graduate admissions: Data from Berkeley. Science, 187(4175):398 404, 1975. ISSN 0036-8075. J. Blitzer, M. Dredze, and F. Pereira. Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification. In Proceedings of ACL 2007, Prague, Czech Republic, 2007. J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. Wortman. Learning bounds for domain adaptation. In Advances in neural information processing systems, pages 129 136, 2008. K. Bonawitz, H. Eichner, W. Grieskamp, D. Huba, A. Ingerman, V. Ivanov, C. Kiddon, J. Koneˇcn y, S. Mazzocchi, H. B. Mc Mahan, et al. Towards federated learning at scale: System design. ar Xiv preprint ar Xiv:1902.01046, 2019. L. Breiman. Bagging predictors. Machine Learning, 24(2):123 140, 1996. L. Breiman. Prediction games and arcing algorithms. Neural Computation, 11:1493 1517, October 1999. T. S. Brisimi, R. Chen, T. Mela, A. Olshevsky, I. C. Paschalidis, and W. Shi. Federated learning of predictive models from federated electronic health records. International journal of medical informatics, 112:59 67, 2018. S. Caldas, P. Wu, T. Li, J. Koneˇcn y, H. B. Mc Mahan, V. Smith, and A. Talwalkar. Leaf: A benchmark for federated settings. ar Xiv preprint ar Xiv:1812.01097, 2018. Y. Cheng, G. Cao, X. Wang, and J. Pan. Weighted multi-source Tr Ada Boost. Chinese Journal of Electronics, 22(3):505 510, 2013. C. Cortes and M. Mohri. Domain adaptation and sample bias correction theory and algorithm for regression. Theoretical Computer Science, 519:103 126, 2014. C. Cortes, M. Mohri, and U. Syed. Deep boosting. In Proceedings of ICML, pages 1179 1187, 2014. C. Cortes, M. Mohri, and A. Muñoz Medina. Adaptation algorithm and theory based on generalized discrepancy. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 169 178, 2015. C. Cortes, X. Gonzalvo, V. Kuznetsov, M. Mohri, and S. Yang. Adanet: Adaptive structural learning of artificial neural networks. In Proceedings of ICML, pages 874 883, 2017. C. Cortes, M. Mohri, A. T. Suresh, and N. Zhang. A discriminative technique for multiple-source adaptation. In Proceedings of ICML, volume 139, pages 2132 2143. PMLR, 2021. W. Dai, Q. Yang, G.-R. Xue, and Y. Yu. Boosting for transfer learning. In Proceedings of ICML, pages 874 883, 2007. N. Duffy and D. P. Helmbold. Potential boosters? In Proceedings of NIPS, pages 258 264, 1999. Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer System Sciences, 55(1):119 139, 1997. Y. Freund, Y. Mansour, and R. E. Schapire. Generalization bounds for averaged classifiers. The Annals of Statistics, 32:1698 1722, 2004. Y. Ganin and V. Lempitsky. Unsupervised domain adaptation by backpropagation. In International conference on machine learning, pages 1180 1189. PMLR, 2015. R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 580 587, 2014. B. Gong, Y. Shi, F. Sha, and K. Grauman. Geodesic flow kernel for unsupervised domain adaptation. In 2012 IEEE conference on computer vision and pattern recognition, pages 2066 2073. IEEE, 2012. A. Habrard, J.-P. Peyrache, and M. Sebban. Boosting for unsupervised domain adaptation. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 433 448. Springer, 2013. J. Hamer, M. Mohri, and A. T. Suresh. Fedboost: A communication-efficient algorithm for federated learning. In International Conference on Machine Learning, pages 3973 3983. PMLR, 2020. J. Hampshire and A. Waibel. The meta-pi network: building distributed knowledge representations for robust multisource pattern recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 14(7):751 769, 1992. doi: 10.1109/34.142911. A. Hard, K. Rao, R. Mathews, F. Beaufays, S. Augenstein, H. Eichner, C. Kiddon, and D. Ramage. Federated learning for mobile keyboard prediction. ar Xiv preprint ar Xiv:1811.03604, 2018. M. Hardt, E. Price, E. Price, and N. Srebro. Equality of opportunity in supervised learning. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Proceedings of NIPS, volume 29, 2016. J. Hoffman, B. Kulis, T. Darrell, and K. Saenko. Discovering latent domains for multisource domain adaptation. In ECCV, volume 7573, pages 702 715, 2012. J. Hoffman, E. Rodner, J. Donahue, T. Darrell, and K. Saenko. Efficient learning of domain-invariant image representations. ar Xiv preprint ar Xiv:1301.3224, 2013. J. Hoffman, M. Mohri, and N. Zhang. Algorithms and theory for multiple-source adaptation. In Proceedings of Neur IPS, pages 8256 8266, 2018. J. Hoffman, M. Mohri, and N. Zhang. Multiple-source adaptation theory and algorithms. Annals of Mathematics and Artificial Intelligence, 89(3-4):237 270, 2021. P. Huang, G. Wang, and S. Qin. A novel learning approach to multiple tasks based on boosting methodology. Pattern recognition letters, 31(12):1693 1700, 2010. P. Huang, G. Wang, and S. Qin. Boosting for transfer learning from multiple data sources. Pattern Recognition Letters, 33(5):568 579, 2012. R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton. Adaptive mixtures of local experts. Neural computation, 3(1):79 87, 1991. P. Kairouz, H. B. Mc Mahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummings, R. G. L. D Oliveira, H. Eichner, S. E. Rouayheb, D. Evans, J. Gardner, Z. Garrett, A. Gascón, B. Ghazi, P. B. Gibbons, M. Gruteser, Z. Harchaoui, C. He, L. He, Z. Huo, B. Hutchinson, J. Hsu, M. Jaggi, T. Javidi, G. Joshi, M. Khodak, J. Koneˇcný, A. Korolova, F. Koushanfar, S. Koyejo, T. Lepoint, Y. Liu, P. Mittal, M. Mohri, R. Nock, A. Özgür, R. Pagh, M. Raykova, H. Qi, D. Ramage, R. Raskar, D. Song, W. Song, S. U. Stich, Z. Sun, A. T. Suresh, F. Tramèr, P. Vepakomma, J. Wang, L. Xiong, Z. Xu, Q. Yang, F. X. Yu, H. Yu, and S. Zhao. Advances and open problems in federated learning, 2021. R. Kohavi. Scaling up the accuracy of naive-bayes classifiers: A decision-tree hybrid. In KDD, volume 96, pages 202 207, 1996. J. Koneˇcn y, H. B. Mc Mahan, D. Ramage, and P. Richtárik. Federated optimization: Distributed machine learning for on-device intelligence. ar Xiv preprint ar Xiv:1610.02527, 2016a. J. Koneˇcn y, H. B. Mc Mahan, F. X. Yu, P. Richtárik, A. T. Suresh, and D. Bacon. Federated learning: Strategies for improving communication efficiency. ar Xiv preprint ar Xiv:1610.05492, 2016b. N. Konstantinov and C. Lampert. Robust learning from untrusted sources. In International Conference on Machine Learning, pages 3488 3498, 2019. V. Kuznetsov, M. Mohri, and U. Syed. Multi-class deep boosting. In Proceedings of NIPS, pages 2501 2509, 2014. P. Lahoti, A. Beutel, J. Chen, K. Lee, F. Prost, N. Thain, X. Wang, and E. H. Chi. Fairness without demographics through adversarially reweighted learning. ar Xiv e-prints, pages ar Xiv 2006, 2020. M. Ledoux and M. Talagrand. Probability in Banach Spaces: Isoperimetry and Processes. Springer, New York, 1991. D. Levy, Y. Carmon, J. C. Duchi, and A. Sidford. Large-scale methods for distributionally robust optimization. Advances in Neural Information Processing Systems, 33, 2020. T. Li, M. Sanjabi, A. Beirami, and V. Smith. Fair resource allocation in federated learning. In Proceedings of ICLR, 2020. H. Liao. Speaker adaptation of context dependent deep neural networks. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 7947 7951. IEEE, 2013. M. Long, Y. Cao, J. Wang, and M. Jordan. Learning transferable features with deep adaptation networks. In International conference on machine learning, pages 97 105. PMLR, 2015. Z.-Q. Luo and P. Tseng. On the convergence of the coordinate descent method for convex differentiable minimization. Journal of Optimization Theory and Applications, 72(1):7 35, 1992. D. J. C. Mac Kay. Bayesian methods for adaptive models. Ph D thesis, California Institute of Technology, 1991. Y. Mansour, M. Mohri, and A. Rostamizadeh. Domain adaptation with multiple sources. Advances in neural information processing systems, 21:1041 1048, 2008. Y. Mansour, M. Mohri, and A. Rostamizadeh. Multiple source adaptation and the Rényi divergence. In UAI, pages 367 374, 2009a. Y. Mansour, M. Mohri, and A. Rostamizadeh. Domain adaptation: Learning bounds and algorithms. In COLT, 2009b. L. Mason, J. Baxter, P. L. Bartlett, and M. R. Frean. Boosting algorithms as gradient descent. In Proceedings of NIPS, pages 512 518, 1999. B. Mc Mahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas. Communication-efficient learning of deep networks from decentralized data. In Proceedings of AISTATS, pages 1273 1282, 2017. M. Mohri, A. Rostamizadeh, and A. Talwalkar. Foundations of Machine Learning. Adaptive computation and machine learning. MIT Press, 2018. Second edition. M. Mohri, G. Sivek, and A. T. Suresh. Agnostic federated learning. In K. Chaudhuri and R. Salakhutdinov, editors, Proceedings of ICML, volume 97 of Proceedings of Machine Learning Research, pages 4615 4625. PMLR, 2019. S. Motiian, Q. Jones, S. Iranmanesh, and G. Doretto. Few-shot adversarial domain adaptation. In Advances in Neural Information Processing Systems, pages 6670 6680, 2017a. S. Motiian, M. Piccirilli, D. A. Adjeroh, and G. Doretto. Unified deep supervised domain adaptation and generalization. In Proceedings of the IEEE International Conference on Computer Vision, pages 5715 5725, 2017b. K. Muandet, D. Balduzzi, and B. Schölkopf. Domain generalization via invariant feature representation. In International Conference on Machine Learning, pages 10 18. PMLR, 2013. H. Namkoong and J. C. Duchi. Stochastic gradient methods for distributionally robust optimization with f-divergences. In Advances in Neural Information Processing Systems, pages 2208 2216, 2016. S. J. Pan and Q. Yang. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10):1345 1359, 2009. Z. Pei, Z. Cao, M. Long, and J. Wang. Multi-adversarial domain adaptation. In AAAI, pages 3934 3941, 2018. G. Rätsch, S. Mika, and M. K. Warmuth. On the convergence of leveraging. In NIPS, pages 487 494, 2001. J. Ro, M. Chen, R. Mathews, M. Mohri, and A. T. Suresh. Communication-efficient agnostic federated averaging. ar Xiv preprint ar Xiv:2104.02748, 2021. K. Saito, D. Kim, S. Sclaroff, T. Darrell, and K. Saenko. Semi-supervised domain adaptation via minimax entropy. In Proceedings of the IEEE International Conference on Computer Vision, pages 8050 8058, 2019. R. E. Schapire and Y. Freund. Boosting: Foundations and Algorithms. The MIT Press, 2012. R. E. Schapire, Y. Freund, P. Bartlett, and W. S. Lee. Boosting the margin: A new explanation for the effectiveness of voting methods. In ICML, pages 322 330, 1997. Z. Shen, H. Hassani, S. Kale, and A. Karbasi. Federated functional gradient boosting. ar Xiv preprint ar Xiv:2103.06972, 2021. P. Smyth and D. Wolpert. Linearly combining density estimators via stacking. Machine Learning, 36: 59 83, July 1999. A. Taherkhani, G. Cosma, and T. M. Mc Ginnity. Ada Boost-cnn: An adaptive boosting algorithm for convolutional neural networks to classify multi-class imbalanced datasets using transfer learning. Neurocomputing, 404:351 366, 2020. A. Torralba and A. A. Efros. Unbiased look at dataset bias. In CVPR 2011, pages 1521 1528. IEEE, 2011. P. Tseng. Convergence of a block coordinate descent method for nondifferentiable minimization. Journal Optimization Theory and Applications, pages 475 494, 2001. E. Tzeng, J. Hoffman, T. Darrell, and K. Saenko. Simultaneous deep transfer across domains and tasks. In Proceedings of the IEEE International Conference on Computer Vision, pages 4068 4076, 2015. B. Wang, J. Mendez, M. Cai, and E. Eaton. Transfer learning via minimizing the performance gap between domains. In Advances in Neural Information Processing Systems, pages 10645 10655, 2019a. T. Wang, X. Zhang, L. Yuan, and J. Feng. Few-shot adaptive faster R-CNN. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7173 7182, 2019b. H. Xiao, K. Rasul, and R. Vollgraf. Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. Co RR, abs/1708.07747, 2017. R. Xu, Z. Chen, W. Zuo, J. Yan, and L. Lin. Deep cocktail network: Multi-source unsupervised domain adaptation with category shift. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3964 3973, 2018. Z. Xu and S. Sun. Multi-view transfer learning with adaboost. In 2011 IEEE 23rd International Conference on Tools with Artificial Intelligence, pages 399 402. IEEE, 2011. Z. Xu and S. Sun. Multi-source transfer learning with multi-view Ada Boost. In International conference on neural information processing, pages 332 339. Springer, 2012. Z. Xu, W. Li, L. Niu, and D. Xu. Exploiting low-rank structure from latent domains for domain generalization. In European Conference on Computer Vision, pages 628 643. Springer, 2014. J. Yang, R. Yan, and A. G. Hauptmann. Cross-domain video concept detection using adaptive SVMs. In Proceedings of the 15th ACM international conference on Multimedia, pages 188 197, 2007. Q. Yang, Y. Liu, T. Chen, and Y. Tong. Federated machine learning: Concept and applications. ACM Transactions on Intelligent Systems and Technology (TIST), 10(2):1 19, 2019. T. Yang, G. Andrew, H. Eichner, H. Sun, W. Li, N. Kong, D. Ramage, and F. Beaufays. Applied federated learning: Improving google keyboard query suggestions. ar Xiv preprint ar Xiv:1812.02903, 2018. Y. Yao and G. Doretto. Boosting for transfer learning with multiple sources. In 2010 IEEE computer society conference on computer vision and pattern recognition, pages 1855 1862. IEEE, 2010. Z. Yuan, D. Bao, Z. Chen, and M. Liu. Integrated transfer learning algorithm using multi-source tradaboost for unbalanced samples classification. In 2017 international conference on computing intelligence and information system (CIIS), pages 188 195. IEEE, 2017. Q. Zhang, H. Li, Y. Zhang, and M. Li. Instance transfer learning with multisource dynamic tradaboost. The scientific world journal, 2014, 2014. H. Zhao, S. Zhang, G. Wu, J. M. Moura, J. P. Costeira, and G. J. Gordon. Adversarial multiple source domain adaptation. In Advances in neural information processing systems, pages 8559 8570, 2018.