# manyobjective_multisolution_transport__6f5310ee.pdf Published as a conference paper at ICLR 2025 MANY-OBJECTIVE MULTI-SOLUTION TRANSPORT Ziyue Li1 Tian Li2 Virginia Smith3 Jeff Bilmes4 Tianyi Zhou1 1University of Maryland 2University of Chicago 3Carnegie Mellon University 4University of Washington, Seattle {ziyueli, tianyi}@umd.edu, litian@uchicago.edu, smithv@cmu.edu, bilmes@uw.edu Project: https://github.com/tianyi-lab/Mos T Optimizing the performance of many objectives (instantiated by tasks or clients) jointly with a few Pareto stationary solutions (models) is critical in machine learning. However, previous multi-objective optimization methods often focus on a few objectives and cannot scale to many objectives that outnumber the solutions, leading to either subpar performance or ignored objectives. We introduce Manyobjective multi-solution Transport (Mos T) , a framework that finds multiple diverse solutions in the Pareto front of many objectives. Our insight is to seek multiple solutions, each performing as a domain expert and focusing on a specific subset of objectives while collectively covering all of them. Mos T formulates the problem as a bi-level optimization of weighted objectives for each solution, where the weights are defined by an optimal transport between objectives and solutions. Our algorithm ensures convergence to Pareto stationary solutions for complementary subsets of objectives. On a range of applications in federated learning, multi-task learning, and mixture-of-prompt learning for LLMs, Mos T distinctly outperforms strong baselines, delivering high-quality, diverse solutions that profile the entire Pareto frontier, thus ensuring balanced trade-offs across many objectives. 1 INTRODUCTION 14 15 16 17 18 MGDA Linearization Fed MGDA Fed Prox Fed Avg Mos T (ours) Figure 1: Accuracies of different methods outputting 5 solutions serving 30 objectives (clients) in federated learning. Mos T results in a better coverage of all the objectives than the other baselines. The underlying goal of many machine learning problems is to simultaneously optimize multiple objectives. Usually, no single solution (or model) is optimal for all objectives at once. Multi-objective optimization (MOO) aims to find a solution on the Pareto frontier where no objective can be improved without degrading others. One approach is to optimize a linear combination of all objectives, which may collapse to solutions for a small subset of objectives. Another method, multi-gradient descent algorithm (MGDA) (Désidéri, 2012) finds a common descent direction at each iteration to update the model so that no objective degrades. However, the trade-offs provided by MGDA solutions are not fully controllable even with recent advances like reference vectors to guide the search space among all the objectives (Mahapatra & Rajan, 2020), especially when the Pareto frontier is unknown and complicated (e.g., non-smooth or discontinuous). Moreover, MOO approaches typically focus on two or three objectives and hardly scale to many objectives, as shown in Figure 1. As objectives increase, it is less plausible that they will reach an agreement on a single solution. Instead of balancing all of them, it is more appealing to find multiple diverse yet complementary solutions on the Pareto frontier each focusing Corresponding author. Published as a conference paper at ICLR 2025 on a local domain of objectives. This problem of finding m Pareto solutions (or training m models) for n objectives can be understood as a multi-solution extension of MOO or a mixture of experts (Mo E) (Jacobs et al., 1991) for multiple objectives. However, as n increases, existing methods (e.g., those based on reference vectors or a uniform exploration of the Pareto frontier) can be computationally prohibitive and thus practically infeasible. It is also challenging to pre-determine the search regions of a few representative solutions profiling the entire Pareto frontier. The setting of n m is emerging in a variety of machine learning problems that involve many (i.e., big-n) users, domains, or evaluation criteria, each associated with a different training objective, but the total available data or computation can only support the training of m n models. In this paper, we ask: can we develop a solution-objective matching mechanism to guide the exploration of m solutions on the high-dimensional (n m) Pareto frontier? For example, some objectives share similar structures; so optimizing them with the same model can bring common improvement, while optimizing separate models for objectives with mutual conflicts can effectively avoid poor performance and the tug-of-war among them. Hence, a matching between models and objectives after every model update step is able to explore the correlation among n objectives along the optimization trajectory. Specifically, we capture the matching relations with a weight matrix Γ Rn m + where Γ ,j reweighs the n objectives that model j [m] aims to optimize. For model j, such a reweighted single-model multi-objective optimization problem steers towards a domain expert focusing on a locally consistent subset of objectives. With complementary (i.e., every objective being equally covered) and balanced (i.e., no model dominating on most objectives) Γ s to adjust the descent directions per iteration, we aim to find m diverse yet complementary solutions that lie in the Pareto front. More concretely, we model the optimization of Γ as an optimal transport (OT) between the m models and n objectives, with two marginal constraints Γ1m = α1 and Γ 1n = β in OT, which allow us to control the ratio of n objectives assigned to each model and the ratio of m solutions optimized for each objective. For example, with a uniform α = (1/n)1n, the m n models are enforced by OT weights Γ to focus on different subsets of objectives otherwise, violate the Γ1m = (1/n)1n constraint could leave some objectives uncovered. Similarly, with a uniform β = (1/m)1m, the training loads for the m models tend to be balanced so no model would dominate the others on a majority of the objectives (Figure 1). We propose an efficient algorithm for the above Many-objective multi-solution Transport (Mos T) problem, which is a bi-level optimization between model parameters and objective-solution matching parameterized by Γ. Specifically, the upper-level is a Γ-reweighted MGDA problem for the m models. The lower-level is a classic OT problem optimizing Γ with marginal constraints. We further introduce a regularization term on top of Γ to promote diversity of the optimal transport. Our algorithm converges to stationary points in the non-convex case, and converges to Pareto stationary solutions in the strongly-convex case under additional stability assumptions. We additionally extend Mos T to handle the n m case by augmenting the n objectives with random linear combinations of them ( Mos T-E ). In practice, we introduce a curriculum for better model specialization by gradually varying the marginals α and β in Mos T focusing more on models selecting objectives at the earlier stage and later on transits to objectives selecting the best models . We apply Mos T to various n m machine learning applications, spanning federated learning (Mc Mahan et al., 2017), multi-task learning (Lin et al., 2019), and mixture-of-prompt learning (Qin & Eisner, 2021). Though the focus of this work is to address the challenges associated with scaling MOO to a large number of objectives, we also apply Mos T to the n m scenarios, including fairness-accuracy trade-offs and other classic MOO problems (Appendix A.3). In all applications (Section 6), Mos T finds diverse high-quality solutions on the Pareto front, consistently outperforming various strong baselines in terms of average accuracy and other popular metrics on the quality of multiple solutions, without extra computation cost. 2 RELATED WORK Single-Solution MOO. The classic goal of MOO is to find some solution lying on the Pareto front of multiple objectives (Désidéri, 2012; Roy et al., 2023; Halffmann et al., 2022; Miettinen, 1999). One approach is to solve a linearized aggregation (i.e., weighted average) of all objectives. However, 11m denotes the m-dimensional all-ones vector and Γ1m computes the row-wise sum of entries in Γ. Published as a conference paper at ICLR 2025 linearization, despite covering broad objective weights, may produce solutions in a small Pareto front area (Boyd & Vandenberghe, 2004). The multiple-gradient descent algorithm (MGDA) (Désidéri, 2012) is widely used due to its capability to handle complicated Pareto fronts and compatibility with gradient-based optimization. In this work, we use MGDA-style optimization methods in our algorithm, and compare with linearization-based objectives (with different weights) empirically (Section 6), showing that our approach can find more diverse Pareto stationary solutions. Multi-Solution MOO. One line of work that aims to discover diverse solutions across the entire Pareto front builds upon MGDA and guides the search process via constraints of preference vectors (Lin et al., 2019; Mahapatra & Rajan, 2020) or constraints of other objectives (Zafar et al., 2017). These methods do not generalize well to the setting where there are many objectives (constraints) or the model dimension is large, since the number of preference vectors to explore the whole Pareto front may depend exponentially on these factors in the worst scenario (Emmerich & Deutz, 2018). Even in the setting where there are only a few (e.g., two or three) objectives, diversity of the preference vectors in the action space (during exploration) may not translate to diversity in the solution space. We showcase the superiority of Mos T relative to these works in Section 6. Some works that balance Pareto optimality and solution diversity cannot guarantee the final solutions are on the Pareto front (Liu et al., 2021). For gradient-free methods, evolution strategies or Bayesian optimization (Coello, 2006; Sindhya et al., 2012) has been explored to find multiple (as opposed to one) Pareto stationary solutions. However, they are usually not efficient when solving practical MOO problems in machine learning due to the lack of gradient information (Liu et al., 2021; Momma et al., 2022); hence, we do not compare with those methods. Applications in Machine Learning. We demonstrate Mos T s effectiveness on applications including cross-device federated learning (Mc Mahan et al., 2017). There is extensive prior research on personalized federated learning (e.g., Smith et al., 2017; Ghosh et al., 2020; Wu et al., 2022), i.e., outputting multiple related models instead of one, to serve all clients. Our approach can be viewed as a personalization objective here. Note that our goal is not to achieve the highest average accuracy for federated learning, but rather, explicitly balance multiple objectives and guarantee that all output solutions are Pareto stationary. We also explore multi-task learning (Lin et al., 2019) and mixtureof-prompt learning (Qin & Eisner, 2021) on standard benchmarks where each objective is a task or a training instance. For the n m case, following the setup in prior MOO works (e.g., Zitzler et al., 2000; Mahapatra & Rajan, 2020), we apply Mos T to a toy problem and address (algorithmic) fairness/utility trade-offs (two objectives). 3 MOST: MANY-OBJECTIVE MULTI-SOLUTION TRANSPORT In this section, we introduce Mos T objective and the alternating optimization algorithm to solve it. Let Li( ) (i [n]) denote the empirical loss function of the i-th objective. When the number of objectives n is much larger than the number of solutions m, it is possible that the learnt solutions (for example, simply by running MGDA for m times with different randomness) cannot cover representative regions on the Pareto front. To address this, we use an assignment matrix Γ Rn m + with constraints Γ1m = α and Γ 1n = β on top of the losses to enforce a balanced matching between objectives and solutions. We could additionally add a regularizer R(Γ) to encourage a more diverse assignment. Our many-objective multi-solution transport (Mos T) objective is as follows. Find {θ1:m} such that every θj (j [m]) is the Pareto solution of m weighted objectives minθj (Γ1,j L1(θj), , Γn,j Ln(θj)) , where (1) Γ arg minΓ Ω P j [m] Γi,j Li(θj) + τR(Γ), (2) Ω {Γ Rn m + : Γ1m = α, Γ 1n = β}. (3) α n, β m are two tunable vectors on nand m-dimensional probability simplexes, respectively. We encourage nondegeneracy of solutions by setting these vectors to follow uniform distributions, i.e., α = 1n/n, β = 1m/m. As discussed in Section 1, the constraint set Ωprevents the undesired outcome where all objectives are matched with a subset of solutions or all solutions optimize a subset of objectives. To explicitly encourage the diversity among the columns in Γ, we define a regularization term R( ) as R(Γ) = P i [n] maxj [m] Γi,j. (4) Published as a conference paper at ICLR 2025 Hence, only the maximum entry maxj [m] Γi,j in each row of Γ contributes to R(Γ). Under the marginal constraints on Γ in Eq 3, minimizing R(Γ) leads to maxj [m] Γi,j = 1/n and zeros for the rest entries in each row-i, resulting in zero cosine similarity between different columns of Γ. Moreover, if n m and n (mod) m = 0, it has exactly n/m nonzero entries (with value 1/n) per column, securing an equal and disjoint partition of the n objectives to the m solutions, which indicates the diversity of objectives used to train the m solutions. Eq. 2 has a weight τ to balance R(Γ) and the transport cost. In the unregularized case when τ = 0 (Eq. 2), the resulting Γ would be a sparse matrix (Proposition 1) (Liu et al., 2022a; Brualdi, 2006). Although enforcing marginal constraints without R(Γ) already leads to improvements over the baselines (Figure 9), empirically, we also showcase that leveraging this extra regularizer further benefits the diverse trade-offs among all objectives (Appendix A.5.3). 3.1 ALGORITHMS FOR MOST At a high level, the bi-level optimization problem described above can be decoupled into two subproblems (over Γ and θ1:m) when fixing one variable and optimizing the other. At each outer iteration, we first solve 2 exactly by running an off-the-shelf OT solver (e.g., IPOT (Xie et al., 2020)). Then we optimize 1 by running a reweighted version of MGDA with a min-norm solver (Désidéri, 2012). The exact algorithm is summarized in Algorithm 1. Algorithm 1 Many-Objective Multi-Solution Transport 1: Input: objectives {Li( )}n i=1, α n, β m, η, K 2: Initialize: m solutions θ1:m 3: for t {1, , T} do 4: Γt solution of Eq. 5 by an optimal transport (OT) solver given θt 1:m; 5: for j {1, , m} do 6: for k {1, , K} do 7: dj Eq. 8, where λ is achieved by a min-norm solver for Eq. 7 given Γt and θj. 8: θj θj + ηdj ; 9: end for 10: dt j dj; θt j θj 11: end for 12: end for 13: Return θT 1:m From Algorithm 1, in each iteration, we first optimize Γt with θt 1:m fixed, i.e., finding the optimal transport (or matching) between the n objectives and the m models by solving the following optimal transport problem with existing algorithms (Xie et al., 2020): minΓ Ω P i [n] Γi,j Li(θj) + τR(Γ). (5) Given the maximum entry per row in Γ, Eq. 5 reduces to a new optimal transport transport problem with an augmented loss Li( ) + τ R(Γ), solvable by an off-the-shelf optimal transport solver. Fixing the optimal Γ, we then optimize a reweighted version of MGDA across objectives (Γ1,j L1(θj), , Γn,j Ln(θj)) for each solution θj, j [m]. To find Pareto stationary solutions, similar as MGDA, we aim to find the common-descent directions d1:m for θ1:m to ensure that all objective values do not increase at each iteration. This reduces to solving m MGDA-type MOO problems (more details in Appendix A.1) in parallel, aiming to find dj for every solution j [m] mindj maxi [n] d j Γi,j θj Li(θj) + 1 2 dj 2 2, (6) which rescales each objective s gradient θj Li(θj) by Γi,j. For simplicity, we will use Li(θj) to denote θj Li(θj) in the remaining of the paper. The dual of Eq. 6 is a min-norm problem over variable λ n as follows: i [n] λiΓi,j Li(θj) 2 , j [m], (7) which can be solved by existing Frank-Wolfe algorithms (Fujishige, 1980). Given the optimal dual solution λ from Eq. 7, the primal solution of dj (j [m]) to Eq. 6 can be derived by the following convex combination of the Γ-weighted gradients: dj = P i [n] λ i Γi,j Li(θj). (8) Published as a conference paper at ICLR 2025 To understand the benefits of Eq. 7, let us consider the vanilla MGDA method: i [n] λi Li(θ) 2 , in which λ may be biased towards the objective with a small gradient norm (i.e., a well-optimized objective). In Mos T, however, OT in Eq. 5 tends to result in a large Γi,j for a small Li(θj), thus moving small gradient away from the origin in Eq. 7 (i.e., preventing a well-optimized objective from dominating dj). The MGDA direction dj guarantees that every objective with non-zero Γi,j will be improved or remain the same after updating θj. After obtaining dj, we update the model parameters θj by moving along this direction (Line 8 in Algorithm 1). Optionally, we can also run such gradient descent steps for K steps in practice under the same Γ. In our convergence analysis (Section 4), we allow for K > 1 and assume a full batch setting with Li(θj) evaluated on all the local data of problem i, for all j s. Empirically, we report our experiment results based on mini-batch gradients in Section 6. 3.2 EXTENSION TO FEW-OBJECTIVE (n < m) CASES The Mos T formulation discussed before is mainly motivated by the challenges of having many objectives in MOO. When n m, the diversity of the m models can be achieved by enforcing the two marginal constraints in the optimal transport problem. However, the diversity cannot be fully guaranteed when n m. For example, when n = 2, by even applying uniform distributions for α and β (i.e., the strongest constraints for diversity), a trivial but feasible solution of Γ = [1m/2, 0m/2; 0m/2, 1m/2] can collapse the m models to duplicates of only two different models, i.e., one minimizing the first objective while the other minimizing the second. To address this problem, we create (n n) m dense interpolations of the n objectives (n m) by sampling (n n) groups of convex combination weights wn+1:n on the simplex, i.e., wi n drawn from a Dirichlet distribution. Then each auxiliary objective Li( ) can be defined as l [n] wi,l Ll(θ), i = n + 1, , n . (9) Thereby, we increase the number of objectives to n m and Mos T can be applied to achieving diverse models for optimizing the n interpolations of the original n objectives. This strategy can be explained as maximizing the coverage of the m models over the dense samples of the Pareto front regions using n random reference vectors. We report the results of this technique with applications on a toy problem and fairness/utility trade-offs in Appendix A.3. 3.3 A PRACTICAL SOLUTION-SPECIALIZATION CURRICULUM In scenarios with diverse objectives, each corresponding to a distinct domain or unique dataset, a practical demand arises: optimizing multiple models and turning them into a mixture of specialized experts (i.e., models). This allows each input sample to select the best expert(s) for inference. Such objective selecting expert/models or objective choice routing strategy corresponds to removing Γ1n = β from the constraint set Ωin Eq. 3. But it may lead to training imbalance among the m models, e.g., one model is chosen by most objectives while other models get nearly zero optimization. As training proceeds, the winning model(s) trained by more objectives tend to be chosen even more frequently; hence joint optimization of m models can collapse to training one single model. To address this challenge, we propose a curriculum of varying the marginal constraints that progressively changes α and β for different training stages. Initially, we mainly focus on enforcing a uniform marginal distribution β so every model receives sufficient training from multiple objectives. By relaxing the other marginal constraint α over n objectives to be slightly non-uniform, the m models have more freedom to choose the objectives they perform the best on (i.e., model selecting objective or model choice routing ) and we allow for slight imbalance among objectives.2 In later stages, the curriculum instead focuses more on enforcing a uniform α so every objective is covered by sufficient models. The marginal constraint β can be relaxed in this stage since models approach convergence. Empirical results in Appendix A.5.2 demonstrate the effectiveness of our curriculum strategy. 4 PROPERTIES OF MOST In this section, we discuss how Mos T encourages diverse solutions and provide convergence guarantees of Algorithm 1. Here we define solution diversity via the diversity of Γ, as stated below. 2Less-selected objectives can be more difficult and it is preferable to learn them later when models improve. Published as a conference paper at ICLR 2025 Definition 1 (Diverse Solutions). We informally say that a set of solutions {θi}i [n] are more diverse if Pm j=1 Pm z=1,z =j cos(Γ ,j, Γ ,z) is small with some feasible Γ, where cos(x,y) = x y ||x|| ||y||. This definition captures the goal of diversifying solution specialization, ensuring at least one suitable model j (corresponding to arg maxj [m] Γi,j) for every objective i, as losses are reweighted by Γ. We note that fixing the losses {Li(θj)}i [n],j [m], when solving for Γ without any regularization (i.e., τ=0 in Eq. 5), the objective inherently results in sparse solutions, as stated in the proposition below. Proposition 1 (Sparsity of Γ Brualdi (2006)). Any Γ that solves the optimal transport problem Eq. 5 with τ=0 has at most n + m 1 non-zero entries. Intuitively, sparse Γ s satisfying the marginal constraint Ωwould prevent the scenarios where only a subset of objectives are well-optimized. In Appendix A.5.3, we empirically verify that even without R(Γ), the sparse transport gives diverse solutions that balance all objectives. Also, setting R(Γ) to be the negative of our diversity measure (Definition 1), further encourage diversity as it is obvious to show that P i [n] maxj [m] Γ i,j(τ) P i [n] maxj [m] Γ i,j(0), where Γ (τ) is defined as the optimal solution of Eq. 5 with a regularization constant τ. We investigate training dynamics and diverse assignments between objectives and models in Section 5. 4.1 CONVERGENCE In this part, we analyze the convergence of Mos T in Algorithm 1 for both strongly-convex and non-convex functions. The alternate minimization scheme poses additional challenges to our analysis compared with prior convergence results in MOO (Fliege et al., 2019). Theorem 1 (Convex and Non-Convex). Assume each objective Li(θ) is ν-smooth. Given marginal distribution constraints α n and β m, under a learning rate η = 1 2ν , after running Algorithm 1 for T outer iterations with full batch multi-gradient descent, we have that j [m] βj dt j 2 O ν We defer the full proofs to Appendix A.2.2. The main step involves leveraging the property of the common descent direction dj obtained from Eq. 8, i.e., for any i [n] and j [m], we have Γt i,j Li(θt j) dt j 1 0 20 40 60 80 100 120 140 Training iterations average test loss oracle 0 50 100 150 200 250 300 350 400 65.0 Percentage (%) Percentage of zero-valued entries (%) KL divergence KL divergence on between two consecutive iterations Figure 2: (a) Left y-axis: Percent of zero-valued entries within Γ. Right y-axis: symmetric KL divergence between Γ in successive iterations. Γ quickly converges to sparsity. (b) Test loss averaged over all solutions vs. test loss of the bestperforming solution for each objective (oracle). As training proceeds, average loss rises while oracle loss continues to decrease, indicating a trend of solution specialization and diversification. Our convergence rate is the same as that of normal gradient descent for non-convex and smooth problems under a fixed learning rate. Note that our result uses full gradients of each objective for the min-norm problem (Eq. 6). In practice, we use K > 1 to run multiple iterations to update the model parameters for each objective locally, and use stochastic mini-batch gradients in Eq. 6. Strongly-convex cases. When the losses Li(θ) (i [n]) is smooth and additionally strongly convex with respect to θ Rd, we show that our proposed algorithm (using full gradients) can converge to Pareto stationary points with respect to a subset of matching objectives under a more restricted assumption on the stability of objective-model matching. Please see Theorem 2 in the appendix for complete statement and a detailed proof. 5 ASSIGNMENT DYNAMICS DURING TRAINING In this section, we empirically examine the optimal assignment matrix Γ during training, highlighting its advantages on objective-solution matching and its impact on solution specialization. Published as a conference paper at ICLR 2025 Pair-wise solution diversity (brighter indicates greater diverse) Figure 3: (a) Training loss and test accuracy curves of each method. Mos T demonstrates faster convergence with higher accuracy. (b) Diversity of solutions during training: each block on a column visualizes the KL divergence of a pair of solutions (brighter indicates a larger value). Mos T produces more diverse solutions. (c) Fairness: Accuracy of the worst 20%, 40%, 60%, and 80% client groups. Diversity leads to better tail performance among all the objectives. Fast convergence of Γ. We observe rapid convergence of Γ by tracking its evolution with Kullback Leibler (KL) divergence between successive iterations (Figure 2(a)), detailed in Appendix A.4. Sparsity of Γ. As training proceeds, Γ becomes more sparse, with nearly 75.00% zero entries (Figure 2(a)). This implies that Γ prompts each solution to focus on only a subset of objectives. Sparsity promotes specialization. Figure 2(b) shows the mean test loss averaged across all solutions evaluated at all the objectives, and the loss when selecting the best-performing solution for each objective (denoted as oracle). Initially, the mean loss increases and then decreases, while the oracle loss consistently decreases. These trends indicate that using all the solutions to serve all the objectives is suboptimal, while objective-specific solutions can focus on subsets of objectives, ultimately contributing to a more complementary and effective overall solution set. 6 MOST APPLICATIONS In this section, we apply Mos T to various ML applications where n m. Though we do not specifically focus on the (less challenging) settings of m n, Mos T can naturally generate diverse solutions for those problems as well. We defer the readers to Appendix A.3, where we show superior performance of Mos T relative to existing methods for the m n case. 6.1 EXPERIMENTAL SETUP This section outlines the experimental setup. We describe the specific setup for each application in their respective sections, and provide more details such as hyperparameter values in Appendix A.4. For all the applications, we at least compare with the following baselines to generate m solutions. Linearization-based MOO (abbr. Linearization below) where we optimize a convex combination of all objectives with m randomly-sampled sets of weights. Minimizing a simple average of all the losses (empirical risk minimization over objectives) is a specific instance of using uniform weights. Running MDGA (Désidéri, 2012) independently for m times with different random seeds. Evaluation metrics. We use task-specific evaluation metrics, such as average accuracy (or tail accuracy) across all objectives or hypervolume (Zitzler & Thiele, 1999). We do not evaluate on hypevolume for many-objective scenarios due to its computational complexity and infeasibility in high-dimensional settings (i.e., many objectives). Each run is repeated three times using different seeds. Published as a conference paper at ICLR 2025 Table 1: Federated Learning: Mean accuracy across clients (mean and std across 3 runs) on federated learning datasets. Mos T outperforms the strong baselines. Dataset MGDA Linearization Fed Avg Fed Prox Fed MGDA+ Mos T w/o R(Γ) Mos T Syn (0.0, 0.0) 77.22 0.41 75.91 0.37 75.71 0.51 75.60 0.42 75.26 1.21 83.09 0.87 84.25 0.51 Syn (0.5, 0.5) 87.09 0.29 87.18 0.27 86.26 0.61 86.13 0.39 85.21 1.42 89.07 0.63 89.99 0.52 Syn (1.0, 1.0) 90.52 0.13 89.87 0.51 88.12 0.75 87.58 1.36 87.16 1.09 91.70 0.02 92.21 0.08 FEMNIST 78.86 1.43 72.62 0.65 72.47 0.19 72.45 0.06 80.08 0.12 80.94 0.34 81.16 0.03 6.2 FEDERATED LEARNING One important scenario where n m is the cross-device federated learning application, where we jointly learn m models over a heterogeneous network of n remote devices. The devices generate local data following non-identical distributions; hence we view the finite sum of empirical losses on each device as one objective, i.e., Li(θ) := 1 vi Pvi s=1 ls(θ) where vi is the number of local samples on device i [n], and ls denotes the individual local loss on sample s. Mos T seamlessly integrates with the decentralized setting of federated learning by computing client-specific local updates (θ1:m) and aggregating them to update the global model by Γ. Moreover, since clients are diverse, it is expected that the solution diversity benefits of Mos T will contribute significantly to the final performance. We conduct experiments on synthetic data and Federated Extended MNIST (FEMNIST) (Cohen et al., 2017; Caldas et al., 2018), where the number of objectives n = 30 and n = 206, respectively. We experiment on three synthetic datasets, denoted as Syn (ρ1, ρ2), with different ρ1 and ρ2 controlling heterogeneity of local models and data, as detailed in Appendix A.4. We compare Mos T with baselines described in Section 6.1 and state-of-the-art federated learning algorithms, including Fed Avg (Mc Mahan et al., 2017), Fed Prox (Li et al., 2020), and Fed MGDA+ (Hu et al., 2022). We run each algorithm m times with different random initializations. It is worth noting that during evaluation, for all methods, we let each device pick a best model out of m solutions based on the validation set, and compute its performance. We then report the average and the quantile accuracy across all devices. As the results shown in Table 1, Mos T outperforms the baselines by a large margin on all datasets. Furthermore, we have the following observations. Mos T results in significant convergence improvements. Figure 3(a) compares the training loss and test accuracy curves of different algorithms. Notably, Mos T outperforms baselines with a lower training loss (faster convergence) and higher test accuracy (better generalization to unseen data). Mos T maintains diversity during training. We analyze how the diversity of solutions evolves for different algorithms. Diversity is quantified using the KL divergence between predictions of any pair of solutions generated by each algorithm. Figure 3(b) shows that initially, all algorithms exhibit high diversity due to randomized initialization. However, baselines witness a notable decrease in solution diversity during training. In contrast, Mos T maintains high diversity throughout the training process. Mos T promotes fairness in FL. Mos T s improved diversity is anticipated to benefit clients typically overlooked by other algorithms, thus enhancing fairness. To validate, we calculate the accuracy of the worst 20%, 40%, 60%, and 80% clients for each algorithm. As depicted in Figure 3(c), Mos T outperforms the baselines by a larger margin for clients with worse performance, which demonstrates that the diversity of Mos T effectively promotes fairness in FL. Furthermore, our study reveals that Mos T strategically assigns diverse solutions for inference, preventing the collapse phenomenon in MOO (detailed in Appendix A.5.1). 6.3 MULTI-TASK LEARNING Mos T seamlessly extends its capabilities to multi-task learning by treating each task as an individual objective. Our experiments explore two real-world datasets, Office-Caltech10 (Saenko et al., 2010; Griffin et al., 2007) and Domain Net (Peng et al., 2019) with n = 4 and 6 objectives, respectively. We have m = 3 solutions for Office-Caltech10 and m = 4 for Domain Net. We conduct a thorough comparison with various state-of-the-art multi-task learning approaches: MGDA, Linearization-based MOO, EPO which is based on user preference vectors (Mahapatra & Rajan, 2020), COSMOS (Ruchte & Grabocka, 2021), and TAG which identifies task grouping for MOO (Fifty et al., 2021). Similar to federated learning datasets, we select the best-performing solutions over the validation set for Published as a conference paper at ICLR 2025 inference and report the accuracy averaged across all tasks. The results are shown in Table 2, which demonstrates Mos T s superiority over existing approaches. Table 2: Multi-task Learning: Average accuracy across all tasks (mean and std across 3 runs) on Office-Caltech10 and Domain Net. Dataset MGDA Linearization EPO COSMOS TAG Mos T Office-10 80.74 0.44 61.26 0.67 61.05 1.09 63.83 1.01 49.38 1.10 82.98 0.51 Domain Net 65.81 0.37 57.15 0.17 58.55 0.37 63.78 0.34 31.05 1.24 67.65 0.55 6.4 MIXTURE-OF-PROMPT LEARNING Another application is prompt learning for language models (Qin & Eisner, 2021), where solutions need to generalize well with diverse instances. To address this, Mos T trains m soft prompts to handle each training instance as a distinct objective, with n being the total number of instances. We define the objective function as Li(θ) := li(θ), where li represents the loss on instance i. This allows Mos T to tailor its learning process to each specific sample, adapting to diverse linguistic nuances. We experiment on three datasets from the Super GLUE benchmark (Wang et al., 2019). For all datasets, we sample n = 128 training instances, while generating m = 3 soft prompts. Note that Mos T prioritizes training complementary solutions for multiple objectives, rather than their assignment. Simple ways, like selecting the best-performing solution over the validation set, are promising for scenarios like federated learning and multi-task learning. However, treating each instance as an objective poses challenges in solution selection during inference. To address this, we train a dispatcher for each algorithm to learn correlations between prompts and instances (implementation details in Appendix A.4), selecting the highest correlated one for inference. We report average accuracy across all tasks, in Table 3. Results show Mos T exhibits a significantly better ability to generalize to unseen instances compared to baselines. 6.5 ABLATION STUDIES AND COMPARISON WITH OTHER BASELINES We conduct extensive ablation studies to validate the design of Mos T, examining the necessity of OT (Appendix A.5.1) and MGDA (Appendix A.5.4), along with specific designs like solutionspecialization curriculum (Appendix A.5.2) and diversity encouragement (Appendix A.5.3). These studies offer insights into the effectiveness of components of Mos T: Necessity of OT and MGDA. Comparing OT-generated and randomly generated weights reveals the necessity of OT for achieving balanced objective-solution matching. Additionally, MGDA consistently outperforms its linearization-based alternative, highlighting its effectiveness in parameter updates for weighted multi-objective optimization. Effectiveness of curriculum. Our proposed scheduling of the marginal constraints (Section 3.3) boosts performance by over 2.00% and enhances training stability. Benefits of diversity-encouragement regularization. It enhances performance and fairness. Preventing collapse with OT. Comparison of three strategies introduced in Section 3.3, Mos T, objective selecting model , and model selecting objective , reveals the collapse phenomenon in MOO, where limited solutions dominate all objectives. In contrast, Mos T using OT with a two-way matching approach achieves a more balanced distribution of objectives among models during training. 6.6 RUNTIME COMPARISONS We enhance the computational efficiency of our algorithm employing several simple practices: 1) detecting early convergence of OT, 2) initializing transport plans using prior computations, 3) running MGDA on a subset of model parameters like the final layer s bias term in a neural network, and 4) effectively reducing objectives in MGDA through the inherent sparsity of regularized OT. Table 4 presents the end-to-end runtime (in seconds) of different approaches on federated learning datasets with n = 30 and n = 206. We see that the running time of Mos T is comparable to that of baseline methods. Notably, the time needed for OT across datasets is negligible due to the use of Published as a conference paper at ICLR 2025 Table 3: Mixture-of-Prompt Learning: Test accuracy on three datasets of Super GLUE benchmark (mean and std across 3 runs). Task MGDA Linearization Mos T Bool Q 62.69 0.71 61.30 0.09 67.03 0.49 Multi RC 60.86 0.50 58.79 1.23 63.78 0.15 Wi C 55.28 0.89 57.16 1.27 62.38 0.31 existing packages and the fast convergence of Γ (shown in Section 5), accounting for less than 1% of the total time. See Appendix A.6 for a complete analysis of computation time in other applications. Table 4: Runtime comparison (sec) on federated learning datasets. Dataset MGDA Linearization Fed Avg Fed MGDA+ Mos T Syn (0.0) 219.86 225.90 222.22 516.25 217.59 Syn (0.5) 208.82 208.27 205.90 495.62 201.70 Syn (1.0) 269.44 268.33 270.65 557.58 260.50 FEMNIST 3522.83 3135.76 3147.62 > 5000.00 3368.63 7 CONCLUSION In this paper, we propose many-objective multi-solution transport (Mos T) , aiming to find m Pareto solutions (models) that achieve diverse trade-offs among n optimization objectives. We investigate a challenging case of n m, where existing methods often struggle with exploring a high-dimensional Pareto frontier. We formulate Mos T as a bi-level optimization of multiple weighted objectives, where weights guide the exploration and are determined by an optimal transport (OT) matching objectives and solutions. Our algorithm theoretically converges to m Pareto solutions by alternating between optimizing the weighted objectives and OT. Mos T extends to achieve diverse solutions for n m. We apply Mos T to various machine learning problems, training m models to serve n users, domains, or criteria. Empirically, Mos T outperforms other strong baselines in tasks like federated learning, multi-task learning, mixture-of-prompt, fairness-accuracy trade-offs, and other MOO benchmarks. ACKNOWLEDGMENTS We would like to thank Shengjie Wang, Xin Yang, and Yuanyuan Yang for their insightful discussions in the earlier-stage exploration of this project s initial idea. We appreciate the reviewers and area chairs for their constructive comments and suggestions. Arthur Asuncion and David Newman. Uci machine learning repository, 2007. Stephen P Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004. Richard A Brualdi. Combinatorial matrix classes, volume 13. Cambridge University Press, 2006. Sebastian Caldas, Peter Wu, Tian Li, Jakub Koneˇcn y, H Brendan Mc Mahan, Virginia Smith, and Ameet Talwalkar. Leaf: A benchmark for federated settings. ar Xiv preprint ar Xiv:1812.01097, 2018. Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. ar Xiv preprint ar Xiv:1905.10044, 2019. CA Coello Coello. Evolutionary multi-objective optimization: a historical view of the field. IEEE computational intelligence magazine, 1(1):28 36, 2006. Gregory Cohen, Saeed Afshar, Jonathan Tapson, and André van Schaik. Emnist: an extension of mnist to handwritten letters. ar Xiv preprint ar Xiv:1702.05373, 2017. Published as a conference paper at ICLR 2025 US Supreme Court. Griggs v. duke power co, 1971. Jean-Antoine Désidéri. Multiple-gradient descent algorithm (mgda) for multiobjective optimization. Comptes Rendus Mathematique, 350(5):313 318, 2012. Michael TM Emmerich and André H Deutz. A tutorial on multiobjective optimization: fundamentals and evolutionary methods. Natural computing, 17:585 609, 2018. Chris Fifty, Ehsan Amid, Zhe Zhao, Tianhe Yu, Rohan Anil, and Chelsea Finn. Efficiently identifying task groupings for multi-task learning. Advances in Neural Information Processing Systems, 34: 27503 27516, 2021. Jörg Fliege, A Ismael F Vaz, and Luís Nunes Vicente. Complexity of gradient descent for multiobjective optimization. Optimization Methods and Software, 34(5):949 959, 2019. Satoru Fujishige. Lexicographically optimal base of a polymatroid with respect to a weight vector. Mathematics of Operations Research, 5(2):186 196, 1980. Avishek Ghosh, Jichan Chung, Dong Yin, and Kannan Ramchandran. An efficient framework for clustered federated learning. In Advances in Neural Information Processing Systems, 2020. Gregory Griffin, Alex Holub, and Pietro Perona. Caltech-256 object category dataset. 2007. Pascal Halffmann, Luca E Schäfer, Kerstin Dächert, Kathrin Klamroth, and Stefan Ruzika. Exact algorithms for multiobjective linear optimization problems with integer variables: A state of the art survey. Journal of Multi-Criteria Decision Analysis, 29(5-6):341 363, 2022. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016. Zeou Hu, Kiarash Shaloudegi, Guojun Zhang, and Yaoliang Yu. Federated learning meets multiobjective optimization. IEEE Transactions on Network Science and Engineering, 9(4):2039 2051, 2022. Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. Adaptive mixtures of local experts. Neural computation, 3(1):79 87, 1991. Daniel Khashabi, Snigdha Chaturvedi, Michael Roth, Shyam Upadhyay, and Dan Roth. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 252 262, 2018. Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. Stacked cross attention for image-text matching. In Proceedings of the European conference on computer vision (ECCV), pp. 201 216, 2018. Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. Federated optimization in heterogeneous networks. Proceedings of Machine learning and systems, 2:429 450, 2020. Xiaoxiao Li, Meirui Jiang, Xiaofei Zhang, Michael Kamp, and Qi Dou. Fedbn: Federated learning on non-iid features via local batch normalization. ar Xiv preprint ar Xiv:2102.07623, 2021. Xi Lin, Hui-Ling Zhen, Zhenhua Li, Qing-Fu Zhang, and Sam Kwong. Pareto multi-task learning. Advances in neural information processing systems, 32, 2019. Tianlin Liu, Joan Puigcerver, and Mathieu Blondel. Sparsity-constrained optimal transport. ar Xiv preprint ar Xiv:2209.15466, 2022a. Xingchao Liu, Xin Tong, and Qiang Liu. Profiling pareto front with multi-objective stein variational gradient descent. Advances in Neural Information Processing Systems, 34:14721 14733, 2021. Published as a conference paper at ICLR 2025 Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11976 11986, 2022b. Debabrata Mahapatra and Vaibhav Rajan. Multi-task learning with user preferences: Gradient descent with controlled ascent in pareto optimization. In International Conference on Machine Learning, pp. 6597 6607. PMLR, 2020. Brendan Mc Mahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In International Conference on Artificial Intelligence and Statistics, 2017. Kaisa Miettinen. Nonlinear multiobjective optimization, volume 12. Springer Science & Business Media, 1999. Michinari Momma, Chaosheng Dong, and Jia Liu. A multi-objective/multi-task learning framework induced by pareto stationarity. In International Conference on Machine Learning, pp. 15895 15907. PMLR, 2022. Xingchao Peng, Qinxun Bai, Xide Xia, Zijun Huang, Kate Saenko, and Bo Wang. Moment matching for multi-source domain adaptation. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 1406 1415, 2019. Mohammad Taher Pilehvar and Jose Camacho-Collados. Wic: the word-in-context dataset for evaluating context-sensitive meaning representations. ar Xiv preprint ar Xiv:1808.09121, 2018. Guanghui Qin and Jason Eisner. Learning how to ask: Querying lms with mixtures of soft prompts. ar Xiv preprint ar Xiv:2104.06599, 2021. Abhishek Roy, Geelon So, and Yi-An Ma. Optimization on pareto sets: On a theory of multi-objective optimization. ar Xiv preprint ar Xiv:2308.02145, 2023. Michael Ruchte and Josif Grabocka. Scalable pareto front approximation for deep multi-objective learning. In 2021 IEEE international conference on data mining (ICDM), pp. 1306 1311. IEEE, 2021. Kate Saenko, Brian Kulis, Mario Fritz, and Trevor Darrell. Adapting visual category models to new domains. In Computer Vision ECCV 2010: 11th European Conference on Computer Vision, Heraklion, Crete, Greece, September 5-11, 2010, Proceedings, Part IV 11, pp. 213 226. Springer, 2010. Karthik Sindhya, Kaisa Miettinen, and Kalyanmoy Deb. A hybrid framework for evolutionary multi-objective optimization. IEEE Transactions on Evolutionary Computation, 17(4):495 511, 2012. Virginia Smith, Chao-Kai Chiang, Maziar Sanjabi, and Ameet S Talwalkar. Federated multi-task learning. In Advances in Neural Information Processing Systems, 2017. Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32, 2019. Shanshan Wu, Tian Li, Zachary Charles, Yu Xiao, Ziyu Liu, Zheng Xu, and Virginia Smith. Motley: Benchmarking heterogeneity and personalization in federated learning. ar Xiv preprint ar Xiv:2206.09262, 2022. Yujia Xie, Xiangfeng Wang, Ruijia Wang, and Hongyuan Zha. A fast proximal point method for computing exact wasserstein distance. In Uncertainty in artificial intelligence, pp. 433 453. PMLR, 2020. Muhammad Bilal Zafar, Isabel Valera, Manuel Gomez Rogriguez, and Krishna P Gummadi. Fairness constraints: Mechanisms for fair classification. In Artificial intelligence and statistics, pp. 962 970. PMLR, 2017. Published as a conference paper at ICLR 2025 Eckart Zitzler and Lothar Thiele. Multiobjective evolutionary algorithms: a comparative case study and the strength pareto approach. IEEE transactions on Evolutionary Computation, 3(4):257 271, 1999. Eckart Zitzler, Kalyanmoy Deb, and Lothar Thiele. Comparison of multiobjective evolutionary algorithms: Empirical results. Evolutionary computation, 8(2):173 195, 2000. A.1 BACKGROUND ON MGDA IN MULTI-OBJECTIVE OPTIMIZATION We first describe some background on the multi-gradient descent algorithm to solve multi-objective optimization. Let L(θ) Rn be defined as L(θ) := (L1(θ), , Ln(θ)) , θ Rd. (10) The goal for the multi-objective optimization (minimization) problem is to find Pareto optimal solutions with respect to all objectives Li( ), i [n]. One line of method is at each iteration, to find a common descent direction d for all objectives. Given the current model θ, we would like to find a descent step to minimize each objective value. For the single-objective case, the direction is L(θ). For n objectives, one objective is to solve for d: max i [n] Li(θ) d + 1 2 d 2 , (11) and then apply d as θ = θ + ηd. If the optimal objective value of Eq. 11 is negative, then there exists a descent direction d such that all objective values will be decreased. If θ is Pareto stationary, then d = 0 and the optimal objective value is 0. This formulation is equivalent to min b,d b + 1 s.t. Li(θ) d b, i [n]. Formally, we have the following lemma. Lemma 1 (Good Descent Direction Désidéri (2012) ). Let d, b be the solutions of Eq. 12, then 1. If θ is Pareto stationary, then d = 0 and b = 0. 2. If θ is not Pareto stationary, then 2 d 2 < 0, (13) Li(θ) d b, i [n]. (14) Lemma 2 (A Rescaled Version of Lemma 1). Let dj Rd be the solution of Eq. 6 and Γi,j be some non-negative scalar, then 1. If θj is Pareto stationary, then dj = 0. 2. If θj is not Pareto stationary, then Γi,j Li(θj) dj 1 2 dj 2, i [n]. (15) A.2 CONVERGENCE PROOFS A.2.1 STRONGLY-CONVEX CASES Mos T learns a matching between objectives and solutions represented by its non-zero entries. Therefore, we show that our proposed algorithm (using full gradients) can converge to Pareto stationary points with respect to a subset of matching objectives in strongly-convex cases. We first make a common assumption. Published as a conference paper at ICLR 2025 Assumption 1. Each objective Li(θ) (i [n]) is ν-smooth and µ-strongly convex w.r.t. θ Rd. We introduce another assumption on objective-model matching, as follows. Assumption 2. There exists an s (s < ) such that after s iterations, all the non-zero entries in Γs remain non-zero, which are lower bounded by a small constant ϵ. This assumption can be interpreted as that the assignment of solutions for each objective become stable during training. From Figure 2, we empirically observe that after certain iterations, the learnt Γ will have the same non-zero patterns. Theorem 2 (Strongly-Convex). Let Assumption 1 and 2 hold. Given marginal distribution constraints α n and β m, under a fixed learning rate η 1 ν , after running Algorithm 1 for T + s outer iterations with full multi-gradient descent, we have for each solution j [m], θT +s j θ j 2 (1 µηϵ)T K θs j θ j 2 , (16) where θ j is a Pareto stationary solution across objectives with a non-zero Γs ,j. Proof. First, let us assume K = 1. At each iteration t, we have i [n] λiΓt i,j Li(θt j), j [m] (17) for some {λi}i [n] n which is the solution of Eq. 7. First we note that for every j [m], θj will converge. If dt j = 0, then θj has converged to a Pareto stationary point. Otherwise, for every objective Li, we have monotonically decrease for the sequence: Li(θt+1 j ) Li(θt j) 1 2η(1 νη) dt j 2 < 0. Therefore, every solution will converge. We denote θ j as one of the Pareto stationary solutions of Eq. 1 that solution θj converges to. By definition and the properties of MDGA, we know that for every solution, every objective value will be non-increasing throughout optimization. Hence, for every i [n] and j [m], it holds that Li(θt+1 j ) Li(θ j ) 0. (18) For every i [n], the ν-smoothness and µ-convexity of Li lead to Li(θt+1 j ) = Li(θt j + ηdt j) (19) Li(θt j) + η Li(θt j) dt j + ν 2 ηdt j 2 (20) Li(θ j ) + Li(θt j) (θt j θ j ) µ θt j θ j 2 + η Li(θt j) dt j + ν 2 ηdt j 2. (21) By moving Li(θ j ) to the left-hand side and multiplying both sides by λiΓt i,j, we have i [n] λiΓt i,j Li(θt+1 j ) Li(θ j ) (22) i [n] λiΓt i,j Li(θt j) (θt j θ j + ηdt j) X i [n] λiΓt i,j µ θt j θ j 2 + X i [n] λiΓt i,j ν 2 ηdt j 2. As Γt i,j ϵ > 0, then i [n] λiΓt i,j µ 2 θt j θ j 2 µϵ 2 θt j θ j 2. (24) Published as a conference paper at ICLR 2025 Due to the Hölder inequality, we have P i [n] λiΓi,j λ 1 Γ ,j := βj 1. Hence, we have i [n] λiΓt i,j Li(θt+1 j ) Li(θ j ) i [n] λiΓt i,j Li(θt j) (θt j θ j + ηdt j) µϵ θt j θ j 2 + νβj 2 ηdt j 2 (25) = dt j(θt j θ j ) η dt j 2 µϵ θt j θ j 2 + νβj dt(θt j θ j ) η 1 βj θt j θ j 2 (taking η 1 dt(θt j θ j ) η 2 dt j 2 µϵ θt j θ j 2 (using βj 1) 2η (2ηdt j(θt j θ j ) + ηdt j 2) µϵ 2η (2(θt+1 j θt j) (θt j θ j ) + θt+1 j θt j 2) µϵ 2η (2θt+1 j θt j 2 θt j 2 2(θt+1 j θt j) θ j + θt+1 j 2 + θt j 2 2θt+1 j θt j) µϵ 2η ( θt+1 j 2 2(θt+1 j θt j) θ j θt j 2) µϵ 2η ( θt+1 j θ j 2 θt j θ j 2) µϵ 2η ( θt j θ j 2 θt+1 j θ j 2) µϵ θt j θ j 2 . (27) Since during optimization, we guarantee that every objective value will be non-increasing at each iteration, we have Li(θt+1 j ) Li(θ j ) 0. So the left-hand side of Eq. 22 is non-negative. Hence, 1 2η ( θt j θ j 2 θt+1 j θ j 2) µϵ θt j θ j 2 0, (28) θt+1 j θ j 2 (1 µηϵ) θt j θ j 2, (29) which gives us linear convergence. When K > 1, at each outer iteration, fixing Γt i,j, we are running multiple updates on the model parameters. In this case, we still have θt,k+1 j θ j (1 µηϵ) θt,k j θ j 2, (30) where θt,k+1 j denote the model parameters at the t-th outer iteration and k + 1-th inner iteration (Line 6 of Algorithm 1). Hence, θt+1 j θ j (1 µηϵ)K θt j θ j 2 (31) A.2.2 NON-CONVEX AND SMOOTH CASES For simplicity, we first consider the case where K = 1. From Lemma 2, we know that at each iteration t, Γt i,j Li(θt j) dt j 1 dt j 2 , i [n]. (32) Published as a conference paper at ICLR 2025 Assuming ν-smooth of each Li, we have Γt i,j Li(θt+1 j ) Li(θt j) = Γt i,j Li(θt j + ηdt j) Li(θt j) (33) ηΓt i,j Li(θt j) dt j + νΓt i,j 2 ηdt j 2 (34) 2 dt j 2 + νη2 2 Γt i,j dt j 2 (35) 2Γt i,j dt j 2 + νη2 2 Γt i,j dt j 2 (Γt i,j 1) (36) 2 Γt i,j dt j 2. (37) Sum over all models j [m], X j [m] Γt i,j Li(θt+1 j ) Li(θt j) η(1 νη) j [m] Γt i,j dt j 2. (38) The above result is for a single objective Li( ). Now let s consider the weighted sum of all the n objectives between two steps with different Γ s, i.e., Γt and Γt+1. By the optimality of Γt+1, X j [m] Γt+1 i,j Lj(θt+1 j ) τ X i [n] max j [m] Γt+1 i,j X j [m] Γt i,j Lj(θt+1 j ) τ X i [n] max j [m] Γt i,j. We then have X Γt+1 i,j Li(θt+1 j ) Γt i,j Li(θt j) (39) j [m] Γt i,j Li(θt+1 j ) Li(θt j) + τ X i [n] max j [m] Γt+1 i,j τ X i [n] max j [m] Γt i,j (40) i [n] Γt i,j dt j 2 + τ X i [n] max j [m] Γt+1 i,j τ X i [n] max j [m] Γt i,j (apply Eq. 38) (41) j [m] βj dt j 2 + τ X i [n] max j [m] Γt+1 i,j τ X i [n] max j [m] Γt i,j. (42) Here βj is the j-th dimension of β m in Eq. 3. Hence, X j [m] βj dt j 2 2 η(1 νη) Γt i,j Li(θt j) Γt+1 i,j Li(θt+1 j ) (43) + 2 η(1 νη) i [n] max j [m] Γt+1 i,j τ X i [n] max j [m] Γt i,j Applying telescope sum for t [T] on both sides gives X j [m] βj dt j 2 2 η(1 νη) Γ1 i,j Li(θ1 j) ΓT +1 i,j Li(θT +1 j ) (45) + 2 η(1 νη) i [n] max j [m] ΓT +1 i,j τ X i [n] max j [m] Γ1 i,j Then we get the following bound on the average gradient norm: j [m] βj dt j 2 C If we take νη = 1 2, then C = O(ν). This gives us a O ν T rate in terms of gradient norms for non-convex cases under a fixed learning rate. Published as a conference paper at ICLR 2025 For the case where K > 1, we have X Γt+1 i,j Li(θt+1 j ) Γt i,j Li(θt j) (48) dt,k j 2 (49) j [m] βj dt j 2 , (50) where k denotes the index for inner updates on model parameters fixing Γt. Similarly, we have the result 1 T j [m] βj dt j 2 C A.3 EXPERIMENTS WITH n m Evaluation Metrics. For applications with few objectives (small n), we use the hypervolume measure, a widely used metric for evaluating the quality of MOO solutions and is a proxy of diversity (Zitzler & Thiele, 1999). Hypervolume is feasible to compute when the number of objectives is small. Given a solution set S Rn and a set of reference points r = [r1, . . . , rn] Rn, the Hypervolume of S measures the region weakly dominated by S and bounded above by r: H(S) = Λ ({q Rn | p S : p q and q r}) , where Λ( ) denotes the Lebesgue measure. To ensure fair calculation of it, the reference points are kept consistent across all algorithms for each dataset. These reference points are determined either by following the settings of previous studies or by setting them as the upper bounds of the objective values from all algorithms to be compared. Hyperparameters. For the extended version of Mos T (denoted as Mos T-E) described in Section 3.2, we introduce additional hyperparameters α1, . . . , αn, and n to handle the extension of existing objectives. The parameter αi represents the positive shape parameter of the Dirichlet distribution, used to generate diverse objective weights, and n represents the number of extended objectives. A.3.1 TOY PROBLEMS We demonstrate the effectiveness of Mos T on a toy ZDT problem set. It is a popular MOO benchmark containing two objectives (n = 2) with oracle Pareto fronts (Zitzler et al., 2000). Specifically, we use ZDT-1, ZDT-2, and ZDT3, which are problems with 30 variables and exhibit convex, concave, and disconnected Pareto-optimal fronts, respectively. We compare Mos T with two baselines described in Section 6.1, and two additional methods Exact Pareto Optimization (EPO) based on different preference vectors (Mahapatra & Rajan, 2020), and SVGD based on stein variational gradient descent (Liu et al., 2021). Table 5: Mos T achieves higher Hypervolumes than the baselines on the ZDT bi-objective problem. MGDA Linearization SVGD EPO Mos T ZDT-1 4.02 0.92 5.72 0.01 5.54 0.12 4.40 0.01 5.87 0.00 ZDT-2 4.63 0.94 6.65 0.00 6.65 0.00 6.65 0.00 6.88 0.00 ZDT-3 4.53 0.83 6.27 0.02 5.77 0.15 4.53 0.68 6.39 0.03 We report the Hypervolumes of each method in Table 5 and visualize the obtained solutions alongside the entire Pareto-optimal fronts in Figure 4 for a more intuitive comparison. The results in Table 5 demonstrate that Mos T achieves higher Hypervolumes, indicating its superior ability to generate more diverse solution sets that cover larger areas. Further analysis of the Pareto fronts reveals the following observations: 1) EPO and SVGD prioritize reducing one loss, potentially resulting in biased trade-offs, with SVGD lacking guaranteed convergence to Pareto-optimal solutions; 2) MGDA produces diverse solutions but fails to cover the entire Pareto-optimal fronts; 3) Linearization-based MOO is a competitive baseline with high Hypervolumes, but its solutions do not provide satisfactory Published as a conference paper at ICLR 2025 0.0 0.5 1.0 MGDA 0.0 0.5 1.0 Linearization 0.0 0.5 1.0 SVGD 0.0 0.5 1.0 EPO 0.0 0.5 1.0 Mos T Figure 4: Solutions derived by different methods (blue scatters) on the ZDT bi-objective task, with the oracle Pareto-optimal fronts for the two objectives shown in red scatters. diverse trade-offs, as evident from the Pareto fronts; 4) In contrast, Mos T generates evenly-distributed solutions across the Pareto fronts. Investigation for EPO and SVGD. Despite adhering to the official implementation of EPO, it fails to meet performance expectations due to its extensive requirement for preference vector sampling. Increasing the number of solutions (m) generated by EPO yields noticeable enhancements, particularly in ZDT-3 and, to a lesser extent, in ZDT-1, as shown in Table 6. However, even with a larger m, EPO consistently lags behind Mos T, which achieves well-distributed solutions across Pareto fronts without relying on extensive sampling. It is worth noting that we focus on scenarios where computational constraints limit us to training only m models while needing to address numerous objectives (n m). Therefore, we prioritize algorithms that afford better control over the number of generated solutions to achieve a promising trade-off. Table 6: EPO s performance improves with larger m, yet still falls short compared to Mos T, which achieves superior results with fewer m. EPO (m = 5) EPO (m = 100) Mos T (m = 5) ZDT-1 4.40 0.01 4.46 0.01 5.87 0.00 ZDT-2 6.65 0.00 6.65 0.00 6.88 0.00 ZDT-3 4.53 0.68 6.09 0.00 6.39 0.03 Additionally, we conduct more experiments on SVGD with a much larger number of solutions m (i.e., intentionally being unfair to our method). We observe that (a) using a larger m can improve its performance, but (b) even with m = 100, SVGD still cannot outperform Mos T (ours) with m = 5. Table 7: SVGD s performance improves with larger m, yet still falls short compared to Mos T, which achieves superior results with fewer m. SVGD (m = 5) SVGD (m = 100) Mos T (m = 5) ZDT-1 5.54 0.12 5.73 0.03 5.87 0.00 ZDT-2 6.65 0.00 6.66 0.00 6.88 0.00 ZDT-3 5.77 0.15 6.01 0.02 6.39 0.03 A.3.2 FAIRNESS-ACCURACY TRADE-OFFS (n m) In this section, we apply Mos T to explore various trade-offs between accuracy and algorithmic fairness (i.e., statistical independence between predictions and sensitive attributes). Thus, the number Published as a conference paper at ICLR 2025 of objectives n is 2. However, in this scenario with limited number of objectives, using optimal transport to match solutions and objectives may produce feasible but trivial solutions as explained in Section 3.2. Hence, we adapt the extension of Mos T named Mos T-E (introduced in Section 3.2). As discussed in Section 2, prior works that address fairness-accuracy trade-offs can be limited due to the difficulty of setting constraints before training (Zafar et al., 2017), or the mismatch between diverse exploration space and diverse solutions (Mahapatra & Rajan, 2020). Mos T-E differs by sampling a wide range of preference vectors to encompass various trade-offs comprehensively and using optimal transport to automatically generate solutions that maximize coverage for all preference vectors. We quantify the fairness objective using disparate impact (Court, 1971), and optimize it using its convex approximation (Zafar et al., 2017). We experiment on a synthetic dataset (Zafar et al., 2017) and a real German credit dataset (Asuncion & Newman, 2007). We compare Mos T and Mos T-E with MGDA, linearization-based MOO (which can be viewed as a soft version of Zafar et al. (2017), and EPO (Mahapatra & Rajan, 2020), and select the best parameters for each method based on the highest Hypervolume on a validation set. 0.40 0.60 0.80 0.00 0.40 0.60 0.80 0.40 0.60 0.80 0.40 0.60 0.80 0.40 0.60 0.80 0.55 0.60 MGDA 0.55 0.60 Linearization 0.55 0.60 EPO 0.55 0.60 Mos T 0.55 0.60 Mos T-E Ref Point Solution 1 Solution 2 Solution 3 Solution 4 Solution 5 Figure 5: Hypervolumes (colored areas) formed by five solutions for classification loss (objective 1, x-axis) and fairness (objective 2, y-axis) on synthetic and German datasets. Table 8: Hypervolumes ( 100) on 5 solutions with different fairness-accuracy trade-offs. Mos T-E achieves the highest Hypervolume coverage on two (fairness, accuracy) objectives. MGDA Linearization EPO Mos T Mos T-E Synthetic 3.28 0.05 6.70 0.04 2.44 0.01 3.78 0.04 7.65 0.06 German 5.14 0.06 4.99 0.05 4.56 0.02 4.64 0.05 5.27 0.06 Mos T-E generates more diverse trade-offs. Table 8 shows that Mos T-E achieves the highest Hypervolumes, suggesting a superior quality of the solution set it generates. Furthermore, Figure 5 demonstrates that Mos T-E generates solutions that are not only more diverse but also more evenly distributed across the Pareto fronts. Mos T-E effectively addresses the problem of Mos T under n m. When n m, Mos T may assign models separately to dominate individual objectives, resulting in solutions without sufficient diversity. The solutions generated by Mos T shown in Figure 5, align with our idea by predominantly prioritizing either low classification loss or low disparate impact. This limitation is effectively overcome by Mos T-E, with diversely combining existing few objectives as new objectives. A.4 EXPERIMENTAL DETAILS We will detail the models and hyperparameters used for each dataset. All algorithms follow the same setup, including the train-validation-test split, number of training epochs, and tunable learning rates. Hyperparameters for baselines. In addition to the standard setup, we fine-tune hyperparameters specific to each baseline model, aligning with their original configurations. For instance, we adjust the hyperparameter responsible for scaling the proximal term in Fed Prox according to the recommendations provided in (Li et al., 2020). Published as a conference paper at ICLR 2025 Description of KL divergence used. We employ KL divergence to assess differences in Γ across iterations and solution diversity. Specifically, we utilize symmetric KL divergence, defined as (KL(P ||Q)+KL(Q||P )) 2 , where Q and P represent probabilities in the distribution of P and Q, respectively. A.4.1 EXPERIMENTAL SETUP FOR TOY PROBLEMS ZDT bi-objective problems (Zitzler et al., 2000). It contains a class of benchmark problems commonly used to evaluate optimization algorithms, particularly those designed for multi-objective optimization. These problems involve optimizing two conflicting objectives simultaneously. We specifically employ ZDT-1, ZDT-2, and ZDT-3 to evaluate the performance of algorithms. We use multinomial logistic regression, maintaining a consistent learning rate of 0.005 throughout training. This configuration aligns with the established setup presented in (Liu et al., 2021). We run 1,000 epochs for the datasets. A.4.2 EXPERIMENTAL SETUP FOR FEDERATED LEARNING Synthetic data (Li et al., 2020). This synthetic dataset is specifically designed to provide controlled complexities and diverse scenarios for assessing the performance of algorithms. The synthetic data generation process relies on two hyperparameters, ρ1 and ρ2, which shape the dataset s characteristics. ρ1 controls the heterogeneity among local models used to generate labels on each device. While ρ2 governs the differences in data distribution among devices. Larger ρ1 or ρ2 introduces more heterogeneity. For the generated dataset, we conduct our experiments using a train-validation-test split ratio of 6:2:2. We use multinomial logistic regression as the model and run 400 epochs in total. The learning rates are swept from {0.005, 0.01, 0.05, 0.1} without decaying throughout the training process. FEMNIST (Cohen et al., 2017; Caldas et al., 2018). In addition to the synthetic datasets, we also conduct experiments on the Federated Extended MNIST (FEMNIST) dataset, a widely used realworld dataset in federated learning research (Li et al., 2020), using multinomial logistic regression. It comprises handwritten digit images from multiple users, encompassing 62 classes, including digits (0-9) and uppercase and lowercase letters (A-Z, a-z). The data is distributed across 206 clients, with each client holding a subset of the digit classes. This distribution simulates a real-world federated learning scenario, prioritizing data privacy and distribution concerns. We employ a convolutional neural network featuring two convolutional layers with Re LU activation, followed by max-pooling. Additionally, a fully connected layer maps the flattened features to 62 output classes. We run 400 epochs for training. Learning rates are swept from {0.08, 0.1}. A.4.3 EXPERIMENTAL SETUP FOR FAIRNESS-ACCURACY TRADE-OFF In the context of fairness-accuracy trade-off, we experiment on two datasets, the synthetic dataset and the German dataset, introduced below. We employ multinomial logistic regression as our model, conducting 20 epochs of training and sweeping learning rates from {0.08, 0.1}. We use the enhanced Mos T-E for the German dataset. Mos T-E extends the existing n objectives to n by interpolating them with weights drawn from a Dirichlet distribution. We set all shape parameters, α1, . . . , αn, to the same value within the range [0.1, 0.5, 1.0]. The number of extended objectives is chosen from [10, 15, 20]. In practice, we observe that extending the original 2 objectives to 10 yields results similar to those obtained with 20 objectives. Synthetic dataset (Zafar et al., 2017). The synthetic dataset contains 2,000 binary classification instances generated randomly as specified in (Zafar et al., 2017). Binary labels for classification are generated using a uniform distribution. It features 2-dimensional nonsensitive features generated from two distinct Gaussian distributions, and a 1-dimensional sensitive feature generated using a Bernoulli distribution. UCI German credit risk dataset (Asuncion & Newman, 2007). This dataset comprises 1,000 entries, each characterized by 20 categorical and symbolic attributes. These attributes serve to classify individuals as either good or bad credit risks. Gender is considered as the sensitive attribute in this context. Published as a conference paper at ICLR 2025 A.4.4 EXPERIMENTAL SETUP FOR MULTI-TASK LEARNING In the realm of multi-task learning, we assess the efficacy of Mos T using the Office-Caltech10 and Domain Net datasets. The number of objectives n varies across the datasets: n = 4 for Office Caltech10 and n = 6 for Domain Net. We initialize the models with pre-trained weights for both datasets, leveraging Image Net-pretrained Res Net-18 (He et al., 2016) for Office-Caltech10 and Conv Ne Xt-tiny (Liu et al., 2022b) for Domain Net. It is worth noting that Pareto Multi-task Learning (PMTL) (Lin et al., 2019) is a notable method, but its exclusion from our comparison is due to concerns regarding computational efficiency, particularly when applied to large-scale real-world datasets. Office-Caltech10 dataset (Saenko et al., 2010; Griffin et al., 2007). The Office-Caltech10 dataset comprises images from four distinct data sources: Office-31(Saenko et al., 2010) (three data sources) and Caltech-256 (Griffin et al., 2007) (one data source). These sources capture images using different camera devices or in various real environments with diverse backgrounds, representing different objectives. Domain Net dataset (Peng et al., 2019). The Domain Net dataset includes natural images sourced from six distinct data sources: Clipart, Infograph, Painting, Quickdraw, Real, and Sketch. This dataset is characterized by its diversity, covering a wide range of object categories. For our experiments, we focus on a sub-dataset composed of the top ten most common object categories from the extensive pool of 345 categories within Domain Net, following (Li et al., 2021). A.4.5 EXPERIMENTAL SETUP FOR PROMPT LEARNING We explore prompt learning across three datasets from the Super GLUE benchmark: Bool Q (Clark et al., 2019), Multi RC (Khashabi et al., 2018), and Wi C (Pilehvar & Camacho-Collados, 2018). In our approach, each instance represents a distinct objective. This framework allows us to delve into prompt learning using a limited set of training instances while aiming for generalization to unseen test instances. We randomly sample 128 instances from the training dataset and evenly partition the original validation dataset to form both the validation and test datasets. Our training involves a soft prompt approach based on the T5-base model, with the base model parameters kept frozen. Parameter setup follows (Qin & Eisner, 2021). For prompt learning, where each instance is considered an objective, the absence of client groups or task types, as seen in federated learning or multi-task learning, prevents us from evaluating solution performance over the validation set and then selecting the best solution for inference. To address this, we train a simple dispatcher to learn the correlation between instances and solutions (prompts), predicting the optimal solution for a given instance. Specifically, we train cross-attention on the hidden embedding of soft prompts and instances, with architecture following (Lee et al., 2018) (Section 3.1). These hidden embeddings are generated from a fixed encoder of the T5-base. A.5 ABLATION STUDY ON MOST DESIGN MGDA Linearization Fed Avg Fed Prox Fed MGDA+ Mos T w/o R(Γ) Syn (0.0, 0.0) 77.22 0.41 75.91 0.37 75.71 0.51 75.60 0.42 75.26 1.21 83.09 0.87 Syn (0.5, 0.5) 87.09 0.29 87.18 0.27 86.26 0.61 86.13 0.39 85.21 1.42 89.07 0.63 Syn (1.0, 1.0) 90.52 0.13 89.87 0.51 88.12 0.75 87.58 1.36 87.16 1.09 91.70 0.02 Mos T (O) Mos T (M) Mos T (M, soft) Mos T w/o CL w MGDA Mos T (L) Mos T 76.65 0.81 67.62 3.46 74.66 0.70 81.97 0.58 76.80 0.79 82.62 0.33 84.25 0.51 86.94 0.61 78.98 2.04 80.15 1.54 88.19 0.40 86.18 1.19 88.85 0.34 89.99 0.52 90.42 0.22 75.20 3.75 73.20 4.71 91.25 0.51 89.32 0.76 91.26 0.44 92.21 0.08 Published as a conference paper at ICLR 2025 To generate diverse and complementary solutions for multiple objectives, Mos T first finds a balanced matching between objectives and solutions by OT along with elaborated learning strategies and then locates the descent direction common to re-weighted objectives using MGDA. In this section, we conduct comprehensive ablation studies to verify the Mos T design. Specifically, for OT, we verify the necessity of it (Appendix A.5.1) and its specific designs, including solution-specialization curriculum (Appendix A.5.2) and sparsity encouragement imposed by L1 regularization (Appendix A.5.3). Additionally, we evaluate the necessity of MGDA in Appendix A.5.4. These ablation studies contribute to a thorough understanding of the effectiveness of each component within the overall Mos T framework. Experiments are carried out on three synthetic federated learning datasets, with results shown in Table A.5 and Figure 6. validation loss 0 50 100 150 200 250 300 350 400 training iteration test accuracy MGDA Linearization Fed Avg Fed Prox Fed MGDA Mos T (O) Mos T w/o CL Mos T w/o L1 Mos T 20% 40% 60% 80% group with the worst performance (percentage) MGDA Linearization Fed Avg Fed Prox Fed MGDA Mos T (O) Mos T w/o CL Mos T w/o L1 Mos T Figure 6: Including Mos T with a series of ablation study results, (a) displays training loss and test accuracy curves; (b) shows the accuracy of the worst 20%, 40%, 60% and 80% client groups. We omitted Mos T (M) from panel for clarity, as it exhibits inferior performance compared to other methods. A.5.1 ABLATION STUDY FOR OT OT-Generated v.s. Randomly Generated Weight Assignments. We compare the weight assignments generated by OT with the randomly generated weight assignments. This can verify the impact of the choice of objective weighting method in MGDA on the overall performance of Mos T. In other words, we compare Mos T with executing MGDA m times by using randomly generated weights to reweight objectives, which is denoted as w-MGDA. Experimental results reveal that OT-generated weights work significantly better than random weights. This illustrates the necessity of using OT to find a balanced match between solutions and objectives. Different Matching Strategies. We also conduct ablation studies on the effectiveness of the optimal transport matching (Eq. 3) by comparing three strategies introduced in Section 3.3: 1) the original Mos T objective, which utilizes optimal transport; 2) objective selecting model , which selects the best expert/model for each objective (i.e., removing the Γ 1n = β constraint); and 3) model selecting objective , which selects the best objective for each model (i.e., removing the Γ1m = α constraint). These two variants are denoted as Mos T (O) and Mos T (M), resp., with results shown Published as a conference paper at ICLR 2025 in Table A.5 and Figure 6. We also try a soft version of Mos T (M), utilizing normalized loss over objectives when training each model, denoted as Mos T (M, soft). To ensure a fair comparison, we initialize comparisons with the same model weights. Throughout the training process, we track the assignment of objectives to each model, i.e., for every model, identifying the objectives with the smallest validation loss. We visualize the percentages of the selected objectives for each model over time in Figure 7 on the Syn (0.0, 0.0) dataset. In the case of objective selecting model (middle), we observe that two of the models progressively dominate all the objectives. Similarly, model selecting objective shows the early dominance of one model. While its soft version, Mos T (M, soft), exhibits some improvement over Mos T (M), it still lags behind other methods because of the imbalanced training over the objectives. These findings emphasize the necessity of achieving a balanced model-objective trade-off. These observations confirm the presence of the collapse phenomenon in MOO, where limited solutions dominate all objectives. On the contrary, Mos T using optimal transport involving a two-way matching shows a more balanced distribution of objectives among the models throughout training. 0 200 400 optimal transport (OT) 0 200 400 objective selecting model 0 200 400 model selecting data 0 200 400 OT w/o R( ) 0 200 400 OT w/o CL Figure 7: The percentage of assigned objectives for each model under three matching strategies and two variants of Mos T. Each color band represents a model, with the y-axis indicating the corresponding percentage. We see that Mos T (leftmost) learns 5 diverse models that serve the 30 objectives in a balanced manner. A.5.2 ABLATION STUDY FOR OT DESIGN - CURRICULUM LEARNING We evaluate the impact of curriculum learning (from Section 3.3) on optimizing multiple models using Mos T. Curriculum setup for Mos T. As introduced in Section 3.3, our proposed curriculum strategy involves adjusting marginal distributions α and β over different training stages to balance the freedom of model selecting objective and objective selecting model . In the initial stages, we prioritize a uniform distribution for β to ensure exposure to multiple objectives. As training progresses, we transition α to a uniform distribution, covering all objectives, while relaxing β. This transition is achieved by a hyperparameter that gradually decreases from 1 to 0. Though this hyperparameter gradually approaches zero, the transition direction differs: it shifts β from uniform to performanceoriented and, conversely, shifts α in the opposite direction. We compare standard Mos T with a variant using uniform marginal distributions for α and β throughout training, denoted as Mos T w/o CL. We hypothesize that curriculum learning enhances overall performance and training stability. We conduct experiments on three synthetic federated learning datasets. The results in Table A.5 show that using curriculum learning significantly improves the performance of Mos T, proving its effectiveness. Notably, even without curriculum learning, Mos T outperforms other algorithms. Furthermore, Figure 6 illustrates the training loss and test accuracy curves, highlighting the stability difference between the two approaches during training. Curriculum learning leads to increased stability and better convergence towards better solutions. A.5.3 ABLATION STUDY FOR OT DESIGN - DIVERSITY ENCOURAGEMENT As detailed in Section 4 and supported by empirical evidence in Section 5, encouraging a sparse and balanced alignment between objectives and solutions leads to solution specialization on objectives. Mos T goes a step further by employing diversity regularized optimal transport to promote diversity. Enhancing Diversity through Sparse Transport and Regularization. Before imposing diversity regularization to further enforce diversity, we aim to verify whether Mos T without diversity regular- Published as a conference paper at ICLR 2025 ization can generate diverse solutions. We track solution diversity throughout the training process using KL divergence, as explained in Section 6.2, with results depicted in Figure 9. Our empirical findings indicate that even without R(Γ), sparse transport yields diverse solutions that balance all objectives. However, by setting R(Γ) to be the negative of our diversity measure (Definition 1), we can further encourage diversity. In Figure 8, we depict the distribution of model specialization across objectives, assessed through normalized model accuracies. Notably, solutions trained with Mos T exhibit a tendency to specialize in specific objectives, underscoring their heightened diversity compared to baseline approaches. And these diverse solutions then jointly perform better. Performance Impact of Diversity Encouragement. Building on the motivation outlined in Section 5, this section focuses on showcasing the performance benefits resulting from diversity encouragement. To assess the impact of diversity regularization, we conduct a comparative analysis between scenarios with and without it. Table A.5 and Figure 6 highlight that Mos T with diversity regularization not only enhances performance but also contributes to the fairness of federated learning. A.5.4 ABLATION STUDY FOR MGDA MGDA v.s. Linearization in Weighted Multi-Objective Optimization. We compare Mos T that uses MGDA against the variant that updates model parameters based on the optimal transport solution weights. We aim to understand how effectively these two methods determine gradient updates for weighted multi-objective optimization. In this variant of Mos T, denoted as Mos T (L), instead of seeking the Pareto solution of m weighted objectives (as indicated in Eq. 1), we compute θj as θj = Pn i=1 Γi jθi j, where θi j represents the parameter of θj trained on data from the i-th objective. The experimental results showcase the consistent superiority of Mos T over both MGDA and Mos T (L) across three synthetic federated learning scenarios. It proves the effectiveness of employing MGDA for parameter updates. A.6 COMPARATIVE RUNTIME ANALYSIS We assess the runtime of algorithms on the same platform, providing analyses for various applications: federated learning (Table 9), multi-task learning (Table 10), mixture-of-prompt learning (Table 11), ZDT datasets (Table 12), and fairness-accuracy trade-off (Table 13). Additionally, we include the computation time required for OT and MGDA within Mos T. Our results indicate that Mos T exhibits comparable running times to baselines, despite Mos T involving the computation of OT and MGDA, both of which only account for negligible time. However, for Mos T-E, which explicitly extends the number of objectives, it will require more time than baselines. Additionally, we provide a detailed decomposition on the complexity of all the components below. At each round, we alternate between the optimization of Γ and solving a reweighted MGDA problem. Utilizing IPOT, the complexity for optimizing Γ (Equation 5) is roughly O(mn) [1]. Optimizing θ1:m using reweighted MGDA (Equations 7 and 8) requires n gradient access for each solution θj (j [m]), which takes up O(mn) time in total. Hence, The overall complexity of our algorithm per iteration is O(mn). When m n (our mainly focused setting), e.g., m = O(log n), the above complexity can reduce to O(n log n). Note that though the OT step and MGDA step both scale similarly, their actual running time is drastically different as MGDA requires gradient computation. Table 9: Runtime (sec) comparisons for all methods on federated learning datasets, performed on a single Nvidia RTX A5000 platform. MGDA Linearization Fed Avg Fed Prox Fed MGDA+ Mos T Mos T-OT Mos T-MGDA Syn (0.0, 0.0) 219.86 225.90 222.22 281.69 516.25 217.59 1.00 0.44 Syn (0.5, 0.5) 208.82 208.27 205.90 258.19 495.62 201.70 0.92 0.23 Syn (1.0, 1.0) 269.44 268.33 270.65 333.74 557.58 260.50 0.98 0.35 FEMNIST 3522.83 3135.76 3147.62 3539.85 > 5000.00 3368.63 0.94 45.35 Published as a conference paper at ICLR 2025 (a) Mos T (ours) (c) Linearization (d) Fed Avg (e) Fed Prox (f) Fed MGDA Figure 8: Proportion of 5 solutions (A/B/C/D/E), trained by Mos T (a) and baselines (b)-(f), specialized on 30 objectives (labeled as numbers). Notably, in (a), wider ribbons indicate that Mos T-trained solutions address each objective with more specialization. In contrast, baseline solutions exhibit similar specializations across objectives. This highlights the enhanced solution diversity of Mos T, a key factor contributing to its overall performance as shown in Figure 1. Published as a conference paper at ICLR 2025 Linearization Mos T w/o diversity regularization 0 100 200 300 400 training iteration Figure 9: KL divergence between pairwise solution predictions. Baselines show decreasing diversity over training iterations, whereas both Mos T variants maintain diversity. Mos T with diversity regularization fosters diversity. Table 10: Runtime (sec) comparisons for all methods on multi-task learning datasets, performed on a single Nvidia RTX A5000 platform. MGDA Linearization EPO Mos T Mos T-OT Mos T-MGDA Office-Caltech10 465.73 669.88 775.24 371.06 0.54 8.08 Domain Net 294.23 341.27 347.23 226.9 0.08 15.43 Table 11: Runtime (sec) comparisons for all methods on prompt learning datasets, performed on a single Nvidia RTX A5000 platform. MGDA Linearization Mos T Mos T-OT Mos T-MGDA Bool Q 2287.03 2170.97 1568.47 0.29 0.02 Multi RC 2108.14 1820.73 1892.47 0.37 0.09 Wi C 1286.30 1187.56 1254.51 0.37 0.05 Table 12: Runtime (sec) comparisons for all methods on ZDT datasets, performed on a single Nvidia RTX A4000 platform. MGDA Linearization SVGD EPO Mos T Mos T-OT Mos T-MGDA ZDT-1 123.77 13.03 9.26 34.36 38.05 1.07 2.19 ZDT-2 124.02 13.32 9.29 34.54 37.32 0.77 1.43 ZDT-3 140.85 14.98 10.08 37.75 40.82 0.87 1.99 Table 13: Runtime (sec) comparisons for all methods on fairness-accuracy trade-off datasets, performed on a single Nvidia RTX A4000 platform. MGDA Linearization EPO Mos T-E Mos T-OT Mos T-MGDA Synthetic 888.59 43.32 79.09 1159.17 0.15 0.08 German 454.64 23.98 43.10 595.34 0.14 0.03