# efficient_pareto_manifold_learning_with_lowrank_structure__e4eb8984.pdf

Efficient Pareto Manifold Learning with Low-Rank Structure

Weiyu Chen 1 James T. Kwok 1

Multi-task learning, which optimizes performance across multiple tasks, is inherently a multiobjective optimization problem. Various algorithms are developed to provide discrete trade-off solutions on the Pareto front. Recently, continuous Pareto front approximations using a linear combination of base networks have emerged as a compelling strategy. However, it suffers from scalability issues when the number of tasks is large. To address this issue, we propose a novel approach that integrates a main network with several low-rank matrices to efficiently learn the Pareto manifold. It significantly reduces the number of parameters and facilitates the extraction of shared features. We also introduce orthogonal regularization to further bolster performance. Extensive experimental results demonstrate that the proposed approach outperforms state-of-theart baselines, especially on datasets with a large number of tasks.

1. Introduction

Multi-task learning (MTL) (Caruana, 1997; Vandenhende et al., 2021) endeavors to efficiently and effectively learn multiple tasks, leveraging shared information to enhance overall performance. As tasks may have disparate scales and potentially conflicting objectives, the challenge of balancing the tasks is a critical aspect of MTL.

Building on the seminal work that conceptualizes MTL as a multi-objective optimization (MOO) problem (Sener & Koltun, 2018), a suite of algorithms has been developed to obtain trade-off solutions along the Pareto front (PF). Examples include EPO (Mahapatra & Rajan, 2020), PMTL (Lin et al., 2019), CAGrad (Liu et al., 2021a), and Nash-MTL (Navon et al., 2022). However, these methods are limited to

1Department of Computer Science and Engineering, The Hong Kong University of Science and Technology. Correspondence to: Weiyu Chen <wchenbx@cse.ust.hk>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

producing a finite set of discrete trade-off solutions, lacking the flexibility to adapt to varying user preferences.

Continuous approximation of the PF aims to provide an arbitrary number of solutions on the PF based on the userprovided preference vector. Pareto hypernetwork-based methods (Navon et al., 2020b; Hoang et al., 2023) utilize a hypernetwork (Ha et al., 2016) to learn the parameter space of a base network. This hypernetwork accepts preference vector as input and outputs the corresponding parameter of base network. Despite its efficacy, the size of the hypernetwork significantly exceeds that of the base network, thus restricting its use to small base networks. COSMOS (Ruchte & Grabocka, 2021) attempts to circumvent this limitation by incorporating preference vectors directly into the base network as an extra input. However, it falls short of achieving satisfactory results due to the limited parameter space.

Pareto Manifold Learning (Pa Ma L) (Dimitriadis et al., 2023) proposes to find a collection of base networks whose linear combination in the parameter space based on the preference vectors yields the corresponding Pareto-optimal solution. This approach exhibits promising performance. However, as we need to maintain a separate base network for each task, it incurs a corresponding surge in the number of parameters when we have a large number of tasks. Moreover, base networks cannot learn from each other during training. This lack of sharing poses challenges when the ensemble comprises a large number of base networks.

To surmount these challenges, we introduce a novel approach that employs a main network and multiple lowrank matrices, which can significantly reduce the number of parameters when the number of tasks is large. Moreover, this design facilitates the extraction of shared features through the main network while simultaneously capturing task-specific differences via the low-rank matrices. We further integrate orthogonal regularization to promote disentanglement of these matrices. Empirical investigations demonstrate that the proposed algorithm outperforms Pa Ma L and other baselines, particularly when the number of tasks is large.

Notations. We denote the set {1, . . . , m} as [m], and the m-dimensional simplex {α | Pm i=1 αi = 1, αi 0} as m. The norm refers to the Euclidean norm when

Efficient Pareto Manifold Learning with Low-Rank Structure

applied to vectors, and to the Frobenius norm when applied to matrices.

2. Background

2.1. Multi-Objective Optimization

A multi-objective optimization (MOO) (Miettinen, 1999) problem can be formulated as

min θ Rd k F (θ) = [f1(θ), . . . , fm(θ)] , (1)

where m 2 is the number of objectives, and d k is the shape of the parameter matrix. In this paper, we represent the parameters in matrix form for convenience when discussing low-rank approximations. Definition 2.1 (Pareto Dominance and Pareto-Optimal (Miettinen, 1999)). A solution θ1 is dominated by another solution θ2 if and only if fi(θ1) fi(θ2) for i [m], and i [m], fi(θ1) > fi(θ2). A solution θ is Pareto-optimal if and only if it is not dominated by any other θ .

A Pareto set (PS) is the set of all Pareto-optimal solutions. The Pareto front (PF) refers to the functional values associated with the solutions in the PS.

2.2. Multi-Task Learning

Multi-task learning (MTL) (Caruana, 1997; Vandenhende et al., 2021) refers to learning multiple tasks simultaneously. In this paper, we focus on the optimization part, adopting the widely utilized shared-bottom architecture (Caruana, 1997) as the base network. More specifically, for a m-task network, we have a shared-bottom θsh and m task-specific heads θt1, . . . , θtm. Throughout this paper, our primary focus is on the shared-bottom θsh. Hence, any mention of θ hereafter refers specifically to θsh.

MTL can be viewed as a MOO problem (Sener & Koltun, 2018). To deal with the possible conflicts of different tasks on the shared bottom, various algorithms have been proposed, such as MGDA (Sener & Koltun, 2018; D esid eri, 2012), PCGrad (Yu et al., 2020), IMTL (Liu et al., 2020), Graddrop (Chen et al., 2020), RLW (Lin et al., 2022a), CAGrad (Liu et al., 2021a), Nash-MTL (Navon et al., 2022), and Auto-λ (Liu et al., 2022). These approaches, however, predominantly focus on obtaining a single solution rather than capturing the whole PF.

Discrete Approximation of Pareto Front. Lin et al. (2019) propose to generate a set of solutions to approximate the Pareto front. Subsequent research have introduced innovative strategies to enhance the discrete approximation. Notable among these are EPO (Mahapatra & Rajan, 2020), MOO-SVGD (Liu et al., 2021b), GMOOAR (Chen & Kwok, 2022) and PNG (Ye & Liu, 2022). These methods are inherently limited to producing a fixed set of discrete points

on the PF. Ma et al. (2020) propose an algorithm to expand these discrete solutions in their vicinity. However, it can only produce PF segments around each discrete solution instead of the whole continuous PF.

Continuous Approximation of Pareto Front. Continuous approximation of PF enables the derivation of solutions tailored to user preferences, offering an arbitrary resolution of the trade-off solutions. Navon et al. (2020b) introduces the concept of a Pareto hypernetwork (PHN), which learns a hypernetwork (Ha et al., 2016) that takes a preference vector as input and generates the corresponding Pareto-optimal network parameters. PHN-HVI (Hoang et al., 2023) further develops the method using hypervolume maximization (Wang et al., 2017). However, scalability of the approach becomes a concern when applying the hypernetwork to large networks, as the size of the hypernetwork is usually much larger than that of the base network itself.

Conditioned one-shot multi-objective search (COSMOS) (Ruchte & Grabocka, 2021) incorporates the preference vector as an additional input, enabling the generation of outputs with different preferences while minimizing the introduction of extra parameters. Dosovitskiy & Djolonga (2019) propose the use of Fi LM layers (Perez et al., 2018) to incorporate preferences through channel-wise multiplication and addition operations on the feature map. However, these methods are sometimes constrained by the relatively limited parameter space, which can lead to suboptimal performance.

Pareto manifold learning (Pa Ma L) (Dimitriadis et al., 2023) aims to jointly learn multiple base networks θ1, . . . , θm, such that for any given preference vector α m, the corresponding weighted combination of base networks leads to the Pareto-optimal model:

i=1 αiθi. (2)

Pa Ma L demonstrates commendable performance. However, it only focuses on two or three tasks. When dealing with a large number of tasks, the number of parameters increases significantly as each task i requires its own network θi. Moreover, the base networks cannot benefit from each other during training, which can potentially impair performance, especially when scaling to a larger number of base networks.

Learning a continuous PF is also explored in Bayesian optimization (Lin et al., 2022b) and evolutionary optimization (Lin et al., 2023) for engineering problems with number of parameters up to 7. They fail to converge in deep learning scenarios due to the significantly larger parameter spaces.

2.3. Low-Rank Adaptation

Low-rank adaptation (Lo RA) has emerged as a popular approach in the field of finetuning large pre-trained models

Efficient Pareto Manifold Learning with Low-Rank Structure

(Hu et al., 2021). It only finetunes a low-rank matrix instead of the entire model, resulting in improved efficiency and the prevention of overfitting. Some subsequent studies (Ilharco et al., 2022; Yadav et al., 2023; Ortiz-Jimenez et al., 2023) explore the merging of Lo RA modules trained on different tasks. These works focus on direct addition or subtraction without incorporating any training. Audibert et al. (2023) proposes fine-tuning a pretrained model for multiple tasks using a single Lo RA module (with some task-specific parameters). Wang et al. (2023a) proposes fine-tuning a pretrained model by adding Lo RA modules one by one, where the newly-added Lo RA module is orthogonal to the previous ones.

The main difference between these methods and the proposed method is that they learn a single Lo RA each time and output a single fine-tuned network, while we train multiple low-rank matrices simultaneously and output a subspace of networks. To mitigate dominance of unitary transforms in weight update as observed in Lo RA, Wang et al. (2023b) propose to split the Lo RA module to multiple Lo RA modules. However, none of the existing research considers joint training of an arbitrary convex combination of low-rank matrices to learn a model subspace.

3. Proposed Method

In this section, we introduce a novel approach that involves learning a main module and m low-rank matrices for each layer (Section 3.1). Orthogonal regularization is further introduced in Section 3.2. Finally, the whole algorithm is presented in Section 3.3.

3.1. Low-Rank Structure

We start by examining the similarities of the base networks obtained by Pa Ma L (Dimitriadis et al., 2023) on Multi MNIST (Sabour et al., 2017), using the same experimental setting as Pa Ma L. Figure 1 shows the layer-wise cosine similarities between two base networks parameters throughout training. As can be seen, the similarities are almost zero at the beginning of training. As training progresses, a marked increase in similarity is observed. Furthermore, similarities are higher at the lower-level layers. This motivates us to consider reducing the redundancy of base networks.

Consider the m base networks in (2). Denote the parameters in the lth layer of base network θi by θl i. In the lth layer, (2) can be written as:

i=1 αiθl i. (3)

Let θl 0 = 1

m Pm i=1 θl i. We can rewrite (3) as:

j=1 (θl i θl j)

Given the similarities observed from Figure 1, we approximate the difference between any two modules θl i θl j by a low-rank matrix. Then, we use the low-rank matrix Bl i Al i, to replace 1

m Pm j=1(θl i θl j) in (4), leading to:

i=1 αi(θl 0 + Bl i Al i). (5)

We further add a scaling factor s to regulate the significance of the low-rank component, as:

θ(α)l = θl 0 + s

i=1 αi Bl i Al i. (6)

Through the above transformation, instead of learning m base module in (3), we learn a main module θl 0 and m lowrank matrices Bl 1Al 1, . . . , Bl m Al m. The main modules are expected to capture the common features across multiple tasks, thereby providing a shared foundation that each lowrank adaptation can leverage. An illustrative example of the proposed method is shown in Figure 2.

3.1.1. COMPUTATIONAL EFFICIENCY

In the lth layer, let Bl i Rdl rl and Al i Rrl kl. The proposed approach reduces the number of parameters to only dlkl + m(dlrl + rlkl). Notably, the rank rl is usually set to be much smaller compared to kl and dl. As a result, the proposed method achieves higher parameter efficiency compared to Pa Ma L, which has mdlkl parameters. This difference becomes more significant when dealing with larger networks and a larger number of tasks.

3.1.2. APPROXIMATION POWER

Similar to Pa Ma L (Dimitriadis et al., 2023), we show in this section the approximation power of the proposed structure. While Pa Ma L only considers two tasks, we consider the more general case with m tasks. Denote the optimal mapping from network input x X Ru and preference vector α m to the corresponding point on the PF as t(x, α) : X m Rm. We have the following Theorem.

Theorem 3.1. Assume that X m is compact and t(x, α) is continuous. For any ϵ > 0, there exists a Re LU MLP h with main network θ0 and m low-rank matrices B1A1, . . . , Bm Am, such that x X, α m, t(x, α) h

i=1 αi Bi Ai

Efficient Pareto Manifold Learning with Low-Rank Structure

0 2 4 6 8 10 Epochs

Cosine Similarity

Layer 1 Layer 2 Layer 3

Figure 1. Layer-wise similarities between base networks obtained by Pa Ma L on Multi MNIST over three random seeds. Shaded areas represent the 95% confidence interval.

Orthogonal Regularization

Weight low-rank matrices

Figure 2. Illustration of the proposed LORPMAN on a L-layer base network with m tasks. For each layer, we aim to learn m low-rank matrices which are orthogonal to each other.

The proof is in Appendix A. This Theorem shows that given any preference vector α, the proposed main network with low-rank matrices can approximately output the corresponding point on the PF within any given error margin ϵ.

3.2. Orthogonal Regularization

Orthogonal regularization, which encourages rows or columns of weight matrices to be approximately orthogonal, has been employed in various scenarios to regulate the parameters in a neural network (Xie et al., 2017; Bansal et al., 2018; Wang et al., 2020). Here, we propose to extend it for orthogonality among the low-rank matrices.

First, we flatten each low-rank matrix into a 1-dimensional vector and normalize: wl i = flatten(Bl i Al i) flatten(Bl i Al i) . Next, we

concatenate them to construct a (dlkl) m matrix W l = concatenate(wl 1, . . . , wl m). Our objective is to encourage orthogonality among the matrix columns. This orthogonality loss can be computed as: Rl o = (W l) W l I 2 2, where I is the m m identity matrix. Finally, we compute the loss for the entire network as:

l=1 Rl o, (7)

where L is the number of layers.

The objective of the orthogonal regularization is to reduce redundancy and conflicts between different low-rank matrices. By reducing redundancy, the model encourages learning common features in the main network and task-specific features in the low-rank matrices, which can lead to more efficient parameter usage and better performance.

Note that the computational complexity of the loss calculation is O(m2), which becomes expensive for large m. To address this, we use stochastic approximation when m exceeds 3. Specifically, in each iteration, a subset of tasks T ,

with cardinality 3, is randomly selected from the set of all m tasks. We then construct ˆ W l by concatenating {wl i}i T . Then, Rl o is estimated as:

Rl o = ( ˆ W l) ˆ W l I 2

where I is the 3 3 identity matrix.

This stochastic approximation ensures the complexity is independent of m. Empirical evaluations in Section 4.3 demonstrate that this still maintains satisfactory performance.

3.3. Optimization

To learn the Pareto manifold, we minimize the expectation of loss given an α over the Dirichlet distribution Dir(p). The optimization objective (without regularization) can be written as:

min θ0,B1A1,...,Bm Am EαEξ

i=1 αifi(θ(α); ξ)

where ξ is a mini-batch of multi-task training data {(x1 i , x2 i , . . . , xm i , y1 i , y2 i , . . . , ym i )}q i=1 and q is the batch size. For any given α, we aim to minimize Eξ[Pm i=1 αifi(θ(α); ξ)], which is the linear scalarization of the m objective functions. Note that the solution obtained by linear scalarization is Pareto-optimal, which can be formally stated as follows:

Proposition 3.2 ((Boyd & Vandenberghe, 2004)). Given any α {α | αi > 0}, the optimal solution of the scalarized problem minθ Pm i=1 αifi(θ) is a Pareto-optimal solution of the original multi-objective optimization problem (1).

For the reader s convenience, the proof is reproduced in Appendix B. Thus, when (9) is minimized, the obtained

Efficient Pareto Manifold Learning with Low-Rank Structure

Algorithm 1 LORPMAN.

Input: learnable main model parameters θ0, learnable matrix {Ai, Bi}m i=1, distribution parameters p, window size b, bath size q, regularization coefficients λp, λo, scaling parameter s. while not converged do

sample a minibatch of multi-task training data ξ = {(x1 i , x2 i , . . . , xm i , y1 i , y2 i , . . . , ym i )}q i=1; independently sample α1, . . . , αb from Dir(p); compute corresponding model parameter for each αj

θ(αj)l = θl 0 + s Pm i=1 αj i Bl i Al i; compute multi-forward regularization loss Rp; compute orthogonal loss Ro in Eq. (7); compute loss L = Pb j=1 Pm i=1 αj ifi(θ(αj); ξ) + λp Rp + λo Ro; if current epoch < freeze epoch then

take a gradient descent step on θ0 and {Ai, Bi}m i=1; else

take a gradient descent step on {Ai, Bi}m i=1; end if end while

θ(α) s are Pareto-optimal solutions. Here, we only consider linear scalarization. More other scalarization methods (Bowman Jr, 1975; Sener & Koltun, 2018; Navon et al., 2022) can be considered.

The proposed algorithm, which is called LOw-Rank Pareto MANifold Learning (LORPMAN), is shown in Algorithm 1. The training process is divided into two phases: First, we adapt both the main model θ0 and low-rank matrices. After a certain number of epochs (which is referred to as freeze epoch), we fix the main model and only adapt the low-rank matrices. This encourages the low-rank matrices to learn task-specific representations instead of always relying on the main model to improve performance.

In each iteration, we sample b preference vectors {α1, . . . , αb} from the Dirichlet distribution. For each αi, we compute the corresponding network parameters according to (6). Then, following (Dimitriadis et al., 2023), we calculate the multi-forward regularization loss Rp which penalizes incorrect solution ordering on the PF (details in Appendix C). Subsequently, we incorporate the orthogonal loss in (7) to encourage orthogonality among the different low-rank matrices, as discussed in the previous section.

4. Experiments

In this section, we first demonstrate the training process of the proposed algorithm on a toy problem (Section 4.1). Then, we perform experiments on datasets with two or three tasks (Section 4.2). Next, we scale up the number of tasks to up to 40 (Section 4.3). We then compare the proposed

method with algorithms that generate discrete PFs (Section 4.4) and algorithms that can generate a single solution (Section 4.5). Finally, ablation studies are presented in Section 4.6.

We adopt the Hypervolume (HV) (Zitzler & Thiele, 1998; Knowles et al., 2003) as performance indicator, which is widely used and can evaluate both the convergence and diversity of the obtained PF. Details of the HV indicator are in Appendix D.1. To evaluate the performance of the obtained PFs, following (Dimitriadis et al., 2023), we sample 11 solutions on the obtained PF in the two-objective cases, 66 solutions in the three-objective cases, and 100 solutions when there are more than three objectives. We tune the hyperparameters according to the HV value on the validation datasets.

We compare the proposed method with state-of-the-art continuous PF approximation algorithms, namely PHN (Navon et al., 2020b), PHN-HVI (Hoang et al., 2023), COSMOS (Ruchte & Grabocka, 2021), and Pa Ma L (Dimitriadis et al., 2023). Since PHN and PHN-HVI require training a hypernetwork with significantly more parameters than the base network, they can only be run on Multi MNIST and Census.

4.1. Illustrative Example

To demonstrate the training process of the proposed algorithm, we employ a widely used two-objective toy problem (Yu et al., 2020; Liu et al., 2021a; Navon et al., 2022), with parameter θ = [θ1, θ2] R2. The detailed definition of the toy problem is in Appendix D.2.

We initialize θ0 to [4.5, 4.5]. Since the parameters of the toy problem do not have a matrix structure, instead of using Bi Ais for low-rank approximation, we use θi R2, where the second component is fixed to mimic the low-rank approximation constraint. We initialize θ1 = [ 4.5, 0.0] and θ2 = [4.5, 0.0]. Given an α, the corresponding Pareto optimal parameter is θ(α) = θ0 + α1 θ1 + α2 θ2.

Figure 3 shows the trajectories of θ obtained by LORPMAN with α = [0.5, 0.5], [1, 0], and [0, 1]. We can see that θ reaches the middle of the PF for α = [0.5, 0.5], and to the two extreme points of the PF when α = [1, 0] and [0, 1]. From Figure 3b, we can see that these three solutions are on the same line in parameter space and we can obtain the whole Pareto set by varying α.

4.2. Datasets with Two or Three Tasks

In this section, we examine the performance on datasets with two or three tasks as used in (Dimitriadis et al., 2023), namely, Multi MNIST, Census, and UTKFace.

Multi MNIST and Census. Multi MNIST (Sabour et al., 2017) is a digit classification dataset with two tasks: classi-

Efficient Pareto Manifold Learning with Low-Rank Structure

Table 1. HV values obtained on Multi MNIST and Census (averaged over three random seeds). The standard deviation is shown in parentheses. For PHN and PHN-HVI, we report the number of parameters in the hypernetwork.

Multi MNIST Census HV # Parameters HV # Parameters

PHN 0.900 (0.0068) 2.793M 0.649 (0.0007) 12.251M PHN-HVI 0.906 (0.0020) 2.793M 0.650 (0.0008) 12.251M COSMOS 0.888 (0.0077) 0.028M 0.652 (0.0014) 0.122M Pa Ma L 0.905 (0.0010) 0.055M 0.654 (0.0006) 0.242M LORPMAN 0.918 (0.0018) 0.046M 0.656 (0.0004) 0.133M

(a) Objective space.

(b) Parameter space.

Figure 3. Trajectory of θ obtained by LORPMAN with α = [0.5, 0.5] (red), α = [1, 0] (green), and α = [0, 1] (blue) in objective space (a) and parameter space (b). Circles denote the initial points and squares denote the final points. The gray lines in (a) and (b) denote the PF and PS, respectively.

fication of the top-left digit and classification of the bottomright digit in each image. Census (Kohavi, 1996) is a tabular dataset with two tasks: age prediction and education level classification.

Following (Dimitriadis et al., 2023), we use the Le Net1

(Le Cun et al., 1998) and Multilayer Perceptron (MLP) as shared-bottom of the base network for Multi MNIST and Census, respectively. For all algorithms, the number of training epochs is set to 10. For LORPMAN, we choose the scaling factor s {1, 2, 4, 6} and freeze epoch {4, 6, 8} based on the validation set. For both datasets, the rank r for all layers is set to 8 and the orthogonal regularization coefficient λo is set to 1. We do not tune r and λo. More experimental details on the two datasets are in Appendices D.3 and D.4, respectively.

The resulting PFs are shown in Figure 4, and the corresponding HV values in Table 1. As can be seen, LORPMAN can obtain the PF closer to the top-right region (i.e., with better accuracies on both objectives). COSMOS exhibits limited accuracies due to the constraints imposed by the small number of parameters. Despite Pa Ma L, PHN, and PHN-HVI

1The parameters of a convolution layer is a four-dimensional tensor instead of a matrix. We adopt the approach in (Hu et al., 2021) to transform it into a matrix and perform low-rank approximation.

having a larger number of parameters, they show inferior HVs when compared to LORPMAN.

UTKFace. UTKFace (Zhang et al., 2017) is a dataset with three tasks: age prediction, gender classification and race classification. Following (Dimitriadis et al., 2023), we use the Res Net-18 (He et al., 2016) as the shared bottom of the base network. The number of training epochs is 100. We tune λo {0.1, 0.5, 1}, freeze epoch {60, 80}, and r {16, 32, 64}. The scaling factor s is set to 1 without tuning. More experimental details are in Appendix D.5. We do not compare with PHN and PHN-HVI because using the same hypernetwork structure as in Multi MNIST and Census will result in a hypernetwork with approximately 1 billion parameters.

Table 2 shows the HVs obtained by COSMOS, Pa Ma L, and LORPMAN and the number of parameters of each algorithm. As can be seen, LORPMAN outperforms Pa Ma L and COSMOS, despite LORPMAN has fewer parameters than Pa Ma L. Figure 5 shows the PFs obtained by Pa Ma L and LORPMAN.2 We can see the solutions obtained by LORPMAN outperform Pa Ma L in all three objectives.

4.3. Scale to Large Number of Tasks

CIFAR-100. CIFAR-100 (Krizhevsky et al., 2009) is an image classification dataset with 100 classes. These classes are further organized into 20 superclasses. As in (Rosenbaum et al., 2017; Yu et al., 2020; Liu et al., 2022), we consider each superclass as a task. The objective of each task is to accurately classify the image into one of the corresponding 5 more-specific classes.

Following (Liu et al., 2022), we use VGG-16 (Simonyan & Zisserman, 2014) as the base network. The number of training epochs is set to 300 and we freeze the main model after 250 epochs, which is similar to the proportion in UTKFace. The scaling factor s is set to 1 without tuning. We search the orthogonal regularization coefficient λo {0.001, 0.005, 0.01, 0.1} and rank r {8, 16}. More

2Since the PF obtained by COSMOS is inferior, it is omitted from the figure to maintain visual clarity.

Efficient Pareto Manifold Learning with Low-Rank Structure

0.92 0.93 0.94 0.95 0.96 Top Left Accuracy

Bottom Right Accuracy

PHN PHN-HVI COSMOS Pa Ma L LORPMAN

(a) Multi MNIST.

0.820 0.822 0.824 0.826 0.828 0.830 0.832 Age Accuracy

Education Accuracy

PHN PHN-HVI COSMOS Pa Ma L LORPMAN

(b) Census.

Figure 4. Test performance on Multi MNIST and Census. The PF is shown in bold. We show the results obtained by three different random seeds.

experimental details are in Appendix D.6.

Table 2. HV values obtained on UTKFace (averaged over three random seeds). The standard deviation is shown in parentheses.

HV # Parameters

COSMOS 0.281 (0.003) 11.2M Pa Ma L 0.304 (0.003) 33.6M LORPMAN 0.314 (0.001) 24.0M

The obtained HVs are shown in Table 3. The obtained PFs are in Appendix E. The results highlight the challenges encountered by Pa Ma L when dealing with a large number of objectives. Pa Ma L needs to jointly train 20 base networks, which leads to a large number of parameters and a small hypervolume. In contrast, LORPMAN achieves significantly better hypervolume while utilizing only 8.9% of the parameters compared to Pa Ma L. This underscores the efficiency and effectiveness of LORPMAN in addressing problems with a large number of tasks.

Table 3. HV values obtained on CIFAR-100 (averaged over three random seeds). The standard deviation is shown in parentheses.

HV ( 10 2) # Parameters

COSMOS 0.344 (0.018) 15.0M Pa Ma L 0.0583 (0.010) 296.5M LORPMAN 0.887 (0.047) 26.4M

Figure 6 shows the convergence of the training losses of Pa Ma L with different learning rates. The training loss of LORPMAN (with learning rate 0.01) is also shown for comparison. We can observe that the poor performance of Pa Ma L on CIFAR-100 is due to its slow convergence, since it has to jointly train 20 base networks. Simply increasing the learning rate does not help the convergence of Pa Ma L.

Figure 5. Test performance of Pa Ma L and LORPMAN on UTKFace. Figures (b), (c), (d) are 2D projections of (a) for better illustration of the 3D surface.

Celeb A. Celeb A (Liu et al., 2015) is a face attribute classification dataset with 40 tasks, where each task is a binary classification of a face attribute. Following (Sener & Koltun, 2018), we use Res Net-18 (He et al., 2016) as the base network. The number of training epochs is 50, and we freeze the main model after 40 epochs, which is similar to the proportion in UTKFace. The scaling factor s is set to 1 without tuning. We search the orthogonal regularization coefficient λo {0.1, 0.5, 1} and rank r {8, 16, 32}. Here, we consider two more baselines: (i) PHN with chunking (Navon et al., 2020a; Lin et al., 2020), both with the original setting in (Navon et al., 2020a) (i.e, the hidden dimension is set to

Efficient Pareto Manifold Learning with Low-Rank Structure

0 50 100 150 200 250 300 Epoch

Pa Ma L (lr=0.1) Pa Ma L (lr=0.05) Pa Ma L (lr=0.01) LORPMAN (lr=0.01)

Figure 6. Training loss of Pa Ma L with number of epochs, using different learning rates.

100) and a scaled setting with comparable number of parameters as LORPMAN (i.e., the hidden dimension is set to 500); and (ii) Fi LM condition, with a Fi LM condition layer added after each block of Res Net-18. More experimental details are described in Appendix D.7.

The HV values are shown in Table 4. We can see that LORPMAN achieves much better performance while using only 21% of parameters compared to Pa Ma L. The Fi LM condition suffers similar problems as COSMOS due to the limited number of parameters. PHN with chunking also shows worse performance than LORPMAN. In comparison, the proposed LORPMAN is a more straightforward approach that achieves good performance and parameter efficiency.

Table 4. HV values obtained on Celeb A (averaged over three random seeds). The standard deviation is shown in parentheses.

Method HV ( 10 2) # Parameters

PHN-Chunking (original) 0.663 (0.027) 36.6M PHN-Chunking (scaled) 0.681 (0.048) 92.3M Fi LM 0.803 (0.022) 11.4M COSMOS 0.783 (0.013) 11.4M Pa Ma L 0.472 (0.018) 453.3M LORPMAN 1.167 (0.008) 96.8M

4.4. Comparison with Discrete PF Baselines

In this experiment, we compare with five algorithms that obtain discrete PFs: EPO (Mahapatra & Rajan, 2020), PMTL (Lin et al., 2019), MOO-SVGD (Liu et al., 2021b), PNG (Ye & Liu, 2022), and GMOOAR (Chen & Kwok, 2022). Since most of these algorithms have performed experiments on Multi MNIST, we use Multi MNIST for comparison. The settings are the same as in Section 4.2.

The obtained solutions are shown in Figure 7. As can be seen, the proposed algorithm has better performance than

0.90 0.91 0.92 0.93 0.94 0.95 0.96 Top Left Accuracy

Bottom Right Accuracy

GMOOAR PMTL EPO PNG MOO-SVGD LORPMAN

Figure 7. Test performance of LORPMAN and various discrete PF algorithms on Multi MNIST.

existing discrete PF algorithms, even though they can only generate a predetermined number of solutions.

4.5. Comparison with Single-Solution Baselines

In this experiment, we compare with multi-task learning baselines that can only generate a single solution. These include Linear Scalarization (LS), UW (Kendall et al., 2018), MGDA (Sener & Koltun, 2018), DWA (Liu et al., 2019), PCGrad (Yu et al., 2020), IMTL (Liu et al., 2020), Graddrop (Chen et al., 2020), CAGrad (Liu et al., 2021a), RLW (Lin et al., 2022a), Nash-MTL (Navon et al., 2022), Roto Grad (Javaloy & Valera, 2021), and Auto-λ (Liu et al., 2022). For a more complete comparison, we also include the results obtained by Single-Task Learning (STL), where each task is trained independently, and the baselines of COSMOS and Pa Ma L which provide continuous approximations of the PF.

Following (Dimitriadis et al., 2023), we use a widely-used dataset City Scapes (Cordts et al., 2016). It is a sceneunderstanding dataset with two tasks: semantic segmentation and depth regression. We adopt the Seg Net (Badrinarayanan et al., 2017) as the base network and use the same parameter configuration as in (Dimitriadis et al., 2023). The number of training epochs is 100. We tune freeze epoch {80, 90}, s {0.5, 1}, and r {32, 64}. More experimental details are in Appendix D.8.

LORPMAN/COSMOS/Pa Ma L can provide the whole PF. We only select one solution from their obtained PFs for comparison with the single-solution algorithms. Table 5 shows the performance on semantic segmentation (m IOU and pixel accuracy) and depth prediction (absolute and relative error) on the test set. We can see that LORPMAN not only surpasses COSMOS and Pa Ma L but also outperforms the single-solution algorithms. While MGDA achieves slightly better relative error in depth estimation, it has much worse performance in the semantic segmentation task. It is important to note that our objective is to learn the entire PF rather than focusing on a single solution.

Efficient Pareto Manifold Learning with Low-Rank Structure

Table 5. Test performance on City Scapes. We show the average over three random seeds. Results, excluding LORPMAN, are taken from (Dimitriadis et al., 2023)

. Segmentation Depth

m Io U Pix Acc Abs Err Rel Err

STL 70.96 92.12 0.0141 38.644

LS 70.12 91.90 0.0192 124.061 UW 70.20 91.93 0.0189 125.943 MGDA 66.45 90.79 0.0141 53.138 DWA 70.10 91.89 0.0192 127.659 PCGrad 70.02 91.84 0.0188 126.255 IMTL 70.77 92.12 0.0151 74.230 Graddrop 70.07 91.93 0.0189 127.146 CAGrad 69.23 91.61 0.0168 110.139 RLW 68.79 91.52 0.0213 126.942 Nash-MTL 71.13 92.23 0.0157 78.499 Roto Grad 69.92 91.85 0.0193 127.281 Auto-λ 70.47 92.01 0.0177 116.959

COSMOS 69.78 91.79 0.0539 136.614 Pa Ma L 70.35 91.99 0.0141 54.520 LORPMAN 72.13 92.57 0.0135 54.942

4.6. Ablation Studies

In this section, we investigate the effects of (i) orthogonal regularization, (ii) rank r, (iii) freeze epoch, and (iv) scaling factor s on the performance of LORPMAN. Experiment is performed on the UTKFace dataset.

Orthogonal Regularization. Table 6 shows the impact of orthogonal regularization on HV and the average correlation3 over all pairs of low-rank matrices. We can see that with orthogonal regularization, the correlation between lowrank matrices is significantly reduced. Similar observations can also be found in traditional orthogonal regularization for parameters within a single neural network (Xie et al., 2017; Bansal et al., 2018). Such reduction in correlation encourages learning common features in the main network and differences in the low-rank matrices, thus leading to better HV value.

Table 6. Effects of orthogonal regularization on UTKFace (averaged over six random seeds).

HV Correlation

w/o orthogonal reg 0.309 (0.002) 0.466 (0.045) w/ orthogonal reg 0.314 (0.001) 0.067 (0.012)

Rank r. Table 7 shows the effects of the rank r on the HV and number of parameters. We can observe that LORPMAN with a rank 8 can already outperform COSMOS and Pa Ma L (see Table 2). By increasing the rank to 64, even better

3The correlation between a pair of lowrank matrices Bi Ai and Bj Aj is computed as flatten(Bi Ai) flatten(Bj Aj)/( Bi Ai Bj Aj ).

performance can be achieved. However, further increasing the rank does not yield significant change.

Table 7. Effects of rank r on UTKFace (averaged over six random seeds).

r HV # parameters

4 0.306 (0.002) 12.1M 8 0.310 (0.001) 12.8M 16 0.310 (0.002) 14.5M 32 0.311 (0.001) 17.6M 64 0.314 (0.001) 24.0M 128 0.313 (0.002) 35.5M

Freeze Epoch. Table 8 shows the effect of freeze epoch on HV. Compared with not freezing the main module (i.e., freeze epoch = 100), freezing during the latter half of the training process encourages the low-rank matrices to learn task-specific representations instead of always relying on the main model, thus leading to better performance.

Scaling Factor s. Table 9 shows the effect of scaling factor s on HV. As can be seen, setting s within a reasonable range (such as [0.1, 1]) leads to stable performance.

Table 8. Effects of freeze epoch on UTKFace (averaged over six random seeds).

20 0.305 (0.001) 40 0.311 (0.001) 60 0.312 (0.002) 80 0.314 (0.001) 100 0.307 (0.003)

Table 9. Effects of scaling factor s on UTKFace (averaged over six random seeds).

0.1 0.313 (0.001) 0.5 0.313 (0.001) 1 0.314 (0.001) 1.5 0.312 (0.002) 2 0.310 (0.002)

5. Conclusion

In this paper, we introduce a novel method for continuous approximation of the PF. We use a main network with a collection of low-rank matrices. Our approach leverages the inherent structure of Pareto manifold to effectively learn trade-off solutions between tasks. Extensive empirical evaluation on various datasets demonstrates the superior performance of the proposed algorithm, especially when the number of tasks is large.

One limitation of the proposed work is that we consider the same rank for all layers. Using different ranks for different layers to achieve better parameter efficiency and exploring automatic rank setting strategies could be interesting future directions.

Efficient Pareto Manifold Learning with Low-Rank Structure

Acknowledgements

This research was supported in part by the Research Grants Council of the Hong Kong Special Administrative Region (Grant 16202523).

Impact Statement

The proposed algorithm can effectively discover trade-off solutions among multiple tasks, leading to improved performance and cost efficiency in real-world applications involving multiple objectives. However, we should be aware the potential biases and privacy concerns when processing data from different tasks.

Audibert, A., Amini, M. R., Usevich, K., and Clausel, M. Low-rank updates of pre-trained weights for multi-task learning. In Findings of the Association for Computational Linguistics, pp. 7544 7554, 2023.

Badrinarayanan, V., Kendall, A., and Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(12):2481 2495, 2017.

Bansal, N., Chen, X., and Wang, Z. Can we gain more from orthogonality regularizations in training deep networks? In Neural Information Processing Systems, 2018.

Bowman Jr, V. J. On the relationship of the Tchebycheff norm and the efficient frontier of multiple-criteria objectives. In Multiple Criteria Decision Making: Proceedings of a Conference Jouy-en-Josas, pp. 76 86. Springer, 1975.

Boyd, S. P. and Vandenberghe, L. Convex Optimization. Cambridge University Press, 2004.

Caruana, R. Multitask learning. Machine Learning, 28: 41 75, 1997.

Chen, W. and Kwok, J. Multi-objective deep learning with adaptive reference vectors. In Neural Information Processing Systems, pp. 32723 32735, 2022.

Chen, Z., Ngiam, J., Huang, Y., Luong, T., Kretzschmar, H., Chai, Y., and Anguelov, D. Just pick a sign: Optimizing deep multitask models with gradient sign dropout. In Neural Information Processing Systems, pp. 2039 2050, 2020.

Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., and Schiele, B. The cityscapes dataset for semantic urban scene understanding. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 3213 3223, 2016.

D esid eri, J.-A. Multiple-gradient descent algorithm (MGDA) for multiobjective optimization. Comptes Rendus Mathematique, 350(5-6):313 318, 2012.

Dimitriadis, N., Frossard, P., and Fleuret, F. Pareto manifold learning: Tackling multiple tasks via ensembles of singletask models. In International Conference on Machine Learning, pp. 8015 8052, 2023.

Dosovitskiy, A. and Djolonga, J. You only train once: Lossconditional training of deep networks. In International Conference on Learning Representations, 2019.

Ha, D., Dai, A., and Le, Q. V. Hypernetworks. Preprint ar Xiv:1609.09106, 2016.

Haykin, S. Neural Networks: A Comprehensive Foundation. Prentice Hall, 1998.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 770 778, 2016.

Hoang, L. P., Le, D. D., Tuan, T. A., and Thang, T. N. Improving Pareto front learning via multi-sample hypernetworks. In AAAI Conference on Artificial Intelligence, pp. 7875 7883, 2023.

Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. Lo RA: Low-rank adaptation of large language models. Preprint ar Xiv:2106.09685, 2021.

Huber, P. J. Robust estimation of a location parameter. In Breakthroughs in Statistics: Methodology and Distribution, pp. 492 518. Springer, 1992.

Ilharco, G., Ribeiro, M. T., Wortsman, M., Schmidt, L., Hajishirzi, H., and Farhadi, A. Editing models with task arithmetic. In International Conference on Learning Representations, 2022.

Javaloy, A. and Valera, I. Rotograd: Gradient homogenization in multitask learning. Preprint ar Xiv:2103.02631, 2021.

Kendall, A., Gal, Y., and Cipolla, R. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 7482 7491, 2018.

Knowles, J. D., Corne, D. W., and Fleischer, M. Bounded archiving using the Lebesgue measure. In Congress on Evolutionary Computation, pp. 2490 2497, 2003.

Kohavi, R. Scaling up the accuracy of naive-Bayes classifiers: A decision-tree hybrid. In International Conference on Knowledge Discovery and Data Mining, pp. 202 207, 1996.

Efficient Pareto Manifold Learning with Low-Rank Structure

Krizhevsky, A., Hinton, G., et al. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.

Kurin, V., De Palma, A., Kostrikov, I., Whiteson, S., and Mudigonda, P. K. In defense of the unitary scalarization for deep multi-task learning. In Neural Information Processing Systems, pp. 12169 12183, 2022.

Le Cun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998.

Lin, B., Feiyang, Y., Zhang, Y., and Tsang, I. Reasonable effectiveness of random weighting: A litmus test for multi-task learning. Transactions on Machine Learning Research, 2022a.

Lin, X., Zhen, H.-L., Li, Z., Zhang, Q.-F., and Kwong, S. Pareto multi-task learning. In Neural Information Processing Systems, 2019.

Lin, X., Yang, Z., Zhang, Q., and Kwong, S. Controllable Pareto multi-task learning. Preprint ar Xiv preprint ar Xiv:2010.06313, 2020.

Lin, X., Yang, Z., Zhang, X., and Zhang, Q. Pareto set learning for expensive multi-objective optimization. pp. 19231 19247, 2022b.

Lin, X., Zhang, X., Yang, Z., and Zhang, Q. Evolutionary Pareto set learning with structure constraints. Preprint ar Xiv:2310.20426, 2023.

Liu, B., Liu, X., Jin, X., Stone, P., and Liu, Q. Conflictaverse gradient descent for multi-task learning. In Neural Information Processing Systems, pp. 18878 18890, 2021a.

Liu, L., Li, Y., Kuang, Z., Xue, J.-H., Chen, Y., Yang, W., Liao, Q., and Zhang, W. Towards impartial multitask learning. In International Conference on Learning Representations, 2020.

Liu, S., Johns, E., and Davison, A. J. End-to-end multitask learning with attention. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1871 1880, 2019.

Liu, S., James, S., Davison, A., and Johns, E. Auto-Lambda: Disentangling dynamic task relationships. Transactions on Machine Learning Research, 2022.

Liu, X., Tong, X., and Liu, Q. Profiling Pareto front with multi-objective stein variational gradient descent. In Neural Information Processing Systems, pp. 14721 14733, 2021b.

Liu, Z., Luo, P., Wang, X., and Tang, X. Deep learning face attributes in the wild. In International Conference on Computer Vision, 2015.

Ma, P., Du, T., and Matusik, W. Efficient continuous Pareto exploration in multi-task learning. In International Conference on Machine Learning, pp. 6522 6531, 2020.

Mahapatra, D. and Rajan, V. Multi-task learning with user preferences: Gradient descent with controlled ascent in Pareto optimization. In International Conference on Machine Learning, pp. 6597 6607, 2020.

Miettinen, K. Nonlinear Multiobjective Optimization, volume 12. Springer Science & Business Media, 1999.

Navon, A., Shamsian, A., Chechik, G., and Fetaya, E. Learning the Pareto front with hypernetworks. Preprint ar Xiv:2010.04104, 2020a.

Navon, A., Shamsian, A., Fetaya, E., and Chechik, G. Learning the Pareto front with hypernetworks. In International Conference on Learning Representations, 2020b.

Navon, A., Shamsian, A., Achituve, I., Maron, H., Kawaguchi, K., Chechik, G., and Fetaya, E. Multi-task learning as a bargaining game. In International Conference on Machine Learning, pp. 16428 16446, 2022.

Ortiz-Jimenez, G., Favero, A., and Frossard, P. Task arithmetic in the tangent space: Improved editing of pretrained models. Preprint ar Xiv:2305.12827, 2023.

Perez, E., Strub, F., De Vries, H., Dumoulin, V., and Courville, A. Fi LM: Visual reasoning with a general conditioning layer. In AAAI Conference on Artificial Intelligence, 2018.

Rosenbaum, C., Klinger, T., and Riemer, M. Routing networks: Adaptive selection of non-linear functions for multi-task learning. Preprint ar Xiv:1711.01239, 2017.

Ruchte, M. and Grabocka, J. Scalable Pareto front approximation for deep multi-objective learning. In IEEE International Conference on Data Mining, pp. 1306 1311, 2021.

Sabour, S., Frosst, N., and Hinton, G. E. Dynamic routing between capsules. In Neural Information Processing Systems, 2017.

Sener, O. and Koltun, V. Multi-task learning as multiobjective optimization. In Neural Information Processing Systems, 2018.

Simonyan, K. and Zisserman, A. Very deep convolutional networks for large-scale image recognition. Preprint ar Xiv:1409.1556, 2014.

Efficient Pareto Manifold Learning with Low-Rank Structure

Vandenhende, S., Georgoulis, S., Van Gansbeke, W., Proesmans, M., Dai, D., and Van Gool, L. Multi-task learning for dense prediction tasks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44 (7):3614 3633, 2021.

Wang, H., Deutz, A., B ack, T., and Emmerich, M. Hypervolume indicator gradient ascent multi-objective optimization. In International Conference on Evolutionary Multi-Criterion Optimization, pp. 654 669, 2017.

Wang, J., Chen, Y., Chakraborty, R., and Yu, S. X. Orthogonal convolutional neural networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11505 11515, 2020.

Wang, X., Chen, T., Ge, Q., Xia, H., Bao, R., Zheng, R., Zhang, Q., Gui, T., and Huang, X.-J. Orthogonal subspace learning for language model continual learning. In Findings of the Association for Computational Linguistics, pp. 10658 10671, 2023a.

Wang, Y., Lin, Y., Zeng, X., and Zhang, G. Multi Lo RA: Democratizing Lo RA for better multi-task learning. Preprint ar Xiv:2311.11501, 2023b.

Xie, D., Xiong, J., and Pu, S. All you need is beyond a good init: Exploring better solution for training extremely deep convolutional neural networks with orthonormality and modulation. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 6176 6185, 2017.

Yadav, P., Tam, D., Choshen, L., Raffel, C., and Bansal, M. TIES-merging: Resolving interference when merging models. In Neural Information Processing Systems, 2023.

Ye, M. and Liu, Q. Pareto navigation gradient descent: A first-order algorithm for optimization in Pareto set. In Uncertainty in Artificial Intelligence, pp. 2246 2255, 2022.

Yu, T., Kumar, S., Gupta, A., Levine, S., Hausman, K., and Finn, C. Gradient surgery for multi-task learning. In Neural Information Processing Systems, pp. 5824 5836, 2020.

Zhang, Z., Song, Y., and Qi, H. Age progression/regression by conditional adversarial autoencoder. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 5810 5818, 2017.

Zitzler, E. and Thiele, L. Multiobjective optimization using evolutionary algorithms A comparative case study. In International Conference on Parallel Problem Solving from Nature, pp. 292 301, 1998.

Efficient Pareto Manifold Learning with Low-Rank Structure

A. Proof of Theorem 3.1

Denote the optimal mapping from network input x X Ru and preference vector α m to the corresponding point on the PF as t(x, α) : X m Rm. We have the following Theorem.

Theorem A.1. Assume that X m is compact and t(x, α) is continuous. For any ϵ > 0, there exists a Re LU MLP h with main network θ0 and m low-rank matrices B1A1, . . . , Bm Am, such that x X, α m, t(x, α) h

i=1 αi Bi Ai

Proof. Denote σ(x) max(0, x). From the universal approximation theorem (Haykin, 1998), for any ϵ > 0, there exists v N, M Rv (u+m), N Rv, C Rm v such that

sup x X,α m t(x, α) g(x, α) ϵ,

where g(x, α) = Cσ(M[x, α] + N).

We define two matrices R Ru (2u+m) and S R(2u+m) u as follows:

1, if j = 2i 1 1, if j = 2i 0, otherwise ,

1, if i = 2j 1 or (i > 2u and j = u) 1, if i = 2j 0, otherwise .

Define m vectors U1, . . . , Um {0, 1}2u+m as:

( 1, j = 2m + i 0, otherwise .

Then, we have

We can construct a MLP h(x; M, N, C, R, S, U) = Cσ(MSσ(Rx + U) + N).

Let θ0 = (M, N, C, R, S, 0) and θi = (0, 0, 0, 0, 0, Ui), i = 1, . . . , m. We have

i=1 αiθi) = h(x; M, N, C, R, S,

= Cσ(MSσ(Rx +

i=1 αi Ui) + N)

= g(Sσ(Rx +

i=1 αi Ui))

Since θi consists of a single non-zero entry (equal to 1) with all other elements being zero, it can be reshaped into a matrix Bi Ai with rank 1.

Efficient Pareto Manifold Learning with Low-Rank Structure

B. Proof of Proposition 3.2

Proposition B.1 ((Boyd & Vandenberghe, 2004)). Given any α {α | αi > 0}, the optimal solution of the scalarized problem minθ Pm i=1 αifi(θ) is a Pareto-optimal solution of the original multi-objective optimization problem (1).

Proof. Suppose θ is the optimal solution of Pm i=1 αifi(θ) but it is not Pareto optimal. Based on the definition of Pareto optimal, there exists θ , such that for i [m], fi(θ) fi(θ ), and i [m], fi(θ) > fi(θ ). Since αi > 0, we have

i=1 αi(fi(θ) fi(θ )) > 0.

It can be rewritten as: m X

i=1 αifi(θ) >

i=1 αifi(θ ),

which contradicts the assumption that θ is the optimal solution of Pm i=1 αifi(θ).

C. Multi-Forward Regularization

Dimitriadis et al. (2023) propose the multi-forward regularization to penalize the wrong ordering of solutions. Denote the set of current sampled reference vectors as V {α1, . . . , αb}. Then, a directed graph Gi = (V, Ei) is constructed for each task where Ei = {(α, α ) V V : αi < α i}. The multi-forward regularization loss is defined as:

(α,α ) Ei e[f(θ(α)) f(θ(α ))]+

D. Experimental Details

D.1. Hypervolume

Hypervolume (HV) (Zitzler & Thiele, 1998; Knowles et al., 2003) is a popular indicator in multi-objective optimization (MOO), offering a measure of the performance of the obtained solution set. Formally, for a given solution set P and a reference point r, the HV is defined as:

HV (P; r) = Λ({q Rm | p P : p q r}), (10)

where Λ denotes the Lebesgue measure. Figure 8 provides an illustration of the HV indicator in a two-objective maximization problem. The shaded area encapsulated by the solution set and the reference point represents the HV.

For all datasets except UTKFace (Zhang et al., 2017), the reference point for HV evaluation is set to [0, 0, . . . , 0], since 0 is the smallest-possible accuracy. For UTKFace, the first objective is the Huber loss (Huber, 1992), not accuracy. Since 0.5 is close to the observed largest Huber loss in the experiment, we set the reference point to [0.5, 0, 0].

D.2. Toy Problem

We use the two-objective toy problem adopted by (Liu et al., 2021a; Navon et al., 2022; Dimitriadis et al., 2023), with parameters θ = [θ1, θ2] R2. The problem is formulated as follows:

Minimize f1(θ) = c1(θ)h1(θ) + c2(θ)g1(θ) and f2(θ) = c1(θ)h2(θ) + c2(θ)g2(θ),

where h1(θ) = log max(|0.5( θ1 7) tanh ( θ2)|, 0.000005) + 6,

h2(θ) = log max(|0.5( θ1 + 3) tanh ( θ2) + 2|, 0.000005) + 6,

g1(θ) = ( θ1 + 7)2 + 0.1 ( θ2 8)2 /10 20,

g2(θ) = ( θ1 7)2 + 0.1 ( θ2 8)2) 10 20,

c1(θ) = max(tanh (0.5 θ2), 0) and c2(θ) = max(tanh ( 0.5 θ2), 0).

Efficient Pareto Manifold Learning with Low-Rank Structure

Figure 8. Illustration of the hypervolume indicator.

D.3. Multi MNIST

Network Structure. Following (Dimitriadis et al., 2023), the base network consists of a shared bottom and two task-specific heads. The shared bottom is the Le Net (Le Cun et al., 1998) with two convolution layers and a fully-connected layer. Each task-specific head consists of two full-connect layers. For PHN and PHN-HVI, we use the same hypernetwork structure as (Navon et al., 2020b). The input preference vector α first goes through a three-layer MLP to get the ray embedding. Then, for each layer of the base network, a linear layer is used to generate the base network parameters based on the ray embedding.

Parameter Settings. Following (Dimitriadis et al., 2023), for LORPMAN and Pa Ma L, we set the multi-forward regularization coefficient λp to 0, and window size b to 4. The distribution parameters p for sampling α is set to 1 for all algorithms. For PHN and PHN-HVI, we use the same parameter setting as in the original paper. We use the Adam optimizer. For COSMOS, the cosine similarity regularization coefficient λc is set to 1. The learning rate is set to 1e 3 and the batch size is set to 256.

D.4. Census

Network Structure. Following (Dimitriadis et al., 2023), the base network consists of a shared bottom and two task-specific heads. The shared bottom is an one-layer MLP. Each task-specific head is a fully-connected layer. The hypernetwork is constructed in the same way as for Multi MNIST.

Parameter Settings. Following (Dimitriadis et al., 2023), for LORPMAN and Pa Ma L, we set the multi-forward regularization coefficient λp to 5, and window size b to 2. The distribution parameters p for sampling α is set to 1 for all algorithms. For PHN-HVI, we use the same parameter setting as in the original paper. For COSMOS, the cosine similarity regularization coefficient λc is set to 1. We use the Adam optimizer. The learning rate is set to 1e 3 and the batch size is set to 256.

D.5. UTKFace

Network Structure. Following (Dimitriadis et al., 2023), the base network consists of a shared bottom and three task-specific heads. The shared bottom is a Res Net-18 (He et al., 2016). Each task-specific head is a fully-connected layer.

Parameter Settings. Following (Dimitriadis et al., 2023), for LORPMAN and Pa Ma L, we set the multi-forward regularization coefficient λp to 1, and window size b to 3. The distribution parameters p for sampling α is set to 2 for all algorithms. For COSMOS, the cosine similarity regularization coefficient λc is set to 1. We use the Adam optimizer. The learning rate is set to 1e 3 and the batch size is set to 256.

Efficient Pareto Manifold Learning with Low-Rank Structure

D.6. CIFAR-100

CIFAR-100 has 20 5-way classification tasks: 1. Aquatic Mammals: beaver, dolphin, otter, seal, whale 2. Fish: aquarium fish, flatfish, ray, shark, trout 3. Flowers: orchids, poppies, roses, sunflowers, tulips 4. Food Containers: bottles, bowls, cans, cups, plates 5. Fruits and Vegetables: apples, mushrooms, oranges, pears, sweet peppers 6. Household Electrical Devices: clock, computer keyboard, lamp, telephone, television 7. Household Furniture: bed, chair, couch, table, wardrobe 8. Insects: bee, beetle, butterfly, caterpillar, cockroach 9. Large Carnivores: bear, leopard, lion, tiger, wolf 10. Large Man-made Outdoor Things: bridge, castle, house, road, skyscraper 11. Large Natural Outdoor Scenes: cloud, forest, mountain, plain, sea 12. Large Omnivores and Herbivores: camel, cattle, chimpanzee, elephant, kangaroo 13. Medium-sized Mammals: fox, porcupine, possum, raccoon, skunk 14. Non-insect Invertebrates: crab, lobster, snail, spider, worm 15. People: baby, boy, girl, man, woman 16. Reptiles: crocodile, dinosaur, lizard, snake, turtle 17. Small Mammals: hamster, mouse, rabbit, shrew, squirrel 18. Trees: maple, oak, palm, pine, willow 19. Vehicles 1: bicycle, bus, motorcycle, pickup truck, train 20. Vehicles 2: lawn-mower, rocket, streetcar, tank, tractor

Network Structure. Following (Liu et al., 2022), the base network consists of a shared bottom and 20 task-specific heads. The shared bottom is a VGG-16 (Simonyan & Zisserman, 2014). Each task-specific head is a fully-connected layer.

Parameter Settings. For LORPMAN and Pa Ma L, we set the multi-forward regularization coefficient λp to 1, and window size b to 2. The distribution parameters p for sampling α is set to 2 for all algorithms. For COSMOS, the cosine similarity regularization coefficient λc is set to 1. Following (Liu et al., 2022), we use the SGD optimizer. The learning rate is set to 1e 2 and the batch size is set to 64.

D.7. Celeb A

Celeb A (Liu et al., 2015) has 40 binary classification tasks: 5 o Clock Shadow, Arched Eyebrows, Attractive, Bags Under Eyes, Bald, Bangs, Big Lips, Big Nose, Black Hair, Blond Hair, Blurry, Brown Hair, Bushy Eyebrows, Chubby, Double Chin, Eyeglasses, Goatee, Gray Hair, Heavy Makeup, High Cheekbones, Male, Mouth Slightly Open, Mustache, Narrow Eyes, No Beard, Oval Face, Pale Skin, Pointy Nose, Receding Hairline, Rosy Cheeks, Sideburns, Smiling, Straight Hair, Wavy Hair, Wearing Earrings, Wearing Hat, Wearing Lipstick, Wearing Necklace, Wearing Necktie, Young.

Network Structure. Following (Sener & Koltun, 2018), the base network consists of a shared bottom and 40 task-specific heads. The shared bottom is a Res Net-18 (He et al., 2016). Each task-specific head is a fully-connected layer. Since Celeb A has the overfitting problem (Kurin et al., 2022), we adopt dropout regularization as in (Kurin et al., 2022) for all algorithms (i.e., we add a dropout layer in each block as well as after the first convolution layer and after average pooling layer).

Parameter Settings. For LORPMAN and Pa Ma L, we set the multi-forward regularization coefficient λp to 1, and window size b to 3. The distribution parameters p for sampling α is set to 2 for all algorithms. For COSMOS, the cosine similarity regularization coefficient λc is set to 1. We use the Adam optimizer. The learning rate is set to 1e 3 and the batch size is set to 128.

D.8. Cityscapes

Network Structure. Following (Liu et al., 2021a; Navon et al., 2022; Dimitriadis et al., 2023), the base network consists of a shared bottom and 2 task-specific heads. The shared bottom is a Seg Net (Badrinarayanan et al., 2017). Each task-specific head consists of two convolution layers.

Efficient Pareto Manifold Learning with Low-Rank Structure

Parameter Settings. Following (Dimitriadis et al., 2023), for LORPMAN and Pa Ma L, we set the multi-forward regularization coefficient λp to 1, and window size b to 3. The distribution parameters p for sampling α is set to 7 for all algorithms. We use the Adam optimizer. The initial learning rate is set to 1e 4 and is halved after 75 epochs. The batch size is set to 8.

E. Pareto Front Obtained on CIFAR-100

Figure 9 shows the models obtained by COSMOS, Pa Ma L and LORPMAN. We can observe that the solutions obtained by Pa Ma L have much lower accuracies while the solutions obtained by COSMOS concentrate in a small region. The proposed LORPMAN achieves good accuracy and diversity.

Efficient Pareto Manifold Learning with Low-Rank Structure

0.40 0.45 0.50 0.55 0.60 0.65 0.70 aquatic mammals Accuracy

fish Accuracy

COSMOS Pa Ma L LORPMAN

0.45 0.50 0.55 0.60 0.65 0.70 0.75 flowers Accuracy

food containers Accuracy

COSMOS Pa Ma L LORPMAN

0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 fruit and vegetables Accuracy

household electrical device Accuracy

COSMOS Pa Ma L LORPMAN

0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 household furniture Accuracy

insects Accuracy

COSMOS Pa Ma L LORPMAN

0.4 0.5 0.6 0.7 0.8 large carnivores Accuracy

large man-made outdoor things Accuracy

COSMOS Pa Ma L LORPMAN

0.4 0.5 0.6 0.7 0.8 large natural outdoor scenes Accuracy

large omnivores and herbivores Accuracy

COSMOS Pa Ma L LORPMAN

0.4 0.5 0.6 0.7 0.8 medium-sized mammals Accuracy

non-insect invertebrates Accuracy

COSMOS Pa Ma L LORPMAN

0.25 0.30 0.35 0.40 0.45 0.50 0.55 people Accuracy

reptiles Accuracy

COSMOS Pa Ma L LORPMAN

0.3 0.4 0.5 0.6 0.7 small mammals Accuracy

trees Accuracy

COSMOS Pa Ma L LORPMAN

0.4 0.5 0.6 0.7 0.8 0.9 vehicles 1 Accuracy

vehicles 2 Accuracy

COSMOS Pa Ma L LORPMAN

Figure 9. Test performance of COSMOS, Pa Ma L, and LORPMAN on CIFAR-100.