# aligned_multi_objective_optimization__701a71df.pdf Aligned Multi Objective Optimization Yonathan Efroni 1 Ben Kretzu 2 Daniel R. Jiang 1 Jalaj Bhandari 1 Zheqing Zhu 1 Karen Ullrich 1 To date, the multi-objective optimization literature has mainly focused on conflicting objectives, studying the Pareto front, or requiring users to balance tradeoffs. Yet, in machine learning practice, there are many scenarios where such conflict does not take place. Recent findings from multi-task learning, reinforcement learning, and LLMs training show that diverse related tasks can enhance performance across objectives simultaneously. Despite this evidence, such phenomenon has not been examined from an optimization perspective. This leads to a lack of generic gradientbased methods that can scale to scenarios with a large number of related objectives. To address this gap, we introduce the Aligned Multi-Objective Optimization framework, propose new algorithms for this setting, and provide theoretical guarantees of their superior performance compared to naive approaches. 1. Introduction In many real-world optimization problems, we have access to multi-dimensional feedback rather than a single scalar objective. The multi-objective optimization (MOO) literature has largely focused on the setting where these objectives conflict with each other, which necessitates the Pareto dominance notion of optimality. A closely related area of study is multi-task learning (Caruana, 1997; Teh et al., 2017; Sener and Koltun, 2018; Yu et al., 2020; Lin et al., 2021; Liu et al., 2021; Navon et al., 2022; Zhou et al., 2022; Lin and Zhang, 2023; Liu et al., 2023a; Chen et al., 2023; Achituve et al., 2024; He et al., 2024; Liu and Vicente, 2024), where multiple tasks are learned jointly, typically with both shared and task-specific parameters. The hope is that the model can perform better on individual tasks by sharing common information across tasks. Indeed, the phenomenon of im- *Equal contribution 1Meta AI 2Technion. Correspondence to: Yonathan Efroni . Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s). proved performance across all tasks has been observed in several settings (Lin and Zhang, 2023; Lee et al., 2024), suggesting that perhaps there may not always be significant trade-offs between objectives. Similar observations appear in meta-learning (Ravi and Larochelle, 2017; Finn et al., 2017; Hospedales et al., 2021), where the goal is to learn representations that enable quick adaptation to new tasks with minimal additional training, as well as in reinforcement learning (Jaderberg et al., 2016; Teh et al., 2017; Veeriah et al., 2019; Dann et al., 2023), where practitioners use multiple reward functions to better specify the policy or its representation. In this work, we explicitly study a setting where objectives are aligned, namely, that the different objectives share a common solution. This situation arises frequently in practice. For example, when using reinforcement learning (RL) to augment large language models (LLMs) with reasoning capabilities, there are often multiple options for the choice of reward model to use. Lightman et al. (2023) and Uesato et al. (2022) consider both outcome and processbased rewards, and, recently Guo et al. (2025); Team et al. (2025) discuss the use of accuracy, format rewards, length of response, and reward on math problems as additional reward functions. In training text-to-image models using RL, Lee et al. (2024) use four reward models (aesthetic quality, human preference, text-image alignment, and image sentiment) and show results where all rewards are increased. Although the method of Lee et al. (2024) is designed for finding Pareto-optimal solutions (implying the existence of trade-offs), the numerical results suggest that the objectives may actually be aligned to a good degree. These observations are also related to a more general phenomenon in RL discussed by Dann et al. (2023), where learning can be accelerated by exploiting several alternative reward specifications that all lead to the same optimal policy. This concept builds on prior work showing that the choice of reward function (e.g., dense versus sparse reward) can have a dramatic effect on training time (Ng et al., 1999; Luo et al., 2020; Wang et al., 2019; Hu et al., 2020). A related idea in statistics is that when labeled data is sparse, practitioners can rely on closely-related proxy tasks to improve prediction accuracy (Bastani, 2021). To our knowledge, there is no work that studies such a Aligned Multi Objective Optimization Algorithm Asymptotic Convergence CAMOO O (1 µG/β)k PAMOO O (1 µL/β)k Table 1: The main results introduced in this work. β is the smoothness parameter; both µG and µL are structural quantities introduced in Section 4 and Section 5. These characterize notions of optimal curvature of weighted function and satisfy µL µG. framework from an optimization perspective. We ask the following question: Can gradient descent type of algorithms benefit from multi-objective feedback when the objectives are aligned? Previous work in multitask learning had provided convergence guarantees for gradient descent-type algorithms for MOO (Sener and Koltun, 2018; Yu et al., 2020; Liu et al., 2021; Navon et al., 2022; He et al., 2024). However, since these consider general multi-objective framework, their algorithms converge with worst-case guarantees with no meaningful convergence improvement of MOO. We provide a positive answer to the aforementioned question. We formally introduce the aligned multi-objective optimization (AMOO) framework. Subsequently, we design new gradient descent-type algorithms and establish their provable improved convergence in the AMOO setting. These can be interpreted as parameter-free algorithms to handle multi-objective feedback when objectives are aligned. Lastly, we conclude by providing empirical evidence of the improved convergence properties of the new algorithms. 2. Related Work 2.1. Gradient Weights in Multi-task & Meta Learning Our work is closely related to optimization methods from the multi-task learning (MTL) and meta learning literature, particularly those that integrate weights into the task gradients or losses. The multiple gradient descent algorithm (MGDA) approach of Fliege and Svaiter (2000); D esid eri (2012); Sener and Koltun (2018); Zhang et al. (2024) is one of the first works along this direction. It proposes an optimization objective that gives rise to a weight vector that implies a descent direction for all tasks and converges to a point on the Pareto set. The PCGrad paper (Yu et al., 2020) identified that conflicting gradients can be detrimental to MTL. The authors then propose to modify the gradients to remove this conflict (by projecting each task s gradient to the normal plane of another task), forming the basis for the PCGrad algorithm. Another work that tackles conflicting gradients is the conflict-averse gradient descent (CAGrad) method of (Liu et al., 2021). CAGrad generalizes MGDA: its main idea is to minimize a notion of conflict between gradients from different tasks, while staying nearby the gradient of the average loss. Notably, CAGrad maintains convergence toward a minimum of the average loss. Another way to handle gradient conflicts is the Nash-MTL method of Navon et al. (2022), where the gradients are combined using a bargaining game. Very recently, Achituve et al. (2024) introduced a Bayesian approach for gradient aggregation by incorporating uncertainty in gradient dimensions. Other optimization techniques for MTL include tuning gradient magnitudes so that all tasks train at a similar rate (Chen et al., 2018), taking the geometric mean of task losses (Chennupati et al., 2019), and random weighting (Lin et al., 2021). On the meta learning front, the MAML algorithm (Finn et al., 2019) aims to learn a useful representation such that the model can adapt to new tasks with only a small number of training samples. Since fast adaptation is the primary goal in meta learning, MAML s loss calculation differs from those found in MTL. Few prior works provided provable convergence guarantees of the different existing multi-objective optimization methods. Without additional assumption on the alignment of different objectives, these guarantees quantified convergence to a point on the Pareto front. Unlike our work, there the convergence guarantees depended on worst-case structural quantities such as the maximal Lipschitz constant among all objectives (Liu et al., 2021; Navon et al., 2022; Zhou et al., 2022; Chen et al., 2023; Liu and Vicente, 2024) or the maximal generalized smoothness (Zhang et al., 2024). The algorithms introduced in the following are similar to existing ones in that they construct a weighted loss to combine information from different sources of feedback. Unlike previous work, we focus on exploiting the prior knowledge that the objectives are aligned. We introduce new instance-dependent structural quantities that reflect how aligned multi-objective feedback can improve GD performance, improving convergence that depends on worst-case structural quantities, as in prior works. 2.2. Proxy & Multi-fidelity Feedback Other streams of related work are (1) machine learning using proxies and (2) multi-fidelity optimization. These works stand out from MTL in that they both focus on using closely related objectives, while traditional MTL typically considers a set of tasks that are more varied in nature. Proxy-based machine learning attempts to approximate the solution of a primary gold task (for which data is expensive or sparsely available) by making use of a proxy task where data is more abundant (Bastani, 2021; Dzyabura et al., 2019). Similarly, multi-fidelity optimization makes use of data sources of varying levels of accuracy (and potentially lower com- Aligned Multi Objective Optimization Figure 1: Visualization of AMOO instances in which it is possible to obtain improved convergence compared to optimizing individual functions or the average function: (left) the specification example, (center) simpler instance of the selection example, and (right) 3D example of the local curvature example, in which f1(x1, x2) = exp(x1) + exp(x2) x1 x2 and f2(x1, x2) = f1( x1, x2). This example highlights the need to toggle between functions according to their local curvature. putational cost) to optimize a target objective (Forrester et al., 2007). In particular, the idea of using multiple closelyrelated tasks of varying levels of fidelity has seen adoption in settings where function evaluations are expensive, including bandits (Kandasamy et al., 2016b;a), Bayesian optimization (Kandasamy et al., 2017; Song et al., 2019; Wu et al., 2020; Takeno et al., 2020), and active learning (Yi et al., 2021; Li et al., 2020; 2022). The motivations behind the AMOO setting are clearly similar to those of proxy optimization and multi-fidelity optimization. However, our papers takes a pure optimization and gradient-descent perspective, which to our knowledge, is novel in the literature. 3. Aligned Multi Objective Optimization Consider an unconstrained multi-objective optimization where F : Rn Rm is a vector valued function, F(x) = (f1(x), f2(x), . . . , fm(x)) , and all functions {fi}i [m] are convex where [m] := {1, . . . , m}. Without additional assumptions the components of F(x) cannot be minimized simultaneously. To define a meaningful approach to optimize F(x) one can study the Pareto front, or to properly define how to trade-off the objectives. In the AMOO setting we make the assumption the functions are aligned in a specific sense: we assume that the functions {fi}i [m] share an optimal solution1. Namely, there exists a point x that minimizes all functions in F( ) simultaneously, x arg min x Rn fi(x) i [m]. (1) With this assumption one may hope to get quantitative benefits from the multi objective feedback. How can Gradient 1We also study an extension of AMOO where the functions can only be approximately simultaneously minimized. See Section 6. Descent (GD) be improved when the functions are aligned? A common algorithmic approach in the multi-objective setting is using a weight vector w Rm that maps the vector F(x) to a single objective fw(x) := w T F(x), and apply a gradient descent step on the weighted function (e.g., Sener and Koltun (2018); Yu et al. (2020); Liu et al. (2021)). Existing algorithms suggest alternatives for choosing w via different weight optimizers. We follow this paradigm and refer to it as Weighted-GD (see Algorithm 1). We provide definitions of useful structures from convex optimization that will be used in this work. Definition 3.1 (Smoothness). f(x) : Rn R is called β-smooth if x, y Rn the following holds: f(y) f(x) + f(x) (y x) + β Definition 3.2 (Self-concordant). f(x) : Rn R is called self-concordant with parameter Mf 0 if x, y Rn the following holds: 3f(x)[y]y, y 2Mf y 3 x , where 3f(x)[y] := limα 0 1 α 2f(x+αy) 2f(x) the directional derivative of the hessian in y, and y 2 x := y 2f(x). Next, towards developing algorithmic intuition for the AMOO setting, we consider few examples in which alignment between different objective function may be useful. Aligned Multi Objective Optimization (i) The Specification Example. Consider the case F(x) = (f1(x), f2(x)), x R2 where f1(x) = (1 )x2 1 + x2 2, f2(x) = x2 1 + (1 )x2 2, for some small [0, 0.1]. It is clear that F(x) can be simultaneously minimized in x = (0, 0), hence, this is an AMOO setting. This example, as we demonstrate, illustrates an instance in which each individual function does not specify the solution well, but with proper weighting the optimal solution is well specified. First, observe both f1 and f2 are -strongly convex and O(1)-smooth functions. Hence, GD with properly tuned learning rate, applied to either f1 or f2 converges with linear rate of Ω( ). This rate can be dramatically improved by proper weighting of the functions. Let fw U be a function with equal weighting of both f1 and f2, namely, choosing w U = (0.5, 0.5). We get fw U (x) = 0.5x2 1 + 0.5x2 2 which is Ω(1)-strongly convex and O(1)-smooth. Hence, GD applied to fw U converges with linear rate of Ω(1) much faster than O( ) since can be arbitrarily small. (ii) The Selection Example. Consider the case F(x) = (f1(x), . . . , fm(x)), x Rn, where i [m 1] : fi(x) = (1 )x2 1 + and [0, 0.5]. The common minimizer of all functions is x = 0 Rn, and, hence, the objectives are aligned. Unlike the specification example, in the selection example, there is a single objective function among the m objectives we should select to improve the convergence rate of GD. Further, in the selection example, choosing the uniform weight degrades the convergence rate. Indeed, setting the weight vector to be uniform w U = (1/m, . . . , 1/m) Rm leads to the function fw U (x) = (2 )/m x2 1 + Pn j=2( + 1)/m x2 j, which is O(1/m)- strongly convex. Hence, GD applied to fw U converges in a linear rate of O(1/m). On the other hand, GD applied to fm converges with linear rate of Ω(1). Namely, setting the weight vector to be (0, . . . , 0, 1) Rm improves upon taking the average when the number of objectives is large. (iii) Local Curvature Example. Consider the case F(x) = (f1(x), f2(x)), x R where f1(x) = exp(x) x, f2(x) = exp( x) + x, Algorithm 1 Weighted-GD initialize: Learning rate η, Weight-Optimizer while k = 1, 2, . . . do wk Weight-Optimizer ({fi(xk)}m i=1) gk fwk(xk) xk+1 = xk ηgk end while namely, f2(x) = f1( x). Both functions are simultaneously minimized in x = 0. This example depicts a scenario in which different functions have a larger curvature in different segments of the parameter space; for x > 0, f1(x) has a larger curvature, and for x < 0 f2 has a larger curvature. For such a setting, it is natural to toggle between the two functions, namely to set the weight vector as w1 = (1, 0) for x > 0 and as w2 = (0, 1) for x < 0. This approach, intuitively, should result in a faster convergence to the optimal solution compared to applying GD to an individual function or the average function, since it effectively applies GD to a function which is uniformly more curved. The three aforementioned examples highlight a key takeaway: the curvature of the functions has a vital role in improving convergence guarantees for GD in AMOO. Indeed, all examples provided arguments as of how to improve the convergence of GD based on curvature information. In next sections we formalize this intuition. We introduce quantities that characterize notions of best curvature, and develop new GD based algorithms that provably converge with rates that depend on these quantities. 4. The CAMOO Weight Optimizer We start by introducing and analyzing the Curvature Aligned Multi Objective Optimizer (CAMOO). CAMOO (Algorithm 2) directly optimizes the curvature of the weighted function. Towards developing it, we define the global adaptive strong convexity parameter, µG. Later we show that when the weighted loss is determined by CAMOO GD converges in a rate that depends on µG. We start by defining the optimal adaptive strong convexity over the class of weights: Definition 4.1 (Global Adaptive Strong Convexity µG). The global adaptive strong convexity parameter, µG R+, is the largest value such that x Rn exists a weight vector w m such that i=1 wi 2fi(x) For each x Rn, there may be a different weight vector that Aligned Multi Objective Optimization solves arg max λmin 2fw(x) and locally maximizes the curvature. The global adaptive strong convexity parameter µG is the largest lower bound in Rn. The specification and selection examples (Section 3) demonstrate µG can be much larger than both the strong convexity parameter of the average function or of each individual function; for both µG = O(1) whereas the alternatives may have arbitrarily small strongly convex parameter value. Further, the local curvature example highlights a case in which the optimal weight may have dependence on x. Additionally, this structural definition implies that there is a unique point x that simultaneously minimizes the objectives. Due to this observation, in the following, we aim to design provable GD methods that converge to this optimal point x . The following result formalizes this by showing that under a weaker condition compared to µG > 0 there is a unique minimizer (see Appendix F.1 for a proof). Proposition 4.1 (Unique Optimal Solution). Assume there exists x Rn that simultaneously minimizes {fi}i [m], namely, solves Eq. (1). If maxw m λmin 2fw(x ) > 0 then x is unique. Definition 4.1 not only quantifies an optimal notion of curvature, but also directly results with the CAMOO algorithm. CAMOO sets the weights according to Eq. (2), namely, at the kth iteration, it finds the weight vector for which fw(xk) has the largest local curvature. Then, a gradient step is applied in the direction of fw(xk) (see Algorithm 1). Indeed, CAMOO seems as a natural algorithm to apply in AMOO. Nevertheless, the analysis of CAMOO faces key challenges that make its analysis less trivial than what one may expect. Challenge (i): fwk is not a strongly convex function. One may hope that standard GD analysis for strongly convex and smooth functions can be applied. It is well known that if a function f(x) is β smooth and x Rn, λmin 2f(x) µ then GD converges with µ/β linear rate. Unfortunately, a careful examination of this argument shows it fails. Even though λmin 2fwk(xk) µG at each iteration k of CAMOO it does not imply that fwk is µG strongly convex for a fixed xk. Namely, it does not necessarily hold that for all x Rn, λmin 2fwk(x) µG, but only pointwise at xk (E.g., the local curvature example highlights this issue. See Appendix B.1 for details). This property emerges naturally in AMOO, yet such nuance is inherently impossible in single-objective optimization. Challenge (ii): Weighted function is not necessarily convex. A naive reduction may be to apply GD to the function fw (x) where w (x) arg max λmin Pm i=1 wi 2fi(x) . Namely, to apply GD to a new weighted function that is determined by optimizing the curvature. Such an approach Algorithm 2 CAMOO inputs: {fi(xk)}m i=1 initialize: wmin = µG/ (8mβ) Get Hessian matrices { 2fi(xk)}m i=1 w arg max w m,wmin λmin P i wi 2fi(xk) turns out as flawed from theoretical perspective; the function fw (x) = P i [m] w ,i(x)fi(x) is not necessarily convex nor smooth due to the dependence on a weight vector that has an x dependence (see Appendix B.2 for an example). Next, we provide a positive result. When restricting the class of functions to the set of self-concordant and smooth functions (see Appendix C for formal definitions) we provide a convergence guarantee for Weighted-GD instantiated with CAMOO that depends on µG. Further, the result shows that close to the optimal solution the convergence has linear rate in O(µG/β) (see Appendix D for proof details). Theorem 4.2 (µG Convergence of CAMOO). Suppose {fi}i [m] are β smooth, Mf self-concordant, share an optimal solution x and that µG > 0. Let k0 := 16β( x0 x 3 mβMf µG) , where is the Euclidean- norm. Then, Weighted-GD instantiated with CAMOO weight-optimizer and η = 1/2β converges with rate: xk0 x 1 3µG 8β (k k0)/2 k k0 x0 x k µG 3/2 16β2 m Mf o.w. Importantly, Theorem 4.2 holds without making strong convexity assumption on the individual functions, but only requires that the adaptive strong convexity parameter µG to be positive, as, otherwise, the result is vacuous. The proof follows few key observations. The selfconcordance property, we find, implies a useful inequality that depends only on local curvature (see Appendix C): f(y) f(x) + f(x), y x + c y x 2 2f(x) 1 + Mf y x 2f(x) , for some constant c > 0. This inequality share similarity with the more standard inequality used in analysis of GD convergence for strongly convex function (Boyd and Vandenberghe, 2004), however, unlike the former, it depends on local curvature. In order to satisfy the assumption the weighted function fwk is self-concordant we rely on the fact mini [m] wk,i > wmin by design of CAMOO. Then, additional analysis leads to a recurrence relation of the residual Aligned Multi Objective Optimization rk := xk x 2 with the form of r2 k+1 r2 k α1r2 k/(1 + α2rk). (4) We provide a bound on this recurrence relation in Appendix C to arrive to the final result of Theorem 4.2. 4.1. Practical Implementation We now describe a scalable approach for implementing CAMOO that we experiment with in next sections. Towards large scale application of CAMOO with modern deep learning architectures we approximate the Hessian matrices with their diagonal. Prior works used the diagonal Hessian approximation as pre-conditioner (Chapelle et al., 2011; Schaul et al., 2013; Yao et al., 2021; Liu et al., 2023b; Achituve et al., 2024). Notably, with this approximation the computational cost of CAMOO scales linearly with number of parameters in the Hessian calculation, instead of quadratically. The following result establishes that the value of optimal curvature, and, hence the convergence rate of Weighted-GD instantiated with CAMOO, degrades continuously with the quality of Hessian approximation (see Appendix F.2 for proof details). Proposition 4.2. Assume that for all i [m] and x Rn || 2fi(x) Diag 2fi(x) ||2 where A 2 is the spectral norm of A Rn n. Let w arg maxw m λmin P i wi 2Diag (fi(x)) . Then, λmin P i w ,i 2fi(x) µG 2 . Next we provide high-level details of our implementation (also see Appendix A). Diagonal Hessian estimation via Hutchinson s Method. We use the Hutchinson method (Hutchinson, 1989; Chapelle et al., 2011; Yao et al., 2021) which provides an estimate to the diagonal Hessian by averaging products of the Hessian with random vectors. Importantly, the computational cost of this method scales linearly with number of parameters. Maximizing the minimal eigenvalue. Maximizing the minimal eigenvalue of symmetric matrices is a convex problem (Boyd and Vandenberghe (2004), Example 3.10) and can be solved via semidefinite programming. For diagonal matrices the problem can be cast as a simpler max-min bilinear problem, arg maxw m minq n w Aq, where n is the dimension of parameters, A Rm n and its ith row is the diagonal Hessian of the ith objective, namely, i [m], A[i, :] = diag( 2fi(x)). This bilinear optimization problem is well studied (Rakhlin and Sridharan, 2013; Mertikopoulos et al., 2018; Daskalakis and Panageas, 2018). We implemented the PU method of Cen et al. (2021) which, loosely speaking, executes iterative updates via exponential gradient descent/ascent. PU has a closed form update rule and its computational cost scales linearly with number of parameters. 5. The PAMOO Weight Optimizer In previous section, we introduced the global adaptive strong convexity parameter, µG, the CAMOO weight optimizer that chooses the weight vector adaptively and showed it has asymptotic linear convergence guarantees that depend on µG. In this section we explore an additional adaptive mechanism for choosing the weight vector based on Polyak step-size design. We introduce the Polyak Aligned Multi Objective Optimizer (PAMOO). Unlike CAMOO, it only requires information on the gradient, without requiring access to the Hessians. Interestingly, even though computationally much cheaper, PAMOO exhibits improved convergence rate compared to CAMOO. PAMOO (Algorithm 3) generalizes the Polyak step-size design to AMOO. As such, it requires access to the optimal function values, fi(x ) for all i [m]. This information may not be readily available in general. However, in modern machine learning applications this value is often zero (Loizou et al., 2021; Wang et al., 2023). Further, there are variations of Polyak step-size in which a the optimal value is estimated (Gower et al., 2021; Orvieto et al., 2022). We leave potential extensions of these to AMOO for future work. Compared to CAMOO, PAMOO only requires access to the gradients of the objectives, and does not assume access to the Hessians. Further, it only requires solving a simple convex quadratic optimization problem in dimension Rm. This problem is simpler than a maximization of the smallest eigenvalue, required to solve by CAMOO. We now define the local strong convexity parameter over a class of weights. As we later show, this parameter controls the convergence rate of PAMOO: Definition 5.1 (Local Strong Convexity µL). The local strong convexity parameter, µL R+, is the largest value such that exists a weight vector w m such that i=1 wi 2fi(x ) where x simultaneously minimizes {fi}i [m]. Notice that Proposition 4.1 implies that x is necessarily unique, and, hence µL is unique and well defined. Further, unlike the global adaptive strong convexity parameter, the local strong convexity parameter only depends on the curvature at x , namely, at the optimal solution. From Definition 4.1 and Definition 5.1 we directly get that µL µG. The PAMOO algorithm is inspired by the Polyak step-size design (Polyak, 1987; Hazan and Kakade, 2019) for choosing the learning rate in a parameter-free way. To provide with Aligned Multi Objective Optimization Algorithm 3 PAMOO 1: inputs: {fi(xk)}m i=1 2: w arg maxw Rm + 2w x w J x Jxw 3: x := [ x,1 . . . x,m], x,i := fi(xk) fi(x ) 4: Jx := [ f1(x) . . . fm(x)] Rn m 5: return: w intuition for our derivation, consider the GD update rule in a single objective optimization problem, xk+1 = xk ηkgk. To derive the Polyak step-size design, observe that by convexity and the GD update rule we have that xk+1 x 2 (6) xk x 2 2ηk (f(xk) f(x )) + η2 k f(xk) 2 . Minimizing the upper bound on the decrease with respect to ηk leads to ηk = (f(xk) f(x )) / f(xk) 2 , which is the Polyak step-size design choice. Building on this derivation we develop the PAMOO weight optimizer (see Algorithm 3). As we now show, interestingly, its convergence rate has the same functional form as CAMOO, while depending on the local strong convexity parameter µL instead in µG. Hence, PAMOO has an improved upper bound on its convergence rate compared to CAMOO (see Appendix D for proof details). Theorem 5.2 (µL Convergence of PAMOO). Suppose {fi}i [m] are β smooth, Mf self-concordant, share an optimal solution x and µL > 0. Let k0 := 64β( x0 x 3 mβMf µL) , where is the Euclidean- norm. Then, Weighted-GD instantiated with PAMOO weight-optimizer and η = 1 converges with rate: xk0 x 1 3µL 32β (k k0)/2 k k0 x0 x k µL 3/2 64β2 m Mf o.w. This result is established by generalizing the Polyak stepsize method analysis (see Eq. (6)) while using a key observation. In the analysis, we upper bound the residual xk x 2 by a quantity that depends on the curvature of the optimal weight vector at x , which is lower bounded by µL, by definition. This is valid since we can replace wk with an alternative weight vector only used in the analysis since wk is an optimal solution of maxw Rm + 2w x w J x Jxw. This flexibility allows us to upper bound expressions that depend on wk by any nonnegative weight vector w Rm +. Furthermore, we use similar tools as were developed in the analysis of CAMOO: the property of self-concordant functions (Eq. (3)) and the recurrence relation bound (Eq. (4)). 5.1. Practical Implementation PAMOO can be implemented in a straightforward and scalable way. It requires access to the Jacobian matrix, which can be readily calculated by accessing the gradients. Calculating the matrix J x Jx Rm m has a computational cost of O(nm2), where n is the dimension of the parameter space and m is the number of objectives, and can be parallelized. Lastly, it requires solving a quadratic convex optimization problem in Rm where m is expected to be of the order of 10. This can be done efficiently with different convex optimization algorithms, e.g., projected GD. In practice, we initialize the weight vector w using the weight vector of the previous iterate. Hence, an approximate optimal solution is found within a few projected gradient descent iterations. PAMOO as a potential advantage over CAMOO due to its scalability and its simple implementation. Lastly, generalizing it to methods in which the optimal value is estimated instead of being given is left for future work (Orvieto et al., 2022; Gower et al., 2021). 6. ϵ-AAMOO: Robustness to Alignment Assumption We have analyzed CAMOO and PAMOO assuming perfectly aligned objectives (Eq. (1)). However, in practice, objectives may be similar rather than perfectly aligned. Next, we extend AMOO to address this more realistic scenario and assume that the alignment assumption is approximately correct. We show that both algorithms are robust to such an approximation and remain effective under these conditions. Instead of assuming the objectives are perfectly aligned, we consider the ϵ-Approximate AMOO (AAMOO) framework, in which there exists a near-optimal solution with respect to all objectives. Let Cϵ be the set of ϵ-approximate solutions: Cϵ = {x Rn| fi(x) fi(xi ) ϵ i [m]}, (7) where xi arg minx Rn fi(x). In ϵ-AAMOO setting we assume that Cϵ is not the empty set. This corresponds to a case in which exists a point that is a near-optimal solution for all objectives and can be understood as a natural generalization of the stricter AMOO setting in which ϵ = 0. The main result of this section shows that for both CAMOO and PAMOO the distance between xk and an ϵ-approximate solution xϵ Cϵ, converges to ϵapp. Further, ϵapp is a polynomial function of ϵ, structural quantities of the problem, and vanishes as ϵ 0. This provides an approximate convergence guarantee of both algorithms. For CAMOO, the result depends on the µG curvature, as in the AMOO setting. For PAMOOthe convergence depends on the best curvature within the set Cϵ defined as follows: Definition 6.1 (ϵ-Local Strong Convexity µϵ L). The ϵ-local strong convexity, µϵ L R+, is the largest value such that Aligned Multi Objective Optimization Figure 2: MSE (see Eq. (9)) versus gradient steps. (left) local curvature example instance, (right) selection example instance . x Cϵ exists a weight vector w m such that i=1 wi 2fi(x) Let µ (x) = maxw m λmin Pm i=1 wi 2fi(x) be the largest curvature at point x. Then, µϵ L is defined as the maximal largest curvature in the set of near optimal solutions Cϵ, namely, maxx Cϵ µ (x). Unlike µG that depends on the worst case curvature at all points, µϵ L depends on the bestcase curvature in Cϵ. PAMOO convergence depends on this quantity, which also satisfies µϵ L µG for any ϵ. We now provide an informal description of an approximate convergence guarantee for both CAMOO and PAMOO (see Appendix E for the formal theorems and proofs): Theorem 6.2 ((Informal) Approximate Convergence in ϵ-AAMOO). Suppose {fi}i [m] are β smooth and Mf self-concordant. Let µA be µG and µϵ L for CAMOO and PAMOO, respectively. Assume that ϵ-AAMOO holds and that ϵ poly (µA, 1/β, 1/m, Mf). Then, exists xϵ Cϵ such that for both CAMOO and PAMOO iterates satisfy ( f1(1 c µA β )(k k0)/2 + f2ϵ1/4 k k0 x0 xϵ f3k o.w. Where c (0, 1), f1 = xk0 xϵ and f2, f3 and k0 are polynomial functions of β, µA, m and Mf. Namely, the performance of CAMOO and PAMOO degrades continuously with ϵ, as the alignment assumption is violated. This holds without modification to the algorithms. 7. Toy Experiment We implemented CAMOO, PAMOO and compared them to a weighting mechanism that uses equal weights on the objectives (EW)2. We tested these three algorithms as the Weight-Optimizers in Weighted-GD (see Algorithm 1). We experimented with SGD and ADAM as the optimizers of the weighted loss. In the learning problem we consider one network is required to match the outputs of a second fixed network. We denote the fixed network with parameters θ as hθ : Rdi Rdo and the second network with parameters θ as hθ : Rdi Rdo. Both are 2-layer neural networks with relu-activation and 512 hidden units. We draw data from a uniform distribution D = {xi}i where xi Uniform([ 1, 1]di). We consider three loss functions: (hθ(x) hθ (x)) Hi(hθ(x) hθ (x)) αi, for all i [3], where Hi Rdo do is a positive definite matrix and αi 1. All loss functions are minimized when hθ(x) = hθ (x), and, hence, it is an instance of AMOO. We investigated two instances. First, a selection example in which αi = 1, and Hi = diag(1, 0.01i, , 0.01i) for i {0, 1, 2}. There we expect the algorithms to adapt to the first loss function, f1. Second, a local curvature example where Hi = I, and αi {1, 1.5, 2}. For such a choice f1 has larger curvature for large losses whereas f2 and f3 have larger curvature for small losses. We track the performance by measuring the mean-squared error between the networks: MSE = 1 |D| x D (hθ(x) hθ (x)) . (9) Figure 2 shows the convergence plots of our experiments. These highlight the potential of using GD algorithms that designed for AMOO: both CAMOO and especially PAMOO 2https://github.com/facebookresearch/ Aligned Multi Objective Optimization Aligned Multi Objective Optimization show an improved convergence rate. We provide additional experimental results and details in Appendix A. The results show that CAMOO modifies the weights, as expected, by adapting them to the local curvature; i.e., it changes the weights gradually during the optimization phase. 8. Conclusions In this work, we introduced the AMOO framework to study how aligned or approximately aligned multi-objective feedback can improve gradient descent convergence. We designed the CAMOO and PAMOO algorithms, which adaptively weight objectives and offer provably improved convergence guarantees. Future research directions include determining optimal rates for AMOO and conducting comprehensive empirical studies in different domains. Additionally, in this work, we have not explored a stochastic or non-convex optimization frameworks of AMOO, which we believe is of interest for future work. We conjecture that algorithmic advancements in AMOO will improve our ability to scale learning algorithms to handle large number of related tasks efficiently with minimal hyper-parameter tuning; a goal much needed in modern machine learning practice. Impact Statement This paper presents work whose goal is to advance the field of Machine Learning and optimization. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here. Idan Achituve, Idit Diamant, Arnon Netzer, Gal Chechik, and Ethan Fetaya. Bayesian uncertainty for gradient aggregation in multi-task learning. ar Xiv preprint ar Xiv:2402.04005, 2024. Hamsa Bastani. Predicting with proxies: Transfer learning in high dimension. Management Science, 67(5):2964 2984, 2021. Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004. Rich Caruana. Multitask learning. Machine learning, 28: 41 75, 1997. Shicong Cen, Yuting Wei, and Yuejie Chi. Fast policy extragradient methods for competitive games with entropy regularization. Advances in Neural Information Processing Systems, 34:27952 27964, 2021. Olivier Chapelle, Dumitru Erhan, et al. Improved preconditioner for hessian free optimization. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning, volume 201. Citeseer, 2011. Lisha Chen, Heshan Fernando, Yiming Ying, and Tianyi Chen. Three-way trade-off in multi-objective learning: Optimization, generalization and conflict-avoidance. Advances in Neural Information Processing Systems, 36: 70045 70093, 2023. Zhao Chen, Vijay Badrinarayanan, Chen-Yu Lee, and Andrew Rabinovich. Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. In International conference on machine learning, pages 794 803. PMLR, 2018. Sumanth Chennupati, Ganesh Sistu, Senthil Yogamani, and Samir A Rawashdeh. Multinet++: Multi-stream feature aggregation and geometric loss strategy for multi-task learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 0 0, 2019. Christoph Dann, Yishay Mansour, and Mehryar Mohri. Reinforcement learning can be more efficient with multiple rewards. In International Conference on Machine Learning, pages 6948 6967. PMLR, 2023. Constantinos Daskalakis and Ioannis Panageas. Last-iterate convergence: Zero-sum games and constrained min-max optimization. ar Xiv preprint ar Xiv:1807.04252, 2018. Jean-Antoine D esid eri. Multiple-gradient descent algorithm (mgda) for multiobjective optimization. Comptes Rendus Mathematique, 350(5-6):313 318, 2012. Daria Dzyabura, Srikanth Jagabathula, and Eitan Muller. Accounting for discrepancies between online and offline product evaluations. Marketing Science, 38(1):88 106, 2019. Chelsea Finn, Pieter Abbeel, and Sergey Levine. Modelagnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning, pages 1126 1135. PMLR, 2017. Chelsea Finn, Aravind Rajeswaran, Sham Kakade, and Sergey Levine. Online meta-learning. In International conference on machine learning, pages 1920 1930. PMLR, 2019. J org Fliege and Benar Fux Svaiter. Steepest descent methods for multicriteria optimization. Mathematical methods of operations research, 51:479 494, 2000. Alexander IJ Forrester, Andr as S obester, and Andy J Keane. Multi-fidelity optimization via surrogate modelling. Proceedings of the royal society a: mathematical, physical and engineering sciences, 463(2088):3251 3269, 2007. Aligned Multi Objective Optimization Robert M Gower, Aaron Defazio, and Michael Rabbat. Stochastic polyak stepsize with a moving target. ar Xiv preprint ar Xiv:2106.11851, 2021. Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. ar Xiv preprint ar Xiv:2501.12948, 2025. Elad Hazan and Sham Kakade. Revisiting the polyak step size. ar Xiv preprint ar Xiv:1905.00313, 2019. Yifei He, Shiji Zhou, Guojun Zhang, Hyokun Yun, Yi Xu, Belinda Zeng, Trishul Chilimbi, and Han Zhao. Robust multi-task learning with excess risks. ar Xiv preprint ar Xiv:2402.02009, 2024. Timothy Hospedales, Antreas Antoniou, Paul Micaelli, and Amos Storkey. Meta-learning in neural networks: A survey. IEEE transactions on pattern analysis and machine intelligence, 44(9):5149 5169, 2021. Yujing Hu, Weixun Wang, Hangtian Jia, Yixiang Wang, Yingfeng Chen, Jianye Hao, Feng Wu, and Changjie Fan. Learning to utilize shaping rewards: A new approach of reward shaping. Advances in Neural Information Processing Systems, 33:15931 15941, 2020. Michael F Hutchinson. A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines. Communications in Statistics-Simulation and Computation, 18(3):1059 1076, 1989. Max Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z Leibo, David Silver, and Koray Kavukcuoglu. Reinforcement learning with unsupervised auxiliary tasks. ar Xiv preprint ar Xiv:1611.05397, 2016. Kirthevasan Kandasamy, Gautam Dasarathy, Junier B Oliva, Jeff Schneider, and Barnab as P oczos. Gaussian process bandit optimisation with multi-fidelity evaluations. Advances in neural information processing systems, 29, 2016a. Kirthevasan Kandasamy, Gautam Dasarathy, Barnabas Poczos, and Jeff Schneider. The multi-fidelity multi-armed bandit. Advances in neural information processing systems, 29, 2016b. Kirthevasan Kandasamy, Gautam Dasarathy, Jeff Schneider, and Barnab as P oczos. Multi-fidelity bayesian optimisation with continuous approximations. In International conference on machine learning, pages 1799 1808. PMLR, 2017. Seung Hyun Lee, Yinxiao Li, Junjie Ke, Innfarn Yoo, Han Zhang, Jiahui Yu, Qifei Wang, Fei Deng, Glenn Entis, Junfeng He, et al. Parrot: Pareto-optimal multi-reward reinforcement learning framework for text-to-image generation. ar Xiv preprint ar Xiv:2401.05675, 2024. Shibo Li, Robert M Kirby, and Shandian Zhe. Deep multifidelity active learning of high-dimensional outputs. ar Xiv preprint ar Xiv:2012.00901, 2020. Shibo Li, Jeff M Phillips, Xin Yu, Robert Kirby, and Shandian Zhe. Batch multi-fidelity active learning with budget constraints. Advances in Neural Information Processing Systems, 35:995 1007, 2022. Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let s verify step by step. ar Xiv preprint ar Xiv:2305.20050, 2023. Baijiong Lin and Yu Zhang. Libmtl: A python library for deep multi-task learning. The Journal of Machine Learning Research, 24(1):9999 10005, 2023. Baijiong Lin, Feiyang Ye, Yu Zhang, and Ivor W Tsang. Reasonable effectiveness of random weighting: A litmus test for multi-task learning. ar Xiv preprint ar Xiv:2111.10603, 2021. Bo Liu, Xingchao Liu, Xiaojie Jin, Peter Stone, and Qiang Liu. Conflict-averse gradient descent for multi-task learning. Advances in Neural Information Processing Systems, 34:18878 18890, 2021. Bo Liu, Yihao Feng, Peter Stone, and Qiang Liu. Famo: Fast adaptive multitask optimization. Advances in Neural Information Processing Systems, 36:57226 57243, 2023a. Hong Liu, Zhiyuan Li, David Hall, Percy Liang, and Tengyu Ma. Sophia: A scalable stochastic second-order optimizer for language model pre-training. ar Xiv preprint ar Xiv:2305.14342, 2023b. Suyun Liu and Luis Nunes Vicente. The stochastic multigradient algorithm for multi-objective optimization and its application to supervised machine learning. Annals of Operations Research, 339(3):1119 1148, 2024. Nicolas Loizou, Sharan Vaswani, Issam Hadj Laradji, and Simon Lacoste-Julien. Stochastic polyak step-size for sgd: An adaptive learning rate for fast convergence. In International Conference on Artificial Intelligence and Statistics, pages 1306 1314. PMLR, 2021. Sha Luo, Hamidreza Kasaei, and Lambert Schomaker. Accelerating reinforcement learning for reaching using continuous curriculum learning. In 2020 International Joint Conference on Neural Networks (IJCNN), pages 1 8. IEEE, 2020. Aligned Multi Objective Optimization Panayotis Mertikopoulos, Houssam Zenati, Bruno Lecouat, Chuan-Sheng Foo, Vijay Chandrasekhar, and Georgios Piliouras. Mirror descent in saddle-point problems: Going the extra (gradient) mile. ar Xiv preprint ar Xiv:1807.02629, 2018. Aviv Navon, Aviv Shamsian, Idan Achituve, Haggai Maron, Kenji Kawaguchi, Gal Chechik, and Ethan Fetaya. Multitask learning as a bargaining game. ar Xiv preprint ar Xiv:2202.01017, 2022. Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013. Yurii Nesterov et al. Lectures on convex optimization, volume 137. Springer, 2018. Andrew Y Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward transformations: Theory and application to reward shaping. In Icml, volume 99, pages 278 287, 1999. Antonio Orvieto, Simon Lacoste-Julien, and Nicolas Loizou. Dynamics of sgd with stochastic polyak stepsizes: Truly adaptive variants and convergence to exact solution. Advances in Neural Information Processing Systems, 35: 26943 26954, 2022. Boris T Polyak. Introduction to optimization. 1987. Sasha Rakhlin and Karthik Sridharan. Optimization, learning, and games with predictable sequences. Advances in Neural Information Processing Systems, 26, 2013. Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. In International conference on learning representations, 2017. Tom Schaul, Sixin Zhang, and Yann Le Cun. No more pesky learning rates. In International conference on machine learning, pages 343 351. PMLR, 2013. Ozan Sener and Vladlen Koltun. Multi-task learning as multi-objective optimization. Advances in neural information processing systems, 31, 2018. Jialin Song, Yuxin Chen, and Yisong Yue. A general framework for multi-fidelity bayesian optimization with gaussian processes. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 3158 3167. PMLR, 2019. Shion Takeno, Hitoshi Fukuoka, Yuhki Tsukada, Toshiyuki Koyama, Motoki Shiga, Ichiro Takeuchi, and Masayuki Karasuyama. Multi-fidelity bayesian optimization with max-value entropy search and its parallelization. In International Conference on Machine Learning, pages 9334 9345. PMLR, 2020. Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms. ar Xiv preprint ar Xiv:2501.12599, 2025. Yee Teh, Victor Bapst, Wojciech M Czarnecki, John Quan, James Kirkpatrick, Raia Hadsell, Nicolas Heess, and Razvan Pascanu. Distral: Robust multitask reinforcement learning. Advances in neural information processing systems, 30, 2017. Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. Solving math word problems with process-and outcome-based feedback. ar Xiv preprint ar Xiv:2211.14275, 2022. Vivek Veeriah, Matteo Hessel, Zhongwen Xu, Janarthanan Rajendran, Richard L Lewis, Junhyuk Oh, Hado P van Hasselt, David Silver, and Satinder Singh. Discovery of useful questions as auxiliary tasks. Advances in Neural Information Processing Systems, 32, 2019. Xiaoyu Wang, Mikael Johansson, and Tong Zhang. Generalized polyak step size for first order optimization with momentum. In International Conference on Machine Learning, pages 35836 35863. PMLR, 2023. Yijia Wang, Matthias Poloczek, and Daniel R Jiang. Dynamic subgoal-based exploration via bayesian optimization. ar Xiv preprint ar Xiv:1910.09143, 2019. Jian Wu, Saul Toscano-Palmerin, Peter I Frazier, and Andrew Gordon Wilson. Practical multi-fidelity bayesian optimization for hyperparameter tuning. In Uncertainty in Artificial Intelligence, pages 788 798. PMLR, 2020. Zhewei Yao, Amir Gholami, Sheng Shen, Mustafa Mustafa, Kurt Keutzer, and Michael Mahoney. Adahessian: An adaptive second order optimizer for machine learning. In proceedings of the AAAI conference on artificial intelligence, volume 35, pages 10665 10673, 2021. Jiaxiang Yi, Fangliang Wu, Qi Zhou, Yuansheng Cheng, Hao Ling, and Jun Liu. An active-learning method based on multi-fidelity kriging model for structural reliability analysis. Structural and Multidisciplinary Optimization, 63:173 195, 2021. Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, and Chelsea Finn. Gradient surgery for multi-task learning. Advances in Neural Information Processing Systems, 33:5824 5836, 2020. Qi Zhang, Peiyao Xiao, Kaiyi Ji, and Shaofeng Zou. On the convergence of multi-objective optimization under generalized smoothness. ar Xiv preprint ar Xiv:2405.19440, 2024. Aligned Multi Objective Optimization Shiji Zhou, Wenpeng Zhang, Jiyan Jiang, Wenliang Zhong, Jinjie Gu, and Wenwu Zhu. On the convergence of stochastic multi-objective gradient manipulation and beyond. Advances in Neural Information Processing Systems, 35:38103 38115, 2022. Aligned Multi Objective Optimization A. Additional Experimental Details We share an IPython notebook in which we implement our algorithms and was used to generate all plots accompanies this submission. Dataset. We sample 200 points from an independent uniform distribution xi Uniform([ 1, 1]20). We generate target data from a randomly initialized network t(xi) = hθ (xi) + 10. The target dimension is 7. Network architecture. We choose the ground truth network and target network to have the same architecture. Both are 2 layer neural networks with 512 hidden dimensions and Re Lu activation. The neural network outputs a vector in dimension 7. Training. For both problems, we set the learning rate of SGD to 0.0005 and ADAM to 0.005, we use the same learning rates for PAMOO. For CAMOOwe multiply the learning rate by the number of loss functions, 3. While equal weighting sets all weights to 1, CAMOOis constrained to set the sum of all weights to one, our adjustment accounts for this normalization. General parameters for CAMOO. We set the number of samples for the Hutchinson method to be NHutch = 10. Namley, we estimate the Hessian matrices by averaging NHutch = 10 estimates obtained from the Hutchnison method. Further, at each training step we perform a single update of the weights based on the PU update rule of Cen et al. (2021) to solve the max-min Bilinear optimization problem (see Section 4.1). We use their primal-dual algorithm, and choose the learning rate as they specified 1/(2 maxi,j |Aij| + τCAMOO) where A is the matrix in the Bilinear optimization problem and τCAMOO = 0.01 is a regularization parameter we choose. Additionally, we set the number of iterations of the primal-dual algorithm to be 100 per-step. We did not choose either parameter with great care, exploring best settings is part of future explorations. General parameters for PAMOO. We solved the constraint convex optimization problem in PAMOO (see Algorithm 3) via the projected GD algorithm, where the projection on Rm + is done by clipping negative values to 10 6. We set the learning rate to be 3e 3, and added a small regularization J x Jx J x Jx + τPAMOOI to avoid exploding weights, where τPAMOO = 1e 4. Additional Plots of Experiments in Section 7. For completeness, we present the three loss functions of the objectives {fi}i [3] and the weighted loss as a function of GD iterates. Figure 3 depicts the losses for the local curvature example instance, and Figure 4 depicts the losses for the selection example instance. Unlike EW, CAMOO and PAMOO adaptively modify the weight vector and adjust it to current parameters. We measured the weight vector as a function of GD iterates and present the results in Figure 9. Local curvature example instance. For the local curvature example instance we expect to see a switching behavior in CAMOO (see Figure 5). Namely, when the loss is large, the weight vector of CAMOO should assign most of its weight to the quartic loss function, f3, since it has the largest curvature for large deviations from optimality. For small deviations, we expect CAMOO to assign weights to the quadratic loss, f1. Selection example instance. For the selection example instance we expect CAMOO to set the weights as a 1-hot vector on f1, since its the quadratic function with largest curvature across all dimensions. Figure 7 exemplifies this. Aligned Multi Objective Optimization Figure 3: Local curvature example, all loss functions. Figure 4: Selection example, all loss functions. Aligned Multi Objective Optimization Figure 5: Local curvature example, CAMOO. The weights flip when the curvature of the loss function changes. Figure 6: Local curvature example, PAMOO. Figure 7: Selection example, CAMOO. Figure 8: Selection example, PAMOO. Figure 9: Weight vector evolution versus GD iterates. A.1. Toy Experiments: ϵ Approximate AMOO We include additional experiment for the approximate AMOO setting, in which there exists a near-optimal solution for all objectives. We consider 3 loss functions of the local curvature example and add an approximation error as follows: (hθ(x) hθ (x) + ϵi) Hi(hθ(x) hθ (x)) + ϵi αi, where we set Hi = I and ϵi {0, ϵ, ϵ} where ϵ = 0.01. Aligned Multi Objective Optimization Figure 10: Local curvature example for ϵ-approximate AMOO, all loss functions. A.2. Toy Experiments: Reducing Computational Complexity of PAMOO Both CAMOO and PAMOO solve convex optimization problems in their inner loop to update the weight vector w. In our implementations we use simple first-order algorithms that apply N gradient-descent updates as detailed in Section 4 and Section 5. The results in the main paper depicts the performance of CAMOO and PAMOO when N = 100 and N = 1000, respectively. For these values, both CAMOO and PAMOO are significantly more costly compared to the EW algorithm. Here we experiment we cheaper variations of both CAMOO and PAMOO in which we set N = 10, and, hence, reduce the number of updates to the weight vector. Table 2 shows the improvement in running time of CAMOO and PAMOO as well as comparing their running time to EW. As observed, the running time of PAMOO can be significantly improved whereas the running time of CAMOO is worse than EW. This is caused by the need to estimate Hessian information which can be costly without careful optimization. Figure 11 depicts the performance of CAMOO and PAMOO when N = 10. The performance of both algorithms degraded on the expense of improved running time. This shows that as the running time is improved algorithm designer may expect to experience performance degradation. Aligned Multi Objective Optimization Algorithm Runtime (s) EW Algorithms EW-SGD 0.408452 EW-ADAM 0.480058 CAMOO and PAMOO PAMOO-SGD 8.654938 CAMOO-SGD 6.695782 PAMOO-ADAM 8.435489 CAMOO-ADAM 7.466246 CAMOO and PAMOO with N = 10 PAMOO-SGD 0.715199 CAMOO-SGD 3.674786 CAMOO-ADAM 3.598157 PAMOO-ADAM 0.726002 Table 2: Runtime in seconds. Figure 11: Local curvature example with reduced computational complexity (N = 10). Aligned Multi Objective Optimization B. Challenges in Analyzing CAMOO B.1. fwk is not a Strongly Convex Function We provide an example that shows that for the AMOO setting, when µG > 0, the fact that λmin 2fwk(xk) µG throughout the iterates of CAMOO does not imply that each fwk is µG-strongly convex. The counter example is the local curvature example (see Section 3), namely, f1(x) = exp(x) x, f2(x) = exp( x) + x, and x R. The minimum point of both f1 and f2 is at x = 0. The Hessians of the functions are: 2f1(x) = exp(x), 2f2(x) = exp( x). (10) We see that 2f1(x) > 1 > 2f2(x) for x > 0, 2f1(x) < 1 < 2f2(x) for x <= 0, and 2f1(0) = 2f2(0) = 1. This implies that µG = 1, and CAMOO will set the weight vector as w+ = (1, 0) for x > 0 and w = (0, 1) for x 0. However, it is readily observed that neither fw or fw+ are 1-strongly convex functions. The smallest value of the individual Hessians is zero since inf x R 2fw (x) = inf x R exp(x) = 0 inf x R 2fw+(x) = inf x R exp( x) = 0, by Eq. (10). Hence, the functions fwk produced in the iterates of CAMOO are not strongly convex. B.2. Weighted function is not necessarily convex We provide a simple counter example that shows failure of a naive reduction in which we construct fw (x)(x) = P i w ,i(x)fi(x), where w (x) optimizes the curvature at each point via, w (x) arg max w m λmin i [m] wi 2fi(xk) and apply gradient descent to the function fw (x). The reason such approach is problematic is the fact that fw (x) may no longer be a convex function. This is a result of the extra dependence on x of w (x). A counter example can be established for a simple scenario in which f1(x) = x2, f2(x) = x2 + c, where x R, for some c = 0. The two functions are convex, and have a minimizer at x = 0. However, since the Hessians of the functions are equal, 2f1(x) = 2f2(x), the solution of the optimization problem w (x) arg max w m λmin i {1,2} wi 2fi(x) is arbitrary. Namely, each point on the simplex is a solution of this optimization problem. For example, choosing w (x) = (sin2(x), cos2(x)), is a solution of Eq. (11). With this, the function fw (x) takes the form of fw (x) = x2 + c cos2(x), which is not a convex function if, for example, |c| > 1. Aligned Multi Objective Optimization Additionally, choosing w (x) as non-smooth function, for example ( (0, 1) x > 0 (1, 0) x 0, results with a non-smooth function fw (x). This highlights that differentiating the function fw (x) is flawed from a theoretical perspective: fw (x) is not necessarily convex nor smooth. C. Preliminaries and Basic Properties In this section, we formally provide our working assumptions. We assume access to multi-objective feedback with m objectives F(x) = (f1(x), . . . , fm(x)), when i [m] the function fi(x) is smooth and self-concordant. Considering two scenarios AMOO, and ϵ-AAMOO. In AMOO we assume the functions are aligned in the sense of Eq. (1), namely, that they share an optimal solution. In ϵ-AAMOO we assume there is a (non-empty) set of ϵ approximate solutions in the sense of Eq. (7), i.e. every solution in the set has a maximum value gap of ϵ from the minima of every function. We define the following quantities, for the single and multi objective settings: y 2 x := y 2f(x) := 2f(x)y, y y 2 x,w := y 2fw(x) := 2fw(x)y, y . A smooth function satisfies the following: Lemma C.1 (Standard result, E.g., 9.17 Boyd and Vandenberghe (2004)). Let f : Rn R a β-smooth over Rn, and let x arg minx f(x). Then, for every x Rn it holds that f(x) 2 2β (f(x) f(x )) . A self-concordant function satisfies the following: Lemma C.2 (Theorem 5.1.8 & Lemma 5.1.5, Nesterov (2013)). Let f : Rn R be an Mf self-concordant function. Let x, y Rn , we have f(y) f(x) + f(x), y x + 1 M 2 f ω (Mf y x x) , where ω(t) := t ln(1 t), and, for any t > 0, ω(t) t2 2(1+t). Lemma C.3 (Theorem 5.1.1, Nesterov et al. (2018)). Let f1, f2 : Rn R be Mf self-concordant functions. Let w1, w2 > 0. Then, f = w1f1 + w2f2 is M = maxi{ 1 wi }Mf self-concordant function. Proposition C.1 (Weighted sum of self-concordant functions). Let {fi : Rn R}n i=1 be Mf self-concordant functions. Let {wi > 0}. Then, f = Pn i=1 wifi is M = maxi{ 1 wi }Mf self-concordant function. The proof of the last proposition proof is found in Section F. Theorem C.1 (Weyl s Theorem). Let A and be symmetric matrices in Rn n. Let λj(A) be the jth largest eigenvalue of a matrix A. Then, for all j [n] it holds that λj(A) λj(A + ) 2, where 2 is the spectral norm of . C.1. Auxiliary Results Further, we have the following simple consequence of the AMOO setting. Lemma C.4. For all w m and x Rn it holds that fw(x) fw(x ) 0. Proof. Observe that fw(x) fw(x ) = Pm i=1 wi (fi(x) fi(x )) . Since x is the optimal solution for all objectives it holds that fi(x) fi(x ) 0. The lemma follows from the fact wi 0 for all i [m]. Aligned Multi Objective Optimization Lemma C.5 (Recurrence bound AMOO). Let {rk}k 0 be a sequence of non-negative real numbers that satisfy the recurrence relation r2 k+1 r2 k α1 r2 k 1 + α2rk , where α1 [0, 2) and α2 R+. Let k0 := l 4(r0α2 1) m . Then, rk is bounded by 2 k k0 r0 α1 4α2 k o.w. . The proof of the lemma is found in Appendix F.3 The next lemma is for the ϵ-AAMOO setting. We separate them since the recursion and the result of ϵ-AAMOO have more details, making it less readable. Lemma C.6 (Recurrence bound ϵ-AAMOO). Let {rk}k 0 be a sequence of non-negative real numbers that satisfy the recurrence relation r2 k+1 r2 k α1 r2 k 1 + α2rk + α3 + α4rk, where α1 (0, 2), α2 R+, α3 α2 1 256α2 2 , and α4 α1 4α2 . Let k0 := l 16(r0α2 1) m . Then, rk is bounded by α1 + 2α4 α1α2 k k0 r0 α1 16α2 k o.w. . D. Proofs of AMOO Results See Section 4 for a highlevel description of key steps in the proof. We first describe two useful lemmas that are used across the section. Lemma D.1 (Weighted Function is Self-Concordant). For any iteration k of CAMOO, the function fwk is 1/ wmin Mf self-concordant. Proof. This is a direct consequence of Proposition C.1 and the fact Algorithm 1 sets the weights by optimizing over a set where the weight vector is lower bounded by wmin. Lemma D.2 (Continuity of Minimal Eigenvalue of Hessian). Let x Rn. Further, let w arg max w m λmin 2fw(x) , bw arg max w m,wmin λmin 2fw(x) . It holds that λmin 2f bw(x) λmin 2fw (x) 2mwminβ. With these two results, we are ready to prove Theorem 4.2 and Theorem 5.2. D.1. Proof of Theorem 4.2 Restate it first: Theorem 4.2 (µG Convergence of CAMOO). Suppose {fi}i [m] are β smooth, Mf self-concordant, share an optimal solution x and that µG > 0. Let k0 := 16β( x0 x 3 mβMf µG) , where is the Euclidean-norm. Then, Weighted-GD instantiated with CAMOO weight-optimizer and η = 1/2β converges with rate: xk0 x 1 3µG 8β (k k0)/2 k k0 x0 x k µG 3/2 16β2 m Mf o.w. Aligned Multi Objective Optimization Proof. At each iteration Algorithm 1 gets wk arg max w m,wmin λmin P i wi 2fi(xk) . Using the assumption that max w m λmin 2fwk(xk) µG, Lemma D.2, and since we set wmin = µG/ (8mβ) we have that λmin 2fwk λmin 2fw 2mwminβ µG µG/4 = (3/4)µG, (12) for all iterations t. Recall that the update rule is given by xk+1 = xk η fwk(xk), where η is the step size. Then, for every x Rn we have xk+1 x 2 = xk η fwk(xk) x 2 = xk x 2 2η fwk(xk), xk x + η2 fwk(xk) 2 . (13) By Lemma D.1 it holds that fwk is c Mf := 1/ wmin Mf 3 p mβMf/ µG (14) self-concordant. Then, from Lemma C.2, by properties of self-concordant functions, for every x Rn we have fwk(xk), xk x fwk(xk) fwk(x) + 1 c Mf 2 ω c Mf x xk xk,wk Plugging this inequality into Eq. (13) implies that xk+1 x 2 xk x 2 2η fwk(xk) fwk(x) + 1 c Mf 2 ω c Mf x xk xk,wk + η2 fwk(xk) 2 . (15) Recall that x arg minx fi(x) for every i [m]. Since fwk is β-smooth, by using Lemma C.1, and plugging in x we have fwk(xk) fwk(x ) + 1 c Mf 2 ω c Mf x xk xk,wk + 2βη2 (fwk(xk) fwk(x )) = xk x 2 2η 1 c Mf 2 ω c Mf x xk xk,wk + 2η(βη 1) (fwk(xk) fwk(x )) . By using Lemma C.4 it holds that fwk(xk) fwk(x ) 0, and since 0 < η 1/2β, it holds that 2η(βη 1) (fwk(xk) fwk(x )) 0. Then, by using the lower bound from Lemma C.2, i.e. ω (t) t2 2(1+t), the following holds xk+1 x 2 xk x 2 2η x xk 2 xk,wk 1 + c Mf x xk x,wk . Note that λmin 2fwk(xk) x xk 2 x xk 2 xk,wk λmax 2fwk(xk) x xk 2. By using the following: λmin 2fwk(xk) (3/4)µG (Eq. (12)), λmax 2fwk(xk) β (smoothness), c Mf 3 mβMf/ µG (Eq. (14)), and η = 1/2β, we obtain xk+1 x 2 xk x 2 3µG 1 + 3 mβMf/ µG x xk . Now, we are ready for the last step. Denote α1 = 3µG 4β , and α2 = 3 mβMf µG . Since α1 (0, 1], α2 R+, and x x R+ for every x Rn, we arrive to the recurrence relation analyzed in Lemma C.5. Then, we obtain xk0 x 1 3µG 8β (k k0)/2 k k0 x0 x k µG 3/2 16β2 m Mf o.w. where k0 := 16β( x0 x 3 mβMf µG) Aligned Multi Objective Optimization D.2. Proof of Theorem 5.2 Restate it first: Theorem 5.2 (µL Convergence of PAMOO). Suppose {fi}i [m] are β smooth, Mf self-concordant, share an optimal solution x and µL > 0. Let k0 := 64β( x0 x 3 mβMf µL) , where is the Euclidean-norm. Then, Weighted-GD instantiated with PAMOO weight-optimizer and η = 1 converges with rate: xk0 x 1 3µL 32β (k k0)/2 k k0 x0 x k µL 3/2 64β2 m Mf o.w. Proof. Recall that for every w Rm + it holds that fw is a convex function. Hence, for every x, y Rn it holds that fw(x) fw(y) fw(x)T (x y). Recall that the step size η = 1, then the update rule is given by xk+1 = xk fwk(xk). Then, by using the previous equation, for every x Rn we have xk+1 x 2 = xk x 2 2 fwk(xk), xk x + fwk(xk) 2 xk x 2 2 (fwk(xk) fwk(x)) + fwk(xk) 2 , (16) Recall that x arg minx Rn fi(x) for every i [m]. Since the update rule of PAMOO is wk arg max w Rm + 2 (fwk(xk) fwk(x )) fwk(xk) 2 , the following holds xk+1 x 2 xk x 2 max w Rm + n 2 (fw(xk) fw(x )) fw(xk) 2 o , (17) Denote w = arg max w m,wmin λmin Pm i=1 wi 2fi(x ) , ak = fw (xk) fw (x ), and bk = fw (xk) 2. Let w(xk) = w ak bk Rm +, we can lower bound of the last expression as follows h 2 (fw(xk) fw(x )) fw(xk) 2i 2 fw(xk)(xk) fw(xk)(x ) fw(xk)(xk) 2 = (fw (xk) fw (x ))2 fw (xk) 2 . Since w m,wmin it holds that fw is β smooth. Then, it holds that fw (x) 2 2β (fw (x) fw (x )) for every x, and we have h 2 (fw(xk) fw(x )) fw(xk) 2i fw (xk) fw (x ) Plugging this in Eq. (17), we arrive to xk+1 x 2 xk x 2 fw (xk) fw (x ) By Lemma D.1 it holds that fw is c Mf := 1/ wmin Mf self concordant. Then, From Lemma C.2, and its lower bound , i.e. ω (t) t2 2(1+t), we have fw (xk) fw (x ) fw (x ), xk x + xk x 2 x ,w 2 1 + c Mf xk x x ,w = xk x 2 x ,w 2 1 + c Mf xk x x ,w Aligned Multi Objective Optimization The equality is due to the optimality condition, fi(x ) = 0 for every i [m], thus, fw (x ) = 0. Combining the last two equations, we have xk+1 x 2 xk x 2 1 xk x 2 x ,w 1 + c Mf xk x x ,w Note that λmin 2fw (x ) x xk 2 x xk 2 x ,w λmax 2fw (x ) x xk 2. Since λmax 2fw (x ) β (smoothness), and since wmin = µL/(8mβ), then c Mf 3 mβMf/ µL. Thus, it holds that xk+1 x 2 xk x 2 1 4β λmin 2fw (x ) x xk 2 1 + 3 m Mfβ µL xk x Using the assumption that max w m λmin 2fw(x ) µL, Lemma D.2, we have that λmin 2fw (x ) max w m λmin 2fw(x ) 2mwminβ µL µL/4 = (3/4)µL. Finally, we obtain the recurring equation we wish: xk+1 x 2 xk x 2 3µL 1 + 3 m Mfβ µL xk x Denote α1 = 3µL 16β , and α2 = 3 mβMf µL . Note that α1 (0, 1], α2 R+, and x x R+ for every x Rn, we arrive to the recurrence relation analyzed in Lemma C.5. Then, we obtain xk0 x 1 3µL 32β (k k0)/2 k k0 x0 x k µL 3/2 64β2 m Mf o.w. where k0 := 64β( x0 x 3 mβMf µL) E. ϵ-Approximation Solution In this section, we represent the formal theorems of the Informal Theorem 6.2 for CAMOO and PAMOO. First, we rewrite the definition of ϵapproximate solutions set. Definition E.1 (ϵ-Approximate Solution Set). Let ϵ 0. Cϵ is ϵ-Approximate Solution Set (ϵ-ASS) if for every i [m] it holds that fi(x) fi(xi ) ϵ, i.e. Cϵ = {x Rn| fi(x) fi(xi ) ϵ i [m]}, where xi arg minx Rn{fi(x)}. We show in the following formal theorem that for ϵ > 0 CAMOO converges for any chosen point from the set Cϵ, i.e. the ϵapproximate solutions set. Theorem E.2 (µG Approximation Convergence of CAMOO). Suppose {fi}i [m] are β smooth, Mf self-concordant, Cϵ is an ϵ-ASS and that µG > 0. Let k0 := 64β 3 mβMf x0 xϵ µG 1 , where is the 2-norm. Let µG 3 46mβ3M 2 f ϵ > 0. Then, for every point xϵ Cϵ, Weighted-GD instantiated with CAMOO weight-optimizer and η = 1/2β converges with rate: xk0 xϵ 1 3µG 8β (k k0)/2 + q 8 3µG ϵ k k0 x0 xϵ k µG 3/2 43β2 m Mf o.w. Aligned Multi Objective Optimization Proof. Let xϵ Cϵ. Denote x k = arg minx Rn fwk(x) for every k, then since fwk is β-smooth, by using Lemma C.1, it holds that f(x) 2 2β (f(x) f(x )). Plugging in x = xϵ in Eq. (15), we have fwk(xk) fwk(xϵ ) + 1 c Mf 2 ω c Mf xϵ xk xk,wk + 2βη2 (fwk(xk) fwk(x k)) = xk xϵ 2 2η 1 c Mf 2 ω c Mf xϵ xk xk,wk + 2η (fwk(xϵ ) fwk(x k)) + 2η(βη 1) (fwk(xk) fwk(x k)) . Since fwk(xk) fwk(x k) 0, and since 0 < η 1/2β, it holds that 2η(βη 1) (fwk(xk) fwk(x k)) 0. In addition, by using the lower bound from Lemma C.2, i.e. ω (t) t2 2(1+t), and Definition E.1 with the fact that P i wi = 1, the following holds xk+1 xϵ 2 xk xϵ 2 2η xϵ xk 2 xk,wk 1 + c Mf xϵ xk x,wk + 2ηϵ. Note that λmin 2fwk(xk) xϵ xk 2 xϵ xk 2 xk,wk λmax 2fwk(xk) xϵ xk 2. By using the following: λmin 2fwk(xk) (3/4)µG (Eq. (12), λmax 2fwk(xk) β (smoothness), c Mf 3 mβMf/ µG (Eq. (14)), and η = 1/2β, we obtain xk+1 xϵ 2 xk xϵ 2 3µG 1 + 3 mβMf/ µG xϵ xk + ϵ Now, we are ready for the last step. Denote α1 = 3µG 4β , α2 = 3 mβMf µG , and α3 = ϵ β . Since α1 (0, 2), α2 R+, α3 α2 1 256α2 2 and x xϵ R+ for every x Rn, we arrive to the recurrence relation analyzed in Lemma C.6. Then, we obtain xk0 xϵ 1 3µG 8β (k k0)/2 + q 8 3µG ϵ k k0 x0 xϵ k µG 3/2 43β2 m Mf o.w. where k0 := 64β 3 mβMf x0 xϵ µG 1 Before we continue to PAMOO, we recall Definition 6.1 which defines µϵ L. This parameter is the maximum curvature over the ϵ-approximate solutions set, which can be much greater than µG, and µL. Before we show the formal theorem of PAMOO, we define c xϵ Cϵ which is the point with the maximum curvature, s.t. c xϵ = arg max xϵ Cϵ max w m λmin i=1 wi 2fi(xϵ ) , µϵ L = max w m λmin i=1 wi 2fi(c xϵ ) Now, we show the formal theorem that for ϵ > 0 PAMOO converges to c xϵ . Theorem E.3 (µϵ L Approximation Convergence of PAMOO). Suppose {fi}i [m] are β smooth, Mf self-concordant, Cϵ is an ϵ-ASS and that µϵ L > 0, where min β 44β4m M 2 f ϵ > 0. Let k0 := x0 c xϵ 3 mβMf where is the Euclidean-norm. Then, Weighted-GD instantiated with PAMOO weight-optimizer and η = 1 converges to c xϵ with rate: xk0 c xϵ 1 3µϵ L 32β k k0 16ϵ 3µϵ L + 32 2βϵ 9 µLm Mf k k0 x0 c xϵ k µL 3/2 64β2 m Mf o.w. Aligned Multi Objective Optimization Proof. Combining Eq. (16) with the update rule of PAMOO, and plugging in c xϵ , i.e. wk arg max w Rm + 2 fwk(xk) fwk(c xϵ ) fwk(xk) 2 , the following holds xk+1 c xϵ 2 xk c xϵ 2 max w Rm + n 2 fw(xk) fw(c xϵ ) fw(xk) 2 o (18) Denote wϵ = arg max w m,wmin λmin Pm i=1 wi 2fi(c xϵ ) , ak = fwϵ (xk) fwϵ (c xϵ ), and bk = fwϵ (xk) 2. Let w(xk) = wϵ ak bk Rm +, we can lower bound of the last expression as follows h 2 fw(xk) fw(c xϵ ) fw(xk) 2i 2 fw(xk)(xk) fw(xk)(c xϵ ) fw(xk)(xk) 2 fwϵ (xk) fwϵ (c xϵ ) 2 fwϵ (xk) 2 . Denote x arg minx fwϵ (x). Then, it holds that fwϵ (xk) fwϵ (c xϵ ) 2 fwϵ (xk) 2 fwϵ (xk) fwϵ (x ) + fwϵ (x ) fwϵ (c xϵ ) fwϵ (xk) fwϵ (c xϵ ) fwϵ (xk) fwϵ (x ) fwϵ (xk) fwϵ (c xϵ ) fwϵ (xk) 2 + fwϵ (x ) fwϵ (c xϵ ) fwϵ (xk) fwϵ (c xϵ ) fwϵ (xk) fwϵ (x ) fwϵ (xk) fwϵ (c xϵ ) fwϵ (xk) 2 + fwϵ (x ) fwϵ (c xϵ ) fwϵ (xk) fwϵ (x ) fwϵ (x ) fwϵ (c xϵ ) 2 fwϵ (xk) 2 fwϵ (xk) fwϵ (x ) fwϵ (xk) fwϵ (c xϵ ) fwϵ (xk) 2 + fwϵ (x ) fwϵ (c xϵ ) fwϵ (xk) fwϵ (x ) fwϵ (xk) 2 . Since wϵ m,wmin it holds that fwϵ is β smooth. Then, it holds that fwϵ (x) 2 2β fwϵ (x) fwϵ (x ) for every x, and we have h 2 fw(xk) fw(c xϵ ) fw(xk) 2i fwϵ (xk) fwϵ (c xϵ ) 2β + fwϵ (x ) fwϵ (c xϵ ) 2β . Plugging this in Eq. (18), we arrive to xk+1 c xϵ 2 xk c xϵ 2 fwϵ (xk) fwϵ (c xϵ ) 2β + fwϵ (c xϵ ) fwϵ (x ) 2β xk c xϵ 2 fwϵ (xk) fwϵ (c xϵ ) 2β + ϵ By Lemma D.1 it holds that fw is c Mf := 1/ wmin Mf self concordant. Then, From Lemma C.2, and its lower bound , i.e. Aligned Multi Objective Optimization ω (t) t2 2(1+t), we have fwϵ (xk) fwϵ (c xϵ ) fwϵ (c xϵ ), xk c xϵ + xk c xϵ 2 c xϵ ,wϵ 2 1 + c Mf xk c xϵ c xϵ ,wϵ fwϵ (c xϵ ) xk c xϵ + xk c xϵ 2 c xϵ ,wϵ 2 1 + c Mf xk c xϵ c xϵ ,wϵ By Lemma C.1, we have that fwϵ (c xϵ ) 2 2β fwϵ (c xϵ ) fwϵ (x ) 2βϵ, and, hence, fwϵ (c xϵ ) 2βϵ. Plugging this back results with the following xk+1 c xϵ 2 xk c xϵ 2 xk c xϵ 2 c xϵ ,wϵ 4β 1 + c Mf xk c xϵ c xϵ ,wϵ 2βϵ xk c xϵ Note that λmin 2fwϵ (c xϵ ) c xϵ xk 2 c xϵ xk 2 c xϵ ,wϵ λmax 2fwϵ (c xϵ ) c xϵ xk 2. Since λmax 2fwϵ (c xϵ ) β (smoothness), and since wmin = µϵ L/(8mβ), then c Mf 3 mβMf/pµϵ L. Thus, it holds that xk+1 c xϵ 2 xk c xϵ 2 1 4β λmin 2fwϵ (c xϵ ) c xϵ xk 2 1 + 3 m Mfβ xk c xϵ + ϵ 2βϵ xk c xϵ Using the assumption that max w m λmin 2fw(c xϵ ) = µϵ L, Lemma D.2, we have that λmin 2fwϵ (c xϵ ) max w m λmin 2fw(c xϵ ) 2mwminβ µϵ L µϵ L/4 = (3/4)µϵ L. Finally, we obtain the recurring equation we wish: xk+1 c xϵ 2 xk c xϵ 2 3µϵ L 16β 1 + 3 m Mfβ xk c xϵ + ϵ 2βϵ xk c xϵ Denote α1 = 3µϵ L 16β , α2 = 3 mβMf µϵ L , α3 = ϵ 2β , and α4 = 2βϵ. Note that α1 (0, 1), α2 R+, α3 α2 1 256α2 2 , and α4 α1 4α2 . and x c xϵ R+ for every x Rn, we arrive to the recurrence relation analyzed in Lemma C.6. Then, we obtain xk0 c xϵ 1 3µϵ L 32β k k0 16ϵ 3µϵ L + 32 2βϵ 9 µLm Mf k k0 x0 c xϵ k µL 3/2 64β2 m Mf o.w. where k0 := x0 c xϵ 3 mβMf F. Missing Proofs F.1. Proof of Proposition 4.1 Let us restate the claim: Proposition 4.1 (Unique Optimal Solution). Assume there exists x Rn that simultaneously minimizes {fi}i [m], namely, solves Eq. (1). If maxw m λmin 2fw(x ) > 0 then x is unique. Aligned Multi Objective Optimization Proof. Let x be a minimizer of all functions {fi}i [m] which exists due to the AMOO assumption, namely, a solution of x arg min x fi(x) i [m]. (19) By assumption, it holds that for the weight vector w arg maxw m λmin Pm i=1 wi 2fi(x ) it holds that λmin Pm i=1 2fw (x ) > 0, namely, 2fw (x ) 0. (20) Notice that x is a minimizer of fw (Lemma C.4), and that fw is a convex function, since fi are convex and w has non-negative components. Combining with Eq. (20), it implies that x is a unique minimizer of fw . Assume, by way of contradiction, there exists an additional minimizer that solves Eq. (19), denote by bx . Since it is a solution of Eq. (19), it is also a minimizer of fw . This contradicts the fact fw has a unique optimal solution x . F.2. Proof of Proposition 4.2 The proof of Proposition 4.2 is a corollary of Theorem C.1 (Weyl s Theorem). We establish the result for a general deviation in Hessian matrices without requiring it to be necessarily diagonal. Let us restate the result: Proposition 4.2. Assume that for all i [m] and x Rn || 2fi(x) Diag 2fi(x) ||2 where A 2 is the spectral norm of A Rn n. Let w arg maxw m λmin P i wi 2Diag (fi(x)) . Then, λmin P i w ,i 2fi(x) µG 2 . Proof. Denote Ai := 2fi(x) + i for every i [m], and Pm i i = . Let w , and ˆw denote the solution of, w arg max w λmin i wi 2fi(x) , and ˆw arg max w λmin respectively. Let g(w ), and ˆg( ˆw ) denote the optimal value, g(w ) = λmin P i w ,i 2fi(x) , and ˆg( ˆw ) = λmin (P i ˆw ,i Ai). Then, since ˆg(w ) ˆg( ˆw ) 0 by the optimality of ˆw on ˆg, the following holds g(w ) = g(w ) ˆg(w ) + ˆg(w ) ˆg( ˆw ) + ˆg( ˆw ) g( ˆw ) + g( ˆw ) |g(w ) ˆg(w )| + |ˆg( ˆw ) g( ˆw )| + g( ˆw ) Using Theorem C.1 (Weyl s Theorem) the followings are hold: |g(w ) ˆg(w )| , and |ˆg( ˆw ) g( ˆw )| . Then, we obtain g(w ) 2 + g( ˆw ) Finally, since g(w ) µG, by Definition 4.1, we obtain the proof. F.3. Proof of Lemma C.5 Let us restate the claim: Lemma C.5 (Recurrence bound AMOO). Let {rk}k 0 be a sequence of non-negative real numbers that satisfy the recurrence relation r2 k+1 r2 k α1 r2 k 1 + α2rk , where α1 [0, 2) and α2 R+. Let k0 := l 4(r0α2 1) m . Then, rk is bounded by 2 k k0 r0 α1 4α2 k o.w. . Aligned Multi Objective Optimization Proof. We split the proof into two regimes, the incremental convergence and linear convergence regime. Incremental convergence, rk > 1/α2. With this assumption we have 1 + rkα2 < 2rkα2. Then, the following bound holds: 1 α1 2α2rk . Recall that 1 y 1 y 2 for every y 1. Since 1 α2rk < 1 we have α1 2α2rk < α1 2 < 1. Hence, For every k < k the recursive equation is still in the incremental convergence regime. Thus, for every k < k holds that rk > 1/α2. By solving 1/α2 r0 k0α1/4α2 we conclude the maximal iteration after which rk 1/α2, namely, after at most k0 iterates rk out from the incremental convergence regime. Linear convergence, rk 1/α2. With this assumption we have the following bound on the recursive equation: r2 k+1 1 α1 Further, since for every k k it holds that rk rk 1/α2 the recursive equation continues in the linear convergence regime. Thus, after at most k0 iterations rk is in the linear convergence regime. F.4. Proof of Lemma C.6 Let us restate the claim: Lemma C.6 (Recurrence bound ϵ-AAMOO). Let {rk}k 0 be a sequence of non-negative real numbers that satisfy the recurrence relation r2 k+1 r2 k α1 r2 k 1 + α2rk + α3 + α4rk, where α1 (0, 2), α2 R+, α3 α2 1 256α2 2 , and α4 α1 4α2 . Let k0 := l 16(r0α2 1) m . Then, rk is bounded by α1 + 2α4 α1α2 k k0 r0 α1 16α2 k o.w. . Proof. We split the proof into two regimes, the incremental convergence and linear convergence regime. Incremental convergence, rk > 1/α2. With this assumption we have 1 + rkα2 < 2rkα2. Then, the following bound holds: 2α2 + α4rk + α3 = 1 α1 4α2rk + α3. The third relation holds since α4 α1 4α2 by assumption which implies α1/2 2α2α4 and by b for a, b 0. Recall that 1 y 1 y/2 for every y 1. Since 1 α2rk < 1 we have α1 2α2rk < α1 2 < 1. Hence, + α3 = rk α1 8α2 + α3 rk α1 16α2 , Aligned Multi Objective Optimization since α3 α2 1 256α2 2 = α2 1 162α2 2 by assumption. For every k < k the recursive equation is still in the incremental convergence regime. Thus, for every k < k holds that rk > 1/α2. By solving 1/α2 r0 k0α1/16α2 we conclude the maximal iteration after which rk 1/α2, namely, after at most k0 iterates rk out from the incremental convergence regime. Linear convergence, rk 1/α2. With this assumption we have the following bound on the recursive equation: r2 k+1 1 α1 r2 k + α3 + α4/α2 = 1 α1 r2 k + α , (21) where α := α3 + α4/α2. We will first show that rk 1/α2 for all k k. Observe that r2 k+1 1 α1 since α = α3 + α4 α2 α1 2α2 2 by assumption and by the fact α1 (0, 2). Hence, rk+1 1 α2 which inductively implies that rk rk 1/α2 for all k k k0. Since in this regime, for all k k Eq. (21) holds, we can upper bound the recursive relation by t α r2 k0 1 α1 where the last inequality holds by summing the geometric series and since 1 α1/2 (0, 1). This inequality implies the result since rk rk0 1 α1 α1 = rk0 1 α1 b for a, b 0. Finally, since for every k k it holds that rk rk 1/α2 the recursive equation continues in the linear convergence regime. Thus, after at most k0 iterations rk is in the linear convergence regime. F.5. Proof of Lemma D.2 Let us restate the claim: Lemma D.2 (Continuity of Minimal Eigenvalue of Hessian). Let x Rn. Further, let w arg max w m λmin 2fw(x) , bw arg max w m,wmin λmin 2fw(x) . It holds that λmin 2f bw(x) λmin 2fw (x) 2mwminβ. Proof. To establish the lemma we want to show that for any w m there exists bw m,wmin such that λmin P i ˆwi 2fi(x) λmin P i wi 2fi(x) 2mwminβ. We start by bounding the following term 2fw(x) 2f ˆw(x) 2 for any x Rn. By the triangle inequality and the positive homogeneity, we have i (wi ˆwi) 2fi(x) (wi ˆwi) 2fi(x) 2 = X i |wi ˆwi| 2fi(x) 2 β X i |wi ˆwi|, while the last inequality holds since {fi}i [m] are β smooth. Since for any w m there exist ˆw m,wmin such that P i |wi ˆwi| 2mwmin, we obtain that for every x Rn it holds that 2fw(x) 2f ˆw(x) 2 = i (wi ˆwi) 2fi(x) Thus, by using Theorem C.1 we obtain that for any w m |λmin( 2fw(x)) λmin( 2f ˆw(x))| 2fw(x) 2f ˆw(x) 2 2mwminβ. By setting w as w we conclude the result.