# multiobjective_online_learning__9f9ff10a.pdf

Published as a conference paper at ICLR 2023

MULTI-OBJECTIVE ONLINE LEARNING

Jiyan Jiang Tsinghua University Beijing, China scjjy95@outlook.com

Wenpeng Zhang Ant Group Beijing, China zhangwenpeng0@gmail.com

Shiji Zhou Tsinghua University Beijing, China zhoushiji00@gmail.com

Lihong Gu, Xiaodong Zeng Ant Group Hangzhou, China {lihong.glh,xiaodong.zxd}@antgroup.com

Wenwu Zhu Tsinghua University Beijing, China wwzhu@tsinghua.edu.cn

This paper presents a systematic study of multi-objective online learning. We first formulate the framework of Multi-Objective Online Convex Optimization, which encompasses a novel multi-objective regret. This regret is built upon a sequencewise extension of the commonly used discrepancy metric Pareto suboptimality gap in zero-order multi-objective bandits. We then derive an equivalent form of the regret, making it amenable to be optimized via first-order iterative methods. To motivate the algorithm design, we give an explicit example in which equipping OMD with the vanilla min-norm solver for gradient composition will incur a linear regret, which shows that merely regularizing the iterates, as in single-objective online learning, is not enough to guarantee sublinear regrets in the multi-objective setting. To resolve this issue, we propose a novel min-regularized-norm solver that regularizes the composite weights. Combining min-regularized-norm with OMD results in the Doubly Regularized Online Mirror Multiple Descent algorithm. We further derive the multi-objective regret bound for the proposed algorithm, which matches the optimal bound in the single-objective setting. Extensive experiments on several real-world datasets verify the effectiveness of the proposed algorithm.

1 INTRODUCTION

Traditional optimization methods for machine learning are usually designed to optimize a single objective. However, in many real-world applications, we are often required to optimize multiple correlated objectives concurrently. For example, in autonomous driving (Huang et al., 2019; Lu et al., 2019b), self-driving vehicles need to solve multiple tasks such as self-localization and object identification at the same time. In online advertising (Ma et al., 2018a;b), advertising systems need to decide on the exposure of items to different users to maximize both the Click-Through Rate (CTR) and the Post-Click Conversion Rate (CVR). In most multi-objective scenarios, the objectives may conflict with each other (Kendall et al., 2018). Hence, there may not exist any single solution that can optimize all the objectives simultaneously. For example, merely optimizing CTR or CVR will degrade the performance of the other (Ma et al., 2018a;b).

Multi-objective optimization (MOO) (Marler & Arora, 2004; Deb, 2014) is concerned with optimizing multiple conflicting objectives simultaneously. It seeks Pareto optimality, where no single objective can be improved without hurting the performance of others. Many different methods for MOO have been proposed, including evolutionary methods (Murata et al., 1995; Zitzler & Thiele, 1999), scalarization methods (Fliege & Svaiter, 2000), and gradient-based iterative methods (D esid eri, 2012). Recently, the Multiple Gradient Descent Algorithm (MGDA) and its variants have been introduced to the training of multi-task deep neural networks and achieved great empirical success (Sener & Koltun, 2018), making them regain a significant amount of research interest (Lin et al., 2019; Yu et al., 2020; Liu et al., 2021). These methods compute a composite gradient based on

Equal contributions. Corresponding author.

Published as a conference paper at ICLR 2023

the gradient information of all the individual objectives and then apply the composite gradient to update the model parameters. The composite weights are determined by a min-norm solver (D esid eri, 2012) which yields a common descent direction of all the objectives.

However, compared to the increasingly wide application prospect, the gradient-based iterative algorithms are relatively understudied, especially in the online learning setting. Multi-objective online learning is of essential importance for reasons in two folds. First, due to the data explosion in many real-world scenarios such as web applications, making in-time predictions requires performing online learning. Second, the theoretical investigation of multi-objective online learning will lay a solid foundation for the design of new optimizers for multi-task deep learning. This is analogous to the single-objective setting, where nearly all the optimizers for training DNNs are initially analyzed in the online setting, such as Ada Grad (Duchi et al., 2011), Adam (Kingma & Ba, 2015), and AMSGrad (Reddi et al., 2018).

In this paper, we give a systematic study of multi-objective online learning. To begin with, we formulate the framework of Multi-Objective Online Convex Optimization (MO-OCO). One major challenge in deriving MO-OCO is the lack of a proper regret definition. In the multi-objective setting, in general, no single decision can optimize all the objectives simultaneously. Thus, to devise the multi-objective regret, we need to first extend the single fixed comparator used in the singleobjective regret, i.e., the fixed optimal decision, to the entire Pareto optimal set. Then we need an appropriate discrepancy metric to evaluate the gap between vector-valued losses. Intuitively, the Pareto suboptimality gap (PSG) metric, which is frequently used in zero-order multi-objective bandits (Turgay et al., 2018; Lu et al., 2019a), is a very promising candidate. PSG can yield scalarized measurements from any vector-valued loss to a given comparator set. However, we find that vanilla PSG is unsuitable for our setting since it always yields non-negative values and may be too loose. In a concrete example, we show that the naive PSG-based regret RI(T) can even be linear w.r.t. T when the decisions are already optimal, which disqualifies it as a regret metric. To overcome the failure of vanilla PSG, we propose its sequence-wise variant termed S-PSG, which measures the suboptimality of the whole decision sequence to the Pareto optimal set of the cumulative loss function. Optimizing the resulting regret RII(T) will drive the cumulative loss to approach the Pareto front. However, as a zero-order metric motivated geometrically, designing appropriate first-order algorithms to directly optimize it is too difficult. To resolve the issue, we derive a more intuitive equivalent form of RII(T) via a highly non-trivial transformation.

Based on the MO-OCO framework, we develop a novel multi-objective online algorithm termed Doubly Regularized Online Mirror Multiple Descent. The key module of the algorithm is the gradient composition scheme, which calculates a composite gradient in the form of a convex combination of the gradients of all objectives. Intuitively, the most direct way to determine the composite weights is to apply the min-norm solver (D esid eri, 2012) commonly used in offline multi-objective optimization. However, directly applying min-norm is not workable in the online setting. Specifically, the composite weights in min-norm are merely determined by the gradients at the current round. In the online setting, since the gradients are adversarial, they may result in undesired composite weights, which further produce a composite gradient that reversely optimizes the loss. To rigorously verify this point, we give an example where equipping OMD with vanilla min-norm incurs a linear regret, showing that only regularizing the iterate, as in OMD, is not enough to guarantee sublinear regrets in our setting. To fix the issue, we devise a novel min-regularized-norm solver with an explicit regularization on composite weights. Equipping it with OMD results in our proposed algorithm.

In theory, we derive a regret bound of O(

T) for DR-OMMD, which matches the optimal bound in the single-objective setting (Hazan et al., 2016) and is tight w.r.t. the number of objectives. Our analysis also shows that DR-OMMD attains a smaller regret bound than that of linearization with fixed composite weights. We show that, in the two-objective setting with linear losses, the margin between the regret bounds depends on the difference between the composite weights yielded by the two algorithms and the difference between the gradients of the two underlying objectives.

To evaluate the effectiveness of DR-OMMD, we conduct extensive experiments on several largescale real-world datasets. We first realize adaptive regularization via multi-objective optimization, and find that adaptive regularization with DR-OMMD significantly outperforms fixed regularization with linearization, which verifies the effectiveness of DR-OMMD over linearization in the convex setting. Then we apply DR-OMMD to deep online multi-task learning. The results show that DROMMD is also effective in the non-convex setting.

Published as a conference paper at ICLR 2023

2 PRELIMINARIES

In this section, we briefly review the necessary background knowledge of two related fields.

2.1 MULTI-OBJECTIVE OPTIMIZATION

Multiple-objective optimization (MOO) is concerned with solving the problems of optimizing multiple objectives simultaneously (Fliege & Svaiter, 2000; Deb, 2014). In general, since different objectives may conflict with each other, there is no single solution that can optimize all the objectives at the same time, hence the conventional concept of optimality used in the single-objective setting is no longer suitable. Instead, MOO seeks to achieve Pareto optimality. In the following, we give the relevant definitions more formally. We use a vector-valued loss F = (f 1, . . . , f m) to denote the objectives, where m 2 and f i : X R, i {1, . . . , m}, X R, is the i-th loss function. Definition 1 (Pareto optimality). (a) For any two solutions x, x X, we say that x dominates x , denoted as x x or x x, if f i(x) f i(x ) for all i, and there exists one i such that f i(x) < f i(x ); otherwise, we say that x does not dominate x , denoted as x x or x x. (b) A solution x X is called Pareto optimal if it is not dominated by any other solution in X.

Note that there may exist multiple Pareto optimal solutions. For example, it is easy to show that the optimizer of any single objective, i.e., x i arg minx X f i(x), i {1, . . . , m}, is Pareto optimal. Different Pareto optimal solutions reflect different trade-offs among the objectives (Lin et al., 2019). Definition 2 (Pareto front). (a) All Pareto optimal solutions form the Pareto set PX (F). (b) The image of PX (F) constitutes the Pareto front, denoted as P(H) = {F(x) | x PX (F)}.

Now that we have established the notion of optimality in MOO, we proceed to introduce the metrics that measure the discrepancy of an arbitrary solution x X from being optimal. Recall that, in the single-objective setting with merely one loss function f : Z R, for any z Z, the loss difference f(z) minz Z f(z ) is directly qualified for the discrepancy measure. However, in MOO with more than one loss, for any x X, the loss difference F(x) F(x ), where x PX (F), is a vector. Intuitionally, the desired discrepancy metric shall scalarize the vector-valued loss difference and yield 0 for any Pareto optimal solution. In general, in MOO, there are two commonly used discrepancy metrics, i.e., Pareto suboptimality gap (PSG) (Turgay et al., 2018) and Hypervolume (HV) (Bradstreet, 2011). As HV is a complex volume-based metric, it is more difficult to optimize via gradient-based algorithms (Zhang & Golovin, 2020). Hence in this paper, we adopt PSG, which has already been extensively used in multi-objective bandits (Turgay et al., 2018; Lu et al., 2019a). Definition 3 (Pareto suboptimality gap1). For any x X, the Pareto suboptimality gap to a given comparator set Z X, denoted as (x; Z, F), is defined as the minimal scalar ϵ 0 that needs to be subtracted from all entries of F(x), such that F(x) ϵ1 is not dominated by any point in Z, where 1 denotes the all-one vector in Rm, i.e., (x; Z, F) = inf ϵ 0 ϵ, s.t. x Z, i {1, . . . , m}, f i(x) ϵ < f i(x ).

Clearly, PSG is a distance-based discrepancy metric motivated from a purely geometric viewpoint. In practice, the comparator set Z is often set to be the Pareto set X = PX (F) (Turgay et al., 2018); therein for any x K, its PSG is always non-negative and equals zero if and only if x PX (F).

Multiple Gradient Descent Algorithm (MGDA) is an offline first-order MOO algorithm (Fliege & Svaiter, 2000; D esid eri, 2012). At each iteration l {1, . . . , L} (L is the number of iterations), it first computes the gradient f i(xl) of each objective, then derives the composite gradient gcomp l = Pm i=1 λi l f i(xl) as a convex combination of these gradients, and finally applies gcomp l to execute a gradient descent step to update the decision, i.e., xl+1 = xl ηgcomp l (η is the step size). The core part of MGDA is the module that determines the composite weights λl = (λ1 l , . . . , λm l ), given by

λl = arg min λl Sm Xm

i=1 λi l f i(xl) 2 2,

where Sm = {λ Rm | Pm i=1 λi = 1, λi 0, i {1, . . . , m}} is the probabilistic simplex in Rm. This is a min-norm solver, which finds the weights in the simplex that yield the minimum L2norm of the composite gradient. Thus MGDA is also called the min-norm method. Previous works

1Our definition looks a bit different from (Turgay et al., 2018). In Appendix B, we show they are equivalent.

Published as a conference paper at ICLR 2023

(D esid eri, 2012; Sener & Koltun, 2018) showed that when all f i are convex functions, MGDA is guaranteed to decrease all the objectives simultaneously until it reaches a Pareto optimal decision.

2.2 ONLINE CONVEX OPTIMIZATION

Online Convex Optimization (OCO) (Zinkevich, 2003; Hazan et al., 2016) is the most commonly adopted framework for designing online learning algorithms. It can be viewed as a structured repeated game between a learner and an adversary. At each round t {1, . . . , T}, the learner is required to generate a decision xt from a convex compact set X Rn. Then the adversary replies the learner with a convex function ft : X R and the learner suffers the loss ft(xt). The goal of the learner is to minimize the regret with respect to the best fixed decision in hindsight, i.e.,

t=1 ft(xt) min x X

t=1 ft(x ).

A meaningful regret is required to be sublinear in T, i.e., lim T R(T)/T = 0, which implies that when T is large enough, the learner can perform as well as the best fixed decision in hindsight.

Online Mirror Descent (OMD) (Hazan et al., 2016) is a classic first-order online learning algorithm. At each round t {1, . . . , T}, OMD yields its decision via xt+1 = arg min x X η ft(xt), x + BR(x, xt),

where η is the step size, R : X R is the regularization function, and BR(x, x ) = R(x) R(x ) R(x ), x x is the Bregman divergence induced by R. As a meta-algorithm, by instantiating different regularization functions, OMD can induce two important algorithms, i.e., Online Gradient Descent (Zinkevich, 2003) and Online Exponentiated Gradient (Hazan et al., 2016).

3 MULTI-OBJECTIVE ONLINE CONVEX OPTIMIZATION

In this section, we formally formulate the MO-OCO framework.

Framework overview. Analogously to single-objective OCO, MO-OCO can be viewed as a repeated game between an online learner and the adversarial environment. The main difference is that in MO-OCO, the feedback is vector-valued. The general framework of MO-OCO is given as follows. At each round t {1, . . . , T}, the learner generates a decision xt from a given convex compact decision set X Rn. Then the adversary replies the decision with a vector-valued loss function Ft : X Rm, whose i-th component f i t : X R is a convex function corresponding to the i-th objective, and the learner suffers the vector-valued loss Ft(xt). The goal of the learner is to generate a sequence of decisions {xt}T t=1 to minimize a certain kind of multi-objective regret.

The remaining work in framework formulation is to give an appropriate regret definition, which is the most challenging part. Recall that the single-objective regret R(T) = PT t=1 ft(xt) PT t=1 ft(x ) is defined as the difference between the cumulative loss of the actual decisions {xt}T t=1 and that of the fixed optimal decision in hindsight x arg minx X PT t=1 ft(x). When defining the multiobjective analogy to R(T), we encounter two issues. First, in the multi-objective setting, no single decision can optimize all the objectives simultaneously in general, hence we cannot compare the cumulative loss with that of any single decision. Instead, we use the the Pareto optimal set X of the cumulative loss function PT t=1 Ft, i.e., X = PX(PT t=1 Ft), which naturally aligns with the optimality concept in MOO. Second, to compare {xt}T t=1 and X in the loss space, we need a discrepancy metric to measure the gap between vector losses. Intuitively, we can adopt the commonly used PSG metric (Turgay et al., 2018). But we find that vanilla PSG is not appropriate for OCO, which is largely different from the bandits setting. We explicate the reason in the following.

3.1 THE NAIVE REGRET BASED ON VANILLA PSG FAILS IN MO-OCO

By definition, at each round t, the difference between the decision xt and the Pareto optimal set can be evaluated by PSG (xt; X , Ft). Naturally, we can formulate the multi-objective regret by accumulating (xt; X , Ft) over all rounds, i.e.,

RI(T) := XT

t=1 (xt; X , Ft).

Published as a conference paper at ICLR 2023

Recall that the single-objective regret can also expressed as R(T) = PT t=1(ft(xt) ft(x )). Hence, RI(T) essentially extends the scalar discrepancy ft(xt) ft(x ) to the PSG metric (xt; X , Ft). However, these two discrepancy metrics have a major difference, i.e., ft(xt) ft(x ) can be negative, whereas (xt; X , Ft) is always non-negative. In previous bandits settings (Turgay et al., 2018), the discrepancy is intrinsically non-negative, since the comparator set is exactly the Pareto optimal set of the evaluated loss function. However, the non-negative property of PSG can be problematic in our setting, where the comparator set X is the Pareto set of the cumulative loss function, rather than the instantaneous loss Ft that is used for evaluation. Specifically, at some round t, the decision xt may Pareto dominate all points in X w.r.t. Ft, which corresponds to the single-objective setting where it is possible that ft(xt) < ft(x ) at some specific round. In this case, we would expect the discrepancy metric at this round to be negative. However, PSG can only yield 0 in this case, making the regret much looser than we expect. In the following, we provide an example in which the naive regret RI(T) is linear w.r.t. T even when the decisions xt are already optimal.

Problem instance. Set X = [ 2, 2]. Let the loss function be identical among all objectives, i.e., f 1 t (x) = ... = f m t (x), and alternate between x and x. Suppose the time horizon T is an even number, then the Pareto optimal set X = X. Now consider the decisions xt = 1, t {1, ..., T}. In this case, it can easily be checked that the single-objective regret of each objective is zero, indicating that these decisions are optimal for each objective. To calculate RI(T), notice that when all the objectives are identical, PSG reduces to (xt; X , f 1 t ) = supx X max{f 1 t (xt) f 1 t (x ), 0} at each round t. Hence, in this case we have RI(T) = P

1 k T/2(supx [ 2,2] max{1 x , 0} + supx [ 2,2] max{x 1, 0}) = 3T, which is linear w.r.t. T. Therefore, RI(T) is too loose to measure the suboptimality of decisions, which is unqualified as a regret metric.

3.2 THE ALTERNATIVE REGRET BASED ON SEQUENCE-WISE PSG

In light of the failure of the naive regret, we need to modify the discrepancy metric in our setting. Recall that the single-objective regret can be interpreted as the gap between the actual cumulative loss PT t=1 ft(xt) and its optimal value minx X PT t=1 ft(x). In analogy, we can measure the gap between PT t=1 Ft(xt) and the Pareto front P = PX (PT t=1 Ft). However, vanilla PSG is a pointwise metric, i.e., it can only measure the suboptimality of a decision point. To evaluate the decision sequence {xt}T t=1, we modify its definition and propose a sequence-wise variant of PSG. Definition 4 (Sequence-wise PSG). For any decision sequence {xt}T t=1, the sequence-wise PSG (S-PSG) to a given comparator set2 X w.r.t. the loss sequence {Ft}T t=1 is defined as

({xt}T t=1; X , {Ft}T t=1) = inf ϵ 0 ϵ, s.t. x X , i {1, . . . , m},

t=1 f i t(xt) ϵ <

t=1 f i t(x ).

Since X is the Pareto set of PT t=1 Ft, S-PSG measures the discrepancy from the cumulative loss of the decision sequence to the Pareto front P . Now the regret can be directly given as

RII(T) := ({xt}T t=1; X , {Ft}T t=1).

RII(T) has a clear physical meaning that optimizing it will impose the cumulative loss to be close to the Pareto front P . However, since PSG (or S-PSG) is a zero-order metric motivated in a purely geometric sense, i.e., its calculation needs to solve a constrained optimization problem with an unknown boundary {Ft(x ) | x X }, it is difficult to design a first-order algorithm to optimize PSG-based regrets, not to mention the analysis. To resolve this issue, we derive an equivalent form via highly non-trivial transformations, which is more intuitive than its original form. Proposition 1. The multi-objective regret RII(T) based on S-PSG has an equivalent form, i.e.,

RII(T) = max n sup x X inf λ Sm

t=1 λ (Ft(xt) Ft(x )), 0 o .

Remark. (i) The above form is closely related to the single-objective regret R(T). Specifically, when m = 1, we can prove that RII(T) = max{PT t=1 Ft(xt) minx X PT t=1 Ft(x ), 0} =

2It is equivalent to use either X or X as the comparator set. See Appendix C for the detailed proof.

Published as a conference paper at ICLR 2023

Algorithm 1 Doubly Regularized Online Mirror Multiple Descent (DR-OMMD)

1: Input: Convex set X, time horizon T, regularization parameter αt, learning rate ηt, regularization function R, user preference λ0. 2: Initialize: x1 X. 3: for t = 1, . . . , T do 4: Predict xt and receive a loss function Ft : X Rm. 5: Compute the multiple gradients Ft(xt) = [ f 1 t (xt), . . . , f m t (xt)] Rn m. 6: Determine the weights for the gradient composition via min-regularized-norm λt = arg min λ Sm Ft(xt)λ 2 2 + αt λ λ0 1.

7: Compute the composite gradient gt = Ft(xt)λt. 8: Perform online mirror descent using gt xt+1 = arg min x X ηt gt, x + BR(x, xt).

max{R(T), 0}. Note that in the regret analysis, we are more interested in the case of R(T) 0 (where RII(T) = R(T)), since when R(T) < 0, it is naturally bounded by any sublinear regret bound. Hence, RII(T) is essentially aligned with R(T) in the single-objective setting. (ii) At its first glance, RII(T) can be optimized via linearization with fixed weights λ0 Sm, or alternatively, optimizing a single objective i {1, ..., m}. We remark that this is not a problem of our regret definition, but an intrinsic requirement of Pareto optimality. Specifically, Pareto optimality characterizes the status where no objective can be improved without hurting others. Hence merely optimizing a single objective naturally achieves Pareto optimality. Please refer to Proposition 8 in (Emmerich & Deutz, 2018) for the rigorous proof. As a general performance metric, our regret should incorporate this special case. Later, we will design a novel algorithm based on the concept of common descent, which outperforms linearization in both theory and experiment.

4 DOUBLY REGULARIZED ONLINE MIRROR MULTIPLE DESCENT

In this section, we present the Doubly Robust Online Mirror Multiple Descent (DR-OMMD) algorithm, the protocol of which is given in Algorithm 1. At each round t, the learner first computes the gradient of the loss regarding each objective, then determines the composite weights of all these gradients, and finally applies the composite gradient to the online mirror descent step.

4.1 VANILLA MIN-NORM MAY INCUR LINEAR REGRETS

The core module of DR-OMMD is the composition of gradients. For simplicity, denote the gradients at round t in a matrix form Ft(xt) = [ f 1 t (xt), . . . , f m t (xt)] Rn m. Then the composite gradient is gt = Ft(xt)λt, where λt is the composite weights. As illustrated in the preliminary, in the offline setting, the min-norm method (D esid eri, 2012; Sener & Koltun, 2018) is a classic method to determine the composite weights, which produces a common descent direction that can descend all the losses simultaneously. Thus, it is tempting to consider applying it to the online setting.

However, directly applying min-norm to the online setting is not workable, which may even incur linear regrets. In vanilla min-norm, the composite weights λt are determined solely by the gradients Ft(xt) at the current round t, which are very sensitive to the instantaneous loss Ft. In the online setting, the losses at each round can be adversarially chosen, and thus the corresponding gradients can be adversarial. These adversarial gradients may result in undesired composite weights, which may further produce a composite gradient that even deteriorates the next prediction. In the following, we provide an example in which min-norm incurs a linear regret. We extend OMD (Hazan et al., 2016) to the multi-objective setting, where the composite weights are directly yielded by min-norm.

Problem instance. We consider a two-objective problem. The decision domain is X = {(u, v) | u + v 1

2, v 0} and the loss function at each round is

( ( x a 2, x b 2), t = 2k 1, k = 1, 2, ...;

( x b 2, x c 2), t = 2k, k = 1, 2, ...,

Published as a conference paper at ICLR 2023

where a = ( 2, 1), b = (0, 1), c = (2, 1). For simplicity, we first analyze the case where the total time horizon T is an even number. Then we can compute the Pareto set of the cumulative loss PT t=1 Ft, i.e., X = {(u, 0) | 1

2 u 1 2}, which locates at the x-axis. For conciseness of analysis, we instantiate OMD with L2-regularization, which results in the simple OGD algorithm (Mc Mahan, 2011). We start at an arbitrary point x1 = (u1, v1) X satisfying v1 > 0. At each round t, suppose the decision xt = (ut, vt), then the gradient of each objective w.r.t. xt takes

g1 t = (2ut + 4, 2vt + 2), t = 2k 1; (2ut, 2vt 2), t = 2k. g2 t = (2ut, 2vt 2), t = 2k 1; (2ut 4, 2vt + 2), t = 2k.

Since 0 vt 1

2, we observe that the second entry of either gradient alternates between positive and negative. By using min-norm, the composite weights λt can be computed as

λt = ((1 ut vt)/4, (3 + ut + vt)/4), t = 2k 1; ((3 ut + vt)/4, (1 + ut vt)/4), t = 2k.

We observe that both entries of composite weights alternative between above 1

2 and below 1

2, and λt+1 λt 1 1. Recall that λt 1 = 1, hence the composite weights at two consecutive rounds change radically. The resulting composite gradient takes

gcomp t = (ut vt + 1, ut + vt 1), t = 2k 1; ( ut vt 1, ut vt 1), t = 2k.

The fluctuating composite weights mix with the positive and negative second entries of gradients, making the second entry of gcomp t always negative, i.e., ut + vt 1 < 0 and ut vt 1 < 0. Hence gcomp t always drives xt away from the Pareto set X that coincides with the x-axis. This essentially reversely optimizes the loss, hence increasing the regret. In fact, we can prove that it even incurs a linear regret. Due to the lack of space, we leave the proof of linear regret when T is an odd number in Appendix H. The above results of the problem instance are summarized as follows.

Proposition 2. For OMD equipped with vanilla min-norm, there exists a multi-objective online convex optimization problem, in which the resulting algorithm incurs a linear regret.

Remark. Stability is a basic requirement to ensure meaningful regrets in online learning (Mc Mahan, 2017). In the single-objective setting, directly regularizing the iterate xt (e.g., OMD) is enough. However, as shown in the above analysis, merely regularizing xt is not enough to attain sublinear regrets in the multi-objective setting, since there is another source of instability, i.e., the composite weights, that affects the direction of composite gradients. Therefore, in multi-objective online learning, besides regularizing the iterates, we also need to explicitly regularize the composite weights.

4.2 THE ALGORITHM

Enlightened by the design of regularization in FTRL (Mc Mahan, 2017), we consider the regularizer r(λ, λ0), where λ0 is the pre-defined composite weights that may reflect the user preference. This results in a new solver called min-regularized-norm, i.e.,

λt = arg min λ Sm Ft(xt)λ 2 2 + αt r(λ, λ0),

where αt is the regularization strength. Equipping OMD with the new solver, we derive the proposed algorithm. Note that beyond the regularization on the iterate xt that is intrinsic in online learning, there is another regularization on the composite weights λt in min-regularized-norm. Both regularizations are fundamental, and they together ensure stability in the multi-objective online setting. Hence we call the algorithm Doubly Regularized Online Mirror Multiple Descent (DR-OMMD).

In principle, r can take various forms such as L1-norm, L2-norm, etc. Here we adopt L1-norm since it aligns well with the simplex constraint of λ. Min-regularized-norm can be computed very efficiently. When m = 2, it has a closed-form solution. Specifically, suppose the gradients at round t are g1 t and g2 t . Set γL = (g 2 (g2 g1) αt)/ g2 g1 2 and γR = (g 2 (g2 g1)+αt)/ g2 g1 2. Given any λ0 = (γ0, 1 γ0) S2, we can compute the composite weights λt as (γt, 1 γt) where

γt = max{min{γ t , 1}, 0}, where γ t = max{min{γ0, γR}, γL}.

Published as a conference paper at ICLR 2023

When m > 2, since the constraint Sm is a simplex, we can introduce a Frank-Wolfe solver (Jaggi, 2013) (see detailed protocol in Appendix E.1). We also discuss the L2-norm case in Appendix E.2.

Compared to vanilla min-norm, the composite weights in min-regularized-norm are not fully determined by the adversarial gradients. The resulting relative stability of composite weights makes the composite gradients more robust to the adversarial environment. In the following, we give a general analysis and prove that DR-OMMD indeed guarantees sublinear regrets.

4.3 THEORETICAL ANALYSIS

Our analysis is based on two conventional assumptions (Jadbabaie et al., 2015; Hazan et al., 2016). Assumption 1. The regularization function R is 1-strongly convex. In addition, the Bregman divergence is γ-Lipschitz continuous, i.e., BR(x, z) BR(y, z) γ x y , x, y, z dom R, where dom R is the domain of R and satisfies X dom R Rn. Assumption 2. There exists some finite G > 0 such that for each i {1, . . . , m}, the i-th loss f i t at each round t {1, . . . , T} is differentiable and G-Lipschitz continuous w.r.t. 2, i.e., |f i t(x) f i t(x )| G x x 2. Note that in the convex setting, this assumption leads to bounded gradients, i.e., f i t(x) 2 G for any t {1, . . . , T}, i {1, . . . , m}, x X. Theorem 1. Suppose the diameter of X is D. Assume Ft is bounded, i.e., |f i t(x)| F, x X, t {1, . . . , T}, i {1, . . . , m}. For any λ0 Sm, DR-OMMD attains

2 ( Ft(xt)λt 2 2 + 4F

ηt λt λ0 1).

Remark. When ηt = 2γD

t , αt = 4F

ηt , the bound attains O(

T). It matches the optimal single-objective bound w.r.t. T (Hazan et al., 2016) and is tight w.r.t. m (justified in Appendix F.2).

Comparison with linearization. Linearization with fixed weights λ0 Sm essentially optimizes the scalar loss λ 0 Ft with gradient gt = Ft(xt)λ0. From OMD s tight bound (Theorem 6.8 in (Orabona, 2019)), we can derive a bound γD

ηT +PT t=1 ηt

2 Ft(xt)λ0 2 2 for linearization. In compar-

ison, when αt = 4F

ηt , DR-OMMD attains a regret bound γD

ηT +PT t=1 ηt

2 minλ Sm{ Ft(xt)λ 2 2 + αt λ λ0 1}, which is smaller than that of linearization. Note that although the bound of linearization refers to single-objective regret R(T), the comparison is reasonable due to the consistency of the two regret metrics, i.e., RII(T) = max{R(T), 0} when m = 1, as proved in Proposition 1. In the following, we further investigate the margin in the two-objective setting with linear losses. Suppose the loss functions are f 1 t (x) = x g1 t and f 2 t (x) = x g2 t for some vectors g1 t , g2 t Rn at each round. Then we can show that the margin is at least (see Appendix F.3 for the detailed proof)

4 λt λ0 2 2 g1 t g2 t 2 2,

which indicates the benefit of DR-OMMD. Specifically, while linearization requires adequate λ0, DR-OMMD selects more proper λt adaptively; the advantange is more obvious as the gradients of different objectives vary wildly. This matches our intuition that linearization suffers from conflict gradients (Yu et al., 2020), while DR-OMMD can alleviate the conflict by pursuing common descent.

5 EXPERIMENTS

In this section, we conduct experiments to compare DR-OMMD with two baselines: (i) linearization performs single-objective online learning on scalar losses λ 0 Ft with pre-defined fixed λ0 Sm; (ii) min-norm equips OMD with vanilla min-norm (D esid eri, 2012) for gradient composition.

5.1 CONVEX EXPERIMENTS: ADAPTIVE REGULARIZATION

Many real-world online scenarios adopt regularization to avoid overfitting. A standard scheme is to add a term r(x) to the loss ft(x) at each round and optimize the regularized loss ft(x) + σr(x) (Mc Mahan, 2011), where σ is a pre-defined fixed hyperparameter. The formalism of multi-objective online learning provides a novel way of regularization. As r(x) measures model complexity, it can

Published as a conference paper at ICLR 2023

(a) Effect of Preference (b) Learning Curve

0 2500 5000 7500 10000 12500 # Rounds

Average Loss

lin-opt DR-OMMD

0.0 0.2 0.4 0.6 0.8 1.0 Value of 1 0

Average Loss

linearization DR-OMMD

Figure 1: Results to verify the effectiveness of adaptive regularization on protein. (a) Performance of DR-OMMD and linearization under varying λ0 = (λ1 0, 1 λ1 0). (b) Performance using the optimal weights λ0 = (0.1, 0.9).

(a) Task L (b) Task R

0 20000 40000 60000 # Rounds

Average Loss

DR-OMMD min-norm lin (.25,.75) lin (0.5,0.5) lin (.75,.25)

0 20000 40000 60000 # Rounds

Average Loss

DR-OMMD min-norm lin (.25,.75) lin (0.5,0.5) lin (.75,.25)

Figure 2: Results to verify the effectiveness of DR-OMMD in the non-convex setting. The two plots show the performance of DR-OMMD and various baselines on both tasks (Task L and Task R) of Multi MNIST.

be regarded as the second objective alongside the primary goal ft(x). We can augment the loss to Ft(x) = (ft(x), r(x)) and thereby cast regularized online learning into a two-objective problem. Compared to the standard scheme, our approach chooses σt = λ2 t/λ1 t in an adaptive way.

We use two large-scale online benchmark datasets. (i) protein is a bioinformatics dataset for protein type classification (Wang, 2002), which has 17 thousand instances with 357 features. (ii) covtype is a biological dataset collected from a non-stationary environment for forest cover type prediction (Blackard & Dean, 1999), which has 50 thousand instances with 54 features. We set the logistic classification loss as the first objective, and the squared L2-norm of model parameters as the second objective. Since the ultimate goal of regularization is to lift predictive performance, we measure the average loss, i.e., P t T lt(xt)/T, where lt(xt) is the classification loss at round t.

We adopt a L2-norm ball centered at the origin with diameter K = 100 as the decision set. The learning rates are decided by a grid search over {0.1, 0.2, . . . , 3.0}. For DR-OMMD, the parameter αt is simply set as 0.1. For fixed regularization, the strength σ = (1 λ1 0)/λ1 0 is determined by some λ1 0 [0, 1], which is exactly linearization with weights λ0 = (λ1 0, 1 λ1 0). We run both algorithms with varying λ1 0 {0, 0.1, ..., 1}. In Figure 1, we plot (a) their final performance w.r.t. the choice of λ0 and (b) their learning curves with desirable λ0 (e.g., (0.1, 0.9) on protein). Other results are deferred to the appendix due to the lack of space. The results show that DR-OMMD consistently outperforms fixed regularization; the gap becomes more significant when λ0 is not properly set.

5.2 NON-CONVEX EXPERIMENTS: DEEP MULTI-TASK LEARNING

We use Multi MNIST (Sabour et al., 2017), which is a multi-task version of the MNIST dataset for image classification and commonly used in deep multi-task learning (Sener & Koltun, 2018; Lin et al., 2019). In Multi MNIST, each sample is composed of a random digit image from MNIST at the top-left and another image at the bottom-right. The goal is to classify the digit at the top-left (task L) and that at the bottom-right (task R) at the same time.

We follow (Sener & Koltun, 2018) s setup with Le Net. Learning rates in all methods are selected via grid search over {0.0001, 0.001, 0.01, 0.1}. For linearization, we examine different weights (0.25, 0.75), (0.5, 0.5), and (0.75, 0.25). For DR-OMMD, αt is set according to Theorem 1, and the initial weights are simply set as λ0 = (0.5, 0.5). Note that in the online setting, samples arrive in a sequential manner, which is different from offline experiments where sample batches are randomly sampled from the training set. Figure 2 compares the average cumulative loss of all the examined methods. We also measure two conventional metrics in offline experiments, i.e., the training loss and test loss (Reddi et al., 2018); the results are similar and deferred to the appendix due to the lack of space. The results show that DR-OMMD outperforms counterpart algorithms using min-norm or linearization in all metrics on both tasks, validating its effectiveness in the non-convex setting.

6 CONCLUSIONS

In this paper, we give a systematic study of multi-objective online learning, encompassing a novel framework, a new algorithm, and corresponding non-trivial theoretical analysis. We believe that this work paves the way for future research on more advanced multi-objective optimization algorithms, which may inspire the design of new optimizers for multi-task deep learning.

Published as a conference paper at ICLR 2023

ACKNOWLEDGMENTS

This work was supported in part by the National Key Research and Development Program of China No. 2020AAA0106300 and National Natural Science Foundation of China No. 62250008. This work was also supported by Ant Group through Ant Research Intern Program. We would like to thank Wenliang Zhong, Jinjie Gu, Guannan Zhang and Jiaxin Liu for generous support on this project.

Jacob Abernethy, Peter L Bartlett, and Elad Hazan. Blackwell approachability and no-regret learning are equivalent. In Conference on Learning Theory, pp. 27 46, 2011.

Amir Beck and Marc Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31(3):167 175, 2003.

Jock A Blackard and Denis J Dean. Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables. Computers and Electronics in Agriculture, 24(3):131 151, 1999.

David Blackwell. An analog of the minimax theorem for vector payoffs. Pacific Journal of Mathematics, 6(1):1 8, 1956.

Lucas Bradstreet. The hypervolume indicator for multi-objective optimisation: calculation and use. University of Western Australia Perth, 2011.

R obert Busa-Fekete, Bal azs Sz or enyi, Paul Weng, and Shie Mannor. Multi-objective bandits: Optimizing the generalized gini index. In International Conference on Machine Learning, pp. 625 634, 2017.

Nicol o Cesa-bianchi, Pierre Gaillard, Gabor Lugosi, and Gilles Stoltz. Mirror descent meets fixed share (and feels no regret). In Advances in Neural Information Processing Systems, pp. 980 988, 2012.

Zhao Chen, Jiquan Ngiam, Yanping Huang, Thang Luong, Henrik Kretzschmar, Yuning Chai, and Dragomir Anguelov. Just pick a sign: Optimizing deep multitask models with gradient sign dropout. In Advances in Neural Information Processing Systems, pp. 2039 2050, 2020.

Sayak Ray Chowdhury and Aditya Gopalan. No-regret algorithms for multi-task bayesian optimization. In International Conference on Artificial Intelligence and Statistics, pp. 1873 1881, 2021.

Samuel Daulton, David Eriksson, Maximilian Balandat, and Eytan Bakshy. Multi-objective bayesian optimization over high-dimensional search spaces. In Uncertainty in Artificial Intelligence, pp. 507 517, 2022.

Kalyanmoy Deb. Multi-objective optimization. In Search methodologies, pp. 403 449, 2014.

R emy Degenne, Thomas Nedelec, Cl ement Calauz enes, and Vianney Perchet. Bridging the gap between regret minimization and best arm identification, with application to a/b tests. In International Conference on Artificial Intelligence and Statistics, pp. 1988 1996, 2019.

Jean-Antoine D esid eri. Multiple-gradient descent algorithm for multiobjective optimization. Comptes Rendus Mathematique, 350(5-6):313 318, 2012.

Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. URL http://archive. ics.uci.edu/ml.

John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7):2121 2159, 2011.

Michael TM Emmerich and Andr e H Deutz. A tutorial on multiobjective optimization: Fundamentals and evolutionary methods. Natural Computing, 17(3):585 609, 2018.

Published as a conference paper at ICLR 2023

J org Fliege and Benar Fux Svaiter. Steepest descent methods for multicriteria optimization. Mathematical Methods of Operations Research, 51(3):479 494, 2000.

Elad Hazan et al. Introduction to online convex optimization. Foundations and Trends in Optimization, 2(3-4):157 325, 2016.

Xinyu Huang, Peng Wang, Xinjing Cheng, Dingfu Zhou, Qichuan Geng, and Ruigang Yang. The apolloscape open dataset for autonomous driving and its application. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(10):2702 2719, 2019.

Ali Jadbabaie, Alexander Rakhlin, Shahin Shahrampour, and Karthik Sridharan. Online optimization: Competing with dynamic comparators. In Artificial Intelligence and Statistics, pp. 398 406, 2015.

Martin Jaggi. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In International Conference on Machine Learning, pp. 427 435, 2013.

Adri an Javaloy and Isabel Valera. Rotograd: Gradient homogenization in multitask learning. In International Conference on Learning Representations, 2021.

Alex Kendall, Yarin Gal, and Roberto Cipolla. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 7482 7491, 2018.

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.

Mina Konakovic Lukovic, Yunsheng Tian, and Wojciech Matusik. Diversity-guided multi-objective bayesian optimization with batch evaluations. In Advances in Neural Information Processing Systems, pp. 17708 17720, 2020.

Wouter Koolen. The Pareto regret frontier. In Advances in Neural Information Processing Systems, pp. 1 9, 2013.

Wouter M Koolen and Tim Van Erven. Second-order quantile methods for experts and combinatorial games. In Conference on Learning Theory, pp. 1155 1175, 2015.

Xi Lin, Hui-Ling Zhen, Zhenhua Li, Qing-Fu Zhang, and Sam Kwong. Pareto multi-task learning. In Advances in Neural Information Processing Systems, pp. 12060 12070, 2019.

Bo Liu, Xingchao Liu, Xiaojie Jin, Peter Stone, and Qiang Liu. Conflict-averse gradient descent for multi-task learning. In Advances in Neural Information Processing Systems, pp. 18878 18890, 2021.

Shiyin Lu, Guanghui Wang, Yao Hu, and Lijun Zhang. Multi-objective generalized linear bandits. In International Joint Conference on Artificial Intelligence, pp. 3080 3086, 2019a.

Weixin Lu, Yao Zhou, Guowei Wan, Shenhua Hou, and Shiyu Song. L3-net: Towards learning based lidar localization for autonomous driving. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 6389 6398, 2019b.

Jiaqi Ma, Zhe Zhao, Xinyang Yi, Jilin Chen, Lichan Hong, and Ed H Chi. Modeling task relationships in multi-task learning with multi-gate mixture-of-experts. In International Conference on Knowledge Discovery & Data Mining, pp. 1930 1939, 2018a.

Xiao Ma, Liqin Zhao, Guan Huang, Zhi Wang, Zelin Hu, Xiaoqiang Zhu, and Kun Gai. Entire space multi-task model: An effective approach for estimating post-click conversion rate. In International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 1137 1140, 2018b.

Wesley J Maddox, Maximilian Balandat, Andrew G Wilson, and Eytan Bakshy. Bayesian optimization with high-dimensional outputs. In Advances in Neural Information Processing Systems, pp. 19274 19287, 2021.

Published as a conference paper at ICLR 2023

Shie Mannor, Vianney Perchet, and Gilles Stoltz. Approachability in unknown games: Online learning meets multi-objective optimization. In Conference on Learning Theory, pp. 339 355, 2014.

R Timothy Marler and Jasbir S Arora. Survey of multi-objective optimization methods for engineering. Structural and Multidisciplinary Optimization, 26(6):369 395, 2004.

Brendan Mc Mahan. Follow-the-regularized-leader and mirror descent: Equivalence theorems and L1 regularization. In International Conference on Artificial Intelligence and Statistics, pp. 525 533, 2011.

H Brendan Mc Mahan. A survey of algorithms and analysis for adaptive online learning. Journal of Machine Learning Research, 18(1):3117 3166, 2017.

Tadahiko Murata, Hisao Ishibuchi, et al. MOGA: multi-objective genetic algorithms. In IEEE International Conference on Evolutionary Computation, pp. 289 294, 1995.

Francesco Orabona. A modern introduction to online learning. ar Xiv preprint ar Xiv:1912.13213, 2019.

Sashank J Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of Adam and beyond. In International Conference on Learning Representations, 2018.

Sara Sabour, Nicholas Frosst, and Geoffrey E Hinton. Dynamic routing between capsules. In Advances in Neural Information Processing Systems, pp. 3859 3869, 2017.

Ozan Sener and Vladlen Koltun. Multi-task learning as multi-objective optimization. In Advances in Neural Information Processing Systems, pp. 525 536, 2018.

Nati Srebro, Karthik Sridharan, and Ambuj Tewari. On the universality of online mirror descent. In Advances in Neural Information Processing Systems, pp. 2645 2653, 2011.

Cem Tekin and Eralp Tur gay. Multi-objective contextual multi-armed bandit with a dominant objective. IEEE Transactions on Signal Processing, 66(14):3799 3813, 2018.

Eralp Turgay, Doruk Oner, and Cem Tekin. Multi-objective contextual bandit problem with similarity information. In International Conference on Artificial Intelligence and Statistics, pp. 1673 1681, 2018.

Yuanyu Wan, Wei-Wei Tu, and Lijun Zhang. Online strongly convex optimization with unknown delays. Machine Learning, 111(3):871 893, 2022.

Jung-Ying Wang. Application of support vector machines in bioinformatics. Ph D thesis, National Taiwan University, 2002.

Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, and Chelsea Finn. Gradient surgery for multi-task learning. In Advances in Neural Information Processing Systems, pp. 5824 5836, 2020.

Richard Zhang and Daniel Golovin. Random hypervolume scalarizations for provable multiobjective black box optimization. In International Conference on Machine Learning, pp. 11096 11105, 2020.

Peng Zhao and Lijun Zhang. Improved analysis for dynamic regret of strongly convex and smooth functions. In Learning for Dynamics and Control, pp. 48 59, 2021.

Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In International Conference on Machine Learning, pp. 928 936, 2003.

Eckart Zitzler and Lothar Thiele. Multiobjective evolutionary algorithms: A comparative case study and the strength Pareto approach. IEEE transactions on Evolutionary Computation, 3(4):257 271, 1999.

Published as a conference paper at ICLR 2023

The appendix is organized as follows. Appendix A reviews related work. Appendix B validates the correctness of our definition of PSG. Appendix C discusses the domain of the comparator in S-PSG, indicating that it makes no difference whether the comparator is selected from the Pareto optimal set or from the whole domain. Appendix D provides the detailed derivation of the equivalent form of RII(T). Appendix E discusses how to efficiently compute the composition weights for the minregularized-norm solver. Appendix F discusses the order of DR-OMMD s regret bound with fixed or adaptive learning rate, shows the tightness of the derived bound, and provides more details on the regret comparison between DR-OMMD and linearization. Appendix G supplements more details in the experimental setup and empirical results. Appendix H and I provide detailed proofs of the remaining theoretical claims in the main paper. Finally, Appendix J supplements regret analysis of DR-OMMD in the strongly convex setting.

A RELATED WORK

In this section, we review previous work in some related fields, i.e., online learning, multi-objective optimization, multi-objective multi-armed bandits, and multi-objective Bayesian optimization.

A.1 ONLINE LEARNING

Online learning arms to make sequential predictions for streaming data. Please refer to the introduction books (Hazan et al., 2016; Orabona, 2019) for more background knowledges.

Most of the previous works on online learning are conducted in the single-objective setting. As far as we are concerned, there are only two lines of work concerning multi-objective learning. The first line of works provides a multi-objective perspective of the prediction-with-expert-advice (PEA) problem (Koolen, 2013; Koolen & Van Erven, 2015). Specifically, they view each individual expert as a multi-objective criterion, and characterize the Pareto optimal trade-offs among different experts. These works have two main distinctions from our proposed MO-OCO. First, they are still built upon the original PEA problem where the payoff of each expert (or decision) is a scalar, while we focus on vectoral payoffs. Second, their framework is restricted to an absolute loss game, whereas our framework is general and can be applied to any coordinate-wise convex loss functions.

The second line of work studies online learning with vectoral payoffs via Blackwell approachability (Blackwell, 1956; Mannor et al., 2014; Abernethy et al., 2011). In their framework, the learner is given a target set T Rm and its goal is to generate decisions {xt}T t=1 to minimize the distance between the average loss PT t=1 lt(xt)/T and the target set T . There are two major differences between Blackwell approachability and our proposed MO-OCO: previous works on Blackwell approachability are zero-order methods and the target set T is often known beforehand (also see the discussion in (Busa-Fekete et al., 2017)), while in MO-OCO we intend to develop a first-order method to reach the unknown Pareto front.

A.2 MULTI-OBJECTIVE OPTIMIZATION

Multi-objective optimization aims to optimize multiple objectives concurrently. Most of the previous works on multi-objective optimization are conducted in the offline setting, including the batch optimization setting (D esid eri, 2012; Liu et al., 2021) and the stochastic optimization setting (Sener & Koltun, 2018; Lin et al., 2019; Yu et al., 2020; Chen et al., 2020; Javaloy & Valera, 2021). These methods are based on gradient composition, and have shown very promising results in multi-task learning applications.

Despite the existence of previous works on multi-objective optimization, as the first work of multiobjective optimization in the OCO setting, our work is largely different from them in three aspects. First, we contribute the first formal framework of multi-objective online convex optimization. In particular, our framework is based on a novel equivalent transformation of the PSG metric, which is intrinsically different from previous offline optimization frameworks. Second, we provide a showcase in which a commonly used method in the offline setting, namely min-norm (D esid eri, 2012; Sener & Koltun, 2018), fail to attain sublinear regret in online setting. Our proposed min-regularized-norm

Published as a conference paper at ICLR 2023

is a novel design when tailoring offline methods to the online setting. Third, the regret analysis of multi-objective online learning is intrinsically different from the convergence analysis in the offline setting (Yu et al., 2020).

A.3 MULTI-OBJECTIVE MULTI-ARMED BANDITS

Another branch of related works study multi-objective optimization in the multi-armed bandits setting (Busa-Fekete et al., 2017; Tekin & Tur gay, 2018; Turgay et al., 2018; Lu et al., 2019a; Degenne et al., 2019). Among these works, the most relevant one to ours is (Turgay et al., 2018), which introduces the Pareto suboptimality gap (PSG) metric to characterize the multi-objective regret in the bandits setting, and proposes a zero-order zooming algorithm to minimize the regret.

In this work, our regret definition also utilizes the PSG metric (Turgay et al., 2018). However, as the first study of multi-objective optimization in the OCO setting, our work is intrinsically different from these previous works in the following aspects. First, as PSG is a zero-order metric, we perform a novel equivalent transformation, making it amenable to the OCO setting. Second, our proposed algorithm is a first-order multiple gradient algorithm, whose design principles are completely distinct from zero-order algorithms. For example, the concept of the stability of composite weights does not even exist in the design of previous zero-order methods for multi-objective bandits (Turgay et al., 2018; Lu et al., 2019a). Third, the regret analysis of MO-OCO is intrinsically different from that in the bandits setting.

A.4 MULTI-OBJECTIVE BAYESIAN OPTIMIZATION

The final area related to our work is multi-objective Bayesian optimization (Zhang & Golovin, 2020; Konakovic Lukovic et al., 2020; Chowdhury & Gopalan, 2021; Maddox et al., 2021; Daulton et al., 2022), which studies Bayesian optimization with vector-valued feedback. There are two branches of works in this area, using different notions of regret. The first branch is based on scalarization, which adopts the expectation of the gap between scalarized losses over some given distribution (Chowdhury & Gopalan, 2021) as the regret. In this approach, the distribution of scalarization can be understood as a set of preference, which needs to be known beforehand. The second branch is based on Pareto optimality (Zhang & Golovin, 2020), which uses hypervolume as the discrepancy metric and adopt the gap between the true Pareto front and the estimated Pareto front as the regret.

As the first work on multi-objective optimization in the OCO setting, our work is largely different from these works in the following aspects. First, the regret definitions are different. Specifically, compared to the first branch based on scalarization, our regret definition is purely motivated by Pareto optimality, which does not need any preference in advance; compared to the second branch using hypervolume, we note that hypervolume is mainly used for Pareto front approximation, which is unsuitable to our adversarial setting where the goal is to impose the cumulative loss to reach the Pareto front. Second, multi-objective Bayesian optimization is conducted in a stochastic setting, which typically assumes that the losses follow some Gaussian distribution, whereas our work is conducted in the adversarial setting where the losses can be generated arbitrarily.

B AN EQUIVALENT DEFINITION OF PSG

Recall that in Definition 3, we formulate the PSG metric as a constrained optimization problem. We note that, since the PSG metric is based on the notion of non-dominance (Turgay et al., 2018), its most direct form is actually (x; K , F) = inf ϵ 0 ϵ,

s.t. x K , i {1, . . . , m}, f i(x) ϵ < f i(x )

or i {1, . . . , m}, f i(x) ϵ = f i(x ). At the first glance, the above definition seems to be quite different from Definition 3, since it has an extra condition i {1, . . . , m}, f i(x) ϵ = f i(x ) . In the following, we prove that both definitions actually yield the same value due to the infimum operation on ϵ.

Specifically, for any possible pair (x, K , F), we denote (x; K , F) = ϵ 0 and (x; K , F) = ϵ0. By comparing the constraints of both definitions, it is obvious that ϵ0 must satisfy the constraint

Published as a conference paper at ICLR 2023

of (x; K , F), hence the infimum operation guarantees that ϵ 0 ϵ0. It remains to prove that ϵ 0 ϵ0. To this end, we only need to show that ϵ 0 + ξ satisfies the constraint of (x; K , F) for any ξ > 0. Consider an arbitrary x K . From the definition of (x; K , F), we know that either i {1, . . . , m}, f i(x) ϵ 0 < f i(x ) or i {1, . . . , m}, f i(x) ϵ 0 = f i(x ). Whichever condition holds, we must have i {1, . . . , m}, f i(x) ϵ 0 ξ < f i(x ) for any ξ > 0. Since it holds for any x K , ϵ 0 + ξ lies in the feasible region of (x; K , F), hence we have ϵ0 ϵ 0 + ξ, ξ > 0 and thus ϵ0 ϵ 0. In summary, we have (x; K , F) = (x; K , F) for any pair (x, K , F).

C DISCUSSION ON THE DOMAIN OF THE COMPARATOR IN S-PSG

Recall that in Definition 4, the comparator x in S-PSG is selected from the Pareto optimal set X

of the cumulative loss PT t=1 Ft. This actually stems from the original definition of PSG (Turgay et al., 2018), which uses the Pareto optimal set as the comparator set. In fact, comparing with Pareto optimal decisions in X is already enough to measure the suboptimality of any decision sequence {xt}T t=1. The reason is that, for any non-optimal decision x X X , there must exist some Pareto optimal decision x X that dominates x , hence the suboptimality metric does not need to compare with this non-optimal decision x . In other words, even if we extend the comparator set in S-PSG to the whole domain X, the modified form will be equivalent to the original form based on the Pareto optimal set X . In the following, we strictly prove this equivalence ({xt}T t=1; X, {Ft}T t=1) = ({xt}T t=1; X , {Ft}T t=1).

Specifically, we modify the definition of S-PSG and let the comparator domain X be any subset of the decision domain X, i.e.,

({xt}T t=1; X , {Ft}T t=1) = inf ϵ 0 ϵ, s.t. x X , i {1, . . . , m},

t=1 f i t(xt) ϵ <

t=1 f i t(x ).

Then the modified regret based on the whole domain X takes R II(T) = ({xt}T t=1; X, {Ft}T t=1). Now we begin to prove the equivalence ({xt}T t=1; X, {Ft}T t=1) = ({xt}T t=1; X , {Ft}T t=1). For any X X, let E(X ) denote the constraint of ({xt}T t=1; X , {Ft}T t=1), i.e.,

E(X ) = {ϵ 0 | x X , i {1, . . . , m},

t=1 f i t(xt) ϵ <

t=1 f i t(x )},

then ({xt}T t=1; X , {Ft}T t=1) = inf E(X ). Hence, we just need to prove inf E(X) = inf E(X ).

On the one hand, since X X, from the above definition of S-PSG, it is easy to check that for any ϵ E(X), it must satisfy ϵ E(X ). Hence, we have E(X) E(X ).

On the other hand, given any ϵ E(X ), we now check that ϵ E(X). To this end, we consider an arbitrary point x X in two cases. (i) If x X , since ϵ E(X ), we naturally have PT t=1 f i0 t (xt) ϵ < PT t=1 f i0 t (x ) for some i0. (ii) If x / X , since X is the Pareto optimal set of PT t=1 Ft, there must exist some Pareto optimal decision ˆx X that dominates x

w.r.t. PT t=1 Ft, which means that PT t=1 f i t(ˆx) PT t=1 f i t(x ) for all i {1, ..., m}. Notice that ϵ E(X ) gives PT t=1 f i0 t (xt) ϵ < PT t=1 f i0 t (ˆx) for some i0, hence in this case we also have PT t=1 f i0 t (xt) ϵ < PT t=1 f i0 t (x ). Combining the above two cases, we prove that ϵ E(X), and consequently E(X ) E(X).

In summary, we have E(X) = E(X ), hence ({xt}T t=1; X, {Ft}T t=1) = inf E(X) = inf E(X ) = ({xt}T t=1; X , {Ft}T t=1). Therefore, it makes no difference whether the comparator in RII(T) is generated from the Pareto optimal set X or from the whole domain X.

D DERIVATION OF THE EQUIVALENT MULTI-OBJECTIVE REGRET FORM

In this section, We strictly derive the equivalent form of RII(T) in Proposition 1, which is highly non-trivial and forms the basis of the subsequent algorithm design and theoretical analysis.

Published as a conference paper at ICLR 2023

Proof of Proposition 1. Recall that the PSG metric used in RII(T) is an extension of vanilla PSG to leverage any decision sequence. To motivate the analysis, we first investigate vanilla PSG (x; X , F) that deals with a single decision x, and derive a useful lemma as follows.

Lemma 1. Vanilla PSG has an equivalent form, i.e.,

(x; X , F) = sup x X inf λ Sm λ (F(x) F(x))+,

where for any vector l = (l1, ..., lm) Rm, the truncation (l)+ produces a vector whose i-th entry equals to max{li, 0} for all i {1, ..., m}.

Proof. In the definition of PSG, the evaluated decision x is compared to all Pareto optimal points x X . For any fixed comparator x X , we define the pair-wise suboptimality gap w.r.t. F between decisions x and x as follows

δ(x; x , F) = inf ϵ 0{ϵ | F(x) ϵ1 F(x )}.

Hence, PSG can be expressed as

(x; X , F) = sup x X δ(x; x , F).

To proceed, we analyze the pair-wise gap δ(x; x , F). From its definition, we know that δ(x; x , F) measures the minimal non-negative value that needs to be subtracted from each entry of F(x) until it is not dominated by x . Now we consider two cases.

(i) If F(x) F(x ), i.e., f k0(x) f k0(x ) for some k0 {1, ..., m}, nothing needs to be subtracted from F(x) and we directly have δ(x; x , F) = 0.

(ii) If F(x) F(x ), we have f k(x) f k(x ) for all k {1, ..., m}, which obviously violates the condition F(x) ϵ1 F(x ) when ϵ = 0. Now let us gradually increase ϵ from zero. Notice that such a condition holds only when there there exists some k0 satisfying f k0(x) ϵ f k0(x ), or equivalently ϵ f k0(x) f k0(x ). Hence, in this case, we have δ(x; x , F) = mink {1,...,m}{f k(x) f k(x )}.

Combining the above two cases, we derive an equivalent form of the pair-wise suboptimality gap. Specifically, we can easily check that the following form holds for both cases, i.e.,

δ(x; x , F) = min k {1,...,m} max{f k(x) f k(x ), 0}.

To relate the above form with F, denote Um = {ek | 1 k m} as the set of all unit vector in Rm, then we equivalently have

δ(x; x , F) = min λ Um λ (F(x) F(x ))+.

Now the calculation of δ(x; x , F) is transformed into a minimization problem over λ Um. Since Um is a discrete set, we can apply a linear relaxation trick. Specifically, we now turn to minimize the scalar p(λ) = λ max{F(x) F(x ), 0} over the convex curvature of Um, which is exactly the probability simplex Sm = {λ Rm | λ 0, λ 1 = 1}. Note that Um contains all the vertexes of Sm. Since infλ Sm p(λ) is a linear optimization problem, the minimal point λ must be a vertex of the simplex, i.e., λ Um. Hence, the relaxed problem is equivalent to the original problem, namely,

δ(x; x , F) = min λ Um λ (F(x) F(x ))+ = inf λ Sm λ (F(x) F(x ))+.

Taking the supremum of both sides over x X , we prove the lemma.

The above lemma can be naturally extended to the sequence-wise variant S-PSG. Specifically, we can extend the pair-wise suboptimality gap δ(x; x , F) to measure any decision sequence, which now becomes

δ({xt}T t=1; x , {Ft}T t=1) = inf ϵ 0{ϵ |

t=1 Ft(xt) ϵ1

t=1 Ft(x )}.

Published as a conference paper at ICLR 2023

Then S-PSG can be expressed as

({xt}T t=1; X , {Ft}T t=1) = sup x X δ({xt}T t=1; x , {Ft}T t=1).

Similar to the derivation of the above lemma, by investigating the relation between PT t=1 Ft(xt) and PT t=1 Ft(x ), we can derive an equivalent form of δ({xt}T t=1; x , {Ft}T t=1) as

δ({xt}T t=1; x , {Ft}T t=1) = min k {1,...,m} max{

t=1 f k t (x)

t=1 f k t (x ), 0},

and further

δ({xt}T t=1; x , {Ft}T t=1) = inf λ Sm λ (

t=1 Ft(x ))+.

Hence, the S-PSG-based regret form can be expressed as

RII(T) = sup x X inf λ Sm λ (

t=1 Ft(x ))+.

The max-min form of RII(T) has a truncation operation ( )+, which brings irregularity to the regret form. To handle the truncation operation, we utilize the following lemma:

Lemma 2. (a) For any l Rm, we have infλ Sm λ (l)+ = max{infλ Sm λ l, 0}. (b) For any h : X R, we have supx X max{h(x), 0} = max{supx X h(x), 0}.

Proof. To prove the first statement, we consider the following two cases. (i) If l 0, then (l)+ = l. For any λ Sm, we have λ (l)+ = λ l > 0. Taking the infimum over λ Sm on both sides, we have infλ Sm λ (l)+ = infλ Sm λ l 0. Moreover, from the last equation we have max{infλ Sm λ l, 0} = infλ Sm λ l, which proves the statement in this case. (ii) If l 0, then li 0 for some i {1, ..., m}. Set ei as the i-th unit vector in Rm, then we have e i l 0. One the one hand, since ei Sm, we have infλ Sm λ l e i l 0, and further max{infλ Sm λ l, 0} = 0. On the other hand, notice that e i (l)+ = 0 and λ (l)+ 0 for any λ Sm, then infλ Sm λ (l)+ = e i (l)+ = 0. Hence, the statement also holds in this case.

To prove the second statement, we also consider two cases. (i) If h(x0) > 0 for some x0 X, then supx X h(x) h(x0) > 0, and max{supx X h(x), 0} = supx X h(x). Since we also have supx X max{h(x), 0} = supx X h(x), the statement holds in this case. (ii) If h(x) 0 for all x X, then supx X h(x) 0, and thus max{supx X h(x), 0} = 0. Meanwhile, for any x X, we have max{h(x)} = 0, which validates the statement in this case.

From the above lemma, we directly have

RII(T) = sup x X max{ inf λ Sm λ (

t=1 Ft(x )), 0}

= max{ sup x X inf λ Sm λ (

t=1 Ft(x )), 0},

which derives the desired equivalent form.

E CALCULATION OF MIN-REGULARIZED-NORM

In this section, we discuss how to efficiently calculate the solutions to min-regularized-norm with L1-norm and L2-norm.

Published as a conference paper at ICLR 2023

Algorithm 2 Frank-Wolfe Solver for Min-Regularized-Norm with L1-Norm

1: Initialize: λt = (γ1 t , . . . , γm t ) = ( 1

m, . . . , 1

2: Compute the matrix U = Ft(xt) Ft(xt), i.e., Uij = f i t(xt) f j t (xt), i, j {1, . . . , m}. 3: repeat 4: Select an index k arg maxi {1,...,m}{Pm j=1 γj t Uij + α sgn(γi t γi 0)}.

5: Compute δ arg min0 δ 1 δ f k t (xt) + (1 δ) Ft(xt)λt 2 2+α δ(ek λt)+λt λ0 1.

6: Update λt = (1 δ)λt + δek. 7: until δ 0 or Number of Iteration Limits 8: return λt.

E.1 L1-NORM

Similar to (Sener & Koltun, 2018), we first consider the setting of two objectives, namely m = 2. In this case, for any λ = (γ, 1 γ), λ0 = (γ0, 1 γ0) S2, the L1-regularization λ λ0 1 equals to 2|γ γ0|. Hence min-regularized-norm with L1-norm at round t reduces to λt = (γt, 1 γt) where γt arg min 0 γ 1 γg1 + (1 γ)g2 2 2 + 2α|γ γ0|.

Interestingly, the above problem has a closed-form solution.

Proposition 3. Set γL = (g 2 (g2 g1) α)/ g2 g1 2 2, and γR = (g 2 (g2 g1)+α)/ g2 g1 2 2. Then min-regularized-norm with L1-norm produces weights λt = (γt, 1 γt) where

γt = max{min{γ t , 1}, 0}, where γ t = max{min{γ0, γR}, γL}.

Proof. We solve the following two quadratic sub-problems, i.e.,

min 0 γ γ0 h1(γ) = γg1 + (1 γ)g2 2 2 + 2α(γ0 γ),

as well as min γ0 γ 1 h2(γ) = γg1 + (1 γ)g2 2 2 + 2α(γ γ0).

It can be checked that in the former sub-problem, h1 monotonously decreases on ( , γR] and increases on [γR, + ); in the latter sub-problem, h2 monotonously decreases on ( , γL] and increases on [γL, + ). Since each sub-problem has its constraint ([0, γ0] or [γ0, 1]), the solution to the original optimization problem can then be derived by comparing the optimal values of the two sub-problems with their constraints. Specifically, notice that γL γR and 0 γ0 1, and we can consider the following three cases.

(i) When 0 γ0 γL γR, then h1 monotonously decreases on [0, γ0] and its minimum on [0, γ0] is h1(γ0). Notice that h1(γ0) = h2(γ0). For the sub-problem of h2, we further consider two situations: (i-a) If γL 1, then γL [γ0, 1], hence the minimum of h2 on [γ0, 1] is h2(γL). Since h2(γL) h2(γ0) = h1(γ0), the minimal point of the original problem is γL, and hence γt = γL. (i-b) If γL > 1, then h2 monotonously decreases on [γ0, 1], and we surely have h2(1) h2(γ0) = h1(γ0). Hence γt = 1 in this situation. Combining the above two situations, we have γt = min{γL, 1} in this case.

(ii) When γL γR γ0 1, then h2 monotonously increases on [γ0, 1] and its minimum on [γ0, 1] is h2(γ0). Notice that h1(γ0) = h2(γ0). For the sub-problem of h1, similar to the first case, we also consider two situations: (ii-a) If γR 0, then γR [0, γ0], hence the minimum of h1 on [0, γ0] is h1(γR). Since h1(γR) h1(γ0) = h2(γ0), the minimal point of the original problem is γR, and hence γt = γR. (ii-b) If γR < 0, then h1 monotonously increases on [0, γ0]. Hence we have h1(0) h1(γ0) = h2(γ0). Hence the solution to the original problem γt = 0. Combining the above two situations, we have γt = max{γR, 0} in this case.

Published as a conference paper at ICLR 2023

Algorithm 3 Frank-Wolfe Solver for Min-Regularized-Norm with L2-Norm

1: Initialize: λt = (γ1 t , . . . , γm t ) = ( 1

m, . . . , 1

2: Compute the matrix U = Ft(xt) Ft(xt), i.e., Uij = f i t(xt) f j t (xt), i, j {1, . . . , m}. 3: repeat 4: Select an index k arg maxi {1,...,m}{Pm j=1 γj t Uij + α(γi t γi 0)}.

5: Compute δ arg min0 δ 1 δ f k t (xt))+(1 δ) Ft(xt)λt 2 2+α δ(ek λt)+λt λ0 2 2, which has an analytical form

δ = max{min{( Ft(xt)λt f k t (xt)) Ft(xt)λt + α ek λt 2 2 Ft(xt)λt f k t (xt) 2 2 + α(ek λt) (λt λ0) , 1}, 0}.

6: Update λt = (1 δ)λt + δek. 7: until δ 0 or Number of Iteration Limits 8: return λt.

(iii) When γL < γ0 < γR, then h1 monotonously decreases on [0, γ0] and h2 monotonously increases on [γ0, 1]. Hence each sub-problem attains its minimum at γ0, and thus γt = γ0.

Summarizing the above three cases gives

min{γL, 1}, γ0 γL; max{γR, 0}, γ0 γR; γ0, otherwise.

We can further rewrite the above formula into a compact form as follows, which can be checked case-by-case.

γt = max{min{γ t , 1}, 0}, where γ t = max{min{γ0, γR}, γL},

This gives the closed-form solution of min-regularized-norm when m = 2.

Now that we have derived the closed-form solution to the min-regularized-norm solver with any two gradients, in principle, we can apply (Sener & Koltun, 2018) s technique to efficiently compute the solution to the solver with more than two gradients. We provide the full procedure in Algorithm 2, which is an extension of (Sener & Koltun, 2018). By following the exact line search technique (Jaggi, 2013) in MGDA, we get our line search oracle as line 5 in Algorithm 2. The first term is the same as that in MGDA, and the second term is an extra L1-regularization term related to the design in Algorithm 1. Unlike the oracle of MGDA that has a closed-form solution by a reduction to the case of two gradients, the extra L1-norm term makes our oracle difficult to get a closed-form solution. The reason is that, such an extra term is the L1-norm of a m-dimension vector, hence it can not simply reduce to the case of two gradients. To proceed, we can directly apply numerical methods to get the solution (e.g. similar to the implementation in (Liu et al., 2021)).

E.2 L2-NORM

Recall that in min-regularized-norm, the regularization on λ can take various forms. In the following, we discuss an alternative regularization, i.e., L2-regularization r(λ, λ0) = 1 2 λ λ0 2 2. In the discussion, we will show that similar to (Sener & Koltun, 2018), min-regularized-norm with L2norm can be computed very efficiently via the Frank-Wolfe method.

Similar to the previous discussion on L1-regularization, we first consider the setting of m = 2. In this case, for any λ = (γ, 1 γ), λ0 = (γ0, 1 γ0) S2, the L2-regularization 1

2 λ λ0 2 2 equals to (γ γ0)2. Hence min-regularized-norm with L2-norm at round t reduces to λt = (γt, 1 γt) where γt arg min 0 γ 1 γg1 + (1 γ)g2 2 2 + α(γ γ0)2.

Since the above problem is in the quadratic form, it also has a closed-form solution. The proof is elementary and hence omitted.

Published as a conference paper at ICLR 2023

Proposition 4. Min-regularized-norm with L2-norm produces weights λt = (γt, 1 γt) where

γt = max{min{(g2 g1) g2 + αγ0

g2 g1 2 2 + α , 1}, 0}.

When m > 2, since λt is constrained to the probability simplex m, similar to the case of L1regularization, we can use a Frank-Wolfe method to efficiently calculate the composition weights, which is presented in Algorithm 3. Note that since the line search (step 2) has a closed-form solution, its calculation cost is not high, i.e., just the same as the calculation cost of the original min-norm solver (Sener & Koltun, 2018).

F MORE DETAILS OF THE THEORETICAL RESULTS

In this section, we first prove the remark below Theorem 1, i.e., with proper choices of ηt and αt, DR-OMMD is guaranteed to have a sublinear regret bound in O(

T). Then we show the tightness of the above derived regret bound of DR-OMMD. Finally, we give a more detailed comparison with linearization from the theoretical aspect.

F.1 MORE DETAILS OF THE REMARK BELOW THEOREM 1

Recall that in the remark below Theorem 1 in our main paper, we claim that with proper choice of ηt and αt, DR-OMMD is guaranteed to attain a sublinear regret bound. We summarize this remark into the following corollary and provide a strict proof.

Corollary 1. (i) (Fixed learning rate) When setting ηt = 2γD

T and αt = 4F

ηt , for any λ0 Sm, DR-OMMD achieves the following multi-objective regret

(ii) (Diminishing learning rate) When setting ηt = 2γD

t , αt = 4F

ηt , for any λ0 Sm, DR-OMMD attains the following multi-objective regret

Proof. We start from the regret bound regarding λt in Theorem 1. When αt = 4F

ηt , from the definition of min-regularized-norm, the composite weights λt generated by DR-OMMD at each round satisfy

λt arg min λ Sm Ft(xt)λ 2 2 + 4F

Recall that λ0 Sm. Hence, for any t {1, ..., T}, the last term of the regret bound in Theorem 1 can be bounded as

Ft(xt)λt 2 2 + 4F

ηt λt λ0 1 = min λ Sm Ft(xt)λ 2 2 + 4F

ηt λ λ0 1 Ft(xt)λ0 2 2.

From Assumption 2, each gradient gi t is bounded as gi t 2 G, hence Ft(xt)λ0 2 Pm i=1 λi 0gi t 2 = Pm i=1 λi 0 gi t 2 G. Therefore, when αt = 4F

ηt , we have

Ft(xt)λt 2 2 + 4F

ηt λt λ0 1 Ft(xt)λ0 2 2 G2.

Plugging it into the regret bound in Theorem 1, we have

2 G2T = G p

which proves the bound with the fixed optimal learning rate.

Alternatively, set ηt = 2γD

t and utilize PT t=1 1

T, we also have

which proves the bound with the adaptive learning rate.

Published as a conference paper at ICLR 2023

F.2 THE TIGHTNESS OF DR-OMMD S BOUND

In this subsection, we show that the derived bound in Corollary 1 is tight w.r.t. m regarding any gradient-based algorithm. Specifically, we follow the standard worst-case analysis of deriving lower bounds and construct a special case in which any gradient-based algorithm will incur a regret in the order of Ω(

Assume f 1 t = f 2 t = = f m t at each round t. In this case, the instantaneous gradients of all the objectives are identical, i.e., gi t = f i t(xt) f 1 t (xt) = g1 t , i {1, ..., m}. For any gradient-based algorithm, since it can only utilize the gradient information of the objectives, it cannot distinguish the objective to which a certain gradient belongs. Alternatively speaking, in this case, any multiple gradient algorithm will treat all gradients in the same way and thus behave like a single-objective algorithm using the single gradient g1 t . Hence, in intuition, for any gradient-based algorithm, the worst-case bounds are at least independent of m. In particular, the worst-case bounds of gradientbased algorithms cannot decrease as m increases; otherwise, the above case will be violated.

In the following, we provide a detailed proof of the tightness of the O(

T) bound. In the above case, since f 1 t = f 2 t = = f m t for any t, the cumulative losses of all the objectives are also identical, i.e., PT t=1 f 1 t = PT t=1 f 2 t = = PT t=1 f m t . Therefore, the Pareto set X of the cumulative vector loss PT t=1 Ft coincides with the optimal decision set of the cumulative loss PT t=1 f 1 t of the first objective, i.e., X = argminx X PT t=1 f 1 t (x). Recall our definition of the multi-objective regret. Since λ Ft(x) = f 1 t (x) for any λ Sm, we have

R(T) = sup x X (

t=1 f 1 t (xt)

t=1 f 1 t (x )) =

t=1 f 1 t (xt) min x X

t=1 f 1 t (x ),

which exactly reduces to the single-objective regret RS(T) defined by the losses {f 1 t }T t=1 of the first objective. Hence we have R(T) = RS(T) in this case. Since the losses {f 1 t }T t=1 of the first objective can be chosen adversarially, we can follow Section 3.2 in (Hazan et al., 2016) to construct a certain sequence {f 1 t }T t=1 that admits a lower single-objective regret bound of Ω(

T). Hence in this certain case, any multiple gradient algorithm will admit a multi-objective regret R(T) = Ω(

T) w.r.t. T and m, matching our derived regret bound for DR-OMMD in terms of both T and m.

Some readers may suspect it unreasonable that in the multi-objective setting, the derived regret bounds do not increase as m increases. Now we explicate the rationality of such independence in the following.

In fact, the independence of m lies in the adoption of PSG in the formulation of the regret. Recall that, in the definition of PSG, i {1, . . . , m} means that it just needs to pick one coordinate i to satisfy f i t(xt) ϵ < f i t(x ), which omits the dependency of m. We can see this point from another perspective. Recall that in the derivation of Proposition 1, we know that the regret R(T) has an equivalent form, namely supx X infλ Sm(λ ) PT t=1(Ft(xt) Ft(x )), or equivalently supx X mini {1,...,m} PT t=1(f i t(xt) f i t(x )). In particular, PSG takes a minimum operation over all objectives, and thus it does not necessarily increase as m increases.

There is another intuitive way that can help understand the rationality of the independence of m. As is well recognized in existing research in multi-objective optimization (Emmerich & Deutz, 2018), the proportion of the Pareto optimal solutions (or more precisely, non-dominated solutions) in the decision domain tends to increase rapidly as the number of objectives increases. As a consequence, it might not be harder to reach the Pareto optimal set when m turns larger, hence intuitively, the regret bound does not necessarily increase as m increases.

F.3 MORE DETAILS IN THE COMPARISON WITH LINEARIZATION

Recall that in the remark below Theorem 1, we show that our derived bound for DR-OMMD is smaller than that of linearization, and discuss the margin between the two regret bounds in the twoobjective setting with linear losses. We now summarize the result in Theorem 2 in the following. Theorem 2. Consider a two-objective optimization setting with linear losses. Suppose the loss functions are f 1 t (x) = x g1 t and f 2 t (x) = x g2 t at each round t. For any λ0 = (γ0, 1 γ0) Sm,

Published as a conference paper at ICLR 2023

let λt = (γt, 1 γt) denote the composite weights produced by min-regularized-norm with L1norm. When the regularization strength is set as αt = 4F/ηt, the margin between the regret bound of linearization with fixed weights λ0 and that of DR-OMMD with composite weights λt is at least

2 (γt γ0)2 g1 t g2 t 2 2.

Before proving the theorem, we remark that the two bounds are basically in the same order. Note that this theoretical result is also very commonly seen in the offline setting, where multiple gradient algorithms often have the same (convergence) rate as linearization (Yu et al., 2020; Liu et al., 2021). The benefit of multiple gradient algorithms is mainly due to the implementation of gradient composition. For example, the concept of common descent (Sener & Koltun, 2018; Yu et al., 2020) eliminates the gradient conflicting issue; the resulting algorithm achieves substantial performance improvements compared to linearization in their experiments. In this paper, we move one step forward and discuss the margin between DR-OMMD and linearization. We show that such a margin is due to the gradient difference g1 t g2 t and the gap between the pre-defined weights λ0 and the adaptive weights λt. This is the best we can do for now. The regret bound comparison for m 3 is left for future research.

Proof of Theorem 2. We first write out the regret bounds of both methods. For DR-OMMD with αt = 4F/ηt, Theorem 1 provides the following regret bound

RDR-OMMD(T) γD

2 min λ Sm{ Ft(xt)λ 2 2 + 4F

ηt λ λ0 1}.

For linearization with fixed weights λ0 Sm, it can be viewed as single-objective optimization with linearized loss λ 0 Ft. Hence, we can directly borrow the tight bound of OMD (e.g., Theorem 6.8 in (Orabona, 2019)) and derive a bound

Rlinear(T) γD

2 Ft(xt)λ0 2 2.

The margin between the above two bounds takes

2 ( Ft(xt)λ0 2 2 min λ Sm{ Ft(xt)λ 2 2 + 4F

ηt λ λ0 1}),

which stems from different choices of composite weights. We investigate the margin at each round.

Lemma 3. In a two-objective setting, suppose the gradients are g1 and g2 at some specific round t, and the corresponding gradient matrix G = [g1, g2]. For any λ0 = (γ0, 1 γ0) Sm, let λt = (γt, 1 γt) denote the composite weights produced by min-regularized-norm with L1-norm, then the following inequality holds, i.e.,

Gλ0 2 2 ( Gλt 2 2 + α λt λ0 1) (γ0 γt)2 g2 g1 2 2.

Proof. Denote the left side of the target inequality as M(λt, λ0), then it can be simplified as

M(λt, λ0) = (Gλ0 Gλt) (Gλ0 + Gλt) α λt λ0 1 = (λ0 λt) G G(λ0 + λt) α λt λ0 1

To leverage this term, we need to plug the derived composite weights λt into M(λt, λ0). Recall that in the two-objective setting, the weight γt is given as

γt = max{min{γ t , 1}, 0}, where γ t = max{min{γ0, γR}, γL},

where γL = (g 2 (g2 g1) α)/ g2 g1 2 2 and γR = (g 2 (g2 g1) + α)/ g2 g1 2 2. Since the maximum and minimum operations will truncate the value of the produced weight, we now calculate M(λt, λ0) by case. Specifically, notice that γL < γR and 0 γ0 1, we consider the following cases.

Published as a conference paper at ICLR 2023

Case 1: When γR < 0, we must have γL < γR < 0 γ0, which leads to γt = 0. In this case, we have λ0 λt = (γ0, γ0) and λ0 + λt = (γ0, 2 γ0). Therefore, M(λt, λ0) can be computed as

M(λt, λ0) = (γ0g1 γ0g2) (γ0g1 + (2 γ0)g2) 2αλ0.

Also, from the condition γR < 0, we have α < g 2 (g1 g2). Since λ0 0, plugging it into the above inequality, we have

M(λt, λ0) (γ0g1 γ0g2) (γ0g1 γ0g2) = γ2 0 g1 g2 2 2. In this case, since γt = 0, we derive the desired inequality.

Case 2: When γL > 1, we must have γ0 1 < γL < γR, which results in γt = 1. In this case, we have λ0 λt = (γ0 1, 1 γ0) and λ0 +λt = (γ0 +1, 1 γ0). Now M(λt, λ0) can be calculated as M(λt, λ0) = ((γ0 1)g1 + (1 γ0)g2) ((γ0 + 1)g1 + (1 γ0)g2) 2α(1 γ0). Notice that the condition γL > 1 gives α < g 1 (g2 g1). Since 1 γ0 0, plugging it into the above inequality, we have

M(λt, λ0) ((γ0 1)g1 + (1 γ0)g2) ((γ0 1)g1 + (1 γ0)g2) = (1 γ0)2 g1 g2 2 2. In this case, since γt = 1, we derive the desired inequality.

Case 3: When 0 γL γR 1, the margin is a bit more complex since the value of λt further depends on the relation between λ0, λL, and λR. Specifically, we consider the following cases.

(i) If 0 γL γ0 γR 1, then γt = γ0. In this case, the inequality trivially holds.

(ii) If 0 γ0 γL 1, then γt = γL. In this case, since λt λ0 1 = 2(γL γ0), we can calculate M(λt, λ0) as

M(λt, λ0) = ((γ0 γL)g1 + (γL γ0)g2) ((γ0 + γL)g1 + (2 γ0 γL)g2) 2α(γL γ0)

= (γ0 γL)((g1 g2) ((γ0 + γL)g1 + (2 γ0 γL)g2) + 2α).

Since γt = γL = (g 2 (g2 g1) α)/ g2 g1 2 2, we have α = (g1 g2) ( γLg1 + (γL 1)g2). Therefore, we further have (g1 g2) ((γ0 + γL)g1 + (2 γ0 γL)g2) + 2α = (g1 g2) ((γ0 γL)g1 + (γL γ0)g2)

= (g1 g2) (γ0 γL)(g1 g2). Plugging it into the above equation on M(λt, λ0), we derive

M(λt, λ0) = (γL γ0)2 g1 g2 2 2

(iii) If 0 γR γ0 1, then γt = γR. In this case, λt λ0 1 = 2(γ0 γR), and M(λt, λ0) can be calculated as M(λt, λ0) = ((γ0 γR)g1 + (γR γ0)g2) ((γ0 + γR)g1 + (2 γ0 γR)g2) 2α(γ0 γR)

= (γ0 γR)((g1 g2) ((γ0 + γR)g1 + (2 γ0 γR)g2) 2α).

Since γt = (g 2 (g2 g1) + α)/ g2 g1 2 2, we have α = (g1 g2) (γRg1 + (1 γR)g2), then

(g1 g2) ((γ0 + γR)g1 + (2 γ0 γR)g2) 2α = (g1 g2) (γ0 γR)(g1 g2). Plugging it into the above equation on M(λt, λ0), we derive

M(λt, λ0) = (γR γ0)2 g1 g2 2 2 Combining all of the above cases, we prove the lemma.

For any t {1, ..., T}, set g1 = g1 t , g2 = g2 t (i.e., G = Gt), and α = 4F

ηt in Lemma 3, we have

Gtλ0 2 2 ( Gtλt 2 2 + 4F

ηt λt λ0 1) (γt γ0)2 g2 t g1 t 2 2.

Since λt λ0 2 2 = 2(γt γ0)2, summing the above inequality over t {1, ..., T}, we can directly calculate the margin as

4 λt λ0 2 2 g2 t g1 t 2 2,

which proves the theorem.

Published as a conference paper at ICLR 2023

(a) Effect of Preference (b) Learning Curve

0 200000 400000 600000 # Rounds

Average Loss

lin-opt DR-OMMD

0.0 0.2 0.4 0.6 0.8 1.0 Value of 1 0

Average Loss

linearization DR-OMMD

Figure 3: Results to verify the effectiveness of adaptive regularization on covtype. (a) Performance of DR-OMMD and linearization under varying λ0 = (λ1 0, 1 λ1 0). (b) Performance using the optimal weights λ0 = (0.1, 0.9).

(a) Task L, Average Cumulative Loss (b) Task L, Training Loss (c) Task L, Test Loss

(d) Task R, Average Cumulative Loss (e) Task R, Training Loss (f) Task R, Test Loss

0 20000 40000 60000 # Rounds

Average Loss

DR-OMMD min-norm lin (.25,.75) lin (0.5,0.5) lin (.75,.25)

0 20000 40000 60000 # Rounds

Average Loss

DR-OMMD min-norm lin (.25,.75) lin (0.5,0.5) lin (.75,.25)

0 20000 40000 60000 # Rounds

Training Loss

DR-OMMD min-norm lin (.25,.75) lin (0.5,0.5) lin (.75,.25)

0 20000 40000 60000 # Rounds

Training Loss

DR-OMMD min-norm lin (.25,.75) lin (0.5,0.5) lin (.75,.25)

0 20000 40000 60000 # Rounds

DR-OMMD min-norm lin (.25,.75) lin (0.5,0.5) lin (.75,.25)

0 20000 40000 60000 # Rounds

DR-OMMD min-norm lin (.25,.75) lin (0.5,0.5) lin (.75,.25)

Figure 4: More empirical results of the multi-task deep online experiments on Multi MNIST. The plots show the average cumulative loss, training loss and test loss of DR-OMMD and various baselines on both tasks (task L and task R).

G MORE EXPERIMENTAL RESULTS

G.1 MORE DETAILS OF THE EXPERIMENTAL SETUP

The protein and covtype datasets used in our experiments are publicly available in (Dua & Graff, 2017). The Multi MNIST dataset is acquired by the code provided by (Sener & Koltun, 2018).

All runs are deployed on Xeon(R) E5-2699 @ 2.2GHz.

G.2 MORE RESULTS FOR ADAPTIVE REGULARIZATION

We supplement the empirical results on covtype in Figure 3, which have been omitted from our main paper due to the lack of space. These results are consistent with the results on protein as presented in our main paper.

G.3 MORE RESULTS FOR ONLINE DEEP MULTI-TASK LEARNING

In our main paper, due to the page limit, Figure 2 only reports the average cumulative loss of DROMMD and various baselines on Multi MNIST. Here we supplement the results on the training loss and the test loss in Figure 4. The results are consistent with the average cumulative loss, showing superiority of DR-OMMD over linearization and MGDA in the non-convex setting.

H OMITTED PROOFS OF PROPOSITION 2 (MIN-NORM MAY INCUR LINEAR REGRETS)

Proof. As we have described in our main paper, we consider the following two-objective optimization problem. Decision domain is set as X = {(u, v) | u + v 1

2, v 0}. At each

Published as a conference paper at ICLR 2023

round t, the loss function Ft : X R2 takes

( ( x a 2 2, x b 2 2), t = 2k 1, k = 1, 2, ...;

( x b 2 2, x c 2 2), t = 2k, k = 1, 2, ...,

where a = ( 2, 1), b = (0, 1), c = (2, 1). For simplicity of analysis, we first consider the case when the total time horizon T is an even number. Then it can be checked that the cumulative loss function takes

t=1 Ft(x) = T

2 ( x a 2 2 + x b 2 2, x b 2 2 + x c 2 2)

= T ((u + 1)2 + v2 + 2, (u 1)2 + v2 + 2),

for any x = (u, v) X. Obviously the Pareto optimal set X of the cumulative loss coincides with the line segment between ( 1, 0) and (1, 0), i.e., X = {(u, v) | 1

2, v = 0} (note that X is the intersection of the line segment and X).

Now consider equipping OMD with vanilla min-norm, where the composite gradients are produced by the min-norm method. Suppose the learning process starts at any x1 = (u1, v1) X such that v1 > 0. Note that this is true if and only if x1 / X . Then for the iterate xt = (ut, vt) at each round t, we can directly calculate the gradients as

g1 t = 2(xt a) = (2ut + 4, 2vt + 2), t = 2k 1; 2(xt b) = (2ut, 2vt 2), t = 2k.

g2 t = 2(xt b) = (2ut, 2vt 2), t = 2k 1; 2(xt c) = (2ut 4, 2vt + 2), t = 2k.

The min-norm weights can be computed as λt = (γt, 1 γt) where

(xt b) (a b)

a b 2 2 = 1 ut vt

4 , t = 2k 1;

(xt c) (b c)

b c 2 2 = 3 ut + vt

4 , t = 2k.

The composite gradient

gcomp t = γt 2(x a) + (1 γt) 2(x b) = (ut vt + 1, ut + vt 1), t = 2k 1; γt 2(x b) + (1 γt) 2(x c) = ( ut vt 1, ut vt 1), t = 2k.

Recall that the update form of OMD takes

xt+1 = ΠX (xt ηtgcomp t ),

where ηt > 0 is the learning rate and ΠX is the projection operation onto X. Denote the iterate xt = (ut, vt) at each round. Now we can investigate the relation between xt and xt+1 by considering the following two cases:

(i) If xt ηtgcomp t X, then we do not need projection, and directly have xt+1 = xt ηtgcomp t .

(ii) If xt ηtgcomp t / X, then we need to project xt ηtgcomp t back to X. Denote x t+1 = xt ηtgcomp t . For simplicity we consider the projection based on the Euclidean distance, namely ΠX (x) = arg minx X x x 2 2. Since the composite gradient is orthogonal to the boundary on which the iterate after projection xt+1 = ΠX (x t+1) is located, it can be checked that xt+1 lies on the line segment linking xt and x t+1. Alternatively speaking, xt+1 can be expressed as xt η tgcomp t for some 0 η t < ηt.

Combining the above two cases, we know that at each round t, there exists some η t [0, ηt] such that xt+1 = xt η tgcomp t . Now we can analyze the relation between each entry of xt and xt+1. Specifically, since the second entry of the composite gradient is always non-positive, namely ut + vt 1 0 and ut vt 1 0, we have vt+1 vt for any t. Moreover, since the first entry of gcomp t is non-negative when t = 2k 1, namely u2k 1 v2k 1 + 1 0, we have u2k u2k 1 for

Published as a conference paper at ICLR 2023

any k; since the first entry of gcomp t is non-positive when t = 2k, namely u2k v2k 1 0, we have u2k+1 u2k for any k.

Now we can go back to analyze the gap between the composite weights at any two consecutive rounds. It is easy to verify that γ2k 1 < γ2k and γ2k > γ2k+1, hence we have

λ2k λ2k 1 1 = 2(γ2k γ2k 1) = 2 (u2k u2k 1) + (v2k + v2k 1)

λ2k+1 λ2k 1 = 2(γ2k γ2k+1) = 2 (u2k u2k+1) + (v2k + v2k+1)

Therefore, the composite weights λt indeed change radically at any two consecutive rounds.

The above analysis on vt also implies the failure of min-norm in this problem. Recall that any Pareto optimal solution x = (u , v ) X must satisfy v = 0. Suppose the initial iterate x1 = (u1, v1) does not lie in X , i.e., v1 > 0, which is almost sure for random initialization x1 X. Then we iteratively have 0 < v1 v2 ... v T , which means that xt moves away from the Pareto set X .

In the following, we strictly prove that min-norm indeed incurs a linear multi-objective regret. To calculate R(T), we first investigate the quantity R(x , λ) = λ PT t=1(Ft(xt) Ft(x )) for any fixed weights λ = (γ, 1 γ) S2 and best fixed decision x = (u , 0) X . Specifically, recall the form of PT t=1 Ft derived above, then we have

t=1 Ft(x ) = (γ(u + 1)2 + (1 γ)(u 1)2 + 2)T.

Denote the cumulative loss PT t=1 Ft(xt) = (L1, L2), we now consider the loss of each objective L1 and L2 separately. Specifically, for the first objective, we have

k=1 ((u2k 1 + 2)2 + u2 2k + (v2k 1 + 1)2 + (v2k 1)2).

Since 0 < v1 v2 ... v T 1, for the term regarding vt we have

k=1 ((v2k 1 + 1)2 + (v2k 1)2) = (v1 + 1)2 + (v T 1)2 +

k=1 ((v2k 1)2 + (v2k+1 + 1)2)

k=1 ((v2k 1)2 + (v2k + 1)2) =

k=1 (2v2 2k + 2)

k=1 (2v2 1 + 2) = (2v2 1 + 2)(T

2 1) v2 1T + T 2.

For the k-th term regarding ut, we have

(u2k 1 + 2)2 + u2 2k = (u2k 1 + 1)2 + (u2k + 1)2 + 2(u2k 1 u2k) + 2.

Recall that we have derived u2k u2k 1, thus we have

k=1 (u2k 1+2)2+u2 2k

k=1 ((u2k 1+1)2+(u2k +1)2+2)

t=1 (ut+1)2+T ( u+1)2T +T,

where u = 1

T PT t=1 ut and the last inequality is derived from Jensen s inequality. In summary, for the cumulative loss L1 of the first objective, we have

L1 ( u + 1)2T + v2 1T + 2T 2.

Similarly, we can analyze the cumulative loss L2 of the second objective

k=1 (u2 2k 1 + (u2k 2)2 + (v2k 1 1)2 + (v2k + 1)2).

Published as a conference paper at ICLR 2023

Since 0 < v1 v2 ... v T 1, for the term regarding vt we have

k=1 ((v2k 1 1)2 + (v2k + 1)2)

k=1 ((v2k 1 1)2 + (v2k 1 + 1)2) v2 1T + T.

For the term regarding ut, we also have

k=1 (u2 2k 1 + (u2k 2)2) =

k=1 ((u2k 1 1)2 + (u2k 1)2 + 2(u2k 1 u2k) + 2)

t=1 (ut 1)2 + T ( u 1)2T + T,

where the last inequality is derived from Jensen s inequality. Therefore, we have

L2 ( u 1)2T + v2 1T + 2T.

Combining the above inequalities, we have

R(x , λ) = γL1 + (1 γ)L2 λ T X

γ(( u + 1)2 (u + 1)2) + (1 γ)(( u 1)2 (u 1)2) + v2 1T 2γ.

For any λ S2 (i.e., γ [0, 1]), set x = ( u, 0) X , then it holds that

R(x , λ) v2 1T 2.

Equivalently, the multi-objective regret satisfies

R(T) = sup x X inf λ S2 R(x , λ) inf λ S2 R(x , λ) v2 1T 2,

which is linear w.r.t. T for any x1 = (u1, v1) X such that v1 > 0.

We now investigate the case when T is an odd number. Since the calculation of the composite weights λt and the composite gradient gcomp t at each round is independent of the total time horizon T, we still have λt+1 λt 1 v1 + 1 for any t. Hence the first desired property also holds for any odd T.

It remains to prove that OMD with min-norm still incurs a linear regret when T is odd. In this case, the Pareto optimal set X does not lie in the x-axis anymore, hence it is difficult to directly compute R(T). However, we can still use our derived R(x , λ) for any even T to estimate the regret. Specifically, set x = ( 1 T 1 PT 1 t=1 ut, 0); from the above derivation with even T, for any λ S2, we still have (note that now T 1 is an even number)

t=1 Ft(xt) λ T 1 X

t=1 Ft(x ) v2 1T 2.

Since for any x X, we have 0 x a 2 2, x b 2 2, x c 2 2 10, we have

R(x , λ) = λ T X

t=1 Ft(xt) λ T X

t=1 Ft(x ) v2 1T 12.

Furthermore, from the definition of Pareto optimality, there exists some x X that Pareto dominates x regarding the cumulative loss PT t=1 Ft, namely PT t=1 Ft(x ) PT t=1 Ft(x ). Hence

R(x , λ) = λ T X

t=1 Ft(xt) λ T X

t=1 Ft(x ) R(x , λ),

for any λ S2. Therefore, the multi-objective regret

R(T) = sup x X inf λ S2 R(x , λ) inf λ S2 R(x , λ) inf λ S2 R(x , λ) v2 1T 12,

which is also linear w.r.t. T for any x1 = (u1, v1) X such that v1 > 0.

Published as a conference paper at ICLR 2023

I OMITTED PROOFS OF THEOREM 1

Proof. We start from the definition of the multi-objective regret RII(T) (which is abbreviated as R(T)). Specifically, for any λ Sm and λ1 . . . , λT Sm, it holds that

R(T) = sup x X inf λ Sm

t=1 λ (Ft(xt) Ft(x )) sup x X

t=1 λ (Ft(xt) Ft(x ))

(λ λt) Ft(xt) + λt (Ft(xt) Ft(x )) + (λt λ) Ft(x )

t=1 F λ λt 1 + sup x X

t=1 λt (Ft(xt) Ft(x )) +

t=1 F λ λt 1

t=1 λ λt 1 + sup x X

t=1 λt (Ft(xt) Ft(x )).

To proceed, notice that if the composite weights λt are given beforehand instead of being calculated via min-regularized-norm, DR-OMMD acts just like standard OMD using linearized loss λt Ft. Hence, the second term in the above regret bound can be further analyzed in a similar way as singleobjective OMD (Srebro et al., 2011; Cesa-bianchi et al., 2012). Specifically, at each round t, since Ft is coordinate-wise convex, the linearized loss λ t Ft is also convex. Also notice that the composite gradient gt = Ft(xt)λt is exactly the gradient of λ t Ft at xt. Hence for any x X , we have

λ t Ft(xt) λ t Ft(x ) g t (xt x ) = g t (xt+1 x ) + g t (xt xt+1).

From the first-order optimal condition of xt+1, for any x X, we have

(ηt Ft(xt)λt + R(xt+1) R(xt)) (x xt+1) 0. Recall that gt = Ft(xt)λt. We set x = x and combine the above two inequalities, which derives

λ t Ft(xt) λ t Ft(x ) 1

ηt ( R(xt+1) R(xt)) (x xt+1) + g t (xt xt+1).

Recall the definition of Bregman divergence BR. We can check that (also see (Beck & Teboulle, 2003)) BR(x , xt) BR(x , xt+1) BR(xt+1, xt) = ( R(xt+1) R(xt)) (x xt+1). Since R is 1-strongly convex, we have BR(xt+1, xt) xt+1 xt 2 2/2. Hence

λ t Ft(xt) λ t Ft(x ) 1

ηt (BR(x , xt) BR(x , xt+1) 1

2 xt+1 xt 2 2) + g t (xt xt+1).

Moreover, from the Cauchy-Schwartz inequality we have

g t (xt xt+1) ηt

2 gt 2 2 + 1

2ηt xt xt+1 2 2.

Combining the above two inequalities, we derive

λt Ft(xt) λt Ft(x ) 1

ηt (BR(x ; xt) BR(x ; xt+1)) + ηt

2 Ft(xt)λt 2 2,

for any x X . Summing it over t {1, ..., T} and utilizing BR(x ; x T +1) 0, we have

t=1 λt (Ft(xt) Ft(x )) 1

η1 BR(x ; x1)+

ηt 1 ηt 1 )BR(x ; x T )+

2 Ft(xt)λt 2 2.

Since BR(x ; xt) γD and ηt ηt 1 for any t, we have ( 1

ηt 1 ηt 1 )BR(x ; x T ) ( 1

1 ηt 1 )γD. Hence we further have

t=1 λt (Ft(xt) Ft(x )) γD

2 Ft(xt)λt 2 2.

Taking the supremum over x X and plugging it back to the above regret bound, we prove the theorem.

Published as a conference paper at ICLR 2023

J REGRET ANALYSIS IN THE STRONGLY CONVEX SETTING

In this section, we discuss the regret bound of DR-OMMD in the strongly convex setting, where each loss function f i t is H-strongly convex. Recall that most literature in this setting only considers OGD (Zhao & Zhang, 2021; Wan et al., 2022), which is a special case of OMD that instantiates the regularization function on the iterate x as L2-regularizer, i.e., R(x) = 1

2 x 2 2. Hence, in the following, we mainly analyze the bound of the OGD-type variant in the strongly convex setting. Theorem 3. Assume that for any t {1, ..., T}, i {1, ..., m}, the loss function f i t is H-strongly convex. Set ηt = 1 Ht and R(x) = 1

2 x 2 in DR-OMMD, then it attains the following regret

2 x1 x 2 2 +

1 2Ht( Ft(xt)λt 2 2 + 4FHt λt λ0 1).

Remark. By setting αt = 4FHt, the above bound reduces to

2 x1 x 2 2 +

1 2Ht min λ Sm{ Ft(xt)λ 2 2 + αt λ λ0 1}

2 x1 x 2 2 +

Ft(xt)λ0 2 2 2Ht H

2 x1 x 2 2 + G2

t = O(log T),

which aligns with the optimal regret bound O(log T) in the single-objective strongly convex setting (Hazan et al., 2016).

Proof. From the derivation of Theorem 1, we have

t=1 λ λt 1 + sup x X

t=1 λt (Ft(xt) Ft(x )).

Denote the composite gradient as gt = Ft(xt)λt, which equals to the gradient of λt Ft at xt. Since λt Sm, λt Ft is a convex combination of f 1 t , ..., f m t , hence λt Ft is also H-strongly convex. For any x X , we now have

λ t Ft(xt) λ t Ft(x ) g t (xt x ) H

2 xt x 2 2.

We now bound the term g t (xt x ). When R(x) = 1

2 x 2 2, the Bregman divergence BR(x, z) = 1 2 x z 2 2 and the OMD update reduces to OGD, i.e., xt+1 = ΠX (xt ηtgt) (ΠX denotes the standard projection onto X). Plugging the above update rule into xt+1 x 2 2, we derive

xt+1 x 2 2 = ΠX (xt ηtgt) x 2 2 xt ηtgt x 2 2 ,

where the last inequality is derived from the Pythagorean theorem. Hence we have

xt+1 x 2 2 xt x 2 2 + η2 t gt 2 2 2ηtg t (xt x ) ,

or equivalently

g t (xt x ) xt x 2 2 xt+1 x 2 2 2ηt + ηt

Plugging it into the inequality of λt Ft(xt) λt Ft(x ), we derive

λt Ft(xt) λt Ft(x ) xt x 2 2 xt+1 x 2 2 2ηt + ηt

2 xt x 2 2,

for any x X . Summing it over t {1, ..., T}, we have

t=1 λt (Ft(xt) Ft(x ))

1 2η1 x1 x 2 2 +

2ηt 1 2ηt 1 H

2 ) xt x 2 2 +

2 Ft(xt)λt 2 2.

Published as a conference paper at ICLR 2023

When setting ηt = 1/Ht, we have

1 2ηt 1 2ηt 1 H

Consequently, we have

t=1 λt (Ft(xt) Ft(x )) H

2 x1 x 2 2 +

1 2Ht Ft(xt)λt 2 2.

Plugging it into the above regret form, we derive the desired regret bound.