# learningtolearn_nonconvex_piecewiselipschitz_functions__d936941b.pdf

Learning-to-learn non-convex piecewise-Lipschitz functions

Maria-Florina Balcan, Mikhail Khodak, Dravyansh Sharma Carnegie Mellon University {ninamf,mkhodak,dravyans}@cs.cmu.edu

Ameet Talwalkar Carnegie Mellon University & Hewlett Packard Enterprise talwalkar@cmu.edu

We analyze the meta-learning of the initialization and step-size of learning algorithms for piecewise-Lipschitz functions, a non-convex setting with applications to both machine learning and algorithms. Starting from recent regret bounds for the exponential forecaster on losses with dispersed discontinuities, we generalize them to be initialization-dependent and then use this result to propose a practical meta-learning procedure that learns both the initialization and the step-size of the algorithm from multiple online learning tasks. Asymptotically, we guarantee that the average regret across tasks scales with a natural notion of task-similarity that measures the amount of overlap between near-optimal regions of different tasks. Finally, we instantiate the method and its guarantee in two important settings: robust meta-learning and multi-task data-driven algorithm design.

1 Introduction

While learning-to-learn or meta-learning has long been an object of study [45], in recent years it has gained signiﬁcant attention as a multi-task paradigm for developing algorithms for learning in dynamic environments, from multiple data sources, and in federated settings. Such methods focus on using data from multiple tasks to improve performance when facing a new, potentially related task. A popular approach is initialization-based meta-learning, in which the meta-learner uses multi-task data to output an initialization for an iterative algorithm such as stochastic gradient descent (SGD) [25]. The ﬂexibility of this approach has led to its widespread adoption, e.g. in robotics [23] and federated learning [18], and to a growing number of attempts to understand it, both empirically and theoretically [20, 29, 24, 40, 42]. However, outside some stylized setups our learning-theoretic understanding of how to meta-learn an initialization is largely restricted to the convex Lipschitz setting.

We relax both assumptions to study the meta-learning of online algorithms over piecewise-Lipschitz functions, which can be nonconvex and highly discontinuous. As no-regret online learning over such functions is impossible in-general, we study the case of piecewise-Lipschitz functions with dispersed discontinuities that do not concentrate in any small compact subset of the domain [11]. Such functions arise frequently in data-driven algorithm design, where the goal is to learn the optimal parameter settings of algorithms for difﬁcult (often NP-Hard) problems over a distribution or sequence of instances [4]; for example, a small change to the metric in cluster linkage can lead to a discontinuous change in the classiﬁcation error [7]. In this paper, we also demonstrate that such losses are relevant in the setting of adversarial robustness, where we introduce a novel online formulation. For both cases, the associated problems are often solved across many time periods or for many different problem domains, resulting in natural multi-task structure that might improve performance. To the best of our knowledge, ours is the ﬁrst theoretical study of meta-learning in both of these application settings.

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

In the single-task setting the problem of learning dispersed functions can be solved using simple methods such as the exponentially-weighted forecaster. To design an algorithm for learning to initialize online learners in this setting, we propose a method that optimizes a sequence of datadependent upper-bounds on the within-task regret [29]. The result is an averaged bound that improves upon the regret of the single-task exponential forecaster so long as there exists an initial distribution that can compactly contain many of the within-task optima of the different tasks. Designing the meta-procedure is especially challenging in our setting because it involves online learning over a set of distributions on the domain. To handle this we study a prescient form of the classic follow-the-regularized leader (FTRL) scheme that is run over an unknown discretization; we then show the existence of another algorithm that plays the same actions but uses only known information, thus attaining the same regret while being practical to implement.

To demonstrate the usefulness of our method, we study this algorithm in two settings.

Multi-task data-driven algorithm design. We consider data-driven tuning of the parameters of combinatorial optimization algorithms for hard problems such as knapsack and clustering. The likely intractability of these problems on worst case instances have led to several approaches to study them in more realistic settings, such as smoothed analysis [44] and data-driven algorithm conﬁguration [4]. Our setting is more realistic than those considered in prior work. It is more challenging than learning from iid instances [27, 12], but at the same time less pessimistic than online learning over adversarial problem instances [11] as it allows us to leverage similarity of problem instances coming from different but related distributions. We instantiate our bounds theoretically on several problems where the cost functions are piecewise-constant in the tuned parameters, allowing our meta-procedure to learn the right initial distribution for exponential forecasters. This includes well-known combinatorial optimization problems like ﬁnding the maximum weighted independent set (MWIS) of vertices on a graph, solving quadratic programs with integer constraints using algorithms based on the celebrated Goemans-Williamson algorithm, and mechanism design for combinatorial auctions. Then we consider experimentally the problem of tuning the right α for the α-Lloyd s family of clustering algorithms [15]. In experimental evaluations on two datasets a synthetic Gaussian mixture model and the well-known Omniglot dataset from meta-learning [33] our meta-procedure leads to improved clustering accuracy compared to single-task learning to cluster. We also study our results for a family of greedy algorithms for the knapsack problem introduced by [27] and obtain similar results for a synthetic dataset. The results holds for both one-shot and ﬁve-shot tasks.

Online robust meta-learning. The second instantiation of our meta-learning procedure is to a new notion of adversarial robustness for the setting of online learning, where our results imply robust meta-learning in the presence of outliers. In this setting, the adversary can make (typically small) modiﬁcations to some example x X, which can result in potentially large changes to the corresponding loss value lh(x), where h H is our hypothesis. For instance, consider the well-studied setting of adversarial examples for classiﬁcation of images using deep neural networks [36, 17]. Given a neural network f, the adversary can perturb a datapoint x to a point x , say within a small Lp-ball around x, such that f(x) = f(x ) but the true label of x does not match x, and therefore lf(x) = lf(x ). In general, under the adversarial inﬂuence, we observe a perturbed loss function lh(x) = lh(x) + ah(x). Typically we are interested in optimizing both the perturbed loss lh(x), i.e. measuring performance relative to optimum for adversarially perturbed losses, and the true loss lh(x) (performance on the unobserved, unperturbed loss). For example, in the online learning setting, [1] consider perturbed loss minimization for linear dynamical systems, while [41] look at true {0, 1} loss minimization in the presence of adversarial noise. Our approach ensures that regret for both the perturbed and true loss are small, for piecewise-Lipschitz but dispersed adversaries.

2 Preliminaries and initialization-dependent learning of dispersed functions

In this section we introduce our setup and notation for online learning of piecewise-Lipschitz functions in a multi-task environment. We then generalize existing results for the single-task setting in order to obtain within-task regret bounds that depend on both the initialization and the task data. This is critical for both deﬁning a notion of task similarity and devising a meta-learning procedure.

Meta-learning setup Following past setups [2, 21, 29], for some T, m > 0 and all t [T] and i [m] we consider a meta-learner faced with a sequence of Tm loss functions ℓt,i : C 7 [0, 1] over a compact subset C Rd that lies within a ball B(ρ, R) of radius R around some point ρ Rd.

Algorithm 1 Exponential Forecaster

1: Input: step size parameter λ (0, 1], initialization w : C R 0. 2: Initialize w1 = w 3: for i = 1, 2, . . . , m do 4: Wi := R

5: Sample ρi with probability proportional to wi(ρi), i.e. with probability pi(ρi) = wi(ρi)

Wi 6: Suffer ℓi(ρi) and observe ℓi( ) 7: For each ρ C, set wi+1(ρ) = e λℓi(ρ)wi(ρ)

Here we used the notation [n] = {1, . . . , n}. Before each loss function ℓt,i the meta-learner must pick an element ρt,i C before then suffering a loss or cost ℓt,i(ρt,i). For a ﬁxed t, the subsequence ℓt,1, . . . , ℓt,m deﬁnes a task for which we expect a single element ρ t C to do well, and thus we will use the within-task regret on task t to describe the quantity

i=1 ℓt,i(ρt,i) ℓt,i(ρ t ) where ρ t arg min ρ C

i=1 ℓt,i(ρ) (1)

In the single-task setting the goal is usually to show that Rt,m is sublinear in m, i.e. that the average loss decreases with more rounds. A key point here is that the functions we consider can have numerous global optima. In this work we will assume, after going through the m rounds of task t, that we have oracle access to a single ﬁxed optimum for t, which we will refer to using ρ t and use in both our algorithm and to deﬁne the task-similarity. Note that in the types of applications we are interested in piecewise-Lipschitz functions the complexity of computing optima scales with the number of discontinuities. In the important special case of piecewise-constant functions, this dependency becomes logarithmic [19]. Thus this assumption does not affect the usefulness of the result.

Our goal will be to improve the guarantees for regret in the single-task case by using information obtained from solving multiple tasks. In particular, we expect average performance across tasks to improve as we see more tasks; to phrase this mathematically we deﬁne the task-averaged regret

t=1 Rt,m = 1

i=1 ℓt,i(ρt,i) ℓt,i(ρ t ) (2)

and claim improvement over single-task learning if in the limit of T it is smaller than Rt,m. Note that for simplicity in this work we assume all tasks have the same number of rounds within-task, but as with past work our results are straightforward to extend to the more general setting.

Learning piecewise-Lipschitz functions We now turn to our target functions and within-task algorithms for learning them: piecewise-Lipschitz losses, i.e. functions that are L-Lipschitz w.r.t. the Euclidean norm everywhere except on measure zero subsets of the space; here they may have arbitrary jump discontinuities so long they still bounded between [0, 1]. Apart from being a natural setting of interest due to its generality compared to past work on meta-learning, this class of functions has also been shown to have important applications in data-driven algorithm conﬁguration [11]; there these functions represent the cost, e.g. an objective value or time-complexity, of algorithms for difﬁcult problems such as integer programming, auction design, and clustering.

This literature has also shown lower bounds demonstrating that no-regret learning piecewise-Lipschitz function is impossible in general, necessitating assumptions about the sequence. One such condition is dispersion, which requires that the discontinuities are not too concentrated. Deﬁnition 2.1 ([11]). The sequence of random loss functions ℓ1, . . . , ℓm is said to be β-dispersed with Lipschitz constant L if, for all m and for all ϵ m β, we have that, in expectation over the randomness of the functions, at most O(ϵm) functions (the soft-O notation suppresses dependence on quantities beside ϵ, m and β, as well as logarithmic terms) are not L-Lipschitz for any pair of points at distance ϵ in the domain C. That is, for all m and for all ε m β,

max ρ,ρ C ρ ρ 2 ϵ

{i [m] | ℓi(ρ) ℓi(ρ ) > L ρ ρ 2}

Assuming a sequence of m β-dispersed loss functions and initial distribution w1 set to the uniform distribution over C and optimize the step size parameter, the exponential forecaster presented in Algorithm 1 achieves sublinear regret O( p

dm log(Rm)+(L+1)m1 β). While this result achieves a no-regret procedure, its lack of dependence on both the task-data and on the chosen initialization makes it difﬁcult to meta-learn. In the following theorem, we generalize the regret bound for the exponential forecaster to make it data-dependent and hyperparameter-dependent:

Theorem 2.1. Let ℓ1, . . . , ℓm : C 7 [0, 1] be any sequence of piecewise L-Lipschitz functions that are β-dispersed. Suppose C Rd is contained in a ball of radius R. The exponentially weighted forecaster (Algorithm 1) has expected regret Rm mλ + log(1/Z)

λ + O((L + 1)m1 β), where

B(ρ ,m β ) w(ρ)dρ R

C w(ρ)dρ for ρ the optimal action in hindsight.

The proof of this result adapts past analyses of Algorithm 1; setting step-size λ appropriately recovers the previously mentioned bound. The new bound is useful due to its explicit dependence on both the initialization w and the optimum in hindsight via the log(1/Z) term. Assuming w is a (normalized) distribution, this effectively measures the overlap between the chosen initialization and a small ball

around the optimum; we thus call log Z = log

B(ρ ,m β ) w(ρ)dρ R

C w(ρ)dρ the negative log-overlap of initialization w(.) with the optimum ρ .

We also obtain an asymptotic lower bound on the expected regret of any algorithm by extending the argument of [10] to the multi-task setting. We show that for ﬁnite D we must suffer Ω(m1 β) regret, which limits the improvement we can hope to achieve from task-similarity.

Theorem 2.2. There is a sequence of piecewise L-Lipschitz β-dispersed functions ℓi,j : [0, 1] 7 [0, 1], whose optimal actions in hindsight arg minρ Pm i=1 lt,i(ρ) are contained in some ﬁxed ball of diameter D , for which any algorithm has expected regret Rm Ω(m1 β).

2.1 Task-similarity

Before proceeding to our discussion of meta-learning, we ﬁrst discuss what we might hope to achieve with it; speciﬁcally, we consider what a reasonable notion of task-similarity is in this setting. Note that the Theorem 2.1 regret bound has three terms, of which two depend on the hyperparameters and the last is due to dispersion and cannot be improved via better settings. Our focus will thus be on improving the ﬁrst two terms, which are the dominant ones due to the dependence on the dimensionality and the distance from the initialization encoded in the negative log overlap. In particular, when the initialization is the uniform distribution then this quantity depends inversely on the size of a small ball around the optimum, which may be quite small. Via meta-learning we hope to assign more of the probability mass of the initializer to areas close to the optimum, which will decrease these terms. On average, rather than a dependence on the volume of a small ball we aim to achieve a dependence on the average negative log-overlap

V 2 = min w:C7 R 0, R

C w(ρ)dρ=1 1 T

B(ρ t ,m β) w(ρ)dρ (4)

which can be much smaller if the task optima ρ t are close together; for example, if they are the same then V = 0, corresponding to assigning all the initial weight within the common ball B(ρ , m β) around the shared optima. This is also true if vol( t T B(ρ t , m β)) > 0, as one can potentially initialize with all the weight in the intersection of the balls. On the other hand if vol( t T B(ρ t , m β)) = 0, V > 0. For example, if a p-fraction of tasks have optima ρ0 and the remaining at ρ1 with ||ρ0 ρ1|| > 2m β the task similarity is given by the binary entropy function V = Hb(p) = p log p (1 p) log(1 p).

The settings of Algorithm 1 that achieve the minimum in the deﬁnition of V are directly related to V itself: the optimal initializer is the distribution achieving V and the optimal step-size is V/ m. Note that while the explicit deﬁnition requires computing a minimum over a set of functions, the task-similarity can be computed using the discretization constructed in Section 3.1.

3 An algorithm for meta-learning the initialization and step-size

Having established a single-task algorithm and shown how its regret depends on the initialization and step-size, we move on to meta-learning these hyperparameters. Recall that our goal is to make the task-averaged regret (2) small, in particular to improve upon the baseline of repeatedly running Algorithm 1 from the uniform distribution, up to o T (1) terms that vanish as we see more tasks. This accomplishes the meta-learning goal of using multiple tasks to improve upon single-task learning.

In this paper, we use the strategy of running online learning algorithms on the data-dependent regret guarantees from above [29]. If we can do so with sublinear regret in T, then we will improve upon the single-task guarantees up to o T (1) terms, as desired. Speciﬁcally, we are faced with a sequence of regret-upper-bounds Ut(w, v) = (v + ft(w)/v) m + g(m) on nonnegative functions w over C and positive scalars v > 0. Note that g(m) cannot be improved via meta-learning, so we will focus on learning w and v. To do so, we run two online algorithms, one over the functions ft and the other over ht(v) = v + ft(wt)/v, where wt is set by the ﬁrst procedure. As shown in the following result, if both procedures have sublinear regret then our task-averaged regret will have the desired properties: Theorem 3.1. Assume each task t [T] consists of a sequence of m β-dispersed piecewise LLipschitz functions ℓt,i : C 7 [0, 1]. Let ft and g be functions such that the regret of Algorithm 1 run with step-size λ = v m for v > 0 and initialization w : C 7 R 0 is bounded by Ut(w, v) = (v + ft(w)/v) m + g(m). Suppose we have a procedure that achieves FT (w) regret w.r.t. any w : C 7 R 0 by playing actions wt : C 7 R 0 on ft and another procedure that achieves HT (v) regret w.r.t. any v > 0 by playing actions vt > 0 on ht(v) = v + ft(wt)/v, where HT is non-increasing on the positive reals. Then by setting ρt,i using Algorithm 1 with step-size vt/ m and initialization wt at each task t, for w = arg minw:C7 R 0 PT t=1 ft(w) the optimal initialization and V the task-similarity (4) we get task-averaged regret bounded by HT (V )

T + min FT (w )

FT (w )/T + 2V m + g(m) (5)

This result is an analog of [29, Theorem 3.1] and follows by manipulating the deﬁnition of regret. It reduces the problem of obtaining a small task-averaged regret to solving two online learning problems, one to set the initialization and one to set the step-size. So long as both have sublinear regret then we will improve over single-task learning. In the next two sections we derive suitable procedures.

3.1 Meta-learning the initialization

The most technically challenging component of the procedure is learning the initialization. As dis-

cussed, this can be done via a no regret procedure for the sequence ft(w) = log

B(ρ t ,m β ) w(ρ)dρ

C w(ρ)dρ . This is nontrivial as the optimization domain is a set of nonnegative functions or measures on the domain C. To handle this, we introduce some notations and abstractions. At each task t we face a function ft associated with an unknown closed subset Ct C in particular Ct = B(ρ t , m β) with positive volume vol(Ct) > 0 that is revealed after choosing wt : C 7 R 0 For each time

t deﬁne the discretization Dt = {D = T

s t C (c[s]) s : c {0, 1}t, vol(D) > 0} of C, where

C(0) t = Ct and C(1) t = C\Ct. We will use elements of these discretizations to index nonnegative vectors in R|Dt| 0 ; speciﬁcally, for any measure w : C 7 R 0 let w(t) R|Dt| 0 denote the vector with entries w(t)[D] = R

D w(ρ)dρ for D Dt. Note that we will exclusively use p, q, v, w for measures, with v speciﬁcally referring to the uniform measure, i.e. v(t)[D] = vol(D). For convenience, for all real vectors x we will use ˆx to denote p/ p 1. Finally, we abuse notation and remove the parentheses to refer those vectors associated with the ﬁnal discretization, i.e. v = v(T) and w = w(T).

Now that we have this notation we can turn back to the functions we are interested in: ft(w) =

Ct w(ρ)dρ R

C w(ρ)dρ , where Ct = B(ρ t , m β). Observe that we can equivalently write this as ft(w) = log w t , ˆw , where w t[D] = 1D Ct; this translates our online learning problem from the domain of measures on C to the simplex on |DT | elements. However, we cannot play in this domain explicitly as we do not have access to the ﬁnal discretization DT , nor do we get access to w t after task t, except implicitly via Ct. In this section we design a method that implicitly run an online convex optimization procedure over R|DT | 0 while explicitly playing probability measures w : C 7 R 0.

Algorithm 2 Follow-the-Regularized-Leader (prescient form)

1: Input: discretization DT of C, mixture parameter γ [0, 1], step-size η > 0 2: Initialize w1 = ˆv 3: for t = 1, 2, . . . , T do 4: Play wt and suffer ft(wt) = log w t , wt . 5: Observe ft. 6: Update wt+1 = arg min w 1=1,w γˆv DKL(w||ˆv) + η P

As the functions ft are exp-concave, one might ﬁrst consider applying a method attaining logarithmic regret on such losses [28, 37]; however, such algorithms have regret that depends linearly on the dimension, which in our case is poly(T). We thus turn to the the follow-the-regularized-leader (FTRL) family of algorithms, which in the case of entropic regularization are well-known to have regret logarithmic in the dimension [43]. In Algorithm 2 we display the pseudo-code of a modiﬁcation with regularizer DKL( ||ˆv), where recall v is the vector of volumes of the discretization DT of C, and we constrain the played distribution to have measure at least γˆv[D] over every set D DT . While this requires knowing the discretization DT of C in advance, the following lemma shows that we can run the procedure knowing only the discretization Dt after task t by simply minimizing the same objective over distributions discretized on Dt. This crucially depends on the re-scaling of the entropic regularizer by ˆv (i.e. the uniform distribution over C) and the fact that w t {0, 1}|DT |. Lemma 3.1. Let w : C 7 R 0 be the probability measure corresponding to the minimizer w = arg min q 1=1,q γˆv DKL(q||ˆv) η X

s t log w s, q (6)

and let w : C 7 R 0 be the probability measure corresponding to the minimizer w(t) = arg min q 1=1,q γˆv(t) DKL(q||ˆv(t)) η X

s t log w s(t), q (7)

Then w = w.

We can thus move on to proving a regret guarantee for Algorithm 2. This follows from Jensen s inequality together with standard results for FTRL once we show that the loss functions are 1 γ vol(Ct)- Lipschitz over the constrained domain, yielding the following guarantee for Algorithm 2:

Theorem 3.2. Algorithm 2 has regret w.r.t. w arg min w 1=1,w 0 PT t=1 ft(w) bounded by

η DKL(w ||ˆv) + η

1 (vol(Ct))2 + γ

t=1 log 1 vol(Ct) (8)

Setting γ2 = GB/

T and η2 = B2γ2

T G2 , where B2 = DKL(w ||ˆv) and G2 = 1 T PT t=1 1 (vol(Ct))2 ,

yields sublinear regret O(

Proof. Algorithm 2 is standard FTRL with regularizer 1

ηDKL( ||ˆv), which has the same Hessian as the standard entropic regularizer over the simplex and is thus 1

η-strongly-convex w.r.t. 1 [43, Example 2.5]. Applying Jensen s inequality, the standard regret bound for FTRL [43, Theorem 2.11] together with the Lipschitz guarantee of Claim B.1, and Jensen s inequality again yields the result: T X

t=1 ft(wt) ft(w ) =

t=1 ft(wt) (1 γ)ft(w ) γft(ˆv) + γ(ft(ˆv) ft(w ))

t=1 ft(wt) ft(γˆv + (1 γ)w ) + γ log w t , w w t , ˆv

η DKL(γˆv + (1 γ)w ||ˆv) + η

1 (vol(Ct))2 + γ

t=1 log 1 vol(Ct)

η DKL(w ||ˆv) + η

1 (vol(Ct))2 + γ

t=1 log 1 vol(Ct)

Since the regret is sublinear in T, this result satisﬁes our requirement for attaining asymptotic improvement over single-task learning via Theorem 3.1. However, there are several aspects of this bound that warrant some discussion. The ﬁrst is the rate of T 3 4 , which is less sublinear than the standard

T and certainly the log T regret of exp-concave functions. However, the functions we face are (a) non-Lipschitz and (b) over a domain that has dimensionality Ω(T); both violate conditions for good rates in online convex optimization [28, 43], making our problem much more difﬁcult.

A more salient aspect is the dependence on B2 = DKL(w ||ˆv), effectively the negative entropy of the optimal initialization. This quantity is in-principle unbounded but is analogous to standard online convex optimization bounds that depend on the norm of the optimum, which in e.g. the Euclidean case are also unbounded. In our case, if the optimal distribution is highly concentrated on a very small subset of the space it will be difﬁcult to compete with. Note that our setting of η depends on knowing or guessing B; this is also standard but is certainly a target for future work to address. For example, past work on parameter-free algorithms has solutions for optimization over the simplex [38]; however, it is unclear whether this is straightforward to do while preserving the property given by Lemma 3.1 allowing us to implicitly work with an unknown discretization. A more reasonable approach may be to compete only with smooth measures that only assign probability at most κ vol(D) to any subset D C for some constant κ 1; in this case we will simply have B bounded by log κ.

A ﬁnal issue is the dependence on

G, which is bounded by the reciprocal of the smallest volume vol(Ct), which in the dispersed case is roughly O(mβd); this means that the task-averaged regret will have a term that, while decreasing as we see additional tasks, is increasing in the number of within-task iterations and the dispersion parameter, which is counter-intuitive. It is also does so exponentially in the dimension. Note that in the common algorithm conﬁguration setting of β = 1/2 and d = 1 this will simply mean that for each task we suffer an extra o T (1) loss at each within-task round, a quantity which vanishes asymptotically.

3.2 Meta-learning the step-size

In addition to learning the initialization, Theorem 3.1 requires learning the task-similarity to set the within-task step-size λ > 0. This involves optimizing functions of form ht(v) = v +ft(wt)/v. Since we know that the measures wt are lower-bounded in terms of γ, we can apply a previous result [29] that solves this by running the EWOO algorithm [28] on the modiﬁed sequence v + ft(wt)+ε2

Corollary 3.1. For any ε > 0, running the EWOO algorithm on the modiﬁed sequence v + ft(w)+ε2

v over the domain [ε, p

D2 log γ + ε2], where D2 1

T PT t=1 log 1 vol(Ct), attains regret

2 max D2 log γ

ε2 , 1 (1 + log(T + 1)) (9)

on the original sequence ht(v) = v + ft(w)/v for all v > 0.

Setting ε = 1/ 4

T gives a guarantee of form O((min{1/v , 4

T). Note this rate might be improvable by using the fact that v is lower-bounded due to the γ-constraint; however, we do not focus on this since this component is not the dominant term in the regret. In fact, because of this we can adapt a related method that simply runs follow-the-leader (FTL) on the same modiﬁed sequence [29] without affecting the dominant terms in the regret:

Corollary 3.2. For any ε > 0, running the FTL algorithm on the modiﬁed sequence v + ft(w)+ε2

v over the domain [ε, p

D2 log γ + ε2], where D2 1

T PT t=1 log 1 vol(Ct), attains regret

v , ε T + 2 p

D2 log γ max

( (D2 log γ) 3 2 ε3 , 1

(1 + log(T + 1)) (10)

on the original sequence ht(v) = v + ft(w)/v for all v > 0.

Setting ε = 1/ 5

T gives a guarantee of form O((min{1/v , 5

T})T 3 5 ). The alternatives are described in pseudocode at the bottom of Algorithm 3; while the guarantee of the FTL-based approach is worse, it is almost as simple to compute as the task-similarity and does not require integration, making it easier to implement.

Algorithm 3 Meta-learning the parameters of the exponential forecaster (Algorithm 1). Recall that p(t) refers to the time-t discretization of the measure p : C 7 R 0 (c.f. Section 3.1).

1: Input: domain C Rd, dispersion β > 0, step-size η > 0, constraint parameter γ [0, 1], offset parameter ε > 0, domain parameter D > 0.

2: Initialize w1 to the uniform measure on C and set λ1 = ε+

D2+ε2 log γ

2 m . 3: for task t = 1, 2, . . . , T do 4: Run Algorithm 1 with initialization wt and step-size λt and obtain task-t optimum ρ t C. 5: Set w t = 1B(ρ t ,m β) to be the function that is 1 in the m β-ball round ρ t and 0 elsewhere. 6: Set wt+1 to wt+1(t) = arg min w 1=1,w γˆv(t) DKL(w||ˆv(t)) η P s t log w s(t), w . 7: if using EWOO then

8: Deﬁne µt(x) = exp α tx + tε2 P

s t log w s(s),ws(s) x

D min n ε2 D2 , 1 o .

9: Set λt+1 =

D2+ε2 log γ ε xµt(x)dx m R

D2+ε2 log γ ε µt(x)dx.

11: Set λt+1 = q P

s t ε2 log w s(s),ws(s)

3.3 Putting the two together

We now combine our methods for the initialization and step-size in Algorithm 3 to meta-learn the parameters of the exponential forecaster. This yields a task-averaged regret bound via Theorem 3.1:

Theorem 3.3. Deﬁne B2 = DKL(w ||ˆv), G2 = 1 T PT t=1 1 (vol(Ct))2 , and D2

1 T PT t=1 log 1 vol(Ct) = O(βd log m). Then Algorithm 3 with η, γ set as in Theorem 3.2 and

T (if using EWOO) or 1/ 5

T (otherwise) yields task-averaged regret

! m + g(m) (11)

As in past meta-learning work this achieves the goal of adapting to the task-similarity, attaining asymptotic average regret of 2V m + O(m β), where we substitute for the dispersion term g and V 2 is the task-similarity encoding the average probability mass assigned to different task balls by the optimal initialization. We include the minimum of two rates in the bound: 1/ 4

T if the task-similarity is a constant ΘT (1) and 1/ 8

T if it is extremely small. As discussed, this reﬂects the difﬁculty of our meta-problem, in which we are optimizing non-smooth functions over distributions; in contrast, past meta procedures take advantage of nice properties of Bregman divergences to obtain faster rates [29].

4 Meta-learning for data-driven algorithm design

We demonstrate the utility of our bounds in a series of applications across data-driven algorithm design and robust learning. This section focuses on the former and demonstrates how our results imply guarantees for meta-learning the tuning of solvers for difﬁcult combinatorial problems. We also demonstrate the practical utility of our approach for tuning clustering on real and synthetic datasets.

4.1 Instantiations for tuning combinatorial optimization algorithms

Algorithm conﬁguration for combinatorial optimization algorithms involves learning algorithm parameters from multiple instances of combinatorial problems [27, 12, 4]. For well-known problems like MWIS (maximum weighted independent set), IQP (integer quadratic programming) and mechanism design for auctions, the algorithmic performance on a ﬁxed instance is typically a piecewise Lipschitz function of the algorithm parameters. Prior work has looked at learning these parameters in the distributional setting (i.e. assuming iid draws of problem instances) [12] or the online setting where the problem instances may be adversarially drawn [11, 10]. On the other hand, instantiating our results for these problems provide upper bounds for much more realistic settings where different tasks may be related and our bounds improve with this relatedness.

We demonstrate how to apply our results to combinatorial problems under mild smoothness assumptions. The key idea is to show that if inputs come from a smooth distribution, the performance is dispersed (as a sequence of functions in the algorithm parameters). We demonstrate this for the greedy algorithm for the knapsack problem and for initialization in k-center clustering. Similar results may be obtained for other problems, e.g. MWIS, IQP, and auction design (c.f. Appendix C).

Greedy Knapsack. Knapsack is a well-known NP-complete problem. We are given a knapsack with capacity cap and items i [m] with sizes wi and values vi. The goal is to select a subset S of items to add to the knapsack such that P

i S wi cap while maximizing the total value P i S vi of selected items. The greedy heuristic to add items in decreasing order of vi/wi gives a 2-approximation. We consider a generalization to use vi/wρ i proposed by [27] for ρ [0, 10]. For example, for the value-weight pairs {(0.99, 1), (0.99, 1), (1.01, 1.01)} and capacity cap = 2 the classic heuristic ρ = 1 gives value 1.01 but using ρ = 3 gives the optimal value 1.98. We can learn the optimal ρ from similar tasks, and obtain the following formal guarantees (proof in Appendix C). Theorem 4.1. Consider instances of the knapsack problem given by bounded weights wi,j [1, C] and κ-bounded independent values vi,j [0, 1] for i [m], j [T]. Then the task-averaged regret for learning the parameter ρ for the greedy heuristic family above is o T (1) + 2V m + O( m).

k-center clustering. We consider the α-Llyod s algorithm family introduced in [15]. In the seeding phase, each point x is sampled with probability minc C d(v, c)α, where d( , ) is the distance metric and C is the set of centers chosen so far. The family contains an algorithm for each α [0, ) , and includes popular clustering heuristics like randomly initialized k-means (α = 0), k-means++ (α = 2) and farthest-ﬁrst traversal (α = ). Performance is measured using the Hamming distance to the best clustering and is a piecewise constant function of α. Our result can be instantiated for this problem even without smoothness by leveraging the randomness of the clustering algorithm. Theorem 4.2. Consider instances of the k-center clustering problem on n points, with Hamming loss li,j for i [m], j [T] against some (unknown) ground truth. Then the task-averaged regret for learning the parameter α for the α-Lloyd s algorithm family [15] is o T (1) + 2V m + O( m).

4.2 Experiments for greedy knapsack and k-center clustering

Next we demonstrate our meta-initialization algorithm empirically on knapsack and k-center clustering. We design experiments on real and simulated data that show the usefulness of our techniques in learning a sequence of piecewise-Lipschitz functions. For knapsack, we generate a synthetic dataset of instances as follows. For each problem instance of each task, we have cap = 100 and m = 50. We have 10 heavy items with wi N(27, 0.5) and vi N(27, 0.5), and 40 items with wi N(19 + wt, 0.5) and vi N(18, 0.5), where wt [0, 2] is task-dependent.

We also consider the α-Lloyd s algorithm family introduced in [15]. The performance of the algorithm is measured using the Hamming loss relative to the best clustering and is a piecewise constant function of α. We can compute the pieces for α [0, 10] by iteratively computing the subset of parameter values where a candidate point can be the next center. We use the small split of the Omniglot dataset [33], and create clustering tasks by drawing random samples consisting of ﬁve characters each, where four characters are constant throughout. We also create a Gaussian mixture binary classiﬁcation dataset where each class is a 2D diagonal Gaussian distribution consisting of 100 points each with variance σ and 2σ and centers (0, 0) and (dσ, 0). We pick d [2, 3] to create different tasks.

We learn using 30 instances each of 10 training tasks and evaluate average loss over 5 test tasks. We use 100 trials to average over the randomization of the clustering algorithm and the exponential forecaster algorithm. We perform meta-initialization with parameters γ = η = 0.01 (no hyperparameter search performed). The step-size is set to minimize the regret term in Theorem 2.1, and not meta-learned.

The relative improvement in task-averaged regret due to meta-learning in our formal guarantees depend on the task-similarity V and how it compares to the dispersion-related O(m1 β) term, and can be signiﬁcant when the latter is small. Our results in Table 1 show that meta-learning an initialization, i.e. a distribution over the algorithm parameter, for the exponential forecaster in this setting yields improved performance on each dataset. We observe this for both the one-shot and ﬁve-shot settings, i.e. the number of within-task iterations of the test task are one and ﬁve, respectively. The beneﬁt of meta-learning is most pronounced for the Gaussian mixture (well-dispersed and similar tasks), and gains for Omniglot may increase with more tasks (dispersed but less similar tasks). For knapsack, the relative gains are smaller (similar tasks, but less dispersed). See Appendix D for further experiments.

Table 1: Effect of meta-initialization on few-shot learning of algorithmic parameters. Performance is as a fraction of the average value (Hamming accuracy, or knapsack value) of the ofﬂine optimum.

Dataset Omniglot Gaussian Mixture Knapsack One-shot Five-shot One-shot Five-shot One-shot Five-shot

Single task 88.67 0.47% 95.02 0.19% 90.10 1.10% 91.43 0.44% 84.74 0.29% 98.89 0.17% Meta-initialized 89.65 0.49% 96.05 0.15% 95.76 0.60% 96.39 0.27% 85.66 0.57% 99.12 0.15%

5 Robust online meta-learning

In online learning, we aim to minimize a sequence of losses and want to perform well relative to the optimum in hindsight. It is possible for the observed losses to be noisy on some inputs, either naturally or due to an adversary. We explore the conditions under which robustness to adversarial inﬂuence (i.e. outlier injection) is possible, which is common in meta-learning with diverse sources.

Setup: At round i, we play xi, observe perturbed loss li : X [0, 1] which is set by the adversary by modifying the true loss li : X [0, 1] using an attack function ai : X [0, 1] such that li = li + ai and may be non-Lipschitz, and suffer perturbed loss li(xi) and true loss li(xi). We seek to minimize regret relative to best ﬁxed action in hindsight, i.e. Rm = Pm i=1 li(xi) minx X Pm i=1 li(x) for the perturbed loss and regret Rm = Pm i=1 li(xi) minx X Pm i=1 li(x) for the true loss.

No regret can be achieved provided the adversary distribution is sufﬁciently smooth, i.e. satisﬁes β-dispersion for some β > 0, as this corresponds to online optimization of the perturbed loss function. We can show this for both perturbed and true loss. The perturbed loss guarantee is immediate from standard results on online learning of piecewise Lipschitz functions [11, 10]. For the true loss, we can achieve no regret if the adversary perturbation ai is limited to small balls and the centers of the balls are dispersed, which we capture using the following deﬁnition. Deﬁnition 5.1 (δ-bounded, βa-dispersed attack). An attack function ai is δ-bounded if there exists a ball B(xa, δ) of radius δ such that ai(x) = 0 for each x X \ B(xa, δ). xa is called a center cai for attack ai. A sequence of attack functions a1, . . . , am is said to be βadispersed, if the positions of attack centers xa are dispersed i.e. for all m and for all ϵ m βa, E maxx,x X,x B(x ,ϵ) {i [m] | x = cai} O(ϵm).

Theorem 5.1. Given a sequence of β-dispersed adversarially perturbed losses li = li + ai, where li, li, ai are piecewise L-Lipschitz functions X [0, 1] for i = 1, . . . , m and X Rd, the exponential forecaster algorithm has E[ Rm] = O(mλ + log(1/Z)

λ + (L + 1)m1 β) (with Z as in Theorem 2.1). If in addition we have that ai is a m βa-bounded, βa-dispersed attack, then E[Rm] = O(mλ + log(1/Z)

λ + (L + 1)m1 min{β,βa}).

Together with Theorem 3.3, this implies no regret meta-learning in the presence of dispersed adversaries, in particular the occurrence of unreliable data in small dispersed parts of the domain. We also show a lower bound which establishes that our bounds are essentially optimal in the attack dispersion.

Theorem 5.2. There exist sequences of piecewise L-Lipschitz functions li, li, ai [0, 1] [0, 1] for i = 1, . . . , m such that for any online algorithm

1. li is β-dispersed and E[ Rm] = Ω(m1 β),

2. li is β-dispersed, ai is m β-bounded, βa-dispersed and E[Rm] = Ω(m1 min{β,βa}).

6 Conclusion

In this paper we studied the initialization-based meta-learning of piecewise-Lipschitz functions, demonstrating how online convex optimization over an adaptive discretization can ﬁnd an initialization that improves the performance of the exponential forecaster across tasks, assuming the tasks have related optima. We then applied this result in two settings: online conﬁguration of clustering algorithms and adversarial robustness in online learning. For the latter we introduced a dispersionbased understanding of robustness that we believe to be of independent interest. In addition, there are further interesting applications of our work to other algorithm conﬁguration problems.

Acknowledgments

This material is based on work supported by the National Science Foundation under grants CCF1535967, CCF-1910321, IIS-1618714, IIS-1901403, SES-1919453, IIS-1705121, IIS-1838017, IIS-2046613 and IIS-2112471; the Defense Advanced Research Projects Agency under cooperative agreements HR00112020003 and FA875017C0141; an AWS Machine Learning Research Award; an Amazon Research Award; a Bloomberg Research Grant; a Microsoft Research Faculty Fellowship; an Amazon Web Services Award; a Facebook Faculty Research Award; funding from Booz Allen Hamilton Inc.; a Block Center Grant; and a Two Sigma Fellowship Award. Any opinions, ﬁndings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reﬂect the views of any of these funding agencies.

[1] Naman Agarwal, Brian Bullins, Elad Hazan, Sham Kakade, and Karan Singh. Online control with adversarial disturbances. In International Conference on Machine Learning, pages 111 119. PMLR, 2019.

[2] Pierre Alquier, The Tien Mai, and Massimiliano Pontil. Regret bounds for lifelong learning. In Proceedings of the 20th International Conference on Artiﬁcial Intelligence and Statistics, 2017.

[3] Eugene Bagdasaryan, Andreas Veit, Yiqing Hua, Deborah Estrin, and Vitaly Shmatikov. How to backdoor federated learning. In International Conference on Artiﬁcial Intelligence and Statistics, pages 2938 2948. PMLR, 2020.

[4] Maria-Florina Balcan. Book chapter Data-Driven Algorithm Design. In Beyond Worst Case Analysis of Algorithms, T. Roughgarden (Ed). Cambridge University Press, 2020.

[5] Maria-Florina Balcan, Avrim Blum, Dravyansh Sharma, and Hongyang Zhang. On the power of abstention and data-driven decision making for adversarial robustness. ar Xiv preprint ar Xiv:2010.06154, 2020.

[6] Maria-Florina Balcan, Avrim Blum, and Santosh Vempala. Efﬁcient representations for lifelong learning and autoencoding. In Proceedings of the 28th Annual Conference on Learning Theory, 2015.

[7] Maria-Florina Balcan, Travis Dick, and Manuel Lang. Learning to link. In International Conference on Learning Representations, 2019.

[8] Maria-Florina Balcan, Travis Dick, and Wesley Pegden. Semi-bandit optimization in the dispersed setting. In Conference on Uncertainty in Artiﬁcial Intelligence, pages 909 918. PMLR, 2020.

[9] Maria-Florina Balcan, Travis Dick, Tuomas Sandholm, and Ellen Vitercik. Learning to branch. In International conference on machine learning, pages 344 353. PMLR, 2018.

[10] Maria-Florina Balcan, Travis Dick, and Dravyansh Sharma. Learning piecewise Lipschitz functions in changing environments. In Proceedings of the 23rd International Conference on Artiﬁcial Intelligence and Statistics, pages 3567 3577, 2020.

[11] Maria-Florina Balcan, Travis Dick, and Ellen Vitercik. Dispersion for data-driven algorithm design, online learning, and private optimization. In 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS), pages 603 614, 2018.

[12] Maria-Florina Balcan, Vaishnavh Nagarajan, Ellen Vitercik, and Colin White. Learning-theoretic foundations of algorithm conﬁguration for combinatorial partitioning problems. In Annual Conference on Learning Theory, pages 213 274, 2017.

[13] Maria-Florina Balcan, Tuomas Sandholm, and Ellen Vitercik. A general theory of sample complexity for multi-item proﬁt maximization. In Proceedings of the 2018 ACM Conference on Economics and Computation, pages 173 174, 2018.

[14] Maria-Florina Balcan and Dravyansh Sharma. Data driven algorithms for limited labeled data learning. ar Xiv preprint ar Xiv:2103.10547, 2021.

[15] Maria-Florina F Balcan, Travis Dick, and Colin White. Data-driven clustering via parameterized lloyd s families. Advances in Neural Information Processing Systems, 31:10641 10651, 2018.

[16] Avrim Blum. Technical perspective: Algorithm selection as a learning problem. Communications of the ACM, 63(6):86 86, 2020.

[17] Wieland Brendel, Jonas Rauber, Alexey Kurakin, Nicolas Papernot, Behar Veliqi, Sharada P Mohanty, Florian Laurent, Marcel Salathé, Matthias Bethge, Yaodong Yu, et al. Adversarial vision challenge. In The Neur IPS 18 Competition, pages 129 153. Springer, 2020.

[18] Fei Chen, Zhenhua Dong, Zhenguo Li, and Xiuqiang He. Federated meta-learning for recommendation. ar Xiv, 2018.

[19] Vincent Cohen-Addad and Varun Kanade. Online optimization of smoothed piecewise constant functions. In Proceedings of the 20th International Conference on Artiﬁcial Intelligence and Statistics, 2017.

[20] Giulia Denevi, Carlo Ciliberto, Riccardo Grazzi, and Massimiliano Pontil. Learning-to-learn stochastic gradient descent with biased regularization. In Proceedings of the 36th International Conference on Machine Learning, 2019.

[21] Giulia Denevi, Carlo Ciliberto, Riccardo Grazzi, and Massimiliano Pontil. Online-within-online meta-learning. In Advances in Neural Information Processing Systems, 2019.

[22] Simon S. Du, Wei Hu, Sham M. Kakade, Jason D. Lee, and Qi Lei. Few-shot learning via learning the representation, provably.

[23] Yan Duan, Marcin Andrychowicz, Bradly Stadie, Jonathan Ho, Jonas Schneider, Ilya Sutskever, Pieter Abbeel, and Wojciech Zaremba. One-shot imitation learning. In Advances in Neural Information Processing Systems, 2017.

[24] Alireza Fallah, Aryan Mokhtari, and Asuman Ozdaglar. On the convergence theory of gradientbased model-agnostic meta-learning algorithms. In Proceedings of the 23rd International Conference on Artiﬁcial Intelligence and Statistics, 2020.

[25] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning, 2017.

[26] Michel X Goemans and David P Williamson. Improved approximation algorithms for maximum cut and satisﬁability problems using semideﬁnite programming. Journal of the ACM (JACM), 42(6):1115 1145, 1995.

[27] Rishi Gupta and Tim Roughgarden. A PAC approach to application-speciﬁc algorithm selection. SIAM Journal on Computing, 46(3):992 1017, 2017.

[28] Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69:169 192, 2007.

[29] Mikhail Khodak, Maria-Florina Balcan, and Ameet Talwalkar. Adaptive gradient-based metalearning methods. In Advances in Neural Information Processing Systems, 2019.

[30] Mikhail Khodak, Maria-Florina Balcan, and Ameet Talwalkar. Provable guarantees for gradientbased meta-learning. In Proceedings of the 36th International Conference on Machine Learning, 2019.

[31] Mikhail Khodak, Renbo Tu, Tian Li, Liam Li, Maria-Florina Balcan, Virginia Smith, and Ameet Talwalkar. Federated hyperparameter tuning: Challenges, baselines, and connections to weight-sharing. ar Xiv, 2021.

[32] Weihao Kong, Raghav Somani, Sham Kakade, and Sewoong Oh. Robust meta-learning for mixed linear regression with small batches. Advances in Neural Information Processing Systems, 33, 2020.

[33] Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Human-level concept learning through probabilistic program induction. Science, 350(6266):1332 1338, 2015.

[34] Jeffrey Li, Mikhail Khodak, Sebastian Caldas, and Ameet Talwalkar. Differentially private meta-learning. In Proceedings of the 8th International Conference on Learning Representations, 2020.

[35] Andreas Maurer, Massimiliano Pontil, and Bernardino Romera-Paredes. The beneﬁt of multitask representation learning. Journal of Machine Learning Research, 17(1):2853 2884, 2016.

[36] Anh Nguyen, Jason Yosinski, and Jeff Clune. Deep neural networks are easily fooled: High conﬁdence predictions for unrecognizable images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 427 436, 2015.

[37] Francesco Orabona, Nicolo Cesa-Bianchi, and Claudio Gentile. Beyond logarithmic bounds in online learning. In Proceedings of the Fifteenth International Conference on Artiﬁcial Intelligence and Statistics, 2012.

[38] Francesco Orabona and David Pal. Coin betting and parameter-free online learning. In Advances in Neural Information Processing Systems, 2016.

[39] Krishna Pillutla, Sham M Kakade, and Zaid Harchaoui. Robust aggregation for federated learning. ar Xiv preprint ar Xiv:1912.13445, 2019.

[40] Aniruddh Raghu, Maithra Raghu, Samy Bengio, and Oriol Vinyals. Rapid learning or feature reuse? Towards understanding the effectiveness of MAML. In Proceedings of the 8th International Conference on Learning Representations, 2020.

[41] Alon Resler and Yishay Mansour. Adversarial online learning with noise. In International Conference on Machine Learning, pages 5429 5437. PMLR, 2019.

[42] Nikunj Saunshi, Yi Zhang, Mikhail Khodak, and Sanjeev Arora. A sample complexity separation between non-convex and convex meta-learning. In Proceedings of the 37th International Conference on Machine Learning, 2020.

[43] Shai Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4(2):107 194, 2011.

[44] Daniel A Spielman and Shang-Hua Teng. Smoothed analysis of algorithms: Why the simplex algorithm usually takes polynomial time. Journal of the ACM (JACM), 51(3):385 463, 2004.

[45] Sebastian Thrun and Lorien Pratt. Learning to Learn. Springer Science & Business Media, 1998.

[46] Vale Tolpegin, Stacey Truex, Mehmet Emre Gursoy, and Ling Liu. Data poisoning attacks against federated learning systems. In European Symposium on Research in Computer Security, pages 480 501. Springer, 2020.

[47] Nilesh Tripuraneni, Chi Jin, and Michael I. Jordan. Provable meta-learning of linear representations. In Proceedings of the 38th International Conference on Machine Learning, 2021.

[48] Xiang Wang, Shuai Yuan, Chenwei Wu, and Rong Ge. Guarantees for tuning the step size using a learning-to-learn approach. In Proceedings of the 38th International Conference on Machine Learning, 2021.

[49] Yufan Zhou, Zhenyi Wang, Jiayi Xian, Changyou Chen, and Jinhui Xu. Meta-learning with neural tangent kernels. In Proceedings of the 9th International Conference on Learning Representations, 2021.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reﬂect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] Alongside contributions in context, e.g. the end of Section 3.1. (c) Did you discuss any potential negative societal impacts of your work? [No] Our concern w.r.t. the negative societal impact of this theoretical work is limited to standard risks associated with ML, e.g. for privacy or fair treatment. (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] (b) Did you include complete proofs of all theoretical results? [Yes] 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] Supplemental material. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [N/A] Experiments run on personal computer (16GB, 2.3 GHz Dual-Core). 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [Yes] (b) Did you mention the license of the assets? [N/A]

(c) Did you include any new assets either in the supplemental material or as a URL? [N/A]

(d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identiﬁable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]