# metalearning_with_neural_bandit_scheduler__d55d81c8.pdf

Meta-Learning with Neural Bandit Scheduler

Yunzhe Qi University of Illinois at Urbana-Champaign Champaign, IL yunzheq2@illinois.edu

Yikun Ban University of Illinois at Urbana-Champaign Champaign, IL yikunb2@illinois.edu

Tianxin Wei University of Illinois at Urbana-Champaign Champaign, IL twei10@illinois.edu

Jiaru Zou University of Illinois at Urbana-Champaign Champaign, IL jiaruz2@illinois.edu

Huaxiu Yao University of North Carolina at Chapel Hill Chapel Hill, NC huaxiu@cs.unc.edu

Jingrui He University of Illinois at Urbana-Champaign Champaign, IL jingrui@illinois.edu

Meta-learning has been proven an effective learning paradigm for training machine learning models with good generalization ability. Apart from the common practice of uniformly sampling the meta-training tasks, existing methods working on task scheduling strategies are mainly based on pre-defined sampling protocols or the assumed task-model correlations, and greedily make scheduling decisions, which can lead to sub-optimal performance bottlenecks of the meta-model. In this paper, we propose a novel task scheduling framework under Contextual Bandits settings, named BASS, which directly optimizes the task scheduling strategy based on the status of the meta-model. By balancing the exploitation and exploration in metalearning task scheduling, BASS can help tackle the challenge of limited knowledge about the task distribution during the early stage of meta-training, while simultaneously exploring potential benefits for forthcoming meta-training iterations through an adaptive exploration strategy. Theoretical analysis and extensive experiments are presented to show the effectiveness of our proposed framework.

1 Introduction

Meta-learning algorithms [19] have been receiving increased attention due to their strong generalization performance across a wide range of tasks [32, 20]. Most existing meta-learning methods often assume a uniform distribution when drawing meta-training tasks, treating each task as equally important [19]. However, this assumption can possibly fail in real-world scenarios. For example, during data collection, candidate training tasks can be subject to noise perturbation, leading to performance bottlenecks in the meta-model if noisy and clean tasks are treated on an equal footing [43, 45]. In addition, some tasks can pose greater challenges for the meta-model to adapt to, necessitating a more flexible allocation of computational resources during the meta-training process. Furthermore, the task distribution may be skewed, with "tail" tasks receiving inadequate attention under uniform sampling. As a result, the task scheduling methods [14, 51, 31, 24, 29] have been proposed for a refined meta-training strategy.

Equal Contribution.

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

Figure 1: Exploitation and exploration in meta-training iteration k [K]. E.g., tasks 1, 2, 3 are "frequent" tasks in a skewed task distribution, and tasks 4, 5, 6 are from task distribution "tail". Exploring the "tail" tasks can help improve the meta-model generalization performance (Subsec. 6.2).

Existing scheduling approaches mainly aim to improve meta-training strategies based on various pre-defined criteria and assumptions, including both non-adaptive and adaptive methods. Among them, non-adaptive methods working on the gradient that updates trainable parameters [31, 13] have proven their superiority over meta-training methods with uniform sampling. But they are unable to adjust the task scheduling strategy adaptively based on the status of the meta-model. On the other hand, adaptive methods aim to schedule the tasks based on Curriculum Learning [14, 52, 50] or adaptively sample the tasks based on the task adaptation difficulty (loss) [51]. However, the existing approaches are all greedy algorithms, which means that they tend to make locally optimal decisions based on the current knowledge (i.e., exploitation only). As the learner only has limited knowledge regarding the task and data distribution at the early stage of meta-training, the greedy strategy can lead to the sub-optimal meta-model due to multiple reasons, such as being misled by the noisy tasks or affected by the skewness of the task distribution. Here, we use Figure 2 to illustrate an example where the greedy approach (only exploitation) may be trapped in sub-optimal solutions.

One intuitive solution for the aforementioned problems is tackling the exploitation-exploration dilemma [4] in meta-training process: balancing the exploitation of current knowledge of selected tasks, and the exploration of under-explored tasks for potential long-term benefits. Therefore, unlike existing approaches with greedy strategies, we propose a novel task scheduling framework under the Contextual Bandits settings [15, 28, 4, 1, 54, 53, 2], named BAndit Ta Sk Scheduler (BASS). In each meta-training iteration, we formulate each candidate meta-training task as an arm under the contextual bandit settings, and the corresponding arm reward will naturally be the meta-model generalization ability score after including this candidate task (arm) into the meta-training process. To achieve this objective, BASS directly learns the mapping from the meta-parameters to the meta-model generalization ability score, instead of depending on the hand-crafted criteria or assumptions. This design also enables us to update meta-parameters and BASS simultaneously within one round of meta-optimization. In particular, instead of greedily scheduling the meta-training tasks solely based on the current knowledge (i.e., exploitation), BASS leverages an additional adaptive exploration module with two different exploration objectives to explore for unrevealed benefits. Our main contributions can be summarized as follows:

[Problem and Proposed Framework] We are the first to formulate the meta-learning task scheduling problem under the Contextual Bandits settings, where we optimize the meta-model w.r.t. chosen task batches in each meta-training iteration. To tackle this problem, we propose a novel banditbased meta-learning task scheduling framework named BASS, which is model-agnostic and can be applied to various meta-learning frameworks. Different from existing works, BASS directly learns the relationship between the meta-model parameters and the meta-model generalization ability. In addition, instead of greedily exploiting the current knowledge as in existing works, BASS utilizes a novel exploration module to adaptively plan for the future meta-training iterations.

[Theoretical Analysis] Under the general meta-learning settings as well as standard assumptions of over-parameterized neural networks and neural bandits, we derive the theoretical performance guarantee for the proposed BASS framework in terms of regret bound.

[Experiments] To demonstrate the effectiveness of BASS, we compare our method against seven strong baselines on three real data sets, with different specifications. In addition, complementary experiments as well as a case study on Ensemble Inference are also provided to better understand the property and behavior of BASS.

2 Related Works

In this section, we briefly review the related works of our proposed BASS framework from the aspects of task scheduling in meta-learning and contextual Bandits.

Task Scheduling in Meta-Learning. By considering each meta-training task to be equally important, many existing works sample the training tasks uniformly from the given task distribution [19, 40]. Then, with fixed sampling strategies, [24, 35] propose to assign the task sampling probability based on the quantity of task information, and [31] utilizes a probabilistic sampling method based on class-pairs. Meanwhile, there are also works try to improve task-specific gradients for the randomly sampled tasks [13, 34]. On the other hand, Curriculum-based approaches [14, 52, 50] schedule the tasks based on the difficulty of task adaptation, and [51] propose to adaptively schedule the tasks based on the assumed correlation between the difficulty of meta-training tasks and the meta-model generalization ability. However, the existing works are all greedy approaches that solely focus on the instant benefits, which can lead to sub-optimal meta-models due to the potential performance bottleneck. Compared with existing works, our proposed BASS directly learns the mapping from the meta-parameters to the generalization loss with no pre-defined criteria. With the adaptive exploration strategy, our proposed BASS helps tackle the insufficient knowledge regarding the task distribution by balancing the exploitation and exploration, as well as focusing on the long-term effects.

Contextual Bandits. Contextual bandits algorithms aim to solve the sequential decision making problem under the online learning settings, such as online recommendation [28, 49, 6]. Assuming that the mapping from arm contexts to rewards is linear, linear upper confidence bound (UCB) algorithms [15, 28, 4] are proposed to solve the exploitation-exploration dilemma. After kernel-based methods [46, 17] being applied to deal with the non-linear setting where the reward mapping is assumed to be a function in Reproducing Kernel Hilbert Space (RKHS), neural bandit algorithms [54, 53, 7, 5, 39, 8] are proposed to utilize neural networks for the reward and confidence bound estimation. In particular, neural bandit algorithms have demonstrated their superior performance over linear and kernel-based algorithms [54], thanks to the representation power of neural networks. Moreover, instead of using non-negative UCB, EE-Net [9, 10] achieves adaptive exploration by adopting an additional neural network to estimate the potential gain of reward estimations. In this paper, we model the task scheduling problem under the Contextual Bandit settings to balance the exploitation and exploration dilemma regarding the meta-task scheduling.

3 Problem Definition and Learning Objective

Under the settings of few-shot supervised meta-learning [19], our goal is to train a meta-model F( ; Θ) with parameters Θ Rp that can generalize well to meta-testing tasks, where p is the number of trainable meta-parameters. The meta-model parameters are initialized as Θ(0). Note that in this work, the trainable parameters are all represented by column vectors, for the simplicity of notation. Following similar settings in [19, 51], in each training iteration k [K], we will receive a pool of candidate training tasks Ω(k) task = {Tk,i}i N where its cardinality |Ω(k) task| = Ntask and the index set N = {1, . . . , Ntask}. Given a task distribution P(T ), we draw each candidate task Tk,i from P(T ) to form the candidate pool, i.e., Tk,i P(T ). Each task Tk,i is also associated with a support data set Ds k,i and a query data set Dq k,i, where the samples (including their labels) of Ds k,i, Dq k,i are drawn from the corresponding task data distribution DTk,i of task Tk,i.

Meta-training Process. Then, we need to choose a batch of B tasks Ωk = {Tk,bi}bi b Nk Ω(k) task for

each meta-training iteration k [K], where b Nk N refers to the indices of chosen tasks. With Θ(k 1) being the meta-model parameters after completing the (k 1)-th meta-training iteration, for each chosen meta-training task Tk,bi,bi b Nk, we run Gradient Descent (GD) for J steps on its support

set Ds k,bi to obtain the task-specific parameters Θ(J) k,bi . This refers to the inner-loop optimization. In this work, we consider a loss function L( ; ) that maps the meta-model parameters and the input sample (or sample set) to the loss value, with the range [0, 1] (e.g., the MSE loss on normalized user ratings [27]). Denoting Θ(0) k,bi = Θ(k 1) in meta-training iteration k, each GD step is represented as

Θ(j) k,bi = Θ(j 1) k,bi η1 ΘL(Ds k,bi; Θ(j 1) k,bi ), j [J] (1)

where η1 R+ is the learning rate of GD. For the simplicity of notation, we formulate the above J-iteration inner-loop optimization as an operator I( , ) : T Θ 7 Θ, such that

Θ(J) k,bi = I(Tk,bi , Θ(k 1)). (2)

Figure 2: In meta-training iteration k [K], the BASS framework overview. We only need one round of the optimization process (LHS of the figure) to update the meta-model and BASS.

Then, we optimize the meta-parameters through the outer-loop optimization with query sets

Θ(k) = Θ(k 1)[Ωk] = Θ(k 1) η2 Θ 1

Tk,bi Ωk L(Dq k,bi; Θ(J) k,bi ) (3)

where η2 R+ is the learning rate. Here, the trained meta-model parameters Θ(K) is expected to minimize the generalization loss ET P(T ),x DT L x; I(T , Θ(K)) , where we let DT being the associated data distribution of task T .

Evaluating Task Batch Benefits. Recall that in meta-training iteration k [K], the meta-model will be updated based on the chosen batch of B tasks (Eq. 3), and we will need to evaluate the benefit in terms of the whole task batch. By the task scheduling problem definition, the learner will select one task batch Ωk Ω(k) task based on its strategy, and the chosen task batch Ωk will be used for meta-training in this iteration k. Here, we define the reward of the corresponding task batch Ωk as

h Θ(k 1)[Ωk] = 1 ET P(T ),x DT L x; I(T , Θ(k 1)[Ωk]) (4)

where Θ(k 1)[Ωk] refer to the meta-parameters after adapting Θ(k 1) to task batch Ωk based on Eq. 3. For simplicity, we let h : Rp 7 R be the mapping function conditioned on distribution P(T ), which maps the trained meta-model parameters Θ(k 1)[Ωk] to task batch rewards.

Learning Objective. Up to meta-training iteration k, we have Ω (K) = {Ω 1, . . . , Ω K} being a series of unknown optimal task batches that minimizes the generalization loss. Each optimal batch is Ω k = {Tk,i }i N k Ω(k) task, k [K] including a total of |Ω k| = B tasks, with the indices N k N. The corresponding optimal meta-parameters are defined as Θ(K), =

ET P(T ),x DT L x; I(T , Θ) , where W Rp refers to the reachable param-

eter space, which contains the possible trained meta-parameter values, after observing the available candidate tasks {Ω(k) task}, k [K] and the randomly initialized meta-parameters Θ(0). Meanwhile, denoting our selection of the meta-training task batches as Ω(K) = {Ω1, . . . , ΩK}, we will have the corresponding trained meta-parameters Θ(K). Here, we can define the K-iteration regret as

R(K) = ET P(T ),x DT

L x; I(T , Θ(K)) L x; I(T , Θ(K), ) (5)

which measures the difference of the generalization ability between the trained meta-parameters Θ(K) and the optimal meta-parameters Θ(K), , after K meta-training iterations.

4 Proposed Framework: BASS

In this section, we introduce our proposed BASS framework, which simultaneously optimizes the meta-model and the task scheduling strategy on the fly. The pseudo-code is presented in Algo. 1, and the illustration for each meta-training iteration k [K] is shown in Figure 2.

Algorithm 1 BAndit Ta Sk Scheduler (BASS)

1: Input: Task distribution P(T ). Iterations K. GD steps J. Number of chosen tasks B. Learning rates for meta-model η1, η2. Learning rates for BASS ηθ 1 , ηθ 2 . Exploration coefficient α (0, 1]. 2: Output: Trained meta-parameters Θ(K). 3: Initialization: Randomly initialized meta-model parameters Θ(0), and BASS parameters θ(0).

4: for each meta-training iteration k [K] do

5: Sample a pool of Ntask candidate meta-training tasks Ω(k) task from the task distribution P(T ). Task Scheduling

6: for For each candidate task Tk,i Ω(k) task do 7: Derive two arm context vectors χs k,i, χq k,i [Eq. 6].

8: Calculate arm benefit score bsk,i = α f1( ; θ(k 1) 1 ) + f2( ; θ(k 1) 2 ) with BASS. [Eq. 7] 9: end for 10: From Ω(k) task, choose the top-B tasks Ωk Ω(k) task with the highest scores {bsk,i}i N . Parameter Updating

11: for each chosen training task Tk,bi Ωk do 12: Derive the corresponding rewards erk,i and exploration score eek,i. [Eq. 8-9] 13: end for 14: Update meta-parameters to Θ(k) based on chosen arms (tasks) Tk,bi Ωk. [Remark 2, Eq. 3]

15: Update BASS parameters to θ(k) with chosen arms. [Eq. 8-9, Subsec. 4.4] 16: end for

Remark 1 (Task Scheduling with Contextual Bandits). By the problem definition, there are Nbatch = Ntask B candidate task batches in each meta-training iteration, which can be a large number. In this case, enumerating all possible task batches and estimating their rewards will be time consuming. Therefore, under the Contextual Bandits settings, BASS alternatively considers each candidate task Tk,i Ω(k) task as an arm, and directly chooses B arms as the meta-training tasks Ωk Ω(k) task. As a result, BASS can (1) reduce the arm space size from O(Nbatch) to O(Ntask), while (2) enjoy the performance guarantee (Section 5) in terms of regret bound (Eq. 5).

Section Outline. Here, we will first present our definition of the arm contexts (Subsec. 4.1), whose formulation is challenging, because our settings are different from conventional Contextual Bandits where arm contexts are readily available from the environment. Then, applying two neural networks f1, f2 for exploitation and exploration respectively, we formulate the arm benefit score (Subsec. 4.2), which measures the benefit if we include the corresponding arm (task) into the current meta-training process. Next, we define the arm rewards and exploration scores as the labels for training f1, f2 respectively. In particular, to deal with the challenge of achieving exploration in task scheduling, we incorporate the information and the dynamics w.r.t. meta-optimizations, and formulate two separate exploration objectives for a refined exploration strategy (Subsec. 4.3). Finally, we update BASS with GD (Subsec. 4.4), and train the meta-model with chosen tasks.

4.1 Formulating Arm Contexts

To encode the information from both the task side and the meta-model side, for each candidate task (i.e., arm) Tk,i Ω(k) task in meta-training iteration k [K], we formulate its arm contexts as the meta-parameters after task adaptations, denoted by

χs k,i := Θ(J) k,i = I(Tk,i, Θ(k 1)); χq k,i := Θ(k 1)[Tk,i] = Θ(k 1) η2 ΘL(Dq k,i; Θ(J) k,i ) (6)

where Θ(J) k,i are the task-specific parameters after adapting meta-parameters Θ(k 1) to task Tk,i with inner-loop optimization (Eqs. 1-2); while Θ(k 1)[Tk,i] refer to the meta-parameters after adapting the current Θ(k 1) to task Tk,i, with both inner-loop and outer-loop optimization (as in Eq. 3). In particular, we assign each arm Tk,i with two different arm contexts to model the dynamics of meta-parameters w.r.t. inner-loop optimization and outer-loop optimization respectively. For example, the variance of the corresponding data distribution DTk,i can be high. In this case, the support set Ds k,i and the query set Dq k,i will be considerably different, which tends to make the corresponding

arm contexts χs k,i, χq k,i divergent. As a result, the gradient vectors θf1(χs k,i), θf1(χq k,i) will likely be distinct from each other. Alternatively, if the support set Ds k,i and the query set Dq k,i are not significantly distinct (the distance between χs k,i and χq k,i is also likely to be relatively small), these two gradient vectors tend to change dramatically when adapting to Tk,i. The reason is possibly that the exploitation model f1 is not well adapted to this task Tk,i. For both scenarios above, it can be beneficial to include more exploration for the task Tk,i, and the target is helping f1 better learn the reward for this task by actively acquiring the knowledge of it. And the two formulated arm contexts can provide important reference for our exploration module. Remark 2 (Recycling Arm Contexts). In order to derive the arm contexts (Eq. 6), the gradients for the outer-loop optimization ΘL(Dq k,bi; Θ(J) k,bi ),bi b Nk of the chosen arms Tk,bi Ωk are calculated. As a result, these gradients can be recycled to update the meta-model parameters based on Eq. 3 (line 15, Algo. 1), which helps reduce the computational cost when updating the meta-model.

4.2 Estimating Benefit Scores for Tasks

To determine which arms (tasks) should be included to the meta-training iteration k, we formulate the arm benefit score estimation for each candidate arm Tk,i Ω(k) task. The estimated benefit score consists of two parts: (1) the estimated arm reward of choosing this task based on existing knowledge (i.e., exploitation); (2) and the exploration score for the future potential benefit (i.e., exploration). Inspired by recent advances in neural bandits [9], we introduce two separate neural networks, f1( ; θ1) and f2( ; θ2), to estimate the arm reward and exploration score respectively. The exploitation network f1( ; θ1) aims to learn the mapping h( ) from arm contexts (i.e., meta-parameters) to rewards, while the exploration network f2( ; θ2) aims to learn the uncertainty of reward estimations as the exploration criterion. Different from conventional bandit models, e.g. [9], that works on static arm contexts given by the environment, our design alternatively leverages the evolving information from both the task (arm) side and meta-parameters side, across meta-training iterations. In addition, we consider the dynamics of meta-optimizations for a more comprehensive modeling of the exploration aspect, and the details will be introduced later. Here, given a candidate arm Tk,i Ω(k) task, its estimated benefit score bsk,i is formulated as

bsk,i = α brk,i + bek,i = α f1(χq k,i; θ(k 1) 1 ) + f2

[ θf1(χs k,i); θf1(χq k,i)]; θ(k 1) 2

where α (0, 1] is the exploration coefficient to balance exploitation and exploration. Notice that f2( ; θ2) will take the concatenated gradient of f1( ; θ1) w.r.t. both arm contexts χs k,i, χq k,i as the input, represented by [ θf1(χs k,i); θf1(χq k,i)]. And the output will be the exploration score estimation ˆek,i. To obtain θf1(χq k,i), we also calculate f1(χq k,i; θ1) and run the back-propagation. Afterwards, we choose the top-B arms with the highest estimated benefit scores bsk,i, i N, as the chosen task batch Ωk Ω(k) task (line 10, Algo. 1).

Design Intuition. First, recent advances of neural Contextual Bandits [54, 9, 38] have shown that the uncertainty of reward estimations is directly related to the gradients of the estimation model. Therefore, we leverage an exploration module f2( ; θ2) to directly learn this unknown relationship. Second, since Ds k,i, Dq k,i are from the same data distribution DTk,i, if these two gradients θf1(χs k,i), θf1(χq k,i) are distinct, the reason can be: (1) the variance of the data distribution DTk,i is high (due to the potentially noisy or difficult task); or (2) the gradients of f1( ; θ1) tend to change significantly when adapting to task Tk,i. In both cases, it can be harder for f1( ; θ1) to accurately predict the arm reward rk,i, and the meta-model can fail to properly adapt to the task Tk,i. In this case, we apply the concatenated gradients w.r.t. both arm contexts as the input of f2( ; θ2), in order to provide the information for f2( ; θ2) to evaluate exploration scores.

Network Architecture and Parameter Initialization. Here, we consider f1( ; θ1), f2( ; θ2) to be two L-layer fully-connected (FC) networks with network width m, while θ = {θ1, θ2} refer to their trainable parameters. For their randomly initialized parameters θ(0) = {θ(0) 1 , θ(0) 2 }, the weight matrix entries for the first L 1 layers are drawn from the Gaussian distribution N(0, 2/m), while the entries of the last layer (L-th layer) are sampled from N(0, 1/m). Remark 3 (Reducing Input Complexity). The input of f1( ; θ1) is the arm context χq k,i, whose dimensionality is the number of meta-parameters p. A similar situation also exists for the exploration

network f2( ; θ2). Inspired by the idea of learning dense low-dimensional representations with Convolutional Neural Networks (CNNs) (e.g., [44]), we apply the average pooling approach to approximate original inputs for reducing the running time and space complexity in practice. To show its effectiveness, we will apply this approach on BASS for all the experiments in Section 6.

4.3 Formulating Arm Rewards and Exploration Scores

Different from the conventional neural bandit algorithms [53, 54, 9] where the reward is provided by the environment oracle, we need to carefully design the arm rewards to reflect the arm benefit in terms of the meta-model s generalization ability. Analogous to task batch rewards (Eq. 4), we formulate the single arm reward rk,i = h(Θ(k 1)[Tk,i]) = 1 ET P(T ),x DT L x; I(T , Θ(k 1)[Tk,i]) for arm Tk,i. Since it is impractical to calculate the arm reward by enumerating over P(T ), we sample a batch of validation tasks Ωvalid k to derive the unbiased reward approximation, denoted by

erk,i = 1 1 |Ωvalid k |

T valid Ωvalid k

L Dq T valid; I(T valid, Θ(k 1)[Tk,i]) . (8)

Here, we adopt the single-step inner-loop optimization [19, 40] to derive I(Tk,bi, Θ(k 1)[Tk,i]) in Eq. 8, in order to save the computational cost in practice. Under the few-shot settings, the computation of arm rewards is efficient, since the support set is generally small for inner-loop optimization. The approximation error here can be bounded by the concentration inequality, as the validation tasks Ωvalid k are sampled from P(T ).

On the other hand, to formulate the exploration score ek,i (i.e., the label for f2( ; θ2)), we consider two separate exploration objectives: (1) the prediction uncertainty for the exploitation module f1( ; θ1), which is rk,i f1(χq k,i; θ1); (2) the validation loss of the meta-model, which represents the difficulty of adapting to Tk,i, inspired by the "task difficulty measurer" in Curriculum Learning [52, 50]. As a result, with Lk,i = L(Dq k,i; Θ(J) k,i ) being the validation loss of arm Tk,i, we formulate

the exploration score as ek,i = α rk,i f1(χq k,i; θ(k 1) 1 ) + (1 α) Lk,i. Analogously, with the approximated reward erk,i (Eq. 8), we calculate the exploration score approximation by

eek,i = α erk,i f1(χq k,i; θ(k 1) 1 ) + (1 α) Lk,i. (9) Here, the exploration coefficient α (0, 1] (in Eq. 7) is also used to balance our two exploration objectives, which are (1) prediction uncertainty rk,i f1( ; θ1): if the exploitation model f1( ; θ1) is under-estimating the arm reward, leading to the positive residual rk,i f1( ; θ1), we will have a high exploration score to enhance the exploration for this arm; otherwise, when rk,i f1( ; θ1) is negative, it indicates an excessively high estimation, which will alternatively lower the exploration score to compensate for the over-estimation. With a higher α value, our exploration strategy will focus more on the behavior of the exploitation model f1( ); (2) the difficulty of task adaptation (i.e., validation loss) Lk,i: if the current meta-model does not generalize well to arm Tk,i, the validation loss Lk,i will be high, which will also lead to a high exploration score. In this way, our formulation considers two different exploration objectives as well as the dynamics of meta-optimizations (base on concatenated network gradients), for a refined exploration strategy.

4.4 Updating Bandit Scheduler Parameters

After updating the meta-parameters (Line 14, Algo. 1, Remark 2) with tasks Ωk = {Tk,bi}bi b Nk, we proceed to update the parameters of BASS (Line 15, Algo. 1). Recall that f1( ; θ1) tries to learn the reward mapping function h( ), and f2( ; θ2) aims to learn the exploration score. Given the selected arms Ωk, with ηθ 1 , ηθ 2 R+ being the learning rates, we apply the GD and quadratic loss to update the parameters of BASS, denoted by θ(k) 1 = θ(k 1) 1 ηθ 1 θ1 1

Tk,bi Ωk f1(χq k,bi; θ(k 1) 1 ) erk,bi 2 ,

θ(k) 2 = θ(k 1) 2 ηθ 2 θ2 1

Tk,bi Ωk f2 [ θf1(χs k,bi); θf1(χq k,bi)]; θ(k 1) 2 eek,bi 2 . We refer to Eqs. 8-9 for calculating the approximated arm reward erk,bi and exploration score eek,bi.

5 Theoretical Analysis

Recall that in each iteration k [K], we receive candidate arms (tasks) Ω(k) task = {Tk,i}i N , and each arm Tk,i is associated with two context vectors χs k,i, χq k,i. For the sake of analysis, we normalize these

two contexts such that χs k,i 2 = χq k,i 2 = 1, and set the exploration coefficient α = 1. Following the existing work [47, 48], we let the meta-model F( ; Θ) be a LF-layer FC network with Gaussian Initialization, with the network width m F. Note that our results can also be generalized to other network architectures, such as CNN and Res Net [22], based on the analysis of over-parameterized neural networks [11, 48, 3]. For the theoretical analysis, we adopt Sigmoid activation for f1 and Re LU for f2, in order to make f1 Lipschitz smooth under over-parameterization settings. Then, we draw

trained parameters of BASS with {θ(k) 1 , θ(k) 2 } {eθ (τ) 1 , eθ (τ) 2 }τ [k]. Here, starting from the randomly

initialized parameters {θ(0) 1 , θ(0) 2 }, each parameter pair {eθ (τ) 1 , eθ (τ) 2 }, τ [k] is separately trained on past arm rewards {rτ ,bi}τ [τ],bi b Nτ and exploration scores {eτ ,bi}τ [τ],bi b Nτ with Jθ-iteration GD. Next, similar to existing neural bandit works (e.g., [54, 9, 53]) and the works on meta-model convergence analysis (e.g., [47, 48]), we have the following separateness assumption.

Assumption 5.1 (ρ-Separateness). After K meta-training iterations, for every pair of arm contexts χq k,i, χq k ,i with k, k [K] such that the corresponding arms Tk,i Ωk Tk ,i Ωk , if (k, i) = (k , i ), we have χq k,i χq k ,i 2 ρ where 0 < ρ O( 1

The assumption above is mild because of two main reasons: (1) since L is manually chosen (e.g., L = 2), we can easily satisfy the condition 0 < ρ O( 1

L) as long as no two arm contexts are identical; (2) since the meta-parameters Θ(k), k [K] are constantly changing, the corresponding arm contexts will also be distinct across different meta-training iterations. Additional discussions on this assumption are in Appendix Section B. With standard settings of over-parameterized neural networks [47, 3, 11] and the definition of regret R(K) in Eq. 5, we have the following Theorem 5.2. Theorem 5.2. Define δ (0, 1), 0 < ξ1, ξ2 O(1/K), ξf = max{ξ1, ξ2}, 0 < ρ O(1/L), cξ > 0, ξL = (cξ)L. Suppose the network width m Ω Poly(K, L, Ntask, ρ 1) log(1/δ) ; m F Ω Poly(K, LF, Ntask) log(1/δ) . Then, let the learning

rates be η1, η2 = Θ m 1 F Poly(K,Ntask,LF) ; η1 θ, η2 θ = Θ ρ m 1

Poly(K,Ntask,L) . Jθ = Θ Poly(K,Ntask,L)

ρ δ2 log( 1

ξ1 ) . Following Algo. 1, with probability at least 1 δ, the K-round R(K) of BASS could be bounded by

2 + (1 + 2γ1) p

2 log(K/δ)) + O(ξ2 LKJB LF m F ) + γm (10)

where γm = O( 1 m1/cγ ), cγ > 1, and γ1 = O(1) with sufficient network width m of BASS.

The proof is presented in Appendix Section D. Here, the first term on the RHS is scaled by the 1/

K term, which means the regret bound will shrink along with more iterations K. The second term on the RHS is scaled by 1/ m F, which makes it a diminutive term under the over-parameterization settings. Since the network depth L of BASS is a small integer (we apply L = 2 for experiments in Section 6), ξL will also be a relatively small constant. Meanwhile, γm will also decrease significantly with increasingly large network width m of BASS. In contrast, with a convex loss function (e.g., L2 loss or cross-entropy loss) and the same over-parameterization settings, the regret upper bound of the uniform sampling strategy [19, 40] can possibly scale up to 1 for the worst-case scenario, and the upper bound will not decrease with more iterations K (Appendix Lemma D.13). Alternatively, BASS works under the bandit settings, by directly measuring the meta-model performance difference w.r.t. the chosen task batch and the optimal one. With more iterations K, BASS tends to make more accurate scheduling decisions, which makes our regret bound possible. For the existing works, [51] prove that they can improve the optimization landscape with the assumed correlation of task difficulties and meta-model generalization ability. [13] show that their self-paced strategy can improve the model robustness when facing noisy training tasks. Different from previous works, we provide the performance guarantee for the proposed BASS under the neural bandit framework.

6 Experiments

In this section, we compare BASS against seven strong baselines, including: (1) Uniform Sampling; non-adaptive self-paced methods and task schedulers (2) SPL [26], (3) Focal-Loss (FOCAL) [30], (4) DAML [29], (5) GCP [31], (6) PAML [24]; and the adaptive task scheduler (7) ATS [51]. Since GCP is not originally compatible with our problem settings and can only work with classification

problems, we properly adapt it by choosing the tasks with the highest probabilities, and apply it on the classification data sets. ANIL [40] is adopted as the backbone meta-learning framework. Due to page limit, we include the complementary experiments (e.g., parameter study for α, effects of different levels of task skewness), and the configurations to Appendix Section A.

6.1 Real Data Sets with Noisy Meta-Training Tasks

We adopt Drug [36], Mini-Image Net (M-Image Net) [42] and CIFAR-100 (CIFAR) [25] data sets under the few-shot learning scenario. Similar to [51], we apply classification accuracy as the evaluation criterion for the Mini-Image Net and CIFAR-100 data sets, and consider the squared Pearson coefficient for the Drug data set. For each meta-learning iteration k [K], the learner is given a candidate pool of 10 tasks (i.e., |Ω(k) task| = Ntask = 10), and it will need to choose a batch of B = 2 tasks as the training tasks Ωk for this iteration. For the Mini-Image Net and CIFAR-100 data sets, we consider half of the meta-training tasks are perturbed by the label flipping noise [21], where the chance of a label being flipped is ϵ [0, 1]. As the Drug data set stands for a regression problem, we draw the label noise from the Gaussian distribution N(0, ϵ2). The experimental settings are under 1-shot or 5-shot, 5-Way (for classification data sets) learning scenario with the noise level ϵ = 0.5. The experiment results are shown in Table 1 and Figure 3. For the average ranking column, we exclude results from the M-Image Net (1) setting, since BASS and the baselines tend to train a meta-model performing "random guessing". More discussions are in the next paragraph.

Table 1: Results on real data sets [data set (shot); results standard deviation]. For 1-shot MImage Net, all the methods end up with an invalid meta-model performing "random guessing".

Algo. \ Data Drug (1) Drug (5) M-Image Net (1) M-Image Net (5) CIFAR (1) CIFAR (5) Avg. Rank Uniform 0.210 0.013 0.220 0.001 0.201 0.002 0.301 0.025 0.234 0.029 0.526 0.011 4.4 SPL 0.244 0.008 0.236 0.004 0.203 0.002 0.240 0.018 0.200 0.002 0.367 0.039 5.0 FOCAL 0.222 0.024 0.223 0.003 0.200 0.000 0.316 0.029 0.231 0.024 0.485 0.006 4.4 DAML 0.146 0.009 0.177 0.003 0.201 0.001 0.310 0.016 0.247 0.003 0.414 0.025 5.4 GCP N/A N/A 0.201 0.001 0.282 0.016 0.243 0.007 0.508 0.009 6.0 PAML 0.192 0.020 0.205 0.009 0.199 0.001 0.218 0.013 0.199 0.004 0.316 0.022 7.2 ATS 0.230 0.002 0.237 0.014 0.201 0.001 0.334 0.053 0.257 0.048 0.515 0.015 2.4

BASS 0.242 0.012 0.245 0.006 0.198 0.004 0.351 0.012 0.272 0.025 0.553 0.008 1.2

0 5000 10000 15000 20000 Iterations

Acc. / Pearson Coef.

M-Image Net (5-shot, : 0.5)

0 5000 10000 15000 20000 Iterations

CIFAR (5-shot, : 0.5)

Uniform SPL FOCAL LOSS

DAML GCP PAML

Figure 3: Accuracy results (5-shot, ϵ = 0.5). BASS can achieve a good performance at early meta-training stage.

Here, BASS can generally outperform the baselines by directly learning the mapping from the metaparameters to the arm rewards, as well as balancing the exploitation and exploration. ATS also achieves good performance as it adaptively learns the correlation between the task adaptation difficulty and task scheduling, which proves that it is necessary to apply the adaptive scheduling strategy instead of staying with a fixed protocol. Meanwhile, BASS can generally train the meta-model more efficiently (Figure 3), leading to good performances at the early stage of meta-training. In particular, for the Mini-Image Net data set under the 1-shot, 5-way settings, all the algorithms fail to train an effective meta-model. Here, under the 5-way classification scenario, all the methods will likely generate a meta-model performing "random guessing" (around 20% accuracy). In this case, utilizing Ensemble Inference techniques [12, 16] can help alleviate this problem, and we include further discussion in the case study (Subsec. 6.3).

6.2 Effects of the Skewed Task Distribution

Table 2: CIFAR-100 with skewed task distribution (5-shot, 5-way / 10-way).

Data \ Algo. Uniform SPL ATS BASS 5-way 0.375 0.009 0.279 0.002 0.382 0.007 0.408 0.008 10-way 0.264 0.047 0.165 0.007 0.283 0.039 0.320 0.021

The skewed task distribution P(T ) commonly exists in real-word cases. For instance, consider an animal image classification data set where each class (i.e., task) corresponds to one kind of animals. In this case, felid classes can be considered as "frequent" tasks in the task distribution due to their large quantity and strong mutual correlations, compared with "tail" tasks like kangaroo classes. In this case, paying insufficient attention to the "tail" classes can impair the

generalization performance of the trained meta-model. Thus, to investigate the effects of when the task distribution P(T ) is skewed, we randomly choose some tasks from P(T ), and assign them with higher sampling probabilities (weights). This corresponds to the situation when P(T ) is skewed, so that sampling from P(T ) will likely lead to similar tasks. Here, with the CIFAR-100 data set, we sample 10 tasks and assign them with higher sampling probabilities (weights) (5 tasks with 10%, 5 tasks with 5%), while the rest of the tasks equally share the remaining 25% probability.

Figure 4: Average weights of chosen meta-training tasks. The testing accuracy vs. iterations needed. BASS can actively exploring for "tail" tasks, and requires much fewer iterations for the same performance (as few as 1/3 of baselines iterations).

From Table 2, our proposed BASS maintains the best performance due to its adaptive scheduling strategy and the ability of balancing exploitation and exploration. On the other hand, the baselines are unable to adjust their meta-training strategies towards exploration, when the meta-model has already well adapted to the "frequent" tasks. This can possibly lead to the sub-optimal performance of the meta-model. In particular, based on Figure 4, when facing a skewed task distribution, BASS can actively explore the "tail" tasks, after sufficiently exploring the "frequent" tasks. BASS is also able to achieve good performances with fewer meta-training iterations.

6.3 Case Study: BASS-aided Ensemble Inference

From Figure 3, we notice that BASS can train a meta-model that achieves good generalization performance at the early stage of meta-training. One application of this property is using BASS to assist meta-learning models under the Ensemble Inference settings, where separate models are combined to enhance the generalization ability. One renowned ensemble approach is the modelparameter ensemble [41, 33]. With a collection of NE individual models {F( ; Θi)}i [NE] of the same architecture, the ensemble model will be FE( ; ΘE), and its parameters ΘE = 1 NE P

i [NE] Θi are the averaged parameters across individual models. Then, the ensemble model FE( ; ΘE) will be applied as the inference model for downstream problems. Here, one natural way of obtaining the individual models F( ; Θi), i [NE] is deeming the models from different training iterations as the individual models for ensemble [12, 16]. Here, we conduct experiments using the ensemble techniques with individual models F( ; Θi) trained by baselines and BASS. We choose the top NE = 10 models with the smallest validation loss across different meta-training iterations as the individual models {F( ; Θi)}i [NE] for ensemble. The results are shown in Table 3. We label the ensemble version of BASS as "BASS-E", and the non-ensemble version as "BASS-S".

Table 3: Ensemble case study [dataset (shot); results standard deviation].

Algo. \ Data M-Image Net (1) M-Image Net (5) CIFAR (1) CIFAR (5) Uniform 0.231 0.014 0.313 0.027 0.270 0.014 0.534 0.012 SPL 0.218 0.006 0.298 0.004 0.219 0.005 0.363 0.038 FOCAL 0.204 0.005 0.347 0.030 0.235 0.015 0.499 0.003 DAML 0.222 0.011 0.326 0.031 0.261 0.008 0.432 0.019 GCP 0.226 0.006 0.297 0.011 0.268 0.019 0.512 0.015 PAML 0.213 0.024 0.232 0.009 0.223 0.009 0.336 0.029 ATS 0.202 0.002 0.334 0.052 0.313 0.081 0.517 0.017

BASS-S 0.198 0.004 0.351 0.012 0.272 0.025 0.553 0.008 BASS-E 0.242 0.004 0.366 0.003 0.327 0.010 0.551 0.004

Compared with the non-ensemble settings (Table 1), we see that the BASS-aided ensemble model can generally perform better. In particular, the ensemble model can improve the meta-model inference performance in significantly difficult cases, such as the Mini-Image Net under the 1-shot setting (Subsec. 6.1). As a result, BASS can help generate high-quality ensemble model with the meta-models trained in different iterations. While the ensemble inference technique can also benefit the other baselines, BASS still maintains decent performances.

7 Conclusion

In this paper, we formulate the task scheduling problem in meta-learning under the Contextual Bandits settings, and propose a novel bandit-based task scheduling framework named BASS. It directly optimizes the task sampling strategy based on the status of the meta-model rather than applying fixed task scheduling protocols. Instead of greedily making decisions, BASS can help deal with the insufficient knowledge problem at the early stage of meta-training, as well as plan for the future meta-training iterations with the adaptive exploration strategy. We include both theoretical analyses and a comprehensive set of experiments to demonstrate the effectiveness of our proposed framework as well as its key properties.

Acknowledgments and Disclosure of Funding

This work is supported by National Science Foundation under Award No. IIS-1947203, IIS-2117902, IIS-2137468, IIS-2002540, and Agriculture and Food Research Initiative (AFRI) grant no. 202067021-32799/project accession no.1024178 from the USDA National Institute of Food and Agriculture. The views and conclusions are those of the authors and should not be interpreted as representing the official policies of the funding agencies or the government.

[1] Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24:2312 2320, 2011.

[2] Shipra Agrawal and Navin Goyal. Thompson sampling for contextual bandits with linear payoffs. In ICML, pages 127 135. PMLR, 2013.

[3] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-parameterization. In International Conference on Machine Learning, pages 242 252. PMLR, 2019.

[4] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235 256, 2002.

[5] Yikun Ban and Jingrui He. Convolutional neural bandit: Provable algorithm for visual-aware advertising. ar Xiv preprint ar Xiv:2107.07438, 2021.

[6] Yikun Ban and Jingrui He. Local clustering in contextual multi-armed bandits. In Proceedings of the Web Conference 2021, pages 2335 2346, 2021.

[7] Yikun Ban, Jingrui He, and Curtiss B Cook. Multi-facet contextual bandits: A neural network perspective. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pages 35 45, 2021.

[8] Yikun Ban, Yunzhe Qi, Tianxin Wei, and Jingrui He. Neural collaborative filtering bandits via meta learning. ar Xiv preprint ar Xiv:2201.13395, 2022.

[9] Yikun Ban, Yuchen Yan, Arindam Banerjee, and Jingrui He. EE-net: Exploitation-exploration neural networks in contextual bandits. In International Conference on Learning Representations, 2022.

[10] Yikun Ban, Yuchen Yan, Arindam Banerjee, and Jingrui He. Neural exploitation and exploration of contextual bandits. ar Xiv preprint ar Xiv:2305.03784, 2023.

[11] Yuan Cao and Quanquan Gu. Generalization bounds of stochastic gradient descent for wide and deep neural networks. Advances in Neural Information Processing Systems, 32:10836 10846, 2019.

[12] Junbum Cha, Sanghyuk Chun, Kyungjae Lee, Han-Cheol Cho, Seunghyun Park, Yunsung Lee, and Sungrae Park. Swad: Domain generalization by seeking flat minima. Advances in Neural Information Processing Systems, 34:22405 22418, 2021.

[13] Dong Chen, Lingfei Wu, Siliang Tang, Xiao Yun, Bo Long, and Yueting Zhuang. Robust metalearning with sampling noise and label noise via eigen-reptile. ar Xiv preprint ar Xiv:2206.01944, 2022.

[14] Yudong Chen, Xin Wang, Miao Fan, Jizhou Huang, Shengwen Yang, and Wenwu Zhu. Curriculum meta-learning for next poi recommendation. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pages 2692 2702, 2021.

[15] Wei Chu, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual bandits with linear payoff functions. In AISTATS, pages 208 214, 2011.

[16] Xu Chu, Yujie Jin, Wenwu Zhu, Yasha Wang, Xin Wang, Shanghang Zhang, and Hong Mei. Dna: Domain generalization with diversified neural averaging. In International Conference on Machine Learning, pages 4010 4034. PMLR, 2022.

[17] Aniket Anand Deshmukh, Urun Dogan, and Clay Scott. Multi-task learning for contextual bandits. In Neur IPS, pages 4848 4856, 2017.

[18] Simon Du, Jason Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global minima of deep neural networks. In International Conference on Machine Learning, pages 1675 1685. PMLR, 2019.

[19] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning, pages 1126 1135. PMLR, 2017.

[20] Luca Franceschi, Paolo Frasconi, Saverio Salzo, Riccardo Grazzi, and Massimiliano Pontil. Bilevel programming for hyperparameter optimization and meta-learning. In International Conference on Machine Learning, pages 1568 1577. PMLR, 2018.

[21] B Han, Q Yao, X Yu, G Niu, M Xu, W Hu, I Tsang, and M Sugiyama. Robust training of deep neural networks with extremely noisy labels. In Thirty-fourth Conference on Neural Information Processing Systems (Neur IPS), volume 2, page 4, 2020.

[22] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770 778, 2016.

[23] Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. ar Xiv preprint ar Xiv:1806.07572, 2018.

[24] Jean Kaddour, Steindór Sæmundsson, et al. Probabilistic active meta-learning. Advances in Neural Information Processing Systems, 33:20813 20822, 2020.

[25] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.

[26] M Kumar, Benjamin Packer, and Daphne Koller. Self-paced learning for latent variable models. Advances in neural information processing systems, 23, 2010.

[27] Hoyeop Lee, Jinbae Im, Seongwon Jang, Hyunsouk Cho, and Sehee Chung. Melu: Meta-learned user preference estimator for cold-start recommendation. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1073 1082, 2019.

[28] Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to personalized news article recommendation. In WWW, pages 661 670, 2010.

[29] Xiaomeng Li, Lequan Yu, Yueming Jin, Chi-Wing Fu, Lei Xing, and Pheng-Ann Heng. Difficulty-aware meta-learning for rare disease diagnosis. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 357 366. Springer, 2020.

[30] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980 2988, 2017.

[31] Chenghao Liu, Zhihao Wang, Doyen Sahoo, Yuan Fang, Kun Zhang, and Steven CH Hoi. Adaptive task sampling for meta-learning. In European Conference on Computer Vision, pages 752 769. Springer, 2020.

[32] Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture search. ar Xiv preprint ar Xiv:1806.09055, 2018.

[33] Shiwei Liu, Tianlong Chen, Zahra Atashgahi, Xiaohan Chen, Ghada Sokar, Elena Mocanu, Mykola Pechenizkiy, Zhangyang Wang, and Decebal Constantin Mocanu. Deep ensembling with no overhead for either training or testing: The all-round blessings of dynamic sparsity. ar Xiv preprint ar Xiv:2106.14568, 2021.

[34] Zhuoqun Liu, Yuankun Jiang, Chenglin Li, Wenrui Dai, Junni Zou, and Hongkai Xiong. Adaptive task sampling and variance reduction for gradient-based meta-learning. 2022.

[35] Ricardo Luna Gutierrez and Matteo Leonetti. Information-theoretic task selection for metareinforcement learning. Advances in Neural Information Processing Systems, 33:20532 20542, 2020.

[36] Eric J Martin, Valery R Polyakov, Xiang-Wei Zhu, Li Tian, Prasenjit Mukherjee, and Xin Liu. All-assay-max2 pqsar: activity predictions as accurate as four-concentration ic50s for 8558 novartis assays. Journal of chemical information and modeling, 59(10):4450 4459, 2019.

[37] Xingchao Peng, Qinxun Bai, Xide Xia, Zijun Huang, Kate Saenko, and Bo Wang. Moment matching for multi-source domain adaptation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1406 1415, 2019.

[38] Yunzhe Qi, Yikun Ban, and Jingrui He. Neural bandit with arm group graph. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 1379 1389, 2022.

[39] Yunzhe Qi, Yikun Ban, and Jingrui He. Graph neural bandits. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 1920 1931, 2023.

[40] Aniruddh Raghu, Maithra Raghu, Samy Bengio, and Oriol Vinyals. Rapid learning or feature reuse? towards understanding the effectiveness of maml. In International Conference on Learning Representations, 2019.

[41] Alexandre Rame, Matthieu Kirchmeyer, Thibaud Rahier, Alain Rakotomamonjy, Patrick Gallinari, and Matthieu Cord. Diverse weight averaging for out-of-distribution generalization. ar Xiv preprint ar Xiv:2205.09739, 2022.

[42] Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. 2016.

[43] Mengye Ren, Wenyuan Zeng, Bin Yang, and Raquel Urtasun. Learning to reweight examples for robust deep learning. In International conference on machine learning, pages 4334 4343. PMLR, 2018.

[44] Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and Grégoire Mesnil. Learning semantic representations using convolutional neural networks for web search. In Proceedings of the 23rd international conference on world wide web, pages 373 374, 2014.

[45] Jun Shu, Qi Xie, Lixuan Yi, Qian Zhao, Sanping Zhou, Zongben Xu, and Deyu Meng. Metaweight-net: Learning an explicit mapping for sample weighting. Advances in neural information processing systems, 32, 2019.

[46] Michal Valko, Nathaniel Korda, Rémi Munos, Ilias Flaounas, and Nelo Cristianini. Finite-time analysis of kernelised contextual bandits. ar Xiv preprint ar Xiv:1309.6869, 2013.

[47] Haoxiang Wang, Ruoyu Sun, and Bo Li. Global convergence and generalization bound of gradient-based meta-learning with deep neural nets. ar Xiv preprint ar Xiv:2006.14606, 2020.

[48] Haoxiang Wang, Yite Wang, Ruoyu Sun, and Bo Li. Global convergence of maml and theoryinspired neural architecture search for few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9797 9808, June 2022.

[49] Qingyun Wu, Huazheng Wang, Quanquan Gu, and Hongning Wang. Contextual bandits in a collaborative environment. In SIGIR, pages 529 538, 2016.

[50] Tongtong Wu, Xuekai Li, Yuan-Fang Li, Gholamreza Haffari, Guilin Qi, Yujin Zhu, and Guoqiang Xu. Curriculum-meta learning for order-robust continual relation extraction. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 10363 10369, 2021.

[51] Huaxiu Yao, Yu Wang, Ying Wei, Peilin Zhao, Mehrdad Mahdavi, Defu Lian, and Chelsea Finn. Meta-learning with an adaptive task scheduler. Advances in Neural Information Processing Systems, 34:7497 7509, 2021.

[52] Ji Zhang, Jingkuan Song, Yazhou Yao, and Lianli Gao. Curriculum-based meta-learning. In Proceedings of the 29th ACM International Conference on Multimedia, pages 1838 1846, 2021.

[53] Weitong Zhang, Dongruo Zhou, Lihong Li, and Quanquan Gu. Neural thompson sampling. ar Xiv preprint ar Xiv:2010.00827, 2020.

[54] Dongruo Zhou, Lihong Li, and Quanquan Gu. Neural contextual bandits with ucb-based exploration. In International Conference on Machine Learning, pages 11492 11502. PMLR, 2020.

A Appendix: Experiments (Continue)

A.1 Further Details for the Experiment Settings

For the data partitioning, we have the Mini-Image Net and CIFAR-100 data sets divided into the partitions 64 : 16 : 20, which correspond to the training set, validation set and the testing set respectively. Each class is corresponding to a task. Then, for the Drug data set, we partition the tasks into 4100 : 76 : 100 representing the training set, validation set and the testing set.

For our BASS, we apply two 2-layer FC networks for f1( ; θ1), f2( ; θ2) respectively, and set network width m = 200. For deriving approximated arm rewards, we let |Ωvalid k | = 5. Recall that we apply the approximation approach mentioned in Remark 3 to reduce the space complexity and time complexity in practice for the experiments. Here, we tune the pooling step such that the inputs of f1( ; θ1), f1( ; θ1) are approximately 50 and 20 respectively. For the learning rate, we find the learning rate for BASS with grid search from {0.01, 0.001, 0.0001}, and choose the learning rates for the meta-model η1 = 0.01, η2 = 0.001. The meta-model architecture as well as its learning rates will stay the same for all the baselines and our proposed BASS. For the CIFAR-100 and Mini-Image Net data sets, we use the the meta-model with four convolutional blocks where the network width of each block is 32, followed by an FC layer as the output layer. For the Drug data set, we apply a meta-model with two FC layers, where the network width is 500. All the experiments are performed on a Linux machine with Intel Xeon CPU, 128GB RAM, and Tesla V100 GPU. Code will be made available at https://github.com/yunzhe0306/Bandit_Task_Scheduler.

A.2 Effect of the Task Noise Magnitude

We conduct the experiments to show the effects of the noise magnitude factor ϵ on the Drug and CIFAR-100 data sets. The experiment results are shown in Table 4.

Table 4: Comparison with baselines with different noise magnitude [data set (noise magnitude ϵ) ; final results standard deviation].

Algo. \ Data Drug (0.3) Drug (0.5) CIFAR100 (0.3) CIFAR100 (0.5) Uniform 0.218 0.007 0.220 0.001 0.655 0.009 0.526 0.011 SPL 0.243 0.008 0.236 0.004 0.625 0.017 0.367 0.039 FOCAL 0.224 0.019 0.223 0.003 0.638 0.010 0.485 0.006 DAML 0.182 0.025 0.177 0.003 0.543 0.017 0.414 0.025 GCP N/A N/A 0.653 0.005 0.508 0.009 PAML 0.186 0.006 0.205 0.009 0.537 0.009 0.316 0.022 ATS 0.239 0.011 0.237 0.014 0.651 0.001 0.505 0.015

BASS (Ours) 0.258 0.003 0.245 0.006 0.657 0.005 0.553 0.008

With increasing noise magnitude ϵ, the performances of the meta-model trained by baselines and our BASS tend to drop, which is intuitive. In particular, for the CIFAR-100 data set, when we increase ϵ, the performance difference between BASS and the other baselines tends to increase. This can be the reason that the greedy baselines with no exploration strategies can be more susceptible to the task noise perturbation, which can lead to the sub-optimal performances of the meta-model.

Table 5: Experiment results of noise-free settings on three real data sets (5-way, 5-shot).

Data \ Algo. Uniform SPL FOCAL-LOSS DAML GCP PAML ATS BASS Drug 0.206 0.012 0.234 0.006 0.240 0.003 0.190 0.002 N/A 0.220 0.010 0.233 0.001 0.256 0.003 M-Image Net 0.576 0.016 0.554 0.004 0.582 0.005 0.437 0.015 0.564 0.002 0.467 0.007 0.561 0.004 0.586 0.008 CIFAR 0.681 0.010 0.681 0.008 0.692 0.023 0.662 0.027 0.681 0.016 0.640 0.011 0.695 0.035 0.697 0.029

From the Table 5, we can see that when there is no noise, the overall performance does not differ significantly across different methods. The reason could be that since the meta-learning backbone remains the same for all the methods, the meta-model performance upper bound can be similar for different scheduling algorithms, without the presence of other confounding factors (e.g., noise, task distribution skewness). In the practical application scenarios with noisy data, BASS-guided meta-models tend to perform well in presence of task noise and skewness compared with baselines, as presented by our experiments in the main body.

A.3 Parameter Study for Exploration Coefficient

As in Eq. 7 and Eq. 9, BASS involves an exploration coefficient α to balance the exploitationexploration and the two exploration objectives. Here, we conduct the parameter study for the exploration coefficient α, and include the results with no exploration (i.e., removing f2).

Table 6: Comparison among different α values [dataset (shot) ; final results standard deviation].

Algo. \ Data Drug (1) Drug (5) CIFAR100 (1) CIFAR100 (5) No Exploration 0.234 0.003 0.239 0.012 0.256 0.027 0.537 0.012 α = 0.1 0.231 0.005 0.233 0.013 0.264 0.051 0.522 0.024 α = 0.3 0.228 0.013 0.231 0.008 0.268 0.047 0.528 0.014 α = 0.5 0.236 0.004 0.245 0.006 0.272 0.025 0.553 0.008 α = 0.7 0.242 0.012 0.227 0.006 0.241 0.005 0.543 0.021 α = 1.0 0.236 0.002 0.235 0.013 0.266 0.006 0.537 0.005

From the results in Table 6, we see that the exploration module can indeed improve the performance of BASS compared with the performance with no exploration. This also fits our initial argument that the greedy algorithm alone can lead to sub-optimal performances of meta-model. By properly choosing the α value, we will be able to achieve a good balance between exploitation and exploration, as well as between the two exploration objectives. Here, setting α [0.5, 0.7] will be good enough to achieve satisfactory performances. Meanwhile, we also note that even with no exploration, our BASS still achieves good performances by directly learning the correlation between the adapted meta-parameter and the generalization score, and refining the scheduling strategy based on the status of the meta-model.

A.4 Running Time Comparison

In Figure 5, we include the running time comparison with baselines. We can see that BASS can achieve significant improvement in terms of the running time, and can take as little as 50% of ATS s running time. The intuition is that our proposed BASS only needs one round of the optimization process to update the meta-model and BASS. On the other hand, from Algorithm 1 of the ATS paper [51], we see that ATS requires two optimization rounds for each meta-training iteration to (1) update the scheduler with the temporal meta-model, and (2) update the actual meta-model respectively. Based on the figure on the RHS, we also see that BASS can achieve a relatively good balance between computational cost and performance.

D (5-shot) D (1-shot) M (5-shot) M (1-shot) C (5-shot) C (1-shot) 0

Running time (seconds)

Running time comparison with ATS

Figure 5: Running time results (including training both the scheduler and the meta-model). "D", "M" and "C" refer to the "Drug", "Mini-Image Net", "CIFAR-100" data sets respectively. BASS can take as little as approximately 50% of ATS s running time. On the RHS, we have the scatter plot in terms of running time vs. performance on the Mini-Image Net dataset.

A.5 Performances with Different Task Skewness Settings

In Table 7, we include the experiments with different levels of skewness. Here, we see that with less skewness levels (the skewness level reduces from Setting 1 to Setting 3), the accuracy of BASS as well as the baselines will continue to improve, while BASS still maintains decent performances.

Skewness Setting \ Algo. Uniform ATS BASS Skewness Setting 1 0.375 0.009 0.382 0.007 0.408 0.008 Skewness Setting 2 0.429 0.012 0.448 0.006 0.460 0.013 Skewness Setting 3 0.497 0.008 0.502 0.010 0.539 0.009

Table 7: Results for different skewness levels on CIFAR-100 data set (5-shot). (1) Setting 1 is the original setting in paper Subsec. 5.2. (2) For Setting 2, we assign 5 tasks with 8% sampling probability, 5 tasks with 3%, and the rest of the tasks equally share the 45% probability. (3) For Setting 3, we assign 5 tasks with 5%, 5 tasks with 2%, while the rest of the tasks equally share the 65% probability.

A.6 Performances with Different Batch Size

With Table 8, we include additional experiments with different batch sizes B, in comparison with the ATS and the uniform sampling approach. Here, we see that with larger B values, the accuracy of BASS as well as the baselines will generally improve.

B (batch size) \ Algo. Uniform ATS BASS 1 0.459 0.009 0.449 0.010 0.472 0.012 2 0.526 0.011 0.515 0.015 0.553 0.008 3 0.570 0.012 0.563 0.007 0.588 0.010 5 0.581 0.005 0.571 0.007 0.586 0.009

Table 8: Results for different B values (batch sizes) on CIFAR-100 data set (5-shot).

A.7 Performances with Different Embedding Approaches of Arm Contexts

In Table 9, we include additional experiments with different levels of average pooling, such that after the average pooling, the dimensionality of the pooled vector representation will fall into {20, 100, 500}. We see that overly small dimensionality of the average-pooled vector representation (e.g., 20) can lead to sub-optimal performance of the BASS framework. In addition, we see that setting the dimensionality to 50 can generally lead to good enough performance.

Dimensionality 20 50 100 500 Accuracy 0.541 0.008 0.553 0.008 0.558 0.006 0.555 0.010

Table 9: With CIFAR-100 (5-shot), different dimensionality of the average-pooled vector representation (Remark 3) of the meta-parameters.

Here, we also include additional experimental results using MLP to map the original context into the lower dimensional space instead of using our proposed average pooling (Remark 3). Results are shown in Table 10. Here, we use the one-layer MLP with the Re LU activation to embed the original meta-parameters to the low-dimensional vector representations. We can see that the MLPbased method can indeed lead to some performance improvement. But in general, the performance difference between MLP-based embedding and the average-pooling vector representation is subtle. We also note that the MLP-based mapping approach is more time consuming compared with the average pooling approach, since we also need to train the additional embedding layer, which has a considerable number of trainable parameters.

Dimensionality Original avg-pooled (50) 50 100 200 Accuracy 0.553 0.008 0.558 0.013 0.560 0.012 0.553 0.015

Table 10: With CIFAR-100 (5-shot), different dimensionality of the one-layer MLP(with Re LU)- embedded vector representation of the meta-parameters. "original avg-pooled (50)" refers to the average-pooled vector representation (Remark 3) with dimensionality of 50.

A.8 Additional Experiments on the "Domain Net" data set

In Table 11, we include additional experiments on the new "Domain Net" data set [37]. Within the "real" domain, we filter 100 classes that have at least 600 images. In this way, with each class being a task with 600 images, we will have a total of 100 tasks. Compared with image data sets in our paper (Mini-Image Net and CIFAR-100), we increase the image resolution of "Domain Net" by resizing its images to 128 128 pixels. Following the settings in our paper, we divide tasks into 64 : 16 : 20 portions that correspond to the training set, validation set and the test set respectively. For the few-shot settings, we formulate the problem to be 5-shot, 5-way / 7-way with uniform sampling and ATS as baselines. With a higher image resolution of the "Domain Net" data set, BASS can still maintain the good performance compared with the baselines.

Setting \ Algo. Uniform ATS BASS 5-way 0.475 0.002 0.483 0.006 0.511 0.012 7-way 0.411 0.005 0.372 0.009 0.435 0.008

Table 11: Results for the "Domain Net" data set (noise level 0.5, 5-shot settings).

B Appendix: Additional Discussion on the Necessity of Assumption 5.1

We would like to mention that in order to finish the convergence and generalization analysis for the neural Contextual Bandit works (e.g., [54, 2, 9]), the separateness assumption of the arm context is the minimum requirement of the data set. This is because the training data needs to be non-degenerate (i.e., every pairs of samples are distinct) to ensure that the neural network can consistently converge, as indicated by Assumption 2.1 in [3]. Therefore, our Assumption 5.1 regarding the arm separateness aims to ensure that the BASS is able to adequately learn the underlying reward mapping function with sufficient information. Comparing with the existing works, in the convergence analysis works on meta-learning [47, 48], they measure the arm separateness in terms of the minimum eigenvalue λ0 (with λ0 > 0) of the Neural Tangent Kernel (NTK) [23] matrix, which is comparable with our Euclidean separateness ρ. For existing neural bandit works, Assumption 5.1 in [9] is similar to our separateness assumption. Meanwhile, Assumption 4.2 in [54] and Assumption 3.4 from [53] also imply that no two arms are the same in terms of the minimum NTK matrix eigenvalue λ0 > 0.

C Appendix: Limitation

One potential limitation of BASS is that its improvement over baselines may not be significant when dealing with noise-free settings and non-skewed task distributions (Table 5). Meanwhile, the non-adaptive FOCAL-LOSS [30] tends to achieve a similar performance comparing with the adaptive method ATS [51], while enjoying an advantage in terms of the computational cost. In practical terms, although BASS can generally achieve the decent performance and enjoys a smaller computational cost than ATS, the practitioner still needs to consider whether their task distribution is noisy or skewed in order to strike a good balance between the computational resource needed and the meta-model performance, as BASS can achieve a more significant advantage over baselines given the noisy or skewed task distribution.

D Appendix: Theoretical Analysis

In this section, we present the proof for Theorem 5.2. Here, instead of directly going for the batch setting where we adopt training task batch Ωk for each iteration k [K] (|Ωk| = |Ω k| = B), we first introduce the results of the single-task setting (Subsec. D.1), i.e., |Ωk| = |Ω k| = 1. Then, the results will be extended to the batch settings as in Subsec. D.2. Recall that for the meta-model, we first consider it to be a LF-layer fully-connected (FC) network (of width m F for the theoretical analysis (lines 237-239). In particular, we follow the settings in [3] for the Gaussian initialization of weight matrices. For the weight matrix elements in meta-model s first (LF 1) layers, we draw each of them from the Gaussian distribution N(0, 2/m F). Then, for the weight matrix elements of the last layer (LF-th layer), we draw each of them from the Gaussian distribution N(0, 1).

D.1 Single-task settings

For the brevity of notation, we denote the scheduler output f(Θ(K 1)[Tk,i]; θ(k 1)) = f1(χq k,i; θ(k 1) 1 ) + f2 [ θf1(χs k,i); θf1(χq k,i)]; θ(k 1) 2 , which corresponds to the definition in Eq. 7. In this case, T (K) = {T1, . . . , TK} refer to the chosen tasks and T (K) = {T 1 , . . . , T K} are the optimal ones. Based on the problem definition, we will have

Rsingle(K) = ET P(T ),x DT

L x; I(T , Θ(K)) ET P(T ),x DT

L x; I(T , Θ(K), )

= ET P(T ),x DT

L x; I(T , Θ(K 1)[TK]) ET P(T ),x DT

L x; I(T , Θ(K 1), [T K])

= h(Θ(K 1), [T K]) h(Θ(K 1)[TK])

= h(Θ(K 1), [T K]) f(χ K; θ (K 1)) + f(χ K; θ (K 1)) f(χK; θ(K 1))

+ f(χK; θ(K 1)) h(Θ(K 1)[TK])

h(Θ(K 1), [T K]) f(χ K; θ (K 1)) + f(χ K; θ (K 1)) f(Θ(K 1)[T K]; θ(K 1))

+ f(χK; θ(K 1)) h(Θ(K 1)[TK])

= h(Θ(K 1), [T K]) f(Θ(K 1), [T K]; θ (K 1)) + f(Θ(K 1), [T K]; θ (K 1)) f(Θ(K 1)[T K]; θ(K 1))

+ f(Θ(K 1)[TK]; θ(K 1)) h(Θ(K 1)[TK])

|h(Θ(K 1), [T K]) f(Θ(K 1), [T K]; θ (K 1))| + |f(Θ(K 1), [T K]; θ (K 1)) f(Θ(K 1)[T K]; θ(K 1))| | {z } I0 + |f(Θ(K 1)[TK]; θ(K 1)) h(Θ(K 1)[TK])|

where the first inequality is due to the arm pulling mechanism, i.e., f(Θ(K 1)[T K]; θ(K 1))

f(Θ(K 1)[TK]; θ(K 1)). Here, f( ; θ (K 1)) is defined as the "shadow" bandit model that are trained on optimal tasks {T 1 , T 2 , . . . , T K 1} and the corresponding meta-model parameters. Here, denote χK = Θ(K 1)[TK] Rp as the arm context given the arm TK and the meta-model parameter Θ(K 1); similarly, we have χ K = Θ(K 1), [T K] Rp being the arm context given the arm T K and the meta-model parameter Θ(K 1), . Thus, for the term I0 on the RHS, we have

I0 = |f(χ ; θ (K 1)) f(Θ(K 1)[T K]; θ(K 1))|

= |f(Θ(K 1), [T K]; θ (K 1)) f(Θ(K 1), [T K]; θ(K 1))

+ f(Θ(K 1), [T K]; θ(K 1)) f(Θ(K 1)[T K]; θ(K 1))|

|f(Θ(K 1), [T K]; θ (K 1)) f(Θ(K 1), [T K]; θ(K 1))|

+ |f(Θ(K 1), [T K]; θ(K 1)) f(Θ(K 1)[T K]; θ(K 1))|.

Then, inserting the inequality will lead to

R(K) |h(Θ(K 1)[TK]) f(χK; θ(K 1))| | {z } I1

+ |f(χ K; θ (K 1)) h(Θ(K 1), [T K])| | {z } I2

+ |f(Θ(K 1), [T K]; θ (K 1)) f(Θ(K 1), [T K]; θ(K 1))| | {z } I3 + |f(Θ(K 1), [T K]; θ(K 1)) f(Θ(K 1)[T K]; θ(K 1))| | {z } I4

Here, the terms I1, I2 refer to the approximation error for the two bandit models (our possessed

model f( ; θ(K 1)) and the pseudo model f( ; θ (K 1))). Then, the third term I3 bounds the output difference when given the same input Θ(K 1), [TK] to two separate bandit models, and the final

term I4 refers to the difference of the meta-model parameters when adapted to the same task with two individual sets of parameters. Here, the terms I1, I2 can be bounded by Lemma D.1, Corollary D.2. Then, the point is to bound the difference term I4 when given different inputs to the bandit model.

D.1.1 Bounding error terms and assembling the regret bound

[Bounding term I3] For error term I3, it focuses on bounding the output difference between two

bandit models f( ; θ (K 1)), f( ; θ(K 1)) given the same input Θ(K 1), [T K], and we have

I3 = |f(Θ(K 1), [T K]; θ (K 1)) f(Θ(K 1), [T K]; θ(K 1))| = |f(χ K; θ (K 1)) f(χ K; θ(K 1))|

|f1(χ K; θ (K 1) 1 ) f1(χ K; θ(K 1) 1 )| | {z } I3.1

[ θf1(χs, K ); θf1(χq, K )]; θ (K 1) 2

[ θf1(χs, K ); θf1(χq, K )]; θ(K 1) 2

| | {z } I3.2

With the defined ξL, applying Lemma D.11 as well as Corollary D.12, we will have

I3.1 1 + O(KL3 log5/6(m)

ρ1/3m1/6 ) O( K3L

ρ m log(m)) + O K4L2 log11/6(m)

Then, for term I3.2, we have

[ θf1(χs, K ); θf1(χq, K )]; θ (K 1) 2

[ θf1(χs, K ); θf1(χq, K )]; θ(K 1) 2

[ θf1(χs, K ); θf1(χq, K )]; θ (K 1) 2

[ θf1(χs, K ); θf1(χq, K )]; θ(K 1) 2

[ θf1(χs, K ); θf1(χq, K )]; θ(K 1) 2

[ θf1(χs, K ); θf1(χq, K )]; θ(K 1) 2

Here, for the first term on the RHS, we apply Lemma D.11 as well as Corollary D.12 to bound.

Then, for the second term, with Gaussian initialization of weight matrices, for the over-parameterized FC network f with Lipschitz-smooth activation functions (e.g., Sigmoid), we can have |f(x) f(x )|, f(x) f(x ) ξ x x due to its Lipschitz continuity / smoothness property [47, 18]. Meanwhile, we also have the Lipschitz continuity property for over-parameterized FC network f with Re LU activation [3], such that |f (x) f (x )| ξ x x . By the Gaussian initialization of BASS s weight matrices and the properties of over-parameterized neural networks [3, 47, 18], we have ξL = max{ξ, ξ } O(c L ξ ) being the Lipschitz constant for our f1, f2, where cξ > 1 is a small constant. Applying the conclusion above, we will have

[ θf1(χs, K ); θf1(χq, K )]; θ(K 1) 2

[ θf1(χs, K ); θf1(χq, K )]; θ(K 1) 2

ξL [ θf1(χs, K ); θf1(χq, K )] ξL [ θf1(χs, K ); θf1(χq, K )]

ξL θf1(χs, K ) θf1(χs, K ) + ξL θf1(χq, K ) θf1(χq, K )

ξL KL4 log5/6(m)

where the last inequality is by Theorem 5 in [3] and the proof of Lemma D.11. With the above results, it will give us

I3 1 + O(KL3 log5/6(m)

ρ1/3m1/6 ) O( K3L

ρ m log(m)) + O K4L2 log11/6(m)

+ ξLKL4 log5/6(m)

[Bounding term I4] On the other hand, applying the analogous procedure for term I4, denoting χ K = Θ(K 1), [T K] and χ K = Θ(K 1)[T K] for the brevity of notation, we can have

I4 = |f(Θ(K 1), [T K]; θ(K 1)) f(Θ(K 1)[T K]; θ(K 1))|

ξL Θ(K 1), [T K] Θ(K 1)[T K] 2 | {z } I4.1

[ θf1(χs, K ); θf1(χq, K )]; θ(K 1) 2

[ θf1( χs, K ); θf1( χq, K )]; θ(K 1) 2

| | {z } I4.2

where χs, K , χq, K respectively represents the support set and query set for task T K and the metaparameters Θ(K 1), . Similar notation also applies to χ K = Θ(K 1)[T K]. And the inequality is due to the fact that Re LU networks are naturally Lipschitz continuous w.r.t. some coefficient ξL when they are wide enough [3], as we have discussed above.

[Bounding term I4.1] Based on the meta-optimization procedure (inner-loop optimization + outerloop optimization), we have

I4.1 = ξL Θ(K 1), [T K] Θ(K 1)[T K] 2

= ξL Θ(K 1), η2 ΘL(Dq, K ; Θ(J), K ) Θ(K 1) η2 ΘL(Dq, K ; Θ(J) K ) 2

where Θ(J), K is the task-specific parameter of T K after adapting on Θ(K 1), with inner-loop optimization, and the Θ(J) K is the similar parameter after adapting on Θ(K 1). Here, we simplify the formula by representing the gradient derivation (inner-loop + outer-loop) with the mapping H : T Θ 7 Rp, which leads to

Θ(K 1), [T K] Θ(K 1)[T K] 2

= Θ(K 1), η2 ΘL(Dq; Θ(J), K ) Θ(K 1) η2 ΘL(Dq; Θ(J) K ) 2

= (Θ(K 1), Θ(K 1)) η2 H(T K, Θ(K 1), ) H(T K, Θ(K 1)) 2

= (Θ(K 2), Θ(K 2)) η2 H(T K, Θ(K 1), ) H(T K, Θ(K 1))

η2 H(T K 1, Θ(K 2), ) H(T K 1, Θ(K 2)) 2

k [K] η2 H(T k , Θ(k 1), ) H(T k , Θ(k 1)) 2

Recall that the past arms, including the actual chosen arms {T1, T2, . . . , TK} as well as the optimal ones {T 1 , T 2 , . . . , T K} are all from the candidate pool where each candidate arm is drawn i.i.d. from the task distribution P(T ). Therefore, denoting the bound as H(T K, Θ(K 1), ) H(T K, Θ(K 1)) 2 S1(K), we can have the upper bound as I4.1 η2 ξLK S1(K).

Then, for the term S1(K), by definition we have H(T K, Θ(K 1), ) H(T K, Θ(K 1)) S1(K), applying mean-reduction for the sample loss will further leads to

H(T K, Θ(K 1), ) H(T K, Θ(K 1)) = ΘL(Dq, K ; Θ(J), K ) ΘL(Dq, K ; Θ(J) K ) 2

= 1 |Dq, K |

ΘL(x; Θ(J), K ) 1 |Dq, K |

ΘL(x; Θ(J) K ) 2

ΘL(x; Θ(J), K ) ΘL(x; Θ(J) K ) 2.

This inequality essentially bound the gradient difference when given the same input task T K w.r.t. different sets of model parameters. Based on the conclusion from Lemma 9 of [47] and Lemma

B.3 of [11], we have Θlf(x; ΘK) F , Θl L(x; ΘK) F O( m F), l [LF] for any set of parameters within the sphere ΘK B(Θ0, ω) where Θ0 is the center and ω is the corresponding radius (which is a small value). With a total of LF layers for the meta-model and each layer of m F hidden units, this will give us ΘL(x; Θ(J), K ) 2, ΘL(x; Θ(J) K ) 2 O( m FLF) (Lemma D.15). And this makes S1(K) O( m FLF). Since we have η1, η2 O( 1 m F ), summarizing the results above, the upper bound can then be derived.

[Bounding term I4.2] Next, we proceed to bound I4.2, which will be

[ θf1(χs, K ); θf1(χq, K )]; θ(K 1) 2

[ θf1( χs, K ); θf1( χq, K )]; θ(K 1) 2

ξL [ θf1(χs, K ); θf1(χq, K )] [ θf1( χs, K ); θf1( χq, K )] 2 ξL θf1(χs, K ) θf1( χs, K ) 2 + ξL θf1(χq, K ) θf1( χq, K ) 2 ξ2 L χs, K χs, K 2 + ξ2 L χq, K χq, K 2 where the inequalities are due to the Lipschitz continuity / smoothness properties of overparameterized FC networks as we discussed above. Here, we notice that the second term on the RHS can be bounded by directly applying the proving procedure of term I4.1. Then, for the first term on the RHS, we can following a similar procedure as for I4.1, by χs, K χs, K 2 = I(T k , Θ(k 1), ) I(T k , Θ(k 1)) 2

= Θ(K 1), η1 X

j [J] ΘL(Ds, K ; Θ(j), K ) Θ(K 1) η1 X

j [J] ΘL(Ds, K ; Θ(j) K ) 2

= Θ(K 2), Θ(K 2)) η1 X

j [J] ΘL(Ds, K 1; Θ(j) K 1) η1 X

j [J] ΘL(Ds, K 1; Θ(j), K 1

j [J] ΘL(Ds, K ; Θ(j) K ) η1 X

j [J] ΘL(Ds, K ; Θ(j), K 2

j [J] ΘL(Ds, k ; Θ(j) k ) η1 X

j [J] ΘL(Ds, k ; Θ(j), k ) 2

where the last inequality is due to Lemma D.15 and by iterating through K meta-training iterations. Summing up the results above, we will have I4.2 O(η2ξ2 L K m FLF)+O(η1ξ2 L KJ m FLF).

[Summing up the results] Then, combining all the results, we would have

Rsingle(K) O( 1

2 + (1 + 2γ1)

δ ) + O(ξ2 LKJ LF m F ) + γm

γ1 = 2 + O K3L

ρ m log m + O L2K4

ρ4/3m1/6 log11/6(m)

γm = 1 + O(KL3 log5/6(m)

ρ1/3m1/6 ) O( K3L

ρ m log(m)) + O K4L2 log11/6(m)

+ ξLKL4 log5/6(m)

Note that the majority of the terms above can be cancelled to O(1) with proper networks width m indicated in Theorem 5.2. With increasingly large network width m, these terms will also become diminutive enough to achieve our regret bound in the main body.

D.2 Extending the result to the batch settings (Proof of Theorem 5.2)

With the results and conclusions from Subsection D.1, we proceed to provide the proof of Theorem 5.2 under the batch settings. Recall that in our original problem formulation and Algorithm 1, we are expected to select a batch of B arms in each meta-training iteration, denoted by {Ωk}k [K]. Note

that each of the candidate arms from Ω(k) task are drawn i.i.d. from the task distribution P(T ). Meantime,

we will have the corresponding optimal arm batches, denoted by {Ω k}k [K], which minimizes the loss objective in Eq. 5. Recall that we update the meta-model parameters with

Θ(k) = Θ(k 1) η2 Θ

Tk,i Ωk L(Dq k,i; Θ(J) k,i ) = Θ(k 1) η2

L(Dq k,i; Θ(J) k,i )

where Θ(J) k,i is the task-specific parameter for Tk,i after the inner-loop optimization for J steps.

Analogously, for the notation brevity and the sake of analysis, we denote f(Θ(K 1)[ΩK]; θ(k 1)) =

f1(χq k; θ(k 1) 1 ) + f2

[ θf1(χs k); θf1(χq k)]; θ(k 1) 2

where we have χq k := Θ(K 1)[ΩK] being

the meta-parameters adapted to batch of tasks ΩK, and the batch-specific parameter is defined as χs k := 1 |ΩK| P

Tk,bi ΩK[Θ(J) k,bi ]. Then, the regret under the batch setting can be denoted by

R(K) = Rbatch(K) = ET P(T ),x DT

L x; I(T , Θ(K)) ET P(T ),x DT

L x; I(T , Θ(K), )

= ET P(T ),x DT

L x; I(T , Θ(K 1)[ΩK]) ET P(T ),x DT

L x; I(T , Θ(K 1), [Ω K])

= h(Θ(K 1), [Ω K]) h(Θ(K 1)[ΩK])

= h(Θ(K 1), [Ω K]) f(Θ(K 1), [Ω K]; θ (K 1)) + f(Θ(K 1), [Ω K]; θ (K 1)) f(Θ(K 1)[ΩK]; θ(K 1))

+ f(Θ(K 1)[ΩK]; θ(K 1)) h(Θ(K 1)[ΩK]),

and after applying properties of the arm pulling mechanism, it is equivalent to

R(K) h(Θ(K 1), [Ω K]) f(Θ(K 1), [Ω K]; θ (K 1))

+ f(Θ(K 1), [Ω K]; θ (K 1)) ˆf(Θ(K 1), [Ω K]; θ (K 1))

+ ˆf(Θ(K 1), [Ω K]; θ (K 1)) ˆf(Θ(K 1)[Ω K]; θ(K 1))

+ ˆf(Θ(K 1)[ΩK]; θ(K 1)) f(Θ(K 1)[ΩK]; θ(K 1)) + f(Θ(K 1)[ΩK]; θ(K 1)) h(Θ(K 1)[ΩK])

|h(Θ(K 1), [Ω K]) f(Θ(K 1), [Ω K]; θ (K 1))| | {z } I5

+ |f(Θ(K 1), [Ω K]; θ (K 1)) ˆf(Θ(K 1), [Ω K]; θ (K 1))| | {z } I6

+ | ˆf(Θ(K 1), [Ω K]; θ (K 1)) ˆf(Θ(K 1)[Ω K]; θ(K 1))| | {z } I7

+ | ˆf(Θ(K 1)[ΩK]; θ(K 1)) f(Θ(K 1)[ΩK]; θ(K 1))| | {z } I8 + |f(Θ(K 1)[ΩK]; θ(K 1)) h(Θ(K 1)[ΩK])| | {z } I9 where the average value of estimated benefit scores for individual tasks TK,i ΩK is represented as bf(Θ(K 1)[ΩK]) = 1 |ΩK| P

TK,i ΩK f(Θ(K 1)[TK,i]) = 1 |ΩK| P

TK,i ΩK f Θ(k 1) η2

ΘL(Dq K,i; Θ(J) K,i); θ(K 1) , and the inequality is due to the pulling mechanism of BASS. Here, I5, I9 individually correspond to I1, I2 in the single-task setting and can be bounded by Lemma D.3, Corollary D.4. Term I7 can be upper bounded by I3 + I4 from the single-task setting above. Then, for the rest terms I6, I8, we proceed to bound them separately.

D.2.1 Bounding error terms and assembling the regret bound

We begin with the term I8, and then proceed to I6. For the chosen batch of tasks ΩK in the round K, we will have f1(Θ(K 1)[ΩK]) = f1 Θ(K 1) η2 Θ 1 |ΩK| P

TK,i ΩK L(Dq K,i; Θ(J) K,i) ; θ(K 1) 1 , In this case, the average value of estimation sampling probabilities for tasks TK,i ΩK is

bf(Θ(K 1)[ΩK]) = bf1(Θ(K 1)[ΩK]) + bf2(Θ(K 1)[ΩK])

f1(Θ(K 1)[TK,i]; θ(K 1) 1 ) + f2 [ θf1(χs K,i); θf1(χq K,i)]; θ(K 1) 2

[Bounding the f1 output difference] Next, let us first proceed to bound the output difference with respect to the exploitation module f1, where we can transform this term into

f1(Θ(K 1)[ΩK]) bf1(Θ(K 1)[ΩK])

Θ(k 1) 1 η2 Θ 1 |ΩK|

TK,i ΩK L(Dq K,i; Θ(J) K,i) ; θ(K 1) 1

Θ(k 1) η2 ΘL(Dq K,j; Θ(J) K,j); θ(K 1) 1

f1 Θ(k 1) η2 Θ 1 |ΩK|

TK,i ΩK L(Dq K,i; Θ(J) K,i) ; θ(K 1) 1

f1 Θ(k 1) η2 ΘL(Dq K,j; Θ(J) K,j); θ(K 1) 1 .

Then, applying the Lipschitz continuity property will lead to

f1(Θ(K 1)[ΩK]) bf1(Θ(K 1)[ΩK])

TK,i ΩK Θ 1 |ΩK|

TK,j ΩK L(Dq K,j; Θ(J) K,j) ΘL(Dq K,i; Θ(J) K,i) 2.

Here, by the definition of the outer-loop optimization of first-order meta-learning, we will have an alternative form the inequality, denoted by

f1(Θ(K 1)[ΩK]) bf1(Θ(K 1)[ΩK])

TK,i ΩK 1 |ΩK|

TK,j ΩK Θ L(Dq K,j; Θ(J) K,j) ΘL(Dq K,i; Θ(J) K,i) 2

For the term in the parentheses on the RHS, substituting the backward operation with the H( , ) mapping function, we have X

TK,i ΩK 1 |ΩK|

TK,j ΩK Θ L(Dq K,j; Θ(J) K,j) ΘL(Dq K,i; Θ(J) K,i) 2

TK,i ΩK 1 |ΩK|

TK,j ΩK Θ L(Dq K,j; Θ(J) K,j) 1 |ΩK|

TK,j ΩK ΘL(Dq K,i; Θ(J) K,i) 2

TK,j ΩK ΘL(Dq K,j; Θ(J) K,j) ΘL(Dq K,i; Θ(J) K,i) 2

TK,j ΩK H(TK,i, Θ(K 1)) H(TK,j, Θ(K 1)) 2

|ΩK| S1(K).

with |ΩK| = B. Therefore, the f1 part of error term I8 could be bounded by f1(Θ(K 1)[ΩK]) bf1(Θ(K 1)[ΩK]) η2 ξL B S1(K). where the upper bound S1(K) O( m FLF) can be found in the procedure bounding term I4.1.

[Bounding the f2 output difference] Then, with χK = Θ(K 1)[ΩK], we proceed to bound the output difference with respect to the exploration module, which is represented by

f2(Θ(K 1)[ΩK]) bf2(Θ(K 1)[ΩK])

= f2 [ θf1(χs K); θf1(χq K)]; θ(K 1) 2 1 |ΩK| X

TK,j ΩK f2 [ θf1(χs K,j); θf1(χq K,j)]; θ(K 1) 2

f2 [ θf1(χs K); θf1(χq K)]; θ(K 1) 2 f2 [ θf1(χs K,j); θf1(χq K,j)]; θ(K 1) 2

By adopting the Lipschitz continuity property of f2, we will have

f2(Θ(K 1)[ΩK]) bf2(Θ(K 1)[ΩK])

[ θf1(χs K); θf1(χq K)] [ θf1(χs K,j); θf1(χq K,j)] 2

θf1(χs K) θf1(χs K,j) 2 + θf1(χq K) θf1(χq K,j) 2

ξ2 L |ΩK| X

χs K χs K,j 2 + χq K χq K,j 2

= ξ2 L η2 |ΩK| X

TK,j ΩK Θ 1 |ΩK|

TK,i ΩK L(Dq K,i; Θ(J) K,i) ΘL(Dq K,j; Θ(J) K,j) 2

+ ξ2 L |ΩK| X

χs K χs K,j 2

η2 ξ2 LB S1(K) + ξ2 L η1 |ΩK| X

TK,j ΩK 1 |ΩK|

TK,i ΩK Θ(J) K,i 1 |ΩK|

TK,i ΩK Θ(J) K,j 2

where the last inequality is by applying the conclusion when bounding the output difference w.r.t. the exploitation module f1. Then, for the second term on the RHS, X

TK,j ΩK 1 |ΩK|

TK,i ΩK Θ(J) K,i 1 |ΩK|

TK,i ΩK Θ(J) K,j 2 1 |ΩK|

TK,i ΩK Θ(J) K,i Θ(J) K,j 2

TK,i ΩK ΘK 1 X

τ [τ] ΘL(Ds K,i; Θ(τ) K,i) ΘK 1 X

τ [J] ΘL(Ds K,j; Θ(τ) K,j) 2

τ [J] ΘL(Ds K,i; Θ(τ) K,i) X

τ [J] ΘL(Ds K,j; Θ(τ) K,j) 2

where the last inequality is due to Lemma D.15. Summing up all the results above will give us the upper bound for f2 output difference f2(Θ(K 1)[ΩK]) bf2(Θ(K 1)[ΩK]) O(η2 ξ2 LB m FLF + η1 BJ m FLF).

[Similar procedure for term I6] Analogously, we can apply the same derivation for the error term I6, which leads to

I6 = f(Θ(K 1), [Ω K]; θ (K 1)) ˆf(Θ(K 1), [Ω K]; θ (K 1))

Θ(k 1), η2 Θ 1 |Ω K|

Ti Ω K L(Dq i ; Θ(J) i ) ; θ (K 1) 1

Θ(k 1), η2 ΘL(Dq i ; Θ(J) i ); θ (K 1) 1

Following a similar procedure as that of term I8 will give us a similar bound as

I6 = f(Θ(K 1), [Ω K]; θ (K 1)) ˆf(Θ(K 1), [Ω K]; θ (K 1))

O(η2 ξ2 LB p

m FLF + BJ p

where the learning rate η1, η2 O( 1 m F ) is a small value. Then, the upper bounds for error terms I6, I8 are given as desired.

[Assembling the results] Then, combining all the results, we would have

2 + (1 + 2γ1)

δ ) + O(ξ2 LKBJ LF m F ) + γm

γ1 = 2 + O K3L

ρ m log m + O L2K4

ρ4/3m1/6 log11/6(m)

γm = 1 + O(KL3 log5/6(m)

ρ1/3m1/6 ) O( K3L

ρ m log(m)) + O K4L2 log11/6(m)

+ ξLKL4 log5/6(m)

Similarly, with proper networks width m as in Theorem 5.2, the majority of the terms above can be cancelled to O(1). With increasingly large network width m under the over-parameterization settings, γ1, γm will also become diminutive.

D.3 Performance Guarantee for the Exploitation and Exploration Modules

In this subsection, we would like to give the performance guarantee for the proposed BASS framework, and the corresponding performance bound can be applied to derive an upper bound for the error terms I1, I2 for the single-task settings and I5, I9 under the batch settings. Up to meta-training iteration k [K] (before updating the meta-parameters and BASS), we denote all the past records received as Pk 1. Before presenting the main lemmas, we first introduce the following operator. Inspired by [3], with two arbitrary vectors χ, χ such that χ 2 1, χ 2 = 1, we have the following operator

ϕ( χ, χ) = ( χ

2 , c) (11)

as the concatenation of the two vectors χ

2 and one constant c, where c = q

2. And this operator transforms the transformed vector into unit norm, ϕ( χ, χ) 2 = 1. The idea of this operator is to make the gradients θf1( ; θ1) of the exploitation model, which is the input of the exploration model f2( ; θ2), comply with the normalization requirement and the separateness assumption (Assumption 5.1). For the sake of analysis, we will adopt this operation in the following proof. Note that this operator is just one possible solution, and our results could be easily generalized to other forms of input gradients under the unit-length and separateness assumption. Similar ideas are also applied in previous works [9]. We begin to bound the single-task settings with the following lemma. Lemma D.1. For the constants c g > 0, 0 < ρ O( 1

L) and ξ1 (0, 1), given the past records Pk 1, we suppose m, η1, η2 satisfy the conditions in Theorem 5.2, and randomly draw the parameter

{θ(k) 1 , θ(k) 2 } {eθ (τ) 1 , eθ (τ) 2 }τ [k]. Consider the past records Pk 1 up to round k are generated by a fixed policy when witness the candidate arms {Ω(τ) task}τ [k]. Then, with probability at least 1 δ given an arm-reward pair (Tk,bi, rk,bi), we have

ETk,i P(T )

ϕ( [ θf1(χs k,bi); θf1(χq k,bi)]

c g L , χq k,bi); θ(k 1) 2

rτ f1(χq k,bi; θ(k 1) 1 ) | Ω(k) task, Pk 1

2 + (1 + 2γ1)

γ1 = 2 + O k3L

ρ m log m + O L2k4

ρ4/3m1/6 log11/6(m) .

Proof. The proof of this lemma is inspired by Lemma C.1 from [9]. First, we can derive the output upper bound f2

ϕ( [ θf1(χs k,bi); θf1(χq k,bi)]

c g L , χq k,bi); θ(k 1) 2

rk f1(χq k,bi; θ(k 1) 1 )

ϕ( [ θf1(χs k,bi); θf1(χq k,bi)]

c g L , χq k,bi); θ(k 1) 2

+ f1(χq k,bi; θ(k 1) 1 ) + 1

by triangle inequality and applying the generalization result of FC networks (Lemma D.5) on f1( ; θ1), f2( ; θ2).

For the brevity of notation, we use f1(Tk,bi) to denote ϕ( [ θf1(χs k,bi); θf1(χq k,bi)]

c g L , χq k,bi) and apply

(χk, rk) as (χq k,bi, rk,bi) for the following proof. Define the difference sequence as

V (1) τ = E f2

f1(Tτ,bi); θ(τ 1) 2

rτ f1(χτ; θ(τ 1) 1 )

f1(Tτ,bi); θ(τ 1) 2

rτ f1(χτ; θ(τ 1) 1 ) .

Since the past rewards and the received arm-reward pairs (χτ, rτ) are generated by the same reward mapping function, we have the expectation

E[V (1) τ Fτ] =E f2

f1(Tτ,bi); θ(τ 1) 2

rτ f1(χτ; θ(τ 1) 1 )

f1(Tτ,bi); θ(τ 1) 2

rτ f1(χτ; θ(τ 1) 1 ) Fτ

where Fτ denotes the filtration given the past records Pτ, up to round τ [k]. This also gives the fact that V (1) τ is a martingale difference sequence. Then, after applying the martingale difference sequence over [k], we have

τ [k] V (1) τ =1

f1(Tτ,bi); θ(τ 1) 2

rτ f1(χτ; θ(τ 1) 1 )

f1(Tτ,bi); θ(τ 1) 2

rτ f1(χτ; θ(τ 1) 1 ) .

Then, by applying the Azuma-Hoeffding inequality, it leads to

τ [k] V (1) τ 1

τ [k] E[V (1) τ ] (1 + 2γ1)

Since the expectation of V (1) τ is zero, with the probability at least 1 δ and an existing set of parameters θ2 s.t. θ2 θ(0) 2 O k3

ρ m log m , the above inequality implies

τ [k] V (1) τ (1 + 2γ1)

ETk,i P(T )E{θ(k 1) 1 ,θ(k 1) 2 }

f1(Tτ,bi); θ(k 1) 2

rτ f1(χτ; θ(k 1) 1 )

f1(Tτ,bi); θ(τ 1) 2

rk f1(χτ; θ(τ 1) 1 )

f1(Tτ,bi); θ(τ 1) 2

rτ f1(χτ; θ(τ 1) 1 ) + (1 + 2γ1)

f1(Tτ,bi); θ2

rτ f1(χτ; θ(τ 1) 1 ) + 3L

2k + (1 + 2γ1)

f1(Tτ,bi); θ2

rτ f1(χτ; θ(τ 1) 1 )

2k + (1 + 2γ1)

2k + (1 + 2γ1)

where the first equality is due to the sampling of candidate tasks and the model parameters. Here, the upper bound (i) is derived by applying the conclusions of Lemma D.6 and Lemma D.10, and the inequality (ii) is derived by adopting Lemma D.6 while defining the empirical loss to be

f1(Tτ,bi); θ2

rτ f1(χτ; θ(τ 1) 1 )

2 ξ1. Finally, applying the union bound

would give the aforementioned results.

Here, analogous to the trained parameters, we consider the shadow parameters as {θ(k), 1 , θ(k), 2 }

{eθ (τ), 1 , eθ (τ), 2 }τ [k]. Similarly, each pair {eθ (τ), 1 , eθ (τ), 2 } is separately trained on past received rewards of the optimal arm(s) {rτ ,i }τ [τ],Tτ ,i Ω k and past exploration scores of the optimal arm(s)

{eτ ,i }τ [τ],Tτ ,i Ω k with Jθ-iteration GD, starting from the random initialization {θ(0) 1 , θ(0) 2 }.

Corollary D.2. For the constants 0 < ρ O(1/L) and ξ1 (0, 1), given the past records Pk 1, we suppose m, η1, J satisfy the conditions in Theorem 5.2, and randomly draw the parameters

{θ(k), 1 , θ(k), 2 } {eθ (τ), 1 , eθ (τ), 2 }τ [k]. For the optimal arm Tk,i Ωk task, consider its union set with the the collection of past optimal arms P k 1 {Tk,i , rk,i } are generated by a fixed policy

when witness the candidate arms {Ω(τ) task}τ [k], with P k 1 being the collection chosen by this policy. Then, with probability at least 1 δ, we have

ETk,i P(T )

ϕ([ θf1(χs, k ); θf1(χq, k )] c g L , χq, k ); θ(k 1), 2

rτ f1(χq, k ; θ(k 1), 1 ) | Ω(k) task, P k 1

2 + (1 + γ1)

where rτ,i is the corresponding reward generated by the mapping function given an arm χτ,i , and

Γk = 1 + O(k L3 log5/6(m)

ρ1/3m1/6 ) O( k4L

ρ m log(m)) + O k5L2 log11/6(m)

Proof. This corollary is the direct application of Lemma D.1 by following a similar proof procedure. First, suppose the shadow models f1( ; θ2), f2( ; θ2) are trained on the alternative trajectory P k 1. Analogous to the proof of Lemma D.1, we can define the following martingale difference sequence with regard to the previous records P k 1 up to round τ [t] as

V (1), τ = E f2

f1(Tτ,i ); θ(τ 1), 2

r τ f1(χ τ; θ(τ 1), 1 )

f1(Tτ,i ); θ(τ 1), 2

r τ f1(χ τ; θ(τ 1), 1 ) .

Since the records in set P k 1 are sharing the same reward mapping function, we have the expectation

E[V (1), τ F τ ] =E f2

f1(Tτ,i ); θ(τ 1), 2

r τ f1(χ τ; θ(τ 1), 1 )

f1(Tτ,i ); θ(τ 1), 2

r τ f1(χ τ; θ(τ 1), 1 ) F τ

where F τ denotes the filtration given the past records P k 1. The mean value of V (1), τ across different time steps will be

τ [k] V (1), τ =1

f1(Tτ,i ); θ(τ 1), 2

r τ f1(χ τ; θ(τ 1), 1 )

f1(Tτ,i ); θ(τ 1), 2

r τ f1(χ τ; θ(τ 1), 1 ) .

with the expectation of zero. Afterwards, applying the Azuma-Hoeffding inequality, with a constant δ (0, 1), it leads to

τ [k] V (1), τ 1

τ [k] E[V (1), τ ] (1 + 2γ1)

To bound the output difference between the shadow model f1( ; θ(k 1), 1 ), f2( ; θ(k 1), 2 ) and the model we trained based on received records f1( ; θ(k 1) 1 ), f2( ; θ(k 1) 2 ), we apply the conclusion from Lemma D.11, which leads to that given arbitrary input vectors x, x , we have

|f1(x; θ(k 1), 1 ) f1(x; θ(k 1) 1 )|, |f2(x ; θ(k 1), 2 ) f2(x ; θ(k 1) 2 )| 1 + O(k L3 log5/6(m)

ρ1/3m1/6 ) O( k3L

ρ m log(m)) + O k4L2 log11/6(m)

Finally, combining all the results will finish the proof.

We will also be able to have the performance guarantee under the batch settings. Recall that given a batch of chosen tasks Ωk Ω(k) task, k [K], we have the meta-parameters adapted to this batch of tasks being Θ(K 1)[ΩK], which we consider as the input for the f1( ; θ1) model, where the tasks within each collection are sampled from the task distribution. Thus, chosen task batches from different iterations are also independent from each other. Intuitively, we can also define the corresponding reward for arm batch Ωk as rk = h(Θ(k 1)[Ωk]). Then, we bound the batch settings with the following lemma and corollary. Lemma D.3. For the constants c g > 0, ρ (0, O( 1

L)) and ξ1 (0, 1), given the past records Pk 1, we suppose m, η1, η2 satisfy the conditions in Theorem 5.2, and randomly draw the parameter

{θ(k) 1 , θ(k) 2 } {eθ (τ) 1 , eθ (τ) 2 }τ [k]. Consider the past records Pk 1 up to round k are generated by a fixed policy when witness the candidate arms {Ω(τ) task}τ [k]. Then, with probability at least 1 δ given the pair of chosen arm batch and the reward (Ωk, rk) in round k, we have

ETk,i P(T )

ϕ([ θf1(χs k); θf1(χq k)] c g L , χq k); θ(k 1) 2

rτ f1(χq k; θ(k 1) 1 ) | Ω(k) task, Pk 1

2 + (1 + 2γ1)

γ1 = 2 + O k3L

ρ m log m + O L2k4

ρ4/3m1/6 log11/6(m) .

Proof. The proof of this lemma is analogous to the proof of Lemma D.1. First, we can derive the output upper bound f2

ϕ([ θf1(χs k); θf1(χq k)] c g L , χq k); θ(k 1) 2

rk f1(χq k; θ(k 1) 1 )

ϕ([ θf1(χs k); θf1(χq k)] c g L , χq k); θ(k 1) 2

+ f1(χq k; θ(k 1) 1 ) + 1

1 + 2γ1 by triangle inequality and applying the generalization result of FC networks (Lemma D.5) on f1( ; θ1), f2( ; θ2), where c g > 0 is a positive number to scale the concatenated gradient vector.

For the brevity of notation, we use f1(Ωk) to denote ϕ( [ θf1(χs k); θf1(χq k)] c g L , χq k) and apply (χk, rk)

as (χq k, rk) for the following proof. Define the difference sequence as

V (2) τ = E f2

f1(Ωτ); θ(τ 1) 2

rτ f1(χτ; θ(τ 1) 1 )

f1(Ωτ); θ(τ 1) 2

rτ f1(χτ; θ(τ 1) 1 ) .

Since the past rewards and the received arm batch-reward pairs (χτ, rτ) are generated by the same reward mapping function, we have the expectation

E[V (2) τ Fτ] =E f2

f1(Ωτ); θ(τ 1) 2

rτ f1(χτ; θ(τ 1) 1 )

f1(Ωτ); θ(τ 1) 2

rτ f1(χτ; θ(τ 1) 1 ) Fτ

where Fτ denotes the filtration given the past records Pτ, up to round τ [k]. This also gives the fact that V (2) τ is a martingale difference sequence. Then, after applying the martingale difference sequence over [k], we have

τ [k] V (2) τ =1

f1(Ωτ); θ(τ 1) 2

rτ f1(χτ; θ(τ 1) 1 )

f1(Ωτ); θ(τ 1) 2

rτ f1(χτ; θ(τ 1) 1 ) .

By the Azuma-Hoeffding inequality, it leads to P 1 k P

τ [k] V (2) τ 1

τ [k] E[V (2) τ ] (1 +

δ. As we have discussed, the tasks within each collection are sampled from the

task distribution, which makes chosen task batches from different iterations Ωk, k [K] are also independent from each other. Since the expectation of V (2) τ is zero, with the probability at least 1 δ and an existing set of parameters θ2 s.t. θ2 θ(0) 2 O k3

ρ m log m , the above inequality implies

τ [k] V (1) τ (1 + 2γ1)

ETk,i P(T )E{θ(k 1) 1 ,θ(k 1) 2 }

f1(Ω); θ(k 1) 2

r f1(χ; θ(k 1) 1 )

f1(Ωτ); θ(τ 1) 2

rk f1(χτ; θ(τ 1) 1 )

f1(Ωτ); θ(τ 1) 2

rτ f1(χτ; θ(τ 1) 1 ) + (1 + 2γ1)

rτ f1(χτ; θ(τ 1) 1 ) + 3L

2k + (1 + 2γ1)

rτ f1(χτ; θ(τ 1) 1 )

2k + (1 + 2γ1)

2k + (1 + 2γ1)

where the first equality is due to the sampling of candidate tasks and the model parameters. Here, the upper bound (i) is derived by applying the conclusions of Lemma D.6 and Lemma D.10, and the inequality (ii) is derived by adopting Lemma D.6 while defining the empirical loss to be

rτ f1(χτ; θ(τ 1) 1 )

2 ξ2. Finally, applying the union bound

would give the aforementioned results.

Analogously, we consider the shadow parameters as {θ(k), 1 , θ(k), 2 } {eθ (τ), 1 , eθ (τ), 2 }τ [k] where

each pair {eθ (τ), 1 , eθ (τ), 2 } is separately trained on past received rewards of the optimal arm(s) {rτ ,i }τ [τ],Tτ ,i Ω k and past exploration scores of the optimal arm(s) {eτ ,i }τ [τ],Tτ ,i Ω k with Jθ-iteration GD starting from the random initialization {θ(0) 1 , θ(0) 2 }.

Corollary D.4. For the constants ρ (0, O( 1

L)) and ξ1 (0, 1), given the past records Pk 1, we suppose m, η1, J satisfy the conditions in Theorem 5.2, and randomly draw the parameters

{θ(k), 1 , θ(k), 2 } {eθ (τ), 1 , eθ (τ), 2 }τ [k]. For the optimal arm batch Ω k Ωk task, consider its union set with the the collection of past optimal arms P k 1 {Ω k, r k} are generated by a fixed policy when

witness the candidate arms {Ω(τ) task}τ [k], with P k 1 being the collection chosen by this policy. Then, with probability at least 1 δ, we have

ETk,i P(T )

ϕ([ θf1(χs, k ); θf1(χq, k )] c g L , χq, k ); θ(k 1), 2

rτ f1(χq, k ; θ(k 1), 1 ) | Ω(k) task, P k 1

2 + (1 + γ1)

where rτ,i is the corresponding reward generated by the mapping function given an arm χτ,i , and

Γk = 1 + O(k L3 log5/6(m)

ρ1/3m1/6 ) O( k4L

ρ m log(m)) + O k5L2 log11/6(m)

This corollary is a directly application of Lemma D.3 and can be obtained with a similar proof as in Corollary D.2.

D.4 Ancillary Lemmas

Applying Pk 1 as the training data, we have the following properties for the over-parameterized FC network f( ; θ) after GD. Lemma D.5. For the constants ρ (0, O( 1

L)) and ξ1 (0, 1), given the past records Pk 1 up to time step k, we suppose m, η1, J1 satisfy the conditions in Theorem 5.2. Then, with probability at least 1 δ, given a sample-label pair (x, r), we have

|f(x; θ(k))| γ1 = 2 + O k3L

ρ m log m + O L2k4

ρ4/3m1/6 log11/6(m) .

Proof. The LHS of the inequality could be written as

|f(x; θ)| |f(x; θ) f(x; θ(0)) θ(0)f(x; θ(0)), θ θ(0) |

+ |f(x; θ(0)) + θ(0)f(x; θ(0)), θ θ(0) |.

Here, we could bound the first term on the RHS with Lemma D.7. Applying Lemma D.8 on the second term, and recalling θ θ(0) 2 ω, would give

|f(x; θ)| 2 + θ(0)f(x; θ(0)) 2 θ θ(0) 2+

m log(m)) θ θ(0) 2

2 + O(L) θ θ(0) 2 + O(L2p

m log(m))( θ θ(0) 2) 4 3 .

Then, applying the conclusion of Lemma D.6 would lead to

|f(x; θ)| 2 + O(L) O k3

ρ m log m + O(L2p

m log(m)) O( k3

ρ m log m) 4

= 2 + O k3L

ρ m log m + O L2k4

ρ4/3m1/6 log11/6(m) = γ1.

Lemma D.6 (Theorem 1 from [3]). For any 0 < ξ1 1, 0 < ρ O( 1

L). Given the past records Pk 1, suppose m, η1, J satisfy the conditions in Theorem 5.2, then with probability at least 1 δ, we could have

1. L(θ) ξ1 after J iterations of GD.

2. For any j [J], θ(j) θ(0) O k3

ρ m log m .

In particular, Lemma D.6 above provides the convergence guarantee for f( ; θ) after certain rounds of GD training on the past records Pk 1. Lemma D.7 (Lemma 4.1 in [11]). Assume a constant ω such that O(m 3/2L 3/2[log(Tn L2/δ)]3/2) ω O(L 6[log m] 3/2) and n training samples. With randomly initialized θ(0), for parameters θ, θ satisfying θ θ(0) , θ θ(0) ω, we have

|f(x; θ) f(x; θ ) θ f(x; θ ), θ θ | O(ω1/3L2p

m log(m)) θ θ with the probability at least 1 δ.

Lemma D.8. Assume m, η1, J satisfy the conditions in Theorem 5.2 and θ(0) is randomly initialized. Then, with probability at least 1 δ and given an arm x 2 = 1, we have

1. |f(x; θ(0))| 2,

2. θ(0)f(x; θ(0)) 2 O(L).

Proof. The conclusion (1) is a direct application of Lemma 7.1 in [3]. Suppose the parameters of the L-layer FC network are θ = {θ1, . . . , θL}. For conclusion (2), applying Lemma 7.3 in [3], for each layer θl {θ1, . . . , θL}, we have

θlf(x; θ(0)) 2 = (θLDL 1 Dl+1θl+1) (Dl+1θl+1 D1θ1) x 2 = O(

L). where D is the diagonal matrix corresponding to the activation function. Then, we could have the conclusion that

θ(0)f(x; θ(0)) 2 = s X

l [L] θlf(x; θ(0)) 2 2 = O(L).

Lemma D.9 (Theorem 5 in [3]). Assume m, η1, J satisfy the conditions in Theorem 5.2 and θ(0)

being randomly initialized. Then, with probability at least 1 δ, and for all parameter θ such that θ θ(0) 2 ω, we have

θf(x; θ) θ(0)f(x; θ(0)) 2 O(ω1/3L3p

log(m)) Lemma D.10. Assume m, η1 satisfy the condition in Theorem 5.2. For notation brevity, suppose the training sample-label pairs are {xτ, rτ}τ [k]. With the probability at least 1 δ, we have

τ [k] |f(xτ; θ(τ)) rτ| X

τ [k] |f(xτ; θ(k)) rτ| + 3L

Proof. With the notation from Lemma 4.3 in [11], set R = k3 log(m)

δ , ν = R2, and ϵ = LR

2νk. Then, considering the loss function to be L(θ) := P

τ [k]|f(xτ; θ) rτ| would complete the proof.

Lemma D.11. Consider a randomly initialized L-layer Re LU fully-connected network f( ; θ0). For any 0 < ξ2 1, 0 < ρ O( 1

L). Let there be two sets of training samples Pk, P k with the unit-length and the ρ-separateness assumption, and let θ be the trained parameter on Pk while θ is the trained parameter on P k. Suppose the conditions in Theorem 5.2 are satisfied. Then, with probability at least 1 δ, we have

|f(x; θ) f(x; θ )| 1 + O(k L3 log5/6(m)

ρ1/3m1/6 ) O( k3L

ρ m log(m)) + O k4L2 log11/6(m)

when given a new sample x Rd.

Proof. First, based on the conclusion from Theorem 1 from [3] and regarding the t samples, the trained the parameters satisfy θ θ0 2, θ θ0 2 O( k3

ρ m log(m)) = ω where θ0 is the randomly initialized parameter. Then, we could have

θf(x; θ) 2 θ0f(x; θ0) 2 + θf(x; θ) θ0f(x; θ0) 2

1 + O(k L3 log5/6(m)

ρ1/3m1/6 ) O(L)

w.r.t. the conclusion from Theorem 1 and Theorem 5 of [3]. Then, regarding the Lemma 4.1 from [11], we would have

|f(x; θ) f(x; θ ) θ f(x; θ ), θ θ | O(ω1/3L2p

m log(m)) θ θ 2.

Therefore, the our target could be reformed as

|f(x; θ) f(x; θ )| θ f(x; θ ) 2 θ θ 2 + O(ω1/3L2p

m log(m)) θ θ 2

1 + O(k L3 log5/6(m)

ρ1/3m1/6 ) O(L) ω + O(ω4/3L2p

Substituting the ω with its value would complete the proof.

Corollary D.12. Following a similar settings as in Lemma D.11, consider a randomly initialized L-layer fully-connected network f( ; θ0) with Sigmoid activation. For any 0 < ξ2 1, 0 < ρ O( 1

L). Let there be two sets of training samples Pk, P k with the unit-length and the ρ-separateness assumption, and let θ be the trained parameter on Pk while θ is the trained parameter on P k. Suppose the conditions in Theorem 5.2 are satisfied. Then, with probability at least 1 δ, we have

|f(x; θ) f(x; θ )| 1 + O(k L3 log5/6(m)

ρ1/3m1/6 ) O( k3L

ρ m log(m)) + O k4L2 log11/6(m)

when given a new sample x Rd.

Proof. This corollary is an intuitive extension of Lemma D.11. Since the result from Theorem 1 of [3] also applies to Lipschitz-smooth (i.e., Sigmoid) activation functions, combining the proof of Lemma D.11 and the result from Lemma 7 in [47] will give the conclusion.

D.5 Regret Bound for Uniform Sampling

Lemma D.13 (Regret Bound for the Uniform Sampling Approach). When applying the uniform sampling as in most meta-learning frameworks, we denote the corresponding sampled task series as

Ωu(K). We will have Ru(K) = ET P(T ),x DT

L(x; I(T , Θ(K) u )) L(x; I(T , Θ(K), )) . where

Θ(K) u refer to the meta-parameters trained with uniform sampling. With Θ(K) u Θ(K), 2 ω, we have the regret bound for the uniform sampling as

Ru(K) = ET P(T ),x DT

L(x; I(T , Θ(K) u )) L(x; I(T , Θ(K), ))

m FLF ω + O(ω4/3L3 F p

m F log(m F)) + O(

min O KLF + K4/3L11/3 F p

Proof. Here, for the simplicity of notation, we denote Θ = I(T , Θ), and neglect the expectation terms. Note that the difference between adapted meta-parameters and the original meta-parameters is

small enough and can be well-bounded. We will then have

Ru(K) = ET P(T ),x DT

L(x; I(T , Θ(K) u )) L(x; I(T , Θ(K), ))

= e L( eΘ (K) u ) e L( eΘ (K), )

where the two sets of meta-parameters are trained with uniformly sampled tasks and the optimal tasks, and eΘ is used to denote the adapted meta-parameters I(T , Θ) for simplicity. With any convex loss function (e.g., L2 loss or cross-entropy loss) under the over-parameterization settings, we will have the generalization loss being almost convex w.r.t. the meta-parameters as in Lemma D.14, which leads to

e L( eΘ (K) u ) e L( eΘ (K), ) e Θ e L( eΘ (K) u ), eΘ (K) u eΘ (K), + ϵ

e Θ e L( eΘ (K) u ) 2 eΘ (K) u eΘ (K), 2 + ϵ

e Θ e L( eΘ (K) u ) 2 Θ(K) u Θ(K), 2 + η1 O( p

m FLF ω + O(ω4/3L3 F p

m F log(m F)) + O(

(ii) O KLF + K4/3L11/3 F p

(iii) = e L( eΘ (K) u ) e L( eΘ (K), ) min O KLF + K4/3L11/3 F p

where ϵ = O(ω4/3L3 F p

m F log(m F)) > 0, and Θ(K), u Θ(K) u 2 ω. Here, the first inequality is due to Lemma D.14 and the convexity of the loss function. The third inequality is due to the upper bound for meta-model gradients (Lemma D.15). The (i) is due to Lemma D.16 and sufficiently small learning rate η1 O( 1 m F ). Based on Lemma D.15, we will have

ΘL(x; Θ(J), K ) 2, ΘL(x; Θ(J) K ) 2 O( m FLF). Since we have η1, η2 O( 1

m), starting from randomly initialized Θ(0), the parameter shift caused by GD can be upper bounded by

Θ(K), u Θ(K) u 2 ω = O(K q

LF m F ). The implication (iii) is because the loss function L( ; ) has the value range [0, 1].

Here, we notice that the RHS of the regret bound in Lemma D.13 has two terms. Although the second term can be reduced to O(1) with sufficiently large meta-model width m F > O(Poly(K, L, ρ 1)), the first term tends to grow along with more iterations K and the larger meta-model width m F. The reason is that the radius for the parameter shift during meta-training ω can be as large as O( 1 m F ), which means that it cannot cancel out the effects of gradient norms, which have the order of O( m F). In this case, we will not able to include a m F term to the denominator to scale down the regret with m F, and make the upper bound narrower than 1.

Lemma D.14. Given an arbitrary sample x and its label, let e L(Θ) = L(x; Θ). Suppose m F, η1, η2 satisfy the conditions in Theorem 5.2. With probability at least 1 O(KL2 F) exp[ Ω(m Fω2/3LF)] over randomness of Θ(0), for all k [K], and Θ, Θ satisfying Θ Θ(0) 2 ω and Θ Θ(0) 2 ω with ω O(L 6 F [log m F] 3/2), it holds uniformly that

e L(Θ) e L(Θ ) Θ e L(Θ), Θ Θ + ϵ.

with ϵ = O(ω4/3L3 F log m F)) being a small constant.

proof. This proof follows an analogous approach as the proof of Lemma 4.2 in [11]. Let F e L(Θ ) be the derivative of e L with respective to F(x; Θ). Then, it holds that | F e L(Θ )| O(1) based on

Lemma D.15. Then, by convexity of e L, we have

e L(Θ ) e L(Θ)

(i) F e L(Θ) (F(x; Θ ) F(x; Θ))

(ii) F e L(Θ ) ΘF(x; Θ), Θ Θ

| F e L(Θ )| |F(x; Θ ) F(x; Θ) F(x; Θ), Θ Θ |

Θ e L(Θ), Θ Θ | F e L(Θ )| |F(x; Θ ) F(x; Θ) F(x; Θ), Θ Θ |

(iii) Θ e L(Θ), Θ Θ O(ω4/3L3 F p

m F log(m F))

Θ e L(Θ), Θ Θ ϵ

where (i) is due to the convexity of the loss function L, (ii) is an application of triangle inequality, and (iii) is the application of and Lemma D.16. Finally, denoting ϵ = O(ω4/3L3 F m F log m F) will complete the proof.

Lemma D.15. Suppose m F, η1, η2 satisfy the conditions in Theorem 5.2. With probability at least 1 O(KLF) exp( Ω(m Fω2/3LF)) over the random initialization, Θ satisfying Θ Θ(0) 2 ω with ω O(L 9/2 F [log m F] 3), it holds uniformly that

ΘF(x; Θ) 2 O( p

ΘL(x; Θ) 2 O( p

Proof. This lemma is a direct application of Lemma 9 of [47] and Lemma B.2, B.3 of [11].

Lemma D.16. Suppose m F, η1, η2 satisfy the conditions in Theorem 5.2. With probability at least 1 O(KLF) exp( Ω(m Fω2/3LF)), for all t [T], i [k], Θ, Θ (or Θ, Θ ) satisfying Θ Θ(0) 2, Θ Θ(0) 2 ω with ω O(L 9/2 F [log m F] 3), it holds uniformly that

|F(x; Θ) F(x; Θ ) Θ F(x; Θ ), Θ Θ | O(w1/3L2 F p

m F log(m F)) Θ Θ 2.

Proof. The proof for this lemma directly follows the proof of Lemma 4.1 in [11] and Lemma 7 in [47].