# metalearning_hypothesis_spaces_for_sequential_decisionmaking__61509ddc.pdf Meta-Learning Hypothesis Spaces for Sequential Decision-making Parnian Kassraie 1 Jonas Rothfuss 1 Andreas Krause 1 Abstract Obtaining reliable, adaptive confidence sets for prediction functions (hypotheses) is a central challenge in sequential decision-making tasks, such as bandits and model-based reinforcement learning. These confidence sets typically rely on prior assumptions on the hypothesis space, e.g., the known kernel of a Reproducing Kernel Hilbert Space (RKHS). Hand-designing such kernels is error prone, and misspecification may lead to poor or unsafe performance. In this work, we propose to meta-learn a kernel from offline data (META-KEL). For the case where the unknown kernel is a combination of known base kernels, we develop an estimator based on structured sparsity. Under mild conditions, we guarantee that our estimated RKHS yields valid confidence sets that, with increasing amounts of offline data, become as tight as those given the true unknown kernel. We demonstrate our approach on the kernelized bandit problem (a.k.a. Bayesian optimization), where we establish regret bounds competitive with those given the true kernel. We also empirically evaluate the effectiveness of our approach on a Bayesian optimization task. 1. Introduction A number of well-studied machine learning problems such as bandits, Bayesian optimization (BO) and model-based reinforcement learning are characterized by an agent that sequentially interacts with an unknown, responsive system. Throughout the interaction, the agent s goal is to maximize the cumulative reward based on an unknown underlying function f. Common to such sequential decision-making problems is an exploration-exploitation trade-off. That is, the agent needs to optimize its reward while, at the same time, learns more about the unknown function f. Confidence sets capture and quantify the uncertainty of the learner 1ETH Zurich, Switzerland. Correspondence to: Parnian Kassraie . Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s). Meta-Ke L Agent Environment Algorithm (ˆk) Oracle (k ) Offline Meta-learning Sequential Decision-making Task Figure 1. Overview of the described framework with k as the true kernel function and ˆk as the solution to META-KEL. about f. Thus, they are an integral tool for directing exploration towards areas of high uncertainty and balancing it against exploitation. Moreover, in safety-critical applications, confidence sets are used to reason about the safety of actions. Thus, they are central to efficiency and safety of exploration. In theoretical analysis of sequential decisionmaking algorithms, a common assumption is that f resides in an RKHS with a known kernel function. This assumption allows for the construction of the confidence sets. In practice, however, the true kernel is unknown and needs to be hand-crafted based on the problem instance. This is a delicate task, since the hand-crafted hypothesis space has to contain the unknown target function f. If this is not the case, the learner may be over-confident and converge to a sub-optimal policy, or incorrectly classify actions as safe. At the same time, we want the chosen hypothesis space to be as small so that the variance of the associated learner is low and the agent converges quickly. This constitutes a dilemma, where we need to trade off efficiency with a potential loss in consistency. We approach this dilemma in a data-driven manner. Many applications of sequential decision-making, such as hyperparameter tuning with BO or online nonlinear control, are of repetitive nature. Often, there is available data from similar but not identical tasks which have been solved before. Therefore, we propose to meta-learn the kernel function, and thus the RKHS, from offline meta-data. Our method, Meta-Kernel Learning (META-KEL), works with a generic (i.e., not necessarily i.i.d.) data model and may be applied to a variety of sequential decision-making tasks. Meta-Learning Hypothesis Spaces We formally analyze the problem when the true kernel is a combination of known base kernels. We prove that the solution to META-KEL corresponds to an RKHS which contains the true function space (Theorem 4.3). Further, the metalearned kernel has a sparse structure (Proposition 4.4) which reduces the variance of the resulting learner, and makes the learner more efficient for solving the downstream sequential decision-making problem. With mild assumptions on the data, we show that the meta-learned kernel yields anytime valid confidence bands matching the ones given oracle knowledge of the true kernel, as more samples of the metadata are provided (Theorem 5.1). These provably reliable confidence estimates constitute the key contribution of our work and distinguishes META-KEL from prior attempts. To demonstrate how META-KEL can be applied to a sequential task, we analyze a Bayesian optimization algorithm when it uses the meta-learned kernel and compare it to the same algorithm when it has knowledge of the true kernel, i.e., the oracle algorithm. By increasing size of the metalearning data, the regret bound we obtain approaches the rate of the oracle (Corollary 5.2). Contributions Our main contributions are: We introduce META-KEL, a method for meta-learning the hypothesis space of a sequential decision task, which yields provably valid adaptive confidence sets. Our meta-learnt confidence bounds converge to the ones estimated by the oracle at a O(1/ mn) rate. Here, m is the number of tasks in the meta-data and n is the number of samples per task. Applied to BO, our results imply a sublinear regret guarantee for the GP-UCB algorithm using our metalearned kernel. This bound approaches that of the oracle algorithm as the amount of meta-data increases. 2. Related Work Numerous sequential decision-making methods rely on confidence sets for uncertainty quantification, e.g., UCB algorithms (Srinivas et al., 2010; Chowdhury & Gopalan, 2017) for Bayesian optimization and bandits, safe exploration and various forms of RL (Berkenkamp et al., 2017; Curi et al., 2020; Kakade et al., 2020; Sessa et al., 2020). Most of these methods assume the true hypothesis space as given. However, in practice, we typically do not know the correct hypothesis space, e.g., in form of a kernel. A body of recent work considers the unknown kernel setting and analyzes the effect of working with a misspecified hypothesis space (Wynne et al., 2021; Simchowitz et al., 2021; Bogunovic & Krause, 2021). Alternatively, Wang & de Freitas (2014) and Berkenkamp et al. (2019) propose to successively expand the hypothesis space throughout the course of BO so that the algorithm remains no-regret in a setting where the kernel lengthscale is unknown. Our work relates to meta-learning for Bayesian optimization. There is a recent line of algorithms that improve accuracy of base sequential learners via meta-learning, albeit without theoretical guarantees (Rothfuss et al., 2021a;b), or with mild guarantees in special cases (Kveton et al., 2020; Boutilier et al., 2020). There are a number of results on updating bandit priors or policies by meta-learning, under problem settings different than ours. Basu et al. (2021) work with a sequence of multi-armed bandit tasks, and adaptively meta-learn the mean of the Gaussian prior used for the next task. Others consider solving a number of structurally similar linear bandit tasks in parallel (Wang et al., 2017; Cella et al., 2020; Cella & Pontil, 2021). They propose how to efficiently update the policy when each learner has access to the data across all tasks. We give stronger guarantees in less restrictive setting, compared to Wang et al. (2018b) which analyzes the simple regret of the GP-UCB algorithm (Srinivas et al., 2010) for multi-armed and linear bandits, when the mean and variance of the Gaussian prior are unknown, and there is sufficient offline i.i.d. data drawn from the same Gaussian distribution. Our framework considers structural sparsity at the kernel level, translating to group sparsity for the coefficients vectors, if applied to linear bandits. Thus our work relates to results on sparse linear bandits and Lasso bandits. In this area, Bastani & Bayati (2020) and Hao et al. (2020) give dimension-independent regret bounds for Explore Then-Commit algorithms under certain assumptions over the action set. This work does not consider offline data. We draw inspiration from the early Multiple Kernel Learning (MKL) literature, which focuses on kernel design for classification with SVMs (Bach et al., 2004; G onen & Alpaydın, 2011; Kloft et al., 2011; Evgeniou & Pontil, 2004; Cristianini et al., 2006; Ong et al., 2005). In contrast, our key contribution is to derive adaptive confidence bounds from meta-learnt kernels for regression, even for non-i.i.d. data. Orthogonal to most prior works, we reduce the kernel learning problem to group Lasso and leverage the properties of the Lasso estimator, in particular seminal results of Lounici et al. (2011) and Bach (2008). Other relevant works on convergence properties of the group Lasso include Koltchinskii & Yuan (2008), Liu & Zhang (2009) and Bunea et al. (2013). 3. Problem Statement Consider a sequential decision-making problem, where the agent repeatedly interacts with an environment and makes observations yt = f(xt) + εt (1) Meta-Learning Hypothesis Spaces of an unknown function f : X R residing in an RKHS Hk that corresponds to an unknown kernel function k .1 We further assume that the function has a bounded kernel norm f k B and that the domain X Rd0 is compact. The observation noise εt are i.i.d. samples from a zero-mean sub-Gaussian distribution with variance proxy σ2. At every step t, the chosen input xt only depends on the history up to step t, denoted by the random sequence Ht 1 = {(xτ, yτ) : 1 τ t 1}. No further assumptions are made about the algorithm or the policy for choosing xt. Depending on the application, Equation (1) can serve different purposes: It can describe the stochastic reward model of a bandit problem, or it may be the transition dynamics of an RL agent in a stochastic environment. For solving such problems, a central prerequisite for numerous algorithms are confidence sets for f(x) based on the history Ht 1 to balance exploration and exploitation at any step t. For any x X, the set Ct 1(x) defines an interval to which f(x) belongs with high probability such that, P ( x X : f(x) Ct 1(x)) 1 δ. The center of this interval reflects the current knowledge of the agent, relevant for exploitation, and the width corresponds to the uncertainty, guiding further exploration. When the true kernel is known, an approach commonly used in the kernelized bandit literature (Abbasi-Yadkori et al., 2011; Srinivas et al., 2010; Russo & Van Roy, 2014) is to build sets of the form Ct 1(k; x) = [µt 1(k; x) νtσt 1(k; x), (2) µt 1(k; x) + νtσt 1(k; x)] where the exploration coefficient νt depends on the desired confidence level 1 δ, and may be set based on the objective of the decision-making task. The functions µt 1 and σt 1 set the center and width of the confidence set as µt 1(k; x) = k T t 1(x)(Kt 1 + σ2I) 1yt 1 (3) σ2 t 1(k; x) = k(x, x) k T t 1(x)(Kt 1 + σ2I) 1kt 1(x) where σ is a constant, yt 1 = [yτ]τ 0 s.t. for all j Jk it holds that β (j) 2 c1. This assumption is inevitable for recovering the sparsity pattern from empirical data and it is widely used in the high-dimensional statistics literature (e.g., B uhlmann & Van De Geer, 2011; Zhao & Yu, 2006; Van de Geer et al., 2011). Assumption 3.1 implies that for j to be in Jk , the coefficients vector corresponding to kernel kj can not be zero or arbitrarily close to zero. In practice, β (j) 2 has to be comparable with the noise level for the activity of a base kernel not to be mistaken with randomness. 4. Meta-learning the Hypothesis Space (META-KEL) In the following section, we present our formulation of the meta-learning problem and analyze the properties of the learned hypothesis space. We meta-learn the kernel by solving the following optimization problem. Then, we set the hypothesis space of the downstream learning algorithm to be the RKHS of the meta-learned kernel. min η,f1,...,fm 1 m i=1 (ys,i fs(xs,i))2 # s=1 fs 2 k + λ s.t. s : fs Hk, k = j=1 ηjkj, 0 η We will refer to this problem as Meta-Kernel Learning (META-KEL). The first part of the objective is similar to the kernel ridge regression loss, and accounts for how well a series of regularized fs fit the meta-data. The last term regularizes our choice of the kernel function. We use ℓ1norm regularization for η to implicitly perform meta-modelselection. As shown in Proposition 4.4, the meta-learned kernel will reflect the sparsity pattern of the true kernel. The optimization problem (7) is convex and admits an efficient solution, as explained next. We first introduce a vectorized formulation of Equation (7). Let ys Rn denote the observed values for a task s and y = (y T 1 , , y T m)T Rmn the multi-task stacked vector of observations. We then design a multi-task feature matrix. We define Φ to be a mn md blockdiagonal matrix, where each block s corresponds to Φs = (ϕ(xs,1), , ϕ(xs,n))T , the n d feature matrix of task s. Figure 7 provides an illustration thereof. As shown in Proposition 4.1, this vectorized design brings forth a parametric equivalent of META-KEL, which happens to be the wellknown Group Lasso problem. Proposition 4.1 (Solution of META-KEL). Let k = P j ˆηjkj be a solution to Problem (7). Then, for all 1 j p, it holds that ˆηj = ˆβ(j) 2 with ˆβ = ( ˆβ(j))j p as the solution of the following convex optimization problem: min β 1 mn y Φβ 2 2 + λ β(j) 2. (8) We show this equivalence by eliminating η. We use a trick introduced by Bach et al. (2004), which, for w, v R states 2|w| = minv 0 w2/v + v. The proof is given in Appendix A.2. Problem (8) can be optimized by any Group Lasso solver. Bach et al. (2012) present a number of coordinate descent algorithms which efficiently find the solution. Before introducing the meta-learned kernel ˆk, we note that Reproducing Kernel Hilbert Spaces are equivalent up to scaling of the kernel function. For c > 0, both Hk and the scaled version Hck contain the same set of functions. Going from Hk to Hck, the RKHS norm of any member f would scale by 1/c, i.e. f k = c f ck. Hence, the norm ˆη 1 will be irrelevant when meta-learning the function space. This norm can be scaled or normalized, and still yield the same hypothesis space, only with a scaled operator norm. For consistency of notation, we define ˆk as follows. For any two points x, x X, set ˆk(x, x ) = ˆηj c1 ϕT j (x)ϕj(x ), (9) Meta-Learning Hypothesis Spaces where c1 is the same constant as in Assumption 3.1. We emphasize that this scaling does not impose a new assumption on the problem as it is done only to simplify theorem statements. We denote the set of base kernels active in ˆk with Jˆk = {1 j p : ˆηj = 0}. The meta-learned hypothesis space will then be Hˆk, the RKHS which corresponds to ˆk. Properties of the Meta-learned Hypothesis Space The meta-learned ˆk can be used as the kernel function for a model-based sequential decision making algorithm, as illustrated in Figure 1. Our goal is to analyze how this choice of kernel affects the success of the algorithm, compared to the oracle algorithm with access to the unknown kernel. To this end, we discuss properties of ˆk. For our main result, we require a final technical assumption to ensure that our meta-data is sufficiently diverse. Consider some vector b Rmd that adheres to the same group structure as β . For a set of group indices J {1, , p}, we define b J := (b(j))j J as the sub-vector indicated by the groups in J. Assumption 4.2 (Relaxed Sufficient Exploration). There exists κ = κ(s) > 0 that satisfies κ min J,b Φb 2 mn b J 2 , b(j) 2, b = 0, |J| s. This condition makes sure that the meta-training data is not degenerate, e.g., no two points xs,i and xs,j in the meta-data are identical or too close. Assumption 4.2 is immediately fulfilled if the minimum eigenvalue of Φ is positive, i.e., if there exists some κ > 0 where λmin(Φ) κ. This stronger version is sometimes referred to as Sufficient Exploration (Basu et al., 2021; Zhou et al., 2020). In the Lasso literature, Assumption 4.2 is commonly referred to as Restricted Eigenvalue Assumption (Bickel et al., 2009; B uhlmann & Van De Geer, 2011; Javanmard & Montanari, 2014). It is also often required to hold for the kernel matrix in the sparse linear bandits literature (Bastani & Bayati, 2020; Wang et al., 2018a; Hao et al., 2020; Kim & Paik, 2019). Lastly, Assumption 4.2 is satisfied with high probability if Φ is an i.i.d. sub-Gaussian random matrix, since the eigenvalues of such random matrices are bounded away from zero (Vershynin, 2018, Theorem 4.6.1). Theorem 4.3 (Hypothesis Space Recovery). Set 0 δ 1, and choose λ such that, log(2p/δ) + p mdmax log(2p/δ) . If |Jk | s and Assumption 4.2 holds with κ(s), then f ˆk f k 1 + ϵ(n, m) + o ϵ (n, m) , with probability greater than 1 δ, if n and m are large enough to satisfy ϵ(n, m) c1. The absolute constant c1 is defined in Assumption 3.1 and ϵ(n, m) := 32σs κ2(s) mn log(2p/δ) + p mdmax log(2p/δ) . The proof is given in Appendix B.1. Theorem 4.3 states that, provided enough meta-data, Hk is contained in Hˆk with high probability and δ, the probability of failure in recovery, decreases as m and n grow (See Appendix B.3). In addition the RKHS norm of any f Hk , can be bounded arbitrary well by f ˆk, since the error ϵ(n, m) decreases at a O(s/ mn p 1 + m 1 log p) rate. This matches the tightest rate for coordinate-wise convergence of the Group Lasso estimator under the same set of assumptions (Lounici et al., 2011; Bunea et al., 2013). The theorem implies that the meta-learner benefits more from increasing the number m of meta-data tasks, rather than increasing the sample size n of each task, since ϵ(n, m) shrinks faster with m compared to n. Note that increasing either m or n will result in convergence and therefore this theorem also holds for the classic offline kernel learning setup when the dataset consists of a single learning task (m = 1). The Benefit of Structural Sparsity Consider a conservative hand-picked kernel function kfull = 1/p j=1 ϕT j (x)ϕj(x ), (10) which does not use any meta-data and instead incorporates all the considered base kernels. When p and dmax are finite, Hk is contained in Hkfull and the hand-picked hypothesis space is not misspecified. However, working with an overly large hypothesis space has downsides. Consider using kfull to estimate a function f Hk . Then every base kernel, including kj with j / Jk , appears in the construction of the estimator. These terms contribute to the estimation error and increase the variance of the function estimate. This slows down the rate of convergence, compared to the case where only active kj are present in the kernel function. By meta-learning ˆk via META-KEL, we can eliminate irrelevant candidate kernels and produce a structurally sparse hypothesis space. Proposition 4.4 guarantees this property. Its proof is given in Appendix B.2. Proposition 4.4 (Bound on structural sparsity of ˆk). Set 0 < δ < 1 and choose λ according to Theorem 4.3. Let Meta-Learning Hypothesis Spaces |Jk | s be the number candidate kernels that contribute to k . If Assumption 4.2 holds with κ(s), then with probability greater than 1 δ, the number of kernels active in ˆk is bounded by |Jˆk| 64s mnκ2(s) which implies that if mn > 64s pκ2(s), then with the same probability Hˆk Hkfull. Hence, in the presence of enough meta-data, Hˆk is a strict subset of Hkfull, and therefore Hk w.h.p. Hˆk w.h.p. Hkfull where the left relation is due to Theorem 4.3. Figure 2 illustrates the nested sets. We conclude that our meta-learned hypothesis space has favorable properties: it contains the true hypothesis space, and it is sparse in structure, in particular, smaller than the conservative candidate space. The fact that Hˆk is smaller than Hkfull reduces the complexity of the downstream learning problem and yields faster convergence rates. We provide an example of this effect in Section 5, where we analyze a Bayesian optimization problem, and establish how choosing ˆk improves upon kfull. Finally, our experiments (e.g. Figure 4) support the claim that in practice the BO algorithm is faster in finding the optimum when it uses the meta-learned kernel. Hk Hˆk Hkfull Figure 2. The oracle Hk (Eq. 5), the meta-learned Hˆk (Eq. 9) and the hand-picked Hkfull (Eq. 10) hypothesis spaces (informal) 5. Sequential Decision-making with META-KEL We now analyze the effect of using ˆk as kernel function in the downstream sequential decision-making problem. We adopt the common construction of confidence sets given in Equation (2), and define ˆCt 1(x) := Ct 1(ˆk; x). We let ˆµt 1(x) := µt 1(ˆk; x), and ˆσt 1(x) := σt 1(ˆk; x), where µt 1(k; x) and σt 1(k; x) are as defined in Equation (3), with time-varying σ2 = 1 + 2/t.3 3The functions ˆµt 1 and ˆσt 1 are the posterior mean and variance of GP(0, ˆk), conditioned on Ht 1, with noise variance σ2. Theorem 5.1 shows that for the right choice of νt, the set ˆCt 1(x) is a valid confidence bound for any f Hk , evaluated at any x X, at any step t, with high probability. Theorem 5.1 (Any-time Valid Confidence Bounds with META-KEL). Let f Hk with f k B, where k is unknown. Under the assumptions of Theorem 4.3, with probability greater than 1 δ, for all x X and all t 1, |ˆµt 1(x) f(x)| ˆσt 1(x) B 1 + ϵ(n, m) ˆd log 1 + σ 2t + 2 + 2 log(1/δ) where ˆd = P j Jˆk dj and σ2 = 1 + 2/t. The proof is given in Appendix C. As discussed in Section 4, the ϵ(n, m)/2c1 term shrinks faster than O(1/ mn) and ˆd approaches d = P j Jk dj at a similar rate. Therefore, Theorem 5.1 presents a tight confidence bound relative to the case when k is known by the agent. In this case, due to Chowdhury & Gopalan (2017), Theorem 2, the 1 δ confidence bound would be, |µt 1(x) f(x)| σt 1(x) B + d log 1 + σ 2t + 2 + 2 log(1/δ)) . where the mean and variance functions are defined by µt 1(x) := µt 1(k ; x) and σt 1(x) := σt 1(k ; x) with σ2 = 1 + 2/t. We conclude that the base learner does not require knowledge of the true kernel for constructing confidence sets, as long as there is sufficient meta-data available. Theorem 4.3 quantifies this notion of sufficiency. Case Study: Bayesian Optimization As an example application, we consider the classic Bayesian optimization problem, but in the case where Hk is unknown. This example illustrates how Theorem 5.1 may be used to prove guarantees for a decision-making algorithm, which uses the meta-learned kernel due to a lack of knowledge of k . We follow the setup and BO notation of Srinivas et al. (2010). The agent seeks to maximize an unknown reward function f, sequentially accessed as described in Equation (1). Their goal is to choose actions xt which maximize the cumulative reward achieved over T time steps. This is equivalent to minimizing the cumulative regret RT = PT t=1[f(x ) f(xt)], where x is a global maximum of f. Note that if RT /T 0 as T then max1 t T f(xt) f(x ), i.e., the learner converges to the optimal value. We will refer to this property as sublinearity of the regret. In the spirit of the GP-UCB algorithm (Srinivas et al., 2010), we choose the next point by maximizing the upper confidence bound as determined by Theorem 5.1 xt = arg max x X ˆµt 1(x) + νtˆσt 1(x) (11) Meta-Learning Hypothesis Spaces where a suitable choice for νt is suggested in Corollary 5.2. Corollary 5.2 (A Regret Bound with META-KEL). Let δ (0, 1). Suppose f Hk with f k B and that values of f are observed with zero-mean sub-Gaussian noise of variance proxy σ2. Then, with probability greater than 1 δ, GP-UCB used together with ˆk satisfies ˆd T log T B 1 + ϵ(n, m) ˆd log T + log 1/δ ! provided that the exploration coefficient is set to νt =B 1 + ϵ(n, m)/2c1 ˆd log (1 + σ 2t/c1) + 2 + 2 log(1/δ). The proof is straightforward. Conditioned on the event that f Hˆk, we may directly use the regret bound of Chowdhury & Gopalan (2017). Then, by Theorem 4.3, we calculate the probability of this event (Appendix C.1). The Corollary relies on knowledge of a bound B on f k . However, using techniques of Berkenkamp et al. (2019) it is possible to adapt it even to unknown B at increased (but still sublinear) regret. Corollary 5.2 shows that GP-UCB using the metalearned kernel guarantees sublinear regret. We obtain a O( ˆd B log T T) rate for the regret which is tight compared to the O(d B log T T) rate satisfied by the oracle. It is insightful to compare this convergence result to a scenario where the hypothesis space is misspecified. For a reward function f / Hˆk, Bogunovic & Krause (2021) show that the learner will not converge to the global optimum, since the cumulative regret has a lower bound of linear order O(T log T). Corollary 5.2 suggests that by using a sparse kernel we can potentially find the optimal policy faster compared to when the complex kernel kfull is used. Recall that d = Pp j=1 dj, by Theorem 2 of Chowdhury & Gopalan (2017) the regret of GP-UCB used together with kfull is bounded by O(dp B log T T), since f kfull = p f k . Therefore, using the meta-learned kernel improves the regret bound by a factor of ˆd/(dp), implying that the solution may be found faster. The results of our experiments in Figure 4 support this argument. Note that our approach to guarantee a sublinear regret for GP-UCB without oracle knowledge of k naturally generalizes to other sequential decision tasks. In particular, any theoretical result relying on RKHS confidence intervals with a known kernel can be immediately extended to use those of the meta-learned kernel. 0.00 0.25 0.50 0.75 1.00 0.0 empirical coverage Full Meta-Ke L Oracle full Meta Ke L oracle 0.0 Figure 3. Calibration (left) and sharpness (right) experiment for confidence sets given 4 training samples. Averaged over 50 runs, ˆk always gives tight valid confidence intervals. 0 10 20 30 40 50 t 50 Rt(k) Meta Kel Rt(kfull) full Rt(k *) oracle 0 10 20 30 40 50 t rt(k) Meta Kel rt(kfull) full rt(k *) oracle Figure 4. Simple and cumulative regret for GP-UCB. The algorithm converges at a slower pace when using kfull. 6. Experiments In this section, we provide experiments to quantitatively illustrate our theoretical contribution. Experiment Setup (1D) We create a synthetic dataset based on our data model, Equations (1) and (4). We first limit the domain to the 1-dimensional X = [ 1, 1] and use Legendre polynomials Pj as our features ϕj. The sequence (Pj)j 0 is a natural choice, since it provides an orthonormal basis for L2(X). Moreover, Legendre polynomials are eigenfunctions to dot-product kernels such as the Neural Tangent Kernel (Jacot et al., 2018). We let k (x, x ) = P j Jk η j Pj(x)Pj(x ), where Jk is a random subset of {1, , p}. Each η j is sampled independently from the standard uniform distribution and the vec- Meta-Learning Hypothesis Spaces 0 200 400 600 800 1000 m R100(k) Meta Kel, =0.03 R100(k) Meta Kel, =0.06 R100(kfull) full R100(k *) oracle Figure 5. The regret of GP-UCB used with ˆk approaches the oracle regret as the number of offline tasks increase. tor η is then normalized. Across all experiments, we set p = 20 and s = |Jk | = 5. To sample the meta-data Dn,m, we choose m independent random subsets of Jk and generate the functions fs via Equation (6) where β (j) are drawn from an i.i.d. standard uniform distribution. We then scale the norm f k to B = 10. The data for a single task, i.e., (xs,i, ys,i)i n, is then created by uniformly drawing i.i.d. samples from the domain X and evaluating fs at those points. We add Gaussian noise with standard deviation σ = 0.01 to all data points. Figure 8 in the appendix shows how random fs may look like. For all experiments we set n = m = 50 unless stated otherwise. To meta-learn ˆk, we solve the vectorized META-KEL problem (Eq. 8) over β(j) with CELER, a fast solver for the group Lasso (Massias et al., 2018), and then set ˆη according to Proposition 4.1. We set λ = 0.03, such that it satisfies the condition of Theorem 4.3. As shown in Figure 10 in the appendix, the choice of λ has little effect on the performance of the algorithm. Confidence Set Experiment We perform calibration and sharpness experiments to assess the meta-learned confidence sets (Gneiting et al., 2007). Figure 3 presents the result. To obtain an α-confidence interval for some f(x) using a kernel k, we assume a f GP(0, k) prior and calculate the α-quantile of the posterior after observing 4 noisy i.i.d. draws from the function. For each hypothesis, the y-axis of the left plot shows the empirical coverage of the confidence sets, i.e., the fraction of test points contained in the α-confidence intervals for varying levels α. In this plot, if a curve were to fall below the x = y line, it would have implied insufficient coverage and hence over-confident sets. The plot on the right shows the posterior variance averaged across all test points. This quantity, referred to as sharpness, reflects the width of the confidence bands. Figure 3 demonstrates that the meta-learned confidence sets are well-calibrated for the entire range of confidence-levels and are tight relative to the true sets. In contrast, kfull yields conservative confidence sets, due to considering polynomials Pj that do not contribute to the construction of f(x). We 0 20 40 60 80 100 t cumulative inference regret Rt(k) - Legendre Meta Kel Rt(k * ) - oracle Rt(kfull) - full Rt(k SE) - SE kernel 0 50 100 150 200 t cumulative inference regret Rt(k) - RFF Meta Kel Rt(kfull) - full Rt(k SE) - SE kernel Figure 6. Cumulative regret of GP-UCB. A synthetic 2D BO (left), hyper-parameters tuning of GLMNET (right). use 1000 test points for calculating the empirical averages. The plot shows the values averaged over 50 runs, where for each the kernel k and the data are generated from scratch. Regret Experiment We verify the performance of GPUCB when used together with ˆk. We generate the random reward function f in a manner similar to fs of the meta-data. The BO problem is simulated according to Equation (1), and the actions are selected via Equation (11). Figure 9 in the appendix shows how this algorithm samples the domain and how the confidence estimates shrink by observing new samples. Keeping the underlying random k fixed, we generate 100 random instances of the meta-learning and the BO problem. In Figure 4 we present the average regret and its standard deviation. In these plots, the simple regret of GP-UCB with a kernel k is labeled rt(k) = f(x ) maxτ t f(xτ). Respectively, the cumulative inference regret is Rt(k) = P τ t f(x ) maxx µτ 1(x). The algorithm converges to the optimum using all three kernels. The meta-learned kernel, however, improves upon using kfull and results in a performance competitive to when k is known by GP-UCB. This behavior empirically confirms Corollary 5.2. Consistency Experiment From Corollary 5.2 we concluded that, as the size of the meta-data grows, the regret bound achieved via ˆk converges to the oracle bound, i.e., the bound satisfied by the learner when it has knowledge of the true kernel. As Figure 5 shows, this consistency is also reflected in the empirical regret values. As we increase m, the number of offline tasks given to the meta-learner, the cumulative inference regret at T = 100 improves. It approaches the regret obtained by the oracle algorithm. The value of λ does not affect this convergence, as long as it satisfies Theorem 4.3. Similar to Figure 4, this plot is generated for a fixed random k , averaged over 50 random instances of the meta-learning and BO problem. Regret Experiment for 2D domain We repeat the regret experiment, with synthetic data over the 2-dimensional domain X = [ 1, 1]2. For x = (x1, x2) X, we define the Legendre feature map as ϕ(x) = (Pj(x1)Pp j(x2))0 j p. This feature map is (p + 1)-dimensional, and has terms of degree at most p. We use the polynomial Pj(x1)Pp j(x2) as the feature ϕj(x) and create a synthetic dataset in a fash- Meta-Learning Hypothesis Spaces ion identical to the 1-dimensional case, again setting p = 20 and s = 5. Figure 6 shows the 2-dimensional counterpart of Figure 4. Again, using ˆk results in competitive performance to the oracle algorithm with knowledge of k . Here, we also use GP-UCB together with the infinite-dimensional Squared Exponential (SE) kernel. The regret curve shows that we do not benefit from choosing a complex kernel for solving an inherently low dimensional problem. Efficient Hyper-parameter Tuning with META-KEL A common application of GP-UCB is in optimizing hyper-parameters of machine learning algorithms. In this setting, X is the algorithm s hyper-parameter space, and f represents the test performance of the algorithm. Evaluating each hyper-parameter configuration is costly, and thus the BO method has to be sample efficient. We consider the GLMNET algorithm (Friedman et al., 2010) and empirically demonstrate that by meta-learning the kernel, we gather knowledge from prior data, and in turn, tune the hyper-parameters of a new task more efficiently. Mainly, by running GP-UCB with ˆk we tend to find the optimal configuration of hyper-parameters for an unseen learning task faster, compared to using data-independent kernels (Figure 6). The Open ML platform (Bischl et al., 2017) enables access to data from hyper-parameter tuning of GLMNET on 38 different classification tasks. We split these datasets into a meta-dataset with m = 25 and leave the rest as test tasks. For meta-learning the kernel, we use 500 Random Fourier Features (Rahimi et al., 2007) defined on a 2-dimensional domain, since the GLMNET algorithm has only two hyper-parameters. The remaining details of our experiment setup is given in Appendix D. Figure 6 shows the performance of GP-UCB on the test task. Utilizing ˆk results in a sample-efficient GP-UCB that rapidly approaches an optimal choice of hyper-parameters. This is in contrast to running GP-UCB with the SE kernel, which takes about 50 iterations to find a good configuration. 7. Conclusion We proposed META-KEL, a method for reliably learning kernel functions from offline meta-data. As a first, our approach yields provably valid meta-learned adaptive confidence sets, setting it apart from existing meta-learning approaches. Importantly, we showed that our meta-learned kernel produces tight and consistent confidence bounds for the target function, provided that enough meta-data is available. As an example application, we showed that GP-UCB still yields sublinear regret when using the meta-learned kernel, performing competitively in theory and experiments with the oracle that has access to the unknown kernel. We believe this result opens up avenues towards giving convergence guarantees for other sequential decision algorithms without oracle knowledge of the hypothesis space. Acknowledgements We thank Andreea Musat and Victor Armegioiu for fruitful discussions and their contributions to an earlier research path of this project. In addition, we thank Scott Sussex and Anastasia Makarova for their feedback on the draft of this paper. We appreciate Felix Schur and Adrian M uller s thorough feedback on the final manuscript. This research was supported by the European Research Council (ERC) under the European Union s Horizon 2020 research and innovation program grant agreement no. 815943. Jonas Rothfuss was supported by an Apple Scholars in AI/ML fellowship. Abbasi-Yadkori, Y., P al, D., and Szepesv ari, C. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, 2011. Bach, F., Jenatton, R., Mairal, J., Obozinski, G., et al. Optimization with sparsity-inducing penalties. Foundations and Trends in Machine Learning, 2012. Bach, F. R. Consistency of the group lasso and multiple kernel learning. Journal of Machine Learning Research, 2008. Bach, F. R., Lanckriet, G. R., and Jordan, M. I. Multiple kernel learning, conic duality, and the smo algorithm. In Proceedings of the twenty-first international conference on Machine learning, 2004. Bastani, H. and Bayati, M. Online decision making with high-dimensional covariates. Operations Research, 2020. Basu, S., Kveton, B., Zaheer, M., and Szepesv ari, C. No regrets for learning the prior in bandits. Advances in Neural Information Processing Systems, 34, 2021. Berkenkamp, F., Turchetta, M., Schoellig, A., and Krause, A. Safe model-based reinforcement learning with stability guarantees. Advances in neural information processing systems, 2017. Berkenkamp, F., Schoellig, A. P., and Krause, A. No-regret bayesian optimization with unknown hyperparameters. Journal of Machine Learning Research, 2019. Bickel, P. J., Ritov, Y., and Tsybakov, A. B. Simultaneous analysis of lasso and dantzig selector. The Annals of statistics, 2009. Bischl, B., Casalicchio, G., Feurer, M., Hutter, F., Lang, M., Mantovani, R., Rijn, J. N., and Vanschoren, J. Openml benchmarking suites and the openml100. ar Xiv preprint ar Xiv:1708.03731, 2017. Meta-Learning Hypothesis Spaces Bogunovic, I. and Krause, A. Misspecified Gaussian process bandit optimization. In Conference on Neural Information Processing Systems (Neur IPS), 2021. Boutilier, C., Hsu, C.-w., Kveton, B., Mladenov, M., Szepesvari, C., and Zaheer, M. Differentiable meta-learning of bandit policies. In Advances in Neural Information Processing Systems, 2020. Boyd, S., Boyd, S. P., and Vandenberghe, L. Convex optimization. Cambridge University Press, 2004. B uhlmann, P. and Van De Geer, S. Statistics for highdimensional data: methods, theory and applications. Springer Science & Business Media, 2011. Bunea, F., Lederer, J., and She, Y. The group square-root lasso: Theoretical properties and fast algorithms. IEEE Transactions on Information Theory, 2013. Cavalier, L., Golubev, G., Picard, D., and Tsybakov, A. Oracle inequalities for inverse problems. The Annals of Statistics, 2002. Cella, L. and Pontil, M. Multi-task and meta-learning with sparse linear bandits. In Uncertainty in Artificial Intelligence. PMLR, 2021. Cella, L., Lazaric, A., and Pontil, M. Meta-learning with stochastic linear bandits. In International Conference on Machine Learning. PMLR, 2020. Chowdhury, S. R. and Gopalan, A. On kernelized multiarmed bandits. In International Conference on Machine Learning. PMLR, 2017. Cristianini, N., Kandola, J., Elisseeff, A., and Shawe-Taylor, J. On kernel target alignment. In Innovations in machine learning. Springer, 2006. Curi, S., Berkenkamp, F., and Krause, A. Efficient modelbased reinforcement learning through optimistic policy search and planning. Advances in Neural Information Processing Systems, 2020. Evgeniou, T. and Pontil, M. Regularized multi task learning. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, 2004. Friedman, J., Hastie, T., and Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. Journal of statistical software, 2010. Gneiting, T., Balabdaoui, F., and Raftery, A. E. Probabilistic forecasts, calibration and sharpness. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2007. G onen, M. and Alpaydın, E. Multiple kernel learning algorithms. The Journal of Machine Learning Research, 2011. Hao, B., Lattimore, T., and Wang, M. High-dimensional sparse linear bandits. Advances in Neural Information Processing Systems, 2020. Jacot, A., Gabriel, F., and Hongler, C. Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 2018. Javanmard, A. and Montanari, A. Confidence intervals and hypothesis testing for high-dimensional regression. The Journal of Machine Learning Research, 2014. Kakade, S., Krishnamurthy, A., Lowrey, K., Ohnishi, M., and Sun, W. Information theoretic regret bounds for online nonlinear control. Advances in Neural Information Processing Systems, 2020. Kim, G.-S. and Paik, M. C. Doubly-robust lasso bandit. Advances in Neural Information Processing Systems, 2019. Kloft, M., Brefeld, U., Sonnenburg, S., and Zien, A. Lpnorm multiple kernel learning. The Journal of Machine Learning Research, 2011. Koltchinskii, V. and Yuan, M. Sparse recovery in large ensembles of kernel machines. In Proceedings of Conference on Learning Theory, 2008. K uhn, D., Probst, P., Thomas, J., and Bischl, B. Automatic exploration of machine learning experiments on openml. ar Xiv preprint ar Xiv:1806.10961, 2018. Kveton, B., Mladenov, M., Hsu, C.-W., Zaheer, M., Szepesvari, C., and Boutilier, C. Meta-learning bandit policies by gradient ascent. ar Xiv preprint ar Xiv:2006.05094, 2020. Liu, H. and Zhang, J. Estimation consistency of the group lasso and its applications. In Artificial Intelligence and Statistics. PMLR, 2009. Lounici, K., Pontil, M., Van De Geer, S., and Tsybakov, A. B. Oracle inequalities and optimal inference under group sparsity. The annals of statistics, 2011. Massias, M., Gramfort, A., and Salmon, J. Celer: a fast solver for the lasso with dual extrapolation. In Proceedings of the 35th International Conference on Machine Learning, 2018. Ong, C. S., Smola, A. J., and Williamson, R. C. Learning the kernel with hyperkernels. Journal of Machine Learning Research, 2005. Meta-Learning Hypothesis Spaces Perrone, V., Jenatton, R., Seeger, M. W., and Archambeau, C. Scalable Hyperparameter Transfer Learning. In Advances in Neural Information Processing Systems, 2008. Rahimi, A., Recht, B., et al. Random features for large-scale kernel machines. In NIPS, 2007. Rothfuss, J., Fortuin, V., Josifoski, M., and Krause, A. Pacoh: Bayes-optimal meta-learning with pacguarantees. In International Conference on Machine Learning. PMLR, 2021a. Rothfuss, J., Heyn, D., Chen, J., and Krause, A. Metalearning reliable priors in the function space. In Advances in Neural Information Processing Systems, 2021b. Russo, D. and Van Roy, B. Learning to optimize via posterior sampling. Mathematics of Operations Research, 2014. Sessa, P. G., Bogunovic, I., Kamgarpour, M., and Krause, A. Learning to play sequential games versus unknown opponents. Advances in Neural Information Processing Systems, 2020. Simchowitz, M., Tosh, C., Krishnamurthy, A., Hsu, D. J., Lykouris, T., Dudik, M., and Schapire, R. E. Bayesian decision-making under misspecified priors with applications to meta-learning. Advances in Neural Information Processing Systems, 2021. Srinivas, N., Krause, A., Kakade, S., and Seeger, M. Gaussian process optimization in the bandit setting: No regret and experimental design. In Proceedings of the 27th International Conference on International Conference on Machine Learning, 2010. Vakili, S., Khezeli, K., and Picheny, V. On information gain and regret bounds in gaussian process bandits. In International Conference on Artificial Intelligence and Statistics. PMLR, 2021. Van de Geer, S., B uhlmann, P., and Zhou, S. The adaptive and the thresholded lasso for potentially misspecified models (and a lower bound for the lasso). Electronic Journal of Statistics, 2011. Vershynin, R. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge University Press, 2018. Wainwright, M. J. High-dimensional statistics: A nonasymptotic viewpoint. Cambridge University Press, 2019. Wang, X., Wei, M., and Yao, T. Minimax concave penalized multi-armed bandit model with high-dimensional covariates. In International Conference on Machine Learning. PMLR, 2018a. Wang, Z. and de Freitas, N. Theoretical analysis of bayesian optimisation with unknown gaussian process hyper-parameters. ar Xiv preprint ar Xiv:1406.7758, 2014. Wang, Z., Li, C., Jegelka, S., and Kohli, P. Batched highdimensional bayesian optimization via structural kernel learning. In International Conference on Machine Learning. PMLR, 2017. Wang, Z., Kim, B., and Kaelbling, L. P. Regret bounds for meta bayesian optimization with an unknown gaussian process prior. ar Xiv preprint ar Xiv:1811.09558, 2018b. Wynne, G., Briol, F.-X., and Girolami, M. Convergence guarantees for gaussian process means with misspecified likelihoods and smoothness. Journal of Machine Learning Research, 2021. Zhao, P. and Yu, B. On model selection consistency of lasso. The Journal of Machine Learning Research, 2006. Zhou, D., Li, L., and Gu, Q. Neural contextual bandits with ucb-based exploration. In International Conference on Machine Learning. PMLR, 2020. Meta-Learning Hypothesis Spaces for Sequential Decision-making: Supplementary Material A. Details of the main Result symbol description k true (unknown) kernel function ˆk meta-learned kernel p total number of candidate base kernels kj base kernels that construct k and ˆk, 1 j p η j coefficient of kj in construction of k , i.e., k ( , ) = Pp j=1 η j kj( , ) ˆηj coefficient of kj in construction of ˆk, i.e., ˆk( , ) = Pp j=1 ˆηjkj( , ) η the vector η 1, , η j , , η p Rp ˆη the vector (ˆη1, , ˆηj, , ˆηp) Rp Jk the set {1 j p, s.t. η j = 0} Jˆk the set {1 j p, s.t. ˆηj = 0} ϕj( ) feature map for kj, i.e., kj(x, x ) = ϕT j (x)ϕj(x ) η 1ϕT 1 (x), , pη pϕT p (x) T , i.e., feature map for k d0 dimension of the input domain, X Rd0 dj dimension of a feature map ϕj(x) Rdj dmax max1 j p dj d Pp j=1 dj d P j Jk dj ˆd P j Jˆk dj n number of samples in each task s in the meta-data m number of tasks in the meta-data fs target function from task s β s true coefficients vector for task s: fs(x) = ϕT (x)β s, in Rd ˆβs estimate of β s β s (j) sub-vector of β s corresponding to kernel kj, in Rdj ˆβs(j) estimate of β s (j) β (β 1 T , , β m T )T , in Rmd ˆβ estimate of β β (j) sub-vector of β corresponding to kernel kj, in Rmdj ˆβ(j) estimate of β (j) Table 1. Notation Guide A.1. RKHS Refresher Here we present a compact reminder of RKHS basics for the sake of completeness and clarifying our notation. We work in a finite-dimensional regime which can also be described by a euclidean vector space. Nevertheless, we use the RKHS notation as it gives a powerful framework and hides away the vector algebra. This section is mainly based on Wainwright (2019). For a positive semi-definite kernel function k over some set X X, the corresponding unique Reproducing Kernel Hilbert Meta-Learning Hypothesis Spaces ϕT j (xs,1) ϕT j (xs,n) Figure 7. Visual guide for the Group Lasso formulation. The red shade shows the group of coefficients which correspond to the effect of kernel kj. The green shade demonstrates how features from each task come together on the diagonal of the multi-task feature matrix. The coefficients that belong to one task are group together with a green rectangle. Space can be constructed as, Hk = n f : X R | f( ) = i=1 αik(xi, ), n N, (xi)n i=1 X, α Rno , equipped with the dot product, f, f k = P i,j αi αjk(xi, xj). We limit X to compact sets, and only consider Mercer kernels, i.e., continuous kernel functions that satisfy the Hilbert-Schmidt condition, Z X X k2(x, x )dµ(x)dµ(x ) < where µ is a non-negative measure over X. Mercer s theorem states that under these assumptions, the kernel operator has a sequence of orthonormal eigenfunctions (ϕr)r 1and non-negative eigenvalues (ηr)r 1, defined as follows Z X k(x, x )ϕr(x )dµ(x ) = ηrϕr(x). (A.1) Moreover, k can be written as their linear combination, k(x, x ) = X r ηrϕr(x)ϕr(x ). Or as a non-negative combination of base kernels, kr(x, x ) = ϕr(x)ϕr(x ) k(x, x ) = X r ηrkr(x, x ). It immediately follows that the unique RKHS corresponding to k takes the form, f : X R | f( ) = X r 1 βrϕr( ), X Meta-Learning Hypothesis Spaces and the inner product the following form, f, ϕr 2 g, ϕr 2 where , 2 denotes the inner product in the L2(X). It is then implied that f k = P r:ηr =0 β2 r/ηr. Lastly, we define ϕ(x) = ηrϕr(x) r 1 to be the feature map corresponding to k. In this paper we refer to the number of non-zero eigen-values as the dimension of the kernel. Note that under this convention, most kernel functions used in practice, e.g. RBF kernel or the Mat ern family, are infinite-dimensional. A.2. Proof of Proposition 4.1 By Equation (6) we may write a parametric equivalent of Problem (7) in terms of the feature maps ϕj, min 0 η, s:βs ηjϕT j (xs,i)β(j) s This problem is jointly convex in η and (βs)s m, and has an optimal solution (Kloft et al., 2011). Renaming the variable β(j) s β(j) s / ηj gives the following equivalent problem, min 0 η, s:βs j=1 ϕT j (xs,i)β(j) s Pm s=1 β(j) s 2 There is no constraint connecting the two variables, and renaming β(j) s does not effect the optimization problem with respect to η. Therefore, ˆη is also an optima for this problem. Let (ˆη, ˆβ) denote the solution to the problem above. We show that ˆη has a closed form expression in terms of ˆβ. This observation allows us to reduce the joint optimization problem to an equivalent problem which is only over (βs). We use the η-trick introduced in Bach et al. (2004). The authors observe that for any two scalar variables w and v, |w| = min v 0 w2 and ˆv = |w|. Applying this trick with w = β(j) 2 and v = ηj for all j p gives min η 0 λ 2 β(j) 2 2 ηj + λ which results the following equivalent problem min s:βs 1 m j=1 ϕT j (xs,i)β(j) s Note that By definition of β(j), the second term satisfies Pp j=1 r Pm s=1 β(j) s 2 2 = β(j) 2. Finally, by simply using the vectorized notation we get Equation (8), concluding the proof. B. Proof of Statements in Section 4 For the first three lemmas in this section we follow the technique in Lounici et al. (2011) and occasionally use classical ideas established in B uhlmann & Van De Geer (2011). Meta-Learning Hypothesis Spaces Notations and naming conventions When X is a matrix, X 2 and X F denote its spectral and Frobenius norm, respectively. Consider the multi-task coefficients vector β Rmd, and the sub-vector β(j) Rmdj which denotes the coefficients corresponding to kernel kj. Through out this proof, we use the convention group j to refer to the set of indices of β which indicate β(j). Similarly, we let Φ(j) denote the mn mdj sub-matrix which only has the features coming from group j. Lastly, let Ψ := ΦT Φ/mn, then Ψ(j) = (Φ(j))T Φ(j)/mn indicates the mdj mdj submatrix that is caused by group j. B.1. Proof of Theorem 4.3 Recall the vectorized formulation of the META-KEL loss, L(β) = 1 mn y Φβ 2 2 + λ Let εs = (εs,i)i n denote error for task s and ε Rmn the stacked multi-task error vector. Using y = Φβ + ε, we may decompose the loss into two deterministic and random parts. The term 2εT Φ( ˆβ β )/mn is the random one and we will refer to as the empirical process. The first typical step in bounding the estimation error of Lasso estimators, is showing that the empirical process, which comes from the noise in observing values of y, does not play a drastic role. More formally, let Aj = (ΦT ε)(j) 2/mn λ/4 denote that the event that the image of noise affecting the feature space of ϕj, is dominated by the regularization term of β(j). In Lemma B.1 we show that p j=1Aj happens with high probability, if λ is set properly. Lemma B.1 (Regularization term dominates the empirical process). Set 0 < δ < 1. Consider the random event A = p j=1Aj. Then A happens with probability greater than 1 δ, if log(2p/δ) + p mdmax log(2p/δ) . where dmax = max1 j p dj. We now show that if the empirical process is controlled by regularization, i.e. if λ is set to be large enough, then ˆβ = min L(β) has favorable properties. Lemma B.2 (Conditional properties of ˆβ). Assume that event A happens. Then for any solution ˆβ of problem 8 and all β Rmd, the following hold: Φ( ˆβ β ) 2 ˆβ(j) β (j) 2 1 mn Φ(β β ) 2 2 j:β(j) =0 min β(j) 2, ˆβ(j) β(j) (B.1) n j : ˆβ(j) = 0 o 16 (mnλ)2 Φ( ˆβ β ) 2 Note that by the meta data-generating model (Section 3), Jk = {j : β (j) = 0}. Lemma B.3 (Complete Variable Screening). Assume |Jk | s and set 0 < δ 1. Define ϵ(n, m) = 32σs κ2 mn log(2p/δ) + p mdmax log(2p/δ) Under assumption 4.2 with κ = κ(s), if λ is chosen according to lemma B.1, then with probability greater than 1 δ ˆβ(j) (β )(j) 2 ϵ(n, m) (B.3) and if in addition minj Jk β (j) 2 c1, then with the same probability for all j Jk ˆβ(j) 2 c1 ϵ(n, m). (B.4) Meta-Learning Hypothesis Spaces We now turn to the first claim made in Theorem 4.3, and prove that Hk Hˆk. Since ˆηj 0 and ki are Mercer, then ˆk is also Mercer and corresponds to an RKHS which we have been referring to as Hˆk. Consider the RKHS Hk , since |Jk | and each djs are finite, f : f( ) = X η j βT j ϕj( ), βj Rd, βj < Therefore, f Hk if and only if it is in the finite span of ϕ, defined as Fin Span ({ϕj : j Jk }) = f : f( ) = X j Jk βT j ϕj( ), βj Rd, βj < Lemma B.3 states that ˆβ(j) c1 ϵ(n, m) with probability greater than 1 δ for j Jk . Therefore, for any j Jk , we get ˆηj 1 ϵ(n, m)/c1, since we had set ˆηj = ˆβ(j) /c1. If c1 > ϵ(n, m), then ˆηj > 0 and j Jˆk, implying Jk Jˆk. Hence, under the assumptions of the theorem, with probability greater than 1 δ, Hk = Fin Span ({ϕj : j Jk }) w.h.p. Fin Span {ϕj : j Jˆk} = Hˆk. The next lemma bounds the ˆk-norm of functions contained in Hk , concluding the proof for Theorem 4.3. Lemma B.4 (Bounding the ˆk-norm). Set 0 < δ 1 and choose λ according to Lemma B.1 and define ϵ(n, m) according to Lemma B.3. If |Jk | s and Assumption 4.2 holds with κ = κ(s), then under Condition 3.1, for all f Hk with f k B, f ˆk 1 + ϵ(n, m) 2c1 + o (ϵ(n, m)) (B.5) B.2. Proof of Proposition 4.4 Under assumptions of the proposition, event A happens with probability greater than 1 δ. From Equation (B.2) in Lemma B.2, Jˆk 16 (mnλ)2 Φ( ˆβ β ) 2 and by Equation (B.10), 1 mn Φ( ˆβ β ) 2 2λ s which gives |Jˆk| 64s mnκ2(s) Therefore, if mn = O(s/p) then Jˆk p = |Jkfull| with probability greater than 1 δ. and via similar argument as given in the proof of theorem 4.3, Hˆk = Fin Span {ϕj : j Jˆk} w.h.p. Fin Span ({ϕj : j Jkfull}) = Hkfull. B.3. Proof of Lemmas used in Section B.1 This section presents the proofs to the helper lemmas introduced before. Proof of Lemma B.1. This proof follows a similar treatment of the empirical process to Lemma 3.1 Lounici et al. (2011). Since εi,s are i.i.d. zero-mean sub-gaussian variables, we observe that P(Aj) = P 1 (mn)2 εT Φ(j)(Φ(j))T ε λ2 = P Pmn i=1 vi(z2 i 1) Meta-Learning Hypothesis Spaces where zi are i.i.d. sub-gaussian variables with variance proxy 1, vi denote the eigenvalues of Φ(j)(Φ(j))T /mn and v is the vector of these eigenvalues. Lastly, α = mnλ2/(16σ2) Tr(Ψ(j)) 2 Ψ(j) F From Equation 27 Cavalier et al. (2002) yields the following inequality, P(Ac j) = P Pmn i=1(z2 i 1)vi 2α v / v 2) We choose λ such that the right hand side is bounded by δ/p. By definition of v we have v / v 2 = Ψ(j) 2/ Ψ(j) F. Then, for Ac j to happen with probability smaller than δ/p, Tr(Ψ(j)) + 2 Ψ(j) 2 2 log(2p/δ) + q mdj log(2p/δ) . Then by union bound, A happens with probability greater than 1 δ if λ max j 4σ mn Tr(Ψ(j)) + 2 Ψ(j) 2 2 log(2p/δ) + q mdj log(2p/δ) . Since the base kernels are normalized we may bound the norm and trace of Ψ(j). Tr(Ψ(j)) = 1 mn s=1 Tr (Φs (j))T Φs (j) = 1 mn i=1 ϕT j (xs,i)ϕj(xs,i) = 1 mn i=1 kj(xs,i, xs,i) 1 Similarly, Ψ(j) 2 1 mn maxs Pn i=1 kj(xs,i, xs,i) 1/m, and thereby concluding the proof. Proof of Lemma B.2. For proving this lemma we are essentially only using Cauchy-Schwarz and the Triangle inequality, together with the KKT optimality condition for L. For any β, since ˆβ is the minimizer of L and due to the data generating model (Eq. 4), we have Φ( ˆβ β ) 2 2 1 mn Φ(β β ) 2 2 + 2 mnεT Φ( ˆβ β) + λ β(j) 2 ˆβ(j) 2 By Cauchy-Schwarz and the assumption that A happens, εT Φ( ˆβ β) (ΦT ε)(j) 2 ˆβ(j) β(j) 2 mnλ ˆβ(j) β(j) 2, and thereby, Φ( ˆβ β ) 2 ˆβ(j) β(j) 2 1 mn Φ(β β ) 2 2 + λ ˆβ(j) β(j) 2 + β(j) 2 ˆβ(j) 2 which gives Inequality B.1. By the KKT optimality conditions for convex losses (Boyd et al., 2004), ˆβ is a minimizer of L, if and only if 0 L( ˆβ), where L( ˆβ) denotes the sub-gradient of the loss evaluated at ˆβ. Therefore ˆβ satisfies ΦT (y Φ ˆβ) (j) = λ ˆβ(j) ˆβ(j) , if ˆβ(j) = 0 (B.6) ΦT (y Φ ˆβ) (j) 2 λ, if ˆβ(j) = 0. (B.7) Meta-Learning Hypothesis Spaces As for Inequality B.2, conditioned on event A, by Equation (4) together with the KKT condition B.6, we obtain that for all 1 j p where ˆβ(j) = 0, 1 mn (ΦT Φ( ˆβ β ))(j) 2 λ Following the analysis of Lounici et al. we conclude, n j : ˆβ(j) = 0 o 16 (mnλ)2 X (ΦT Φ( ˆβ β ))(j) 2 (ΦT Φ( ˆβ β ))(j) 2 (ΦT Φ( ˆβ β )) 2 (Φ( ˆβ β )) 2 Since the kernels kj are normalized by 1 and Φ 2 maxj kj(x, x) 1. Proof of Lemma B.3. Let βJk = (β(j))j Jk denote the sub coefficient vector that corresponds to all the active groups. Due to Equation (B.1) with β = β , conditioned on event A we have Φ( ˆβ β ) 2 ˆβ(j) β (j) 2λ j Jk s ˆβ(j) β (j) 2 2 = 2λ s ˆβJk β Jk 2 (B.8) where the second inequality follows from Cauchy-Schwarz together with the Lemma s assumption |Jk | s. By again using Equation (B.1) we get Pp j=1 ˆβ(j) β (j) 2 4 P j Jk ˆβ(j) β (j) 2, and thereby Pp j / Jk ˆβ(j) β (j) 2 j Jk ˆβ(j) β (j) 2. Using assumption 4.2, this inequality indicates that ˆβJk β Jk 2 1 κ mn Φ( ˆβ β ) 2, (B.9) which together with Equation (B.8) gives, Φ( ˆβ β ) 2 2λ s The next chain of inequalities proves the first statement of the Lemma (Equation (B.3)). For all j Jk , ˆβ(j) β (j) 2 ˆβ(j) β (j) 2 ˆβ(j) β (j) 2 4 s ˆβJk β Jk 2 B.9 Φ( ˆβ β ) 2 Note that the analysis here is carried out conditional on event A. From Lemma B.1, we have that A happens with probability greater than 1 δ if λ is set according to the statement of the Lemma. Recall that from the Beta-min condition (Cond. 3.1) we have β (j) > c1, which together with Equation (B.3) concludes the proof. Meta-Learning Hypothesis Spaces Proof of Lemma B.4. Let f Hk , with f 2 k B2. Then by construction of Hk (see Appendix A.1) η j ϕT j (x)(β)(j), f 2 k = X We calculate the ˆk norm of functions that lie in Hk . Let I = {1 j p : η j = 0, ˆηj = 0}. ( f, ϕj,r 2)2 pη j βr(j) 2 where ϕj,r denotes the r-th feature in the feature map ϕj, similarly βr(j) the r-th element in vector β(j), and , 2 is the inner product in the L2(X) space. By applying Cauchy-Schwarz we get j I ( η j ˆηj )2 s X Consider the vector v = ( β(j) 2 2)j I, we observe that v 2 = q P j I β(j) 4 2 and v 1 B2. Since 2 1 and due to the assumption η 1 1, we obtain f 2 ˆk B2 X η j ˆηj B2 η 1 max j I 1 ˆηj B2 max j I 1 ˆηj . (B.11) It remains to bound maxj I ˆη 1 j . We set ˆηj = ˆβ(j) 2/(c1) and under conditions of the theorem, Lemma B.3 states that ˆηj 1 ϵ(n, m)/c1, for all j Jk . Then for members of I Jk , 1 ˆηj 1 1 ϵ(n, m)/c1 1 + ϵ(n, m) c1 + o (ϵ(n, m)) , which implies the following for the norm bound f ˆk B 1 + ϵ(n, m) 2c1 + o (ϵ(n, m)) (B.12) with probability greater than 1 δ. C. Proof of Statements in Section 5 The following lemma presents a confidence bound for when the learner has oracle knowledge of the true kernel. This lemma plays an integral role in for the proofs in this section. Lemma C.1 (Theorem 2 Chowdhury & Gopalan (2017) for ˆk). Let f Hˆk for some kernel ˆk with a ˆk-norm bounded by ˆB. Then with probability greater than 1 δ, for all x X and t 1, |ˆµt 1(x) f(x)| ˆσt 1(x) ˆB + σ q 2(ˆγt 1 + 1 + log(1/ δ)) (C.1) where ˆµt 1 and ˆσt 1 are as defined in Equation 3 with σ = 1 + 2/T. We skip the proof for this lemma as it is given in Chowdhury & Gopalan (2017), with the same notation. For the kernel ˆk we define the maximum information gain after t 1 observations as ˆγt 1 := max [xτ ]τ t 1 2 log det(I + σ 2 ˆ Kt 1) This parameter quantifies the speed at which we learn about f, when using the kernel ˆk. Note that γt 1 is independent of any specific realization of Ht 1. It only depends on the choice of kernel, the input domain, and the noise variance. The next lemma bounds this parameter. Meta-Learning Hypothesis Spaces Lemma C.2 (Information Gain Bound). The maximum information gain for ˆk after observing t samples satisfies, ˆγt ˆd 2 log 1 + σ 2t = O( ˆd log t/c1) where ˆd = P j Jˆk dj d. We now have the main tools for proving Theorem 5.1. Proof of Theorem 5.1. Assume that f Hk , and that f k B. Then by Theorem 4.3 f Hˆk, with probability greater than 1 δ. Define ˆB as ˆB = B 1 + ϵ(n, m) 2c1 + o (ϵ(n, m)) , (C.2) then by Lemma B.4, f ˆk ˆB with probability greater than 1 δ. We first condition on the event that f Hˆk and f ˆk ˆB. Then Lemma C.1 gives the following confidence interval, P |ˆµt 1(x) f(x)| ˆσt 1(x) ˆB + σ p 2(ˆγt 1 + 1 + log(1/δ)) |f Hˆk, f ˆk ˆB 1 δ (C.3) We now remove the conditional event. Let Ct(x) := [ˆµ(x) ˆσ(x), ˆµ(x) ˆσ(x)]. By the chain rule, P (f(x) Ct(x)) P f(x) Ct(x)|f Hˆk P f Hˆk 1 δ δ Renaming δ + δ to δ for simplicity, we conclude that with probability greater than 1 δ |ˆµt 1(x) f(x)| ˆσt 1(x) B 1 + ϵ(n, m) 2c1 + o (ϵ(n, m)) + σ p 2(ˆγt 1 + 1 + log(1/δ)) . Lastly, Lemma C.2 gives an upper bound for ˆγt 1 which completes the proof. Proof of Lemma C.2. For this proof, we follow a similar technique as in Vakili et al. (2021). Recall that ˆk(x, x ) = P j Jˆk ˆηjϕT j (x)ϕj(x ), where Jˆk = {1 j p : ˆηj = 0}. Define ˆd the effective dimension of kernels kj that correspond to this index set, ˆd = P j Jˆk dj. Now consider any arbitrary sequence of inputs (xτ)t τ=1 and let ˆΦt = ˆϕ(x1), , ˆϕ(xt) Rt ˆd, with ˆϕ(x) = (ϕj(x))j Jˆk. Define the ˆd ˆd matrix Λ = diag (ˆηj)j Jˆk as the diagonal matrix containing the eigenfunctions of ˆk. We have Kt = ˆΦtΛˆΦT t . Let Ht = Λ1/2 ˆΦT ˆΦΛ1/2, by the Weinstein-Aronszajn identity, 1 2 log det(I + σ 2Kt) = 1 2 log det(I + σ 2Ht) 2 ˆd log tr(I + σ 2Ht)/ ˆd For positive definite matrices P Rn n, we have log det P n log tr(P /n). The inequality follows from I + σ 2Ht being positive definite, since ˆηj 0. We may write, 1 2 log det(I + σ 2Kt) 1 2 ˆd log 1 + σ 2 ˆd tr(Λ1/2 ˆΦT ˆΦΛ1/2) 2 ˆd log 1 + σ 2 τ=1 tr(Λ1/2 ˆϕT (xτ) ˆϕ(xτ)Λ1/2) 2 ˆd log 1 + σ 2 τ=1 || ˆϕ(xτ)Λ1/2||2 2 2 ˆd log 1 + σ 2 j Jˆk ˆηj ϕj(xτ) 2 2 2 ˆd log 1 + σ 2t Meta-Learning Hypothesis Spaces The next to last inequality holds since kj(x, x) = ϕj(x) 2 2 is normalized to one and ˆηj = ˆβ(j) 2/c1. The inequality above holds for any sequence (xτ)τ t, ˆγt = O ˆd log(t/c1) C.1. Proof of Corollary 5.2 Similar to the previous section, we take advantage of a classic regret bound for the oracle learner and then apply it to our setting. Lemma C.3 (Theorem 3 Chowdhury & Gopalan (2017), for ˆk). Set δ (0, 1). If f Hˆk with f ˆk ˆB, then with probability 1 δ, GP-UCB satisfies, RT = O ˆB p T ˆγT (ˆγT + log 1/δ) Proof of Corollary 5.2. Conditioned on the event that f Hˆk and f ˆk ˆB, Lemma C.3 states that P RT = O ˆB p T ˆγT (γT + log 1/δ) |f Hˆk, f ˆk ˆB 1 δ Then by Lemma B.4 and Theorem 4.3, with probability greater than 1 δ δ RT = O ˆB p T ˆγT (ˆγT + log 1/δ) where ˆB is set according to Equation (C.2). Plugging in Lemma C.2 to bound the information gain and changing the variable name δ + δ to δ and concludes the proof. D. Experiments Regret Experiment on Hyper-Parameter Tuning data This experiment is based on Rothfuss et al. (2021b, Appendix B.1.2), we repeat some of the details for completeness. We consider the use case of hyper-parameter tuning for machine learning algorithms. In particular, we consider Generalized linear models with elastic NET regularization (GLMNET) (Friedman et al., 2010) for this purpose, which has two hyper-parameters lambda and alpha. Following previous work (e.g., Perrone et al., 2008), we replace the costly training and evaluation step by a cheap table lookup based on a large number of hyper-parameter evaluations (K uhn et al., 2018) on 38 classification datasets from the Open ML platform (Bischl et al., 2017). The hyper-parameter evaluations are available under a Creative Commons BY 4.0 license and can be downloaded here4. In effect, X is a finite set, corresponding to 10000-30000 random evaluations hyper-parameter evaluations per dataset and machine learning algorithm. Since the sampling is quite dense, for the purpose of empirically evaluating the meta-learned models towards BO, this finite domain can be treated like a continuous domain. All datasets correspond to binary classification. The target function we aim to optimize is the area under the ROC curve (AUROC) on a test split of the respective dataset. Since (K uhn et al., 2018) sample lambda in the log-space, we transform it via lambda log2(lambda)/10 such that we can expect a reasonably good performance of a Vanilla GP-UCB with SE kernel. We randomly split the available tasks (i.e. train/test evaluations on a specific dataset) into a set of meta-train and meta-test tasks. In the following, we list the corresponding Open ML dataset identifiers: meta-train tasks: 3, 1036, 1038, 1043, 1046, 151, 1176, 1049, 1050, 31, 1570, 37, 4134, 1063, 1067, 44, 1068, 50, 1461, 1462 test tasks: 335, 1489, 1486, 1494, 1504, 1120, 1510, 1479, 1480, 333, 1485, 1487, 334 The plots in Figure 6 show average cumulative regret for 13 test tasks, and 10 runs with different random seeds for each. 4https://doi.org/10.6084/m9.figshare.5882230.v2 Meta-Learning Hypothesis Spaces Supplementary Figures Figure 8 illustrates a few samples of the random functions that we optimize over in the experiments. The functions are constructed using the Legendre basis, as explained in the main text. Figure 9 gives an example of a BO problem where we use IGP-UCB together with the meta-learned kernel to find the minimum of a function with as few samples as possible. As the function estimate improves, the confidence sets rapidly shrink and the learner only samples points close to the minimum. Figure 10 demonstrates that the choice of λ for the META-KEL loss does not have a severe effect on the regret of IGP-UCB. This is only the case if λ satisfies the condition of Theorem 4.3. 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 x Figure 8. Examples of possible functions fs for the meta-dataset. 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 30 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 Figure 9. BO (minimization) with META-KEL. Upper plot shows the state at t = 5 and the lower plot at t = 55. Meta-Learning Hypothesis Spaces 0 10 20 30 40 50 t 100 Rt(k) Meta Kel Rt(kfull) full Rt(k *) oracle 0 10 20 30 40 50 t Rt(k) Meta Kel Rt(kfull) full Rt(k *) oracle 0 10 20 30 40 50 t 100 Rt(k) Meta Kel Rt(kfull) full Rt(k *) oracle 0 10 20 30 40 50 t Rt(k) Meta Kel Rt(kfull) full Rt(k *) oracle 0 10 20 30 40 50 t 100 Rt(k) Meta Kel Rt(kfull) full Rt(k *) oracle 0 10 20 30 40 50 t Rt(k) Meta Kel Rt(kfull) full Rt(k *) oracle 0 10 20 30 40 50 t 100 Rt(k) Meta Kel Rt(kfull) full Rt(k *) oracle 0 10 20 30 40 50 t Rt(k) Meta Kel Rt(kfull) full Rt(k *) oracle 0 10 20 30 40 50 t 100 Rt(k) Meta Kel Rt(kfull) full Rt(k *) oracle 0 10 20 30 40 50 t Rt(k) Meta Kel Rt(kfull) full Rt(k *) oracle Figure 10. For m = n = 50 and p = 20, Theorem 4.3 requires that λ > 0.001 for recovery to happen with probability greater than 1 δ = 0.9. For λ that satisfies this condition, the particular choice of its value does not effect performance severely.