# fewshot_learning_via_learning_the_representation_provably__b49342f4.pdf Published as a conference paper at ICLR 2021 FEW-SHOT LEARNING VIA LEARNING THE REPRESENTATION, PROVABLY Simon S. Du1 , Wei Hu2 , Sham M. Kakade1,3 , Jason D. Lee2 , and Qi Lei2 1University of Washington 2Princeton University 3Microsoft Research {ssdu,sham}@cs.washington.edu, {huwei@cs., jasonlee@,qilei@}princeton.edu This paper studies few-shot learning via representation learning, where one uses T source tasks with n1 data per task to learn a representation in order to reduce the sample complexity of a target task for which there is only n2( n1) data. Specifically, we focus on the setting where there exists a good common representation between source and target, and our goal is to understand how much a sample size reduction is possible. First, we study the setting where this common representation is low-dimensional and provide a risk bound of O( dk n2 ) on the target task for the linear representation class; here d is the ambient input dimension and k( d) is the dimension of the representation. This result bypasses the Ω( 1 T ) barrier under the i.i.d. task assumption, and can capture the desired property that all n1T samples from source tasks can be pooled together for representation learning. We further extend this result to handle a general representation function class and obtain a similar result. Next, we consider the setting where the common representation may be high-dimensional but is capacity-constrained (say in norm); here, we again demonstrate the advantage of representation learning in both high-dimensional linear regression and neural networks, and show that representation learning can fully utilize all n1T samples from source tasks. 1 INTRODUCTION A popular scheme for few-shot learning, i.e., learning in a data-scarce environment, is representation learning, where one first learns a feature extractor, or representation, e.g., the last layer of a convolutional neural network, from different but related source tasks, and then uses a simple predictor (usually a linear function) on top of this representation in the target task. The hope is that the learned representation captures the common structure across tasks, which makes a linear predictor sufficient for the target task. If the learned representation is good enough, it is possible that a few samples are sufficient for learning the target task, which can be much smaller than the number of samples required to learn the target task from scratch. While representation learning has achieved tremendous success in a variety of applications (Bengio et al., 2013), its theoretical studies are limited. In existing theoretical work, the most natural algorithm is to explicitly look for the optimal representation given source data, which when combined with a (different) linear predictor on top for each task can achieve the smallest cumulative training error on the source tasks. Of course, it is not guaranteed that the representation found will be useful for the target task unless one makes some assumptions to characterize the connections between different tasks. Existing work often imposes a probabilistic assumption about the connection between tasks: each task is sampled i.i.d. from an underlying distribution. Under this assumption, Maurer et al. (2016) showed an O( 1 T + 1 n2 ) risk bound on the target task, where T is the number of source tasks, n1 is the number of samples per source task, and n2 is the number of samples from the target task.1 Unsatisfactorily, this bound necessarily requires the number of tasks T to be large, and it Alphabetical Order. 1We only focus on the dependence on T, n1 and n2 in this paragraph. Note that Maurer et al. (2016) only considered n1 = n2, but their approach does not give a better result even if n1 > n2. Published as a conference paper at ICLR 2021 does not improve when the number of samples per source task, n1, increases. Intuitively, one should expect more data to help, and therefore an ideal bound would be 1 n1T + 1 n2 (or 1 n1T + 1 n2 in the realizable case), because n1T is the total number of training data points from source tasks, which can be potentially pooled to learn the representation. Unfortunately, as pointed out by Maurer et al. (2016), there exists an example that satisfies the i.i.d. task assumption for which Ω( 1 T ) is unavoidable (or Ω( 1 T ) in the realizable setting). This means that the i.i.d. assumption alone is not sufficient if we want to take advantage of a large amount of samples per task. Therefore, a natural question is: What connections between tasks enable representation learning to utilize all source data? In this paper, we obtain the first set of results that fully utilize the n1T data from source tasks. We replace the i.i.d. assumption over tasks with natural structural conditions on the input distributions and linear predictors. These conditions depict that the target task can be in some sense covered by the source tasks, which will further give rise to the desirable guarantees. First, we study the setting where there exists a common well-specified low-dimensional representation in source and target tasks, and obtain an O( dk n1T + k n2 ) risk bound on the target task where d is the ambient input dimension, k( d) is the dimension of the representation, and n2 is the number of data from the target task. Note that this improves the d n2 rate of just learning the target task without using representation learning. The term dk n1T indicates that we can fully exploit all n1T data in the source tasks to learn the representation. We further extend this result to handle general representation function class and obtain an O( C(Φ) n2 ) risk bound on the target task, where Φ is the representation function class and C (Φ) is a certain complexity measure of Φ. Second, we study the setting where there exists a common linear high-dimensional representation for source and target tasks, and obtain an O R Tr(Σ) n1T + R Σ 2 n2 rate where R is a normalized nuclear norm control over linear predictors, and Σ is the covariance matrix of the raw feature. This also improves over the baseline rate for the case without using representation learning. We further extend this result to two-layer neural networks with Re LU activation. Again, our results indicate that we can fully exploit n1T source data. A technical insight coming out of our analysis is that any capacity-controlled method that gets low test error on the source tasks must also get low test error on the target task by virtue of being forced to learn a good representation. Our result on high-dimensional representations shows that the capacity control for representation learning does not have to be through explicit low dimensionality. Organization. The rest of the paper is organized as follows. We review related work in Section 2. In Section 3, we formally describe the setting we consider. In Section 4, we present our main result for low-dimensional linear representation learning. A generalization to nonlinear representation classes is demonstrated in Section 5. In Section 6, we present our main result for high-dimensional linear representation learning. In Section 7, we present our result for representation learning in neural networks. We conclude in Section 8 and leave most of the proofs to appendices. 2 RELATED WORK The idea of multitask representation learning at least dates back to Caruana (1997); Thrun and Pratt (1998); Baxter (2000). Empirically, representation learning has shown its great power in various domains; see Bengio et al. (2013) for a survey. In particular, representation learning is widely adopted for few-shot learning tasks (Sun et al., 2017; Goyal et al., 2019). Representation learning is also closely connected to meta-learning (Schaul and Schmidhuber, 2010). Recent work Raghu et al. (2019) empirically suggested that the effectiveness of the popular meta-learning algorithm Model Agnostic Meta-Learning (MAML) is due to its ability to learn a useful representation. The scheme we analyze in this paper is closely related to Lee et al. (2019); Bertinetto et al. (2018) for meta-learning. On the theoretical side, Baxter (2000) performed the first theoretical analysis and gave sample complexity bounds using covering numbers. Maurer et al. (2016) and follow-up work gave analyses on the benefit of representation learning for reducing the sample complexity of the target task. They assumed every task is i.i.d. drawn from an underlying distribution and can obtain an O( 1 Published as a conference paper at ICLR 2021 rate. As pointed out in Maurer et al. (2016), the 1 T dependence is not improvable even if n1 because 1 T is the rate of concentration for the distribution over tasks. The concurrent work of Tripuraneni et al. (2020) studies low-dimensional linear representation learning and obtains a similar result as ours in this case, but they assume isotropic inputs for all tasks, which is a special case of our result. Furthermore, we also provide results for high-dimensional linear representations, general non-linear representations, and two-layer neural networks. Tripuraneni et al. (2020) also give a computationally efficient algorithm for standard Gaussian inputs and a lower bound for subspace recovery in the low-dimensional linear setting. Another recent line of theoretical work analyzed gradient-based meta-learning methods (Denevi et al., 2019; Finn et al., 2019; Khodak et al., 2019) and showed guarantees for convex losses by using tools from online convex optimization. Lastly, we remark that there are analyses for other representation learning schemes (Arora et al., 2019; Mc Namara and Balcan, 2017; Galanti et al., 2016; Alquier et al., 2016; Denevi et al., 2018). 3 NOTATION AND SETUP Notation. Let [n] = {1, 2, . . . , n}. We use or 2 to denote the ℓ2 norm of a vector or the spectral norm of a matrix. Denote by F and the Frobenius norm and the nuclear norm of a matrix, respectively. Let , be the Euclidean inner product between vectors or matrices. Denote by I the identity matrix. Let N(µ, σ2)/N(µ, Σ) be the one-dimensional/multi-dimensional Gaussian distribution, and χ2(m) the chi-squared distribution with m degrees of freedom. For a matrix A Rm n, let σi(A) be its i-th largest singular value. Let span(A) be the subspace of Rm spanned by the columns of A, i.e., span(A) = {Av | v Rn}. Denote PA = A(A A) A Rm m, which is the projection matrix onto span(A). Here stands for the Moore-Penrose pseudoinverse. Note that 0 PA I and P 2 A = PA. We also define P A = I PA, which is the projection matrix onto span(A) , the orthogonal complement of span(A) in Rm. For a positive semidefinite (psd) matrix B, denote by λmax(B) and λmin(B) its largest and smallest eigenvalues, respectively; let B1/2 be the psd matrix such that (B1/2)2 = B. We use the standard O( ), Ω( ) and Θ( ) notation to hide universal constant factors. We also use a b or b a to indicate a = O(b), and use a b or b a to mean that a C b for a sufficiently large universal constant C > 0. Problem Setup. Suppose that there are T source tasks. Each task t [T] is associated with a distribution µt over the joint data space X Y, where X is the input space and Y is the output space. In this paper we consider X Rd and Y R. For each source task t [T] we have access to n1 i.i.d. samples (xt,1, yt,1), . . . , (xt,n1, yt,n1) from µt. For convenience, we express these n1 samples collectively as an input matrix Xt Rn1 d and an output vector yt Rn1. Multitask learning tries to learn prediction functions for all the T source tasks simultaneously in the hope of discovering some underlying common property of these tasks. The common property we consider in this paper is a representation, which is a function φ : X Z that maps an input to some feature space Z Rk. We restrict the representation function to be in some function class Φ, e.g., neural networks. We try to use different linear predictors on top of a common representation function φ to model the input-output relations in different source tasks. Namely, for each task t [T], we set the prediction function to be x 7 wt, φ(x) (wt Rk). Therefore, using the training samples from T tasks, we can solve the following optimization problem to learn the representation:2 minimize φ Φ,w1,...,w T Rk 1 2n1T i=1 (yt,i wt, φ(xt,i) )2 . (1) We overload the notation to allow φ to apply to all the samples in a data matrix simultaneously, i.e., φ(Xt) = [φ(xt,1), . . . , φ(xt,n1)] Rn1 k. Then (1) can be rewritten as minimize φ Φ,w1,...,w T Rk 1 2n1T t=1 yt φ(Xt)wt 2 . (2) 2We use the ℓ2 loss throughout this paper. Published as a conference paper at ICLR 2021 Let ˆφ Φ be the representation function obtained by solving (2). Now we retain this representation and apply it to future (target) tasks. For a target task specified by a distribution µT +1 over X Y, suppose we receive n2 i.i.d. samples XT +1 Rn2 d, y T +1 Rn2. We further train a linear predictor on top of ˆφ for this task: minimize w T +1 Rk 1 2n2 y T +1 ˆφ(XT +1)w T +1 2 . (3) Let ˆw T +1 be the returned solution. We are interested in whether our learned predictor x 7 ˆw T +1, ˆφ(x) works well on average for the target task, i.e., we want the population loss LµT +1(ˆφ, ˆw T +1) = E(x,y) µT +1 1 2(y ˆw T +1, ˆφ(x) )2 to be small. In particular, we are interested in the few-shot learning setting, where the number of samples n2 from the target task is small much smaller than the number of samples required for learning the target task from scratch. Data assumption. In order for the above learning procedure to make sense, we assume that there is a ground-truth optimal representation function φ Φ and specializations w 1, . . . , w T +1 Rk for all the tasks such that for each task t [T + 1], we have E(x,y) µt[y|x] = w t , φ (x) . More specifically, we assume (x, y) µt can be generated by y = w t , φ (x) + z, x pt, z N(0, σ2), (4) where x and z are independent. Our goal is to bound the excess risk of our learned model on the target task, i.e., how much our learned model (ˆφ, ˆw T +1) performs worse than the optimal model (φ , w T +1) on the target task: ER(ˆφ, ˆw T +1) = LµT +1(ˆφ, ˆw T +1) LµT +1(φ , w T +1) 2 Ex p T +1[( ˆw T +1, ˆφ(x) w T +1, φ (x) )2]. (5) Here we have used the relation (4). Oftentimes we are interested in the average performance on a random target task (i.e., w T +1 is random). In such case we look at the expected excess risk Ew T +1[ER(ˆφ, ˆw T +1)]. 4 LOW-DIMENSIONAL LINEAR REPRESENTATIONS In this section, we consider the case where the representation is a linear map from the original input space Rd to a low-dimensional space Rk (k d). Namely, we let the representation function class be Φ = {x 7 B x | B Rd k}. Then the optimization problem (2) for learning the representation can be written as: ( ˆB, ˆW) arg min W =[w1,...,w T ] Rk T t=1 yt Xt Bwt 2 . (6) The inputs from T source tasks, X1, . . . , XT Rn1 d, can be written in the form of a linear operator X : Rd T Rn1 T , where X(Θ) = [X1θ1, . . . , XT θT ], Θ = [θ1, . . . , θT ] Rd T . With this notation, (6) can be rewritten as ( ˆB, ˆW) arg min B Rd k,W Rk T 1 2n1T Y X(BW) 2 F , where Y = [y1, . . . , y T ] Rn1 T . (7) With the learned representation ˆB from (7), for the target task, we further find a linear function on top of the representation: ˆw T +1 arg min w Rk 1 2n2 y T +1 XT +1 ˆBw 2 . (8) Published as a conference paper at ICLR 2021 As described in Section 3, we assume that all T +1 tasks share a common ground-truth representation specified by a matrix B Rd k such that a sample (x, y) µt satisfies x pt and y = (w t ) (B ) x + z where z N(0, σ2) is independent of x. Here w t Rk, and we assume w t = Θ(1) for all t [T + 1]. Denote W = [w 1, . . . , w T ] Rk T . Then we can write Y = X(B W ) + Z, where the noise matrix Z has i.i.d. N(0, σ2) entries. Assume Ex pt[x] = 0 and let Σt = Ex pt[xx ] for all t [T + 1]. Note that a sample x pt can be generated from x = Σ1/2 t x for x pt such that E x pt[ x] = 0 and E x pt[ x x ] = I. ( pt is called the whitening of pt.) In this section we make the following assumptions on the input distributions p1, . . . , p T +1. Assumption 4.1 (subgaussian input). There exists ρ > 0 such that, for all t [T + 1], the random vector x pt is ρ2-subgaussian.3 Assumption 4.2 (covariance dominance). There exists c > 0 such that Σt c ΣT +1 for all t [T].4 Assumption 4.1 is a standard assumption in statistical learning to obtain probabilistic tail bounds used in our proof. It may be replaced with other moment or boundedness conditions if we adopt different tail bounds in the analysis. Assumption 4.2 says that every direction spanned by ΣT +1 should also be spanned by Σt (t [T]), and the parameter c quantifies how easy it is for Σt to cover ΣT +1. Intuitively, the larger c is, the easier it is to cover the target domain using source domains, and we will indeed see that the risk will be proportional to 1 c. We remark that we do not necessarily need Σt c ΣT +1 for all t [T]; as long as this holds for a constant fraction of t s, our result is valid. We also make the following assumption that characterizes the diversity of the source tasks. Assumption 4.3 (diverse source tasks). The matrix W = [w 1, . . . , w T ] Rk T satisfies σ2 k(W ) Ω( T Recall that w t = Θ(1), which implies Pk j=1 σ2 j (W ) = W 2 F = Θ(T). Thus, Assumption 4.3 is equivalent to saying that σ1(W ) σk(W ) = O(1). Roughly speaking, this means that {w t }t [T ] can cover all directions in Rk. As an example, Assumption 4.3 is satisfied with high probability when w t s are sampled i.i.d. from N(0, Σ) with λmax(Σ) λmin(Σ) = O(1). Finally, we make the following assumption on the distribution of the target task. Assumption 4.4 (distribution of target task). Assume that w T +1 follows a distribution ν such that Ew ν[ww ] O 1 Since we assume w T +1 = Θ(1), the assumption Ew ν[ww ] O( 1 k) means that the distribution of w T +1 does not align with any direction significantly more than average. It is useful to think of the uniform distribution on the unit sphere as an example, though we can allow a much more general class of distributions. This is also compatible with Assumption 4.3 which says that w t s cover all the directions. Assumption 4.4 can be removed at the cost of a slightly worse risk bound. See Remark 4.1. Our main result in this section is the following theorem. Theorem 4.1 (main theorem for linear representations). Fix a failure probability δ (0, 1). Under Assumptions 4.1, 4.2, 4.3 and 4.4, we further assume 2k min{d, T} and that the sample sizes in source and target tasks satisfy n1 ρ4(d + log T δ ), n2 ρ4(k + log 1 δ ), and cn1 n2. Define κ = maxt [T ] λmax(Σt) mint [T ] λmin(Σt) . Then with probability at least 1 δ over the samples, the expected excess risk of the learned predictor x 7 ˆw T +1 ˆBx on the target task satisfies Ew T +1 ν[ER( ˆB, ˆw T +1)] σ2 kd log(κn1) cn1T + k + log 1 3A random vector x is called ρ2-subgaussian if for any fixed unit vector v of the same dimension, the random variable v x is ρ2-subgaussian, i.e., E[es v (x E[x])] es2ρ2/2 ( s R). 4Note that Assumption 4.2 is a significant generalization of the identically distributed isotropic assumption used in concurrent work Tripuraneni et al. (2020): they require Σ1 = Σ2 = = ΣT +1 = I. Published as a conference paper at ICLR 2021 The proof of Theorem 4.1 is in Appendix A. Theorem 4.1 shows that it is possible to learn the target task using only O(k) samples via learning a good representation from the source tasks, which is better than the baseline O(d) sample complexity for linear regression, thus demonstrating the benefit of representation learning. It also shows that all n1T samples from source tasks can be pooled together, bypassing the Ω( 1 T ) barrier under the i.i.d. tasks assumption. Remark 4.1 (deterministic target task). We can drop Assumption 4.4 and easily obtain the following excess risk bound for any deterministic w T +1 by slightly modifying the proof of Theorem 4.1: ER( ˆB, ˆw T +1) σ2 k2d log(κn1) cn1 + k + log 1 which is only at most k times larger than the bound in (9). 5 GENERAL LOW-DIMENSIONAL REPRESENTATIONS Now we return to the general case described in Section 3 where we allow a general representation function class Φ. We still assume that the representation is of low dimension k like in Section 4, and we assume that inputs from all the tasks follow the same distribution, i.e., p1 = = p T +1 = p, but each task t still has its own specialization function w t (c.f. (4)). We overload the notation from Section 4 and use X to represent the collection of all the training inputs from T source tasks X1, . . . , XT Rn1 d. We can think of X as a third-order tensor of dimension n1 d T. To characterize the complexity of the representation function class Φ, we need the standard definition of Gaussian width. Definition 5.1 (Gaussian width). Given a set K Rm, the Gaussian width of K is defined as: G (K) := Ez N(0,I) supv K v, z . We will measure the complexity of Φ using the Gaussian width of the following set that depends on the input data X: FX (Φ) = A = [a1, . . . , a T ] Rn1 T : A F = 1, φ, φ Φ s.t. at span([φ(Xt), φ (Xt)]), t [T] . (10) We also need the following definition. Definition 5.2 (covariance between two representations). Given a distribution q over Rd and two representation functions φ, φ Φ, define the covariance between φ and φ with respect to q to be Σq (φ, φ ) = Ex q φ (x) φ (x) Rk k. Also define the symmetric covariance as Λq(φ, φ ) = Σq (φ, φ) Σq (φ, φ ) Σq (φ , φ) Σq (φ , φ ) It is easy to verify Λq(φ, φ ) 0 for any φ, φ and q, as shown in the proof of Lemma B.2. We make the following assumptions on the input distribution p, which ensure concentration properties of the representation covariances. Assumption 5.1 (point-wise concentration of covariance). For δ (0, 1), there exists a number Npoint (Φ, p, δ) such that if n Npoint (Φ, p, δ), then for any given φ, φ Φ, n i.i.d. samples of p will with probability at least 1 δ satisfy 0.9Λp(φ, φ ) Λˆp(φ, φ ) 1.1Λp(φ, φ ), where ˆp is the empirical distribution over the n samples. Assumption 5.2 (uniform concentration of covariance). For δ (0, 1), there exists a number Nunif (Φ, p, δ) such that if n Nunif (Φ, p, δ), then n i.i.d. samples of p will with probability at least 1 δ satisfy 0.9Λp(φ, φ ) Λˆp(φ, φ ) 1.1Λp(φ, φ ), φ, φ Φ, where ˆp is the empirical distribution over the n samples. Published as a conference paper at ICLR 2021 Assumptions 5.1 and 5.2 are conditions on the representation function class Φ and the input distribution p that ensure concentration of empirical covariances to their population counterparts. Typically, we expect Nunif (Φ, p, δ) Npoint (Φ, p, δ) since uniform concentration is a stronger requirement. In Section 4, we have essentially shown that for linear representations and subgaussian input distributions, Nunif (Φ, p, δ) = O (d) and Npoint (Φ, p, δ) = O (k) (see Claims A.1 and A.2). Our main theorem in this section is the following: Theorem 5.1 (main theorem for general representations). Fix a failure probability δ (0, 1). Suppose n1 Nunif Φ, p, δ 3T and n2 Npoint Φ, p, δ 3 . Under Assumptions 4.3 and 4.4, with probability at least 1 δ over the samples, the expected excess risk of the learned predictor x 7 ˆw T +1 ˆφ(x) on the target task satisfies Ew T +1 ν[ER(ˆφ, ˆw T +1)] σ2 G(FX (Φ))2 + log 1 δ n1T + k + log 1 Theorem 5.1 is very similar to Theorem 4.1 in terms of the result and the assumptions made. In the bound (11), the complexity of Φ is captured by the Gaussian width of the data-dependent set FX (Φ) defined in (10). Data-dependent complexity measures are ubiquitous in generalization theory, one of the most notable examples being Rademacher complexity. Similar complexity measure also appeared in existing representation learning theory (Maurer et al., 2016). Usually, for specific examples, we can apply concentration bounds to get rid of the data dependency, such as our result for linear representations (Theorem 4.1). Our assumptions on the linear specification functions w t s are the same as in Theorem 4.1. The probabilistic assumption on w T +1 can also be removed at the cost of an additional factor of k in the bound see Remark 4.1. We defer the full proof of Theorem 5.1 to Appendix B. 6 HIGH-DIMENSIONAL LINEAR REPRESENTATIONS In this section, we consider the case where the representation is a general linear map without an explicit dimensionality constraint, and we will prove a norm-based result by exploiting the intrinsic dimension of the representation. Such a generalization is desirable since in many applications the representation dimension is not restricted. Without loss of generality, we let the representation function class be Φ = {x 7 B x | B Rd T }. We note that a dimension-T representation is sufficient for learning T source tasks and any choice of dimension greater than T will not change our argument. We use the same notation from Section 4 unless otherwise specified. In this section we additionally assume that all tasks have the same input covariance: Assumption 6.1. The input distributions in all tasks satisfy Σ1 = = ΣT +1 = Σ. Note that each task t still has its own specialization function w t (c.f. (4)). We remark that there are many interesting and nontrivial scenarios under Assumption 6.1 for example, consider the case where the inputs in each task are all images from Image Net and each task asks whether the image is from a specific class. Since we do not have a dimensionality constraint, we modify (7) by adding norm constraints: ( ˆB, ˆW) arg min B Rd T ,W RT T 1 2n1 Y X(BW) 2 F + λ 2 W 2 F + λ 2 B 2. (12) For the target task, we also modify (8) by adding a norm constraint: ˆw T +1 arg min w r 1 2n2 XT +1 ˆBw y T +1 2. (13) We will specify the choices of regularization, i.e., λ and r in Theorem 6.1. Similar to Section 4, the source task data relation is denoted as Y = X(Θ ) + Z, where Θ Rd T is the ground truth and Z has i.i.d. N(0, σ2) entries. Suppose that the target task data satisfy y T +1 = XT +1θ T +1 + z T +1 Rn2. Similar to the setting in Section 4, we assume the target task data is subgaussian as in Assumption 4.1. Published as a conference paper at ICLR 2021 Theorem 6.1 (main theorem for high-dimensional representations). Fix a failure probability δ (0, 1). Under Assumptions 4.1 and 6.1, we further assume n1 n2, R = Θ . Let r = 2 p R/T, R = R/ T and proper λ specified in Lemma C.2. Let the target task model θ T +1 be coherent with the source task models Θ in the sense that θ T +1 ν = N(0, Θ (Θ ) /T). Then with probability at least 1 δ over the samples, the expected excess risk of the learned predictor x 7 ˆw T +1 ˆB x on the target task satisfies: Eθ T +1 ν[ER( ˆB, ˆw T +1)] σ R O Tr(Σ) n1T + + ζn1,n2, (14) where ζn1,n2 := ρ4 R2 O Tr(Σ) is lower-order terms due to randomness of the input data. Here O hides logarithmic factors. The proof of Theorem 6.1 is given in Appendix C. Note that Θ F = T when each θ t is of unit norm. Thus R = Θ / T should generally be regarded as O(1) for a well-behaved Θ that is nearly low-dimensional. In this regime, Theorem 6.1 indicates that we are able to exploit all n1T samples from the source tasks, similar to Theorem 4.1. With a good representation, the sample complexity on the target task can also improve over learning the target task from scratch. Consider the baseline of regular ridge regression directly applied to the target task data: ˆθ arg min θ θ T +1 1 2n2 XT +1θ y T +1 2. (15) Its standard excess risk bound in fixed design is ER(ˆθλ) σ q θ T +1 2 2Tr(Σ) n2 . (See e.g. Hsu et al. (2012).) Taking expectation over θ T +1 ν = N(0, Θ (Θ ) /T), we obtain Eθ T +1 ν[ER(ˆθλ)] σ Θ F Compared with (16), our bound (14) is an improvement as long as Θ 2 Θ 2 F Tr(Σ) Σ . The left hand side Θ 2 Θ 2 F is always no more than the rank of Θ , and we call it the intrinsic rank. Hence we see that we can gain from representation learning if the source predictors are intrinsically low dimensional. To intuitively understand how this is achieved, we note that a representation B is reweighing linear combinations of the features according to their importance on the T source tasks. We make an analogy with a simple case of feature selection. Suppose we have learned a representation vector b where bi scales with the importance of the i-th feature, i.e., the representation is φ(x) = x b (entry-wise product). Then ridge regression on the target task data (X, y), minimize w r 1 2n2 X diag(b) w y 2 2, is equivalent to minimize diag(b) 1v r 1 2n2 Xv y 2 2. From the above equation, we see that the features with large |bi| (those that were useful on the source tasks) will be more heavily used than the ones with small |bi| due to the reweighed ℓ2 constraint. Thus the important features are learned from the source tasks, and the coefficients are learned from the target task. Remark 6.1 (The non-convex landscape). Although the optimization problem (12) is non-convex, its structure allows us to apply existing landscape analysis of matrix factorization problems (Haeffele et al., 2014) and to show that it has the nice properties of no strict saddles and no bad local minima. Therefore, randomly initialized gradient descent or perturbed gradient descent are guaranteed to converge to a global minimum of (12) (Ge et al., 2015; Lee et al., 2016; Jin et al., 2017). Remark 6.2 (Multi-class problems). When both source and target have multi-class labels instead of independent tasks, using quadratic loss on the one-hot labels, our results apply similarly and will attain an excess risk of the form σ R O plus lower-order terms (see e.g. Lee et al. (2020)). Notice the result is independent of the number of classes. 7 NEURAL NETWORKS In this section, we show that we can provably learn good representations in a neural network. Published as a conference paper at ICLR 2021 Consider a two-layer Re LU neural network f B,w(x) = w (B x)+, where w Rd, B Rd0 d and x Rd0. Here ( )+ is the Re LU activation (z)+ = max{0, z} defined element-wise. Namely, we let the representation function class be Φ = {x (B x)+|B Rd0 d}. On the source tasks we use the square loss with weight decay regularizer:5 ( ˆB, ˆW) arg min B Rd0 d,W =[w1, w T ] Rd T 1 2n1T t=1 yt (Xt B)+wt 2 + λ 2 B 2 F + λ On the target task, we simply re-train the output layer while fixing the hidden layer weights: ˆw T +1 arg min w r 1 2n2 y T +1 (XT +1 ˆB)+w 2. (18) Assumption 7.1. All tasks share the same input distribution: p1 = = p T +1 = p. We redefine Σ to be the covariance operator of the feature induced by Re LU, i.e., it is a kernel defined by Σ(u, v) = Ex p[(u x)+(v x)+], for u, v on the unit sphere Sd0 1 Rd0. Assumption 7.2 (teacher network). Assume for the source tasks that yt = (Xt B )+w t + zt is generated by a teacher network with parameters B Rd0 d, W = [w 1, , w T ] Rd T , and noise term zt N(0, σ2I). A standard lifting of the neural network is: fαt = αt, φ(x) where φ(x) : Sd0 1 R, φ(x)b = (b x)+ is the feature map, i.e., for each task, αt(bi/ bi ) = Wi,t bi and is zero elsewhere. We assume αT +1 that describes the target function to follow a Gaussian process µ with covariance function K(b, b ) = PT t=1 αt(b)αt(b ). Theorem 7.1. Fix a failure probability δ (0, 1). Under Assumptions 4.1, 7.1 and 7.2, let n1 n2, R = ( 1 2 B 2 F + 1 T. Let the target task model fαT +1 = αT +1, φ(x) be coherent with the source task models in the sense that α T +1 ν. Set r2 = ( B 2 F + W 2 F )/T.Then with probability at least 1 δ over the samples, the expected excess risk of the learned predictor x 7 ˆw T +1( ˆB x)+ on the target task satisfies: EαT +1 ν[ER(f ˆ B, ˆ w T +1)] σ R O Tr(Σ) n1T + + ζn1,n2, (19) where ζn1,n2 := ρ4 R2 O( Tr(Σ) n2 ) is lower-order term due to randomness of the input data. To highlight the advantage of representation learning, we compare to training a neural network with weight decay directly on the target task: ( ˆB, ˆw) = arg min B,w, Bw R i=1 yt+1 (XT +1B)+w 2. (20) The error of the baseline method in fixed-design is E[ER(f ˆ B, ˆ w)] σ R We see that Equation (19) is always smaller than Equation (21) since n1T n2. See Appendix D for the proof of Theorem 7.1 and the calculation of (21). 8 CONCLUSION We gave the first statistical analysis showing that representation learning can fully exploit all data points from source tasks to enable few-shot learning on a target task. This type of results were shown for both low-dimensional and high-dimensional representation function classes. There are many important directions to pursue in representation learning and few-shot learning. Our results in Sections 6 and 7 indicate that explicit low dimensionality is not necessary, and norm-based capacity control also forces the classifier to learn good representations. Further questions include whether this is a general phenomenon in all deep learning models, whether other capacity control can be applied, and how to optimize to attain good representations. 5Wei et al. (2019) show that (17) can be minimized in polynomial iteration complexity using perturbed gradient descent, though potentially exponential width is required. Published as a conference paper at ICLR 2021 ACKNOWLEDGMENTS SSD acknowledges support of National Science Foundation (Grant No. DMS-1638352) and the Infosys Membership. JDL acknowledges support of the ARO under MURI Award W911NF-11-10303, the Sloan Research Fellowship, and NSF CCF 2002272. WH is supported by NSF, ONR, Simons Foundation, Schmidt Foundation, Amazon Research, DARPA and SRC. QL is supported by NSF #2030859 and the Computing Research Association for the CIFellows Project. The authors also acknowledge the generous support of the Institute for Advanced Study on the Theoretical Machine Learning program, where SSD, WH, JDL, and QL were participants. Pierre Alquier, The Tien Mai, and Massimiliano Pontil. Regret bounds for lifelong learning. ar Xiv preprint ar Xiv:1610.08628, 2016. Sanjeev Arora, Hrishikesh Khandeparkar, Mikhail Khodak, Orestis Plevrakis, and Nikunj Saunshi. A theoretical analysis of contrastive unsupervised representation learning. In Proceedings of the 36th International Conference on Machine Learning, 2019. Jonathan Baxter. A model of inductive bias learning. J. Artif. Int. Res., 2000. Yoshua Bengio, Nicolas L Roux, Pascal Vincent, Olivier Delalleau, and Patrice Marcotte. Convex neural networks. In Advances in neural information processing systems, pages 123 130, 2006. Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798 1828, 2013. Luca Bertinetto, Joao F Henriques, Philip HS Torr, and Andrea Vedaldi. Meta-learning with differentiable closed-form solvers. ar Xiv preprint ar Xiv:1805.08136, 2018. Rich Caruana. Multitask learning. Machine Learning, 28(1):41 75, Jul 1997. ISSN 1573-0565. doi: 10.1023/A:1007379606734. URL https://doi.org/10.1023/A:1007379606734. Giulia Denevi, Carlo Ciliberto, Dimitris Stamos, and Massimiliano Pontil. Incremental learning-tolearn with statistical guarantees. ar Xiv preprint ar Xiv:1803.08089, 2018. Giulia Denevi, Carlo Ciliberto, Riccardo Grazzi, and Massimiliano Pontil. Learning-to-learn stochastic gradient descent with biased regularization. In Proceedings of the 36th International Conference on Machine Learning, 2019. Chelsea Finn, Aravind Rajeswaran, Sham Kakade, and Sergey Levine. Online meta-learning. In Proceedings of the 36th International Conference on Machine Learning, 2019. Tomer Galanti, Lior Wolf, and Tamir Hazan. A theoretical framework for deep transfer learning. Information and Inference: A Journal of the IMA, 5(2):159 209, 2016. Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points online stochastic gradient for tensor decomposition. In Proceedings of The 28th Conference on Learning Theory, pages 797 842, 2015. Priya Goyal, Dhruv Mahajan, Abhinav Gupta, and Ishan Misra. Scaling and benchmarking selfsupervised visual representation learning. In Proceedings of the IEEE International Conference on Computer Vision, pages 6391 6400, 2019. Benjamin Haeffele, Eric Young, and Rene Vidal. Structured low-rank matrix factorization: Optimality, algorithm, and applications to image processing. In International conference on machine learning, pages 2007 2015, 2014. Daniel Hsu, Sham M Kakade, and Tong Zhang. Random design analysis of ridge regression. In Conference on learning theory, pages 9 1, 2012. Published as a conference paper at ICLR 2021 Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M. Kakade, and Michael I. Jordan. How to escape saddle points efficiently. In Proceedings of the 34th International Conference on Machine Learning, pages 1724 1732, 2017. Mikhail Khodak, Maria-Florina Balcan, and Ameet Talwalkar. Adaptive gradient-based meta-learning methods. ar Xiv preprint ar Xiv:1906.02717, 2019. Jason D Lee, Max Simchowitz, Michael I Jordan, and Benjamin Recht. Gradient descent only converges to minimizers. In Conference on Learning Theory, pages 1246 1257, 2016. Jason D Lee, Qi Lei, Nikunj Saunshi, and Jiacheng Zhuo. Predicting what you already know helps: Provable self-supervised learning. ar Xiv preprint ar Xiv:2008.01064, 2020. Kwonjoon Lee, Subhransu Maji, Avinash Ravichandran, and Stefano Soatto. Meta-learning with differentiable convex optimization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 10657 10665, 2019. Andreas Maurer, Massimiliano Pontil, and Bernardino Romera-Paredes. The benefit of multitask representation learning. The Journal of Machine Learning Research, 17(1):2853 2884, 2016. Daniel Mc Namara and Maria-Florina Balcan. Risk bounds for transferring representations with and without fine-tuning. In Proceedings of the 34th International Conference on Machine Learning Volume 70, pages 2373 2381. JMLR. org, 2017. Aniruddh Raghu, Maithra Raghu, Samy Bengio, and Oriol Vinyals. Rapid learning or feature reuse? towards understanding the effectiveness of maml. ar Xiv preprint ar Xiv:1909.09157, 2019. Saharon Rosset, Grzegorz Swirszcz, Nathan Srebro, and Ji Zhu. l1 regularization in infinite dimensional feature spaces. In International Conference on Computational Learning Theory, pages 544 558. Springer, 2007. Tom Schaul and Jürgen Schmidhuber. Metalearning. Scholarpedia, 5(6):4650, 2010. Nathan Srebro and Adi Shraibman. Rank, trace-norm and max-norm. In International Conference on Computational Learning Theory, pages 545 560. Springer, 2005. Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. Revisiting unreasonable effectiveness of data in deep learning era. In Proceedings of the IEEE international conference on computer vision, pages 843 852, 2017. Sebastian Thrun and Lorien Pratt. Learning to Learn: Introduction and Overview, pages 3 17. Springer US, Boston, MA, 1998. ISBN 978-1-4615-5529-2. doi: 10.1007/978-1-4615-5529-2_1. URL https://doi.org/10.1007/978-1-4615-5529-2_1. Nilesh Tripuraneni, Chi Jin, and Michael I Jordan. Provable meta-learning of linear representations. ar Xiv preprint ar Xiv:2002.11684, 2020. Joel A Tropp et al. An introduction to matrix concentration inequalities. Foundations and Trends in Machine Learning, 8(1-2):1 230, 2015. Roman Vershynin. Four lectures on probabilistic methods for data science. https://arxiv. org/pdf/1612.06661.pdf, 2017. Colin Wei, Jason D Lee, Qiang Liu, and Tengyu Ma. Regularization matters: Generalization and optimization of neural nets vs their induced kernel. In Advances in Neural Information Processing Systems, pages 9709 9721, 2019. Published as a conference paper at ICLR 2021 A PROOF OF THEOREM 4.1 We first prove several claims and then combine them to finish the proof of Theorem 4.1. We will use technical lemmas proved in Section A.1. Claim A.1 (covariance concentration of source tasks). Suppose n1 ρ4(d + log(T/δ)) for δ (0, 1). Then with probability at least 1 δ 10 over the inputs X1, . . . , XT in the source tasks, we have n1 X t Xt 1.1Σt, t [T]. (22) Proof. According to our assumption on pt, we can write Xt = XtΣ1/2 t , where Xt Rn1 d and the rows of Xt hold i.i.d. samples of pt. Since pt satisfies the conditions in Lemma A.6, from Lemma A.6 we know that with probability at least 1 δ 10T , n1 X t Xt 1.1I, which implies n1 Σ1/2 t X t XtΣ1/2 t = 1 n1 X t Xt 1.1Σt. The proof is finished by taking a union bound over all t [T]. Claim A.2 (covariance concentration of target task). Suppose n2 ρ4(k + log(1/δ)) for δ (0, 1). Then for any given matrix B Rd 2k that is independent of XT +1, with probability at least 1 δ 10 over XT +1 we have 0.9B ΣT +1B 1 n2 B X T +1XT +1B 1.1B ΣT +1B. (23) Proof. According to our assumption on p T +1, we can write XT +1 = XT +1Σ1/2 T +1, where XT +1 Rn2 d and the rows of XT +1 hold i.i.d. samples of p T +1. We take the SVD of Σ1/2 T +1B: Σ1/2 T +1B = UDV , where U Rd 2k has orthonormal columns. Now we look at the matrix XT +1U Rn2 2k. It is easy to see that the rows of XT +1U are i.i.d. 2k-dimensional random vectors with zero mean, identity covariance, and are ρ2-subgaussian. Therefore, applying Lemma A.6, with probability at least 1 δ n2 U X T +1 XT +1U 1.1I, which implies n2 V DU X T +1 XT +1UDV 1.1V DDV . Since 1 n2 V DU X T +1 XT +1UDV = 1 n2 B Σ1/2 T +1 X T +1 XT +1Σ1/2 T +1B = 1 n2 B X T +1XT +1B and V DDV = V DU UDV = B ΣT +1B, the above inequality becomes 0.9B ΣT +1B 1 n2 B X T +1XT +1B 1.1B ΣT +1B. Claim A.3 (guarantee on source training data). Under the setting of Theorem 4.1, with probability at least 1 δ X( ˆB ˆW B W ) 2 F σ2 (k T + kd log(κn1) + log(1/δ)) . (24) Proof. We assume that (22) is true, which happens with probability at least 1 δ 10 according to Claim A.1. Let ˆΘ = ˆB ˆW and Θ = B W . From the optimality of ˆB and ˆW for (7) we have Y X(ˆΘ) 2 F Y X(Θ ) 2 F . Plugging in Y = X(Θ ) + Z, this becomes X(ˆΘ Θ ) 2 F 2 Z, X(ˆΘ Θ ) . (25) Published as a conference paper at ICLR 2021 Let = ˆΘ Θ . Since rank( ) 2k, we can write = V R = [V r1, , V r T ] where V Od,2k and R = [r1, , r T ] R2k T . Here Od1,d2 (d1 d2) is the set of orthonormal d1 d2 matrices (i.e., the columns are orthonormal). For each t [T] we further write Xt V = Ut Qt where Ut On1,2k and Qt R2k 2k. Then we have t=1 z t Xt V rt t=1 z t Ut Qtrt U t zt Qtrt t=1 Ut Qtrt 2 t=1 Xt V rt 2 U t zt 2 X( ) F . (26) Next we give a high-probability upper bound on PT t=1 U t zt 2 using the randomness in Z. Since Ut s depend on V which depends on Z, we will need an ϵ-net argument to cover all possible V Od,2k. First, for any fixed V Od,2k, we let Xt V = Ut Qt where Ut On,2k. The Ut s defined in this way are independent of Z. Since Z has i.i.d. N(0, σ2) entries, we know that σ 2 PT t=1 U t zt 2 is distributed as χ2(2k T). Using the standard tail bound for χ2 random variables, we know that with probability at least 1 δ over Z, U t zt 2 k T + log(1/δ ). Therefore, using the same argument in (26) we know that with probability at least 1 δ , Z, X( V R) σ p k T + log(1/δ ) X( V R) F . Now, from Lemma A.5 we know that there exists an ϵ-net N of Od,2k in Frobenius norm such that N Od,2k and |N| ( 6 2k ϵ )2kd. Applying a union bound over N, we know that with probability at least 1 δ |N|, Z, X( V R) σ p k T + log(1/δ ) X( V R) F , V N. (27) Choosing δ = δ 20( 6 2k ϵ )2kd , we know that (27) holds with probability at least 1 δ We will use (22), (25) and (27) to complete the proof of the claim. This is done in the following steps: 1. Upper bounding Z F . Since σ 2 Z 2 F χ2(n1T), we know that with probability at least 1 δ Z 2 F σ2(n1T + log(1/δ)). (28) Published as a conference paper at ICLR 2021 2. Upper bounding F . From (25) we have X( ) 2 F 2 Z F X( ) F , which implies X( ) F 2 Z F σ p n1T + log(1/δ). On the other hand, letting the t-th column of be δt, we have t=1 δ t X t Xtδt t=1 δ t Σtδt (using (22)) t=1 λmin(Σt) δt 2 0.9n1λ 2 F , where λ = mint [T ] λmin(Σt). Hence we obtain 2 F X( ) 2 F n1λ σ2(n1T + log(1/δ)) 3. Applying the ϵ-net N. Let V N such that V V F ϵ. Then we have X(V R V R) 2 F Xt(V V )rt 2 t=1 Xt 2 V V 2 rt 2 t=1 1.1n1λmax(Σt)ϵ2 rt 2 (using (22)) 1.1n1 λϵ2 R 2 F ( λ = max t [T ] λmax(Σt)) = 1.1n1 λϵ2 2 F ( F = V R F = R F ) n1 λϵ2 σ2(n1T + log(1/δ)) n1λ = κϵ2σ2(n1T + log(1/δ)). (29) 4. Finishing the proof. We have the following chain of inequalities: 1 2 X( ) 2 F Z, X( ) (using (25)) = Z, X( V R) + Z, X(V R V R) k T + log(1/δ ) X( V R) F + Z F X(V R V R) F (using (27)) k T + log(1/δ ) X(V R) F + X(V R V R) F n1T + log(1/δ) X(V R V R) F (using (28)) Published as a conference paper at ICLR 2021 k T + log(1/δ ) X(V R) F + σ p n1T + log(1/δ ) X(V R V R) F (using k < n1 and δ < δ) k T + log(1/δ ) X( ) F + σ p n1T + log(1/δ ) p κϵ2σ2(n1T + log(1/δ)) (using (29)) k T + log(1/δ ) X( ) F + ϵσ2 κ(n1T + log(1/δ )). Finally, we let ϵ = k κn1 , and recall δ = δ 20( 6 2k ϵ )2kd . Then the above inequality implies k T + log(1/δ ), q ϵσ2 κ(n1T + log(1/δ )) k T + log(1/δ ), σ k n1 (n1T + log(1/δ )) k T + log(1/δ ), σ p k T + log(1/δ )) o (using k < n1) k T + log(1/δ ) k T + kd log k k T + kd log(κn1) + log 1 The high-probability events we have used in the proof are (22), (27) and (28). By a union bound, the failure probability is at most δ 10 + δ 5. Therefore the proof is completed. Claim A.4 (Guarantee on target training data). Under the setting of Theorem 4.1, with probability at least 1 2δ 5 , we have P XT +1 ˆ BXT +1B 2 F σ2 k T + kd log(κn1) + log 1 cn1 σ2 k(W ) . Proof. We suppose that the high-probability events in Claims A.1, A.2 and A.3 happen, which holds with probability at least 1 2δ 5 . Here we instantiate Claim A.2 using B = [ ˆB, B ] Rd 2k. From the optimality of ˆB and ˆW in (6) we know Xt ˆB ˆwt = PXt ˆ Byt = PXt ˆ B(Xt B w t + zt) for each t [T]. Then we have σ2 (k T + kd log(κn1) + log(1/δ)) X( ˆB ˆW B W ) 2 F (from (24)) Xt ˆB ˆwt Xt B ˆw t 2 PXt ˆ B(Xt B w t + zt) Xt B ˆw t 2 P Xt ˆ BXt B w t + PXt ˆ Bzt 2 P Xt ˆ BXt B w t 2 + PXt ˆ Bzt 2 (the cross term is 0) P Xt ˆ BXt B w t 2 Published as a conference paper at ICLR 2021 P Σ1/2 t ˆ BΣ1/2 t B w t 2 (using (22) and Lemma A.7) P Σ1/2 T +1 ˆ BΣ1/2 T +1B w t 2 (using Assumption 4.2 and Lemma A.7) P Σ1/2 T +1 ˆ BΣ1/2 T +1B W P Σ1/2 T +1 ˆ BΣ1/2 T +1B F σ2 k(W ). Next, we write ˆB = [ ˆB, B ] I 0 =: BA and B = [ ˆB, B ] 0 I =: BC. Recall that we have 1 n2 B X T +1XT +1B 1.1B ΣT +1B from Claim A.2. Then using Lemma A.7 we can obtain 1.1 P Σ1/2 T +1BAΣ1/2 T +1BC P XT +1BAXT +1BC 2 1.1 P Σ1/2 T +1 ˆ BΣ1/2 T +1B P XT +1 ˆ BXT +1B 2 Therefore we get σ2 (k T + kd log(κn1) + log(1/δ)) 0.9cn1 P XT +1 ˆ BXT +1B 2 F σ2 k(W ), completing the proof. Proof of Theorem 4.1. We will use all the high-probability events in Claims A.1, A.2, A.3 and A.4. Here we instantiate Claim A.2 using B = [ ˆB, B ] Rd 2k. The success probability is at least 1 4δ For the target task, the excess risk of our learned linear predictor x 7 ( ˆB ˆw T +1) x is ER( ˆB, ˆw T +1) = 1 2 Ex p T +1 x ( ˆB ˆw T +1 B w T +1) 2 2( ˆB ˆw T +1 B w T +1) ΣT +1( ˆB ˆw T +1 B w T +1). Applying Claim A.2 with B = [ ˆB, B ], we have 0.9B ΣT +1B 1 n2 B X T +1XT +1B, which implies 0.9v B ΣT +1Bv 1 n2 v B X T +1XT +1Bv for v = ˆw T +1 w T +1 . This becomes ( ˆB ˆw T +1 B w T +1) ΣT +1( ˆB ˆw T +1 B w T +1) 1 0.9n2 ( ˆB ˆw T +1 B w T +1) X T +1XT +1( ˆB ˆw T +1 B w T +1). Therefore we have ER( ˆB, ˆw T +1) 1 1.8n2 ( ˆB ˆw T +1 B w T +1) X T +1XT +1( ˆB ˆw T +1 B w T +1) XT +1( ˆB ˆw T +1 B w T +1) 2 . Published as a conference paper at ICLR 2021 From the optimality of ˆw T +1 in (8) we know XT +1 ˆB ˆw T +1 = PXT +1 ˆ By T +1 = PXT +1 ˆ B(XT +1B w T +1 + z T +1). It follows that ER( ˆB, ˆw T +1) 1 PXT +1 ˆ B(XT +1B w T +1 + z T +1) XT +1B w T +1 2 P XT +1 ˆ BXT +1B w T +1 + PXT +1 ˆ Bz T +1 2 P XT +1 ˆ BXT +1B w T +1 2 PXT +1 ˆ Bz T +1 2 Recall that w T +1 ν and Ew ν[ww ] O( 1 k). Taking expectation over w T +1 ν and denoting Σ = Ew ν[ww ], we obtain Ew T +1 ν[ER( ˆB, ˆw T +1)] n2 Ew T +1 ν P XT +1 ˆ BXT +1B w T +1 2 PXT +1 ˆ Bz T +1 2 n2 Ew T +1 ν Tr P XT +1 ˆ BXT +1B w T +1w T +1 P XT +1 ˆ BXT +1B + 1 PXT +1 ˆ Bz T +1 2 n2 Tr P XT +1 ˆ BXT +1B Σ P XT +1 ˆ BXT +1B + 1 PXT +1 ˆ Bz T +1 2 P XT +1 ˆ BXT +1B Σ1/2 2 PXT +1 ˆ Bz T +1 2 P XT +1 ˆ BXT +1B 2 PXT +1 ˆ Bz T +1 2 P XT +1 ˆ BXT +1B 2 PXT +1 ˆ Bz T +1 2 F (using Σ 1 k σ2 (k T + kd log(κn1) + log(1/δ)) cn1 σ2 k(W ) + 1 PXT +1 ˆ Bz T +1 2 F (using Claim A.4) σ2 (k T + kd log(κn1) + log(1/δ)) PXT +1 ˆ Bz T +1 2 F . (using σ2 k(W ) T For the second term above, notice that 1 σ2 PXT +1 ˆ Bz T +1 2 F χ2(k), and thus with probability at 5 we have 1 σ2 PXT +1 ˆ Bz T +1 2 F k + log 1 δ . Therefore we obtain the final bound Ew T +1 ν[ER( ˆB, ˆw T +1)] σ2 (k T + kd log(κn1) + log(1/δ)) cn1T + σ2(k + log 1 = σ2 kd log(κn1) cn1T + k cn1 + log 1 δ cn1T + k + log 1 σ2 kd log(κn1) cn1T + k + log 1 where the last inequality is due to cn1 n2. A.1 TECHNICAL LEMMAS Lemma A.5. Let Od1,d2 = {V Rd1 d2 | V V = I} (d1 d2), and ϵ (0, 1). Then there exists a subset N Od1,d2 that is an ϵ-net of Od1,d2 in Frobenius norm such that |N| ( 6 d2 ϵ )d1d2, i.e., for any V Od1,d2, there exists V N such that V V F ϵ. Proof. For any V Od1,d2, each column of V has unit ℓ2 norm. It is well known that there exists an ϵ 2 d2 -net (in ℓ2 norm) of the unit sphere in Rd1 with size ( 6 d2 ϵ )d1. Using this net to cover Published as a conference paper at ICLR 2021 all the columns, we obtain a set N Rd1 d2 that is an ϵ 2-net of Od1,d2 in Frobenius norm and |N | ( 6 d2 Finally, we need to transform N into an ϵ-net N that is a subset of Od1,d2. This can be done by projecting each point in N onto Od1,d2. Namely, for each V N , let P( V ) be its closest point in Od1,d2 (in Frobenium norm); then define N = {P( V ) | V N }. Then we have |N| |N | ( 6 d2 ϵ )d1d2 and N is an ϵ-net of Od1,d2, because for any V Od1,d2, there exists V N such that V V F ϵ 2, which implies P( V ) N and V P( V ) F V V F + V P( V ) F V V F + V V F = 2 V V F ϵ. Lemma A.6. Let a1, . . . , an be i.i.d. d-dimensional random vectors such that E[ai] = 0, E[aia i ] = I, and ai is ρ2-subgaussian. For δ (0, 1), suppose n ρ4(d + log(1/δ)). Then with probability at least 1 δ we have i=1 aia i 1.1I. Proof. Let A = 1 n Pn i=1 aia i I. Then it suffices to show A 0.1 with probability at least 1 δ. We use a standard ϵ-net argument for the unit sphere Sd 1 = {v Rd : v = 1}. First, consider any fixed v Sd 1. We have v Av = 1 n Pn i=1[(v ai)2 1]. From our assumptions on ai we know that v ai has mean 0 and variance 1 and is ρ2-subgaussian. (Note that we must have ρ 1.) Therefore (v ai)2 1 is zero-mean and 16ρ2-sub-exponential. By Bernstein inequality for sub-exponential random variables, we have for any ϵ > 0, Pr |v Av| > ϵ 2 exp n (16ρ2)2 , ϵ 16ρ2 Next, take a 1 5-net N Sd 1 of Sd 1 with size |N| e O(d). By a union bound over all v N, we have Pr max v N |v Av| > ϵ 2|N| exp n (16ρ2)2 , ϵ 16ρ2 (16ρ2)2 , ϵ 16ρ2 Plugging in ϵ = 1 20 and noticing ρ > 1, the above inequality becomes Pr max v N |v Av| > 1 where the last inequality is due to n ρ4 (d + log(1/δ)). Therefore, with probability at least 1 δ we have maxv N |v Av| 1 20. Suppose this indeed happens. Next, for any u Sd 1, there exists u N such that u u 1 5. Then we have u Au (u ) Au + 2 (u u ) Au + (u u ) A(u u ) 20 + 2 u u A u + u u 2 A Taking a supreme over u Sd 1, we obtain A 1 20 + 1 2 A , i.e., A 1 10. Lemma A.7. If two matrices A1 and A2 (with the same number of columns) satisfy A 1 A1 A 2 A2, then for any matrix B (of compatible dimensions), we have A 1 P A1BA1 A 2 P A2BA2. Published as a conference paper at ICLR 2021 As a consequence, for any matrices B and B (of compatible dimensions), we have P A1BA1B 2 F P A2BA2B 2 Proof. For the first part of the lemma, it suffices to show the following for any vector v: v A 1 P A1BA1v v A 2 P A2BA2v, which is equivalent to min w A1Bw A1v 2 2 min w A2Bw A2v 2 2. Let w arg minw A1Bw A1v 2 2. Then we have min w A1Bw A1v 2 2 = A1Bw A1v 2 2 = (Bw v) A 1 A1(Bw v) (Bw v) A 2 A2(Bw v) = A2Bw A2v 2 2 min w A2Bw A2v 2 2, finishing the proof of the first part. For the second part, from A 1 P A1BA1 A 2 P A2BA2 we know (B ) A 1 P A1BA1B (B ) A 2 P A2BA2B . Taking trace on both sides, we obtain P A1BA1B 2 F P A2BA2B 2 which finishes the proof. B PROOF OF THEOREM 5.1 Here we first prove an important intermediate result on the in-sample risk, which explains how the Gaussian width of FX (Φ) arises. Claim B.1 (analogue of Claim A.3). Let ˆφ and ˆw1, . . . , ˆw T be the optimal solution to (2). Then with probability at least 1 δ we have ˆφ(Xt) ˆwt φ (Xt)w t 2 σ2 G(FX (Φ))2 + log 1 Proof. By the optimality of ˆφ and ˆw1, . . . , ˆw T for (2), we know yt ˆφ(Xt) ˆwt 2 t=1 yt φ (Xt)w t 2 . Plugging in yt = φ (Xt)w t + zt (zt N(0, I) is independent of Xt), we get φ (Xt)w t + zt ˆφ(Xt) ˆwt 2 which gives ˆφ(Xt) ˆwt φ (Xt)w t 2 2 t=1 zt, ˆφ(Xt) ˆwt φ (Xt)w t . Published as a conference paper at ICLR 2021 Denote Z = [z1, , z T ] Rn1 T and A = [a1, , a T ] Rn1 T where at = ˆφ(Xt) ˆwt φ (Xt)w t . Then the above inequality reads A 2 F 2 Z, A . Notice that A A F FX (Φ) (c.f. (10)). It follows that A F 2 Z, A A F 2 sup A FX (Φ) Z, A . (30) By definition, we have EZ h sup A FX (Φ) σ 1Z, A i = G(FX (Φ)). Furthermore, since the function Z 7 sup A FX (Φ) Z, A is 1-Lipschitz in Frobenius norm, by the standard Gaussian concentration inequality, we have with probability at least 1 δ, sup A FX (Φ) σ 1Z, A E sup A FX (Φ) σ 1Z, A δ = G(FX (Φ)) + Then the proof is completed using (30). The proof is conditioned on several high-probability events, each happening with probability at least 1 Ω(δ). By a union bound at the end, the final success probability is also at least 1 Ω(δ). We can always rescale δ by a constant factor such that the final probability is at least 1 δ. Therefore, we will not carefully track the constants before δ in the proof. All the δ s should be understood as Ω(δ). We use the following notion of representation divergence. Definition B.1 (divergence between two representations). Given a distribution q over Rd and two representation functions φ, φ Φ, the divergence between φ and φ with respect to q is defined as Dq (φ, φ ) = Σq (φ , φ ) Σq (φ , φ) (Σq (φ, φ)) Σq (φ, φ ) Rk k. It is easy to verify Dq(φ, φ ) 0, Dq(φ, φ) = 0 for any φ, φ and q. See Lemma B.2 s proof. The next lemma shows a relation between (symmetric) covariance and divergence. Lemma B.2. Suppose that two representation functions φ, φ Φ and two distributions q, q over Rd satisfy Λq(φ, φ ) α Λq (φ, φ ) for some α > 0. Then it must hold that Dq(φ, φ ) α Dq (φ, φ ). Proof. Fix any v Rk. We will prove v Dq(φ, φ )v α v Dq (φ, φ )v, which will complete the proof of the lemma. We define a quadratic function f : Rk R as f(w) = [w , v ]Λq(φ, φ ) w v . According to Definition 5.2, we can write f(w) = w Σq(φ, φ)w 2w Σq(φ, φ )v + v Σq(φ , φ )v = Ex q h w φ(x) v φ (x) 2i . Therefore we have f(w) 0 for any w Rk.6 This means that f must have a global minimizer in Rk. Since f is convex, taking its gradient f(w) = 2Σq(φ, φ)w 2Σq(φ, φ )v and setting the gradient to 0, we obtain a global minimzer w = (Σq(φ, φ)) Σq(φ, φ )v. Plugging this into the definition of f, we obtain7 min w Rk f(w) = f(w ) = v Dq(φ, φ )v. (31) Similarly, letting g(w) = [w , v ]Λq (φ, φ ) w v min w Rk g(w) = v Dq (φ, φ )v. 6Note that we have proved Λq(φ, φ ) 0. 7Note that (31) implies Dq(φ, φ ) 0. Published as a conference paper at ICLR 2021 From Λq(φ, φ ) α Λq (φ, φ ) we know f(w) αg(w) for any w Rk. Recall that w arg minw Rk f(w). We have αv Dq (φ, φ )v = α min w Rk g(w) αg(w ) f(w ) = min w Rk f(w) = v Dq(φ, φ )v. This finishes the proof. Claim B.3 (analogue of Claim A.4). Under the setting of Theorem 5.1, with probability at least 1 δ we have P ˆφ(XT +1)φ (XT +1) 2 F σ2 G(FX (Φ))2 + log 1 n1σ2 k(W ) . Proof. We continue to use the notation from Claim B.1 and its proof. Let ˆpt be the empirical distribution over the samples in Xt (t [T +1]). According to Assumptions 5.1 and 5.2 as well as the setting in Theorem 5.1, we know that the followings are satisfied with probability at least 1 δ: 0.9Λp(φ, φ ) Λˆpt(φ, φ ) 1.1Λp(φ, φ ), φ, φ Φ, t [T], 0.9Λp(ˆφ, φ ) Λˆp T +1(ˆφ, φ ) 1.1Λp(ˆφ, φ ). (32) Notice that ˆφ and φ are independent of the samples from the target task, so n2 Npoint(Φ, p, δ 3) is sufficient for the second inequality above to hold with high probability. Using Lemma B.2, we know that (32) implies 0.9Dp(φ, φ ) Dˆpt(φ, φ ) 1.1Dp(φ, φ ), φ, φ Φ, t [T], 0.9Dp(ˆφ, φ ) Dˆp T +1(ˆφ, φ ) 1.1Dp(ˆφ, φ ). (33) By the optimality of ˆφ and ˆw1, . . . , ˆw T for (2), we know ˆφ(Xt) ˆwt = P ˆφ(Xt)yt = P ˆφ(Xt)(φ (Xt)w t + zt). Then we have the following chain of inequalities: σ2 G(FX (Φ))2 + log 1 ˆφ(Xt) ˆwt φ (Xt)w t 2 (Claim B.1) P ˆφ(Xt)(φ (Xt)w t + zt) φ (Xt)w t 2 P ˆφ(Xt)φ (Xt)w t + P ˆφ(Xt)zt 2 P ˆφ(Xt)φ (Xt)w t 2 + P ˆφ(Xt)zt 2 (cross term is 0) P ˆφ(Xt)φ (Xt)w t 2 t=1 (w t ) φ (Xt) I ˆφ(Xt) ˆφ(Xt) ˆφ(Xt) ˆφ(Xt) φ (Xt)w t t=1 (w t ) Dˆpt(ˆφ, φ )w t Published as a conference paper at ICLR 2021 t=1 (w t ) Dp(ˆφ, φ )w t ((33)) Dp(ˆφ, φ ) 1/2 W Dp(ˆφ, φ ) 1/2 = 0.9n1Tr h Dp(ˆφ, φ ) i σ2 k(W ) 1.1 Tr h Dˆp T +1(ˆφ, φ ) i σ2 k(W ) ((33)) P ˆφ(XT +1)φ (XT +1) 2 F σ2 k(W ), completing the proof. Now we can finish the proof of Theorem 5.1. Proof of Theorem 5.1. The excess risk is bounded as ER(ˆφ, ˆw T +1) ˆw T +1 ˆφ(x) (w T +1) φ (x) 2 ˆw T +1 w T +1 Λp(ˆφ, φ ) ˆw T +1 w T +1 ˆw T +1 w T +1 Λˆp T +1(ˆφ, φ ) ˆw T +1 w T +1 ˆφ (XT +1) ˆw T +1 φ (XT +1) w T +1 2 P ˆφ(XT +1)φ (XT +1) w T +1 + P ˆφ(XT +1)z T +1 2 P ˆφ(XT +1)φ (XT +1) w T +1 2 + P ˆφ(XT +1)z T +1 2 P ˆφ(XT +1)φ (XT +1) w T +1 2 + σ2(k + log 1 δ ) n2 . (using χ2 tail bound) Taking expectation over w T +1 ν, we get Ew T +1 ν[ER(ˆφ, ˆw T +1)] P ˆφ(XT +1)φ (XT +1) 2 Ew ν[ww ] + σ2(k + log 1 P ˆφ(XT +1)φ (XT +1) 2 F + σ2(k + log 1 k σ2 G(FX (Φ))2 + log 1 n1σ2 k(W ) + σ2(k + log 1 δ ) n2 (Claim B.3) σ2 G(FX (Φ))2 + log 1 n1T + σ2(k + log 1 δ ) n2 , (σk(W ) T finishing the proof. Published as a conference paper at ICLR 2021 C PROOF OF THEOREM 6.1 C.1 PROOF SKETCH OF THEOREM 6.1 Let R = Θ . Recall ˆB and ˆW are derived from Eqn. (12) and let ˆΘ := ˆB ˆW. We first note that the constraint set { w 2 i R/T, B 2 F R} ensures W 2 F R and WB R at global minimum. On the other hand, our constraint for W, B is also expressive enough to attain any ˆΘ that satisfies ˆΘ R. See reference e.g. Srebro and Shraibman (2005). Therefore at global minimum ˆW F R and ˆΘ R. For the ease of proof, we introduce the following auxiliary functions and parameters. Write Σ1/2Θ Σ1/2 ˆBW 2 Lλ 1(W) =L1(W) + λ 2 W 2 F , W λ 1 arg min W {Lλ 1(W)}, Σ1/2θ T +1 Σ1/2 ˆBw 2 , wλ 2 arg min w {L2(w) + λ/2 w 2}, X(Θ ˆBW) 2 , ˆLλ 1(W) =ˆL1(W) + λ 2 W 2 F W λ 1 arg min W {ˆLλ 1(W)}, XT +1θ T +1 XT +1 ˆBw 2 , w2 arg min w r {ˆL2(w)}. We define terms ϵic,1 and ϵic,2 that will be used to bound intrinsic dimension concentration error in the input signal. Namely with high probability, Σ1/2Θ q 1/n1 PT t=1 Xtθt 2 ϵic,1 Θ , and similarly Σ1/2 ˆBv q 1 n2 X ˆBv 2 ϵic,2 v 2. Additionally we use ϵee,i, i {1, 2} to bound the estimation error (for fixed design) incurred when using noisy label y T +1 and Y . The choice of ϵee,i, and ϵic,i are respectively justified in Lemma C.5, Claim C.4, Lemma C.10 and Claim C.11, along with some more detailed descriptions. Proof of Theorem 6.1. Eθ ν ER( ˆB, ˆw T +1) = Eθ ν L2( ˆw T +1) Eθ ν ˆL2( ˆw T +1) + ϵ2 ic,2r2 (Claim C.11) Eθ ν ˆL2( w2) + ϵ2 ee,2r + ϵ2 ic,2r2 (Lemma C.4) Eθ ν ˆL2(wλ 2 ) + ϵ2 ee,2r + ϵ2 ic,2r2 (Definition of w2) Eθ ν L2(wλ 2 ) + ϵ2 ee,2r + ϵ2 ic,2r2 (Claim C.11) T L1(W λ 1 ) + ϵ2 ee,2r + ϵ2 ic,2r2 (Claim C.3) T + ϵ2 ee,2r + ϵ2 ic,2r2 (Lemma C.2) ϵ2 ee,1R + ϵ2 ic,1R2 T + ϵ2 ee,2r + ϵ2 ic,2r2. (Choices of λ) Each step is with high probability 1 δ/10 over the randomness of X or XT +1. Therefore overall by union bound, with probability 1 δ, by plugging in the values of ϵic,i and ϵee,i we have: Eθ ν ER( ˆB, ˆw T +1) σR Tr(Σ) Tn1 + Notice a term Σ /n1 is absorbed by Σ /n2 since we assume n1 n2. Published as a conference paper at ICLR 2021 Claim C.1 (guarantee with source regularization). 1 n1 X(Θ ˆΘ) 2 F + λ ˆΘ 3λ Θ 3λR, and ˆB 2 F 3R, ˆW 2 F 3R for any λ 2 Here X is the adjoint operator of X such that X (Z) = PT i=1 X t zte t . Proof. With the optimality of ˆΘ we have: 1 2n1 X(ˆΘ Θ ) Z 2 F + λ ˆΘ 1 2n1 Z 2 F + λ Θ , Let = ˆΘ Θ . Therefore 1 2n1 X( ) 2 F λ( Θ ˆΘ ) + 1 n1 Θ X (Z) + 1 n1 ˆΘ X (Z) λ ˆΘ λ Θ + λ/2 Θ + λ/2 ˆΘ λ ˆΘ Therefore 1 2n1 X( ) 2 F + λ 2 ˆΘ 3 2λ Θ , and clearly both terms satisfy 1 n1 X( ) 2 F 3λ Θ and ˆΘ 3 Θ . Lemma C.2 (source task concentration). For a fixed δ > 0, let λ = ϵ2 ee,1 + ϵ2 ic,1R, we have Lλ 1(W λ 1 ) λR with probability 1 δ/10. Proof of Lemma C.2. W λ 1 2 F < 2 λLλ 1(W λ 1 ) λLλ 1( ˆW) (Definition of W λ 1 ) 2 Σ1/2(Θ ˆΘ) 2 F + λ 2( 1 n1 X(Θ ˆΘ) F + O(ϵic,1)R)2 + λ n1 X(Θ B W λ) 2 F + λ 2 W λ 2 F + O(ϵ2 ic,1R2) λ 6λR + O(ϵ2 ic,1R2) (from Claim C.1) Thus both results have been shown. Published as a conference paper at ICLR 2021 Claim C.3 (Source and Target Connections). Eθ ν L2(wλ 2 ) =L1(W λ 1 ) Proof of Claim C.3. wλ 2 = ( ˆB Σ ˆB + λI) 1 ˆB Σθ T +1 =: Sλθ T +1, where Sλ := ( ˆB Σ ˆB + λI) 1 ˆB Σ. Eθ ν L2(wλ 2 ) = Eθ ν Σ2(I Sλ)θ T +1 2 T Σ2(I Sλ)Θ T +1 2 Lemma C.4 (Estimation Error for Target Task). ˆL2( ˆw) ˆL2( w) R Tn2 σ(log 1/δ)3/2 log(n2) p Σ =: ϵ2 ee,2r. Proof of Lemma C.4. With the definition of ˆw we write the basic inequality: 1 2n2 XT +1( ˆB ˆw θ ) z T +1 2 F 1 2n2 XT +1( ˆB w θ ) z T +1 2 F , Therefore by rearranging we get: 1 2n2 XT +1 ˆB( ˆw w) 2 F 1 n2 ˆw w, ˆB X T +1z T +1 R/T n2 ˆB X T +1z T +1 2 F R Tn2 σ log2/3(1/δ) log(n2) p Σ (Claim C.6) =O(rϵ2 ee,2) C.2 TECHNICAL LEMMAS This section includes the technical details for several parts: bounding the noise term from basic inequality; and intrinsic dimension concentration for both source and target tasks. Lemma C.5 (Regularizer Estimation). For X Rn d drawn from distribution p with covariance matrix Σ, and noise Z N(0, σ2In), with high probability 1 δ, we have ϵ2 ee,1 := 1 n X Z 2 1 nσ log 1 3/2 log(T + n) p T Σ + Tr(Σ). Proof. We use matrix Bernstein with intrinsic dimension to bound λ (See Theorem 7.3.1 in Tropp et al. (2015)). Write A = 1 n X Z = 1 n PT t=1 X zte t =: PT t=1 St. EX,Z[AA ] = EX 1 n1 X EZ[ztz t ]X Published as a conference paper at ICLR 2021 EX,Z[A A] = 1 ne EX,Z z t XX zte t 1 n EX,Z[z t XX zt]ete t =σ2Tr(Σ)In. Therefore the matrix variance statistic of the sum v(A) satisfies: v(A) = σ2 max{T Σ , Tr(Σ)}. Denote V = diag([TΣ, Tr(Σ)I]) and its intrinsic dimension dΣ = tr(V )/ V . Tr(V ) = σ2(T + n)Tr(Σ), and V 2 σ2Tr(Σ). Therefore dΣ T + n. Finally from Hanson-Wright inequality, the upper bound on each term is St 2 Xzt 2 σ2Tr(Σ) + σ2 Σ log 1 δ + σ2 Σ F q δ with probability 1 δ. Thus using Σ F Tr(Σ), δ )Tr(Σ) + Σ log 1 Then from intrinsic matrix bernstein (Theorem 7.3.1 in Tropp et al. (2015)), with probability 1 δ we have, A O(σ q δ v log(dΣ) + σ log 1 δ L log(dΣ)), which gives δ T Σ log(T + n) + log 1 δ Tr(Σ) log(T + n) + log 1 δ σL log(T + n) 3/2 log(T + n) p T Σ + Tr(Σ). Claim C.6 (target noise concentration). For a fixed δ > 0, with probability 1 δ/10, ϵ2 ee,2 := 1 n2 ˆB X T +1z 2 O(log2/3(1/δ) log(n2) q Tr( ˆB Σ ˆB)) O( p Proof. The first inequality directly follows from Lemma C.5. Meanwhile Tr( ˆB Σ ˆB) = Σ, ˆB ˆB Σ 2 ˆB ˆB Σ 2R. This finishes the proof. Definition C.7. The sub-gaussian norm of some vector y is defined as: y ψ2 := sup x Sn 1 y, x ψ2, (34) where Sn 1 denotes the unit Euclidean sphere in Rn. Definition C.8. Let T Rd be a bounded set, and g be a standard normal random vector in Rd, i.e., g N(0, Id).Then the quantities w(T) := E sup x T g, x , and γ(T) := E sup x T | g, x | (35) are called the Gaussian width of T and the Gaussian complexity of T, respectively. Theorem C.9 (Restated Matrix deviation inequality from Vershynin (2017)). Let A be an m n matrix whose rows ai are independent, isotropic and sub-gaussian random vectors in Rn. Let T Rn be a fixed bounded set. Then E sup x T | Ax 2 m x 2| Cρ2γ(T), (36) where K = maxi Ai ψ2 is the maximal sub-gaussian norm of the rows of A. A high-probability version states as follows. With probability 1 δ, sup x T | Ax 2 m x 2| Cρ2[γ(T) + p log(2/δ)r(T)], (37) where the radius r(T) := supx T x 2. Published as a conference paper at ICLR 2021 Lemma C.10 (intrinsic dimension concentration). Let X, Xt, t [T] be n d matrix whose rows x are independent, isotropic and sub-gaussian random vectors in Rd that satisfy Assumption 4.1, and the whitening distribution is with sub-gaussian norm C1ρ, where E[x] = 0 and E[xx ] = Σ. For a fixed δ > 0, and any v Rd, we have Σ1/2v 2 1 n Xv 2 + Cρ2 log(2/δ) Σ v 2. For any Θ Rd T , we further have Σ1/2Θ F 1 n t=1 Xtθt 2 2 + ϵic,1 Θ , (38) where ϵic,1 := 2Cρ2 log(2/δ) Σ , with probability 1 δ. Proof. We use Theorem C.9. Let T = {v : |Σ 1/2v|2 1}. Let x = Σ1/2z, X = ZΣ1/2. Then γ(T) = p Tr(Σ), r(T) = Σ 1/2. We note with probability 1 δ, 1 n Xv 2 Σ1/2v 2 1 n Z v 2 v 2 log(2/δ)r(T) log(2/δ) Σ . Therefore 1 n Xv 2 Σ1/2v 2 Cρ2 log(2/δ) Σ , v = 1. Then by homogeneity of v, for arbitrary v, we have 1 n Xv 2 Σ1/2v 2 | {z } term I Notice when n C2ρ4(Tr(Σ) + Σ log 1/δ), term I 0.1 λ. Therefore | Σ1/2v 2 1 n Xv 2| 0.1 Write Θ = UDV , where D = diag(σ1, σ2, , σT ). t=1 Xtθt 2 2 = 1 t=1 σ2 t Xtut 2 Σ1/2ut 2 ut 2 Cρ2 log(2/δ) Σ 2 Σ1/2ut 2 2 2 Σ1/2ut ut 2 Cρ2 = Σ1/2Θ 2 F 2Cρ2 log(2/δ) Σ X t σt(σt Σ1/2ut 2) Σ1/2Θ 2 F 2Cρ2 log(2/δ) Σ X t σt(max t σt Σ1/2ut 2) Σ1/2Θ 2 F 2Cρ2 log(2/δ) Σ Θ Σ1/2Θ F . Published as a conference paper at ICLR 2021 Therefore Σ1/2Θ F 1 n q PT t=1 |Xtθt 2 2 + 2Cρ2 log(2/δ) Σ Θ | {z } term II Claim C.11. Let X be n2 d matrix whose rows x are independent, isotropic and sub-gaussian random vectors in Rd that satisfy Assumption 4.1, where E[x] = 0 and E[xx ] = Σ. Let ΣB = B ΣB for some matrix B that satisfies BB R. Then for a fixed δ > 0, and any v Rd we have: Σ1/2 B v 1 n2 X ˆBv + ϵic,2 v , where ϵic,2 := Cρ2 n2 p R Σ log(1/δ), and C is a universal constant. Proof. This result directly uses Lemma C.10 when replacing X by X ˆB. Notice now the subgaussian norm for the whitening distribution for ˆB x remains the same as C1ρ. Therefore Σ1/2 ˆBv 2 1 n2 X ˆBv 2 + S v 2 Σ1/2 ˆBv 2 1 n2 Xv 2 + S v . Here S = Cρ2/ n2( p log(2/δ) ΣB ) C ρ2/ n2 p Σ R log(1/δ) =: ϵic,2. D PROOF OF THEOREM 7.1 First, we describe a standard lifting of neural networks to infinite dimension linear regression Wei et al. (2019); Rosset et al. (2007); Bengio et al. (2006). Define the infinite feature vector with coordinates φ(x)b = (b x)+ for every b Sd0 1. Let αt be a signed measure on Sd0 1. The inner product notation denotes integration: α φ(x) R Sd0 1 φ(x)bdα(b). The tth output of the infinite-width neural network is fαt(x) = αt, φ(x) . Consider the least-squares problem min α1,...,αt:|supp(αt)| d α 2,1 R i,t (yit α t φ(xi))2, (39) where α(u) = [α1(u), . . . , αT (u)], and α 2,1 = R Sd0 1 α( b) 2d( b). The regularizer corresponds to a group ℓ1 regularizer on the vector measure α. Proposition D.1. Let γd be the value of Equation (17) when the network has d neurons and γ d be the value of Equation (39). Then γd = γ d. (40) Proof. Let B, W be solutions to Equation (17). Let B = BD 1 β and Dβ be a diagonal matrix whose entries are βj = B ej 2. The network f B,W (x) = W Dβ( B x)+ and it satisfies β 2 = B F . We first show that γ d γd. Define αt( bj bj ) = Wtjβj. We verify that j=1 αt( bj)φ(x) bj = j=1 Wtjβj( b j x)+ = w t (B x)+ = f B,wt(x). Due to the regularizer, and using the AM-GM inequality, at optimality βj = Wej 2. Next, we verify that the two regularizer values are the same. Let wj be the j-th row vector of W. We have j=1 β2 j /2 + wj 2/2 Published as a conference paper at ICLR 2021 Thus the network given by α t φ(x) has the same network outputs and regularizer values. Thus γ γd. Finally, we show that γd γ d. Let bj for j [d] be the support of the optimal measure of (39). Define βj = q α( bj) 2, B = BDβ where B is a matrix whose rows are bj, and W such that Wjt = αt( bj)/ q We verify that the network values agree e t f B,W (x) = e t W Dβ( B x)+ = X j Wjtβj( b j x)+ = α φ(x). Finally by our construction βj = Wej , so the regularizer values agree. Thus γd = γ d. Finally, we note that the regularizer can be expressed in a variational form as8 α 2,1 = min b,W :αt( b)=β( b)wt( b) β 2 2 + W 2 F , where β 2 2 = R β( b)2d( b) and W 2 F = P t R wt( b)2d( b). With these in place, we note that Equation (39) can be expressed as Equation (12) with B constrained to be a diagonal operator and xit as the lifted features φ(xit). Proof of Theorem 7.1. The global minimizer of Equation (39) with d = may have infinite support, so the corresponding value may not be achieved by minimizing (17). However, Theorem 6.1 only requires that the we obtain a learner network with regularized loss less than the regularized loss of the teacher network. Since the teacher network has d neurons, this value is attainable by (17). Thus the finite-size network does not need to attain the global minimum of (39) for Claim C.1 to apply. Since Theorem 6.1 has no dependence (even in the logarithmic terms) on the input dimension of the data, it can be applied when the input features the infinite-dimensional feature vector φ(x). The only part of the proof of Theorem 6.1 specific to the nuclear norm is that the dual norm is the operator norm. In Lemma C.5 we had an upper bound on 1 n X Z 2. Since we use the 2,1 norm, we must upper bound 1 n X Z 2, , the dual of the (2, 1)-norm. Note that A 2, A 2, so the upper bound in Lemma C.5 still applies. Thus, Theorem 7.1 follows from Theorem 6.1. Proof of (21). The test error of (20) is given by E[ER(f ˆ B, ˆ w)] σ 1 2 n( B T +1 2 + w T +1 2 2) E xi n iid p, z N(0,σ2I) [ Φ(X) z ], (41) via the basic inequality (c.f. proof of Claim C.2 and C.4). By the matrix Bernstein inequality (c.f. Lemma C.5 or Wei et al. (2019)), Exi n iid p,z N(0,I)[ Φ(X) z ] p tr(Σ). When B T +1, w T +1 are sampled from the same distribution as the source tasks, then 1 2( B T +1 2 + w T +1 2 2) R T . Thus we conclude E[ER(f ˆ B, ˆ w)] σ R 8Informally if α RD T with D potentially infinite, α 2,1 = minα=diag(b)W 1 2 b 2 2 + 1