# learning_bound_for_parameter_transfer_learning__9df10d60.pdf Learning Bound for Parameter Transfer Learning Wataru Kumagai Faculty of Engineering Kanagawa University kumagai@kanagawa-u.ac.jp We consider a transfer-learning problem by using the parameter transfer approach, where a suitable parameter of feature mapping is learned through one task and applied to another objective task. Then, we introduce the notion of the local stability and parameter transfer learnability of parametric feature mapping, and thereby derive a learning bound for parameter transfer algorithms. As an application of parameter transfer learning, we discuss the performance of sparse coding in selftaught learning. Although self-taught learning algorithms with plentiful unlabeled data often show excellent empirical performance, their theoretical analysis has not been studied. In this paper, we also provide the first theoretical learning bound for self-taught learning. 1 Introduction In traditional machine learning, it is assumed that data are identically drawn from a single distribution. However, this assumption does not always hold in real-world applications. Therefore, it would be significant to develop methods capable of incorporating samples drawn from different distributions. In this case, transfer learning provides a general way to accommodate these situations. In transfer learning, besides the availability of relatively few samples related with an objective task, abundant samples in other domains that are not necessarily drawn from an identical distribution, are available. Then, transfer learning aims at extracting some useful knowledge from data in other domains and applying the knowledge to improve the performance of the objective task. In accordance with the kind of knowledge that is transferred, approaches to solving transfer-learning problems can be classified into cases such as instance transfer, feature representation transfer, and parameter transfer (Pan and Yang (2010)). In this paper, we consider the parameter transfer approach, where some kind of parametric model is supposed and the transferred knowledge is encoded into parameters. Since the parameter transfer approach typically requires many samples to accurately learn a suitable parameter, unsupervised methods are often utilized for the learning process. In particular, transfer learning from unlabeled data for predictive tasks is known as self-taught learning (Raina et al. (2007)), where a joint generative model is not assumed to underlie unlabeled samples even though the unlabeled samples should be indicative of a structure that would subsequently be helpful in predicting tasks. In recent years, self-taught learning has been intensively studied, encouraged by the development of strong unsupervised methods. Furthermore, sparsity-based methods such as sparse coding or sparse neural networks have often been used in empirical studies of self-taught learning. Although many algorithms based on the parameter transfer approach have empirically demonstrated impressive performance in self-taught learning, some fundamental problems remain. First, the theoretical aspects of the parameter transfer approach have not been studied, and in particular, no learning bound was obtained. Second, although it is believed that a large amount of unlabeled data help to improve the performance of the objective task in self-taught learning, it has not been sufficiently clarified how many samples are required. Third, although sparsity-based methods are typically employed in self-taught learning, it is unknown how the sparsity works to guarantee the performance of self-taught learning. The aim of the research presented in this paper is to shed light on the above problems. We first consider a general model of parametric feature mapping in the parameter transfer approach. Then, we newly formulate the local stability of parametric feature mapping and the parameter transfer learnability for this mapping, and provide a theoretical learning bound for parameter transfer learning algorithms based on the notions. Next, we consider the stability of sparse coding. Then we discuss the parameter transfer learnability by dictionary learning under the sparse model. Applying the learning bound for parameter transfer learning algorithms, we provide a learning bound of the sparse coding algorithm in self-taught learning. This paper is organized as follows. In the remainder of this section, we refer to some related studies. In Section 2, we formulate the stability and the parameter transfer learnability of the parametric feature mapping. Then, we present a learning bound for parameter transfer learning. In Section 3, we show the stability of the sparse coding under perturbation of the dictionaries. Then, by imposing sparsity assumptions on samples and by considering dictionary learning, we derive the parameter transfer learnability for sparse coding. In particular, a learning bound is obtained for sparse coding in the setting of self-taught learning. In Section 4, we conclude the paper. 1.1 Related Works Approaches to transfer learning can be classified into some cases based on the kind of knowledge being transferred (Pan and Yang (2010)). In this paper, we consider the parameter transfer approach. This approach can be applied to various notable algorithms such as sparse coding, multiple kernel learning, and deep learning since the dictionary, weights on kernels, and weights on the neural network are regarded as parameters, respectively. Then, those parameters are typically trained or tuned on samples that are not necessarily drawn from a target region. In the parameter transfer setting, a number of samples in the source region are often needed to accurately estimate the parameter to be transferred. Thus, it is desirable to be able to use unlabeled samples in the source region. Self-taught learning corresponds to the case where only unlabeled samples are given in the source region while labeled samples are available in the target domain. In this sense, self-taught learning is compatible with the parameter transfer approach. Actually, in Raina et al. (2007) where self-taught learning was first introduced, the sparse coding-based method is employed and the parameter transfer approach is already used regarding the dictionary learnt from images as the parameter to be transferred. Although self-taught learning has been studied in various contexts (Dai et al. (2008); Lee et al. (2009); Wang et al. (2013); Zhu et al. (2013)), its theoretical aspects have not been sufficiently analyzed. One of the main results in this paper is to provide a first theoretical learning bound in self-taught learning with the parameter transfer approach. We note that our setting differs from the environment-based setting (Baxter (2000), Maurer (2009)), where a distribution on distributions on labeled samples, known as an environment, is assumed. In our formulation, the existence of the environment is not assumed and labeled data in the source region are not required. Self-taught learning algorithms are often based on sparse coding. In the seminal paper by Raina et al. (2007), they already proposed an algorithm that learns a dictionary in the source region and transfers it to the target region. They also showed the effectiveness of the sparse coding-based method. Moreover, since remarkable progress has been made in unsupervised learning based on sparse neural networks (Coates et al. (2011), Le (2013)), unlabeled samples of the source domain in self-taught learning are often preprocessed by sparsity-based methods. Recently, a sparse coding-based generalization bound was studied (Mehta and Gray (2013); Maurer et al. (2012)) and the analysis in Section 3.1 is based on (Mehta and Gray (2013)). 2 Learning Bound for Parameter Transfer Learning 2.1 Problem Setting of Parameter Transfer Learning We formulate parameter transfer learning in this subsection. We first briefly introduce notations and terminology in transfer learning (Pan and Yang (2010)). Let X and Y be a sample space and a label space, respectively. We refer to a pair of Z := X Y and a joint distribution P(x, y) on Z as a region. Then, a domain comprises a pair consisting of a sample space X and a marginal probability of P(x) on X and a task consists of a pair containing a label set Y and a conditional distribution P(y|x). In addition, let H = {h : X Y} be a hypothesis space and ℓ: Y Y R 0 represent a loss function. Then, the expected risk and the empirical risk are defined by R(h) := E(x,y) P [ℓ(y, h(x))] and b Rn(h) := 1 n n j=1 ℓ(yj, h(xj)), respectively. In the setting of transfer learning, besides samples from a region of interest known as a target region, it is assumed that samples from another region known as a source region are also available. We distinguish between the target and source regions by adding a subscript T or S to each notation introduced above, (e.g. PT , RS). Then, the homogeneous setting (i.e., XS = XT ) is not assumed in general, and thus, the heterogeneous setting (i.e., XS = XT ) can be treated. We note that self-taught learning, which is treated in Section 3, corresponds to the case when the label space YS in the source region is the set of a single element. We consider the parameter transfer approach, where the knowledge to be transferred is encoded into a parameter. The parameter transfer approach aims to learn a hypothesis with low expected risk for the target task by obtaining some knowledge about an effective parameter in the source region and transfer it to the target region. In this paper, we suppose that there are parametric models on both the source and target regions and that their parameter spaces are partly shared. Then, our strategy is to learn an effective parameter in the source region and then transfer a part of the parameter to the target region. We describe the formulation in the following. In the target region, we assume that YT R and there is a parametric feature mapping ψθ : XT Rm on the target domain such that each hypothesis h T ,θ,w : XT YT is represented by h T ,θ,w(x) := w, ψθ(x) (1) with parameters θ Θ and w WT , where Θ is a subset of a normed space with a norm and WT is a subset of Rm. Then the hypothesis set in the target region is parameterized as HT = {h T ,θ,w|θ Θ, w WT }. In the following, we simply denote RT (h T ,θ,w) and b RT (h T ,θ,w) by RT (θ, w) and b RT (θ, w), respectively. In the source region, we suppose that there exists some kind of parametric model such as a sample distribution PS,θ,w or a hypothesis h S,θ,w with parameters θ Θ and w WS, and a part Θ of the parameter space is shared with the target region. Then, let θ S Θ and w S WS be parameters that are supposed to be effective in the source region (e.g., the true parameter of the sample distribution, the parameter of the optimal hypothesis with respect to the expected risk RS); however, explicit assumptions are not imposed on the parameters. Then, the parameter transfer algorithm treated in this paper is described as follows. Let Nand n-samples be available in the source and target regions, respectively. First, a parameter transfer algorithm outputs the estimator bθN Θ of θ S by using N-samples. Next, for the parameter w T := argmin w WT RT (θ S, w) in the target region, the algorithm outputs its estimator bw N,n := argmin w WT b RT ,n(bθN, w) + ρr(w) by using n-samples, where r(w) is a 1-strongly convex function with respect to 2 and ρ > 0. If the source region relates to the target region in some sense, the effective parameter θ S in the source region is expected to also be useful for the target task. In the next subsection, we regard RT (θ S, w T ) as the baseline of predictive performance and derive a learning bound. 2.2 Learning Bound Based on Stability and Learnability We newly introduce the local stability and the parameter transfer learnability as below. These notions are essential to derive a learning bound in Theorem 1. Definition 1 (Local Stability). A parametric feature mapping ψθ is said to be locally stable if there exist ϵθ : X R>0 for each θ Θ and Lψ > 0 such that for θ Θ θ θ ϵθ(x) ψθ(x) ψθ (x) 2 Lψ θ θ . We term ϵθ(x) the permissible radius of perturbation for θ at x. For samples Xn = {x1, . . . xn}, we denote as ϵθ(Xn) := minj [n] ϵθ(xj), where [n] := {1, . . . , n} for a positive integer n. Next, we formulate the parameter transfer learnability based on the local stability. Definition 2 (Parameter Transfer Learnability). Suppose that N-samples in the source domain and n-samples Xn in the target domain are available. Let a parametric feature mapping {ψθ}θ Θ be locally stable. For δ [0, 1), {ψθ}θ Θ is said to be parameter transfer learnable with probability 1 δ if there exists an algorithm that depends only on N-samples in the source domain such that, the output bθN of the algorithm satisfies Pr [ bθN θ S ϵθ S(Xn) ] 1 δ. In the following, we assume that parametric feature mapping is bounded as ψθ(x) 2 Rψ for arbitrary x X and θ Θ and linear predictors are also bounded as w 2 RW for any w W. In addition, we suppose that a loss function ℓ( , ) is Lℓ-Lipschitz and convex with respect to the second variable. We denote as Rr := supw W |r(w)|. Then, the following learning bound is obtained, where the strong convexity of the regularization term ρr(w) is essential. Theorem 1 (Learning Bound). Suppose that the parametric feature mapping ψθ is locally stable and an estimator bθN learned in the source region satisfies the parameter transfer learnability with probability 1 δ. When ρ = LℓRψ 8(32+log(2/δ)) Rrn , the following inequality holds with probability 1 (δ + 2 δ): RT ( bθN, bw N,n ) RT (θ S, w T ) 2 log(2/δ) + 2 2Rr(32 + log(2/δ)) ) 1 n + LℓLψRψ bθN θ S ( Rr 2(32 + log(2/δ)) 4 n 1 4 bθN θ S . (2) If the estimation error bθN θ S can be evaluated in terms of the number N of samples, Theorem 1 clarifies which term is dominant, and in particular, the number of samples required in the source domain such that this number is sufficiently large compared to the samples in the target domain. 2.3 Proof of Learning Bound We prove Theorem 1 in this subsection. In this proof, we omit the subscript T for simplicity. In addition, we denote θ S simply by θ . We set as bw n := argmin w W j=1 ℓ(yj, w, ψθ (xj) ) + ρr(w). Then, we have RT ( bθN, bw N,n ) RT (θ , w ) = E(x,y) P [ ℓ(y, bw N,n, ψbθN (x) ) ] E(x,y) P [ℓ(y, bw N,n, ψθ (x) )] +E(x,y) P [ℓ(y, bw N,n, ψθ (x) )] E(x,y) P [ℓ(y, bw n, ψθ (x) )] (3) +E(x,y) P [ℓ(y, bw n, ψθ (x) )] E(x,y) P [ℓ(y, w , ψθ (x) )] . In the following, we bound three parts of (3). First, we have the following inequality with probability 1 (δ/2 + δ): E(x,y) P [ ℓ(y, bw N,n, ψbθN (x) ) ] E(x,y) P [ℓ(y, bw N,n, ψθ (x) )] LℓRWE(x,y) P [ ψbθN (x) ψθ (x) ] ψbθN (xj) ψθ (xj) + LℓRWRψ LℓLψRW bθN θ + LℓRWRψ where we used Hoeffding s inequality as the third inequality, and the local stability and parameter transfer learnability in the last inequality. Second, we have the following inequality with probability 1 δ: E(x,y) P [ℓ(y, bw N,n, ψθ (x) )] E(x,y) P [ℓ(y, bw n, ψθ (x) )] LℓE(x,y) P [| bw N,n, ψθ (x) bw n, ψθ (x) |] LℓRψ bw N,n bw n 2 bθN θ , (4) where the last inequality is derived by the strong convexity of the regularizer ρr(w) in the Appendix. Third, the following holds by Theorem 1 of Sridharan et al. (2009) with probability 1 δ/2: E(x,y) P [ℓ(y, bw n, ψθ (x) )] E(x,y) P [ℓ(y, w , ψθ (x) )] = E(x,y) P [ℓ(y, bw n, ψθ (x) ) + ρr(bw n)] E(x,y) P [ℓ(y, w , ψθ (x) ) + ρr(w )] + ρ(r(w ) r(bw n)) ( 8L2 ℓR2 ψ(32 + log(2/δ)) Thus, when ρ = LℓRψ 8(32+log(2/δ)) Rrn , we have (2) with probability 1 (δ + 2 δ). 3 Stability and Learnability in Sparse Coding In this section, we consider the sparse coding in self-taught learning, where the source region essentially consists of the sample space XS without the label space YS. We assume that the sample spaces in both regions are Rd. Then, the sparse coding method treated here consists of a two-stage procedure, where a dictionary is learnt on the source region, and then a sparse coding with the learnt dictionary is used for a predictive task in the target region. First, we show that sparse coding satisfies the local stability in Section 3.1 and next explain that appropriate dictionary learning algorithms satisfy the parameter transfer learnability in Section 3.4. As a consequence of Theorem 1, we obtain the learning bound of self-taught learning algorithms based on sparse coding. We note that the results in this section are useful independent of transfer learning. We here summarize the notations used in this section. Let p be the p-norm on Rd. We define as supp(a) := {i [m]|ai = 0} for a Rm. We denote the number of elements of a set S by |S|. When a vector a satisfies a 0 = |supp(a)| k, a is said to be k-sparse. We denote the ball with radius R centered at 0 by BRd(R) := {x Rd| x 2 R}. We set as D := {D = [d1, . . . , dm] BRd(1)m| dj 2 = 1 (i = 1, . . . , m)} and each D D a dictionary with size m. Definition 3 (Induced matrix norm). For an arbitrary matrix E = [e1, . . . , em] Rd m, 1) the induced matrix norm is defined by E 1,2 := maxi [m] ei 2. We adopt 1,2 to measure the difference of dictionaries since it is typically used in the framework of dictionary learning. We note that D D 1,2 2 holds for arbitrary dictionaries D, D D. 3.1 Local Stability of Sparse Representation We show the local stability of sparse representation under a sparse model. A sparse representation with dictionary parameter D of a sample x Rd is expressed as follows: φD(x) := argmin z Rm 1 2 x Dz 2 2 + λ z 1, 1) In general, the (p, q)-induced norm for p, q 1 is defined by E p,q := supv Rm, v p=1 Ev q. Then, 1,2 in this general definition coincides with that in Definition 3 by Lemma 17 of Vainsencher et al. (2011). where λ > 0 is a regularization parameter. This situation corresponds to the case where θ = D and ψθ = φD in the setting of Section 2.1. We prepare some notions to the stability of the sparse representation. The following margin and incoherence were introduced by Mehta and Gray (2013). Definition 4 (k-margin). Given a dictionary D = [d1, . . . , dm] D and a point x Rd, the k-margin of D on x is Mk(D, x) := max I [m],|I|=m k min j I {λ | dj, x DφD(x) |} . Definition 5 (µ-incoherence). A dictionary matrix D = [d1, . . . , dm] D is termed µ-incoherent if | di, dj | µ/ d for all i = j. Then, the following theorem is obtained. Theorem 2 (Sparse Coding Stability). Let D D be µ-incoherent and D D 1,2 λ. When D D 1,2 ϵk,D(x) := Mk,D(x)2λ 64 max{1, x }4 , (5) the following stability bound holds: φD(x) φ D(x) 2 4 x 2 d)λ D D 1,2. From Theorem 2, ϵk,D(x) becomes the permissible radius of perturbation in Definition 1. Here, we refer to the relation with the sparse coding stability (Theorem 4) of Mehta and Gray (2013), who measured the difference of dictionaries by 2,2 instead of 1,2 and the permissible radius of perturbation is given by Mk,D(x)2λ except for a constant factor. Applying the simple inequality E 2,2 m E 1,2 for E Rd m, we can obtain a variant of the sparse coding stability with the norm 1,2. However, then the dictionary size m affects the permissible radius of perturbation and the stability bound of the sparse coding stability. On the other hand, the factor of m does not appear in Theorem 2, and thus, the result is effective even for a large m. In addition, whereas x 1 is assumed in Mehta and Gray (2013), Theorem 2 does not assume that x 1 and clarifies the dependency for the norm x . In existing studies related to sparse coding, the sparse representation φD(x) is modified as φD(x) x (Mairal et al. (2009)) or φD(x) (x DφD(x)) (Raina et al. (2007)) where is the tensor product. By the stability of sparse representation (Theorem 2), it can be shown that such modified representations also have local stability. 3.2 Sparse Modeling and Margin Bound In this subsection, we assume a sparse structure for samples x Rd and specify a lower bound for the k-margin used in (5). The result obtained in this section plays an essential role to show the parameter transfer learnability in Section 3.4. Assumption 1 (Model). There exists a dictionary matrix D such that every sample x is independently generated by a representation a and noise ξ as x = D a + ξ. Moreover, we impose the following three assumptions on the above model. Assumption 2 (Dictionary). The dictionary matrix D = [d1, . . . , dm] D is µ-incoherent. Assumption 3 (Representation). The representation a is a random variable that is k-sparse (i.e., a 0 k) and the non-zero entries are lower bounded by C > 0 (i.e., ai = 0 satisfy |ai| C). Assumption 4 (Noise). The noise ξ is independent across coordinates and sub-Gaussian with parameter σ/ d on each component. We note that the assumptions do not require the representation a or noise ξ to be identically distributed while those components are independent. This is essential because samples in the source and target domains cannot be assumed to be identically distributed in transfer learning. Theorem 3 (Margin Bound). Let 0 < t < 1. We set as δt,λ := 2σ (1 t) dλ exp ( (1 t)2dλ2 dλ exp ( dλ2 dλ exp ( dλ2 We suppose that d {( 1 + 6 (1 t) ) µk }2 and λ = d τ for arbitrary 1/4 τ 1/2. Under Assumptions 1-4, the following inequality holds with probability 1 δt,λ at least: Mk,D (x) tλ. (7) We refer to the regularization parameter λ. An appropriate reflection of the sparsity of samples requires the regularization parameter λ to be set suitably. According to Theorem 4 of Zhao and Yu (2006)2), when samples follow the sparse model as in Assumptions 1-4 and λ = d τ for 1/4 τ 1/2, the representation φD(x) reconstructs the true sparse representation a of sample x with a small error. In particular, when τ = 1/4 (i.e., λ = d 1/4) in Theorem 3, the failure probability δt,λ = e d on the margin is guaranteed to become sub-exponentially small with respect to dimension d and is negligible for the high-dimensional case. On the other hand, the typical choice τ = 1/2 (i.e., λ = d 1/2) does not provide a useful result because δt,λ is not small at all. 3.3 Proof of Margin Bound We give a sketch of proof of Theorem 3. We denote the first term, the second term and the sum of the third and fourth terms of (6) by δ1, δ2 and δ3, respectively From Assumptions 1 and 3, a sample is represented as x = D a + ξ and a 0 k. Without loss of generality, we assume that the first m k components of a are 0 and the last k components are not 0. Since Mk,D (x) min 1 j m k λ dj, x D φD(x) = min 1 j m k λ dj, ξ D dj, a φD(x) , it is enough to show that the following holds an arbitrary 1 j m k to prove Theorem 3: Pr[ dj, ξ + D dj, a φD(x) > (1 t)λ] δt,λ. (8) Then, (8) follows from the following inequalities: Pr [ dj, ξ > 1 t 2 λ ] δ1, (9) Pr [ D dj, a φD(x) > 1 t 2 λ ] δ2 + δ3. (10) The inequality (9) holds since dj = 1 by the definition and Assumption 4. Thus, all we have to do is to show (10). We have D dj, a φD(x) = [ d1, dj , . . . , dm, dj ] , a φD(x) = (1supp(a φD(x)) [ d1, dj , . . . , dm, dj ]) , a φD(x) 1supp(a φD(x)) [ d1, dj , . . . , dm, dj ] 2 a φD(x) 2,(11) where u v is the Hadamard product (i.e. component-wise product) between u and v, and 1A for a set A [m] is a vector whose i-th component is 1 if i A and 0 otherwise. Applying Theorem 4 of Zhao and Yu (2006) and using the condition for λ, the following holds with probability 1 δ3: supp(a) = supp(φD(x)). (12) 2)Theorem 4 of Zhao and Yu (2006) is stated for Gaussian noise. However, it can be easily generalized to sub-Gaussian noise as in Assumption 4. Our setting corresponds to the case in which c1 = 1/2, c2 = 1, c3 = (log κ + log log d)/ log d for some κ > 1 (i.e., edc3 = dκ) and c4 = c in Theorem 4 of Zhao and Yu (2006). Note that our regularization parameter λ corresponds to λd/d in (Zhao and Yu (2006)). Moreover, under (12), the following holds with probability 1 δ2 by modifying Corollary 1 of Negahban et al. (2009) and using the condition for λ: a φD(x) 2 6 Thus, if both of (12) and (13) hold, the right hand side of (11) is bounded as follows: 1supp(a φD(x)) [ d1, dj , . . . , dm, dj ] 2 a φD(x) 2 |supp(a φD(x))| µ where we used Assumption 2 in the first inequality, (12) and Assumption 3 in the equality and the condition for d in the last inequality. From the above discussion, the left hand side of (10) is bounded by the sum of the probability δ3 that (12) does not hold and the probability δ2 that (12) holds but (13) does not hold. 3.4 Transfer Learnability for Dictionary Learning When the true dictionary D exists as in Assumption 1, we show that the output b DN of a suitable dictionary learning algorithm from N-unlabeled samples satisfies the parameter transfer learnability for the sparse coding φD. Then, Theorem 1 guarantees the learning bound in self-taught learning since the discussion in this section does not assume the label space in the source region. This situation corresponds to the case where θ S = D , bθN = b DN and = 1,2 in Section 2.1. We show that an appropriate dictionary learning algorithm satisfies the parameter transfer learnability for the sparse coding φD by focusing on the permissible radius of perturbation in (5) under some assumptions. When Assumptions 1-4 hold and λ = d τ for 1/4 τ 1/2, the margin bound (7) for x X holds with probability 1 δt,λ, and thus, we have ϵk,D (x) t2λ3 64 max{1, x }4 = Θ(d 3τ). Thus, if a dictionary learning algorithm outputs the estimator b DN such that b DN D 1,2 O(d 3τ) (14) with probability 1 δN, the estimator b DN of D satisfies the parameter transfer learnability for the sparse coding φD with probability δ = δN + nδt,λ. Then, by the local stability of the sparse representation and the parameter transfer learnability of such a dictionary learning, Theorem 1 guarantees that sparse coding in self-taught learning satisfies the learning bound in (2). We note that Theorem 1 can apply to any dictionary learning algorithm as long as (14) is satisfied. For example, Arora et al. (2015) show that, when k = O( d/ log d), m = O(d), Assumptions 1-4 and some additional conditions are assumed, their dictionary learning algorithm outputs b DN which satisfies b DN D 1,2 = O(d M) with probability 1 d M for arbitrarily large M, M as long as N is sufficiently large. 4 Conclusion We derived a learning bound (Theorem 1) for a parameter transfer learning problem based on the local stability and parameter transfer learnability, which are newly introduced in this paper. Then, applying it to a sparse coding-based algorithm under a sparse model (Assumptions 1-4), we obtained the first theoretical guarantee of a learning bound in self-taught learning. Although we only consider sparse coding, the framework of parameter transfer learning includes other promising algorithms such as multiple kernel learning and deep neural networks, and thus, our results are expected to be effective to analyze the theoretical performance of these algorithms. Finally, we note that our learning bound can be applied to different settings from self-taught learning because Theorem 1 includes the case in which labeled samples are available in the source region. [1] S. Arora, R. Ge, T. Ma, and A. Moitra (2015) Simple, efficient, and neural algorithms for sparse coding, ar Xiv preprint ar Xiv:1503.00778. [2] J. Baxter (2000) A model of inductive bias learning, J. Artif. Intell. Res.(JAIR), Vol. 12, p. 3. [3] A. Coates, A. Y. Ng, and H. Lee (2011) An analysis of single-layer networks in unsupervised feature learning, in International conference on artificial intelligence and statistics, pp. 215 223. [4] W. Dai, Q. Yang, G.-R. Xue, and Y. Yu (2008) Self-taught clustering, in Proceedings of the 25th international conference on Machine learning, pp. 200 207, ACM. [5] Q. V. Le (2013) Building high-level features using large scale unsupervised learning, in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pp. 8595 8598, IEEE. [6] H. Lee, R. Raina, A. Teichman, and A. Y. Ng (2009) Exponential Family Sparse Coding with Application to Self-taught Learning, in IJCAI, Vol. 9, pp. 1113 1119, Citeseer. [7] J. Mairal, J. Ponce, G. Sapiro, A. Zisserman, and F. R. Bach (2009) Supervised dictionary learning, in Advances in neural information processing systems, pp. 1033 1040. [8] A. Maurer (2009) Transfer bounds for linear feature learning, Machine learning, Vol. 75, pp. 327 350. [9] A. Maurer, M. Pontil, and B. Romera-Paredes (2012) Sparse coding for multitask and transfer learning, ar Xiv preprint ar Xiv:1209.0738. [10] N. Mehta and A. G. Gray (2013) Sparsity-based generalization bounds for predictive sparse coding, in Proceedings of the 30th International Conference on Machine Learning (ICML13), pp. 36 44. [11] S. Negahban, B. Yu, M. J. Wainwright, and P. K. Ravikumar (2009) A unified framework for high-dimensional analysis of M-estimators with decomposable regularizers, in Advances in Neural Information Processing Systems, pp. 1348 1356. [12] S. J. Pan and Q. Yang (2010) A survey on transfer learning, Knowledge and Data Engineering, IEEE Transactions on, Vol. 22, pp. 1345 1359. [13] R. Raina, A. Battle, H. Lee, B. Packer, and A. Y. Ng (2007) Self-taught learning: transfer learning from unlabeled data, in Proceedings of the 24th international conference on Machine learning, pp. 759 766, ACM. [14] K. Sridharan, S. Shalev-Shwartz, and N. Srebro (2009) Fast rates for regularized objectives, in Advances in Neural Information Processing Systems, pp. 1545 1552. [15] D. Vainsencher, S. Mannor, and A. M. Bruckstein (2011) The sample complexity of dictionary learning, The Journal of Machine Learning Research, Vol. 12, pp. 3259 3281. [16] H. Wang, F. Nie, and H. Huang (2013) Robust and discriminative self-taught learning, in Proceedings of The 30th International Conference on Machine Learning, pp. 298 306. [17] P. Zhao and B. Yu (2006) On model selection consistency of Lasso, The Journal of Machine Learning Research, Vol. 7, pp. 2541 2563. [18] X. Zhu, Z. Huang, Y. Yang, H. T. Shen, C. Xu, and J. Luo (2013) Self-taught dimensionality reduction on the high-dimensional small-sized data, Pattern Recognition, Vol. 46, pp. 215 229.