# transfer_learning_with_affine_model_transformation__97ab2b3f.pdf

Transfer Learning with Affine Model Transformation

Shunya Minami The Institute of Statistical Mathematics mshunya@ism.ac.jp

Kenji Fukumizu The Institute of Statistical Mathematics fukumizu@ism.ac.jp

Yoshihiro Hayashi The Institute of Statistical Mathematics yhayashi@ism.ac.jp

Ryo Yoshida The Institute of Statistical Mathematics yoshidar@ism.ac.jp

Supervised transfer learning has received considerable attention due to its potential to boost the predictive power of machine learning in scenarios where data are scarce. Generally, a given set of source models and a dataset from a target domain are used to adapt the pre-trained models to a target domain by statistically learning domain shift and domain-specific factors. While such procedurally and intuitively plausible methods have achieved great success in a wide range of real-world applications, the lack of a theoretical basis hinders further methodological development. This paper presents a general class of transfer learning regression called affine model transfer, following the principle of expected-square loss minimization. It is shown that the affine model transfer broadly encompasses various existing methods, including the most common procedure based on neural feature extractors. Furthermore, the current paper clarifies theoretical properties of the affine model transfer such as generalization error and excess risk. Through several case studies, we demonstrate the practical benefits of modeling and estimating inter-domain commonality and domain-specific factors separately with the affine-type transfer models.

1 Introduction

Transfer learning (TL) is a methodology to improve the predictive performance of machine learning in a target domain with limited data by reusing knowledge gained from training in related source domains. Its great potential has been demonstrated in various real-world problems, including computer vision [1, 2], natural language processing [3, 4], biology [5], and materials science [6, 7, 8]. Notably, most of the outstanding successes of TL to date have relied on the feature extraction ability of deep neural networks. For example, a conventional method reuses feature representations encoded in an intermediate layer of a pre-trained model as an input for the target task or uses samples from the target domain to fine-tune the parameters of the pre-trained source model [9]. While such methods are operationally plausible and intuitive, they lack methodological principles and remain theoretically unexplored in terms of their learning capability for limited data. This study develops a principled methodology generally applicable to various kinds of TL.

In this study, we focus on supervised TL settings. In particular, we deal with settings where, given feature representations obtained from training in the source domain, we use samples from the target domain to model and estimate the domain shift to the target. This procedure is called hypothesis transfer learning (HTL); several methods have been proposed, such as using a linear transformation function [10, 11] and considering a general class of continuous transformation functions [12]. If the transformation function appropriately captures the functional relationship between the source and target domains, only the domain-specific factors need to be additionally learned, which can be done efficiently even with a limited sample size. In other words, the performance of HTL depends strongly

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

on whether the transformation function appropriately represents the cross-domain shift. However, the general methodology for modeling and estimating such domain shifts has been less studied.

This study derives a theoretically optimal class of supervised TL that minimizes the expected ℓ2 loss function of the HTL. The resulting function class takes the form of an affine coupling g1(fs) + g2(fs) g3(x) of three functions g1, g2 and g3, where the shift from a given source feature fs to the target domain is represented by the functions g1 and g2, and the domain-specific factors are represented by g3(x) for any given input x. These functions can be estimated simultaneously using conventional supervised learning algorithms such as kernel methods or deep neural networks. Hereafter, we refer to this framework as the affine model transfer. As described later, we can formulate a wide variety of TL algorithms within the affine model transfer, including the widely used neural feature extractors, offset and scale HTLs [10, 11, 12], and Bayesian TL [13]. We clarify theoretical properties of the affine model transfer such as generalization error and excess risk.

To summarize, the contributions of our study are as follows:

The affine model transfer is proposed to adapt source features to the target domain by separately estimating cross-domain shift and domain-specific factors. The affine form is derived theoretically as an optimal class based on the squared loss for the target task. The affine model transfer encompasses several existing TL methods, including neural feature extraction. It can work with any type of source model, including non-machine learning models such as physical models as well as multiple source models. For each of the three functions g1, g2, and g3, we provide an efficient and stable estimation algorithm when modeled using the kernel method. Two theoretical properties of the affine transfer model are shown: the generalization and the excess risk bound.

With several applications, we compare the affine model transfer with other TL algorithms, discuss its strengths, and demonstrate the advantage of being able to estimate cross-domain shifts and domain-specific factors separately.

2 Transfer Learning via Transformation Function

2.1 Affine Model Transfer

This study considers regression problems with squared loss. We assume that the output of the target domain y Y R follows y = ft(x)+ϵ, where ft : X R is the true model on the target domain, and the observation noise ϵ has mean zero and variance σ2. We are given n samples {(xi, yi)}n i=1 (X Y)n from the target domain and the feature representation fs(x) Fs from one or more source domains. Typically, fs is given as a vector, including the output of the source models, observed data in the source domains, or learned features in a pre-trained model, but it can also be a non-vector feature such as a tensor, graph, or text. Hereafter, fs is referred to as the source features.

In this paper, we focus on transfer learning with variable transformations as proposed in [12]. For an illustration of the concept, consider the case where there exists a relationship between the true functions f s (x) R and f t (x) R such that f t (x) = f s (x) + x θ , x Rd with an unknown parameter θ Rd. If f s is non-smooth, a large number of training samples is needed to learn f t directly. However, since the difference f t f s is a linear function with respect to the unknown θ , it can be learned with fewer samples if prior information about f s is available. For example, a target model can be obtained by adding fs to the model g trained for the intermediate variable z = y fs(x).

The following is a slight generalization of TL procedure provided in [12]:

1. With the source features, perform a variable transformation of the observed outputs as zi = ϕ(yi, fs(xi)), using the data transformation function ϕ : Y Fs R. 2. Train an intermediate model ˆg(x) using the transformed sample set {(xi, zi)}n i=1 to predict the transformed output z for any given x. 3. Obtain a target model ˆft(x) = ψ(ˆg(x), fs(x)) using the model transformation function ψ : R Fs Y that combines ˆg and fs to define a predictor.

In particular, [12] considers the case where the model transformation function is equal to the inverse of the data transformation function. We consider a more general case that eliminates this constraint.

The objective of step 1 is to identify a transformation ϕ that maps the output variable y to the intermediate variable z, making the variable suitable for learning. In step 2, a predictive model for z is constructed. Since the data is limited in many TL setups, a simple model, such as a linear model, should be used as g. Step 3 is to transform the intermediate model g into a predictive model ft for the original output y.

This class of TL includes several approaches proposed in previous studies. For example, [10, 11] proposed a learning algorithm consisting of linear data transformation and linear model transformation: ϕ=y θ, fs and ψ=g(x)+ θ, fs with pre-defined weights θ. In this case, factors unexplained by the linear combination of source features are learned with g, and the target output is predicted additively with the common factor θ, fs and the additionally learned g. In [13], it is shown that a type of Bayesian TL is equivalent to using the following transformation functions; for Fs R, ϕ=(y τfs)/(1 τ) and ψ=ρg(x)+(1 ρ)fs with two varying hyperparameters τ <1 and 0 ρ 1. This includes TL using density ratio estimation [14] and neural network-based fine-tuning as special cases when the two hyperparameters belong to specific regions.

The performance of this TL strongly depends on the design of the two transformation functions ϕ and ψ. In the sequel, we theoretically derive the optimal form of transformation functions under the squared loss scenario. For simplicity, we denote the transformation functions as ϕfs( )=ϕ( , fs) on Y and ψfs( )=ψ( , fs) on R. To derive the optimal class of ϕ and ψ, note first that the TL procedure described above can be formulated in population as solving two successive least square problems;

(i) g := arg min g Ept ϕfs(Y ) g(X) 2, (ii) min ψ Ept Y ψfs(g (X)) 2.

Since the regression function that minimizes the mean squared error is the conditional mean, the first problem is solved by g ϕ(x) = E[ϕfs(Y )|X = x], which depends on ϕ. We can thus consider the optimal transformation functions ϕ and ψ by the following minimization:

min ψ,ϕ Ept Y ψfs(g ϕ(X)) 2. (1)

It is easy to see that Eq. (1) is equivalent to the following consistency condition:

ψfs g ϕ(x) = Ept[Y |X = x].

From the above observation, we make three assumptions to derive the optimal form of ψ and ϕ:

Assumption 2.1 (Differentiability). The data transformation function ϕ is differentiable with respect to the first argument.

Assumption 2.2 (Invertibility). The model transformation function ψ is invertible with respect to the first argument, i.e., its inverse ψ 1 fs exists.

Assumption 2.3 (Consistency). For any distribution on the target domain pt(x, y), and for all x X,

ψfs(g (x)) = Ept[Y |X = x],

where g (x) = Ept[ϕfs(Y )|X = x].

Assumption 2.2 is commonly used in most existing HTL settings, such as [10] and [12]. It assumes a one-to-one correspondence between the predictive value ˆft(x) and the output of the intermediate model ˆg(x). If this assumption does not hold, then multiple values of ˆg correspond to the same predicted value ˆft, which is unnatural. Note that Assumption 2.3 corresponds to the unbiased condition of [12].

We now derive the properties that the optimal transformation functions must satisfy.

Theorem 2.4. Under Assumptions 2.1-2.3, the transformation functions ϕ and ψ satisfy the following two properties:

(i) ψ 1 fs = ϕfs. (ii) ψfs(g) = g1(fs) + g2(fs) g,

where g1 and g2 are some functions.

The proof is given in Section D.1 in Supplementary Material. Despite not initially assuming that the two transformation functions are inverses, Theorem 2.4 implies they must indeed be inverses. Furthermore, the mean squared error is minimized when the data and model transformation functions are given by an affine transformation and its inverse, respectively. In summary, under the expected squared loss minimization with the HTL procedure, the optimal class for HTL model is expressed as follows:

H = x 7 g1(fs) + g2(fs)g3(x) | gj Gj, j = 1, 2, 3 ,

where G1, G2 and G3 are the arbitrarily function classes. Here, each of g1 and g2 is modeled as a function of fs that represents common factors across the source and target domains. g3 is modeled as a function of x, in order to capture the domain-specific factors unexplainable by the source features.

We have derived the optimal form of the transformation functions when the squared loss is employed. Even for general convex loss functions, (i) of Theorem 2.4 still holds. However, (ii) of Theorem 2.4 does not generally hold because the optimal transformation function depends on the loss function. Extensions to other losses are briefly discussed in Section A.1, but the establishment of a complete theory is a future work.

Here, the affine transformation is found to be optimal in terms of minimizing the mean squared error. We can also derive the same optimal function by minimizing the upper bound of the estimation error in the HTL procedure, as discussed in Section A.2.

One of key principles for the design of g1, g2, and g3 is interpretability. In our model, g1 and g2 primarily facilitate knowledge transfer, while the estimated g3 is used to gain insight on domainspecific factors. For instance, in order to infer cross-domain differences, we could design g1 and g2 by the conventional neural feature extraction, and a simple, highly interpretable model such as a linear model could be used for g3. Thus, observing the estimated regression coefficients in g3, one can statistically infer which features of x are related to inter-domain differences. This advantage of the proposed method is demonstrated in Section 5.2 and Section B.3.

2.2 Relation to Existing Methods

The affine model transfer encompasses some existing TL procedures. For example, by setting g1(fs) = 0 and g2(fs) = 1, the prediction model is estimated without using the source features, which corresponds to an ordinary direct learning, i.e., a learning scheme without transfer. Furthermore, various kinds of HTLs can be formulated by imposing constraints on g1 and g2. In prior work, [10] employs a two-step procedure where the source features are combined with pre-defined weights, and then the auxiliary model is additionally learned for the residuals unexplainable by the source features. The affine model transfer can represent this HTL as a special case by setting g2 = 1. [12] uses the transformed output zi = yi/fs with the output value fs R of a source model, and this cross-domain shift is then regressed onto x using a target dataset. This HTL corresponds to g1 = 0 and g2 = fs.

When a pre-trained source model is provided as a neural network, TL is usually performed with the intermediate layer as input to the model in the target domain. This is called a feature extractor or frozen featurizer and has been experimentally and theoretically proven to have strong transfer capability as the de facto standard for TL [9, 15]. The affine model transfer encompasses the neural feature extraction as a special subclass, which is equivalent to setting g2(fs)g3(x) = 0. A performance comparison of the affine model transfer with the neural feature extraction is presented in Section 5 and Section B.2. The relationships between these existing methods and the affine model transfer are illustrated in Figure 1 and Figure S.1

The affine model transfer can also be interpreted as generalizing the feature extraction by adding a product term g2(fs)g3(x). This additional term allows for the inclusion of unknown factors in the transferred model that are unexplainable by source features alone. Furthermore, this encourages the avoidance of a negative transfer, a phenomenon where prior learning experiences interfere with training in a new task. The usual TL based only on fs attempts to explain and predict the data generation process in the target domain using only the source features. However, in the presence of domain-specific factors, a negative transfer can occur owing to a lack of descriptive power. The additional term compensates for this shortcoming. The comparison of behavior for the case with the non-relative source features is described in Section 5.1.

Figure 1: Architectures of (a) feature extraction, (b) HTL in [10], and (c) affine model transfer.

Algorithm 1 Block relaxation algorithm [19].

Initialize: a0, b0 = 0, c0 = 0 repeat

at+1 = arg mina F(a, bt, ct) bt+1 = arg minb F(at+1, b, ct) ct+1 = arg minc F(at+1, bt+1, c) until convergence

The affine model transfer can be naturally expressed as an architecture of network networks. This architecture, called affine coupling layers, is widely used for invertible neural networks in flow-based generative modeling [16, 17]. Neural networks based on affine coupling layers have been proven to have universal approximation ability [18]. This implies that the affine transfer model has the potential to represent a wide range of function classes, despite its simple architecture based on the affine coupling of three functions.

3 Modeling and Estimation

In this section, we focus on using kernel methods for the affine transfer model and provide the estimation algorithm. Let H1, H2 and H3 be reproducing kernel Hilbert spaces (RKHSs) with positive-definite kernels k1, k2 and k3, which define the feature mappings Φ1 :Fs H1, Φ2 :Fs H2 and Φ3 :X H3, respectively. Denote Φ1,i =Φ1(fs(xi)), Φ2,i =Φ2(fs(xi)), Φ3,i =Φ3(xi). For the proposed model class, the ℓ2-regularized empirical risk with the squared loss is given as follows:

yi α, Φ1,i β, Φ2,i γ, Φ3,i 2 + λ1 α 2 H1 + λ2 β 2 H2 + λ3 γ 2 H3, (2)

where λ1,λ2,λ3 0 are hyperparameters for the regularization. According to the representer theorem, the minimizer of Fα,β,γ with respect to the parameters α H1, β H2, and γ H3 reduces to α = Pn i=1 aiΦ1,i, β = Pn i=1 biΦ2,i, γ = Pn i=1 ciΦ3,i, with the n-dimensional unknown parameter vectors a, b, c Rn. Substituting this expression into Eq. (2), we obtain the objective function as

n y K1a (K2b) (K3c) 2 2+λ1a K1a+λ2b K2b+λ3c K3c =: F(a, b, c). (3)

Here, the symbol denotes the Hadamard product. KI is the Gram matrix associated with the kernel k I for I {1, 2, 3}. k(i) I = [k I(xi, x1) k I(xi, xn)] denotes the i-th column of the Gram matrix. The n n matrix M (i) is given by the tensor product M (i) = k(i) 2 k(i) 3 of k(i) 2 and k(i) 3 .

Because the model is linear with respect to parameter a and bilinear for b and c, the optimization of Eq. (3) can be solved using well-established techniques for the low-rank tensor regression. In this study, we use the block relaxation algorithm [19] as described in Algorithm 1. It updates a, b, and c by repeatedly fixing two of the three parameters and minimizing the objective function for the remaining one. Fixing two parameters, the resulting subproblem can be solved analytically because the objective function is expressed in a quadratic form for the remaining parameter.

Algorithm 1 can be regarded as repeating the HTL procedure introduced in Section 2.1; alternately estimates the parameters (a, b) of the transformation function and the parameters c of the model for the given transformed data {(xi, zi)}n i=1. The function F in Algorithm 1 is not jointly convex in general. However, when employing methods like kernel methods or generalized linear models, and fixing two parameters, F exhibits convexity with respect to the remaining parameter. According to [19], when each sub-minimization problem is convex, Algorithm 1 is guaranteed to converge to a stationary point. Furthermore, [19] showed that consistency and asymptotic normality hold for the alternating minimization algorithm.

4 Theoretical Results

In this section, we present two theoretical properties, the generalization bound and excess risk bound.

Let (Z, P) be an arbitrary probability space, and {zi}n i=1 be independent random variables with distribution P. For a function f :Z R, let the expectation of f with respect to P and its empirical counterpart denote respectively by Pf = EP f(z), Pnf = 1

n Pn i=1 f(zi). We use a non-negative loss ℓ(y, y ) such that it is bounded from above by L > 0 and for any fixed y Y, y 7 ℓ(y, y ) is µℓ-Lipschitz for some µℓ> 0.

Recall that the function class proposed in this work is

H = x 7 g1(fs) + g2(fs)g3(x) | gj G,j = 1, 2, 3 .

In particular, the following discussion in this section assumes that g1, g2, and g3 are represented by linear functions on the RKHSs.

4.1 Generalization Bound

The optimization problem is expressed as follows:

min α,β,γ Pnℓ y, α, Φ1 H1 + β, Φ2 H2 γ, Φ3 H3 + λα α 2 H1 + λβ β 2 H2 + λγ γ 2 H3, (4)

where Φ1 = Φ1(fs(x)), Φ2 = Φ2(fs(x)) and Φ3 = Φ3(x) denote the feature maps. Without loss of generality, it is assumed that Φi 2 Hi 1 (i = 1, 2, 3) and λα, λβ, λγ > 0. Hereafter, we will omit the suffixes Hi in the norms if there is no confusion.

Let (ˆα, ˆβ, ˆγ) be a solution of Eq. (4), and denote the corresponding function in H as ˆh. For any α, we have

λα ˆα 2 Pnℓ(y, ˆα,Φ1 + ˆβ,Φ2 ˆγ,Φ3 )+λα ˆα 2+λβ ˆβ 2+λγ ˆγ 2 Pnℓ(y, α,Φ1 )+λα α 2,

where we use the fact that ℓ( , ) and are non-negative, and (ˆα, ˆβ, ˆγ) is the minimizer of Eq. (4). Denoting ˆRs = infα{Pnℓ(y, α, Φ1 ) + λα α 2}, we obtain ˆα 2 λ 1 α ˆRs. Because the same inequality holds for λβ ˆβ 2, λγ ˆγ 2 and Pnℓ(y, ˆh), we have ˆβ 2 λ 1 β ˆRs, ˆγ 2 λ 1 γ ˆRs and Pnℓ(y, ˆh) ˆRs. Moreover, we have Pℓ(y, ˆh) = EPnℓ(y, ˆh) E ˆRs. Therefore, it is sufficient to consider the following hypothesis class H and loss class L:

H = x7 α,Φ1 + β,Φ2 γ,Φ3 α 2 λ 1 α ˆRs, β 2 λ 1 β ˆRs, γ 2 λ 1 γ ˆRs,Pℓ(y, h) E ˆRs ,

L = (x, y) 7 ℓ(y, h) | h H .

Here, we show the generalization bound of the proposed model class. The following theorem is based on [11], showing that the difference between the generalization error and empirical error can be bounded using the magnitude of the relevance of the domains. Theorem 4.1. There exists a constant C depending only on λα, λβ, λγ and L such that, for any η > 0 and h H, with probability at least 1 e η,

Pℓ(y, h) Pnℓ(y, h) = O r

n + µ2 ℓC2 + η

C + η + C2 + η

where Rs = infα{Pℓ(y, α, Φ1 ) + λα α 2}.

Because Φ1 is the feature map from the source feature space Fs into the RKHS H1, Rs corresponds to the true risk of training in the target domain using only the source features fs. If this is sufficiently small, e.g., Rs = O(n 1/2), the convergence rate indicated by Theorem 4.1 becomes n 1, which is an improvement over the naive convergence rate n 1/2. This means that if the source task yields feature representations strongly related to the target domain, training in the target domain is accelerated. Theorem 4.1 measures this cross-domain relation using the metric Rs.

Theorem 4.1 is based on Theorem 11 of [11] in which the function class g1+g3 is considered. Our work differs in the following two points: the source features are modeled not only additively but also multiplicatively, i.e., we consider the function class g1+g2 g3, and we also consider the estimation of the parameters for the source feature combination, i.e., the parameters of the functions g1 and g2. In particular, the latter affects the resulting rate. With fixed the source combination parameters, the resulting rate improves only up to n 3/4. The details are discussed in Section D.2.

4.2 Excess Risk Bound

Here, we analyze the excess risk, which is the difference between the risk of the estimated function and the smallest possible risk within the function class.

Recall that we consider the functions g1, g2 and g3 to be elements of the RKHSs H1, H2 and H3 with kernels k1, k2 and k3, respectively. Define the kernel k(1) =k1, k(2) =k2 k3 and k=k(1)+k(2). Let H(1), H(2) and H be the RKHS with k(1), k(2) and k respectively. For m = 1, 2, consider the normalized Gram matrix K(m) = 1

n(k(m)(xi, xj))i,j=1,...,n and its eigenvalues (ˆλ(m) i )n i=1 , arranged in a nonincreasing order.

We make the following additional assumptions:

Assumption 4.2. There exists h H and h(m) H(m) (m = 1, 2) such that P(y h (x))2 = infh H P(y h(x))2 and P(y h(m) (x))2 = infh H(m) P(y h(x))2.

Assumption 4.3. For m = 1, 2, there exist am > 0 and sm (0, 1) such that ˆλ(m) j amj 1/sm.

Assumption 4.2 is used in [20] and is not overly restrictive as it holds for many regularization algorithms and convex, uniformly bounded function classes. In the analysis of kernel methods, Assumption 4.3 is standard [21], and is known to be equivalent to the classical covering or entropy number assumption [22]. The inverse decay rate sm measures the complexity of the RKHS, with larger values corresponding to more complex function spaces.

Theorem 4.4. Let ˆh be any element of H satisfying Pn(y ˆh(x))2 = infh H Pn(y h(x))2. Under Assumptions 4.2 and 4.3, for any η > 0, with probability at least 1 5e η,

P(y ˆh(x))2 P(y h (x))2 = O n 1 1+max {s1,s2} .

Theorem 4.4 suggests that the convergence rate of the excess risk depends on the decay rates of the eigenvalues of two Gram matrices K(1) and K(2). The inverse decay rate s1 of the eigenvalues of K(1) = 1

n(k1(fs(xi), fs(xj)))i,j=1,...,n represents the learning efficiency using only the source features, while s2 is the inverse decay rate of the eigenvalues of the Hadamard product of K2 = 1 n(k2(fs(xi), fs(xj))i,j=1,...,n and K3 = 1

n(k3(xi, xj))i,j=1,...,n, which addresses the effect of combining the source features and original input. While rigorous discussion on the relationship between the spectra of two Gram matrices K2, K3 and their Hadamard product K2 K3 seems difficult, intuitively, the smaller the overlap between the space spanned by the source features and by the original input, the smaller the overlap between H2 and H3. In other words, as the source features and original input have different information, the tensor product H2 H3 will be more complex, and the decay rate 1/s2 is expected to be larger. In Section B.1, we experimentally confirm this speculation.

5 Experimental Results

We demonstrate the potential of the affine model transfer through two case studies: (i) the prediction of feed-forward torque at seven joints of the robot arm [23], and (ii) the prediction of review scores and decisions of scientific papers [24]. The experimental details are presented in Section C. Additionally, two case studies in materials science are presented in Section B. The Python code is available at https://github.com/mshunya/Affine TL.

5.1 Kinematics of the Robot Arm

We experimentally investigated the learning performance of the affine model transfer, compared to several existing methods. The objective of the task is to predict the feed-forward torques, required to follow the desired trajectory, at seven different joints of the SARCOS robot arm [23]. Twenty-one features representing the joint position, velocity, and acceleration were used as the input x. The target task is to predict the torque value at one joint. The representations encoded in the intermediate layer of the source neural network for predicting the other six joints were used as the source features fs R16. The experiments were conducted with seven different tasks (denoted as Torque 1-7) corresponding to the seven joints. For each target task, a training set of size n {5, 10, 15, 20, 30, 40, 50} was randomly constructed 20 times, and the performances were evaluated using the test data.

Table 1: Performance on predicting the torque values at the first and seventh joints of the SARCOS robot arm. The mean and standard deviation of the RMSE are reported for varying numbers of training samples. For each task and n, the case with the smallest mean RMSE is indicated by the bold type. An asterisk indicates a case where the RMSEs of 20 independent experiments were significantly improved over Direct at the 1% significance level, according to the Welch s t-test. d represent the dimension of the original input x (i.e., d = 21).

Target Model Number of training samples n < d n d n > d 5 10 15 20 30 40 50

Direct 21.3 2.04 18.9 2.11 17.4 1.79 15.8 1.70 13.7 1.26 12.2 1.61 10.8 1.23 Only source 24.0 6.37 22.3 3.10 21.0 2.49 19.7 1.34 18.5 1.92 17.6 1.59 17.3 1.31 Augmented 21.8 2.88 19.2 1.37 17.8 2.30 15.7 1.53 13.3 1.19 11.9 1.37 10.7 0.954 HTL-offset 23.7 6.50 21.2 3.85 19.8 3.23 17.8 2.35 16.2 3.31 15.0 3.16 15.1 2.76 HTL-scale 23.3 4.47 22.1 5.31 20.4 3.84 18.5 2.72 17.6 2.41 16.9 2.10 16.7 1.74

Affine TL-full 21.2 2.23 18.8 1.31 18.6 2.83 15.9 1.65 13.7 1.53 12.3 1.45 11.1 1.12 Affine TL-const 21.2 2.21 18.8 1.44 17.7 2.44 15.9 1.58 13.4 1.15 12.2 1.54 10.9 1.02

Fine-tune 25.0 7.11 20.5 3.33 18.6 2.10 17.6 2.55 14.1 1.39 12.6 1.13 11.1 1.03 MAML 29.8 12.3 22.5 3.21 20.8 2.12 20.3 3.14 16.7 3.00 14.4 1.85 13.4 1.19 L2-SP 24.9 7.09 20.5 3.30 18.8 2.04 18.0 2.45 14.5 1.36 13.0 1.13 11.6 0.983 PAC-Net 25.2 8.68 22.7 5.60 20.7 2.65 20.1 2.16 18.5 2.77 17.6 1.85 17.1 1.38

Direct 2.66 0.307 2.13 0.420 1.85 0.418 1.54 0.353 1.32 0.200 1.18 0.138 1.05 0.111 Only source 2.31 0.618 *1.73 0.560 *1.49 0.513 *1.22 0.269 *1.09 0.232 *0.969 0.144 *0.927 0.170 Augmented 2.47 0.406 1.90 0.515 1.67 0.552 *1.31 0.214 1.16 0.225 *0.984 0.149 *0.897 0.138 HTL-offset 2.29 0.621 *1.69 0.507 *1.49 0.513 *1.22 0.269 *1.09 0.233 *0.969 0.144 *0.925 0.171 HTL-scale 2.32 0.599 *1.71 0.516 1.51 0.513 *1.24 0.271 *1.12 0.234 *0.999 0.175 0.948 0.172

Affine TL-full *2.23 0.554 *1.71 0.501 *1.45 0.458 *1.21 0.256 *1.06 0.219 *0.974 0.164 *0.870 0.121 Affine TL-const *2.30 0.565 *1.73 0.420 *1.48 0.527 *1.20 0.243 *1.04 0.217 *0.963 0.161 *0.884 0.136

Fine-tune *2.33 0.511 *1.62 0.347 *1.35 0.340 *1.12 0.165 *0.959 0.12 *0.848 0.0824 *0.790 0.0547 MAML 2.54 1.29 1.90 0.507 1.67 0.313 1.63 0.282 1.28 0.272 1.20 0.199 1.06 0.111 L2-SP *2.33 0.509 *1.65 0.378 *1.35 0.340 *1.12 0.165 *0.968 0.114 *0.858 0.0818 *0.802 0.0535 PAC-Net 2.24 0.706 *1.61 0.394 *1.43 0.389 *1.24 0.177 *1.18 0.100 1.13 0.0726 1.100 0.0589

The following seven methods were compared, including two existing HTL procedures:

Direct Train a model using the target input x with no transfer. Only source Train a model g(fs) using only the source feature fs. Augmented Perform a regression with the augmented input vector concatenating x and fs. HTL-offset [10] Calculate the transformed output zi = yi gonly(fs) where gonly(fs) is the model pre-trained using Only source, and train an additional model with input xi to predict zi. HTL-scale [12] Calculate the transformed output zi = yi/gonly(fs), and train an additional model with input xi to predict zi. Affine TL-full Train the model g1 + g2 g3. Affine TL-const Train the model g1 + g3.

Kernel ridge regression with the Gaussian kernel exp( x x 2/2ℓ2) was used for each procedure. The scale parameter ℓwas fixed to the square root of the dimension of the input. The regularization parameter in the kernel ridge regression and λα, λβ, and λγ in the affine model transfer were selected through 5-fold cross-validation. In addition to the seven feature-based methods, four weight-based TL methods were evaluated: fine-tuning, MAML [25], L2-SP [26], and PAC-Net [27].

Table 1 summarizes the prediction performance of the seven different procedures for varying numbers of training samples in two representative tasks: Torque 1 and Torque 7. The joint of Torque 1 is located closest to the root of the arm. Therefore, the learning task for Torque 1 is less relevant to those for the other joints, and the transfer from Torque 2 6 to Torque 1 would not work. In fact, as shown in Table 1, no method showed a statistically significant improvement to Direct. In particular, Only source failed to acquire predictive ability, and HTL-offset and HTL-scale likewise showed poor prediction performance owing to the negative effect of the failure in the variable transformation. In contrast, the two affine transfer models showed almost the same predictive performance as Direct, which is expressed as its submodel, and successfully suppressed the occurrence of negative transfer.

Because Torque 7 was measured at the joint closest to the end of the arm, its value strongly depends on those at the other six joints, and the procedures with the source features were more effective than in the other tasks. In particular, Affine TL achieved the best performance among the other feature-based methods. This is consistent with the theoretical result that the transfer capability of the affine model transfer can be further improved when the risk of learning using only the source features is sufficiently small.

In Table S.3 in Section C.1, we present the results for all tasks. In most cases, Affine TL achieved the best performance among the feature-based methods. In several other cases, Direct produced the best results; in almost all cases, Only source and the two HTLs showed no advantage over Affine TL. Comparing the weight-based and feature-based methods, we noticed that the weight-based methods showed higher performance with large sample sizes. Nevertheless, in scenarios with extremely small sample sizes (e.g., n=5 or 10), Affine TL exhibited comparable or even superior performance.

The strength of our method compared to weight-based TLs including fine-tuning is that it does not degrade its performance in cases where cross-domain relationships are weak. While fine-tuning outperformed our method in cases of Torque 7, the performance of fine-tuning was significantly degraded as the source-target relationship became weaker, as seen in Torque 1 case. In contrast, our method was able to avoid negative transfer even for such cases. This characteristic is particularly beneficial because, in many cases, the degree of relatedness between the domains is not known in advance. Furthermore, weight-based methods can sometimes be unsuitable, especially when transferring knowledge from large models, such as LLMs. In these scenarios, fine-tuning all parameters is unfeasible, and feature-based TL is preferred. Our approach often outperforms other feature-based methods.

5.2 Evaluation of Scientific Documents

Through a case study in natural language processing, we compare the performance of the affine model transfer with that of ordinary feature extraction-based TL and show the advantage of being able to estimate domain shift and domain-specific factors separately.

We used Sci Rep Eval [24], a benchmark dataset of scientific documents. The dataset consists of abstracts, review scores, and decision statuses of papers submitted to various machine learning conferences. We focused on two primary tasks: a regression task to predict the average review score, and a binary classification task to determine the acceptance or rejection status of each paper. The original input x was represented by a two-gram bag-of-words vector of the abstract. For the source features fs, we utilized text embeddings of the abstract generated by the pre-trained language models; BERT [28], Sci BERT [29], T5 [30], and GPT-3 [31]. In the affine model transfer, we employed neural networks with two hidden layers to model g1 and g2, and a linear model for g3. For comparison, we also evaluated the performance of the ordinary feature extraction-based TL using a two-layer neural network with fs as inputs. We used 8,166 training samples and evaluated the performance of the model on 2,043 test samples.

Table 2 shows the root mean square error (RMSE) for the regression task and accuracy for the classification task. In the regression tasks, the RMSEs of the affine model transfer were significantly improved over the ordinary feature extraction for the four types of text feature embedding. We also observed the improvements in accuracy for the classification task even though the affine model transfer was derived on the basis of regression settings. While the pre-trained language models have the remarkable ability to represent text quality and structure, their representation ability to perform prediction tasks for machine learning documents is not sufficient. The affine model transfer effectively bridged this gap by learning the additional target-specific factor via the target task, resulting in improved prediction performance in both regression and classification tasks.

Table 3 provides a list of phrases that were estimated to have a positive or negative effect on the review scores. Because we restricted a network to output positive values for g2, the influence of each phrase could be inferred from the estimated coefficients of the linear model g3. Specifically, phrases such as "tasks including" and "new state" were estimated to have positive influences on the predicted score. These phrases often appear in contexts such as "demonstrated on a wide range of tasks including" or "establishing a new state-of-the-art result,", suggesting that superior experimental results tend to yield higher peer review scores. In addition, the phrase "theoretical analysis" was also identified to have a positive effect on the review score, reflecting the significance of theoretical validation in machine learning research. On the contrary, general phrases with broader meanings such as "recent advances" and "machine learning," contributed to lower scores. This observation suggests the importance of explicitly stating the novelty and uniqueness of research findings and refraining from using generic terminologies.

Table 2: Prediction performance of peer review scores and acceptance/rejection for submitted papers. The mean and standard deviation of the RMSE and accuracy are reported for the affine model transfer (Affine TL) and feature extraction (FE). Definitions of asterisks and boldface letters are the same as in Table 1.

Regression Classification

BERT FE 1.3086 0.0035 0.6250 0.0217 Affine TL *1.3069 0.0042 0.6252 0.0163

Sci BERT FE 1.2856 0.0144 0.6520 0.0106 Affine TL *1.2797 0.0122 0.6507 0.0124

T5 FE 1.3486 0.0175 0.6344 0.0079 Affine TL *1.3442 0.0030 0.6366 0.0065

GPT-3 FE 1.3284 0.0138 0.6279 0.0181 Affine TL *1.3234 0.0140 *0.6386 0.0095

Table 3: Phrases with the top and bottom ten regression coefficients for g3 in the affine transfer model for the regression task with Sci BERT.

Positive Negative

1 tasks including recent advances 2 new state novel approach 3 high quality latent space 4 recently proposed learning approach 5 latent variable neural architecture 6 number parameters machine learning 7 theoretical analysis attention mechanism 8 policy gradient reinforcement learning 9 inductive bias proposed framework 10 image generation descent sgd

As illustrated in this example, integrating modern deep learning techniques and highly interpretable transfer models through the mechanism of the affine model transfer not only enhances prediction performance, but also provides valuable insights into domain-specific factors.

5.3 Case Studies in Materials Science

We conducted two additional case studies, both of which pertain to scientific tasks in the field of materials science. One experiment aims to examine the relationship between qualitative differences in source features and learning behavior of the affine model transfer. In the other experiment, we demonstrate the potential utility of the affine model transfer as a calibration tool bridging computational models and real-world systems. In particular, we highlight the benefits of separately modeling and estimating domain-specific factors through a case study in polymer chemistry. The objective is to predict the specific heat capacity at constant pressure of any given organic polymer with its chemical structure in the polymer s repeating unit. Specifically, we conduct TL to bridge the gap between experimental values and physical properties calculated from molecular dynamics simulations. The details are shown in Section B in Supplementary Material,

6 Conclusions

In this study, we introduced a general class of TL based on affine model transformations, and clarified their learning capability and applicability. The proposed affine model transformation was shown to be an optimal class that minimizes the expected squared loss in the HTL procedure. The model is contrasted with widely applied TL methods, such as re-using features from pre-trained models, which lack theoretical foundation. The affine model transfer is model-agnostic; it is easily combined with any machine learning models, features, and physical models. Furthermore, in the model, domainspecific factors are involved in incorporating the source features. From this property, the affine transfer has the ability to handle domain common and unique factors simultaneously and separately.

The advantages of the model were verified theoretically and experimentally in this study. We showed theoretical results on the generalization bound and excess risk bound when the regression tasks are solved by kernel methods. It is shown that if the source feature is strongly related to the target domain, the convergence rate of the generalization bound is improved from naive learning. The excess risk of the proposed TL is evaluated using the eigen-decay of the product kernel, which also illustrates the effect of the overlap between the source and target tasks. In our numerical studies, the affine model transfer generally outperforms in test errors when the target and source tasks have a similarity. We have also seen in the example of NLP that the proposed affine model transfer can identify the (non-)valuable phrases for high-quality papers. This can be done by the affine representation of cross-domain shift and domain-specific factors in our model.

Acknowledgments and Disclosure of Funding

This work was supported by JST SPRING Grant No. JPMJSP2104, JST CREST Grants No. JPMJCR22O3 and No. JPMJCR19I3, MEXT KAKENHI Grant-in-Aid for Scientific Research on Innovative Areas (Grant No. 19H50820), the Grant-in-Aid for Scientific Research (A) (Grant No. 19H01132) and Grant-in-Aid for Research Activity Start-up (Grant No. 23K19980) from the Japan Society for the Promotion of Science (JSPS), and the MEXT Program for Promoting Researches on the Supercomputer Fugaku (No. hp210264).

[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, Image Net classification with deep convolutional neural networks, Communications of the ACM, vol. 60, pp. 84 90, 2012.

[2] G. Csurka, Domain adaptation for visual applications: A comprehensive survey, ar Xiv, vol. abs/1702.05374, 2017.

[3] S. Ruder, M. E. Peters, S. Swayamdipta, and T. Wolf, Transfer learning in natural language processing, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Tutorials, pp. 15 18, 2019.

[4] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, ar Xiv, vol. abs/1810.04805, 2019.

[5] R. K. Sevakula, V. Singh, N. K. Verma, C. Kumar, and Y. Cui, Transfer learning for molecular cancer classification using deep neural networks, IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 16, pp. 2089 2100, 2019.

[6] H. Yamada, C. Liu, S. Wu, Y. Koyama, S. Ju, J. Shiomi, J. Morikawa, and R. Yoshida, Predicting materials properties with little data using shotgun transfer learning, ACS Central Science, vol. 5, pp. 1717 1730, 2019.

[7] S. Wu, Y. Kondo, M. Kakimoto, B. Yang, H. Yamada, I. Kuwajima, G. Lambard, K. Hongo, Y. Xu, J. Shiomi, et al., Machine-learning-assisted discovery of polymers with high thermal conductivity using a molecular design algorithm, npj Computational Materials, vol. 5, no. 1, pp. 1 11, 2019.

[8] S. Ju, R. Yoshida, C. Liu, K. Hongo, T. Tadano, and J. Shiomi, Exploring diamond-like lattice thermal conductivity crystals via feature-based transfer learning, Physical Review Materials, vol. 5, no. 5, p. 053801, 2021.

[9] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, How transferable are features in deep neural networks?, Advances in Neural Information Processing Systems, vol. 27, 2014.

[10] I. Kuzborskij and F. Orabona, Stability and hypothesis transfer learning, International Conference on Machine Learning, pp. 942 950, 2013.

[11] I. Kuzborskij and F. Orabona, Fast rates by transferring from auxiliary hypotheses, Machine Learning, vol. 106, no. 2, pp. 171 195, 2017.

[12] S. S. Du, J. Koushik, A. Singh, and B. Póczos, Hypothesis transfer learning via transformation functions, Advances in Neural Information Processing Systems, vol. 30, 2017.

[13] S. Minami, S. Liu, S. Wu, K. Fukumizu, and R. Yoshida, A general class of transfer learning regression without implementation cost, Proceedings of AAAI Conference on Artificial Intelligence, vol. 35, pp. 8992 8999, 2021.

[14] S. Liu and K. Fukumizu, Estimating posterior ratio for classification: Transfer learning from probabilistic perspective, Proceedings of the 2016 SIAM International Conference on Data Mining, pp. 747 755, 2016.

[15] N. Tripuraneni, M. Jordan, and C. Jin, On the theory of transfer learning: The importance of task diversity, Advances in Neural Information Processing Systems, vol. 33, pp. 7852 7862, 2020.

[16] L. Dinh, D. Krueger, and Y. Bengio, Nice: Non-linear independent components estimation, ar Xiv, vol. abs/1410.8516, 2014.

[17] L. Dinh, J. N. Sohl-Dickstein, and S. Bengio, Density estimation using real nvp, International Conference on Learning Representations, 2017.

[18] T. Teshima, I. Ishikawa, K. Tojo, K. Oono, M. Ikeda, and M. Sugiyama, Coupling-based invertible neural networks are universal diffeomorphism approximators, Advances in Neural Information Processing Systems, vol. 33, pp. 3362 3373, 2020.

[19] H. Zhou, L. Li, and H. Zhu, Tensor regression with applications in neuroimaging data analysis, Journal of the American Statistical Association, vol. 108, no. 502, pp. 540 552, 2013.

[20] P. L. Bartlett, O. Bousquet, and S. Mendelson, Local Rademacher complexities, Annals of Statistics, vol. 33, pp. 1497 1537, 2005.

[21] I. Steinwart and A. Christmann, Support vector machines. Springer science & business media, 2008.

[22] I. Steinwart, D. R. Hush, and C. Scovel, Optimal rates for regularized least squares regression, Proceedings of the 22nd Annual Conference on Learning Theory, pp. 79 93, 2009.

[23] C. K. Williams and C. E. Rasmussen, Gaussian processes for machine learning, vol. 2. MIT press Cambridge, MA, 2006.

[24] A. Singh, M. D Arcy, A. Cohan, D. Downey, and S. Feldman, Sci Rep Eval: A multi-format benchmark for scientific document representations, Ar Xiv, vol. abs/2211.13308, 2022.

[25] C. Finn, P. Abbeel, and S. Levine, Model-agnostic meta-learning for fast adaptation of deep networks, International Conference on Machine Learning, 2017.

[26] L. Xuhong, Y. Grandvalet, and F. Davoine, Explicit inductive bias for transfer learning with convolutional networks, International Conference on Machine Learning, pp. 2825 2834, 2018.

[27] S. Myung, I. Huh, W. Jang, J. M. Choe, J. Ryu, D. Kim, K.-E. Kim, and C. Jeong, PAC-Net: A model pruning approach to inductive transfer learning, International Conference on Machine Learning, pp. 16240 16252, 2022.

[28] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, BERT: Pre-training of deep bidirectional transformers for language understanding, ar Xiv preprint ar Xiv:1810.04805, 2018.

[29] I. Beltagy, K. Lo, and A. Cohan, Sci BERT: A pretrained language model for scientific text, in Conference on Empirical Methods in Natural Language Processing, 2019.

[30] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, Exploring the limits of transfer learning with a unified text-to-text transformer, Journal of Machine Learning Research, vol. 21, no. 140, pp. 1 67, 2020.

[31] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., Language models are few-shot learners, Advances in neural information processing systems, vol. 33, pp. 1877 1901, 2020.

[32] R. Vershynin, High-dimensional probability: An introduction with applications in data science, vol. 47. Cambridge university press, 2018.

[33] S. Wu, G. Lambard, C. Liu, H. Yamada, and R. Yoshida, i QSPR in Xenon Py: A Bayesian molecular design algorithm, Molecular Informatics, vol. 39, no. 1-2, p. 1900107, 2020.

[34] C. Liu, K. Kitahara, A. Ishikawa, T. Hiroto, A. Singh, E. Fujita, Y. Katsura, Y. Inada, R. Tamura, K. Kimura, and R. Yoshida, Quasicrystals predicted and discovered by machine learning, Phys. Rev. Mater., vol. 7, p. 093805, 2023.

[35] C. Liu, E. Fujita, Y. Katsura, Y. Inada, A. Ishikawa, R. Tamura, K. Kimura, and R. Yoshida, Machine learning to predict quasicrystals from chemical compositions, Advanced Materials, vol. 33, no. 36, p. 2102507, 2021.

[36] D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, International Conference for Learning Representations, 2015.

[37] J. Wang, R. M. Wolf, J. W. Caldwell, P. A. Kollman, and D. A. Case, Development and testing of a general amber force field, Journal of Computational Chemistry, vol. 25, no. 9, pp. 1157 1174, 2004.

[38] Y. Hayashi, J. Shiomi, J. Morikawa, and R. Yoshida, Radonpy: automated physical property calculation using all-atom classical molecular dynamics simulations for polymer informatics, npj Computational Materials, vol. 8, no. 222, 2022.

[39] S. Otsuka, I. Kuwajima, J. Hosoya, Y. Xu, and M. Yamazaki, Po Ly Info: Polymer database for polymeric materials design, 2011 International Conference on Emerging Intelligent Data and Web Technologies, pp. 22 29, 2011.

[40] M. Kusaba, Y. Hayashi, C. Liu, A. Wakiuchi, and R. Yoshida, Representation of materials by kernel mean embedding, Physical Review B, vol. 108, p. 134107, 2023.

[41] J. Duchi, E. Hazan, and Y. Singer, Adaptive subgradient methods for online learning and stochastic optimization., Journal of machine learning research, vol. 12, no. 7, 2011.

[42] D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, International Conference for Learning Representations, 2015.

[43] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, Scikit-learn: Machine learning in Python, Journal of Machine Learning Research, vol. 12, pp. 2825 2830, 2011.

[44] M. Mohri, A. Rostamizadeh, and A. S. Talwalkar, Foundations of Machine Learning. MIT press, 2018.

[45] N. Srebro, K. Sridharan, and A. Tewari, Smoothness, low noise and fast rates, Advances in Neural Information Processing Systems, vol. 23, 2010.

[46] M. Kloft and G. Blanchard, The local Rademacher complexity of ℓp-norm multiple kernel learning, Advances in Neural Information Processing Systems, vol. 24, 2011.

Supplementary Material Transfer Learning with Affine Model Transformation

A Other Perspectives on Affine Model Transfer

A.1 Transformation Functions for General Loss Functions

Here we discuss the optimal transformation function for general loss functions.

Let ℓ(y, y ) 0 be a convex loss function that returns zero if and only if y = y , and let g (x) be the optimal predictor that minimizes the expectation of ℓwith respect to the distribution pt followed by x and y transformed by ϕ:

g (x) = arg min g Ept ℓ(g(x), ϕfs(y)) .

The function g that minimizes the expected loss

Ept ℓ(g(x), ϕfs(y)) = ZZ ℓ(g(x), ϕfs(y))pt(x, y)dxdy

should be a solution to the Euler-Lagrange equation

Z ℓ(g(x), ϕfs(y))pt(x, y)dy = Z g(x)ℓ(g(x), ϕfs(y))pt(y|x)dy pt(x) = 0. (S.1)

Denote the solution of Eq. (S.1) by G(x; ϕfs). While G depends on the loss ℓand distribution pt, we omit those from the argument for notational simplicity. Using this function, the minimizer of the expected loss Ex,y[ℓ(g(x), y)] can be expressed as G(x; id), where id represents the identity function.

Here, we consider the following assumption to hold, which generalizes Assumption 2.3 in the main text: Assumption 2.3(b). For any distribution on the target domain pt(x, y) and all x X, the following relationship holds: ψfs(g (x)) = arg min g Ex,y[ℓ(g(x), y)].

Equivalently, the transformation functions ϕfs and ψfs satisfy

ψfs G(x; ϕfs) = G(x; id). (S.2)

Assumption 2.3(b) states that if the optimal predictor G(x; ϕfs) for the data transformed by ϕ is given to the model transformation function ψ, it is consistent with the overall optimal predictor G(x; id) in the target region in terms of the loss function ℓ. We consider all pairs of ψ and ϕ that satisfy this consistency condition.

Here, let us consider the following proposition:

Proposition A.1. Under Assumption 2.1, 2.2 and 2.3(b), ψ 1 fs = ϕfs.

Proof. The proof is analogous to that of Theorem 2.4 in Section D.1. For any y0 Y, let pt(y|x) = δy0. Combining this with Eq. (S.1) leads to

g(x)ℓ(g(x), ϕfs(y0)) = 0 ( y0 Y).

Because ℓ(y, y ) returns the minimum value zero if and only if y = y , we obtain G(x; ϕfs) = ϕfs(y0). Similarly, we have G(x; id) = y0. From these two facts and Assumption 2.3(b), we have ψfs(ϕfs(y0)) = y0, proving that the proposition is true.

Proposition A.1 indicates that the first statement of Theorem 2.4 holds for general loss functions. However, the second claim of Theorem 2.4 generally depends on the type of loss function. Through the following examples, we describe the optimal class of transformation functions for several loss functions. Example 1 (Squared loss). Let ℓ(y, y ) = |y y |2. As a solution of Eq. (S.1), we can see that the optimal predictor is the conditional expectation Ept[ϕfs(Y )|X = x]. As discussed in Section 2.1, the transformation functions ϕfs and ψfs should be affine transformations. Example 2 (Absolute loss). Let ℓ(y, y ) = |y y |. Substituting this into Eq. (S.1), we have

g(x) ϕfs(y) pt(y|x)dy

= Z sign g(x) ϕfs(y) pt(y|x)dy

ϕfs(y) g(x) pt(y|x)dy Z

ϕfs(y)<g(x) pt(y|x)dy.

Assuming that ϕfs is monotonically increasing, we have

y ϕ 1 fs (g(x)) pt(y|x)dy Z

y<ϕ 1 fs (g(x)) pt(y|x)dy.

This yields Z

ϕ 1 fs (g(x)) pt(y|x)dy = Z ϕ 1 fs (g(x))

The same result is obtained even if ϕfs is monotonically decreasing. Consequently,

ϕ 1 fs (g(x)) = Median[Y |X = x],

which results in G(x; ϕfs) = ϕfs Median[Y |X = x] . This implies that Eq. (S.2) holds for any ϕfs including an affine transformation, and the function form cannot be identified. from this analysis. Example 3 (Exponential-squared loss). As an example where the affine transformation is not optimal, consider the loss function ℓ(y, y ) = |ey ey |2. Substituting this into Eq. (S.1), we have

exp(g(x)) exp(ϕfs(y)) 2pt(y|x)dy

= 2 exp(g(x)) Z exp(g(x)) exp(ϕfs(y)) pt(y|x)dy.

Therefore, G(x; ϕfs) = log E exp(ϕfs(Y ))|X = x . Consequently, Eq. (S.2) becomes

log E exp(ϕfs(Y )) = ϕfs log E exp(Y ) .

Even if ϕfs is an affine transformation, this equation does not generally hold.

A.2 Analysis of the Optimal Function Class Based on the Upper Bound of the Estimation Error

Here, we discuss the optimal class for the transformation function based on the upper bound of the estimation error.

In addition to Assumptions 2.1 and 2.2, we assume the following in place of Assumption 2.3: Assumption A.2. The transformation functions ϕ and ψ are Lipschitz continuous with respect to the first argument, i.e., there exist constants µϕ and µψ such that,

ϕ(a, c) ϕ(a , c) µϕ a a 2, ψ(b, c) ψ(b , c) µψ b b 2,

for any a, a Y and b, b R with any given c Fs.

Note that each Lipschitz constant is a function of the second argument fs, i.e., µϕ = µϕ(fs) and µψ = µψ(fs).

Under Assumptions 2.1, 2.2 and A.2, the estimation error is upper bounded as follows:

h |ft(x) ˆft(x)|2i = E x,y

h ψ(g(x), fs(x)) ψ(ˆg(x), fs(x)) 2i

h µψ(fs(x))2 g(x) ˆg(x) 2i

h µψ(fs(x))2 g(x) ϕ(ft(x), fs(x)) 2

+ ϕ(ft(x), fs(x)) ϕ(y, fs(x)) 2

+ ϕ(y, fs(x)) ˆg(x) 2 i

h µψ(fs(x))2 ψ 1(ft(x), fs(x)) ϕ(ft(x), fs(x)) 2i

h µψ(fs(x))2µϕ(fs(x))2 ft(x) y 2i

h µψ(fs(x))2 z ˆg(x) 2i

h µψ(fs(x))2 ψ 1(ft(x), fs(x)) ϕ(ft(x), fs(x)) 2i

+ 3σ2 E x,y

h µψ(fs(x))2µϕ(fs(x))2i

h µψ(fs(x))2 z ˆg(x) 2i .

The derivation of this inequality is based on [12]. We use the Lipschitz property of ψ and ϕ for the first and third inequalities, and the second inequality comes from the inequality (a d)2 3(a b)2 + 3(b c)2 + 3(c d)2 for a, b, c, d R.

According to this inequality, the upper bound of the estimation error is decomposed into three terms: the discrepancy between the two transformation functions, the variance of the noise, and the estimation error for the transformed data. Although it is intractable to find the optimal solution of ϕ, ψ, ˆg that minimizes all these terms together, it is possible to find a solution that minimizes the first and second terms expressed as the functions of ϕ and ψ only. Obviously, the first term, which represents the discrepancy between the two transformation functions, reaches its minimum (zero) when ϕfs = ψ 1 fs . The second term, which is related to the variance of the noise, is minimized when the differential coefficient uψfs(u) is a constant, i.e., when ψfs is a linear function. This is verified as follows. From ϕfs = ψ 1 fs and the continuity of ψfs, it follows that

uψfs(u) , µϕ = max

uψ 1 fs (u) = 1 min |

and thus the product µϕµψ takes the minimum value (one) when the maximum and minimum of the differential coefficient are the same. Therefore, we can write

ϕ(y, fs) = y g1(fs)

g2(fs) , ψ(g(x), fs) = g1(fs) + g2(fs)g(x),

where g1, g2 : Fs R are arbitrarily functions. Thus, the minimization of the third term in the upper bound of the estimation error can be expressed as

min g1,g2,g E x,y|y g1(fs) g2(fs)g(x)|2.

As a result, the suboptimal function class for the upper bound of the estimated function is given as

H = x 7 g1(fs) + g2(fs) g3(x) | g1 G1, g2 G2, g3 G3 .

This is the same function class derived in Section 2.1.

(a) Direct learning (b) Feature extraction (c) HTL-offset (d) Affine model transfer

Figure S.1: Model architectures for the affine model transfer and related procedures. (a) Direct learning predicts outputs using only the original inputs x, while (b) feature extraction-based neural transfer predicts outputs using only the source features fs. (c) The HTL procedure proposed in [10] (HTL-offset) constructs the predictor as the sum of g1(fs) and g3(x). (d) The affine model transfer encompasses these procedures, computing g1 and g2 as functions of the source features and constructing the predictor as an affine combination with g3.

B Additional Experiments

B.1 Eigenvalue Decay of the Hadamard Product of Two Gram Matrices

We experimentally investigated how the decay rate s2 in Theorem 4.4 is related to the overlap degree in the spaces spanned by the original input x and source features fs.

For the original input x R100, we randomly constructed a set of 10 orthonormal bases, and then generated 100 samples from their spanning space. For the source features fs R100, we selected d bases randomly from the 10 orthonormal bases selected for x and the remaining 10 d bases from their orthogonal complement space. We then generated 100 samples of fs from the space spanned by these 10 bases. The overlap number d can be regarded as the degree of overlap of two spaces spanned by the samples of x and fs. We generated the 100 different sample sets of x and fs.

We calculated the Hadamard product of the Gram matrices K2 and K3 using the samples of x and fs, respectively. For the computation of K2 and K3, all combinations of the following five kernels were tested:

Linear kernel k(x, x ) = x x

Matérn kernel k(x, x )= 21 ν

where Kν( ) is a modified Bessel function and Γ( ) is the gamma function. Note that for ν = , the Matérn kernel is equivalent to the Gaussian RBF kernel. The scale parameter γ of both kernels was set to γ = p

10. For a given matrix K, the decay rate of the eigenvalues was estimated as the smallest value of s that satisfies λi K 2 F i 1

s where F denotes the Frobenius norm. Note that this inequality holds for any matrices K with s = 1 [32].

Figure S.2 shows the change of the decay rates with respect to varying d for various combinations of the kernels. In all cases, the decay rate of K2 K3 showed a clear trend of monotonically decreasing as the degree of overlap d increases. In other words, the greater the overlap between the spaces spanned by x and fs, the smaller the decay rate, and the smaller the complexity of the RKHS H2 H3.

B.2 Lattice Thermal Conductivity of Inorganic Crystals

Here, we describe the relationship between the qualitative differences in source features and the learning behavior of the affine model transfer, in contrast to ordinary feature extraction using neural networks.

0 2 4 6 8 10

0 2 4 6 8 10

0.80 Mat ern-1/2

0 2 4 6 8 10

Mat ern-3/2

0 2 4 6 8 10

Mat ern-5/2

0 2 4 6 8 10

0.675 Gaussian

0 2 4 6 8 10

Mat ern-1/2

0 2 4 6 8 10

0 2 4 6 8 10

0 2 4 6 8 10

0 2 4 6 8 10

0 2 4 6 8 10

Mat ern-3/2

0 2 4 6 8 10

0 2 4 6 8 10

0 2 4 6 8 10

0 2 4 6 8 10

0 2 4 6 8 10

Mat ern-5/2

0 2 4 6 8 10

0 2 4 6 8 10

0 2 4 6 8 10

0 2 4 6 8 10

0 2 4 6 8 10

0 2 4 6 8 10

0 2 4 6 8 10

0 2 4 6 8 10

0 2 4 6 8 10

Kernel function for k2

Kernel function for k3

Figure S.2: Decay rates of eigenvalues of K2 (blue lines), K3 (green lines) and K2 K3 (red lines) for all combinations of the five different kernels. The vertical axis represents the decay rate, and the horizontal axis represents the overlap dimension d in the space where x and fs are distributed.

The target task is to predict the lattice thermal conductivity (LTC) of inorganic crystalline materials, where the LTC is the amount of vibrational energy propagated by phonons in a crystal. In general, LTC can be calculated ab initio by performing many-body electronic structure calculations based on quantum mechanics. However, it is quite time-consuming to perform the first-principles calculations for thousands of crystals, which will be used as a training sample set to create a surrogate statistical model. Therefore, we perform TL with the source task of predicting an alternative, computationally tractable physical property called scattering phase space (SPS), which is known to be physically related to LTC.

We used the dataset from [8] that records SPS and LTC for 320 and 45 inorganic compounds, respectively. The input compounds were translated to 290-dimensional compositional descriptors using Xenon Py [6, 33, 34, 35]1.

B.2.2 Model Definition and Hyperparameter Search

Fully connected neural networks were used for both the source and target models, with a Leaky Re LU activation function with α = 0.01. The model training was conducted using the Adam optimizer [36]. Hyperparameters such as the width of the hidden layer, learning rate, number of epochs, and regularization parameters were adjusted with 5-fold cross-validation. For more details on the experimental conditions and procedure, refer to the provided Python code.

Source Model For the preliminary step, neural networks with three hidden layers that predict SPS were trained using 80% of the 320 samples. 100 models with different numbers of neurons were randomly generated and the top 10 source models that showed the highest generalization performance in the source domain were selected. The hidden layer width L was randomly chosen from the range [50, 100], and we trained a neural network with a structure of (input)-L-L-L-1. Each of the three hidden layers of the source model was used as an input to the transfer models, and we examined the difference in prediction performance for the three layers.

Target Model In the target task, an intermediate layer of a source model was used as the feature extractor. A model was trained using 40 randomly chosen samples of LTC, and its performance was evaluated with the remaining 5 samples. For each of the 10 source models, we performed the

1https://github.com/yoshida-lab/Xenon Py

1st layer 2nd layer 3rd layer Extracted layer

Root mean squared error (W/m K)

Feature extractor Afﬁne transfer Without transfer Fine-tuning

Figure S.3: Change of RMSE values between the affine transfer model and the ordinary feature extractor when using different levels of intermediate layers as the source features. The line plot shows the mean and 95% confidence interval. As a baseline, RMSE values for direct learning without transfer and fine-tuned neural networks are shown as dotted and dashed lines, respectively.

training and testing 10 times with different sample partitions and compared the mean values of RMSE among four different methods: (i) the affine model transfer using neural networks to model the three functions g1, g2 and g3, (ii) a neural network using the Xenon Py compositional descriptors as input without transfer, (iii) a neural network using the source features as input, and (iv) fine-tuning of the pre-trained neural networks. The width of the layers of each neural network, the number of training epochs, and the dropout rate were optimized during 5-fold cross-validation looped within each training set. For the affine model transfer, the functions g1, g2, and g3 were modeled by neural networks. We used neural networks with one hidden layer for g1, g2 and g3.

B.2.3 Results

Figure S.3 shows the change in prediction performance of TL models using source features obtained from different intermediate layers from the first to the third layers. The affine transfer model and the ordinary feature extractor showed opposite patterns. The performance of the feature extractor improved when the first intermediate layer closest to the input layer was used as the source features and gradually degraded when layers closer to the output were used. When the third intermediate layer was used, a negative transfer occurred in the feature extractor as its performance became worse than that of the direct learning. In contrast, the affine transfer model performs better as the second and third intermediate layers closer to the output were used. The affine transfer model using the third intermediate layer reached a level of accuracy slightly better than fine-tuning, which intuitively uses more information to transfer than the extracted features.

In general, the features encoded in an intermediate layer of a neural network are more task-independent as the layer is closer to the input, and the features are more task-specific as the layer is closer to the output [9]. Because the first layer does not differ much from the original input, using both x and fs in the affine model transfer does not contribute much to performance improvement. However, when using the second and third layers as the feature extractors, the use of both x and fs contributes to improving the expressive power of the model, because the feature extractors have acquired different representational capabilities from the original input. In contrast, a model based only on fs from a source task-specific feature extractor could not account for data in the target domain, so its performance would become worse than direct learning without transfer, i.e., a negative transfer would occur.

B.3 Heat Capacity of Organic Polymers

We highlight the benefits of separately modeling and estimating domain-specific factors through a case study in polymer chemistry. The objective is to predict the specific heat capacity at constant pressure CP of any given organic polymer with its chemical structure in the polymer s repeating unit. Specifically, we conduct TL to bridge the gap between experimental values and physical properties calculated from molecular dynamics (MD) simulations.

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Experimentally-observed heat capacity (log J/kg K)

MD-calculated heat capacity (log J/kg K)

Correlation : 0.7788

Figure S.4: MD-calculated (vertical axis) and experimental values (horizontal axis) of the specific heat capacity at constant pressure for various amorphous polymers.

Table S.1: Force field parameters that form the General AMBER force field [37] version 2 (GAFF2), and their detailed descriptions.

Parameter Description mass Atomic mass σ Equilibrium radius of van der Waals (vd W) interactions ϵ Depth of the potential well of vd W interactions charge Atomic charge of Gasteiger model r0 Equilibrium length of chemical bonds Kbond Force constant of bond stretching polar Bond polarization defined by the absolute value of charge difference between atoms in a bond θ0 Equilibrium angle of bond angles Kangle Force constant of bond bending Kdih Rotation barrier height of dihedral angles

As shown in Figure S.4, there was a large systematic bias between experimental and calculated values; the MD-calculated properties CMD P exhibited an evident overestimation with respect to their experimental values. As discussed in [38], this bias is inevitable because classical MD calculations do not reflect the presence of quantum effects in the real system. According to Einstein s theory for the specific heat in physical chemistry, the logarithmic ratio between Cexp P and CMD P can be calibrated by the following equation:

log Cexp P = log CMD P + 2 log ℏω

+ log exp ℏω

k BT 1 2 , (S.3)

where k B is the Boltzmann constant, ℏis the Planck constant, ω is the frequency of molecular vibrations, and T is the temperature. The bias is a monotonically decreasing function of frequency ω, which is described as a black-box function of polymers with their molecular features. Hereafter, we consider the calibration of this systematic bias using the affine transfer model.

Experimental values of the specific heat capacity of the 70 polymers were collected from Po Ly Info [39]. The MD simulation was also applied to calculate their heat capacities. For models to predict the log-transformed heat capacity, a given polymer with its chemical structure was translated into the 190-dimensional force field descriptors, using Radon Py [38]2.

The force field descriptor represents the distribution of the ten different force field parameters ( t T = {mass, σ, ϵ, charge, r0, Kbond, polar, θ0, Kangle, Kdih} that make up the empirical potential (i.e., the General AMBER force field [37] version 2 (GAFF2)) of the classical MD simulation. The

2https://github.com/Radon Py/Radon Py

Algorithm S.1 Block relaxation algorithm for the model in Eq. (S.4).

Initialize: α0 ˆα0,olr, α1 ˆα1,olr, β 0, γ ˆγdiff repeat

α arg minα Fα,β,γ β arg minβ Fα,β,γ γ arg minγ Fα,β,γ until convegence

detailed descriptions for each parameter are listed in Table S.1. For each t, pre-defined values are assigned to their constituent elements in a polymer, such as individual atoms (mass, charge, σ, and ϵ), bonds (r0, Kbond, and polar), angles (θ0 and Kangle), or dihedral angles (Kdih), respectively. The probability density function of the assigned values of t is then estimated and discretized into 10 points corresponding to 10 different element species such as hydrogen and carbon for mass, and 20 equally spaced grid points for the other parameters. The details of the descriptor calculations are described in [40].

The source feature fs was given as the log-transformed value of CMD P . Therefore, fs is no longer a function of x; this modeling was intended for calibrating the MD-calculated properties.

We randomly sampled 60 training polymers and tested the prediction performance of a trained model on the remaining 10 polymers 20 times. The Po Ly Info sample identifiers for the selected polymers are listed in the code.

B.3.2 Model Definition and Hyperparameter Search

As described above, the 190-dimensional force field descriptor consists of ten blocks corresponding to different types of features. The Jt features that make up block t represent discretized values of the density function of the force field parameters assigned to the atoms, bonds, or dihedral angles that constitute the given polymer. Therefore, the regression coefficients of the features within a block should be estimated smoothly. To this end, we imposed fused regularization on the parameters as

λ1 γ 2 2 + λ2 X

γt,j γt,j 1 2,

where T = {mass, charge, ϵ, σ, Kbond, r0, Kangle, θ, Kdih}, and Jt = 10 for t = mass and Jt = 20 otherwise. The regression coefficient γt,j corresponds to the j-th feature of block t.

Ordinary Linear Regression The experimental heat capacity y = log Cexp P was regressed on the MD-calculated property, without regularization, as ˆy = α0 + α1fs where ˆy denotes the conditional expectation and fs = log CMD P .

Learning the Log-Difference We calculated the log-difference log Cexp P log CMD P and trained the linear model with the ridge penalty. The hyperparameters λ1 and λ2 for the scaleand smoothnessregularizers were determined based on 5-fold cross validation across 25 equally space grids in the interval [10 2, 102] for λ1 and across the set {50, 100, 150} for λ2.

Affine Transfer The log-transformed value of Cexp p is modeled as

y := log Cexp P = α0 + α1fs | {z } g1

(βfs + 1) | {z } g2

x γ |{z} g3

where ϵσ represents observation noise, and α0, α1, β and γ are unknown parameters to be estimated. When α1 = 1 and β = 0, Eq. (S.4) is consistent with the theoretical equation in Eq. (S.3) in which the quantum effect is linearly modeled as α0 + x γ.

Table S.2: Mean and standard deviation of RMSE of three prediction models.

Model RMSE (log J/kg K) y = α0 + α1fs + ϵσ 0.1403 0.0461 y = fs + x γ + ϵσ 0.1368 0.04265 y = α0 + α1fs (βfs + 1)x γ + ϵσ 0.1357 0.04173

In the model training, the objective function was given as follows:

yi (α0 + α1fs,i (βfs,i + 1)x γ) 2

+ λββ2 + λγ,1 γ 2 2 + λγ,2 X

γt,j γt,j 1 2,

where α = [α0 α1] . With a fixed λβ = 1, the remaining hyperparameters λγ,1 and λγ,2 were optimized through 5-fold cross validation over 25 equally space grids in the interval [10 2, 102] for λγ,1 and across the set {50, 100, 150} for λγ,2.

The algorithm to estimate the parameters α, β and γ is described in Algorithm S.1, where α0,olr and α1,olr are the estimated parameters of the ordinary linear regression model, and ˆγdiff is the estimated parameter of the log-difference model. For each step, the full conditional minimization of Fα,β,γ with respect to each parameter can be made analytically as

arg min α Fα,β,γ = (F s Fs) 1F s (y + (βfs,1:n + 1) (Xγ)),

arg min β Fα,β,γ = (f s,1:ndiag(Xγ)2fs,1:n + nλ2) 1f s,1:ndiag(Xγ)(y Fsα + Xγ),

arg min γ Fα,β,γ = (X diag(βfs,1:n + 1)2X + Λ) 1X diag(βfs,1:n + 1)(y Fsα),

where X denote the matrix in which the i-th row is xi, y = [y1 yn] , fs,1:n = [fs,1 fs,n] , Fs = [fs,1:n 1], and d = 190. Λ is a matrix including the two regularization parameters λγ,1 and λγ,2 as

Λ = D D, where D = λγ,1Id λγ,2M

1 1 0 0 0 0 1 1 0 0 ... ... ... ... ... ... 0 0 0 0 0 ... ... ... ... ... ... 0 0 0 1 1

m-th rows ,

where m {10, 30, 50, 70, 90, 110, 130, 150, 170}. Note that the matrix M is the same as the matrix [0 I189] [I189 0] except that the m-th row is all zeros. Note also that M R189 190, and therefore D R279 190 and Λ R190 190.

The stopping criterion of the algorithm was set as

max θ {α,β,γ} maxi θ(new) i θ(old) i

maxi θ(old) i < 10 4, (S.5)

where θi denotes the i-th element of the parameter θ. This convergence criterion is employed in several existing machine learning libraries, e.g., scikit-learn 3.

B.3.3 Results

Table S.2 summarizes the prediction performance (RMSE) of the three models. The ordinary linear model y = α0 + α1fs + ϵσ, which ignores the force field descriptors, exhibited the lowest prediction

3https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso. html

mass charge ϵ σ Kbond r0 polar Kangle θ0 Kdih

Figure S.5: Bar plot of regression coefficients γ of linear calibrator filling the discrepancy between experimental and MD-calculated specific heat capacity of amorphous polymers.

performance. The other two calibration models y = fs + x γ + ϵσ and the full model in Eq. (S.4) reached almost the same accuracy, but the latter had achieved slightly better prediction accuracy. The estimated parameters of the full model were α1 0.889 and β 0.004. The model form is highly consistent with the theoretical equation in Eq. (S.3) as well as the restricted model (α1 = 1, β = 0). This supports the validity of the theoretical model in [38].

It is expected that physicochemical insights can be obtained by examining the estimated coefficient γ, which would capture the contribution of the force field parameters to the quantum effects. The magnitude of the quantum effect is a monotonically increasing function of the frequency ω, and is known to be highly related to the descriptors ϵ, Kdih, Kbond, Kangle and mass. According to physicochemical intuition, it is considered that as ϵ, Kbond, Kangle, and Kdih decrease, their potential energy surface becomes shallow, which leads to the decrease of ω, and in turn the decrease of quantum effects. Furthermore, because the molecular vibration of light-weight atoms is faster than that of heavy atoms, ω and quantum effects should theoretically increase with decreasing mass.

Figure S.5 shows the mean values of the estimated parameter γ for the full calibration model. The physical relationships described above can be captured consistently with the estimated coefficients. The coefficients in lower regions of ϵ, Kbond, Kangle and Kdih showed large negative values, indicating that polymers containing more atoms, bonds, angles, and dihedral angles with lower values will have smaller quantum effects. Conversely, the coefficients in lower regions of mass showed positive large values, meaning that polymers containing more atoms with smaller masses will have larger quantum effects. As illustrated in this example, separate inclusion of the domain-common and domain-specific factors in the affine transfer model enables us to infer the features relevant to the cross-domain differences.

C Experimental Details

Instructions for obtaining the datasets used in the experiments are described in the code.

C.1 Kinematics of the Robot Arm

We used the SARCOS dataset in [23]. The task is to predict the feed-forward torque required to follow the desired trajectory in the seven joints of the SARCOS anthropomorphic robot arm. The twenty one features representing the joints position, velocity, and acceleration were used as x. The observed values of six torques other than the torque at the joint in the target domain were given to the source features fs. The dataset includes 44,484 training samples and 4,449 test samples. We selected {5, 10, 15, 20, 30, 40, 50} samples randomly from the training set. The prediction performances of the trained models were evaluated using the 4,449 test samples. Repeated experiments were conducted 20 times with different independently sampled datasets.

C.1.2 Model Definition and Hyperparameter Search

Source model For each target task, a multi-task neural network was trained to predict the torque values of the remaining six source tasks. The source model shares four layers (256-128-64-32) up to the final layer, and only the output layer is task-specific. We used all training data and Adagrad [41] with learning rate of 0.01.

Algorithm S.2 Block relaxation algorithm for Affine TL-full.

Initialize: a0 (K1 + λ1In) 1y, b0 N(0, In), c0 N(0, In), d0 0.5 repeat

a (K1 + λ1In) 1(y (K2b + 1) (K3c) d1) b (Diag(K3c)2K2 + λ2In) 1((K3c) (y K1a K3c d1)) c (Diag(K2b + 1)2K3 + λ3In) 1((K2b + 1) (y K1a d1)) d y K1a (K2b + 1) (K3c), 1 /n until convergence

Direct, Only source, Augmented, HTL-offset, HTL-scale For each procedure, we used kernel ridge regression with the RBF kernel k(x, x ) = exp( 1

2ℓ2 x x 2 2). The scale parameter ℓwas set to the square root of the input dimension as ℓ=

21 for Direct, HTL-offset and HTL-scale, ℓ=

6 for Only source and ℓ=

27 for Augmented. The regularization parameter λ was selected in 5-fold cross-validation in which the grid search was performed over 50 grid points in the interval [10 4, 102].

Affine TL-full, Affine TL-const We considered the following kernels:

k1(fs(x), fs(x )) = exp 1

2ℓ2 fs(x) fs(x ) 2 2 (ℓ=

k2(fs(x), fs(x )) = exp 1

2ℓ2 fs(x) fs(x ) 2 2 (ℓ=

k3(x, x ) = exp 1

2ℓ2 x x 2 2 (ℓ=

for g1, g2 and g3 in the affine transfer model, respectively.

Hyperparameters to be optimized are the three regularization parameters λ1, λ2 and λ3. We performed 5-fold cross-validation to identify the best hyperparameter set from the candidate points; {10 3, 10 2, 10 1, 1} for λ1 and {10 2, 10 1, 1, 10} for each of λ2 and λ3.

To learn the Affine TL-full and Affine TL-const, we used the following objective functions:

Affine TL-full y (K1a+(K2b+1) (K3c)+d1) 2 2+λ1a K1a+λ2b K2b+λ3c K3c,

Affine TL-const 1

n y (K1a + K3c + d1) 2 2 + λ1a K1a + λ3c K3c.

Algorithm S.2 summarizes the block relaxation algorithm for Affine TL-full. For Affine TL-const, we found the optimal parameters as follows: " ˆa ˆc ˆd

[ K1 K3 1 ] +

" λ1K1 λ3K3 0

#! 1" K1 K3 1

The stopping criterion for the algorithm was the same as Eq. (S.5).

Fine-tuning The target network was constructed by adding a one-dimensional output layer to the shared layers of the source network. As initial values for the training, we used the weights of the source neural network for the shared layer and the average of the multidimensional output layer of the source network for the output layer. Adagrad [41] was used for the optimization. The learning rate was fixed at 0.01 and the number of training epochs was selected from {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 50, 100} through 5-fold cross-validation.

MAML A fully connected neural network with 256-64-32-16-1 layer width was used, and the initial values were searched through MAML [25] using the six source tasks. The obtained base model was fine-tuned with the target samples. Adam [42] with a fixed learning rate of 0.01 was used for the optimization. The number of training epochs was selected from {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 50, 100} through 5-fold cross-validation.

L2-SP L2-SP is a regularization method proposed by [26] in which the following regularization term is added so that the weights of the target network are estimated in the neighborhood of the weights of the source network:

2 w ws 2 2, (S.6)

where w and ws are the weights of the target and source model, respectively, and λ > 0 is a hyperparameter. We used the weights of the source network as the initial point for the training of the target model, and added a regularization parameter as in Eq. (S.6). Adagrad [41] were used for the optimizer, and the regularization parameters and learning rate were fixed at 0.01 and 0.001, respectively. The number of training epochs was selected from {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 50, 100} through 5-fold cross-validation.

PAC-Net PAC-Net, proposed in [27], is a TL method that leverages pruning of the weights of the source network. Its training strategy consists of three steps: identifying the important weights in the source model, fine-tuning them using the source samples, and updating the remaining weights using the target samples.

Firstly, we pruned the bottom 10% of weights, based on absolute value, from the pre-trained source network. Following this, the remaining weights were retrained using the stochastic gradient descent (SGD). Finally, the pruned weights were retrained using target samples. For the final training phase, SGD with learning rate 0.01, was employed, and the number of training epochs was selected from {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 50, 100} through 5-fold cross-validation.

Table S.3: Performance on predicting the torque values at the first and seventh joints of the SARCOS robot arm. Definitions of an asterisk and bold type are the same as in Table 1.

Target Model Number of training samples n < d n d n > d 5 10 15 20 30 40 50

Direct 21.3 2.04 18.9 2.11 17.4 1.79 15.8 1.70 13.7 1.26 12.2 1.61 10.8 1.23 Only source 24.0 6.37 22.3 3.10 21.0 2.49 19.7 1.34 18.5 1.92 17.6 1.59 17.3 1.31 Augmented 21.8 2.88 19.2 1.37 17.8 2.30 15.7 1.53 13.3 1.19 11.9 1.37 10.7 0.954 HTL-offset 23.7 6.50 21.2 3.85 19.8 3.23 17.8 2.35 16.2 3.31 15.0 3.16 15.1 2.76 HTL-scale 23.3 4.47 22.1 5.31 20.4 3.84 18.5 2.72 17.6 2.41 16.9 2.10 16.7 1.74

Affine TL-full 21.2 2.23 18.8 1.31 18.6 2.83 15.9 1.65 13.7 1.53 12.3 1.45 11.1 1.12 Affine TL-const 21.2 2.21 18.8 1.44 17.7 2.44 15.9 1.58 13.4 1.15 12.2 1.54 10.9 1.02

Fine-tune 25.0 7.11 20.5 3.33 18.6 2.10 17.6 2.55 14.1 1.39 12.6 1.13 11.1 1.03 MAML 29.8 12.3 22.5 3.21 20.8 2.12 20.3 3.14 16.7 3.00 14.4 1.85 13.4 1.19 L2-SP 24.9 7.09 20.5 3.30 18.8 2.04 18.0 2.45 14.5 1.36 13.0 1.13 11.6 0.983 PAC-Net 25.2 8.68 22.7 5.60 20.7 2.65 20.1 2.16 18.5 2.77 17.6 1.85 17.1 1.38

Direct 15.8 2.37 13.0 1.41 11.5 0.985 10.4 0.845 9.20 0.827 8.35 0.802 7.78 0.780 Only source 14.9 1.77 13.6 2.51 12.3 1.77 11.2 1.16 10.6 1.22 9.74 0.920 9.06 0.785 Augmented 15.2 1.95 12.3 0.923 11.4 1.48 10.2 0.813 9.07 0.983 8.06 0.862 7.23 0.629 HTL-offset 14.8 1.71 13.4 2.41 12.2 1.81 10.9 1.29 10.4 1.37 9.32 1.11 8.78 0.829 HTL-scale 14.8 1.71 13.4 2.47 12.2 1.82 11.0 1.32 10.5 1.28 9.39 1.01 8.91 0.946

Affine TL-full 14.7 1.83 13.0 1.34 11.9 1.22 11.3 1.39 9.38 0.842 8.25 0.932 7.34 0.605 Affine TL-const 14.6 1.47 12.6 1.09 11.5 0.807 10.5 1.19 9.28 0.828 8.35 1.06 7.33 0.57

Fine-tune 24.4 5.87 15.0 2.01 13.6 2.31 11.9 1.21 10.7 0.897 9.52 0.774 8.43 0.907 MAML 21.8 7.33 14.8 4.51 13.1 2.69 11.5 2.24 9.77 1.24 8.90 1.10 7.89 0.713 L2-SP 24.4 5.87 15.1 2.02 13.6 2.29 12.0 1.22 10.8 0.886 9.70 0.78 8.68 0.868 PAC-Net 24.0 6.94 16.7 4.14 13.7 2.36 13.2 2.49 12.4 2.05 11.6 0.844 11.2 0.706

Direct 9.91 1.65 8.15 1.01 7.39 1.21 6.84 0.878 5.90 0.850 5.26 0.774 4.66 0.523 Only source 9.00 1.44 7.51 1.05 6.90 1.15 6.51 0.930 5.67 0.890 5.29 0.840 4.89 0.604 Augmented 9.47 1.35 7.72 1.05 6.99 1.25 6.29 0.967 5.42 0.938 4.76 0.826 4.32 0.592 HTL-offset 8.96 1.42 7.47 1.06 6.88 1.15 6.39 0.952 5.58 0.856 5.18 0.821 4.83 0.603 HTL-scale 9.05 1.40 7.49 1.08 6.89 1.18 6.63 1.03 5.60 0.955 5.21 0.836 4.86 0.503

Affine TL-full 9.24 1.46 7.45 1.25 6.85 1.23 6.28 0.930 5.54 1.15 4.89 0.907 4.46 0.733 Affine TL-const 9.08 1.21 7.55 0.974 6.67 1.00 6.17 0.916 5.42 0.971 4.85 0.752 4.42 0.614

Fine-tune 9.00 2.14 7.38 1.09 6.72 1.01 *5.91 0.734 *5.26 0.541 4.86 0.488 4.41 0.325 MAML 9.50 4.94 *7.11 0.966 *6.44 1.01 *5.92 0.793 *5.22 0.626 4.87 0.539 4.79 0.525 L2-SP 9.00 2.14 7.39 1.08 6.73 1.02 *5.91 0.73 5.39 0.633 4.89 0.493 4.46 0.319 PAC-Net 9.14 2.11 *7.31 1.03 *6.33 0.841 *5.96 0.926 5.34 0.633 5.17 0.474 5.05 0.371

Direct 14.2 2.30 11.1 2.28 9.49 2.19 7.78 1.02 6.86 0.768 6.13 0.714 5.48 0.592 Only source 13.1 3.36 9.62 2.05 8.38 2.06 7.06 1.32 6.36 1.24 5.79 0.768 5.37 0.897 Augmented 13.5 2.83 9.69 1.89 8.51 1.84 *6.96 1.03 *6.09 0.931 *5.39 0.685 *4.87 0.618 HTL-offset 13 3.38 9.62 2.05 8.34 2.00 7.02 1.24 6.26 1.17 5.76 0.764 5.36 0.897 HTL-scale 13.0 3.35 9.63 2.07 8.30 1.95 7.01 1.16 6.30 1.17 5.77 0.758 5.37 0.902

Affine TL-full 13.0 2.69 9.48 2.10 8.38 1.85 7.14 1.62 *5.91 0.838 *5.45 0.777 *4.94 0.603 Affine TL-const 13.2 3.16 *9.32 1.99 8.39 1.84 *6.88 1.00 *5.85 0.710 *5.55 0.679 *4.94 0.581

Fine-tune *11.7 2.70 *8.24 1.31 *6.71 1.02 *5.90 0.971 *5.17 0.785 *4.59 0.442 *4.21 0.376 MAML 14.3 7.75 10.9 3.44 9.55 1.99 9.41 2.33 7.98 2.36 6.70 1.25 6.18 1.35 L2-SP *11.7 2.70 *8.24 1.31 *6.73 1.01 *5.92 0.959 *5.22 0.765 *4.67 0.45 *4.28 0.363 PAC-Net 11.2 5.24 *8.84 2.75 *7.64 1.17 7.34 1.56 6.77 0.966 6.29 0.536 6.02 0.446

Direct 1.07 0.157 0.993 0.0903 0.910 0.119 0.847 0.129 0.744 0.113 0.686 0.0996 0.623 0.0944 Only source 1.15 0.214 1.04 0.0775 0.998 0.145 0.975 0.133 0.863 0.111 0.826 0.155 0.775 0.106 Augmented 1.04 0.113 0.987 0.109 0.907 0.120 0.874 0.136 0.755 0.130 0.710 0.110 0.637 0.0893 HTL-offset 1.14 0.221 1.02 0.0864 0.965 0.157 0.925 0.141 0.837 0.104 0.800 0.156 0.738 0.101 HTL-scale 1.13 0.194 1.01 0.0786 0.980 0.177 0.914 0.132 0.830 0.114 0.844 0.171 0.785 0.123

Affine TL-full 1.04 0.121 0.989 0.175 0.907 0.162 0.860 0.170 0.747 0.117 0.691 0.0924 0.654 0.0716 Affine TL-const 1.05 0.106 0.974 0.102 0.899 0.123 0.854 0.121 0.756 0.106 0.700 0.0869 0.638 0.0796

Fine-tune 1.22 0.356 1.04 0.105 0.976 0.0878 0.913 0.137 0.749 0.111 0.688 0.103 0.598 0.0697 MAML 1.45 0.479 1.18 0.183 1.07 0.208 0.999 0.193 0.816 0.211 0.703 0.124 0.613 0.0634 L2-SP 1.22 0.355 1.04 0.105 0.973 0.0873 0.917 0.133 0.756 0.109 0.699 0.113 0.606 0.0644 PAC-Net 1.27 0.319 1.10 0.115 1.03 0.108 0.985 0.145 0.881 0.151 0.806 0.118 0.781 0.124

Direct 1.86 0.248 1.67 0.192 1.50 0.162 1.36 0.159 1.22 0.163 1.12 0.102 1.05 0.0916 Only source 1.91 0.230 1.87 0.357 1.76 0.179 1.64 0.190 1.53 0.255 1.36 0.153 1.26 0.0883 Augmented 1.86 0.180 1.66 0.17 1.55 0.219 1.45 0.231 1.27 0.265 1.11 0.130 1.01 0.0944 HTL-offset 1.88 0.214 1.81 0.369 1.67 0.246 1.54 0.239 1.46 0.267 1.33 0.142 1.21 0.133 HTL-scale 2.05 0.649 1.88 0.389 1.71 0.241 1.60 0.308 1.62 0.474 1.37 0.158 1.25 0.0922

Affine TL-full 1.82 0.232 1.74 0.209 1.55 0.242 1.43 0.231 1.27 0.230 1.13 0.122 1.06 0.191 Affine TL-const 1.83 0.173 1.68 0.207 1.55 0.236 1.40 0.227 1.23 0.198 1.12 0.113 1.02 0.0953

Fine-tune 2.41 0.375 2.03 0.387 1.71 0.463 1.49 0.297 1.27 0.257 1.15 0.110 1.07 0.0817 MAML 2.69 0.676 2.17 0.511 1.96 0.526 1.68 0.373 1.42 0.365 1.23 0.111 1.16 0.0829 L2-SP 2.41 0.375 2.03 0.371 1.72 0.455 1.49 0.299 1.28 0.254 1.16 0.108 1.09 0.0777 PAC-Net 2.47 0.385 2.22 0.550 2.22 0.599 1.99 0.372 1.91 0.355 1.74 0.144 1.69 0.0595

Direct 2.66 0.307 2.13 0.420 1.85 0.418 1.54 0.353 1.32 0.200 1.18 0.138 1.05 0.111 Only source 2.31 0.618 *1.73 0.560 *1.49 0.513 *1.22 0.269 *1.09 0.232 *0.969 0.144 *0.927 0.170 Augmented 2.47 0.406 1.90 0.515 1.67 0.552 *1.31 0.214 1.16 0.225 *0.984 0.149 *0.897 0.138 HTL-offset 2.29 0.621 *1.69 0.507 *1.49 0.513 *1.22 0.269 *1.09 0.233 *0.969 0.144 *0.925 0.171 HTL-scale 2.32 0.599 *1.71 0.516 1.51 0.513 *1.24 0.271 *1.12 0.234 *0.999 0.175 0.948 0.172

Affine TL-full *2.23 0.554 *1.71 0.501 *1.45 0.458 *1.21 0.256 *1.06 0.219 *0.974 0.164 *0.870 0.121 Affine TL-const *2.30 0.565 *1.73 0.420 *1.48 0.527 *1.20 0.243 *1.04 0.217 *0.963 0.161 *0.884 0.136

Fine-tune *2.33 0.511 *1.62 0.347 *1.35 0.340 *1.12 0.165 *0.959 0.12 *0.848 0.0824 *0.790 0.0547 MAML 2.54 1.29 1.90 0.507 1.67 0.313 1.63 0.282 1.28 0.272 1.20 0.199 1.06 0.111 L2-SP *2.33 0.509 *1.65 0.378 *1.35 0.340 *1.12 0.165 *0.968 0.114 *0.858 0.0818 *0.802 0.0535 PAC-Net 2.24 0.706 *1.61 0.394 *1.43 0.389 *1.24 0.177 *1.18 0.100 1.13 0.0726 1.100 0.0589

C.2 Evaluation of Scientific Papers

We used Sci Rep Eval benchmark dataset for scientific documents, proposed in [23]. This dataset comprises abstracts, review scores, and decision statuses of papers submitted to various machine learning conferences. We conducted two primary tasks: predicting the average review score (a regression task) and determining the acceptance or rejection status of each paper (a binary classification task). The original input x, was represented as a two-gram bag-of-words vector derived from the abstract. As for source features fs, we employed text embeddings from the abstracts, which were generated by various pre-trained language models, including BERT [28], Sci BERT [29], T5 [30], and GPT-3 [31]. When building the vocabulary for the bag-of-words, we ignore phrases with document frequencies strictly higher than 0.9 or strictly lower than 0.01. Additionally, we eliminated certain stop-words using the default settings in scikit-learn [43]. The sentences containing URLs were removed from the abstracts because accepted papers tend to include Git Hub links in their abstracts after acceptance, which may cause leakage of information to the prediction. The models were trained on a dataset comprising 8,166 instances, and their performance were subsequently evaluated on a test dataset of 2,043 instances.

C.2.2 Model Definition and Hyperparamter Search

For both the affine model transfer and feature extraction, we employed neural networks with Re LU activation and dropout layer with 0.1 dropout rate. The parameters were optimized using Adagrad [41] with 0.01 learning rate.

Affine Model Transfer For functions g1 and g2 in the affine model transfer, we used a neural network composed of layers with widths 128, 64, 32, and 16, wherein the source features fs were used as inputs. The number of layers of each width was determined based on Bayesian optimization. Sigmoid activation was employed to the output of g2 in order to facilitate the interpretation of g3. For g3, we employed a linear model with the input x. To prevent overfitting and promote model simplicity, we applied ℓ1 regularization to the parameters of g3 with a regularization parameter of 0.01. The final output of the model was computed as g1 + g2 g3. In the case of the binary classification task, we applied sigmoid activation function to this final output.

Feature Extraction As in the affine model transfer, we used a neural network composed of layers with widths 128, 64, 32, and 16, wherein the source features fs were used as inputs. The number of layers of each width was determined based on Bayesian optimization. In the case of the binary classification task, we applied sigmoid activation function to the final output.

D.1 Proof of Theorem 2.4

Proof. According to Assumption 2.3, it holds that for any pt(y|x),

Z ϕfs(y)pt(y|x)dy = Z ypt(y|x)dy. (S.7)

(i) Let δy0 be the Dirac delta function supported on y0. Substituting pt(y|x) = δy0 into Eq. (S.7), we have ψfs(ϕfs(y0)) = y0 ( y0 Y).

Under Assumption 2.3, this implies the property (i).

(ii) For simplicity, we assume the inputs x are fixed and pt(y|x) > 0. Applying the property (i) to Eq. (S.7) yields Z ϕfs(y)pt(y|x)dy = ϕfs

Z ypt(y|x)dy .

We consider a two-component mixture pt(y|x) = (1 ϵ)q(y|x) + ϵh(y|x) with a mixing rate ϵ [0, 1], where q and h denote arbitrary probability density functions. Then, we have

Z ϕfs(y) (1 ϵ)q(y|x) + ϵh(y|x) dy = ϕfs

Z y (1 ϵ)q(y|x) + ϵh(y|x) dy .

Taking the derivative at ϵ = 0, we have

Z ϕfs(y) h(y|x) q(y|x) dy = ϕ fs

Z yq(y|x)dy Z y h(y|x) q(y|x) dy ,

which yields

Z h(y|x) q(y|x) ϕfs(y) ϕ fs Eq[Y |X = x] y dy = 0. (S.8)

For Eq. (S.8) to hold for any q and h, ϕfs(y) ϕ fs Eq[Y |X = x] y must be independent of y. Thus, the function ϕfs and its inverse ψfs = ϕ 1 fs are limited to affine transformations with respect to y. Since ϕ depends on y and fs(x), it takes the form ϕ(y, fs(x)) = g1(fs(x)) + g2(fs(x))y.

D.2 Proof of Theorem 4.1

To bound the generalization error, we use the empirical and population Rademacher complexity ˆRS(F) and R(F) of hypothesis class F, defined as:

ˆRS(F) = Eσ sup f F

i=1 σif(xi), R(F) = ES ˆRS(F),

where {σi}n i=1 is a set of Rademacher variables that are independently distributed and each take one of the values in { 1, +1} with equal probability, and S denotes a set of samples. The following proof is based on the one of Theorem 11 shown in [11].

Proof of Theorem 4.1. For any hypothesis class F with feature map Φ where Φ 2 1, the following inequality holds:

Eσ sup θ 2 Λ

i=1 σi θ, Φ(xi)

The proof is given, for example, in Theorem 6.12 of [44]. Thus, the empirical Rademacher complexity of H is bounded as

ˆRS( H) = Eσ sup

α 2 H1 λ 1 α ˆ Rs,

β 2 H2 λ 1 β ˆ Rs,

γ 2 H3 λ 1 γ ˆ Rs

α, Φ1(fs(xi)) H1 + β, Φ2(fs(xi)) H2 γ, Φ(xi) H3

Eσ sup α 2 H1 λ 1 α ˆ Rs

i=1 σi α, Φ1(fs(xi)) H1

β 2 H2 λ 1 β ˆ Rs,

γ 2 H3 λ 1 γ ˆ Rs

i=1 σi β γ, Φ2(fs(xi)) Φ(xi) H2 H3

Eσ sup α 2 H1 λ 1 α ˆ Rs

i=1 σi α, Φ1(fs(xi)) H1

+ sup β γ 2 H2 H3 λ 1 β λ 1 γ ˆ R2 s

i=1 σi β γ, Φ2(fs(xi)) Φ(xi) H2 H3

ˆR2s λβλγn (S.9)

The first inequality follows from the subadditivity of supremum. The last inequality follows from the fact that ˆRs Pnℓ(y, 0, Φ1 ) + λα 0 2 L.

L λβλγ , and applying Talagrand s lemma [44] and Jensen s inequality, we obtain

R(L) = E ˆRS(L) µℓE ˆRS( H) CµℓE

To apply Corollary 3.5 of [20], we should solve the equation

and obtain r = µ2 ℓC2

n . Thus, for any η > 0, with probability at least 1 e η, there exists a constant C > 0 that satisfies

Pnℓ(y, h) C E ˆRs + µ2 ℓC2

C Rs + µ2 ℓC2

Note that, for the last inequality, because ˆRs Pnℓ(y, α, Φ1 ) + λα α 2 for any α, taking the expectation of both sides yields E ˆRs Pℓ(y, α, Φ1 ) + λα α 2 , and this gives E ˆRs infα{Pℓ(y, α, Φ1 ) + λα α 2} = Rs. Consequently, applying Theorem 1 of [45], we have

Pℓ(y, h(x)) Pnℓ(y, h(x)) + O r

n + µℓC + η

Lη + C2L + Lη

Here, we use ˆRS( H) C q

Remark D.1. As in [11], without the estimation of the parameters α and β, the right-hand side of Eq. (S.9) becomes 1 n c1 + c2 p ˆRs with some constant c1 > 0 and c2 > 0, and Eq. (S.10) becomes r = 1 n(c1 + c2 r).

This yields the solution

c2 2 n + c1 n,

where we use the inequality x + x + y 4x + 2y. Thus, Eq. (S.11) becomes

Pnℓ(y, h) C Rs + c2 2 n + c1 n + η

Consequently, we have the following result:

Pℓ(y, h(x)) Pnℓ(y, h(x))

n + c1 n3/4 + c2 + η

Lη + (c1 + c2

This means that even if Rs = O(n 1), the resulting rate only improves to O(n 3/4).

D.3 Proof of Theorem 4.4

Recall that loss function ℓ( , ) is assumed to be µℓ-Lipschitz for the first argument. In addition, we impose the following assumption. Assumption D.2. There exists a constant B 1 such that for every h H, P(h h ) BP(ℓ(y, h) ℓ(y, h )).

Because we consider ℓ(y, y ) = (y y )2 in Theorem 4.4, Assumption D.2 holds as stated in [20].

First, we show the following corollary, which is a slight modification of Theorem 5.4 of [20].

Corollary D.3. Let ˆh be any element of H satisfying Pnℓ(y, ˆh) = infh H Pnℓ(y, h), and let ˆh(m)

be any element of H(m) satisfying Pnℓ(y, ˆh(m)) = infh H(m) Pnℓ(y, h). Define

ˆψ(r) = c1 ˆRS{h H : max m {1,2} Pn(h(m) ˆh(m))2 c3r} + c2η

where c1, c2 and c3 are constants depending only on B and µℓ. Then, for any η > 0, with probability at least 1 5e η,

Pℓ(y, ˆh) Pℓ(y, h ) 705

B ˆr + (11µℓ+ 27B)η

where ˆr is the fixed point of ˆψ.

Proof. Define the function ψ as

2 R{h H : µ2 ℓmax P(h(m) h(m) )2 r} + (c2 c1)η

Because H, H(1) and H(2) are all convex and thus star-shaped around each of its points, Lemma 3.4 of [20] implies that ψ is a sub-root. Also, define the sub-root function ψm as

ψm(r) = c(m) 1

2 R{h(m) H(m) : µ2 ℓP(h(m) h(m) )2 r} + (c2 c1)η

Let r m be the fixed point of ψm(rm). Now, for rm ψm(rm), Corollary 5.3 of [20] and the condition on the loss function imply that, with probability at least 1 e η,

µ2 ℓP(ˆh(m) h(m) )2 Bµ2 ℓP(ℓ(y, ˆh(m)) ℓ(y, ˆh(m) )) 705µ2 ℓrm + (11µℓ+ 27B)Bµ2 ℓη n .

Denote the right-hand side by sm, and define r = max rm and s = max sm. Because s sm rm r m, we obtain s ψm(s) according to Lemma 3.2 of [20], and thus,

s 10µ2 ℓR{h(m) H(m) : µ2 ℓP(h(m) h(m) )2 s} + 11µ2 ℓη n .

Therefore, applying Corollary 2.2 of [20] to the class µℓH(m), it follows that with probability at least 1 e η,

{h(m) H(m) : µ2 ℓP(h(m) h(m) )2 s} {h(m) H(m) : µ2 ℓPn(h(m) h(m) )2 2s}.

This implies that with probability at least 1 2e η,

Pn(ˆh(m) h(m) )2 2 705r + (11µℓ+ 27B)Bη

2 705 + (11µℓ+ 27B)B

where the second inequality follows from r ψ(r) c2η

n . Define 2 705 + (11µℓ+27B)B

According to the triangle inequality in L2(Pn), it holds that

Pn(h(m) ˆh(m))2 q

Pn(h(m) h(m) )2 + q

Pn(h(m) ˆh(m))2 2

Pn(h(m) h(m) )2 +

Again, applying Corollary 2.2 of [20] to µℓH(m) as before, but now for r ψm(r), it follows that with probability at least 1 4e η,

{h H : µ2 ℓmax P(h(m) h(m) )2 r}

m=1 {h(m) H(m) : µ2 ℓP(h(m) h(m) )2 r}

m=1 {h(m) H(m) : µ2 ℓPn(h(m) h(m) )2 2r}

m=1 {h(m) H(m) : µ2 ℓPn(h(m) ˆh(m))2 (

= {h H : µ2 ℓmax Pn(h(m) ˆh(m))2 c3r},

where c3 = (

c )2. Combining this with Lemma A.4 of [20] leads to the following inequality: with probability at least 1 5e x

2 R{h H : µ2 ℓmax P(h(m) h(m) )2 r} + (c2 c1)η

c1 ˆRS{h H : µ2 ℓmax P(h(m) h(m) )2 r} + c2η

c1 ˆRS{h H : µ2 ℓmax Pn(h(m) ˆh(m))2 c3r} + c2η

Letting r = r and using Lemma 4.3 of [20], we obtain r ˆr , thus proving the statement.

Under Assumption 4.2, we obtain the following excess risk bound for the proposed model class using Corollary D.3. The proof is based on [20].

Theorem D.4. Let ˆh be any element of H satisfying Pnℓ(y, ˆh(x)) = infh H Pnℓ(y, h(x)). Suppose that Assumption 4.2 is satisfied, then there exists a constant c depending only on µℓsuch that for any η > 0, with probability at least 1 5e η,

P(y ˆh(x))2 P(y h (x))2

min 0 κ1,κ2 n

j=κ1+1 ˆλ(1) j +

j=κ2+1 ˆλ(2) j

Theorem D.4 is a multiple-kernel version of Corollary 6.7 of [20], and a data-dependent version of Theorem 2 of [46] which considers the eigenvalues of the Hilbert-Schmidt operators on H and H(m). Theorem D.4 concerns the eigenvalues of the Gram matrices K(m) computed from the data.

Proof of Theorem D.4. Define R = maxm suph H(m) Pn(y h(x))2. For any h H(m), we obtain

Pn(h(m)(x) ˆh(m)(x))2 2Pn(y h(m)(x))2+2Pn(y ˆh(m)(x))2 4 sup h H(m) Pn(y h(x))2 4R.

From the symmetry of the σi and the fact that H(m) is convex and symmetric, we obtain the following:

ˆRS{h H : max Pn(h(m) ˆh(m))2 4R}

= Eσ sup h(m) H(m)

Pn(h(m) ˆh(m))2 4R

m=1 h(m)(xi)

= Eσ sup h(m) H(m)

Pn(h(m) ˆh(m))2 4R

m=1 (h(m)(xi) ˆh(m)(xi))

Eσ sup h(m),g(m) H(m)

Pn(h(m) g(m))2 4R

m=1 (h(m)(xi) g(m)(xi))

= 2Eσ sup h(m) H(m)

m=1 h(m)(xi)

m=1 Eσ sup h(m) H(m)

i=1 σih(m)(xi)

j=1 min{R, ˆλ(m) j }

j=1 min{R, ˆλ(m) j }

The second inequality comes from the subadditivity of supremum and the third inequality follows from Theorem 6.6 of [20]. To obtain the last inequality, we use x + y p

2(x + y). Thus, we have

2c1 ˆRS{h H : max Pn(h(m) ˆh(m))2 4R} + (c2 + 2)η

j=1 min R, ˆλ(m) j

+ (c2 + 2)η

for some constants c1 and c2. To apply Corollary D.3, we should solve the following inequality for r

j=1 min r, ˆλ(m) j

For any integers κm [0, n], the right-hand side is bounded as

j=1 min r, ˆλ(m) j

j=κm+1 ˆλ(m) j

r + 256c2 1 n

j=κm+1 ˆλ(m) j

and we obtain the solution r as

r 128c2 1 n

( 128c2 1 n

+ 256c2 1 n

j=κm+1 ˆλ(m) j

j=κm+1 ˆλ(m) j

Optimizing the right-hand side with respect to κ1 and κ2, we obtain the solution as

r min 0 κ1,κ2 n

j=κm+1 ˆλ(m) j

Furthermore, according to Corollary D.3, there exists a constant c such that with probability at least 1 5e η,

P(y ˆh(x))2 P(y h (x))2

min 0 κ1,κ2 n

j=κm+1 ˆλ(m) j

With Theorem D.4 and Assumption 4.3, we prove Theorem D.4 as follows.

Proof of Theorem 4.4. Using the inequality x + y x + y for x 0, y 0, we have

P(y ˆh(x))2 P(y h (x))2

min 0 κ1,κ2 n

j=κ1+1 ˆλ(1) j + 1

j=κ2+1 ˆλ(2) j

min 0 κ1,κ2 n

min 0 κ1,κ2 n

Because it holds that

j=κm+1 j 1 sm < Z

κm x 1 sm dx <

" 1 1 1 sm x1 1 sm

κm = sm 1 sm κ 1 1 sm m ,

for m = 1, 2, we should solve the following minimization problem:

min 0 κ1,κ2 n

n s1 1 s1 κ 1 1

n s1 1 s1 κ 1 1

Taking the derivative, we have

n s1 1 s1 κ 1 1

Setting this to zero, we find the optimal κ1 as

κ1 = s1 1 s1

Similarly, we have

κ2 = s2 1 s2

P(y ˆh(x))2 P(y h (x))2

s1 1+s1 + 1

1 s1 1+s1 s1 1 s1

1 s2 1+s2 s2 1 s2

= O n 1 1+s1 + n 1 1+s2

= O n 1 1+max{s1,s2} .