# transfer_learning_with_affine_model_transformation__97ab2b3f.pdf Transfer Learning with Affine Model Transformation Shunya Minami The Institute of Statistical Mathematics mshunya@ism.ac.jp Kenji Fukumizu The Institute of Statistical Mathematics fukumizu@ism.ac.jp Yoshihiro Hayashi The Institute of Statistical Mathematics yhayashi@ism.ac.jp Ryo Yoshida The Institute of Statistical Mathematics yoshidar@ism.ac.jp Supervised transfer learning has received considerable attention due to its potential to boost the predictive power of machine learning in scenarios where data are scarce. Generally, a given set of source models and a dataset from a target domain are used to adapt the pre-trained models to a target domain by statistically learning domain shift and domain-specific factors. While such procedurally and intuitively plausible methods have achieved great success in a wide range of real-world applications, the lack of a theoretical basis hinders further methodological development. This paper presents a general class of transfer learning regression called affine model transfer, following the principle of expected-square loss minimization. It is shown that the affine model transfer broadly encompasses various existing methods, including the most common procedure based on neural feature extractors. Furthermore, the current paper clarifies theoretical properties of the affine model transfer such as generalization error and excess risk. Through several case studies, we demonstrate the practical benefits of modeling and estimating inter-domain commonality and domain-specific factors separately with the affine-type transfer models. 1 Introduction Transfer learning (TL) is a methodology to improve the predictive performance of machine learning in a target domain with limited data by reusing knowledge gained from training in related source domains. Its great potential has been demonstrated in various real-world problems, including computer vision [1, 2], natural language processing [3, 4], biology [5], and materials science [6, 7, 8]. Notably, most of the outstanding successes of TL to date have relied on the feature extraction ability of deep neural networks. For example, a conventional method reuses feature representations encoded in an intermediate layer of a pre-trained model as an input for the target task or uses samples from the target domain to fine-tune the parameters of the pre-trained source model [9]. While such methods are operationally plausible and intuitive, they lack methodological principles and remain theoretically unexplored in terms of their learning capability for limited data. This study develops a principled methodology generally applicable to various kinds of TL. In this study, we focus on supervised TL settings. In particular, we deal with settings where, given feature representations obtained from training in the source domain, we use samples from the target domain to model and estimate the domain shift to the target. This procedure is called hypothesis transfer learning (HTL); several methods have been proposed, such as using a linear transformation function [10, 11] and considering a general class of continuous transformation functions [12]. If the transformation function appropriately captures the functional relationship between the source and target domains, only the domain-specific factors need to be additionally learned, which can be done efficiently even with a limited sample size. In other words, the performance of HTL depends strongly 37th Conference on Neural Information Processing Systems (Neur IPS 2023). on whether the transformation function appropriately represents the cross-domain shift. However, the general methodology for modeling and estimating such domain shifts has been less studied. This study derives a theoretically optimal class of supervised TL that minimizes the expected ℓ2 loss function of the HTL. The resulting function class takes the form of an affine coupling g1(fs) + g2(fs) g3(x) of three functions g1, g2 and g3, where the shift from a given source feature fs to the target domain is represented by the functions g1 and g2, and the domain-specific factors are represented by g3(x) for any given input x. These functions can be estimated simultaneously using conventional supervised learning algorithms such as kernel methods or deep neural networks. Hereafter, we refer to this framework as the affine model transfer. As described later, we can formulate a wide variety of TL algorithms within the affine model transfer, including the widely used neural feature extractors, offset and scale HTLs [10, 11, 12], and Bayesian TL [13]. We clarify theoretical properties of the affine model transfer such as generalization error and excess risk. To summarize, the contributions of our study are as follows: The affine model transfer is proposed to adapt source features to the target domain by separately estimating cross-domain shift and domain-specific factors. The affine form is derived theoretically as an optimal class based on the squared loss for the target task. The affine model transfer encompasses several existing TL methods, including neural feature extraction. It can work with any type of source model, including non-machine learning models such as physical models as well as multiple source models. For each of the three functions g1, g2, and g3, we provide an efficient and stable estimation algorithm when modeled using the kernel method. Two theoretical properties of the affine transfer model are shown: the generalization and the excess risk bound. With several applications, we compare the affine model transfer with other TL algorithms, discuss its strengths, and demonstrate the advantage of being able to estimate cross-domain shifts and domain-specific factors separately. 2 Transfer Learning via Transformation Function 2.1 Affine Model Transfer This study considers regression problems with squared loss. We assume that the output of the target domain y Y R follows y = ft(x)+ϵ, where ft : X R is the true model on the target domain, and the observation noise ϵ has mean zero and variance σ2. We are given n samples {(xi, yi)}n i=1 (X Y)n from the target domain and the feature representation fs(x) Fs from one or more source domains. Typically, fs is given as a vector, including the output of the source models, observed data in the source domains, or learned features in a pre-trained model, but it can also be a non-vector feature such as a tensor, graph, or text. Hereafter, fs is referred to as the source features. In this paper, we focus on transfer learning with variable transformations as proposed in [12]. For an illustration of the concept, consider the case where there exists a relationship between the true functions f s (x) R and f t (x) R such that f t (x) = f s (x) + x θ , x Rd with an unknown parameter θ Rd. If f s is non-smooth, a large number of training samples is needed to learn f t directly. However, since the difference f t f s is a linear function with respect to the unknown θ , it can be learned with fewer samples if prior information about f s is available. For example, a target model can be obtained by adding fs to the model g trained for the intermediate variable z = y fs(x). The following is a slight generalization of TL procedure provided in [12]: 1. With the source features, perform a variable transformation of the observed outputs as zi = ϕ(yi, fs(xi)), using the data transformation function ϕ : Y Fs R. 2. Train an intermediate model ˆg(x) using the transformed sample set {(xi, zi)}n i=1 to predict the transformed output z for any given x. 3. Obtain a target model ˆft(x) = ψ(ˆg(x), fs(x)) using the model transformation function ψ : R Fs Y that combines ˆg and fs to define a predictor. In particular, [12] considers the case where the model transformation function is equal to the inverse of the data transformation function. We consider a more general case that eliminates this constraint. The objective of step 1 is to identify a transformation ϕ that maps the output variable y to the intermediate variable z, making the variable suitable for learning. In step 2, a predictive model for z is constructed. Since the data is limited in many TL setups, a simple model, such as a linear model, should be used as g. Step 3 is to transform the intermediate model g into a predictive model ft for the original output y. This class of TL includes several approaches proposed in previous studies. For example, [10, 11] proposed a learning algorithm consisting of linear data transformation and linear model transformation: ϕ=y θ, fs and ψ=g(x)+ θ, fs with pre-defined weights θ. In this case, factors unexplained by the linear combination of source features are learned with g, and the target output is predicted additively with the common factor θ, fs and the additionally learned g. In [13], it is shown that a type of Bayesian TL is equivalent to using the following transformation functions; for Fs R, ϕ=(y τfs)/(1 τ) and ψ=ρg(x)+(1 ρ)fs with two varying hyperparameters τ <1 and 0 ρ 1. This includes TL using density ratio estimation [14] and neural network-based fine-tuning as special cases when the two hyperparameters belong to specific regions. The performance of this TL strongly depends on the design of the two transformation functions ϕ and ψ. In the sequel, we theoretically derive the optimal form of transformation functions under the squared loss scenario. For simplicity, we denote the transformation functions as ϕfs( )=ϕ( , fs) on Y and ψfs( )=ψ( , fs) on R. To derive the optimal class of ϕ and ψ, note first that the TL procedure described above can be formulated in population as solving two successive least square problems; (i) g := arg min g Ept ϕfs(Y ) g(X) 2, (ii) min ψ Ept Y ψfs(g (X)) 2. Since the regression function that minimizes the mean squared error is the conditional mean, the first problem is solved by g ϕ(x) = E[ϕfs(Y )|X = x], which depends on ϕ. We can thus consider the optimal transformation functions ϕ and ψ by the following minimization: min ψ,ϕ Ept Y ψfs(g ϕ(X)) 2. (1) It is easy to see that Eq. (1) is equivalent to the following consistency condition: ψfs g ϕ(x) = Ept[Y |X = x]. From the above observation, we make three assumptions to derive the optimal form of ψ and ϕ: Assumption 2.1 (Differentiability). The data transformation function ϕ is differentiable with respect to the first argument. Assumption 2.2 (Invertibility). The model transformation function ψ is invertible with respect to the first argument, i.e., its inverse ψ 1 fs exists. Assumption 2.3 (Consistency). For any distribution on the target domain pt(x, y), and for all x X, ψfs(g (x)) = Ept[Y |X = x], where g (x) = Ept[ϕfs(Y )|X = x]. Assumption 2.2 is commonly used in most existing HTL settings, such as [10] and [12]. It assumes a one-to-one correspondence between the predictive value ˆft(x) and the output of the intermediate model ˆg(x). If this assumption does not hold, then multiple values of ˆg correspond to the same predicted value ˆft, which is unnatural. Note that Assumption 2.3 corresponds to the unbiased condition of [12]. We now derive the properties that the optimal transformation functions must satisfy. Theorem 2.4. Under Assumptions 2.1-2.3, the transformation functions ϕ and ψ satisfy the following two properties: (i) ψ 1 fs = ϕfs. (ii) ψfs(g) = g1(fs) + g2(fs) g, where g1 and g2 are some functions. The proof is given in Section D.1 in Supplementary Material. Despite not initially assuming that the two transformation functions are inverses, Theorem 2.4 implies they must indeed be inverses. Furthermore, the mean squared error is minimized when the data and model transformation functions are given by an affine transformation and its inverse, respectively. In summary, under the expected squared loss minimization with the HTL procedure, the optimal class for HTL model is expressed as follows: H = x 7 g1(fs) + g2(fs)g3(x) | gj Gj, j = 1, 2, 3 , where G1, G2 and G3 are the arbitrarily function classes. Here, each of g1 and g2 is modeled as a function of fs that represents common factors across the source and target domains. g3 is modeled as a function of x, in order to capture the domain-specific factors unexplainable by the source features. We have derived the optimal form of the transformation functions when the squared loss is employed. Even for general convex loss functions, (i) of Theorem 2.4 still holds. However, (ii) of Theorem 2.4 does not generally hold because the optimal transformation function depends on the loss function. Extensions to other losses are briefly discussed in Section A.1, but the establishment of a complete theory is a future work. Here, the affine transformation is found to be optimal in terms of minimizing the mean squared error. We can also derive the same optimal function by minimizing the upper bound of the estimation error in the HTL procedure, as discussed in Section A.2. One of key principles for the design of g1, g2, and g3 is interpretability. In our model, g1 and g2 primarily facilitate knowledge transfer, while the estimated g3 is used to gain insight on domainspecific factors. For instance, in order to infer cross-domain differences, we could design g1 and g2 by the conventional neural feature extraction, and a simple, highly interpretable model such as a linear model could be used for g3. Thus, observing the estimated regression coefficients in g3, one can statistically infer which features of x are related to inter-domain differences. This advantage of the proposed method is demonstrated in Section 5.2 and Section B.3. 2.2 Relation to Existing Methods The affine model transfer encompasses some existing TL procedures. For example, by setting g1(fs) = 0 and g2(fs) = 1, the prediction model is estimated without using the source features, which corresponds to an ordinary direct learning, i.e., a learning scheme without transfer. Furthermore, various kinds of HTLs can be formulated by imposing constraints on g1 and g2. In prior work, [10] employs a two-step procedure where the source features are combined with pre-defined weights, and then the auxiliary model is additionally learned for the residuals unexplainable by the source features. The affine model transfer can represent this HTL as a special case by setting g2 = 1. [12] uses the transformed output zi = yi/fs with the output value fs R of a source model, and this cross-domain shift is then regressed onto x using a target dataset. This HTL corresponds to g1 = 0 and g2 = fs. When a pre-trained source model is provided as a neural network, TL is usually performed with the intermediate layer as input to the model in the target domain. This is called a feature extractor or frozen featurizer and has been experimentally and theoretically proven to have strong transfer capability as the de facto standard for TL [9, 15]. The affine model transfer encompasses the neural feature extraction as a special subclass, which is equivalent to setting g2(fs)g3(x) = 0. A performance comparison of the affine model transfer with the neural feature extraction is presented in Section 5 and Section B.2. The relationships between these existing methods and the affine model transfer are illustrated in Figure 1 and Figure S.1 The affine model transfer can also be interpreted as generalizing the feature extraction by adding a product term g2(fs)g3(x). This additional term allows for the inclusion of unknown factors in the transferred model that are unexplainable by source features alone. Furthermore, this encourages the avoidance of a negative transfer, a phenomenon where prior learning experiences interfere with training in a new task. The usual TL based only on fs attempts to explain and predict the data generation process in the target domain using only the source features. However, in the presence of domain-specific factors, a negative transfer can occur owing to a lack of descriptive power. The additional term compensates for this shortcoming. The comparison of behavior for the case with the non-relative source features is described in Section 5.1. Figure 1: Architectures of (a) feature extraction, (b) HTL in [10], and (c) affine model transfer. Algorithm 1 Block relaxation algorithm [19]. Initialize: a0, b0 = 0, c0 = 0 repeat at+1 = arg mina F(a, bt, ct) bt+1 = arg minb F(at+1, b, ct) ct+1 = arg minc F(at+1, bt+1, c) until convergence The affine model transfer can be naturally expressed as an architecture of network networks. This architecture, called affine coupling layers, is widely used for invertible neural networks in flow-based generative modeling [16, 17]. Neural networks based on affine coupling layers have been proven to have universal approximation ability [18]. This implies that the affine transfer model has the potential to represent a wide range of function classes, despite its simple architecture based on the affine coupling of three functions. 3 Modeling and Estimation In this section, we focus on using kernel methods for the affine transfer model and provide the estimation algorithm. Let H1, H2 and H3 be reproducing kernel Hilbert spaces (RKHSs) with positive-definite kernels k1, k2 and k3, which define the feature mappings Φ1 :Fs H1, Φ2 :Fs H2 and Φ3 :X H3, respectively. Denote Φ1,i =Φ1(fs(xi)), Φ2,i =Φ2(fs(xi)), Φ3,i =Φ3(xi). For the proposed model class, the ℓ2-regularized empirical risk with the squared loss is given as follows: yi α, Φ1,i β, Φ2,i γ, Φ3,i 2 + λ1 α 2 H1 + λ2 β 2 H2 + λ3 γ 2 H3, (2) where λ1,λ2,λ3 0 are hyperparameters for the regularization. According to the representer theorem, the minimizer of Fα,β,γ with respect to the parameters α H1, β H2, and γ H3 reduces to α = Pn i=1 aiΦ1,i, β = Pn i=1 biΦ2,i, γ = Pn i=1 ciΦ3,i, with the n-dimensional unknown parameter vectors a, b, c Rn. Substituting this expression into Eq. (2), we obtain the objective function as n y K1a (K2b) (K3c) 2 2+λ1a K1a+λ2b K2b+λ3c K3c =: F(a, b, c). (3) Here, the symbol denotes the Hadamard product. KI is the Gram matrix associated with the kernel k I for I {1, 2, 3}. k(i) I = [k I(xi, x1) k I(xi, xn)] denotes the i-th column of the Gram matrix. The n n matrix M (i) is given by the tensor product M (i) = k(i) 2 k(i) 3 of k(i) 2 and k(i) 3 . Because the model is linear with respect to parameter a and bilinear for b and c, the optimization of Eq. (3) can be solved using well-established techniques for the low-rank tensor regression. In this study, we use the block relaxation algorithm [19] as described in Algorithm 1. It updates a, b, and c by repeatedly fixing two of the three parameters and minimizing the objective function for the remaining one. Fixing two parameters, the resulting subproblem can be solved analytically because the objective function is expressed in a quadratic form for the remaining parameter. Algorithm 1 can be regarded as repeating the HTL procedure introduced in Section 2.1; alternately estimates the parameters (a, b) of the transformation function and the parameters c of the model for the given transformed data {(xi, zi)}n i=1. The function F in Algorithm 1 is not jointly convex in general. However, when employing methods like kernel methods or generalized linear models, and fixing two parameters, F exhibits convexity with respect to the remaining parameter. According to [19], when each sub-minimization problem is convex, Algorithm 1 is guaranteed to converge to a stationary point. Furthermore, [19] showed that consistency and asymptotic normality hold for the alternating minimization algorithm. 4 Theoretical Results In this section, we present two theoretical properties, the generalization bound and excess risk bound. Let (Z, P) be an arbitrary probability space, and {zi}n i=1 be independent random variables with distribution P. For a function f :Z R, let the expectation of f with respect to P and its empirical counterpart denote respectively by Pf = EP f(z), Pnf = 1 n Pn i=1 f(zi). We use a non-negative loss ℓ(y, y ) such that it is bounded from above by L > 0 and for any fixed y Y, y 7 ℓ(y, y ) is µℓ-Lipschitz for some µℓ> 0. Recall that the function class proposed in this work is H = x 7 g1(fs) + g2(fs)g3(x) | gj G,j = 1, 2, 3 . In particular, the following discussion in this section assumes that g1, g2, and g3 are represented by linear functions on the RKHSs. 4.1 Generalization Bound The optimization problem is expressed as follows: min α,β,γ Pnℓ y, α, Φ1 H1 + β, Φ2 H2 γ, Φ3 H3 + λα α 2 H1 + λβ β 2 H2 + λγ γ 2 H3, (4) where Φ1 = Φ1(fs(x)), Φ2 = Φ2(fs(x)) and Φ3 = Φ3(x) denote the feature maps. Without loss of generality, it is assumed that Φi 2 Hi 1 (i = 1, 2, 3) and λα, λβ, λγ > 0. Hereafter, we will omit the suffixes Hi in the norms if there is no confusion. Let (ˆα, ˆβ, ˆγ) be a solution of Eq. (4), and denote the corresponding function in H as ˆh. For any α, we have λα ˆα 2 Pnℓ(y, ˆα,Φ1 + ˆβ,Φ2 ˆγ,Φ3 )+λα ˆα 2+λβ ˆβ 2+λγ ˆγ 2 Pnℓ(y, α,Φ1 )+λα α 2, where we use the fact that ℓ( , ) and are non-negative, and (ˆα, ˆβ, ˆγ) is the minimizer of Eq. (4). Denoting ˆRs = infα{Pnℓ(y, α, Φ1 ) + λα α 2}, we obtain ˆα 2 λ 1 α ˆRs. Because the same inequality holds for λβ ˆβ 2, λγ ˆγ 2 and Pnℓ(y, ˆh), we have ˆβ 2 λ 1 β ˆRs, ˆγ 2 λ 1 γ ˆRs and Pnℓ(y, ˆh) ˆRs. Moreover, we have Pℓ(y, ˆh) = EPnℓ(y, ˆh) E ˆRs. Therefore, it is sufficient to consider the following hypothesis class H and loss class L: H = x7 α,Φ1 + β,Φ2 γ,Φ3 α 2 λ 1 α ˆRs, β 2 λ 1 β ˆRs, γ 2 λ 1 γ ˆRs,Pℓ(y, h) E ˆRs , L = (x, y) 7 ℓ(y, h) | h H . Here, we show the generalization bound of the proposed model class. The following theorem is based on [11], showing that the difference between the generalization error and empirical error can be bounded using the magnitude of the relevance of the domains. Theorem 4.1. There exists a constant C depending only on λα, λβ, λγ and L such that, for any η > 0 and h H, with probability at least 1 e η, Pℓ(y, h) Pnℓ(y, h) = O r n + µ2 ℓC2 + η C + η + C2 + η where Rs = infα{Pℓ(y, α, Φ1 ) + λα α 2}. Because Φ1 is the feature map from the source feature space Fs into the RKHS H1, Rs corresponds to the true risk of training in the target domain using only the source features fs. If this is sufficiently small, e.g., Rs = O(n 1/2), the convergence rate indicated by Theorem 4.1 becomes n 1, which is an improvement over the naive convergence rate n 1/2. This means that if the source task yields feature representations strongly related to the target domain, training in the target domain is accelerated. Theorem 4.1 measures this cross-domain relation using the metric Rs. Theorem 4.1 is based on Theorem 11 of [11] in which the function class g1+g3 is considered. Our work differs in the following two points: the source features are modeled not only additively but also multiplicatively, i.e., we consider the function class g1+g2 g3, and we also consider the estimation of the parameters for the source feature combination, i.e., the parameters of the functions g1 and g2. In particular, the latter affects the resulting rate. With fixed the source combination parameters, the resulting rate improves only up to n 3/4. The details are discussed in Section D.2. 4.2 Excess Risk Bound Here, we analyze the excess risk, which is the difference between the risk of the estimated function and the smallest possible risk within the function class. Recall that we consider the functions g1, g2 and g3 to be elements of the RKHSs H1, H2 and H3 with kernels k1, k2 and k3, respectively. Define the kernel k(1) =k1, k(2) =k2 k3 and k=k(1)+k(2). Let H(1), H(2) and H be the RKHS with k(1), k(2) and k respectively. For m = 1, 2, consider the normalized Gram matrix K(m) = 1 n(k(m)(xi, xj))i,j=1,...,n and its eigenvalues (ˆλ(m) i )n i=1 , arranged in a nonincreasing order. We make the following additional assumptions: Assumption 4.2. There exists h H and h(m) H(m) (m = 1, 2) such that P(y h (x))2 = infh H P(y h(x))2 and P(y h(m) (x))2 = infh H(m) P(y h(x))2. Assumption 4.3. For m = 1, 2, there exist am > 0 and sm (0, 1) such that ˆλ(m) j amj 1/sm. Assumption 4.2 is used in [20] and is not overly restrictive as it holds for many regularization algorithms and convex, uniformly bounded function classes. In the analysis of kernel methods, Assumption 4.3 is standard [21], and is known to be equivalent to the classical covering or entropy number assumption [22]. The inverse decay rate sm measures the complexity of the RKHS, with larger values corresponding to more complex function spaces. Theorem 4.4. Let ˆh be any element of H satisfying Pn(y ˆh(x))2 = infh H Pn(y h(x))2. Under Assumptions 4.2 and 4.3, for any η > 0, with probability at least 1 5e η, P(y ˆh(x))2 P(y h (x))2 = O n 1 1+max {s1,s2} . Theorem 4.4 suggests that the convergence rate of the excess risk depends on the decay rates of the eigenvalues of two Gram matrices K(1) and K(2). The inverse decay rate s1 of the eigenvalues of K(1) = 1 n(k1(fs(xi), fs(xj)))i,j=1,...,n represents the learning efficiency using only the source features, while s2 is the inverse decay rate of the eigenvalues of the Hadamard product of K2 = 1 n(k2(fs(xi), fs(xj))i,j=1,...,n and K3 = 1 n(k3(xi, xj))i,j=1,...,n, which addresses the effect of combining the source features and original input. While rigorous discussion on the relationship between the spectra of two Gram matrices K2, K3 and their Hadamard product K2 K3 seems difficult, intuitively, the smaller the overlap between the space spanned by the source features and by the original input, the smaller the overlap between H2 and H3. In other words, as the source features and original input have different information, the tensor product H2 H3 will be more complex, and the decay rate 1/s2 is expected to be larger. In Section B.1, we experimentally confirm this speculation. 5 Experimental Results We demonstrate the potential of the affine model transfer through two case studies: (i) the prediction of feed-forward torque at seven joints of the robot arm [23], and (ii) the prediction of review scores and decisions of scientific papers [24]. The experimental details are presented in Section C. Additionally, two case studies in materials science are presented in Section B. The Python code is available at https://github.com/mshunya/Affine TL. 5.1 Kinematics of the Robot Arm We experimentally investigated the learning performance of the affine model transfer, compared to several existing methods. The objective of the task is to predict the feed-forward torques, required to follow the desired trajectory, at seven different joints of the SARCOS robot arm [23]. Twenty-one features representing the joint position, velocity, and acceleration were used as the input x. The target task is to predict the torque value at one joint. The representations encoded in the intermediate layer of the source neural network for predicting the other six joints were used as the source features fs R16. The experiments were conducted with seven different tasks (denoted as Torque 1-7) corresponding to the seven joints. For each target task, a training set of size n {5, 10, 15, 20, 30, 40, 50} was randomly constructed 20 times, and the performances were evaluated using the test data. Table 1: Performance on predicting the torque values at the first and seventh joints of the SARCOS robot arm. The mean and standard deviation of the RMSE are reported for varying numbers of training samples. For each task and n, the case with the smallest mean RMSE is indicated by the bold type. An asterisk indicates a case where the RMSEs of 20 independent experiments were significantly improved over Direct at the 1% significance level, according to the Welch s t-test. d represent the dimension of the original input x (i.e., d = 21). Target Model Number of training samples n < d n d n > d 5 10 15 20 30 40 50 Direct 21.3 2.04 18.9 2.11 17.4 1.79 15.8 1.70 13.7 1.26 12.2 1.61 10.8 1.23 Only source 24.0 6.37 22.3 3.10 21.0 2.49 19.7 1.34 18.5 1.92 17.6 1.59 17.3 1.31 Augmented 21.8 2.88 19.2 1.37 17.8 2.30 15.7 1.53 13.3 1.19 11.9 1.37 10.7 0.954 HTL-offset 23.7 6.50 21.2 3.85 19.8 3.23 17.8 2.35 16.2 3.31 15.0 3.16 15.1 2.76 HTL-scale 23.3 4.47 22.1 5.31 20.4 3.84 18.5 2.72 17.6 2.41 16.9 2.10 16.7 1.74 Affine TL-full 21.2 2.23 18.8 1.31 18.6 2.83 15.9 1.65 13.7 1.53 12.3 1.45 11.1 1.12 Affine TL-const 21.2 2.21 18.8 1.44 17.7 2.44 15.9 1.58 13.4 1.15 12.2 1.54 10.9 1.02 Fine-tune 25.0 7.11 20.5 3.33 18.6 2.10 17.6 2.55 14.1 1.39 12.6 1.13 11.1 1.03 MAML 29.8 12.3 22.5 3.21 20.8 2.12 20.3 3.14 16.7 3.00 14.4 1.85 13.4 1.19 L2-SP 24.9 7.09 20.5 3.30 18.8 2.04 18.0 2.45 14.5 1.36 13.0 1.13 11.6 0.983 PAC-Net 25.2 8.68 22.7 5.60 20.7 2.65 20.1 2.16 18.5 2.77 17.6 1.85 17.1 1.38 Direct 2.66 0.307 2.13 0.420 1.85 0.418 1.54 0.353 1.32 0.200 1.18 0.138 1.05 0.111 Only source 2.31 0.618 *1.73 0.560 *1.49 0.513 *1.22 0.269 *1.09 0.232 *0.969 0.144 *0.927 0.170 Augmented 2.47 0.406 1.90 0.515 1.67 0.552 *1.31 0.214 1.16 0.225 *0.984 0.149 *0.897 0.138 HTL-offset 2.29 0.621 *1.69 0.507 *1.49 0.513 *1.22 0.269 *1.09 0.233 *0.969 0.144 *0.925 0.171 HTL-scale 2.32 0.599 *1.71 0.516 1.51 0.513 *1.24 0.271 *1.12 0.234 *0.999 0.175 0.948 0.172 Affine TL-full *2.23 0.554 *1.71 0.501 *1.45 0.458 *1.21 0.256 *1.06 0.219 *0.974 0.164 *0.870 0.121 Affine TL-const *2.30 0.565 *1.73 0.420 *1.48 0.527 *1.20 0.243 *1.04 0.217 *0.963 0.161 *0.884 0.136 Fine-tune *2.33 0.511 *1.62 0.347 *1.35 0.340 *1.12 0.165 *0.959 0.12 *0.848 0.0824 *0.790 0.0547 MAML 2.54 1.29 1.90 0.507 1.67 0.313 1.63 0.282 1.28 0.272 1.20 0.199 1.06 0.111 L2-SP *2.33 0.509 *1.65 0.378 *1.35 0.340 *1.12 0.165 *0.968 0.114 *0.858 0.0818 *0.802 0.0535 PAC-Net 2.24 0.706 *1.61 0.394 *1.43 0.389 *1.24 0.177 *1.18 0.100 1.13 0.0726 1.100 0.0589 The following seven methods were compared, including two existing HTL procedures: Direct Train a model using the target input x with no transfer. Only source Train a model g(fs) using only the source feature fs. Augmented Perform a regression with the augmented input vector concatenating x and fs. HTL-offset [10] Calculate the transformed output zi = yi gonly(fs) where gonly(fs) is the model pre-trained using Only source, and train an additional model with input xi to predict zi. HTL-scale [12] Calculate the transformed output zi = yi/gonly(fs), and train an additional model with input xi to predict zi. Affine TL-full Train the model g1 + g2 g3. Affine TL-const Train the model g1 + g3. Kernel ridge regression with the Gaussian kernel exp( x x 2/2ℓ2) was used for each procedure. The scale parameter ℓwas fixed to the square root of the dimension of the input. The regularization parameter in the kernel ridge regression and λα, λβ, and λγ in the affine model transfer were selected through 5-fold cross-validation. In addition to the seven feature-based methods, four weight-based TL methods were evaluated: fine-tuning, MAML [25], L2-SP [26], and PAC-Net [27]. Table 1 summarizes the prediction performance of the seven different procedures for varying numbers of training samples in two representative tasks: Torque 1 and Torque 7. The joint of Torque 1 is located closest to the root of the arm. Therefore, the learning task for Torque 1 is less relevant to those for the other joints, and the transfer from Torque 2 6 to Torque 1 would not work. In fact, as shown in Table 1, no method showed a statistically significant improvement to Direct. In particular, Only source failed to acquire predictive ability, and HTL-offset and HTL-scale likewise showed poor prediction performance owing to the negative effect of the failure in the variable transformation. In contrast, the two affine transfer models showed almost the same predictive performance as Direct, which is expressed as its submodel, and successfully suppressed the occurrence of negative transfer. Because Torque 7 was measured at the joint closest to the end of the arm, its value strongly depends on those at the other six joints, and the procedures with the source features were more effective than in the other tasks. In particular, Affine TL achieved the best performance among the other feature-based methods. This is consistent with the theoretical result that the transfer capability of the affine model transfer can be further improved when the risk of learning using only the source features is sufficiently small. In Table S.3 in Section C.1, we present the results for all tasks. In most cases, Affine TL achieved the best performance among the feature-based methods. In several other cases, Direct produced the best results; in almost all cases, Only source and the two HTLs showed no advantage over Affine TL. Comparing the weight-based and feature-based methods, we noticed that the weight-based methods showed higher performance with large sample sizes. Nevertheless, in scenarios with extremely small sample sizes (e.g., n=5 or 10), Affine TL exhibited comparable or even superior performance. The strength of our method compared to weight-based TLs including fine-tuning is that it does not degrade its performance in cases where cross-domain relationships are weak. While fine-tuning outperformed our method in cases of Torque 7, the performance of fine-tuning was significantly degraded as the source-target relationship became weaker, as seen in Torque 1 case. In contrast, our method was able to avoid negative transfer even for such cases. This characteristic is particularly beneficial because, in many cases, the degree of relatedness between the domains is not known in advance. Furthermore, weight-based methods can sometimes be unsuitable, especially when transferring knowledge from large models, such as LLMs. In these scenarios, fine-tuning all parameters is unfeasible, and feature-based TL is preferred. Our approach often outperforms other feature-based methods. 5.2 Evaluation of Scientific Documents Through a case study in natural language processing, we compare the performance of the affine model transfer with that of ordinary feature extraction-based TL and show the advantage of being able to estimate domain shift and domain-specific factors separately. We used Sci Rep Eval [24], a benchmark dataset of scientific documents. The dataset consists of abstracts, review scores, and decision statuses of papers submitted to various machine learning conferences. We focused on two primary tasks: a regression task to predict the average review score, and a binary classification task to determine the acceptance or rejection status of each paper. The original input x was represented by a two-gram bag-of-words vector of the abstract. For the source features fs, we utilized text embeddings of the abstract generated by the pre-trained language models; BERT [28], Sci BERT [29], T5 [30], and GPT-3 [31]. In the affine model transfer, we employed neural networks with two hidden layers to model g1 and g2, and a linear model for g3. For comparison, we also evaluated the performance of the ordinary feature extraction-based TL using a two-layer neural network with fs as inputs. We used 8,166 training samples and evaluated the performance of the model on 2,043 test samples. Table 2 shows the root mean square error (RMSE) for the regression task and accuracy for the classification task. In the regression tasks, the RMSEs of the affine model transfer were significantly improved over the ordinary feature extraction for the four types of text feature embedding. We also observed the improvements in accuracy for the classification task even though the affine model transfer was derived on the basis of regression settings. While the pre-trained language models have the remarkable ability to represent text quality and structure, their representation ability to perform prediction tasks for machine learning documents is not sufficient. The affine model transfer effectively bridged this gap by learning the additional target-specific factor via the target task, resulting in improved prediction performance in both regression and classification tasks. Table 3 provides a list of phrases that were estimated to have a positive or negative effect on the review scores. Because we restricted a network to output positive values for g2, the influence of each phrase could be inferred from the estimated coefficients of the linear model g3. Specifically, phrases such as "tasks including" and "new state" were estimated to have positive influences on the predicted score. These phrases often appear in contexts such as "demonstrated on a wide range of tasks including" or "establishing a new state-of-the-art result,", suggesting that superior experimental results tend to yield higher peer review scores. In addition, the phrase "theoretical analysis" was also identified to have a positive effect on the review score, reflecting the significance of theoretical validation in machine learning research. On the contrary, general phrases with broader meanings such as "recent advances" and "machine learning," contributed to lower scores. This observation suggests the importance of explicitly stating the novelty and uniqueness of research findings and refraining from using generic terminologies. Table 2: Prediction performance of peer review scores and acceptance/rejection for submitted papers. The mean and standard deviation of the RMSE and accuracy are reported for the affine model transfer (Affine TL) and feature extraction (FE). Definitions of asterisks and boldface letters are the same as in Table 1. Regression Classification BERT FE 1.3086 0.0035 0.6250 0.0217 Affine TL *1.3069 0.0042 0.6252 0.0163 Sci BERT FE 1.2856 0.0144 0.6520 0.0106 Affine TL *1.2797 0.0122 0.6507 0.0124 T5 FE 1.3486 0.0175 0.6344 0.0079 Affine TL *1.3442 0.0030 0.6366 0.0065 GPT-3 FE 1.3284 0.0138 0.6279 0.0181 Affine TL *1.3234 0.0140 *0.6386 0.0095 Table 3: Phrases with the top and bottom ten regression coefficients for g3 in the affine transfer model for the regression task with Sci BERT. Positive Negative 1 tasks including recent advances 2 new state novel approach 3 high quality latent space 4 recently proposed learning approach 5 latent variable neural architecture 6 number parameters machine learning 7 theoretical analysis attention mechanism 8 policy gradient reinforcement learning 9 inductive bias proposed framework 10 image generation descent sgd As illustrated in this example, integrating modern deep learning techniques and highly interpretable transfer models through the mechanism of the affine model transfer not only enhances prediction performance, but also provides valuable insights into domain-specific factors. 5.3 Case Studies in Materials Science We conducted two additional case studies, both of which pertain to scientific tasks in the field of materials science. One experiment aims to examine the relationship between qualitative differences in source features and learning behavior of the affine model transfer. In the other experiment, we demonstrate the potential utility of the affine model transfer as a calibration tool bridging computational models and real-world systems. In particular, we highlight the benefits of separately modeling and estimating domain-specific factors through a case study in polymer chemistry. The objective is to predict the specific heat capacity at constant pressure of any given organic polymer with its chemical structure in the polymer s repeating unit. Specifically, we conduct TL to bridge the gap between experimental values and physical properties calculated from molecular dynamics simulations. The details are shown in Section B in Supplementary Material, 6 Conclusions In this study, we introduced a general class of TL based on affine model transformations, and clarified their learning capability and applicability. The proposed affine model transformation was shown to be an optimal class that minimizes the expected squared loss in the HTL procedure. The model is contrasted with widely applied TL methods, such as re-using features from pre-trained models, which lack theoretical foundation. The affine model transfer is model-agnostic; it is easily combined with any machine learning models, features, and physical models. Furthermore, in the model, domainspecific factors are involved in incorporating the source features. From this property, the affine transfer has the ability to handle domain common and unique factors simultaneously and separately. The advantages of the model were verified theoretically and experimentally in this study. We showed theoretical results on the generalization bound and excess risk bound when the regression tasks are solved by kernel methods. It is shown that if the source feature is strongly related to the target domain, the convergence rate of the generalization bound is improved from naive learning. The excess risk of the proposed TL is evaluated using the eigen-decay of the product kernel, which also illustrates the effect of the overlap between the source and target tasks. In our numerical studies, the affine model transfer generally outperforms in test errors when the target and source tasks have a similarity. We have also seen in the example of NLP that the proposed affine model transfer can identify the (non-)valuable phrases for high-quality papers. This can be done by the affine representation of cross-domain shift and domain-specific factors in our model. Acknowledgments and Disclosure of Funding This work was supported by JST SPRING Grant No. JPMJSP2104, JST CREST Grants No. JPMJCR22O3 and No. JPMJCR19I3, MEXT KAKENHI Grant-in-Aid for Scientific Research on Innovative Areas (Grant No. 19H50820), the Grant-in-Aid for Scientific Research (A) (Grant No. 19H01132) and Grant-in-Aid for Research Activity Start-up (Grant No. 23K19980) from the Japan Society for the Promotion of Science (JSPS), and the MEXT Program for Promoting Researches on the Supercomputer Fugaku (No. hp210264). [1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, Image Net classification with deep convolutional neural networks, Communications of the ACM, vol. 60, pp. 84 90, 2012. [2] G. Csurka, Domain adaptation for visual applications: A comprehensive survey, ar Xiv, vol. abs/1702.05374, 2017. [3] S. Ruder, M. E. Peters, S. Swayamdipta, and T. Wolf, Transfer learning in natural language processing, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Tutorials, pp. 15 18, 2019. [4] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, ar Xiv, vol. abs/1810.04805, 2019. [5] R. K. Sevakula, V. Singh, N. K. Verma, C. Kumar, and Y. Cui, Transfer learning for molecular cancer classification using deep neural networks, IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 16, pp. 2089 2100, 2019. [6] H. Yamada, C. Liu, S. Wu, Y. Koyama, S. Ju, J. Shiomi, J. Morikawa, and R. Yoshida, Predicting materials properties with little data using shotgun transfer learning, ACS Central Science, vol. 5, pp. 1717 1730, 2019. [7] S. Wu, Y. Kondo, M. Kakimoto, B. Yang, H. Yamada, I. Kuwajima, G. Lambard, K. Hongo, Y. Xu, J. Shiomi, et al., Machine-learning-assisted discovery of polymers with high thermal conductivity using a molecular design algorithm, npj Computational Materials, vol. 5, no. 1, pp. 1 11, 2019. [8] S. Ju, R. Yoshida, C. Liu, K. Hongo, T. Tadano, and J. Shiomi, Exploring diamond-like lattice thermal conductivity crystals via feature-based transfer learning, Physical Review Materials, vol. 5, no. 5, p. 053801, 2021. [9] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, How transferable are features in deep neural networks?, Advances in Neural Information Processing Systems, vol. 27, 2014. [10] I. Kuzborskij and F. Orabona, Stability and hypothesis transfer learning, International Conference on Machine Learning, pp. 942 950, 2013. [11] I. Kuzborskij and F. Orabona, Fast rates by transferring from auxiliary hypotheses, Machine Learning, vol. 106, no. 2, pp. 171 195, 2017. [12] S. S. Du, J. Koushik, A. Singh, and B. Póczos, Hypothesis transfer learning via transformation functions, Advances in Neural Information Processing Systems, vol. 30, 2017. [13] S. Minami, S. Liu, S. Wu, K. Fukumizu, and R. Yoshida, A general class of transfer learning regression without implementation cost, Proceedings of AAAI Conference on Artificial Intelligence, vol. 35, pp. 8992 8999, 2021. [14] S. Liu and K. Fukumizu, Estimating posterior ratio for classification: Transfer learning from probabilistic perspective, Proceedings of the 2016 SIAM International Conference on Data Mining, pp. 747 755, 2016. [15] N. Tripuraneni, M. Jordan, and C. Jin, On the theory of transfer learning: The importance of task diversity, Advances in Neural Information Processing Systems, vol. 33, pp. 7852 7862, 2020. [16] L. Dinh, D. Krueger, and Y. Bengio, Nice: Non-linear independent components estimation, ar Xiv, vol. abs/1410.8516, 2014. [17] L. Dinh, J. N. Sohl-Dickstein, and S. Bengio, Density estimation using real nvp, International Conference on Learning Representations, 2017. [18] T. Teshima, I. Ishikawa, K. Tojo, K. Oono, M. Ikeda, and M. Sugiyama, Coupling-based invertible neural networks are universal diffeomorphism approximators, Advances in Neural Information Processing Systems, vol. 33, pp. 3362 3373, 2020. [19] H. Zhou, L. Li, and H. Zhu, Tensor regression with applications in neuroimaging data analysis, Journal of the American Statistical Association, vol. 108, no. 502, pp. 540 552, 2013. [20] P. L. Bartlett, O. Bousquet, and S. Mendelson, Local Rademacher complexities, Annals of Statistics, vol. 33, pp. 1497 1537, 2005. [21] I. Steinwart and A. Christmann, Support vector machines. Springer science & business media, 2008. [22] I. Steinwart, D. R. Hush, and C. Scovel, Optimal rates for regularized least squares regression, Proceedings of the 22nd Annual Conference on Learning Theory, pp. 79 93, 2009. [23] C. K. Williams and C. E. Rasmussen, Gaussian processes for machine learning, vol. 2. MIT press Cambridge, MA, 2006. [24] A. Singh, M. D Arcy, A. Cohan, D. Downey, and S. Feldman, Sci Rep Eval: A multi-format benchmark for scientific document representations, Ar Xiv, vol. abs/2211.13308, 2022. [25] C. Finn, P. Abbeel, and S. Levine, Model-agnostic meta-learning for fast adaptation of deep networks, International Conference on Machine Learning, 2017. [26] L. Xuhong, Y. Grandvalet, and F. Davoine, Explicit inductive bias for transfer learning with convolutional networks, International Conference on Machine Learning, pp. 2825 2834, 2018. [27] S. Myung, I. Huh, W. Jang, J. M. Choe, J. Ryu, D. Kim, K.-E. Kim, and C. Jeong, PAC-Net: A model pruning approach to inductive transfer learning, International Conference on Machine Learning, pp. 16240 16252, 2022. [28] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, BERT: Pre-training of deep bidirectional transformers for language understanding, ar Xiv preprint ar Xiv:1810.04805, 2018. [29] I. Beltagy, K. Lo, and A. Cohan, Sci BERT: A pretrained language model for scientific text, in Conference on Empirical Methods in Natural Language Processing, 2019. [30] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, Exploring the limits of transfer learning with a unified text-to-text transformer, Journal of Machine Learning Research, vol. 21, no. 140, pp. 1 67, 2020. [31] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., Language models are few-shot learners, Advances in neural information processing systems, vol. 33, pp. 1877 1901, 2020. [32] R. Vershynin, High-dimensional probability: An introduction with applications in data science, vol. 47. Cambridge university press, 2018. [33] S. Wu, G. Lambard, C. Liu, H. Yamada, and R. Yoshida, i QSPR in Xenon Py: A Bayesian molecular design algorithm, Molecular Informatics, vol. 39, no. 1-2, p. 1900107, 2020. [34] C. Liu, K. Kitahara, A. Ishikawa, T. Hiroto, A. Singh, E. Fujita, Y. Katsura, Y. Inada, R. Tamura, K. Kimura, and R. Yoshida, Quasicrystals predicted and discovered by machine learning, Phys. Rev. Mater., vol. 7, p. 093805, 2023. [35] C. Liu, E. Fujita, Y. Katsura, Y. Inada, A. Ishikawa, R. Tamura, K. Kimura, and R. Yoshida, Machine learning to predict quasicrystals from chemical compositions, Advanced Materials, vol. 33, no. 36, p. 2102507, 2021. [36] D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, International Conference for Learning Representations, 2015. [37] J. Wang, R. M. Wolf, J. W. Caldwell, P. A. Kollman, and D. A. Case, Development and testing of a general amber force field, Journal of Computational Chemistry, vol. 25, no. 9, pp. 1157 1174, 2004. [38] Y. Hayashi, J. Shiomi, J. Morikawa, and R. Yoshida, Radonpy: automated physical property calculation using all-atom classical molecular dynamics simulations for polymer informatics, npj Computational Materials, vol. 8, no. 222, 2022. [39] S. Otsuka, I. Kuwajima, J. Hosoya, Y. Xu, and M. Yamazaki, Po Ly Info: Polymer database for polymeric materials design, 2011 International Conference on Emerging Intelligent Data and Web Technologies, pp. 22 29, 2011. [40] M. Kusaba, Y. Hayashi, C. Liu, A. Wakiuchi, and R. Yoshida, Representation of materials by kernel mean embedding, Physical Review B, vol. 108, p. 134107, 2023. [41] J. Duchi, E. Hazan, and Y. Singer, Adaptive subgradient methods for online learning and stochastic optimization., Journal of machine learning research, vol. 12, no. 7, 2011. [42] D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, International Conference for Learning Representations, 2015. [43] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, Scikit-learn: Machine learning in Python, Journal of Machine Learning Research, vol. 12, pp. 2825 2830, 2011. [44] M. Mohri, A. Rostamizadeh, and A. S. Talwalkar, Foundations of Machine Learning. MIT press, 2018. [45] N. Srebro, K. Sridharan, and A. Tewari, Smoothness, low noise and fast rates, Advances in Neural Information Processing Systems, vol. 23, 2010. [46] M. Kloft and G. Blanchard, The local Rademacher complexity of ℓp-norm multiple kernel learning, Advances in Neural Information Processing Systems, vol. 24, 2011. Supplementary Material Transfer Learning with Affine Model Transformation A Other Perspectives on Affine Model Transfer A.1 Transformation Functions for General Loss Functions Here we discuss the optimal transformation function for general loss functions. Let ℓ(y, y ) 0 be a convex loss function that returns zero if and only if y = y , and let g (x) be the optimal predictor that minimizes the expectation of ℓwith respect to the distribution pt followed by x and y transformed by ϕ: g (x) = arg min g Ept ℓ(g(x), ϕfs(y)) . The function g that minimizes the expected loss Ept ℓ(g(x), ϕfs(y)) = ZZ ℓ(g(x), ϕfs(y))pt(x, y)dxdy should be a solution to the Euler-Lagrange equation Z ℓ(g(x), ϕfs(y))pt(x, y)dy = Z g(x)ℓ(g(x), ϕfs(y))pt(y|x)dy pt(x) = 0. (S.1) Denote the solution of Eq. (S.1) by G(x; ϕfs). While G depends on the loss ℓand distribution pt, we omit those from the argument for notational simplicity. Using this function, the minimizer of the expected loss Ex,y[ℓ(g(x), y)] can be expressed as G(x; id), where id represents the identity function. Here, we consider the following assumption to hold, which generalizes Assumption 2.3 in the main text: Assumption 2.3(b). For any distribution on the target domain pt(x, y) and all x X, the following relationship holds: ψfs(g (x)) = arg min g Ex,y[ℓ(g(x), y)]. Equivalently, the transformation functions ϕfs and ψfs satisfy ψfs G(x; ϕfs) = G(x; id). (S.2) Assumption 2.3(b) states that if the optimal predictor G(x; ϕfs) for the data transformed by ϕ is given to the model transformation function ψ, it is consistent with the overall optimal predictor G(x; id) in the target region in terms of the loss function ℓ. We consider all pairs of ψ and ϕ that satisfy this consistency condition. Here, let us consider the following proposition: Proposition A.1. Under Assumption 2.1, 2.2 and 2.3(b), ψ 1 fs = ϕfs. Proof. The proof is analogous to that of Theorem 2.4 in Section D.1. For any y0 Y, let pt(y|x) = δy0. Combining this with Eq. (S.1) leads to g(x)ℓ(g(x), ϕfs(y0)) = 0 ( y0 Y). Because ℓ(y, y ) returns the minimum value zero if and only if y = y , we obtain G(x; ϕfs) = ϕfs(y0). Similarly, we have G(x; id) = y0. From these two facts and Assumption 2.3(b), we have ψfs(ϕfs(y0)) = y0, proving that the proposition is true. Proposition A.1 indicates that the first statement of Theorem 2.4 holds for general loss functions. However, the second claim of Theorem 2.4 generally depends on the type of loss function. Through the following examples, we describe the optimal class of transformation functions for several loss functions. Example 1 (Squared loss). Let ℓ(y, y ) = |y y |2. As a solution of Eq. (S.1), we can see that the optimal predictor is the conditional expectation Ept[ϕfs(Y )|X = x]. As discussed in Section 2.1, the transformation functions ϕfs and ψfs should be affine transformations. Example 2 (Absolute loss). Let ℓ(y, y ) = |y y |. Substituting this into Eq. (S.1), we have g(x) ϕfs(y) pt(y|x)dy = Z sign g(x) ϕfs(y) pt(y|x)dy ϕfs(y) g(x) pt(y|x)dy Z ϕfs(y)