# pcabased_multitask_learning_a_random_matrix_approach__659b3a9d.pdf

PCA-based Multi-Task Learning: a Random Matrix Approach

Malik Tiomoko 1 Romain Couillet 2 Frédéric Pascal 3

The article proposes and theoretically analyses a computationally efficient multi-task learning (MTL) extension of popular principal component analysis (PCA)-based supervised learning schemes (Barshan et al., 2011; Bair et al., 2006). The analysis reveals that (i) by default, learning may dramatically fail by suffering from negative transfer, but that (ii) simple counter-measures on data labels avert negative transfer and necessarily result in improved performances. Supporting experiments on synthetic and real data benchmarks show that the proposed method achieves comparable performance with state-of-the-art MTL methods but at a significantly reduced computational cost.

1. Introduction

From single to multiple task learning. Advanced supervised machine learning algorithms require large amounts of labelled samples to achieve high accuracy, which is often too demanding in practice. Multi-task learning (MTL) (Caruana, 1997; Zhang & Yang, 2018; 2021) and transfer learning provide a potent workaround by appending extra somewhat similar datasets to the scarce available dataset of interest. The additional data possibly being of a different nature, MTL effectively solves multiple tasks in parallel while exploiting task relatedness to enforce collaborative learning.

State-of-the-art of MTL. To proceed, MTL solves multiple related tasks and introduces shared hyperparameters or feature spaces optimized to improve the performance of the individual tasks. The crux of efficient MTL lies in both enforcing and, most importantly, evaluating task relatedness: this, in general, is highly non-trivial as this implies theoretically identifying the common features of the

1Huawei Noah s Ark Lab, Paris, France 2LIG-Lab, Université de Grenoble Alpes, France 3L2S Centrale-Supélec, France. Correspondence to: Malik Tiomoko <malik.tiomoko@huawei.com>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

datasets. Several heuristics have been proposed, which may be split into two groups: parameterversus feature-based MTL. In parameter-based MTL, the tasks are assumed to share common hyperparameters (Evgeniou & Pontil, 2004; Xu et al., 2013) (e.g., separating hyperplanes in a support vector machine (SVM) flavor) or hyperparameters derived from a common prior distribution (Zhang & Yeung, 2012; 2014). Classical learning mechanisms (SVM, logistic regression, etc.) can be appropriately turned into an MTL version by enforcing parameter relatedness: (Evgeniou & Pontil, 2004; Xu et al., 2013; Parameswaran & Weinberger, 2010) respectively adapt the SVM, least square-SVM (LSSVM), and large margin nearest neighbor (LMNN) methods into an MTL paradigm. In feature-based MTL, the data are instead assumed to share a common low-dimensional representation, which needs to be identified: through sparse coding, deep neural network embeddings, principal component analysis (PCA) (Argyriou et al., 2008; Maurer et al., 2013; Zhang et al., 2016; Pan et al., 2010) or simply by feature selection (Obozinski et al., 2006; Wang & Ye, 2015; Gong et al., 2012).

The negative transfer plague. A strong limitation of MTL methods is their lack of theoretical tractability: as a result, the biases inherent to the base methods (SVM, LS-SVM, deep nets) are exacerbated in MTL. A major consequence is that many of these heuristic MTL schemes suffer from negative transfer, i.e., cases where MTL performs worse than a single-task approach (Rosenstein et al., 2005; Long et al., 2013); this often occurs when task relatedness is weaker than assumed, and MTL enforces fictitious similarities.

A large dimensional analysis to improve MTL. To fill these gaps, this work focuses on a large dimensional random matrix setting (El Karoui, 2018) to provide an exact (asymptotic) performance evaluation of an elementary (yet powerful) PCA-based MTL approach. It is worth noticing that although the proposed framework is asymptotic, it has shown to be very efficient for small dimensions/small numbers of data in practice (see e.g. (Couillet & Liao, 2022)). This analysis conveys insights into the MTL inner workings, providing an optimal data labelling scheme to avert negative transfer fully.

More fundamentally, the choice of investigating PCA-based

PCA-based Multi-Task Learning

MTL results from realizing that the potential gains incurred by a proper theoretical adaptation of simple algorithms vastly outweigh the losses incurred by biases and negative transfer in more complex and elaborate methods (see performance tables in the article). As a result, the article s main contribution lies in achieving high-performance MTL at low computational cost when compared to competitive methods.

This finding goes in the direction of the compellingly needed development of cost-efficient and environment-friendly AI solutions (Lacoste et al., 2019; Strubell et al., 2019; Henderson et al., 2020).

Article contributions. In detail, our main contributions may be listed as follows:

We theoretically compare the performance of two natural PCA-based single-task supervised learning schemes (PCA and SPCA) and justify the uniform superiority of SPCA. As a by-product result, this work formally provides the optimal choice of dimension for PCA and SPCA;

As a consequence, we propose a natural extension of SPCA to multi-task learning, for which we also provide an asymptotic performance analysis;

The latter analysis (i) theoretically grasps the transfer learning mechanism at play, (ii) exhibits the relevant information being transferred, and (iii) harnesses the sources of negative transfer;

This threefold analysis unfolds in a counter-intuitive improvement of SPCA-MTL based on an optimal data label adaptation (not set to 1, which is the very source of negative transfer); the label adaptation depends on the optimization target, changes from task to task, and can be efficiently computed before running the SPCAMTL algorithm;

Synthetic and real data experiments support the competitive SPCA-MTL results compared to state-of-theart MTL methods; these experiments most crucially show that high-performance levels can be achieved at significantly lower computational costs.

Supplementary material. The proofs and Matlab codes to reproduce our main results and simulations, along with theoretical extensions and additional supporting results, are provided in the supplementary material.

Notation. p stands for the data dimension while n corresponds to the data number, m is the number of classes and k stands for the tasks. Vectors (resp. matrices) are denoted by bold-faced lowercase letters (resp. uppercase letters).

e[n] m Rn is the canonical vector of Rn with [e[n] m ]i = δmi. Moreover, e[mk] ij = e[mk] m(i 1)+j.

2. Related works

A series of supervised (single-task) learning methods were proposed which rely on PCA (Barshan et al., 2011; Ritchie et al., 2019; Zhang et al., 2021; Ghojogh & Crowley, 2019): the central idea is to project the available data onto a shared low-dimensional space, thus ignoring individual data variations. These algorithms are generically coined supervised principal component analysis (SPCA). Their performances are, however, difficult to grasp as they require understanding the statistics of the PCA eigenvectors: only recently have large dimensional statistics, and specifically random matrix theory, provided the first insights into the behavior of eigenvalues and eigenvectors of sample covariance and kernel matrices (Benaych-Georges & Nadakuditi, 2012; Johnstone, 2001; Baik & Silverstein, 2006; Lee et al., 2010; Paul, 2007). To the best of our knowledge, none of these works have drawn an analysis of SPCA: the closest work is likely (Ashtiani & Ghodsi, 2015) which however only provides statistical bounds on performance rather than exact results.

On the MTL side, several methods were proposed under unsupervised (Long et al., 2016; Saito et al., 2018; Baktashmotlagh et al., 2013), semi-supervised (Rei, 2017; Liu et al., 2007) and supervised (parameter-based (Tiomoko et al., 2020; Evgeniou & Pontil, 2004; Xu et al., 2013; Ando & Zhang, 2005) or feature-based (Argyriou et al., 2008; Liu et al., 2012)) flavors. Although most of these works generally achieve satisfying performances on both synthetic and real data, few theoretical analyses and guarantees exist so instances of negative transfer are likely to occur.

To be exhaustive, we must mention that, for specific types of data (images, text, time-series) and under the availability of numerous labelled samples, deep learning MTL methods have recently been devised (Ruder, 2017). These are, however, at odds with the article s requirement to leverage scarce labelled samples and to be valid for generic inputs (beyond images or texts). These methods cannot be compared on even grounds with those discussed in the present study.1

3. Supervised principal component analysis: single task preliminaries

Before delving into PCA-based MTL, first results on large dimensional PCA-based single-task learning for a training set X = [x1, . . . , xn] Rp n of n samples of dimension

1But nothing prevents us from exploiting data features extracted from pre-trained deep nets.

PCA-based Multi-Task Learning

p are needed. To each xi Rp is attached a (possibly multivariate) label yi: in a binary class setting, yi { 1, 1}, while for m 3 classes, yi = e[m] j Rm, the canonical vector of the corresponding class j.

PCA in supervised learning. Let us first recall that, applied to X, PCA identifies a subspace of Rp, say the span of the columns of U = [u1, . . . , uτ] Rp τ (τ p), which maximizes the variance of the data when projected on the subspace, i.e., U solves:

max U Rp τ tr UT XXT

p U subject to UTU = Iτ.

The solution is the collection of the eigenvectors associated with the τ largest eigenvalues of XXT

To predict the label y of a test data vector x, a simple method to exploit PCA consists in projecting x onto the PCA subspace U and in performing classification in the projected space. This has the strong advantage of providing a (possibly dramatic) dimensionality reduction (from p to τ) to supervised learning mechanisms, thus improving cost efficiency while mitigating the loss incurred by the dimension reduction. Yet, the PCA step is fully unsupervised and does not exploit the available class information. It is instead proposed in (Barshan et al., 2011; Chao et al., 2019) to trade U for a more representative projector V which maximizes the dependence between the projected data VTX and the output labels Y = [y1, . . . , yn]T Rn m. To this end, (Barshan et al., 2011) exploits the Hilbert-Schmidt independence criterion (Gretton et al., 2005), with corresponding optimization

max V Rp τ tr VT XYYTXT

np V subject to VTV = Iτ.

This results in the Supervised PCA (SPCA) projector, the solution V = V(Y) of which being the concatenation of the τ dominant eigenvectors of XYYTXT

np . Subsequent learning (by SVMs, empirical risk minimizers, discriminant analysis, etc.) is then applied to the projected training VTxi and test VTx data.

Large dimensional analysis of SPCA. To best grasp the performance of PCA-or SPCA-based learning, assume the data arise from a large dimensional m-class Gaussian mixture.2

Assumption 3.1 (Distribution of X). The columns of X are independent random vectors with X = [X1, . . . , Xm],

2To obtain simpler intuitions, we consider here an isotropic Gaussian mixture model (i.e., with identity covariance). This strong constraint is relaxed in the supplementary material, where arbitrary covariances are considered; the results only marginally alter the main conclusions.

Xj = [x(j) 1 , . . . , x(j) nj ] Rp nj for x(j) i N(µj, Ip), also denoted x(j) i Cj and M [µ1, . . . , µm] Rp m. Note that Pm j=1 nj = n.

Recent works in random matrix (Seddik et al., 2020; Louart & Couillet, 2018) suggests that the technical arguments used in this paper are extendable to the broader family of random vectors known as concentrated random vectors in which a wide range of realistic random vectors can be found, including Generative Adversarial Network images. Moreover, the experiments on image and text classification suggest the robustness of the intuitions drawn on real data.

Assumption 3.2 (Growth Rate). As n , p/n c0 > 0, the feature dimension τ is constant and, for 1 j m, nj/n cj > 0; we denote c = [c1, . . . , cm]T and Dc = diag(c). Besides,

1 2c M Rm m.

The growth rate assumption assumes that p and n are of the same order of magnitude. This is different and more realistic than assumptions usually considered in classical statistics, where n is very large compared to p. More importantly, the technical results are obtained at a convergence rate of order of 1/ p , which allows a smooth application to finite p, n. The assumption on the existence of the matrix M states the difficulty of the problem when p evolves. Any other rate for the order of M (scaling with p for example) will lead to a trivial problem (tasks too easy or too difficult to solve).

Let us now show that, under this setting, SPCA is uniformly more discriminative on new data than PCA. As n, p , the spectrum of 1

p XXT is subject to a phase transition phenomenon now well established in random matrix theory (Baik & Silverstein, 2006; Benaych-Georges & Nadakuditi, 2012). This result is crucial as the PCA vectors of 1

p XXT are only informative beyond the phase transition and otherwise can be considered pure noise.

Proposition 3.3 (Eigenvalue Phase transition). Under Assumptions 3.1-3.2, as n, p , the empirical spectral measure 1

p Pp i=1 δλi of the eigenvalues λ1 . . . λp of XXT

p converges weakly, with probability one, to the Mar cenko-Pastur law (Marchenko & Pastur, 1967) supported on [(1 p

1/c0)2, (1 + p

1/c0)2]. Besides, for 1 i m, and for ℓ1 > . . . > ℓm the eigenvalues of M,3

c0 + ℓi + 1 c0ℓi (1 + 1 c0 )2, ℓi 1 c0 (1 + p

1/c0)2, otherwise

λm+1 a.s. (1 + p

3We implicitly assume the ℓi s distinct for simplicity of exposition.

PCA-based Multi-Task Learning

Proposition 3.3 states that, if ℓi 1/ c0, the i-th largest eigenvalue of 1

p XXT separates from the main bulk of eigenvalues. These isolated eigenvalues are key to the proper functioning of PCA-based classification as their corresponding eigenvectors are non-trivially related to the class discriminating statistics (here, the µj s). Consequently, UTx Rτ

also exhibits a phase transition phenomenon.

Theorem 3.4 (Asymptotic behavior of PCA projectors). Let x N(µj, Ip) independent of X. Then, under Assumptions 3.1-3.2, with (ℓi, ui) the decreasing (distinct) eigenpairs of M, as p, n ,

UTx Gj 0, Gj N(m(pca) j , Iτ), in probability,

where [m(pca) j ]i =

c0ℓi 1 ℓ2 i (ℓi+1) u T i MD 1

2 c e[m] j , i min(m, τ) and ℓi 1 c0 0, otherwise.

As such, only the projections on the eigenvectors of 1

attached to isolated eigenvalues carry informative discriminating features. Practically, for all n, p large, it is thus useless to perform PCA on a larger dimension than the number of isolated eigenvalues, i.e., τ arg max1 i m{ℓi 1/ c0}.

Consider now SPCA. Since XYYTXT

np only has m non-zero eigenvalues, no phase transition occurs: all eigenvalues are isolated . Thus, one may take τ = m principal eigenvectors for the SPCA projection matrix V, which is quite likely informative.

Theorem 3.5 (Asymptotic behavior of SPCA projectors). Let x N(µj, Ip) independent of X. Then, under Assumptions 3.1-3.2, as p, n , in probability,

VTx gj 0, gj N(m(spca) j , Iτ),

[m(spca) j ]i = q

1/( ℓi) v T i D

for ℓ1 . . . ℓm the eigenvalues of Dc + D

1 2c and v1, . . . , vm their associated eigenvectors.

Since both PCA and SPCA data projections UTx and VTx are asymptotically Gaussian and isotropic (i.e., with identity covariance), the oracle-best supervised learning performance only depends on the differences m( ) j m( ) j ( being pca or spca). Being small dimensional (of dimension τ), the vectors m( ) j can be consistently estimated from their associated empirical means, and are known in the large n, p limit (with probability one). Remark 3.6 (Consistent estimate of sufficient statistics). From Assumption 3.2, cj can be empirically estimated by

nj/n. This, in turn, provides a consistent estimate for Dc. Besides, as n, p ,

1 njnj 1T nj XT j Xj 1nj a.s. [MTM]jj , j = j and

4 n2 j 1Tnj

2 XT j,1Xj,21 nj

a.s. [MTM]jj, j

where Xj = [Xj,1, Xj,2] Rp nj, with Xj,1, Xj,2 Rp (nj/2). Combining the results provides a consistent estimate for M as well as an estimate ˆm( ) j for the quantities

m( ) j , by replacing c and M by their respective estimates

in the definition of m( ) j .

These results ensure the (large n, p) optimality of the classification decision rule for a test data x:

arg max j {1,...,m} UTx ˆm(pca) j 2, (1)

arg max j {1,...,m} VTx ˆm(spca) j 2. (2)

As a consequence, the discriminating power of both PCA and SPCA directly relates to the limiting (squared) distances m( ) (j,j ) m( ) j m( ) j 2, for all pairs of class indices 1 j = j m, and the classification error P(x Cj |x Cj) satisfies

P(x Cj |x Cj) = Q 1

m( ) (j,j )

for Q(t) = 1

In particular, and as confirmed by Figure 1, when cj = cj , SPCA uniformly dominates PCA:

m(spca) (j,j ) m(pca) (j,j ) =

2 c (e[τ] j e[τ] j ) 2

ℓ2 i (ℓi + 1) 0.

For m = 2 classes, irrespective of c1, c2, one even finds in explicit form

m(spca) (1,2) m(pca) (1,2) = 16 n p µ 2 + 4,

m(spca) (1,2) m(pca) (1,2) m(spca) (1,2) = 16 n p µ 4

where µ µ1 µ2, conveniently showing the influence of n/p and of µ 2 in the relative performance gap, which vanishes as the task gets easier or as n/p increases (so with more data).

In summarizing, under a large dimensional setting, we showed that SPCA-based classification uniformly outperforms the PCA alternative, thus motivating the design of

PCA-based Multi-Task Learning

200 400 600 800 1,000

PCA (Emp) PCA (Th)

SPCA(Emp) SPCA (Th)

Figure 1. Theoretical (Th) vs. empirical (Emp) error for PCAand SPCA-based binary classification: x(ℓ) i N(( 1)ℓµ, Ip) (ℓ {1, 2}), µ = e[p] 1 , n1 = n2 = 500. Averaged over 1 000 test samples.

an SPCA-based MTL approach. Furthermore, the section not only justifies the superiority of SPCA over PCA qualitatively but, more importantly quantitatively by quantifying the gap and highlighting the elements that influence it (the norm of the means of the data µ 2 and the ratio p

n notably).

4. From singleto multi-task SPCA-based learning

4.1. Multi-task setting

Let now X = [X[1], . . . , X[k]] Rp n be a collection of n independent p-dimensional data vectors, divided into k subsets attached to individual tasks . Task t is an m-class classification problem with training samples X[t] = [X[t]1, . . . , X[t]m] Rp nt with X[t]j =

[x(j) t1 , . . . , x(j) tntj] Rp ntj the ntj vectors of class j

{1, . . . , m}. In particular, n = Pk t=1 nt for nt Pm j=1 ntj.

To each x(j) tℓ Rp is attached a corresponding label (or score) y(j) tℓ Rm. We denote in short Yt = [y(1) t1 , . . . , y(m) tnt ]T Rnt m and Y = [YT 1 , . . . , YT k ]T Rn m the matrix of all labels. The natural MTL extension of SPCA would default y(j) tℓ Rm to the canonical vectors e[m] j (or to 1 in the binary case). We disrupt here from

this approach by explicitly not imposing a value for y(j) tℓ: this will be seen to be key to avert the problem of negative transfer. We only let y(j) tℓ= ytj, for all 1 ℓ ntj and for some generic matrix Y = [ y11, . . . , ykm]T Rmk m, i.e., we impose that

Y = J Y, for J = [j11, . . . , jmk],

where jtj = (0, . . . , 0, 1ntj, 0, . . . , 0)T.

As with the single-task case, we work under a Gaussian mixture model for each class Ctj.

Assumption 4.1 (Distribution of X). For class j of task t, denoted Ctj, x(j) tℓ N(µtj, Ip), for some µtj Rp. We further denote M [µ11, . . . , µkm] Rp mk.

Assumption 4.2 (Growth Rate). As n , p/n c0 > 0 and, for 1 j m, ntj/n ctj > 0. Denoting c = [c11, . . . , ckm]T Rkm and Dc = diag(c),

1 2c M Rmk mk.

We are now able to present the article s main technical result.

Theorem 4.3 (MTL Supervised Principal Component Analysis). Let x N(µtj, Ip) independent of X and V Rp τ be the collection of the τ mk dominant eigenvectors of XYYTXT

np Rp p. Then, under Assumptions 4.1-4.2, as p, n , in probability,

VTx gtj 0, gtj N(mtj, Iτ)

for [mtj]i = q

1/(c0 ℓi) v T i ( Y YT) 1 2 D

2 c e[mk] tj

with ℓ1 > . . . > ℓmk the eigenvalues of ( Y YT) 1 2 (D

1 2c + Dc)( Y YT) 1 2 and v1, . . . , vmk their eigenvectors.4

As in the single task case, despite the high dimension of the data statistics appearing in V, the asymptotic performance only depends on the (small) mk mk matrices M and Dc, which here leverages the inter-task inter-class products µT tjµt j . This correlation between tasks together with the labelling choice Y (importantly recall that here V = V(Y)) influences the MTL performance. The next section discusses how to optimally align Y and M so to maximize this performance. This, in addition to Remark 3.6 being still valid here (i.e., c and M can be a priori consistently estimated), will unfold into our proposed asymptotically optimal MTL SPCA algorithm.

4.2. Binary classification and optimal labels

Let us focus on a binary classification to obtain more convincing conclusions (m = 2). In this case, y = J y, with y R2k (rather than in R2k 2) unidimensional. Here Xyy TXT

np has for unique non-trivial eigenvector v = Xy/ Xy and v Tx is scalar.

Corollary 4.4 (Binary MTL Supervised Principal Component Analysis). Let x N(µtj, Ip) independent of X. Then, under Assumptions 4.1-4.2 and the above setting, as

4For simplicity, we avoid the scenario where the eigenvalues ℓj appear with multiplicity, which would require to gather the eigenvectors into eigenspaces. This would, in effect, only make the notations more cumbersome.

PCA-based Multi-Task Learning

v Tx gtj 0, gtj N(m(bin) tj , 1)

where m(bin) tj = y TD

1 2c + Dc) y .

From Corollary 4.4, denoting ˆm(bin) t1 the natural consistent estimate for m(bin) t1 (as per Remark 3.6), the optimal class allocation decision for x reduces to the averaged-mean test

v Tx = v(y)Tx Ct1 Ct2

ˆm(bin) t1 + ˆm(bin) t2 (3)

with corresponding classification error rate ϵt 1

2P(x Ct2|x Ct1) + 1

2P(x Ct1|x Ct2) (assuming equal prior class probability) given by

ϵt P v Tx Ct1 Ct2

1 2( ˆm(bin) t1 + ˆm(bin) t2 )

2(m(bin) t1 m(bin) t2 ) + o(1). (4)

From the expression of m(bin) tj , the asymptotic performance clearly depends on a proper choice of y. This expression being quadratic in y, the ϵt minimizer y = y [t] assumes a closed-form:

y [t] arg max y R2k (m(bin) t1 m(bin) t2 )2

2 c (M + I2k) 1 MD 1

2 c (et1 et2). (5)

Letting ˆ y [t] be the natural consistent estimator of y [t] (again from Remark 3.6), and updating v = v( y[t]) accordingly, the corresponding (asymptotically) optimal value ϵ t of the error rate ϵt is

(e[2k] t1 e[2k] t2 )TH(e[2k] t1 e[2k] t2 ) + o(1),

with H = D 1

2 c M (M + I2k) 1 MD 1

This formula is instructive to discuss: under strong or weak task correlation, y [t] implements differing strategies to avoid negative transfers. For instance, if µT tjµt j = 0 for all t = t and j, j {1, . . . , m}, then the two rows and columns of M associated to task t are all zero, but on the 2 2 diagonal block: y [t] is then all zeros but on its two task-t elements; any other value at these zero-entry locations (such as the usual 1) is suboptimal and possibly severely detrimental to classification. Letting y[t] = [1, 1, . . . , 1, 1]T is even more detrimental when µT tjµt j < 0 for some t = t : when the mapping of classes across tasks is reversed, these tasks work against the classification.

Remark 4.5 (On Bayes optimality). Under the present MTL setting of a mixture of two isotropic random Gaussian vectors, the authors recently established that the Bayes optimal error rate (associated to the decision rule infg P(g(x) > 0 | x Ct1)) precisely coincides with ε t1.5 This proves that, at least under the present data configuration, the proposed SPCA-MTL framework is optimal.

4.3. Binary-based multi-class classification

With an optimal binary classification framework for every task and every pair of classes, one may expect to reach high-performance levels in generic multi-class settings by resorting to a one-versus-all extension of the binary case. For every target task t, one-versus-all implements m binary classifiers: classifier ℓ {1, . . . , m} separates class Ctℓ locally renamed class C(ℓ) t1 from all other classes gathered as a unique class C(ℓ) t2 . Each binary classifier is then optimized using labels y (ℓ) [t] as per Equation (5);

however, the joint class C(ℓ) t2 is here composed of a Gaussian mixture: this disrupts with our optimal framework, thereby in general leading to suboptimal labels; in practice though, for sufficiently distinct classes, the (suboptimal) label y (ℓ) [t] manages to isolate the value m(bin) tℓ = m(bin,ℓ) t1 for class Ctℓ= C(ℓ) t1 from the values m(bin) tj of all other classes Ctj, j = ℓ, to such an extent that (relatively speaking) these m(bin) tj can be considered quite close, and so close to their

mean m(bin,ℓ) t2 , without much impact on the classifier performance. Finally, the class allocation for unknown data x is based on the largest classifier score. But, to avoid biases that naturally arise in the one-versus-all approach (Bishop, 2006, Section 7.1.3), this imposes that the m different classifiers be comparable and aligned . To this end, we exploit Corollary 4.4 and Remark 3.6, which give a consistent estimate of all classifier statistics: the test scores for each classifier can be centered so that the asymptotic distribution for class C(ℓ) t1 is a standard normal distribution for each 1 ℓ m, thereby automatically discarding biases. Thus, instead of selecting the class with the largest score arg maxℓv(y (ℓ) [t] )Tx (as conventionally performed (Bishop, 2006, Section 7.1.3)), the class allocation is based on the centered scores arg maxℓ{V (y (ℓ) [t] )Tx m(bin,ℓ) t1 }.6

These discussions result in Algorithm 1.

5The result builds on recent advances in physics-inspired (spin glass models) large dimensional statistics; see for instance (Lelarge & Miolane, 2019) for a similar result in a single task semisupervised learning setting. Being a parallel work of the same authors, the reference is concealed in the present version to maintain anonymity. 6More detail and illustrations are provided in the supplementary material.

PCA-based Multi-Task Learning

Algorithm 1. Proposed multi-class MTL SPCA algorithm. Input: Training X = [X[1], . . . , X[k]], X[t ] = [X[t ]1, . . . , X[t ]m], X[t ]ℓ Rp nt ℓand test x. Output: Estimated class ˆℓ {1, . . . , m} of x for target task t. for ℓ= 1 to m do

Estimate c and M (from Remark 3.6) using X[t ]ℓ as data of class C(ℓ) t 1 for each t {1, . . . , k} and {X[t ]1, . . . , X[t ]m} \ {X[t ]ℓ} as data of class C(ℓ) t 2. Evaluate labels

y (ℓ) [t] = D 1

2 c (M + I2k) 1 MD 1

2 c (e[2k] t1 e[2k] t2 ).

Compute the classification score

g(ℓ) x,t = y (ℓ)T [t] JTXTx/ y (ℓ)T [t] JTXT .

Estimate m(bin,ℓ) t1 as ˆm(bin,ℓ) t1 from Corollary 4.4. end for Output: ˆℓ= arg maxℓ {1,...,m}(g(ℓ) x,t ˆm(bin,ℓ) t1 ).

5. Supporting experiments

We here compare the performance of Algorithm 1 (MTL SPCA), on both synthetic and real data benchmarks, to competing state-of-the-art methods, such as MTL-LSSVM (Tiomoko et al., 2020) and CDLS (Hubert Tsai et al., 2016).7

Transfer learning for binary classification. First consider a two-task two-class (k, m = 2) scenario with x(j) tℓ N(( 1)jµt, Ip), µ2 = βµ1 + p

1 β2µ 1 for µ 1 any vector orthogonal to µ1 and β [0, 1] controlling intertask similarity. Figure 2 depicts the empirical and theoretical classification error ϵ2 for the above methods for p = 100 and n = 2 200; for completeness, the single-task SPCA (ST-SPCA) of Section 3 (which disregards data from other tasks), as well as its naive MTL extension with labels y[t] = [1, 1, . . . , 1, 1]T (N-SPCA), were added. MTL SPCA properly tracks task relatedness, while CDLS fails when both tasks are similar. MTL LSSVM shows identical performances but at the cost of setting optimal hyperparameters. Probably most importantly, when not optimizing the labels y, the performance (of N-SPCA) is strongly degraded by negative transfer, particularly when tasks are not related. Figure 2 also provides typical computational times for each

7We insist that MTL SPCA is intended to function under the constraint of scarce data and does not account for the very nature of these data: to avoid arbitrary conclusions, imageor languagededicated MTL and transfer learning methods (e.g., modern adaptions of deep nets for transfer learning (Tan et al., 2018)) are not used for comparison.

algorithm when run on a modern laptop and confirms that Algorithm 1 scales very favorably with the data dimension p. At the same time, MTL LSSVM and CDLS quickly become prohibitively expensive.

Transfer learning for multi-class classification. We next experiment on the Image Clef dataset (Ionescu et al., 2017) made of 12 common categories shared by 3 public data domains : Caltech-256 (C), Image Net ILSVRC 2012 (I), and Pascal VOC 2012 (P). Every pair of domains is successively selected as source and a target for binary (transfer) multi-task learning, resulting in 6 transfer tasks S T for S,T {I,C,P}. Table 1 supports the stable and competitive performance of MTL-SPCA, on par with MTL LSSVM (but much cheaper).

Increasing the number of tasks. We now investigate the comparative gains induced when increasing the number of tasks. To best observe the reaction of each algorithm to the additional tasks, we here consider both a tunable synthetic Gaussian mixture and (less tractable) real-world data. The synthetic data consist of two Gaussian classes with means µtj = ( 1)jµ[t] with µ[t] = β[t]µ + q

for β[t] drawn uniformly at random in [0, 1] and with

µ = e[p] 1 , µ = e[p] p . The real-world data are the Amazon review (textual) dataset8 (Blitzer et al., 2007) and the MNIST (image) dataset (Deng, 2012). For Amazon review, the positive vs. negative reviews of books , dvd and electronics products are added to help classify the positive vs. negative reviews of kitchen products. For MNIST, additional digit pairs are added progressively to help classify the target pair (1, 4). The results are shown in Figure 3 which confirms that (i) the naive extension of SPCA (N-SPCA) with labels 1 can fail to the point of being bested by (single task) ST-SPCA, (ii) MTL-SPCA never decays with more tasks.

Multi-class multi-task classification. We finally turn to the full multi-task multi-class setting of Algorithm 1. Figure 4 simultaneously compares running time and error rates of MTL-SPCA and MTL-LSSVM9 on a variety of multitask datasets, and again confirms the overall computational gains (by decades!) of MTL-SPCA for approximately the same performance levels.

6. Conclusion

Following recent works on large-dimensional statistics for the design of simple, cost-efficient, and tractable machine

8Encoded in p = 400-dimensional tf*idf feature vectors of bag-of-words unigrams and bigrams. 9CDLS only handles multi-task learning with k = 2 and cannot be used for comparison.

PCA-based Multi-Task Learning

0 0.2 0.4 0.6 0.8 1 0.1

Task relatedness β

Classification error

ST SPCA N-SPCA (Emp)

N-SPCA (Th) MTL SPCA(Emp)

MTL SPCA (Th) MTL LSSVM (Th)

p MTL SPCA MTL LSSVM CDLS 16 0.34 s 4.15 s 7.16 s 32 0.34 s 4.46 s 7.43 s 64 0.39 s 5.38 s 8.61 s 128 0.40 s 8.28 s 8.80 s 256 0.55 s 12.2 s 11.9 s 512 0.57 s 48.3 s 17.5 s 1024 0.88 s 315.6 s 27.1 s 2048 2.02 s 1591.8 s 73.5 s

Figure 2. (Left) Theoretical (Th)/empirical (Emp) error rate for 2-class Gaussian mixture transfer with means µ1 = e[p] 1 , µ 1 = e[p] p , µ2 = βµ1 + p

1 β2µ 1 , p = 100, n1j = 1 000, n2j = 50; (Right) running time comparison (in sec); n = 2p, ntj/n = 0.25. Averaged over 1 000 test samples.

S/T P I P C I P I C C P C I Average ST SPCA 91.84 96.24 82.26 96.24 82.26 91.84 90.11 N-SPCA 92.21 96.37 84.34 95.97 81.34 90.47 90.12 MTL LSSVM 93.03 97.24 84.79 97.74 83.74 94.92 91.91 CDLS 92.03 94.62 84.82 95.72 81.04 92.54 90.13 MTL SPCA 93.39 96.61 85.24 96.68 83.76 93.39 91.51

Table 1. Transfer learning accuracy for the Image Clef database: P(Pascal), I(Imagenet), C(Caltech); different Source to target task pairs (S T) based on Resnet-50 features.

Number of tasks

Books DVD Elec

[7-9][3-8][5-6][2-9][3-5]

Figure 3. Empirical classification error vs. number of tasks; (Left) Synthetic Gaussian with random task correlation: p = 200, n11 = n12 = 50, n21 = n22 = 5, 10 000 test samples; (Center) Amazon Review: n11 = n12 = 100, n21 = n22 = 50, 2 000 test samples; (Right) MNIST: initial p = 100-PCA preprocessing, n11 = n12 = 100, n21 = n22 = 50, 500 test samples.

0.1 0.2 0.3

Datasets (Features) Tasks Classes Mark Synthetic (Gaussian) 3 10 Office-Caltech (VGG) 4 10 Office 31 (Resnet-50) 4 31 Office-Home (Resnet-50) 3 65 Image-Clef (Resnet-50) 3 12

Figure 4. (Left) Runtime vs. classification error (ϵt) for multi-task multi-class MTL-LSSVM (filled marks) and MTL-SPCA (empty marks). (Right) Datasets. Synthetic: µj = 2e[p] j , µ j = 2e[p] p j, β1 = 0.2, β2 = 0.4, β3 = 0.6; p = 200, n1j = n2j = 100, n3j = 50; 1 000 test sample averaging.

learning algorithms (Couillet et al., 2021), the article confirms the possibility of achieving high-performance levels while theoretically averting the main sources of biases, here for the a priori difficult concept of multi-task learning. The article, we hope, will be followed by further investigations

of sustainable AI algorithms driven by modern mathematical tools. In the present multi-task learning framework, practical extensions to semi-supervised learning (when labelled data are scarce) with possibly missing, unbalanced, or incorrectly labelled data are being considered.

PCA-based Multi-Task Learning

Ando, R. K. and Zhang, T. A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research, 6(Nov):1817 1853, 2005.

Argyriou, A., Evgeniou, T., and Pontil, M. Convex multitask feature learning. Machine learning, 73(3):243 272, 2008.

Ashtiani, H. and Ghodsi, A. A dimension-independent generalization bound for kernel supervised principal component analysis. In Feature Extraction: Modern Questions and Challenges, pp. 19 29. PMLR, 2015.

Baik, J. and Silverstein, J. W. Eigenvalues of large sample covariance matrices of spiked population models. Journal of multivariate analysis, 97(6):1382 1408, 2006.

Bair, E., Hastie, T., Paul, D., and Tibshirani, R. Prediction by supervised principal components. Journal of the American Statistical Association, 101(473):119 137, 2006.

Baktashmotlagh, M., Harandi, M. T., Lovell, B. C., and Salzmann, M. Unsupervised domain adaptation by domain invariant projection. In Proceedings of the IEEE International Conference on Computer Vision, pp. 769 776, 2013.

Barshan, E., Ghodsi, A., Azimifar, Z., and Jahromi, M. Z. Supervised principal component analysis: Visualization, classification and regression on subspaces and submanifolds. Pattern Recognition, 44(7):1357 1371, 2011.

Benaych-Georges, F. and Nadakuditi, R. R. The singular values and vectors of low rank perturbations of large rectangular random matrices. Journal of Multivariate Analysis, 111:120 135, 2012.

Bishop, C. M. Pattern recognition and machine learning. springer, 2006.

Blitzer, J., Dredze, M., and Pereira, F. Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. In Proceedings of the 45th annual meeting of the association of computational linguistics, pp. 440 447, 2007.

Caruana, R. Multitask learning. Machine learning, 28(1): 41 75, 1997.

Chao, G., Luo, Y., and Ding, W. Recent advances in supervised dimension reduction: A survey. Machine learning and knowledge extraction, 1(1):341 358, 2019.

Couillet, R. and Debbah, M. Random matrix methods for wireless communications. Cambridge University Press, 2011.

Couillet, R. and Liao, Z. Random Matrix Methods for Machine Learning. Cambridge University Press, 2022.

Couillet, R., Chatelain, F., and Bihan, N. L. Two-way kernel matrix puncturing: towards resource-efficient pca and spectral clustering. ar Xiv preprint ar Xiv:2102.12293, 2021.

Deng, L. The mnist database of handwritten digit images for machine learning research [best of the web]. IEEE Signal Processing Magazine, 29(6):141 142, 2012.

El Karoui, N. Random matrices and high-dimensional statistics: beyond covariance matrices. In Proceedings of the International Congress of Mathematicians: Rio de Janeiro 2018, pp. 2857 2876. World Scientific, 2018.

Evgeniou, T. and Pontil, M. Regularized multi task learning. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 109 117. ACM, 2004.

Ghojogh, B. and Crowley, M. Unsupervised and supervised principal component analysis: Tutorial. ar Xiv preprint ar Xiv:1906.03148, 2019.

Gong, P., Ye, J., and Zhang, C.-s. Multi-stage multi-task feature learning. In Advances in neural information processing systems, pp. 1988 1996, 2012.

Gretton, A., Bousquet, O., Smola, A., and Schölkopf, B. Measuring statistical dependence with hilbert-schmidt norms. In International conference on algorithmic learning theory, pp. 63 77. Springer, 2005.

Henderson, P., Hu, J., Romoff, J., Brunskill, E., Jurafsky, D., and Pineau, J. Towards the systematic reporting of the energy and carbon footprints of machine learning. Journal of Machine Learning Research, 21(248):1 43, 2020.

Hoffman, J., Rodner, E., Donahue, J., Darrell, T., and Saenko, K. Efficient learning of domain-invariant image representations. ar Xiv preprint ar Xiv:1301.3224, 2013.

Hubert Tsai, Y.-H., Yeh, Y.-R., and Frank Wang, Y.-C. Learning cross-domain landmarks for heterogeneous domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5081 5090, 2016.

Ionescu, B., Müller, H., Villegas, M., Arenas, H., Boato, G., Dang-Nguyen, D.-T., Cid, Y. D., Eickhoff, C., de Herrera, A. G. S., Gurrin, C., et al. Overview of imageclef 2017: Information extraction from images. In International Conference of the Cross-Language Evaluation Forum for European Languages, pp. 315 337. Springer, 2017.

PCA-based Multi-Task Learning

Johnstone, I. M. On the distribution of the largest eigenvalue in principal components analysis. Annals of statistics, pp. 295 327, 2001.

Lacoste, A., Luccioni, A., Schmidt, V., and Dandres, T. Quantifying the carbon emissions of machine learning. ar Xiv preprint ar Xiv:1910.09700, 2019.

Lee, S., Zou, F., and Wright, F. A. Convergence and prediction of principal component scores in high-dimensional settings. Annals of statistics, 38(6):3605, 2010.

Lelarge, M. and Miolane, L. Asymptotic bayes risk for gaussian mixture in a semi-supervised setting. In 2019 IEEE 8th International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP), pp. 639 643. IEEE, 2019.

Liu, J., Ji, S., and Ye, J. Multi-task feature learning via efficient l2, 1-norm minimization. ar Xiv preprint ar Xiv:1205.2631, 2012.

Liu, Q., Liao, X., and Carin, L. Semi-supervised multitask learning. Advances in Neural Information Processing Systems, 20:937 944, 2007.

Long, M., Wang, J., Ding, G., Shen, D., and Yang, Q. Transfer learning with graph co-regularization. IEEE Transactions on Knowledge and Data Engineering, 26 (7):1805 1818, 2013.

Long, M., Zhu, H., Wang, J., and Jordan, M. I. Unsupervised domain adaptation with residual transfer networks. ar Xiv preprint ar Xiv:1602.04433, 2016.

Louart, C. and Couillet, R. Concentration of measure and large random matrices with an application to sample covariance matrices. ar Xiv preprint ar Xiv:1805.08295, 2018.

Marchenko, V. A. and Pastur, L. A. Distribution of eigenvalues for some sets of random matrices. Matematicheskii Sbornik, 114(4):507 536, 1967.

Maurer, A., Pontil, M., and Romera-Paredes, B. Sparse coding for multitask and transfer learning. In International conference on machine learning, pp. 343 351, 2013.

Obozinski, G., Taskar, B., and Jordan, M. Multi-task feature selection. Statistics Department, UC Berkeley, Tech. Rep, 2(2.2):2, 2006.

Pan, S. J., Tsang, I. W., Kwok, J. T., and Yang, Q. Domain adaptation via transfer component analysis. IEEE Transactions on Neural Networks, 22(2):199 210, 2010.

Parameswaran, S. and Weinberger, K. Q. Large margin multi-task metric learning. In Advances in neural information processing systems, pp. 1867 1875, 2010.

Paul, D. Asymptotics of sample eigenstructure for a large dimensional spiked covariance model. Statistica Sinica, pp. 1617 1642, 2007.

Rei, M. Semi-supervised multitask learning for sequence labeling. ar Xiv preprint ar Xiv:1704.07156, 2017.

Ritchie, A., Scott, C., Balzano, L., Kessler, D., and Sripada, C. S. Supervised principal component analysis via manifold optimization. In 2019 IEEE Data Science Workshop (DSW), pp. 6 10. IEEE, 2019.

Rosenstein, M. T., Marx, Z., Kaelbling, L. P., and Dietterich, T. G. To transfer or not to transfer. In NIPS 2005 workshop on transfer learning, volume 898, pp. 1 4, 2005.

Ruder, S. An overview of multi-task learning in deep neural networks. ar Xiv preprint ar Xiv:1706.05098, 2017.

Saenko, K., Kulis, B., Fritz, M., and Darrell, T. Adapting visual category models to new domains. In European conference on computer vision, pp. 213 226. Springer, 2010.

Saito, K., Watanabe, K., Ushiku, Y., and Harada, T. Maximum classifier discrepancy for unsupervised domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3723 3732, 2018.

Seddik, M. E. A., Louart, C., Tamaazousti, M., and Couillet, R. Random matrix theory proves that deep learning representations of gan-data behave as gaussian mixtures. ar Xiv preprint ar Xiv:2001.08370, 2020.

Strubell, E., Ganesh, A., and Mc Callum, A. Energy and policy considerations for deep learning in nlp. ar Xiv preprint ar Xiv:1906.02243, 2019.

Tan, C., Sun, F., Kong, T., Zhang, W., Yang, C., and Liu, C. A survey on deep transfer learning. In International conference on artificial neural networks, pp. 270 279. Springer, 2018.

Tiomoko, M., Couillet, R., and Tiomoko, H. Large dimensional analysis and improvement of multi task learning. ar Xiv preprint ar Xiv:2009.01591, 2020.

Venkateswara, H., Eusebio, J., Chakraborty, S., and Panchanathan, S. Deep hashing network for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5018 5027, 2017.

Wang, J. and Ye, J. Safe screening for multi-task feature learning with multiple data matrices. ar Xiv preprint ar Xiv:1505.04073, 2015.

PCA-based Multi-Task Learning

Xu, S., An, X., Qiao, X., Zhu, L., and Li, L. Multi-output least-squares support vector regression machines. Pattern Recognition Letters, 34:1078 1084, 07 2013. doi: 10. 1016/j.patrec.2013.01.015.

Zhang, W., Li, R., Zeng, T., Sun, Q., Kumar, S., Ye, J., and Ji, S. Deep model based transfer and multi-task learning for biological image analysis. IEEE transactions on Big Data, 2016.

Zhang, X., Sun, Q., and Kong, D. Supervised principal component regression for functional response with high dimensional predictors. ar Xiv preprint ar Xiv:2103.11567, 2021.

Zhang, Y. and Yang, Q. An overview of multi-task learning. National Science Review, 5(1):30 43, 2018.

Zhang, Y. and Yang, Q. A survey on multi-task learning. IEEE Transactions on Knowledge and Data Engineering, 2021.

Zhang, Y. and Yeung, D.-Y. A convex formulation for learning task relationships in multi-task learning. ar Xiv preprint ar Xiv:1203.3536, 2012.

Zhang, Y. and Yeung, D.-Y. A regularization approach to learning task relationships in multitask learning. ACM Transactions on Knowledge Discovery from Data (TKDD), 8(3):1 31, 2014.

PCA-based Multi-Task Learning

This document contains the main technical arguments omitted in the article s core due to space limitations and is organized as follows. Section A details the large dimensional analysis of PCA. Section B provides the asymptotic performance of SPCA in the most general case of a Gaussian mixture model (with arbitrary means and covariances) in a multi-task setting. The single-task setting is retrieved as a special case. Section C details and illustrates the binary-based multi-class classification and proposes alternative schemes to the one-versus-all approach covered in the main article. Supplementary experiments are provided in Section D. Finally, Section E explains how to use the code to implement the paper s main results.

A. Large dimensional analysis of Single Task PCA

We recall that the solution U of PCA is explicitly given by collecting the eigenvectors associated with the τ largest eigenvalues of 1

p XXT. This section aims to compute the isolated eigenvalues of 1

p XXT and to study the behavior of the projection of new test data on the feature space spanned by PCA under the large dimensional regime.

Assumption A.1 (Distribution of X and x). The columns of X are independent random vectors with X = [X1, . . . , Xm], Xj = [x(j) 1 , . . . , x(j) nj ] Rp nj where x(j) i N(µj, Ip). As for x, it follows an independent N(µx, Ip) distribution. We will further denote x Cj to indicate that data vector x belongs to class j, i.e., x N(µj, Ip).

Assumption A.2 (Growth Rate). As n , p/n c0 > 0 and, for 1 j m, nj

n cj > 0; we will denote c = [c1, . . . , cm]T. Furthermore, the latent feature space dimension τ is constant with respect to n, p.

A.1. Isolated eigenvalues

To retrieve the isolated eigenvalues of 1

p XXT, we simply aim to solve the determinant equation in z R+

Writing X = MJT + W with M = [µ1, . . . , µm] Rp m, J = [j1, . . . , jm], where jj = (0, . . . , 0, 1nj, 0, . . . , 0)T

and where W is a random matrix with independent standard Gaussian entries, this becomes

p WWT + UVT z Ip

where U = 1 p[M, WJ] Rp 2m and V = 1 p[MJTJ + WTJ, M] Rp 2m are low rank matrices (as n, p ); as

p WWT, its limiting eigenvalue distribution under Assumption 3.2 is known as the Mar cenko-Pastur law (Marchenko & Pastur, 1967), recalled next in whole generality:

Theorem A.3. Let W be a p n matrix with i.i.d. realor complex-valued entries with zero mean and unit variance. Then, as n, p such that p/n a.s. c0, the empirical spectral measure µ ˆC = 1

p Pp i=1 δˆλi of the eigenvalues ˆλ1 . . . ˆλp of 1 p WWT, converges weakly, with probability one, to a nonrandom distribution, known as the Mar cenko Pastur law and denoted µc0 MP. If c0 (0, 1), µc0 MP has density:

µc0 MP(dx) =

(λ+ x)(x λ )

where λ = (1 p

1/c0)2. If c0 (1, ), µMP is the weighted sum of a point mass at 0 and of the density µ1/c0 MP with weights 1 (1/c0) and 1/c0.

The spectrum of 1

p WWT, which contains no structural information (generally referred to as a noise bulk ), will not be useful for classification. The challenge is to determine which observed eigenvalues represent the class structure. Specifically, let us seek for the presence of an eigenvalue λj of 1

p XXT asymptotically greater than the limit (1 + p

1/c0)2 of the largest eigenvalue of 1

p WWT. Following the initial ideas of (Baik & Silverstein, 2006; Benaych-Georges & Nadakuditi, 2012), the

PCA-based Multi-Task Learning

approach is to isolate the low-rank contribution UVT from the noise matrix 1

p WWT. Factoring out 1

p WWT z Ip and using Sylverster s identity (det(AB + I) = det(BA + I)), Equation (7) is equivalent to:

det VTQ(z)U + I2m = 0, with Q(z) = 1

We next retrieve the large dimensional limit (or, more specifically, a deterministic equivalent (Couillet & Debbah, 2011, Chapter 6)) of VTQ(z)U + I2m under Assumptions A.1 and A.2. Defining the resolvents and co-resolvents Q(z) = ( 1

p WWT z Ip) 1 and Q(z) = ( 1

p WTW z In) 1 , as n, p with p/n c (0, ), we have

Q(z) Q(z), Q(z) = δ(z)Ip Q(z) Q(z), Q(z) = δ(z)In

where ( δ(z), δ(z)) are defined as

δ(z) = c0 1 c0z + p

(c0 1 c0z)2 4z

2z , δ(z) = 1

δ(z) + 1 c0

and the notation F F stands for the fact that, under Assumption A.2, for any deterministic linear functional f : Rn p R, f(F F) 0 almost surely (for instance, for u, v of unit norm, u T(F F)v a.s. 0 and, for A Rp n deterministic of bounded operator norm, 1

ntr A(F F) a.s. 0).

In particular, developing the definitions of V and U,

det VTQ(z)U + I2m

= det Im + 1

p JTJMTQ(z)M + 1

p JTWTQ(z)M 1 p JTJMTQ(z)WJ + 1

p JTWTQ(z)WJ 1 p MTQ(z)M Im + 1

and we then have, from the above deterministic equivalents, that

det VTQ(z)U + I2m = det

Im + δ(z) JTJ

p MTM (1 + z δ(z))JTJ δ(z) 1

= det Im z δ(z)δ(z)JTJ

p MTM + o(1).

The limiting position of the (hypothetical) isolated eigenvalues z is, therefore, the solution of:

det Im z δ(z)δ(z)M = 0

where M = lim p 1 c0 D

1 2c . Denoting ℓ1 . . . ℓm the eigenvalues of M, the eigenvalues z = ˆλi such that

ˆλi > (1 + p

1/c0)2 are explicit and pairwise associated to ℓi whenever:

c0 + 1 + ℓi + 1 c0ℓi > (1 + p

which occurs if and only if ℓi 1 c0 . This completes the proof of Proposition 1.

A.2. PCA projectors

In this section, the goal is to study the asymptotic behavior of u T i x|x Cj, for i τ. Since conditionally on the training data X, u T i x is expressed as the projection of the deterministic vector ui on the isotropic Gaussian random vector x, it follows that u T i x is asymptotically Gaussian.

PCA-based Multi-Task Learning

Computation of the mean. Since ui is independent from x, we have conditionally to the training data X that E[u T i x] = µT j ui. It then remains to compute the expectation with respect to X. First, since ui is defined up to a sign, we may impose

µT j ui = µT j uiu T i 1p/p q

1Tpuiu T i 1p/p2 (8)

Using Cauchy s integral formula, we have for any vector a Rp of the bounded norm (i.e., lim p a < ),

a Tuiu T i 1p

p WWT + UVT z Ip

γi a T Q(z) Q(z)U I2m + VTQ(z)U 1 VTQ(z) 1p

γi a TQ(z)U I2m + VTQ(z)U 1 VTQ(z)1p

with γi a contour surrounding only the isolated eigenvalues ˆλi of 1

Using the deterministic equivalents of Q(z) and Q(z), we have

a TQ(z)U 1 p[δ(z)a TM, 01 m]

Im + VTQ(z)U

Im + δ(z) JTJ

p MTM (1 + z δ(z))JTJ δ(z) 1

δ(z)JTJMT 1p

p δ(z)MT 1p

Altogether, this gives :

a Tuiu T i 1p

γi z δ(z)δ(z)2a TMD

1 2c ui u T i 1 zδ(z) δ(z)ℓi D

with ui the eigenvector of M associated to the eigenvalue ℓi. The only pole of the integrand inside γi is the isolated eigenvalue ˆλi. From the residue theorem, this gives

a Tuiu T i 1p

ℓ2 i (ℓi + 1)a TMD

1 2c ui u T i D

Finally, using Equation (8), we conclude

µT j ui a.s.

c0ℓi 1 ℓ2 i (ℓi + 1) u T i MD 1

2 c e[m] j .

Computation of the variance. The computation is immediate since U is orthonormal, therefore Var(u T i x) = 1.

B. Large dimensional analysis of Multi-Task SPCA

We recall that the solution V of SPCA is explicitly given by the collection of the eigenvectors associated with the τ largest eigenvalues of 1

n XT. This section aims to evaluate the position of these isolated eigenvalues and to study the behavior of the projection of new test data on the feature space spanned by SPCA under the large dimensional regime.

Assumption B.1 (Distribution of X). For class j of task t, denoted Ctj, x(j) tℓ N(µtj, Σtj), for some µtj Rp. We further denote M [µ11, . . . , µkm] Rp mk.

PCA-based Multi-Task Learning

Assumption B.2 (Growth Rate). As n , p/n c0 > 0 and, for 1 j m, ntj/n ctj > 0; we denote c = [c11, . . . , ckm]T Rkm, and Dc = diag(c). Besides,

1 2c M Rmk mk,

lim sup p max 1

ptrΣtjΣt j , 1

B.1. Isolated eigenvalues

The eigenvalues of 1

n XT are solutions of

1 p XJ Y YT

n JTXT z Ip

p( Y YT) 1 2 JT XTX

n J( Y YT) 1 2 z Im

Besides, we have

n JTD v J + 1

c0 Dc MTMDc

with v = [ v11, . . . , vk2], vtj = lim p 1 ptrΣtj.

Therefore, the isolated eigenvalues are, in the large n, p limit, the eigenvalues of H = ( Y YT) 1 2 1 n JTD v J + D

1 2c ( Y YT) 1 2 . In the case of identity covariance structure treated in the main article, vtj = 1, t, j and therefore

H = ( Y YT) 1 2 Dc + D

1 2c ( Y YT) 1 2 .

B.2. SPCA projectors

Computation of the mean. Since the eigenvector vi is defined up to sign, we may, as above, impose that

µT tjvi = µT tjviv T i 1p/p q

1Tpviv T i 1p/p2 . (9)

We have for any vector a Rp such that lim p a < ,

a Tviv T i 1p

γi a T 1 p XJ Y YT

n JTXT z Ip

np XJ( Y YT) 1 2 z Im 1

np( Y YT) 1 2 JTXTXJ( Y YT) 1 2 1 ( Y YT) 1 2 JTXT 1p

1 z a TMDc( Y YT) 1 2 (z Im H) 1 ( Y YT) 1 2 Dc MT 1p

1 λi a TMDc( Y YT) 1 2 vi v T i ( Y YT) 1 2 Dc MT 1p

with γi the contour surrounding the eigenvalue λi of H and vi the eigenvector of H associated to λi.

µT tjvi a.s. r

1 λi v T i ( Y YT) 1 2 D

2 c e[mk] tj .

PCA-based Multi-Task Learning

Computation of the variance For the variance, conditionally to the training data X, Var(v T i x) = v T i Σtjvi. Furthermore, it then remains to compute the expectation with respect to the training data X:

v T i Σtjvi = tr viv T i Σtj

1 p XJ Y YT

n JTXT z Ip

1 npz XJ( Y YT) 1 2 z Im 1

np( Y YT) 1 2 JTXTXJ( Y YT) 1 2 1 ( Y YT) 1 2 JTXT !

1 npz ( Y YT) 1 2 JTXTΣtj XJ( Y YT) 1 2 z Im 1

np( Y YT) 1 2 JTXTXJ( Y YT) 1 2 1!

λi ( Y YT) 1 2 Ttj( Y YT) 1 2 + o(1)

where Ttj = 1

n JTD v J + D

1 2c and vab = lim p 1 ptr (ΣtjΣab).

When Σtj = Ip, as treated in the main article, it is immediate that Var(u T i x) = 1.

C. Binary-based multi-class classification

This section provides various applications and optimizations of the proposed MTL-SPCA framework in the context of multi-class classification.

C.1. One-versus-all multi-class preliminary

The literature (Bishop, 2006) describes broad groups of approaches to deal with classification with m > 2 classes. We focus here on the most common method, namely the one-versus-all approach. The complete optimization of one-versus-all being theoretically heavy to handle and demanding prior knowledge on the decision output statistics, the method inherently suffers from sometimes severe practical limitations; these are partly tackled here exploiting the large dimensional analysis performed in this article.

In the one-versus-all method, focusing on Task t, m individual binary classifiers, indexed by ℓ= 1, . . . , m, are trained, each of them separating Class Ctℓfrom the other m 1 classes Ctℓ , ℓ = ℓ. Each test sample is then allocated to the class index corresponding to the classifier reaching the highest of the m classifier scores. Although quite used in practice, the approach first suffers a severe unbalanced data bias when using binary ( 1) labels as the set of negative labels in each binary classification is on average m 1 times larger than the set of positive labels, and also suffers a center-scale issue when ultimately comparing the outputs of the m decision functions, the average locations and ranges of which may greatly differ; these issues lead to undesirable effects, as reported in (Bishop, 2006, section 7.1.3)).

These problems are here simultaneously addressed: specifically, having access to the large dimensional statistics of the classification scores allows us to appropriately center and scale the scores. Each centered-scaled binary classifier is then further optimized by appropriately selecting the class labels (different from 1) so to minimize the resulting classification error. See Figure 5 for a convenient illustration of the improvement induced by this centering-scaling and label optimization approach.

C.2. One-versus-all multi-class optimization

For each target task t, in a one-to-all approach, m MTL-SPCA binary classifications are solved with the target class Ctℓ (renamed class Cℓ t1"), against all other Cℓ t2 classes (combined into a single Cℓ t2 class ). Calling g(ℓ) x,t the output of the classifier ℓfor a new datum x in Task t, the class allocation decision is traditionally based on the largest among all scores g(1) x,t, . . . , g(m) x,t . However, this presumes that the distribution of the scores g(1) x,t when x C1, g(2) x,t when x C2, etc., more or less have the same statistical mean and variance. This is not the case in general, as depicted in the first column of Figure 5, where data from class C1 are more likely to be allocated to class C3 (compare the red curves).

PCA-based Multi-Task Learning

By providing an accurate estimate of the distribution of the scores g(ℓ) x,t for all ℓ s and all genuine classes of x, Theorem 3 of the main article allows us to predict the various positions of the Gaussian curves in Figure 5. In particular, it is possible, for each binary classifier ℓto center and scale g(ℓ) x,t when x Ctℓ. This operation averts the centering and scaling biases depicted in the first column of Figure 5: the result of the center-scale operation appears in the second column of Figure 5.

This first improvement step simplifies the algorithm which now boils down to determining the index of the largest g(ℓ) x,t m(bin,ℓ) t1 , ℓ {1, . . . , m}, while limiting the risks induced by the center-scale biases.

This being said, our theoretical analysis further allows to adapt the input labels y(ℓ) [t] in such a way to optimize the expected output. Ideally, assuming x genuinely belongs to class Ctℓ, one may aim to increase the distance between the output score g(ℓ) x,t and the other output scores g(ℓ ) x,t for ℓ = ℓ. This however raises two technical questions:

1. Corollary 1 of the main article is derived under a 2-class Gaussian mixture model while for classifier ℓof the oneversus-all approach, the data are composed of m Gaussians, of which one belongs to class Cℓ t1 and the other m 1 to class Cℓ t2 (which remains a mixture when m > 2). In this case, the labels express as y = J y, with now y Rmk

(instead of R2k) for J =

1n11 . . . 1nmk

2. the procedure demands to simultaneously adapt all input scores y(1) [t] , . . . , y(m) [t] .

To solve Item 1., we extend Corollary 1 to a one-versus-all-based binary classification.

Corollary C.1 (One-versus-all Binary MTL Supervised Principal Component Analysis). Let x N(µtj, Ip) independent of X. Then, under Assumptions B.1-4.2 and the above setting, as p, n ,

VTx gtj 0, gtj N(m(bin) tj , 1), where m(bin) tj = y TD

1 2c + Dc) y .

Note that Corollary C.1 is similar to Corollary 1 of the main article but now with y Rmk and M, Dc Rmk mk.

A first option to solve Item 2. consists in maximizing the distance between the output score g(ℓ) x,t for x Ctℓand the scores

g(ℓ) x,t for x Ctℓ. By mechanically pushing away all wrong decisions, this ensures that, when x Ctℓ, g(ℓ) x,t is greater than

g(ℓ ) x,t for ℓ = ℓ. This is visually seen in the third column of Figure 5, where the distances between the rightmost Gaussians and the other two are increased when compared to the second column, and we retrieve the desired behavior. Specifically, the proposed (heuristic) label optimization here consists in solving, for a target task t and each ℓ {1, . . . , m} the optimization problem:

y (ℓ) [t] = max y(ℓ) [t] Rkm min j =ℓ

m(bin),ℓ tℓ m(bin),ℓ tj (10)

Being a non-convex and non-differentiable (due to the max) optimization, Equation (10) cannot be solved straightforwardly. An approximated solution consists in relaxing the max operator max(x1, . . . , xn) into the differentiable soft-max operator 1 γn log(Pn j=1 exp(γxj)) for some γ > 0, and use a standard gradient descent optimization scheme, here initialized at

y(ℓ) [t] Rmk filled with 1 s at every m(i 1) + ℓ, for i {1, . . . , m}, and with 1 s everywhere else.

An alternative option to tackle Item 2. (the one developed in the core article) consists in reducing the dimension of the labels to y(ℓ) [t] R2k by merging all Gaussians of class Ctj with j = ℓinto a unique approximated Gaussian class with mean P

j =ℓ ntj n ntℓµtj. We may then (abusively) apply Corollary 1, leading to an explicit expression of the optimal label y (ℓ) [t] , from which Algorithm 1 in the main article unfolds.

Figure 6 compares the Min-Max optimization scheme with the scheme assuming the Gaussian approximation for class 2 (denoted Gaussian Approx ). The two methods interestingly have comparable performance. The synthetic data considered for this experiment consists of 2-tasks with ten Gaussian classes with means µ1j = µj and µ2j = βµj + p

PCA-based Multi-Task Learning

1.5 1 0.5 0 0.5 1 0

gt(x)|x C1 gt(x)|x C2 gt(x)|x C3

4 2 0 2 4 0

Scaled scores

4 2 0 2 4 0

Optimized labels

1.5 1 0.5 0 0.5 1 0

4 2 0 2 4 0

4 2 0 2 4 0

1.5 1 0.5 0 0.5 1 0

4 2 0 2 4 0

4 2 0 2 4 0

Figure 5. Test score distribution in a 2-task and 3 classes-per-task setting, using a one-versus-all multi-class classification. Every graph in row ℓdepicts the limiting distributions of g(ℓ) x,t for x in different classes. Column 1 (Classical) is the standard implementation of the one-versus-all approach. Column 2 (Scaled scores) is the output for centered and scaled g(ℓ) x,t for x Cℓ. Column 3 (Optimized labels) is the same as Column 2 but with optimized input scores (labels) y (ℓ) [t] . Under the classical approach, data from C1 (red curves) will often be misclassified as C2. With optimized labels , the discrimination of scores for x in either class C2 or C3 is improved (blue curve in 2nd row further away from the blue curve in 1st row; and similarly for the green curve in 3rd versus 1st row).

PCA-based Multi-Task Learning

0 0.2 0.4 0.6 0.8 1

Relatedness parameter β

Min-Max scheme

Gaussian Approx

Figure 6. Empirical accuracy as function of the relatedness parameter β on Synthetic Gaussian with p = 500, µj = 3e[p] j , µ j = 3e[p] p j, n1j = 100, n2j = 50 for 1 j 10; 10 000 test sample averaging

D. Supplementary experiments

We next experiment on two transfer learning datasets:

the Office31 dataset (Saenko et al., 2010) which contains 31 object categories in three domains: Amazon (A), DSLR (D), and Webcam (W). The Amazon images were captured from a website of online merchants (clean background and unified scale). The DSLR domain contains low-noise high-resolution images. For Webcam, the images of low resolution exhibit significant noise and color. Every pair of domains is successively selected as source and a target for binary (transfer) multi-task learning, resulting in 6 transfer tasks S T for S,T {A,D,W};

the Office Home dataset (Venkateswara et al., 2017) which consists of images from 4 different domains: Artistic images (A), Clip Art (C), Product images (P), and Real-World images (R). For each domain, the dataset contains images of 65 object categories found typically in Office and Home settings.

Table 2 reports the comparative performances of the various algorithms and, while exhibiting a slight superiority for the MTL-LSSVM scheme, supports the stable and competitive performance of MTL-SPCA.

S/T w a w d a w a d d w d a Mean score ST-SPCA 77.63 93.72 90.09 90.51 91.33 75.43 86.45 CDLS 76.47 92.52 91.57 90.07 91.43 74.99 86.17 N-SPCA 74.10 96.44 79.59 81.94 95.10 73.15 83.39 MTL-LSSVM 80.85 97.63 93.11 91.91 95.12 79.41 89.67 MTL SPCA 77.67 96.70 90.72 91.09 94.83 76.90 87.99

Table 2. Classification accuracy over Office31 database. w(Webcam), a(Amazon), d(dslr), for different Source to target task pairs (S T) based on Resnet-50 features.

PCA-based Multi-Task Learning

S/T A R A P A C R A R P R C P A P R P C C A C R C P

Mean score ST-SPCA 91.07 92.19 74.05 77.61 92.64 72.84 75.66 90.38 71.48 72.26 86.47 89.20 82.15 CDLS 88.30 90.24 75.71 78.04 91.28 75.29 75.59 88.20 73.86 73.43 85.12 88.91 82.00 N-SPCA 89.73 89.26 69.47 76.77 89.90 66.63 71.13 87.41 63.01 70.50 84.30 82.98 78.42 MTL LSSVM 91.82 92.85 80.09 79.39 93.63 79.13 75.94 90.67 78.19 74.39 88.61 91.56 84.69 MTL SPCA 91.10 92.28 77.44 79.57 92.79 73.64 76.36 90.39 76.90 74.23 87.01 89.37 83.42

Table 3. Classification accuracy over Office+Home database. Art (A), Real World (R), Product (P), Clipart (C), for different Source to target task pairs (S T) based on Resnet-50 features.

E. Readme Code

This document explains how to use the code implementing the Random Matrix Improved Multi-Task Learning Supervised Principal Component Analysis (RMT-MTLSPCA) proposed in the core of the article.

E.1. Archive content

The function implementing the binary version of our method is called RMTMTLSPCA_binary_train.m which trains the MTL SPCA proposed algorithm.

The function implementing our method is called RMTMTLSPCA_multiclass_train.m which trains the MTL SPCA proposed algorithm in the multi-class classification.

The main script comparing the performance of SPCA and PCA in a single task setting PCA_versus_SPCA_single_task.

The main script comparing all algorithms for synthetic data for binary transfer learning compare TL_binary.

The main script comparing all algorithms for synthetic data/real data for multi-class transfer learning is Compare TL_multiclass.m.

The main script illustrating the benefit of adding more tasks to the MTL framework Increased_tasks.m.

The main script illustrating the tradeoff performance versus running time in the multi-task multi-class classification Trade Off_performance_running_time.m

Folder utils: containing alternative MTL algorithms among which MMDT algorithm from (Hoffman et al., 2013), and MTL-LSSVM algorithm(Tiomoko et al., 2020)10 and other functions used for the proposed method.

Folder datasets: containing Office+Caltech dataset, Office Home, Image Clef, Office31 and Amazon review dataset. Due to space limitations, the dataset is not included but can be downloaded and put in the corresponding folder. Codes and datasets used are publicly accessible and under free licenses.

E.2. Reproducing the results of the article

The following sections detail the parameter set to reproduce the experiments of the main article.

E.2.1. FIGURE 1

Script PCA_versus_SPCA_single_task.m

E.3. Figure 2

Script compare TL_binary.m

10To use these codes, one needs to have a Matlab compiler for Mex files

PCA-based Multi-Task Learning

E.3.1. TABLE 1

Script Compare TL_multiclass.m number_trials 20 dataset synthetic (for synthetic data) dataset officehome (for Office Home dataset) dataset Image Clef (for Image Clef dataset)

E.3.2. FIGURE 3

Script Increased_tasks.m dataset synthetic (for synthetic data) dataset nlp (for amazon-review dataset) dataset mnist (for MNIST dataset)

E.3.3. FIGURE 4

Script Trade Off_performance_running_time.m dataset synthetic (for synthetic data) dataset officehome (for Office Home dataset) dataset Image Clef (for Image Clef dataset)