# robust_temporal_smoothness_in_multitask_learning__7f58128c.pdf

Robust Temporal Smoothness in Multi-Task Learning

Menghui Zhou1, Yu Zhang2, Yun Yang1, Tong Liu2, Po Yang2

1 Deparment of Software, Yunnan University, Kunming, China 2 Department of Computer Science, Sheffield University, Sheffield, UK mhzcn@mail.ynu.edu.cn, yzhang489@sheffield.ac.uk, yangyun@ynu.edu.cn, {t.liu, po.yang}@sheffield.ac.uk

Multi-task learning models based on temporal smoothness assumption, in which each time point of a sequence of time points concerns a task of prediction, assume the adjacent tasks are similar to each other. However, the effect of outliers is not taken into account. In this paper, we show that even only one outlier task will destroy the performance of the entire model. To solve this problem, we propose two Robust Temporal Smoothness (Ro TS) frameworks. Compared with the existing models based on temporal relation, our methods not only chase the temporal smoothness information, but identify outlier tasks, however, without increasing the computational complexity. Detailed theoretical analyses are presented to evaluate the performance of our methods. Experimental results on synthetic and real-life datasets demonstrate the effectiveness of our frameworks. We also discuss several potential specific applications and extensions of our Ro TS frameworks.

Introduction In recent years, the temporal smoothness assumption (Wei 2006) has been used in a wide range of machine learning applications (Wang, Shi, and Reddy 2020; Zhou et al. 2022; Xu et al. 2021; Romeo et al. 2020; Emrani, Mc Guirk, and Xiao 2017; Saha et al. 2018). They model the interactions between a time point and its adjacent time points and thus capture the temporal relationship to some extent. Owing to intrinsic correlation among multiple time points, a joint analysis of multiple time points is supposed to be more effective than analysing each time point independently. Therefore, the idea of multi-task learning (MTL) (Shen et al. 2021; Fifty et al. 2021; Zhang and Yang 2021) is applied to analyse multiple time points simultaneously. Specifically, existing methods (Romeo et al. 2020; Wang, Shi, and Reddy 2020; Emrani, Mc Guirk, and Xiao 2017; Zhao et al. 2015; Zheng and Ni 2013) formulate the prediction of a target at a sequence of time points as a multi-task learning problem, and each task concerns the prediction at a time point. As shown in Figure 1, the t-th time point is treated as the t-th task wt. The crucial challenge of MTL is to know how the tasks are related and how to capture such complex task relation (Zhang and Yang 2021). One common way is employing the

Copyright 2023, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

Figure 1: We decompose every wi = pi + ri. P satisfies the temporal smoothness, R identifies the outlier tasks.

Temporal Smoothness assumption (TS). It assumes the difference between two successive tasks is relatively small and thus chases the temporal correlation among multiple tasks. With TS, advanced MTL benefits many applications, like disease progression prediction, survival analysis, and keypoint tracking. In (Nie et al. 2016; Zhou et al. 2011), the authors use MTL with TS to predict the progression of Alzheimer s disease. They assume the cognitive score of one patient will not fluctuate dramatically over time, and the difference of cognitive scores between two successive time points is relatively small. So they penalize wi wi+1 2 2, known as Laplacian based Temporal Smoothness assumption (LTS). In (Zhou et al. 2022, 2012), the authors argue that LTS only focuses on the smoothing of tasks across different time points. It is a better way to enforce that the nearby time points have similar feature weight, so they penalize |wij wi,j+1| using the famous fused Lasso (Tibshirani et al. 2005), regarded as the Fused Lasso based Temporal Smoothness assumption (FTS). Clearly, if FTS is satisfied, so is LTS. Similarly, in (Emrani, Mc Guirk, and Xiao 2017), the authors use MTL with TS to conduct prognosis and diagnosis of the progression of Parkinson s disease. In (Romeo et al. 2020), the authors propose a novel spatio-temporal MTL model based on TS to predict the progression of diabetes and its complications. Besides in the field of disease diagnosis and treatment, (Wang, Shi, and Reddy 2020) applies TS to propose a tensor based temporal MTL survival model. Introducing TS into the MTL model has been shown to improve performance and robustness, however, the signifi-

The Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI-23)

cant problem is that TS does not consider the difference between tasks and the impacts of potential outlier tasks. Actually, the asymptotic property of fused Lasso proved in (Tibshirani et al. 2005) demonstrates that TS just tends to average all tasks. Due to the usual existence of outlier tasks, TS is too restrictive for real-world applications. Here we first define the outlier task: A task should be considered as outlier if it is vastly different from most tasks. In this study, we identify outlier tasks by comparing the magnitude of the L2norm of the task coefficients. As shown in Figure 2 (same experimental setup as in Experiment part), LTS and FTS average all tasks, and seem to have a trend, but extremely limited, to capture the outlier task (4-th task). It means even only one outlier task will destroy the entire performance of the MTL models based on TS. Hence, how to detect outlier tasks while chasing the temporal smoothness assumption is a particularly important and challenging problem in models based on TS. Our motivation comes from an intuitive idea that outlier tasks arise since there is not only the information of temporal smoothness among tasks, but also other information that depends on specific domain knowledge. The outlier task is determined by the domain information, rather than by the noise in data. These outlier tasks also contain a lot of valuable information, that can not be ignored. To implement this idea, we propose two Robust Temporal Smoothness (Ro TS) frameworks. Mathematically, we write each task model wi(i Nm) as wi = pi + ri (hence the model coefficient matrix W = P + R). The temporal part pi satisfies the temporal smoothness pi pi+1. The discriminative part ri represents the difference beyond the temporal relation among tasks. If ri is large , simple temporal smoothness is not suitable, since ri ri+1 wi wi+1, i.e., the difference beyond temporal relation among tasks can not be ignored. And the i-th task is regarded as an outlier task. It is worth noting that it is difficult to give an explicit definition of outlier task, since it depends on the specific case and is governed by the combined effect of the temporal part P and discriminative part R. Traditionally, an outlier is an observation that lies an abnormal distance from other values in a random sample from a population , where it has significant differences with errors. However, in many practical applications, the outlier may occur randomly and regularly, associating with the definition of tasks. Therefore, through defining different tasks, the threshold in temporal smoothness might be able to classify some errors into outliers, and vice versa. For instance, taking an example of predicting the monthly amount of suitable fertilizers with AI models over historical data, the outliers will differ if we set a 6-month or 12-month task of fertilization over a year. In practice, both of these circumstances are possible and varied with farms. Here, we need to consider outliers associated with tasks, which is an important and common phenomenon facing practical long-term prediction cases. Specifically, we propose the first Ro TS framework, Laplacian based Ro TS (LRo TS), which utilizes LTS to chase the temporal smoothness among pi and L2-norm to measure the quantity of ri. The number of outlier tasks is assumed to be small, so we employ the group Lasso (Meier,

Figure 2: The comparison on S2 dataset. Both LTS and FTS only have a limited trend to capture the outlier (4th) task.

Van De Geer, and B uhlmann 2008) on column groups of discriminative matrix R to detect outlier tasks. Whereas, LTS only focuses on the smoothness of the prediction models across different time points. Inspired by (Zhou et al. 2022, 2012), we would like to incorporate feature smoothness rather than only task smoothness, so we use FTS to replace LTS to propose the second framework, Fused Lasso based Ro TS (FRo TS), which captures the temporal smoothness not only on task level but also on feature level. In addition, this kind of temporal smoothness based on the extension of fused Lasso has another attractive property, i.e., sparsity continuity (Tibshirani et al. 2005), which is important for us to derive detailed theoretical analyses. The main contributions of this work include:

Our work highlights the importance of outlier tasks in MTL methods and discovers its relationship with temporal smoothness in many real-world applications. We are the first to point out that all MTL models based on TS could not effectively deal with outlier tasks.

We propose a Ro TS assumption to fully utilize both the temporal information between tasks and the specific domain information in outlier tasks. We accomplish this by decomposing the task coefficient and then present two frameworks based on Ro TS. Comparing to the model based on TS, our robust frameworks have no additional computational complexity

Through detailed theoretical analysis and experimental results, we verify the superiority and effectiveness of two Ro TS frameworks compared to the TS methods. We discuss several possible specific applications and extensions of our frameworks in broader fields.

Notations: Denote Nm = {1, , m}. xi a=nd xij denote the i-th entry of a vector x and the (i, j)-th entry of a matrix X. xi (xi) denotes the i-th row (column) of a matrix X. X p,q = (Pn j=1(Pm i=1 |xij|p)q/p)1/q. N(µ, σ2) denotes a normal distribution with mean µ and standard deviation σ. x(i) jk and x(i) j denote the (j, k)-th entry and the j-th column of a matrix Xi. For the implementation code and Appendix, please refer to https://github.com/menghuizhou/Ro TS.

The Proposed Frameworks Assume that we are given a sequence of time points, the number of which is m. Each time point concerns a task. The training data is {(X1, y1), , (Xm, ym))}, where Xi Rd ni is the data matrix of the i-th task with each column as a sample; yi Rni is the response of the ith task (yi has continuous values for regression and discrete values for classification); d is the data dimension; ni is the number of samples for the i-th task. Denoting W = [w1, , wm] Rd m as the weight matrix to be estimated, the empirical risk is given by L(W) = 1 m Pm i=1 1 ni Pni j=1 l((x(i) j )T wi, (yi)j), where the loss function l( , ) is squared loss for regression problem and logistic loss for binary classification problem. To learn the m tasks simultaneously, we minimize L(W) + Ω(W), where Ωis the regularization term that encodes the prior knowledge.

Laplacian Based Robust Temporal Smoothness For our Ro TS frameworks setting, we decompose the weight matrix W = P +R, i.e, wi = pi+ri. The temporal part pi satisfies the temporal smoothness pi pi+1. The discriminative part ri represents the difference beyond the temporal relation among tasks. To capture the temporal smoothness, we introduce a regularization term that penalizes large deviations of predictions at neighboring time points; to identify the outlier tasks, we use the group Lasso l2,1-norm regularization term. Formally, our first framework is formulated as

min P,R L(P + R) + λ1

i=1 pi pi+1 2 2 + λ2 R 2,1,

where λ1 and λ2 are regularization parameters. For simple notation, we use the following formulation:

min P,R L(P + R) + λ1 PH 2 F + λ2 R 2,1, (1)

where H Rm (m 1) is defined as: hij = 1 if i = j, hij = 1 if i = j + 1, and hij = 0 otherwise. The regularization term PH 2 F is also called Laplacian term (Zhou et al. 2011), so we call (1) Laplacian based Robust Temporal Smoothness framework (LRo TS).

Fused Lasso Based Robust Temporal Smoothness Since Laplacian term is differentiable, LRo TS avoids the computational difficulty. However, LRo TS only encourages the smoothness between adjacent tasks. We emphasize that decoupling P into row vectors is usually meaningful. For example, in modeling disease progression scenarios (Zhou et al. 2012; Emrani, Mc Guirk, and Xiao 2017; Zhou et al. 2022), it is more natural that a feature has similar weights at adjacent time points. So we propose the second framework termed Fused Lasso based Robust Temporal Smoothness (FRo TS) associated with the following formulation:

min P,R L(P + R) + λ1 FP T 1,1 + λ2 R 2,1, (2)

where FP T 1,1 = Pd i=1 Pm 1 j=1 |pi,j pi,j+1| and F = HT . The term FP T 1,1 is an extension of fused Lasso

(Tibshirani et al. 2005) in multi-task setting, that is where the name FRo TS comes from. Compared with LRo TS, FRo TS has another advantage: FPo TS encourages each row of P to get a sparse solution, where sparsity refers to the first difference |pi,j pi,j+1|. It is attractive property for interpretation while LRo TS fails to have. Note that this sparsity property is necessary for us to derive theoretical analyses.

Optimization Algorithm In this section, we show how to solve the two Ro TS frameworks efficiently using the accelerated proximal gradient method (APM) (Li, Fang, and Lin 2020). Denote

j=1 l((X(i) j )T (pi + ri), (yi)j), (3)

Ω(P, R) = λ1Ω(P) + λ2Ω(R), (4)

where Ω(P) = PH 2 F in (1) and FP T 1,1 in (2), Ω(R) = R 2,1. The objective function of two Ro TS frameworks is a composite function of a differential term L(P, R) and a non-differential term Ω(P, R). Denote

TQ,S,η(P, R) = L(Q, S) + L(Q, S)

η 2 P Q 2 F + L(Q, S)

S , R S + η

2 R S 2 F ,

(P k, Rk) = arg min P,R TQk,Sk,ηk L(P, R) + Ω(P, R), (5)

where Q1 = P 0, S1 = R0 and Qk = P k + αk(P k P k 1), Sk = Rk+αk(Rk Rk 1) for (k 1); the value of ηk and αk applies the strategy in (Beck and Teboulle 2009). According to the theoretical analysis in (Beck and Teboulle 2009; Chen, Zhou, and Ye 2011), we present the following convergence result for our two Ro TS frameworks: Theorem 1 Let (P k, Rk) be generated by (5) where ηk satisfies the strategy in (Beck and Teboulle 2009). Then for any k 1, f( , ) and (P , R ) are respectively the objective functions and the optimal solutions of two Ro TS formulations (1) (2), we have the optimal convergence rate among the first-order methods:

f(P k, Rk) f(P , R ) = O 1

Computing the Proximal Operator A key building block of APM is computing the proximal operator of non-smooth term Ω(P, R) efficiently. Due to the decomposable property of (5), we cast (5) into the following two separate proximal operator problems:

P = arg min P

1 2 P U 2 F + λ1

R = arg min R

1 2 R V 2 F + λ2

If Ω(P) = PH 2 F , (6) admits an analytical solution using matrix inverse, but with expensive complexity of O(max(m3, dm2)). We emphasize the matrix (I + 2λ1

ηk HHT ) Rm m is tridiagonal and non-singular, this special structure makes us to use the chasing method (Golub and Van Loan 2013) to reduce the complexity to O(dm). When Ω(P) = FP T 1,1, (6) no longer admits an analytical solution, however, it can be solved efficiently using FLSA (Fused Lasso Signal Approximation) proposed in (Liu, Yuan, and Ye 2010). It is shown to be scalable to the large-size problem. For updating R, (7) admits closed form solution with the complexity of O(dm) (Liu, Ji, and Ye 2012). It is concluded that both two frameworks are scalable to large scale datasets using our proposed optimization algorithm.

Theoretical Analysis Since LTS does not induce the sparsity pattern of the first difference |pi,j pi,j+1|, we do not discuss LRo TS. Here we provide the theoretical analysis of FRo TS.

Basic Assumption We begin by outlining some fundamental assumptions for the subsequent theoretical analyses. Assume features are normalized, all diagonal elements of the matrix Xi XT i equal 1, i.e., Pni k=1((x(i) jk )2 = 1, j Nd. Assume that the linear predictive function associated with the i-th task satisfies

yji = f i (x(i) j ) + δji = (x(i) j )T w i + δji,

where i Nm, j Nn, the noise δi = [δ1i, , δni]T Rn, δji N(0, σ2); Xi = [x(i) 1 , , x(i) n ]T Rd n, yi = [y1i, , yni]T Rn are the training data and responses of the i-th task; W is the true weight matrix, decomposed as the sum of two underlying true components P and R , i.e., W = [w 1, , w m] = P + R Rd m. The true evaluation is

f i = XT i w i = [f i (x(i) 1 ), , f i (x(i) n )]T Rn. (8)

Thus, we have yi = f i + σi, i Nm. We also define the index set Q and J for sparsity pattern as

Q(A) = {(i, j)|aij = 0}, Q (A) = {(i, j)|aij = 0}, (9) J (A) = {i|ai = 0}, J (A) = {i|ai = 0}. (10)

For the sake of simplicity, we assume that the training sample sizes are the same for all tasks; however, the analysis that follows can be easily modified to account for the situation where the training sample sizes differ for various tasks. For notation simplicity, let X Rdm nm be a block diagonal matrix with Xi Rd n(i Nm) as the i-th block and vec(A) [a1T , , am T ]T , A Rd m.

Theoretical Analysis for FRo TS

Theorem 2 Let ( ˆP, ˆR) be an optimal solution of (2) for m 2 and n, d 1. Let Xi and yi satisfy the above assumptions. Take the regularization parameters λ1 and λ2 as

2λ1(m 1), λ2 α, α = 2σ

dm + t, (11)

where t > 0 is a universal constant. Then with probability of at least 1 exp( 1

2(t dm log(1 + 1 dm))), for any P, R Rd m, we have

i=1 XT i (ˆpi + ˆri) f i 2

i=1 XT i (pi + ri) f i 2

2λ1(m 1) (P ˆP)T 2,1

+ 2λ2 ( ˆR R)J (R) 2,1. (12)

Then (12) can be written as

1 mn XT vec( ˆP + ˆR) vec(F ) 2

mn XT vec(P + R) vec(F ) 2

2λ1(m 1) (P ˆP)T 2,1

+ 2λ2 ( ˆR R)J (R) 2,1 (13)

where F = [f 1 , , f m] Rn m. We make the following assumption about training data and the weight matrix.

Assumption 1 For a matrix pair ΓP Rd m and ΓR Rd m, let r and c (1 r d(m 1), 1 c m) be the upper bounds of |Q(FP T )| and |J (R )|, respectively. Let β be positive scalars. Given XXT is positive definite. There exist positive scalars k1(r) and k2(c) such that

k1(r) min ΓP ,ΓR R(r,c) XT vec(ΓP + ΓR) mn F F ΓP F , (14)

k2(c) min ΓP ,ΓR R(r,c) XT vec(ΓP + ΓR) mn (ΓR)J (R) F , (15)

where the set R(r, c) is defined as

R(r, c) ={ΓP , ΓR Rd m|ΓP = 0, ΓR = 0,

|Q(FP T )| r, |J (R)| c,

(ΓR)J (R) 2,1 β (ΓR)J (R) 2,1}, (16)

the notations |J | and |Q| denote the number of elements in the sets J and Q respectively.

Note that Assumption 1 is connected to the restricted eigenvalue assumption, which is essential to (Bickel, Ritov, and Tsybakov 2009). Similar assumptions have also been used in some earlier studies on multi-task learning (Gong, Ye, and Zhang 2012; Chen, Zhou, and Ye 2011; Lounici et al. 2009). The following theorem for performance bounds is a concise statement of our main theoretical finding.

Theorem 3 Let ( ˆP, ˆR) be an optimal solution of (2) for m 2 and n, d 1. Take the regularization parameters λ1 and λ2 as in (11). Then under Assumption 1, the following result hold with probability of at least 1 exp( 1

dm log(1 + 1 dm))), t > 0:

1 mn XT vec( ˆP + ˆR) vec(F ) 2

(m 1) k1(r) + 2λ2 c k2(c)

(m 1) k1(r) + 2λ2 c k2(c)

Theorem 4 Based on Theorem 3, let

b = c(β + 1)

(m 1) k1(r) + 2λ2 c k2(c)

if the following condition are true:

min j J (R ) r j > 2b. (19)

ˆ J = {j | ˆrj > b}. (20)

Then with the same probability, ˆ J estimate the true sparsity pattern J (R ). That is ˆ J = J (R ).

Theorem 3 gives an essential theoretical guarantee for FRo TS. To be specific, these bounds assess how well FRo TS can approximate the real evaluation values F as well as the real outlier tasks r i , i Nm. Furthermore, we can estimate the true sparsity patterns J (R ) with high probability, i.e., at least (1 exp( 1

2(t dm log(1 + 1 dm)))), if the underlying true weights are above the noise level, i.e,

minj J (R ) r j > 2 c(β+1)

(m 1) k1(r) + 2λ2 c k2(c)

Experiments To demonstrate the competitiveness of the proposed approaches, we compare them with Laplacian based temporal similarity (LTS) and fused Lasso based temporal similarity (FTS). The implementation code of all these competitive methods is in the supplementary material. For all the methods, the hyperparameters are selected by grid search and 3-fold cross validation. For each dataset, the experiments on different methods are repeated 5 times by splitting data set randomly, and the mean and standard deviation of the results are reported. Note that for numerical accuracy consideration, we solve the involved formulations with their objective function multiplied by Pm i=1 ni. The search range of the regularization parameters is [0.1, 1, 10, 50, 100, 200, 500, 1000, 2500, 5000]. The root mean square error (r MSE) is used to evaluate the performance of involved methods as used in multi-task learning literature (Yao, Cao, and Chen 2019). We stop the iterative procedure of the algorithms if the change of the objective values in two consecutive iterations is smaller than 10 4. The training ratio is 0.5, defined as the ratio of the training set over the data set.

T1 T2 T3 T4 T5 True Task

Figure 3: Correlation coefficient with true task on S1.

Figure 4: The L2-norm of task coefficient on S3.

Synthetic Data Sets and Experimental Results To validate the effectiveness of the proposed approaches in terms of robustness against outlier tasks, we first evaluate our approach on the following three synthetic data sets: S1: We have 5 tasks (m = 5), set w1 = w2 = 2

3w3 = w4 = w5 N(0, 1), hence the 3th task is set as an outlier task. The input data are generated from Xi N(0, 1) with feature dimensionality d = 100, ni = 100(i N5), and the output of the i-th task is obtained by yi = XT i wi +N(0, 1). S2: Denote 1 as a vector whose elements are all one. We set 7 tasks (m = 7), ni = 20(i N7), dimensionality d = 20, W1 = [1, , 1] R10 m, W2 = 5 [1, 1, 1, 3 1, 1, 1, 1] R10 m, W = [W1; W2]. Actually, 4-th task w4 is regarded as an outlier task. S3: This dataset is similar to S2, but with 18 tasks (m = 18). W2 = 5 [1, 1, 1, α11, 1, 1, 1, α21, 1, 1, 1, α31, 1, 1, 1, α41, 1, 1]. α1, α2, α3, α4 are generated from a uniform distribution with the range of [0.3, 0.8]. It means the 4th, 8th, 12th, and 16th tasks are outliers. We verify the performance of different methods on S1 dataset, and we calculate the correlation coefficients between the model parameters learned by different methods and the real model. As shown in Figure 3, the correlation coefficient associated with the 3th task is generally lower than the others, which indicates that the influence of the outlier task is obvious. Note that the correlation coefficients corresponding to LRo TS, and FRo TS are significantly better than those of LTS and FTS, which shows the effectiveness of our proposed methods. However, there is no clear illustration to show how well our methods capture the outlier tasks. To more intuitively analyze the differences between the var-

ious methods, we design the two datasets S2 and S3. Note that if we only look at W1, there is no outlier task. This setting is designed to analyze the difference between the two ways, LTS and FTS, of chasing the temporal information. Since LTS focuses on the task level, and FTS focuses on the feature level (every entry of task coefficient). As shown in Figure 2 and 4, two Ro TS frameworks are significantly better than LTS and FTS in detecting outlier tasks, and also better than LTS and FTS in terms of fitting non-outlier tasks. This also indicates that we can not simply average all tasks to chase the temporal information, and both LTS and FTS are too strict.

Real Datasets and Experimental Results Here we introduce the two used datasets in this work.

Smart Fert Dataset The dataset is designed for global soil health assessment. The data are collected from 354 geographic sites from 42 countries. It includes many factors describing agriculture, such as climate, soil type, yield, and fertilization. After data preprocessing, the Smart Fert dataset has available data of four farms with same standard and 12 features. The corresponding label is the amount of fertilizer applied for the months of the year, including nitrogen, phosphorus, and potash content. We emphasize that in the Smart Fert dataset, heavy fertilization is only applied in the 6th, 7th, 8th, and 9th months. Some farms apply additional fertilization in 11th months. From the perspective of our proposed methods, the 6th, 7th, 8th, and 9th month can be regarded as outlier time points, since the amount of fertilization in these months is extremely different from other months.

Alzheimer s Disease (AD) Dataset This dataset (Jack Jr et al. 2008) consists of three subsets, including RAVLT, MMSE, and ADAS-Cog, ADAS. National Institute of Health (NIH) in 2003 funded the Alzheimer s Disease Neuroimaging Initiative (ADNI) to facilitate the scientific evaluation of neuroimaging data including magnetic resonance imaging (MRI), and clinical and neuropsychological assessments for predicting the onset and progression of mild cognitive impairment (MRI) and AD. The three data sets RAVLT, MMSE, and ADAS are all from ADNI (Weiner et al. 2017). Every dataset has 313 MRI features and corresponding six time points.

Evaluation of Performance We verify our methods on the Smart Fert dataset, the results are shown in Table 1. Note that the variances of the four methods are all large. The possible reason is the sample number of the Smart Fert dataset is small and we can not train the model adequately. However, in this scenario with limited data, both two Ro TS frameworks achieve significant improvements compared to LTS and FTS, with FRo TS performing the best. Compared to LTS, FRo TS reduces r MSE from 44.26 to 29.22, almost 34% lower. This shows that our methods have greater potential to achieve good performance with limited data. To visualize the detection ability and practical significance of outlier tasks, we compute the l2-norm of each column of the discriminant matrix R. As shown in Figure 5, the l2-norm of the 6th, 7th, 8th, 9th, and 11th task is

Figure 5: The L2-norm of each column of the matrix R, generated by two Ro TS frameworks on the Smart Fert dataset.

significantly higher than the others and thus can be considered as outlier tasks. It is consistent with the reality, since in the Smart Fert dataset, heavily fertilization is only applied in the 6th, 7th, 8th, 9th months. In the 11th month, there is occasional extra fertilizer. To analyze the performance of our methods comprehensively, we conduct experiments on AD datasets with training ratio as 0.2. As shown in Table 1, both Ro TS methods outperform LTS and FTS clearly. Note that the Ro TS frameworks do not improve the baseline on three AD datasets as much as on Smart Fert dataset. Possibly because the cognitive scores of AD patients is a somewhat smooth process (Zhou et al. 2022). It tells us the limitation of our methods: The stronger the temporal information is, the more limited improvement will be achieved by our Ro TS frameworks. We find that in most cases, FRo TS performs better than LRo TS. This seems to suggest that FRo TS is the better choice of the two metrics. However, we would like to emphasize that, although the optimization algorithm we designed has high efficiency and can be extended to a large-scale dataset, computing the proximal operator of the Fused Lasso penalty is required in FRo TS, which makes FRo TS more complicated than LRo TS. We conclude that if more efficiency is needed, LRo TS is a better option; If better performance is necessary, FRo TS is the better choice. We also make visual analysis of the detection of outlier tasks on the three AD sub datasets. Refer to Figure 6, the detection result of outlier task on three AD datasets is not as clear as that on Smart Fert dataset. That means the temporal relation on the AD dataset is stronger than on Smart Fert dataset. And we also notice that there are big differences between the experimental results conducted on the three datasets. For example, on ADAS dataset , the second and third tasks are clearly identified as outlier tasks (right subfigure of Figure 6), but on MMSE dataset, only 2nd task is an obvious outlier task (middle subfigure of Figure 6); on the RAVLT dataset, 1st and 3rd are clear outlier tasks. The reason for this phenomenon may be the differences of the three datasets themselves. For example, the ADAS dataset focuses on the analysis of the patient s language and cognitive ability, the MMSE dataset focuses on the analysis of the patient s arithmetic, memory and direction recognition ability, and the RAVLT dataset focuses on the assessment of

Data set LTS FTS LRo TS FRo TS

Smart Fert 44.26 10.92 40.95 11.72 35.19 6.27 29.22 5.39

RAVLT 4.74 0.22 4.72 0.25 4.80 0.51 4.67 0.37

MMSE 5.21 0.41 5.13 0.26 5.01 0.21 5.07 0.32

ADAS 9.38 0.03 9.39 0.03 9.35 0.03 9.33 0.04

Table 1: The comparison of performance in terms of r MSE ( mean std ).

Figure 6: The results of detecting the outlier tasks of LRo TS and FRo TS on RAVLT (left subfigure), MMSE (middle subfigure), and ADAS (right subfigure) datasets.

the patient s learning ability. It is worth emphasizing that, the outlier task mainly appears in the early stages. The possible reason is in initial stages of the disease, the patient s condition is relatively good, but in later stages will deteriorate rapidly because of a rapid loss of many cognitive functions. It makes a huge difference between the state of the patient at the beginning and the state of the patient at other time points.

Possible Specific Application and Extension

We point out that if temporal smoothness assumption (TS) is useful in some scenarios, the two Ro TS frameworks are a better option. For instance, (Zhou et al. 2012) proposed c FSGL based on TS for modeling disease progression. We can easily extend c FSGL to: L(P + R) + λ1 P T 1,1 + λ2 P T 2,1 + λ3 FP T 1,1 + λ4 R 2,1. It employs the sparse group Lasso (λ1 P T 1,1 +λ2 P T 2,1) (Simon et al. 2013) to conduct simultaneous joint feature selection for all tasks and selection of a specific set of features for each task. And the FRo TS term (λ3 FP T 1,1 + λ4 R 2,1) is used to capture the robust temporal smoothness. The decomposition property of (λ1 P T 1,1 + λ2 P T 2,1 + λ3 FP T 1,1), proved in (Zhou et al. 2012), enables to compute the proximal operator efficiently and be scalable to the large size problem. Similarly, two Ro TS frameworks also have a potential extension on temporal survival model (Wang, Shi, and Reddy 2020). Our Ro TS assumption can be possibly extended to tackle other kinds of sequence data. Gene expression sequence data usually shows some order patterns (Robinson, Mc Carthy, and Smyth 2010). Tibshirani et al. (Tibshirani et al. 2005)

proposed the famous fused Lasso to encourage the orderly successive features to be similar. However, they did not consider the outlier features. We may propose the robust Fused Lasso formulation for tackling it: L(p + r) + λ1 p 1 + λ2 Fp 1 + λ3 r 1. Another example is spatio sequence data. Some works (Xu et al. 2016; Gao et al. 2019) utilize the spatio smoothness assumption, which means the closer two objects are, the more similar they are. Similar to Ro TS assumption, the robust spatio smoothness assumption is possibly proposed, which simultaneously captures the spatio smoothness and detects outliers.

Temporal smoothness assumption is widely used in multitask learning setting to simultaneously analyze multiple time points. However, it treats all tasks equally, without considering the difference between them, which means ignoring the negative effect of the outlier tasks. In this paper, we assumed every task consists of one temporal part and one discriminative part. Based on it, we proposed two Robust Temporal Smoothness (Ro TS) frameworks that simultaneously chase the temporal smoothness among tasks and capture the outlier tasks, but with no additional computational complexity. The effectiveness of our approach is demonstrated by experimental results and theoretical analyses. Finally, we presented some possible applications in modeling disease progression, tensor multi-task model, and survival model. We also discussed the potential extension of our idea of Ro TS frameworks to deal with other kinds of sequence data, like gene expression data and spatio data. Our future work focuses on using these frameworks in a broader area.

Acknowledgments This research was supported by the National Natural Science Foundation of China (No. 62061050).

References Beck, A.; and Teboulle, M. 2009. A fast iterative shrinkagethresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1): 183 202. Bickel, P. J.; Ritov, Y.; and Tsybakov, A. B. 2009. Simultaneous analysis of Lasso and Dantzig selector. The Annals of statistics, 37(4): 1705 1732. Chen, J.; Zhou, J.; and Ye, J. 2011. Integrating low-rank and group-sparse structures for robust multi-task learning. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, 42 50. Emrani, S.; Mc Guirk, A.; and Xiao, W. 2017. Prognosis and diagnosis of Parkinson s disease using multi-task learning. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, 1457 1466. Fifty, C.; Amid, E.; Zhao, Z.; Yu, T.; Anil, R.; and Finn, C. 2021. Efficiently identifying task groupings for multitask learning. Advances in Neural Information Processing Systems, 34. Gao, Y.; Zhao, L.; Wu, L.; Ye, Y.; Xiong, H.; and Yang, C. 2019. Incomplete label multi-task deep learning for spatiotemporal event subtype forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, 3638 3646. Golub, G. H.; and Van Loan, C. F. 2013. Matrix computations. JHU press. Gong, P.; Ye, J.; and Zhang, C. 2012. Robust multi-task feature learning. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, 895 903. Jack Jr, C. R.; Bernstein, M. A.; Fox, N. C.; Thompson, P.; Alexander, G.; Harvey, D.; Borowski, B.; Britson, P. J.; L. Whitwell, J.; Ward, C.; et al. 2008. The Alzheimer s disease neuroimaging initiative (ADNI): MRI methods. Journal of Magnetic Resonance Imaging: An Official Journal of the International Society for Magnetic Resonance in Medicine, 27(4): 685 691. Li, H.; Fang, C.; and Lin, Z. 2020. Accelerated first-order optimization algorithms for machine learning. Proceedings of the IEEE, 108(11): 2067 2082. Liu, J.; Ji, S.; and Ye, J. 2012. Multi-task feature learning via efficient l2, 1-norm minimization. ar Xiv preprint ar Xiv:1205.2631. Liu, J.; Yuan, L.; and Ye, J. 2010. An efficient algorithm for a class of fused lasso problems. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, 323 332. Lounici, K.; Pontil, M.; Tsybakov, A. B.; and Van De Geer, S. 2009. Taking advantage of sparsity in multi-task learning. ar Xiv preprint ar Xiv:0903.1468.

Meier, L.; Van De Geer, S.; and B uhlmann, P. 2008. The group lasso for logistic regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(1): 53 71. Nie, L.; Zhang, L.; Meng, L.; Song, X.; Chang, X.; and Li, X. 2016. Modeling disease progression via multisource multitask learners: A case study with Alzheimer s disease. IEEE transactions on neural networks and learning systems, 28(7): 1508 1519. Robinson, M. D.; Mc Carthy, D. J.; and Smyth, G. K. 2010. edge R: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1): 139 140. Romeo, L.; Armentano, G.; Nicolucci, A.; Vespasiani, M.; Vespasiani, G.; and Frontoni, E. 2020. A Novel Spatio-Temporal Multi-Task Approach for the Prediction of Diabetes-Related Complication: a Cardiopathy Case of Study. In IJCAI, 4299 4305. Saha, T. K.; Williams, T.; Hasan, M. A.; Joty, S.; and Varberg, N. K. 2018. Models for capturing temporal smoothness in evolving networks for learning latent representation of nodes. ar Xiv preprint ar Xiv:1804.05816. Shen, J.; Zhen, X.; Worring, M.; and Shao, L. 2021. Variational Multi-Task Learning with Gumbel-Softmax Priors. Advances in Neural Information Processing Systems, 34. Simon, N.; Friedman, J.; Hastie, T.; and Tibshirani, R. 2013. A sparse-group lasso. Journal of computational and graphical statistics, 22(2): 231 245. Tibshirani, R.; Saunders, M.; Rosset, S.; Zhu, J.; and Knight, K. 2005. Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(1): 91 108. Wang, P.; Shi, T.; and Reddy, C. K. 2020. Tensor-based Temporal Multi-Task Survival Analysis. IEEE Transactions on Knowledge and Data Engineering. Wei, W. W. 2006. Time series analysis. In The Oxford Handbook of Quantitative Methods in Psychology: Vol. 2. Weiner, M. W.; Veitch, D. P.; Aisen, P. S.; Beckett, L. A.; Cairns, N. J.; Green, R. C.; Harvey, D.; Jack Jr, C. R.; Jagust, W.; Morris, J. C.; et al. 2017. Recent publications from the Alzheimer s Disease Neuroimaging Initiative: Reviewing progress toward improved AD clinical trials. Alzheimer s & Dementia, 13(4): e1 e85. Xu, J.; Tan, P.-N.; Luo, L.; and Zhou, J. 2016. Gspartan: a geospatio-temporal multi-task learning framework for multi-location prediction. In Proceedings of the 2016 SIAM International Conference on Data Mining, 657 665. SIAM. Xu, Y.; Sun, S.; Zhang, H.; Yi, C.; Miao, Y.; Yang, D.; Meng, X.; Hu, Y.; Wang, K.; Min, H.; et al. 2021. Time-aware graph embedding: A temporal smoothness and task-oriented approach. ACM Transactions on Knowledge Discovery from Data (TKDD), 16(3): 1 23. Yao, Y.; Cao, J.; and Chen, H. 2019. Robust task grouping with representative tasks for clustered multi-task learning. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 1408 1417.

Zhang, Y.; and Yang, Q. 2021. A survey on multi-task learning. IEEE Transactions on Knowledge and Data Engineering. Zhao, L.; Li, X.; Xiao, J.; Wu, F.; and Zhuang, Y. 2015. Metric learning driven multi-task structured output optimization for robust keypoint tracking. In Twenty-ninth AAAI conference on artificial intelligence. Zheng, J.; and Ni, L. M. 2013. Time-dependent trajectory regression on road networks via multi-task learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 27, 1048 1055. Zhou, J.; Liu, J.; Narayan, V. A.; and Ye, J. 2012. Modeling disease progression via fused sparse group lasso. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, 1095 1103. Zhou, J.; Yuan, L.; Liu, J.; and Ye, J. 2011. A multi-task learning formulation for predicting disease progression. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, 814 822. Zhou, M.; Zhang, Y.; Liu, T.; Yang, Y.; and Yang, P. 2022. Multi-task Learning with Adaptive Global Temporal Structure for Predicting Alzheimer s Disease Progression. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, 2743 2752.