# personalizing_eegbased_affective_models_with_transfer_learning__4e1f1cd1.pdf

Personalizing EEG-Based Affective Models with Transfer Learning

Wei-Long Zheng1 and Bao-Liang Lu1,2,3,

1Center for Brain-like Computing and Machine Intelligence

Department of Computer Science and Engineering 2Key Laboratory of Shanghai Education Commission for

Intelligent Interaction and Cognitive Engineering

3Brain Science and Technology Research Center

Shanghai Jiao Tong University, Shanghai, China

{weilong, bllu}@sjtu.edu.cn

Individual differences across subjects and nonstationary characteristic of electroencephalography (EEG) limit the generalization of affective braincomputer interfaces in real-world applications. On the other hand, it is very time consuming and expensive to acquire a large number of subjectspeciﬁc labeled data for learning subject-speciﬁc models. In this paper, we propose to build personalized EEG-based affective models without labeled target data using transfer learning techniques. We mainly explore two types of subject-to-subject transfer approaches. One is to exploit shared structure underlying source domain (source subject) and target domain (target subject). The other is to train multiple individual classiﬁers on source subjects and transfer knowledge about classiﬁer parameters to target subjects, and its aim is to learn a regression function that maps the relationship between feature distribution and classiﬁer parameters. We compare the performance of ﬁve different approaches on an EEG dataset for constructing an affective model with three affective states: positive, neutral, and negative. The experimental results demonstrate that our proposed subject transfer framework achieves the mean accuracy of 76.31% in comparison with a conventional generic classiﬁer with 56.73% in average.

1 Introduction

Affective brain-computer interfaces (a BCIs) [M uhl et al., 2014a] introduce affective factors into conventional braincomputer interfaces [Chung et al., 2011]. a BCIs provide relevant context information about users affective states in braincomputer interfaces (BCIs) [Zander and Jatzev, 2012], which can help BCI systems react adaptively according to users affective states, rather in a rule-based fashion. It is an efﬁcient

Corresponding author

way to enhance current BCI systems with an increase of information ﬂow, while at the same time without additional cost. Therefore, a BCIs have attracted increasing interests in both research and industry communities, and various studies have presented their efﬁciency and feasibility [M uhl et al., 2014a; 2014b; Jenke et al., 2014; Eaton et al., 2015]. Although the large progress has been obtained about development of a BCIs in recent years, there still exist some challenges such as the adaptation to changing environments and individual differences.

Until now, most of studies emphasized choices of features and classiﬁers [Singh et al., 2007; Jenke et al., 2014; Abraham et al., 2014] and neglected to consider individual differences in target persons. They focus on training subjectspeciﬁc affective models. However, these approaches are practically infeasible in real-world scenarios since they need to collect a large number of labeled data. In addition, the calibration phase is time-consuming and annoying. An intuitive and straightforward way to dealing with this problem is to train a generic classiﬁer on the collected data from a group of subjects and then make inference on the unseen data from a new subject. However, the existing studies indicated that following this way, the performance of generic classiﬁers were dramatically degraded due to the structural and functional variability between subjects as well as the non-stationary nature of EEG signals [Samek et al., 2013; Morioka et al., 2015]. Technically, this issue refers to the covariate-shift challenges [Sugiyama et al., 2007]. The alternative way to dealing with this problem is to personalize a generic classiﬁer for target subjects in an unsupervised fashion with knowledge transfer from the existing labeled data in hand.

The problem mentioned above has motivated many researchers from different ﬁelds in developing transfer learning and domain adaptation algorithms [Duan et al., 2009; Pan and Yang, 2010; Chu et al., 2013]. Transfer learning methods try to transfer knowledge from source domain to target domain with few or no labeled samples available from subjects of interest, which refer to inductive and transductive setups, respectively. Figure 1 illustrates the covariate-shift

Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-16)

Component 1 Component 2

Component 3

Negative (Subject 1) Neutral (Subject 1) Positive (Subject 1) Negative (Subject 2) Neutral (Subject 2) Positive (Subject 2)

Figure 1: Illustration of the covariate-shift challenges of constructing EEG-based affective models. Here, two sample subjects (subjects 1 and 2) are with three classes of emotions and EEG features are different in conditional probability distribution across subjects.

challenges of constructing EEG-based affective models. Traditional machine learning methods have a prior assumption that the distributions of training data and test data are independently and identically distributed (i.i.d.). However, due to the variability from subject to subject, this assumption can not be always satisﬁed in a BCIs.

In this work, we adopt transfer learning algorithms in a transductive setup (without any labeled samples from target subjects) to tackle the subject-to-subject variability for building EEG-based affective models. Let X 2 X be the EEG recording of a sample (X, y), here y 2 Y represents the corresponding emotion labels. In this case, X = RC d, C is the number of channels, and d is the number of time series samples. Let P(X) be the marginal probability distribution of X. According to [Pan and Yang, 2010], D = {X, P(X)} is a domain, which in our case is a given subject from which we record the EEG signals. The source and target domains in this paper share the same feature space, XS = XT , but the respective marginal probability distributions are different, that is, P(XS) 6= P(XT ). The key assumption in most domain adaptation methods is that P(YS|XS) = P(YT |XT ).

Recently, Jayaram and colleges made a timely survey on current transfer learning techniques for BCIs [Jayaram et al., 2016]. Morioka et al. proposed to learn a common dictionary shared by multiple subjects and used the resting-state activity of a previously unseen target subject as calibration data for compensating for individual differences, rather than task sessions [Morioka et al., 2015]. Krauledat and colleges proposed a zero-training framework for extracting prototypical spatial ﬁlters that have better generalization properties [Krauledat et al., 2008]. Although most of these methods are based on the variants of common spacial patterns (CSP) for motor-imagery paradigms, some studies focused on passive BCIs [Wu et al., 2013] and EEG-based emotion recognition [Zheng et al., 2015].

In this paper, we propose to personalize EEG-based affec-

tive models by adopting two kinds of domain adaptation approaches in an unsupervised manner. One is to ﬁnd a shared common feature space between source and target domains. We apply Transfer Component Analysis (TCA) and Kernel Principle Analysis (KPCA) based methods proposed in [Pan et al., 2011]. These methods are adopted to learn a set of common transfer components underlying both the source domain and the target domain. When projected to this subspace, the difference of feature distributions of both domains can be reduced. The other is to construct individual classiﬁers and learn a regression function that maps the relationship between data distribution and classiﬁer parameters, which refers to Transductive Parameter Transfer (TPT) [Sangineto et al., 2014]. We evaluate the performance of these approaches on an EEG dataset, SEED1, to personalize EEG-based affective models.

2.1 TCA-based Subject Transfer

Transfer component analysis (TCA) proposed by Pan et al. learns a set of common transfer components between source domain and target domain [Pan et al., 2011] and ﬁnds a low-dimensional feature subspace across source domain and target domain, where the difference of feature distributions between two domains can be reduced. The aim of this transfer learning algorithm is to ﬁnd a transformation φ( ) such that P(φ(XS)) P(φ(XT )) and P(YS|φ(XS)) P(YT |φ(XT )) without any labeled data in target domain (target subjects). An intuitive approach to ﬁnd the mapping φ( ) is to minimize the Maximum Mean Discrepancy (MMD) [Gretton et al., 2006] between the empirical means of the two domains,

where n1 and n2 represent the sample numbers of source domain and target domain, respectively. However, φ( ) is usually highly nonlinear and a direct optimization with respect to φ( ) can easily get stuck in poor local minima [Pan et al., 2011].

TCA is a dimensionality reduction based domain adaptation method. It embeds both the source and target domain data into a shared low-dimensional latent space using a mapping φ. Specially, let the Gram matrices deﬁned on the source domain, target domain and cross-domain data in the embedded space be KS,S, KT,T , and KS,T , respectively. The kernel matrix K is deﬁned on all the data as

KS,S KS,T KT,S KT,T

2 R(n1+n2) (n1+n2). (2)

By virtue of kernel trick, the MMD distance can be rewritten as tr(KL), where K = [φ(xi)>φ(xj)], and Lij = 1/n2

1 if xi, xj 2 XS, else Lij = 1/n2

2 if xi, xj 2 XT , otherwise,

Lij = (1/n1n2). A matrix

W 2 R(n1+n2) m transforms

1http://bcmi.sjtu.edu.cn/ seed/index.html

the empirical kernel map K to an m-dimension space (where m n1 + n2). The resultant kernel matrix is

K = (KK 1/2

K 1/2K)) = KWW >K, (3)

where W = K 1/2

W. With the deﬁnition of

K in Eq.(3), the MMD distance between the empirical means of the two domain X0

T can be rewritten as

T ) = tr((KWW >K)L) = tr(W >KLKW).

(4) A regularization term tr(W >W) is usually added to control the complexity of W, while minimizing Eq.(4).

Besides reducing the difference of the two distributions, φ should also preserve the data variance that is related to the target learning task. From Eq.(3), the variance of the projected samples is W >KHKW, where H = In1+n2 (1/(n1 + n2))11> is the centering matrix, 1 2 Rn1+n2 is the column vector with all 1 s, and In1+n2 2 R(n1+n2) (n1+n2) is the identity matrix.

Therefore, the objective function of TCA is

W tr(W >KLKW) + µtr(W >W)

s.t. W >KHKW = Im,

where µ > 0 is a regularization parameter, and Im 2 Rm n is the identity matrix. According to [Pan et al., 2011], the solutions W are the m leading eigenvectors of (KLK + µI) 1KHK, where m n1 + n2 1. The algorithm of TCA for subject transfer is summarized in Algorithm 1. We recommend the readers to refer to [Pan et al., 2011] for the detailed descriptions of TCA. After obtaining the transformation matrix W, standard machine learning methods can be used in this feature subspace.

Algorithm 1 TCA-based Subject Transfer input : Source domain data set DS = {(xs

i=1, and target domain data set DT = {xt

j=1. output : Transformation matrix W.

1: Compute kernel matrix K from {xs

i=1 and {xt

j=1, matrix L, and the centering matrix H. 2: Eigendecompose the matrix (KLK + µI) 1KLK and

select the m leading eigenvectors to construct the transformation matrix W. 3: return tranformation matrix W.

2.2 KPCA-based Subject Transfer Kernel PCA [Sch olkopf et al., 1998] projects the original D-dimensional feature space into an M-dimensional feature space with a nonlinear transformation φ(x), where M D. For KPCA-based subject transfer, we concatenate the source and target data as the training data and construct the kernel matrix. The kernel principal components are then computed using singular value decomposition. Each sample xi is projected to a point φ(xi) in lower dimensional subspaces. The algorithm of KPCA-based subject transfer is summarized in Algorithm 2.

Algorithm 2 KPCA-based Subject Transfer input : Source domain data set DS = {(xs

i=1, and target domain data set DT = {xt

j=1. output : The kernel principal components pk.

1: Concatenate the source and target domain data sets as the

training data set, {xi}n1+n2

j=1]. 2: Construct the kernel matrix from the training data set

i=1 . 3: Compute the vectors ak. 4: Compute the kernel principal components pk. 5: Return the kernel principal components pk.

Individual Classifier

Feature Distributions Parameter Regression Model

Personalized Affective Model

Labeled Source Data

Unlabeled Target Data

Parameter Transfer

Training Phase

Testing Phase

Individual Classifier

Individual Classifier

Figure 2: The framework of the transductive parameter transfer (TPT) approach adopted in this work.

2.3 Transductive Parameter Transfer The transductive parameter transfer (TPT) approach is ﬁrstly proposed by Sangineto and colleges for action units detection and spontaneous pain recognition [Sangineto et al., 2014]. The TPT approach consists of three main steps as illustrated in Figure 2. First, multiple individual classiﬁers are learned on each training dataset Ds

i . Second, a regression function is trained to learn the relation between the data distributions and classiﬁers parameter vectors. Finally, the target classiﬁer is obtained using the target feature distribution and the distribution-to-classiﬁer mapping.

We adopt the TPT approach to personalize EEG-based affective models in this paper. In the ﬁrst phase, we train multiple individual classiﬁers on each source dataset Ds

i . Here, we use linear support vector machine (SVM) as a classiﬁer. i = [w0

i, bi] deﬁnes the hyperplane in the feature space. The objective function is as follows,

1 2kwk2 + λL

where l( ) is the hinge loss.

In the second step, a regression function f is learned for the mapping: D ! with the source data distributions.

Since the data distributions and the optimal corresponding hyperplanes are relevant, we can predict the hyperplane and construct the classiﬁer on target data without any label information via learning this mapping. To quantify the similarity between pairs of datasets Xi and Xj, a kernel function (Xi, Xj) is adopted. In this implement, we use the density estimation kernel [Blanchard et al., 2011], which is deﬁned as follows,

(Xi, Xj) = 1 nm

X (xp, xq), (7)

where n, m are cardinality of Xi, Xj, respectively, and X ( ) is a Gaussian kernel. After computing the kernel matrix, the mapping function f can be learned using the Multioutput Support Vector Regression framework [Tuia et al., 2011].

Finally, the parameter vector of the target classiﬁer can be predicted by t = f(Xt) without any label information from target subjects. Given the target features x and the classiﬁer parameters , the label can be predicted by the decision function: y = sign(w0

tx + bt). The algorithm of TPT-based subject transfer is summarized in Algorithm 3. For more details about the TPT algorithm, we recommend the readers to refer to [Sangineto et al., 2014].

Algorithm 3 TPT-based Subject Transfer input : Source domain data sets Ds

N, target domain data set Xt, and some regularization parameters for SVMs. output : The parameter vector of target classiﬁer: wt, bt.

1: Construct individual classiﬁers: { i = (wi, bi)}N

i=1. 2: Create a training set T = {Xs

i=1. 3: Compute the kernel matrix K, Kij = (Xs

j ). 4: Given K and T , learn f( ) using multioutput support

vector regression. 5: Compute (wt, bt) = f(Xt) 6: Return wt, bt.

3 Experiment Setup 3.1 EEG Dataset for Constructing Affective Models We adopt ﬁlm clips as stimuli to elicit emotions in the laboratory environment, since ﬁlm clips contains both visual and auditory stimuli. A preliminary study is conducted using scores (1-5) to select a pool of ﬁlm clips to elicit three emotions: positive, neutral, and negative. Each clip is well edited for its coherent context for corresponding emotion and time length of about four minutes. Finally, 15 emotional ﬁlm clips with high ratings are chosen from the materials. For the experimental protocol, each experiment consists of 15 sessions with 15 different ﬁlm clips. For each session, there is a 5 s hint for starting before each clip, a 45 s self-assessment, and a 15 s rest after each clip. There are totally 15 subjects (8 females, mean: 23.27, std: 2.37) participated in our experiments. They are all informed about the experiments and instructress before the experiments. They are required to elicit their own corresponding emotions while watching the ﬁlm clips. Only the

data with right elicited emotions are used in further analysis. EEG data are recorded simultaneously with a 62-electrode cap according to the international 10-20 system using ESI Neuroscan system. The original sampling rate is 1000 Hz. The impedance of each electrode is lower than 5 k .

3.2 Data Preprocessing and Feature Extraction For data preprocessing, since there is often contamination of electromyography (EMG) signals from facial expressions and Electrooculogram (EOG) signals from eye movements in EEG data [Fatourechi et al., 2007], the raw EEG data is processed with a bandpass ﬁlter between 1 Hz and 75 Hz and the data with serious noise and artifacts are discarded. The 62-channel EEG signals are further down-sampled to 200 Hz and EEG features are extracted with each segment of the preprocessed EEG data with a non-overlapping 1 s time window.

For feature extraction, we employ differential entropy (DE) features [Duan et al., 2013; Zheng and Lu, 2015], which show superior performance than conventional power spectral density (PSD) features. According to [Zheng and Lu, 2015], for a ﬁxed length EEG sequence, the DE feature is equivalent to the logarithm of PSD in a certain frequency band. Therefore, the DE features can be calculated in ﬁve frequency bands (δ: 1-3 Hz, : 4-7 Hz, : 8-13 Hz, β: 14-30 Hz, and γ: 31-50 Hz), which are widely used in EEG studies using Short-term Fourier transform. The total dimension of a 62-channel EEG segment is 310.

3.3 Evaluation Details We adopt a leave-one-subject-out cross validation method for the evaluation. Each time, we separate the data from one subject as the target domain and the resting data from other 14 subjects as the source domain. For the baseline method, we concatenate data from all available subjects as training data and train a generic classiﬁer with linear SVM. A variant of SVM called Transductive SVM (T-SVM) is also developed to learn a decision boundary and maximize the margin with unlabeled data [Collobert et al., 2006]. We use the implement of T-SVM in SV M light [Joachims, 1999].

For TCA and KPCA, it is practically infeasible to include all the data from available subjects due to limits of memory and time cost for singular value decomposition. Therefore, we randomly select a subset of samples (5000 samples) from 14 subjects as training data. For kernel functions, we employ linear kernels. We will evaluate how the performance varies with respect to the dimensionality of new feature space. The regularization parameter µ is set to 1, the same as [Pan et al., 2011]. After TCA and KPCA project the original features into low-dimensional subspace using transfer components, standard linear SVMs are trained on the new extracted features. The value of the regularization parameter for TPT is 0.1. To deal with multi-class classiﬁcation task, we adopt the one vs one strategy to avoid label unbalance problem. All the algorithms are implemented in MATLAB.

4 Experiment Results

In this section, we carry out experiments to demonstrate the effectiveness of the proposed methods for personalizing

0 10 20 30 40 50

# Dimension

Figure 3: Comparison of KPCA and TCA approaches for different dimensionality of the subspace.

EEG-based affective models. All results are conducted using a leave-one-subject-out evaluation scheme. First, we evaluate how the performance of KPCA and TCA approaches varies with the dimensionality of the low-dimensional feature subspace and choose the best dimensions. Figure 3 depicts the accuracy curve with respect to varying dimensions. TCA achieves better performance than KPCA in lower dimensions (less than 30) and reaches the peak accuracy with 63.64% in the 30-dimensional subspace. In contrast, the accuracies of KPCA approximately increase with the increasing dimensions. The saturation is reached at about 35-dimension point. Regarding the accuracy, TCA outperforms KPCA slightly (63.64% vs 61.28%).

We compare the performance of different subject transfer methods. Figure 4 shows the accuracies of different methods (Generic classiﬁers, KPCA, TCA, T-SVM, and TPT) for total 15 subjects and Table 1 presents the mean accuracies and standard deviations. The generic classiﬁers perform poorly with a mean accuracy of only 56.73% due to the fact that this method directly includes all source samples for the training. Since there exist some individual differences across subjects and some training samples from irrelevant subjects included, all these factors may bias the optimal hyperplane and dramatically degrade the performance of subject transfer, which refers to some negative transfer. TCA and KPCA learn a set of transfer components underlying both the source and target distributions. The feature distributions of both domains are similar in the new low-dimensional subspace. Both TCA and KPCA methods outperform the generic classiﬁers, indicating the efﬁcient knowledge transfer through feature reduction. T-SVM archives comparative accuracies among these methods with the mean accuracy and standard deviation of

Stats. Generic KPCA TCA T-SVM TPT Mean 0.5673 0.6128 0.6364 0.7253 0.7631 Std. 0.1629 0.1462 0.1488 0.1400 0.1589

Table 1: The mean accuracies and standard deviations of the ﬁve different approaches.

72.53%/14.00%, respectively. T-SVM learns the decision boundary in a semi-supervised manner and weights all the training instances equally, which may still introduce some irrelevant source data during training.

TPT method outperforms the other four approaches with the highest accuracy of 76.31% and achieves a signiﬁcant improvement in comparison with generic classiﬁers (one way analysis of variance, p < 0.01). TPT method can measure the similarity between pairs of data distributions from different subjects and learn the mapping function from data distributions and classiﬁer parameters. Therefore, it can extract the relevant information to determine the decision function and bypass the bias caused by irrelevant information. Besides accuracy, we compare the generalization performance of KPCA, TCA, and TPT approaches. The KPCA and TCA need to construct the kernel matrix using source and target domains and ﬁnd the manifold subspaces where the differences between them can be reduced. Their limitation is high memory and time cost for training. They can not constantly beneﬁt from the increasing samples in the source domain. In contrast, TPT keeps individual classiﬁers for different source subjects. The new regression function can be trained only on the kernel matrix. It is incremental while data from a new training subject is available. Therefore, TPT approach is more feasible in practice regarding both the performance and incremental learning property.

The confusion matrices of the ﬁve approaches are shown in Figure 5. Each row of the confusion matrix represents the target class and each column represents the predicted class. The element (i, j) is the percentage of samples in class i that is classiﬁed as class j. For generic method, the accuracies for three emotions are almost similar. For TCA and KPCA, they have an improvement for classifying positive emotions with the accuracies of 75.88% and 66.23%, respectively. Besides the improved performance of recognizing positive emotions, T-SVM has a signiﬁcant increase in performance for recognizing neutral emotions (71.52%). TPT has highest accuracies for classifying all of the three emotions among these approaches. Comparing the accuracies of different emotions, we can ﬁnd that positive emotions can be more easily recognized using EEG with the comparatively high accuracy of 85.01%. Negative emotions are often confused with neutral emotions (25.76%) and vice versa (10.24%). These results indicate that the neural patterns of negative and neutral emotions are similar. In summary, the experimental results demonstrate the efﬁciency of the TPT approach for constructing personalized affective models from EEG.

5 Conclusion and Future Work

In this paper, we have proposed a novel method for personalizing EEG-based affective models with transfer learning techniques. The affective models are personalized and constructed for a new target subject without any label information. We have compared the performance of ﬁve different methods: TPT, T-SVM, TCA, KPCA, and conventional generic classiﬁers. The experimental results have demonstrated that the transductive parameter transfer approach signiﬁcantly outperforms the other approaches in terms of the

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Mean 0

Subject Index

Generic KPCA TCA T-SVM TPT

Figure 4: The accuracies of the ﬁve different methods (Generic classiﬁers, KPCA, TCA, T-SVM, and TPT) for each subject.

Figure 5: The confusion matrices of the ﬁve methods.

accuracies, and a 19.58% increase in recognition accuracy is achieved. TPT can capture the similarity between data distributions taking advantages of kernel functions and learn the mapping from data distributions to classiﬁer parameters with the regression framework. For the future work, we will evaluate the performance of our proposed approaches for more categories of emotions as well as the publicly available EEG datasets.

Acknowledgments

This work was supported in part by the grants from the National Natural Science Foundation of China (Grant No.61272248), the National Basic Research Program of China (Grant No.2013CB329401), and the Major Basic Research Program of Shanghai Science and Technology Committee (15JC1400103).

[Abraham et al., 2014] Alexandre Abraham, Fabian Pedregosa, Michael Eickenberg, Philippe Gervais, Andreas Mueller, Jean Kossaiﬁ, Alexandre Gramfort, Bertrand Thirion, and Ga el Varoquaux. Machine learning for neuroimaging with scikit-learn. Frontiers in Neuroinformatics, 8, 2014.

[Blanchard et al., 2011] Gilles Blanchard, Gyemin Lee, and

Clayton Scott. Generalizing from several related classiﬁcation tasks to a new unlabeled sample. In Advances In Neural Information Processing Systems, pages 2178 2186, 2011.

[Chu et al., 2013] Wen-Sheng Chu, Fernando De la Torre,

and Jeffrey F Cohn. Selective transfer machine for personalized facial action unit detection. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3515 3522. IEEE, 2013.

[Chung et al., 2011] Mike Chung, Willy Cheung, Reinhold

Scherer, and Rajesh PN Rao. A hierarchical architecture for adaptive brain-computer interfacing. In IJCAI 11, volume 22, page 1647, 2011.

[Collobert et al., 2006] Ronan Collobert, Fabian Sinz, Jason

Weston, and L eon Bottou. Large scale transductive SVMs. The Journal of Machine Learning Research, 7:1687 1712, 2006.

[Duan et al., 2009] Lixin Duan, Ivor W Tsang, Dong Xu,

and Tat-Seng Chua. Domain adaptation from multiple sources via auxiliary classiﬁers. In 26th Annual International Conference on Machine Learning, pages 289 296. ACM, 2009. [Duan et al., 2013] Ruo-Nan Duan, Jia-Yi Zhu, and Bao-

Liang Lu. Differential entropy feature for EEG-based emotion classiﬁcation. In 6th International IEEE/EMBS Conference on Neural Engineering, pages 81 84. IEEE, 2013. [Eaton et al., 2015] Joel Eaton, Duncan Williams, and Ed-

uardo Miranda. The space between us: Evaluating a multiuser affective brain-computer music interface. Brain Computer Interfaces, 2(2-3):103 116, 2015. [Fatourechi et al., 2007] Mehrdad Fatourechi, Ali Bashashati, Rabab K Ward, and Gary E Birch. EMG and EOG artifacts in brain computer interface systems: A survey. Clinical Neurophysiology, 118(3):480 494, 2007. [Gretton et al., 2006] Arthur Gretton, Karsten M Borgwardt,

Malte Rasch, Bernhard Sch olkopf, and Alex J Smola. A kernel method for the two-sample-problem. In Advances in Neural Information Processing Systems, pages 513 520, 2006. [Jayaram et al., 2016] Vinay Jayaram, Morteza Alamgir, Yasemin Altun, Bernhard Sch olkopf, and Moritz Grosse Wentrup. Transfer learning in brain-computer interfaces. IEEE Computational Intelligence Magazine, 11(1):20 31, 2016. [Jenke et al., 2014] Robert Jenke, Angelika Peer, and Martin

Buss. Feature extraction and selection for emotion recognition from EEG. IEEE Transactions on Affective Computing, 5(3):327 339, 2014. [Joachims, 1999] Thorsten Joachims. Making large scale SVM learning practical. Technical report, Universit at Dortmund, 1999. [Krauledat et al., 2008] Matthias Krauledat, Michael Tangermann, Benjamin Blankertz, and Klaus-Robert Mller. Towards zero training for brain-computer interfacing. Plo S One, 3:e2967, 08 2008. [Morioka et al., 2015] Hiroshi Morioka, Atsunori Kanemura, Jun-ichiro Hirayama, Manabu Shikauchi, Takeshi Ogawa, Shigeyuki Ikeda, Motoaki Kawanabe, and Shin Ishii. Learning a common dictionary for subject-transfer decoding with resting calibration. Neuro Image, 111:167 178, 2015. [M uhl et al., 2014a] Christian M uhl, Brendan Allison, An-

ton Nijholt, and Guillaume Chanel. A survey of affective brain computer interfaces: principles, state-of-the-art, and challenges. Brain-Computer Interfaces, 1(2):66 84, 2014. [M uhl et al., 2014b] Christian M uhl, Camille Jeunet, and Fa-

bien Lotte. EEG-based workload estimation across affective contexts. Frontiers in Neuroscience, 8, 2014. [Pan and Yang, 2010] Sinno Jialin Pan and Qiang Yang. A

survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22(10):1345 1359, 2010.

[Pan et al., 2011] Sinno Jialin Pan, Ivor W Tsang, James T

Kwok, and Qiang Yang. Domain adaptation via transfer component analysis. IEEE Transactions on Neural Networks, 22(2):199 210, 2011. [Samek et al., 2013] Wojciech Samek, Frank C Meinecke,

and Klaus-Robert Muller. Transferring subspaces between subjects in brain computer interfacing. IEEE Transactions on Biomedical Engineering, 60(8):2289 2298, 2013. [Sangineto et al., 2014] Enver Sangineto, Gloria Zen, Elisa

Ricci, and Nicu Sebe. We are not all equal: Personalizing models for facial expression analysis with transductive parameter transfer. In ACM International Conference on Multimedia, pages 357 366. ACM, 2014. [Sch olkopf et al., 1998] Bernhard Sch olkopf, Alexander Smola, and Klaus-Robert M uller. Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10(5):1299 1319, 1998. [Singh et al., 2007] Vishwajeet Singh, Krishna P Miyapu-

ram, and Raju S Bapi. Detection of cognitive states from f MRI data using machine learning techniques. In IJCAI 07, pages 587 592, 2007. [Sugiyama et al., 2007] Masashi Sugiyama, Matthias Krauledat, and Klaus-Robert M uller. Covariate shift adaptation by importance weighted cross validation. The Journal of Machine Learning Research, 8:985 1005, 2007. [Tuia et al., 2011] Devis Tuia, Jochem Verrelst, Luis Alonso,

Fernando P erez-Cruz, and Gustavo Camps-Valls. Multioutput support vector regression for remote sensing biophysical parameter estimation. IEEE Geoscience and Remote Sensing Letters, 8(4):804 808, 2011. [Wu et al., 2013] Dongrui Wu, Brent J Lance, and Thomas D

Parsons. Collaborative ﬁltering for brain-computer interaction using transfer learning and active class selection. Plo S One, 8(2), 2013. [Zander and Jatzev, 2012] Thorsten O Zander and S Jatzev.

Context-aware brain computer interfaces: exploring the information space of user, technical system and environment. Journal of Neural Engineering, 9(1):016003, 2012. [Zheng and Lu, 2015] Wei-Long Zheng and Bao-Liang Lu.

Investigating critical frequency bands and channels for EEG-based emotion recognition with deep neural networks. IEEE Transactions on Autonomous Mental Development, 7(3):162 175, 2015. [Zheng et al., 2015] Wei-Long Zheng, Yong-Qi Zhang, Jia-

Yi Zhu, and Bao-Liang Lu. Transfer components between subjects for EEG-based emotion recognition. In International Conference on Affective Computing and Intelligent Interaction, pages 917 922. IEEE, 2015.