# distributionbased_semisupervised_learning_for_activity_recognition__4f18d847.pdf

The Thirty-Third AAAI Conference on Artiﬁcial Intelligence (AAAI-19)

Distribution-Based Semi-Supervised Learning for Activity Recognition

Hangwei Qian, Sinno Jialin Pan, Chunyan Miao

School of Computer Science and Engineering Joint NTU-UBC Research Centre of Excellence in Active Living for the Elderly Interdisciplinary Graduate School Alibaba-NTU Singapore Joint Research Institute Nanyang Technological University, Singapore qian0045@e.ntu.edu.sg, {sinnopan, ascymiao}@ntu.edu.sg

Supervised learning methods have been widely applied to activity recognition. The prevalent success of existing methods, however, has two crucial prerequisites: proper feature extraction and sufﬁcient labeled training data. The former is important to differentiate activities, while the latter is crucial to build a precise learning model. These two prerequisites have become bottlenecks to make existing methods more practical. Most existing feature extraction methods highly depend on domain knowledge, while labeled data requires intensive human annotation effort. Therefore, in this paper, we propose a novel method, named Distribution-based Semi-Supervised Learning, to tackle the aforementioned limitations. The proposed method is capable of automatically extracting powerful features with no domain knowledge required, meanwhile, alleviating the heavy annotation effort through semi-supervised learning. Speciﬁcally, we treat data stream of sensor readings received in a period as a distribution, and map all training distributions, including labeled and unlabeled, into a reproducing kernel Hilbert space (RKHS) using the kernel mean embedding technique. The RKHS is further altered by exploiting the underlying geometry structure of the unlabeled distributions. Finally, in the altered RKHS, a classiﬁer is trained with the labeled distributions. We conduct extensive experiments on three public datasets to verify the effectiveness of our method compared with state-of-the-art baselines.

Introduction Human activity recognition has spurred a great deal of interest with a wide spectrum of real-world applications, such as security, personalized health monitoring and assisted living (Janidarmian et al. 2017; Bulling, Blanke, and Schiele 2014; Lara and Labrador 2013; Frank, Mannor, and Precup 2010; Avci et al. 2010). Generally, there are two types of scenarios: wireless-sensor-based and video-based. In this work, we focus on wireless-sensor-based activity recognition scenarios. In these scenarios, the data is often in the form of a continuous multivariate time series from multiple sensors. Therefore, the data needs to be divided into segments ﬁrst, each of which corresponding to a single label. Traditionally, it requires intensive annotation effort with the starting and ending times of each activity. Further, in order to increase the expressiveness of data, feature extraction is commonly

Copyright c 2019, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

applied to each segment. Extracted features are then fed into a classiﬁer to recognize different activities. Note that feature extraction and large amount of labeled training data are crucial in the process, which are discussed in detail hereinafter. It is well-known that good features can help to discriminate different classes of activities, by increasing the expressiveness of each activity. Generally, feature extraction approaches can be classiﬁed into two categories: statistical and structural (Lara and Labrador 2013). Structural features take into account the overall information of the data. For example, SAX method transforms continuous data into discrete symbolic strings (Lin et al. 2007); ECDF method preserves the overall shape and spatial information of time series data (Hammerla et al. 2013; Pl otz, Hammerla, and Olivier 2011). Therefore, domain knowledge is highly required for structural features. Statistical features, on the other hand, aim to capture statistical information underlying each timeseries segment. There are also around twenty commonly used handcrafted statistical features which are proven to be beneﬁcial practically, including orders of moments (mean, variance, skewness, etc), median, etc (Janidarmian et al. 2017). Major limitations of statistical features are the ﬂexibility of handcrafted features and the involvement of domain knowledge. Recently, Qian, Pan, and Miao (2018) proposed the SMMAR approach, which is capable of automatically extracting all orders of moments as statistical features for activity recognition. Though SMMAR is able to systematically extract powerful statistical features, as a supervised learning based method, it requires a plethora of labeled data for training. Note that label annotation on a large-scale dataset on sensor readings is a costly process. Therefore, growing research interests have been focused on exploring the tradeoff between label ambiguity and human annotation effort. Some researchers focus on efﬁcient annotation strategies to reduce labeling effort, including ofﬂine and online strategies (Stikic et al. 2011), such as experience sampling, selfrecall and video recording. There also exist several research works applying semi-supervised learning (Zhu 2005) for activity recognition by exploiting unlabeled data, which is supposed to be easy to collect with very low cost, to learn a precise classiﬁer even with a limited number of labeled data (Guan et al. 2007; Stikic, Larlus, and Schiele 2009; Stikic et al. 2011). Most existing semi-supervised learning

methods adopt handcrafted features. In this paper, we propose a novel semi-supervised learning method, namely Distribution-based Semi-Supervised Learning (DSSL), to free the intensive effort on feature engineering by using the kernel mean embedding technique for distributions (Berlinet and Thomas-Agnan 2011). To elaborate, we treat data stream of sensor readings received in a period as a probability distribution. Modeling input instances as probability distribution is a new and promising machine learning paradigm, and some methods have been successfully developed in the supervised learning manner, e.g., Support Measure Machines (SMMs) (Muandet et al. 2012; 2017). Recently, Qian, Pan, and Miao (2018) proposed a framework based on SMMs for activity recognition, which is known as SMMAR. A major advantage of SMMAR over other supervised learning methods for activity recognition is the capability of automatically extracting all the orders of statistical moments as features to represent each input instance. Our proposed method, DSSL, is an extension of SMMAR in the semi-supervised learning manner. Compared with SMMAR and other supervised or semi-supervised learning methods for activity recognition, our contributions are 4-fold:

Compared with other supervised or semi-supervised learning methods, DSSL is able to represent each instance, i.e., data stream of a period, using all the orders of statistical moments implicitly and automatically, which contains rich information to distinguish activities.

Compared with SMMAR, DSSL relaxes its full supervision assumption, and is able to exploit unlabeled instances to learn an underlying data structure. With the learned structure and a few labeled instances, DSSL is able to learn a precise classiﬁer for activity recognition.

Most existing works on learning with distributions are supervised. To the best of our knowledge, DSSL is the ﬁrst attempt on semi-supervised learning with distributions. Moreover, we provide theoretical analysis proving that DSSL is valid for semi-supervised learning in a reproducing kernel Hilbert space (RKHS).

Extensive experiments are conducted to demonstrate the superior performance of DSSL over a number of state-ofthe-art baselines.

Other Related Work Limited labeled training data is insufﬁcient to train a good classiﬁer due to the cold start problem of supervised learning. Semi-supervised learning approaches are appealing in practice since they require only a small fraction of labeled training data with a large amount of easily obtained unlabeled data (Chapelle, Schlkopf, and Zien 2010; Zhu 2005). Among existing semi-supervised learning approaches, manifold regularization (Sindhwani, Niyogi, and Belkin 2005) and wrapping kernels using point cloud (Belkin, Niyogi, and Sindhwani 2006) are two classic methods, which incorporates the manifold structure underlying both unlabeled and labeled data into the learning of Support Vector Machines (SVMs).

In the context of activity recognition, Stikic, Larlus, and Schiele (2009) proposed a multi-graph based semisupervised approach named GLSVM, where each graph propagates different information of activities. Different graphs are then combined to improve label propagation in graphs. After that, an SVM classiﬁer is trained by using both the initially labeled training data and the propagated labels. Matsushige, Kakusho, and Okadome (2015) proposed a semi-supervised kernel logistic regression method for activity recognition, denoted by SSKLR, which extends kernel logistic regression into semi-supervised fashion, and solves the problem by the Expectation-Maximization algorithm. Yao et al. (2016) proposed a robust graph-based semi-supervised method named RSAR to tackle the intraclass variability in activities across different subjects. The RSAR method extracts the intrinsic shared subspace structures from activities with the assumption that intrinsic relationships have invariant properties thus are less sensitive with varying subjects. In (Naz abal et al. 2016), a new Bayesian model is proposed to tackle the scenario with a very low number of sensors. The dynamic nature of human activities are further modeled as a ﬁrst-order homogeneous Markov chain. Our proposed DSSL is a uniﬁed framework that naturally inherits the spirit of learning from distributions and manifold learning.

Preliminaries Support Measure Machines In supervised learning with distributions, we are given a set of labeled data {Xi, yi}n i=1, where Xi = {xij}ni j=1 and n is may vary across different xi. The goal is to learn a classiﬁer f to map {Xi} s to {yi} s. In SMMs (Muandet et al. 2012), each Xi is mapped to a functional in a RKHS H via kernel mean embedding (Berlinet and Thomas-Agnan 2011) as µPi = Exij Pi[k(xij, )], where k( , ) is a characteristic kernel associated with the RKHS H. It has been proven that if the kernel is characteristic, then an arbitrary probability distribution Pi is uniquely represented by an element µPi in the RKHS, which implicitly captures all orders of statistical moments of Xi. The inner product, i.e., a linear kernel, of two distributions, which measures their similarity, can be deﬁned as µPi, µPj = 1 ninj Pni a=1 Pnj b=1 k(xia, xjb). One can also deﬁne a nonlinear kernel of µPi and µPj to capture their nonlinear relationships via

k(µPi, µPj) H = ψ(µPi), ψ(µPj) , (1)

where k( , ) is the nonlinear kernel induced by the nonlinear feature map ψ( ), and H is the corresponding RKHS. To train a classiﬁer from {Xi} s to {yi} s, SMMs deﬁne the optimization problem by learning f H that minimizes the following regularized risk functional

i=1 ℓ(µPi, yi, f) + Ω( f H), (2)

where ℓ( ) is the loss function and Ω( ) is the regularization term. Note that H = H if k is linear.

Random Fourier Features Approximation The kernel embedding technique of distributions used in SMMs is computationally expensive as it requires to compute kernel matrices. This makes it impractical in some real-world applications when the size of the dataset is large. To scale up SMMs, Qian, Pan, and Miao (2018) proposed an accelerated version using Random Fourier Features to construct an explicit feature map instead of using the kernel trick. Based on Bochner s Theorem (Rahimi and Recht 2007), a continuous, shift-invariant positive deﬁnite kernel k(x, x ) can be linearized by using the randomized feature map z : Rd RD as k(x, x ) = φ(x), φ(x ) z(x) z(x ), (3)

where the inner product of explicit feature maps can uniformly approximate the kernel values without the kernel trick, and the random Fourier features are generated by:

2cos(w x + b), (4)

where w p(w), which is k( , ) s Fourier transform distribution on RD, and b is sampled uniformly from [0, 2π]. It can be proven that k(x, x ) = E(zw(x) zw(x )) for all x and x . In practice, D can be small, which enables SMMs to handle large-scale datasets.

The Proposed Methodology

Problem Statement

In our project setting of activity recognition, we are given a set of l labeled segments data {Xi, yi}l i=1, and a set of u = n l unlabeled segments {Xi}i=n i=l+1 as training data obtained by applying segmentation methods on the raw data, where Xi = [xi1 ... xini] Rd ni, yi {1, ..., L}, l u, and ni may vary across different segments. The goal is to make use of both labeled and unlabeled segments to learn a classiﬁer from each segment X to its corresponding label y. Following (Qian, Pan, and Miao 2018), each segment Xi, including both labeled and unlabeled, is treated as a sample of ni data points drawn from an unknown distribution Pi. Kernel mean embedding is then applied to map each Xi to an element µPi in a RHKS. In practice, to make the learning process more efﬁcient, random Fourier features are used to approximate the nonlinear feature map induced by the kernel of the RKHS via µPi = 1 ni Pni j=1 z(xij). where µPi RD. Therefore, our goal becomes to learn a classiﬁer f : µP yi from {µPi, yi}l i=1 and {µPi}i=n i=l+1.

Distribution-based Semi-Supervised Learning

Borrowing the idea from manifold regularization (Belkin, Niyogi, and Sindhwani 2006) and the technique on warping data-dependent kernels (Sindhwani, Niyogi, and Belkin 2005), we aim to incorporate the underlying manifold structure of both labeled and unlabeled data into the learning of a classiﬁer via warping a RKHS. Speciﬁcally, we wrap the RKHS H deﬁned in (1) to another RKHS H by leveraging unlabeled training segments or distributions to reﬂect the underlying geometry of {ψ(µPi)} s. Notations on different kernels and their corresponding RKHSs used in this

Table 1: Notations of different kernels used in the paper

Kernel Space Descriptions k H kernel mean embedding of distributions

k H kernel on the embedded distributions

k H data-dependent kernel constructed based on k for semi-supervised learning

paper are summarized in Table 1. The new RKHS H is associated with the new kernel k, which is data-dependent for semi-supervised learning. We will discuss how to achieve the kernel as well as the resulting new space later. Here, we assume the new kernel k is constructed, then the revised optimization problem over H is formulated as

f = arg min f H

i=1 ℓ(µPi, yi, f) + f 2 H, (5)

where ℓ( ) is the loss function. Note the objective function looks similar to that in the supervised learning setting in (2). However, in (5) the RKHS, where the functional to be optimized is H, which is inﬂuenced by both labeled and unlabeled distributions, while the RKHS in (2) is H, which is deﬁned by labeled distributions only. The new optimization probelm raises a potential problem: f is to be learned in H, while the input space of µPi is H. As these RKHSs are not the same, how to calculate the loss function remains a problem. To sum up, in order to solve the optimization problem (5), three crucial questions need to be answered:

How to construct the data-dependent kernel k by incorporating unlabeled training data?

Is the new space H valid? How to calculate the loss function given µP H and f H are not in the same space? In the following, we investigate the questions one by one.

1) Construction of the Data-dependent Kernel k Since unlabeled data may shed light on the underlying structure and manifolds of all data, now the problem becomes how to appropriately construct such a valid RKHS H from H to achieve so. We ﬁrst deﬁne H to be the space of functionals from H with the following modiﬁed inner product:

f, g H = f, g H + Sf, Sg V, (6)

where V is a linear space and S : H V is a bounded linear operator. The ﬁrst term in (6) is the common deﬁnition of inner product between two functionals, while the second term with the operator S reﬂects that unlabeled embedded distributions alter our beliefs in the overall structure. Denote by f(µ)=(f(µP1), ..., f(µPn)), we have Sf, Sf V = f(µ)Mf(µ) with M being a positive semideﬁnite matrix.

2) Validity of H Theorem 1. H is a valid RKHS. A space is valid if it is bounded and complete.

3) Loss Function Calculation Based on Theorem 1, we have the following propositions. Proposition 1. H = H. The two spaces are the same if each of the space is the subset of the other space. Although the two spaces are the same, the kernels therein are not identical. However, they are connected due to the involvement of unlabeled distributions. Proposition 2. K = (I + KM) 1 K, where K with Kij = k(µPi, µPj) is the kernel matrix for H on µPi s, and K is the kernel matrix in the altered space H. Note that detailed proofs and derivations of theorems and propositions introduced in this section can be found in the next section. The complexity of the above kernel seems to be a potential problem when the data scales up, since it involves matrix multiplication as well as matrix inversion. However, when conducting experiments on large scale activity recognition datasets, the problem actually is not severe in practice. The reason is that the entries of kernels are dependent on the number of distributions, i.e., number of segments, each containing a repetition of activity, instead of the number of total instances, i.e., one entry for each timestamp equivalent to the product of # sample and # instances per sample. Other feasible solutions to further alleviate this problem include matrix factorization, low-rank approximation (Bach and Jordan 2005), etc. Data selection or feature selection (Nie et al. 2010) can be conducted on training data beforehand to keep a small fraction of key training data. The proposed method can be further developed in an online learning fashion (Hoi, Wang, and Zhao 2014), so that the matrix are maintained in a small scale. Note that the choice of M is crucial regarding how to properly incorporate unlabeled embedded distributions. In this paper, we set M to be M = r L2, where r is a scalar and L = D W is the Laplacian matrix, which is widely used in semi-supervised learning (Sindhwani, Niyogi, and Belkin 2005; Belkin, Niyogi, and Sindhwani 2006) to model the geometry structure underlying the data. To be speciﬁc,

Wij = exp µPi µPj 2

if µPi and µPj are connected in

the graph, and D is the diagonal matrix with Dii = P

j Wij. Based on the following Theorem 2 (whose derivations are at the end of the paper), the solution for the optimization problem in (5) can be expressed as a linear combination of the functionals { k(µPi), }l i=1 as

i=1 αi k(µP, µPi). (7)

Theorem 2 (Representer Theorem for the proposed DSSL method). Given l labeled distributions {(P1, y1), ..., (Pl, yl)} P R, a loss function ℓ: (P R2)l R {+ } and a strictly monotonically increasing real-valued function Ωon [0, + ), the minimizer of the regularized risk functional ℓ(P1, y1, EP1[f], ..., Pl, yl, EPl[f]) + Ω( f H), (8)

admits an expansion f = Pl i=1 αi k(µPi, ), where αi R, for i = 1, ..., l.

Detailed Proofs Proof of Theorem 1 Let s start with H with the kernel k. Since H is a complete Hilbert space, and evaluation functionals therein are bounded, i.e., µ H, f H, Cµ R, s.t. |f(µ)| Cµ f H. Moreover, the bounded operator S is bounded by a constant D, i.e., S = sup f H

complete H means every Cauchy sequence in the space converges to an element in H. Let (fn) be a Cauchy sequence in H converging to f, then ϵ>0, an integer N(ϵ), s.t.

m > N(ϵ), n > N(ϵ) fm fn H < ϵ

Now let s turn to H. We need to prove the completeness of the space ﬁrst. According to the deﬁnition in Eq. (6), we obtain that for any Cauchy sequence in H,

fm fn 2 H = fm fn 2 H + S(fm fn) 2 V fm fn 2 H + D2 fm fn 2 H = fm fn H p

1 + D2 fm fn H

1 + D2 = ϵ.

Hence H is complete since every Cauchy sequence in H converges to an element in H. Moreover, H is bounded based on the property that any Cauchy sequence is bounded (Berlinet and Thomas-Agnan 2011, Lemma 5). This completes the proof.

Proof of Proposition 1 Firstly, we decompose H to two orthogonal parts as

H = span{ k(µP1, ), ..., k(µPl, )} H ,

where H vanishes at all labeled embedded distributions, i.e., f H , i {1, ..., l}, f(µPi) = 0. (9)

Accordingly Sf = 0, which means f, g H = f, g H, f H , g H. Moreover,

f(µP) = f, k(µP, ) H = f, k(µP, ) H = f, k(µP, ) H + Sf, S k(µP, ) V

= f, k(µP, ) H.

Thus, we have

f H , f, k(µP, ) k(µP, ) H = 0. (10)

That is k(µP, ) k(µP, ) ( H ) . By substituting (9) into (10), we obtain k(µPi, ) ( H ) , i, which means

span{ k(µPi, )}l i=1 span{ k(µPi, )}l i=1. (11)

Secondly, we decompose H as H = span{ k(µPi, )}l i=1 H . Similarly, we have

f, k(µPi, ) H = 0, f H , i {1, ..., l}.

As Sf = 0, we have f, g H = f, g H, and

f(µP) = f, k(µP, ) H = f, k(µP, ) H = f, k(µP, ) H + Sf, S k(µP, ) V

= f, k(µP, ) H.

Therefore, we have f, k(µP, ) k(µP, ) H = 0. Since f H , it becomes f, k(µP, ) H = 0, i.e., k(µP, ) ( H ) . Therefore, we have

span{ k(µPi, )}l i=1 span{ k(µPi, )}l i=1. (12)

Finally, by considering both (11) and (12), we conclude that the two spans are the same. This completes the proof.

Proof of Proposition 2 Based on Proposition 1, we have

k(µP, ) = k(µP, ) +

j=1 βj(µP) k(µPj, ), (13)

where the coefﬁcients βj depend on µP. If we can obtain the exact formulation for βj, then we can derive relations between two spaces by explicit forms. To ﬁnd βj, we use a system of linear equations generated by evaluating k(µPi, ) at µP:

k(µPi, ), k(µP, ) H

= k(µPi, ), k(µP, ) +

j=1 βj(µP) k(µPj , ) H

= k(µPi, ), k(µP, ) +

j=1 βj(µP) k(µPj , ) H + k µPi Mg,

where k µPi = k(µPi, µP1), ..., k(µPi, µPn) and

g consists of components gi = k(µP, µPi) + Pn j=1 βj(µP) k(µPj, µPi). Then we have the following linear equation for the coefﬁcients β(µP) = (β1(µP), ..., βn(µP)) :

M kµP = (I + M K)β(µP). (14)

Based on (13) and (14), we obtain the following explicit form for k( , ):

k(µPi, µPj) = k(µPi, µPj) k µPi(I + M K) 1M kµPj .

The above equation can be written in the following concise matrix form:

K = K K(I + M K) 1M K. (15)

It can be shown that by applying the Sherman-Morrison Woodbury (SMW) identity, (15) can be further rewritten as

K = (I K(I + M K) 1M) K = (I + KM) 1 K. (16)

This completes the proof.

Proof of Theorem 2 Any functional f H can be uniquely decomposed into a component fµ in the space spanned by the kernel mean embedding fµ = Pl i=1 αi k(µPi, ), and a component f orthogonal to it, i.e., f , k(µPj, ) = 0, j {1, ..., l}. Therefore, we have

f = fµ + f =

i=1 αi k(µPi, ) + f .

Thus, for all j, we can further induce that

i=1 αi k(µPi, ) + f , k(µPj, )

i=1 αi k(µPi, ), k(µPj, )

This indicates the loss function term in (8) does not depend on f . Besides, the second term Ω( ) in (8) is strictly monotonically increasing, so we have

Ω( f H) = Ω

i=1 αi k(µPi, ) + f

i=1 αi k(µPi, )

i=1 αi k(µPi, )

where the equality holds if and only if f = 0. Therefore, the ﬁrst term in (8) is independent of f and the second term reaches its minimum when f = 0. Consequently, any minimizer must take the form f = fµ = Pl i=1 αi k(µPi, ). This completes the proof.

Experiments We conduct experiments on 3 sensor-based activity datasets. The statistics are listed in Table 2. Skoda records 10 gestures in car maintenance scenarios with 20 acceleration sensors being put on the arms of the subject (Stiefmeier, Roggen, and Tr oster 2007). Each gesture is repeated around 70 times. The transitions between two gestures are labeled as Null class, which are also considered as activities. WISDM uses accelerometer sensors embedded in the phones to collect six regular activities: jogging, walking, ascending stairs, descending stairs, sitting and standing (Kwapisz, Weiss, and Moore 2010). HCI composes of gestures with the hand describing different shapes: a circle, a square, a pointingup triangle, an upside-down triangle, and an inﬁnity symbol (F orster, Roggen, and Tr oster 2009). Each gesture is recorded over 50 repetitions, and about 5 to 8 seconds per repetition. Null class exists as well in HCI dataset.

Experimental Setup Following the criteria in (Qian, Pan, and Miao 2018), we adopt both micro-F1 score (mi F) and weighted macro-F1

Table 2: Statistics of datasets used in experiments.

Datasets # Sample # Instances per sample # Feature # Class Skoda 1,447 68 60 10 HCI 264 602 48 5 WISDM 389 705 6 6

score (ma F) to evaluate the performance of different methods. All the reported results are the average values together with the standard deviation over 6 random splits for training and testing. Each dataset is randomly split into 3 subsets: labeled training set, unlabeled training set and test set. Each subset is set to contain activities of all classes. We set the ratio to be 0.02:0.1:0.88 and ﬁx r = 100. The impact of differentiating r will be discussed later. Different from experimental setups in existing papers that set labeled data s ratio to be quite large (Matsushige, Kakusho, and Okadome 2015; Stikic, Larlus, and Schiele 2009), we deliberately set the labeled data s ratio to be extremely small. Hence, our method requires fewer labels and thus more practical with regards to applicability in reality. Evaluations are conducted on the test set. We adopt RBF kernels for all the kernels used in the experiments.

Baselines We compare the proposed DSSL method with the following state-of-the-art methods.

State-of-the-art supervised methods with various features:

SVMs (Chang and Lin 2011): as SVM is a vectorialbased classiﬁer, we use mean, variance, etc to generate a feature vector for each segment. SAX-a (Lin et al. 2007) treats data as strings, and structural features are extracted. We follow the settings in (Lin et al. 2007) with no dimension reduction. The parameter alphabet size range is a {3, 6, 9}. ECDF-d (Hammerla et al. 2013; Pl otz, Hammerla, and Olivier 2011) extracts d descriptors from each sensor s each dimension. d {5, 15, 30, 45}.

Note that the overall shape and spatial features besides the mean and variance features are concatenated before applying the SVM classiﬁer.

State-of-the-art supervised method based on distributions, SMMAR (Qian, Pan, and Miao 2018).

Classic vectorial-based semi-supervised methods:

Lap SVM (Belkin, Niyogi, and Sindhwani 2006) is an extension of SVM with manifold regularization. TSVM (Chapelle and Zien 2005) is a Transductive SVM by using gradient descent for training. As this is a transductive approach rather than a truly semisupervised learning approach, we make the test data available in the training phase of this method.

State-of-the-art semi-supervised methods speciﬁcally designed for activity recognition:

SSKLR (Matsushige, Kakusho, and Okadome 2015) is a semi-supervised kernel logistic regression method with Expectation-Maximization algorithm.

GLSVM (Stikic, Larlus, and Schiele 2009) is a multigraph method where each graph captures different aspects of the activities.

Experimental Results

Overall Experimental Results The experimental results are presented in Table 3. The proposed DSSL consistently performs the best on all datasets. DSSL outperforms all the other methods by 5.6%, 17.7%, and 14.4% respectively on three datasets in terms of mi F. This favorably indicates the effectiveness of the proposed DSSL. Note that in Table 3, the performances of the comparison methods on WISDM are much worse than those on the other two datasets. This may be due to the data complexity caused by the large number of subjects in WISDM. On datasets Skoda and HCI, the performance ranking is DSSL > SMMAR > SVMs ECDF > SAX, which reveals that 1) distribution-based methods are more capable of distinguishing different activities; 2) feature extraction plays an important role and string-based data representation in SAX is not that proper for activity data compared to ECDF; 3) with the increase of descriptor d, the performance of ECDF is increasing in HCI dataset while decreasing in Skoda and WISDM, meaning ECDF may be task-dependent. However, note that SMMAR performs the worst on WISDM dataset, which illustrates that distribution-based methods are more dependent on the number of labeled data than vectorial-based methods. This indeed reﬂects the motivation of our proposed method. Nevertheless, DSSL does not suffer from this limitation ascribed to its semi-supervised fashion. For semi-supervised methods, the ranking is DSSL > Lap SVM GLSVM TSVM > SSKLR, which demonstrates the prevalence of graph-based methods over logistic regression method for activity data.

Impact of Ratio of Labeled Data To analyze the impact on the proportion of labeled training data, we conduct experiments on WISDM dataset. We ﬁx the ratio of test data and unlabeled training data to be 20% and 20% respectively, and alter the ratio of labeled training data to be {0.02, 0.05, 0.1, 0.3, 0.5, 0.7, 0.9} of the rest 60% data. The results are depicted in Figure 1(a). DSSL performs the best under all the ratios. When more labeled training data becomes available, all methods perform better. Moreover, distributionalbased method (SMMAR) has larger performance enhancement than vectorial-based methods, which further veriﬁes the superiority of learning from distributions.

Impact of Ratio of Unlabeled data We investigate the inﬂuence of unlabeled data by ﬁxing the ratio of labeled training data and test data to be 1% and 20%, respectively, and modifying unlabeled training data to be {0.1, 0.3, 0.5, 0.7, 0.9} of the remaining 79% data. Note that supervised methods (SMMAR, SVMs) and transductive methods ( TSVM) perform the same under this setting, while the performances of semi-supervised methods keep increasing with more unlabeled training data as shown in Figure 1(b).

Impact of parameter r In previous experiments, we ﬁx r =100. Here we conduct sensitivity test on r. As indicated

Table 3: Experimental results on 3 activity datasets (unit: %).

Methods Skoda HCI WISDM mi F ma F mi F ma F mi F ma F

Vectorial-based supervised

SVMs 85.7 1.8 42.5 0.9 69.7 9.6 69.6 9.4 41.5 5.2 39.6 6.8 SAX 3 39.6 6.3 18.7 2.9 36.0 3.0 34.7 2.5 34.6 1.4 30.6 1.2 SAX 6 37.2 6.1 18.6 2.8 39.7 7.3 38.4 7.9 34.9 3.0 30.5 5.0 SAX 9 40.3 6.5 19.9 3.2 39.8 8.7 37.0 9.2 33.6 2.9 28.8 5.8 ECDF 5 84.2 2.1 41.6 1.0 67.7 10.1 67.6 9.1 42.1 6.3 40.5 7.7 ECDF 15 79.8 1.5 39.2 0.7 68.4 10.4 68.5 9.6 39.4 3.3 36.2 5.7 ECDF 30 72.6 1.2 35.4 0.3 68.6 11.1 68.7 10.5 37.7 2.5 32.6 4.9 ECDF 45 65.7 2.5 31.5 1.3 68.6 11.4 68.6 10.8 36.4 1.4 31.3 3.6

Vectorial-based semi-supervised

Lap SVM 89.7 2.1 44.6 1.2 76.1 4.8 76.3 4.7 40.1 3.8 34.5 3.5 TSVM 85.9 2.7 84.8 2.8 75.4 11.5 75.5 11.2 41.3 5.6 39.4 6.9 SSKLR 25.4 19.3 12.1 2.5 24.2 17.2 18.1 10.1 24.6 17.0 17.3 9.9 GLSVM 89.7 2.1 44.5 1.2 75.7 5.8 75.7 5.7 40.4 3.8 33.9 4.0 Distribution-based supervised SMMAR 93.2 0.9 93.1 1.0 82.2 13.4 78.9 18.4 20.5 3.3 11.7 3.9 Distribution-based semi-supervised DSSL 98.8 0.5 98.8 0.5 99.9 0.2 99.9 0.2 56.5 5.1 55.6 5.0

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 ratio of labeled data

mi F (%, log scale)

SMM AR SVM Lap SVM

(a) Varying ratios of labeled data.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 ratio of unlabeled data

mi F (%, log scale)

SMM AR SVM Lap SVM

(b) Varying ratios of unlabeled data.

-6 -4 -2 0 2 4 6 log 10 r

mi F (%, log scale)

DSSL best baseline

(c) Impact of r to the performance.

Figure 1: Performance of DSSL on WISDM under different settings (in mi F).

in Fig. 1(c), the performance of DSSL on test data keeps stable when r [10 6, 1]. When r becomes larger, the performance of DSSL begins to decrease. This observation indicates that r balances the tradeoff between labeled and unlabeled data. Larger r implies stronger emphasis on unlabeled data. More importantly, under all different r values, DSSL consistently outperforms all other methods. Fig. 1(c) shows the best baseline, i.e., ECDF 5 in WISDM s case.

0 5 10 15 20 Random feature dimension D

mi F (%, log scale)

R-DSSL DSSL best baseline

0 5 10 15 20 Random feature dimension D

run time (s)

R-DSSL DSSL

Figure 2: Impact of D to the performance on WISDM.

Impact on random Fourier feature (RFF) dimension D We analyze how R-DSSL accelerates DSSL with D-

dimensional explicit statistical features. The experiments are conducted on a Linux server with Intel(R) Xeon(R) E5-2695 2.40GHz CPU. As shown in Fig. 2, R-DSSL steadily outperforms the best baseline when D 2. Note that R-DSSL performs slightly worse than DSSL due to its approximation nature, however it requires less computational run time when D < 8 compared to DSSL.

Conclusion In this paper, we propose a semi-supervised learning framework, DSSL, for sensor-based activity recognition problems. The proposed DSSL naturally embeds automatic feature extraction and classiﬁcation in a semi-supervised learning manner. Extensive experiments are conducted on three activity datasets to demonstrate the superiority of DSSL compared with a number of state-of-the-art methods.

Acknowledgments This research is supported, in part, by the National Research Foundation, Prime Minister s Ofﬁce, Singapore under its IDM Futures Funding Initiative, the Singapore Ministry of Health under its National Innovation Challenge on Active and Conﬁdent Ageing (NIC Project No. MOH/NIC/COG04/2017 and MOH/NIC/HAIG03/2017),

and the Interdisciplinary Graduate School, Nanyang Technological University under its Graduate Research Scholarship. Sinno J. Pan thanks the support from the NTU Singapore Nanyang Assistant Professorship (NAP) grant M4081532.020.

References Avci, A.; Bosch, S.; Marin-Perianu, M.; Marin-Perianu, R.; and Havinga, P. J. M. 2010. Activity recognition using inertial sensing for healthcare, wellbeing and sports applications: A survey. In ARCS Workshops, 167 176. Bach, F. R., and Jordan, M. I. 2005. Predictive low-rank decomposition for kernel methods. In ICML, 33 40. Belkin, M.; Niyogi, P.; and Sindhwani, V. 2006. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. Journal of Machine Learning Research 7:2399 2434. Berlinet, A., and Thomas-Agnan, C. 2011. Reproducing kernel Hilbert spaces in probability and statistics. Springer Science & Business Media. Bulling, A.; Blanke, U.; and Schiele, B. 2014. A tutorial on human activity recognition using body-worn inertial sensors. ACM Comput. Surv. 46(3):33:1 33:33. Chang, C., and Lin, C. 2011. LIBSVM: A library for support vector machines. ACM Trans. Intell. Syst. Technol 2(3):27:1 27:27. Chapelle, O., and Zien, A. 2005. Semi-supervised classiﬁcation by low density separation. In AISTATS. Chapelle, O.; Schlkopf, B.; and Zien, A. 2010. Semi Supervised Learning. The MIT Press. F orster, K.; Roggen, D.; and Tr oster, G. 2009. Unsupervised classiﬁer self-calibration through repeated context occurences: Is there robustness against sensor displacement to gain? In ISWC, 77 84. Frank, J.; Mannor, S.; and Precup, D. 2010. Activity and gait recognition with time-delay embeddings. In AAAI. Guan, D.; Yuan, W.; Lee, Y.; Gavrilov, A.; and Lee, S. 2007. Activity recognition based on semi-supervised learning. In RTCSA, 469 475. Hammerla, N. Y.; Kirkham, R.; Andras, P.; and Ploetz, T. 2013. On preserving statistical characteristics of accelerometry data using their empirical cumulative distribution. In ISWC, 65 68. Hoi, S. C. H.; Wang, J.; and Zhao, P. 2014. LIBOL: a library for online learning algorithms. J. Mach. Learn. Res. 15(1):495 499. Janidarmian, M.; Fekr, A. R.; Radecka, K.; and Zilic, Z. 2017. A comprehensive analysis on wearable acceleration sensors in human activity recognition. Sensors 17(3):529. Kwapisz, J. R.; Weiss, G. M.; and Moore, S. 2010. Activity recognition using cell phone accelerometers. SIGKDD Explorations 12(2):74 82. Lara, O. D., and Labrador, M. A. 2013. A survey on human activity recognition using wearable sensors. IEEE Communications Surveys and Tutorials 15(3):1192 1209.

Lin, J.; Keogh, E. J.; Wei, L.; and Lonardi, S. 2007. Experiencing SAX: a novel symbolic representation of time series. Data Min. Knowl. Discov. 15(2):107 144. Matsushige, R.; Kakusho, K.; and Okadome, T. 2015. Semisupervised learning based activity recognition from sensor data. In GCCE, 106 107. Muandet, K.; Fukumizu, K.; Dinuzzo, F.; and Sch olkopf, B. 2012. Learning from distributions via support measure machines. In NIPS, 10 18. Muandet, K.; Fukumizu, K.; Sriperumbudur, B. K.; and Sch olkopf, B. 2017. Kernel mean embedding of distributions: A review and beyond. Foundations and Trends in Machine Learning 10(1-2):1 141. Naz abal, A.; Garcia-Moreno, P.; Art es-Rodr ıguez, A.; and Ghahramani, Z. 2016. Human activity recognition by combining a small number of classiﬁers. IEEE J. Biomed. Health Inform. 20(5):1342 1351. Nie, F.; Huang, H.; Cai, X.; and Ding, C. H. Q. 2010. Efﬁcient and robust feature selection via joint l2,1-norms minimization. In NIPS, 1813 1821. Pl otz, T.; Hammerla, N. Y.; and Olivier, P. 2011. Feature learning for activity recognition in ubiquitous computing. In IJCAI, 1729 1734. Qian, H.; Pan, S. J.; and Miao, C. 2018. Sensor-based activity recognition via learning from distributions. In AAAI. Rahimi, A., and Recht, B. 2007. Random features for largescale kernel machines. In NIPS, 1177 1184. Sindhwani, V.; Niyogi, P.; and Belkin, M. 2005. Beyond the point cloud: from transductive to semi-supervised learning. In ICML, 824 831. Stiefmeier, T.; Roggen, D.; and Tr oster, G. 2007. Fusion of string-matched templates for continuous activity recognition. In ISWC, 41 44. Stikic, M.; Larlus, D.; Ebert, S.; and Schiele, B. 2011. Weakly supervised recognition of daily life activities with wearable sensors. IEEE Trans. Pattern Anal. Mach. Intell. 33(12):2521 2537. Stikic, M.; Larlus, D.; and Schiele, B. 2009. Multi-graph based semi-supervised learning for activity recognition. In ISWC, 85 92. Yao, L.; Nie, F.; Sheng, Q. Z.; Gu, T.; Li, X.; and Wang, S. 2016. Learning from less for better: semi-supervised activity recognition via shared structure discovery. In Ubi Comp, 13 24. Zhu, X. 2005. Semi-supervised learning literature survey. Technical report, Computer Sciences, University of Wisconsin-Madison.