# unsupervised_active_learning_via_subspace_learning__2e7b5b04.pdf

Unsupervised Active Learning via Subspace Learning

Changsheng Li1*, Kaihang Mao1, Lingyan Liang2, Dongchun Ren3, Wei Zhang4, Ye Yuan1, Guoren Wang1

1School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China 2Inspur Group Company Limited, China, 3Meituan, Beijing, China 4School of Information and Communication Engineering, University of Electronic Science and Technology of China {lcs,khmao,yuan-ye}@bit.edu.cn, lianglingyan@inspur.com, rendongchun@meituan.com, zhanggwei1103@163.com, wanggrbit@126.com

Unsupervised active learning has been an active research topic in machine learning community, with the purpose of choosing representative samples to be labelled in an unsupervised manner. Previous works usually take the minimization of data reconstruction loss as the criterion to select representative samples, by which the original inputs can be better approximated. However, data are often drawn from low-dimensional subspaces embedded in an arbitrary highdimensional space in many scenarios, thus it might severely bring in noise if attempting to precisely reconstruct all entries of one observation, leading to a suboptimal solution. In view of this, this paper proposes a novel unsupervised Active Learning model via Subspace Learning, called ALSL. In contrast to previous approaches, ALSL aims to discover low-rank structures of data, and then perform sample selection based on the learnt low-rank representations. To this end, we devise two different strategies and propose two corresponding formulations to select samples with and under low-rank sample representations, respectively. Since the proposed formulations involve several non-smooth regularization terms, we develop a simple but effective optimization procedure to solve them. Extensive experiments are performed on ﬁve publicly available datasets, and experimental results demonstrate the proposed ﬁrst formulation achieves comparable performance with the state-of-the-arts, while the second formulation signiﬁcantly outperforms them, achieving a 13% improvement over the second best baseline at most.

Introduction

Recently, unsupervised active learning has attracted lots of attention in machine learning community. Different from supervised active learning approaches pretraining a classiﬁer by labeled data (Roy and Mc Callum 2001; Gal, Islam, and Ghahramani 2017; Yoo and Kweon 2019), the goal of unsupervised active learning is to select representative samples to be labelled in an unsupervised manner, so as to better reduce the cost of annotations but still guarantee the performance of downstream models trained by the selected samples.

*Changsheng Li is the corresponding author. Copyright 2021, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

One key point in unsupervised active learning is how to evaluate representativeness of one sample. To this end, some researchers propose to measure it through minimizing the data reconstruction loss and taking the contributions to reconstructing other samples as its representativeness. Based on this criteria, many unsupervised active learning methods have been proposed in the past decade (Yu, Bi, and Tresp 2006; Zhang et al. 2011; Cai and He 2011; Nie et al. 2013; Hu et al. 2013; Shi and Shen 2016; Li et al. 2019, 2020a,b). Roughly speaking, unsupervised active learning can be divided into two categories: linear and nonlinear ones (Li et al. 2020a). For linear methods, they usually assume that each sample can be well reconstructed by a linear combination of a selected sample subset in the original space. The typical works include transductive experimental design (TED) (Yu, Bi, and Tresp 2006), robust structured representation (RRSS) (Nie et al. 2013), active learning via neighborhood reconstruction (ALNR) (Hu et al. 2013), joint active learning and feature selection (Li et al. 2019). Given the fact that intrinsic structures of data are often complex (e.g., nonlinearity) in practice, a few nonlinear works have been proposed to handle such a case, including the manifold adaptive experimental design (MAED) algorithm (Cai and He 2011) and deep unsupervised active learning (DUAL) (Li et al. 2020a). They usually nonlinearly map original inputs into a latent space, where a linear model is utilized to select samples. In this paper, we focus on linear unsupervised active learning, because of its simplicity but effectiveness.

Most of the above linear unsupervised algorithms select representative samples relying on the reconstruction loss minimization in the original space. However, in practice, data are often approximately drawn from low-dimensional subspaces embedded in an arbitrary high-dimensional space. Obviously, it is not necessary to reconstruct all entries of one observation well under such a case. In contrast, if features of the observations are noisy, reconstructing them even degrades model performance, resulting in a suboptimal solution. Thus, it will be beneﬁcial to sample selection, if we can learn and leverage subspace structures of data during training. We take Figure 1 as an example to illustrate the main idea behind our method. Figure 1 (a) is the 2-D exhibition of data matrix from the Extended Yale Face B dataset

The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21)

200 400 600 800 1000

(a) Original inputs

50 100 150 200 250

(b) Learnt feature representations

50 100 150 200 250

(c) Selected samples using the original space

50 100 150 200 250

(d) Selected samples using the learnt feature representations

Figure 1: Illustration of our basic idea. (a) Each row denotes one sample with an original 1024-dimensional vector. (b) Feature representations learnt by a subspace learning method, LSR (Lu et al. 2012), using the original inputs. (c) The selected samples obtained by a typical active learning method, TED (Yu, Bi, and Tresp 2006), using the original space. (d) An ideal sample selection result obtained by our method using the learnt feature representations. Red lines in Figure 1 (c) and (d) denote the samples are selected.

(Georghiades, Belhumeur, and Kriegman 2001). We use the ﬁrst 5 subjects, where the ﬁrst 50 frontal face images for each subject are used in this example. In Figure 1 (a), each row denotes one sample with a 1024-dimensional feature vector. The samples of the ﬁrst 50 rows are from one subject, i.e., belonging to the same category, and those of the second 50 rows are from the second subject, and so on. From Figure 1 (a), we can observe the ﬁrst three subjects show similar input patterns, while they are actually drawn from different subspaces as shown in Figure 1 (b). If directly selecting samples in the original space, the result is not good, as shown in Figure 1 (c). In Figure 1 (c), we show ﬁve topranked samples selected by a typical active learning method, TED (Yu, Bi, and Tresp 2006), where three samples come from the ﬁrst subject, and there is no sample selected from the third and fourth subjects. If training a classiﬁer based on the selected ﬁve samples, the classiﬁcation performance will be inferior, because of some subjects not selected. An ideal sample selection is demonstrated in Figure 1 (d), obtained through using the learnt representations in Figure 1 (b), rather than the original space in Figure 1 (a). In Figure 1 (d), the ﬁve top-ranked samples cover all subjects, which is good for downstream classiﬁcation tasks. In light of these, this paper proposes a novel unsupervised Active Learning model based on Subspace Learning, called

ALSL. ALSL intends to learn subspace structures of data and perform sample selection in a uniﬁed framework. To reach the goal, we devise two different strategies and propose two corresponding formulations to select samples with and under low-rank sample representations respectively. In the ﬁrst formulation, we attempt to learn a reconstruction coefﬁcient matrix to minimize the reconstruction loss of original inputs, while simultaneously require the coefﬁcient matrix to be low-rank and row-sparse, such that the samples can be selected with low-rank representations. Different from the ﬁrst formulation still minimizing the reconstruction loss of original inputs, the proposed second formulation aims to learn low-rank structures of data, and then select representative samples to best approximate the learnt low-rank representations, so as to further suppress noise. Finally, we develop a simple but effective procedure to solve the above two optimization problems. Extensive experiments are performed on multiple tasks usually requiring high annotation costs, and experimental results on ﬁve publicly available datasets demonstrate the efﬁcacy of the proposed models.

Related Work

In this section, we review linear unsupervised active learning algorithms which are the most related to our method. Linear unsupervised active learning intends to utilize a linear model to choose representative samples to be labelled in an unsupervised manner. An earlier approach is the transductive experimental design (TED) (Yu, Bi, and Tresp 2006), the core idea of which is to select a sample subset to best approximate the whole dataset through optimizing a least square loss function plus a ridge regularization term. Later, (Yu et al. 2008) extends TED to a convex formulation by replacing the cardinality constraint by a sparsity regularization. Moreover, (Shi and Shen 2016) further extends TED to simultaneously select representative and diverse samples. Recently, RRSS proposed in (Nie et al. 2013) utilizes a structural sparsity regularization to select samples, and a robust sample representation strategy to mitigate the issue of outliers. However, the complexity of this method is very high when the number of samples is large, which is of order O(n4). To solve this issue, (FY et al. 2015) proposes an accelerated version to RRSS. ALNR (Hu et al. 2013) aims to incorporate the neighborhood relation of samples into the reconstruction process, so as to make the nearest neighbors of one sample contribute more to the reconstruction of the sample. More recently, a uniﬁed framework for simultaneously active learning and feature selection is proposed in (Li et al. 2019), called ALFS, demonstrating that both tasks are beneﬁcial to each other. Most of the above linear methods concentrate on sample selection in the original feature space, ignoring the case that data are often approximately drawn from low-dimensional subspaces embedded into a high-dimensional space.

Proposed Method

Let X = [x1, x2, . . . , xn] Rd n be a collection of data points drawn from a union of multiple low-dimensional subspaces, where d is the dimension of each sample, and n is the

number of samples. Our goal is to perform sample selection and subspace learning simultaneously, making the selected sample subset S Rd k more representative. k denotes the number of the selected samples. Before introducing our method, we ﬁrst summarize the notations and the deﬁnition of norms used in this paper. Vectors are written as boldface lowercase letters and matrices are written as boldface uppercase letters. For an arbitrary matrix A Rn m, we use ai, aj, and aij to denote its i-th row, j-th column, and (i, j)-th entry, respectively. The Frobenius norm of the matrix A is deﬁned as ||A||F = q Pn i=1 Pm j=1 a2 ij, and the ℓ2,1-norm of the matrix A is

deﬁned as ||A||2,1 = Pn i=1 q Pm j=1 a2 ij = Pn i=1 ||ai||2.

||A|| denotes the nuclear norm of A, deﬁned as ||A|| = P i σi, where σi is the i-th singular value of A. The supnorm of a matrix A is deﬁned as ||A|| = Pn i=1 ||ai|| = Pn i=1(max1 j m |aij|). The Euclidean inner product between two matrices is < A, B >= tr(AT B), where AT is the transpose of a matrix and tr( ) is the trace of a matrix.

Formulation I: Active Learning with Low-Rank Representation Following previous approaches, we still adopt the strategy of minimizing the overall reconstruction loss in the original space to select the most representative samples. To this end, we take advantage of the following objective function: min Q Rn n ||X XQ||2 F + η||Q||ℓ, (1)

where η 0 is a tradeoff parameter. In (1), the ﬁrst term aims to minimize the reconstruction loss, where Q is the reconstruction coefﬁcient matrix. The second terms ||Q||ℓis a certain norm of Q used as a regularization term. Since we attempt to select the k most representative samples, the corresponding reconstruction coefﬁcients on these k selected samples should have larger weights, and those of other n k samples should have as small weights as possible. In an extreme case, if all entries of one row of Q are zeros, then the corresponding sample will be not selected as an representative one, as it has no any contribution to reconstructing other samples. Thus, Q should be row-sparse, since the row of matrix Q reﬂects the impact of the corresponding sample on reconstructing other samples. To make Q row-sparse, ||Q||ℓ can be replaced by ||Q||2,1 or the sup-norm ||Q|| . Here we use ||Q||2,1 in this paper. In principle, ||Q|| can also be applied. Actually, the matrix Q plays two roles in (1): First it is a reconstruction coefﬁcient matrix whose column contains the weights of a linear combination of all samples for reconstructing the corresponding sample; Second it is a self-representation matrix. Each column qi Rn in Q can be taken as a new feature representation of xi, where X is regarded as a new dictionary. As aforementioned, the data are often drawn from lowdimensional subspaces embedded in a high-dimensional space. Thereby, we force Q to be low-rank, and propose to minimize the following objection function: min Q Rn n ||X XQ||2 F + η||Q||2,1 + λrank(Q), (2)

where λ 0 is a tradeoff parameter. rank( ) denotes the rank of a matrix. Minimizing the third term aims to make the learnt Q low-rank, so as to recover the low-rank structure from the given observation matrix X. However, the above problem is in general NP-hard, a common practice is to replace the rank of Q by its nuclear norm ||Q|| (Recht, Fazel, and Parrilo 2010), which leads to a convex formulation as: min Q Rn n ||X XQ||2 F + η||Q||2,1 + λ||Q|| , (3)

where ||Q|| is a convex envelope of the rank function.

Formulation II: Active Learning under Low-Rank Representation In (3), although the proposed formulation can reach the goal of performing unsupervised active learning with low-rank sample representation, but the reconstruction loss is still directly built upon the original input matrix X for selecting k representative samples. In many cases, the observations often contain many noisy features, due to the low-dimensional subspace structure. If we reconstruct each entry of one observation well, this might severely bring in the noise during training and thus degrade the model performance. As aforementioned, we know that the original inputs can be taken as a dictionary, and the reconstruction coefﬁcients can be regarded as new feature representations of samples. If adding a subspace learning based regularizer on the reconstruction coefﬁcients, e.g., a low-rank constraint, we can obtain the intrinsic low-rank representation of data. Motivated by this point, we propose to jointly learn a low-rank representation, and select k representative samples under the new low-rank representations, instead of the original inputs. To this end, we propose another new formulation as: min Q,Z Rn n||X XQ||2 F + λ||Q||

+ µ||Q QZ||2 F + η||Z||2,1, (4) where µ 0 is a tradeoff parameter. In (4), the ﬁrst two terms aim to learn a low-rank representation Q, while the last two terms attempt to select the k most representative sample to best approximate the learnt low-rank representation Q. By jointly optimizing these terms, the proposed formulation can perform unsupervised active learning under low-rank representation, making the selected samples more representative. Compared (4) with (3), the main differences are that there is an additional reconstruction loss term in (4) and imposing the ℓ2,1-norm constraint on different subjects, enabling (4) to perform sample selection under new low-rank representation. Through such changes, the performance of the model can be further improved. Finally, we utilize Q to select k most representative samples for the ﬁrst formulation, and use Z for the second formulation. To be speciﬁc, we can sort all the data points by the l2-norm of the rows of Q or Z in descending order, and select the top k samples as the most representative ones.

Optimization Procedure The objection function (3) is convex, while it is non-convex in (4). thus (3) and (4) have a globally and locally optimal

solutions respectively. To solve them, we employ the alternating direction method of multipliers (ADMM) (Boyd et al. 2011) to separate the joint problem into easier sub-problems, which could converge to a minimum (Boyd et al. 2011; Hajinezhad et al. 2016) Because of space limitation, we mainly introduce how to optimize (4) using ADMM. The procedure for solving (3) is similar. In order to solve (4), we ﬁrst introduce two auxiliary variables W and T to convert (4) into the following equivalent objective function:

min||X XQ||2 F +λ||W|| +µ||Q QZ||2 F +η||E||2,1 s.t. Q = W, Z = E. (5)

The augmented Lagrangian function of problem (5) is

Lρ1,ρ2(Q, Z, W, T, Λ1, Λ2) := ||X XQ||2 F + λ||W|| + µ||Q QZ||2 F + η||E||2,1 + Λ1, Q W

2 ||Q W||2 F + Λ2, Z E + ρ2

2 ||Z E||2 F , (6)

where Λ1 and Λ2 are Lagrange multipliers. ρ1 and ρ2 are constraint violation penalty parameters. Solver for W: Removing irrelevant terms to W from (6), it becomes:

min λ||W|| + ρ1

2 ||Q W + Λ1

ρ1 ||2 F . (7)

The problem (7) can be solved by the singular value thresholding (Cai, Candes, and Shen 2010), and it has a closed form solution:

ρ1 (ΣM)VT M, (8)

where Sτ(x) = sgn(x) max(|x| τ, 0) is the soft thresholding operator. UMΣMVT M is the singular value decomposition (SVD) of the matrix M, where M = Q + Λ1

ρ1 . Solver for E: E is the minimizer of

min η||E||2,1 + ρ2

2 ||Z E + Λ2

ρ2 ||2 F . (9)

The above optimization problem can be solved by the following lemma (Yang et al. 2009):

Lemma 1. For any κ, ν > 0, and g Rq, the minimizer of

min t Rq κ t 2 + ν

is given by

t = (1 κ ν g 2 )g, g 2 > κ

Thus, we can obtain the solution of ( 9) as

Ei = (1 η ρ2 vi 2 )vi, vi 2 > η ρ2 0, vi 2 η ρ2 , (10)

where vi = (Z + Λ2

ρ2 )i. Ei denotes the i-th row of E, i = 1, , n.

Solver for Z: the sub-problem about Z can be written as

min µ||Q QZ||2 F + ρ2

2 ||Z E + Λ2

ρ2 ||2 F . (11)

Taking the gradient of (11) with respect to Z, and setting it to zero, we can easily obtain the closed-form solution of Z as:

Z = (2µQT Q + ρ2I) 1(2µQT Q + ρ2E Λ2). (12)

Solver for Q: when other variables are ﬁxed, Q can be obtained by minimizing the following objective function:

||X XQ||2 F +µ||Q QZ||2 F + ρ1

2 ||Q W+ Λ1

ρ1 ||2 F . (13)

Since (13) is convex in terms of Q, the optimal solution can be found by differentiating (13) and setting the derivative to zero. This implies

(2XT X + ρ1I)Q + 2µQ(I Z)(I ZT )

= 2XT X + ρ1W Λ1. (14)

For writing conveniently, let M = 2XT X + ρ1I, N = 2µ(I Z)(I ZT ), and H = 2XT X + ρ1W Λ1, then the Eq. (14) can be written as

MQ + QN = H. (15)

Since M and N are symmetric and positive semi-deﬁnite, we have M = UΣ1UT

N = VΣ2VT , (16)

where U and V are both orthogonal. Σ1 and Σ2 are two diagonal matrices. Plugging (16) into (15), we can obtain

UΣ1UT Q + QVΣ2VT = H

Σ1UT Q + UT QVΣ2VT = UT H

Σ1UT QV + UT QVΣ2 = UT HV. (17)

Let S = UT QV, then (17) can be rewritten as

Σ1S + SΣ2 = UT HV

Sij = (UT HV)ij (Σ1)ii + (Σ2)jj , i, j = 1, , n. (18)

Actually, we have (Σ1)ii > 0 and (Σ2)jj 0, thus the denominator in (18) is greater than zero. After obtaining S, we can obtain Q by:

Q = USVT . (19)

The updating rule for Λ1, Λ2: The Lagrange multipliers Λ1, Λ2 can be updated by: Λ1 Λ1 + ρ1(Q W) Λ2 Λ2 + ρ2(Z E). (20)

We list the key steps for solving (4) in Algorithm 1. The procedure for solving (3) is similar to that in Algorithm 1.

Algorithm 1: Optimization Procedure for Solving Formulation II

Input: The matrix X Rd n, tradeoff parameters; Initialization: Q0 = W0 = Z0 = E0 = 0, Λ0 1 = Λ0 2 = 0, ρ1 = ρ2 = 10 6, maxρ = 106, τ = 1.1, ϵ = 10 3, t = 0; while not converged do

1. Fix the other variables and update W by (8); 2. Fix the other variables and update E by (10); 3. Fix the other variables and update Z by (12); 4. Fix the other variables and update Q by (19); 5. Update the multiplier Λ1 and Λ2 by (20); 6. Update the parameter ρ1, ρ2 by ρi = min(τρi, maxρ), i = 1, 2; 7. Check the convergence conditions: Q W < ϵ, Z E < ϵ, | Lt+1 Lt

Lt | < ϵ, where Lt is the value of the augmented Lagrangian function (6) for the t-th iteration; 8. t t + 1; end Output: Z.

Convergence and Computational Complexity Analysis

It is easy to proof that Algorithm 1 will converge to a local optimum based on (Hajinezhad et al. 2016). Because of space limitation, we omit the proof. Please refer to (Hajinezhad et al. 2016) for the details. Moreover, we analyze the computational complexities of Algorithm 1. The main computation costs in Algorithm 1 stem from updating W, E, Z, and Q. For updating W, it typical needs O(n3) because of referring to SVD of an n n matrix. The complexity is of order O(n2) for updating E. It costs O(n3) for updating Z, since it needs several matrix multiplications and an inverse operation of a matrix. As for Q, it requires O(n3 + n2d). Thus, the total complexity of Algorithm 1 in each iteration is O(n3+n2+n3+n3+n2d) = O(n3+n2d). We know ALFS is of order O(n3+n2d+d2n+ d3) and RRSS needs O(n4) in each iteration, where n is the number of samples and d is the dimension of samples. Thus, our method has lower complexity than ALFS and RRSS.

In this section, we evaluate the performance of the proposed formulations on ﬁve publicly available datasets across different tasks including facial age estimation, video action recognition, medical image classiﬁcation and wine quality prediction. It is usually much more time-consuming and expensive to manually label samples for these tasks.

We evaluate the proposed methods on two video action recognition datasets HMDB51 (Kuehne et al. 2011) and UCF50 (Reddy and Shah 2013), one facial age estimation

datasets UTKFace (Zhang, Song, and Qi 2017), one medical image dataset HAM10000 (Tschandl, Rosendahl, and Kittler 2018), and one wine quality dataset (Cortez et al. 2009). HMDB51 is composed of 6,766 realistic video clips from 51 action categories. UCF50 contains 6,681 realistic youtube videos associated with 50 action categories. For each video, we extract a 512-dimensional feature vector to represent each frame, and then use the average of all frames as the feature representation of the video. UTKFace consists of over 20,000 face images with annotations of age. The age spans range from 0 to 116 years old, which are divided into eight ranges: 0-9, 10-19, ..., 60-69, and 70+ for age group estimation. We randomly select 200 images from each class, in order to avoid the effect of class imbalance.

Compared Methods and Experimental Protocol We compare with some related unsupervised active learning algorithms, including RRSS (Nie et al. 2013), ALNR (Hu et al. 2013), ALFS (Li et al. 2019). In addition, we also compare with a popular matrix column subset selection algorithm, DCS (Papailiopoulos, Kyrillidis, and Boutsidis 2014) which can be used for sample selection. There are two variants about our method, ALSL W and ALSL U, meaning that performing sample selection with and under low-rank sample representations, respectively. To evaluate the effectiveness of the proposed method, we train a SVM classiﬁer using the selected samples, and evaluate its accuracy using the unseen samples. The parameters λ, µ and η in our algorithm are searched from {0.001, 0.01, 0.1, 1, 10}. In the experiment, we repeat every test case 5 times, and report the average result and standard deviation.

Experimental Results General Performance: The results are listed in Table 1, 2, 3, 4, and 5. ALSL U consistently outperforms all other methods on the ﬁve datasets. For instance, when setting the number of queries to 100 on the UCF50 dataset, ALSL U obtains a 13% improvement over DCS that achieves the second best performance. This illustrates the strategy for selecting samples to best approximate the low-rank representations is effective. ALSL W has a comparable performance with other baselines. This is because ALSL W aims to reconstruct the original inputs, leading to struggling with noise. Ablation Study: We study the effectiveness of our components on the two image datasets. The experimental setting is as follows: We only perform sample selection with subspace learning, i.e., setting λ = 0 in (3), shorten as AL. The experimental results are reported in Table 6. ALSL W is better than AL, indicating that learning low-rank sample representations is good for sample selection. ALSL U achieves superior performance over ALSL W, which demonstrates that selecting samples to reconstruct the low-rank representations can better suppress noise. Effectiveness of Joint Learning: In (4), we propose a joint framework to perform subspace learning and sample selection simultaneously. Someone may argue that whether we can ﬁrst learn a low-rank representation Q by the ﬁrst second terms of (4), and then select samples to best approximate

#Num. ALFS ALNR DCS RRSS ALSL W ALSL U 50 0.184 0.007 0.131 0.016 0.127 0.011 0.150 0.018 0.218 0.024 0.242 0.018 100 0.196 0.024 0.160 0.018 0.133 0.015 0.214 0.018 0.243 0.010 0.273 0.015 150 0.259 0.011 0.183 0.014 0.153 0.015 0.233 0.009 0.285 0.035 0.301 0.022 200 0.299 0.024 0.190 0.011 0.221 0.018 0.275 0.017 0.303 0.018 0.317 0.003 250 0.318 0.019 0.213 0.014 0.260 0.015 0.302 0.015 0.326 0.013 0.351 0.023 300 0.331 0.024 0.200 0.013 0.295 0.011 0.317 0.019 0.340 0.026 0.374 0.004 350 0.351 0.014 0.254 0.022 0.315 0.017 0.352 0.011 0.356 0.028 0.380 0.010 400 0.355 0.024 0.265 0.012 0.328 0.021 0.349 0.024 0.368 0.017 0.392 0.004 450 0.371 0.015 0.297 0.012 0.355 0.027 0.379 0.009 0.376 0.013 0.398 0.014

Table 1: Quantitative results in terms of accuracy of different active learning methods on the UTKFace dataset.

#Num. ALFS ALNR DCS RRSS ALSL W ALSL U 50 0.227 0.015 0.204 0.020 0.126 0.013 0.191 0.018 0.256 0.019 0.271 0.020 100 0.233 0.014 0.232 0.015 0.127 0.010 0.227 0.011 0.276 0.018 0.295 0.015 150 0.274 0.020 0.253 0.022 0.206 0.020 0.239 0.014 0.285 0.019 0.328 0.020 200 0.302 0.012 0.284 0.010 0.259 0.010 0.280 0.019 0.295 0.019 0.331 0.013 250 0.344 0.016 0.301 0.023 0.268 0.015 0.287 0.013 0.302 0.011 0.355 0.018 300 0.359 0.011 0.312 0.015 0.311 0.010 0.319 0.010 0.318 0.020 0.374 0.017 350 0.368 0.021 0.335 0.020 0.339 0.014 0.326 0.014 0.334 0.015 0.399 0.019 400 0.379 0.017 0.358 0.010 0.357 0.010 0.346 0.015 0.351 0.011 0.402 0.019 450 0.393 0.018 0.352 0.019 0.359 0.014 0.369 0.019 0.360 0.010 0.407 0.020

Table 2: Quantitative results in terms of accuracy of different active learning methods on the HAM10000 dataset.

#Num. ALFS ALNR DCS RRSS ALSL W ALSL U 50 0.117 0.010 0.111 0.018 0.115 0.017 0.122 0.015 0.108 0.020 0.159 0.011 100 0.154 0.017 0.143 0.011 0.182 0.028 0.157 0.021 0.159 0.020 0.207 0.012 150 0.208 0.009 0.180 0.006 0.214 0.014 0.208 0.010 0.207 0.010 0.243 0.010 200 0.220 0.016 0.208 0.013 0.254 0.007 0.226 0.023 0.234 0.022 0.269 0.012 250 0.266 0.012 0.247 0.027 0.277 0.014 0.281 0.015 0.257 0.025 0.294 0.007 300 0.279 0.020 0.276 0.016 0.302 0.015 0.290 0.019 0.277 0.014 0.318 0.008 350 0.314 0.012 0.300 0.010 0.324 0.012 0.327 0.014 0.302 0.022 0.344 0.007 400 0.326 0.018 0.313 0.014 0.341 0.016 0.330 0.010 0.323 0.013 0.368 0.003 450 0.352 0.009 0.333 0.009 0.367 0.009 0.364 0.009 0.339 0.008 0.377 0.009

Table 3: Quantitative results in terms of accuracy of different active learning methods on the HMDB51 dataset.

#Num. ALFS ALNR DCS RRSS ALSL W ALSL U 50 0.142 0.016 0.152 0.020 0.186 0.017 0.185 0.018 0.130 0.008 0.249 0.010 100 0.237 0.017 0.242 0.022 0.246 0.017 0.240 0.010 0.246 0.003 0.377 0.011 150 0.340 0.015 0.310 0.014 0.297 0.011 0.345 0.014 0.335 0.010 0.450 0.010 200 0.389 0.019 0.370 0.024 0.404 0.021 0.374 0.011 0.406 0.006 0.511 0.008 250 0.476 0.010 0.422 0.012 0.449 0.012 0.441 0.013 0.450 0.018 0.565 0.010 300 0.525 0.015 0.468 0.012 0.482 0.017 0.469 0.023 0.488 0.008 0.607 0.005 350 0.564 0.011 0.510 0.015 0.522 0.012 0.495 0.013 0.512 0.012 0.641 0.017 400 0.597 0.013 0.542 0.017 0.551 0.009 0.506 0.023 0.555 0.007 0.667 0.009 450 0.628 0.008 0.581 0.017 0.580 0.006 0.532 0.013 0.598 0.007 0.684 0.014

Table 4: Quantitative results in terms of accuracy of different active learning methods on the UCF50 dataset.

#Num. ALFS ALNR DCS RRSS ALSL W ALSL U 50 0.447 0.007 0.428 0.009 0.448 0.024 0.472 0.002 0.451 0.012 0.474 0.007 100 0.441 0.018 0.443 0.011 0.466 0.010 0.453 0.010 0.417 0.023 0.498 0.009 150 0.471 0.017 0.479 0.005 0.477 0.010 0.475 0.024 0.490 0.008 0.507 0.015 200 0.475 0.011 0.479 0.024 0.494 0.017 0.481 0.024 0.499 0.013 0.510 0.007 250 0.502 0.017 0.490 0.018 0.503 0.009 0.495 0.013 0.501 0.011 0.516 0.006 300 0.505 0.017 0.498 0.013 0.504 0.026 0.496 0.014 0.500 0.015 0.521 0.006 350 0.509 0.018 0.504 0.017 0.515 0.014 0.503 0.023 0.509 0.013 0.526 0.006 400 0.511 0.017 0.508 0.017 0.517 0.019 0.503 0.013 0.511 0.011 0.528 0.001 450 0.517 0.013 0.511 0.024 0.519 0.016 0.512 0.023 0.519 0.019 0.534 0.006

Table 5: Quantitative results in terms of accuracy of different active learning methods on the Wine Quality dataset.

(a) UTKFace # Num. AL ALSL W ALSL U 50 0.130 0.013 0.218 0.024 0.242 0.018 100 0.223 0.016 0.243 0.010 0.273 0.015 150 0.237 0.026 0.285 0.035 0.301 0.022 200 0.264 0.027 0.303 0.018 0.317 0.003 250 0.265 0.013 0.326 0.013 0.351 0.023 300 0.293 0.018 0.340 0.026 0.374 0.004 350 0.322 0.016 0.356 0.028 0.380 0.010 400 0.331 0.016 0.368 0.017 0.392 0.004 450 0.335 0.014 0.376 0.013 0.398 0.014

(b) HAM10000 # Num. AL ALSL W ALSL U 50 0.156 0.036 0.256 0.019 0.271 0.020 100 0.196 0.009 0.276 0.018 0.295 0.015 150 0.201 0.010 0.285 0.019 0.328 0.020 200 0.207 0.032 0.295 0.019 0.331 0.013 250 0.254 0.019 0.302 0.011 0.355 0.018 300 0.272 0.021 0.318 0.020 0.374 0.017 350 0.271 0.016 0.334 0.015 0.399 0.019 400 0.290 0.023 0.351 0.011 0.402 0.019 450 0.321 0.022 0.360 0.010 0.407 0.019

Table 6: Ablation study of the proposed methods.

Q using the last two terms in (4). To verify the effectiveness of joint learning, we compare ALSL U with the results using two separate steps. The results are listed in Table 7. It is obvious that ALSL U achieves better performance, stemming from the coupling effect of active learning and subspace learning, as veriﬁed in (Li et al. 2019).

0.001 0.01 0.1 1 10 0

0.001 0.01 0.1 1 10 0

0.001 0.01 0.1 1 10 0

Figure 2: Parameter Study on the Wine Quality dataset.

Parameter Study: We also study the parameter sensitivities of our method on the Wine Quality dataset. In the experi-

(a) HMDB51 #Num. Separate Steps ALSL U 50 0.101 0.022 0.159 0.011 100 0.168 0.020 0.207 0.012 150 0.204 0.018 0.243 0.010 200 0.238 0.016 0.269 0.012 250 0.251 0.016 0.294 0.007 300 0.282 0.016 0.318 0.008 350 0.305 0.012 0.344 0.007 400 0.316 0.021 0.368 0.003 450 0.326 0.020 0.377 0.009

(b) UCF50 #Num. Separate Steps ALSL U 50 0.107 0.022 0.249 0.010 100 0.166 0.020 0.377 0.011 150 0.234 0.018 0.450 0.010 200 0.299 0.016 0.511 0.008 250 0.357 0.016 0.565 0.010 300 0.395 0.016 0.607 0.005 350 0.441 0.012 0.641 0.017 400 0.480 0.021 0.667 0.009 450 0.523 0.020 0.684 0.014

Table 7: Effectiveness veriﬁcation of joint learning on the HMDB51 and UCF50 datasets.

ment, the number of the selected samples is set to 50. The results of ALSL W are shown in Figure 2. Our method is not sensitive to the three parameters with wide ranges.

In this paper, we proposed a subspace learning based model for unsupervised active learning. To leverage subspace learning, we proposed two formulations to perform sample selection with and under low-rank sample representations respectively. Experimental results on multiple tasks demonstrated their effectiveness. Moreover, we can come to the conclusion that selecting samples by approximating lowrank representations of data can obtain best performance.

Acknowledgments

This work was supported in part by the NSFC under Grants 61806044, U2001211, 61932004, and 61732003. It also was partially supported by the Open Project Program of the National Laboratory of Pattern Recognition (NLPR).

Boyd, S.; Parikh, N.; Chu, E.; Peleato, B.; Eckstein, J.; et al. 2011. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine learning 3(1): 1 122.

Cai, D.; and He, X. 2011. Manifold adaptive experimental design for text categorization. TKDE 24(4): 707 719.

Cai, J.; Candes, E. J.; and Shen, Z. 2010. A Singular Value Thresholding Algorithm for Matrix Completion. Siam Journal on Optimization 20(4): 1956 1982.

Cortez, P.; Cerdeira, A.; Almeida, F.; Matos, T.; and Reis, J. 2009. Modeling wine preferences by data mining from physicochemical properties. Decision Support Systems 47(4): 547 553.

FY, Z.; Zhu, X.; SM, X.; CH, P.; et al. 2015. 10,000+ Times Accelerated Robust Subset Selection (ARSS). In Proceedings of The AAAI Conference on Artiﬁcial Intelligence (AAAI), 3217 3223.

Gal, Y.; Islam, R.; and Ghahramani, Z. 2017. Deep bayesian active learning with image data. In Proceedings of International Conference on Machine Learning (ICML), 1183 1192.

Georghiades, A. S.; Belhumeur, P. N.; and Kriegman, D. J. 2001. From few to many: Illumination cone models for face recognition under variable lighting and pose. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 23(6): 643 660.

Hajinezhad, D.; Chang, T.-H.; Wang, X.; Shi, Q.; and Hong, M. 2016. Nonnegative matrix factorization using admm: Algorithm and convergence analysis. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 4742 4746. IEEE.

Hu, Y.; Zhang, D.; Jin, Z.; Cai, D.; and He, X. 2013. Active learning via neighborhood reconstruction. In Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI), 1415 1421.

Kuehne, H.; Jhuang, H.; Garrote, E.; Poggio, T.; and Serre, T. 2011. HMDB: a large video database for human motion recognition. In ICCV, 2556 2563.

Li, C.; Ma, H.; Kang, Z.; Yuan, Y.; Zhang, X.-Y.; and Wang, G. 2020a. On deep unsupervised active learning. In Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI).

Li, C.; Wang, X.; Dong, W.; Yan, J.; Liu, Q.; and Zha, H. 2019. Joint active learning with feature selection via cur matrix decomposition. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 41(6): 1382 1396.

Li, C.; Yang, C.; Liang, L.; Yuan, Y.; and Wang, G. 2020b. On Robust Grouping Active Learning. IEEE Transactions on Emerging Topics in Computational Intelligence . Lu, C.; Min, H.; Zhao, Z.-Q.; Zhu, L.; Huang, D.-S.; and Yan, S. 2012. Robust and efﬁcient subspace segmentation via least squares regression. In ECCV, 347 360. Nie, F.; Wang, H.; Huang, H.; and Ding, C. 2013. Early active learning via robust representation and structured sparsity. In Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI), 1572 1578. Papailiopoulos, D.; Kyrillidis, A.; and Boutsidis, C. 2014. Provable deterministic leverage score sampling. Ar Xiv abs/1404.1530. Recht, B.; Fazel, M.; and Parrilo, P. A. 2010. Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM review 52(3): 471 501. Reddy, K. K.; and Shah, M. 2013. Recognizing 50 human action categories of web videos. Machine vision and applications 24(5): 971 981. Roy, N.; and Mc Callum, A. 2001. Toward optimal active learning through monte carlo estimation of error reduction. In Proceedings of International Conference on Machine Learning (ICML), 441 448. Shi, L.; and Shen, Y.-D. 2016. Diversifying convex transductive experimental design for active learning. In Proceedings of International Joint Conferences on Artiﬁcial Intelligence (IJCAI), 1997 2003. AAAI Press. Tschandl, P.; Rosendahl, C.; and Kittler, H. 2018. The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Scientiﬁc data 5: 180161. Yang, J.; Yin, W.; Zhang, Y.; and Wang, Y. 2009. A fast algorithm for edge-preserving variational multichannel image restoration. SIAM Journal on Imaging Sciences 2(2): 569 592. Yoo, D.; and Kweon, I. S. 2019. Learning Loss for Active Learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 93 102. Yu, K.; Bi, J.; and Tresp, V. 2006. Active learning via transductive experimental design. In Proceedings of International Conference on Machine Learning (ICML), 1081 1088. Yu, K.; Zhu, S.; Xu, W.; and Gong, Y. 2008. Non-greedy active learning for text categorization using convex ansductive experimental design. In Proceedings of International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 635 642. Zhang, L.; Chen, C.; Bu, J.; Cai, D.; He, X.; and Huang, T. S. 2011. Active Learning Based on Locally Linear Reconstruction. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 33(10): 2026 2038. Zhang, Z.; Song, Y.; and Qi, H. 2017. Age Progression/Regression by Conditional Adversarial Autoencoder. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).