# semisupervised_active_linear_regression__716fa353.pdf Semi-supervised Active Linear Regression Fnu Devvrit Department of Computer Science University of Texas at Austin devvrit@cs.utexas.edu Nived Rajaraman Department of Electrical Engineering and Computer Sciences University of California, Berkeley nived.rajaraman@berkeley.edu Pranjal Awasthi Google Research & Department of Computer Science Rutgers University pranjal.awasthi@rutgers.edu Labeled data often comes at a high cost as it may require recruiting human labelers or running costly experiments. At the same time, in many practical scenarios, one already has access to a partially labeled, potentially biased dataset that can help with the learning task at hand. Motivated by such settings, we formally initiate a study of semi-supervised active learning through the frame of linear regression. Here, the learner has access to a dataset X R(nun+nlab) d composed of nun unlabeled examples that a learner can actively query, and nlab examples labeled a priori. Denoting the true labels by Y Rnun+nlab, the learner s objective is to find bβ Rd such that, X bβ Y 2 2 (1 + ϵ) min β Rd Xβ Y 2 2 (1) while querying the labels of as few unlabeled points as possible. In this paper, we introduce an instance dependent parameter called the reduced rank, denoted RX, and propose an efficient algorithm with query complexity O(RX/ϵ). This result directly implies improved upper bounds for two important special cases: (i) active ridge regression, and (ii) active kernel ridge regression, where the reducedrank equates to the statistical dimension, sdλ and effective dimension, dλ of the problem respectively, where λ 0 denotes the regularization parameter. Finally, we introduce a distributional version of the problem as a special case of the agnostic formulation we consider earlier; here, for every X, we prove a matching instancewise lower bound of Ω(RX/ϵ) on the query complexity of any algorithm. 1 Introduction Classification and regression are the cornerstone of machine learning. Often these tasks are carried out in a supervised learning manner - a learner collects many examples and their labels, and then runs an optimization algorithm to minimize a loss function and learn a model. However, the process equal contribution 36th Conference on Neural Information Processing Systems (Neur IPS 2022). of collecting labeled data comes at a cost, for instance when a human labeler is required, or if an expensive experiment needs to be run for the same. Motivated by such scenarios, two popular approaches have been proposed to mitigate these issues in practice: Active Learning [24] and Semi-Supervised Learning [30]. In Active Learning, the learner carries out the task by adaptively querying the labels of a small subset of characteristic data points. On the other hand, semi-supervised learning is motivated by the fact that learners often have access to massive amounts of unlabeled data in addition to some labeled data, and algorithms leverage both to carry out the learning task. Active learning and Semi-supervised learning have found numerous applications in areas such as text classification [26], visual recognition [17] and foreground-background segmentation [15]. In practice, learners often have access to smaller labeled datasets and larger unlabeled datasets, chosen points of which can be submitted for manual labeling. This motivates the semi-supervised active regression (SSAR) formulation - here, the learner is provided an a priori labeled data Xlab Rnlab d, as well as an unlabeled dataset Xun Rnun d, points of which can be iteratively submitted to an external agent to be labeled at unit cost. In the linear regression framework, the objective of the learner is to then minimize the squared-loss Xβ Y 2 2 while making as few label queries (of points in Xun) as possible. Here X is the overall dataset, Xlab Xun represented as a matrix in R(nun+nlab) d, while Y Rnun+nlab denotes the corresponding labels. The learner seeks to find a regression function bβ Rd such that, X bβ Y 2 2 (1 + ϵ) min β Rd Xβ Y 2 2. (2) Likewise, the ϵ-query complexity of an algorithm is defined as the number of unlabeled points in Xun that must be queried in order to return a bβ such that eq. (2) is satisfied with constant probability. In the special case when no labeled data is provided (i.e. Xlab is empty), SSAR reduces to active linear regression, which has been studied in several prior works [9, 5, 7, 8, 6]. SSAR includes several other important problems as special cases. For some λ > 0, consider the setting where the labeled dataset Xlab is the set of standard basis vectors scaled by a factor of λ, with corresponding labels as 0. The loss of a learner bβ on this dataset exactly corresponds to the ridge regression loss on the unlabeled dataset, Xun bβ Yun 2 2 + λ β 2 2, where Yun are the labels of points in Xun. Thus, active ridge regression is captured as a special case of SSAR. On the other hand, it turns out that active kernel ridge regression is also captured as a special case of SSAR. Here, data points lie in a reproducing kernel hilbert space (RKHS) [21, 25, 20] that is potentially infinite dimensional, but the learner has access to a kernel representation of the N data points. Denoting the kernel matrix by K RN N and the labels of the points as Y RN, the objective in kernel ridge regression is to minimize the loss Kβ Y 2 2+λ β 2 K. This can be viewed as a special case of the SSAR framework by choosing the labeled dataset as Xlab = K RN N with each point labeled by 0, and the unlabeled dataset Xun = K RN N with labels Y RN. In this work, the main question we ask is, Can we design algorithms for SSAR with instance-wise optimal query complexities? Our first main contribution is to propose an instance dependent parameter called the reduced rank, denoted RX, that can be used to upper bound the worst-case optimal query complexity of SSAR. Intuitively, RX measures the relative importance of the points at which labels are known, Xlab, in the context of the overall dataset. Formally, the reduced rank is defined as: RX = Tr XT un Xun + XT lab Xlab 1 XT un Xun (3) We propose an efficient polynomial time algorithm for SSAR with ϵ-query complexity bounded by O(RX/ϵ). In the special cases of active ridge regression and kernel ridge regression, RX is shown to be the equal to the statistical dimension sdλ [3] and the effective dimension dλ [28] of the problem respectively, which directly implies query complexity guarantees for the two special cases. These parameters are formally defined in Section 3. Optimal algorithms for active ridge regression. The recent work of [6] studies the problem of active regression and following a long line of work (see [9]) provides a near optimal (up to constant factors) algorithm for active regression querying labels of at most O(d/ϵ) points. The algorithm of [6] can be used for active ridge regression too, where λ = 0, resulting in a query complexity guarantee of O(d/ϵ). However, the dependence on d is not ideal in this setting and an instance-dependent parameter of the data known as statistical dimensions sdλ is the right measure of the complexity of the problem. As a consequence of our general algorithm and guarantee, we obtain an algorithm for active ridge regression with query complexity upper bounded by O(sdλ/ϵ), which we show is in fact optimal in the worst case. As a result we resolve complexity of ridge regression in the active setting. Improved guarantees for active kernel ridge regression. In the context of kernel ridge regression, the work of [1] analyzed kernel leverage score sampling (and in general Nyström approximation based methods), showing that these sampling approaches query O(dλ log(N)) labels to find a constant factor approximation to the optimal loss. In contrast, our algorithm when applied in the kernel ridge regression setting achieves a query complexity guarantee of O(dλ/ϵ) bearing no dependence on the number of datapoints, N, which is often large in practice. Techniques. Our algorithm is based on a novel variant of the Randomized BSS spectral sparsification algorithm of [16, 4], that returns a small weighted subset of the dataset, queries its labels, and subsequently carries out weighted least-squares regression on this dataset. The key technical contribution of this work is to construct a sampling algorithm which is guaranteed to query the labels of only a limited number of unlabeled points. We discuss the algorithm and its guarantees in more detail in Section 4. The rest of the paper is organized as follows. We cover related work in Section 1.1 and introduce relevant preliminaries in Section 2. We motivate and present the new parameter RX in Section 3, and present our algorithm for SSAR and a proof sketch in Section 4. In Section 5 we introduce the distributional SSAR problem, and show a matching instance-dependent lower bound for the problem. 1.1 Related Work Several prior works use active learning to augment semi-supervised learning based approaches [12, 29, 10, 22, 23]. The key in several of these works is to use active learning methods to decide which labels to sample, and then use semi-supervised learning to finally learn the model. In the context of active learning for linear regression, [9, 27] analyze the leverage score sampling algorithm showing a sample complexity upper bound of O(d log(d)/ϵ). Subsequently for the case of ϵ d + 1, [7] shows that volume sampling achieves an O(d) sample complexity. Later, [8] shows that rescaled volume sampling matches the sample complexity of O(d log(d)/ϵ) achieved by leverage score sampling. Finally, [6] shows that a spectral sparsification based approach using the Randomized BSS algorithm [16] achieves the optimal sample complexity of O(d/ϵ). In the context of kernel ridge regression, the works of [11, 28, 19] study the problem of reducing the runtime and number of kernel evaluations to approximately solve the problem. These results show that the number of kernel evaluations can be made to scale only linearly with the effective dimension of the problem, dλ(K) up to log factors. Recently, [1] prove a statistical guarantee showing that any optimal Nystrom approximation based approach samples at most dλ(K) log(n) labels to find a constant factor approximation to the kernel regression loss, where n is the dimension of the kernel. A closely related problem to active learning is the coreset (and weak coreset) problem [2, 18], where the objective is to find a low-memory data structure that can be used to approximately reconstruct the loss function which naïvely requires storing all the training examples to compute. [13] study the weak coreset problem for ridge regression and propose an algorithm that returns a weak coreset of O(sdλ/ϵ) points which can approximately recover the optimal loss. A drawback of their algorithm is that the labels of all the points in the dataset are required, which may come at a very high cost in practice. As we discuss in Section 4, our algorithm also returns a weighted subset of O(sdλ/ϵ) points in Rd and a set of d additional weights that can be used to reconstruct a (1 + ϵ)-approximation for the original ridge regression instance. In terms of memory requirement, this matches the result of [13]; however it suffices for our algorithm to query at most O(sdλ/ϵ) labels. 2 Preliminaries In this section, we formally define Semi-supervised Active Regression (SSAR) under squared-loss in the linear setting. The learner is provided datasets Xun Rnun d (comprised of nun points in Rd) and Setting Upper bounds Lower bounds Agnostic SSAR RX log(d) ϵ (LSS : Theorem 1) - ϵ (ASURA: Theorem 3 + 4) Distributional SSAR RX ϵ (ASURA: Corollary 1) RX ϵ (Theorem 5) Table 1: Table of guarantees for SSAR The agnostic SSAR problem assumes no distribution on the input, and assumes no label noise. The distributional SSAR problem (Section 5) assumes that the input follows a known distribution and queried labels are corrupted by Gaussian noise. LSS stands for Leverage Score Sampling. Setting Upper bounds Lower bounds Active Ridge Regression d log(d) ϵ (LSS [9, 13]) d ϵ (Linear Sparsification [6]) ϵ (ASURA: Theorem 3 + 4) sdλ ϵ (Theorem 5) Active Kernel Ridge Regression dλ log(n) ϵ (Nystrom Approximation [1]) ϵ (ASURA: Corollary 1) dλ ϵ (Theorem 5) Table 2: Table of guarantees for special cases of SSAR LSS stands for Leverage Score Sampling. Xlab Rnlab d. The labels of the points are respectively Yun Rnun and Ylab Rnlab. Define, X = Xun Xlab , Y = Yun Ylab In the agnostic setting, the labels of the points are not assumed to be generated from an unknown underlying linear function and can be arbitrary. We study the problem in the overconstrained setting with nun + nlab d. Furthermore, in the active setting, it is typically the case that nun d. When clear from the context, we use X, Xlab and Xun to represent the datasets in set notation. We assume that the learner is a priori provided Ylab and hence knows the labels of all points in Xlab, but the labels of points in Xun are not known. However, the learner can probe an oracle to return the label of any queried point in Xun and incurs a unit cost for each label returned. The learner s objective is to return a linear function bβ Rd which approximately minimizes the squared ℓ2 loss, X bβ Y 2 2 (1 + ϵ) Xβ Y 2 2, where, β arg min β Rd Xβ Y 2 2. (5) SSAR has the following two important problems as subcases. Active ridge regression: In the active ridge regression problem, the learner is provided an unlabeled dataset Xun Rn d of n points. The learner has access to an oracle which can be actively queried to label points in Xun. Representing the labels by Yun Rn, the objective of the learner is to minimize Lλ(β) Xunβ Yun 2 2 + λ β 2 2. Defining X = Xun R(n+d) d and Y = Yun 0 observe that Xβ Y 2 2 = Xunβ Yun 2 2 + λ β 2 2. Thus, active ridge regression is a special case of the SSAR objective in eq. (5). Active kernel ridge regression: In the kernel ridge regression problem, the learner operates in a reproducing kernel Hilbert space H and is provided an unlabeled dataset X of n points with (implicit) feature representations Φ(x) for x X. The objective is to learn a linear function F(x), in the feature representations of points in X that minimizes the squared ℓ2 norm with regularization. Namely, Pn i=1(F(xi) Yi)2 + λ F 2 H. By the Representer Theorem [14], it turns out the optimal regression function F ( ) can be expressed as Pn i=1 βiκ(xi, ) where κ( , ) is the kernel function. The kernel ridge regression objective can be expressed in terms of the kernel matrix K [κ(xi, xj)]n i,j=1 Rn n. In other words, denoting the labels by Yun Rn and β 2 K = βT Kβ, the learner s objective is to minimize over β Rn, Kβ Yun 2 2 + λ β 2 K (6) while minimizing the number of label queries of points in Xun. The objective can be represented as, K is defined as the unique solution to K = ZT Z. Yet again, it is a special case of the linear regression objective in eq. (5) with Xun = K, Xlab = K and Ylab = 0. 3 Motivating the reduced rank RX A natural algorithm for SSAR is to use the popular approach of leverage score sampling [9, 13]. This approach uses the entire dataset X (both labeled and unlabeled) to construct a diagonal weight matrix WS: here S represents the support of the diagonal and E[|S|] m = O(d log(d)/ϵ). The learner subsequently solves the problem: bβ = arg minβ WS(Xβ Y ) 2 2 which is a weighted linear regression problem on |S| examples. The works of [9, 13] show that resulting linear function produces a (1 + ϵ) multiplicative approximation to the optimal squared-loss: X bβ Y 2 2 (1 + ϵ) inf β Rd Xβ Y 2 2 (8) with constant probability. The weight matrix WS returned by leverage score sampling is constructed as follows: Letting UΣV T denote the singular value decomposition of X, the algorithm iterates over all the points in X and includes a point x in S with probability p(x) min 1, m proportional to its leverage score U(x) 2 2, where U(x) is the row in U corresponding to the point x X. The corresponding diagonal entry of WS is set as 1 p(x). Since the points in Xlab are labeled, the number of times the algorithm queries the oracle to label a point is equal to |S Xun|. By definition of the sampling probabilities, this is proportional to the sum of the leverage scores across the points in Xun: E [|S Xun|] = m x Xun U(x) 2 2 = log(d) x Xun U(x) 2 2 (9) In Lemma 5, we prove that the term P x Xun U(x) 2 2 equals the reduced rank, RX defined as Tr XT lab Xlab + XT un Xun 1 XT un Xun . Combining with eq. (9) results in the equation, E [|S Xun|] = RX log(d) In particular, by Markov s inequality, with large enough constant probability, leverage score sampling makes at most O RX log(d) ϵ label queries (of points in X1). Combining with the guarantee on the quality of the solution returned by leverage score sampling in eq. (8) results in the following guarantee. Theorem 1. For 0 < ϵ < 1, the ϵ-query complexity of leverage score sampling for SSAR is upper bounded by O( RX log(d) In order to further motivate RX we next turn to the ridge regression problem: Here, it turns out that RX equals the statistical dimension of Xun, defined as: sdλ(Xun) Xrank(Xun) i=1 1 1 + λ/σ2 i (Xun), (11) where σi(Xun) is the ith largest singular value of Xun. This can be seen by plugging in Xlab as λI in eq. (3) and expanding Xun using its singular value decomposition as U1Σ1V1. In comparison, in kernel ridge regression, RX evaluates to the effective dimension, defined as: dλ(K) Xrank(K) i=1 1 1 + λ/σi(K), (12) where the σi s are the eigenvalues of K. This yet again follows by plugging the choice of Xun and Xlab in the kernel ridge regression setting from eq. (7). The common theme in the prior sampling based approaches [1] for kernel ridge regression, as well as the bound on leverage score sampling established for SSAR (Theorem 1) is that they face a logarithmic dependence in the dimension parameter d or N (in the case of kernel ridge regression) which are often large in practice. Indeed, we show that this logarithmic dependence is unavoidable in the worst case for leverage score sampling. Theorem 2. There exists an instance of SSAR such that the ϵ-query complexity of leverage score sampling for any constant ϵ is 1 4RX log(d). In this paper, we subsume previous results and show an algorithm with an ϵ-query complexity of O RX ϵ for SSAR. This shows that it is possible to derive statistical guarantees for active ridge / kernel ridge regression that respectively depend only on the statistical dimension / effective dimension. 4 Algorithm In this section, we design a polynomial time algorithm (Algorithm 1) for SSAR and prove that the ϵ-query complexity of the algorithm is O( RX The core construction of the algorithm is based on an iterative mechanism known as an ϵ-well balanced sampling procedure, which decides which points to query the labels of points. At an intuitive level, an ϵ-well balanced sampling procedure guarantees that the points sampled so far are well representative of the overall dataset (in a weighted sense), while at the same time guaranteeing that a certain weighted condition number, related to the error incurred by least squares regression is small. Our main algorithm, Algorithm 1 follows the template of [6] which (i) repeatedly queries the labels of points in an adaptive manner using an ϵ-well balanced sampling procedure Alg, and (ii) solves a weighted linear regression problem on the set of points sampled in step (i). Algorithm 1: ASURA (Active semi-SUpervised Regression Algorithm) Input: X R(nun+nlab) d; accuracy parameter ϵ; any well-balanced sampling procedure Alg (cf. Definition 1 or Algorithm 2); 1 Run Alg(X, ϵ) to return a subset of points {x1, , xm} and weights {w1, , wm}.; // m is determined internally by Alg 2 Query the labels yi of x1, , xm for xi Xun. ; // the label query cost of the algorithm Output: bβ arg minβ Rd Pm i=1 wi(βT xi yi)2. We first begin by defining the notion of a well-balanced sampling procedure. Definition 1 (well-balanced sampling procedure). [6, Def 2.1] A well-balanced sampling procedure is a randomized algorithm Alg(X, ϵ) that outputs a set of points {x1, , xm} and weights {w1, , wm} as follows: in each iteration i = 1, , m, the algorithm chooses a distribution Di over points in X to sample xi Di. Let D be the uniform distribution over all points in X. Then, Alg is said to be an ϵ-well-balanced sampling procedure if it satisfies the following two properties with probability 3/4, 1. The matrix Z = diag({ wi}m i=1)U[xi:xm] Rm d, where UΣV T is the SVD of X and U[xi:xm] are the rows of U corresponding to the points {x1, , xm}, satisfies: 3 4I ZT Z 5 Equivalently, For every β Rd, Pm i=1 wi β, xi 2 3 4 Ex D β, x 2 . 2. For each i [m], define αi = Di(xi) D(xi) wi. Then, Alg must satisfy Pm i=1 αi = O(1) and αi KDi = O(ϵ), where KDi is the re-weighted condition number: KDi = sup x D(x) β, x 2 Ex D [ β, x 2] In [6] it is shown that any sampling procedure in step (i) of Algorithm 1 which is well-balanced (Definition 1) guarantees that the resulting solution is approximately optimal. Theorem 3. [6, Theorem 2.3] For any 0 < ϵ < 1, instantiate Algorithm 1 with any ϵ-well balanced sampling procedure Alg(X, ϵ). Denoting bβ as the output of Algorithm 1, with constant probability, X bβ Y 2 2 (1 + O(ϵ)) min β Rd Xβ Y 2 2 (15) Theorem 3 shows that to solve the SSAR problem, it suffices to design a well-balanced sampling procedure which queries a minimal number of labels of datapoints in the unlabeled set, Xun. We now present Algorithm 2 and show that it indeed satisfies both of these criteria. The algorithm is parameterized by γ, chosen to be ϵ/C0 for a sufficiently large constant C0 > 0. Theorem 4. Algorithm 2 is an ϵ-well-balanced sampling procedure, where 0 < ϵ < 1. Furthermore, Algorithm 2 samples the labels of at most O RX ϵ points in Xun. Algorithm 2: ϵ-well balanced sampling procedure for SSAR Input: X Rn d.; 1 Initialization: j = 0; γ = ϵ/C0; u0 = 2d/γ; l0 = 2d/γ; A0 = 0.; 2 while (uj lj) + Pj 1 i=0 ΦId i < 8d/γ do 3 Define ΦId j = Tr (uj I Aj) 1 + (Aj lj I) 1 .; 4 Set coefficient α j = γ/ΦId j ; 5 Sample a point from the multinomial distribution on X which assigns probability to x as, px U(x)T (uj I Aj) 1 + (Aj lj I) 1 U(x) ΦId j (16) // Define the sampled point x as xj and px as pj 6 Update Aj+1 Aj + γ ΦId j 1 pj U(xj)(U(xj))T ; 7 Define the weight w j γ ΦId j 1 pj ,; 8 Update uj+1 uj + γ 1 2γ 1 ΦId j and lj+1 lj + γ 1+2γ 1 ΦId j ,; 9 j j + 1.; 11 Assign m = j,; 12 Define mid um+lm 2 and for each j, set the coefficient αj = α j mid and the weight wj = w j mid.; Output: {x1, , xm}; {w1, , wm}.; Algorithm 2 is a variant of the Randomized BSS algorithm of [16] for spectral sparsification (eq. (13)) with an updated stopping criterion, which we discuss in Remark 1. In each iteration, the algorithm maintains a matrix A which up to a scaling factor is the same as the matrix ZT Z in Definition 1, and an upper and lower threshold u and l which are updated in an iterative manner. The update rule is guaranteed to maintain the invariant l I A u I in every iteration, which is ensured by the adaptive sampling distribution in eq. (16). Over iterations, the algorithm shows that the upper and lower thresholds u and l can be brought closer to each other which sandwich ZT Z (equal to A up to scaling) until eq. (13) is finally satisfied. It turns out that the desirable quality for is that u and l progressively become larger (which dictates the scaling factor relating A and ZT Z), while at the same time gap between the two, u l, remains small. The proof of correctness of Algorithm 2 requires two ingredients: 1. Algorithm 2 is indeed an ϵ-well balanced sampling procedure. (Section 4.1), and 2. The number of label queries made by Algorithm 2 is bounded by O (RX/ϵ) (Section 4.2). Remark 1. Notice the important difference in the stopping criterion of Algorithm 2 compared to that of the Randomized BSS Algorithm of [16] which is just the condition uj lj < 8d/γ. The more aggressive stopping criterion can be used to show that Algorithm 2 terminates within 2d/γ2 iterations of the while loop in Step 2 almost surely, while the Randomized BSS algorithm [16] terminates within 2d/γ2 iterations only with constant probability. This behavior is critical in bounding the number of label queries made by Algorithm 2. 4.1 Algorithm 2 is an ϵ-well-balanced sampling procedure We first address the question of showing that Algorithm 2 is an ϵ-well balanced sampling procedure. We provide a brief sketch which addresses the first condition, eq. (13). First observe that the stopping criterion of the algorithm is the first time j when Pj 1 i=0 4γ2 ΦId i (1 4γ2) + Pj 1 i=0 ΦId i > 8d γ . Note that each term in the summation, 4γ2 ΦId i (1 4γ2) + ΦId i is at least 2 q 4γ2 ΦId i (1 4γ2) ΦId i 4γ by the AM-GM inequality. Therefore, if the number of terms in the summation is at least 8d γ 1 4γ = 2d γ2 the stopping criterion must certainly be violated. This results in a bound on the number of iterations before the while loop is terminated in Algorithm 2, defined as m. Lemma 1. Almost surely, m 2d/γ2. While m is upper bounded by 2d/γ2, the key result of this section is that the number of iterations of by the algorithm also satisfies m cd/γ2 with constant probability, for a sufficiently small constant c. The implication of the algorithm not terminating too soon is that in the upper and lower boundaries maintained in the algorithm are updated over a large number of iterations and therefore the eigenvalues of the matrix A are very tightly sandwiched, proving the eigenvalue condition in eq. (13). We flesh out the details below. Note that in each iteration, the upper boundary uj increments by γ 1 2γ 1 ΦId j . As we show in the Appendix A.3 (Lemma 7), ΦId j is γ/2 almost surely and therefore uj increments by at least a constant γ 1 2γ 2 γ 2 in each iteration of Algorithm 2. Since the algorithm does not terminate too soon and m cd/γ2 with constant probability, in the final iteration, um is also cd/γ2 2 with constant probability. More precisely, we have the following result. Lemma 2. For γ < 1/4 and any 0 p < 1, with probability at least 1 p, um p2d/8γ2. We next relate the matrix A maintained in the algorithm and the matrix ZT Z in Definition 1. When Algorithm 2 terminates, the matrix Am stored is equal to Pm 1 j=0 γ ΦId j 1 pj U(xj)(U(xj))T . From the definition of mid and wj in Algorithm 2, it is a quick calculation to show that ZT Z = 1 mid A = 2 um+lm Am. Furthermore, since the invariant of Algorithm 2 guarantees that lm I Am um I, we have that 2lm um+lm ZT Z 2um um+lm . Thus to prove eq. (13) it suffices to show that um lm um+lm is small. By the stopping criterion of the algorithm, we are guaranteed that the gap, um lm um 1 lm 1 8d/γ; formally we show that um lm 9d/γ in Appendix A.4 (Lemma 9), implying that um and lm are close to each other. Moreover from Lemma 2, with constant probability, lm um = Ω(d/γ2). Thus, the ratio um lm um+lm = O(γ) and satisfies eq. (13) for sufficiently small γ. Lemma 3. With probability at least 3/4, (1 O(γ))I A (1 + O(γ))I. We defer the proof of the fact that Algorithm 2 satisfies the second condition in Definition 1 to Appendix A (Lemma 12) since it largely follows from the analysis in [6]. 4.2 Bounding the number of label queries made by Algorithm 2 Finally, we upper bound the number of times Algorithm 1 queries the labels of points in the unlabeled set, Xun. We denote this quantity by lq(Xun). First observe that the number of label queries, lq(Xun), is at most the number of iterations in which the well-balanced sampling procedure Alg in Line 1 of Algorithm 1 samples a point in Xun. The former is smaller if a point is sampled more than once by Alg . Indeed, E [lq(Xun)] E Xm 1 x Xun I(j) x where m is the total number of iterations the algorithm runs for, and I(j) x is an indicator capturing which is 1 if x is sampled in Step 5 of Algorithm 2 in iteration j and 0 otherwise. Note that E[I(j) x ] = E[p(j) x ] is the probability of sampling the point x as defined in eq. (16). Recall that in Lemma 1 we show that m is upper bounded by 2d γ2 almost surely. By linearity of expectation, E[lq(Xun)] X2d/γ2 1 x Xun E h p(j) x i . (18) Remark 2. Note that the bound on the expected number of label queries in eq. (17) is the sum of non-negative random variables (I(j) x ) which are correlated with the number of terms in the summation (m). The almost sure upper bound on m enables this correlation to be decoupled, which is not easily possible with a constant or even high probability upper bound on m, as guaranteed by the Randomized BSS algorithm [16] (discussed in Remark 1). Before studying P2d/γ2 1 j=0 P x Xun E h p(j) x i , we first introduce some relevant notation. Definition 2. For a PSD matrix M, define the potential function ΦM j = Tr(M(uj I Aj) 1 + M(Aj lj I) 1)). uj, lj and Aj are as defined in Algorithm 2. Simplifying from definition of p(j) x results in the following lemma. Lemma 4. In iteration j = 0, , m 1, P x Xun p(j) x = ΦD j ΦId j , where D = P x Xun U(x)(U(x))T . As we show in Appendix A (Lemma 7), the term in the denominator, ΦId j , is at least ΦId j γ/2 almost surely. By plugging in Lemma 4 into eq. (18), the expected number of labels of points Xun queried by Algorithm 2 is upper bounded by, E[lq(Xun)] 2 j=0 E[ΦD j ] (19) Over the course of the time, one expects the gap between uj and lj to grow apart (from the update rule on line 8 of Algorithm 2), while Aj remains tightly sandwiched in between the two bounds. It turns out that a consequence of this fact, the expectation of the potential, E ΦD j is a decreasing function in j. We formally prove this result in Appendix A.3 (Lemma 8). From eq. (19), this results in the following upper bound on the number of label queries, E[lq(Xun)] 4d γ3 E ΦD 0 = 4Tr(D) where the last equation, E[ΦD 0 ] = E[Tr(D(u0I A0) 1 + D(A0 l0I) 1)] = γTr(D) d is a short calculation from its definition using the fact that u0 = l0 = 2d γ and A0 = 0. The final lemma relates Tr(D) to the reduced rank RX. Lemma 5 (Relating D to RX). With D = P x Xun U(x)(U(x))T , Tr(D) = RX. Putting together Lemma 5 with eq. (20), and noting the definition γ = ϵ/C0 results in the final bound on the number of label queries made by Algorithm 2. E[lq(Xun)] 4RX Remark 3. In the case of ridge regression, Algorithm 1 can be used to construct a weak-coreset for β . The advantage of the algorithm over [13] is that the weak coreset can be constructed without having oracle access to the set of all labels. In particular, our algorithm returns a coreset which uses O(dsdλ/ϵ) bits of memory: up to log-factors, the weight matrix WS requires d ϵ bits of memory to store and the sdλ/ϵ points queried in Xun require dsdλ ϵ bits of memory to store. Solving: arg minβ Rd WSXβ WSY 2 2 returns a (1 + ϵ) approximate optimizer of the loss. The memory requirement of our algorithm matches that proposed in [13], while only querying a few labels. 5 Instance-dependent lower bound for distributional SSAR In this section we introduce a distributional or inductive [6] formulation of semi-supervised active regression as a special case of the agnostic formulation discussed earlier. Definition 3 (Distributional SSAR). The setup here is identical to the agnostic setting, but with the labels revealed to the learner corrupted by independent stochastic noise. In particular, at training time, each time the label of x X is observed (either in the labeled dataset, or by explicitly querying) a noisy version of the ground-truth label, f(x), is revealed. Namely, y = f(x) + Z is observed, where the training time noise Z N(0, σ2 x) 2. The objective is to minimize the generalization error, x X ( β, x f(x) Z x)2 (22) where for each x X, Z x N(0, σ2 x) is sampled independent of the noise at training time. Note that the distributional SSAR problem is a special case of the agnostic formulation. This just corresponds to having infinitely many copies of the labeled and unlabeled datasets, where the label in each copy is the ground truth corrupted by a different realization of the stochastic noise. Therefore, it may be easily verified that the guarantee of ASURA (Theorem 3) carries over here. Corollary 1. The ASURA algorithm for the distributional SSAR can be implemented efficiently (polynomial in d, the support size of X) for the distributional SSAR problem, requiring to compute SVDs over a single copy of the matrix X. Furthermore, the guarantee of ASURA still carries over: the algorithm queries the labels of at most O(RX/ϵ) points, and guarantees to achieve a (1 + ϵ) approximation to the best achievable generalization loss (eq. (22)). It turns out that in the distributional setting, one may also come up with an instance-dependent sample complexity lower bound, where the input design X is fixed, but the adversary has the power to control the ground truth labels f( ) as well as the noise variances σ2 x in designing the instance. Theorem 5. For any 0 < ϵ < 1, d and any λ > 0. Suppose X satisfies the condition, σmin X(XT X) 1XT lab ϵ 1 ϵ. In the inductive setting, for each X and learner there ex- ists an instance of SSAR where if bβ returned by the learner satisfies E h X bβ Y 2 2 i (1 + ϵ) minβ Rd E Xβ Y 2 2 must query the labels of at least Ω( RX ϵ ) points. The proof of this lower bound is quite complicated because of both the adaptive nature of the problem, as well the requirement for an instance-dependent bound. At a high level, we use an hypothesis testing based construction (Assouad s lemma) with a careful choice of hypothesis space. The main challenge is choosing the underlying instance in a way to recover the reduced rank as the measure of complexity, which has a very delicate dependence on (the spectrum of) Xlab and Xun. In the interest of space, we defer the details of th proof to Appendix A.5. This result also shows that there is no need to have any sort of commutation structure between the labelled and unlabelled data matrices Xlab and Xun for the reduced rank to remain as the right instance dependent lower bound. 6 Conclusion We introduce the semi-supervised active learning problem and prove a query complexity upper bound of RX/ϵ for linear regression. In the special case of active ridge regression, this implies a sample complexity upper bound of O(sdλ/ϵ) labels which we prove is optimal. The problem also generalizes active kernel ridge regression, where we show a query complexity upper bound of O(dλ/ϵ) improving the results of [1]. It is an open question to extend these results to other loss functions and to design near-linear time algorithms for the problem. 7 Acknolwedgements NR was partially supported by NSF Grants IIS-1901252, and CCF-1909499. 2one may instead assume that the noise is subgaussian with variance σ2 x. For ease of exposition we stick to Gaussian noise. [1] Ahmed El Alaoui and Michael W. Mahoney. Fast randomized kernel methods with statistical guarantees, 2015. [2] Alexandr Andoni, Collin Burns, Yi Li, Sepideh Mahabadi, and David P. Woodruff. Streaming complexity of svms, 2020. [3] Haim Avron, Kenneth L. Clarkson, and David P. Woodruff. Sharper bounds for regularized data fitting, 2017. [4] Joshua Batson, Daniel A. Spielman, and Nikhil Srivastava. Twice-ramanujan sparsifiers, 2009. [5] Christos Boutsidis, Petros Drineas, and Malik Magdon-Ismail. Near-optimal coresets for least-squares regression. IEEE Transactions on Information Theory, 59(10):6880 6892, Oct 2013. ISSN 1557-9654. doi: 10.1109/tit.2013.2272457. URL http://dx.doi.org/10.1109/TIT.2013.2272457. [6] Xue Chen and Eric Price. Active regression via linear-sample sparsification, 2019. [7] Michał Derezi nski and Manfred K. Warmuth. Unbiased estimates for linear regression via volume sampling, 2018. [8] Michał Derezi nski, Manfred K. Warmuth, and Daniel Hsu. Leveraged volume sampling for linear regression, 2018. [9] Petros Drineas, Michael W. Mahoney, and S. Muthukrishnan. Relative-error cur matrix decompositions, 2007. [10] Thomas Drugman, Janne Pylkkonen, and Reinhard Kneser. Active and semi-supervised learning in asr: Benefits on the acoustic and language models, 2019. [11] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning, volume 1 of Springer series in statistics New York. Springer, 2001. URL https://web.stanford.edu/~hastie/ Elem Stat Learn/. [12] Mingfei Gao, Zizhao Zhang, Guo Yu, Sercan O. Arik, Larry S. Davis, and Tomas Pfister. Consistency-based semi-supervised active learning: Towards minimizing labeling cost, 2020. [13] Praneeth Kacham and David Woodruff. Optimal deterministic coresets for ridge regression. In Silvia Chiappa and Roberto Calandra, editors, Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pages 4141 4150, Online, 26 28 Aug 2020. PMLR. [14] George Kimeldorf and Grace Wahba. Some results on tchebycheffian spline functions. Journal of mathematical analysis and applications, 33(1):82 95, 1971. [15] Ksenia Konyushkova, Raphael Sznitman, and Pascal Fua. Introducing geometry in active learning for image segmentation, 2015. [16] Yin Tat Lee and He Sun. Constructing linear-sized spectral sparsification in almost-linear time, 2015. [17] Tong Luo, K. Kramer, S. Samson, A. Remsen, D.B. Goldgof, L.O. Hall, and T. Hopkins. Active learning to recognize multiple types of plankton. In Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004., volume 3, pages 478 481 Vol.3, 2004. doi: 10.1109/ICPR.2004.1334570. [18] Alexander Munteanu, Chris Schwiegelshohn, Christian Sohler, and David P. Woodruff. On coresets for logistic regression, 2018. [19] Cameron Musco and Christopher Musco. Recursive sampling for the nyström method, 2017. [20] Arti Patle and Deepak Singh Chouhan. Svm kernel functions for classification. In 2013 International Conference on Advances in Technology and Engineering (ICATE), pages 1 9, 2013. doi: 10.1109/ICAd TE. 2013.6524743. [21] Vern I. Paulsen and Mrinal Raghupathi. An Introduction to the Theory of Reproducing Kernel Hilbert Spaces. Cambridge Studies in Advanced Mathematics. Cambridge University Press, 2016. doi: 10.1017/ CBO9781316219232. [22] Phill Kyu Rhee, Enkhbayar Erdenee, Shin Dong Kyun, Minhaz Uddin Ahmed, and Songguo Jin. Active and semi-supervised learning for object detection with imperfect data. Cognitive Systems Research, 45:109 123, 2017. ISSN 1389-0417. doi: https://doi.org/10.1016/j.cogsys.2017.05.006. URL https: //www.sciencedirect.com/science/article/pii/S1389041716301127. [23] Ozan Sener and Silvio Savarese. Active learning for convolutional neural networks: A core-set approach, 2018. [24] Burr Settles. Active learning literature survey. 2009. [25] Hiroyuki Takeda, Sina Farsiu, and Peyman Milanfar. Kernel regression for image processing and reconstruction. IEEE Transactions on Image Processing, 16(2):349 366, 2007. doi: 10.1109/TIP.2006.888330. [26] Simon Tong and Daphne Koller. Support vector machine active learning with applications to text classification. J. Mach. Learn. Res., 2:45 66, March 2002. ISSN 1532-4435. doi: 10.1162/153244302760185243. URL https://doi.org/10.1162/153244302760185243. [27] David P. Woodruff. Computational advertising: Techniques for targeting relevant ads. Foundations and Trends in Theoretical Computer Science, 10(1-2):1 157, 2014. ISSN 1551-3068. doi: 10.1561/ 0400000060. URL http://dx.doi.org/10.1561/0400000060. [28] Taisuke Yasuda, David Woodruff, and Manuel Fernandez. Tight kernel query complexity of kernel ridge regression and kernel k-means clustering. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 7055 7063. PMLR, 09 15 Jun 2019. URL http://proceedings. mlr.press/v97/yasuda19a.html. [29] Xiaojin Zhu, John Lafferty, and Zoubin Ghahramani. Combining active learning and semi-supervised learning using gaussian fields and harmonic functions. In ICML 2003 workshop on The Continuum from Labeled to Unlabeled Data in Machine Learning and Data Mining, pages 58 65, 2003. [30] Xiaojin Jerry Zhu. Semi-supervised learning literature survey. 2005. 1. For all authors... (a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] We describe some future directions in Section 6 which we do not address (c) Did you discuss any potential negative societal impacts of your work? [No] (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results... (a) Did you state the full set of assumptions of all theoretical results? [Yes] (b) Did you include complete proofs of all theoretical results? [Yes] All proofs are deferred to the supplementary material 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [N/A] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [N/A] (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [N/A] (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [N/A] 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (a) If your work uses existing assets, did you cite the creators? [N/A] (b) Did you mention the license of the assets? [N/A] (c) Did you include any new assets either in the supplemental material or as a URL? [N/A] (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]