# active_heteroscedastic_regression__6535fd6d.pdf

Active Heteroscedastic Regression

Kamalika Chaudhuri 1 Prateek Jain 2 Nagarajan Natarajan 2

An active learner is given a model class Θ, a large sample of unlabeled data drawn from an underlying distribution and access to a labeling oracle that can provide a label for any of the unlabeled instances. The goal of the learner is to ﬁnd a model θ Θ that ﬁts the data to a given accuracy while making as few label queries to the oracle as possible. In this work, we consider a theoretical analysis of the label requirement of active learning for regression under a heteroscedastic noise model, where the noise depends on the instance. We provide bounds on the convergence rates of active and passive learning for heteroscedastic regression. Our results illustrate that just like in binary classiﬁcation, some partial knowledge of the nature of the noise can lead to signiﬁcant gains in the label requirement of active learning.

1. Introduction

An active learner is given a model class Θ, a large sample of unlabeled data drawn from an underlying distribution Px and access to a labeling oracle O which can provide a label for any of the unlabeled instances. The goal of the learner is to ﬁnd a model θ Θ that ﬁts the data to a given accuracy while making as few label queries to the oracle as possible.

There has been a lot of theoretical literature on active learning, most of which has been in the context of binary classiﬁcation in the PAC model (Balcan et al., 2009; Hanneke, 2007; Dasgupta et al., 2007; Beygelzimer et al., 2009; Awasthi et al., 2014; Zhang and Chaudhuri, 2014). For classiﬁcation, the problem is known to be particularly difﬁcult when there is no perfect classiﬁer in the class that best ﬁts the labeled data induced by the oracle s responses. Prior work in the PAC model has shown that the difﬁculty of the problem is alleviated when the noise is more benign for

Authors listed in the alphabetical order 1University of California, San Diego 2Microsoft Research, India. Correspondence to: Nagarajan Natarajan <t-nanata@microsoft.com>.

Proceedings of the 34 th International Conference on Machine Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 by the author(s).

example, when there is a ground truth classiﬁer that induces a labeling and the oracle s responses are perturbed versions of these labels (Hanneke, 2007; Awasthi et al., 2014; Zhang and Chaudhuri, 2014; Awasthi et al., 2016) corrupted by certain kinds of noise. In particular, signiﬁcant improvements in label complexity have been obtained under what is known as the Tsybakov noise conditions, which model the realistic case of noise that decreases as we move further from the decision boundary.

The case of active learning under regression however is signiﬁcantly less well-understood. In particular, we only have a theoretical understanding of the two extreme cases no noise (as in, no model mismatch) and arbitrary model mismatch. Chaudhuri et al. (2015) show that allowing the learner to actively select instances for labeling under regression with no model mismatch can only improve the convergence rates by a constant factor; moreover, in many natural cases, such as when the unlabeled data is drawn from a uniform Gaussian, there is no improvement. Sabato and Munos (2014) look at the other extreme case when arbitrary model mismatch is allowed and provide an algorithm that attempts to learn the locations of the mismatch through increasingly reﬁned partitions of the space, and then learn a model accordingly. However if the model mismatch is allowed to be arbitrary, then this algorithm either requires an extremely reﬁned partition leading to a very high running time, or a large number of labels. More recently, Anava and Mannor (2016) study an online learning approach for estimating heteroscedastic variances and provide general task-dependent regret bounds, but not exact parameter recovery gaurantees.

In this paper we take a step towards closing this gap in understanding by considering active regression under a realistic yet more benign noise model when the variance of the label noise depends linearly on the example x. Specifically, the oracle s response on an unlabeled example x is distributed as N( x, β , σ2

x) with σx = f , x ; here β is the unknown vector of regression coefﬁcients and f is an unknown parameter vector. In classical statistics, this framework is called heteroscedastic regression, and is known to arise in econometric and medical applications.

While the usual least squares estimator for heteroscedastic regression is still statistically consistent, we ﬁnd that

Active Heteroscedastic Regression

even in the passive learning case, optimal convergence rates for heteroscedastic regression are not known. We thus begin with a convergence analysis of heteroscedastic regression for passive learning when the distribution Px over the unlabeled examples is a spherical Gaussian (in d dimensions). We show that even in this very simple case, the usual least squares estimator is sub-optimal, even when the noise model f is known to the learner. Instead, we propose a weighted least squares estimator, and show that its rate of convergence is O

f 2(1/n + d2/n2)

when the noise model is known, and O

when it needs to be estimated from the data; here, n denotes the number of labeled examples used to obtain the estimator. The latter matches the convergence rates of the least squares estimator for the usual homoskedastic linear regression, where f 2 plays the role of the variance σ2.

We next turn to active heteroscedastic regression and propose a two-stage active estimator. We show that when the noise model is known, the convergence rate of active heteroscedastic regression is O

f 2(1/n+d2/n4)

, a small improvement over passive. However, in the more realistic case where the noise model is unknown, the rates become O

f 2(1/n+d2/n2)

, which improves over the passive estimator by a factor of d. Our results extend to the case when the distribution Px over unlabeled examples is an arbitrary Gaussian with covariance matrix Σ and the norm used is the Σ norm. Our work illustrates that just like binary classiﬁcation, even a partial knowledge of the nature of the model mismatch signiﬁcantly helps the label complexity of active learning.

Our work is just a ﬁrst step towards a study of active maximum likelihood estimation under controlled yet realistic forms of noise. There are several avenues for future work. For simplicity, the convergence bounds we present relate to the case when the distribution Px is a Gaussian. An open problem is to combine our techniques with the techniques of (Chaudhuri et al., 2015) and establish convergence rates for general unlabeled distributions. Another interesting line of future work is to come up with other, realistic noise models that apply to maximum likelihood estimation problems such as regression and logistic regression, and determine when active learning can help under these noise models. Summary of our main results in this work is given in Table 1. We conclude the paper presenting simulations supporting our theoretical bounds as well as experiments on real-world data.

2. Problem Setup and Preliminaries

Let x denote instances in Rd. Let Px denote a ﬁxed unknown distribution over instances x. The response y is gen-

NOISE MODEL KNOWN

NOISE MODEL ESTIMATED PASSIVE O

Table 1. Summary of our results: Rates for convergence of estimators, i.e. !β β 2

2, under the heteroscedastic noise model (2). Here, d is the data dimensionality and n is the number of labeled examples used for estimation.

erated according to the model: y = β , x + ηx, where ηx denotes instance-dependent corruption, and β is the ground-truth parameter. In this work, we consider the following heteroscedastic model:

ηx N(0, σ2(x)) , (1)

with a standard parametric model for heteroscedastic noise given by a linear model:

ηx N(0, f , x 2) , (2)

for some unknown f = β . Each response is independently corrupted via (2). The goal is to recover β using instances drawn from Px and their responses sampled from N( β , x , f , x 2). Remark 1. The noise ηx can be sampled from any sub Gaussian distribution with E[ηx] = 0 and bounded second moment E[η2

x] σ2 (for some constant σ). For simplicity, we will consider the Gaussian model (1).

Our approach is based on maximum likelihood estimator (MLE) for regression. In the homoscedastic setting (i.e. σ(x)2 = σ for all x in (1)), MLE is known to give minimax optimal rates1. The standard least squares estimator computed on n iid training instances (xi, yi), i = 1, 2, . . . , n is given by:

and is the solution to the minimization problem:

#βLS = arg min

( β, xi yi)2.

In the heteroscedastic setting, it is easy to show that the standard least squares estimator is consistent. Remark 2. Standard least squares estimator is consistent for the heteroscedastic noise model (2): Assuming xi Rd, i = 1, 2, . . . , n are drawn iid from the standard multivariate Gaussian, we have the rate:

1A notable exception is the Stein s estimator that may do better for high dimensional spaces (Stein et al., 1956)

Active Heteroscedastic Regression

While the estimator (3) is consistent, it does not exploit the knowledge of the noise model, and does not give better rates even when the noise model f is known exactly. We look at the maximum likelihood estimator for the heteroscedastic noise (1); which is given by the weighted least squares estimator (or sometimes referred to as generalized least squares estimator):

wixiyi, (4)

where wi = 1 σ(xi)2 . When the weights are known, it has been shown that the weighted estimator is the correct estimator to study; in particular, it is the minimum variance unbiased linear estimator (Theorem 10.7, Greene (2002)). However, we do not know of strong learning rate guarantees for the weighted least squares model in general, or in particular for the model (2), compared to the ordinary least squares estimator. This raises two important questions for which we provide answers in the subsequent sections.

1. What are the rates of convergence of the maximum

likelihood estimator for the heteroscedastic model when the noise model, aka, f is unknown?

2. Can we achieve a better label requirement via active

The problem is formally stated as follows. Given a set of m instances U = {x1, x2, . . . , xm} sampled i.i.d. from the underlying Px, a label budget n m, and access to label oracle O that generates responses yi according to the heteroscedastic noise model (2), we want an estimator #β of the regression model parameter β such that the estimation error is small, i.e. #β β 2 O(ϵ). Remark 3. Existing active regression methods (Sabato and Munos, 2014; Chaudhuri et al., 2015) do not consider the heteroscedastic noise model. Note that when f

is known exactly, one can reduce heteroscedastic model to a homoscedastic model, by scaling instances x and their responses by 1/ f , x . However, we still may not be able to apply the existing active learning results to the transformed problem, as the modiﬁed data distribution may no longer satisfy required nice properties. The resulting estimators do not yield advantages over passive learning, even in simple cases such as when Px is spherical Gaussian.

Notation. Id denotes the identity matrix of size d. We use bold letters to denote vectors and capital letters to denote matrices.

3. Basic Idea: Noise Model is Known Exactly

To motivate our approach, we begin with the basic heteroscedastic setting: when f is known exactly in (2). Even

in this arguably simple setting, the rates for passive and active learning are a priori not clear, and the exercise turns out to be non-trivial. The results and the analysis here help gain insights into label complexities achievable via passive and active learning strategies.

In the standard (passive) learning setting, we sample n instances uniformly from the set U and compute the maximum likelihood estimator given in (4) with weights set to wi = 1/ f , xi 2. The procedure is given in Algorithm 1. The resulting estimator is unbiased, i.e. E[#βGLS|X] = β . Let W denote the diagonal weighting matrix with Wii = wi. The variance of the estimator is given by: Var(#βGLS|X) = (XT WX) 1. The question of interest is if and when the weighted estimator #βGLS is qualitatively better than the ordinary least squares estimator #βLS. The following theorem shows that the variance of the latter, and in turn the estimation error, can be potentially much larger; and in particular, the difference between their estimation errors is at least a factor of dimensionality d.

Theorem 1 (Passive Regression With Noise Oracle). Let

#βGLS denote the estimator in (4) (or the output of Algorithm 1) where xi N(0, Id) i.i.d., with label budget n 2d ln d and #βLS denote the ordinary least squares estimator (3). There exist positive constants C and c 1 such that, with probability at least 1 d nc , both the statements hold:

n + d2 ln n

Remark 4. We present the results for instances sampled from N(0, Id) for clarity. The estimation error bounds can be naturally extended to the case of Gaussian distribution with arbitrary covariance matrix Σ. In this case, the bounds (in Theorem 1, for example) continue to hold for the estimation error measured w.r.t. Σ, i.e. (#β β )T Σ(#β β ). Furthermore, with some calculations, we can obtain analogous bounds for sub-Gaussian distributions, with distribution-speciﬁc constants featuring in the resulting bounds.

Remark 5. In Theorem 1, when n > d, d2/n2 term is the lower-order term, and thus, up to constants, the error of the weighted least squares estimator is at most f 2(1/n) while that of the ordinary least squares estimator is at least f 2(d/n). Thus, if the noise model is known, the weighted least squares estimator can give a factor of d improvement in convergence rate.

Remark 6 (Technical challenges). The proofs of key results in this paper involve controlling quantities such as sum of ratios of Gaussian random variables, ratios of chi-squared

Active Heteroscedastic Regression

random variables, etc. which do not even have expectation, let alone higher moments; so, standard concentration arguments cannot be made. However, in many of these cases, we can show that our error bounds hold with sufﬁciently high probability.

The following lemma is key to showing Theorem 1; the proof sketch illustrates some of the aforementioned technical challenges. Unlike typical results in this domain, which bound tr(A 1) by providing concentration bounds for A, we bound tr(A 1) by providing lower bound on each eigenvalue of A.

Lemma 1. Let X Rn d where the rows xi are sampled i.i.d. from N(0, Id). Assume n 2d ln d. Let W denote a diagonal matrix, with Wii = 1/ xi, f 2, for ﬁxed f Rd. Let σ1 σ2 σd denote the eigenvalues of XT WX. Then, with probability at least (1 d nc ):

1. σd(XT WX) n f 2 and

2. σi(XT WX) C n2 d f 2 ln n for i = 1, 2, . . . , d 1,

where c 1 and C > 0 are constants.

Proof. We give a sketch of the proof here (See Appendix B.2 for details). To show a lower bound on the smallest eigenvalue, we ﬁrst show that the smallest eigenvector is very close to f, with sufﬁciently large probability. To do so, we exploit the fact that the smallest eigenvalue is at most n/ f 2 which can be readily seen. For the second part, we consider the variational characterization of d 1st singular value given by:

σd 1(XT WX) = max U:dim(U)=d 1 min v U, v =1 v T XT WXv.

We look at the particular subspace that is orthogonal to f

to get the desired upper bound. One key challenge here is controlling quantity of the form 'n

i where gi and hi are iid Gaussian variables. We use a blocking technique based on the magnitude of f, xi , and lower bound the quantity with just the ﬁrst block (as all the quantities involved are positive). This requires proving a bound on the order statistics of iid Gaussian random variables (Lemma 7 in Appendix A).

Theorem 1 shows that weighting clean instances (i.e. f , x 0) much more than noisy instances yields a highly accurate estimator of β . But can we instead prefer querying labels on instances where we know a priori the response will be relatively noise-free? This motivates a simple active learning solution in principle, if we actually know f , we could query the labels of instances with low noise, and hope to get an accurate estimator. The active

learning procedure is given in Algorithm 2. Besides label budget n, it takes another parameter τ as input, which is a threshold on the noise level.

We state the convergence for Algorithm 2 below:

Theorem 2 (Active Regression with Noise Oracle). Consider the output estimator #β of Algorithm 2, with input label budget n 2d ln d, unlabeled set U with |U| = m and xi N(0, Id) i.i.d., and τ = 2n/m. Then, with probability at least 1 1/nc:

n + d2 ln n

for some constants c > 1 and C > 0.

Remark 7. We observe that the estimation error via active learning (Theorem 2) goes to 1/n as the size of the unlabeled examples m becomes larger. Note that O(1/n) is the error for 1-dimensional problem and is much better than d2/n2 we get from uniform sampling.

Remark 8. If we have m n2 unlabeled samples, then we observe that active learning (Theorem 2) achieves a better convergence rate compared to that of passive learning (Theorem 1) the lower order term in case of active learning is O( d2

n4 ) versus passive learning which is O( d2

n2 ). The convergence is superior especially when n < d2 (as we also observe in simulations).

The proof of Theorem 2 relies on two lemmas stated below.

Lemma 2. Let X Rn d denote the design matrix whose rows xi are sampled from U such that they satisfy | xi, f | f τ for ﬁxed f Rd. Assume n 2d ln d. Let σ1 σ2 . . . σd denote the eigenvalues of XT X. Then, with probability at least (1 1 nc ):

1. σd(XT X) nτ 2,

2. σi(XT X) 1

2n, for i = 1, 2, . . . , d 1,

for some constants C > 0 and c 1.

Lemma 3. For each xi U, where |U| = m, deﬁne gi = xi, z , for any ﬁxed z Rd. Then, with probability at least exp( mτ 3

i : |gi| z τ

4. Estimating Noise: Algorithms and

In this section, we will ﬁrst show that we can obtain a consistent estimator of f , as long as we have a reasonably good estimate of β . Let β0 denote the ordinary least

Active Heteroscedastic Regression

Algorithm 1 Passive Regression With Noise Oracle

Input: Labeling oracle O, instances U = {xi, i [m]}, label budget n, noise model f

1. Choose a subset L of size n from U uniformly at random from U. Query their labels using O. 2. Estimate #β using (4) on L, with wi = 1/ f , xi 2. Output: #β.

Algorithm 2 Active Regression With Noise Oracle

Input: Labeling oracle O, noise model f , instances U = {xi, i [m]}, label budget n, noise tolerance τ. 1. Choose a subset L of size at most n from U with expected noise up to the given tolerance τ, i.e. for all xi L, | xi, f | τ. Query their labels using O. 2. Estimate #β as #β = (XT WX) 1XT Wy where X Rn d and y Rn, and W is a diagonal matrix with Wii =

1 f ,xi 2 .

Output: #β.

squares estimator of β , obtained by using (3), on m1 labeled instances, chosen i.i.d. from N(0, Id). The largest eigenvector of the residual-weighted empirical covariance matrix given by:

(yi xi, #β0 )2xix T

gives a sufﬁciently good estimate of f . This is established formally in the following lemma. Lemma 4. Let m1 = Ω(d log3 d). Then, with probability at least 1 1 m5

1 , the largest eigenvector ˆf of #S in (5) converges to f :

2 C1E[ β0 β 2

for some positive constant C1, and expectation is wrt the randomness in the estimator β0.

We ﬁrst discuss the implications of using the estimated ˆf in order to obtain the generalized least square estimator given in (4) and then present the active learning solution.

4.1. Weighted Least Squares

We now consider a simple (passive) learning algorithm for the setting where the noise model is estimated, based on the weighted least squares solution discussed in Section 3. We ﬁrst get a good estimator of f (as in Lemma 4), and then obtain the weighted least squares estimator:

#β = (XT + WX) 1XT + Wy, where + W is the diagonal matrix of inverse noise variances obtained using the estimate

ˆf with a small additive offset γ. The procedure is presented in Algorithm 3. Remark 9. Algorithm 3 can be thought of as a special case of the well-known iterative weighted least squares (i.e. with just one iteration), that has been studied in the past (Carroll et al., 1988).

It is well-known heuristic to ﬁrst estimate the weights and then obtain the weighted estimator in practice; the approach has been widely in use for decades now in multiple communities including Econometrics and Bioinformatics (Harvey, 1976; Greene, 2002). However, we do not know of strong convergence rates for the solution. To our knowledge, the most comprehensive analysis was done by (Carroll et al., 1988). Their analysis is not directly applicable to us for reasons two-fold: (i) they focus on using a maximum likelihood estimate of the parameters in the heteroscedastic noise model, and does not apply to our noise model (2), and (ii) their analysis relies the noise being smooth (for obtaining tighter Taylor series approximation). More importantly, their analysis conceals a lot of signiﬁcant factors in both d and n, and the resulting statements about convergence rates are not useful (See Appendix C).

Theorem 3. Consider the output estimator #β of Algorithm 3, with label budget n 2d ln d and offset γ2 = ,

d n. Then, with probability at least 1 1/nc:

2 C f 2 d ln4 n

for some constants C > 0 and c 1.

Remark 10. Note that the above result holds for the speciﬁc choice of γ. When γ = 0, we get the weighted least squares estimator analogous to the one used in Algorithm 1. However, when estimating weights with γ = 0, the resulting estimator #β has poor convergence rate. In particular, we observe empirically that the error #β β 2

scales as O( d3

4.2. Active Regression

We now show that active learning can help overcome the inadequacy of the passive weighted least squares solution. The proposed active regression algorithm, presented in Al-

Active Heteroscedastic Regression

gorithm 4, consists of two stages. In the ﬁrst stage, we obtain an estimate #f similar to that in Algorithm 3. Note that if #f is indeed very close to f , #f serves as a good proxy for selecting instances whose labels are nearly noise-free. To this end, we sample instances that have sufﬁciently small noise: | #f, x | τ, where τ is an input parameter to the algorithm. If #f is exact, then the algorithm reduces to the strategy outlined in Algorithm 2. Our algorithm follows the strategy of using a single-round of interaction (in light of the analysis presented in the passive learning setting) to achieve a good estimate of the underlying β akin to the active MLE estimation algorithm studied by Chaudhuri et al. (2015). Lemma 5. Let #f denote an estimator of f satisfying #f f 2 . For a given τ, let L denote a set of |L| 2d log d instances sampled from m unlabeled instances U, such that | #f, xi | τ, for all xi L, and let yi denote their corresponding labels. Consider the ordinary least squares estimator obtained using L, i.e.:

Then, with probability at least 1 1 nc :

2 C f 2(τ 2 + 2)

$ 1 mτ 3 + d 1

for some constants C > 0 and c 1. Remark 11. The bound in the above theorem recovers the known variance case discussed in Theorem 2, where the estimation error 2 = 0 and the choice of τ = 2n

Compared to the passive learning error bound in Theorem 3, we hope to get leverage as long we can choose τ sufﬁciently small, and yet guarantee that the number of samples m2 in Step 4 of Algorithm 4 is sufﬁciently large. The following theorem shows that this is indeed the case, and that the proposed active learning solution achieves optimal learning rate. Theorem 4 (Active Regression with Noise Estimation). Consider the output estimator #β of Algorithm 4, with input label budget n, unlabeled examples m n2, m1 = n

2 and τ = 2

d n. Then, we have, with probability at least 1 1/nc:

for some constants C > 0 and c > 1. Remark 12. We observe that active learning (Theorem 4) has the same convergence rate for sufﬁciently large n, as that of the case when f is known exactly (Theorem 2). Note that d2/n2 and d2/m2 are lower-order terms in the compared bounds.

Remark 13. Unlike in the case when noise model was known (Theorem 2), here we can not do better even with inﬁnite unlabeled examples. The source of trouble is the estimation error in #f, so beyond a point even active learning does not provide improvement. Note that we do not compute weighted least squares estimator in the ﬁnal step of Algorithm 4 unlike in Algorithm 2, for the same reason.

5. Simulations

We now present simulations that support the convergence bounds developed in this work. The setup is as follows. We sample unlabeled instances x1, x2, . . . , xm from N(0, Id). Labels are generated according to the heteroscedastic model: yi = β , xi + gi f , xi , where gi are iid standard Gaussian random variables. We ﬁx f 2 = 1 and d = 10. We look at how the model estimation error (in case of Algorithms 1 and 2) #β β decays as a function of the label budget n (m = 2n2 for all the simulations). We also check the estimation error of the noise model in case of Algorithms 3 and 4.

The results for convergence of model estimation when the noise model is known are presented in Figure 1 (a)-(d). In passive learning, the bounds in Theorem 1 suggest that when n d2, β #β = O( d

n); but once n > d2, we get a convergence of O(1/ n). We observe that the result in Figure 1 (a) closely matches the given bounds2. In case of active learning, the bounds in Theorem 2, for the case when m n2, suggest that we get an error rate of β #β = O( d

n2 ). We observe a similar phenomenon in the Figure 1 (b). Turning to the noise estimation setting for passive learning, we see in Figure 1 (c) that the estimation error of β as well as f decay as

d/n (as suggested by Theorem 3); for active learning, we see in Figure 1 (d) that the estimation error of β is noticeably better, in particular, better than that of f , and approaches 1/ n as n becomes larger than d2.

We also study the performance of the algorithms on two real-world datasets from UCI: (1) WINE QUALITY with m = 6500 and d = 11, and (2) MSD (a subset of the million song dataset) with m = 515345 and d = 90. For each dataset, we create a 70-30 train-test split, and learn the best linear regressor using ordinary least squares, which forms our β . We then sample labels using β and a simulated heteroscedastic noise f . We compare active and passive learning algorithms on the root mean square error (RMSE) obtained on the test set. In Figure 1 (e), we see that active learning with noise estimation gives a signiﬁcant reduction in RMSE early on for WINE QUALITY. We also see that weighted least squares gives slight beneﬁt over or-

2For better resolution, we plot β !β rather than β !β 2

given in the theorem statements

Active Heteroscedastic Regression

dinary least squares. On MSD dataset 3, again we observe that our active learning algorithm consistently achieves a marginal reduction in RMSE as the number of labeled examples increases.

6. Conclusions and Future Work

In conclusion, we consider active regression under a heteroscedastic noise model. Previous work has looked at active regression either with no model mismatch (Chaudhuri et al., 2015) or arbitrary model mismatch (Sabato and Munos, 2014). In the ﬁrst case, active learning provided no improvement even in the simple case where the unlabeled examples were drawn from Gaussians. In the second case, under arbitrary model mismatch, the algorithm either required a very high running time or a large number of labels. We provide bounds on the convergence rates of active and passive learning for heteroscedastic regression. Our results illustrate that just like in binary classiﬁcation, some partial knowledge of the nature of the noise has the potential to lead to signiﬁcant gains in the label requirement of active learning.

There are several avenues for future work. For simplicity, the convergence bounds we present relate to the case when the distribution Px over unlabeled examples is a Gaussian. An open problem is to combine our techniques with the techniques of (Chaudhuri et al., 2015) and establish convergence rates for general unlabeled distributions. Another interesting line of future work is to come up with other, realistic noise models that apply to maximum likelihood estimation problems such as regression and logistic regression, and determine when active learning can help under these noise models.

3here, the response variable is the year of the song; we make the response mean zero in our experiments

Active Heteroscedastic Regression

Algorithm 3 Least Squares with Estimated Weights

Input: Labeling oracle O, unlabeled samples U = {xi, i [m]}, label budget n, parameter m1, offset γ. 1. Draw m1 examples uniformly at random from U and query their labels y using O. 2. Estimate #β0 by solving y X #β0 where X Rm1 d has xi as rows and y Rm1 is the vector of labels. 3. Draw a subset L of n examples uniformly at random from U. Form X Rn d and y Rn. 4. Compute #f as the largest eigenvector of the residual-weighted empirical covariance given in (5). 5. Set #wi = 1 xi, ! f 2+γ2 ., for xi L.

6. Estimate #β by solving: #β = (XT + WX) 1XT + Wy, where + W is diagonal matrix with + Wii = ˆwi. Output: #β.

Algorithm 4 Active Regression

Input: Labeling oracle O, unlabeled samples U = {xi, i [m]}, label budget n, parameters m1, τ . 1. Draw m1 examples uniformly at random from U and query their labels y using O. 2. Estimate #β0 by solving y X #β0 where X Rm1 d and y Rm1. 3. Compute #f as the largest eigenvector of #S given in (5). 4. Choose a subset L of m2 = n m1 instances from U with estimated noise variance up to tolerance τ 2, i.e. for all xi L, | xi, #f |2 τ 2. Query their labels using O. 5. Estimate #β as #β = (XT X) 1XT y where X Rm2 d and y Rm2. Output: #β.

n 0 100 200 300 400

d/n 1/ n Algorithm 1

(a) Algorithm 1 (Passive)

n 0 200 400 600 800 1000

d/n n 1/ n Algorithm 2

(b) Algorithm 2 (Active)

n 0 100 200 300 400

d/n "β β "f f

(c) Algorithm 3 (Passive)

n 0 100 200 300 400

d/ n 1/ n d/n !β β !f f

(d) Algorithm 4 (Active)

n 0 1000 2000 3000 4000

Ordinary Least Squares Algorithm 3 (Passive Learning) Algorithm 4 (Active Learning)

(e) WINE QUALITY

n 0 2000 4000 6000 8000 10000

Ordinary Least Squares Algorithm 3 (Passive Learning) Algorithm 4 (Active Learning)

Figure 1. Plots (a)-(b): convergence of model (β ) estimation error, when the noise model is known. Plots (c)-(d): convergence of model (β ) estimation error as well as noise parameter (f ) estimation error, when the noise model is estimated. Plots (e)-(f): RMSE on test data for two real-world datasets.

Active Heteroscedastic Regression

Oren Anava and Shie Mannor. Heteroscedastic sequences:

Beyond gaussianity. In Proceedings of The 33rd International Conference on Machine Learning, pages 755 763, 2016.

Pranjal Awasthi, Maria-Florina Balcan, and Philip M.

Long. The power of localization for efﬁciently learning linear separators with noise. In Symposium on Theory of Computing, STOC 2014, New York, NY, USA, May 31 - June 03, 2014, pages 449 458, 2014.

Pranjal Awasthi, Maria-Florina Balcan, Nika Haghtalab,

and Hongyang Zhang. Learning and 1-bit compressed sensing under asymmetric noise. In Proceedings of the 29th Conference on Learning Theory, COLT 2016, New York, USA, June 23-26, 2016, pages 152 192, 2016.

M.-F. Balcan, A. Beygelzimer, and J. Langford. Agnostic

active learning. J. Comput. Syst. Sci., 75(1):78 89, 2009.

A. Beygelzimer, S. Dasgupta, and J. Langford. Importance

weighted active learning. In ICML, 2009.

Raymond J Carroll, CF Jeff Wu, and David Ruppert. The

effect of estimating weights in weighted least squares. Journal of the American Statistical Association, 83 (404):1045 1054, 1988.

Kamalika Chaudhuri, Sham M Kakade, Praneeth Netra-

palli, and Sujay Sanghavi. Convergence rates of active learning for maximum likelihood estimation. In Advances in Neural Information Processing Systems, pages 1090 1098, 2015.

S. Dasgupta, D. Hsu, and C. Monteleoni. A general agnos-

tic active learning algorithm. In NIPS, 2007.

Yehoram Gordon, Alexander Litvak, Carsten Sch utt, and

Elisabeth Werner. On the minimum of several random variables. Proceedings of the American Mathematical Society, 134(12):3665 3675, 2006.

William H Greene. Econometric analysis. Prentice Hall,

S. Hanneke. A bound on the label complexity of agnostic

active learning. In ICML, 2007.

Andrew C Harvey. Estimating regression models with mul-

tiplicative heteroscedasticity. Econometrica: Journal of the Econometric Society, pages 461 465, 1976.

Prateek Jain and Ambuj Tewari. Alternating minimization for regression problems with vector-valued outputs. In Advances in Neural Information Processing Systems, pages 1126 1134, 2015.

Sivan Sabato and Remi Munos. Active regression by strat-

iﬁcation. In Advances in Neural Information Processing Systems, pages 469 477, 2014.

Charles Stein et al. Inadmissibility of the usual estimator

for the mean of a multivariate normal distribution. In Proceedings of the Third Berkeley symposium on mathematical statistics and probability, volume 1, pages 197 206, 1956.

Chicheng Zhang and Kamalika Chaudhuri. Beyond disagreement-based agnostic active learning. In Neural Information Processing Systems (NIPS), 2014.