# hardmargin_active_linear_regression__feb943b7.pdf

Hard-Margin Active Linear Regression

Elad Hazan EHAZAN@IE.TECHNION.AC.IL Technion, Haifa, Israel

Zohar Karnin ZKARNIN@YAHOO.COM Yahoo Labs, Haifa, Israel

We consider the fundamental problem of linear regression in which the designer can actively choose observations. This model naturally captures various experiment design settings in medical experiments, ad placement problems and more. Whereas previous literature addresses the soft-margin or mean-square-error variants of the problem, we consider a natural machine learning hard-margin criterion. In this setting, we show that active learning admits signiﬁcantly better sample complexity bounds than the passive learning counterpart, and give efﬁcient algorithms that attain near-optimal bounds.

1. Introduction

In this paper we consider a problem of experiment design from an active learning viewpoint. The setting at hand is a linear regression problem where the error is not measured in the standard mean square loss function, but rather in the natural 0-1 loss setting. That is, a data point will induce a loss of cost one if the regressor produces an error larger than some threshold and zero otherwise.

The 0-1 loss scenario is motivated by numerous real-life scenarios. The ﬁrst example is medical experiments: consider a scientist attempting to predict a condition according to patient attributes. In this setting, different patients have different attributes, or features, and the experiment designer goal is to test a minimal number of patients before being able to successfully predict the condition of future patients based on their attributes. In a regression setting it is likely that predicting the outcome up to some small error will induce no loss, while erring by slightly more or much more will have the same affect of mistreating the patient.

Proceedings of the 31 st International Conference on Machine Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copyright 2014 by the author(s).

Another example is a recommendation or ad serving system where the objective is to serve relevant items to users. These items could be movies, articles, webpages, advertisements, etc.. A very common approach to the problem is to assume that the relevance of an item for a user is determined as a bi-linear function of the features of the user and item. Consider the case where the users are not given by their identity but only as a list of attributes such as gender, location, age, etc (a common scenario in web related applications). By considering categories of the items the problem is reduced to learning multiple regressors. The true objective of these regressors is to present relevant items to the users. The underlying objective function aligns perfectly with the 0-1 loss setting; an item is either relevant or not relevant, as indicated by the user when choosing whether to view / click / purchase the item.

Formally we can model this experiment design setting as follows: a data set (of patients/users/etc) is associated with a set of feature, or attribute, vectors K Rd. An experiment with patient x K corresponds to a noisy measurement, or query , x, w + εx, where w is the optimal underlying linear model and εx is a zero mean and bounded variance noise. The goal of the experiment designer is to adaptively choose and measure a sequence of experiments with the goal of attaining, with high probability, an ε-close estimate of entire data set 1

The difﬁculty presented in this problem is two-fold: ﬁrst there is the statistical-geometrical difﬁculty of exploration - which experiments should be conducted in order to accurately predict the values over all data points? Second - there is the optimization difﬁcult of efﬁcient computation of the optimal linear regressor in an adaptive setting. These two questions are obviously related - the ideal method should interpolate between measurements, optimization to identify the areas that require more exploration, more experiments

1By standard techniques this can be generalised to a distributional setting, in which patients x arrive from an unknown and i.i.d distribution x D and the requirement is that with probability at least 1 δ the regressor errs by less than ε.

Hard-Margin Active Linear Regression

Figure 1. Illustration of a discrete experiment design: the main problem is to pick the most informative experiments. In this case there is an optimal orthonormal basis that spans the space - the two highlighted points.

Clearly, if the data lives on the real line, Ω( 1

ε2 ) noisy measurements are required to obtain an ε-approximate regressor. This is generalized to Ω( d

ε2 ) experiments for data that lives in d-dimensional Euclidean space. The question becomes interesting once we have very large data sets. Suppose there are n d patients for which predictions are needed, can we regress on the entire data making less than Θ(n) experiments?

Clearly in a passive-learning setting the answer is no. The distribution over the data could be so skewed as to give no or very little information about an important part of space. However - an active learner can potentially exploit the geometric structure of the data and attain much better bounds.

Our main contribution is indeed an active/adaptive algorithm for experiment design in this framework that achieves the following guarantee: after querying O d

ε2 carefullychosen experiments, the algorithm returns a vector that is ε-accurate for all data points. Furthermore, our algorithm is computationally efﬁcient and can be implemented to run in near-linear running time given a low-variance exploration basis for the experiment space (that can be constructed in polynomial time).

1.1. Related work

The natural problem we address has been considered before in many variants in two main communities: experiment design and active learning.

Experiment design: In the statistical ﬁeld called optimal design of experiments, or just optimal design (Atkinson & Donev, 1992; Wu, 1978) , a statistician is faced with the task of choosing experiments to perform from a given pool, with the goal of producing the optimal result within the budget constraint.

Formally, consider a pool of possible experiments denoted

x1, ..., xn Rd. The goal of the designer is to choose a distribution over the pool of experiments, such that experiments chosen according to this distribution produce a hypothesis ˆw that is as close as possible to the true linear function behind the data. The distance between the hypothesis and true linear function can be measured in different ways, each corresponding to a different optimality criteria. The common property of the criteria is that they all minimize the variance of the hypothesis. Since the variance is not a scalar but a d d matrix, the different criteria differ by the fact that each one minimizes a different function Φ : Rd d R over the covariance matrix. Common criteria are the A-, D-, and E-optimality criteria. D-optimality, minimizes the determinant of the covariance matrix, and thus minimizes the volume of the conﬁdence region. In A-optimality the trace of the covariance matrix, i.e. the total variance of the parameter estimates, is minimized. Eoptimality minimizes the maximum eigenvalue of the covariance matrix, and thus minimizes the size of the major axis of the conﬁdence region.

The above criteria do not directly characterize the quality of predictions on test data. A common criterion that directly takes the test data into account is that of G-optimality. Here the goal is to minimize the maximum variance of the predicted values. In other words, by denoting Var S(xi) the variance of the prediction of xi after querying the points of S, the goal in G-optimality is to minimize maxi Var S(xi). G-optimality and D-optimality are closely related in the sense that an exact solution to one is the solution to the other (see e.g. (Spruill & Studden, 1979)). Note that this criterion is similar to our objective. The difference is twofold. First, we do not aim to minimize the variance but to obtain a high probability bound. Second, in optimal design the quality of the distribution is measured when the budget tends to inﬁnity. Speciﬁcally, notice that for a distribution over the possible experiments, rather than a deterministic subset of them, the corresponding covariance matrix is random. The discussed minimizations are done over the expected covariance matrix, where the expectation is taken over the subset of chosen experiments. When the budget tends to inﬁnity the actual covariance matrix is close w.h.p to its expected counterpart. We call this the inﬁnite budget setting. Ours is a ﬁnite budget setting where one does not aim to provide a distribution over the possible experiments but a deterministic subset of them of a ﬁxed size.

For the ﬁnite budget setting various relaxations have been considered in the statistical literature, usually without an approximation guarantee. Our method differs from previous works of this spirit by: First, we do not impose a hard-budget constraint of experiments, but rather bound the number of experiments as a function of the desired approximation guarantee. Second, we obtain a computationally efﬁcient algorithm with provable optimality results. Finally,

Hard-Margin Active Linear Regression

as an added bonus our solution has the property of choosing very few data points to explore, potentially much less than the budget. A motivating example for this property is the medical experiment design. Here a data point is a human subject and it is more realistic to have few volunteers being thoroughly tested on as opposed to performing few tests over many volunteers.

Our setting is arguably more natural for the medicalpatient-experiment motivating example; in general there are numerous examples where the budget of experiments is not ﬁxed but rather the tolerable error. Of equal importance is the fact that our setting allows to derive efﬁcient algorithms with rigorous theoretical guarantees.

A related and recently popular model is called random design (Hsu et al., 2012; Audibert & Catoni, 2010; Gy orﬁ, 2002; Audibert & Catoni, 2011). In this setting the designer is given a set of measurements {xi, yi|i [n]} for xi Rd drawn from an unknown distribution D. The goal is to predict as well as the best linear predictor measured according to the mean square error, i.e. minimize

E(x,y) D (x w y)2 (x w y)2

where w is the optimal linear regressor. Various other performance metrics have been considered in the referenced papers, i.e. measuring the norm of the regressor vs. the optimal regressor in a norm proportional to the covariance matrix. However, in this setting an expected error is the criterion vs. our criterion of worst-case, or a high conﬁdence bound on the error2, which is more suitable for some experiment design settings.

Active learning: The most well-studied setting in active learning is pool-based active learning (Mc Callum & Nigam, 1998), in which the learner has access to a pool of examples, and can iteratively query labels of particular examples of her choice. Compared to passive learning, in which labelled examples are drawn from a ﬁxed unknown distribution, known active learning algorithms can attain a certain generalization error guarantee albeit observing exponentially fewer labelled examples, e.g. (Cohn et al., 1994; Dasgupta et al., 2009; Hanneke, 2007; Balcan et al., 2009), under certain assumptions such as special hypothesis classes, realizability or large-margin. Active learning with noise is a much less studied topic: (Balcan et al., 2009) give an exponential improvement over passive learning of linear threshold functions, but under the condition that the noise is smaller than the desired accuracy. Real valued active learning with a soft-margin criteria was addressed in (Ganti & Gray, 2012). The reader is referred to (Dasgupta & Langford, 2009) for a more detailed survey of active learning literature.

2Our results though stated as a worst case error can be generalized to a high probability solution in the random design scenario

2. Preliminaries

Let Rd be the space of d dimensional vectors over the reals. Throughout the paper we denote the set of n data points as K = {x1, . . . , xn}. We denote by w Rd the hidden regressor. A query at point x, reveals a noisy output denoted by \ x, w consisting of the true inner product x, w plus a zero mean noise. We operate under the following assumptions: First, the different noise elements of different queries are independent. Second, the output of \ x, w is always bounded in [ 1, 1]; furthermore, all of the data points x K and the hidden vector w are in the Euclidean unit sphere. Relaxations of these assumptions are in many cases possible via standard techniques, and in this extended abstract we focus on the core problem. The objective in the Hard-Margin Active Linear Regression problem, or ALR in short, is to to learn a regressor w Rd where it holds with probability at least 1 δ that

x K, | x, w x, w | < ε (1)

An algorithm for this problem has two measurements. First and more important for our setting is the query complexity. That is, how many queries did the algorithm use? The second measure is the computational complexity which is the running time.

Before diving into our algorithmic results, we mention here a lower bound on the query complexity of ALR that can be derived by reduction to known lower bounds for the stochastic multi-armed bandit problem. We prove the following theorem in section 3: Theorem 1 (Query Complexity Lower Bound). Any algorithm for the ALR problem requires Ω(d log(1/δ)/ε2) measurements in order to compute a linear regressor with error of at most ε with probability larger than 1 δ.

2.1. Volumetric Spanners

A non-active solution to the ALR problem would simply experiment over all data points sufﬁciently many times, and apply any optimization method for ﬁnding the optimal regressor over the measurements. This is of course unsatisfactory - an active learner would want to exploit similarities in data and closeness in feature space to achieve faster learning.

An important technical tool for exploiting such structure is an informative exploration basis. That is, a subset of points from K such that querying the points of this set will provide the most information for all of the points in K. To this end we use a recently devised geometric object called a volumetric spanner (Hazan et al., 2013). A volumetric spanner is a ﬁnite subset of K such that any point in K is spanned by it with small coefﬁcients, or formally: Deﬁnition 1. Let K Rd and let S = {v1, . . . , v|S|} be

Hard-Margin Active Linear Regression

a (multi-)subset3 of K. Let V be the d |S| matrix whose columns are the vectors of S. For a vector x let

The set S is a volumetric spanner of K when for all x K it holds that x E(S) 1

We later show that the properties of the volumetric spanner are exactly those that are required of an exploration set. That is, when querying the points of a set S K, the quality of the corresponding regressor w.r.t a data point x is exactly proportional to x E(S). The query complexity of the algorithm will be determined by the size of the set S. In (Hazan et al., 2013) it is proven that such spanners of cardinality O(d) exist for any compact set K. Furthermore, for ﬁnite sets K Rd of cardinality n, such spanners can be found in polynomial time in n and d 4. Formally:

Theorem 2. [(Hazan et al., 2013)] Let K Rd be a ﬁnite set of size n. There exist an algorithm that constructs a volumetric spanner for K of cardinality at most 12d whose running time is O(n3.5 + n3d + d5). An alternative algorithm for the problem exits with running time of O(nd2) achieving a volumetric spanner of size O(d log(d) log(n)).

3. A lower bound for passive linear regression

In this section we provide an example for a set X where the passive learning algorithm must use Ω( n

ε2 ) observations to obtain a regressor w with additive error of at most ε on all of the data points. We start by better deﬁning the passive setup. Here, a query returns a random point x chosen uniformly from the set K and a noisy measurement of w , x . As before we assume that all points, including w are contained in the ℓ2 unit sphere.

Our set K is deﬁned in the following manner. Let Y Rd

be an arbitrary set of size n 1 such that for all x Y , x, e1 = 0. Let K = Y {e1}.

Theorem 3. Any algorithm in the passive setting achieving an additive error of at most ε in all of the data points of K whose success probability is 1 δ requires Ω(log(1/δ)n/ε2) queries.

The theorem is an immediate corollary of the following lemma.

Lemma 4. For K deﬁned above, any policy distinguishing between the case where w , e1 = ε and w , e1 = ε with probability larger than 1 δ must use Ω(n log(1/δ)/ε2) queries.

3S can be a multi-set, meaning it can contain several copies of the same element 4 Extensions to inﬁnite sets is also addressed in (Hazan et al., 2013), but outside the scope of this paper

Proof. We begin by mentioning that a query of a point x where x, e1 = 0 provides no information to the sign of w , e1 , hence does not help distinguish between the two hypotheses. The following lemma provides a lower bound to the number of queries at point e1 required to estimate the w , e1 up to a sufﬁciently small additive error and with sufﬁcient conﬁdence. It is a folklore lemma in statistics and appears e.g., in (Mannor & Tsitsiklis, 2004) in a much more general form.

Lemma 5 (Theorem 1 of (Mannor & Tsitsiklis, 2004)). Let D be a distribution over [ 1, 1]. Let ε > 0 be such that for X D, |E[X]| ε. Let T be the expected number of queries required by any algorithm that queries i.i.d copies of X D until being able to distinguish, with probability at least 1 δ between the cases E[X] ε and E[X] ε. Then for universal constants ε0 > 0, δ0 > 0, c1, c2 it holds that if ε < ε0 and δ < δ0 then T c1 log(c2/δ)

It follows that the expected number of queries needed in order to distinguish between the two hypotheses with probability 1 δ is at least c1n log(c2/δ)

ε2 , as the probability of observing a query to the inner product with e1 is 1/n.

4. Our ALR solution

In this section we provide a high level description of our solution. In the following sections we prove the following.

Theorem 6. There exists a solution to the ALR problem with success probability of at least 1 δ with the following properties: (1) The solution requires a preprocessing stage of O(n3.5 + dn3 + d5) (2) it s running time (after prepro-

cessing) is O nd log(1/δ)

ε2 and (3) it s query complexity is

at most O d log(n) log(1/δ)

ε2 . An alternative algorithm ex-

ists with a preprocessing time of O(nd2) requiring an additional factor of log(n) log(d) for the number of queries.

The intuition behind the algorithm is the following. We begin with a preprocessing stage of computing a volumetric spanner S for the set of points K. Given this spanner we can implement a procedure that outputs, for all of the points of K simultaneously, an unbiased estimator of w , x with variance of at most |S|. To demonstrate the usefulness of this estimator, consider averaging |S| log(n/δ)/ε2 i.i.d outputs of S. Standard concentration bound show that w.p at least 1 δ the estimates of all points in K are correct up to an additive error of ε. Rather than computing a noisy output for w on the points and recovering a hypothesis w from that we use a technique by (Clarkson et al., 2012) that given an oracle for a function over a set of data points constructs a hypothesis w using a small number of queries to the oracle.

Hard-Margin Active Linear Regression

4.1. Constructing a low variance estimator

In the following section we provide an algorithm that requires a black box providing a noisy estimate of x, w to all of the data point of K simultaneously. The intuition behind the algorithm is that given sufﬁciently many queries to the noisy estimator, a union bound argument can ensure an accurate estimate in all of the data points simultaneously. In this section we begin with the description of this black box providing the estimates.

The main tool used for this all-point-estimator is the volumetric spanner. Algorithm 1 provides the formal description of the method used to obtain the estimates.

Algorithm 1 Sample(K)

1: Input: set K = {x1, ..., xn}, Volumetric spanner for K denoted S, and measurement oracle that given x returns an unbiased estimator \ x, w with variance at most one for some ﬁxed w . 2: Choose a point vj S uniformly at random, query its inner product ˆℓ= \ vj, w 3: let V be the d |S| matrix whose columns are the elements of S, and let V R|S| d be its Moore-Penrose pseudo inverse. 4: For x K, let αx = V x and let ˆℓx (αx)j ˆℓ |S| 5: return estimates {ˆℓx}

The following lemma provides the analysis of Algorithm 1.

Lemma 7. Algorithm 1 queries a single point from K. Its estimates have the properties of (1) being unbiased and (2) have a variance of at most 12d. More formally, we have

x K . E[ˆℓx] = x, w , Var(ˆℓx) |S|

E h ˆℓx i = P

j S Pr[vj] (αx)j E \ vj, w |S|

j S(αx)j E \ vj, w

= (V x)T V T w = x, w

For the variance, recall that x K and S is a volumetric spanner of K indicates that αx 2 1:

E[ˆℓ2 x] = P

j S Pr[vj] (αx)2 j E \ vj, w 2 |S|2

j S(αx)2 j |S|

By Theorem 2 we can efﬁciently construct volumetric spanners of size |S| = 12d.

4.2. Algorithm and its analysis

In this section we present an algorithm for the ALR problem, following the primal-dual paradigm of (Clarkson et al., 2012). It assumes an oracle to a procedure Sample(K) that returns a vector of length |K| whose entries are unbiased estimators of x, w , for all x K whose variance is upper bounded by O(d). Recall that such a procedure was given in Section 4.1.

To avoid extraneous notions we will assume henceforth w.l.o.g that K is symmetric meaning that x K iff x K. This is without loss of generality since an unbiased estimator for x, w is obtained by negating the estimator for x, w .

Consider the following mathematical program:

min w 1 g(w) s.t. g(w) = maxx K cx(w)

cx(w) = x, w x, w

Note that by deﬁnition g(w ) = 0, which is the optimal solution to the problem. In addition, an ε approximate solution to the ALR instance, assuming w 1, corresponds to a vector ˆw with g( ˆw) ε. The following algorithm is an instantiation of Alg 3 from (Clarkson et al., 2012) applied to mathematical program 2 with the following arguments:

1. The primal decision set { w 1} and (linear) cost functions cx(w), admits an iterative low regret algorithm, namely online gradient descent, with expected regret E[R(T)] 2

T. This follows since the norms of x, w (for all x K) are bounded by one. See e.g. Theorem 1 in (Zinkevich, 2003).

2. We assume oracle access to a procedure Sample(K) that returns, for all x K, an unbiased estimate of cx(w ) with variance at most s.

The following theorem provides the analysis of Algorithm 2. Given the above it is immediately derived from Lemma 4.1 in (Clarkson et al., 2012).

Theorem 8. Algorithm 2 runs in time O( dn

ε2 ) and requires O( d log d log(n)

ε2 ) queries to the procedure Sample(K). It returns, with probability of at least 1

2, a vector w such that maxx K w w , x ε.

In the following section we describe a validation procedure that can verify, w.h.p, whether a proposed hypothesis is correct. The validation procedure will not increase the asymptotic behavior of the sample nor running time complexity of the algorithm. With it, we can repeat the procedure of Theorem 8 log 1

δ many times, and achieve a high probability result described below.

Hard-Margin Active Linear Regression

Corollary 9. There exists an algorithm that runs in time O( dn log 1

δ ε2 ) and returns, with probability of at least 1 δ, a vector w such that maxx K w w , x ε.

Algorithm 2 Primal-Dual Algorithm

1: Input: T

2: Let w1 0, q0 1n, η 1 100

T . 3: for t = 1 to T do 4: Query Sample(K) to obtains estimators at(i) for all ci s 5: for i [n] do 6: at(i) clip( at(i), 1/η) 7: qt(i) qt 1(i)(1 ηat(i) + η2at(i)2) 8: end for 9: Choose it [n] at random with Pr[it = i] qt(i) 10: wt wt 1 1

t wcit, where wcit = xit 11: end for 12: return w = 1

4.3. Validation

In this section we present Algorithm 3 that given a hypothesis w, veriﬁes, w.h.p., that w is sufﬁciently accurate.

Algorithm 3 Veriﬁcation

1: Input: Volumetric spanner S, parameters ε, δ > 0 2: run Sample(K) T = 2 log(2n/δ)|S|/ε2 times and obtain for each data point in K, T i.i.d samples of w , x 3: for each x K let f(x) be the average of the above T samples. 4: declare w as accurate iff for all x, | w, x f(x)| < 2ε

Lemma 10. Algorithm 3 has the following properties: (1) it requires O(log(n/δ)|S|/ε2) queries to the oracle (2) if the worst-case error of w is bounded by ε then w.p. at least 1 δ it will be declared as accurate (3) if the worst-case error of w is larger than 3ε then w.p. at least 1 δ it will be declared as inaccurate.

Proof. We recall that the process Sample(K) returns unbiased estimates of w , x for all of the data points where the estimates are bounded in absolute value by |S|. By the Chernoff bound it can be veriﬁed that for any given x K it holds with probability at least 1 δ/n that | w , x f(x)| < ε. A union bound shows that w.p. at least 1 δ the above holds for all data points simultaneously.

5. The Agnostic Case

In this section we discuss the agnostic case, where the learned function is not necessarily linear. The formal setting here is the following. There exist some function

f : K [ 1, 1] for which we have noisy access. That is, a query at point x K returns an answer d f(x) consisting of f(x) plus some zero mean noise. We denote by w Rd the optimal linear regressor, being the minimizer of the expression

Φ(w ) = max x K |f(x) w , x |

We denote the attained value by λ. Our objective is to ﬁnd a regressor w such that Φ(w) is as close to λ as possible. We analyze both an algorithmic result obtaining a multiplicative approximation to λ and proceed to show the tightness of this result in the sense that any algorithm achieving a better approximation must query all of the data points.

The following theorems describe both results correspondingly

Theorem 11. Let f : K [ 1, 1] and let

λ = min w Rd max x K |f(x) w, x | .

Given access to a noisy oracle of f, algorithm 2 equipped with the sample procedure described in algorithm 1 outputs a regressor w such that with probability at least 1 δ, for each point x K it holds that

|f(x) w, x | O p

d log(d) λ + ε

The query complexity of the process is O(d log(n) log(1/δ)/ε2).

Theorem 12. For any integer n, 0 < λ < ln(2n)/d and any policy requiring o n/(dλ2) queries when given a set of n points in Rd, there exist a set of points K of cardinality |K| = n and a function f : K [ 1, 1] with the following properties. First, there exists some w Rd such that Φ(w ) = λ. Second, the solution w obtained by the policy is such that

E[Φ(w)] λ c

where c > 0 is some universal constant. Here, the expectation is taken both over the (possible) internal randomness of the policy and the query noise.

5.1. An Agnostic Algorithm

proof of Theorem 11. For simplicity we describe an algorithm with a success probability of at least 1/2. The high probability result can be achieved analogously to the nonagnostic case (section 4.3). Recall from Algorithm 1 that S is the chosen volumetric spanner of K with |S| = O(d). Recall that V is the d |S| matrix whose columns are the elements of S, and V R|S| d is its Moore-Penrose pseudo

Hard-Margin Active Linear Regression

inverse. We also recall that that for all x K it holds for αx = V x that αx 1 and x = V αx.

We deﬁne for x K the function f(x) = f(V ) αx where f(V ) is the vector of length |S| whose values are the values of f on the elements of S.

Observation 13. f(x) is a linear function. Furthermore, the procedure in algorithm 1 returns unbiased estimates of variance at most |S| to f at all points x K simultaneously.

As a result of the above observation we get that by Theorem 8, when running Algorithm 2 with T = Θ(log(n)|S|/ε2) queries to f we obtain, with probability at least 1/2, a regressor w such that for all x K, w, x f(x) < ε (2)

This is since all of our queries are made by uniformly sampling an element of S, hence the oracle we are querying is in fact an unbiased estimator for f. It thus remains to prove an upper bound for the ℓ distance between f and f. Let w Rd be the optimal regressor and let b Rn be the vector of biases corresponding to w ; that is for x K, b(x) = x, w f(x). For x K we have that

f(x) = f(V ) αx = w (V ) αx + b(V ) αx =

w , x + b(V ) αx

It follows that

f(x) f(x) = f(x) w , x b(V ) αx (1)

λ + b(V ) αx (2) λ + b(V ) (3) λ 1 + p

Transitions (1) and (3) are due to the optimality of w . Inequality (2) is due to the Cauchy-Schwartz inequality and αx 1. The proof follows from combining inequalities (2) and (3), along with |S| = O(d).

5.2. Query Complexity Lower Bound

In order to prove the theorem we will present different scenarios that cannot be distinguished without using the required number of queries. To this end we will use a negative result for the PAC setting of the Multi-Armed Bandit (MAB) problem. We are interested in a special case of the problem; here, there exist some unknown vector p [ 1/2, 1/2]n and the player can query indices of p. Each query is independent of the past and is a sample of a

random variable in [ 1, 1] with expectation pj where j is the index being queried. The goal of the player is to ﬁnd an index j whose value pj is maximal, i.e, ﬁnd maxargjpj. The theorem below provides a lower bound to the number of queries required for the task. It is a special case of Theorem 5 in (Mannor & Tsitsiklis, 2004).

Lemma 14. There exists a global constant c such that: For any p [ 1/2, 1/2]n a policy that successfully identiﬁes maxargjpj with probability at least 2/3, requires at least

queries, where j = maxargjpj.

We begin with the construction of the set of data points. Let x1, . . . , xn be i.i.d random points chosen uniformly from the normalized hypercube { 1/

d}d. We require the points to be uncorrelated in a manner guaranteed by the following lemma

Lemma 15. With probability larger than 1/2, it holds that for every j = j [n]

| xj, xj | p

Proof. Notice that xj, xj = Pd i=1(xj)i (xj )i = 1 d Pd i=1 Yi where the Yi s are i.i.d r.v chosen uniformly from { 1, 1}. By the well known Chernoff inequality we get that for any t > 0

2 exp( t2/2)

Taking t = p

4 ln(2n) ensures the above probability is at most 1/n2 and a union bound argument proves the lemma.

We deﬁne the set of data points as follows: x1, . . . , xn are points in the normalized hypercube for which the statement of the above lemma applies. We deﬁne n scenarios, enumerated from 1 to n that will be indistinguishable by the policy. For j [n], scenario j is deﬁned as follows: f(xj ) = 0 for all j [n] where j = j and f(xj) = λ p

d/4 ln(2n).

We ﬁrst analyze the regressors of f in the different scenarios.

Lemma 16. In any of the above scenarios, there exists a linear regressor w such that |f(xj) w , xj | λ for all j. Furthermore, any regressor will achieve a worst case error of at least p

d/16 ln(2n)λ in n 1 of the n scenarios.

Proof. We prove the ﬁrst claim for scenario 1 and notice that the other scenarios have analogous proofs. Let w =

Hard-Margin Active Linear Regression

d/4 ln(2n). Notice that due to our assumption on λ, w is contained in the unit sphere. For x1, we have f(x1) = w, x1 . For j > 1 we have

|f(xj) w, xj | = | w, xj | =

| x1, xj | λ p

d/4 ln(2n) λ

This proves the ﬁrst claim in the lemma.

To prove the second claim, we deﬁne w as an arbitrary regressor and consider two cases. In one, w, xj < p

d/16 ln(2n)λ for all j. Clearly, the worst case error is p

d/16 ln(2n)λ in all scenarios. In the other case, for some point j, w, xj p

d/16 ln(2n)λ. In any scenario except scenario j, f(xj) = 0 hence the worst case error is at least p

d/16 ln(2n)λ, proving the claim.

We now conclude the proof of Theorem 12 with a lemma showing that without many queries, it is not possible to identify the correct scenario.

Lemma 17. Any policy that identiﬁes the correct scenario with probability at least 2/3 requires Ω(n/(dλ2)) queries.

Proof. Consider the vector p Rn where pj = f(xj). Assuming sufﬁciently small λ, p [ 1/2, 1/2]n. Also, for j = maxargjpj, it holds that

1 (pj pj)2 = Ω(n/(dλ2))

Finnaly, a policy that identiﬁes the scenario with probability at least 2/3 implicitly identiﬁes the index j with the same probability and by that solves the exploratory MAB problem associated with p. Hence, by Lemma 14, it must use

= Ω n/(dλ2)

proof of Theorem 12. Assume that the scenario deﬁning f is chosen uniformly at random among the n scenarios presented above. Consider a policy using o(n/(dλ2)) queries. According to Lemma 17, the correct scenario cannot be identiﬁed with probability larger than 2/3. This, and Lemma 16 indicate that with probability at least 1/3 the regressor chosen will have a worst-case additive error of at least p

d/16 ln(2n)λ. The claim of the theorem now follows.

6. Conclusions and Open Questions

We have studied active real-valued learning (i.e. regression) in the presence of noise, which has been studied before in the experiment-design literature. While exponential separation of active and passive learning is notoriously hard, in the fundamental hard-margin linear regression setting we show it to be attainable: while passive algorithms require O( n

ε2 ) queries in order to correctly deduce the optimal linear regressor up to precision ε, our active algorithm requires only O( d

ε2 ) queries.

The latter is attainable using recently-developed techniques in sublinear optimization and exploration using volumetric ellipsoids.

We continue to study active regression in the agnostic setting, where we show that the same algorithm gives a

dapproximation to the optimal linear regressor, and that no algorithm can do better in general without querying all data points.

Many intriguing questions remain open: what is the query complexity of active non-linear regression? Of particular interest is the characterization of the query complexity of the most popular regression models, i.e. polynomial and support-vector regression.

Acknowledgements

The ﬁrst author gratefully acknowledges support by the Microsoft-Technion EC center, ISF grant 810/11 and an ERC Starting Grant project SUBLRN.

Atkinson, A.A.C. and Donev, A.A.N. Optimum Experimental Designs. Oxford science publications. OXFORD University Press, 1992. ISBN 9780198522546. URL http://books.google. co.il/books?id=cmm OA_-M7S0C.

Audibert, Jean-Yves and Catoni, Olivier. Linear regression through PAC-Bayesian truncation. 2010. URL http://hal.archives-ouvertes.fr/ hal-00522536.

Audibert, Jean-Yves and Catoni, Olivier. Robust linear least squares regression. Annals of Statistics, 2011.

Balcan, Maria-Florina, Beygelzimer, Alina, and Langford, John. Agnostic active learning. J. Comput. Syst. Sci., 75(1):78 89, January 2009. ISSN 0022-0000. doi: 10.1016/j.jcss.2008.07.003. URL http://dx.doi. org/10.1016/j.jcss.2008.07.003.

Clarkson, Kenneth L., Hazan, Elad, and Woodruff, David P.

Hard-Margin Active Linear Regression

Sublinear optimization for machine learning. J. ACM, 59 (5):23:1 23:49, November 2012.

Cohn, David, Atlas, Les, and Ladner, Richard. Improving generalization with active learning. Mach. Learn., 15(2): 201 221, May 1994. ISSN 0885-6125. doi: 10.1023/A: 1022673506211. URL http://dx.doi.org/10. 1023/A:1022673506211.

Dasgupta, Sanjoy and Langford, John. Active learning tutorial. Proceedings of the 26th International Conference on Machine Learning (ICML), 2009. URL http: //hunch.net/ active_learning/.

Dasgupta, Sanjoy, Kalai, Adam Tauman, and Monteleoni, Claire. Analysis of perceptron-based active learning. Journal of Machine Learning Research, 10:281 299, 2009.

Ganti, Ravi and Gray, Alexander G. Upal: Unbiased pool based active learning. In AISTATS, volume 22 of JMLR Proceedings, pp. 422 431. JMLR.org, 2012.

Gy orﬁ, L. A Distribution-Free Theory of Nonparametric Regression. Springer Series in Statistics. Springer, 2002. ISBN 9780387954417. URL http://books. google.co.il/books?id=Q9NMAp-c Ky0C.

Hanneke, Steve. Teaching dimension and the complexity of active learning. In Proceedings of the 20th annual conference on Learning theory, COLT 07, pp. 66 81, Berlin, Heidelberg, 2007. Springer-Verlag. ISBN 978-3-540-72925-9. URL http://dl.acm.org/ citation.cfm?id=1768841.1768851.

Hazan, Elad, Karnin, Zohar, and Mehka, Raghu. Volumetric spanners and their applications to machine learning. Co RR, abs/1312.6214, 2013.

Hsu, Daniel, Kakade, Sham M., and Zhang, Tong. Random design analysis of ridge regression. Journal of Machine Learning Research - Proceedings Track, 23:9.1 9.24, 2012.

Mannor, Shie and Tsitsiklis, John N. The sample complexity of exploration in the multi-armed bandit problem. Journal of Machine Learning Research, 5:623 648, 2004.

Mc Callum, Andrew and Nigam, Kamal. Employing em and pool-based active learning for text classiﬁcation. In Proceedings of the Fifteenth International Conference on Machine Learning (ICML), pp. 350 358, 1998.

Spruill, MC and Studden, WJ. A kiefer-wolfowitz theorem in a stochastic process setting. The Annals of Statistics, 7(6):1329 1332, 1979.

Vapnik, Vladimir. The nature of statistical learning theory. springer, 2000.

Wu, Chien-Fu. Some algorithmic aspects of the theory of optimal designs. Annals of Statistics, 6(6):1286 1301, 1978.

Zinkevich, Martin. Online convex programming and generalized inﬁnitesimal gradient ascent. In Proceedings of the Twentieth International Conference on Machine Learning (ICML), pp. 928 936, 2003.