# the_teaching_dimension_of_linear_learners__019b43d3.pdf

Journal of Machine Learning Research 17 (2016) 1-25 Submitted 12/15; Revised 6/16; Published 9/16

The Teaching Dimension of Linear Learners

Ji Liu ji.liu.uwisc@gmail.com Department of Computer Science University of Rochester Rochester, NY 14627, USA

Xiaojin Zhu jerryzhu@cs.wisc.edu Department of Computer Sciences University of Wisconsin-Madison Madison, WI 53706, USA

Editor: Sanjoy Dasgupta

Teaching dimension is a learning theoretic quantity that speciﬁes the minimum training set size to teach a target model to a learner. Previous studies on teaching dimension focused on version-space learners which maintain all hypotheses consistent with the training data, and cannot be applied to modern machine learners which select a speciﬁc hypothesis via optimization. This paper presents the ﬁrst known teaching dimension for ridge regression, support vector machines, and logistic regression. We also exhibit optimal training sets that match these teaching dimensions. Our approach generalizes to other linear learners.

Keywords: Optimization based learner, Karush-Kuhn-Tucker conditions, VC-dimension

1. Introduction

Consider a teacher who knows both a target model and the learning algorithm used by a machine learner. The teacher wants to teach the target model to the learner by constructing a training set. The training set does not need to contain independent and identically distributed items drawn from some distribution. Furthermore, the teacher can construct any item in the input space. How many training items are needed? This is the question addressed by the teaching dimension (Goldman and Kearns, 1995; Shinohara and Miyano, 1991). We give the precise deﬁnition in section 2, but ﬁrst illustrate the intuition with an example.

Consider integers x {1 . . . 10} and threshold classiﬁers hθ on them, so that hθ(x) returns -1 if x < θ and 1 if x θ. Now let the hypothesis space H consist of eleven classiﬁers H = {hθ | θ {1 . . . 11}}. Let the learner be a version-space learner, namely it maintains a version space {hθ H | hθ consistent with the training set}. Equivalently, the learner is a 0-1 loss empirical risk minimizer (ERM) which ﬁnds all models with zero training error. If we want to teach a target model (in this paper we use hypothesis and model exchangeably), say h9, to such a learner, we can construct a training set that results in a singleton version space {h9}. It is easy to see that the training set D = {(x1 = 8, y1 = 1), (x2 = 9, y2 = 1)} is the smallest set for this purpose. We say that the teaching dimension of h9 with respect

c 2016 Ji Liu and Xiaojin Zhu.

Liu and Zhu

to H is TD(h9) = |D| = 2. Similarly, TD(h11) = 1 because D = {(x1 = 10, y1 = 1)} suﬃces. In fact, TD(h θ) = 1 for target model θ = 1 or 11, and 2 for θ = 2, 3, . . . , 10. The astute reader may notice that this example does not apply to continuous spaces. To see this, let us extend x R and H = {hθ | θ R}. The learner s version space under any linearly separable training set would now be represented by the interval between the two closest oppositely labeled items. It is impossible for the version-space learner to pick out a unique target model hθ with a ﬁnite training set. In other words, TD(hθ ) = for all target models θ . This is counterintuitive because ostensibly we can teach any one of the modern machine learning algorithms such as a support vector machine (SVM) with only two training items: D = {(x1 = θ ϵ, y1 = 1), (x2 = θ + ϵ, y2 = 1)} with any ϵ > 0. The issue here is that a version-space learner is not equipped with the ability to pick the max-margin (or any other speciﬁc) hypothesis from the version space. In contrast, an SVM is not a version-space learner in our terminology; we have stronger knowledge from optimization on how it picks a speciﬁc hypothesis from the hypothesis space. This paper will utilize such knowledge to derive teaching dimensions that are distinct from classic teaching dimension analysis (e.g. Doliwa et al. (2014)). Speciﬁcally, we extend teaching dimension to linear learners that learn by regularized surrogate-loss empirical risk minimization:

Aopt(D) := Argminθ Rd

i=1 ℓ(x i θ, yi) + λ

2 θ 2 A | {z } =:f(θ)

Here, we identify H with Rd, h with θ, the surrogate loss function ℓis either smooth or convex in the ﬁrst argument, λ > 0 is the regularization coeﬃcient, and A is a positive semideﬁnite matrix. A is the Mahalanobis norm: θ A :=

θ Aθ. This covers both homogeneous (e.g. A = I) and inhomogeneous (e.g. A = [I, 0; 0, I]) learners. We follow the convention in optimization when we use the capitalized Argmin to emphasize that it returns a set that achieves the minimum. The teacher can construct a training set with any items in Rd. The alternative pool-based teaching setting, where the teacher is given a ﬁnite pool of candidate training items and must select items from that pool, is not studied in this paper. By linear learners we mean the input x and the parameter θ interact only via their inner product x θ. Linear learners include SVMs, logistic regression, and linear regression. Our analysis technique involves a novel application of the Karush-Kuhn-Tucker (KKT) conditions.

homogeneous inhomogeneous ridge SVM logistic ridge SVM logistic

exact parameter 1

λ θ 2 l λ θ 2

2 2 l λ w 2

decision boundary - 1 1 - 2 2

Table 1: The teaching dimension of ridge regression, SVM, and logistic regression. ( : up to rounding eﬀect, see section 3.3).

The Teaching Dimension of Linear Learners

To our knowledge, this paper gives the ﬁrst known values of teaching dimension for ridge regression, SVM, and logistic regression. We summarize our main results in Table 1. The table separately lists homogeneous (without a bias term) and inhomogeneous (with a bias term) versions of the linear learners. The teaching goal refers to the intention of the teacher: is teaching considered successful only when the learner learns the exact target parameter, or when the learner learns the correct decision boundary (which can be achieved by any positive scaling of the target parameter)? See section 3 for deﬁnition of the target parameters θ , w and the constant τmax. The target parameters are assumed to be nonzero. We will also present the corresponding minimum teaching set construction in section 3.

2. Classic Teaching Dimension and its Limitations

Let X denote the input space and Y R the output space. A hypothesis is a function h : X Y. In this section we identify a hypothesis hθ with its model parameter θ. The hypothesis space H is a set of hypotheses. By training item we mean a pair (x, y) X Y. A training set is a multiset D = {(x1, y1) . . . (xn, yn)} where repeated items are allowed. Importantly, for the purpose of teaching we do not assume that D be drawn i.i.d. from a distribution. Let D = n=1(X Y)n denote the set of all training sets of all sizes. A learning algorithm A : D 2H takes in a training set D D and outputs a subset of the hypothesis space H. That is, A does not necessarily return a unique hypothesis. Classic teaching dimension analysis is restricted to the version-space learner Avs:

Avs(D) = {h H | (x, y) D, h(x) = y}. (2)

That is, the learner Avs keeps track of the version space consisting of all hypotheses h that are consistent with D. Let the target model be hθ H. Teaching is successful if the teacher identiﬁes a training set D D such that Avs(D) = {hθ } the singleton set. Such a D is called a teaching set of hθ with respect to H. The teaching dimension of the hypothesis hθ is the minimum size of the teaching set:

TD(hθ ) = min D D |D|, for D a teaching set of hθ , if no teaching set exists

Furthermore, the teaching dimension of the whole hypothesis space H is deﬁned by the hardest hypothesis: TD(H) = maxh H TD(h). In this paper we will focus on the ﬁnegrained teaching dimension of individual hypothesis TD(h). Classic teaching dimension analysis has several limitations: the learner is assumed to be a version-space learner Avs, and the hypothesis space is typically ﬁnite or countably inﬁnite. As the example in section 1 showed, these fail to capture the teaching dimension of modern machine learners which has Rd as input space and picks a unique hypothesis via regularized empirical risk minimization (1). Furthermore, the target model can be ambiguous when the learner is a classiﬁer: should the learner learn the exact target parameter θ , or the target decision boundary? In linear models any scaled parameter cθ with c > 0 produces the same target decision boundary. These limitations motivate us to generalize the teaching dimension in the next section.

Liu and Zhu

3. Main Results

To make our teaching dimension s dependency on the learning algorithm explicit, henceforth we write teaching dimension with two arguments as

where h H is the target model, and A : D 2H is the learning algorithm which given a training set D D returns a set of hypotheses A(D). We deﬁne teaching dimension to be the size of the smallest training set D such that A(D) = {h }, the singleton set containing the target model. With this notation, the classic teaching dimension is TD(h , Avs) where Avs is the version space learning algorithm (2). In this paper we focus on Aopt in (1) instead, namely linear learners in Rd. Linear learners include many popular members such as both homogeneous and inhomogeneous versions of linear regression, SVM, and logistic regression. In addition, the linear interaction between x and θ makes the loss function subgradient easy to compute, though in principle our analysis technique is applicable to other optimizationbased learners, too. In this section our goal is to teach the exact parameter θ , consequently our teaching dimension of interest is

TD(θ , Aopt).

Later in section 4 for classiﬁcation we will teach the decision boundary instead. How to reason about our teaching dimension TD(θ , Aopt)? It is the size of the smallest training set D with which (1) has a unique solution θ . Our strategy is to ﬁrst establish a number of lower bounds LB TD(θ , Aopt) by showing that any training set with which (1) has a unique solution θ must have at least LB items. Section 3.1 is devoted to such lower bounds. The actual teaching dimension is learner dependent. In sections 3.2 and 3.3 we construct speciﬁc teaching sets for three popular learners: ridge regression, SVM, and logistic regression. These teaching sets uniquely returns θ via (1). By deﬁnition, the size of these teaching sets is an upper bound on TD(θ , Aopt), respectively. If the lower and upper bounds match, we would have identiﬁed the teaching dimension TD(θ , Aopt).

3.1 Lower Bounds on Teaching Dimension TD(θ , Aopt)

In this section we provide three general lower bounds on the teaching dimension. These lower bounds capture diﬀerent aspects of a teaching set, and should be used in conjunction (i.e. taking the maximum) when applicable. We will instantiate these lower bounds for speciﬁc learners in section 3.2. In the following let X and Y be the feasible region of all xi s and yi s respectively. We will use the notation 1ℓ( , ) in the following way: if ℓ( , ) is smooth, then it denotes a singleton set only containing the gradient w.r.t. the ﬁrst argument; if ℓ( , ) is convex, then it denotes the subdiﬀerential w.r.t the ﬁrst argument. LB1 comes from a degree-of-freedom perspective. It is necessary to have this amount of training items for a unique solution to exist in (1).

Theorem 1 Given any target model θ , there is a degree-of-freedom lower bound on the number of training items to obtain a unique solution θ from solving (1):

( d Rank(A) + 1, if Aθ = 0 d Rank(A), otherwise. (3)

The Teaching Dimension of Linear Learners

Proof Let n be the minimal number of training items to ensure a unique solution θ . First consider the case n = 0. It happens if and only if θ = 0 and Rank(A) = d, which is a special case of Aθ = 0. Clearly, this case is consistent with LB1. Next consider the case n 1. Since θ solves (1), the KKT condition holds:

i=1 1ℓ(x i θ , yi)xi. (4)

We seek all δ such that θ + δ satisﬁes

A(θ + δ) = Aθ and x i (θ + δ) = x i θ i = 1, , n , (5)

For any such δ, simple algebra veriﬁes that θ + tδ satisﬁes the KKT condition (4) for any t [0, 1]. Consequently, θ + δ also solves the problem in (1). To see this, we consider two situations:

If the loss function ℓ( , ) is convex in the ﬁrst argument, the KKT condition is a suﬃcient optimality condition, which means that θ + δ solves (1).

If the loss function ℓ( , ) is smooth (not necessary convex) in the ﬁrst argument, we have f(θ ) = f(θ +δ) by using the Taylor expansion (recall f is deﬁned in equation 1):

f(θ + δ) =f(θ ) + f(θ + tδ),δ (for some t [0, 1])

i=1 1ℓ(x i (θ + tδ), yi)xi + λA(θ + tδ)), δ

i=1 1ℓ(x i θ , yi)xi + λAθ

| {z } =0 due to the KKT condition (4)

Therefore, θ + δ also solves (1). However, the uniqueness of θ requires δ = 0 to be the only value satisfying (5). This is equivalent to say

Null(A) Null(Span{x1, , xn }) = {0}. (6)

It indicates that Rank(A) + Dim(Span{x1, , xn }) d.

From n Dim(span{x1, , xn }), we have n d Rank(A). We proved the general case for LB1. If we have Aθ = 0, we can further improve LB1. Let g = (g 1, . . . , g n ) be the vector satisfying

i=1 g i xi and g i 1ℓ(x i θ , yi) i = 1, 2, , n . (7)

Liu and Zhu

Since θ satisﬁes the KKT condition, such vector g must exist. Applying Aθ = 0 to (7), we have g = 0 and

Dim (Span{A.1, A.2, , A.d} Span{x1, , xn }) 1. (8)

To satisfy (6), we must have

d = Dim (Span{A.1, A.2, , A.d, x1, , xn }) .

Using the fact in linear algebra

Dim (Span{A.1, A.2, , A.d, x1, , xn })

= Dim (Span{A.1, A.2, , A.d}) | {z } =Rank(A)

Dim (Span{x1, , xn }) | {z } n

Dim (Span{A.1, A.2, , A.d} Span{x1, , xn }) | {z } 1 (from (8))

We conclude that n d Rank(A) + 1. We completed the proof for LB1.

LB2 observes that the regularizer acts as a prior. If λ is large, more items are needed to sway the prior toward the target θ .

Theorem 2 Given any target model θ , there is a strength-of-regularization lower bound on the required number of training items to obtain a unique solution θ from solving (1):

λ supα R,y Y,g 1ℓ(α θ 2 A,y) αg 1 , if A has full rank and θ = 0

0, otherwise. (9)

Proof When A has full rank we have an equivalent expression for the KKT condition (4):

2 xi 1ℓ(x i θ , yi) i = 1, , n . (10)

Let us decompose A 1

2 xi for all i = 1, , n into A 1

2 xi = αi A 1 2θ +ui, where ui is orthogonal to A 1 2θ : u i A 1 2θ = 0. Equivalently xi = αi Aθ +A 1 2 ui. Applying this decomposition, we have x i θ = αi θ 2 A + u i A 1 2θ = αi θ 2 A.

Putting it back in (10) we obtain

αi A 1 2θ + ui 1ℓ(αi θ 2 A, yi) i = 1, , n . (11)

The Teaching Dimension of Linear Learners

Since ui is orthogonal to A 1 2θ , (11) can be rewritten as

αi R, yi Y, gi 1ℓ(αi θ 2 A, yi) i = 1, , n

i=1 giui = 0

λA 1 2θ = A 1 2θ n X

i=1 αigi (12)

Since Aθ = 0, we have A 1 2θ = 0 and (12) is equivalent to λ = Pn i=1 αigi. It follows that

i=1 αigi n sup α R,y Y,g 1ℓ(α θ 2 A,y) αg = n sup α R,y Y,g 1ℓ(α θ 2 A,y) αg

It indicates the lower bound for n

& λ supα R,y Y,g 1ℓ(α θ 2 A,y) αg

LB1 and LB2 apply to all generalized linear learners. Due to the popularity of inhomogeneous margin-based linear learners (which include the standard form of SVM and logistic regression), we provide a tighter lower bound LB3 for such learners in Theorem 3. For inhomogeneous margin-based linear learners the learning algorithm Aopt solves a special form of (1):

Aopt(D) = Argminw,b

i=1 ℓ(yi(x i w + b)) + λ

2 w 2 A. (13)

LB3 will prove to be instrumental in computing the teaching dimension for those learners. Following standard notation, we deﬁne θ = [w; b] where w Rd is the weight vector and b R the bias (oﬀset) term. Note θ Rd+1 now. The d d regularization matrix A applies only to w while b is not regularized. Furthermore, margin-based linear learners have loss functions deﬁned on the margin y(x w + b). This loss function structure will play a key role in obtaining LB3.

Theorem 3 Assume matrix A in (13) has full rank and w = 0. Given any target model [w ; b ], there is an inhomogeneous-margin lower bound on the required number of training items to obtain a unique solution [w ; b ] from solving (13):

sup α R,g ℓ(α w 2 A) αg

Liu and Zhu

Proof Let D = {xi, yi}i=1, ,n be a teaching set for [w ; b ]. The following KKT condition needs to be satisﬁed:

i=1 ℓ(yi(x i w + b ))yi

If we construct a new training set

ˆD = ˆxi = xi + b

w 2 A Aw , ˆyi = yi

then [w ; 0] satisﬁes the KKT condition deﬁned on ˆD. This can be veriﬁed as follows:

i=1 ℓ(ˆyi(ˆx i w )) ˆyi

i=1 ℓ(yi(x i w + b ))yi

" xi + b w 2 A Aw

i=1 ℓ(yi(x i w + b ))yi

| {z } 0 from (15)

" b w 2 A Aw

i=1 ℓ(yi(x i w + b ))yi

| {z } 0 from (15)

where 0 Pn i=1 ℓ(yi(x i w + b ))yi is from the bias dimension in (15). It follows that

i=1 ℓ(ˆyiˆx i w )ˆyiˆxi + λAw

which is equivalent to

i=1 ℓ(ˆyiˆx i w )A 1

2 ˆyiˆxi |{z} =:zi

i=1 ℓ(z i w )A 1

2 zi + λA 1 2 w . (16)

We decompose A 1

2 zi = αi A 1 2 w + ui where ui satisﬁes u i A 1 2 w = 0. Applying this decomposition to (16), we have

i=1 ℓ(αi w 2 A)(αi A 1 2 w + ui). (17)

Since ui is orthogonal to A 1 2 w , (17) implies that

i=1 ℓ(αi w 2 A)αi A 1 2 w .

The Teaching Dimension of Linear Learners

Since w = 0 we have

i=1 ℓ(αi w 2 A)αi.

Together with n X

i=1 ℓ(αi w 2 A)αi n sup α R,g ℓ(α w 2 A) αg,

we obtain LB3.

3.2 The Teaching Dimension TD(θ , Aopt) of Three Homogeneous Learners

We now turn to upper bounding teaching dimension by constructing teaching sets. To prove that we indeed have a teaching set for a target θ , we need to show that θ is a solution of (1), and the solution is unique. The size of any such teaching set is an upper bound on the teaching dimension. The teaching dimension itself is determined if such an upper bound matches the corresponding lower bound. We show that this is indeed the case for our constructed teaching sets. For the sake of reference we preview in Table 2 the instantiated lower bounds that we will use in this section; their derivation will be shown below.

homogeneous inhomogeneous lower bound ridge SVM logistic ridge SVM logistic LB1 1 1 1 2 2 2

λ θ 2 l λ θ 2

LB3 - - - -

λ w 2 l λ w 2

Table 2: Lower bounds of teaching dimension TD(θ , Aopt) for homogeneous and inhomogeneous versions of ridge regression, SVM, and logistic regression.

Teaching dimension is learner-dependent. We choose three learners to study their teaching dimension due to these learners popularity in machine learning: ridge regression, SVM, and logistic regression. It turns out that homogeneous and inhomogeneous versions of these learners require diﬀerent analysis. We devote this section to the homogeneous version where the regularizer matrix A = I the identity matrix, and the next section to the inhomogeneous version. It is possible to extend our analysis to other linear learners of the form (1). It is easy to see that if the target model θ = 0, we do not need any training data to uniquely obtain the target model from (1). In the following, we only consider the nontrivial case θ = 0. Homogeneous ridge regression solves the following optimization problem:

1 2(x i θ yi)2 + λ

2 θ 2. (18)

We only need one training item to uniquely obtain any nonzero target model θ , as the following construction shows.

Liu and Zhu

Proposition 1 Given any target model θ = 0, the following is a teaching set for homogeneous ridge regression (18):

x1 = aθ , y1 = λ + x1 2

where a can be any nonzero real number.

Proof We simply verify the KKT condition to see that θ is a solution to (18) by applying the construction in (19). The uniqueness of θ is guaranteed by the strong convexity of (18).

It is worth to note that the teaching set is inconsistent with the target model, that is, x 1 θ = a θ 2 = y1 = λ

a + a θ 2, unless the regularization is absent λ = 0. The teacher intentionally overshoots the target in order to precisely counter the learner s regularizer. This has been observed before for Bayesian learners, too (Zhu, 2013).

We encourage the reader to distinguish two senses of uniqueness. The teaching set itself is not necessarily unique. In the construction (19), any a = 0 leads to a valid teaching set. Nonetheless, any one of the teaching sets will lead to the unique solution θ in (18).

Corollary 1 The teaching dimension TD(θ , Ahom ridge) = 1 for homogeneous ridge regression and target θ = 0.

Proof Substituting A by I in LB1 (3), we obtain the lower bound d Rank(I) + 1 = 1 which matches the teaching set size in (19).

Homogeneous SVM solves the problem:

i=1 max(1 yix i θ, 0) + λ

2 θ 2. (20)

To teach this learner one training item is in general not enough: we will show that we need λ θ 2 training items. In fact, we will construct such a teaching set consisting of identical training items. It is well-known in the teaching literature that a teaching set does not need to consist of i.i.d. samples from a distribution, and can look unusual. It is possible to incorporate additional constraints into a teaching problem if one wants the training items to be diverse, but we do not consider that in the present paper.

Proposition 2 Given any target model θ = 0, the following is a teaching set for homogeneous SVM (20). There are n = λ θ 2 identical training items, each taking the form

λ θ 2 , yi = 1. (21)

Proof We only need to verify that the KKT condition holds for θ . Due to the strong convexity of (20) uniqueness is guaranteed automatically. We denote the subgradient

The Teaching Dimension of Linear Learners

a max(1 a, 0) = 1 max(1 a, 0) = I(a), where

1, if a < 1 [0, 1], if a = 1 0, otherwise

The KKT condition is

i=1 yixi 1 max(1 yix i θ , 0) + λθ

i=1 yixi I(yix i θ ) + λθ

λ θ 2 I λ θ 2

= λθ I λ θ 2

where the last line is due to I λ θ 2

λ θ 2 giving either the set [0, 1] or the value 1.

Corollary 2 The teaching dimension TD(θ , Ahom svm) = λ θ 2 for homogeneous SVM and target θ = 0.

Proof We show this number matches LB2. Let A = I, ℓ(a, b) = max(1 ab, 0), and consider the denominator of (9):

sup α R,y Y,g 1ℓ(α θ 2,y) αg = sup α,y { 1,1},g y I(yα θ 2) αg

= sup α,g I(α θ 2) αg

where the ﬁrst equality is due to 1ℓ(a, b) = b I(ab). Therefore, LB2 = λ θ 2 which matches the construction in (21).

Homogeneous logistic regression solves the problem:

i=1 log(1 + exp{ yix i θ}) + λ

where log has base e. The situation is similar to homogeneous SVM. However, due to the negative log likelihood term we have a coeﬃcient deﬁned by the Lambert W function (Corless et al., 1996), which we denote by Wlam. Recall the deﬁning equation for Lambert W

Liu and Zhu

function is Wlam(x)e Wlam(x) = x. We further deﬁne

τmax := max t t 1 + et = Wlam(1/e) 0.2785,

where the equality can be derived in following: The optimal t satisﬁes

1 + et = t et (t 1)et 1 = 1/e

which suggests t = Wlam(1/e) + 1. We apply the optimality condition above and the optimal value of t to obtain

max t t 1 + et = t

et = 1 e e Wlam(1/e) = Wlam(1/e).

For any value a τmax, we deﬁne τ 1(a) as the solution to a = t 1+et . By using the Lambert W function τ 1(a) can be expressed as τ 1(a) a Wlam( aea), which can be derived from t 1 + et = a Wlam( aea)

1 + ea Wlam( aea) = a + aea/e Wlam( aea)

1 + ea Wlam( aea) = a.

Proposition 3 Given any target model θ = 0, the following is a teaching set for homogeneous logistic regression (23). There are n = l λ θ 2

m identical training items, each taking the form

λ θ 2 λ θ 2

θ 2 , yi = 1. (24)

Proof We ﬁrst verify that θ is a solution to (23) based on the teaching set construction in (24). We only need to verify the gradient of (23) is zero. Computing the gradient of (23), we have

yixi 1 + exp{yix i θ } + λθ

1 + exp τ 1 λ θ 2 l λ θ 2

= n τ 1 λ θ 2 l λ θ 2

1 + exp τ 1 λ θ 2 l λ θ 2

= nλ θ 2 λ θ 2

where the third equality uses the fact λ θ 2 l λ θ 2

m 1 τmax and the property a =

τ 1(a) 1+eτ 1(a) . The strong convexity of (23) automatically implies uniqueness.

The Teaching Dimension of Linear Learners

Corollary 3 The teaching dimension TD(θ , Ahom log ) = l λ θ 2

m for homogeneous logistic regression and target θ = 0.

Proof We show that the number matches LB2. In (9) let A = I and ℓ(a, b) = log(1 + exp{ ab}). The denominator of LB2 is:

sup α R,y Y,g 1ℓ(α θ 2,y) αg = sup α,y { 1,1},g=y(1+exp{yα θ 2}) 1 αg

= sup α,g=(1+exp{α θ 2}) 1 αg

= sup α α 1 + exp{α θ 2}

= θ 2 sup t t 1 + exp{t}

which implies LB2 = l λ θ 2

3.3 The Teaching Dimension TD(θ , Aopt) of Three Inhomogeneous Learners

Inhomogeneous learners are deﬁned by θ = [w; b] where the weight vector w Rd and the scalar oﬀset b R. The oﬀset b is not regularized. Similar to the previous section, we need to instantiate the teaching dimension lower bounds and design the teaching sets. We show that the size of our teaching set exactly matches the lower bound for inhomogeneous ridge regression, and diﬀers from the lower bound of inhomogeneous SVM and logistic regression by at most one due to rounding. Therefore, up to rounding we also establish the teaching dimension for these inhomogeneous learners. Inhomogeneous ridge regression solves the problem:

min w Rd,b R

1 2(x i w + b yi)2 + λ

Proposition 4 Given any target model [w ; b ], if w = 0 (b can be an arbitrary value), the following is a teaching set for inhomogeneous ridge regression (25) with n = 1:

x1 = 0, y1 = b . (26)

If w = 0, any n = 2 items satisfying the following are a teaching set for a = 0:

x1 x2 = aw , y1 = x 1 w + b + λ

a, y2 = y1 a w 2 2λ

Proof We ﬁrst prove the case for w = 0. We can verify that the KKT condition is satisﬁed by designing x1 and y1 as in (26):

(x 1 w + b y1)x1 + λw =0

x 1 w + b y1 =0.

Liu and Zhu

The uniqueness of [w ; b ] is indicated by the strong convexity of (25) when n = 1. We then prove the case for w = 0. With simple algebra, we can verify the KKT condition holds via the construction in (27):

(x 1 w + b y1)x1 + (x 2 w + b y2)x2 + λw =0

(x 1 w + b y1) + (x 2 w + b y2) =0.

Similarly, the uniqueness is implied by the strong convexity of (25) when n = 2.

Corollary 4 The teaching dimension for inhomogeneous ridge regression with target θ = [w ; b ] is TD(θ , Ainh ridge) = 1 if target w = 0, or TD(θ , Ainh ridge) = 2 if w = 0, regardless of the target oﬀset b .

Proof We match the lower bound LB1 in (3). Note θ = [w ; b ] Rd+1, and A in this case is a (d + 1) (d + 1) matrix with the d d identity matrix Id padded with one additional row and column of zeros for the oﬀset. Therefore Rank(A) = Rank(Id) = d. When w = 0, Aθ = 0 and LB1 = (d + 1) Rank(A) = 1. When w = 0, Aθ = 0 and LB1 = (d + 1) Rank(A) + 1 = 2. These lower bounds match the teaching set sizes in (26) and (27), respectively.

Inhomogeneous SVM solves the problem:

min w Rd,b R

i=1 max(1 yi(x i w + b), 0) + λ

2 w 2. (28)

Proposition 5 Given any target model [w ; b ] with w = 0, the following is a teaching set for inhomogeneous SVM (28). We need n = 2 l λ w 2

2 m training items, half of which

are identical positive items xi = x+, yi = 1, i 1, , n

2 and half identical negative items xi = x , yi = 1, i n

2 + 1, , n . x+ and x can be designed as any vectors satisfying

x +w = 1 b , x = x+ 2w

Proof Unlike in previous learners (including homogeneous SVM), we no longer have strong convexity w.r.t. b. In order to prove that (29) is a teaching set, we need to verify the KKT condition and verify solution uniqueness. We ﬁrst verify the KKT condition to show that the solution under (29) includes the target model [w ; b ]. From (29), we have

x +w + b = 1, x w + b = 1. (30)

The Teaching Dimension of Linear Learners

Applying them to the KKT condition and using the notation in (22) we obtain

2 I(x +w + b ) x+ 1

2 I( x w b ) x 1

2 I(1) x+ 1

2 I(1) x x+ 0

setting the last dimension to 0

applying (29)

observing n λ w 2

It proves that [w ; b ] solves (28) by our teaching set construction. Next we prove uniqueness by contradiction. We use f(w, b) to denote the objective function in (28) under the teaching set. It is easy to verify that f(w , b ) = λ 2 w 2. Assume that there exists another solution [ w; b] diﬀerent from [w ; b ]. We can obtain w 2 w 2 due to

λ 2 w 2 = f(w , b ) = f( w, b) λ

The second equality is due to [ w; b] being a solution; the inequality is due to whole-part relationship. Therefore, there are only two possibilities for the norm of w: w = w or w = t w for some 0 t < 1. Next we will show that both cases are impossible. (Case 1) For the case w = w , we have

f( w, b) =n

2 max 1 (x + w + b), 0 + n

2 max 1 + (x w + b), 0 + λ

x +(w w) + (b b) | {z } =: +

x (w w) (b b) | {z } =:

2 max ( +, 0) + n

2 max ( , 0) + f(w , b ).

From f( w, b) = f(w , b ), it follows + 0 and 0. Since

0 + + = (x+ x ) (w w) = 2(w ) (w w)

w 2 = 2 2 w w

we have w w w 2. But because w = w , we must have w = w . Applying this new observation to + 0 and 0, we obtain b = b. It means that [w ; b ] = [ w; b], contradicting our assumption [w ; b ] = [ w; b].

Liu and Zhu

(Case 2) Next we turn to the case w = t w for some t [0, 1). Recall our assumption that [ w; b] solves (28). Then it follows that the following speciﬁc construction [ ˆw,ˆb] solves (28) as well:

ˆw = tw , ˆb = tb . (31)

To see this, we consider the following optimization problem:

min w,b L(w, b) := n

2 max(1 (x +w + b), 0) + n

2 max(1 + (x w + b), 0)

s.t. w t w . (32)

Since [ w; b] solves (28), it is easy to see that [ w; b] solves (32) too, otherwise there exists a solution for (32) which gives a lower function value on (28). Then we can verify that [ ˆw;ˆb] solves (32) as well by showing the following geometric optimality condition holds:

# [ ˆw;ˆb] N w t w ( ˆw,ˆb) | {z }

Normal cone to the set {[w; b] : w t w } at [ ˆw;ˆb]

Given a convex closed set Ωand a point θ Ω, the normal cone at point θ is deﬁned to be a set

NΩ(θ) = {φ : φ,ψ θ 0 ψ Ω}.

The optimality condition basically suggests that at the optimal point, the negative (sub)gradient direction overlaps with the normal cone. In other words, there does not exist any valid direction to decrease the objective at the optimal point. Readers can refer to Nocedal and Wright (2006) or Bertsekas and Nedic (2003) for more explanations about the geometric optimality condition.

Because of (30) and (31), we have x + ˆw + ˆb = t < 1. Thus at [ ˆw;ˆb] the subgradient is

# [ ˆw;ˆb] = n

And the normal cone is

N w t w ( ˆw,ˆb) =

The intersection is non-empty by choosing s = n w 2 . Since both [ ˆw;ˆb] and [ w; b] solve

(32), we have L( ˆw,ˆb) = L( w, b). Together with ˆw = w , we have

f( ˆw,ˆb) = L( ˆw,ˆb) + λ

2 ˆw 2 = f( w, b) = f(w , b ).

The Teaching Dimension of Linear Learners

Therefore, we proved that [ ˆw;ˆb] solves (28) as well. To see the contradiction, let us check the function value of f( ˆw,ˆb) via a diﬀerent route:

f( ˆw,ˆb) =f(tw , tb )

i=1 max 1 t(x +w + b ), 0 +

i=1 max 1 + t(x w + b ), 0 + λ

i=1 max (1 t, 0) +

i=1 max (1 t, 0) + λ

2 w 2(1 t2) + λ

2 (1 t2) + λ

2 (1 t)2 + f(w , b )

>f(w , b ),

where the ﬁrst inequality uses the fact that n λ w 2. It contradicts our early assertion f( ˆw,ˆb) = f(w , b ). Putting cases 1 and 2 together we prove uniqueness.

Our construction of the teaching set in (29) requires n = 2 l λ w 2

2 m training items. This is an upper bound on the teaching dimension. Meanwhile, we show below that the inhomogeneous SVM lower bound is LB3 = λ w 2 . There can be a diﬀerence of at most one between the lower and upper bounds, which we call the rounding eﬀect. We suspect that this small gap is a technicality and not intrinsic. However, at present we do not have a teaching set construction that bridges this gap. Therefore, we state the teaching dimension as an interval in the following corollary and leave the precise value as an open question for future research.

Corollary 5 The teaching dimension for inhomogeneous SVM and target θ = [w ; b ] where w = 0 is in the interval λ w 2 TD(θ , Ainh svm) 2 l λ w 2

Proof The upper bound directly follows Proposition 5. We only need to show the lower bound LB3 = λ w 2 in Theorem 3. Let A = I, ℓ(a) = max(1 a, 0), and consider the denominator of (14):

sup α R,g ℓ(α w 2) αg = sup α,g I(α w 2) αg = 1 w 2

where the ﬁrst equality is due to ℓ(a) = I(a). Therefore, LB3 = λ w 2 which proves the lower bound.

Inhomogeneous logistic regression solves the problem

min w Rd,b R

i=1 log(1 + exp{ yi(x i w + b)}) + λ

2 w 2. (33)

Liu and Zhu

Proposition 6 To create a teaching set for target model [w ; b ] with nonzero w for inhomogeneous logistic regression (33), we can use n = 2 l λ w 2

m training items where

xi = x+, yi = 1, i 1, , n

2 and xi = x , yi = 1, i n

2 + 1, , n . x+ and x can be designed as any vectors satisfying

x +w = t b , x = x+ 2t w 2 w , (34)

where the constant t is deﬁned by t := τ 1 λ w 2

Proof We ﬁrst point out that for t to be well-deﬁned the argument to τ 1() has to be bounded λ w 2

n τmax. This implies n λ w 2

τmax . The size of our proposed teaching set is the smallest among all such symmetric construction that satisfy this constraint. We verify that the KKT condition to show the construction in (34) includes the solution [w ; b ]. From (34), we have

x +w + b = t x w + b = t.

We apply them and the teaching set construction to compute the gradient of (33):

2 1 1 + exp{x +w + b }

2 1 1 + exp{ x w b }

2 1 1 + exp{t}

2 1 1 + exp{t}

= n w 2 t 1 + exp{t}

= n w 2 λ w 2

This veriﬁes the KKT condition. Finally we show uniqueness. The Hessian matrix of the objective function (33) under our training set (34) is:

2 exp{t} (1 + exp{t})2 | {z } :=a

x+x + + x x x+ + x x + + x 2

Note a > 0 and A = x+ 1

x 1 is positive semi-deﬁnite. We show

that a A + λB is positive deﬁnite. Suppose not. Then there exists [u; v] = 0 such that [u; v] (a A + λB)[u; v] = 0. This implies [u; v] (a A)[u; v] + λu u = 0. Since the ﬁrst term is non-negative due to A being positive semi-deﬁnite, u = 0. But then we have 2av2 = 0 which implies [u; v] = 0, a contradiction. Therefore uniqueness is guaranteed.

The Teaching Dimension of Linear Learners

Corollary 6 The teaching dimension for inhomogeneous logistic regression and target θ = [w ; b ] where w = 0 is in the interval l λ w 2

m TD(θ , Ainh log ) 2 l λ w 2

Proof The upper bound directly follows Proposition 6. We only need to show the lower bound l λ w 2

m by applying LB3 in Theorem 3. Let A = I and ℓ(a) = log(1 + exp{ a}) and consider the denominator of (14):

sup α R,g ℓ( α w 2) αg = sup α,g=(1+exp{α w 2}) 1 αg

= sup α α 1 + exp{α w 2}

= w 2 sup t t 1 + exp{t}

which implies LB3 = l λ w 2

4. Teaching a Decision Boundary Instead of a Parameter

In section 3 we considered the teaching goal where the learner is required to learn the exact target parameter θ . But when the learner is a classiﬁer often a weaker teaching goal is suﬃcient, namely teaching the learner a target decision boundary. In this section we consider this teaching goal. Equivalently, such a goal is deﬁned by the set of parameters that produce the target decision boundary. Teaching is successful if the learner arrives at any one parameter within that set. In the case of inhomogeneous linear learners, the linear decision boundary {x | x w + b = 0} is identiﬁed with the parameter set {t[w ; b ] : t > 0}. Here we assume w is nonzero. The parameter θ = [w ; b ] is just a representative member of the set. Homogeneous linear learners are similar without b . We denote the corresponding decision boundary teaching dimension by TD({tθ }, Aopt). This notation extends our earlier definition of TD by allowing the ﬁrst argument to be a set, with the understanding that the teaching goal is for the learned model to be an element in the set. It immediately follows that TD({tθ }, Aopt) = min t>0 TD(tθ , Aopt).

Since it is suﬃcient to teach the parameter tθ for some t > 0 in order to teach the decision boundary, we can choose the best t that minimizes TD(tθ , Aopt). For SVM and logistic regression either homogeneous or inhomogeneous the teaching dimension TD(tθ , Aopt) depends on tθ (see Table 1). We can choose t suﬃciently small to drive down the teaching set size toward its possible minimum indicated by the LB1 value in Table 2 (which is nonzero because of the ceiling function). Speciﬁcally, for any ﬁxed parameter θ representing the target decision boundary:

(homogeneous SVM): we choose t 1

λ θ so that TD({tθ }, Ahom svm) = 1;

Liu and Zhu

(homogeneous logistic regression): we choose t τmax

λ θ so that TD({tθ }, Ahom log ) = 1;

(inhomogeneous SVM): we choose t

λ w so that TD({tθ }, Ainh svm) = 2 (note LB1=2

in Table 2);

(inhomogeneous logistic regression): we choose t 2τmax

λ w so that TD({tθ }, Ainh log ) = 2

(note LB1=2 in Table 2).

The resulting teaching dimension TD({tθ }, Aopt) is listed in Table 1 on the row marked by decision boundary. The teaching set construction is the same as in sections 3.2 and 3.3, respectively, but with tθ .

5. Related Work

Teaching dimension as a learning-theoretic quantity has attracted a long history of research. It was proposed independently in Goldman and Kearns (1995); Shinohara and Miyano (1991). Subsequent theoretical developments can be found in e.g. Zilles et al. (2011); Balbach and Zeugmann (2009); Angluin (2004); Angluin and Krikis (1997); Goldman and Mathias (1996); Mathias (1997); Balbach and Zeugmann (2006); Balbach (2008); Kobayashi and Shinohara (2009); Angluin and Krikis (2003); Rivest and Yin (1995); Ben David and Eiron (1998); Doliwa et al. (2014). Many of them assume little extra knowledge on the learner other than that it is consistent with the training data; though Zilles et al. (2011); Balbach (2008) allow the teacher and the learner to cooperate. These theoretically elegant teaching deﬁnitions diverge from the practice of modern machine learning where the learner solves an optimization problem to ﬁnd a single model that is not necessarily the 0-1 loss ERM. Teaching such modern learners is our goal. Section 6 discusses a new view to unify our work and some existing optimal teaching work.

Teaching dimension is distinct from VC dimension. For a ﬁnite hypothesis space H, Goldman and Kearns (1995) proved the relation

V C(H)/ log(|H|) TD(H) V C(H) + |H| 2V C(H).

These inequalities are somewhat weak, as Goldman and Kearns had shown both cases where one quantity is much larger than the other. The distinction between TD and VC dimension is also present in our setting. For example, by inspecting the inhomogeneous SVM column in Table 1 we note that TD does not depend on the dimensionality d of the feature space Rd. To see why this makes intuitive sense, note two d-dimensional points are suﬃcient to specify any bisecting hyperplane in Rd. On the other hand, recall that the VC dimension for inhomogeneous hyperplanes in Rd is d + 1. Furthermore, there is an interesting connection to sample compression in Floyd and Warmuth (1995). Our teaching set can be viewed as the compressed sample, but with two generalizations: (i) the original sample is the whole input space, and (ii) the labels is allowed to diverge from the target model. Further quantiﬁcation of these connections remains an open research question.

The teaching setting we considered is also distinct from active learning. In teaching the teacher knows the target model a priori and her goal is to encode the target model as a

The Teaching Dimension of Linear Learners

training set, knowing that the decoder is special (namely a speciﬁc machine learning algorithm). This communication perspective highlights the diﬀerence to active learning, which must explore the hypothesis space to ﬁnd the target model. Consequently, the teaching dimension can be dramatically smaller than the active learning query complexity for the same learner and hypothesis space. For example, Zhu (2013) demonstrated that to learn a 1D threshold classiﬁer within ϵ error, the teaching dimension is a constant TD=2 regardless of ϵ, while active learning would require O(log 1

ϵ) queries which can be arbitrarily larger than TD. While the present paper focused on the theory of optimal teaching, there are practical applications, too. One such application is computer-aided personalized education. The human student is modeled by a computational cognitive model, or equivalently the learning algorithm. The educational goal is speciﬁed by the target model. The optimal teaching set is then well-deﬁned, and represents the best personalized lesson for the student (Zhu, 2015, 2013; Khan et al., 2011). In one experiment, Patil et al. showed that real human students learn statistically signiﬁcantly better under such optimal teaching set compared to an i.i.d. training set (Patil et al., 2014). Because contemporary cognitive models often employ optimization-based machine learners, our teaching dimension study helps to characterize these optimal lessons. Another application of optimal teaching is in computer security. In particular, optimal teaching is the mathematical formalism to study the so-called data poisoning attacks (Barreno et al., 2010; Mei and Zhu, 2015a,b; Alfeld et al., 2016). Here the teacher is an attacker who has a nefarious target model in mind. The student is a learning agent (such as a spam ﬁlter) which accepts data and adapts itself. The attacker wants to minimally manipulate the input data in order to manipulate the learning agent toward the attacker s target model. Teaching dimension quantiﬁes the diﬃculty of data-poisoning attacks, and supports research on defenses.

Teaching dimension also has applications in interactive machine learning to quantify the minimum human interaction necessary (Suh et al., 2016; Cakmak and Thomaz, 2011), and in formal synthesis to generate computer programs satisfying a speciﬁcation (Jha and Seshia, 2015).

6. A New View on Teaching

The optimal teaching literature has been cautious about the so-called collusion or coding tricks between the teacher and the learner. Nonetheless, what constitutes collusion does not have a fully satisfactory deﬁnition. Goldman and Mathias (1996) deﬁned the teacher and the learner as collusion-free if (i) the teaching set is consistent with the target concept; (ii) any superset of the teaching set will make the learner learn the target concept, too. While this deﬁnition of collusion-free is useful, it does not capture all interesting learning behaviors. For example, Zilles et al. (2011, section 4) had to introduce a diﬀerent notion of collusion in order to allow benign cooperation between the teacher and the learner. As another example, standard machine learning algorithms such as ridge regression does not satisfy either of the two properties: the teaching set (19) is inconsistent in that y1 = x 1 θ , and adding more consistent training items will in general produce a diﬀerent model due to regularization.

Liu and Zhu

We advance an alternative view on the relation between the teacher and the learner. Under this view, the learner publishes his learning algorithm A : D 2H. Recall A takes in a training set D D and outputs a subset of the hypothesis space H. The teacher then uses a ﬁxed strategy: she simply solves the training set cardinality minimization problem under the constraint that A returns the target hypothesis set Θ . For example, to teach a speciﬁc parameter vector θ the target is the singleton set Θ = {θ }; to teach a decision boundary the target is the set Θ = {tθ | t > 0}. More precisely, the teacher s strategy is to solve the following optimization problem, whose objective value is the (learner-dependent) teaching dimension TD(Θ , A):

min D D |D| (35)

s.t. Θ = A(D).

Our teaching dimension for linear learners clearly ﬁts this view, with Aopt being a regularized empirical risk minimizer (1). Let us look at a few other interesting learners A under this view. We will use the following hypothesis space as it is historically used to contrast those learners (Goldman and Kearns, 1995; Zilles et al., 2011). Let X = {x1, . . . , xn}. Let hi(x) = 1 if x = xi and 0 otherwise, for i = 1 . . . n. In other words, hi is the indicator concept on xi. Let the all-negative concept be h0(x) = 0 for all x. Let H = {h0, h1, . . . , hn}.

The version-space learner Avs as deﬁned by (2). This is the learner behind the teaching dimension deﬁned by Goldman and Kearns (1995). We have Avs({(xi, 1)}) = {hi} for i = 1 . . . n, such that these target concepts have classic teaching dimension TD(hi, Avs) = 1. But note that Avs({(xi, 0)}) = {h0, . . . , hi 1, hi+1, . . .} which does not reduce the version space to a single element. To specify the all-negative concept we need Avs({(x1, 0), . . . , (xn, 0)}) = {h0}. That is, h0 s classic teaching dimension is TD(h0, Avs) = n. These teaching dimensions are the objective values in our view (35) when we plug in Avs.

The Balbach learner AB (Balbach, 2008). Balbach noticed that h1, . . . , hn can each be taught with one item. The reasoning goes that as soon as the teaching set contains more than one item, it must be a helpful teacher s hint that the target concept is h0. That is, the size of the training set carries useful information about the target concept. In the view of (35), we may deﬁne AB({(xi, 1)}) = {hi} for i = 1 . . . n, and AB({(xi, 0), (xj, 0)}) = {h0} for any i = j. For the sake of completeness, here and below for all other D D not explicitly mentioned we simply deﬁne A (D) = {h consistent with D}. When we plug AB into (35) we obtain Balbach s teaching dimension TD(hi, AB) of 1 for h1, . . . , hn, and 2 for h0.

The subset learner As (Zilles et al., 2011). Since the teaching sets for h1, . . . , hn each contain a positive item, it stands to reason that h0 is the target concept as soon as a single negative training item is observed. We can deﬁne As({(xi, 1)}) = {hi} and As({(xi, 0)}) = {h0} for i = 1 . . . n. When we plug As into (35) we obtain the subset teaching dimension of TD(h, As) = 1 for all h H, which is an improvement over the Balbach teaching dimension by a certain benign cooperation.

The Teaching Dimension of Linear Learners

A coding-trick learner Ac1. This learner uses x to encode hypothesis: Ac1({(xi, y)}) = {hi} for i = 1 . . . n regardless of y, and all non-singleton training set maps to h0: Ac1(D) = {h0} if |D| = 1. Ac1 is mathematically well-deﬁned for teaching in (35), but one can argue that it does not seem like a reasonable learner: it ignores y completely and thus is inconsistent (although recall modern regularized empirical risk minimizers (1) can be inconsistent, too).

Another coding-trick learner Ac2. This learner uses training set size to encode the hypothesis, while ignoring the content of the training set: Ac2(D) = {h|D|} if |D| n, and if |D| > n. Again, Ac2 is mathematically well-deﬁned but does not seem like a reasonable learner.

As the examples above show, our alternative view of teaching in (35) does not resolve the issue of what constitutes coding-tricks. All the learners A are well-deﬁned functions mapping a training set to a subset of hypotheses, so that the optimization problem (35) is also well-deﬁned even for unreasonable learners like Ac1 and Ac2. However, our alternative view does provide two beneﬁts:

Because the teacher employs a ﬁxed strategy (35), this view removes the notion of collusion altogether. Instead, the question becomes what learning algorithm A one would consider as admissible. This view point can be more natural when we extend teaching to richer, more complex learners.

There can be a misconception that the classic teaching dimension deﬁned by Goldman and Kearns (1995) is learner-independent and a property of H only, in part perhaps fueled by the original notation TD(H). Our view highlights classic teaching dimension s dependency on the version space learner Avs. It is true that Avs is a particularly simple and elegant learner with very nice properties. But, as others have observed (e.g. Balbach (2008); Zilles et al. (2011)), it does not capture all natural teaching and learning behaviors.

7. Conclusion

We have presented a generalization on teaching dimension to optimization-based learners. To the best of our knowledge, our teaching dimension for ridge regression, SVM, and logistic regression is new; so are the lower bounds and our analysis technique in general. There are many possible extensions to the present work. For example, one may extend our analysis to nonlinear learners. This can potentially be achieved by using the kernel trick on the linear learners. As another example, one may allow approximate teaching by relaxing the teaching goal, such that teaching is considered successful if the learner arrives at a model close enough to the target model. Taken together, the present paper and its extensions are expected to enrich our understanding of optimal teaching and enable novel applications.

Acknowledgments

Liu and Zhu

The authors thank the editor and referees for their valuable comments. Special thanks to the production editor Dr. Charles Sutton for his help to prepare the ﬁnal version of this paper. This work is supported in part by NSF grants CNS-1548078, IIS-0953219, DGE1545481, and by the University of Wisconsin-Madison Graduate School with funding from the Wisconsin Alumni Research Foundation.

S. Alfeld, X. Zhu, and P. Barford. Data poisoning attacks against autoregressive models. AAAI, 2016.

D. Angluin. Queries revisited. Theoretical Computer Science, 313(2):175 194, 2004.

D. Angluin and M. Krikis. Teachers, learners and black boxes. COLT, 1997.

D. Angluin and M. Krikis. Learning from diﬀerent teachers. Machine Learning, 51(2): 137 163, 2003.

F. J. Balbach. Measuring teachability using variants of the teaching dimension. Theor. Comput. Sci., 397(1-3):94 113, 2008.

F. J. Balbach and T. Zeugmann. Teaching randomized learners. COLT, pages 229 243, 2006.

F. J. Balbach and T. Zeugmann. Recent developments in algorithmic teaching. In Proceedings of the 3rd International Conference on Language and Automata Theory and Applications, pages 1 18, 2009.

M. Barreno, B. Nelson, A. D. Joseph, and J. D. Tygar. The security of machine learning. Machine Learning Journal, 81(2):121 148, 2010.

S. Ben-David and N. Eiron. Self-directed learning and its relation to the VC-dimension and to teacher-directed learning. Machine Learning, 33(1):87 104, 1998.

D. Bertsekas and A. Nedic. Convex analysis and optimization (conservative). Athena Scientiﬁc, 2003.

M. Cakmak and A. Thomaz. Mixed-initiative active learning. ICML Workshop on Combining Learning Strategies to Reduce Label Cost, 2011.

R. M. Corless, G. H. Gonnet, D. E. G. Hare, D. J. Jeﬀrey, and D. E. Knuth. On the Lambert W function. Advances in Computational Mathematics, 5(1):329 359, 1996.

T. Doliwa, G. Fan, H. U. Simon, and S. Zilles. Recursive teaching dimension, VC-dimension and sample compression. Journal of Machine Learning Research, 15:3107 3131, 2014.

S. Floyd and M. Warmuth. Sample compression, learnability, and the Vapnik-Chervonenkis dimension. Machine learning, 21(3):269 304, 1995.

S. Goldman and M. Kearns. On the complexity of teaching. Journal of Computer and Systems Sciences, 50(1):20 31, 1995.

The Teaching Dimension of Linear Learners

S. A. Goldman and H. D. Mathias. Teaching a smarter learner. Journal of Computer and Systems Sciences, 52(2):255 267, 1996.

S. Jha and S. A. Seshia. A theory of formal synthesis via inductive learning. Co RR, 2015.

F. Khan, X. Zhu, and B. Mutlu. How do humans teach: On curriculum learning and teaching dimension. NIPS, 2011.

H. Kobayashi and A. Shinohara. Complexity of teaching by a restricted number of examples. COLT, pages 293 302, 2009.

H. David Mathias. A model of interactive teaching. J. Comput. Syst. Sci., 54(3):487 501, 1997.

S. Mei and X. Zhu. Using machine teaching to identify optimal training-set attacks on machine learners. AAAI, 2015a.

S. Mei and X. Zhu. The security of latent Dirichlet allocation. AISTATS, 2015b.

J. Nocedal and S. J. Wright. Numerical Optimization (2nd edition). Springer, 2006.

K. Patil, X. Zhu, L. Kopec, and B. C. Love. Optimal teaching for limited-capacity human learners. NIPS, 2014.

R. L. Rivest and Y. L. Yin. Being taught can be faster than asking questions. COLT, 1995.

A. Shinohara and S. Miyano. Teachability in computational learning. New Generation Computing, 8(4):337 348, 1991.

J. Suh, X. Zhu, and S. Amershi. The label complexity of mixed-initiative classiﬁer training. ICML, 2016.

X. Zhu. Machine teaching for Bayesian learners in the exponential family. NIPS, 2013.

X. Zhu. Machine teaching: an inverse problem to machine learning and an approach toward optimal education. AAAI, 2015.

S. Zilles, S. Lange, R. Holte, and M. Zinkevich. Models of cooperative teaching and learning. Journal of Machine Learning Research, 12:349 384, 2011.