# teaching_a_blackbox_learner__3496ff0e.pdf Teaching a black-box learner Sanjoy Dasgupta 1 Daniel Hsu 2 Stefanos Poulis 1 3 Xiaojin Zhu 4 One widely-studied model of teaching (Goldman & Kearns, 1995; Shinohara & Miyano, 1991; Anthony et al., 1992) calls for a teacher to provide the minimal set of labeled examples that uniquely specifies a target concept. The assumption is that the teacher knows the learner s hypothesis class, which is often not true of real-life teaching scenarios. We consider the problem of teaching a learner whose representation and hypothesis class are unknown: that is, the learner is a black box. We find that a teacher who does not interact with the learner can do no better than providing random examples. However, by interacting with the black-box learner, a teacher can efficiently find a set of teaching examples that is a provably good approximation to the optimal set. As an illustration, we show how this scheme can be used to shrink training sets for any family of classifiers: that is, to find an approximatelyminimal subset of training instances that yields the same classifier as the entire set. 1. Introduction The theory of machine learning has focused primarily on situations where training data consists of random samples from an underlying distribution, as in the statistical learning framework (Valiant, 1984), or is chosen in an arbitrary and possibly adversarial manner, as in online learning (Littlestone, 1988). In real life, however, data is often chosen by a teacher who wishes to help the learner. This is likely to be the case, for instance, when a human is personalizing an electronic assistant; or when an intelligent tutoring system is explaining concepts to a human student; or when one machine is communicating a classifier to another machine with 1University of California, San Diego 2Columbia University 3NTENT 4University of Wisconsin Madison. Correspondence to: Sanjoy Dasgupta . Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). a different architecture or representation. To model such scenarios, several notions of teaching have been developed. One influential model, introduced independently by Goldman & Kearns (1995), Shinohara & Miyano (1991), and Anthony et al. (1992), is based on the notion of a teaching set. To define this, let X be any finite instance space and H any finite set of concepts on X, so that each h H is of the form h : X {0, 1}. Let h H denote a target concept. We say S X is a teaching set for (h , H) if h is the only concept in H that is consistent with the labeled examples {(x, h (x)) : x S}. An optimal teacher is then one who provides the learner with the smallest possible teaching set. This notion of teaching is related to compression. A simple learner human or machine might have difficulty remembering a large set of examples and finding a hypothesis consistent with them. The teacher helps by identifying a concise set of examples that conveys the desired concept. Consider, for instance, the case of thresholds on the line. Here data points are real numbers, so X R, and the hypothesis class is H = {hw : w R}, where hw(x) = 1 if x w 0 otherwise For finite X, we need only consider |X| + 1 distinct hypotheses. If the target concept is given by threshold w , the optimal teaching set consists of the two points in X nearest w , on either side of it: In this case, no matter how large X might be, two teaching examples suffice. Thus the teacher makes learning easier. This example also illustrates a significant issue with this notion of teaching: it requires the teacher to know H, the learner s hypothesis class. This can be unrealistic in many scenarios. When teaching a human, one generally has no idea what the underlying hypotheses might be. And when teaching a machine, the general type of concept might be known (a neural net, for instance), but the specifics (number of layers, number of nodes per layers, other parameter settings) may be opaque; and even if they were known, it is unclear how they would be used in choosing a teaching set. Teaching a black-box learner Under the strong assumption that H is known, teaching does not need to be adaptive: the teacher merely serves up the relevant examples beforehand, and does not need to see what the learner does with them. We will call this oblivious teaching. This paper studies two basic questions. How well can oblivious teachers perform in situations where they are not aware of the learner s concept class? Can interaction help in such cases? 1.1. Contributions We begin with a negative result for oblivious teaching. We show that an oblivious teacher who does not know the learner s concept class must, in general, provide labels on all of X. This holds even if the teacher knows that the unknown class H has low VC dimension and small teaching set size. We then provide two positive results for interactive teaching. We consider an interactive protocol in which the teacher provides one teaching example at a time, and in the interim is allowed to probe the predictions of the learner s current model, rather like giving the learner a quiz. We show that without knowing H, such a teacher can efficiently pick a teaching set of size at most O(t log |X| log |H|), where t is the optimal teaching set size for H. We also look at a setting in which the teacher not only probes the learner s prediction, but also the learner s uncertainty levels. We show bounds on the number of teaching examples needed that depend only on the VC dimension of H and its disagreement coefficient (Hanneke, 2011). One interesting use of our teaching algorithm is in shrinking a training set T: finding a subset S T that yields the same final classifier. This can be useful in situations where the computational complexity of training scales poorly (e.g. quadratically) with the number of training instances. Our method can be used with any black-box learner. It constructs S incrementally by adding a few examples at a time, assessing the resulting classifier, and then deciding whether more examples are needed. In this way, it never needs to train on more than |S| instances. We illustrate its behavior in experiments with kernel machines and neural nets. 1.2. From finite to infinite The notion of teaching set is defined for finite instance spaces and concept classes. To see what this means for more general spaces, suppose we have an arbitrary instance space X o and that there is some distribution P on this space. Suppose also that the possibly-infinite concept class Ho has VC dimension d. Then, we can draw m = O(d/ϵ) examples at random from P and treat these as a finite instance space X; by standard generalization bounds, with high probability over the choice of examples, any h H that agrees with the target concept h on all of X will have error ϵ on distribution P. Moreover, since |X| = m, we know by VC bounds (Sauer, 1972) that the effective size of the hypothesis class is at most N = O(md), and we can set H to this finite set of candidate concepts. These equivalences, |X| = O(d/ϵ) and H = O(|X|d), are useful in interpreting the bounds we obtain. Our lower bound then says that an oblivious teacher will need to use a teaching set of size Ω(d/ϵ), which could be very large for small ϵ. On the other hand, an interactive teacher can find a teaching set of size O(td log2(d/ϵ)), which has only a logarithmic dependence on 1/ϵ. 1.3. Related work: models of teaching The literature on teaching can be organized along two main dimensions: whether the learner is required to be consistent with all teaching examples and whether the teacher has full knowledge of the learner (Zhu et al., 2018). Earlier theoretical work on teaching assumes both, such as the classic teaching dimension (Goldman & Kearns, 1995; Shinohara & Miyano, 1991), the recursive teaching dimension (Zilles et al., 2011; Hu et al., 2017) and the preferencebased teaching dimension (Gao et al., 2017). Recently, there has been growing interest in settings where both dimensions are negative: for instance, the learner is a convex empirical risk minimizer (e.g., Liu & Zhu, 2016), or the teacher does not target a specific learner (Zhu et al., 2017), or the teacher does not know the learner s hyper-parameters or hypothesis space (Liu et al., 2017). Of particular relevance is recent work by Liu et al. (2018), which assumes the teacher and the learner use different linear feature spaces. The teacher cannot fully observe the learner s linear model but knows the learner s algorithm and can employ active querying to learn the mapping between feature spaces. In contrast, the present work assumes the learner is consistent with teaching examples but does not require knowledge of its concept class or learning algorithm. This setting is closer to that of classical learning theory and offers a crisp characterization of teaching black-box learners. 1.4. Related work: sample compression The notion of sample compression was introduced by Littlestone & Warmuth (1986) and has been the subject of much further work (e.g., Floyd & Warmuth, 1995; Moran & Yehudayoff, 2016). It is centered on an intriguing question: for a given concept class H, is it possible to design (1) a learning algorithm A that operates on labeled samples of some fixed size k, and (2) a procedure that, given any labeled data set, chooses a subset of size k such that when A is applied to this subset, it produces a classifier consistent with the full Teaching a black-box learner data set? A recent result of Moran & Yehudayoff (2016) showed that if H has VC dimension d, then k = d2d is always achievable. The results in our paper can be thought of as a form of adaptive sample compression, where the concept class H is unknown and the learning algorithm A is fixed in advance and also unknown. 2. Lower bounds for oblivious teaching Suppose the teacher knows only that the learner s concept class is one of H1, . . . , Hk, but not which one. They all contain the target h , and suppose they each have teaching sets of size t. What can an oblivious teacher do? One option is to provide a teaching set for the union of the concept classes, H1 Hk. This could have size up to tk. We ll see that, in general, an oblivious teacher can do no better than this. Thus, if k is large, the teacher simply has to provide labels for all of X. We will demonstrate this lower bound through two examples that illustrate different problems with oblivious teaching. 2.1. Example 1: Decision stumps It might seem that, even without knowing H exactly, the teacher would still be able to select examples near the true decision boundary. We ll see that this is not the case. The different hypothesis classes could effectively deal with very different representations of the data, so that the identities of the boundary examples change dramatically from one hypothesis class to the next. The instance space will be a finite set X Rk, to be specified shortly. The concept classes H1, . . . , Hk consist of thresholds along individual coordinates: Hi consists of all functions hi,w : X {0, 1} of the form hi,w(x) = 1 if xi > w; 0 otherwise. where w R. That is, the hypotheses in Hi only use the ith coordinate of the data. Each Hi has VC dimension 1 and has a teaching set of size 2. We select X so that every point in it has either all positive coordinates or all negative coordinates. The target concept h is 1 if the coordinates are all positive and 0 if all negative. Thus h lies in every Hi: in particular, h = hi,0 for all i. Set X = {x(1), x(2), . . . , x(k), x(1), . . . , x(k)}, where the x(i) Rk + are defined as follows: The values of the k features of x(i) are 2, 3, 4, . . . , k, in that order, with a 1 inserted in the ith position. Thus x(1) = (1, 2, 3, . . . , k), x(2) = (2, 1, 3, . . . , k), x(3) = (2, 3, 1, . . . , k), and x(k) = (2, 3, 4, . . . , k, 1). Along any coordinate i, the correct threshold is 0, and the minimal teaching set consists of the two examples closest to 0, on either side of it. These are x(i), x(i), whose ith coordinates have values 1, 1 respectively. In other words: for Hi, the optimal teaching set consists of x(i) and x(i). However, the only teaching set that works for every Hi simultaneously is all of X. Theorem 1 In the construction above, the concept classes H1, . . . , Hk each have VC dimension 1 and teaching set size 2. If an oblivious teacher does not know which of these concept classes is being used by the learner, the smallest possible teaching set it can provide is all of X, of size 2k. PROOF: Consider any teaching set that leaves out some point in X, say x(i). Then, if the learner happens to have concept class Hi, it can consistently set the threshold to be 1.5 along the ith coordinate, since the k 1 positive instances it has seen all have ith coordinate 2. Thus it will get x(i) wrong. Thus in this situation, the minimal teaching set has size tk, where t is the teaching set size of each individual Hi. 2.2. Example 2: All-but-one rules In the previous example, the different concept classes Hi dealt with different representations of the input, and the resulting differences in boundary regions created problems for the teacher. But even if the different hypothesis classes work with the same representation of the input, they might focus on different regions of the input space, in the sense of allowing a more detailed boundary in those regions. In such cases, a teaching set would need to provide the necessary level of detail in all of these regions. To see a situation of this type, let X be a finite instance space. Define the target hypothesis h to be identically zero. For any x X, define hypothesis hx : X {0, 1} to be zero everywhere except x , that is, hx (x) = 1 if x = x ; 0 otherwise. For any subset S X, define hypothesis class H(S) = {h } {hx : x S}. Now, partition X into k subsets S1, . . . , Sk of size |X|/k and define Hj = H(Sj). Theorem 2 In the construction above, each Hj has VC dimension 1 and teaching set size t = |X|/k. If an oblivious teacher does not know which of these concept classes is being used by the learner, the smallest teaching set it can provide is all of X, of size tk. PROOF: Each hypothesis class Hj = H(Sj) has a teaching set of size t = |Sj|, consisting of {(x, 0) : x Sj}. There Teaching a black-box learner is no smaller teaching set: if x Sj is left out of the set, then the concepts h and hx will both be consistent. Likewise, if the teacher does not know which of H1, . . . , Hk is being used by the learner, its minimal teaching set must consist of all points in X. Suppose, on the contrary, that it leaves out a particular x, and that x Sj. Then a learner using Hj could choose hx as a consistent hypothesis. 3. Teaching by probing the learner s predictions We now consider scenarios in which the teacher has no knowledge of the learner s concept class other than an upper bound on its size or VC dimension. It does not, for instance, have a shortlist of possibilities H1, . . . , Hk, as in the previous section. Nevertheless, if the teacher is allowed to interact with the learner, it can come up with a teaching set that is provably within a factor log |X| log |H| of optimal. Here is a model in which the teacher and learner communicate in rounds of interaction. On each round, The teacher supplies one or more teaching examples (x, y) X {0, 1} to the learner. The learner gives the teacher a black-box classifier h : X {0, 1} that is consistent with all the teaching examples it has seen so far. The idea here is that the teacher cannot look inside the black box classifier, but can test it on examples to get a sense of where its mistakes lie. If the learner is a machine, this is a natural setup. If the learner is a human, probing the black box corresponds to giving the learner a quiz. In either case, we distinguish teaching examples from probes to the black box. It is of primary interest to keep the former small: they constitute a summary of what is important, and the learner is required to be consistent with them. But we also want the number of probes to be polynomially bounded. We will say that an interactive teaching strategy of the type above is a (t, p)-protocol if it ultimately yields a teaching set of size t after making at most p probes. 3.1. A generic interactive teaching procedure Suppose the learner s hypothesis class H has optimal teaching set size t. The teacher has no information about H, except that it contains h . We will see that there is an efficient interactive protocol of the type above in which the teacher needs to provide at most O(t log |X| log |H|) teaching examples before the learner converges to h . The key idea is as follows. Teaching is essentially a set cover problem: each teaching example eliminates some suboptimal hypotheses in H, and a teaching set is a collection of examples that eliminate, or cover , all sub-optimal hypotheses. By this view, optimal teaching is equivalent to minimum set cover. However, in our setting, the set to be covered the set of sub-optimal hypotheses is unknown, since H is unknown, and this would seem to be a major problem. But there is an alternative online formulation of the set cover problem, in which the elements to be covered are not provided beforehand but appear one at a time, and must be covered immediately. An elegant algorithm is known for online set cover (Alon et al., 2009), and we will see that it can be simulated in our setting. The resulting learning algorithm is shown in Figure 1. It is a randomized procedure that begins by drawing values Tx, one for each x X, from a suitable exponential distribution. Then the interaction loop begins. A key quantity computed by the algorithm, for any learner-supplied black-box classifier h, is the set of misclassified points, (h) = {x X : h(x) = h (x)}. Roughly speaking, the points x that are most likely to be chosen as teaching examples are those that have been misclassified multiple times by the learner s models, and for which Tx happens to be small. Theorem 3 Let t be the size of an optimal teaching set for H. Pick any 0 < δ < 1. With probability at least 1 δ, the algorithm of Figure 1 halts after at most t log(2|X|) iterations. The number of teaching examples it provides is in expectation at most (1 + t lg(2|X|)) ln |H| + ln 1 The algorithm of Figure 1 is efficient and yields a teaching set of size O(t log |X| log |H|), despite having no knowledge of the concept class H. This can be significantly better than a teaching set of all |X| points, as we have seen would be needed by an oblivious teacher. Along the way, the teacher makes O(t |X| log |X|) probes to intermediate classifiers provided by the learner. This is polynomial, but is nonetheless significant, and consequently this algorithm is better for teaching machines than humans. In fact, we will later see (Section 5) that probing all of X is inevitable when the teacher does not know H. Teaching a black-box learner 1. Let S = (teaching set) 2. For each x X: Initialize weight w(x) = 1/m Choose threshold Tx from an exponential distribution with rate λ = ln(N/δ) 3. Repeat until done: Learner provides some h : X {0, 1} as a black box By probing the black box, determine (h) = {x X : h(x) = h (x)} If (h) = : halt and accept h While w( (h)) < 1: Double each w(x), for x (h) If this doubling causes w(x) to exceed Tx for the first time, add x to S and provide (x, h (x)) as a teaching example to the learner Figure 1. The teacher s algorithm. Here m = |X| and N = |H|. For S X, we define w(S) = P 3.2. Proof of Theorem 3 The proof of Theorem 3 follows that of the original online set cover algorithm (Alon et al., 2009). We provide it here for reference and because it differs on several small details. Lemma 4 Let t be the size of an optimal teaching set for H. Then the total number of doubling steps performed by the algorithm is at most t lg(2m), and at any point in time, X x X w(x) 1 + t lg(2m). PROOF: First, w(x) 2 for all x, always. This is because w(x) increases only during a doubling step, which happens only if x belongs to a subset of X of total weight < 1. Let T X denote an optimal teaching set, of size t. By definition, T must intersect (h) for all h = h . Now, a doubling step doubles the weight of each x (h), and thus some element of T . And since the weight of an individual point begins at 1/m and never exceeds 2, the total number of doubling steps cannot exceed t lg(2m). During each doubling step, w( (h)), and thus P x w(x), increases by at most 1. The lemma follows by noting that the initial value of this summation is 1, and there are at most t lg(2m) doubling steps. Lemma 5 With probability at least 1 δ, at the end of any iteration of the main loop, any hypothesis h = h with w( (h)) 1 is invalidated by the teaching examples. PROOF: Fix any h = h and consider the first point in time at which w( (h)) 1. Recall that the thresholds Tx are drawn from an exponential distribution with rate λ = ln(N/δ). Thus the probability, over the random choice of thresholds, that no point in (h) is chosen as a teaching example is Y x (h) Pr(w(x) Tx) = Y x (h) exp( λw(x)) = exp( λw( (h))) exp( λ) = δ Now take a union bound over all N hypotheses in H. Lemma 6 The expected total number of teaching examples provided is at most (1 + t lg(2m)) ln(N/δ). PROOF: The probability that any particular x X is eventually provided as a teaching example is Pr(final value of w(x) exceeds Tx) = 1 Pr(Tx > w(x)) = 1 exp( λw(x)) λw(x) where λ = ln(N/δ) is the rate parameter of the exponential distribution from which Tx is chosen. Thus x X λw(x) λ(1 + t lg(2m)), where the last inequality invokes Lemma 4. 4. Teaching by probing the learner s uncertainty Suppose that the learner communicates not just a classifier but also its uncertainty. We model this by allowing the Teaching a black-box learner teacher two forms of communication with the learner: Teaching example. The teacher provides a labeled example (x, h (x)). The learner must subsequently adopt this as a hard constraint. Uncertainty-rated probe. The teacher makes a query x, and the learner answers according to its current version space: 0 or 1 if everything in the version space agrees on this label ? if there is disagreement within the version space We will use the notation (t, p)-protocol if a teaching set of size t is produced after at most p uncertainty-rated probes. To make the notion of disagreement precise, let V denote the current version space of the learner: that is, the set of hypotheses in H that are consistent with all teaching examples seen so far. We define DIS(V ) = {x X : there exist h, h V for which h(x) = h (x)}. The uncertainty-rated probe then returns ? if x DIS(V ). If the learner is a machine, it can easily return a black box that computes uncertainty probes. This is no harder than learning: if S is the set of labeled teaching examples so far, an uncertainty probe on x X can be performed efficiently, as follows: Let S0 = S {(x, 0)}. Check if there is a hypothesis in H that perfectly fits S0. Let S1 = S {(x, 1)}. Check if there is a hypothesis in H that perfectly fits S1. If both succeed: return ?. If only S0 succeeds: return 0. If only S1 succeeds: return 1. If the learner is human, the uncertainty probe model is less reasonable, and it would make sense to consider a version with weaker requirements, for instance where the learner answers ? if there is a substantial amount of disagreement on the label, with respect to its (unknown) prior over concepts. 4.1. Teaching by simulating the Cohn-Atlas-Ladner active learner Uncertainty-rated probes permit a teaching strategy that simulates the active learning algorithm of Cohn et al. (1994): 1. Teacher randomly permutes instance space X 2. For each x X, in this order: Teacher probes the learner on x If learner returns ?: teacher provides (x, h (x)) as a teaching example. Using an analysis by Hanneke (2011), we show that this method constructs a teaching set of expected size θ (log2 |X| + log |X| log |H|), where θ = sup r>0 h H:| (h)| r (h) is the disagreement coefficient of h within H. (Recall (h) = {x X : h(x) = h (x)}.) Hanneke introduced the disagreement coefficient to bound the label complexity of active learning. To get some intuition for it in our context, consider k hypotheses h1, . . . , hk that all differ from h on the same number of points, and suppose the hypothesis class is just H = {h1, . . . , hk, h }. If (hi) (hj) = for all i = j (so θ = k), then a teaching set must contain at least one point from each (hi). If, on the other hand, the (hi) overlap substantially (so θ k), then just a few random points in (h1) (hk) are likely to constitute a teaching set. In many cases, θ has a bound that is independent of the cardinality of X; see Hanneke (2014) for several examples. For instance, for thresholds on the line, θ = 2. Theorem 7 The teaching strategy that simulates the Cohn Atlas-Ladner algorithm constructs a teaching set for (h , H) with expected cardinality O(θ (log2 |X|+log |X| log |H|)). Thus, a (θ (log2 |X| + log |X| log |H|), |X|)-protocol is always achievable. A more careful analysis of Hanneke (2011) replaces the log2 |X| with log |X| log log |X|. 4.2. Proof of Theorem 7 Let x1, x2, . . . , xm be the random ordering of X used by the teacher (with m = |X|), and for any 1 k m, let Hk = {h H : (h) {x1, . . . , xk} = } be the hypotheses in H that disagree with h on at least one of the first k points. Lemma 8 The set of teaching examples provided through time k is a teaching set for (h , Hk). PROOF: Take any h Hk, and consider the first xi (h) {x1, . . . , xk}. When the teacher probes the learner on xi, the learner must return ?, as h(xi) = h (xi). So (xi, h (xi)) is provided as a teaching example. Lemma 8 implies that the final set of teaching examples is indeed a teaching set for (h , H). It remains to bound the (expected) number of teaching examples. Define rk,δ = (m/k) ln(|H|/δ). Teaching a black-box learner Lemma 9 Pick any δ (0, 1) and any 1 k m. With probability at least 1 δ, every hypothesis h H that agrees with h on x1, . . . , xk has | (h)| rk,δ. PROOF: Pick a h H with | (h)| > rk,δ. The probability that it agrees with x1, . . . , xk is at most (1 | (h)|/m)k exp( | (h)|k/m) < δ/|H|. Now apply a union bound over all such h. Lemma 10 The expected number of teaching examples is at most 2 + θ ln(m) ln(|H|m). PROOF: Let Ek be the 1 δ probability event of Lemma 9, and let Qk be the event that the learner returns ? when probed on xk. We ll show that Pr(Qk) is at most the probability that: either Ek 1 does not happen, or xk Xk,δ = h H:| (h)| rk 1,δ (h). Indeed, if Qk happens, then by Lemma 8, there is some h H that agrees with h on x1, . . . , xk 1, but h(xk) = h (xk). If Ek 1 holds, then any such hypothesis h H must have | (h)| rk 1,δ. This line of reasoning gives the bound Pr(Qk) Pr( Ek 1 xk Xk,δ) Pr( Ek 1) + |Xk,δ| m δ + θ rk 1,δ Summing Pr(Qk) from k = 1, . . . , m and choosing δ = 1/m proves the claim. 5. A lower bound on the number of probes In both our teaching procedures, all of X is probed. To see why this is necessary, recall the hypothesis classes H(S) of Section 2.2 and consider |X| classes of the form H({x}) = {h , hx}, where x is a single element of X. Each such class has a teaching set of size 1, consisting of the labeled example (x, 0). If the teacher does not know which of these concept classes is used by the learner, then every x must either be probed or supplied as a teaching example; otherwise, the teacher cannot be sure that the learner is not using hx. 6. Experimental illustration In this section, we use Algorithm 1 to shrink several synthetic and real datasets, that is, to find subsets (teaching sets) of the data that yield the same final classifier. This can be useful for reducing storage/transmission costs of training data, or in situations where the computational complexity of training scales poorly with the number of samples. Suppose the learning algorithm has running time T(n), where n is the size of the training set. Algorithm 1 builds a teaching set incrementally, in iterations that involve adding a few points, invoking the learning algorithm, and evaluating the classifier that results. If the teaching set sizes along the way are t1 < t2 < < tk, the total training time is T(t1) + + T(tk), which can be much smaller than T(n). Synthetic data We looked at synthetic data in the form of moons, circles, and mixtures. For each, we generated twodimensional separable and non-separable datasets of 4000 points each, by varying the level of noise. We then tested Algorithm 1 using SVM learners with linear, quadratic, and RBF kernels. For each simulation we report: (1) the support vectors (SVs) of each learner; (2) the teaching points (TPs), as decided by the algorithm; (3) the points that are both support vectors and teaching points (TPs AND SVs); and (4) teaching curves. For a support vector machine, it is always possible to create a teaching set of size two by choosing the points so that their perpendicular bisector is the boundary; the maximummargin objective function will then yield exactly the target classifier. However, any given data set is unlikely to contain such a pair of points. Thus in our examples, the size of the optimal teaching set is not known, although it is certainly upper-bounded by the number of support vectors. Some of the results are shown in Figure 2. For instance, the top left-hand panel shows the result of the teaching algorithm on the moon-shaped data. There are 123 support vectors in the full data set, but a teaching set of just 19 points is found. As can be seen on the right, these points are picked in five batches: the first batch has two points and already brings the accuracy above 75%. Overall, the learning algorithm is called five times, on data sets of size 2, 10, 13, 17, 19; and we get the same effect as calling it on the entire set of 4000 points. The full range of experiments on synthetic data can be seen in Figures 3 to 20 in the appendix. Real datasets We also looked at the MNIST and fashion MNIST (Xiao et al., 2017) datasets, both with 60K points. 1. On MNIST, we used an SVM with a quadratic kernel. This data has 32,320 support vectors, and a teaching set of 4,445 points is found (almost all support vectors). 2. On fashion MNIST, we used a convolutional network with 4 different layers of 2d convolutions (32, 64, 128, 128) each followed by a Re LU and a max pooling layer. The bottom panel of Figure 2 shows the teaching curves for these two data sets. In either case, the accuracy achieved on the full training set is below 100%. For all experiments we used the same termination criterion: the algorithm terminated when it got within .01 of the accuracy of the learner that was trained using the full data. Also, to initialize the weight Tx of each data point we set the confidence parameter δ of Algorithm 1 to .1. Teaching a black-box learner Figure 2. Top: Moon data with RBF kernel SVM; Middle: Mixtures data with quadratic kernel; Bottom: MNIST (quadratic SVM) and Fashion MNIST (convolutional neural net). Shown: support vectors (SV), teaching points (TP), regular points (RP). Teaching a black-box learner Acknowledgements This collaboration began during the Foundations of Machine Learning program at the Simons Institute for the Theory of Computing, Berkeley. Dasgupta is grateful to the NSF for support under grant CCF-1813160. Hsu was supported in part by NSF grant CCF-1740833. Zhu was supported in part by NSF 1545481, 1704117, 1836978. Poulis is grateful for support from NTENT. The authors thank Daniel Kane and Phil Long for helpful conversations. Alon, N., Awerbuch, B., Azar, Y., Buchbinder, N., and Naor, S. The online set cover problem. SIAM Journal on Computing, 39(2):361 370, 2009. Anthony, M., Brightwell, G., Cohen, D., and Shawe-Taylor, J. On exact specification by examples. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, pp. 311 318, 1992. Cohn, D., Atlas, L., and Ladner, R. Improving generalization with active learning. Machine Learning, 15(2): 201 221, 1994. Floyd, S. and Warmuth, M. Sample compression, learnability, and the Vapnik-Chervonenkis dimension. Machine Learning, 21(3):269 304, 1995. Gao, Z., Ries, C., Simon, H. U., and Zilles, S. Preferencebased teaching. Journal of Machine Learning Research, 18(31):1 32, 2017. Goldman, S. and Kearns, M. On the complexity of teaching. Journal of Computer and System Sciences, 50(1):20 31, 1995. Hanneke, S. Rates of convergence in active learning. Annals of Statistics, 39(1):333 361, 2011. Hanneke, S. Theory of disagreement-based active learning. Foundations and Trends in Machine Learning, 7(2-3):131 309, 2014. ISSN 1935-8237. doi: 10. 1561/2200000037. URL http://dx.doi.org/10. 1561/2200000037. Hu, L., Wu, R., Li, T., and Wang, L. Quadratic upper bound for recursive teaching dimension of finite VC classes. In Proceedings of the 30th Conference on Learning Theory, COLT, pp. 1147 1156, 2017. Littlestone, N. Learning quickly when irrelevant attributes abound: a new linear-threshold algorithm. Machine Learning, 2:285 318, 1988. Littlestone, N. and Warmuth, M. Relating data compression and learnability. Unpublished, 1986. Liu, J. and Zhu, X. The teaching dimension of linear learners. Journal of Machine Learning Research, 17(162):1 25, 2016. URL http://jmlr.org/papers/v17/ 15-630.html. Liu, W., Dai, B., Humayun, A., Tay, C., Yu, C., Smith, L. B., Rehg, J. M., and Song, L. Iterative machine teaching. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, pp. 2149 2158, 2017. Liu, W., Dai, B., Li, X., Liu, Z., Rehg, J., and Song, L. Towards black-box iterative machine teaching. In International Conference on Machine Learning, pp. 3147 3155, 2018. Moran, S. and Yehudayoff, A. Sample compression schemes for VC classes. Journal of the ACM, 63(3), 2016. Sauer, N. On the density of families of sets. Journal of Combinatorial Theory, Series A, 13:145 147, 1972. Shinohara, A. and Miyano, S. Teachability in computational learning. New Generation Computing, 8(4):337 347, 1991. Valiant, L. A theory of the learnable. Communications of the ACM, 27(11):1134 1142, 1984. Xiao, H., Rasul, K., and Vollgraf, R. Fashion MNIST: a novel image dataset for benchmarking machine learning algorithms. ar Xiv e Prints, 2017. Zhu, X., Liu, J., and Lopes, M. No learner left behind: On the complexity of teaching multiple learners simultaneously. In The 26th International Joint Conference on Artificial Intelligence (IJCAI), 2017. Zhu, X., Singla, A., Zilles, S., and Rafferty, A. An overview of machine teaching. Ar Xiv e-prints, January 2018. https://arxiv.org/abs/1801.05927. Zilles, S., Lange, S., Holte, R., and Zinkevich, M. Models of cooperative teaching and learning. J. Mach. Learn. Res., 12:349 384, 2011.