# iterative_machine_teaching__c6dea21f.pdf Iterative Machine Teaching Weiyang Liu 1 Bo Dai 1 Ahmad Humayun 1 Charlene Tay 2 Chen Yu 2 Linda B. Smith 2 James M. Rehg 1 Le Song 1 In this paper, we consider the problem of machine teaching, the inverse problem of machine learning. Different from traditional machine teaching which views the learners as batch algorithms, we study a new paradigm where the learner uses an iterative algorithm and a teacher can feed examples sequentially and intelligently based on the current performance of the learner. We show that the teaching complexity in the iterative case is very different from that in the batch case. Instead of constructing a minimal training set for learners, our iterative machine teaching focuses on achieving fast convergence in the learner model. Depending on the level of information the teacher has from the learner model, we design teaching algorithms which can provably reduce the number of teaching examples and achieve faster convergence than learning without teachers. We also validate our theoretical findings with extensive experiments on different data distribution and real image datasets. 1. Introduction Machine teaching is the problem of constructing an optimal (usually minimal) dataset according to a target concept such that a student model can learn the target concept based on this dataset. Recently, there is a surge of interests in machine teaching which has found diverse applications in model compression (Bucila et al., 2006; Han et al., 2015; Ba & Caruana, 2014; Romero et al., 2014), transfer learning (Pan & Yang, 2010) and cyber-security problems (Alfeld et al., 2016; 2017; Mei & Zhu, 2015). Furthermore, machine teaching is also closely related to other subjects of interests, such as curriculum learning (Bengio et al., 2009) and knowledge distilation (Hinton et al., 2015). 1Georgia Institute of Technology 2Indiana University. Correspondence to: Weiyang Liu , Le Song . Proceedings of the 34 th International Conference on Machine Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 by the author(s). Sample query Sample labeled by oracle Oracle Provide training set Provide information Construct minimal training set Interact only once Teacher Iterative Provide information Provide samples for this iteration Interact iteratively Iterative Machine Teaching Active Learning Passive Learning Machine Teaching Figure 1. Comparison between iterative machine teaching and the other learning paradigms. In the traditional machine learning paradigm, a teacher will typically construct a batch set of examples, and provide them to a learning algorithm in one shot; then the learning algorithm will work on this batch dataset trying to learn the target concept. Thus, many research work under this topic try to construct the smallest such dataset, or characterize the size of of such dataset, called the teaching dimension of the student model (Zhu, 2013; 2015). There are also many seminal theory work on analyzing the teaching dimension of different models (Shinohara & Miyano, 1991; Goldman & Kearns, 1995; Doliwa et al., 2014; Liu et al., 2016). However, in many real world applications, the student model is typically updated via an iterative algorithm, and we get the opportunity to observe the performance of the student model as we feed examples to it. For instance, In model compression where we want to transfer a target teacher model to a destination student model , we can constantly observe student model s prediction on current training points. Intuitively, such observations will allow us to get a better estimate where the student model is and pick examples more intelligently to better guide the student model to convergence. In cyber-security setting where an attack wants to mislead a recommendation system that learns online, the attacker can constantly generate fake clicks and observe the system s response. Intuitively, such feedback will allow the attacker to figure out the state of the learning system, and design better strategy to mislead the system. From the aspects of both faster model compression and bet- Iterative Machine Teaching ter avoiding hacker attack, we seek to understand some fundamental questions, such as, what is the sequence of examples that teacher should feed to the student in each iteration in order to achieve fast convergence? And how many such examples or such sequential steps are needed? In this paper, we will focus on this new paradigm, called iterative machine teaching, which extends traditional machine teaching from batch setting to iterative setting. In this new setting, the teacher model can communicate with and influence the student model in multiple rounds, but the student model remains passive. More specifically, in each round, the teacher model can observe (potentially different levels of) information about the students to intelligently choose one example, and the student model runs a fixed iterative algorithm using this chosen example. Furthermore, the smallest number of examples (or rounds) the teacher needs to construct in order for the student to efficiently learn a target model is called the iterative teaching dimension of the student algorithm. Notice that in this new paradigm, we shift from describing the complexity of a model to the complexity of an algorithm. Therefore, for the same student model, such as logistic regression, the iterative teaching dimension for a teacher model can be different depending on the student s learning algorithms, such as gradient descent versus conjugate gradient descent. In some sense, the teacher in this new setting is becoming active, but not the student. In Fig. 1, we summarize the differences of iterative machine teaching from traditional machine teaching, active learning and passive learning. Besides introducing the new paradigm, we also propose three iterative teaching algorithms, called omniscient teacher, surrogate teacher and imitation teacher, based on the level of information about the student that the teacher has access to. Furthermore we provide partial theoretical analysis for these algorithms under different example construction schemes. Our analysis shows that under suitable conditions, iterative teachers can always perform better than passive teacher, and achieve exponential improvements. Our analysis also identifies two crucial properties, namely teaching monotonicity and teacher capability, which play critical roles in achieving fast iterative teaching. To corroborate our theoretical findings, we also conduct extensive experiments on both synthetic data and real image data. In both cases, the experimental results verify our theoretical findings and the effectiveness of our proposed iterative teaching algorithms. 2. Related Work Machine teaching. Machine teaching problem is to find an optimal training set given a student model and a target. (Zhu, 2015) proposes a general teaching framework. (Zhu, 2013) considers Bayesian learner in exponential family and expresses the machine teaching as an optimization problem over teaching examples that balance the future loss of the learner and the effort of the teacher. (Liu et al., 2016) provides the teaching dimension of several linear learners. The framework has been applied to security (Mei & Zhu, 2015), human computer interaction (Meek et al., 2016) and education (Khan et al., 2011). (Johns et al., 2015) further extends machine teaching to interactive settings. However, these work ignores the fact that a student model is typically learned by an iterative algorithm, and we usually care more about how fast the student can learn from the teacher. Interactive Machine Learning. (Cakmak & Thomaz, 2014) consider the scenario of a human training an agent to perform a classification task by showing examples. They study how to improve human teacher by giving teaching guidance. (Singla et al., 2014) consider the crowdsourcing problem and propose a sequential teaching algorithm that can teach crowd worker to better classify the query. Both work consider a very different setting where the learner (i.e. human learner) is not iterative and does not have a particular optimization algorithm. Active learning. Active learning enables a learner to interactively query the oracle to obtain the desired outputs at new samples. Machine teaching is different from active learning in the sense that active learners explore the optimal parameters by itself rather than being guided by the teacher. Therefore they have different sample complexity (Balcan et al., 2010; Zhu, 2013). Curriculum learning. Curriculum learning (Bengio et al., 2009) is a general training strategy that encourages to input training examples from easy ones to difficult ones. Very interestingly, our iterative teacher model suggests similar training strategy in our experiments. 3. Iterative Machine Teaching The proposed iterative machine teaching is a general concept, and the paper considers the following settings: Student s Asset. In general, the asset of a student (learner) includes the initial parameter w0, loss function, optimization algorithm, representation (feature), model, learning rate ηt over time (and initial η0) and the trackability of the parameter wt. The ideal case is that a teacher has access to all of them and can track the parameters and learning rate, while the worst case is that a teacher knows nothing. How practical the teaching is depends on how much the prior knowledge and trackability that a teacher has. Representation. The teacher represents an example as (x, y) while the student represents the same example as (ex, ey) (typically y= ey). The representation x X and ex e X can be different but deterministically related. We assume there exists ex=G(x) for an unknown invertible mapping G. Model. The teacher uses a linear model y= v, x with pa- Iterative Machine Teaching rameter v (w for student s space) that is taught to the student. The student also uses a linear model ey= w, ex with parameter w, i.e., ey= w, G(x) =f(x) in general. w and v do not necessarily lie in the same space, but for omniscient teacher, they are equivalent and interchangeably used. Teaching protocol. In general, the teacher can only communicate with the student via examples. In this paper, the teacher provides one example xt in one iteration, where t denotes the t-th iteration. The goal of the teacher is to provide examples in each iteration such that the student parameter w converge to its optimum w as fast as possible. Loss function. The teacher and student share the same loss function. We assume this is a convex loss function ℓ(f(x), y), and the best model is usually found by minimizing the expected loss below: w = arg min w E(x,y) [ℓ( w, x , y)] . (1) where the sampling distribution (x, y) P(x, y). Without loss of generality, we only consider typical loss functions, such as square loss 1 2( w, x y)2, logistic loss log(1+ exp( y w, x )) and hinge loss max(1 y w, x , 0). Algorithm. The student uses the stochastic gradient descent to optimize the model. The iterative update is wt+1 = wt ηt ℓ( w, x , y) Without teacher s guiding, the student can be viewed as being guided by a random teacher who randomly feed an example to the student in each iteration. 4. Teaching by an Omniscient Teacher An omniscient teacher has access to the student s feature space, model, loss function and optimization algorithm. In specific, omniscient teacher s (x, y) and student s ( x, y) share the same representation space, and teacher s optimal model v is also the same as student s optimal model w . 4.1. Intuition and teaching algorithm In order to gain intuition on how to make the student model converge faster, we will start with looking into the difference between the current student parameter and the teacher parameter w during each iteration: 2 = wt ηt ℓ( w, x , y) ℓ( wt, x , y) wt T1(x,y|wt):Difficulty of an example (x, y) wt w , ℓ( wt, x , y) wt T2(x,y|wt):Usefulness of an example (x, y) Based on the decomposition of the parameter error, the teacher aims to choose a particular example (x, y) such that wt+1 w 2 2 is most reduced compared to wt w 2 2 from the last iteration. Thus the general strategy for the teacher is to choose an example (x, y), such that η2 t T1 2ηt T2 is minimized in the t-th iteration: argmin x X,y Y η2 t T1(x, y|wt) 2ηt T2(x, y|wt). (4) The teaching algorithm of omniscient teacher is summarized in Alg.1. The smallest value of η2 t T1 2ηt T2 is wt w 2 2. If the teacher achieves this, it means that we have reached the teaching goal after this iteration. However, it usually cannot be done in just one iteration, because of the limitation of teacher s capability to provide examples. T1 and T2 have some nice intuitive interpretations: Difficulty of an example. T1 quantifies the difficulty level of an example. This interpretation for different loss functions becomes especially clear when the data lives on the surface of a sphere, i.e., x =1. For instance, For linear regression, T1 =( w, x y)2. The larger the norm of gradient is, the more difficult the example is. For logistic regression, we have T1 = 1 1+exp(y w,x ) 2 2. We know that 1 1+exp(y w,x ) is the probability of predicting the wrong label. The larger the number is, the more difficult the example is. For support vector machines, we have T1 = 1 2(sign(1 y w, x )+1). Different from above losses, the hinge loss has a threshold to identify the difficulty of examples. While the example is difficult enough, it will produce 1. Otherwise it is 0. Interestingly, the difficulty level is not related to the teacher w , but is based on the current parameters of the learner wt. From another perspective, the difficulty level can also be interpreted as the information that an example carries. Essentially, a difficult example is usually more informative. In such sense, our difficulty level has similar interpretation to curriculum learning, but with different expression. Usefulness of an example. T2 quantifies the usefulness of an example. Concretely, T2 is the correlation between discrepancy wt w and the information (difficulty) of an example. If the information of the example has large correlation with the discrepancy, it means that this example is very useful in this teaching iteration. Trade-off. Eq.(4) aims to minimize the difficulty level T1 and maximize the usefulness T2. In other word, the teacher always prefers easy but useful examples. When the learning rate is large, T1 term plays a more important role. When learning rate is small, T2 term plays a more important role. This suggests that initially the teacher should choose easier examples to feed into the student model, and later on the teacher should choose examples to focus more on reducing the discrepancy between wt w . Such examples are very likely the difficult ones. Even if the learning rate is fixed, the gradient wℓis usually large for a convex loss Iterative Machine Teaching function at the beginning, so reducing the difficulty level (choosing easy examples) is more important. While near the optimum, the gradient wℓis usually small, so T2 becomes more important. It is also likely to choose difficult examples. It has nice connection with curriculum learning (easy example first and difficult later) and boosting (gradually focus on difficult examples). 4.2. Teaching monotonicity and universal speedup Can the omniscient teacher always do better than a teacher who feed random examples to the student (in terms of convergence)? In this section, we identify generic conditions under which we can guarantee that the iterative teaching algorithm always perform better than random teacher. Definition 1 (Teaching Volume) For a specific loss function ℓ, we first define a teaching volume function TV (w) with model parameter w as TV (w) = max x X,y Y{ η2 t T1(x, y|w) + 2ηt T2(x, y|w)} (5) Theorem 2 (Teaching Monotonicity) Given a training set X and a loss function ℓ, if the inequality w1 w 2 TV (w1) w2 w 2 TV (w2) (6) holds for any w1, w2 that satisfy w1 w 2 w2 w 2, then with the same parameter initialization and learning rate, the omniscient teacher can always converge not slower than random teacher. The teaching volume represents the teacher s teaching effort in this iteration, so wt w 2 TV (wt) characterizes the remaining teaching effort needed to achieve the teaching goal after iteration t. Theorem 2 says that for a loss function and a training set, if the remaining teaching effort is monotonically decreasing while the model parameter gets closer to the optimum, we can guarantee that the omniscient teacher can always converge not slower than random teacher. It is a sufficient condition for loss functions to achieve faster convergence than SGD. For example, the square loss satisfies the condition with certain training set: Proposition 3 The square loss satisfies the teaching monotonicity condition given the training set {x| x R}. 4.3. Teaching capability and exponential speedup The theorem in previous subsection insures that under certain conditions the omniscient teacher can always lead to faster convergence for the student model, but can there be exponential speedup? To this end, we introduce further assumptions of the richness of teaching examples, which we call teaching capability. We start from the ideal case, i.e., the synthesis-based omniscient teacher with hyperspherical feature space, and then, extend to real cases with the restrictions on teacher s knowledge domain, sampling scheme, and student information. We present specific teaching strategies in terms of teaching capability (strong to weak): synthesis, combination and (rescalable) pool. Synthesis-based teaching. In synthesis-based teaching, the teacher can provide any samples from X = {x Rd, x R} Y = R (Regression) or { 1, 1} (Classification). Theorem 4 (Exponential Synthesis-based Teaching) For a synthesis-based omniscient teacher and a student with fixed learning rate η =0, if the loss function ℓ( , ) satisfies that for any w Rd, there exists γ =0, |γ| R w w such that while ˆx=γ (w w ) and ˆy Y, we have 0 < γ w,ˆx ℓ( w, ˆx , ˆy) 1 then the student can learn an ϵ-approximation of w with O(Cγ,η 1 log 1 ϵ ) samples. We call such loss function ℓ( , ) exponentially teachable in synthesis-based teaching. The constant is Cγ,η 1 =(log 1 1 ην(γ)) 1 in which ν(γ):= minw,y γ w,ˆx ℓ( w, ˆx , y)>0. ν(γ) is related to the convergence speed. Note that the sample complexity serves as the iterative teaching dimension corresponding to this particular teacher, student, algorithm and training data. The sample complexity in iterative teaching is deterministic, different from the high probability bounds of traditional sample complexity with random i.i.d.samples or actively required samples. This is because the teacher provides the samples deterministically without noise in every iteration. The radius R for X, which can be interpreted as the knowledge domain of the teacher, will affect the sample complexity by constraining the valid values of γ, and thus Cγ,η 1 . For example, for absolute loss, if R is large, such that 1 η R w0 w , γ can be set to 1 η and the ν(γ) will be 1 η in this case. Therefore, we have Cγ,η 1 =0, which means the student can learn with only one example (one iteration). However, if 1 η > R w0 w , we have Cγ,η 1 >0, and the student can converge exponentially. The similar phenomenon appears in the square loss, hinge loss, and logistic loss. Refer to Appendix A for details. The exponential synthesis-based teaching is closely related to Lipschitz smoothness and strong convexity of loss functions in the sense that the two regularities provide positive lower and upper bound for γ w,x ℓ( w, x , y). Proposition 5 The Lipschitz smooth and strongly convex loss functions are exponentially teachable in synthesisbased teaching. The exponential synthesis-based teachability is a weaker condition compared to the strong convexity and Lipschitz smoothness. We can show that besides the Lipschitz smooth and strongly convex loss, there are some other loss functions, which are not strongly convex, but still are exponentially teachable in synthesis-based scenario, e.g., the hinge loss and logistic loss. Proofs are in Appendix A. Combination-based teaching. In this scenario, the teacher Iterative Machine Teaching Algorithm 1 The omniscient teacher 1: Randomly initialize the student and teacher parameter w0; 2: Set t = 1 and the maximal iteration number T; 3: while wt has not converged or t < T do 4: Solve the optimization (e.g., pool-based teaching): (xt, yt) = argmin x X,y Y η2 t ℓ D wt 1, x E , y wt 1 w , ℓ D wt 1, x E , y 5: Use the selected example (xt, yt) to perform the update: wt = wt 1 ηt ℓ D wt 1, xt E , yt 6: t t + 1 7: end while can provide examples from (αi R) X = x| x R, x = i=1 αixi, xi D , D = {x1, . . . , xm} Y = R (Regression) or { 1, 1} (Classification) Corollary 6 For a combination-based omniscient teacher and a student with fixed learning rate η = 0 and initialization w0, if the loss function is exponentially synthesis-based teachable and w0 w span (D), the student can learn an ϵ-approximation of w with O Cγ,η 1 log 1 Although the knowledge pool of teacher is more restricted compared to the synthesis-based scenario, with teacher s extra work to combine samples, the teacher can behave the same as the most knowledgable synthesis-based teacher. Rescalable pool-based teaching. This scenario is further restricted in both knowledge pool and the effort to prepare samples. The teacher can provide examples from X Y: X = {x| x R, x = γxi, xi D, γ R}, D = {x1, . . .} Y = R (Regression) or { 1, 1} (Classification) In such scenario, we cannot get arbitrary direction rather than the samples from the candidate pool. Therefore, to achieve the exponential improvement, the candidate pool should contain rich enough directions. To characterize the richness in finite case, we define the pool volume as Definition 7 (Pool Volume) Given the training example pool X Rd, the volume of X is defined as V(X) := min w span(D) max x X w, x Obviously, for the candidate pool of the synthesis-based teacher, we have V(X)=1. In general, for finite candidate pool, the pool volume is 0