# iterative_teaching_by_label_synthesis__f35777de.pdf

Iterative Teaching by Label Synthesis

Weiyang Liu1,2,* Zhen Liu3,* Hanchen Wang1,* Liam Paull3,4 Bernhard Schölkopf2 Adrian Weller1,5

1University of Cambridge 2MPI for Intelligent Systems, Tübingen 3Mila, Université de Montréal 4CIFAR AI Chair 5The Alan Turing Institute *Equal Contribution {wl396@cam.ac.uk, zhen.liu.2@umontreal.ca}

In this paper, we consider the problem of iterative machine teaching, where a teacher provides examples sequentially based on the current iterative learner. In contrast to previous methods that have to scan over the entire pool and select teaching examples from it in each iteration, we propose a label synthesis teaching framework where the teacher randomly selects input teaching examples (e.g., images) and then synthesizes suitable outputs (e.g., labels) for them. We show that this framework can avoid costly example selection while still provably achieving exponential teachability. We propose multiple novel teaching algorithms in this framework. Finally, we empirically demonstrate the value of our framework.

1 Introduction

Machine teaching [103, 106] studies the problem of constructing a minimal dataset for a target concept such that a learner can learn the concept based on this dataset. Machine teaching has diverse applications ranging from crowd sourcing [71, 72, 100, 101] to model robustness [2, 3, 54, 63]. Machine teaching also has nice connections with curriculum learning [7] and coresets [1, 28].

Teacher Learner

Select Teaching

Randomly Sample Examples

(a) Vanilla Iterative Teaching

(b) Label Synthesis Teaching

Figure 1: Comparison of vanilla iterative machine teaching and label synthesis teaching. The red dotted frames indicate the teacher s efforts.

Based on the learner, machine teaching can be performed in either batch or sequential fashion. The majority of prior work studies batch machine teaching [42, 56, 102, 103], where a teacher constructs a minimal batch set of training samples and provides it to a student in one shot without further interactions. Then the student keeps learning from this batch dataset for the target concept. The size of such a minimal batch set is called teaching dimension [22]. Differently, sequential machine teaching [38, 43, 44, 61] bridges the gap between machine teaching and practical learning algorithms by studying the sequential (i.e., iterative) learner such as neural networks. A typical example is iterative machine teaching (IMT) [43, 44] where the teacher guides a learner to a target concept by interacting with the learner (e.g., feeding training samples) in every iteration. The minimum number of such iterations is called the iterative teaching dimension. One of the largest challenges in sequential teaching is how to effectively and efﬁciently provide teaching examples to the iterative learner. Usually we are mostly interested in the pool-based teaching in IMT since it well matches the setting of modern machine learning. However, exploring all the possible teaching trajectories is computationally prohibitive. For example, there are m k n possible teaching trajectories (n is the number of iterations) if we select k samples per iteration from a pool of size m. Due to such a huge search space, the selection of teaching examples is a combinatorial problem that is inherently difﬁcult to solve. IMT [43] performs the teaching sample selection with a greedy policy, but it could be substantially sub-optimal in certain cases [38] and its computational complexity also scales linearly with the size of the dataset.

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

We propose a general teaching framework called LAbel Synthesis Teaching (LAST), which in its standard version avoids the problems of sample selection though we will later discuss how sample selection can be combined with our approach. In the standard version of LAST, teaching examples are randomly sampled (similar to SGD) in the pool and the teacher synthesizes their labels in order to quickly guide the learner to the desired model. A brief comparison between IMT and LAST is given in Fig. 1. LAST restricts the teaching to the label space and bypasses the selection of teaching samples. Therefore, LAST avoids the high complexity of selecting teaching samples in a large pool.

Intuition for why LAST is able to achieve promising teaching performance comes from the empirical success of knowledge distillation [26] and label smoothing [73]. Knowledge distillation shows that training neural networks with soft labels from a pretrained model can generally improve generalization. Label smoothing demonstrates that the ground truth labels are not necessarily the optimal supervision for training a model. Instead, smoothing the label with uniform distribution can calibrate the model and lead to better generalization. Both methods can be viewed as providing an alternative label (instead of the ground truth) to the learner in order to improve its generalizability. Moreover, [77, 78] show that there exists privileged information beyond the ground truth labels that can signiﬁcantly improve the convergence rate of learning algorithms. Therefore, now we can safely argue that the ground truth labels are not always optimal learning signals. Motivated by these work, we aim to construct a teacher model that can adaptively synthesize suitable labels for a learner (with the hope to implicitly encode priviledged information) in order to improve the learner s convergence. Speciﬁcally, we study LAST primarily under the omniscient scenario where the teacher knows everything about the learner (e.g., the optimal learner parameters). To perform omniscient teaching, we consider a greedy teacher and a parameterized teacher. We show that greedy teaching can achieve exponential teachability (ET) [43] without selecting teaching examples. Additionally, we touch upon the black-box teaching scenario where the teacher knows less about the learner (e.g., the optimal learner parameters are unavailable), and discuss how to perform LAST in this case.

LAST provides a uniﬁed view for understanding soft label methods, e.g., knowledge distillation [26, 53], label smoothing [11, 59, 73], and self-training [104]. All these methods can be interpreted as modifying the labels to achieve desirable learning behavior and outcome. With LAST, we can connect iterative machine teaching to many classic learning algorithms and shed novel light on them.

2 Related Work

Machine teaching. Batch machine teaching [42, 56, 102, 103] has drawn increasing more attention recently. Different learners typically exhibit distinct behaviors during teaching. [10, 74] focus on how to teach version space learners. The teaching dimension of linear learners [42], kernel learners [36] and reinforcement learner [95] is extensively studied. Teaching a forgetful learner is explored in [29, 44]. [94, 105] consider how to teach multiple learners simultaneously. Sequential machine teaching [38, 43, 44, 91] studies iterative learners by considering the speciﬁc optimization algorithm that the learner adopts. Unlike batch scenario, the teaching quality is evaluated by the learner s convergence. [38] connects sequential teaching to optimal control and gains interesting insights, but it can not produce a practical teaching policy. [91] uses locality-sensitive sampling to scale IMT to largescale problems. Machine teaching is shown useful in reinforcement learning [24, 32, 63, 76], humanin-the-loop learning [9, 31, 55], crowd sourcing [72, 100, 101] and cyber security [2, 57, 96, 97, 98]. [9, 13, 21, 35, 65, 108] study machine teaching from a more theoretical point of view.

Cooperative communication. Cooperative communication [69] is a mutual theory of mind reasoning between a teacher and a learner. The teacher selects data to convey a hypothesis and the learner infers a hypothesis given the selected data. Both agents have the shared goal of successfully transmitting beliefs. [84] formulates cooperative communication with the framework of optimal transport. In this framework, pedagogical reasoning [66, 67, 68], cooperative inference [82, 83, 93] and Bayesian teaching [15, 16, 92] can all be viewed as special cases. Among this line of research, cooperative inference [82] shares many similarities to our work and also adopts an iterative teaching paradigm. While [82] formulates the teaching problem with a probabilistic inference framework, LAST addresses the iterative teaching with a nested optimization framework.

Soft-supervised learning. It has been shown that supervising the model with soft labels can be beneﬁcial to generalization [73, 89] and robustness [75, 107]. Label smoothing [73] replaces one-hot label vector with a mixture of itself and the uniform distribution, achieving stronger generalization in training neural networks. Knowledge distillation [26] uses the soft predicted label from a pretrained model to supervise the training of a student model, which can make the student model generalize

better than a trained-from-scratch one. [89] perturbs the one-hot label vector with random noise and also observes better generalization. Large-margin softmax [45, 50, 51, 52] can be approximately viewed as training learners with dynamic soft labels (Appendix E). Self-training [8, 90, 104] also uses dynamic soft labels to iteratively train the learner. In constrast to the empirical nature of soft-supervised learning, LAST formulates it as a uniﬁed and principled iterative teaching framework.

3 Label Synthesis Teaching

Teaching protocol. We ﬁrst consider the simplest teaching protocol following [43]. Both the teacher and the learner observe the same sample A and share the same feature space by representing A as x with the label y. The teacher has access to all the information about the learner such as the model parameter wi at the i-th iteration, the learning rate ηi, the loss function ℓand the speciﬁc optimization algorithm (e.g., SGD). For interaction, the teacher can only communicate with the learner via examples {(xi j, yi j)}j=1, ,m in the i-th iteration, where m denotes the batch size.

Teaching objective. The goal of the teacher is to provide examples in each iteration such that the learner parameters w converge to desired parameters w as fast as possible. For example, w typically is arg minw E(x,y){ℓ(x, y|w)}. Therefore, a general objective for any teaching algorithm is min{(x1,y1), ,(x T ,y T )} d(w T , w ) where T denotes the termination iteration for the teaching algorithm. We use batch size 1 here, but it is straightforward to extend the batch size to arbitrary m. d( , ) denotes some discrepancy measure (e.g., Euclidean distance). This objective aims to ﬁnd a teaching trajectory that is terminated at the T-th step such that the discrepancy between the current learner parameters and the desired learner parameters is the smallest. This is very challenging given the enormous search space. To simplify the problem, [43] proposes a greedy policy to generate the teaching curriculum, but it is computationally expensive when dealing with a large pool.

3.1 The LAST Framework

Label Example

Groud Truth

Figure 2: Overview.

Unlike the general teaching objective that optimizes the sequential sample selection, the LAST framework instead optimizes the teaching examples in the label space, and the corresponding objective becomes min{y1, ,y T } d(w T , w ) where x1, , x T are sampled uniformly from the pool (same as SGD). Such a teaching objective is general and does not make any assumption on the optimal structure in the teaching trajectory, making the problem intractable. To simplify the problem, we can assume that the teaching signal (i.e., label y in LAST) only depends on the randomly sampled {x, y} and the learner s current and desired model (i.e., parameters). Therefore, we can formulate the corresponding label synthesis policy as y=πθ(x, y, wt, w ) where π is optionally parameterized by θ and y denotes the original ground truth label. Then the teaching objective becomes arg minπθ d(w T , w ). An overview of LAST is given in Fig. 2. Moreover, LAST provides a uniﬁed framework for knowledge distillation, label smoothing and self-training. Knowledge distillation [26] generates the label with the policy yi = exp(zi(x)/ψ) P

j exp(zj(x)/ψ) where ψ is the temperature

and zi(x) is the i-th logit output of x from a pretrained neural network. Label smoothing [73] uses y=µ y+ 1 µ

K 1 (K is the number of classes) as the label synthesis policy. Self-training [104] feeds the learner with the pseudo-label predicted by the learner from the last iteration. All these methods can be viewed as using some form of customized label synthesis policies.

3.2 Greedy Teaching Policy

For the t-th iteration, we can approximately solve the original teaching problem with the optimization min{yt+1, ,yt+v} d(wt+v, w ) where v T is the number of steps being considered. By using v=1 and the gradient descent update in LAST, we obtain a greedy policy arg minyt d(wt+1, w ) where wt+1 is one-step update from the current learner wt w.r.t. the learner loss ℓand example {xt, yt}. The greedy policy minimizes the Euclidean distance between the current learner parameters and the desired parameters in each iteration by learning a suitable label. We apply the greedy policy to each iteration for generating labels. The label for the example xt at the t-th iteration is produced by

wt 1 ηt ℓ(xt, yt|wt 1)

G(xt,yt|wt 1)

ℓ(xt, yt|wt 1)

T1(xt,yt|wt 1)

2ηt wt 1 w , ℓ(xt, yt|wt 1)

wt 1 | {z }

T2(xt,yt|wt 1)

where ℓ(xt, yt|wt) denotes the loss function of the learner and wt is the learner parameter in the t-th iteration. For simplicity, we consider the mini-batch size as 1. It is straightforward to write Eq. (1)

in a mini-batch form. According to the decomposition of the parameter discrepancy, the teacher optimizes the labels such that the parameter discrepancy can be minimized. In contrast to IMT, which aims to optimize the sample xt, the greedy teaching in LAST ﬁrst randomly samples the teaching example and then optimize the label based on arg minyt η2 t T1(xt, yt|wt) 2ηt T2(xt, yt|wt). The overall procedure is given in Algorithm 1. Depending on the teacher s capability, we can optionally consider a few constraints for the synthesized labels.

Algorithm 1 Omniscient Greedy LAST Initialize t=1, w0, ϵ and T; while wt w 2 ϵ or t T do

Randomly select a sample xt from the pool; Solve Eq. (1) to synthesize the label yt: Use the synthesized label yt for the update:

wt = wt 1 ηt ℓ(xt, yt|wt)

Set t t + 1; end

One-hot constraint. For classiﬁcation tasks, we can constrain the teacher to synthesize one-hot labels for x. With the one-hot constraint, label synthesis becomes a standard integer programming problem [87]. Soft constraint. A straightforward relaxation of the one-hot constraint is to constrain the labels on a probability simplex, i.e., P

i yi =1 and yi 0, i, and it becomes a constrained optimization problem. Magnitude constraint. For regression, we can constrain the p-norm of the label vector to be smaller than a positive constant r, i.e., y wt, x p r (for constraining the synthesized label to be around the current prediction) or y y p r (for constraining the synthesized label to be around the ground truth label), where we usually use p=2. A larger magnitude r indicates a more ﬂexible and powerful teacher. For classiﬁcation, we can use the same magnitude constraints (optionally) along with soft constraints P

i yi =1 and yi 0, i. No constraint. The most powerful teacher has no constraint on the synthesized labels and can synthesize any label vector, making it easy to solve the unconstrained optimization in Eq. (1).

3.2.1 Teaching Linear Learners We start by discussing how to use greedy LAST to teach representative linear learners such as linear least square regression (LSR) and logistic regression (LR), and then give theoretical analyses on their iterative teaching dimension (i.e., convergence properties). Speciﬁcally, we consider the following standard objective function: ℓLSR(x, y|w)= 1

2( w, x y)2 for the LSR learner and ℓLR(x, y|w)= log(1+exp{ y w, x }) for the LR learner. {x[i], y[i]} is the i-th sample in the pool. We gain intuitions by looking at the synthesized label. Taking LSR as an concrete example, Eq (1) becomes

min y η2 t x, x y2 + 2ηt w w , x 2η2 t x, x w, x y +η2 t x, x w, x 2ηt w, x w w , x (2)

from which we can obtain the closed-form solution for the synthesized label as

y = 2η2 t x, x w, x 2ηt w w , x

2η2 t x, x | {z } Synthesized label

= (1 λ) w, x | {z } Predicted label easy

+λ w , x | {z } Optimal label hard

where λ=(ηt x, x ) 1. λ is essentially a parameter that weights the importance between using the current prediction as the training label and using the (pseudo) ground truth label as the training label. Suppose that the input data lies on a hypersphere with radius γ, then we have λ= 1 ηtγ2 . Typically ηt will gradually decrease during training, so λ will gradually increase. Interestingly, this implies that the training label should be gradually shifted from the current predicted label to the ground truth label during training. This result validates our argument that it is not always optimal to feed the learner with ground truth labels. The predicted label is trivial to learn for the current learner (since the learner can output this same prediction without any update) and the optimal label is the most difﬁcult to learn, revealing a principle that a learner should be taught in an easy-to-hard fashion in order to learn a concept efﬁciently. This implication nicely matches the conclusions drawn in curriculum learning [7], self-paced learning [37], cognitive shaping [17, 34, 62], continuation methods [4] and IMT [43, 44].

loss contour

Figure 3: Example.

Moreover, greedy LAST has the property of teaching monotonicity [43] if the learner loss satisﬁes certain conditions (See Proposition 1 for a formal argument). This nice property implies that LAST can always converge to w faster than a random teacher (i.e., SGD) if both of them sample teaching examples uniformly from the pool. An illustration of why LAST yields better convergence to w is given in Fig. 3. We can observe that LAST is guaranteed to better minimize the discrepancy between w1 and w for the ﬁrst iteration if both LAST and SGD use the same initialization and training examples. Such a guarantee leads to the teaching monotonicity under a few conditions. If the learner loss is properly designed (e.g., least square loss [43]) such that teaching monotonicity can be satisﬁed, then LAST will always converge faster than SGD.

Proposition 1 (Teaching Monotonicity). Given a learner ℓand a randomly sampled teaching data point (x, y) P for both LAST and a random teacher (SGD), if E{G(x, y|w1)} E{G(x, y|w2)} holds for any w1, w2 that satisfy E{ w1 w } E{ w2 w }, then with the same initialization and learning rate, in expectation LAST can converge not slower than the random teacher.

3.2.2 Teaching Neural Learners We extend the greedy teaching to neural learners. As an illustrative example, we teach a two-layer multi-layer perceptron (MLP): ℓMLP =Cross Entropy(Soft Max(W σ(V x)), y) where V Rd1 d2 is the weight matrix of the ﬁrst layer, W Rd2 K is the weight matrix of the output layer, σ( ) denotes an element-wise nonlinear function (e.g., Re LU), Cross Entropy( , ) is the cross-entropy loss and Soft Max( ) is the softmax function. We denote the intermediate outputs as U =V x and P =σ(U). Because the weights for the MLP are hierarchically structured, we propose a simple yet effective heuristic to teach MLP learners. The training label is generated by

W t ηt ℓ(x, y|V t, W t)

F | {z } Discrepancy of the output layer: W t+1 W

+β V t ηt ℓ(x, y|V t, W t)

F | {z } Discrepancy of the ﬁrst layer: V t+1 V

where we are using the chain rule to compute ℓ V t . Both P t and U t are intermediate outputs at the t-th iteration. β is a hyperparameter that balances the importance of weights in different layers. When β =1, it becomes miny [W t+1; V t+1] [W ; V ] 2 F which is equivalent to minimizing the Euclidean distance between the concatenated weight matrices and the desired ones.

3.3 Learning a Parameterized Teaching Policy

The greedy policy only considers v=1 in min{yt+1, ,yt+v} d(wt+v, w ) and it is by no means optimal. We take a step further by considering v>1. To approximate minyt+1, ,yt+v wt+v w 2 2, we propose to learn a parameterized teaching policy πθ through minπ(θ) wt+v w 2 2. Unrolling. We can unroll the optimization algorithm of the learner into the teaching objective. In general, we aim to solve the following bi-level nested optimization objective for the teacher:

min θ wv(θ) w 2 2 s.t. wv(θ) = arg min w E{x, y} ℓ(x, πθ(x, y, wt, w )|w) (5)

where ℓis the learner loss and πθ is the parameterized teaching policy that generates the label. For gradient descent learners, wv(θ) is obtained by unrolling v steps of stochastic gradient descent w.r.t. ℓ. Unrolling is shown useful in [12, 41, 46, 47, 58] and shares similar spirits with back-propagation through time in recurrent networks [85] and meta-learning [20]. The greedy policy is the optimal solution to Eq. (5) for v=1, and the parameterized policy aims to approximate the solution to v>1. For large v, unrolling builds a large computational graph for back-propagation and is very costly.

wt-1 wt wt+1

yt-1 yt yt+1

Learner Teacher

Figure 4: Unrolling the teacher.

Optimizing the unrolled teaching policy is conceptually similar to optimizing recurrent neural networks with back-propagation through time, as illustrated in Fig. 4. Unrolling v steps is equivalent to seeking a teaching policy that best minimizes the weight discrepancy after v-step gradient descent. Broadly speaking, our method also has intrinsic connections to learning an optimizer [5, 39] and teaching a loss function [88] in the sense that both aim to learn better gradients.

Policy gradient. We can deﬁne rt = wt w as the reward signal and directly apply policy gradient [86] to maximize it. The ﬁnal teaching objective is to maximize the expected reward: maxθ J(θ):=E(st,at) πθ(a|s){PT t=1 γtrt(st, at)} where γ (0, 1] is the discounting factor, T is the termination iteration and (st, at) is the state-action pair at the t-th iteration. The state is characterized by the sample {x, y}, current learner w and optimal learner w . The action space is a discretized label space. The policy gradient for updating θ is given by θJ(θ)=Eπθ(at|st){R θ PT t=1 log πθ(at|st)} where R=PT t=1 γtrt(st, at). Appendix G gives the details for the state and action. Note that we stick to the simplest policy gradient for demonstration.

Parameterization of the teacher. The teaching policy can be parameterized as any neural network, such as MLP, CNN [25, 27, 33, 49, 52, 70] and transformer [14, 79]. It may also be beneﬁcial to use dynamic networks [23, 30, 48], since the optimal teacher parameters for different input may vary.

3.4 Theoretical Guarantees and Justiﬁcations

We ﬁrst study how label synthesis affects the gradients when the label is a scalar. For LSR learners, the gradient w.r.t. w of a single sample (x, y) is wℓ=( w, x y)x. For LR learners, the gradient is

wℓ= yx 1+exp( y w,x ). Modifying the label is essentially to re-scale the gradient without changing the direction. We denote the new gradient after modifying the label to y as wℓ=g(y) wℓwhere g(y) is a scalar function of y. For LSR and LR learners, we have g(y)= w,x y

w,x y and g(y)= y(1+exp( y w,x ))

y(1+exp(y w,x )), respectively ( y is the original label for x). In general, the new gradient of linear learners (e.g., LSR, LR and SVM) in greedy LAST can be re-written as g(y) wℓwhere g(y) varies for different cases. However, in the mini-batch case, modifying labels will become much more ﬂexible than adjusting the learning rate, since g(y) is generally sample-speciﬁc and learning rate is not.

We discuss how g(y) determines the exponential teachability, since LAST can be viewed as modifying g(y) to minimize the model parameter discrepancy. Here we consider the learner to have a loss function f(w)=Pn i=1 ℓi(w) where ℓi(w)=ℓ(x[i], y[i]|w) and n is the number of samples. We show in Theorem 1 that exponential teachability [43, 44] (i.e., the teacher guides the learner to converge to w at an exponential speed) can be achieved by intelligently tuning g(y) in LAST:

Theorem 1 (Exponential teachability). Assume the learner loss ℓi has the property of interpolation, Li-Lipschitz, and convexity. f is order-1 µ strongly convex. Then LAST can achieve ET with g(y)=c1 wt w , i.e., E{ w T w 2} (1 c1ηt µ+c2 1η2 t Lmax)T w0 w 2 where Lmax = maxi Li and µ=P i µi/n. It implies that O((log 1

c0 ) 1 log( 1

ϵ )) samples are needed to achieve E{ w T w 2} ϵ. c0 =1 c1ηt µ+c2 1η2 t Lmax and c1 is adjusted such that 0 < c1ηt < µ/Lmax.

Theorem 1 shows that g(y) plays a critical role, similar in spirit to the way that in line search, the step size is adaptively adjusted. We emphasize that a random teacher (i.e., vanilla SGD) cannot achieve ET and yields an optimal rate of O( 1

ϵ ) [60] (in order to reach ϵ-approximation of loss value). Additionally, we connect Theorem 1 to Armijo linear search [6, 80] by showing that ET can also be achieved by tuning g(y) in a more ﬁne-grained way under a different set of conditions:

Theorem 2 (Exponential teachability without target parameters). Assume that the learner loss ℓi has the property of interpolation, Li-smoothness, convexity and f has the property of µ strongconvexity of order 2. If we adjust g(y) such that the following condition is always satisﬁed in each iteration: ℓit(wt ηtg(y) ℓit(wt)) ℓit(wt) c2ηtg(y) ℓit(wt) 2, then LAST can achieve ET for iterative learners with c2 = 1

2: E{ w T w 2} max (1 µ Lmax ), (1 µη max) T w0 w 2

which indicates that O(c3 log( 1

ϵ )) samples are needed to achieve E{ w T w 2} ϵ.

Remark 1 (Meaningfulness of omniscient teaching). A natural question arises: what is the point of the target parameters w if we can achieve ET without it? First of all, both Theorem 1 and Theorem 2 merely consider the case where the synthesized label is a scaler, while labels are usually high-dimensional vectors in practice. When labels are vectors, LAST is much more ﬂexible than tuning the learning rate. Second, the advantages of LAST is reﬂected in the easiness of setting g(y). Because there exists a universal constant c1 for g(y)=c1 wt w , g(y) is much easier to set in Theorem 1. In contrast, g(y) in Theorem 2 has to satisfy a dynamically changing inequality, which is highly nontrivial. Third, Theorem 2 also shows that Theorem 1 can still be substantially improved and the current lower bound on iterative teaching dimension is not necessarily tight.

When the label y is a scalar, LAST can also be viewed as a method that intelligently controls the step size in order to improve convergence to a desired model w . Theorem 1 is reasonably comparable to the theoretical results in [43, 44], but it does not require to modify x to achieve ET. In contrast to deterministic ET in [43], LAST achieves probabilistic ET due to the randomness in data sampling. When y becomes a vector, then LAST essentially controls a multi-dimensional learning rate that can modify the gradient in a more ﬁne-grained fashion. As an example, for LSR learners with a vector y RK, we have the new gradient matrix of size d K (after label synthesis) as wℓ=g(y) wℓwhere g(y)= 1

d(y w, x ) (1 ( y w, x )) ( denotes the Hadamard division). More interestingly, Theorem 2 also validates the feasibility of black-box LAST teaching, since the convergence can be improved even without target parameters. We show in Theorem 3 that super-exponential teachability can be achieved under certain conditions and a carefully chosen g(y):

Theorem 3 (Super-exponential teachability). Consider a learner loss f(w)=Pn i=1 ℓi(w) and an optimal weight matrix w such that wf(w )=0. Assume that wf(w) is L-smooth and continuously differentiable within a sufﬁciently small δ-neighborhood w w δ. In LAST, the gradient update is wt+1 =wt ηtg(y) wf(w). If we have that w0 is in the δ-neighborhood, |λmin( 2 wf(w))| µ and Lµδ<2, then LAST can achieve super-ET with g(y)=η 1 t ( 2 wf(αw + (1 α)w)) 1 where α [0, 1]. When α=0, the result reduces to Newton s method.

Theorem 3 can be viewed as a generalization of Newton s method. In constrast to Newton s method, LAST takes advantages of the knowledge of the optimal learner w and reduces the requirement of the Hessian information. If we simply set α=1, then LAST only needs a single static 2 wf(w ) to achieve super-ET. In contrast, Newton s method requires 2 wf(w) in every iteration (i.e., Hessian information of a local neighborhood). In the light of [64], we can easily extend Theorem 3 to work with sub-sampled Hessian 1 |S| P

i |S| 2 wℓi(w ) where S is a subset of indices.

3.5 The Best of Both Worlds? Combining Sample Selection to LAST

Gradient Space

Figure 5: Comparison.

It is natural to further consider combining sample selection with LAST. A straightforward approach is to ﬁrst perform the teaching sample selection in IMT and then run LAST to synthesize the labels for the selected samples. This can be viewed as applying alternating optimization to approximately solve minx,y G(x, y|wt). Speciﬁcally, mixed teaching ﬁrst solves x with y ﬁxed as the original ground truth label, and then solves y with x ﬁxed. This procedure is iterated until sufﬁciently small error w wt is attained. Due to stronger ﬂexibility, this mixed greedy teaching strategy yields a more powerful policy than LAST. Based on [43], it is easy to verify that mixed teaching can achieve ET. Fig. 5 gives an illustrative comparison of SGD, LAST and mixed teaching. Unlike LAST that tends to follow the largest gradient direction (same as SGD), mixed teaching ﬁnds the closest point to w in the gradient space. Mixed teaching combines label synthesis in LAST with example selection in IMT, and is able to achieve faster teaching empirically.

4 A Distant Look at Black-box Label Synthesis Teaching

The previous section considers omniscient teaching where everything about the learner is accessible. However, the optimal learner parameters (w ) are often unavailable in practice. This motivates us to study black-box teaching [18, 19, 44, 88] where the teacher knows much less information about the learner. This problem is extremely challenging, but for completeness, we brieﬂy discuss some potential strategies for black-box LAST (BLAST) to tackle this difﬁcult problem and also provide some preliminary experiments in Appendix F.8. Because a black-box teacher has no access to w , we need to use a surrogate objective to efﬁciently guide the learner to w =arg minw E(x,y){ℓ(x, y|w)}. Such a surrogate objective can either be a differentiable criterion on a validation set (e.g., some learner loss function) or a non-differentiable performance metric on a validation set (e.g., validation accuracy). For the former, we can unroll the teacher into the optimization of the learner. For the latter, we can deﬁne the performance metric as the reward signal and formulate black-box teaching as a reinforcement learning problem. Then we utilize the policy gradient method to solve it.

Unrolling. We start with a fully differentiable variant for BLAST. Because w is no longer available in the black-box scenario, the objective wv w becomes infeasible. In order to conduct the black-box teaching, we propose to directly use the learner loss ℓ(x, y|w) as the surrogate. After removing the dependence on w from Eq. (5), we obtain the objective function of BLAST:

min θ E{x, y} Da ℓ(x, y|wv) s.t. wv(θ) = arg min w E{x, y} Dr ℓ(x, πθ(x, y, w, w )|w) (6)

where y denotes the ground truth label of sample x and the output of πθ is the synthesized label y. Da denotes the validation set, and Dr denotes the training set. Sometimes we can use the same set as Da and Dr. Here we use stochastic gradient descent to solve the inner minimization. Typically we can unroll v-step gradient updates and train the teacher in an end-to-end fashion. To reduce the difﬁculty of learning the teacher in practice, it can be beneﬁcial to parameterize the synthesized label to be α y+(1 α)y where α (0, 1) and y =πθ(x, y, w, w ). This is essentially to learn a residue between the synthesized label and the ground truth label. The intuition behind is to constrain the space of synthesized labels such that the black-box teaching can be easier.

Policy gradient. We also discuss a more general variant for BLAST. We ﬁrst frame black-box teaching as a reinforcement learning task. Since we have no access to the optimal model w , we use a hold-out set to evaluate the performance of the learner (Fig. 6), which also serves as the surrogate learning objective. Typically the performance on the hold-out set is non-differentiable w.r.t. θ, so we leverage the policy gradient method [86] to learn the teaching policy θ. Speciﬁcally, we use the following objective function: maxθ J(θ) := E(st,at) πθ(a|s){PT t=1 rt(st, at)} where T denotes the termination iteration, (st, at) is the state-action pair at the t-the iteration and rt(st, at) is the reward. Since we use a terminal reward R=r T (st, at) deﬁned on a hold-out set, the objective becomes J(θ)=Eπθ{R}. Therefore, we update the teacher parameters θ with θ θ+η θJ(θ) where θJ(θ)=P t θ log πθ(at|st)R. We take classiﬁcation as an illustrative example here.

Synthesize Label Teacher πθ

Learner fw Action a

Reward Signal rt

Original Label State st

Randomly Sampled

Figure 6: RL formulation for BLAST.

Speciﬁcally, there are many ways to deﬁne the terminal reward, for example, (i) the accuracy after T iterations on the hold-out set, and (ii) the number of iterations to achieve a preset accuracy ζ on the hold-out set (T is the maximal iteration number). We can parameterize the teacher as any approximation method, such as linear classiﬁer or neural network. The learner optimizes its weights by cross-entropy loss and SGD. The teaching policy aims to generate suitable labels for input samples so that the learner can converge efﬁciently. Fig. 6 illustrates the reinforcement learning formulation. Since there are many choices to design state features, action space and teaching policy, we aim to demonstrate the feasibility of applying reinforcement learning to perform black-box teaching instead of enumerating all the design choices.

State. State features are the input for the teacher model and are crucial for learning a good teaching policy. As an example for linear learners, we can construct the state features h(st) with the current learner w, the optimal learner w , the input sample x and the original ground truth label y. For nonlinear learners, it remains an open problem to design informative state features [7, 18, 37]. Action. For small number of classes, we can uniformly discretize the label space (i.e., simplex) into a set of vectors. For example, the binary label space can be discretized into [0, 1], [0.25, 0.75], [0.5, 0.5], [0.75, 0.25] and [1, 0]. For large number of classes, we need to reduce search difﬁculty in the label space. For example, inspired by label smoothing (LS) and knowledge distillation (KD), we can design the synthesized label to be y=µ y+(1 µ)p where y is the one-hot ground truth label. p is an all1

K vector in LS and is a soft label vector from a pretrained model in KD (K is the number of classes). We discretize the action space µ [0.5, 1] and use the reward to guide the search of state-dependent µ. The teaching policy produces µt =πθ(h(st)) where πθ can be a MLP.

5 Beyond LAST: Intriguing Insights into Iterative Teaching

5.1 Additional Discussions

Practicality of iterative teaching. One of the most signiﬁcant differences between batch machine teaching (BMT) and IMT is whether learner s optimization algorithm is taken into consideration or not. More generally, IMT views the learner as a speciﬁc procedure or algorithm, while BMT views the learner as an objective function. When the objective function becomes non-convex and has multiple equivalent global minima, then it could be infeasible for BMT to construct a minimal teaching set for guiding the learner to a particular target model w . In contrast, IMT can still guide the learner to a speciﬁc model by considering the parameter update in every iteration. Overall, IMT makes a paradigm shift from focusing on the learner s model to focusing on the learner s algorithm.

Iterative teaching as constrained communication. We provide an alternative perspective to look at iterative teaching. If we view the teacher as a sender and the learner as a receiver, then the task of iterative teaching is to transmit some information (i.e., the target model w ) from the sender to the receiver. Such information can only be transmitted under a constrained channel. For example, vanilla IMT [43] uses a communication channel that can only transmit teaching examples and LAST uses a channel that only transmits labels. The teacher encodes the target model into teaching examples or labels, and the learner uses a decoder (i.e., its learning algorithm) to recover the target model. The constraints on the communication channel determine the difﬁculty of iterative teaching.

What makes iterative teaching interesting? First of all, iterative teaching provides a theoretically grounded framework to study how different sequences of training samples (i.e., different data sampling schemes) affects the convergence of a learner. More broadly, iterative teaching could help us better understand to what extent the inductive bias can be affected by different sequences of training samples. This has immediate connections to curriculum learning [7], self-paced learning [37] and importance sampling [99]. Second, iterative teaching can inspire new heuristics to improve convergence and generalization for complex nonlinear learners (e.g., neural networks) in practice. For example, both vanilla IMT and LAST imply that training a learner with easy samples ﬁrst and hard samples later can improve convergence. Such an implication can be used to design practical training heuristics. Moreover, black-box iterative teaching is studied to bridge the gap between omniscient iterative teaching and practical machine learning. Third, iterative teaching yields a uniﬁed framework where the deﬁnition of teaching medium is very ﬂexible. We have studied sample selection [43, 44, 91], label synthesis (this paper) and the combination of both ([43] and this paper). Iterative teaching is not limited to training data and the effectiveness of other teaching spaces remains to be explored.

Open problems. Iterative teaching is still in its infancy and there are numerous open problems in its theory, algorithm and practice. From the theory side, we can only obtain a lower bound for the iterative teaching dimension for now. The tightness of such lower bounds remains unknown. Moreover, a upper bound for the iterative teaching dimension and how to loosen assumptions on learners are also important open problems. From the algorithm side, better algorithms to perform omniscient, gray-box or black-box teaching are yet to be designed. In practice, the scalability and efﬁciency of iterative teaching is also a major bottleneck for large-scale applications [91].

5.2 A Generalized Iterative Teaching Framework We present a generalized iterative teaching (GIT) framework that considers a broader class of iterative learners (not limited to ﬁrst-order gradient learners). At the t-th iteration, GIT is formulated as

Learner: wt+1 = Q {M[t] i }m i=1, wt, ℓ

| {z } Iterative update function

Teacher: arg min n {M[1] i }m i=1, ,{M[v] i }m i=1

o d wt+v, w (7)

where M denotes the teaching material (e.g., M[j] i ={xj i, yj i } is the i-th training sample in the mini-batch of the j-th iteration), Q is a general function to denote learner s iterative update rule (e.g., SGD learner uses Q=wt ηtℓ (xt, yt|wt)), ℓis the learner s objective function, wt is the learner s model parameters at the t-th iteration and w is the target parameters. The teacher interacts with the learner every v iterations. Then we denote this generalized teaching problem as GIT(m, v, M, Q). For standard iterative teaching, we use training data as M and the SGD update as Q and abbreviate the problem as GIT(m, v). For the ultimate IMT problem, we aim to solve GIT(m, T) where T is the termination iteration. Most prior studies [43, 44, 91] focus on solving GIT(1, 1), while our paper addresses GIT(1, v) where v 1. For strongly convex learners, BMT is equivalent to ﬁnding a minimal m that solves GIT(m , ) under the constraint of M[1] i =M[j] i , j, and m is also the teaching dimension in BMT. Alternatively, BMT also solves GIT(m , , {x, y}, Q({M[1] i }m i=1, w, ℓ)) where the learner always uses the mini-batch in the ﬁrst iteration to update itself (and ignores mini-batches in the other iteration). The ﬁrst mini-batch is the same as the minimal teaching set in BMT.

6 Empirical Evaluation

This section comprehensively evaluates LAST in the omniscient teaching scenario. Experimental details and more results (including BLAST) are given in Appendix G and Appendix F, respectively.

6.1 Omniscient Greedy Teaching

0 50 100 150 200 250 300 350 400 0

SGD IMT LAST (NC) LAST (r=0.5) LAST (r=2) LAST+IMT (NC) LAST+IMT (r=0.5) LAST+IMT (r=2)

0 50 100 150 200 250 300 350 400 0

2 SGD IMT LAST (NC) LAST (r=0.5) LAST (r=2) LAST+IMT (NC) LAST+IMT (r=0.5) LAST+IMT (r=2)

Iteration Iteration

Objective Value

Distance between w and w* t

Figure 7: Convergence curves for linear regression. LAST (r=0.5) denotes LAST with magnitude constraint of r=0.5.

Least square regression learner. We evaluate LAST on linear regression where the training data is generated by y= w , x + ϵ (ϵ follows the normal distribution). We compute the training objective value and the Euclidean distance between wt and w in Fig. 7 to evaluate empirical convergence. For LAST, we test no constraint (denoted as NC ) and magnitude constraint . The results show that LAST generally converges better with a more ﬂexible teacher, and outperforms IMT with magnitude coefﬁcient r=2. This suggests that a suitable label may sometimes be more crucial than a suitable input example. We also observe that mixed teaching (LAST+IMT) converges signiﬁcantly faster than IMT or LAST alone, implying that alternating optimization works well.

Objective Value

Distance between w and w* t

Iteration 0 100 200 300 400 500 600 700

SGD IMT LAST (NC) LAST (One-Hot) LAST (Soft) LAST-IMT (NC)

0 100 200 300 400 500 600 700 0

SGD IMT LAST (NC) LAST (One-Hot) LAST (Soft) LAST-IMT (NC)

Figure 8: Binary classiﬁcation for logistic regression. One Hot: one-hot constraint. Soft: soft constraints.

Logistic regression learner. Next we evaluate LAST on linear classiﬁcation for logistic regression learners. We teach the logistic regression learner to perform binary classiﬁcation on both synthetic data and real image data. For synthetic data, we use 2D half moon data and Gaussian cluster data (in Appendix F). For real image data, we use 3/5 digits in MNIST. From Fig. 8, we observe that mixed teaching converges efﬁciently and consistently achieves the best performance. All variants of LAST signiﬁcantly outperform SGD. Without constraints, LAST is comparable to IMT in terms of convergence but does not require any costly example selection.

Ground Truth (a) (b) (c) (d)

Figure 9: Synthesized labels on half-moon data with 1-hot constraint. (a) ground truth labels. (b,c,d) synthesized labels by LAST. The purple line is the current classiﬁer and the green dotted line is the optimal classiﬁer.

Fig. 9 shows how synthesized labels change depending on the learner s parameters. We observe that LAST synthesizes labels adaptively based on the current learner. Our results connect well to the VC theory that the convergence rate depends on how well a classiﬁer can separate the data [77]. The synthesized labels are always linearly separable, which effectively improves convergence.

0 100 200 300 400 500 600 0.62

SGD IMT LAST (NC) LAST (One-Hot) LAST (Soft) LAST+IMT (NC)

0 100 200 300 400 500 600 0

Objective Value

Distance between w and w* t

SGD IMT LAST (NC) LAST (One-Hot) LAST (Soft) LAST+IMT (NC)

Figure 10: Binary classiﬁcation for MLP learners.

Multi-layer perceptron learner. We apply LAST to neural networks in order to empirically show its effectiveness for nonlinear learners. Speciﬁcally, we use LAST (β =1) to teach a 2-layer MLP with Re LU nonlinearity (the output layer is essentially a logistic regression classiﬁer). We perform binary classiﬁcation of digit 3 and 5 on MNIST. Results in Fig. 10 show that all the variants of LAST achieve signiﬁcantly better convergence than SGD. With a teacher that generates labels without constraints, both LAST and mixed teaching achieve far better convergence than IMT. LAST is also much more efﬁcient than IMT.

Iteration 56

GT: 3 LAST: 5

Iteration 169

GT: 5 LAST: 3

Iteration 317

GT: 5 LAST: 3

Iteration 412

GT: 3 LAST: 5

Iteration 511

GT: 5 LAST: 3

Iteration 590

GT: 5 LAST: 3

Figure 11: Ground truth vs. LAST.

Fig. 11 presents some teaching examples for MLP learners in 3/5 MNIST digit classiﬁcation, where the one-hot labels synthesized by LAST differ from the ground truth. The synthesized labels depend on the learner status and may vary across different iterations. We can observe that most of the examples whose synthesized labels are different from the ground truth ones are semantically ambiguous to humans, suggesting that labels of hard samples near the decision boundary are very crucial for convergence. Moreover, such a observation is consistent for linear learners (see Appendix F.4).

6.2 Omniscient Parameterized Teaching

Distance between w and w* t

Time (s) 10-1 100 101 0.08

IMT LAST (policy gradient) LAST (unrolling 20 steps) LAST (unrolling 100 steps)

Distance between w and w* t

0 50 100 150 200 250 0.07

SGD IMT LAST (policy gradient) LAST (unrolling 20 steps) LAST (unrolling 100 steps)

Figure 12: Omniscient parameterized teaching on Binary classiﬁcation for logistic regression learners. Left: time-wise convergence. Right: iteration-wise convergence.

We learn a parameterized teaching policy for logistic regression classiﬁer learners using both unrolling and policy gradient. We parameterize the teacher model with a MLP. Experimental details are given in Appendix G. We evaluate the learned policies on binary classiﬁcation of digit 3/5 in MNIST. Fig. 12 shows that both unrolling and policy gradient can learn an effective teaching policy that achieves better convergence than SGD. Moreover, we evaluate two unrolling settings (20 steps and 100 steps) and unrolling more steps show marginal beneﬁts on convergence. On time-wise convergence, we can observe that all variants of LAST performs signiﬁcantly faster than IMT, because no costly example selection is needed. On iteration-wise convergence, we can see that IMT still achieves the fastest convergence due to its strong ﬂexibility from example selection. LAST with either unrolling or policy gradient converges much faster than SGD, and policy gradient yields slightly better convergence than unrolling in the experiments.

7 Concluding Remarks

In this paper, we present a novel iterative teaching framework via label synthesis. We consider both the omniscient scenario where the teacher has access to the optimal learner parameters and the black-box scenario where the teacher has no access to the optimal learner parameters. We propose greedy teaching and parameterized teaching, and provide theoretical insights on why they can guide the learner to converge quickly. Extensive experiments demonstrate the effectiveness of our methods.

Acknowledgments

WL is supported by a Cambridge-Tübingen Fellowship, an NVIDIA GPU grant, Deep Mind and the Leverhulme Trust via CFI. AW acknowledges support from a Turing AI Fellowship under grant EP/V025379/1, The Alan Turing Institute, and the Leverhulme Trust via CFI. LP is supported by CIFAR AI Chair Program and acknowledges support from Samsung Electronics Co., Ldt. This work was supported by the German Federal Ministry of Education and Research (BMBF): Tübingen AI Center, FKZ: 01IS18039B, and by the Machine Learning Cluster of Excellence, EXC number 2064/1 Project number 390727645.

[1] Pankaj K Agarwal, Sariel Har-Peled, Kasturi R Varadarajan, et al. Geometric approximation via coresets. Combinatorial and computational geometry, 2005. 1

[2] Scott Alfeld, Xiaojin Zhu, and Paul Barford. Data poisoning attacks against autoregressive models. In AAAI, 2016. 1, 2

[3] Scott Alfeld, Xiaojin Zhu, and Paul Barford. Explicit defense actions against test-set attacks. In AAAI, 2017. 1

[4] Eugene L Allgower and Kurt Georg. Numerical continuation methods: an introduction, volume 13. Springer Science & Business Media, 2012. 4

[5] Marcin Andrychowicz, Misha Denil, Sergio Gómez Colmenarejo, Matthew W Hoffman, David Pfau, Tom Schaul, Brendan Shillingford, and Nando de Freitas. Learning to learn by gradient descent by gradient descent. In Neur IPS, 2016. 5

[6] Larry Armijo. Minimization of functions having lipschitz continuous ﬁrst partial derivatives. Paciﬁc Journal of mathematics, 1966. 6

[7] Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In ICML, 2009. 1, 4, 8

[8] Beidi Chen, Weiyang Liu, Zhiding Yu, Jan Kautz, Anshumali Shrivastava, Animesh Garg, and Animashree Anandkumar. Angular visual hardness. In ICML, 2020. 3

[9] Yuxin Chen, Oisin Mac Aodha, Shihan Su, Pietro Perona, and Yisong Yue. Near-optimal machine teaching via explanatory teaching sets. In AISTATS, 2018. 2

[10] Yuxin Chen, Adish Singla, Oisin Mac Aodha, Pietro Perona, and Yisong Yue. Understanding the role of adaptivity in machine teaching: The case of version space learners. In Neur IPS, 2018. 2

[11] Jan Chorowski and Navdeep Jaitly. Towards better decoding and language model integration in sequence to sequence models. ar Xiv preprint ar Xiv:1612.02695, 2016. 2

[12] Bo Dai, Hanjun Dai, Niao He, Weiyang Liu, Zhen Liu, Jianshu Chen, Lin Xiao, and Le Song. Coupled variational bayes via optimization embedding. In Neur IPS, 2018. 5

[13] Thorsten Doliwa, Gaojian Fan, Hans Ulrich Simon, and Sandra Zilles. Recursive teaching dimension, vc-dimension and sample compression. JMLR, 2014. 2

[14] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2020. 5

[15] Baxter S Eaves Jr, Naomi H Feldman, Thomas L Grifﬁths, and Patrick Shafto. Infant-directed speech is consistent with teaching. Psychological review, 2016. 2

[16] Baxter S Eaves Jr and Patrick Shafto. Toward a general, scaleable framework for bayesian teaching with applications to topic models. ar Xiv preprint ar Xiv:1605.07999, 2016. 2

[17] Jeffrey L Elman. Learning and development in neural networks: The importance of starting small. Cognition, 1993. 4

[18] Yang Fan, Fei Tian, Tao Qin, Xiang-Yang Li, and Tie-Yan Liu. Learning to teach. In ICLR, 2018. 7, 8

[19] Yang Fan, Yingce Xia, Lijun Wu, Shufang Xie, Weiqing Liu, Jiang Bian, Tao Qin, and Xiang-Yang Li. Learning to reweight with deep interactions. In AAAI, 2021. 7

[20] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, 2017. 5

[21] Ziyuan Gao, Hans Ulrich Simon, and Sandra Zilles. On the teaching complexity of linear sets. In ALT, 2015. 2

[22] S.A. Goldman and M.J. Kearns. On the complexity of teaching. Journal of Computer and System Sciences, 1995. 1

[23] David Ha, Andrew Dai, and Quoc V Le. Hypernetworks. In ICLR, 2016. 5

[24] Luis Haug, Sebastian Tschiatschek, and Adish Singla. Teaching inverse reinforcement learners via features and demonstrations. In Neur IPS, 2018. 2

[25] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. ar Xiv preprint ar Xiv:1512.03385, 2015. 5

[26] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. ar Xiv preprint ar Xiv:1503.02531, 2015. 2, 3

[27] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In CVPR, 2017. 5

[28] Jonathan H Huggins, Trevor Campbell, and Tamara Broderick. Coresets for scalable bayesian logistic regression. In NIPS, 2016. 1

[29] Anette Hunziker, Yuxin Chen, Oisin Mac Aodha, Manuel Gomez Rodriguez, Andreas Krause, Pietro Perona, Yisong Yue, and Adish Singla. Teaching multiple concepts to a forgetful learner. In Neur IPS, 2019. 2

[30] Xu Jia, Bert De Brabandere, Tinne Tuytelaars, and Luc Van Gool. Dynamic ﬁlter networks. In NIPS, 2016. 5

[31] Edward Johns, Oisin Mac Aodha, and Gabriel J Brostow. Becoming the expert-interactive multi-class machine teaching. In CVPR, 2015. 2

[32] Parameswaran Kamalaruban, Rati Devidze, Volkan Cevher, and Adish Singla. Interactive teaching algorithms for inverse reinforcement learning. ar Xiv preprint ar Xiv:1905.11867, 2019. 2

[33] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classiﬁcation with deep convolutional neural networks. In Neur IPS, 2012. 5

[34] Kai A Krueger and Peter Dayan. Flexible shaping: How learning in small steps helps. Cognition, 2009. 4

[35] Akash Kumar, Adish Singla, Yisong Yue, and Yuxin Chen. Average-case complexity of teaching convex polytopes via halfspace queries. ar Xiv preprint ar Xiv:2006.14677, 2020. 2

[36] Akash Kumar, Hanqi Zhang, Adish Singla, and Yuxin Chen. The teaching dimension of kernel perceptron. ar Xiv preprint ar Xiv:2010.14043, 2020. 2

[37] M Pawan Kumar, Benjamin Packer, and Daphne Koller. Self-paced learning for latent variable models. In Neur IPS, 2010. 4, 8

[38] Laurent Lessard, Xuezhou Zhang, and Xiaojin Zhu. An optimal control approach to sequential machine teaching. In AISTATS, 2019. 1, 2

[39] Ke Li and Jitendra Malik. Learning to optimize. ar Xiv preprint ar Xiv:1606.01885, 2016. 5

[40] Gui-Hua Lin and M Fukushima. Some exact penalty results for nonlinear programs and mathematical programs with equilibrium constraints. Journal of Optimization Theory and Applications, 2003. 18

[41] Rongmei Lin, Weiyang Liu, Zhen Liu, Chen Feng, Zhiding Yu, James M Rehg, Li Xiong, and Le Song. Regularizing neural networks via minimizing hyperspherical energy. In CVPR, 2020. 5

[42] Ji Liu, Xiaojin Zhu, and H Gorune Ohannessian. The teaching dimension of linear learners. In ICML, 2016. 1, 2

[43] Weiyang Liu, Bo Dai, Ahmad Humayun, Charlene Tay, Chen Yu, Linda B. Smith, James M. Rehg, and Le Song. Iterative machine teaching. In ICML, 2017. 1, 2, 3, 4, 6, 7, 8, 9, 27

[44] Weiyang Liu, Bo Dai, Xingguo Li, Zhen Liu, James Rehg, and Le Song. Towards black-box iterative machine teaching. In ICML, 2018. 1, 2, 4, 6, 7, 8, 9

[45] Weiyang Liu, Rongmei Lin, Zhen Liu, Lixin Liu, Zhiding Yu, Bo Dai, and Le Song. Learning towards minimum hyperspherical energy. In Neur IPS, 2018. 3

[46] Weiyang Liu, Rongmei Lin, Zhen Liu, James M Rehg, Liam Paull, Li Xiong, Le Song, and Adrian Weller. Orthogonal over-parameterized training. In CVPR, 2021. 5

[47] Weiyang Liu, Rongmei Lin, Zhen Liu, Li Xiong, Bernhard Schölkopf, and Adrian Weller. Learning with hyperspherical uniformity. In AISTATS, 2021. 5

[48] Weiyang Liu, Zhen Liu, James M Rehg, and Le Song. Neural similarity learning. Neur IPS, 2019. 5

[49] Weiyang Liu, Zhen Liu, Zhiding Yu, Bo Dai, Rongmei Lin, Yisen Wang, James M Rehg, and Le Song. Decoupled networks. In CVPR, 2018. 5

[50] Weiyang Liu, Yandong Wen, Zhiding Yu, Ming Li, Bhiksha Raj, and Le Song. Sphereface: Deep hypersphere embedding for face recognition. In CVPR, 2017. 3, 23

[51] Weiyang Liu, Yandong Wen, Zhiding Yu, and Meng Yang. Large-margin softmax loss for convolutional neural networks. In ICML, 2016. 3, 23

[52] Weiyang Liu, Yan-Ming Zhang, Xingguo Li, Zhen Liu, Bo Dai, Tuo Zhao, and Le Song. Deep hyperspherical learning. In NIPS, 2017. 3, 5

[53] David Lopez-Paz, Léon Bottou, Bernhard Schölkopf, and Vladimir Vapnik. Unifying distillation and privileged information. ar Xiv preprint ar Xiv:1511.03643, 2015. 2

[54] Yuzhe Ma, Xuezhou Zhang, Wen Sun, and Jerry Zhu. Policy poisoning in batch reinforcement learning and control. In Neur IPS, 2019. 1

[55] Oisin Mac Aodha, Shihan Su, Yuxin Chen, Pietro Perona, and Yisong Yue. Teaching categories to human learners with visual explanations. In CVPR, 2018. 2

[56] Farnam Mansouri, Yuxin Chen, Ara Vartanian, Jerry Zhu, and Adish Singla. Preference-based batch and sequential teaching: Towards a uniﬁed view of models. In Neur IPS, 2019. 1, 2

[57] Shike Mei and Xiaojin Zhu. Using machine teaching to identify optimal training-set attacks on machine learners. In AAAI, 2015. 2

[58] Vishal Monga, Yuelong Li, and Yonina C Eldar. Algorithm unrolling: Interpretable, efﬁcient deep learning for signal and image processing. IEEE Signal Processing Magazine, 2021. 5

[59] Rafael Müller, Simon Kornblith, and Geoffrey E Hinton. When does label smoothing help? In Neur IPS, 2019. 2

[60] Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization, 2009. 6

[61] Tomi Peltola, Mustafa Mert Çelikok, Pedram Daee, and Samuel Kaski. Machine teaching of active sequential learners. In Neur IPS, 2019. 1

[62] Gail B Peterson. A day of great illumination: Bf skinner s discovery of shaping. Journal of the experimental analysis of behavior, 2004. 4

[63] Amin Rakhsha, Goran Radanovic, Rati Devidze, Xiaojin Zhu, and Adish Singla. Policy teaching via environment poisoning: Training-time adversarial attacks against reinforcement learning. In ICML, 2020. 1, 2

[64] Farbod Roosta-Khorasani and Michael W Mahoney. Sub-sampled newton methods ii: Local convergence rates. ar Xiv preprint ar Xiv:1601.04738, 2016. 7

[65] Rahim Samei, Pavel Semukhin, Boting Yang, and Sandra Zilles. Algebraic methods proving sauer s bound for teaching complexity. Theoretical Computer Science, 2014. 2

[66] Patrick Shafto and Noah Goodman. Teaching games: Statistical sampling assumptions for learning in pedagogical situations. In Cog Sci, 2008. 2

[67] Patrick Shafto, Noah D Goodman, and Michael C Frank. Learning from others: The consequences of psychological reasoning for human learning. Perspectives on Psychological Science, 2012. 2

[68] Patrick Shafto, Noah D Goodman, and Thomas L Grifﬁths. A rational account of pedagogical reasoning: Teaching by, and learning from, examples. Cognitive psychology, 2014. 2

[69] Patrick Shafto, Junqi Wang, and Pei Wang. Cooperative communication as belief transport. Trends in Cognitive Sciences, 2021. 2

[70] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. ar Xiv preprint ar Xiv:1409.1556, 2014. 5

[71] Adish Singla, Ilija Bogunovic, G Bartók, A Karbasi, and A Krause. On actively teaching the crowd to classify. In Neur IPS Workshop on Data Driven Education, 2013. 1

[72] Adish Singla, Ilija Bogunovic, Gabor Bartok, Amin Karbasi, and Andreas Krause. Near-optimally teaching the crowd to classify. In ICML, 2014. 1, 2

[73] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In CVPR, 2016. 2, 3

[74] Behzad Tabibian, Utkarsh Upadhyay, Abir De, Ali Zarezade, Bernhard Schölkopf, and Manuel Gomez Rodriguez. Enhancing human learning via spaced repetition optimization. Proceedings of the National Academy of Sciences, 2019. 2

[75] Christian Thiel. Classiﬁcation on soft labels is robust against label noise. In International Conference on Knowledge-Based and Intelligent Information and Engineering Systems, pages 65 73. Springer, 2008. 2

[76] Sebastian Tschiatschek, Ahana Ghosh, Luis Haug, Rati Devidze, and Adish Singla. Learner-aware teaching: Inverse reinforcement learning with preferences and constraints. In Neur IPS, 2019. 2

[77] Vladimir Vapnik and Rauf Izmailov. Learning using privileged information: similarity control and knowledge transfer. JMLR, 2015. 2, 10

[78] Vladimir Vapnik and Akshay Vashist. A new learning paradigm: Learning using privileged information. Neural Networks, 22(5):544 557, 2009. 2

[79] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. ar Xiv preprint ar Xiv:1706.03762, 2017. 5

[80] Sharan Vaswani, Aaron Mishkin, Issam Laradji, Mark Schmidt, Gauthier Gidel, and Simon Lacoste-Julien. Painless stochastic gradient: Interpolation, line-search, and convergence rates. In Neur IPS, 2019. 6, 18

[81] Feng Wang, Weiyang Liu, Haijun Liu, and Jian Cheng. Additive margin softmax for face veriﬁcation. ar Xiv preprint ar Xiv:1801.05599, 2018. 23

[82] Junqi Wang, Pei Wang, and Patrick Shafto. Sequential cooperative bayesian inference. In ICML, 2020. 2

[83] Pei Wang, Pushpi Paranamana, and Patrick Shafto. Generalizing the theory of cooperative inference. In AISTATS, 2019. 2

[84] Pei Wang, Junqi Wang, Pushpi Paranamana, and Patrick Shafto. A mathematical theory of cooperative communication. In Neur IPS, 2020. 2

[85] Paul J Werbos. Generalization of backpropagation with application to a recurrent gas market model. Neural networks, 1988. 5

[86] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 1992. 5, 7

[87] Laurence A Wolsey and George L Nemhauser. Integer and combinatorial optimization. John Wiley & Sons, 1999. 4

[88] Lijun Wu, Fei Tian, Yingce Xia, Yang Fan, Tao Qin, Jianhuang Lai, and Tie-Yan Liu. Learning to teach with dynamic loss functions. In Neur IPS, 2010. 5, 7

[89] Lingxi Xie, Jingdong Wang, Zhen Wei, Meng Wang, and Qi Tian. Disturblabel: Regularizing cnn on the loss layer. In CVPR, 2016. 2, 3

[90] Qizhe Xie, Minh-Thang Luong, Eduard Hovy, and Quoc V Le. Self-training with noisy student improves imagenet classiﬁcation. In CVPR, 2020. 3

[91] Zhaozhuo Xu, Beidi Chen, Chaojian Li, Weiyang Liu, Le Song, Yingyan Lin, and Anshumali Shrivastava. Locality sensitive teaching. In Neur IPS, 2021. 2, 8, 9

[92] Scott Cheng-Hsin Yang and Patrick Shafto. Explainable artiﬁcial intelligence via bayesian teaching. In NIPS 2017 workshop on Teaching Machines, Robots, and Humans, 2017. 2

[93] Scott Cheng-Hsin Yang, Yue Yu, Pei Wang, Wai Keen Vong, Patrick Shafto, et al. Optimal cooperative inference. In AISTATS, 2018. 2

[94] Teresa Yeo, Parameswaran Kamalaruban, Adish Singla, Arpit Merchant, Thibault Asselborn, Louis Faucon, Pierre Dillenbourg, and Volkan Cevher. Iterative classroom teaching. In AAAI, 2019. 2

[95] Xuezhou Zhang, Shubham Kumar Bharti, Yuzhe Ma, Adish Singla, and Xiaojin Zhu. The teaching dimension of q-learning. ar Xiv preprint ar Xiv:2006.09324, 2020. 2

[96] Xuezhou Zhang, Yuzhe Ma, Adish Singla, and Xiaojin Zhu. Adaptive reward-poisoning attacks against reinforcement learning. In ICML, 2020. 2

[97] Xuezhou Zhang, Xiaojin Zhu, and Laurent Lessard. Online data poisoning attacks. In L4DC, 2020. 2

[98] Xuezhou Zhang, Xiaojin Zhu, and Stephen Wright. Training set debugging using trusted items. In AAAI, 2018. 2

[99] Peilin Zhao and Tong Zhang. Stochastic optimization with importance sampling for regularized loss minimization. In ICML, 2015. 8

[100] Yao Zhou, Arun Reddy Nelakurthi, and Jingrui He. Unlearn what you have learned: Adaptive crowd teaching with exponentially decayed memory learners. In KDD, 2018. 1, 2

[101] Yao Zhou, Arun Reddy Nelakurthi, Ross Maciejewski, Wei Fan, and Jingrui He. Crowd teaching with imperfect labels. In WWW, 2020. 1, 2

[102] Xiaojin Zhu. Machine teaching for bayesian learners in the exponential family. In Neur IPS, 2013. 1, 2

[103] Xiaojin Zhu. Machine teaching: An inverse problem to machine learning and an approach toward optimal education. In AAAI, 2015. 1, 2

[104] Xiaojin Zhu and Andrew B Goldberg. Introduction to semi-supervised learning. Synthesis lectures on artiﬁcial intelligence and machine learning, 2009. 2, 3

[105] Xiaojin Zhu, Ji Liu, and Manuel Lopes. No learner left behind: On the complexity of teaching multiple learners simultaneously. In IJCAI, 2017. 2

[106] Xiaojin Zhu, Adish Singla, Sandra Zilles, and Anna N Rafferty. An overview of machine teaching. ar Xiv preprint ar Xiv:1801.05927, 2018. 1

[107] Bojia Zi, Shihao Zhao, Xingjun Ma, and Yu-Gang Jiang. Revisiting adversarial robustness distillation: Robust soft labels make student better. In ICCV, 2021. 2

[108] Sandra Zilles, Steffen Lange, Robert Holte, and Martin Zinkevich. Teaching dimensions based on cooperative learning. In COLT, 2008. 2