# the_teaching_dimension_of_regularized_kernel_learners__858f8ebf.pdf

The Teaching Dimension of Regularized Kernel Learners

Hong Qian * 1 2 Xu-Hui Liu * 3 4 Chen-Xi Su 1 Aimin Zhou 1 5 Yang Yu 3 4

Teaching dimension (TD) is a fundamental theoretical property for understanding machine teaching algorithms. It measures the sample complexity of teaching a target hypothesis to a learner. The TD of linear learners has been studied extensively, whereas the results of teaching non-linear learners are rare. A recent result investigates the TD of polynomial and Gaussian kernel learners. Unfortunately, the theoretical bounds therein show that the TD is high when teaching those non-linear learners. Inspired by the fact that regularization can reduce the learning complexity in machine learning, a natural question is whether the similar fact happens in machine teaching. To answer this essential question, this paper proposes a unified theoretical framework termed STARKE to analyze the TD of regularized kernel learners. On the basis of STARKE, we derive a generic result of any type of kernels. Furthermore, we disclose that the TD of regularized linear and regularized polynomial kernel learners can be strictly reduced. For regularized Gaussian kernel learners, we reveal that, although their TD is infinite, their ϵ-approximate TD can be exponentially reduced compared with that of the unregularized learners. The extensive experimental results of teaching the optimizationbased learners verify the theoretical findings.

1. Introduction

Machine teaching (Zhu et al., 2018) is aimed at designing an optimal training set (aka teaching set) to steer a learner (aka student) towards a target hypothesis. It can be re-

*Equal contribution 1School of Computer Science and Technology, East China Normal University, Shanghai, China. 2Shanghai Key Laboratory of Multidimensional Information Processing. 3School of Artificial Intelligence, Nanjing University, Nanjing, China. 4National Key Laboratory for Novel Software Technology. 5Shanghai Institute of AI for Education.. Correspondence to: Hong Qian <hqian@cs.ecnu.edu.cn>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

garded as an inverse problem of machine learning. Machine teaching has various applications, such as reinforcement learning (Kamalaruban et al., 2019), trustworthy AI (Zhang et al., 2018), education (Patil et al., 2014) and cognitive psychology (Shafto et al., 2014). In those scenarios, although a teacher knows the target hypothesis, she cannot telepathize it into the learner s mind. For instance, a botanist intends to teach the students to categorize the flowers into peony, rose, and azalea. The botanist has the correct decision boundary in mind, but she could only teach via picking the representative and informative flower examples and showing them to the students. The choice of training set can be optimized if the teacher has a good understanding of how the students learn from the examples.

Related Work. Teaching dimension (TD) (Goldman & Kearns, 1991; Shinohara, 1991) is a fundamental theoretical property for understanding machine teaching algorithms. It measures the sample complexity of teaching, and is defined as the minimal number of training examples required in the worst case to teach a target hypothesis to a learner. Along the theoretical research direction of TD, one of the most considered settings is teaching a version space learner (Goldman & Kearns, 1991; Anthony et al., 1995; Chen et al., 2018; Kirkpatrick et al., 2019). A version space learner maintains a set of hypotheses, keeps removing those which are not consistent with the receiving training examples, and outputs a qualified hypothesis in the end. To reduce the teaching dimension, i.e., the teaching complexity, a series of teaching models are proposed, such as recursive teaching (Zilles et al., 2011), preference-based teaching (Gao et al., 2017; Mansouri et al., 2019) and non-clashing teaching (Kirkpatrick et al., 2019). However, the power of version space learners is limited (Goldman & Mathias, 1996), and it can hardly model the behavior of a wide range of modern learners.

To tackle the above issue, the notation of TD is extended to the optimization-based learners, i.e, the empirical risk minimization (ERM) learners. The scenario of teaching ERM learners is more realistic, and the version space learners are a special case of it provided that we optimize the 0-1 loss. Liu et al. (2016) extensively investigate the TD of ERM learners under the linear hypothesis space. Since the linear ERM learners may be restricted, recently, Kumar et al. (2021) generalize the hypothesis space to the non-linear ones by considering the kernel perceptrons in ERM. They disclose

The Teaching Dimension of Regularized Kernel Learners

(a) Hinge loss function without regularization

(b) Hinge loss function with regularization

Figure 1. Teaching a 1-D threshold classifier. The positive training examples are marked as red diamonds and the negative ones are marked as blue rectangles. The black point represents the target hypothesis θ . (a) A hinge loss learner without regularization and the output hypotheses marked by greed line under the given training set. (b) A hinge loss learner with L2 regularization and the teaching set {(θ , 1)}.

that the TD is Θ(d) for teaching linear kernel perceptrons in Rd , and Θ(dp) for polynomial kernel perceptrons with degree p. For Gaussian kernel perceptrons, they reveal that the exact teaching is impossible with a finite teaching set, and thus propose the ϵ-approximate TD (ϵ-TD). It is the minimal number of training examples required in the worst case to teach a target hypothesis to a learner that has no more than ϵ excess risk. Under the approximate teaching scenario, the ϵ-TD of Gaussian kernel perceptrons is d O(log2 (1/ϵ)).

Problem & Motivation. The theoretical bounds derived in Kumar et al. (2021) indicate that the exact or ϵ-approximate TD is high when teaching linear, polynomial, and Gaussian kernel learners. The high sample complexity of teaching could lead to the low efficiency of machine teaching algorithms thus blocking their further applications. Inspired by the fact that regularization can reduce the learning complexity in machine learning (Bousquet & Elisseeff, 2002; Mohri et al., 2012), a natural and fundamental question is whether the similar fact happens in machine teaching. Furthermore, Kumar et al. (2021) only consider the linear, polynomial, and Gaussian kernel learners. Other types of widely-used kernels, e.g., exponential kernels (Feragen et al., 2015) and Laplacian kernels (Fadel et al., 2016; Drewnik & Pasternak Winiarski, 2017), are omitted therein. Therefore, a generic theoretical analysis framework that is able to derive the TD of any type of (non-linear) kernel learners is quite appealing.

Our Contribution. In this paper, we focus on answering the above essential question: Can regularization help reduce the teaching complexity of kernel learners? Fortunately, our answer is YES. Intuitively, consider a 1-dimensional threshold classifier hθ (x) = 2(I(x > θ ) 0.5), i.e., hθ (x) returns 1 if x θ and +1 if x > θ . As illustrated in Figure 1, the hinge loss is used. For a learner without regularization, the hypotheses in the interval between the closest and oppositely labeled two examples in the training set have the equal hinge loss. Thus, it is impossible for the learner to pick out the unique target hypothesis θ with a finite teaching set. In contrast, for a learner with L2 regularization whose regularization coefficient is less

Table 1. The teaching dimension (TD) of ERM linear, polynomial and Gaussian kernel learners with and without regularization (line 2-4). The ϵ-approximate teaching dimension (ϵ-TD) of ERM Gaussian kernel learners with and without regularization (line 5).

Kernel Type (TD Type)

With Regularization (This Paper)

Without Regularization (Kumar et al., 2021)

Linear (TD) 1 Θ(d)

Polynomial (TD) G (d, p) TD d+p 1 p

Gaussian (TD)

Gaussian (ϵ-TD) O(1/ϵ2) d O(log2 (1/ϵ))

than 1, one teaching example (θ , 1) is enough due to its preference to θ with less L2 norm. The contribution of this paper is three folds:

As a cornerstone of analyzing TD, we at first propose a unified theoretical framework termed STARKE. Based on the subset analysis and the extension technique, STARKE is able to analyze the exact or ϵ-TD of the regularized ERM linear and non-linear kernel learners. With the help of STARKE, a generic result of TD or ϵ-TD is derived for any type of kernels.

Via specifying the kernel type in STARKE, we disclose the TD of the regularized ERM homogeneous linear and polynomial kernel learners. Their TD is strictly reduced compared with that of the unregularized learners. As shown in Table 1, the TD is reduced from Θ(d) to 1 for the linear kernel learners. For the polynomial kernel learners, we reveal that G (d, p) d+p 1 p ,

and their TD is strictly smaller than d+p 1 p for some target hypotheses.

For regularized ERM Gaussian kernel learners, we reveal that, although their TD is infinite, their ϵ-TD can be exponentially reduced compared with that of the unregularized learners, as shown in Table 1. Besides, the experiment results verify the theoretical findings.

To the best of our knowledge, the above results are the first known bounds on (approximately) teaching the regularized ERM non-linear kernel learners.

The consequent sections introduce the preliminaries, describe the proposed STARKE framework, present the case study on five types of kernels, show the experiment results, and finally give the discussion and conclusion.

The Teaching Dimension of Regularized Kernel Learners

2. Preliminaries

This section introduces the necessary notations, definitions, concepts, and the existing results, so as to pave the way for the consequent sections.

Basic Definitions. We denote X Rd as the input space and Y as the output space. A hypothesis is a mapping h: X Y. This paper assumes that the hypothesis hθ can be identified by its parameter θ. The hypothesis space H consists of a set of hypotheses. A training example is denoted as a pair (x, y) X Y. A training set is a multiset D = {(x1, y1), , (xn, yn)}, where the repeated pairs are acceptable. Let D denote the set of all training sets with all sizes. An algorithm A: D 2H learns from a training set D D and outputs a subset of the hypothesis space H.

Let θ H be the target hypothesis. The exact teaching is successful if a helpful teacher identifies a training set D D such that A(D) = {θ }. Such a D is called the teaching set (TS) of θ with respect to H. The teaching dimension of θ is the minimum size of the teaching set, i.e.,

TD(θ ) = min D D |D| , D is a teaching set of θ ; , if no teaching set exists.

Furthermore, the teaching dimension of the whole hypothesis space H is defined as the teaching dimension of the hardest hypothesis, i.e., TD(H) = maxθ H TD(θ).

Regularized Empirical Risk Minimization. Consider a training set D = {(xi, yi)}n i=1, where xi Rd. The optimization problem identified by the regularized ERM can be formulated as

A(D) = arg min θ H

i=1 ℓ(fθ(xi), yi) + Ω(||θ||2) , (1)

where ℓis the loss function and || || is the L2 norm. This paper assumes that the regularization function Ω(x2) is strictly increasing, differentiable and convex. This assumption is not strong in the field of teaching theory, since analyzing TD of the non-linear learners is still in its infancy. It can be easily satisfied by the widely-used Ω(x2) = 1

2µx2 regularization function. The regularized ERM learner outputs a set of hypotheses A(D).

Approximate Teaching. When a finite teaching set does not exist, exact teaching is impossible and the teaching dimension is meaningless. Therefore, Kumar et al. (2021) propose to consider the ϵ-approximate teaching set and the ϵapproximate teaching dimension instead. Let {ˆθ} = A(D), and θ is the target hypothesis. If ˆθ satisfies F(ˆθ) F(θ ) ϵ ,

where F(θ) = E[ℓ(fθ(x), y)] + Ω(||θ||2) and the expectation is over (x, y) P, then we call D as the ϵ-approximate

teaching set for the regularized ERM learners. In a similar way, the ϵ-approximate teaching dimension (ϵ-TD) of them can also be defined.

Reproducing kernel Hilbert space (RKHS). An RKHS H is uniquely determined by a reproducing kernel, which is a kernel operator k: X X R adheres to the Mercer s positive definite conditions (Vapnik, 1998). Let Hpre = {Pn i=1 αik(xi, ): n N, αi R, xi X}, then H is the closure of Hpre. An RKHS with k can be decomposed as k(x1, x2) = Φ(x1), Φ(x2) (Steinwart & Christmann, 2008) for any x1, x2 X, where Φ( ) is the feature map. Specifically, the equation holds for Φ(x) = k(x, ), and this type of feature map is called the canonical feature map.

We can generalize the problem identified by Equation (1) to the non-linear setting by the kernel method, namely, rewriting fθ(x) as the inner product θ, Φ(x) . If the canonical feature map is used, the optimization problem becomes

A(D) = arg min θ H

i=1 ℓ( θ, k(xi, ) , yi)+Ω(||θ||2 H), (2)

where H is the RKHS norm. In this way, the hypothesis space becomes the RKHS, i.e., H = H, and we are able to express the hypothesis as θ = P i=1 αik(xi, ) (Steinwart & Christmann, 2008).

3. STARKE: A Unified Theoretical Framework

In this section, we introduce the proposed subset analysis framework for deriving teaching dimension of regularized kernel learners (STARKE). The STARKE theoretical framework consists of two essential parts. First, analyzing the teaching dimension of a considered subset of RKHS. Second, extending the teaching to the whole RKHS via the strong representation ability of the considered subset. The two parts of STARKE and the relationship among the main theoretical results therein are illustrated in Figure 2. We elaborate them respectively.

S: Subset Hm H: RKHS H L: Lemma T: Theorem

Representer Theorem

Lower bound of TD(S)

Square loss

Teaching set of S (L-3) TD(S) (T-1)

Teaching set of S (L-4)

Upper bound of TD(S) (L-5)

TD(S) (T-2)

Finite dim H Upper Bound of TD(H) (T-3)

Infinite dim H

TD(H) can be infinite (T-4)

Upper bound of approximate teaching excess risk (L-6)

Upper bound of 𝜖-TD(H) (T-5)

Figure 2. The STARKE framework and the relationship among the main theoretical results therein.

The Teaching Dimension of Regularized Kernel Learners

3.1. Subset Analysis

This paper considers the subset of RKHS defined as Hm = {Pm i=1 αik(xi, ): αi R, xi X}. Notably, Hm is different from the aforementioned Hpre because m is fixed here. In the following analysis, we assume that Hm Hm 1 = . The assumption means that we only consider the smallest m such that Hm stays unchanged. In fact, if the assumption does not hold, i.e., Hm Hm 1 = , it implies that Hm = Hm 1 and thus Hm can be replaced by Hm 1. This process can be done recursively until we find an m0 such that the assumption holds for Hm0. Furthermore, the Representer Theorem is a canonical result in kernel methods. It is necessary in our proof process, and we state it in the following lemma for the purpose of self-contained.

Lemma 1 (Representer Theorem). Given a training set {xi, yi}n i=1, if the regularization function Ωin Equation (2) is monotonically increasing, then the solution has the following form

i=1 αik(xi, ) , αi R . (3)

The proof of the above lemma can be found in (Sch olkopf & Smola, 2002). Note that the regularization function Ω meets the condition of Lemma 1, and thus the solution of Equation (2) has the form of Equation (3). Next, we study the teaching dimension of Hm with square loss function and hinge loss function. It is easy to see that if the target hypothesis θ = 0, we do not need any training data to uniquely obtain the target hypothesis from Equation (2). Thus, we only consider the non-trivial case when θ = 0. At first, the lower bound of the teaching dimension of subset Hm is derived, and its proof can be found in Appendix A.1.

Lemma 2 (Lower Bound of TD(Hm)). The lower bound of the teaching dimension of subset Hm is m.

To determine the teaching dimension, it suffices to derive the upper bound if it matches the lower bound. The upper bound of TD(Hm) can be established by providing a teaching set (TS), the cardinality of which is its upper bound. We now construct the teaching sets for Hm, and the square loss function and the hinge loss function are analyzed respectively. The analysis is applied to both regression and classification tasks, since the classification tasks can be accomplished by simply set a threshold for the predicted value. The square loss function is ℓ(x, y) = (x y)2. The teaching set is provided in the following lemma, and its proof can be found in Appendix A.2.

Lemma 3 (TS(Hm)-Square Loss). Given any θ Hm Hm 1, where θ = Pm i=1 α i k(x i , ), then a teaching set of θ with the square loss function is

X = X , Y = K α α Ω ((α )T K α ) ,

where X = (x 1, , x m)T , α = (α 1, , α m)T , Ω (x) is the derivative of the regularization function Ω(x) and K

is the Gram matrix such that K ij = α i α jk(x i , x j).

Lemma 3 gives a teaching set for θ Hm Hm 1. Noticing that Hm = H0 (H1 H0) (Hm Hm 1), we can derive the teaching set for all hypotheses in Hm by substituting different m. Since the teaching set has m elements, the upper bound of the teaching dimension of Hm is m. Combined it with Lemma 2 results in Theorem 1.

Theorem 1 (TD(Hm)-Square Loss). The teaching dimension of subset Hm with the square loss function is m.

Theorem 1 indicates that whatever the kernel function we use, exact teaching can be performed in the subset Hm. We now turn to the hinge loss function defined as ℓ(x, y) = max(1 xy, 0). We denote k(x, ), θ as g(x, θ ) and denote {1, 2, . . . , n} as [n]. The teaching set is constructed in Lemma 4, and its proof is in Appendix A.3.

Lemma 4 (TS(Hm)-Hinge Loss). If θ Hm, then a teaching set of θ with the hinge loss function is

xij = x i , yij = αiΩ /ni, i [m], j [ni],

where ni = max(1, αig(x i , θ )Ω ) and Ω is defined as 2Ω ( θ 2 H).

Lemma 5 (Upper Bound of TD(Hm)-Hinge Loss). If θ Hm, then the upper bound of TD(Hm) with the hinge loss function is Pm i=1 max(1, αig(x i , θ )Ω ) , where θ = Pm i=1 α i k(x i , ), Ω = 2Ω ( θ 2 H).

The corresponding upper bound of TD(Hm) is shown in Lemma 5, and it indicates that TD(Hm) = m is non-trivial. It also implies the difference of TD between the square loss and the hinge loss function. That is to say, TD(Hm) = m does not hold for all loss functions and regularization functions. Based on Lemma 5, the upper bound of TD(Hm) can be improved with properly selected regularization functions. The improved upper bound matches the lower bound so that we obtain TD(Hm) for the hinge loss function. Its proof is in Appendix A.4.

Theorem 2 (TD(Hm)-Hinge Loss). If the regularization function Ωsatisfies Ω (x) mini I 1/β, where I = {i: α i g(x i , θ ) > 0} and β = 2α i p

k(x i , x i ) θ H, then TD(Hm) with the hinge loss function is m.

Remark. We would like to point out that, for the widelyused regularization function Ω(x) = 1

2µx, the condition of Ω (x) mini I 1/β implies µ mini I 1/β, thus the condition can be satisfied by choosing small enough µ.

In a nutshell, the teaching dimension of subset Hm for both square loss and hinge loss functions are m under the mild assumptions we have assumed.

The Teaching Dimension of Regularized Kernel Learners

3.2. Representation Ability of Subset Hm w.r.t. RKHS

In this section, we analyze the representation ability of subset Hm with respect to (w.r.t.) the RKHS H. Then the TD or ϵ-TD of any type of regularized ERM kernel learners can be disclosed consequently.

3.2.1. TEACHING DIMENSION FOR FINITE DIMENSIONAL RKHS

For any finite dimensional RKHS H, H can be represented by Hm with large enough m.

Theorem 3 (Upper Bound of TD(H)-Finite Dimension). If the dimension of the RKHS dim(H) = d H, then we have that Hm = H with m d H.

The proof can be found in Appendix B.1. From the theorems and the conclusion of the above section, the teaching dimension of the regularized kernel learners with finite dimensional RKHSs is O(d H) if dim(H) = d H.

3.2.2. TEACHING DIMENSION FOR INFINITE DIMENSIONAL RKHS

Kumar et al. (2021) prove that exact teaching requires the infinite teaching set for Gaussian kernel which induces an infinite dimensional RKHS. However, for the regularized learners, since the regularization function can be arbitrarily chosen, the result is not so obvious. In this section, we prove that even if using regularization, for some specific kernels, such as Gaussian, exponential and Laplacian kernels, we cannot find a finite teaching set for some target hypotheses. The pessimistic result is shown in Theorem 4, and its proof is in Appendix B.1.

Theorem 4 (TD(H)-Pessimistic Infinity). Assume the RKHS H of the considered kernel is infinite dimensional. For all regularization functions, the TD of the regularized learner is if there exists no m0 < s.t. Hm0 = H.

Remark 1. Theorem 4 does not require the assumption of Ω. In other words, this theorem can be applied to all types of regularization functions. Furthermore, this theorem can also be applied to all types of loss functions.

Remark 2. The RKHSs induced by Gaussian, exponential and Laplacian kernels are infinite dimensional and can not be represented by Hm if m < . The proof can be found in the Appendix B.3.

3.2.3. APPROXIMATE TEACHING FROM Hm TO H

Although the whole RKHS cannot be exactly taught for some kernels, the result on Hm implies that we can realize exact teaching on the subset of H. In this way, the subset Hm can be seen as an approximation to H, and performing teaching in Hm is performing approximate teaching in H.

It has been clarified that there exists an RKHS that cannot be expressed as Hm with a finite m. In this case, we attempt to explore how well Hm can approximate H. Let θ be the optimal solution to

min θ H F(θ) = min θ H E[ℓ( k(x, ), θ , y)] + Ω(||θ||2 H) , (4)

where the expectation is taken over the joint distribution P(x, y). Following Koltchinskii (2011), we define the excess risk of any hypothesis θ as

Λ(θ) = F(θ) F(θ ) . (5)

Let θ m be the optimal solution to minθ Hm F(θ), then the approximation error of Hm is Λ(θ m).

Similar to Yang et al. (2012), we assume maxy Y ℓ(0, y) 1 and ℓ(z, y) has a bounded partial derivative | zℓ(z, y)| C1. Consider {(xi, yi)N i=1}, which is i.i.d. and sampled from the joint distribution P(x, y). Let K be the Gram matrix of {xi}N i=1, and supx k(x, x) C(k) where C(k) is a constant corresponding to kernel k, and {λi}N i=1 be the eigenvalues of K with λ1 λN, Ω 1 be the inverse function of Ω, the existence of which is guaranteed by the monotonicity of Ω, we can upper bound the excess risk of approximate teaching as Lemma 6. Its proof is in Appendix B.4.

Lemma 6 (Upper Bound of Approximate Teaching Excess Risk). If M1 = Ω 1(Ω(0) + 1) e2N/4γ2, where γ 1, sup θ H M1 2 θ H Ω ( θ 2 H) is bounded, and λm+1 O(N/ m) for all N, then we have that

Λ (θ m) O C(k) + 1 m

Lemma 6 indicates that the excess risk raised by teaching in a subset rather than the whole RKHS is upper bounded by O ((C(k) + 1)/ m). Since the definition of ϵ-TD is that the excess risk between the taught hypothesis and the target hypothesis is no more than ϵ, we can derive the ϵ-TD based on Lemma 6 as shown in Theorem 5.

Theorem 5 (Upper Bound of ϵ-TD(H)). Under the conditions as Lemma 6, for the regularized ERM learner, the ϵ-TD of the RKHS H is upper bounded by

4. Theoretical Case Study

We take the linear, polynomial, Gaussian, exponential and Laplacian kernels as the case study to show how to apply the generic STARKE framework to determine the TD or ϵ-TD of the regularized kernel learners. The RKHSs of the linear

The Teaching Dimension of Regularized Kernel Learners

and polynomial kernels are finite dimensional. Based on Theorem 3, exact teaching is achievable under such kernels. However, the situation differs for Gaussian, exponential and Laplacian kernels, and Theorem 5 can be helpful.

4.1. Linear Kernel

The linear kernel is defined as k(x1, x2) = x1, x2 +c. For simplicity, this paper considers the homogeneous scenario, i.e., c = 0. The result when c = 0 can be derived similarly. The canonical feature map of linear kernel is Φ(x) = x, . We denote the dimension of input space dim(X) = d. Based on the linearity of inner product, we have Lemma 7.

Lemma 7 (H Identification). Let H be the RKHS defined by the linear kernel, and dim(X) = d, then H = H1.

We leave the proof of Lemma 7 in Appendix C.1. According to Lemma 7 and the theoretical result of subset, we derive the following corollary.

Corollary 1 (TD under Linear Kernel). Under the conditions in Theorem 2, for the regularized ERM linear kernel learners with the considered two loss functions, the TD of the RKHS H is TD(H) = 1.

Remark. For the linear kernel learner without regularization, Kumar et al. (2021) derive the teaching dimension, which is Θ(d). When the regularization term is equipped, the TD is drastically reduced to 1. This implies that regularization is helpful to reduce the teaching complexity.

4.2. Polynomial Kernel

The polynomial kernel of degree p N is defined as k(x1, x2) = ( x1, x2 + c)p, where c 0 is a constant. We consider the homogeneous scenario for simplicity, i.e., c = 0. The result when c > 0 can be derived similarly.

For the input space X Rd, the dimension of RKHS induced by polynomial kernel with degree p is ϕ = d+p 1 p .

By Theorem 3, we have that θ = Pϕ i=1 αik(xi, ). Note that αik(xi, ) = k( p αixi, ). Without loss of generality, we assume αi = 1. Let z = (z1, , zd), we have

θ , z = C1zp 1 + p! (p 1)!1!C2zp 1 1 z2

+ p! (p 1)!1!C3zp 1 1 z3 + + Cpzp d .

Consider the following system of polynomial equations

yp 11 + yp 21 + + yp m1 = C1

yp 1 11 y12 + + yp 1 m1 ym2 = C2

yp 1d + + yp md = Cϕ ,

where yij are variables. If the solution for the polynomial system exists, let yi = (yi1, , yid), then θ = Pm i=1 k(yi, ). Namely, θ can be expressed with no more than m components. For simplicity, let the polynomials be f1, f2, . . . , fϕ. Define the degree lexicographic order as y11 y12 y21 ymd, and the derived Gr obner basis (Hartshorne, 1977) for fi as G(θ , d, p, m). We have the following lemma.

Lemma 8 (H Identification). Let H be the RKHS defined by the homogeneous polynomial kernel with degree p, then H = HG (d,p), where

G (d, p) = max θ H G(θ, d, p)

arg min m {m : G(θ, d, p, m) = {1}} .

Its proof is in Appendix C.2. With Lemma 8 and the results of Theorem 1 and 2, the following corollary is derived.

Corollary 2 (TD under Polynomial Kernel). Under the conditions in Theorem 2, for the regularized ERM polynomial kernel learners with the considered two loss functions, the teaching dimension of θ is TD(θ ) = G(θ , d, p) if θ = 0 (TD(0) = 0) , and the teaching dimension of the RKHS H is TD(H) = G (d, p).

Remark 1. G (d, p) d+p 1 p . This can be seen by letting m = d+p 1 p , then the polynomial system (6) has at least one solution yij = xij. By the Hilbert s Nullstellensatz (Hartshorne, 1977), G(θ , d, p, m) = {1}, and this holds for all θ H.

Remark 2. G(θ , d, p) can be strictly smaller than d+p 1 p

for some θ . We can easily find out an example to satisfy it (cf. Appendix C.3 for more details of the example).

Remark 3. For the polynomial kernel learner without regularization, Kumar et al. (2021) derive the lower bound of the TD, which is d+p 1 p . As mentioned before, the TD for regularized learner meets G (d, p) d+p 1 p , and is strictly smaller than d+p 1 p for some θ . Besides, the lower bound of ERM learner without regularization needs the assumptions on θ , which are stated in Appendix D. This indicates that regularization not only reduces the sample complexity of teaching, but also relaxes the conditions on exact teaching for polynomial kernel.

4.3. Gaussian Kernel

The Gaussian kernel with parameter σ > 0 is defined as k(x1, x2) = exp( ||x1 x2||2

2σ2 ). The TD of Gaussian kernel learners is infinity, and ϵ-TD is suitable for it. Corollary 3 shows the result of ϵ-TD for the regularized ERM Gaussian kernel learners, and we leave its proof in Appendix C.4.

The Teaching Dimension of Regularized Kernel Learners

Corollary 3 (ϵ-TD under Gaussian Kernel). Under the conditions in Theorem 2 and 5, for the regularized ERM Gaussian kernel learners with the considered two loss functions, the ϵ-approximate teaching dimension of the RKHS H is ϵ-TD(H) O(1/ϵ2).

Remark. For the ERM Gaussian kernel learners without regularization, Kumar et al. (2021) derive that the ϵ-TD is upper bounded by d O(log2(1/ϵ)). It can be seen that the ϵ-TD is exponentially reduced with regularization. Besides, the assumptions they made is not necessary in the regularization scenario (cf. Appendix D for more discussions).

4.4. Exponential and Laplacian Kernels

In addition to linear, polynomial and Gaussian kernels, our generic STARKE framework can also be applied to other types of kernels. We take exponential and Laplacian kernels as examples, and analyze TD and ϵ-TD of them. Notably, Kumar et al. (2021) do not involve those types of kernels.

As shown in Appendix B.3, the RKHSs of both exponential and Laplacian kernels cannot be expressed by Hm. Thus, the TD of exponential and Laplacian kernels is infinite. On the other hand, approximate teaching from Hm to H enables us to obtain the ϵ-TD of the two kernels as Corollary 4. Its proof can be found in Appendix C.4.

Corollary 4 (ϵ-TD under Exponential and Laplacian Kernels). Under the conditions in Theorem 2 and 5, for the regularized ERM exponential and Laplacian kernel learners with the considered two loss functions, the ϵ-approximate teaching dimension of the RKHS H is ϵ-TD(H) O(1/ϵ2).

5. Experiments

In this section, we perform the empirical study to verify the theoretical results. The code is available at https://github. com/liuxhym/STARKE.git.

For exact teaching, we provide numerical results of linear and polynomial kernel learners respectively. For the regularized linear learners, with the target hypothesis in R3, the teaching sets have only one element for both square loss and hinge loss. However, teaching the unregularized learners needs three elements for square loss and five elements for hinge loss. For the regularized polynomial learner, given a certain target hypothesis, the teaching sets have two elements for both square loss and hinge loss. While the unregularized learners need four elements for square loss and cannot be taught for hinge loss since the violation of assumption (cf. Appendix F for more details).

We next show the empirical results for approximate teaching with Gaussian kernel. Since STARKE can be applied to any type of kernels, we also conduct experiments on exponential and Laplacian kernels (cf. Appendix G for more details

due to page limitation). For Gaussian kernel, we set σ = 0.9 and adopt square loss for regression while hinge loss for classification. For regression, we choose two synthetic datasets: the make-regression (MR) dataset from sklearn as well as the Sin dataset, and two real-world datasets: MPG from UCI (Blake et al., 1998) and Eunite. The regularization function Ω(x2) = x2 is applied. For classification, we choose two synthetic datasets: the two-moon (Moon) dataset as well as the two-circles (Circle) dataset from sklearn, and two UCI binary classification datasets: Adult and Sonar. The classification threshold is set as zero in experiments. The regularization function Ω(x2) = 1 200x2 is applied.

According to the theoretical result, the performance of the teaching is measured by excess risk. However, this measurement varies dramatically among different datasets. In order to avoid the influence of datasets on the excess risk, we introduce the excess risk ratio Λ as the measurement, which is the value of the current excess risk Λ divided by a reference excess risk. For each dataset, the reference risk is the average of Λ calculated by θ with 5% random samples for 100 times except for Adult. Considering the larger sample size of Adult compared with the others, if we set under 5% random samples as before, the teaching set will be very large even if the excess risk ratio is 100%, resulting in the relationship between teaching set size and excess risk ratio being expressed inappropriately. Thus, for Adult, the reference risk is calculated under 1% random samples for 100 times and then averaged. We use Nystr om method (Williams & Seeger, 2000) to approximate θ m Hm, which is an approximation of θ H. For approximate teaching, θ m is treated as the target hypothesis, and the approximate teaching set is constructed via Lemma 3 for square loss and Lemma 4 for hinge loss based on θ m.

Visualization of the Teaching Set. For intuitive understanding, we visualize the teaching set for the regularized ERM learners with Gaussian kernel on the synthetic datasets. The cardinality of teaching sets is determined by whether it is enough to obtain a low risk learned hypothesis, and we choose the smallest possible one. Figure 3 shows the results for regression on Sin dataset, while Figure 4 shows the results for classification on Moon and Circle datasets. The left sub-figure of Figure 3 shows the data points in the dataset and the target hypothesis. The teaching set and the learned hypothesis is shown in the right sub-figure. The top subfigures in Figure 4 are the results on the Circle dataset, and the bottom sub-figures are the results on the Moon dataset. The interface between the blue and red areas is the decision boundary of teacher in the sub-figure (a), and learner in the sub-figure (b). The positive and negative points in dataset are marked by red and blue dots respectively in the sub-figure (a). The constructed teaching set (TS) is shown by dark-red stars in the sub-figure (b). The results show that with much less data points than that of the dataset, the

The Teaching Dimension of Regularized Kernel Learners

Training Data Data Points in TS Target Hypothesis Learned Hypothesis from TS

Figure 3. Approximate teaching on the Sin dataset. The left subfigure: the target hypothesis θ is marked as the red solid line and the data points is marked as the black dots. The right sub-figure: the learned hypothesis is marked as the dashed red line and the teaching set is marked as the dark-red stars.

1.0 0.5 0.0 0.5 1.0

1.0 0.5 0.0 0.5 1.0

(b) Learned hypothesis

Figure 4. Approximate teaching on the Moon and Circle datasets. The binary dataset is marked by red and blue dots. The interface between the blue and red regions is the decision boundary of the learned hypothesis. (a) The target hypothesis θ . (b) The learned hypothesis with the teaching set being marked as the dark-red stars.

learner can obtain nearly the same hypothesis as the target hypothesis generated by the dataset.

On Approximate Ability of Hm. In the approximate teaching setting, we treat Hm as an approximation to H. From theoretical perspective, the approximation error of Hm decreases as O((C(k) + 1)/ m), implying Hm is a good approximation of H. We study the approximate ability of Hm empirically. To show the approximate ability of Hm, we perform Nystr om method (Williams & Seeger, 2000) with m components to fit the kernel SVM and kernel ridge regression. The whole process is repeated for 100 times and the excess risk of Hm is represented by the best learned hypothesis to match the machine teaching setting.

The result is shown in Table 2. The value inside the table is the smallest m such that Hm achieves an excess risk ratio no more than that in the first line of the table. For the Sin dataset, the excess risk of H8 can even achieve zero. All the

Table 2. The relation between excess risk ratio Λ and m of Hm.

Dataset Λ = 100% 80% 60% 40% 20% 0%

Sin 1 1 1 1 1 8 MR 1 1 2 5 7 >60 MPG 2 3 3 5 10 >60 Eunite 3 4 5 6 7 >60 Circle 1 2 3 3 4 >60 Moon 1 3 3 4 5 >60 Adult 9 139 396 >400 >400 >400 Sonar 4 28 62 103 149 > 200

regression datasets achieve less than 20% excess risk ratio within 10 samples. For classification, Hm achieves small excess risk in synthetic datasets with small m. Adult and Sonar are more difficult and need higher m to achieve small excess risk ratio. However, if we focus on the accuracy of classification rather than the excess risk, the hypothesis in Hm with small m also performs well, which is shown in Appendix H. Therefore, the approximate ability of Hm is good enough for approximate teaching. We can also observe that the number from left to right in the table increases quadratically as exposed by Corollary 3.

Regularization vs. Without Regularization. Before teaching, the target hypothesis θ H should be obtained first. For regression, θ is obtained by performing the Gaussian kernel ridge regression on the whole dataset, while the Gaussian kernel SVM is performed for classification. For approximate teaching, the estimated θ , which belongs to Hm, is obtained by performing Nystr om method (Williams & Seeger, 2000) with m components on the dataset. In order to reduce the error caused by randomness, the Nystr om method is repeated for 15 times and we choose the one with the lowest excess risk.

The comparison between the regularized learners and the unregularized ones is shown Table 3. The first line of Table 3 is as same as Table 2. The value/symbol inside the table shows the difference between the ϵ-TD of the regularized learners and that of the unregularized ones. means the unregularized learner cannot reach such excess risk ratio within 60 samples, and means both regularized and unregularized learners cannot reach such excess risk ratio within 60 samples. The result shows that the regularized learner surpasses the unregularized one in approximate teaching, because the differences are all positive. The scalability of teaching for regularized learner is also better than unregularized one, as teaching fails for unregularized learner on hard datasets or under small excess risk ratio, as indicated by the . This is because as ϵ decreases, the ϵ-TD of unregularized learner increases exponentially while regularized learner increases quadratically. It matches our theoretical findings.

By comparing the difference of ϵ-TD on Sin, MR, MPG,

The Teaching Dimension of Regularized Kernel Learners

Table 3. The difference between ϵ-TD of the regularized learners and that of the unregularized learners under the excess risk ratio Λ. means that only unregularized learner cannot reach such ratio within 60 samples, and means that both learners cannot reach such ratio within 60 samples.

Dataset Λ = 100% 80% 60% 40% 20% 0%

Sin MR 4 5 4 9 8 MPG 23 25 27 29 Eunite Circle Moon 25 26 Adult Sonar

Eunite, Circle, Moon, Adult and Sonar under a unified criterion, e.g., 40% excess risk ratio, it reveals that Sin, Eunite, Circle, Moon and Sonar are hard for teaching without regularization, while MR and MPG are easy (MR is easier than MPG). Adult is hard for both teaching with and without regularization. The excess risk ratio Λ enables us to explicitly observe the advantages of regularization on different datasets in a unified way.

6. Discussion and Conclusion

This paper gives an affirmative answer to the essential question of whether regularization can help reduce the teaching complexity in machine teaching. We propose a unified theoretical framework STARKE that is able to analyze any type of kernels. With the help of STARKE, we intensively analyze the popular regularized ERM (non-linear) kernel learners, e.g., kernel SVM and kernel ridge regression. Our theoretical findings reveal that, when equipped with regularization, the TD or ϵ-TD of them is substantially reduced compared with that of the unregularized ones. The results obtained may be beneficial for the researchers to have a deeper understanding of teaching the complex concepts.

We would like to point out that Kumar et al. (2021) inspire this work. They analyze the perceptron loss instead of the square loss and the hinge loss, whereas we do not include the perceptron loss in our framework. On the one hand, the square loss and the hinge loss function may be more popular. On the other hand, the power of perceptron loss may be weaker and the optimal solution of the optimization problem is dominated by the regularization function, so that it cannot learn some hypotheses. The possible future work could be generalizing the proposed framework to other loss functions (e.g., exponential loss and logistic loss), and imposing some realistic conditions on the teaching set under the real-world scenarios instead of arbitrary selection.

Acknowledgements

The authors would like to thank Wei Wang for the valuable suggestions, and the anonymous reviewers for their helpful comments. This work is supported by the National Natural Science Foundation of China (No. 62106076), the Natural Science Foundation of Shanghai (No. 21ZR1420300), the Chenguang Program sponsored by Shanghai Education Development Foundation and Shanghai Municipal Education Commission (No. 21CGA32), the Shanghai Key Laboratory of Multidimensional Information Processing at East China Normal University (No. MIP202101), the Fundamental Research Funds for the Central Universities, and the National Key Laboratory for Novel Software Technology at Nanjing University (No. KFKT2021B14).

Anthony, M., Brightwell, G. R., and Shawe-Taylor, J. On specifying boolean functions by labelled examples. Discrete Applied Mathematics, 61(1):1 25, 1995.

Blake, C. L., Keogh, E., and Merz, C. J. UCI Repository of machine learning databases. [http://www.ics.uci.edu/ mlearn/MLRepository.html], 1998.

Bousquet, O. and Elisseeff, A. Stability and generalization. Journal of Machine Learning Research, 2:499 526, 2002.

Chen, Y., Singla, A., Aodha, O. M., Perona, P., and Yue, Y. Understanding the role of adaptivity in machine teaching: The case of version space learners. In Proceedings of the 31st International Conference on Neural Information Processing Systems (Neur IPS 18), pp. 1483 1493, Montreal, Canada, 2018.

Drewnik, M. and Pasternak-Winiarski, Z. SVM kernel configuration and optimization for the handwritten digit recognition. In Proceedings of 16th Conference on Computer Information Systems and Industrial Management (CISIM 17), pp. 87 98, Bialystok, Poland, 2017.

Drineas, P. and Mahoney, M. W. On the nystr om method for approximating a gram matrix for improved kernel-based learning. Journal of Machine Learning Research, 6(12): 2153 2175, 2005.

Fadel, S., Ghoniemy, S., Abdallah, M., Sorra, H. A., Ashour, A., and Ansary, A. Investigating the effect of different kernel functions on the performance of svm for recognizing arabic characters. International Journal of Advanced Computer Science and Applications, 7(1):446 450, 2016.

Feragen, A., Lauze, F., and Hauberg, S. Geodesic exponential kernels: When curvature and linearity conflict. In IEEE Conference on Computer Vision and Pattern

The Teaching Dimension of Regularized Kernel Learners

Recognition (CVPR 15), pp. 3032 3042, Boston, MA, 2015.

Gao, Z., Ries, C., Simon, H. U., and Zilles, S. Preferencebased teaching. Journal of Machine Learning Research, 2017.

Goldman, S. A. and Kearns, M. J. On the complexity of teaching. In Proceedings of the 4th Workshop on Computational Learning Theory (COLT 91), pp. 303 314, Santa Cruz, CA, 1991.

Goldman, S. A. and Mathias, H. D. Teaching a smarter learner. Journal of Computer and System Sciences, 52(2): 255 267, 1996.

Hartshorne, R. Algebraic Geometry. Springer, 1977.

Kamalaruban, P., Devidze, R., Cevher, V., and Singla, A. Interactive teaching algorithms for inverse reinforcement learning. In Proceedings of the 28th International Joint Conference on Artificial Intelligence (IJCAI 19), pp. 2692 2700, Macao, China, 2019.

Kirkpatrick, D. G., Simon, H. U., and Zilles, S. Optimal collusion-free teaching. In Proceedings of 30th International Conference on Algorithmic Learning Theory (ALT 19), pp. 506 528, Chicago, IL, 2019.

Koltchinskii, V. Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems. Springer, 2011.

Koltchinskii, V. and Yuan, M. Sparsity in multiple kernel learning. The Annals of Statistics, 38(6):3660 3695, 2010.

Kumar, A., Zhang, H., Singla, A., and Chen, Y. The teaching dimension of kernel perceptron. In Proceedings of the 24th International Conference on Artificial Intelligence and Statistics (AISTATS 21), pp. 2071 2079, Virtual Event, 2021.

Liu, J., Zhu, X., and Ohannessian, H. The teaching dimension of linear learners. In Proceedings of the 33rd International Conference on Machine Learning (ICML 16), pp. 117 126, New York City, NY, 2016.

Mansouri, F., Chen, Y., Vartanian, A., Zhu, X. J., and Singla, A. Preference-based batch and sequential teaching: Towards a unified view of models. In Proceedings of the 32nd International Conference on Neural Information Processing Systems (Neur IPS 19), pp. 9195 9205, Vancouver, Canada, 2019.

Mohri, M., Rostamizadeh, A., and Talwalkar, A. Foundations of Machine Learning. MIT Press, 2012.

Patil, K. R., Zhu, X., Kopec, L., and Love, B. C. Optimal teaching for limited-capacity human learners. In Proceedings of the 27th International Conference on Neural Information Processing Systems (Neur IPS 14), pp. 2465 2473, Montreal, Canada, 2014.

Sch olkopf, B. and Smola, A. J. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and beyond. MIT Press, 2002.

Shafto, P., Goodman, N. D., and Griffiths, T. L. A rational account of pedagogical reasoning: Teaching by, and learning from, examples. Cognitive Psychology, 71:55 89, 2014.

Shinohara, A. Teachability in computational learning. New Generation Computing, 8(4):337 347, 1991.

Steinwart, I. and Christmann, A. Support Vector Machines. Springer, 2008.

Vapnik, V. Statistical Learning Theory. Wiley, 1998.

Williams, C. K. I. and Seeger, M. W. Using the nystr om method to speed up kernel machines. In Proceedings of the 13rd Conference on Neural Information Processing Systems (Neur IPS 20), pp. 682 688, Denver, CO, 2000.

Yang, T., Li, Y., Mahdavi, M., Jin, R., and Zhou, Z. Nystr om method vs random fourier features: A theoretical and empirical comparison. In Proceedings of the 26th Conference on Neural Information Processing Systems (ICML 12), pp. 485 493, Lake Tahoe, NV, 2012.

Zhang, X., Zhu, X., and Wright, S. J. Training set debugging using trusted items. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence (AAAI 18), pp. 4482 4489, New Orleans, LA, 2018.

Zhu, X., Singla, A., Zilles, S., and Rafferty, A. N. An overview of machine teaching. Co RR, abs/1801.05927, 2018.

Zilles, S., Lange, S., Holte, R., and Zinkevich, M. Models of cooperative teaching and learning. Journal of Machine Learning Research, 12:349 384, 2011.

The Teaching Dimension of Regularized Kernel Learners

A. Proofs in Section 3.1

A.1. Proof of Lemma 2

Proof of Lemma 2. Because Hm Hm 1 = , there exists θ Hm such that θ cannot be expressed with less than m terms. However, by Lemma 1, the number of terms contained by the learned hypothesis is no more than the cardinality of the training set. Therefore, the cardinality of the teaching set is no less than m examples, and we conclude the proof.

A.2. Proof of Lemma 3

We first introduce Lemma 9, which is necessary to prove Lemma 3 in the paper.

Lemma 9. For {xi}m i=1, denote θ = Pm i=1 αik(xi, ), if θ Hm Hm 1, then the Gram matrix K with Kij = k(xi, ), k(xj, ) is invertible.

Proof. Let F = span{k(x1, ), , k(xm, )}, then F has finite dimension and we can find a standard orthogonal basis of F via Schmidt orthogonalization. We denote the basis as e1, e2, e3, , em, then (k(x1, ), k(x2, ), , k(xm, )) = (e1, e2, , em)P, where P is invertible. We have that

αik(xi, ), αjk(xj, ) =

r=1 pkiprj ek, er

k=1 pkipkj ek, ek

k=1 pkipkj.

Thus, K = P T P and P is invertible K is invertible.

Proof of Lemma 3. Provided X = X , the estimated ˆθ can be expressed as Pm i=1 αik(x i , ) according to Lemma 1 in the main paper. Therefore, we can rewrite the optimization problem as

A(D) = arg min α

i=1 ℓ(βi, yi) + Ω(αT Kα), (7)

where βi is the i-th element of Kα. Furthermore, the KKT (Karush Kuhn Tucker) condition states that

2KαΩ (αT Kα)

i=1 (K)i βiℓ(βi, yi), (8)

where βiℓ(βi, yi) = ℓ(βi,yi)

βi and Ki is the i-th column of K. By substituting ℓ(x, y) = (x y)2 into Formula (10), we have that

2KαΩ (αT Kα) 2K(Kα Y ). (9)

By Lemma 9, we have that

2αΩ (αT Kα) 2K(Kα Y ). (10)

By substituting ℓ(x, y) = (x y)2 into Formula (10), we have that

2αΩ (αT Kα) 2(Kα Y ). (11)

Obviously, α = α satisfies Equation (11). By the assumption of Ωmade in the paper and the strong convexity of square loss function as well as the Hilbert norm, we have that the optimization problem (2) in the paper is strongly convex. Therefore, α is the only solution to this problem, which proves the lemma.

A.3. Proof of Lemma 4

We denote the sub-gradient u max(1 u, 0) = I(u), where

1, if u < 1;

[0, 1], if u = 1;

0, otherwise.

Proof of Lemma 4. The KKT (Karush Kuhn Tucker) condition of Equation (2) is

2θ Ω ( θ 2)

i=1 k(xi, )yi I(yif(xi, θ )) . (12)

Remember that Ω = 2Ω ( θ 2), the teaching dimension is the minimum of n such that

yi Ω I(yig(xi, θ ))k(xi, ) . (13)

Remember θ can be expressed as

i=1 αik(x i , ) ,

we can construct the teaching set using {xi}m i=1.

If αiΩ g(xi, θ ) 1, let yi = αiΩ and xi = x i , we have

yi Ω I(yig(xi, θ ))k(xi, ) = αik(x i , ) .

The Teaching Dimension of Regularized Kernel Learners

If αiΩ g(xi, θ ) > 1, then the construction of teaching set can be expressed as the following optimization problem

i=1 yi = αiΩ

yig(xi, θ ) 1 i [ni] .

The solution to the above optimization problem is ni = max(1, αig(x i , θ )Ω ) and the corresponding teaching items can be xij = x i , yij = αiΩ /n, for j [ni].

According to the assumption of Ωmade in the main paper and the convexity of hinge loss and strong convexity of Hilbert norm, the uniqueness of θ is guaranteed.

Then we have that

xij = x i , yij = αiΩ /ni, i [m], j [ni] ,

is the teaching set.

A.4. Proof of Theorem 2

Proof of Theorem 2. Note that

α i g(x i , θ ) = α i q

k(x i , x i )

* k(x i , ) p

k(x i , x i ) , θ +

k(xi, xi) θ

k(x i , x i ) θ H ,

then Ω (x) mini I 1/(2α i p

k(x i , x i ) θ H) implies Ω (x) 1/(2α i g(x i , θ )). Combine the conclusion of Lemma 5, we have that the teaching dimension is m.

B. Proofs in Section 3.2

B.1. Proof of Theorem 3

Proof of Theorem 3. Suppose θ Hd H+r Hd H, where r Z+, then we have that

i=1 αik(xi, ),

where at least d H + 1 elements in {k(xi, )}d H+r i=1 are linear independent, or θ can be expressed with no more than d H components, which implies θ Hd H, a contradiction.

However, the linear independence of d H + 1 elements is contradict to dim(H) = d H. Then Hd H+r Hd H = . Because the conclusion holds for all r Z+, and H is the closure of Hpre, we have that Hd H = H. Thus, Hm = H with m d H.

B.2. Proof of Theorem 4

Proof of Theorem 4. We prove the result by contradiction. Assume that D = {(x1, y1), , (xn, yn)} is a finite teaching set for a target hypothesis θ . Let

i=1 αik(xi, ): αi R

Denote H n the orthogonal subspace of Hn, we have Hn H n = H, where denotes direct sum. By the definition of Hn, we obtain dim(Hn) n. Since dim(H) = > dim(Hn), we have H n = . Then, there exists d H n , such that d = 0 and d Hn. For all λ R, since k(xi, ) Hn, we have that

i=1 ℓ( θ , k(xi, ) , yi)

i=1 ℓ( θ + λd, k(xi, ) , yi).

By the definition of norm and inner product, we have that

||θ + λd||2 H = ||θ ||2 H + λ2||d||2 H + 2λ θ , d .

The problem can be divided into two cases.

Case 1: θ , d = 0. In this case, let λ = 2 θ ,d

||d||2 H = 0,

then ||θ ||2 H = ||θ +λd||2 H, but θ = θ +λd. Combined this result with Equation (14), we have that θ + λd is also a solution to Equation (2) in the paper, thus a contradiction.

Case 2: θ , d = 0. Because of the uniqueness of θ , we have that

Ω(||θ ||2 H) < Ω(||θ + λd||2 H)

holds for all λ R. This is equivalent to

Ω(x) < Ω(x ),

for x = ||θ ||2 H and all x > x. Since θ can be arbitrary chosen in H, x ranges from 0 to infinity. Therefore, Ωis a monotonically increasing for x 0. By Lemma 1 in the paper, the hypothesis has the following form

i=1 αik(xi, ).

Let r(θ) be the infimum number of data points such that θ H can be expressed as the linear combination of the functions mapped from the data points by the canonical feature map. The assumption that the teaching dimension is finite implies that supθ H r(θ) = N < . Let

i=1 αik(xi, ): αi R, xi X

then HN = H, hence a contradiction to the non-existence of finite m0.

The Teaching Dimension of Regularized Kernel Learners

B.3. Proof of Remark of Theorem 4

We at first recall the expression of Gaussian, exponential and Laplacian kernels,

Gaussian kernel: e ||x1 x2||2

exponential kernel: e ||x1 x2||

Laplacian kernel: e ||x1 x2||

Proposition 1. The RKHS induced by Gaussian kernel, exponential kernel and Laplacian kernel cannot be represented by Hm with m .

Proof of Proposition 1. Let H, H and H be the RKHS induced by Gaussian kernel, exponential kernel and Laplacian kernel, respectively. It suffices for us to show for the case d = 1, i.e., dim(X) = 1. Without loss of generality, we assume the parameters for Gaussian kernel, exponential kernel and Laplacian kernel are

The following function belongs to H by its definition,

n= e x n 2, n Z ,

then f(p) = f(q) > 0, if p and q Z. It implies that limx f(x) = 0.

If H = Hm for some finite m, we have

i=1 αie x xi 2.

i=1 αie x xi 2 = 0,

a contradiction.

For exponential kernel and Laplacian kernel, consider the following function,

n= e x n , n Z .

It is easy to see that g(x) H and g(x) H . And then the proof is same as the case for Gaussian kernel.

B.4. Proof of Lemma 6

In order to calculate Λ(θ m), we introduce θN H, such that θN is the optimal solution to

L(θ) = min θ H 1 N

i=1 ℓ( k(xi, ), θ , yi)+Ω( θ 2 H), (15)

where {(xi, yi)N i=1} is i.i.d. and sampled from the joint distribution P(x, y). We next bound the generalization performance of θm via the generalization performance of θN.

Following the work of (Yang et al., 2012), we now exploit the local Rademacher complexity. We define ψ(δ) = ( 2

N PN i=1 min(δ2, λi))1/2. Let eε denote the solution to δ2 = ψ(δ), where the existence and uniqueness of eε are determined by the sub-root property of ψ(δ). Denote γ = max (eε, p

6 ln N/N). According to (Koltchinskii, 2011), we have that γ2 O(N 1/2), and when the eigenvalues of kernel function follow the p-power law, it can be improved to γ2 O(N p/(p+1)). The following Lemma 10 bounds Λ(θ m) by Λ(θ N). Lemma 10. For M1 = Ω 1(Ω(0) + 1) e2N/4γ2

with γ 1, sup θ H M1 2 θ H Ω ( θ 2 H) M2, and λm+1 O(N/ m), if Ω(x2) is µ-strongly convex, then with a probability 1 2N 3,

Λ (θ m) Λ (θ N) + O γ + C(k) + 1 m + e N .

Note that in our setting N can be arbitrarily chosen, we can let N . In this way, we have Λ(θ N) 0, γ 0 and e N 0, and complete the proof of Lemma 6.

We denote 1

N PN i=1 ℓ( k(xi, ), θ , yi)+Ω( θ 2 H) as e L(θ). Let K be the Gram matrix of {xi}N i=1 under kernel k, i.e., K = [k(xi, xj)]N N. Then we sample m construct a low rank matrix with the teaching set {x i , y i }m i=1 as ˆKm = Kb K m KT b , where Kb = [k(x i , xj)]N m, Km = [k(x i , x j)]m m, and K m is the pseudo inverse of K m. We first introduce Lemma 11, 12, 13, and 14, which are necessary to prove Lemma 10.

We now give a recap of the concept of Fenchel conjugate. Let f(x) be a function of x, its Fenchel conjugate f (α) is defined as f (α) = sup z (αz f(z)) .

Suppose f is convex and differentiable over Rn. Any maximizer z of αz f(z) satisfies α = zf(z ). It implies α falls in the range of the mapping zf(z): R R. Lemma 11. If f is closed and strong convex with parameter µ, then f has a Lipschitz continuous gradient with parameter 1

Proof. By implication of strong convexity, we have

sx sy µ x y sx f(x), sy f(y) ,

which implies

sx sy µ f (sx) f (sy) .

Hence, f has a Lipschitz continuous gradient with constant value 1

The Teaching Dimension of Regularized Kernel Learners

Lemma 12. Under the assumption of Ωmade in the paper, we have that

0 e L(θ m) e L(θ N) C1C2µ

where C1 is the upper bound of xℓ(x, y), and C2 is the upper bound of |y| with y Y. 2 stands for the spectral norm of a matrix. θ N is the solution that minimizes e L(θ).

Proof. Let ℓ (α) and Ω denote the Fenchel conjugate of ℓ(z, y) and Ω(z2) in terms of z, respectively, i.e.,

ℓ (α) = sup z (αz ℓ(z, y)),

Ω (α) = sup z (αz Ω(z2)).

By the property of conjugate function, the derivative of Ω

at 0 is arg minx Ω(x2) = 0.

According to Lemma 11, (Ω ) has a Lipschitz constant 1

µ, thus (Ω ) (x) (Ω ) (0) + 1

Using the conjugates, we are able to rewrite e L(θ N) as an equivalent form

e L(θ N) = max α 1

1 N 2 (α y)T K(α y)

where α = (α1, , αN)T and denotes the element-wise dot product. Similarly, we can rewrite e L(θ m) as

e L(θ m) = max α 1

1 N 2 (α y)T ˆKm(α y)

Then, we have that

e L(θ m) = max α 1

1 N 2 (α y)T K(α y)

1 N 2 (α y)T K(α y)

1 N 2 (α y)T ˆKm(α y)

e L(θ m) max α

1 N 2 (α y)T K(α y)

1 N 2 (α y)T K(α y)

1 N 2 (α y)T ˆKm(α y)

(a) = e L(θ N)

+ max α (Ω ) ( x0)

2N 2 x0 (α y)T (K ˆKm)(α y)

(b) e L(θ N)

+ max α 1 2µN 2 (α y)T (K ˆKm)(α y)

e L(θ N) + max α C2 2µN 2 α 2 K ˆKm 2

(c) e L(θ N) + C1C2

where (a) uses Lagrange s mean value theorem, (b) is derived from Equation (16), and (c) follows |αi| C1 duo to xℓ(x, y) C1.

Lemma 13. For M1 = Ω 1(Ω(0) + 1) e2N/(4γ2) with γ 1 and sup θ 2 H M1 2 θ H Ω ( θ 2 H) M2 , with a probability 1 2N 3, we have that

Λ (θ m) Λ (θ N) + C3

K b Km 2 N + e N

where C3 is a constant.

Proof. Define the loss function

ℓ( θ, k(x, , y) = ℓ( θ, k(x, , y) + Ω( θ 2 H).

To simplify our notations, we define PN and P as

PN( ℓ θ) = 1

i=1 ℓ( θ, k(x, , yi) = e L(θ),

P( ℓ θ) = E[ ℓ( θ, k(x, , y)] = F(θ).

The Teaching Dimension of Regularized Kernel Learners

Using those notations, we have that

Λ (θ m) Λ (θ N)

= P ℓ θ m ℓ θ N

= PN ℓ θ m ℓ θ N + (P PN) ℓ θ m ℓ θ N

PN ℓ θ m ℓ θ N

+ max (θ,θ ) G (P PN) ℓ θ ℓ θ .

In the above formula, G is defined as

G = (θ, θ ) : θ θ ℓ2 r, θ θ H R ,

where θ ℓ2 = p

E[θ2] θ H, and r as well as R

are given by r = R = θ m θ N H 2M 1/2 1 . Using Lemma 9 from (Koltchinskii & Yuan, 2010), we have that, with probability 1 2N 3, for any γr e N, γ2R e N,

sup (θ,θ ) G (P PN) ℓ θ ℓ θ

C4L rγ + Rγ2 + e N

C5L rγ + e N ,

where C4, C5 are constants, and L is the upper bound of the gradient of ℓfor functions in G and is given by L sup θ 2 H M 1/2 1 2||θ||HΩ ( θ 2 H) + C1 M2 + C1. Since

max( θ N H , θ m H) M1 and M1 e2N/(4γ2), we have the condition γr e N satisfied. Therefore, with a probability 1 2N 3, we have that

P ℓ θ m ℓ θ N

PN ℓ θ m ℓ θ N + C5(M2 + C1) rγ + e N

C K b Km 2 2µN + C5(M2 + C1) rγ + e N ,

where we use the result in Lemma 12. By the fact of Λ(θ) = P( ℓ θ ℓ θ ), we have that

Λ (θ m) Λ (θ N) + C K b Km 2 2µN

+ C5(M2 + C1) 2M 1/2 1 γ + e N .

We complete the proof by absorbing the constant terms into a constant C3.

Now we only need to bound K b Km 2. To this end, we apply the conclusion from (Drineas & Mahoney, 2005), and we state it in the following lemma.

Lemma 14. Suppose G is an N N symmetric positive semi-definite (SPSD) matrix, let b Gm be constructed by sampling m columns of G with a given probability. In addition, let Gm be the best rank-m approximation to G. With

8 log(1/δ), we have that, with probability at least 1 δ,

G b Gm 2 G Gm 2 + 2η m

where 2 is the spectral norm of matrix.

With the property that Gram matrix K is symmetric and positive semi-definite, we can apply Lemma 14 to K and b Km. Notably, in machine teaching setting, b Km is constructed optimally, while in the setting of Lemma 14, b Gm is constructed by sampling. Therefore, we can achieve this inequality if δ > 0. Then, we have that

K b Km 2 K Km 2 + 2 m

i=1 K2 ii (17)

for our setting, where Km is the best rank-m approximation to K. Using the property of spectral norm of a matrix, we have K Km 2 = λm+1. By substituting Inequality (17) into the result of Lemma 13, and letting 2 supx k2(x, x) = C(k), we complete the proof of Lemma 10.

C. Proofs in Section 4

C.1. Proof of Lemma 7

Proof of Lemma 7. For dim(X) = d, the dimension of the induced RKHS H is dim(H) = d. By Theorem 3 in the paper, θ = Pd i=1 αik(xi, ). Then we have that

i=1 αi xi, (a) =

where (a) holds because of the linearity of inner product, and (b) holds because X is a linear space and there exists x X such that x = Pd i=1 αixi. Therefore, all hypotheses in H can be expressed by one term, which ends the proof.

C.2. Proof of Lemma 8

Proof of Lemma 8. According to the definition of G (d, p), we have that G(θ, d, p, G (d, p)) = {1}. By the Hilbert s Nullstellensatz (Hartshorne, 1977), the solution to the polynomial system exists when Gr obner basis is not {1}.

C.3. An example for Remark 2 of Lemma 8

Let θ = P4 i=1 k(xi, ), where x1 = (1, 2), x2 = (3, 4), x3 = (5, 6), x4 = (7, 8), i.e., d = 2. Let p = 3, then d+p 1 p = 4. Instantiate (6) with this example and let

The Teaching Dimension of Regularized Kernel Learners

m = 2, (6) becomes

y3 11 + y3 21 = 496

y2 11 y12 + y2 21 y22 = 580

y11 y2 12 + y21 y2 22 = 680

y3 12 + y3 22 = 800.

Then the Gr obner basis is G(θ , 2, 3, 2) = {y12 y3 21 y11 y22 y2 21, y12 y2 21 y22 y11 y21 y2 22, y12 y21 y2 22 y11 y3 22, y3 11 + y3 21, y12 y2 11 + y22 y2 21, y11 y2 12 + y21 y2 22, y3 12 + y3 22} = {1}. Similarly, we have G(θ , 2, 3, 1) = {1}. According to the definition of G(θ , 2, 3), we have G(θ , 2, 3) = 2 < 4.

C.4. Proof of Corollary 3 and 4

Proof of Corollary 3 and 4. According to Theorem 5 in the paper, the ϵ-approximate teaching dimension is O((C(k) + 1)2/ϵ2). For the Gaussian kernel, exponential kernel and Laplacian kernel, C(k) = 2 supx k2(x, x) = 2. Then, the ϵ-TD can be simplified as O(1/ϵ2).

D. For the Unregularized Learners

We derive the teaching set for unregularized learners with square loss and hinge loss in this section. We also mention the assumptions needed by perceptron loss.

D.1. Square Loss

We start by providing teaching set for linear kernel. Proposition 2. For the target hypothesis θ , {xi, (θ )T xi}d i=1 is a teaching set, where {xi}d i=1 are linearly independent.

i=1 (θT xi yi)2,

it is easy to see that L(θ ) = 0. Therefore, θ is in the solution set.

If L(θ + δ) = 0, then

(θ + δ)T xi yi 2 =

i=1 (δT xi)2 = 0 .

Since {xi}d i=1 are linearly independent and dim(δ) = d, we have δ = 0. This guarantees θ is the unique solution.

According to (Kumar et al., 2021), the RKHS of polynomial kernel is isomorphic to that of a higher dimensional linear kernel. Therefore, if the following assumption is satisfied, the teaching set for polynomial kernel can be derived the same as linear kernel.

Assumption 1. For the target hypothesis θ H, we assume that there exist r = dim(H) linearly independent polynomials of the form {Φ(zi)}r i=1, where i, zi X and Φ is the feature map of polynomial kernel.

For Gaussian kernel, (Kumar et al., 2021) uses

k(x1, x2) = e x1 2 H 2σ2 e x2 2 H 2σ2 s X

as an approximation of Gaussian kernel. The approximation error is small when the following assumption holds.

Assumption 2 (Assumption 3.4.2 in Kumar et al. (2021)). For the target hypothesis θ = Pl i=1 αik(xi, ), the learner optimizes to a solution ˆθ with bounded coefficients. Alternatively, the sums Pr i=1 |αi| and |β|+Pr j=2 |γj| are bounded

where ˆθ H has the form ˆθ = βk(x1, )+Pr j=2 γjk(xj, ) and r = dim(H).

We denote H as the RKHS induced by k, Pθ as the projection of θ in H. Similar to Assumption 1, we need an assumption for the approximate kernel.

Assumption 3. For the target hypothesis θ H, we assume that there exist r = dim( H) linearly independent elements in H of the form {Φ(zi)}r i=1, where i, zi X and Φ is the feature map of the approximate Gaussian kernel.

With the assumption, we can provide teaching set for the approximate kernel as the linear kernel, and the teaching set is also the approximate teaching set for Gaussian kernel.

D.2. Hinge Loss

Similar to the square loss, we start by providing teaching set for linear kernel.

Proposition 3. For the target hypothesis θ , the following is a teaching set

xi = vi + θ

θ 2 , yi = 1 i {1, , d 1};

xi = vi + θ

θ 2 , yi = 1 i {d, , 2d 2};

θ 2 , y2d 1 = 1 ,

where {vi}d i=1 is an orthogonal basis for Rd which extends with vd = θ .

i=1 max(1 yiθT xi, 0),

we can calculate L(θ ) = 2.

The Teaching Dimension of Regularized Kernel Learners

Suppose θ0 = θ + δ is the optimal solution for min L(θ), where δ can be represented as Pd 1 i=1 tivi + tdθ .

Note that max(1 y2d 1θT 0 x2d 1) = 2+td, the optimality of θ0 implies td 0. However,

i=1 max(1 yiθT 0 xi, 0) (2d 2)td ,

Based on the fact td = 0, we have

i=1 max(1 yiθT 0 xi, 0)

i=1 tivi 2 ,

this implies ti = 0.

Then we get δ = 0, and θ is the unique solution to min L(θ).

The derivation of polynomial kernel and Gaussian kernel is similar to that of the square loss except for some modifications of Assumption 1 and 3. The modification comes from the appearance of θ in the teaching set. We states the modified assumptions as follows.

Assumption 4. For the target hypothesis θ H, let Φ be the feature map of polynomial kernel, we assume that

1. There exist (r 1) linearly independent polynomials on the orthogonal subspace of θ in H of the form {Φ(zi)}r 1 i=1 , where i, zi X and r = dim(H).

2. There exists polynomial such that θ = Φ(z), where z X.

Assumption 5. For the target hypothesis θ H, let Φ be the feature map of approximate Gaussian kernel, we assume that

1. There exist (r 1) linearly independent elements on the orthogonal subspace of θ in H of the form {Φ(zi)}r 1 i=1 , where i, zi X and r = dim( H).

2. There exists polynomial such that θ = Φ(z), where z X.

D.3. Perceptron Loss

Because we only focus on square loss and hinge loss in this paper, the teaching set for perceptron loss is omitted and can be found in (Kumar et al., 2021). In this section, we provide the assumptions needed by perceptron loss for the purpose of self-contained. For polynomial kernel, we need one assumption.

Assumption 6 (Assumption 3.2.1 in (Kumar et al., 2021)). For the target hypothesis θ H, let Φ be the feature map of polynomial kernel, we assume that there exist (r 1) linearly independent polynomials on the orthogonal subspace of θ in H of the form {Φ(zi)}r 1 i=1 , where i, zi X and r = dim(H).

For Gaussian kernel, Assumption 2 and the following assumption are needed.

Assumption 7 (Assumption 3.4.1 in (Kumar et al., 2021)). For the target hypothesis θ H, let Φ be the feature map of approximate Gaussian kernel, we assume that there exists (r 1) linearly independent elements such that on the orthogonal subspace of Pθ in H of the form {Φ(zi)}r 1 i=1 , where i, zi X and r = dim( H).

E. Experiment Details

This section introduces the selected datasets. Sin: It is generated by sin operator with (µ = 0, σ = 0.15) Gaussian noise in [0, 5]. It has 150 samples and the dimension of input space is 1. Make-regression: An optionally-sparse random linear combination of random features with noise. It has 250 samples and the dimension of input space is 2. MPG: A dataset taken from the Stat Lib library maintained at Carnegie Mellon University, concerns city-cycle fuel consumption, is to be predicted in terms of both multivalued discrete and continuous attributes. It has 392 samples and the dimension of input space is 7. Eunite1: The Eunite 2001 competition dataset. Given load and some other information in previous years, the task is to predict daily maximum load in the next January. It has 336 samples and the dimension of input space is 16. Two-moon: Two interleaving half circles. It has 250 samples and the dimension of input space is 2. Two-circle: A large circle containing a smaller circle in 2D. It has 250 samples and the dimension of input space is 2. Adult: High dimensional binary classification problem with both continuous and discrete features. It has 1605 samples and the dimension of input space is 123. Sonar: Discriminate between sonar signals bounced off a metal cylinder and those bounced off a roughly cylindrical rock, the data is derived from 111 patterns, each pattern consists of 60 numbers representing the energy within a particular frequency band. It has 208 samples and the dimension of input space is 60.

F. Numerical Results for Exact Teaching

To illustrate exact teaching, we provide numerical examples for teaching the linear kernel learner and the polynomial kernel learner.

1https://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/ regression.html#eunite2001

The Teaching Dimension of Regularized Kernel Learners

Linear Kernel. For the target hypothesis θ = ( 1, 1, 0) R3, we consider the regularization function Ω(x2) = 1

2x2. According to Lemma 3 and 4, the teaching set for square loss is {( 1, 1, 0), 3/2}, for hinge loss is {( 1, 1, 0), 1}.

However, for unregularized learner, the teaching set can be {{(1, 0, 0), 1}, {(0, 1, 0), 1}, {(0, 0, 1), 0}} for square loss. For hinge loss, we first obtain an orthogonal basis {(1/2, 1/2, 0), (0, 0, 1)} for the subspace orthogonal to θ , and the teaching set is

2 2 , 0), 1}, {(

2 2 , 1), 1},

2 2 , 0), 1}, {(

2 2 , 1), 1},

2 2 , 0), 1}.

Polynomial Kernel. For the target hypothesis θ = P4 i=1 k(xi, ), where x1 = (1, 2), x2 = (3, 4), x3 = (5, 6), x4 = (7, 8). Let the degree of the polynomial be 3, i.e., p = 3. We consider the regularization function Ω(x2) = 1 107 x2. By solving the polynomial system (6), we have θ = k((2.22, 3.48), ) + k((7.86, 9.12), ). Then we obtain the teaching set {{(2.22, 3.18), 123946.37}, {(7.86, 9.12), 3164724.12}} for square loss. Note that for classification problem, tθ is equivalent to θ if t is a positive constant. We can construct the teaching set for hinge loss as {{(2.22, 3.18), 1}, {(7.86, 9.12), 1}}.

For the unregularized learner, we first consider the teaching set for square loss. Note that (1, 0), (0, 1), (1, 2), (2, 1) X, and the functions induced by mapping the four elements to RKHS with canonical feature map are linearly independent, we can construct the teaching set as

{(1, 0), 496}, {(0, 1), 800},

{(1, 2), 18536}, {(2, 1), 15808}.

For hinge loss, the teaching set can not be constructed because Assumption 4 is not satisfied.

G. Experiments on Laplacian and Exponential and Kernels

In this section, we perform experiments on Laplacian kernel and exponential kernel. The parameter σ of Laplacian kernel is set to be d, and the σ of exponential kernel is set to be 0.9, where d is the dimension of input space.

Figure 5 and Figure 6 visualize the teaching sets for Laplacian kernel learner and exponential kernel learner with hinge loss function respectively. Similar to Gaussian kernel, the top sub-figures are the results on the Circle dataset and the bottom sub-figures are the results on the Moon dataset. The

1.0 0.5 0.0 0.5 1.0

1.0 0.5 0.0 0.5 1.0

(b) Learned hypothesis

Figure 5. Approximate teaching with Laplacian kernel learner on Moon and Circle datasets. The binary dataset is marked by red and blue dots. The interface between the blue and red areas is the decision boundary of the learned hypothesis. (a) The target hypothesis θ . (b) The learned hypothesis with the teaching set being marked as the dark-red stars.

interface between the blue and red areas is the decision boundary of teacher in the sub-figure (a), and learner in the sub-figure (b). The positive and negative points in dataset are marked by red and blue dots respectively in the subfigure (a). The constructed teaching set (TS) is shown by dark-red stars in the sub-figure (b).

The results show that with much less data points than that of the dataset, both Laplacian kernel learners and exponential kernel learners can obtain nearly the same hypothesis as the target hypothesis generated by the dataset.

Table 4 and Table 5 indicate the relationship between the excess risk ratio Λ and ϵ-TD for Laplacian kernel learner and exponential kernel learner respectively. Each line of the table represents a dataset and each column indicates a ratio of reference risk. The number inside table shows the TD of our method.

Table 4. ϵ-TD of the regularized Laplacian kernel learner under the excess risk ratio Λ.

Dataset Λ = 100% 80% 60% 40% 20% 0%

Sin 2 2 2 2 4 >60 MR 2 2 3 5 10 >60 MPG 1 1 1 1 2 >60 Eunite 1 1 1 1 1 >60 Circle 2 3 4 5 6 17 Moon 2 2 3 5 6 18 Adult 1 3 8 15 37 >60 Sonar 1 3 5 11 36 >60

The Teaching Dimension of Regularized Kernel Learners

1.0 0.5 0.0 0.5 1.0

1.0 0.5 0.0 0.5 1.0

(b) Learned hypothesis

Figure 6. Approximate teaching with exponential kernel learner on Moon and Circle datasets. The binary dataset is marked by red and blue dots. The interface between the blue and red areas is the decision boundary of the learned hypothesis. (a) The target hypothesis θ . (b) The learned hypothesis with the teaching set being marked as the dark-red stars.

Table 5. ϵ-TD of the regularized exponential kernel learner under the excess risk ratio Λ.

Dataset Λ = 100% 80% 60% 40% 20% 0%

Sin 2 2 2 2 3 >60 MR 2 2 3 5 9 >60 MPG 1 1 1 2 4 >60 Eunite 1 1 2 3 5 >60 Circle 2 2 3 4 5 26 Moon 2 2 2 3 5 16 Adult 1 1 1 36 >60 >60 Sonar 1 4 8 16 42 >60

H. Extra Experiments on Gaussian Kernel with Hinge Loss

To better illustrate the ability of Hm when it comes to hinge loss, we also provide empirical results on the relationship between error rate (ER) and the m of Hm, where ER is defined as 1 accuracy. The results are shown in Table 6.

Each line in the table stands for a binary classification dataset and each column indicates an error rate. The value

Table 6. The relationship between ER and m of Hm. Dataset ER = 0.5 0.4 0.3 0.2 0.1 0

Circle 1 1 2 3 3 4 Moon 1 1 1 1 3 7 Adult 1 1 1 8 >400 >400 Sonar 1 1 3 17 52 182

inside the table is the smallest m such that Hm achieves ER no more than that in the first line of the table.