# preferencebased_teaching__32dc501c.pdf

Journal of Machine Learning Research 18 (2017) 1-32 Submitted 9/16; Revised 2/17; Published 4/17

Preference-based Teaching

Ziyuan Gao GAO257@CS.UREGINA.CA Department of Computer Science, University of Regina

Christoph Ries CHRISTOPH.RIES@RUB.DE Department of Mathematics, Ruhr-University Bochum

Hans U. Simon HANS.SIMON@RUB.DE Department of Mathematics, Ruhr-University Bochum

Sandra Zilles ZILLES@CS.UREGINA.CA Department of Computer Science, University of Regina

Editor: Manfred Warmuth

We introduce a new model of teaching named preference-based teaching and a corresponding complexity parameter the preference-based teaching dimension (PBTD) representing the worstcase number of examples needed to teach any concept in a given concept class. Although the PBTD coincides with the well-known recursive teaching dimension (RTD) on ﬁnite classes, it is radically different on inﬁnite ones: the RTD becomes inﬁnite already for trivial inﬁnite classes (such as half-intervals) whereas the PBTD evaluates to reasonably small values for a wide collection of inﬁnite classes including classes consisting of so-called closed sets w.r.t. a given closure operator, including various classes related to linear sets over N0 (whose RTD had been studied quite recently) and including the class of Euclidean half-spaces. On top of presenting these concrete results, we provide the reader with a theoretical framework (of a combinatorial ﬂavor) which helps to derive bounds on the PBTD. Keywords: teaching dimension, preference relation, recursive teaching dimension, learning halfspaces, linear sets

1. Introduction

The classical model of teaching (Shinohara and Miyano, 1991; Goldman and Kearns, 1995) formulates the following interaction protocol between a teacher and a student:

Both of them agree on a classiﬁcation-rule system , formally given by a concept class L.

In order to teach a speciﬁc concept L L, the teacher presents to the student a teaching set, i.e., a set T of labeled examples so that L is the only concept in L that is consistent with T.

The student determines L as the unique concept in L that is consistent with T.

Goldman and Mathias (1996) pointed out that this model of teaching is not powerful enough, since the teacher is required to make any consistent learner successful. A challenge is to model powerful teacher/student interactions without enabling unfair coding tricks . Intuitively, the term coding trick refers to any form of undesirable collusion between teacher and learner, which would reduce the learning process to a mere decoding of a code the teacher sent to the learner. There is no

c 2017 Ziyuan Gao, Christoph Ries, Hans U. Simon, and Sandra Zilles.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v18/16-460.html.

generally accepted deﬁnition of what constitutes a coding trick, in part because teaching an exact learner could always be considered coding to some extent: the teacher presents a set of examples which the learner decodes into a concept. In this paper, we adopt the notion of valid teacher/learner pair introduced by Goldman and Mathias (1996). They consider their model to be intuitively free of coding tricks while it provably allows for a much broader class of interaction protocols than the original teaching model. In particular, teaching may thus become more efﬁcient in terms of the number of examples in the teaching sets. Further deﬁnitions of how to avoid unfair coding tricks have been suggested (Zilles et al., 2011), but they were less stringent than the one proposed by Goldman and Mathias. The latter simply requests that, if the learner hypothesizes concept L upon seeing a sample set S of labeled examples, then the learner will still hypothesize L when presented with any sample set S S , where S contains only examples labeled consistently with L. A coding trick would then be any form of exchange between the teacher and the learner that does not satisfy this deﬁnition of validity. The model of recursive teaching (Zilles et al., 2011; Mazadi et al., 2014), which is free of coding tricks according to the Goldman-Mathias deﬁnition, has recently gained attention because its complexity parameter, the recursive teaching dimension (RTD), has shown relations to the VCdimension and to sample compression (Chen et al., 2016; Doliwa et al., 2014; Moran et al., 2015; Simon and Zilles, 2015), when focusing on ﬁnite concept classes. Below though we will give examples of rather simple inﬁnite concept classes with inﬁnite RTD, suggesting that the RTD is inadequate for addressing the complexity of teaching inﬁnite classes. In this paper, we introduce a model called preference-based teaching, in which the teacher and the student do not only agree on a classiﬁcation-rule system L but also on a preference relation (a strict partial order) imposed on L. If the labeled examples presented by the teacher allow for several consistent explanations (= consistent concepts) in L, the student will choose a concept L L that she prefers most. This gives more ﬂexibility to the teacher than the classical model: the set of labeled examples need not distinguish a target concept L from any other concept in L but only from those concepts L over which L is not preferred.1 At the same time, preference-based teaching yields valid teacher/learner pairs according to Goldman and Mathias s deﬁnition. We will show that the new model, despite avoiding coding tricks, is quite powerful. Moreover, as we will see in the course of the paper, it often allows for a very natural design of teaching sets. Assume teacher and student choose a preference relation that minimizes the worst-case number M of examples required for teaching any concept in the class L. This number M is then called the preference-based teaching dimension (PBTD) of L. In particular, we will show the following: (i) Recursive teaching is a special case of preference-based teaching where the preference relation satisﬁes a so-called ﬁnite-depth condition . It is precisely this additional condition that renders recursive teaching useless for many natural and apparently simple inﬁnite concept classes. Preference-based teaching successfully addresses these shortcomings of recursive teaching, see Section 3. For ﬁnite classes, PBTD and RTD are equal. (ii) A wide collection of geometric and algebraic concept classes with inﬁnite RTD can be taught very efﬁciently, i.e., with low PBTD. To establish such results, we show in Section 4 that spanning sets can be used as preference-based teaching sets with positive examples only a result that is very simple to obtain but quite useful.

1. Such a preference relation can be thought of as a kind of bias in learning: the student is biased towards concepts that are preferred over others, and the teacher, knowing the student s bias, selects teaching sets accordingly.

PREFERENCE-BASED TEACHING

(iii) In the preference-based model, linear sets over N0 with origin 0 and at most k generators can be taught with k positive examples, while recursive teaching with a bounded number of positive examples was previously shown to be impossible and it is unknown whether recursive teaching with a bounded number of positive and negative examples is possible for k 4. We also give some almost matching upper and lower bounds on the PBTD for other classes of linear sets, see Section 6. (iv) The PBTD of halfspaces in Rd is upper-bounded by 6, independent of the dimensionality d (see Section 7), while its RTD is inﬁnite. (v) We give full characterizations of concept classes that can be taught with only one example (or with only one example, which is positive) in the preference-based model (see Section 8). Based on our results and the naturalness of the teaching sets and preference relations used in their proofs, we claim that preference-based teaching is far more suitable to the study of inﬁnite concept classes than recursive teaching. Parts of this paper were published in a previous conference version (Gao et al., 2016).

2. Basic Deﬁnitions and Facts

N0 denotes the set of all non-negative integers and N denotes the set of all positive integers. A concept class L is a family of subsets over a universe X, i.e., L 2X where 2X denotes the powerset of X. The elements of L are called concepts. A labeled example is an element of X { , +}. We slightly deviate from this notation in Section 7, where our treatment of halfspaces makes it more convenient to use { 1, 1} instead of { , +}, and in Section 8, where we perform Boolean operations on the labels and therefore use {0, 1} instead of { , +}. Elements of X are called examples. Suppose that T is a set of labeled examples. Let T + = {x X : (x, +) T} and T = {x X : (x, ) T}. A set L X is consistent with T if it includes all examples in T that are labeled + and excludes all examples in T that are labeled , i.e, if T + L and T L = . A set of labeled examples that is consistent with L but not with L is said to distinguish L from L . The classical model of teaching is then deﬁned as follows.

Deﬁnition 1 (Shinohara and Miyano (1991); Goldman and Kearns (1995)) A teaching set for a concept L L w.r.t. L is a set T of labeled examples such that L is the only concept in L that is consistent with T, i.e., T distinguishes L from any other concept in L. Deﬁne TD(L, L) = inf{|T| : T is a teaching set for L w.r.t. L}. i.e., TD(L, L) is the smallest possible size of a teaching set for L w.r.t. L. If L has no ﬁnite teaching set w.r.t. L, then TD(L, L) = . The number TD(L) = sup L L TD(L, L) N0 { } is called the teaching dimension of L.

For technical reasons, we will occasionally deal with the number TDmin(L) = inf L L TD(L, L), i.e., the number of examples needed to teach the concept from L that is easiest to teach. In this paper, we will examine a teaching model in which the teacher and the student do not only agree on a classiﬁcation-rule system L but also on a preference relation, denoted as , imposed on L. We assume that is a strict partial order on L, i.e., is asymmetric and transitive. The partial order that makes every pair L = L L incomparable is denoted by . For every L L, let

L L = {L L : L L}

be the set of concepts over which L is strictly preferred. Note that L L = for every L L.

As already noted above, a teaching set T of L w.r.t. L distinguishes L from any other concept in L. If a preference relation comes into play, then T will be exempted from the obligation to distinguish L from the concepts in L L because L is strictly preferred over them anyway.

Deﬁnition 2 A teaching set for L X w.r.t. (L, ) is deﬁned as a teaching set for L w.r.t. L\L L. Furthermore deﬁne

PBTD(L, L, ) = inf{|T| : T is a teaching set for L w.r.t. (L, )} N0 { } .

The number PBTD(L, ) = sup L L PBTD(L, L, ) N0 { } is called the teaching dimension of (L, ).

Deﬁnition 2 implies that

PBTD(L, L, ) = TD(L, L \ L L) . (1)

Let L 7 T(L) be a mapping that assigns a teaching set for L w.r.t. (L, ) to every L L. It is obvious from Deﬁnition 2 that T must be injective, i.e., T(L) = T(L ) if L and L are distinct concepts from L. The classical model of teaching is obtained from the model described in Deﬁnition 2 when we plug in the empty preference relation for . In particular, PBTD(L, ) = TD(L). We are interested in ﬁnding the partial order that is optimal for the purpose of teaching and we aim at determining the corresponding teaching dimension. This motivates the following notion:

Deﬁnition 3 The preference-based teaching dimension of L is given by

PBTD(L) = inf{PBTD(L, ) : is a strict partial order on L} .

A relation R on L is said to be an extension of a relation R if R R . The order-extension principle states that any partial order has a linear extension (Jech, 1973). The following result (whose second assertion follows from the ﬁrst one in combination with the order-extension principle) is pretty obvious:

Lemma 4 1. Suppose that extends . If T is a teaching set for L w.r.t. (L, ), then T is a teaching set for L w.r.t. (L, ). Moreover PBTD(L, ) PBTD(L, ).

2. PBTD(L) = inf{PBTD(L, ) : is a strict linear order on L}.

Recall that Goldman and Mathias (1996) suggested to avoid coding tricks by requesting that any superset S of a teaching set for a concept L remains a teaching set, if S is consistent with L. This property is obviously satisﬁed in preference-based teaching. A preference-based teaching set needs to distinguish a concept L from all concepts in L that are preferred over L. Adding more labeled examples from L to such a teaching set will still result in a set distinguishing L from all concepts in L that are preferred over L.

PREFERENCE-BASED TEACHING

Preference-based teaching with positive examples only. Suppose that L contains two concepts L, L such that L L . In the classical teaching model, any teaching set for L w.r.t. L has to employ a negative example in order to distinguish L from L . Symmetrically, any teaching set for L w.r.t. L has to employ a positive example. Thus classical teaching cannot be performed with one type of examples only unless L is an antichain w.r.t. inclusion. As for preference-based teaching, the restriction to one type of examples is much less severe, as our results below will show. A teaching set T for L L w.r.t. (L, ) is said to be positive if it does not make use of negatively labeled examples, i.e., if T = . In the sequel, we will occasionally identify a positive teaching set T with T +. A positive teaching set for L w.r.t. (L, ) can clearly not distinguish L from a proper superset of L in L. Thus, the following holds:

Lemma 5 Suppose that L 7 T +(L) maps each L L to a positive teaching set for L w.r.t. (L, ). Then must be an extension of (so that proper subsets of a set L are strictly preferred over L) and, for every L L, the set T +(L) must distinguish L from every proper subset of L in L.

PBTD+(L, L, ) = inf{|T| : T is a positive teaching set for L w.r.t. (L, )} . (2)

The number PBTD+(L, ) = sup L L PBTD+(L, L, ) (possibly ) is called the positive teaching dimension of (L, ). The positive preference-based teaching dimension of L is then given by

PBTD+(L) = inf{PBTD+(L, ) : is a strict partial order on L} . (3)

Monotonicity. A complexity measure K that assigns a number K(L) N0 to a concept class L is said to be monotonic if L L implies that K(L ) K(L). It is well known (and trivial to see) that TD is monotonic. It is fairly obvious that PBTD is monotonic, too:

Lemma 6 PBTD and PBTD+ are monotonic.

As an application of monotonicity, we show the following result:

Lemma 7 For every ﬁnite subclass L of L, we have PBTD(L) PBTD(L ) TDmin(L ).

Proof The ﬁrst inequality holds because PBTD is monotonic. The second inequality follows from the fact that a ﬁnite partially ordered set must contain a minimal element. Thus, for any ﬁxed choice of , L must contain a concept L such that L L = . Hence,

PBTD(L , ) PBTD(L , L , ) (1) = TD(L , L \ L L ) = TD(L , L ) TDmin(L ) .

Since this holds for any choice of , we get PBTD(L ) TDmin(L ), as desired.

3. Preference-based versus Recursive Teaching

The preference-based teaching dimension is a relative of the recursive teaching dimension. In fact, both notions coincide on ﬁnite classes, as we will see shortly. We ﬁrst recall the deﬁnitions of the recursive teaching dimension and of some related notions (Zilles et al., 2011; Mazadi et al., 2014). A teaching sequence for L is a sequence of the form S = (Li, di)i 1 where L1, L2, L3, . . . form a partition of L into non-empty sub-classes and, for every i 1, we have that

di = sup L Li TD L, L \ i 1 j=1Lj . (4)

If, for every i 1, di is the supremum over all L Li of the smallest size of a positive teaching set for L w.r.t. j i Lj (and di = if some L Li does not have a positive teaching set w.r.t. j i Lj), then S is said to be a positive teaching sequence for L. The order of a teaching sequence or a positive teaching sequence S (possibly ) is deﬁned as ord(S) = supi 1 di. The recursive teaching dimension of L (possibly ) is deﬁned as the order of the teaching sequence of lowest order for L. More formally, RTD(L) = inf S ord(S) where S ranges over all teaching sequences for L. Similarly, RTD+(L) = inf S ord(S), where S ranges over all positive teaching sequences for L. Note that the following holds for every L L and for every teaching sequence S = (Li, di)i 1 for L such that ord(S) = RTD(L ):

RTD(L) RTD(L ) = ord(S) d1 = sup L L1 TD(L, L ) TDmin(L ) . (5)

Note an important difference between PBTD and RTD: while RTD(L) TDmin(L ) for all L L, in general the same holds for PBTD only when restricted to ﬁnite L , cf. Lemma 7. This difference will become evident in the proof of Lemma 10. The depth of L L w.r.t. a strict partial order imposed on L is deﬁned as the length of the longest chain in (L, ) that ends with the -maximal element L (resp. as if there is no bound on the length of these chains). The recursive teaching dimension is related to the preference-based teaching dimension as follows:

Lemma 8 RTD(L) = inf PBTD(L, ) and RTD+(L) = inf PBTD+(L, ) where ranges over all strict partial orders on L that satisfy the following ﬁnite-depth condition : every L L has a ﬁnite depth w.r.t. .

The following is an immediate consequence of Lemma 8 and the trivial observation that the ﬁnite-depth condition is always satisﬁed if L is ﬁnite:

Corollary 9 PBTD(L) RTD(L), with equality if L is ﬁnite.

While PBTD(L) and RTD(L) refer to the same ﬁnite number when L is ﬁnite, there are classes for which the RTD is inﬁnity and yet the PBTD is ﬁnite, as Lemma 10 will show.

Lemma 10 There exists an inﬁnite class L of VC-dimension 1 such that PBTD+(L ) = 1 and RTD(L ) = .

PREFERENCE-BASED TEACHING

Proof Let L be the family of closed half-intervals over [0, 1), i.e., L = {[0, a] : 0 a < 1}. We ﬁrst prove that PBTD+(L ) = 1. Consider the preference relation given by [0, b] [0, a] iff a < b. Then, for each 0 a < 1, we have

PBTD([0, a], L , ) (1) = TD([0, a], {[0, b] : 0 b a}) = 1

because the single example (a, +) sufﬁces for distinguishing [0, a] from any interval [0, b] with b < a. It was observed by Moran et al. (2015) already that RTD(L ) = because every teaching set for some [0, a] must contain an inﬁnite sequence of distinct reals that converges from above to a. Thus, using Equation (5) with L = L, we have RTD(L ) TDmin(L ) = .

Remark 11 For every k 1, there exists an inﬁnite class Lk such that PBTD+(Lk) = 1 and RTD(Lk) = k; see (Gao et al., 2017, Lemma 6).

4. Preference-based Teaching with Positive Examples Only

The main purpose of this section is to relate positive preference-based teaching to spanning sets and closure operators , which are well-studied concepts in the computational learning theory literature. Let L be a concept class over the universe X. We say that S X is a spanning set of L L w.r.t. L if S L and any set in L that contains S must contain L as well.2 In other words, L is the unique smallest concept in L that contains S. We say that S X is a weak spanning set of L L w.r.t. L if S L and S is not contained in any proper subset of L in L.3 We denote by I(L) (resp. I (L)) the smallest number k such that every concept L L has a spanning set (resp. a weak spanning set) w.r.t. L of size at most k. Note that S is a spanning set of L w.r.t. L iff S distinguishes L from all concepts in L except for supersets of L, i.e., iff S is a positive teaching set for L w.r.t. (L, ). Similarly, S is a weak spanning set of L w.r.t. L iff S distinguishes L from all its proper subsets in L (which is necessarily the case when S is a positive teaching set). These observations can be summarized as follows:

I (L) PBTD+(L) PBTD+(L, ) I(L) . (6)

The last two inequalities are straightforward. The inequality I (L) PBTD+(L) follows from Lemma 5, which implies that no concept L can have a preference-based teaching set T smaller than its smallest weak spanning set. Such a set T would be consistent with some proper subset of L, which is impossible by Lemma 5. Suppose L is intersection-closed. Then L L:S LL is the unique smallest concept in L containing S. If S L0 is a weak spanning set of L0 L, then L L:S LL = L0 because, on the one hand, L L:S LL L0 and, on the other hand, no proper subset of L0 in L contains S. Thus the distinction between spanning sets and weak spanning sets is blurred for intersection-closed classes:

2. This generalizes the classical deﬁnition of a spanning set (Helmbold et al., 1990), which is given w.r.t. intersectionclosed classes only. 3. Weak spanning sets have been used in the ﬁeld of recursion-theoretic inductive inference under the name tell-tale sets (Angluin, 1980).

Lemma 12 Suppose that L is intersection-closed. Then I (L) = PBTD+(L) = I(L).

Example 1 Let Rd denote the class of d-dimensional axis-parallel hyper-rectangles (= d-dimensional boxes). This class is intersection-closed and clearly I(Rd) = 2. Thus PBTD+(Rd) = 2.

A mapping cl : 2X 2X is said to be a closure operator on the universe X if the following conditions hold for all sets A, B X:

A B cl(A) cl(B) and A cl(A) = cl(cl(A)) .

The following notions refer to an arbitrary but ﬁxed closure operator. The set cl(A) is called the closure of A. A set C is said to be closed if cl(C) = C. It follows that precisely the sets cl(A) with A X are closed. With this notation, we observe the following lemma.

Lemma 13 Let C be the set of all closed subsets of X under some closure operator cl, and let L C. If L = cl(S), then S is a spanning set of L w.r.t. C.

Proof Suppose L C and S L . Then L = cl(S) cl(L ) = L .

For every closed set L L, let scl(L) denote the size (possibly ) of the smallest set S X such that cl(S) = L. With this notation, we get the following (trivial but useful) result:

Theorem 14 Given a closure operator, let C[m] be the class of all closed subsets C X with scl(C) m. Then PBTD+(C[m]) PBTD+(C[m], ) m. Moreover, this holds with equality provided that C[m] \ C[m 1] = .

Proof The inequality PBTD+(C[m], ) m follows directly from Equation (6) and Lemma 13. Pick a concept C0 C[m] such that scl(C0) = m. Then any subset S of C0 of size less than m spans only a proper subset of C0, i.e., cl(S) C0. Thus S does not distinguish C0 from cl(S). However, by Lemma 5, any preference-based learner must strictly prefer cl(S) over C0. It follows that there is no positive teaching set of size less than m for C0 w.r.t. C[m].

Many natural classes can be cast as classes of the form C[m] by choosing the universe and the closure operator appropriately; the following examples illustrate the usefulness of Theorem 14 in that regard.

Example 2 Let LINSETk = { G : (G N) (1 |G| k)}

where G = n P g G a(g)g : a(g) N0 o . In other words, LINSETk is the set of all non-empty linear subsets of N0 that are generated by at most k generators. Note that the mapping G 7 G is a closure operator over the universe N0. Since obviously LINSETk \ LINSETk 1 = , we obtain PBTD+(LINSETk) = k.

Example 3 Let X = R2 and let Ck be the class of convex polygons with at most k vertices. Deﬁning cl(S) to be the convex closure of S, we obtain C[k] = Ck and thus PBTD+(Ck) = k.

Example 4 Let X = Rn and let Ck be the class of polyhedral cones that can be generated by k (or less) vectors in Rn. If we take cl(S) to be the conic closure of S Rn, then C[k] = Ck and thus PBTD+(Ck) = k.

PREFERENCE-BASED TEACHING

5. A Convenient Technique for Proving Upper Bounds

In this section, we give an alternative deﬁnition of the preference-based teaching dimension using the notion of an admissible mapping . Given a concept class L over a universe X, let T be a mapping L 7 T(L) X { , +} that assigns a set T(L) of labeled examples to every set L L such that the labels in T(L) are consistent with L. The order of T, denoted as ord(T), is deﬁned as sup L L |T(L)| N { }. Deﬁne the mappings T + and T by setting T +(L) = {x : (x, +) T(L)} and T (L) = {x : (x, ) T(L)} for every L L. We say that T is positive if T (L) = for every L L. In the sequel, we will occasionally identify a positive mapping L 7 T(L) with the mapping L 7 T +(L). The symbol + as an upper index of T will always indicate that the underlying mapping T is positive. The following relation will help to clarify under which conditions the sets (T(L))L L are teaching sets w.r.t. a suitably chosen preference relation:

RT = {(L, L ) L L : (L = L ) (L is consistent with T(L ))} .

The transitive closure of RT is denoted as trcl(RT ) in the sequel. The following notion will play an important role in this paper:

Deﬁnition 15 A mapping L 7 T(L) with L ranging over all concepts in L is said to be admissible for L if the following holds:

1. For every L L, L is consistent with T(L).

2. The relation trcl(RT ) is asymmetric (which clearly implies that RT is asymmetric too).

If T is admissible, then trcl(RT ) is transitive and asymmetric, i.e., trcl(RT ) is a strict partial order on L. We will therefore use the notation T instead of trcl(RT ) whenever T is known to be admissible.

Lemma 16 Suppose that T + is a positive admissible mapping for L. Then the relation T + on L extends the relation on L. More precisely, the following holds for all L, L L:

L L (L, L ) RT + L T + L .

Proof If T + is admissible, then L is consistent with T +(L ). Thus T +(L ) L L so that L is consistent with T +(L ) too. Therefore (L, L ) RT +, i.e., L T + L .

The following result clariﬁes how admissible mappings are related to preference-based teaching:

Lemma 17 For each concept class L, the following holds:

PBTD(L) = inf T ord(T) and PBTD+(L) = inf T + ord(T +)

where T ranges over all mappings that are admissible for L and T + ranges over all positive mappings that are admissible for L.

Proof We restrict ourselves to the proof for PBTD(L) = inf T ord(T) because the equation PBTD+(L) = inf T + ord(T +) can be obtained in a similar fashion. We ﬁrst prove that PBTD(L) inf T ord(T). Let T be an admissible mapping for L. It sufﬁces to show that, for every L L, T(L) is a teaching set for L w.r.t. (L, T ). Suppose L L \ {L} is consistent with T(L). Then (L , L) RT and thus L T L. It follows that T prefers L over all concepts L L \ {L} that are consistent with T(L). Thus T is a teaching set for L w.r.t. (L, T ), as desired. We now prove that inf T ord(T) PBTD(L). Let be a strict partial order on L and let T be a mapping such that, for every L L, T(L) is a teaching set for L w.r.t. (L, ). It sufﬁces to show that T is admissible for L. Consider a pair (L , L) RT . The deﬁnition of RT implies that L = L and that L is consistent with T(L). Since T(L) is a teaching set w.r.t. (L, ), it follows that L L. Thus, is an extension of RT . Since is transitive, it is even an extension of trcl(RT ). Because is asymmetric, trcl(RT ) must be asymmetric, too. It follows that T is admissible.

6. Preference-based Teaching of Linear Sets

Some work in computational learning theory (Abe, 1989; Gao et al., 2015; Takada, 1992) is concerned with learning semi-linear sets, i.e., unions of linear subsets of Nk for some ﬁxed k 1, where each linear set consists of exactly those elements that can be written as the sum of some constant vector c and a linear combination of the elements of some ﬁxed set of generators, see Example 2. While semi-linear sets are of common interest in mathematics in general, they play a particularly important role in the theory of formal languages, due to Parikh s theorem, by which the so-called Parikh vectors of strings in a context-free language always form a semi-linear set (Parikh, 1966). A recent study (Gao et al., 2015) analyzed computational teaching of classes of linear subsets of N (where k = 1) and some variants thereof, as a substantially simpler yet still interesting special case of semi-linear sets. In this section, we extend that study to preference-based teaching. Within the scope of this section, all concept classes are formulated over the universe X = N0. Let G = {g1, . . . , gk} be a ﬁnite subset of N. We denote by G resp. by G + the following sets:

i=1 aigi : a1, . . . , ak N0

i=1 aigi : a1, . . . , ak N

We will determine (at least approximately) the preference-based teaching dimension of the following concept classes over N0:

LINSETk = { G : (G N) (1 |G| k)} .

CF-LINSETk = { G : (G N) (1 |G| k) (gcd(G) = 1)} .

NE-LINSETk = { G + : (G N) (1 |G| k)} .

NE-CF-LINSETk = { G + : (G N) (1 |G| k) (gcd(G) = 1)} .

A subset of N0 whose complement in N0 is ﬁnite is said to be co-ﬁnite. The letters CF in CF-LINSET mean co-ﬁnite . The concepts in LINSETk have the algebraic structure of a monoid w.r.t. addition. The concepts in CF-LINSETk are also known as numerical semigroups (Rosales

PREFERENCE-BASED TEACHING

and Garc ıa-S anchez, 2009). A zero coefﬁcient aj = 0 erases gj in the linear combination Pk i=1 aigi. Coefﬁcients from N are non-erasing in this sense. The letters NE in NE-LINSET mean nonerasing . The shift-extension L of a concept class L over the universe N0 is deﬁned as follows:

L = {c + L : (c N0) (L L)} . (7)

The following bounds on RTD and RTD+ (for sufﬁciently large values of k)4 are known from (Gao et al., 2015):

RTD+ RTD LINSETk = ? CF-LINSETk = k {k 1, k} NE-LINSET k = k + 1 {k 1, k, k + 1}

Here NE-LINSET k denotes the shift-extension of NE-LINSETk . The following result shows the corresponding bounds with PBTD in place of RTD:

Theorem 18 The bounds in the following table are valid:

PBTD+ PBTD LINSETk = k {k 1, k} CF-LINSETk = k {k 1, k}

NE-LINSETk j k 1

NE-CF-LINSETk j k 1

PBTD+(L ) = k + 1 PBTD(L ) {k 1, k, k + 1} (8)

holds for all L {LINSETk, CF-LINSETk, NE-LINSETk, NE-CF-LINSETk}.

Note that the equation PBTD+(LINSETk) = k was already proven in Example 2, using the fact that G 7 G is a closure operator. Since G 7 G + is not a closure operator, we give a separate argument to prove an upper bound of k on PBTD+(NE-LINSETk) (see Lemma 37 in Appendix A). All other upper bounds in Theorem 18 are then easy to derive. The lower bounds in Theorem 18 are much harder to obtain. A complete proof of Theorem 18 will be given in Appendix A.

Remark 19 The lower bound on PBTD+(NE-LINSETk) and PBTD+(NE-CF-LINSETk) may be improved to k 1; see (Gao et al., 2017, Theorem 2, Appendix A.3).

4. For instance, RTD+(LINSETk) = holds for all k 2 and RTD(LINSETk) = ? (where ? means unknown ) holds for all k 4.

7. Preference-based Teaching of Halfspaces

In this section, we study preference-based teaching of halfspaces. We will denote the all-zeros vector as 0. The vector with 1 in coordinate i and with 0 in the remaining coordinates is denoted as ei. The dimension of the Euclidean space in which these vectors reside will always be clear from the context. The sign of a real number x (with value 1 if x > 0, value 1 if x < 0, and value 0 if x = 0) is denoted by sign(x). Suppose that w Rd \ { 0} and b R. The (positive) halfspace induced by w and b is then given by Hw,b = {x Rd : w x + b 0} .

Instead of Hw,0, we simply write Hw. Let Hd denote the class of d-dimensional Euclidean halfspaces: Hd = {Hw,b : w Rd \ { 0} b R} .

Similarly, H0 d denotes the class of d-dimensional homogeneous Euclidean halfspaces:

H0 d = {Hw : w Rd \ { 0}} .

Let Sd 1 denote the (d 1)-dimensional unit sphere in Rd. Moreover S+ d 1 = {x Sd 1 : xd > 0} denotes the northern hemisphere . If not stated explicitly otherwise, we will represent homogeneous halfspaces with normalized vectors residing on the unit sphere. We remind the reader of the following well-known fact:

Remark 20 The orthogonal group in dimension d (i.e., the multiplicative group of orthogonal (d d)-matrices) acts transitively on Sd 1 and it conserves the inner product.

We now prove a helpful lemma, stating that each vector w in the northern hemisphere may serve as a representative for some homogeneous halfspace Hu in the sense that all other elements of Hu in the northern hemisphere have a strictly smaller d-th component than w . This will later help to teach homogeneous halfspaces with a preference that orders vectors by the size of their last coordinate.

Lemma 21 Let d 2, let 0 < h 1 and let Rd,h = {w Sd 1 : wd = h}. With this notation the following holds. For every w Rd,h, there exists u Rd \ { 0} such that

(w Hu) ( w (S+ d 1 Hu) \ {w } : wd < h) . (9)

Proof For h = 1, the statement is trivial, since Rd,1 = { ed}. So let h < 1. Because of Remark 20, we may assume without loss of generality that the vector w Rd,h equals (0, . . . , 0,

1 h2, h). It sufﬁces therefore to show that, with this choice of w , the vector u = (0, . . . , 0, w d, w d 1) satisﬁes (9). Note that w Hu iff u, w = w dwd 1 w d 1wd 0. Since u, w = 0, we have w Hu. Moreover, it follows that

S+ d 1 Hu =

w S+ d 1 : wd 1

wd w d 1 w d > 0

It is obvious that no vector w S+ d 1 Hu can have a d-th component wd exceeding w d = h and that setting wd = h = w d forces the settings wd 1 = w d 1 =

1 h2 and w1 = . . . = wd 2 = 0.

PREFERENCE-BASED TEACHING

Consequently, (9) is satisﬁed, which concludes the proof.

With this lemma in hand, we can now prove an upper bound of 2 for the preference-based teaching dimension of the class of homogeneous halfspaces, independent of the underlying dimension d.

Theorem 22 PBTD(H0 1) = TD(H0 1) = 1 and, for every d 2, we have PBTD(H0 d) 2.

Proof Clearly, PBTD(H0 1) = TD(H0 1) = 1 since H0 1 consists of the two sets {x R : x 0} and {x R : x 0}. Suppose now that d 2. Let w be the target weight vector (i.e., the weight vector that has to be taught). Under the following conditions, we may assume without loss of generality that w d = 0:

For any 0 < s1 < s2, the student prefers any weight vector that ends with s2 zero coordinates over any weight vector that ends with only s1 zero coordinates.

If the target vector ends with (exactly) s zero coordinates, then the teacher presents only examples ending with (at least) s zero coordinates.

In the sequel, we specify a student and a teacher such that these conditions hold, so that we will consider only target weight vectors w with w d = 0. The student has the following preference relation:

Among the weight vectors w with wd = 0, the student prefers vectors with larger values of |wd| over those with smaller values of |wd|.

The teacher will use two examples. The ﬁrst one is chosen as ( ( ed, ) if w d > 0 ( ed, ) if w d < 0 .

This example reveals whether the unknown weight vector w Sd 1 has a strictly positive or a strictly negative d-th component. For reasons of symmetry, we may assume that w d > 0. We are now precisely in the situation that is described in Lemma 21. Given w and h = w d, the teacher picks as a second example (u, +) where u Rd \ { 0} has the properties described in the lemma. It follows immediately that the student s preferences will make her choose the weight vector w .

The upper bound of 2 given in Theorem 22 is tight, as is stated in the following lemma.

Lemma 23 For every d 2, we have PBTD(H0 d) 2.

Proof We verify this lemma via Lemma 7, by providing a ﬁnite subclass F of H0 2 such that TDmin(F) = 2. Let F = {Hw : 0 = w { 1, 0, 1}2}. It is easy to verify that each of the 8 halfspaces in F has a teaching dimension of 2 with respect to F. This example can be extended to higher dimensions in the obvious way.

We thus conclude that the class of homogeneous halfspaces has a preference-based teaching dimension of 2, independent of the dimensionality d 2.

Corollary 24 For every d 2, we have PBTD(H0 d) = 2.

By contrast, we will show next that the recursive teaching dimension of the class of homogeneous halfspaces grows with the dimensionality.

Theorem 25 For any d 2, TD(H0 d) = RTD(H0 d) = d + 1.

Proof Assume by normalization that the target weight vector has norm 1, i.e., it is taken from Sd 1. Remark 20 implies that all weight vectors in Sd 1 are equally hard to teach. It sufﬁces therefore to show that TD(H e1, H0 d) = d + 1. We ﬁrst show that TD(H e1, H0 d) d + 1. Deﬁne u = Pd i=2 ei. We claim that T = {( ei, +) : 2 i d} {(u, +), ( e1, +)} is a teaching set for H e1 w.r.t. H0 d. Consider any w Sd 1 such that Hw is consistent with T. Note that wi = ei, w 0 for all i {2, . . . , d} and u, w = Pd i=2 wi 0 together imply that wi = 0 for all i {2, . . . , d} and therefore w = e1. Furthermore, w1 = w, e1 0, and so w = e1, as required. Now we show that TD(H e1, H0 d) d + 1 holds for all d 2. It is easy to see that two examples do not sufﬁce for distinguishing e1 R2 from all weight vectors in S1. In other words, TD(H e1, H0 2) 3. Suppose now that d 3. It is furthermore easy to see that a teaching set T which distinguishes e1 from all weight vectors in Sd 1 must contain at least one positive example u that is orthogonal to e1. The inequality TD(H e1, H0 d) d + 1 is now obtained inductively because the example (u, +) T leaves open a problem that is not easier than teaching e1 w.r.t. the (d 2)- dimensional sphere {x Sd 1 : x u}.

We have thus established that the class of homogeneous halfspaces has a recursive teaching dimension growing linearly with d, while its preference-based teaching dimension is constant. In the case of general (i.e., not necessarily homogeneous) d-dimensional halfspaces, the difference between RTD and PBTD is even more extreme. On the one hand, by generalizing the proof of Lemma 10, it is easy to see that RTD(Hd) = for all d 1. On the other hand, we will show in the remainder of this section that PBTD(Hd) 6, independent of the value of d. We will assume in the sequel (by way of normalization) that an inhomogeneous halfspace has a bias b { 1}. We start with the following result:

Lemma 26 Let w Rd be a vector with a non-trivial d-th component w d = 0 and let b { 1} be a bias. Then there exist three examples labeled according to Hw ,b such that the following holds. Every weight-bias pair (w, b) consistent with these examples satisﬁes b = b , sign(wd) = sign(w d) and ( |wd| |w d| if b = 1 |wd| |w d| if b = +1 . (10)

Proof Within the proof, we use the label 1 instead of + and the label 1 instead of . The pair (w, b) denotes the student s hypothesis for the target weight-bias pair (w , b ). The examples shown to the student will involve the unknown quantities w and b . Each example will lead to a new constraint on w and b. We will see that the collection of these constraints reveals the required information. We proceed in three stages:

1. The ﬁrst example is chosen as ( 0, b ). The pair (w, b) can be consistent with this example only if b = 1 in the case that b = 1 and b {0, 1} in the case that b = 1.

PREFERENCE-BASED TEACHING

2. The next example is chosen as a2 = 2b

w d ed and labeled b . Note that w , a2 + b = b . We obtain the following new constraint:

w, a2 + b =

{0,1} z}|{ b < 0 if b = 1 +2 wd

w d + b |{z} = 1 0 if b = 1 .

The pair (w, b) with b = b if b = 1 and b {0, 1} if b = 1 can satisfy the above constraint only if the sign of wd equals the sign of w d.

3. The third example is chosen as the example a3 = b

w d ed with label 1 . Note that w , a3 + b = 0. We obtain the following new constraint:

w, a3 = b wd

w d + b 0 .

Given that w is already constrained to weight vectors satisfying sign(wd) = sign(w d), we can safely replace wd/w d by |wd|/|w d|. This yields |wd|/|w d| b if b = 1 and |wd|/|w d| b if b = 1. Since b is already constrained as described in stage 1 above, we obtain |wd|/|w d| b {0, 1} if b = 1 and |wd|/|w d| b = 1 if b = 1. The weight-bias pair (w, b) satisﬁes these constraints only if b = b and if (10) is valid.

The assertion of the lemma is immediate from this discussion.

Theorem 27 PBTD(Hd) 6.

Proof As in the proof of Lemma 26, we use the label 1 instead of + and the label 1 instead of . As in the proof of Theorem 22, we may assume without loss of generality that the target weight vector w Rd satisﬁes w d = 0. The proof will proceed in stages. On the way, we specify six rules which determine the preference relation of the student. Stage 1 is concerned with teaching homogeneous halfspaces given by w (and b = 0). The student respects the following rules:

Rule 1: She prefers any pair (w, 0) over any pair (w , b) with b = 0. In other words, any homogeneous halfspace is preferred over any non-homogeneous halfspace.

Rule 2: Among homogeneous halfspaces, her preferences are the same as the ones that were used within the proof of Theorem 22 for teaching homogeneous halfspaces.

Thus, if b = 0, then we can simply apply the teaching protocol for homogeneous halfspaces. In this case, w can be taught at the expense of only two examples. Stage 1 reduces the problem to teaching inhomogeneous halfspaces given by (w , b ) with b = 0. We assume, by way of normalization, that b { 1}, but note that w can now not be assumed to be of unit (or any other ﬁxed) length. In stage 2, the teacher presents three examples in accordance with Lemma 26. It follows that the student will take into consideration only weight-bias pairs (w, b) such that the constraints b = b , sign(wd) = sign(w d) and (10) are satisﬁed. The following rule will then induce the constraint wd = w d:

Rule 3: Among the pairs (w, b) such that wd = 0 and b { 1}, the student s preferences are as follows. If b = 1 (resp. b = 1), then she prefers vectors w with a smaller (resp. larger) value of |wd| over those with a larger (resp. smaller) value of |wd|.

Thanks to Lemma 26 and thanks to Rule 3, we may from now on assume that b = b and wd = w d. In the sequel, let w be decomposed according to w = ( w d 1, w d) Rd 1 R. We think of wd 1 as the student s hypothesis for w d 1. Stage 3 is concerned with the special case where w d 1 = 0. The student will automatically set wd 1 = 0 if we add the following to the student s rule system:

Rule 4: Given that the values for wd and b have been ﬁxed already (and are distinct from 0), the student prefers weight-bias pairs with wd 1 = 0 over any weight-bias pair with wd 1 = 0.

Stage 3 reduces the problem to teaching (w , b ) with ﬁxed non-zero values for wd and b

(known to the student) and with w d 1 = 0. Thus, essentially, only w d 1 has still to be taught. In the next stage, we will argue that the problem of teaching w d 1 is equivalent to teaching a homogeneous halfspace. In stage 4, the teacher will present only examples a such that ad = b

w d so that the contribution of the d-th component to the inner product of w and a cancels with the bias b . Given this commitment for ad, the ﬁrst d 1 components of the examples can be chosen so as to teach the homogeneous halfspace H w d 1. According to Theorem 22, this can be achieved at the expense of two more examples. Of course the student s preferences must match with the preferences that were used in the proof of this theorem:

Rule 5: Suppose that the values of wd and b have been ﬁxed already (and are distinct from 0) and suppose that wd 1 = 0. Then the preferences for the choice of wd 1 match with the preferences that were used in the protocol for teaching homogeneous halfspaces.

After stage 4, the student takes into consideration only weight-bias pairs (w, b) such that wd = w d, b = b and H wd 1 = H w d 1. However, since we had normalized the bias and not the weight vector, this does not necessarily mean that wd 1 = w d 1. On the other hand, the two weight vectors already coincide modulo a positive scaling factor, say

wd 1 = s w d 1 for some s > 0 . (11)

In order to complete the proof, it sufﬁces to teach the L1-norm of w d 1 to the student (because (11) and wd 1 1 = w d 1 1 imply that wd 1 = w d 1). The next (and ﬁnal) stage serves precisely this purpose. As for stage 5, we ﬁrst ﬁx some notation. For i = 1, . . . , k 1, let βi = sign(w i ). Note that (11) implies that βi = sign(wi). Let L = w d 1 1 denote the L1-norm of w d 1. The ﬁnal example is chosen as a6 = (β1, . . . , βd 1, (L + b )/w d) and labeled 1 . Note that w , a6 + b = |w 1| + . . . + |w d 1| L = 0 .

Given that βi = sign(wi), wd = w d and b = b , the student can derive from a6 and its label the following constraint on wd 1:

w, a6 + b = |w1| + . . . + |wd 1| L 0 .

In combination with the following rule, we can now force the constraint wd 1 1 = L:

PREFERENCE-BASED TEACHING

Rule 6: Suppose that the values of wd and b have been ﬁxed already (and are distinct from 0) and suppose that H wd 1 has already been ﬁxed. Then, among the vectors representing H wd 1, the ones with a smaller L1-norm are preferred over the ones with a larger L1-norm.

An inspection of the six stages reveals that at most six examples altogether were shown to the student (three in stage 2, two in stage 4, and one in stage 5). This completes the proof of the theorem.

Note that Theorems 22 and 27 remain valid when we allow w to be the all-zero vector, which extends H0 d by {Rd} and Hd by {Rd, }. Rd will be taught with a single positive example, and with a single negative example. The student will give the highest preference to Rd, the second highest to , and among the remaining halfspaces, the student s preferences stay the same.

8. Classes with PBTD or PBTD+ Equal to One

In this section, we will give complete characterizations of (i) the concept classes with a positive preference-based teaching dimension of 1, and (ii) the concept classes with a preference-based teaching dimension of 1. Throughout this section, we use the label 1 to indicate positive examples and the label 0 to indicate negative examples. Let I be a (possibly inﬁnite) index set. We will consider a mapping A : I I {0, 1} as a binary matrix A {0, 1}I I. A is said to be lower-triangular if there exists a linear ordering on I such that A(i, i ) = 0 for every pair (i, i ) such that i i . We will occasionally identify a set L X with its indicator function by setting L(x) = 1[x L]. For each M X, we deﬁne M L = (L \ M) (M \ L)

and M L = {M L : L L} .

For T X {0, 1}, we deﬁne similarly

M T = {(x, y) : (x, y) T and x M} {(x, y) T : x / M} .

Moreover, given M X and a linear ordering on L, we deﬁne a linear ordering M on M L as follows: M L M M L M (M L ) | {z } =L M (M L) | {z } =L

Lemma 28 With this notation, the following holds. If the mapping L L 7 T(L) X {0, 1} assigns a teaching set to L w.r.t. (L, ), then the mapping M L M L 7 M T(L) X {0, 1} assigns a teaching set to M L w.r.t. (M L, M).

Since this result is rather obvious, we skip its proof. We say that L and L are equivalent if L = M L for some M X (and this clearly is an equivalence relation). As an immediate consequence of Lemma 28, we obtain the following result:

Lemma 29 If L is equivalent to L , then PBTD(L) = PBTD(L ).

The following lemma provides a necessary condition for a concept class to have a preferencebased teaching dimension of one.

Lemma 30 Suppose that L 2X is a concept class of PBTD 1. Pick a linear ordering on L and a mapping L L 7 (x L, y L) X {0, 1} such that, for every L L, {(x L, y L)} is a teaching set for L w.r.t. (L, ). Then

either every instance x X occurs at most once in (x L)L L

or there exists a concept L L that is preferred over all other concepts in L and x L is the only instance from X that occurs twice in (x L)L L.

Proof Since the mapping T must be injective, no instance can occur twice in (x L)L L with the same label. Suppose that there exists an instance x X and concepts L L such that x = x L = x L and, w.l.o.g., y L = 1 and y L = 0. Since {(x, 1)} is a teaching set for L w.r.t. (L, ), every concept L L (including the ones that are preferred over L ) must satisfy L (x) = 0. For analogous reasons, every concept L L (if any) must satisfy L (x) = 1. A concept L L that is preferred over L would have to satisfy L (x) = 0 and L (x) = 1, which is impossible. It follows that there can be no concept that is preferred over L .

The following result is a consequence of Lemmas 28 and 30.

Theorem 31 If PBTD(L) = 1, then there exists a concept class L that is equivalent to L and satisﬁes PBTD(L ) = PBTD+(L ) = 1.

Proof Pick a linear ordering on L and, for every L L, a pair (x L, y L) X {0, 1} such that T(L) = {(x L, y L)} is a teaching set for L w.r.t. (L, ).

Case 1: Every instance x X occurs at most once in (x L)L L. Then choose M = {x L : y L = 0} and apply Lemma 28.

Case 2: There exists a concept L L that is preferred over all other concepts in L and x L is the only instance from X that occurs twice in (x L)L L. Then choose M = {x L : y L = 0 L = L } and apply Lemma 28. With this choice, we obtain M T(L) = {(x L, 1)} for every L L \ {L }. Since L is preferred over all other concepts in L, we may teach L w.r.t. (L, ) by the empty set (instead of employing a possibly 0-labeled example).

The discussion shows that there is a class L that is equivalent to L and can be taught in the preference-based model with positive teaching sets of size 1 (or size 0 in case of L ).

We now have the tools required for characterizing the concept classes whose positive PBTD equals 1.

Theorem 32 PBTD+(L) = 1 if and only if there exists a mapping L L 7 x L X such that the matrix A {0, 1}(L\{ }) (L\{ }) given by A(L, L ) = L (x L) is lower-triangular.

PREFERENCE-BASED TEACHING

Proof Suppose ﬁrst that PBTD+(L) = 1. Pick a linear ordering on L and, for every L L\{ }, pick x L X such that {x L} is a positive teaching set for L w.r.t. (L, ).5 If L L (so that L

is preferred over L), we must have L (x L) = 0. It follows that the matrix A, as speciﬁed in the theorem, is lower-triangular. Suppose conversely that there exists a mapping L L 7 x L X such that the matrix A {0, 1}(L\{ }) (L\{ }) given by A(L, L ) = L (x L) is lower-triangular, say w.r.t. the linear ordering on L \ { }. Then, for every L L \ { }, the singleton {x L} is a positive teaching set for L w.r.t. (L, ) because it distinguishes L from (of course) and also from every concept L L \ { } such that L L. If L, then extend the linear ordering by preferring over every other concept from L (so that is a positive teaching set for w.r.t. (L, )).

In view of Theorem 31, Theorem 32 characterizes every class L with PBTD(L) = 1 up to equivalence.

Let Sg(X) = {{x} : x X} denote the class of singletons over X and suppose that Sg(X) is a sub-class of L and PBTD(L) = 1. We will show that only fairly trivial extensions of Sg(X) with a preference-based dimension of 1 are possible.

Lemma 33 Let L 2X be a concept class of PBTD 1 that contains Sg(X). Let T be an admissible mapping for L that assigns a labeled example (x L, y L) X {0, 1} to each L L. For b = 0, 1, let Lb = {L L : y L = b}. Similarly, let X b = {x X : y{x} Lb}. With this notation, the following holds:

1. If L L1 and L L L, then L L1.

2. If L L0 and L L L, then L L0.

3. |X 0| 2. Moreover if |X 0| = 2, then there exist q = q X such that X 0 = {q, q } and x{q} = q .

Proof Recall that RT = {(L, L ) L L : (L = L ) (L is consistent with T(L ))} and that RT (and even the transitive closure of RT ) is asymmetric if T is admissible.

1. If L L1 and L L , then y L = 1 so that L is consistent with the example (x L, y L). It follows that (L , L) RT . L L0 would similarly imply that (L, L ) RT so that RT would not be asymmetric. This is in contradiction with the admissibility of T.

2. The second assertion in the lemma is a logically equivalent reformulation of the ﬁrst assertion.

3. Suppose for the sake of contradiction that X 0 contains three distinct points, say q1, q2, q3. Since, for i = 1, 2, 3, T assigns a 0-labeled example to {qi}, at least one of the remaining two points is consistent with T({qi}). Let G be the digraph with the nodes q1, q2, q3 and with an edge from qj to qi iff {qj} is consistent with T({qi}). Then each of the three nodes has an indegree of at least 1. Digraphs of this form must contain a cycle so that trcl(RT ) is not asymmetric. This is in contradiction with the admissibility of RT .

5. Such an x L always exists, even if is a teaching set for L, because every superset of a teaching set for L that is still consistent with L is still a teaching set for L, cf. the discussion immediately after Lemma 4.

A similar argument holds if X 0 contains only two distinct elements, say q and q . If neither x{q} = q nor x{q } = q, then ({q }, {q}) RT and ({q}, {q }) RT so that RT is not asymmetric again a contradiction to the admissibility of RT .

We are now in the position to characterize those classes of PBTD one that contain all singletons.

Theorem 34 Suppose that L 2X is a concept class that contains Sg(X). Then PBTD(L) = 1 if and only if the following holds. Either L coincides with Sg(X) or L contains precisely one additional concept, which is either the empty set or a set of size 2.

Proof We start with proving . It is well known that PBTD+(L) = 1 for L = Sg(X) { }: prefer over any singleton set, set T( ) = and, for every x X, set T({x}) = {(x, 1)}. In a similar fashion, we can show that PBTD(L) = 1 for L = Sg(X) {{q, q }} for any choice of q = q X. Prefer {q, q } over {q} and {q }, respectively. Furthermore, prefer {q} and {q } over all other singletons. Finally, set T({q, q }) = , T({q}) = {(q , 0)}, T({q }) = {(q, 0)} and, for every x X \ {q, q }, set T({x}) = {(x, 1)}. As for the proof of , we make use of the notions T, x L, y L, L0, L1, X 0, X 1 that had been introduced in Lemma 33 and we proceed by case analysis.

Case 1: X 0 = . Since X 0 = , we have X = X 1. In combination with the ﬁrst assertion in Lemma 33, it follows that L \ { } = L1. We claim that no concept in L contains two distinct elements. Assume for the sake of contradiction that there is a concept L L such that |L| 2. It follows that, for every q L, x{q} = q and y{q} = 1 so that (L, {q}) RT . Moreover, there exists q0 L such that x L = q0 and y L = 1. It follows that ({q0}, L) RT , which contradicts the fact that RT is asymmetric.

Case 2: X 0 = {q} for some q X. Set q = x{q} and note that y{q} = 0. Moreover, since X 1 = X \ {q}, we have x{p} = p and y{p} = 1 for every p X \ {q}. We claim that L cannot contain a concept L of size at least 2 that contains an element of X \ {q, q }. Assume for the sake of contradiction, that there is a set L such that |L| 2 and p L for some p X \ {q, q }. The ﬁrst assertion in Lemma 33 implies that y L = 1 (because y{p} = 1 and {p} L). Since all pairs (x, 1) with x = q are already in use for teaching the corresponding singletons, we may conclude that q L and T(L) = {(q, 1)}. This contradicts the fact that trcl(RT ) is asymmetric, because our discussion implies that (L, {p}), ({p}, {q}), ({q}, L) RT . We may therefore safely assume that there is no concept of size at least 2 in L that has a non-empty intersection with X \ {q, q }. Thus, except for the singletons, the only remaining sets that possibly belong to L are and {q, q }. We still have to show that not both of them can belong to L. Assume for the sake of contradiction that , {q, q } L. Since is consistent with T({q}) = {(q , 0)}, we have ( , {q}) RT . Clearly, y = 0. Since {q} is consistent with every pair (x, 0) except for (q, 0), we must have x = q. (Otherwise, we have ({q}, ) RT and arrive at a contradiction.) Let us now inspect the possible teaching sets for L = {q, q }. Since {q, q } is consistent with T({q }) = {(q , 1)}, setting y L = 0 would lead to a contradiction. The example (q , 1)

PREFERENCE-BASED TEACHING

is already in use for teaching {q }. It is therefore necessary to set T(L) = {(q, 1)}. An inspection of the various teaching sets shows that ( , {q}), ({q}, L), (L, {q }), ({q }, ) RT , which contradicts the fact that trcl(RT ) is asymmetric.

Case 3: X 0 = {q, q } for some q = q X. Note ﬁrst that y{q} = y{q } = 0 and y{p} = 1 for every p X \ {q, q }. We claim that / L. Assume for the sake of contradiction that L. Then ( , {q}), ( , {q }) RT since is consistent with the teaching sets for instances from X 0. But then, no matter how x in T( ) = {(x, 0)} is chosen, at least one of the sets {q} and {q } will be consistent with T( ) so that at least one of the pairs ({q}, ) and ({q }, ) belongs to RT . This contradicts the fact that RT must be asymmetric. Thus / L, indeed. Now it sufﬁces to show that L cannot contain a concept of size at least 2 that contains an element of X \{q, q }. Assume for the sake of contradiction that there is a set L L such that |L| 2 and p L for some p X \{q, q }. Observe that (L, {p}) RT . Another application of the ﬁrst assertion in Lemma 33 shows that y L = 1 (because y{p} = 1 and p L) and x L {q, q } (because the other 1-labeled instances are already in use for teaching the corresponding singletons). It follows that one of the pairs ({q}, L) and ({q }, L) belongs to RT . The third assertion of Lemma 33 implies that T(q) = {(q , 0)} or T(q ) = {(q, 0)}. For reasons of symmetry, we may assume that T(q) = {(q , 0)}. This implies that ({p}, {q}) RT . Let q be given by T(q ) = {(q , 0)}. Note that either q = q or q X \{q, q }. In the former case, we have that ({p}, {q }) RT and in the latter case we have that ({q}, {q }) RT . Since ({p}, {q}) RT (which was observed above already), we conclude that in both cases, ({p}, {q}), ({p}, {q }) trcl(RT ). Combining this with our observations above that (L, {p}) RT and that one of the pairs ({q}, L) and ({q }, L) belongs to RT , yields a contradiction to the fact that trcl(RT ) is asymmetric.

Corollary 35 Let L 2X be a concept class that contains Sg(X). If PBTD(L) = 1, then RTD(L) = 1.

Proof According to Theorem 34, either L coincides with Sg(X) or L contains precisely one additional concept that is or a set of size 2. The partial ordering on L that is used in the ﬁrst part of the proof of Theorem 34 (proof direction ) is easily compiled into a recursive teaching plan of order 1 for L.6

The characterizations proven above can be applied to certain geometric concept classes. Consider a class L, consisting of bounded and topologically closed objects in the d-dimensional Euclidean space, that satisﬁes the following condition: for every pair (A, B) Rd, there is exactly one object in L, denoted as LA,B in the sequel, such that A, B L and such that A B coincides with the diameter of L. This assumption implies that |L \ Sg(Rd)| = . By setting A = B, it furthermore implies Sg(Rd) L. Let us prefer objects with a small diameter over objects with a larger diameter. Then, obviously, {A, B} is a positive teaching set for LA,B. Because

6. This also follows from Lemma 8 and the fact that there are no chains of a length exceeding 2 in (L, ).

of |L \ Sg(Rd)| = , L does clearly not satisfy the condition in Theorem 34, which is necessary for L to have a PBTD of 1. We may therefore conclude that PBTD(L) = PBTD+(L) = 2. The family of classes with the required properties is rich and includes, for instance, the class of d-dimensional balls as well as the class of d-dimensional axis-parallel rectangles.

9. Conclusions

Preference-based teaching uses the natural notion of preference relation to extend the classical teaching model. The resulting model is (i) more powerful than the classical one, (ii) resolves difﬁculties with the recursive teaching model in the case of inﬁnite concept classes, and (iii) is at the same time free of coding tricks even according to the deﬁnition by Goldman and Mathias (1996). Our examples of algebraic and geometric concept classes demonstrate that preference-based teaching can be achieved very efﬁciently with naturally deﬁned teaching sets and based on intuitive preference relations such as inclusion. We believe that further studies of the PBTD will provide insights into structural properties of concept classes that render them easy or hard to learn in a variety of formal learning models. We have shown that spanning sets lead to a general-purpose construction for preference-based teaching sets of only positive examples. While this result is fairly obvious, it provides further justiﬁcation of the model of preference-based teaching, since the teaching sets it yields are often intuitively exactly those a teacher would choose in the classroom (for instance, one would represent convex polygons by their vertices, as in Example 3). It should be noted, too, that it can sometimes be difﬁcult to establish whether the upper bound on PBTD obtained this way is tight, or whether the use of negative examples or preference relations other than inclusion yield smaller teaching sets. Generally, the choice of preference relation provides a degree of freedom that increases the power of the teacher but also increases the difﬁculty of establishing lower bounds on the number of examples required for teaching.

Acknowledgements. Sandra Zilles was supported by the Natural Sciences and Engineering Research Council of Canada (NSERC), in the Discovery Grant and Canada Research Chairs programs. We thank the anonymous referees for their numerous thoughtful comments, which greatly helped to improve the presentation of the paper.

N. Abe. Polynomial learnability of semilinear sets. In Proceedings of the 2nd Annual Conference on Learning Theory (COLT), pages 25 40, 1989.

D. Angluin. Inductive inference of formal languages from positive data. Information and Control, 45(2):117 135, 1980.

X. Chen, Y. Cheng, and B. Tang. On the recursive teaching dimension of vc classes. In Advances in Neural Information Processing Systems 29 (NIPS 2016), pages 2164 2171, 2016.

T. Doliwa, G. Fan, H. U. Simon, and S. Zilles. Recursive teaching dimension, VC-dimension, and sample compression. Journal of Machine Learning Research, 15:3107 3131, 2014.

Z. Gao, H. U. Simon, and S. Zilles. On the teaching complexity of linear sets. In Proceedings of the 26th International Conference on Algorithmic Learning Theory (ALT), pages 102 116, 2015.

PREFERENCE-BASED TEACHING

Z. Gao, C. Ries, H. U. Simon, and S. Zilles. Preference-based teaching. In Proceedings of the 29th Conference on Learning Theory (COLT), pages 971 997, 2016.

Z. Gao, C. Ries, H. U. Simon, and S. Zilles. Preference-based teaching. Ar Xiv e-prints, February 2017. URL https://arxiv.org/pdf/1702.02047.pdf.

S. A. Goldman and M. J. Kearns. On the complexity of teaching. Journal of Computer and System Sciences, 50:20 31, 1995.

S. A. Goldman and H. D. Mathias. Teaching a smarter learner. Journal of Computer and System Sciences, 52:255 267, 1996.

D. Helmbold, R. Sloan, and M. K. Warmuth. Learning nested differences of intersection-closed concept classes. Machine Learning, pages 165 196, 1990.

T. J. Jech. The Axiom of Choice. North-Holland Pub. Co., Amsterdam, 1973.

Z. Mazadi, Z. Gao, and S. Zilles. Distinguishing pattern languages with membership examples. In Proceedings of the 8th International Conference on Language and Automata Theory and Applications (LATA), pages 528 540, 2014.

S. Moran, A. Shpilka, A. Wigderson, and A. Yehudayoff. Compressing and teaching for low VCdimension. In Proceedings of the 56th Annual Symposium on the Foundations of Computer Science (FOCS), pages 40 51, 2015.

R. J. Parikh. On context-free languages. Journal of the ACM, 13(4):570 581, 1966.

J. G. Rosales and P. A. Garc ıa-S anchez. Numerical Semigroups. Springer, 2009.

A. Shinohara and S. Miyano. Teachability in computational learning. New Generation Computing, 8(4):337 347, 1991.

H. U. Simon and S. Zilles. Open problem: Recursive teaching dimension versus VC dimension. In Proceedings of the 28th Annual Conference on Learning Theory (COLT), pages 1770 1772, 2015.

Y. Takada. Learning semilinear sets from examples and via queries. Theoretical Computer Science, 104(2):207 233, 1992.

S. Zilles, S. Lange, R. Holte, and M. Zinkevich. Models of cooperative teaching and learning. Journal of Machine Learning Research, 12:349 384, 2011.

Appendix A. Proof of Theorem 18

In Section A.1, we present a general result which helps to verify the upper bounds in Theorem 18. These upper bounds are then derived in Section A.2. Section A.3 is devoted to the derivation of the lower bounds.

A.1 The Shift Lemma

In this section, we assume that L is a concept class over a universe X {N0, Q+ 0 , R+ 0 }. We furthermore assume that 0 is contained in every concept L L. We can extend L to a larger class, namely the shift-extension L of L, by allowing each of its concepts to be shifted by some constant which is taken from X: L = {c + L : (c X) (L L)} .

The next result states that this extension has little effect only on the complexity measures PBTD and PBTD+:

Lemma 36 (Shift Lemma) With the above notation and assumptions, the following holds:

PBTD(L) PBTD(L ) 1+PBTD(L) and PBTD+(L) PBTD+(L ) 1+PBTD+(L) .

Proof It sufﬁces to verify the inequalities PBTD(L ) 1 + PBTD(L) and PBTD+(L ) 1 + PBTD+(L) because the other inequalities hold by virtue of monotonicity. Let T be an admissible mapping for L. It sufﬁces to show that T can be transformed into an admissible mapping T for L

such that ord(T ) 1+ord(T) and such that T is positive provided that T is positive. To this end, we deﬁne T as follows:

T (c + L) = {(c, +)} {(c + x, b) : (x, b) T(L)} .

Obviously ord(T ) 1+ord(T). Note that c c+L because of our assumption that 0 is contained in every concept in L. Moreover, since the admissibility of T implies that L is consistent with T(L), the above deﬁnition of T (c+L) makes sure that c+L is consistent with T (c+L). It sufﬁces therefore to show that the relation trcl(RT ) is asymmetric. Consider a pair (c +L , c+L) RT . By the deﬁnition of RT , it follows that c +L is consistent with T (c+L). Because of (c, +) T (c+L), we must have c c. Suppose that c = c. In this case, L must be consistent with T(L). Thus L T L. This reasoning implies that (c + L , c + L) RT can happen only if either c < c or (c = c) (L T L). Since T is asymmetric, we may now conclude that trcl(RT ) is asymmetric, as desired. Finally note that, according to our deﬁnition above, the mapping T is positive provided that T is positive. This concludes the proof.

A.2 The Upper Bounds in Theorem 18

We remind the reader that the equality PBTD+(LINSETk) = k was stated in Example 2. We will show in Lemma 37 that PBTD+(NE-LINSETk) k. In combination with the Shift Lemma, this implies that PBTD+(LINSET k) k + 1 and PBTD+(NE-LINSET k) k + 1. All remaining upper bounds in Theorem 18 follow now by virtue of monotonicity.

Lemma 37 PBTD+(NE-LINSETk) k.

Proof We want to show that there is a preference relation for which k positive examples sufﬁce to teach any concept in NE-LINSETk. To this end, let G = {g1, . . . , gℓ} be a generator set with ℓ k where g1 < . . . < gℓ. We use sum(G) = g1 + . . . + gℓto denote the sum of all generators in G. We say that gi is a redundant generator in G if gi {g1, . . . , gi 1} . Let G = {g 1, . . . , g ℓ } G

PREFERENCE-BASED TEACHING

with g 1 < . . . < g ℓ be the set of non-redundant generators in G and let tuple(G) = (g 1, . . . , g ℓ ) be the corresponding ordered sequence. Then G is an independent subset of G generating the same linear set as G when allowing zero coefﬁcients, i.e., we have G = G (although G + = G + whenever G is a proper subset of G). To deﬁne a suitable preference relation, let G, b G be generator sets of size k or less with tuple(G) = (g 1, . . . , g ℓ ) and tuple( b G) = (bg 1, . . . , bg bℓ ). Let the student prefer G over b G if any of the following conditions is satisﬁed:

Condition 1: sum(G) > sum( b G).

Condition 2: sum(G) = sum( b G) and tuple(G) is lexicographically greater than tuple( b G) without having tuple( b G) as preﬁx.

Condition 3: sum(G) = sum( b G) and tuple(G) is a proper preﬁx of tuple( b G).

To teach a concept G NE-LINSETk with sum(G) = g and tuple(G) = (g 1, . . . , g ℓ ), one uses the teaching set S = {(g, +), (g + g 1, +), . . . , (g + g h , +)}

( ℓ 1 if G = G ℓ if G G . (12)

Note that S contains at most |G| k examples. Let b G with D b G E

+ NE-LINSETk denote the

generator set that is returned by the student. Clearly D b G E satisﬁes sum( b G) = g since

concepts with larger generator sums are inconsistent with (g, +), and

concepts with smaller generator sums have a lower preference (compare with Condition 1 above).

It follows that g + g i D b G E

+ is equivalent to g i D b G E = D b G E . We conclude that the smallest

generator in tuple( b G) equals g 1 since

a smallest generator in tuple( b G) that is greater than g 1 would cause an inconsistency with (g + g 1, +), and

a smallest generator in tuple( b G) that is smaller than g 1 would have a lower preference (compare with Condition 2 above).

Assume inductively that the i 1 smallest generators in tuple( b G) are g 1, . . . , g i 1. Since g i / D {g 1, . . . , g i 1} E , we may apply a reasoning that is similar to the above reasoning concerning g 1 and conclude that the i th smallest generator in tuple( b G) equals g i . The punchline of this discussion is that the sequence tuple( b G) starts with g 1, . . . , g h with h given by (12). Let G = G \ G be the set of redundant generators in G and note that

( g ℓ if G = G P g G g if G G .

Let b G = b G \ {g 1, . . . , g h}. We proceed by case analysis:

Case 1: G = G. Since b G is consistent with (g, +), we have P

g b G g = g ℓ . Since g ℓ / D {g 1, . . . , g ℓ 1} E ,

the set b G must contain an element that cannot be generated by g 1, . . . , g ℓ 1. Given the preferences of the student (compare with Condition 2), she will choose b G = {g ℓ }. It follows that b G = G.

Case 2: G G. Here, we have P

g b G g = P g G g . Given the preferences of the student (compare with

Condition 3), she will choose b G such that b G = G and b G consists of elements from G that sum up to P g G g (with b G = n P g G g o among the possible choices). Clearly, D b G E

Thus, in both cases, the student comes up with the right hypothesis.

A.3 The Lower Bounds in Theorem 18

The lower bounds in Theorem 18 are an immediate consequence of the following result:

Lemma 38 The following lower bounds are valid:

PBTD+(NE-CF-LINSET k) k + 1 . (13)

PBTD(NE-CF-LINSET k) k 1 . (14)

PBTD(NE-CF-LINSETk) k 1

PBTD(CF-LINSETk) k 1 . (16)

This lemma can be seen as an extension and a strengthening of a similar result in (Gao et al., 2015) where the following lower bounds were shown:

RTD+(NE-LINSET k) k + 1 .

RTD(NE-LINSET k) k 1 .

RTD(CF-LINSETk) k 1 .

The proof of Lemma 38 builds on some ideas that are found in (Gao et al., 2015) already, but it requires some elaboration to obtain the stronger results. We now brieﬂy explain why the lower bounds in Theorem 18 directly follow from Lemma 38. Note that the lower bound k 1 in (8) is immediate from (14) and a monotonicity argument. This is because NE-LINSET k NE-CF-LINSET k as well as LINSET k CF-LINSET k NE-CF-LINSET k. Note furthermore that PBTD+(CF-LINSET k) k + 1 because of (13) and a monotonicity argument. Then the Shift Lemma implies that PBTD+(CF-LINSETk) k. All remaining lower bounds in Theorem 18 are obtained from these observations by virtue of monotonicity. The proof of Theorem 18 can therefore be accomplished by proving Lemma 38. It turns out that the proof of this lemma is quite involved. We will present in Section A.3.1 some theoretical prerequisites. Sections A.3.2 and A.3.3 are devoted to the actual proof of the lemma.

PREFERENCE-BASED TEACHING

A.3.1 SOME BASIC CONCEPTS IN THE THEORY OF NUMERICAL SEMIGROUPS

Recall from Section 6 that G = n P g G a(g)g : a(g) N0 o . The elements of G are called generators of G . A set P N is said to be independent if none of the elements in P can be written as a linear combination (with coefﬁcients from N0) of the remaining elements (so that P

is a proper subset of P for every proper subset P of P). It is well known (Rosales and Garc ıa S anchez, 2009) that independence makes generating systems unique, i.e., if P, P are independent, then P = P implies that P = P . Moreover, for every independent set P, the following implication is valid: (S P P S) ( S P ) . (17)

Let P = {a1, . . . , ak} be independent with a1 = min P. It is well known7 and easy to see that the residues of a1, a2, . . . , ak modulo a1 must be pairwise distinct (because, otherwise, we would obtain a dependence). If a1 is a prime and |P| 2, then the independence of P implies that gcd(P) = 1. Thus the following holds:

Lemma 39 If P N is an independent set of cardinality at least 2 and min P is a prime, then gcd(P) = 1.

In the remainder of the paper, the symbols P and P are reserved for denoting independent sets of generators. It is well known that G is co-ﬁnite iff gcd(G) = 1 (Rosales and Garc ıa-S anchez, 2009). Let P be a ﬁnite (independent) subset of N such that gcd(P) = 1. The largest number in N \ P is called the Frobenius number of P and is denoted as F(P). It is well known (Rosales and Garc ıa-S anchez, 2009) that F({p, q}) = pq p q (18)

provided that p, q 2 satisfy gcd(p, q) = 1.

A.3.2 PROOF OF (13)

The shift-extension of NE-CF-LINSETk is (by way of deﬁnition) the following class:

NE-CF-LINSET k = {c + P + : (c N0) (P N) (|P| k) (gcd(P) = 1)} . (19)

It is easy to see that this can be written alternatively in the form

NE-CF-LINSET k =

N + P : N N0 P N |P| k gcd(P) = 1 X

(20) where N in (20) corresponds to c + P p P p in (19). For technical reasons, we deﬁne the following subfamilies of NE-CF-LINSET k. For each N 0, let NE-CF-LINSET k[N] = {N + L : L LINSETk[N]}

7. E.g., see (Rosales and Garc ıa-S anchez, 2009)

LINSETk[N] =

P LINSETk : (gcd(P) = 1)

In other words, NE-CF-LINSET k[N] is the subclass consisting of all concepts in NE-CF-LINSET k (written in the form (20)) whose constant is N. A central notion for proving (13) is the following one:

Deﬁnition 40 Let k, N 2 be integers. We say that a set L NE-CF-LINSET is (k, N)-special if it is of the form L = N + P such that the following holds:

1. P is an independent set of cardinality k and min P is a prime (so that gcd(P) = 1 according to Lemma 39, which furthermore implies that P is co-ﬁnite).

2. Let q(P) denote the smallest prime that is greater than F(P) and greater than max P. For a = min P and r = 0, . . . , a 1, let

tr(P) = min{s P : s r (mod a)} and tmax(P) = max 0 r a 1 tr(P) .

Then N k(a + tmax(P)) and N q(P) + X

p P\{a} p . (21)

We need at least k positive examples in order to distinguish a (k, N)-special set from all its proper subsets in NE-CF-LINSET k[N], as the following result shows:

Lemma 41 For all k 2, the following holds. If L NE-CF-LINSET is (k, N)-special, then L NE-CF-LINSET [N] and I (L, NE-CF-LINSETk[N]) k.

Proof Suppose that L = N + P is of the form as described in Deﬁnition 40. Let P = {a, a2 . . . , ak} with a = min P. For the sake of simplicity, we will write tr instead of tr(P) and tmax instead of tmax(P). The independence of P implies that tai mod a = ai for i = 2, . . . , k. It follows that tmax max P. Since, by assumption, N k tmax, it becomes obvious that L NE-CF-LINSET [N]. Assume by way of contradiction that the following holds:

(A) There is a weak spanning set S of size k 1 for L w.r.t. NE-CF-LINSET k[N].

Since N is contained in any concept from NE-CF-LINSET k[N], we may assume that N / S so that S is of the form S = {N + x1, . . . , N + xk 1} for integers xi 1. For i = 1, . . . , k 1, let ri = xi mod a {0, 1, . . . , a 1}. It follows that each xi is of the form xi = qia + tri for some integer qi 0. Let X = {x1, . . . , xk 1}. We proceed by case analysis:

Case 1: X {a2, . . . , ak} (so that, in view of |X| = k 1, we even have X = {a2, . . . , ak}). Let L = N + X . Then S L . Note that X P but P X. We may conclude from (17) that X P and, therefore, L L. Thus L is a proper subset of L which contains S. Note that (21) implies that N Pk i=2 ai = Pk 1 i=1 xi. If gcd(X) = 1, then L NE-CF-LINSET[N] and we have an immediate contradiction to the above assumption (A).

PREFERENCE-BASED TEACHING

Otherwise, if gcd(X) 2, then we deﬁne L = N+ X {q(P)} . Note that S L L . Since q(P) > F(P), we have X {q(P)} P and, since q(P) > max P, we have P X {q(P)}. We may conclude from (17) that X {q(P)} P and, therefore, L L. Thus, L is a proper subset of L which contains S. Because X = {a2, . . . , ak} and q(P) is a prime that is greater than max P, it follows that gcd(X {q(P)}) = 1. In combination with (21), it easily follows now that L NE-CF-LINSET[N]. Putting everything together, we arrive at a contradiction to the assumption (A).

Case 2: X {a2, . . . , ak}. If ri = 0 for i = 1, . . . , k 1, then each xi is a multiple of a. In this case, N + a, q(P) is a proper subset of L = N + P that is consistent with S, which yields a contradiction. We may therefore assume that there exists i {1, . . . , k 1} such that ri = 0. From the case assumption, X {a2, . . . , ak}, it follows that there must exist an index i {1, . . . , k 1} such that qi 1 or tri / {a2, . . . , ak}. For i = 1, . . . , k 1, let q i = min{qi, 1} and x i = q ia + tri. Note that q i = 1 iff qi 1. Deﬁne L = N + X for X = {a, x 1, . . . , x k 1} and observe the following. First, the set L clearly contains S. Second, the choice of x 1, . . . , x k 1 implies that X P . Third, it easily follows from q i = 1 or tri / {a2, . . . , ak} that P {a, x 1, . . . , x k 1}. We may conclude from (17) that X P and, therefore, L L. Thus, L is a proper subset of L which contains S. Since ri = 0 and a is a prime, it follows that gcd(a, x i ) = 1 and, therefore, gcd(X ) = 1. In combination with (21), it easily follows now that L NE-CF-LINSET[N]. Putting everything together, we obtain again a contradiction to the assumption (A).

For the sake of brevity, let L = NE-CF-LINSET . Assume by way of contradiction that there exists a positive mapping T of order k that is admissible for Lk. We will pursue the following strategy:

1. We deﬁne a set L Lk of the form L = N + p + 1 .

2. We deﬁne a second set L = N + G L that is (k, N)-special and consistent with T +(L). Moreover, L \ L = {N}.

If this can be achieved, then the proof will be accomplished as follows:

According to Lemma 41, T +(L ) must contain at least k examples (all of which are different from N) for distinguishing L from all its proper subsets in Lk[N].

Since L is consistent with T +(L), the set T +(L ) must contain an example which distinguishes L from L. But the only example which ﬁts this purpose is (N, +).

The discussion shows that T +(L ) must contain k examples in order to distinguish L from all its proper subsets in Lk plus one additional example, N, needed to distinguish L from L.

We obtain a contradiction to our initial assumption that T + is of order k.

We still have to describe how our proof strategy can actually be implemented. We start with the deﬁnition of L. Pick the smallest prime p k + 1. Then {p, p + 1, . . . , p + k} is independent.

Let M = F({p, p + 1}) (18) = p(p + 1) p (p + 1). An easy calculation shows that k 2 and p k + 1 imply that M p + k. Let I = {p, p + 1, . . . , M}. Choose N large enough so that all concepts of the form

N + P where |P| = k, p = min P and P I

are (k, N)-special. With these choices of p and N, let L = N +p+ 1 . Note that N +p, N +p+1 T +(L) because, otherwise, one of the concepts N + p + 1 + 1 , N + p + 2, 3 L would be consistent with T +(L) whereas T +(L) must distinguish L from all its proper subsets in Lk. Setting A = {x : N + x T +(L)}, it follows that |A| = |T +(L)| k and p, p + 1 A. The set A is not necessarily independent but it contains an independent subset B such that p, p + 1 B and A = B . Since M = F({p, p + 1}), it follows that any integer greater than M is contained in p, p + 1 . Since B is an independent extension of {p, p + 1}, it cannot contain any integer greater than M. It follows that B I. Clearly, |B| k and gcd(B) = 1. We would like to transform B into another generating system G I such that

B G , gcd(G) = 1 and |G| = k .

If |B| = k, we can simply set G = B. If |B| < k, then we make use of the elements in the independent set {p, p + 1, . . . , p + k} I and add them, one after the other, to B (thereby removing other elements from B whenever their removal leaves B invariant) until the resulting set G contains k elements. We now deﬁne the set L by setting L = N + G . Since G I = {p, p + 1, . . . , M}, and p, p + 1 G, it follows that p = min G, gcd(G) = 1 and min(L \ {N}) is N + p. Thus, L \ L = {N}, as desired. Moreover, since N had been chosen large enough, the set L is (k, N)- special. Thus L and L have all properties that are required by our proof strategy and the proof of (13) is complete.

A.3.3 PROOF OF (14), (15), AND (16)

We make use of some well known (and trivial) lower bounds on TDmin:

Example 5 For every k N, let [k] = {1, 2, . . . , k}, let 2[k] denote the powerset of [k] and, for all ℓ= 0, 1, . . . , k, let [k] ℓ

= {S [k] : |S| = ℓ}

denote the class of those subsets of [k] that have exactly ℓelements. It is trivial to verify that

TDmin 2[k] = k and TDmin

= min{ℓ, k ℓ} .

In view of PBTD+(LINSETk) = k, the next results show that negative examples are of limited help only as far as preference-based teaching of concepts from LINSETk is concerned:

PREFERENCE-BASED TEACHING

Lemma 42 For every k 1 and for all ℓ= 0, . . . , k 1, let

Lk = { k, p1, . . . , pk 1 : pi {k + i, 2k + i}} ,

Lk,ℓ = {{ k, p1, . . . , pk 1 Lk : |{i : pi = k + i}| = ℓ} .

With this notation, the following holds:

TDmin(Lk) k 1 and TDmin(Lk,ℓ) min{ℓ, k 1 ℓ} .

Proof For k = 1, the assertion in the lemma is vacuous. Suppose therefore that k 2. An inspection of the generators k, p1, . . . , pk 1 with pi {k + i, 2k + i} shows that

Lk = {Lk,S : S {k + 1, k + 2, . . . , 2k 1}}

Lk,ℓ = {Lk,S : (S {k + 1, k + 2, . . . , 2k 1}) (|S| = ℓ)}

where Lk,S = {0, k} {2k, 2k + 1, . . .} S .

Note that the examples in {0, 1, . . . , k} {2k, 2k + 1, . . . , } are redundant because they do not distinguish between distinct concepts from Lk. The only useful examples are therefore contained in the interval {k + 1, k + 2, . . . , 2k 1}. From this discussion, it follows that teaching the concepts of Lk (resp. of Lk,ℓ) is not essentially different from teaching the concepts of 2[k 1] resp. of [k 1] ℓ . This completes the proof of the lemma because we know from Example 5 that

TDmin(2[k 1]) = k 1 and TDmin [k 1] ℓ = min{ℓ, k 1 ℓ}.

We claim now that the inequalities (14), (15) and (16) are valid, i.e., we claim that the following holds:

1. PBTD(CF-LINSETk) k 1.

2. PBTD(NE-CF-LINSETk) (k 1)/2 .

3. PBTD(NE-CF-LINSET k) k 1.

Proof For k = 1, the inequalities are obviously valid. Suppose therefore that k 2.

1. Since gcd(k, k + 1) = gcd(k, 2k + 1) = 1, it follows that Lk is a ﬁnite subclass of CF-LINSETk. Thus PBTD(CF-LINSETk) PBTD(Lk) TDmin(Lk) k 1.

2. Deﬁne Lk[N] = {N + L : L Lk} and Lk,ℓ[N] = {N + L : L Lk,ℓ}. Clearly TDmin(Lk[N]) = TDmin(Lk) and TDmin(Lk,ℓ[N]) = TDmin(Lk,ℓ) holds for every N 0. It follows that the lower bounds in Lemma 42 are also valid for the classes Lk[N] and Lk,ℓ[N] in place of Lk and Lk,ℓ, respectively. Let

N(k) = k2+(k 1 (k 1)/2 )k+

i=1 i = k2+(k 1 (k 1)/2 )k+ 1

2(k 1)k . (22)

It sufﬁces to show that N(k) + Lk, (k 1)/2 is a ﬁnite subclass of NE-CF-LINSETk. To this end, ﬁrst note that

k, p1, . . . , pk 1 + = k +

i=1 pi + k, p1, . . . , pk 1 .

Call pi light if pi = k + i and call it heavy if pi = 2k + i. Note that a concept L from N(k) + Lk,ℓis of the general form

L = N(k) + k, p1, . . . , pk 1 (23)

with exactly ℓlight parameters among p1, . . . , pk 1. A straightforward calculation shows that, for ℓ= (k 1)/2 , the sum k + Pk 1 i=1 pi equals the number N(k) as deﬁned in (22). Thus, the concept L from (23) with exactly (k 1)/2 light parameters among {p1, . . . , pk 1} can be rewritten as follows:

L = N(k) + k, p1, . . . , pk 1 = k, p1, . . . , pk 1 + .

This shows that L NE-CF-LINSETk. As L is a concept from N(k) + Lk, (k 1)/2 in general form, we may conclude that N(k)+Lk, (k 1)/2 is a ﬁnite subclass of NE-CF-LINSETk, as desired.

3. The proof of the third inequality is similar to the above proof of the second one. It sufﬁces to show that, for every k 2, there exists N N such that N + Lk is a subclass of NE-CF-LINSET k. To this end, we set N = 3k2. A concept L from 3k2 + Lk is of the general form L = 3k2 + k, p1, . . . , pk 1

with pi {k + i, 2k + i} (but without control over the number of light parameters). It is easy to see that the constant 3k2 is large enough so that L can be rewritten as

+ k, p1, . . . , pk 1 +

where 3k2 k + Pk 1 i=1 pi 0. This shows that L NE-CF-LINSET k. As L is a

concept from 3k2 + Lk in general form, we may conclude that 3k2 + Lk is a ﬁnite subclass of NE-CF-LINSET k, as desired.