# map_and_mlebased_teaching__c5405b50.pdf

Journal of Machine Learning Research 25 (2024) 1-34 Submitted 8/23; Revised 3/24; Published 3/24

MAPand MLE-Based Teaching

Hans Ulrich Simon hsimon@mpi-inf.mpg.de Max-Planck Institute for Informatics, Germany and Ruhr-University Bochum, Department of Mathematics, Germany

Jan Arne Telle telle@ii.uib.no Department of Informatics, University of Bergen, Norway

Editor: Aryeh Kontorovich

Imagine a learner L who tries to infer a hidden concept from a collection of observations. Building on the work of Ferri et al. (2022), we assume the learner to be parameterized by priors P(c) and by c-conditional likelihoods P(z|c) where c ranges over all concepts in a given class C and z ranges over all observations in an observation set Z. L is called a MAP-learner (resp. an MLE-learner) if it thinks of a collection S of observations as a random sample and returns the concept with the maximum a-posteriori probability (resp. the concept which maximizes the c-conditional likelihood of S). Depending on whether L assumes that S is obtained from ordered or unordered sampling resp. from sampling with or without replacement, we can distinguish four diﬀerent sampling modes. Given a target concept c C, a teacher for a MAP-learner L aims at ﬁnding a smallest collection of observations that causes L to return c . This approach leads in a natural manner to various notions of a MAPor MLE-teaching dimension of a concept class C. Our main results are as follows. First, we show that this teaching model has some desirable monotonicity properties. Second we clarify how the four sampling modes are related to each other. As for the (important!) special case, where concepts are subsets of a domain and observations are 0,1-labeled examples, we obtain some additional results. First of all, we characterize the MAPand MLE-teaching dimension associated with an optimally parameterized MAP-learner graph-theoretically. From this central result, some other ones are easy to derive. It is shown, for instance, that the MLE-teaching dimension is either equal to the MAP-teaching dimension or exceeds the latter by 1. It is shown furthermore that these dimensions can be bounded from above by the so-called antichain number, the VC-dimension and related combinatorial parameters. Moreover they can be computed in polynomial time.

1. Introduction

In formal models of machine learning we have a concept class C of possible concepts/hypotheses, an unknown target concept c C and training data given by correctly labeled random examples. In formal models of machine teaching a collection T(c ) of labeled examples is instead carefully chosen by a teacher T in a way that the learner can reconstruct the target concept c from T(c ). In recent years, the ﬁeld of machine teaching has seen various applications in ﬁelds like explainable AI as in H avardstun et al. (2023), trustworthy AI as in Zhu et al. (2018) and pedagogy as in Shafto et al. (2014).

. This research was supported by the Norwegian Research Council, project Machine Teaching for XAI .

c 2024 Hans Simon and Jan Arne Telle.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v25/23-1086.html.

Simon and Telle

Various models of machine teaching have been proposed, e.g. the classical teaching model in Shinohara and Miyano (1991) and in Goldman and Kearns (1995), the optimal teacher model in Balbach (2008), the recursive teaching in Zilles et al. (2011), the preferencebased teaching in Gao et al. (2017), or the no-clash teaching in Kirkpatrick et al. (2019) and in Fallat et al. (2022). These models diﬀer mainly in the restrictions that they impose on the learner and the teacher in order to avoid unfair collusion or cheating. The common goal is to keep the size of the largest teaching set, maxc C |T(c)|, as small as possible. There are also other variants using probabilities, from Muggleton (1996) where examples are sampled based on likelihoods for a target concept, to Shafto et al. (2014) who calls this pedagogical sampling and leads into the Bayesian Teaching of Eaves and Shafto (2016) and of Yang and Shafto (2017), to the Bayesian learners of Zhu (2013) with a proper teacher selecting examples. In this paper we continue this line of research and consider the probabilistic model that had been described in the abstract. This model is inspired by and is an extension of the model that was introduced in Ferri et al. (2022). As already observed in Ferri et al. (2022), the condition for collusion-avoidance from Goldman and Mathias (1996) may here be violated, i.e., the learner may ﬁrst reconstruct a concept c1 from some given observations but, after having received additional observations, switch to another concept c2 even if the new observations have given additional support to c1. Like the authors of Ferri et al. (2022), we would like to argue that this should not be considered as collusion or cheating as long as the parameters assigned to the learner reﬂect some factual information about the world. In our paper the parameters can be freely chosen, and thus in order to disallow collusion one would have to impose more restrictions on the model, for instance a notion of natural parameterization. However, that is not our focus in this paper, which is rather on ﬁnding lower bounds on the teaching dimension that are even valid in the most liberal model of MAP-teaching. As already outlined in the abstract, we will distinguish between four distinct sampling modes: ordered sampling with replacement ((O, R)-mode), unordered sampling with replacement ((O, R)-mode), ordered sampling without replacement ((O, R)-mode) and unordered sampling without replacement ((O, R)-mode). The smallest number d such that every c C can be taught to a given MAP-learner L by a collection of at most d observations is denoted by MAP-TDα,β L (C) where (α, β) {O, O} {R, R} indicates the underlying sampling mode. Then MAP-TDα,β(C) = min L MAP-TDα,β L (C) is the corresponding parameter with an optimally parameterized learner L. The analogous notation is used for MLE-learners. Our main results are as follows:

1. The MAP-teaching model has two desirable and quite intuitive monotonicity properties. Loosely speaking, adding new observations (making Z larger) leads to smaller MAP-TD while adding new concepts (making C larger) leads to larger MAP-TD. See Section 3.2 for details.

2. The sampling modes (O, R) and (O, R) are equivalent. The sampling modes (O, R), (O, R) and (O, R) are pairwise incomparable (i.e., which one leads to smaller values of MAP-TDL(C) depends on the choice of C and L). Note that incomparability of the modes (α, β) and (α , β ) does not rule out the possibility that MAP-TDα,β(C) MAP-TDα ,β (C) for each concept class C. See Section 3.3 for details.

MAPand MLE-Based Teaching

3. As for the (important!) special case, where concepts are subsets of a domain and observations are 0,1-labeled examples, we obtain some additional results, the ﬁrst of which is the central one:

(a) For a (properly deﬁned) bipartite graph G(C)α,β associated with C and (α, β) = (O, R), one gets1

MAP-TDα,β(C) = SMN(G(C)α,β) . (1)

If we replace G(C)α,β by a slightly modiﬁed graph, we obtain the corresponding result for MLE-TD at the place of MAP-TD.2 Fig. 1 visualizes this result. See Sections 4 and 5.1 for details.

(b) The MLE-teaching dimension is either equal to the MAP-teaching dimension or exceeds the latter by 1. See Section 5.2 for details.

(c) The MAPand the MLE-teaching dimension can be bounded from above by the so-called antichain number, the VC-dimension and related combinatorial parameters. See Section 5.3 for details.

(d) Moreover the MAPand the MLE-teaching dimension can be computed in polynomial time from a natural encoding of the underlying concept class. See Section 5.4 for details.

2. Deﬁnitions and Notations

We ﬁrst ﬁx some general notation. Afterwards, in Sections 2.1, 2.2, and 2.3, the MAPand MLE-based teaching model is introduced, step-by-step.

Mappings. The restriction of a mapping f : A B to a subset A A will be denoted by f A . Suppose that B is a set that is equipped with a size function which associates a size |b| with each b B. Then the order of a mapping f : A B is deﬁned as the size of the largest element in the image of f, i.e., the order of f equals maxa A |f(a)|.

Graphs and Matchings. For a graph G = (V, E) and a set U V , we denote by Γ(U) the set of vertices which are adjacent to at least one vertex in U. If G = (V1, V2, E) is the bipartite graph with vertex sets V1 and V2 and with edge set E V1 V2, then U V1 implies (of course) that Γ(U) V2. A matching M in a bipartite graph G = (V1, V2, E) can be viewed as a (partially deﬁned and injective) function M : V1 V2 with the property that (v, M(v)) E for each v having an M-partner. If V1 is saturated by M, i.e., every vertex in V1 has an M-partner, then this function is fully deﬁned.

VC-Dimension, Vapnik and Chervonenkis (1971). Let C be a family of subsets of some ground set X. For c C and x X, we also write c(x) = 1 if x c and c(x) = 0 if x / c. We say that S X is shattered by C if, for every b : S {0, 1}, there is some c C that coincides with b on S. The VC-dimension of C is deﬁned as if there exist arbitrarily large shattered sets, and it is deﬁned as the size of a largest shattered set otherwise.

1. SMN(G) denotes the saturating matching number of a bipartite graph G (formally deﬁned in Section 4) 2. Some bounds on MLE-TD numbers in terms of SMN numbers are already found in Ferri et al. (2022), but no results that hold with equality (as in (1)) are proven there.

Simon and Telle

ord ord ord ord

Figure 1: For any binary concept class C 2X and 0, 1-labeled examples as observations, the tree visualizes the identities in (1). Using the same color for the two leftmost leaves in the MAP-subtree is justiﬁed by the equivalence of the modes (O, R) and (O, R) as shown in Corollary 19. A parameter represented by a leaf in the MAP-subtree has the same value as the parameter represented by a leaf of the same color in the SMN-subtree, as shown in Theorems 25, 27, 28 and Corollary 26. The parameters represented in the SMN-subtree are ordered as indicated by the rightmost diagram, with lowest value on top and highest value at bottom. We will see later, in Theorem 20, that parameters represented in diﬀerent colors can generally have diﬀerent values.

2.1 Concept Classes

Let C be a ﬁnite set of size at least 2, let Z be another non-empty ﬁnite set and let |= be a relation on C Z. We refer to C as a concept class and to Z as a set of observations. If c |= z, then we say that the concept c is consistent with the observation z. We say that c is consistent with a set (resp. multiset) A of observations, which is written as c |= A, if c is consistent with every z A. The notation c |= z with z = (z1, . . . , zn) Zn is understood analogously. For each c C, we deﬁne

Zc = {z Z : c |= z} .

Example 1 (Positive Examples as Observations) Let Z = X be a set of examples and let C be a family of subsets of X. Let the consistency relation be given by

c C, x X : c |= x x c .

Note that Zc = c in this setting, i.e., concepts are identiﬁed with the sets of observations they are consistent with.

Example 2 (Labeled Examples as Observations) Let Z = X {0, 1} be a set of labeled examples and let C be a family of subsets of X. Let the consistency relation be given by c C, (x, b) Z : c |= (x, b) (b = 1 x c) (b = 0 x / c) . (2)

Note that Zc = {(x, 1) : x c} {(x, 0) : x / c} in this setting. It follows that |Zc| = |X| for all c C.

MAPand MLE-Based Teaching

We will occasionally identify a set c X with the corresponding 0, 1-valued function so that c(x) = 1 for x c and c(x) = 0 for x X \ c. The equivalence in (2) can then be written in the form c |= (x, b) b = c(x).

Example 3 (Labeled Examples and Probabilistic Concepts) Let Z = X {0, 1} be again a set of labeled examples and let C be a family of functions from X to [0, 1]. Let the consistency relation be given by

c C, x X : c |= (x, 1) c(x) > 0 and c |= (x, 0) c(x) < 1 .

Intuitively we should think of c(x) as the probability that c assigns label 1 to instance x. If all concepts c C were 0, 1-valued, we would again be in the setting of Example 2.

Note that within Examples 1 and 2, we have that

c, c C : c = c Zc = Zc (3)

so that each concept c C is uniquely determined by the full set Zc of observations that c is consistent with. Of course this will not necessarily be the case if the concepts are probabilistic as in Example 3.

2.2 Variants of Sampling

As formalized in the deﬁnitions below, we distinguish between ordered and unordered sampling and we may have sampling with or without replacement.

Deﬁnition 1 (Sampling with Replacement) Let Q = (q(z))z Z be a collection of probability parameters, i.e., q(z) 0 and P z Z q(z) = 1. For n 0, we deﬁne n-fold (ordered resp. unordered) Q-sampling with replacement as the following random procedure:

1. Choose z1, . . . , zn independently at random according to Q.

2. In case of ordered sampling, return the sequence (z1, . . . , zn) whereas, in case of unordered sampling, return the multiset {z1, . . . , zn}.3

Let z = (z1, . . . , zn) Zn be a sequence that contains k distinct elements, say z 1, . . . , z k, and let ni denote the number of occurrences of z i in z. Let Az Z be the corresponding multiset. The probability that z (resp. Az) is obtained from n-fold ordered (resp. unordered) Q-sampling with replacement is henceforth denoted by P O,R(z|Q) (resp. by P O,R(Az|Q)). With these notations, the following holds:

P O,R(z|Q) =

i=1 q(zi) =

i=1 q(z i)ni and P O,R(Az|Q) = n! n1! . . . nk!

i=1 q(z i)ni . (4)

Deﬁnition 2 (Sampling without Replacement) Let Q = (q(z))z Z be a collection of probability parameters. Let N+(Q) be the number of z Z such that q(z) > 0. For 0 n N+(Q), we deﬁne n-fold (ordered resp. unordered) Q-sampling without replacement as the following random procedure:

3. If n = 0, then the empty sequence resp. the empty multiset is returned,

Simon and Telle

1. Choose z1 at random according to Q.

2. For i = 2, . . . , n do the following: Choose zi Z \ {z1, . . . , zi 1} at random where, for each z Z \ {z1, . . . , zi 1}, the probability for zi = z equals q(z) 1 (q(z1)+...+q(zi 1)).4

3. In case of ordered sampling, return the sequence (z1, . . . , zn) whereas, in case of unordered sampling, return the set {z1, . . . , zn}.

Let z = (z1, . . . , zn) Zn be a repetition-free sequence and let Az Z be the corresponding set. For a permutation σ of 1, . . . , n, we deﬁne zσ = (zσ(1), . . . , zσ(n)). The probability that z (resp. Az) is obtained from n-fold ordered (resp. unordered) Q-sampling without replacement is henceforth denoted by P O,R(z|Q) (resp. by P O,R(Az|Q)). With these notations, the following holds:

P O,R(z|Q) =

q(zi) 1 (q(z1) + . . . + q(zi 1)) and P O,R(Az|Q) = X

σ P O,R(zσ|Q) , (5)

where σ ranges over all permutations of 1, . . . , n. We introduce the following notation:

ZO,R = Z denotes the set of sequences over Z (including the empty sequence).

ZO,R denotes the set of multisets over Z (including the empty multiset).

ZO,R denotes the set of repetition-free sequences over Z (including the empty sequence).

ZO,R = 2Z denotes the powerset of Z.

The pairs (α, β) {O, O} {R, R} are called sampling modes. We use the symbol not only to denote the empty set but also to denote the empty multiset or the empty sequence. If A is a ﬁnite set or multiset, then |A| denotes its size where, in case of a multiset, the multiple occurrences of elements are taken into account. The length of a ﬁnite sequence z is denoted by |z|.

Remark 3 (Trivial Identities) Suppose that Q = (q(z))z Z is collection of probability parameters. Then, for each sampling mode (α, β), we have that P α,β( |Q) = 1. Moreover, if all parameters q(z) with z Z are strictly positive, then P O,R(Z|Q) = 1.

We close this section with a more or less obvious result whose proof will be given for sake of completeness.

Remark 4 Let z1, . . . , zn be a sequence with pairwise distinct elements from Z. Let p1 > p2 > . . . > pn be a strictly decreasing sequence of strictly positive parameters such that Pn i=1 pi 1. For each permutation σ of [n], consider the parameter collection Qσ = (qσ(zi))i=1,...,n given by qσ(zi) = pσ(i). Then the identity permutation is the unique maxi-

mizer of P O,R(z1, . . . , zn|Qσ).

4. Note that the probability parameters for z Z\{z1, . . . , zi 1} are the same as before up to normalization.

MAPand MLE-Based Teaching

Proof According to (5), we have

P O,R(z1, . . . , zk|Qσ) =

qσ(zi) 1 (qσ(z1) + . . . + qσ(zi 1))

pσ(i) 1 (pσ(1) + . . . + pσ(i 1)) = Qn i=1 pi Qn i=1(1 (pσ(1) + . . . + pσ(i 1))

The product in the numerator is the same for all permutations σ. The following assertions are equivalent:

1. σ is the identity permutation.

2. The sequence pσ (1), . . . , pσ (n) is strictly decreasing.

3. For each permutation σ = σ and each i [n], we have that

pσ (1) + . . . + pσ (i 1) pσ(1) + . . . + pσ(i 1)

and, for at least one i [n], this inequality is strict.

4. The permutation σ is the unique maximizer of P O,R(z1, . . . , zk|Qσ).

The remark now is immediate from the equivalence of the ﬁrst and the fourth statement.

2.3 MAPand MLE-based Teaching

An MLE-learner will always choose a hypothesis from a class C that maximizes the likelihood of a given set of observations. MAP-learners are a bit more general because they additionally bring into play priors (P(c))c C. The notion of likelihood depends on how the observations are randomly sampled. We proceed with the formal deﬁnition of MAPand MLE-learners and their teachers:

Deﬁnition 5 (MAPand MLE-Learner) A MAP-Learner L for C is given by (and henceforth identiﬁed with) parameters P(z|c) 0 and P(c) > 0 for z Z and c C such that X

c C P(c) = 1 and c C : X

z Z P(z|c) = 1 .

The parameters P(c) are referred to as priors. The parameters P(z|c), referred to as cconditional likelihoods, must satisfy the following validity condition:

c |= z P(z|c) = 0 . (6)

Set Z+ c (L) := {z Z : P(z|c) > 0} and N+(C, L) = minc C |Z+ c (L)|.5 L can be in four diﬀerent sampling modes (depending on the assumed kind of sampling). These modes determine the form of L s input and the choice of its output as will be detailed below.

5. Because of the validity condition, Z+ c (L) is a subset of Zc = {z Z : c |= z}.

Simon and Telle

(O, R)-mode: For every n 0 and every sequence a Zn, we denote by P O,R(a|c) the probability that a is obtained from n-fold ordered P( |c)-sampling with replacement. Given a sequence a ZO,R, L returns the concept arg!maxc C P(c) P O,R(a|c) if it exists, and a question mark otherwise.6

(O, R)-mode: For every n 0 and and every multiset A Z of size n, we denote by P O,R(A|c) the probability that A is obtained from n-fold unordered P( |c)-sampling with replacement. Given a multiset A ZO,R, L returns the concept arg!maxc C h P(c) P O,R(A|c) i if it exists, and a question mark otherwise.

(O, R)-mode: For every 0 n N+(C, L), and every repetition-free sequence a Zn, we denote by P O,R(a|c)) the probability that a is obtained from n-fold ordered P( |c)- sampling without replacement. Given a repetition-free sequence a ZO,R with |a| N+(C, L), L returns the concept arg!maxc C h P(c) P O,R(a|c) i if it exists, and a

question mark otherwise. If |a| > N+(C, L), then also a question mark is returned.

(O, R)-mode: For every 0 n N+(C, L), and every set A Z of size n, we denote by P O,R(A|c) the probability that A is obtained from n-fold unordered P( |c)-sampling without replacement. Given a set A ZO,R with |A| N+(C, L), L returns the concept arg!maxc C h P(c) P O,R(A|c) i if it exists, and a question mark otherwise. If

|A| > N+(C, L), then also a question mark is returned.

An MLE-learner is a MAP-learner with uniform priors (so that the factor P(c) in the above arg!max-expressions can be dropped).

Deﬁnition 6 (Teacher) Suppose that L is a MAP-learner for C that is in sampling mode (α, β) {O, O} {R, R}. A (successful) teacher for L is a mapping T which assigns to each concept c0 C an input I = T(c0) for L such that L(I) = c0. In other words:

1. I Zα,β and, if β = R, then |I| N+(C, L).

2. c0 = arg!maxc C P(c) P α,β(I|c) .

A couple of observations are in place here.

Remark 7 Suppose that L is a MAP-learner for C which is in sampling mode (α, β) {O, O} {R, R}. Suppose that T is a teacher for L. Then the following holds for all c, c C:

L(T(c)) = c , P α,β( |c) = 1 , P α,β(T(c)|c) > 0 , c |= T(c) and (c = c T(c) = T(c )) . (7) Moreover, if L is an MLE-learner and T is a teacher for L, then T(c) = .

6. The operator arg!maxc C f(c) returns the unique maximizer c C of f(c) provided that it exists.

MAPand MLE-Based Teaching

Proof L(T(c)) = c is an immediate consequence of Deﬁnitions 5 and 6. It now follows that, if T(c) = T(c ), then c = L(T(c)) = L(T(c )) = c . In other words, c = c implies that T(c) = T(c ). 0-fold sampling conditioned to c yields regardless of how c is chosen. It follows that P α,β( |c) = 1. Assume now for contradiction that P α,β(T(c )|c ) = 0. But then c cannot be the unique maximizer of P α,β(T(c )|c) P(c) in C. This is in contradiction with L(T(c )) = c . Assume for contradiction that T(c) contains an observation z Z such that c |= z. It follows that P α,β(T(c)|c) = 0, which is in contradiction with P α,β(T(c)|c) > 0. Thus c |= T(c). Finally, suppose that the priors are uniform, i.e., P(c) = 1/|C| for every c C. Assume for contradiction that T(c0) = for some c0 C. For every c C, we have P(c) P α,β( |c) = P(c) = 1/|C|. Hence c0 cannot be unique maximizer of P(c) P α,β( |c) in C. This is in contradiction with L(T(c0)) = c0.

Here is the deﬁnition of the parameter that is in the focus of our interest:

Deﬁnition 8 (MAPand MLE-Teaching Dimension) Suppose that L is a MAP-learner for C who is in sampling mode (α, β). The MAP-teaching dimension of C given L and (α, β), denoted as MAP-TDα,β L (C), is deﬁned as the smallest number d such that there exists a teacher of order d for L, respectively as if there does not exist a teacher for L. The MAP-teaching dimension of C with respect to sampling mode (α, β) is then given by

MAP-TDα,β(C) := min L MAP-TDα,β L (C) ,

where L ranges over all MAP-learners for C. Similarly, the MLE-teaching dimension of C with respect to sampling mode (α, β) is given by MLE-TDα,β(C) := min L MAP-TDα,β L (C) with L ranging over all MLE-learners for C.

The parameter MAP-TDα,β(C) equals the number of observations needed to teach an optimally parameterized learner. It represents an information-theoretic barrier that cannot be broken regardless of how the learner is parameterized. Of course, this parameter will generally be smaller than the parameter MAP-TDα,β L (C) associated with a naturally parameterized learner. We close this section by mentioning the inequality

MAP-TDα,β(C) MLE-TDα,β(C) ,

which (for trivial reasons) holds for each choice of C and (α, β).

3. Basic Results on the MAP-Based Teaching Model

In Ferri et al. (2022), the authors used a more restrictive condition in place of the validity condition. However, as we will see in Section 3.1, in the context of MAP-learners and their teachers, both conditions lead essentially to the same results. In Section 3.2, we discuss two natural monotonicity properties and thereafter, in Section 3.3, we note the equivalence of (O, R)- and the (O, R)-mode and prove the pairwise incomparability of the modes (O, R), (O, R) and (O, R).

Simon and Telle

3.1 Validity and Strong Validity

We will refer to c |= z P(z|c) = 0

as the strong validity condition for the parameters (P(z|c))z Z,c C. This is the condition that the authors of Ferri et al. (2022) had imposed on the c-conditional likelihoods associated with a MAP-learner. We will see that each L satisfying the validity condition has a close relative Lε that satisﬁes the strong validity condition. Here comes the deﬁnition of Lε:

Deﬁnition 9 (ε-Shift) Let L be given by parameters P(c) and P(z|c) with c C and z Z such that the validity condition is satisﬁed but the strong validity condition is not. We say that Lε (with 0 < ε 1/2) is the ε-shift of L if Lε is given by the parameters P(c) and Pε(z|c) where

(1 ε) P(z|c) if z Z+ c (L) ε |Zc\Z+ c (L)| if z Zc \ Z+ c (L)

0 if z Z \ Zc

For convenience, we set Pε(z|c) = P(z|c) if already L satisﬁes the strong validity condition.

Note that Lε satisﬁes the strong validity condition because Pε(z|c) = 0 iﬀz Zc and Zc = {z Z : c |= z}. A learner and its ε-shift are related as follows:

Lemma 10 Let L be a MAP-learner for C whose parameters satisfy the validity condition. Then the following holds for each (α, β) {O, O} {R, R} and all suﬃciently small ε > 0: each teacher for L in sampling mode (α, β) is also a teacher for Lε in sampling mode (α, β).

Proof Suppose that L and Lε are both in sampling mode (α, β). Consider a teacher T for L. We claim that the following holds:

c0, c C : lim ε 0 P α,β ε (T(c0)|c) = P α,β(T(c0)|c) . (8)

This would imply that, for every c0 C and suﬃciently small ε, we have

c0 = arg!maxc C P α,β(T(c0)|c) = arg!maxc C P α,β ε (T(c0)|c) ,

which, in turn, implies that T is a teacher for Lε. We still have to verify (8). This can be done by means of a simple continuity argument. Note ﬁrst that

c C, z Z : lim ε 0 Pε(z|c) = P(z|c) .

Since P α,R ε (T(c0)|c) is a polynomial (and hence a continuous function) in the variables Pε(z|c) with z T(c0), we may conclude that (8) is true in case of β = R. Suppose now that (α, β) = (O, R) and T(c0) = (z1, . . . , zn), which implies that n N+(C, L) and z1, . . . , zn Z+ c (L). The function

P O,R ε (T(c0)|c) =

Pε(zi|c) 1 (Pε(z1|c) + . . . + Pε(zi 1|c))

MAPand MLE-Based Teaching

is a rational function in the variables Pε(zi|c) for i = 1, . . . , n. Hence we can apply the continuity argument again but, in addition, we must rule out that the denominator, 1 (Pε(z1|c) + . . . + Pε(zi 1|c)), converges to 0 when ε approaches 0. This, however, can be ruled out as follows:

2 minc C,z Z+ c (L) P(z|c) and note that 0 < ρ minc C,z Z+ c (L) Pε(z|c). The latter inequality holds because of Pε(z|c) = (1 ε) P(z|c) and ε 1/2.

Because of n N+(C, L), the set {z1, . . . , zn 1} cannot contain all elements of Z+ c (L).

Therefore 1 (Pε(z1|c) + . . . + Pε(zi 1|c) ρ for all i = 1, . . . , n and the limit for ε 0 cannot be equal to 0.

We may therefore conclude that (8) is true in case of (α, β) = (O, R). The proof in case of (α, β) = (O, R) is similar.

Corollary 11 With the notation from Deﬁnition 9, we have

MAP-TDα,β L (C) = MAP-TDα,β Lε (C)

for all suﬃciently small ε > 0.

3.2 Monotonicity Properties

It is clear, intuitively, that adding concepts without adding observations should make the teaching problem harder. Conversely, adding observations without adding concepts should make the teaching problem easier. In this section, we formalize these statements and prove them. All results in this section are formulated in terms of MAP-TD. But the corresponding results with MLE-TD in place of MAP-TD hold as well. We say that (C , Z , |= ) is an extension of (C, Z, |=) if C C , Z Z and, for all c C and z Z, we have that c |= z if and only if c |= z. So far, we used a notation (e.g. MAP-TDα,β(C) instead of MAP-TDα,β(C, Z, |=)) which made a dependence on (C, Z, |=) explicit for C only (because the corresponding Z and the corresponding relation |= were clear from context). In this section, there is some danger of confusion and, consequently, we use a notation which makes the dependence on the whole triple (C, Z, |=) more explicit.

Deﬁnition 12 Let (C , Z , |= ) be an extension of (C, Z, |=) with Z = Z. Let L be a MAPlearner for (C , Z, |= ) with parameters P(c ) > 0 and P(z|c ) for c C and z Z. Set P(C) = P c C P(c). The MAP-learner with parameters P(c)/P(C) and P(z|c) for c C and z Z, denoted by L C, is called the restriction of L to subclass C.

The parameters of a MAP-learner L for (C , Z, |= ) must satisfy the validity condition. Clearly the parameters of L C satisfy the validity condition too. Moreover, for each c C, we have that Z+ c (L C) = Z+ c (L). These observations can be used for showing the following result:

Simon and Telle

Lemma 13 (Concept-Class Monotonicity) With the assumptions and notation as in Deﬁnition 12, the following holds for each sampling mode (α, β):

MAP-TDα,β L C(C, Z, |=) MAP-TDα,β L (C , Z, |= ) .

Proof Let T : C Zα,β be a teacher for L and let T C denote its restriction to subclass C. Clearly the order of T C is upper-bounded by the order of T. It suﬃces to show that T C is a teacher for L C. To this end, we have to show the following:

(a) If β = R then, for all c C, we have that |T C(c)| N+(C, L C).

(b) For all c0 C, c C\{c0}, we have that P(c) P α,β(T C(c0)|c) < P(c0) P α,β(T C(c0)|c0).

Of course, since T is teacher for L, we know that the following hold:

(a ) If β = R then, for all c C , we have that |T(c )| N+(C , L).

(b ) For all c 0 C , c C \{c 0}, we have that P(c ) P α,β(T(c 0)|c ) < P(c 0) P α,β(T(c 0)|c 0).

The following calculation veriﬁes (a) under the assumption that β = R:

|T C(c)| = |T(c)| N+(C , L) = min c C |Z+ c (L)|

min c C |Z+ c (L)| = min c C |Z+ c (L C)| = N+(C, L C) .

Suppose that c0 C and c C \ {c0}. Then (b) can be veriﬁed as follows:

P(c) P α,β(T C(c0)|c) = P(c) P α,β(T(c0)|c) < P(c0) P α,β(T(c0)|c0) = P(c0) P α,β(T C(c0)|c0) .

Here the ﬁrst and the last equation hold because c0 C and therefore T C(c0) = T(c0).

Corollary 14 If (C , Z , |= ) is an extension of (C, Z, |=) with Z = Z , then

MAP-TDα,β(C, Z, |=) MAP-TDα,β(C , Z, |= ) .

Deﬁnition 15 Let (C , Z , |= ) be an extension of (C, Z, |=) with C = C. Let L be a MAP-learner for (C, Z, |=) with parameters P(c) and P(z|c) for c C and z Z. The MAP-learner with parameters P Z (c) = P(c) and

P Z (z |c) = P(z |c) if z Z 0 otherwise ,

denoted by L Z , is called the extension of L to superset Z .

The parameters of a MAP-learner L for (C, Z, |=) must satisfy the validity condition. It is easy to check that, therefore, the parameters of L Z satisfy the validity condition too. Moreover, for each c C, we have that

{z Z : P Z (z |c) > 0} = {z Z : P(z|c) > 0} = Z+ c (L) ,

which implies that N+(C, L Z ) = N+(C, L). These observations can be used for showing the following result:

MAPand MLE-Based Teaching

Lemma 16 (Observation-Set Monotonicity) With the assumptions and the notation as in Deﬁnition 15, the following holds for each sampling mode (α, β):

MAP-TDα,β L (C, Z, |=) MAP-TDα,β L Z (C, Z , |= ) .

Proof Let T : C Zα,β be a teacher for L. It is suﬃcient to show that T is also a teacher for L Z (albeit a teacher for L Z who does not make use of observations in Z \ Z). To this end, we have to show the following:

(a) If β = R then, for all c C, we have that |T(c)| N+(C, L Z ).

(b) For all c0 C, c C \{c0}, we have that P(c) P α,β Z (T(c0)|c) < P(c0) P α,β Z (T(c0)|c0).

Assertion (a), assuming β = R, is obtained by

|T(c)| N+(C, L) = N+(C, L Z ) ,

where the ﬁrst inequality holds because T is a teacher for L. Suppose that c0 C and c C \ {c0}. Assertion (b) is obtained by

P(c) P α,β Z (T(c0)|c) = P(c) P α,β(T(c0|c) < P(c0) P α,β(T(c0)|c0) = P(c0) P α,β Z (T(c0)|c0) ,

where the ﬁrst and the last equation holds because T(c0) Z so that the likelihoods of observations in Z \ Z do not come into play. The inequality in the middle holds because T is a teacher for L.

Corollary 17 If (C , Z , |= ) is an extension of (C, Z, |=) with C = C , then

MAP-TDα,β(C, Z, |=) MAP-TDα,β(C, Z , |= ) .

3.3 A Comparison of the Sampling Modes

We say that the sampling mode (α, β) dominates the sampling mode (α , β ) if, for every concept class C and every MAP-learner L for C, we have that MAP-TDα,β L (C) MAP-TDα ,β L (C). We say they are equivalent if they mutually dominate each other, i.e., if MAP-TDα,β L (C) = MAP-TDα ,β L (C) holds for every choice of C and L. We say, they are incomparable if none of them dominates the other one. We start with an easy observation:

Remark 18 The sampling modes (O, R) and (O, R) are equivalent.

Proof Consider a concept class C and a MAP-learner L for C. Let a Zn be a sequence of k distinct elements with multiplicities n1, . . . , nk, respectively. Denote by A the corresponding multiset. An inspection of (4) shows that the following holds for each c C:

P O,R(A|c) = n! n1! . . . nk! P O,R(a|c) . (9)

Simon and Telle

Let a be a sequence obtained from a by a permutation of the components. Since a also consists of k distinct elements with multiplicities n1, . . . , nk, respectively, equation (9) also holds with a in place of a. It therefore easily follows that a teacher T for L, with L being in sampling mode (O, R), can be converted into a teacher T of the same order for L with L being in sampling mode (O, R), and vice versa:

Suppose that T is given. If T(c) = a, then deﬁne T (c) = A where A is the multiset induced by a.

Suppose that T is given. If T (c) = A then deﬁne T(A) = a where a is an (arbitrarily chosen) sequence containing the same elements as A with the same multiplicities.

It follows from this discussion that MAP-TDO,R L (C) = MAP-TDO,R L (C), which concludes the proof.

Corollary 19 MAP-TDO,R(C) = MAP-TDO,R(C) and MLE-TDO,R(C) = MLE-TDO,R(C).

We now turn our attention to the incomparability results:

Theorem 20 The sampling modes (O, R), (O, R) and (O, R) are pairwise incomparable.

In order to prove the theorem, we will consider triples (C, Z, |=) with C = {c1, c2, c3}, Z = {z1, z2, z3} and ci |= zj for all 1 i, j 3. An important role will be played by concepts of the form c with parameters given by

P(z1|c ) = p + , P(z2|c ) = p and P(z3|c ) = 1 2p . (10)

The following Facts 1 4, which pave the way for the proof of Theorem 20, can be proven by using the derivation rules of analysis. For sake of completeness, these proofs are given in the appendix.

Fact 1: Suppose that 0 | | < p < 1/2. Let c be the concept whose parameters are given by (10). Then P O,R(z1, z2|c ) and P O,R(z1, z2|c ) are both strictly decreasing when | | is increased, which implies that = 0 is the unique maximizer.

Fact 2: Suppose that 0 | | < p < 1/2. Let c be the concept whose parameters are given by (10). Then

P O,R(z1, z2|c ) P O,R(z1, z2|c 0)

= 0 if {0, p2

> 0 if 0 < < p2 1 p < 0 otherwise

Fact 3: Suppose that 0 < p < 1/2. Let c be the concept whose parameters are given by (10). Then

P O,R(z1, z1, z2|c ) P O,R(z1, z1, z2|c 0)

= 0 if 0, 1

> 0 if 0 < < 1

5 1)p < 0 otherwise . (12)

MAPand MLE-Based Teaching

Fact 4: Suppose that 0 < p < 1/2 and 1 t < 1 p

p . Let c(t) be the concept whose parameters are given by

c(t)(z1) = pt , c(t)(z2) = p/t and c(t)(z3) = 1 pt p/t . (13)

Then P O,R(z1, z2|c(t)) is strictly increasing with t.

A couple of more intuitive remarks are in place here. Fact 1 tells us that, in sampling modes (O, R) and (O, R), the better a concept explains observations z1, z2 (in the maximum likelihood sense), the more evenly it splits the available probability mass 2p among them. We will refer to an application of Fact 1 as applying the even-split argument . In sampling mode (O, R), however, the even split does not maximize the likelihood of these observations. The likelihood of z1, z2 becomes larger if the probability assigned to z1 is slightly larger than the probability assigned to z2. See (11). A similar remark applies to the sampling mode (O, R) and the sequence z1, z1, z2. See (12). Fact 4 is concerned with sampling mode (O, R) and a multiplicative decomposition of p2 into pt (the probability assigned to z1) and p/t (the probability assigned to z2) with t 1. According to Fact 4, the likelihood of {z1, z2} becomes larger when the scaling factor t 1 is increased. Note that this is not in contradiction with the even-split argument, because pt+p/t is itself strictly increasing with t so that the even-split argument does not apply. We would furthermore like to note that the c-conditional likelihood of a (multi-)set or sequence of observations becomes larger if one of the relevant c-conditional likelihood parameters is increased while the others are ﬁxed. We refer to this way of arguing as applying the monotonicity argument . Theorem 20 is a direct consequence of the following three lemmas.

Lemma 21 Consider the triple (C, Z, |=) with C = {c1, c2, c3}, Z = {z1, z2, z3} and ci |= zj for all 1 i, j 3. Let L be an MLE-learner for C with parameters given by

P(z|c) c1 c2 c3 z1 p + 1 p + 2 p z2 p 1 p 2 p z3 1 2p 1 2p 1 2p

where 0 < 1 < p2 1 p < 2 = 1

5 1)p < p 0.4.7 Then

MLE-TDO,R L (C) = 3 , MLE-TDO,R L (C) = 2 and MLE-TDO,R L (C) = . (14)

Proof It is obvious that, in any mode of sampling, the concept c2 can be taught by observation z1 and the concept c3 can be taught by observation z2. An inspection of (11) and (12) reveals that

P O,R L (z1, z2|c1) > P O,R L (z1, z2|c3) > P O,R L (z1, z2|c2) ,

P O,R L (z1, z1, z2|c1) > max{P O,R L (z1, z1, z2|c2), P O,R L (z1, z1, z2|c3)} .

7. The constraint p 0.4 has the eﬀect that p 1 p < 1

Simon and Telle

It follows that c1 can be taught in (O, R)-mode (resp. in (O, R)-mode) by the sequence z1, z2 (resp. by the sequence z1, z1, z2). We will argue now that there are no shorter sequences for teaching c1 and that, in (O, R)-mode, c1 cannot be taught at all. An application of the monotonicity argument yields that c1 cannot be taught by a single observation (regardless of the sampling mode). The same remark holds for 2 observations except, possibly, for observations z1, z2. But, by the even-split argument, it is the concept c3 that assigns the highest probability to the sequence (z1, z2) ZO,R resp. to the set {z1, z2} ZO,R. Thus (O, R) is the only sampling mode in which c1 can be taught by 2 observations. It follows that, in (O, R)-mode, c1 cannot be taught at all.8 We may conclude from this discussion that the identities in (14) are valid.

Lemma 21 implies that (O, R) does not dominate (O, R) and (O, R) does not dominate any of the other sampling modes. The next result leads to some more no-domination results:

Lemma 22 Consider the triple (C, Z, |=) with C = {c1, c2, c3}, Z = {z1, z2, z3} and ci |= zj for all 1 i, j 3. Let L be an MLE-learner for C with the parameters P(z|c) given by

P(z|c) c1 c2 c3 z1 p p + p z2 p p p + z3 1 2p 1 2p 1 2p

where 0 < < p2 1 p < p < 1/2. Then

MLE-TDO,R L (C) = MLE-TDO,R L (C) = 2 and MLE-TDO,R L (C) = . (15)

Proof Clearly the concept c2 can be taught by observation z1 and the concept c3 can be taught by observation z2 in any mode of sampling. The concept c1 cannot be taught by a single observation. But it can be taught by the sequence (z1, z2) in (O, R)-mode and by the set {z1, z2} in (O, R)-mode (application of the even-split argument). We ﬁnally discuss teachability of c1 in (O, R)-mode. An application of the monotonicity argument yields that c1 cannot be taught in (O, R)-mode by two observations except, possibly, by the observations (z1, z2) or (z2, z1) in ZO,R. But an inspection of (11) reveals that it is the concept c2 (resp. c3) that assigns the highest probability to (z1, z2) (resp. to (z2, z1)). It follows that, in (O, R)-mode, the concept c1 cannot be taught at all. We may conclude from this discussion that the identities in (15) are valid.

Lemma 22 implies that (O, R) does not dominate any of the other sampling modes. The next result implies (O, R) does not dominate (O, R).

8. Here we make use of the fact that, if Zc = Z for each c C, then P O,R(Z|c) = 1 for each c C. Note that this rules out the possibility of having teaching sets of size 3 = |Z|.

MAPand MLE-Based Teaching

Lemma 23 Consider the triple (C, Z, |=) with C = {c1, c2, c3}, Z = {z1, z2, z3} and ci |= zj for all 1 i, j 3. Let L be an MLE-learner for C with parameters P(z|c) given by

P(z|c) c1 c2 c3 z1 sp p sp + ε z2 p/s p p/s ε z3 1 sp p/s 1 2p 1 sp p/s

where 0 < p < 1

2, 1 < s 1 p

p and 0 < ε < min{1 sp, p/s}. Then

MLE-TDO,R L (C) = 2 < MLE-TDO,R L (C) . (16)

Proof Clearly, the concept c2 can be taught by observation z2 and c3 can be taught by observation z1 in any mode of sampling. It is obvious that c1 cannot be taught by a single observation (regardless of the sampling mode). In (O, R)-mode, the concept c1 cannot be taught by sequences of length 2 because c1 is for none of them the unique maximizer:

P O,R L (z1, z2|c1) = p2 = P O,R L (z1, z2|c2).

P O,R L (z1, z3|c1) < P O,R L (z1, z3|c3) and P O,R L (z2, z3|c1) < P O,R L (z2, z3|c2).9

However, in (O, R)-mode, the concept c1 can be taught by the set {z1, z2}:

Concept c1 distributes the probability mass sp + p/s (slightly) more evenly on z1 and z2 than the concept c3. By the even-split argument, we obtain P O,R({z1, z2}|c1) > P O,R({z1, z2}|c3).

Recall from Fact 4 that c(t), with t 1, denotes the concept which assigns probability pt to z1, probability p/t to z2 and the remaining probability mass to z3. Note that c1 = c(s) and c2 = c(1). According to Fact 4, the function P O,R(z1, z2|c(t)) is strictly increasing with t. Hence P O,R({z1, z2}|c1) > P O,R({z1, z2}|c2).

The identities in (16) are immediate from this discussion.

Putting the above three lemmas together, we obtain Theorem 20.

4. MAP-Based Teaching and Saturating Matchings

Suppose that C is a concept class with observation set Z and consistency relation |=. The bipartite graph G(C) = (C, Z, E) with

E = {(c, z) C Z : c |= z}

is called the consistency graph (associated with C). Let Zα,β with (α, β) {O, O} {R, R} be the notation that was introduced in Section 2.2. The bipartite graph G(C)α,β = (C, Zα,β, Eα,β) with Eα,β = {(c, ζ) C Zα,β : c |= ζ}

9. These are two applications of the monotonicity argument. Note that s + 1

s > 2 for all s > 1.

Simon and Telle

is called the extended consistency graph (associated with C). The graph resulting from G(C)α,β by the removal of the vertex from the second vertex class Zα,β will be denoted by G(C)α,β = . We denote by SMN(G(C)α,β) the smallest possible order of a C-saturating

matching in G(C)α,β. Analogously, SMN(G(C)α,β = ) denotes the smallest possible order of

a C-saturating matching in G(C)α,β = . For ease of later reference, we make the following observation:

Remark 24 Suppose that T : C Zα,β is a mapping which satisﬁes

c, c C : (c |= T(c)) (c = c T(c) = T(c )) . (17)

Then T is of order at least SMN(G(C)α,β). Moreover, if T satisﬁes (17) and is not in the image of T, then T is of order at least SMN(G(C)α,β = ).

Proof If T satisﬁes (17), then T represents a C-saturating matching in G(C)α,β. If additionally is not in the image of T, then T represents a C-saturating matching in G(C)α,β = .

Here is the main result of this section:

Theorem 25 For each sampling mode (α, β), we have

MAP-TDα,β(C) SMN(G(C)α,β) and MLE-TDα,β(C) SMN(G(C)α,β = ) . (18)

Moreover, for (α, β) = (O, R), this holds with equality.

Proof Let L be a MAP-learner for C and let (α, β) denote its sampling mode. Let T be a teacher for L. Recall from (7) that T satisﬁes (17). Moreover, if L is an MLE-learner for C, then T(c) = for all c C. Now an application of Remark 24 yields (18). We move on and prove that MLE-TDO,R(C) SMN(G(C)O,R = ). Suppose that M is a C-

saturating matching in G(C)O,R = that is of order SMN(G(C)O,R = ). For each c C and z Z, let n(z, c) denote the number of occurrences of z in the multiset M(c) and let n(c) = |M(c)|. Consider a learner L with uniform priors (= MLE-learner) and the parameters P(z|c) = n(z,c)

n(c) . Note that these parameters satisfy the validity condition. It suﬃces to show that M represents a teacher for L, i.e., we have to show that

c C : c = arg!maxc C P O,R(M(c )|c) .

To this end, we pick a concept c from C \ {c }, and proceed by case analysis:

Case 1: M(c ) and M(c) contain the same elements of Z (albeit with diﬀerent multiplicities)10. Denote these elements by z1, . . . , zk. Let n := n(c ), ni = n(zi, c ). Then pi := ni/n

10. The multiplicities cannot be the same because M : C ZO,R is a matching.

MAPand MLE-Based Teaching

is the relative frequency of zi in M(c ). Let qi denote the relative frequency of zi in M(c), which implies that q = p. It follows that

P O,R(M(c )|c ) = n! n1! . . . nk!

i=1 pni i and P O,R(M(c )|c) = n! n1! . . . nk!

i=1 qni i .

A straightforward calculation shows that P O,R(M(c )|c ) > P O,R(M(c )|c) iﬀ

i=1 pi log pi

The left-hand side is the Kullback-Leibler divergence (= KLD) between p and q. Since the KLD is non-negative and 0 only if q = p, the condition (19) is satisﬁed.

Case 2: M(c ) contains an element that is not contained in M(c). Then the c-conditional likelihood of M(c ) equals 0.

Case 3: All elements in M(c ) are contained in M(c), but M(c) contains an element that is not contained in M(c ). Then the c-conditional likelihood of M(c ) can be expressed as Pr(E1) Pr(E2|E1) for the following two events:

E1: n(c )-fold c-sampling yields only elements from M(c ).

E2: n(c )-fold c-sampling yields M(c ).

Since M(c) contains an element that is not contained in M(c ), we have Pr(E1) < 1. It follows from the analysis of Case 1 that Pr(E2|E1) is upper-bounded by the c - conditional likelihood of M(c ).

We may conclude from the above discussion that c = arg!maxc C P O,R(M(c )|c). Thus M

can be seen as a teacher for L. It follows that MLE-TDO,R(C) SMN(G(C)O,R = ).

The inequality MAP-TDO,R(C) SMN(G(C)O,R) can be obtained in a similar fashion. We start with a C-saturating matching M in G(C)O,R that is of order SMN(G(C)O,R). If M does not assign to any concept, we can proceed as before. Otherwise, if M(c0) = for some c0 C, we still use a similar reasoning but with a slight modiﬁcation of the parameter collection of the learner L:

The priors are given by setting P(c0) = 1+ε

|C| and by letting the remaining |C| 1 concepts evenly share the remaining probability mass (still almost uniform priors).

The parameters P(z|c) are chosen as before.

We can again view the matching M as a teacher for L. Since P O,R( |c) = 1 for all c C, we obtain arg!maxc C P(c) P O,R( |c) = arg!maxc C P(c) = c0 .

For the remaining concepts, the reasoning is as before provided that ε > 0 s suﬃciently small: this is an easy continuity argument which exploits that the priors converge to the

Simon and Telle

uniform distribution on C if ε approaches 0.

SMN(G(C)O,R) min{SMN(G(C)O,R), SMN(G(C)O,R)}

max{SMN(G(C)O,R), SMN(G(C)O,R)} SMN(G(C)O,R)

SMN(G(C)O,R = ) min{SMN(G(C)O,R = ), SMN(G(C)O,R = )}

max{SMN(G(C)O,R = ), SMN(G(C)O,R = )} SMN(G(C)O,R = ) .

Combining this with Theorem 25 and with Corollary 19, we immediately obtain the following result:

Corollary 26

1. MAP-TDO,R(C) = SMN(G(C)O,R) SMN(G(C)O,R) MAP-TDO,R(C).

2. MLE-TDO,R(C) = SMN(G(C)O,R = ) SMN(G(C)O,R = ) MLE-TDO,R(C).

Hence we get MAP-TDO,R(C) MAP-TDO,R(C) and MLE-TDO,R(C) MLE-TDO,R(C) despite the fact that (O, R) does not dominate (O, R).

5. On Concepts Taught by Labeled Examples

In this section, we will restrict ourselves to triples (C, Z, |=) of the form as described in Example 2, i.e., C is a family of subsets of a domain X, Z = X {0, 1} and |= is given by (2). We will see that, for each triple (C, Z, |=) of this special form and for each sampling mode (α, β) except (O, R), we have MAP-TDα,β(C) = SMN(G(C)α,β). For (α, β) = (O, R), this is already known from Theorem 25. For the other sampling modes, (O, R) and (O, R), it will be shown in Section 5.1, Since the modes (O, R) and (O, R) are equivalent, we see that, for triples of the special form, the MAP-teaching dimensions of C are fully determined by the saturating matching numbers associated with G(C). In Section 5.2 we explore how MAPand MLE-learners are related. For a given collection of conditional likelihoods, it can make much of a diﬀerence whether we commit ourselves to uniform priors or not. However, in the case of optimally parameterized learners, the freedom for choosing a non-uniform prior is of minor importance only: it turns out that the MLE-teaching dimension exceeds the MAP-teaching dimension at most by 1. In Section 5.3, we will see that the MLE-TDO,R(C) is upper bounded by the so-called antichain number of C, by the VC-dimension of C and by the no-clash teaching dimension of C. These upper bounds are then, all the more, valid for all parameters MAP-TDα,β(C) (no matter how the sampling mode (α, β)) is chosen). In Section 5.4, we will show that the saturating matching numbers associated with G(C) (and hence the MAP-teaching dimensions of C) can be computed in polytime.

MAPand MLE-Based Teaching

5.1 Saturating Matching Number Revisited

We start with the two main results of this section.

Theorem 27 Suppose that (C, Z, |=) is of the form as described in Example 2. Then MAP-TDO,R(C) = SMN(G(C)O,R) and MLE-TDO,R(C) = SMN(G(C)O,R = ).

Proof The -direction of the claimed equalities is covered by Theorem 25. We have to show the -direction. We may restrict ourselves to proving MLE-TDO,R(C) SMN(G(C)O,R = )

because the proof for MAP-TDO,R(C) SMN(G(C)O,R) is quite similar and uses the same kind of arguments that we had used in the ﬁnal part of the proof of Theorem 25. Set m = |X|, d+ = SMN(G(C)O,R) and let M : C ZO,R\{ } be a C-saturating matching in G(C)O,R of order d+. For every c C, we set d(c) = |M(c)|. Note that 1 d(c) d+. If d+ = m, then we are done because MLE-TDO,R(C) cannot exceed m. We may assume therefore that d+ m 1. Let 0 < ε 1

2 be a small real number (where the meaning of small will become clear from what follows). For each c C, we set

U0(c) := {(x, b) Z : c(x) = b} , U1(c) := {(x, b) Z : c(x) = b (x, b) / M(c)} (20)

and U(c) = U0(c) U1(c). Note that, for each c C, the set Z partitions into M(c), U0(c) and U1(c). For each c C and each (x, b) Z, we set

P((x, b)|c) =

1 ε d(c) if (x, b) M(c) ε m d(c) if (x, b) U1(c) 0 if (x, b) U0(c) . (21)

Let L be the MLE-learner given by (21). We aim at showing that the matching M : C ZO,R \{ } can be seen as a teacher for L. To this end, it suﬃces to show that the condition

c = c0 C : P O,R(M(c0)|c0) > P O,R(M(c0)|c) (22)

is satisﬁed provided that ε is suﬃciently small. We brieﬂy note that |M(c)|+|U1(c)| = m d+ and ε 1/2, and proceed with two claims which will help us to verify (22).

Claim 1: Call a subset of Z c-rare if it contains a (low probability) element from U(c) while missing a (high probability) element from M(c). Suppose that d d+. Then the probability that d-fold P( |c)-sampling without replacement leads to a c-rare sample is smaller than dε divided by 1 ε

d(c) and, therefore, smaller than 2dd(c)ε.

Proof of Claim 1: The total P( |c) probability mass of U(c) is ε whereas any element of M(c) has a P( , c)-probability of 1 ε

d(c). For k = 1, . . . , d, let Ek be the event that, within trial k, a point from U(c) is sampled although at least one point from M(c) has not been sampled before. It suﬃces to upper-bound the probability of E1 . . . Ed. The probability of Ek is obviously smaller than ε divided by 1 ε

d(c) and therefore smaller

1 ε 2d(c)ε. An application of the union bound yields an additional factor d.

Claim 2: Suppose that d d(c). Then a sample of size d which contains an element from U1(c) is c-rare (because it necessarily must miss an element from M(c)).

Simon and Telle

Setting c = c0 and d = d(c0), we infer from the above claims that P O,R(M(c0)|c0) > 1 2d(c0)2ε. Consider now an arbitrary, but ﬁxed, concept c1 C\{c0}. Then M(c1) = M(c0). We proceed by case analysis:

Case 1: Neither M(c0) M(c1) nor M(c1) M(c0). Then M(c0) is a c1-rare sample. Hence P O,R(M(c0)|c1) < 2d(c0)d(c1)ε.

Case 2: M(c0) M(c1). We apply a symmetry argument. Every sample containing d(c0) elements of M(c1) has the same chance for being obtained from d(c0)-fold P( |c1)-sampling without replacement. Hence

P O,R(M(c0)|c1) d(c1) d(c0)

1 1 d(c1) 1

where the last two inequalities follow from 1 d(c0) d(c1) 1.

Case 3: M(c1) M(c0). We may assume that M(c0) M(c1) U1(c1) because, otherwise, we obtain directly P O,R(M(c0)|c1) = 0. We apply again a symmetry argument. Every sample containing M(c1) and d(c0) d(c1) elements of U1(c1) has the same chance for being obtained from d(c0)-fold P( |c1)-sampling without replacement. Hence

P O,R(M(c0)|c1) m d(c1) d(c0) d(c1)

The latter expression is upper-bounded by 1

2 because 1 d(c0) d(c1) < m d(c1), d(c1) d(c0) 1 m 2 and, therefore, m d(c1) 2.

It becomes obvious from this discussion that condition (22) is satisﬁed provided that ε is suﬃciently small.

Theorem 28 Suppose that (C, Z, |=) is of the form as described in Example 2. Then MAP-TDO,R(C) = SMN(G(C)O,R) and MLE-TDO,R(C) = SMN(G(C)O,R = ).

Proof The -direction of the claimed equalities is covered by Theorem 25. We have to show the -direction. We may restrict ourselves to proving MLE-TDO,R(C) SMN(G(C)O,R = )

because the proof for MAP-TDO,R(C) SMN(G(C)O,R) is quite similar and uses the same kind of arguments that we had used in the ﬁnal part of the proof of Theorem 25. Set m = |X|, d+ = SMN(G(C)O,R = ) and let M : C ZO,R\{ } be a C-saturating matching

in G(C)O,R = of order d+. If d+ = m, then we are done because MLE-TDO,R(C) cannot exceed m. We may assume therefore that d+ m 1. For every c C, we set d(c) = |M(c)|. Note that 1 d(c) d+. We ﬁx for each concept c C a sequence zc 1, . . . , zc m consisting of all elements of Zc subject to the constraint that zc 1, . . . , zc d(c) = M(c), i.e., this sequence must start with M(c). In the sequel, we will specify the parameter set of an MLE-learner

MAPand MLE-Based Teaching

of C. We do this in two stages. In Stage 1, we make a preliminary deﬁnition which already achieves that each c C is a (not necessarily unique) maximizer of P O,R(M(c |c)). In Stage 2, we make some inﬁnitesimal changes of the parameter set (by bringing a small parameter ε > 0 into play) so that, after these changes have taken place, each c C will be a unique maximizer of P O,R(M(c |c)). This would imply that M can be viewed as a teacher for L, which would complete the proof. Details follow. We enter Stage 1 of the parameter construction. Let L be the MLE-learner whose parameters are given by

2 i if 1 i d(c) and z = zc i 2 d(c) m d(c) if d(c) + 1 i m and z = zc i 0 if z Z \ Zc

In other words, given c, L assigns probability mass 2 i to the i-the element of the sequence M(c) and distributes the remaining probability mass, 2 d(c), evenly on the elements of Zc \ M(c). Note that the c-conditional likelihood of an element in M(c) is at least 2 d(c)

while the probability of an element in Zc \ M(c) equals 2 d(c) m d(c) 2 d(c) with equality only if d(c) = m 1. It is easy to determine the c-conditional likelihood of M(c):

P O,R(M(c)|c) = Qd(c) i=1 2 i Qd(c) 1 i=1 2 i = 2 d(c) .

The middle term contains in the numerator the product of the c-conditional likelihoods of zc 1, . . . , zc d(c), respectively. In the denominator, it contains the product of the corresponding normalization factors: if zc 1, . . . , zc j haven been sampled within the ﬁrst j trials, then the remaining probability mass equals 1 Pj i=1 2 i = 2 j. Let us now ﬁx an arbitrary target concept c C and see how the c -conditional likelihood of M(c ) relates to the c-conditional likelihood of M(c ) for some other concept c C \ {c }. We aim at showing that P O,R(M(c )|c) P O,R(M(c )|c ). We may assume that c |= M(c ) because, otherwise, we would obtain P O,R(M(c )|c) = 0, and we were done. For sake of simplicity, we set d := d(c ) and zi := zc i for i = 1, . . . , d. Let us brieﬂy discuss the case that M(c) and M(c ) are equal as sets. Then there exists a permutation σ such that M(c) = zσ(1), . . . , zσ(d). Since M is a matching, σ cannot

be the identity permutation. It follows that P O,R(M(c )|c ) > P O,R(M(c )|c) because (P(zi|c ))i=1,...,d = (2 i)i=1,...,d is a strictly decreasing sequence while (P(zi|c))i=1,...,d (as a non-identity permutation of (2 i)i=1,...,d) is not.11

From now, we assume that M(c) and M(c ) are diﬀerent even when viewed as sets. Let j be the number of z Z occurring in M(c) and in M(c ). We can make the pessimistic assumption that the sequence M(c) starts with z1, . . . , zj because this will lead to the largest conceivable value of P O,R(M(c )|c).12 The remaining observations zj+1, . . . , zd(c) must then be members of Zc \ M(c). Remember that for each z Zc \ M(c) we have that P(z|c) = 2 d(c) m d(c). The term P O,R(M(c )|c) can be expressed as a product of two terms.

11. Compare with Remark 4. 12. This brings the j largest c-conditional likelihoods into play and puts them in the most eﬀective position.

Simon and Telle

The ﬁrst one (resp. second one) is the contribution of the ﬁrst j trials (resp. the last d j trials). Since M(c) starts with z1, . . . , zj, the ﬁrst term is simply T1 := 2 j. The second term has the following form

2 j m j d j

2 j 2 j 2 j

m j 2 j 2 2 j

m j . . . 2 j (d j 1) 2 j

As usual, the numerator contains the product of the c-conditional (here: uniform) likelihoods while the denominator contains the product of the corresponding normalization factors. T2 looks terrifying at ﬁrst glance, but luckily there is a lot of cancellation and T2 can be rewritten as follows:

(m j)d j 1 1 m j 1 2 m j . . . 1 d j 1

= 1 (m j)(m j 1)(m j 2) . . . (m d + 1) .

Remember that d = d(c ) m 1. It follows that m d + 1 2 and therefore

T2 2 (d j) and P O,R(M(c )|c) = T1 T2 2 d

with equality only if either j = d or d = m 1 and j = m 2. Note that j = d if and only if the sequence M(c) starts with the sequence M(c ) = z1, . . . , zd. We enter now Stage 2 of the parameter construction, in which we make some inﬁnitesimal changes of the parameters that we have used so far. In order to distinguish the new parameter collection from the old one, the new parameters are denoted by Pε(z|c). They are deﬁned as follows:

2 i if 1 i d(c) 1 and z = zc i 2 i + ε if i = d(c) and z = zc i 2 d(c) ε

m d(c) if d(c) + 1 i m and z = zc i 0 if z Z \ Zc

The main diﬀerence to the old parameter collection is the extra-bonus ε that c assigns to the last element zc d(c) of the sequence M(c). Now the total probability mass assigned to zc 1, . . . , zc d(c) is by the amount of ε greater than before, so that only probability mass

2 d(c) ε is left for Zc \ M(c). Again, this probability mass is shared evenly among the elements of Zc \ M(c). Here comes the central observation:

Claim: If ε > 0 is suﬃciently small, then the following implications are valid:

P O,R(M(c )|c ) > P O,R(M(c )|c) P O,R ε (M(c )|c ) > P O,R ε (M(c )|c) ,

P O,R(M(c )|c ) = P O,R(M(c )|c) P O,R ε (M(c )|c ) > P O,R ε (M(c )|c) .

Proof of the Claim: The ﬁrst implication is based on a simple continuity argument. The second implication can be veriﬁed as follows. Remember from the discussion in Stage

MAPand MLE-Based Teaching

1 that P O,R(M(c )|c ) = P O,R(M(c )|c) can occur only if either M(c) starts with M(c ) = z1, . . . , zd or if d = m 1 and j = m 2. In the former case, the eﬀect of Pε(zd|c ) = P(zd|c ) + ε and Pε(zd|c) = P(zd|c) will be that

P O,R ε (M(c )|c ) > P O,R(M(c )|c ) = P O,R(M(c )|c) = P O,R ε (M(c )|c) ,

as desired. In the latter case, we have M(c ) = z1, . . . , zm 1 and either M(c) = z1, . . . , zm 2 or M(c) = z1, . . . , zm 2, zm. In the latter case, we obtain

P O,R ε (M(c )|c ) > P O,R(M(c )|c ) = P O,R(M(c )|c) > P O,R ε (M(c )|c) ,

which is again the desired result. Suppose therefore that M(c ) = z1, . . . , zm 1 and M(c) = z1, . . . , zm 2. Here the situation is less clear, because the ε-bonus will affect not only the c -conditional likelihood of M(c ) but also the c-conditional likelihood. We therefore compute both quantities and compare them afterwards. Clearly P O,R ε (M(c )|c ) = 2 (m 1) + ε. The term P O,R ε (M(c )|c) can be expressed as a product of two terms, The ﬁrst one (resp. second one) is the contribution of the ﬁrst m 2 trials (resp. the last trial). Since M(c) = z1, . . . , zm 2, the ﬁrst term clearly equals 2 (m 2) + ε. Note that 2 (m 2) ε is the probability mass remaining for, and evenly shared by, zm 1 and zm. The second term equals therefore

Pε(zm 1|c) 2 (m 2) ε =

2 (m 2) ε /2 2 (m 2) ε = 1

It follows that

P O,R ε (M(c )|c) = 1

2 2 (m 2) + ε = 2 (m 1) + ε

which is less than P O,R ε (M(c )|c ) = 2 (m 1) + ε. This completes the proof of the claim.

The above discussions show that we can view M a teacher for the learner L with parameter collection (Pε(z|c))z Z,c C. This completes the proof of the theorem.

Combining Theorems 27 and 28 with what we already know about saturating matching numbers, we obtain the following result:

Corollary 29 Suppose that (C, Z, |=) is of the form as described in Example 2 and (α, β) = (O, R). Then

MAP-TDα,β(C) = SMN(G(C)α,β) and MLE-TDα,β(C) = SMN(G(C)α,β = ) .

MAP-TDO,R(C) max{MAP-TDO,R(C), MAP-TDO,R(C)} ,

MLE-TDO,R(C) max{MLE-TDO,R(C), MLE-TDO,R(C)} .

Simon and Telle

The ﬁrst assertion of the corollary implies the correctness of the results which are visualized in Fig. 1. The following result provides some supplementary information:

Theorem 30 Let (α, β) and (α , β ) be two diﬀerent sampling modes. There exists a concept class C such that SMN(G(C)α ,β ) = SMN(G(C)α,β).

Proof We present the proof for (α, β) = (O, R) and (α , β ) = (O, R).13 Let X = {x1, . . . , xm}, let Z = X {0, 1}, let Cm be the powerset of X and let |= be given by (2). Let Z2 (resp. Z 2) be the set of all A ZO,R (resp. A ZO,R) such that |A| 2. A simple counting argument shows that |Z 2| < |Z2|. Consider the bipartite graph G with vertex sets Cm and Z2 and with an edge (c, A) if and only if c |= A. Each vertex in Z2 has degree at least D := 2m 2 whereas each vertex in Cm has degree d := 1 + 2m + 1

2(m 1)m. Suppose that m is suﬃciently large such that d D. Fix an arbitrary subset S of Z2. It follows that |Γ(S)| D

so that G satisﬁes Hall s condition. It follows that G admits a Z2-saturating matching, say M. Let C be the set of concepts in Cm having an M-partner. By construction: SMN(G(C)O,R) = 2. For cardinality reasons, namely |C| = |M| = |Z2| > |Z 2|, we have SMN(G(C)O,R) > 2.

Theorem 30 implies that the parameters with diﬀerent colors in Fig. 1 can generally have diﬀerent values.

5.2 MAPversus MLE-Learners

Suppose that L is an MLE-learner for C. Let L be a MAP-learner that diﬀers from L only by having non-uniform priors, i.e., the conditional likelihoods are the same. The following example demonstrates that the gap between MAP-TDα,β L (C) and MAP-TDα,β L (C) can become arbitrarily large.14

Example 4 Let X = {x1, . . . , xm}, Z = X {0, 1}, C = {{x1}, . . . , {xm}} { } and let |= be given by (2). Consider the MLE-learner L be given by the parameters

P((xi, c(xi))|c) = 1 m

for each c C and i = 1, . . . , m. We assume for simplicity that the sampling mode (α, β) of L equals (O, R), but the following reasoning (mutatis mutandis) applies to any other sampling mode as well. Clearly, for each k [m], the concept {xk} can be taught by the single observation (xk, 1). However can only be taught by the full set A0 := {(xi, 0) : i = 1, . . . , m} of observations that is consistent with: as long as some (xk, 0) is missing in a set A A0, we have that P(A| ) = P(A|{xk}) so that is not the unique maximizer

13. The proof for the other choices of (α, β) and (α , β ) is similar. 14. This example uses a concept class, namely singletons plus empty set, which is often used to demonstrate that the classical teaching model from Shinohara and Miyano (1991); Goldman and Kearns (1995) may assign an inappropriately high teaching dimension to a trivial concept class.

MAPand MLE-Based Teaching

of P(A|c). We may conclude from this discussion that MAP-TDα,β L (C) = m. Let L be a MAP-learner that diﬀers from L only by having for a higher prior than for the other concepts in C. Then the concept {xk} can still be taught by the single observation (xk, 1). But now also the concept C can be taught in a trivial fashion by 2Z. We may conclude that MAP-TDα,β L (C) = 1.

In contrast to Example 4, the next result shows that, in case of optimally parameterized learners, the advantage of MAP-learners over MLE-learners is anything but dramatic:

Theorem 31 Suppose that (C, Z, |=) is of the form as described in Example 2 and (α, β) = (O, R). Then

MAP-TDα,β(C) MLE-TDα,β(C) 1 + MAP-TDα,β(C) . (23)

Moreover, there exist concept classes C and C such that

MLE-TDα,β(C ) = MAP-TDα,β(C ) and MLE-TDα,β(C ) = 1 + MAP-TDα,β(C ) . (24)

Proof Clearly MAP-TDα,β(C) MLE-TDα,β(C). In order to obtain (23), it suﬃces therefore to show that MLE-TDα,β(C) 1+MAP-TDα,β(C), or equivalently, that SMN(G(C)α,β = ) 1 + SMN(G(C)α,β). We present the proof for (α, β) = (O, R).15 For sake of brevity, set m := |X|, G = G(C)O,R and d := SMN(G). Since SMN(G = ) m, we may assume that d m 2. Let M : C 2Z be a C-saturating matching of order d in G. If M does not assign to any concept in C, then SMN(G = ) d. Otherwise, if M(c0) = for some c0 C, then we may arbitrarily pick a set A X of size d + 1 and replace the M-partner of c0 by the set B = {(a, c0(a)) : a A}. The resulting matching now witnesses that SMN(G = ) d + 1. We still have to specify concept classes C and C which satisfy (24). As for C , there are plenty of choices, e.g., C = {{xi} : i = 1, . . . , m} satisﬁes

MLE-TDα,β(C ) = MAP-TDα,β(C ) = 1 .

In order to specify an appropriate class C , we assume again that (α, β) = (O, R) and proceed as follows. Let X = {x1, . . . , xm}, let Z = X {0, 1}, let Cm be the powerset of X and let |= be given by (2). Let Z d (resp. Z d) be the set of subsets (resp. non-empty subsets) of Z of size at most d. Consider the bipartite graph G with vertex sets Cm and Z d an edge (c, A) if and only if c |= A. If m is suﬃciently large (while d is kept ﬁxed), G admits a Z d-saturating matching, say M. Let C be the set of concepts in Cm having an M-partner. By construction: SMN(G(C )O,R) = d. For cardinality reasons, namely |C | = |M| = |Z d| > |Z d| 1 = |Z d|, we have SMN(G(C )O,R = ) > d, which implies that

SMN(G(C )O,R = ) = d + 1.

15. The proof for the other choices of (α, β) is similar.

Simon and Telle

5.3 Parameters Bounding MLE-TD from Above

Since MLE-TD can never be smaller than MAP-TD, it follows that MLE-TDO,R(C) is the largest among the parameters occurring in Corollary 29. Hence upper bounds on MLE-TDO,R(C) are, all the more, upper bounds on the other parameters. For this reason, we conﬁne ourselves to MLE-learners and to sampling mode (O, R) in what follows. In order to simplify notation, we will write

2Z instead of ZO,R,

MLE-TD(C) instead of MLE-TDO,R(C),

G+(C) instead of G(C)O,R = .

Among the parameters that bound MLE-TD(C) from above are the antichain number of C, the VC-dimension of C and the so-called no-clash teaching dimension of C. We begin with the deﬁnition of the antichain number:

Deﬁnition 32 (Antichain Mapping and Antichain Number) T : C 2Z is called an antichain mapping for C if the following holds:

1. Each concept c C is consistent with T(c).

2. The sets (T(c))c C form an antichain, i.e.,

c1 = c2 C : T(c1) T(c2) T(c2) T(c1) .

The smallest possible order of an antichain mapping for C is called the antichain number of C and denoted by AN(C).

It is well-known that the antichain number is upper-bounded by the VC-dimension:

Theorem 33 (Mansouri et al. (2022)) Suppose that the concept class C is a family of subsets of a ﬁnite domain X. Then AN(C) VCdim(C).

We proceed with the deﬁnition of the teaching dimension in the so-called no-clash model of teaching:

Deﬁnition 34 (No-clash Teaching Dimension Kirkpatrick et al. (2019); Fallat et al. (2022)) A mapping T : C 2Z is called clash-free on C if it satisﬁes the following:

1. Each c C is consistent with T(c).

2. If c1 = c2 C, then c1 is inconsistent with T(c2) or c2 is inconsistent with T(c1).16

The no-clash teaching dimension of C, denoted as NC-TD(C), is the smallest possible order of a mapping T : C 2Z that is clash-free on C.

16. The situation that c1 is consistent with T(c2) and c2 is consistent with T(c1) would be called a clash of c1 and c2. This explains why the mapping T is called clash-free.

MAPand MLE-Based Teaching

Theorem 35 Suppose that (C, Z, |=) is of the form as described in Example 2. Then MLE-TD(C) AN(C) and MLE-TD(C) NC-TD(C).

Proof Because MLE-TD(C) = SMN(G+(C)), it suﬃces to show that SMN(G+(C)) is upper-bounded by AN(C) and NC-TD(C). An antichain mapping T : C 2Z clearly satisﬁes (17) and does not have in its image. Thus, an application of Remark 24 yields AN(C) SMN(G+(C)). A clash-free mapping T : C 2Z must be of order at least 1. There can be at most one concept c in C such that T(c) = . Suppose that T(c) = . Consider an arbitrary, but ﬁxed, concept c C \ {c}. Since c is consistent with (the empty sample) T(c) and T is clash-free, the concept c must be inconsistent with T(c ). Let us redeﬁne T(c) as a singleton set {(x, b)} such that b = c(x). This modiﬁcation of T is still clash-free and leaves the order of T unchanged. Moreover, after this modiﬁcation, T satisﬁes (17) and does not have in its image. Now another application of Remark 24 yields NC-TD(C) SMN(G+(C)).

The inequality MLE-TD(C) NC-TD(C) had been proven already in Ferri et al. (2022). The proof given there does not make use of saturating matching numbers and is more complicated. Because AN(C) VCdim(C), we immediately obtain the following result:

Corollary 36 Suppose that (C, Z, |=) is of the form as described in Example 2. Then MLE-TD(C) VCdim(C).

5.4 Computational Considerations

We will show in the course of this section that SMN(G+(C)) (and related quantities) can be computed in time poly(|C|, |X|) from a given (ﬁnite) concept class C 2X. The central observation will be that, in order to ﬁnd a C-saturating matching of minimum order in G+(C), we do not need to compute the (possibly exponentially large) bipartite graph G+(C). All pieces of information about G+(C) that we need in the course of the algorithm can be eﬃciently extracted from the much smaller bipartite graph G(C). We start with a lemma that is particularly interesting when we have a bipartite graph whose ﬁrst vertex set, V1, is much smaller than its second vertex set, V2:

Lemma 37 Let G = (V1, V2, E) with E V1 V2 be a bipartite graph. Let O be an oracle that, upon request (v, k) with v V1 and k [|V1|], returns min{deg G(v), k} distinct neighbors of v.17 Then there is an oracle algorithm AO which computes a maximum matching in G and has a time bound that is polynomial in |V1|.

Proof For sake of brevity, we set n = |V1|. Let V 1 V1 be the set of vertices in V1 with less than n neighbors, and let V 1 = V1 \ V 1 be the set of remaining vertices in V1, i.e., the vertices with at least n neighbors. The algorithm AO proceeds as follows:

1. For each v V1, it sends the request (v, n) to O and receives a list of all neighbors if v V 1, resp. a list of n distinct neighbors if v V 1 .

17. The oracle O can be implemented eﬃciently if, for instance, G is represented by the adjacency lists for the vertices in V1 and there is direct access to each of these lists.

Simon and Telle

2. Let G be the subgraph of G that is induced by V 1 and Γ(V 1). Let MAX-MATCH be a standard polynomial-time algorithm for maximum matching computation. AO

applies MAX-MATCH to G and MAX-MATCH returns a maximum matching M in G .

3. AO augments M to a V1-saturating matching in a greedy fashion: for each v V 1 , it inspects the list of n distinct neighbors of v and matches v with the ﬁrst neighbor which had not been matched before.

Note that G has at most n(n 1) vertices. Moreover, among n neighbors of a vertex v V 1 , there must be at least 1 neighbor which is not already matched with another vertex in V1. It easily follows that AO returns a maximum matching in poly(|V1|) time.

With a bipartite graph G = (V1, V2, E), we associate the bipartite graph

G+ = (V1, 2V2 \ { }, E+) with E+ = {(v, B) V1 2V2 \ { } : {v} B E} . (25)

In other words: the pair (v, B) with v V1 and B V2 is an edge in E+ iﬀ, for every v B, the pair (v, v ) is an edge in E.

Theorem 38 Given a bipartite graph G = (V1, V2, E), a V1-saturating matching of minimum order in G+ (resp. an error message if a V1-saturating matching does not exist) can be computed in polynomial time.

Proof We consider ﬁrst the problem of computing a V1-saturating matching of minimum order in G+. Let us ﬁx some notation. For ℓ= 1, . . . , |V2|, let G(ℓ) = (V1, V (ℓ) 2 , E(ℓ)) be the bipartite graph given by

V (ℓ) 2 = {B V2 : 1 |B| ℓ}) and E(ℓ) 2 = {(v, B) V1 V (ℓ) 2 : {v} B E} .

In other words, G(ℓ) is the subgraph of G+ induced by V1 and V (ℓ) 2 . Given G, ℓ [|V2|], k [|V1|] and v V1, it is easy to compute a list of min{deg(v), k} distinct neighbors of v in G(ℓ). It follows from Lemma 37 that, given G and ℓ [|V2|], we can compute in poly(|V1|, |V2|) steps a maximum matching Mℓin G(ℓ). Let ℓ+ be the minimum ℓsuch that Mℓis of size |V1|, respectively ℓ+ = 1 + |V2| if none of the Mℓsaturates V1. If ℓ+ |V2|, then Mℓ+ is the desired V1-saturating matching of minimum order in G+. If ℓ+ = |V2| + 1, we may report error because G+ does not admit a V1-saturating matching.

Corollary 39 Suppose that (C, Z, |=) is of the form as described in Example 2. Then the following objects can be computed in polynomial time:

the bipartite consistency graph G(C) with vertex sets C and Z

the (identical) parameters SMN(G+(C)) and MLE-TD(C)

a C-saturating matching M in G+(C) of order SMN(G+(C))

MAPand MLE-Based Teaching

parameters representing an MLE-learner L for C and a teacher T for L who is of order MLE-TD(C)

Proof Given C, the set Z and the bipartite graph G(C) can clearly be computed in polynomial time. We may now apply Theorem 38 to the bipartite graph G = G(C). Then G+

in Theorem 38 equals G+(C). Hence the algorithm sketched in the proof of Theorem 38 can be used for ﬁnding a C-saturating matching M in G+(C) of minimum order (which is order SMN(G+(C))). As a byproduct, the parameter SMN(G+(C)) is now known as well. As for the speciﬁcation of an appropriate MLE-learner L, we may use the parameter setting that is found in the proof of Theorem 25. As also shown in that proof, M (already known to be computable from C in polynomial time) represents a teacher of order MLE-TD(C) for L. This completes the proof of the corollary.

It is straightforward to extend Corollary 39 from sampling mode (O, R) to other sampling modes, and from MLE-TD to MAP-TD. The main point is to adjust the deﬁnition of G+

in (25) so that G(C)+ becomes identical to G(C)α,β = resp. to G(C)α,β. We omit the details.

Open Problems and Future Work. What are natural parameterizations of MAPor MLE-learners? Does MAP-based teaching of naturally parameterized learners force the teacher to present observations/examples which illustrate the underlying target concept in an intuitively appealing way?

Acknowledgements We would like to thank the reviewers for their very useful comments and suggestions.

Appendix A. Proof of Facts 1 4

Fact 1: Suppose that 0 | | < p < 1/2. Let c be the concept whose parameters are given by (10). Then P O,R(z1, z2)|c ) and P O,R(z1, z2|c ) are both strictly decreasing when | | is increased.

Proof The assertion is obvious for P O,R(z1, z2)|c ) = (p+ )(p ) = p2 2. Consider now the function

h( ) := P O,R(z1, z2|c ) = (p + )(p )

1 p + (p )(p + )

1 p + = 2(1 p)(p2 2)

where the last equation can be obtained by a straightforward calculation. Another straightforward, but tedious, calculation shows that

h ( ) = 4(1 p)(1 2p)

((1 p)2 2)2 .

Hence the function h( ) is strictly increasing for < 0 and strictly decreasing for > 0. It is therefore strictly decreasing when | | is increased.

Simon and Telle

Fact 2: Suppose that 0 < p < 1/2. Let c be the concept whose parameters are given by (10). Then

P O,R(z1, z2|c ) P O,R(z1, z2|c 0)

= 0 if {0, p2

> 0 if 0 < < p2 1 p < 0 otherwise

Proof We set

h( ) := P O,R(z1, z2|c ) = (p + )(p )

and observe that

P O,R(z1, z2|c ) P O,R(z1, z2|c 0) = h( ) h(0)

= (1 p)(p2 2) (1 p )p2

(1 p )(1 p)

= (p2 (1 p) ) (1 p )(1 p) .

The denominator of the latter expression is strictly positive. Moreover

(p2 (1 p) )

= 0 if {0, p2

> 0 if 0 < < p2 1 p < 0 otherwise

which accomplishes the proof of Fact 2.

Fact 3: Suppose that 0 < p < 1/2. Let c be the concept whose parameters are given by (10). Then

P O,R(z1, z1, z2|c ) P O,R(z1, z1, z2|c 0)

= 0 if {0, 1

5 1)p} > 0 if 0 < < 1

5 1)p < 0 otherwise .

Proof Let 0 < δ < 1 be given by = δp and note that

P O,R(z1, z1, z2|c δp) = (p + δp)2 (p δp) = (1 + δ)2 (1 δ) p3 = (1 + δ δ2 δ3) p3 .

It follows that

P O,R(z1, z1, z2|c δp) P O,R(z1, z1, z2|c 0) = δ (1 δ δ2) p3 .

Furthermore

= 0 if δ 0, 1

> 0 if 0 < δ < 1

5 1) < 0 otherwise .

We may conclude from this discussion that (12) is valid.

MAPand MLE-Based Teaching

Fact 4: Suppose that 0 < p < 1/2 and 1 t < 1 p

p . Let c(t) be the concept whose

parameters are given by (13). Then P O,R(z1, z2|c(t)) is strictly increasing with t.

h(t) := P O,R(z1, z2|c(t))

p2 = 1 1 pt + 1 1 p/t = 1 1 pt + t t p

= (t p) + (1 pt)t

(1 pt)(t p) = 2t pt2 p (p2 + 1)t pt2 p .

It suﬃces to show that h(t) is strictly increasing with t. To this end, we compute the ﬁrst derivative:

h (t) = (2 2pt) ((p2 + 1)t pt2 p) (2t pt2 p)(p2 + 1 2pt)

(1 pt)2 (t p)2 .

The denominator is strictly positive. After an application of the distributive law and some cancellation, the numerator has the form

f(t) := p(1 p2)(t2 1) .

Hence the numerator equals 0 for t = 1 and is strictly positive for t > 1. It follows that h(t) with t 1 is strictly increasing.

Frank Balbach. Measuring teachability using variants of the teaching dimension. Theoretical Computer Science, 397(1 3):94 113, 2008.

Baxter S. Eaves, Jr. and Patrick Shafto. Toward a general, scaleable framework for Bayesian teaching with applications to topic models. Co RR, 2016. URL http://arxiv.org/abs/ 1605.07999.

Shaun Fallat, David Kirkpatrick, Hans U. Simon, Abolghasem Soltani, and Sandra Zilles. On batch teaching without collusion. Journal of Machine Learning Research, 23:1 32, 2022.

C esar Ferri, Jos e Hern andez-Orallo, and Jan Arne Telle. Non-cheating teaching revisited: A new probabilistic machine teaching model. In Luc De Raedt, editor, Proceedings of the Thirty-First International Joint Conference on Artiﬁcial Intelligence, IJCAI 2022, pages 2973 2979, 2022. URL https://doi.org/10.24963/ijcai.2022/412.

Ziyuan Gao, Christoph Ries, Hans U. Simon, and Sandra Zilles. Preference-based teaching. Journal of Machine Learning Research, 18(31):1 32, 2017.

Sally A. Goldman and Michael J. Kearns. On the complexity of teaching. Journal of Computer and System Sciences, 50(1):20 31, 1995.

Simon and Telle

Sally A. Goldman and H. David Mathias. Teaching a smarter learner. Journal of Computer and System Sciences, 52(2):255 267, 1996.

Brigt Arve Toppe H avardstun, C esar Ferri, Jos e Hernandez-Orallo, Pekka Parviainen, and Jan Arne Telle. XAI with machine teaching when humans are (not) informed about the irrelevant features. In ECML, 2023. To appear.

David G. Kirkpatrick, Hans U. Simon, and Sandra Zilles. Optimal collusion-free teaching. In Aur elien Garivier and Satyen Kale, editors, Proceedings of Machine Learning Research (ALT 2019), volume 98, pages 1 23, 2019. URL http://proceedings.mlr.press/v98/ kirkpatrick19a/kirkpatrick19a.pdf.

Farnam Mansouri, Hans Simon, Adish Singla, and Sandra Zilles. On batch teaching with sample complexity bounded by vcd. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 15732 15742. Curran Associates, Inc., 2022.

Stephen H. Muggleton. Learning from positive data. In Stephen H. Muggleton, editor, Inductive Logic Programming, 6th International Workshop, ILP 1996, volume 1314 of Lecture Notes in Computer Science, pages 358 376. Springer, 1996. URL https://link. springer.com/content/pdf/10.1007/3-540-63494-0_65.pdf.

Patrick Shafto, Noah D. Goodman, and Thomas L. Griﬃths. A rational account of pedagogical reasoning: Teaching by, and learning from, examples. Cognitive Psychology, 71:55 89, 2014. URL http://www.sciencedirect.com/science/article/pii/ S0010028514000024.

Ayumi Shinohara and Satoru Miyano. Teachability in computational learning. New Generation Computing, 8(4):337 348, 1991.

Vladimir N. Vapnik and Alexey Ya. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability & its Applications, 16 (2):264 280, 1971.

Scott Cheng-Hsin Yang and Patrick Shafto. Explainable artiﬁcial intelligence via bayesian teaching. In NIPS 2017 workshop on Teaching Machines, Robots, and Humans, pages 127 137, 2017. URL https://www.scottchenghsinyang.com/paper/Yang Shafto_ NIPS_2017.pdf.

Jerry Zhu. Machine teaching for bayesian learners in the exponential family. In C.J. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013.

Xiaojin Zhu, Adish Singla, Sandra Zilles, and Anna N. Raﬀerty. An overview of machine teaching. Co RR, 2018. URL https://doi.org/10.48550/ar Xiv.1801.05927.

Sandra Zilles, Steﬀen Lange, Robert Holte, and Martin Zinkevich. Models of cooperative teaching and learning. Journal of Machine Learning Research, 12:349 384, 2011.