# marginbased_active_learning_of_classifiers__b265b5b8.pdf

Journal of Machine Learning Research 25 (2024) 1-45 Submitted 10/22; Revised 11/23; Published 4/24

Margin-Based Active Learning of Multiclass Classiﬁers

Marco Bressan marco.bressan@unimi.it Dept. of Computer Science, Università degli Studi di Milano, Italy

Nicolò Cesa-Bianchi nicolo.cesa-bianchi@unimi.it Dept. of Computer Science, Università degli Studi di Milano, Italy & Politecnico di Milano, Italy

Silvio Lattanzi silviol@google.com Google Research

Andrea Paudice andrea.paudice@unimi.it Dept. of Computer Science, Università degli Studi di Milano, Italy

Editor: Samory Kpotufe

We study active learning of multiclass classiﬁers, focusing on the realizable transductive setting. The input is a ﬁnite subset X of some metric space, and the concept to be learned is a partition C of X into k classes. The goal is to learn C by querying the labels of as few elements of X as possible. This is a useful subroutine in pool-based active learning, and is motivated by applications where labels are expensive to obtain. Our main result is that, in very diﬀerent settings, there exist interesting notions of margin that yield eﬃcient active learning algorithms. First, we consider the case X Rm, assuming that each class has an unknown personalized margin separating it from the rest. Second, we consider the case where X is a ﬁnite metric space, and the classes are convex with margin according to the geodesic distances in the thresholded connectivity graph. In both cases, we give algorithms that learn C exactly, in polynomial time, using O(log n) label queries, where O( ) hides a near-optimal dependence on the dimension of the metric spaces. Our results actually hold for or can be adapted to more general settings, such as pseudometric and semimetric spaces. Keywords: active learning, semimetric space, pseudometric space, convexity, margin

1. Introduction

We study the following active learning problem. Given a ﬁnite set X from some domain, we have access to an oracle Oh that, upon receiving x X, replies with h(x), where h : X [k] is an unknown function from some class H. The queries to Oh can be adaptive, that is, each element x queried can depend on the answers to previous queries. The goal is to learn h exactly by making as few queries to Oh as possible. This problem, and active learning more in general, have gained a lot of attention in recent years. The reason is that collecting unlabeled data has become relatively cheap, and the most expensive part has become data annotation (which often requires human intervention). As a result, active learning is now a mainstream

. Preliminary versions of this work appeared in the Proceedings of the 34th Annual Conference on Learning Theory (COLT), PMLR 134:775 803, 2021 (Bressan et al., 2021a) and in Advances in Neural Information Processing Systems 34 (Neur IPS), 34:25231 25243, 2021 (Bressan et al., 2021b).

c 2024 Marco Bressan, Nicolò Cesa-Bianchi, Silvio Lattanzi, Andrea Paudice.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v25/22-1127.html.

Bressan, Cesa-Bianchi, Lattanzi, Paudice

ﬁeld of machine learning, with a rich literature providing algorithms and bounds for several settings (Dasgupta, 2005; Dasgupta et al., 2005; Balcan et al., 2007; Settles, 2012; Balcan and Hanneke, 2012; Balcan and Long, 2013; Gonen et al., 2013; Hanneke, 2014; Dasarathy et al., 2015; Hanneke and Yang, 2015; Wiener et al., 2015; Beygelzimer et al., 2016; Kane et al., 2017; Hopkins et al., 2020a,b, 2021). It is well known that active learning may provide exponential savings over passive learning. Consider for instance the class of thresholds H = {Ix t : t [0, 1]} over X = [0, 1], and suppose one wants to learn a threshold that, with probability at least 1 δ, has error ε against the target threshold t with respect to some unknown distribution P over X. In the passive model, one can draw a sample X of n = Θ 1

δ points x, obtain the pairs x, t (x) , and then take any threshold bt consistent with the labeled sample. However, if one is allowed to ﬁrst draw X alone (that is, without the labels) from P, and then to adaptively query the points of X for their labels, then a simple binary search learns a good threshold by looking at only O(log n) = O log 1

ε + log log 1

δ labels. This idea can be applied to any H and any P. First, draw a set X of n = n(ε, δ) i.i.d. unlabeled points from P, where n(ε, δ) is the sample complexity of the passive case. Then, learn h exactly over X by making as few queries as possible; ideally only O(log n), where the constants hidden in the O( ) notation can depend on H, X, and other relevant parameters. Clearly, if H = [k]X then learning H even approximately with non-vanishing probability requires Ω(|X|) queries. Thus the interesting question is ﬁnding useful classes H that can be learned with O(log n) queries over any n-point set X, possibly in polynomial time. The present work addresses precisely this question. As a starting point, consider the classic SVM margin assumption in Rm. For notational convenience, we represent the target classiﬁer h : X [k] as a partition C = (C1, . . . , Ck) of X. The SVM margin assumption requires every class C C to be separated from X \ C by a hyperplane in Rm with margin γ > 0. In this case one can learn C with O(log n) queries, where the hidden constants are functions of m and γ (see Bressan et al., 2020). There are two key ingredients that enable such a query rate: the convexity of the classes and the margin between them. Inspired by these ﬁndings, in this work we devise novel, general conditions that allow one to attain a query rate of O(log n), with constants that we can often prove to be near-optimal, and in polynomial time.

1.1 Our contributions

We study two settings. The ﬁrst one considers the classic Euclidean case, X = Rm, and the more general case of ﬁnite pseudometric spaces (where the triangle inequality holds, but the distance between distinct points can be zero). The second setting is inspired by algorithms designed for dense classes, such as DBSCAN, and applies to any (ﬁnite) semimetric space (where the distance between distinct points is positive, but the triangle inequality may fail), including Rm.

1.1.1 Euclidean spaces and pseudometric spaces

For the case X = Rm, we introduce a new notion of margin, called convex hull margin (Deﬁnition 1), enabling the margin of each class to be measured in terms of its own diameter and according to a personalized pseudometric. This notion strictly generalizes the spherical margin of Ashtiani et al. (2016), the ellipsoidal margin of Bressan et al. (2020), and (as a

Margin-Based Active Learning of Classifiers

consequence) the SVM margin. Using the convex hull margin, we develop a technique called convex hull expansion trick, based on recent results from convex geometry. This technique

recovers a constant fraction of any class, with one-sided error, by sampling roughly 1+ 1

of its points. By a simple boosting argument this yields a polynomial-time algorithm that

learns all classes using roughly k2 1 + 1

2 log n queries (Theorem 2), without knowing the pseudometrics that yield the margin for each class.1 The dependence on γ is nearly optimal,

as we prove that roughly 1 + 1

2 queries are necessary in the worst case (Theorem 3). For the case of ﬁnite pseudometric spaces, we generalize the convex hull margin by introducing the one-versus-all margin (Deﬁnition 11). Interestingly, this notion captures as special cases standard notions of stability for clustering problems such as k-means or k-centers. We show that, if a classiﬁer has one-versus-all margin γ, then it can be learned with M(γ) poly(k) log n queries via a purely learning-theoretic approach (Theorem 12), where M(γ) measures the dimension of the pseudometric space via packing numbers. We also show that the dependence on M(γ) is essentially optimal and thus M(γ) characterizes the learnability of multiclass classiﬁers in this setting. Finally, we study the case when the classes are realized by some concept class H over X (that is, when for each class Ci there is a concept hi H such that X hi = Ci). We show that if a certain combinatorial parameter, the coslicing dimension cosl(H), is bounded, then one can learn C with cosl(H) poly(k) log n label queries; otherwise, Ω(n) queries are needed in the worst case (Theorem 21). Moreover we show that, for all concept classes in Rm that are closed under aﬃne transformations and well-behaved in a natural sense, ﬁnite coslicing dimension and positive one-versus-all margin are equivalent (Theorem 22).

1.1.2 Finite semimetric spaces

The second setting assumes X is a ﬁnite set and d is a semimetric over X (that is, a metric without the triangle inequality). For instance d could be the Wasserstein distance between images, the Levenshtein distance between strings, the cosine dissimilarity in document analysis, or the Pearson dissimilarity in bioinformatics. In such a general setting, it is not clear how to deﬁne notions of convexity or margin for the elements of a partition C. We ﬁll this gap by introducing (β, γ)-convexity (Deﬁnition 23), a property of classes based on the connectivity graph G(ε) where {x, y} X is an edge iﬀd(x, y) ε. Loosely speaking, a class C is (β, γ)-convex if the subgraph G(ε)[C] is connected and closed under taking nearly-shortest paths. When X = Rm and d is the Euclidean distance, (β, γ)-convexity captures classes that are very far from being convex, but still natural in an intuitive way. Our main result is that (β, γ)-convex classiﬁers can be learned deterministically with O k2 log n + (6/βγ)dens(X) queries, provided that we know some point (a seed ) from each class (Theorem 24). Here, dens(X) is the density dimension of X (see Gottlieb and Krauthgamer, 2013), a generalization of the doubling dimension that is used to bound the size of packings in metric spaces. This dependence of our exponent on dens(X) is asymptotically optimal, as we prove that Ω(2dens(X)) queries are necessary to learn a (β, γ)-convex classiﬁer in the worst case. Thus, dens(X) plays a role similar to that of the dimensionality m in

1. Any upper bound B on the query cost of our algorithms should be intended as min(n, B) since obviously one never needs to query the same point twice. A similar observation holds for our running times, which are all polynomial in n, m, k.

Bressan, Cesa-Bianchi, Lattanzi, Paudice

Euclidean spaces. The algorithm is completely diﬀerent from those of Section 1.1.1: rather than sampling and boosting, it computes certain separators between each class C and every other class in turn. To do so, it carefully exploits the structural properties of (β, γ)-convexity by analysing shortest paths and edge cuts between diﬀerent classes in G(ε). The algorithm runs in time almost linear in the size of the encoding of (X, d) as a weighted graph.

1.1.3 Enriched queries and learning the parameters

Finally, we investigate the use of more powerful queries for learning the model parameters. In particular we consider a query seed that, given a set U X and an index i [k], either certiﬁes that Ci U or returns some x Ci \ U. It is easy to see that k seed queries are suﬃcient to test whether any given k-partition b C coincides with C; by an easy halving/guessing argument, this implies one can learn C without knowing γ (in the ﬁrst setting) or one of γ and β (in the second setting) by running our algorithms O k log p

times where p is the unknown parameter and p a known upper bound on it. For the second setting we give several additional results. First we show that, without seed points, any algorithm needs Ω(n) label queries to learn C in the worst case; and if one allows for a diﬀerent semimetric for each cluster, then Ω(n) queries are needed to learn C even if seed is allowed. Next we show that, with O(k2) additional seed queries, our algorithms can learn classes that are (β, γ)-convex at diﬀerent connectivity thresholds ε1, . . . , εk, and thus are dense at diﬀerent scales (Theorem 26). Moreover we show that O(k log n) seed queries are suﬃcient and Ω(k log n

k ) are necessary to learn ε1, . . . , εk (Theorem 46). Finally, we show how to adapt our algorithms to the case that one or both of β and γ are unknown by using a small number of additional seed queries and incurring a small computational overhead. Again, these adapted algorithms run in almost linear time.

2. Related work

Ashtiani et al. (2016) introduced semi-supervised active clustering (SSAC), a setting very similar to ours where the oracle supports same-cluster queries that reveal whether two given points are in the same class. They showed how to exactly recover the optimal k-means clustering with O poly(k) log n queries when each cluster lies inside a sphere centered in the cluster s centroid and well separated from the spheres of other clusters. Bressan et al. (2020) extended these results to clusters separated by arbitrary ellipsoids with arbitrary centers.2

Compared to Bressan et al. (2020), our algorithms for the Euclidean case work under strictly more general conditions and reduce the queries by a factor of roughly mm(1 + 1/γ) m 1

2 . We note that label queries can be simulated (up to a permutation of the classes) with k same-cluster queries, hence our upper bounds work for cluster recovery with an additional multiplicative factor of k. As same-cluster queries can be simulated with two label queries, our lower bounds hold for same-cluster queries as well. Various notions of margin are central in active learning and cluster recovery (see Xu et al., 2004; Balcan et al., 2007; Balcan and Long, 2013; Kane et al., 2017; Bressan et al., 2021a). Our coslicing dimension is similar to the slicing dimension of Kivinen (1995) and

2. Thanks to their stronger notion of margin, the results and bounds of Ashtiani et al. (2016) are dimensionfree, and can also be extended to inﬁnite-dimensional Hilbert spaces.

Margin-Based Active Learning of Classifiers

the star number of Hanneke and Yang (2015). Our lower bounds, like many others, are inspired by a construction by Dasgupta (2005). Our arguments based on packing numbers are similar to those based on the inference dimension of Kane et al. (2017) or the lossless sample compression of Hopkins et al. (2021). A query bound similar to the one given by our convex hull expansion trick, but worse by a factor roughly 2m, can be inferred by adapting arguments of Hopkins et al. (2020b). Combinatorial characterizations of multiclass learning have been proposed in the passive case by Ben-David et al. (1995), Rubinstein et al. (2009), and Daniely and Shalev-Shwartz (2014). Other learning settings related to one-sided and active learning are RPU learning (Rivest and Sloan, 1988) and perfect selective classiﬁcation (El-Yaniv and Wiener, 2012) see Hopkins et al. (2020a) for a discussion. Concerning learning on graphs, Dasarathy et al. (2015) develop a probabilistic active classiﬁcation algorithm, called S2 (shortest-shortest-path), whose label complexity depends on the graph s structure. In particular, the query complexity is linear in the size of the boundary of the edge cut between nodes with diﬀerent labels. Unfortunately, even under (β, γ)-convexity, in GX(ε) this boundary can have size Ω(n). Thiessen and Gärtner (2021) show a deterministic algorithm with label complexity proportional to the size of the shortest path cover of the graph (the smallest set of shortest paths that cover all nodes). Again, in GX(ε) this cover could have size Ω(n) even under (β, γ)-convexity. Active classiﬁcation on unweighted graph has been also studied by Afshani et al. (2007), Cesa-Bianchi et al. (2010), and Guillory and Bilmes (2011), but only for approximate learning; to learn exactly, their algorithms may need Ω(n) queries. A number of works studied cluster reconstruction under stochastic models using samecluster queries. Mazumdar and Saha (2017a) prove a logarithmic query bound assuming some latent distribution of similarities between points, while Mazumdar and Saha (2017b) obtain a linear bound when the oracle can err with a certain probability. Stochastic block models (Zhang et al., 2014; Gadde et al., 2016) and geometric block models (Chien et al., 2020) have been also considered as generative models for active clustering on graphs. Finally, there is a rich stream of research on approximating the optimal solution to clustering optimization problems using oracles (see Ailon et al., 2018a,b; Gamlath et al., 2018; Mazumdar and Pal, 2017; Saha and Subramanian, 2019). seed and their variants are motivated and used by Hanneke (2009) as positive example queries, by Balcan and Hanneke (2012) as conditional class queries, and by Beygelzimer et al. (2016) and Attenberg and Provost (2010) as search queries. They are also used implicitly by Tong and Chang (2001), Doyle et al. (2011), and Vikram and Dasgupta (2016). It also easy to see that seed queries are equivalent to the partial equivalence queries of Maass and Turán (1992) and to the subset plus superset queries of Angluin (1988).

3. Preliminaries and notation

Let X be a set. A function d : X 2 R is a metric if for all x, y X 2 we have (i) d(x, y) = 0 iﬀ x = y, (ii) d(x, y) = d(y, x), and (iii) d(x, y) d(x, z) + d(z, y) for all z X. The function is a pseudometric if (i) is replaced by d(x, x) = 0, and is a semimetric if (iii) is removed. Let d be any (pseudo/semi) metric over X. For x X and r 0 let B(x, r, d) = {y X : d(x, y) r} denote the closed ball of radius r centered at x. For any S X, the packing number M(S, r, d) of S in X is the maximum size of any A S such that d(x, y) > r for all distinct

Bressan, Cesa-Bianchi, Lattanzi, Paudice

x, y A. For any η > 0 let M(η, d) = sup{M B(x, r, d), ηr, d : x X, r > 0}. This is the largest packing number of any ball of radius r in X using balls of radius ηr. Note that M(η, d) 1 for all η, d. For any A X let φd(A) = supx,x A d(x, x ). For any U, S X let d(U, S) = infx U,y S d(U, S). For any X Rm let conv(X) be the convex hull of X. The unit sphere in Rm is Sm 1 = {x Rm : x 2 = 1}. We may omit d if clear from the context. We recall some learning-theoretic facts. A concept class over X is a family H 2X of measurable sets.3 A set X X is shattered by H if for every S X there is h H such that S = X h. The Vapnik-Chervonenkis dimension of H, denoted by vc-dim(H, X), is the size of the largest such set; we write vc-dim(H) when X is clear from the context. The intersection class of H is I(H) = S i N{h1 . . . hi : h1, . . . , hi H}. For a given X X, the smallest concept in I(H) consistent with X is T{h H : h X = X}. An algorithm A learns H with one-sided error ε and conﬁdence δ with r examples if, for any target concept h H and any probability measure P over X, by drawing r independent labeled examples from P the algorithm outputs a concept h h such that P(h \ h) ε with probability at least 1 δ. The input to our problem depends on the setting. For the Euclidean setting, the input set is X X where X = Rm. For the ﬁnite semimetric setting, the input is a weighted graph G = (X, E, d) encoding the semimetric d over X. In both cases, the input also contains a pointer to a labeling oracle OC consistent with a k-partition C = (C1, . . . , Ck) of X, so that OC(x) = i if and only if x Ci. In line with previous work we assume k is known and we allow for empty classes. By high probability we mean a probability that, for any a > 0, can be made higher than 1 n a by increasing the cost by no more than a multiplicative factor ca for some universal constant c > 0. Further deﬁnitions and notation are given when needed.

4. Learning classiﬁers in Euclidean and pseudometric spaces

In this section we consider the problem of learning a k-classiﬁer over a ﬁnite input set X from some space X endowed with a number of metrics or pseudometrics. We start with the special case X = Rm with pseudometrics induced by seminorms, and then we generalize to more abstract settings. The input to our problem always contains the pair (X, OC), and possibly some additional information, such as the margin γ. Depending on the setting, the algorithm may know or may not know the pseudometrics relevant to the classiﬁer.

4.1 The convex hull margin in Rm

Let X Rm and let C be a k-partition of X. We introduce a novel notion of margin, called convex hull margin, which captures the idea that points not in a certain class should be well separated from that class in terms of the class diameter. Instead of using the Euclidean metric, however, we allow distances to be measured by any pseudometric in Rm, which the algorithm may not know, and which may be speciﬁc to each class. We only require that the pseudometric be homogeneous and invariant under translation (that is, that the pseudometric be induced by a seminorm).

3. We will interchangeably consider hypotheses as functions from X to {0, 1} and as subsets of X.

Margin-Based Active Learning of Classifiers

Deﬁnition 1 (Convex hull margin) Let D be the family of all pseudometrics induced by seminorms over Rm, and let X Rm be a ﬁnite set. A k-partition C = (C1, . . . , Ck) of X has convex hull margin γ if for every i [k] there exists di D such that:

di X \ Ci, conv(Ci) > γ φdi(Ci)

Deﬁnition 1 has some interesting properties. First, it generalizes both the ellipsoidal margin of Bressan et al. (2020) and the spherical margin of Ashtiani et al. (2016), as D contains the pseudometric induced by Wx 2 for every positive semideﬁnite matrix W 0, and it generalizes also the classic SVM margin see Lemma 4 below. Second, pseudometrics are versatile and can express, for instance, the Euclidean distance after a projection on a subspace, modeling scenarios where each class only cares about a certain subset of the features. Our main contribution is the design and analysis of Cheat Rec (for Convex Hull Exp Ansion Trick Recovery), a polynomial-time algorithm for actively learning k-classiﬁers with positive convex hull margin.

Theorem 2 Let C be a k-partition of X Rm with convex hull margin γ > 0. Then Cheat Rec(X, OC, γ) outputs C, runs in time poly(k, n, m), and with high probability makes a number of label queries to OC bounded by k2 2O(m) (1 + 1/γ) m 1

2 log(1 + 1/γ) log n.

The key ingredient behind Cheat Rec is the convex hull expansion trick, a technique that enables one-sided-error learning without knowing the pseudometrics that deﬁne the margin. Section 4.1.1 describes this technique and contains the proof of Theorem 2. To put Theorem 2 in perspective, consider the algorithm of Bressan et al. (2020). Under an ellipsoidal margin of γEL, that algorithm achieves a query bound of roughly m

γEL m log n. One can check that an ellipsoidal margin of γEL implies a convex hull margin of γ γEL

all γEL 1.4 Hence, in this range, our dependence on γ is better by factors (c1m)m(c2/γ) m+1

2 for some c1, c2 > 0. In fact, our dependence is nearly optimal:

Theorem 3 For some a > 0 the following claim holds. For all γ 0, 1

5 , all m 2, and every (randomized) algorithm A, there exists an instance (X, OC) with X Rm such that:

2 , |C| = 2, C has convex hull margin γ, and, in order to return C with constant probability, A must make at least a|X| queries to OC in expectation.

Theorem 3 is proven in Section 4.1.2. Between Theorem 3 and Theorem 2 there is a gap of at most 2O(m) poly(k) log(1 + 1

γ ) log n. We believe removing the 2O(m) factor requires deriving tight bounds on M(Sm 1, γ). We also note that, while the deﬁnition of convex hull margin could be generalized to arbitrary normed spaces, our algorithm Cheat Rec relies on several technical results (Section 4.1.1) whose generalizability beyond Rm is unclear. Before continuing, we show that the convex hull margin generalizes the SVM margin. For any two sets C1, C2 Rm deﬁne their multiplicative SVM margin as

supu Sm 1(0,1) infx C1,y C2 | u, x y |

supx,y C1 C2 x y

Note that this is precisely the usual SVM margin after scaling C1 C2 so to have radius 1.

4. Their deﬁnition uses squared distances, so the relationship with our margin is 1 + γ 1 + γEL.

Bressan, Cesa-Bianchi, Lattanzi, Paudice

Lemma 4 If C1, C2 have multiplicative SVM margin γ, then they have convex hull margin γ. Moreover, for every η (0, 1) there exists a partition C = (C1, C2) of a set X R2 having multiplicative SVM margin at most η but convex hull margin γ for all γ > 0.

Proof The ﬁrst claim is immediate from the deﬁnition of multiplicative SVM margin. For the second claim let X = C1 C2 where C1 = {(η, 0), (1, 0)} and C2 = {(0, η), (0, 1)}. One can check that the multiplicative SVM margin between C1 and C2 is precisely η. Now, for any x, y C let d1(x, y) = |x2 y2| and d2(x, y) = |x1 y1|. Clearly, both d1 and d2 are induced by seminorms over R2, and d1(C1, C2) = d2(C1, C2) > 0 while φd1(C1) = φd2(C2) = 0. Hence d1(C1, C2) > γφd1(C1) and d2(C1, C2) > γφd2(C2) for all γ > 0.

4.1.1 Cheat Rec and proof of Theorem 2

The idea behind Cheat Rec is to boost learning with one-sided error. Suppose we draw a sample SC of points from some class C that has convex hull margin γ with respect to a pseudometric d. Using SC we would like to ﬁnd out, say, half of C, by taking all points that are close to SC under d. If we could to this, then we could learn C exactly by repeating the process O(k log n) times, since there are k classes with at most n points each. Unfortunately, this approach has two problems. First, we do not know d. Second, even if we knew d, there may be very few points close to SC. We show that both problems can be overcome as follows. Let K = conv(SC), choose any z K, and let Q be the scaling of K about z by a factor 1 + γ. Since d is homogeneous and invariant under translation (recall that d is induced by a seminorm), then any y X such that d(y, K) γφd(K) is in C as well, hence X Q C. This solves the ﬁrst problem. The second problem is less trivial and requires choosing carefully |SC| and z. We prove that, if |SC| is large enough and z is (approximately) the center of mass of K, then Q contains half of C with good probability. We call this the convex hull expansion trick. Before continuing, we need some more deﬁnitions. Without loss of generality assume K has full rank (otherwise use the subspace spanned by SC, which can be computed in time O(|SC| m)). Let R Rm be any convex body. We denote by µR = R

R x dx and R respectively the center of mass and the boundary of R. We say R is in isotropic position5

if µR = 0 and R

R x, u 2 dx = 1 for all u Sm 1. We let R = f R( ) 2 where f R is the unique invertible aﬃne transformation such that f R(R) is in isotropic position, and let d R(x, y) = x y R for all x, y Rm. For any two points x, y Rm and any α > 0, let σ(x, y, α) = y + α(x y) be the scaling of x by a factor of α w.r.t. y, and for any A Rm

let σ(A, y, α) = x Aσ(x, y, α).

Lemma 5 Suppose C Rm has convex hull margin γ > 0. Let SC be a sample of s independent uniform random points from C for s = 2O(m) 1 + 1

2 log2 1 + 1

γ large enough, let z K = conv(SC) be a (random) point such that P d K(z, µK) 1

4, and let Q = σ(K, z, 1 + γ). Then Q (X \ C) = and P |Q C| |C|/2 1

Proof We use the following results (slightly rephrased from their original version):

5. Not to be confused with the deﬁnition of Giannopoulos (2003), where the assumption R

R x, u 2 dx = 1 is replaced by vol(R) = 1.

Margin-Based Active Learning of Classifiers

Lemma 6 (Kannan et al. 1995) If R Rm is a convex body in isotropic position, then r

m B(0, 1) R p

m(m + 1)B(0, 1)

Theorem 7 (Naszódi et al. 2020) For every convex body R Rm and every γ > 0 there

is a polytope P Rm on 2O(m) 1 + 1

2 vertices such that R P σ(R, µR, 1 + γ).

Theorem 8 (Kupavskii 2020) The VC dimension of the family of t-vertex polytopes in Rm is at most 8m2t log2 t.

First, let us prove that Q (X \ C) = . Let d be any pseudometric that is homogeneous and invariant under translation. Since z K, then any y σ(K, z, 1 + γ) X satisﬁes d(y, K) γ φd(K). But K conv(C), therefore φd(K) φd(C). Hence d(y, conv(C)) γ φd(C). By the convex hull margin this implies y C. Hence Q X C, which in turn implies Q (X \ C) = . Next we prove that P d K(z, µK) 1 2 3 4 implies P |Q C| |C|/2 1 2. Let

t = 2O(m) 1 + 1

2 large enough and let Pt,m be the family of all polytopes on at most t vertices in Rm. By Theorem 8,

vc-dim(Pt) = 2O(m) 1 + 1

2 log 1 + 1

Thus, if s = 2O(m) 1 + 1

2 log 1 + 1

γ is large enough and SC is a sample of s independent uniform points from C, then by standard PAC bounds any P Pt,m that is consistent with SC (i.e., such that P SC) satisﬁes P |P C| 1

4. It remains to show that, if d K(z, µK) 1

2, then there exists such a P that satisﬁes P Q; the claim then follows by a union bound. By Theorem 7 there is P Pt,m such that K P σ K, µK, 1 + γ

2 ; thus we only need to prove that σ K, µK, 1 + γ

2 σ(K, z, 1 + γ). Observe that, to this end, it is suﬃcient to show that d K σ(K, z, 1 + γ), σ(K, µK, 1 + γ

2) > 0, since z K as implied by Lemma 6 and by d K(z, µK) 1

2. Now, by the triangle inequality:

d K σ(K, z, 1 + γ), σ K, µK, 1 + γ

d K σ(K, µK, 1 + γ), σ K, µK, 1 + γ

d K σ(K, µK, 1 + γ), σ(K, z, 1 + γ)

On the one hand, by standard arguments and by Lemma 6,

d K σ(K, µK, 1 + γ), σ K, µK, 1 + γ

2 inf x K x K γ

On the other hand, since σ(K, z, 1 + γ) = σ(K, µK, 1 + γ) + γ(z µK),

d K ( σ(K, µK, 1 + γ), σ(K, z, 1 + γ)) γ z µK K = γ d K(z, µK)

Bressan, Cesa-Bianchi, Lattanzi, Paudice

d K σ(K, z, 1 + γ), σ(K, µK, 1 + γ

2 d K(z, µK)

which is positive for d K(z, µK) 1

2, concluding the proof.

Next, we give a polynomial-time implementation of the convex hull expansion trick of Lemma 5, see Algorithm 1. The minimum volume enclosing ellipsoid (MVEE) of a set P Rm is an ellipsoid E Rm of smallest volume such that P E. A convex body P Rm

is in near-isotropic position if the ratio between the smallest radius of a ball containing P and the largest radius of a ball contained in P is at most m. By hit-and-run we mean the hit-and-run algorithm of Lovász and Vempala 2006.

Algorithm 1 Expand Hull(SC, 1 + γ)

1: let N = Ω(m2) large enough and ε = O(m 1) small enough

2: compute the MVEE of conv(SC) and put conv(SC) in near-isotropic position

3: for i = 1, . . . , N do

4: Xi = output of poly(m/ε) hit-and-run steps over conv(SC) starting from 0

5: let z = 1

N PN i=1 Xi 6: return σ(SC, z, 1 + γ)

Lemma 9 Expand Hull(SC, 1 + γ) can be implemented to run in time poly(|SC|, m). Moreover, its output σ(SC, z, 1 + γ) satisﬁes P d K(z, µK) 1

4 where K = conv(SC).

Proof We start with the running time bound. Let K = conv(SC). Computing the MVEE E of K takes time |SC|m2 ln m+ln ln |SC| , see Khachiyan (1996). Putting K in near-isotropic position can be done by computing the aﬃne transformation that maps E into the unit ball B(0, 1), which clearly takes time poly(m). At that point by John s theorem (John, 1985) we have B(0, 1/m) K, hence 0 is at distance at least 1 m from K. If we then execute hit-and-run from 0, we have:

Lemma 10 (see Lovász and Vempala 2006, Corollary 1.2) For any ε > 0, the distribution of the random walk after t = Θ m5 ln m

ε steps is ε-uniform over K.

It remains to implement hit-and-run in time poly(|SC|, m). To this end we need to solve the following problem: given any x K and any u Sn 1, determine the intersection of the ray {x + αu}α 0 with K. This amounts to ﬁnding the maximum α 0 such that x + αu is in conv(SC), which can be done via linear programming in time t K = poly(|SC|, m) with polynomial precision. Summarizing, the total time to draw X1, . . . , XN and compute z is:

O |SC|m2 ln m + ln ln |SC| + Nt K m5 ln m

Setting N = O(m2), ε = O(m 1), and t K = poly(|SC|, m) yields a bound of poly(|SC|, m). We now prove that σ(SC, z, 1+γ) satisﬁes P d K(z, µK) 1

4. Recall that the uniform

probability measure U over a body K is deﬁned by U(K ) = vol(K )

vol(K) for all measurable K K.

Margin-Based Active Learning of Classifiers

A probability measure P is ε-uniform if |P(K ) U(K )| ε for all measurable K K. Fix η > 0, p > 0, ε η 4(m+1), and N 16(m+1)2

p2η2 . We show that, if X1, . . . , XN are independent

ε-uniform points from K and X = 1

N Xi, then:

X, µK η 1 p (1)

which, for η = 1

2 and p = 1

4, implies the claim above since by construction z = X. To prove (1) we need two ancillary results. Deﬁne RK(K) = supx K x K. First,

RK(K) m + 1 (2)

Consider indeed K in isotropic position, and let K = ϑK where ϑ = vol(K) 1/m, so that vol(K ) = 1. Then, K is in isotropic position according to the deﬁnition of Giannopoulos (2003). Deﬁne R(K ) = supx K x . Theorem 1.2.4 of Giannopoulos (2003) implies R(K ) (m + 1)LK, where LK is the isotropic constant which, for all u Sm 1, satisﬁes R

K x, u 2 dx = L2 K. Since K = ϑK and R

K x, u 2 dx = 1 by the isotropy of K, we have LK = ϑ. Hence R(K ) (m + 1)ϑ, that is, RK(K) m + 1. Second, we claim that if X is ε-uniform over K then EX K 2ε(m + 1). Indeed, since X is ε-uniform over K, then there exists a coupling (X, Y ) with P(X = Y ) ε and Y uniform over K. Since EY K = 0, we have:

EX K = E[X Y ] K P(X = Y ) sup x,y K d K(x, y) ε 2RK(K) 2ε(m + 1)

where the last inequality is given by (2). We can now prove (1). Consider K in isotropic position; clearly the Xi are still independent and ε-uniform over K, and (1) becomes P X 2 η 1 p. As X 2 EX 2 + X EX 2, proving that EX 2 η

2 and P( X EX 2 η

2) 1 p is suﬃcient. For the ﬁrst part, by the above bound on EX K and since ε η 4(m+1),

EX 2 = EXi 2 2ε(m + 1) η

For the second part, by (2) Xi 2 m + 1 for all i. Therefore, letting Yi = Xi EXi for all i, we have Yi 2 2(m + 1), and thus Yi 2 2 4(m + 1)2. Let Y = 1

N PN i=1 Yi. Since the Yi are independent with EYi = 0, then E Yi, Yj = 0 whenever i = j and therefore:

E Y 2 2 = E 1

i,j=1 Yi, Yj = 1 N2

i=1 E Yi 2 2 4(m + 1)2

Using N 16(m+1)2

p2η2 and Jensen s inequality yield E Y 2 2 E Y 2 2 p2η2

4 . Therefore E Y 2 pη

2 , which by Markov s inequality implies P Y 2 > η

2 < p. Substituting Y with X EX completes the proof of (1).

We can now prove Theorem 2 by analysing the pseudocode in Algorithm 2. Let a round be a single iteration of the while loop in Cheat Rec. For the correctness just note that

Bressan, Cesa-Bianchi, Lattanzi, Paudice

Algorithm 2 Cheat Rec(X, OC, γ)

1: let s = 2O(m) 1 + 1

γ large enough

2: while X = do

3: if |X| ks then query OC to label each point in X and return

4: draw ks i.i.d. uniform random points S from X

5: learn the labels of S by querying OC 6: for i = 1, . . . , k do

7: let Si be the points of S having label i

8: S i = Expand Hull(Si, 1 + γ)

9: compute b Ci = conv(S i ) X via LP

10: label all points of b Ci with label i

11: X = X \ b Ci

b Ci Ci for all i [k] by Lemma 9. For the query bound, since |S| = ks some i [k] satisﬁes |Si| s, so by Lemma 9 with probability at least 1

2 we have | b Ci X| |Ci|

2 . As shown in Lemma 3 of Bressan et al. (2020), this implies that Cheat Rec performs more than 4k log n + 6a

k log n rounds with probability at most n a for all a > 0. This in turn implies the query bound by substituting s and noting that each round makes ks queries. For the running time note that drawing S, asking for all labels, and computing the sets Si requires time poly(n, m). By Lemma 9, each call to Expand Hull costs time poly(|Si|, m). Computing conv(S i ) X amounts to solving a polytope membership problem for each x X, which takes time poly(|S i |, m) via linear programming. Since at each round at least one new point is labeled, the algorithm makes at most n rounds, proving that Cheat Rec runs in time poly(n, m) = poly(n, k, m).

4.1.2 Proof of Theorem 3

Let ε = 5γ; clearly ε (0, 1] by the assumption on γ. Let B Rm 1 be the unit (m 1)- dimensional ball centered at the origin, and Y be an ε-packing of B of maximum size. It is well known that |Y | 1

ε m 1, see Vershynin (2018). With our choice of ε this gives:

where the second inequality comes from γ 1

5. Let f : Y Rm be deﬁned by:

f(y1, . . . , ym 1) = ( p

1 y 2, y1, . . . , ym 1)

for all y = (y1, . . . , ym 1) Y . Deﬁne X = {f(y) : y Y }, and let d be the Euclidean distance. Note that d(f(y), f(y )) d(y, y ) for all y, y Y . Thus X Sm 1 is an ε-packing

of Sm 1 of size n 1+γ

2 . Now choose any x X. Let C1 = {x}, C2 = X \ C1, and C = (C1, C2). We claim that both C1 and C2 have convex hull margin γ under d. For C1 this is trivial since φd(C1) = 0 and d(X \ C1, conv(C1)) > 0. For C2, consider any x X \ {x} and let p be the projection

Margin-Based Active Learning of Classifiers

of x on the subspace span(x) spanned by x. As the hyperplane orthogonal to span(x) and containing p separates X \ C2 and C2,

d(X \ C2, conv(C2)) d(x, p)

Let θ be the angle between x and x . Standard trigonometric arguments show that d(x, p) = d(x, x ) sin θ

2, and since θ d(x, x ) > ε, then d(x, p) > ε sin ε

2. As sin a > 0.8 a for all a (0, 1/2], then d(x, p) > 0.4 ε2. We conclude that d(X \ C2, conv(C2)) > 0.4 ε2 too. Since φ(C2) < 2, the convex hull margin of C2 is at least 0.2 ε2 = γ. Now consider an instance (X, OC) obtained by drawing x uniformly at random from X and letting C = (X \ {x}, {x}) as described above. Clearly, any deterministic algorithm that returns C with constant probability must make at least a|X| queries to OC for some universal a > 0. By Yao s principle (see Motwani and Raghavan 1995) this implies the claim.

4.2 The one-versus-all margin

This section generalizes the setting of Section 4.1 by letting X be a generic set equipped with a set of pseudometrics. As in Section 4.1, we want a notion of margin between classes. However, we cannot express the margin in terms of diameter of convex hulls, since we do not assume X to be a vector space or to have any notion of convexity. Thus, we introduce a notion called one-versus-all margin. We prove that classiﬁers with positive one-versus-all margin can be learned with O(log n) oracle queries. However, we do not provide a running time analysis; the implementation of our algorithm depends on X and on properties of C, and in general may take time superpolynomial in the size of the input.

Deﬁnition 11 (One-versus-all margin) Fix k pseudometrics d1, . . . , dk over X. A partition C = (C1, . . . , Ck) of a ﬁnite set X X has one-versus-all margin γ with respect to d1, . . . , dk if, for all i [k], we have di(X \ Ci, Ci) > γ φdi(Ci).

Note that, unlike the convex hull margin (Deﬁnition 1), here d1, . . . , dk are ﬁxed in advance. The reason is that, as said above, X may not be a linear space, C1, . . . , Ck may not be convex, and d1, . . . , dk may not be induced by seminorms. These properties appear to be crucial for learning C without knowing d1, . . . , dk (see Section 4.1). Note that if a clustering C in Rm has convex hull margin γ then it has one-versus-all margin γ, too, since di(X \ Ci, Ci) di(X \ Ci, conv(Ci)) > γ φdi(Ci). In this sense the one-versus-all margin is a generalization of the convex hull margin. Under the one-versus-all margin we prove two main results. First, in Section 4.2.1 we prove that the one-versus-all margin subsumes well-known notions of stability that yield polynomial-time algorithms for center-based clusterings, such as k-means or k-medians. Second, in Section 4.2.2 we give an algorithm, m Rec, that learns C with a query rate of O(log n). Formally, recalling M(γ, ) from Section 3 and letting M(γ) = maxi [k] M(γ, di), we prove: We prove:

Theorem 12 Let C be a k-partition of X with one-versus-all margin γ > 0 with respect to d1, . . . , dk. Then m Rec(X, OC, γ, d1, . . . , dk) outputs C while making O M(γ) k log k log |X|

queries to OC with high probability. Moreover, for some universal b > 0, every (randomized) algorithm that learns classiﬁers with one-versus-all margin γ > 0 makes at least b M(2γ) label queries in the worst case.

Bressan, Cesa-Bianchi, Lattanzi, Paudice

Due to the generality of the setting, m Rec is based on purely learning-theoretical arguments. Thus, unlike what we did for Cheat Rec (Section 4.1), we do not prove running time bounds for m Rec, as they depend on the particular semimetric space at hand.

4.2.1 One-versus-all-margin and stability of center-based clusterings

Fix any pseudometric d over X. A partition C = (C1, . . . , Ck) of X is a center-based clustering if there exist k points c1, . . . , ck X, called centers, such that for every i [k] and every x Ci we have d(x, cj) > d(x, ci) for all j = i. In words, every point is assigned to the nearest center. It is well known that computing center-based clusterings that minimize objective functions such as k-means or k-centers is NP-hard in general (Cohen-Addad and Srikanta, 2019). However, those problems become polynomial-time solvable if the optimal solution C meets certain stability properties. We show that two such properties, the α-center proximity of Awasthi et al. (2012) and the (1 + ε)-perturbation resilience of Bilu and Linial (2012), imply positive one-versus-all margin. Let us recall these properties. A function d

(not necessarily a pseudometric) is a (1 + ε)-perturbation of d if d d (1 + ε)d.

Deﬁnition 13 Let C be a center-based clustering.

C satisﬁes the α-center proximity property with α > 1 if, for all i [k], for all x Ci and all j = i we have d(x, cj) > α d(x, ci).

C is (1+ε)-perturbation resilient with ε > 0 if it is induced by the same centers c1, . . . , ck under any (1 + ε)-perturbation of d.

It is known that (1 + ε)-perturbation resilience implies α-center proximity with α = 1 + ε, see (Awasthi et al., 2012). We show:

Theorem 14 If C satisﬁes α-center proximity, then it has one-versus-all margin (α 1)2

Hence, if C satisﬁes (1 + ε)-perturbation stability, then it has one-versus-all margin ε2 2(ε+2).

Proof Consider any cluster Ci with center ci, let x = arg maxx Ci d(x, ci), and choose any x Ci and any y Cj with j = i. If d(x , ci) = 0 then clearly φd(Ci) = 0, and α-center proximity implies d(y, x) = d(y, ci) > α d(y, cj) 0 = γφd(Ci) for all γ > 0. Suppose then d(x , ci) > 0; we consider two cases. Case 1: d(x, ci) = 0. Then d(y, ci) = d(y, x) and d(x , ci) = φd(Ci). We obtain:

φd(Ci) = d(x , ci)

d(x , ci) + d(ci, y) + d(y, cj)

φd(Ci) + d(y, x) + 1

αφd(Ci) + 1

αd(y, x) + 1

from which we infer d(y, x) > α(α 1)

α+1 φd(Ci) > (α 1)2

2(α+1)φd(Ci).

Margin-Based Active Learning of Classifiers

Case 2: d(x, ci) > 0. We start by deriving:

d(y, x) d(x, cj) d(y, cj)

> d(x, cj) 1

α d(y, x) + d(x, ci)

αd(y, x) + d(x, cj) 1

and thus d(y, x) α+1

α > d(x, cj) 1

αd(x, ci), which yields

d(y, x) > α α + 1d(x, cj) 1 α + 1d(x, ci) (3)

Let β = d(x ,ci)

d(x,ci) . Note that φd(Ci) 2βd(x, ci), thus d(x, ci) φd(Ci)

2β . We consider two cases. First, suppose β α+1

α 1. In this case apply d(x, cj) > αd(x, ci) to (3) to obtain:

d(y, x) > α2

α + 1d(x, ci) 1 α + 1d(x, ci)

= d(x, ci)(α 1)

φd(Ci)(α 1)2

Suppose instead β > α+1

α 1. Since we chose x Ci such that d(x , ci) = βd(x, ci),

d(x, cj) > d(x , cj) d(x, x )

> αd(x , ci) d(x, ci) + d(x , ci)

= αβd(x, ci) d(x, ci) + βd(x, ci)

= d(x, ci)((α 1)β 1)

Combining this with (3), we obtain:

d(y, x) > d(x, ci) α α + 1((α 1)β 1) 1 α + 1

= d(x, ci) α(α 1)β

= φd(Ci) α(α 1)

> φd(Ci) α(α 1) (α 1)

= φd(Ci)(α 1)2

Bressan, Cesa-Bianchi, Lattanzi, Paudice

Hence in all cases d(y, x) > φd(Ci) (α 1)2

2(α+1). This concludes the proof.

4.2.2 m Rec and proof of Theorem 12

The idea behind m Rec is a classic one: use the smallest hypothesis consistent with the labeled sample. To be more formal, we need:

Deﬁnition 15 For any ﬁnite X X, any pseudometric d over X, and any γ > 0, the eﬀective concept class with one-versus-all margin γ with respect to d over X is:

H(X, d, γ) = {C X : d(X \ C, C) > γ φd(C)}

Lemma 16 Let d be any pseudometric over X and H = H(X, d, γ). Then H = I(H).

Proof Let C1, C2 H and C = C1 C2. Let y X \ C; without loss of generality we may assume y / C1. Since C C1, then φd(C1) φd(C) and d(y, C) d(y, C1). Moreover, d(y, C1) > γ φd(C1) since C1 H. Therefore:

d(y, C) d(y, C1) > γ φd(C1) γ φd(C)

Thus C has one-versus-all margin γ. As this holds for all C = C1 C2, then H = I(H).

Let d1, . . . , dk be the pseudometrics of Deﬁnition 11. In each round, m Rec draws a uniform random sample S from X, labels it and partitions it accordingly into S1, . . . , Sk. Then, for each i [k], it computes the smallest set b Ci H(X, di, γ) consistent with Si and labels all of b Ci as i. By smallest set we mean the intersection of all elements of H(X, di, γ) that include S. It is easy to see that all assigned labels are correct. To prove the upper bounds of Theorem 12 it remains to show that, if S is suﬃciently large compared to M(γ), then every round labels a good fraction of X. To begin with, we prove that vc-dim(H(X, di, γ)) is bounded by 1 + M(γ). Recall that I(H) denotes the intersection class of H.

Algorithm 3 m Rec(X, OC, γ, d1, . . . , dk)

1: while X = do

2: draw a sample S of Θ M(γ) k ln k points uniformly at random from X

3: use OC to label S and partition it into S1, . . . , Sk 4: for i [k] do

5: let b Ci be the smallest set in H(X, di, γ) such that Si b Ci 6: label every x b Ci with i

7: X = X \ i [k] b Ci

Lemma 17 Let d be any pseudometric over X and H = H(X, d, γ). Then vc-dim(H, X) 1 + M(γ, d).

Proof By Lemma 16 vc-dim(H, X) = vc-dim(I(H), X). Now we use the following results:

Margin-Based Active Learning of Classifiers

Deﬁnition 18 (Kivinen 1995, Deﬁnition 5.11) Let X be any set and H 2X . We say that H slices X X if, for each x X, there is h H such that X h = X \ {x}. The slicing dimension of (H, X), denoted by sl(H, X), is the maximum size of a set sliced by H. If H slices arbitrarily large sets, then we let sl(H, X) = .

Lemma 19 (Kivinen 1995, Lemma 5.19) vc-dim(I(H), X) sl(H, X).

Next, we show that sl(H, X) 1 + M(γ, d); by Lemma 19 this implies the claim. Let S X be any set sliced by H. We show that |S| 1 + M(γ, d). The case |S| 2 is trivial since M(γ, d) 1. Suppose then that |S| 3. Choose a, b S such that d(a, b) = φd(S). Since S is sliced by H, for any x S we have S \ {x} = S C for some C H. Since S \ {x} C, we have d S \ {x}, x d(C, x) and φd(C) φd(S \ {x}). Moreover, d(C, x) > γ φd(C), since x X \ C and C H. Therefore:

d S \ {x}, x d(C, x) > γ φd(C) γ φd(S \ {x})

Hence, d(S \ {x}, x) > γ φd(S \ {x}) for all x S. Now, suppose ﬁrst that γ 1. Since |S| 3, any x S \ {a, b} yields the absurd:

φd(S) d(S \ {x}, x)

> φd(S \ {x}) since γ 1

= d(a, b) since a, b S \ x

= φd(S) by the choice of a, b

Hence |S| 2 1 + M(γ, d). Suppose instead that γ < 1. Then, for any two distinct points x, y S, we have d(x, y) > γφd(S). This is trivially true if x = a and y = b; otherwise, assuming x / {a, b}, it follows by the fact that d(x, y) d(S \ {x}, x) > γφd(S \ {x}) = γφd(S), as seen above. Moreover, S is contained in the closed ball B(x, φd(S)) for any x S. Therefore, M(B(x, r), γr, d) |S| for r = φd(S) and some x X. By deﬁnition of M(γ, d) this implies that |S| < 1 + M(γ, d). This concludes the proof.

We can ﬁnally prove Theorem 12. First, we prove the lower bound. By deﬁnition of M(2γ), there is a set X of M(2γ) points that, according to some d {d1, . . . , dk}, lies within a ball of radius r > 0, and thus has diameter at most 2r, and such that d(x, y) > 2γr for all distinct x, y X. Now choose x X uniformly at random, and deﬁne C = ({x}, X \ {x}). The same argument used in the proof of Lemma 17 shows that d(x, X \{x}) > γ φd(X \{x}), which implies that C has one-versus-all margin γ. Clearly, in expectation over the distribution of C, any algorithm that learns C exactly makes b |X| = b M(2γ) queries for some universal b > 0. By Yao s principle, then, any such algorithm makes b M(2γ) queries on some instance. We turn to the upper bounds. Consider a generic iteration of m Rec and let i [k]. By Lemma 17 and standard generalization bounds, if |S| = Θ 1

ε M(γ) log 1

P Ci \ b Ci ε|X| 1 δ

Bressan, Cesa-Bianchi, Lattanzi, Paudice

Set ε = 1 2k and δ = 1

2. This makes |S| = Θ M(γ) k log k , and since |Ci \ b Ci| = |Ci| | b Ci|, the probability bound above yields:

P b Ci |Ci| |X|

which, since b Ci 0, implies:

Thus, the expected number of points m Rec removes from X at the end of a generic round is:

E b C1 . . . b Ck =

Standard concentration bounds show that with high probability m Rec makes O(log n) rounds and thus O(M(γ) k log k log n) queries, concluding the proof of Theorem 12.

4.3 One-versus-all classiﬁers

Let X be any domain, X X a ﬁnite subset, and C = (C1, . . . , Ck) a partition of X. Let H be any concept class over X. For example, for the ellipsoidal clusters of Bressan et al. (2020) X = Rm and H is the set of all ellipsoids in Rm; for the convex classes of Section 4.1, X = Rm and H is the family of all polytopes in Rm. Recall that C is realized by H if for all i [k] there is some hi H such that Ci = X hi. We start by introducing the coslicing dimension, a combinatorial quantity similar to the star number of Hanneke and Yang (2015) and the slicing dimension of Deﬁnition 18.

Deﬁnition 20 H coslices X X if for all x X there exist h+ x , h x H such that X h+ x = {x} and X h x = X \ {x}. The coslicing dimension of H is cosl(H) = sup{|X| : X is cosliced by H}. If H coslices arbitrarily large sets then we let cosl(H) = .

Consider for example X = Rm. If H is the set of linear separators, then cosl(H) = ; if instead H is the set of axis-aligned boxes, it is not hard to see that cosl(H) = 2m. We give two results. First, we show that C is actively learnable if and only if cosl(H) < (Theorem 21). Second, for a wide family of concept classes in Rm, we show that cosl(H) < is equivalent to a positive one-versus-all margin (Theorem 22).

Theorem 21 Suppose cosl(H) < . There is an algorithm that, given (X, OC) such that C is realized by H, learns C with O(cosl(H) k log k log |X|) queries to OC with high probability; moreover, any algorithm A makes Ω(cosl(H)) label queries in expectation to learn C. If instead cosl(H) = , then for any algorithm A that learns C there are instances (X, OC) with X arbitrarily large where A makes Ω(|X|) queries in expectation in the worst case.

Proof The proof is along the lines of the proof of Theorem 12. For the lower bound, let X be a set cosliced by H. Let x X be uniform random and let C = ({x}, X \ {x}). By deﬁnition of cosliced set C is realised by H. The same arguments of the proof of Theorem 12

Margin-Based Active Learning of Classifiers

show that any algorithm makes Ω(|X|) queries in the worst case. This implies the lower bounds for both cosl(H) < and cosl(H) = . For the upper bounds, let Pk(X) be the set of all k-partitions of X that are realised by H. For each i [k] deﬁne:

Hi = C : C = C i (C 1, . . . , C k) Pk(X)

Adapt m Rec (Algorithm 3) by (i) drawing S of size Θ(vc-dim(I(Hi), X) k ln k), (ii) letting b Ci be the smallest hypothesis in I(Hi) consistent with Si. If we prove that sl(Hi, X) cosl(H), then the proof of Lemma 17 carries on with vc-dim(I(Hi), X) cosl(H) and yields our bound. To prove that sl(Hi, X) cosl(H), suppose that U = {x1, . . . , xℓ} X is sliced by Hi. By construction of Hi this means that there are ℓclasses C1, . . . , Cℓ, each one realised by H, such that Ci = ({xi}, U \ {xi}) for all i [k]. Thus U is cosliced by H, so |U| cosl(H), and sl(Hi, X) cosl(H). We conclude that the adaptation of m Rec learns C by making with high probability O(cosl(H)k log k log n) queries.

For our second result, let us say that H is non-fractal if there exists h H such that both h and X \ h contain some ball of positive radius; this avoids pathological cases such as H containing only aﬃne transformations of Cantor s set.

Theorem 22 Let H be a concept class in Rm that is non-fractal and closed under aﬃne transformations. There is an algorithm that, given any instance (X, OC) where C is realized by H and has one-versus-all margin γ > 0, learns C by making O (1 + 4/γ)m k log k log |X|

queries to OC with high probability. Moreover, for any randomized algorithm A there are instances (X, OC), with X arbitrarily large and C realized by H with arbitrarily small oneversus-all margin, where A makes Ω(|X|) label queries in expectation to learn C.

Proof The upper bound follows from Theorem 12, by recalling that in the Euclidean metric the unit ball has covering number N(B(x, 1), ε) (1 + 2/ε)m for all ε > 0 and M(B(x, 1), ε) N(B(x, 1), ε/2). For ε = γ this yields M(γ) (1 + 4/γ)m.

We now prove the lower bound. Choose γ > 0 arbitrarily small. Take any h H such that both h and h = Rm \ h contain a ball of positive radius. Then for any ρ > 0 there exists a closed ball B with radius r > 0 such that:

x h : d(B, x) ρ

Let c be the center of B. Consider a sphere S of radius r that contains x and whose center c lies on the aﬃne subspace x + α(c x), see Figure 1. Let η = supy S\B d(x, y) and let X be an η-packing of S. By making r and η

r arbitrarily small we can make X arbitrarily large. Now consider x X \ {x}. Since by construction d(x , x) > η, and as r < r, then x B h. Hence h satisﬁes X h = X \ {x}. Now, for any x X there is an invertible aﬃne transformation Rx (a rotation) such that Rx (x ) = x. Let then hx = R 1 x hx. Clearly X hx = X \ {x }, and since H is closed under aﬃne transformations, then hx H. Thus for every x X there exists hx H such that X hx = X \ {x}. Now consider the complement h of h. The argument above applies to h, too, hence for every x X there exists an invertible aﬃne transformation Rx such that hx := Rx h

Bressan, Cesa-Bianchi, Lattanzi, Paudice

Figure 1: The η-packing X of S is in B, and thus in h, except for x that lies in h. By taking B arbitrarily close to x we can make η arbitrarily small and X arbitrarily large.

satisﬁes X hx = X \ {x}. The complement hx of hx thus satisﬁes X hx = {x}, and since hx = Rx h, then hx H.

We conclude that for every x X there exist h x , h+ x H such that X h x = X \ {x} and X h+ x = {x}; thus every 2-clustering C of X in the form C1 = {x}, C2 = X \ {x} is realized by H. This immediately implies that any algorithm makes Ω(|X|) queries to learn C on some instance (X, OC). As we can make X arbitrarily large, this completes the proof.

Note that Theorem 22 applies to many fundamental concept classes for instance linear separators, ellipsoids, polytopes, and more in general convex bodies (bounded or not, and possibly degenerate), as well as disjoint unions of those concepts.

5. Learning classiﬁers in ﬁnite semimetric spaces

In this section we consider learning classiﬁers in (ﬁnite) semimetric spaces. The use of semimetrics is common in domains where the notion of distance is strongly non-Euclidean (see Gottlieb et al., 2017). Classic examples include the Wasserstein distance in vision, the Levenshtein distance in string matching, the cosine dissimilarity in document analysis, the Pearson dissimilarity in bioinformatics. In all these cases, the notions of convexity and separability with margin are lost, or exist only in certain generalized forms, hence the techniques of Section 4.1 do not apply anymore. To ﬁll this gap, we introduce a novel notion of class convexity that can be applied to any ﬁnite semimetric space and exploited algorithmically. Let X be a ﬁnite set and d a semimetric over X. We assume that (X, d) is given as a weighted graph G = (X, E, d) where {x, y} E if and only if d(x, y) > 0. We would like to use G to deﬁne our abstract notion of convexity. We start by considering geodesic convexity (Pelayo, 2013), a well-known generalization of Euclidean convexity. A set C X is geodesically convex if every shortest path in G between pairs of vertices of C lies in C. Unfortunately, this condition is too lax. To see this, let X consist of n distinct points on a circle with the Euclidean metric. In G, the shortest path between any two points x, y is the edge {x, y}. Thus, any C X will be geodesically convex, hence H = [k]X and Ω(n) queries are needed to learn C in the worst case. However, a variant of this approach gives a suitable notion of convexity. For any ε > 0 let Gε be the unit disk graph of X, d

ε ; this is the simple graph where x, y X are connected if and only if d(x, y) ε. For any simple

Margin-Based Active Learning of Classifiers

graph G = (V, E) and any x, y V denote by d G(x, y) the shortest-path distance between x and y in G.

Figure 2: Left: a set X R2. Middle and right: the graphs Gε1 and Gε2 where ε1 < ε2 are the radii of the outer class C1 and the inner class C2. The partition is (β, γ)-convex for β .5 and γ .1.

Deﬁnition 23 Let β, γ (0, 1]. A k-partition C = (C1, . . . , Ck) of X is (β, γ)-convex if there is ε > 0 such that for each i [k] the following properties hold:

1. connectedness: Gε[Ci] is connected

2. local metric margin: for all x, y X, if x Ci and y / Ci then d(x, y) > βε

3. geodesic convexity with margin: if x, y Ci, then in Gε any simple path between x and y of length at most (1 + γ)d Gε(x, y) lies in Ci

The smallest value of ε satisfying the three properties is called the radius of the classes.6

Figure 3: Left: a toy point set. Right: the graph Gε[X] and a valid (β, γ)-convex partition for X for any β < 1

2 and γ > 0 (classes encoded by the color of the points).

Although deﬁnition 23 might seem artiﬁcial at ﬁrst sight, it captures pretty interesting scenarios. Indeed, (β, γ)-convex classes can be strongly non-convex in Rm, see Figure 3. Section 5.3 gives a more elaborate version of Deﬁnition 23 allowing for classes with diﬀerent radii, as in Figure 2. These examples suggest that one can view (β, γ)-convexity as a way of translating density into convexity. Moreover, we can show that (β, γ)-convexity becomes uninteresting if we drop any one of the conditions of Deﬁnition 23. Let indeed X = R2, let k = 2, and let d be the Euclidean distance. We use Figure 4 for reference.

6. Actually, our algorithm in Section 5.2 works with any such ε; we use the minimum to disambiguate.

Bressan, Cesa-Bianchi, Lattanzi, Paudice

Figure 4: Example of classes that sastisfy all properties of Deﬁnition 23 except: connectivity (top), local metric margin (middle), geodesic convexity with margin (bottom). Empty nodes are in C1, ﬁlled nodes are in C2. The edges shown are those of Gε[X].

Removing connectivity. Choose any β, γ 1. Let X consist of disjoint subsets, each one with diameter ε, and suﬃciently far from each other. Label any subsets as C1 and the rest as C2. Note that the second and third property in Deﬁnition 23 are satisﬁed. Removing local metric margin. Choose any γ 1. Let C1 consist of collinear points spaced by ε, and the same for C2, so that two extremal points of C1 and C2 are at arbitrarily small distance δ < ε. Note that the ﬁrst and third property in Deﬁnition 23 are satisﬁed. Removing geodesic convexity with margin. Choose any β < 1

2. Let X consist of collinear points spaced by ε

2, with the points of C1 and C2 interleaved. Note that the ﬁrst and second property in Deﬁnition 23 are satisﬁed, but the third one is violated for any γ = ω(1/n): take the two extreme nodes of C1 and change their shortest path to use a point of C2. Our main result is Learn BGC (Algorithm 4 in Section 5.2), an eﬃcient active learning algorithm for (β, γ)-convex classiﬁers.

Theorem 24 Let C = (C1, . . . , Ck) be a (β, γ)-convex partition of X with radius ε and let s = (s1, . . . , sk) where si Ci for each i. Then Learn BGC(G, β, γ, ε, s, OC) learns C exactly in time O(k2(n+m)) using O k2 log n+k2M βγ

2+γ = O k2 log n+k2(6/βγ)dens(X) queries.

In many aspects, Theorem 24 cannot be signiﬁcantly improved: as we show in Section 5.6, the dependence on M and dens is essentially tight, and any algorithm needs Ω(n) queries if the seed nodes s1, . . . , sk are not available or if one changes the deﬁnition of (β, γ)-convexity to allow for a diﬀerent semimetric for each class. To relax the assumptions of Theorem 24, we introduce seed queries. For any S X and i [k], seed(S, i) returns an arbitrary point si Ci S, or nil if Ci S = . We show that, at the price of some seed queries, one can learn C without knowing ε or s1, . . . , sk. In fact, if we adopt a hereditary version of the geodesic convexity property, then we can even allow each class Ci to have its own radius εi, which captures classes that are dense at a diﬀerent scale, as in Figure 2. Formally:

Deﬁnition 25 Let β, γ (0, 1]. A partition C = (C1, . . . , Ck) of X is hereditary (β, γ)- convex if for some ε = (ε1, . . . , εk) Rk 0 the following properties hold for all i [k]: 1. connectedness: Gεi[Ci] is connected 2. local metric margin: for all x, y X, if x Ci and y / Ci, then d(x, y) > βεi 3. hereditary geodesic convexity with margin: for any ε εi, if x, y Ci and d Gε(x, y) < , then in Gε any simple path between x and y of length at most (1 + γ)d Gε(x, y) lies in Ci

Margin-Based Active Learning of Classifiers

Our main result is Learn HBGC (Algorithm 10 in Section 5.3), an eﬃcient algorithm based on label and seed queries to learn hereditary (β, γ)-convex classiﬁers.

Theorem 26 Let C be hereditary (β, γ)-convex. Then Learn HBGC(G, β, γ,ε, OC) learns C exactly, has the same runtime as Learn BGC, and uses the same number of label queries as Learn BGC plus O(k2) seed queries.

Next, we investigate the cost of learning C when the model parameters are unknown. First we show how, when β and/or γ are unknown, one can learn a (hereditary) (β, γ)-convex classiﬁer C for a small overhead in terms of queries and running time. Formally, we prove:

Theorem 27 Suppose C is (hereditary) (β, γ)-convex. If only one of β, γ is unknown then one can learn C in time O k2 + log 2

p (n + m) using O k2 log n + k2 +log 2

label queries and 3k2 + 2 log 2

p seed queries, where p {β, γ} is the unknown parameter. If both β, γ are unknown then one can learn C in time O k2 + log2 2 βγ (n + m) using

O k2 log n + k2 + log2 2 βγ M βγ

2 label queries and 3k2 + 2 log 2

βγ seed queries.

Finally, we show one can eﬃciently learn the radii of a hereditary (β, γ)-convex classiﬁer.

Theorem 28 The radii ε1, . . . , εk can be learned using O(k log n) seed queries in time O(m α(m, n) + kn log n), where m = |E| and α(m, n) is the functional inverse of the Ackermann function.7 Moreover, any (randomized) algorithm needs Ω(k log n

k ) seed and/or label queries to learn the radii of all classes with constant probability.

5.1 Additional notation

We denote by G = (V, E) a generic unweighted graph. We denote by ρ(G) the number of connected components of G; for any U V we denote by G[U] the subgraph of G induced by U, and by ΓG(U) the set of edges of G with exactly one endpoint in U. An edge in ΓG(U) is called a cut edge of U. We often treat such edges as oriented by writing (x, y) ΓG(U) to specify that x U and y / U. We assume the algorithm is given a vector s = (s1, . . . , sk) of seed nodes si Ci for each i [k] and that the radius ε is known. Without s any algorithm needs Ω(n) queries, see Section 5.6. These assumptions may be avoided by introducing seed queries: for any S X and i [k], seed(S, i) returns an arbitrary point si Ci S, or nil if Ci S = . Given a bipartition (S, X \ S) of X and i [k], with two seed queries one can check whether Ci is cut by the partition, yielding a kind of separation oracle. Recall M(η) from Section 3. Following Gottlieb et al. (2017), we deﬁne the density constant of X as µ(X) = M(1/2) and the density dimension of X as dens(X) = log2 µ(X).8

It is easy to see that M(η) µ(X) log2 1/η (2/η)dens(X) for every η > 0.

5.2 Learn BGC and proof of Theorem 24

This section describes Learn BGC and proves the correctness and query bounds of Theorem 24. The proofs of the running time bounds require some care, and are deferred to Section 5.4. To simplify the notation we omit G, β, γ, OC from the parameters passed to the algorithms

7. For all practical purposes, α can be considered constant. For instance, α(m, m) 4 for all m 1

8222265536 . 8. While Gottlieb et al. (2017) uses open balls, we use closed balls.

Bressan, Cesa-Bianchi, Lattanzi, Paudice

and their subroutines. Learn BGC ﬁrst computes the graph Gε and then invokes a second algorithm, Learn Class, which provides:

Lemma 29 Learn Class(Gε, ε, s, i) returns Ci using O k log n + k M βγ

2+γ queries.

The correctness and query bounds of Theorem 24 follow immediately from Lemma 29 and the pseudocode of Learn BGC below. Hence the rest the section deals with Learn Class and the proof of Lemma 29.

Algorithm 4 Learn BGC(ε, s)

1: compute Gε 2: for i = 1, . . . , k do

3: Ci := Learn Class(Gε, ε, s, i) return (C1, . . . , Ck)

5.2.1 Margin-based separation, and binary search on shortest paths.

We begin with a simple routine MBS (for Margin-Based Separator) that is useful for learning the labels of any set of small radius.

Algorithm 5 MBS(Z, ε)

1: Z1 := := Zk :=

2: for each connected component H of Gβε[Z] do

3: query OC for the label i of any x V (H)

4: add V (H) to Zi return (Z1, . . . , Zk)

Lemma 30 For any Z X such that Z B(z, r) for some z X and r < , MBS(Z, ε) returns (Z C1, . . . , Z Ck) using at most M βε

Proof If H is a connected component of Gβε[Z] then d(x, y) βε for all {x, y} E(H), hence by the local metric margin V (H) Ci for some i. This proves the correctness. For the query bound, note that the vertices queried at line 4 form a set P independent in Gβε[Z], thus d(x, y) > βε for all distinct x, y P. By deﬁnition of M( ) then |P| M βε

Next, we give a condition for ﬁnding eﬃciently a cut edge of Ci. A path π = (x1, . . . , x|π|) is Ci-preﬁxed if there exists j |π| such that xj Ci if and only if j {1, . . . , j }.

Lemma 31 Let Ci R X such that Gε[R] is connected. Then in Gε[R] any shortest path between any si Ci and any s R \ Ci is Ci-preﬁxed.

Proof Suppose Gε[R] contains a shortest path π between si Ci and s R \ Ci that is not Ci-preﬁxed. Then some preﬁx π of π is a shortest path between si Ci and s i Ci that intersects R \ Ci. Now observe that π is a shortest path in G, too. This holds because any shortest path between si and s i in Gε lies in Gε[Ci] by geodesic convexity, and thus in Gε[R] as Ci R. By geodesic convexity this implies π Gε[Ci], a contradiction.

Margin-Based Active Learning of Classifiers

Finally, in a Ci-preﬁxed simple path, a cut edge of Ci can be found via binary search, yielding a routine Find Cut Edge with the following guarantees whose proof is straightforward:

Lemma 32 Given a simple Ci-preﬁxed path π, Find Cut Edge(π) returns the unique cut edge of Ci in π using O(log n) queries.

Algorithm 6 Find Cut Edge(π(si, sh))

1: if |π(si, sh)| = 1 then return π(si, sh)

2: choose a median node x in π(si, sh)

3: if OC(x) = i then return Find Cut Edge(π(x, sh)) else return Find Cut Edge(π(si, x))

5.2.2 Class separators

We introduce the notion of class separator, which is at the heart of Learn Class.

Deﬁnition 33 Let R X and Ci, Cj C. A partition (Si, Sj) of R is an (i, j)-separator of R if Si Cj = Sj Ci = .

A class separator is similar to half-spaces in abstract closure systems (Seiﬀarth et al., 2019), which require Ci Si and Cj Sj. We use the weaker condition Si Cj = Sj Ci = because in some of our algorithms R X and thus Cj R might not hold. However we will always ensure that Ci R, and thus Ci Si, where Ci is the class we are learning. Therefore, if we could compute an (i, j)-separator for all j = i, then we could compute Ci as a simple intersection. To compute (Si, Sj) eﬃciently, suppose we know a cut edge (ui, uj) between Ci and Cj. First, we compute Z = {x R : d Gε(x, ui) < 1

γ } and learn Z Ci using MBS (Algorithm 5). For every x R \ Z, instead, we show that the condition d Gε(x, ui) d Gε(x, uj) tells whether to place x in Si or Sj without making queries:

Lemma 34 Let (ui, uj) Gε be a cut edge between Ci and Cj and suppose 1

γ d Gε(ui, x) < . If d Gε(ui, x) d Gε(uj, x) then x / Cj, and if d Gε(ui, x) d Gε(uj, x) then x / Ci.

Proof Assume d Gε(ui, x) d Gε(uj, x) and x Cj; we show this leads to a contradiction. Consider the path π = π(x, ui)+(ui, uj). First, π is a simple path, as otherwise uj π(x, ui), which implies d Gε(uj, x) d Gε(ui, x) 1, contradicting the assumption. Second:

|π| = d Gε(x, ui) + 1

d Gε(x, uj) + 1 since d Gε(ui, x) d Gε(uj, x)

d Gε(x, uj) (1 + γ) since d Gε(x, uj) d Gε(x, ui) 1/γ

Thus, π is a simple path between two points of Cj, with length at most (1 + γ) times their distance in Gε, and containing a point of Ci. This violates the geodesic convexity of Cj. We conclude that x / Cj. The other case is symmetric.

We can now introduce Find Separator (Algorithm 7), a routine for computing an (i, j)- separator given a cut edge between Ci and Cj.

Bressan, Cesa-Bianchi, Lattanzi, Paudice

Algorithm 7 Find Separator(Gε, ui, uj)

1: Z := {x V (Gε) : d G(ui, x) < 1/γ}

2: (Z1, . . . , Zk) := MBS(Z, ε)

3: Ui := , Uj :=

4: for each x V (Gε) \ Z do

5: if d Gε(x, ui) d Gε(x, uj) then add x to Ui else add x to Uj

6: return (Zi Ui, (Z \ Zi) Uj)

Lemma 35 Let (ui, uj) Gε be a cut edge between Ci and Cj. Find Separator(Gε, ui, uj) returns an (i, j)-separator (Si, Sj) of V (Gε) using at most M(βγ) queries.

Proof The query bound follows from Lemma 30 by observing that Z B(ui, ε/γ). For the correctness, we show that (Zi, Z \Zi) is an (i, j)-separator of Z and (Ui, Uj) is an (i, j)-separator of V (Gε) \ Z. For Z, Lemma 30 guarantees that Zi = Ci Z. So Zi Cj = (Z \ Zi) Ci = , and (Zi, Z \ Zi) is an (i, j)-separator of Z. Consider now any x V (Gε) \ Z. By deﬁnition of Z, we have d Gε(x, ui) 1

γ . Therefore, by Lemma 34, if the algorithm assigns x to Ui then x / Cj, and if the algorithm assigns x to Uj then x / Ci. Therefore Ui Cj = Uj Ci = , and (Ui, Uj) is an (i, j)-separator of V (Gε) \ Z.

5.2.3 Learning a single class with Learn Class

We can ﬁnally introduce Learn Class (Algorithm 8) and prove Lemma 29. Learn Class starts by computing the set of vertices Ri reachable from si in Gε, and the induced subgraph Gε,i = Gε[Ri]. Note that, by the connectedness of the classes, Ri is simply the union of Ci and possibly other classes. Now, consider a seed node sh s such that sh Ri \ {si}. If no such sh exists, then Ri = Ci and we are done. Otherwise, compute the shortest path π between sh and si in Gε,i. By Lemma 31, π is Ci-preﬁxed, and so by Lemma 32 we can ﬁnd a cut edge (ui, uj) of Ci with O(log n) queries. Using (ui, uj) we can then compute an (i, j)-separator (Si, Sj) of X = V (Gε) using Find Separator. Finally, we update Gε,i to be the connected component of si in Gε[Ri Si], and Ri to be V (Gε,i). By deﬁnition of (Si, Sj), this removes from Gε,i all vertices of Cj, so we have reduced by at least one the number of classes other than Ci intersected by Ri. After at most k 1 of these rounds we have Ri = Ci.

This process has an issue: it can run out of seeds. Indeed, by taking Ri Si we could remove every seed node sh, even if Ri Si still contains points of Ch. If this is the case then Learn Class would be stuck, that is, unable to compute a new cut edge. One is tempted to ignore that sh / Ri, and compute the shortest path between sh and si for all h = i nonetheless. This however does not work, as the only cut edge on all those shortest paths could be (ui, uj). We bypass this obstacle by exploiting the separators found by Learn Class. By carefully analysing the cuts of those separators we devise an algorithm, Find New Seed, that either ﬁnds some new seed sh Ri \ Ci or certiﬁes that Ri = Ci. Find New Seed is invoked by Learn Class at the beginning of each round and is described below.

Margin-Based Active Learning of Classifiers

Algorithm 8 Learn Class(Gε, ε, s, i)

1: Gε,i := {the connected component of si in Gε}, Ri := V (Gε,i)

2: u := an empty vector

3: while true do

4: sh := Find New Seed(Gε, Ri, ε, s, u, i)

5: if sh = nil then return Ri 6: π(si, sh) := Shortest Path(Gε[Ri], si, sh)

7: (ui, uj) := Find Cut Edge(π(si, sh))

8: add ui to u

9: (Si, Sj) := Find Separator(Gε, ui, uj)

10: Gε,i := {the connected component of si in Gε,i[Ri Si]}, Ri := V (Gε,i)

Proof [of Lemma 29] For each ℓ= 1, 2, . . ., denote by Rℓ i the set Ri at the beginning of the ℓ-th iteration, and by κ(Rℓ i) the number of distinct classes that Rℓ i intersects. We show that, at the beginning of the ℓ-th iteration, the following invariants hold:

I1. Gε[Rℓ i] is connected I2. Ci Rℓ i I3. κ(Rℓ i) κ(R1 i ) (ℓ 1)

Note that I2 and I3 imply that the algorithm stops and returns Ci (line 5) after at most k iterations. Invariant I1 holds by the construction of Rℓ i, so we focus on proving I2 and I3. Suppose ﬁrst ℓ= 1. Then, I2 holds by the assumptions of the lemma, and I3 is trivial. Suppose then I1, I2, I3 hold for some ℓ 1, and that iteration ℓ+ 1 exists; we show that I2, I3 hold at iteration ℓ+ 1 as well. First, if iteration ℓ+ 1 exists, then Ci Rℓ i. Then, by Lemma 36, Find New Seed returns sh Rℓ i \ Ci. As Gε[Rℓ i] is connected by I1, the shortest path π(si, sh) exists in Gε[Rℓ i], and is Ci-preﬁxed by Lemma 31 applied to Rℓ i. Then by Lemma 32 Find Cut Edge(π(si, sh)) returns a cut edge (ui, uj) of Ci in π(si, sh). At this point the hypotheses of Lemma 35 are satisﬁed, so Find Separator(Gε, ui, uj) returns an (i, j)-separator (Si, Sj) of V (Gε), hence Ci Rℓ i Si. This implies that the connected component of si in Gε[Rℓ i Si] still contains Ci. The vertex set of this connected component is precisely Rℓ+1 i (line 10), hence I2 holds. Moreover uj Rℓ i, since uj π(si, sh) Rℓ i. Therefore Rℓ i Cj = . However, by construction, Rℓ+1 i Rℓ i Si and Si Cj = since (Si, Sj) is an (i, j)-separator of V (Gε). Therefore κ(Rℓ+1 i ) κ(Rℓ i) 1. As I3 holds at iteration ℓ, then κ(Rℓ+1 i ) κ(R1 i ) (ℓ 1) 1 = κ(R1 i ) ((ℓ+ 1) 1). So I3 holds too. Finally, we bound the number of queries. By Lemma 36 (see below), Find New Seed makes at most k M βγ

2+γ scq in total across all iterations. By Lemma 32, Find Cut Edge makes O(log n) queries at each iteration. By Lemma 35, Find Separator makes at most M(βγ) queries at each iteration. Summing the three terms yields the bound.

Finding new seed nodes with Find New Seed. We describe the routine Find New Seed (Algorithm 9) that, given Ri Ci, either decides that Ri = Ci or ﬁnds a node sh Ri \ Ci. The key idea behind Find New Seed is this: if Ri does not contain any seed sh s other than si, then for each h = i either Ch Ri = or, by the connectedness of Gε[Ch], the cut

Bressan, Cesa-Bianchi, Lattanzi, Paudice

ΓGε(Ri) must contain some edge of Gε[Ch]. Therefore the task boils down to ﬁnding such an edge, or deciding that there is none. Clearly, checking all edges in ΓGε(Ri) would require too many queries. Fortunately, the following approach works: for all those edges that are close to cut edges previously used, we perform an explicit search using MBS(Algorithm 5). For the remaining edges, geodesic convexity tells whether a cut edge exists without making queries.

Algorithm 9 Find New Seed(Gε, Ri, ε, s, u, i)

1: if s Ri = {si} then return any s s Ri \ {si}

2: for each u u do

3: Z = {x Ri : d Gε(u, x) < 2/γ + 1}

4: (Z1, . . . , Zk) := MBS(Z, ε)

5: if Z \ Zi = then return any x Z \ Zi

6: if (x, y) ΓGε(Ri) such that u u : d Gε(u, x) 2/γ + 1 then return any such x

7: else return nil

Lemma 36 Consider the beginning of any iteration of Learn Class(Gε, ε, s, i). Then, the call to Find New Seed(Gε, Ri, ε, s, u) returns a point x Ri \ Ci if Ri = Ci, or nil if Ri = Ci, using at most k M βγ

2+γ queries. Moreover, Find New Seed can be adapted so as to make at

most k M βγ

2+γ queries over the entire execution of Learn Class.

In order to prove Lemma 36 we need to two ancillary results, Lemma 37 and Lemma 38.

Lemma 37 Let (Si, Sj) be an (i, j)-separator for V (Gε) obtained from Find Separator(V (Gε), ui, uj). If (x, y) ΓGε(Si) and d G(ui, x) 2

γ + 1 then x / Ci.

Proof We show that x Ci violates the geodesic convexity of Ci. First (x, y) ΓGε(Si) implies y Sj. Moreover, d Gε(ui, y) d Gε(ui, x) 1 > 1 γ . Now recall the code of Find Separator. Since d Gε(ui, y) > 1

γ , then y / Zj. But y Sj = Zj Uj, and therefore y Uj. This means that Find Separator executed line 5, which happens only if:

d Gε(uj, y) < d Gε(ui, y) (4)

Now consider the path π = (ui, uj)+π(uj, y)+(y, x) from ui to x where π(uj, y) has length d Gε(uj, y). First, we show that π is a simple path. Suppose indeed that π was not simple. Since π(uj, y) is simple (it is a shortest path), and since ui = x (because d Gε(ui, x) 1 by assumption), then ui π(uj, y) or x π(uj, y). If ui π(uj, y), then d Gε(uj, y) > d Gε(ui, y), which contradicts (4). If instead x π(uj, y), then d Gε(uj, y) > d Gε(uj, x), which gives:

d Gε(uj, y) > d Gε(uj, x)

d Gε(ui, x) since x Si d Gε(ui, y) 1 since (x, y) E(Gε)

which, since d Gε is integral, implies d Gε(uj, y) d Gε(ui, y). This contradicts again (4). Thus, π is a simple path. Next, we show that |π| d Gε(ui, x) (1 + γ). From (4):

d Gε(uj, y) d Gε(ui, y) 1 since d Gε N

d Gε(ui, x) + d Gε(x, y) 1 since (x, y) E(Gε)

= d Gε(ui, x)

Margin-Based Active Learning of Classifiers

And therefore:

|π| = d Gε(uj, y) + 2

d Gε(ui, x) + 2 since d Gε(uj, y) d Gε(ui, x)

d Gε(ui, x) (1 + γ) since d Gε(ui, x) 2

Hence π is a simple path violating the geodesic convexity of Ci. So x / Ci, as claimed.

Recall Learn Class(Gε, ε, s, i) in Algorithm 8. Let Rℓ i be the value of Ri at the beginning of the ℓ-th iteration, and let (Sℓ i , Sjℓ) be the separator computed by Find Separator at the ℓ-th iteration.

Lemma 38 Let ℓ 2. If (x, y) ΓGε(Rℓ i), then (x, y) Γ(Sτ i ) for some τ {1, . . . , ℓ 1}.

Proof Let τ = min 1 t ℓ 1 : (x, y) ΓGε(Rt+1 i ) . First, x Sτ i . Indeed x Rτ+1 i , and by construction Rτ+1 i Rτ i Sτ i . Second, y / Sτ i . Suppose indeed by contradiction that y Sτ i . Since x Rτ i and (x, y) / ΓGε(Rτ i ), then y Rτ i . Thus y Sτ i Rτ i . So y is adjacent to x in Gε[Sτ i Rτ i ], and therefore y Rτ+1 i as well, by construction of Rτ+1 i as a connected component. But then (x, y) / ΓGε(Rτ+1 i ), which contradicts our hypothesis. Therefore x Sτ i and y / Sτ i . We conclude that (x, y) ΓGε(Sτ i ).

Proof [of Lemma 36] The query bound holds by Lemma 30 noting that Z B u, ε 2

γ . Let us now prove the correctness. In the ﬁrst iteration of Learn Class Ri is the union of Ci and zero or more other classes, ΓGε(Ri) = , and u = . In that case, if Ri = Ci then Find New Seed returns nil, otherwise it returns some s s Ri \ {si}. Consider now any subsequent iteration of Learn Class. Suppose ﬁrst that Ri = Ci. Then s Ri = {si}. Moreover, at each for iteration, by Lemma 30 Zi = Z, so Z \ Zi = . Finally, note that no edge (x, y) ΓGε(Ri) exists such that d Gε(u, x) 2

γ +1 for all u u. Indeed, if such an edge existed, Lemma 38 would imply (x, y) ΓGε(Si) in some previous round of Learn Class, which by Lemma 37 implies x / Ci a contradiction, since x Ri = Ci. Hence, Find New Seed reaches line 7 and returns nil. Now suppose that Ri = Ci. If sh Ri for some sh = si, then Find New Seed returns sh at line 1. Otherwise, we must have Ch Ri = but Ch Ri for some h = i. By the connectedness of Ch in Gε this implies the existence of an edge (x, y) ΓGε(Ri) with x Ch. If any such edge exists with d Gε(u, x) < 2

γ + 1 for some u u, then line 5 ﬁnds x and returns it. Otherwise, any such edge has d Gε(u, x) 2

γ + 1 for all u u. In this case, line 6 returns x such that (x, y) ΓGε(Ri) and d Gε(u, x) 2

γ + 1 for all u u, in which case x / Ci by Lemma 38 and Lemma 37. To make Find New Seed use at most k M βγ

2+γ queries over the execution of Learn Class, we keep track of Z \ Zi. At each invocation compute the set Zu = {x Ri : d Gε(u, x) < 2

γ + 1}, where u is the last node added to u. Invoke MBS on Zu, retrieve Zu i , and add Zu \ Zu i to Z \ Zi. Finally, remove from Z \ Zi all points not in Ri. This sequence of operations costs M βγ

2+γ queries, and the resulting set is exactly the Z \ Zi computed at line 5 of Find New Seed. Hence the behavior of Learn Class is unchanged, but the total number of queries is at most k M βγ

Bressan, Cesa-Bianchi, Lattanzi, Paudice

5.3 Learn HBGC and proof of Theorem 26

This section describes Learn HBGC (Algorithm 10) and proves Theorem 26. We start by noting that, if the conditions of Deﬁnition 25 hold, then they hold for the smallest εi such that Ci is connected in Gεi,9 so without loss of generality we can use such an εi.

Lemma 39 Let C be a hereditary (β, γ)-convex partition of X, and for i [k] let:

ζi = min{ζ : x, y Ci : d Gζ(x, y) < } (5)

ζ i = min{ζ : ρ(Gζ[Ci]) = 1} (6)

Then ζi = ζ i , and all properties of Deﬁnition 25 hold when εi = ζi = ζ i .

Proof First, note that ζi ζ i , since ρ(Gζ[Ci]) = 1 implies x, y Ci : d Gζ(x, y) < , and so the minimum in (5) is taken over a superset of that of (6). Now consider the graph Gζi. For x, y Ci, since ζi ζ i and d Gζi(x, y) < , the geodesic convexity implies that any shortest path in Gζi between x and y lies in Ci. Thus the subgraph induced by Ci in Gζi is connected. By deﬁnition of ζ i this implies ζ i ζi, so ζi = ζ i . For the second claim, let εi = ζ i . The connectedness of Gζ i [Ci] holds by deﬁnition of ζ i . Now let ζ be any value such that the three properties hold when εi = ζ. Then ζ ζ i , because for εi < ζ i the connectedness fails, by deﬁnition of ζ i . The hereditarity implies that the local metric margin and the geodesic convexity with margin hold for εi = ζ i , since they hold for εi = ζ ζ i .

We now turn to Learn HBGC (Algorithm 10). The basic idea is to learn each Ci separately by invoking Learn Class (Algorithm 8) on the graph Gεi[X]. Unfortunately, this does not work: any Cj with εj < εi may not satisfy geodesic convexity in Gεi, so Learn Class may fail. However, we can make things work under the following precautions. First, recover the classes in nondecreasing order of radius. Second, when recovering Ci, restrict Gεi to the connected component of si. Third, after recovering Ci, delete it from X. Note that this procedure learns each Ci on a graph thresholded at a diﬀerent radius and with only a subset of the original vertices. Thus, its correctness is not be obvious. In particular, it is not obvious that the partition induced by the sub-instance used at the i-th iteration is (β, γ)-convex, which is necessary for Learn Class to work. We show that it is: at every iteration, the residual instance is (β, γ)-convex, and the guarantees of Lemma 29 hold. In all this, the role of seed queries is to ﬁnd a seed node for each class in the connected component of Gεi containing si, as required by Learn Class. After one last technical lemma, we can describe Learn HBGC and prove Theorem 26.

Lemma 40 Let C be a hereditary (β, γ)-convex partition of X. Let ε be the smallest radius of any class and let X be the vertex set of any connected component of Gε that contains a class with radius ε . Finally, let C = C1 X , . . . , Cκ X . Then C is a (β, γ)-convex partition of X with radius ε in in Gε [X ].

9. Note that this is diﬀerent from requiring that Gεi[Ci] is connected; here we are only requiring that any two points of Ci have a connecting path in Gεi. Lemma 39 however shows that the smallest εi that satisﬁes either condition is actually the same.

Margin-Based Active Learning of Classifiers

Proof We verify that the three properties of Deﬁnition 23 hold with radius ε . This implies that ε is also the smallest such value, since the corresponding classes becomes disconnected in Gε[X ] for any ε < ε . Let C i = Ci X for all i [κ]. To simplify the notation let G = Gε [X ]. Connectivity: the subgraph induced by C i in G is connected. Consider two points x, y C i ; obviously x, y Ci. Since G is connected, d G (x, y) < . Moreover, d G (x, y) = d Gε (x, y), since by construction G is the connected component of Gε containing x, y. Thus, d Gε (x, y) < . Moreover, ε εi by assumption. Thus, x, y Ci, and d Gε (x, y) < with ε εi. Then, by the hereditary geodesic convexity of C on X, any shortest path π between x and y in Gε lies in Ci. Since again G is the connected component of Gε containing x, y, then π lies in X . We conclude that π Ci X = C i . This holds for any choice of x, y. Therefore, G [C i ] is connected. Local metric margin: for all x, y X , if x C i and y / C i , then d(x, y) > βε . Since x C i and y / C i , then x Ci and y / Ci. By the local metric margin of C, and since εi ε , we have d(x, y) > βεi βε . Geodesic convexity with margin: if x, y C i , then in G any simple path between x and y of length at most (1 + γ)d G (x, y) lies in C i . By the same argument of connectivity, we invoke the hereditary geodesic convexity for x, y Ci, for ε = ε εi. We obtain that Gε contains no simple path of length at most (1 + γ)d Gε (x, y) between x and y that leaves Ci. This implies that no such path exists in G as well, since G Gε . Moreover, since C i = Ci X , no such path exists in G that leaves C i (otherwise it would leave Ci). Recalling that (1 + γ)d Gε (x, y) = (1 + γ)d G (x, y), we deduce that in G there is no path of length at most (1 + γ)d G (x, y) between x and y that leaves C i .

Algorithm 10 Learn HBGC(X,ε, s)

1: assume ε1 . . . εk 2: for i = 1, . . . , k do

3: G := the connected component of si in Gεi[X], X := V (G )

4: for j = i + 1, . . . , k do s j := seed(X , j)

5: s := (si, s i+1, . . . , s k)

6: c Ci := Learn Class(G , εi, s , i)

7: output c Ci 8: X := X \ c Ci

Proof [of Theorem 26] For each i = 1, . . . , k let Xi be the value of X at the beginning of the i-th iteration of Learn HBGC(X,ε, s). We prove the following three invariants:

I1. Xi = Ci . . . Ck I2. (Ci, . . . , Ck) is a hereditary (β, γ)-convex partition of Xi I3. if i > 1 then the algorithm has output C1, . . . , Ci 1 so far.

For i = 1 we have Xi = X and I1-I3 clearly hold. Now assume I1-I3 hold at the beginning of the i-th iteration for some i 1. Observe that if Learn HBGC sets c Ci = Ci then I1-I3

Bressan, Cesa-Bianchi, Lattanzi, Paudice

hold at iteration i + 1. For I1 and I3 this is trivial; for I2, note that deleting a class never invalidates the properties of Deﬁnition 25, thus, (Ci+1, . . . , Ck) will be a hereditary (β, γ)- convex partition of Xi+1 = Ci+1 . . . Ck. Hence, we prove that Learn HBGC sets c Ci = Ci. Consider the subgraph G and its node set V computed at line 3. Let Ci = (Ci, . . . , Ck), and let C i = (Ci X , . . . , Ck X ). Since εi is the smallest radius of all classes in Ci, then by Lemma 40 C i is a hereditary (β, γ)-convex partition of Xi with radius εi. Furthermore, by construction s contains one seed for each nonempty class in C . Thus by Lemma 29, Learn Class(G , ε, s , i) returns Ci, so c Ci = Ci. The query bound is straightforward.

5.4 Bounds on the running time of Learn BGC and Learn HBGC

We show how to implement our algorithms in time linear or essentially linear in the size of G. Without loss of generality assume that, for any graph G and any x V (G), the edges incident with x can be listed in constant time per edge. The following claim is straightforward:

Claim 1 For every ε 0 the graph Gε can be computed in time O(n + m). This holds more in general for thresholding any subgraph of G.

By Claim 1 we can implement Learn Class in time O(k2(n + m)): essentially, for O(k2) times the algorithm computes the distances of all nodes of Gε[X] from some given node u, which takes time O(n+m) via breadth-ﬁrst search. Since Learn Class is invoked once per each class, this would give a total running time of O(k3(n + m)) for both Learn BGC and Learn HBGC. With some extra eﬀort, however, we show how to adapt Learn Class so that it runs in time O(k(n + m)), by amortizing the cost of Find New Seed. In the end, we obtain a running time of O(k2(n + m)) for both Learn BGC and Learn HBGC.

Lemma 41 MBS(Z, ε) and Find Separator(G, ui, uj) both run in time O(n + m).

Proof For MBS(Z, ε), computing Gβε takes time O(n + m) by Claim 1, and thereafter computing Gβε[Z] and listing its connected components takes again time O(n+m). Consider now Find Separator(Gε, ui, uj). Computing d Gε(ui, x) and d Gε(uj, x) for all x Gε takes time O(n + m) using a BFS from ui and uj. Computing Z takes time O(n). MBS(Z, ε) takes time O(n + m) as said, and the loop at line 4 takes time O(n).

Lemma 42 In Learn Class(Gε, ε, s, i), each call to Find New Seed(Gε, Ri, ε, s, u, i) takes time O(k(n + m)). By adapting both algorithms, this can be reduced to O(n + m) while adding at most an additive O(n + m) to the running time of each iteration of Learn Class.

Proof Let us start with the O(k(n + m)) bound given by a naive implementation of Find New Seed. Line 1 computes s Ri in time O(k|Ri|) = O(kn) and performs the check in constant time. Line 2 makes |u| k iterations. Each iteration computes Z in time O(n + m) with a BFS from u, runs MBS(Z, ε) in time O(n + m) by Lemma 41, and possibly searches for x Z \ Zi in time O(n). Hence the entire loop of line 2 takes time O(k(n + m)). Finally, line 6 computes {(x, y) Γ(Ri) : u u : d Gε(u, x) 2

γ + 1}. To this end, ﬁrst compute

Margin-Based Active Learning of Classifiers

the set {x Ri : u u : d Gε(u, x) 2

γ + 1} in time O(k(n + m)) using a BFS from u. For each x in that set list all its edges (x, y) E(Gε); if some edge satisﬁes y / Ri then return y, otherwise return nil. Thus, this part takes time O(k(n + m)). Therefore a single call to Find New Seed takes time O(k(n + m)). Since Find New Seed is called at most k times, the total running time is O(k2(n + m)). Let us now see how to reduce to O(n + m) the running time of Find New Seed by adding at most O(n + m) to each iteration of Learn Class. First, consider line 1 of Find New Seed. We keep s updated so that s Ri = s \ {si}; in this way we can run line 1 in constant time. To this end modify Learn Class as follows. First, we store si separately from s. Second, after updating Gε,i and Ri at line 10 of Learn Class, we replace s with s Ri. To this end encode Ri as a bitmap of size O(n), for every x bs check if x Ri, and if so then append x to a new vector s ; then replace s with s . Summarizing, we spend an additional O(n) time at each iteration of Learn Class, and line 1 of Find New Seed will run in constant time. Now consider the loop at line 2 of Find New Seed. Let Z(u) be the value of Z computed for u u at line 3, and Zi(u) the corresponding value of Zi computed by MBS. Deﬁne:

Zi(u) := x Ri \ Ci : u u : d Gε(u, x) < 2

Note that Zi(u) = u u Z(u) \ Zi(u); thus the loop detects precisely if Zi(u) = , in which case it returns some x Zi(u). We adapt Learn Class so as to maintain Zi(u) as a linked list across subsequent invocations of Find New Seed; we can then replace the loop of line 2 with O(1) elementary operations (checking Zi(u) = and returning, say, its ﬁrst element). Start by initializing Zi(u) to an empty list. After updating Gε,i and Ri at line 10, perform the following operations. First, make Learn Class compute Z(ui). Second, run MBS(Z(ui), ε) to obtain Zi(ui) = Z(ui) Ci. Third, compute Zi(ui) = Z(ui) \ Zi(ui). Fourth, add Zi(ui) to Z(u). Finally, replace Z(u) with Z(u) Ri. The last two steps can be performed in time O(n) by encoding the sets as bitmaps, and every other step requires time O(n + m); thus, overall the running time of each iteration of Learn Class increases by O(n + m). Finally, consider line 6 of Find New Seed. At the beginning of the ﬁrst iteration of Learn Class mark all nodes of Ri as active. At each iteration of Learn Class, after computing (ui, uj), mark as inactive every x Ri such that d Gε(ui, x) < 2

γ + 1; this takes time O(n + m) by running a BFS from ui. Finally, for every active x Ri list all its edges (x, y), until ﬁnding y / Ri or concluding that no such y exists (the check y / Ri takes time O(1) by again encoding Ri as a bitmap). This makes line 6 run in time O(n + m) while making the running time of each iteration of Learn Class increase by O(n + m).

Lemma 43 Learn Class(Gε, ε, s, i) runs in time O(k(n + m)).

Proof At every instant, computing Gε,i and Ri takes time O(n + m). Now consider each iteration of the main loop. By Lemma 42, we can make Find New Seed(Gε, Ri, ε, s, u, i) run in time O(n+m) while increasing the overall running time of the iteration of Learn Class by O(n+ m). Shortest Path(Gε[Ri], si, sh) takes time O(n + m) via BFS, and Find Cut Edge(π(si, sh)) clearly runs in time O(log n). Adding ui to u takes constant time. Find Separator(G, ui, uj) runs in time O(n + m), see Lemma 41. Hence each iteration takes time O(n + m) overall.

Bressan, Cesa-Bianchi, Lattanzi, Paudice

To conclude, note that at most k iterations are made.

We can ﬁnally prove the running time bounds of Theorem 24 and Theorem 26. First, consider Learn BGC(G, ε, s). By Claim 1, computing Gε takes time O(n + m). Each call to Learn Class(Gε, ε, s, i) takes time O(k(n + m)) by Lemma 43. As there are k classes, the O(k2(n + m)) bound of Theorem 24 follows. Now consider Learn HBGC. At every iteration, computing G takes time O(n + m) via a BFS. The seed part takes time O(k) = O(n), as does the construction of s . The call to Learn Class takes time O(k(n + m)) by Lemma 43. Writing out c Ci takes time O(n). This proves the bounds of Theorem 26.

5.5 Learning the radii and the convexity parameters

In this section we show how to use seed queries to learn the radii of the classes and to adapt our algorithms when β or γ are unknown.

5.5.1 Learning the radii

This section proves Theorem 28.

Upper bounds. The upper bounds of Theorem 28 hinge on three observations. First, εi is actually the smallest ε such that all nodes of Ci belong to a single connected component of Gε[X], see above. Second, with O(1) seed queries we can test whether Ci belongs to a single connected component of G for any given G. Third, if T is a minimum spanning tree of G, then the connected components of Gε[X] are exactly the connected components of Tε. Our algorithm starts by computing T; for each i [k] it then performs a binary search to ﬁnd the smallest edge weight ε such that Ci is connected in Tε. We start with a simple routine for checking if Ci V is connected in a graph G = (V, E).

Algorithm 11 Is Connected(G, i)

1: u := seed(V (G), i)

2: U := the connected component of u in G, or if u = nil

3: return (seed(V (G) \ U, i) = nil)

Claim 2 Is Connected(G, i) returns true if and only if d G(x, y) < for all x, y Ci.

Now let T be a minimum spanning tree of G = (X, E, d). For ε > 0 let Tε be the forest formed by the edges (x, y) of T such that d(x, y) ε. Observe the following basic fact:

Claim 3 The connected components of Tε are the connected components of Gε.

As a consequence, we have:

Claim 4 Is Connected(Tε, i) = Is Connected(Gε, i), for any i [k] and any ε > 0.

We can now give the algorithm for learning the radius of a single class.

Lemma 44 If T is a MST of G = (X, E, d), then Get Epsilon(T, i) returns εi in time O(n log n) using O(log n) seed queries where n = |X|.

Margin-Based Active Learning of Classifiers

Algorithm 12 Get Epsilon(T, i)

1: w := (w1, . . . , wℓ), the distinct edge weights of T in increasing order

2: lo := 1, hi := ℓ

3: while lo < hi do

4: mid := lo+hi

5: if Is Connected(Twmid, i) then hi := mid else lo := mid + 1

6: return whi

Proof Clearly Get Epsilon(T, i) makes O(log n) iterations, since w has length O(n2). Moreover, since T has O(n) edges, every call to Is Connected(Twmid, i) takes time O(|T|) = O(n). This gives the time bound of O(n log n). Next we show that Get Epsilon(T, i) returns εi, or equivalently by Lemma 39, w = min{w w : Ci is connected in Gw[X]}. Consider the test lo < hi at beginning of each iteration. We claim that w whi. Indeed, Ci is connected in Gwhi[X] before the while loop starts, when whi = wℓ, and at each iteration hi is set to mid only if Is Connected(Twmid, i) = true. We claim that w wlo, too. Indeed w w0 at the ﬁrst iteration, and at each iteration lo is set to mid+1 only if Is Connected(Twmid, i) = false. Therefore at return time wlo = whi = w , hence Get Epsilon(T, i) returns w .

We conclude with the algorithm to learn the radii and with the upper bounds of Theorem 28.

Algorithm 13 Get Epsilons(G = (X, E, d), k)

1: T := MST(G)

2: for i = 1, . . . , k do bεi := Get Epsilon(T, i)

3: return bε1, . . . , bεk

Lemma 45 Get Epsilons(G, k) returns the radii ε1, . . . , εk in time O(m α(m, n) + kn log n) using O(k log n) seed queries, where α(m, m) is the functional inverse of Ackermann s function (Chazelle, 2000).

Proof One can compute MST(G) in time O(m α(m, n)), see (Chazelle, 2000), and by Lemma 44 Get Epsilon(T, i) returns εi in time O(n log n) using O(log n) seed queries.

Lower bounds. The lower bounds of Theorem 28 are given by:

Theorem 46 For any k 2 and any suﬃciently large n there exists a distribution of (hereditary) (1/2, 1)-convex classiﬁers on n points such that any (randomized) algorithm needs Ω(k log n

k ) seed and/or label queries to learn the radii of all classes with constant probability.

Proof Suppose ﬁrst k = 2. Let G = (X, E, d) be a path with increasing edge weights. Formally, let X = [n], let E = {{j, j + 1} : j [n 1]}, and let d(j, j + 1) = 1 + βj

n for each j [n 1]. Draw j uniformly at random in {2, . . . , n 1}, let C1 = {1, . . . , j 1}, and C2 = {j , . . . , n}.

Bressan, Cesa-Bianchi, Lattanzi, Paudice

First, we prove that C1 and C2 are (β, γ)-convex with radii respectively ε1 = d(j 1, j ) and ε2 = d(n 1, n). For the connectivity, clearly ρ(Gε1[C1]) = ρ(Gε2[C2]) = 1. For the local metric margin note that the choice of β and d ensure d 1 and ε1, ε2 3

2. Thus d(x, y) > β max(ε1, ε2) for any two distinct x, y X. Therefore the local metric margin is satisﬁed. For geodesic convexity, note that there is only one edge between C1 and C2 in G, thus no simple path between two points of one class intersects the other class. Thus, the properties of Deﬁnition 23 are satisﬁed by ε1, ε2. To show that Deﬁnition 25 is satisﬁed as well, we show ε1, ε2 are the smallest such values, see Lemma 39. To this end, simply note that ε1 = min{ζ : ρ(Gζ[C1]) = 1}, since d(j 1, j ) = ε1 and d(j, j + 1) ε1 for all j = 1, . . . , j 1. Similarly, ε2 = min{ζ : ρ(Gζ[C2]) = 1}, since d(n 1, n) = ε2 and d(j, j + 1) ε1 for all j = j , . . . , n 1. Finally, we show that any algorithm needs Ω(log n) queries to learn ε1. Clearly, if the algorithm learns ε1 then it can also output the index j , which is a function of ε1. Therefore, we show that ﬁnding j requires Ω(log n) queries. First, we observe that seed queries can be emulated by label queries. Consider indeed any U X, and recall that if U Ci = then seed(U, i) may return any x U Ci. We then let seed(U, i) return min(U Ci) if i = 1, and max(U Ci) if i = 2. Now, observe that min(U C1) = min U and max(U C1) = max U. Therefore seed(U, i) can be emulated by asking the labels of min(U) and max(U). Since every label query reveals at most one bit of information and j is uniform over Ω(n) elements, learning j with constant probability requires Ω(log n) label queries. To extend the construction to k > 2, take K = k

2 disjoint weighted paths on n

K nodes each (without loss of generality we can assume k is even). Each such path is weighted as in the construction above, with the weights of the h-th path all smaller than the weights of the (h + 1)-th path. For each path, we draw j uniformly at random like above, and form two classes. The same proof used above shows that any algorithm needs Ω(k log n

k ) queries to learn the radii of all classes with constant probability.

5.5.2 Learning β and γ

We adapt our algorithms, using seed queries, to the case where β and/or γ are unknown (that is, it is known only that β, γ 1). We begin with an alternative implementation of Find New Seed, see Algorithm 14. It is easy to see that it has the same guarantees as Find New Seed.

Algorithm 14 Find New Seed-2(Ri, c, i)

1: for each j c do

2: sj = seed(Ri, j)

3: if sj = nil then return c, sj 4: c := c \ {j}

5: return c, nil

Next, we describe Learn Class-2, an alternative implementation of Learn Class that learns one or both β, γ along the way. The subroutine Update Parameters is described in the proofs below; it depends on whether one knows β, γ, or neither. It is crucial to observe that if

Margin-Based Active Learning of Classifiers

a classiﬁer C is (hereditary) (β, γ)-convex then it is also (hereditary) (bβ, bγ)-convex for all bβ β and bγ γ. This implies that if we run our algorithms with bβ, bγ in place of β, γ then the output will be correct (and this holds for subroutines too).

Algorithm 15 Learn Class-2(Gε, ε, i)

1: si := seed(V, i)

2: Gε,i := {the connected component of si in Gε}, Ri := V (Gε,i)

3: c := [k]

4: while true do

5: c, sh := Find New Seed-2(Ri, c, i)

6: if sh = nil then return Ri 7: π(si, sh) := Shortest Path(Gε[Ri], si, sh)

8: (ui, uj) := Find Cut Edge(π(si, sh))

9: while true do

10: (Si, Sj) := Find Separator(Gε, ui, uj)

11: if seed(Si, j) = seed(Sj, i) = nil then break

12: Update Parameters()

13: Gε,i := {the connected component of si in Gε[Ri Si]}, Ri := V (Gε,i)

14: c := c \ {j}

We start by proving correctness and bounds for Learn Class-2 when only one of β, γ is unknown; when both are unknown we will only need to adapt Update Parameters.

Lemma 47 Let C be (β, γ)-convex. Suppose Learn Class-2 and its subroutines use bγ in place of γ and that each call to Update Parameters halves bγ. Then Learn Class-2(Gε, ε, i) returns Ci in time O k + log γ0

γ1 (n + m) using O k log n + k + log γ0

γ1 M(βγ1) label queries and

at most 3k + 2 log γ0

γ1 seed queries, where γ0, γ1 are the values of bγ at the beginning and at

the end of the execution. The same holds with β, bβ, β0, β1 in place of γ, bγ, γ0, γ1.

Proof For the correctness, note that Find New Seed-2 behaves exactly as Find New Seed (see Lemma 36) and that when line 13 is executed (Si, Sj) is an (i, j)-separator. The running time bounds follow by the analysis of Section 5.4, noting that Find New Seed-2 runs in time O(k) = O(n + m) and that lines 10 11 are repeated at most log γ0

γ1 times. For the label query bound, Find Cut Edge is called at most k times and by Lemma 32 each call uses O(log n) queries, while Find Separator is called at most k + log γ0

γ1 times and by Lemma 35 every call uses at most M(βbγ) M(βγ1) queries. For the seed query bound, note that Learn Class-2 makes at most 2k + 2 log γ0

γ1 queries at line 11, while Find New Seed-2 makes at most k queries overall as after each query some element is deleted from c (either by Find New Seed-2 itself or by Learn Class-2 at line 14). Finally, note that all arguments hold with β, bβ, β0, β1 in place of γ, bγ, γ0, γ1.

We can now prove Theorem 27. First, assume C is (β, γ)-convex. The ﬁrst claim follows by Lemma 47 when calling Learn Class-2(Gε, ε, i) for each i [k]. For the second claim, for every j = 0, 1, . . . let δ = 2 j and consider the j + 1 pairs of values (δ, 1), (2δ, 1

2), . . . , (1, δ). This

Bressan, Cesa-Bianchi, Lattanzi, Paudice

yields a succession of values for (bβ, bγ). Starting with bβ = bγ = 1, each call to Update Parameters updates bβ and bγ by moving to the next term in the succession. The claim follows by the same argument as above. Suppose now C is hereditary (β, γ)-convex. Consider Learn HBGC-2, see Algorithm 16; it is an adaptation of Learn HBGC that uses Learn Class-2 in place of Learn Class. Note that Learn HBGC-2 does not need to receive or compute the list of seed vertices; any necessary seed is found already by Learn Class-2. The correctness of Learn HBGC-2 follows by the same arguments used for Learn HBGC, see the proof of Theorem 26. The bounds on the number of queries follow by the same arguments of Theorem 27 (note that seed is used only by Learn Class-2; the O(k2) seed queries appearing in Theorem 26 are not used anymore). The claim on the running time holds again by Theorem 27.

Algorithm 16 Learn HBGC-2(X,ε)

1: assume ε1 . . . εk 2: for i = 1, . . . , k do

3: G := the connected component of si in Gεi[X], X := V (G )

4: c Ci := Learn Class-2(G , εi, i)

5: output c Ci 6: X := X \ c Ci

5.6 Lower bounds

In this section we show that some of our parameters and assumptions are necessary.

Theorem 48 (Dependence on dens(X)) Choose any β, γ (0, 1). There is a distribution of (hereditary) (β, γ)-convex classiﬁers C, where n = |X| = µ(X) = 2dens(X) is arbitrarily large, such that any (randomized) algorithm needs Ω(µ(X)) = Ω(2dens(X)) label and/or seed queries to learn C with constant probability. This holds even if β, γ, ε are known.

Proof Let G = (X, E, d) where (X, E) is the complete graph on n nodes and d = 1, and let C = (C1, C2) be a uniform random partition of X. We claim that C is (hereditary) (β, γ)-convex for ε = 1. The connectivity of Gε[C1] and Gε[C2] holds trivially. The local metric margin holds as well, since any two distinct points x, y X satisfy d(x, y) = d > βd, as β < 1. To see that geodesic convexity holds, too, note that for any x, y C1 we have d Gε(x, y) 1 and any (simple) path between x and y that contains a point in X \ C1 has length at least 2 > (1 + γ)d Gε(x, y). Finally, note that Gε is an independent set for any ε < ε. Hence C satisﬁes both Deﬁnition 23 and Deﬁnition 25. Since C is chosen uniformly at random among all bipartitions of X, then Ω(n) label and/or seed queries are needed to learn C exactly with non-vanishing probability. Noting that µ(X) = n and recalling that dens(X) = log2 µ(X) completes the proof.

Theorem 49 (Necessity of seeds) Choose any β, γ (0, 1]. There is a distribution of (hereditary) (β, γ)-convex classiﬁers C, with n = |X| arbitrarily large, such that any (randomized) algorithm needs Ω(n) label queries to learn C with non-vanishing probability if no seed nodes are given. This holds even if γ, β, ε are known.

Margin-Based Active Learning of Classifiers

Proof Let X = UP LOW R2 where (see Figure 5):

j=1 {(2j, 1)}, LOW =

j=1 {(j, 0)}

Moreover let G = (X, E, d) where E = X 2 and d is the Euclidean distance.

Figure 5: the graph Gε for ε = 1. All points are in C1, save for the ﬁlled point in C2, chosen uniformly at random in UP.

Choose a point z uniformly at random from UP, and let C1 = X \ {z} and C2 = {z}. One can check that all properties of Deﬁnition 23 and Deﬁnition 25 hold for ε = 1. In particular, in Gε no simple path between two points of C1 contains z. Hence C is (hereditary) (β, γ)-convex. Learning C requires ﬁnding z, and ﬁnding z with constant probability requires Ω(n) queries in expectation even if γ, β and ε are known.

Next, we show that relaxing our notion of (β, γ)-convexity by allowing a diﬀerent semimetric for each class leads to an Ω(n) query lower bound. To this end, for each i [k] let Gi = (X, Ei, di) be the weighted graph encoding the semimetric di. The notation and deﬁnitions formulated for G apply to Gi with the obvious modiﬁcations.

Deﬁnition 50 Let β, γ (0, 1]. A k-partition C = (C1, . . . , Ck) of X is personalized (β, γ)-convex if there is ε > 0 such that for each i [k] the following properties hold:

1. connectedness: Gi ε[Ci] is connected

2. local metric margin: for all x, y X, if x Ci and y / Ci, then di(x, y) > βε

3. geodesic convexity with margin: if x, y Ci, then in Gi ε any simple path between x and y of length at most (1 + γ)d Giε(x, y) lies in Ci

The smallest value of ε satisfying the three properties is called the radius of the classes.

Theorem 51 (Necessity of a universal semimetric) Choose any β, γ (0, 1). There is a distribution of personalized (β, γ)-convex 2-class classiﬁers C, with n = |X| arbitrarily large, such that any (randomized) algorithm needs Ω(n) seed and/or label queries to learn C with non-vanishing probability, even if γ, β, ε are known.

Proof Let ε > 0, let X = {a1, a2} M where M = [n], and for i = 1, 2 and all x, y X let:

di(x, y) = ε x M, y = ai 2ε otherwise

Note that Gi ε is the union of an isolated vertex and a star centered in ai with M as leaves. Note also that for any X X the set Ci = {ai} X satisﬁes all properties of Deﬁnition 50.

Bressan, Cesa-Bianchi, Lattanzi, Paudice

Thus any bipartition (C1, C2) of X such that ai Ci for i = 1, 2 is personalized (β, γ)-convex. Taking the distribution of classiﬁers where ai Ci and every x M is independently assigned to Ci with probability 1

2 proves the claim.

6. Conclusions and future work

We have shown that, in very diﬀerent settings, one can devise meaningful notions of margins that allow for eﬃcient and exact active learning of multiclass classiﬁers. Interestingly, this can be achieved with very diﬀerent techniques, from boosting learners with one-sided errors through expansion of convex hull, to computing abstract separators that play the role of halfspaces in vector spaces. Our work also shows that exact learning of multiclass classiﬁers requires an inevitable exponential dependence on the dimensionality of the space. There are two main natural questions that our work leaves open. The ﬁrst one is to what extent these results can be adapted to the non-realizable case, that is, to noisy oracles. The second one is whether a parsimonious use of more powerful queries, such as seed, can remove the exponential dependence on the dimensionality of the ambient space while keeping the running time polynomial.

Acknowledgments

The authors gratefully acknowledge partial support by the Google Focused Award Algorithms and Learning for AI (ALL4AI). Nicolò Cesa-Bianchi acknowledges the ﬁnancial support from the FAIR (Future Artiﬁcial Intelligence Research) project, funded by the Next Generation EU program within the PNRR-PE-AI scheme (M4C2, investment 1.3, line on Artiﬁcial Intelligence) and the EU Horizon CL4-2022-HUMAN-02 research and innovation action under grant agreement 101120237, project ELIAS (European Lighthouse of AI for Sustainability).

Peyman Afshani, Ehsan Chiniforooshan, Reza Dorrigiv, Arash Farzan, Mehdi Mirzazadeh, Narges Simjour, and Hamid Zarrabi-Zadeh. On the complexity of ﬁnding an unknown cut via vertex queries. In Proceedings of the International Computing and Combinatorics Conference (COCOON), pages 459 469, 2007. doi: 10.1007/978-3-540-73545-8_45.

Nir Ailon, Anup Bhattacharya, and Ragesh Jaiswal. Approximate correlation clustering using same-cluster queries. In Proceedings of the Latin American Symposium on Theoretical Informatics (LATIN), pages 14 27, 2018a. doi: 10.1007/978-3-319-77404-6_2.

Nir Ailon, Anup Bhattacharya, Ragesh Jaiswal, and Amit Kumar. Approximate clustering with same-cluster queries. In Proceedings of Innovations in Theoretical Computer Science (ITCS), pages 40:1 40:21, 2018b. doi: 10.4230/LIPIcs.ITCS.2018.40.

Dana Angluin. Queries and concept learning. Machine Learning, 2(4):319 342, 1988. doi: 10.1023/A:1022821128753.

Margin-Based Active Learning of Classifiers

Hassan Ashtiani, Shrinu Kushagra, and Shai Ben-David. Clustering with same-cluster queries. In Advances in Neural Information Processing Systems (Neur IPS), pages 3216 3224, 2016.

Josh Attenberg and Foster Provost. Why label when you can search? Alternatives to active learning for applying human resources to build classiﬁcation models under extreme class imbalance. In Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), page 423 432, 2010. doi: 10.1145/1835804.1835859.

Pranjal Awasthi, Avrim Blum, and Or Sheﬀet. Center-based clustering under perturbation stability. Information Processing Letters, 112(1):49 54, 2012. doi: 10.1016/j.ipl.2011.10.006.

Maria Florina Balcan and Steve Hanneke. Robust interactive learning. In Proceedings of the Annual Conference Computational Learning Theory (COLT), pages 20.1 20.34, 2012.

Maria-Florina Balcan and Phil Long. Active and passive learning of linear separators under log-concave distributions. In Proceedings of the Annual Conference Computational Learning Theory (COLT), pages 288 316, 2013.

Maria-Florina Balcan, Andrei Broder, and Tong Zhang. Margin based active learning. In Proceedings of the Annual Conference Computational Learning Theory (COLT), pages 35 50, 2007.

Shai Ben-David, Nicolo Cesa-Bianchi, David Haussler, and Philip Long. Characterizations of learnability for classes of {0, ..., n}-valued functions. Journal of Computer and System Sciences, 50(1):74 86, 1995. doi: 10.1006/jcss.1995.1008.

Alina Beygelzimer, Daniel J Hsu, John Langford, and Chicheng Zhang. Search improves label for active learning. In Advances in Neural Information Processing Systems (Neur IPS), page 3350 3358, 2016.

Yonatan Bilu and Nathan Linial. Are stable instances easy? Combinatorics, Probability and Computing, 21(5):643 660, 2012. doi: 10.1017/S0963548312000193.

Marco Bressan, Nicolò Cesa-Bianchi, Silvio Lattanzi, and Andrea Paudice. Exact recovery of mangled clusters with same-cluster queries. In Advances in Neural Information Processing Systems (Neur IPS), pages 9324 9334, 2020.

Marco Bressan, Nicolò Cesa-Bianchi, Silvio Lattanzi, and Andrea Paudice. Exact recovery of clusters in ﬁnite metric spaces using oracle queries. In Proceedings of the Annual Conference Computational Learning Theory (COLT), pages 775 803, 2021a.

Marco Bressan, Nicolò Cesa-Bianchi, Silvio Lattanzi, and Andrea Paudice. On margin-based cluster recovery with oracle queries. In Advances in Neural Information Processing Systems (Neur IPS), pages 25231 25243, 2021b.

Nicolò Cesa-Bianchi, Claudio Gentile, Fabio Vitale, and Giovanni Zappella. Active learning on trees and graphs. In Proceedings of the Annual Conference Computational Learning Theory (COLT), pages 320 332, 2010.

Bressan, Cesa-Bianchi, Lattanzi, Paudice

Bernard Chazelle. A minimum spanning tree algorithm with inverse-Ackermann type complexity. Journal of the ACM, 47(6):1028 1047, 2000. doi: 10.1145/355541.355562.

Eli Chien, Antonia Tulino, and Jaime Llorca. Active learning in the geometric block model. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence (AAAI), pages 3641 3648, 2020.

Vincent Cohen-Addad and Karthik Srikanta. Inapproximability of clustering in L-p metrics. In Proceedings of the IEEE Annual Symposium on Foundations of Computer Science (FOCS), pages 519 539, 2019. doi: 10.1109/FOCS.2019.00040.

Amit Daniely and Shai Shalev-Shwartz. Optimal learners for multiclass problems. In Proceedings of the Annual Conference Computational Learning Theory (COLT), pages 287 316, 2014.

Gautam Dasarathy, Robert Nowak, and Xiaojin Zhu. S2: An eﬃcient graph based active learning algorithm with application to nonparametric classiﬁcation. In Proceedings of the Annual Conference Computational Learning Theory (COLT), pages 503 522, 2015.

Sanjoy Dasgupta. Analysis of a greedy active learning strategy. In Advances in Neural Information Processing Systems (Neur IPS), pages 337 344, 2005.

Sanjoy Dasgupta, Adam Tauman Kalai, and Claire Monteleoni. Analysis of Perceptron-based active learning. In Proceedings of the Annual Conference Computational Learning Theory (COLT), pages 249 263, 2005.

Scott Doyle, James Monaco, Michael Feldman, John Tomaszewski, and Anant Madabhushi. An active learning based classiﬁcation strategy for the minority class problem: application to histopathology annotation. BMC Bioinformatics, 12(1), 2011.

Ran El-Yaniv and Yair Wiener. Active learning via perfect selective classiﬁcation. Journal of Machine Learning Research, 13(2), 2012.

Akshay Gadde, Eyal En Gad, Salman Avestimehr, and Antonio Ortega. Active learning for community detection in stochastic block models. In Proceedings of the IEEE International Symposium on Information Theory (ISIT), pages 1889 1893, 2016. doi: 10.1109/ISIT.2016. 7541627.

Buddhima Gamlath, Sangxia Huang, and Ola Svensson. Semi-supervised algorithms for approximately optimal and accurate clustering. In Proceedings of the International Colloquium on Automata, Languages and Programming (ICALP), pages 57:1 57:14, 2018. doi: 10.4230/LIPIcs.ICALP.2018.57.

Apostolos Giannopoulos. Notes on isotropic convex bodies, 2003. URL http://users.uoa. gr/~apgiannop/isotropic-bodies.pdf.

Alon Gonen, Sivan Sabato, and Shai Shalev-Shwartz. Eﬃcient active learning of halfspaces: an aggressive approach. Journal of Machine Learning Research, 14(1):2583 2615, 2013.

Margin-Based Active Learning of Classifiers

Lee-Ad Gottlieb and Robert Krauthgamer. Proximity algorithms for nearly doubling spaces. SIAM Journal on Discrete Mathematics, 27(4):1759 1769, 2013. doi: 10.1137/120874242.

Lee-Ad Gottlieb, Aryeh Kontorovich, and Pinhas Nisnevitch. Nearly optimal classiﬁcation for semimetrics. The Journal of Machine Learning Research, 18(1):1233 1254, 2017.

Andrew Guillory and JeﬀBilmes. Active semi-supervised learning using submodular functions. In Proceedings of the Conference on Uncertainty in Artiﬁcial Intelligence (UAI), pages 274 282, 2011.

Steve Hanneke. Theoretical Foundations of Active Learning. Ph D thesis, Carnegie Mellon University, 2009. URL http://reports-archive.adm.cs.cmu.edu/anon/ml2009/ CMU-ML-09-106.pdf. AAI3362265.

Steve Hanneke. Theory of disagreement-based active learning. Foundations and Trends in Machine Learning, 7(2-3):131 309, 2014. doi: 10.1561/2200000037.

Steve Hanneke and Liu Yang. Minimax analysis of active learning. The Journal of Machine Learning Research, 16(1):3487 3602, 2015.

Max Hopkins, Daniel Kane, and Shachar Lovett. The power of comparisons for actively learning linear classiﬁers. In Advances in Neural Information Processing Systems (Neur IPS), pages 6342 6353, 2020a.

Max Hopkins, Daniel Kane, Shachar Lovett, and Gaurav Mahajan. Noise-tolerant, reliable active classiﬁcation with comparison queries. In Proceedings of the Annual Conference Computational Learning Theory (COLT), pages 1957 2006, 2020b.

Max Hopkins, Daniel Kane, Shachar Lovett, and Michal Moshkovitz. Bounded memory active learning through enriched queries. In Proceedings of the Annual Conference Computational Learning Theory (COLT), pages 2358 2387, 2021.

Fritz John. Extremum problems with inequalities as subsidiary conditions. Birkhäuser Boston e Books, Jan 1985. doi: 10.1007/978-1-4612-5412-6_25.

Daniel Kane, Shachar Lovett, Shay Moran, and Jiapeng Zhang. Active classiﬁcation with comparison queries. In Proceedings of the IEEE Annual Symposium on Foundations of Computer Science (FOCS), pages 355 366, 2017. doi: 10.1109/FOCS.2017.40.

Ravi Kannan, László Lovász, and Miklós Simonovits. Isoperimetric problems for convex bodies and a localization lemma. Discrete & Computational Geometry, 13(3):541 559, 1995. doi: 10.1007/BF02574061.

Leonid G Khachiyan. Rounding of polytopes in the real number model of computation. Mathematics of Operations Research, 21(2):307 320, 1996. doi: 10.1287/moor.21.2.307.

J. Kivinen. Learning reliably and with one-sided error. Mathematical Systems Theory, 28(2): 141 172, 1995. doi: 10.1007/BF01191474.

Andrey Kupavskii. The VC-dimension of k-vertex d-polytopes. Combinatorica, 40(6):869 874, 2020. doi: 10.1007/s00493-020-4475-4.

Bressan, Cesa-Bianchi, Lattanzi, Paudice

László Lovász and Santosh Vempala. Hit-and-run from a corner. SIAM Journal on Computing, 35(4):985 1005, 2006. doi: 10.1137/S009753970544727X.

Wolfgang Maass and György Turán. Lower bound methods and separation results for on-line learning models. Machine Learning, 9(2):107 145, 1992. doi: 10.1007/BF00992674.

Arya Mazumdar and Soumyabrata Pal. Semisupervised clustering, and-queries and locally encodable source coding. In Advances in Neural Information Processing Systems (Neur IPS), pages 6489 6499, 2017.

Arya Mazumdar and Barna Saha. Query complexity of clustering with side information. In Advances in Neural Information Processing Systems (Neur IPS), pages 4682 4693, 2017a.

Arya Mazumdar and Barna Saha. Clustering with noisy queries. In Advances in Neural Information Processing Systems (Neur IPS), pages 5788 5799, 2017b.

Rajeev Motwani and Prabhakar Raghavan. Randomized Algorithms. Cambridge University Press, USA, 1995. ISBN 0521474655. doi: 10.1017/CBO9780511814075.

Márton Naszódi, Fedor Nazarov, and Dmitry Ryabogin. Fine approximation of convex bodies by polytopes. American Journal of Mathematics, 142(3):809 820, 2020. doi: 10.1353/ajm.2020.0018.

Ignacio M. Pelayo. Geodesic convexity in graphs. Springer-Verlag New York, 2013. doi: 10.1007/978-1-4614-8699-2.

Ronald L. Rivest and Robert Sloan. Learning complicated concepts reliably and usefully. In Proceedings of the Annual Conference Computational Learning Theory (COLT), pages 69 79, 1988.

Benjamin I. P. Rubinstein, Peter L. Bartlett, and J. Hyam Rubinstein. Shifting: One-inclusion mistake bounds and sample compression. Journal of Computer and System Sciences, 75 (1):37 59, 2009. doi: 10.1016/j.jcss.2008.07.005.

Barna Saha and Sanjay Subramanian. Correlation clustering with same-cluster queries bounded by optimal cost. In Proceedings of the European Symposium on Algorithms (ESA), pages 81:1 81:17, 2019. doi: 10.4230/LIPIcs.ESA.2019.81.

Florian Seiﬀarth, Tamás Horváth, and Stefan Wrobel. Maximal closed set and half-space separations in ﬁnite closure systems. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases (ECML PKDD), pages 21 37, 2019.

Burr Settles. Active learning. Technical report, University of Wisconsin-Madison, 2012. URL http://digital.library.wisc.edu/1793/60660.

Maximilian Thiessen and Thomas Gärtner. Active learning of convex halfspaces on graphs. In Advances in Neural Information Processing Systems (Neur IPS), pages 23413 23425, 2021.

Margin-Based Active Learning of Classifiers

Simon Tong and Edward Chang. Support vector machine active learning for image retrieval. In Proceedings of the ACM international conference on Multimedia (ICM), page 107 118, 2001. doi: 10.1145/500141.500159.

Roman Vershynin. High-dimensional probability: An introduction with applications in data science. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2018. doi: 10.1017/9781108231596.

Sharad Vikram and Sanjoy Dasgupta. Interactive Bayesian hierarchical clustering. In Proceedings of the International Conference on Machine Learning (ICML), volume 48, pages 2081 2090, 2016.

Yair Wiener, Steve Hanneke, and Ran El-Yaniv. A compression technique for analyzing disagreement-based active learning. Journal of Machine Learning Research, 16:713 745, 2015.

Linli Xu, James Neufeld, Bryce Larson, and Dale Schuurmans. Maximum margin clustering. In Advances in Neural Information Processing Systems (Neur IPS), pages 1537 1544, 2004.

Pan Zhang, Cristopher Moore, and Lenka Zdeborová. Phase transitions in semisupervised clustering of sparse networks. Physical Review E, 90(5):052802, 2014. doi: 10.1103/ Phys Rev E.90.052802.