# almost_no_label_no_cry__d7e8a456.pdf

(Almost) No Label No Cry

Giorgio Patrini1,2, Richard Nock1,2, Paul Rivera1,2, Tiberio Caetano1,3,4

Australian National University1, NICTA2, University of New South Wales3, Ambiata4 Sydney, NSW, Australia {name.surname}@anu.edu.au

In Learning with Label Proportions (LLP), the objective is to learn a supervised classiﬁer when, instead of labels, only label proportions for bags of observations are known. This setting has broad practical relevance, in particular for privacy preserving data processing. We ﬁrst show that the mean operator, a statistic which aggregates all labels, is minimally sufﬁcient for the minimization of many proper scoring losses with linear (or kernelized) classiﬁers without using labels. We provide a fast learning algorithm that estimates the mean operator via a manifold regularizer with guaranteed approximation bounds. Then, we present an iterative learning algorithm that uses this as initialization. We ground this algorithm in Rademacher-style generalization bounds that ﬁt the LLP setting, introducing a generalization of Rademacher complexity and a Label Proportion Complexity measure. This latter algorithm optimizes tractable bounds for the corresponding bag-empirical risk. Experiments are provided on fourteen domains, whose size ranges up to 300K observations. They display that our algorithms are scalable and tend to consistently outperform the state of the art in LLP. Moreover, in many cases, our algorithms compete with or are just percents of AUC away from the Oracle that learns knowing all labels. On the largest domains, half a dozen proportions can sufﬁce, i.e. roughly 40K times less than the total number of labels.

1 Introduction

Machine learning has recently experienced a proliferation of problem settings that, to some extent, enrich the classical dichotomy between supervised and unsupervised learning. Cases as multiple instance labels, noisy labels, partial labels as well as semi-supervised learning have been studied motivated by applications where fully supervised learning is no longer realistic. In the present work, we are interested in learning a binary classiﬁer from information provided at the level of groups of instances, called bags. The type of information we assume available is the label proportions per bag, indicating the fraction of positive binary labels of its instances. Inspired by [1], we refer to this framework as Learning with Label Proportions (LLP). Settings that perform a bag-wise aggregation of labels include Multiple Instance Learning (MIL) [2]. In MIL, the aggregation is logical rather than statistical: each bag is provided with a binary label expressing an OR condition on all the labels contained in the bag. More general setting also exist [3] [4] [5]. Many practical scenarios ﬁt the LLP abstraction. (a) Only aggregated labels can be obtained due to the physical limits of measurement tools [6] [7] [8] [9]. (b) The problem is semior unsupervised but domain experts have knowledge about the unlabelled samples in form of expectation, as pseudomeasurement [5]. (c) Labels existed once but they are now given in an aggregated fashion for privacy-preserving reasons, as in medical databases [10], fraud detection [11], house price market, election results, census data, etc. . (d) This setting also arises in computer vision [12] [13] [14].

Related work. Two papers independently introduce the problem, [12] and [9]. In the ﬁrst the authors propose a hierarchical probabilistic model which generates labels consistent with the proportions, and make inference through MCMC sampling. Similarly, the second and its follower [6] offer a

variety of standard machine learning methods designed to generate self-consistent labels. [15] gives a Bayesian interpretation of LLP where the key distribution is estimated through an RBM. Other ideas rely on structural learning of Bayesian networks with missing data [7], and on K-MEANS clustering to solve preliminary label assignment in order to resort to fully supervised methods [13] [8]. Recent SVM implementations [11] [16] outperform most of the other known methods. Theoretical works on LLP belong to two main categories. The ﬁrst contains uniform convergence results, for the estimators of label proportions [1], or the estimator of the mean operator [17]. The second contains approximation results for the classiﬁer [17]. Our work builds upon their Mean Map algorithm, that relies on the trick that the logistic loss may be split in two, a convex part depending only on the observations, and a linear part involving a sufﬁcient statistic for the label, the mean operator. Being able to estimate the mean operator means being able to ﬁt a classiﬁer without using labels. In [17], this estimation relies on a restrictive homogeneity assumption that the class-conditional estimation of features does not depend on the bags. Experiments display the limits of this assumption [11][16].

Contributions. In this paper we consider linear classiﬁers, but our results hold for kernelized formulations following [17]. We ﬁrst show that the trick about the logistic loss can be generalized, and the mean operator is actually minimally sufﬁcient for a wide set of symmetric proper scoring losses with no class-dependent misclassiﬁcation cost, that encompass the logistic, square and Matsushita losses [18]. We then provide an algorithm, LMM, which estimates the mean operator via a Laplacian-based manifold regularizer without calling to the homogeneity assumption. We show that under a weak distinguishability assumption between bags, our estimation of the mean operator is all the better as the observations norm increase. This, as we show, cannot hold for the Mean Map estimator. Then, we provide a data-dependent approximation bound for our classiﬁer with respect to the optimal classiﬁer, that is shown to be better than previous bounds [17]. We also show that the manifold regularizer s solution is tightly related to the linear separability of the bags. We then provide an iterative algorithm, AMM, that takes as input the solution of LMM and optimizes it further over the set of consistent labelings. We ground the algorithm in a uniform convergence result involving a generalization of Rademacher complexities for the LLP setting. The bound involves a bag-empirical surrogate risk for which we show that AMM optimizes tractable bounds. All our theoretical results hold for any symmetric proper scoring loss. Experiments are provided on fourteen domains, ranging from hundreds to hundreds of thousands of examples, comparing AMM and LMM to their contenders: Mean Map, Inv Cal [11] and SVM [16]. They display that AMM and LMM outperform their contenders, and sometimes even compete with the fully supervised learner while requiring few proportions only. Tests on the largest domains display the scalability of both algorithms. Such experimental evidence seriously questions the safety of privacy-preserving summarization of data, whenever accurate aggregates and informative individual features are available. Section (2) presents our algorithms and related theoretical results. Section (3) presents experiments. Section (4) concludes. A Supplementary Material [19] includes proofs and additional experiments.

2 LLP and the mean operator: theoretical results and algorithms

Learning setting Hereafter, boldfaces like p denote vectors, whose coordinates are denoted pl for l = 1, 2, .... For any m N , let [m] .= {1, 2, ..., m}. Let Σm .= {σ { 1, 1}m} and X Rd. Examples are couples (observation, label) X Σ1, sampled i.i.d. according to some unknown but ﬁxed distribution D. Let S .= {(xi, yi), i [m]} Dm denote a size-m sample. In Learning with Label Proportions (LLP), we do not observe directly S but S|y, which denotes S with labels removed; we are given its partition in n > 0 bags, S|y = j Sj, j [n], along with their respective label proportions ˆπj .= ˆP[y = +1|Sj] and bag proportions ˆpj .= mj/m with mj = card(Sj). (This generalizes to a cover of S, by copying examples among bags.) The bag assignment function that partitions S is unknown but ﬁxed. In real world domains, it would rather be known, e.g. state, gender, age band. A classiﬁer is a function h : X R, from a set of classiﬁers H. HL denotes the set of linear classiﬁers, noted hθ(x) .= θ x with θ X. A (surrogate) loss is a function F : R R+. We let F(S, h) .= (1/m) P

i F(yih(xi)) denote the empirical surrogate risk on S corresponding to loss F. For the sake of clarity, indexes i, j and k respectively refer to examples, bags and features.

The mean operator and its minimal sufﬁciency We deﬁne the (empirical) mean operator as:

i yixi . (1)

Algorithm 1 Laplacian Mean Map (LMM)

Input Sj, ˆπj, j [n]; γ > 0 (7); w (7); V (8); permissible φ (2); λ > 0; Step 1 : let B arg min X R2n d ℓ(L, X) using (7) (Lemma 2) Step 2 : let µS P j ˆpj(ˆπj b+ j (1 ˆπj) b j ) Step 3 : let θ arg minθ Fφ(S|y, θ, µS) + λ θ 2 2 (3) Return θ

Table 1: Correspondence between permissible functions φ and the corresponding loss Fφ.

loss name Fφ(x) φ(x) logistic loss log(1 + exp( x)) x log x (1 x) log(1 x) square loss (1 x)2 x(1 x) Matsushita loss x +

The estimation of the mean operator µS appears to be a learning bottleneck in the LLP setting [17]. The fact that the mean operator is sufﬁcient to learn a classiﬁer without the label information motivates the notion of minimal sufﬁcient statistic for features in this context. Let F be a set of loss functions, H be a set of classiﬁers, I be a subset of features. Some quantity t(S) is said to be a minimal sufﬁcient statistic for I with respect to F and H iff: for any F F, any h H and any two samples S and S , the quantity F(S, h) F(S , h) does not depend on I iff t(S) = t(S ). This deﬁnition can be motivated from the one in statistics by building losses from log likelihoods. The following Lemma motivates further the mean operator in the LLP setting, as it is the minimal sufﬁcient statistic for a broad set of proper scoring losses that encompass the logistic and square losses [18]. The proper scoring losses we consider, hereafter called symmetric (SPSL), are twice differentiable, non-negative and such that misclassiﬁcation cost is not label-dependent.

Lemma 1 µS is a minimal sufﬁcient statistic for the label variable, with respect to SPSL and HL.

([19], Subsection 2.1) This property, very useful for LLP, may also be exploited in other weakly supervised tasks [2]. Up to constant scalings that play no role in its minimization, the empirical surrogate risk corresponding to any SPSL, Fφ(S, h), can be written with loss:

Fφ(x) .= φ(0) + φ ( x)

φ(0) φ(1/2)

.= aφ + φ ( x)

and φ is a permissible function [20, 18], i.e. dom(φ) [0, 1], φ is strictly convex, differentiable and symmetric with respect to 1/2. φ is the convex conjugate of φ. Table 1 shows examples of Fφ. It follows from Lemma 1 and its proof, that any Fφ(Sθ), can be written for any θ hθ HL as:

Fφ(S, θ) = bφ 2m

σ Fφ(σθ xi)

2θ µS .= Fφ(S|y, θ, µS) , (3)

where σ Σ1.

The Laplacian Mean Map (LMM) algorithm The sum in eq. (3) is convex and differentiable in θ. Hence, once we have an accurate estimator of µS, we can then easily ﬁt θ to minimize Fφ(S|y, θ, µS). This two-steps strategy is implemented in LMM in algorithm 1. µS can be retrieved from 2n bag-wise, label-wise unknown averages bσ j :

σ Σ1 (2ˆπj + σ(1 σ))bσ j , (4)

with bσ j .= ES[x|σ, j] denoting these 2n unknowns (for j [n], σ Σ1), and let bj .= (1/mj) P

xi Sj xi. The 2n bσ j s are solution of a set of n identities that are (in matrix form):

B Π B = 0 , (5)

where B .= [b1|b2|...|bn] Rn d, Π .= [DIAG(ˆπ)|DIAG(1 ˆπ)] R2n n and B R2n d is the matrix of unknowns:

B .= h b+1 1 |b+1 2 |...|b+1 n | {z }

b-1 1 |b-1 2 |...|b-1 n | {z }

System (5) is underdetermined, unless one makes the homogeneity assumption that yields the Mean Map estimator [17]. Rather than making such a restrictive assumption, we regularize the cost that brings (5) with a manifold regularizer [21], and search for B = arg min X R2n d ℓ(L, X), with:

ℓ(L, X) .= tr (B X Π)Dw(B Π X) + γtr X LX , (7)

and γ > 0. Dw .= DIAG(w) is a user-ﬁxed bias matrix with w Rn +, (and w = ˆp in general) and:

L .= εI + La | 0 0 | La

R2n 2n , (8)

where La .= D V Rn n is the Laplacian of the bag similarities. V is a symmetric similarity matrix with non negative coordinates, and the diagonal matrix D satisﬁes djj .= P

j vjj , j [n]. The size of the Laplacian is O(n2), which is very small compared to O(m2) if there are not many bags. One can interpret the Laplacian regularization as smoothing the estimates of bσ j w.r.t the similarity of the respective bags.

Lemma 2 The solution B to min X R2n d ℓ(L, X) is B = ΠDwΠ + γL 1 ΠDw B.

([19], Subsection 2.2). This Lemma explains the role of penalty εI in (8) as ΠDwΠ and L have respectively nand ( 1)-dim null spaces, so the inversion may not be possible. Even when this does not happen exactly, this may incur numerical instabilities in computing the inverse. For domains where this risk exists, picking a small ε > 0 solves the problem. Let bσ j denote the row-wise decomposition of B following (6), from which we compute µS following (4) when we use these 2n estimates in lieu of the true bσ j . We compare µj .= ˆπjb+ j (1 ˆπj)b j , j [n] to our estimates µj .= ˆπj b+ j (1 ˆπj) b j , j [n], granted that µS = P

j ˆpjµj and µS = P

Theorem 3 Suppose that γ satisﬁes γ

2 ((ε(2n) 1) + maxj =j vjj )/ minj wj. Let M .= [µ1|µ2|...|µn] Rn d, M .= [ µ1| µ2|...| µn] Rn d and ς(V, B ) .= ((ε(2n) 1) + maxj =j vjj )2 B F . The following holds:

2 min j w2 j

1 ς(V, B ) . (9)

([19], Subsection 2.3) The multiplicative factor to ς in (9) is roughly O(n5/2) when there is no large discrepancy in the bias matrix Dw, so the upperbound is driven by ς(., .) when there are not many bags. We have studied its variations when the distinguishability between bags increases. This setting is interesting because in this case we may kill two birds in one shot, with the estimation of M and the subsequent learning problem potentially easier, in particular for linear separators. We consider two examples for vjj , the ﬁrst being (half) the normalized association [22]:

vnc jj .= 1 2

ASSOC(Sj, Sj)

ASSOC(Sj, Sj Sj ) + ASSOC(Sj , Sj )

ASSOC(Sj , Sj Sj )

= NASSOC(Sj, Sj ) , (10)

v G,s jj .= exp( bj bj 2/s) , s > 0 . (11)

Here, ASSOC(Sj, Sj ) .= P x Sj,x Sj x x 2 2 [22]. To put these two similarity measures in the context of Theorem 3, consider the setting where we can make assumption (D1) that there exists a small constant κ > 0 such that bj bj 2 2 κ maxσ,j bσ j 2 2, j, j [n]. This is a weak distinguishability property as if no such κ exists, then the centers of distinct bags may just be confounded. Consider also the additional assumption, (D2), that there exists κ > 0 such that maxj d2 j κ , j [n], where dj .= maxxi,x i Sj xi xi 2 is a bag s diameter. In the following Lemma, the little-oh notation is with respect to the largest unknown in eq. (4), i.e. maxσ,j bσ j 2.

Algorithm 2 Alternating Mean Map (AMMOPT)

Input LMM parameters + optimization strategy OPT {min, max} + convergence predicate PR Step 1 : let θ0 LMM(LMM parameters) and t 0 Step 2 : repeat

Step 2.1 : let σt arg OPTσ Σˆ πFφ(S|y, θt, µS(σ)) Step 2.2 : let θt+1 arg minθ Fφ(S|y, θ, µS(σt)) + λ θ 2 2 Step 2.3 : let t t + 1 until predicate PR is true Return θ .= arg mint Fφ(S|y, θt+1, µS(σt))

Lemma 4 There exists ε > 0 such that ε ε , the following holds: (i) ς(Vnc, B ) = o(1) under assumptions (D1 + D2); (ii) ς(VG,s, B ) = o(1) under assumption (D1), s > 0.

([19], Subsection 2.4) Hence, provided a weak (D1) or stronger (D1+D2) distinguishability assumption holds, the divergence between M and M gets smaller with the increase of the norm of the unknowns bσ j . The proof of the Lemma suggests that the convergence may be faster for VG,s. The following Lemma shows that both similarities also partially encode the hardness of solving the classiﬁcation problem with linear separators, so that the manifold regularizer limits the distortion of the b . s between two bags that tend not to be linearly separable.

Lemma 5 Take vjj {v G,. jj , vnc jj }. There exists 0 < κl < κn < 1 such that (i) if vjj > κn then Sj, Sj are not linearly separable, and if vjj < κl then Sj, Sj are linearly separable.

([19], Subsection 2.5) This Lemma is an advocacy to ﬁt s in a data-dependent way in v G,s jj . The question may be raised as to whether ﬁnite samples approximation results like Theorem 3 can be proven for the Mean Map estimator [17]. [19], Subsection 2.6 answers by the negative.

In the Laplacian Mean Map algorithm (LMM, Algorithm 1), Steps 1 and 2 have now been described. Step 3 is a differentiable convex minimization problem for θ that does not use the labels, so it does not present any technical difﬁculty. An interesting question is how much our classiﬁer θ in Step 3 diverges from the one that would be computed with the true expression for µS, θ . It is not hard to show that Lemma 17 in Altun and Smola [23], and Corollary 9 in Quadrianto et al. [17] hold for

LMM so that θ θ 2 2 (2λ) 1 µS µS 2 2. The following Theorem shows a data-dependent approximation bound that can be signiﬁcantly better, when it holds that θ xi, θ xi φ ([0, 1]), i (φ is the ﬁrst derivative). We call this setting proper scoring compliance (PSC) [18]. PSC always holds for the logistic and Matsushita losses for which φ ([0, 1]) = R. For other losses like the square loss for which φ ([0, 1]) = [ 1, 1], shrinking the observations in a ball of sufﬁciently small radius is sufﬁcient to ensure this.

Theorem 6 Let fk Rm denote the vector encoding the kth feature variable in S : fki = xik (k [d]). Let F denote the feature matrix with column-wise normalized feature vectors: fk .= (d/ P

k fk 2 2)(d 1)/(2d)fk. Under PSC, we have θ θ 2 2 (2λ + q) 1 µS µS 2 2, with:

q .= det F F m 2e 1

bφφ (φ 1(q /λ)) (> 0) , (12)

for some q I .= [ (x + max{ µS 2, µS 2})]. Here, x .= maxi xi 2 and φ .= (φ ) .

([19], Subsection 2.7) To see how large q can be, consider the simple case where all eigenvalues of F F, λk( F F) [λ δ] for small δ. In this case, q is proportional to the average feature norm :

det F F m = tr F F

md + o(δ) = P

i xi 2 2 md + o(δ) .

The Alternating Mean Map (AMM) algorithm Let us denote Σˆπ .= {σ Σm : P

i:xi Sj σi = (2ˆπj 1)mj, j [n]} the set of labelings that are consistent with the observed proportions ˆπ, and µS(σ) .= (1/m) P

i σixi the biased mean operator computed from some σ Σˆπ. Notice that the true mean operator µS = µS(σ) for at least one σ Σˆπ. The Alternating Mean Map algorithm, (AMM, Algorithm 2), starts with the output of LMM and then optimizes it further over the set of consistent labelings. At each iteration, it ﬁrst picks a consistent labeling in Σˆπ that is the best (OPT = min) or the worst (OPT = max) for the current classiﬁer (Step 2.1) and then ﬁts a classiﬁer θ on the given set of labels (Step 2.2). The algorithm then iterates until a convergence predicate is met, which tests whether the difference between two values for Fφ(., ., .) is too small (AMMmin), or the number of iterations exceeds a user-speciﬁed limit (AMMmax). The classiﬁer returned θ is the best in the sequence. In the case of AMMmin, it is the last of the sequence as risk Fφ(S|y, ., .) cannot increase. Again, Step 2.2 is a convex minimization with no technical difﬁculty. Step 2.1 is combinatorial. It can be solved in time almost linear in m [19] (Subsection 2.8).

Lemma 7 The running time of Step 2.1 in AMM is O(m), where the tilde notation hides log-terms.

Bag-Rademacher generalization bounds for LLP We relate the min and max strategies of AMM by uniform convergence bounds involving the true surrogate risk, i.e. integrating the unknown distribution D and the true labels (which we may never know). Previous uniform convergence bounds for LLP focus on coarser grained problems, like the estimation of label proportions [1]. We rely on a LLP generalization of Rademacher complexity [24, 25]. Let F : R R+ be a loss function and H a set of classiﬁers. The bag empirical Rademacher complexity of sample S, Rb m, is deﬁned as Rb m .= Eσ Σm suph H{Eσ Σˆ πES[σ(x)F(σ (x)h(x))]. The usual empirical Rademacher complexity equals Rb m for card(Σˆπ) = 1. The Label Proportion Complexity of H is:

L2m .= ED2m EI/2 1 ,I/2 2 sup h H ES[σ1(x)(ˆπs |2(x) ˆπℓ |1(x))h(x)] . (13)

Here, each of I/2 l , l = 1, 2 is a random (uniformly) subset of [2m] of cardinal m. Let S(I/2 l ) be the size-m subset of S that corresponds to the indexes. Take l = 1, 2 and any xi S. If i I/2 l then ˆπs |l(xi) = ˆπℓ |l(xi) is xi s bag s label proportion measured on S\S(I/2 l ). Else, ˆπs |2(xi) is its bag s label proportion measured on S(I/2 2) and ˆπℓ |1(xi) is its label (i.e. a bag s label proportion that would contain only xi). Finally, σ1(x) .= 2 1x S(I/2 1 ) 1 Σ1. L2m tends to be all the smaller as classiﬁers in H have small magnitude on bags whose label proportion is close to 1/2.

Theorem 8 Suppose h 0 s.t. |h(x)| h , x, h. Then, for any loss Fφ, any training sample of size m and any 0 < δ 1, with probability > 1 δ, the following bound holds over all h H:

ED[Fφ(yh(x))] EΣˆ πES[Fφ(σ(x)h(x))] + 2Rb m + L2m + 4 2h

Furthermore, under PSC (Theorem 6), we have for any Fφ:

Rb m 2bφEΣm sup h H {ES[σ(x)(ˆπ(x) (1/2))h(x)]} . (15)

([19], Subsection 2.9) Despite similar shapes (13) (15), Rb m and L2m behave differently: when bags are pure (ˆπj {0, 1}, j), L2m = 0. When bags are impure (ˆπj = 1/2, j), Rb m = 0. As bags get impure, the bag-empirical surrogate risk, EΣˆ πES[Fφ(σ(x)h(x))], also tends to increase. AMMmin and AMMmax respectively minimize a lowerbound and an upperbound of this risk.

3 Experiments

Algorithms We compare LMM, AMM (Fφ = logistic loss) to the original MM [17], Inv Cal [11], conv SVM and alter SVM [16] (linear kernels). To make experiments extensive, we test several initializations for AMM that are not displayed in Algorithm 2 (Step 1): (i) the edge mean map estimator, µEMM S .= 1/m2(P

i xi) (AMMEMM), (ii) the constant estimator µ1 S .= 1 (AMM1), and ﬁnally AMM10ran which runs 10 random initial models ( θ0 2 1), and selects the one with smallest risk;

2 4 6 divergence

AUC rel. to MM

MM LMMG LMMG,s LMMnc

0.6 0.8 1.0 entropy

AUC rel. to Oracle

MM LMMG LMMG,s LMMnc

0.6 0.8 1.0 entropy

AUC rel. to Oracle

AMMMM AMMG AMMG,s AMMnc AMM10ran

Bigger domains Small domains 0.2

10 5 10 3 10 1 #bag/#instances

AUC rel. to Oracle

Figure 1: Relative AUC (wrt MM) as homogeneity assumption is violated (a). Relative AUC (wrt Oracle) vs entropy on heart for LMM(b), AMMmin(c). Relative AUC vs n/m for AMMmin G,s (d).

Table 2: Small domains results. #win/#lose for row vs column. Bold faces means p-val < .001 for Wilcoxon signed-rank tests. Top-left subtable is for one-shot methods, bottom-right iterative ones, bottom-left compare the two. Italic is state-of-the-art. Grey cells highlight the best of all (AMMmin G ).

algorithm MM LMM Inv Cal AMMmin AMMmax conv G G,s nc MM G G,s 10ran MM G G,s 10ran SVM

G 36/4 G,s 38/3 30/6 nc 28/12 3/37 2/37 Inv Cal 4/46 3/47 4/46 4/46 MM 33/16 26/24 25/25 32/18 46/4 e.g. AMMmin G,s wins on AMMmin G 7 times, loses 15, with 28 ties G 38/11 35/14 30/20 37/13 47/3 31/7 G,s 35/14 33/17 30/20 35/15 47/3 24/11 7/15

10ran 27/22 24/26 22/28 26/24 44/6 20/30 16/34 19/31

MM 25/25 23/27 22/28 25/25 45/5 15/35 13/37 13/37 8/42 G 27/23 22/28 21/28 26/24 45/5 17/33 14/36 14/36 10/40 13/14 G,s 25/25 21/29 22/28 24/26 45/5 15/35 13/37 13/37 12/38 15/22 16/22 10ran 23/27 21/29 19/31 24/26 50/0 19/31 15/35 17/33 7/43 19/30 20/29 17/32

conv21/29 2/48 2/48 2/48 2/48 4/46 3/47 3/47 4/46 3/47 3/47 4/46 0/50 alter0/50 0/50 0/50 0/50 20/30 0/50 0/50 0/50 3/47 3/47 2/48 1/49 0/50 27/23

this is the same procedure of alter SVM. Matrix V (eqs. (10), (11)) used is indicated in subscript: LMM/AMMG, LMM/AMMG,s, LMM/AMMnc respectively denote v G,s with s = 1, v G,s with s learned on cross validation (CV; validation ranges indicated in [19]) and vnc. For space reasons, results not displayed in the paper can be found in [19], Section 3 (including runtime comparisons, and detailed results by domain). We split the algorithms in two groups, one-shot and iterative. The latter, including AMM, (conv/alter)- SVM, iteratively optimize a cost over labelings (always consistent with label proportions for AMM, not always for (conv/alter)- SVM). The former (LMM, Inv Cal) do not and are thus much faster. Tests are done on a 4-core 3.2GHz CPUs Mac with 32GB of RAM. AMM/LMM/MM are implemented in R. Code for Inv Cal and SVM is [16]. Simulated domains, MM and the homogeneity assumption The testing metric is the AUC. Prior to testing on our domains, we generate 16 domains that gradually move away the bσ j away from each other (wrt j), thus violating increasingly the homogeneity assumption [17]. The degree of violation is measured as B B F , where B is the homogeneity assumption matrix, that replaces all bσ j by bσ for σ { 1, 1}, see eq. (5). Figure 1 (a) displays the ratios of the AUC of LMM to the AUC of MM. It shows that LMM is all the better with respect to MM as the homogeneity assumption is violated. Furthermore, learning s in LMM improves the results. Experiments on the simulated domain of [16] on which MM obtains zero accuracy also display that our algorithms perform better (1 iteration only of AMMmax brings 100% AUC). Small and large domains experiments We convert 10 small domains [19] (m 1000) and 4 bigger ones (m > 8000) from UCI[26] into the LLP framework. We cast to one-against-all classiﬁcation when the problem is multiclass. On large domains, the bag assignment function is inspired by [1]: we craft bags according to a selected feature value, and then we remove that feature from the data. This conforms to the idea that bag assignment is structured and non random in real-world problems. Most of our small domains, however, do not have a lot of features, so instead of clustering on one feature and then discard it, we run K-MEANS on the whole data to make the bags, for K = n 2[5]. Small domains results We performe 5-folds nested CV comparisons on the 10 domains = 50 AUC values for each algorithm. Table 2 synthesises the results [19], splitting one-shot and iterative algo-

Table 3: AUCs on big domains (name: #instances #features). I=cap-shape, II=habitat, III=cap-colour, IV=race, V=education, VI=country, VII=poutcome, VIII=job (number of bags); for each feature, the best result over one-shot, and over iterative algorithms is bold faced.

algorithm mushroom: 8124 108 adult: 48842 89 marketing: 45211 41 census: 299285 381 I(6) II(7) III(10) IV(5) V(16) VI(42) V(4) VII(4) VIII(12) IV(5) VIII(9) VI(42)

EMM 55.61 59.80 76.68 43.91 47.50 66.61 63.49 54.50 44.31 56.05 56.25 57.87 MM 51.99 98.79 5.02 80.93 76.65 74.01 54.64 50.71 49.70 75.21 90.37 75.52 LMMG 73.92 98.57 14.70 81.79 78.40 78.78 54.66 51.00 51.93 75.80 71.75 76.31 LMMG,s 94.91 98.24 89.43 84.89 78.94 80.12 49.27 51.00 65.81 84.88 60.71 69.74

AMMEMM 85.12 99.45 69.43 49.97 56.98 70.19 61.39 55.73 43.10 87.86 87.71 40.80 AMMMM 89.81 99.01 15.74 83.73 77.39 80.67 52.85 75.27 58.19 89.68 84.91 68.36 AMMG 89.18 99.45 50.44 83.41 82.55 81.96 51.61 75.16 57.52 87.61 88.28 76.99 AMMG,s 89.24 99.57 3.28 81.18 78.53 81.96 52.03 75.16 53.98 89.93 83.54 52.13 AMM1 95.90 98.49 97.31 81.32 75.80 80.05 65.13 64.96 66.62 89.09 88.94 56.72

AMMEMM 93.04 3.32 26.67 54.46 69.63 56.62 51.48 55.63 57.48 71.20 77.14 66.71 AMMMM 59.45 55.16 99.70 82.57 71.63 81.39 48.46 51.34 56.90 50.75 66.76 58.67 AMMG 95.50 65.32 99.30 82.75 72.16 81.39 50.58 47.27 34.29 48.32 67.54 77.46 AMMG,s 95.84 65.32 84.26 82.69 70.95 81.39 66.88 47.27 34.29 80.33 74.45 52.70 AMM1 95.01 73.48 1.29 75.22 67.52 77.67 66.70 61.16 71.94 57.97 81.07 53.42 Oracle 99.82 99.81 99.8 90.55 90.55 90.50 79.52 75.55 79.43 94.31 94.37 94.45

rithms. LMMG,s outperforms all one-shot algorithms. LMMG and LMMG,s are competitive with many iterative algorithms, but lose against their AMM counterpart, which proves that additional optimization over labels is beneﬁcial. AMMG and AMMG,s are conﬁrmed as the best variant of AMM, the ﬁrst being the best in this case. Surprisingly, all mean map algorithms, even one-shots, are clearly superior to SVMs. Further results [19] reveal that SVM performances are dampened by learning classiﬁers with the inverted polarity i.e. ﬂipping the sign of the classiﬁer improves its performances. Figure 1 (b, c) presents the AUC relative to the Oracle (which learns the classiﬁer knowing all labels and minimizing the logistic loss), as a function of the Gini entropy of bag assignment, gini(S) .= 4Ej[ˆπj(1 ˆπj)]. For an entropy close to 1, we were expecting a drop in performances. The unexpected [19] is that on some domains, large entropies ( .8) do not prevent AMMmin to compete with the Oracle. No such pattern clearly emerges for SVM and AMMmax [19]. Big domains results We adopt a 1/5 hold-out method. Scalability results [19] display that every method using vnc and SVM are not scalable to big domains; in particular, the estimated time for a single run of alter SVM is >100 hours on the adult domain. Table 3 presents the results on the big domains, distinguishing the feature used for bag assignment. Big domains conﬁrm the efﬁciency of LMM+AMM. No approach clearly outperforms the rest, although LMMG,s is often the best one-shot. Synthesis Figure 1 (d) gives the AUCs of AMMmin G over the Oracle for all domains [19], as a function of the degree of supervision , n/m (=1 if the problem is fully supervised). Noticeably, on 90% of the runs, AMMmin G gets an AUC representing at least 70% of the Oracle s. Results on big domains can be remarkable: on the census domain with bag assignment on race, 5 proportions are sufﬁcient for an AUC 5 points below the Oracle s which learns with 200K labels.

4 Conclusion

In this paper, we have shown that efﬁcient learning in the LLP setting is possible, for general loss functions, via the mean operator and without resorting to the homogeneity assumption. Through its estimation, the sufﬁciency allows one to resort to standard learning procedures for binary classiﬁcation, practically implementing a reduction between machine learning problems [27]; hence the mean operator estimation may be a viable shortcut to tackle other weakly supervised settings [2] [3] [4] [5]. Approximation results and generalization bounds are provided. Experiments display results that are superior to the state of the art, with algorithms that scale to big domains at affordable computational costs. Performances sometimes compete with the Oracle s that learns knowing all labels , even on big domains. Such experimental ﬁnding poses severe implications on the reliability of privacy-preserving aggregation techniques with simple group statistics like proportions.

Acknowledgments

NICTA is funded by the Australian Government through the Department of Communications and the Australian Research Council through the ICT Centre of Excellence Program. The ﬁrst author would like to acknowledge that part of this research was conducted during his internship at the Commonwealth Bank of Australia. We thank A. Menon and D. Garc ıa-Garc ıa for useful discussions.

[1] F.-X. Yu, S. Kumar, T. Jebara, and S.-F. Chang. On learning with label proportions. Co RR, abs/1402.5902, 2014.

[2] T.-G. Dietterich, R.-H. Lathrop, and T. Lozano-P erez. Solving the multiple instance problem with axisparallel rectangles. Artiﬁcial Intelligence, 89:31 71, 1997.

[3] G.-S. Mann and A. Mc Callum. Generalized expectation criteria for semi-supervised learning of conditional random ﬁelds. In 46 th ACL, 2008.

[4] J. Grac a, K. Ganchev, and B. Taskar. Expectation maximization and posterior constraints. In NIPS*20, pages 569 576, 2007.

[5] P. Liang, M.-I. Jordan, and D. Klein. Learning from measurements in exponential families. In 26 th ICML, pages 641 648, 2009.

[6] D.-J. Musicant, J.-M. Christensen, and J.-F. Olson. Supervised learning by training on aggregate outputs. In 7 th ICDM, pages 252 261, 2007.

[7] J. Hern andez-Gonz alez, I. Inza, and J.-A. Lozano. Learning bayesian network classiﬁers from label proportions. Pattern Recognition, 46(12):3425 3440, 2013.

[8] M. Stolpe and K. Morik. Learning from label proportions by optimizing cluster model selection. In 15th

ECMLPKDD, pages 349 364, 2011.

[9] B.-C. Chen, L. Chen, R. Ramakrishnan, and D.-R. Musicant. Learning from aggregate views. In 22 th ICDE, pages 3 3, 2006.

[10] J. Wojtusiak, K. Irvin, A. Birerdinc, and A.-V. Baranova. Using published medical results and nonhomogenous data in rule learning. In 10 th ICMLA, pages 84 89, 2011.

[11] S. R uping. Svm classiﬁer estimation from group probabilities. In 27 th ICML, pages 911 918, 2010.

[12] K. Hendrik and N. de Freitas. Learning about individuals from group statistics. In 21 th UAI, pages 332 339, 2005.

[13] S. Chen, B. Liu, M. Qian, and C. Zhang. Kernel k-means based framework for aggregate outputs classiﬁcation. In 9 th ICDMW, pages 356 361, 2009.

[14] K.-T. Lai, F.X. Yu, M.-S. Chen, and S.-F. Chang. Video event detection by inferring temporal instance labels. In 11 th CVPR, 2014.

[15] K. Fan, H. Zhang, S. Yan, L. Wang, W. Zhang, and J. Feng. Learning a generative classiﬁer from label proportions. Neurocomputing, 139:47 55, 2014.

[16] F.-X. Yu, D. Liu, S. Kumar, T. Jebara, and S.-F. Chang. SVM for Learning with Label Proportions. In 30th ICML, pages 504 512, 2013.

[17] N. Quadrianto, A.-J. Smola, T.-S. Caetano, and Q.-V. Le. Estimating labels from label proportions. JMLR, 10:2349 2374, 2009.

[18] R. Nock and F. Nielsen. Bregman divergences and surrogates for learning. IEEE Trans.PAMI, 31:2048 2059, 2009.

[19] G. Patrini, R. Nock, P. Rivera, and T-S. Caetano. (Almost) no label no cry - supplementary material . In NIPS*27, 2014.

[20] M.J. Kearns and Y. Mansour. On the boosting ability of top-down decision tree learning algorithms. In 28 th ACM STOC, pages 459 468, 1996.

[21] M. Belkin, P. Niyogi, and V. Sindhwani. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. JMLR, 7:2399 2434, 2006.

[22] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Trans.PAMI, 22:888 905, 2000.

[23] Y. Altun and A.-J. Smola. Unifying divergence minimization and statistical inference via convex duality. In 19th COLT, pages 139 153, 2006.

[24] P.-L. Bartlett and S. Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. JMLR, 3:463 482, 2002.

[25] V. Koltchinskii and D. Panchenko. Empirical margin distributions and bounding the generalization error of combined classiﬁers. Ann. of Stat., 30:1 50, 2002.

[26] K. Bache and M. Lichman. UCI machine learning repository, 2013.

[27] A. Beygelzimer, V. Dani, T. Hayes, J. Langford, and B. Zadrozny. Error limiting reductions between classiﬁcation tasks. In 22 th ICML, pages 49 56, 2005.