# federated_learning_with_only_positive_labels__ef9c746b.pdf

Federated Learning with Only Positive Labels

Felix X. Yu 1 Ankit Singh Rawat 1 Aditya Krishna Menon 1 Sanjiv Kumar 1

We consider learning a multi-class classiﬁcation model in the federated setting, where each user has access to the positive data associated with only a single class. As a result, during each federated learning round, the users need to locally update the classiﬁer without having access to the features and the model parameters for the negative classes. Thus, naïvely employing conventional decentralized learning such as distributed SGD or Federated Averaging may lead to trivial or extremely poor classiﬁers. In particular, for embedding based classiﬁers, all the class embeddings might collapse to a single point. To address this problem, we propose a generic framework for training with only positive labels, namely Federated Averaging with Spreadout (Fed Aw S), where the server imposes a geometric regularizer after each round to encourage classes to be spreadout in the embedding space. We show, both theoretically and empirically, that Fed Aw S can almost match the performance of conventional learning where users have access to negative labels. We further extend the proposed method to settings with large output spaces.

1. Introduction

We consider learning a classiﬁcation model in the federated learning (Mc Mahan et al., 2017) setup, where each user has only access to a single class. The users are not allowed to communicate with each other, nor do they have access to the classiﬁcation model parameters associated with other users classes. Examples of such settings include decentralized training of face recognition models or speaker identiﬁcation models, where in addition to the user speciﬁc facial images and voice samples, the classiﬁers of the users also constitute sensitive information that cannot be shared with other users.

1Google Research, New York. Correspondence to: Felix X. Yu <felixyu@google.com>, Ankit Singh Rawat <ankitsrawat@google.com>.

Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s).

This setting can also be extended to the case where each user has access to data associated with a small number of classes. For example, one application is deep retrieval in the federated setting: training a query to document relevance model based on user interactions such as clicked documents after issuing a query, assuming that the clicks do not get recorded by a central server.

In this work, we assume that the classiﬁcation models are embedding-based discriminative models (Krizhevsky et al.,

2012; Vaswani et al., 2017; Koneˇcn y et al., 2016): both the classes and the input instances are embedded into the same space, and the similarity between the class embedding and the input embedding (i.e. logit or score) captures the likelihood of the input belonging to the class. A popular example of this framework is neural network based classiﬁcation. Here, given an input instance x X, a neural network gθ : X Rd (parameterized by θ) embeds the instance into a d dimensional vector gθ(x). The class embeddings are learned as a matrix W RC d, commonly referred to as the classiﬁcation matrix, where C denotes the number of classes. Finally, the logits for the instance x are computed as W gθ(x).

In the federated learning setup, one collaboratively learns the classiﬁcation model with the help of a server which facilitates the iterative training process by keeping track of a global model. During each round of the training process:

The server sends the current global model to a set of participating users. Each user updates the model with its local data, and sends the model delta to the server. The server averages ( Federated Averaging ) the deltas collected from the participating users and updates the global model.

Notice that the conventional synchronized distributed SGD falls into the federated learning framework if each user runs a single step of SGD, and the data at different users is i.i.d. Federated learning has been widely studied in distributed training of neural networks due to its appealing characteristics such as leveraging the computational power of edge devices (Li et al., 2019), removing the necessity of sending user data to server (Mc Mahan et al., 2017), and various improvements on trust/security (Bonawitz et al., 2016), privacy (Agarwal et al., 2018), and fairness (Mohri et al., 2019).

Federated Learning with Only Positive Labels

However, conventional federated learning algorithms are not directly applicable to the problem of learning with only positive labels due to two key reasons: First, the server cannot communicate the full model to each user. Besides sending the instance embedding model gθ( ), for the i-th user, the server can communicate only the class embedding vector wi associated with the positive class of the user. Note that, in various applications, the class embeddings constitute highly sensitive information as they can be potentially utilized to identify the users.

Second, when the i-th user updates the model using its local data, it only has access to a set of instances x Xi from the i-th class along with the class embedding vector wi. While training a standard embedding-based multi-class classiﬁcation models, the underlying loss function encourages two properties: i) similarity between an instance embedding and the positive class embedding should be as large as possible; and ii) similarity between the instance embedding and the negative class embeddings should be as small as possible. In our problem setting, the latter is not possible because the user does not have access to the negative class embeddings.

In other words, if we were to use the vanilla federated learning approach, we would essentially be minimizing a loss function that only encourages small distances between the instances and their positive classes in the embedding space. As a result, this approach would lead to a trivial optimal solution where all instances and classes collapse to a single point in the embedding space.

To address this problem, we propose Federated Averaging with Spreadout (Fed Aw S) framework, where in addition to Federated Averaging, the server applies a geometric regularization to make sure that the class embeddings are well separated (cf. Section 4). This prevents the model from collapsing to the aforementioned trivial solution. To the best of our knowledge, this is the ﬁrst principled approach for learning in the federated setting without explicit access to negative classes. We further show that the underlying regularizer can be suitably modiﬁed to extend the Fed Aw S framework to settings with large number of classes. This extension is crucial for the real-world applications such as user identiﬁcation models with a large number of users. Subsequently, we theoretically justify the Fed Aw S framework by showing that it approximates the conventional training settings with a loss function that has access to both positive and negative labels (cf. Section 5). We further conﬁrm the effectiveness of the proposed framework on various standard datasets in Section 6. Before presenting our aforementioned contributions, we begin by discussing the related work and formally describing the problem setup in Section 2 and 3, respectively.

2. Related Works

To the best of our knowledge, this is the ﬁrst work addressing the novel setting of distributed learning with only positive labels in the federated learning framework. The learning setting we are considering is related to the positiveunlabeled (PU) setting where one only has access to the positives and unlabeled data. Different from PU learning (Liu et al., 2002; Elkan & Noto, 2008; du Plessis et al., 2015; Hsieh et al., 2015), in the federated learning setting, the clients do not have access to unlabeled data for both positive and negative classes. The setting is also related to one-class classiﬁcation (Moya & Hush, 1996; Manevitz & Yousef, 2001) used in applications such as outlier detection and novelty detection. Different from one-class classiﬁcation, we are interested in collaboratively learning a multi-class classiﬁcation model.

We consider the setting of learning a discriminative embedding-based classiﬁer. Popular neural networks fall in this category. An alternative approach is to train generative models. For example, each user can learn a generative model based on its own data, and the server performs the MAP estimation during the inference time. We do not consider this approach because it does not ﬁt into the federated learning framework, where the clients and server collaboratively train a model. In addition, training a good generative model is both data and computation consuming. Another possible generative approach is to use federated learning to train a GAN model to synthesize negative labels for each user possibly using the techniques proposed in (Augenstein et al., 2019) and therefore convert the problem into learning with both positives and negatives. Training a GAN model in the federated setting is a separate and expensive process. In this paper we consider the setting where the users do not have access to either true or synthesized negatives.

As mentioned in the introduction, a typical application of federated learning with only positive labels is to use this learning framework to train user identiﬁcation models such as speaker/face recognition models. Although the proposed Fed Aw S algorithm promotes user privacy by not sharing the data among the users or with the server, Fed Aw S itself does not provide formal privacy guarantees. To show formal privacy guarantees, we notice that differential privacy methods for federated learning (Agarwal et al., 2018; Abadi et al., 2016) can be readily employed in Fed Aw S by adding noise to the updates sent from each user.

On the technical side, the proposed Fed Aw S can be seen as using stochastic negative mining to improve the spreadout regularizer. The stochastic negative mining method was ﬁrst proposed in Reddi et al. (2019) to mine hard negative classes for each data point. Differently, we mine hard negative classes for each class. The spreadout regularization was ﬁrst proposed to improve learning discriminative vi-

Federated Learning with Only Positive Labels

sual descriptors (Zhang et al., 2017) and further used in the extreme-multiclass classiﬁcation setting (Guo et al., 2019). The spreadout regularization is related to the design of errorcorrecting output code (ECOC) matrix (Dietterich & Bakiri, 1991; Pujol et al., 2006). In order for the ECOC matrix to work, the class embeddings have to be well separated from each other. In particular, similar to Proposition 1, Yu et al. (2013) show that the classiﬁcation error can be bounded by the distance between data and positive label in the embedding space, and a measure of spreadout of the classes. Differently, our result is on the true error instead of the empirical error.

3. Problem Setup

3.1. Federated learning of a classiﬁcation model

Let us ﬁrst consider the conventional federated learning of a classiﬁcation model, when each client has access to data from multiple classes. Let the instance space be X, and suppose there are C classes indexed by the set [C]. Let F {f : X RC} be a set of scorer functions, where each scorer, given an instance x, assigns a score to each of the C classes. In particular, for c [C], f(x)c represents the relevance of the c-th class for the instance x, as measured by the scorer f F. We consider scorers of the form

f(x) = Wgθ(x), (1)

where gθ : X Rd maps the instance x to a d-dimensional embedding, and W RC d uses this embedding to produce the scores (or logits) for C classes as Wgθ(x). The c-th row of W, wc, is referred to as the embedding vector of the c-th class. The score of the c-th class is thus w c gθ(x).

Let us assume a distributed setup with m clients. In the traditional federated learning setup, for i [m], the ith client has access to ni instance and label pairs Si = {(xi 1, yi 1), . . . , (xi ni, yi ni)} X [C] distributed according to an unknown distribution Pi XY, i.e., (xi j, yi j) Pi XY. Let S = i [m]Si denote the set of n = P

i [m] ni instance and label pairs collectively available at all the clients. Our objective is to ﬁnd a scorer in F that captures the true relevance of a class for a given instance.

Formally, let ℓ: RC [C] R be a loss function such that ℓ(f(x), y) measures the quality of the scorer f on (x, y) pair. The client minimizes an empirical estimate of the risk based on its local observations Si as follows:

ˆf = argmin f F ˆR(f; Si) := 1

j [ni] ℓ f(xi j), yi j . (2)

In the federated learning setting, the m clients are interested in collaboratively training a single classiﬁcation model on their joint data. A coordinator server facilitates the joint iterative distributed training as follows:

At the t-th round of training, the coordinator sends the current model parameters θt and Wt to all clients. For i [m], the i-th client updates the current model based on its local empirical estimate of the risk1:

θi t = θt η θt ˆR(ft; Si). (3)

W i t = Wt η Wt ˆR(ft; Si). (4)

The coordinator receives the updated model parameters from all clients {θi t, W i t }i [m], and updates its estimate of the model parameters using Federated Averaging:

i [m] ωi θi t; Wt+1 = X

i [m] ωi W i t , (5)

where ω = (ω1, . . . , ωm) denotes the weights that the coordinator assigns to the training samples of different clients. For example, ωi = ni

n assigns uniform importance to all the training samples across different clients2.

In the above, assuming that each client has data of multiple classes, the loss function in (2) can take various forms such as the contrastive loss (Hadsell et al., 2006; Chopra et al., 2005), triplet loss (Chechik et al., 2010) and softmax crossentropy. All such losses encourage two properties:

The embedding vector g(xi j) and its positive class embedding wyi j are close. In other words, one wants large logits or scores for positives instance and label pairs. The embedding vector g(xi j) and its negative class class embeddings wc, c = yi j are far away. In other words, one wants small logits or scores for negatives instance and label pairs.

For example, given a distance measure d( , ), the contrastive loss is expressible as

ℓcl f(x), y = α d(gθ(x),wy) 2 | {z } ℓpos cl (f(x),y)

max 0, ν d(gθ(x),wc) 2

| {z } ℓneg cl (f(x),y)

where α, β R are some predeﬁned constants. In (6), ℓpos cl ( ) encourages high logit for the positive instance and label pairs. Similarly, ℓneg cl ( ) aims to decrese the logit for the negative instance and label pairs.

1In the federated learning setup, the client may also update the model with a few steps, not just a single step. 2Recently, Mohri et al. (2019) proposed the agnostic federated learning framework to account for the heterogeneous data distribution across the clients, which crucially relies on the selecting the non-uniform weights. In this paper, for the ease of exposition, we restrict ourselves to the uniform weights, i.e., ωi = ni

Federated Learning with Only Positive Labels

3.2. Federated Learning with only positive labels

In this work, we consider the case where each client has access to only the data belonging to a single class. To simplify the notation, we assume that there are m = C clients and the i-th client has access of the data of the i-th class. The algorithm and analysis also applies to the setting where multiple clients have the same class.

The clients are not allowed to share their data with other clients, nor can they access the label embeddings associated with other clients. Formally, in each communication round, the i-th client has access to

ni instance and label pairs with the same label i: Si = {(xi 1, i), . . . , (xi ni, i)} X [C] Its own class embedding wi. The current instance embedding model parameter θ.

Without access to the negative instance and label pairs, the loss function can only encourage the instances embedding and the positive class embedding to be close to each other. For example, with the contrastive loss in (6), in the absence of negative labels, one can only employ ℓpos cl ( ) part of the loss function. Since ℓpos cl ( ) is a monotonically decreasing function of the distance between the instance and the positive label, this approach would quickly lead to a trivial solution with small risk where all the users and the classes have an identical embedding. Regardless of the underlying loss function, training with only positive instance and label pairs will result in this degenerate solution. We propose an algorithm to address this problem in the next section.

4. Algorithm

To prevent all the class embeddings {wi}C i=1 from collapsing into a single point in the optimization process, we propose Federated Averaging with Spreadout (Fed Aw S).

4.1. Federated Averaging with Spreadout (Fed Aw S)

In addition to Federated Averaging, the server performs an additional optimization step on the class embedding matrix W RC d to ensure that different class embeddings are separated from each other by at least a margin of ν. In particular, in each round of training, the server employs a geometric regularization, namely spreadout regularizer, which takes the following form.

regsp(W) = X

max 0, ν d(wc,wc ) 2. (7)

A similar objective was ﬁrst proposed as a regularizer to improve learning discriminative visual descriptors (Zhang et al., 2017) and then used in extreme-multiclass classiﬁcation (Guo et al., 2019). There, it was shown that the spreadout regularization can improve the quality and stabil-

Algorithm 1 Federated averaging with spreadout (Fed Aw S)

1: Input. For C clients and C classes indexed by [C], ni examples Si at the i-th client. 2: Server initializes model parameters θ0, W 0. 3: for t = 0, 1, . . . , T 1 do 4: The server communicates θt,wt i to the i-th client. 5: for i = 1, 2, . . . , C do 6: The i-th client updates the model based on Si: 7: (θt,i,wt,i i ) (θt,wt i) η (θt,wt i) ˆRpos(Si),

8: where ˆRpos(Si) = 1 ni P

j [ni] ℓpos cl (f(x), y).

9: The i-th client sends (θt,i,wt,i i ) to the server. 10: end for 11: Server updates the model parameters: 12: θt+1 = 1

i [C] θt,i.

13: W t+1 = [wt,i i , . . . ,wt,C C ] . 14: W t+1 W t+1 λη W t+1regsp( W t+1). 15: end for 16: Output: θ and W .

ity of the learned models. In this work, we argue that the spreadout regularizer along with the positive part of the underlying loss function (e.g., ℓpos cl ( ) in (6)) constitutes a valid loss function that takes the similarity of the instance from both positive and negative labels into account (cf. Section 5). This proves critical in allowing for meaningful training in the federated setting with only positive labels.

The Fed Aw S algorithm which modiﬁes the Federated Averaging using the spreadout regularizer is summarized in Algorithm 1. Note that in Step 7, the local objective at each client is deﬁne by the positive part ℓpos( ) of the the underling loss (cf. (6)). The algorithm differs from the conventional Federated Averaging in two ways. First, averaging of W is replaced by updating the class embeddings received from each client (Step 13). Second, an additional optimization step is performed on server to encourage the separation of the class embeddings (Step 14). Here, we also introduce a learning rate multiplier λ which controls the effect of the spreadout regularization term on the trained model.

Remark 1. In Algorithm 1, we assumed all clients participate in each communication round for the ease of exposition. However, the algorithm easily extends to the practical setting, where only a subset of clients are involved in each round: Let Ct denote the set of clients participating the t-th round. Then, the server performs the updates in Step 12 and Step 13 with the help of the information received from the clients indexed by Ct. Note that the optimization in Step 7 and Step 14 can employ multiple steps of SGD steps or be based on other optimizers.

Federated Learning with Only Positive Labels

4.2. Fed Aw S with stochastic negative mining

There are two unique challenges that arise when we perform optimization w.r.t. (7). First, the best ν is problem dependent and therefore hard to choose. Second, when C is large (also known as the extreme multiclass classiﬁcation setting), even computing the spreadout regularizer becomes expensive. To this end we propose the following modiﬁcation of (7):

regtop sp (W) = X

d2(wc,wy) Jy Nk(c)K, (8)

where C is a subset of classes, and Nk(c) denotes the set of k classes that are closest to the class c in the embedding space. The regularizer in (8) can be viewed as an adaptive approximator of the spreadout regularizer in (7), where, for each class c, we adaptively set ν to be the distance between wc and its (k + 1)-th closest class embedding. Intuitively, we only need to make sure that, in the embedding space, each class is as far away as possible from its close classes.

This approach of adaptively picking ν is motivated by the stochastic negative mining method ﬁrst proposed in (Reddi et al., 2019), where for each instance, they consider only the positive label and a small set of most confusing ( hard ) negative labels to deﬁne the underlying loss function. On the contrary, we are picking the most confusing classes based on only the class embeddings. Furthermore, the method is applied at the server as a regularizer as opposed to deﬁning the underlying loss function for an individual instance. As we demonstrate in Section 6, the stochastic negative mining is crucial to improve the quality of Fed Aw S.

Before presenting these empirical results, we provide a theoretical justiﬁcation for this in the following section.

5. Analysis

To justify our Fed Aw S technique, we will:

(i) relate the classiﬁcation error to the separation of the class embeddings

(ii) introduce a particular cosine contrastive loss, which we show to be consistent for classiﬁcation

(iii) relate the Fed Aw S objective to empirical risk minimization using the cosine contrastive loss, despite the latter requiring both positive and negative labels.

Put together, this justiﬁes why the Fed Aw S classiﬁer can be close in performance to that of a consistent classiﬁer, despite only being trained with positive labels.

We ﬁrst state a simple result arguing that small distance between the instance embedding and the true class embedding, and large distance between the class embeddings, imply low classiﬁcation error.

Proposition 1. Let the minimum distance between the class embeddings be ρ := infi =j d(wi,wj), and the expected distance between the embeddings of an instance x and its true class y be ϵ = E(x,y) PXYd(gθ(x),wy). Then, the probability of misclassiﬁcation satisﬁes

P z = y s.t. d(gθ(x),wy) d(gθ(x),wz) 2ϵ/ρ.

Proof. Note that, if there exists any z = y such that d(gθ(x),wy) d(gθ(x),wz), then

d(gθ(x),wy) 1

2 d(gθ(x),wy) + d(gθ(x),wz)

(i) d(wy,wz)

where (i) and (ii) follow from the triangle inequality and the deﬁnition of ρ, respectively. Next, by combing (9) with Markov s inequality, we obtain that

P z = y s.t. d(gθ(x),wy) d(gθ(x),wz)

P d(gθ(x),wy) ρ

2E(x,y) PXYd(gθ(x),wy)

To relate the Fed Aw S objective to a contrastive loss, without loss of generality, we work with normalized embeddings; i.e., we assume that the rows of the matrix W as well as the instance embeddings generated by gθ( ) have unit Euclidean norm3. We can then adopt the cosine distance:

dcos(u,u ) = 1 u u u, u Rd. (10)

Specializing the contrastive loss in (6) to the cosine distance measure gives us the cosine contrastive loss.

Deﬁnition 1 (Cosine contrastive loss). Given an instance and label pair (x, y) and the scorer f(x) in (1), the cosine contrastive loss takes the following form.

ℓccl f(x), y = dcos(gθ(x),wy) 2 + X

max 0, ν dcos(gθ(x),wc) 2. (11)

Further, by using sc = g θ (x)wc to denote the logit for class c, the cosine contrastive loss can be expressed as

ℓccl f(x), y =

(1 sy)2 + X

max 0, ν 1 + sc 2 (12)

3The analysis in this section easily extends to unnormalized embeddings. However, the restriction to normalized embeddings slightly improves performance empirically.

Federated Learning with Only Positive Labels

Note that, besides utilizing the cosine distance, we have used α = 1 and β = 1 in (6) to obtain (11). The following result states that cosine contrastive loss is a valid surrogate loss (Bartlett et al., 2006) for the misclassiﬁcation error. Lemma 1. Let ν (1, 2). The cosine contrastive loss in (12) is a surrogate-loss of the misclassiﬁcation error, i.e.,

ℓccl f(x), y 2(ν 1) Jy / Top1(f(x))K, (13)

where Top1(f(x)) denotes the indices of the classes that f( ) assigns the highest score for the instance x.

Proof. If y Top1(f(x)), then Jy / Top1(f(x))K = 0. Since ℓccl f(x), y 0, we have

ℓccl f(x), y 2(ν 1) Jy / Top1(f(x))K (14)

in this case. Now, let s consider the case when y / Top1(f(x)). For a R, let φ(a) = (1 a)2 and φ(a) = (max{0, ν 1 a})2. We have

ℓccl f(x), y = φ(sy) + X

c =y φ( sc)

φ(sy) + φ( max c =y sc) (i) φ(sy) + φ( max c =y sc)

(ii) 2 φ sy max c =y sc /2 (iii) 2 (ν 1)

= 2(ν 1) Jy / Top1(f(x))K, (15)

where (i) follows as we have φ(a) φ(a), a and (ii) utilizes the convexity of φ. (iii) follows as we have φ(a) > ν 1 for a < 0, and

y / Top1(f(x)) sy max c =y sc < 0.

The statement of the lemma follows from (14) and (15).

Lemma 1 established that the cosine contrastive loss is a valid surrogate for the misclassﬁcation error in the sense of (Bartlett et al., 2006). We note here the bound is tight. For example, for ϵ > 0, suppose ν = 2 ϵ, sy = ϵ, sy = ϵ for some y = y, and sc = 1 + ϵ/2, c = {y, y }. Then, ℓccl f(x), y = 2(1 ϵ) + ϵ2 + 4ϵ = 2(ν 1)Jy / Top1(f(x))K + ϵ2 + 4ϵ, which tends to 2(ν 1)Jy / Top1(f(x))K as ϵ goes to 0. One may follow similar analysis as in Reddi et al. (2019, Theorem 4) to show the statistical consistency (Zhang, 2004) of minimizing this loss.

We now explicate a connection between the classiﬁcationconsistent cosine contrastive loss and the objective underlying the Fed Aw S algorithm. To do so, we assume that n1 = = n C = n C , and note that Fed Aw S effectively seeks to collaboratively minimize

n ˆRpos(Si) + λ regsp(W), (16)

with the regularizer regsp(W) from (7). Now we observe:

Proposition 2. Suppose λ = 1 C and n1 = = n C = n C . Then, Fed Aw S objective equals the empirical risk with respect to the loss function

ℓsp(f(x), y) =

(1 sy)2 + X

max 0, ν 1 + w y wc 2, (17)

i.e., Rsp(f) = 1

(x,y) S ℓsp(f(x), y).

Proof. Note that

n ˆRpos(Si) + λ regsp(W)

(x,y) S ℓpos ccl (f(x), y) + λ regsp(W)

(x,y) S ℓpos ccl (f(x), y)+

max 0, ν dcos(wy,wc) 2

ℓpos ccl (f(x), y)+

max 0, ν dcos(wy,wc) 2

max 0, ν 1 + w y wc 2

(x,y) S ℓsp(f(x), y), (18)

where (i) and (ii) follows from the assumptions that n1 = = n C and λ = 1

C , respectively.

Note that the contribution of the negative labels in the loss function ℓsp is independent of the input embedding gθ(x).

Remark 2. Proposition 2 shows that the spreadout regularizer regsp(W) can be viewed as the negative component of a conventional loss. The assumption of each user having equal number of examples makes this easy to see. Our analysis easily extends to unequal number of examples by using a weighted spreadout for each class.

Next, we utilize Proposition 2 to argue that the Fed Aw S objective Rsp(f) approximates the cosine contrastive loss a valid surrogate for the misclassiﬁcation error (cf. Lemma 1). Proposition 2 establishes that Rsp(f) corresponds to the empirical risk with respect to ℓsp. Note that ℓpos sp (f(x), y) =

Federated Learning with Only Positive Labels

ℓpos ccl (f(x), y) (cf. (6)). Thus, the desired result follows by establishing that ℓneg sp (f(x), y) approximates ℓneg ccl (f(x), y). This approximation becomes better as the input embedding gθ(x) gets closer to its class embedding wy, as encouraged by ℓpos sp (f(x), y).

Theorem 1. Let ν (1, 2). Then, the loss ℓsp in (17) satisﬁes

ℓccl(f(x), y) (1 + 2ν) X

c =y |w c rx,y| ℓsp(f(x), y)

ℓccl(f(x), y) + (1 + 2ν) X

c =y |w c rx,y|, (19)

where rx,y = wy gθ(x).

Proof. Note that rx,y = wy gθ(x) denotes the mismatch between wy and gθ(x). Thus,

w y wc = gθ(x) wc + r x,ywc = sc + r x,ywc.

As a result ℓsp in (17) can be written as

ℓsp(f(x), y)

= (1 sy)2 + X

max 0, ν 1 + sc + w c rx,y 2

= (1 sy)2 + X

max 0, ν 1 + sc 2 + X

= ℓccl(f(x), y) + X

c =y c, (20)

c := max 0, ν 1 + sc + w c rx,y 2 max 0, ν 1 + sc 2. (21)

The result follows from (20) and Claim 1 below.

Claim 1. Given an instance and label pair (x, y) and the scorer f, for c = y, let c be as deﬁned in (21). Then,

| c| 2(1 + 2ν) w c rx,y . (22)

Proof. Let a = ν 1 + sc and b = w c rx,y. Thus, we want to show that max 0, a + b 2 max 0, a 2 (1 + 2ν) b .

Let us consider four possible cases.

Case 1 (a + b < 0 and a < 0). In this case, we have

max 0, a + b 2 max 0, a 2 = 0.

Case 2 (a + b > 0 and a > 0). Note that max 0, a + b 2 max 0, a 2

= |(a + b)2 a2| = |b(b + 2a)| (1 + 2ν) |b|,

where the last inequality follows from the fact that a = ν 1 + sc ν, since sc 1.

Case 3 (a + b > 0 and a < 0). In this case, max 0, a + b 2 max 0, a 2

= max 0, a + b 2 |b2| |b|,

where the last equality follows as |b| = |w c rx,y| 1.

Case 4 (a + b < 0 and a > 0). Note that max 0, a + b 2 max 0, a 2

= max 0, a 2 |a|2 (i) |b|2 |b|,

where (i) follows as by combining a > 0 and a+b < 0 we obtain the order b < a < 0 < a.

Now, by combining all the four case above and using the fact that ν (1, 2), we obtain the desired the result.

As a ﬁnal remark, our analysis above assumed that the cosine contrastive loss (11) uses all labels c = y as negatives for the given label y. However, using similar ideas as in (Reddi et al., 2019), we may easily extend our analysis to the case where the loss uses the k hardest labels as negatives (cf. (8)).

6. Experiments

We empirically evaluate the proposed Fed Aw S method on benchmark image classiﬁcation and extreme multi-class classiﬁcation datasets. In all experiments, both the class embedding wc s and instance embedding gθ(x) are ℓ2 normalized, as we found this slightly improves model quality.

For Fed Aw S, we use the squared hinge loss with cosine distance to deﬁne ˆRpos(Si) at the clients (cf. Algorithm 1):

ℓpos(f(x), y) = max 0, 0.9 gθ(x) wy 2. (23)

This encourages all positive instance and label pairs (x, y) to have dot product larger than 0.9 in the embedding space.

We compare the following methods in our experiments.

Baseline-1: Training with only positive squared hinge loss. As expected, we observe very low precision values because the model quickly collapses to a trivial solution. Baseline-2: Training with only positive squared hinge loss with the class embeddings ﬁxed. This is a simple way of preventing the class embeddings from collapsing into a single point.

Federated Learning with Only Positive Labels

Dataset Model Baseline-1 Baseline-2 Fed Aw S Softmax (Oracle) CIFAR-10 RESNET-8 10.7 83.3 86.3 88.4 CIFAR-10 RESNET-32 9.8 92.1 92.4 92.4 CIFAR-100 RESNET-32 1.0 65.1 67.9 68.0 CIFAR-100 RESNET-56 1.1 67.5 69.6 70.0

Table 1: Precision@1 (%) on CIFAR-10 and CIFAR-100.

Dataset #Features #Labels #Train Points #Test Points Avg. #I/L Avg. #L/I AMAZONCAT 203,882 13,330 1,186,239 306,782 448.57 5.04 WIKILSHTC 1,617,899 325,056 1,778,351 587,084 17.46 3.19 AMAZON670K 135,909 670,091 490,449 153,025 3.99 5.45

Table 2: Summary of the datasets used in the paper. #I/L is the number of instances per label, and #L/I is the number of labels per instance.

Fed Aw S: Our method with stochastic negative mining (cf. Section 4.2). Softmax: An oracle method of regular training with the softmax cross-entropy loss function that has access to both positive and negative labels.

6.1. Experiments on CIFAR

We ﬁrst present results on the CIFAR-10 and CIFAR100 datasets. We trained Res Nets (RESNETS) (He et al.,

2016a;b) with different number of layers as the underlying model. Speciﬁcally, we train RESNET-8 and RESNET-32 for CIFAR-10; and train RESNET-32 and RESNET-56 for CIFAR-100 with the larger number of classes.

From Table 1, we see that on both CIFAR-10 and CIFAR100, Fed Aw S almost matches or comes very close to the performance of the oracle method which has access to all labels. The ﬁrst baseline method, training with only positive squared hinge loss does not lead to any meaningful precision values. In this case, as discussed above the model collapses into a degenerate solution.

Interestingly, the naïve way of preventing the embeddings from collapsing by ﬁxing the class embeddings as their random initialization gives a much better result. In fact, on CIFAR-10 with RESNET-32, Baseline-2 performs almost identically to the oracle and Fed Aw S. The reason behind this good performance is that with a smaller number of classes, at a random initialization in a high-dimensional space (64 in this case), the class embeddings are already well spread-out as they are almost orthogonal to each other. In addition, the 10 classes of CIFAR-10 are not related to each other. This makes the 10 nearly-orthogonal vectors ideal to be used as-is for class embeddings.

6.2. Experiments on extreme-multiclass classiﬁcation

Datasets. We test the proposed approach on standard extreme multilabel classiﬁcation datasets (Varma, 2018). These datasets have a large number of classes, and therefore are a good representatives of the applications of federated

learning with only positive labels. Similar to Reddi et al. (2019), because these datasets are multi-label, we uniformly sample positive labels to obtain datasets corresponding to multi-class classiﬁcation problems. The datasets and their statistics are summarized in Table 2.

Model architecture. We use a simple embedding-based classiﬁcation model wherein an instance x Rd , a highdimensional sparse vector, is ﬁrst embedded into R512 using a linear embedding lookup followed by averaging. The vector is then passed through a three-layer neural network with layer sizes 1024, 1024 and 512, respectively. The ﬁrst two layers in the network apply a Re LU activation function. The output of the network is then normalized to obtain instance embeddings with unit ℓ2-norm. Each class is represented as a 512-dimensional normalized vector.

Training setup. SGD with a large learning rate is used to optimize the embedding layers, and Adagrad is used to update other model parameters. In each round, we randomly select 4K clients associated with 4K labels.

In addition to the methods used in the CIFAR experiments, we also compare the Fed Aw S with SLEEC (Hadsell et al., 2006). This is an oracle method of regular training with access to both positive and negative labels.

Results. We report precision@k for k {1, 3, 4} in Table 3. On all the datasets, Fed Aw S largely outperforms the two baseline methods of training with only positive labels. On both AMAZONCAT and AMAZON670K, it matches or comes very close to the performance of Softmax and SLEEC. Baseline-2 gives reasonable (although quite sub-optimal) performance on AMAZONCAT; but does not work on AMA-

ZON670K and WIKILSHTC which have larger number of classes. Thus, randomly initialized class embeddings are not ideal in the situation of many classes, and it is crucial to train the class embeddings with the rest of the model.

Meta parameters. There are two meta parameters in the proposed method: the learning rate multiplier of the spreadout loss λ (cf. Algorithm 1), and the number top confusing labels considered in each round k (cf. (8)). To make a fair

Federated Learning with Only Positive Labels

Federated Learning with Only Positives Oracle Baseline-1 Baseline-2 Fed Aw S Softmax SLEEC P@1 3.4 64.1 92.1 92.1 90.5 AMAZONCAT P@3 3.2 46.8 70.8 77.9 76.3 P@5 3.1 32.6 58.7 62.3 61.5 P@1 0.0 4.3 33.1 35.2 35.1 AMAZON670K P@3 0.0 2.8 29.6 31.6 31.3 P@5 0.0 2.2 27.4 29.5 28.6 P@1 7.6 7.9 37.2 54.1 54.8 WIKILSHTC P@3 4.5 3.4 22.6 38.8 33.4 P@5 2.8 2.6 16.2 29.9 23.9

Table 3: P@1,3,5 (%) of different methods on AMAZONCAT, AMAZON670K and WIKILSHTC.

Baseline-1 Baseline-2 k = 10 k = 100 k = 500 k = all λ = 1 λ = 10 λ = 100 P@1 3.4 64.1 26.3 92.1 86.9 87.7 73.2 92.1 92.2 P@3 3.2 46.8 21.5 70.8 66.1 69.7 50.2 70.8 71.7 P@5 3.1 32.6 18.2 58.7 49.3 52.2 40.4 58.7 57.9

Table 4: P@1,3,5 (%) of different meta parameters on AMAZONCAT.

comparison with other methods which do not have these meta parameters, in all of our other experiments in Table 3, we simply use k = 10 and λ = 10.

We perform an analysis of these two parameters in Table 4 on the AMAZONCAT dataset. The reason for the bad performance for a small k is that most of the picked labels are in fact positives in this setting (due to the inherent multilabel nature of the dataset), and over spreading the positive classes is not desirable. On the other hand, a very large k leads to sub-optimal performance, verifying the beneﬁt and requirement of stochastic negative mining. That said, we would like to emphasize that once k is large enough, Fed Aw S is robust with respect to k. Even with k = all (i.e., no SNM), Fed AWS is still far better than the baselines.

Regarding λ, a relatively large value (10 or 100) is necessary to ensure the class embeddings are sufﬁciently spreadout.

7. Conclusion

We studied a novel learning setting, federated learning with only positive labels, and proposed an algorithm that can learn a high-quality classiﬁcation model without requiring negative instance and label pairs. The idea is to impose a geometric regularization on the server side to make all class embeddings spreadout. We justiﬁed the proposed method both theoretically and empirically.

One can extend the identity-based class embeddings to the settings where the class embeddings are generated from class-level features. In addition, we notice that negative sampling techniques are crucial to make conventional extreme multiclass classiﬁcation work. The proposed method is of

independent interest in this setting because it replaces negative sampling altogether by imposing a strong geometric regularization. Further, even though our proposed method achieves near oracle performance on multiple datasets, exploring a fundamental trade-off between the performances in our novel setting and the oracle setting is another interesting future research direction.

Abadi, M., Chu, A., Goodfellow, I., Mc Mahan, H. B., Mironov, I., Talwar, K., and Zhang, L. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pp. 308 318, 2016.

Agarwal, N., Suresh, A. T., Yu, F. X. X., Kumar, S., and Mc Mahan, B. cp SGD: Communication-efﬁcient and differentially-private distributed sgd. In Advances in Neural Information Processing Systems, pp. 7564 7575, 2018.

Augenstein, S., Mc Mahan, H. B., Ramage, D., Ramaswamy, S., Kairouz, P., Chen, M., Mathews, R., et al. Generative models for effective ml on private, decentralized datasets. ar Xiv preprint ar Xiv:1911.06679, 2019.

Bartlett, P. L., Jordan, M. I., and Mc Auliffe, J. D. Convexity, classiﬁcation, and risk bounds. Journal of the American Statistical Association, 101(473):138 156, 2006.

Bonawitz, K., Ivanov, V., Kreuter, B., Marcedone, A., Mc Mahan, H. B., Patel, S., Ramage, D., Segal, A., and Seth, K. Practical secure aggregation for federated learn-

Federated Learning with Only Positive Labels

ing on user-held data. ar Xiv preprint ar Xiv:1611.04482, 2016.

Chechik, G., Sharma, V., Shalit, U., and Bengio, S. Large scale online learning of image similarity through ranking. Journal of Machine Learning Research, 11(Mar):1109 1135, 2010.

Chopra, S., Hadsell, R., and Le Cun, Y. Learning a similarity metric discriminatively, with application to face veriﬁcation. In Computer Vision and Pattern Recognition, pp. 539 546, 2005.

Dietterich, T. G. and Bakiri, G. Error-correcting output codes: A general method for improving multiclass inductive learning programs. In AAAI, pp. 572 577, 1991.

du Plessis, M., Niu, G., and Sugiyama, M. Convex formulation for learning from positive and unlabeled data. In Proceedings of the 32nd International Conference on Machine Learning, volume 37, pp. 1386 1394, Lille, France, 07 09 Jul 2015. PMLR.

Elkan, C. and Noto, K. Learning classiﬁers from only positive and unlabeled data. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 213 220, 2008.

Guo, C., Mousavi, A., Wu, X., Holtmann-Rice, D. N., Kale, S., Reddi, S., and Kumar, S. Breaking the glass ceiling for embedding-based classiﬁers for large output spaces. In Advances in Neural Information Processing Systems, pp. 4944 4954, 2019.

Hadsell, R., Chopra, S., and Le Cun, Y. Dimensionality reduction by learning an invariant mapping. In Computer Vision and Pattern Recognition, volume 2, pp. 1735 1742, 2006.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016a.

He, K., Zhang, X., Ren, S., and Sun, J. Identity mappings in deep residual networks. In European conference on computer vision, pp. 630 645. Springer, 2016b.

Hsieh, C.-J., Natarajan, N., and Dhillon, I. Pu learning for matrix completion. In Proceedings of the 32nd International Conference on Machine Learning, volume 37, pp. 2445 2453. PMLR, 07 09 Jul 2015.

Koneˇcn y, J., Mc Mahan, H. B., Yu, F. X., Richtárik, P., Suresh, A. T., and Bacon, D. Federated learning: Strategies for improving communication efﬁciency. ar Xiv preprint ar Xiv:1610.05492, 2016.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classiﬁcation with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 1097 1105, 2012.

Li, T., Sahu, A. K., Talwalkar, A., and Smith, V. Federated learning: Challenges, methods, and future directions. ar Xiv preprint ar Xiv:1908.07873, 2019.

Liu, B., Lee, W. S., Yu, P. S., and Li, X. Partially supervised classiﬁcation of text documents. In International Conference on Machine Learning, volume 2, pp. 387 394, 2002.

Manevitz, L. M. and Yousef, M. One-class svms for document classiﬁcation. Journal of machine Learning research, 2(Dec):139 154, 2001.

Mc Mahan, B., Moore, E., Ramage, D., Hampson, S., and y Arcas, B. A. Communication-efﬁcient learning of deep networks from decentralized data. In Artiﬁcial Intelligence and Statistics, pp. 1273 1282, 2017.

Mohri, M., Sivek, G., and Suresh, A. T. Agnostic federated learning. In International Conference on Machine Learning, pp. 4615 4625, 2019.

Moya, M. M. and Hush, D. R. Network constraints and multi-objective optimization for one-class classiﬁcation. Neural Networks, 9(3):463 474, 1996.

Pujol, O., Radeva, P., and Vitria, J. Discriminant ecoc: A heuristic method for application dependent design of error correcting output codes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(6):1007 1012, 2006.

Reddi, S. J., Kale, S., Yu, F., Holtmann-Rice, D., Chen, J., and Kumar, S. Stochastic negative mining for learning with large output spaces. Artiﬁcial Intelligence and Statistics, 2019.

Varma, M. Extreme classiﬁcation repository. Website, 8 2018. http://manikvarma.org/downloads/ XC/XMLRepository.html.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998 6008, 2017.

Yu, F. X., Cao, L., Feris, R. S., Smith, J. R., and Chang, S.-F. Designing category-level attributes for discriminative visual recognition. In Computer Vision and Pattern Recognition, pp. 771 778, 2013.

Zhang, T. Statistical behavior and consistency of classiﬁcation methods based on convex risk minimization. Ann. Statist., 32(1):56 85, 02 2004.

Federated Learning with Only Positive Labels

Zhang, X., Yu, F. X., Kumar, S., and Chang, S.-F. Learning spread-out local feature descriptors. In International Conference on Computer Vision, pp. 4595 4603, 2017.