# unbiased_multilabel_learning_from_crowdsourced_annotations__caf7c9db.pdf

Unbiased Multi-Label Learning from Crowdsourced Annotations

Mingxuan Xia 1 2 Zenan Huang 2 Runze Wu 3 Gengyu Lyu 4 Junbo Zhao 2 Gang Chen 2 Haobo Wang 1 2

This work studies the novel Crowdsourced Multi Label Learning (CMLL) problem, where each instance is related to multiple true labels but the model only receives unreliable labels from different annotators. Although a few Crowdsourced Multi-Label Inference (CMLI) methods have been developed, they require both the training and testing sets to be assigned crowdsourced labels and focus on true label inferring rather than prediction, making them less practical. In this paper, by excavating the generation process of crowdsourced labels, we establish the first unbiased risk estimator for CMLL based on the crowdsourced transition matrices. To facilitate transition matrix estimation, we upgrade our unbiased risk estimator by aggregating crowdsourced labels and transition matrices from all annotators while guaranteeing its theoretical characteristics. Integrating with the unbiased risk estimator, we further propose a decoupled autoencoder framework to exploit label correlations and boost performance. We also provide a generalization error bound to ensure the convergence of the empirical risk estimator. Experiments on various CMLL scenarios demonstrate the effectiveness of our proposed method. The source code is available at https: //github.com/Mingxuan Xia/CLEAR.

1. Introduction

Multi-label learning (MLL) deals with scenarios where each instance belongs to multiple categories concurrently (Zhang & Zhou, 2007; 2014), which is widely adopted in real-world

1School of Software Technology, Zhejiang University, Ningbo, China 2State Key Laboratory of Blockchain and Data Security, Zhejiang University, Hangzhou, China 3Fuxi AI Lab, Net Ease Inc., Hangzhou, China 4Faculty of Information Technology, Beijing University of Technology, Beijing, China. Correspondence to: Haobo Wang <wanghaobo@zju.edu.cn>, Runze Wu <wurunze1@corp.netease.com>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

Crowdsourced Multi-Label Inference (CMLI)

Crowdsourced Multi-Label Learning (CMLL)

Crowdsourced Training set Ground-truth Labels ?

Predictor ? Unseen Instances

Crowdsourced Testing set

Crowdsourced Training set

Unlabeled Testing set

Figure 1. CMLI approaches focus on directly uncovering the ground-truth labels given the crowdsourced ones on both training and testing sets, while CMLL takes a further step by learning a robust predictor based on crowdsourced labels that can generalize well on unseen instances.

applications such as image recognition (Zha et al., 2008; Chen et al., 2019b), document classification (Rubin et al., 2012; Xiao et al., 2019), protein function prediction (Wu et al., 2014), and so on. However, the success of MLL relies on large amounts of precisely labeled data, making data annotation labor-intensive and time-consuming. On the other hand, crowdsourcing (Snow et al., 2008; Albarqouni et al., 2016; Rodrigues & Pereira, 2018) has recently established itself as an efficient and cost-effective solution for large-scale data annotation, where labels are collected from low-cost crowds. This gives rise to the potential significance of implementing crowdsourcing in the context of MLL.

Nonetheless, the study of crowdsourcing MLL has been overlooked, since most existing crowdsourcing methods emphasize multi-class classification problems, where each instance is associated with a single label (Guan et al., 2018; Rodrigues & Pereira, 2018; Wei et al., 2022; Gao et al., 2022). Recently, a few Crowdsourced Multi-Label Inference (CMLI) (Zhang & Wu, 2018; Li et al., 2019) approaches have been proposed to address the crowdsourcing scenario when learning with multiple labels. As shown in the upper part of Figure 1, CMLI approaches focus on directly uncovering the ground-truth labels given the crowdsourced ones on both training and testing sets. However, CMLI

Unbiased Multi-Label Learning from Crowdsourced Annotations

appears to be not only less practical but also lacks solid theoretical grounding. On the one hand, CMLI requires accessing crowdsourced labels on testing sets, which is typically intractable. On the other hand, no existing CMLI methods could provide theoretical guarantees as to how the model trained based on crowdsourced labels generalizes on unseen instances. This gives rise to an emergent question: How to infer a theoretically robust MLL classifier from crowdsourced labels?

To bridge the gaps, we deal with the urgent but underexplored CMLL problem which aims to train a multilabel predictor given crowdsourced labels directly and proposed a theoretically grounded method named CLEAR, i.e., Crowdsourced mu Lti-label learning with d Ecoupled Autoencode R. Specifically, we first excavate the generation process of crowdsourced data in the setting of multiple labels and establish the first unbiased risk estimator for CMLL based on the crowdsourced transition matrices. Subsequently, to avoid high time costs and accumulated errors when estimating transition matrices, we upgrade the unbiased risk estimator by aggregating labels from multiple annotators and present the existence and formulation of aggregated transition matrices. We also design feasible solutions for approximating noisy posterior and estimating the aggregated transition matrices which practicably realize our unbiased objective. Equipping with the unbiased risk estimator, we further devise a benchmark solution by a decoupled autoencoder framework with latent space distillation to exploit label correlations and boost performance. Besides, we derive a generalization error bound for our statistically consistent algorithm to guarantee the performance of our method on new instances. Empirically, we evaluate CLEAR on five multi-label datasets under three crowdsourcing scenarios, where CLEAR demonstrates superior results among all baselines including crowdsourcing-based, MLL, and weakly-supervised MLL approaches.

2. Related Work

2.1. Multi-Label Learning

Multi-label learning (MLL) (Liu et al., 2022) is a classical learning paradigm where each data example simultaneously relates to multiple binary labels. The most intuitive strategy to resolve MLL is the one-versus-all algorithm (OVA) (Zhang & Zhou, 2014) that decomposes MLL into several binary prediction problems, which is followed by recent deep approaches (Ridnik et al., 2021; Li et al., 2022; Gao et al., 2023). Despite its simplicity, OVA neglects the rich semantic dependencies among labels and thus suffers from limited performance. To remedy this problem, there has been a plethora of approaches developed, such as chainbased algorithms (Read et al., 2011; Wang et al., 2016), graph-based methods (Chen et al., 2019b; Zhu et al., 2023),

attention-based method (Huynh & Elhamifar, 2020a; Zhu & Wu, 2021) and vision-language models (Hu et al., 2023; Ding et al., 2023). Amongst them, the label embedding algorithm (Chen & Lin, 2012; Yeh et al., 2017; Chen et al., 2019a; Wang et al., 2020; Xiong et al., 2022) is a popular solution that assumes the label vectors can be projected into a lower dimensional space due to semantic relations. Following this line of work, we also devise a label embedding framework, which only manipulates the feature space, to be compatible with our unbiased loss.

2.2. Weakly-Supervised Multi-Label Learning

Classical MLL approaches mostly assume the training data are fully-supervised. However, due to the complicated structure of the label space, it can be too expensive and time-consuming to collect precise labels. To mitigate this problem, researchers have proposed a variety of weaklysupervised settings of MLL, including semi-supervised MLL (Wei et al., 2018; Shi et al., 2020; Wang et al., 2021), multi-label with missing labels (Durand et al., 2019; Huynh & Elhamifar, 2020b; Schultheis et al., 2022), MLL with single positive label (Cole et al., 2021; Cho et al., 2022; Xu et al., 2022), and partial MLL (Xie & Huang, 2018; Wang et al., 2019; Lyu et al., 2020; Xu et al., 2020). In this work, we study the crowdsourced MLL that collects labels from multiple weak annotators for reduced cost.

2.3. Crowdsourcing

Crowdsourcing is a popular paradigm that collects low-cost but unreliable labels, which release the burden of largescale data annotations (Liu et al., 2023; Wang et al., 2024). Traditional crowdsourcing methods model crowdsourced labels by expectation-maximization (EM) algorithm (Dawid & Skene, 1979) that identify the accurate labels (Whitehill et al., 2009; Raykar et al., 2009; Raykar & Yu, 2012; Dalvi et al., 2013; Zhang et al., 2016). Subsequently, deep learning-based methods (Albarqouni et al., 2016; Guan et al., 2018) are proposed and demonstrate superiority, where they deal with crowdsourced label noise by learning label transition matrices (Rodrigues & Pereira, 2018; Li et al., 2023; Chen et al., 2020; Wei et al., 2022; Gao et al., 2022). However, these methods study the problem where each instance is associated with a single label. Instead, we explore the crowdsourcing problem in the multi-label learning scenario where samples are related to multiple labels. There have also been some works (Zhang & Wu, 2018; Li et al., 2019) studying the crowdsourcing problem in the context of learning with multiple labels. Nevertheless, they mostly concentrate on inferring the ground-truth labels behind the crowdsourced labels and need further training to obtain a predictor. In contrast, our work aims to learn a classifier end-to-end that can be generalized on unseen instances and provides theoretical insights.

Unbiased Multi-Label Learning from Crowdsourced Annotations

3. Problem Setting

3.1. Multi-Label Learning

Multi-label learning (MLL) aims at assigning each instance multiple binary labels simultaneously. Let X denotes the d-dimensional feature space and Y = {0, 1}K denotes the label space with K class labels. The training dataset D = {(xi, yi)|1 i n} contains n examples, where xi X is the instance vector and yi Y is the label vector. In this setting, yi k = 1 indicates that the k-th label is associated with instance xi and yi k = 0, otherwise. MLL aims to learn a multi-label predictor f : X Y by minimizing the following risk:

R(f) = E(x,y) P (X,Y ) [L(f(x), y)] , (1)

where L : RK Y R is the multi-label loss function. X, Y denotes the random variables of x, y, and P(X, Y ) is the data distribution from where the dataset is sampled. Note that we say a method guarantees risk consistency if an unbiased risk estimator is implemented, i.e., the risk estimator that is equivalent to R(f) given the same classifier f (Mohri et al., 2018; Feng et al., 2020).

3.2. Crowdsourced Multi-Label Learning

In this paper, we study a novel scenario called Crowdsourced Multi-Label Learning (CMLL), where the fully supervised data is not accessible, and a crowdsourced dataset D = {(xi, { yi m}M m=1)|1 i n} is given. Specifically, each instance is labeled by M annotators independently, and yi m Y denotes the label vector tagged by the m-th annotator. Let yi mk denote the label on k-th class given by the m-th annotator. The goal of CMLL is to learn a multi-label predictor f : X Y from D to assign a relevant label set for each unseen instance. It is worth noting that the groundtruth label yi corresponding to each instance is related to the crowdsourced labels { yi m}M m=1, but is inaccessible during training. In the following section, we omit the sample index i when the context is CLEAR.

Data Generation Process of CMLL. We consider that crowdsourced labels ymk are corrupted from their ground-truth label yk through M K class-dependent instance-independent transition matrices {T mk}M,K m=1,k=1 [0, 1]2 2. Denoting Ymk and Yk as the random variables of ymk and yk, the transition matrix is defined by T mk ij = P( Ymk = j|Yk = i), i, j {0, 1}. With the instanceindependent assumption (Xie & Huang, 2023; Li et al., 2022), i.e. P( Ymk = j|Yk = i, X = x) = P( Ymk = j|Yk = i), the transition matrix bridges the class posterior

probabilities for noisy and clean data following:

P( Ymk = j|X = x) = X

i {0,1} T mk ij P(Yk = i|X = x),

j {0, 1}, T mk 01 + T mk 10 < 1. (2) Note that we assume the annotators will not make profound mistakes (Xie & Huang, 2023; Gao et al., 2022), which gives rise to the constraint on T mk in Eq. (2).

4. The Proposed Method

4.1. Unbiased Risk Estimator

In this subsection, we establish the first unbiased risk estimator for the CMLL problem. The theorem proposed below guarantees risk consistency when solving CMLL. Theorem 1. By decomposing the MLL problem into K independent binary classification problem, i.e., L(f(x), y) = PK k=1 ℓ(fk(x), yk), where fk refers to prediction of the model on k-th class and ℓis the base loss function. With R(f) = EP (X, Y ) h L(f(x), { ym}M m=1) i , and define

L(f(x), { ym}M m=1) = 1 2M

j=0 P(Yk = j|X = x) QM m=1 P1 i=0 T mk i ymk P(Yk = i|X = x) ℓ(fk(x), j),

where Y denotes the random variable of the crowdsourced labels for each x. Then, R(f) is the unbiased risk estimator with respect to R(f).

The proof is provided in Appendix A. Note that decomposing MLL loss into multiple binary classification loss is commonly used for deep MLL (Ridnik et al., 2021; Li et al., 2022; Gao et al., 2023).

Remark. The unbiased risk estimator provided by theorem 1 directly models the impact on each individual annotator. However, this objective requires estimating M K individual transition matrices, which is not only time-consuming but also troublesome since the transition matrix estimation error can accumulate. In what follows, we show that there exists an alternative solution that aggregates M crowdsourced label vectors { ym}M m=1 {0, 1}M K into one label vector y {0, 1}K. In this way, we only need to estimate K transition matrices if there do exist transition matrices for those aggregated labels. Theorem 2. Let y = [ y1, . . . , y K] be the aggregated label vector for each x, and Yk is the random variable of yk. We have the following consequences:

(Existence) There exist a set of class-dependent instanceindependent transition matrices { T k}K k=1 [0, 1]2 2 such

Unbiased Multi-Label Learning from Crowdsourced Annotations

that T k ij = P( Yk = j|Yk = i), i, j {0, 1}, the unbiased risk estimator for CMLL with respect to R(f) is R(f) =

EP (X, Y ) h L(f(x), y) i , and

L(f(x), y) =

P( Yk = 1|X = x) T k 01 1 T k 01 T k 10 ℓ(fk(x), 1)

+ P( Yk = 0|X = x) T k 10 1 T k 01 T k 10 ℓ(fk(x), 0)

(Formulation of Transition Matrices) Let A be the random variable of the index of the annotator. By denoting ωm = P(A = m|X = x) as the contribution of m-th annotator on tagging x, and with T mk defined in subsection 3.2, the transition matrices { T k}K k=1 for aggregated labels are formalized as linear combinations of T mk:

m=1 ωm T mk, (5)

which are class-dependent and instance-independent.

The proof of Theorem 2 is provided in the Appendix B. Theorem 2 enables us to adopt an aggregated version of the unbiased risk estimator, reducing the cost of estimating transition matrices.

Practical Implementation. Despite the efficiency brought by objective 4, the aggregated label y is unfortunately inaccessible and so does its noisy posterior probability P( Yk = 1|X = x). To deal with this problem, by assuming that each annotator tags each instance with uniform contribution, we approximate P( Yk = 1|X = x) by averaging the crowdsourced labels of M annotators, i.e., sk = 1 M ymk. Thus, our unbiased objective function is finally formalized as:

L(f(x), s) =

+ 1 T k 01 T k 10 ℓ(fk(x), 1)

(1 sk) T k 10

+ 1 T k 01 T k 10 ℓ(fk(x), 0)

where s = [s1, . . . , s K] denotes the averaged crowdsourced label vector for sample x, and [ ]+ is abbreviated for max( , 0) which ensures the loss non-negative.

Transition Matrix Estimation. With the existence of the aggregated transition matrices proved in Theorem 2, we further introduce how we estimate them in practice. Our implementation is motivated by the anchor point assumption, which is widely adopted in noisy label learning (Liu & Tao, 2016; Patrini et al., 2017; Xia et al., 2019). Here, we present

Algorithm 1 Pseudo-code of CLEAR.

Input: Crowdsourced multi-label dataset D

1: Aggregating the crowdsourced labels by sk = 1 M ymk 2: Fitting s by a neural network to estimate { T k}K k=1 by averaging s of the top-C anchor points 3: Initialize the input of the label VAE by s = s 4: for epoch = 1, 2, . . . do 5: for step = 1, 2, . . . do 6: Calculate the unbiased loss Lunbiased by Eq. (7) 7: Calculate the distillation loss Ldistill by Eq. (10) 8: Train the decoupled autoencoders f and f by minimizing Lfinal = Lunbiased + Ldistill 9: Update s by Eq. (9) 10: end for 11: end for Output: Multi-label predictor f

the transition matrix estimator following the anchor point assumption in the setting of MLL.

Proposition 1. Given a sample x, if x satisfies P(Yk = a|X = x) = 1, a {0, 1}, we say that x is the anchor point for label value a of class k, and we have T k aj = P( Yk = j|X = x).

The proof is given by P( Yk = j|X = x) = P

i {0,1} T k ij P(Yk = i|X = x) = T k aj. In other words, proposition 1 enables us to estimate the transition matrices based on the noisy class probabilities, which are approximated by the aggregated crowdsourced label sk as mentioned above. Following (Liu & Tao, 2016; Patrini et al., 2017; Xia et al., 2019), we select samples that are far from the classification boundary as anchor points, namely, xka = arg maxx D a ˆfk(x)+(1 a) (1 ˆfk(x)), where xka is the anchor point for label a of class k, and ˆf is the multi-label predictor after sigmoid, which is trained by sk. Moreover, instead of selecting the most confident sample, we select top-C confident samples as anchor points and take the average of their noisy class probabilities to approximate the aggregated transition matrices, which turned out to be more robust. The detailed pseudo-code of transition matrix estimation is summarized in Algorithm 2.

4.2. Training with Decoupled Autoencoder

Despite the risk consistency provided by the above objective functions, all labels are treated independently. To capture the rich semantic correlation among labels, we propose a benchmark solution for CMLL, i.e., Crowdsourced mu Lti-label learning with d Ecoupled Autoencode R (CLEAR), which integrates the unbiased risk estimator into a decoupled autoencoder framework.

As shown in Figure 2, CLEAR contains two variational au-

Unbiased Multi-Label Learning from Crowdsourced Annotations

Feature Encoder

Feature Decoder

𝝁𝝓(𝒙) 𝜮𝝓(𝒙)

Label Encoder

Label Decoder

𝝁𝝍(𝒔 ) 𝜮𝝍(𝒔 )

𝒔 𝓛𝒖𝒏𝒃𝒊𝒂𝒔𝒆𝒅 𝓛𝒖𝒏𝒃𝒊𝒂𝒔𝒆𝒅

Figure 2. Model architecture of our proposed CLEAR framework.

toencoders (VAE), namely a feature VAE f and a label VAE f , where the feature x and the denoised label s are encoded and decoded respectively to reconstruct the aggregated crowdsourced label s. On the one hand, the unbiased risk estimator is implemented as the reconstruction loss of the two VAEs, which encourages training robust classifiers and building clean latent spaces. On the other hand, we leverage the label VAE whose latent embedding contains implicit label correlations, to distill its latent space to the feature VAE. Guided by the unbiased objective and label correlation distillation, the feature VAE f can be optimized and serves as the predictor at test time.

Specifically, each encoder first maps each input to a Gaussian subspace, namely N(µϕ(x), Σϕ(x)) and N(µψ(s ), Σψ(s )), where ϕ and ψ are trainable parameters for the two encoders. Let zx and zs denote samples from these two distributions respectively. Then, we map zx and zs through two decoders gx and gy respectively, and input them into two Multivariate Probit Models (Chen et al., 2018) to output final prediction ˆyf and ˆyl, which is proved to be effective on building label dependencies (Bai et al., 2020). With the unbiased loss L defined in Eq. (6), our reconstruction loss can be formalized as:

Lunbiased = L(ˆyf, s) + L(ˆyl, s). (7)

Without loss of generality, we implement the popular binary cross-entropy loss (BCE) as the base loss function, i.e.:

ℓ(ˆyk, sk) = (sk log σ(ˆyk) + (1 sk) log σ(1 ˆyk)) , (8) where ˆyk denotes the k-th value of the output vector ˆyf or ˆyl. Note that instead of directly reconstructing s for the label VAE, which might be problematic since the crowdsourced labels are not reliable, we adopt a stable solution that uses

a denoised label s as the input of the label VAE which is initialized by s and progressively refined by the more and more reliable output ˆyf, i.e.,

s = η s + (1 η) Multi Hot(ˆyf), (9)

where η is the momentum parameter and Multi Hot(ˆyf) is the Multi-Hot version of ˆyf with threshold 0.5.

Then we define the latent space distillation loss by the distance measures on multiple layers between the two autoencoders, i.e.,

Ldistill = αLkl + βLmse, (10)

where Lkl is the KL divergence between the two multivariate Gaussian distributions, and Lmse is the mean square error of the latent embedding samples and decoder output between the two autoencoders. α, β are hyper-parameters that trade off the weights of different losses. The two losses are formalized as follows:

i=1 log Σϕ i,i(x)

Σψ i,i(s ) d +

(µϕ i (x) µψ i (s ))2

Σϕ i,i(x) ,

Lmse = (zx zs)2 + (gx(zx) gy(zs))2 . (12)

where d is the dimension of the Gaussian subspace. Overall, the final objective of CLEAR is defined by Lfinal = Lunbiased + Ldistill. The pseudo-code of CLEAR is summarized in Algorithm 1.

4.3. Generalization Error Bound

In this subsection, we establish a generalization error bound for our proposed method. With the unbiased risk estimator defined in Eq. (6), we can obtain an learned classifier ˆf by minimizing the empirical risk R(f) = 1 n Pn i=1 L(f(xi), si). We then define F as the hypothesis class and Hk = {h : x 7 fk(x)|f F} as the functional space for the k-th class. Further, by denoting Rn(Hk) as the expected Rademacher complexity (Bartlett & Mendelson, 2002) of Hk with sample size n, the generalization error bound for our proposed unbiased risk estimator can be derived as the following theorem.

Theorem 3. Assume that the true aggregated transition matrices { T k}K k=1 are given, and the loss function L(f(x), s) is LT -Lipschitz continuous with respect to f(x), and the base loss function l is upper-bounded by λ. Let µ = maxk 1 1 T k 01 T k 10 . Then, for any δ > 0, with probability

Unbiased Multi-Label Learning from Crowdsourced Annotations

Table 1. Comparison of CLEAR with baselines when T k 01 = 0.2 and T k 10 = 0.2. The best results are shown in boldface. Metric Dataset BCE MV Doctor Net ML-KNN MPVAE PML-NI CLEAR

Image 0.6360 0.6433 0.6427 0.5060 0.6269 0.5845 0.6730 Scene 0.7241 0.7277 0.7288 0.6701 0.7106 0.5949 0.7596 Corel5K 0.0194 0.0210 0.0234 0.0172 0.1240 0.1084 0.1237 Mirflickr 0.7792 0.7838 0.7841 0.7098 0.7875 0.6830 0.7997 NUS-WIDE 0.2677 0.2931 0.2893 0.1014 0.2934 0.0677 0.3198

Image 0.6579 0.6677 0.6721 0.5707 0.6491 0.6072 0.6809 Scene 0.7569 0.7579 0.7593 0.7219 0.7364 0.5706 0.7743 Corel5K 0.0195 0.0210 0.0249 0.0251 0.1437 0.1375 0.1432 Mirflickr 0.8162 0.8185 0.8160 0.7567 0.8179 0.6717 0.8254 NUS-WIDE 0.3358 0.3524 0.3420 0.1628 0.3533 0.0630 0.3827

Image 0.6621 0.6692 0.6754 0.5666 0.6508 0.6079 0.6851 Scene 0.7639 0.7634 0.7651 0.7249 0.7433 0.5750 0.7836 Corel5K 0.0103 0.0049 0.0013 0.0067 0.0237 0.0216 0.0263 Mirflickr 0.7133 0.7228 0.7175 0.6330 0.7240 0.5591 0.7405 NUS-WIDE 0.0314 0.0423 0.0407 0.0194 0.0714 0.0548 0.0740

Table 2. Comparison of CLEAR with baselines when T k 01 = 0.2 and T k 10 = 0.5. The best results are shown in boldface. Metric Dataset BCE MV Doctor Net ML-KNN MPVAE PML-NI CLEAR

Image 0.3098 0.3183 0.3347 0.0305 0.3292 0.5617 0.5738 Scene 0.3348 0.3337 0.3519 0.0955 0.3656 0.4310 0.6055 Corel5K 0.0182 0.0176 0.0112 0.0005 0.0390 0.0241 0.0601 Mirflickr 0.4272 0.4344 0.4403 0.0535 0.4242 0.4242 0.7267 NUS-WIDE 0.0101 0.1028 0.0876 0.0001 0.0781 0.0832 0.1968

Image 0.3905 0.4092 0.4189 0.0497 0.4181 0.5761 0.6036 Scene 0.4325 0.4468 0.4528 0.1628 0.4662 0.4199 0.6609 Corel5K 0.0182 0.0176 0.0121 0.0007 0.0406 0.0341 0.0677 Mirflickr 0.5109 0.5278 0.5170 0.0773 0.5039 0.4278 0.7818 NUS-WIDE 0.0219 0.1474 0.1290 0.0001 0.1122 0.0781 0.2565

Image 0.3892 0.4074 0.4192 0.0492 0.4165 0.5783 0.6050 Scene 0.4338 0.4473 0.4552 0.1563 0.4652 0.4186 0.6606 Corel5K 0.0099 0.0045 0.0008 0.0003 0.0145 0.0060 0.0159 Mirflickr 0.3949 0.4159 0.3993 0.0799 0.4015 0.3651 0.6328 NUS-WIDE 0.0019 0.0142 0.0121 0.0005 0.0163 0.0584 0.0491

at least 1 δ, we have

E[ R( ˆf)] R( ˆf) 2

+ λK(µ + 1)

Theorem 3 shows that minimizing the empirical risk can bound population level error, which ensures the generalization ability of our proposed unbiased loss. The proof is given in Appendix C.

5. Experiments

In this section, we report our empirical results to show the superiority of CLEAR. We refer the readers to the Appendix for more experimental results.

Datasets. We conduct our experiments on five benchmark multi-label image datasets1, including Image, Scene, Corel5K, Mirflickr, NUS-WIDE. For these datasets, we corrupt the training sets according to true transition matrices {T mk}M,K m=1,k=1. We consider the following three CMLL scenarios: T mk 01 = T mk 10 , T mk 01 < T mk 10 , and T mk 01 > T mk 10 . Specifically, the inverse diagonal elements of the aggregated transition matrices ( T k 01, T k 10) are set as (0.2,0.2), (0.2,0.5), and (0.5,0.2) for the above three scenarios. For convenience, for each annotator, we adopt the same true transition matrices for all classes but do not leak this information to the algorithm. Moreover, we set the number of annotators as M = 5 for all the experiments unless otherwise specified.

Baselines. For a comprehensive comparison, we exploit the following four types of baselines: 1) A naive baseline BCE, which trains the classifier by BCE loss on the aggre-

1http://mulan.sourceforge.net/datasets-mlc.html

Unbiased Multi-Label Learning from Crowdsourced Annotations

Table 3. Comparison of CLEAR with baselines when T k 01 = 0.5 and T k 10 = 0.2. The best results are shown in boldface. Metric Dataset BCE MV Doctor Net ML-KNN MPVAE PML-NI CLEAR

Image 0.4885 0.4889 0.4912 0.3938 0.4891 0.5824 0.6224 Scene 0.4429 0.4433 0.4482 0.3539 0.4484 0.3122 0.7043 Corel5K 0.0259 0.0261 0.0264 0.0255 0.0256 0.0075 0.0258 Mirflickr 0.5432 0.5395 0.5453 0.4274 0.5388 0.3969 0.7106 NUS-WIDE 0.0782 0.0834 0.0821 0.0697 0.0817 0.0571 0.0763

Image 0.4821 0.4836 0.4856 0.4006 0.4819 0.5753 0.6120 Scene 0.4254 0.4258 0.4306 0.3499 0.4272 0.3126 0.6865 Corel5K 0.0259 0.0261 0.0264 0.0255 0.0256 0.0103 0.0258 Mirflickr 0.5402 0.5382 0.5423 0.4329 0.5366 0.4045 0.7166 NUS-WIDE 0.0794 0.0856 0.0844 0.0706 0.0830 0.0578 0.0786

Image 0.4796 0.4816 0.4837 0.3986 0.4796 0.5778 0.6235 Scene 0.4236 0.4252 0.4288 0.3575 0.4268 0.3114 0.7051 Corel5K 0.0180 0.0184 0.0187 0.0176 0.0180 0.0017 0.0192 Mirflickr 0.4275 0.4264 0.4302 0.3610 0.4265 0.3520 0.6378 NUS-WIDE 0.0542 0.0622 0.0628 0.0523 0.0612 0.0523 0.0683

Figure 3. Ablation analysis when T k 01 = 0.2 and T k 10 = 0.5 where CLEAR is compared with its variant CLEAR-B and CLEAR-M.

gated crowdsourced label s; 2) Two crowdsourcing-based methods, namely, MV (Majority Voting) (Zhou, 2012), which trains the classifier with majority voting labels, and Doctor Net (Guan et al., 2018), which models multiple annotators individually and averages their outputs at test time. Note that we use sigmoid layers and BCE loss to replace the softmax layers and cross-entropy loss; 3) two MLL methods, namely, ML-KNN (Zhang & Zhou, 2007) a nearestneighbor based MLL approach where we input the majority voting labels as the training targets, and MPVAE (Bai et al., 2020), a Multivariate Probit Variational Autoencoder designed for learning latent embedding spaces with label correlations; 4) A partial multi-label learning (PML) methods PML-NI (Xie & Huang, 2022), which simultaneously models the ground-truth labels and noisy labels.

Implementation Details. The encoder and decoder for CLEAR are parameterized as three fully connected layer neural networks with hidden sizes 512 and 256. To facilitate fair comparison, the compared methods are equipped with the same network structure in cases where neural networks are used. Note that for the baseline BCE, MV, and Doctor Net, we also implement the three-layer fully connected networks of hidden sizes 512 and 256. Following (Bai et al., 2020),

we train the models with Adam optimizer (Kingma & Ba, 2015) with a learning rate of 7.5 10 4 and a weight decay of 1e 5. For hyper-parameters in CLEAR, the confidentsample number C and the momentum parameter η are fixed as 20 and 0.9 for all settings. The Gaussian subspace dimensionality d is set as 100 for Corel5K and NUS-WIDE, and 50 otherwise. The trade-off parameters α and β are set by 1.0 and 1.1 by default. Other parameters in the baselines are set to their default values. Besides, the training targets of all baselines are replaced by the aggregated crowdsourced label s except ML-KNN, MV, and Doctor Net.

For performance evaluations, we adopt three widely used multi-label metrics, namely example-based F1 (example F1), micro-averaged F1 (micro-F1), and macro-averaged F1 (macro-F1) (Zhang & Zhou, 2014; Chen et al., 2019a; Bai et al., 2020). Note that for all these metrics, the higher the better. For all the experiments, we perform ten-fold cross-validation and report the mean as well as the standard deviation for metric values.

5.2. Main Results

The comparison results of CLEAR on three CMLL scenarios are shown in Table 1, 2, and 3, where the best results are

Unbiased Multi-Label Learning from Crowdsourced Annotations

Figure 4. Comparison results of different numbers of annotators on Image in the setting of T k 01 = 0.2 and T k 10 = 0.5.

shown in boldface. Overall, our proposed method outperforms all baselines on three metrics on most CMLL datasets and scenarios. For example, in the setting of T k 01 = 0.2 and T k 10 = 0.5 on the Mirflickr dataset, CLEAR improves the best baseline by a notable margin of 28.64%, 25.40%, and 21.69% on Example-F1, Micro-F1, and Macro-F1 respectively. The superior results against various types of baselines imply that CLEAR can effectively tackle the CMLL task.

Specifically, CLEAR improves BCE on average by 12.49%, 11.89%, and 10.03% on Example-F1, Micro-F1, and Macro F1 respectively which significantly proves the effectiveness of our proposed method. Moreover, CLEAR outperforms the crowdsourcing-based approaches, i.e. MV and Doctor Net, especially when the corruption probability is large. This is because MV naively trusts the false majority voting labels and Doctor Net overfits a large number of error labels when most labels are incorrectly tagged. Besides, these two methods lack the consideration of label dependencies. Furthermore, CLEAR achieves better results compared to the MLL algorithm, i.e., ML-KNN and MPVAE, which shows the robustness of CLEAR when handling unreliable crowdsourcing information. Our method also outperforms PML-NI, which is designed for solving redundant labels.

5.3. Additional Experiments

Ablation Analysis. To show how the proposed unbiased risk estimator and the decoupled autoencoder influence CLEAR respectively, we conduct comparison on two variants of CLEAR: 1) CLEAR-B, which implements BCE loss as the reconstruction loss instead of unbiased loss; 2) CLEARM, which individually trains each class predictor by the unbiased loss, without considering label relationships. For CLEAR-M, we implement a three-layer fully connected network with the same hidden sizes as CLEAR. Figure 3 demonstrates the comparison results on the three datasets. In general, CLEAR consistently achieves the best performance compared to the two variants. These results clearly verify the superiority of our proposed unbiased loss and the decoupled autoencoder framework.

Table 4. Results of estimating transition matrices of CLEAR with different strategies. Note that the smaller the mean absolute error is the better, and the best results are shown in boldface.

Dataset T k 01/ T k 10 T-max S-1 S-20

0.2/0.2 0.55 0.34 0.21 0.2/0.5 0.31 0.66 0.22 0.5/0.2 0.33 0.74 0.27

0.2/0.2 0.61 0.47 0.30 0.2/0.5 0.30 0.63 0.20 0.5/0.2 0.34 0.57 0.25

0.2/0.2 0.73 0.73 0.57 0.2/0.5 0.33 0.35 0.16 0.5/0.2 0.60 0.63 0.59

0.2/0.2 0.52 0.39 0.15 0.2/0.5 0.25 0.57 0.12 0.5/0.2 0.38 0.39 0.26

0.2/0.2 0.63 0.69 0.54 0.2/0.5 0.25 0.34 0.17 0.5/0.2 0.58 0.60 0.57

Effect of Annotator Number. To investigate how the number of annotators affects CLEAR, we further explore the performance of our method as well as the competitive baselines on a wide range of the annotator number M on the Image dataset when T k 01 = 0.2 and T k 10 = 0.5, where M {3, 5, 7, 9, 11, 13, 15}. As shown in Figure 4, CLEAR achieves the best results by beating all baselines on all settings on Example-F1, Micro-F1, and Macro-F1. In addition, as the number of annotators increases, the result of CLEAR improves in general while most other baselines decrease. This indicates that CLEAR can benefit from the growth of the annotation number, even when heavy noise exists on crowdsourced labels.

Results of Estimating Transition Matrices. Moreover, we evaluate the transition matrix estimation results of CLEAR under different estimation strategies. Specifically, T-max uses the model-predicted probability of the most confident sample to estimate the transition matrices following (Patrini et al., 2017; Xia et al., 2019; Li et al., 2022).

Unbiased Multi-Label Learning from Crowdsourced Annotations

S-C conducts approximation by averaging the top-C confident crowdsourced labels, as mentioned in section 4.1. We use the mean absolute error to measure the estimation results, i.e., 1 K PK k=1 ˆ T k T k 1, where ˆ T k and T k denote the estimated and ground-truth transition matrices respectively. As shown in Table 4, our proposed estimation strategy achieves the best results.

6. Conclusion

In this work, we study the CMLL problem which aims to learn a robust multi-label predictor given crowdsourcing labels. We establish the first unbiased risk estimator under the CMLL and upgrade it by integrating the annotations while ensuring theoretical characteristics. We then exploit label correlations by proposing a decoupled autoencoder framework. Experiments on various CMLL settings verify the effectiveness of our algorithm. Note that our study resolves the data annotation issue in MLL by greatly reducing the labeling cost while ensuring the robustness of the learner.

Acknowledgements

This work is majorly supported by the Pioneer R&D Program of Zhejiang (No. 2024C01035) and CCF-Net Ease Thunder Fire Innovation Research Funding (NO. 202305). This work is also supported by Netease Youling Crowdsourcing Platform. Gengyu Lyu would like to thank the National Natural Science Foundation of China (No. 62306020).

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here.

Albarqouni, S., Baur, C., Achilles, F., Belagiannis, V., Demirci, S., and Navab, N. Aggnet: Deep learning from crowds for mitosis detection in breast cancer histology images. IEEE Trans. Medical Imaging, 35(5):1313 1321, 2016.

Bai, J., Kong, S., and Gomes, C. P. Disentangled variational autoencoder based multi-label classification with covariance-aware multivariate probit model. In IJCAI, pp. 4313 4321. ijcai.org, 2020.

Bartlett, P. L. and Mendelson, S. Rademacher and gaussian complexities: Risk bounds and structural results. J. Mach. Learn. Res., 3:463 482, 2002.

Boucheron, S., Lugosi, G., and Massart, P. Concentration

Inequalities - A Nonasymptotic Theory of Independence. Oxford University Press, 2013.

Chen, C., Wang, H., Liu, W., Zhao, X., Hu, T., and Chen, G. Two-stage label embedding via neural factorization machine for multi-label classification. In AAAI, pp. 3304 3311. AAAI Press, 2019a.

Chen, D., Xue, Y., and Gomes, C. P. End-to-end learning for the deep multivariate probit model. In ICML, volume 80 of Proceedings of Machine Learning Research, pp. 931 940. PMLR, 2018.

Chen, Y. and Lin, H. Feature-aware label space dimension reduction for multi-label classification. In Neur IPS, pp. 1538 1546, 2012.

Chen, Z., Wei, X., Wang, P., and Guo, Y. Multi-label image recognition with graph convolutional networks. In CVPR, pp. 5177 5186. Computer Vision Foundation / IEEE, 2019b.

Chen, Z., Wang, H., Sun, H., Chen, P., Han, T., Liu, X., and Yang, J. Structured probabilistic end-to-end learning from crowds. In IJCAI, pp. 1512 1518. ijcai.org, 2020.

Cho, Y., Kim, D., Khan, M. A., and Choo, J. Mining multilabel samples from single positive labels. In Neur IPS, 2022.

Cole, E., Aodha, O. M., Lorieul, T., Perona, P., Morris, D., and Jojic, N. Multi-label learning from single positive labels. In CVPR, pp. 933 942. Computer Vision Foundation / IEEE, 2021.

Dalvi, N. N., Dasgupta, A., Kumar, R., and Rastogi, V. Aggregating crowdsourced binary ratings. In WWW, pp. 285 294. International World Wide Web Conferences Steering Committee / ACM, 2013.

Dawid, A. P. and Skene, A. M. Maximum likelihood estimation of observer error-rates using the em algorithm. Journal of the Royal Statistical Society: Series C (Applied Statistics), 28(1):20 28, 1979.

Ding, Z., Wang, A., Chen, H., Zhang, Q., Liu, P., Bao, Y., Yan, W., and Han, J. Exploring structured semantic prior for multi label recognition with incomplete labels. In CVPR, pp. 3398 3407. IEEE, 2023.

Durand, T., Mehrasa, N., and Mori, G. Learning a deep convnet for multi-label classification with partial labels. In CVPR, pp. 647 657. IEEE, 2019.

Feng, L., Lv, J., Han, B., Xu, M., Niu, G., Geng, X., An, B., and Sugiyama, M. Provably consistent partial-label learning. In Neur IPS, 2020.

Unbiased Multi-Label Learning from Crowdsourced Annotations

Gao, Y., Xu, M., and Zhang, M. Unbiased risk estimator to multi-labeled complementary label learning. In IJCAI, pp. 3732 3740. ijcai.org, 2023.

Gao, Z., Sun, F., Yang, M., Ren, S., Xiong, Z., Engeler, M., Burazer, A., Wildling, L., Daniel, L., and Boning, D. S. Learning from multiple annotator noisy labels via samplewise label fusion. In ECCV, volume 13684 of Lecture Notes in Computer Science, pp. 407 422. Springer, 2022.

Guan, M. Y., Gulshan, V., Dai, A. M., and Hinton, G. E. Who said what: Modeling individual labelers improves classification. In AAAI, pp. 3109 3118. AAAI Press, 2018.

Hu, P., Sun, X., Sclaroff, S., and Saenko, K. Dualcoop++: Fast and effective adaptation to multi-label recognition with limited annotations. Co RR, abs/2308.01890, 2023.

Huynh, D. and Elhamifar, E. A shared multi-attention framework for multi-label zero-shot learning. In CVPR, pp. 8773 8783. Computer Vision Foundation / IEEE, 2020a.

Huynh, D. and Elhamifar, E. Interactive multi-label CNN learning with partial labels. In CVPR, pp. 9420 9429. IEEE, 2020b.

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In ICLR, 2015.

Li, S., Jiang, Y., Chawla, N. V., and Zhou, Z. Multi-label learning from crowds. IEEE Trans. Knowl. Data Eng., 31 (7):1369 1382, 2019.

Li, S., Xia, X., Zhang, H., Zhan, Y., Ge, S., and Liu, T. Estimating noise transition matrix with label correlations for noisy multi-label learning. In Neur IPS, 2022.

Li, S., Xia, X., Deng, J., Ge, S., and Liu, T. Transferring annotatorand instance-dependent transition matrix for learning from crowds. Co RR, abs/2306.03116, 2023.

Liu, H., Wang, F., Lin, M., Wu, R., Zhu, R., Zhao, S., Wang, K., Lv, T., and Fan, C. Towards long-term annotators: A supervised label aggregation baseline. Co RR, abs/2311.14709, 2023.

Liu, T. and Tao, D. Classification with noisy labels by importance reweighting. IEEE Trans. Pattern Anal. Mach. Intell., 38(3):447 461, 2016.

Liu, W., Wang, H., Shen, X., and Tsang, I. W. The emerging trends of multi-label learning. IEEE Trans. Pattern Anal. Mach. Intell., 44(11):7955 7974, 2022.

Lyu, G., Feng, S., and Li, Y. Partial multi-label learning via probabilistic graph matching mechanism. In KDD, pp. 105 113. ACM, 2020.

Maurer, A. A vector-contraction inequality for rademacher complexities. In ALT, volume 9925 of Lecture Notes in Computer Science, pp. 3 17, 2016.

Mohri, M., Rostamizadeh, A., and Talwalkar, A. Foundations of machine learning. MIT press, 2018.

Patrini, G., Rozza, A., Menon, A. K., Nock, R., and Qu, L. Making deep neural networks robust to label noise: A loss correction approach. In CVPR, pp. 2233 2241. IEEE Computer Society, 2017.

Raykar, V. C. and Yu, S. Eliminating spammers and ranking annotators for crowdsourced labeling tasks. J. Mach. Learn. Res., 13:491 518, 2012.

Raykar, V. C., Yu, S., Zhao, L. H., Jerebko, A. K., Florin, C., Valadez, G. H., Bogoni, L., and Moy, L. Supervised learning from multiple experts: whom to trust when everyone lies a bit. In ICML, volume 382 of ACM International Conference Proceeding Series, pp. 889 896. ACM, 2009.

Read, J., Pfahringer, B., Holmes, G., and Frank, E. Classifier chains for multi-label classification. Mach. Learn., 85(3): 333 359, 2011.

Ridnik, T., Baruch, E. B., Zamir, N., Noy, A., Friedman, I., Protter, M., and Zelnik-Manor, L. Asymmetric loss for multi-label classification. In ICCV, pp. 82 91. IEEE, 2021.

Rodrigues, F. and Pereira, F. C. Deep learning from crowds. In AAAI, pp. 1611 1618. AAAI Press, 2018.

Rubin, T. N., Chambers, A., Smyth, P., and Steyvers, M. Statistical topic models for multi-label document classification. Mach. Learn., 88(1-2):157 208, 2012.

Schultheis, E., Wydmuch, M., Babbar, R., and Dembczynski, K. On missing labels, long-tails and propensities in extreme multi-label classification. In SIGKDD, pp. 1547 1557. ACM, 2022.

Shi, W., Sheng, V. S., Li, X., and Gu, B. Semi-supervised multi-label learning from crowds via deep sequential generative model. In SIGKDD, pp. 1141 1149. ACM, 2020.

Snow, R., O Connor, B., Jurafsky, D., and Ng, A. Y. Cheap and fast - but is it good? evaluating non-expert annotations for natural language tasks. In ACL, pp. 254 263. ACL, 2008.

Wang, F., Liu, H., Bi, H., Shen, X., Zhu, R., Wu, R., Lin, M., Lv, T., Fan, C., Liu, Q., Huang, Z., and Chen, E. A dataset for the validation of truth inference algorithms suitable for online deployment. Co RR, abs/2403.08826, 2024.

Unbiased Multi-Label Learning from Crowdsourced Annotations

Wang, H., Liu, W., Zhao, Y., Zhang, C., Hu, T., and Chen, G. Discriminative and correlative partial multi-label learning. In IJCAI, pp. 3691 3697. ijcai.org, 2019.

Wang, H., Chen, C., Liu, W., Chen, K., Hu, T., and Chen, G. Incorporating label embedding and feature augmentation for multi-dimensional classification. In AAAI, pp. 6178 6185. AAAI Press, 2020.

Wang, J., Yang, Y., Mao, J., Huang, Z., Huang, C., and Xu, W. CNN-RNN: A unified framework for multi-label image classification. In CVPR, pp. 2285 2294. IEEE Computer Society, 2016.

Wang, L., Liu, Y., Di, H., Qin, C., Sun, G., and Fu, Y. Semi-supervised dual relation learning for multi-label classification. IEEE Trans. Image Process., 30:9125 9135, 2021.

Wei, H., Xie, R., Feng, L., Han, B., and An, B. Deep learning from multiple noisy annotators as a union. IEEE transactions on neural networks and learning systems, 2022.

Wei, T., Guo, L., Li, Y., and Gao, W. Learning safe multilabel prediction for weakly labeled data. Mach. Learn., 107(4):703 725, 2018.

Whitehill, J., Ruvolo, P., Wu, T., Bergsma, J., and Movellan, J. R. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In Neur IPS, pp. 2035 2043. Curran Associates, Inc., 2009.

Wu, J., Huang, S., and Zhou, Z. Genome-wide protein function prediction through multi-instance multi-label learning. IEEE ACM Trans. Comput. Biol. Bioinform., 11 (5):891 902, 2014.

Xia, X., Liu, T., Wang, N., Han, B., Gong, C., Niu, G., and Sugiyama, M. Are anchor points really indispensable in label-noise learning? In Neur IPS, pp. 6835 6846, 2019.

Xiao, L., Huang, X., Chen, B., and Jing, L. Label-specific document representation for multi-label text classification. In EMNLP-IJCNLP, pp. 466 475. Association for Computational Linguistics, 2019.

Xie, M. and Huang, S. Partial multi-label learning. In AAAI, pp. 4302 4309. AAAI Press, 2018.

Xie, M. and Huang, S. Partial multi-label learning with noisy label identification. IEEE Trans. Pattern Anal. Mach. Intell., 44(7):3676 3687, 2022.

Xie, M. and Huang, S. CCMN: A general framework for learning with class-conditional multi-label noise. IEEE Trans. Pattern Anal. Mach. Intell., 45(1):154 166, 2023.

Xiong, B., Cochez, M., Nayyeri, M., and Staab, S. Hyperbolic embedding inference for structured multi-label prediction. In Neur IPS, 2022.

Xu, N., Liu, Y., and Geng, X. Partial multi-label learning with label distribution. In AAAI, pp. 6510 6517. AAAI Press, 2020.

Xu, N., Qiao, C., Lv, J., Geng, X., and Zhang, M. One positive label is sufficient: Single-positive multi-label learning with label enhancement. In Neur IPS, 2022.

Yeh, C., Wu, W., Ko, W., and Wang, Y. F. Learning deep latent space for multi-label classification. In AAAI, pp. 2838 2844. AAAI Press, 2017.

Zha, Z., Hua, X., Mei, T., Wang, J., Qi, G., and Wang, Z. Joint multi-label multi-instance learning for image classification. In CVPR. IEEE Computer Society, 2008.

Zhang, J. and Wu, X. Multi-label inference for crowdsourcing. In SIGKDD, pp. 2738 2747. ACM, 2018.

Zhang, M. and Zhou, Z. ML-KNN: A lazy learning approach to multi-label learning. Pattern Recognit., 40(7): 2038 2048, 2007.

Zhang, M. and Zhou, Z. A review on multi-label learning algorithms. IEEE Trans. Knowl. Data Eng., 26(8):1819 1837, 2014.

Zhang, Y., Chen, X., Zhou, D., and Jordan, M. I. Spectral methods meet EM: A provably optimal algorithm for crowdsourcing. JMLR, 17:102:1 102:44, 2016.

Zhou, Z.-H. Ensemble methods: foundations and algorithms. CRC press, 2012.

Zhu, K. and Wu, J. Residual attention: A simple but effective method for multi-label recognition. In ICCV, pp. 184 193. IEEE, 2021.

Zhu, X., Liu, J., Liu, W., Ge, J., Liu, B., and Cao, J. Sceneaware label graph learning for multi-label image classification. In ICCV, pp. 1473 1482, 2023.

Unbiased Multi-Label Learning from Crowdsourced Annotations

A. Proof of Theorem 1

Theorem 1. By decomposing the MLL problem into K independent binary classification problem, i.e., L(f(x), y) = PK k=1 ℓ(fk(x), yk), where fk refers to prediction of the model on k-th class and ℓis the base loss function. With R(f) = EP (X, Y ) h L(f(x), { ym}M m=1) i , and define

L(f(x), { ym}M m=1) = 1 2M

P(Yk = j|X = x) QM m=1 P1 i=0 T mk i ymk P(Yk = i|X = x) ℓ(fk(x), j), (14)

where Y denotes the random variable of the crowdsourced labels for each x. Then, R(f) is the unbiased risk estimator with respect to R(f).

Proof. The multi-label learning risk R(f) could be rewritten as

R(f) = EP (X,Y ) [L(f(x), y)]

k=1 EP (X,Yk) [ℓ(fk(x), yk)]

j=0 P(X = x, Yk = j)ℓ(fk(x), j)dx

{ ymk}M m=1 {0,1}M

P(Yk = j|X = x) P({ Ymk}M m=1 = { ymk}M m=1|X = x) P(X = x, { Ymk}M m=1 = { ymk}M m=1)ℓ(fk(x), j)dx

k=1 EP (X,{ Ymk}M m=1)

P(Yk = j|X = x) P({ Ymk}M m=1 = { ymk}M m=1|X = x) ℓ(fk(x), j)

k=1 EP (X,{ Ymk}M m=1)

P(Yk = j|X = x) QM m=1 P( Ymk = ymk|X = x) ℓ(fk(x), j)

k=1 EP (X,{ Ymk}M m=1)

P(Yk = j|X = x) QM m=1 P1 i=0 T mk i ymk P(Yk = i|X = x) ℓ(fk(x), j)

= EP (X, Y )

P(Yk = j|X = x) QM m=1 P1 i=0 T mk i ymk P(Yk = i|X = x) ℓ(fk(x), j)

= EP (X, Y ) h L(f(x), { ym}M m=1) i

By proving that R(f) is equivalent to R(f) given the same classifier f, we demonstrate our proposed risk estimator R(f) guarantees risk consistency.

B. Proof of Theorem 2

Theorem 2. Let y = [ y1, . . . , y K] be the aggregated label vector for each x, and Yk is the random variable of yk. We have the following consequences:

(Existence) There exist a set of class-dependent instance-independent transition matrices { T k}K k=1 [0, 1]2 2 such that T k ij = P( Yk = j|Yk = i), i, j {0, 1}, the unbiased risk estimator for CMLL with respect to R(f) is R(f) =

Unbiased Multi-Label Learning from Crowdsourced Annotations

EP (X, Y ) h L(f(x), y) i , and

L(f(x), y) =

P( Yk = 1|X = x) T k 01 1 T k 01 T k 10 ℓ(fk(x), 1) + P( Yk = 0|X = x) T k 10 1 T k 01 T k 10 ℓ(fk(x), 0)

(Formulation of Transition Matrices) Let A be the random variable of the index of the annotator. By denoting ωm = P(A = m|X = x) as the contribution of m-th annotator on tagging x, and with T mk defined in subsection 3.2, the transition matrices { T k}K k=1 for aggregated labels are formalized as linear combinations of T mk:

m=1 ωm T mk, (17)

which are class-dependent and instance-independent.

Proof. We first detail the proof of the existence and the formulation of the aggregated transition matrices { T k}K k=1. Assuming there exists a set of class-dependent instance-independent transition matrices { T k}K k=1, we have P( Yk = j|X = x) = P1 i=0 T k ij P(Yk = i|X = x), which is similar to the data generation process discussed in subsection 3.2. Also,

P( Yk = j|X = x) =

m=1 P( Yk = j, A = m|X = x)

m=1 P(A = m|X = x)P( Yk = j|A = m, X = x)

m=1 ωm P( Ymk = j|X = x)

i=0 T mk ij P(Yk = i|X = x)

m=1 ωm T mk ij )P(Yk = i|X = x).

Thus, { T k}K k=1 exist and T k = PM m=1 ωm T mk. Then, the unbiased risk estimator is derived by:

R(f) = EP (X,Y ) [L(f(x), y)]

k=1 EP (X,Yk) [ℓ(fk(x), yk)]

j=0 P(X = x)P(Yk = j|X = x)ℓ(fk(x), j)dx

k=1 EP (X) [(P(Yk = 1|X = x)ℓ(fk(x), 1) + P(Yk = 0|X = x)ℓ(fk(x), 0))]

= EP (X, Y )

P( Yk = 1|X = x) T k 01 1 T k 01 T k 10 ℓ(fk(x), 1) + P( Yk = 0|X = x) T k 10 1 T k 01 T k 10 ℓ(fk(x), 0)

= EP (X, Y ) h L(f(x), y) i

Unbiased Multi-Label Learning from Crowdsourced Annotations

Note that the third last equation holds because, with P( Yk = j|X = x) = P1 i=0 T k ij P(Yk = i|X = x), we have:

P( Yk = 1|X = x) = T k 01P(Yk = 0|X = x) + T k 11P(Yk = 1|X = x)

= T k 01(1 P(Yk = 1|X = x)) + (1 T k 10)P(Yk = 1|X = x)

= T k 01 + (1 T k 01 T k 10)P(Yk = 1|X = x),

thus P(Yk = 1|X = x) = P ( Yk=1|X=x) T k 01 1 T k 01 T k 10 , and similarly P(Yk = 0|X = x) = P ( Yk=0|X=x) T k 10 1 T k 01 T k 10 .

C. Proof of Theorem 3

Theorem 3. Assume that the true aggregated transition matrices { T k}K k=1 are given, and the loss function L(f(x), s) is LT -Lipschitz continuous with respect to f(x), and the base loss function l is upper-bounded by λ. Let µ = maxk 1 1 T k 01 T k 10 . Then, for any δ > 0, with probability at least 1 δ, we have

E[ R( ˆf)] R( ˆf) 2

k=1 Rn(Hk) + λK(µ + 1)

Recall that the empirical risk is defined by R(f) = 1

n Pn i=1 L(f(xi), si), and

L(f(x), s) =

+ 1 T k 01 T k 10 ℓ(fk(x), 1) +

(1 sk) T k 10

+ 1 T k 01 T k 10 ℓ(fk(x), 0)

Let S and S be two crowdsourced datasets that exactly differ by the i-th example, i.e.,

S = {(x1, s1), . . . , (xi, si), . . . , (xn, sn)},

S = {(x1, s1), . . . , (x i, s i), . . . , (xn, sn)}, (23)

and denote the function Φ as:

Φ(S) = sup f F

E R(f) R(f) (24)

where the generalization risk E R(f) is equivalent to EP (X, Y ) h L(f(xi), si) i = R(f) and the empirical risk R(f) is

equivalent to ˆES h L(f(xi), si) i . The proof of Theorem 3 is mainly composed of the following two lemmas.

Lemma 1. Let ˆf be the minimizer of the empirical risk R(f), and E S [Φ(S)] is the expectation of Φ(S) over all S drawn

from the data distribution. With the base loss function l upper-bounded by λ and µ = maxk 1 1 T k 01 T k 10 , for any δ > 0, with probability at least 1 δ, we have

E[ R( ˆf)] R( ˆf) E S [Φ(S)] + λK(µ + 1)

Proof. To apply Mc Diarmid s inequality (Boucheron et al., 2013) to prove the lemma, we first check the bounded difference

Unbiased Multi-Label Learning from Crowdsourced Annotations

property of Φ(S) by

Φ(S) Φ(S ) sup f F

L(f(xi), si) L(f(x i), s i)

+ 1 T k 01 T k 10 ℓ(fk(xi), 1)

+ 1 T k 01 T k 10 ℓ(fk(x i), 1)

(1 sk) T k 10

+ 1 T k 01 T k 10 ℓ(fk(xi), 0)

(1 s k) T k 10

+ 1 T k 01 T k 10 ℓ(fk(x i), 0)

1 T k 01 1 T k 01 T k 10 ℓ(fk(xi), 1) 1 T k 01 1 T k 01 T k 10 ℓ(fk(x i), 1)

+ 1 T k 10 1 T k 01 T k 10 ℓ(fk(xi), 0) 1 T k 10 1 T k 01 T k 10 ℓ(fk(x i), 0)

1 T k 01 1 T k 01 T k 10 λ + 1 T k 10 1 T k 01 T k 10 λ

1 1 T k 01 T k 10 + 1

Similarly, we can obtain Φ(S ) Φ(S) λK(µ+1)

n and thus |Φ(S) Φ(S )| λK(µ+1)

n . Then, by Mc Diarmid s inequality, for any δ > 0, with probability at least 1 δ, we have

Φ(S) E S [Φ(S)] + λK(µ + 1)

Besides, we have E[ R( ˆf)] R( ˆf) sup f F

E R(f) R(f) = Φ(S), which complete the proof. Then, we give an upper

bound of E S [Φ(S)] in the following lemma.

Lemma 2. Denote F as the hypothesis class and Hk = {h : x 7 fk(x)|f F} as the functional space for the k-th class and let Rn(Hk) be the expected Rademacher complexity of Hk with sample size n. Assuming the loss function L(f(x), s) is LT -Lipschitz continuous with respect to f(x), then we have

E S [Φ(S)] 2

k=1 Rn(Hk) (28)

Proof. Note that the Rademacher complexity of Hk is formalized as Rn(Hk) = E σ,S

1 n Pn i=1 σih(xi) , where

σ = [σ1, . . . , σn] are i.i.d. Rademacher random variables. By abbreviating EP (X, Y ) h L(f(xi), si) i as E[ L f], and

Unbiased Multi-Label Learning from Crowdsourced Annotations

denoting the base loss function of the unbiased risk as ℓ, we bound E S [Φ(S)] by the following derivations:

E S [Φ(S)] = E S

E[ L f] ˆES[ L f] #

sup f F E S

h ˆES [ L f] ˆES[ L f] i#

ˆES [ L f] ˆES[ L f] #

L(f(x i), s i) L(f(xi), si) #

i=1 σi L(f(x i), s i) L(f(xi), si) #

i=1 σi L(f(x i), s i)

i=1 σi L(f(xi), si)

i=1 σi L(f(xi), si)

i=1 σih(xi)

k=1 Rn(Hk).

The second inequality holds due to the sub-additivity of the supremum function, and the last inequality holds because of the Rademacher vector contraction inequality (Maurer, 2016). Theorem 3 follows by combining Lemma 1 and Lemma 2.

Extension of Theorem 3. The proposal of Theorem 3 ensures the empirical risk minimizer approximately approaches its population minimizer counterpart. Further, with the uniform convergence of E R(f) R(f), we extend our theoretical guarantees by proposing Theorem 4, which demonstrates that the empirical risk minimizer converges to the true risk minimizer as n .

Theorem 4. Let ˆf and f be the minimizer of R(f) and of R(f) respectively. With the conditions in Theorem 3, for any δ > 0, with probability at least 1 δ, we have

R( ˆf) R(f ) 4

k=1 Rn(Hk) + 2λK(µ + 1)

Proof. We first bound the left-hand side by the following derivation:

R( ˆf) R(f ) = R( ˆf) R(f )

= [ R( ˆf) R( ˆf)] + [ R( ˆf) R(f )] + [ R(f ) R(f )]

[ R( ˆf) R( ˆf)] + 0 + [ R(f ) R(f )]

2 sup f F | R(f) R(f)|,

where the first equation holds due to the risk consistency of the estimator, and the first inequality holds since ˆf is defined by the minimizer of R(f). Using the same trick in Lemma 1, we can derive the same bound for sup f F

R(f) E R(f) with

Unbiased Multi-Label Learning from Crowdsourced Annotations

E R(f) R(f) . Then, combining with Lemma 2, we have

E R(f) R(f) 2

k=1 Rn(Hk) + λK(µ + 1)

Note that E R(f) is equivalent to R(f), thus

R( ˆf) R(f ) 2 sup f F

R(f) R(f) 4

k=1 Rn(Hk) + 2λK(µ + 1)

D. Pseudo-Code of Transition Matrix Estimation

Algorithm 2 Pseudo-code of Transition Matrix Estimation.

Input: Crowdsourced multi-label dataset D, a randomly initialized fully connected network ˆf

1: Aggregating the crowdsourced labels by sk = 1 M ymk 2: Train ˆf by minimizing Lest = PK k=1 sk log ˆfk(x) + (1 sk) log(1 ˆfk(x)) until convergence 3: for k in {1, . . . , K} do 4: for a in {0, 1} do 5: Select top-C samples in D with largest value of a ˆfk(x) + (1 a) (1 ˆfk(x)), denoting them as Aka 6: Set ˆ T k a(1 a) = 1

x Aka(1 ˆfk(x)) and ˆ T k aa = 1

x Aka ˆfk(x) 7: end for 8: end for Output: Estimated transition matrices {ˆ T k}K k=1

E. Complexity Analyses

Let B, Dx, and K denote the batch size, the dimensionality of feature x and the number of classes, and let Dh denote the proxy of the hidden dimensionalities of the encoders and decoders in CLEAR. On the one hand, the time complexity of the feature VAE and label VAE correspond to O(BDx Dh) and O(BKDh) respectively. On the other hand, the time complexity of the sampling process in the multivariate probit models corresponds to O(BSK) where S is the sampling number. Thus, the total time complexity of CLEAR is O(B(Dx Dh + KDh + SK)). Table 5 shows the empirical running time (in seconds) of CLEAR and the deep-model-based baselines, regarding one training epoch, which shows that CLEAR is in the same magnitude as baseline methods.

Table 5. Running time (in seconds) of one training epoch of deep-model-based approach.

Dataset BCE MV Doctor Net MPVAE CLEAR

Image 1.04 1.04 1.15 1.85 2.06 Scene 1.09 1.07 1.22 2.06 2.47 Corel5K 1.41 1.40 1.94 6.96 7.82 Mirflickr 1.64 1.60 2.12 6.08 7.63 NUS-WIDE 24.24 24.55 26.30 31.33 32.94

F. Standard Deviation

In this paper, we conduct ten-fold cross-validation for all the experiments, and only the mean metric values are reported in the main paper since the page is limited. In this section, we further report the standard deviations of CLEAR and baselines in Table 6, 7, and 8, which demonstrates the robustness of CLEAR.

Unbiased Multi-Label Learning from Crowdsourced Annotations

Table 6. Standard deviations of CLEAR and baselines when T k 01 = 0.2 and T k 10 = 0.2. Metric Dataset BCE MV Doctor Net ML-KNN MPVAE PML-NI CLEAR

Image 0.0389 0.0358 0.0232 0.0416 0.0360 0.0287 0.0370 Scene 0.0279 0.0260 0.0272 0.0431 0.0312 0.0146 0.0245 Corel5K 0.0032 0.0067 0.0142 0.0051 0.0066 0.0130 0.0110 Mirflickr 0.0097 0.0094 0.0107 0.0123 0.0087 0.0324 0.0094 NUS-WIDE 0.0013 0.0044 0.0017 0.0075 0.0012 0.0035 0.0082

Image 0.0350 0.0334 0.0277 0.0355 0.0285 0.0256 0.0343 Scene 0.0214 0.0223 0.0261 0.0402 0.0314 0.0130 0.0212 Corel5K 0.0032 0.0067 0.0149 0.0070 0.0066 0.0185 0.0136 Mirflickr 0.0082 0.0063 0.0068 0.0096 0.0057 0.0368 0.0075 NUS-WIDE 0.0028 0.0065 0.0041 0.0126 0.0055 0.0014 0.0100

Image 0.0328 0.0323 0.0271 0.0341 0.0281 0.0293 0.0342 Scene 0.0200 0.0226 0.0259 0.0377 0.0305 0.0139 0.0202 Corel5K 0.0018 0.0014 0.0006 0.0014 0.0030 0.0031 0.0023 Mirflickr 0.0165 0.0110 0.0150 0.0187 0.0108 0.0404 0.0154 NUS-WIDE 0.0004 0.0016 0.0025 0.0011 0.0055 0.0004 0.0067

Table 7. Standard deviations of CLEAR and baselines when T k 01 = 0.2 and T k 10 = 0.5. Metric Dataset BCE MV Doctor Net ML-KNN MPVAE PML-NI CLEAR

Image 0.0343 0.0475 0.0282 0.0164 0.0344 0.0376 0.0467 Scene 0.0309 0.0312 0.0253 0.0178 0.0295 0.0236 0.0251 Corel5K 0.0309 0.0312 0.0253 0.0178 0.0295 0.0236 0.0251 Mirflickr 0.0119 0.0084 0.0134 0.0341 0.0166 0.0141 0.0174 NUS-WIDE 0.0007 0.0079 0.0054 0.0000 0.0035 0.0260 0.0029

Image 0.0369 0.0477 0.0282 0.0259 0.0367 0.0328 0.0459 Scene 0.0334 0.0332 0.0248 0.0279 0.0344 0.0213 0.0242 Corel5K 0.0182 0.0048 0.0070 0.0008 0.0038 0.0090 0.0056 Mirflickr 0.0185 0.0153 0.0162 0.0519 0.0207 0.0121 0.0104 NUS-WIDE 0.0017 0.0106 0.0085 0.0000 0.0065 0.0197 0.0116

Image 0.0355 0.0469 0.0243 0.0243 0.0370 0.0336 0.0430 Scene 0.0307 0.0321 0.0262 0.0201 0.0355 0.0212 0.0204 Corel5K 0.0099 0.0015 0.0005 0.0004 0.0023 0.0020 0.0037 Mirflickr 0.0230 0.0272 0.0126 0.0246 0.0252 0.0069 0.0256 NUS-WIDE 0.0001 0.0010 0.0005 0.0006 0.0012 0.0048 0.0013

Table 8. Standard deviations of CLEAR and baselines when T k 01 = 0.5 and T k 10 = 0.2. Metric Dataset BCE MV Doctor Net ML-KNN MPVAE PML-NI CLEAR

Image 0.0133 0.0139 0.0092 0.0039 0.0091 0.0247 0.0269 Scene 0.0111 0.0086 0.0148 0.0140 0.0119 0.0039 0.0204 Corel5K 0.0259 0.0004 0.0003 0.0005 0.0005 0.0026 0.0006 Mirflickr 0.0106 0.0116 0.0103 0.0091 0.0113 0.0025 0.0374 NUS-WIDE 0.0008 0.0003 0.0004 0.0008 0.0006 0.0003 0.0027

Image 0.0121 0.0126 0.0103 0.0041 0.0071 0.0220 0.0236 Scene 0.0096 0.0082 0.0123 0.0132 0.0102 0.0041 0.0243 Corel5K 0.0259 0.0004 0.0003 0.0005 0.0005 0.0044 0.0006 Mirflickr 0.0101 0.0116 0.0100 0.0086 0.0110 0.0025 0.0314 NUS-WIDE 0.0007 0.0004 0.0005 0.0009 0.0006 0.0003 0.0028

Image 0.0128 0.0129 0.0120 0.0046 0.0081 0.0219 0.0215 Scene 0.0102 0.0086 0.0137 0.0167 0.0104 0.0043 0.0208 Corel5K 0.0180 0.0003 0.0003 0.0004 0.0005 0.0007 0.0004 Mirflickr 0.0066 0.0078 0.0050 0.0056 0.0067 0.0023 0.0304 NUS-WIDE 0.0020 0.0007 0.0005 0.0003 0.0006 0.0002 0.0013