# semicrowdsourced_clustering_with_deep_generative_models__86110c60.pdf

Semi-crowdsourced Clustering with Deep Generative Models

Yucen Luo, Tian Tian, Jiaxin Shi, Jun Zhu , Bo Zhang Dept. of Comp. Sci. & Tech., Institute for AI, THBI Lab, BNRist Center, State Key Lab for Intell. Tech. & Sys., Tsinghua University, Beijing, China {luoyc15,shijx15}@mails.tsinghua.edu.cn, rossowhite@163.com {dcszj,dcszb}@mail.tsinghua.edu.cn

We consider the semi-supervised clustering problem where crowdsourcing provides noisy information about the pairwise comparisons on a small subset of data, i.e., whether a sample pair is in the same cluster. We propose a new approach that includes a deep generative model (DGM) to characterize low-level features of the data, and a statistical relational model for noisy pairwise annotations on its subset. The two parts share the latent variables. To make the model automatically trade-off between its complexity and ﬁtting data, we also develop its fully Bayesian variant. The challenge of inference is addressed by fast (natural-gradient) stochastic variational inference algorithms, where we effectively combine variational message passing for the relational part and amortized learning of the DGM under a uniﬁed framework. Empirical results on synthetic and real-world datasets show that our model outperforms previous crowdsourced clustering methods.

1 Introduction

Clustering is a classic data analysis problem when the taxonomy of data is unknown in advance. Its main goal is to divide samples into disjunct clusters based on the similarity between them. Clustering is useful in various application areas including computer vision [21], bioinformatics [28], anomaly detection [2], etc. When the feature vectors of samples are observed, most clustering algorithms require a similarity or distance metric deﬁned in the feature space, so that the optimization objective can be built. Since different metrics may result in entirely different clustering results, and general geometry metrics may not meet the intention of the tasks designer, many clustering approaches learn the metric from the side-information provided by domain experts [30], thus the manual labeling procedure of experts could be a bottleneck for the learning pipeline.

Crowdsourcing is an efﬁcient way to collect human feedbacks [12]. It distributes micro-tasks to a group of ordinal web workers in parallel, so the whole task can be done fast with relatively low cost. It has been used on annotating large-scale machine learning datasets such as Image Net [6], and can also be used to collect side-information for clustering. However, directly collecting labels from crowds may lead to low-quality results due to the lack of expertise of workers. Consider an example of labeling a set of images of ﬂowers from different species. One could show images to the web workers and ask them to identify the corresponding species, but such tasks require the workers to be experts in identifying the ﬂowers and have all the species in their minds, which is not always possible. A more reasonable and easier task is to ask the workers to compare pairs of ﬂower images and to answer whether they are in the same species or not. Then speciﬁc clustering methods are required to discover the clusters from the noisy feedbacks.

To solve above clustering problems with pairwise similarity labels between samples from the crowds, Crowdclustering [8] discovers the clusters within the dataset using a Bayesian hierarchical model.

corresponding author

32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada.

By explicitly modeling the mistakes and preferences of web workers, the outputs will match the human consciousness of the clustering tasks. This method reduces the labeling cost to a great degree compared with expert labeling. However, the cost still grows quadratically as the dataset size grows, so it is still only suitable for small datasets. In this work, we move one step further and consider the semi-supervised crowdclustering problem that jointly models the feature vectors and the crowdsourced pairwise labels for only a subset of samples. When we control the size of the subset to be labeled by crowds, the total labeling budget and time can be controlled. A similar problem has been discussed by [31], while the authors use a linear similarity function deﬁned on the low-level object features, and ignore the noise and inter-worker variations in the manual annotations.

Different from existing approaches, we propose a semi-supervised deep Bayesian model to jointly model the generation of the labels and the raw features for both crowd-labeled and unlabeled samples. Instead of the direct usage of low-level features, we build a ﬂexible deep generative model (DGM) to capture the latent representation of data, which is more suitable to express the semantic similarity than the low-level features. The crowdsourced pairwise labels are modeled by a statistical relational model, and the two parts (i.e., DGM and the relational model) share the same latent variables. We also investigate the fully Bayesian variant of this model so that it can automatically control its complexity. Due to the intractability of exact inference, we develop fast (natural-gradient) stochastic variational inference algorithms. To address the challenges in fully Bayesian inference over model parameters, we effectively combine variational message passing and natural gradient updates for the conjugate part (i.e., the relational model and the mixture model) and amortized learning of the nonconjugate part (i.e., DGM) under a uniﬁed framework. Empirical results on synthetic and real-world datasets show that our model outperforms previous crowdsourced clustering methods.

2 Semi-crowdsourced deep clustering

Figure 1: Semi-crowdsourced Deep Clustering (SCDC).

In this section, we propose the semi-crowdsourced clustering with deep generative models for directly modeling the raw data, which enables end-to-end training. We call the model Semi-crowdsourced Deep Clustering (SCDC), whose graphical model is shown in Figure 1. This model is composed of two parts: the raw data model handles the generative process of the observations O; the crowdsourcing behavior model on labels L describes the labeling procedure of the workers. The details for each part will be introduced below.

2.1 Model the raw data deep generative models We denote the raw data observations by O = {o1, ..., o N}. For images, on RD denotes the pixel values. For each data point on we have a corresponding latent variable xn Rd and p(on|xn, γ) is a ﬂexible neural network density model parametrized by γ. p(xn|zn; µ, Σ) is a Gaussian mixture where zn comprises a 1-of-K binary vector with elements znk for k = 1, ..., K. Here K denotes the number of clusters. We denote the local latent variables by X = {x1, ..., x N}, Z = {z1, ..., z N}. When real-valued observations are given, the generative process is as follows:

n=1 p(zn; π) =

k=1 πznk k , p(X|Z; µ, Σ) =

k=1 N(xn; µk, Σk)znk,

p(O|X; γ) =

n=1 N(on|µγ(xn), diag(σ2 γ(xn))),

where µγ( ) and σ2 γ( ) are two neural networks parameterized by γ. For other types of observations O, p(o|x; γ) can be other distributions, e.g. Bernoulli distribution for binary observations. In general, our model is a deep generative model with structured latent variables.

2.2 Model the behavior of each worker two-coin Dawid-Skene model

We collect pairwise annotations provided by M workers. A partially observed L(m) {0, 1, NULL}Nl Nl is the annotation matrix of the m-th worker, where Nl is the number of an-

notated data points. For observation pairs (oi, oj), i = j, L(m) ij = 1 represents that the m-th worker provides a must-link (ML) constraint, which means observations i and j belong to a same cluster, L(m) ij = 0 represents cannot-link (CL) constraint, which means observations i and j belong to differ-

ent clusters, and NULL represents that L(m) ij is not observed. It is obvious that L(m) is symmetric,

i.e., L(m) ij = L(m) ji , i, j, m. Self-edges are not allowed, i.e., L(m) ii = NULL, i.

Among all the N data observations O, we only crowdsource pairwise annotations for a small portion of O, denoted by OL. Each worker only provides annotations to a small amount of items in OL and the annotation accuracies of non-expert workers may vary with observations and levels of expertise. We adopt the two-coin Dawid-Skene model for annotators from [18] and develop a probabilistic model by explicitly modeling the uncertainty of each worker. Speciﬁcally, the uncertainty of the m-th worker can be characterized by accuracy parameters (αm, βm), where αm represents sensitivity, which means the probability of providing ML constraints for sample pairs belonging to the same cluster. And βm is the m-th worker s speciﬁcity, which means the probability of providing CL constraints for sample pairs from different clusters. Let α = {α1, ..., αM} and β = {β1, ..., βM}. The likelihood is deﬁned as p(L(m) ij |zi, zj; αm, βm) = Bern(L(m) ij |αm)z i zj Bern(L(m) ij |1 βm)1 z i zj, (1)

or equivalently, p(L(m) ij = 1|zi = zj, αm) = αm, p(L(m) ij = 0|zi = zj, βm) = βm. To simplify the

notation, we deﬁne I(m) ij = I[L(m) ij = NULL]. Using the symmetry of L(m), the total likelihood of annotations can be written

p(L|Z; α, β) =

1 i<j N p(L(m) ij |zi, zj; αm, βm)I(m) ij . (2)

2.3 Amortized variational inference

As described above, the parameters in the semi-crowdsourced deep clustering model include π [0, 1]K, µ RK d,Σ RK d d, α, β [0, 1]M, and the parameters of neural networks γ. Let Θ = {π, µ, Σ, α, β}, the overall joint likelihood of the model is p(Z, X, O, L; Θ, γ) = p(Z; π)p(X|Z; µ, Σ)p(O|X; γ)p(L|Z; α, β). (3) For this model, the learning objective is to maximize the variational lower bound L(O, L) of the marginal log likelihood of the entire dataset log p(O, L): log p(O, L) Eq(Z,X|O) [log p(Z, X, O, L; Θ, γ) log q(Z, X|O)] = L(O, L; Θ, γ, φ) (4)

To deal with the non-conjugate likelihood p(O|X; γ), we introduce inference networks for each of the latent variables zn and xn. The inference networks are assumed to have a factorized form q(zn, xn|on) = q(zn|on; φ)q(xn|zn, on; φ), which are Categorical and Normal distributions, respectively: q(zn|on; φ) = Cat(zn; π(on; φ)), q(xn|zn, on; φ) = N(µ(zn, on; φ), diag(σ2(zn, on; φ))),

where σ(zn, on; φ) is a vector of standard deviations and φ denotes the inference networks parameters. Similar to the approach in [14], we can analytically sum over the discrete variables zn in the lower bound and use the reparameterization trick to compute gradients w.r.t. to Θ, γ and φ.

The above objective sums over all data and annotations. For large datasets, we can conveniently use a stochastic version by approximating the lower bound with subsampled minibatches of data. Speciﬁcally, the variational lower bound is decomposed into two terms: L(L, O; Θ, γ, φ) = Llocal + Lrel, where Llocal = PN n=1 Eq(zn,xn|on) [log p(zn) + log p(xn|zn) + log p(on|xn) log q(zn, xn|on)], and Lrel = PM m=1 P

1 i<j N I(m) ij Eq(zi,zj|oi,oj;φ) log p(L(m) ij |zi, zj; αm, βm). It is easy to derive an unbiased stochastic approximation of Llocal:

n B Eq(zn,xn|on) [log p(zn) + log p(xn|zn) + log p(on|xn) log q(zn, xn|on)] ,

where B is the sampled minibatch. For Lrel, we can similarly randomly sample a minibatch S of annotations: Lrel Na |S| P

(i,j,m) S Eq(zi,zj|oi,oj;φ) log p(L(m) ij |zi, zj; αm, βm), where Na = PM m=1 P

1 i<j N I(m) ij denotes the total number of annotations.

3 Natural gradient inference for the fully Bayesian model

In the previous section, the global parameters Θ = {α, β, π, µ, Σ} are assumed to be deterministic and are directly optimized by gradient descent. In this section, we propose a fully Bayesian variant of our model (Bayes SCDC), which has an automatic trade-off between its complexity and ﬁtting the data. There is no overﬁtting if we choose a large number K of components in the mixture, in which case the variational treatment below can automatically determine the optimal number of mixture components. We develop fast natural-gradient stochastic variational inference algorithms for Bayes SCDC, which effectively combines variational message passing for the conjugate structures (i.e., the relational part and the mixture part) and amortized learning of deep components (i.e., the deep generative model).

3.1 Fully Bayesian semi-crowdsourced deep clustering (Bayes SCDC)

For the mixture model, we choose a Dirichlet prior over the mixing coefﬁcients π and an independent Normal-Inverse-Wishart prior governing the mean and covariance (µ, Σ) of each Gaussian component, given by

p(π) = Dir(π|α0) = C(α0)

k=1 πα0 1 k , p(µ, Σ) =

k=1 NIW(µk, Σk|m, κ, S, ν), (5)

where m Rd is the location parameter, κ > 0 is the concentration, S Rd d is the scale matrix (positive deﬁnite), and ν > d 1 is the degrees of freedom. The densities of π, z, (µ, Σ), x can be written in the standard form of exponential families as:

p(π) = exp η0 π, t(π) log Z(η0 π) , p(µ, Σ) = exp η0 µ,Σ, t(µ, Σ) log Z(η0 µ,Σ) ,

p(z|π) = exp η0 z(π), t(z) log Z(η0 z(π)) = exp { t(π), (t(z), 1) } ,

p(x|z, µ, Σ) = exp t(z), t(µ, Σ) (t(x), 1) ,

where η denotes the natural parameters, t( ) denotes the sufﬁcient statistics2, and log Z( ) denotes the log partition function.

For the relational model, we assume the accuracy parameters of all workers (α, β) are drawn independently from common priors. We choose conjugate Beta priors for them as

m=1 p(αm) =

m=1 Beta(τα1 0, τα2 0), p(β) =

m=1 p(βm) =

m=1 Beta(τβ1 0, τβ2 0). (6)

We write the exponential family form of p(αm) as: p(αm) = exp η0 αm, t(αm) log Z(η0 αm)

(p(βm) is similar), where η0 αm = [τα1 0 1, τα2 0 1] and t(αm) = [log αm, log(1 αm)] .

3.2 Natural-gradient stochastic variational inference

The overall joint distribution of all of the hidden and observed variables takes the form:

p(L(1:M), O, X, Z, Θ; γ) = p(π)p(Z|π)p(µ, Σ)p(X|Z, µ, Σ)p(O|X; γ) (7)

p(α)p(β)p(L(1:M)|Z, α, β).

Our learning objective is to maximize the marginal likelihood of observed data and pairwise annotations log p(O, L(1:M)). Exact posterior inference for this model is intractable. Thus we consider a mean-ﬁeld variational family q(Θ, Z, X) = q(α)q(β)q(Z)q(X)q(π)q(µ, Σ). To simplify the notations, we write each variational distribution in its exponential family form: q(θ) = exp { ηθ, t(θ) log Z(ηθ)} , θ Θ Z X. The evidence lower bound (ELBO) L(ηΘ, ηZ, ηX; γ) of log p(O, L(1:M)) is

log p(O, L(1:M)) L(ηΘ, ηZ, ηX; γ) Eq(Θ,Z,X) log p(L(1:M), O, X, Z, Θ; γ)

q(Θ)q(Z)q(X)

2Detailed expressions of each distribution can be found in Appendix A

In traditional mean-ﬁeld variational inference for conjugate models, the optimal solution of maximizing eq. (8) over each variational parameter can be derived analytically given other parameters ﬁxed, thus a coordinate ascent can be applied as an efﬁcient message passing algorithm [27, 11]. However, it is not directly applicable to our model due to the non-conjugate observation likelihood p(O|X; γ). Inspired by [13], we handle the non-conjugate likelihood by introducing recognition networks r(oi; φ). Different from SCDC in Section 2.3, the recognition networks here are used to form conjugate graphical model potentials:

ψ(xi; oi, φ) r(oi; φ), t(xi) . (9)

By replacing the non-conjugate likelihood p(O|X; γ) in the original ELBO with a conjugate term deﬁned by ψ(xi; oi, φ), we have the following surrogate objective ˆL:

b L(ηΘ, ηZ, ηX; φ) Eq(Θ,Z,X)log p(L(1:M), X, Z, Θ) exp{ψ(X; O, φ)}

q(Θ)q(Z)q(X)

As we shall see, the surrogate objective b L helps us exploit the conjugate structure in the model, thus enables a fast message-passing algorithm for these parts. Speciﬁcally, we can view eq. (10) as the ELBO of a conjugate graphical model with the same structure as in Fig. 1 (up to a constant). Similar to coordinate-ascent mean-ﬁeld variational inference [11], we can derive the local partial optimizers of individual variational parameters as below.

The optimal solution for q (X) factorizes over n , i.e., q (X) = QN i=1 q (xi), and q (xi) depends on the expected sufﬁcient statistics of (µ, Σ) and zn:

log q (xi) = Eq(µ,Σ)q(zi) log p(xi|zi, µ, Σ) + r(oi; φ), t(xi) + const, (11)

η xi = Eq(µ,Σ)[η0 xi(µ, Σ)] Eq(zi)[t(zi)]+r(oi; φ). (12)

By further assuming a mean-ﬁeld structure over Z: q (Z) = QN i=1 q (zi), we have the local partial optimizer for each single q(zi) as

log q (zi) = Eq(π) log p(zi|π) + Eq(µ,Σ)q(xi) log p(xi|zi, µ, Σ)

+ Eq(α)q(β)q(Z i) h log p(L(1:M)|Z, α, β) i + const, (13)

η zi = Eq(π)t(π) + Eq(µ,Σ) [t(µ, Σ)] Eq(xi) [(t(xi), 1)] +

j=1 w(m) ij Eq(zj)[t(zj)], (14)

where w(m) ij = I(m) ij Eq(α,β) h ln 1 αm

βm + L(m) ij ln αm 1 αm + ln βm 1 βm

i is the weight of the message from zj to zi. Using a block coordinate ascent algorithm that applies eqs. (12) and (14) alternatively, we can ﬁnd the joint local partial optimizers (η Z(ηΘ, φ), η X(ηΘ, φ)) of b L w.r.t. (ηX, ηZ) given other parameters ﬁxed, i.e.,

ηZ b L(ηΘ, η Z(ηΘ, φ), η X(ηΘ, φ), φ) = 0, ηX b L(ηΘ, η Z(ηΘ, φ), η X(ηΘ, φ), φ) = 0. (15)

Plugging (η Z(ηΘ, φ), η X(ηΘ, φ)) back into L, we deﬁne the ﬁnal objective

J (ηΘ; φ, γ) L(ηΘ, η Z(ηΘ, φ), η X(ηΘ, φ), γ). (16)

As shown in [13], J (ηΘ; φ, γ) lower-bounds the partially-optimized mean ﬁeld objective, i.e., maxηX,ηZ L(ηΘ, ηZ, ηX, γ) J (ηΘ, γ, φ), thus can serve as a variational objective itself. We compute the natural gradients of J w.r.t. the global variational parameters ηΘ:

e ηΘJ = h η0 Θ + Eq (Z)q (X) t(Z, X, L(1:M)), 1 ηΘ i

+ ( ηZ,ηXL(ηΘ, η Z(ηΘ, φ), η X(ηΘ, φ); γ), 0) . (17)

Note that the ﬁrst term in eq. (17) is the same as the formula of natural gradient in SVI [11], which is easy to compute, and the second term originates from the dependence of η Z, η X on ηΘ and can be computed using the reparameterization trick. For other parameters φ, γ, we can also get the gradients φJ (ηΘ; φ, γ) and γJ (ηΘ; φ, γ) using the reparameterization trick.

Algorithm 1 Semi-crowdsoursed clustering with DGMs (Bayes SCDC)

Input: observations O = {o1, ..., o N}, annotations L(1:M), variational parameters (ηΘ, γ, φ) repeat

ψi r(oi; φ), t(xi) , i = 1, ..., N for each local variational parameter η xi and η zi do Update alternatively using eq. (12) and eq. (14) end for Sample ˆxi q (xi), i = 1, ..., N Use ˆxi to approximate Eq (x) log p(o|x; γ) in the lower bound J eq. (16) Update the global variational parameters ηΘ using the natural gradient in eq. (17) Update φ, γ using φ,γJ (ηΘ; φ, γ) until Convergence

Stochastic approximation: Computing the full natural gradient in eq. (17) requires to scan over all data and annotations, which is time-consuming. Similar to Section 2.3, we can approximate the variational lower bound with unbiased estimates using mini-batches of data and annotations, thus getting a stochastic natural gradient. Several sampling strategies have been developed for relational model [9] to keep the stochastic gradient unbiased. Here we choose the simplest way: we sample annotated data pairs uniformly from the annotations and form a subsample of the relational model, and do local message passing (eqs. (12) and (14)), then perform the global update using stochastic natural gradient calculated in the subsample. Besides, for all the unannotated data, we also subsample mini-batches from them and perform local and global steps without relational terms. The algorithm of Bayes SCDC is shown in Algorithm 1.

Comparison with SCDC Bayes SCDC is different in two aspects: (a) fully Bayesian treatment of global parameters; (b) variational algorithms. As we shall see in experiments, the result of (a) is that Bayes SCDC can automatically determine the number of mixture components during training. As for (b), note that the variational family used in SCDC is not more ﬂexible, but more restricted compared to Bayes SCDC. In Bayes SCDC, the mean-ﬁeld q(z)q(x) doesn t imply that q (z) and q (x) are independent, instead they implicitly inﬂuence each other through message passing in Eqs. (12) and (14). More importantly, in Bayes SCDC the variational posterior gathers information from L through message passing in the relational model. In contrast, the amortized form q(z|o)q(x, z|o) used in SCDC ignores the effect of observed annotations L. Another advantage of the inference algorithm in Bayes SCDC is in the computational cost. As we have seen in Algorithm 1, the number of passes through the x to o network is no longer linear with K because we get rid of summing over z in the observation term as in Section 2.3.

4 Related work

Most previous works on learning-from-crowds are about aggregating noisy crowdsourced labels from several predeﬁned classes [5, 18, 26, 32, 24]. A common way they use is to simultaneously estimate the workers behavior models and the ground truths. Different from this line of work, crowdclustering [8] collects pairwise labels, including the must-links and the cannot-links, from the crowds, then discovers the items afﬁliations as well as the category structure from these noisy labels, so it can be used on a border range of applications compared with the classiﬁcation methods. Recent work [25] also developed crowdclustering algorithm on triplet annotations.

One shortcoming of crowdclustering is that it can only cluster objects with available manual annotations. For large-scale problems, it is not feasible to have each object manually annotated by multiple workers. Similar problems were extensively discussed in the semi-supervised clustering area, where we are given the features for all the items and constraints on only a same portion of the items. Metric learning methods, including Information-Theoretic Metric Learning (ITML) [4] and Metric Pairwise Constrained KMeans (MPCKMeans) [1], are used on this problem, they ﬁrst learn the similarity metric between items mainly based on the supervised portion of data, then cluster the rest items using this metric. Semi-crowdsourced clustering (Semi Crowd) [31] combines the idea of crowdclustering and semi-supervised clustering, it aims to learn a pairwise similarity measure from the crowdsourced labels of n objects (n N) and the features of N objects. Unlike crowdclustering, the number of clusters in Semi Crowd is assumed to be given a priori. And it doesn t estimate the behavior

(a) The Pinwheel dataset.

(b) Without annotations, good initialization.

(c) Without annotations, bad initialization.

(d) With noisy annotations on a subset of data.

Figure 2: Clustering results on the Pinwheel dataset, with each color representing one cluster.

of different workers. Multiple Clustering Views from the Crowd (MCVC) [3] extends the idea to discover several different clustering results from the noisy labels provided by uncertain experts. A common shortcoming of these semi-crowdsourced clustering methods is they cannot make good use of unlabeled items when measuring the similarities, while our model is a step towards this direction.

As shown in Section 2.1, our model is a deep generative model (DGM) with relational latent structures. DGMs are a kind of probabilistic graphical models that use neural networks to parameterize the conditional distribution between random variables. Unlike traditional probabilistic models, DGMs can directly model high-dimensional outputs with complex structures, which enables end-to-end training on real data. They have shown success in (conditional) image generation [15, 19], semisupervised learning [14], and one-shot classiﬁcation [20]. Typical inference algorithms for DGMs are in the amortized form like that in Section 2.3. However, this approach cannot leverage the conjugate structures in latent variables. Therefore few works have been done on fully Bayesian treatment of global parameters in DGMs. [13, 16] are two exceptions. In [13] the authors propose using recognition networks to produce conjugate graphical model potentials, so that traditional variational message passing algorithms and natural gradient updates can be easily combined with amortized learning of network parameters. Our work extends their algorithm to relational observations, which has not been investigated before.

5 Experiments

In this section, we demonstrate the effectiveness of the proposed methods on synthetic and real-world datasets with simulated or crowdsourced noisy annotations. Code is available at https://github. com/xinmei9322/semicrowd. Part of the implementation is based on Zhu Suan [22].

5.1 Toy Pinwheel dataset

Simulating noisy annotations from workers. Suppose we have M workers with accuracy parameters (αm, βm). We random sample pairs of items oi and oj and generate the annotations provided by worker m based on the true clustering labels of oi and oj as well as the worker s accuracy parameters (αm, βm). If oi and oj belong to the same cluster, the worker has probability αm to provide ML constraint L(m) ij = 1. If not, the worker has probability βm to provide CL constraint L(m) ij = 0.

Evaluation metrics. The clustering performance is evaluated by the commonly used normalized mutual information (NMI) score [23], measuring the similarity between two partitions. Following recent work [29], we also report the unsupervised clustering accuracy, which requires to compute the best mapping using the Hungarian algorithm efﬁciently.

First we apply our method to a toy example the pinwheel dataset in Fig. 2a following [13, 16]. It has 5 clusters and each cluster has 100 data points, thus there are 500 data points in total. We compare with unsupervised clustering to understand the help of noisy annotations. The clustering results are shown in Fig. 2. We random sampled 100 data points for annotations and simulate 20 workers, each worker gives 49 pairs of annotations, 980 in total. We set equal accuracy to each worker αm = βm = 0.9.

We use the fully Bayesian model (Bayes SCDC) described in Section 3. The initial number of clusters is set to a larger number K = 15 since the hyper priors have sparsity property intrinsically and can learn the number of clusters automatically. Unsupervised clustering is sensitive to the initializations, which achieves 95.6% accuracy and NMI score 0.91 with good initializations as shown in Fig. 2b.

200 400 600 800 1000 1200 1400 1600 1800 2000 Number of Constraints

Normalized Mutual Information

MCVC Semi Crowd

ITML MPCKMeans

CSPA Proposed

200 400 600 800 1000 1200 1400 1600 1800 2000 Number of Constraints

Normalized Mutual Information

MCVC Semi Crowd

ITML MPCKMeans

CSPA Proposed

1 2 3 4 5 Worker ID

True Weights Recovered Weights

Figure 3: Comparison to baselines: (a) Face: All the data points are annotated; (b) Face: Only 100 data points are annotated; (c) True accuracies are set to α = β = [0.95, 0.9, 0.85, 0.8, 0.75]. The green line is the true weights of each worker and the red line is the estimated weights by our model.

After training, it learns K = 8 clusters. However, with bad initializations, the accuracy and NMI score of unsupervised clustering are 75.6% and 0.806, respectively, as shown in Fig. 2c. With noisy annotations on random sampled 100 data points, our model improves accuracy to 96.6% and NMI score to 0.94. And it converges to K = 6 clusters. Our model prevents the bad results in Fig. 2c by making use of annotations.

5.2 UCI benchmark experiments

In this subsection, we compare the proposed SCDC with the competing methods on the UCI benchmarks. The baselines include MCVC [3], Semi Crowd [31], semi-supervised clustering methods such as ITML [4], MPCKMeans [1] and Cluster-based Similarity Partitioning Algorithm (CSPA) [23].

Crowdsourced annotations are not available for UCI datasets. Following the experimental protocol in MCVC [3], we generate noisy annotations given by M = 5 simulated workers with different sensitivity and speciﬁcity, i.e., α = β = [0.95, 0.9, 0.85, 0.8, 0.75], which is more challenging than equal accuracy parameters. The annotations provided by each worker varies from 200 to 2000 and the number of ML constraints equals to the number of CL constraints.

We test on Face dataset [7], containing 640 face images from 20 people with different poses (straight, left, right, up). The ground-truth clustering is based on the poses. The original image has 960 pixels. To speed up training, baseline methods apply Principle Component Analysis (PCA) and keep 20 components. For fair comparison, we test the proposed SCDC on the features after PCA. Fig. 3 plots the mean and standard deviation of NMI scores in 10 different runs for each ﬁxed number of constraints. In Fig. 3a, the annotations are randomly generated on the whole dataset. We observe that our method consistently outperforms all competing methods, demonstrating that the clustering beneﬁts from the joint generative modeling of inputs and annotations.

Annotations on a subset. To illustrate the beneﬁts of our method in the situation where only a small part of data points are annotated, we simulate noisy annotations on only 100 images. Fig. 3b shows the results of 100 annotated images. Our method exploits more structure information in the unlabeled data and shows notable improvements over all competing methods.

Recover worker behaviors. For each worker m, our model estimates the different accuracies αm and βm. We can derive from eq. (2) that the annotations of each worker m are weighted by log αm 1 αm + log βm 1 βm , which means workers with higher accuracies are more reliable and will be weighted higher. We plot the weights of 5 workers in the Face experiments in Fig. 3c.

5.3 End-to-end training with raw images

MNIST As mentioned earlier, an important feature of DGMs is that they can directly model raw data, such as images. To verify this, we experiment with the MNIST dataset of digit images, which includes 60k training images from handwritten digits 0-9. We collect crowdsourced annotations from M = 48 workers and get 3276 annotations in total. The two variants of our model (SCDC, Bayes SCDC) are tested with or without annotations. For Bayes SCDC, a non-informative prior Beta(1, 1) is placed over α, β. For fair comparison, we also randomly sample the initial accuracy parameters α, β from Beta(1, 1) for SCDC. We average the results of 5 runs. In each run we randomly initialize the model for 10 times and pick the best result. All models are trained for

Table 1: Clustering performance on MNIST. The average time per epoch is reported.

Method without annotations with annotations Accuracy NMI Time Accuracy NMI Time

SCDC 65.92 3.47 % 0.6953 0.0167 177.3s 81.87 3.86% 0.7657 0.0233 201.7s

Bayes SCDC 77.64 3.97 % 0.7944 0.0178 11.2s 84.24 5.52% 0.8120 0.0210 16.4s

Figure 4: (a) MNIST: visualization of generated random samples of 50 clusters during training Bayes SCDC. Each column represents a cluster, whose inferred proportion (πk) is reﬂected by brightness; (b) Clustering results on CIFAR-10: (top) unsupervised; (bottom) with noisy annotations.

200 epochs with minibatch size of 128 for each random initialization. The results are shown in Table 1. We can see that both models can effectively combine the information from the raw data and annotations, i.e., they worked reasonably well with only unlabeled data, and better when given noisy annotations on a subset of data. In terms of clustering accuracy and NMI, Bayes SCDC outperforms SCDC. We believe that this is because the variational message passing algorithm used in Bayes SCDC can effectively gather information from the crowdsourced annotations to form better variational approximations, as explained in Section 3.2. Besides being more accurate, Bayes SCDC is much faster because the computation cost caused by neural networks does not scales linearly with the number of clusters K (50 in this case). In Fig. 4a we show that Bayes SCDC is more ﬂexible and automatically determines the number of mixture components during training.

CIFAR-10 We also conduct experiments with real crowdsourced labels on more complex natural images, i.e., CIFAR-10. Using the same crowdsourcing scheme, we collect 8640 noisy annotations from 32 web workers on a subset of randomly sampled 4000 images. We apply SCDC with/without annotations for 5 runs of random initializations. SCDC without annotations failed with NMI score 0.0424 0.0119 and accuracy 14.23 0.69% among 5 runs. But the NMI score achieved by SCDC with noisy annotations is 0.5549 0.0028 and the accuracy is 50.09 0.08%. The clustering results on test dataset are shown in Fig. 4b. We plot 10 test samples with the largest probability for each cluster. More experiment details and discussions could be found in the supplementary material.

6 Conclusion

In this paper, we proposed a semi-crowdsourced clustering model based on deep generative models and its fully Bayesian version. We developed fast (natural-gradient) stochastic variational inference algorithms for them. The resulting method can jointly model the crowdsourced labels, worker behaviors, and the (un)annotated items. Experiments have demonstrated that the proposed method outperforms previous competing methods on standard benchmark datasets. Our work also provides general guidelines on how to incorporate DGMs to statistical relational models, where the proposed inference algorithm can be applied under a broader context.

Acknowledgement

Yucen Luo would like to thank Matthew Johnson for helpful discussions on the SVAE algorithm [13], and Yale Chang for sharing the code of the UCI benchmark experiments. We thank the anonymous reviewers for feedbacks that greatly improved the paper. This work was supported by the National Key Research and Development Program of China (No. 2017YFA0700904), NSFC Projects (Nos. 61620106010, 61621136008, 61332007), Beijing NSF Project (No. L172037), Tiangong Institute for Intelligent Computing, NVIDIA NVAIL Program, and the projects from Siemens, NEC and Intel.

[1] Mikhail Bilenko, Sugato Basu, and Raymond J Mooney. Integrating constraints and metric learning in semi-supervised clustering. In Proceedings of the twenty-ﬁrst international conference on Machine learning, page 11. ACM, 2004.

[2] Varun Chandola, Arindam Banerjee, and Vipin Kumar. Anomaly detection: A survey. ACM computing surveys (CSUR), 41(3):15, 2009.

[3] Yale Chang, Junxiang Chen, Michael H Cho, Peter J Castaldi, Edwin K Silverman, and Jennifer G Dy. Multiple clustering views from multiple uncertain experts. In International Conference on Machine Learning, pages 674 683, 2017.

[4] Jason V Davis, Brian Kulis, Prateek Jain, Suvrit Sra, and Inderjit S Dhillon. Information-theoretic metric learning. In Proceedings of the 24th international conference on Machine learning, pages 209 216. ACM, 2007.

[5] Alexander Philip Dawid and Allan M Skene. Maximum likelihood estimation of observer error-rates using the em algorithm. Applied Statistics, pages 20 28, 1979.

[6] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248 255. IEEE, 2009.

[7] Dua Dheeru and EﬁKarra Taniskidou. UCI machine learning repository, 2017.

[8] Ryan G Gomes, Peter Welinder, Andreas Krause, and Pietro Perona. Crowdclustering. In Advances in neural information processing systems, pages 558 566, 2011.

[9] Prem K Gopalan, Sean Gerrish, Michael Freedman, David M Blei, and David M Mimno. Scalable inference of overlapping communities. In Advances in Neural Information Processing Systems, pages 2249 2257, 2012.

[10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770 778, 2016.

[11] Matthew D Hoffman, David M Blei, Chong Wang, and John Paisley. Stochastic variational inference. The Journal of Machine Learning Research, 14(1):1303 1347, 2013.

[12] Jeff Howe. The rise of crowdsourcing. Wired magazine, 14(6):1 4, 2006.

[13] Matthew Johnson, David K Duvenaud, Alex Wiltschko, Ryan P Adams, and Sandeep R Datta. Composing graphical models with neural networks for structured representations and fast inference. In Advances in neural information processing systems, pages 2946 2954, 2016.

[14] Diederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-supervised learning with deep generative models. In Advances in Neural Information Processing Systems, pages 3581 3589, 2014.

[15] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114, 2013.

[16] Wu Lin, Mohammad Emtiyaz Khan, and Nicolas Hubacher. Variational message passing with structured inference networks. In International Conference on Learning Representations, 2018.

[17] Yucen Luo, Jun Zhu, Mengxi Li, Yong Ren, and Bo Zhang. Smooth neighbors on teacher graphs for semi-supervised learning. In The IEEE Conference on Computer Vision and Pattern Recognition, 2018.

[18] V. C. Raykar, S. Yu, L. H. Zhao, G. H. Valadez, C. Florin, L. Bogoni, and L. Moy. Learning from crowds. JMLR, 11:1297 1322, 2010.

[19] Yong Ren, Jun Zhu, Jialian Li, and Yucen Luo. Conditional generative moment-matching networks. In Advances in Neural Information Processing Systems, pages 2928 2936, 2016.

[20] Danilo J Rezende, Shakir Mohamed, Ivo Danihelka, Karol Gregor, and Daan Wierstra. One-shot generalization in deep generative models. In Proceedings of the 33rd International Conference on International Conference on Machine Learning-Volume 48, pages 1521 1529. JMLR. org, 2016.

[21] Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. IEEE Transactions on pattern analysis and machine intelligence, 22(8):888 905, 2000.

[22] Jiaxin Shi, Jianfei. Chen, Jun Zhu, Shengyang Sun, Yucen Luo, Yihong Gu, and Yuhao Zhou. Zhu Suan: A library for Bayesian deep learning. ar Xiv preprint ar Xiv:1709.05870, 2017.

[23] Alexander Strehl and Joydeep Ghosh. Cluster ensembles a knowledge reuse framework for combining multiple partitions. Journal of machine learning research, 3(Dec):583 617, 2002.

[24] Tian Tian and Jun Zhu. Max-margin majority voting for learning from crowds. In Advances in Neural Information Processing Systems, pages 1621 1629, 2015.

[25] Ramya Korlakai Vinayak and Babak Hassibi. Crowdsourced clustering: Querying edges vs triangles. In Neural Information Processing System, 2016.

[26] Peter Welinder, Steve Branson, Pietro Perona, and Serge J Belongie. The multidimensional wisdom of crowds. In Advances in neural information processing systems, pages 2424 2432, 2010.

[27] John Winn and Christopher M Bishop. Variational message passing. Journal of Machine Learning Research, 6(Apr):661 694, 2005.

[28] Christian Wiwie, Jan Baumbach, and Richard Röttger. Comparing the performance of biomedical clustering methods. Nature methods, 12(11):1033 1038, 2015.

[29] Junyuan Xie, Ross Girshick, and Ali Farhadi. Unsupervised deep embedding for clustering analysis. In International conference on machine learning, pages 478 487, 2016.

[30] Eric P Xing, Michael I Jordan, Stuart J Russell, and Andrew Y Ng. Distance metric learning with application to clustering with side-information. In Advances in neural information processing systems, pages 521 528, 2003.

[31] Jinfeng Yi, Rong Jin, Shaili Jain, Tianbao Yang, and Anil K Jain. Semi-crowdsourced clustering: Generalizing crowd labeling by robust distance metric learning. In Advances in neural information processing systems, pages 1772 1780, 2012.

[32] Dengyong Zhou, Qiang Liu, John Platt, and Christopher Meek. Aggregating ordinal labels from crowds by minimax conditional entropy. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014, pages 262 270, 2014.