# pareto_optimal_streaming_unsupervised_classification__708d7c0e.pdf

Pareto Optimal Streaming Unsupervised Classiﬁcation

Soumya Basu 1 Steven Gutstein 2 Brent Lance 2 Sanjay Shakkottai 1

We study an online and streaming unsupervised classiﬁcation system. Our setting consists of a collection of classiﬁers (with unknown confusion matrices) each of which can classify one sample per unit time, and which are accessed by a stream of unlabeled samples. Each sample is dispatched to one or more classiﬁers, and depending on the labels collected from these classiﬁers, may be sent to other classiﬁers to collect additional labels. The labels are continually aggregated. Once the aggregated label has high enough accuracy (pre-speciﬁed threshold for accuracy) or the sample is sent to all the classiﬁers, the now labeled sample is ejected from the system. For any given pre-speciﬁed accuracy threshold, the objective is to sustain the maximum possible sample arrival rate, such that the number of samples in memory does not grow unbounded. In this paper, we characterize the Pareto-optimal region of accuracy and arrival rate, and develop an algorithm that can operate at any point within this region. Our algorithm uses queueing-based routing and scheduling approaches combined with a novel online tensor decomposition method to learn the hidden parameters, to Pareto-optimality guarantees. We ﬁnally verify our theoretical results through simulations on two ensembles formed using Alex Net, VGG, and Res Net deep image classiﬁers.

1. Introduction

Computational outsourcing, via the cloud, is becoming increasingly important. However, the efﬁcacy of cloud computing agents often remains unknown (e.g. task processing on Mechanical Turk, where the human agents have

1The University of Texas at Austin, USA 2Army Research Lab, USA. Correspondence to: Soumya Basu <basusoumya@utexas.edu>.

Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s).

unknown efﬁciencies). Furthermore, ground truth data needed to evaluate the agents performance is often absent.

Such problems must be addressed through an ensemble of classiﬁers, operating in an unsupervised setting (i.e. Unsupervised Ensemble Learning UEL). Over the past three decades, signiﬁcant theoretical and empirical research has been performed in this area (Dawid & Skene, 1979; Li et al., 2013; Zhang et al., 2014; Jain & Oh, 2014; Jaffe et al., 2016; Shaham et al., 2016). In particular, crowdsourcing, which focuses on selecting and aggregating a large number of classiﬁers, has an extensive literature (Karger et al., 2011; 2014; Ok et al., 2016) culminating in optimal aggregation schemes. The key focuses of most previous works are 1) from a crowdsourcing perspective, choosing a small subset of an effectively inﬁnite pool of classiﬁers for each sample, and 2) from an unsupervised learning perspective, ofﬂine/one-shot classiﬁcation of a set of samples (see Section 7 for related work).

We take a somewhat different approach than the traditional crowdsourcing and UEL. Our system consists of a ﬁxed ensemble of deterministic classiﬁers, each of which has a ﬁxed processing speed. Such classiﬁers are available in many emerging ﬁelds such as cloud computation or humanin-the-loop computation. As a motivating example, consider an ensemble of both human experts and third party neural network classiﬁers, which are jointly tasked to label images in real time. The individual labeling characteristics of these classiﬁers are often unknown; for human agents because the agents have not been evaluated sufﬁciently in the past, and for DNNs, due to contractual or privacy reasons (e.g. the neural network was downloaded from the web where it was previously trained on an unknown data set). Importantly, these classiﬁers are deterministic when the DNNs have ﬁxed weights and the humans have ﬁxed opinions (i.e. repeatedly providing the same sample to an agent will not result in changing labels).

In our streaming model, samples (e.g. images) arrive online and are assigned to an ensemble of expert classiﬁers. Each classiﬁer is characterized by a latent confusion matrix and can process one sample per time slot. For each sample, the labels are continually collected and aggregated as it sequentially passes through distinct classiﬁers. During the process of aggregation once the sample achieves a pre-

Pareto Optimal Streaming Unsupervised Classiﬁcation

speciﬁed desired accuracy, it gets ejected from the system. Furthermore, in our ﬁxed ensemble, repeated access for a speciﬁc classiﬁer results in the same output. Thus, when classiﬁcation by all the classiﬁers gets completed, a sample is removed even if desired accuracy is not achieved.

Two key features in our system are: (i) the sequential selection of classiﬁers depends on the current set of collected labels, and (ii) the decision to either eject a sample or continue to acquire additional labels also depends on the current set of collected labels. Sustaining a high arrival rate requires that the average number of classiﬁers used to label a sample be small, because each classiﬁer has limited processing speed. Yet, obtaining a high threshold accuracy requires that the average number of classiﬁers used to label a sample be large, because larger ensembles are more accurate. This is a fundamental trade-off in our system.

1.1. Our Contribution

Our main contributions are as follows.

1. We deﬁne and characterize the notion of a Pareto region for sample arrival rate and threshold accuracy in the streaming, online Dawid-Skene model. We show that the posterior class distribution for a set of samples under any history dependent algorithm is independent of that history, given the partial label collected by the samples (Lemma 3.4). Using this conditional independence, we characterize the Pareto region as a feasible network ﬂow (Theorem 3.6).

2. We design a scheduling algorithm, which is inspired by max-weight scheduling in stochastic networks, but uses a threshold based departure and maximum likelihood labeling. We show it achieves any point in the interior of the Pareto region (Theorem 4.1). The designed algorithm does not assume knowledge of arrival rate or classiﬁer statistics. Using an explore-exploit strategy we learn these parameters online, with a spectral estimation technique.

3. Off-the-shelf unsupervised learning techniques are computationally prohibitive for the online learning of hidden parameters. We develop a novel online Tensor decomposition algorithm that reuses eigenvectors from the previous round. We show such reuse along with the application of the stopping criteria in (Anandkumar et al., 2015) decreases the computational complexity (Lemma 5.2).

4. Finally, we validate our theoretical results using ensembles comprised of Alexnet (Krizhevsky et al., 2012), VGG (Simonyan & Zisserman, 2014) and Resnet (He et al., 2016) deep convolutional neural nets. We perform experiments on modiﬁed Cifar-10 datasets. Various classes were merged into super-classes , so that we work with datasets containing only 3 or 5 classes. Our results highlight the trade-off between threshold accuracy and arrival rate.

2. System Model

We consider a time-slotted online unsupervised classiﬁcation problem where in each time-slot (a.k.a. round) samples (e.g. images) arrive into the system for classiﬁcation. In the remaining section, we ﬁrst deﬁne the description and generation of the samples. We next provide the system s description and dynamics.

2.1. Notation

We deﬁne the set [N] = {0, 1, . . . , N 1} for all integer N 0. We further include the null element, denoted as e, in this set to deﬁne [N]e = {e} [ [N]. An ordered array is denoted as {ai : i 2 [N]} and a ordered matrix is denoted as {Aij : i 2 [N], j 2 [N 0]} . More generally an array can be indexed by an ordered set I as {ai : i 2 I}.

2.2. Online Dawid-Skene Model

We adopt a model which is the online analogue of the generalized Dawid-Skene model (Dawid & Skene, 1979) for unsupervised classiﬁcation. We call our model Online Dawid-Skene (On DS) model. In our model, there are K classes and M deterministic classiﬁers, with M > 1. Each classiﬁer i 2 [M] has a confusion matrix, Ci := {Ci(k, l) : k, l 2 [K]}. We assume each classiﬁer can process one sample per timeslot, which can be easily generalized to non-uniform processing speeds.1

The samples arrive online and are represented by their unique ids. In our deterministic classiﬁcation, the true label and individual classiﬁers labels for each sample are generated once and ﬁxed thereafter. Formally, the sample with id j has a latent-tuple sj 2 [K](M+1). sj(i) denotes the deterministic label received from classiﬁer i, for all i 2 [M], and sj(M) denotes the true label of the sample.

E.g., with two classiﬁers (M = 2) and two classes (K = 2), if sample 0 contains the latent tuple s0 = (0, 1, 1), then for sample 0, (i) 0-th classiﬁer s label = 0, (ii) 1st classiﬁer s label = 1, and (iii) true label = 1.

The generation process of sj, for each j, is as follows: 1) For each sample an independently and identically distributed (i.i.d.) true-label s(M) is ﬁrst chosen from the K classes following a probability vector pg. Thus, for all j, P[sj(M) = k] := pg(k) for k 2 [K]. 2) Next, for all i 2 [M], sj(i) is chosen according to the following conditional probability; for all k, l 2 [K] P[sj(i) = l|sj(M) = k] := Ci(k, l).

1For each i 2 [M], if classiﬁer i labels at most µi samples/round, for any ﬁnite integer µi > 1, we can replace the i-th classiﬁer with µi separate classiﬁers each with confusion matrix Ci and one sample per round processing speed.

Pareto Optimal Streaming Unsupervised Classiﬁcation

2.3. Memory

In our time-slotted system there exists a common memory where past samples reside. At the beginning of round 1, the memory contains a set of samples, denoted as S . At each time , a sample j in memory is associated with a label-tuple ˆs

e , which changes with time. Speciﬁcally, for all classiﬁers i 2 [M] and rounds 1, at the beginning of round : (i) if the sample is labeled by classiﬁer i then ˆs

j (i) = sj(i), (ii) otherwise, ˆs

A sample j is partially labeled, if any of its coordinates are unobserved (i.e. for some i, ˆs

j (i) = e). Otherwise, it is completely labeled. For each label-tuple 2 [K]M

e , the number of samples with that label-tuple at the beginning of round is denoted as Q ( ) = |j 2 S : ˆs

E.g. with M = 2 and K = 2, let sample 0 (s0 = (0, 1, 1)) be labeled by classiﬁer 0 in round = 1, and classiﬁer 1 in round = 3. Then we have ˆs1

0 = (e,e) , ˆs2

0 = (0,e), ˆs3

0 = (0,e), and ˆs

0 = (0, 1) for all 4.

2.4. System Dynamics

Each round has four sequential stages: 1) scheduling, 2) labeling, 3) departure, and 4) arrival.

Scheduling: Initially, there are samples in memory which are matched with the classiﬁers. The system schedules a matching in the bipartite graph G with the two partites: S (the samples) and [M] (the classiﬁers). For each sample j 2 S and classiﬁer i 2 [M], there is an edge (j, i) if and only if ˆs

j (i) = e. Let A denote the set of feasible matchings, and A be the scheduled matching in round . If classiﬁer i is matched with sample j then A (i) = j. Otherwise A (i) = e.

Labeling: The labels obtained from classiﬁers are denoted as Cl 2 [K]M

e . If classiﬁer i is not matched with any sample then Cl (i) = e. Otherwise, for sample j = A (i) the latent label sj(i) is revealed, i.e. Cl (i) = sj(i). Therefore, for the sample j the observed-tuple changes to ˆs( +1)

j . So, ˆs( +1)

j [i] = sj[i], and ˆs( +1)

j [i0] = ˆs

j [i0] for all i0 6= i 2 [M]. For all unmatched samples their respective observed-tuples remain unchanged.

Departure: Next, each sample either obtains a ﬁnal label and exits the system, or reenters the memory. The set of departing samples is denoted as D . Each sample j 2 D

has an associated ﬁnal label k

j . The set of partially labeled departing samples is denoted as D

Arrival: Finally, a random number of new samples, N , (generated following On DS model) arrive into the system. Here N is i.i.d., for each , with ﬁnite mean λex = E[N ] and bounded support sup 1 N λmax. Let the set of

arriving samples be N

ex. The sample ids are ordered according to their arrival.

Update: In ( + 1)-th time-slot the set of samples in storage changes to S( +1), where S( +1) = S \D [N

ex. The number of samples that changes into the label-tuple from another and re-enters the system in round is denoted as N ( ). Further, the number of samples that changes from one label-tuple, , to another in round is denoted as R ( ). Formally, for all 2 [K]M

e , N ( ) = |j : j 2 S( +1), ˆs( +1)

R ( ) = |j : j 2 S , ˆs( +1)

j = |. At time , label-tuple 2 [K]M

e , Q ( + 1) = (Q ( ) R ( ))+ + N ( ).

3. Pareto Region of On UEL

3.1. Preliminaries and Deﬁnitions

System History: The history of the system is the set of all the events that happened in the past. Formally, at various stages of round 1 the histories can be deﬁned recursively as follows: Let the initial history of the system be H1

0 be history in the beginning of round . (i) After scheduling history is H

0 \ A . (ii) After labeling history is H

1 \ Cl . (iii) After departure it is H

(iv) After the new arrivals it is H( +1)

Stability: A policy is a sequence of matching, departure and labeling decisions {A , D , {k

j : j 2 D } : 1}. Time average backlog of the system under a policy φ is

sum := lim sup T !1

A policy φ stabilizes the system if and only if Qφ

Threshold Accuracy: Our objective is to ensure that each sample either achieves required conﬁdence or is processed using best-effort. Speciﬁcally, we ensure on average a sample with a ﬁnal label either i) has been labeled by all classiﬁers, or ii) has a probability of correct labeling greater or equal to a given threshold th. We formalize this using the notion of threshold accuracy.

At time , the accuracy of a departing sample j 2 D with true label sj[M] is Accj( ) := P

j = sj[M] |H

. The event {infj2D p Accj( ) < } indicates that some partially labeled sample has accuracy less than , i.e. condition (i) above is not satisﬁed. Threshold accuracy of implies the rate of such events is zero.

Threshold accuracy under a stabilizing policy φ is th if and only if for all th

Pareto Optimal Streaming Unsupervised Classiﬁcation

The stabilizing policy φ is called th-accurate.

The On DS parameters = (pg, {Ci : i 2 [M]}), the arrival rate λex, and the desired threshold accuracy th describes a system completely. Therefore, we deﬁne P = ( , λex, th) as the On UEL instance.

Causal Policy: A policy φ is causal if and only if for each 1, the matching A is a random function of history H

0 , the parameters P, the departure D ; also, the ﬁnal labeling {k

j : j 2 D } is a random function of history H

2 and parameters P. The class of causal policies is C.

Pareto Region: We now deﬁne the Pareto region of a causal policy and of the On UEL.

Deﬁnition 3.1. Given a On DS model , the Pareto region of a causal policy φ, φ( ), is the set of all tuple (λex, th) such that the system is stable under policy φ for λex, and the policy φ is th-accurate.

Deﬁnition 3.2. Given a On DS model , the Pareto region ( ), is the convex closure of the Pareto regions of the individual causal policies, i.e. ( ) = conv ( ( )) where,

(λex, th) : 9φ 2 C, (λex, th) 2 φ( )

A policy φ is Pareto-optimal if and only if 8 g > 0, and 8(λex, th)(1+ g) 2 ( ), (λex, th) 2 φ ( ).

Remark: The notions of Pareto region and Pareto optimality is similar, but not identical, to the concept of capacity region and throughput optimality, resp., in stochastic optimization literature (Tassiulas & Ephremides, 1992; Stolyar, 2005). The key difference is that throughput optimality concerns a set of feasible arrival rates, whereas Pareto optimality relates arrival rates to accuracy.

3.2. Evolution of Posterior Distribution of Samples

We consider a ﬁxed causal policy φ and an On UEL instance P. Consider any stage st 2 {0, 1, 2, 3}, in round 1. S is the set of samples in memory; and for each sample j 2 S

j the true label is sj[M]. Given a vector t = {tj 2 [K] : j 2 S }, we are interested in the posterior probability of the true labels being t given the history upto the time , i.e. Pφ ({sj[M]=tj : j2S }|H

Dependence on Entire History: The history H

st is formed by stacking up the snapshots of the system in each stage upto the stage st in round . What does this history include? It includes all other possible events that have happened in the past. A tiny part of it is the partial label of the samples present in the system. Recall, the partial labels are {ˆs

j : j 2 S } for stages st = 0, 1, and {ˆs( +1)

j : j 2 S } for stages st = 2, 3.

Let us consider a speciﬁc event: In round and stage 0, sample j is in the memory with partial label ˆs

The set of events that lead to this observation are as follows. (1) The sample j arrives at some round before . (2) In each round, from arrival upto , it was either kept unmatched or it is matched with a speciﬁc classiﬁer. (3) In each round, from arrival upto , when matched to a classiﬁer it obtained a speciﬁc label. (4) In each round, from arrival upto , the sample is not evicted from the system.

The events in step (3) are independent of the evolution of other samples as the labels are generated independently given the true label sj[M]. However, the events in step (1), (2) and (4) couple the posterior evolution with the history. In step (1), we know all the samples j0 < j have arrived, because the samples are numbered in the order of their arrival into the system. In step (2), the matching events impose a ﬁxed order in which the labels are collected for this sample. Further, these matching events in step (2) and departure events in step (4) are taken collectively for all the samples present. This implies that the evolution of all the other samples which were present in the system from arrival of j upto , are tightly coupled with the event under consideration.

Notations for Posterior Evolution: We now setup some notations for describing the posterior evolution.

We adopt the convention Ci(k,e) = 1 for all k 2 [K] and i 2 [M]. Let us denote 8 2 [K]M

e , k 2 [K], P ( , k) := pg(k) Q

Ci(k, (i)). (1)

For all 2 [K]M

e , k 2 [K], let acc( , k) := P ( ,k) P

k0 P ( ,k).

For all i2[M], k2[K], and for any 2 [K]M

P (k|i, ) :=

0, if [i] 6= e, P

k02[K] P ( ,k0)Ci(k0,k) P

k0,k002[K] P ( ,k0)Ci(k0,k00), o/w. .

For all i2[M], and for any two , 0 2 [K]M

e if 8i0 2 [M], i0 6= i, (i0) = 0(i0) we deﬁne P ( 0|i, ) := P ( 0(i)|i, ). Otherwise, P ( 0|i, ) := 0. Given a vector t2[K]|S |, let n( ,st)

k (t) denote the number of samples in memory with partial label and candidate label k, in round 1 immediately after stage st 2 {0, 1, 2, 3}. Speciﬁcally, for all k2[K], 2[K]M

k (t) := |{j 2 S : tj = k,ˆs( +1)

k (t) := |{j 2 S : tj = k,ˆs

j = }|, 8st 6= 3.

Conditional Independence: We show that under a causal policy the posterior probability of the true labels of each sample is independent of the history given the collected labels for that sample.

Assumption 3.3. The initial state is ﬁnite, |S1|<1, and

Pareto Optimal Streaming Unsupervised Classiﬁcation

there is no labeled sample, i.e. Q (1)=0 8 2[K]M

The following lemma gives exact expression for the joint posterior of the samples conditioned on the history. The above assumption ensures the prior distribution of the samples present in round 1 is well behaved.

Lemma 3.4. For a given problem instance P satisfying Assumption 3.3, under any causal policy φ, the following statements hold w.p. 1, for all 1, stage st 2 {0, 1, 2, 3}, and t2[K]|S |,

P (8j2S , sj[M]=tj|H

P (sj[M]=tj|H

Ci(tj,ˆs st

st : 0, 1 = 2, 3 = ( +1)

P ( , k)n( ,st)

Remarks on Lemma 3.4: Firstly, Lemma 3.4 shows that the posterior admits a product form across all the samples present in the system. Secondly, for each sample the probability is independent of the history given only the partial labels. Finally, this implies the posterior is independent of the sample ids given the counts n( ,st)

The following statements are simple consequences of the above lemma.

Corollary 3.5. For a given problem instance P satisfying Assumption 3.3, the following statements are true for all 1 and sample j with true label sj[M]. 1) 8k2[K], 2[K]M

e , P(sj[M] = k|ˆs

j = ) = acc( , k). 2)8i2[M], k2[K], 2{ 0 2 [K]M

e : 0[i] = e}, P(Cl [i] = k|A [i] = j ˆs

j = ) = P (k|i, ).

The maximum likelihood estimate (MLE) of the true label for any sample with partial label is the same. We denote it as k ( ) := arg maxk2[K]acc( , k). Also, we denote acc ( ) := acc( , k ( )).

3.3. Characterization of the Pareto Region

Network Model: We construct a network-ﬂow instance to characterize the Pareto region for the problem instance ( , λex, th). The feasibility of the ﬂow instance will imply that (λex, th) is in the Pareto region, as proven in Theorem 3.6 below.

The network consists of |[K]M

e | label nodes. The label nodes , for which either acc ( ) th (desired accuracy is reached) or 8i 2 [M], (i) 6= e (all labels are collected), comprise the set of terminal nodes T ( th). Thus T c( th) : =[K]M

e \T ( th) The set of non-terminal nodes is [K]M

e \ T ( th). The label node {e}M is the unique source

node. There are hyper-edges connecting the nodes. Each hyper-edge corresponds to the allocation of labels to classiﬁers. Speciﬁcally, a hyper-edge h 2 (T c( th))M indicates the non-terminal label h(i) is matched with classiﬁer i, for all i 2 [M].

The movement of ﬂow is decided based on the selection of a hyper-edge. In particular, when a hyper-edge h is chosen, at most |{i 2 [M] : h(i) = }| amount of ﬂow can exit the label node . Further, the amount of ﬂow entering label node is P

i2[M] P ( |i, h(i)). The all empty label node {e}M has a incoming ﬂow λex. A pdf over the hyper-edges, denoted as x, determines a ﬂow in the network. The non-emptiness of the following polytope describes the feasibility of network ﬂow corresponding to the tuple ( , λex, th).

Flow Polytope, FP( , λex, th)

s 8h, xh 0 s X

s λin({e}M) = λex s 8 , λout( ) λin( )

s 8 6= {e}M, λin( ) =

P ( |i, h(i))

s 8 /2 T ( th), λout( ) =

The following theorem characterizes the Pareto region of the On UEL corresponding to On DS .

Theorem 3.6. Given a On DS model satisfying Assumption 3.3, the Pareto region is: ( )=[0, 1] [0, |pg|1] [ {(λex, th) : th > |pg|1, FP( , λex, th) 6= ;}.

Remarks on Pareto Region: For th |pg|1, technically, any data rate (λex = 1) can be supported by giving a ﬁnal label k = arg max{pg(k) : k 2 [K]}. In the network ﬂow formulation the source {e}M itself becomes a terminal. Further, for all th > |pg|1 the maximum arrival rate is less or equal to M, as each classiﬁer processes at most one sample per round.

4. Learning Aided Scheduling

In this section we design dynamic matching, departure and labeling policies. Further, we do not assume the knowledge of the parameters P except th (a design speciﬁcation). Our algorithm tries to stabilize the system while maintaining accuracy. Therefore, different classiﬁers may see different true label distribution for the incoming sample. We resort to an explore-exploit learning strategy.

In the explore phase, we select an unlabeled sample and send it to all the classiﬁers.This ensures the true label distributions of the incoming exploration samples are same for all the classiﬁers. Using the samples classiﬁed in this phase

Pareto Optimal Streaming Unsupervised Classiﬁcation

we gradually learn the hidden parameters with an online variant of a spectral method.

In the exploit phase, we match samples to classiﬁers based on the current queue length vector. For departure and labeling we use a Bayesian threshold based approach. Our algorithm resembles max-weight matching (Tassiulas & Ephremides, 1992) in network optimization.

The algorithm is structured in two separate parts. In the ﬁrst part, we present the routing algorithm along with the exploration events. Here, we assume that, for all labels , 0 2 [K]M

e , classiﬁers i 2 [M] and classes k 2 [K], c acc( , k) and b P ( 0|i, ) are provided by an oracle (possibly erroneous). This oracle also has access to labels obtained during exploration events. Further, in each round the algorithm receives the queue length vector Q( ).

In the second part, deferred to Sec. 5, we construct an oracle which uses the explored samples to learn the hidden parameters, i.e. the classiﬁers confusion matrices and the true class distribution. The oracle computes c acc( , k) and b P ( 0|i, ) as queries arrive.

4.1. Max-weight Matching

Bipartite Graph: In any time step 1, the samples and the classiﬁers form the bipartite graph, G , as given in Sec.2. We assign a weight for each edge in the graph as a function of the queue length vector Q( ). Consider an edge (i, j) for some j 2 S and some i 2 [M]. For notational convenience, let sample j have the partial label ( i.e. ˆs

j ). Also, let 0

k denote the unique label for which 0

k(i0) = (i0) for all i0 6= i, and 0

k(i) = k whereas (i) = e. The weight of the edge (i, j) at time is given as

k|i, (j))Q 0

To account for the idle machines let w(i,e) = 0 Then, the weight of matching A is W (A) = P

i2M w (i, A(i)).

Description: In each round 1 the algorithm is provided with an explore ﬂag Ex( ) which is a 0-1 r.v. drawn independently with mean s( ) > 0. Here, s( ) is the exploration rate parameter. If exploration ﬂag is 1, the algorithm explores by sending a sample, j, to all the classiﬁers. Otherwise, it creates the graph, G , and solves the maximum bipartite matching problem to compute the matching, A . Here it requires oracle access of P ( ).

For each classiﬁer i, using the label Cl (i) the algorithm updates the label for sample j = A (i) to ˆs +1

j . Next, the MLE accuracy acc (ˆs +1

j ) is obtained from the oracle. If the estimated accuracy is larger than ( th+ ( )), the sample j departs the system with label c k (ˆs +1

j ). Otherwise, it re-enters the system. The error parameter ( ) accounts for the errors in oracle estimates.

Exploration: In each of the exploration rounds, a single

Algorithm 1 Max-weight with Bayesian Departure

1: for all 1 do 2: Obtain Explore ﬂag Ex( ), and queue length Q( ) 3: if Ex( ) and Q;( ) > 0 then . Explore 4: Find a j s.t. ˆs

j =e, set A [i] = j, 8i2[M] 5: else . Exploit 6: Set A = arg max A2A W (A)

7: for all i 2 [M] do 8: Update label of j: =A [i] to ˆs +1

j using Cl (i)

9: if c acc (ˆs +1

j ) th + ( ) then . Departure

10: Sample j departs with label c k (ˆs +1

j ) 11: else 12: Sample j re-enters the system

sample is sent to all classiﬁers. Further, the labels acquired in this round are sent to the oracle, which may use them for learning the hidden parameters.

Query Complexity: The algorithm makes oracle queries while both computing A and deciding whether a sample should depart the system. Each of the M|Q( )|1 edges in the network, G , requires at most K ˆP ( )-queries. Thus, computing A requires KM|Q( )|1 queries. For each matched sample, K c acc( )-queries are required to take departure and ﬁnal labeling decisions. This leads to an additional KM queries. Therefore, the query complexity per round is O(KM|Q( )|1).

Time Complexity: The creation of the graph G requires O(KM|Q( )|1) time. On the unbalanced graph G

with (possibly) |Q( )|1 >> M, solving the Max-weight matching takes another O(max{M 2 log(M), M|Q( )|1}) time. Overall, the time complexity is O(KM|Q( )|1) time per round. The amortized time complexity per round under the max-weight policy is given as O(max{M 2 log(M), MQMW

sum }). Here, QMW

sum is the timeaverage backlog of Algorithm 1.

4.2. Analysis of Algorithm 1

In this section we present the guarantees of the Max-weight algorithm. Speciﬁcally, we show Algorithm 1 is Paretooptimal when the oracle has certain asymptotic convergence properties, which we make precise shortly.

( (σ), δ(σ)) Oracle: An oracle is an ( , β) oracle if and only if, for all (i) partial labels , 0 2 [K], (ii) classiﬁers i 2 [M], and (iii) classes k 2 [K], the oracle with access to σ exploration samples satisﬁes, (1) |acc( , k) c acc( , k)| σ w.p. (1 σ β), (2) |P ( 0|i, ) b P ( 0|i, )| σ w.p. (1 σ β).

The following theorem provides guarantees for Algo-

Pareto Optimal Streaming Unsupervised Classiﬁcation

rithm 1, under the assumption that an ( , β) oracle exists with , β > 0. In Sec 5, we construct such an oracle.

Theorem 4.1. Given an On UEL instance P=( , λex, th) such that Assumption 3.3 is valid, th > |pg|1, and (λex, th)(1 + g) 2 ( ) for some g > 0; and given an ( , β) oracle with , β > 0; under Algorithm 1 with parameters s( ) = log( )

and ( ) = 2 log( ) ; then

ex,M 2) gλmin

where λmin = λex

min 2T ( th(1+ g))

k P ( , k)pg(k)

Furthermore, the threshold accuracy of Algorithm 1 is th.

5. Learning with Explored Samples

In this section, we describe the unsupervised learning of the confusion matrices {Ci : i 2 [M]} and the true probability distribution pg. Speciﬁcally, our purpose is to design a ( , β)-oracle that provides the c acc( , k) and b P ( 0|i, ) for all , 0. We require the following assumption which is common in the literature.

Assumption 5.1. [Assumptions on On DS] (i) The entries of confusion matrix is strictly positive, i.e. 8i 2 [M] and k, k0 2 [K], Ci(k, k0) > 0. (ii) The class distribution is also bounded from below, i.e. for all k 2 [K] pg(k) 0 > 0. (iii) Finally, for all classiﬁers i 2 [M] , the diagonal terms in the confusion matrix dominates, i.e.

min k,k02[K],k06=k(Ci(k, k) Ci(k, k0)) > 0.

Existence of an ( , β)-Oracle: Our algorithm is closely related to the one-shot unsupervised learning in paper (Zhang et al., 2014). The algorithm proposed in (Zhang et al., 2014). ensures that for ﬁxed constants δU < 1 and U > 0, given O( K5

log (K+M/δU)/ 2

U) number of samples, with probability at least (1 δU) the confusion matrices Ci for all i 2 [M], and the true class distribution pg are recovered with maximum error U (Theorem 2 in (Zhang et al., 2014)). Furthermore, as b P ( 0|i, ) and acc( , k) are function of the confusion matrices and the true class distribution, the error in the corresponding term is also bounded.

Speciﬁcally, given (i) maxi2[M] |Ci c Ci|1 O( U) and (ii) |pg c pg|1 O( U), under Assumption 5.1, we have

|acc( , k) c acc( , k)| = O

. This ensures that

with σ samples, the error in the estimates b P ( ) and c acc( ) are bounded from above by O(σ ), for all < 1/2, with probability at least 1 σ β for any β < 1. Therefore, an ( , β) oracle exists for > 0 and β < 1.

This shows that under the additional Assumption 5.1 Maxweight algorithm is Pareto Optimal.

Algorithm 2 We design an online spectral learner algo-

Algorithm 2 Online Spectral Learner

1: Initialize: Eigenvectors {vk : k 2 [K]} 2: for all round 1 do 3: Obtain Explore ﬂag Ex( ) and new labels Cl

4: if Ex( ) is true then . Moment Update 5: Update the moments ˆ M2 and ˆ M3. 6: Generate whitened Tensor T0 = ˆT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7: for all r 2 [K] do . Online RTP 8: while vr does not satisfy Eq.(3) do 9: Randomly pick vr on k-dim unit sphere. 10: Repeat vr Tr(I,vr,vr) k Tr(I,vr,vr)k2 , RU times.

11: Repeat vr Tr(I,vr,vr) k Trs(I,vr,vr)k2 , RU times.

12: Let T(r+1) Tr Tr(vr, vr, vr) v 3

r . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13: Recover parameters using {vk:k2[K]}. 14: Answer c acc( ) and b P( ) queries.

|Tr(v, v, v)| max

k Trk F 2p K r, k Tr(I,I,v)k F

rithm, Algorithm 2, as Algorithm 1 in (Zhang et al., 2014) is inefﬁcient for our problem (see appendix for details).

Algorithm 2, in line 8, checks if the past eigenvector vr for all r 2 [K] satisﬁes the condition (3) in (Anandkumar et al., 2015). The rest of the steps are identical to the approach in (Zhang et al., 2014), and thus omitted for lack of space. Speciﬁcally, line 5 in Algorithm 2 uses (2a) and (2b) in (Zhang et al., 2014), and line 13 in Algorithm 2 uses (5) in (Zhang et al., 2014). We do not require the expectation maximization steps therein.

The following lemma states that Algorithm 2 is an ( , β)- oracle with constant amortized time complexity. Here, we treat dependence on M and K as constant.Further, the Lemma implies that Algorithm 1 along with Algorithm 2 is Pareto optimal.

Lemma 5.2. Under Assumption 5.1, Algorithm 2 with RU = (1) is an ( , β)-oracle for any < 1/2 and any β < 1. Furthermore, the amortized time complexity of Algorithm 2 is O(1).

6. Empirical Results

Experimental Setup: For our experiments, we create two datasets where the classes of Cifar-10 (Krizhevsky & Hinton, 2009) are merged into larger groups, effectively giving us fewer classes. Our ensembles consist of deep convolutional neural networks. Speciﬁcally, we consider two modiﬁcations of Cifar-10. In the ﬁrst, we have three groups: (airplanes, ships, trucks, cars), (birds, frogs cats), and (dogs, deer, horses). In the second we have ﬁve groups: (airplanes, ships), (trucks, cars), (birds, frogs) (cats, dogs) and (deer, horses).

Pareto Optimal Streaming Unsupervised Classiﬁcation

The ﬁrst ensemble has 6 classiﬁers - three Alexnet, one VGG-19, and two Res Net-18 neural nets (Krizhevsky et al., 2012; Simonyan & Zisserman, 2014; He et al., 2016). The second ensemble has 6 - one VGG-11 , one VGG-16, two VGG-19, and two Res Net-18 neural nets. Each classiﬁer is individually trained.

The accuracies of the various net architectures are different across the two variations of Cifar-10. The Res Net-18 neural nets out perform the other variants, whereas Alexnet neural nets perform the worst. Therefore, the performance of the ﬁrst ensemble is worse than the other. The classiﬁers accuracies and confusion matrices are given in the supplementary material. In the supplementary material, we show that across a broad range of arrival rates the proposed algorithm attains 2% to 5% higher average accuracy compared to a random scheduler.

Convergence Property: We now study our algorithm s temporal dynamics. The arrival rate, λex and threshold accuracy, th were set to 2 samples/round and 0.85, resp. In Fig. 1a, we show that inside the Pareto region the sum of queue lengths (a.k.a. backlog) ﬂuctuates around an average of 10.8. Further, the convergence to this average value happens within a few hundred rounds. The average accuracy (i.e. departures with correct labels / total departures) of the system also converges to an average 0.87 as showed in Fig.1b. In Fig. 1c we observe that the online spectral learning effectively learns the parameters involved. The error in this ﬁgure is the total error in estimating the confusion matrices and the true class distributions. Speciﬁcally, the error is given as

kpg ˆpgk1 + 1 M

i2[M] k ˆCi Cik1

Similar results are noted for the other dataset in Fig. 2a-2c.

Pareto Region: We study the Pareto regions of the two ensembles in Fig. 1d and Fig. 2d. In this ﬁgure, the circles sizes depict the logarithm of the average queue length for a given (λex, th) pair. We see that for higher arrival rates, the average queue lengths start increasing for smaller thresholds, and vice versa. Furthermore, we observe that the Pareto region of the ﬁrst ensemble is smaller than the second ensemble. This is consistent with more accurate ensembles supporting larger Pareto regions.

7. Related Work

Our work belongs to the family of techniques derived from the Dawid-Skene model (Dawid & Skene, 1979). This model performs unsupervised ensemble learning (UEL), using an expectation maximization (EM) based algorithm. The need for UEL appears in ﬁelds varying from anthropology (Romney et al., 1986) to medical diagnostics (Zhou et al., 2009). Most recently, UEL has been of interest to the crowdsourcing community (Snow et al., 2008; Sheng et al., 2008; Whitehill et al., 2009; Raykar et al., 2010).

(a) Backlog (b) Accuracy

(c) Estimation Error

(d) Capacity

Figure 1: Performance of Ensemble-1 on 3-Group Dataset

(a) Backlog (b) Accuracy

(c) Estimation Error

(d) Capacity

Figure 2: Performance of Ensemble-2 on 5-Group Dataset

This has led to the development of various approaches, such as weighted majority voting (Li et al., 2013), messagepassing (Karger et al., 2011; 2014; Ok et al., 2016), and spectral method based estimation (Zhang et al., 2014; Jain & Oh, 2014; Jaffe et al., 2016), which all have provable guarantees of classiﬁcation error bounds and learning the parameters deﬁning the ensemble and data.

We have developed an adaptive task routing technique with latent labels. In this sense, our work is most similar to (Massouli e & Xu, 2016; Shah et al., 2017), but with some important distinctions. The setting in (Massouli e & Xu, 2016) allows for resampling of classiﬁers, whereas our ensemble is ﬁxed. In (Shah et al., 2017), the departure decisions are not made by the ensemble, but in our approach, both the departure and labeling decisions are made by the ensemble (thus statistically correlating the labels acquired and departures). Finally, unlike (Massouli e & Xu, 2016; Shah et al., 2017), our technique requires no a priori knowledge of the classiﬁers or incoming data.

Pareto Optimal Streaming Unsupervised Classiﬁcation

Acknowledgments

This work is supported by ARO grant W911NF-17-1-0359 and the US Do T supported D-STOP Tier 1 University Transportation Center.

Anandkumar, A., Ge, R., Hsu, D., Kakade, S. M., and

Telgarsky, M. Tensor decompositions for learning latent variable models (a survey for alt). In International Conference on Algorithmic Learning Theory, pp. 19 38. Springer, 2015.

Dawid, A. P. and Skene, A. M. Maximum likelihood es-

timation of observer error-rates using the em algorithm. Applied statistics, pp. 20 28, 1979.

Gopalan, A., Caramanis, C., and Shakkottai, S. On wire-

less scheduling with partial channel-state information. IEEE Transactions on Information Theory, 58(1):403 420, 2012.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn-

ing for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016.

Jaffe, A., Fetaya, E., Nadler, B., Jiang, T., and Kluger, Y.

Unsupervised ensemble learning with dependent classiﬁers. In Artiﬁcial Intelligence and Statistics, pp. 351 360, 2016.

Jain, P. and Oh, S. Learning mixtures of discrete product

distributions using spectral decompositions. In Conference on Learning Theory, pp. 824 856, 2014.

Karger, D. R., Oh, S., and Shah, D. Iterative learning for

reliable crowdsourcing systems. In Advances in neural information processing systems, pp. 1953 1961, 2011.

Karger, D. R., Oh, S., and Shah, D. Budget-optimal task al-

location for reliable crowdsourcing systems. Operations Research, 62(1):1 24, 2014.

Krizhevsky, A. and Hinton, G. Learning multiple layers

of features from tiny images. Technical report, Citeseer, 2009.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet

classiﬁcation with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097 1105, 2012.

Li, H., Yu, B., and Zhou, D. Error rate bounds in crowd-

sourcing models. ar Xiv preprint ar Xiv:1307.2674, 2013.

Massouli e, L. and Xu, K. On the capacity of information

processing systems. In Conference on Learning Theory, pp. 1292 1297, 2016.

Neely, M. J., Modiano, E., and Rohrs, C. E. Dynamic power allocation and routing for time-varying wireless networks. IEEE Journal on Selected Areas in Communications, 23(1):89 103, 2005.

Ok, J., Oh, S., Shin, J., and Yi, Y. Optimality of belief prop-

agation for crowdsourced classiﬁcation. In International Conference on Machine Learning, pp. 535 544, 2016.

Raykar, V. C., Yu, S., Zhao, L. H., Valadez, G. H., Florin,

C., Bogoni, L., and Moy, L. Learning from crowds. Journal of Machine Learning Research, 11(Apr):1297 1322, 2010.

Romney, A. K., Weller, S. C., and Batchelder, W. H. Cul-

ture as consensus: A theory of culture and informant accuracy. American anthropologist, 88(2):313 338, 1986.

Shah, V., Gulikers, L., Massouli e, L., and Vojnovi c, M.

Adaptive matching for expert systems with uncertain task types. In Communication, Control, and Computing (Allerton), 2017 55th Annual Allerton Conference on, pp. 753 760. IEEE, 2017.

Shaham, U., Cheng, X., Dror, O., Jaffe, A., Nadler, B.,

Chang, J., and Kluger, Y. A deep learning approach to unsupervised ensemble learning. In International Conference on Machine Learning, pp. 30 39, 2016.

Sheng, V. S., Provost, F., and Ipeirotis, P. G. Get another label? improving data quality and data mining using multiple, noisy labelers. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 614 622. ACM, 2008.

Simonyan, K. and Zisserman, A. Very deep convolutional networks for large-scale image recognition. ar Xiv preprint ar Xiv:1409.1556, 2014.

Snow, R., O Connor, B., Jurafsky, D., and Ng, A. Y. Cheap

and fast but is it good?: evaluating non-expert annotations for natural language tasks. In Proceedings of the conference on empirical methods in natural language processing, pp. 254 263. Association for Computational Linguistics, 2008.

Stolyar, A. L. Maximizing queueing network utility subject

to stability: Greedy primal-dual algorithm. Queueing Systems, 50(4):401 457, 2005.

Tassiulas, L. and Ephremides, A. Stability properties of

constrained queueing systems and scheduling policies for maximum throughput in multihop radio networks. IEEE transactions on automatic control, 37(12):1936 1948, 1992.

Pareto Optimal Streaming Unsupervised Classiﬁcation

Whitehill, J., Wu, T.-f., Bergsma, J., Movellan, J. R., and

Ruvolo, P. L. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In Advances in neural information processing systems, pp. 2035 2043, 2009.

Zhang, Y., Chen, X., Zhou, D., and Jordan, M. I. Spec-

tral methods meet em: A provably optimal algorithm for crowdsourcing. In Advances in neural information processing systems, pp. 1260 1268, 2014.

Zhou, X.-H., Mc Clish, D. K., and Obuchowski, N. A.

Statistical methods in diagnostic medicine, volume 569. John Wiley & Sons, 2009.