# pareto_optimal_streaming_unsupervised_classification__708d7c0e.pdf Pareto Optimal Streaming Unsupervised Classification Soumya Basu 1 Steven Gutstein 2 Brent Lance 2 Sanjay Shakkottai 1 We study an online and streaming unsupervised classification system. Our setting consists of a collection of classifiers (with unknown confusion matrices) each of which can classify one sample per unit time, and which are accessed by a stream of unlabeled samples. Each sample is dispatched to one or more classifiers, and depending on the labels collected from these classifiers, may be sent to other classifiers to collect additional labels. The labels are continually aggregated. Once the aggregated label has high enough accuracy (pre-specified threshold for accuracy) or the sample is sent to all the classifiers, the now labeled sample is ejected from the system. For any given pre-specified accuracy threshold, the objective is to sustain the maximum possible sample arrival rate, such that the number of samples in memory does not grow unbounded. In this paper, we characterize the Pareto-optimal region of accuracy and arrival rate, and develop an algorithm that can operate at any point within this region. Our algorithm uses queueing-based routing and scheduling approaches combined with a novel online tensor decomposition method to learn the hidden parameters, to Pareto-optimality guarantees. We finally verify our theoretical results through simulations on two ensembles formed using Alex Net, VGG, and Res Net deep image classifiers. 1. Introduction Computational outsourcing, via the cloud, is becoming increasingly important. However, the efficacy of cloud computing agents often remains unknown (e.g. task processing on Mechanical Turk, where the human agents have 1The University of Texas at Austin, USA 2Army Research Lab, USA. Correspondence to: Soumya Basu . Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). unknown efficiencies). Furthermore, ground truth data needed to evaluate the agents performance is often absent. Such problems must be addressed through an ensemble of classifiers, operating in an unsupervised setting (i.e. Unsupervised Ensemble Learning UEL). Over the past three decades, significant theoretical and empirical research has been performed in this area (Dawid & Skene, 1979; Li et al., 2013; Zhang et al., 2014; Jain & Oh, 2014; Jaffe et al., 2016; Shaham et al., 2016). In particular, crowdsourcing, which focuses on selecting and aggregating a large number of classifiers, has an extensive literature (Karger et al., 2011; 2014; Ok et al., 2016) culminating in optimal aggregation schemes. The key focuses of most previous works are 1) from a crowdsourcing perspective, choosing a small subset of an effectively infinite pool of classifiers for each sample, and 2) from an unsupervised learning perspective, offline/one-shot classification of a set of samples (see Section 7 for related work). We take a somewhat different approach than the traditional crowdsourcing and UEL. Our system consists of a fixed ensemble of deterministic classifiers, each of which has a fixed processing speed. Such classifiers are available in many emerging fields such as cloud computation or humanin-the-loop computation. As a motivating example, consider an ensemble of both human experts and third party neural network classifiers, which are jointly tasked to label images in real time. The individual labeling characteristics of these classifiers are often unknown; for human agents because the agents have not been evaluated sufficiently in the past, and for DNNs, due to contractual or privacy reasons (e.g. the neural network was downloaded from the web where it was previously trained on an unknown data set). Importantly, these classifiers are deterministic when the DNNs have fixed weights and the humans have fixed opinions (i.e. repeatedly providing the same sample to an agent will not result in changing labels). In our streaming model, samples (e.g. images) arrive online and are assigned to an ensemble of expert classifiers. Each classifier is characterized by a latent confusion matrix and can process one sample per time slot. For each sample, the labels are continually collected and aggregated as it sequentially passes through distinct classifiers. During the process of aggregation once the sample achieves a pre- Pareto Optimal Streaming Unsupervised Classification specified desired accuracy, it gets ejected from the system. Furthermore, in our fixed ensemble, repeated access for a specific classifier results in the same output. Thus, when classification by all the classifiers gets completed, a sample is removed even if desired accuracy is not achieved. Two key features in our system are: (i) the sequential selection of classifiers depends on the current set of collected labels, and (ii) the decision to either eject a sample or continue to acquire additional labels also depends on the current set of collected labels. Sustaining a high arrival rate requires that the average number of classifiers used to label a sample be small, because each classifier has limited processing speed. Yet, obtaining a high threshold accuracy requires that the average number of classifiers used to label a sample be large, because larger ensembles are more accurate. This is a fundamental trade-off in our system. 1.1. Our Contribution Our main contributions are as follows. 1. We define and characterize the notion of a Pareto region for sample arrival rate and threshold accuracy in the streaming, online Dawid-Skene model. We show that the posterior class distribution for a set of samples under any history dependent algorithm is independent of that history, given the partial label collected by the samples (Lemma 3.4). Using this conditional independence, we characterize the Pareto region as a feasible network flow (Theorem 3.6). 2. We design a scheduling algorithm, which is inspired by max-weight scheduling in stochastic networks, but uses a threshold based departure and maximum likelihood labeling. We show it achieves any point in the interior of the Pareto region (Theorem 4.1). The designed algorithm does not assume knowledge of arrival rate or classifier statistics. Using an explore-exploit strategy we learn these parameters online, with a spectral estimation technique. 3. Off-the-shelf unsupervised learning techniques are computationally prohibitive for the online learning of hidden parameters. We develop a novel online Tensor decomposition algorithm that reuses eigenvectors from the previous round. We show such reuse along with the application of the stopping criteria in (Anandkumar et al., 2015) decreases the computational complexity (Lemma 5.2). 4. Finally, we validate our theoretical results using ensembles comprised of Alexnet (Krizhevsky et al., 2012), VGG (Simonyan & Zisserman, 2014) and Resnet (He et al., 2016) deep convolutional neural nets. We perform experiments on modified Cifar-10 datasets. Various classes were merged into super-classes , so that we work with datasets containing only 3 or 5 classes. Our results highlight the trade-off between threshold accuracy and arrival rate. 2. System Model We consider a time-slotted online unsupervised classification problem where in each time-slot (a.k.a. round) samples (e.g. images) arrive into the system for classification. In the remaining section, we first define the description and generation of the samples. We next provide the system s description and dynamics. 2.1. Notation We define the set [N] = {0, 1, . . . , N 1} for all integer N 0. We further include the null element, denoted as e, in this set to define [N]e = {e} [ [N]. An ordered array is denoted as {ai : i 2 [N]} and a ordered matrix is denoted as {Aij : i 2 [N], j 2 [N 0]} . More generally an array can be indexed by an ordered set I as {ai : i 2 I}. 2.2. Online Dawid-Skene Model We adopt a model which is the online analogue of the generalized Dawid-Skene model (Dawid & Skene, 1979) for unsupervised classification. We call our model Online Dawid-Skene (On DS) model. In our model, there are K classes and M deterministic classifiers, with M > 1. Each classifier i 2 [M] has a confusion matrix, Ci := {Ci(k, l) : k, l 2 [K]}. We assume each classifier can process one sample per timeslot, which can be easily generalized to non-uniform processing speeds.1 The samples arrive online and are represented by their unique ids. In our deterministic classification, the true label and individual classifiers labels for each sample are generated once and fixed thereafter. Formally, the sample with id j has a latent-tuple sj 2 [K](M+1). sj(i) denotes the deterministic label received from classifier i, for all i 2 [M], and sj(M) denotes the true label of the sample. E.g., with two classifiers (M = 2) and two classes (K = 2), if sample 0 contains the latent tuple s0 = (0, 1, 1), then for sample 0, (i) 0-th classifier s label = 0, (ii) 1st classifier s label = 1, and (iii) true label = 1. The generation process of sj, for each j, is as follows: 1) For each sample an independently and identically distributed (i.i.d.) true-label s(M) is first chosen from the K classes following a probability vector pg. Thus, for all j, P[sj(M) = k] := pg(k) for k 2 [K]. 2) Next, for all i 2 [M], sj(i) is chosen according to the following conditional probability; for all k, l 2 [K] P[sj(i) = l|sj(M) = k] := Ci(k, l). 1For each i 2 [M], if classifier i labels at most µi samples/round, for any finite integer µi > 1, we can replace the i-th classifier with µi separate classifiers each with confusion matrix Ci and one sample per round processing speed. Pareto Optimal Streaming Unsupervised Classification 2.3. Memory In our time-slotted system there exists a common memory where past samples reside. At the beginning of round 1, the memory contains a set of samples, denoted as S . At each time , a sample j in memory is associated with a label-tuple ˆs e , which changes with time. Specifically, for all classifiers i 2 [M] and rounds 1, at the beginning of round : (i) if the sample is labeled by classifier i then ˆs j (i) = sj(i), (ii) otherwise, ˆs A sample j is partially labeled, if any of its coordinates are unobserved (i.e. for some i, ˆs j (i) = e). Otherwise, it is completely labeled. For each label-tuple 2 [K]M e , the number of samples with that label-tuple at the beginning of round is denoted as Q ( ) = |j 2 S : ˆs E.g. with M = 2 and K = 2, let sample 0 (s0 = (0, 1, 1)) be labeled by classifier 0 in round = 1, and classifier 1 in round = 3. Then we have ˆs1 0 = (e,e) , ˆs2 0 = (0,e), ˆs3 0 = (0,e), and ˆs 0 = (0, 1) for all 4. 2.4. System Dynamics Each round has four sequential stages: 1) scheduling, 2) labeling, 3) departure, and 4) arrival. Scheduling: Initially, there are samples in memory which are matched with the classifiers. The system schedules a matching in the bipartite graph G with the two partites: S (the samples) and [M] (the classifiers). For each sample j 2 S and classifier i 2 [M], there is an edge (j, i) if and only if ˆs j (i) = e. Let A denote the set of feasible matchings, and A be the scheduled matching in round . If classifier i is matched with sample j then A (i) = j. Otherwise A (i) = e. Labeling: The labels obtained from classifiers are denoted as Cl 2 [K]M e . If classifier i is not matched with any sample then Cl (i) = e. Otherwise, for sample j = A (i) the latent label sj(i) is revealed, i.e. Cl (i) = sj(i). Therefore, for the sample j the observed-tuple changes to ˆs( +1) j . So, ˆs( +1) j [i] = sj[i], and ˆs( +1) j [i0] = ˆs j [i0] for all i0 6= i 2 [M]. For all unmatched samples their respective observed-tuples remain unchanged. Departure: Next, each sample either obtains a final label and exits the system, or reenters the memory. The set of departing samples is denoted as D . Each sample j 2 D has an associated final label k j . The set of partially labeled departing samples is denoted as D Arrival: Finally, a random number of new samples, N , (generated following On DS model) arrive into the system. Here N is i.i.d., for each , with finite mean λex = E[N ] and bounded support sup 1 N λmax. Let the set of arriving samples be N ex. The sample ids are ordered according to their arrival. Update: In ( + 1)-th time-slot the set of samples in storage changes to S( +1), where S( +1) = S \D [N ex. The number of samples that changes into the label-tuple from another and re-enters the system in round is denoted as N ( ). Further, the number of samples that changes from one label-tuple, , to another in round is denoted as R ( ). Formally, for all 2 [K]M e , N ( ) = |j : j 2 S( +1), ˆs( +1) R ( ) = |j : j 2 S , ˆs( +1) j = |. At time , label-tuple 2 [K]M e , Q ( + 1) = (Q ( ) R ( ))+ + N ( ). 3. Pareto Region of On UEL 3.1. Preliminaries and Definitions System History: The history of the system is the set of all the events that happened in the past. Formally, at various stages of round 1 the histories can be defined recursively as follows: Let the initial history of the system be H1 0 be history in the beginning of round . (i) After scheduling history is H 0 \ A . (ii) After labeling history is H 1 \ Cl . (iii) After departure it is H (iv) After the new arrivals it is H( +1) Stability: A policy is a sequence of matching, departure and labeling decisions {A , D , {k j : j 2 D } : 1}. Time average backlog of the system under a policy φ is sum := lim sup T !1 A policy φ stabilizes the system if and only if Qφ Threshold Accuracy: Our objective is to ensure that each sample either achieves required confidence or is processed using best-effort. Specifically, we ensure on average a sample with a final label either i) has been labeled by all classifiers, or ii) has a probability of correct labeling greater or equal to a given threshold th. We formalize this using the notion of threshold accuracy. At time , the accuracy of a departing sample j 2 D with true label sj[M] is Accj( ) := P j = sj[M] |H . The event {infj2D p Accj( ) < } indicates that some partially labeled sample has accuracy less than , i.e. condition (i) above is not satisfied. Threshold accuracy of implies the rate of such events is zero. Threshold accuracy under a stabilizing policy φ is th if and only if for all th Pareto Optimal Streaming Unsupervised Classification The stabilizing policy φ is called th-accurate. The On DS parameters = (pg, {Ci : i 2 [M]}), the arrival rate λex, and the desired threshold accuracy th describes a system completely. Therefore, we define P = ( , λex, th) as the On UEL instance. Causal Policy: A policy φ is causal if and only if for each 1, the matching A is a random function of history H 0 , the parameters P, the departure D ; also, the final labeling {k j : j 2 D } is a random function of history H 2 and parameters P. The class of causal policies is C. Pareto Region: We now define the Pareto region of a causal policy and of the On UEL. Definition 3.1. Given a On DS model , the Pareto region of a causal policy φ, φ( ), is the set of all tuple (λex, th) such that the system is stable under policy φ for λex, and the policy φ is th-accurate. Definition 3.2. Given a On DS model , the Pareto region ( ), is the convex closure of the Pareto regions of the individual causal policies, i.e. ( ) = conv ( ( )) where, (λex, th) : 9φ 2 C, (λex, th) 2 φ( ) A policy φ is Pareto-optimal if and only if 8 g > 0, and 8(λex, th)(1+ g) 2 ( ), (λex, th) 2 φ ( ). Remark: The notions of Pareto region and Pareto optimality is similar, but not identical, to the concept of capacity region and throughput optimality, resp., in stochastic optimization literature (Tassiulas & Ephremides, 1992; Stolyar, 2005). The key difference is that throughput optimality concerns a set of feasible arrival rates, whereas Pareto optimality relates arrival rates to accuracy. 3.2. Evolution of Posterior Distribution of Samples We consider a fixed causal policy φ and an On UEL instance P. Consider any stage st 2 {0, 1, 2, 3}, in round 1. S is the set of samples in memory; and for each sample j 2 S j the true label is sj[M]. Given a vector t = {tj 2 [K] : j 2 S }, we are interested in the posterior probability of the true labels being t given the history upto the time , i.e. Pφ ({sj[M]=tj : j2S }|H Dependence on Entire History: The history H st is formed by stacking up the snapshots of the system in each stage upto the stage st in round . What does this history include? It includes all other possible events that have happened in the past. A tiny part of it is the partial label of the samples present in the system. Recall, the partial labels are {ˆs j : j 2 S } for stages st = 0, 1, and {ˆs( +1) j : j 2 S } for stages st = 2, 3. Let us consider a specific event: In round and stage 0, sample j is in the memory with partial label ˆs The set of events that lead to this observation are as follows. (1) The sample j arrives at some round before . (2) In each round, from arrival upto , it was either kept unmatched or it is matched with a specific classifier. (3) In each round, from arrival upto , when matched to a classifier it obtained a specific label. (4) In each round, from arrival upto , the sample is not evicted from the system. The events in step (3) are independent of the evolution of other samples as the labels are generated independently given the true label sj[M]. However, the events in step (1), (2) and (4) couple the posterior evolution with the history. In step (1), we know all the samples j0 < j have arrived, because the samples are numbered in the order of their arrival into the system. In step (2), the matching events impose a fixed order in which the labels are collected for this sample. Further, these matching events in step (2) and departure events in step (4) are taken collectively for all the samples present. This implies that the evolution of all the other samples which were present in the system from arrival of j upto , are tightly coupled with the event under consideration. Notations for Posterior Evolution: We now setup some notations for describing the posterior evolution. We adopt the convention Ci(k,e) = 1 for all k 2 [K] and i 2 [M]. Let us denote 8 2 [K]M e , k 2 [K], P ( , k) := pg(k) Q Ci(k, (i)). (1) For all 2 [K]M e , k 2 [K], let acc( , k) := P ( ,k) P k0 P ( ,k). For all i2[M], k2[K], and for any 2 [K]M P (k|i, ) := 0, if [i] 6= e, P k02[K] P ( ,k0)Ci(k0,k) P k0,k002[K] P ( ,k0)Ci(k0,k00), o/w. . For all i2[M], and for any two , 0 2 [K]M e if 8i0 2 [M], i0 6= i, (i0) = 0(i0) we define P ( 0|i, ) := P ( 0(i)|i, ). Otherwise, P ( 0|i, ) := 0. Given a vector t2[K]|S |, let n( ,st) k (t) denote the number of samples in memory with partial label and candidate label k, in round 1 immediately after stage st 2 {0, 1, 2, 3}. Specifically, for all k2[K], 2[K]M k (t) := |{j 2 S : tj = k,ˆs( +1) k (t) := |{j 2 S : tj = k,ˆs j = }|, 8st 6= 3. Conditional Independence: We show that under a causal policy the posterior probability of the true labels of each sample is independent of the history given the collected labels for that sample. Assumption 3.3. The initial state is finite, |S1|<1, and Pareto Optimal Streaming Unsupervised Classification there is no labeled sample, i.e. Q (1)=0 8 2[K]M The following lemma gives exact expression for the joint posterior of the samples conditioned on the history. The above assumption ensures the prior distribution of the samples present in round 1 is well behaved. Lemma 3.4. For a given problem instance P satisfying Assumption 3.3, under any causal policy φ, the following statements hold w.p. 1, for all 1, stage st 2 {0, 1, 2, 3}, and t2[K]|S |, P (8j2S , sj[M]=tj|H P (sj[M]=tj|H Ci(tj,ˆs st st : 0, 1 = 2, 3 = ( +1) P ( , k)n( ,st) Remarks on Lemma 3.4: Firstly, Lemma 3.4 shows that the posterior admits a product form across all the samples present in the system. Secondly, for each sample the probability is independent of the history given only the partial labels. Finally, this implies the posterior is independent of the sample ids given the counts n( ,st) The following statements are simple consequences of the above lemma. Corollary 3.5. For a given problem instance P satisfying Assumption 3.3, the following statements are true for all 1 and sample j with true label sj[M]. 1) 8k2[K], 2[K]M e , P(sj[M] = k|ˆs j = ) = acc( , k). 2)8i2[M], k2[K], 2{ 0 2 [K]M e : 0[i] = e}, P(Cl [i] = k|A [i] = j ˆs j = ) = P (k|i, ). The maximum likelihood estimate (MLE) of the true label for any sample with partial label is the same. We denote it as k ( ) := arg maxk2[K]acc( , k). Also, we denote acc ( ) := acc( , k ( )). 3.3. Characterization of the Pareto Region Network Model: We construct a network-flow instance to characterize the Pareto region for the problem instance ( , λex, th). The feasibility of the flow instance will imply that (λex, th) is in the Pareto region, as proven in Theorem 3.6 below. The network consists of |[K]M e | label nodes. The label nodes , for which either acc ( ) th (desired accuracy is reached) or 8i 2 [M], (i) 6= e (all labels are collected), comprise the set of terminal nodes T ( th). Thus T c( th) : =[K]M e \T ( th) The set of non-terminal nodes is [K]M e \ T ( th). The label node {e}M is the unique source node. There are hyper-edges connecting the nodes. Each hyper-edge corresponds to the allocation of labels to classifiers. Specifically, a hyper-edge h 2 (T c( th))M indicates the non-terminal label h(i) is matched with classifier i, for all i 2 [M]. The movement of flow is decided based on the selection of a hyper-edge. In particular, when a hyper-edge h is chosen, at most |{i 2 [M] : h(i) = }| amount of flow can exit the label node . Further, the amount of flow entering label node is P i2[M] P ( |i, h(i)). The all empty label node {e}M has a incoming flow λex. A pdf over the hyper-edges, denoted as x, determines a flow in the network. The non-emptiness of the following polytope describes the feasibility of network flow corresponding to the tuple ( , λex, th). Flow Polytope, FP( , λex, th) s 8h, xh 0 s X s λin({e}M) = λex s 8 , λout( ) λin( ) s 8 6= {e}M, λin( ) = P ( |i, h(i)) s 8 /2 T ( th), λout( ) = The following theorem characterizes the Pareto region of the On UEL corresponding to On DS . Theorem 3.6. Given a On DS model satisfying Assumption 3.3, the Pareto region is: ( )=[0, 1] [0, |pg|1] [ {(λex, th) : th > |pg|1, FP( , λex, th) 6= ;}. Remarks on Pareto Region: For th |pg|1, technically, any data rate (λex = 1) can be supported by giving a final label k = arg max{pg(k) : k 2 [K]}. In the network flow formulation the source {e}M itself becomes a terminal. Further, for all th > |pg|1 the maximum arrival rate is less or equal to M, as each classifier processes at most one sample per round. 4. Learning Aided Scheduling In this section we design dynamic matching, departure and labeling policies. Further, we do not assume the knowledge of the parameters P except th (a design specification). Our algorithm tries to stabilize the system while maintaining accuracy. Therefore, different classifiers may see different true label distribution for the incoming sample. We resort to an explore-exploit learning strategy. In the explore phase, we select an unlabeled sample and send it to all the classifiers.This ensures the true label distributions of the incoming exploration samples are same for all the classifiers. Using the samples classified in this phase Pareto Optimal Streaming Unsupervised Classification we gradually learn the hidden parameters with an online variant of a spectral method. In the exploit phase, we match samples to classifiers based on the current queue length vector. For departure and labeling we use a Bayesian threshold based approach. Our algorithm resembles max-weight matching (Tassiulas & Ephremides, 1992) in network optimization. The algorithm is structured in two separate parts. In the first part, we present the routing algorithm along with the exploration events. Here, we assume that, for all labels , 0 2 [K]M e , classifiers i 2 [M] and classes k 2 [K], c acc( , k) and b P ( 0|i, ) are provided by an oracle (possibly erroneous). This oracle also has access to labels obtained during exploration events. Further, in each round the algorithm receives the queue length vector Q( ). In the second part, deferred to Sec. 5, we construct an oracle which uses the explored samples to learn the hidden parameters, i.e. the classifiers confusion matrices and the true class distribution. The oracle computes c acc( , k) and b P ( 0|i, ) as queries arrive. 4.1. Max-weight Matching Bipartite Graph: In any time step 1, the samples and the classifiers form the bipartite graph, G , as given in Sec.2. We assign a weight for each edge in the graph as a function of the queue length vector Q( ). Consider an edge (i, j) for some j 2 S and some i 2 [M]. For notational convenience, let sample j have the partial label ( i.e. ˆs j ). Also, let 0 k denote the unique label for which 0 k(i0) = (i0) for all i0 6= i, and 0 k(i) = k whereas (i) = e. The weight of the edge (i, j) at time is given as k|i, (j))Q 0 To account for the idle machines let w(i,e) = 0 Then, the weight of matching A is W (A) = P i2M w (i, A(i)). Description: In each round 1 the algorithm is provided with an explore flag Ex( ) which is a 0-1 r.v. drawn independently with mean s( ) > 0. Here, s( ) is the exploration rate parameter. If exploration flag is 1, the algorithm explores by sending a sample, j, to all the classifiers. Otherwise, it creates the graph, G , and solves the maximum bipartite matching problem to compute the matching, A . Here it requires oracle access of P ( ). For each classifier i, using the label Cl (i) the algorithm updates the label for sample j = A (i) to ˆs +1 j . Next, the MLE accuracy acc (ˆs +1 j ) is obtained from the oracle. If the estimated accuracy is larger than ( th+ ( )), the sample j departs the system with label c k (ˆs +1 j ). Otherwise, it re-enters the system. The error parameter ( ) accounts for the errors in oracle estimates. Exploration: In each of the exploration rounds, a single Algorithm 1 Max-weight with Bayesian Departure 1: for all 1 do 2: Obtain Explore flag Ex( ), and queue length Q( ) 3: if Ex( ) and Q;( ) > 0 then . Explore 4: Find a j s.t. ˆs j =e, set A [i] = j, 8i2[M] 5: else . Exploit 6: Set A = arg max A2A W (A) 7: for all i 2 [M] do 8: Update label of j: =A [i] to ˆs +1 j using Cl (i) 9: if c acc (ˆs +1 j ) th + ( ) then . Departure 10: Sample j departs with label c k (ˆs +1 j ) 11: else 12: Sample j re-enters the system sample is sent to all classifiers. Further, the labels acquired in this round are sent to the oracle, which may use them for learning the hidden parameters. Query Complexity: The algorithm makes oracle queries while both computing A and deciding whether a sample should depart the system. Each of the M|Q( )|1 edges in the network, G , requires at most K ˆP ( )-queries. Thus, computing A requires KM|Q( )|1 queries. For each matched sample, K c acc( )-queries are required to take departure and final labeling decisions. This leads to an additional KM queries. Therefore, the query complexity per round is O(KM|Q( )|1). Time Complexity: The creation of the graph G requires O(KM|Q( )|1) time. On the unbalanced graph G with (possibly) |Q( )|1 >> M, solving the Max-weight matching takes another O(max{M 2 log(M), M|Q( )|1}) time. Overall, the time complexity is O(KM|Q( )|1) time per round. The amortized time complexity per round under the max-weight policy is given as O(max{M 2 log(M), MQMW sum }). Here, QMW sum is the timeaverage backlog of Algorithm 1. 4.2. Analysis of Algorithm 1 In this section we present the guarantees of the Max-weight algorithm. Specifically, we show Algorithm 1 is Paretooptimal when the oracle has certain asymptotic convergence properties, which we make precise shortly. ( (σ), δ(σ)) Oracle: An oracle is an ( , β) oracle if and only if, for all (i) partial labels , 0 2 [K], (ii) classifiers i 2 [M], and (iii) classes k 2 [K], the oracle with access to σ exploration samples satisfies, (1) |acc( , k) c acc( , k)| σ w.p. (1 σ β), (2) |P ( 0|i, ) b P ( 0|i, )| σ w.p. (1 σ β). The following theorem provides guarantees for Algo- Pareto Optimal Streaming Unsupervised Classification rithm 1, under the assumption that an ( , β) oracle exists with , β > 0. In Sec 5, we construct such an oracle. Theorem 4.1. Given an On UEL instance P=( , λex, th) such that Assumption 3.3 is valid, th > |pg|1, and (λex, th)(1 + g) 2 ( ) for some g > 0; and given an ( , β) oracle with , β > 0; under Algorithm 1 with parameters s( ) = log( ) and ( ) = 2 log( ) ; then ex,M 2) gλmin where λmin = λex min 2T ( th(1+ g)) k P ( , k)pg(k) Furthermore, the threshold accuracy of Algorithm 1 is th. 5. Learning with Explored Samples In this section, we describe the unsupervised learning of the confusion matrices {Ci : i 2 [M]} and the true probability distribution pg. Specifically, our purpose is to design a ( , β)-oracle that provides the c acc( , k) and b P ( 0|i, ) for all , 0. We require the following assumption which is common in the literature. Assumption 5.1. [Assumptions on On DS] (i) The entries of confusion matrix is strictly positive, i.e. 8i 2 [M] and k, k0 2 [K], Ci(k, k0) > 0. (ii) The class distribution is also bounded from below, i.e. for all k 2 [K] pg(k) 0 > 0. (iii) Finally, for all classifiers i 2 [M] , the diagonal terms in the confusion matrix dominates, i.e. min k,k02[K],k06=k(Ci(k, k) Ci(k, k0)) > 0. Existence of an ( , β)-Oracle: Our algorithm is closely related to the one-shot unsupervised learning in paper (Zhang et al., 2014). The algorithm proposed in (Zhang et al., 2014). ensures that for fixed constants δU < 1 and U > 0, given O( K5 log (K+M/δU)/ 2 U) number of samples, with probability at least (1 δU) the confusion matrices Ci for all i 2 [M], and the true class distribution pg are recovered with maximum error U (Theorem 2 in (Zhang et al., 2014)). Furthermore, as b P ( 0|i, ) and acc( , k) are function of the confusion matrices and the true class distribution, the error in the corresponding term is also bounded. Specifically, given (i) maxi2[M] |Ci c Ci|1 O( U) and (ii) |pg c pg|1 O( U), under Assumption 5.1, we have |acc( , k) c acc( , k)| = O . This ensures that with σ samples, the error in the estimates b P ( ) and c acc( ) are bounded from above by O(σ ), for all < 1/2, with probability at least 1 σ β for any β < 1. Therefore, an ( , β) oracle exists for > 0 and β < 1. This shows that under the additional Assumption 5.1 Maxweight algorithm is Pareto Optimal. Algorithm 2 We design an online spectral learner algo- Algorithm 2 Online Spectral Learner 1: Initialize: Eigenvectors {vk : k 2 [K]} 2: for all round 1 do 3: Obtain Explore flag Ex( ) and new labels Cl 4: if Ex( ) is true then . Moment Update 5: Update the moments ˆ M2 and ˆ M3. 6: Generate whitened Tensor T0 = ˆT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7: for all r 2 [K] do . Online RTP 8: while vr does not satisfy Eq.(3) do 9: Randomly pick vr on k-dim unit sphere. 10: Repeat vr Tr(I,vr,vr) k Tr(I,vr,vr)k2 , RU times. 11: Repeat vr Tr(I,vr,vr) k Trs(I,vr,vr)k2 , RU times. 12: Let T(r+1) Tr Tr(vr, vr, vr) v 3 r . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13: Recover parameters using {vk:k2[K]}. 14: Answer c acc( ) and b P( ) queries. |Tr(v, v, v)| max k Trk F 2p K r, k Tr(I,I,v)k F rithm, Algorithm 2, as Algorithm 1 in (Zhang et al., 2014) is inefficient for our problem (see appendix for details). Algorithm 2, in line 8, checks if the past eigenvector vr for all r 2 [K] satisfies the condition (3) in (Anandkumar et al., 2015). The rest of the steps are identical to the approach in (Zhang et al., 2014), and thus omitted for lack of space. Specifically, line 5 in Algorithm 2 uses (2a) and (2b) in (Zhang et al., 2014), and line 13 in Algorithm 2 uses (5) in (Zhang et al., 2014). We do not require the expectation maximization steps therein. The following lemma states that Algorithm 2 is an ( , β)- oracle with constant amortized time complexity. Here, we treat dependence on M and K as constant.Further, the Lemma implies that Algorithm 1 along with Algorithm 2 is Pareto optimal. Lemma 5.2. Under Assumption 5.1, Algorithm 2 with RU = (1) is an ( , β)-oracle for any < 1/2 and any β < 1. Furthermore, the amortized time complexity of Algorithm 2 is O(1). 6. Empirical Results Experimental Setup: For our experiments, we create two datasets where the classes of Cifar-10 (Krizhevsky & Hinton, 2009) are merged into larger groups, effectively giving us fewer classes. Our ensembles consist of deep convolutional neural networks. Specifically, we consider two modifications of Cifar-10. In the first, we have three groups: (airplanes, ships, trucks, cars), (birds, frogs cats), and (dogs, deer, horses). In the second we have five groups: (airplanes, ships), (trucks, cars), (birds, frogs) (cats, dogs) and (deer, horses). Pareto Optimal Streaming Unsupervised Classification The first ensemble has 6 classifiers - three Alexnet, one VGG-19, and two Res Net-18 neural nets (Krizhevsky et al., 2012; Simonyan & Zisserman, 2014; He et al., 2016). The second ensemble has 6 - one VGG-11 , one VGG-16, two VGG-19, and two Res Net-18 neural nets. Each classifier is individually trained. The accuracies of the various net architectures are different across the two variations of Cifar-10. The Res Net-18 neural nets out perform the other variants, whereas Alexnet neural nets perform the worst. Therefore, the performance of the first ensemble is worse than the other. The classifiers accuracies and confusion matrices are given in the supplementary material. In the supplementary material, we show that across a broad range of arrival rates the proposed algorithm attains 2% to 5% higher average accuracy compared to a random scheduler. Convergence Property: We now study our algorithm s temporal dynamics. The arrival rate, λex and threshold accuracy, th were set to 2 samples/round and 0.85, resp. In Fig. 1a, we show that inside the Pareto region the sum of queue lengths (a.k.a. backlog) fluctuates around an average of 10.8. Further, the convergence to this average value happens within a few hundred rounds. The average accuracy (i.e. departures with correct labels / total departures) of the system also converges to an average 0.87 as showed in Fig.1b. In Fig. 1c we observe that the online spectral learning effectively learns the parameters involved. The error in this figure is the total error in estimating the confusion matrices and the true class distributions. Specifically, the error is given as kpg ˆpgk1 + 1 M i2[M] k ˆCi Cik1 Similar results are noted for the other dataset in Fig. 2a-2c. Pareto Region: We study the Pareto regions of the two ensembles in Fig. 1d and Fig. 2d. In this figure, the circles sizes depict the logarithm of the average queue length for a given (λex, th) pair. We see that for higher arrival rates, the average queue lengths start increasing for smaller thresholds, and vice versa. Furthermore, we observe that the Pareto region of the first ensemble is smaller than the second ensemble. This is consistent with more accurate ensembles supporting larger Pareto regions. 7. Related Work Our work belongs to the family of techniques derived from the Dawid-Skene model (Dawid & Skene, 1979). This model performs unsupervised ensemble learning (UEL), using an expectation maximization (EM) based algorithm. The need for UEL appears in fields varying from anthropology (Romney et al., 1986) to medical diagnostics (Zhou et al., 2009). Most recently, UEL has been of interest to the crowdsourcing community (Snow et al., 2008; Sheng et al., 2008; Whitehill et al., 2009; Raykar et al., 2010). (a) Backlog (b) Accuracy (c) Estimation Error (d) Capacity Figure 1: Performance of Ensemble-1 on 3-Group Dataset (a) Backlog (b) Accuracy (c) Estimation Error (d) Capacity Figure 2: Performance of Ensemble-2 on 5-Group Dataset This has led to the development of various approaches, such as weighted majority voting (Li et al., 2013), messagepassing (Karger et al., 2011; 2014; Ok et al., 2016), and spectral method based estimation (Zhang et al., 2014; Jain & Oh, 2014; Jaffe et al., 2016), which all have provable guarantees of classification error bounds and learning the parameters defining the ensemble and data. We have developed an adaptive task routing technique with latent labels. In this sense, our work is most similar to (Massouli e & Xu, 2016; Shah et al., 2017), but with some important distinctions. The setting in (Massouli e & Xu, 2016) allows for resampling of classifiers, whereas our ensemble is fixed. In (Shah et al., 2017), the departure decisions are not made by the ensemble, but in our approach, both the departure and labeling decisions are made by the ensemble (thus statistically correlating the labels acquired and departures). Finally, unlike (Massouli e & Xu, 2016; Shah et al., 2017), our technique requires no a priori knowledge of the classifiers or incoming data. Pareto Optimal Streaming Unsupervised Classification Acknowledgments This work is supported by ARO grant W911NF-17-1-0359 and the US Do T supported D-STOP Tier 1 University Transportation Center. Anandkumar, A., Ge, R., Hsu, D., Kakade, S. M., and Telgarsky, M. Tensor decompositions for learning latent variable models (a survey for alt). In International Conference on Algorithmic Learning Theory, pp. 19 38. Springer, 2015. Dawid, A. P. and Skene, A. M. Maximum likelihood es- timation of observer error-rates using the em algorithm. Applied statistics, pp. 20 28, 1979. Gopalan, A., Caramanis, C., and Shakkottai, S. On wire- less scheduling with partial channel-state information. IEEE Transactions on Information Theory, 58(1):403 420, 2012. He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn- ing for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016. Jaffe, A., Fetaya, E., Nadler, B., Jiang, T., and Kluger, Y. Unsupervised ensemble learning with dependent classifiers. In Artificial Intelligence and Statistics, pp. 351 360, 2016. Jain, P. and Oh, S. Learning mixtures of discrete product distributions using spectral decompositions. In Conference on Learning Theory, pp. 824 856, 2014. Karger, D. R., Oh, S., and Shah, D. Iterative learning for reliable crowdsourcing systems. In Advances in neural information processing systems, pp. 1953 1961, 2011. Karger, D. R., Oh, S., and Shah, D. Budget-optimal task al- location for reliable crowdsourcing systems. Operations Research, 62(1):1 24, 2014. Krizhevsky, A. and Hinton, G. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009. Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097 1105, 2012. Li, H., Yu, B., and Zhou, D. Error rate bounds in crowd- sourcing models. ar Xiv preprint ar Xiv:1307.2674, 2013. Massouli e, L. and Xu, K. On the capacity of information processing systems. In Conference on Learning Theory, pp. 1292 1297, 2016. Neely, M. J., Modiano, E., and Rohrs, C. E. Dynamic power allocation and routing for time-varying wireless networks. IEEE Journal on Selected Areas in Communications, 23(1):89 103, 2005. Ok, J., Oh, S., Shin, J., and Yi, Y. Optimality of belief prop- agation for crowdsourced classification. In International Conference on Machine Learning, pp. 535 544, 2016. Raykar, V. C., Yu, S., Zhao, L. H., Valadez, G. H., Florin, C., Bogoni, L., and Moy, L. Learning from crowds. Journal of Machine Learning Research, 11(Apr):1297 1322, 2010. Romney, A. K., Weller, S. C., and Batchelder, W. H. Cul- ture as consensus: A theory of culture and informant accuracy. American anthropologist, 88(2):313 338, 1986. Shah, V., Gulikers, L., Massouli e, L., and Vojnovi c, M. Adaptive matching for expert systems with uncertain task types. In Communication, Control, and Computing (Allerton), 2017 55th Annual Allerton Conference on, pp. 753 760. IEEE, 2017. Shaham, U., Cheng, X., Dror, O., Jaffe, A., Nadler, B., Chang, J., and Kluger, Y. A deep learning approach to unsupervised ensemble learning. In International Conference on Machine Learning, pp. 30 39, 2016. Sheng, V. S., Provost, F., and Ipeirotis, P. G. Get another label? improving data quality and data mining using multiple, noisy labelers. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 614 622. ACM, 2008. Simonyan, K. and Zisserman, A. Very deep convolutional networks for large-scale image recognition. ar Xiv preprint ar Xiv:1409.1556, 2014. Snow, R., O Connor, B., Jurafsky, D., and Ng, A. Y. Cheap and fast but is it good?: evaluating non-expert annotations for natural language tasks. In Proceedings of the conference on empirical methods in natural language processing, pp. 254 263. Association for Computational Linguistics, 2008. Stolyar, A. L. Maximizing queueing network utility subject to stability: Greedy primal-dual algorithm. Queueing Systems, 50(4):401 457, 2005. Tassiulas, L. and Ephremides, A. Stability properties of constrained queueing systems and scheduling policies for maximum throughput in multihop radio networks. IEEE transactions on automatic control, 37(12):1936 1948, 1992. Pareto Optimal Streaming Unsupervised Classification Whitehill, J., Wu, T.-f., Bergsma, J., Movellan, J. R., and Ruvolo, P. L. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In Advances in neural information processing systems, pp. 2035 2043, 2009. Zhang, Y., Chen, X., Zhou, D., and Jordan, M. I. Spec- tral methods meet em: A provably optimal algorithm for crowdsourcing. In Advances in neural information processing systems, pp. 1260 1268, 2014. Zhou, X.-H., Mc Clish, D. K., and Obuchowski, N. A. Statistical methods in diagnostic medicine, volume 569. John Wiley & Sons, 2009.