# a_pomdp_formulation_of_proactive_learning__167d3e3a.pdf

A POMDP Formulation of Proactive Learning

Kyle Hollins Wray and Shlomo Zilberstein

College of Information and Computer Sciences University of Massachusetts Amherst, MA 01003, USA {wray,shlomo}@cs.umass.edu

We cast the Proactive Learning (PAL) problem Active Learning (AL) with multiple reluctant, fallible, cost-varying oracles as a Partially Observable Markov Decision Process (POMDP). The agent selects an oracle at each time step to label a data point while it maintains a belief over the true underlying correctness of its current dataset s labels. The goal is to minimize labeling costs while considering the value of obtaining correct labels, thus maximizing ﬁnal resultant classiﬁer accuracy. We prove three properties that show our particular formulation leads to a structured and bounded-size set of belief points, enabling strong performance of pointbased methods to solve the POMDP. Our method is compared with the original three algorithms proposed by Donmez and Carbonell and a simple baseline. We demonstrate that our approach matches or improves upon the original approach within ﬁve different oracle scenarios, each on two datasets. Finally, our algorithm provides a general, well-deﬁned mathematical foundation to build upon.

Introduction Active Learning (AL) techniques capture the process of taking an unlabeled dataset and labeling a selected subset by querying an omniscient oracle for labels (Cohn, Atlas, and Ladner 1994). In practice, however, active learning makes strong assumptions regarding the labeling process. Speciﬁcally, real world applications often involve multiple oracles, each of which may be reluctant to answer, incorrectly answer, and have data point-sensitive costs subject to a ﬁxed budget (Attenberg and Provost 2011). Proactive Learning (PAL) captures all of these properties in a formal problem domain (Donmez and Carbonell 2008a). We present a Partially Observable Markov Decision Process (POMDP) solution for this inherently sequential optimization problem. Within AL, multi-oracle (Ipeirotis et al. 2014; Yan et al. 2011), imprecise oracle (Golovin and Krause 2011; Ipeirotis et al. 2014), and cost-varying oracle (Culotta and Mc Callum 2005; Golovin and Krause 2011) scenarios have been explored separately in depth. Also, for single-oracle AL, both MDP (Lizotte, Madani, and Greiner 2003), POMDP (Jaulmes, Pineau, and Precup 2005), and other related methods (Golovin and Krause 2011) have been devised. These

Copyright c 2016, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

do not capture the entire multiple, fallible, reluctant, and cost-varying oracles found in the realistic PAL domain. As such, the MDP-like states, actions, observations, and transition functions differ markedly from ours. The distinct, but related, preference elicitation problem seeks to perform as few queries as possible while maximizing the belief of a user s preferences. POMDPs have been successfully implemented here; however, they do not maintain a belief over a dataset, model oracles, or manage a budget (Boutilier 2002). Our primary contribution is an algorithm which maps a PAL problem directly to the true underlying sequential optimization problem using a POMDP. To the best of our knowledge, no one has proposed such a POMDP solution to PAL. We state and prove three propositions regarding the belief points and horizon required to rapidly produce high-quality solutions for the PAL POMDP using point-based methods. Additionally, we provide experimental evidence that demonstrates our approach either meets or exceeds the performance of the original algorithms and a random baseline. The next section formally deﬁnes PAL, POMDPs, and our algorithm. Then, we provide a rigorous theoretical analysis of the point-based algorithm used to solve our PAL POMDP. Next, we present our experiments and discuss our ﬁndings. Finally, we conclude with a summary of our contributions.

An Automated Planning Approach We formalize the general proactive learning problem and propose a general POMDP framework for solving it. To the best of our knowledge, neither has been previously formulated in this manner.

Proactive Learning Deﬁnition Originally proposed by Donmez and Carbonell (2008a), proactive learning originally considered four relaxations to active learning. Their approach handled each of them separately. We state the problem in its most general form, simultaneously describing all four within one problem domain. The Proactive Learning (PAL) problem is a tuple x X, Yl, O, Pr, Pc, Cy. X Xl Y Xu is a dataset of d data points (zero-indexed), Xl denotes the dl labeled data points with corresponding labels Yl, and Xu denotes the du unlabeled data points. O is a set of m oracles. Pr : X ˆ O Ñ r0, 1s denotes the probability that an oracle will respond for a data point. Pc : X ˆ O Ñ r0, 1s denotes the probability that

Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16)

an oracle s response is correct for a data point. These probabilities may be given (e.g., guarantees by oracles), estimated (e.g., cluster centroids), or a mix of both. C : X ˆ O Ñ R is a cost function for an oracle labeling a data point. The true underlying objective in PAL is to maximize classiﬁer accuracy while minimizing cost. Unfortunately, we cannot directly measure classiﬁer accuracy given that we do not necessarily have any labeled data for comparison. Thus, the objective of a PAL algorithm is to maximize a measure of the expected information gain, subject to a budget constraint β P R . Formally, let S Ď X ˆO be a set of samples of data point-oracle pairs, and V : S Ñ R denote any value of information metric. Our optimization problem is to maximize Erř s PS Vpsqs subject to ř

xx,oy PS Cpx, oq ď β.

PAL Algorithm Initialization As stated above, the probabilities Pr and Pc are obtained in one of two ways: (1) initially given, or (2) acquired using a clustering method. We experiment with both scenarios. In many proactive learning domains, these probabilities are given. For example, consider medical domains that require a company oracle to conduct lab work in order to label a data point. Any contract with the company to do this work will clearly state estimates for duration, and thus oracle reluctance, as well as probabilistic guarantees regarding the quality of the labels, and thus oracle correctness. Obviously, costs are provided for each oracular company. Other domains require a pre-processing step to obtain estimates of these probabilities. We employ a similar clustering method as the original PAL algorithms (Donmez and Carbonell 2008a) and others (Wallace et al. 2011). In summary, we are given an initial clustering budget βc ă β. Using this budget, we run k-means, obtain k 9 βc data points closest to the cluster centroids, query the oracles to obtain labels, and train a classiﬁer on any labeled data points yielding parameters ˆw. Then, for each unlabeled data point xi P Xu we now have Prpyi|xi, ˆwq. Using this and a distance metric dpxiq P r0, 1s from the closest cluster centroid denoted as xc with label yc (possibly undeﬁned, denoted as H), we compute our probabilities for oracle o P O:

Prpxi, oq σpp2ryc Hs 1qp1 dpxiqqq (1) Pcpxi, oq σpp2 max y PY Prpy|xi, ˆwq 1qp1 dpxiqqq (2)

with standard sigmoid function σp q and Iverson brackets r s. This assumes (strongly) that if an oracle does not respond, then it is likely to not respond for data points nearby.

POMDP Deﬁnition A POMDP is represented by the tuple x S, A, Ω, T, O, Ry (Smallwood and Sondik 1973; Sondik 1978; Kaelbling, Littman, and Cassandra 1998). S is a set of n states, A is a set of m actions, and Ω is a set of z observations. T is a state transition function that captures the stochastic Markovian state transitions after each action is taken, with T : S ˆ A ˆ S Ñ r0, 1s such that Tps, a, s1q Prps1|s, aq. O is an observation function that stochastically presents an observation to the agent based on the action performed and the true underlying state that

resulted from that action, with O : A ˆ S ˆ Ω Ñ r0, 1s such that Opa, s1, ωq Prpω|a, s1q. Finally, R : S ˆ A Ñ R is a function mapping state-action pairs to rewards such that Rps, aq P R. The sequential optimization process considers a number of time steps called the horizon h. Inﬁnite horizon (h 8) POMDPs could be approximated using a ﬁnite horizon (h P N) or solved directly using a variety of methods (Amato, Bernstein, and Zilberstein 2007). The overall objective is to maximize the cumulative reward over the problem horizon, where rewards are discounted by a discount factor γ P r0, 1s per time step. The decision maker or agent must select actions without knowing the true underlying state of the system. Instead, it maintains a belief over the possible state denoted b P n, or a set of r belief points over the standard n-simplex n denoted B Ď n. At every time step, the agent takes action a at belief b and makes observation ω. This updates the belief over all possible successor states s1 following:

b1ps1|b, a, ωq ηOpa, s1, ωq ÿ

s PS Tps, a, s1qbpsq (3)

with normalizing constant η Prpω|b, aq 1 (Kaelbling, Littman, and Cassandra 1998). This belief is a sufﬁcient statistic for the entire history of actions and observations the agent has taken. For notational brevity, we often denote b1 rb1ps1|b, a, ωq, . . . , b1psn|b, a, ωqs T . A policy π maps beliefs to actions π : B Ñ A. With a policy, we deﬁne a value function as V : B Ñ R, denoting the expected reward at beliefs. Value functions are piecewise linear and convex (Smallwood and Sondik 1973). We represent them by a set of α-vectors Γ tα1, . . . , αiu such that each αi r V ps1q, . . . , V psnqs T , with the value of each state denoted as V psiq. A policy is deﬁned by attaching an action to each α-vector, compactly denoting V pbq αi b and πpbq aαi P A. We may write the value function for an initial belief b, following policy π, at horizon h as:

V h π pbq E h 1 ÿ

t 0 γt Rpbt, πtpbtqq ˇˇˇb0 b, π ı

The objective is to select a policy which maximizes the expected reward earned over time. Thus, the optimal policy follows selecting maximal α-vectors at each belief point (Kaelbling, Littman, and Cassandra 1998):

V tpbq max a PA b ra ÿ

ωPΩ max αPΓt 1 ÿ

s PS bpsq V t saωα (4)

with ra r Rps1, aq, . . . , Rpsn, aqs T and:

V t saωα γ ÿ

s1PS Opa, s1, ωq Tps, a, s1qαps1q

Let initial α-vectors be αpsq R{p1 γq, with R mins PS mina PA Rps, aq, for all s P S. This guarantees αvectors weakly monotonically increase (Lovejoy 1991).

The Proactive Learning POMDP Model We begin by ordering the data points in Xu so that during policy execution we initially select the most informative data

Figure 1: Example of PAL POMDP with du 3 data points. States denote the current dataset s correct and incorrect labelings, as well as if the last query s oracle responded. Arrows denote the non-zero probabilistic state transitions, with duplicates for each speciﬁc action (oracle query) omitted for clarity. Boxes visually represent an example policy, mapping a range of beliefs regarding the true correctness of the dataset to an oracle selection action.

points to label, progressively selecting less informative ones at each time step until we either run out of data points or, more likely, the allotted budget. The order follows by selecting the oracle which will receive the highest utility, then selecting the data point which yields the highest utility given this ﬁxed oracle. Formally, each x P Xu and o P O, let ˆUpx, oq Prpx, oq Pcpx, oq Vpx, oq{Cpx, oq be the utility. Given current ordered set X t 1 o tx1, . . . , xt 1u, the data point for step t denoted xt is:

ot argmax o PO max x PXuz X t 1 o ˆUpx, oq

xt argmax x PXuz X t 1 o ˆUpx, otq

Importantly, this procedure only selects the order, which is connected to the POMDP state structure below. Hereafter, we will assume that Xu is reassigned to the ordering of X du o . We propose a mapping from a PAL problem to a POMDP. Figure 1 provides a visual explanation of the mapping, in addition to an example representation of a policy. The highlevel idea is to represent the process of constructing a correctly labeled dataset as a sequential optimization problem using a POMDP. States in the POMDP capture the quality of the labelings within the dataset. As such, they include both the number of correctly and incorrectly labeled data points. In order to properly adjust the policy based on oracle responses, we also record if the previous oracle responded or not as part of the state. Formally, for du unlabeled data points, S txc, i, ry|c, i P N, r P t T , Fu, 0 ď c i ă duu Y txc, i, T y|c, i P N, c i duu. For example, x3, 2, Fy P S, means the dataset has 3 correct labels,

2 incorrect labels, and there was no response from the previously queried oracle. The ﬁnal dataset s correctness state does not need F, since it is complete upon reaching the end. Actions within our model directly correspond to which oracle should be queried, i.e., A O. Observations are simply if we observed an oracle response, i.e., Ω t T , Fu. This simpliﬁed set of observations works due to the important structure of the subsequent belief update, which takes into consideration the state transition function T and observation transition function O (detailed below). These two probability functions combined with this deﬁnition of observations, yield the exact desired result: Certainty about the size of the labeled dataset and previous oracle response, and uncertainty about the true labelings within the current labeled dataset. This lets us map beliefs over this to optimal actions, i.e., select the best oracle to label the next data point. The state transition function Tps, a, s1q, for states s, s1 P S and action a P A, captures both the probability of an oracle responding and the probability it successfully labels the data point. Formally, for s xc, i, ry and s1 xc1, i1, r1y:

Tps, a, s1q (5) $ &

1 Prpxc i, aq if c c1, i i1, r1 F Prpxc i, aq Pcpxc i, aq if c 1 c1, i i1, r1 T Prpxc i, aqp1 Pcpxc i, aqq if c c1, i 1 i1, r1 T 1 if c c1, i i1, c i du 0 otherwise

Observation transition function Opa, s1, ωq, for action a P A, successor s1 P S, and observation ω P Ω, only needs to inform the agent if the oracle responded. For s1 xc1, i1, r1y:

Opa, s1, ωq rω r1s (6)

Lastly, the reward function Rps, aq, for state s P S and action a P A, is built upon the original utility function proposed by Donmez and Carbonell (2008a), which essentially uses the value of information divided by the cost. For this value of the information gained by labeling, Vpxq for a data point x P Xu, we use the same uncertainty weighted density score as Donmez and Carbonell. This metric assumes that data points within the same region are relatively clustered and share the same label. Thus, we ﬁrst deﬁne a neighborhood of indexes Nx over x that are within some threshold distance τ: Nx ti P t0, . . . , du 1u|}x xi} ă τu. Vpxq is deﬁned in Equation 7 below, with weights ˆw and entropy Hpyk|xk, ˆwq using the probability of labeling data point xk as yk following the model so far (see initialization section) (Donmez and Carbonell 2008b). The reward weighs V, C, and a ratio of correctly versus incorrectly labeled points, in addition to a penalty of ϵ ă 1 only if they select a reluctant oracle when they just failed to receive a label. Formally, for state s xc, i, ry and action a:

Rps, aq c 1

i 1 Vpxc iq Cpxc i, aqϵps, aq (7)

k PNx expp }x xk}2q Hpyk|xk, ˆwq

with ϵps, aq ϵ only if Prpxc i, aq ă 1 and r F as described above; ϵps, aq 1 otherwise.

Algorithm 1 Proactive Learning POMDP: Initially unknown oracles; thus, it requires clustering. Require: x Xu, Xl, Yly: The unlabeled and labeled datasets. Require: O: The set of oracles that may be queried. Require: xβc, βy: The initial clustering budget and entire budget. 1: x Pr, Pc, Cy Ð init palp Xu, Xl, Yl, O, βc, βq 2: x S, A, Ω, T, O, Ry Ð init pomdpp Xu, O, Pr, Pc, Cq 3: π Ð solvep S, A, Ω, T, O, Rq 4: bpsq Ð rs x0, 0, T ys, @s P S 5: i Ð 0 6: ct Ð βc 7: while ct ă β or Xu H do 8: xy, cy Ð querypxi, πpbqq 9: ω Ð F 10: if y H then 11: x Xu, Xl, Yly Ð x Xuztxiu, Xl Y txiu, Yl Y tyuy 12: i Ð i 1 13: ω Ð T 14: end if 15: bpsq Ð b1ps|b, πpbq, ωq, @s P S 16: ct Ð ct c 17: end while 18: return x Xu, Xl, Yly

The entire PAL POMDP procedure is detailed in Algorithm 1. The functions init palp q and init pomdpp q implement the previous two sections, respectively. The function solvep q solves the POMDP, returning a policy π, using Point-Based Value Iteration (PBVI) (Pineau, Gordon, and Thrun 2003). Note another Iverson bracket on Line 4. The function queryp q queries an oracle for a data point label. A variant of Algorithm 1 reorders all future data points upon each successful query after updating Pr and Pc (Equations 1 and 2). We then re-solve this modiﬁed POMDP on every step. Note that the current belief b over the dataset does not change, since the prior data points ordering remains ﬁxed. This process is obviously computationally expensive. In practice, however, domains which have a long real-world time delay for oracle queries (e.g., biological experiments) allow plenty of time to re-solve the POMDP.

Theoretical Analysis and Optimizations of Point-Based Value Iteration

Exact solutions to POMDPs require deﬁning a policy tree. These trees have a height equal to the horizon h, with each node s branching factor equal to the number of possible observations z. Each node in this tree is assigned an action to take given its history of observations. In practice, this makes exact solutions intractable for anything but the smallest of POMDPs; in fact, POMDPs are PSPACE-hard (Papadimitriou and Tsitsiklis 1987). Instead, point-based methods were developed which operate over a ﬁxed set of belief points, with additional methods for intelligently adding new belief points to the set (Pineau, Gordon, and Thrun 2003). We brieﬂy describe the point-based method, then establish three properties of our PAL POMDP that enable us to deﬁne B and h in order to quickly produce high-quality solutions.

Point-Based POMDP Solvers Point-Based Value Iteration (PBVI) was proposed by Pineau, Gordon, and Thrun (2003). It operates over the set of α-vectors Γt, updating each one in the set at each time step t. Over these time steps, we only focus on a ﬁxed set of belief points B. This is commonly deﬁned using intermediate variables Γb and Γaω:

Γt aω tr V t s1aωα, . . . , V t snaωαs T , @α P Γt 1u

ωPΩ argmax αPΓt aω α b, @a P Au, @b P B (8)

Γt targmax αPΓt b α b, @b P Bu (9)

Theoretical Analysis The quality of the solution returned by this approach depends heavily on which belief points are chosen for B. Common algorithms explore the policy tree (i.e., reachable belief points) to deﬁne a B which best represents the belief space, to obtain the highest (and thus most accurate) values. Our speciﬁc PAL POMDP enables us to deﬁne three strong properties regarding the reachable belief points and the horizon required for PBVI. First we deﬁne helpful variables. Let b0 P n be the initial belief from Algorithm 1, Line 4. Let ℏ xa0, ω1, a1, ω2, . . . , ah 1, ωhy be any history. Let bh P n be the resultant belief applying Equation 3 at each time step in history ℏ, starting with initial belief b0. Finally, let Sdω ts P S|s xc, i, ry, c i d_r ωu be the set of states which are not at data point index d or did not have previous oracle response ω (or both). Our ﬁrst proposition describes the structure of our belief points, following any history of actions and observations. Importantly, it means that the agent s uncertainty of the true state space is only over the dataset accuracy it built so far.

Proposition 1 (Guarantee: Belief Is Always Over Dataset Correctness). For any history ℏ, with dh |ti P t1, . . . , hu|ωi T u|, if s P Sdhωh, then bhpsq 0.

Proof. By induction on h. Base Case: h 0. Thus, b0 bh with s0 x0, 0, T y we have dh 0 and implicit ω0 T . This yields Sd0ω0 Sztx0, 0, T yu. By Algorithm 1, Line 4, b0psq 0 for all s x0, 0, T y, i.e., s P Sd0ω0. Thus, the base case is shown to hold true. Inductive Step: Assume true for h 1 (induction hypothesis), must show that for h, for all s1 P Sdhωh (given by ℏ) that bhps1q 0. Given our history ℏ, and thus ah 1 and ωh, we apply Equation 3 to bh 1 and prove bh equals zero for all s1 P Sdhωh:

bhps1|bh 1, ah 1, ωhq

ηOpah 1, s1, ωhq ÿ

s PS Tps, ah 1, s1qbh 1psq

Two cases (by Sdhωh): (1) r1 ωh, and (2) c1 i1 dh. Case 1: r1 ωh. Thus, by Equation 6, we have Opah 1, s1, ωhq 0. Therefore, bhps1q 0.

Case 2: c1 i1 dh. We assume r1 ωh, otherwise bhps1q 0. Thus, Opah 1, s1, ωhq 1 and we eliminate values in which bh 1 is zero (induction hypothesis):

bhps1|bh 1, ah 1, ωhq η 1 ÿ

s PS Tps, ah 1, s1qbh 1psq

s PSz Sdh 1ωh 1 Tps, ah 1, s1qbh 1psq

We must show that for all remaining s xc, i, ry P Sz Sdh 1ωh 1, that T or bh 1 is zero. By deﬁnition of Sdh 1ωh 1, we know c i dh 1 and r ωh 1. By deﬁnition of Sdhωh, we know c1 i dh or r1 ωh. By deﬁnition of dh, we also know dh 1 ď dh, and since c1 i1 dh it implies dh 1 ă dh must be true. This, in turn, implies ωh T by deﬁnition of dh. Three possible sub-cases: (1) c1 i1 ă dh 1, (2) c1 i1 ą dh, and (3) c1 i1 dh 1. By Equation 5, sub-case (1) has Tps, ah 1, s1q 0, since c1 i1 ă dh 1 c i implies all four non-zero conditions cannot occur; informally, it describes transitioning to a state with less data points. For sub-case (2), c1 i1 ą dh ą dh 1 c i, which again implies all four non-zero conditions cannot occur; informally, it describes skipping data points, because s and s1 would strictly be two data point indexes away from one another. Lastly, sub-case (3) implies c1 i1 dh 1 c i. Since r1 ωh T , again, all four non-zero conditions cannot occur, namely one (ωh T ) and four (dh 1 ă dh ď du); informally, it describes a contradiction in which it receives a response but does not transition. In each sub-case, Tps, ah 1, s1q 0, thus bhps1q 0. Case 1 and 2 are proven: bhps1q 0 for all s1 P Sdhωh. The Base Case and Inductive Step are proven, thus by induction for any history ℏ, for all s P Sdhωh, bhpsq 0.

The second proposition has four results. First, it solidiﬁes a bound on the size of any belief point vector s potential nonzero values. Second, this is bounded by a value proportional to the number of data points or square of the states, such that as the POMDP grows, the maximal number of the non-zero values in a belief point is asymptotically smaller. Third, the size of the POMDP itself grows quadratically with the number of data points, not exponentially. Lastly, the application of this insight enables us to compute dot products with the belief point vectors b over just potential non-zero elements (as in Equations 4, 8, and 9). We found this to vastly improve the performance of PBVI.

Proposition 2 (Bound: Reachable Belief Size; POMDP Size). For any history ℏand its belief bh P n, the number of non-zero elements in bh is bounded: |ts P S|bhpsq ą 0u| ď du ?n 1.

Proof. By Proposition 1, if s P Sdhωh, then bhpsq 0.

|ts P S|bhpsq ą 0u| |ts P Sz Sdhωh|bhpsq ą 0u|

ď |ts P Sz Sdhωh|bhpsq ě 0u| ď |Sz Sdhωh|

|txc, i, ry P S|c i dh r ωhu|

ď |txc, i, ry P S|c i dhu| ď du

Let us deﬁne a function which maps a data point index k P t0, . . . , duu to how many states have that index: fpkq |txc, i, ry P S|c i ku|. By deﬁnition of states S:

k 0 fpkq 2p1 2 duq pdu 1q

2 pdu 1q pdu 1q2

Therefore, |ts P S|bhpsq ą 0u| ď du ?n 1.

Our third proposition determines the horizon h to select for PBVI in order to guarantee h is large enough to capture the entire labeling process.

Proposition 3 (Bound: Required Horizon). Let Pr minx PXu mino PO Prpx, oq. For an h ě p1 ϵqdu{Pr, state sh xch, ih, rhy P S will have ch ih du with probability 1 ϵ.

Proof. Given history ℏ, the action sequence induces a Markov process with respect to the true underlying states (with s0 x0, 0, T y). The probabilities Prpst 1|st, atq vary following equation 5. A lower bound over all time steps (possible data points) and oracles (possible actions) reduces the process to a Binomial distribution Z Binomialph, Prq, in which the number of successes maps to the data point index up to du. We apply Markov s inequality in order to determine h:

1 ϵ Prp Zěduq ď Er Zs

du ñ h ě p1 ϵq du

Experimentation We create similar experiments to those of Donmez and Carbonell for two of the well-known UCI datasets they used: Adult and Spambase (Lichman 2013). Adult contains 32544 people each with 14 features. The objective is to classify which individuals make over $50K a year. Spambase contains 4601 emails each with 57 features. The objective is to classify which emails are spam. We consider ﬁve oracle scenarios: (1-3) Original #1-#3, (4-5) Complex #1-#2. In all cases, we vary the budget β from 3 to 21 with an interval of 3, and randomly sample 100-300 data points for Xu and use the remaining, up to 10000, for our test set. We use a support vector machine for the ﬁnal classiﬁer, training with the proactively learned Xl and Yl. Importantly, we only select the top 40 points for our POMDP s Xu due to their complexity (n pdu 1q2 1681). If a budget remains after 40 labelings, then we execute another POMDP with the next 40 points, and so on, until the budget was exhausted. Each conﬁguration of dataset, oracle scenario, and budget is run 10 times and averaged to produce the ﬁnal accuracy results shown in Figure 2. PBVI uses horizon 2du (increasing it only improves performances) and discount factor 0.9. Our implementation uses Python 3.4.3 with scikit-learn 0.16.1, Num Py 1.9.2, and Sci Py 0.15.1, run on an Intel(R) Core(TM) i7-4702HQ CPU at 2.20GHz, 8GB of RAM, and a Nvidia(R) Ge Force GTX 870M. We leverage a high-performing GPU-based implementation of

PBVI using CUDA(C) 6.5 (Wray and Zilberstein 2015a; 2015b). We compare our algorithm with the three original decision-theoretic algorithms designed for a reluctant, fallible, and cost-varying oracles denoted as PAL #1, #2, and #3, respectively (Donmez and Carbonell 2008a). PAL #1 selects oracles using a utility of Prpx, oq Vpxq{Cround, for data point x and oracle o. Cround is for reluctant oracles, aggregating the cost spent so far trying to label the same data point. Preference shifts if an oracle is continually queried but fails to respond. PAL #2 selects oracles using a utility of Pcpx, oq Vpxq{Cpx, oq. PAL #3 selects oracles using a utility of Vpxq Cpx, oq. We also include a random baseline. In Original #1-#3, we re-implement the original three scenarios and their oracles using the process by Donmez and Carbonell. Original #1 has a normal and reluctant oracle, Original #2 has a normal and fallible oracle, and Original #3 has a normal and cost-varying oracle. These three scenarios assume Pr and Pc are unknown, requiring a clustering step using βc mint5, β{2u. The normal and other oracles cost 0.5 and 0.125, respectively. We observe that our method has similar or improved performance in comparison with PAL #1-#3. This is to be expected, since both the PAL POMDP and PAL #i algorithms were designed to handle the oracle within scenario Original #i. This also supports the accuracy of our implementation of original PAL #1-#3, reproducing their above-baseline performance. Furthermore, it demonstrates that the cluster approach properly estimates Pr and Pc (initially unknown) for use with our algorithm. These scenarios, however, are quite simple, serving as a ﬁrst comparison of individual PAL #1-#3 and our PAL POMDP. In Complex #1-#2, we seek to truly compare the algorithms by removing the noise introduced by these initial clustering methods which estimate the probabilities, since they share the initial clustering step regardless. (Let x X, Y, Zy denote Prpx, oq PX, Pcpx, oq PY , and Cpx, oq PZ for all x PX and o PO.) In Complex #1, we create four randomly assigned xr0.25, 1s, r0.25, 1s, r0.01, 1sy. This truly represents a realistic scenario with heavily mixed oracles, each with its own beneﬁts and drawbacks for different data points. Complex #1 for both Adult and Spambase strongly show the capability of our method. Finally, in Complex #2, we create three oracles: xt1u, t1u, t1uy, xt0.5u, t0.01u, t0.01uy, and xt0.01u, t0.5u, t0.01uy. This extreme example illustrates our POMDP PAL algorithm s success in solving the true sequential optimization problem which weighs all oracle information, as well as the drawbacks of the previous algorithms. It is evident from both datasets that the previous algorithms select either the severely fallible or reluctant oracles. Our approach outperforms PAL #1-#3 as well as the random baseline, demonstrating overall that it builds upon the original algorithms and is able to select the correct oracle each time.

Conclusion We propose a general solution for proactive learning which casts the problem as a POMDP, taking into account the observed history in order to optimally select oracles and construct a high quality labeled dataset. This approach simul-

Figure 2: Mean accuracies with 95% conﬁdence intervals for Adult (left) and Spambase (right) with ﬁve oracle scenarios: Original #1-#3 and Complex #1-#2 (top to bottom). Complex #1-#2 s algorithms are marked the same as Original #1-#3 (legend omitted for clarity).

taneously captures all aspects of proactive learning (multiple reluctant, fallible, cost-varying oracles) under one uniﬁed mathematical framework. We provide three propositions which assure point-based methods will quickly produce quality policies for large datasets. Our algorithm is compared with the original three PAL algorithms, as well as a simple baseline, on two distinct datasets. We explore ﬁve oracle conﬁgurations, demonstrating our algorithm builds upon the original three, outperforming them when presented with realistic reluctant, fallible, and cost-varying oracles. In future work, we hope to introduce richer aspects of or-

acles (e.g., responses), data point selection (e.g., relabeling), and integrate this into other domains (e.g., robotics). Our model s data point ordering ˆU, clustering step, and reward function components are purposefully built from the PAL algorithms aspects (Donmez and Carbonell 2008a). Considering different methods within these components is a fertile ground for future exploration. Additionally, our experiments are purposefully designed to be the same as the original paper. This guarantees the accuracy of our implementations and allows for comparison with more difﬁcult scenarios (Complex #1-#2). We will investigate additional application domains with this solid foundation established. Another extension of our work is to use a constrained POMDP to explicitly incorporate the budget; however, this alone presents numerous interesting facets which can be explored beyond the scope of this paper. Finally, we will provide our source code so that others may more easily build upon this work to design and implement high caliber proactive learners.

Acknowledgments We thank the anonymous reviewers for their helpful comments. Also, we thank Luis Pineda for his valuable comments. This work was supported in part by the National Science Foundation grant number IIS-1405550.

Amato, C.; Bernstein, D. S.; and Zilberstein, S. 2007. Solving POMDPs using quadratically constrained linear programs. In Proceedings of the 20th International Joint Conference on Artiﬁcial Intelligence (IJCAI), 2418 2424. Attenberg, J., and Provost, F. 2011. Inactive learning? Difﬁculties employing active learning in practice. ACM SIGKDD Explorations Newsletter 12(2):36 41. Boutilier, C. 2002. A POMDP formulation of preference elicitation problems. In Proceedings of the 18th AAAI Conference on Artiﬁcal Intelligence, 239 246. Cohn, D.; Atlas, L.; and Ladner, R. 1994. Improving generalization with active learning. Machine Learning 15(2):201 221. Culotta, A., and Mc Callum, A. 2005. Reducing labeling effort for structured prediction tasks. In Proceedings of the 20th AAAI Conference on Artiﬁcal Intelligence, 746 751. Donmez, P., and Carbonell, J. G. 2008a. Proactive learning: Cost-sensitive active learning with multiple imperfect oracles. In Proceedings of the 17th ACM Conference on Information and Knowledge Management (CIKM), 619 628. Donmez, P., and Carbonell, J. G. 2008b. Paired sampling in density-sensitive active learning. In Proceedings of the 10th International Symposium on Artiﬁcial Intelligence and Mathematics (ISAIM). Golovin, D., and Krause, A. 2011. Adaptive submodularity: Theory and applications in active learning and stochastic optimization. Journal of Artiﬁcial Intelligence Research (JAIR) 42:427 486. Ipeirotis, P. G.; Provost, F.; Sheng, V. S.; and Wang, J. 2014. Repeated labeling using multiple noisy labelers. Data Mining and Knowledge Discovery 28(2):402 441.

Jaulmes, R.; Pineau, J.; and Precup, D. 2005. Active learning in partially observable Markov decision processes. In Proceedings of the 16th European Conference on Machine Learning (ECML), 601 608. Kaelbling, L. P.; Littman, M. L.; and Cassandra, A. R. 1998. Planning and acting in partially observable stochastic domains. Journal of Artiﬁcial Intelligence Research (JAIR) 101(1):99 134. Lichman, M. 2013. UCI machine learning repository. URL: http://archive.ics.uci.edu/ml. Lizotte, D. J.; Madani, O.; and Greiner, R. 2003. Budgeted learning of naive-Bayes classiﬁers. In Proceedings of the 19th Conference on Uncertainty in Artiﬁcial Intelligence (UAI), 378 385. Lovejoy, W. S. 1991. Computationally feasible bounds for partially observed Markov decision processes. Operations Research 39(1):162 175. Papadimitriou, C. H., and Tsitsiklis, J. N. 1987. The complexity of Markov decision processes. Mathematics of Operations Research 12(3):441 450. Pineau, J.; Gordon, G.; and Thrun, S. 2003. Point-based value iteration: An anytime algorithm for POMDPs. Proceedings of the 18th International Joint Conference on Artiﬁcial Intelligence (IJCAI) 3:1025 1032. Smallwood, R. D., and Sondik, E. J. 1973. The optimal control of partially observable Markov processes over a ﬁnite horizon. Operations Research 21(5):1071 1088. Sondik, E. J. 1978. The optimal control of partially observable Markov processes over the inﬁnite horizon: Discounted costs. Operations Research 26(2):282 304. Wallace, B. C.; Small, K.; Brodley, C. E.; and Trikalinos, T. A. 2011. Who should label what? Instance allocation in multiple expert active learning. In Proceedings of the SIAM International Conference on Data Mining (SDM), 176 187. Wray, K. H., and Zilberstein, S. 2015a. Multi-objective POMDPs with lexicographic reward preferences. In Proceedings of the 24th International Joint Conference of Artiﬁcial Intelligence (IJCAI), 1719 1725. Wray, K. H., and Zilberstein, S. 2015b. A parallel pointbased POMDP algorithm leveraging GPUs. In AAAI Fall Symposium on Sequential Decision Making for Intelligent Agents (SDMIA), 95 96. Yan, Y.; Rosales, R.; Fung, G.; and Dy, J. G. 2011. Active learning from crowds. In Proceedings of the 28th International Conference on Machine Learning (ICML), 1161 1168.