# optimal_decision_tree_with_noisy_outcomes__5ee880d4.pdf Optimal Decision Tree with Noisy Outcomes Su Jia Carnegie Mellon University sjia1@andrew.cmu.edu Fatemeh Navidi University of Michigan navidi@umich.edu Viswanath Nagarajan University of Michigan viswa@umich.edu R. Ravi Carnegie Mellon University ravi@andrew.cmu.edu A fundamental task in active learning involves performing a sequence of tests to identify an unknown hypothesis that is drawn from a known distribution. This problem, known as optimal decision tree induction, has been widely studied for decades and the asymptotically best-possible approximation algorithm has been devised for it. We study a generalization where certain test outcomes are noisy, even in the more general case when the noise is persistent, i.e., repeating a test gives the same noisy output, disallowing simple repetition as a way to gain confidence. We design new approximation algorithms for both the non-adaptive setting, where the test sequence must be fixed a-priori, and the adaptive setting where the test sequence depends on the outcomes of prior tests. Previous work in the area assumed at most a logarithmic number of noisy outcomes per hypothesis and provided approximation ratios that depended on parameters such as the minimum probability of a hypothesis. Our new approximation algorithms provide guarantees that are nearly best-possible and work for the general case of a large number of noisy outcomes per test or per hypothesis where the performance degrades smoothly with this number. Our results adapt and generalize methods used for submodular ranking and stochastic set cover. We evaluate the performance of our algorithms on two natural applications with noise: toxic chemical identification and active learning of linear classifiers. Despite our theoretical logarithmic approximation guarantees, our methods give solutions with cost very close to the information theoretic minimum, demonstrating the effectiveness of our methods. 1 Introduction The classic optimal decision tree (ODT) problem involves identifying an initially unknown hypothesis x that is drawn from a known probability distribution over a set of m possible hypotheses. We can perform tests in order to distinguish between these hypotheses. Each test produces a binary outcome (positive or negative) and the precise outcome of each hypothesis-test pair is known beforehand. 3 So an instance of ODT can be viewed as a 1 matrix with the hypotheses as rows and tests as columns. The goal is to identify hypothesis x using the minimum number of tests in expectation. As a motivating application, consider the following task in medical diagnosis [25]. A doctor needs to diagnose a patient s disease by performing tests. Given an a priori probability distribution over possible diseases, what sequence of tests should the doctor perform? Another application is in active Su Jia and Fatemeh Navidi contributed equally to this work. Research of F. Navidi and V. Nagarajan partly supported by NSF grant CCF-1750127. 3We consider binary test outcomes only for simplicity: our results also hold for finitely many outcomes. 33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada. learning [10]. Given a set of data points, one wants to learn a classifier that labels the points correctly as positive and negative. There is a set of m possible classifiers which is assumed to contain the true classifier. In the Bayesian setting, which we consider, the true classifier is drawn from some known probability distribution. The goal is to identify the true classifier by querying labels at the minimum number of points (in expectation). Other applications include entity identification in databases [6] and experimental design to choose the most accurate theory among competing candidates [14]. An important issue that is not considered in the classic ODT model is that of unknown or noisy outcomes. In fact, our research was motivated by a dataset involving toxic chemical identification where the outcomes of many hypothesis-test pairs are stated as unknown (one of our experimental results is also on this dataset). Prior work incorporating noise in ODT [14] is restricted to settings with very sparse noise. In this paper, we design approximation algorithms for the noisy optimal decision tree problem in full generality. We consider a standard model for persistent noise. Certain outcomes (i.e., entries in the hypothesistest matrix) are random with a known distribution: for simplicity we treat each noisy outcome as an unbiased 1 random variable. Our results extend directly to the case when each noisy outcome has a different probability of being 1. Persistent noise means that repeating the same test always produces the same 1 outcome. We assume that the instance is identifiable, i.e., a unique hypothesis can always be identified irrespective of the noisy outcomes. (This assumption can be relaxed: see 6.) We consider both non-adaptive policies (where the test sequence is fixed upfront) and adaptive policies (where the test sequence is built incrementally and depends on observed test outcomes). Clearly, adaptive policies perform at least as well as non-adaptive ones.4 However, non-adaptive policies are very simple to implement (requiring minimal incremental computation) and may be preferred in time-sensitive applications. Our main contributions are: an O(log m)-approximation algorithm for non-adaptive ODT with noise. an O(min(h, r) + log m)-approximation algorithm for adaptive ODT with noise, where h (resp. r) is the maximum number of noisy outcomes for any hypothesis (resp. test). an O(log m)-approximation algorithm for adaptive ODT with noise when every test has at least m O( m) noisy outcomes. experimental results on applications to toxic chemical identification and active learning. We note that both non-adaptive and adaptive versions (even for usual ODT) generalize the set cover problem: so an Ω(log m) approximation ratio is the best possible (unless P=NP). Related Work The optimal decision tree problem (without noise) has been extensively studied for several decades [11, 20, 25, 24, 1, 2, 7, 18]. The state-of-the-art result [18] is an O(log m)- approximation, for instances with arbitrary probability distribution and costs. It is also known that ODT cannot be approximated to a factor better than O(log m), unless P=NP [6]. The application of ODT to Bayesian active learning was formalized in [10]. There are also several results on the statistical complexity of active learning. e.g. [4, 19, 29], where the focus is on proving bounds for structured hypothesis classes. In contrast, we consider arbitrary hypothesis classes and obtain computationally efficient policies with provable approximation bounds relative to the optimal (instance specific) policy. This approach is similar to that in [10, 15, 12, 14, 9, 22]. The noisy ODT problem was studied previously in [14]. Using a connection to adaptivesubmodularity [12], they obtained an O(log2 1 pmin )-approximation algorithm for noisy ODT in the presence of very few noisy outcomes; here pmin 1 m is the minimum probability of any hypothesis.5 In particular, the running time of the algorithm in [14] is exponential in the number of noisy outcomes per hypothesis, which is polynomial only if this number is at most logarithmic in the number of hypotheses/tests. Our result provides the following improvements (i) the running time is polynomial irrespective of the number of noisy outcomes and (ii) the approximation ratio is better by at least one logarithmic factor. We note that a better O(log m) approximation ratio (still only for very sparse noise) follows from subsequent work on the equivalence class determination problem by [9]. 4There are also instances where the relative gap between the best adaptive and non-adaptive policies is Ω(m). 5The paper [14] states the approximation ratio as O(log 1/pmin) because it relied on an erroneous claim in [12]. The correct approximation ratio, based on [28, 13], is O(log2 1/pmin). For this setting, our result is also an O(log m) approximation, but the algorithm is simpler. More importantly, ours is the first result that can handle any number of noisy outcomes. Other variants of noisy ODT have also been considered, e.g. [27, 5, 8], where the goal is to identify the correct hypothesis with at least some target probability. The theoretical results in [8] provide bicriteria approximation bounds where the algorithm has a larger error probability than the optimal policy. Our setting is different because we require zero probability of error. Many algorithms for ODT (including ours) rely on some underlying submodularity properties. We briefly survey some background results. The basic submodular cover problem was first considered by [31], who proved that the natural greedy algorithm is a (1 + ln 1 ϵ )-approximation algorithm, where ϵ is the minimal positive marginal increment of the function. [3] obtained an O(log 1 ϵ )-approximation algorithm for the submodular ranking problem, that involves simultaneously covering multiple submodular functions; [21] extended this result to also handle costs. [23] studied an adaptive version of the submodular ranking problem. We utilize results/techniques from these papers. Finally, we note that there is also work on minimizing the worst-case (instead of average case) cost in ODT and active learning [26, 30, 16, 17]. These results are incomparable to ours because we are interested in the average case, i.e. minimizing expected cost. 2 Problem Definition We start with defining the optimal decision tree with noise (ODTN) formally. There is a set of m possible hypotheses with a probability distribution {πx}m x=1, from which an unknown hypothesis x is drawn. There is also a set U = [n] of binary tests. Each test e U is associated with a 3-way partition T +(e), T (e), T (e) of the hypotheses, where the test outcome is (a) positive if x lies in T +(e), (b) negative if x T (e), and (c) positive or negative with probability 1 2 each if x T (e) (these are noisy outcomes). We assume that conditioned on x, each noisy outcome is independent. We also use rx(e) to denote the part of test e that hypothesis x lies in, i.e. 1 if x T (e) +1 if x T +(e) if x T (e) While we know the 3-way partition T +(e), T (e), T (e) for each test e U upfront, we are not aware of the actual outcomes for the noisy hypothesis-test pairs. It is assumed that the realized hypothesis x can be uniquely identified by performing all tests, regardless of the outcomes of -tests. This means that for every pair x, y [m] of hypotheses, there is some test e U with x T +(e) and y T (e) or vice-versa. The goal is to perform an adaptive (or non-adaptive) sequence of tests to identify hypothesis x using the minimum expected number of tests. Note that expectation is taken over both the prior distribution of x and the random outcomes of noisy tests for x. In our algorithms and analysis, it will be convenient to work with an expanded set of hypotheses M. For a binary vector b { 1}U and hypothesis x [m], we say b is consistent with x and denote b x, if be = rx(e) for each e U with rx(e) = . Let M = {(b, x) { 1}U [m] : b x}, and Mx M be all copies associated with a particular hypothesis x [m]; note that {Mx}m x=1 is a partition of M. Each expanded hypothesis (b, x) M corresponds to the case where the true hypothesis x = x and the test-outcomes are given by b. We assign the probability qb,x = πx/2hx to each (b, x) M, where hx is the number of *-tests for x. Note that conditioned on x = x, the probability of observing outcomes b is exactly 2 hx; so Pr[ x = x and test outcomes are b] = qb,x. For any (b, x) M and e U, define rb,x(e) = b(e) to be the observed outcome of test e if x = x and test-outcomes are b. For every expanded hypothesis (b, x) M and test e U, define Tb,x(e) = T +(e) if rb,x(e) = 1 T (e) if rb,x(e) = +1 , (1) which is the subset of (original) hypotheses that can definitely be ruled-out based on test e if x = x and the test-outcomes are given by b. Note that hypotheses in T (e) are never part of Tb,x(e) as their outcome on test e can be positive/negative (so they cannot be ruled-out). For every hypothesis (b, x) M, define a monotone submodular function fb,x : 2U [0, 1]: fb,x(S) = | [ e S Tb,x(e)| 1 m 1, S U, (2) which equals the fraction of the m 1 hypotheses (excluding x) that have been ruled-out based on the tests in S if x = x and test-outcomes are given by b. Assuming x = x and test-outcomes are given by b, hypothesis x is uniquely identified after tests S if and only if fb,x(S) = 1. A non-adaptive policy is specified by just a permutation of tests. The policy performs tests in this sequence and eliminates incompatible hypotheses until there is a unique compatible hypothesis (which is x). Note that the number of tests performed under such a policy is still random (depends on x and outcomes of noisy tests). An adaptive policy chooses tests incrementally, depending on prior test outcomes. The state of a policy is a tuple (E, d) where E U is a subset of tests and d { 1}E denotes the observed outcomes on tests in E. An adaptive policy is specified by a mapping Φ : 2U { 1}U U from states to tests, where Φ(E, d) is the next test to perform at state (E, d). Equivalently, we can view a policy as decision tree with nodes corresponding to states, labels at nodes representing the test performed at that state and branches corresponding to the 1 outcome at the current test. As the number of states can be exponential, we cannot hope to specify arbitrary adaptive policies. Instead, we want implicit policies Φ, where given any state (E, d), the test Φ(E, d) can be computed efficiently. This would imply that the total time taken under any outcome is polynomial. We note that an optimal policy Φ can be very complex and the map Φ (E, d) may not be efficiently computable. We will still compare the performance of our (efficient) policy to Φ . In this paper, we consider the persistent noise model. That is, repeating a test e with x T (e) always produces the same outcome. An alternative model is non-persistent noise, where each run of test e with x T (e) produces an independent random outcome. The persistent noise model is more appropriate to handle missing data. It also contains the non-persistent noise model as a special case (by introducing multiple tests with identical partitions). One can easily obtain an adaptive O(log2 m)- approximation for the non-persistent model using existing algorithms for noiseless ODT [7] and repeating each test O(log m) times. The persistent-noise model that we consider is much harder. 3 Non-Adaptive Algorithm Our algorithm is based on a reduction to the submodular ranking problem [3], defined below. Submodular Function Ranking (SFR) An instance of SFR consists of a ground set U of elements and a collection of monotone submodular functions {f1, ..., fm}, fx : 2U [0, 1], with fx( ) = 0 and fx(U) = 1 for all x [m]. Additionally, there is a weight wx 0 for each x [m]. A solution is a permutation of the elements U. Given any permutation σ of U, the cover time of function f is C(f, σ) := min{t |f( i [t]σ(i)) = 1} where σ(i) is the ith element in σ. In words, it is the earliest time when the value of f reaches the unit threshold. The goal is to find a permutation σ of [n] with minimal total cover time P x [m] w(x) C(fx, σ). We will use the following result: Theorem 3.1 ([3]). There is an O(log 1 ϵ )-approximation for SFR where ϵ is minimum marginal increment of any function. The non-adaptive ODTN problem can be expressed as an instance of SFR as follows. The elements are the tests U. For each hypothesis-copy (b, x) M there is a function fb,x (see (2)) with weight qb,x. Based on the definition of these functions, the parameter ϵ = 1 m 1. To see the equivalence, note that a solution to non-adaptive ODTN is also a permutation σ of U and hypothesis x is uniquely identified under outcome (b, x) exactly when function fb,x has value one. Moreover, the objective of the ODTN problem is the expected number of tests in σ to identify the realized hypothesis x, which equals m X b x 2 hx Cb,x(σ) = X (b,x) M qb,x Cb,x(σ), where Cb,x(σ) is the cover-time of function fb,x. It now follows that this SFR instance is equivalent to the non-adaptive ODTN instance. However, we cannot apply Theorem 3.1 directly to obtain an O(log m) approximation. This is because we have an exponential number of functions (note |M| can be exponential in m), which means that a direct implementation of the algorithm from [3] requires exponential time. Nevertheless, we show that a variant of the SFR algorithm can be used to obtain: Theorem 3.2. There is an O(log m)-approximation for non-adaptive ODTN. The SFR algorithm [3] is a greedy-style algorithm that at any point, having already chosen tests E, assigns a score to each test e U \ E of (b,x) M:fb,x(E)<1 qb,x fb,x({e} E) fb,x(E) 1 fb,x(E) = X (b,x) M qb,x E(b, x, e), (3) E(b, x, e) = ( fb,x({e} E) fb,x(E) 1 fb,x(E) , if fb,x(E) < 1; 0, otherwise. (4) where E(b, x, e) is the gain of test e for the hypothesis-copy b, x. At each step, the algorithm chooses the test of maximum score. However, we do not know how to compute the score (3) in polynomial time. Instead, using the fact that GE(e) is the expectation of E(b, x, e) over the hypothesis-copies (b, x) M, we will show that we can obtain an approximate maximizer by sampling. Moreover, Theorem 3.1 also holds when we choose a test with approximately maximum score: this follows directly from the analysis in [21]. This sampling approach is still not sufficient because it can fail when the value GE(e) is very small. A key observation is, when the score GE(e) is small for all tests e then it must be that, with high probability the already-performed tests E uniquely identify hypothesis x. Hence the future tests won t affect the expected cover time by much. As the realized hypothesis x can always be identified uniquely, for any pair x, y [m] of hypotheses, there is a test where x and y have opposite outcomes (i.e. one is + and the other ). So there is a set L of at most m 2 tests where hypothesis x will be uniquely identified by performing all the tests in L. The non-adaptive ODTN algorithm (Non-Adap) involves two phases. In the first phase, we run the SFR algorithm using sampling to get estimates GE(e) of the scores GE(e) at each step; let e = arg maxe U GE(e) denote the chosen test. If at some step, the maximum sampled score is less than m 5 then we go to the second phase where we perform all the tests in L and stop. The number of samples used to obtain each estimate is polynomial in m; so the overall runtime is polynomial. The complete proof can be found in the full version of this paper. 4 Adaptive Algorithms Our adaptive algorithm chooses between two algorithms (ODTNr and ODTNh) based on the noise sparsity parameters h (maximum number of noisy outcome per hypothesis), and r (maximum number of noisy outcome per test). These two algorithms maintain the posterior probability of each hypothesis based on the previous test outcomes, and use these probabilities to calculate a score for each test. The score of a test has two components (i) a term that prioritizes splitting the candidate hypotheses in a balanced way and (ii) terms that correspond to the expected number of hypotheses eliminated. We maintain the following information at each point in the algorithm: already performed tests E U, compatible hypotheses H [m] and (posterior) probability px for each x H. Given values {px : x [m]}, to reduce notation we use the shorthand p(S) = P x S px for any subset S [m]. The main difference between the two algorithms we have, is in the defining the metric for component (i). First we discuss ODTNr: Algorithm 1 ODTNr initially E , H [m] and px πx for all x [m]. while |H| > 1 do for any test e U, let Le(H) be the smaller cardinality set among T +(e) H and T (e) H select test e U \ E that maximizes: p (Le(H)) + |T (e) H| |H| 1 p T +(e) H + |T +(e) H| |H| 1 p T (e) H +|H \ T (e)| 2(|H| 1) p (T (e) H) . (5) if outcome of test e is + then H H \ T (e); else H H \ T +(e). E E {e} and update px px/2 for all x T (e). end while Theorem 4.1. Algorithm 1 is an O(r + log m)-approximation algorithm for adaptive ODTN, where r is the maximum number of noisy outcomes per test. The high-level idea is to view any ODT instance I as a suitable instance J of adaptive submodular ranking (ASR). Then we will use and modify an existing framework of analysis of ASR from [23]. An equivalent ASR instance J . This involves the expanded hypothesis set M where each hypothesis (b, x) M occurs with probability qb,x = πx/2hx. Each hypothesis (b, x) is also associated with: (i) submodular function fb,x : 2U [0, 1] and (ii) feedback function rb,x : U {+, } where rb,x(e) is the outcome of test e under hypothesis (b, x). The goal in the ASR instance is to adaptively select a subset S U such that the value fb,x(S) = 1 for the realized hypothesis (b, x). The objective is to minimize the expected cost E[|S|]. Lemma 4.2. The ASR instance J is equivalent to ODT instance I. Now, we present an algorithm for the ASR instance J that we will show is equivalent to running Algorithm 1 on instance I. Crucially, the ASR algorithm is almost identical to that studied in prior work [23] and therefore we can essentially re-use the analysis from that paper to prove our bound. Recall that the expanded hypotheses M = m x=1Mx where Mx are all copies of hypothesis x [m]. To reduce notation, we use q(S) = P (b,x) S qb,x for any subset S M. Also note that hypothesis (b, x) is covered when fb,x(E) = 1 which implies identifying hypothesis x [m]. Algorithm 2 Algorithm for ASR instance J . initially E , H M. while H = do H {x [m] : Mx H = }. for any test e U, let L e(H ) = {(b, x) H : x Le(H)} = H x Le(H)Mx select test e U \ E that maximizes: q (L e(H )) + X (b,x) Mx H qb,x fb,x(e E) fb,x(E) 1 fb,x(E) . (6) remove incompatible and covered hypotheses from H based on the feedback from e. E E {e} end while We now prove the equivalence between Algorithms 1 and 2. The state of either algorithm is represented by the set E of tests performed along with their outcomes. Lemma 4.3. The decision tree produced by Algorithm 1 on I is the same as that produced by Algorithm 2 on J . Based on Lemmas 4.2 and 4.3, in order to prove Theorem 4.1 it suffices to show that Algorithm 2 is an O(r + log m)-approximation algorithm for ASR instance J . The proof is very similar to the analysis in [23]. So we only provide an outline of the overall proof, while emphasizing the differences. For k = 0, 1, , define the following quantities: Ak M is the set of uncovered hypotheses in ALG at time L 2k, and ak = q(Ak). Yk is the set of uncovered hypotheses in OPT at time 2k 1, and yk = q(Yk). Here L = O(r + log m). The key step is to show: ak 0.2ak 1 + 3yk, for all k 1. (7) As shown in [23], this implies an O(L) approximation ratio. In order to prove (7) we use the quantity: (E,H ) R(t) max e U\E (b,x) L e(H ) qb,x + X (b,x) H qb,x fb,x(e E) fb,x(E) Above, R(t) denotes the set of states (E, H ) that occur at time t in ALG. (7) will be proved by separately lower and upper bounding Z. Lemma 4.4 ([23]). We have Z L (ak 3yk)/3. Lemma 4.5. We have Z ak 1 (1 + ln m + r + log m). Proof. For any hypothesis (b, x) Ak 1 (i.e. uncovered in ALG by time L2k 1) let σb,x be the path traced by (b, x) in ALG s decision tree, starting from time 2k 1L and ending at 2k L or when (b, x) gets covered. Recall that for any L2k 1 < t L2k, any hypothesis in H for any state in R(t) appears in Ak 1. So only hypotheses in Ak 1 can contribute to Z and we rewrite (8) as: (b,x) Ak 1 qb,x X fb,x(e E) fb,x(E) 1 fb,x(E) + 1[(b, x) L e(H )] (b,x) Ak 1 qb,x fb,x(e E) fb,x(E) 1 fb,x(E) + X e σb,x 1[(b, x) L e(H )] Above, for any e σb,x we use (E, H ) to denote the state at which e is selected. Fix any hypothesis (b, x) Ak 1. For the first term, we use Lemma 4.6 below and the definition of ϵ. This implies P e σb,x fb,x(e E) fb,x(E) 1 fb,x(E) 1 + ln 1 ϵ 1 + ln m as parameter ϵ 1/m for fb,x. Now, we bound the second term by proving the inequality below: X e σb,x 1[(b, x) L e(H )] r + log m (10) To prove this inequality, consider hypotheses in H . Now, if hypothesis (b, x) L e(H ) when ALG selects test e, then x would be in Le(H). Suppose Le(H) = T +(e) H; the other case is identical. Let De(H) = T (e) H and Se(H) = T (e) H. As x Le(H), it must be that path σb,x follows the + branch out of e. Also, the number of candidate hypotheses on this path after test e is |Le(H)| + |Se(H)| |Le(H)| 2 + |De(H)| 2 + |Se(H)| = |H| 2 + |Se(H)| The first inequality uses the definition of Le(H) and the last inequality uses the bound of r on the number of hypotheses with outcomes. Hence, each time that (b, x) L e(H ) along path σb,x, the number of candidate hypotheses changes as |Hnew| 1 2|Hold| + r 2. This implies that after log2 m such events, |H| r. Let σ b,x denote the portion of path σb,x after |H| drops below r. Note that each time (b, x) L e(H ) we have Le(H) = : so |H| reduces by at least one after each such test e. Hence P e σ b,x 1[(b, x) L e(H )] r. As the portion of path σb,x until |H| r contributes at most log2 m to the left-hand-side in (10), the total is at most r + log2 m as needed. Lemma 4.6 ([3]). Let f : 2U [0, 1] be any monotone function with f( ) = 0 and ϵ = min{f(S {e}) f(S) : e U, S U, f(S {e}) f(S) > 0}. Then, for any sequence = S0 S1 Sk U of subsets, we have Pk t=1 f(St) f(St 1) 1 f(St 1) 1 + ln 1 Setting L = 15(1 + ln m + r + log2 m) and applying Lemmas 4.4 and 4.5 completes the proof of (7) and hence Theorem 4.1. We now discuss algorithm ODTNh. This is based on directly applying the ASR algorithm from [23] to J . The resulting algorithm is very similar to Algorithm 2 and involves a change in the definition of L e(H ) in score (6) to be the smaller of the following sets: {(b, x) H : rb,x(e) = +} and {(b, x) H : rb,x(e) = }. The main difference in the analysis is in proving the following inequality instead of inequality (10) in the proof of Lemma 4.5: X e σb,x 1[(b, x) L e(H )] h + log m (11) This follows from the observation that each time that (b, x) L e(H ) along path σb,x, the size |H | reduces by at least a factor 1 2 (note that initially |H | = |M| 2h m). Finally by choosing the best between ODTNr and ODTNh, we obtain: Theorem 4.7. There is an O(min(h, r) + log m)-approximation algorithm for adaptive ODTN, where h (resp. r) is the maximum number of noisy outcomes per hypothesis (resp. test). We note that our analysis is tight (up to constant factors). In order to understand the dependence of the approximation ratio on the noise sparsity min(h, r), we study instances that have a very large number of noisy outcomes. Formally, we define an α-sparse (α 1/2) instance as follows. There is a constant C such that max{|T +(e)|, |T (e)|} C mα for all tests e U. Somewhat surprisingly, there is a very different algorithm (see the full version) that can actually take advantage of the large noise: Theorem 4.8. There is an adaptive O(log m)-approximation for ODTN on α 1 2 sparse instances. 5 Non-identifiable Instances We have assumed that for every pair x, y of hypotheses, there is some test that distinguishes them deterministically. Without this assumption, we can still obtain similar results by slightly changing the stopping criterion. Define a similarity graph G on m nodes (corresponding to hypotheses) with an edge (x, y) if there is no test separating x and y deterministically. Let Dx denote the set containing x and all its neighbors in G, for each x [m]. The neighborhood stopping criterion involves stopping when the set H of compatible hypotheses is contained in some Dx, where x might or might not be x. The clique stopping criterion involves stopping when H is contained in some clique of G. Note that clique stopping is a stronger notion of identification than neighborhood stopping. Our algorithms performance guarantees will now also depend on the maximum degree d of G; note that d = 0 in the perfectly identifiable case. We obtain a non-adaptive algorithm with approximation ratio O((d + 1) log m) and an adaptive algorithm with approximation ratio O(d + min(h, r) + log m). Below we outline our approach for the adaptive algorithm; the details can be found in the full version. For the adaptive version, we run a two-phase algorithm. In the first phase, we identify some subset N [m] containing x with |N| d. This can be done using the algorithm in 4 with the following submodular function for each (b, x) M. fb,x(S) = | [ e S Tb,x(e)| 1 m d, S U. The expected cost of this phase is O(min(r, h)+log m) OPT using an analysis identical to 4; here OPT denotes the optimal value. Then, in the second phase, we run a simple splitting algorithm that iteratively selects any test that splits the current set H of candidate scenarios, until the neighborhood or clique stopping criterion is satisfied. The expected cost of this phase is at most d OPT. Combining both phases, we obtain an O(d + min(h, r) + log m)-approximation algorithm. 6 Extensions Non-binary outcomes. We can also handle tests with an arbitrary set Σ of outcomes (instead of 1). This requires extending the outcomes b to be in ΣU and applying this change to the definitions of sets Tb,x (1) and submodular function fb,x (2). Non-uniform noise distribution. Our results extend directly to the case where each noisy outcome has a different probability of being 1. Suppose that the probability of every noisy outcome is between δ and 1 δ. Then Theorems 3.2 and 4.7 continue to hold (irrespective of δ), and Theorem 4.8 holds with a slightly worse O( 1 δ log m) approximation ratio. 7 Experiments We implemented our algorithms, and performed experiments on real-world and synthetic data sets. We compared our algorithms cost (expected number of tests) with an information theoretic lower bound on the optimal cost and show that the difference is negligible. Thus, despite our logarithmic approximation ratios, the practical performance can be much better. Chemicals with Unknown Test Outcomes One natural application of ODT is identifying chemical or biological materials. We considered a data set called WISER6, which includes 400+ chemicals (hypothesis) and 78 binary tests. Every chemical has either positive, negative or unknown result on each test. We have performed our algorithms on both the original instance, in which some chemicals are not perfectly identifiable, and a modified version. In the modified version, to ensure every pair of chemicals can be distinguished, we removed the chemicals that are not identifiable from each other to be left with 255 chemicals (to do this, we used a greedy rule that iteratively drops the highest-degree hypothesis in the similarity graph). Random Binary Classifiers with Margin Error We construct a dataset containing 100 twodimensional points, by picking each of their attributes uniformly in [ 1000, 1000]. We also choose 2000 random triples (a, b, c) to form linear classifiers ax+by a2+b2 + c 0, where a, b N(0, 1) and c U( 1000, 1000). The point labels are binary and we introduce noisy outcomes based on the distance of each point to a classifier. Specifically, for each threshold d {0, 5, 10, 20, 30} we define dataset CL-d that has a noisy outcome for any classifier-point pair where the distance of the point to the boundary of the classifier is smaller than d. In order to ensure that the instances are perfectly identifiable, we remove equivalent classifiers and we are left with 234 classifiers. Algorithms We implement the following algorithms: the adaptive O(r + log m)-approximation (ODTNr), the adaptive O(h + log m)-approximation (ODTNh), the non-adaptive O(log m)- approximation (Non-Adap) and a slightly adaptive version of Non-Adap (Low-Adap). Algorithm Low-Adap considers the same sequence of tests as Non-Adap while (adaptively) skipping noninformative tests based on observed outcomes. The implementations of the adaptive and non-adaptive algorithms are available online.7 We also consider three different stopping criteria: unique stopping for perfectly identifiable instances, neighborhood and clique stopping (defined in Section 5) for original WISER dataset. Results Table 1 shows the results of different algorithms with unique stopping on all identifiable datasets when the distribution over hypothesis is uniform, and Table 2 summarizes the results on original WISER with other stopping criteria. Experiments with some non-uniform distributions are presented in the supplementary material. Table 1 also reports values of an information theoretic lower bound (the entropy log2 m) on the optimal cost (Low-BND). We can see that ODTNr consistently outperforms the other algorithms and is very close to the lower bound. Note that original WISER dataset that is used to produce results in Table 2 has m = 414 hypotheses and d = 54 in its similarity graph, while the processed WISER in Table 1 is perfectly identifiable with m = 255 hypotheses. Algorithm Data(r,h) WISER CL-0 CL-5 CL-10 CL-20 CL-30 (245,45) (0,0) (5,3) (7,6) (12,8) (13,8) Low-BND 7.994 7.870 7.870 7.870 7.870 7.870 ODTNr 8.357 7.910 7.927 7.915 7.962 8.000 ODTNh 9.707 7.910 7.979 8.211 8.671 8.729 Non-Adap 11.568 9.731 9.831 9.941 9.996 10.204 Low-Adap 9.152 8.619 8.517 8.777 8.692 8.803 Table 1: Cost of Algorithms with Unique Stopping for Uniform Distribution. For each dataset we also indicate the noise parameters h and r (max number of noisy outcomes per hypothesis/test). Algorithm Neighborhood Stopping Clique Stopping ODTNr 11.163 11.817 ODTNh 11.908 12.506 Non-Adap 16.995 21.281 Low-Adap 16.983 20.559 Table 2: Cost of Algorithms on original WISER dataset with Neighborhood and Clique Stopping for Uniform Distribution. 6https://wiser.nlm.nih.gov 7https://github.com/Fatemeh Navidi/ODTN ; https://github.com/sjia1/ODT-with-noisy-outcomes [1] M. Adler and B. Heeringa. Approximating optimal binary decision trees. In Approximation, Randomization and Combinatorial Optimization. Algorithms and Techniques, pages 1 9. Springer, 2008. [2] E. M. Arkin, H. Meijer, J. S. Mitchell, D. Rappaport, and S. S. Skiena. Decision trees for geometric models. International Journal of Computational Geometry & Applications, 8(03):343 363, 1998. [3] Y. Azar and I. Gamzu. Ranking with submodular valuations. In Proceedings of the twenty-second annual ACM-SIAM symposium on Discrete Algorithms, pages 1070 1079. SIAM, 2011. [4] M. Balcan, A. Beygelzimer, and J. Langford. Agnostic active learning. In Machine Learning, Proceedings of the Twenty-Third International Conference (ICML 2006), Pittsburgh, Pennsylvania, USA, June 25-29, 2006, pages 65 72, 2006. [5] G. Bellala, S. K. Bhavnani, and C. Scott. Active diagnosis under persistent noise with unknown noise distribution: A rank-based approach. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2011, Fort Lauderdale, USA, April 11-13, 2011, pages 155 163, 2011. [6] V. T. Chakaravarthy, V. Pandit, S. Roy, P. Awasthi, and M. K. Mohania. Decision trees for entity identification: Approximation algorithms and hardness results. ACM Trans. Algorithms, 7(2):15:1 15:22, 2011. [7] V. T. Chakaravarthy, V. Pandit, S. Roy, and Y. Sabharwal. Approximating decision trees with multiway branches. In International Colloquium on Automata, Languages, and Programming, pages 210 221. Springer, 2009. [8] Y. Chen, S. H. Hassani, and A. Krause. Near-optimal bayesian active learning with correlated and noisy tests. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, AISTATS 2017, 20-22 April 2017, Fort Lauderdale, FL, USA, pages 223 231, 2017. [9] F. Cicalese, E. S. Laber, and A. M. Saettler. Diagnosis determination: decision trees optimizing simultaneously worst and expected testing cost. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014, pages 414 422, 2014. [10] S. Dasgupta. Analysis of a greedy active learning strategy. In Advances in neural information processing systems, pages 337 344, 2005. [11] M. Garey and R. Graham. Performance bounds on the splitting algorithm for binary testing. Acta Informatica, 3:347 355, 1974. [12] D. Golovin and A. Krause. Adaptive submodularity: Theory and applications in active learning and stochastic optimization. J. Artif. Intell. Res., 42:427 486, 2011. [13] D. Golovin and A. Krause. Adaptive submodularity: A new approach to active learning and stochastic optimization. Co RR, abs/1003.3967, 2017. [14] D. Golovin, A. Krause, and D. Ray. Near-optimal bayesian active learning with noisy observations. In Advances in Neural Information Processing Systems 23: 24th Annual Conference on Neural Information Processing Systems 2010. Proceedings of a meeting held 6-9 December 2010, Vancouver, British Columbia, Canada., pages 766 774, 2010. [15] A. Guillory and J. A. Bilmes. Average-case active learning with costs. In Algorithmic Learning Theory, 20th International Conference, ALT 2009, Porto, Portugal, October 3-5, 2009. Proceedings, pages 141 155, 2009. [16] A. Guillory and J. A. Bilmes. Interactive submodular set cover. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), June 21-24, 2010, Haifa, Israel, pages 415 422, 2010. [17] A. Guillory and J. A. Bilmes. Simultaneous learning and covering with adversarial noise. In Proceedings of the 28th International Conference on Machine Learning, ICML 2011, Bellevue, Washington, USA, June 28 - July 2, 2011, pages 369 376, 2011. [18] A. Gupta, V. Nagarajan, and R. Ravi. Approximation algorithms for optimal decision trees and adaptive tsp problems. Mathematics of Operations Research, 42(3):876 896, 2017. [19] S. Hanneke. A bound on the label complexity of agnostic active learning. In Machine Learning, Proceedings of the Twenty-Fourth International Conference (ICML 2007), Corvallis, Oregon, USA, June 20-24, 2007, pages 353 360, 2007. [20] L. Hyafil and R. L. Rivest. Constructing optimal binary decision trees is NP-complete. Information Processing Lett., 5(1):15 17, 1976/77. [21] S. Im, V. Nagarajan, and R. V. D. Zwaan. Minimum latency submodular cover. ACM Transactions on Algorithms (TALG), 13(1):13, 2016. [22] S. Javdani, Y. Chen, A. Karbasi, A. Krause, D. Bagnell, and S. S. Srinivasa. Near optimal bayesian active learning for decision making. In Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, AISTATS 2014, Reykjavik, Iceland, April 22-25, 2014, pages 430 438, 2014. [23] P. Kambadur, V. Nagarajan, and F. Navidi. Adaptive submodular ranking. In Integer Programming and Combinatorial Optimization - 19th International Conference, IPCO 2017, Waterloo, ON, Canada, June 26-28, 2017, Proceedings, pages 317 329, 2017 (full version: https://arxiv.org/abs/1606.01530). [24] S. R. Kosaraju, T. M. Przytycka, and R. Borgstrom. On an optimal split tree problem. In Workshop on Algorithms and Data Structures, pages 157 168. Springer, 1999. [25] D. W. Loveland. Performance bounds for binary testing with arbitrary weights. Acta Inform., 22(1):101 114, 1985. [26] M. J. Moshkov. Greedy algorithm with weights for decision tree construction. Fundam. Inform., 104(3):285 292, 2010. [27] M. Naghshvar, T. Javidi, and K. Chaudhuri. Noisy bayesian active learning. In 50th Annual Allerton Conference on Communication, Control, and Computing, Allerton 2012, Allerton Park & Retreat Center, Monticello, IL, USA, October 1-5, 2012, pages 1626 1633, 2012. [28] F. Nan and V. Saligrama. Comments on the proof of adaptive stochastic set cover based on adaptive submodularity and its implications for the group identification problem in "group-based active query selection for rapid diagnosis in time-critical situations". IEEE Trans. Information Theory, 63(11):7612 7614, 2017. [29] R. D. Nowak. Noisy generalized binary search. In Advances in Neural Information Processing Systems 22: 23rd Annual Conference on Neural Information Processing Systems 2009. Proceedings of a meeting held 7-10 December 2009, Vancouver, British Columbia, Canada., pages 1366 1374, 2009. [30] A. M. Saettler, E. S. Laber, and F. Cicalese. Trading off worst and expected cost in decision tree problems. Algorithmica, 79(3):886 908, 2017. [31] L. A. Wolsey. An analysis of the greedy algorithm for the submodular set covering problem. Combinatorica, 2(4):385 393, 1982.