# approximating_a_distribution_using_weight_queries__f849f494.pdf Approximating a Distribution Using Weight Queries Nadav Barak 1 Sivan Sabato 1 Abstract We consider a novel challenge: approximating a distribution without the ability to randomly sample from that distribution. We study how such an approximation can be obtained using weight queries. Given some data set of examples, a weight query presents one of the examples to an oracle, which returns the probability, according to the target distribution, of observing examples similar to the presented example. This oracle can represent, for instance, counting queries to a database of the target population, or an interface to a search engine which returns the number of results that match a given search. We propose an interactive algorithm that iteratively selects data set examples and performs corresponding weight queries. The algorithm finds a reweighting of the data set that approximates the weights according to the target distribution, using a limited number of weight queries. We derive an approximation bound on the total variation distance between the reweighting found by the algorithm and the best achievable reweighting. Our algorithm takes inspiration from the UCB approach common in multi-armed bandits problems, and combines it with a new discrepancy estimator and a greedy iterative procedure. In addition to our theoretical guarantees, we demonstrate in experiments the advantages of the proposed algorithm over several baselines. A python implementation of the proposed algorithm and of all the experiments can be found at https: //github.com/Nadav-Barak/AWP 1. Introduction A basic assumption in learning and estimation tasks is the availability of a random sample from the distribution of 1Department of Computer Science, Ben Gurion University, Israel. Correspondence to: Nadav Barak , Sivan Sabato . Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). interest. However, in many cases, obtaining such a random sample is difficult or impossible. In this work, we study a novel challenge: approximating a distribution without the ability to randomly sample from it. We consider a scenario in which the only access to the distribution is via weight queries. Given some data set of examples, a weight query presents one of these examples to an oracle, which returns the probability, according to the target distribution, of observing examples which are similar to the presented example. For instance, the available data set may list patients in a clinical trial, and the target distribution may represent the population of patients in a specific hospital, which is only accessible through certain database queries. In this case, the weight query for a specific data set example can be answered using a database counting query, which indicates how many records with the same demographic properties as the presented example exist in the database. A different example of a relevant oracle is that of a search engine for images or documents, which returns the number of objects in its database that are similar to the searched object. We study the possibility of using weight queries to find a reweighting of the input data set that approximates the target distribution. Importantly, we make no assumptions on the relationship between the data set and the target distribution. For instance, the data set could be sampled from a different distribution, or be collected via a non-random process. Reweighting the data set to match the true target weights would be easy if one simply queried the target weight of all the data set examples. In contrast, our goal in this work is to study whether a good approximated weighting can be found using a number of weight queries that is independent of the data set size. A data set reweighted to closely match a distribution is often used as a proxy to a random sample in learning and statistical estimation tasks, as done, for instance, in domain adaptation settings (e.g., Bickel et al. 2007; Sugiyama et al. 2008; Bickel et al. 2009). We consider a reweighting scheme in which the weights are fully defined by some partition of the data set to a small number of subsets. Given the partition, the weights of examples in each subset of the partition are uniform, and are set so that the total weight of the subset is equal to its true target weight. An algorithm for finding a good reweighting should thus search for a partition such that within each of its subsets, the true example weights are as close to uniform as Approximating a Distribution Using Weight Queries possible. For instance, in the hospital example, consider a case in which the hospital of interest has more older patients than their proportion in the clinical trial data set, but within each of the old and young populations, the makeup of the patients in the hospital is similar to that in the clinical trial. In this case, a partition based on patient ages will lead to an accurate reweighting of the data set. The goal of the algorithm is thus to find a partition which leads to an accurate reweighting, using a small number of weight queries. We show that identifying a good partition using only weight queries of individual examples is impossible, unless almost all examples are queried for their weight. We thus consider a setting in which the algorithm has a limited access to higher order queries, which return the total weight of a subset of the data set. These higher-order queries are limited to certain types of subsets, and the algorithm can only use a limited number of such queries. For instance, in the hospital example, higher-order queries may correspond to database counting queries for some less specific demographic criteria. To represent the available higher order queries for a given problem, we assume that a hierarchical organization of the input data set is provided to the algorithm in the form of a tree, whose leaves are the data set examples. Each internal node in the tree represents the subset of leaves that are its descendants, and the only allowed higher-order queries are those that correspond to one of the internal nodes. Given such a tree, we consider only partitions that are represented by some pruning of the tree, where a pruning is a set of internal nodes such that each leaf has exactly one ancestor in the set. A useful tree is one that includes a small pruning that induces a partition of the data set into near-uniform subsets, as described above. Main results: We give an algorithm that for any given tree, finds a near-optimal pruning of size K using only K 1 higher order queries and O(K3/ϵ2) weight queries of individual examples, where ϵ measures the total variation distance of the obtained reweighting. The algorithm greedily splits internal nodes until reaching a pruning of the requested size. To decide which nodes to split, it iteratively makes weight queries of examples based on an Upper Confidence Bound (UCB) approach. Our UCB scheme employs a new estimator that we propose for the quality of an internal node. We show that this is necessary, since a naive estimator would require querying the weight of almost all data set examples. Our guarantees depend on a property of the input hierarchical tree that we term the split quality, which essentially requires that local node splits of the input tree are not too harmful. We show that any algorithm that is based on iteratively splitting the tree and obtains a nontrivial approximation guarantee requires some assumption on the quality of the tree. To supplement our theoretical analysis, we also imple- ment the proposed algorithm and report several experiments, which demonstrate its advantage over several natural baselines. A python implementation of the proposed algorithm and of all the experiments can be found at https://github.com/Nadav-Barak/AWP. Related work In classical density estimation, the goal is to estimate the density function of a random variable given observed data (Silverman, 1986). Commonly used methods are based on Parzen or Kernel estimators (Wand and Jones, 1994; Goldberger and Roweis, 2005), expectation maximization (EM) algorithms (Mc Lachlan and Krishnan, 2007; Figueiredo and Jain, 2002) or variational estimation (Corduneanu and Bishop, 2001; Mc Grory and Titterington, 2007). Some works have studied active variants of density estimation. In Ghasemi et al. (2011), examples are selected for kernel density estimation. In Kristan et al. (2010), density estimation in an online and interactive setting is studied. We are not aware of previous works that consider estimating a distribution using weight queries. The domain adaptation framework (Kifer et al., 2004; Ben-David et al., 2007; Blitzer et al., 2008; Mansour et al., 2009; Ben-David et al., 2010) assumes a target distribution with scarce or unlabeled random examples. In Bickel et al. (2007); Sugiyama et al. (2008); Bickel et al. (2009), reweighting the labeled source sample based on unlabeled target examples is studied. Tradeoffs between source and target labels are studied in Kpotufe and Martinet (2018). In Berlind and Urner (2015), a labeled source sample guides target label requests. In this work, we assume that the input data set is organized in a hierarchical tree, which represents relevant structures in the data set. This type of input is common to many algorithms that require structure. For instance, such an input tree is used for active learning in Dasgupta and Hsu (2008); Urner et al. (2013). In Slivkins (2011); Bubeck et al. (2011), a hierarchical tree is used to organize different arms in a multi-armed-bandits problem, and in Munos (2011) such a structure is used to adaptively estimate the maximum of an unknown function. In Kpotufe et al. (2015); Cortes et al. (2020), an iterative partition of the domain is used for active learning. 2. Setting and Notations Denote by 1 the all-1 vector; its size will always be clear from context. For an integer n, let [n] := {1, . . . , n}. For a vector or a sequence x = (x(1), . . . , x(n)), let x 1 := P i [n] |x(i)| be the ℓ1 norm of x. For a function f : X R on a discrete domain X, denote f 1 := P x X |f(x)|. The input data set is some finite set S X. We assume Approximating a Distribution Using Weight Queries that the examples in S induce a partition on the domain X, where the part represented by x S is the set of target examples that are similar to it (in some application-specific sense). The target weighting of S is denoted w : S [0, 1], where w 1 = 1. w (x) is the probability mass, according to the unknown target distribution, of the examples in the domain that are represented by x. For a set V S, denote w (V ) := P x V w (x). The goal of the algorithm is to approximate the target weighting w using weight queries, via a partition of S into at most K parts, where K is provided as input to the algorithm. A basic weight query presents some x S to the oracle and receives its weight w (x) as an answer. To define the available higher-order queries, the input to the algorithm includes a binary hierarchical tree T whose leaves are the elements in S. For an internal node v, denote by Lv S the set of examples in the leaves descending from v. A higherorder query presents some internal node v to the oracle, and receives its weight w (Lv) as an answer. We note that the algorithm that we propose below uses higher-order queries only for nodes of depth at most K. A reweighting algorithm attempts to approximate the target weighting by finding a small pruning of the tree that induces a weighting on S which is as similar as possible to w . Formally, for a given tree, a pruning of the tree is a set of internal nodes such that each tree leaf is a descendant of exactly one of these nodes. Thus, a pruning of the input tree T induces a partition on S. The weighting induced by the pruning is defined so that the weights assigned to all the leaves (examples) descending from the same pruning node v are the same, and their total weight is equal to the true total weight w (Lv). Formally, let Nv := |Lv|. For a pruning P and an example x S, let A(P, x) be the node in P which is the ancestor of x. The weighting w P : S R+ induced by the pruning P is defined as w P (x) := w A(P,x)/NA(P,x). The quality of a weighting w P is measured by the total variation distance between the distribution induced by w P and the one induced by w . This is equivalent to the ℓ1 norm between the weight functions (see, e.g., Wilmer et al., 2009). Formally, the total variation distance between two weight functions w1, w2 : S R+ is dist(w1, w2) := max S S |w1(S ) w2(S )| x S |w1(x) w2(x)| 1 For a node v, define Dv := P x Lv |w v/Nv w (x)|. The discrepancy of w P (with respect to w ) is DP := 2dist(w P , w ) = w P w 1 = X Intuitively, this measures the distance of the weights in each pruning node from uniform weights. More generally, for any subset G of a pruning, define DG := P v G Dv. We also call DG, Dv the discrepancy of G, v, respectively. The goal of the algorithm is thus to find a pruning P with a low discrepancy w P , using a small number of weight queries. 3. Estimating the Discrepancy As stated above, the goal of the algorithm is to find a pruning with a low discrepancy. A necessary tool for such an algorithm is the ability to estimate the discrepancy Dv of a given internal node using a small number of weight queries. In this section, we discuss some challenges in estimating the discrepancy and present an estimator that overcomes them. First, it can be observed that the discrepancy of a node cannot be reliably estimated from basic weight queries alone, unless almost all leaf weights are queried. To see this, consider two cases: one in which all the leaves in Lv have an equally small weight, and one in which this holds for all but one leaf, which has a large weight. The discrepancy Dv in the first case is zero, while it is large in the second case. However, it is impossible to distinguish between the cases using basic weight queries, unless they happen to include the heavy leaf. A detailed example of this issue is provided in Appendix B. To overcome this issue, the algorithm uses a higher order query to obtain the total weight of the internal node w v, in addition to a random sample of basic weight queries of examples in Lv. However, even then, a standard empirical estimator of the discrepancy, obtained by aggregating |w v/Nv w (x)| over sampled examples, can have a large estimation error due to the wide range of possible values (see example in Appendix B). We thus propose a different estimator, which circumvents this issue. The lemma below gives this estimator and proves a concentration bound for it. In this lemma, the leaf weights are represented by w = (w1, . . . , wn) and the discrepancy of the node is D(w). In more general terms, this lemma gives an estimator for the uniformity of a set of values. Lemma 3.1. Let w = (w1, . . . , wn) be a sequence of nonnegative real values with a known w 1. Let W := w /n, and D(w) := w W 1 1. Let U be a uniform distribution over the indices in [n], and suppose that m i.i.d. samples {Ii}i [m] are drawn from U. Denote Zi := w Ii for i [m], and Z := (Z1, . . . , Zm). Let the estimator for D(w) be: ˆD(w) := w 1 + n m( Z W 1 1 Z 1). Then, with a probability at least 1 δ, |D(w) ˆD(w)| w 1 p 2 ln(2/δ)/m. Proof. Let Z i = |Zi W| Zi. If Zi W, we have Approximating a Distribution Using Weight Queries Z i = W. Otherwise, 0 Zi < W, and therefore Z i = W 2Zi [ W, W]. Combining the two cases, we get P[Z i [ W, W]] = 1. Thus, by Hoeffding s inequality, with a probability at least 1 δ, we have i [m] Z i| W p 2 ln(2/δ)/m, and n( w W 1 1 w 1) = 1 n(D(w) w 1). In addition, 1 m i [m] Z i = 1 m( Z W 1 1 Z 1) = 1 n(ˆD(w) w 1). Therefore, with a probability at least 1 δ, D(w) w 1 (ˆD(w) w 1) n W p 2 ln(2/δ)/m 2 ln(2/δ)/m, as claimed. 4. Main result: the AWP algorithm We propose the AWP (Approximated Weights via Pruning) algorithm, listed in Alg. 1. AWP uses weight queries to find a pruning P, which induces a weight function w P as defined in Section 2. AWP gets the following inputs: a binary tree T whose leaves are the data set elements, the requested pruning size K 2, a confidence parameter δ (0, 1), and a constant β > 1, which controls the trade-off between the number of weight queries requested by AWP and the approximation factor that it obtains. Finding a pruning with a small discrepancy while limiting the number of weight queries involves several challenges, since the discrepancy of any given pruning is unknown in advance, and the number of possibilities is exponential in K. AWP starts with the trivial pruning, which includes only the root node. It iteratively samples weight queries of leaves to estimate the discrepancy of nodes in the current pruning. AWP decides in a greedy manner when to split a node in the current pruning, that is, to replace it in the pruning with its two child nodes. It stops after reaching a pruning of size K. We first provide the notation for Alg. 1. For a node v in T, Mv is the number of weight queries of examples in Lv requested so far by the algorithm for estimating Dv. The sequence of Mv weights returned by the oracle for these queries is denoted zv. Note that although an example in Lu is also in Lv for any ancestor v of u, weight queries used for estimating Dv are not reused for estimating Du, since this would bias the estimate. AWP uses the estimator for Dv provided in Lemma 3.1. In AWP notation, the estimator is: ˆDv := w v + Nv Mv ( zv w v Nv 1 1 zv 1). (1) AWP iteratively samples weight queries of examples for nodes in the current pruning, until it can identify a node which has a relatively large discrepancy. The iterative sampling procedure takes inspiration from the upper-confidencebound (UCB) approach, common in best-arm-identification problems (Audibert et al., 2010); In our case, the goal is to find the best node up to a multiplicative factor. In each iteration, the node with the maximal known upper bound on its discrepancy is selected, and the weight of a random example from its leaves is queried. To calculate the upper bound, we define 2 ln(2Kπ2M 2v /(3δ))/Mv. (2) We show in Section 5 that |Dv ˆDv| v with a high probability. Hence, the upper bound for Dv is set to ˆDv+ v. Whenever a node from the pruning is identified as having a large discrepancy in comparison with the other nodes, it is replaced by its child nodes, thus increasing the pruning size by one. Formally, AWP splits a node if with a high probability, v P \ {v}, Dv Dv /β. The factor of β trades off the optimality of the selected node in terms of its discrepancy with the number of queries needed to identify such a node and perform a split. In addition, it makes sure that a split can be performed even if all nodes have a similar discrepancy. The formal splitting criterion is defined via the following Boolean function: SC(v, P)= β(ˆDv v) max v P \{v} ˆDv + v . (3) To summarize, AWP iteratively selects a node using the UCB criterion, and queries the weight of a random leaf of that node. Whenever the splitting criterion holds, AWP splits a node that satisfies it. This is repeated until reaching a pruning of size K. In addition, when a node is added to the pruning, its weight is queried for use in Eq. (1). We now provide our guarantees for AWP. The properties of the tree T affect the quality of the output weighting. First, the pruning found by AWP cannot be better than the best pruning of size K in T. Thus, we guarantee an approximation factor relative to that pruning. In addition, we require the tree to be sufficiently nice, in that a child node should have a somewhat lower discrepancy than its parent. Formally, we define the notion of split quality. Definition 4.1 (Split quality). Let T be a hierarchical tree for S, and q (0, 1). T has a split quality q if for any two nodes v, u in T where u is a child of v, we have Du q Dv. This definition is similar in nature to other tree quality notions, such as the taxonomy quality of Slivkins (2011), though the latter restricts weights and not the discrepancy. We note that Def. 4.1 could be relaxed, for instance by allowing different values of q in different tree levels. Nonetheless, Approximating a Distribution Using Weight Queries Algorithm 1 AWP: Approximated weighting of a data set via a low-discrepancy pruning 1: Input:A binary tree T with examples as leaves; Maximum pruning size K 2; δ (0, 1); β > 1. 2: Output:Po: A pruning of T with a small DPo 3: Initializations: P {The root node of T}; 4: For all nodes v in T: zv (), Mv 0. 5: while |P| < K do 6: Select a node to sample from 7: Set vs argmaxv P (ˆDv + v). 8: Draw uniformly random example x from Lvs 9: Query the true weight of x, w (x). 10: Mvs Mvs + 1, zvs zvs w (x). 11: Decide whether to split a node in the pruning: 12: while v P s.t. SC(v, P) holds (Eq. (3)) do 13: Let v R, v L be children of such a v. 14: Set P P \ {v} {v R, v L}. 15: Query w v R; set w v L w v w v R. 16: end while 17: end while 18: return Po := P. we prove in Appendix A that greedily splitting the node with the largest discrepancy cannot achieve a reasonable approximation factor without some restriction on the input tree, and that this also holds for other types of greedy algorithms. It is an open problem whether this limitation applies to all greedy algorithms. Note also that even in trees with a split quality less than 1, splitting a node might increase the total discrepancy; see Section 5 for further discussion. Our main result is the following theorem. Theorem 4.2. Suppose that AWP gets the inputs T, K 2, δ (0, 1), β > 1 and let Po be its output. Let Q be a pruning of T of size K with a minimal DQ. With a probability at least 1 δ, we have: If T has split quality q for some q < 1, then DPo 2β( log(K) log(1/q) + 1)DQ. AWP requests K 1 weight queries of internal nodes. AWP requests O((1 + 1 β 1)2K3 ln(1/δ)/D2 Po) weight queries of examples.1 Thus, an approximation factor with respect to the best achievable discrepancy is obtained, while keeping the number of higher-order weight queries minimal and bounding the number of weight queries of examples requested by the algorithm. The theorem is proved in the next section. 1The O notation hides constants and logarithmic factors; These are explicit in the proof of the theorem. 5. Analysis In this section, we prove the main result, Theorem 4.2. First, we prove the correctness of the definition of v given in Section 4, using the concentration bound given in Lemma 3.1 for the estimator ˆDv. The proof is provided in Appendix C. Lemma 5.1. Fix inputs T, K, δ, β to AWP. Recall that P is the pruning updated by AWP during its run. The following event holds with a probability at least 1 δ on the randomness of AWP: E0 := {At all times during the run of AWP, v P, |Dv ˆDv| v}. (4) Next, we bound the increase in discrepancy that could be caused by a node split. Even in trees with a split quality less than 1, a split could increase the discrepancy of the pruning. The next lemma bounds this increase, and shows that this bound is tight. The proof is provided in Appendix D. Lemma 5.2. Let r be the root of a hierarchical tree and let P be a pruning of this tree. Then DP 2Dr. Moreover, for any ϵ > 0, there exists a tree with a split quality q < 1 and a pruning P of size 2 such that DP (2 ϵ)Dr. We now prove the two main parts of Theorem 4.2, starting with the approximation factor of AWP. In the proof of the following lemma, the proof of some claims is omitted. The full proof is provided in Appendix E. Lemma 5.3. Fix inputs T, K, δ, β to AWP, and suppose that T has a split quality q (0, 1). Let Po be the output of AWP. Let E0 be the event defined in Eq. (4). In any run of AWP in which E0 holds, for any pruning Q of T such that |Q| = K, we have DPo 2β( log(K) log(1/q) + 1)DQ. Proof. Let Q be some pruning such that |Q| = K. Partition Po into R, Pa and Pd, where R := Po Q, Pa Po is the set of strict ancestors of nodes in Q, and Pd Po is the set of strict descendants of nodes in Q. Let Qa Q be the ancestors of the nodes in Pd and let Qd Q be the descendants of the nodes in Pa, so that R, Qd and Qa form a partition of Q. First, we prove that we may assume without loss of generality that Pa, Pd, Qa, Qd sets are non-empty. Claim 1: If any of the sets Pa, Pd, Qa, Qd is empty then the statement of the lemma holds. The proof of Claim 1 is deferred to Appendix E. Assume henceforth that Pa, Pd, Qa, Qd are non-empty. Let r be the node with the smallest discrepancy out of the nodes that were split by AWP during the entire run. Define θ := |Pa| Dr/DQa if DQa > 0 and θ := 0 otherwise. Claim 2: DPo max(2, βθ)DQ. Proof of Claim 2: We bound the discrepancies of Pd and of Pa separately. For each node u Qa, denote by P(u) Approximating a Distribution Using Weight Queries the descendants of u in Pd. These form a pruning of the sub-tree rooted at u. In addition, the sets {P(u)}u Qa form a partition of Pd. Thus, by the definition of discrepancy and Lemma 5.2, u Qa DP (u) X u Qa 2Du = 2DQa. (5) Let P be the pruning when AWP decided to split node r. By the definition of the splitting criterion SC (Eq. (3)), for all v P \{r}, at that time it held that β(ˆDr r) ˆDv+ v. Since E0 holds, we have Dr ˆDr r and ˆDv + v Dv. Therefore, v P \ {r}, βDr Dv. Now, any node v Po \ P is a descendant of some node v P. Since T has split quality q for q < 1, we have Dv Dv. Therefore, for all v Po, Dv βDr. In particular, DPa P v Pa Dv β|Pa|Dr. Since all nodes in Qa were split by AWP DQa = 0 implies Dr = 0, therefore in all cases DPa βθDQa. Combining this with Eq. (7), we get that DPo = DR + DPd + DPa DR + 2DQa + βθDQa max(2, βθ)DQ, which completes the proof of Claim 2. It follows from Claim 2 that to bound the approximation factor, it suffices to bound θ. Let P d be the set of nodes both of whose child nodes are in Pd and denote n := |P d|. In addition, define α := log(1/q) log(|Pa|) + log(1/q) 1. We now prove that θ 2/α by considering two complementary cases, n α|Pa| and n < α|Pa|. The following claim handles the first case. Claim 3: if n α|Pa|, then θ 2/α. Proof of Claim 3: Each node in P d has an ancestor in Qa, and no ancestor in P d. Therefore, P d can be partitioned to subsets according to their ancestor in Qa, and each such subset is a part of some pruning of that ancestor. Thus, by Lemma 5.2, DP d 2DQa. Hence, for some node v P d, Dv 2DQa/n. It follows from the definition of r that Dr 2DQa/n. Hence, θ 2|Pa|/n. Since n α|Pa|, we have θ 2/α as claimed. We now prove this bound hold for the case n < α|Pa|. For a node v with an ancestor in Qa, let lv be the path length from this ancestor to v, and define L := P v P d lv. We start with an auxiliary Claim 4. Claim 4: L |Pa| n. The proof of Claim 4 is deferred to Appendix E. We use Claim 4 to prove the required upper bound on θ in Claim 5. Claim 5: if n < α|Pa|, then θ 2/α. Proof of Claim 5: It follows from Claim 4 that for some node v P d, lv (|Pa| n)/n = |Pa|/n 1 > 0, where the last inequality follows since n < α|Pa| < |Pa|. Letting u Qa be the ancestor of v in Qa, we have by the split quality q of T that Dv Du q |Pa| n 1. Since u Qa, we have Du DQa. In addition, Dr Dv by the definition of r. Therefore, Dr DQa q |Pa| n 1. Since n < |Pa|α and q < 1, from the definition of α, θ we have θ |Pa|q |Pa| n 1 1 2/α. This proves Claim 5. Claims 3 and 5 imply that in all cases, θ 2/α. By Substituting α, we have that θ 2(log(|Pa|) log(1/q) + 1) 2( log(K) log(1/q) + 1). Placing this upper bound in the statement of Claim 2 concludes of the lemma. Next, we prove an upper bound on the number of weight queries requested by AWP. Lemma 5.4. Let β > 1. Consider a run of AWP in which E0 holds, and fix some iteration. Let P be the current pruning, and for a node v in T, denote αv := β β 1 w v maxu P Du . Then in this iteration, the node vs selected by AWP satisfies Mvs < 22α2 vs(ln(α4 vs K/δ) + 10). Proof. Recall that the next weight query to be sampled by AWP is set to vs argmaxv P (ˆDv + v). First, if some nodes v P have Mv = 0, they will have v = , thus one of them will be set as vs, in which case the bound on Mvs trivially holds. Hence, we assume below that for all v P, Mv 1. Denote Dmax := maxv P Dv. Let u argmaxv P ˆDv be a node with a maximal estimated discrepancy. Since E0 holds, for all v P we have ˆDv + v Dv. Thus, by definition of vs, ˆDvs + vs Dmax which implies ˆDu ˆDvs Dmax vs. Therefore, β(ˆDu u) = ˆDu + (β 1)ˆDu β u ˆDu + (β 1)(Dmax vs) β u. Denote θ := (β 1)Dmax/(2β), and assume for contradiction that vs θ. By the definitions of u and vs, we have u vs θ. Thus, from the inequality above and the definitions of θ, u, vs, β(ˆDu u) ˆDu + (β 1)(Dmax θ) βθ (6) = ˆDu + θ ˆDvs + vs max z P \{u}(ˆDz + z), Approximating a Distribution Using Weight Queries implying that SC(u, P) holds. But this means that the previous iteration should have split this node in the pruning, a contradiction. Therefore, vs > θ. Since by Eq. (2), vs w vs p 2 ln(2Kπ2M 2vs/(3δ))/Mvs, it follows that Mvs < 2w vs 2 θ2 ln(2Kπ2M 2 vs 3δ ). Denoting φ = 4w vs 2/θ2, µ = p 2Kπ2/(3δ), this is equivalent to Mvs < φ ln(µMvs). By Lemma F.1 in Appendix F, this implies Mvs < eφ ln(eµφ). Noting that φ = 16α2 v and bounding constants, this gives the required bound. Lastly, we combine the lemmas to prove the main theorem. Proof of Theorem 4.2. Consider a run in which E0 holds. By Lemma 5.1, this occurs with a probability of 1 δ. If T has a split quality q < 1, the approximation factor follows from Lemma 5.3. Next, note that AWP makes a higher-order query of a node only if it is the right child of a node that was split in line 1 of Alg. 1. Thus, it makes K 1 higher-order weight queries. This proves the first two claims. We now use Lemma 5.4 to bound the total number of weight queries of examples under E0. First, we lower bound maxv P Dv for any pruning P during the run of AWP. Let u argmaxv Po Dv. By the definition of u, we have Du DPo/K. Now, at any iteration during the run of AWP, some ancestor v of u is in P. By Lemma 5.2, we have Dv Du/2. Therefore, maxv P Dv DPo/(2K). Thus, by Lemma 5.4, at any iteration of AWP, vs is set to some node in P that satisfies Mvs < 22α2 vs(ln(α4 vs K/δ) + 10), where αvs := β β 1 2Kw v DPo . Hence, AWP takes at most Tv := 22α2 vs(ln(α4 vs K/δ) + 10) + 1 samples for node v, and so the the total number of example weight queries taken by AWP is at most P v V Tv, where V is the set of nodes that participate in P at any time during the run. To bound this sum, note that for any pruning P during the run of AWP, we have P v P w v = 1. Hence, P v P w v 2 1. Since there are K different prunings during the run, we have P v V w v 2 K. Substituting αv by its definition and rearranging, we get that the total number of example weight queries by AWP is O((1 + 1 β 1)2K3 ln(1/δ)/D2 Po). 6. Experiments We report experiments that compare AWP to several natural baselines. The full results of the experiments described below, as well as results for additional experiments, are reported in Appendix H. A python implementation of the proposed algorithm and of all the experiments can be found at https://github.com/Nadav-Barak/AWP. The implementation can be easily run on a standard personal computer, with the longest runs taking a few hours. The implementation of AWP includes two practical improvements: First, we use an empirical Bernstein concentration bound (Maurer and Pontil, 2009) to reduce the size of v when possible; This does not affect the correctness of the analysis. See Appendix G for details. Second, for all algorithms, we take into account the known weight values of single examples in the output weighting, as follows. For v Po, let Sv be the examples in Lv whose weight was queried. Given the output pruning P, we define the weighting w P as follows. For x Sv, w P (x) := w (x); for x Lv \ Sv, we set w P (x) := (w v P x Sv w (x))/(Nv |Sv|). In all the plots, we report the normalized output distance dist(w P , w ) = w P w 1/2 [0, 1], which is equal to half the discrepancy of w P . To fairly compare AWP to the baselines, they were allowed the same number of higher-order weight queries and basic weight queries as requested by AWP for the same inputs. The baselines are non-adaptive, thus the basic weight queries were drawn uniformly from the data set at the start of their run. We tested the following baselines: (1) WEIGHT: Iteratively split the node with the largest weight. Use basic weight queries only for the final calculation of w P . The rationale here is to get a finer resolution pruning in more important parts of the data set. However, this can lead to wasted resources on parts of the tree which are both high weight and low-discrepancy and an unbounded approximation factor, as proved in Appendix A. (2) UNIFORM: Iteratively split the node v with the largest ˆDv, the estimator used by AWP, but without adaptive queries. (3) EMPIRICAL: Same as UNIFORM, except that instead of ˆDv, it uses the naive empirical estimator, n |Sv| P x Sv |w v(x)/Nv w (x)| (See Section 3). We ran several types of experiments, all with inputs δ = 0.05 and β = 4. Note that β defines a trade-off between the number of queries and the quality of the solution. Therefore, its value represents user preferences and not a hyperparameter to be optimized. For each experiment, we report the average normalized output distance over 10 runs, as a function of the pruning size. Error bars, displayed as shaded regions, show the maximal and minimal normalized distances obtained in these runs. For each input data set, we define an input hierarchical tree T and test several target distributions, which determine the true weight function w . In the first set of experiments, the input data set was Adult (Dua and Graff, 2017), which contains 45,000 population census records (after excluding those with missing values). We created the hierarchical tree via a top-down procedure, in which each node was split by one of the data set attributes, using a balanced splitting criterion. Similarly to the hospital example given in Section 1, the higher-order queries (internal nodes) in this tree require finding the proportion of the population with a specific set of demographic characteris- Approximating a Distribution Using Weight Queries tics, which could be obtained via a database counting query. For instance, a higher-order query could correspond to the set of single government employees aged 45 or higher. To generate various target distributions, we partitioned the data set into ordered bins, where in each experiment the partition was based on the value of a different attribute. The target weights were set so that each example in a given bin was N times heavier than each example in the next bin. We ran this experiment with values N = 2, 4. Some plots are given in Figure 1. See Appendix H for full details and results. The rest of the experiments were done on visual data sets, with a tree that was generated synthetically from the input data set using Ward s method (M ullner, 2011), as implemented in the scipy python package. First, we tested the MNIST (Le Cun and Cortes, 2010) training set, which contains 60,000 images of hand-written digits, classified into 10 classes (digits). The target distributions were generated by allocating the examples into ordered bins based on the class labels, and allocating the weights as in the Adult data set experiment above, again with N = 2, 4. We tested three random bin orders. See Figure 1 for some of the results. See Appendix H for full details and results. Lastly, we ran experiments using an input data set and target distribution based on data set pairs commonly studied in domain adaptation settings (e.g., Gong et al. 2012; Hoffman et al. 2012; Ding et al. 2015). The input data set was Caltech256 (Griffin et al., 2007), with 29,780 images of various objects, classified into 256 classes (not including the singular clutter class). In each experiment, the target distribution was determined by a different data set as follows: The target weight w of each image in the Caltech256 data set was set to the fraction of images from the target data set that have this image as their nearest neighbor. We used two target data sets: (1) The Office data set (Saenko et al., 2010), of which we used the 10 classes that also exist in Caltech256 (1410 images); (2) The Bing data set (Alessandro Bergamo, 2010a;b), which includes 300 images in each Caltech256 class. For Bing, we also ran experiments where images from a single super-class from the taxonomy in Griffin et al. (2007) were used as the target data set. See Figure 1 for some of the results, and Appendix H for full results. In all experiments, except for a single configuration, AWP performed better than the other algorithms. In addition, UNIFORM and EMPIRICAL behave similarly in most experiments, with UNIFORM sometimes being slightly better. This shows that our new estimator is empirically adequate, on top of its crucial advantage in getting a small v. We note that in our experiments, the split quality q was usually close to one. This shows that AWP can be successful even in cases not covered by Theorem 4.2. We did find that the average splitting values were usually lower, see Appendix I. 0 20 40 60 80 100 Pruning size AWP EMPIRICAL 0 10 20 30 40 50 Pruning size AWP EMPIRICAL 0 10 20 30 40 50 Pruning size AWP EMPIRICAL 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0 400 800 1200 1600 2000 Pruning size AWP EMPIRICAL Figure 1. Some of the experiment results. Full results are provided in Appendix H. Top to Bottom: Adult data set, bins assigned by marital status, N = 4; MNIST: N = 2, N = 4; The Caltech Office experiment. Approximating a Distribution Using Weight Queries 7. Conclusions In this work, we studied a novel problem of approximating a distribution via weight queries, using a pruning of a hierarchical tree. We showed, both theoretically and experimentally, that such an approximation can be obtained using an efficient interactive algorithm which iteratively constructs a pruning. In future work, we plan to study the effectiveness of our algorithm under more relaxed assumptions, and to generalize the input structure beyond a hierarchical tree. Lorenzo Torresani Alessandro Bergamo. Exploiting weaklylabeled web images to improve object classification: a domain adaptation approach. In Neural Information Processing Systems (NIPS), December 2010a. Lorenzo Torresani Alessandro Bergamo. The Bing data set. http://vlg.cs.dartmouth.edu/ projects/domainadapt/data/Bing/, 2010b. Jean-Yves Audibert, S ebastien Bubeck, and R emi Munos. Best arm identification in multi-armed bandits. COLT 2010, page 41, 2010. Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira. Analysis of representations for domain adaptation. In Advances in neural information processing systems, pages 137 144, 2007. Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. A theory of learning from different domains. Machine learning, 79(1-2):151 175, 2010. Christopher Berlind and Ruth Urner. Active nearest neighbors in changing environments. In International Conference on Machine Learning, pages 1870 1879, 2015. Steffen Bickel, Michael Br uckner, and Tobias Scheffer. Discriminative learning for differing training and test distributions. In Proceedings of the 24th international conference on Machine learning, pages 81 88. ACM, 2007. Steffen Bickel, Michael Br uckner, and Tobias Scheffer. Discriminative learning under covariate shift. Journal of Machine Learning Research, 10(Sep):2137 2155, 2009. John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman. Learning bounds for domain adaptation. In Advances in neural information processing systems, pages 129 136, 2008. S ebastien Bubeck, R emi Munos, Gilles Stoltz, and Csaba Szepesv ari. X-armed bandits. Journal of Machine Learning Research, 12(May):1655 1695, 2011. Adrian Corduneanu and Christopher M Bishop. Variational bayesian model selection for mixture distributions. In Artificial intelligence and Statistics, volume 2001, pages 27 34. Morgan Kaufmann Waltham, MA, 2001. Corinna Cortes, Giulia Desalvo, Claudio Gentile, Mehryar Mohri, and Ningshan Zhang. Adaptive region-based active learning. In Hal Daum e III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 2144 2153. PMLR, 13 18 Jul 2020. URL http://proceedings.mlr.press/ v119/cortes20a.html. Sanjoy Dasgupta and Daniel Hsu. Hierarchical sampling for active learning. In Proceedings of the 25th international conference on Machine learning, pages 208 215. ACM, 2008. Zhengming Ding, Ming Shao, and Yun Fu. Deep low-rank coding for transfer learning. In Twenty-Fourth International Joint Conference on Artificial Intelligence, 2015. Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. URL http://archive.ics.uci. edu/ml. Mario A. T. Figueiredo and Anil K. Jain. Unsupervised learning of finite mixture models. IEEE Transactions on pattern analysis and machine intelligence, 24(3):381 396, 2002. Alireza Ghasemi, Mohammad T Manzuri, Hamid R Rabiee, Mohammad H Rohban, and Siavash Haghiri. Active oneclass learning by kernel density estimation. In 2011 IEEE International Workshop on Machine Learning for Signal Processing, pages 1 6. IEEE, 2011. Jacob Goldberger and Sam T Roweis. Hierarchical clustering of a mixture model. In Advances in neural information processing systems, pages 505 512, 2005. Boqing Gong, Yuan Shi, Fei Sha, and Kristen Grauman. Geodesic flow kernel for unsupervised domain adaptation. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 2066 2073. IEEE, 2012. Gregory Griffin, Alex Holub, and Pietro Perona. Caltech-256 object category dataset. http://www.vision.caltech.edu/Image_ Datasets/Caltech256/, 2007. Judy Hoffman, Brian Kulis, Trevor Darrell, and Kate Saenko. Discovering latent domains for multisource domain adaptation. In European Conference on Computer Vision, pages 702 715. Springer, 2012. Approximating a Distribution Using Weight Queries Daniel Kifer, Shai Ben-David, and Johannes Gehrke. Detecting change in data streams. In VLDB, volume 4, pages 180 191. Toronto, Canada, 2004. Samory Kpotufe and Guillaume Martinet. Marginal singularity, and the benefits of labels in covariate-shift. In S ebastien Bubeck, Vianney Perchet, and Philippe Rigollet, editors, Proceedings of the 31st Conference On Learning Theory, volume 75 of Proceedings of Machine Learning Research, pages 1882 1886. PMLR, 06 09 Jul 2018. Samory Kpotufe, Ruth Urner, and Shai Ben-David. Hierarchical label queries with data-dependent partitions. In Conference on Learning Theory, pages 1176 1189. PMLR, 2015. Matej Kristan, Danijel Skoˇcaj, and Ales Leonardis. Online kernel density estimation for interactive learning. Image and Vision Computing, 28(7):1106 1116, 2010. Yann Le Cun and Corinna Cortes. MNIST handwritten digit database. http://yann.lecun.com/exdb/ mnist/, 2010. Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Domain adaptation: Learning bounds and algorithms. ar Xiv preprint ar Xiv:0902.3430, 2009. Andreas Maurer and Massimiliano Pontil. Empirical Bernstein bounds and sample-variance penalization. In Proceedings of the 22nd Annual Conference on Learning Theory, COLT 2009, 2009. Clare A Mc Grory and DM237087605559631 Titterington. Variational approximations in bayesian model selection for finite mixture distributions. Computational Statistics & Data Analysis, 51(11):5352 5367, 2007. Geoffrey J Mc Lachlan and Thriyambakam Krishnan. The EM algorithm and extensions, volume 382. John Wiley & Sons, 2007. Daniel M ullner. Modern hierarchical, agglomerative clustering algorithms. ar Xiv preprint ar Xiv:1109.2378, 2011. R emi Munos. Optimistic optimization of a deterministic function without the knowledge of its smoothness. In Advances in neural information processing systems, pages 783 791, 2011. Kate Saenko, Brian Kulis, Mario Fritz, and Trevor Darrell. Adapting visual category models to new domains. In European conference on computer vision, pages 213 226. Springer, 2010. URL https://drive.google.com/open?id= 0B4Iap RTv9p J1WGZVd1VDMmhwdl E. Bernard W Silverman. Density estimation for statistics and data analysis, volume 26. CRC press, 1986. Aleksandrs Slivkins. Multi-armed bandits on implicit metric spaces. In Advances in Neural Information Processing Systems, pages 1602 1610, 2011. Masashi Sugiyama, Shinichi Nakajima, Hisashi Kashima, Paul V Buenau, and Motoaki Kawanabe. Direct importance estimation with model selection and its application to covariate shift adaptation. In Advances in neural information processing systems, pages 1433 1440, 2008. Ruth Urner, Sharon Wulff, and Shai Ben-David. Plal: Cluster-based active learning. In Conference on Learning Theory, pages 376 397. PMLR, 2013. Matt P Wand and M Chris Jones. Kernel smoothing. Crc Press, 1994. EL Wilmer, David A Levin, and Yuval Peres. Markov chains and mixing times. American Mathematical Soc., Providence, 2009.