# dart_distance_assisted_recursive_testing__85ed6b89.pdf Journal of Machine Learning Research 24 (2023) 1-41 Submitted 10/22; Revised 3/23; Published 4/23 DART: Distance Assisted Recursive Testing Xuechan Li xuechan.li@duke.edu Department of Biostatistics Duke University Durham, NC 27705, USA Anthony D. Sung anthony.sung@duke.edu Department of Medicine Duke University Durham, NC 27705, USA Jichun Xie jichun.xie@duke.edu Department of Biostatistics Duke University Durham, NC 27705, USA Editor: Sayan Mukherjee Multiple testing is a commonly used tool in modern data science. Sometimes, the hypotheses are embedded in a space; the distances between the hypotheses reflect their co-null/coalternative patterns. Properly incorporating the distance information in testing will boost testing power. Hence, we developed a new multiple testing framework named Distance Assisted Recursive Testing (DART). DART features in joint artificial intelligence (AI) and statistics modeling. It has two stages. The first stage uses AI models to construct an aggregation tree that reflects the distance information. The second stage uses statistical models to embed the testing on the tree and control the false discovery rate. Theoretical analysis and numerical experiments demonstrated that DART generates valid, robust, and powerful results. We applied DART to a clinical trial in the allogeneic stem cell transplantation study to identify the gut microbiota whose abundance was impacted by post-transplant care. Keywords: multiple testing, hierarchical testing, aggregation tree, false discovery rate, auxiliary information 1. Introduction Multiple testing is commonly used to discover important features in modern data science. Each feature represents a hypothesis: the important features correspond to alternative hypotheses, and the rest to null hypotheses. A rejected hypothesis corresponds to an identified feature. The goal is to discover the alternative hypotheses with a controlled false discovery rate (FDR), the expected number of false discoveries over the total number of discoveries. Under many circumstances, the hypotheses are coupled with other attributes in a metric space with a known pairwise distance. For example, previous brain studies have shown that the anatomical distance between the neurons can partially inform brain activities and neurons 2023 Xuechan Li, Anthony D. Sung, and Jichun Xie. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v24/22-1131.html. Li, Sung and Xie co-functioning patterns (Alexander-Bloch et al., 2013; Perinelli et al., 2019; Kristanto et al., 2020). For another example, the polygenic distance between amplicon sequence variants (ASVs) often informs their functional similarity and survival (Chen et al., 2012; Garcia et al., 2014; Martiny et al., 2015). In these examples, if we form neurons or ASVs as hypotheses, properly incorporating anatomy distance or the polygenic distance between hypotheses can improve the power in identifying functionally important neurons or ASVs. In other words, the distance among the hypotheses partially reflects the hypotheses co-null or co-alternative status, called co-status hereafter. We developed a new hierarchical multiple testing procedure called DART. It incorporates distance information to boost the power of testing. DART has two stages. The first stage is based on AI modeling: we use the automatic algorithm to construct an aggregation tree that incorporates the distance information and facilitates downstream testing. The second stage is based on statistical modeling: we perform a bottom-up multiple testing procedure on the aggregation tree to control FDR. Unlike traditional multiple testing that only uses statistical modeling, DART combines the power of statistical models and AI models to improve testing accuracy: statistical models are less flexible and less data-adaptive, but they can provide highly interpretable results; AI models fail to provide interpretable results, but they can explore complex underlying structures. Our study provides a new solution to generate data-adaptive and highly-interpretable results. Our work is closely related to three streams of research. Testing using distance/side information. Some methods incorporate side information via parametric modeling. For example, Shu et al. (2015) imposes a 3D hidden Ising model to model the nearby hypothesis co-status. Lee and Lee (2016) uses a scalar parameter in a specific exponential-family distribution to control the level of co-status among nearby hypotheses. Cai et al. (2020) uses kernel functions to enforce similar prior null probabilities on nearby hypotheses. Lei and Fithian (2018) proposes an iterative FDR control procedure that incorporates side information via a parametric model estimated by the EM algorithm. These methods use parametric forms to model how distance/side information impacts the hypothesis co-status. Although parametric methods usually achieve good performances on simulated data, their performance on real data is unclear. For example, for f MRI analysis, Eklund et al. (2012) used extensive real data sets to show that many commonly used parametric models that specify temporal correlations in f MRI analysis are inappropriate, and thus lead to highly inflated type I error in multiple testing. Recently, Ramdas et al. (2019b) develops an FDR control procedure that allows scientists to incorporate four types of prior knowledge simultaneously. The procedure allows mix and match techniques and using multiple different forms of prior knowledge simultaneously while maintaining internal consistency among the pattern of rejections and acceptances. Besides, Xia et al. (2017) and Tansey et al. (2018) employ neural networks to leverage side information to improve testing accuracy while controlling FDR. All these methods specify the side information for each hypothesis specifically. However, the distance information is not defined for each hypothesis but for each hypothesis pair, and thus cannot be incorporated directly. Alternatively, some methods adopt non-parametric models to incorporate distance information into multiple testing. Zhang et al. (2011) uses a local neighborhood size and uses the aggregated DART: Distance Assisted Recursive Testing P-values in the neighborhood to perform multiple testing, called FDRL. Li et al. (2013) uses the non-parametric propagation-separation approach (Belomestny and Spokoiny, 2007) to smooth the coefficients in the generating generalized estimating equations (GEE) models. These methods are more data adaptive. Because Li et al. (2013) only works for the GEE models, we will mainly compare DART with FDRL later. Hierarchical multiple testing. a) Gatekeeping. Suppose the hypotheses are grouped into m 2 ordered families. Gatekeeping procedures test a later family only if they reject the previous families (Dmitrienko et al., 2007, 2008, 2011; Dmitrienko and Tamhane, 2013; Xi and Tamhane, 2014) . The aim is to control the family-wise error rate (FWER). Gatekeeping procedures are usually designed for clinical trials. For other applications, the tests lose power when they discard the families of hypotheses whose previous family is accepted. b) Top-down hierarchical testing. Soriano and Ma (2017) applied a tree-structured Markov prior distribution to the indicators of hypotheses status and then calculated their posterior being alternative. The method relies on parametric modeling. Yekutieli (2008) proposes a top-down testing rule similar to gatekeeping. A node on the tree (a set with multiple hypotheses) will be tested only if its parent node is rejected. Other top-down multiple testing procedure also have been proposed for hypotheses structured in DAG, such as Ramdas et al. (2019a),Loper et al. (2022), Meijer and Goeman (2015), and Goeman and Finos (2012). These methods require the hypotheses to be partially ordered in the DAG, and thus not applicable to general hypotheses testing. Lei et al. (2020) proposes an iterative testing algorithm to perform FDR control on a series of contiguous candidate sets in a constrained set. However, for hypotheses with distance information, how to turn the distance into contiguous candidate sets is unclear. c) Bottom-up hierarchical testing. This approach tests the individual hypotheses first and then tests the aggregated hypotheses later. Our method, DART, adopts this approach. The most similar work to ours is the one introduced by Li et al. (2020b). They proposed a bottom-up procedure to test conjugate nodes (sets of hypotheses) with tree structures. A conjugate node is alternative only when all its containing hypotheses are alternative. Their method aims to control the node-level FDR. In contrast, DART focuses on each hypothesis. It aims to control hypothesis-level FDR. Thus, DART is fundamentally different. Explainable AI (XAI). XAI aims to (a) produce more explainable models while maintaining a high level of learning performance (prediction accuracy) and (b) enable human users to understand, appropriately trust, and effectively manage the emerging generation of AI partners. (Turek, 2021) However, to the best of our knowledge, no XAI method has been proposed to control FDR in multiple testing. In general, pure AI modeling introduces intrinsic difficulties in controlling FDR. Li, Sung and Xie 2. Preliminaries Suppose there are m hypotheses, forming the null and alternative hypothesis sets Ω0 and Ω1: Ω0 Ω1 = , Ω0 Ω1 = [m]. For hypothesis i, a P-value statistic Ti is derived. Previous studies often assume the null P-values follow super-uniform distributions: i Ω0, p (0, 1), P(Ti p) p. Our work relaxes this assumption. We allow a null Ti asymptotically converges to a superuniform statistic Ti in the following way. sup i Ω0 sup p Pi0 P(Ti < p) P( Ti < p) 1 δ0m, where lim m δ0m = o(1) and Pi0 = n p [0, 1] : P( Ti < p) m(log m log log m)1/2 1o . (1) Pi0 excludes the left tail regions (close to zero) to make the convergence easier. Otherwise, if P( Ti < p) is too small, the convergence is hard to achieve. If Ti follows a super-uniform (resp. uniform) distribution, we call Ti asymptotically super-uniform (resp. uniform). We relax the P-value null distribution assumptions because many P-values derived from asymptotic tests (e.g., Wald, score, and likelihood-ratio tests) do not follow super-uniform distributions, but they are asymptotically super-uniform. We do not make assumptions on alternative P-value distributions. Although they can be arbitrary, it is useful to think of the alternative P-values have larger probabilities to be small than the uniform distributions. Thus, many methods reject the hypotheses when the P-values are small. For example, a commonly used P-value threshold (Liu et al., 2013; Cai and Liu, 2016a; Xie and Li, 2018) is ˆt = sup αm t α : mt P i [m] I(Ti < t) α , where αm = (m log m) 1. (2) Here, the dotted fraction notation denotes the shorthand a b = a b 1. The threshold ˆt for the Benjamini and Hochberg procedure (BH) (Benjamini and Hochberg, 1995) is slightly different but asymptotically equivalent to (2). In general, the rejection set is R = {i : Ti t} for some threshold t. Then, U = Ω1 R is called the true discovery set, and V = Ω0 R is called the false discovery set. The false discovery proportion and rate are defined as FDP := |V| |R| , FDR := E (FDP) , where |A| denotes the cardinality of set A. Our task is to design a powerful rejection rule to asymptotically control FDR when m goes to infinity. We hope to gain power by properly incorporating the distance information between hypotheses, assuming they reflect their co-status patterns. 3. DART Algorithm Description DART has two stages. Stage I uses an AI method to transfer the distance matrix into an aggregation tree, which defines the testing structure (Section 3.1). Stage II tests hypotheses embedded in the tree to gain power while controlling FDR (Section 3.2). An illustrating example is provided in Section 3.3. DART: Distance Assisted Recursive Testing 3.1 Distance Matrix and Aggregation tree Stage I aims to construct an aggregation tree T that provides a hierarchical testing structure for stage II. The aggregation tree T has L layers, constructed based on the distance matrix D = (dij). The node-set on layer ℓis denoted by A(ℓ). On the first (bottom) layer, A(1) = {{1}, . . . , {m}}; each node represents a hypothesis. In general, A(ℓ) contains the nodes representing a set of hypotheses. The hypotheses have different co-status patterns: some alternative hypotheses might stand-alone, some co-alternative within a small region, and others co-alternative within a large region. For any node A, denote its diameter (within-node distance) by dia(A) = maxi,j A dij. On higher layers of T , nodes have larger diameters. We design multiple layers to construct nodes with various diameters and adapt to various co-status patterns. In stage II, if a node is rejected, we will reject all the hypotheses within the node. For this purpose, we require the hypothesis distance within the node to be small so that they are likely co-status; thus, imposing the same decision rule on these hypotheses is reasonable. We require that for all A A(ℓ), dia(A) g(ℓ), where g(ℓ) is a distance threshold with 0 < g(2) < . . . < g(L). We set g(1) = 0 because we do not need to aggregate hypotheses on the first layer. Another requirement is to limit nodes child numbers. Any node A in A(ℓ) with ℓ 2 is formed by aggregating the nodes from the previous layer. These nodes are called the children of A; they form the child set C(A). We require that |C(A)| M for any node A. Here, M is called the maximum node size. We set up this requirement to increase the following testing s stability: if A contains too many hypotheses, rejecting A leads to rejecting all hypotheses in A. This introduces large variability in testing and makes the algorithm outputs unstable. To make sure all the nodes on layer ℓsatisfy dia(A) g(ℓ) and |C(A)| M, we developed Algorithm 1. The key strategy is to recursively set A(ℓ) based on A(ℓ 1) and find the closest node pair for aggregation. The resulting tree depends on the tuning parameter L, M, and g(ℓ) with ℓ {2, . . . , L}. We introduce the tuning parameter selection criteria in Section 3.4. Algorithm 1 is a feasible algorithm to construct an aggregation tree satisfying dia(A) g(ℓ) and |C(A)| M. Alternative AI approaches could also be used. For example, an aggregation tree can be constructed by recursively applying community detection algorithms, such as K-means clustering (Jin and Han, 2010), Louvain method (Blondel et al., 2008), and Leiden algorithm (Traag et al., 2019), in the similar spirit of hierarchical clustering (Zepeda-Mendoza and Resendis-Antonio, 2013). If a modified algorithm based on these algorithms could restrict the maximum node size, it may also be applied here. The key of the desired algorithm is to generate a tree with few mixed nodes (defined in Section 4) to ensure the asymptotic FDR control of Step 2. 3.2 Recursive Testing Embedded in the Tree Recall that the tree nodes consist of close hypotheses likely to be co-status. We developed a multiple testing algorithm that incorporates the tree to improve the testing power. Algorithm 2 describes the recursive testing procedure from the single-hypothesis layer (bottom) to the large-node layer (top). It starts with testing all hypotheses using the traditional FDR control procedure. This step does not use any hypothesis co-status patterns. Thus, DART is likely to discover the hypotheses with strong signal-to-noise ratios (SNRs) on Li, Sung and Xie the bottom layer. On higher layers, DART tests larger nodes containing more hypotheses. The weaker-SNR hypotheses are likely to be aggregated then to increase their identification chances. Algorithm 2 mentioned a few new terms: dynamic nodes, candidate node P-values, and recursive P-value cutoffs. We provide more details on these terms below. Candidate dynamic nodes. The dynamic nodes are the nodes excluding the rejected hypotheses on previous layers. If a hypothesis is rejected, we will not test it again. There are two main reasons. First, if the rejected hypotheses were not removed, they could become nonsignificant after being aggregated with other null hypotheses on higher layers. It introduces interpretation difficulties. Second, a rejected hypothesis on low layers usually has strong SNRs. If we include these hypotheses in higher layers, this hypothesis alone may make the entire node significant, even if the node contains null hypotheses. Then, the rejection of the entire node may lead to false rejections on its containing null hypotheses. A candidate dynamic node is a dynamic node with |C(S)| 2. If a node S has |C(S)| = 1, this node must have been tested on previous layers. Thus, we do not need to test it again. The set of the candidate dynamic nodes on layer ℓis denoted by B(ℓ). Candidate node P-values. For any candidate dynamic node S, we use the Gaussian aggregation approach to derive candidate node P-value: TS = Φ n P j S Φ 1(Tj)/ p where Φ is the complement cumulative density function (CCDF) of the standard Gaussian distribution. The aggregation is efficient. Alternatively, we may consider using other aggregation approaches, such as the Fisher s combination (Fisher, 1925), the chi-square aggregation, and the Cauchy aggregation (Liu and Xie, 2020). The Fisher s combination and the chi-square aggregation approaches have lower power than the Gaussian aggregation when the hypotheses in the node are co-status. The Cauchy aggregation relies on the heavy-tail Cauchy density; thus, it introduces theoretical challenges to studying the asymptotic null distributions of the node P-values. The main challenge lies in accounting for the post-selection effect: the hypotheses and their P-values only have the chance to be aggregated when they are not rejected on the previous layers. In contrast, for Gaussian aggregation, the postselection problem can be solved because the candidate node P-values are still asymptotically super-uniform (Lemmas 7 and 8). Thus, we go with the Gaussian aggregation. P-value cutoffs. On layer ℓ, we already have the set of the rejected hypotheses on the previous ℓ 1 layers, denoted by R(1:ℓ 1). We set up the P-value threshold ˆt(ℓ) as ˆt(ℓ) = sup αm t α : Pℓ 1 ℓ =1 m(ℓ )ˆt(ℓ ) + m(ℓ)t |R(1:ℓ 1)| + P S B(ℓ) |S|I(TS < t) α Here, αm = (m log m) 1 and m(ℓ ) = |B(ℓ )|. It is easy to see that ˆt(ℓ) depends on the previous layer s cutoff and rejection set. Comparing ˆt in (2) and ˆt(ℓ) in (3), it is easy to see that they share some similarities. The numerators bound the false discoveries asymptotically, and the denominators count the total discoveries. By making the fraction less than or equal to the desired FDR level α, DART asymptotically controls the weighted node-FDR (Section 4). DART also asymptotically controls the hypothesis level FDR when most nodes contain co-status hypotheses. DART: Distance Assisted Recursive Testing Algorithm 1: Transform the distance matrix into an aggregation tree. Data: Distance matrix D = (dij)m m, maximum layer L, maximum node size M, distance thresholds g(ℓ) with ℓ [L]. Result: An aggregation tree T = {A(ℓ) : ℓ [L]}. A(1) = {{1}, . . . , {m}}; for ℓ {2, . . . , L} do A(ℓ) = A(ℓ 1); flag=TRUE ; // Initiate A(ℓ) Calculate the between-node distances: A1, A2 A(ℓ), dist(A1, A2) = maxi A1,j A2 dij; while flag=TRUE do Pick up the node pair ( A1, A2) with the smallest between-node distance; if dist( A1, A2) > g(ℓ) then flag=FALSE ; else if |C( A1 A2)| M then Put A = A1 A2 in A(ℓ) and remove A1 and A2; Set the between-node distance: A A(ℓ), dist(A, A) = maxi A,j A dij; else Update the between-node distance: dist( A1, A2) = ; Algorithm 2: Recursive testing embedded in the tree. Data: P-values T1, . . . , Tm; tree T = {A(ℓ) : ℓ [L]}. Result: Rejection set R. Following the BH procedure (2) to set the threshold ˆt(1) and R = I(i : Ti < ˆt(1)); for ℓ {2, . . . , L} do Define the candidate dynamic node set B(ℓ) = {S : S = A \ R, |C(S) 2|}; For all S B(ℓ), get the candidate node P-values TS ; Set the recursive P-value cutoff ˆt(ℓ) as in (3) and let R(ℓ) node = {S : TS < ˆt(ℓ)}; Map the rejections to hypothses and update R = R {i : i S, S R(ℓ) node}. Li, Sung and Xie 3.3 A Toy Example to Illustrate DART We provide a toy example in Figure 1 to illustrate DART. The detailed algorithm descriptions are provided in Algorithm 1 and Algorithm 2 in Section 3. Setting up dynamic working Signal-to-noise (a) 1 2 3 4 5 6 7 1 0 2 4 5 5 8 11 2 2 0 2 3 3 6 9 3 4 2 0 1 1 8 11 4 5 3 1 0 2 9 12 5 5 3 1 2 0 9 12 6 8 6 8 9 9 0 3 7 11 9 11 12 12 3 0 Stage II: Embed multiple testing in the tree (c) Stage I: Transform the distance matrix into an aggregation tree 1 2 3 4 5 6 7 Layer 2 Layer 3 11 3 2 5 4 6 7 1 2 4 5 6 7 1 3 and FDR control Tree nodes Dynamic nodes Rejected nodes Accepted nodes Figure 1: An illustrating example of DART with 7 features. Figure 1(a) shows the distance matrix of the seven hypotheses. In stage I, we transfer the distance matrix into the 3-layer aggregation tree based on Algorithm 1. The resulting 3-layer aggregation tree is shown in Figure 1(b). Each leaf (node on the bottom layer) on the tree corresponds to a hypothesis and a test statistic. The gray scales illustrate the underlying SNR ratios in the statistics; these SNR ratios are unknown. In stage II, we perform the multiple testing embedded in the aggregation tree based on Algorithm 2. We start from the bottom layer and hierarchically proceed to higher layers. On layer 1 (the bottom layer), hypotheses 1 and 3 are rejected because their test statistics have strong SNR ratios. DART: Distance Assisted Recursive Testing After the testing procedure on each layer, the rejected nodes are marked by solid squares and the accepted nodes solid hexagons. If a node is rejected, all its containing hypotheses are rejected. When testing on a higher layer, all previously rejected features are excluded from the nodes (dashed-line circled) to form the dynamic nodes (solid-line circled) on this layer. For example, on layer 2, tree node A1 = {1, 2} turns into the dynamic node S1 = {2} because hypothesis 1 is rejected on the bottom layer. S1 will not be tested on layer two because it only contains hypothesis 2, already tested on layer 1. Using the terminology in Section 3.2, the number of children |C(S2)| = 1; thus, S1 is not a candidate dynamic node and will not be tested on layer 2. For another example, A2 = {3, 4, 5} turns to the dynamic node S2 = {4, 5} on layer 2 because hypothesis 3 was rejected on the bottom layer. |C(S2)| = 2 and thus, it will be tested on layer 2. The test statistics of hypotheses 4 and 5 have relatively weak SNRs. They are not significant enough to be rejected on layer 1. However, on layer 2, they are aggregated to form the dynamic node S2 = {4, 5}; the aggregated SNR is large enough so that S2 is rejected, leading to the rejections of hypotheses 4 and 5. Thus, the power of DART is higher than the power of a single-layer testing method. 3.4 Tuning Parameter Selection Proper parameters will result in a tree empowering testing. We suggest setting the maximum node size M as 2 or 3. Appendix C verified that when M = 2 or M = 3, DART asymptotically controls FDR and is more powerful. Denote the desired minimal number of nodes on layer L by cm. If cm < 35, DART s asymptotic validity might fail to kick in, leading to possibly inflated FDR. Therefore, we request cm 35. Accordingly, we set the maximum layer number L = log M m log M cm log M m log M 35 . The distance thresholds are set recursively to maximize the number of candidate nodes on each layer. We first try a set of possible thresholds G = {g1, . . . , g K}. On layer ℓwith ℓ 2, we let G(ℓ) = {g G : g g(ℓ 1)}. Hierarchically, on layer ℓ, we try every g G(ℓ) and count the number of resulting candidate nodes on this layer. We set g(ℓ) as the g with the most candidate nodes. See Algorithm 3 in Appendix C for details. 4. Asymptotic Validity This section shows that DART asymptotically controls the hypothesis FDR under mild conditions. Weighted node-FDR and hypothesis FDR. For any candidate dynamic node, if the node contains any alternative hypothesis, we call the node alternative; otherwise, it is called null. On layer ℓ, we denote the set of null candidate dynamic nodes by B(ℓ) 0 and the set of rejected nodes by R(ℓ) node. Then the weighted node-FDR is S R(ℓ) node B(ℓ) 0 |S| PL ℓ=1 P S R(ℓ) node |S| = S R(ℓ) node B(ℓ) 0 |S| |R| , FDRnode = E(FDPnode). Li, Sung and Xie In contrast, the hypothesis FDR is FDP := |R| Ω0 |R| , FDR = E(FDP). Notably, FDPnode only accounts for the false discoveries in null nodes. If an alternative node containing both null and alternative hypotheses is rejected, the rejection will not increase the numerator of FDPnode but will increase the numerator of FDP. Thus, FDPnode FDP. Although our ultimate goal is to control FDP, we control FDPnode as an intermediate step. The difference between FDPnode and FDP relies on the number of the rejected mixed nodes that contain both null and alternative hypotheses. Weighted node-FDR control. We require the following conditions for weighted node-FDR control. Condition 1. Sparse alternatives. The alternative hypothesis number m1 = O(mr1), for some r1 < (ML 1 + 1) 1. Condition 2. Sufficient moderate SNR nodes (see Definition 4 in Appendix A). Denote the number of moderate SNR nodes on the tree T by mmd. We require that mmd O(log m). A moderate SNR node (a) contains no hypotheses that will be rejected with non-vanishing probabilities before its locating layer, and (b) will be rejected on its locating layer with a non-vanishing positive probability. The existence of these nodes is to guarantee that some alternative nodes are rejected on each layer so that the threshold ˆt(ℓ) is not too small; otherwise, the number of total rejections will be too small so that a single false rejection would inflate FDR. Condition 3. Almost independence. Most hypothesis P-values are mutually independent. The number of dependent P-values does not exceed o(mmd). Here, Conditions 1 and 2 are inherited and extended from the previous multiple testing literature (Cai and Liu, 2016b). Condition 1 assumes the alternative hypothesis sparsity. Condition 2 usually holds when the sample size n is sufficiently large compared to p, L is properly chosen, and the signal-to-noise ratio distribution of the alternative hypotheses has continuous support over a large range. Condition 3 is a strong assumption. We require it to ensure that after higher-layer aggregation, most node P-values are still asymptotically super-uniformly distributed under the null. It is possible to relax this condition. However, the proof will be much more complicated. Lemma 1 Under Conditions 1 3, at any pre-specified level α (0, 1), Algorithm 2 satisfies the following asymptotic validity. (a) For any ϵ > 0, limm P(FDPnode α + ϵ) = 1. Consequently, limm FDRnode α. (b) Let Ω0 be the set of null P-values that are asymptotically uniform. If limm | Ω0|/m = 1, then for any ϵ > 0, limm P(|FDPnode α| ϵ) = 1. Consequently, limm FDRnode = α. See Appendix B for proof of this lemma. Two main challenges in the proof are the hierarchical testing structure and the post-selective effect introduced by the dynamic nodes. Thus, we proved the FDRnode control recursively, starting from the bottom layer. The bottom layer follows the BH procedure. Then, given the FDRnode control on the previous layers, we proved the control on the current layer. Recall that dynamic nodes do not contain the already-rejected hypotheses. To account for the post-selection effect, we proved that DART: Distance Assisted Recursive Testing conditioning on the testing results from the previous layers, the dynamic node P-values are still asymptotically super-uniform or asymptotically uniform. Hypothesis FDR control. Previously, we constructed a tree where the hypotheses were hierarchically aggregated into the nodes based on their distance. Thus, we expect many nodes contain co-status hypotheses. However, some large nodes on high layers may be mixed, containing null and alternative hypotheses. Some mixed nodes are concerning, while others are not. For example, suppose a node A on layer ℓcontains null hypotheses, strong SNR hypotheses (see Definition 5 in Appendix A) and weak SNR hypotheses (see Definition 6 in Appendix A). The strong SNR hypotheses are probably rejected before layer ℓ. Thus, when Algorithm 2 reaches layer ℓ, A probably already turns into a dynamic node only containing the null and weak SNR hypotheses. The weak SNR hypotheses are the alternative hypotheses unlikely to be rejected. As a result, the null hypotheses in A will not be rejected, and thus, the existence of A will not inflate the FDR. Thus, to control FDR, we only need to restrict those concerning mixed nodes whose null hypotheses are likely to be rejected with non-vanishing probabilities. Definition 2 (Concerning mixed nodes) For any node A A(L), let A = A \ (Ωst Ωwk), where Ωst is the strong SNR hypothesis set, and Ωwk is the weak SNR hypothesis set. The definitions of Ωst and Ωwk are provided in Appendix A. If A Ω0 = and A Ω1 = , we call A a concerning mixed node. Condition 4. sparse concerning mixed nodes. On the top layer of the tree A(L), the number of the concerning mixed nodes cannot exceed o(mmd). We allow the existence of concerning mixed nodes, but to asymptotically control FDR, Algorithm 2 cannot afford too many of them. Condition 4 specifies the tolerance level. Intrinsically, Condition 4 depends on the assumption that the distance matrix predominantly reflects the hypothesis co-status. If so, with properly selected {g(ℓ) : ℓ [L]}, Algorithm 1 will probably generate a tree satisfying Condition 4, because it uses the greedy algorithm to aggregate the closest remaining hypotheses. On the other hand, if this assumption does not hold, Algorithm 1 cannot generate a tree with nodes implying hypothesis co-status. Under this case, we do not recommend using DART. By adding Condition 4, we extend the FDRnode control to FDR control (Theorem 3). Theorem 3 Under Conditions 1-4, at any pre-specified level α (0, 1), Algorithm 2 satisfies the following asymptotic vality. (a) For any ϵ > 0, limm P(FDP α + ϵ) = 1. Consequently, limm FDR α. (b) Let Ω0 be the set of null P-values that are asymptotically uniform. If limm | Ω0|/m = 1, then for any ϵ > 0, limm P(|FDP α| ϵ) = 1. Consequently, limm FDR = α. 5. Numerical Simulation We simulated m = 1000 features located in the two-dimensional Euclidean space with randomly generated location coordinates: the first coordinate follows N(0, 2), and the second Li, Sung and Xie coordinate follows Unif(0, 4). A distance matrix D = (dij)m m was calculated based on the Euclidean distance between two features locations. Feature i links to a parameter of interest θi. The hypotheses are H0,i : θi = 0 versus H1,i : θi = 0, i [m]. We considered four settings, SE1 SE4. SE1 simulated a straightforward case where the P-values follow uniform distributions under the null. SE2 misspecified the null distributions of the test statistics, in order to evaluate the methods robustness. SE3 simulated the linear regression model, and SE4 simulated the Cox proportional hazard model; their P-values were derived from the Wald tests. Each setting contained 200 repetitions. The setting details were described in Appendix C. Under different nominal FDR levels α {0.05, 0.1, 0.15, 0.2}, we compared the performance of DART and its competitors: BH (Benjamini and Hochberg, 1995), Ada PT (Lei and Fithian, 2018) and FDRL (FDRL I and FDRL II) (Zhang et al., 2011). Ada PT incorporates the location coordinates as side information; DART incorporates the distance matrix; FDRL incorporates the information of each hypothesis s k nearest neighbors. Thus, the settings favor Ada PT because we provided it with the most information. The tuning parameters used in DART, Ada PT, and FDRL procedures were discussed in Appendix C. Figure 2A shows the type I error (measured by average FDP) and power (measured by average sensitivity) and their error bars of various methods: Average FDP: Under SE1, SE3, and SE4, DART, BH, Ada PT, and FDRL II control the average FDP well. Under SE2, DART s (resp. BH s) average FDP is slightly inflated when α = 5% (resp. α 15%). This is because we deliberately misspecified the P-value null distributions in SE2. Ada PT has consistently good FDR control. In contrast, FDRL I exhibited severe FDR inflation under all four settings. FDRL I has longer error bars than DART; so does FDRL II when α 10%. This suggests that DART s FDP is less variable than FDRL I and II. Average sensitivity: DART s sensitivities are consistently higher than BH. DART has much higher sensitivities than Ada PT (resp. FDRL II) when α 15% (resp. α 10%) and slightly lower sensitivities when α = 20%. If a low nominal FDR level (such as 5%) is preferred, DART is the most powerful among all methods. Computation time: DART is computationally efficient. For example, on average, one run (per repetition) of DART takes only 0.64 minutes across all settings. In contrast, Ada PT failed in generating any testing results within 8 minutes in about 17% of the runs (Table 1 in Appendix C). Among Ada PT s successful runs (within 8 minutes), one run on average takes 3.90 minutes. DART is at least 6 times faster. DART assumes that the distances reflect the hypothesis co-status patterns. However, in practice, this assumption could be partially violated. To assess the methods robustness, we switched the proportion τ of the alternative hypotheses with the null hypotheses (Appendix C). Figure 2B shows that FDRL have inflated average FDP when the switching proportion τ 6%. All other methods still have good FDR control. Even under these assumption partial violation cases, DART s sensitivity is still much higher than BH. Compared to Ada PT, DART still has higher sensitivity when α 15% and a slightly lower average sensitivity when α = 20%. These results show that DART s performance is consistently satisfying even when the data are less ideal. DART: Distance Assisted Recursive Testing Test BH FDRL I FDRL II Ada PT DART α=0.05 α=0.1 α=0.15 α=0.2 FDP Sensitivity 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 0.0 0.1 0.2 0.3 0.0 0.2 0.4 0.6 α=0.05 α=0.1 α=0.15 α=0.2 τ=0.02 τ=0.04 τ=0.06 τ=0.08 τ=0.1 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 Figure 2: A, FDP and sensitivity of various testing methods under SE1-SE4. B, FDP of various testing methods under SE1-SE4 when proportion τ of the alternative hypotheses were switched with the null. The dashed lines in the FDP panels mark the nominal FDR levels. Li, Sung and Xie One reason for DART s satisfying performance is because DART incorporates AI modeling to transfer the distance matrix into an aggregation tree, which later defines the testing structure. The AI modeling is data adaptive (compared to fixed neighborhood modeling as in FDRL) and robust (compared to the parametric modeling as in Ada PT) and thus generate satisfying results under various settings. 6. Real-world Experiment We applied DART to a clinical trial on hematopoietic stem cell transplantation (HCT). Graft-versus-host disease (GVHD) is one of the major complications of HCT. Recent studies have linked GVHD to the disruptions of the gut microbiome (Jenq et al., 2012). The disruptions may be related to the environmental changes such as post-transplant care (Claesson et al., 2012). This study investigates the impact of two post-transplant cares, home care (HC) and standard hospital care (SH), on patients gut microbiota compositions. In our data, patient fecal samples were collected before and after HCT; the fecal microbiome are sequenced by the 16S ribosomal RNA sequencing at the Memorial Sloan Kettering Cancer Center. The data were then pre-processed by the R package, DADA2 (Callahan et al., 2016), to generate the amplicon sequence variants (ASV) and the read counts. To improve the analysis quality, we removed the ASVs present in fewer than 10% of the samples. Samples with follow-up time longer than 1-year was also removed from the study. The zero counts were replaced by 0.5 (Aitchison, 1982; Kurtz et al., 2015). After pre-processing, our data contain 456 microbiome samples from 126 leukemia patients before and after the HCT. Each microbiome sample contains 866 amplicon sequence variants (ASVs). We excluded the 9 ASVs with missing taxonomy order information. In microbiome studies, the ASV relative abundance (measured by its abundance proportions) is more meaningful than its absolute abundance. Thus, for the remaining 857 ASVs, we calculated their log odds. Here, the odds for an ASV is odds = ASV abundance proportion 1 ASV abundance proportion. These ASV s abundance proportions do not add up to 1 because 9 ASVs were excluded. We set up the longitudinal linear mixed model Yijk =θ0,i + θ1,i W1,k + θ2,i W2,jk + θ3,i W1,jk W2,jk + bij + ϵijk. (4) Here i is the ASV index, j is the sample index, and k is the patient index. The outcome Yijk is the log odds of ASVi when sample j of patient k was collected. For patient k, W1,k is one s after-transplant care type (HC for 1 and SH for 0), W2,jk is care time length. bij is the random effect to incorporate the dependence across measurements for the same patient, and ϵijk is the random error. To identify ASVs whose abundance change is impacted by the after-transplant care (the interaction between the post-transplant care type and time), we set up the hypotheses: H0,i : θ3,i = 0, i [857]. The distances between the hypotheses were defined by the evolutionary distances among ASVs. Previous studies showed that evolutionally close ASVs might be functional similar. (Chen et al., 2012; Garcia et al., 2014; Martiny et al., 2015). We used the Wald tests to calculate the P-values. DART: Distance Assisted Recursive Testing Bacillales Bacteroidales Clostridiales Erysipelotrichales Lactobacillales Verrucomicrobiales Bacteroidaceae Peptostreptococcaceae Lachnospiraceae Clostridiaceae Erysipelotrichaceae Lactobacillaceae Enterococcaceae Verrucomicrobiaceae Bacteroides Clostridium_XI Intestinibacter Anaerostipes CSS Lactobacillus Enterococcus Akkermansia 0e+00 4e 04 0e+00 2e 04 Figure 3: Forest plots to visualize the after-transplant time effect in home care group (HC) and standard hospital care group (SH) on the abundance of 43 DART-identified ASVs. A dot represents the estimated time effect on an ASV; a vertical line marks its 95% confidence interval. Here, CSS stands for Genus Clostridium sensu stricto. We applied BH and DART to test the hypotheses with the nominal FDR level 5%. Details about aggregation tree construction in DART can be found in Appendix C. BH failed to identify any important ASVs. In contrast, DART identified 43 ASVs by incorporating the evolutionary distance information among ASVs. Among them, 39 ASVs have well-annotated Genus information. Figure 3 shows their log relative abundance change across time in home-care (HC) and standard hospital care (SH) groups Higher abundance in Enterococcus, Clostridium XI and Akkermansia were associated with more severe GVHD (Payen et al., 2020; Li et al., 2020a,a; Shono et al., 2016). Most of the ASVs in these three genera had less abundance over time in the home care group, suggesting home care might help reduce the GVHD severity. Higher abundances in Bacteroides, Anaerostipes and Lactobacillus were associated with reduced GVHD severity (Payen et al., 2020; Lin et al., 2021). Almost all the identified ASVs in Bacteroides had increased relative abundance in the home care group but decreasing abundance in the standard hospital care group. These findings suggested that home care might help relieve the GVHD severity. 7. Conclusion and Discussion In this paper, we developed a novel multiple testing method, DART, to incorporate feature distance in multiple testing. Under many application contexts, the feature distances serve as auxiliary information of their co-importance pattern. DART incorporates this information to boost the testing power. DART applies to the P-values obtained from many asymptotic tests, and thus can work with a wide range of models. Stage 1 of DART involves constructing an aggregation tree. We provided Algorithm 1 to construct the aggregation tree. Other algorithms may also work, and result in a different aggregation tree from the same distance matrix. Consequently, Stage 2 testing process could lead to different results based on different trees. In practice, if several aggregation trees exist, DART can be applied to all of them, and we can take the one with the most rejections. The asymptotic validity will still hold for this procedure. The main limitation of the work is that the FDR control is asymptotic and relies on several conditions. Recently, many hypothesis testing literature develops finite-sample FDR control procedures. These procedures usually impose stronger assumptions on p-values Li, Sung and Xie or test-statistics (Lei and Fithian, 2018; Ren and Candès, 2020). The primary obstacle for incorporating these conditions into DART lies in ensuring that higher layer p-values or statistics also adhere to them, thereby facilitating higher layer FDR control proof via deduction. This intriguing area of research warrants further exploration. Additionally, we aspire to alleviate Condition 4 without relying on the presumption that the distance matrix primarily represents co-status patterns. Our objective is to devise a robust testing algorithm that ensures FDR control in the absence of this assumption while enhancing power when the assumption is valid. In conclusion, our paper initiates an attempt at joint AI-statistics modeling to generate data-adaptive, powerful, and high-interpretable analysis results. It can be easily extended to the case where other information implies the co-importance pattern of the features. Such information could from domain knowledge, external data sets, or other resources. In addition, the hierarchical testing ideas and techniques can also be extended to solve other multiple testing problems. 8. Acknowledgment The microbiome samples were collected and sequenced at Memorial Sloan Kettering Cancer Center (MSKCC) and pre-processed at Duke Cancer Institute (DCI) Bioinformatics Shared Resource (BSR). We thank Tsoni Peled and Marcel van den Brink from MSKCC for their help in sample collection and sequencing. We thank Kouros Owzar and Alexander Sibley from DCI-BSR for the help in data pre-processing and constructive discussions. Xuechan Li s research is supported by Duke University. Anthony Sung s research is supported by NIH Award 1-R01-HL151365. Jichun Xie s research is supported by NIH Award 1-R01-HG01255501. DART: Distance Assisted Recursive Testing Appendix Appendix A. More Definitions In Section 4, we introduced a few terms, including moderate SNR nodes, strong SNR hypotheses, and weak SNR hypotheses. We provide their mathematical definitions here. Appendix A.1 Moderate SNR Nodes For any node A, we define its descendant set as D(A) = {D : ℓ, such that D A(ℓ) and D A}. The descendant set contains all the nodes from previous layers that a subset of A. Definition 4 (Moderate SNR node) A node A is called a moderate SNR node if P{TA < αm, D D(A), TD Φ(mr1 1p log m)} C1 > 0, (5) To provide more intuitions on the moderate SNR nodes, we provide a sufficient condition for a node being a moderate SNR node when the test statistics are Gaussian-distributed. Example 1 A sequence of independent Gaussian-distributed test statistics Zi N(τi, 1) for i [m]. To test the two-sided hypothesis Hi : τi = 0 versus τi = 0, we derive the P-value Ti = 2 Φ(|Zi|). In example 1, if a node A satisfies with βm = p 2(1 r1) log m 2 log log m and γm = p 2 log m + log log log m, (6) then A has moderate SNR. We request each τi falls in the range involving βm and γm; both slowly increase with m. In practice, the test statistics are calculated based on the observed samples. We usually consider the sample size n increases with the number of hypotheses m. As sample size n increases, the alternative SNR τi will also increase, often at the rate of n. Thus, we usually expect to have some alternatives whose SNR falls into the range. In (6), each τi needs to fall in the range. More generally, the purpose to define moderate SNR nodes is to define a set of alternative nodes (a) remaining in tree till they become candidate nodes, and (b) having large enough signals to be discovered when they become candidate nodes. We only need O(log m) such nodes to make sure the multiple testing procedure is stable and asymptotic valid. Li, Sung and Xie Appendix A.2 Strong SNR Hypotheses The definition of strong SNR hypotheses is linked with the definition of strong SNR node. On layer 1, we define strong SNR hypotheses. On layer ℓwith ℓ 2, recursively, we exclude the strong SNR hypotheses defined on previous layers from each node, and then evaluate if this node is a strong SNR node; if so, all the hypotheses in this node are counted as strong SNR hypotheses. This process ends till reaching the top of the tree. Denote the strong SNR node set on layer ℓby G(ℓ) st and the strong SNR hypothesis set on layer ℓby Ω(ℓ) st . To initiate, let Ω(0) st = . Definition 5 (Strong SNR hypotheses) On layer ℓ, after excluding the strong SNR hypotheses from the previous layers, define A(ℓ) = {A \ ℓ 1 ℓ =1Ω(ℓ ) st : A A(ℓ)}. For any node A A(ℓ), if for all i A, P{Ti κ(| A|)} > 1 o(m r1) with κ(| A|) = [m 1 r1 | A| 1 , m(log m log log m)1/2 1/| A|], (7) Then A is called a strong SNR node. The strong SNR hypothesis set on layer ℓis Ω(ℓ) st = {i A : A is a strong SNR node in A(ℓ)}. The overall strong SNR hypothesis set is defined as Ωst = L 1 ℓ=1 Ω(ℓ) st . With a high probability converging to 1, no hypothesis in Ω(ℓ) st will be rejected before layer ℓ, but all of them will be rejected on layer ℓ. Note that on layer ℓ 2, it is possible that m(log m log log m)1/2 1/| A| < m 1 r1 | A| 1 , which leads κ(| A|) = . In that case, strong SNR nodes do not exist. Our method does not require the existence of the strong SNR nodes. Under Example 1, (7) is satisfied when the SNR | A| + λm, βm q with βm, γm defined in (6), λm = p 2r1 log m. (8) Appendix A.3 Weak SNR Hypotheses Weak SNR hypotheses are those with weak SNRs so that they are very unlikely to be rejected if aggregated with other null hypotheses. Definition 6 (Weak SNR hypothesis) For any alternative hypothesis i Ω1, if P(Ti ι) = o(m r1) with ι = (0, m r1 1 ML 1 ), (9) this hypothesis is called a weak SNR hypothesis. Under Example 1, (9) is satisfied if |τi| (0, βm/ DART: Distance Assisted Recursive Testing Appendix Appendix B. Proofs of Main Theorems We introduce some notations before we provide the proofs. On layer ℓ, for a working node S B(ℓ), let U(S) = {S S : S ℓ 1 ℓ =1B(ℓ )} be the collection of sets in the testing path of S. In addition, let Uc(S) = {S ℓ 1 ℓ =1B(ℓ ) : S S = , S S A, for some A A(ℓ)} be the collection of sets that was planning to combined with S on layer ℓof the static aggregation tree but rejected on previous layers. When S B(1), we set U(S) = Uc(S) = . We define GS(c) as the complementary CDF conditional on previous testing results. When ℓ= 1, we have S = {i} {1, ..., m}, and GS(c) = P(Zi c) with Z1, . . . , Zm iid N(0, 1). When ℓ> 1, the oracle rejection path for set S B(ℓ) is recursively defined as Q(1:ℓ 1) z = {z : S U(S), GS (ZS ) ˆt(ℓS )(α), S Uc(S), GS (ZS ) ˆt(ℓS )(α)}, where GS(c) = P ZS c Q(1:ℓ 1) z and ZS = P i S Zi/ p |S|, and ℓS , ℓS {1, ..., ℓ 1} is the value s.t. S B(ℓS ) and S B(ℓS ), respectively. Given Z1, . . . , Zm are mutually independent, we have GS(c) = P ZS c S U(S), GS (ZS ) ˆt(ℓS )(α) Given the definition of GS(c), we define the rejection path as Q(1:ℓ 1) = {x : S U(S), GS (XS ) ˆt(ℓS )(α), S Uc(S), GS (XS ) ˆt(ℓS )(α)} (10) In addition, for two sequence of real numbers am and bm, we write am = o(bm) when am/bm 0, and am = O(bm) when limm |am/bm| C for some constant C. To prove the asymptotic properties of DART, we need the following lemmas. The proofs of lemmas are shown in supplementary materials. Lemma 7 Let Pi = {p [0, 1] : P( Ti < p) ϵ(m)} and P i = {p [0, 1] : P( Ti < p) ϵ(m)ϵ (m)}, with ϵ(m), ϵ (m) 0. For any set of independent random variable ˆTi [0, 1], and a collection M = {S {1, ..., m} : |S| < c0} with some constant c0, (1) If maxi M supp P i P( ˆTi < p)/ P( Ti < p) 1 0, then, sup S0 M sup p ϵ(m) P(P i S0 ˆXi > c S0(p)) P(P i S0 Xi > c S0(p)) 1 0, (2) If limm maxi M supp P i P( ˆTi < p)/ P( Ti < p) 1 0, then, lim m sup S0 M sup p ϵ(m) P(P i S0 ˆXi > c S0(p)) P(P i S0 Xi > c S0(p)) 1 0 Here, ˆXi = Φ 1( ˆTi), Xi = Φ 1( Ti) and c S0(p) is the value s.t. P[P i S0 Xi > c S0(p)] = p. Li, Sung and Xie Lemma 8 Let Ω0 = {i : Ti follows Unif(0, 1)}, B(ℓ) 0a := {S B(ℓ) 0 : A A(L) \ A , s.t.S A}, and B(ℓ) 0b := {S B(ℓ) 0a : S Ω0}, we have: (1) max S B(ℓ) 0a sup c [0,γm] (2) max S B(ℓ) 0b sup c [0, Φ 1(1/m)] P(XS > c|Q(1:ℓ 1)) P(XS > c) 1 0 Lemma 9 Define |S|I(TS < ˆt(ℓ)) X S B(ℓ) 0 |S|I(TS < ˆt(ℓ)) P S B(ℓ) 0 |S|ˆt(ℓ) 1 ϵ Then, ℓ= 1, ..., L, when the FDR control holds on layer 1, ..., ℓ 1, (1) For all ϵ (0, α), if P(mˆt(ℓ) Ccmd) 1, then P(X (ℓ)) = 1 o(1). Together with limm | Ω0|/m = 1, we have P(X (ℓ)) = 1 o(1). (2) On ℓ h=1X (h), there exist a constant C s.t. ˆt(ℓ) Cmr1 1. (3) Let ˆc S be the rejection threshold for the test node S B(ℓ), s.t. GS(ˆc S) = ˆt(ℓ). Then on ℓ h=1X (h), ˆc S > βm, S B(ℓ), and on ℓ 1 h=1X (h), ˆc S < γm, S B(ℓ). S B(ℓ) 0 |S|ˆt(ℓ) P S B(ℓ)|S|I(TS < ˆt(ℓ)) = α(1 + o(1)) (12) Proof [Proof of Theorem 3] Let V(ℓ) = {S B(ℓ) 0 : S R(ℓ)} and W(ℓ) = {S B(ℓ) 1 : S R(ℓ)} be the false rejection node set and the rejection node set on layer ℓ, respectively. Define X1 = {S L ℓ=2W(ℓ) : S Ωst = and S Ω0 = } X2 = {S L ℓ=2W(ℓ) : S Ωwk = , S \ (Ω0 Ωwk) = and S Ω0 = } X3 = {S L ℓ=2W(ℓ) : S Ω1 \ (Ωwk Ωst) = and S Ω0 = } DART: Distance Assisted Recursive Testing P(X1 = ) P(X1 = | L ℓ=1 X (ℓ)) P( L ℓ=1X (ℓ)) + P(( L ℓ=1X (ℓ))c) Cmr1o(m r1) + o(1) 0 P(X2 = ) P(X2 = | L ℓ=1 X (ℓ)) P( L ℓ=1X (ℓ)) + P(( L ℓ=1X (ℓ))c) Cmr1o(m r1) + o(1) 0 Here, the inequality (a) is based on Lemma 9 (1) and (3). By condition 4, |X3| = o(cmd), accordingly, P FDP > α + ϵ P(X1 X2 = ) + P PL ℓ=1 P S V(ℓ) |S| PL ℓ=1 P S R(ℓ) node |S| > α + ϵ, X1 X2 = P S V(ℓ)\X3 |S| P S R(ℓ) node |S| > α + ϵ + o(1) So statement (a) is proved. The statement (b) can be proved in the similar way. Appendix Appendix C. Additional Details on Simulation and Real Data Analysis All the experiments were conducted on 2.10 GHz Intel Xeon Gold 6252 processors with 16 Gb memory at the Duke Compute Cluster. We requested 80 cores when running the simulation experiments to save time. Experiment code can be found in https://github. com/jichunxie/DART_manu_support.git. We also built an R package, which can be found in https://github.com/jichunxie/DART.git. Appendix C.1 Details on Numerical Experiments We generated four simulation settings, each with n = 300 observations on m = 1000 features (hypotheses). All four simulation settings were based on a set of parameters ηi, i [m] related to alternative hypothesis signal levels. We defined two driver features 7 and 156; the features close to them were likely alternative. We also added 10 stand-alone features. Define the stand-alone feature set Ω2 = {100, 200, . . . , 1000}. η i = {[3.4ϕ1(d156,i) 0.8] 0} + 3{ϕ2(d7,i)} + 10 I(i Ω2), ηi = η i I(η i > 0.15) Here, ϕ1 and ϕ2 are the probability density functions of N(0, 0.8) and N(0, 0.05), respectively. Once feature locations and signals ηis were generated, they were fixed across all settings and all repetitions. We visualized the feature locations and their ηi in Figure 4. Below is the list of the four settings. Li, Sung and Xie 6 3 0 3 6 d1 Alternative Figure 4: Illustration of the simulated hypotheses affiliated location and their corresponding ηi. A dot stands for a hypothesis. The dot color indicates its corresponding hypothesis status; and its size is L-eta = log(ηi + 1) + 0.01. SE1: For node i {1, ..., m}, we generated the P-values Ti = 2 Φ(| Zi|) with Z1, . . . , Zm independently from N( nθi, 1), with θi = 1 5ηi. SE2: For node i {1, ..., m}, we generated the P-values Ti = 2 Φ(| Zi|) with Z1, . . . , Zm independently from the mixed Gaussian and T distribution 0.04t5( nθi) + 0.96N( nθi, 1). Here, t5( nθi) stands for the student t distribution with the degree of freedom 5 and none centrality parameter nθi, with θi = 1 5ηi. SE3: Consider the linear mode Yi = ϑ0,i+θi W1+ϑ2,i W2+ϵi. Here, W1 and W2 were generated from Binom(0.5) and Unif(0.1, 0.5); ϵi was from N(0, 1). We set ϑ0,i = ϑ2,i = 0.1 and θi = 2 5ηi. The P-values Ti were generated from the Wald test of testing whether θi is zero. SE4: We generated the data D = {(W1,j, W2,j, ij, Eij) : i [m], j [n]}. Here, the covariates W1,j and W2,j were sampled from Binom(0.5) and Unif(0.1, 0.5), respectively. Eij [0, + ) is the survival time and ij {0, 1} is the event indicator. The true event time was generated from the exponential distribution with the rate exp{θi W1,j+ϑi W2,j}, where ϑi = 0.1 and θi = 1 2ηi. The centering time Cij was generated from Unif(0, 5). The observed event time was set as Eij = min{ Eij, Cij}. We used the Cox regression model to regress the covariates W1,j and W2,j on the event ( ij, Eij) Cox (1972). The P-values Ti were generated from the Wald test of testing whether θi is zero. DART: Distance Assisted Recursive Testing To check the robustness of the algorithms when the distance cannot fully reflect the co-status patterns, we switched some hypotheses null/alternative statuses. In particular, we randomly changed proportion τ of alternative hypotheses to be null and proportional τ of the null hypotheses to be alternative. The switching rate τ reflects the violation level of the co-status patterns. Here is the detailed switching process. (1) Denote the original null and alternative sets by Ω 0 = {i : θi = 0} and Ω 1 = {i : θi > 0}. (2) We randomly selected the elements from the original null and alternative sets to form the alternative-to-null set Ω 1,0 Ω 1, where |Ω 1,0| = τ|Ω 1| ; and the null-to-alternative set Ω 0,1 Ω 0, where |Ω 0,1| = |Ω 1,0| (3) Re-assign the signal parameters. For i Ω 0,1, re-assign θi = θj for a random-chosen j Ω 1,0. For i Ω 1,0, set θi = 0. (4) Define the new null and alternative sets Ω0 = Ω 0 Ω 1,0 \ Ω 0,1 and Ω1 = Ω 1 Ω 0,1 \ Ω 1,0. The average FDP and sensitivity across 200 repetitions were summarized in Figure 2B and Figure 5. We applied DART, BH (Benjamini and Hochberg, 1995), two FDRL procedures (Zhang et al., 2011), and Ada PT (Lei and Fithian, 2018) to the above numerical settings. BH does not have tuning parameters. For DART, based on the tuning parameter selection criterion in Section 3.4, we set M = 3 and constructed a 3-layer aggregation tree, with distance thresholds g(2) = 0.88 and g(3) = 1.52. For the two FDRL procedures, Zhang et al. (2011) recommended setting k as an odd number greater than three and used k = 5 in the simulation experiments. Thus, we set k = 5 in our numerical experiments too. For Ada PT, we followed the instructions found at https://cran.r-project.org/web/packages/adapt MT/ vignettes/adapt_demo.html to set up its tuning parameters. During simulation, we noticed that Ada PT sometimes failed to generate results after a long execution time. To ensure valid results from Ada PT, Figures 2 and 5 only summarize the repetition who successfully deliver a result. Table 1 lists the number of repetitions that failed under different scenarios among 200 repetitions. SE τ = 0 τ = 0.02 τ = 0.04 τ = 0.06 τ = 0.08 τ = 0.1 1 33 41 34 45 46 45 2 34 37 29 42 39 45 3 23 30 36 42 32 51 4 12 19 15 27 25 36 Table 1: Number of repetition fails to deliver testing result within 8 minutes. Li, Sung and Xie Appendix C.2 Experiments to Decide the Optimal Maximum Node Size We considered four settings of the maximum node sizes: M {2, 3, 4, 5}. To design a good tuning parameter selection criterion, we evaluated DART s performance with different M via numerical experiments described in Section 5. The maximum number of layers L and distance thresholds g(ℓ) were set according to Section 3.4 and Algorithm 3. We listed the resulting tuning parameters in Table 2. Figure 6 shows that no matter what M we used, DART always controls the average FDR well under SE1, SE3, and SE4. For SE2, when the nominal FDR is 5%, DART has slight inflation because SE2 deliberately misspecified the null distribution. However, when the nominal FDR 10%, DART under SE2 still has the average FDP well controlled. This indicates that DART s performance is robust in M. When M {2, 3}, DART has higher sensitivity. Thus, in practice, we recommend using M = 2 or M = 3. M 2 3 4 5 Maximum Layer L 4 3 2 2 Distance thresholds g(2) = 1.20 g(3) = 1.52 g(4) = 1.74 g(2) = 0.88 g(3) = 1.52 g(2) = 0.25 g(2) = 0.19 Table 2: Summary of the tuning parameters selected under different M Appendix C.3 Aggregation Tree Construction in Real-world Experiment Because 9 ASVs are chosen as the reference ASVs, the distance matrix is calculated among the remaining 857 non-reference ASV using the R package Phangorn (Schliep, 2011) based on the JC69 model (Jukes et al., 1969). As the default model in Phangorn, the JC69 model is a classical Markov model of DNA sequence evolution and can be used to estimate the evolutionary distance between sequences. Two ASVs with similar sequences tend to be evolutionary close to each other, and more likely to perform similar biological functions. Therefore, we incorporate the distance matrix in identifying the important ASVs. Based on the tuning parameter selection procedure described in Section 2.3, we construct an aggregation tree with M = 3, L = 3. The set of possible threshold G is set as {4, 8, 12, 16}/ n log m log log m, with n = 456 and m = 857, and we choose g(2) = g(3) = 16/ n log m log log m = 0.21. DART: Distance Assisted Recursive Testing Algorithm 3: g(ℓ) Selection algorithm. Data: Distance Matrix D = (dij)m m, Sample size n, number of features m, the maximum children number M, the maximum layer L Result: g(2), ..., g(L). // set searching upper bound dmax and step-size sn,m Let dmax = maxj Ωmini {i:i =j} dij; sn,m = 4/ p n log(m) log log(m) ; for ℓ= 2, ..., L do // on layer ℓ, search g(ℓ) from (g(ℓ 1), dmax], g(1) = 0 Let Mg =NULL; eg=1; G =NULL; g = g(ℓ 1) + sn,m; while g (2ML 2 1)dmax and eg < 10 do // stop searching process if the value g exceed the searching upper bound or the | A(ℓ)(g)| does not increase for past 10 candidate values g. // stop searching process if the value g exceed the searching upper bound. Use Algorithm 1 to Construct an ℓlayers aggregation tree Tℓ= {A(ℓ ) : ℓ = 1, ..., ℓ} with maximum children number M, and (g(1), ..., g(ℓ 1), g); Set A(ℓ)(g) = {A : A A(ℓ)(g), |C(A)| 2}; if mg | A(ℓ)(g)| then eg = eg + 1; G = (G, g); Mg = (Mg, | A(ℓ)(g)|) ; mg = | A(ℓ)(g)| ; g = g + sn,m; g(ℓ) = min{arg maxg G Mg}; Li, Sung and Xie α=0.05 α=0.1 α=0.15 α=0.2 τ=0.02 τ=0.04 τ=0.06 τ=0.08 τ=0.1 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 Figure 5: Sensitivity of various testing methods under SE1-SE4 when proportion τ of the alternative hypothesis were switched with the null. The main bars mark the average values across 200 repetitions; the error bars mark their 25% and 75% quantiles. Every column shows the sensitivity under a nominal FDR level α. DART: Distance Assisted Recursive Testing α=0.05 α=0.1 α=0.15 α=0.2 FDP Sensitivity 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 0.00 0.05 0.10 0.15 0.20 Figure 6: Performance of DART across different M. The main bars mark the average values across 200 repetitions; the error bars mark their 25% and 75% quantiles. Every column shows the performance under a nominal FDR level α. The first row represents the average FDP, and the dashed horizontal lines marks the desired FDR level. The second row represents the average sensitivity. John Aitchison. The statistical analysis of compositional data. Journal of the Royal Statistical Society: Series B, 44(2):139 160, 1982. Aaron F Alexander-Bloch, Petra E Vértes, Reva Stidd, François Lalonde, Liv Clasen, Judith Rapoport, Jay Giedd, Edward T Bullmore, and Nitin Gogtay. The anatomical distance of functional connections predicts brain network topology in health and schizophrenia. Cereb Cortex, 23(1):127 38, Jan 2013. doi: 10.1093/cercor/bhr388. Denis Belomestny and Vladimir Spokoiny. Spatial aggregation of local likelihood estimates with applications to classification. The Annals of Statistics, 35(5):2287 2311, 2007. ISSN 00905364. URL http://www.jstor.org/stable/25464582. Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B, 57(1):289 300, 1995. Vincent D Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment, 2008(10):P10008, oct 2008. doi: 10.1088/1742-5468/2008/10/p10008. URL https://doi.org/10.1088%2F1742-5468%2F2008%2F10%2Fp10008. T Tony Cai and Weidong Liu. Large-scale multiple testing of correlations. Journal of the American Statistical Association, 111(513):229 240, 2016a. doi: 10.1080/01621459.2014. 999157. Li, Sung and Xie T Tony Cai and Weidong Liu. Large-scale multiple testing of correlations. Journal of the American Statistical Association, 111(513):229 240, 2016b. T Tony Cai, Wenguang Sun, and Yin Xia. Laws: A locally adaptive weighting and screening approach to spatial multiple testing. Journal of the American Statistical Association, pages 1 30, 2020. Benjamin J Callahan, Paul J Mc Murdie, Michael J Rosen, Andrew W Han, Amy Jo A Johnson, and Susan P Holmes. Dada2: high-resolution sample inference from illumina amplicon data. Nature Methods, 13(7):581, 2016. Jun Chen, Kyle Bittinger, Emily S Charlson, Christian Hoffmann, James Lewis, Gary D Wu, Ronald G Collman, Frederic D Bushman, and Hongzhe Li. Associating microbiome composition with environmental covariates using generalized unifrac distances. Bioinformatics, 28(16):2106 13, Aug 2012. doi: 10.1093/bioinformatics/bts342. Marcus J Claesson, Ian B Jeffery, Susana Conde, Susan E Power, Eibhlís M O connor, Siobhán Cusack, Hugh MB Harris, Mairead Coakley, Bhuvaneswari Lakshminarayanan, Orla O Sullivan, et al. Gut microbiota composition correlates with diet and health in the elderly. Nature, 488(7410):178 184, 2012. David R Cox. Regression models and life-tables. Journal of the Royal Statistical Society: Series B, 34(2):187 202, 1972. Alex Dmitrienko and Ajit C Tamhane. General theory of mixture procedures for gatekeeping. Biometrical Journal, 55(3):402 19, May 2013. doi: 10.1002/bimj.201100258. Alex Dmitrienko, Brian Wiens, Ajit Tamhane, and Xin Wang. Tree-structured gatekeeping tests in clinical trials with hierarchically ordered objectives. Statistics in Medicine, 26: 2465 78, 05 2007. doi: 10.1002/sim.2716. Alex Dmitrienko, Ajit C Tamhane, Lingyun Liu, and Brian L Wiens. A note on tree gatekeeping procedures in clinical trials. Statistics in Medicine, 27(17):3446 51, Jul 2008. doi: 10.1002/sim.3307. Alex Dmitrienko, George Kordzakhia, and Ajit C Tamhane. Multistage and mixture parallel gatekeeping procedures in clinical trials. Journal of Biopharmaceutical Statistics, 21(4): 726 47, Jul 2011. doi: 10.1080/10543406.2011.551333. Anders Eklund, Mats Andersson, Camilla Josephson, Magnus Johannesson, and Hans Knutsson. Does parametric fmri analysis with spm yield valid results? an empirical study of 1484 rest datasets. Neuroimage, 61(3):565 78, Jul 2012. doi: 10.1016/j.neuroimage. 2012.03.093. Ronald Fisher. Statistical method for research workers. Oliver and Boyd, Edinburgh ;London, 1925. Tanya P Garcia, Samuel Müller, Raymond J Carroll, and Rosemary L Walzem. Identification of important regressor groups, subgroups and individuals via regularization methods: DART: Distance Assisted Recursive Testing application to gut microbiome data. Bioinformatics, 30(6):831 7, Mar 2014. doi: 10.1093/ bioinformatics/btt608. Jelle J Goeman and Livio Finos. The inheritance procedure: multiple testing of treestructured hypotheses. Statistical Applications in Genetics and Molecular Biology, 11(1): Article 11, Jan 2012. doi: 10.1515/1544-6115.1554. Robert R Jenq, Carles Ubeda, Ying Taur, Clarissa C Menezes, Raya Khanin, Jarrod A Dudakov, Chen Liu, Mallory L West, Natalie V Singer, Michele J Equinda, et al. Regulation of intestinal inflammation by microbiota following allogeneic bone marrow transplantation. Journal of Experimental Medicine, 209(5):903 911, 2012. Xin Jin and Jiawei Han. K-Means Clustering, pages 563 564. Springer US, Boston, MA, 2010. ISBN 978-0-387-30164-8. doi: 10.1007/978-0-387-30164-8_425. URL https: //doi.org/10.1007/978-0-387-30164-8_425. Thomas H Jukes, Charles R Cantor, et al. Evolution of protein molecules. Mammalian Protein Metabolism, 3:21 132, 1969. Daniel Kristanto, Mianxin Liu, Xinyang Liu, Werner Sommer, and Changsong Zhou. Predicting reading ability from brain anatomy and function: From areas to connections. Neuroimage, 218:116966, 09 2020. doi: 10.1016/j.neuroimage.2020.116966. Zachary D Kurtz, Christian L Müller, Emily R Miraldi, Dan R Littman, Martin J Blaser, and Richard A Bonneau. Sparse and compositionally robust inference of microbial ecological networks. PLo S Computational Biology, 11(5), 2015. Donghwan Lee and Youngjo Lee. Extended likelihood approach to multiple testing with directional error control under a hidden markov random field model. Journal of Multivariate Analysis, 151:1 13, 2016. ISSN 0047-259X. doi: https://doi.org/10. 1016/j.jmva.2016.07.001. URL http://www.sciencedirect.com/science/article/pii/ S0047259X16300458. Lihua Lei and William Fithian. Adapt: an interactive procedure for multiple testing with side information. Journal of the Royal Statistical Society: Series B, 80(4):649 679, 2018. Lihua Lei, Aaditya Ramdas, and William Fithian. A general interactive framework for false discovery rate control under structural constraints. Biometrika, 108(2):253 267, 07 2020. ISSN 0006-3444. doi: 10.1093/biomet/asaa064. URL https://doi.org/10.1093/biomet/ asaa064. Xiaoqing Li, Yu Lin, Xue Li, Xiaoxiao Xu, Yanmin Zhao, Lin Xu, Yang Gao, Yixue Li, Yamin Tan, Pengxu Qian, et al. Tyrosine supplement ameliorates murine agvhd by modulation of gut microbiome and metabolome. EBio Medicine, 61:103048, 2020a. Yimei Li, John H Gilmore, Dinggang Shen, Martin Styner, Weili Lin, and Hongtu Zhu. Multiscale adaptive generalized estimating equations for longitudinal neuroimaging data. Neuroimage, 72:91 105, May 2013. doi: 10.1016/j.neuroimage.2013.01.034. Li, Sung and Xie Yunxiao Li, Yi-Juan Hu, and Glen A. Satten. A bottom-up approach to testing hypotheses that have a branching tree dependence structure, with error rate control. Journal of the American Statistical Association, pages 1 18, sep 2020b. doi: 10.1080/01621459.2020. 1799811. URL https://doi.org/10.1080%2F01621459.2020.1799811. Dandan Lin, Bo Hu, Pengfei Li, Ye Zhao, Yang Xu, and Depei Wu. Roles of the intestinal microbiota and microbial metabolites in acute gvhd. Experimental Hematology & Oncology, 10(1):1 19, 2021. Weidong Liu et al. Gaussian graphical model estimation with false discovery rate control. The Annals of Statistics, 41(6):2948 2978, 2013. Yaowu Liu and Jun Xie. Cauchy combination test: a powerful test with analytic p-value calculation under arbitrary dependency structures. Journal of the American Statistical Association, 115(529):393 402, 2020. doi: 10.1080/01621459.2018.1554485. Jackson H Loper, Lihua Lei, William Fithian, and Wesley Tansey. Smoothed nested testing on directed acyclic graphs. Biometrika, 109(2):457 471, 2022. Jennifer B H Martiny, Stuart E Jones, Jay T Lennon, and Adam C Martiny. Microbiomes in light of traits: A phylogenetic perspective. Science, 350(6261):aac9323, Nov 2015. doi: 10.1126/science.aac9323. Rosa J Meijer and Jelle J Goeman. A multiple testing method for hypotheses structured in a directed acyclic graph. Biometrical Journal, 57(1):123 143, 2015. Mathilde Payen, Ioannis Nicolis, Marie Robin, David Michonneau, Johanne Delannoye, Camille Mayeur, Nathalie Kapel, Béatrice Berçot, Marie-José Butel, Jérôme Le Goff, et al. Functional and phylogenetic alterations in gut microbiome are linked to graft-versus-host disease severity. Blood Advances, 4(9):1824 1832, 2020. Alessio Perinelli, Davide Tabarelli, Carlo Miniussi, and Leonardo Ricci. Dependence of connectivity on geometric distance in brain networks. Scientific Reports, 9(1):13412, 09 2019. doi: 10.1038/s41598-019-50106-2. Aaditya Ramdas, Jianbo Chen, Martin J Wainwright, and Michael I Jordan. A sequential algorithm for false discovery rate control on directed acyclic graphs. Biometrika, 106(1): 69 86, 2019a. Aaditya K. Ramdas, Rina F. Barber, Martin J. Wainwright, and Michael I. Jordan. A unified treatment of multiple testing with prior knowledge using the p-filter. The Annals of Statistics, 47(5), oct 2019b. doi: 10.1214/18-aos1765. URL https://doi.org/10.1214% 2F18-aos1765. Zhimei Ren and Emmanuel Candès. Knockoffs with side information. 01 2020. URL https://arxiv.org/pdf/2001.07835.pdf. K.P. Schliep. phangorn: phylogenetic analysis in r. Bioinformatics, 27(4):592 593, 2011. URL https://doi.org/10.1093/bioinformatics/btq706. DART: Distance Assisted Recursive Testing Yusuke Shono, Melissa D Docampo, Jonathan U Peled, Suelen M Perobelli, Enrico Velardi, Jennifer J Tsai, Ann E Slingerland, Odette M Smith, Lauren F Young, Jyotsna Gupta, et al. Increased gvhd-related mortality with broad-spectrum antibiotic use after allogeneic hematopoietic stem cell transplantation in human patients and mice. Science Translational Medicine, 8(339):339ra71 339ra71, 2016. Hai Shu, Bin Nan, and Robert Koeppe. Multiple testing for neuroimaging via hidden markov random field. Biometrics, 71(3):741 750, 2015. J. Soriano and L. Ma. Probabilistic multi-resolution scanning for two-sample differences. Journal of The Royal Statistical Society Series B-statistical Methodology, 79:547 572, 2017. Wesley Tansey, Yixin Wang, David Blei, and Raul Rabadan. Black box FDR. In International conference on machine learning, pages 4867 4876. PMLR, 2018. Vincent A. Traag, Ludo Waltman, and Nees Jan van Eck. From louvain to leiden: guaranteeing well-connected communities. Scientific Reports, 9, 2019. Matt Turek. Explainable artifical intelligence (xai). https://www.darpa.mil/program/explainable-artificial-intelligence, 2021. Dong Xi and Ajit C Tamhane. A general multistage procedure for k-out-of-n gatekeeping. Statistics in Medicine, 33(8):1321 35, Apr 2014. doi: 10.1002/sim.6028. Fei Xia, Martin J Zhang, James Y Zou, and David Tse. Neuralfdr: Learning discovery thresholds from hypothesis features. Advances in neural information processing systems, 30, 2017. Jichun Xie and Ruosha Li. False discovery rate control for high dimensional networks of quantile associations conditioning on covariates. Journal of the Royal Statistical Society: Series B, 80(5):1015 1034, Nov 2018. doi: 10.1111/rssb.12288. Daniel Yekutieli. Hierarchical false discovery rate-controlling methodology. Journal of the American Statistical Association, 103(481):309 316, 2008. ISSN 01621459. URL http://www.jstor.org/stable/27640041. Marie Lisandra Zepeda-Mendoza and Osbaldo Resendis-Antonio. Hierarchical Agglomerative Clustering, pages 886 887. Springer New York, New York, NY, 2013. ISBN 978-14419-9863-7. doi: 10.1007/978-1-4419-9863-7_1371. URL https://doi.org/10.1007/ 978-1-4419-9863-7_1371. Chunming Zhang, Jianqing Fan, and Tao Yu. Multiple testing via FDRL for large scale imaging data. Annals of Statistics, 39(1):613, 2011. DART: Distance Assisted Recursive Testing Supplementary Materials for "DART: Distance Assisted Recursive Testing" Appendix S1. Proofs of Lemmas Proof [Proof of Lemma 1] Since the proof of the theorem statement (b) is similar to the proof of the theorem statement (a), we will only focusing on the proof of statement (a). The random variable FDP (ℓ) can be decomposed to the product of two parts. S B(ℓ) 0 |S|I{TS < ˆt(ℓ)} P S B(ℓ) 0 |S|ˆt(ℓ) S B(ℓ) 0 |S|ˆt(ℓ) max(P S B(ℓ) |S|I{TS < ˆt(ℓ)}, 1) (S1) Based on (S1), in order to prove limm P(FDP (ℓ) α + ϵ) = 1 for all ϵ > 0, we only need prove S B(ℓ) 0 |S|I{TS < ˆt(ℓ)} P S B(ℓ) 0 |S|ˆt(ℓ) 1 < ϵ S B(ℓ) 0 |S|ˆt(ℓ) max(P S B(ℓ) |S|I{TS < ˆt(ℓ)}, 1) α > ϵ (S3) is immediately followed by Lemma 10, and we will prove (S2) by induction. Below is a list of the proof sketch: 1. On layer 1, show P(mˆt(1) Ccmd) 1. Then, by applying Lemma 9, we have P(X (1)) 1, which is equivalent to (S2). Hence, we proved the FDR control on layer 1. P(βm < ˆc S < γm, S B(1)) 1, and P(ˆc S < γm, S B(2)) 1. Note that although this conclusion is not used to prove the FDR control on the current layer, but is necessary to guarantee the FDR control on higher layers. 2. On layer ℓ 2, assume the FDR control holds on previous layers and P(X (ℓ )) 1 for all ℓ = 1, . . . , ℓ 1. Then by Lemma 9, P(βm < ˆc S < γm, S ℓ 1 ℓ =1B(ℓ )) 1, and P(ˆc S < γm, S B(ℓ)) 1. Accordingly, we can get P(mˆt(1) Ccmd) 1. Then, by applying the Lemma 9 again, we have P(X (ℓ)) 1, which is equivalent to (S2). Hence, we proved the FDR control on layer ℓ. P(βm < ˆc S < γm, S B(ℓ)) 1, and P(ˆc S < γm, S B(ℓ+1)) 1. We start the proof on layer 1. Layer 1: Take a subset F(1) Amd A(1), such that |F(1)| = cmd. For any i F(1), we have P(Xi > γm) C. By Markov s inequality, we have: i F(1) I(Xi > γm) X i F(1) P(Xi > γm)| c3/4 md C(cmd) 1/2 Li, Sung and Xie 1 i m I(Ti ˆt(1)) Ccmd c3/4 md 1 o(1) Therefore, by Lemma 10, exists constant C(1), s.t. P m0ˆt(1) C(1)cmd 1 o(1) (S4) Together with Lemma 9 (1), we have P(X (1)) 1 and accordingly, P(FDP (1) < α + ϵ) 1. Layer ℓ: Based on similar arguments on Layer 1, it suffices to show P(m0ˆt(ℓ) > C(ℓ)cmd) 1 for some constant C(ℓ). Assume h = 1, . . . , ℓ 1, P(X (h)) 1, then by Lemma 9, we have P(βm < ˆc S < γm, S B(h)) 1, and P(c S < γm, S B(ℓ)) 1. Let F(ℓ) Amd A(ℓ) with |F(ℓ)| = cmd. Define ˆF(ℓ) = A B(ℓ) F(ℓ) : TA < αm By condition 2, A F(ℓ), P(A ˆF(ℓ)) P TA < αm, TD Φ(mr1 1p log m), D D(A) C1 (S5) Accordingly, define ˆ X (ℓ) = {| ˆF(ℓ)| cmd/2}, then P( ˆ X (ℓ)) 1 o(1). On ˆ X (ℓ), we have X I(TS ˆt(ℓ)) Ccmd Then based on Lemma 8, we can conclude that P(m0ˆt(ℓ) C(ℓ)cmd) 1 o(1) for some constant C(ℓ). Proof [Proof of Lemma 7] (1) Define Xi = Φ( Ti), For k {1, ..., c0}, let q0 ϵ(m). Also define b1,k(q0), c1,...,ck be the value s.t. P(Pk j=1 Xj > b1,k(q0)) = q0 ϵ (m) (c0 k)/c0, and P( X1 > c1) = ... = P( Xk > ck) = ϵ(m)ϵ (m), respectively. For simplicity s sake, we use b1,k to present b1,k(q0). Based on the definition, we have Thus, when k = 2, P( ˆX1 + ˆX2 > b1,2) =P( ˆX1 + ˆX2 > b1,2, ˆX1 > b1,2 c2, ˆX2 > b1,2 c1) + P( ˆX1 + ˆX2 > b1,2, ˆX1 < b1,2 c2) + P( ˆX1 + ˆX2 > b1,2, ˆX2 < b1,2 c1) =P( ˆX1 + ˆX2 > b1,2, c1 > ˆX1 > b1,2 c2) + P( ˆX1 > c1, ˆX2 > b1,2 c1) + P( ˆX1 + ˆX2 > b1,2, ˆX1 < b1,2 c2) + P( ˆX1 + ˆX2 > b1,2, ˆX2 < b1,2 c1) DART: Distance Assisted Recursive Testing Based on construction, the last three terms always smaller than ϵ(m)ϵ (m)(1 + δ4(m)) for δ4(m) := maxi Ωsupp P i P( ˆTi < p)/P( Ti < p) 1 0, and accordingly, we have P( ˆX1 + ˆX2 > b1,2, c1 > ˆX1 > b1,2 c2) + P( ˆX1 > c1, ˆX2 > b1,2 c1) [P( ˆX1 + X2 > b1,2, c1 > ˆX1 > b1,2 c2) + P( ˆX1 > c1, X2 > b1,2 c1)](1 + δ4(m)) [P( ˆX1 + X2 > b1,2, ˆX1 > b1,2 c2, X2 > b1,2 c1)](1 + δ4(m)) P( X1 + X2 > b1,2, X1 > b1,2 c2, X2 > b1,2 c1)(1 + δ4(m))2 Based on similar arguments, we can also have P( ˆX1 + ˆX2 > b1,2, c1 > ˆX1 > b1,2 c2) + P( ˆX1 > c1, ˆX2 > b1,2 c1) P( X1 + X2 > b1,2, X1 > b1,2 c2, X2 > b1,2 c1)(1 δ4(m))2 q0 ϵ(m) ϵ (m) c0 2 P( ˆX1 + ˆX2 > b1,2) P( X1 + X2 > b1,2) 1 0 Similarly, if sup q0 ϵ(m) ϵ (m) c0 k P(Pk j=1 ˆ Xj>b1,k) P(Pk j=1 Xj>b1,k) 1 0, we can have q0 ϵ(m) ϵ (m) c0 k 1 P(Pk+1 j=1 ˆXj > b1,k+1) P(Pk+1 j=1 Xj > b1,k+1) 1 0 Thus, we can get (1). In addition, based on the similar arguments, we can get (2). Proof [Proof of Lemma 8] (1) Let Z 1, ..., Z K iid N(0, 1), with 2 K < ML 1. Define the set M = {M1 {1, ..., m} : 1 |M1| K 1}. It is suffice to show: lim m sup M1 M sup c1 [β0,γm] c2 [0,γm] K PK i=1 Z i > c2, 1 |M1| P j M1 Z j > c1) K PK i=1 Z i > c2) = 0 Here, β0 = p 2b(1 r1) log m + b(1 r1) log log log m, with 2(1 r1) ML 1 (ML 1 + 1)(1 r1), 1 . For simplification, let k1 = |M1|. For Z1 and Z2 iid N(0, 1), define Dm = c2 (0, γm) : d dc2 k1 K Z1 + q K Z2 > c2, Z1 > β0 k1 K Z1 + q K Z2 > c2 = 0 Li, Sung and Xie sup c1 [β0,γm] c2 [0,γm] K PK i=1 Z i > c2, 1 |M1| P j M1 Z j > c1) K PK i=1 Z i > c2) 2 sup c2 [0,γm] k1 K Z1 + q K Z2 > c2, Z1 > β0 k1 K Z1 + q max c2=0 or γm k1 K Z1 + q K Z2 > c2, Z1 > β0 k1 K Z1 + q K Z2 > c2 , k1 K Z1 + q K Z2 > c2, Z1 > β0 k1 K Z1 + q (i). When c2 = 0, K Z2 > c2, Z1 > β0 K Z2 > c2 = lim m 2P r K Z2 > c2, Z1 > β0 = 0 (ii). When c2 = γm, c2/β0 = q K Z2 > c2, Z1 > β0 K Z2 > c2 = lim β0 S q K K k1 β0 r k1 K k1 z1 ϕ(z1)ϕ(z2)dz2dz1 R Sβ0 ϕ(z)dz S q K K k1 β0 r k1 K k1 β0 ϕ(β0)ϕ(z)dz + R β0 ϕ(z)ϕ(S q K K k1 β0 q k1 K k1 z)dz ϕ(Sβ0) (L Hopital s rule) Where S = q 1 b(1 r1) (iii). When c2 Dm, given k1 K Z1 + q K Z2 > c2, Z1 > β0 k1 K Z1 + q k1 K Z1 + q K Z2 > c2 2 K Z2 > c2 d K Z2 > c2, Z1 > β0 K Z2 > c2, Z1 > β0 d DART: Distance Assisted Recursive Testing k1 K Z1 + q K Z2 > c2, Z1 > β0 k1 K Z1 + q K Z2 > c2 = k1 K Z1 + q K Z2 > c2, Z1 > β0 k1 K Z1 + q k1 K Z1 + q K Z2 > c2, Z1 > β0 k1 K Z1 + q = sup c2 Dm k1 K Z1 + q K Z2 > c2, Z1 > β0 k1 K Z1 + q = sup c2 Dm C Z Combine (i), (ii) and (iii), we have lim m sup M1 M sup c1 [β0,γm] c2 [0,γm] K PK i=1 Zi > c2| 1 |M1| P j M1 Zj| > c1) K PK i=1 Zi > c2) = 0 It is suffice to show lim m sup M1 M sup c2 [0, Φ 1(1/m)] K PK i=1 Xi > c2, 1 |M1| P j M1 Xj > β0) P(PK i=1 Z i/ Let X1 = P i M1 Xi/ k1, X2 = P i M\M1 Xi/ K k1. Based on lemma 7, δ6m = |P( Xj > p)/P(Zj > p) 1| 0 uniformly for j = 1, 2 and p > αm. Li, Sung and Xie Thus, uniformly, K X2 > c2, X1 > β0) k1 K β0, X1 > β0) (1 + δ6m) P( k1 K β0, Z1 > β0) K X2 > c2) + P(Z1 > Φ 1(αm)) (1 + δ6m)2 P( K Z2 > c2, Z1 > β0) + (1 + δ6m) j=1 P(Zj > Φ 1(αm)) (1 + δ6m)2 P( K Z2 > c2, Z1 > β0) + 2(1 + δ6m)αm Proof [Proof of Lemma 9] (i) Prove that (1) can leads to (2): On ℓ t=1X (t), |S|I(TS < ˆt(ℓ)) X |S|ˆt(ℓ) + X Combined with X |S|ˆt(ℓ) α X S B(ℓ) |S|I{TS < ˆt(ℓ)} S B(ℓ) |S|I{TS < ˆt(ℓ)} = X |S|I{TS < ˆt(ℓ)} + X |S|I{T (ℓ) S ˆt(ℓ)} |S|I{T (ℓ) S < ˆt(ℓ)} + Cmr1 We have: (1 α αϵ) X |S|ˆt(ℓ) αCmr1 Thus, 2|B(ℓ) 0 |ˆt(ℓ) P S B(ℓ) 0 |S|ˆt(ℓ) α 1 α αϵmr1, for any 1 ℓ L. DART: Distance Assisted Recursive Testing When ℓ= 1, by |B(1) 0 | = m0 = m(1 + o(1)), we have ˆt(ℓ) Cm(r1 1). When ℓ 2, on (ℓ) k=1X (k), we have max k=1,...,ℓ{FDP (k) α} < ϵ which leads to |B(ℓ) 0 |/|B(ℓ)| 1. And accordingly, ˆt(ℓ) Cm(r1 1). (ii) Prove that statement (2) leads to statement (3) On layer 1, Φ(ˆc S) = ˆt(1) C(m)r1 1. On layer ℓ 2 and ℓ h=1X (h), for all S B(ℓ), Φ(ˆc S) GS(ˆc S) + X Φ(ˆc S ) (S6) Suppose Φ(ˆc S ) C(m)r1 1 for S ℓ 1 k=1B(k), then together with GS(ˆc S) = ˆt(ℓ) Cmr1 1 and (S6), we have 2(1 r1) log m 2 log log m = βm for all S B(ℓ). In addition, for S B(ℓ), on ℓ 1 h=1X (h), GS(ˆc S)[1 Φ( β0 ML 1 )]ML 1 Φ(ˆc S) (S7) So we have Φ(ˆc S) ˆt(ℓ)(1 + o(1)), and accordingly, ˆc S γm. Note that the ˆc S γm only depends on the statement (2) on layer ℓ 1. Thus, we can apply the conclusion to show P(m0ˆt(ℓ) > c log m) 1 in the proof of theorem 1. (iii) Prove that statement (1) holds on layer 1 (ℓ= 1): Define νm = [(|A |2/m + δ2m) 1]/ cmd log m. Let 0 = c0 < ... < c γm/νm = γm satisfy ck ck 1 = νm for 1 k < γm/νm and c γm/νm c γm/νm 1 νm. We can get the corresponding p-values sequence q0 > ... > q γm/νm with qk = 1 Φ(ck). Let value q(1) = C(1)cmd/m, by (S4), we have P(ˆt > q(1)) 1. We define the working p-value sequence on layer 1 as P (1) sub = {q0, ..., qk(1), q(1)}, where k(1) {0, ..., γm/νm 1} is the index s.t. qk(1) q(1) and qk(1)+1 q(1). P max q P (1) sub S B(1) 0 I(XS > Φ 1(q)) P S B(1) 0 P(XS > Φ 1(q))(1 δ0m) P Li, Sung and Xie P max q P (1) sub S B(1) 0 I(T (1) S < q) P S B(1) 0 q P S B(1) 0 q > ϵ P max q P (1) sub S B(1) 0 I(XS > Φ 1(q)) P S B(1) 0 P( XS > Φ 1(q)) P S B(1) 0 q > ϵ P max q P (1) sub S B(1) 0 I(XS > Φ 1(q)) P S B(1) 0 P(XS > Φ 1(q))(1 δ0m) P S B(1) 0 q > ϵ P max q P (1) sub S B(1) 0 I(XS > Φ 1(q)) P S B(1) 0 P(XS > Φ 1(q))(1 δ0m) P Together with the fact that supj=1,...,k q(j)/q(j 1) 1 = o(1), we have P sup q [q(1),α] S B(1) 0 I(TS < q) P S B(1) 0 q P S B(1) 0 q > ϵ = o(1) Thus, to prove (1) holds on layer 1, we only need to show (S8). Define C(1) sub = {c0, ..., ck , c }, with c = Φ 1(q ). In order to show (S8), it is suffice to show Z c S B(1) 0 I(XS > c) P(XS > c)(1 δ0m) S B(1) 0 Φ(c) ϵ dc = o(νm) (S10) Note that by Markov inequality, S B(1) 0 I(XS > c) P(XS > c)(1 δ0m) S B(1) 0 Φ(c) S B(1) 0 I(XS > c) P(XS > c) S B(1) 0 Φ(c) ϵ (1 + δ0m)δ0m S,S B(1) 0 P(XS > c, XS > c) P(XS > c)P(XS > c) S B(1) 0 Φ(c) 2[ϵ (1 + δ0m)δ0m]2 We can divide the S, S B(1) 0 into the following three subsets: B(1) 01 = {S, S B(1) 0 : S = S } B(1) 02 = {S, S B(ℓ) 0 : S = S , A, A A(L), s.t.S A, S A , and A ΓA} (S11) B(1) 03 = {S, S B(1) 0 : S = S } \ B(1) 02 DART: Distance Assisted Recursive Testing (S,S ) B(1) 01 P(XS > c, XS > c) P(XS > c)P(XS > c) S B(1) 0 Φ(c) 2[ϵ (1 + δ0m)δ0m]2 C P S B(1) 0 Φ(c) Based on condition 3, P (S,S ) B(1) 02 P(XS > c, XS > c) P(XS > c)P(XS > c) S B(1) 0 Φ(c) 2[ϵ (1 + δ0m)δ0m]2 C(|A |2/m + δ2m) P S B(1) 0 Φ(c) In addition, P (S,S ) B(1) 03 P(XS > c, XS > c) P(XS > c)P(XS > c) S B(1) 0 Φ(c) 2[ϵ (1 + δ0m)δ0m]2 = o(1) Thus, after some calculation, we can prove (S10) and then P(X (1)) 1. Similarly, if | Ω0| = m(1 + o(1)), based on (S8), we have P max q P (1) sub S B(1) 0 I(T (1) S < q) P S B(1) 0 q P Hence, P(X (1)) 1. (iv) Prove that statement (1) holds on layer ℓ 2 when statement (1) holds on previous layers: On layer ℓ, we can divide the S, S B(ℓ) 0 into the following three subsets: B(ℓ) 01 = {S, S B(ℓ) 0 : S = S , {Ti : i S} are mutually independent} B(ℓ) 02 = {S, S B(ℓ) 0 : A, A A(L), s.t.S A, S A , and A ΓA} B(ℓ) 03 = {S, S B(ℓ) 0 : S = S } \ B(ℓ) 02 Consider the p-values sequence q0 > ... > q γm/νm constructed in (iii). Let q(ℓ) = C(ℓ)cmd/m, by (S4), we have P(ˆt > q(ℓ)) 1. We define the working p-value sequence on layer 1 as P (ℓ) sub = {q0, ..., qk(ℓ), q(ℓ)}, where k(ℓ) {0, ..., γm/νm 1} is the index s.t. qk(ℓ) q(ℓ) and qk(ℓ)+1 q(ℓ). In view of statement (3) and Lemma 8, we have sup k=0,..., γm/νm Φ(ck) 1 = o(1) Together with statement (3) and Lemma 7, there exists δ5(m) 0 with max S B(ℓ) 0 P(XS > Φ 1(q)|Q(1:ℓ 1)) max S B(ℓ) 0 P(XS > Φ 1(q)) P(ZS > Φ 1(q))[1 Φ( β0 ML 1 )]ML 1 Li, Sung and Xie Then ϵ > 0, by following the similar arguments in (iii), we can have P max q P (ℓ) sub S B(ℓ) 01 |S|I(XS > Φ 1(q)) P S B(ℓ) 01 |S|P(XS > Φ 1(q)|Q(1:ℓ 1))(1 + δ0m) P S B(ℓ) 01 |S|q P max q P (ℓ) sub S B(ℓ) 0 |S|I(TS < q) P S B(ℓ) 0 |S|q P S B(ℓ) 0 |S|q > ϵ P max q P (ℓ) sub S B(ℓ) 0 |S|I(XS > Φ 1(q)) P S B(ℓ) 0 |S|P(XS > Φ 1(q)|Q(1:ℓ 1)) P S B(ℓ) 0 |S|q > ϵ/2 =o(1) (S13) Together with the fact that supj=1,...,k q(j)/q(j 1) 1 = o(1), we have P sup q [q(ℓ),α] S B(ℓ) 0 |S|I(TS < q) P S B(ℓ) 0 |S|q P S B(ℓ) 0 |S|q > ϵ Q(1:ℓ 1) = o(1) And thus P(X (ℓ)) 1. Similarly, when | Ω0| = m(1 + o(1)), we have P(X (ℓ)) 1 based on Lemma 8 (2). Proof [Proof of Lemma 10] When ℓ= 1: for δ = 1/m4, X |S|ˆt(1) α X S B(1) |S|I(TS < ˆt(1)) S B(1) |S|I(TS < ˆt(1) + δ) |S|ˆt(1)(1 + o(1)) (S14) Assume (12) holds on layer 1, . . . , ℓ 1. Then, X |S|ˆt(ℓ) α(1 + o(1)) X S B(ℓ) |S|I(TS < ˆt(ℓ)) Thus, by following the similar arguments on (S14), we can get (12) on layer ℓ.