# on_the_adversarial_robustness_of_benjamini_hochberg__89829878.pdf On the Adversarial Robustness of Benjamini Hochberg Louis L Chen Operations Research Department Naval Postgraduate School Monterey, CA 93943 louis.chen@nps.edu Roberto Szechtman Operations Research Department Naval Postgraduate School Monterey, CA 93943 rszechtm@nps.edu Matan Seri Operations Research Department Naval Postgraduate School Monterey, CA 93943 matan.seri@gmail.com The Benjamini-Hochberg (BH) procedure is widely used to control the false detection rate (FDR) in multiple testing. Applications of this control abound in drug discovery, forensics, anomaly detection, and, in particular, machine learning, ranging from nonparametric outlier detection to out-of-distribution detection and one-class classification methods. Considering this control could be relied upon in critical safety/security contexts, we investigate its adversarial robustness. More precisely, we study under what conditions BH does and does not exhibit adversarial robustness, we present a class of simple and easily implementable adversarial test-perturbation algorithms, and we perform computational experiments. With our algorithms, we demonstrate that there are conditions under which BH s control can be significantly broken with relatively few (even just one) test score perturbation(s), and provide non-asymptotic guarantees on the expected adversarial-adjustment to FDR. Our technical analysis involves a combinatorial reframing of the BH procedure as a balls into bins process, and drawing a connection to generalized ballot problems to facilitate an information-theoretic approach for deriving non-asymptotic lower bounds. 1 Introduction Multiple testing has broad applications in drug discovery, forensics, candidate screening, anomaly detection, and in particular, machine learning. Indeed, recent works [5, 18, 28, 32], in nonparametric outlier detection, out-of-distribution detection (OOD), and one-class classification have all adopted multiple testing methodology in developing principled decision rules with statistical guarantees. In fact, the Benjamini-Hochberg (BH) multiple testing procedure, widely used to control the false detection rate (FDR), is either used or modified in all these recent methods. Considering this FDR control could be relied upon in some critical (safety/security) contexts, for which false positives incur costs, we investigate its adversarial robustness. Adversarial corruption presents a challenge to statistical methodology, and is a modern-day concern due to not only the ease with which high volumes of data can now be accessed/processed but also the increasingly widespread use of statistical procedures. This threat poses vulnerabilities to machine learning tasks like OOD, which would aim to fortify security systems like fraud detection [5]. website: https://louislchen.github.io/ 38th Conference on Neural Information Processing Systems (Neur IPS 2024). Manipulation of data and experimental results are common means by which incorrect conclusions can be reached. Worse, strategic perturbation can dramatically decrease the fidelity of the models and methods used. A burgeoning field of adversarial corruption has gained traction in recent years to meet this concern, most notably in the area of (deep) machine learning; see, for example, [27, 16, 20]. In this work we address adversarial corruption in hypothesis testing, specifically in the large-scale context in which the primary focus is on the aggregate metric: FDR. BH [6] is one of the most widely used multiple testing procedures, which upon input of a collection of p-values, outputs a rejection region ensuring that the FDR is no greater than a user-defined threshold q (0,1). This control of FDR holds under independently generated p-values as well as some restricted forms of dependency like positive regression dependent on a subset (PRDS) [7] but it generally holds without strong assumptions on the alternative distributions. This degree of distributional robustness, however, could be said to come at the cost of adversarial robustness, as we show in this work. 1.1 Literature Review Although OOD methods [23, 12, 21, 22] are often complex and not always supported by statistical guarantees, conformal inference has made possible the use of one-class classifiers to generate conformal p-values for which OOD can now be conducted via multiple testing. This has led to the adoption of the BH procedure in OOD. Indeed, [5] leverages the FDR control afforded by BH over conformal p-values (shown to be PRDS) to test for outliers. More precisely, given a test set of observations for which we wish to identify as inliers or outliers (out of distribution), a conformal p-value is generated for each observation, which is then processed by BH to decide which are likely outliers. We refer the reader to [18, 28] for other recent works along this vein. In recent years, concerns have risen over the possibility of adversarial manipulation of statistical methodologies. This manipulation commonly occurs at the level of data collection and training, often invalidating the assumptions made regarding how data is drawn, but it can also occur at test time. There is a growing literature on adversarial robustness, which is concerned with securing statistical methods like (deep) machine learning [27, 16, 20], linear regression [8], M-estimation [9], and online learning [26, 17, 1]. In particular, [13] considers contamination models that incorporate (adaptive) adversarial perturbation of up to an ϵ fraction of drawn data. Indeed, we adopt this modeling in our own study - see (c-Perturb). As well, a similar concept to the notion of adversarial robustness that we adopt in this paper is one the literature refers to as perturbation resilience. Generally speaking, a problem instance is called α perturbation resilient when despite a degree (parameterized by α) of perturbation to the instance, the optimal solution does not change. First introduced in [10] for combinatorial optimization (in particular, MAX-CUT), the concept has since also inspired research into devising resilient unsupervised learning, particularly in clustering [4, 3, 2]. With respect to the hypothesis testing literature, there are recent adversarial robust studies focused on simple [19] and sequential hypothesis testing [11] from a game theoretic perspective, in which protection of statistical power, risk, or sample size from corruption is of chief concern. Complementing the adversarial robust perspective are several distributionally robust studies, in which the data-generating distribution is known only to lie in a parametric family. Recent works include [11, 24, 15] which focus on test risk in single and sequential hypothesis testing settings employing uncertainty sets of distributions of fixed distance (e.g. Wasserstein, phi-divergence) for the null and alternative hypotheses. In contrast to these works, this paper is focused on FDR, not individual test risk. Furthermore, distributional robustness is not equivalent to the perturbation-robustness that this paper and other adversarial robust studies seek in general. Indeed, [30] shows that the BH procedure s FDR control exhibits a distributional robustness to possible dependence between null and non-null hypotheses. On the other hand, our work would illustrate that, distributional robustness aside, BH can lack adversarial robustness. 1.2 Preliminaries Let N = {1,...,N}, where N Z+ , be a finite set for which each member i N denotes a binary hypothesis test deciding between a null and alternative hypothesis. Further, there exists a partitioning, N = H0 H1, such that the correct decision for test i N is either null if i H0, or alternative if i H1. Here, H0 and H1 are the (unknown) sets of null and alternative test indices, respectively; consequently, for any test i, the correct decision (i.e. set membership) is unknown to the decision maker. In fact, while the number of tests N is known (and large, on the order of thousands), neither N0 = H0 nor π0 = N0 N is known to the decision maker, although π0 0.90 is reasonable in most large-scale testing situations" - ([14], p. 285). For each test i N, p-value pi [0,1] is randomly generated (independent of all other pj, j i), which we model as a draw from either U(0,1) when i H0 (pi then referred to as a null p-value) or some alternative distribution P1 i on [0,1] when i H1 (pi then referred to as an alternative p-value). A multiple-testing algorithm A takes as input a randomly generated collection of p-values p = {pi}i N and outputs for each test i a determination A(i) {0,1} with A(i) = 1 iff the determination is to reject the null hypothesis (i.e., claim i H1), or sometimes referred to as make a discovery" for the i-th test. With ap = i H0 A(i) denoting the number of false discoveries, and Rp = A 1(1) denoting the number of rejections/discoveries made, we refer to FDP[A;p] = ap Rp 1 as the false detection proportion summarizing A s decisions on p-values p, where x y is shorthand for max(x,y) for any x,y R. We refer to its expectation with respect to the random generation of p as the false detection rate, FDR(A) = Ep FDP[A;p]. In this work, we focus on the Benjamini Hochberg procedure, a widely-used multiple-testing algorithm. 1.3 The Benjamini Hochberg (BH) Procedure Given a collection of p-values p = {pi}i N and desired control level q (0,1) to bound FDR, the BH procedure BHq operates as follows: 1. The p-values are sorted in increasing order, p(1) p(2) ... p(N). 2. The index imax = max{i [0,N]Z p(i) i q N } is identified, with p(0) = 0. 3. Reject the tests corresponding to the smallest imax p-values: p(1),p(2),...,p(imax). BHq provides provable FDR control at the level of q without any assumptions on the alternative distributions {P1 i }i H1. Lemma 1.1 (Theorem 4.1 from [14]). If every null p-value is super-uniform, equiv., pi P0 i U(0,1) for all i H0, and the collection is jointly independent, then regardless of the collection of alternative distributions {P1 i }i H1, FDR(BHq) = π0q q, q (0,1). (1) Remark 1.2. In fact, the assumption of joint independence of the null p-values can be relaxed to a form of dependency known as PRDS - see [7] for this generalization of Lemma 1.1. 1.4 The Adversary and the c-Perturbation Problem We model an (omniscient) adversary with knowledge of H0, H1, and that knows the decision maker s choice of control level q. The adversary receives the p-values p = (pi)N i=1 after they are generated but before they are received by the decision maker, or before test time. Given the ability to perturb c 1 p-values, the adversary solves max p p p 0 c FDP[BHq;p ], (c-Perturb) where p denotes the p-values derived from an adjusted collection of p-values p = (p i)N i=1. In words, the adversary finds the perturbation of at most c many p-values before the execution of BHq so as to maximize the adversarially-adjusted false detection proportion FDP[BHq,p ]. Perturbation of p-values is implicitly the result of data perturbation, and we refer to both Remark 4.6 and Section 5.2 for examples and experiments involving direct perturbation of data. While we assume omniscience for the adversary throughout, we will briefly address modifications in analysis for an oblivious adversary that has no knowledge of H0, nor H1. Indeed, our algorithms to be presented can be modified naturally for implementation by an oblivious adversary (see comments in Section 3); further, there is nearly equivalent performance when π0 is large, as is typical. Hence, it is for the sake of brevity that we omit explicit analysis of the oblivious adversary. 1.5 Main Results Intuitively, the effect of a small number of p-value perturbations becomes insignificant in settings where a large number of tests are rejected (see Theorem 3.2). This happens, for instance, when either the number of tests N, or the control level q, or the distance (e.g. KL-divergence) between the null and alternative distributions, is large. For this reason, we focus on results that are non-asymptotic in the number of tests N. In Section 3, we present the algorithm INCREASE-c that uses strategic increases to c null p-values to induce expansion of the BH rejection region. We also present an efficient, optimal algorithm MOVE-1 (Appendix Section 7.2) for the adversary s maximization of FDR with at most one (i.e. c = 1) p-value perturbation. In Section 4 we discuss the adversarial robustness of BH through study of the adversarial adjustments to FDR by the INCREASE-c algorithms, revealing where its control is and isn t adversarially robust. In Section 5 we provide accompanying numerical experiments on i.i.d. as well as PRDS p-values. 2 BH as Balls into Bins In this section we will establish important notation for the discussions to follow. We reduce the real line to a collection of N + 1 bins". N0 and N1 balls will each be assigned to one of these bins independently of each other, from discrete distributions that are specified in the next subsection. The main motivation for this reduction is to facilitate discussion of effective perturbations in Section 3, and for the technical analysis in Section 4. 2.1 The Balls into Bins System We partition the segment [0,1] into N equiprobable segments that will be referred to as bins. We define the i-th bin Bi = {p R (i 1) q N }, for i = 1,...,N. What remains forms bin N +1, i.e., BN+1 = {p R 1 p 1 q}. For shorthand, we write B1 i = i l=1Bl = {p R 0 p < i q N }, Finally, we write B0 i = Bi {pj}j H0 and BN i = Bi {pj}j N for bin i s, respectively, null-load and total load. The alternativeload B1 i is defined analogously, as are B0 1 i BN 1 i, and B1 1 i. Rejection Count: Borrowing terminology from the classic balls into bins problem of probability theory [29], this framework facilitates a re-interpretation of the random drawing of p-values as balls being randomly placed into an ordered collection of bins, enumerated 1 up to N + 1. Framed in this way, we see that BHq operates by identifying the rejection count k = max{i [0,N]Z BN 1 i = i}, (2) which corresponds to the largest collection of consecutive bins 1,...,i that collectively contain precisely i balls, so that BHq rejects all tests with p-values lying in the first k bins. The case k = 0 corresponds to rejecting no tests. In fact, k is a stopping time under a filtration F that we define next. Filtration F = {Fi}N i=0: Let Ω = [0,1]N be a sample space with the classical Borel σalgebra B and (with slight abuse of notation) probability measure P = ( i H0U(0,1)) ( i H1P1 i ). We define a filtration beginning with FN = σ(BN N+1,B0 N+1), and continuing inductively (N towards 0), let Fi be the σ-algebra generated by {BN j }N+1 j=i+1 and {B0 j }N+1 j=i+1. In words, this filtration corresponds to what is cumulatively learned about the bin loads (null and total) upon examination of the bins in sequence starting with bin N + 1 and concluding with bin 1, assuming each observed p value comes with correct identification of whether or not i H0. We note the fact that for ℓ> i, it follows that E [ B0 1 i i Fℓ] = E [ B0 1 i i B0 1 ℓ ℓ] = B0 1 ℓ ℓ a.s., so that B0 1 N N , B0 1 N 1 N 1 ,..., B0 1 1 1 form a martingale sequence adapted to the filtration. This fact will prove useful when combined with the optional stopping theorem to facilitate several results in this work. Under this lens, the adversary s (algorithmic) task reduces to reshuffling p-values among the bins, and in Section 3 we demonstrate there are indeed simple, tractable ways of performing this to manipulate BH, with potentially great effect on FDR control. The key insight is that there exist alternative stopping times that present alternative rejection counts (/regions) that can break FDR control. 3 Adversarial Algorithm: INCREASE-c Throughout this section, we don the role of the adversary and study the c-Perturb problem, in which we are given q (0,1), a realized collection p = {pi}i N (along with the labels of null or alternative for each pi), and a budget c 1, and our task is to produce a perturbed collection p . Toward this, we focus on a procedure called INCREASE-c that despite its sub-optimality (see Appendix Section 7.2) is intuitive and simple to execute; further, as theoretical and empirical analysis in sections 4 and 5 respectively show, it has strong performance in expectation. We begin by defining a random variable that is a stopping time adapted to F; given an integer c 1, let k+c = {max{i [c,N]Z BN 1 i = i c} B0 N+1 c k B0 N+1 < c, (3) and we choose to write k+ in place of k+1. Increasing the Rejection Count The interest in k+c is that if we moved any selection of c null p-values from bin N + 1 into bin k+c (in fact any bin i k+c), then BHq would output a new, increased rejection count k+c. We formally study this in Section 4. In the meantime, we comment on the increase k+c k, which is a difference between two stopping times It is easy to see that k+c k c whenever B0 N+1 c; hence, the increase in the rejection count is at least c, but possibly more. We provide a stronger lower bound on this increase by utilizing the ratio between the number B0 k+2 N of nulls not rejected by BHq and the number N ( k + 1) of bins left outside of the BHq rejection region in the case of no corruption. Computational experiments indicate comparable performance of this bound with those of simulations presented in Section 5 s Table 1. Theorem 3.1. If c 1, then E[ k+c k B0 N+1 c] c 1 1 E[ B0 k+2 N N ( k+1) B0 N+1 c] + 1 (4) for any collection of alternative hypothesis distributions {P1 i }i H1. INCREASE-c runs as follows: 1. IF B0 N+1 c, then move the largest c (ties broken arbitrarily) in the (N+1)-th bin to bin k+c. 2. ELSE leave the p-values unperturbed. It in fact suffices for the c many p-values to be placed in any bin i k+c. We remark that since an oblivious adversary cannot discern null-drawn from alternative-drawn in the collection p, INCREASEc as written is unimplementable in such a case. Hence, for the oblivious adversary, we modify INCREASE-c s criterion to BN N+1 c and have the oblivious adversary now take the c many p-values uniformly at random from among the p-values in the (N + 1)-th bin. Intuitively, this modification for the oblivious adversary should yield nearly c null p-values being moved (on average) just as in the non-oblivious case, assuming the proportion of nulls among the BN N+1 - many p-values is high, as a typically large π0 would entail. We conclude this section with a characterization of the average increase in FDR, denoted c that INCREASE-c induces. Theorem 3.2. Given c 1, let p+c denote the perturbed form of p that INCREASE-c produces. Then the adversarially-adjusted FDR induced by INCREASE-c is EFDP[BHq;p+c] = EFDP[BHq;p] + c, for any collection of alternative distributions {P1 i }i H1, where k+c ;B0 N+1 c]. (5) In Section 4 to follow, we provide analytical lower bounds for c as part of a discussion on BH s adversarial robustness. Section 5 presents computational experiments (e.g. Table 1). We remark that INCREASE-c is not optimal for all instances of c-Perturb; indeed, for c = 1, we present a provably optimal algorithm MOVE-1 in Appendix Section 7.2, which in contrast to INCREASE1 sometimes induces a reduced rejection count. However, INCREASE-c remains a formidable adversarial procedure, as the results of Section 5 demonstrate on not only i.i.d. p-values but also PRDS conformal p-values. 4 Theoretical Analysis: Performance Guarantees and Insights into Adversarial Robustness BH s FDR control Lemma 1.1 is (distributionally) robust in the sense that it holds no matter the alternative distributions {P1 i }i H1. However, as Theorem 3.2 indicates, the degree to which this control can withstand data perturbations at test time, i.e., its adversarial robustness, very much depends on {P1 i }i H1. Recalling Lemma 1.1, we may assume without loss of generality that no alternative distribution P1 i stochastically dominates U(0,1) (equiv., P1 i U(0,1)). That being said, the degree" to which the alternative distributions {P1 i }i H1 are (stoch.) dominated by the null distribution U(0,1) (equiv., P1 i U(0,1)) is critical. We briefly preview two regimes of special interest for which each of the next two subsections cover. High sub-uniformity: When the alternatives are sub-uniform P1 i U(0,1) for all i H1, and highly so, such that for all i H1 it holds that P1 i (pi < ϵ) 1 for some small ϵ > 0, then it follows that k is large and B1 k should contain most alternative p-values. Consequently, in order for INCREAES-c to induce any sizeable increase to the FDR, the adversary will need to expand the BH rejection region significantly so as to introduce a commensurate number of nulls. Table 1 indicates c may need to be quite large to make a dent in FDR control. This message is made more precise in Theorem 4.1. Low sub-uniformity: As we will see, when the alternative p-values are barely dominated by U(0,1), c can be rather large. In fact, in the special case that there is no dominance such that P1 i = U(0,1) for all i H1, a strikingly vulnerable state occurs with high probability. Indeed, in this case where nulls and alternatives are virtually indistinguishable, BHq (in fact any A) admits an FDR of π0 whenever any rejections are made (i.e., E[FDP[BHq;p] k 1] = π0) so that BHq accordingly compensates by making no rejections with high probability (P( k = 0) = 1 q), which follows by the distributional robust control (1) from Lemma 1.1. But those times when k = 0 is precisely when INCREASE-c s simultaneous expansion of the rejection region and injection of nulls into this region is most damaging. That this event and other similarly vulnerable events occurs with high probability is the fault of the distributional robustness. This message is made rigorous in the forthcoming Theorem 4.5. 4.1 Case of High Sub-Uniformity in Alternatives {P1 i }i H1 If P1 i U(0,1) for all i H1, with P1 i (pi < ϵ) 1 for some small ϵ > 0, then it is clear that the number of alternatives rejected by BHq should be nearly the maximum number N1 of correct rejections possible (i.e., B1 1 k N1) with high probability, limiting any potential impact of INCREASE-c. The following bounds are formulated to elaborate on such dynamics in this case of large separation between alternatives and nulls, for which the event [B1 1 c = N1] has probability close to 1. Theorem 4.1. If c 1, then c P(B1 1 c = N1)E[ c c + N1 + B0 1 N1+c B0 N+1 c] + 1 P(B1 1 c = N1) E[ k+c B0 N+1 c] (N1 + c) P(B1 1 c = N1) 1 E[ B0 1 N N B0 1 N N0 c] . (6) In words, for fixed c, as the alternative distributions concentrate more and more on 0, it follows that P(B1 1 c = N1) 1, so that the effect c of INCREASE-c on BH s FDR is dampened. And this occurs despite the fact that the increase k+c k in rejection count produced by INCREASE-c consists of mostly the introduction of nulls, and tends to a magnification of (N1 + c) by at least a factor of the inverse of 1 E[ B0 1 N N B0 1 N N0 c], which is straightforward to compute since B0 1 N Binom(N0,q). We refer the reader to Section 5 s Table 1 for simulations illustrating the above. 4.2 Case of Low Sub-Uniformity in Alternatives {P1 i }i H1 In the study of this regime, we aim to demonstrate that the adversarial increase c can be rather large. We provide a lower bound Lc on c that will be a function of parameters q,N,N0, as well as the alternative distributions {P1 i }i H1. 4.2.1 Lower Bounding c In view of 5, the event [ k+c = c] is clearly of significance in the computation of c. In words, this event describes the case that the BHq rejection region captures nothing, and yet when the adversary successfully executes INCREASE-c the rejection will now capture only nulls - generating a false detection proportion of 1. Hence, we lower bound the adjustment c of Theorem 3.2 by lower-bounding the probability of this event. Our strategy: (1) first, characterize this probability under the special case that P1 i = U(0,1) for all i H1; (2) second, to handle when P1 i U(0,1) for some i H1, we translate the KL divergence of the resulting discrete, bin-assignment distributions into a bound via Pinsker s Inequality. The key to the first step will be to recognize that when P1 i = U(0,1) for all i H1, the vector of total loads (BN 1 ,...,BN N ) is exchangeable (i.e., its law is invariant under permutations), given BN 1 N. Indeed, exchangeability, combined with the generalized Ballot Theorem of [31] yields the following result: Corollary 4.2. Let n 1, p [0,1], and x a non-negative integer such that 0 x n. If B = ( B1,..., Bn) Multinomial(x,(p,...,p)), then P B ( n r=1 [ r i=1 Bi < r]) = 1 x Armed with Corollary 4.2, we can begin to estimate the probability laws of k and k+c. Corollary 4.3. If P1 i = U(0,1) for all i H1, then P( k = ℓ) = ( 1 q 1 q+ (N ℓ)q N ) P(BN 1 ℓ= ℓ) for ℓ= 0,...,N, where 00 = 1, and P(BN 1 ℓ= ℓ) = (N ℓ)( qℓ If B0 N+1 c, then P( k+c = c B0 N+1) = (1 cq N )N0 B0 N+1 (1 N0 B0 N+1 N c N1q N cq). Next, we carry out the second step of our plan. Towards this, we make the following assumption. Assumption 4.4. Suppose the alternative p-values are independent and identically distributed, with common distribution P1, i.e., {pi}i H1 iid P1. We write δ = (1 q) P1 (pi BN+1), and δj = P1 (pi Bj) q N , and we assume that P1 (pi Bj) > 0 for all j [N], as well as P1 (pi Bℓ N) > 0 for arbitrary ℓ< N. This assumption not only reduces the notational burden (otherwise documenting N1 many alternative distributions) but it also allows the leveraging of KL divergences between the different bin assignment distributions implied by the alternative P1 versus the null U(0,1). Theorem 4.5. Suppose Assumption 4.4. Then c Lc(q,N,N0,P1) = (1 cq N0 (1 πc Vc)Mc + Zc] where πc(q,N,N0,P1) = N0+E[B1 c+1 N B1 1 c=0] N c , E[B1 c+1 N B1 1 c = 0] = N1 (N c)q+Nδc+1 N N cq Nδ1 c , Vc(q,N,N0,P1) = 2 E[B1 c+1 N B1 1 c = 0]DKL(P1,c), Mc(q,N,N0) = N cq N ) B0 N+1 ;B0 N+1 c 1], Zc(q,N,N0) = P(B0 N+1 c)(1 c N ) N0 E[B0 N+1 B0 N+1 c] E[B0 N+1 B0 N+1 c] N c DKL(P1,c) = j 1 N clog ( q+δ c j=1(q/N+δj) (N c)(q/N+δj) ) Remark 4.6. For a more concrete application/example, consider when the p-values {pi}i N are derived from z-scores {zi}i N via pi = P(Z > zi), where Z N(0,1), with null z-scores {zi}i H0 i.i.d. N(0,1) and alternative z-scores {zi N(µi 1,1)}i H1 (with all µi 1 0). Indeed, under this framework, we may view the µi 1 as the distance between P1 i and U(0,1). Towards satisfying Assumption 4.4, we can assume throughout that there exists a nonnegative parameter µ1 such that the alternative parameters (µi 1,σi) = (µ1,1) for all i H1. Indeed, if every member of the collection {µi 1}i H1 is in reality only close to zero, then the forthcoming estimates (lower bounds) would provide close, conservative estimates when setting µ1 = max{µi 1 i H1}. When µ1 > 0, it follows that δ(µ1) = (1 q) Φ( Φ 1(1 q) µ1 σ1 ) > 0. As well, δj(µ1) = Φ( Φ 1(1 (j 1)q N ) µ1 σ1 ) Φ( Φ 1(1 j q N ) µ1 σ1 ) q N . Intuitively, BH is adversarially robust ( c small) when the alternative distributions are far" (µ1 >> 0) from the theoretical null, and the opposite holds when they are close" (µ1 = 0) - see Figures 2 and 3. 5 Simulations and Data Experiments In this section, we provide computations (performed in R and Python on a Macbook Air-M2 chip, 8GB memory, with no experiment time exceeding 5 minutes) to demonstrate the performance of the adversarial algorithm INCREASE-c. We demonstrate its performance through simulation on synthetic data to make comparisons to the theoretical estimates provided in Section 4. We then demonstrate its performance on a real-data experiment in outlier detection. 5.1 INCREASE-c Simulations on Synthetic Data 5.1.1 INCREASE-c Simulations on i.i.d. p-values Following the framework from Remark 4.6, we simulated 104 replications of the following experiment: (1) N = 1000 p-values are generated, with {pi}i N0 iid U(0,1), and each pi among {pi}i N1 gen- erated via pi = 1 Φ( zi µ1 1 ), with {zi}i N0 iid N(µ1,1); (2) FDP[BHq;p] and FDP[BHq;p+c] are calculated. In Figure 1 each of the 104many (FDP[BHq;p], FDP[BHq;p+c]) pairs are plotted. As can be seen, the vast majority of the pairs satisfy FDP[BHq;p+c] > FDP[BHq;p], and, further, all pairs lie above the horizontal line situated at the level of the BH control level π0 q = .09, i.e, FDP[BHq;p+c] > π0 q. In Table 1, we present the effectiveness of INCREASE-c over ranges of corruption budget c and µ1 from small to large. As can be seen, when µ1 = 0, any amount of corruption budget c yields large post-corruption FDR Ez FDP[BHq;z+c]. When µ1 > 0 grows, however, the budget c must correspondingly grow in order for there to be nontrivial post-corruption FDR. Finally, we note that for any fixed c, the increase in rejection count k+c k is on the average larger when µ1 is larger. For experiments on noni.i.d., PRDS p-values, we refer the interested reader to Section 5.2 or Appendix Section 7.4. 5.1.2 INCREASE-1 Simulations versus Theoretical Bounds Under Small µ1 In Figures 2 and 3 we illustrate how well the insights discussed in Section 4.2 capture the sensitivity of BH to adversarial perturbations when µ1 is near 0. Specifically, for each q in the grid {.01,.02,...,.99}, we computed the difference FDP[BHq;p+] FDP[BHq;p] across 103 replications of the setup as in Section 5.1.1, with the average of this difference being an estimate of 1. The plot of these 1 estimates with respect to q is then compared with the plot of our lower bound L1 as a function of q. Figures 2 and 3 illustrate that the stricter the control, i.e., the smaller q is, the more effective INCREASE-c becomes. In fact, the bound indicates high vulnerability for the typical use case 0.00 0.25 0.50 0.75 1.00 BH FDP before Adversary BH FDP after Adversary N0=900, N1=100, q = 0.10, µ1=1, σ1=1, # Simulations = 1e4 Simulation of INCREASE 10 Figure 1: 104 simulations of FDP by BHq before and after INCREASE-10 is executed on the pvalues. N = 103, N0 = 900, and q = 0.10. c µ1 0 1 2 1 .99 (1.11) .75 (1.39) .14 (1.9) 2 .99 (2.22) .77 (2.75) .18 (3.7) 5 .99 (5.6) .81 (6.67) .26 (9.0) 10 .99 (11.22) .84 (13.76) .36 (17.0) 100 .99 (111.30) .91 (121.48) .72 (136.2) 200 .99 (222.41) .93 (237.78) .81 (256.0) 500 .99 (555.59) .95 (580.75) .89 (601.5) Table 1: Sample Average (104-batch) estimates of Ez FDP[BHq;z+c] and E[ k+c k] (in parentheses) when all {µi 1}i H1 commonly equal some µ1 {0,1,2} and N = 103, q = 0.10, π0 = 0.90, and all σi = 1. Figure 2: Comparing the FDR increase 1 of INCREASE-1 with the lower bound L1 of Theorem 4.5 as functions of q when µ1 = 0, N = 1000 Figure 3: Comparing the FDR increase 1 of INCREASE-1 with the lower bound L1 of Theorem 4.5 as functions of q when µ1 = .25, N = 1000 of q (0,0.10). On the other hand, as q increases, INCREASE-c s effect weakens; however, for increased q, the decision maker is accepting a higher FDR already. 5.2 Real-Data Experiment: Credit Card Fraud Detection The Credit Card2 dataset D = {(Xi,Yi)}n i=1 R30 {0,1} contains n = 284,807 credit card transactions in September 2013 by European cardholders over the course of two days - 492 of which were frauds. Each Xi R30 consists of numerical input variables that are the result of a PCA transformation. The Class" label Yi takes value 1 in case of fraud and 0 otherwise (/genuine), yielding the partition [n] = n0 n1, with n0 = 284,315 and n1 = 492. Fraud Detection Experiment Given a set of unlabeled transactions {Xi}i S, where S [n], the fraud detection task is to identify which members of S belong to n1, i.e., are fraudulent. We experiment with the BH-based, outlier detection method of [5] on this fraud detection task - specifically, we consider the false detection proportion of this method in the absence and presence of an adversary. The details of the experimental setup will now be discussed. 2https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud We begin by training an unsupervised decision-tree-based algorithm on a training set. From the set n0 of genuine transactions, we uniformly at random select a subset ntrain n0 of size 141,758 to form a training set Dtrain = {Xi}i ntrain upon which we train an isolation forest [25] ˆs R30 R+ using the R library isotree3, where, in principle, ˆs(Xi) returns an isolation depth that is smaller if Yi = 1 (i.e. is an outlier) and larger if Yi = 0 (i.e. is an inlier) Calibration and (Adversarially-Perturbed) Testing Then for each of 102 simulations, we uniformly at random selected a subset ncal n0 ntrain of size 141,657 to form a calibration set Dcal = {Xi}i ncal of strictly genuine transactions. As well, we uniformly at random selected a subset ntest,1 n1 of 100 fraudulent transactions to append to the 900 remaining genuine transactions comprising ntest,0 = n0 ntrain ncal to form a test set Dtest = {Xi}i ntest, where ntest = ntest,0 ntest,1. Finally, we transformed Xi pi (0,1) for each i ntest via pi = 1+ {j ncal ˆs(Xj) ˆs(Xi)} ncal . The resulting collection of conformal p-values p = ( pi)i ntest is PRDS, as proven in [5], and hence the FDR control of Lemma 1.1 holds (see [7]). In contrast, upon executing INCREASE-c to generate a corresponding adversarially-perturbed collection p+c = ( p+c,i)i ntest, we obtain a collection for which BH s FDR control no longer holds. Experimental Results We executed BH0.1, on both p and p+c, with an execution providing the decision for each i ntest whether to report it as genuine (null) or fraudulent (alternative). We report the average (over the 102 simulations) false detection proportion (FDP) produced by BH0.1, i.e., both E[FDP[BH0.1; p]] and E[FDP[BH0.1; p+c]] (for c = 1,5,10). As well, we report the average number of alleged frauds E[ k] and E[ k+c]. As Table 2 indicates, although the method of [5] can ordinarily control the c E[FDP[BH0.1; p]] E[FDP[BH0.1; p+c]] E[ k] E[ k+c] 1 .09 .10 60.21 60.21 5 .08 .17 48.69 57.12 10 .09 .23 56.39 72.85 20 0.09 0.31 58.12 89.06 Table 2: Credit Card Fraud Detection Experiment FDR below the explicit 0.10 level, INCREASE-c has the ability to push the FDR above this control level. 6 Conclusions This is the first work to consider adversarial corruption of the popular Benjamini Hochberg multiple testing procedure to break its FDR control. While BH may exhibit robustness when the alternative distributions are far" from the null, it exhibits great sensitivity in practical cases when the alternatives are closer" to the null. In such cases, with the modification of few p-values (as few as one), the attacker can increase the expected FDR well past the guarantee stipulated by the BH procedure. This study suggests some caution may be necessary when using BH, especially in safety-security settings. Numerical experiments support the analytical results. Finally, BH is but one member of the family of step-up multiple testing procedures, which generally entail rejection regions decided via a stopping time, which since our paper shows can be manipulated in the case of BH, it means other step-up procedures can be similarly prone. 3https://cran.r-project.org/web/packages/isotree/vignettes/An_Introduction_to_Isolation_Forests.html Acknowledgments and Disclosure of Funding We wish to thank Wang Chi Cheung for a stimulating conversation. This work was supported by the Air Force Office of Scientific Research (Mathematical Optimization Program) under the grant: Optimal Decision Making under Tight Performance Requirements in Adversarial and Uncertain Environments: Insight from Rockafellian Functions . [1] Idan Amir, Idan Attias, Tomer Koren, Yishay Mansour, and Roi Livni. Prediction with corrupted expert advice. Advances in Neural Information Processing Systems, 33:14315 14325, 2020. [2] Haris Angelidakis, Konstantin Makarychev, and Yury Makarychev. Algorithms for stable and perturbation-resilient problems. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pages 438 451, 2017. [3] Pranjal Awasthi, Avrim Blum, and Or Sheffet. Center-based clustering under perturbation stability. Information Processing Letters, 112(1-2):49 54, 2012. [4] Maria-Florina Balcan, Nika Haghtalab, and Colin White. K-center clustering under perturbation resilience. ACM Trans. Algorithms, 16(2), mar 2020. [5] Stephen Bates, Emmanuel Candès, Lihua Lei, Yaniv Romano, and Matteo Sesia. Testing for outliers with conformal p-values. The Annals of Statistics, 51(1):149 178, 2023. [6] Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal statistical society: series B (Methodological), 57(1):289 300, 1995. [7] Yoav Benjamini and Daniel Yekutieli. The control of the false discovery rate in multiple testing under dependency. The Annals of Statistics, 29(4):1165 1188, 2001. [8] Daniel Berend, Aryeh Kontorovich, Lev Reyzin, and Thomas Robinson. On biased random walks, corrupted intervals, and learning under adversarial design. Annals of Mathematics and Artificial Intelligence, 88:887 905, 2020. [9] Sujay Bhatt, Guanhua Fang, Ping Li, and Gennady Samorodnitsky. Minimax m-estimation under adversarial corruption. In Proceedings of the 39th International Conference on Machine Learning (ICML), Baltimore, MD, 2022. [10] Yonatan Bilu and Nathan Linial. Are stable instances easy? Combinatorics, Probability and Computing, 21(5):643 660, 2012. [11] Shuchen Cao, Ruizhi Zhang, and Shaofeng Zou. Adversarially robust sequential hypothesis testing. Sequential Analysis, 41(1):81 103, 2022. [12] D.Hendrycks and K.Gimpel. A baseline for detecting misclassified and out-of-distribution examples in neural networks. In Proceedings of International Conference on Learning Representations, 2017. [13] Ilias Diakonikolas and Daniel M Kane. Algorithmic high-dimensional robust statistics. Cambridge university press, 2023. [14] Bradley Efron and Trevor Hastie. Computer age statistical inference, 2013. [15] Rui Gao, Liyan Xie, Yao Xie, and Huan Xu. Robust hypothesis testing using wasserstein uncertainty sets. Advances in Neural Information Processing Systems, 31, 2018. [16] Ian Goodfellow, Patrick Mc Daniel, and Nicolas Papernot. Making machine learning robust against adversarial inputs. Commun. ACM, 61(7):56 66, jun 2018. [17] Anupam Gupta, Tomer Koren, and Kunal Talwar. Better algorithms for stochastic bandits with adversarial corruptions. In Conference on Learning Theory, pages 1562 1578. PMLR, 2019. [18] Ying Jin and Emmanuel J Candès. Selection by prediction with conformal p-values. Journal of Machine Learning Research, 24(244):1 41, 2023. [19] Yulu Jin and Lifeng Lai. On the adversarial robustness of hypothesis testing. IEEE Transactions on Signal Processing, 69:515 530, 2021. [20] Akshay Krishnamurthy, Thodoris Lykouris, Chara Podimata, and Robert Schapire. Contextual search in the presence of adversarial corruptions. Operations Research, 71(4):1120 1135, 2023. [21] K. Lee, K. Lee, H. Lee, and J. Shin. A simple unified framework for detecting out-ofdistribution samples and adversarial attacks. In Neur IPS, 2018. [22] K. Lee, K. Lee, H. Lee, and J. Shin. Training confidence-calibrated classifiers for detecting out-of-distribution samples. In International Conference on Learning Representations, 2018. [23] Kimin Lee, Kibok Lee, Honglak Lee, and Jinwoo Shin. A simple unified framework for detecting out-of-distribution samples and adversarial attacks. Advances in neural information processing systems, 31, 2018. [24] Zishuo Li, Yilin Mo, and Fei Hao. Game theoretical approach to sequential hypothesis test with byzantine sensors. In 2019 IEEE 58th Conference on Decision and Control (CDC), pages 2654 2659. IEEE, 2019. [25] Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. Isolation forest. In 2008 eighth ieee international conference on data mining, pages 413 422. IEEE, 2008. [26] Thodoris Lykouris, Vahab Mirrokni, and Renato Paes Leme. Stochastic bandits robust to adversarial corruptions. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pages 114 122, 2018. [27] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, 2018. [28] Akshayaa Magesh, Venugopal V Veeravalli, Anirban Roy, and Susmit Jha. Principled out-ofdistribution detection via multiple testing. Journal of Machine Learning Research, 24(378):1 35, 2023. [29] Martin Raab and Angelika Steger. balls into bins a simple and tight analysis. In International Workshop on Randomization and Approximation Techniques in Computer Science, pages 159 170. Springer, 1998. [30] Weijie J Su. The fdr-linking theorem. ar Xiv preprint ar Xiv:1812.08965, 2018. [31] Lajos Takács. A generalization of the ballot problem and its application in the theory of queues. Journal of the American Statistical Association, 57(298):327 337, 1962. [32] Weiwei Liu Xinsong Ma, Xin Zou. A provable decision rule for out-of-distribution detection. Proceedings of the 41st International Conference on Machine Learning (ICML), Vienna, Austria, 2024. 7 Appendix / supplemental material 7.1 Adversarial Algorithm: INCREASE-c Theorem 3.1. If c 1, then E[ k+c k B0 N+1 c] c 1 1 E[ B0 k+2 N N ( k+1) B0 N+1 c] + 1 (4) for any collection of alternative hypothesis distributions {P1 i }i H1. Proof. Since the statement in the case of c = 1 is trivially true, we henceforth assume that c > 1. Recall k+c = max{i [c,N]Z B1 i = i c}. Let us define k0 +c = max{i [c,N]Z B0 1 i = i (B1 1 k + c)}. Then we begin by establishing that k+c k0 +c > k + 1. To see why this holds, we note that B0 1 k+1 = k B1 1 k+1 = k + 1 (B1 1 k+1 + 1) > k + 1 (B1 1 k+1 + c), and this implies k + 1 < k0 +c. Further, B1 k0 +c = B0 1 k0 +c + B1 1 k0 +c = k0 +c (B1 1 k + c) + B1 1 k0 +c = k0 +c c + (B1 1 k0 +c B1 1 k) k0 +c c, which implies k+c k0 +c. Next, we justify the relation B0 k+2 N N ( k + 1) k,B1 1 k,[B0 N+1 c] = E B0 k+2 k0 +c k0+c ( k + 1) k,B1 1 k,[B0 N+1 c] . It suffices to establish it for any fixed, joint realization of k,B1 1 k that occurs consistent with the event [B0 N+1 c] with positive probability. With slight abuse of notation, we will continue to use k,B1 1 k for such a fixed realization, and write E[ ] for E[ k,B1 1 k,[B0 N+1 c]]. The plan is to apply the Optional Stopping Theorem on a martingale sequence. To do so, let us form a (backwards-running) filtration: let FN be the sigma-algebra generated by the event [B0 N+1 c] as well as the random variable B0 N+1, and let Fi be the sigma-algebra generated by the event [B0 N+1 c] as well as the random variables {B0 j }N+1 j=i+1. Then for any integers k,ℓ {N,N 1,...1} such that ℓ> k, E[B0 k+2 [k ( k+2)] Fℓ] = E[B0 k+2 [k ( k+2)] B0 k+2 [ℓ ( k+2)]] = [k ( k + 1)] 1 [ℓ ( k + 1)] 1 B0 k+2 [ℓ ( k+2)], where B0 k+2 [ℓ ( k+2)] Fℓsince B0 k+2 [ℓ ( k+2)] = N0 B0 N+1 ( k B1 1 k) B0 (ℓ+1) ( k+3) N is a measurable function of the random variables in Fℓ. In other words, the collection { B0 k+2 [k ( k+2)] [k ( k+1)] 1 } forms a martingale under the filtration. As well, k0 +c is a stopping time. Hence, by the Optional Stopping Theorem, B0 k+2 N N ( k + 1) k,B1 1 k,[B0 N+1 c] = E B0 k+2 N N ( k + 1) [B0 N+1 c] B0 k+2 k0 +c k0+c ( k + 1) [B0 N+1 c] = E B0 k+2 k0 +c k0+c ( k + 1) k,B1 1 k,[B0 N+1 c] . We note that the martingale property could be established even without the event [B0 N+1 c] in the filtration but for the sake of forthcoming computations we have included it. Next we observe that B0 k+2 k0 +c + k B1 1 k = B0 k+2 k0 +c + B0 1 k = B0 1 k0 +c = k0 +c (B1 1 k + c) Ô B0 k+2 k0 +c = k0 +c ( k + c), B0 k+2 k0 +c k0+c ( k + 1) = k0 +c ( k + c) k0+c ( k + 1) = k0 +c ( k + 1) c + 1 k0+c ( k + 1) = 1 c 1 k0+c ( k + 1) B0 k+2 N N ( k + 1) k,B1 1 k,[B0 N+1 c] = 1 (c 1)E[ 1 k0+c ( k + 1) k,B1 1 k,[B0 N+1 c]] < 1 E[ k0+c ( k + 1) k,B1 1 k,[B0 N+1 c]] E[ 1 k0+c ( k + 1) k,B1 1 k,[B0 N+1 c]] = 1 E[ B0 k+2 N N ( k+1) k,B1 1 k,[B0 N+1 c]] 1 E[ B0 k+2 N N ( k+1) k,B1 1 k,[B0 N+1 c]] +1 E[ k0 +c k k,B1 1 k,[B0 N+1 c]] E[ k+c k k,B1 1 k,[B0 N+1 c]]. It follows that c 1 1 E[ B0 k+2 N N ( k+1) B0 N+1 c] + 1 = c 1 E[1 E[ B0 k+2 N N ( k+1) k,B1 1 k,[B0 N+1 c]] B0 N+1 c] + 1 1 E[ B0 k+2 N N ( k+1) k,B1 1 k,[B0 N+1 c]] B0 N+1 c E[ k+c k B0 N+1 c], where we have used (conditional) Jensen s Inequality - valid because c > 1 and (*) ensures 1 E[ B0 k+2 N N ( k+1) k,B1 1 k,[B0 N+1 c]] > 0. Theorem 3.2. Given c 1, let p+c denote the perturbed form of p that INCREASE-c produces. Then the adversarially-adjusted FDR induced by INCREASE-c is EFDP[BHq;p+c] = EFDP[BHq;p] + c, for any collection of alternative distributions {P1 i }i H1, where k+c ;B0 N+1 c]. (5) Proof. We derive EFDP[BHq;p+] = E B0 1 k+c + c k+c ;B0 N+1 c + E B0 1 k 1 k 1 ;B0 N+1 < c B0 1 k+c k+c ;B0 N+1 c + E[ c k+c ;B0 N+1 c] + E B0 1 k 1 k 1 ;B0 N+1 < c = P(B0 N+1 c)E[B0 1 N N B0 N+1 c] + E[ c k+c ;B0 N+1 c] + P(B0 N+1 < c)E[B0 1 N N B0 N+1 < c] = E[B0 1 N N ] + E[ c k+c ;B0 N+1 c], as desired. MOVE-1 is an efficient, optimal algorithm for an (omniscient) adversary with budget c = 1. We begin with several small insights en route to describing MOVE-1 in full. Perturbing the Rejection Count If p = p, i.e., the sample statistics are left unperturbed, we obtain a false detection proportion of FDP [BHq;p] = B0 1 k k . If it is possible to improve on this, it can be easily shown that in the case of c = 1 we must induce an altered rejection count k = max{i [0,N]Z {p i}i N B1 i} = i}, (7) henceforth referred to as the (adversarially) adjusted rejection count resulting from the adversary s choice of p . As it turns out, for the special case of c = 1, k k is necessary if a larger FDP is to be obtained, as the following lemma indicates. Lemma 7.1. If p p 0 1 and FDP [BHq;p ] FDP [BHq;p], then k k. Proof. We suppose the contrary for the sake of a contradiction, i.e., k = k. It follows that {j N 0 p j < k q N } = k = k = {j N 0 pj < k q and then by FDP [BHq;p ;] FDP [BHq;p], it follows that {j H0 0 p j < k q N } {j H0 0 pj < k q Since p p 0 1, these two conclusions are at odds, presenting a contradiction. Considering Lemma 7.1, the adversary s p [0,1]N decision simplifies to deciding on an adjusted rejection count k from among a constrained set of integers consistent with the constraint p p 0 1. An optimal k can indeed be larger or smaller than, or even equal to k. Hence, towards understanding this new search space, it suffices to characterize the set of feasible k that are larger than k, and smaller. Increasing the Rejection Count (c = 1) Towards understanding how to increase the rejection count, we first highlight the following fact that follows immediately from (2). Lemma 7.2. If i { k + 1,...,N}, then BN 1 i i 1. In particular, BN 1 k+1 = i 1. Proof. Suppose there exists i { k + 1,...,N} such that BN 1 i > i 1. If BN 1 i = i, then i > k presents a contradiction of (2). However, proceeding with BN 1 i i + 1, we see that there necessarily exists j {i + 1,...,N} for which BN 1 j = j, for if this were not the case, then BN 1 j j + 1 for all j {i + 1,...,N}, meaning BN 1 N N + 1, yet another contradiction since there are only N p-values. Consequently, for k > k, we require a perturbed collection p for which bin counts increase. This necessitates a decrease of a p-value - see Lemma 7.5 for details. Hence, we define L = {i { k + 1,...,N} BN 1 i = i 1} (8) and note that these are precisely the positions i for which the addition of one (and only one) p-value into B1 i through the decrease of a p-value brings i into candidacy for the new rejection count - see equation (2). Proposition 7.3. L = { k k > k, p p 0 1} Proof. To prove the first statement, we recall that by Lemma 7.2, if i > k, then BN 1 i i 1. So, if i k + 1 with BN 1 i < i 1, then no decrease of a single p-value could increase BN 1 i to i (Lemma 7.5), precluding the possibility of k = i. In other words, no i L is achievable for the new rejection count k . On the other hand, say we enumerate L with i L > i L 1 > ... > i1 = k + 1. For the case of i L, k = i L is achieved if and only if a p-value is decreased from any bin Bj with j > iℓto a bin Bj s where i L j s 1. For the case of iℓ, k = iℓis achieved if and only if a p-value is decreased from any bin Bj with iℓ+1 j > iℓto a bin Bj s where iℓ j s 1. Finally, all these movements of p-value just described are always possible for any given p = {pi}i N . Decreasing the Rejection Count (c = 1) Analogously, it follows that the increase of a p-value is necessary for the decrease of the rejection count - see Lemma 7.6, and we define R = {i {i + 1,..., k 1} BN 1 i = i} {i }, (9) where i = 0 max{i {1,..., k 1} BN 1 i = i + 1}. We note that R is precisely the set of all the achievable new (decreased) rejection counts k < k we could induce with the perturbation of a single p-value. Proposition 7.4. R = { k k > k , p p 0 1} Proof. We begin by proving the first statement that k < k implies k R. To do so, we proceed in two steps. First we show that k i , so that i k k 1. Then we show that if k = i for some i {i + 1,..., k 1} in which BN 1 i i, we arrive at a contradiction. To see that k i , if i > k , then BN 1 i = i + 1 becomes BN 1 i i 1 by Lemma 7.2; in other words, the change in magnitude of BN 1 i is at least 2, which contradicts Lemma 7.6. Next, if i {i + 1,..., k 1}, then BN 1 i i by the definition of i . This means if BN 1 i i, then BN 1 i < i, so that Lemma 7.6 indicates k could not be i. As for the second statement, let R be enumerated i R > i R 1 > ... > i1 = i . For the case of i R, k = i R is achieved if and only if a p-value is moved from bin Bj with k j > i R to a bin Bj+s > k. For R 1 r 2, by Lemma 7.6 it holds that k = ir is achieved if and only if a p-value is moved from a bin Bj with ir+1 j > ir to a bin Bj+s with j + s > k. Finally, for the case of i1, k = i1 = i is achieved if and only if a p-value is moved from Bj with i j to a Bj+s with j + s > k. Finally, all these movements of p-values just described are always possible for any given p = {pi}i N . It follows that k k if and only if k L R, so an efficient, optimal search procedure becomes straightforward. Informally, we iterate over the bins in reverse order, beginning with N + 1 and terminating with i . At iteration (/bin number) i, if i L R, then the trivial subproblem maxp p p 0 1, k =i FDP[BHq;p ] is solved; otherwise, nothing is done. Upon termination, the best FDP encountered is the answer. This is summarized in Theorem 7.7 Lemma 7.5 (Decreasing a p-value). Let p be such that p p 0 1. If p is the decrease of a single p-value in p from Bj to Bj s, where j {2,...,N + 1} and 1 s j 1, then i {1,...,j s 1} Ô BN 1 i remains constant i {j s,...,j 1} Ô BN 1 i increases by 1 i {j,...,N + 1} Ô BN 1 i remains constant. Lemma 7.6 (Increasing a p-value). Let p be such that p p 0 1. If p is the increase of a single p-value in p from Bj to Bj+s, where j {1,...,N} and 1 s N + 1 j, then i {1,...,j 1} Ô BN 1 i remains constant i {j,...,j + s 1} Ô BN 1 i decreases by 1 i {j + s,...,N + 1} Ô BN 1 i remains constant. Theorem 7.7 (MOVE-1). Let q (0,1), p-values {pi}i N and sets H0, H1 be given. For each i L R, let FDPi = max p p p 0 1, k =i FDP[BHq;p ]. max p p p 0 1FDP[BHq;p ] = ( max i L R { k} FDPi) This result explains that with one pass of the p-values from the largest to the smallest in the collection, we can ascertain the optimal perturbation p . As the execution based on this result is straightforward, we omit the pseudocode for the sake of brevity. In Table 3, we compare the average performance of INCREASE-1 against that of the optimal MOVE-1 over 104 simulations. The experiments followed the setup described in Remark 4.6, in which N = 103 p-values are derived from independently generated z-scores, with N0(= 900) null z-scores i.i.d. N(0,1) and N1(= 100) alternative z-scores i.i.d. N(µ1,1), for µ1 = 1,2. As Table 3 indicates, INCREASE-1 can provide nearly identical performance in adjustment to FDR; however, the perturbation distance z z is on the average much greater than in MOVE-1. MOVE-1(INCREASE-1) Ez FDP[BHq;z ] Average z z 1 µ1 = 2 0.140 (0.139) 0.139 (1.563) µ1 = 1 0.775 (0.751) 0.492 (2.297) µ1 = 0 .992 (0.990) 0.551 (2.41) Table 3: Sample Average (104-batch) estimates of Ez FDP[BHq;z ] under MOVE-1 and INCREASE-1 (in parentheses) when µ1 = 0,1,2 and N = 103,q = 0.10,π0 = 0.90, and all σi = 1. 7.3 Theoretical Analysis: Performance Guarantees and Insights into Adversarial Robustness Theorem 4.1. If c 1, then c P(B1 1 c = N1)E[ c c + N1 + B0 1 N1+c B0 N+1 c] + 1 P(B1 1 c = N1) E[ k+c B0 N+1 c] (N1 + c) P(B1 1 c = N1) 1 E[ B0 1 N N B0 1 N N0 c] . (6) Proof. When B1 1 c = N1, it easily follows that k+c c + N1 + B0 1 N1+c. This combined with Theorem 3.2 easily yields the first inequality. As for the second inequality, let k0 +c = max{i [0,N]Z B0 1 i = i (N1 + c)} N1 + c > 0. E[B0 1 N N B0 N+1 c] = E B0 1 k0 +c k0+c B0 N+1 c = E[ k0 +c (N1 + c) k0+c B0 N+1 c] = 1 (N1 + c)E[ 1 k0+c B0 N+1 c] implies that E[ k0+c B0 N+1 c] E[ 1 k0+c B0 N+1 c] = 1 N1 + c (1 E[B0 1 N N B0 N+1 c]), E[ k+c B0 N+1 c,B1 1 c = N1] = E[ k0 +c B0 N+1 c] N1 + c 1 E[ B0 1 N N B0 N+1 c] , since whenever B1 1 c = N1, it follows that k+c = k0 +c. Consequently, we derive 1 E[ B0 1 N N B0 N+1 c] E[ k+c B0 N+1 c,B1 1 c = N1] = E[ k+c;B1 1 c = N1,B0 N+1 c] P(B1 1 c = N1)P(B0 N+1 c) = E[ k+c;B0 N+1 c] E[ k+c;B1 1 c < N1,B0 N+1 c] P(B1 1 c = N1)P(B0 N+1 c) = E[ k+c B0 N+1 c] 1 P(B1 1 c = N1) E[ k+c B1 1 c < N1,B0 N+1 c] P(B1 1 c < N1) P(B1 1 c = N1), P(B1 1 c = N1) N1 + c 1 E[ B0 1 N N B0 N+1 c] +E[ k+c B1 1 c < N1,B0 N+1 c] P(B1 1 c < N1) E[ k+c B0 N+1 c]. Corollary 4.2. Let n 1, p [0,1], and x a non-negative integer such that 0 x n. If B = ( B1,..., Bn) Multinomial(x,(p,...,p)), then P B ( n r=1 [ r i=1 Bi < r]) = 1 x Proof. We will appeal to the following result of [31]: Lemma 7.8 (Theorem 1 of [31]). Let there be n 1 non-negative integers z1,...,zn summing to x. If π denotes a permutation drawn uniformly at random, then P π ( n r=1 [ r i=1 z π(i) < r]) = [1 x Let π be a random permutation on {1,...,n} that is drawn uniformly at random, independently of B. Since B = ( B1,..., Bn) Multinomial(x,(p,...,p)), then B π = ( B π(1),..., B π(n)) is equivalent in distribution to B, or B π D= B. Therefore, P B ( n r=1 [ r i=1 Bi < r]) = E B [P π ( n r=1 [ r i=1 B π(i) < r])] Corollary 4.3. If P1 i = U(0,1) for all i H1, then P( k = ℓ) = ( 1 q 1 q+ (N ℓ)q N ) P(BN 1 ℓ= ℓ) for ℓ= 0,...,N, where 00 = 1, and P(BN 1 ℓ= ℓ) = (N ℓ)( qℓ If B0 N+1 c, then P( k+c = c B0 N+1) = (1 cq N )N0 B0 N+1 (1 N0 B0 N+1 N c N1q N cq). Proof. The event [ k = ℓ] is equivalent to { N j=ℓ+1 [BN ℓ+1 j < j ℓ]} [BN 1 ℓ= ℓ], an intersection of two events. Regarding the event [BN 1 ℓ= ℓ], because µ1 = 0, it is clear that P(BN 1 ℓ= ℓ) = (N ℓ)( qℓ N )N ℓ. To complete the characterization of P( k = ℓ) it suffices to find P( N j=ℓ+1 [BN ℓ+1 j < j ℓ] [BN 1 ℓ= ℓ]). Towards doing so, we note that conditioned on [BN 1 ℓ= ℓ], we have BN ℓ+1 N Binom N ℓ, N 1 q + (N ℓ)q and, upon further conditioning on BN ℓ+1 N, (BN ℓ+1,...,BN N ) Multinom(BN ℓ+1 N,( 1 N ℓ,..., 1 N ℓ)). Hence, we derive P( N j=ℓ+1 [BN ℓ+1 j < j ℓ] [BN 1 ℓ= ℓ],BN ℓ+1 N) = 1 BN ℓ+1 N N ℓ, by Corollary 4.2. To conclude, we integrate out BN ℓ+1 N from this derivation; more precisely, P( N j=ℓ+1 [BN ℓ+1 j < j ℓ] [BN 1 ℓ= ℓ]) = EBN ℓ+1 N [P( N j=ℓ+1 [BN ℓ+1 j < j ℓ] [BN 1 ℓ= ℓ],BN ℓ+1 N) [BN 1 ℓ= ℓ]] = 1 E[BN ℓ+1 N [BN 1 ℓ= ℓ]] N 1 q + (N ℓ)q 1 q + (N ℓ)q As for the second statement, in similar fashion to above, Corollary 4.2 yields P( N j=c+1[BN c+1 j < j c] [BN 1 c = 0],B0 N+1,B1 c+1 N) = 1 N0 B0 N+1 + B1 c+1 N N c , P( N j=c+1[BN c+1 j < j c] [BN 1 c = 0],B0 N+1) = 1 N0 B0 N+1 + E[B1 c+1 N B1 1 c = 0] N c . P( k+c = c B0 N+1) = P(BN 1 c = 0 B0 N+1)P( N j=c+1[BN c+1 j < j 1] BN 1 c = 0,B0 N+1) N )N0 B0 N+1 (1 N0 B0 N+1 N c N1q N cq ). Theorem 4.5. Suppose Assumption 4.4. Then c Lc(q,N,N0,P1) = (1 cq N0 (1 πc Vc)Mc + Zc] where πc(q,N,N0,P1) = N0+E[B1 c+1 N B1 1 c=0] N c , E[B1 c+1 N B1 1 c = 0] = N1 (N c)q+Nδc+1 N N cq Nδ1 c , Vc(q,N,N0,P1) = 2 E[B1 c+1 N B1 1 c = 0]DKL(P1,c), Mc(q,N,N0) = N cq N ) B0 N+1 ;B0 N+1 c 1], Zc(q,N,N0) = P(B0 N+1 c)(1 c N ) N0 E[B0 N+1 B0 N+1 c] E[B0 N+1 B0 N+1 c] N c DKL(P1,c) = j 1 N clog ( q+δ c j=1(q/N+δj) (N c)(q/N+δj) ) Proof. Observe that c = E[ c k+c ;B0 N+1 c] P( k+c = c,B0 N+1 c). By Theorem 4.3, P( k+c = c,B0 N+1 c) = E[P( k+c = c B0 N+1);B0 N+1 c] P(B1 c = 0 B0 N+1) P( N j=c+1[BN c+1 j < j c] [BN 1 c = 0],B0 N+1) ¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹ ¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹ (#) where we now proceed to lower bound (#). The following discussion outlines how we proceed in the case of c = 1, but this is without loss of generality. Let B0 N+1 and B1 2 N be given, along with the event [BN 1 = 0]. Then there are N0 B0 N+1 many null p-values and B1 2 N many alternative p-values each of whose assignment to one of bin number 2,...,N remains stochastic. Let there be an arbitrary enumeration of these null p-values, upon which we let the collection of their bin-assignment random variables be denoted (α0 i )N0 B0 N+1 i=1 . Let there also be an arbitrary enumeration of these alternative p-values, upon which we let the collection of their bin-assignment random variables be denoted (α1 i )B1 2 N i=1 . More precisely, α0 i Unif({2,...,N}), and α1 i = j with probability q/N+δj q+δ (q/N+δ1), for j = 2,...,N. Finally, we will write Pbins 1 for the joint distribution derived from the independent coupling of all the random variables (α0 i )N0 B0 N+1 i=1 and (α1 i )B1 2 N i=1 . For contrast, let α1 i Unif({2,...,N}), which expresses the random bin assignment of any alternative p-value to one of B2,...,BN given that BN 1 = 0, were P1 = U(0,1). Then Pbins 0 , the joint derived from the independent coupling of (α0 i )N0 B0 N+1 i=1 and ( α1 i )B1 2 N i=1 , offers a useful counterpoint to Pbins 1 . By the chain rule for KL-divergence, we find DKL (Pbins 0 Pbins 1 ) = N0 B0 N+1 i=1 DKL (α0 i α0 i ) + B1 2 N i=1 DKL ( α1 i α1 i ) 1 N 1 q/N+δj q+δ (q/N+δ1) ¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹ ¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹ = DKL(P1,1) and by Pinsker s Inequality, for any event E, Pbins 1 (E) Pbins 0 (E) 2 DKL (Pbins 0 Pbins 1 ). Let a0 i (respectively a1 i ) denote the realization of the i-th null (respectively alternative) p-value s bin assignment. Then if we set E to be the collection of realizations (a0 i )N0 B0 N+1 i=1 and (a1 i )B1 2 N i=1 that are consistent with N j=2[B2 j < j 1], we find P( N j=2[BN 2 j < j 1] BN 1 = 0,B0 N+1,B1 2 N) = Pµ1 (E) P0 (E) 2 DKL (Pbins 0 Pbins 1 ) = (1 N0 B0 N+1 + B1 2 N N 1 ) 2 B1 2 NDKL(P1,1). (Corollary 4.2) Hence, for the case of general c, (#) = E[P( N j=c+1[BN c+1 j < j c] [BN 1 c = 0],B0 N+1,B1 c+1 N) [BN 1 c = 0],B0 N+1] E (1 N0 B0 N+1 + B1 c+1 N N c ) 2 B1 c+1 NDKL(P1,c) [BN 1 c = 0],B0 N+1 (Pinsker s Inequality) 1 N0 B0 N+1 + E[B1 c+1 N B1 1 c = 0] 2 E[B1 c+1 N B1 1 c = 0]DKL(P1,c). It follows that P( k+c = c,B0 N+1 c) P(BN 1 c = 0 B0 N+1) ¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹ ¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹ N δ1 c)N1(1 c N )N0 B0 N+1 1 N0 B0 N+1 + E[B1 c+1 N B1 1 = 0] 2 E[B1 c+1 N B1 1 = 0]DKL(P1,c) ;B0 N+1 c To conclude, P( k+c = c,B0 N+1 1) N )N0 B0 N+1 1 N0 B0 N+1 + E[B1 c+1 N B1 1 c = 0] 2 E[B1 2 N B1 1 = 0]DKL(P1,c) ;B0 N+1 c 1 N0 + E[B1 c+1 N B1 1 c = 0] N c ¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹ ¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹ π(c,µ1) 2 E[B1 c+1 N B1 1 c = 0]DKL(P1,c) ¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹ ¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹ V (c,µ1) N ) B0 N+1;B0 N+1 c] ¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹ ¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹ ( ) N )N0 B0 N+1 B0 N+1 N c;B0 N+1 c] ¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹ ¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹ ( ) ( ) = E (1 c B0 N+1 E (1 c B0 N+1 ;B0 N+1 c 1 = N cq B0 N+1 ;B0 N+1 c 1 , ( ) P(B0 N+1 c)(1 c N0 E[B0 N+1 B0 N+1 c] E[B0 N+1 B0 N+1 c] N c (Jensen s) 0.0 0.1 0.2 0.3 FDP[BHq ; p] FDP[BHq ; p+ c ] N0=900, N1=100, q = 0.10, a=1.5, # Simulations = 1e3 Simulation of INCREASE 5 on Marginal Conformal P values Figure 4: 103 simulations of FDP by BHq with and without application of INCREASE-5 on marginal conformal p-values [5] derived from an SVM one-class classifier on a test set with outliers drawn with a = 1.5. c a 1 1.5 2 2.5 3 1 1 .559 .096 .096 .098 5 1 .368 .142 .135 .133 10 .997 .435 .190 .174 .170 50 .992 .657 .429 .389 .383 Table 4: Sample Average (103-batch) estimates of EFDP[BHq;p+c] on marginal conformal pvalues for outlier detection [5], in terms of signal strength a and budget c. 7.4 Simulations and Data Experiments 7.4.1 INCREASE-c Simulations on PRDS p-values In Table 4 we illustrate the effectiveness of INCREASE-c in disrupting the nonparametric outlier detection method of [5] that is based on the application of BH on conformal p-values, and in doing so, demonstrate effectiveness of INCREASE-c on PRDS p-values. We follow the simulation setting of Section 5.2 in [5], using their publicly available source code to generate the conformal p-values. In short, a data set is generated in R50, along with 103 training observations used to fit a one-class SVM classifier, as well as 103 observations forming a calibration set to be used with a test set to derive (marginal) conformal p-values. In each of 103 independent replications, INCREASE-c was applied to a new test set consisting of 103 conformal p-values, designed to discern inliers (signals drawn from a mixture of multivariate gaussians with identity covariance matrices) from outliers (signals drawn from a mixture of multivariate gaussians with identity covariance matrices scaled by a strength a). This was performed for a {1,1.5,2,2.5,3}; a signal strength a = 1 corresponds to identical null and alternative distributions, while larger values of a make it easier to detect outliers. We set the fraction of outliers in each test set to π1 = .1, so that a fraction π0 = .90 of the observations are inliers in each data set. Neur IPS Paper Checklist Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: Our abstract and introduction precisely summarize the results of our paper s content. 2. Limitations Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: Yes, we discuss to what extent our analysis could be extended and further generalized, but were not done due to space constraints. 3. Theory Assumptions and Proofs Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes] Justification: Our appendix contains the full proofs of all theoretical results. 4. Experimental Result Reproducibility Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: All such information is disclosed in our Experimental Results section. 5. Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [NA] Justification: Experiments conducted were either with simulations on synthetic data, or a Kaggle data set in which the URL is provided. They are simple to set up, and we described them precisely. The codes we implemented were either our own scripts, or code that we cited from [5]. 6. Experimental Setting/Details Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: Training of an isolation forest was done in Section 5.2. Training of a one-class SVM was also done in Appendix Section 7.4 when we implemented the simulations of [5] as executed with their publicly available code at: https://github.com/msesia/conditionalconformal-pvalues.git 7. Experiment Statistical Significance Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [Yes] Justification: We found it unnecessary to report error bars as the purpose behind our simulations was to showcase/illustrate how large the false detection proportion could be made as a result of small perturbation. Further, the simulation averages were consistent with our theory. 8. Experiments Compute Resources Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: See Section 5 9. Code Of Ethics Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: We affirm that the research conforms to the code of ethics. 10. Broader Impacts Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [Yes] Justification: In the Introduction and Conclusion we discuss potential concerns for safetysecurity settings 11. Safeguards Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer:[NA] Justification: [NA] 12. Licenses for existing assets Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [Yes] Justification: We give credit to [5] for the code we implemented in our own experiments. 13. New Assets Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justification: [NA] 14. Crowdsourcing and Research with Human Subjects Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: [NA] 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer:[NA] Justification: [NA]