# paprika_private_online_false_discovery_rate_control__a37bff32.pdf

PAPRIKA: Private Online False Discovery Rate Control

Wanrong Zhang 1 Gautam Kamath * 2 Rachel Cummings * 3

In hypothesis testing, a false discovery occurs when a hypothesis is incorrectly rejected due to noise in the sample. When adaptively testing multiple hypotheses, the probability of a false discovery increases as more tests are performed. Thus the problem of False Discovery Rate (FDR) control is to ﬁnd a procedure for testing multiple hypotheses that accounts for this effect in determining the set of hypotheses to reject. The goal is to minimize the number (or fraction) of false discoveries, while maintaining a high true positive rate (i.e., correct discoveries). In this work, we study False Discovery Rate (FDR) control in multiple hypothesis testing under the constraint of differential privacy for the sample. Unlike previous work in this direction, we focus on the online setting, meaning that a decision about each hypothesis must be made immediately after the test is performed, rather than waiting for the output of all tests as in the ofﬂine setting. We provide new private algorithms based on state-of-the-art results in non-private online FDR control. Our algorithms have strong provable guarantees for privacy and statistical performance as measured by FDR and power. We also provide experimental results to demonstrate the efﬁcacy of our algorithms in a variety of data environments.

1. Introduction

In the modern era of big data, data analyses play an important role in decision-making in healthcare, information technology, and government agencies. The growing availability

*Equal contribution as last author. 1H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, GA, USA 2Cheriton School of Computer Science, University of Waterloo, Waterloo, ON, Canada 3Department of Industrial Engineering and Operations Research, Columbia University, New York, NY, USA. Correspondence to: Wanrong Zhang <wanrongz@gatech.edu>, Gautam Kamath <g@csail.mit.edu>, Rachel Cummings <rac2239@columbia.edu>.

Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

of large-scale datasets and ease of data analysis, while beneﬁcial to society, has created a severe crisis of reproducibility in science. In 2011, Bayer Health Care reviewed 67 in-house projects and found that they could replicate fewer than 25 percent, and found that over two-thirds of the projects had major inconsistencies (National Academies, 2019). One major reason is that random noise in the data can often be mistaken for interesting signals, which does not lead to valid and reproducible results. This problem is particularly relevant when testing multiple hypotheses, when there is an increased chance of false discoveries based on noise in the data. For example, an analyst may conduct 250 hypothesis tests and ﬁnd that 11 are signiﬁcant at the 5% level. This may be exciting to the researcher who publishes a paper based on these ﬁndings, but elementary statistics suggests that (in expectation) 12.5 of those tests should be signiﬁcant at that level purely by chance, even if the null hypotheses were all true. To avoid such problems, statisticians have developed tools for controlling overall error rates when performing multiple hypothesis tests.

In hypothesis testing, the null hypothesis of no interesting scientiﬁc discovery (e.g., a drug has no effect), is tested against the alternative hypothesis of a particular scientiﬁc theory being true (e.g., a drug has a particular effect). The signiﬁcance of each test is measured by a p-value, which is the probability of the observed data occurring under the null hypothesis, and a hypothesis is rejected if the corresponding p-value is below some (ﬁxed) signiﬁcance level. Each rejection is called a discovery, and a rejected hypothesis is a false discovery if the null hypothesis is actually true. When testing multiple hypotheses, the probability of a false discovery increases as more tests are performed. The problem of false discovery rate (FDR) control is to ﬁnd a procedure for testing multiple hypotheses that takes in the p-values of each test, and outputs a set of hypotheses to reject. The goal is to minimize the number of false discoveries, while maintaining high true positive rate (i.e., true discoveries).

In many applications, the dataset may contain sensitive personal information, and the analysis must be conducted in a privacy-preserving way. For example, in genome-wide association studies (GWAS), a large number of single-nucleotide polymorphisms (SNPs) are tested for an association with a disease simultaneously or adaptively. Prior work has shown that the statistical analysis of these datasets can lead to

Private Online False Discovery Rate Control

privacy concerns, and it is possible to identify an individual s genotype when only minor allele frequencies are revealed (Homer et al., 2008). The ﬁeld of differential privacy (Dwork et al., 2006) offers data analysis tools that provide powerful worst-case privacy guarantees, and has become a de facto gold standard in private data analysis. Informally, an algorithm that is ε-differentially private ensures that any particular output of the algorithm is at most eε more likely when a single data point is changed. This parameterization allows for a smooth tradeoff between accurate analysis and privacy to the individuals who have contributed data. In the past decade, researchers have developed a wide variety of differentially private algorithms for many statistical tasks; these tools have been implemented in practice at major organizations including Google (Erlingsson et al., 2014), Apple (Differential Privacy Team, Apple, 2017), Microsoft (Ding et al., 2017), and the U.S. Census Bureau (Dajani et al., 2017).

Related Work. The only prior work on differentially private FDR control (Dwork et al., 2018) considers the classic ofﬂine multiple testing problem, where an analyst has all the hypotheses and corresponding p-values upfront. Their private method repeatedly applies REPORTNOISYMIN (Dwork & Roth, 2014) to the celebrated Benjamini-Hochberg (BH) procedure (Benjamini & Hochberg, 1995) in ofﬂine multiple testing to privately pre-screen the p-values, and then applies the BH procedure again to select the signiﬁcant p-values. The (non-private) BH procedure ﬁrst sorts all p-values, and then sequentially compares them to an increasing threshold, where all p-values below their (ranked and sequential) threshold are rejected. The REPORTNOISYMIN mechanism privatizes this procedure by repeatedly (and privately) ﬁnding the hypothesis with the lowest p-value.

Although the work of (Dwork et al., 2018) showed that it was possible to integrate differential privacy with FDR control in multiple hypothesis testing, the assumption of having all hypotheses and p-values upfront is not reasonable in many practical settings. For example, a hospital may conduct multi-phase clinical trials where more patients join over time, or a marketing company may perform A/B testings sequentially. In this work, we focus on the more practical online hypothesis testing problem, where a stream of hypotheses arrive sequentially, and decisions to reject hypotheses must be made based on current and previous results before the next hypothesis arrives. This sequence of the hypotheses could be independent or adaptively chosen. Due to the fundamental difference between the ofﬂine and online FDR procedures, the method of (Dwork et al., 2018) based on REPORTNOISYMIN cannot be applied to the online setting. Instead, we use SPARSEVECTOR, described in Section 2.1, as a starting point. Discussion of non-private online multiple hypothesis testing appears in Section 2.2.

Our Results. We develop a differentially private online FDR control procedure for multiple hypothesis testing, which takes a stream of p-values and a target FDR level and privacy parameter ε, and outputs discoveries that can control the FDR at a certain level at any time point. Such a procedure provides unconditional differential privacy guarantees (to ensure that privacy will be protected even in the worst case) and satisfy the theoretical guarantees dictated by the FDR control problem.

Our algorithm, Private Alpha-investing P-value Rejecting Iterative sparse ve Ktor Algorithm (PAPRIKA, Algorithm 1), is presented in Section 3. Its privacy and accuracy guarantees are stated in Theorem 3 and 4, respectively. While the full proofs appear in the appendix, we describe the main ideas behind the algorithms and proofs in the surrounding prose. In Section 4, we provide a thorough empirical investigation of PAPRIKA, with additional empirical results in Appendix C.

2. Preliminaries

2.1. Background on Differential Privacy

Differential Privacy bounds the maximal amount that one data entry can change the output of the computation. Databases belong to the space Dn and contain n entries one for each individual where each entry belongs to data universe D. We say that D, D Dn are neighboring databases if they differ in at most one data entry.

Deﬁnition 1 (Differential Privacy (Dwork et al., 2006)). An algorithm M : Dn R is (ε, δ)-differentially private if for every pair of neighboring databases D, D Rn, and for every subset of possible outputs S R, Pr[M(D) S] exp(ε) Pr[M(D ) S] + δ. If δ = 0, we say that M is ε-differentially private.

The additive sensitivity of a real-valued query f : Dn R is denoted f, and is deﬁned to be the maximum change in the function s value that can be caused by changing a single entry. That is, f = max D,D neighbors |f(D) f(D )| . Differential privacy guarantees are often achieved by adding Laplace noise at various places in the computation, where the noise scales with f/ε. A Laplace random variable with parameter b is denoted Lap(b), and has probability

density function, p Lap(b)(x) = 1 2b exp |x|

The SPARSEVECTOR algorithm, ﬁrst introduced by (Dwork et al., 2010) and reﬁned to its current form by (Dwork & Roth, 2014), privately reports the outcomes of a potentially very large number of computations, provided that only a few are signiﬁcant. It takes in a stream of queries and releases a bit vector indicating whether or not each noisy query answer is above the ﬁxed noisy threshold. Pseudocode appears in Appendix A. We build off this algorithm, using

Private Online False Discovery Rate Control

it as a framework for our online private false discovery rate control algorithm as new hypotheses arrive online, and we only care about those signiﬁcant hypotheses when the pvalue is below a certain threshold. We note that the standard presentation below checks for queries with values above a threshold, but by simply changing signs this framework can be used to check for values below a threshold, as we will do with the p-values.

Theorem 1 ((Dwork et al., 2010)). For any sequence of k queries f1, . . . , fk with sensitivity such that |{i : fi(D) T αSV }| c, SPARSEVECTOR outputs with probability at least 1 β a stream of a1, . . . , ak { , } such that ai = for every i [m] with f(i) < T αSV and ai = for every i [m] with f(i) > T +αSV as long as αSV 8 c log(2kc/β)

Unlike the conventional use of additive sensitivity, (Dwork et al., 2018) deﬁned the notion of multiplicative sensitivity speciﬁcally for p-values. It is motivated by the observation that, although the additive sensitivity of a p-value may be large, the relative change of the p-value on two neighboring datasets is stable unless the p-value is very small. This notion allows us to treat the logarithm of the p-values as having additive sensitivity η, substantially reducing the scale of noise required to preserve privacy.

Deﬁnition 2 (Multiplicative Sensitivity (Dwork et al., 2018)). A p-value function p is said to be (η, µ)- multiplicative sensitive if for all neighboring databases D and D , either both p(D), p(D ) µ or

exp( η)p(D) p(D ) exp(η)p(D).

2.2. Background on Online False Discovery Rate Control

In the online false discovery rate (FDR) control problem, a data analyst receives a stream of hypotheses on the database D, or equivalently, a stream of p-values p1, p2, . . .. The analyst must pick a threshold αt at each time t to reject the hypothesis when pt αt; this threshold can depend on previous hypotheses and discoveries, and rejection must be decided before the next hypothesis arrives.

The error metric is the false discovery rate, formally deﬁned as: FDR = E [FDP] = E h |H0 R|

|R| i , where H0 is the (unknown to the analyst) set of hypotheses where the null hypothesis is true, and R is the set of rejected hypotheses. We will also write these terms as a function of time t to indicate their values after the ﬁrst t hypotheses: FDR(t), FDP(t), H0(t), R(t). The goal of FDR control is to guarantee that for any time t, the FDR up to time t is less than a pre-determined quantity α (0, 1).

Such a problem was ﬁrst investigated by (Foster & Stine, 2008), who proposed a framework known as online alpha-

investing that models the hypothesis testing problem as an investment problem. The analyst is endowed with an initial budget, can test hypotheses at a unit cost, and receives an additional reward for each discovery. The alpha-investing procedure ensures that the analyst always maintains an αfraction of their wealth, and can therefore continue testing future hypotheses indeﬁnitely. Unfortunately, this approach only controls a slightly relaxed version of FDR, known as

m FDR, which is given by m FDR(t) = E[|H0 R|]

E[|R|] . This approach was later extended to a class of generalized alphainvesting (GAI) rules (Aharoni & Rosset, 2014).

A generalized alpha-investing procedure starts with an initial wealth W(0) = α, where α is the testing level. It uses a GAI rule IW (0) that takes in past rejections to determine three quantities at each time t: the level of the test αt, the amount ϕt subtracted from the wealth, and the reward ψt received for each discovery. (αt, ϕt, ψt) = IW (0)({R1, R2, . . . , Rt 1}). The wealth updating rule is W(t) = W(t 1) ϕt + Rtψt. A GAI rule maintains nonnegative wealth W(t) 0 for any t, and the following ineuqality holds:

0 ψt min(ϕt

ρt + α 1), (1)

where ρt is the best power of the t-th test.

One subclass of GAI rules, the Level based On Recent Discovery (LORD), was shown to have consistently good performance in practice (Javanmard & Montanari, 2015; 2018). GAI++ in (Ramdas et al., 2017) improves the class of GAI, with LORD++ as an explicit example. The SAFFRON procedure, proposed by (Ramdas et al., 2018), further improves the LORD procedures by adaptively estimating the proportion of true nulls, and is the current state-of-the-art in online FDR control for multiple hypothesis testing.

To understand the main differences between the SAFFRON and the LORD procedures, we ﬁrst introduce an oracle estimate of the FDP as FDP (t) =

j t,j H0 αj

|R(t)| . The numerator P

j t,j H0 αj overestimates the number of false discoveries, so FDP (t) overestimates the FDP. The oracle estimator FDP (t) cannot be calculated since H0 is unknown. LORD s naive estimator P j t αj/|R(t)| is a natural overestimate of FDP (t). The SAFFRON s threshold sequence is based on a novel estimate of FDP as

d FDPSAFFRON(t) =

j t αj I(pj >λj )

1 λj |R(t)| , where {λj} j=1 is a sequence of user-chosen parameters in the interval (0, 1), which can be a constant or a deterministic function of the information up to time t 1. This estimate provides the null-proportion adaptivity basis for SAFFRON.

Our private algorithm is built upon the LORD++ and the SAFFRON algorithms, which are given formally in Algorithm 3 and 4 in Appendix A. As a class of GAI, the

Private Online False Discovery Rate Control

LORD++ and the SAFFRON both start off with an error budget, which will be allocated to different tests over time. The wealth budget decays as each hypothesis is tested, and it earns back wealth on every rejection except for the ﬁrst. The decay factors γj that depreciate past wealth is a non-increasing sequence summing to one, which ensures that the sum of the wealth budget is always below the desired level α. SAFFRON involves an additional candidacy checking step to be null-proportion adaptive: it never loses wealth when testing candidate p-values with pj < λj. The sequence {λj} j=1 can be deﬁned by any coordinatewise non-decreasing function gt. For example, {λj} j=1 can be a deterministic sequence of constants, or λt = αt, as in the case of alpha-investing. These λj values serve as a weak overestimate of αj. The algorithm ﬁrst checks if a p-value is below λj, and if so, adds it to the candidate set of hypotheses that may be rejected. It then computes the αj threshold based on current wealth, current size of the candidate set, and the number of rejections so far, and decides to reject the hypothesis if pj αj.

Both LORD++ and SAFFRON require that the input sequence of p-values are still valid p-values given past information. which is formalized as conditional super-uniformity of null p-values, with respect to a ﬁltration process on the sequence of rejection decisions {Rj} and candidacy {Cj} (for SAFFRON). This is stated formally in Appendix A. Intuitively, it means that the sequence of hypotheses cannot be overly adaptive. Independent p-values is a special case of conditional super-uniformity.

SAFFRON provides the following accuracy guarantees under this condition.

Theorem 2 ((Ramdas et al., 2018)). If the null p-values are conditionally super-uniformly distributed, then we have: (a) E h P j t,j H0 αj I(pj>λj)

i E |H0 R(t)| ;

(b) The condition d FDPSAFFRON(t) α for all t N implies that m FDR(t) α for all t N. If the null p-values are independent of each other and of the non-null p-values, and {αt} and {λt} are coordinatewise non-decreasing functions of the vector R1, . . . , Rt 1, C1, . . . , Ct 1, then (c) E h d FDPSAFFRON(t) i E [FDP(t)] := FDR(t) for all t N; (d) The condition d FDPSAFFRON(t) α for all t implies that FDR(t) α for all t N.

3. Private online false discovery rate control

In this section, we provide our algorithm for private online false discovery rate control, PAPRIKA, given formally in Algorithm 1. It starts with SAFFRON, using SPARSEVECTOR to ensure privacy of the rejection set. However, the combination of these tools is far from immediate, and

several algorithmic innovations are required, including: dynamic thresholds in SPARSEVECTOR to accommodate the alpha-investing rule, adding noise that scales with the multiplicative sensitivity of p-values to reduce the noise required for privacy, shifting the Sparse Vector threshold to accommodate FDR as a novel accuracy metric, and the candidacy indicator step which cannot be done privately and requires modiﬁcations to the wealth updates. We resolve this by using a similar wealth updating rule as in LORD++. We provide new analysis for both privacy and accuracy. Complete proofs of our privacy and accuracy results appear in the appendix; we elaborate here on the algorithmic details and why these modiﬁcations are needed to ensure privacy and FDR control.

The non-private online false discovery rate control algorithms decide to reject hypothesis t if the corresponding p-value pt is less than the rejection threshold αt; that is, if pt αt. We instantiate the SPARSEVECTOR framework in this setting, where pt plays the role of the tth query answer ft(X), and αt plays the role of the threshold. Note that SPARSEVECTOR uses a single ﬁxed threshold for all queries, while our algorithm PAPRIKA allows for a dynamic threshold that depends on the previous output. Our privacy analysis of the algorithm accounts for this change and shows that dynamic thresholds do not affect the privacy guarantees of SPARSEVECTOR. However, the algorithm would not be private if the dynamic thresholds also depend on the data. Note that SAFFRON never loses wealth when testing candidate p-values with pj λj, and the threshold αj depends on the data since it is based on current wealth. We remove such dependence in PAPRIKA by losing wealth at every step regardless of whether we test a candidate pvalues, similar to LORD++. This will result in stricter FDR control (and potentially weaker power) because our wealth decays faster.

Similar to prior work on private ofﬂine FDR control (Dwork et al., 2018), we use multiplicative sensitivity as described in Deﬁnition 2, as p-values may have high sensitivity and require unacceptably large noise to be added to preserve privacy. We assume that our input stream of p-values p1, p2, . . . , each has multiplicative sensitivity (η, µ). As long as µ is small enough (i.e., less than the rejection threshold), we can treat the logarithm of the p-values as the queries with additive sensitivity η. Because of this change, we must make rejection decisions based on the logarithm of the pvalues, so our reject condition is log pt + Zt log αt + Zα for Laplace noise terms Zt, Zα drawn from the appropriate distributions.

Private Online False Discovery Rate Control

Algorithm 1 PAPRIKA(α, λ, W0, γ, c, ε, δ, s)

Input: stream of p-values {p1, p2, . . .} with mutiplicative sensitivity (η,µ), target FDR level α, initial wealth W0 < α, positive non-increasing sequence {γj} j=0 of summing to one, expected number of rejections c, privacy parameters ε, δ, threshold shift magnitude s, maximum number of p-values k. Let Z0 α Lap(2ηc/ε), count = 0, A = scη

ε log 2 3 min{δ,1 ((1 δ)/ exp(ε))1/k} for each p-value pt do

if count c then Output Rt = 0 else Sample Zt Lap(4ηc/ε). Set λt = gt(R1:t 1, C1:t 1). Set the indicator for candidacy Ct = I(log pt < log 2λt). if t = 1 then Set α1 = (1 2λ1)γ1W0 else Compute αt = (1 2λt)(W0γt + (α W0)γt τ1 + P

j 2 αγt τj) if Ct = 1 and log pt + Zt log αt A + Zcount α then Output Rt = 1. Set count = count +1 and sample Zcount α Lap(2ηc/ε) else Output Rt = 0 end for

The accuracy guarantees of SPARSEVECTOR ensure that if a value is reported to be below threshold, then with high probability it will not be more than αSV above the threshold. However, to ensure that our algorithm satisﬁes the desired bound FDR α, we require that reports of below threshold truly do correspond to p-values that are below the desired threshold αt. To accommodate this, we shift our rejection threshold log αt down by a parameter A. A is chosen such that the algorithm satisﬁes (ε, δ)-differential privacy, but the choice can be seen as inspired by the αSV -accuracy term of SPARSEVECTOR as given in Theorem 1. Therefore our ﬁnal reject condition is log pt + Zt log αt A + Zα. This ensures that below threshold reports are below (log αt A) + αSV log αt with high probability. Empirically, we see that the bound of A in Theorem 3 may be overly conservative and lead to no hypotheses being rejected, so we allow an additional scaling parameter s that will scale the magnitude of shift by a factor of s. The conservative bounds of Theorem 3 correspond to s = 4, but in many scenarios a smaller value of s = 1 or 2 will lead to better performance while still satisfying the privacy guarantee. Further guidance choosing this shift parameter is given in Appendix C.1.

Even with these modiﬁcations, a naive combination of SPARSEVECTOR and SAFFRON would still not satisfy differential privacy. This is due to the candidacy indicator

step of the algorithm. In the SAFFRON algorithm, a preprocessing candidacy step occurs before any rejection decisions. This step checks whether each p-value pt is smaller than a loose upper bound λt on the eventual reject threshold αt. The algorithm chooses αt using an α-investing rule that depends on the number of candidate hypotheses seen so far, and ensures that αt λt, so only hypotheses in this candidate set can be rejected. These λ values are used to control d FDPSAFFRON(t), which serves as a conservative overestimate of FDP(t). (For a discussion of how to choose λt, see Lemma 1 or our experimental results in Section 4. Reasonable choices would be λt = αt or a small constant such as 0.2.)

Without adding noise to the candidacy condition, there may be neighboring databases with p-values pt, p t for some hypothesis such that log pt < log λt < log p t, and hence the hypothesis would have positive probability of being rejected under the ﬁrst database and zero probability of rejection under the neighbor. This would violate the (ε, 0)-differential privacy guarantee intended under SPARSEVECTOR. If we were to privatize the condition for candidacy using, for example, a parallel instantiation of SPARSEVECTOR, then we would have to reuse the same realizations of the noise when computing the rejection threshold αt to still control FDP, but this would no longer be private.

Since we cannot add noise to the candidacy condition, we weaken it in PAPRIKA to be log pt < log 2λt1 Then if a hypothesis has different candidacy results under neighboring databases and the multiplicative sensitivity η is small, then the hypothesis is still extremely unlikely to be rejected even under the database for which it was candidate. To see this, consider a pair of neighboring databases that induce p-values where log pt < log 2λt < log p t. Due to the multiplicative sensitivity constraint, we know that log pt log 2λt η. Plugging this into the rejection condition log pt + Zt log αt A + Zα, we see that we would need the difference of the noise terms to satisfy Zt Zα log 1

2 A+η, which by analysis of the Laplace distribution, will happen with exponentially small probability in n when η = poly 1(n).2 Our PAPRIKA algorithm is thus (ε, δ)-differentially private, and we account for this

1We note that although this change is algorithmically equivalent to scaling up the parameter λt by a factor of 2, this slack is relevant for certain instantiations of PAPRIKA that set λt = αt, which we show perform well empirically. (See Section 4 for more details.) We write this step as a relaxation of the candidacy condition both for notational consistency with existing non-private alpha-investing-based FDR control methods, such as SAFFRON AI (Ramdas et al., 2018), that also choose λt = αt, and to emphasize that this slack in the candidacy condition is necessary in ensuring differential privacy of the overall algorithm. 2Such values of η are typical; see examples in Section 4 where η = 1 n. The shift term A also has dependence on η which contributes to the bound.

Private Online False Discovery Rate Control

failure probability in our (exponentially small) δ parameter, as stated in Theorem 3.

One may wonder whether this candidacy step in necessary at all. Since we have removed the dependence of αt on the size of the candidate set in PAPRIKA, the threshold αt is no longer null-proportion sensitive. The advantage of being null-proportion adaptive in SAFFRON increases as the proportion of non-nulls increases, but we focus on the case where the non-nulls are sparse, and thus it has little impact in our setting. In Section 4, we empirically compare PAPRIKA to two private versions of LORD++, which we call Priv LORD and Priv LORD2. The former combines SPARSEVECTOR and LORD++, with the same threshold shifting as described earlier in this section. The latter adds the candidacy checking step on top of Priv LORD. We see in Section 4.2 that both methods provide poor FDR control relative to PAPRIKA, thus providing empirical evidence that the candidacy step in PAPRIKA plays a vital role in FDR control, even if αt is not null-proportion sensitive. Further details about Priv LORD and Priv LORD2 are deferred to Appendix C.

Our PAPRIKA algorithm allows analysts to specify a maximum number of hypotheses tested k and rejections c. We require a bound on the maximum number of hypotheses tested because the accuracy guarantees of SPARSEVECTOR only allows exponentially (in the size of the database) many queries to be answered accurately. Once the total number of rejections reaches c, the algorithm will fail to reject all future hypotheses. We do not halt the algorithm as in SPARSEVECTOR and therefore, PAPRIKA does not have a stopping criterion, and we can safely talk about the FDR control at any ﬁxed time, just like SAFFRON and LORD++.

Our algorithm also controls at each time t,

d FDPPAPRIKA(t)

j t αt I(pj >2λj )

1 2λj |R(t)| . We note that

this is equivalent to d FDPSAFFRON(t) by scaling down λj by a factor of 2. By analyzing and bounding this expression, we achieve FDR bounds for our PAPRIKA algorithm, as stated in Theorem 4.

Theorem 3. For any stream of p-values {p1, p2, . . .}, PAPRIKA is (ε, δ)-differentially private.

As a starting point, our privacy comes from SPARSEVEC-

TOR, but as discussed above, many crucial modiﬁcations are required. To brieﬂy summarize the key considerations, we must handle different thresholds at different times, multiplicative rather than additive sensitivity, a modiﬁed notion of the candidate set, and introducing a small delta parameter to account for the new candidate set deﬁnition and the shift. The proof of Theorem 3 appears in Appendix D.

Next we describe the theoretical guarantees of FDR control for our private algorithm PAPRIKA which is an ana-

log of Theorem 2. We modify the notation of the conditional super-uniformity assumption of SAFFRON to incorporate the added Laplace noise. The conditions are otherwise identical. (See (2) in Appendix A for comparison.) We note that independent p-values is a special case of conditional super-uniformity, but this requirement more generally allows for a broader class of dependencies among p-values. Let Rj := I(pj + Zj αj + Zα) be the rejection decisions, and let Cj := I(pj 2λj) be the indicators for candidacy. We let αt := ft(R1, . . . , Rt 1, C1, . . . , Ct 1), where ft is an arbitrary function of the ﬁrst t 1 indicators for rejections and candidacy. Deﬁne the ﬁltration formed by the sequences of σﬁelds F t := σ(R1, . . . , Rt, C1, . . . , Ct, Z1, . . . , Zt, Zα). The null p-values are conditionally super-uniformly distributed with respect to the ﬁltration F if when the null hypothesis Hi is true, then Pr(pt αt|F t 1) αt. We emphasize that this condition is only needed for FDR control, and that our privacy guarantee of Theorem 3 holds for arbitrary streams of p-values, even those which do not satisfy conditional super-uniformity.

Our FDR control guarantees for PAPRIKA mirror those of SAFFRON (Theorem 2). The ﬁrst two statements apply if p-values are conditionally super-uniform, and the last two statements apply if the p-values are additionally independent under the null. The proof of Theorem 4 appears in Appendix E.

Theorem 4. If the null p-values are conditionally superuniformly distributed, then we have: (a) E h P

j t,j H0 αj I(pj>2λj)

i + δt E |H0 R(t)| ;

(b)The condition d FDPPAPRIKA(t) α for all t N implies that m FDR(t) α + δt for all t N. If the null p-values are independent of each other and of the non-null p-values, and {αt} and {λt} are coordinate-wise non-decreasing functions of the vector R1, . . . , Rt 1, C1, . . . , Ct 1, then (c) E h d FDPPAPRIKA(t) i + δt E [FDP(t)] := FDR(t) for all t N; (d) The condition d FDPPAPRIKA(t) α for all t implies that FDR(t) α + δt for all t N.

Relative to the non-private guarantees of Theorem 2, the FDR bounds provided by PAPRIKA are weaker by an additive of δt. In most differential privacy applications, δ is typically required to be cryptographically small (i.e., at most negligible in the size of the database) (Dwork & Roth, 2014), so this additional term should have a minuscule effect on the FDR.3 We note that ε plays a role in the analysis of Theorem

3Alternatively, δ could be treated like a tunable parameter to balance the tradeoff between privacy and FDR control. If an analyst has an upper bound on the allowable slack in FDR, say 0.01, then she could set δ = 0.01/t to ensure her desired bound.

Private Online False Discovery Rate Control

4, although it does not appear in FDR bounds. Equation (22) in the appendix shows that the additive slack term δt in Theorem 4 is in fact min n δ, 1 ((1 δ)/ exp(ε))

1 k o t, which is upper bounded by δt.

The following lemma is a key tool in the proof of Theorem 4. Though it is qualitatively similar to Lemma 2 in (Ramdas et al., 2018), it is crucially modiﬁed to show an analogous statement holds under the addition of Laplace noise. Its proof appears in Appendix F.

Lemma 1. Assume p1, p2, . . . are all independent and let h : {0, 1}k R be any coordinate-wise non-decreasing function. Assume ft and gt are coordinate-wise non-decreasing functions and that αt = ft(R1:t 1, C1:t 1) and λt = gt(R1:t 1, C1:t 1). Then for any t k such that Ht H0, we have E h αt I(pt>2λt) (1 2λt)h(R1:k)|F t 1i E h αt h(R1:k)|F t 1i

and E h min{αt exp(Zα Zt A),1}

h(R1:k) |F t 1i

E h I(log pt+Zt log αt+Zα A)

h(R1:k) |F t 1i .

There are no known theoretical bounds on the statistical power of SAFFRON even in the non-private setting. Instead, we validate power empirically through the experimental results in Section 4.

4. Experiments

We experimentally compare the FDR and the statistical power of variations of the PAPRIKA and SAFFRON procedures, under different sequences of {λj}. Following the convention of (Ramdas et al., 2018), we deﬁne PAPRIKAAlpha-Investing, or PAPRIKA AI, to be the instantiation of Algorithm 1 with the sequence λj = αj, where the rejection threshold matches the α-investing rule, and we use PAPRIKA to denote Algorithm 1 instantiated with a sequence of constant of λj, which in our experiments is λj = 0.2. We use λj = 0.5 in SAFFRON.4 We generally observe that, even under moderately stringent privacy restrictions, PAPRIKA and its AI variant perform comparably to the non-private alternatives, with PAPRIKA AI typically outperforming PAPRIKA. This suggests that even though setting λj as a ﬁxed constant may be easier for implementation, parameter optimization can lead to meaningful performance improvements. We chose the sequence {γj} to be a constant 1/k up to time k. Note that the sequence can be decreasing such as of the form γj j s in (Ramdas et al., 2018), which controls the wealth to be more concentrated around small values of j. See (Ramdas et al., 2018) for more discussion on the choice of {γj}. In our experiments, we set the target FDR level α + δt = 0.2, and thus our privacy

4Recall from Section 3 that our λj is equivalent to the λj in SAFFRON scaling down by a factor of 2.

parameter δ is set to be bounded by 0.2/800 = 2.5 10 4. The maximum number of rejections c = 40. All the results are averaged over 100 runs. We investigate two settings: the observations come Bernoulli distributions in Section 4.1, and the observations are generated from truncated exponential distributions in Section 4.3. In Section 4.2, we compare our algorithm against other private algorithms. In Appendix C.1, we discuss our choice of the shift parameter A and give guidance on how to choose this parameter in practice. Code for PAPRIKA and our experiments is available at https://github.com/wanrongz/PAPRIKA.

4.1. Testing with Bernoulli Observations

We assume that the database D contains n individuals with k independent features. The ith feature is associated with n i.i.d. Bernoulli variables ξi 1, . . . , ξi n, each of which takes the value 1 with probability θi, and takes the value 0 otherwise. Let ti be the sum of the ith features. A p-value for testing null hypothesis Hi 0 : θi 1/2 against Hi 1 : θi > 1/2 is given by pi(D) = Pn k=ti 1 2n n k . (Dwork et al., 2018) showed that pi is (µ, η)-multiplicatively sensitive for µ =

m 1 c and η q

n , where m poly(n) and c is any small positive constant. We choose θi = 0.5 with probability 1 π1, and θi = 0.75 with probability π1, for varying values of π1, which represents the expected fraction of non-null hypotheses. We consider relatively small values of π1 as most practical applications of FDR control (such as GWAS studies) will have only a small fraction of true discoveries in the data.

In the following experiments, we sequentially test Hi 0 versus Hi 1 for i = 1, . . . , k. We use n = 1000 as the size of the database D, and k = 800 as the number of features as well as the number of hypotheses. Our experiments are run under several different shifts A, but due to space constraints, we only report results in the main body with A = cη

ε log 2 3 min{δ,1 ((1 δ)/ exp(ε))1/k} (i.e., when s = 1), which still satisﬁes our privacy guarantee. Further discussion on the choice of A and additional results under other shift parameters s are deferred to Appendix C.1. The results are summarized in Figure 1, which plots the FDR and statistical power against the expected fraction of non-nulls, π1. In Figure 1(a) and (b), we compare our algorithms with privacy parameter ε = 5 to the non-private baseline methods of LORD (Javanmard & Montanari, 2015; 2018), Alpha-investing (Aharoni & Rosset, 2014), and SAFFRON and SAFFRON AI from (Ramdas et al., 2018). In Figure 1(c,d) and (e,f), we compare the performance of PAPRIKA AI and PAPRIKA, respectively, with varying privacy parameters ε = 3, 5, 10. We also list these values in Table 1 in Appendix C.2.

As expected, the performance of PAPRIKA generally diminishes as ε decreases. A notable exception is that FDR

Private Online False Discovery Rate Control

also decreases in Figure 1(c). This phenomenon is because we set λj = αj, resulting in a smaller candidacy set and leading to insufﬁcient rejections. Surprisingly, PAPRIKA AI also yields a lower FDR than many of the non-private algorithms (Figure 1(a)), since it tends to make fewer rejections. We also see that PAPRIKA AI performs dramatically better than PAPRIKA, suggesting that the choice of λj = αj should be preferred to constant λj to ensure good performance in practice.

0.02 0.04 π1

SAFFRON LORD Alpha-investing

SAFFRON AI PAPRIKA AI PAPRIKA

0.02 0.04 π1

SAFFRON LORD Alpha-investing

SAFFRON AI PAPRIKA AI PAPRIKA

0.02 0.04 π1

ε = 3 ε = 5

(c) PAPRIKA AI

0.02 0.04 π1

ε = 3 ε = 5

(d) PAPRIKA AI

0.02 0.04 π1

ε = 3 ε = 5

(e) PAPRIKA

0.02 0.04 π1

ε = 3 ε = 5

(f) PAPRIKA

Figure 1. FDR and statistical power versus fraction of non-null hypotheses π1 for PAPRIKA (with λj = 0.2), PAPRIKA AI (with λj = αj), and non-private algorithms when the database consists of Bernoulli observations.

4.2. Comparison with Other Private Algorithms

As PAPRIKA is the ﬁrst algorithm for private online FDR control, there is no private baseline for comparison. In Appendix C, we show that na ıve Laplace privatization of SAFFRON is ineffective. This na ıve approach applies the Laplace Mechanism (Dwork et al., 2006) to the p-values of each hypothesis, and then uses these noisy p-values as input to SAFFRON. We see that this baseline mechanism performs extremely poorly relative to PAPRIKA and PAPRIKA AI.

We also compare our PAPRIKA against Priv LORD and

Priv LORD2 with Bernoulli observations in Figure 2 and truncated exponential observations in Figure 6 in Appendix 4.3. For comparison, we use the same shift A for the four algorithms, but we note that A should be larger in Priv LORD to control FDR at the level 0.2 as it lacking the candidate checking step, and a larger A leads to worse power, see Section C.1.

0.02 0.04 π1

PAPRIKA AI PAPRIKA

Priv LORD Priv LORD2

0.02 0.04 π1

PAPRIKA AI PAPRIKA

Priv LORD Priv LORD2

0.02 0.04 π1

PAPRIKA AI PAPRIKA

Priv LORD Priv LORD2

0.02 0.04 π1

PAPRIKA AI PAPRIKA

Priv LORD Priv LORD2

0.02 0.04 π1

PAPRIKA AI PAPRIKA

Priv LORD Priv LORD2

0.02 0.04 π1

PAPRIKA AI PAPRIKA

Priv LORD Priv LORD2

Figure 2. FDR and statistical power versus fraction of non-nulls π1 for PAPRIKA (with λj = 0.2), PAPRIKA AI (with λj = αj), and Priv LORD and Priv LORD2 when the database consists of Bernoulli observations.

We make three key observations. First, Priv LORD makes signiﬁcantly more false discoveries than the other three algorithms, suggesting that the candidacy checking step largely offsets against the added noise for private algorithms. The performance of Priv LORD gets closer to PAPRIKA and PAPRIKA AI when we add less noise as ε goes large. Second, PAPRIKA with constant λt has stricter FDR control and slightly weaker power compared to Priv LORD2 as expected, since the threshold αt in PAPRIKA has an additional constant (1 2λt) factor. Third, PAPRIKA AI provides dramatically better FDR and power tradeoffs it controls FDR at a much lower level while maintaining power at a similar level as other methods (even the best in Figure 2(f) and 6(d)), suggesting PAPRIKA with a smart choice

Private Online False Discovery Rate Control

of the predictable sequence {λt} is preferred.

4.3. Testing with Truncated Exponential Observations

We again assume that the database D contains n individuals with k independent features. The ith feature is associated with n i.i.d. truncated exponential distributed variables ξi 1, . . . , ξi n, each of which is sampled according to density fi(x | θi, b) = θi exp( θix) 1 exp( bθi)I(0 x b), for positive parameters b and θi. Let ti be the realized sum of the ith features, and let Ti denote the random variable of the sum of the n truncated exponential distributed variables in the ith entry. A p-value for testing the null hypothesis Hi 0 : θi = 1 against the alternative hypothesis Hi 1 : θi > 1 is given by, pi(D) = Prθi=1(Ti > ti). (Dwork et al., 2018) showed that pi is (µ, η)-multiplicatively sensitive for µ = m 1 c

n , where m poly(n) and c is any small positive constant. In the following experiments, we generate our database using the exponential distribution model truncated at b = 1. We set θi = 1 with probability 1 π1, and θi = 1.95 with probability π1, again varying the value of π1.

We sequentially test Hi 0 versus Hi 1 for i = 1, . . . , k. We use n = 1000 as the size of the database D, and k = 800 as the number of features as well as the number of hypotheses. While there is no closed form to compute the p-values, the sum of n = 1000 i.i.d. samples is approximately normally distributed by the Central Limit Theorem. The expectation and the variance of ξi j with b = 1 are

E ξi j = 1 θi + 1 1 exp(θi) and Var[ξi j] = 1 θ2 i exp(θi) (exp(θi) 1)2 , respectively. Therefore, Ti is approximately distributed as N(n E ξi j , n Var[ξi j]), and we compute the p-values accordingly. We run the experiments with shift A = cη

ε log 2 3 min{δ,1 ((1 δ)/ exp(ε))1/k} (shift magnitude s = 1). The results are shown in Figure 3, which plots the FDR and statistical power against the expected fraction of non-nulls, π1.

As in the case with binomial data, we see that the performance of PAPRIKA generally diminishes as ε decreases, and that PAPRIKA AI outperforms PAPRIKA, again reinforcing the need for tuning the parameters λj based on the alpha-investing rule. All methods perform well in this setting, and the FDR of PAPRIKA AI is visually indistinguishable from 0 at all levels of ε and π1 tested. Numerical values are listed in Table 2 in Appendix C for ease of comparison.

Additionally in Appendix C: we plot the rejection threshold αt and wealth of all methods over time in Figure 4, and ﬁnd that our private algorithms are consistent with the rejections of the non-private algorithms, another perspective which empirically conﬁrms their accuracy. We also vary θi, which

parameterizes the strength of the signal between the null and alternative hypotheses. We also vary the signal in the alternative hypotheses and Figure 5 shows that performance begins to decline with a weaker signal. Finally, in Appendix C.1 we discuss how the shift parameter A should be chosen to balance the tradeoff between FDR and statistical power, displaying results in Figure 7.

0.02 0.04 π1

SAFFRON LORD Alpha-investing

SAFFRON AI PAPRIKA AI PAPRIKA

0.02 0.04 π1

SAFFRON LORD Alpha-investing

SAFFRON AI PAPRIKA AI PAPRIKA

0.02 0.04 π1

ε = 3 ε = 5

(c) PAPRIKA AI

0.02 0.04 π1

ε = 3 ε = 5

(d) PAPRIKA AI

0.02 0.04 π1

ε = 3 ε = 5

(e) PAPRIKA

0.02 0.04 π1

ε = 3 ε = 5

(f) PAPRIKA

Figure 3. FDR and statistical power versus fraction of non-nulls π1 for PAPRIKA (with λj = 0.2), PAPRIKA AI (with λj = αj), and non-private algorithms when the database consists of truncated exponential observations.

Acknowledgements

W.Z. supported in part by a Mozilla Research Grant, NSF grant CNS-1850187, and an ARC-TRIAD Fellowship from the Georgia Institute of Technology. G.K. supported by a University of Waterloo startup grant and a NSERC Discovery grant. R.C. supported in part by a Mozilla Research Grant, a Google Research Fellowship, NSF grants CNS1850187 and CNS-1942772 (CAREER), and a JPMorgan Chase Faculty Research Award. Most of this work was completed while R.C. was at Georgia Institute of Technology. This work was initiated while all authors were visiting the Simons Institute for the Theory of Computing.

Private Online False Discovery Rate Control

Aharoni, E. and Rosset, S. Generalized α-investing: definitions, optimality results and application to public databases. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(4):771 794, 2014.

Benjamini, Y. and Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B (Methodological), 57(1):289 300, 1995.

Dajani, A. N., Lauger, A. D., Singer, P. E., Kifer, D., Reiter, J. P., Machanavajjhala, A., Garﬁnkel, S. L., Dahl, S. A., Graham, M., Karwa, V., Kim, H., Lelerc, P., Schmutte, I. M., Sexton, W. N., Vilhuber, L., and Abowd, J. M. The modernization of statistical disclosure limitation at the U.S. census bureau, 2017. Presented at the September 2017 meeting of the Census Scientiﬁc Advisory Committee.

Differential Privacy Team, Apple. Learning with privacy at scale. https:// machinelearning.apple.com/docs/ learning-with-privacy-at-scale/ appledifferentialprivacysystem.pdf, December 2017.

Ding, B., Kulkarni, J., and Yekhanin, S. Collecting telemetry data privately. In Advances in Neural Information Processing Systems 30, NIPS 17, pp. 3571 3580. Curran Associates, Inc., 2017.

Dwork, C. and Roth, A. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3 4):211 407, 2014.

Dwork, C., Mc Sherry, F., Nissim, K., and Smith, A. Calibrating noise to sensitivity in private data analysis. In Proceedings of the 3rd Conference on Theory of Cryptography, TCC 06, pp. 265 284, 2006.

Dwork, C., Naor, M., Pitassi, T., and Rothblum, G. N. Differential privacy under continual observation. In Proceedings of the 42nd ACM Symposium on Theory of Computing, STOC 10, pp. 715 724, 2010.

Dwork, C., Su, W. J., and Zhang, L. Differentially private false discovery rate control. ar Xiv preprint ar Xiv:1807.04209, 2018.

Erlingsson, U., Pihur, V., and Korolova, A. RAPPOR: Randomized aggregatable privacy-preserving ordinal response. In Proceedings of the 2014 ACM Conference on Computer and Communications Security, CCS 14, pp. 1054 1067, New York, NY, USA, 2014. ACM.

Foster, D. P. and Stine, R. A. α-investing: a procedure for sequential control of expected false discoveries. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(2):429 444, 2008.

Homer, N., Szelinger, S., Redman, M., Duggan, D., Tembe, W., Muehling, J., Pearson, J. V., Stephan, D. A., Nelson, S. F., and Craig, D. W. Resolving individuals contributing trace amounts of dna to highly complex mixtures using high-density snp genotyping microarrays. PLo S genetics, 4(8):e1000167, 2008.

Javanmard, A. and Montanari, A. On online control of false discovery rate. ar Xiv preprint ar Xiv:1502.06197, 2015.

Javanmard, A. and Montanari, A. Online rules for control of false discovery rate and false discovery exceedance. The Annals of Statistics, 46(2):526 554, 2018.

National Academies, S. Reproducibility and replicability in science. 2019.

Ramdas, A., Yang, F., Wainwright, M. J., and Jordan, M. I. Online control of the false discovery rate with decaying memory. In Advances In Neural Information Processing Systems, pp. 5650 5659, 2017.

Ramdas, A., Zrnic, T., Wainwright, M., and Jordan, M. SAFFRON: an adaptive algorithm for online control of the false discovery rate. In International Conference on Machine Learning, pp. 4286 4294, 2018.

Tian, J. and Ramdas, A. ADDIS: An adaptive discarding algorithm for online FDR control with conservative nulls. In Advances in Neural Information Processing Systems 32, Neur IPS 19, pp. 9383 9391. Curran Associates, Inc., 2019.