# the_limits_of_postselection_generalization__85026ff9.pdf The Limits of Post-Selection Generalization Kobbi Nissim Georgetown University kobbi.nissim@georgetown.edu Adam Smith Boston University ads22@bu.edu Thomas Steinke IBM Research Almaden posel@thomas-steinke.net Uri Stemmer Ben-Gurion University u@uri.co.il Jonathan Ullman Northeastern University jullman@ccs.neu.edu While statistics and machine learning offers numerous methods for ensuring generalization, these methods often fail in the presence of post selection the common practice in which the choice of analysis depends on previous interactions with the same dataset. A recent line of work has introduced powerful, general purpose algorithms that ensure a property called post hoc generalization (Cummings et al., COLT 16), which says that no person when given the output of the algorithm should be able to find any statistic for which the data differs significantly from the population it came from. In this work we show several limitations on the power of algorithms satisfying post hoc generalization. First, we show a tight lower bound on the error of any algorithm that satisfies post hoc generalization and answers adaptively chosen statistical queries, showing a strong barrier to progress in post selection data analysis. Second, we show that post hoc generalization is not closed under composition, despite many examples of such algorithms exhibiting strong composition properties. 1 Introduction Consider a dataset X consisting of n independent samples from some unknown population P. How can we ensure that the conclusions drawn from X generalize to the population P? Despite decades of research in statistics and machine learning on methods for ensuring generalization, there is an increased recognition that many scientific findings do not generalize, with some even declaring this to be a statistical crisis in science [14]. While there are many reasons a conclusion might fail to generalize, one that is receiving increasing attention is post-selection, in which the choice of method for analyzing the dataset depends on previous interactions with the same dataset. Post-selection can arise from many common practices, such as variable selection, exploratory data analysis, and dataset re-use. Unfortunately, post-selection invalidates traditional methods for ensuring generalization, which assume that the method is independent of the data. Numerous methods have been devised for statistical inference after post selection (e.g. [16, 18, 12, 13, 23]). These are primarily special purpose procedures that apply to specific types of simple post selection that admit direct analysis. A more limited number of methods apply where the data is reused in one of a small number of prescribed ways (e.g. [2, 4]). Supported by NSF award CNS-1565387. Supported by NSF awards IIS-1447700 and AF-1763665, a Google Faculty Award and a Sloan Foundation Research Award. Work done while U.S. was a postdoctoral researcher at the Weizmann Institute of Science, supported by a Koshland fellowship, and by the Israel Science Foundation (grants 950/16 and 5219/17). Supported by NSF awards CCF-1718088, CCF-1750640, and CNS-1816028, and a Google Faculty Award. 32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada. A recent line of work initiated by Dwork et al. [9] posed the question: Can we design generalpurpose algorithms for ensuring generalization in the presence of post-selection? These works (e.g. [9, 8, 19, 1]) identified properties of an algorithm that ensure generalization under post-selection, including differential privacy [10], information-theoretic measures, and compression. They also identified many powerful general-purpose algorithms satisfying these properties, leading to algorithms for post-selection data analysis with greater statistical power than all previously known approaches. Each of the aforementioned properties give incomparable generalization guarantees, and allow for qualitatively different types of algorithms. However, Cummings et al. [7] identified that the common thread in each of these approaches is to establish a notion of post hoc generalization (which they originally called robust generalization), and initiated a general study of algorithms satisfying this notion. Informally, an algorithm M satisfies post hoc generalization if there is no way, given only the output of M(X), to identify any statistical query [17] (that is, a bounded, linear, real-valued statistic on the population) such that the value of that query on the dataset is significantly different from its answer on the whole population. Definition 1.1 (Post Hoc Generalization [7]). An algorithm M : X n Y satisfies (ε, δ)-post hoc generalization if for every distribution P over X and every algorithm A that outputs a bounded function q : X [ 1, 1], if X P n, y M(X), and q A(y), then P [|q(P) q(X)| > ε] δ, where we use the notation q(P) = E [q(X)] and q(X) = 1 i q(Xi), and the probability is over the sampling of X and any randomness of M, A. Post hoc generalization is easily satisfied whenever n is large enough to ensure uniform convergence for the class of statistical queries. However, uniform convergence is only satisfied in the unrealistic regime where n is much larger than |X|. Algorithms that satisfy post hoc generalization are interesting in the realistic regime where there will exist queries q for which q(P) and q(X) are far, but these queries cannot be found. The definition also extends seamlessly to richer types of statistics than statistical queries. However, restricting to statistical queries only strengthens our negative results. Since all existing general-purpose algorithms for post-selection data analysis are analyzed via post hoc generalization, it is crucial to understand what we can achieve with algorithms satisfying post hoc generalization. In this work we present several strong limitaitons on the power of such algorithms. Our results identify natural barriers to progress in this area, and highlight important challenges for future research on post-selection data analysis. 1.1 Our Results Sample Complexity Bounds for Statistical Queries. Our first contribution is strong new lower bounds on any algorithm that satisfies post hoc generalization and answers a sequence of adaptively chosen statistical queries the setting introduced in Dwork et al. [9] and further studied in [1, 15, 20]. In this model, there is an underlying distribution P. We would like to design an algorithm M that holds a sample X P n, takes statistical queries q, and returns accurate answers a such that a q(P). To model post-selection, we consider a data analyst A that issues a sequence of queries q1, . . . , qk where each query qj may depend on the answers a1, . . . , aj 1 given by the algorithm in response to previous queries. The simplest algorithm M for this task of answering adaptive statistical queries would return the empirical mean qj(X) = 1 n P i qj(Xi) in response to each query, and one can show that this algorithm answers each query to within ε if n O(k/ε2) samples. Surprisingly, we can improve the sample complexity to n O( k/ε2) by returning q(X) perturbed with carefully calibrated noise [9, 1]. The analysis of this approach uses post hoc generalization: the noise is chosen so that |a q(X)| ε/2 and the noise ensures |q(P) q(X)| ε/2 for every query the analyst asks. Our main result shows that the sample complexity n = O( k/ε2) is essentially optimal for any algorithm that uses the framework of post hoc generalization. Theorem 1.2 (Informal). If M takes a sample of size n, satisfies (ε, δ)-post hoc generalization, and for every distribution P over X = { 1}k+O(log(n/ε)) and every data analyst A who asks k statistical queries, P j [k], |qj(P) a| > ε δ then n = Ω( k/ε2), where the probability is taken over X P n and the coins of M and A. To prove our theorem, we construct a joint distribution over pairs (A, P) such that when M is given too small a sample X, and A asks k 1 statistical queries, then either M does not answer all the queries accurately or A outputs a k-th query q such that q (P) q (X) > ε. Thus, M cannot be both accurate and satisfy post hoc generalization. Our proof of this result refines the techniques in [15, 20] which yield a lower bound of n = Ω( k) for ε = 1/3. Our proof circumvents a barrier in previous lower bounds. The previous works use the sequence of queries to uncover almost all of the sample held by the mechanism (a reconstruction attack of sorts). Once the analyst has identified all the points in the sample, it is easy to force an error: the analyst randomly asks one of two queries zero everywhere or zero on the reconstructed sample and one elsewhere that look the same to M but have different true answers. We cannot use that approach because in our setting it is impossible to reconstruct any of the sample. Indeed, for the parameter regime we consider, differentially private algorithms could be used to prevent reconstruction with any meaningful confidence. All we can hope for is a weak approximate reconstruction of the sample. This means the algorithm will have sufficient information to distinguish the aforementioned two queries and we cannot end the proof the same way as previously. Intuitively, our attack approximately reconstructs the dataset in a way that is only O(ε) better than guessing. This is not enough to completely cut off the algorithm and force an error, but, as we will see, does allow the analyst to construct a query q that overfits i.e., |q (X) q (P)| > ε. Our approximate reconstruction is accomplished using a modification of the reconstruction attack techniques of prior work. Specifically, we employ tools from the fingerprinting codes literature [3, 22, 6] but we output quantitative scores, rather than a hard in/out decision about what is in the sample. Independently, Wang [24] proved a quantitatively similar bound to Theorem 1.2. However, Wang s bound only applies to algorithms M that receive only the empirical mean q(X) of each query (as opposed to the whole data set). This precludes mechanisms such as sample splitting that treat records assymetrically. Wang s bound also applies for a slightly different (though closely related) class of statistics. The dimensionality of X required in our result is at least as large as k, which is somewhat necessary. Indeed, if the support of the distribution is { 1}d, then there is an algorithm M that takes a sample of size just O( d log(k)/ε3) [9, 1], so the conclusion is simply false if d k. Even when d k, the aforementioned algorithms require running time at least 2d per query. [15, 20] also showed that any polynomial time algorithm that answers k queries to constant error requires n = Ω( k). We improve this result to have the optimal dependence on ε. Theorem 1.3 (Informal). Assume one-way functions exist and let c > 0 be any constant. If M takes a sample of size n, has polynomial running time, satisfies (ε, δ)-post hoc generalization, and for every distribution P over X = { 1}kc+O(log(n/ε)) and every data analyst A who asks k statistical queries, P j [k], |qj(P) a| > ε δ, then n = Ω( k/ε2), where the probability is taken over X P n and the coins of M and A. We prove the information-theoretic result (Theorem 1.2) in Section 2. Due to space restrictions, we defer the computational result (Theorem 1.3) to the full version of this work. Negative Results for Composition. Differential privacy provides optimal or near-optimal methods for answering an adaptively-chosen sequence of statistical queries. However, even for answering statistical queries, outside constraints sometimes preclude randomized algorithms (to allay reproducibility concerns, for instance). Furthermore, one of the main goals of the emerging study of adaptive data analysis is to understand unstructured, unplanned dataset re-use. At this point, we know several techniques for reasoning about generalization in the adaptive setting: differential privacy and algorithmic stability, information bounds, and compression (and there may be many more yet to be discovered) [7]. These techniques are not directly comparable, but they all use posthoc generalization as a fundamental unit of their analysis. If posthoc generalization were to compose well, then this would provide an avenue for combining these techniques (and possibly others). However, we show that this is not the case and, hence, we must search elsewhere for a unifying theory. Intuitively, we show that, if the same dataset is analyzed by many different algorithms each satisfying post hoc generalization, then the composition of these algorithms may not satisfy post hoc generaliza- tion. That is, combining the information output by several algorithms may permit overfitting even when the individual outputs do not. The key reason differential privacy is used for adaptive data analysis is that it satisfies strong composition properties this is what quantitatively distinguishes the technique from, say, data splitting. We show that posthoc generalization does not have even weak adaptive composition properties. This shows a stark difference between differential privacy and posthoc generalization as tools for analyzing adaptive data analysis. This result can be viewed as further motivation for using differential privacy in this setting its composition properties are special. Theorem 1.4 states that there is a set of O(log n) algorithms that have almost optimal post hoc generalization, but whose composition does not have any non-trivial post hoc generalization. Theorem 1.4. For every n N there is a collection of ℓ= O(log n) algorithms M1, . . . , Mℓthat take n samples from a distribution over X = {0, 1}O(log n) such that (1) each of these algorithms are (ε, δ)-post hoc generalizing for every δ > 0 and ε = O( p log(n/δ)/n.999), but (2) the composition (M1, . . . , Mℓ) is not (1.999, .999)-post hoc generalizing. If we consider a relaxed notion of computational post hoc generalization, then we show that composition can fail even for just two algorithms. Informally, computational post hoc generalization means that Definition 1.1 is satisfied when the algorithm A runs in polynomial time. Theorem 1.5. Assume one-way functions exist. For every n N there are two algorithms M1, M2 that take n samples from a distribution over X = {0, 1}O(log n) such that (1) both algorithms are (ε, δ)-computationally post hoc generalizing for every δ > n O(1) and ε = O( p log(n/δ)/n.999), but (2) the composition (M1, M2) is not (1.999, .999)-computationally post hoc generalizing. We prove the information-theoretic result (Theorem 1.4) in Section 3. Due to space restrictions, we defer the computational result (Theorem 1.5) to the full version of this work. 2 Lower Bounds for Statistical Queries 2.1 Post Hoc Generalization for Adaptive Statistical Queries We are interested in the ability of interactive algorithms satisfying post hoc generalization to answer a sequence of statistical queries. Definition 1.1 applies to such algorithms via the following experiment. Algorithm 1: AQX,n,k[M A] A chooses a distribution P over X X P n and X is given to M (but not to A) For j = 1, . . . , k A outputs a statistical query qj (possibly depending on q1, a1, . . . , qj 1, aj 1) M(X) outputs aj Definition 2.1. An algorithm M is (ε, δ)-post hoc generalizing for k adaptive queries over X given n samples if for every adversary A, P AQX,n,k[M A] j [k] qj(X) qj(P) > ε δ. 2.2 A Lower Bound for Natural Algorithms We begin with an information-theoretic lower bound for a class of algorithms M that we call natural algorithms. These are algorithms that can only evaluate the query on the sample points they are given. That is, an algorithm M is natural if, when given a sample X = (X1, . . . , Xn) and a statistical query q : X [ 1, 1], the algorithm M returns an answer a that is a function only of (q(X1), . . . , q(Xn)). In particular, it cannot evaluate q on data points of its choice. Many algorithms in the literature have this property. Formally, we define natural algorithms via the game NAQX,n,k[M A]. This game is identical to AQX,n,k[M A] except that when A outputs qj, M does not receive all of qj, but instead receives only qj X = (qj(X1), . . . , qj(Xn)). Theorem 2.2 (Lower Bound for Natural Algorithms). There is an adversary ANAQ such that for every natural algorithm M, and for universe size N = 8n/ε, if P NAQ[N],n,k[M ANAQ] h j [k] qj(X) qj(P) > ε _ aj qj(P) > ε i 1 100 then n = Ω( k/ε2). Here the sample X is chosen via the game NAQ[N],n,k (it is sampled uniformly from the domain [N]). The proof uses the analyst ANAQ described in Algorithm 2. For notational convenience, ANAQ actually asks k + 1 queries, but this does not affect the final result. Algorithm 2: ANAQ Parameters: sample size n, universe size N = 8n ε , number of queries k, target accuracy ε Let P U[N], A1 , and τ 9ε q ε ) + 1 For j [k] Sample pj U[0,1] For i [N] Sample qj i Ber(pj) and let qj(i) qj i i / Aj Ask query qj and receive answer aj For i [N] Let zj i trunc3ε(aj pj) (qj i pj) i / Aj where trunc3ε(x) takes x R and returns the nearest point in [ 3ε, 3ε] to x. Let Aj+1 n i [N] : Pj ℓ=1 zℓ i > τ 1 o (N.B. By construction, Aj Aj+1.) Define zi Pk j=1 zj i and q i zi τ Let q : [N] [ 1, 1] be defined by q (i) q i In order to prove Theorem 2.2, it suffices to prove that either the answer aj to one of the initial queries qj fails to be accurate (in which case M is not accurate, or that the final query q gives significantly different answers on X and P (in which case M is not robustly generalizing). Formally, we have the following proposition. Proposition 2.3. For an appropriate choice of k = Θ(ε4n2) and n, 1 ε sufficiently large, for any natural M, with probability at least 2/3, either (1) j [k] |aj qj(P)| > ε, or (2) q (X) q (P) > ε. where the probability is taken over the game NAQX,n,k[M ANAQ] and ANAQ is specified by Algorithm 2. We prove Proposition 2.3 using a series of claims. The first claim states that none of the values zi are ever too large in absolute value, which follows immediately from the definition of the set Aj and the fact that each term zj i is bounded. Claim 2.4. For every i [N], |zi| τ. The next claim states that, no matter how the mechanism answers, very few of the items not in the sample get accused of membership, that is, included in the set Aj. Claim 2.5 (Few Accusations). Pr(|Ak \ X| εN/8) 1 e Ω(εn). Proof. Fix the biases p1, ..., pk as well as the all the information visibile to the mechanism (the query values {qj i : i X, j [k]}, as well as the answers a1, ..., ak). We prove that the probability of F is high conditioned on any setting of these variables. The main observation is that, once we condition on the biases pj, the query values at {qj i : i / X, j [k]} are independent with qj i Ber(pj). This is true because M is a natural algorithm (so it sees n Sample size. N Universe size. P Target distribution, uniform on [N]. Aj Universe elements suspected of being in the sample during the jth round of the attack. qj The query constructed in the j round. zj i If i is not in the sample then E[(aj pj) (qj i pj)] = E[aj pj] E[qj i pj] = 0. The bigger Pk j=1 zj i is, the more we suspect element i of being in the database. Table 1: Notation and intuition for Algorithm 2 only the query values for points in X) and, more subtly, because the analyst s decisions about how to sample the pj s, and which points in X to include in the sets Aj, are independent of the query values outside of X. By the principle of deferred decisions, we may thus think of the query values {qj i : i / X, j [k]} as selected after the interaction with the mechanism is complete. Fix i / X. For every j [k] and i / X, we have E h zj i i = E h trunc3ε(aj pj) (qj i pj) i = E trunc3ε(aj pj) E h qj i pji = 0. By linearity of expectation, we also have E [zi] = E h Pk j=1 zj i i = 0. Next, note that |zj i | 3ε, since trunc3ε(aj pj) [ 3ε, 3ε] and qj i pj [0, 1]. The terms zj i are not independent, since if a partial sum Pℓ j=1 zj i ever exceeds τ, then subsequent values zj i for j > ℓwill be set to 0. However, we may consider a related sequence given by sums of the terms zj i = trunc3ε(aj pj) ( qj i pj) (the difference from zj i is that we use values qj i Ber(pj) regardless of whether item i is in Aj). Once we have conditioned on the biases and mechanism s outputs, Pk j=1 zi is a sum of bounded independent random variables. By Hoeffding s Inequality, the sum is bounded k log(1/ε) with high probability, for every i X P h Pk j=1 zj i > ε q ε i ε 48 . By Etemadi s Inequality, a related bound holds uniformly over all the intermediate sums: Finally, notice that by construction, the real scores zj i are all set to 0 when an item is added to Aj, so the sets Aj are nested (Aj Aj+1), and a bound on partial sums of the zj i applies equally well to the partial sums of the zj i . Thus, i X P h ℓ [k] : Pℓ j=1 zℓ i > τ 1 i ε 16 Now, the scores zi are independent across players (again, because we have fixed the biases pj and the mechanism s outputs). We can bound the probability that more than εN 4 elements i are accused over the course of the algorithm using Chernoff s bound: P |Ak \ X| > ε 8N e εN/64 e Ω(n) The claim now follows by averaging over all of the choices we fixed. The next claim states that the sum of the scores over all i not in the sample is small. Claim 2.6. With probability at least 99 100, P i [N]\X zi = O(ε Proof. Fix a choice of (p1, . . . , pk) [0, 1]k, the in-sample query values (q1 X, . . . , qk X) {0, 1}n k, and the answers (a1, . . . , ak) [0, 1]k. Conditioned on these, the values zi for i / X are independent and identically distributed. They have expectation 0 (see the proof of Claim 2.5) and are bounded by τ (by Claim 2.4). By Hoeffding s inequality, with probability at least 99 100 P i [N]\X zi = O(τ Nk) as desired. The claim now follows by averaging over all of the choices we fixed. Claim 2.7. There exists c > 0 such that, for all sufficiently small ε and sufficiently large n, with probability at least 99 100, either j [k] : |aj qj(P)| > ε (large error), or P i [N] zi ck (high scores in sample). The proof of Claim 2.7 relies on the following key lemma. The lemma has appeared in various forms [20, 11, 21]; the form we use is [5, Lemma 3.6] (rescaled from { 1, +1} to {0, 1}). Lemma 2.8 (Fingerprinting Lemma). Let f : {0, 1}m [0, 1] be arbitrary. Sample p U[0,1] and sample x1, . . . , xm Ber(p) independently. Then i [m] (xi p) + Proof of Claim 2.7. To make use of the fingerprinting lemma, we consider a variant of Algorithm 2 that does not truncate the quantity aj pj to the range 2ϵ when computing the score zj i for each element i. Specifically, we consider scores based on the quantities ˆzj i = (aj pj) (qj i pj) if i / Aj , 0 if i Aj ; and ˆzi = j=1 ˆzj i . We prove two main statements: first, that these untruncated scores are equal to the truncated ones with high probability as long as the mechanism s answers are accurate. Second, that the expected sum of the untruncated scores is large. This gives us the desired final statement. To relate the truncated and untruncated scores, consider the following three key events: 1. ( Few accusations ): Let F the event that, at every round j, set of accused items outside of the sample is small: |Ak \ X| εN/8. Since the Aj are nested, event F implies the same condition for all j in [k]. 2. ( Low population error ): Let G be the event that at every round j [k], the mechanism s anwer satisfies |aj pj| 3ε. 3. ( Representative queries ): Let H be the event that | qj(P) pj| ε for all rounds j [k] that is, each query s population average is close to the corresponding sampling bias pj. Sub-Claim 2.9. Conditioned on F G H, the truncated and untruncated scores are equal. Specifically, |aj pj| 3ε for all j [k]. Proof. We can bound the difference |aj pj| via the triangle inequality: |aj pj| |aj qj(P)| + |qj(P) qj(P)| + | qj(P) pj| . The first term is the mechanism s sample error (bounded when G occurs). The second is the distortion of the sample mean introduced by setting the query values of i Aj to 0. This distortion is at most |Aj|/N. When F occurs, Aj has size at most |X| + |Aj \ X| n + εN/8 = εN/4, so the second term is at most ε/4. Finally, the last term is bounded by ε when H occurs, by definition. The three terms add to at most 3ε when F, G, and H all occur. We can bound the probability of H via a Chernoff bound: The probability of that a binomial random variable deviates from its mean by εN is at most 2 exp( ε2N/3). The technical core of the proof is the use of the fingerprinting lemma to analyze the difference D between the sum of untruncated scores and the summed population errors: D := PN i=1 zi Pk j=1 aj qj(P) k E h |Aj| N |Aj| i Sub-Claim 2.10. E [D] = Ω(k) Proof. We show that for each round j, the expected sum of scores for that round P i zj i is at least 1/12 E h |aj qj(P)| |Aj| N |Aj| i . This is true even when we condition on all the random choices and communication in rounds 1 through j 1. Adding up these expectations over all rounds gives the desired expectation bound for D. First, note that summing zj i over all elements i [N] is the same as summing over that round s unaccused elements i [N] \ Aj (since zj i = 0 for i Aj). Thus, i=1 zj i = X i [N]\Aj zj i = (aj pj) X i [N]\Aj (qj i pj) . We can now apply the Fingerprinting Lemma, with m = N |Aj|, p = pj, xi = qj i for i / Aj, and f ((xi)i/ Aj) = aj (note that f depends implicitly on Aj, but since we condition on the outcome of previous rounds, we may take Aj as fixed for round j). We obtain " aj 1 N |Aj| X Now the difference between 1 N |Aj| P i/ Aj qj i and the actual population mean 1 N PN i=1 qj i is at N 1 N |Aj|) = |Aj| N |Aj|. Thus we can upper-bound the term inside the right-hand side expectation above by |aj qj(P)| + |Aj| N |Aj|. A direct corollary of Sub-Claim 2.10 is that there is a constant c > 0 such that, with probability at least 199/200, D c k. Let s call that event I. Conditioned on F G H, we know that each zi equals the real score zi (by the first sub-claim above), that |aj qj(P)| 3ε for each j, and that |Ak| εN/8. If we also consider the intersection with I, then we have D c k 3kε k ε/8 1 ε/8 k(c 4ε) (for sufficiently small ε). By a union bound, the probability of (F H I) is at most 1/200 + exp( Ω(ε2n)) 1/100 (for sufficiently large n). Thus we get P h ( G) or PN i=1 zi ck i 99 100 , where c = c 4ε is positive for sufficiently small ε. This completes the proof of Claim 2.7. To complete the proof of the proposition, suppose that |aj qj(P)| ε for every j, so that we can assume P i X zi = Ω(k). Then, we can show that, when n is sufficiently large and k ε4n2, the final query q will violate robust generalization. A relatively straightforward calculation (omitted for space) shows that for the query q that we defined, q (X) q (P) = Θ(ε k). Now, we choose an appropriate k = Θ(ε4n2) we will have that q (X) q (P) > ε. By this choice of k, the first term in the final line above will be at least 2ε. Also, we have N n = Θ( k/ε2), so when k is larger than some absolute constant, the O(1/ N) term in the final line above is Θ(ε/ 4 k) ε. Thus, by Claims 2.6 and 2.7, either M fails to be accurate, so that j [k] |aj qj(P)| > ε, or we find a query q such that q (X) q (P) > ε. 2.3 Lower Bounds for All Algorithms via Random Masks We prove Theorem 1.2 by constructing the following transformation from an adversary that defeats all natural algorithms to an adversary that defeats all algorithms. The main idea of the reduction is to use random masks to hide information about the evaluation of the queries at points outside of the dataset, which effectively forces the algorithm to behave like a natural algorithm because, intuitively, it does not know where to evaluate the query apart from on the dataset. The reduction is described in Algorithm 3. Due to space restrictions, we omit its analysis due to space. 3 Post Hoc Generalization Does Not Compose In this section we prove that post hoc generalization is not closed under composition. Theorem 3.1. For every n N and every α > 0 there is a collection of ℓ= O( 1 α log n) algorithms M1, . . . , Mℓ: ({0, 1}5 log n)n Y such that (1) for every i = 1, . . . , ℓand δ > 0, Mi satisfies (ε, δ)-post hoc generalization for ε = O( p log(n/δ)/n1 α), but (2) the composition (M1, . . . , Mℓ) is not 2 2 n4 , 1 1 2n3 -post hoc generalizing. The result is based on an algorithm that we call Encrypermute. Before proving Theorem 3.1, we introduce Encrypermute and establish the main property that it satisfies. The key facts about Encrypermute are as follows. Algorithm 3: AAQ Parameters: sample size n, universe size N = 8n ε , number of queries k, target accuracy ε. Oracle: an adversary ANAQ for natural algorithms with sample size n, universe size N, number of queries k, target accuracy ε. Let X = {(i, y)}i [N],y { 1}k For i [N] Choose mi = (m1 i , . . . , mk i ) U({ 1}k) Let P be the uniform distribution over pairs (i, mi) for i [N] For j [k] Receive the query ˆqj : [N] [ 1] from ANAQ Form the query qj(i, y) = yj mj i ˆqj(i) (NB: qj(i, mi) = ˆqj(i)) Send the query qj to M and receive the answer aj Send the answer aj to ANAQ Algorithm 4: Encrypermute Input: Parameter k, and a sample X = (x1, x2, . . . , xn) ({0, 1}d)n for d = 5 log n. If X contains n distinct elements Let π be the permutation that sorts (x1, . . . , xk) and identify π with r {0, 1, . . . , k! 1} Let α [0, 1] be the largest number such that k nα and let t αk/20 (NB: 2dt k!) Identify (xk+1, . . . , xk+t) ({0, 1}d)t with a number m {0, 1, . . . , k! 1} Return c = m + r mod k! Else Return a random number c {0, 1, . . . , k! 1} Claim 3.2. Let D be any distribution over ({0, 1}d)n. Let D D, let X be a random permutation of D, and let C Encrypermute(X). Then D and C are independent. Intuitively, the claim follows from the fact that r is uniformly random and depends only on the permutation, so it is independent of D. Therefore m + r mod k! is random and independent of m. Lemma 3.3. δ > 0, Encrypermute satisfies (ε, δ)-post hoc generalization for ε = p 2 ln(2/δ)/n. Intuitively the lemma follows from the fact that C is independent of D. We omit the proof of both of these claims due to space restrictions. Proof of Theorem 3.1. Fix α (0, 1), and let M1 denote the mechanism that takes a database of size n and outputs the first nα elements of its sample. As M1 outputs a sublinear portion of its input, it satisfies post hoc generalization with strong parameters. Specifically, by [7, Lemma 3.5], M1 is (ε, δ)-post hoc generalizing for ε = O p log(n/δ)/n1 α . Now consider composing M1 with O( 1 α log n) copies of Encrypermute, with exponentially growing choices for the parameter k, where for the ith copy we set k = (1 + α 20)i nα. By Lemma 3.3, each of these mechanisms satisfies post hoc generalization for ε = O( p log(1/δ)/n), so this composition satisfies the assumptions of the theorem. Let P be the uniform distribution over {0, 1}d, where d = 5 log n, and let X P n. By a standard analysis, X contains n distinct elements with probability at least 1 1 2n3 . Assuming that this is the case, we have that the first copy of Encrypermute outputs c = m + r mod k!, where m encodes the rows of X in positions nα + 1, . . . , (1 + α 20)nα, and where r is a deterministic function of the first nα rows of X. Hence, when composed with M1, these two mechanism reveal the first (1+ α 20)nα rows of X. By induction, the output of the composition of all the copies of Encrypermute with M1 reveals all of X. Hence, from the output this composition, we can define the predicate q : {0, 1}d { 1} that evaluates to 1 on every element of X, and to -1 otherwise. This predicate satisfies q(X) = 1 but q(P) 1 + 2n/2d = 1 + 2/n4. [1] Raef Bassily, Kobbi Nissim, Adam D. Smith, Thomas Steinke, Uri Stemmer, and Jonathan Ullman. Algorithmic stability for adaptive data analysis. In STOC, 2016. [2] Richard Berk, Lawrence Brown, Andreas Buja, Kai Zhang, Linda Zhao, et al. Valid postselection inference. The Annals of Statistics, 41(2):802 837, 2013. [3] Dan Boneh and James Shaw. Collusion-secure fingerprinting for digital data. IEEE Transactions on Information Theory, 44(5):1897 1905, 1998. [4] Andreas Buja, Richard Berk, Lawrence Brown, Edward George, Emil Pitkin, Mikhail Traskin, Linda Zhao, and Kai Zhang. Models as approximations: A conspiracy of random regressors and model deviations against classical inference in regression. Statistical Science, 1460, 2015. [5] Mark Bun, Thomas Steinke, and Jonathan Ullman. Make up your mind: The price of online queries in differential privacy. In SODA, 2017. [6] Mark Bun, Jonathan Ullman, and Salil P. Vadhan. Fingerprinting codes and the price of approximate differential privacy. In STOC, pages 1 10. ACM, May 31 June 3 2014. [7] Rachel Cummings, Katrina Ligett, Kobbi Nissim, Aaron Roth, and Zhiwei Steven Wu. Adaptive learning with robust generalization guarantees. In COLT, 2016. [8] Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron Roth. Generalization in adaptive data analysis and holdout reuse. In NIPS, 2015. [9] Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron Roth. Preserving statistical validity in adaptive data analysis. In STOC, 2015. [10] Cynthia Dwork, Frank Mc Sherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In TCC, 2006. [11] Cynthia Dwork, Adam Smith, Thomas Steinke, Jonathan Ullman, and Salil Vadhan. Robust traceability from trace amounts. In FOCS, 2015. [12] Bradley Efron. Estimation and accuracy after model selection. Journal of the American Statistical Association, 109(507):991 1007, 2014. [13] William Fithian, Dennis Sun, and Jonathan Taylor. Optimal inference after model selection. ar Xiv preprint ar Xiv:1410.2597, 2014. [14] Andrew Gelman and Eric Loken. The statistical crisis in science. Am Sci, 102(6):460, 2014. [15] Moritz Hardt and Jonathan Ullman. Preventing false discovery in interactive data analysis is hard. In FOCS, 2014. [16] Clifford M Hurvich and Chih-Ling Tsai. The impact of model selection on inference in linear regression. The American Statistician, 44(3):214 217, 1990. [17] Michael J. Kearns. Efficient noise-tolerant learning from statistical queries. In STOC, 1993. [18] Benedikt M Pötscher. Effects of model selection on inference. Econometric Theory, 1991. [19] Daniel Russo and James Zou. Controlling bias in adaptive data analysis using information theory. In AISTATS, 2016. [20] Thomas Steinke and Jonathan Ullman. Interactive fingerprinting codes and the hardness of preventing false discovery. In COLT, 2015. [21] Thomas Steinke and Jonathan Ullman. Tight lower bounds for differentially private selection. In FOCS, 2017. [22] Gábor Tardos. Optimal probabilistic fingerprint codes. J. ACM, 55(2), 2008. [23] Jonathan Taylor and Robert J Tibshirani. Statistical learning and selective inference. Proceedings of the National Academy of Sciences, 112(25):7629 7634, 2015. [24] Yu-Xiang Wang. New Paradigms and Optimality Guarantees in Statistical Learning and Estimation. Ph D thesis, Carnegie Mellon University, 2017.