# natural_analysts_in_adaptive_data_analysis__da49aad3.pdf

Natural Analysts in Adaptive Data Analysis

Tijana Zrnic 1 Moritz Hardt 1

Adaptive data analysis is frequently criticized for its pessimistic generalization guarantees. The source of these pessimistic bounds is a model that permits arbitrary, possibly adversarial analysts that optimally use information to bias results. While being a central issue in the ﬁeld, still lacking are notions of natural analysts that allow for more optimistic bounds faithful to the reality that typical analysts aren t adversarial. In this work, we propose notions of natural analysts that smoothly interpolate between the optimal non-adaptive bounds and the best-known adaptive generalization bounds. To accomplish this, we model the analyst s knowledge as evolving according to the rules of an unknown dynamical system that takes in revealed information and outputs new statistical queries to the data. This allows us to restrict the analyst through different natural control-theoretic notions. One such notion corresponds to a recency bias, formalizing an inability to arbitrarily use distant information. Another complementary notion formalizes an anchoring bias, a tendency to weight initial information more strongly. Both notions come with quantitative parameters that smoothly interpolate between the non-adaptive case and the fully adaptive case, allowing for a rich spectrum of intermediate analysts that are neither non-adaptive nor adversarial. Natural not only from a cognitive perspective, we show that our notions also capture standard optimization methods, like gradient descent in various settings. This gives a new interpretation to the fact that gradient descent tends to overﬁt much less than its adaptive nature might suggest.

1Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, Berkeley, USA. Correspondence to: Tijana Zrnic <tijana@eecs.berkeley.edu>.

Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s).

1. Introduction

Modern data analysis is usually adaptive in the sense that past analyses shape future analyses. This practice offers power and ﬂexibility to data science at the cost of a greater potential for spurious results. The issue is now well recognized in multiple communities. The problem of inference after selection is an active research area in statistics, while computer science has developed an area known as adaptive data analysis.

The statistical community has focused on analyzing concrete two-step procedures, such as, variable selection followed by a signiﬁcance test on the chosen variables (Fithian et al., 2014; Belloni et al., 2014). This approach leads to precise insight into some concrete procedures, but it does not capture the workﬂow of typical analysts that proceed in more than two steps.

Computer scientists took an alternative route by focusing on a powerful statistical query model that in principle captures all sorts of different analyses involving many adaptive steps. In this model, an analyst interacts with a data set through a primitive called statistical queries. In each round, the analyst can evaluate one statistical query on the data. Future statistical queries may depend arbitrarily on the revealed transcript of past queries and query results. This level of generality comes at the cost of diminished generalization ability.

To review what s known, the generalization error on t nonadaptively chosen statistical queries on a data set of size n is on the order of O( p

log(t)/n), as follows from Hoeffding s bound. In the fully adaptive model, Hoeffding s bound would only give a rate of O( p

t/n). This disappointing bound coincides with the naive strategy of splitting the data set into t chunks, each of size n/t and using one chunk for each query. Noise addition techniques combined with the mature technical repertoire of differential privacy yield a better bound of O(t1/4/ n) (Bassily et al., 2016). However, this bound still features a polynomial dependence on the number of queries t that has resisted improvement for years. Negative results suggest that it might in fact be computationally hard to improve on this bound (Hardt & Ullman, 2014; Ullman et al., 2018).

For years, the knee jerk reaction to such pessimistic bounds

Natural Analysts in Adaptive Data Analysis

has been to point out that natural analysts aren t adversarial. However, it has proved challenging to formalize what makes natural analysts more benign than the worst-case bounds suggest. Indeed, to date there is still no comprehensive proposal for a class of analysts that allows for interesting intermediate points between the fully adaptive and nonadaptive case.

1.1. Our Contributions

In this work, we tackle the central conceptual challenge of formalizing classes of natural analysts using ideas from dynamical systems theory. Speciﬁcally, we model the analyst s knowledge as evolving according to the rules of an unknown dynamical system in discrete time. The system takes in query results at at each step and maintains a hidden state ht at time t. Based on its hidden state, the system chooses a new query qt = ft(ht) as a function of the hidden state (that may vary with time) and updates its hidden state ht+1 = ψt(ht, at) according to a state-transition map ψt that is allowed to vary with time. It is clear that we can recover the non-adaptive case by forcing the hidden state to be constant at all steps, whereas the fully adaptive case corresponds to an unrestricted hidden state and state transition rule.

What is interesting is that this dynamical perspective allows us to restrict the analyst in natural ways, which we show lead to interesting trade-offs. These restrictions simultaneously correspond to natural control-theoretic notions, subsume common optimization procedures, and can be seen as formalizing well-known cognitive biases. We focus on two complementary notions of natural analysts that we call progressive and conservative.

Progressive analysts. Progressive analysts, intuitively speaking, have a recency bias and weight recent information more strongly than information received far into the past. We can think of a discount factor λ (0, 1) by which the analyst downweights past observations. Formally, we call an analyst λ-progressive if the state transition map is contractive1: ψt(h, a) ψt(h , a) λ h h .

To gain intuition, in the case of a linear state-transition map ht+1 = Aht + Bat, this requirement corresponds to the condition A op λ, where op denotes the operator norm. In control-theoretic terms, this requirement expresses that the system is stable. Trajectories cannot blow up under repeated application of the state transition map. We show that this control-theoretic stability has a strong regularizing effect.

Theorem 1 (Informal result for progressive analysts). There is a computationally efﬁcient algorithm to an-

1From here forward, we will assume denotes some ℓp-norm, p 1.

swer t statistical queries chosen adaptively by a λprogressive analyst so that the error on each query is at most O p

K(λ)dq log(t)/n , where K(λ) =

O log(1/(1 λ))

log(1/λ) , dq is the dimension of the queries, and n is the size of the data set.

Since Theorem 1 allows queries of arbitrary dimension, dq can also be thought of as the number of parallel statistical queries in one round, making the total number of one-dimensional queries after t rounds equal to tdq. With this in mind, we can see that for λ = 1 1/t, the bound reduces to the adaptive Hoeffding bound O( p

tdq/n) (by a ﬁrst-order Taylor approximation). For any constant λ bounded away from 1, we recover the non-adaptive bound. The proof of this result combines a simple compression argument with recent ideas in the context of recurrent neural networks (Miller & Hardt, 2019).

We could hope that as λ approaches 1 we not only recover the Hoeffding bound but rather the best known adaptive bounds that follow from differential privacy techniques. While this turns out to be difﬁcult for progressive analysts for reasons we elaborate on later, we can indeed achieve this better trade-off for our second notion.

Conservative analysts. Conservative analysts favor initial information over new information in their decision making. Intuitively, this can be seen as a kind of anchoring bias. One of the ways we can express this is by requiring that the state-transition map gets increasingly Lipschitz in its second argument over time:

ψt(h, a) ψt(h, a ) η ψt 1(h, a) ψt 1(h, a ) ,

for some η (0, 1). We call analysts satisfying this requirement ηt-conservative, leading to the following result.

Theorem 2 (Informal result for conservative analysts, special case). There is a computationally efﬁcient algorithm to answer t statistical queries chosen adaptively by a ηt-conservative analyst so that the error on each query is at most O (K(ηt)dq log(t))1/4/ n , where K(ηt) =

O 1 log(1/η) , dq is the dimension of the queries, and n is the size of the data set.

Contrary to progressive analysts, if η = 1 1/t, the bound reduces to a multi-dimensional generalization of the hardto-improve generalization bound O((tdq)1/4/ n).

1.2. Proof Technique Overview

The main technical tool used in our generalization proofs is an algorithmic abstraction called the truncated analyst. For both progressive and conservative analysts, we design their respective truncated counterpart, which acts according to the

Natural Analysts in Adaptive Data Analysis

same dynamics ψt. By construction, the truncated analyst has a time-independent number of rounds of adaptivity. We will also refer to the true analyst as the full analyst, to contrast it with the corresponding truncated abstraction.

We ﬁrst derive a natural conclusion stating that truncated analysts have time-independent generalization properties. Then, we show that, for a large enough level of truncation (which is still time-independent), the truncated analyst closely approximates the full one. This observation will enable us to claim that the full analyst, which is either progressive or conservative, inherits the generalization properties of its corresponding truncated version. One of the conclusions we will derive from here is the following: setting the parameters of progressiveness or conservatism to be constant with respect to the number of interactions yields a generalization error that scales only logarithmically with the number of queries.

2. Analysts as Dynamical Systems

2.1. Problem Setting

Let S := {X1, . . . , Xn} Dn be a data set of n i.i.d. samples from a distribution P supported on D. On one side, there is a data analyst, who initially has no information about the drawn samples in S. On the other side there is a statistical mechanism with access to S, however with no knowledge of its true underlying distribution. At each time step t N, the analyst and statistical mechanism have an interaction: the analyst asks a statistical query qt Q, and the statistical mechanism responds with an answer at A. In the adaptive data analysis literature, statistical queries are typically deﬁned as one-dimensional bounded functions, however in this work we generalize this deﬁnition to allow bounded functions in higher dimensions. The motivation for this is that many common procedures query a vector of values; for example, gradient descent queries a gradient of the loss at the current point. Formally, we deﬁne statistical queries as functions of the form qt : D [0, 1]dq. In this generalized setting, a single query qt is equivalent to a set of dq one-dimensional queries. It is only natural to assume that the dimension of answers matches that of the posed queries, and hence we take A Rdq.

Before deciding on qt, the analyst takes into account the previous interactions with the statistical mechanism, typically called the transcript. In classical work on adaptive data analysis, the transcript at time t consists of all query-answer pairs thus far, (q1, a1, . . . , qt 1, at 1). Recall that, in this work, the analyst only has access to the transcript through its hidden state, or history, ht H Rd, acting according to the recursion:

ht = ψt(ht 1, at 1), (1)

where we initialize h0 = 0. The variable ht serves as a possibly lossy encoding of the knowledge the analyst has gathered about data S up to time t. Based on this encoding, the analyst picks the next query qt Q:

qt = ft(ht), (2)

where ft : H Q is an arbitrary measurable function.

The goal of designing a statistical mechanism is to have the analyst learn about the distribution P, and not just the samples in S. Mathematically, we want the generalization error max 1 i t ai EX P[qi(X)]

to be small with high probability, for any given number of rounds t. The difﬁculty is this task lies in the fact that the statistical mechanism does not have access to P. It might seem intuitive to set at = qt(S) := 1

n Pn i=1 qi(Xi). However, in general, this standard choice quickly leads to overﬁtting (see the paper (Blum & Hardt, 2015) for an example attack).

A better solution stems from a connection with privacypreserving data analysis. In particular, it has been shown that good sample accuracy combined with differential privacy ensures small generalization error (Dwork et al., 2015b; Bassily et al., 2016; Dwork et al., 2015a).

We say that a possibly randomized function F : Dn Y Rd is (α, β)-differentially private for some α, β 0, if for all data sets S, S Dn, such that S and S differ in at most one entry, it holds that:

P(F(S) O) eαP(F(S ) O) + β,

for any event O. We will extensively rely on some of the well-known properties of differential privacy that we collect in the supplementary materials.

A possibly randomized function M : Dn Q Y, where Y Rd, is (ϵ, δ)-sample accurate if for every data set S = {X1, . . . , Xn} Dn and every query q Q, where q : D Y, it holds that:

i=1 q(Xi) M(S, q) ϵ) δ.

Applying these deﬁnitions to the problem of adaptive data analysis, we simultaneously want max1 i t ai qi(S) to be small, and ai to be constructed in a differentially private manner, thus obscuring the exact value of qi(S). We will show that, with an appropriate choice of a differentially private mechanism, these two conditions will result in favorable generalization properties in our setting.

Our analysis will primarily make use of Gaussian noise addition, as it achieves the hard-to-improve rate of O(t1/4/ n),

Natural Analysts in Adaptive Data Analysis

in the one-dimensional statistical query model. We will use ξt to denote a generic Gaussian noise vector; with this, the classical Gaussian mechanism is given by at = qt(S) + ξt, where ξt is zero-mean noise of dq independent Gaussians with appropriately chosen variance.

The main idea for preventing adversarial behavior of analysts will be some form of contraction characterizing the state-transition map sequence {ψt}. This approach requires a way to translate closeness in norm into a form of information-theoretic closeness. In general, however, if two different analysts have histories h1 t and h2 t, such that h1 t h2 t ϵ for some very small ϵ > 0, it is impossible to say whether their knowledge of S is indeed ϵ-close . For this reason, we introduce the assumption that H is a discrete grid in Rd with coordinate-wise resolution > 0, where is sufﬁciently small. Mathematically, if h = (h1, . . . , hd) H, then hi = ki , for some ki Z. This way, if two histories are close enough in norm, they have to be semantically identical. This condition is satisﬁed by a great majority of real-world data analysts. First, all transcripts generated by numerical algorithms are memorized in computers using ﬁnite-bit precision. Second, human analysts typically use only the ﬁrst few digits after the decimal of any performed numerical evaluation. It is also worth pointing out that all generalization results obtained for the set H also hold for all uniformly discrete sets which have a packing radius at least .

3. Progressive Analysts: Motivation and Generalization

The ﬁrst class of analysts is oblivious in that its knowledge of past events diminishes over time. We will aptly refer to such data analysts as progressive. Deﬁnition 1 (Progressive analyst). An adaptive analyst is λ-progressive if the maps {ψt} are λ-contractive in their ﬁrst argument; for every h, h H and a A, ψt satisﬁes:

ψt(h, a) ψt(h , a) λ h h ,

for some λ [0, 1]. Additionally, we require ψt(h, ) to be L-Lipschitz for any ﬁxed h H; that is, for all a, a A, ad some L 0:

ψt(h, a) ψt(h, a ) L a a .

Without loss of generality, we also assume that the maps {ψt} are normalized to satisfy ψt(0, 0) = 0. This does not limit their expressiveness.

We now motivate the deﬁnition of progressive analysts via three examples, before proving our main generalization bound for this class of analysts.

It is well-known that humans exhibit numerous cognitive biases while performing analytical tasks. One well-known

bias is the recency bias (Cheadle et al., 2014). This bias is deﬁned as a tendency to focus more on recent evidence than the history. We can think of recency bias as a motivating analogy for our deﬁnition of progressive. In our deﬁnition, the parameter λ determines how fast prior information are forgotten. The case λ = 0 corresponds to full recency bias and virtually no adaptivity in query formulation, while λ = 1 implies no recency bias and arbitrarily adaptive queries.

As another, contrasting example, iterative algorithms which interact with a ﬁxed data set can also be thought of as adaptive analysts. Suppose that S contains simulation samples of an agent interacting with a stochastic environment, which returns noisy rewards from an unknown distribution and has known random transitions between a possibly large number of states. This problem can be modeled as a classical Markov decision process (Bertsekas, 2005). Suppose that the analyst wishes to deﬁne a set of d states, possibly by grouping the existing elementary states, such that the value function, which is the expected reward-to-go under the optimal policy, satisﬁes some criterion: for example, one objective could be maximizing the value function in one of the states of the model. First, the analyst initializes the set of states to some arbitrary set of ﬁxed size d. Then, they recurse their hidden state, whose coordinates i d are updated as:

ht,i = sup a (rt(i, a) + γ

j=1 P(i, a, j)ht 1,j), (3)

where the supremum is taken over the possible actions, γ (0, 1) is a discount factor, P(i, a, j) is the probability of landing in state j after taking action a in state i, and rt(i, a) is the estimated average reward of taking action a in state i. Equation (3) is called the Bellman equation, and the algorithm given by repeated iterations of this equation is called value iteration (Bellman, 1957), as it is used to ﬁnd the value function. For example, if every sample Xk S is vector containing the initial state, action, reward, and subsequent state, (s1,k, ak, rk, s2,k), then the estimated reward is given by rt(i, a) = Pn k=1 rk1{s1,k = i, ak = a}/ Pn k=1 1{s1,k = i, ak = a}. The analyst s queries are therefore asking for the reward estimates across all states and all actions. After running the Bellman update for a certain number of rounds, the analyst can now adaptively change the set of states, using the previously learned value of ht for initialization. Since the Bellman equation contracts ht by factor γ in ℓ -norm, such an analyst would be γ-progressive. The Bellman equation is at the core of numerous dynamic programs, thus making many algorithmic solvers of such problems progressive analysts.

Stable recurrent neural networks are another algorithmic example of progressive analysts. Recurrent neural networks

Natural Analysts in Adaptive Data Analysis

are given by the update:

ht = ρ(Wht 1 + Uat 1),

where U Rdq d, W Rd d, and ρ is a point-wise non-linearity. The variable at is the empirical answer to an arbitrary query based on ht. In this case, the analyst is λ-progressive if W op 1/Lρ, where Lρ is the Lipschitz constant of the map ρ. For a detailed treatment of this case, see the paper (Miller & Hardt, 2019). The work also shows how other stateful models, such as LSTMs, can be made stable and how stable models perform well in practice.

Now we argue that the parametrization of progressive analysts allows interpolation between that of non-adaptive analysts and fully adaptive analysts. Then, we move on to proving the generalization error in regimes between these two extremes.

First, consider L = 0. In this case, ht has no sensitivity to the answers of the statistical mechanism, so queries are trivially non-adaptive.

On the other end, λ = 1 allows full adaptivity, for any L > 0. To see this, imagine that ht is an inﬁnite-dimensional vector2, where each coordinate is initially 0, and coordinatewise, ht can take values in LA. At time t, simply set the coordinates (t 1)dq +1 through tdq of ht to Lat 1. Since all queries are computed via a deterministic function of the current history, which is composed by stacking the answers, the vector of all answers encodes the whole transcript in a lossless fashion. Consequently, this analyst is fully adaptive. One can easily verify that the described transition maps satisfy the conditions of Deﬁnition 1 with λ = 1.

Since these two extreme cases reduce to generalization rates which are known from prior work, in the rest of this section we focus on the parameter set λ [0, 1), L > 0.

3.1. Truncated Analyst

Now we introduce a useful counterpart of a λ-progressive analyst, who only has access to the last k answers of the full analyst, for some constant k. This truncated analyst will be the main abstraction used in the proofs of this section.

Deﬁne the truncated analyst corresponding to a full progressive analyst as:

hk t = ψt(hk t 1, at 1), hk t j = 0, j k,

qk t = ft(hk t ),

for ﬁxed k N. The truncated analyst updates their history according to the same map sequence as the full analyst, and receives exactly k answers of the full analyst.

2This can be formalized in the framework of separable Hilbert spaces, however this example is only intended to be illustrative.

First we show that, as aligned with intuition, each query of the truncated analyst has a time-independent generalization error. Lemma 1. Let hk t be the history of a truncated analyst, and let the range of answers be of size Adq, where A is polynomial in n. Then, at time t, the query qk t asked by the truncated analysts satisﬁes the following:

P( qk t (S) EX P[qk t (X)] > ϵ) 2dq exp(kdq log A 2nϵ2).

Now we show that contractiveness implied by the progressivenes condition forces the full analyst to be close in norm to its corresponding truncated version.

Lemma 2. Let at [0, 1]dq for all t N. For any k N, the progressive analyst and the corresponding truncated analyst satisfy hk t ht λk LC1

1 λ , where C1 := (1, . . . , 1) is the norm of the dq-dimensional all-ones vector.

3.2. Generalization via Compression

For a large enough level of truncation k, which depends on the radius of the set of all possible histories H, the truncated analyst and the full analyst are identical. This level of truncation is time-independent, and hence, by Lemma 1, progressive analysts also have a time-independent scaling of the generalization error.

Theorem 1. Answering t queries chosen adaptively by a λ-progressive analyst by rounding the empirical answer to O(1/n) precision achieves overall generalization error at most O( p

K(λ)dq log(t)/n), where K(λ) =

log( LC1 (1 λ)λ ) log(1/λ) .

In other words, having n = O(K(λ)dq log(t)/ϵ2) samples sufﬁces to guarantee ϵ-generalization error with high probability.

Let λ = 1 1

t . Then, by the ﬁrst-order Taylor approximation, log(1/λ) 1

t , and hence the generalization error of Theorem 1 grows as O( p

tdq/n). The same scaling of generalization error is achieved by fully adaptive analysts in the case of dq-dimensional queries, when there is no use of privacy mechanisms. As argued earlier, λ = 1 corresponds to full adaptivity, so it comes as no surprise that the same rate is achieved.

Note also that the generalization error is completely independent of the dimension of the history d. This justiﬁes our inﬁnite-dimensional example earlier in this section.

3.3. Limitations

Differential privacy. It is natural to wonder why we never used differential privacy to prevent progressive analysts from overﬁtting. In the proof of Theorem 1, we allow the

Natural Analysts in Adaptive Data Analysis

statistical mechanism to return unobscured empirical answers (up to a small rounding error), although we initially argued that differentially private perturbations provide a quadratic improvement.

The main reason is that the truncated analyst does not in general have a time-independent composition of differential privacy, in spite of the fact that the number of observed answers is time-independent. This follows from the observation that it receives answers of the full analyst, whose uncertainty grows with time. On a high level, changing one element in the data set S allows minor changes in the history in each step, even if differential privacy is used. After t k steps, for some k N, these changes might pile up to lead to a completely different query than the one that resulted from the original data set S. The initial input from S of the truncated analyst is the answer to this query, which is highly unstable for large enough t. Therefore, claiming time-independent generalization, if possible, would require a novel framework for designing mechanisms for adaptive data analysis, one that does not rely on differential privacy.

Naive deﬁnition. Stepping away from the dynamical systems perspective for a moment, one might argue that a simple way to smoothly interpolate between no adaptivity and full adaptivity through recency bias is to truncate the analyst s view of the transcript. More formally, deﬁne q K t = g K t (q K t 1, a K t 1, . . . , q K t K, a K t K), for some ﬁxed K N and functions {g K t }. The input to g K t consists of the last K query-response pairs. This seems to be in contrast with the usual adaptive query construction qt = gt(qt 1, at 1, . . . , q1, a1), for some {gt}; here, the argument of gt is all query-response pairs so far. However, we claim that this intuitive construction does not necessarily rule out full adaptivity.

Claim 1. Suppose that an adaptive data analyst has a truncated view of the transcript with truncation depth K. In full generality, this analyst generalizes no better than an analyst with a full view of the transcript, regardless of the mechanism for constructing responses and value of K.

4. Conservative Analysts, Type A: Motivation and Generalization

The second main class of natural analysts operates in a manner opposite to progressive analysts; namely, these discount new evidence increasingly with time, making their knowledge saturate. We will call such analysts conservative.

We consider two possible causes for saturation. Either the maps {ψt} become less sensitive to new evidence, or the queries {qt} are chosen in such a way that the values {qt(S)} saturate. This distinction leads to two notions of conservative analysts. Type A conservatives and type B conservatives. We will see that each correspond to natural

algorithms.

Below we deﬁne type A conservative analysts, while we leave the deﬁnition of type B for the following section.

Deﬁnition 2 (Conservative analyst, type A). An adaptive analyst is type A ηt-conservative if the maps {ψt(h, )} are ηt-Lipschitz, where limt ηt = 0. Mathematically, this corresponds to:

ψt(h, a) ψt(h, a ) ηt a a ,

for every h H and a, a A.

The construction of conservative analysts is primarily motivated by gradient descent in various settings in which it experiences saturation. As in the case of progressive analysts, however, there is also a connection between human data analysts and our deﬁnition of conservative analysts.

A common cognitive bias that humans experience in analytical tasks is called the anchoring bias (Campbell & Sharpe, 2009; Cen et al., 2013). It is characterized by relying heavily on initial evidence, and becoming decreasingly sensitive to new evidence, as mathematically formulated in Deﬁnition 2. The sequence {ηt} in the deﬁnition of conservative analysts can be thought of as the strength of one s anchoring phenomenon. In particular, ηt = 0 for all t N implies complete anchoring and no adaptivity in formulating queries, while a slow decrease in ηt represents analysts with a mild anchoring effect.

From the algorithmic perspective, examples of conservative analysts include optimization algorithms with decaying step size. Consider the problem of empirical risk minimization using gradient descent. In particular, let the loss be:

i=1 ℓ(h; Xi),

where h Rd is a vector of weights for the given optimization model, and ℓ(h; Xi) is the loss incurred by this model on sample Xi S. The well-known gradient descent update is the following:

ht+1 = ψt+1(ht, h L(ht)) = ht ηt h L(ht),

where h L(ht) is the gradient of the loss on data S at point ht, and ηt is a time-dependent, decreasing step. Notice that this gradient decomposes as h L(ht) = 1 n Pn i=1 hℓ(ht; Xi). Therefore, gradient descent for empirical risk minimization is an ηt-conservative analyst, whose queries are equal to the gradient of the loss incurred at each point of S at the current weight iterate.

The rate of step size decay determines the rate of saturation of the analyst, allowing the class of conservative analysts to cover a wide spectrum of gradient-based optimization

Natural Analysts in Adaptive Data Analysis

algorithms. Notable examples of step size decays include ηt = O(1/tα), where α (0.5, 1] (Robbins & Monro, 1951), or the more recent schemes which cut the learning rate by a constant factor in every so-called epoch, which implies an essentially exponential decay (Hazan & Kale, 2014; Ge et al., 2018).

4.1. Truncated Analyst

As its name suggests, the adaptiveness of a conservative analyst essentially saturates after some number of rounds of interaction with the data set. Again, we prove this via truncation of the full analyst. Let the truncated analyst corresponding to the full conservative analyst be the following:

hk t = ψt(hk t 1, ak t 1),

where ak t = at, t k and ak t = 0, t > k,

qk t = ft(hk t ).

In words, the truncated analyst only sees the ﬁrst k true answers and deterministically sets the second input to 0 for the remaining t k rounds. Lemma 3. Assume that the answers of the statistical mechanism are bounded to [0, 1]dq, and let C1 := (1, . . . , 1) denote the norm of the dq-dimensional all-ones vector. Then, for K(ηt) := min{t : ηt < C1 }, the history of the full analyst matches the history of the truncated analyst with truncation depth K(ηt), ht = h K(ηt) t .

Since the approximating truncated analyst in this setting sees the ﬁrst k answers, instead of the last k, the privacy parameters of its history degrade gracefully with k, given that the statistical mechanism is differentially private. The Gaussian mechanism is, however, not bounded, as required by Lemma 3. For this reason, we introduce a slight modiﬁcation of this mechanism, where the answers are computed as at = [qt(S) + ξt][0,1]dq , where [ ][0,1]dq is truncation to the box [0, 1]dq:

([x][0,1]dq )i =

0, if xi 0, xi, if xi 0, 1, if xi 1,

where subscript i denotes the i-th coordinate. As before, ξt is dq-dimensional Gaussian noise. By post-processing of differential privacy, this truncated mechanism preserves the parameters of differential privacy of the Gaussian mechanism, determined by the variance of ξt.

The next lemma formalizes the gradual degradation of differential privacy of the truncated analyst s history. Lemma 4. Let hk t be the history of the truncated analyst at time t N, and let the statistical mechanism be (α, β)- differentially private. Then, hk t is ( p

2k log(1/β )α + 2kα2, kβ + β )-differentially private.

4.2. Generalization via Differential Privacy

Since we proved that type A conservative analysts have the same history as their corresponding truncated analyst, for a large enough level of truncation, and that truncated analysts have a time-independent composition of differential privacy, we can conclude a time-independent composition of privacy for the full analyst as well.

Proposition 1. Let ht be the hidden state of an oblivious analyst at time t. Let the statistical mechanism answering queries be (α, β)-differentially private. Then, for arbitrarily large t, ht is ( p

2K(ηt) log(1/β )α+2K(ηt)α2, K(ηt)β+ β )-differentially private, where K(ηt) := min{t : ηt C1 }.

To prove the generalization error of conservative analysts, we turn to the main transfer theorem of Bassily et al. (2016), which is the main technical tool used to establish the celebrated rate of O((tdq)1/4/ n). This transfer theorem will allow us to compute the generalization error by balancing out the sample accuracy and differential privacy parameters of the Gaussian mechanism.

Theorem 2. There is a computationally efﬁcient mechanism to answer t queries chosen adaptively by a type A ηt-conservative analyst so that the overall generalization error is at most O((K(ηt)dq log(t))1/4/ n), where K(ηt) := min{t : ηt C1 } and C1 = (1, . . . , 1) is the norm of the dq-dimensional all-ones vector.

Said in terms of sample complexity, it sufﬁces to have n = O( p

K(ηt)dq log(t)/ϵ2) samples for ϵ-generalization error. Notice that, just like for progressive analysts, there is no direct dependence on the dimension of the history.

As an example, taking ηt = ηt, where η = 1 1/t, results in error rate O((tdq)1/4/ n) by the ﬁrst-order Taylor approximation. As expected, this is the tight rate for fully adaptive queries under differential privacy.

5. Conservative Analysts, Type B: Motivation and Generalization

In this section we deﬁne and analyze linear analysts whose histories saturate despite non-decreasing sensitivity to revealed information about S.

Deﬁnition 3 (Conservative analyst, type B). An adaptive analyst is type B λ-conservative if, ﬁrst, it contracts when given empirical answers:

ψt(ht 1, qt 1(S)) ψt(h t 1, q t 1(S)) λ ht 1 h t 1 ,

for some λ [0, 1], where qt = ft(ht) and q t = ft(h t). Second, we require the analyst to be linear:

ht = ψt(ht 1, at 1) = Atht 1 + Btat 1,

Natural Analysts in Adaptive Data Analysis

for some sequences {At}, {Bt}, where Ai Rd d, with Ai op 1, and Bi Rd dq for all i N.

The motivation for type B conservative analysts comes from the observation that gradient-based methods sometimes saturate even if there is no step decay.

Consider again the problem of empirical risk minimization using gradient descent. In this setting, let the gradient descent update have a constant step size η > 0:

ht+1 = ψt+1(ht, h L(ht)) = ht η h L(ht),

where h L(ht) = 1

n Pn i=1 ℓ(ht; Xi) is again the gradient of the loss on data S. If the loss is β-smooth and µ-strongly convex, and the step size is η 2 β+µ, then the gradient

descent update is type B λ-conservative, where λ = 1 ηβµ

β+µ (Hardt et al., 2016). If the objective is not strongly convex, however is still smooth, gradient descent is non-expansive, meaning it has contraction parameter λ = 1. In that case, one can induce contractiveness in many ways; one is to add an ℓ2-regularizer to the objective, that is transform the loss into Lreg(h) = L(h) + µ

2 h 2, for some µ > β.

5.1. Truncated Analyst

As in the previous section, due to saturation of conservative analysts, we will deﬁne a truncated analyst that has access to k responses of the statistical mechanism. In this case, however, the interaction happens in the last k rounds.

Suppose that the statistical mechanism is the usual Gaussian mechanism. Worth mentioning is that this time we deploy no truncation. Denote by ξt the noise variable added to the empirical answer at time t. For ﬁxed k, deﬁne the truncated analyst corresponding to a type B conservative analyst as:

hk t = ψt(hk t 1, ak t 1), hk t j = 0, j k,

where ak t = qk t (S) + ξt,

qk t = ft(hk t ),

In this setting, we assume that H is a norm-ball with radius D, where D is large enough with respect to Pt i=1 Bi op C1, so that there is no need for projecting the norm of the current history iterates to D. Since this escaping event happens with negligible probability, in all subsequent arguments for simplicity we treat it as being of measure zero. First we establish closeness between the full analyst and the truncated version. Lemma 5. Suppose that the statistical mechanism is the Gaussian mechanism. For any k N, the truncated analyst with truncation depth k and the full analyst satisfy ht hk t λk D.

Additionally, we show that the truncated analyst has a composition of differential privacy which only depends on the

truncation depth. Lemma 6. Let hk t be the history of a truncated analyst corresponding to a type B conservative analyst, and let the statistical mechanism be (α, β)-differentially private. Then, for all t N and β > 0, hk t is ( p

2k log(1/β )α + 2kα2, kβ + β )-differentially private.

5.2. Generalization via Differential Privacy

Lemma 5 allows us to ﬁnd the effective memory of a conservative analyst, resulting in the following time-independent composition of differential privacy parameters. Proposition 2. Let ht be the history of a type B conservative analyst at time t. Let the statistical mechanism answering queries be the Gaussian mechanism, such that the answers are (α, β)-differentially private. Then, for arbitrarily large t, ht is ( p

2K(λ) log(1/β )α + 2K(λ)α2, K(λ)β + β )- differentially private, where K(λ) := log(D/ λ)

The main transfer theorem of Bassily et al. (2016) will now show that the generalization error of type B conservative analysts is essentially the same as for type A conservative analysts, justifying their uniﬁcation into one broader class. Theorem 3. There is a computationally efﬁcient mechanism to answer t queries chosen adaptively by a type B λ-conservative analyst so that the overall generalization error is at most O((K(λ)dq log(t))1/4/ n), where K(λ) := log(D/ λ)

In other words, n = O( p

K(λ)dq log(t)/ϵ2) samples sufﬁce for ϵ-generalization error. Moreover, under a few additional commonly satisﬁed assumptions, this sample complexity holds for much more general sets H, which need not be discrete. This essentially follows by linearity of maps ψt. We prove this claim, as well as negative results for progressive and type A conservative analysts under continuous sets H, in the supplementary material.

We introduce progressive and conservative analysts by modeling the evolution of their knowledge using different control-theoretic constraints. In addition to serving as mathematical analogies of human cognitive biases, these categories also capture various iterative algorithms, like value iteration or gradient descent. The natural analysts we deﬁne achieve generalization error essentially independent of the number of queries, in stark contrast with arbitrary adversarial analysts whose error scales polynomially. In doing so, we combine control-theoretic notions of stability with the algorithmic stability notions underpinning adaptive generalization bounds. The connection between control-theoretic and algorithmic stability for the sake of proving stronger generalization bounds is worth studying further.

Natural Analysts in Adaptive Data Analysis

Acknowledgements

The authors thank John Miller for his careful reading of a draft of this paper and constructive feedback provided.

Bassily, R., Nissim, K., Smith, A., Steinke, T., Stemmer, U., and Ullman, J. Algorithmic stability for adaptive data analysis. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pp. 1046 1059. ACM, 2016.

Bellman, R. E. Dynamic programming. Princeton University Press, 1957.

Belloni, A., Chernozhukov, V., and Hansen, C. Inference on treatment effects after selection among high-dimensional controls. The Review of Economic Studies, 81(2):608 650, 2014.

Bertsekas, D. P. Dynamic programming and optimal control, volume 1. Athena scientiﬁc Belmont, MA, 2005.

Blum, A. and Hardt, M. The ladder: A reliable leaderboard for machine learning competitions. In International Conference on Machine Learning, pp. 1006 1014, 2015.

Campbell, S. D. and Sharpe, S. A. Anchoring bias in consensus forecasts and its effect on market prices. Journal of Financial and Quantitative Analysis, 44(2):369 390, 2009.

Cen, L., Hilary, G., and Wei, K. J. The role of anchoring bias in the equity market: Evidence from analysts earnings forecasts and stock returns. Journal of Financial and Quantitative Analysis, 48(1):47 76, 2013.

Cheadle, S., Wyart, V., Tsetsos, K., Myers, N., De Gardelle, V., Casta n on, S. H., and Summerﬁeld, C. Adaptive gain control during human perceptual choice. Neuron, 81(6): 1429 1441, 2014.

Dwork, C., Feldman, V., Hardt, M., Pitassi, T., Reingold, O., and Roth, A. Generalization in adaptive data analysis and holdout reuse. In Advances in Neural Information Processing Systems, pp. 2350 2358, 2015a.

Dwork, C., Feldman, V., Hardt, M., Pitassi, T., Reingold, O., and Roth, A. L. Preserving statistical validity in adaptive data analysis. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pp. 117 126. ACM, 2015b.

Fithian, W., Sun, D., and Taylor, J. Optimal inference after model selection. ar Xiv preprint ar Xiv:1410.2597, 2014.

Ge, R., Kakade, S. M., Kidambi, R., and Netrapalli, P. Rethinking learning rate schedules for stochastic optimization. 2018.

Hardt, M. and Ullman, J. Preventing false discovery in interactive data analysis is hard. In Foundations of Computer Science (FOCS), 2014 IEEE 55th Annual Symposium on, pp. 454 463. IEEE, 2014.

Hardt, M., Recht, B., and Singer, Y. Train faster, generalize better: Stability of stochastic gradient descent. In International Conference on Machine Learning, pp. 1225 1234, 2016.

Hazan, E. and Kale, S. Beyond the regret minimization barrier: optimal algorithms for stochastic strongly-convex optimization. The Journal of Machine Learning Research, 15(1):2489 2512, 2014.

Miller, J. and Hardt, M. Stable recurrent models. In International Conference on Learning Representations (to appear), 2019.

Robbins, H. and Monro, S. A stochastic approximation method. The Annals of Mathematical Statistics, 22(3): 400 407, 1951.

Ullman, J., Smith, A., Nissim, K., Stemmer, U., and Steinke, T. The limits of post-selection generalization. In Advances in Neural Information Processing Systems, pp. 6402 6411, 2018.