# confidence_sequences_for_sampling_without_replacement__126ad358.pdf

Conﬁdence sequences for sampling without replacement

Ian Waudby-Smith1 and Aaditya Ramdas12

Departments of Statistics1 and Machine Learning2

Carnegie Mellon University {ianws, aramdas}@cmu.edu

Many practical tasks involve sampling sequentially without replacement (Wo R) from a ﬁnite population of size N, in an attempt to estimate some parameter θ . Accurately quantifying uncertainty throughout this process is a nontrivial task, but is necessary because it often determines when we stop collecting samples and conﬁdently report a result. We present a suite of tools for designing conﬁdence sequences (CS) for θ . A CS is a sequence of conﬁdence sets p Cnq N

n 1, that shrink in size, and all contain θ simultaneously with high probability. We present a generic approach to constructing a frequentist CS using Bayesian tools, based on the fact that the ratio of a prior to the posterior at the ground truth is a martingale. We then present Hoeffdingand empirical-Bernstein-type time-uniform CSs and ﬁxed-time conﬁdence intervals for sampling Wo R, which improve on previous bounds in the literature and explicitly quantify the beneﬁt of Wo R sampling.

1 Introduction

When data are collected sequentially rather than in a single batch with a ﬁxed sample size, many classical statistical tools cannot naively be used to calculate uncertainty as more data become available. Doing so can quickly lead to overconﬁdent and incorrect results (informally, peeking, p-hacking ). For these kinds of situations, the analyst would ideally have access to procedures that allow them to:

(a) Efﬁciently calculate tight conﬁdence intervals whenever new data become available; (b) Track the intervals, and use them to decide whether to continue sampling, or when to stop;

(c) Have valid conﬁdence intervals (or p-values) at arbitrary data-dependent stopping times.

The desire for methods satisfying (a), (b), and (c) led to the development of conﬁdence sequences (CS) sequences of conﬁdence sets which are uniformly valid over a given time horizon T . Formally, a sequence of sets t Ctut PT is a p1 αq-CS for some parameter θ if

Prp@t P T , θ P Ctq ě 1 α Prp Dt P T : θ R Ctq ď α. (1.1)

Critically, (1.1) holds iff Prpθ R Cτq ď α for arbitrary stopping times τ [1], yielding property (c). The foundations of CSs were laid by Robbins, Darling, Siegmund & Lai [2, 3, 4, 5]. The multi-armed bandit literature sometimes calls them anytime conﬁdence intervals [6, 7]. CSs have recently been developed for a variety of nonparametric problems [1, 8, 9]. This paper derives closed-form CSs when samples are drawn without replacement (Wo R) from a ﬁnite population. The technical underpinnings are novel (super)martingales for both categorical (Section 2) and continuous (Section 3) observations. In the latter setting, our results unify and improve on the time-uniform with-replacement extensions of Hoeffding s [10] and empirical Bernstein s inequalities by Maurer and Pontil [11] that have been derived recently [12, 1], with several related inequalities for sampling Wo R by Serﬂing [13] and extensions by Bardenet and Maillard [14] and Greene and Wellner [15].

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

0 200 400 600 800 1000 Balls sampled from the urn

Number of balls in the urn

Figure 1: 95% CS for the number of green and red balls in an urn by sampling Wo R2. Notice that the true totals (650 green, 350 red) are captured by the CSs uniformly over time from the initial sample until all 1000 balls are observed. After sampling 61 balls in this example, the CSs cease to overlap, and we can conclude with 95% conﬁdence that there are more green than red balls in the urn.

Outline. In Section 2, we use Bayesian ideas to obtain frequentist CSs for categorical observations. In Section 3, we construct CSs for the mean of a ﬁnite set of bounded real numbers. We discuss implications for testing in Section 4. Some prototypical applications are described in Appendix A. The other appendices contain proofs, choices of tuning parameters, and computational considerations.

1.1 Notation, supermartingales and the model for sampling Wo R Everywhere in this paper, the N objects in the ﬁnite population tx1, . . . , x Nu are ﬁxed and nonrandom. In the discrete setting (Section 2) with K ě 2 categories tcku K

k 1, we have xi P tc1, c2, . . . , c Ku. In the continuous setting (Section 3), xi P rℓ, us for some known bounds ℓă u. What is random is only the order of observation; the model for sampling uniformly at random Wo R posits that

Xt | t X1, . . . , Xt 1u Uniformptx1, . . . , x Nuzt X1, . . . , Xt 1uq. (1.2) All probabilities in this paper are to be understood as solely arising from observing ﬁxed entities in a random order, with no distributional assumptions being made on the ﬁnite population. It is worth remarking on the power of this randomization as demonstrated in our experiments, one can estimate the average of a deterministic set of numbers to high accuracy without observing a large fraction of the set. The results in this paper draw from the theory of supermartingales. While they can be deﬁned in more generality, we provide a deﬁnition of supermartingales which will sufﬁce for the theorems that follow. A ﬁltration is an increasing sequence of sigma ﬁelds. For the entirety of this paper, we consider the canonical ﬁltration p Ftq N

t 0 deﬁned by Ft : σp X1, . . . , Xtq, with F0 is the empty or trivial sigma ﬁeld. For any ﬁxed N P N, a stochastic process p Mtq N

t 0 is said to be a supermartingale with respect to p Ftq N

t 0 if for all t P t0, 1, . . . , N 1u, Mt is measurable with respect to Ft (informally, Mt is a function of X1, . . . , Xt), and

Ep Mt 1 | Ftq ď Mt.

If the above inequality is replaced by an equality for all t, then p Mtq N

t 0 is said to be a martingale. For succinctness, we use the notation at

1 : ta1, . . . , atu and ras : t1, . . . , au. Using this terminology, one can rewrite model (1.2) as positing that Xt | Ft 1 Uniformpx N

2 Discrete categorical setting

When observations are of this discrete form, the variables can be rewritten in such a way that they follow a hypergeometric distribution. In such a setting, the following prior-posterior-ratio martingale can be used to obtain CSs for parameters of the hypergeometric distribution which shrink to a single point after all data have been observed.

2Code to reproduce plots is available at github.com/wannabesmith/confseq_wor.

2.1 The prior-posterior-ratio (PPR) martingale While the PPR martingale will be particularly useful for obtaining CSs when sampling discrete categorical random variables Wo R from a ﬁnite population, it may be employed whenever one is able to compute a posterior distribution, and is certainly not limited to this paper s setting. Moreover, this posterior distribution need not be computed in closed form, and computational techniques such as Markov Chain Monte Carlo may be employed when a conjugate prior is not available or desirable. To avoid confusion, we emphasize that while we make use of terminology from Bayesian inference such as posteriors and conjugate priors, all of the probability statements with regards to CSs should be read in the frequentist sense, and are not interpreted as sequences of credible intervals. Consider any family of distributions t FθuθPΘ with density fθ with respect to some underlying common measure (such as Lebesgue for continuous cases, counting measure for discrete cases). Let θ P Θ be a ﬁxed parameter and let T r Ns where N P N Y t8u. Suppose that X1 fθ pxq and

Xt 1 fθ px | Xt

1q for all t P T .

Let π0pθq be a prior distribution on Θ, with posterior given by

πtpθq π0pθqfθp Xt

ηPΘ π0pηqfηp Xt

To prepare for the result that follows, deﬁne the prior-posterior ratio (PPR) evaluated at θ P Θ as

Rtpθq : π0pθq

Proposition 2.1 (Prior-posterior-ratio martingale). For any prior π0 on Θ that assigns nonzero mass everywhere, the sequence of prior-posterior ratios evaluated at the true θ , that is p Rtpθ qq N

t 0, is a nonnegative martingale with respect to p Ftq N

t 0. Further, the sequence of sets

Ct : tθ P Θ : Rtpθq ă 1{αu

forms a p1 αq-CS for θ , meaning that Prp Dt P T : θ R Ctq ď α.

The proof is given in Appendix B.1. Going forward, we adopt the label working before prior and posterior and encase them in quotes to emphasize that they constitute part of a Bayesian working model , to contrast it against an assumed Bayesian model; the latter would be inappropriate given the discussion in Section 1.1. Next, we apply this result to the hypergeometric distribution. We will later examine the practical role of this working prior.

2.2 CSs for binary settings using the hypergeometric distribution Recall that a random variable X has a hypergeometric distribution with parameters p N, N , nq if it represents the number of successes in n random samples Wo R from a population of size N in which there are N such successes, and each observation is either a success or failure (1 or 0). The probability of a particular number of successes x P t0, 1, . . . , minp N , nqu is

For notational simplicity, we consider the case when n 1, that is we make one observation at a time, but this is not a necessary restriction. In fact, one would obtain the same CS at time ten if we repeatedly make one observation ten times, or make ten observations in one go. For a moment, let us view this problem from the Bayesian perspective, treating the ﬁxed parameter N as a random parameter, which we call r N to avoid confusion. We choose a beta-binomial working prior on r N

as it is conjugate to the hypergeometric distribution up to a shift in r N [16]. Concretely, suppose

Xt | p r N , X1, . . . , Xt 1q Hyper Geo

N pt 1q, r N

r N Beta Binp N, a, bq,

for some a, b ą 0. Then for any t P r Ns, the working posterior for r N is given by

Now that we have prior and posterior distributions for r N , an application of the prior-posterior martingale (Proposition 2.1) yields a CS for the true N , summarized in the following theorem. Theorem 2.1 (CS for binary observations). Suppose x N

1 P t0, 1u N is a nonrandom set with the number of successes řN

i 1 xi N ﬁxed and unknown. Under observation model (1.2), we have

1 Hyper Geo

For any beta-binomial prior π0 for N with parameters a, b ą 0 and induced posterior πt,

n P r Ns : π0pn q

is a p1 αq-CS for N . Further, the running intersection, pŞ

sďt Ctqt Pr Ns is also a valid p1 αq-CS.

0 500 1000 0

Total red balls

0 500 1000 Total green balls

300 samples

1000 samples

Figure 2: Consider sampling balls from an urn Wo R with three distinct colors (red, green, and purple). In this example, the urn contains 1000 balls with 300 red, 600 green, and 100 purple. We only require a two-dimensional conﬁdence sequence (yellow region) to capture uncertainty about all three totals. After around 300 balls have been sampled, we are quite conﬁdent that the urn is made up mostly of green; after 1000 samples, we know the totals for each color with certainty.

The proof of Theorem 2.1 is a direct application of Proposition 2.1. Note that for any prior , the posterior at time t N is πNpn q 1pn N q, so Ct shrinks to a point, containing only N . For K ą 2 categories, Theorem 2.1 can be extended to use a multivariate hypergeometric with a Dirichlet-multinomial prior to yield higher-dimensional CSs, but we leave the (notationally heavy) derivation to Appendix C. See Figure 2 to get a sense of what these CSs can look like when K 3.

2.3 Role of the prior in the prior-posterior CS The prior-posterior CSs discussed thus far have valid (frequentist) coverage for any prior on N , and in particular are valid for a beta-binomial prior with any data-independent choices of a, b ą 0. Importantly, the corresponding CS always shrinks to zero width. How, then, should the user pick pa, bq? Figure 3 provides some visual intuition. These are our takeaway messages: (a) if the prior is very accurate (coincidentally peaked at the truth), the resulting CS is narrowest, (b) even if the prior is horribly inaccurate (placing almost no mass at the truth), the resulting CS is well-behaved and robust, albeit wider, (c) if we do not actually have any idea what the underlying truth might be, we suggest using a uniform prior to safely balance the two extremes. However, a more risky prior pays a relatively low statistical price.

3 Bounded real-valued setting

Suppose now that observations are real-valued and bounded as in Examples C and D of Appendix A. Here we introduce Hoeffdingand empirical Bernstein-type inequalities for sampling Wo R.

3.1 Hoeffding-type bounds

Recalling Section 1.1, we deal with a ﬁxed batch x N

1 of bounded real numbers xi P rℓ, us with mean µ : 1

i 1 xi. Our CS for µ will utilize a novel Wo R mean estimator,

0 200 400 800 1000 N

Probability mass

0 200 400 600 800 1000 Number of samples

a = 1, b = 1 a = 65, b = 35 a = 0.5, b = 5

Figure 3: Beta-binomial probability mass function as a prior on N

1 with different choices of (a, b), and the resulting PPR CS for the parameter N

1 of a hypergeometric distribution when p N

2 q p650, 350q.

More generally, if λ1, . . . , λN is a predictable sequence (meaning λt is Ft 1-measurable for t P t1, . . . , Nu), then we may deﬁne the weighted Wo R mean estimator,

i 1 λip Xi 1 N i 1

i 1 λip1 i 1 N i 1q , (3.2)

where it should be noted that if λ1 λN then pµtpλt

1q recovers pµt. Past Wo R works [13, 14, 15] base their bounds on the sample average ř

i Xi{t. Both pµt and the sample average are conditionally biased and unconditionally unbiased (see Appendix B.2 for more details). As frequently encountered in Hoeffding-style inequalities for bounded random variables [10], deﬁne

ψHpλq : λ2pu ℓq2

Setting M H

0 : 1, we introduce a new exponential Hoeffding-type process for a predictable sequence λN

Xi µ 1 N i 1

Theorem 3.1 (A time-uniform Hoeffding-type CS for sampling Wo R). Under the observation model and ﬁltration p Ftq N

t 0 of Section 1.1, and for any predictable sequence λN

1 , the process p M H

t 0 is a nonnegative supermartingale, and thus,

Dt P r Ns : µ pµtpλt

i 1 ψHpλiq logp1{αq řt

1 i 1 N i 1

Consequently,

i 1 ψHpλiq logp2{αq řt

1 i 1 N i 1

forms a p1 αq-CS for µ.

The proof in Appendix B.2 combines ideas from the with-replacement, time-uniform extension of Hoeffding s inequality of Howard et al. [1, 12] with the ﬁxed-time, without-replacement extension of Hoeffding s by Bardenet & Maillard [14], to yield a bound that improves on both. When λ : λ1 λN is a constant, the term

i 1 N i 1 (3.5)

captures the advantage over the classical Hoeffding s inequality; we discuss this term more soon. In order to use the aforementioned CS, one needs to choose a predictable λ-sequence. First, consider the simpler case of a ﬁxed real-valued λ : λ1 λN as this will aid our intuition in choosing a more complex λ-sequence. In this case, λ corresponds to a time t0 P r Ns for which the CS is

tightest. If the user wishes to optimize the width of the CS for time t0, then the corresponding λ to be used is given by

8 logp2{αq t0pu ℓq2 . (3.6)

Alternatively, if the user does not wish to commit to a single time t0, they can choose a λ-sequence akin to (3.6) but which spreads its width optimization over time. For example, one can use the sequence for t P t1, . . . , Nu,

8 logp2{αq t logpt 1qpu ℓq2 1 u ℓ, (3.7)

where the minimum was taken to prevent the CS width from being dominated by early terms. Note however that any predictable λ-sequence yields a valid CS (see Appendix E for more examples). Optimizing a real-valued λ λ1 λN for a particular time is in fact the typical strategy used to obtain the tightest ﬁxed-time (i.e. non-sequential) Chernoff-based conﬁdence intervals (CIs) such as those based on Hoeffding s inequality [1, 10]. This same strategy can be used with our Wo R CSs to obtain tight ﬁxed-time CIs for sampling Wo R. Speciﬁcally, plugging (3.6) into Theorem 3.1 for a ﬁxed sample size n P r Ns, we obtain the following corollary. Corollary 3.1 (Hoeffding-type CI for sampling Wo R). For any n P r Ns,

1 2pu ℓq2 logp2{αq

?n An{?n forms a p1 αq CI for µ. (3.8)

Notice that the classical Hoeffding conﬁdence interval is recovered exactly, including constants, by dropping the An term and using the usual sample mean estimator instead of pµt. To get a sense of how large the advantage is, note that

for small n ! N, An

for large n N, An AN

j N log N p N 1q.

Thus, the advantage is negligible for n Op

Nq, while it is substantial for n Op Nq, but it is clear that the CI of (3.8) is strictly tighter than Hoeffding s inequality for any n.

3.2 Empirical Bernstein-type bounds Hoeffding-type bounds like the one in Theorem 3.1 only make use of the fact that observations are bounded, and they can be loose if only some observations are near the boundary of rℓ, us while the rest are concentrated near the middle of the interval. More formally, the CS of Theorem 3.1 has the same width whether the underlying population x N

1 has large or small variance řN

i 1pxi µq2 thus, they are tightest when the xis equal ℓor u, and they are loosest when xi pℓ uq{2 for all i. As an alternative that adaptively takes a variance-like term into account [11, 17], we introduce a sequential, Wo R, empirical Bernstein CS. As is typical in empirical Bernstein bounds [1], we use a different subexponential -type function,

ψEpλq : p logp1 cλq cλq{4 for any λ P r0, 1{cq

where c : u ℓ. ψE seems quite different from ψH, but Taylor expanding log yields ψEpλq c2λ2{8. Indeed,

lim λÑ0 ψEpλq{ψHpλq 1. (3.9)

Note that one typically picks small λ, e.g.: set t0 N{2, ℓ 1, u 1 in (3.6) to get λ191{

N. In what follows, we derive a time-uniform empirical-Bernstein inequality for sampling Wo R. Similar to Theorem 3.1, underlying the bound is an exponential supermartingale. Set M E

0 1, and recall that c u ℓto deﬁne a novel exponential process for any r0, 1{cq-valued predictable sequence λ1, . . . λN:

Xi µ 1 N i 1

p Xi pµi 1q2ψEpλiq

Theorem 3.2 (A time-uniform empirical Bernstein-type CS for sampling Wo R). Under the observation model and ﬁltration p Ftq N

t 0 of Section 1.1, and for any r0, 1{cq-valued predictable sequence λN

1 , the process p M E

t 0 is a nonnegative supermartingale, and thus,

Dt P r Ns : µ pµtpλt

i 1 pc{2q 2 p Xi pµi 1q2ψEpλiq logp1{αq

1 i 1 N i 1

Consequently,

i 1 pc{2q 2 p Xi pµi 1q2ψEpλiq logp2{αq

1 i 1 N i 1

forms a p1 αq-CS for µ.

The proof in Appendix B.3 involves modifying the proof of Theorem 4 in Howard et al. [1] to use our Wo R versions of pµt and to include predictable values of λt.

0.0 0.2 0.4 0.6 0.8 1.0 0

0 100 200 300 400 500 0.0

Estimate for µ

0.0 0.2 0.4 0.6 0.8 1.0 x

0 100 200 300 400 500 Number of samples

Estimate for µ

Empirical Bernstein HoeÆding

Figure 4: Left-most plots show the histogram of the underlying set of numbers x N

1 P r0, 1s N, while right-most plots compare empirical Bernsteinand Hoeffding-type CSs for µ. Speciﬁcally, the Hoeffding and empirical Bernstein CSs use the λ-sequences in (3.7) and (3.13), respectively. As expected, in low-variance settings (top), CE

t is superior, but in a high-variance setting (bottom), CH

t has a slight edge.

As before one must choose a λ-sequence to use CE

t . We will again consider the case of a realvalued λ : λ1 λN to help guide our intuition on choosing a more complex λ-sequence. Unlike earlier, we cannot optimize the width of CE

t in closed-form since ψE is less analytically tractable. Once more, fact (3.9) comes to our rescue: substituting ψH for ψE and optimizing the width yields an expression like (3.6):

where p Vt : řt

i 1p Xi pµi 1q2 is a variance process. However, we cannot use this choice of λ since it depends on Xt

1. Instead, we construct a predictable λ-sequence which mimics λ and adapts to the underlying variance as samples are collected. To heuristically optimize the CS for a particular time t0, take an estimate pσ2

t 1 of the variance which only depends on Xt 1

1 , and set

Alternatively, to spread the CS width optimization over time as in (3.7), one can use the λ-sequence,

2 logp2{αq pσ2

t 1t logpt 1q 1

but again, any predictable sequence will sufﬁce. Similarly to the Hoeffding-type CS, we may instantiate the empirical Bernstein-type CS at a particular time to obtain tight CIs for sampling Wo R. However, ensuring that the resulting ﬁxed-time CI is valid when using a data-dependent λ-sequence requires some additional care. Suppose now that Xn

1 is a simple random sample Wo R from the ﬁnite population, x N

1 P rℓ, us N. If we randomly permute X1, . . . , Xn to obtain the sequence, r X1, . . . , r Xn, we have recovered the observation model of Section 1.1, and thus Theorem 3.2 applies. We choose a λ-sequence which sequentially estimates the variance, but heuristically optimizes for the sample size n as in (3.12). For t P rns, deﬁne

2c where rσ2

t : c2{4 řt

i 1p r Xi rµiq2

t 1 and rµt : 1

r Xi. (3.14)

Here, an extra c2{4 was added to rσ2

t so that it is deﬁned at time 0, but this is simply a heuristic and any other choice of rσ2

0 will sufﬁce. The resulting CI can be summarized in the following corollary.

Corollary 3.2. Let Xn

1 be a simple random sample Wo R from the ﬁnite population x N

1 and let r Xn

1 be a random permutation of Xn

1 . Let rλt be a predictable sequence such as the one in (3.14) for each t P rns. Then for any n P r Ns,

i 1pc{2q 2p r Xi rµi 1q2ψEprλiq logp2{αq řn

1 i 1 N i 1

forms a p1 αq CI for µ.

The aforementioned CSs and CIs have a strong relationship with corresponding hypothesis tests. In the following section, we discuss how one can use the techniques developed here to sequentially test hypotheses about ﬁnite sets of nonrandom numbers.

4 Testing hypotheses about ﬁnite sets of nonrandom numbers

In classical hypothesis testing, one has access to i.i.d. data from some underlying distribution(s), and one wishes to test some property about them; this includes sequential tests dating back to Wald [18]. However, it is not often appreciated that it is possible to test hypotheses about a ﬁnite list of numbers that do not have any distribution attached to them. Recalling the setup of Section 1.1, this is the nonstandard setting we ﬁnd ourselves in. For instance in the same example as Figure 1, we may wish to test:

1 ď 550 (At most 550 of the balls are green). If we had access to each ball in advance, then we could accept or reject the null without any type-I or type-II error, but this is tedious, and so we sequentially take samples in a random order to test this hypothesis. The main question then is: how do we calculate a p-value Pt that we can track over time, and stop sampling when Pt ď 0.05?

0 200 400 600 800 1000 Number of samples

Conﬁdence sequence

671 samples 260 samples

0 400 800 1000 260 671 Number of samples

Anytime p-values

1 650 H0 : N

1 600 H0 : N

Figure 5: The duality between anytime p-values and CSs for three null hypotheses: H0 : N

1 ď D for D P t550, 600, 650u. The ﬁrst null is rejected at a 5% signiﬁcance level after 260 samples, exactly when the 95% CS stops intersecting the null set r0, 550s. However, H0 : N

1 ď 650 is never rejected since 650, the ground truth, is contained in the CS at all times from 0 to 1000.

Luckily, we do not need any new tools for this, and our CSs provide a straightforward answer. Though we left it implicit, each conﬁdence sequence Ct is really a function of conﬁdence level α. Consider the family t Ctpqquq Pp0,1q indexed by q, which we only instantiated at q α. Now, deﬁne

Pt : inftq : Ctpqq X H0 Hu, (4.1)

which is the smallest error level q at which Ctpqq just excludes the null set H0. This duality is familiar in non-sequential settings, and in our case it yields an anytime-valid p-value [19, 1],

Under H0, Prp Dt P r Ns : Pt ď αq ď α for any α P r0, 1s. In words, if the null hypothesis is true, then Pt will remain above α through the whole process, with probability ě 1 α. To more clearly bring out the duality to CSs, deﬁne the stopping time

τ : inftt P r Ns : Pt ď αu, and we set τ N if the inf is not achieved. Then under the null, τ N (we never stop early) with probability ě 1 α. If we do stop early, then τ is exactly the time at which Ctpαq excluded the null set H0. The manner in which anytime-valid p-values and CSs are connected through stopping times is demonstrated in Figure 5. In summary, our CSs directly yield p-values (4.1) for composite null hypotheses. These p-values can be tracked, and are valid simultaneously at all times, including at arbitrary stopping times. Aforementioned type-I error probabilities are due to the randomness in the ordering, not in the data. It is worth noting that our (super)martingales p Rtq, p M H

t q and p M E

t q also immediately yield evalues [20] and hence safe tests [21], meaning that under nulls of the form in Figure 5, they satisfy EMτ ď 1 for any stopping time τ.

Wo R sampling and inference naturally arise in a variety of applications such as ﬁnite-population studies and permutation-based statistical methods as outlined in Appendix A. Furthermore, several machine learning tasks involve random samples from ﬁnite populations , such as sampling (a) points for a stochastic gradient method, (b) covariates in a random order for coordinate descent, (c) columns of a matrix, or (d) edges in a graph. In order to quantify uncertainty when sequentially sampling Wo R from a ﬁnite set of objects, this paper developed three new conﬁdence sequences: one in the discrete setting and two in the continuous setting (Hoeffding, empirical-Bernstein). Their construction was enabled by the development of new technical tools the prior-posterior-ratio martingale, and two exponential supermartingales which may be of independent interest. We clariﬁed how these can be tuned (role of prior or λ-sequence), and demonstrated their advantages over naive sampling with replacement. Our CSs can be inverted to yield anytime-valid p-values to sequentially test arbitrary composite hypotheses. Importantly, these CSs can be efﬁciently updated, continuously monitored, and adaptively stopped without violating their uniform validity, thus merging theoretical rigor with practical ﬂexibility.

Acknowledgements IW-S thanks Serge Aleshin-Guendel for conversations regarding Bayesian methods. AR thanks Steve Howard for early conversations. AR acknowledges funding from an Adobe Faculty Research Award, and an NSF DMS 1916320 grant.

Broader impact The main type of broader impact caused by our work is the reduction of time and money due to the ability to continuously monitor data and hence make critical decisions early. In Appendix A, we provide four prototypical examples of such situations. In Example A, every phone call requires time, thus using up money as well, and if we can accurately quantify uncertainty then we can stop collecting data sooner. In Example B, randomization tests such as those involving permutations are a common way to quantify statistical signiﬁcance, but they are computationally intensive. Knowing when to stop, based on the test being clearly statistically signiﬁcant (or clearly far from it), can save on energy costs. In Example D, when an educational intervention is unrolled one school at a time, there are two possibilities again: if it is clearly beneﬁcial, we would like to recognize it quickly so that every student can avail of the beneﬁts, while if it is for some reason harmful (e.g. causing stress without measurable beneﬁt), then it would be equally important to end the program quickly. Once more, accurately quantifying uncertainty as the process unfolds underpins the ability to make these decisions early to disseminate beneﬁts rapidly or mitigate harms quickly. Our techniques are also applicable to auditing elections (checking whether the results are as announced by a manual random recount). Risk-limiting audits [22] constitute an application area that we intend to pursue; there are many variants depending on how voters express their preferences (choose one, or rank all, or score all) and the aggregation mechanism used to decide on one or multiple winners. Audits are not currently required by law in many elections due to high perceived effort among other reasons, so being able to stop these audits early, yet accurately and conﬁdently, is critical to their broad adoption. Thus, a longer-term broader impact to trust in elections is anticipated.

[1] Steven R Howard, Aaditya Ramdas, Jon Mc Auliffe, and Jasjeet Sekhon. Time-uniform, non-

parametric, nonasymptotic conﬁdence sequences. Annals of Statistics, 2020+. [2] DA Darling and HE Robbins. Conﬁdence sequences for mean, variance, and median. Pro-

ceedings of the National Academy of Sciences of the United States of America, 58(1):66, 1967. [3] Tze Leung Lai. On conﬁdence sequences. The Annals of Statistics, 4(2):265 280, 1976. [4] Herbert Robbins and David Siegmund. Boundary crossing probabilities for the Wiener process

and sample sums. The Annals of Mathematical Statistics, 41(5):1410 1429, 1970. [5] Tze Leung Lai. Boundary Crossing Probabilities for Sample Sums and Conﬁdence Sequences.

The Annals of Probability, 4(2):299 312, 1976. [6] Kevin Jamieson, Matthew Malloy, Robert Nowak, and Sébastien Bubeck. lil UCB: An Opti-

mal Exploration Algorithm for Multi-Armed Bandits. In Proceedings of The 27th Conference on Learning Theory, volume 35, pages 423 439, 2014. [7] Emilie Kaufmann and Wouter Koolen. Mixture martingales revisited with applications to se-

quential tests and conﬁdence intervals. ar Xiv:1811.11419, 2018. [8] Larry Wasserman, Aaditya Ramdas, and Sivaraman Balakrishnan. Universal inference. Pro-

ceedings of the National Academy of Sciences, 2020. [9] Steven R Howard and Aaditya Ramdas. Sequential estimation of quantiles with applications

to A/B-testing and best-arm identiﬁcation. ar Xiv preprint ar Xiv:1906.09712, 2019. [10] Wassily Hoeffding. Probability Inequalities for Sums of Bounded Random Variables. Journal

of the American Statistical Association, 58(301):13 30, 1963. [11] Andreas Maurer and Massimiliano Pontil. Empirical Bernstein bounds and sample variance

penalization. In Proceedings of the Conference on Learning Theory, 2009. [12] Steven R. Howard, Aaditya Ramdas, Jon Mc Auliffe, and Jasjeet Sekhon. Time-uniform Cher-

noff bounds via nonnegative supermartingales. Probability Surveys, 17:257 317, 2020. [13] Robert J Serﬂing. Probability inequalities for the sum in sampling without replacement. The

Annals of Statistics, pages 39 48, 1974. [14] Rémi Bardenet and Odalric-Ambrym Maillard. Concentration inequalities for sampling with-

out replacement. Bernoulli, 21(3):1361 1385, 2015. [15] Evan Greene and Jon A Wellner. Exponential bounds for the hypergeometric distribution.

Bernoulli, 23(3):1911, 2017. [16] Daniel Fink. A compendium of conjugate priors, 1997. [17] Akshay Balsubramani and Aaditya Ramdas. Sequential Nonparametric Testing with the Law

of the Iterated Logarithm. In Proceedings of the Thirty-Second Conference on Uncertainty in Artiﬁcial Intelligence, 2016. [18] Abraham Wald. Sequential tests of statistical hypotheses. The Annals of Mathematical Statis-

tics, 16(2):117 186, 1945. [19] Ramesh Johari, Pete Koomen, Leonid Pekelis, and David Walsh. Peeking at A/B tests: Why

it matters, and what to do about it. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1517 1525, 2017. [20] Glenn Shafer and Vladimir Vovk. Game-Theoretic Foundations for Probability and Finance,

volume 455. John Wiley & Sons, 2019. [21] Peter Grünwald, Rianne de Heide, and Wouter Koolen. Safe testing. ar Xiv preprint ar Xiv:1906.07801, 2019. [22] Mark Lindeman and Philip B Stark. A gentle introduction to risk-limiting audits. IEEE Security

& Privacy, 10(5):42 49, 2012. [23] Ronald A Fisher. Mathematics of a lady tasting tea. The world of mathematics, 3(part 8):

1514 1521, 1956. [24] Jean Ville. Etude critique de la notion de collectif. Bull. Amer. Math. Soc, 45(11):824, 1939. [25] Xiequan Fan, Ion Grama, and Quansheng Liu. Exponential inequalities for martingales with

applications. Electronic Journal of Probability, 2015.

[26] Ian Waudby-Smith and Aaditya Ramdas. Variance-adaptive conﬁdence sequences by betting.

ar Xiv preprint ar Xiv:2010.09686, 2020.