# online_balanced_experimental_design__f4899d39.pdf

Online Balanced Experimental Design

David Arbour * 1 Drew Dimmery * 2 Tung Mai 1 Anup Rao 1

Abstract We consider the experimental design problem in an online environment, an important practical task for reducing the variance of estimates in randomized experiments which allows for greater precision, and in turn, improved decision making. In this work, we present algorithms that build on recent advances in online discrepancy minimization which accommodate both arbitrary treatment probabilities and multiple treatments. The proposed algorithms are computationally efﬁcient, minimize covariate imbalance, and include randomization which enables robustness to misspeciﬁcation. We provide worst case bounds on the expected mean squared error of the causal estimate and show that the proposed estimator is no worse than an implicit ridge regression, which are within a logarithmic factor of the best known results for ofﬂine experimental design. We conclude with a detailed simulation study showing favorable results relative to complete randomization as well as to ofﬂine methods for experimental design with time complexities exceeding our algorithm, which has a linear dependence on the number of observations, by polynomial factors.

1. Introduction

Randomized experimentation is a fundamental tool for obtaining counterfactual estimates. The efﬁcacy of randomization comes from a very simple intuition by randomly assigning treatment status dependence between observed (and unobserved) treatment and pre-treatment covariates necessarily tends to zero as a function of the number of units. In the context of experimentation, this independence condition on observed covariates, commonly known as balance (Imai et al., 2008; Imai and Ratkovic, 2014), reduces the variance of estimates of the average treatment ef-

*Equal contribution 1Adobe Research, San Jose, CA, USA 2Data Science @ University of Vienna, Vienna, AT. Correspondence to: David Arbour <darbour26@gmail.com>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

fect (Greevy et al., 2004; Higgins et al., 2016; Kallus, 2018; Li et al., 2018; Harshaw et al., 2020). Under the appropriate conditions such corrections can result in large increases in effective sample size, allowing for the detection of the small effects which are commonplace in many large scale studies and industrial applications (Dimmery et al., 2019; Azevedo et al., 2020). These contexts rely heavily on experimentation for decision-making, so reduced variance directly translates into more reliable decisions (Kohavi et al., 2012).

Traditional experimental design like blocking (Greevy et al., 2004; Higgins et al., 2016) or even the novel Gram Schmidt Walk design (Harshaw et al., 2020) require more than one pass over the sample and their sample complexity is greater than O(n). Even algorithms which admit sequential assignment such as Moore and Moore (2017) suffer from the fact that the algorithm is not linear time and, thus, respondents late in the experiment may take substantially longer to receive a treatment assignment (Cavaille, 2018). Our work is motivated by this setting to provide a lineartime, single-pass (i.e. sequential) algorithm for balancing experimental design. Our focus is on linear measures of balance (often of particular interest to applied researchers). This provides a new avenue through which experimenters can ensure that their experiments optimize the information they gain from costly samples.

We start by presenting four desiderata for effective, practical online experimental design. First, a method must be computationally efﬁcient. In a review of existing methods for online assignment in the case of survey experiments, Cavaille (2018) ﬁnds that existing methods become slow to unusable as increasingly more respondents are included in a study. This resulting speed is fundamentally incompatible with effective administration of an experiment. In short, high latency will cause disproportionately large dropoff in an experiment, which may completely nullify the gains from using more sophisticated experimental design. Any algorithm with greater than linear time complexity exacerbates this problem: higher latency for later subjects than earlier subjects will tend to cause non-random sample attrition, as later subjects (who may be different than earlier respondents) will be more likely to drop out.

Second, an experimental design must reduce covariate im-

Balanced Experimentation

balance to be effective. This is the entire justiﬁcation for using methods more sophisticated than Bernoulli randomization, so if the design is not able to improve on balance, then there will be no subsequent reduction in variance and therefore no compelling reason to use it.

Third, the design must incorporate randomization. Harshaw et al. (2020) provides an extensive discussion on the inherent tradeoff between robustness and balance within experimental design. A design which solely optimizes for balance will tend to operate on a knife-edge of accidental bias (Efron, 1971): the potential bias from an adversarially chosen confounder. With higher accidental bias than Bernoulli randomization, if units do not arrive precisely i.i.d., then the entire design may be compromised. Given that experimental settings are prized speciﬁcally for their unbiasedness, this could completely undermine any gains from improved balance. Strong theoretical guarantees on robustness are extremely important in this setting in order to ensure that inferences rest squarely on the design. If assumptions about sampling procedures or data generating processes are necessary to ensure the reliability of estimation, then inferences are not based solely on properties of the experiment, but rather on factors outside the control of the experimenter (Aronow et al., 2021).

Fourth, units which do not show up in the sample should not be included in the balancing. An ofﬂine algorithm used in an online environment would fail this condition, because to use it would require to balance the entire population of units expected to show up to an experiment. Units within that population who fail to show would nevertheless be assigned a treatment. Depending on the distribution of units who actually show up in the sample, there are no longer any guarantees that balance will obtain.

Our approach satisﬁes all four of these desiderata. Our contributions are the following:

We propose an online method for covariate balancing requiring linear time and space which provably provides variance reduction.

All analyses incorporate an adversarial, non-i.i.d. sampling mechanism in the analysis of worst-case behavior.

Building upon recent work on discrepancy minimization the self-balancing walk (Alweiss et al., 2020) and kernel thinning (Dwivedi and Mackey, 2021) we provide an algorithm whose L2-imbalance matches the best known online discrepancy algorithm. Where δ is a failure probability, performing this optimization online results in a log(n/δ) cost in convergence of the average treatment effect over the ofﬂine algorithm of discrepancy minimization by Harshaw et al. (2020).

Using restarts, we show that the unconditional performance of this algorithm differs by only a constant factor.

We extend this algorithm to multiple treatments and nonuniform treatment probabilities.

These algorithms provide a tuning parameter, φ, which allows practitioners to directly manage the balancerobustness tradeoff.

The rest of the paper is organized as follows. Section 2 with a discussion of related work to position our contribution in the literature. Section 3 deﬁnes notation and formally introduces the problem. Section 4 provides our proposed algorithms and methods. Section 6 provides a detailed simulation study of the behavior of the proposed algorithms.

2. Related Work

There are two common approaches for achieving improved covariate balance in experiments. The ﬁrst, and most common especially within industrial settings, approach is to perform a post-hoc regression adjustment which includes pre-treatment covariates (Deng et al., 2013; Lin, 2013). The second approach is to consider covariate balance during the design phase of the experiment, i.e., explicitly optimizing treatment assignment in order to minimize imbalance between treatment groups (Greevy et al., 2004; Higgins et al., 2016; Kallus, 2018).

Post-hoc stratiﬁcation may be seen as asymptotically equivalent in terms of variance reduction to its analogous pre-stratiﬁed design as shown by Miratrix et al. (2013). Miratrix et al. s (2013) analysis is limited by two factors: it assumes a ﬁxed number of stratiﬁcation cells (that do not grow with sample size) and it is conditioned on the poststratiﬁcation estimator being deﬁned (e.g. treatment and control units within each stratiﬁcation cell). These limitations may weaken it s asymptotic equivalence argument. A key limitation of post-hoc adjustment approaches is that the desire for simplicity and scalability implies that practitioners typically adjust for only a linear function of the pretreatment covariates. Indeed, in the common CUPED approach, adjustment is performed solely on a linear function of a single pre-treatment outcome measurement (Deng et al., 2013). Second, many common approaches for constructing stratiﬁcation cells (i.e. clustering algorithms) may be computationally infeasible in practice for industrial applications when the number of simultaneous experiments and the number of outcome variables of interest are large. Third, without sample splitting (or when naively applied) advanced machine-learning based methods for adjustment may slip in assumptions of correct-speciﬁcation of the outcome model, or have conﬁdence intervals with poor coverage properties. While cross-ﬁtting may ameliorate some

Balanced Experimentation

of these problems in larger samples, sample splitting may prove too high a cost when sample sizes are low.

A key shortcoming of design-based covariate balance is the lack of computationally efﬁcient algorithms which provide theoretical guarantees over worst case behavior. Blocking (Fisher, 1935) partitions variables into non-overlapping sets and performs complete randomization within each partition ( blocks ). Higgins et al. (2016) introduced a computationally approximation of blocking which runs in O(n log(n)) time. Kallus (2018); Bertsimas et al. (2015) propose an optimization based approach, Kallus (2018) additionally considers a partially random approach using semi-deﬁnite programming. Zhou et al. (2018) provide a method combining batch-based sequential experimentation with rerandomization to achieve balance, but which is not computationally feasible in moderate to large sample sizes. Perhaps the closest to the current work is Harshaw et al. (2020) which propose a balancing design using the Gram-Schmidt walk, an ofﬂine method for (linear) discrepancy minimization. Current state of the art for balancing treatment assignment requires polynomial running time and generally requires knowing all of the covariate vectors prior to determining assignment (Higgins et al., 2016; Harshaw et al., 2020; Arbour et al., 2021). As we discuss in section 3, this is a non-starter for online treatment assignment. In this setting, subjects must be allocated as they arrive; it does no good to know how you should have assigned a user at the end of the experiment; you need to know when that subject arrives. It s crucial that when a subject in an experiment arrives they be swiftly allocated to a unit. Especially in an online environment, high latency will lead to attrition, which may counteract any potential gains from greater efﬁciency. Moreover, the users who attrit may be the very subjects of interest (Munger et al., 2021). By inducing differential attrition based on patience, the sample in the experiment may differ greatly from the population of interest on unobserved characteristics that make it difﬁcult to extrapolate to a population-level effect (Egami and Hartman, 2020).

There is also a variety of methods aimed at sequential, online assignment in experiments. The seminal work in this literature is Efron (1971) which introduced an online variant of complete randomization which aims to ensure that a pre-speciﬁed marginal treatment probability is met without introducing too much accidental bias. Smith (1984) provides a generalization of the Efron (1971) approach which extends gracefully to multiple treatments. There are a variety of online balanced coin designs which seek to reduce covariate imbalance (e.g. Baldi Antognini and Zagoraiou, 2011; Moore and Moore, 2017). Moore and Moore (2017) is based around Mahalanobis distance. As such, it has polynomial time-complexity at each arrival time. In addition to inefﬁciency, the theoretical worst-case behavior of this

algorithm has not been resolved, even in the stochastic setting. Theoretical guarantees of this sort are paramount in the design setting, as practitioners need to know the credibility of their inferences (and how they may differ from simple Bernoulli randomization).

3. Background and Problem Setting

We ﬁrst ﬁx notation. Random variables will be denoted in upper case, with sets in bold. The problem setting, which we refer to as experimental treatment allocation, is as follows. We assume that we observe 1, . . . , n i.i.d. observations of X Rn d: the covariates1. The experimenter is asked to assign a treatment assignment, A {1, 1} (we will later loosen this to multiple discrete treatment values). We will refer to the assignments of A as treatment and control, respectively. Each unit is imbued with potential outcomes for each treatment, the value of the outcome if that unit had been assigned to the given group: y(1) for treatment and y( 1) for control. After assignment we observe only the potential outcome corresponding to the realized treatment assignment, Y . We assume that the outcomes are not available until the conclusion of the experiment. At the end of the experiment we are interested in measuring the sample average treatment effect (SATE) between any two treatments, k and k with the difference in means estimator:

Ai p(Ai)Yi (1)

where p(Ai) denotes the probability of assigning treatment Ai to instance i. Note that this is simply the differencein-means rather than the more general Horwitz-Thompson estimator (Horvitz and Thompson, 1952), as the treatment probability is marginal rather than conditional. More sophisticated estimators are usable in this setting (e.g. Tsiatis et al., 2008; Aronow and Middleton, 2013), but we will focus on the simplest as we optimize design as is commonplace for studying design (Kallus, 2017; Harshaw et al., 2020).

If propensity scores are constant, the estimator of the SATE given by equation 1 will be unbiased and consistent for its oracle counterpart,

i yi(k) yi(k ), (2)

the difference of potential outcomes of the k and k treatments. This SATE is our estimand of interest, as estimated by equation 1.

1We assume linear feature maps throughout. We note that nonlinearities can be handled with the same guarantees following Dwivedi and Mackey (2021) at the cost of additional computational complexity.

Balanced Experimentation

We will maintain the following assumptions:

Assumption 1 (Consistency). Yi = yi(k) if Ai = k i, k.

The problem of experimental allocation is to observe covariate vectors and assign A to units so as to achieve desirable properties of the SATE (for instance, to minimize variance). In the most general setting where no assumptions are placed the relationship between the covariates and outcome complete randomization randomly drawing assignments without respect to background covariates is known to be minimax optimal (Kallus, 2017).

3.1. Robustness in sequential design

Experiments are prized for their ability to provide unbiased estimates of causal effects with relatively mild assumptions. These assumptions are on the design of the experiment rather than more difﬁcult assumptions about the data used in the course of analysis (Sekhon, 2009; Aronow et al., 2021).

In the study of vector balancing, there are three main sampling schemes of interest, listed in order of how adversarial they are:

Stochastic arrivals Units are sampled i.i.d. from some ﬁxed (possibly inﬁnite) population. As such, a given covariate vector is just as likely to arrive early in the sequence as late.

Oblivious adversarial arrivals The adversary is allowed to arbitrarily set the order in which units arrive prior to the ﬁrst assignment, but cannot change the order in response to assignment of the units to treatment/control.

Fully adversarial arrivals The adversary is allowed to arbitrarily set the order in which units arrive and may change the ordering in response to assignments. The fully adversarial case has a discrepancy bound of Ω

T , with Bernoulli randomization as the optimal strategy (Alweiss et al., 2020).

Assumption 2 (Oblivious Adversary). The current imbalance is independent of the newly arrived covariate proﬁle conditional on the history of covariate proﬁles

3.2. Balance v. Robustness

The fully adversarial case represents an extreme form of robustness. In the face of such an adversary there is no strategy more effective than complete randomization, as in the No Free Lunch Theorem of Kallus (2018). As such, it isn t possible to reduce imbalance by departing from complete randomization without incurring a large cost in terms of robustness. In the oblivious adversarial setting, however,

the best available algorithms for discrepancy minimization incorporate randomization (Alweiss et al., 2020; Dwivedi and Mackey, 2021), but with substantially less entropy than in the fully adversarial environment by modulating their assignment probabilities based on the current state of imbalance, leading to O(log nd/δ) discrepancy with probability 1 δ. In this setting, it is not necessary to purely randomize to ensure sufﬁcient robustness, but rather it is only necessary to incorporate sufﬁcient randomization so that an adversary cannot recover too much information about the state of the system at any point in time.

It is this setting which incurs a substantive tradeoff between imbalance minimization and robustness (Harshaw et al., 2020): too much randomization will result in greater imbalance (and, therefore, variance), while too little will result in the potential for exploitation by the adversary. The algorithmic task is to assign units such that imbalance is reduced, while randomizing enough that it s impossible to infer the algorithm s state at any time. Our approach incorporates a parameter, φ, to manage the tradeoff between imbalance reduction and worst-case error. This allows the practitioner to choose how to manage this tradeoff based on their preferences between the two.

4. Weighted Online Discrepancy Minimization

Our approach for online experimental design uses weighted online discrepancy minimization. In what follows, we ﬁrst describe our procedure a variant of prior work (Alweiss et al., 2020; Dwivedi and Mackey, 2021) to accommodate arbitrary marginal treatment probabilities before describing it s application and implications for experimental design.

We consider a variant of the online Koml os problem (Spencer, 1977), where vectors x1, . . . xn arrive one by one and must be immediately assigned a weighted sign of either 2q or 2(1 q), for 0 < q < 1, such that the weighed discrepancy Pn i=1 ηixi , where ηi is the weighted sign given to xi, is minimized. Notice that when q = 1/2, the signs become 1.

Algorithm 1, takes i = 1, . . . , n unit vectors in sequentially and assigns them to a treatment and control, represented by the value of ηi. The procedure, an extension of recent work in online discrepancy minimization (Alweiss et al., 2020; Dwivedi and Mackey, 2021), assigns treatment with probability proportional to the inner product between a running sum of the signed prior observations. The algorithm and analysis differs from prior work for discrepancy in two aspects which are necessary for use in experimentation:

1. We provide a ridge regression guarantee, by characteriz-

Balanced Experimentation

ing the random vectors output by the algorithm in terms

of the projection matrix, P i = X i Xi X i 1 Xi, where Xi is the i d submatrix of X corresponding to covariates {x1, .., xi}. This can be seen as an online analogue to what is provided Harshaw et al. (2020) for ofﬂine discrepancy minimization.

2. A generalization of the discrepancy minimization algorithm given by Dwivedi and Mackey (2021) to allow for arbitary marginal probabilities q. This allows for the case of imbalanced treatment assignments. A straightforward adoption of the analysis in Dwivedi and Mackey (2021) to this case results in a worse dependence on 1/q. We derive a sub-exponential concentration bound and get a better dependence on 1/q.

Algorithm 1 takes each input vector xi and assigns it { 2q, 2(1 q)} signs online to maintain low weighted discrepancy with probability 1 δ.

Input: x, q c min(1/q, 9.3) log(2n/δ) for i from 1 to n do

if |w i 1xi| > c then

wi wi 1 2q w i 1xi

( 2(1 q), w.p. q(1 w i 1xi/c) 2q, w.p. 1 q(1 w i 1xi/c) wi wi 1 + ηixi end if end for Output: η, w

Our main theorem relies on two non-standard deﬁnitions of sub-Gaussian and sub-exponential random vectors. We provide a deﬁnition of each below before introducing our main results.

Deﬁnition 1 (Sub-Gaussian). A mean zero random vector w is (σ, P ) sub-Gaussian if for all unit vectors u and λ

R, E[exp λw u ] exp λ2σ2u P u

Deﬁnition 2 (Sub-exponential). A mean zero random vector w is (ν, α, P ) sub-exponential if for all unit vectors u and |λ| 1 α

u P u, E[exp λw u ]

exp λ2ν2u P u

Theorem 1 (Main). Let w1, ...wn be as in Algorithm 1, A = 0.5803, B = 0.4310 and α = 2/B. Then

1. wi is mean zero p

c/2q, Pi sub-Gaussian.

2. wi is mean zero

8Ac, α, Pi sub-exponential.

3. With probability 1 δ, for all i, |w T i xi| c.

Note that ηi is deﬁned only when |w i 1xi| c. Therefore, η is deﬁned with probability at least 1 δ.

Remark 1. If wi is a is a mean zero (σ, P i) sub-Gaussian random vector, then Cov(wi) σ2P i. Similarly, we have that if wi is a is a mean zero (ν, α, Pi) sub-exponential random vector, then Cov(wi) 3

2ν2P i. A proof is given in Lemma 6.

We will use Theorem 1 to derive results on the average treatment effect using the framework developed by (Harshaw et al., 2020).

5. Online Experimental Design

We now describe how the results developed in section 4 can be applied to experimental design, and describe the implications in terms of the efﬁcacy of the estimate.

Let zi = ηi + 2q 1 { 1, 1}, denote treatment status with 1 indicating treatment, and 1 indicating control, respectively. We will show that the expected imbalance is zero, i.e., E[ηi] = 0, and so we have marginal treatment probability E[zi] = 2q 1, and z E[z] = η. Let µ = Y (1)

4q + Y (0) 4(1 q). Harshaw et al. (2020) give a linear algebraic expression for the error of HT-estimators in terms of µ. In particular, they show

Lemma 1 (Lemma A2 and Corollary A1 in (Harshaw et al., 2020)). Let ˆτ = 1

n Pn i Ai p(Ai)Yi be the empirical estimate of the average treatment effect and τ be its population counterpart. We have

n (z E[z]) µ = 2

Var(ˆτ) = E (ˆτ τ)2 = 4

n2 µ Cov(z)µ.

5.1. Controlling Balance/Robustness Tradeoff

Theorem 1 immediately implies that assignments generated by Algorithm 1 are well balanced. But optimizing just for balance can lead to accidental bias (Efron, 1971). We have from Lemma 1 that Var(ˆτ) = 4 n2 λmax (Cov(z)) µ 2

in the worst case when µ is along the top eigenvector of Cov(z). Therefore, to control accidental bias we need to make sure λmax (Cov(z)) is not high.

We achieve this by augmenting the original covariates xi

by φei to get φei 1 φxi

, where φ (0, 1) is a pa-

rameter which controls the extent of the covariate balance, and ei is a basis vector in dimension n. By a simple calculation, we can see that running Algorithm 1 on xi augmented with φei is equivalent to running it with xi and

Balanced Experimentation

replacing w i 1xi by 1 φw i 1xi everywhere. Therefore, we don t have to explicitly augment the covariates in the algorithm.

We note that with augmented covariates, wn = φη 1 φX η

is a sub-Gaussian or a sub-exponential

random vector as in Theorem 1.

The parameter φ controls the worst case behavior of the assignment process. In the fully general case, when no assumptions are placed on the potential outcomes, it s known that complete randomization is minimax optimal. However, as our results and others show, with additional assumptions substantial variance reduction can be achieved. The largest eigenvalue of the covariance matrix of treatments quantiﬁes the extent to which bias can occur from this misspeciﬁcation. The function of φ in our setting (and also in Harshaw et al. (2020)), is to balance between reducing this worst case bias and improving the precision of effect estimates by minimizing imbalance. We will see in Proposition 6, the worst case bias becomes unbounded as the algorithm focuses solely on balance, while the improvement from balancing given in Proposition 7 has the opposite dependence on φ.

We now describe the properties of treatment effect estimates generated using the weighted online design. Our main guarantees are within the context of an implicit ridge regression, and so we will deﬁne Q = φI + (1 φ)XX 1 .

Proposition 1. When Algorithm 1 is run with augmented covariates as described above, then η = z E[z] is a mean zero ( p

c/2q, Q) sub-Gaussian random vector and also, η is a mean zero (

8Ac, α, Q) sub-exponential random vector. Where A = 0.5803 as in theorem 1.

The consequence of proposition 1 is that E[η] = 0 for all i, as given below. Corollary 1 (Unbiasedness). When Algorithm 1 is run with augmented covariates, we have with probability at least 1 δ, E [P

i xiηi] = 0. 2

We now show how these bounds can be applied to control the balance/robustness trade-off by bounding covariance Cov(z) in Loewner order. We will also let σ2 = c/2q if c = log(2n/δ)/q and σ2 = 12Ac if c = 9.3 log(2n/δ). Where A = 0.5803 as in theorem 1, for the remainder of the paper. Corollary 2 (Eigenvalues of Treatment Covariance). With probability at least 1 δ, Algorithm 1 produces η satisfying Cov(z) = Cov(η) σ2Q.

2Here and in other places, 1 δ is a lower bound on the probability that all ηis are deﬁned in Algorithm 1. The expectations are conditional on this event.

Proposition 2 gives a bound on covariate discrepancy at the end of the experiment. Proposition 2. Let w = P

i ηixi. With probability at least 1 δ,

d log(4d/δ) log(4n/δ)

5.2. Error bounds

Our ﬁnal set of results provide bounds on the concentration and worst case mean squared error of the treatment effect estimation. As we note above, we do so by bounding our estimate by an implicit ridge regression.

Proposition 3 (Concentration of ATE). Algorithm 1 when run with augmented covariates, generates a random assignment z such that

with probability 1 δ.

Lemma A10 in (Harshaw et al., 2020) shows that µ Qµ = minβ Rd h 1 φ µ Xβ 2 + β 2

Proposition 4. The worst-case mean squared error of the online balancing walk design is upper bounded by E (bτ τ)2 4σ2

φn2 Pn i=1 µ2 i where φ (0, 1) with probability 1 δ.

Proof of Proposition 4. This follows from Lemma 1 and Proposition 2. We note that Q σ2

Proposition 5 (Ridge Connection). The worst-case mean squared error of the online balancing walk design is upper bounded by an implicit ridge regression estimator with regularization proportional to φ. That is, E (bτ τ)2

n where L = minβ Rd h 1 φn µ Xβ 2 + β 2

with probability 1 δ.

We note that proposition 5 closely mirrors the results obtained by Harshaw et al. (2020) in the ofﬂine case, with an additional penalty (σ2) incurred by using an online procedure.

5.3. Algorithm with Restart

We saw earlier that η, z are deﬁned only with probability 1 δ. This is because with our choice of c, only with probability 1 δ we have for all i, |w i xi| c. This means our treatment assignment fails with probability δ. There is a simple way to make sure that the algorithm never fails and have same error bound guarantees with a slightly worse constant.

Balanced Experimentation

To do so, we slightly modify Algorithm 1 so that whenever |w i 1xi| > c for a particular i, we start a new instance of the algorithm for covariates xi+1, ...xn. This is equivalent to setting wi 1 = 0 and continuing with the algorithm.

Since for any treat assignment procedure E (ˆτ τ)2 just depends on Cov(z), and Algorithm 1 fails with probability δ, we can show that

Cov(z) (1 δ)Q + δQ + δ2Q + ...

2Q when δ 1/2.

Therefore, for the modiﬁed algorithm, we will have error guarantees as in Propositions 4 and 5 (but worse by at most a factor of 2) and with probability one.

5.4. Extension to Multiple Treatments

In this section, we consider an online multi-treatment setting, where each vector is assigned to a group in M = {m1, m2, . . . , mk} immediately on arrival. For each 1 i k, group mi is associated with a weight αi. The goal is to minimize the multi-treatment discrepancy:

max m1,m2 M 2 s(m1)/α1 s(m2)/α2

1/α1 + 1/α2

where s(m) is the sum of all vectors assigned to treatment m. Notice that by setting α1 = 1 1 q and α2 = 1

q, we can recover the deﬁnition given for the weighted discrepancy between two treatments m1 and m2.

Our algorithm can leverage any oracle (we call it Binary Balance in Algorithm 2) that minimizes the weighted discrepancy for two treatments. Our results are obtained by using Algorithm 1.

We ﬁrst build a binary tree where each leaf of the tree corresponds to one of the k treatments in M. Let h be the smallest integer such that 2h k. We start with a complete binary tree of height h, and then remove 2h k leaves from the tree such that no two siblings are removed. Note that this is possible by the deﬁnition of h. We further contract each internal node with only one child to its child. This process does not change the number of leaves in the tree. Let T be the obtained tree. By construction, all internal nodes of T have 2 children and T has exactly k leaves. We then associate each leaf of T with a treatment in M. For each vector assigned to treatment mi, we also say that it is assigned to the leaf corresponding to mi, 1 i k. For each node v T, denote by s(v) the sum of all vectors assigned to leaves under v. In addition, let α(v) be the sum of all weights assigned to leaves under v. For each internal node v of T, the weighted discrepancy vector at v is deﬁned

w(v) = α(vr) α(vl) + α(vr)s(vl) α(vl) α(vl) + α(vr)s(vr),

where vl and vr are the left and right child of v respectively.

For each internal node v in T, we maintain an independent run of a two-treatment algorithm that minimizes w(v) . At a high level, we minimize the weighted discrepancies at all internal nodes simultaneously. When a new vector x arrives, we ﬁrst feed it to the algorithm at root r. If the result is +, we continue with the left sub-tree of r. Otherwise, we go to the right sub-tree. We continue in that manner until we reach a leaf l. x will then be assigned to l (and the treatment associated with l).

Theorem 2. Let Binary Balance be Algorithm 1. Then

Algorithm 2 obtains O log k q

(1 φ)d log(dk/δ) log(nk/δ)

multi-treatment discrepancy with probability 1 δ.

Algorithm 2 KGroup Balance takes each input vector xi and assigns it to one of the groups online to maintain low discrepancy with probability 1 δ.

Input: x, k, α. h smallest integer such that 2h k. T complete binary tree with height h. Remove 2h k leaves from T such that no two siblings are removed. Associate each treatment to a leaf of T. Contract each internal node in T with one child to its child. for node v in T do

α(v) sum of all weights of groups associating with leaves under v. end for for internal node v of T do

Instantiate Binary Balance(v) oracle for weighted discrepancy problem at v with weighted signs α(vr) α(vl)+α(vr) and α(vl) α(vl)+α(vr). end for for i from 1 to n do

v root of T. for v is an internal node of T do

Feed xi to Binary Balance(v) v one of the children of v according to the assignment of Binary Balance(v) on input xi end for Assign xi to the group corresponding to v. end for

6. Experiments

In this section, we provide simulation evidence on the efﬁcacy of our proposed methods. In particular, we use a wide variety of data generating processes, many of which do not assume that units arrive i.i.d. as is often standard in simulation settings for this problem. Subjects are unlikely to

Balanced Experimentation

104 105 106 107 108

Sample Size

Time (seconds)

Alweiss BWD Quick Block Simple

Figure 1. Time to design. All timings performed on a ml.r5.2xlarge instance of Amazon Sage Maker. The y-axis is scaled by the square root for easier visualization.

arrive truly i.i.d. in the real world. Earlier arrivals will typically be more active than late-arriving units, for example.

All data generating processes used in simulations are shown in Table A1. If not otherwise speciﬁed, the sample size is 1000 subjects, the number of groups is two and the marginal probability of treatment is 1

Methods compared in the simulations are simple randomization (Bernoulli coin ﬂips), complete randomization (ﬁxed-margins randomization), the biased coin designs of Efron (1971) and of Smith (1984), Quick Block of Higgins et al. (2016), and Alweiss et al. (2020). These are compared to our proposed methods which are generalized versions of the discrepancy minimization procedure of Dwivedi and Mackey (2021). We provide three versions of our proposed algorithms, called BWD for Balancing Walk Design . The most basic version (BWDRandom) reverts to simple random assignment for all remaining periods once |w i 1xi| > c. Our preferred approach restarts the algorithm in this case as described in section 5.3. We examine this with two levels of robustness, φ: 0 (purely balancing) and 0.5 (a uniform mix between randomization and balancing): BWD(0) and BWD(0.5) respectively. For further details, see section A.4.

In these comparisons, BWD need not out-perform all methods in all data-generating processes. Quick Block, for instance, is a fully off-line method, so comparable performance by an online method is noteworthy. In general, Alweiss and BWD will be most effective when the true relationship between covariates and outcome is linear, since they seek linear balance. All plots incorporate 95% conﬁdence intervals.

6.1. Binary treatment

Timing. Our proposed method (BWD) is highly efﬁcient, scaling substantially better than other balancing methods.

Sinusoidal DGP

Linear Season DGP Quadratic DGP Quick Block DGP

Cubic DGP Linear DGP Linear Drift DGP

102 103 104 105

102 103 104 105 102 103 104 105

Sample Size

Alweiss BWD(0.5) BWD(0) BWDRandom Efron Quick Block Simple

Figure 2. MSE. BWD provides effective variance reduction across a wide array of simulation environments.

This analysis of runtime directly compares online methods to a widely used ofﬂine balancing method (Quick Block). It s important to note that the Quick Block algorithm cannot be used in the online setting, even if its runtime did not make that prohibitive. While Quick Block is O(n log n), the proposed online balancing methods are all linear-time. Given Quick Block s runtime, the following simulations only include it for comparisons up to sample sizes of 104.

MSE. Next, we demonstrate in Figure 2 how imbalance minimization translates to improved estimation of causal effects, measured by the mean squared-error of our estimate of the average treatment effect. We normalize this graph based on n, the rate of convergence of the difference-in-means estimator under simple randomization. The results depend strongly on the true nature of the datagenerating process. In short, on non-linear data generating processes, ofﬂine blocking performs better than anything else, but in many settings BWD converges to similar error rates as Quick Block. On linear or near-linear data-generating processes, our proposed algorithms perform very strongly, outperforming Quick Block even in small sample-sizes. When there is a break from the purely i.i.d. stochastic setting (such as Linear Drift DGP and Linear Season DGP), BWD behaves well, as expected. When φ is chosen to be 0.5 (the line marked BWD(0.5)), representing a desire for more emphasis on robustness than on pure imbalance reduction, the reduction in MSE is more

Balanced Experimentation

Sinusoidal DGP

Linear Season DGP Quadratic DGP Quick Block DGP

Cubic DGP Linear DGP Linear Drift DGP

0 10 20 30 0 10 20 30

Number of Treatments

BWD(0.5) BWD(0) Complete Simple Smith

Figure 3. MISE. BWD effective reduces variance no matter the number of treatments.

similar to full imbalance minimization than to complete randomization.

6.2. Multiple Treatments

MISE. BWD gracefully extends to the multipletreatment setting, which we demonstrate in Figure 3. This chart measures the mean integrated squared-error of our estimates of the ATEs (relative to a single control group). Figure 3 shows the results. BWD consistently outperforms existing online assignment methods by substantial margins.

An array of additional simulation results may be found in Appendix B.

7. Conclusion

Experiments are a crucial part of how humans learn about the world and make decisions. This paper is aimed at providing a way to more effectively run experiments in the online setting. Practitioners must commonly operate their experiments in this environment, but due to the lack of suitable options for design fall back to simple randomization as the assignment mechanism. In this paper, we have shown how the Balancing Walk Design can be an effective tool in this setting. It is efﬁcient, effective at reducing imbalance (and, therefore, the variance of resulting causal estimates), robust and it is fully suited to the particularities of online treatment assignment. Simulations have shown it to

work well across a range of diverse settings. The Balancing Walk Design can improve the practice of large-scale online experimentation.

R. Alweiss, Y. P. Liu, and M. Sawhney. Discrepancy minimization via a self-balancing walk. ar Xiv preprint ar Xiv:2006.14009, 2020.

D. Arbour, D. Dimmery, and A. Rao. Efﬁcient balanced treatment assignments for experimentation. In International Conference on Artiﬁcial Intelligence and Statistics, pages 3070 3078. PMLR, 2021.

P. Aronow, J. M. Robins, T. Saarinen, F. S avje, and J. Sekhon. Nonparametric identiﬁcation is not enough, but randomized controlled trials are. ar Xiv preprint ar Xiv:2108.11342, 2021.

P. M. Aronow and J. A. Middleton. A class of unbiased estimators of the average treatment effect in randomized experiments. Journal of Causal Inference, 1(1):135 154, 2013. doi: doi:10.1515/jci-2012-0009. URL https: //doi.org/10.1515/jci-2012-0009.

E. M. Azevedo, A. Deng, J. L. Montiel Olea, J. Rao, and E. G. Weyl. A/b testing with fat tails. Journal of Political Economy, 128(12):4614 000, 2020.

A. Baldi Antognini and M. Zagoraiou. The covariateadaptive biased coin design for balancing clinical trials in the presence of prognostic factors. Biometrika, 98(3): 519 535, 2011.

D. Bertsimas, M. Johnson, and N. Kallus. The power of optimization over randomization in designing experiments involving small samples. Operations Research, 63(4): 868 876, 2015.

C. Cavaille. Implementing blocked randomization in online survey experiments, 2018.

A. Deng, Y. Xu, R. Kohavi, and T. Walker. Improving the sensitivity of online controlled experiments by utilizing pre-experiment data. In Proceedings of the sixth ACM international conference on Web search and data mining, pages 123 132, 2013.

D. Dimmery, E. Bakshy, and J. Sekhon. Shrinkage estimators in online experiments. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 2914 2922, 2019.

R. Dwivedi and L. Mackey. Kernel thinning. ar Xiv preprint ar Xiv:2105.05842, 2021.

Balanced Experimentation

B. Efron. Forcing a sequential experiment to be balanced. Biometrika, 58(3):403 417, 1971. ISSN 00063444. URL http://www.jstor.org/ stable/2334377.

N. Egami and E. Hartman. Elements of external validity: Framework, design, and analysis. SSRN, 2020.

R. A. Fisher. The Design of Experiments. Oliver and Boyd, Edinburgh, 1935.

R. Greevy, B. Lu, J. H. Silber, and P. Rosenbaum. Optimal multivariate matching before randomization. Biostatistics, 5(2):263 275, 2004.

C. Harshaw, F. S avje, D. Spielman, and P. Zhang. Balancing covariates in randomized experiments using the gram-schmidt walk, 2020.

M. J. Higgins, F. S avje, and J. S. Sekhon. Improving massive experiments with threshold blocking. Proceedings of the National Academy of Sciences, 113(27):7369 7376, 2016.

D. G. Horvitz and D. J. Thompson. A generalization of sampling without replacement from a ﬁnite universe. Journal of the American Statistical Association, 47(260):663 685, 1952. ISSN 01621459. URL http: //www.jstor.org/stable/2280784.

K. Imai and M. Ratkovic. Covariate balancing propensity score. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(1):243 263, 2014.

K. Imai, G. King, and E. A. Stuart. Misunderstandings between experimentalists and observationalists about causal inference. Journal of the Royal Statistical Society. Series A,(Statistics in society), 2008.

N. Kallus. Balanced policy evaluation and learning. ar Xiv preprint ar Xiv:1705.07384, 2017.

N. Kallus. Optimal a priori balance in the design of controlled experiments. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(1):85 112, 2018.

R. Kohavi, A. Deng, B. Frasca, R. Longbotham, T. Walker, and Y. Xu. Trustworthy online controlled experiments: Five puzzling outcomes explained. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 786 794. ACM, 2012.

X. Li, P. Ding, and D. B. Rubin. Asymptotic theory of rerandomization in treatment control experiments. Proceedings of the National Academy of Sciences, 115(37): 9157 9162, 2018.

W. Lin. Agnostic notes on regression adjustments to experimental data: Reexamining freedman s critique. The Annals of Applied Statistics, 7(1):295 318, 2013.

L. W. Miratrix, J. S. Sekhon, and B. Yu. Adjusting treatment effect estimates by post-stratiﬁcation in randomized experiments. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(2):369 396, 2013.

R. T. Moore and S. A. Moore. Blocking for sequential political experiments. Political Analysis, 21(4):507 523, 2017. doi: 10.1093/pan/mpt007.

K. Munger, I. Gopal, J. Nagler, and J. A. Tucker. Accessibility and generalizability: Are social media effects moderated by age or digital literacy? Research & Politics, 8(2):20531680211016968, 2021. doi: 10.1177/20531680211016968. URL https://doi. org/10.1177/20531680211016968.

J. S. Sekhon. Opiates for the matches: Matching methods for causal inference. Annual Review of Political Science, 12:487 508, 2009.

R. L. Smith. Sequential treatment allocation using biased coin designs. Journal of the Royal Statistical Society: Series B (Methodological), 46(3):519 543, 1984.

J. Spencer. Balancing games. Journal of Combinatorial Theory, Series B, 23(1):68 74, 1977.

A. A. Tsiatis, M. Davidian, M. Zhang, and X. Lu. Covariate adjustment for two-sample treatment comparisons in randomized clinical trials: A principled yet ﬂexible approach. Statistics in Medicine, 27(23): 4658 4677, 2008. doi: https://doi.org/10.1002/sim. 3113. URL https://onlinelibrary.wiley. com/doi/abs/10.1002/sim.3113.

M. J. Wainwright. High-Dimensional Statistics: A Non Asymptotic Viewpoint. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2019. doi: 10.1017/9781108627771.

Q. Zhou, P. A. Ernst, K. L. Morgan, D. B. Rubin, and A. Zhang. Sequential rerandomization. Biometrika, 105(3):745 752, 06 2018. ISSN 0006-3444. doi: 10. 1093/biomet/asy031. URL https://doi.org/10. 1093/biomet/asy031.

Balanced Experimentation

A.1. Proof of Theorem 1

Proposition 6 (Loewner Order). Let M 0, C 1, L 0. If

M CLP := CLB B B 1 B

then for any vector v Rn with v 2 1

M = I C 1vv M I C 1vv + Lvv

0 M CLP := CLP + CL(I P ) vv (I P )

v (I P ) v .

Proof. By deﬁnition of M and the assumption M c LP , we have

M CL I C 1vv P I C 1vv + Lvv .

Therefore, it is sufﬁcient to prove

I C 1vv P I C 1vv + C 1vv P + (I P ) vv (I P )

v (I P ) v .

Also note that, since v 1 and C 1, we can absorb C into v (by taking v := v/

C) and therefore without loss of generality assume that C = 1 and v 1. Deﬁne:

P x = B B B 1 B Qx := (I P ) P v = vv Qv = I vv

a := P xv b := Qxv α := a 2 β := b 2

We want to show

Qv P x Qv + P v P x + Qx P v Qx

Beginning by rewriting the LHS

Qv P x Qv + P v = (I P v)P x(I P v) + P v = P x P v P x P x P v + P v P x P v + P v

= P x va av + vv α + P v

= P x (a + b)a a(a + b) + (a + b)(a + b) α + P v

Expanding P v as P v = (a + b) (b + a) = aa + bb + ab + ba gives

P x (a + b) a a (a + b) + (1 + α) aa + bb + ab + ba

=P x + (α 1)aa + α ab + ba + (1 + α)bb

Since v 2 1, we have α + β 1. Now considering the difference of the LHS from the RHS after multiplying both sides by β we arrive at

β(RHS LHS) =bb ( 1 β(1 + α) | {z } 1 β βα α(1 β) α2 ) + β(1 α)aa αβ(ab + ba )

α2bb + β2aa αβ(ab + ba )

= (αb βa) (αb βa) 0.

Balanced Experimentation

Lemma 2. When |w i 1xi| < c, we have E [ηi] = 2qw i 1xi/c.

E [ηi] = 2(1 q) (q(1 w i 1xi/c)) + ( 2q) (1 q(1 w i 1xi/c))

= 0 + 2(1 q)( qw i 1xi/c) + ( 2q)(qw i 1xi/c)

= 2qw i 1xi/c.

For all i and u Rd, we have

wi, u = wi 1, u 2q x i u

u + ϵix i u

= wi, Qxi,cu + ϵix i u,

where ϵi = 0 if |w i 1xi| > c and ϵi = ηi E[ηi] otherwise and Qxi,c = I 2q

Consider the case when |w i 1xi| c and ϵi = ηi E[ηi]. Let q = q(1 w i 1xi/c). Note that 0 q 2q. We have

( 2(1 q) with probability q, 2 q with probability 1 q.

Deﬁnition 3 (Sub-Gaussian). A mean zero random variable X is sub-Gaussian with parameter σ if for all λ R,

E[exp (λX)] exp λ2σ2

A mean zero random vector w is (σ, P ) sub-Gaussian if for all unit vectors u and λ R,

E[exp λw u ] exp λ2σ2u P u

In particular, w u is σ sub-Gaussian, where σ 2 = σ2u P u.

Deﬁnition 4 (Sub-exponential). A mean zero random variable X is (ν, α) sub-exponential if for all |λ| 1

E[exp (λX)] exp λ2ν2

A mean zero random vector w is (ν, α, P ) sub-exponential if for all unit vectors u and |λ| 1 α

E[exp λw u ] exp λ2ν2u P u

In particular, w u is (ν , α ) sub-exponential, where ν 2 = ν2u P u and α = α

The following concentration bounds for sub-Gaussian and sub-exponential vectors are obtained from standard bounds for scalar sub-Gaussian and sub-exponential random variables (Wainwright, 2019) by scaling σ, ν and α by appropriate factors.

Balanced Experimentation

Lemma 3 (Sub-Gaussian Concentration). If a random vector w is (σ, P ) sub-Gaussian, then for all unit vectors u,

P |w u| t 2 exp t2

Lemma 4 (Sub-exponential Concentration). If a random vector w is (ν, α, P ) sub-exponential, then for all unit vectors u,

2ν2u P u if 0 t ν2

α 2 exp t 2α

Lemma 5. Suppose X is a (ν, α) sub-exponential random variable with 2 ν2 1 α2 . Then

Proof. We will use the inequality x2 C (ex + e x) , x and C = 1.5/e. This gives

Var(λX) CE eλX + e λX

2C exp(0.5λ2ν2), if |λ| 1

We therefore have

Var(X) Cν2 exp(0.5λ2ν2)

= e Cν2 when 0.5λ2ν2 = 1.

We get the result with C = 1.5/e.

Lemma 6. If w is a is a mean zero (σ, P ) sub-Gaussian random vector, then Cov(w) σ2P . If w is a is a mean zero (ν, α, P ) sub-exponential random vector with 2 ν2 1 α2 , Cov(w) 3

Proof. For all unit vector u, we have

u Cov(w)u = u E(ww )u = E[(w u)2] = Var(w u).

By deﬁnition, if w is a is a mean zero (σ, P ) sub-Gaussian random vector, w u is σ2u P u sub-Gaussian. Therefore,

u Cov(w)u = Var(w u) σ2u P u

as desired.

Similarly, if w is a is a mean zero (ν, α, P ) sub-exponential random vector, w u is (ν

u P u) subexponential. By Lemma 5,

u Cov(w)u = Var(w u) 3

Lemma 7. For all i [n]

1. ϵi is sub-Gaussian with σ = 1, and

2. ϵi is (4 Aq, 2/B) sub-exponential for any A, B > 0 satisfying ex < 1 + x + Ax2 for x < B.

Balanced Experimentation

Proof. For the ﬁrst claim, note that a random variable bounded in [a, b] is sub-Gaussian with σ2 = (b a)2

4 . To prove second claim, we have

exp(λϵi) = q exp(2λ(1 q)) + (1 q) exp( 2λ q)

= exp( 2λ q)(1 + q(exp(2λ) 1))

< exp ( 2λ q) exp ( q(exp(2λ) 1)))

exp q A(2λ)2 ,

for 2λ < B. The last step follows from ex 1 + x + Ax2 for x < B. Recall that q < 2q, we have

exp(λϵi) exp 16q Aλ2

for λ < B/2 as desired.

Let P i, i [n] be orthogonal projection matrices onto the span of {x1, .., xi}, that is,

P i := P i 1 + (I P i 1)xix i (I P i 1) (I P i 1)xi 2 ,

with P 0 = 0.

Lemma 8. Suppose A, B > 0 satisfy ex < 1 + x + Ax2 for all x < B. Let σ2 := c/2q, ν2 := 8Ac and α := 2/B. Then

1. If wi 1 is (σ, P i 1) sub-Gaussian, wi is (σ, P i) sub-Gaussian.

2. If wi 1 is (ν, α, P i 1) sub-exponential, wi is (ν, α, P i) sub-exponential.

Proof. We have

E exp(λw i u) = E E exp(λw i u) wi 1

e λ wi 1,u 2q x i u

+λϵix i u wi 1

= E h eλ wi 1,Qxi,cu E h eλϵix i u wi 1 ii .

First, suppose that wi 1 is (σ, P i 1) sub-Gaussian, we will prove that wi is (σ, P i) sub-Gaussian. By Lemma 7, ϵi is 1-sub-Gaussian. Therefore,

E h eλ wi 1,Qxi,cu E h eλϵix i u|wi 1 ii E h eλ wi 1,Qxi,cu e 1 2 λ2 x i u 2i

2 (Qxi,cu) P i 1(Qxi,cu)+ λ2

2 (u xix i u).

We now consider the exponent (divided by the common factor λ2). It is sufﬁcient to show

2 Qxi,cu P i 1 Qxi,cu + 1

2u xix i u σ2

σ2Q xi,c P i 1Qxi,c + xix i σ2P i.

Using Proposition 6 (with L 1, C c/2q) and assuming σ2 = c/2q 1, we have

Qxi,c P i 1Qxi,cσ2 + P xi = c

Balanced Experimentation

Therefore, wi is a (c/2q, P i) sub-Gaussian random vector.

Now suppose that wi 1 is (ν, α, P i 1) sub-exponential, we will prove that wi is (ν, α, P i) sub-exponential. Again, by Lemma 7, ϵi is (4 Aq, 2/B) sub-exponential. Therefore,

E h eλ wi 1,Qxi,cu E h eλϵix i u|wi 1 ii E h eλ wi 1,Qxi,cu e8Aqλ2 x i u 2i

for |λ| < 2 B|x i u| = 1 α

u xix i u. Since wi 1 is (ν, α, P i 1) sub-exponential,

E h eλ wi 1,Qxi,cu i e λ2ν2

2 (Qxi,cu) P i 1(Qxi,cu)

for |λ| < 1 α

u P i 1u. Note that u P iu is greater than both u P i 1u and u xix i u. We have

E h eλ wi 1,Qxi,cu E h eλϵix i u|wi 1 ii e λ2ν2

2 (Qxi,cu) P i 1(Qxi,cu)+ 8Aqλ2(u xix i u)

for all |λ| < 1 α

u P iu. Hence, it is sufﬁcient to show

2 Qxi,cu P i 1 Qxi,cu + 8Aqu xix i u ν2

ν2Q xi,c P i 1Qxi,c + 16Aqxix i ν2P i.

This follows from Proposition 6 by substituting L 16Aq and C c/2q and noting that ν2 = 8Ac = Lc.

Lemma 9. If c = min(1/q, 9.3) log(2n/δ) then with probability at least 1 δ, we have

|w i 1xi| < c for all i [n].

Proof. By deﬁnition, c is either equal to log(2n/δ)/q or 9.3 log(2n/δ). We consider these two cases.

First suppose c = log(2n/δ)/q. With σ2 = c/2q, we have

c = log(2n/δ)/q = σ p

2 log(2n/δ).

By Lemma 3,

P |w i 1xi| > c P |w i 1xi| > c q

x i P i 1xi

The result then follows by a union bound over i [n].

Now suppose c = 9.3 log(2n/δ). Note that A = 0.5803 and B = 0.4310 satisfy ex < 1 + x + Ax2 for x < B. Let ν2 = 8Ac and α = 2/B. We have ν2

2/B = 4ABc > c.

Therefore, by Lemma 4,

P |w i 1xi| > c P |w i 1xi| > c q

x i P i 1xi

Again, the result follows by union bounding over i [n].

Lemma 9 and Lemma 8 together prove Theorem 1.

Balanced Experimentation

A.2. Proof of Theorem 2

Proof of Theorem 2. Let D(δ) be the discrepancy obtained by Binary Balance as a function of the failure probability δ. We will show that (2 log k)D(δ/k) is the corresponding discrepancy obtained by Algorithm 2. Theorem 2 will then follow from Proposition 2. Notice that with probability δ/k, each run of Binary Balance at an internal node of T has discrepancy D(δ/k). By union bounding over O(k) internal nodes, we have that with probability 1 δ, all of the discrepancies are bounded by D(δ/k).

Assume all the discrepancies at the internal nodes in T are bounded, we show how to bound the discrepancy between any two treatments. Let l and l be any two leaves in T. The goal is to show that α(l ) α(l ) + α(l)s(l) α(l) α(l ) + α(l)s(l )

is small. First we relate s(v) to s(vl) and s(vr) where vl, vr are the left and right children of v. By deﬁnition,

w(v) = α(vr) α(vl) + α(vr)s(vl) α(vl) α(vl) + α(vr)s(vr)

and s(v) = s(vl) + s(vr).

w(v) = s(vl) α(vl) α(vl) + α(vr)s(v),

w(v) = s(vr) α(vr) α(vl) + α(vr)s(v).

Hence, both s(vl) α(vl)

α(v) s(v) and s(vr) α(vr)

are bounded by D(δ/k).

Now consider v1, v2 and v3 in T such that v1 is a child of v2 and v2 is a child of v3. We have, by triangle inequality, s(v1) α(v1)

α(v3)s(v3) s(v1) α(v1)

α(v2)s(v2) + α(v1)

s(v2) α(v2)

α(v3)s(v3) 1 + α(v1)

Let l be a leaf in T and let lv1v2 . . . r be the path from l to the root r. Repeatedly applying the above relation along the path gives s(l) α(l)

α(r)s(r) 1 + α(l)

α(v1) + α(l)

α(v2) . . . + α(l)

D(δ/k). (3)

Since there are at most log k nodes in the path from l to r, s(l) α(l)

α(r)s(r) (log k)D(δ/k).

Finally, for any two leaves l and l , s(l)/α(l) s(l )/α(l )

1/α(l) + 1/α(l )

s(l)/α(l) s(r)/α(r)

1/α(l) + 1/α(l )

+ s(l )/α(l ) s(r)/α(r)

1/α(l) + 1/α(l )

(2 log k)D(δ/k).

Remark 2. If all weights are uniform, the summation in (3) becomes a geometric series and can be bounded by a constant. Therefore, we can remove the factor log k in Theorem 2.

Balanced Experimentation

A.3. Other Proofs

Proof of Proposition 1. Let B = φI 1 φX

. We have from Threorem 1 that wn = Bη = φη 1 φX η

mean zero (c/2q, P ) sub-Gaussian random vector, where

P = B(B B) 1B

Therefore, by sub-Gaussianity of wn, for any vector u, we have

φη u i exp c

4q u (φQ) u.

We therefore have the sub-Gaussian claim. The sub-exponential result follows similarly.

Proof of Proposition 2. Suppose c = log(2n/δ)/q. We have from Proposition 1 that z is a ( p

c/2q, Q) sub-Gaussian vector. This implies that Cov(z) c 2q Q.

When c = 9.3 log(2n/δ), we have from Proposition 1 that η is a (

8Ac, α, Q) sub-exponential vector. Now, Lemma 5 gives that

28Ac Q = 12Ac Q.

Proof of Proposition 2. When c = log(4n/δ)

q we have that with probability at least 1 δ/2, Bη is a p

sub-Gaussian vector. Since Bη = φη 1 φX η

and Q φI, we have 1 φ P

i ηixi = 1 φX η is

c/2q, I sub-Gaussian random vector. By sub-Gaussian concentration, we have with probability at least

1 δ/2d, | X η ei| p

(c/2φ(1 φ)q p

4d/δ. The result follows by a union bound over e1, ..., ed and

When c = 9.3 log(4n/δ), then with probability 1 δ, φη 1 φX η

8Ac, α, Q) random vector. Like before,

this implies 1 φX η is a (

8Ac, α, φI) random vector. By sub-exponential concentration, we have

( 2 exp t2φ/2ν2 when t ν2

α φ 2 exp ( t/2α) otherwise

Setting t = p

2 log(4dδ)(8Ac) c ν2/α, (when n d), we get that with probability at least 1 δ/2d, we have

1 φ X η ei| 9.3

log(4d/δ) log(4n/δ)

The rest follows by a union bound.

Proof of Proposition 3. First consider the case when c = log(2n/δ)/q = σ p

2 log(2n/δ). By Proposition 1, η is (σ, Q) sub-Gaussian with σ = p

c/2q. From Lemma 3, we have

P(|η µ| > c p

µT Qµ) 2 exp c2

Balanced Experimentation

Now consider c = 9.3 log(2n/δ). By Proposition 1, η is (ν, α, Q) sub-exponential with ν =

8Ac. Note that

ν2/α = 8Ac/(2/B) = 4ABc > c.

From Lemma 4, we have

P(|η µ| > c p

µT Qµ) 2 exp c2

The result then follows by a union bound.

Proof of Proposition 5. This follows from Proposition 2 and the proof of Theorem 3 in (Harshaw et al., 2020).

A.4. Robustness

Proposition 2 immediately gives a bound on λmax(Cov(z)) and hence bounds accidental bias.

Remark 3 (Accidental Bias). With probability 1 δ the maximum eigenvalue of Cov(z) satisﬁes

λmax (Cov(z)) σ2

B. Simulations

B.1. Description

DGP Name X y(0) y(a) s.t. a = 0

Quick Block DGP Xi,k U(0, 10), k {1, 2} Q2 k=1 Xk + ϵ 1 + y(0)

Linear DGP Xi,k = ϵk, k {1, . . . , 4} Xβ + 1

10ϵy(0) 1 + Xβ + 1

Linear Drift DGP Xi,k = i N + ϵk, k {1, . . . , 4} Xβ + 1

10ϵy(0) 1 + Xβ + 1

Linear Season DGP Xi,k = sin(2π i

N ) + ϵk, k {1, . . . , 4} Xβ + 1

10ϵy(0) 1 + Xβ + 1

Quadratic DGP Xi,k = 2βk 1, k {1, 2} µ0 + 1

10ϵy(0) 1 + µ0 + 1

10ϵy(1) µ0 = X1 X2 + X2 1 + X2 2 2X1X2

Cubic DGP Xi,k = 2βk 1, k {1, 2} µ0 + 1

10ϵy(0) 1 + µ0 + 1

10ϵy(1) µ0 = X1 X2 + X2 1 + X2 2 2X1X2 +X3 1 X3 2 3X2 1X2 + 3X1X2 2 Sinusoidal DGP Xi,k = 2βk 1, k {1, 2} µ0 + 1

10ϵy(0) 1 + µ0 + 1

10ϵy(1) µ0 = sin π

4 ) + 6 sin( πX1

Table A1. Data generating processes used in simulations. All ϵs indicate a standard normal variate and all βs indicate a standard uniform variate. i indicates a unit s index. Covariate vectors are row-normalized to unit norm, except for the Quick Block simulation which just normalized relative to the maximum row norm.

Balanced Experimentation

Sinusoidal DGP

Linear Season DGP Quadratic DGP Quick Block DGP

Cubic DGP Linear DGP Linear Drift DGP

1e+02 1e+03 1e+04 1e+05

1e+02 1e+03 1e+04 1e+05 1e+02 1e+03 1e+04 1e+05

Sample Size

Alweiss BWD(0.5) BWD(0) BWDRandom Efron Quick Block Simple

Figure A1. Bias

B.2. Binary Treatments

Bias. Figure A1 shows that none of the examined methods are biased (but that does not imply that they are robust (Efron, 1971; Harshaw et al., 2020)).

Imbalance. To measure linear imbalance, we calculate the L2 norm of the difference in covariate means. While this is far from the only measure of imbalance, it serves as an effective metric to demonstrate how well various metrics serve to eliminate linear imbalances, a common diagnostic used my experimenters. We further normalize across sample-size by multiplying by the sample size. Since all methods are unbiased, this has the effect of showing parametric convergence rates as a ﬂat line in the graph. Unsurprisingly, methods which directly optimize for linear imbalance perform very well in Figure A2. Our proposed algorithm has better ﬁnite sample imbalance minimization than does the algorithm of (Alweiss et al., 2020) due to the ﬁnite sample improvements of (Dwivedi and Mackey, 2021).

B.3. Non-uniform assignment

MSE. Figure A3 shows the resulting mean squared-error attained by methods which support marginal probabilities not equal to one-half. All methods perform well, with DM performing effectively on nearly linear processes, and Quick Block performing slightly better when the true process is highly non-linear.

Marginal probability. The evaluation in Figure A4 examines how closely each method hews to the desired marginal probability of treatment. All methods do a good job of ensuring the appropriate marginal distribution.

B.4. Multiple Treatments

Entropy. To ensure that treatment is being assigned with the correct marginal probability, we can measure the normalized entropy of the empirical marginal treatment probabilities. If the values are perfectly uniform, then the value will be exactly one. As the normalized entropy decreases, the marginal probabilities are more uneven, indicating a failure to match the desired marginal distribution. BWD performs very similarly to complete randomization (which almost exactly matches the desired marginal probabilities), slightly out-performing (Smith, 1984). Note that while Smith (1984) only seeks to optimize the marginal probability of treatment for each unit, BWD additionally balances covariates.

Balanced Experimentation

Sinusoidal DGP

Linear Season DGP Quadratic DGP Quick Block DGP

Cubic DGP Linear DGP Linear Drift DGP

1e+02 1e+03 1e+04 1e+05

1e+02 1e+03 1e+04 1e+05 1e+02 1e+03 1e+04 1e+05

Sample Size

Imbalance (MISE n)

Alweiss BWD(0.5) BWD(0) BWDRandom Efron Quick Block Simple

Figure A2. Imbalance. BWD is highly effective at eliminating linear imbalance between groups.

Sinusoidal DGP

Linear Season DGP Quadratic DGP Quick Block DGP

Cubic DGP Linear DGP Linear Drift DGP

0.01 0.03 0.10 0.30

0.01 0.03 0.10 0.30 0.01 0.03 0.10 0.30 0.01 0.03 0.10 0.30

0.01 0.03 0.10 0.30 0.01 0.03 0.10 0.30 0.01 0.03 0.10 0.30 1e-04

BWD(0.5) BWD(0) Complete Quick Block Simple Smith

Figure A3. Marginal probability of treatment

Balanced Experimentation

Sinusoidal DGP

Linear Season DGP Quadratic DGP Quick Block DGP

Cubic DGP Linear DGP Linear Drift DGP

0.01 0.03 0.10 0.30

0.01 0.03 0.10 0.30 0.01 0.03 0.10 0.30 0.01 0.03 0.10 0.30

0.01 0.03 0.10 0.30 0.01 0.03 0.10 0.30 0.01 0.03 0.10 0.30 0.01

Target Marginal Probability of treatment

Actual Marginal Probability of treatment

BWD(0.5) BWD(0) Complete Quick Block Simple Smith

Figure A4. Marginal probability of treatment

Sinusoidal DGP

Linear Season DGP Quadratic DGP Quick Block DGP

Cubic DGP Linear DGP Linear Drift DGP

0 10 20 30 0 10 20 30

Number of Treatments

Entropy of Marginal distribution

BWD(0.5) BWD(0) Complete Simple Smith

Figure A5. Entropy