# privacy_preserving_adaptive_experiment_design__37085ae9.pdf

Privacy Preserving Adaptive Experiment Design

Jiachun Li * 1 Kaining Shi * 2 David Simchi-Levi * 1

Adaptive experiment is widely adopted to estimate conditional average treatment effect (CATE) in clinical trials and many other scenarios. While the primary goal in experiment is to maximize estimation accuracy, due to the imperative of social welfare, it s also crucial to provide treatment with superior outcomes to patients, which is measured by regret in contextual bandit framework. Furthermore, privacy concerns arise in clinical scenarios containing sensitive data like patients health records. Therefore, it s essential for the treatment allocation mechanism to incorporate robust privacy protection measures. In this paper, we investigate the tradeoff between loss of social welfare and statistical power of CATE estimation in contextual bandit experiment. We propose a matched upper and lower bound for the multi-objective optimization problem, and then adopt the concept of Pareto optimality to mathematically characterize the optimality condition. Furthermore, we propose differentially private algorithms which still matches the lower bound, showing that privacy is almost free . Additionally, we derive the asymptotic normality of the estimator, which is essential in statistical inference and hypothesis testing.

1. Introduction

1.1. Background

The contextual bandit framework, a prominent and effective approach for sequential decision-making, is distinguished by its adaptability in progressively refining decisions based on accumulating information. This stands in contrast to reliance on static, offline datasets and batch learning methodologies. While most literature focus on developing algo-

*Equal contribution 1Laboratory for Information and Decision Systems, MIT, Cambridge, U.S. 2Department of Statistics, The University of Chicago, Chicago, U.S.. Correspondence to: Jiachun Li <jiach334@mit.edu>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

rithms like UCB or Thompson sampling to minimize the cumulative loss of rewards (like revenue or social welfare), it has been shown in (Lai & Robbins, 1985) that adaptive allocation strategies can surpass the efficiency of certain conventional random experimental approaches, such as Randomized Control Trials (RCTs)and has drawn much attention in recent experiment design works (Zhao 2023, Dai et al. 2023).

Considering the following motivating example of clinical trials. These trials necessitate evaluating the efficacy of new pharmaceutical interventions across diverse patient circumstances. Regret here is measured as the cumulative detriment to patient welfare, thus necessitating its minimization. This imperative becomes particularly acute in the case of rare or fatal diseases, where the goal is to allocate the most effective treatment possible to patients. The heterogeneity of patient profiles, characterized by diverse attributes such as age, gender, and genotype, significantly influences the efficacy of treatment. Therefore, it s crucial to evaluate the efficacy of drugs across varied patient profiles to identify treatments with superior therapeutic benefits while mitigating potential adverse effects for specific patient groups. This illustrates the necessity to estimate the conditional average treatment effect (CATE) (see Abrevaya et al. 2015, Fan et al. 2022, Wager & Athey 2018) in adaptive allocation assignment problems while keeping the loss of welfare, or regret at minimum. This dual focus on minimizing regret and accurately estimating CATE is central to both experimental design and contextual bandits in academic literature.

While online regret minimization and statistical inference have been extensively studied separately, the simultaneous pursuit of these objectives introduces novel complexities. This duality of purpose can result in conflicting optimal allocation strategies, as illustrated in some recent works (Simchi Levi & Wang 2023). In specific, better accuracy of statistical inference typically necessitates broader exploration of various treatment options while a focus on minimizing regret restricts the algorithm s engagement with suboptimal arms. Moreover, the presence of patient-specific features and heterogeneous treatment effects introduce further complexity. The estimation and inference for one subgroup of patients cannot be transferred to another, and the arrival of various types of patients may be highly non-stationary, which complicates the inference process for certain subgroups due to

Privacy Preserving Adaptive Experiment Design

insufficient data. While the tradeoff of regret and estimation accuracy has been clearly characterized in homogeneous ATE setting, it remains unknown for the conditional average treatment case when covariates are present. We formulate it in the following question:

Question 1: Given a budget of welfare loss (or regret), what s the best possible accuracy of estimation for CATE and how can we achieve such an accuracy?

Privacy concerns arise in scenarios involving sensitive data types, such as healthcare records, financial information, or digital footprints. (Carlini et al. 2019, Melis et al. 2019, Niu et al. 2022). Such privacy risks also exist the estimation of Conditional Average Treatment Effect (CATE), which has the potential to inadvertently disclose sensitive individual information, including covariates, treatment statuses or outcomes. Differential Privacy (DP) has been established as a rigorous mathematical framework for defining privacy, which has gained widespread adoption in practices by major organizations such as the U.S. Census Bureau and companies like Apple and Google for data publication and analysis (Erlingsson et al. 2014, Abowd 2018). However, it is widely known that in the realms of both statistical estimation and regret minimization, Privacy comes at a cost . While it is well established for DP-statistical estimation with offline, identically and independently distributed data, it s much more subtle to perform valid DP-estimation for online, potentially correlated data. Similarly, it has been a long standing problem to develop a differentially private algorithm for regret minimization in contextual bandit framework. Considering our objective to design an allocation mechanism that achieves enhanced accuracy while simultaneously minimizing regret,a DP version of our mechanism necessitates a delicate balance of these dual tasks. This leads to a critical inquiry: to what extent must one incur a cost to ensure privacy while striving to optimize both accuracy and regret minimization?

Question 2: With the constraint that the experimenter need to protect the privacy of participants, is it still possible to attain the same estimation accuracy as well as social welfare loss?

To the best of our knowledge, our work is the first one to handle these two tasks simultaneously in a DP manner.

1.2. Problem Formulation

In adaptive experiment design with heterogeneous treatment effect, there is a binary set A = {0, 1} of arms (i.e., treatments or controls) and a finite feature set X = {X1, X2, , XM} with |X| = M. Suppose n is the time horizon (or the total number of experimental units). The features at each time period follow a sequence of discrete distributions PX = {P t X}t 1, where P t X = (pt 1, , pt M)

(0, 1)M with PM j=1 pt j = 1, t 1, representing the probability of observing experimental unit with feature Xj at

time t as pt j. Denote fj(m) := E h P

1 t m 1{xt=Xj} i = P

1 t m pt j, which represents the expected number of occurrences of feature Xj in the first m periods, for any 1 j M and 1 m n. We have the following assumption for the distribution of features.

Assumption 1.1. Seasonal Nonstationarity (1) C1, C2 > 1, s.t. 1 j M, C1 < fj(n)

2 ) < C2. (2)fmin(n) := min1 j M fj(n) Ω(log n)

Intuitively, this assumption says in the first and the second half periods, every features will be expected to appear approximately same and at least Ω(log n) times. Compared to the simplest assumption where the distribution of features for patients is stationary at each time period, our assumption greatly expands the applicability of our method. For example, different types of patients may have completely different patterns of occurrence on weekdays and weekends, or during different seasons. In this case, the simple assumption that patients features have a stable distribution will no longer hold, but this situation may still conform to our non-stationary seasonal assumption. Remark 1.2. Though here we denote it as non-stationarity, in fact our assumption is very mild and contains oblivious adversarial case. To see this, note that we allow the distribution P X t to be arbitrary, so for any oblivious adversarial sequence (Ht)n t=1 satisfying assumption 1.1, we can just set pt i = 1 for Ht = Xi and pt j = 0 for any j = i to obtain the desired sequence.

After observing the feature xt, the experimenter will choose a treatment allocation at {0, 1} based on policy π, and the reward of the chosen arm rt = rt (at|xt) [0, 1] can be observed. where Pi(Xj) is the distribution of the rewards of arm i and feature Xj and PX is the distribution of features. We define the conditional average treatment effect (CATE) of a feature x as (Xj) := E [rt(1|Xj)] E [rt(0|Xj)], for any Xj X. We also denote σji = V [r(i|Xj)] for i = 0, 1 and j = 1, , M as the variance of reward of playing arm i when facing context Xj. In this paper, we will elaborate on | (x)| = Θ(1) for all x X, which is arguably the most fundamental case in real applications. Denote all stochastic MAB instances satisfying the mentioned assumptions to constitute a feasible set E0.

A key index to measure the efficiency of online learning policy π is accumulative regret R(n, π), defined as the expected difference between the reward under the optimal policy and the policy π, i.e., R(n, π) = Eπ [Pn i=1 [ri(a (xi)|xi) ri (ai|xi)]], where a (xi) is the optimal arm of feature xi. In addition, an admissible adaptive estimator ˆ (Xj) maps the history Hn to an estimation of (Xj). We use the mean square error of ˆ (Xj), i.e.,

Privacy Preserving Adaptive Experiment Design

e n, ˆ (Xj) = E (Xj) ˆ (Xj) 2 to measure the

quality of the estimation. We define ˆ := { ˆ (Xj)}1 j M to represent all the estimators on the gap between two arms for each feature. A design of an contextual bandit experiment can then be represented by an admissible pair (π, ˆ ). Different from traditional contextual bandit problems, the optimal design of contextual bandit experiments in this paper is solving the following minimax multi-objective optimization problem:

min (π, ˆ ) max ν E0

Rν(n, π), max 1 j M eν n, ˆ (Xj) (1)

where we use the subscript ν to denote the contextual bandit instance. Eq.1 mathematically describes the two goals: minimizing the regret and the largest estimation error among all features.

The above is a rigorous mathematical description of the first question that we presented. This sets the stage for our second question, which concerns about the price of protecting privacy for both regret and CATE estimation, and how it will affect the balance between minimizing regret and estimation error. In order to rigorously address this question, we first need the following definition of differential privacy.

Definition 1.3. ((ε, δ)-anticipating private contextual bandit algorithm). A bandit algorithm π is (ε, δ)-private if for two neighboring datasets D = {(xi, ai, ri)n i=1} and D = {(ˆxi, ˆai, ˆri)n i=1} of feature, action and reward pairs that differs in at most one time step t, then for all S AT t:

P (π (at+1, , an) S | D)) eεP (π (at+1, , an) S | D )) + δ, where A = {0, 1} is the set of actions.

This definition is slightly different with the classical differential privacy (DP). (Shariff & Sheffet, 2018) propose a notion of joint DP in the context of linear contextual bandits and is later adopted by (Chen et al., 2022) as anticipating DP (ADP). The key difference of ADP is to restrict the output sets as allocations strictly after a patient of interest at time t. Such a restriction is motivated by two reasons. The first one is that following the classical DP will inevitably lead to linear regret. The second reason is that classical DP assumes that the adversary has access to the provided action at at time t, which is unreasonable in most adaptive experiments, as communication about (xt, at, rt) at time t is expected to be secured and the data prior to time t have no impact on the privacy of patient t because the decision making algorithm has no knowledge of xt before time t. Therefore, only the privacy of outputs after time t needs to be protected. For a more detailed discussion about ADP, one can refer to (Chen et al., 2022).

Definition 1.4. ((ε, δ)-private CATE estimator) An admissible CATE estimator ˆ which takes a dataset

{(xi, ai, ri)}n i=1 as input, and output M estimations for ATE of each feature { ˆ (Xi)}M i=1 is (ε, δ)-private if for two neighboring datasets D = {(xi, ai, ri)n i=1} and D = {(ˆxi, ˆai, ˆri)n i=1} of feature, action and reward pairs that differs in at most one time step t, then for any measurable set S RM:

P ( ˆ (X1), , ˆ (XM) S | D eεP ( ˆ (X1), , ˆ (XM) S | D + δ.

Since a design of contextual bandit experiments can be represented as an admissible pair (π, ˆ ), in this paper we say a contextual bandit experiment design is (ε, δ)-private when both π and ˆ are (ε, δ)-private.

Technical Difficulties and Our Contribution

1. Elaborating on the Length of RCTs with a Regret Budget for Different Features. As claimed in (Simchi Levi & Wang, 2023), the key idea of balancing regret and estimation error is to properly set the length of RCT. However, in our setting each feature may vary enormously in their occurrences, and it can also be highly non-stationary. Since we are interested in the worst estimation among all features, we should set the length of RCTs for all features to be the same as that of the feature with least occurrence frequency, i.e, fmin(n). Since we don t know fmin(n) at the beginning of experiment, by the assumption of seasonal non-stationarity, we propose an algorithm named Con SE, which divides the experiment into two phase: in the first half periods, it chooses to minimize regret while learning the frequency of occurrences fj( n

2 ), and in the second half periods Con SE runs RCT for ˆfmin( n

2 ) periods for each feature, which is estimated from the first phase.

Another contribution of our result is the improvement of analysis compared to existing work(Simchi-Levi & Wang 2023). In particular, we develop a tighter upper bound which is tight up to constant, which helps to have a better characterization of the Pareto optimal curve for regret and estimation error. Besides, the proposed estimator in this paper is asymptotically normal, which is vital in constructing confidence interval and testing hypothesis and has been a long standing issue for adaptive experiment design literature(Simchi-Levi & Wang 2023, Dai et al. 2023, Zhao 2023). See section 3 for a more detailed discussion.

2. Privatizing Feature Information in Non-stationary Environment. Differential privacy is known to be more challenging in bandit setting due to its highly correlated data. For multi-arm bandit, algorithms based on tree mechanism proposed in (Chan et al., 2011) have been proved to be optimal up to polylog factors(see Tossou & Dimitrakakis 2016, Azize & Basu 2022, Sajed & Sheffet 2019). However, when it comes to contextual bandit, things become more complicated, as the algorithm not only needs to privatize the reward of each arm, but also the context of each patient. Most existing works focus on setting where reward

Privacy Preserving Adaptive Experiment Design

function is in a specific function class like (generalized) linear function (Hanna et al. 2022, Shariff & Sheffet 2018, Zheng et al. 2020, Chen et al. 2022). However, in clinical trials, it s risky to believe the treatment effect of one type of customer can be generalized to other types in certain way (like a linear function). Therefore, in this paper we don t assume any structure of CATE among different types of patients, which forces us to propose mechanisms different from existing literature. The second difficulty arises from the non-stationarity assumption, which has been considered in very few works. In particular, this rules out the possibility of merging different features as a whole and applying a unified mechanism.

To overcome all the difficulties mentioned, we propose a Doubly Private algorithm, which treats each feature separately and doubly privatize the patients information: first of all, we use the idea of tree mechanism and divide the whole experiment into batches, where the estimation of rewards are only updated at the end of each batches. Secondly, we randomize the length of each batch to protect the context information, which is, to our best knowledge, novel in DP-contextual bandit setting. Finally, our Doubly Private allows experimenters to balancing regret and the estimation accuracy of CATE in any given level, and our subsequent theoretical guarantees ensure that no method can simultaneously outperform our algorithm in minimizing regret and accurately estimating CATE.

1.3. Literature Review

Adaptive Experiment Design. Experimental design is witnessing a surge in popularity across operations research, econometrics, and statistics (see, e.g., Johari et al. 2015, Bojinov et al. 2021, Bojinov et al. 2023,Xiong et al. 2023) Adaptive experimental design emerges as a particularly relevant area to our current focus(Hahn et al. 2011, Atan et al. 2019, Greenhill et al. 2020, Kato et al. 2020, Qin & Russo 2022) There are some recent works trying to demonstrate the statistical superiority of adaptive experiment compared classical non-adaptive experiment design, where the measurement of precision is the (asymptotic) variance of the estimator. In (Dai et al., 2023), they propose a measurement called Neyman regret, and show that an adaptive design with asymptotically optimal variance is equivalent to sub-linear Neyman regret, thus transforming it into a regret minimization problem. (Zhao, 2023) consider a similar setting, but adopt a competitive analysis framework.

Another emerging field is multitasking bandit problems, where minimizing regret is not the only objective (see, e.g., Yang et al. 2017, Yao et al. 2021, Zhong et al. 2021). (Erraqabi et al., 2017) also consider the tradeoff between regret and estimation error, and propose a new loss function combining these two objectives together. The most related

work to this paper may be (Simchi-Levi & Wang, 2023), where they consider the tradeoff between regret and ATE estimation. We extend their framework to contextual bandit setting, derive a similar Pareto optimality characterization, and consider the additional requirement to protect patients privacy.

Differentially Private (Contextual) Bandit Learning and Estimation. Differential privacy (Dwork et al. 2006) has emerged as gold-standard for privacy preserving dataanalysis, as it ensures that the output of the data-analysis algorithm has minimum dependency on any individual datum. Differentially private variants of online learning algorithms have been successfully developed in various settings (Guha Thakurta & Smith 2013), including a private UCB-algorithm for the MAB problem ( Azize & Basu 2022, Tossou & Dimitrakakis 2016) as well as UCB variations in the linear bandit settings (Hanna et al. 2022, Shariff & Sheffet 2018). These algorithms are in general motivated by techniques named tree mechanism (Chan et al. 2011, Dwork et al. 2010), which functions by continuously releasing aggregated statistics over a stream of T observations, introducing only polylog(T )

ε noise in each time period, and thus leading to an added pseudo regret of order polylog(T )

ε . It was shown in Shariff & Sheffet 2018 that any ϵ-DP stochastic MAB algorithm must incur an added pseudo regret of Ω( K log(T )

ϵ ), and this lower bound is matched by Sajed & Sheffet 2019, using a batched elimination algorithm.

However, when it comes to DP-contextual bandit, so far there isn t a golden standard that works for general contextual bandit problems. Instead, most works focus on contextual linear bandit (Shariff & Sheffet 2018, Hanna et al. 2022,Charisopoulos et al. 2023) and adopt a relaxed definition named joint-DP or anticipating-DP. These works are in general variants of Lin-UCB (Abbasi-Yadkori et al., 2011), which is known to be optimal for contextual linear bandits. A lower bound for contextual linear bandit was proposed in Shariff & Sheffet 2018, and then was matched in Shariff & Sheffet 2018, Hanna et al. 2022 up to polylog factors. (Chen et al., 2022) consider differential privacy in dynamic pricing problem in a generalized linear model. A follow-up work (Chen et al. 2021) considers dynamic pricing in a non-parametric model and derive an upper bound of O T (d+2)/(d+4) + ε 1T d/(d+4) , which is not known to be optimal.

There has been some initial work on differentially private causal inference methods. Lee & Bell 2013 proposed a privacy-preserving inverse propensity score estimator for estimating average treatment effect (ATE). Komarova & Nekipelov 2020 study the ramifications of differential privacy on the identification of statistical models and demonstrate the obstacles encountered in regression discontinuity design with privacy constraints. However, when it comes to

Privacy Preserving Adaptive Experiment Design

adaptive experiment, to our knowledge there is no similar work trying to estimating CATE privately.

2. A Warm-up: Upper and Lower Bound Without Privacy Constraint

In this section, we aim to answer the first question proposed in subsection 1.1, i.e. what s the best possible accuracy of estimation for CATE given a budget of regret, by first showing a lower bound and then proposing an algorithm Con SE with matching upper bound. Besides, we also use this section as a warm-up to describe the technical difficulties of this problem and how to solve them, which can be helpful to understand the more complicated algorithm in section 3 with privacy constraints. In the following theorem, we provide a mini-max lower bound to explicitly show the best possible estimation accuracy with a constraint on regret budget.

Theorem 2.1. For any admissible pair π, ˆ n , there al-

ways exists a hard instance ν E0 such that eν n, ˆ n

Ω M Rν(n,π) , or in other words

inf (π, ˆ n) max ν E0

h eν n, ˆ n Rν(n, π) i Ω(M).

Theorem 2.1 mathematically highlights the trade-off that a small regret will inevitably lead to a large error on the CATE estimation. In specific, it states that for any admissible pair π, ˆ n , there exists a hard instance ν E such that the expected error is lower bounded by M times the inverse of the regret, i.e., eν n, ˆ n Ω M Rν(n,π) . In particular, since Rν(n, π) = O(log(n)) for UCB and TS algorithms, no estimators can not achieve smaller error than the order Ω M log(n) consistently over all the possible instances, which explicitly shows the limitation of regret optimal policies in terms of statistical power for estimating the CATE. On the other hand, if we ignore the regret and simply run random control trials (RCT), it can be easily shown that

eν(n, ˆ n) = max1 j M E ˆ n(Xj) (Xj) 2 = O 1 fmin(n) ,

which is the best possible accuracy one can obtain but will result in O(n) regret.

The above two cases can be regarded as two extreme cases (note that they don t match the lower bound), but in practice, the experimenter may want to find a balance of estimation accuracy and regret between these two extreme cases. In the following, we provide a family of algorithms named Con SE which depends on a parameter α [0, 1]. A larger α leads to smaller regret budget and larger estimation error. In particular, when α = 1, the algorithm ignores estimation error and focuses on minimizing regret. On the contrary, when

α = 0, the algorithm only focuses on minimizing estimation error. Moreover, for each given α, Con SE achieves the lower bound provided in theorem 2.1, which shows that it can attain every Pareto optimal point from one extreme case to the other (see figure 1). In the figure, the endpoints of the curve represent two extreme cases with minimum regret and estimation error. The other points on the curve characterize the tradeoff between these two objectives. Namely, this is the Pareto optimal curve for regret and estimation error. In section 4, we will have a more detailed discussion on variants of the Pareto optimal curve. Remark 2.2. (Intuitive example to illustrate the trade-off) Since the two objectives presented in theorem 2.1 seems to be not conflicting at first glance, we find it necessary to provide an intuitive example here to illustrate why there is indeed a trade-off between these two objectives. Assume that there are only 2 arms with mean µ1 and µ2, where = µ1 µ2 > 0. Now by definition, the regret is Reg = T(arm2), where T(arm2) is the frequency of playing arm 2. So consider the following two tasks. The first task is to identify µ1 > µ2, and the second task is to estimate µ1, µ2 as accurate as possible. It s quite intuitive here that task 1 is strictly easier than task 2. Indeed, concentration inequalities tell us that it only takes O( log T

2 ) times for each arm to complete task one, while basic statistical lower bound tells us that to estimate µ1, µ2 with accuracy 1 T α , it s necessary to play at least T α times of each arm. Therefore, in order to minimize regret, one should only play each arm O( log T

2 ) times, identify that µ1 > µ2, and never play arm 2 again. In this case, it would lead to a very bad estimation of µ2 with accuracy 2 log T , but it s already necessary and sufficient for regret minimization. On the contrary, if the goal is to estimate µ2 much more precisely with accuracy 1 T α , then from the above analysis, we know that it ll inevitably lead to a regret of O(T α).

Figure 1. Pareto Optimal Curve In the Con SE algorithm, we need the following notations. Define three number sequences: For e = epoch = 1, 2, 3, ... and the number of total patients n, define:

Privacy Preserving Adaptive Experiment Design

e = 2 epoch

Re = max{ 32 log(16n epoch2)

2e , 8 log(8n epoch2)

log(16n epoch2)

Algorithm 1 Con SE

1: Input: α 2: Initialize: Sj {0, 1}, epoch ej 0, rj 0, µj i 0, nj 0 (i = 0, 1; j = 1, 2, ..., M) 3: for t = 1 to [ n

2 ] do 4: Get feature xt = Xjt X 5: Increment njt njt + 1 6: if |Sjt| = 2 then 7: Select action at {0, 1} with equal probabilities ( 1

2) and update mean µjt at 8: Increment rjt rjt + 1 9: if rjt Rejt then 10: if ejt 1 then 11: Remove arm i from Sjt if max{ µjt 1 , µjt 2 } µjt i > 2he (i = 0, 1) 12: end if 13: Increment epoch ejt ejt + 1 14: Set rjt 0 15: Zero means: µjt i 0 i {1, 2} 16: end if 17: else 18: Pull the arm in Sjt 19: end if 20: if t = [ n

21: ˆfj = nj(1 j M) 22: Tmin = max{log n, min{ ˆf 1 α 1 , ˆf 1 α 2 , , ˆf 1 α M }} 23: end if 24: end for 25: for j = 1 to M do 26: nj = 0 27: end for 28: for t = [ n

2 ] + 1 to n do 29: Get feature xt = Xjt X 30: Increment njt njt + 1 31: if njt Tmin then 32: Select action at {0, 1} with equal probabilities ( 1

2) and update mean µjt at 33: if njt = Tmin then 34: Output ˆ (Xjt) = µjt 1 µjt 0 35: end if 36: else 37: Pull the arm in Sjt. (if |Sjt| = 2, pull any arm at Sjt) 38: end if 39: end for

Theorem 2.3. Let Algorithm 1 runs with any given α [0, 1]. For any instance ν, the regret and estimation error

Table 1. Comparison with Simchi-Levi & Wang 2023.

DIFFERENCES SIMCHI-LEVI & WANG 2023 THIS PAPER

CONTEXT NO YES LOWER BOUND Ω(1) Ω(M) UPPER BOUND O(log n) O(M) DIFFERENTIAL PRIVACY NO YES ASYMPTOTIC NORMALITY NO YES

Rν(n, π) O M max{fmin(n)1 α, log n} ,

eν(n, ˆ n) O 1 max{fmin(n)1 α, log n}

Therefore, the product of regret and estimation error is always O(M), i.e.,

eν(n, ˆ n)Rν(n, π) O(M)

Combining the two theorems above, we can now answer Question 1: Given a budget of social welfare loss Rν(n, π), the best possible accuracy of inference for CATE

is O M Rν(n,π) and is attained by Con SE.

Remark 2.4. While we prove an upper bound of Con SE under non-stationary setting, the result proved in theorem 2.3 cannot be improved when the distribution of feature is stationary. This can be shown by noticing that the hard instance in lower bound in theorem 2.1 is constructed under the stationary case. That is to say, Con SE is optimal in both stationary and non-stationary settings. Remark 2.5. Compared to previous work in bandit experiment (Simchi-Levi & Wang 2023), while the high level idea is similar, we consider an alternative estimator and improve the analysis in the proof. Specifically, in (Simchi-Levi & Wang, 2023), the upper bound is tight up to poly-log term, while in this paper the upper bound is tight up to constant. First of all, since classical bandit algorithms like UCB or TS attain regret bound of O(log n), we believe that poly-log factors do matter. Besides, this improved upper bound help us have a better characterization of Pareto optimal curve that we will explain in section 3. Finally, the estimator in our algorithm is asymptotically normal, which means we can construct (asymptotic) normal confidence interval for inference and hypothesis testing, which has been a long standing issue for existing adaptive experiment design literature (Simchi-Levi & Wang 2023, Zhao 2023, Dai et al. 2023).

Proposition 2.6. The estimators for all features ˆ (Xj)

are unbiased, i.e., E h ˆ (Xj) i = (Xj) ( 1 j M).

Moreover, ˆ (Xj) is asymptotically normal for any j, or formally,

max{fmin(n)1 α, log n}( ˆ (Xj) (Xj)) N(0, σ2 j0 + σ2 j1).

Privacy Preserving Adaptive Experiment Design

Intuitively speaking, Con SE can be divided into three steps:

Step 1. (From line 3 to 24) In the first half periods, we use Successive Elimination algorithm separately for each arm to eliminate the suboptimal arm and maintain log n regret. At the same time, we use these data to estimate the appearance frequency fj( n

2 ) of each feature Xj as defined in assumption 1.

Step 2. (From line 25 to 35) At the beginning of second half periods, we run RCT ˆfmin( n

2 ) times for every feature, where ˆfmin( n

2 ) is estimated from Phase 1.

Step 3. (From line 36 to 38) Play the optimal arm for each feature in the remaining time of experiment.

Although it becomes more complicated with privacy constraints, our main goal is still to do the three steps privately. A more detailed discussion will be provided in next section.

3. Privacy is Free: A Doubly-Private Algorithm for Bandit Experiment

In this section, our focus is to answer Question 2, i.e., with the constraint that the experimenter need to protect the privacy of participants, is it still possible to attain the same estimation accuracy as well as social welfare loss? Roughly speaking, our answer is yes (when ε is a small, constant number, which is the most common case). In other words, we will provide a DP version of Con SE that matches the lower bound provided in theorem 2.1 for any given α [0, 1], where the meaning of α is exactly the same as in Con SE described in last section. The framework of DP-Con SE is quite similar to Con SE, with changes only in technical details. Due to limitation of space, we omit the precise description of DP-Con SE here and put it in appendix A.4. Instead, in the following we will provide an intuitive explanation of three steps in DP-Con SE, together with important theoretical guarantees.

Step 1. In the first half periods, we use an improved DP Successive Elimination algorithm in (Sajed & Sheffet, 2019) for each feature. Our goal in this phase is twofold for each feature: to identify the optimal action and estimate the frequency of occurrences (based on our non-stationary seasonal assumption) with minimal regret. For each feature we compare the privatized average rewards of two actions in batches. If the difference is large, we eliminate the suboptimal arm and claim that we find the optimal arm with high probability. There are two technical designs involved here. First, the length of batches increases exponentially, which strikes a balance between differential privacy protection and regret loss. Similar idea can be found in DP Successive Elimination algorithm (Sajed & Sheffet 2019) and widely used tree mechanism ((Chan et al., 2011)) in DP-bandit algorithms. Second, we use a novel technique

by adding noise to the batch lengths for each feature. The reason for this is that due to the seasonal non-stationarity assumption, it s essential to run batched learning for each feature independently, and to protect the patients feature, the length of batches should also be privatized. To the best of our knowledge, this technique has not appeared in DP-bandit literature and again highlights the difficulty of DP-contextual bandit compared to bandit setting.

After identifying the optimal action, we will continue to execute this action until the first half of the experiment is completed. After the completion of the first half, based on the occurrence frequencies of features observed, we can estimate fj(n) for each feature Xj. This helps us to decide the length of RCTs in second half periods to estimate CATE.

To make our claim valid, we first need to show that the elimination process will end in step 1 (with high probability). This is confirmed by the following lemma.

Lemma 3.1. Let DP-Con SE runs with any given α [0, 1] and ε > 0. Then w.p. 1 1

n it holds that DP-Con SE pulls the bad arm of any feature Xj in the first half periods for at most

O (log nj + log log (1/ (Xj))) 1 (Xj)2 + 1 ε (Xj)

where nj is the number of occurrences of the feature Xj (1 j M).

So when ε is a small constant and n is sufficiently large, we can find the optimal arm for each feature in the first half with high probability, and the number of playing suboptimal arm is bounded. As a corollary, we can bound the regret in the first half periods as claimed.

Corollary 3.2. For sufficiently large n, the expected pseudo regret in the first half periods of DP-Con SE is at most O P

1 j M log n (Xj) + M log n

Step 2. In the second half periods, our primary objective is to ensure the required accuracy of estimating the CATE. Using the estimated fj(n) from step 1, we can determine the length of RCTs for each feature to attain the desired accuracy. It is important to remember that we still need to add noise to the length of RCTs for the same reason as stated in step 1.

After step 2, the main task of estimating CATE is completed, and the estimation accuracy is provided in the following theorem.

Theorem 3.3. If DP-Con SE runs with α [0, 1] and ε > 0, the estimate error is

e(n, ˆ ) = O

1 max{fmin(n)1 α, log n

Privacy Preserving Adaptive Experiment Design

Step 3. Finally, for each feature, after completing RCT phase in step 2, we simply play the optimal action obtained in the first half periods for the remaining patients with the aim of achieving minimum regret. The cumulative regret in the second half periods can be bounded as in the following lemma.

Lemma 3.4. The expected regret in the second half periods of DP-Con SE is at most O max{fmin(n)1 α, log n

1 j M (Xj) .

We have elaborated on how our algorithm strikes a balance between estimation, regret minimization and differential privacy, and to wrap things up, we have the following theorem to answer Question 2. A rigorous proof can be found in appendix.

Theorem 3.5. DP-Con SE is (ε, 1

n)-private. Moreover, let DP-Con SE runs with any given α [0, 1] and ε > 0. The regret is

O M max{fmin(n)1 α, log n

As a result, we have

eν(n, ˆ n)Rν(n, π) O(M),

which is the same as theorem 2.3 and matches the lower bound in theorem 2.1.

Similar to the estimator without privacy constraint, we can also prove that the estimator in DP-Con SE is asymptotically normal, and more interestingly, it has the same asymptotic variance as in Con SE. This again shows that privacy is free in the case of statistical inference, as the Laplacian noise converges to 0 faster as n compared to Guassian variable, and thus has no impact on asymptotic variance of the estimator. Proposition 3.6. The estimators for all features ˆ (Xj)

are unbiased, i.e., E h ˆ (Xj) i = (Xj) ( 1 j M).

Moreover, ˆ (Xj) is asymptotically normal for any j, or formally,

max{fmin(n)1 α, log n

ε }( ˆ (Xj) (Xj)) N(0, σ2 j0+σ2 j1).

4. Pareto Optimal Curve

In this section, we will characterize the Pareto optimal curve of regret and estimation error, which is the standard measurement in multi-objective optimization problems and is widely adopted in multi-objective bandit literature (Simchi Levi & Wang 2023, Zhong et al. 2021). A formal definition is provided in the following.

Definition 4.1. A pair of regret and estimation error (x(n), y(n)) is Pareto optimal with respect to n if there

exists no algorithm which can attain a regret and estimation error pair (α(n), β(n)) such that α(n) O(x(n)), β(n) = o(y(n)) or α(n) = o(x(n)), β(n) O(y(n)) as n .

From the upper and lower bound we derive above in theorem 2.1, 2.3, 3.5, we know exactly what the Pareto optimal curve is.

Theorem 4.2. The Pareto optimal curve for regret and estimation error (with or without privacy constraint) is characterized by

eν(n, ˆ n)Rν(n, π) = O(M),

Figure 2. Pareto Optimal Curve (General) Figure 2 shows the Pareto optimal curve in general case. We can see that the applicability of the Pareto optimal curve depends on the privacy protection indicator ϵ and the minimum feature occurrence fmin. As for the privacy protection parameter ϵ, we claim that it is almost for free in terms of the trade-off between regret and estimation error, as it does not affect the equation of our Pareto optimal curve. While here we can see that the higher the requirement for privacy protection, the shorter our optimal curve will be. This is due to the fact that the smallest possible estimation error and regret depend on ε, which is ϵ log nand M log n

ε , respectively.

The minimum feature occurrence fmin is entirely determined by the (nonstationary) distribution of patient features rather than by the experimenter. Therefore, in different scenarios, we may have different optimal curves. Figure 3 shows Pareto optimal curves in different scenarios, namely when fmin takes different values. We can see that as fmin decreases gradually, the Pareto optimal curve shortens and eventually collapses into a single point. Compared to (Simchi-Levi & Wang, 2023), since they cannot deal with sub-polynomial terms, it s impossible for them to characterize the case when fmin = polylog(n) (see figure 3). The light blue areas in different scenarios show the regions of estimation error and regret pair eν(n, ˆ n), Rν(n, π) that algorithms may achieve, while there exists no algorithm that can attain the region below Pareto optimal curves.

Privacy Preserving Adaptive Experiment Design

Figure 3. Pareto Optimal Curves (In Different Scenarios)

5. Experiments

In this section, we use numerical results to illustrate the theoretical findings and performance of the proposed algorithms. In particular, we focus on evaluating the algorithm Con SE in 2. First of all, we want to mention that Con SE is in fact a meta algorithm consisting of two phases, one for regret minimization and one for statistical estimation. For each phase, one can adopt scenario-specific algorithms to enhance potential better performance. In this paper, we adopt the batched sequential elimination for regret minimization, and RCT for CATE estimation, mainly because it s easier to privatize these two algorithms. However, when privacy is not a primary concern, one may adopt the celebrated UCB or Thopmson Sampling for regret minimization, and other estimation algorithms like adaptive neyman allocation proposed in (Zhao, 2023), or double machine learning methods with further structure assumption to improve estimation efficiency. In particular, when the outcome is assumed to have a linear structure, it was shown in (Kim et al., 2021) that doubly robust method can be adopted to capture the information in missing data. Below, we provide numerical results for estimation accuracy for both our RCT estimation and double machine learning estimation under different regret budget. We denote the mean difference estimator as MD, and double machine learning estimator as DML. The length of every experiment is 20000. For simplicity, we don t consider the heterogeneity of treatment effect among different features and assume no existence of features.

Exp 1: Normal Linear Bandit In experiment 1, we set a0 = [1, 0], a1 = [0, 1], θ = [1, 1], µ1 = µ0 = 1. The experiment results are averaged on 50 replications. From the experiment results, we can conclude that:

1. DML for predicting missing data is not helpful when experiment length n is sufficiently large.

MD Error DML Error Reg=200 0.31 0.53 Reg=400 0.03 0.07 Reg=800 0.008 0.02

Table 2. Normal Linear Bandit

MD Error DML Error Reg=400 0.03 4.1 Reg=800 0.008 4.0

Table 3. When regularity assumption of Linear Bandit Fails

2. The regret-estimation tradeoff always holds regardless of estimator.

Exp 2: When regularity assumption of Linear Bandit Fails

In experiment 2, we set a0 = [1, 0.001], a1 = [1, 0], µ0 = 3, µ1 = 1. In this case, the linear structure is ill-conditioned. The experiment results are averaged on 50 replications. In this experiment, MD estimator significantly outperforms DML estimator, which shows that

3. DML estimator is very sensitive in extreme cases, while MD estimator is robust.

4. MD estimator is much more efficient. The experiment of MD estimator can be finished within 1 second, while it takes 5 minutes to finish experiment of DML estimator. So to conclude, our numerical results validate the regretestimation tradeoff and show that naive MD estimator is already efficient and powerful when experiment is long enough.

While the above experiments compare two specific estimators, it remains interesting to compare different estimation methods under various problem structures and more complex environments.

6. Concluding Remarks

In this paper, we statistically investigate the trade-off between efficiency in decision-making and estimation precision of CATE in contextual bandit experiments. We adopt the minimax multi-objective optimization framework and Pareto optimality to characterize the trade-off. We first provide a lower bound of the multi-objective optimization problem and then propose Con SE to match that lower bound. Going one step further, we consider the constraint of protecting patients privacy and propose a differentially private version of Con SE (DP-Con SE) which still matches the lower bound, demonstrating that privacy is almost free. Besides, we also develop the asymptotic normality for both Con SE and DP-Con SE, which is crucial for statistical inference and hypothesis testing.

Privacy Preserving Adaptive Experiment Design

Impact Statement

This paper presents work whose goal is to advance the field of adaptive experiment design. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Abbasi-Yadkori, Y., P al, D., and Szepesv ari, C. Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24, 2011.

Abowd, J. M. The us census bureau adopts differential privacy. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pp. 2867 2867, 2018.

Abrevaya, J., Hsu, Y.-C., and Lieli, R. P. Estimating conditional average treatment effects. Journal of Business & Economic Statistics, 33(4):485 505, 2015.

Atan, O., Zame, W. R., and Schaar, M. Sequential patient recruitment and allocation for adaptive clinical trials. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 1891 1900. PMLR, 2019.

Azize, A. and Basu, D. When privacy meets partial information: A refined analysis of differentially private bandits. Advances in Neural Information Processing Systems, 35: 32199 32210, 2022.

Bojinov, I., Rambachan, A., and Shephard, N. Panel experiments and dynamic causal effects: A finite population perspective. Quantitative Economics, 12(4):1171 1196, 2021.

Bojinov, I., Simchi-Levi, D., and Zhao, J. Design and analysis of switchback experiments. Management Science, 69 (7):3759 3777, 2023.

Carlini, N., Liu, C., Erlingsson, U., Kos, J., and Song, D. The secret sharer: Evaluating and testing unintended memorization in neural networks. In 28th USENIX Security Symposium (USENIX Security 19), pp. 267 284, 2019.

Chan, T.-H. H., Shi, E., and Song, D. Private and continual release of statistics. ACM Transactions on Information and System Security (TISSEC), 14(3):1 24, 2011.

Charisopoulos, V., Esfandiari, H., and Mirrokni, V. Robust and private stochastic linear bandits. In International Conference on Machine Learning, pp. 4096 4115. PMLR, 2023.

Chen, X., Miao, S., and Wang, Y. Differential privacy in personalized pricing with nonparametric demand models. ar Xiv preprint ar Xiv:2109.04615, 2021.

Chen, X., Simchi-Levi, D., and Wang, Y. Privacy-preserving dynamic personalized pricing with demand learning. Management Science, 68(7):4878 4898, 2022.

Dai, J., Gradu, P., and Harshaw, C. Clip-ogd: An experimental design for adaptive neyman allocation in sequential experiments. ar Xiv preprint ar Xiv:2305.17187, 2023.

Dwork, C., Mc Sherry, F., Nissim, K., and Smith, A. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography: Third Theory of Cryptography Conference, TCC 2006, New York, NY, USA, March 4-7, 2006. Proceedings 3, pp. 265 284. Springer, 2006.

Dwork, C., Naor, M., Pitassi, T., and Rothblum, G. N. Differential privacy under continual observation. In Proceedings of the forty-second ACM symposium on Theory of computing, pp. 715 724, 2010.

Erlingsson, U., Pihur, V., and Korolova, A. Rappor: Randomized aggregatable privacy-preserving ordinal response. In Proceedings of the 2014 ACM SIGSAC conference on computer and communications security, pp. 1054 1067, 2014.

Erraqabi, A., Lazaric, A., Valko, M., Brunskill, E., and Liu, Y.-E. Trading off rewards and errors in multi-armed bandits. In Artificial Intelligence and Statistics, pp. 709 717. PMLR, 2017.

Fan, Q., Hsu, Y.-C., Lieli, R. P., and Zhang, Y. Estimation of conditional average treatment effects with highdimensional data. Journal of Business & Economic Statistics, 40(1):313 327, 2022.

Greenhill, S., Rana, S., Gupta, S., Vellanki, P., and Venkatesh, S. Bayesian optimization for adaptive experimental design: A review. IEEE access, 8:13937 13948, 2020.

Guha Thakurta, A. and Smith, A. (nearly) optimal algorithms for private online learning in full-information and bandit settings. Advances in Neural Information Processing Systems, 26, 2013.

Hahn, J., Hirano, K., and Karlan, D. Adaptive experimental design using the propensity score. Journal of Business & Economic Statistics, 29(1):96 108, 2011.

Hanna, O. A., Girgis, A. M., Fragouli, C., and Diggavi, S. Differentially private stochastic linear bandits:(almost) for free. ar Xiv preprint ar Xiv:2207.03445, 2022.

Johari, R., Pekelis, L., and Walsh, D. J. Always valid inference: Bringing sequential analysis to a/b testing. ar Xiv preprint ar Xiv:1512.04922, 2015.

Privacy Preserving Adaptive Experiment Design

Kato, M., Ishihara, T., Honda, J., and Narita, Y. Efficient adaptive experimental design for average treatment effect estimation. ar Xiv preprint ar Xiv:2002.05308, 2020.

Kim, W., Kim, G.-S., and Paik, M. C. Doubly robust thompson sampling with linear payoffs. Advances in Neural Information Processing Systems, 34:15830 15840, 2021.

Komarova, T. and Nekipelov, D. Identification and formal privacy guarantees. ar Xiv preprint ar Xiv:2006.14732, 2020.

Lai, T. L. and Robbins, H. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1): 4 22, 1985.

Lee, J. Y. and Bell, D. R. Neighborhood social capital and social learning for experience attributes of products. Marketing Science, 32(6):960 976, 2013.

Melis, L., Song, C., De Cristofaro, E., and Shmatikov, V. Exploiting unintended feature leakage in collaborative learning. In 2019 IEEE symposium on security and privacy (SP), pp. 691 706. IEEE, 2019.

Niu, F., Nori, H., Quistorff, B., Caruana, R., Ngwe, D., and Kannan, A. Differentially private estimation of heterogeneous causal effects. In Conference on Causal Learning and Reasoning, pp. 618 633. PMLR, 2022.

Qin, C. and Russo, D. Adaptivity and confounding in multi-armed bandit experiments. ar Xiv preprint ar Xiv:2202.09036, 2022.

Sajed, T. and Sheffet, O. An optimal private stochasticmab algorithm based on optimal private stopping rule. In International Conference on Machine Learning, pp. 5579 5588. PMLR, 2019.

Shariff, R. and Sheffet, O. Differentially private contextual linear bandits. Advances in Neural Information Processing Systems, 31, 2018.

Simchi-Levi, D. and Wang, C. Multi-armed bandit experimental design: Online decision-making and adaptive inference. In International Conference on Artificial Intelligence and Statistics, pp. 3086 3097. PMLR, 2023.

Tossou, A. and Dimitrakakis, C. Algorithms for differentially private multi-armed bandits. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 30, 2016.

Wager, S. and Athey, S. Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, 113(523):1228 1242, 2018.

Xiong, R., Athey, S., Bayati, M., and Imbens, G. Optimal experimental design for staggered rollouts. Management Science, 2023.

Yang, F., Ramdas, A., Jamieson, K. G., and Wainwright, M. J. A framework for multi-a (rmed)/b (andit) testing with online fdr control. Advances in Neural Information Processing Systems, 30, 2017.

Yao, J., Brunskill, E., Pan, W., Murphy, S., and Doshi-Velez, F. Power constrained bandits. In Machine Learning for Healthcare Conference, pp. 209 259. PMLR, 2021.

Zhao, J. Adaptive neyman allocation. 2023.

Zheng, K., Cai, T., Huang, W., Li, Z., and Wang, L. Locally differentially private (contextual) bandits learning. Advances in Neural Information Processing Systems, 33: 12300 12310, 2020.

Zhong, Z., Cheung, W. C., and Tan, V. Y. Achieving the pareto frontier of regret minimization and best arm identification in multi-armed bandits. ar Xiv preprint ar Xiv:2110.08627, 2021.

Privacy Preserving Adaptive Experiment Design

A. Appendix

A.1. Proof to Theorem 2.1

We first consider the case that M = 1. In this case, we are actually talking about an ATE problem, that means no feature. And first, we start from proving the Lemma 1.1 below. Lemma 1.1 For any given online decision-making policy π, the error of any ATE estimators can be lower bounded as follows, for any function ϕ : n 0, 1

4 and any u E0

inf ˆ n max ν E0 Pν ˆ n ν ϕ(n) 1

3 ϕ(n)2 Ru(n, π)

We define a Rademacher-like distribution as X Rad(p) means X = 1 with probability p and X = 1 with probability

1 p. Consider the following two bandits instance ν1 = Rad 1 ξ

2 and ν2 = Rad 1 ξ

2 , Rad 1+2ϕ(t)

2 . Note that the treatment effects of ν1 and ν2 is 1 = ξ and 2 = ξ + 2ϕ(t). ξ can be any number in (0, 1]. By such constructions and the symmetry, ν1 and ν2 can represent all the possible instances without loss of generality. We define the minimum distance test ψ ˆ t that is associated to ˆ t by ψ ˆ t = arg mini=1,2 ˆ t i . If ψ ˆ t = 1, we know

that ˆ t 1 ˆ t 2 . By the triangle inequality, we can have, if ψ ˆ t = 1,

ˆ t 2 | 1 2| ˆ t 1 | 1 2| ˆ t 2 ,

which yields that ˆ t 2 1

2 | 1 2| = ϕ(t). Symmetrically, if ψ ˆ t = 2, we can have | ˆ t 1 1

2 |= ϕ(t). Therefore, we can use this to show

inf ˆ t max ν E0 Pν ˆ t ν 2 ϕ(t) inf ˆ t max i {1,2} Pνi ˆ t i 2 ϕ(t)

inf ˆ t max i {1,2} Pνi ψ ˆ t = i

inf ψ max i {1,2} Pνi(ψ = i)

where the last infimum is taken over all tests ψ based on Ht that take values in {1, 2}.

inf ˆ t max ν E0 Pν ˆ t ν 2 ϕ(t) inf ψ max i {1,2} Pνi(ψ = i)

1 2 inf ψ (Pν1(ψ = 2) + Pν2(ψ = 1))

= 1 2 [1 TV (Pν1, Pν2)]

1 2KL (Pν1, Pν2)

3ξ Rν1(t, π)

where the equality holds due to Neyman-Pearson lemma and the third inequality holds due to Pinsker s inequality, and the fourth inequality holds due to the following:

Privacy Preserving Adaptive Experiment Design

KL (Pν1, Pν2) =

s=1 Eν1 [KL (P1,At, P2,At)]

i=1 Eν1 [Ti(n)] KL (P1,i, P2,i)

1 4 (ϕ(t))2 (Eν1 [Ti(n)])

3ξ Rν1(t, π)

where we use KL Rad 1

2 , Rad 1+2ϕ(t)

1/4 ϕ(t)2 16ϕ(t)2

3 , and the last inequality holds because the history Ht is generated by π and ξEν1 [Ti(n)] is just the expected regret of ν1, which is just the definition of regret. Now we finish the proof of Lemma 1.1.

Then, we can have, given policy π, and ˆ n, if ϕ(n) q

3| u| 32Ru(n,π) for some u E0,

max ν E0 E ˆ n ν 2 ϕ(t)2 max ν E0 Pν ˆ t ν 2 ϕ(t)

3ξ Rν1(t, π)

where the second inequality holds due to Lemma 1.1.

We use νπ, ˆ n to denote arg maxν E0 E ˆ n ν 2 for any given policy π and ˆ n,

h eν n, ˆ n Rν(n, π) i eνπ, ˆ n

n, ˆ n Rνπ, ˆ n(n, π)

4 Rνπ, ˆ n(n, π)

where the last equation holds because we plug in ϕ(n) and ν = Θ(1) for ν E0. Since the above inequalities hold for any policy π and ˆ n, we finish the proof of the no feature case. In general case, for any 1 j M, we have the following:

Rj ν(n, π) := Eπ " n X

i=1 I{xi=Xj} [ri(a (xi)|xi) ri (ai|xi)]

= E [Rν(nj, π)] ,

where nj = Pn i=1 I{xi=Xj} is a random variable and E h PM j=1 Rν(nj, π) i = PM j=1 Rj ν(n, π) = Rν(n, π).

For using the result in no feature case, consider the following two bandits instance νM 1 = {Xj : ν1|1 j M},

νM 2 = {Xj : ν2|1 j M}, where ν1 = Rad 1 ξ

2 and ν2 = Rad 1 ξ

2 , Rad 1+2ϕ(t)

2 are introduced in the case M = 1. Using the result in no feature case, we have the following:

h eν n, ˆ n(Xj) Rν(nj, π) i Θ(1)

Privacy Preserving Adaptive Experiment Design

for any given nj and 1 j M. Add their squares up, for any given {nj}M j=1,we have

max 1 j M eν n, ˆ n(Xj) M X

j=1 Rν(nj, π)

Therefore, we have the result

max 1 j M eν n, ˆ n(Xj) Rν(n, π) Θ(M),

A.2. Proof to Theorem 2.3

Firstly, we give the proof of Rν(n, π) O M max{fmin(n)1 α, log n} below. Lemma 2.1 Let Algorithm 1 runs with any given α [0, 1]. Then w.p. 1 1

n it holds that Algorithm 1 pulls the bad arm of any feature Xj in the first half periods for at most

O (log nj + log log (1/ (Xj))) 1 (Xj)2

where nj is the number of occurrences of the feature Xj (1 j M). Proof of Lemma 2.1 Given an epoch e we denote by Ee the event where for all arms a S it holds that |µa µa| he and also denote E = T

e 1 Ee. (we use T := nj represents the number of occurrences of the feature Xj and β = 1

n below) First, by definition, we can calculate that: R1 16 log T, so Re 2Re 1 2e+3 log T. Furthermore, the Hoeffding bound gives that Pr [Ee] 1 β 4e2 , thus Pr[E] 1 β

e 1 e 2 1 1

T (T 3). The remainder of the proof continues under the assumption the E holds, and so, for any epoch e and any viable arm a in this epoch we have |µa µa| he. As a result for any epoch e and any two arms a1, a2 we have that |( µa1 µa2) (µa1 µa2)| 2he.

Next, we argue that under E the optimal arm a is never eliminated. Indeed, for any epoch e, we denote the arm ae = argmaxa S µa and it is simple enough to see that µae µa 0 + 2he, so the algorithm doesn t eliminate a .

Next, we argue that, under E, in any epoch e we eliminate all viable arms with suboptimality gap 2 e = e. Fix an epoch e and a viable arm a with suboptimality gap a e. Note that we have set parameter Re so that

log (16 e2/β)

v u u t log (16 e2/β)

2 32 log(16e2/β)

Therefore, since arm a remains viable, we have that µmax µa µa µa a (2he) > e 1 2

2 > 2he, guaranteeing that arm a is removed from S.

Lastly, fix a suboptimal arm a and let e(a) be the first epoch such that a e(a), implying e(a) a < e(a) 1 = 2 e. Using the immediate observation that for any epoch e we have Re Re+1/2, we have that the total number of pulls of arm a is X

e e(a) Re X

e e(a) 2e e(a)Re(a) Re(a) X

32 log 16 e(a)2/β

2e + 8 log 8 e(a)2/β

The bounds e > a/2, |S| 2, e(a) < log2 (2/ a) allow us to conclude and infer that under E the total number of pulls of arm a is at most

log (2 log (2/ a) /β) 1024

= O (log nj + log log (1/ (Xj))) 1 (Xj)2

Privacy Preserving Adaptive Experiment Design

We finish the proof of Lemma 2.1. Therefore we have the following straightforward corollary. Corollary 2.2 For sufficiently large n, the expected pseudo regret in the first half periods of Algorithm 1 is at most O P

1 j M log n (Xj) .

Actually, by using the result of lemma 2.1, for n > log 1/ (Xj), we have

Rfirst ν (n, π) Σ1 j M (Xj)O (log nj + log log (1/ (Xj))) 1 (Xj)2

For the regret of the second half periods, noticed that with the probability 1 1

n, the optimal arm would be chosen correctly in the first half periods. Therefore, the expected regret of the second half periods of DP-Con SE is:

Rsecond ν (n, π) Σ1 j M (Xj)E [Tmin] + 1

n O (n) = O

max{fmin(n)1 α, log n} X

Therefore, when (Xj) = O(1) ( 1 j M), we have

Rν(n, π) = Rfirst ν (n, π) + Rsecond ν (n, π) O M max{fmin(n)1 α, log n}

Secondly, we give the proof of eν(n, ˆ n) O 1 max{fmin(n)1 α,log n} below. Note that for any feature Xj(1 j M), we learn at least Tj = min{fj(n) fj( n

2 ), Tmin} periods with equal

probabilities of two arms, so the MSE estimation error of feature Xj is bounded as O 1 Tj

. Hence, our estimation error is

bounded as O 1 min1 j M Tj

. Now we focus on Tj. For any 1 j M and 1 t n, notice the characteristic function I{xt=Xj} {0, 1} and follows Bernoulli(pt j), we

have E h ˆfj i = fj( n

2 pt j Therefore, by Chernoff bound (multiplicative form (relative error)), we have

P ˆfj < fj( n

e 0.06fj( n

where C = 0.06

C2 > 0, the last inequality is correct due to our assumption (1). Combining (1) and (2) in our assumption 1.1, we know min1 j M Tj Ω max{fmin(n)1 α, log n} with at least the probability 1 Me Cfmin(n). Therefore, our estimation error is bounded as

eν(n, ˆ n) O 1 max{fmin(n)1 α, log n}

+ Me Cfmin(n)O (1) = O 1 max{fmin(n)1 α, log n}

Thus, we finish the proof of theorem 2.2.

Privacy Preserving Adaptive Experiment Design

A.3. Proof to Theorem 3.3

The following proof is under the event T

2Tmin), which probability is at least 1 M

n2 . Note that for any feature Xj(1 j M), we learn at least Tj = min{fj(n) fj( n

2Tmin} periods with equal

probabilities of two arms, so the MSE estimation error of feature Xj is bounded as O 1 Tj

. Hence, our estimation error is

bounded as O 1 min1 j M Tj

. Now we focus on Tj. For any 1 j M and 1 t n, notice the characteristic function I{xt=Xj} {0, 1} and follows Bernoulli(pt j), we

have E h ˆfj i = fj( n

2 pt j Therefore, by Chernoff bound (multiplicative form (relative error)), we have

P ˆfj < fj( n

e 0.06fj( n

2 ) e Cfmin(n)

where C = 0.06

C2 > 0, the last inequality is correct due to our assumption (1).

Combining (1) and (2) in our assumption 1.1, we know min1 j M Tj Ω max{fmin(n)1 α, log n

ε } with at least the

probability 1 Me Cfmin(n). Therefore, our estimation error is bounded as

1 max{fmin(n)1 α, log n

+ Me Cfmin(n)O (1) + M

n2 O (1) = O

1 max{fmin(n)1 α, log n

A.4. DP-Con SE Algorithm

In this subsection, we provide the precise formulation of algorithm DP-Con SE, which is a differentially private version of Con SE provided in section 2.

Before introducing the DP-Con SE algorithm, we need to provide two notations. Define a r.v. generator: Given ε > 0, m > 0, denote Lap+(m) = Lap+ ε (m) is a random variable, satisfies: k [m], k Z,

P(Lap+(m) = [m] + k) = e ε

2 |k|(e ε 2 1) e ε 2 +1 e ε 2 [m] Define four number sequences: For e = epoch = 1, 2, 3, ..., and ε > 0 and the number of total patients n, define: e = 2 epoch

Re = max{ 32 log(16n epoch2)

2 e , 8 log(8n epoch2)

log(16n epoch2)

ce = 2 log(8n epoch2)

Algorithm 2 DP-Con SE

1: Input: α, privacy-loss ε 2: Initialize: Sj {0, 1}, epoch ej 0, rj 0, µj i 0, nj 0 (i = 0, 1; j = 1, 2, ..., M) 3: for t = 1 to [ n

2 ] do 4: Get feature xt = Xjt X 5: Increment njt njt + 1 6: if |Sjt| = 2 then 7: Select action at {0, 1} with equal probabilities ( 1

2) and update mean µjt at

Privacy Preserving Adaptive Experiment Design

8: Increment rjt rjt + 1 9: if rjt Rjt ejt then 10: if ejt 1 then 11: Set eµjt i µjt i + Lap( 2 εRejt )

12: Remove arm i from Sjt if max{eµjt 1 , eµjt 2 } eµjt i > 2he + 2ce (i = 0, 1) 13: end if 14: Increment epoch ejt ejt + 1 15: Set rjt 0 16: Zero means: µjt i 0 i {1, 2} 17: end if 18: else 19: Pull the arm in Sjt 20: end if 21: if t = [ n

22: ˆfj = nj(1 j M) 23: Tmin = max{log n, min{ ˆf 1 α 1 , ˆf 1 α 2 , , ˆf 1 α M }} 24: end if 25: end for 26: for j = 1 to M do 27: Tj = Lap+ ε (Tmin) 28: nj = 0 29: end for 30: for t = [ n

2 ] + 1 to n do 31: Get feature xt = Xjt X 32: Increment njt njt + 1 33: if njt Tjt then 34: Select action at {0, 1} with equal probabilities ( 1

2) and update mean µjt at 35: if njt = Tjt then 36: Output ˆ (Xjt) = µjt 1 µjt 0 + Lap( 2 εTjt )

37: end if 38: else 39: Pull the arm in Sjt. (if |Sjt| = 2, pull any arm at Sjt) 40: end if 41: end for

A.5. Proof to Lemma 3.1, Corollary 3.2, Lemma 3.4 and Theorem 3.5

A.5.1. PROOF OF LEMMA 3.1

Given an epoch e we denote by Ee the event where for all arms a S it holds that (we use T := nj represents the number of occurrences of the feature Xj and β = 1

n below) (i) |µa µa| he; (ii) | µa eµa| ce; (iii) Re Rj e 3Re; and also denote E = T

e 1 Ee. First, by definition, we can calculate that: R1 16 log T

ϵ , so Re 2Re 1 2e+3 log T

ϵ . Hence, P((iii)c) 2exp{ Reϵ} 2T 2e+3

Furthermore, given (iii), the Hoeffding bound, concentration of the Laplace distribution and the union bound over all arms in S0 give that Pr [Ee] 1 β 4e2 + β 4e2 + 2T 2e+3 , thus Pr[E] 1 β

e 1 2T 2e+3 1 1

T (T 3). The remainder of the proof continues under the assumption the E holds, and so, for any epoch e and any viable arm a in this epoch we have |eµa µa| he + ce. As a result for any epoch e and any two arms a1, a2 we have that |(eµa1 eµa2) (µa1 µa2)| 2he + 2ce.

Privacy Preserving Adaptive Experiment Design

Next, we argue that under E the optimal arm a is never eliminated. Indeed, for any epoch e, we denote the arm ae = argmaxa S eµa and it is simple enough to see that eµae eµa 0 + 2he + 2ce, so the algorithm doesn t eliminate a .

Next, we argue that, under E, in any epoch e we eliminate all viable arms with suboptimality gap 2 e = e. Fix an epoch e and a viable arm a with suboptimality gap a e. Note that we have set parameter Re so that

log (16 e2/β)

v u u t log (16 e2/β)

2 32 log(16e2/β)

ce = log 8 e2/β

Reε < log 8 e2/β

ε 8 log(8e2/β)

Therefore, since arm a remains viable, we have that eµmax eµa eµa eµa a (2he + 2ce) > e 1 2

2 > 2he + 2ce, guaranteeing that arm a is removed from S.

Lastly, fix a suboptimal arm a and let e(a) be the first epoch such that a e(a), implying e(a) a < e(a) 1 = 2 e. Using the immediate observation that for any epoch e we have Re Re+1/2, we have that the total number of pulls of arm a is

e e(a) Rj e 3 X

e e(a) Re 3 X

e e(a) 2e e(a)Re(a) 3Re(a) X

32 log 16 e(a)2/β

2e + 8 log 8 e(a)2/β

The bounds e > a/2, |S| 2, e(a) < log2 (2/ a) allow us to conclude and infer that under E the total number of pulls of arm a is at most

3 log (2 log (2/ a) /β) 1024

= O (log nj + log log (1/ (Xj))) 1 (Xj)2 + 1 ε (Xj)

A.5.2. PROOF OF COROLLARY 3.2

By using the result of lemma 3.1, for n > log 1/ (Xj), we have

Rfirst ν (n, π) Σ1 j M (Xj)O (log nj + log log (1/ (Xj))) 1 (Xj)2 + 1 ε (Xj)

A.5.3. PROOF OF LEMMA 3.4

Noticed that with the probability 1 1

n, the optimal arm would be chosen correctly in the first half periods. Therefore, the expected regret of the second half periods of DP-Con SE is Rsecond ν (n, π) Σ1 j M (Xj)E [Tj] + 1

n O (n) = O max{fmin(n)1 α, log n

A.5.4. PROOF OF THEOREM 3.5

By adding the result of corollary 1 and lemma 2, under the condition that (Xj) = O(1) ( 1 j M) we can easily

proof that Rν(n, π) O M max{fmin(n)1 α, log n

As a result, we have eν(n, ˆ n)Rν(n, π) O(M), which is the same as theorem 2.3, and matches the lower bound in theorem 2.1.

Privacy Preserving Adaptive Experiment Design

Finally, we need to prove that the DP-Con SE is ε, 1

n -private. The following proof is under the event S 1 j M S e 1(Rj e Re), which probability is at least 1 1

n (n 3M). For any two neighboring datasets D and D , suppose D and D are only different at time t.We discuss different cases for t as following: (1) t is in the second half, i.e. [ n

2 ] + 1 t n; In this case, since the probabilities of arms(actions) are not dependent of features or rewards (always ( 1

2), and noticed that the output and running time periods Tj are added Laplace mechanism, which are both ϵ

2-private. Therefore, in this case, P(D) eϵP(D ). (2)t is in the first half, i.e. 1 t [ n

2 ]; Noticed that |Tmin T min| 1 and Tj and T j are added Laplace mechanism, which is ϵ

2-private for the second half. Moreover, for the first half, the mean values µs and each running periods are all added Laplace mechanism, which are ϵ 2-private. Therefore, by the composition theorem, P(D) eϵP(D ). In conclusion, for any two neighboring datasets D and D , we have P(D) eϵP(D ) + M

n2 eϵP(D ) + 1

A.6. Proof of Proposition 2.6 and Proposition 3.6

A.6.1. PROOF OF PROPOSITION 2.6

By our definition and Central Limit Theorem (CLT), we know Tmin = max{fmin(n)1 α, log n} and

Tmin( ˆ (Xj) (Xj)) N(0, σ2 j0 + σ2 j1)

A.6.2. PROOF OF PROPOSITION 3.6

We know Tmin = max{fmin(n)1 α, log n

ε }, therefore, we have

max{fmin(n)1 α, log n

ε }( ˆ (Xj) (Xj)) = p

Tmin ( µj 1 µj 0) (µj 1 µj 0) + Lap(2/εTj)

To prove the above expression is asymptotically normal, we will give the proof of the three solutions below: (1) Tj Tmin P 1.

(2) Tmin Lap(2/εTj) P 0.

(3) Tmin ( µj 1 µj 0) (µj 1 µj 0) L N(0, σ2 j0 + σ2 j1). For the solution (1), by our definition of Tj, we know for any δ > 0,

P Tj Tmin 1 δ 2 X

k δTmin e ε

as Tmin log n + . For the solution (2), by using the solution (1), we know for any δ (0, 1),

P (|X| > δ) P Tj Tmin 1 δ +P |X| > δ, Tj Tmin 1 δ P Tj Tmin 1 δ +e (1 δ)δ Tminϵ 0

as Tmin log n + , where r.v. X follows the distribution Tmin Lap(2/εTj).

For the solution (3), denote STj = Tj ( µj 1 µj 0) (µj 1 µj 0) is the sum of Tj i.i.d differences. Similarly, we denote

STmin is the sum of Tmin i.i.d. differences who follow the same distribution with expectation 0 and variance σ2 j0 + σ2 j1.

Obviously, by Central Limit Theorem (CLT), we have STmin Tmin

L N(0, σ2 j0 + σ2 j1). Therefore, we only need to prove that

STmin STj Tmin

Privacy Preserving Adaptive Experiment Design

By using the solution (1), we know for any δ, δ > 0,

P STmin STj Tmin

> δ P Tj Tmin 1 δ + P STmin STj Tmin

> δ , Tj Tmin 1 δ

P Tj Tmin 1 δ + δ

δ 2 (σ2 j0 + σ2 j1)

as Tmin log n + and letting δ 0, where the last inequality is because of Markov inequality. Combining solutions (2) and (3), we can easily know the proposition 3.6 is correct.