# adaptive_experimentation_when_you_cant_experiment__1f9d83d0.pdf Adaptive Experimentation When You Can t Experiment Yao Zhao University of Arizona yaoz@arizona.edu Kwang-Sung Jun University of Arizona kjun@cs.arizona.edu Tanner Fiez Amazon fieztann@amazon.com Lalit Jain University of Washington lalitj@uw.edu This paper introduces the confounded pure exploration transductive linear bandit (CPET-LB) problem. As a motivating example, often online services cannot directly assign users to specific control or treatment experiences either for business or practical reasons. In these settings, naively comparing treatment and control groups that may result from self-selection can lead to biased estimates of underlying treatment effects. Instead, online services can employ a properly randomized encouragement that incentivizes users toward a specific treatment. Our methodology provides online services with an adaptive experimental design approach for learning the best-performing treatment for such encouragement designs. We consider a more general underlying model captured by a linear structural equation and formulate pure exploration linear bandits in this setting. Though pure exploration has been extensively studied in standard adaptive experimental design settings, we believe this is the first work considering a setting where noise is confounded. Elimination-style algorithms using experimental design methods in combination with a novel finitetime confidence interval on an instrumental variable style estimator are presented with sample complexity upper bounds nearly matching a minimax lower bound. Finally, experiments are conducted that demonstrate the efficacy of our approach. 1 Introduction In this study, we present a methodology for adaptive experimentation in scenarios characterized by potential confounding. Online services routinely conduct thousands of A/B tests annually [23]. In most online A/B/N experimentation, meticulous user-level randomization is essential to ensure unbiased estimates of treatment effects at the population level, commonly known as average treatment effects (ATE). In this setting, firms are often interested in understanding the treatment with the highest average outcome if presented to all members of the population. However, in many settings firms may not be able to randomize, for example if a feature must be rolled out to all users for various business reasons [30]. In such instances, users may choose to engage with a feature or not based on potentially unobservable preferences. Thus the resulting measured outcome may be correlated with the decision to engage in a specific feature. I.e. the underlying characteristics of the user confound the relationship between the decision to use the feature being evaluated and the effect of the feature. Thus, naively comparing the average outcome for users who engage with a feature with those who do not suffers from a (selection) bias. This setting is captured in Figure 1. A potential solution is for services to employ encouragement designs where users are presented with incentives that encourage users to engage with a specific feature [4, 7, 12]. As a concrete example, many online services have introduced membership levels with different offerings and 38th Conference on Neural Information Processing Systems (Neur IPS 2024). prices available to all users. Given a set of membership level options, the service is interested in knowing the counterfactual of which level has the optimal outcome (e.g., total revenue) if every user chooses to join that membership level. In this setting, encouragements could be coupons or trials for corresponding membership levels. In these settings, the firm can use intent-to-treat (ITT) estimates for the treatment effect which naively compare the average outcomes between the groups given different encouragements. In practice, given an encouragement a user may not engage with the corresponding feature choosing either the control or a different feature. Hence, the resulting ITT estimate may be a diluted estimate of the ATE [3]. However, all is not lost: if the encouragement presented to a user is properly randomized, and the service guarantees that the encouragement only affects the outcome through the choice of user treatment, then the encouragement acts as an instrumental variable. Standard analysis from the econometrics and compliance literature show that two-stage least squares (2SLS) estimators can then be used to provide consistent estimates of treatment effects. X User Choice Y Outcome Z Encouragement U Confounder Figure 1: Causal graph of the model. At the same time, firms are also increasingly utilizing adaptive experimentation techniques, often known as pure exploration multi-armed bandits (MAB) algorithms [26, 15], to accelerate traditional A/B/N testing. Pure exploration MAB techniques promise to deliver accurate inferences in a fraction of the time and cost as traditional methods. Similar to A/B testing, bandit methods assume users are properly randomized and can fail to learn the optimal treatment if naively used and may be sample inefficient if they fail to take the confounded structure into account. Contributions. In this work, we provide a methodology for experimenters seeking to use adaptive experimentation in settings with confounding where encouragements are available. We formulate this work in the more general and novel setting of confounded pure exploration transductive linear bandits (CPET-LB) (Section 1.1). We present algorithms using experimental design for the CPET-LB problem and analyze the resulting sample complexity. As we demonstrate, even in the simple multi-armed bandit setting described above, computing an effective sampling pattern requires using the machinery of linear bandits. Without knowledge of the underlying structural model, existing linear bandit approaches could lead to suboptimal sampling. The main technical challenge we face is simultaneously improving our estimate of the structural model while designing with inaccurate estimates (Section 3). This approach crucially relies on our development of novel finite-time confidence bounds for two-stage least squares (2SLS) style estimators that may be of independent interest (Section 2.2). Moreover, we provide worst-case sample complexity lower bounds that are nearly matched by our sample complexity upper bounds (Appendix D). Though the goal of this work is primarily theoretical, we empirically show the efficacy of our method over existing solutions (Appendix E). 1.1 General Problem Formulation A confounded pure exploration transductive linear bandits (CPET-LB) instance consists of finite collections of measurement vectors Z Rd and evaluation vectors W Rd. At each time t N, the learner selects zt Z and observes a pair of noisy responses xt X Rd and yt R generated via the structural equation model xt = Γ zt + ηt, yt = x t θ + εt, (1) where Γ Rd d and θ Rd are model parameters.1 We define the history Ht 1 = {(zs, xs, ys)}s 2 2 k are then eliminated from the active set by round k + 1 of the procedure. To actually choose our samples, as is common in this literature [15], we use an efficient rounding procedure, ROUND that requires a minimum number of samples r(ω). Sample Complexity Guarantee. The sample complexity of Algorithm 1 depends on the following problem-dependent quantity ρ (γ) that captures the underlying hardness of a problem instance in terms of (W, Z, Γ, θ), when γ = 0, we abbreviate ρ (0) = ρ , ρ (γ) = min λ (Z) max w W\{w } z Z λzΓ zz Γ) 1 w w, θ 2 γ2 . (7) Theorem 3.1. Algorithm 1 is δ-PAC and terminates in at most c(1 + ω)Lνρ log 1/δ + cr(ω) samples, where c hides logarithmic factors of := minw w w, θ and |W|, as well as constants. The proof of this result is in Appendix H.1. In the unconfounded case when Γ = I and η = 0, Et 1[εt|xt] = 0 this matches the sample complexity of [15]. In particular, for the case where Z = X = W, the problem further reduces to a standard multi-armed bandit, and if ε is 1-sub Gaussian noise, [33] shows that ρ = O(Pd i=2(θ1 θi) 2)), which is the optimal sample complexity of best-arm identification for multi-armed bandits. The following lemma shows that the conditioning of Γ can have a strong impact on the resulting sample complexity. Lemma 3.2. For the compliance setting, we have minλ d maxj,j ej ej 2 (Pd i=1 λiΓ eie i Γ) 1 d maxj,j Γ 1(ej ej ) 2 2. Furthermore, ρ dσ2 min(Γ) 1 Algorithm 1 CPEG:Confounded pure exploration with Γ 1: Input Z, W, Ψ = Γ, δ, Lν σ2 ν, ω, 2: Initialize: k = 1, W1 = W, ζ1 = 1 3: Set f(w, w , Γ, λ) := w w 2 (P z Z λzΓ zz Γ) 1 4: while |Wk| > 1 do 5: λk = arg minλ (Z) maxw,w Wk f(w, w , Γ, λ). 6: ρ(Wk) = minλ (Z) maxw,w Wk f(w, w , Γ, λ) 7: Nk := 2(1 + ω)22kρ(Wk)Lν log 4k2|W|/δ r(ω) 8: Pull arms in ZNk = ROUND(λk, Nk) and observe YNk. 9: Compute bθk Γ = Z Nk ZNkΓ 1 Z Nk YNk 10: Wk+1 = Wk\{w Wk| w Wk, w w, bθk Γ > 2 k}, k k + 1 11: end while 12: Output: w Wk To further illustrate the impact of Γ, imagine an extreme setting where Γ = (1 ε)/d11 + εI and ε 0, i.e. Γ is a perturbation of 1/d11 . It s straightforward to show that the upper bound in the first display of Lemma 3.2 is of the order O(dε 2) (this is also a lower bound - see Appendix K.1). In particular, the upper bound on the sample complexity is of the form dε 2/ 2 min. This is in sharp contrast to the linear bandit case, when Γ = I and we are guaranteed a sample complexity of no more than d/ 2 min samples. To gain some intuition, regardless of the choice of λ, P i d λiΓ eie i Γ Γ. As a result, ρ as ε 0. Intuitively in the limit, regardless of which instrument i d is being pulled, the resulting distribution on the treatments is uniform (the instruments are weak). Thus, it is impossible to deconfound the measurement noise, and recover an estimate of θ. This is a phenomenon which does not arise in the standard multi-armed bandit case with unconfounding. Remark 3.3. We also consider a setting where instead of given Γ directly, we are given an estimate bΓ of Γ based on offline data. We discuss such an adaptation of Algorithm 1 to this setting in Appendix I and provide a sample complexity which reflects the error in Γ (scaling with ρ (γ) for γ > 0). We remark that this result is subsumed by the approach of Section 3.2 and so we omit it in the main text. Lower bound. Due to the noise model from confounding and the dependence of the noise θ η + ε, the instance-dependent lower bounds of [15] do not immediately apply. We develop a lower bound tailored for the confounding setting that nearly match the upper bounds of our algorithms. What s more, our lower bound illustrates the additional difficulty that arises from confounding by an additional factor of d2 compared to the standard transductive linear bandit problem in the most general setting where entries of η are sub-Gaussian, but not necessarily independent nor bounded. Due to space limit, we defer it to Appendix D. 3.2 Fully Unknown Structural Model We now consider the setting where Γ is fully unknown. The difficulty of this setting is that the data collection process needs to support both estimation of Γ and θ simultaneously. Our algorithm, built upon Algorithm 1, is summarized in Algorithm 3. At its core, each phase of the algorithm is divided into two sub-phases, for estimating Γ and θ respectively. Specifically, the second sub-phase is essentially same as Algorithm 1 with bΓk in place of Γ where bΓk is estimated from the first sub-phase. The main novelty of our algorithmic design lies in the first sub-phase, which resolves the challenge of performing the optimal design for estimating Γ. To explain this challenge, the confidence interval for P-2SLS estimators of Theorem 2.3 indicates that one should pull arms so that we control both D2 := maxw,w w w 2 A(ZT2,bΓk) (error from ˆθP 2SLS) and D1 := maxw,w w w 2 A(ZT1,bΓk) (error from bΓk) to be below the target error O(ζ2 k) at each phase (ignoring unimportant factors for discussion). Controlling D2 is trivial, which is done in the second sub-phase as we described above. However, for D1, a similar strategy cannot be done because the estimate bΓk is computed directly by sampling arms in ZT1. That is, the ideal design, based on which we will collect data points z1, . . . , zn, requires access to the random matrix bΓk that can only be computed after sampling z1, . . . , zn,. This Algorithm 3 CPEUG: Confounded pure exploration with with unknown Γ Input Z, W, δ, Lν σ2 ν, Lη σ2 η, ω, γmin λmin(Γ), λE, κ0 Initialize: k = 1, W1 = W, bΓ0 = , ζ1 = 1 Define f(w, w , Γ, λ) := w w 2 (P z Z Γ λzzz Γ) 1, M := 32Lη γ2 minσmin(A(λE,I)) 1, δℓ:= δ 4ℓ2 while |Wk| > 1 do bΓk = Γ estimator Wk, bΓk 1, ζk, δ/k2, ω, λE, M, Lη Step 1: update bΓ bθk P-2SLS = θ estimator Wk, δ/k2, ζk, bΓk, ω, Lν Step 2: update bθ Wk+1 = Wk\ w Wk | w Wk, s.t., D w w, bθk P-2SLS E > ζk Step 3: elimination k k + 1, ζk = 2 k end while Output: Wk creates a cycle that seems impossible to resolve. Such an issue, to our knowledge, has not been seen in existing work on pure exploration, and thus resolving it is our key technical contribution. Our solution is to compute the design based on bΓk from the previous phase. We then perform a doubling trick where we double the sample size (while following the computed design) until D1 becomes smaller than the target error O(ζ2 k). The intuition is that in later phases the estimate bΓk from the previous phase will be accurate enough to ensure that the design is efficient. Note that this novel algorithm induces extra randomness in how many samples we end up collecting in the first sub-phase, which remains random even after conditioning on the history, unlike the second sub-phase. This makes the analysis challenging, which we describe after the main result. Algorithm 2 θ estimator Input W, δ, ζ, bΓ, ω, Lν bλ = arg minλ (Z) maxw,w W f(w, w , bΓ, λ) ρ(W) = minλ (Z) maxw,w W f(w, w , bΓ, λ) N2 = 2(1 + ω)ζ 2ρ(W)Lν log 4|W| get N2 samples per design bλ denoted as {Z2, X2, Y2} via ROUND update bθP-2SLS = (bΓ Z 2 Z2bΓ) 1bΓ Z 2 Y2 Output: bθP-2SLS Our algorithm additionally employs the so-called E-optimal design to ensure that the covariance matrix of the collected data used to estimate Γ is well-conditioned. This conditioning is required to ensure that bΓk concentrates fast enough to Γ as shown in the analysis. The E-optimal design is a well-known design objective in experimental design that aims to maximize the smallest singular value: λ E := arg minλ (Z) σmax(V 1(λ)), where V = P z Z λzzz . We denote κ 1 0 := σmax(V 1(λ E)) = σ 1 min(V (λ E)) as the smallest singular value achieved by the E-optimal design. We present our analysis result Theorem 3.4 where we show that, even without knowledge of Γ, the sample complexity scales with the key problem difficulty ρ almost matching the sample complexity of Algorithm 1 which relies on knowledge of Γ. Theorem 3.4. Algorithm 3 is δ-PAC and terminates in at most (1 + ω)((Lν log 1/δ + Lη θ 2 2(d + log 1/δ ))ρ + (d + log 1/δ )(Lη θ 2 2ρ0 + M)) pulls, ignoring both of the additive and multiplicative logarithms of , |W|, ρ , ρ0, M, where ρ0 = max w W\{w } w w 2 (P z Z λE,zΓ zz Γ) 1, and M = 32Lη γ2 minσmin A(λE, I) 1. Note that ρ0 does not get hurt by w w, θ , (ρ does). It comes from the fact that in the first phase, we initialize that algorithm with E-optimal design. The challenge of the analysis can be summarized in two-fold. First, since the concentration result in Theorem 2.3 is w.r.t. bΓk, we need to analyze how the random matrix bΓk concentrates around Γ and how this impacts the sample complexity. For this, we develop a novel concentration inequality Algorithm 4 Γ estimator Input W, bΓ, ζ, δ, ω, λE, M, Lη Define Stop(W, Z, Γ, δ) := maxw,w W w w A(Z,Γ) 1 θ 2 Lηlog(Z, δ) Initialize ℓ= 1, N0,0 = 0 doubling trick initialization if bΓ = then while ℓ= 1 or Stop W, Z0,ℓ, bΓ , δℓ > 1 do get 2ℓ 1 r(ω) 2 κ0 samples denoted as {Z0,ℓ, X0,ℓ, Y0,ℓ} per design λE via ROUND Update bΓ by OLS on {Z0,ℓ, X0,ℓ}, ℓ ℓ+ 1 end while else λ = arg minλ (Z) maxw,w W f(w, w , bΓ, λ) 4gd M ln 1 + 2M d + L2 z + 2M2gd M + 8M ln 2 6d while ℓ= 1 or Stop W, Z0,ℓ Z1,ℓ, bΓ, δℓ > ζ do N1,ℓ= 2ℓN doubling trick update get N1,ℓsamples per λ denoted as {Z1,ℓ, X1,ℓ, Y1,ℓ} via ROUND 2gd M ln M d + N1,ℓ+ L2 z + 4M ln 2 6d get (N0,ℓ N0,ℓ 1) samples per λE augmented to {Z0,ℓ 1, X0,ℓ 1} and get {Z0,ℓ, X0,ℓ} Update bΓ by OLS on {Z0,ℓ Z1,ℓ, X0,ℓ X1,ℓ}, ℓ ℓ+ 1 end while end if Output: bΓ that relates the confidence width involving ˆΓ from Theorem 2.3 with the same quantity involving Γ in place of ˆΓ. Second, our algorithm creates a long-range error propagation, which is highly nontrivial to analyze. To see this, the quality of ˆΓk is affected by the design objective function maxw,w f(w, w , ˆΓk 1, λ), which depends on the error of the estimate ˆΓk 1 from the previous phase. This error is, in turn, affected by the error of ˆΓk 2 by the same mechanism. This is repeated all the way back to the first phase. Thus, any abnormal behavior from the first iteration will have a cumulative impact to even the end. In our analysis, we successfully analyze how the error is propagated from the previous iterations, which forms a complicated recursion. Resolving this recursion is our key novelty in the analysis. Remark 3.5. Our algorithm requires knowledge of a lower bound γmin of λmin(Γ). The knowledge of γmin is for simplicity only as one can obtain such a lower bound that is at least half of the true value λmin(Γ) via an efficient sampling procedure that we describe in Appendix K. Experiments. We provide experiments for the instance of Section 1.2 in the Appendix E. The experiments show that our approach is more sample efficient than natural passive baselines (e.g. A/B testing), or naively applying existing Pure-Exploration linear bandit methods and performs similarly to the oracle complexity. 4 Conclusion This work introduces the CPET-LB problem in which the learning protocol is characterized by a linear structural equation model governed by parameters Γ and θ. We provide a general solution that simultaneously estimates the structural model while optimally designing to learn the best-arm. The key ideas behind our approach are based on linear experimental design techniques, an instrumental variable estimator whose variance can be controlled by the design, and novel finite-time confidence intervals on this estimator. This paper presents a number of directions for future work including considering situations where the dz = dx, analysis to improve the dependence on the underlying noise variance, and the pursuit of a tight information-theoretic instance-dependent lower-bound. We hope that this line of work motivates increased discussion of the real impact of confounding on applicability of adaptive experimentation. Acknowledgements Kwang-Sung Jun and Yao Zhao were supported in part by the National Science Foundation under grant CCF-2327013. [1] Abbasi-Yadkori, Y., Pál, D., and Szepesvári, C. Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24, 2011. [2] Allen-Zhu, Z., Li, Y., Singh, A., and Wang, Y. Near-optimal discrete optimization for experimental design: A regret minimization approach. Mathematical Programming, 186:439 478, 2021. [3] Angrist, J. D., Imbens, G. W., and Rubin, D. B. Identification of causal effects using instrumental variables. Journal of the American Statistical Association, 91(434):444 455, 1996. [4] Bakshy, E., Eckles, D., and Bernstein, M. S. Designing and deploying online field experiments. In International Conference on World Wide Web, pages 283 292, 2014. [5] Bareinboim, E., Forney, A., and Pearl, J. Bandits with unobserved confounders: A causal approach. Advances in Neural Information Processing Systems, 28, 2015. [6] Bilodeau, B., Wang, L., and Roy, D. Adaptively exploiting d-separators with causal bandits. Advances in Neural Information Processing Systems, 35:20381 20392, 2022. [7] Bradlow, E. Encouragement designs: an approach to self-selected samples in an experimental design. Marketing Letters, 9:383 391, 1998. [8] Chandak, Y., Shankar, S., Syrgkanis, V., and Brunskill, E. Adaptive instrument design for indirect experiments. ar Xiv preprint ar Xiv:2312.02438, 2023. [9] Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., and Robins, J. Double/debiased machine learning for treatment and structural parameters, 2018. [10] Degenne, R., Ménard, P., Shang, X., and Valko, M. Gamification of pure exploration for linear bandits. In International Conference on Machine Learning, pages 2432 2442. PMLR, 2020. [11] Della Vecchia, R. and Basu, D. Online instrumental variable regression: Regret analysis and bandit feedback. ar Xiv preprint ar Xiv:2302.09357, 2023. [12] Elbers, B. Encouragement designs and instrumental variables for a/b testing, August 2023. URL https://engineering.atspotify.com/2023/08/ encouragement-designs-and-instrumental-variables-for-a-b-testing/. Accessed: 2024-05-19. [13] Engineering, S. Encouragement designs and instrumental variables for a/b testing, 2023. Accessed: December 26, 2023. [14] Even-Dar, E., Mannor, S., Mansour, Y., and Mahadevan, S. Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. Journal of machine learning research, 7(6), 2006. [15] Fiez, T., Jain, L., Jamieson, K. G., and Ratliff, L. Sequential experimental design for transductive linear bandits. Advances in Neural Information Processing Systems, 32, 2019. [16] Gales, S. B., Sethuraman, S., and Jun, K.-S. Norm-agnostic linear bandits. In International Conference on Artificial Intelligence and Statistics, pages 73 91. PMLR, 2022. [17] Gong, X. and Zhang, J. Dual instrumental method for confounded kernelized bandits. ar Xiv preprint ar Xiv:2209.03224, 2022. [18] Greene, W. H. Econometric analysis. Pearson Education India, 2003. [19] Inoue, A. and Solon, G. Two-sample instrumental variables estimators. The Review of Economics and Statistics, 92(3):557 561, 2010. [20] Jun, K.-S., Jain, L., Mason, B., and Nassif, H. Improved confidence bounds for the linear logistic model and applications to bandits. In International Conference on Machine Learning, pages 5148 5157. PMLR, 2021. [21] Kallus, N. Instrument-armed bandits. In Algorithmic Learning Theory, pages 529 546. PMLR, 2018. [22] Katz-Samuels, J., Jain, L., Karnin, Z., and Jamieson, K. G. An empirical process approach to the union bound: Practical algorithms for combinatorial and linear bandits. In Advances in Neural Information Processing Systems, volume 33, pages 10371 10382, 2020. [23] Kohavi, R., Tang, D., and Xu, Y. Trustworthy online controlled experiments: A practical guide to a/b testing. Cambridge University Press, 2020. [24] Kveton, B., Liu, Y., Kruijssen, J. M., and Nie, Y. Non-compliant bandits. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, pages 1138 1147, 2023. [25] Lattimore, F., Lattimore, T., and Reid, M. D. Causal bandits: Learning good interventions via causal inference. Advances in Neural Information Processing Systems, 29, 2016. [26] Lattimore, T. and Szepesvári, C. Bandit algorithms. Cambridge University Press, 2020. [27] Li, Z., Jamieson, K., and Jain, L. Optimal exploration is no harder than thompson sampling. ar Xiv preprint ar Xiv:2310.06069, 2023. [28] Lu, Y., Meisami, A., Tewari, A., and Yan, W. Regret analysis of bandit problems with causal background knowledge. In Conference on Uncertainty in Artificial Intelligence, pages 141 150. PMLR, 2020. [29] Malek, A., Aglietti, V., and Chiappa, S. Additive causal bandits with unknown graph. In International Conference on Machine Learning, 2023. [30] Mummalaneni, S., Yoganarasimhan, H., and Pathak, V. V. Producer and consumer engagement on social media platforms. SSRN 4173537, 2022. [31] Pearl, J. Causality. Cambridge university press, 2009. [32] Sen, R., Shanmugam, K., Dimakis, A. G., and Shakkottai, S. Identifying best interventions through online importance sampling. In International Conference on Machine Learning, pages 3057 3066. PMLR, 2017. [33] Soare, M., Lazaric, A., and Munos, R. Best-arm identification in linear bandits. Advances in Neural Information Processing Systems, 27, 2014. [34] Vershynin, R. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018. [35] Weltz, J., Fiez, T., Volfovsky, A., Laber, E., Mason, B., Nassif, H., and Jain, L. Experimental designs for heteroskedastic variance. ar Xiv preprint ar Xiv:2310.04390, 2023. [36] Zhang, J., Chen, Y., and Singh, A. Causal bandits: Online decision-making in endogenous settings. ar Xiv preprint ar Xiv:2211.08649, 2022. Table of Contents A Broader Impacts 13 B Illustrative Example 13 C Standard 2SLS estimator 14 D A non-interactive lower bound 14 E Experiments 16 E.1 Comparison Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 E.2 Experiment 1: Jump-Around Instance . . . . . . . . . . . . . . . . . . . . . . . 16 E.3 Experiment 2: Interpolation Instance . . . . . . . . . . . . . . . . . . . . . . . 17 F Proofs of the lower bound 18 F.1 Proof of Theorem D.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 F.2 Proof of Corollary D.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 G Proofs of the confidence interval 21 G.1 Proof of Lemma 2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 G.2 Proof of Lemma 2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 G.3 Proof of Theorem 2.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 H Proofs of sample complexity when given Γ 28 H.1 Proof of Theorem 3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 I Proofs of sample complexity when given bΓ 31 J Proofs of sample complexity with unknown Γ 35 K Estimating λmin(Γ) 54 K.1 Proof of Lemma 3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 K.2 lemma for solving x less than ln(x) . . . . . . . . . . . . . . . . . . . . . . . . 57 A Broader Impacts This work is algorithmic and not tied to a particular application that would have immediate negative impact. B Illustrative Example We now present an illustrative experiment that highlights the challenges of endogenous noise and the insufficiency of standard experimentation approaches used in the absence of confounding. Instance Definition. Toward connecting back to membership example in Section 1, consider that a service has d membership options given by the set A = {1, . . . , d}. Let the set Z = {e1, , ed} represent encouragements (incentives or advertisements) for the corresponding membership options given by W = X = {e1, , ed}. We consider a location model that assumes each user t N arriving online has an underlying unobserved one-dimensional preference ut N(0, σ2 u). If an algorithm presents the user with encouragement z It = e It for It A, then the user selects into the membership level given by Jt = minj A |It + ut j| so that xt = e Jt. For a visual depiction, see Figure 3a. This process captures a user being more likely to opt-in to membership levels that are closer to the encouragement that they were presented. The outcome is then given by yt = x t θ + ut. This problem instance is a specific compliance instance. For this experiment, we take d = 6, let θ = 1 0.95 0 0.45 0.95 0.99 and σ2 u = 0.35. Observe that the optimal evaluation vector is w = e1 = arg maxw W w θ. We simulate a UCB strategy which maintains estimates of the average reward of each of the possible d incentives, namely bµi,t = Pt s=1 1{zt = ei}yt and then pulls the one with the highest upper confidence bound. This models current practice of using a bandit algorithm to select which incentive to show a user. Our results averaged over 100 simulations are in Figure 3d. At each round we estimate the average reward of each level using an OLS estimator, i.e. bθOLS i,t = Pt s=1 1{xt = ei}yt/ Pt s=1 1{xt = ei}, and check whether it matches the true value (denoted as UCB-OLS). We also consider an instrumental variable-estimator (see the next Section) which incorporates knowledge of Γ similar to 2SLS to deconfound our estimate (UCB-IV). As the plot demonstrates, UCB-OLS completely fails to identify θ1 = arg maxi d θi (this line is hard to see it is at 0) due to a biased estimate, whereas UCB-IV does better. However, UCB-IV methods seem to have a constant probability of error. To see why, note that the expected reward from pulling z = ei is e i Γθ. These values are plotted in orange in Figure 3c. In particular, with some constant probability, UCB runs on the empirical rewards from pulling z s zeroes in on arm 6, and as a result fails to give enough samples to learn that arm 1 is indeed the best. In contrast, our proposed method CPEG, Algorithm 1 manages to find the best arm with significantly higher probability (the algorithm was run with δ = .1) in the given time horizon. C Standard 2SLS estimator Consider Ψ = bΓ2SLS = (Z T ZT ) 1Z T XT . In this setting, we recover the standard two-stage-leastsquares (2SLS) estimator, bθ2SLS = (X T ZT (Z T ZT ) 1Z T XT ) 1X T (Z T ZT ) 1Z T YT = (Z T XT ) 1Z T YT . Note that the 2SLS estimator is a biased, but consistent estimator of the parameter θ [3, 18]. Note that in particular, the asymptotic variance of 2SLS is known to be σ2 ε w (bΓ 2SLSZ T ZT bΓ2SLS) 1[18]. Recent work by [11] provides a confidence interval of the form |w (bθ2SLS θ)| O(dσ2 ε w (bΓ 2SLSZ T ZT bΓ2SLS) 1 q log T/δ ). However, it s unclear how to use the form of their con- fidence interval directly for experimental design due to the dependence of ˆΓ2SLS on the random quantity X. In addition, their work is not sufficiently general to handle the general forms of noise that we consider in Lemma 2.1. D A non-interactive lower bound Due to the noise model from confounding and the dependence of the noise θ η + ε, the instancedependent lower bounds of [15] do not immediately apply. In this section, we develop a lower bound tailored for the confounding setting. Toward characterizing the optimal sample complexity, we develop a lower bound for a specific non-adaptive algorithm A that has access to the matrix Γ governing the structural equation model. In particular, suppose that the non-adaptive algorithm A is allowed to select a sequence of T measurements {z I1, . . . z It . . . , z IT } to query prior to collecting any observations, where It represents the index of the vector z Z chosen at time t {1, . . . , T}. Then, given the observations {y1, . . . , . . . yt, . . . , y T } generated by the environment, a candidate optimal vector bw W is returned by the algorithm. We are interested in the necessary number of observations T that must be collected in order to ensure P( bw = w ) δ for some δ (0, 1). Thus, it is natural that the optimal non-adaptive algorithm A using the estimator bθoracle forms a recommendation rule such that bw = arg maxw W w bθoracle. We now state our lower bound result with respect to the non-adaptive oracle algorithm. 1 2 3 4 5 6 It Jt It + ut (a) Depiction of the problem instance. (b) Heat-map of Γ. (c) w θ and w Γθ for w {e1, . . . , ed} (d) P( bwt = w ) for the instance Figure 3: (a) A visual depiction of the problem instance from Section B. The user is presented with encouragement It A and the user choice is given by Jt where Jt = minj A |It + ut j| and ut N(0, σ2 u). b) A heat-map showing the structural parameter Γ for the problem instance from Section B. (c) A bar chart showing E[y|x = w] = w θ and E[y|z = w] = z Γθ for all w W. This chart shows that the optimal evaluation vector is w = e1 = arg maxw W E[y|x = w], while e6 = arg maxw W E[y|z = w] and consequently estimation based on this quantity is problematic. (d) The probability of identifying w = e1 for a collection of algorithms on the CPET-LB instance described in Section B. Standard optimistic sampling approaches in combination with an ordinary least squares estimator leads to faulty inferences. Given an instrumental variable estimator, these experimental designs eventually give high probability identification but do so inefficiently compared to our proposed approach (see Section 3). Theorem D.1 (Non-Adaptive Oracle Lower Bound). Consider a problem instance characterized by W Rd, Z Rd, Γ Rd d, and θ Rd. Assume Γ is known, θ is unknown, and the noise process is jointly Gaussian and defined by γ := η ε N(0, Σ) where Σ R(d+1) (d+1) is an arbitrary correlation matrix. For δ (0, 0.05], if the non-adaptive oracle algorithm acquires T σ2ρ log 1/δ /2 samples on the problem instance where σ2 := v Σv and v := θ 1 Rd+1, then P( bw = w ) δ. Corollary D.2. There exists a problem instance characterized by W Rd, Z Rd, Γ Rd d, and θ Rd with a noise process satisfying Assumption 1 such that if the non-adaptive oracle algorithm acquires T max{d θ 2 2, d θ 2}ρ log 1/δ /2 samples, then P( bw = w ) δ for δ (0, 0.05]. The proof of Theorem D.1 is in Appendix F.1. Notably, the result is reminiscent of lower bounds for the standard pure exploration transductive linear bandit problem without confounding [15, 22] when given the measurement set {Γ z}z W, evaluation set Z, and parameter θ. Notably, the upper bounds for our algorithms nearly match the lower bound of Theorem D.1. However, it is interesting to observe that the sample complexity incurs an additional factor of d2 relative to the standard transductive linear bandit problem in the most general setting where entries of η are sub-Gaussian, but not necessarily independent nor bounded. This illustrates the additional difficulty that arises from confounding. We point out that this is not likely to be a tight lower bound. In particular, it is a lower bound with respect to a non-adaptive algorithm that uses the particular choice of estimator. We leave improved lower bounds to future work. E Experiments We now present experiments a collection of experiments on CPET-LB problem instances. The experiments demonstrate that our approach produces efficient designs for inference and estimation. E.1 Comparison Algorithms The baselines that our approaches are compared with are discussed below. We run experiments both when Γ is known and when Γ is fully unknown. E.1.1 Known Γ. To standardize the experiments, the baselines considered run in rounds mirroring the structure of Algorithm 1. Specifically, in round k N a sampling algorithm selects a design λk (Z), collects Nk samples from the design, and forms a Ψ IV estimate of θ with Ψ = Γ that is combined with a confidence interval (Lemma 2.2 to either eliminate evaluation vectors or validate a stopping condition. The number of samples Nk taken in round k N by any of the algorithms is given by Nk = 2(1+ω)ζ 2 k ρ(Wk)Lν log 4k2|W|/δ r(ω) where ρ(Wk) = maxw,w Wk f(w, w , Γ, λ) for design λk (Z) and an active set of evaluation vectors Wk. The round sample count guarantees that given any experimental design, all vectors w W such that (w w) θ > 2 2 k can be determined to be suboptimal by the end of round k. The sampling methods we consider are now described. Static Oracle. This design selects λk = arg minλ (Z) maxw W\{w } f(w , w , Γ, λ). Static XY-Optimal. This design selects λk = arg minλ (Z) maxw,w W f(w, w , Γ, λ). Static Uniform. This design selects λk,z = 1/|Z| z Z. Adaptive Uniform (SE). This design selects λk,w = 1/|Wk| w Wk. Note that this algorithm is effectively an adaption of action-elimination [14] . The static designs are independent of the round and simply terminate when all evaluation vectors can be eliminated except for a recommended optimal vector bw. E.1.2 Unknown Γ. For this set of experiments, we compare Algorithm 3 against a collection of variations of the sampling procedures. Specifically, we compare against methods that either replace only the experimental design for estimating Γ, or only the experimental design for estimating θ, or both with uniform sampling. We label the approaches as N N, where N represents the sampling approaches (XY or uniform) for Γ and θ respectively. Moreover, to make our approach more practical, we modify the algorithm so that log(ZT , δ) = 4d + log 1/δ . The step of incrementally adding more E-optimal design samples is also removed, so we collect E-optimal design samples only once in the beginning of Algorithm 4. We find that even with these modifications to the algorithm, correctness is maintained empirically. E.2 Experiment 1: Jump-Around Instance We first return back to the location model of Section B. Recall that Z = W = X = {e1, , ed}. For this experiment, we take d = 6, let θ = 1 0.95 0.45 0.45 0.95 0.45 and σ2 u = 0.275. The results of the experiment are shown in Figure 4a for the case of known Γ. We see that Algorithm 1 performs much better than the baselines and nearly matches the oracle design. Delving into the approach, it is able to quickly eliminate all but w1 and wd 1 and then puts more mass on z1 and zd 1 to reduce the uncertainty on w1 and wd 1. For the case of unknown Γ, the results are shown in Figure 4d, where θ5 is reduced to 0.9 so that all approaches could finish. (a) Experiment 1: Known Γ (b) Experiment 2: Known Γ, identity (c) Experiment 2: Known Γ, permutation (d) Experiment 1: Unknown Γ (e) Experiment 2: Unknown Γ, identity (f) Experiment 2: Unknown Γ, permutation Figure 4: Sample complexity for algorithms on CPET-LB problems. Our approach is consistently competitive across the experiments.) E.3 Experiment 2: Interpolation Instance Let Z = W = X = {e1, , ed} define the measurement, evaluation, and observation sets. We first consider that Γ := (1 ε) d 1d1 d + εId for a parameter ε (0, 1) where 1d is a d-dimensional vector of 1 s and Id is the d-dimensional identity matrix. For this experiment, we take d = 4 and let θ = 0.5 0.583 0.67 0.75 . As in all compliance instances, ηt = xt Γ zt, and in this simulation ηt = 0.4η t vt, where vt = vt/ vt 2 and vt N(0, Id). The results of the experiment are shown in Figure 4b for ε {1, 0.9, 0.8, 0.7} with Γ known. Note that Static-XY and Uniform overlap, and SE and CPEG overlap. We see that Algorithm 1 and the adaptive uniform strategy perform similarly and near optimally. This is to be expected since the most efficient way to gather observations for treatments is to encourage that treatment, given that if the encouragement is not followed each of the alternatives is equally likely and provides no additional information of interest. Moreover, as discussed earlier, the problem gets more challenging as Γ 1d1 d /d. Note that the identity matrix could be replaced with a permutation matrix, in which case uniform sampling with elimination becomes highly suboptimal. The results for the case of Γ unknown are shown in Figure 4e with ε = 0.99. This shows the value that comes from the experimental design for estimating both Γ and θ. To demonstrate the superiority of our algorithm over SE, we also consider that Γ := (1 ε) d 1d1 d +εIp d, where Ip d is a permutation matrix as follows, 0 1 0 0 0 0 1 0 0 0 0 1 1 0 0 0 All other settings remain the same as in the previous interpolation instance. The results for the known Γ case, shown in Figure 4c, indicate that SE exhibits significant underperformance due to its sampling rule not accounting for the permutation effect in Γ. In contrast, CPEG consistently achieves near-optimal performance. Note that Static-XY and Uniform still overlap. Figure 4f presents the results for the unknown Γ case, where we can notice that, comparing with Figure 4e, estimating different permutation matrices (with the identity matrix as a special case) does not affect problem difficulty. F Proofs of the lower bound F.1 Proof of Theorem D.1 Theorem D.1. Consider a problem instance characterized by W Rd, Z Rd, Γ Rd d, and θ Rd. Assume Γ is known, θ is unknown, and the noise process is jointly Gaussian and defined by γ := η ε N(0, Σ) where Σ R(d+1) (d+1) is an arbitrary correlation matrix. For δ (0, 0.05], if the non-adaptive oracle algorithm acquires T σ2ρ log 1/δ /2 samples on the problem instance where σ2 := v Σv and v := θ 1 Rd+1, then P( bw = w ) δ. Proof. We begin by recalling the framework of the non-adaptive oracle algorithm and discussing the properties of its estimator for the noise structure described in the statement of the result. Non-Adaptive Oracle and Instance Definition. The non-adaptive oracle algorithm A selects T measurements to query prior to collecting any data. Let It represent the index of the vector z Z chosen at time t {1, . . . , T}. The noise process for the instance under consideration is assumed to be jointly Gaussian and defined by γt := ηt εt N(0, Σ) where Σ R(d+1) (d+1) is an arbitrary positive semidefinite matrix. Defining x It := Γ z It, v := θ 1 Rd+1 and νt := v γt, the feedback model can be described as follows: yt = xtθ + εt = (Γ z It) θ + η t θ + εt =: (Γ z It) θ + v γt =: x Itθ + νt. Observe that the noise is independent and identically distributed as νt N(0, σ2) where σ2 := v Σv since γt N(0, Σ). Moreover, the noise process is exogeneous with E[νt| x It] = 0 since x It is deterministic given the index choice It. Let {z It}T t=1, { x It}T t=1, and {yt}T t=1 denote the observations collected by the non-adaptive oracle algorithm A and define ZT RT d, XT RT d, and YT RT to contain the respective stacked observations. Algorithm A obtains an estimate bθoracle by minimizing the sum of squares as follows: bθoracle := arg minbθ Rd PT t=1(yt x It bθ)2 = ( X T XT ) 1 X T YT = (Z T ZT Γ) 1Z T YT . Given bθoracle, the non-adaptive oracle algorithms A returns a recommendation defined by bw = arg maxw W w bθoracle. Note that since bθoracle is obtained by least squares with exogeneous, independent and identically distributed mean-zero Gaussian noise, it is straightforward to verify the estimator is distributed as bθOracle θ N 0, σ2 A(ZT , Γ) 1 , (8) A(ZT , Γ) := T X t=1 Γ z Itz ItΓ = X T XT = Γ Z T ZT Γ. Proof by Contradiction. To begin, recall that ρ := min λ (Z) max w W\{w } w w 2 A(λ,Γ) 1 w w, θ 2 . Suppose for the sake of contradiction that the number of samples collected by the non-adaptive oracle algorithm A is T σ2ρ log 1/δ /2 and P( bw = w ) < δ for δ (0, 0.05]. To reach a contradiction, we analyze the distribution of (w w ) bθ for some w = w and show that with probability at least δ it is positive. We remark that this proof follows similar techniques to that of the proof of Theorem 3 of [22]. Let λ (Z) represent the empirical sampling distribution of the algorithm A, which is defined such that λz = 1 T PT t=1 1{z It = z} for each z Z. Moreover, define ρ (λ) := max w W\{w } w w 2 A(λ,Γ) 1 w w, θ 2 and ew arg max w W\{w } w w 2 A(λ,Γ) 1 w w, θ 2 . Note that ρ (λ) ρ and observe by definition, z Z λzΓ zz Γ) 1 t=1 Γ z Itz ItΓ 1 := A(ZT , Γ) 1. Thus, by Eq. (8), bθOracle θ N 0, σ2 A(λ, Γ) 1 and ( ew w ) (bθOracle θ) w ew, θ N 0, σ2 w ew 2 A(λ,Γ) 1 T w ew, θ 2 Furthermore, by the definition of ρ (λ), the assumption T σ2ρ log 1/δ /2, and the fact ρ (λ) ρ , we obtain V ( ew w ) (bθOracle θ) = σ2 w ew 2 A(λ,Γ) 1 T w ew, θ 2 := σ2 ρ (λ) T 2 log 1/δ . (9) Now, consider a random variable W N(0, 1). Proposition 2.1.2 of Vershynin [34] gives an anti-concentration result showing that for all ζ > 0, 2π e ζ2/2. (10) We apply this result to the quantity ( ew w ) (bθOracle θ)/ w ew, θ to conclude that ( ew w ) bθOracle > 0 with probability at least δ. Toward doing so, let c (1, 1.15] be a constant and define f W N 0, 2 log 1/δ and W := f W q 2/ log 1/δ and γ := c q 2/ log 1/δ . Observe that W N(0, 1). The following analysis holds for δ (0, 0.05] given that c (1, 1.15] as assumed: P ( ew w ) (bθOracle θ) w ew, θ c P(f W c) (By Eq. 9) 2/ log 1/δ c q 2π e γ2/2 (Proposition 2.1.2 Vershynin 34) log 1/δ 3 1 δ. The final inequality can be verified computationally. Thus, with probability at least δ for δ (0, 0.05], we obtain ( ew w ) bθOracle c(w ew) θ + ( ew w ) θ = c(w ew) θ (w ew) θ = (c 1)(w ew) θ > 0. Observe that the final inequality holds since c > 1 and (w ew) θ > 0 by definition. This result directly implies that with probability at least δ for δ (0, 0.05], the vector bw returned by algorithm A is not w . This is a contradiction, so we conclude that if the non-adaptive oracle algorithm A acquires T σ2ρ log 1/δ /2 samples, then P( bw = w ) δ for δ (0, 0.05]. F.2 Proof of Corollary D.2 Corollary D.2. There exists a problem instance characterized by W Rd, Z Rd, Γ Rd d, and θ Rd with a noise process satisfying Assumption 1 such that if the non-adaptive oracle algorithm acquires T max{d θ 2 2, d θ 2}ρ log 1/δ /2 samples, then P( bw = w ) δ for δ (0, 0.05]. Proof. To begin, consider the specifications of Theorem D.1 and its result. That is, a problem an arbitrary instance characterized by W Rd, Z Rd, Γ Rd d, and θ Rd where the noise process is jointly Gaussian and defined by γ := η ε N(0, Σ) where Σ R(d+1) (d+1) is an arbitrary correlation matrix. Observe that the noise process defined by γ satisfies Assumption 1. The result states that if the non-adaptive oracle algorithm acquires T σ2ρ log 1/δ /2 samples on the problem instance where σ2 := v Σv and v := θ 1 Rd+1, then P( bw = w ) δ for δ (0, 0.05]. From this point, we show that there exists a parameter θ and correlation matrix Σ such that σ2 := v Σv max{d θ 2 2, d θ 2} in order to reach the stated conclusion. Notation. Let ζηi,ε [ 1, 1] denote the correlation between ηi and ε for i {1, . . . , d}. Similarly, let ζηi,ηj = ζηj,ηi [ 1, 1] denote the correlation between ηi and ηj for i = j {1, . . . , d}. Note that the correlation of ηi with itself for i {1, . . . , d} is σ2 ηi = ζηi,ηi = 1 and similarly the correlation of ε with itself is σ2 ε = ζε,ε = 1. The correlation matrix Σ is then given by 1 ζη2,η1 ζηd,η1 ζε,η1 ζη1,η2 1 ζηd,η2 ζε,η2 ... ... ... ... ... ζη1,ηd ζη2,ηd 1 ζε,ηd ζη1,ε ζη2,ε ζηd,ε 1 := Σθ ζη,ε ζ η,ε 1 1 ζη2,η1 ζηd,η1 ζη1,η2 1 ζηd,η2 ... ... ... ... ζη1,ηd ζη2,ηd 1 [ 1, 1]d d and ζη,ε = ζη1,ε ζη2,ε ... ζηd,ε Moreover, we use the notation Σ θ,i Rd to denote the i th row of Σ θ , or equivalently the i th column of Σθ, for any i {1, . . . , d}. Lower Bounding Noise Variance. Given the above notation, we now work toward lower bounding σ2 := v Σv. Observe that by algebraic manipulations, v Σv := θ 1 Σθ ζη,ε ζ η,ε 1 = h θ Σ θ,1 + ζη1,ε θ Σ θ,2 + ζη2,ε θ Σ θ,d + ζηd,ε θ ζη,ε + 1 i θ 1 i=1 Σ θ,iθi + i=1 ζηi,εθi + θ ζη,ε + 1 = θ Σθθ + 2θ ζη,ε + 1. Since Σθ is a real symmetric matrix, an eigendecomposition exists such that Σθ = QΛQ where Λ = diag(λ1, . . . , λd) Rd d is a diagonal matrix containing the eigenvalues of Σθ and Q Rd d is an orthogonal matrix with columns corresponding to the eigenvectors of Σθ. Let qi := Q i denote column i of the matrix Q for i = {1, . . . , d}, which is equivalently eigenvector i of Σθ for i = {1, . . . , d}. Without loss of generality, assume that the eigenvectors are of unit length so that qi 2 = 1 for all i = {1, . . . , d}. Given this information, suppose that the parameter θ in the instance is equal to a scalar multiple of the eigenvector of Σθ corresponding to the maximum eigenvalue. Note this is equivalent to the statement that θ is equal to some scalar multiple of the column q Rd of the matrix Q where q is the eigenvector of Σθ corresponding to the maximum eigenvalue λ . Thus, we take θ = c q for some c R and observe that θ 2 = c. Toward quantifying the value of v Σv for the problem instance, we begin by characterizing θ Σθθ for the choice of θ. Consider the following analysis: θ Σθθ = θ QΛQ θ := (cq ) QΛQ (cq ) (θ := cq ) = θ 2 2q q1 q qd Λ q1 q qd q = θ 2 2 0 q 2 2 0 diag(λ1, . . . , λ , . . . , λ1) 0 q 2 2 0 (q i qj = 0 i = j) = θ 2λ . ( q 2 = 1) Thus, in general for this choice of θ, v Σθv = θ 2 2λ + 2θ ζη,ε + 1. To conclude, take Σθ := 1d1 d where 1d represents the d-dimensional vector of all ones. Since this is a rank-1 matrix, the maximum eigenvalue is λ = 1 d 1d = d and the remainder of the eigenvalues are zero. Observe that q = 1d/ d is an eigenvector corresponding to the maximum eigenvalue since 1d1 d q = dq . Thus, v Σθv = θ 2 2λ + 2θ ζη,ε + 1 := θ 2 2λ + 2 θ 21 d 1d/ d + 1 (θ := cq := θ 21d/ d = and ζη,ε := 1d) = d θ 2 2 + 2 max{d θ 2 2, This completes the proof since we have shown that there exists a parameter θ and correlation matrix Σ such that σ2 := v Σv max{d θ 2 2, d θ 2}, which by Theorem D.1 allows us to make the stated conclusion. G Proofs of the confidence interval G.1 Proof of Lemma 2.1 The first statement is an immediate consequence of Lemma G.2 and the second statement is proven in Lemma G.1. Lemma G.1. In the compliance model, the noise η θ + ε follows a 8 θ 2 2 + 2 -sub-Gaussian distribution. Proof. In compliance, we have z, x {e1, , ed}, and η = x P e1 | z , , P ed | z . Let us figure out the sub-Gaussian parameter of the random vector η. Fix any unit vector a. First, we have E[ η, a ] = 0. Second, we have η a η 2 (Cauchy Schwarz inequality) x P e1 | z , , P ed | z 2 x 2 + P e1 | z , , P ed | z 2 1 + P e1 | z , , P ed | z 1 (x {e1, , ed} and x 2 x 1, x) Thus, η a is bounded and zero-mean and thus 22-sub-Gaussian. This implies that β, max a: a 1 E[exp β η, a ] exp and thus η is a 22-sub-Gaussan random vector. Then, η θ is (2 θ )2-sub-Gaussian. Using Lemma G.2, we have that η θ + ε is 2(4 θ 2 + 1)-sub-Gaussian. Lemma G.2. Let A and B random variables that are each σ2 Aand σ2 B-sub-Gaussian but are correlated. Then, A + B is 2(σ2 A + σ2 B)-sub-Gaussian. Proof. By definition of sub-Gaussian, we have for any γ R, E h exp γ(A + B) i = E exp(γA) exp(γB) E exp(2γA) q E exp(2γB) (Cauchy-Schwarz) exp 2γ2σ2 A q exp 2γ2σ2 B exp 2γ2(σ2 A + σ2 B) . G.2 Proof of Lemma 2.2 Lemma 2.2. Suppose that T observations are collected non-adaptively from the structural equation model in Eqs. (1) and Γ Rd d is known. Then, with probability at least 1 δ for δ (0, 1) and w Rd, |w (bθoracle θ)| q 2σ2ν w A(ZT ,Γ) 1 log 2/δ . where σ2 ν is the sub-Gaussian parameter of the noise process ν := η θ + ε as characterized in Lemma 2.1. Proof. Given the knowledge of Γ, we have the oracle 2SLS estimator t=1 zs Γ zs t=1 zsz s Γ yt = x t θ + εt = Γ zt θ + η t θ + εt. Denote νt := η t θ + εt. For any w W, we have D bθoracle θ, w E = t=1 zsz s Γ t=1 zsyt θ, w t=1 zsz s Γ t=1 zs z s Γθ + νt θ, w t=1 zsz s Γ t=1 zsz s Γθ + t=1 zsz s Γ t=1 zsνt, w t=1 zsz s Γ By Lemma G.1, the noise νt is σ2 ν-sub-Gaussian, we have * t=1 zsz s Γ is PT t=1 zsz s Γ 1 zq, w 2 σ2 ν-sub-Gaussian. Thus D bθoracle θ, w E t=1 zsz s Γ t=1 zsz s Γ t=1 zsz s Γ t=1 zsz s Γ t=1 zsz s Γ D bθoracle θ, w E v u u u u t2w We can further write it as D bθoracle θ, w E 2 w 2 Γ ( PT t=1 zsz s )Γ 1σ2ν log 1 By taking a union bound over, we have the confidence interval for the absolute value as in the statement of the lemma. G.3 Proof of Theorem 2.3 Theorem 2.3. Suppose that bΓ is estimated through a design matrix ZT1 RT1 d and bθP-2SLS is estimated through a design matrix ZT2 RT2 d. Then, for any w W, with probability at least 1 δ, |w (bθP-2SLS θ)| w A(ZT2,bΓ) 1 + w A(ZT1,bΓ) 1 θ 2 σ2ηlog ZT1, δ/4 . where σ2 ν is the sub-Gaussian parameter of the noise ν := η θ + ε, σ2 η is the sub-Gaussian parameter of the noise η, and log(ZT , δ) := 8d ln 1 + 2TL2 z d(2 σmin(Z T1ZT1)) 4 2 σmin(Z T1ZT1) Proof. For the pseudo 2SLS estimator, we have bθP-2SLS θ = t=1 z Itz ItbΓ t=1 z Ityt θ t=1 z Itz ItbΓ t=1 z It z ItΓθ + νt θ t=1 z Itz ItbΓ t=1 z Itz ItΓθ + t=1 z Itz ItbΓ t=1 z Itνt + bΓ 1Γ I θ. For any w W, we have D bθP-2SLS θ, w E = t=1 z Itz ItbΓ t=1 z Itνt + bΓ 1Γ I θ, w t=1 z Itz ItbΓ t=1 z Itνt, w + bΓ 1Γ I θ, w t=1 z Itz ItbΓ νq + bΓ 1Γ I θ, w t=1 z Itz ItbΓ νq + bΓ 1 Γ 1 Γθ, w . (12) We upper bound the first term and the second term separately. For the first term, by Lemma 2.1, we know νt is σ2 ν-sub Gaussian, we have PT2 q=1 PT2 t=1 z Itz ItbΓ 1 z Iq, w νq is PT2 t=1 z Itz ItbΓ 1 z Iq, w 2 σ2 ν-sub Gaussian. Thus by the concentration inequality of sub- Gaussian random variables, we have t=1 z Itz ItbΓ t=1 z Itz ItbΓ By similar calculation as (11), we have with probability at least 1 δ t=1 z Itz ItbΓ v u u u u t2w t=1 z Itz It 2 w 2 A(ZT2,bΓ) 1σ2ν log 2 Thus with probability at least 1 δ D bθP-2SLS θ, w E w A(ZT2,bΓ) 1 + bΓ 1 Γ 1 Γθ, w By Theorem G.3, we have with probability at least 1 δ 2, bΓ 1 Γ 1 Γθ, w w A(ZT1,bΓ) 1 θ q σ2ηlog ZT1, δ/2 (14) Combining (13) and (14), we have with probability at least 1 δ, for any w W, D bθP-2SLS θ, w E w A(ZT2,bΓ) 1 + w A(ZT1,bΓ) 1 θ q σ2ηlog ZT1, δ/2 . By a union bound, we have the confidence interval for the absolute value of the inner product, |w (bθP-2SLS θ)| w A(ZT2,bΓ) 1 + w A(ZT1,bΓ) 1 θ q σ2ηlog ZT1, δ/4 . Theorem G.3. Suppose that the least square estimator bΓ is estimated through a design matrix ZT1 RT1 d, then it satisfies, with probability at least 1 δ, bΓ 1 Γ 1 Γθ, w w A(ZT1,bΓ) 1 θ q σ2ηlog(ZT1, δ) for any w W. Proof. Define V = Z T1ZT1, S RT1 d as the matrix with i-th row being η i , the stacked noise, i.e., the data collection process of the design matrix ZT1 is X = ZT1Γ + S. Then we have bΓ 1 Γ 1 Γθ, w =w bΓ 1 Γ 1 Γθ =w Γ + V 1Z T1S 1 V 1Z T1SΓ 1Γθ (Lemma J.12) =w bΓ 1V 1/2V 1/2Z T1Sθ w bΓ 1V 1/2 V 1/2Z T1Sθ w A(ZT1,bΓ) 1 V 1/2Z T1S op θ w A(ZT1,bΓ) 1 θ q σ2ηlog(ZT1, δ), where the last inequality is due to Lemma G.4. Lemma G.4. Suppose we have z1, . . . , z T Rd and η1, . . . , ηT Rd such that ηT | z1, η1, . . . , z T 1, ηT 1, z T is σ2 η-sub-Gaussian vector (defined in Assumption 1). Let Z, S RT d be matrices whose t-th row is z t and η t respectively. Suppose zt Lz, t. Let V = Z Z. Then, δ (0, 1), we have, with probability at least 1 δ, V 1/2Z T S op ση v u u t8d ln 1 + 2TL2z d(2 σmin(V )) 4 2 σmin(V ) We abbreviate log(ZT , δ) := 8d ln 1 + 2T L2 z d(2 σmin(V )) + 16 ln 2 6d δ log2 2 4 2 σmin(V ) . Proof. By the definition of operator norm, we have V 1/2Z T S op = sup {x| x 2=1} V 1/2Z T Sx 2 = sup {x| x 2=1} x S ZT V 1Z T Sx = sup {x| x 2=1} Z T Sx V 1. Considering a fixed w Rd, by Lemma G.5, we have with probability at least 1 δ, v u u td ln 1 + 2TL2z d(2 σmin(V )) 4 2 σmin(V ) By Lemma G.6 and a union bound, for the ε-covering Cε, we have the following event happens with probability no more than δ: x Cε, Z T Sx V 1 v u u td ln 1 + 2TL2z d(2 σmin(V )) 4 2 σmin(V ) We abbreviate c log(ZT , δ) := d ln 1 + 2T L2z d(2 σmin(V )) + 2 ln 2|Cε| δ log2 2 4 2 σmin(V ) When E does not happen, we have V 1/2Z S op = sup {x| x =1} = sup {x| x =1} min {y|y Cε} Z T S(x y + y) V 1 sup {x| x =1} min {y|y Cε} Z T S(x y) V 1 + Z T Sy V 1 sup {x| x =1} min {y|y Cε} V 1/2Z S op x y 2 + ση c log(ZT , δ) ε V 1/2Z T S op + ση c log(ZT , δ). Thus, V 1/2Z T S op ση 1 ε c log(ZT , δ). By choosing ε = 1 2 and Lemma G.6, we have V 1/2Z T S op ση v u u t8d ln 1 + 2TL2z d(2 σmin(V )) 4 2 σmin(V ) Lemma G.5. [Self-Normalized Bound for Vector-Valued Martingales] Suppose we have z1, . . . , zt Rd and η1, . . . , ηt Rd such that ηt | z1, η1, . . . , zt 1, ηt 1, zt is σ2 η-sub-Gaussian vector (defined in Assumption 1). Let Z, S Rt d be matrices whose s-th row is z s and η s respectively. Suppose zs Lz, s. Let Vt = Z Z. Then, b Rt, δ (0, 1), we have, with probability at least 1 δ, t 1, Z Sb V 1 t v u u td ln 1 + 2t L2z d(2 σmin(Vt)) 4 2 σmin(Vt) Proof. Since each row of S is a σ2 η-sub Gaussian vector, we have that (Sb)i/ b is σ2 η-sub Gaussian. Using Lemma G.7 with εs = (Sb)i/ b completes the proof. Lemma G.6. [26][Lemma 20.1] There exists a set Cε Rd with |Cε| 3 ε d such that for any x x | x Rd, x 2 = 1 , there exists a y Cε such that x y 2 ε. Lemma G.7. Let z1, z2, . . . {z Rd : z 2 Lz} and ε1, ε2, . . . R be random variables such that εk | z1, ε1, . . . , zt 1, εt 1, zt is σ2 ε-sub-Gaussian. Let Vt = Pt s=1 zsz s . Then, v u u td ln 1 + 2t L2z d(2 σmin(Vt)) 4 2 σmin(Vt) Proof. Let Zt Rt d be the design matrix and define εt := (ε1, . . . , εt) Rt. Let us omit the subscript t from Zt, εt and Vt. Note that Z ε V 1 = Z ε ( 1 2 V ) 1 Z ε ( 1 2 σmin(V )I) 1 2 Z ε (V +σmin(V )I) 1 It remains to bound Z ε (V +σmin(V )I) 1. We use union bound with the standard self-normalized inequality of Lattimore and Szepesvári [26][Theorem 20.4 and Note 20.2]. Specifically, let λk = 2 k+1 for k 1. Then, k N+, t 1, Z ε (V +λk I) 1 σε d ln 1 + t L2 + 2 ln (π2/6)k2 Under the event E, we have two cases: σmin(V ) 1: choose k = 1. Then, Z ε (V +σmin(V )I) 1 Z ε (V +I) 1 d ln 1 + t L2 + 2 ln (π2/6) σmin(V ) < 1: choose k = log2(2σmin(V ) 1) 2, which is the k that satisfies λk σmin(V ) < λk 1. Then, Z ε (V +σmin(V )I) 1 Z ε (V +λk I) 1 d ln 1 + t L2 + 2 ln (π2/6)k2 d ln 1 + t L2 dσmin(V )/2 + 2 ln (π2/6) log2(2σmin(V ) 1) 2 2σmin(V ); def n of k) Altogether, we have Z ε (V +σmin(V )I) 1 σε v u u td ln 1 + t L2(1 2σ 1 min(V )) d (π2/6)(1 log2(2σ 1 min(V )) 2) δ We conclude the proof by simplifying the RHS. H Proofs of sample complexity when given Γ H.1 Proof of Theorem 3.1 Algorithm 5 Confounded pure exploration with known Γ Input Z, W, Γ, δ, ε, Lν σ2 ν Initialize: k = 1, ˆ Wk = W Define f(w, w , Γ, λ) := w w 2 (P z Z Γ λzzz Γ) 1 while ˆ Wk > 1 do λk = arg minλ (Z) maxw,w Wk f(w, w , Γ, λ) ρ(Wk) = minλ (Z) maxw,w Wk f(w, w , Γ, λ) ζk = 2 k Nk = 2(1 + ω)ζ 2 k ρ(Wk)Lν log 4k2|W| ZNk = ROUND(λk, Nk) Pull arms in ZNk and observe YNk Compute bθk oracle = Z Nk ZNkΓ 1 Z Nk YNk Wk+1 = Wk\ w Wk | w Wk, s.t., D w w, bθk oracle E > ζk k = k + 1 end while Output: Wk Theorem 3.1. Algorithm 1 is δ-PAC for the CPET-LB problem and terminates in at most c(1 + ω)Lνρ log 1/δ + cr(ω) samples, where c hides logarithmic factors of and |W|. Proof. Part 1 δ-PAC By the confidence interval in Lemma 2.2, we have, with probability at least 1 δ 2k2|W|, D bθk oracle θ, w w E w w Γ PNk s=1 z Itz It 2σ2ν log 4k2|W| 1 + ω w w (Γ λzzz Γ) 1 Nk 2σ2ν log 4k2|W| 1 + ω w w (Γ λzzz Γ) 1 s 2(1 + ω)ζ 2 k ρ(Wk)Lν log 4k2|W| 2σ2ν log 4k2|W| Define a good event Ek,w for each k and Wk as ( D bθk oracle θ, w w E ζk We claim that with probability at least 1 δ, the event T k=1 T w Wk Ek,w holds. It can be proved by a union bound. w Wk P Ec k,w where the last step is by the fact that P k=1 1 k2 = π2 6 < 2. Under the event T k=1 T w Wk Ew,k, to show that the best arm is never eliminated, it suffices to show that for any sub-optimal arm w Wk, D w w , bθk oracle E = D w w , bθk oracle θ E + w w , θ D w w , bθk oracle θ E Thus the best arm w never satisfies the elimination condition. Next we show that at the end of stage k, any suboptimal arm w that satisfies is eliminated. To show this, we need to show that w satisfies the elimination condition, D bθk oracle, w w E D bθk oracle, w w E = D bθk oracle θ, w w E + θ, w w ζk + 2ζk =ζk. This implies that with probability at least 1 δ, w always survives. Part 2 sample complexity Define Sk = w W | w w, θ 4ζk . Thus with probability at least 1 δ, we have T k {Wk Sk}. This implies the following is true with probability at least 1 δ for all k ρ(Wk) = min λ (Z) max w,w Wk z Z Γ λzzz Γ) 1 min λ (Z) max w,w Sk z Z Γ λzzz Γ) 1 Define to be the minimum gap between w and any other w W, i.e., := minw W\{w } w w, θ . Then for k l log 4 1 m , we have Sk = {w } with probability at least 1 δ. The total sample complexity is the summation of the number of samples in each round, l log(4 1) m X l log(4 1) m X 4(1 + ω)ζ 2 k ρ(Wk)Lν log l log(4 1) m X 8(1 + ω)22kρ(Wk)Lν log l log(4 1) m X 8(1 + ω)22kρ(Sk)Lν log On the other hand, we have ρ = min λ (Z) max w W\{w } z Z Γ λzzz Γ) 1 = min λ (Z) max k max w Sk z Z Γ λzzz Γ) 1 1 log(4 1) min λ (Z) l log(4 1) m X k=1 max w Sk z Z Γ λzzz Γ) 1 1 16 log(4 1) l log(4 1) m X k=1 22k min λ (Z) max w Sk w w 2 (P z Z Γ λzzz Γ) 1 1 64 log(4 1) l log(4 1) m X k=1 22k min λ (Z) max w,w Sk w w 2 (P z Z Γ λzzz Γ) 1 = 1 64 log(4 1) l log(4 1) m X k=1 22kρ(Sk), (17) where the last inequality is by the triangle inequality, i.e., z Z Γ λzzz Γ) 1 max w,w Sk z Z Γ λzzz Γ) 1 + w w (P z Z Γ λzzz Γ) 1 4 max w Sk w w 2 (P z Z Γ λzzz Γ) 1. Combining (16) and (17), we have l log(4 1) m X k=1 Nk c(1 + ω)Lν log 4 1 log log 4 1 2|W| ρ + log 4 1 r(ω). I Proofs of sample complexity when given bΓ Algorithm 6 Confounded pure exploration with known bΓ Input Z, W, bΓ, δ, ε, Lν σ2 ν, Lη σ2 η Initialize: k = 1, ˆ Wk = W, calculate γ using (18) Define f(w, w , Γ, λ) := w w 2 (P z Z Γ λzzz Γ) 1 while w, w Wk, s.t., D w w, bθk oracle E > 4γ or ζk > γ do λk = arg minλ (Z) maxw,w Wk f(w, w , bΓ, λ) ρ(Wk) = minλ (Z) maxw,w Wk f(w, w , bΓ, λ) ζk = 2 k Nk = 2(1 + ω) ζ 2 k 1 γ2 ρ(Wk)Lν log 4k2|W| ZNk = ROUND(λk, Nk) Pull arms in ZNk and observe YNk Compute bθk P-2SLS = Z Nk ZNkbΓ 1 Z Nk YNk Wk+1 = Wk\ w Wk | w Wk, s.t., D w w, bθk P-2SLS E > ζk + γ k = k + 1 end while Output: any w Wk Theorem I.1. Suppose that we have bΓ that is an OLS estimate from an offline dataset {ZT , XT } collected non-adaptively through a fixed design ξ and the efficient rounding procedure ROUND,as well as T 4σ2 η σmin(A(ξ,Γ))log ZT , δ/2 . Algorithm 6 guarantees that with probability at least 1 δ, a 6γ-good arm is returned, where γ := max w,w W w w A(ZT ,bΓ) 1 θ 2 Lηlog(ZT , δ). (18) Also, the algorithm terminates in at most c(1 + ω)Lν log 1/δ ρ (γ) + c(1 + ω)2Lν log 1/δ σ2 ηlog(ZT , δ) σmin A(ξ, Γ) ρ(ξ, γ) samples, where c hides logarithmic factors of and |W| and γ. ρ(λ, γ) := max w W\{w } w w 2 A(λ,Γ) 1 w w, θ 2 γ2 ρ (γ) = min λ (Z) max w W\{w } w w 2 A(λ,Γ) 1 w w, θ 2 γ2 . Proof. The proof can be divided into four steps: The best arm w is never eliminated. At the end of stage k, any suboptimal arm w that satisfies θ, w w 2ζk + 2γ is eliminated. The stopping condition is met in finite time. When the stopping condition is met, there are only 6γ-good arms left. The upper bound of the sample complexity. Step 1: The best arm w is never eliminated. By the confidence interval in Theorem 2.3, we have, with probability at least 1 δ 2k2|W|, for any k and w Wk, D bθk P-2SLS θ, w w E w w A(ZNk ,bΓ) 1 2σ2ν log 4k2|W| 1 + ω w w A(λk,bΓ) 1 Nk 2σ2ν log 4k2|W| 1 + ω w w A(λk,bΓ) 1 s 2(1 + ω)ζ 2 k ρ(Wk)Lν log 4k2|W| 2σ2ν log 4k2|W| Define a good event Ek,w for each k and w Wk as ( D bθk P-2SLS θ, w w E ζk + γ By the same calculation as (15), we claim that with probability at least 1 δ, the event T k=1 T w Wk Ek,w holds. Under the event T k=1 T w Wk Ew,k, to show that the best arm is never eliminated, it suffices to show that for any sub-optimal arm w Wk, D w w , bθk P-2SLS E = D w w , bθk P-2SLS θ E + w w , θ D w w , bθk P-2SLS θ E Thus the best arm w never satisfies the elimination condition. Step 2: At the end of stage k, any suboptimal arm w that satisfies θ, w w 2ζk + 2γ is eliminated. To prove this, we show that such arm w must satisfy the elimination condition, D bθk P-2SLS, w w E D bθk P-2SLS, w w E = D bθk P-2SLS θ, w w E + θ, w w ζk γ + 2ζk + 2γ =ζk + γ. Thus the arm w is eliminated. Step 3: The stopping condition is met in finite time. Given the result in Step 2 and the fact that ζk is an exponentially decreasing sequence, we know that all of the arms w satisfying θ, w w > 2γ will be eliminated in finite time. This means that only the arms w satisfying θ, w w 2γ will remain. We need to show that w, w Wk, D w w, bθk P-2SLS E 4γ can be achieved in finite time. When ζk γ, (which will happen in finite time), D w w, bθk P-2SLS E = D w w + w w , bθk P-2SLS E = D w w, bθk P-2SLS E + D w w , bθk P-2SLS E = D w w, bθk P-2SLS θ + θ E + D w w , bθk P-2SLS θ + θ E = D w w, bθk P-2SLS θ E + w w, θ + D w w , bθk P-2SLS θ E + w w , θ ζk + γ + 2γ + ζk + γ =4γ. Step 4: When the stopping condition is met, there are only 6γ-good arms left. For any w Wk, we have w w, θ = D w w, θ bθk P-2SLS E + D w w, bθk P-2SLS E ζk + γ + 4γ =6γ. Step 5: The upper bound of the sample complexity. Define W(2γ) as the set of 2γ-good arms, i.e., W(2γ) := w W | θ, w w 2γ . Then the best arm in the set W\W(2γ) has a suboptimality gap minw W\W(2γ) θ, w w . We define min(2γ) := minw W\W(2γ) θ, w w 2γ. By the result in Step 2, we know that after k := log 4 min(2γ) 1 stages, all of the arms in W\W(2γ) are eliminated. De- fine Sk = w W | w w, θ 4ζk + 2γ . Thus with probability at least 1 δ, we have T The sample complexity of the algorithm is the total number of samples pulled, which is 2(1 + ω) ζ 2 k 1 ρ(Wk)Lν log 3(1 + ω) 22k 1 ρ(Wk)Lν log 3(1 + ω) 22k 1 ρ(Sk)Lν log Note that the factor of minλ (Z) maxw,w Sk w w 2 A(λ,Γ) 1 has the underlying true Γ in it, while the algorithm uses the plugged-in bΓ. We need to relate it to ρ(Sk) := minλ (Z) maxw,w Sk w w 2 A(λ,bΓ) 1. By Lemma J.9, and defining λ z(Sk) as the optimal design for Sk, i.e., λ k = arg min λ (Z) max w,w Sk w w 2 A(λ,Γ) 1. Define λ k = 1 2ξ, we have w w 2 A(λk,bΓ) 1 = min λ (Z) max w,w Wk w w 2 A(λ,bΓ) 1 min λ (Z) max w,w Sk w w 2 A(λ,bΓ) 1 min λ (Z) max w,w Sk 3 w w 2 A(λ,Γ) 1 + 2 max w,w Sk (Γ bΓ ) w w 2 max w,w Sk 3 w w 2 A(λ k,Γ) 1 + 2 max w,w Sk (Γ bΓ ) w w 2 max w,w Sk 6 w w 2 A(λ k,Γ) 1 + 16σ2 ηlog ZT , δ/2 σmin A(ξ, Γ) max w,w Sk w w 2 A(ξ,Γ) 1 1 T , (20) where the last inequality is due to w w 2 A(λ k,Γ) 1 w w 2 A( 1 =2 w w 2 A(λ k,Γ) 1, as well as, by Lemma J.11, when T 4σ2 η σmin(A(ξ,Γ))log ZT , δ/2 , we have, with probability at least 1 δ/2, we have (Γ bΓ ) w w 2 4σ2 ηlog ZT , δ/2 σmin A(λ k, Γ) max w,w Sk w w 2 A(ξ,Γ) 1 1 T 4σ2 ηlog ZT , δ/2 σmin A(αλ k + (1 α)ξ, Γ) max w,w Sk w w 2 A(ξ,Γ) 1 1 T 8σ2 ηlog ZT , δ/2 σmin A(ξ, Γ) max w,w Sk w w 2 A(ξ,Γ) 1 1 T . We can lower bound ρ (γ) by ρ (γ) = min λ (Z) max w W\{w } w w 2 A(λ,Γ) 1 w w, θ 2 γ2 = min λ (Z) max k max w Sk\{w } w w 2 A(λ,Γ) 1 w w, θ 2 γ2 k min λ (Z) k=1 max w Sk\{w } w w 2 A(λ,Γ) 1 w w, θ 2 γ2 min λ (Z) max w Sk\{w } w w 2 A(λ,Γ) 1 min λ (Z) max w,w Sk w w 2 A(λ,Γ) 1. (21) Given (19), (20) and (21), we have k=1 Nk c(1 + ω)Lν log 1/δ ρ (γ) + c(1 + ω)2Lν log 1/δ σ2 ηlog(ZT , δ) σmin A(ξ, Γ) ρ(ξ, γ) where c hides logarithmic factors of and |W| and γ. J Proofs of sample complexity with unknown Γ Algorithm 7 Optimal design with unknown Γ Input Z, W, δ, Lν σ2 ν, Lη σ2 η, ω, γmin λmin(Γ), λE, κ0 Initialize: k = 1, W1 = W, bΓ0 = , ζ1 = 1 Define f(w, w , Γ, λ) := w w 2 (P z Z Γ λzzz Γ) 1, M := 32Lη γ2 minσmin(A(λE,I)) 1, δk,ℓ:= δ 4k2ℓ2 while |Wk| > 1 do bΓk = Γ estimator Wk, bΓk 1, ζk, δ, k, ω, λE, M, Lη Step 1: update bΓ bθk P-2SLS = θ estimator Wk, δ, ζk, bΓk, ω, Lν, k Step 2: update bθ Wk+1 = Wk\ w Wk | w Wk, s.t., D w w, bθk P-2SLS E > ζk Step 3: elimination k k + 1, ζk = 2 k end while Output: Wk The algorithms of Γ estimator and θ estimator we present below are slightly different from the one in the main text. In the main text, we omit the phase index k in the algorithm for simplicity. Algorithm 8 Γ estimator Input Wk, bΓk 1, ζk, δ, k, ω, λE, M, Lη Define Stop(W, Z, Γ, δ) := maxw,w W w w A(Z,Γ) 1 θ 2 Lηlog(Z, δ) Initialize ℓ= 1, Nk,0,0 = 0 doubling trick initialization if k = 1 then while ℓ= 1 or Stop Wk, Zk,0,ℓ, bΓk, δk,ℓ > 1 do get 2ℓ 1 r(ω) 2 κ0 samples denoted as Zk,0,ℓ, Xk,0,ℓ, Yk,0,ℓ per design λE via ROUND Update bΓk by OLS on Zk,0,ℓ, Xk,0,ℓ , ℓ ℓ+ 1 end while else λk = arg minλ (Z) maxw,w Wk f(w, w , bΓk 1, λ) N = 4gd M ln 1 + 2M d + L2 z + 2M2gd M + 8M ln 2 6d δk,1 while ℓ= 1 or Stop Wk, Zk,0,ℓ Zk,1,ℓ, bΓk, δk,ℓ > ζk do Nk,1,ℓ= 2ℓN doubling trick update get Nk,1,ℓsamples per λk denoted as Zk,1,ℓ, Xk,1,ℓ, Yk,1,ℓ via ROUND Nk,0,ℓ= 2gd M ln M d + Nk,1,ℓ+ L2 z + 4M ln 2 6d get (Nk,0,ℓ Nk,0,ℓ 1) samples per λE augmented to Zk,0,ℓ 1, Xk,0,ℓ 1 and get Zk,0,ℓ, Xk,0,ℓ Update bΓk by OLS on Zk,0,ℓ Zk,1,ℓ, Xk,0,ℓ Xk,1,ℓ , ℓ ℓ+ 1 end while end if Output: bΓk Algorithm 9 θ estimator Input Wk, δ, ζk, bΓk, ω, Lν, k bλk = arg minλ (Z) maxw,w Wk f(w, w , bΓk, λ) ρ(Wk) = minλ (Z) maxw,w Wk f(w, w , bΓk, λ) Nk,2 = 2(1 + ω)ζ 2 k ρ(Wk)Lν log 4k2|W| get Nk,2 samples per design bλk denoted as Zk,2, Xk,2, Yk,2 via ROUND update bθk P-2SLS = (bΓ k Z k,2Zk,2bΓk) 1bΓ k Z k,2Yk,2 Output: bθk P-2SLS Lemma J.1. Algorithm 3 and Nk,1,ℓ, Nk,0,ℓguarantees three properties: Property 1: Nk,1,ℓ Nk,0,ℓ. Property 2: 1 2 Nk,0,ℓ+ Nk,1,ℓ Nk,0,ℓ 1 + Nk,1,ℓ 1. Property 3: Nk,1,1 8d ln 1+ Nk,1,1L2z +16 ln 2 6dk2 δ M ln(d M). Proof. For Property 1, recall that Nk,0,ℓand Nk,1,ℓare defined as 2gd M ln M d + Nk,1,ℓ+ L2 z + 4M ln 4gd M ln 1 + 2M d + L2 z + 2M2gd M + 8M ln We prove the result by induction. For the result for ℓ= 1, note that Nk,1,1 is of the form Nk,1,1 = 2a + 2b ln(1 + 2c + 2bd) and Nk,0,1 is of the form Nk,0,1 = a + b ln(c + dr), where a = 4M ln 2 6d δk,1 c = M d + L2 z r = Nk,1,ℓ. By the contraposition of Lemma K.5, we have r > 2a + 2b ln(1 + 2c + 2bd) = r > a + b ln(c + dr). Thus we have Nk,1,1 Nk,0,1. Now we assume the result holds for ℓ, i.e., Nk,1,ℓ Nk,0,ℓand prove that it holds for ℓ+ 1. We have Nk,1,ℓ+1 = 2Nk,1,ℓ 2Nk,0,ℓ. It suffices to prove that 2Nk,0,ℓ Nk,0,ℓ+1. 2gd M ln M d + Nk,1,ℓ+ L2 z + 4M ln 2gd M ln M 2 d + Nk,1,ℓ+ L2 z 2 + 8M ln 2gd M ln M 2 d + Nk,1,ℓ+ L2 z 2 + 8M 2gd M ln M 2 d + Nk,1,ℓ+ L2 z 2 + 4M + 2 ln(ℓ+ 1) (2 ln ℓ ln(ℓ+ 1) for ℓ 2) 2gd M ln M d + 2Nk,1,ℓ+ L2 z + 4M ln 2 6d(ℓ+ 1)2 For Property 2, it suffices to prove that 1 2Nk,0,ℓ Nk,0,ℓ 1. This is equivalent to prove that 1 2 a + b ln(c + 2dr) a + b ln(c + dr). We have 1 2 a + b ln(c + 2dr) a + b ln(c + dr) 2b ln(c + 2dr) 1 2a + b ln(c + dr) 2a + b ln(c + dr) c + 2dr c + dr c + 2dr c2 + 2cdr + d2r2 dr(dr + 2c 2) + c2 c 0. The last inequality holds because c = M > 1. For Property 3, we have, 4d M ln 1 + 2M d + L2 z + 2Md M + 8M ln 2 6dk2 8d ln 1 + Nk,1,1L2z d + 16 ln 2 6dk2 4d M ln 1 + 2M d + L2 z + 2Md M 8d ln 1 + Nk,1,1L2 z d + 16 ln 2 6dk2 δ + 8M ln 2 6dk2 8d ln 1 + Nk,1,1L2 z d + 16 ln 2 6dk2 M ln 1 + 2M d + L2 z + 2Md M 2 ln (1 + ML2z) + M 2 (loosely apply Nk,1,1 d M) Theorem J.2. Algorithm 3 guarantees that with probability at least 1 δ, the best arm is returned, and the algorithm terminates in at most Lν log 1/δ + Lη θ 2 2 d + log 1/δ ρ + d + log 1/δ Lη θ 2 2ρ0 + M ! pulls, ignoring both of the additive and multiplicative logarithms of , |W|, ρ , ρ0, M, where ρ = min λ (Z) max w W\{w } z Z λzΓ zz Γ) 1 ρ0 = max w W\{w } w w 2 (P z Z λE,zΓ zz Γ) 1, M = 32Lη γ2 minσmin A(λE, I) 1. Note that ρ0 does not get hurt by w w, θ . It comes from the fact that in the first phase, we initialize that algorithm with E-optimal design. Proof. Part 1: correctness of the algorithm The idea of the proof is similar to the proof of Theorem 3.1. Recall that the confidence interval of P-2SLS can be break down into two terms (12). D bθP-2SLS θ, w E = t=1 z Itz ItbΓ νq + bΓ 1 Γ 1 Γθ, w . Given bΓ, for a w W, with probability at least 1 δ 2 the first term satisfies t=1 z Itz ItbΓ νq w A(ZT2,bΓ) 1 For any w W, with probability at least 1 δ 2, the second term satisfies bΓ 1 Γ 1 Γθ, w w A(ZT1,bΓ) 1 θ 2 σ2ηlog ZT1, δ/4 . Note that by Lemma G.4, the above inequality holds for all w W, and the RHS is essentially a result of V 1/2Z T S op q σ2ηlog ZT , δ/4 . In the vanilla form of the confidence of P-2SLS, we can define good events as for the first term, for any w W, t=1 z Itz ItbΓ νq w A(ZT2,bΓ) 1 q 2σ2ν log 16k2|W|/δ . for the second term, V 1/2Z T S op q σ2ηlog ZT , δ/4 . For our algorithm design, since we use the doubling trick for the first sub-phase, we need to define the good event for the first sub-phase as the samples from each doubling trick iteration satisfies the self-normalized concentration inequality of Lemma G.4. We define the good event for ℓ-th doubling trick iteration in the first sub-phase of phase k as ( V 1/2 k,ℓ Z k,ℓSk,ℓ 2 op σ2 ηlog Zk,ℓ, δ 4k2(ℓ+ 1)2 where Zk,ℓ= Zk,0,ℓ Zk,1,ℓ, Vk,ℓ= Z k,ℓZk,ℓand Sk,ℓis stacked noise matrix during collecting samples Zk,ℓper the model X = ZΓ + S. By a union bound, we have ℓ=1 P E1 k,ℓ δ/2. For the second sub-phase in phase k, we define the good event for the second sub-phase in phase k and w W as t=1 z Itz ItbΓ νq w A(Zk,2,bΓ) 1 q 2σ2ν log 16k2|W|/δ By a union bound, we have w Wk E2 k,w w W P E2 k,w δ/2. Under the good event T k=1 T ℓ=1 E1 k,ℓ T T k=1 T w Wk E2 k,w , we have with probability at least 1 δ, for all k and w W, D bθk P-2SLS θ, w w E ζk. The rest of proof is same as the proof of Theorem 3.1. Part 2: sample complexity of algorithm Sample complexity for first sub-phase Recall that λE = arg maxλ (Z) σmin P z Z λzzz is the E-optimal design to maximize the minimum singular value of P z Z λzzz and κ0 = maxλ σmin(P z Z λzzz ) is the maximum minimum singular value of P z Z λzzz . At the beginning of first sub-phase in phase k, the algorithm first samples Nk,0,0 arms according to λE. Before we proceed to the main proof of the sample complexity, we first address a minor technique issue to avoid cumbersomeness. For the logarithmic term that appears in the algorithm and confidence interval, log(ZT , δ) := 8d ln 1 + 2TL2 z d(2 σmin(Z T ZT )) 4 2 σmin(Z T ZT ) when 2 σmin(V ) = 2, it is equivalent to f log(T, δ) := 8d ln 1 + TL2 z d Our algorithm design guarantees that 2 σmin(V ) = 2 is always true whenever we need to use the logarithmic term log(ZT , δ), given that the samples of our interest always includes the E-optimal design samples Zk,0,ℓand the number of samples from the E-optimal design Zk,0,ℓ is always larger than 2 κ0 . So for the remaining part of the proof, we will use f log(N, δ) instead of log(ZT , δ). Denote the samples of E-optimal design that mixed into the samples from ℓ-th doubling trick iteration in the first sub-phase of phase k as Zk,0,ℓand Zk,0,ℓ = Nk,0,ℓ. By Lemma J.3, our choice of Nk,0,ℓ Nk,0,ℓ 64gdσ2 η σmin A(λE, Γ) ln 32gσ2 η σmin A(λE, Γ) d + Nk,1,ℓL2 z + L2 z ! + 128gσ2 η σmin A(λE, Γ) ln is a sufficient condition to guarantee Nk,0,ℓ 4gσ2 η f log Nk,0,ℓ+ Nk,1,ℓ, δk,ℓ σmin A(λE, Γ) r(ω). (22) Multiply both sides of (22) by Nk,0,ℓ+Nk,1,ℓ Nk,0,ℓ and we have Nk,0,ℓ+ Nk,1,ℓ 4gσ2 η f log Nk,0,ℓ+ Nk,1,ℓ, δk,ℓ σmin A(λE, Γ) r(ω) Nk,0,ℓ+ Nk,1,ℓ By Property 1 of Lemma J.1, we have αk,ℓ:= Nk,0,ℓ Nk,0,ℓ+Nk,1,ℓ< 1/2, then Nk,0,ℓ+ Nk,1,ℓ 1 αk,ℓ 4gσ2 η f log Nk,0,ℓ+ Nk,1,ℓ, δk,ℓ σmin A(λE, Γ) r(ω) = 4gσ2 η f log Nk,0,ℓ+ Nk,1,ℓ, δk,ℓ σmin A(αk,ℓλE, Γ) r(ω) 4gσ2 η f log Nk,0,ℓ+ Nk,1,ℓ, δk,ℓ σmin A(αk,ℓλE + (1 αk,ℓ) λk, Γ) r(ω). (23) These condition on Nk,0,ℓand Nk,0,ℓ+ Nk,1,ℓare needed for the proof. Denote the total number of doubling trick iterations as Lk for phase k. In the case of Lk = 1, the samples from the first doubling trick iteration satisfies stopping condition of the first sub-phase already, and the algorithm will not enter the second doubling trick iteration. Thus the total number of samples for the first sub-phase is Nk,0,1 + Nk,1,1 2Nk,1,1 (Property 1 of Lemma J.1) 8gd M ln 1 + 2M d + L2 z + 2M2gd M + 16M ln In the case of Lk > 1, for ℓ {1, , Lk}, denote bΓℓas the estimate of Γ at the end of the ℓ-th doubling trick iteration. With these notations, we have At the end of Lk-th doubling trick iteration, the stopping condition is satisfied, i.e., w w 2 A(Zk,0,Lk Zk,1,Lk ,bΓLk ) 1 θ 2 2Lη f log Nk,0,Lk + Nk,1,Lk, δk,Lk ζ2 k. We short f log Nk,0,ℓ+ Nk,1,ℓ, δk,ℓ = f logk,ℓ. At the end of (Lk 1)-th doubling trick iteration, the stopping condition is not satisfied, i.e., w w 2 A(Zk,0,Lk 1 Zk,1,Lk 1,bΓLk 1) 1 θ 2 2Lη f logk,Lk 1 > ζ2 k. (25) Denote ξLk as the empirical distribution of Zk,0,Lk Zk,1,Lk. Then above two conditions imply that the number of samples for Lk-th and (Lk 1)-th doubling trick iterations respectively satisfy Nk,0,Lk + Nk,1,Lk θ 2 2Lη f logk,Lk ζ2 k max w,w Wk w w 2 A(ξLk ,bΓLk ) 1. (26) Nk,0,Lk 1 + Nk,1,Lk 1 < θ 2 2Lη f logk,Lk 1 ζ2 k max w,w Wk w w 2 A(ξLk 1,bΓLk 1) 1. (27) Note that by Property 2 of Lemma J.1, 1 2 Nk,0,Lk + Nk,1,Lk Nk,0,Lk 1 + Nk,1,Lk 1. Nk,0,Lk + Nk,1,Lk < 2 θ 2 2Lη f logk,Lk 1 ζ2 k max w,w Wk w w 2 A(ξLk 1,bΓLk 1) 1. (28) For any ℓ {1, . . . , Lk}, by Lemma J.7, the factor of maxw,w Wk w w 2 A(ξℓ,bΓℓ) 1 can be upper bounded by w w 2 A(ξℓ,bΓℓ) 1 3 max w,w Wk w w 2 A(ξℓ,bΓLk 1) 1 + 2 max w,w Wk ((bΓℓ) (bΓLk 1) ) w w 2 3 max w,w Wk w w 2 A(ξℓ,bΓLk 1) 1 + 2 max w,w Wk (Γ (bΓℓ) ) w w 2 + 2 max w,w Wk (Γ (bΓLk 1) ) w w 2 z ξℓzz ) 1. (29) We will upper bound the three terms in the RHS of (29) separately. For the first term, by Lemma J.6, w w 2 A(ξℓ,bΓLk 1) 1 4 max w,w Wk w w 2 A( λk,bΓLk 1) 1, (30) where λk is the optimal design for Wk in the first sub-phase of phase k 1 (based on the last doubling trick iteration), i.e., λk = arg min λ (Z) max w,w Wk w w 2 A(λ,bΓLk 1) 1. Then for any λ, w w 2 A( λk,bΓLk 1) 1 3 max w,w Wk w w 2 A(λ,Γ) 1 + 2 max w,w Wk (Γ (bΓLk 1) ) w w 2 A(λ,I) 1 (Lemma J.7) b1 3 max w,w Wk w w 2 A(λ k,Γ) 1 + 2 max w,w Wk (Γ (bΓLk 1) ) w w 2 z λ kzz ) 1 b2 6 max w,w Wk w w 2 A(λ k,Γ) 1 + 2 max w,w Wk (Γ (bΓLk 1) ) w w 2 z λ kzz ) 1. where for (b1), we plug in λ k := α kλE + (1 α k)λ k, with α k 1/2 will be defined later and λ k is the optimal design for Wk given Γ, i.e., λ k = arg min λ (Z) max w,w Wk w w 2 A(λ,Γ) 1. (b2) is due to the fact that α k 1/2. For the second term in the RHS of (31), given the condition (23), i.e., Nk 1,0,Lk 1 + Nk 1,1,Lk 1 4g Lη f logk 1,Lk 1 σmin A(ξLk 1, Γ) by Lemma J.11 we have, (Γ (bΓLk 1) ) w w 2 z λ kzz ) 1 2Lη f logk 1,Lk 1 σmin A(λ k, Γ) 1 Nk 1,0,Lk 1 + Nk 1,1,Lk 1 w w 2 A(ξLk 1,Γ) 1 1 Nk 1,0,Lk 1 2Lη f logk 1,Lk 1 σmin A(λ k, Γ) Nk 1,0,Lk 1 Nk 1,0,Lk 1+Nk 1,1,Lk 1 w w 2 A(ξLk 1,Γ) 1 1 Nk 1,0,Lk 1 2Lη f logk 1,Lk 1 σmin A(α kλE, Γ) Nk 1,0,Lk 1 Nk 1,0,Lk 1+Nk 1,1,Lk 1 w w 2 A(ξLk 1,Γ) 1 1 Nk 1,0,Lk 1 2Lη f logk 1,Lk 1 σmin A(λE, Γ) w w 2 A(ξLk 1,Γ) 1 (set α k = αk 1) w w 2 A(ξLk 1,Γ) 1 w w 2 A(ξLk 1,bΓLk 1) 1. (32) where for (b1), we use the condition (22) on Nk 1,0,Lk 1, and (b2) is due to Lemma J.5. Plug (32) into (31), we have w w 2 A( λk,bΓLk 1) 1 6 max w,w Wk w w 2 A(λ k,Γ) 1 + 3 g max w,w Wk w w 2 A(ξLk 1,bΓLk 1) 1. Plug (33) into (30), we have the first term in the RHS of (29) can be upper bounded by w w 2 A(ξℓ,bΓLk 1) 1 24 max w,w Wk w w 2 A(λ k,Γ) 1 + 12 g max w,w Wk w w 2 A(ξLk 1,bΓLk 1) 1 24 max w,w Wk w w 2 A(λ k,Γ) 1 + 12 g ζ2 k 1 θ 2 2Lη f logk 1,Lk 1 (Nk 1,0,Lk 1 + Nk 1,1,Lk 1), (34) where for the last inequality we use the fact that the stopping condition is satisfied at the end of (Lk 1)-th doubling trick iteration for phase k 1 per (26). For the second term in the RHS of (29), by Lemma J.10, when the condition (23) is satisfied, i.e., when Nk,0,ℓ+ Nk,1,ℓ 4g Lη f logk,ℓ σmin A(ξℓ, Γ) , (Γ (bΓℓ) ) w w 2 z ξℓzz ) 1 1 g max w,w Wk w w 2 A(ξℓ,Γ) 1 g max w,w Wk w w 2 A(ξℓ,bΓℓ) 1, (35) where the last inequality is due to Lemma J.5. For the third term in the RHS of (29), by Lemma J.11, when the condition (23) is satisfied, i.e., when Nk 1,0,Lk 1 + Nk 1,1,Lk 1 4g Lη f logk 1,Lk 1 σmin A(ξLk 1, Γ) (Γ (bΓLk 1) ) w w 2 2Lη f logk 1,Lk 1 σmin A(ξℓ, Γ) 1 Nk 1,0,Lk 1 + Nk 1,1,Lk 1 max w,w Wk w w 2 A(ξLk 1,Γ) 1 2Lη f logk 1,Lk 1 σmin A(αℓλE, Γ) 1 Nk 1,0,Lk 1 + Nk 1,1,Lk 1 max w,w Wk w w 2 A(ξLk 1,bΓLk 1) 1 b1 Nk,0,ℓ+ Nk,1,ℓ 1 Nk 1,0,Lk 1 + Nk 1,1,Lk 1 2f logk 1,Lk 1 σmin A(λE, Γ) (Nk 1,0,Lk 1 + Nk 1,1,Lk 1) ζ2 k 1 θ 2 2 f logk 1,Lk 1 Nk,0,ℓ+ Nk,1,ℓ 2 σmin A(λE, Γ) ζ2 k 1 θ 2 2 b2 Nk,0,ℓ+ Nk,1,ℓ 4g Lη f logk,ℓ ζ2 k 1 θ 2 2 (36) where (b1) is due to (26), (b2) is due to (22). Plug (34), (35) and (36) into (29), we have w w 2 A(ξℓ,bΓℓ) 1 72 max w,w Wk w w 2 A(λ k,Γ) 1 + 36 g ζ2 k 1 θ 2 2Lη f logk 1,Lk 1 (Nk 1,0,Lk 1 + Nk 1,1,Lk 1) g max w,w Wk w w 2 A(ξℓ,bΓℓ) 1 + 2 g Nk,0,ℓ+1 + Nk,1,ℓ+1 Lη f logk,ℓ ζ2 k 1 θ 2 2 . (37) With ℓ= Lk 1 and the fact that g > 24 whose exact value will be set later, we can rearrange (37) as w w 2 A(ξLk 1,bΓLk 1) 1 144 max w,w Wk w w 2 A(λ k,Γ) 1 + 72 g ζ2 k 1 θ 2 2Lη f logk 1,Lk 1 (Nk 1,0,Lk 1 + Nk 1,1,Lk 1) g Nk,0,Lk + Nk,1,Lk Lη f logk,Lk 1 ζ2 k 1 θ 2 2 . (38) Note that the LHS of above can be lower bounded by (28), ζ2 k 2 θ 2 2Lη f logk,Lk 1 (Nk,0,Lk + Nk,1,Lk) < max w,w Wk w w 2 A(ξLk 1,bΓLk 1) 1. (39) Rearrange the terms in (39) and (38) and setting g to be larger enough and using the fact that ζk = ζk 1/2, we have Nk,0,Lk + Nk,1,Lk 576 ζ2 k θ 2 2Lη f logk,Lk 1 max w,w Wk w w 2 A(λ k,Γ) 1 + 72 f logk,Lk 1 f logk 1,Lk 1 (Nk 1,0,Lk 1 + Nk 1,1,Lk 1). Note that by definition f logk,Lk 1 < f logk,Lk, thus Nk,0,Lk + Nk,1,Lk 576 ζ2 k θ 2 2Lη f logk,Lk max w,w Wk w w 2 A(λ k,Γ) 1 + 72 f logk,Lk f logk 1,Lk 1 (Nk 1,0,Lk 1 + Nk 1,1,Lk 1). Denote Dk := Nk,0,Lk + Nk,1,Lk and divide both sides of above by f logk,Lk, we have Dk f logk,Lk 576 ζ2 k θ 2 2Lη max w,w Wk w w 2 A(λ k,Γ) 1 + 72 g Dk 1 f logk 1,Lk 1 . In the case of Lk = 1, by Property 3 of Lemma J.1, we have Dk f logk,Lk = Nk,1,1 8d ln 1 + Nk,1,1L2z d + 16 ln 2 6dk2 δ M ln(d M). Thereby we have Dk f logk,Lk max M ln(d M), 576 ζ2 k θ 2 2Lη max w,w Wk w w 2 A(λ k,Γ) 1 + 72 g Dk 1 f logk 1,Lk 1 Taking a summation over k on both sides of (40), we have Dk f logk,Lk = D1 f log1,L1 + Dk f logk,Lk (41) D1 f log1,L1 + M ln(d M), 576 ζ2 k θ 2 2Lη max w,w Wk w w 2 A(λ k,Γ) 1 + 72 g Dk 1 f logk 1,Lk 1 D1 f log1,L1 + k=2 M ln(d M) + ζ2 k θ 2 2Lη max w,w Wk w w 2 A(λ k,Γ) 1 + 72 g Dk 1 f logk 1,Lk 1 D1 f log1,L1 + k=2 M ln(d M) + ζ2 k θ 2 2Lη max w,w Wk w w 2 A(λ k,Γ) 1 + g Dk 1 f logk 1,Lk 1 . Thus by setting g = 72 2 and rearranging the terms, we have Dk f logk,Lk 2D1 f log1,L1 + 2 k=2 M ln(d M) + 2 ζ2 k θ 2 2Lη max w,w Wk w w 2 A(λ k,Γ) 1. For D1, which corresponds to the first sub-phase where we use E-optimal design with doubling trick, we have the stopping condition, w w 2 A(Z1,0,L1,bΓL1) 1 θ 2 2Lη f log1,L1 ζ2 1. This implies that when the stopping condition is met N1,0,L1 θ 2 2Lη f log1,L1 ζ2 1 max w,w W1 w w 2 A(ξL1,bΓL1) 1. (46) N1,0,L1 1 < 2 θ 2 2Lη f log1,L1 1 ζ2 1 max w,w W1 w w 2 A(ξL1 1,bΓL1 1) 1. (47) Since we use E-optimal design for the first phase, the factor of w w 2 A(ξL1 1,bΓL1 1) 1 can be upper bounded by w w 2 A(ξL1 1,bΓL1 1) 1 3 w w 2 A(ξL1 1,Γ) 1 + 2 (Γ (bΓL1 1) ) w w 2 z ξL1 1zz ) 1 (Lemma J.7) 3 w w 2 A(λE,Γ) 1 + 2 w w 2 A(λE,Γ) 1 (Lemma J.10) 6 w w 2 A(λE,Γ) 1. (48) Plug (48) into (47), and use the fact that 2N1,0,L1 1 = N1,0,L1, ζ1 = 1, we have N1,0,L1 24 θ 2 2Lη f log1,L1 1 max w,w W1 w w 2 A(λE,Γ) 1 24 θ 2 2Lη f log1,L1 max w,w W1 w w 2 A(λE,Γ) 1. Thus we have, D1 f log1,L1 2N1,0,L1 f log1,L1 48 θ 2 2Lη max w,w W1 w w 2 A(λE,Γ) 1 =: ρ1. By the same calculation as (17) and (19), we have 1 ζ2 k max w,w Wk w w 2 A(λ k,Γ) 1 c θ 2 2LηK ρ =: ρ2, where c is an absolute constant. Next we lower bound the left hand side of (45). To do this, we first upper bound f logk,Lk as f logk,Lk =8d ln 1 + Dk L2 z d 2 6dk2L2 k δ 1 + Dk L2 z d + 32 ln(Lk) + 16 ln 1 + Dk L2 z d 1 + Dk L2 z d where the inequality above uses the fact that Lk is the index of the last doubling trick iteration for phase k, by the design of the doubling trick, we have Lk log2 Dk Dk f logk,Lk = 8d ln 1 + Dk L2z d + 16 ln 2 6dk2L2 k δ 32d ln 1 + Dk L2z d + 16 ln 2 6d K 2 δ (due to (50)) PK k=1 Dk L2 z d + 16 ln 2 6d K 2 PK k=1 Dk L2z + 16 ln 2 6d K 2 Denote ρ3 := K M ln(d M), looking back at (44), we have k=1 Dk L2 z d (ρ1 + ρ2 + ρ3). By Lemma K.5, we have k=1 Dk 32(ρ1 + ρ2 + ρ3) ln + 64d(ρ1 + ρ2 + ρ3) ln 3 + (ρ1 + ρ2 + ρ3)2L2 z d Sample complexity for second sub-phase The design for the second sub-phase of phase k is based on bΓLk, the estimate of Γ at the end of the Lk-th doubling trick iteration in the first sub-phase of phase k, ˆλk = arg min λ (Z) max w,w Wk w w 2 A(λ,bΓLk ) 1. Then for any λ, by Lemma J.7 we have w w 2 A(ˆλk,bΓLk ) 1 3 max w,w Wk w w 2 A(λ,Γ) 1 + 2 max w,w Wk (Γ (bΓLk) ) w w 2 We plug in λ k := α kλE + (1 α k)λ k, with α k 1/2 will be defined later and λ k is the optimal design for Wk, w w 2 A(ˆλk,bΓLk ) 1 3 max w,w Wk w w 2 A(λ k ,Γ) 1 + 2 max w,w Wk (Γ (bΓLk) ) w w 2 z λ k zz ) 1 6 max w,w Wk w w 2 A(λ k,Γ) 1 + 2 max w,w Wk (Γ (bΓLk) ) w w 2 z λ k zz ) 1, (52) where the last inequality is due to the fact that α k 1/2. For the second term in the RHS of above, by Lemma J.11 we have, (Γ (bΓLk) ) w w 2 z λ k zz ) 1 4Lη f logk,Lk σmin A(λ k, Γ) w w 2 A(Zk,0 ZLk ,Γ) 1 4Lη f logk σmin A(λ k, Γ) Nk,0,Lk Nk,0,Lk + Nk,1Lk w w 2 A(ξLk ,Γ) 1 4Lη f logk,Lk σmin A(α kλE, Γ) Nk,0,Lk Nk,0,Lk + Nk,1,Lk w w 2 A(ξLk ,Γ) 1 4Lη f logk,Lk σmin A(λE, Γ) w w 2 A(ξLk ,Γ) 1 (set α k = αk) w w 2 A(ξLk ,Γ) 1 w w 2 A(ξLk ,bΓLk ) 1, (53) where for (b1), we have Nk,0,Lk 4g Lη f logk,Lk σmin(A(λE,Γ)), and (b2) is due to Lemma J.5. Plug (53) into (52), w w 2 A(ˆλk,bΓLk ) 1 6 max w,w Wk w w 2 A(λ k,Γ) 1 + 2 g max w,w Wk w w 2 A(ξLk ,bΓLk ) 1 6 max w,w Wk w w 2 A(λ k,Γ) 1 + 2 g ζ2 k θ 2 2Lη f logk,Lk (Nk,0,Lk + Nk,1,Lk), where the last inequality is due to (26). According to the algorithm design, the number of samples in the second sub-phase of phase k is defined as 2(1 + ω)ζ 2 k ρ(Wk)Lν log with ρ(Wk) = minλ (Z) maxw,w Wk w w 2 A(λ,bΓLk ) 1. Then we have, by setting g 4, Nk,2 (1 + ω)ζ 2 k Lν max w,w Wk w w 2 A(λ k,Γ) 1 log + (1 + ω)(Nk,0,Lk + Nk,1,Lk), where hides logarithmic factors of |W| for the second term and constants for simplicity. Plug (54) into the above inequality. Also note that by Lemma 2.1, we can always set Lν = 2( θ 2 2Lη + 1). k=1 Nk,2 (1 + ω) k=1 ζ 2 k Lν max w,w Wk w w 2 A(λ k,Γ) 1 log k=1 (Nk,0,Lk + Nk,1,Lk) (1 + ω)K Lνρ log k=1 (Nk,0,Lk + Nk,1,Lk). This essentially means that the sample complexity for the second sub-phase P Nk,2 can be upper bounded by summation of the sample complexity we pay for Algorithm 1 and the sample complexity of the first sub-phase PK k=1(Nk,0,Lk + Nk,1,Lk). Combine (51) and (55), we conclude the result. Lemma J.3. Denote f log Nk,0, δk,0 = 8d ln 1 + Nk,0L2 z d A sufficient condition for Nk,0,ℓ gσ2 η f log Nk,0,ℓ+ Nk,1,ℓ, δk,ℓ σmin A(λE, Γ) r(ω). (55) Nk,0,ℓ 16gdσ2 η σmin A(λE, Γ) ln 8g σmin A(λE, Γ) d + Nk,1,ℓL2 z + L2 z ! + 32gσ2 η σmin A(λE, Γ) ln Proof. Given f log Nk,0,ℓ+ Nk,1,ℓ, δk,ℓ = 8d ln 1 + (Nk,0,ℓ+ Nk,1,ℓ)L2 z d By Lemma J.4, for the formula X A ln (D + BX) + C A = 8gdσ2 η σmin(A(λE,Γ)), B = L2 z d , C = 16gσ2 η σmin(A(λE,Γ)) ln 2 6d D = 1 + Nk,1,ℓL2 z d . Thus a sufficient condition for the inequality to hold is Nk,0,ℓ 16gdσ2 η σmin A(λE, Γ) ln 8gdσ2 η σmin A(λE, Γ) 1 + Nk,1,ℓL2 z d + L2 z d + 32gσ2 η σmin A(λE, Γ) ln = 16gdσ2 η σmin A(λE, Γ) ln 8gσ2 η σmin A(λE, Γ) d + Nk,1,ℓL2 z + L2 z ! + 32gσ2 η σmin A(λE, Γ) ln Lemma J.4. Let X 1, A, B 0, then a sufficient condition for X A ln (D + BX) + C is X 2A ln(AD + AB) + 2C. Proof. The proof is motivated by Gales et al. [16]. Let f (0, 1), then X A ln (D + BX) + C + C (since ln(x) x 1) X(1 f) A ln X 1 1 f A ln Set f = 1/2 and by the fact X 1, we have X 2A ln(AD + AB) + 2C. Lemma J.5. Suppose that we have a data set {ZT , XT }. Denote the empirical distribution of ZT as ξ. The number of samples satisfies T 8σ2 η σmin A(ξ, Γ) log(ZT , δ) r(ω). bΓ is the OLS estimate of Γ based on {ZT , XT }. Then w 2 A(ξ,Γ) 1 6 w 2 A(ξ,bΓ) 1. Proof. By Lemma J.7, we have w 2 A(ξ,Γ) 1 3 w 2 A(ξ,bΓ) 1 + 2 (Γ bΓ )w 2 3 w 2 A(ξ,bΓ) 1 + 1 2 w 2 A(ξ,Γ) 1, where the last inequality is due to Lemma J.10 with g = 2. Rearranging the terms, we have w 2 A(ξ,Γ) 1 6 w 2 A(ξ,bΓ) 1. Lemma J.6. Suppose that we use ROUND to sample N0 arms according to λE denoted as Z0, and N1 arms according to λ1 denoted as Z1, with N1 N0. Denote the empirical distribution of all the collected samples as ξ, then w 2 A(ξ,Γ) 4 w 2 A(λ1,Γ) 1. Proof. Denote the empirical distribution of Z0 as ξ0, and the empirical distribution of Z1 as ξ1, then we have w 2 A(Z0 Z1,Γ) 1 =w z Z0 Z1 Γ zz Γ z Z0 Γ zz Γ + X z Z1 Γ zz Γ z Z ξz,0Γ zz Γ + N1 X z Z ξz,1Γ zz Γ = 1 N0 + N1 w z Z ξz,0Γ zz Γ + N1 N0 + N1 z Z ξz,1Γ zz Γ 1 N0 + N1 w z Z ξz,1Γ zz Γ 2 N0 + N1 w z Z ξz,1Γ zz Γ 4 N0 + N1 w z Z λ1Γ zz Γ = 4 N0 + N1 w 2 A(λ1,Γ) 1. The result follows by noting that w 2 A(Z0 Z1,Γ) 1 = 1 N0 + N1 w 2 A(ξ,Γ). Lemma J.7. Suppose that we have an estimate bΓ that is invertible, for any x Rd and covariance matrix V , we have x 2 (Γ V Γ) 1 3 x 2 (bΓ V bΓ) 1 + 2 (Γ bΓ )x 2 Proof. Suppose we have an estimate bΓ. Then, x 2 (Γ V Γ) 1 = x 2 (bΓ V bΓ) 1 + x 2 (Γ V Γ) 1 x 2 (bΓ V bΓ) 1. (Γ V Γ) 1 (bΓ V bΓ) 1 = Γ 1V 1Γ bΓ 1V 1bΓ = Γ 1V 1Γ bΓ 1V 1bΓ + Γ 1V 1bΓ Γ 1V 1bΓ = Γ 1V 1(Γ bΓ ) + (Γ 1 bΓ 1)V 1bΓ . x 2 (Γ V Γ) 1 = x 2 (bΓ V bΓ) 1 + x Γ 1V 1(Γ bΓ )x + x (Γ 1 bΓ 1)V 1bΓ x x 2 (bΓ V bΓ) 1 + x (Γ V Γ) 1 (Γ bΓ )x V 1 + (Γ bΓ )x V 1 x (bΓ V bΓ) 1 x 2 (bΓ V bΓ) 1 + 1 2 x 2 (Γ V Γ) 1 + 1 2 x 2 (bΓ V bΓ) 1. This implies that x 2 (Γ V Γ) 1 3 x 2 (bΓ V bΓ) 1 + 2 (Γ bΓ )x 2 Lemma J.8. Suppose λf = arg min f(λ) and λg = arg min g(λ) and f(λ) g(λ) + h(λ), then f(λf) g(λg) + h(λg). f(λf) min λ g(λ) + h(λ) g(λg) + h(λg). Lemma J.9. Define λ z and λz λ z := arg min λ (Z) max w,w W w w 2 A(λz,Γ) 1, λz := arg min λ (Z) max w,w W w w 2 A(λz,bΓ) 1 as the optimal design regarding Γ and that regarding its estimate bΓ respectively. Then, we have max w,w W w w 2 A( λz,bΓ) 1 max w,w W 3 w w 2 A(λ z,Γ) 1 + max w,w W 2 (Γ bΓ ) w w 2 z λ zzz ) 1. Proof. By Lemma J.7, for any w, w W and λz Z, w w 2 A(λz,bΓ) 1 3 w w 2 A(λz,Γ) 1 + 2 (Γ bΓ ) w w 2 z λzzz ) 1. max w,w W w w 2 A(λz,bΓ) 1 max w,w W 3 w w 2 A(λz,Γ) 1 + max w,w W 2 (Γ bΓ k ) w w 2 z λzzz ) 1. By Lemma J.8, max w,w W w w 2 A( λz,bΓ) 1 max w,w W 3 w w 2 A(λ z,Γ) 1 + max w,w W 2 (Γ bΓ ) w w 2 z λ zzz ) 1. Lemma J.10. Suppose that we have bΓ that is an OLS estimate from an offline dataset {ZT1, XT1} collected non-adaptively through a fixed design λ and the efficient rounding procedure ROUND. Let V = Z T1ZT1. Then, for any x Rd and g 1, we have, with probability 1 δ, g x 2 A(ZT1 ,Γ) 1, T1 4gσ2 η σmin A(λ, Γ) log(ZT1, δ) 2p, (57) where p is the cardinality of support of λ. Proof. We first show that (Γ bΓ )x 2 = Γ S ZT1V 1(Γ + V 1Z T1S) x 2 V 1 (Lemma J.12) 2 Γ S ZT1V 1(Γ + V 1Z T1S) x 2 2 ( Ax 2 V = V 1 2 Ax 2 2 Γ S ZT1V 1 2 (Γ + V 1Z T1S) x 2 2 ( Ax A op x ) 2 (Γ + V 1Z T1S) x 2 2 ( AB op A op B op) = Γ 1V 1Γ op V 1/2Z T1S 2 V 1/2 Γ + V 1Z T1S x 2 ( A A op = A 2 op) = Γ 1V 1Γ op V 1/2Z T1S 2 V 1/2Γ x V 1/2Γ V 1Z T1S Γ + V 1Z T1S x 2 (Lemma J.13: (A + B) 1 = A 1 (A + B) 1BA 1) a2 Γ 1V 1Γ op V 1/2Z T1S 2 2 + V 1/2Γ V 1Z T1S Γ + V 1Z T1S x We can upper bound the the term V 1/2Γ V 1Z T1S Γ + V 1Z T1S x noticing that it appears in both of a1, a2 above. Thus we have the inequality V Γ 1V 1Γ op V 1/2Z T1S 2 By rearranging the terms, we have 1 Γ 1V 1Γ op V 1/2Z T1S 2 V 1/2Z T1S 2 By Lemma G.4, with probability 1 δ, we have V 1/2Z T1S 2 op σ2 ηlog(ZT1, δ). Thus, with probability 1 δ, we have V 1 1 Γ 1V 1Γ opσ2ηlog(ZT1, δ) Γ 1V 1Γ opσ2 ηlog(ZT1, δ) V 1/2Γ x 2 To further upper bound V, we first find a sufficient condition on T1 such that Γ 1V 1Γ opσ2 ηlog(ZT1, δ) 1 2g, g 1. By Lemma J.14, when T1 2p, Γ 1V 1Γ opσ2 ηlog(ZT1, δ) 2σ2 η z Z λzΓ zz Γ log(ZT1, δ). To upper bound the right hand side by 1 2g, we need z Z λzΓ zz Γ log(ZT1, δ). (58) With this condition (58), we have with probability 1 δ, V 1 1 Γ 1V 1Γ opσ2ηlog(ZT1, δ) Γ 1V 1Γ opσ2 ηlog(ZT1, δ) V 1/2Γ x 2 Lemma J.11. Suppose that we have bΓ that is an OLS estimate from an offline dataset {ZT1, XT1} collected non-adaptively through a fixed design λ and the efficient rounding procedure ROUND. Let V be any positive definite matrix. Then, for any x Rd, we have, with probability 1 δ, (Γ bΓ )x 2 V 1 2 Γ 1 V 1Γ opσ2 ηlog(ZT1, δ) x 2 A(ZT1 ,Γ) 1, T1 4σ2 η σmin A(λ, Γ) log(ZT1, δ) 2p, where p is the cardinality of support of λ. = Γ S ZT1V 1(Γ + V 1Z T1S) x 2 V 1 (Lemma J.12) 2 Γ S ZT1V 1(Γ + V 1Z T1S) x 2 2 ( Ax 2 V = V 1 2 Ax 2 2 Γ S ZT1V 1 2 (Γ + V 1Z T1S) x 2 2 ( Ax A op x ) 2 (Γ + V 1Z T1S) x 2 2 ( AB op A op B op) = Γ 1 V 1Γ op V 1/2Z T1S 2 V 1/2 Γ + V 1Z T1S x 2 ( A A op = A 2 op) = Γ 1 V 1Γ op V 1/2Z T1S 2 V 1/2Γ x V 1/2Γ V 1Z T1S Γ + V 1Z T1S x 2 (Lemma J.13: (A + B) 1 = A 1 (A + B) 1BA 1) a2 Γ 1 V 1Γ op V 1/2Z T1S 2 2 + V 1/2Γ V 1Z T1S Γ + V 1Z T1S x Given the condition (58) holds, we have with probability 1 δ, (Γ bΓ )x 2 a1 Γ 1 V 1Γ op V 1/2Z T1S 2 2 + V 1/2Γ x 2 =2 Γ 1 V 1Γ op V 1/2Z T1S 2 2 Γ 1 V 1Γ opσ2 ηlog(ZT1, δ) x 2 (Γ V Γ) 1, where (a1) is due to (59) and setting g = 1. Lemma J.12. For a least square estimate bΓ that is estimated through a design matrix Z and bΓ is invertible, we have bΓ 1 Γ 1 = Γ + V 1Z S 1 V 1Z SΓ 1. Proof. Since bΓ is a least square estimator, we have bΓ = Γ + V 1Z S. By Lemma J.13, we have bΓ 1 Γ 1 = Γ + V 1Z S 1 Γ 1 = Γ + V 1Z S 1 V 1Z SΓ 1. Lemma J.13. For two invertible matrixes A, B Rd d, we have (A + B) 1 = A 1 (A + B) 1BA 1. Proof. We have (A + B) 1 = A 1 + (A + B) 1 A 1 = A 1 + (A + B) 1A I A 1 = A 1 + (A + B) 1 A (A + B) A 1 = A 1 (A + B) 1BA 1. Lemma J.14. Suppose that we have a design matrix ZT that is sampled from a distribution λ Z, with the efficient rounding procedure ROUND. Let p represent the cardinality of support of λ. We have, if T 2p, Γ Z T ZT Γ 1 op 2 z Z λzΓ zz Γ . where σmin( ) is the smallest singular value of a matrix. Proof. Suppose that each arm z Z is sampled tz times, the empirical distribution of ZT is ξ := tz z Z. Thus, we have Γ Z T ZT Γ 1 op = 1 σmin Γ Z T ZT Γ z Z ξzΓ zz Γ z Z ξzΓ zz Γ . By Fiez et al. [15, Proposition 2], we have X z Z ξzΓ zz Γ α X z Z λzΓ zz Γ, where α [ T T +2p, 1] when T 2p. Given the fact that both of P z Z ξzΓ zz Γ z Z λzΓ zz Γ are positive definite, we have σmin P z Z ξzΓ zz Γ ασmin P z Z λzΓ zz Γ . Thus, we have Γ Z T ZT Γ 1 op 1 z Z λzΓ zz Γ . When T 2p, we have α 1/2, which implies the result. K Estimating λmin(Γ) In this section, we introduce a simple adaptive procedure that finds a high probability lower bound on γ min := λmin(Γ) that is sufficiently accurate (i.e., within a constant factor of γ min). For simplicity, we assume zt 1, t in this section. Our algorithm leverages confidence bounds to adaptively determine how many samples we like to take. Let ˆΓt := (Z Z) 1Z X be the least square estimate of Γ after sampling t times to obtain {(zs, xs)}t s=1 where Z Rt d and X Rt d are the design matrices. Let Vt := Pt s=1 zsz s . We define the lower and upper confidence bound for γ min as follows: LCB(t) := λmin(ˆΓt) t , UCB(t) := λmin(ˆΓt) + ψt = σ 1 min 1 8d ln 1 + 2t d(2 σmin(Vt)) 4 2 σmin(Vt) and LCB(0) := and UCB(0) := . The following lemma shows that LCB(t) and UCB(t) form a valid anytime confidence bound for γ min. Lemma K.1. (Correctness of the confidence bounds) 1 δ P( t 1, LCB(t) γ min UCB(t)) Equipped with the confidence bounds, we are now ready to describe our algorithm for learning γ min (see Algorithm 10). Since the tightness of the confidence bounds depends on the smallest eigenvalue of Vt, it is natural to use the E-optimal design as defined in Section 3.2. Recall that the solution of the E-optimal design is λ E and κ0 is the smallest singular value achieved by λ E. We take in a rounding procedure for the E-optimal design ROUNDE(λ, t) that takes in t samples and design λ and outputs integer sample count assignments {Nz}z Z so that if we sample according to these counts then we have t Vt) (1 + ω)κ 1 0 (60) After determining the base sample counts {mz}z Z by ROUND(λ E, r(ω) , ω), we start doubling the sample size until we satisfy the condition LCB(t) > 1 2UCB(t). Note that the sampling scheme in the while loop is designed such that the total number of samples collected up to (and including) j-th iteration is 2j 1 r(ω) . Once the loop stops, we return LCB(t) as the claimed lower approximation of the γ min. Let Nw be the total number of samples we used in Algorithm 10. Then, the next theorem shows that the estimate returned by our algorithm is both a valid lower bound to γ min and sufficiently accurate. Algorithm 10 Learning λmin(Γ) Input: Arm set Z, rounding procedure ROUNDE for the E-optimal design, rounding accuracy ω Initialize: j = 1, t = 0. Compute the E-optimal design λ E for Z. Compute {mz}z Z by ROUNDE(λ E, r(ω) ). while LCB(t) 1 2UCB(t) do t 2j 1 r(ω) . For each arm z Z, sample 2j 1mz 1{j = 1} 2j 2mz times. Estimate ˆΓt using all samples collected so far (total t data points) j j + 1. end while Output: LCB(t) Theorem K.2. (Correctness of Algorithm 10) The total number of samples denoted by Nw used in Algorithm 10 satisfies that, with probability at least 1 δ, 2 < LCB(Nw) γ min We next analyze the sample complexity of the algorithm, which essentially shows the scaling of 1 (γ min)2κ0 even if the algorithm does not need knowledge of γ min. Theorem K.3. (Sample complexity of Algorithm 10) Then, with ω = 1, we have, with probability at least 1 δ, Nw = O r(1) + (γ min) 2κ 1 0 d polylog((γ min) 2, κ 1 0 , d) + ln 2/δ We remark that Allen-Zhu et al. [2] provides a rounding procedure with r(ε) = O(d/ε2). Proof of Lemma K.1. Note that ˆΓ = Γ + V 1 t Z S where Z, S Rt d is the design matrices with s-th row being z s and η s respectively. Using Lemma G.4, we have, with probability at least 1 δ, t 1, ˆΓt Γ 2 op = V 1 t Z S 2 op op V 1/2 t Z S 2 = 1 tσmin( 1 t Vt) V 1/2 t Z S 2 8d ln 1 + 2t d(2 σmin(Vt)) 4 2 σmin(Vt) ( Lemma G.4) The well-known Weyl s theorem implies that maxk |λk(ˆΓt) λk(Γ)| ˆΓt Γ op where λk(A) is the k-th largest singular value of the matrix A. Choosing k = d and combining it with the display above conclude the proof. Proof of Theorem K.2. We assume t 1, LCB(t) γ min UCB(t), which happens with probability at least 1 δ. Then, it is trivial to see that LCB(Nw) γ min. For the other inequality, we use the fact that the stopping condition was satisfied with Nw: LCB(Nw) > 1 2UCB(Nw) γ min This concludes the proof. Proof of Theorem K.3. We assume t 1, LCB(t) λmin(Γ) UCB(t), which happens with probability at least 1 δ. In this proof, we let ˆγmin := λmin(ˆΓNw) and γ min := λmin(Γ) for brevity. If Nw = r(1) , then there is nothing to prove. Otherwise, the loop in the algorithm was iterated more than once. Then, since the stopping condition was satisfied with Nw, we have that in the previous iteration where t = Nw/2 the stopping condition was not satisfied. Thus, 2LCB(Nw/2) UCB(Nw/2) . Using the following two inequalities: 2LCB(Nw/2) 2 UCB(Nw/2) ˆγmin + Nw/2 γ min + 2 Nw/2 = Nw 72 (γ min)2 ψNw/2 . On the other hand, with the rounding procedure, we have σ 1 min N X t=1 ztz t = 1 N σ 1 min 1 t=1 ztz t = 1 N (1 + ω)κ 1 0 = 2 since ω = 1. Using this and the fact that Nw d, it is easy to see that there exists an absolute constant c1 such that ψNw/2 c1κ 1 0 d ln 1 + Nw + κ 1 0 + ln 2 Then, there exists an absolute constant c2 such that Nw (γ min) 2 c2κ 1 0 d ln 1 + Nw + κ 1 0 + ln 2 We have Nw on both sides. We invoke Lemma K.5 with r = 1 + Nw to obtain Nw 1 + 2c2(γ min) 2κ 1 0 d ln 1 + 2κ 1 0 (1 + c2d(γ min) 2) + ln 2 Lemma K.4. Let A, Γ Rd d where A is symmetric positive semi-definite. Then, σmin Γ AΓ σmin(Γ)2σmin(A) (σmin(Γ AΓ)) 1 = Γ 1A 1Γ T op Γ 1 op A 1 op Γ T op (submultiplicity of the operator norm) = σmin(Γ) 2σmin(A) 1 Taking the inverse on both sides concludes the proof. K.1 Proof of Lemma 3.2 Define Λ = Pd i=1 λieie i , i.e. the diagonal matrix with λ on the diagonal. min λ max i,j (e i ej) (ΓΛΓ ) 1(ei ej) = min λ max i,j (Γ 1 k,i Γ 1 k,j)2 So for an upper bound, we consider λ = 1/d1, min λ max i,j (e i ej) (ΓΛΓ ) 1(ei ej) d max i,j Γ 1(ei ej) 2 2 The result about ρ follows immediately. When Γ = (1 ε)/d11 + εI, a computation using Sherman-Morrison shows that Γ 1 = 1/ε[I (1 ε)/d11 ]. Thus min λ max i,j (e i ej) (ΓΛΓ ) 1(ei ej) = ε 2 min λ max i,j (ei ej) Λ 1(ei ej) = ε 2 min λ max i,j ei Λ 1ei + ejΛ 1ej = ε 2 min λ max i,j ei Λ 1ei + ejΛ 1ej ε 2 min λ max i ei Λ 1ei = ε 2d where the last line follows from the Kiefer-Wolfowitz Theorem [26]. K.2 lemma for solving x less than ln(x) Lemma K.5. Let a, b, c, d > 0. Then, for every r, r a + b ln(c + dr) = r 2a + 2b ln(1 + 2c + 2bd) Proof. If dr c, then r a + b ln(2c) a + b ln(1 + 2c) If dr > c, then, r a + b ln(2dr) = a + b ln 2d r 2b 1 + ln(4bd) (ln(x) x 1) = r 2a + 2b ln 4 2a + 2b ln(2bd) Either case, we have r 2a + 2b ln (1 + 2c) 2bd 2a + 2b ln(1 + 2c + 2bd) Lemma K.6. Let α, β > 0. Then, for any r, r α ln(1 + r) + β = r 2α ln(e + α) + 2β r 2α 1 + ln 2α(1 + β ( x, ln(x) x 1) = r 2α ln 2 If r 2α, then there is nothing to prove. If r > 2α, then r 2α ln(1 + α) + 2β. Either case, we have r 2α ln(e + α) + 2β. Neur IPS Paper Checklist Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: Abstract and introduction summarize our main theorems. Guidelines: The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: For example, we explain our limitations on lower bound. Guidelines: The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes] Justification: The assumptions are given is the beginning of the paper and referred later when necessary. Guidelines: The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: Details on the experiment setup are given. It is mainly a theoretical paper. Guidelines: The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: It is mainly a theoretical paper. We attach the code used for the simulation. Guidelines: The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: It is mainly a theoretical paper with some simulations in appendix. No deep learning or other complicated training involved. But we attach our code for reference. Guidelines: The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [Yes] Justification: Error bar is reported, e.g., Fig 2b. Guidelines: The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [No] Justification: It is mainly a theoretical paper. Some simulations are given in the appendix. No need for heavy computation resource. Guidelines: The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: This paper conforms all the code of Ethics. Guidelines: The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [Yes] Justification: we discuss it in the appendix. It is mainly a theoretical and algorithmic paper, without any sensitive data, very unlikely to have negative impacts. Guidelines: The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML). 11. Safeguards Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: the paper poses no such risks, no such data or model involved. Guidelines: The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort. 12. Licenses for existing assets Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [NA] Justification: We do not use any existing assets. Guidelines: The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justification: We do not produce any new assets. Guidelines: The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: We do not use anything like crowdsourcing. Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: We do not use anything like crowdsourcing. Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.