# sequential_counterfactual_risk_minimization__8ba3e444.pdf Sequential Counterfactual Risk Minimization Houssam Zenati 1 2 Eustache Diemert 1 Matthieu Martin 1 Julien Mairal 2 Pierre Gaillard 2 Counterfactual Risk Minimization (CRM) is a framework for dealing with the logged bandit feedback problem, where the goal is to improve a logging policy using offline data. In this paper, we explore the case where it is possible to deploy learned policies multiple times and acquire new data. We extend the CRM principle and its theory to this scenario, which we call Sequential Counterfactual Risk Minimization (SCRM). We introduce a novel counterfactual estimator and identify conditions that can improve the performance of CRM in terms of excess risk and regret rates, by using an analysis similar to restart strategies in accelerated optimization methods. We also provide an empirical evaluation of our method in both discrete and continuous action settings, and demonstrate the benefits of multiple deployments of CRM. 1. Introduction Counterfactual reasoning in the logged bandit problem has become a common task for practitioners in a wide range of applications such as recommender systems (Swaminathan & Joachims, 2015a), ad placements (Bottou et al., 2013) or precision medicine (Kallus & Zhou, 2018). Such a task typically consists in learning an optimal decision policy from logged contextual features and partial feedbacks induced by predictions from a logging policy. To do so, the logged data is originally obtained from a randomized data collection experiment. However, the success of counterfactual risk minimization is highly dependent on the quality of the logging policy and its ability to sample meaningful actions. Counterfactual reasoning can be challenging due to large variance issues associated with counterfactual estimators (Swaminathan & Joachims, 2015b). Additionally, as pointed 1Criteo AI Lab 2Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK, 38000 Grenoble, France. Correspondence to: Houssam Zenati . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). out by Bottou et al. (2013), confidence intervals obtained from counterfactual estimates may not be sufficiently accurate to select a final policy from offline data (Dai et al., 2020). This can occur when the logging policy does not sufficiently explore the action space. To address this, one option is to simply collect additional data from the same logging system to increase the sample size. However, it may be more efficient to use already collected data to design a better data collection experiment through a sequential design approach (Bottou et al., 2013, see Section 6.4). It is thus appealing to consider successive policy deployments when possible. We tackle this sequential design problem and are interested in multiple deployments of the CRM setup of Swaminathan & Joachims (2015a), which we call sequential counterfactual risk minimization (SCRM). SCRM performs a sequence of data collection experiments by determining at each round a policy using data samples collected during previous experiments. The obtained policy is then deployed for the next round to collect additional samples. Such a sequential decision making system thus entails designing an adaptive learning strategy that minimizes the excess risk and expected regret of the learner. In contrast to the conservative learning strategy in CRM, the exploration induced by sequential deployments of enhanced logging policies should allow for improved excess risk and regret guarantees. Yet, obtaining such guarantees is nontrivial and we address it in this work. In order to accomplish this, we first propose a new counterfactual estimator that controls the variance and analyze its convergence guarantees. Specifically, we obtain an improved dependence on the variance of importance weights between the optimal and logging policy. Second, leveraging this estimator and a weak assumption on the concentration of this variance term, we show how the error bound sequentially concentrates through CRM rollouts. This allows us to improve the excess risk bounds convergence rate as well as the regret rate. Our analysis employs methods similar to restart strategies in acceleration methods (Nesterov, 2012) and optimization for strongly convex functions (Boyd & Vandenberghe, 2004). We also conduct numerical experiments to demonstrate the effectiveness of our method in both discrete and continuous action settings, and how it improves upon CRM and other existing methods in the literature. Sequential Counterfactual Risk Minimization 2. Related Work Counterfactual learning from logged feedback (Bottou et al., 2013) uses only past interactions to learn a policy without interacting with the environment. Counterfactual risk minimization methods (Swaminathan & Joachims, 2015a;b) propose learning formulations using a variance penalization as in (Maurer & Pontil, 2009) to find policies with minimal variance. Even so, counterfactual methods remain prone to large variance issues (Dud ık et al., 2014). These problems may arise when the logging policy under-explores the action space, making it difficult to use importance sampling tehcniques (Owen, 2013) that are key to counterfactual reasoning. While one could collect additional data to counter this problem, our method focuses on sequential deployments (Bottou et al., 2013, see Section 6.4) to collect data obtained from adaptive policies to explore the action space. Note also that the original motivation is related but different from the support deficiency problem (Sachdeva et al., 2020) where the support of the logging policy does not cover the support of the optimal policy. Another related literature to our framework is batch bandit methods. Originally introduced by Perchet et al. (2015) and then extended by Gao et al. (2019) in the multi-arm setting, batch bandit agent take decisions and only observe feedback in batches. This therefore differs from the classic bandit setting (Auer et al., 2002; Audibert et al., 2007) where rewards are observed after each action taken by an agent. Extensions to the contextual case have been proposed by Han et al. and could easily be kernelized (Valko et al., 2013). The sequential counterfactual risk minimization problem is thus closely related to this setting. However, major differences can be noted. First, SCRM does not leverage any problem structure as in stochastic contextual bandits (Li et al., 2010) by assuming a linear reward function (Chu et al., 2011; Goldenshluger & Zeevi, 2013; Han et al.) nor uses regression oracles as (Foster & Rakhlin, 2020; Simchi Levi & Xu, 2020). Second, deterministic decision rules taken by bandit agents (Lattimore & Szepesvari, 2019) do not allow for counterfactual reasoning or causal inference (Peters et al., 2017), unlike our framework which performs sequential randomized data collection. Third, unlike gradient based methods used in counterfactual methods with parametric policies, batch bandit methods use zero-order methods to learn from data and necessitate approximations to be scalable (Calandriello et al., 2020; Zenati et al., 2022). The sequential designs that we use are adaptive data collection experiments, which have been studied by Bakshy et al. (2018); Kasy & Sautmann (2021). Closely related to our method is policy learning from adaptive data that has been studied by Zhan et al. (2021) and Bibaut et al. (2021) in the online setting. In contrast, we consider a batch setting and our analysis achieve fast rates in more general conditions. Zhan et al. (2021) use a doubly robust estimator and provide regret guarantees but assume a deterministic lower bound on the propensity score to control the variance. Instead, our novel counterfactual estimator does not require such an assumption. Bibaut et al. (2021) propose a novel maximal inequality and derive thereof fast rate regret guarantees under an additional margin condition that can only hold for finite action sets. Our work instead uses a different assumption on the expected risk, which is similar to H olderian error bounds in acceleration methods (d Aspremont et al., 2021) that are known to be satisfied for a broad class of subanalytic functions (Bolte et al., 2007). In the reinforcement learning literature (Sutton & Barto, 1998), off-policy methods (Harutyunyan et al., 2016; Munos et al., 2016) evaluate and learn a policy using actions sampled from a behavior (logging) policy, which is therefore closely related to our setting. Among methods that have shown to be empirically successful are the PPO (Schulman et al., 2017) and TRPO (Schulman et al., 2015) algorithms which learn policies using a Kullback-Leibler distributional constraint to ensure robust learning, which can be compared to our learning strategy that improves the logging policy at each round. However reinforcement learning models transitions in the states (contexts) induced by the agent s actions while bandit problems like ours assume that actions do not influence the context distribution. This enables to design algorithms that exploit the problem structure, have theoretical guarantees and can achieve better performance in practice. Finally, our method is related to acceleration methods (d Aspremont et al., 2021) where current iterates are used as new initial points in the optimization of strongly convex functions (Boyd & Vandenberghe, 2004). While different schemes use fixed (Powell, 1977) or adaptive (Nocedal & Wright, 2006; Becker et al., 2011; Nesterov, 2012; Bolte et al., 2007; Gaillard & Wintenberger, 2018) strategies, our method differs in that it does not consider the same original setting, does not require the same assumptions nor provides the same guarantees. Eventually, while current models are also used as new starting points, additional data is effectively collected in our setting unlike those previous works that do not assume partial feedbacks as in our case. 3. Sequential Counterfactual Risk Minimization In this section, we introduce the (CRM) framework and motivate the use of sequential designs for (SCRM). Notations For random variables x PX , a πθ( |x) and y PY( |x, a), we write the expectation Ex,θ,y[ ] = Ex PX ,a πθ( |x),y PY( |x,a)[ ] and do the same for the variance Varx,θ,y. Moreover, denotes approximate inequalities up to universal multiplicative terms. Sequential Counterfactual Risk Minimization 3.1. Background In the counterfactual risk minimization (CRM) problem, we are given n logged observations (xi, ai, yi)i=1,...,n where contexts xi X are sampled from a stochastic environment distribution xi PX , actions ai πθ0( |xi) are drawn from a logging policy πθ0 with a model θ0 in a parameter space Θ. The losses are drawn from a conditional distribution yi PY( |xi, ai). We note π0,i = πθ0(ai|xi) the associated propensities and assume them to be known. We will assume that the policies in πθ, θ Θ admit densities so that the propensities will denote the density function of the logging policy on the actions given the contexts. The expected risk of a model θ is defined as: L(θ) = Ex,θ,y [y] . (1) Counterfactual reasoning uses the logged data sampled from the logging policy associated to θ0 to estimate the risk of any model θ Θ with importance sampling: L(θ) = Ex,θ0,y under the common support assumption (the support of πθ support is included in the support of πθ0). The goal in CRM is to find a model θ Θ with minimal risk by minimizing ˆθ arg min θ Θ L0(θ), (3) where L0(θ) = ˆL0(θ) + λ q n uses the sample variance penalization principle (Maurer & Pontil, 2009) on samples from θ0 with counterfactual estimates of the expected risk ˆL0, an empirical variance ˆV0 and λ > 0. Specifically, in the (CRM) framework, multiple estimators are derived from the IPS method (Horvitz & Thompson, 1952) that uses the following clipped importance sampling estimator of the risk of a model θ by Bottou et al. (2013); Swaminathan & Joachims (2015a): ˆLIS 0 (θ) = 1 i=1 yi min πθ(ai|xi) πi , α , (4) where α is a clipping parameter. Writing χi(θ) = yi min( πθ(ai|xi) π0,i , α) and χ(θ) = 1 n Pn i=1 χi(θ) the empirical variance estimator becomes: ˆV IS 0 (θ) = 1 n 1 i=1 (χi(θ) χ(θ))2. (5) Other estimators aim at controlling the variance of the estimator with self-normalized estimators (Swaminathan & Joachims, 2015b) or with direct methods (Dudik et al., 2011; Dud ık et al., 2014) in doubly robust estimators. Even so, the performance of counterfactual learning is harmed when the logging policy under-explores the action space (Owen, 2013). Likewise, counterfactual estimates obtained from a first round of randomized data collection may not suffice (Bottou et al., 2013) to select a model ˆθ. In those cases, it could be natural to consider collecting additional samples. While it is possible to use the same logging model θ0 to do so, we will present a framework for designing an improved sequential data collection strategy, following the intuition of sequential designs of Bottou et al. (2013). 3.2. Sequential Designs In this section we present a design of data collections that sequentially learn a policy from logged data in order to deploy it and learn from the newly collected data. Specifically, we assume that at a round m {1, . . . M}, a model θm Θ is deployed and a set sm of nm observations sm = (xm,i, am,i, ym,i, πm,i)i=1,...,nm is collected thereof, with propensities πm,i = πθm(am,i|xm,i) to learn a new model θm+1 and reiterate. In this work, we assume that the loss y is bounded in [ 1, 0] as in (Swaminathan & Joachims, 2015a) (note however that this assumption could be relaxed to bounded losses) and follows a fixed distribution PY. Next, we will introduce useful definitions. Definition 3.1 (Excess Risk and Expected Regret). Given an optimal model θ arg minθ Θ L(θ), we write for each rollout m the excess risk: m = L(θm) L(θ ), (6) and define the expected regret as: m=0 mnm+1. (7) The objective is now to find a sequence of models {θm}m=1...M that have an excess risk and an expected regret Rn that improve upon CRM guarantees. To do so, we define a sequence of minimization problems for m {1, . . . M}: ˆθm+1 arg min θ Θ Lm(θ), (8) where Lm is an objective function that we define in Section 4.2. Note that in the setting we consider, samples are i.i.d inside a rollout m but dependencies exist between different sets of observations. From a causal inference perspective (Peters et al., 2017), this does not incur an additional bias because of the successive conditioning on past observations. We provide detailed explanations in Appendix A.1 on this matter. Note also that the main intuition and motivation of our work is to shed light on how learning intermediate models θm to adaptively collect data can improve upon sampling from the same logging system by using the same total sample size n = Pm i=0 nm. To illustrate the learning benefits of SCRM we now provide a simple example. Sequential Counterfactual Risk Minimization Algorithm 1 Sequential Counterfactual Risk Minimization Input: Logged observations (x0,i, a0,i, y0,i, π0,i)i=1,...,n0, parameter λ > 0 for m = 1 to M do Build Lm from observations sm using Eq. (11) Learn θm+1 using Eq. (8) Deploy the model θm+1 and collect observations sm+1 = (xm+1,i, am+1,i, ym+1,i, πm+1,i)i=1,...,nm+1 end Example 3.1 (Gaussian policies with quadratic loss). Let us consider Gaussian parametrized policies πθ = N(θ, σ2) and a loss lt(a) = (a yt)2 1 where yt N(θ , σ2). We illustrate in Figure 1 the evolution of the losses of learned models θm through 15 rollouts with either i) Batch CRM learning on aggregation of data, being generated by the unique initial logging policy θ0 or ii) Sequential CRM learning with models θ0, . . . , θm 1 deployed adaptively, with data being generated by the last learned model θm 1 for the batch m. We see that the models learned with SCRM take larger optimization steps than the ones with CRM. We summarize our (SCRM) framework in Algorithm 1 with the different blocks exposed previously. We provide an additional graphical illustration of SCRM compared to CRM in Appendix A.1. In the next section we will define counterfactual estimators from the observations sm at each round and define a learning strategy Lm. 4. Variance-Dependent Convergence Guarantees In this part we aim at providing convergence guarantees of counterfactual learning. We show how we can obtain a dependency of the excess risk on the variance of importance weights between the logging model and the optimal model. 4.1. Implicit exploration and controlled variance We first introduce a new counterfactual estimator. For this, we will require a common support assumption as in importance sampling methods (Owen, 2013). We will assume that the policies πθ for θ Θ have all the same support. We then consider the following estimator of the risk of a model θ: ˆLIPS-IX m (θ) = 1 nm πθ,i πm,i + απθ,i ym,i, (9) where πθ,i = πθ(am,i|xm,i) and α is like a clipping parameter which ensures that the modified propensities πm,i + απθ(am,i|xm,i) are lower bounded. Noting ζi(θ) = πθ,i πm,i+απθ,i 1 ym,i, ζ(θ) = 1 nm Pnm i=1 ζi(θ) we can write the empirical variance estimator as: ˆV IPS-IX m (θ) = 1 nm 1 i=1 (ζi(θ) ζ(θ))2. (10) Here, the empirical variance uses a control variate since it uses the expression of ζi(θ) above instead of ym,i πθ,i πm,i+απθ,i . This allows to improve the dependency on the variance in the excess risk provided in Proposition 4.2. Note also that our estimator resembles the implicit exploration estimator in the EXP3-IX algorithm (Lattimore & Szepesvari, 2019), as our motivation is to improve the control of the variance. 4.2. Learning strategy Next, we aim in this part to provide a learning objective strategy Lm, as referred to in Eq. (8). Our approach, like the (CRM) framework, uses the sample variance penalization principle (Maurer & Pontil, 2009) to learn models that have low expected risk with high probability. To do so, we first provide an assumption to be used in our generalization error bound. Assumption 4.1 (Bounded importance weights). For any models θ, θ Θ and any (x, a) X A, we assume πθ(a|x)/πθ (a|x) W, for some W > 0. This assumption has been made in previous works (Kallus & Zhou, 2018; Zenati et al., 2020) and is reasonable when we consider a bounded parameter space Θ. Next, we state an error bound for our estimator. Proposition 4.1 (Generalization Error Bound). Let ˆLIPS-IX m and ˆV IPS-IX m be the empirical estimators defined respectively in Eq. (9) and Eq. (10). Let θ Θ, δ (0, 1), and nm 2. Then, under Ass. 4.1, for λm = p 18(Cm(Θ) + log(2/δ)), with probability at least 1 δ: L(θ) ˆLIPS-IX m (θ) + λm ˆV IPS-IX m (θ) nm + 2λ2 m W nm + δm, where Cm(Θ) is a metric entropy complexity measure defined in App. B.1 and δm = p log(2/δ)/(2nm). This Proposition is proved in Appendix B.2 and essentially uses empirical bounds (Maurer & Pontil, 2009). By minimizing the latter high-probability upper bound, we can find models θ with guarantees of minimizing the expected risk. Therefore, at each round, we minimize the following loss: Lm(θ) = ˆLIPS-IX m (θ) + λm ˆV IPS-IX m (θ) nm , (11) where λm > 0 is a positive parameter. Unlike deterministic decision rules used for example in UCB-based algorithms (Lattimore & Szepesvari, 2019), the exploration is naturally guaranteed by the stochasticity of the policies we use. Sequential Counterfactual Risk Minimization loss function estimated loss confidence set intermediate m Sequential CRM loss function estimated loss confidence set intermediate m 2 4 6 8 10 12 14 Rollouts m Loss Evolution Figure 1: Comparison of CRM and SCRM on a simple setting described in Example 3.1. The models learned through CRM using re-deployments of θ0 (left) reach θ slower than SCRM (center) that uses intermediate deployments θ1, . . . , θM indicated with x markers and rollout numbers. The comparison of the evolution of averaged losses (right) over 10 random runs also shows SCRM converges faster. Here θ = 1, σ = 0.3 and we take M = 15 total rollouts with batches m of size nm = 100 2m. The parameter λ is set to its theoretical value. 4.3. Excess risk upper bound Eventually, we establish an upper bound on the excess risk of the IPS-IX estimator for counterfactual risk minimization using the learning strategy that we just defined. For this, we require an assumption on the complexity measure. Assumption 4.2. We assume that the set Θ is compact and that there exists d > 0 such that Cm(Θ) d log(nm). This assumption states that the complexity grows logarithmically with the sample size. It holds for parametric policies so long as the propensities are lower bounded, which is verified using our estimator. We now state our variance-dependent excess risk bound. Proposition 4.2 (Excess Risk Bound). Let nm 1 and θm Θ. Let sm be a set of nm samples collected with policy πθm. Then, under Assumptions 4.1 and 4.2, a minimizer θm+1 of Eq. (11) on the samples sm satisfies the excess risk upper-bound: w.p. 1 δ m+1 = L(θm+1) L(θ ) ν2m d log nm log δ nm + W 2 + W(d log nm log δ) where ν2 m = Varx,θm The proof is postponed to Appendix B.2. The modified propensities in IPS-IX as well as the control variate used in the variance estimator allow us to improve the dependency in ν2 m, compared to ν2 m+1 obtained in previous work (Zenati et al., 2020). This turns out to be a crucial point to use these error bounds sequentially as in acceleration methods since νm 0 if θm θ , as explained in the next section. 5. SCRM Analysis In this section we provide the main theoretical result of this work on the excess risk and regret analysis of SCRM. We start by stating an assumption that is common in acceleration methods (d Aspremont et al., 2021) with restart strategies (Becker et al., 2011; Nesterov, 2012) that we will require to achieve the benefits of sequential designs. Assumption 5.1 (H olderian Error Bound). We assume that there exist γ > 0 and β > 0 such that for any θ Θ, there exists θ arg minθ Θ L(θ) such that (L(θ) L(θ ))β . Typically, in acceleration methods, H olderian error bounds (Bolte et al., 2007) are of the form: γd(θ, S Θ) (L(θ) L(θ ))β for some γ, β > 0 and where d(θ, S Θ) is some distance to the optimal set (S Θ = arg minθ Θ L(θ)). This bound is akin to a local version of strong convexity (β = 1) or a bounded parameter space (β = 0) if d is the Euclidean distance. When β [0, 1], this has also been referred to as the Łojasiewicz assumption introduced in (Łojasiewicz, 1963; 1993). Notably, it has been used in online learning (Gaillard & Wintenberger, 2018) to obtain fast rates with restart strategies. This assumption holds for instance for Example 3.1 with β = 1 (see App C.1). We also discuss this assumption for distributions in the exponential family in Appendix C.2 notably for distributions that have been used practice (Swaminathan & Joachims, 2015b; Kallus & Zhou, 2018; Zenati et al., 2020). Next we state our main result that is the acceleration of the excess risk convergence rate and the regret upper bound of SCRM. Sequential Counterfactual Risk Minimization Proposition 5.1. Let n0, n 2 and θ arg minθ L(θ). Let nm = n02m for m = 0, . . . , M = log2(1 + n n0 ) . Then, under Assumptions 4.1, 4.2 and 5.1 with β > 0, the SCRM procedure (Alg. 1) satisfies the excess risk upperbound M = L(θM) L(θ ) O n 1 2 β log n . Moreover, the expected regret is bounded as follows: m=0 mnm+1 O n 1 β 2 β log(n)2 . The proof of our result is detailed in Appendix B.3. Discussion This result illustrates that an excess risk of order O log(n) n may be obtained when β = 1 (which is implied by a local version of strong convexity assumption in acceleration methods). When β = 0, which merely accounts that the variance of importance weights are bounded, we simply recover the original rate of CRM of order O(log(n)/ n). The SCRM procedures thus improves the excess risk rate whenever β > 0. It is worth to emphasize that the knowledge of β is not needed by Alg. 1. We also note that our assumption seems related to the Bernstein condition (Bartlett & Mendelson, 2006, see Def 2.6), and (van Erven et al., 2015, see Def 5.1) that bounds a variance term by an excess risk term to the power. In empirical risk minimization, this implies the same excess risk rate and regret rate (van Erven & Koolen, 2016), which are exactly the same rates as ours (up to logs). 6. Empirical Evaluation In this section we perform numerical experiments to validate our method in practical settings. We present the experimental setup as well as experiments comparing SCRM to related approaches and internal details of the method. 6.1. Experimental setup As our method is able to handle both discrete and continuous actions we experiment in both settings. We now provide a brief description of the setups, with extensive details available in Appendix D.2. 1 Continuous actions We perform evaluation on synthetic problems pertaining to personalized pricing problems from (Demirer et al., 2019) (Pricing) and advertising from (Zenati et al., 2020) (Advertising). We consider Gaussian 1All the code to reproduce the empirical results is available at: https://github.com/criteo-research/ sequential-conterfactual-risk-minimization policies πθ( |x) = N(µθ(x), σ2) with linear contextual parametrization µθ(x) = θ x and fixed variance σ2 that corresponds to the exploration budget allowed in the original randomized experiment. The features are up to 10 dimensions and the actions are one-dimensional. We keep the original logging baselines from the settings and compare results to a skyline supervised model trained on the whole training data with full information. Discrete actions We adapt the setup of (Swaminathan & Joachims, 2015a) that transforms a multilabel classification task into a contextual bandit problem with discrete, combinatorial action space. We keep the original modeling (akin to CRF) with categorical policies πθ(a|x) exp(θ (x N a)). The baseline (resp. skyline) is a supervised, full information model with identical parameter space than CRM methods trained on 5% (resp. 100%) of the training data. We consider the class of probabilistic policies that satisfy Assumption 5.1 by predicting actions in an Epsilon Greedy fashion (Sutton & Barto, 1998)): πε θ(a, x) = (1 ε)πθ(a, x) + ε/|A| where ε = .1. Real-world datasets include Scene, Yeast and TMC2007 with feature space up to 30,438 dimensions and action space up to 222. To account for this combinatorial action space we allow a model θm to be learned using data from all past rollouts {sl}l 100 222 SCRM (ours) 100 28 100 29 100 211 Table 1: Needed sample size to achieve test loss L(θ) p L(θ ) on the setting in Example 3.1 over the average of 10 random runs. SCRM needs way less data to converge to near optimal solution. λ is set to its theoretical value. Pricing Advertising Yeast TMC2007 λ 5.353 .178 .716 .020 .294 .026 .146 .012 ˆλ 5.575 .036 .726 .001 .299 .039 .164 .021 Table 2: Test loss after 10 rollouts when choosing λ by a posteriori selection (λ ) or with proposed heuristic (ˆλ). Our heuristic is competitive with the a posteriori selection of a fixed λ . attain near optimal performance when θ is known as in Example 3.1, where we also observe that SCRM reaches optimal performances faster than CRM. This corroborates the benefits of improved excess risk rates for SCRM. Hyper-parameter selection for SCRM In our experiments, hyperparameter selection consists in choosing a value for λ. We describe a simple heuristic and evaluate its performance on different datasets. We propose to select ˆλm by estimating the non-penalized CRM loss (eq. 3) using offline cross-validation on past data st 1 weight functions ωt(a) > 0 which satisfies Pm t=0 ωt,m(a) = 1 for all a and m {0, . . . M}. The MIS estimator writes: ˆLMIS m (θ) = i=1 ωt,m(at,i)yt,iwθ t,i, wθ t,i = πθ(at,i|xt,i) πt,i . (12) In multiple importance sampling we usually assume that the behavior distributions are independent. In our case, when we optimize θt based on the models θt 1, . . . , θ0, we break this assumption. However, as we will see, we can still have the unbiasedness property and derive an estimator for the variance of the estimator. Proposition A.1 (Unbiasedness). The MIS estimator (12) is unbiased when the loss y is fixed (its distribution PY( |x, a) does not depend on time rollout m). Proof. Let m {1, . . . M}. We recall that at all rounds t < m, models θt Θ were deployed and sets st of nt observations st = (xt,i, at,i, lt,i, πt,i)i=1,...,nt were collected thereof, with propensities πt,i = πθt(at,i|xt,i) to learn the next model θt+1. To prove the unbiasedness we use the tower rule on the expectation and condition on previous observations s1, . . . st 1: E[ˆLMIS m (θ)] = i=1 Ex,θm,y ωt(a)ywθ t t=0 Ex,θm,y ωt(a)ywθ t t=0 Es1...st 1 Ex,θm,y ωt(a)ywθ t | s1 . . . st 1 t=0 Es1...st 1 [Ex,θ,y [ωt(a)y | s1 . . . st 1]] t=0 Ex,θ,y [ωt(a)y] = Ex,θ,y [y] where the second last line is true only when the distribution of y does not change over time roll-outs m. Among the proposals for functions ωt(a), the most naive and natural heuristic is to choose ωt(a) = nt Pm l=1 nl , (13) which gives the naive concatenation of all IPS estimators ˆLn-MIS m (θ) = 1 i=1 yt,i πθ(at,i|xt,i) πθt(at,i|xt,i), (14) where n = Pm t=0 nt. Sequential Counterfactual Risk Minimization With the previous definition of the empirical mean estimator, we can now derive an empirical variance estimator, starting with the naive multi importance sampling estimator. We write the random variable rm = (πθ/πθm)y. We note that for inside a batch m each realization of rm i = (πθ(am,i|xm,i)/πm,i)ym,i and rm j are independent. But the realizations of the random variables rm and rm are dependent. Writing n = Pm t=0 nt 1 p 0, the complexity of a class F by H (ε, F, n) = sup (xi,ai,yi) (X A Y)n H(ε, F {xi, ai, yi} , ) , (18) where F {xi, ai, yi} = f(x1, a1, y1), . . . , f(xn, an, yn) , f F Rn and the number H(ε, A, ) is the smallest cardinality |A0| of a set A0 A such that A is contained in the finite union of ε-balls centered at points in A0 in the metric induced by ). Then, Cm(Θ) is defined by Cm(Θ) = log H (1/nm, Fm,Θ, 2nm) . (19) Sequential Counterfactual Risk Minimization B.2. Variance-dependent excess risk bounds We will denote by Em[ ] = E[ |s0, . . . sm] the conditional expectation given the set of observation samples sm = (xm,i, am,i, ym,i, πm,i)i=1,...,nm up to the rollout m. Here, we recall that xm,i PX , am,i πθm( |xm,i), ym,i PY( |xm,i, am,i), and πm,i = πθm(am,i|xm,i). Furthermore, throughout the document, Ex,θm,y (resp. Varx,θm,y ]) denotes the expectation (resp. variance) in (x, a, y) where x PX , a πθm( |x), and y PY( |x, a). Proposition 4.1 (Generalization Error Bound). Let ˆLIPS-IX m and ˆV IPS-IX m be the empirical estimators defined respectively in Eq. (9) and Eq. (10). Let δ (0, 1), θ Θ, and nm 2 the number of samples associated to the logged dataset at round m. Then, with probability at least 1 δ, L(θ) ˆLIPS-IX m (θ) + λ ˆV IPS-IX m (θ) nm + 2λ2W where λ = p 18(Cm(Θ) + log(2/δ)). Proof. Let δ (0, 1) and θ Θ. Since all functions in Fm,Θ defined in Eq. (17) take values in [0, 1], we can apply the concentration bound of Maurer & Pontil (2009, Theorem 6) to the set Fm,Θ. This yields, with probability at least 1 δ/2, Ex,θm,y[fθ(x, a, y)] 1 i=1 fθ(xm,i, am,i, ym,i) 18 ˆVnm(fθ)(Cm(Θ) + log(2/δ)) nm + 15(Cm(Θ) + log(1/δ)) ˆVnm(fθ) = 1 nm 1 fθ(xm,i, am,i, ym,i) 1 j=1 fθ(xm,j, am,j, ym,j) 2 is an estimation of the sample variance. Let α > 0 and define the following biased estimator of the excess risk: Lα m(θ) = Ex,θm,y y πθ(a|x) πθm(a|x) + απθ(a|x) 1 θ Θ. (22) We recall that Ex,θm,y denotes the expectation in (x, a, y) where x PX , a πθm( |x), and y PY( |x, a). By construction of fθ (see Eq. (17)), Ex,θm,y[fθ(x, a, y)] = 1 i=1 fθ(xm,i, am,i, ym,i) = 1 W ˆLIPS-IX m (θ) 1 Wnm ˆVnm(fθ) = 1 W 2 ˆV IPS-IX m (θ) , where ˆLIPS-IX m and ˆV IPS-IX m are defined respectively in Eq. (9) and Eq. (10). Thus, multiplying (21) by W, substituting the above terms, and using λ = p 18(Cm(Θ) + log(2/δ)), yields Lα m(θ) ˆLIPS-IX m (θ) + 1 ˆV IPS-IX m (θ) nm + 15λ2W 18(nm 1), (23) with probability 1 δ/2. Now, let us decompose Lα m(θ) = Ex,θm,y y πθ(a|x) πθm(a|x) + απθ(a|x) 1 = Ex,θm,y y πθ(a|x) πθm(a|x) + απθ(a|x) But, since the losses y are bounded in [ 1, 0] almost surely, y πθ(a|x) πθm(a|x) + απθ(a|x) Sequential Counterfactual Risk Minimization which, substituted into the previous equation, entails, Lα m(θ) L(θ) L(θm). (24) Lower-bounding the left-hand side of (26), we thus get w.p 1 δ/2, L(θ) ˆLIPS-IX m (θ) λ ˆV IPS-IX m (θ) nm + 15λ2W 18(nm 1) + L(θm) 1 Using Em 1[ym,i] = L(θm) and applying Hoeffding s inequality, this further yields w.p. 1 δ L(θ) ˆLIPS-IX m (θ) + λ ˆV IPS-IX m (θ) nm + 15λ2W 18(nm 1) + Eventually, note that (nm 1) 1 (2/nm) since nm 2. Thus, L(θ) ˆLIPS-IX m (θ) + λ ˆV IPS-IX m (θ) nm + 2λ2W which concludes the proof. Proposition 4.2 (Conservative Excess Risk). Let m 0 and θm Θ. Let sm = (xm,i, am,i, ym,i, πm,i)1 i nm be a set of samples collected with am,i πθm( |xm,i). Then, under Assumptions 4.1 and 4.2, the solution θm+1 of Problem (8) with the IPS-IX estimator in Eq. (11) on the samples sm satisfies the excess risk upper-bound m+1 = L(θm+1) L(θ ) d log(nm) + log(1/δ) nm ν2m + W 2 + W(d log(nm) + log(1/δ)) where ν2 m = Varx,θm Proof. We consider the notations of the proof of Proposition 4.1. Fix θ Θ. Applying, Theorem 15 of (Maurer & Pontil, 2009)2 to the function set Fm,Θ defined in (17), we get with probability 1 δ Ex,θm,y[fθm+1(x, a, y)] Ex,θm,y[fθ (x, a, y)] 32Varx,θm,y fθ (x, a, y) Cm(Θ) + log 30 nm + 22 Cm(Θ) + log 30 This can be written as: m U m, (28) with the following definitions: m = Ex,θm,y[fθm+1(x, a, y)] Ex,θm,y[fθ (x, a, y)] 32Varx,θm,y fθ (x, a, y) Cm(Θ) + log 30 nm + 22 Cm(Θ) + log 30 nm 1 . (29) 2Note that in their notation, log Mn(π) equals Cm(Θ) + log(10), X is the dataset {(xi, ai, yi)}1 i n where (xi, ai, yi) i.i.d. PX πθm( |x) PY( |a, x), and P( , µ) is the expectation with respect to one test sample Ex,θm,y[ ]. Sequential Counterfactual Risk Minimization Step: Lower bounding m Using the definition of fθ(x, a, y) in (17) and that of Lα m in Eq. (22), we have Ex,θm,y[fθm+1(x, a, y)] = 1 W Lα m(θm+1). Thus, m can be re-written as W (Lα m(θm+1) Lα m(θ )) , which we now lower-bound. To do so, we begin by upper-bounding Lα m(θ ). It can be expressed as Lα m(θ ) = Ex,θm,y y πθ (a|x) πθm(a|x) + απθ (a|x) L(θm). (30) To shorten notation, from now on and throughout this proof, we write πθ instead of πθ(a|x), omitting the dependence on a and x. Using the inequality (1 + x) 1 1 x for x 0, we have y πθ πθm + απθ = L(θ ) αEx,θm,y L(θ ) + αW 2 , (32) where the last inequality is by Assumption 4.1 and because y [ 1, 0]. Together with (30), we get Lα m(θ ) L(θ ) + αW 2 L(θm). We recall that L(θm+1) L(θm) Lα m(θm+1) by Eq.(24). Therefore, 1 W (L(θm+1) L(θ ) αW 2) 1 W (Lα m(θm+1) Lα m(θ )) , which finally gives 1 W (L(θm+1) L(θ ) αW 2) m. (33) Step: Upper bound U m By definition of fθ(x, a, y) in (17), we have Varx,θm,y fθ (x, a, y) = 1 W 2 Varx,θm,y y πθ πθm + απθ 1 1 W 2 Ex,θm,y y2 πθ πθm + απθ 1 2# 1 W 2 Ex,θm " πθ πθm + απθ 1 2# Then, using the inequality (x + y)2 2x2 + 2y2, for x, y R, this may be upper-bounded as Varx,θm,y fθ (x, a, y) 2 W 2 Ex,θm " πθ πθm + απθ Ex,θm πθ πθm + απθ πθ πθm + απθ On the one hand, the first term of the right-hand side may be upper-bounded as " πθ πθm + απθ Ex,θm πθ πθm + απθ πθ πθm + απθ Sequential Counterfactual Risk Minimization where ν2 m = Varx,θm h πθ πθm i . On the other hand, for the second term, we use the same factorization as in Eq. (31) to get πθ πθm + απθ which yields the upper-bound πθ πθm + απθ 1 2 α2 Ex,θm Therefore, substituting the last two upper-bounds into (34) entails Varx,θm,y fθ (x, a, y) 2 W 2 ν2 m + α2W 2 . Then, replacing this upper-bound into the definition of U m in (29) and using Assumption 4.2 to upper bound the terms in Cm(Θ) d log(nm), we obtain the following upper-bound 64(ν2m + α2W) d log(nm) + log 30 nm + 22 d log(nm) + log 30 64(ν2m + α2W) d log(nm) + log 30 nm + 44 d log(nm) + log 30 where the last inequality is because nm 2. Step: excess risk upper bound Setting α = 1 nm and using the two previous bounds (33) and (35) respectively on m and on U m into (28), we get L(θm+1) L(θ ) 64 d log(nm) + log 30 n2m W 2 + W 44 d log(nm) + log 30 nm W 2. (36) b, we have that s 64 d log(nm) + log 30 64 d log(nm) + log 30 64 d log(nm) + log 30 Then, since nm 2 and δ < 1, we have d log(nm) + log(30/δ) log(2) + log(30) 4, which yields 64 d log(nm) + log 30 32 d log(nm) + log 30 8 d log(nm) + log 30 Substituting the last two inequalities into (36) finally entails L(θm+1) L(θ ) 8 d log(nm) + log 30 δ nm ν2m + 47W d log(nm) + log 30 which concludes the proof. B.3. Regret analysis Proposition 5.1 (Regret upper-bound). Let n0, n 2 and θ arg minθ L(θ). Let nm = n02m for m = 0, . . . , M = log2(1+ n n0 ) . Then, under Assumptions 4.1, 4.2 and 5.1, the SCRM procedure (Alg. 1) satisfies the excess risk upper-bound L(θM) L(θ ) O n 1 2 β log n . Sequential Counterfactual Risk Minimization Moreover, the expected regret is upper-bounded as follows: m=0 nm+1 L(θm) L(θ ) O n 1 β 2 β log(n)2 . Proof. First, note that for nm = n02m and M = log2(1 + n n0 ) , we have PM 1 m=0 nm = n0(2M 1) n. Hence, Alg. 1 has collected at most n samples to design the estimator θM. For m 0, we recall m = L(θm) L(θ ) and use Eq. (37) to write d log(nm) + log 30 δ nm ν2m + 47W d log(nm) + log 30 d log(n) + log 30 δ nm ν2m + 47W d log(n) + log 30 where C = 8 q d log(n) + log 30 δ and B = W 2 + 47W(d log(n) + log 30 δ ) are independent of m. Step: Obtaining a recurrence relation for m+1 By Assumption 5.1, there exist γ > 0 and β [0, 1] such that ν2 m = Varx,θm γ L(θm) L(θ ) β = β m γ . Replacing ν2 m in Eq. (38) thus entails 1 γ β m nm + B γ β m + B2 mn0 nm = n02m 2 β/2 m + B2 mn0 . (39) Step: Solving the recurrence relation for m We then insure by induction that m satisfies m 2 β , (40) for some c0 > 0 that will be specified by the analysis. Base step Since losses take values in [ 1, 0], 0 = L(θ0) L(θ ) 1. Equation (40) is thus satisfied for m = 0 as soon as c0 1. Induction step Let m 0. We assume that m c02 m 2 β and prove Equation (40) for m+1. Using Eq. (39), we have 2 β/2 m + B2 mn0 2 c0 β/22 mβ β(2 β) + B2 mn0 by induction max n 2C rn0 2 m 2 β , 2B2 mn0 o . (41) Sequential Counterfactual Risk Minimization Now, we show that both terms inside the maximum can be upper-bounded by c02 (m+1)/(2 β) as soon as c0 is large enough. On the one hand, if c0 4Bn0, we have 2B2 mn0 c02 (m+1) c02 m+1 On the other hand, if c0 (4C2n0/γ)1/(2 β), we also have 2 m 2 β 2C rn0 2 β c02 m+1 Combining the above two upper-bounds with (41) concludes the induction step under the condition c0 max 1, 4C2n0 1 2 β , 4Bn0 Step: conclusion Finally, setting the above value for c0 we proved that for all m 0, we have m max 1, 4C2n0 1 2 β , 4Bn0 1 2 β + 4Bn0 = 1 + 256(d log n + log 30 1 2 β + W 2n0 + 47Wn0 d log n + log 30 2 m 2 β , (42) where the last equality is by substituting the values of B and C from (38). For the final step M = log2( n n0 + 1) , this yields M 1 + 256(d log n + log 30 1 2 β + W 2n0 + 47Wn0 d log n + log 30 2 1 + 256(d log n + log 30 1 2 β + W 2n0 + 47Wn0 d log n + log 30 = O n 1 2 β log n . This concludes the first part of the proof. Regret upper-bound To upper bound the cumulative regret, using nm+1 = n02m+1, we write m=0 mnm+1 (42) D m=0 2 m 2 β nm+1 = 2Dn0 m=0 2 1 β 2 β m , D = 1 + 256(d log n + log 30 1 2 β + W 2n0 + 47Wn0 d log n + log 30 Then, computing the sum for M = log2( n n0 + 1) , we have m=0 2 1 β 2 β m 2Dn0(M + 1)2 1 β 2 β M 2Dn0 1 + log2 n n0 + 1 1 + n Using that D = O(log n), we finally obtain 1 β 2 β log(n)2 . Sequential Counterfactual Risk Minimization C. Additional discussions on the H olderian Bound Assumption 5.1 In this appendix, we discuss Assumption 5.1 on different particular examples. C.1. Verification of the assumption on a toy example with Gaussian families We consider the setting of Example 3.1. In the latter, the policies are Gaussian of the form πθ = N(θ, σ2) and the loss is defined by lt(a) = (a yt)2 1 where yt N(θ , σ2). There is no loss in generality in assuming σ2 = 1. Then, we can compute L(θ) L(θ ) = (θ θ )2 and Varθ = exp (θ θ)2 1 . We recall that we are interested in verifying the existence of γ > 0 and β > 0 for which Assumption 5.1 holds, that is in this case for any θ Θ: L(θ) L(θ ) β , (43) which may be re-written here as γ exp (θ θ)2 1 (θ θ )2β . The latter is satisfied for any β 1 as soon as Θ is a bounded interval. Note that the constant γ may decrease exponentially fast as the diameter of Θ increases. To illustrate, the existence of such couples (β, γ), we plot in Fig. 8 different values of the following ratio R(θ, β) = Varθ L(θ) L(θ ) β = exp (θ θ)2 1 θ θ 2 β . (44) The value of γ can be found for different values of β in Fig. 8 by taking 1 γ = maxθ R(θ, β). Higher values of β induce 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Figure 8: Ratio R defined in (44) with different values of β. faster rates and lower values of γ induce worst constant terms in the excess risk and regret bounds. Eventually, note that SCRM does not need those parameters to run and those two parameters γ, β are automatically calibrated by SCRM to find the best trade-off. C.2. Discussion of Assumption 5.1 for Exponential Families In this section, we consider a more realistic example in which policies belong to an exponential family. That is, we assume that the policies are parameterized by a parameter η Rq and can be written in the form: a A, πη(a) = eη t(a) A(η)h(a), for some known function h : A R+ and sufficient statistic t : A Rq. Here, A(η) is a normalization constant, so that e A(η) = R a eη t(a)h(a) da. We provide in Example C.1 a concrete example considered by (Swaminathan & Joachims, Sequential Counterfactual Risk Minimization 2015a; Faury et al., 2020). To ease the notation, we removed here the dependency on contexts, but the generalization to contextual policies can be made similarly. The importance weight ratio may be written as, πη(a) πηm(a) = e(η ηm)t(a) (A(η) A(ηm)). (45) To verify Assumption 5.1, we need to upper bound their variance, which we shall write as, = e2(A(ηm) A(η))Vara πηm h e(η ηm)t(a)i . Now, computing the moment generating function (MGF) of the statistic t(a) Rq Mt(s) = E h es t(a)i = Z a es t(a)eηm t(a) A(ηm)h(a) da = e A(ηm) Z a e(ηm+s) t(a)eηm t(a)h(a) da = e A(ηm+s) A(ηm), the variance term may be written as Vara πηm h e(η ηm)t(a)i = Mt(2(η ηm)) M 2 t (η ηm) = e A(2η ηm) A(ηm) e2(A(η) A(ηm)) . This eventually leads us to = e A(2η ηm)+A(ηm) 2A(η) 1. (46) We now discuss two cases that are used for discrete actions (Swaminathan & Joachims, 2015a) and continuous actions (Kallus & Zhou, 2018; Zenati et al., 2020). Bounded sufficient statistic Supposing that there exists an upper bound A such that t(a) A, Cauchy-Schwartz inequality states that |(η ηm) t(a)| η ηm A, which entails = e A(2η ηm)+A(ηm) 2A(η) 1 a e(2η ηm) t(a)h(a) da R a eηm t(a)h(a) da R a eη t(a)h(a) da 2 1 a e(η ηm) t(a)eη t(a)h(a) da R a e(ηm η) t(a)eη t(a)h(a) da R a eη t(a)h(a) da 2 1 e η ηm A 1 . Assuming that the parameter space is compact, i.e, maxη,η η η D, there exists a constant C that depends on A and D such that, this may be further upper-bounded as Therefore, Assumption 5.1 is implied by γC η ηm 2 (L(θ) L(θ ))2β . The latter is implied by a local version of strong convexity for β = 1/2 (d Aspremont et al., 2021), and holds with γ = C 1D 2 for β = 0. Example C.1. For discrete actions A = {a1, . . . , a K}, we consider, as in (Swaminathan & Joachims, 2015a) and (Faury et al., 2020), policies where given a context x, probabilities pi(x) of sampling an action ai are given by pi(x) = exp(θ ϕ(x, ai)) PK j=1 exp(θ ϕ(x, aj)) . (47) Sequential Counterfactual Risk Minimization The function ϕ is typically a feature map associated to a kernel in a RKHS. In this case, the natural parameter η and the sufficient statistic t(a) may be written as p K ) ... log( p K 1 1{a = a1} ... 1{a = a K} Lognormal and Normal distributions For normal N(µ, σ2) and lognormal Lognormal(µ, σ2) distributions with fixed variance σ2 as considered by (Kallus & Zhou, 2018; Zenati et al., 2020), the normalizing constant writes A(η) = η2 2 , and we then obtain that: A(2η ηm) + A(ηm) 2A(η) = (η ηm)2 , which gives: = e η ηm 2 1 . In that case, it is again possible for a bounded parameter space to linearize e η ηm 2 1 η ηm 2, consider losses that verify: for all η, there exists an optimal η such that γ ηm η 2 L(ηm) L(η ) β. (49) Again, this holds generally for β = 0 and for locally strongly convex losses for β = 1. D. Experiment details All the code to reproduce figures and tables is available in the following repository: https://github.com/ criteo-research/sequential-conterfactual-risk-minimization. D.2. Empirical settings details Pricing The pricing application in (Demirer et al., 2019) considers a personalized pricing setting where given contexts x, prices p (which are the actions) need to be predicted to maximize the revenue: r(x, p) = p(a(x) b(x)p + ε) where ε N(0, 1) and d = a(x) + b(x)p + ε is akin to an unknown context-specifidemand function. The data generating process uses contexts x [1, 2]k for k > 1 a positive integer. Only l < k dimensions however affect the demand, that is if we write x = 1 l (z1, . . . , zl). The price p is generated from a Gaussian logging policy p N( x, 1) centered in x. We consider in our example the quadratic functionnal a(x) = 2x2 and b(x) = 0.6x as in the original paper. Advertising The advertising simulation in (Zenati et al., 2020) consists in predicting the potential p ]0, + [ of a user that may be compared to their a priori responsiveness to a treatment. The potential is caused by an unobserved random group variable g in G (groups of high or low potential users in their responsiveness) that influences context x of users. The goal is then to find a policy π(a|x) that maximizes reward by adapting to an unobserved potential. The potentials are normally distributed conditionally on the group index, p|g N(µg, σ2 g) where σg = 0.5 and µg = 1 or 3 for two groups. The observed reward y is then a function of the action a and the context x through the associated potential px of the user x. The reward function mimics reward over the offline continuous bidding dataset in (Zenati et al., 2020) with the form: rl(px, a) = a px if a < px 1 2(px a) + 1 else r(px, a) = max(rl(px, a), 0.1) Sequential Counterfactual Risk Minimization The logging policy is a lognormal distribution as it is common in advertising applications (Bottou et al., 2013). In particular, as in (Zenati et al., 2020), πθ0 = Lognormal(µ, σ2) where the mean exp(µ + σ2/2) = 2 and the variance (exp(σ2) 1) exp(2µ + σ2) = 1. Yeast, Scene, TMC2007 We follow (Swaminathan & Joachims, 2015a). We now recall briefly the setup. The problem is a binary multilabel classification with |A| = 2K potential labels. All models are parametrized by πθ(a|x) exp(θ (x N a)). The baseline (resp. skyline) is a supervised, full information model with identical parameter space than CRM methods trained on 5% (resp. 100%) of the training data. Our main modification it to consider the class of probabilistic policies that satisfy Assumption 5.1 by predicting actions in an Epsilon Greedy fashion (Sutton & Barto, 1998)): πε θ(a, x) = (1 ε)πθ(a, x) + ε/|A| where ε = .1. The loss is the Hamming loss (number of incorrectly assigned labels - both false positives and false negatives in the action vector): L(θ) = 1 n K j=1 1[yj i =aj i ] (50) where yj i (resp. aj i) is the j-th component of the label vector (resp. action vector) of line i. A uniform policy will thus evaluate at a loss of .5. D.3. Implementation details Counterfactual methods In this paragraph we start by detailing the non adaptive counterfactual risk minimization that we compare to in this work. Algorithm 2 Counterfactual Risk Minimization Input: Logged observations (x0,i, a0,i, y0,i, π0,i)i=1,...,n0, parameter λ > 0 for m = 1 to M do Build Lm from observations sm using Eq. (11) Learn θ using Eq. (8) Re-deploy the logging model θ0 and collect observations sm+1 = (xm+1,i, am+1,i, lm+1,i, πm+1,i)i=1,...,nm+1 end We also provide the grid of hyperparameters for the λ evaluated in CRM and SCRM methods λ [1e 5, 1e 4, 1e 3, 1e 2, 1e 1]. Batch Bandits Let k : (X A) (X A) R be a bounded positive definite Kernel associated to a RKHS H, ϕ : X A H is the feature map such that k(s, s ) = ϕ(s), ϕ(s ) for any s, s X A. Context-actions pairs are written as sm,i := (xm,i, am,i) X A and Sm := {s1,0, . . . , snm,m} denoting the history of all context-actions pairs seen up until the end of batch m. Km is the kernel matrix of all context-actions seen until the end of the batch m 1. Eventually, KS(s ) is the kernel column vector [k(s1, s ), . . . , k(sl, s )] of size |S| = l. Ym = [ y0,1, y0,n0, ym,1, ym,nm] denotes the vector of concatenated rewards observed up until the end of the batch m. At a batch m, a context xm,i is sampled for i {1, nm}, and then to sample an action a, the following decision rule is applied: a arg max a A ˆqm,i,a. (51) In batch Kernel UCB, ˆqm,i,a is defined as ˆqm,i,a = ˆmm,i,a + βmˆσm,i,a, (52) ˆµm,i,a = KSt 1 (xm,i, a) K 1 m 1Ym 1 ˆσ2 m,i,a = 1 λk (xm,i, a), (xm,i, a) 1 λKSm 1 (xm,i, a) K 1 m 1KSm 1 (xm,i, a) , Sequential Counterfactual Risk Minimization and βm is a theoretical parameter that is set to βm = 1 m in practical heuristics (Lattimore & Szepesvari, 2019). In SBPE (Han et al.), ˆqm,i,a is defined directly as ˆqm,i,a = KSt 1 (xm,i, a) K 1 m 1Ym 1. (53) Algorithm 3 Batch bandit - SBPE (Han et al.) and Kernel UCB (Valko et al., 2013) Input: Logged observations (x0,i, a0,i, y0,i, π0,i)i=1,...,n0, λ regularization and exploration parameters, k the kernel function initialization Kλ = [k(s0,i, s0,j)]1 i,j n0 + λI, Y0 = [ y0,i]1 i n0 for m = 1 to M do for i = 1 to nm do Observe context xi,m Choose ai,m arg maxa A ˆqm,i,a using Eq. (53) or (52) end Observe losses yi,m for all i in past batch {1, . . . , nm} Update Ym [ y0,1, y0,n0, ym,1, ym,nm] Update the translated gram matrix Kλ [k(si,p, sj,p)]1 i,j np,1 p m + λI SBPE (Han et al.) uses a linear modelling, therefore we used a linear kernel. For the Kernel UCB (Valko et al., 2013) method, we used Gaussian and Polynomial kernels in our experiments. Note also that no regularization parameter λ is used in SBPE so we set λ = 0 in our experiments, and for K-UCB we chose λ in the grid [1e0, 1e1, 1e2]. Note in particular that we adapted the batch bandit baselines to the CRM setting by benefiting the initialization with the logged dataset to set the gram matrix Kλ as well as the reward vector Y0 with information from the logging data. This modification changes the original methods which take random actions at initializations. Eventually, the baselines were carefully optimized using the Jax library (https://github.com/google/jax) to allow for just in time compilations of algebraic blocks in both methods and to maximize their scaling capacity. RL baselines In order to compare our method to the two known off-policy online RL algorithm PPO (Schulman et al., 2017) and TRPO (Schulman et al., 2015), we do the following: 1. we use the stable_baselines3(Raffin et al., 2021) library for the implementation. When necessary we call multiple times the model PPO or TRPO, to have buffer size of geometrical increase. 2. we initialize the Actor Critic Policy with a simpler MLP model having only one layer with output dimension of 1, (with argument net_arch= [1], that is mathematically the same modelling as in CRM and SCRM baselines). 3. At the initial step only and to enable a fair comparison with counterfactual methods using a logging dataset, we pretrain the RL policies to imitate the actions sampled from the logging policy: we process by multiple step of the Adam optimizer, minimizing a loss being the sum of 2 terms: a MSE term between the sampled action of the Actor Critic Policy for the contexts in the n0 instances, and the actions sampled by the logging policy. the ENTROPY term guaranteeing to keep a minimum of exploration in order to initialize the RL algorithm ( P pi log(pi)) 4. we combine the 2 last terms with a linear combinaison with hyperparameters being tuned a posteriori, i.e. LOSS = MSE + λ ENTROPY with the hyperparam λ {.5, 1, 2, 5, 10} Sequential Counterfactual Risk Minimization E. Additional empirical results E.1. SCRM compared to CRM We provide here the additional plot in the Pricing setting. 2 4 6 8 10 Rollouts m 2 4 6 8 10 Rollouts m Advertising Figure 9: Test loss as a function of sample size on Pricing, Advertising (from left to right). E.2. Evaluation of IPS-IX We provide here the plots for the whole setting considered in policy evaluation with IPS-IX. 0 1 2 3 Shift 0 ips ips_clipped snips ips_ix 0 1 2 3 Shift 0 ips ips_clipped snips ips_ix 0 1 2 3 Shift 0 ips ips_clipped snips ips_ix Figure 10: Comparison of IPS estimators on a Cosine reward and series of shifted Gaussian policies. Setup (left), Bias (middle left), Variance (middle right), Average IPS weight (right). IPS-IX shows a low bias and compares favorably to IPS and SNIPS in terms of variance. E.3. Exploration/Exploitation tradeoff In this part we give the details used for the experiment described in Section 6.3. We consider again Example 3.1 with the Gaussian parametrized policies πθ = N(θ, σ2) and a loss lt(a) = (a yt)2 1 where yt N(θ , σ 2) with σ = 0.3. Recall that πθ0 = N(θ0, σ). We consider a grid of σ [0.1, 0.3, 1, 3] and consider θ = 1. Our experiment aims at illustrating the influence of sequential exploration that is an important detail of the SCRM and CRM principles.