# sequential_counterfactual_risk_minimization__8ba3e444.pdf

Sequential Counterfactual Risk Minimization

Houssam Zenati 1 2 Eustache Diemert 1 Matthieu Martin 1 Julien Mairal 2 Pierre Gaillard 2

Counterfactual Risk Minimization (CRM) is a framework for dealing with the logged bandit feedback problem, where the goal is to improve a logging policy using offline data. In this paper, we explore the case where it is possible to deploy learned policies multiple times and acquire new data. We extend the CRM principle and its theory to this scenario, which we call Sequential Counterfactual Risk Minimization (SCRM). We introduce a novel counterfactual estimator and identify conditions that can improve the performance of CRM in terms of excess risk and regret rates, by using an analysis similar to restart strategies in accelerated optimization methods. We also provide an empirical evaluation of our method in both discrete and continuous action settings, and demonstrate the benefits of multiple deployments of CRM.

1. Introduction

Counterfactual reasoning in the logged bandit problem has become a common task for practitioners in a wide range of applications such as recommender systems (Swaminathan & Joachims, 2015a), ad placements (Bottou et al., 2013) or precision medicine (Kallus & Zhou, 2018). Such a task typically consists in learning an optimal decision policy from logged contextual features and partial feedbacks induced by predictions from a logging policy. To do so, the logged data is originally obtained from a randomized data collection experiment. However, the success of counterfactual risk minimization is highly dependent on the quality of the logging policy and its ability to sample meaningful actions.

Counterfactual reasoning can be challenging due to large variance issues associated with counterfactual estimators (Swaminathan & Joachims, 2015b). Additionally, as pointed

1Criteo AI Lab 2Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK, 38000 Grenoble, France. Correspondence to: Houssam Zenati <housszenati@gmail.com>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

out by Bottou et al. (2013), confidence intervals obtained from counterfactual estimates may not be sufficiently accurate to select a final policy from offline data (Dai et al., 2020). This can occur when the logging policy does not sufficiently explore the action space. To address this, one option is to simply collect additional data from the same logging system to increase the sample size. However, it may be more efficient to use already collected data to design a better data collection experiment through a sequential design approach (Bottou et al., 2013, see Section 6.4). It is thus appealing to consider successive policy deployments when possible.

We tackle this sequential design problem and are interested in multiple deployments of the CRM setup of Swaminathan & Joachims (2015a), which we call sequential counterfactual risk minimization (SCRM). SCRM performs a sequence of data collection experiments by determining at each round a policy using data samples collected during previous experiments. The obtained policy is then deployed for the next round to collect additional samples. Such a sequential decision making system thus entails designing an adaptive learning strategy that minimizes the excess risk and expected regret of the learner. In contrast to the conservative learning strategy in CRM, the exploration induced by sequential deployments of enhanced logging policies should allow for improved excess risk and regret guarantees. Yet, obtaining such guarantees is nontrivial and we address it in this work.

In order to accomplish this, we first propose a new counterfactual estimator that controls the variance and analyze its convergence guarantees. Specifically, we obtain an improved dependence on the variance of importance weights between the optimal and logging policy. Second, leveraging this estimator and a weak assumption on the concentration of this variance term, we show how the error bound sequentially concentrates through CRM rollouts. This allows us to improve the excess risk bounds convergence rate as well as the regret rate. Our analysis employs methods similar to restart strategies in acceleration methods (Nesterov, 2012) and optimization for strongly convex functions (Boyd & Vandenberghe, 2004). We also conduct numerical experiments to demonstrate the effectiveness of our method in both discrete and continuous action settings, and how it improves upon CRM and other existing methods in the literature.

Sequential Counterfactual Risk Minimization

2. Related Work

Counterfactual learning from logged feedback (Bottou et al., 2013) uses only past interactions to learn a policy without interacting with the environment. Counterfactual risk minimization methods (Swaminathan & Joachims, 2015a;b) propose learning formulations using a variance penalization as in (Maurer & Pontil, 2009) to find policies with minimal variance. Even so, counterfactual methods remain prone to large variance issues (Dud ık et al., 2014). These problems may arise when the logging policy under-explores the action space, making it difficult to use importance sampling tehcniques (Owen, 2013) that are key to counterfactual reasoning. While one could collect additional data to counter this problem, our method focuses on sequential deployments (Bottou et al., 2013, see Section 6.4) to collect data obtained from adaptive policies to explore the action space. Note also that the original motivation is related but different from the support deficiency problem (Sachdeva et al., 2020) where the support of the logging policy does not cover the support of the optimal policy.

Another related literature to our framework is batch bandit methods. Originally introduced by Perchet et al. (2015) and then extended by Gao et al. (2019) in the multi-arm setting, batch bandit agent take decisions and only observe feedback in batches. This therefore differs from the classic bandit setting (Auer et al., 2002; Audibert et al., 2007) where rewards are observed after each action taken by an agent. Extensions to the contextual case have been proposed by Han et al. and could easily be kernelized (Valko et al., 2013). The sequential counterfactual risk minimization problem is thus closely related to this setting. However, major differences can be noted. First, SCRM does not leverage any problem structure as in stochastic contextual bandits (Li et al., 2010) by assuming a linear reward function (Chu et al., 2011; Goldenshluger & Zeevi, 2013; Han et al.) nor uses regression oracles as (Foster & Rakhlin, 2020; Simchi Levi & Xu, 2020). Second, deterministic decision rules taken by bandit agents (Lattimore & Szepesvari, 2019) do not allow for counterfactual reasoning or causal inference (Peters et al., 2017), unlike our framework which performs sequential randomized data collection. Third, unlike gradient based methods used in counterfactual methods with parametric policies, batch bandit methods use zero-order methods to learn from data and necessitate approximations to be scalable (Calandriello et al., 2020; Zenati et al., 2022).

The sequential designs that we use are adaptive data collection experiments, which have been studied by Bakshy et al. (2018); Kasy & Sautmann (2021). Closely related to our method is policy learning from adaptive data that has been studied by Zhan et al. (2021) and Bibaut et al. (2021) in the online setting. In contrast, we consider a batch setting and our analysis achieve fast rates in more general conditions.

Zhan et al. (2021) use a doubly robust estimator and provide regret guarantees but assume a deterministic lower bound on the propensity score to control the variance. Instead, our novel counterfactual estimator does not require such an assumption. Bibaut et al. (2021) propose a novel maximal inequality and derive thereof fast rate regret guarantees under an additional margin condition that can only hold for finite action sets. Our work instead uses a different assumption on the expected risk, which is similar to H olderian error bounds in acceleration methods (d Aspremont et al., 2021) that are known to be satisfied for a broad class of subanalytic functions (Bolte et al., 2007).

In the reinforcement learning literature (Sutton & Barto, 1998), off-policy methods (Harutyunyan et al., 2016; Munos et al., 2016) evaluate and learn a policy using actions sampled from a behavior (logging) policy, which is therefore closely related to our setting. Among methods that have shown to be empirically successful are the PPO (Schulman et al., 2017) and TRPO (Schulman et al., 2015) algorithms which learn policies using a Kullback-Leibler distributional constraint to ensure robust learning, which can be compared to our learning strategy that improves the logging policy at each round. However reinforcement learning models transitions in the states (contexts) induced by the agent s actions while bandit problems like ours assume that actions do not influence the context distribution. This enables to design algorithms that exploit the problem structure, have theoretical guarantees and can achieve better performance in practice.

Finally, our method is related to acceleration methods (d Aspremont et al., 2021) where current iterates are used as new initial points in the optimization of strongly convex functions (Boyd & Vandenberghe, 2004). While different schemes use fixed (Powell, 1977) or adaptive (Nocedal & Wright, 2006; Becker et al., 2011; Nesterov, 2012; Bolte et al., 2007; Gaillard & Wintenberger, 2018) strategies, our method differs in that it does not consider the same original setting, does not require the same assumptions nor provides the same guarantees. Eventually, while current models are also used as new starting points, additional data is effectively collected in our setting unlike those previous works that do not assume partial feedbacks as in our case.

3. Sequential Counterfactual Risk Minimization

In this section, we introduce the (CRM) framework and motivate the use of sequential designs for (SCRM).

Notations For random variables x PX , a πθ( |x) and y PY( |x, a), we write the expectation Ex,θ,y[ ] = Ex PX ,a πθ( |x),y PY( |x,a)[ ] and do the same for the variance Varx,θ,y. Moreover, denotes approximate inequalities up to universal multiplicative terms.

Sequential Counterfactual Risk Minimization

3.1. Background

In the counterfactual risk minimization (CRM) problem, we are given n logged observations (xi, ai, yi)i=1,...,n where contexts xi X are sampled from a stochastic environment distribution xi PX , actions ai πθ0( |xi) are drawn from a logging policy πθ0 with a model θ0 in a parameter space Θ. The losses are drawn from a conditional distribution yi PY( |xi, ai). We note π0,i = πθ0(ai|xi) the associated propensities and assume them to be known. We will assume that the policies in πθ, θ Θ admit densities so that the propensities will denote the density function of the logging policy on the actions given the contexts. The expected risk of a model θ is defined as:

L(θ) = Ex,θ,y [y] . (1)

Counterfactual reasoning uses the logged data sampled from the logging policy associated to θ0 to estimate the risk of any model θ Θ with importance sampling:

L(θ) = Ex,θ0,y

under the common support assumption (the support of πθ support is included in the support of πθ0). The goal in CRM is to find a model θ Θ with minimal risk by minimizing

ˆθ arg min θ Θ L0(θ), (3)

where L0(θ) = ˆL0(θ) + λ q

n uses the sample variance penalization principle (Maurer & Pontil, 2009) on samples from θ0 with counterfactual estimates of the expected risk ˆL0, an empirical variance ˆV0 and λ > 0. Specifically, in the (CRM) framework, multiple estimators are derived from the IPS method (Horvitz & Thompson, 1952) that uses the following clipped importance sampling estimator of the risk of a model θ by Bottou et al. (2013); Swaminathan & Joachims (2015a):

ˆLIS 0 (θ) = 1

i=1 yi min πθ(ai|xi)

πi , α , (4)

where α is a clipping parameter. Writing χi(θ) = yi min( πθ(ai|xi)

π0,i , α) and χ(θ) = 1

n Pn i=1 χi(θ) the empirical variance estimator becomes:

ˆV IS 0 (θ) = 1 n 1

i=1 (χi(θ) χ(θ))2. (5)

Other estimators aim at controlling the variance of the estimator with self-normalized estimators (Swaminathan & Joachims, 2015b) or with direct methods (Dudik et al., 2011; Dud ık et al., 2014) in doubly robust estimators. Even so, the performance of counterfactual learning is harmed when

the logging policy under-explores the action space (Owen, 2013). Likewise, counterfactual estimates obtained from a first round of randomized data collection may not suffice (Bottou et al., 2013) to select a model ˆθ. In those cases, it could be natural to consider collecting additional samples. While it is possible to use the same logging model θ0 to do so, we will present a framework for designing an improved sequential data collection strategy, following the intuition of sequential designs of Bottou et al. (2013).

3.2. Sequential Designs

In this section we present a design of data collections that sequentially learn a policy from logged data in order to deploy it and learn from the newly collected data. Specifically, we assume that at a round m {1, . . . M}, a model θm Θ is deployed and a set sm of nm observations sm = (xm,i, am,i, ym,i, πm,i)i=1,...,nm is collected thereof, with propensities πm,i = πθm(am,i|xm,i) to learn a new model θm+1 and reiterate. In this work, we assume that the loss y is bounded in [ 1, 0] as in (Swaminathan & Joachims, 2015a) (note however that this assumption could be relaxed to bounded losses) and follows a fixed distribution PY. Next, we will introduce useful definitions. Definition 3.1 (Excess Risk and Expected Regret). Given an optimal model θ arg minθ Θ L(θ), we write for each rollout m the excess risk:

m = L(θm) L(θ ), (6)

and define the expected regret as:

m=0 mnm+1. (7)

The objective is now to find a sequence of models {θm}m=1...M that have an excess risk and an expected regret Rn that improve upon CRM guarantees. To do so, we define a sequence of minimization problems for m {1, . . . M}:

ˆθm+1 arg min θ Θ Lm(θ), (8)

where Lm is an objective function that we define in Section 4.2. Note that in the setting we consider, samples are i.i.d inside a rollout m but dependencies exist between different sets of observations. From a causal inference perspective (Peters et al., 2017), this does not incur an additional bias because of the successive conditioning on past observations. We provide detailed explanations in Appendix A.1 on this matter. Note also that the main intuition and motivation of our work is to shed light on how learning intermediate models θm to adaptively collect data can improve upon sampling from the same logging system by using the same total sample size n = Pm i=0 nm. To illustrate the learning benefits of SCRM we now provide a simple example.

Sequential Counterfactual Risk Minimization

Algorithm 1 Sequential Counterfactual Risk Minimization Input: Logged observations (x0,i, a0,i, y0,i, π0,i)i=1,...,n0, parameter λ > 0 for m = 1 to M do

Build Lm from observations sm using Eq. (11) Learn θm+1 using Eq. (8) Deploy the model θm+1 and collect observations sm+1 = (xm+1,i, am+1,i, ym+1,i, πm+1,i)i=1,...,nm+1 end

Example 3.1 (Gaussian policies with quadratic loss). Let us consider Gaussian parametrized policies πθ = N(θ, σ2) and a loss lt(a) = (a yt)2 1 where yt N(θ , σ2). We illustrate in Figure 1 the evolution of the losses of learned models θm through 15 rollouts with either i) Batch CRM learning on aggregation of data, being generated by the unique initial logging policy θ0 or ii) Sequential CRM learning with models θ0, . . . , θm 1 deployed adaptively, with data being generated by the last learned model θm 1 for the batch m. We see that the models learned with SCRM take larger optimization steps than the ones with CRM.

We summarize our (SCRM) framework in Algorithm 1 with the different blocks exposed previously. We provide an additional graphical illustration of SCRM compared to CRM in Appendix A.1. In the next section we will define counterfactual estimators from the observations sm at each round and define a learning strategy Lm.

4. Variance-Dependent Convergence Guarantees

In this part we aim at providing convergence guarantees of counterfactual learning. We show how we can obtain a dependency of the excess risk on the variance of importance weights between the logging model and the optimal model.

4.1. Implicit exploration and controlled variance

We first introduce a new counterfactual estimator. For this, we will require a common support assumption as in importance sampling methods (Owen, 2013). We will assume that the policies πθ for θ Θ have all the same support. We then consider the following estimator of the risk of a model θ:

ˆLIPS-IX m (θ) = 1 nm

πθ,i πm,i + απθ,i ym,i, (9)

where πθ,i = πθ(am,i|xm,i) and α is like a clipping parameter which ensures that the modified propensities πm,i + απθ(am,i|xm,i) are lower bounded. Noting ζi(θ) = πθ,i πm,i+απθ,i 1 ym,i, ζ(θ) = 1 nm Pnm i=1 ζi(θ)

we can write the empirical variance estimator as:

ˆV IPS-IX m (θ) = 1 nm 1

i=1 (ζi(θ) ζ(θ))2. (10)

Here, the empirical variance uses a control variate since it uses the expression of ζi(θ) above instead of ym,i πθ,i πm,i+απθ,i . This allows to improve the dependency on the variance in the excess risk provided in Proposition 4.2. Note also that our estimator resembles the implicit exploration estimator in the EXP3-IX algorithm (Lattimore & Szepesvari, 2019), as our motivation is to improve the control of the variance.

4.2. Learning strategy

Next, we aim in this part to provide a learning objective strategy Lm, as referred to in Eq. (8). Our approach, like the (CRM) framework, uses the sample variance penalization principle (Maurer & Pontil, 2009) to learn models that have low expected risk with high probability. To do so, we first provide an assumption to be used in our generalization error bound. Assumption 4.1 (Bounded importance weights). For any models θ, θ Θ and any (x, a) X A, we assume πθ(a|x)/πθ (a|x) W, for some W > 0.

This assumption has been made in previous works (Kallus & Zhou, 2018; Zenati et al., 2020) and is reasonable when we consider a bounded parameter space Θ. Next, we state an error bound for our estimator. Proposition 4.1 (Generalization Error Bound). Let ˆLIPS-IX m and ˆV IPS-IX m be the empirical estimators defined respectively in Eq. (9) and Eq. (10). Let θ Θ, δ (0, 1), and nm 2. Then, under Ass. 4.1, for λm = p

18(Cm(Θ) + log(2/δ)), with probability at least 1 δ:

L(θ) ˆLIPS-IX m (θ) + λm

ˆV IPS-IX m (θ) nm + 2λ2 m W nm + δm,

where Cm(Θ) is a metric entropy complexity measure defined in App. B.1 and δm = p

log(2/δ)/(2nm).

This Proposition is proved in Appendix B.2 and essentially uses empirical bounds (Maurer & Pontil, 2009). By minimizing the latter high-probability upper bound, we can find models θ with guarantees of minimizing the expected risk. Therefore, at each round, we minimize the following loss:

Lm(θ) = ˆLIPS-IX m (θ) + λm

ˆV IPS-IX m (θ) nm , (11)

where λm > 0 is a positive parameter. Unlike deterministic decision rules used for example in UCB-based algorithms (Lattimore & Szepesvari, 2019), the exploration is naturally guaranteed by the stochasticity of the policies we use.

Sequential Counterfactual Risk Minimization

loss function estimated loss

confidence set intermediate m

Sequential CRM

loss function estimated loss

confidence set intermediate m

2 4 6 8 10 12 14 Rollouts m

Loss Evolution

Figure 1: Comparison of CRM and SCRM on a simple setting described in Example 3.1. The models learned through CRM using re-deployments of θ0 (left) reach θ slower than SCRM (center) that uses intermediate deployments θ1, . . . , θM indicated with x markers and rollout numbers. The comparison of the evolution of averaged losses (right) over 10 random runs also shows SCRM converges faster. Here θ = 1, σ = 0.3 and we take M = 15 total rollouts with batches m of size nm = 100 2m. The parameter λ is set to its theoretical value.

4.3. Excess risk upper bound

Eventually, we establish an upper bound on the excess risk of the IPS-IX estimator for counterfactual risk minimization using the learning strategy that we just defined. For this, we require an assumption on the complexity measure.

Assumption 4.2. We assume that the set Θ is compact and that there exists d > 0 such that Cm(Θ) d log(nm).

This assumption states that the complexity grows logarithmically with the sample size. It holds for parametric policies so long as the propensities are lower bounded, which is verified using our estimator. We now state our variance-dependent excess risk bound.

Proposition 4.2 (Excess Risk Bound). Let nm 1 and θm Θ. Let sm be a set of nm samples collected with policy πθm. Then, under Assumptions 4.1 and 4.2, a minimizer θm+1 of Eq. (11) on the samples sm satisfies the excess risk upper-bound: w.p. 1 δ

m+1 = L(θm+1) L(θ )

ν2m d log nm log δ

nm + W 2 + W(d log nm log δ)

where ν2 m = Varx,θm

The proof is postponed to Appendix B.2. The modified propensities in IPS-IX as well as the control variate used in the variance estimator allow us to improve the dependency in ν2 m, compared to ν2 m+1 obtained in previous work (Zenati et al., 2020). This turns out to be a crucial point to use these error bounds sequentially as in acceleration methods since νm 0 if θm θ , as explained in the next section.

5. SCRM Analysis

In this section we provide the main theoretical result of this work on the excess risk and regret analysis of SCRM. We start by stating an assumption that is common in acceleration methods (d Aspremont et al., 2021) with restart strategies (Becker et al., 2011; Nesterov, 2012) that we will require to achieve the benefits of sequential designs.

Assumption 5.1 (H olderian Error Bound). We assume that there exist γ > 0 and β > 0 such that for any θ Θ, there exists θ arg minθ Θ L(θ) such that

(L(θ) L(θ ))β .

Typically, in acceleration methods, H olderian error bounds (Bolte et al., 2007) are of the form:

γd(θ, S Θ) (L(θ) L(θ ))β

for some γ, β > 0 and where d(θ, S Θ) is some distance to the optimal set (S Θ = arg minθ Θ L(θ)). This bound is akin to a local version of strong convexity (β = 1) or a bounded parameter space (β = 0) if d is the Euclidean distance. When β [0, 1], this has also been referred to as the Łojasiewicz assumption introduced in (Łojasiewicz, 1963; 1993). Notably, it has been used in online learning (Gaillard & Wintenberger, 2018) to obtain fast rates with restart strategies. This assumption holds for instance for Example 3.1 with β = 1 (see App C.1). We also discuss this assumption for distributions in the exponential family in Appendix C.2 notably for distributions that have been used practice (Swaminathan & Joachims, 2015b; Kallus & Zhou, 2018; Zenati et al., 2020). Next we state our main result that is the acceleration of the excess risk convergence rate and the regret upper bound of SCRM.

Sequential Counterfactual Risk Minimization

Proposition 5.1. Let n0, n 2 and θ arg minθ L(θ). Let nm = n02m for m = 0, . . . , M = log2(1 + n n0 ) . Then, under Assumptions 4.1, 4.2 and 5.1 with β > 0, the SCRM procedure (Alg. 1) satisfies the excess risk upperbound

M = L(θM) L(θ ) O n 1 2 β log n .

Moreover, the expected regret is bounded as follows:

m=0 mnm+1 O n

1 β 2 β log(n)2 .

The proof of our result is detailed in Appendix B.3.

Discussion This result illustrates that an excess risk of order O log(n)

n may be obtained when β = 1 (which is implied by a local version of strong convexity assumption in acceleration methods). When β = 0, which merely accounts that the variance of importance weights are bounded, we simply recover the original rate of CRM of order O(log(n)/ n). The SCRM procedures thus improves the excess risk rate whenever β > 0. It is worth to emphasize that the knowledge of β is not needed by Alg. 1. We also note that our assumption seems related to the Bernstein condition (Bartlett & Mendelson, 2006, see Def 2.6), and (van Erven et al., 2015, see Def 5.1) that bounds a variance term by an excess risk term to the power. In empirical risk minimization, this implies the same excess risk rate and regret rate (van Erven & Koolen, 2016), which are exactly the same rates as ours (up to logs).

6. Empirical Evaluation

In this section we perform numerical experiments to validate our method in practical settings. We present the experimental setup as well as experiments comparing SCRM to related approaches and internal details of the method.

6.1. Experimental setup

As our method is able to handle both discrete and continuous actions we experiment in both settings. We now provide a brief description of the setups, with extensive details available in Appendix D.2. 1

Continuous actions We perform evaluation on synthetic problems pertaining to personalized pricing problems from (Demirer et al., 2019) (Pricing) and advertising from (Zenati et al., 2020) (Advertising). We consider Gaussian

1All the code to reproduce the empirical results is available at: https://github.com/criteo-research/ sequential-conterfactual-risk-minimization

policies πθ( |x) = N(µθ(x), σ2) with linear contextual parametrization µθ(x) = θ x and fixed variance σ2 that corresponds to the exploration budget allowed in the original randomized experiment. The features are up to 10 dimensions and the actions are one-dimensional. We keep the original logging baselines from the settings and compare results to a skyline supervised model trained on the whole training data with full information.

Discrete actions We adapt the setup of (Swaminathan & Joachims, 2015a) that transforms a multilabel classification task into a contextual bandit problem with discrete, combinatorial action space. We keep the original modeling (akin to CRF) with categorical policies πθ(a|x) exp(θ (x N a)). The baseline (resp. skyline) is a supervised, full information model with identical parameter space than CRM methods trained on 5% (resp. 100%) of the training data. We consider the class of probabilistic policies that satisfy Assumption 5.1 by predicting actions in an Epsilon Greedy fashion (Sutton & Barto, 1998)): πε θ(a, x) = (1 ε)πθ(a, x) + ε/|A| where ε = .1. Real-world datasets include Scene, Yeast and TMC2007 with feature space up to 30,438 dimensions and action space up to 222. To account for this combinatorial action space we allow a model θm to be learned using data from all past rollouts {sl}l<m for better sample efficiency and therefore adjust variance estimation in Appendix A.2 to take into account sequential dependencies.

6.2. SCRM compared to CRM and related methods

We first compare SCRM to CRM and existing methods in the literature.

Comparison between SCRM and CRM First, we provide insights on the performance that SCRM can achieve compared to classical CRM with increasing sample sizes. The key difference between CRM/SCRM is that for each sample size nm CRM learns from samples generated by the logging model s CRM m θ0 (see Alg. 2) whilst SCRM learns from samples generated by a series of optimized models s SCRM m θm (see Alg. 1). For each sample size we select a posteriori the best λ for both methods based on test set loss value. We report in Figure 2 over M = 10 rollouts the mean test loss depending on sample size up to 210, with standard deviation estimated over 10 random runs. We observe that SCRM converges very fast, often within the first rollouts. Conversely, CRM needs more samples and the variance is higher. We conclude that there is a striking benefit to use a sequential design in order to achieve near optimal loss with much fewer samples and better confidence compared to CRM. Complementary results on other datasets are available in Appendix E.1.

Moreover, to further illustrate this benefit of efficient learning we also report in Table 1 the sample size needed to

Sequential Counterfactual Risk Minimization

2 4 6 8 10 Rollouts m

Advertising

Figure 2: Test loss as a function of sample size on Scene, Yeast, TMC2007, Advertising, (from left to right). SCRM (in orange) converges faster and with less variance than CRM (in blue).

Percentage p 0.7 0.8 0.9 CRM 100 210 100 216 > 100 222

SCRM (ours) 100 28 100 29 100 211

Table 1: Needed sample size to achieve test loss L(θ) p L(θ ) on the setting in Example 3.1 over the average of 10 random runs. SCRM needs way less data to converge to near optimal solution. λ is set to its theoretical value.

Pricing Advertising Yeast TMC2007 λ 5.353 .178 .716 .020 .294 .026 .146 .012 ˆλ 5.575 .036 .726 .001 .299 .039 .164 .021

Table 2: Test loss after 10 rollouts when choosing λ by a posteriori selection (λ ) or with proposed heuristic (ˆλ). Our heuristic is competitive with the a posteriori selection of a fixed λ .

attain near optimal performance when θ is known as in Example 3.1, where we also observe that SCRM reaches optimal performances faster than CRM. This corroborates the benefits of improved excess risk rates for SCRM.

Hyper-parameter selection for SCRM In our experiments, hyperparameter selection consists in choosing a value for λ. We describe a simple heuristic and evaluate its performance on different datasets. We propose to select ˆλm by estimating the non-penalized CRM loss (eq. 3) using offline cross-validation on past data st<m. We report in Table 2 the test loss obtained when choosing a fixed λ a posteriori (λ ) or with this heuristic (ˆλ). We observe that loss confidence intervals for both methods intersect for all discrete datasets, except on TMC2007 where the degradation shows only at the 3rd digit. On continuous datasets, the heuristic actually improves upon the fixed a posteriori selection. We conclude that this heuristic is usable in practice.

Comparison with other methods In this paragraph we compare our SCRM to related methods to explore practical implications of existing methods in our setting. We first consider batch bandits methods and implement the stochastic sequential batch pure exploitation (SBPE) algorithm in (Han et al.) and a batch version of kernel UCB (Valko et al.,

2013) algorithm (BKUCB) with an optimized library (see implementations details in Appendix D.3). We also experiment with off-policy RL methods PPO (Schulman et al., 2017) and TRPO (Schulman et al., 2015) from the Stable Baselines library (Raffin et al., 2021) (see Appendix D.3). Indeed, such methods model more general state transitions based on past actions, but they could be used in our setting. To fairly compare all methods (in particular those for which no heuristic existing for hyper-parameter selection) we report the mean and standard deviation over 10 random runs of the best test loss a posteriori over hyperparameter grids of the same size. First, we observe that SCRM beats CRM on all datasets, illustrating the benefit of the sequential design. Second, on discrete tasks (where we the combinatorial action space is large) we observe that SCRM achieves nearly the best test loss in all tasks, while RL methods have difficulties maintaining good performances. Third, batch bandits algorithms can achieve good performances in practice because of their deterministic decision rules. However, they involve an O(n3) matrix inversion and therefore did not finish (DNF) in 24h (per single run) on a 46 CPU / 500G RAM machine in most of our settings with large sample size n, which make them unpractical for large scale experiments. We conclude that SCRM is an effective learning paradigm and that it scales successfully on a variety of settings.

6.3. Details on SCRM

Next, we provide additional empirical evaluations of details of our method.

Evaluation of IPS-IX To understand the bias-variance trade-off that IPS-IX can achieve in practice compared to other counterfactual estimators we consider a policy evaluation experiment. The task we consider uses sinusoidal losses y(a) = cos(a) and evaluated policies are shifted Gaussians {πi = N(i π/4, 1)}i=0,4, with π0 being the logging policy. Evaluated policies with large shifts with π0 therefore simulate the setting where the logging policy under-explores the action space. The estimators we consider include IPS, SNIPS (Swaminathan & Joachims, 2015b), clipped IPS (eq. 4) with heuristic from (Bottou et al., 2013) and IPS-IX (eq. 9) with α = 1/n. All methods therefore use

Sequential Counterfactual Risk Minimization

Pricing Advertising Scene Yeast TMC2007 n/|A|/dim(X)| 105/ /10 105/ /2 2.103/26/295 2.103/214/104 3.104/222/3.104

Baseline 3.414 .162 .431 .120 .353 .009 .478 .014 .511 .003 SBPE DNF DNF .179 .001 .302 .003 DNF BKUCB DNF DNF .236 .014 .303 .004 DNF TRPO -5.750 .020 .670 .030 .376 .001 .434 .001 .396 .001 PPO 5.274 .200 .637 .015 .206 .001 .463 .001 .263 .001 CRM 5.325 .068 .594 .100 .233 .031 .362 .044 .158 .034 SCRM (ours) 5.575 .036 -.726 .020 .219 .009 .294 .026 .146 .012 Skyline 5.830 .020 .739 .002 .179 .002 .312 .003 .142 .001

Table 3: Test loss stddev of different methods after 10 rollouts. SCRM achieves optimal or near optimal performance in all datasets. Batch bandit methods did not finish (DNF) on large scale settings, and RL methods perform overall poorly on discrete settings with large action space.

their respective heuristics to set hyperparameters. We report in Figure 3 the bias and variance of estimators for each shift µ0 µ = i π/4 for i = 0, . . . , 4. We observe that IPSIX shows an empirical bias comparable to IPS, lower than SNIPS and clipped IPS while maintaining a lower variance. Moreover its variance is only slightly higher than clipped IPS which introduced a large bias. We conclude that besides being a key component of our analysis IPS-IX also controls the variance with a better trade-off in practice. More details are available in Appendix E.2.

0 1 2 3 Shift 0

ips ips_clipped snips ips_ix

0 1 2 3 Shift 0

ips ips_clipped snips ips_ix

Figure 3: Comparison of counterfactual estimators on policy evaluation. Bias (left), Variance (right). IPS-IX shows a low bias and compares favorably to IPS and SNIPS in terms of variance.

When is SCRM useful is a natural question of interest when choosing the method to be used on a given logged bandit feedback problem. Intuitively one can imagine that SCRM will be most useful when the logging policy underexplores the action space, for example when the distance (in parameter space) between the logging and optimal parameters is large. To study this question we proceed to the following experiment on the setup of Example 3.1 with Gaussian distributions N(θ, σ) and fixed loss variance σ = Vary(y). We vary the distance δ0 = θ θ0 between the optimal

10 1 100 101 102

Figure 4: Best final loss when varying δ0 = θ θ0 . SCRM achieves better losses especially for larger δ0.

model θ and the logging model θ0. Since the ideal exploration level may be task dependent we choose a posteriori the best σ on a grid, for both CRM and SCRM. We report in Figure 4 the best final loss for both CRM and SCRM for a range of values of δ0. We observe in particular that SCRM achieves better final losses for larger distances δ0 than CRM. With the same number of rollouts M, SCRM can extend the exploration to further areas while CRM fails for any exploration level in those cases, which advocates for using sequential deployments.

7. Discussions

In this work, we have proposed a method to extend the CRM perspective for designing sequential data collection experiments. We have introduced a novel counterfactual estimator to improve variance control in excess risk bounds. Under a weak error bound assumption, we have sequentially applied these excess risk guarantees to achieve faster rates similarly to acceleration methods. Our method also improves upon CRM in practice and is particularly well-suited for this setting compared to existing methods in the literature. It is worth noting that, in order to avoid introducing dependencies in the excess risk bounds we analyzed, the theoretical

Sequential Counterfactual Risk Minimization

algorithm we have studied uses geometric sample sizes to discard previous samples. However, using all past samples has been found to be also effective in practice and developing guarantees for this case would be an interesting area for future research. Additionally, similar to online settings that involve an exploration-exploitation tradeoff, investigating the use of optimism in the face of uncertainty (OFUL) principle in SCRM would also be a promising avenue for future work.

Acknowledgements

The authors thank Alberto Bietti for the insightful early discussions on this project. The authors also thank the reviewers for their feedback on this paper. This work was supported by ANR 3IA MIAI@Grenoble-Alpes (ANR-19P3IA0003).

Audibert, J.-Y., Munos, R., and Szepesvari, C. Tuning bandit algorithms in stochastic environments. In International Conference on Algorithmic Learning Theory, 2007. 2

Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47:235 256, 05 2002. 2

Bakshy, E., Dworkin, L., Karrer, B., Kashin, K., Letham, B., Murthy, A., and Singh, S. Ae: A domain-agnostic platform for adaptive experimentation. 2018. 2

Bartlett, P. L. and Mendelson, S. Empirical minimization. Probability Theory and Related Fields, 2006. 6

Becker, S. R., Cand es, E. J., and Grant, M. C. Templates for convex cone problems with applications to sparse signal recovery. Mathematical Programming Computation, 3 (3):165 218, jul 2011. 2, 5

Bibaut, A., Kallus, N., Dimakopoulou, M., Chambaz, A., and van der Laan, M. Risk minimization from adaptively collected data: Guarantees for supervised and policy learning. In Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 19261 19273. Curran Associates, Inc., 2021. 2

Bolte, J., Daniilidis, A., and Lewis, A. The Łojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems. SIAM Journal on Optimization, 17(4):1205 1223, 2007. 2, 5

Bottou, L., Peters, J., Qui nonero Candela, J., Charles, D. X., Chickering, D. M., Portugaly, E., Ray, D., Simard, P., and Snelson, E. Counterfactual reasoning and learning

systems: The example of computational advertising. Journal of Machine Learning Research (JMLR), 14(1):3207 3260, 2013. 1, 2, 3, 7, 24

Boyd, S. and Vandenberghe, L. Convex Optimization. Cambridge University Press, 2004. 1, 2

Calandriello, D., Carratino, L., Lazaric, A., Valko, M., and Rosasco, L. Near-linear time Gaussian process optimization with adaptive batching and resparsification. In International Conference on Machine Learning (ICML), 2020. 2

Chu, W., Li, L., Reyzin, L., and Schapire, R. Contextual bandits with linear payoff functions. In Gordon, G., Dunson, D., and Dud ık, M. (eds.), Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, volume 15 of Proceedings of Machine Learning Research, pp. 208 214, Fort Lauderdale, FL, USA, 11 13 Apr 2011. PMLR. 2

Dai, B., Nachum, O., Chow, Y., Li, L., Szepesv ari, C., and Schuurmans, D. Coindice: Off-policy confidence interval estimation. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS 20, Red Hook, NY, USA, 2020. Curran Associates Inc. 1

Demirer, M., Syrgkanis, V., Lewis, G., and Chernozhukov, V. Semi-parametric efficient policy learning with continuous actions. In Advances in Neural Information Processing Systems (Neur IPS), 2019. 6, 23

Dudik, M., Langford, J., and Li, L. Doubly robust policy evaluation and learning. In International Conference on Machine Learning (ICML), 2011. 3

Dud ık, M., Erhan, D., Langford, J., and Li, L. Doubly robust policy evaluation and optimization. Statist. Sci., 29(4):485 511, 11 2014. doi: 10.1214/14-STS500. 2, 3

d Aspremont, A., Scieur, D., and Taylor, A. Acceleration methods. Foundations and Trends in Optimization, 5 (1-2):1 245, 2021. ISSN 2167-3888. 2, 5, 22

Faury, L., Tanielian, U., Dohmatob, E., Smirnova, E., and Vasile, F. Distributionally Robust Counterfactual Risk Minimization. Proceedings of the AAAI Conference on Artificial Intelligence, 34(04):3850 3857, Apr. 2020. 22

Foster, D. and Rakhlin, A. Beyond UCB: Optimal and efficient contextual bandits with regression oracles. In III, H. D. and Singh, A. (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 3199 3210. PMLR, 13 18 Jul 2020. 2

Sequential Counterfactual Risk Minimization

Gaillard, P. and Wintenberger, O. Efficient online algorithms for fast-rate regret bounds under sparsity. In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. 2, 5

Gao, Z., Han, Y., Ren, Z., and Zhou, Z. Batched multi-armed bandits problem. Proceedings of the 33rd International Conference on Neural Information Processing Systems, 2019. 2

Goldenshluger, A. and Zeevi, A. A linear response bandit problem. Stochastic Systems, 3(1):230 261, 2013. 2

Han, Y., Zhou, Z., Zhou, Z., Blanchet, J. H., Glynn, P. W., and Ye, Y. Sequential batch learning in finite-action linear contextual bandits. doi: 10.48550/ARXIV.2004.06321. 2, 7, 25

Harutyunyan, A., Bellemare, M. G., Stepleton, T., and Munos, R. Q(λ) with off-policy corrections. In Ortner, R., Simon, H. U., and Zilles, S. (eds.), Algorithmic Learning Theory, pp. 305 320, Cham, 2016. Springer International Publishing. 2

Horvitz, D. G. and Thompson, D. J. A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association, 47(260):663 685, 1952. 3

Kallus, N. and Zhou, A. Policy evaluation and optimization with continuous treatments. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2018. 1, 4, 5, 22, 23

Kasy, M. and Sautmann, A. Adaptive treatment assignment in experiments for policy choice. Econometrica, 89(1): 113 132, 2021. 2

Lattimore, T. and Szepesvari, C. Bandit Algorithms. 2019.

Li, L., Chu, W., Langford, J., and Schapire, R. E. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on World Wide Web, WWW 10, New York, NY, USA, 2010. ACM. 2

Maurer, A. and Pontil, M. Empirical bernstein bounds and sample variance penalization. In Conference on Learning Theory (COLT), 2009. 2, 3, 4, 15, 16

Munos, R., Stepleton, T., Harutyunyan, A., and Bellemare, M. G. Safe and efficient off-policy reinforcement learning. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS 16, pp. 1054 1062, Red Hook, NY, USA, 2016. Curran Associates Inc. 2

Nesterov, Y. Gradient methods for minimizing composite functions. Mathematical Programming, 140:125 161, 2012. 1, 2, 5

Nocedal, J. and Wright, S. J. Numerical Optimization. Springer, New York, NY, USA, 2e edition, 2006. 2

Owen, A. B. Monte Carlo theory, methods and examples. 2013. 2, 3, 4, 13, 14

Perchet, V., Rigollet, P., Chassang, S., and Snowberg, E. Batched bandit problems. The Annals of Statistics, 44, 05 2015. 2

Peters, J., Janzing, D., and Sch olkopf, B. Elements of Causal Inference: Foundations and Learning Algorithms. Adaptive Computation and Machine Learning. MIT Press, Cambridge, MA, 2017. 2, 3, 12

Powell, M. J. Restart procedures for the conjugate gradient method. Math. Program., 12(1):241 254, dec 1977. 2

Raffin, A., Hill, A., Gleave, A., Kanervisto, A., Ernestus, M., and Dormann, N. Stable-baselines3: Reliable reinforcement learning implementations. Journal of Machine Learning Research, 22(268):1 8, 2021. 7, 25

Sachdeva, N., Su, Y., and Joachims, T. Off-policy bandits with deficient support. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 20, pp. 965 975, New York, NY, USA, 2020. Association for Computing Machinery. 2

Schulman, J., Levine, S., Moritz, P., Jordan, M. I., and Abbeel, P. Trust region policy optimization. 2015. doi: 10.48550/ARXIV.1502.05477. 2, 7, 25

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. 2017. doi: 10.48550/ARXIV.1707.06347. 2, 7, 25

Simchi-Levi, D. and Xu, Y. Bypassing the monster: A faster and simpler optimal algorithm for contextual bandits under realizability. SSRN Electronic Journal, 2020. ISSN 1556-5068. doi: 10.2139/ssrn.3562765. 2

Sutton, R. S. and Barto, A. G. Introduction to Reinforcement Learning. MIT Press, Cambridge, MA, USA, 1st edition, 1998. ISBN 0262193981. 2, 6, 24

Swaminathan, A. and Joachims, T. Counterfactual risk minimization: Learning from logged bandit feedback. In International Conference on Machine Learning (ICML), 2015a. 1, 2, 3, 6, 12, 21, 22, 24

Swaminathan, A. and Joachims, T. The self-normalized estimator for counterfactual learning. In Advances in Neural Information Processing Systems (NIPS). 2015b. 1, 2, 3, 5, 7

Sequential Counterfactual Risk Minimization

Valko, M., Korda, N., Munos, R., Flaounas, I., and Cristianini, N. Finite-Time Analysis of Kernelised Contextual Bandits. In Uncertainty in Artificial Intelligence, Bellevue, United States, July 2013. 2, 7, 25

van Erven, T. and Koolen, W. M. Metagrad: Multiple learning rates in online learning. In Advances in Neural Information Processing Systems, 2016. 6

van Erven, T., Gr unwald, P. D., Mehta, N. A., Reid, M. D., and Williamson, R. C. Fast rates in statistical and online learning. Journal of Machine Learning Research, 2015. 6

Zenati, H., Bietti, A., Martin, M., Diemert, E., Gaillard, P., and Mairal, J. Counterfactual learning of stochastic policies with continuous actions: from models to offline evaluation. 2020. doi: 10.48550/ARXIV.2004.11722. 4, 5, 6, 22, 23, 24

Zenati, H., Bietti, A., Diemert, E., Mairal, J., Martin, M., and Gaillard, P. Efficient kernelized ucb for contextual bandits. In Camps-Valls, G., Ruiz, F. J. R., and Valera, I. (eds.), Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, volume 151 of Proceedings of Machine Learning Research, pp. 5689 5720. PMLR, 28 30 Mar 2022. 2

Zhan, R., Ren, Z., Athey, S., and Zhou, Z. Policy learning with adaptively collected data. 2021. doi: 10.48550/ ARXIV.2105.02344. 2

Łojasiewicz, S. Une propri et e topologique des sous ensembles analytiques r eels. Les equations aux d eriv ees partielles, 1963. 5

Łojasiewicz, S. Sur la g eom etrie semi et sous analytique. Annales de l institut Fourier, 43(5):1575 1595, 1993. ISSN 2167-3888. 5

Sequential Counterfactual Risk Minimization

This appendix is organized as follows: in Appendix A, we provide additional explanations on counterfactual methods related to our approach. In Appendix B, we detail our analysis of our counterfactual estimator as well as the general SCRM procedure, as given in Alg. 1. Next, in Appendix D we present all the details of the empirical evaluation and eventually in Appendix E we provide all additional empirical results that were omitted from the main paper due to space limitation.

A. Additional details on counterfactual estimators

A.1. Unconfoundedness in sequential designs

In these explanations, we recall that the distributions of contexts as well as the distribution of losses are fixed. In other words, the latter do not vary from one batch to another. In the counterfactual risk minimization framework (CRM) (Swaminathan & Joachims, 2015a), the causal graph (using the conventions in (Peters et al., 2017)) can be represented as shown in Figure 5.

model treatment context outcome

Figure 5: Causal Graph in a randomized data collection experiment. A denotes action (or treatment), X context, Y is the loss (or outcome). The causal influence of the contexts on actions is done through the model θ.

In the sequential counterfactual risk minimization (SCRM) framework, if we unfold the causal graph, the following representation can be given in Figure 6.

model treatment context outcome

Figure 6: Causal Graph in a sequential randomized data collection experiment. A denotes action (or treatment), X context, Y is the loss (or outcome). The contextual treatments are taken through the models θt.

Therefore, it is clear that in general, θt θt+1. However, from d-separation and faithfullness (Peters et al., 2017), we have for t < t:

θt θt |θt 1.

Therefore, given that all the dependencies are observed and that we can condition on the direct parents of a given model θt, sequential randomized data collection are possible. We eventually provide in Figure 7 an illustration of SCRM and CRM.

Learn model

Rollout m = 1 m = 2

Data sampled from

Learn model

Rollout m = 1 m = 2 m = 0

Figure 7: Graphical illustration of SCRM setup (left) and CRM (right), learned with same amount of data after each batch m. The training data are displayed with color block and the policy used to sample actions in these block are either adaptive (SCRM) or using the loggind model θ0 (CRM).

Sequential Counterfactual Risk Minimization

A.2. Multiple Importance Sampling Estimators

Note that in order to avoid introducing dependencies in the excess risk bounds we analyzed, the theoretical algorithm we have studied uses geometric sample sizes to discard previous samples. However, using all past samples is effective in practice and developing guarantees for this case would be an interesting area for future research. We present in this section a estimators using aggregation of all previous information. In particular, we can use Multiple Importance Sampling (MIS) (Owen, 2013) over all previous samples. Consider in particular a partition of unity with m > 1 weight functions ωt(a) > 0 which satisfies Pm t=0 ωt,m(a) = 1 for all a and m {0, . . . M}. The MIS estimator writes:

ˆLMIS m (θ) =

i=1 ωt,m(at,i)yt,iwθ t,i, wθ t,i = πθ(at,i|xt,i)

πt,i . (12)

In multiple importance sampling we usually assume that the behavior distributions are independent. In our case, when we optimize θt based on the models θt 1, . . . , θ0, we break this assumption. However, as we will see, we can still have the unbiasedness property and derive an estimator for the variance of the estimator.

Proposition A.1 (Unbiasedness). The MIS estimator (12) is unbiased when the loss y is fixed (its distribution PY( |x, a) does not depend on time rollout m).

Proof. Let m {1, . . . M}. We recall that at all rounds t < m, models θt Θ were deployed and sets st of nt observations st = (xt,i, at,i, lt,i, πt,i)i=1,...,nt were collected thereof, with propensities πt,i = πθt(at,i|xt,i) to learn the next model θt+1. To prove the unbiasedness we use the tower rule on the expectation and condition on previous observations s1, . . . st 1:

E[ˆLMIS m (θ)] =

i=1 Ex,θm,y ωt(a)ywθ t

t=0 Ex,θm,y ωt(a)ywθ t

t=0 Es1...st 1 Ex,θm,y ωt(a)ywθ t | s1 . . . st 1

t=0 Es1...st 1 [Ex,θ,y [ωt(a)y | s1 . . . st 1]]

t=0 Ex,θ,y [ωt(a)y]

= Ex,θ,y [y]

where the second last line is true only when the distribution of y does not change over time roll-outs m.

Among the proposals for functions ωt(a), the most naive and natural heuristic is to choose

ωt(a) = nt Pm l=1 nl , (13)

which gives the naive concatenation of all IPS estimators

ˆLn-MIS m (θ) = 1

i=1 yt,i πθ(at,i|xt,i) πθt(at,i|xt,i), (14)

where n = Pm t=0 nt.

Sequential Counterfactual Risk Minimization

With the previous definition of the empirical mean estimator, we can now derive an empirical variance estimator, starting with the naive multi importance sampling estimator. We write the random variable rm = (πθ/πθm)y. We note that for inside a batch m each realization of rm i = (πθ(am,i|xm,i)/πm,i)ym,i and rm j are independent. But the realizations of the random variables rm and rm are dependent. Writing n = Pm t=0 nt

1 p<q m Cov

i=1 rp i , 1

j=1 Cov [rp, rq]

1 p<q m npnq Cov [rp, rq]

where the second last equality is obtained with the bilinearity of the covariance. Given the latter expression of the variance, we propose the following estimator and with a linear sampling where all np = nq for p, q {1, . . . , M}:

ˆV n-MIPS m (θ) = 1

t=0 ˆV (rt) + 2 X

1 p<q m npnq

rp k rp rq k rq !

where ˆV (rm) = 1 nm(nm 1) Pnm i=1 rm i rm 2 and rm = 1 nm Pnm j=1 rm j .

Note also that for other functions ωt(a), the most studied one is the balance heuristic with ωt ntπθt(a), that is:

ωBH t (a) = ntπθt(a) Pm l=1 nlπθl(a). (16)

The latter heuristic has been studied for its low variance (Owen, 2013) but these properties have been studied under an i.i.d assumption that is broken in our adaptive data collection strategy. Eventually, note that controlling the variance of this estimator with an implicit exploration estimator as we do in the i.i.d case would make a an interesting research direction.

B. Analysis details

In this section, we provide the details of our analysis by starting with essential definitions, then our proofs of variance dependent excess risk bounds and finally our regret analysis.

B.1. Definitions

Cm(Θ) is a complexity measure that will be upper-bounded by the metric entropy in sup-norm at level ε = 1/nm of the following function set,

Fm,Θ := fθ : (x, a, y) X A Y 7 1

W y πθ(a|x) πθm(a|x) + απθ(a|x) 1 for θ Θ . (17)

The latter corresponds to clipped prediction errors of policies πθ normalized into [0, 1]. More precisely, to define rigorously Cm(Θ), we denote for any nm 1 and ε > 0, the complexity of a class F by

H (ε, F, n) = sup (xi,ai,yi) (X A Y)n H(ε, F {xi, ai, yi} , ) , (18)

where F {xi, ai, yi} = f(x1, a1, y1), . . . , f(xn, an, yn) , f F Rn and the number H(ε, A, ) is the smallest cardinality |A0| of a set A0 A such that A is contained in the finite union of ε-balls centered at points in A0 in the metric induced by ). Then, Cm(Θ) is defined by

Cm(Θ) = log H (1/nm, Fm,Θ, 2nm) . (19)

Sequential Counterfactual Risk Minimization

B.2. Variance-dependent excess risk bounds

We will denote by Em[ ] = E[ |s0, . . . sm] the conditional expectation given the set of observation samples sm = (xm,i, am,i, ym,i, πm,i)i=1,...,nm up to the rollout m. Here, we recall that xm,i PX , am,i πθm( |xm,i), ym,i PY( |xm,i, am,i), and πm,i = πθm(am,i|xm,i). Furthermore, throughout the document, Ex,θm,y (resp. Varx,θm,y ]) denotes the expectation (resp. variance) in (x, a, y) where x PX , a πθm( |x), and y PY( |x, a).

Proposition 4.1 (Generalization Error Bound). Let ˆLIPS-IX m and ˆV IPS-IX m be the empirical estimators defined respectively in Eq. (9) and Eq. (10). Let δ (0, 1), θ Θ, and nm 2 the number of samples associated to the logged dataset at round m. Then, with probability at least 1 δ,

L(θ) ˆLIPS-IX m (θ) + λ

ˆV IPS-IX m (θ) nm + 2λ2W

where λ = p

18(Cm(Θ) + log(2/δ)).

Proof. Let δ (0, 1) and θ Θ. Since all functions in Fm,Θ defined in Eq. (17) take values in [0, 1], we can apply the concentration bound of Maurer & Pontil (2009, Theorem 6) to the set Fm,Θ. This yields, with probability at least 1 δ/2,

Ex,θm,y[fθ(x, a, y)] 1

i=1 fθ(xm,i, am,i, ym,i)

18 ˆVnm(fθ)(Cm(Θ) + log(2/δ))

nm + 15(Cm(Θ) + log(1/δ))

ˆVnm(fθ) = 1 nm 1

fθ(xm,i, am,i, ym,i) 1

j=1 fθ(xm,j, am,j, ym,j) 2

is an estimation of the sample variance. Let α > 0 and define the following biased estimator of the excess risk:

Lα m(θ) = Ex,θm,y

y πθ(a|x) πθm(a|x) + απθ(a|x) 1 θ Θ. (22)

We recall that Ex,θm,y denotes the expectation in (x, a, y) where x PX , a πθm( |x), and y PY( |x, a). By construction of fθ (see Eq. (17)),

Ex,θm,y[fθ(x, a, y)] = 1

i=1 fθ(xm,i, am,i, ym,i) = 1

W ˆLIPS-IX m (θ) 1 Wnm

ˆVnm(fθ) = 1 W 2 ˆV IPS-IX m (θ) ,

where ˆLIPS-IX m and ˆV IPS-IX m are defined respectively in Eq. (9) and Eq. (10). Thus, multiplying (21) by W, substituting the above terms, and using λ = p

18(Cm(Θ) + log(2/δ)), yields

Lα m(θ) ˆLIPS-IX m (θ) + 1

ˆV IPS-IX m (θ) nm + 15λ2W 18(nm 1), (23)

with probability 1 δ/2. Now, let us decompose

Lα m(θ) = Ex,θm,y

y πθ(a|x) πθm(a|x) + απθ(a|x) 1 = Ex,θm,y

y πθ(a|x) πθm(a|x) + απθ(a|x)

But, since the losses y are bounded in [ 1, 0] almost surely,

y πθ(a|x) πθm(a|x) + απθ(a|x)

Sequential Counterfactual Risk Minimization

which, substituted into the previous equation, entails,

Lα m(θ) L(θ) L(θm). (24)

Lower-bounding the left-hand side of (26), we thus get w.p 1 δ/2,

L(θ) ˆLIPS-IX m (θ) λ

ˆV IPS-IX m (θ) nm + 15λ2W 18(nm 1) + L(θm) 1

Using Em 1[ym,i] = L(θm) and applying Hoeffding s inequality, this further yields w.p. 1 δ

L(θ) ˆLIPS-IX m (θ) + λ

ˆV IPS-IX m (θ) nm + 15λ2W 18(nm 1) +

Eventually, note that (nm 1) 1 (2/nm) since nm 2. Thus,

L(θ) ˆLIPS-IX m (θ) + λ

ˆV IPS-IX m (θ) nm + 2λ2W

which concludes the proof.

Proposition 4.2 (Conservative Excess Risk). Let m 0 and θm Θ. Let sm = (xm,i, am,i, ym,i, πm,i)1 i nm be a set of samples collected with am,i πθm( |xm,i). Then, under Assumptions 4.1 and 4.2, the solution θm+1 of Problem (8) with the IPS-IX estimator in Eq. (11) on the samples sm satisfies the excess risk upper-bound

m+1 = L(θm+1) L(θ )

d log(nm) + log(1/δ)

nm ν2m + W 2 + W(d log(nm) + log(1/δ))

where ν2 m = Varx,θm

Proof. We consider the notations of the proof of Proposition 4.1. Fix θ Θ. Applying, Theorem 15 of (Maurer & Pontil, 2009)2 to the function set Fm,Θ defined in (17), we get with probability 1 δ

Ex,θm,y[fθm+1(x, a, y)] Ex,θm,y[fθ (x, a, y)]

32Varx,θm,y fθ (x, a, y) Cm(Θ) + log 30

nm + 22 Cm(Θ) + log 30

This can be written as: m U m, (28)

with the following definitions:

m = Ex,θm,y[fθm+1(x, a, y)] Ex,θm,y[fθ (x, a, y)]

32Varx,θm,y fθ (x, a, y) Cm(Θ) + log 30

nm + 22 Cm(Θ) + log 30

nm 1 . (29)

2Note that in their notation, log Mn(π) equals Cm(Θ) + log(10), X is the dataset {(xi, ai, yi)}1 i n where (xi, ai, yi) i.i.d. PX πθm( |x) PY( |a, x), and P( , µ) is the expectation with respect to one test sample Ex,θm,y[ ].

Sequential Counterfactual Risk Minimization

Step: Lower bounding m Using the definition of fθ(x, a, y) in (17) and that of Lα m in Eq. (22), we have

Ex,θm,y[fθm+1(x, a, y)] = 1

W Lα m(θm+1).

Thus, m can be re-written as

W (Lα m(θm+1) Lα m(θ )) ,

which we now lower-bound. To do so, we begin by upper-bounding Lα m(θ ). It can be expressed as

Lα m(θ ) = Ex,θm,y

y πθ (a|x) πθm(a|x) + απθ (a|x)

L(θm). (30)

To shorten notation, from now on and throughout this proof, we write πθ instead of πθ(a|x), omitting the dependence on a and x. Using the inequality (1 + x) 1 1 x for x 0, we have

y πθ πθm + απθ

= L(θ ) αEx,θm,y

L(θ ) + αW 2 , (32)

where the last inequality is by Assumption 4.1 and because y [ 1, 0]. Together with (30), we get

Lα m(θ ) L(θ ) + αW 2 L(θm).

We recall that L(θm+1) L(θm) Lα m(θm+1) by Eq.(24). Therefore,

1 W (L(θm+1) L(θ ) αW 2) 1

W (Lα m(θm+1) Lα m(θ )) ,

which finally gives 1 W (L(θm+1) L(θ ) αW 2) m. (33)

Step: Upper bound U m By definition of fθ(x, a, y) in (17), we have

Varx,θm,y fθ (x, a, y) = 1 W 2 Varx,θm,y

y πθ πθm + απθ 1

1 W 2 Ex,θm,y

y2 πθ πθm + απθ 1 2#

1 W 2 Ex,θm

" πθ πθm + απθ 1 2#

Then, using the inequality (x + y)2 2x2 + 2y2, for x, y R, this may be upper-bounded as

Varx,θm,y fθ (x, a, y)

2 W 2 Ex,θm

" πθ πθm + απθ Ex,θm

πθ πθm + απθ

πθ πθm + απθ

On the one hand, the first term of the right-hand side may be upper-bounded as

" πθ πθm + απθ Ex,θm

πθ πθm + απθ

πθ πθm + απθ

Sequential Counterfactual Risk Minimization

where ν2 m = Varx,θm h πθ πθm

i . On the other hand, for the second term, we use the same factorization as in Eq. (31) to get

πθ πθm + απθ

which yields the upper-bound

πθ πθm + απθ

1 2 α2 Ex,θm

Therefore, substituting the last two upper-bounds into (34) entails

Varx,θm,y fθ (x, a, y) 2 W 2 ν2 m + α2W 2 .

Then, replacing this upper-bound into the definition of U m in (29) and using Assumption 4.2 to upper bound the terms in Cm(Θ) d log(nm), we obtain the following upper-bound

64(ν2m + α2W) d log(nm) + log 30

nm + 22 d log(nm) + log 30

64(ν2m + α2W) d log(nm) + log 30

nm + 44 d log(nm) + log 30

where the last inequality is because nm 2.

Step: excess risk upper bound Setting α = 1 nm and using the two previous bounds (33) and (35) respectively on m and on U m into (28), we get

L(θm+1) L(θ )

64 d log(nm) + log 30

n2m W 2 + W 44 d log(nm) + log 30

nm W 2. (36)

b, we have that s

64 d log(nm) + log 30

64 d log(nm) + log 30

64 d log(nm) + log 30

Then, since nm 2 and δ < 1, we have d log(nm) + log(30/δ) log(2) + log(30) 4, which yields

64 d log(nm) + log 30

32 d log(nm) + log 30

8 d log(nm) + log 30

Substituting the last two inequalities into (36) finally entails

L(θm+1) L(θ ) 8

d log(nm) + log 30

δ nm ν2m + 47W d log(nm) + log 30

which concludes the proof.

B.3. Regret analysis

Proposition 5.1 (Regret upper-bound). Let n0, n 2 and θ arg minθ L(θ). Let nm = n02m for m = 0, . . . , M = log2(1+ n

n0 ) . Then, under Assumptions 4.1, 4.2 and 5.1, the SCRM procedure (Alg. 1) satisfies the excess risk upper-bound

L(θM) L(θ ) O n 1 2 β log n .

Sequential Counterfactual Risk Minimization

Moreover, the expected regret is upper-bounded as follows:

m=0 nm+1 L(θm) L(θ ) O n

1 β 2 β log(n)2 .

Proof. First, note that for nm = n02m and M = log2(1 + n

n0 ) , we have PM 1 m=0 nm = n0(2M 1) n. Hence, Alg. 1 has collected at most n samples to design the estimator θM. For m 0, we recall m = L(θm) L(θ ) and use Eq. (37) to write

d log(nm) + log 30

δ nm ν2m + 47W d log(nm) + log 30

d log(n) + log 30

δ nm ν2m + 47W d log(n) + log 30

where C = 8 q

d log(n) + log 30

δ and B = W 2 + 47W(d log(n) + log 30

δ ) are independent of m.

Step: Obtaining a recurrence relation for m+1 By Assumption 5.1, there exist γ > 0 and β [0, 1] such that

ν2 m = Varx,θm

γ L(θm) L(θ ) β = β m γ .

Replacing ν2 m in Eq. (38) thus entails

1 γ β m nm + B

γ β m + B2 mn0 nm = n02m

2 β/2 m + B2 mn0 . (39)

Step: Solving the recurrence relation for m We then insure by induction that m satisfies

m 2 β , (40)

for some c0 > 0 that will be specified by the analysis.

Base step Since losses take values in [ 1, 0], 0 = L(θ0) L(θ ) 1. Equation (40) is thus satisfied for m = 0 as soon as c0 1.

Induction step Let m 0. We assume that m c02

m 2 β and prove Equation (40) for m+1. Using Eq. (39), we have

2 β/2 m + B2 mn0

2 c0 β/22 mβ β(2 β) + B2 mn0 by induction

max n 2C rn0

2 m 2 β , 2B2 mn0 o . (41)

Sequential Counterfactual Risk Minimization

Now, we show that both terms inside the maximum can be upper-bounded by c02 (m+1)/(2 β) as soon as c0 is large enough. On the one hand, if c0 4Bn0, we have

2B2 mn0 c02 (m+1) c02 m+1

On the other hand, if c0 (4C2n0/γ)1/(2 β), we also have

2 m 2 β 2C rn0

2 β c02 m+1

Combining the above two upper-bounds with (41) concludes the induction step under the condition

c0 max 1, 4C2n0

1 2 β , 4Bn0

Step: conclusion Finally, setting the above value for c0 we proved that for all m 0, we have

m max 1, 4C2n0

1 2 β , 4Bn0

1 2 β + 4Bn0

= 1 + 256(d log n + log 30

1 2 β + W 2n0 + 47Wn0 d log n + log 30

2 m 2 β , (42)

where the last equality is by substituting the values of B and C from (38). For the final step M = log2( n

n0 + 1) , this yields

M 1 + 256(d log n + log 30

1 2 β + W 2n0 + 47Wn0 d log n + log 30

2 1 + 256(d log n + log 30

1 2 β + W 2n0 + 47Wn0 d log n + log 30

= O n 1 2 β log n .

This concludes the first part of the proof.

Regret upper-bound To upper bound the cumulative regret, using nm+1 = n02m+1, we write

m=0 mnm+1 (42) D

m=0 2 m 2 β nm+1 = 2Dn0

m=0 2 1 β 2 β m ,

D = 1 + 256(d log n + log 30

1 2 β + W 2n0 + 47Wn0 d log n + log 30

Then, computing the sum for M = log2( n

n0 + 1) , we have

m=0 2 1 β 2 β m 2Dn0(M + 1)2 1 β 2 β M 2Dn0 1 + log2 n

n0 + 1 1 + n

Using that D = O(log n), we finally obtain

1 β 2 β log(n)2 .

Sequential Counterfactual Risk Minimization

C. Additional discussions on the H olderian Bound Assumption 5.1

In this appendix, we discuss Assumption 5.1 on different particular examples.

C.1. Verification of the assumption on a toy example with Gaussian families

We consider the setting of Example 3.1. In the latter, the policies are Gaussian of the form πθ = N(θ, σ2) and the loss is defined by lt(a) = (a yt)2 1 where yt N(θ , σ2). There is no loss in generality in assuming σ2 = 1. Then, we can compute

L(θ) L(θ ) = (θ θ )2 and Varθ

= exp (θ θ)2 1 .

We recall that we are interested in verifying the existence of γ > 0 and β > 0 for which Assumption 5.1 holds, that is in this case for any θ Θ:

L(θ) L(θ ) β , (43)

which may be re-written here as γ exp (θ θ)2 1 (θ θ )2β .

The latter is satisfied for any β 1 as soon as Θ is a bounded interval. Note that the constant γ may decrease exponentially fast as the diameter of Θ increases. To illustrate, the existence of such couples (β, γ), we plot in Fig. 8 different values of the following ratio

R(θ, β) = Varθ

L(θ) L(θ ) β = exp (θ θ)2 1 θ θ 2 β . (44)

The value of γ can be found for different values of β in Fig. 8 by taking 1

γ = maxθ R(θ, β). Higher values of β induce

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00

Figure 8: Ratio R defined in (44) with different values of β.

faster rates and lower values of γ induce worst constant terms in the excess risk and regret bounds. Eventually, note that SCRM does not need those parameters to run and those two parameters γ, β are automatically calibrated by SCRM to find the best trade-off.

C.2. Discussion of Assumption 5.1 for Exponential Families

In this section, we consider a more realistic example in which policies belong to an exponential family. That is, we assume that the policies are parameterized by a parameter η Rq and can be written in the form:

a A, πη(a) = eη t(a) A(η)h(a),

for some known function h : A R+ and sufficient statistic t : A Rq. Here, A(η) is a normalization constant, so that e A(η) = R

a eη t(a)h(a) da. We provide in Example C.1 a concrete example considered by (Swaminathan & Joachims,

Sequential Counterfactual Risk Minimization

2015a; Faury et al., 2020). To ease the notation, we removed here the dependency on contexts, but the generalization to contextual policies can be made similarly. The importance weight ratio may be written as,

πη(a) πηm(a) = e(η ηm)t(a) (A(η) A(ηm)). (45)

To verify Assumption 5.1, we need to upper bound their variance, which we shall write as,

= e2(A(ηm) A(η))Vara πηm h e(η ηm)t(a)i .

Now, computing the moment generating function (MGF) of the statistic t(a) Rq

Mt(s) = E h es t(a)i = Z

a es t(a)eηm t(a) A(ηm)h(a) da = e A(ηm) Z

a e(ηm+s) t(a)eηm t(a)h(a) da = e A(ηm+s) A(ηm),

the variance term may be written as

Vara πηm h e(η ηm)t(a)i = Mt(2(η ηm)) M 2 t (η ηm) = e A(2η ηm) A(ηm) e2(A(η) A(ηm)) .

This eventually leads us to

= e A(2η ηm)+A(ηm) 2A(η) 1. (46)

We now discuss two cases that are used for discrete actions (Swaminathan & Joachims, 2015a) and continuous actions (Kallus & Zhou, 2018; Zenati et al., 2020).

Bounded sufficient statistic Supposing that there exists an upper bound A such that t(a) A, Cauchy-Schwartz inequality states that |(η ηm) t(a)| η ηm A, which entails

= e A(2η ηm)+A(ηm) 2A(η) 1

a e(2η ηm) t(a)h(a) da R

a eηm t(a)h(a) da R

a eη t(a)h(a) da 2 1

a e(η ηm) t(a)eη t(a)h(a) da R

a e(ηm η) t(a)eη t(a)h(a) da R

a eη t(a)h(a) da 2 1

e η ηm A 1 .

Assuming that the parameter space is compact, i.e, maxη,η η η D, there exists a constant C that depends on A and D such that, this may be further upper-bounded as

Therefore, Assumption 5.1 is implied by

γC η ηm 2 (L(θ) L(θ ))2β .

The latter is implied by a local version of strong convexity for β = 1/2 (d Aspremont et al., 2021), and holds with γ = C 1D 2 for β = 0.

Example C.1. For discrete actions A = {a1, . . . , a K}, we consider, as in (Swaminathan & Joachims, 2015a) and (Faury et al., 2020), policies where given a context x, probabilities pi(x) of sampling an action ai are given by

pi(x) = exp(θ ϕ(x, ai)) PK j=1 exp(θ ϕ(x, aj)) . (47)

Sequential Counterfactual Risk Minimization

The function ϕ is typically a feature map associated to a kernel in a RKHS. In this case, the natural parameter η and the sufficient statistic t(a) may be written as

p K ) ... log( p K 1

1{a = a1} ... 1{a = a K}

Lognormal and Normal distributions For normal N(µ, σ2) and lognormal Lognormal(µ, σ2) distributions with fixed variance σ2 as considered by (Kallus & Zhou, 2018; Zenati et al., 2020), the normalizing constant writes A(η) = η2

2 , and we then obtain that: A(2η ηm) + A(ηm) 2A(η) = (η ηm)2 ,

which gives:

= e η ηm 2 1 .

In that case, it is again possible for a bounded parameter space to linearize e η ηm 2 1 η ηm 2, consider losses that verify: for all η, there exists an optimal η such that

γ ηm η 2 L(ηm) L(η ) β. (49)

Again, this holds generally for β = 0 and for locally strongly convex losses for β = 1.

D. Experiment details

All the code to reproduce figures and tables is available in the following repository: https://github.com/ criteo-research/sequential-conterfactual-risk-minimization.

D.2. Empirical settings details

Pricing The pricing application in (Demirer et al., 2019) considers a personalized pricing setting where given contexts x, prices p (which are the actions) need to be predicted to maximize the revenue:

r(x, p) = p(a(x) b(x)p + ε)

where ε N(0, 1) and d = a(x) + b(x)p + ε is akin to an unknown context-specifidemand function. The data generating process uses contexts x [1, 2]k for k > 1 a positive integer. Only l < k dimensions however affect the demand, that is if we write x = 1

l (z1, . . . , zl). The price p is generated from a Gaussian logging policy p N( x, 1) centered in x. We consider in our example the quadratic functionnal a(x) = 2x2 and b(x) = 0.6x as in the original paper.

Advertising The advertising simulation in (Zenati et al., 2020) consists in predicting the potential p ]0, + [ of a user that may be compared to their a priori responsiveness to a treatment. The potential is caused by an unobserved random group variable g in G (groups of high or low potential users in their responsiveness) that influences context x of users. The goal is then to find a policy π(a|x) that maximizes reward by adapting to an unobserved potential. The potentials are normally distributed conditionally on the group index, p|g N(µg, σ2 g) where σg = 0.5 and µg = 1 or 3 for two groups. The observed reward y is then a function of the action a and the context x through the associated potential px of the user x. The reward function mimics reward over the offline continuous bidding dataset in (Zenati et al., 2020) with the form:

rl(px, a) = a px if a < px 1 2(px a) + 1 else

r(px, a) = max(rl(px, a), 0.1)

Sequential Counterfactual Risk Minimization

The logging policy is a lognormal distribution as it is common in advertising applications (Bottou et al., 2013). In particular, as in (Zenati et al., 2020), πθ0 = Lognormal(µ, σ2) where the mean exp(µ + σ2/2) = 2 and the variance (exp(σ2) 1) exp(2µ + σ2) = 1.

Yeast, Scene, TMC2007 We follow (Swaminathan & Joachims, 2015a). We now recall briefly the setup. The problem is a binary multilabel classification with |A| = 2K potential labels. All models are parametrized by πθ(a|x) exp(θ (x N a)). The baseline (resp. skyline) is a supervised, full information model with identical parameter space than CRM methods trained on 5% (resp. 100%) of the training data. Our main modification it to consider the class of probabilistic policies that satisfy Assumption 5.1 by predicting actions in an Epsilon Greedy fashion (Sutton & Barto, 1998)): πε θ(a, x) = (1 ε)πθ(a, x) + ε/|A| where ε = .1. The loss is the Hamming loss (number of incorrectly assigned labels - both false positives and false negatives in the action vector):

L(θ) = 1 n K

j=1 1[yj i =aj i ] (50)

where yj i (resp. aj i) is the j-th component of the label vector (resp. action vector) of line i. A uniform policy will thus evaluate at a loss of .5.

D.3. Implementation details

Counterfactual methods In this paragraph we start by detailing the non adaptive counterfactual risk minimization that we compare to in this work.

Algorithm 2 Counterfactual Risk Minimization Input: Logged observations (x0,i, a0,i, y0,i, π0,i)i=1,...,n0, parameter λ > 0 for m = 1 to M do

Build Lm from observations sm using Eq. (11) Learn θ using Eq. (8) Re-deploy the logging model θ0 and collect observations sm+1 = (xm+1,i, am+1,i, lm+1,i, πm+1,i)i=1,...,nm+1 end

We also provide the grid of hyperparameters for the λ evaluated in CRM and SCRM methods λ [1e 5, 1e 4, 1e 3, 1e 2, 1e 1].

Batch Bandits Let k : (X A) (X A) R be a bounded positive definite Kernel associated to a RKHS H, ϕ : X A H is the feature map such that k(s, s ) = ϕ(s), ϕ(s ) for any s, s X A. Context-actions pairs are written as sm,i := (xm,i, am,i) X A and Sm := {s1,0, . . . , snm,m} denoting the history of all context-actions pairs seen up until the end of batch m. Km is the kernel matrix of all context-actions seen until the end of the batch m 1. Eventually, KS(s ) is the kernel column vector [k(s1, s ), . . . , k(sl, s )] of size |S| = l. Ym = [ y0,1, y0,n0, ym,1, ym,nm] denotes the vector of concatenated rewards observed up until the end of the batch m.

At a batch m, a context xm,i is sampled for i {1, nm}, and then to sample an action a, the following decision rule is applied: a arg max a A ˆqm,i,a. (51)

In batch Kernel UCB, ˆqm,i,a is defined as

ˆqm,i,a = ˆmm,i,a + βmˆσm,i,a, (52)

ˆµm,i,a = KSt 1 (xm,i, a) K 1 m 1Ym 1

ˆσ2 m,i,a = 1

λk (xm,i, a), (xm,i, a) 1

λKSm 1 (xm,i, a) K 1 m 1KSm 1 (xm,i, a) ,

Sequential Counterfactual Risk Minimization

and βm is a theoretical parameter that is set to βm = 1 m in practical heuristics (Lattimore & Szepesvari, 2019). In SBPE (Han et al.), ˆqm,i,a is defined directly as

ˆqm,i,a = KSt 1 (xm,i, a) K 1 m 1Ym 1. (53)

Algorithm 3 Batch bandit - SBPE (Han et al.) and Kernel UCB (Valko et al., 2013) Input: Logged observations (x0,i, a0,i, y0,i, π0,i)i=1,...,n0, λ regularization and exploration parameters, k the kernel function initialization Kλ = [k(s0,i, s0,j)]1 i,j n0 + λI, Y0 = [ y0,i]1 i n0 for m = 1 to M do

for i = 1 to nm do

Observe context xi,m Choose ai,m arg maxa A ˆqm,i,a using Eq. (53) or (52)

end Observe losses yi,m for all i in past batch {1, . . . , nm} Update Ym [ y0,1, y0,n0, ym,1, ym,nm] Update the translated gram matrix Kλ [k(si,p, sj,p)]1 i,j np,1 p m + λI

SBPE (Han et al.) uses a linear modelling, therefore we used a linear kernel. For the Kernel UCB (Valko et al., 2013) method, we used Gaussian and Polynomial kernels in our experiments. Note also that no regularization parameter λ is used in SBPE so we set λ = 0 in our experiments, and for K-UCB we chose λ in the grid [1e0, 1e1, 1e2].

Note in particular that we adapted the batch bandit baselines to the CRM setting by benefiting the initialization with the logged dataset to set the gram matrix Kλ as well as the reward vector Y0 with information from the logging data. This modification changes the original methods which take random actions at initializations.

Eventually, the baselines were carefully optimized using the Jax library (https://github.com/google/jax) to allow for just in time compilations of algebraic blocks in both methods and to maximize their scaling capacity.

RL baselines In order to compare our method to the two known off-policy online RL algorithm PPO (Schulman et al., 2017) and TRPO (Schulman et al., 2015), we do the following:

1. we use the stable_baselines3(Raffin et al., 2021) library for the implementation. When necessary we call multiple times the model PPO or TRPO, to have buffer size of geometrical increase.

2. we initialize the Actor Critic Policy with a simpler MLP model having only one layer with output dimension of 1, (with argument net_arch= [1], that is mathematically the same modelling as in CRM and SCRM baselines).

3. At the initial step only and to enable a fair comparison with counterfactual methods using a logging dataset, we pretrain the RL policies to imitate the actions sampled from the logging policy: we process by multiple step of the Adam optimizer, minimizing a loss being the sum of 2 terms:

a MSE term between the sampled action of the Actor Critic Policy for the contexts in the n0 instances, and the actions sampled by the logging policy. the ENTROPY term guaranteeing to keep a minimum of exploration in order to initialize the RL algorithm ( P pi log(pi))

4. we combine the 2 last terms with a linear combinaison with hyperparameters being tuned a posteriori, i.e. LOSS = MSE + λ ENTROPY with the hyperparam λ {.5, 1, 2, 5, 10}

Sequential Counterfactual Risk Minimization

E. Additional empirical results

E.1. SCRM compared to CRM

We provide here the additional plot in the Pricing setting.

2 4 6 8 10 Rollouts m

2 4 6 8 10 Rollouts m

Advertising

Figure 9: Test loss as a function of sample size on Pricing, Advertising (from left to right).

E.2. Evaluation of IPS-IX

We provide here the plots for the whole setting considered in policy evaluation with IPS-IX.

0 1 2 3 Shift 0

ips ips_clipped snips ips_ix

0 1 2 3 Shift 0

ips ips_clipped snips ips_ix

0 1 2 3 Shift 0

ips ips_clipped snips ips_ix

Figure 10: Comparison of IPS estimators on a Cosine reward and series of shifted Gaussian policies. Setup (left), Bias (middle left), Variance (middle right), Average IPS weight (right). IPS-IX shows a low bias and compares favorably to IPS and SNIPS in terms of variance.

E.3. Exploration/Exploitation tradeoff

In this part we give the details used for the experiment described in Section 6.3. We consider again Example 3.1 with the Gaussian parametrized policies πθ = N(θ, σ2) and a loss lt(a) = (a yt)2 1 where yt N(θ , σ 2) with σ = 0.3. Recall that πθ0 = N(θ0, σ). We consider a grid of σ [0.1, 0.3, 1, 3] and consider θ = 1. Our experiment aims at illustrating the influence of sequential exploration that is an important detail of the SCRM and CRM principles.