# control_variates_for_slate_offpolicy_evaluation__dce07354.pdf Control Variates for Slate Off-Policy Evaluation Nikos Vlassis Netflix Ashok Chandrashekar Warner Media Fernando Amat Gil Netflix Nathan Kallus Cornell University and Netflix We study the problem of off-policy evaluation from batched contextual bandit data with multidimensional actions, often termed slates. The problem is common to recommender systems and user-interface optimization, and it is particularly challenging because of the combinatorially-sized action space. Swaminathan et al. (2017) have proposed the pseudoinverse (PI) estimator under the assumption that the conditional mean rewards are additive in actions. Using control variates, we consider a large class of unbiased estimators that includes as specific cases the PI estimator and (asymptotically) its self-normalized variant. By optimizing over this class, we obtain new estimators with risk improvement guarantees over both the PI and the self-normalized PI estimators. Experiments with real-world recommender data as well as synthetic data validate these improvements in practice. 1 Introduction Online services (news, music and video streaming, social media, e-commerce, app stores, etc.) serve content that often takes the form of a combinatorial, high-dimensional slate, where each slot on the slate can take multiple values. For example, a personalized news service can have separate slots for local news, sports, politics, etc., with multiple news items from which to select in each slot (Li et al., 2010); or a video streaming service may organize the contents of the landing homepage of a user as a multi-slot slate, where each slot can contain shows from a given category (Gomez-Uribe and Hunt, 2015). Typically, the offered content is personalized via machine learning and A/B testing. However, the number of eligible items per slot may be in the hundreds or thousands, which makes the testing and optimization of personalized policies a challenging task. An alternative to A/B testing and online learning is off-policy evaluation (OPE). In OPE we use historical data collected by a past policy in order to evaluate a new candidate policy. This is primarily an estimation problem, often grounded in causal assumptions about the data-generating process and its connections to a counterfactual one (Horvitz and Thompson, 1952; Robins and Rotnitzky, 1995; Dudík et al., 2011; Bottou et al., 2013; Athey and Wager, 2017; Bibaut et al., 2019; Kallus and Uehara, 2019). However, even with plentiful off-policy data, handling a combinatorial action space can be challenging: Unbiased importance sampling (IS) (Horvitz and Thompson, 1952) suffers variance on the scale of the cardinality of the action space under a uniform logging policy and deterministic target policy. To address this, Swaminathan et al. (2017) have proposed the pseudoinverse (PI) estimator, which significantly attenuates the variance and remains unbiased for decomposable reward structures. In this paper, we discuss how to further reduce this variance using a control variates approach (Glynn and Szechtman, 2002). We demonstrate that the self-normalized PI (w PI) estimator, which Swaminathan et al. (2017) found to be superior to PI, is asymptotically equivalent to adding a control variate. But, the choice of control variate is not optimal. We instead show how to choose optimal The author was with Netflix when this work was concluded. 35th Conference on Neural Information Processing Systems (Neur IPS 2021). control variates, and even how to do so without incurring any bias. We provide strong empirical evidence based on the MSLR-WEB30K data from the Microsoft Learning to Rank Challenge (Qin and Liu, 2013) that confirms the improvement of our approach over the PI and w PI estimators. 1.1 Related work Much of the existing OPE work involves the use of doubly-robust estimators and variants, which add control variates to the standard IS estimator (Robins and Rotnitzky, 1995; Dudík et al., 2011; Thomas and Brunskill, 2016; Farajtabar et al., 2018; Bibaut et al., 2019; Kallus and Uehara, 2019; Su et al., 2020). Directly optimizing control variates for OPE has been suggested earlier (Farajtabar et al., 2018; Vlassis et al., 2019) but not for problems involving high-dimensional slates. Combinatorial bandits and variants such as semi-bandits have been studied by Cesa-Bianchi and Lugosi (2012) and Kveton et al. (2015) and are also covered in the book of Lattimore and Szepesvári (2020). Earlier approaches to OPE for contextual combinatorial bandits have been restricted to small slate problems (Li et al., 2011) or relied on accurate parametric models of slate rewards (Chapelle and Zhang, 2009; Guo et al., 2009; Filippi et al., 2010). Li et al. (2015) and Wang et al. (2016) have suggested the use of partial matches between the logging and target actions in order to mitigate the variance explosion. Swaminathan et al. (2017) have proposed the PI estimator, a version of which we study in this paper. Under factored policies, the PI estimator is similar to an estimator proposed by Li et al. (2018) for OPE of ranking policies with user click models. Swaminathan and Joachims (2015) and Joachims et al. (2018) have developed estimators for OPE under the framework of counterfactual risk minimization and discuss approaches for offline policy optimization. Agarwal et al. (2018) have developed an OPE method for rankers, when only the relative value of two given rankers is of interest. Chen et al. (2019) have developed an approach for offline policy gradients, when slates are unordered lists of items (of random size). Mc Inerney et al. (2020) and Lopez et al. (2021) have studied the problem of slate OPE with semi-bandit feedback (one observed reward per slot), under different assumptions on the structure of the rewards and the contextual policies. In constrast to the latter two works, and similar to Swaminathan et al. (2017) and Su et al. (2020), we study the slate OPE problem when only a single, slate-level reward is observed. Su et al. (2020) achieve variance reduction by clipping propensities, whereas we achieve variance reduction by optimizing control variates. Applications of slate OPE in real-world problems include Sar Shalom et al. (2016) who apply OPE to a large scale real world recommender system that handles purchases from tens of millions of users, Hill et al. (2017) who present a method to optimize a message that is composed of separated widgets (slots) such as title text, offer details, image, etc, and Mc Inerney et al. (2020) who apply OPE to a slate problem involving sequences of items such as music playlists. 2 Setup and Notation We consider contextual slate bandits where each slate has K slots. We will use [K] to denote the set {1, 2, . . . , K}. The available actions in slot k are [dk] = {1, . . . , dk}. Thus, the set of available slates has cardinality QK k=1 dk. The data consists of n independent and identically distributed (iid) triplets (Xi, Ai, Ri), i = 1, . . . , n, representing the observed context, slate taken, and resulting reward, respectively. Here, Ai = (Ai1, . . . , Ai K) where Aik [dk]. We use (X, A, R) to refer to a generic draw of one of the above triplets, where X P (X) , A µ( | X), R P ( | A, X). The distribution µ(a | x) = P (A = a | X = x) is called the logging policy as the data can be seen as the logs generated by executing this policy. Probabilities P, expectations E, and (co)variances without subscripts indicating otherwise are understood to be with respect to this distribution. For any function f(x, a, r) we will write En[f(X, A, R)] = 1 n Pn i=1 f(Xi, Ai, Ri). We will also follow the convention that random, data-driven objects have the formˆ n (such as ˆwn or ˆθn). In OPE, we are given a fixed policy π(a | x) and are interested in estimating its average reward Eπ[R] using data collected under a logging policy µ(a | x). Here Eπ refers to expectation under the distribution X P (X) , A π( | X), R P ( | A, X). The quantity Eπ[R] can be interpreted as a counterfactual given that P ( | A = a, X) is equal to the distribution of the potential reward of slate a given context X, for every a; or, in other words, that P ( | A, X) represents a structural model. This assumption, known as unconfoundedness or ignorability, is only needed for giving a causal interpretation to Eπ[R] but not for estimation. The classical IS estimator for the problem of estimating Eπ[R] is En π(A|X) µ(A|X)R , which is unbiased but can suffer unacceptably large variance when A is high-dimensional and µ provides coverage of all slates, as is necessary for the identification of arbitrary policies. Generally, we cannot do much better than IS in the worst case (Wang et al., 2017), unless we impose additional assumptions about the structure of actions or rewards. In this paper, we will focus on the case in which the logging policy is factored, that is, µ satisfies µ(a | x) = QK k=1 µk(ak | x). This generally holds if µ is given by a per-slot ϵ-greedy policy or is the uniform distribution, both are common cases in practice. We define the following slot-level density ratios and their sum Yk = π(Ak | X) µk(Ak | X), G = 1 + k=1 (Yk 1) . (1) Then, we have the following reformulation result for Eπ[R]. (A succinct proof of this result, as well as all our novel results, can be found in Section 9 of this paper and in the Supplementary Text.) Lemma 1 (Reformulation under additive rewards (Swaminathan et al., 2017)). Under a factored µ, Eπ[R] = E[GR] whenever E[R | A, X] = PK k=1 φk(Ak, X) for some (latent) functions φk(ak, x). Motivated by this result, in this paper we focus on estimating the target parameter θ = E[GR]. Crucially, lemma 1 implies that under additive rewards, the variance of an estimator of θ need only suffer the size of the slot-level density ratios (namely, PK k=1 d2 k for uniform exploration and deterministic evaluation) rather than the possibly astronomical size of the slate density ratio appearing in the IS estimator (namely, QK k=1 d2 k under the same settings). Moreover, lemma 1 shows that µ need not even cover every slate combination only marginally every action per slot. In the rest of the paper we will not assume any special structure for E[R | A, X]; our estimation results on θ will hold irrespective of whether θ coincides with Eπ[R]. In the experiments we will show results for both additive and non-additive rewards. In the factored policy setting, the PI and w PI estimators of Swaminathan et al. (2017) are, respectively, ˆθPI n = En[GR], ˆθw PI n = En[GR] The PI estimator is unbiased for estimating θ by construction. While w PI may be biased, it can further reduce PI s variance. 3 A Class of Estimators We now construct a class of unbiased estimators that includes PI and (almost) w PI. We will then propose to choose optimal estimators in this class and show how they can be approximated. For any fixed tuple of weights w = (w1, . . . , w K) RK, we define the estimator ˆθ(w) n = En[Γw], Γw = GR k=1 wk(Yk 1). Here, each Yk 1 acts as a control variate (Glynn and Szechtman, 2002). We slightly overload notation and for the case w1 = = w K = β R we will write ˆθ(β) n = En[Γβ], Γβ = GR β(G 1), which corresponds to using a single control variate G 1. Note that ˆθPI n is included in the estimator class: When β = 0, we have Γ0 = GR and ˆθPI n = ˆθ(0) n . We further define Vw = Var(Γw). It is straightforward to show that, for any fixed w, the estimator ˆθ(w) n is unbiased (for any n) and asymptotically normal. Lemma 2 (Unbiasedness and asymptotic normality of ˆθ(w) n ). For any fixed w, we have E[ˆθ(w) n ] = E[Γw] = θ, n(ˆθ(w) n θ) d N(0, Vw). While PI is in this class and has variance V0, w PI is not actually a member of this class, but it is asymptotically equivalent to a member. To see this, we first need to understand the asymptotics of ˆθ(w) n when we plug in a random, data-driven w denoted ˆwn. It turns out (by Slutsky s theorem) that, if ˆwn converges to some fixed vector w, the asymptotic behavior of ˆθ( ˆ wn) n is the same as plugging in the fixed w. Lemma 3 (Asymptotics of plug-in w). Suppose ˆwn p w, for some fixed w. Then n(ˆθ( ˆ wn) n θ) d N(0, Vw). We can now establish the following corollary showing that w PI is asymptotically equivalent to ˆθ(θ) n , that is, a member of our class of estimators using β = θ (the unknown target estimand). Lemma 4 (Asymptotics of w PI). We have n(ˆθw PI n θ) d N(0, Vθ). 4 Optimal Control Variates In the previous section we showed that both PI and w PI are, or are asymptotically equivalent to, members of the class of estimators ˆθ(w) n . But the class is more general than these two, so it behooves us to try to find an optimal member. 4.1 Optimal Single Control Variate We first focus on the case w1 = = w K = β, that is, using a single control variate G 1. Both PI and w PI fall in this category with β = 0 and β = θ, respectively, as we showed above. Let V = inf β R Vβ (2) represent the minimal variance in the class of single-control-variate estimators. By construction, V V0 = Var(GR) and V Vθ, the right-hand sides being the (asymptotic) variance of PI and w PI, respectively. Our next result derives this optimum. Lemma 5 (Optimal single control variate). We have V = Vβ = V0 (E[G2R] θ)2 k Var(Yk) , β = E[G2R] θ P Notice that generally β = θ. That is, while w PI is asymptotically equivalent to a control-variate estimate, it is not generally optimal. In the degenerate case where R G (e.g., constant reward) then we do have β = θ. Our results, nonetheless, immediately suggest how to obtain this optimum asymptotically, using a feasible estimator. Lemma 6 (Achieving optimal single control variate). Compute a data-driven β as ˆβ n = En[GR(G 1)] P k En[(Yk 1)2]. Then, n(ˆθ( ˆβ n) n θ) d N(0, V ). That is, asymptotically, ˆθ( ˆβ n) n is at least as good as either PI and w PI. The improvement is positive, in general. Lemma 7. The improvement of ˆθ( ˆβ n) n over PI is V0 V = E[GR(G 1)]2 E[(G 1)2] 0, and the improvement over w PI is Vθ V = E[GR(G 1)]2 E[(G 1)2] 2E[GR]E[G2R] + E[GR]2(2 + E[(G 1)2]) 0. 4.2 Optimal Multiple Control Variates We now turn our attention to optimally choosing weights on all K control variates. Let V = inf w RK Vw. (3) By construction, V V min(V0, Vθ), that is, it is smaller than or equal to the asymptotic variances of PI, w PI, and the optimal single-control-variate estimator. Empirically, we find that the first inequality can have a small gap while the second a large one; that is, optimizing control variates provides substantive improvement over PI and w PI, but much of the improvement is often captured by optimizing a single control variate in certain highly symmetric settings. See sections 6 and 7. Lemma 8 (Optimal multiple control variates). We have E[GR(Yk 1)]2 w k = E[GR(Yk 1)] Again, if R Yk, we have wk = θ for all k. Generally, wk vary over k and are not equal to θ. As in the single control-variate case above, we can obtain the above optimum asymptotically using a feasible estimator. Lemma 9 (Achieving optimal multiple control variates). Compute data-driven weights wk as ˆw n,k = En[GR(Yk 1)] En[(Yk 1)2] . Then, n(ˆθ( ˆ w n) n θ) d N(0, V ). Lemma 10. The improvement of ˆθ( ˆ w n) n over PI is E[GR(Yk 1)]2 E[(Yk 1)2] 0. The improvement over the single control variate ˆθ( ˆβ n) n is E[GR(Yk 1)]2 E[(Yk 1)2] E[GR(G 1)]2 E[(G 1)2] 0. This improvement is of course no smaller than the improvement over w PI, which is no better than the optimal single control variate (asymptotically). The improvement again collapses to zero in the special case of constant rewards. 5 Achieving Optimality Without Suffering Bias Although the above estimators have asymptotically optimal variance among control variates, they may incur finite-sample bias due to the estimation of control variate weights. We can avoid this using a three-way cross-fitting. The estimator we construct in this section will be both finite-sample unbiased and have optimal variance asymptotically. The estimator we propose is as follows: 1. Split the data randomly into three even folds, D0, D1, D2. 2. For j = 0, 1, 2, compute ˆw(j) n,k = EDj [GR(Yk 1)] EDj [(Yk 1)2] , where EDj refers to the sample average only over Dj. 3. Set ˆθ n = En[GR] X n ˆw(j+1 mod 3) n,k EDj[Yk 1]. Lemma 11 (Unbiased estimator with optimal asymptotic variance). We have E[ˆθ n] = θ, n(ˆθ n θ) d N(0, V ). While using just two folds would suffice to eliminate bias, ensuring that each data point i is independent from the data used to fit its weight, this would not suffice for ensuring we get the optimal variance. For this, it is crucial that we use three folds so as to eliminate covariance as well. Having three folds ensures that, given any two data points i and i , either i is disjoint from the data used to fit the weights of i or vice versa. That is, for any two folds j and j , we always have either j + 1 = j (mod 3) or j = j + 1 (mod 3). In summary, lemma 11 establishes that we can completely avoid bias even in finite samples. We could in fact adapt our cross-fitting procedure to obtain alternative estimators that are both finite-sample unbiased and match the asymptotic variance of the single-control-variate version or even w PI; for that we would need only change the weights computed in step 2. We note, however, that the variance characterization in all of our results is asymptotic and we do not currently have corresponding finite-sample results. 6 Experiments on real data We have benchmarked the proposed estimators on the publicly available dataset MSLR-WEB30K from the Microsoft Learning to Rank Challenge (Qin and Liu, 2013). This is a labeled dataset that contains about 31k user queries, each providing up to 1251 labeled documents (the labels are relevance scores from 0 to 4). The queries form the contexts x of our OPE problem. In order to be able to provide a head-to-head comparison with the results reported by Swaminathan et al. (2017), we have closely followed the experimental protocol of the latter (with minor differences, explained next) in order to generate contextual slate bandit instances from the data. (Additional details about the MSLR-WEB30K data and the experimental protocol that we followed can be found in the Supplementary Text.) Each query-document pair in the dataset is annotated with title and body features. As a preprocessing step, we first use all data to train a regression tree that predicts document scores from the title features of query-document pairs. We then use this regressor as a standard greedy ranker to extract, for each query, the top-M predicted documents (with M {10, 50, 100}). Finally, we discard all queries that have less than M judged documents. (This latter step was not present in the experimental pipeline of Swaminathan et al., 2017. We include it here because it facilitates sampling with replacement, which is required in our experiments. Even for M = 100, the resulting dataset contains a sizable number (18k+) of unique queries, and running the PI and w PI estimators against these data attained essentially the same performance as with the unfiltered dataset.) The above procedure defines the sets A(x) of allowed documents per context x. In each problem instance, we train a new predictor (using decision trees or lasso) constrained on the sets A(x), and use it as a greedy ranker to extract, for each query, the top-K predicted documents (with K {5, 10, 30}). This mapping defines a deterministic target policy π(x) for our OPE problem: each K-size ranking Figure 1: Benchmarking the proposed PICVs estimator against PI and w PI on the MSLR-WEB30K data, using a tree-based regression model for the target policy. Here we vary M, the number of available actions (documents) per slot, K, the number of slots (size of ranked lists), and the choice of metric that defines slate-level reward. Missing values for w PI (for sample size 1k) are due to excessive variance caused by the presence of outliers. See text for details. defines a K-slot slate, with each slot having cardinality M. Note that each document can appear only once in the slates mapped by π(x). The logging policy µ(x) is uniform, and it samples K documents with replacement from the set A(x), for each context x. (In Swaminathan et al., 2017, documents were sampled without replacement by the logging policy.) To create each logged dataset, we sampled the x uniformly from the set of all non-filtered queries, up to the desired sample size (which, as in Swaminathan et al., 2017, could end up reusing some queries multiple times). We report results for PI, w PI, and the optimal single control variate estimator ˆθ( ˆβ n) n from lemma 6, denoted PICVs. The multiple control variate estimator from lemma 9 obtained near-identical MSE as the reported PICVs, so we omit it from the plots. (The similar performance of the single vs multiple control variate estimators in this problem is owed to the symmetry between slots; for example, each having the same number of actions. In the next section we show an example where the two estimators exhibit different behavior.) We evaluate each estimator using (log) root mean square error (RMSE) as a function of sample size. We estimate MSE and standard errors of log(RMSE) (the latter computed via the delta method on log(MSE) 2 log(10) ) using 300 independent runs for each setting. As in Swaminathan et al. (2017), we repeat the protocol for a number of different experimental conditions, varying the values of M, K, the regression model for the target policy (tree-based or lasso), and the choice of metric. We tested two metrics, NDCG (additive) and ERR (non-additive); see Swaminathan et al. (2017) for definitions. The version of NDCG that we used is tailored to rankings with replacements: It only differs to standard NDCG in that the denominator is the DCG of a slate that is formed by the globally most relevant document (for the given x) replicated over all K slots of the slate (capturing the fact that documents are sampled with replacement in µ). Note that this version of NDCG is additive over slots, which is a requirement for PI (and, asymptotically, for our PICVs estimator) to be unbiased when the estimand is the value of the target policy. The code for these experiments is publicly available at https://github.com/fernandoamat/slate OPE. In Fig. 1 we show the results for the NDCG and ERR metrics when using a tree-based predictor to define the target policy. (See Supplementary Text for results when using lasso.) We observe that the proposed estimator PICVs dominates both PI and w PI, in both metrics, with notable improvements for small sample sizes relative to M. For relatively small sample sizes, w PI would often return abnormally large estimates (values up to four orders of magnitude larger than the median). This can happen when the denominator of w PI gets, by chance, close to zero. This in an artifact of w PI being defined as the ratio of estimates and can lead to very high variance even higher than PI s. Our proposed PICVs is not prone to such outliers and the ensuing variance, and is robust uniformly over all the sample sizes tested. While w PI seems to converge to PICVs for large sample sizes, this need not be the case in general. Lemma 4 shows that w PI has asymptotic variance Vθ, that is, w PI is equivalent to using the weight β = θ on the single control variate. Since, by definition, Vθ V = Vβ , it is not true in general that w PI should converge to PICVs, at least in the sense of having similar asymptotic distributions, unless certain special conditions hold so that these variances are equal. The estimators do appear similar for large sample sizes in Fig. 1 due to the nature of the particular dataset and the reported setup (choice of metric, target policy, and value of K); see Supplementary Text for different problem settings, where we can observe a uniform gap between PICV and (w)PI. Our theoretical results support the observation that the MSE of the various estimators in Fig. 1 wil asymptotically appear as parallel curves that never catch up. Note that the plots in Fig. 1 are on a log-log scale (log(MSE) vs log(# samples)). Our theory predicts that, for large enough sample sizes n, the MSE of the different estimators will be V/n for different values of V . Since log(V/n) = log(V ) log(n), we expect all methods to eventually appear linear in the log-log scale with the very same slope (hence parallel), but at different heights given by the corresponding V (hence never catch up). The fact that PICVs is always at the bottom is consistent with lemma 6, which shows that PICVs attains the smallest asymptotic variance among all single-control-variate estimators. 7 Experiments on synthetic data: The gap between PICV single and multi The results on the MSLR-WEB30K data do not demonstrate any substantial differentiation between the single and multi variants of the proposed PICV estimators. This is mainly due to symmetry between slots, which need not always be the case. In this section we show results from synthetic simulations on a non-contextual slate bandit problem, using a reward model that is skewed toward the first slot of the slate and toward the target policy (which picks action 0 from each slot). This setting reveals a differential behavior of the single vs the multi PICV variants. Here we assume Bernoulli slate-level rewards, and the slate-level Bernoulli rates p(a) are given by the weighted sum p(a) = (0.5)a1 1φ1(a1) + 0.01 PK k=2 φk(ak), (4) where the slot-level actions are ak {1, . . . , dk}, for k = 1, . . . , K. Under this model, slot-1 actions that are closer to action 1 are conferring more impact to the slate-level reward, and hence we expect the latter to be more correlated with the target policy. This allows us to examine if there is any differential behavior between the single (PICVs) and the multi (PICVm) variants. In each simulation experiment, we generate T = 20 random reward tensors p(a) [0, 1]D using eq. (4), where, for each ak, the value of φk(ak) is drawn from a Gaussian distribution N(0.2/K, 0.01). The parameters of each simulated instance are the number N of logged slates, the number of slots K, and the slot action cardinality tuple D = [d1, ..., d K] where di is the number of available actions of the i-th slot. For each sampled tensor, we generate 300 datasets by drawing slates a µ( ) using a uniform logging policy µ(a) and drawing Bernoulli rewards from the corresponding p(a). We use each dataset to evaluate a deterministic policy π(a) = I(a = [0, ..., 0]) (w.l.o.g.) using each of the candidate estimators. Since we know the ground truth, we can compute the MSE of each estimator. We report (log) average root mean square error (RMSE), with standard errors (almost negligible), over the T tensors. The code for these experiments is publicly available at https://github.com/fernandoamat/slate OPE. The results are shown in Fig. 2 and demonstrate a differentiation between PICVs and PICVm, with the multi variant obtaining lower RMSE by virtue of leveraging more control variates. In the Supplementary Text we show the analogous plot for larger sample sizes. 8 Conclusions and Discussion We studied the problem of off-policy evaluation for slate bandits. We considered a class of unbiased estimators that included PI and (asymptotically) w PI, constructed using a control variate approach. This strongly suggested trying to obtain the minimal variance in this class. We showed how to do so Figure 2: Comparing the estimators in a small scale simulation setting, demonstrating a differentiated behavior between the single and multi variants of PICV. See text for details. Missing values for w PI are due to excessive variance caused by the presence of outliers. asymptotically, and even how to do so without incurring any additional bias in finite samples. Our new estimators led to gains in both simulations and real-data experiments. Future Work. An interesting avenue for future research is how to meld this approach with doublyrobust (DR) style centering. Su et al. (2020) consider the DR-PI estimator given by adding to the PI estimator the (unweighted) control variate E[Gf(X, A) P a QK k=1[dk] π(a | X)f(X, a)]. In particular, they consider f(x, a) being an estimate of E[R | A = a, X = x]. But unlike the standard non-additive OPE case, it is not clear that it would be efficient to consider only this control variate (Kallus and Uehara, 2019). In particular, restricting the statistical model to have additive rewards make standard semiparametric efficiency analyses difficult, and the efficient influence function is unknown and likely has no analytic form. Again, we can consider a range of centered control variates, PK k=1 E[Ykfk(X, Ak)] E[P ak [dk] π(ak | X)fk(X, ak)] . Letting fk(x, ak) = wk recovers our estimator class ˆθ(w) n . However, there may be benefit to using (x, a)-dependent control variates. Efficiency considerations suggest letting fk estimate E[R | Ak = ak, X = x], but the (sub)optimality of such in the additive-rewards model requires further investigation. A promising direction forward is to consider a parametric family, fk Fk, and optimize the choice of parameter to minimize an empirical variance estimate, in the spirit of more robust doubly robust estimation (Farajtabar et al., 2018). As long as Fk has nice complexity, we can obtain the best-in-class variance asymptotically. Characterizing this precisely and investigating the value of this empirically remains future work. Societal Impacts. In general, having accurate methods for performing OPE can allow decision makers to evaluate potentially unsafe decision making policies, and decide whether they may lead to improved societal outcomes, without having to risk the potentially negative consequences of trialling such policies. Nonetheless, as discussed above, OPE with combinatorial actions is nearly hopeless without making additional assumptions. These, however, can fail to hold and lead to biased estimates, which needs to be taken into account when considering potential harms of a proposed policy. Another potential risk, which applies to OPE in general, is that the data used may not accurately reflect the diversity of the population that the policy will actually be deployed on. In this case, resulting policy value estimates may be much more accurate for individuals who are well represented in the data, compared with individuals who are not. This may bias downstream decision making towards policies that best serve those who are well represented in the data, possibly at the cost of those who are not well represented. Such biases must also be taken into account in any OPE analysis. 9 Proofs of the main results Here we provide succinct proofs for all results, except for lemma 11 whose proof is quite involved and long and is therefore deferred to the Supplementary Text. Proof of lemma 1. Using the definition of G from eq. (1), we have E[GR | A, X] = k=1 Yk φk(Ak, X) + j =k Yj φk(Ak, X). Taking expectation over A, the last term cancels (when µ is factored) since E[Yk | X] = P ak [dk] π(ak | X) = 1, and the first term is Eπ[R | X]. Further taking expectation over X gives the result. Proof of lemma 2. Since E[Yk | X] = 1, we have E[Γw | X] = E[GR | X]. Iterated expectations gives the first statement. The second statement is immediate from the central limit theorem (CLT). Proof of lemma 3. Set θn = En[Γw]. From CLT, n( θn θ) d N(0, Vw). Next, note that n( θn ˆθn) = k=1 ( ˆwn,k wn,k) n En[Yk 1]. Since n En[Yk 1] d N(0, Var(Yk)) by CLT and ˆwn,k wn,k p 0 by assumption, we have by Slutsky s theorem that each of the K terms converges in distribution to the constant 0, and therefore also converges in probability to the constant 0. Hence, n( θn ˆθn) p 0. Applying Slutsky s theorem again establishes the claim of the lemma. Proof of lemma 4. Since En[GR] p θ and En[G] p 1, the continuous mapping theorem gives ˆθw PI n p θ. Next, note that we can rewrite En[G] = En[GR]+En[GR]1 En[G] En[G] = En[GR] ˆθw PI n En[G 1] = En[GR] X k ˆθw PI n En[Yk 1] and invoking lemma 3 completes the proof. Proof of lemma 5. We have Vβ = Var(GR) 2βE[GR(G 1)] + β2E[(G 1)2], which is convex in β. Differentiating, setting to zero, and solving for β, we get β = E[GR(G 1)] E[(G 1)2] . For the numerator, we have E[GR(G 1)] = E[G2R] E[GR] = E[G2R] θ. For the denominator, we have E[(G 1)2] = Var(G) = P k Var(Yk). Combining yields the claim of the lemma. Proof of lemma 6. This is direct from the weak law of large numbers and lemmas 3 and 5. Proof of lemma 7. This follows by algebra and noting that V max(V0, Vθ) since both β = 0 and β = θ are feasible in eq. (2). Proof of lemma 8. Let C = (Y1 1, . . . , YK 1). We have Vw = Var(GR) 2w E[GRC] + w E[CC ]w, which is convex in w. Differentiating, setting to zero, and solving for w gives w = E[CC ] 1E[GRC]. We have Cov(Yk, Yk ) = 0 whenever k = k . Hence we can simplify and obtain the result. Proof of lemma 9. This is direct from the weak law of large numbers and lemmas 3 and 8. Proof of lemma 10. This follows by algebra and noting that V max(V0, V ) since both w1 = = w K = 0 and w1 = = w K = β are feasible in eq. (3). Acknowledgments We want to thank Adith Swaminathan for helping with the MSLR-WEB30K data and code, and for many useful discussions. Agarwal, A., Wang, X., Li, C., Bendersky, M., and Najork, M. (2018). Offline comparison of ranking functions using randomized data. ar Xiv preprint ar Xiv:1810.05252. Athey, S. and Wager, S. (2017). Efficient policy learning. ar Xiv preprint ar Xiv:1702.02896. Bibaut, A., Malenica, I., Vlassis, N., and Van Der Laan, M. (2019). More efficient off-policy evaluation through regularized targeted learning. In International Conference on Machine Learning, pages 654 663. PMLR. Bottou, L., Peters, J., Quiñonero-Candela, J., Charles, D. X., Chickering, D. M., Portugaly, E., Ray, D., Simard, P., and Snelson, E. (2013). Counterfactual reasoning and learning systems: The example of computational advertising. Journal of Machine Learning Research, 14(65):3207 3260. Cesa-Bianchi, N. and Lugosi, G. (2012). Combinatorial bandits. Journal of Computer and System Sciences, 78(5):1404 1422. Chapelle, O. and Zhang, Y. (2009). A dynamic Bayesian network click model for web search ranking. In Proceedings of the 18th international conference on World wide web, pages 1 10. Chen, M., Beutel, A., Covington, P., Jain, S., Belletti, F., and Chi, E. H. (2019). Top-k off-policy correction for a reinforce recommender system. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, pages 456 464. Dudík, M., Langford, J., and Li, L. (2011). Doubly robust policy evaluation and learning. In Proceedings of the 28th International Conference on International Conference on Machine Learning, pages 1097 1104. Farajtabar, M., Chow, Y., and Ghavamzadeh, M. (2018). More robust doubly robust off-policy evaluation. In International Conference on Machine Learning, pages 1447 1456. PMLR. Filippi, S., Cappe, O., Garivier, A., and Szepesvári, C. (2010). Parametric bandits: The generalized linear case. In NIPS, volume 23, pages 586 594. Glynn, P. W. and Szechtman, R. (2002). Some new perspectives on the method of control variates. In Monte Carlo and Quasi-Monte Carlo Methods 2000, pages 27 49. Springer. Gomez-Uribe, C. A. and Hunt, N. (2015). The Netflix recommender system: Algorithms, business value, and innovation. ACM Transactions on Management Information Systems (TMIS), 6(4):1 19. Guo, F., Liu, C., Kannan, A., Minka, T., Taylor, M., Wang, Y.-M., and Faloutsos, C. (2009). Click chain model in web search. In Proceedings of the 18th international conference on World wide web, pages 11 20. Hill, D. N., Nassif, H., Liu, Y., Iyer, A., and Vishwanathan, S. (2017). An efficient bandit algorithm for realtime multivariate optimization. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1813 1821. Horvitz, D. G. and Thompson, D. J. (1952). A generalization of sampling without replacement from a finite universe. Journal of the American statistical Association, 47(260):663 685. Joachims, T., Swaminathan, A., and de Rijke, M. (2018). Deep learning with logged bandit feedback. In International Conference on Learning Representations. Kallus, N. and Uehara, M. (2019). Intrinsically efficient, stable, and bounded off-policy evaluation for reinforcement learning. In Wallach, H., Larochelle, H., Beygelzimer, A., d Alché-Buc, F., Fox, E., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc. Kveton, B., Wen, Z., Ashkan, A., and Szepesvari, C. (2015). Tight regret bounds for stochastic combinatorial semi-bandits. In Artificial Intelligence and Statistics, pages 535 543. PMLR. Lattimore, T. and Szepesvári, C. (2020). Bandit algorithms. Cambridge University Press. Li, L., Chu, W., Langford, J., and Schapire, R. E. (2010). A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661 670. Li, L., Chu, W., Langford, J., and Wang, X. (2011). Unbiased offline evaluation of contextual-banditbased news article recommendation algorithms. In Proceedings of the fourth ACM international conference on Web search and data mining, pages 297 306. Li, L., Kim, J. Y., and Zitouni, I. (2015). Toward predicting the outcome of an A/B experiment for search relevance. In Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, pages 37 46. Li, S., Abbasi-Yadkori, Y., Kveton, B., Muthukrishnan, S., Vinay, V., and Wen, Z. (2018). Offline evaluation of ranking policies with click models. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1685 1694. Lopez, R., Dhillon, I. S., and Jordan, M. I. (2021). Learning from e Xtreme bandit feedback. Proc. Association for the Advancement of Artificial Intelligence, pages 8732 8740. Mc Inerney, J., Brost, B., Chandar, P., Mehrotra, R., and Carterette, B. (2020). Counterfactual evaluation of slate recommendations with sequential reward interactions. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1779 1788. Qin, T. and Liu, T. (2013). Introducing LETOR 4.0 datasets. Co RR, abs/1306.2597. Robins, J. M. and Rotnitzky, A. (1995). Semiparametric efficiency in multivariate regression models with missing data. Journal of the American Statistical Association, 90(429):122 129. Sar Shalom, O., Koenigstein, N., Paquet, U., and Vanchinathan, H. P. (2016). Beyond collaborative filtering: The list recommendation problem. In Proceedings of the 25th international conference on world wide web, pages 63 72. Su, Y., Dimakopoulou, M., Krishnamurthy, A., and Dudík, M. (2020). Doubly robust off-policy evaluation with shrinkage. In International Conference on Machine Learning, pages 9167 9176. PMLR. Swaminathan, A. and Joachims, T. (2015). Counterfactual risk minimization: Learning from logged bandit feedback. In International Conference on Machine Learning, pages 814 823. PMLR. Swaminathan, A., Krishnamurthy, A., Agarwal, A., Dudik, M., Langford, J., Jose, D., and Zitouni, I. (2017). Off-policy evaluation for slate recommendation. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc. Thomas, P. and Brunskill, E. (2016). Data-efficient off-policy policy evaluation for reinforcement learning. In International Conference on Machine Learning, pages 2139 2148. PMLR. Vlassis, N., Bibaut, A., Dimakopoulou, M., and Jebara, T. (2019). On the design of estimators for bandit off-policy evaluation. In International Conference on Machine Learning, pages 6468 6476. PMLR. Wang, X., Bendersky, M., Metzler, D., and Najork, M. (2016). Learning to rank with selection bias in personal search. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pages 115 124. Wang, Y.-X., Agarwal, A., and Dudık, M. (2017). Optimal and adaptive off-policy evaluation in contextual bandits. In International Conference on Machine Learning, pages 3589 3597. PMLR. 1. For all authors... (a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] (c) Did you discuss any potential negative societal impacts of your work? [Yes] See Section 8. (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results... (a) Did you state the full set of assumptions of all theoretical results? [Yes] (b) Did you include complete proofs of all theoretical results? [Yes] See Section 9 and Supplementary Text. 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] We include the code and instructions for all experiments as a URL. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] See Supplementary Text. (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] See Sections 6 and 7. (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] See Supplementary Text. 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (a) If your work uses existing assets, did you cite the creators? [Yes] See Section 6. (b) Did you mention the license of the assets? [Yes] See Supplementary Text. (c) Did you include any new assets either in the supplemental material or as a URL? [Yes] We include the code for all experiments as a URL. (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] We are using publicly available data. (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [Yes] See Supplementary Text. 5. If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]