# distributional_offpolicy_evaluation_for_slate_recommendations__a413b83f.pdf Distributional Off-Policy Evaluation for Slate Recommendations Shreyas Chaudhari1 David Arbour2, Georgios Theocharous2, Nikos Vlassis2 1University of Massachusetts Amherst 2Adobe Research schaudhari@cs.umass.edu, {arbour,theochar,vlassis}@adobe.com Recommendation strategies are typically evaluated by using previously logged data, employing off-policy evaluation methods to estimate their expected performance. However, for strategies that present users with slates of multiple items, the resulting combinatorial action space renders many of these methods impractical. Prior work has developed estimators that leverage the structure in slates to estimate the expected off-policy performance, but the estimation of the entire performance distribution remains elusive. Estimating the complete distribution allows for a more comprehensive evaluation of recommendation strategies, particularly along the axes of risk and fairness that employ metrics computable from the distribution. In this paper, we propose an estimator for the complete off-policy performance distribution for slates and establish conditions under which the estimator is unbiased and consistent. This builds upon prior work on offpolicy evaluation for slates and off-policy distribution estimation in reinforcement learning. We validate the efficacy of our method empirically on synthetic data as well as on a slate recommendation simulator constructed from real-world data (Movie Lens-20M). Our results show a significant reduction in estimation variance and improved sample efficiency over prior work across a range of slate structures. Introduction Recommendation services are ubiquitous throughout industry (Bobadilla et al. 2013; Lu et al. 2015). A common variant of recommendation consists of suggesting multiple items to a user simultaneously, often termed recommendation slates, where each position, (a.k.a., a slot), can take multiple possible values (Sarwar et al. 2000). For example, webpage layouts of news or streaming services have separate slots for each category of content where each slot can display any of the items from the category for that slot. The items are suggested to the user based on a recommendation strategy, called a policy and the user response is encoded into a reward (Li et al. 2010). A crucial problem for selection and improvement of recommendation strategies is to evaluate the efficacy of a slate policy by estimating the expected reward of that policy. One of the simplest and most effective approaches to policy evaluation, often employed in industrial settings, is A/B Copyright 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. testing (Gomez-Uribe and Hunt 2015; Kohavi and Longbotham 2017; Feitelson, Frachtenberg, and Beck 2013). This involves randomly assigning users to receive the item recommended by one of two candidate policies, and the relative performance of each policy is directly measured. However, A/B testing involves deploying the new policy online, which may be infeasible in many settings due to practical or ethical considerations. As a result, it is often necessary to employ offline off-policy evaluation, in which interaction data collected (offline) from previously deployed policies (offpolicy) is used to estimate statistics of the expected performance, risk and other metrics of interest for a new target policy, without actually deploying it online. This ensures that policies with undesirable outcomes are not deployed online. A large amount of literature addresses the problem of offpolicy evaluation in the non-slate setting (Dud ık et al. 2014; Wang, Agarwal, and Dudık 2017; Thomas, Theocharous, and Ghavamzadeh 2015), where the majority of methods rely on some version of importance sampling (Horvitz and Thompson 1952). Applied to the slate setting, these methods result in very large importance weights that result in high variance estimates due to the combinatorially large action space on which slate policies operate. Recent work addresses this deficiency, by introducing methods that leverage the structure in slate actions and rewards to address the high variance. Swaminathan et al. (2017); Vlassis et al. (2021) leverage reward additivity across slot actions to propose estimators for estimation of the expected reward of target slate policies. Mc Inerney et al. (2020) propose a Markov structure to the slate rewards for sequential interactions. All the aforementioned methods provide solutions for estimating the expected reward for a target policy. However, in many scenarios where recommendation systems are used, such as those with large financial stakes and in healthcare applications, practitioners are concerned with evaluation metrics such as the behavior of the policy at extreme quantiles and the expected performance at a given risk tolerance (CVa R). These quantities need the full reward distribution for estimation, which renders prior work which estimates the expected reward inapplicable. A notable exception is Chandak et al. (2021) who provide a method for off-policy estimation of the target policy s cumulative reward distribution using ideas similar to importance sampling, allowing for the computation of various metrics of interest, but the The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) estimator is intractable in the slate setting. In this work, we propose slate universal off-policy evaluation (SUn O), a method that allows for off-policy estimation of target reward distribution (Theorem 3) for slate recommendation policies. SUn O applies the core ideas from the universal off-policy evaluation (Un O) method (Chandak et al. 2021) to the slate setting, leveraging an additive decomposition of the conditional reward distribution. This makes it possible to perform off-policy estimation in structured high dimensional action spaces without incurring prohibitively high estimation variance. We highlight how the estimator can readily be adapted to other generalized decompositions of reward while continuing to be unbiased. Finally, we provide an empirical evaluation comparing against Un O, where the proposed estimator shows significant variance reduction, improved sample efficiency, and robust performance even when the conditions for unbiasedness of the estimator are not met. The main contributions of our work are: We propose an unbiased estimator for the off-policy reward distribution for slate recommendations under an additively decomposable reward distribution, generalizing prior results for slate off-policy evaluation to the distributional setting. We theoretically demonstrate how the estimator readily generalizes to slate rewards that do not decompose additively over slots. We empirically demonstrate the efficacy of the proposed estimator on slate simulators using synthetic as well as real world data, on a range of slate reward structures. Background and Notation We first formulate the slate recommendation system as a contextual bandit with a combinatorial action space. Each slate action has K dimensions where each dimension is a slot-level action. The user-bandit interaction results in a random tuple (X, A, R) at each step, where X d X( ) is the user context, A is the slate action generated by the recommendation strategy where A = [Ak]K k=1 is composed of K slot-level actions, and R d R( | A, X) is the scalar slatelevel reward. Since the rewards are observed only at the slate level and not at a per-slot level, we use reward and slate reward interchangeably. Each slot-level action can take upto N candidate values, leading to a combinatorially large action space of the order !N K " . A logging policy µ(A | X) = Pr(A | X) recommends slate actions conditioned on user context X is deployed online to collect a dataset for offline evaluation. The offline dataset consists of n i.i.d. samples Dn = {(Xi, Ai, Ri)}n i=1, generated by the user-bandit interaction. We focus on the case where µ is a factored policy, that is, k=1 µk ! Ak | X " where K is the number of slots. Data collection with factored uniform logging policies is standard in practice (Swaminathan et al. 2017). Off-policy evaluation is the task of utilizing data Dn logged using a policy µ, to evaluate a target policy by computing evaluation metrics from the target reward under . Standard methods focus on the estimation of the expected reward under the target policy. In this work, our focus is on the estimation of quantities that go beyond just the expected target reward by estimating the reward distribution. Throughout the paper, the sample estimates of any quantity y are denoted by ˆyn where the subscript indicates the number (n) of data points used for estimation. For instance, the cumulative reward distribution observed for a policy is denoted by F ( ). The sample estimate of the distribution will be denoted by ˆF n ( ). Related Work on Off-Policy Evaluation Importance sampling (IS) (Horvitz and Thompson 1952; Sutton and Barto 2018), also known as inverse propensity scoring (IPS) (Dud ık et al. 2014), provides a technique for unbiased estimation of expected target reward. However, it suffers from high variance in large action spaces. There are numerous extensions to IS for variance reduction (Dud ık, Langford, and Li 2011; Thomas 2015; Kallus and Uehara 2019). The IS estimator and all methods derived from it rely on a standard common-support assumption. We use a weaker form of this assumption that requires support at the slot level instead of the entire slate. Assumption 1 (Common Support). The set Dn contains i.i.d. tuples generated using µ, such that for some (unknown) " > 0, µk(Ak | X) < " =) (Ak | X) = 0; 8 k, X, A. Applying the methods to slates, (Swaminathan et al. 2017) assume additivity of slate rewards as a way to reduce the estimation variance of the expected reward of the target policy. Further variance reduction is obtained by using control variates (Vlassis et al. 2021). Mc Inerney et al. (2020) assume a Markov structure to the slate rewards during sequential item interaction and propose an estimator for the expected target reward. We refer the reader to these papers for additional references on slate recommendations. These off-policy estimators, along with most others in the literature, provide estimates of the expected reward of the target policy (Li et al. 2018). However, the expected value is usually not sufficient for comprehensive off-policy analysis, particularly in the case of risk assessment that is crucial for recommendation systems (Shani and Gunawardana 2011). Additional metrics of interest, often those computable from the whole reward distribution are necessary in practice (Keramati et al. 2020; Altschuler, Brunel, and Malek 2019). For example, metrics like value at risk are used risk analysis of a new recommendation strategy. To that end, work on universal off-policy estimation (Un O) (Chandak et al. 2021) uses ideas motivated by importance sampling to estimate the whole cumulative distribution of the reward under the target policy. However, in combinatorially large action spaces, as with the slate problems we consider here, the Un O estimator can incur prohibitive variance. Our proposed estimator utilizes possible structure in slate rewards to circumvent this issue. The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) Structure in Slate Rewards The combinatorial action space of slates becomes a key challenge for most general methods like importance sampling (IS) for off-policy evaluation (Wang, Agarwal, and Dudık 2017). The generality of the approach results in them not fully leveraging the structure present in slate rewards (Sunehag et al. 2015) and thus general IS-based approaches frequently suffer from high variance. Prior work leverages user behavior patterns while interacting with slates, which are encoded into structured rewards (e.g., time spent, items clicked, etc.). Some examples of structure in slate rewards are Markov structure for observed slot-level rewards (Mc Inerney et al. 2020), the dependence of slate reward only on the selected slot (Ie et al. 2019), and unobserved slotlevel rewards with an additively decomposable slate reward (Swaminathan et al. 2017; Vlassis et al. 2021). The last one is of particular interest, where the additivity of expected reward (Cesa-Bianchi and Lugosi 2012) posits that the conditional mean slate-level reward decomposes additively as the sum of (arbitrary) slot-level latent functions, i.e., E[R | A, X] = PK k=1 φk(Ak, X). This has been leveraged to obtain a significant reduction in estimation variance for off-policy evaluation (Swaminathan et al. 2017; Vlassis et al. 2021). This decomposition captures the individual effects of each slot. It may readily be generalized to capture non-additive joint effects of more than one slot action, for example, to capture the effects of pairs of slot-actions, one may consider the decomposition: E[R | A, X] = j=k φjk(Ak, Aj, X) . (1) It may further be generalized to capture the combined effects of m-slots. Note that in the most general case, for m = K, the reward does not permit any decomposition over slots. Analogous to the above structural conditions, we posit a condition that allows us to perform consistent and unbiased estimation of the target off-policy distribution. Assumption 2 (Additive CDF). There exists an additive decomposition of the conditional cumulative density function (CDF) of the slate reward as the sum of (arbitrary) slot-level latent functions: k=1 k(Ak, X, ), 8 where FR( ) := Pr(R | A, X) . The slot-level rewards, if any, are unobserved. The condition just assumes that an additive decomposition exists and does not require knowledge of the constituent slot-level functions. We demonstrate empirically that this condition is often a close approximation for real-world data and that our estimator performs better than more general methods even when this condition happens to be an inexact approximation, i.e., when a perfect decomposition does not exist. This decomposition may also be generalized to capture the combined effects of m-slots. In line with prior work (Wen, Kveton, and Ashkan 2015; Kale, Reyzin, and Schapire 2010; Kveton et al. 2015; Swaminathan et al. 2017; Vlassis et al. 2021; Ie et al. 2019), we focus our analysis for the case m = 1, which proves to be effective in practice as corroborated by empirical analysis. We additionally provide derivations for how the estimator can readily generalize to cases where m > 1, along with a theoretical analysis of its properties. It is worth noting that an additively decomposable reward CDF always implies an additive expected reward by definition and the former often serves to be a close approximation when the latter holds. This is helpful since a commonly used metric for evaluating the performance of slates, the normalized discounted cumulative gain (n DCG) (Burges et al. 2005), is an additively decomposable metric, and it has been used in the past for defining the slate reward (J arvelin and Kek al ainen 2017; Swaminathan et al. 2017). Slate Universal Off-Policy Evaluation We will now turn to off-policy evaluation in slates as an off-policy reward distribution estimation task. The core idea builds upon the framework of Chandak et al. (2021) who use importance weights in an estimator of the reward CDF of a target policy from logged data Dn µ. In the case of slates, the the importance weight comprises of probability ratios over all slot actions. The most direct approach for defining in the case of a factored logging policy µ is to consider a formulation analogous to importance sampling by taking the product of the slot-level probabilities (Equation (2)). This approach will be plagued by high variance when the size of the slate K is large. To remedy this, our proposed algorithm SUn O utilizes the structure in slates provided by Assumption 2 wherein the CDF of the slate level reward admits an additive decomposition. In place of , we define an importance weight G (Equation (2)) that is a sum of slot-density ratios. µk ! Ak | X " 1 The estimator for the target distribution counts the number of samples for which the reward is less than a threshold and reweighs that count with importance weights to be reflective of the counts under the target policy . In expectation, the counts reflect the probabilities of obtaining a reward less than , providing the value of the reward CDF at , denoted by F ( ). Use of the importance weight G in place of results in significantly lower variance in estimation and improved effective sample size, while keeping the estimator unbiased. It is easy to confirm that under a factored logging policy, Eµ[G] = 1. Below we prove that this importance weight allows for a change of distribution of the expected slate reward and can be used for estimating the target reward CDF. Our main result is the following: Theorem 3. Let R be a real-valued random variable that denotes the slate reward and admits an additive decompo- The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) Algorithm 1: SUn O( ) Input: , µ, , {(Xi, Ai, Ri)}n i=1 Dn Output: ˆF n ( ) 1: s = 0 {Initialize counter and iterate over the logged dataset} 2: for i = 1, 2, . . . , n do 3: Gi 1 K + PK k=1 (Ak i |Xi) µk(Ak i |Xi) {Compute the importance weight (Equation (2))} 4: s s + 1{Ri }Gi {Add to counter if reward is less than } 5: end for 6: return ˆF n ( ) = s /n {Normalize and return counter} sition of its conditional cumulative distribution FR( ) (Assumption 2). Under a factored µ and Assumption 2, F ( ) = Eµ[G 1{R }], 8 Thus, a weighted expectation of the indicator function, with weights given by G, gives the target CDF. Based on this result, we propose the following sample estimator for F ( ) that uses data Dn µ, ˆF n ( ) := 1 i=1 Gi 1{Ri }, 8 This estimator, called slate universal off-policy estimator (SUn O) is outlined in Algorithm 1. In the following result, we establish that SUn O leverages the additive structure in Assumption 2 to obtain an unbiased and pointwise consistent estimate of the CDF of the target policy. The proofs of both results may be found in the Appendix. Theorem 4. Under Assumption 2, ˆF n ( ) is an unbiased and pointwise consistent estimator of F ( ). It is important to note that analogous to Swaminathan et al. (2017) our estimator does not require knowledge of the specific functions ( k s) in the decomposition of the conditional CDF in Assumption 2; it only assumes the existence of a set of such latent functions, and a corresponding additive decomposition of the conditional CDF, to attain unbiased estimation. Even in cases where the assumption is not satisfied, our method (Algorithm 1) performs robustly, as we demonstrate in our experiments. The estimated target CDF can be used to compute metrics of interest as functions of the CDF (for example, mean, variance, Va R, CVa R, etc.). Some of these metrics are non-linear functions of the CDF (Va R, CVa R) and thus their sample estimates would be biased estimators. This is to be expected (Chandak et al. 2021). Metrics that are linear functions of the CDF however have unbiased sample estimators. Thus, an unbiased target CDF estimator serves to be a one-shot solution for most metrics of interest, though unbiasedness holds only for certain metrics. We demonstrate the estimation of some of these metrics in our empirical analysis. Properties of SUn O Variance: The estimator enjoys significantly low variance for target estimation. The key factor is that the estimator uses importance weights that are a sum of slot level density ratios as opposed to a product as is used Un O (Chandak et al. 2019). Particularly in the slate setting, the latter methods suffer from enormous variance and reduced effective sample size, as we demonstrate empirically. Consider the worst-case variance of the two estimators. From Assumption 1 we have 0 (Ak|X) µk(Ak|X) 1 , which implies Var(Un O) = O 1 ; Var(SUn O) = O K Thus the worst-case variance of SUn O grows linearly with an increase in the size of the slate (K) while that of Un O grows exponentially. Generalization to m-slot reward decomposition As noted earlier, the structural assumptions in slate rewards may be generalized to account for the joint effects of multiple slots. The proposed estimator readily applies to such generalizations. For instance, consider the case where the conditional reward CDF decomposes into terms composed of m slot-actions. 1 k1