# distributional_offpolicy_evaluation_for_slate_recommendations__a413b83f.pdf

Distributional Off-Policy Evaluation for Slate Recommendations

Shreyas Chaudhari1 David Arbour2, Georgios Theocharous2, Nikos Vlassis2

1University of Massachusetts Amherst 2Adobe Research schaudhari@cs.umass.edu, {arbour,theochar,vlassis}@adobe.com

Recommendation strategies are typically evaluated by using previously logged data, employing off-policy evaluation methods to estimate their expected performance. However, for strategies that present users with slates of multiple items, the resulting combinatorial action space renders many of these methods impractical. Prior work has developed estimators that leverage the structure in slates to estimate the expected off-policy performance, but the estimation of the entire performance distribution remains elusive. Estimating the complete distribution allows for a more comprehensive evaluation of recommendation strategies, particularly along the axes of risk and fairness that employ metrics computable from the distribution. In this paper, we propose an estimator for the complete off-policy performance distribution for slates and establish conditions under which the estimator is unbiased and consistent. This builds upon prior work on offpolicy evaluation for slates and off-policy distribution estimation in reinforcement learning. We validate the efﬁcacy of our method empirically on synthetic data as well as on a slate recommendation simulator constructed from real-world data (Movie Lens-20M). Our results show a signiﬁcant reduction in estimation variance and improved sample efﬁciency over prior work across a range of slate structures.

Introduction Recommendation services are ubiquitous throughout industry (Bobadilla et al. 2013; Lu et al. 2015). A common variant of recommendation consists of suggesting multiple items to a user simultaneously, often termed recommendation slates, where each position, (a.k.a., a slot), can take multiple possible values (Sarwar et al. 2000). For example, webpage layouts of news or streaming services have separate slots for each category of content where each slot can display any of the items from the category for that slot. The items are suggested to the user based on a recommendation strategy, called a policy and the user response is encoded into a reward (Li et al. 2010). A crucial problem for selection and improvement of recommendation strategies is to evaluate the efﬁcacy of a slate policy by estimating the expected reward of that policy. One of the simplest and most effective approaches to policy evaluation, often employed in industrial settings, is A/B

Copyright 2024, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

testing (Gomez-Uribe and Hunt 2015; Kohavi and Longbotham 2017; Feitelson, Frachtenberg, and Beck 2013). This involves randomly assigning users to receive the item recommended by one of two candidate policies, and the relative performance of each policy is directly measured. However, A/B testing involves deploying the new policy online, which may be infeasible in many settings due to practical or ethical considerations. As a result, it is often necessary to employ ofﬂine off-policy evaluation, in which interaction data collected (ofﬂine) from previously deployed policies (offpolicy) is used to estimate statistics of the expected performance, risk and other metrics of interest for a new target policy, without actually deploying it online. This ensures that policies with undesirable outcomes are not deployed online. A large amount of literature addresses the problem of offpolicy evaluation in the non-slate setting (Dud ık et al. 2014; Wang, Agarwal, and Dudık 2017; Thomas, Theocharous, and Ghavamzadeh 2015), where the majority of methods rely on some version of importance sampling (Horvitz and Thompson 1952). Applied to the slate setting, these methods result in very large importance weights that result in high variance estimates due to the combinatorially large action space on which slate policies operate. Recent work addresses this deﬁciency, by introducing methods that leverage the structure in slate actions and rewards to address the high variance. Swaminathan et al. (2017); Vlassis et al. (2021) leverage reward additivity across slot actions to propose estimators for estimation of the expected reward of target slate policies. Mc Inerney et al. (2020) propose a Markov structure to the slate rewards for sequential interactions. All the aforementioned methods provide solutions for estimating the expected reward for a target policy. However, in many scenarios where recommendation systems are used, such as those with large ﬁnancial stakes and in healthcare applications, practitioners are concerned with evaluation metrics such as the behavior of the policy at extreme quantiles and the expected performance at a given risk tolerance (CVa R). These quantities need the full reward distribution for estimation, which renders prior work which estimates the expected reward inapplicable. A notable exception is Chandak et al. (2021) who provide a method for off-policy estimation of the target policy s cumulative reward distribution using ideas similar to importance sampling, allowing for the computation of various metrics of interest, but the

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

estimator is intractable in the slate setting. In this work, we propose slate universal off-policy evaluation (SUn O), a method that allows for off-policy estimation of target reward distribution (Theorem 3) for slate recommendation policies. SUn O applies the core ideas from the universal off-policy evaluation (Un O) method (Chandak et al. 2021) to the slate setting, leveraging an additive decomposition of the conditional reward distribution. This makes it possible to perform off-policy estimation in structured high dimensional action spaces without incurring prohibitively high estimation variance. We highlight how the estimator can readily be adapted to other generalized decompositions of reward while continuing to be unbiased. Finally, we provide an empirical evaluation comparing against Un O, where the proposed estimator shows signiﬁcant variance reduction, improved sample efﬁciency, and robust performance even when the conditions for unbiasedness of the estimator are not met. The main contributions of our work are: We propose an unbiased estimator for the off-policy reward distribution for slate recommendations under an additively decomposable reward distribution, generalizing prior results for slate off-policy evaluation to the distributional setting. We theoretically demonstrate how the estimator readily generalizes to slate rewards that do not decompose additively over slots. We empirically demonstrate the efﬁcacy of the proposed estimator on slate simulators using synthetic as well as real world data, on a range of slate reward structures.

Background and Notation We ﬁrst formulate the slate recommendation system as a contextual bandit with a combinatorial action space. Each slate action has K dimensions where each dimension is a slot-level action. The user-bandit interaction results in a random tuple (X, A, R) at each step, where X d X( ) is the user context, A is the slate action generated by the recommendation strategy where A = [Ak]K k=1 is composed of K slot-level actions, and R d R( | A, X) is the scalar slatelevel reward. Since the rewards are observed only at the slate level and not at a per-slot level, we use reward and slate reward interchangeably. Each slot-level action can take upto N candidate values, leading to a combinatorially large action space of the order !N K " . A logging policy µ(A | X) = Pr(A | X) recommends slate actions conditioned on user context X is deployed online to collect a dataset for ofﬂine evaluation. The ofﬂine dataset consists of n i.i.d. samples Dn = {(Xi, Ai, Ri)}n i=1, generated by the user-bandit interaction. We focus on the case where µ is a factored policy, that is,

k=1 µk ! Ak | X "

where K is the number of slots. Data collection with factored uniform logging policies is standard in practice (Swaminathan et al. 2017). Off-policy evaluation is the task

of utilizing data Dn logged using a policy µ, to evaluate a target policy by computing evaluation metrics from the target reward under . Standard methods focus on the estimation of the expected reward under the target policy. In this work, our focus is on the estimation of quantities that go beyond just the expected target reward by estimating the reward distribution. Throughout the paper, the sample estimates of any quantity y are denoted by ˆyn where the subscript indicates the number (n) of data points used for estimation. For instance, the cumulative reward distribution observed for a policy is denoted by F ( ). The sample estimate of the distribution will be denoted by ˆF n ( ).

Related Work on Off-Policy Evaluation

Importance sampling (IS) (Horvitz and Thompson 1952; Sutton and Barto 2018), also known as inverse propensity scoring (IPS) (Dud ık et al. 2014), provides a technique for unbiased estimation of expected target reward. However, it suffers from high variance in large action spaces. There are numerous extensions to IS for variance reduction (Dud ık, Langford, and Li 2011; Thomas 2015; Kallus and Uehara 2019). The IS estimator and all methods derived from it rely on a standard common-support assumption. We use a weaker form of this assumption that requires support at the slot level instead of the entire slate.

Assumption 1 (Common Support). The set Dn contains i.i.d. tuples generated using µ, such that for some (unknown) " > 0, µk(Ak | X) < " =) (Ak | X) = 0; 8 k, X, A.

Applying the methods to slates, (Swaminathan et al. 2017) assume additivity of slate rewards as a way to reduce the estimation variance of the expected reward of the target policy. Further variance reduction is obtained by using control variates (Vlassis et al. 2021). Mc Inerney et al. (2020) assume a Markov structure to the slate rewards during sequential item interaction and propose an estimator for the expected target reward. We refer the reader to these papers for additional references on slate recommendations. These off-policy estimators, along with most others in the literature, provide estimates of the expected reward of the target policy (Li et al. 2018). However, the expected value is usually not sufﬁcient for comprehensive off-policy analysis, particularly in the case of risk assessment that is crucial for recommendation systems (Shani and Gunawardana 2011). Additional metrics of interest, often those computable from the whole reward distribution are necessary in practice (Keramati et al. 2020; Altschuler, Brunel, and Malek 2019). For example, metrics like value at risk are used risk analysis of a new recommendation strategy. To that end, work on universal off-policy estimation (Un O) (Chandak et al. 2021) uses ideas motivated by importance sampling to estimate the whole cumulative distribution of the reward under the target policy. However, in combinatorially large action spaces, as with the slate problems we consider here, the Un O estimator can incur prohibitive variance. Our proposed estimator utilizes possible structure in slate rewards to circumvent this issue.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Structure in Slate Rewards The combinatorial action space of slates becomes a key challenge for most general methods like importance sampling (IS) for off-policy evaluation (Wang, Agarwal, and Dudık 2017). The generality of the approach results in them not fully leveraging the structure present in slate rewards (Sunehag et al. 2015) and thus general IS-based approaches frequently suffer from high variance. Prior work leverages user behavior patterns while interacting with slates, which are encoded into structured rewards (e.g., time spent, items clicked, etc.). Some examples of structure in slate rewards are Markov structure for observed slot-level rewards (Mc Inerney et al. 2020), the dependence of slate reward only on the selected slot (Ie et al. 2019), and unobserved slotlevel rewards with an additively decomposable slate reward (Swaminathan et al. 2017; Vlassis et al. 2021). The last one is of particular interest, where the additivity of expected reward (Cesa-Bianchi and Lugosi 2012) posits that the conditional mean slate-level reward decomposes additively as the sum of (arbitrary) slot-level latent functions, i.e., E[R | A, X] = PK k=1 φk(Ak, X). This has been leveraged to obtain a signiﬁcant reduction in estimation variance for off-policy evaluation (Swaminathan et al. 2017; Vlassis et al. 2021). This decomposition captures the individual effects of each slot. It may readily be generalized to capture non-additive joint effects of more than one slot action, for example, to capture the effects of pairs of slot-actions, one may consider the decomposition:

E[R | A, X] =

j=k φjk(Ak, Aj, X) . (1)

It may further be generalized to capture the combined effects of m-slots. Note that in the most general case, for m = K, the reward does not permit any decomposition over slots. Analogous to the above structural conditions, we posit a condition that allows us to perform consistent and unbiased estimation of the target off-policy distribution. Assumption 2 (Additive CDF). There exists an additive decomposition of the conditional cumulative density function (CDF) of the slate reward as the sum of (arbitrary) slot-level latent functions:

k=1 k(Ak, X, ), 8

where FR( ) := Pr(R | A, X) . The slot-level rewards, if any, are unobserved. The condition just assumes that an additive decomposition exists and does not require knowledge of the constituent slot-level functions. We demonstrate empirically that this condition is often a close approximation for real-world data and that our estimator performs better than more general methods even when this condition happens to be an inexact approximation, i.e., when a perfect decomposition does not exist. This decomposition may also be generalized to capture the combined effects of m-slots. In line with prior work (Wen, Kveton, and Ashkan 2015; Kale, Reyzin, and

Schapire 2010; Kveton et al. 2015; Swaminathan et al. 2017; Vlassis et al. 2021; Ie et al. 2019), we focus our analysis for the case m = 1, which proves to be effective in practice as corroborated by empirical analysis. We additionally provide derivations for how the estimator can readily generalize to cases where m > 1, along with a theoretical analysis of its properties. It is worth noting that an additively decomposable reward CDF always implies an additive expected reward by deﬁnition and the former often serves to be a close approximation when the latter holds. This is helpful since a commonly used metric for evaluating the performance of slates, the normalized discounted cumulative gain (n DCG) (Burges et al. 2005), is an additively decomposable metric, and it has been used in the past for deﬁning the slate reward (J arvelin and Kek al ainen 2017; Swaminathan et al. 2017).

Slate Universal Off-Policy Evaluation We will now turn to off-policy evaluation in slates as an off-policy reward distribution estimation task. The core idea builds upon the framework of Chandak et al. (2021) who use importance weights in an estimator of the reward CDF of a target policy from logged data Dn µ. In the case of slates, the the importance weight comprises of probability ratios over all slot actions. The most direct approach for deﬁning in the case of a factored logging policy µ is to consider a formulation analogous to importance sampling by taking the product of the slot-level probabilities (Equation (2)). This approach will be plagued by high variance when the size of the slate K is large. To remedy this, our proposed algorithm SUn O utilizes the structure in slates provided by Assumption 2 wherein the CDF of the slate level reward admits an additive decomposition. In place of , we deﬁne an importance weight G (Equation (2)) that is a sum of slot-density ratios.

µk ! Ak | X " 1

The estimator for the target distribution counts the number of samples for which the reward is less than a threshold and reweighs that count with importance weights to be reﬂective of the counts under the target policy . In expectation, the counts reﬂect the probabilities of obtaining a reward less than , providing the value of the reward CDF at , denoted by F ( ). Use of the importance weight G in place of results in signiﬁcantly lower variance in estimation and improved effective sample size, while keeping the estimator unbiased. It is easy to conﬁrm that under a factored logging policy, Eµ[G] = 1. Below we prove that this importance weight allows for a change of distribution of the expected slate reward and can be used for estimating the target reward CDF. Our main result is the following:

Theorem 3. Let R be a real-valued random variable that denotes the slate reward and admits an additive decompo-

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Algorithm 1: SUn O( )

Input: , µ, , {(Xi, Ai, Ri)}n i=1 Dn Output: ˆF n ( )

1: s = 0 {Initialize counter and iterate over the logged dataset} 2: for i = 1, 2, . . . , n do

3: Gi 1 K + PK k=1 (Ak i |Xi) µk(Ak i |Xi) {Compute the importance weight (Equation (2))}

4: s s + 1{Ri }Gi {Add to counter if reward is less than } 5: end for 6: return ˆF n ( ) = s /n {Normalize and return counter}

sition of its conditional cumulative distribution FR( ) (Assumption 2). Under a factored µ and Assumption 2,

F ( ) = Eµ[G 1{R }], 8

Thus, a weighted expectation of the indicator function, with weights given by G, gives the target CDF. Based on this result, we propose the following sample estimator for F ( ) that uses data Dn µ,

ˆF n ( ) := 1

i=1 Gi 1{Ri }, 8

This estimator, called slate universal off-policy estimator (SUn O) is outlined in Algorithm 1. In the following result, we establish that SUn O leverages the additive structure in Assumption 2 to obtain an unbiased and pointwise consistent estimate of the CDF of the target policy. The proofs of both results may be found in the Appendix.

Theorem 4. Under Assumption 2, ˆF n ( ) is an unbiased and pointwise consistent estimator of F ( ). It is important to note that analogous to Swaminathan et al. (2017) our estimator does not require knowledge of the speciﬁc functions ( k s) in the decomposition of the conditional CDF in Assumption 2; it only assumes the existence of a set of such latent functions, and a corresponding additive decomposition of the conditional CDF, to attain unbiased estimation. Even in cases where the assumption is not satisﬁed, our method (Algorithm 1) performs robustly, as we demonstrate in our experiments. The estimated target CDF can be used to compute metrics of interest as functions of the CDF (for example, mean, variance, Va R, CVa R, etc.). Some of these metrics are non-linear functions of the CDF (Va R, CVa R) and thus their sample estimates would be biased estimators. This is to be expected (Chandak et al. 2021). Metrics that are linear functions of the CDF however have unbiased sample estimators. Thus, an unbiased target CDF estimator serves to be a one-shot solution for most metrics of interest, though unbiasedness holds only for certain metrics. We demonstrate the estimation of some of these metrics in our empirical analysis.

Properties of SUn O Variance: The estimator enjoys signiﬁcantly low variance for target estimation. The key factor is that the estimator uses

importance weights that are a sum of slot level density ratios as opposed to a product as is used Un O (Chandak et al. 2019). Particularly in the slate setting, the latter methods suffer from enormous variance and reduced effective sample size, as we demonstrate empirically. Consider the worst-case variance of the two estimators. From Assumption 1 we have 0 (Ak|X) µk(Ak|X) 1 , which implies

Var(Un O) = O 1

; Var(SUn O) = O K

Thus the worst-case variance of SUn O grows linearly with an increase in the size of the slate (K) while that of Un O grows exponentially.

Generalization to m-slot reward decomposition As noted earlier, the structural assumptions in slate rewards may be generalized to account for the joint effects of multiple slots. The proposed estimator readily applies to such generalizations. For instance, consider the case where the conditional reward CDF decomposes into terms composed of m slot-actions.

1 k1<k2, <km K k1:m(Ak1:m, X, ) (3)

where k1:m is used as shorthand to denote the indices k1, k2, . . . , km. It can be seen that this decomposition consists of !K m " terms. The importance weight from Equation 2 with a minor change provides an unbiased and consistent off-policy estimator for this reward structure. Deﬁne

1 k1<k2, <km K

(Aki|X) µki(Aki|X) 1

to be the importance weight. With a derivation similar to Theorem 3 we show the following result, Corollary 5. Let R be a random variable that denotes the slate reward, and permits a decomposition into (latent) functions of m slots as in Equation 3. Under a factored µ, we have F ( ) = Eµ[Gm 1{R }], 8 The proof of the result may be found in the Appendix. The derivation highlights how the result may be extended to other forms of decomposition of the slate reward. Note

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Figure 1: (left) Movielens-20M: (top) MSE for mean computed from the estimated target CDF for increasing sample sizes. SUn O performs signiﬁcantly better in terms of bias and variance compared to Un O. The same follows for median estimation (bottom) where it demonstrates much better sample efﬁciency and lower estimation variance as seen by the error bars. (right) Synthetic Experiment: Estimates of CVa R0.3 (top) and Va R0.3 (bottom) computed from the estimated target CDF. In this setting where Assumption 2 is satisﬁed, SUn O performs better than Un O in terms of estimation variance, sample efﬁciency, and estimation accuracy as expected.

that when m = K, the importance weight reduces to G = QK i=1 Yk+i which is the same as as used in Un O. This is because at m = K the reward permits no decomposition. The proposed estimator may then be interpreted as a generalization of Un O to various reward decompositions, where Un O forms a special case wherein the reward does not permit any decomposition. We now empirically study the case where m = 1, in line with prior work. The reward decomposition at m = 1 proves to be a sufﬁciently accurate approximation to real-world data as we demonstrate in the empirical section that follows.

Empirical Analysis

We investigate the following questions in the empirical analysis: RQ1: Does the estimator have low estimation variance and high sample efﬁciency when the slate rewards have an additive structure? RQ2: Does the method accurately estimate the off-policy distribution, and metrics from it? RQ3: Does the method apply to settings where conditions for unbiasedness of the estimator do not hold? To that end, we evaluate the performance of SUn O in a range of slate recommendation settings, comparing its performance against Un O (Chandak et al. 2021), a general estimator that does not make any structural assumptions. Better performance is deﬁned as having lower estimation error and variance, along with improvement in sample efﬁciency.

We begin by evaluating the estimators on synthetic data that follows the additive CDF structure to corroborate theoretical results (RQ1). We then proceed to real-world data experiments and use the additively decomposable metric n DCG (Swaminathan et al. 2017) as the slate reward (RQ2, RQ3). We test our estimator on a publicly available dataset - Movie Lens-20M (Harper and Konstan 2015) - and on a semi-synthetic slate simulator - Open Bandit Pipeline (Saito et al. 2020).

We develop a procedure to construct a slate simulator from ratings datasets like Movie Lens. We evaluate SUn O in settings where the slate reward does not by construction satisfy either the additive CDF or the additive reward conditions, and see it demonstrate robust performance.

Implementational note: Algorithm 1 outlines the steps for estimating the target CDF at any reward value . The reward, in general, takes on continuous real values and implementationally it is not practical to estimate the target CDF at all continuous values of . In practice, an empirical estimate of the CDF may be computed at discrete points over the range of rewards. In between those points, the value of the CDF is kept constant. Consequently, we compute the target CDF at evenly spaced points over the range of rewards for both estimators in the experiments that follow. The granularity of this discretization of the domain of the CDF reﬂects in the granularity of the estimated CDF. To ensure accuracy and relative smoothness in the estimated CDF, we choose a very ﬁne level of discretization relative to the range of reward for

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

(a) Synthetic experiment

Sample size 0.5 103 1 103 5 103 10 103

SUn O 0.131 0.102 0.059 0.049 Un O 0.256 0.191 0.098 0.077

(b) Movie Lens

Sample size 0.5 106 1 106 5 106 10 106

SUn O 0.169 0.173 0.181 0.184 Un O 0.441 0.410 0.31912 0.252

Table 1: The tables report the average Kolmogorov-Smirnov statistic for the estimated CDFs for (a) Synthetic experiment and (b) Movie Lens. The results demonstrate that SUn O estimates the target CDF more accurately with better sample efﬁciency.

each experiment. Metrics: We deﬁne a few metrics that we utilize in our empirical analysis. Value at Risk (Va R ) (Rockafellar, Uryasev et al. 2000; Wirch and Hardy 2001) denotes the value of the reward such that the probability of observing rewards greater than that value is 1 . Correspondingly, Conditional Value at Risk (CVa R ) denotes the expected value of the rewards given that that observed reward is less than Va R . These metrics are used for risk assessment of policies and we compute them from the estimated target reward distribution. To evaluate the estimation error in our estimated target distribution, we use the Kolmogorov-Smirnov statistic (Stephens 1974) which measures how well the estimated target distribution matches ground-truth distribution by assessing the largest discrepancy between their cumulative probabilities. All the experiments have a factored uniform-random logging policy. The error bars denote one standard error. The code is available at: https://github.com/shreyasc-13/suno.

Synthetic Experiments We begin by synthetically generating data where the slate reward permits the additive CDF structure (Assumption 2). Setup: We consider the non-contextual bandit setting for ease of analysis, and the same may easily be extended to a contextual setting. To construct the data-generating reward distribution, each k is set to a monotonic non-decreasing function by assigning slices of a sigmoid function to the corresponding k s for each (Ak). This manner of construction ensures that the resultant sum of the functions, the CDF, is again a monotonic non-decreasing function. The k s are appropriately normalized. For these experiments, we set the number of slots K = 3 and the number of actions in each slot to N = 3. The target policy is a deterministic policy that chooses one action per slot, where the action for each slot is assigned randomly at the start of the experiment and held constant for all experiments. Experiment: We compare the performance of the estimators on two fronts:

1. Goodness-of-ﬁt of CDF: We report the average Kolmogorov-Smirnov statistic of the estimated target CDFs. The ground truth target CDF is computed by executing the target policy on the simulator. 2. Tail measures: We compute the CVa R0.3 and Va R0.3 from the target CDFs estimated by the two estimators.

The experiments are run for different logged data sizes and the results are averaged over 1000 trials. Results [Table 1, Figure 1]: Since SUn O leverages the additive structure in rewards, it estimates the CDF and the tail

measures with lower variance and estimation error, while being more sample efﬁcient. If a single slot action has a zero probability of occurring under the target policy, the importance weight used by Un O for the entire slate goes to zero since it comprises of the product of slot-level density ratios. This is not the case for SUn O which is thus able to utilize a larger effective sample size and be more sample efﬁcient. Note that with an increase in sample size, both estimators tend to the ground-truth values as they are consistent and unbiased. SUn O has a signiﬁcantly lower variance in estimation as seen by the error bars in Figure 1.

Real-World Data In this section, we ﬁrst introduce a procedure for converting a recommender system ratings dataset (like Movie Lens) to a slate recommendation simulator with additive rewards and proceed to evaluate our method on the simulator. The procedure for constructing the simulator follows.

Simulation Setup

1. Learn a user-item preference matrix B along with the user context embedding X. For m users and l item, B 2 Rm l and X 2 {0, 1}m. We follow the steps outlined in (Steck 2019) to learn B from rating data. X is a binary vector that encodes user-item interaction history. An alternative method for learning embeddings could be (Elahi and Chandrashekar 2020). 2. To limit our setup to approximately 10k unique users, we trim the set of users to those that have an interaction history of 10 to 15 items. 3. Compute the ground truth preference scores for each user by computing the product of a user s context embedding with the preference matrix (x B). 4. To make the simulator tractable, we trim the action set by retaining the top 20 preferred actions per user based on each user s ground truth scores (N = 20). 5. For a slate action A, a ranking metric like n DCG can be set as the slate reward R and works well in practice (Vlassis et al. 2021; Swaminathan et al. 2017).

Experiment: First, we set up a slate simulator as described above using the Movie Lens-20M dataset to estimate B and X. A uniform random factored logging policy is used for creating the ofﬂine dataset for evaluating the estimators. We consider an -greedy target policy. For each user, it picks the top K preferred actions (one per slot) with probability 1 N and a uniform random action from the user s action set with probability . Here N = 20, K = 5, = 0.01 and results are averaged over 50 trials. We analyze:

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Sample size 0.5 105 1 106 5 106

SUn O 0.253 0.257 0.269 Un O 0.543 0.541 0.567

Table 2: The table reports the mean squared error for mean computation from the estimated CDFs. Even in settings where the slate reward is not additive, our method continues to perform better than the structure-agnostic estimator.

1. Goodness-of-ﬁt of CDF: We report the average Kolmogorov-Smirnov statistic of the estimated CDFs against the ground truth CDF. The ground truth CDF is computed by executing the target policy on the simulator.

2. Metrics computed from the CDF: We compute the mean and 0.5-quantile (median) from the estimated CDF.

Results [Table 1, Figure 1]: The experiments demonstrate that although only the additive reward condition is met and not Assumption 2 , SUn O estimates the target CDF with fewer samples (Table 1) than Un O. Our estimator has a signiﬁcantly lower estimation variance for metrics computed from the CDF, as seen by the error bars for median computation and the mean squared error for the expected value computation in Figure 1. Note that the mean squared error (MSE) captures both the bias and variance in estimation.

Non-Additive Reward Structure

Finally, we evaluate the estimators in a setting where neither the additive reward nor the additive CDF conditions are satisﬁed. Simulator: We use the Open Bandit Pipeline (OBP) slate bandit simulator (Saito et al. 2020) that uses the synthetic slate reward model described in (Kiyohara et al. 2022). It models higher-order interactions among slot actions and thus does not trivially satisfy Assumption 2 or the additive slate reward structure. We use the cascade additive reward model deﬁned in OBP for these experiments. Experiment: Similar to the Movie Lens experiments, we observe the estimation error for the target mean computed from the estimated CDF. A uniform random logging policy is used to generate the ofﬂine dataset and the target policy deﬁned here1 is evaluated. We set K = 3, N = 10, and the results are averaged over 10 trials. Results [Table 2]: In this setting, we cannot expect unbiased estimates of the mean from SUn O since the additive CDF condition is required for unbiased estimation of the target CDF. Nonetheless, SUn O continues to perform signiﬁcantly better in terms of the mean squared error for the mean estimation compared to Un O, which does not make any structural assumptions and is an unbiased estimator in this setting. Here K is set to a relatively small value and a large gap in performance between the two estimators can be expected larger K.

1https://github.com/st-tech/zr-obp/blob/master/examples/ quickstart/synthetic slate.ipynb

Discussion and Conclusion

We proposed an estimator (SUn O) for off-policy estimation of the target reward distribution in slate recommendations modeled as a bandit problem. Under an additively decomposable conditional CDF, the estimator is unbiased and consistent. The proposed estimator leads to signiﬁcant reduction in estimation variance and an increase in effective sample size as compared to the estimator of Chandak et al. (2021) for the slate setting. We demonstrate estimation gains on synthetic as well as real-world data experiments. The estimator also readily extends to other reward decompositions that capture the joint effects of slot actions. In future work, variance reduction techniques can be applied for further variance gains. For instance, one may consider a self-normalized version of SUn O that incurs some bias but provides further variance reduction akin to weighted importance sampling (Koller and Friedman 2009). Control variates can also be adopted for variance reduction. For example, by recapitulating the analysis of (Vlassis et al. 2021), one can derive an optimal control variate (w ) for SUn O at each . Further analysis and experiments with such methods are left to future work. One must note while SUn O provides unbiased estimates for the target CDF and its linear functions under our assumptions, many metrics of interest, such as variance or CVa R, are not linear functions of the CDF. Another direction for future work would be to extend our results to the unbiased estimation of risk functionals like CVa R (Huang et al. 2021). Finally, it would be interesting to develop techniques for discovering the decomposition structure in rewards so that the appropriate unbiased off-policy estimators can be used.

Acknowledgements

The research was supported by and partially conducted at Adobe Research. We are also immensely grateful to the four anonymous reviewers who shared their insights and feedback.

Altschuler, J. M.; Brunel, V.-E.; and Malek, A. 2019. Best Arm Identiﬁcation for Contaminated Bandits. J. Mach. Learn. Res., 20(91): 1 39. Bobadilla, J.; Ortega, F.; Hernando, A.; and Guti errez, A. 2013. Recommender systems survey. Knowledge-based systems, 46: 109 132. Burges, C.; Shaked, T.; Renshaw, E.; Lazier, A.; Deeds, M.; Hamilton, N.; and Hullender, G. 2005. Learning to rank using gradient descent. In Proceedings of the 22nd international conference on Machine learning, 89 96. Cesa-Bianchi, N.; and Lugosi, G. 2012. Combinatorial bandits. Journal of Computer and System Sciences, 78(5): 1404 1422. Chandak, Y.; Niekum, S.; da Silva, B.; Learned-Miller, E.; Brunskill, E.; and Thomas, P. S. 2021. Universal off-policy evaluation. Advances in Neural Information Processing Systems, 34.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Chandak, Y.; Theocharous, G.; Kostas, J.; Jordan, S.; and Thomas, P. 2019. Learning action representations for reinforcement learning. In International conference on machine learning, 941 950. PMLR. Dud ık, M.; Erhan, D.; Langford, J.; and Li, L. 2014. Doubly robust policy evaluation and optimization. Statistical Science, 29(4): 485 511. Dud ık, M.; Langford, J.; and Li, L. 2011. Doubly robust policy evaluation and learning. ar Xiv preprint ar Xiv:1103.4601. Elahi, E.; and Chandrashekar, A. 2020. Learning representations of hierarchical slates in collaborative ﬁltering. In Proceedings of the 14th ACM Conference on Recommender Systems, 703 707. Feitelson, D. G.; Frachtenberg, E.; and Beck, K. L. 2013. Development and deployment at facebook. IEEE Internet Computing, 17(4): 8 17. Gomez-Uribe, C. A.; and Hunt, N. 2015. The netﬂix recommender system: Algorithms, business value, and innovation. ACM Transactions on Management Information Systems (TMIS), 6(4): 1 19. Harper, F. M.; and Konstan, J. A. 2015. The movielens datasets: History and context. Acm transactions on interactive intelligent systems (tiis), 5(4): 1 19. Horvitz, D. G.; and Thompson, D. J. 1952. A generalization of sampling without replacement from a ﬁnite universe. Journal of the American statistical Association, 47(260): 663 685. Huang, A.; Leqi, L.; Lipton, Z.; and Azizzadenesheli, K. 2021. Off-policy risk assessment in contextual bandits. Advances in Neural Information Processing Systems, 34: 23714 23726. Ie, E.; Jain, V.; Wang, J.; Narvekar, S.; Agarwal, R.; Wu, R.; Cheng, H.-T.; Chandra, T.; and Boutilier, C. 2019. Slate Q: A tractable decomposition for reinforcement learning with recommendation sets. J arvelin, K.; and Kek al ainen, J. 2017. IR evaluation methods for retrieving highly relevant documents. In ACM SIGIR Forum, volume 51, 243 250. ACM New York, NY, USA. Kale, S.; Reyzin, L.; and Schapire, R. E. 2010. Nonstochastic bandit slate problems. Advances in Neural Information Processing Systems, 23. Kallus, N.; and Uehara, M. 2019. Intrinsically efﬁcient, stable, and bounded off-policy evaluation for reinforcement learning. Advances in neural information processing systems, 32. Keramati, R.; Dann, C.; Tamkin, A.; and Brunskill, E. 2020. Being optimistic to be conservative: Quickly learning a cvar policy. In Proceedings of the AAAI conference on artiﬁcial intelligence, volume 34, 4436 4443. Kiyohara, H.; Saito, Y.; Matsuhiro, T.; Narita, Y.; Shimizu, N.; and Yamamoto, Y. 2022. Doubly robust off-policy evaluation for ranking policies under the cascade behavior model. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining, 487 497.

Kohavi, R.; and Longbotham, R. 2017. Online Controlled Experiments and A/B Testing. Encyclopedia of machine learning and data mining, 7(8): 922 929. Koller, D.; and Friedman, N. 2009. Probabilistic graphical models: principles and techniques. MIT press. Kveton, B.; Wen, Z.; Ashkan, A.; and Szepesvari, C. 2015. Tight regret bounds for stochastic combinatorial semibandits. In Artiﬁcial Intelligence and Statistics, 535 543. PMLR. Li, L.; Chu, W.; Langford, J.; and Schapire, R. E. 2010. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, 661 670. Li, S.; Abbasi-Yadkori, Y.; Kveton, B.; Muthukrishnan, S.; Vinay, V.; and Wen, Z. 2018. Ofﬂine evaluation of ranking policies with click models. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 1685 1694. Lu, J.; Wu, D.; Mao, M.; Wang, W.; and Zhang, G. 2015. Recommender system application developments: a survey. Decision support systems, 74: 12 32. Mc Inerney, J.; Brost, B.; Chandar, P.; Mehrotra, R.; and Carterette, B. 2020. Counterfactual evaluation of slate recommendations with sequential reward interactions. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 1779 1788. Rockafellar, R. T.; Uryasev, S.; et al. 2000. Optimization of conditional value-at-risk. Journal of risk, 2: 21 42. Saito, Y.; Shunsuke, A.; Megumi, M.; and Yusuke, N. 2020. Open Bandit Dataset and Pipeline: Towards Realistic and Reproducible Off-Policy Evaluation. ar Xiv preprint ar Xiv:2008.07146. Sarwar, B.; Karypis, G.; Konstan, J.; and Riedl, J. 2000. Analysis of recommendation algorithms for e-commerce. In Proceedings of the 2nd ACM Conference on Electronic Commerce, 158 167. Sen, P. K.; and Singer, J. M. 1994. Large sample methods in statistics: an introduction with applications, volume 25. CRC press. Shani, G.; and Gunawardana, A. 2011. Evaluating recommendation systems. In Recommender systems handbook, 257 297. Springer. Steck, H. 2019. Embarrassingly shallow autoencoders for sparse data. In The World Wide Web Conference, 3251 3257. Stephens, M. A. 1974. EDF statistics for goodness of ﬁt and some comparisons. Journal of the American statistical Association, 69(347): 730 737. Sunehag, P.; Evans, R.; Dulac-Arnold, G.; Zwols, Y.; Visentin, D.; and Coppin, B. 2015. Deep reinforcement learning with attention for slate markov decision processes with high-dimensional states and actions. ar Xiv preprint ar Xiv:1512.01124. Sutton, R. S.; and Barto, A. G. 2018. Reinforcement learning: An introduction. MIT press.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Swaminathan, A.; Krishnamurthy, A.; Agarwal, A.; Dudik, M.; Langford, J.; Jose, D.; and Zitouni, I. 2017. Off-policy evaluation for slate recommendation. Advances in Neural Information Processing Systems, 30. Thomas, P.; Theocharous, G.; and Ghavamzadeh, M. 2015. High-conﬁdence off-policy evaluation. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 29. Thomas, P. S. 2015. Safe reinforcement learning. Vlassis, N.; Chandrashekar, A.; Amat, F.; and Kallus, N. 2021. Control variates for slate off-policy evaluation. Advances in Neural Information Processing Systems, 34. Wang, Y.-X.; Agarwal, A.; and Dudık, M. 2017. Optimal and adaptive off-policy evaluation in contextual bandits. In International Conference on Machine Learning, 3589 3597. PMLR. Wen, Z.; Kveton, B.; and Ashkan, A. 2015. Efﬁcient learning in large-scale combinatorial semi-bandits. In International Conference on Machine Learning, 1113 1122. PMLR. Wirch, J. L.; and Hardy, M. R. 2001. Distortion risk measures: Coherence and stochastic dominance. In International congress on insurance: Mathematics and economics, 15 17.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)