# uncertaintyaware_instance_reweighting_for_offpolicy_learning__386247bf.pdf

Uncertainty-Aware Instance Reweighting for Off-Policy Learning

Xiaoying Zhang1 Junpu Chen2 Hongning Wang3 Hong Xie4 Yang Liu1 John C.S. Lui5 Hang Li1

1Byte Dance Research 2Chong Qing University 3Tsinghua University 4 Chongqing Institute of Green and Intelligent Technology, Chinese Academy of Science 5 The Chinese University of Hong Kong {zhangxiaoying.xy,yang.liu01,lihang.lh}@bytedance.com {jumpchan98,hongx87,wang.hongn}@gmail.com cslui@cse.cuhk.edu.hk

Off-policy learning, referring to the procedure of policy optimization with access only to logged feedback data, has shown importance in various real-world applications, such as search engines and recommender systems. While the ground-truth logging policy is usually unknown, previous work simply employs its estimated value for the off-policy learning, ignoring the negative impact from both high bias and high variance resulted from such an estimator. And such impact is often magnified on samples with small and inaccurately estimated logging probabilities. The contribution of this work is to explicitly model the uncertainty in the estimated logging policy, and propose an Uncertainty-aware Inverse Propensity Score estimator (UIPS) for improved off-policy learning, with a theoretical convergence guarantee. Experiment results on the synthetic and real-world recommendation datasets demonstrate that UIPS significantly improves the quality of the discovered policy, when compared against an extensive list of state-of-the-art baselines.

1 Introduction

In many real-world applications, including search engines [2], online advertisements [35], recommender systems [8, 22], only logged data is available for subsequent policy learning. For example, in recommender systems, various complex recommendation policies are optimized over logged user interactions (e.g., clicks or stay time) with items recommended by previous recommendation policies (referred to as the logging policy) [51, 14]. However, such logged data is often known to be biased, since the feedback on items where the logging policy did not take is unknown. This inevitably distorts the evaluation and optimization of a new policy when it differs from the logging policy.

Off-policy learning [41, 27] thus emerges as a preferred way to learn an improved policy only from the logged data, by addressing the mismatch between the learning and logging policies. One of the most commonly used off-policy learning methods is the Inverse Propensity Scoring (IPS) [8, 25], which assigns per-sample importance weight (i.e., propensity score) to the training objective on the logged data, so as to get an unbiased optimization objective in expectation. The importance weight in IPS is the probability ratio of taking an action between the learning and logging policies.

Unfortunately, the ground-truth logging policy is oftentimes unavailable to the learner in practice, due to reasons like legacy issues, i.e., it was not recorded in the data. Additionally, in specific situations like the healthcare domain [28] or two-stage recommender systems [8], access to the ground-truth logging policy is not feasible. One common treatment by many previous studies [35, 22, 8, 24] is to first estimate the logging policy using a supervised learning method (e.g., logistic regression,

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

1 2 3 4 5 6 7 8 9 0

Log(item freq)

Estimated Logging Probability

(a) Estimated Logging Probability

1 2 3 4 5 6 7 8 9 0

Log(item freq)

Uncertainty

(b) Uncertainty of Estimation Figure 1: Estimated logging policy and its uncertainty under different item frequency on Kuai Rec.

neural networks, etc.), and then employ the estimated logging policy for off-policy learning. In this work, we first show that such an approximation results in a biased estimator which is sensitive to data with small estimated logging probabilities. Worse still, small estimated logging probabilities usually suggest there are limited related samples in the logged data, whose estimations can have high uncertainties, i.e., being wrong with a high probability. Figure 1 shows a piece of empirical evidence from a large-scale recommendation benchmark Kuai Rec dataset [12], where items with lower frequencies in the logged dataset have lower estimated logging probabilities (via a neural network estimator) and higher uncertainties at the same time. The high bias and variance caused by these samples can greatly hinder the performance of subsequent off-policy learning. We defer detailed discussions of this result in Section 2.

In this work, we explicitly take the uncertainty of the estimated logging policy into consideration and design an Uncertainty-aware Inverse Propensity Score estimator (UIPS) for off-policy learning. UIPS reweighs the propensity score of each logged sample to control its impact on policy optimization, and learns an improved policy by alternating between: (1) Find the optimal weight that makes the estimator as accurate as possible, based on the uncertainty of the estimated logging policy; (2) Improve the policy by optimizing the resulting objective function. The optimal weight for each sample is obtained by minimizing the upper bound of the mean squared error (MSE) to the ground-truth policy evaluation, with a closed-form solution. Furthermore, UIPS ensures that off-policy learning converges to a stationary point where the true policy gradient is zero; while convergence may not be guaranteed when directly using the estimated logging policy. Extensive experiments on a synthetic and three realworld recommendation datasets against a rich set of state-of-the-art baselines demonstrate the power of UIPS. All data and code can be found in https://github.com/Xiaoyinggit/UIPS.git.

2 Preliminary: off-policy learning

We focus on the standard contextual bandit setup to explain the key concepts in UIPS. Following the convention [16, 29, 36], let x X Rd be a d-dimensional context vector drawn from an unknown distribution p(x). Each context is associated with a finite set of actions denoted by A, where |A| < . Let π : A X [0, 1] denote a stochastic policy, such that π(a|x) is the probability of selecting action a under context x and P

a A π(a|x) = 1. Under a given context x, the reward rx,a is only observed when action a is chosen, i.e., bandit feedback. Without loss of generality, we assume rx,a [0, 1]. Let V (π) denote the expected reward of the policy π:

V (π) = Ex p(x),a π(a|x)[rx,a]. (1)

We look for a policy π(a|x) to maximize V (π). In the rest, we denote Ex p(x),a π(a|x)[ ] as Eπ[ ].

In off-policy learning, one can only access a set of logged feedback data D := {(xn, an, rxn,an)|n [N]}. Given xn, the action an was generated by a stochastic logging policy β , i.e., an β (a|xn), which is usually different from the learning policy π(a|x) [24, 40, 8]. The actions {a1, . . . , a N} and their corresponding rewards {rx1,a1, . . . , rx N,a N } are generated independently given β . The main challenge is then to address the distributional discrepancy between β (a|x) and π(a|x), when optimizing π(a|x) to maximize V (π) with access only to the logged dataset D.

One of the most widely used methods to address the distribution shift between π(a|x) and β (a|x) is the Inverse Propensity Scoring (IPS) [8, 25]. One can easily get that:

V (π) = Eβ π(a|x)

β (a|x)rx,a

yielding the following empirical estimator of V (π):

ˆVIPS(π) = 1

π(an|xn) β (an|xn)rxn,an, (2)

where π(an|xn)/β (an|xn) is referred to as the propensity score. Various algorithms can be readily used for policy optimization under ˆVIPS(π), including value-based methods [33] and policy-based methods [19, 31, 42]. In this work, we adopt a well-known policy gradient algorithm, REINFORCE [42]. Assume the policy π(a|x) is parameterized by ϑ, via the log-trick , the gradient of ˆVIPS(πϑ) with respect to ϑ can be readily derived as,

ϑ ˆVIPS(πϑ) = 1

πϑ(an|xn) β (an|xn) rxn,an ϑ log(πϑ(an|xn)).

Approximation with an unknown logging policy. In many real-world applications, the ground-truth logging probabilities, i.e., β (a|x) of each observation (x, a) in D, are unknown. As a typical walk-around, previous work employs supervised learning methods such as logistic regression [30] and nerural networks [8] to estimate the logging policy, and replaces β (a|x) with its estimated value ˆβ(a|x) to get the following BIPS estimator for policy learning:

ˆVBIPS(πϑ) = 1

ˆβ(an|xn) rxn,an. (3)

However, as shown in the following proposition, inaccurate ˆβ(a|x) leads to high bias and variance in BIPS. Worse still, smaller and inaccurate ˆβ(a|x) further enlarges this bias and variance.

Proposition 2.1. The bias and variance of ˆVBIPS(πϑ) can be derived as follows:

Bias ˆVBIPS(πϑ) = ED h ˆVBIPS(πϑ) V (πϑ) i = Eπϑ

N Var D ˆVBIPS(πϑ) = Varπϑ

ˆβ(a|x) rx,a

β (a|x) 1 β (a|x)2

ˆβ(a|x)2 r2 x,a

However, smaller ˆβ(a|x) usually implies less number of related training samples in the logged data, and thus ˆβ(a|x) can be inaccurate with a higher probability. To make it more explicit, let us revisit the empirical results shown in Figure 1. We followed the method introduced in [8] to estimate the logging policy on Kuai Rec dataset [12] and plotted the estimated ˆβ(a|x) and its corresponding uncertainties on items of different observation frequencies in the logged dataset. We adopted the method in [45] to measure the confidence interval of ˆβ(a|x) on each instance. A wider confidence interval, i.e., higher uncertainty in estimation, implies that with a high probability the true value may be further away from the empirical mean estimate. We can observe in Figure 1 that as item frequency decreases, the estimated logging probability also decreases, but the estimation uncertainty increases. This implies that a smaller ˆβ(a|x) is usually 1) more inaccurate and 2) associated with a higher uncertainty.

As a result, with high bias and variance caused by inaccurate ˆβ(a|x), it is erroneous to learn πϑ(a|x) by simply optimizing ˆVBIPS(πϑ). Furthermore, this approach may also hinder the convergence of off-policy learning, as discussed later in Section 3.2.

3 Uncertainty-aware off-policy learning

Our idea is to consider the uncertainty of the estimated logging policy by incorporating per-sample weight ϕx,a, and perform policy learning by optimizing the following empirical estimator:

ˆVUIPS(πϑ) = 1

ˆβ(an|xn) ϕxn,an rxn,an. (4)

Intuitively, one should assign lower weights to samples whose ˆβ(a|x) is small and far away from the ground-truth β (a|x). We then divide off-policy optimization into two iterative steps:

Deriving the optimal instance weight: Find the optimal ϕx,a to make ˆVUIPS(πϑ) approach its ground-truth V (π) as closely as possible, so as to facilitate policy learning. The derived optimal weight is denoted as ϕ x,a (see Theorem 3.2). Policy improvement: Update the policy πϑ(a|x) using the following gradient:

ϑ ˆVUIPS(πϑ) = 1

ˆβ(an|xn) ϕ xn,an rxn,an ϑ log(πϑ(an|xn)) (5)

The whole algorithm framework and its computational cost, as well as important notations are summarized in Appendix 7.1.

3.1 Derive the optimal uncertainty-aware instance weight

We expect to find the optimal weight ϕx,a to make the empirical estimator ˆVUIPS(πϑ) as accurate as possible, taking into account the uncertainty in estimated logging probabilities. Intuitively, a high accuracy of the estimator is crucial for determining the correct direction of policy learning. We follow previous work [36, 29] and measure the mean squared error (MSE) of ˆVUIPS(πϑ) to the ground-truth policy value V (πϑ), which captures both the bias and variance of an estimator. A lower MSE indicates a more accurate estimator.

In UIPS, instead of directly minimizing the MSE, which is intractable, we find ϕx,a to minimize the upper bound of MSE. As we show later, the optimal ϕx,a has a closed-form solution which relates to both the value of πϑ(a|x)/ˆβ(a|x) and the estimation uncertainty of ˆβ(a|x).

Theorem 3.1. The mean squared error (MSE) between ˆVUIPS(πϑ) and ground-truth estimator V (πϑ) is upper bounded as follows:

MSE ˆVUIPS(πϑ) = ED

ˆVUIPS(πϑ) V (πϑ) 2 = Bias ˆVUIPS(πϑ) 2 + Var ˆVUIPS(πϑ)

r2 x,a πϑ(a|x) β (a|x)

ˆβ(a|x) ϕx,a 1

ˆβ(a|x)2 ϕ2 x,a

As the first expectation term Eπϑ h r2 x,a πϑ(a|x) β (a|x) i is a non-negative constant, we denote it as λ [0, ) when searching for ϕx,a. To minimize this upper bound of MSE, the optimal ϕx,a for each sample (x, a) should minimize the following,

ˆβ(a|x) ϕx,a 1

ˆβ(a|x)2 ϕ2 x,a. (6)

An interesting observation is that setting ϕx,a = ˆβ(a|x) β (a|x), i.e., turning π(a|x)

ˆβ(a|x)ϕx,a into π(a|x) β (a|x) does not result in the optimal solution of Eq.(6). This is because such a setting only reduces bias (i.e., the first term of Eq.(6)), but fails to control the second term, which is related to the variance. Moreover, we cannot directly minimize Eq.(6) due to the unknown β (a|x). But it is possible to obtain a confidence interval which contains β (a|x) with a high probability, when ˆβ(a|x) is obtained via a specific estimator, e.g., (generalized) linear model or kernel methods.

Following previous work [23, 16, 22], we adopt the realizable assumption that β (a|x) can be represented by a softmax function applied over a parametric function fθ (x, a). Moreover, the universal approximation theorem [18] states that a parametric function with sufficient capacity, when combined with a softmax function, can approximate any distribution. Then we have:

β (a|x) exp(fθ (x, a)), ˆβ(a|x) exp(fθ(x, a)), (7)

where fθ(x, a) is an estimate of fθ (x, a). Following the conventional definition of confidence interval [20], we define γ and Ux,a such that |fθ (x, a) fθ(x, a)| γUx,a holds with probability

at least 1-δ, where γ is a function of δ (typically the smaller δ is, the larger γ is). Then γUx,a measures the width of confidence interval of fθ(x, a) against its ground-truth fθ (x, a). As derived in Appendix 7.2, with probability at least 1-δ, we have β (a|x) Bx,a and

" ˆZ exp ( γUx,a)

Z ˆβ(a|x), ˆZ exp (γUx,a)

where Z = P

a exp(fθ (a |x)) and ˆZ = P

a exp(fθ(a |x)).

As β (a|x) can be any value in Bx,a with high probability, we aim to find the optimal ϕx,a that minimizes the worst case of Eq.(6), thereby ensuring that ˆVUIPS(πϑ) approaches its ground-truth V (πϑ) under the sense of MSE, even in the worst possible scenarios. This ensures the subsequent policy improvement direction will not be much worse with high probability. Thus, we formulate the following optimization problem:

min ϕx,a max βx,a Bx,aλ

βx,a ˆβ(a|x) ϕx,a 1

ˆβ(a|x)2 ϕ2 x,a. (8)

The following theorem derives a closed-form formula for the optimal solution of Eq.(8). Theorem 3.2. Let η [exp( γU max x ), exp(γU max x )], where U max x = maxa Ux,a. The optimization problem in Eq.(8) has a closed-form solution:

ϕ x,a = min

η exp γUx,a + ηπϑ(a|x)2

ˆβ(a|x)2 exp ( γUx,a)

, 2η/ exp γUx,a + exp γUx,a !

The following corollary demonstrates the advantage of UIPS. The detailed proof of Theorem 3.2 and Corollary 3.3 can be found in Appendix 7.8.

Corollary 3.3. With ϕ x,a derived in Theorem 3.2, ˆVUIPS(πϑ) in Eq.(4) achieves a smaller upper bound of MSE than ˆVBIPS(πϑ) in Eq. (3).

Insights about ϕ x,a. The detailed analysis of the effect of ϕ x,a can be found in Lemma 7.1 in Appendix 7.8. In summary, we have the following key findings,

For samples whose largest possible propensity score is under control: i.e., πϑ(a|x) min Bx,a <

uncertainty implies smaller values of π/ˆβ. This suggests samples of this type with positive rewards are underestimated, and the extend of underestimation increases with the estimation uncertainty. UIPS thus chooses to increase ϕ x,a with uncertainty, to emphasize these long-tail positive samples. Conversely, for samples with large propensity scores, UIPS decreases ϕ x,a as the uncertainty increases, so as to prevent their distortion in policy learning.

Uncertainty estimation. Now we describe how to calculate Ux,a, i.e., the uncertainty of the estimated ˆβ(a|x). In this work, we choose to estimate β (a|x) using a neural network, because 1) its representation learning capacity has been proved in numerous studies, and 2) various ways [11, 45] can be leveraged to perform the uncertainty estimation in a neural network. We adopt [45] due to its computational efficiency and theoretical soundness. Following the proof of Theorem 4.4 in [45], given the logged dataset D, we can get with a high probability that there exists γ such that:

|fθ(xn, an) fθ (xn, an))| γ q

g(xn, an) M 1 D g(xn, an)

where g(xn, an) is the gradient of fθ(xn, an) with respect to the neural network s last layer s parameter θw θ, i.e., g(xn, an) = θwfθ(xn, an). And MD = PN n=1 g(xn, an)g(xn, an) ,

implying Uxn,an = q

g(xn, an) M 1 D g(xn, an).

3.2 Convergence of policy learning under UIPS

The following theorem provides the convergence result for UIPS, which converges to a stationary point of the expected reward function. The proof is provided in Appendix 7.9.

Theorem 3.4. Denote Gmax and Φ as the maximum value of πϑ(a|x)

ϑ and Eβ h π2 ϑ(a|x) ˆβ2(a|x) (ϕ x,a)2i

respectively, i.e., πϑ(a|x)

ϑ Gmax and Eβ h π2 ϑ(a|x) ˆβ2(a|x) (ϕ x,a)2i Φ. And denote Vmax as the finite

maximum expected reward that can be achieved, and φmax = maxx,a n β (a|x)

ˆβ(a|x) ϕ x,a 1 o . Assume

that the expected reward of πϑ, i.e., V (πϑ), is a differentiable and L-smooth function w.r.t ϑ. Denote the policy parameters obtained by Eq.(5) at iteration k [K] as ϑk, then φmax (0, 1) and

k=1 E[ V (πϑk) ]2 2LVmax K(1 φmax) + L + 2Vmax (1 φmax)

where V (πϑ) is the true policy gradient under ground-truth logging probability, i.e., V (πϑ) = Eβ [ πϑ(a|x)

β (a|x)rx,a ϑ log(πϑ(a|x))].

Theorem 3.4 shows that, as K and with 1/(1 φmax) and Φ being controlled, UIPS leads policy update to converge to a stationary point where the true policy gradient V (πϑk) is zero. And fortunately, UIPS is effective in controlling both 1/(1 φmax) and Φ. Specifically, we denote φx,a = β (a|x)

ˆβ(a|x) ϕ x,a 1 and Φx,a = π2 ϑ(a|x) ˆβ2(a|x) (ϕ x,a)2. It is clear to note that λφ2 x,a + Φx,a corresponds to the objective in Eq.(6) for deriving ϕ x,a for each sample (x, a). In other words, UIPS selects {ϕ x,a} to minimize φmax = max{φx,a} and Φ = Eβ [Φx,a], which directly accelerate the policy converge to a stationary point with the true policy gradient being zero.

In the case of BIPS in Eq.(3), we have ϕx,a 1. Although Φ may be large due to small logging probabilities, the more concerning issue is that the requirement φmax (0, 1) is no longer satisfied when β (a|x) 2ˆβ(a|x), which may happen with a non-negligible probability. Hence, the convergence of policy learning under ˆVBIPS is no better than that under UIPS.

4 Empirical evaluations

We evaluate UIPS on both synthetic data and three real-world datasets with unbiased collection. We compare UIPS with the following baselines, which can be grouped into five categories:

Cross-Entropy (CE): A supervised learning method with the cross-entropy loss over its softmax output. No off-policy correction is performed in this method. BIPS-Cap [8]: The off-policy learning solution under the BIPS estimator in Eq.(3). The estimated propensity scores are further suppressed to control variance, i.e., taking min c, πϑ(a|x)

as the propensity score. Setting c to a small value can reduce variance, but introduces bias. Min Var & stable Var [46], Shrinkage [36]: This line of work improves off-policy evaluation by reweighing each sample. For example, Min Var and stable Var reweigh each sample by hx,a P

with hx,a = ˆβ(a|x) πϑ(a|x)2 and hx,a =

ˆβ(a|x) πϑ(a|x) respectively, since they find that πϑ(a|x)2/ˆβ(a|x) is directly related to policy evaluation variance. Su et al. [36] propose to shrink the propensity score by λ/(λ + πϑ(a|x)2

ˆβ(a|x)2 ), which is a special case of our UIPS with Ux,a = 0 and η = 1. All these

methods simply treat ˆβ(a|x) as β (a|x), and none of them consider the uncertainty of ˆβ(a|x). SNIPS [39], Bandit Net [16], POEM [38], POXM [23], Adaptive [22]: This line of work aims for more stable and accurate policy learning. For example, SNIPS normalizes the estimator by the sum of propensity scores in each batch. Bandit Net extends SNIPS and leverages an additional Lagrangian term to normalize the estimator by an approximated sum of propensity scores of all samples. POEM jointly optimizes the estimator and its variance. POXM controls estimation variance by pruning samples with small logging probabilities. Adaptive proposes a new formulation to utilize negative samples. Approx KNN [5] and IPS-C-TS: The line of work improves off-policy learning by applying calibration to estimated logging probabilities. Approx KNN utilizes the K-Nearest Neighbor algorithm for calibration, which exhibits the lowest calibration error in [5]. IPS-C-TS, on the other hand, employs temperature scaling, a widely recognized and effective calibration method for probability distribution [13].

Table 1: Experiment results on synthetic datasets. The best and second best results are highlighted with bold and underline respectively. The p-value under the t-test between UIPS and the best baseline on each dataset is also provided.

τ = 0.5 τ = 1 τ = 2 Algorithm P@5 R@5 NDCG@5 P@5 R@5 NDCG@5 P@5 R@5 NDCG@5 IPS-GT 0.5589 1e 3 0.1582 6e 4 0.6093 1e 3 0.5526 2e 3 0.1565 6e 4 0.6007 1e 3 0.5531 2e 3 0.1557 7e 4 0.6037 1e 3 CE 0.5553 6e 4 0.1573 2e 4 0.6037 5e 4 0.5510 6e 4 0.1561 2e 4 0.5995 4e 4 0.5386 2e 3 0.1524 7e 4 0.5874 2e 3 BIPS-Cap 0.5515 2e 3 0.1553 8e 4 0.6031 2e 3 0.5526 2e 3 0.1561 6e 4 0.6016 1e 3 0.5409 3e 3 0.1529 9e 4 0.5901 2e 3 Min Var 0.5340 2e 3 0.1509 6e 4 0.5857 2e 3 0.5282 2e 3 0.1491 7e 4 0.5791 2e 3 0.5036 4e 3 0.1415 1e 3 0.5543 3e 3 stable Var 0.4577 5e 3 0.1310 1e 3 0.5111 2e 3 0.5373 3e 3 0.1523 9e 4 0.5866 3e 3 0.5279 3e 3 0.1492 8e 4 0.5781 3e 3 Shrinkage 0.5526 2e 3 0.1562 7e 4 0.6024 1e 3 0.5499 4e 3 0.1545 1e 3 0.6040 3e 3 0.5347 2e 3 0.1513 6e 4 0.5824 2e 3 SNIPS 0.2616 6e 2 0.0749 2e 2 0.3150 7e 2 0.3538 5e 2 0.0987 1e 2 0.4144 6e 2 0.4379 3e 2 0.1226 9e 3 0.5177 3e 2 Bandit Net 0.4011 3e 2 0.1131 8e 3 0.4830 2e 2 0.3894 4e 2 0.1095 1e 2 0.4741 3e 2 0.4122 3e 2 0.1153 8e 3 0.4934 3e 2 POEM 0.5480 2e 3 0.1539 8e 4 0.6008 2e 3 0.5502 2e 3 0.1551 6e 4 0.6000 2e 3 0.5399 2e 3 0.1526 8e 4 0.5893 2e 3 POXM 0.4006 3e 2 0.1130 8e 3 0.4828 2e 2 0.3616 4e 2 0.1019 1e 2 0.4522 4e 2 0.3816 4e 2 0.1069 1e 2 0.4680 4e 2 Adaptive 0.3831 2e 2 0.1050 4e 3 0.4382 2e 2 0.4734 4e 3 0.1325 1e 3 0.5326 3e 3 0.3936 1e 2 0.1097 4e 3 0.4368 2e 2 Approx KNN 0.5576 1e 3 0.1580 4e 4 0.6059 2e 3 0.5527 9e 4 0.1567 1e 4 0.6010 1e 3 0.5409 2e 3 0.1532 6e 4 0.5890 1e 3 IPS-C-TS 0.5565 1e 3 0.1577 3e 4 0.6048 7e 4 0.5517 6e 4 0.1563 2e 4 0.6002 6e 4 0.5393 1e 3 0.1526 5e 4 0.5879 1e 3 UIPS-P 0.4019 3e 2 0.1131 1e 2 0.4831 3e 2 0.3904 4e 2 0.1096 1e 2 0.4749 3e 2 0.4109 3e 2 0.1149 1e 2 0.4922 3e 2 UIPS-O 0.4135 4e 2 0.1167 1e 2 0.4954 4e 2 0.3896 4e 2 0.1096 1e 2 0.4739 3e 2 0.4519 3e 2 0.1268 8e 3 0.5296 2e 2 UIPS 0.5608 2e 3 0.1589 8e 4 0.6113 3e 3 0.5572 2e 3 0.1571 8e 4 0.6074 2e 3 0.5432 3e 3 0.1534 8e 4 0.5946 2e 3 p-value 3e 3 1e 2 4e 5 2e 5 2e 1 4e 10 1e 1 5e 1 4e 2

Table 2: Performance under different uncertainties.

Actions on Samples with High Uncertainty Actions on Samples with Low Uncertainty Algorithm P@5(RI) R@5(RI) NDCG@5(RI) P@5(RI) R@5(RI) NDCG@5(RI) CE 0.5190 0.1231 0.5526 0.5913 0.1915 0.6549 BIPS-Cap 0.5117 (-1.41%) 0.1202 (-2.33%) 0.5488 (-0.68%) 0.5913 (+0.00%) 0.1903 (-0.64%) 0.6574 (+0.39%) Shrinkage 0.5158 (-0.62%) 0.1217 (-1.11%) 0.5505 (-0.37%) 0.5892 (-0.35%) 0.1905 (-0.55%) 0.6546 (-0.05%) UIPS 0.5222 (+0.61%) 0.1237 (+0.50%) 0.5568 (+0.77%) 0.5994 (+1.38%) 0.1940 (+1.28%) 0.6658 (+1.66%)

UIPS-P and UIPS-O: These are two variants of our UIPS with different ways of leveraging uncertainties. UIPS-P directly penalizes samples whose estimated logging probabilities are of high uncertainty, i.e., taking ϕx,a = 1.0/ exp(γUx,a), which follows previous work on offline reinforcement learning [43, 4]. UIPS-O adversarially uses the worst propensity score for policy learning, i.e., ϕx,a = 1.0/ exp( γUx,a).

4.1 Synthetic data

Data generation. Following previous work [24, 23], we generate a synthetic dataset by a superviseto-bandit conversion on Wiki10-31K dataset [7], which is an extreme multi-label classification dataset. The Wiki10-31K dataset contains approximately 20K samples. Each sample is associated with a feature vector x of 101,938 dimensions and a label vector y x of 31K classes with more than one positive class. Let y x,a denote the label of class a under x, and we take each class as an action. The huge action space creates great challenges in off-policy learning, e.g., sparse observations, and therefore better evaluates different methods.

We split the dataset into train, validation and test sets with size 11K:3K:6K. The test set is from the official split. Since the original feature vector x is too sparse, for ease of learning, we first embedded it to dimension d by x = W x, and synthesized the ground-truth logging policy β (a|x) by:

β (a|x) exp x θ a/τ , (9)

where W and {θ a} are pre-trained parameters by applying a logistic regression model on the train set, τ is a positive hyper-parameter that controls the skewness of logging distribution. A small value of τ leads to a near-deterministic logging policy, while a larger τ makes it flatter. More implementation details can be found in Appendix 7.3.

Evaluation metrics. To evaluate the learned policy πϑ(a|x), we calculate Precision@K (P@K), Recall@K (R@K) and NDCG@K following previous work [23, 24]. Higher P@K, R@K and NDCG@K imply a better policy.

Effectiveness of policy learning. Table 1 shows the average performance and standard deviations of all algorithms under 10 random seeds on three synthetic datasets generated under different τ. As the ground-truth logging policy is accessible on the synthetic datasets, we included a new baseline IPSGT, which uses the IPS estimator with the ground-truth logging probabilities. We calculated p-value under t-test between UIPS and the best baseline to investigate the significance of improvement.

First, we can observe that UIPS achieved similar and even better performance than IPS-GT when τ = 0.5 and τ = 1, but performed worse than IPS-GT when τ = 2. Despite using ground-truth

logging probabilities, IPS-GT still suffered from high variance caused by samples with small logging probabilities, which is the main cause of its worse performance when τ = 0.5 and τ = 1. In contrast, UIPS effectively controlled the negative impact of these high-variance samples, resulting in a better bias-variance trade-off.

With an increasing τ, suggesting a decrease in the probability of selecting positive actions, most algorithms experienced a drop in performance. However, UIPS consistently outperformed all other algorithms across all three datasets and metrics. Interestingly, as τ decreases, the performance improvement of UIPS became even more pronounced, despite SNIPS, Bandit Net, and POXM being designed to handle small logging probabilities of positive actions.

Approx KNN and IPS-C-TS generally achieved better performance than BIPS-Cap, implying the effectiveness of calibration of estimated logging probabilities. However, UIPS still consistently outperformed both Approx KNN and IPS-C-TS. The main reason is that calibration primarily focuses on adjusting the estimated probabilities to ensure on average the model s predictions are reliable and accurate. In contrast, UIPS specifically handles the impact from each individual sample in policy learning.

UIPS also consistently outperformed Shrinkage (a special case of UIPS with uncertainties always being zero) on all three datasets, demonstrating the benefits of considering the estimation uncertainty. Finally, blindly reweighing through uncertainties, regardless of their impact on the accuracy of the resulting estimator and the learned policy, ultimately resulted in poor performance, as demonstrated by UIPS-P and UIPS-O.

Performance under different uncertainty levels. As shown in Figure 1, low-frequency samples in the logged dataset suffer higher uncertainties in their propensity estimation. Thus, we divided the test set into two subsets according to the average frequency of associated actions, where the uncertainty in the subset associated with low-frequency actions is on average 8% higher than that in high-frequency actions. Table 2 shows the results on these two subsets when τ = 0.5. In addition, we include the results of the top three baselines that directly utilize the estimated logging policy. Table 2 clearly demonstrates that only UIPS performed better than CE on the test set with low-frequency actions, implying the distortion of inaccurately estimated logging probabilities and the effectiveness of UIPS in efficiently handling them.

Off-policy Evaluation. We further inspected whether ˆVUIPS in Eq.(4) leads to more accurate offpolicy evaluation. Following previous work [29, 46, 36], we evaluated the following ϵ-greedy policy: π(a|x) = 1 ϵ |Mx| I{a Mx}+ϵ/|A|, where Mx contains all positive actions associated with instance x. For each x in the test set, we randomly sample 100 actions following the logging policy in Eq.(9) to generate the logged dataset. Table 3 shows the MSE of the estimators to the ground-truth policy value under 20 different random seeds. From Table 3, one can observe that: 1) IPS-GT with a skewer logging policy (i.e., smaller τ) leads to higher MSE, consistent with previous findings [29, 46, 36]; 2) inaccurate logging probabilities result in high bias and variance, leading to much larger MSE of BIPS compared to IPS-GT. Furthermore, this distortion is particularly pronounced when the ground-truth logging policy is skewed (τ = 0.5); and 3) although all using the estimated logging policy, ˆVUIPS yields the smallest MSE, comparing to other baselines that are designed to improve over BIPS.

Hyper-parameter Tuning. Discussions about hyper-parameter tuning and performance of UIPS under different hyper-parameters can also be found in Appendix 7.3.1.

4.2 Real-world data

To demonstrate the effectiveness of UIPS in real-world scenarios, we evaluate it on three recommendation datasets: (1) Yahoo! R31; (2) Coat2; (3) Kuai Rec [12], for music, fashion and short-video recommendations respectively. All these datasets contain an unbiased test set collected from a randomized controlled trial where items are randomly selected. The statistics of the three datasets and implementation details, e.g., model architectures and dataset splits, can be found in Appendix 7.3.2.

Following [10], we take K = 5 on Yahoo! R3 and Coat datasets, and K = 50 on Kuai Rec dataset. The p-value under the t-test between UIPS and the best baseline on each dataset is also reported to investigate the significance of improvement.

1https://webscope.sandbox.yahoo.com/ 2https://www.cs.cornell.edu/~schnabts/mnar/

Table 3: MSE of different off-policy estimators. A lower MSE indicates a more accurate estimator.

Algorithm IPS-GT BIPS min Var stable Var Shrinage UIPS τ = 0.5 0.0875 4e 4 15.786 1.51 0.9021 7e 13 0.8612 5e 8 0.0718 5e 6 0.0210 2e 6

τ = 1.0 0.0209 8e 5 0.5510 0.388 0.9019 8e 12 0.8578 2e 7 0.1978 2e 5 0.0093 1e 6

τ = 2.0 0.0020 6e 6 0.5669 0.013 0.9015 5e 15 0.8342 5e 7 0.2952 3e 5 0.0043 4e 7

Table 4: Experimental results on real-world datasets. The best and second best results are highlighted with bold and underline respectively. The p-value under the t-test between UIPS and the best baseline on each dataset is also provided.

Yahoo Coat Kuai Rec Algorithm P@5 R@5 NDCG@5 P@5 R@5 NDCG@5 P@50 R@50 NDCG@50 CE 0.2819 2e 3 0.7594 6e 3 0.6073 7e 3 0.2799 5e 3 0.4618 1e 2 0.4529 7e 3 0.8802 2e 3 0.0240 8e 5 0.8810 6e 3 BIPS-Cap 0.2808 2e 3 0.7576 5e 3 0.6099 8e 3 0.2758 6e 3 0.4582 7e 3 0.4399 9e 3 0.8750 3e 3 0.0238 7e 5 0.8788 5e 3 Min Var 0.2843 4e 3 0.7685 1e 2 0.6168 1e 2 0.2813 3e 3 0.4668 9e 3 0.4414 8e 3 0.8827 1e 3 0.0240 5e 5 0.8886 2e 3 stable Var 0.2787 2e 3 0.7499 7e 3 0.5919 7e 3 0.2840 3e 3 0.4662 5e 3 0.4393 7e 3 0.8524 7e 3 0.0231 2e 4 0.8570 4e 3 Shrinkage 0.2843 3e 3 0.7654 8e 3 0.6204 7e 3 0.2790 5e 3 0.4636 4e 3 0.4464 1e 2 0.8744 3e 3 0.0238 9e 5 0.8771 6e 3 SNIPS 0.2222 4e 3 0.5828 1e 2 0.4357 1e 2 0.2643 7e 3 0.4287 1e 2 0.4009 9e 3 0.8411 6e 3 0.0228 2e 4 0.8431 6e 3 Bandit Net 0.2413 8e 3 0.6442 2e 2 0.4988 2e 2 0.2781 8e 3 0.4527 1e 2 0.4251 1e 2 0.8758 5e 3 0.0239 2e 4 0.8810 4e 3 POEM 0.2732 3e 3 0.7357 1e 2 0.5880 1e 2 0.2791 4e 3 0.4566 6e 3 0.4375 6e 3 0.7785 1e 2 0.0210 2e 4 0.7779 6e 3 POXM 0.2250 5e 3 0.5940 1e 2 0.4542 2e 2 0.2663 6e 3 0.4308 9e 3 0.4006 1e 2 0.8962 1e 2 0.0245 4e 4 0.9041 1e 2 Adaptive 0.2762 3e 3 0.7451 9e 3 0.5919 8e 3 0.2830 3e 3 0.4634 5e 3 0.4217 5e 3 0.8375 1e 2 0.0227 4e 4 0.8460 1e 2 Approx KNN 0.2697 2e 3 0.7225 5e 3 0.5760 6e 3 0.2755 2e 3 0.4594 5e 3 0.4490 4e 3 0.8839 2e 6 0.0240 5e 5 0.8895 2e 3 IPS-C-TS 0.2816 2e 3 0.7582 5e 3 0.6114 5e 3 0.2799 3e 3 0.4625 7e 3 0.4462 6e 3 0.8781 3e 3 0.0239 1e 4 0.8749 3e 3 UIPS-P 0.1829 8e 3 0.4560 3e 2 0.3300 1e 2 0.2685 7e 3 0.4364 9e 3 0.4087 7e 3 0.8638 8e 3 0.0235 3e 4 0.8685 7e 3 UIPS-O 0.1947 3e 3 0.4959 1e 2 0.3600 8e 3 0.2657 5e 3 0.4306 9e 3 0.4146 9e 3 0.8651 8e 3 0.0235 2e 4 0.8697 7e 3 UIPS 0.2868 2e 3 0.7742 5e 3 0.6274 5e 3 0.2877 3e 3 0.4757 5e 3 0.4576 8e 3 0.9120 1e 3 0.0250 5e 5 0.9174 7e 4 p-value 4e 2 1e 2 3e 2 2e 2 6e 4 5e 5 6e 4 6e 4 1e 3

We can first observe from Table 4 that on all three datasets, the proposed UIPS achieved the highest precision, recall and NDCG. Apparently, accurate estimation of logging probabilities in real-world scenarios with large action spaces and sparse interactions is challenging to achieve, causing BIPSCap to underperform CE. Additionally, calibration poses difficulties in such scenarios, leading to poor performance of Approx KNN and IPS-C-TS across all three real-world datasets. Bandit Net, POEM and POXM performed better in problems with a larger action space, while Min Var, stable Var and Shrinkage as well as Adaptive suited better for scenarios with a smaller action size. Again, UIPS outperformed Shrinkage, highlighting the importance of handling uncertainty in the estimated logging policy. But simply reweighing based on uncertainties, without considering their impact on the accuracy of the resulting estimator and the learned policy, led to poor performance, as shown by UIPS-P and UIPS-O.

4.3 Comparisons against other lines of off-policy learning

We discuss about the difference between UIPS and recent work with direct propensity estimation [34, 21], the DICE line of work [26, 47, 9] as well as the work on the distributionally robust off-policy learning [32, 17, 44] in Appendix 7.6, Appendix 7.5 and Appendix 7.7 respectively. Our empirical evaluations on recent work [26, 44, 34] from all three lines suggest that these lines of work cannot properly handle inaccuracies in the estimated logging probabilities that hinder policy improvement. Moreover, we also integrated the proposed UIPS with the doubly-robust (DR) estimator and our results suggest the accuracy of the imputation model greatly affected policy learning under DR, but UIPS still provides benefits. More details can be found in Appendix 7.4.

5 Related work

Our work is the first of its kind to account for the uncertainty of logging policy estimation in off-policy learning. The following two lines of work are most related to this work.

Off-policy learning. In many real-world applications, such as search engines and recommender systems, interactive online model update is expensive and risky [15]. Off-policy learning has therefore attracted great interest, since it can leverage the already logged feedback data [2, 8, 22]. The main challenge in off-policy learning is to address the mismatch between the logging policy and the learning policy. One common and widely-applied approach is to leverage the Inverse Propensity Scoring (IPS) method to correct the discrepancy between the two policies. And various methods are proposed to enhance IPS for more stabilizing learning [39, 37, 38] and improved variance control [23, 22], as well as extensions to more complex problems [40, 24].

However, all these solutions directly use the estimated logging policy for off-policy correction, leading to sub-optimal performance as shown in our experiments. A recent study on causal recommendation [10] also argues that propensity scores may not be correct due to unobserved confounders. They assume the effect of unobserved confounder for any sample can be bounded by a pre-defined hyperparameter, and adversarially search for the worst-case propensity for learning. Mapping to off-policy learning, their solution is a special case of our UIPS-O variant with uncertainty as a pre-defined constant.

There were existing studies [34, 21, 26, 47, 9] also explore direct estimation of the propensity ratio to bypass estimating the logging policy. However, as discussed in Appendix 7.6 and Appendix 7.5, they demonstrate inferior performance compared to UIPS. This is primarily due to either the lack of consideration for the accuracy of the estimated propensity ratio, similar to the limitations of existing IPS-type algorithms in handling inaccurately estimated logging probabilities, or the degeneration to a specific IPS estimator that suffers high variance.

Recent work on distributionally robust off-policy evaluation and learning [32, 17, 44] also addresses uncertainty in off-policy learning. However, their approach to handling uncertainty and the underlying motivation differ significantly from ours, resulting in distinct techniques employed. Further details can be found in Appendix 7.7. Additionally, experiments conducted in Appendix 7.7 demonstrate that directly adapting methods from distributionally robust off-policy learning to handle inaccurately estimated logging probabilities leads to poor performance.

Off-policy learning can also be directly built on off-policy evaluation. Several work [36, 46] also propose to control the variance of the estimator caused by small logging probabilities through instance reweighing. Again, they directly use the estimated logging policy for correction, and thus performed worse than UIPS as observed in our experiments. A recent study [29] assumed additional structure in the action space and proposed the marginalized IPS. Instead, our work considers the uncertainty in the estimated logging policy and thus does not add any new assumptions about the problem space.

Uncertainty-aware learning. Estimation uncertainty has been extensively studied [45, 50, 1]. In the context of on-policy reinforcement leanring and bandits [1, 48, 49], the use of uncertainty aims to strike a balance between exploration and exploitation by adopting an optimistic approach (i.e., UCB in bandits). One the other hand, most reseach on offline reinforcement learning/bandits [43, 4, 6] tends to be more conservative, employing techniques such as Lower Confidence Bounds (LCB) or penalizing out-of-distribution states and actions based on uncertainty to address extrapolation errors. However, these principles differ fundamentally from UIPS, which directly minimizes the mean square error of off-policy evaluation. The closed-form solution of the resulting per-instance weight in UIPS reflects how uncertainty contributes to the policy evaluation error. Moreover, Our UIPS-O and UIPS-P baselines leverage uncertainties using the two aforementioned general principles respectively. However, empirical findings indicate that blindly penalizing or boosting samples based on uncertainty is problematic. Proper correction depends on both uncertainty in logging policy estimation and the actual value of estimated logging probabilities.

6 Conclusion

In this paper, we propose a Uncertainty-aware Inverse Propensity Score estimator (UIPS) to explicitly model the uncertainty of the estimated logging policy for improved off-policy learning. UIPS weighs each logged instance to reduce its policy evaluation error, where the optimal weights have a closedform solution derived by minimizing the upper bound of the resulting estimator s mean squared error (MSE) to its ground-truth value. An improved policy is then obtained by optimizing the resulting estimation. Extensive experiments on synthetic and three real-world datasets as well as the theoretical convergence guarantee demonstrate the efficiency of UIPS.

As demonstrated in this work, explicitly modeling the uncertainty of the estimated logging policy is crucial for effective off-policy learning; but the best use of this uncertainty is not to simply downweigh or drop instances with uncertain estimations, but to balance it with the actually estimated logging probabilities in a per-instance basis. As our future work, it is promising to investigate how UIPS can be extended to value-based learning methods, e.g., actor-critics. And on the other hand, it is also important to analyze how tight our upper bound analysis of policy evaluation error is; and if possible, find new ways to tighten it for improvements.

[1] Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312 2320, 2011.

[2] Aman Agarwal, Ivan Zaitsev, Xuanhui Wang, Cheng Li, Marc Najork, and Thorsten Joachims. Estimating position bias without intrusive interventions. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, pages 474 482, 2019.

[3] Ahmad Ajalloeian and Sebastian U Stich. On the convergence of sgd with biased gradients. ar Xiv preprint ar Xiv:2008.00051, 2020.

[4] Gaon An, Seungyong Moon, Jang-Hyun Kim, and Hyun Oh Song. Uncertainty-based offline reinforcement learning with diversified q-ensemble. Advances in neural information processing systems, 34:7436 7447, 2021.

[5] Raghu Aniruddh, Gottesman Omer, Liu Yao, Komorowski Matthieu, Faisal Aldo, Doshi-Velez Finale, and Brunskill Emma. Behaviour policy estimation in off-policy policy evaluation: Calibration matters. ar Xiv preprint ar Xiv:1807.01066, 2018.

[6] Chenjia Bai, Lingxiao Wang, Zhuoran Yang, Zhihong Deng, Animesh Garg, Peng Liu, and Zhaoran Wang. Pessimistic bootstrapping for uncertainty-driven offline reinforcement learning. ar Xiv preprint ar Xiv:2202.11566, 2022.

[7] K. Bhatia, K. Dahiya, H. Jain, P. Kar, A. Mittal, Y. Prabhu, and M. Varma. The extreme classification repository: Multi-label datasets and code, 2016.

[8] Minmin Chen, Alex Beutel, Paul Covington, Sagar Jain, Francois Belletti, and Ed H Chi. Top-k off-policy correction for a reinforce recommender system. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, pages 456 464, 2019.

[9] Bo Dai, Ofir Nachum, Yinlam Chow, Lihong Li, Csaba Szepesvári, and Dale Schuurmans. Coindice: Off-policy confidence interval estimation. Advances in neural information processing systems, 33:9398 9411, 2020.

[10] Sihao Ding, Peng Wu, Fuli Feng, Yitong Wang, Xiangnan He, Yong Liao, and Yongdong Zhang. Addressing unmeasured confounder for recommendation with sensitivity analysis. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 305 315, 2022.

[11] Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pages 1050 1059. PMLR, 2016.

[12] Chongming Gao, Shijun Li, Wenqiang Lei, Jiawei Chen, Biao Li, Peng Jiang, Xiangnan He, Jiaxin Mao, and Tat-Seng Chua. Kuairec: A fully-observed dataset and insights for evaluating recommender systems. In Proceedings of the 31st ACM International Conference on Information and Knowledge Management, CIKM 22, 2022.

[13] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In International conference on machine learning, pages 1321 1330. PMLR, 2017.

[14] Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. Deepfm: a factorization-machine based neural network for ctr prediction. ar Xiv preprint ar Xiv:1703.04247, 2017.

[15] Nan Jiang and Lihong Li. Doubly robust off-policy value evaluation for reinforcement learning. In International Conference on Machine Learning, pages 652 661. PMLR, 2016.

[16] Thorsten Joachims, Adith Swaminathan, and Maarten De Rijke. Deep learning with logged bandit feedback. In International Conference on Learning Representations, 2018.

[17] Nathan Kallus, Xiaojie Mao, Kaiwen Wang, and Zhengyuan Zhou. Doubly robust distributionally robust off-policy evaluation and learning. In International Conference on Machine Learning, pages 10598 10632. PMLR, 2022.

[18] Patrick Kidger and Terry Lyons. Universal approximation with deep narrow networks. In Conference on learning theory, pages 2306 2327. PMLR, 2020.

[19] Sergey Levine and Vladlen Koltun. Guided policy search. In International conference on machine learning, pages 1 9. PMLR, 2013.

[20] Lihong Li, Yu Lu, and Dengyong Zhou. Provably optimal algorithms for generalized linear contextual bandits. In International Conference on Machine Learning, pages 2071 2080. PMLR, 2017.

[21] Xu-Hui Liu, Zhenghai Xue, Jingcheng Pang, Shengyi Jiang, Feng Xu, and Yang Yu. Regret minimization experience replay in off-policy reinforcement learning. Advances in Neural Information Processing Systems, 34:17604 17615, 2021.

[22] Yaxu Liu, Jui-Nan Yen, Bowen Yuan, Rundong Shi, Peng Yan, and Chih-Jen Lin. Practical counterfactual policy learning for top-k recommendations. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 1141 1151, 2022.

[23] Romain Lopez, Inderjit S Dhillon, and Michael I Jordan. Learning from extreme bandit feedback. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 8732 8740, 2021.

[24] Jiaqi Ma, Zhe Zhao, Xinyang Yi, Ji Yang, Minmin Chen, Jiaxi Tang, Lichan Hong, and Ed H Chi. Off-policy learning in two-stage recommender systems. In Proceedings of The Web Conference 2020, pages 463 473, 2020.

[25] Rémi Munos, Tom Stepleton, Anna Harutyunyan, and Marc Bellemare. Safe and efficient off-policy reinforcement learning. Advances in neural information processing systems, 29, 2016.

[26] Ofir Nachum, Yinlam Chow, Bo Dai, and Lihong Li. Dualdice: Behavior-agnostic estimation of discounted stationary distribution corrections. Advances in Neural Information Processing Systems, 32, 2019.

[27] Doina Precup. Eligibility traces for off-policy policy evaluation. Computer Science Department Faculty Publication Series, page 80, 2000.

[28] Aniruddh Raghu, Omer Gottesman, Yao Liu, Matthieu Komorowski, Aldo Faisal, Finale Doshi-Velez, and Emma Brunskill. Behaviour policy estimation in off-policy policy evaluation: Calibration matters. ar Xiv preprint ar Xiv:1807.01066, 2018.

[29] Yuta Saito and Thorsten Joachims. Off-policy evaluation for large action spaces via embeddings. ar Xiv preprint ar Xiv:2202.06317, 2022.

[30] Tobias Schnabel, Adith Swaminathan, Ashudeep Singh, Navin Chandak, and Thorsten Joachims. Recommendations as treatments: Debiasing learning and evaluation. In international conference on machine learning, pages 1670 1679. PMLR, 2016.

[31] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International conference on machine learning, pages 1889 1897. PMLR, 2015.

[32] Nian Si, Fan Zhang, Zhengyuan Zhou, and Jose Blanchet. Distributionally robust policy evaluation and learning in offline contextual bandits. In International Conference on Machine Learning, pages 8884 8894. PMLR, 2020.

[33] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484 489, 2016.

[34] Samarth Sinha, Jiaming Song, Animesh Garg, and Stefano Ermon. Experience replay with likelihood-free importance weights. In Learning for Dynamics and Control Conference, pages 110 123. PMLR, 2022.

[35] Alex Strehl, John Langford, Lihong Li, and Sham M Kakade. Learning from logged implicit exploration data. Advances in neural information processing systems, 23, 2010.

[36] Yi Su, Maria Dimakopoulou, Akshay Krishnamurthy, and Miroslav Dudík. Doubly robust off-policy evaluation with shrinkage. In International Conference on Machine Learning, pages 9167 9176. PMLR, 2020.

[37] Adith Swaminathan and Thorsten Joachims. Batch learning from logged bandit feedback through counterfactual risk minimization. The Journal of Machine Learning Research, 16(1):1731 1755, 2015.

[38] Adith Swaminathan and Thorsten Joachims. Counterfactual risk minimization: Learning from logged bandit feedback. In International Conference on Machine Learning, pages 814 823. PMLR, 2015.

[39] Adith Swaminathan and Thorsten Joachims. The self-normalized estimator for counterfactual learning. advances in neural information processing systems, 28, 2015.

[40] Adith Swaminathan, Akshay Krishnamurthy, Alekh Agarwal, Miro Dudik, John Langford, Damien Jose, and Imed Zitouni. Off-policy evaluation for slate recommendation. Advances in Neural Information Processing Systems, 30, 2017.

[41] Sebastian Thrun and Michael L Littman. Reinforcement learning: an introduction. AI Magazine, 21(1):103 103, 2000.

[42] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3):229 256, 1992.

[43] Yue Wu, Shuangfei Zhai, Nitish Srivastava, Joshua Susskind, Jian Zhang, Ruslan Salakhutdinov, and Hanlin Goh. Uncertainty weighted actor-critic for offline reinforcement learning. ar Xiv preprint ar Xiv:2105.08140, 2021.

[44] Da Xu, Yuting Ye, Chuanwei Ruan, and Bo Yang. Towards robust off-policy learning for runtime uncertainty. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 10101 10109, 2022.

[45] Pan Xu, Zheng Wen, Handong Zhao, and Quanquan Gu. Neural contextual bandits with deep representation and shallow exploration. In International Conference on Learning Representations, 2021.

[46] Ruohan Zhan, Vitor Hadad, David A Hirshberg, and Susan Athey. Off-policy evaluation via adaptive weighting with data from contextual bandits. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pages 2125 2135, 2021.

[47] Ruiyi Zhang, Bo Dai, Lihong Li, and Dale Schuurmans. Gendice: Generalized offline estimation of stationary values. ar Xiv preprint ar Xiv:2002.09072, 2020.

[48] Xiaoying Zhang, Hong Xie, Hang Li, and John CS Lui. Conversational contextual bandit: Algorithm and application. In Proceedings of the web conference 2020, pages 662 672, 2020.

[49] Xiaoying Zhang, Hong Xie, and John CS Lui. Heterogeneous information assisted bandit learning: Theory and application. In 2021 IEEE 37th International Conference on Data Engineering (ICDE), pages 2135 2140. IEEE, 2021.

[50] Dongruo Zhou, Lihong Li, and Quanquan Gu. Neural contextual bandits with ucb-based exploration. In International Conference on Machine Learning, pages 11492 11502. PMLR, 2020.

[51] Guorui Zhou, Xiaoqiang Zhu, Chenru Song, Ying Fan, Han Zhu, Xiao Ma, Yanghui Yan, Junqi Jin, Han Li, and Kun Gai. Deep interest network for click-through rate prediction. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pages 1059 1068, 2018.

Algorithm 1 UIPS

1: Input: The logged dataset D := {(xn, an, rxn,an)|n [N]}, the estimated logging policy model ˆβ(a|x) = exp(fθ(x,a)) P

a exp(fθ(x,a )), dimension d.

2: Initialize MD = Id d. 3: for n = 1 to N do 4: MD = MD + θfθ(xn, an) θfθ(xn, an) MD for uncertainty calculation. 5: end for 6: for n = 1 to N do

7: Uxn,an = q

θfθ(xn, an) M 1 D θfθ(xn, an)

8: end for 9: while not converge do 10: for n = 1 to N do 11: Calculate ϕ xn,an as in Theorem 3.2 12: Calculate gradients as in Eq.(5) and updating πϑ(a|x). 13: end for 14: end while 15: Output:The learnt policy πϑ(a|x).

7.1 Notations and Algorithm Framework.

For ease of reading, we list important notations in Table 5 and summarize the main framework of the proposed UIPS in Algorithm 1.

Notation Description X context space A action set x Rd context vector a action rx,a reward π(a|x) targeted policy to evaluate β (a|x) the unknown ground-truth logging policy ˆβ(a|x) the estimated logging policy V (π) value function D := {(xn, an, rxn,an)|n [N]} logged dataset containing N samples ϕ x,a the optimal uncertainty-aware weight

the unknown ground-truth function that generates β (a|x) = exp(fθ (x,a)) P

a exp(fθ (x,a )) fθ(x, a) the estimate of fθ (x, a) that generates ˆβ(a|x) Bx,a confidence interval of ˆβ(a|x) Ux,a uncertainty defined as |fθ (x, a) fθ(x, a)| γUx,a g(xn, an) gradient of fθ(x, a) regarding to the last layer. Table 5: Notations

Computation Cost. The additional computation cost of UIPS over IPS comes from two parts:

Pre-calculate uncertainties (line 1-5 in Algorithm 1) : This part calculates uncertainty of the logging probability for each (s, a) pair, and it only needs to be executed once. The computational cost of this step is O(Nd2 + d3), where O(Nd2) is for calculating uncertainties in each (s, a) pair and O(d3) is for matrix inverse.

Calculate ϕ x,a during training (line 8 in Algorithm 1): It only takes O(1) time, the same computational cost as calculating IPS score.

Note that calculating the logging probability for each sample, which is essential for both UIPS and IPS, takes O(Nd|A|) time. Since the dimension d is usually much less than action size |A| and sample size N, UIPS does not introduce significant computational overhead compared to the original IPS solution.

7.2 Derivation of confidence interval of logging probability.

Given that |fθ (x, a) fθ(x, a)| γUx,a holds with probability at least 1 σ and

β (a|x) = exp(fθ (x, a))

Z , ˆβ(a|x) = exp(fθ(x, a))

where Z = P

a exp(fθ (a |x)) and ˆZ = P

a exp(fθ(a |x)), we can get that with probability at least 1 σ:

|fθ (x, a) fθ(x, a)| γUx,a fθ(x, a) γUx,a fθ (x, a) fθ(x, a) + γUx,a exp(fθ(x, a)) exp( γUx,a) exp(fθ (x, a)) exp(fθ(x, a)) exp(γUx,a)

(1) ˆZ ˆβ(a|x) exp( γUx,a) exp(fθ (x, a)) ˆZ ˆβ(a|x) exp(γUx,a)

(2) ˆZ exp ( γUx,a)

Z ˆβ(a|x) β (a|x) ˆZ exp (γUx,a)

The step labeled as (1) is due to the modelling of ˆβ(a|x). And the step labeled as (2) is because Z is a positive constant independent of ˆβ(a|x). Thus with probability at least 1-δ, we have β (a|x) Bx,a and

" ˆZ exp ( γUx,a)

Z ˆβ(a|x), ˆZ exp (γUx,a)

7.3 Experiment details

7.3.1 Synthetic Data

Data generation. Given the ground-truth logging policy β (a|x), we generate the logged dataset as follows. For each sample in train set, we first get the embedded context vector x from its original feature vector x. We then sample an action a according to β (a|x), and obtain the reward rx,a = y x,a, resulting bandit feedback (x, a, rx,a), where y x,a is the label of class a under the original feature vector x. We repeat above process N times to collect the logged dataset. In our experiments, we take d = 64, N = 100.

We model the logging policy as in Eq.(9), where {θa} are the parameters to be estimated. To train the logging policy, we take all samples in the logged dataset D as positive instances, and randomly sample non-selected actions as negative instances as in [8].

7.3.2 Real-world Data

Statistics of datasets. The statistics of three real-world recommendation datasets with unbiased data can be found in Table 6.

Table 6: The statistics of three real-world datasets. Dataset #User #Item #Biased Data #Unbiased Data Yahoo R3 15,400 1,000 311,704 54,000 Coat 290 300 6,960 4,640 Kuai Rec 7,176 10,729 12,530,806 4,676,570

All these datasets contain a set of biased data collected from users interactions on the platform, and a set of unbiased data collected from a randomized controlled trial where items are randomly selected. As in [10], on each dataset, the biased data is used for training, and the unbiased data is for testing, with a small part of unbiased data split for validation purpose (5% on Yahoo R3 and Coat,

and 15% on Kuai Rec). We take the reward as 1 if : (1) the rating is larger than 3 in Yahoo! R3 and Coat datasets; (2) the user watched more than 70% of the video in Kuai Rec. Otherwise, the reward is labeled as 0.

We adopted a two-tower neural network architecture to implement both the logging and learning policy, as shown in Figure 2. For the learning policy, the user representation and item representation are first modelled through two separate neural networks (i.e., the user tower and the item tower), and then their element-by-element product vector is projected to predict the user s preference for the item. We then re-use the user state generated from the user tower of the learning policy, and model the logging policy with another separate item tower, following [8]. We also block gradients to prevent the logging policy learning interfering the user state of the learning policy. In each learning epoch, we will first estimate the logging policy, and then take the estimated logging probabilities as well as their uncertainties to optimize the learning policy.

Figure 2: Model architecture of the logging and the learning policy in real-world datasets

7.3.3 Implementation details.

To facilitate hyper-parameter tuning, we disentangled two ηs in two the terms of ϕ x,a in Theorem 3.2, and introduce η1 and η2 to represent η in the first and second term respectively in our implementation. This is due to the scale of η in the first term is closely related to the scale of λ, with λ/ η as the truly effective hyper-parameter. But the scale of η in the second term is independent from λ.

Moreover, while λ = Eπϑ h r2 x,a πϑ(a|x) β (a|x) i depends on πθ as discussed in Eq.(6), we cannot adaptively

set the value of λ since the ground-truth logging policy β (a|x) is unknown. However, we have:

r2 x,a πθ(a|x) β (a|x)

a πθ(a|x)4 ! X

r4 x,a β (a|x)2

r4 x,a β (a|x)2

where the first inequality is due to the Cauchy Schwarz inequality and the second inequality is

a πθ(a|x)4 P

a πθ(a|x) = 1 with πθ(a|x) [0, 1]. We denote λ = P

a r4 x,a β (a|x)2 ,

which is dataset-specific constant and independent from πθ. By replacing λ in Eq.(6) with λ, we can still minimize an upper bound of MSE( ˆVUIPS(πϑ)), which ensures that the result of our analysis still holds. Thus, considering ease of computation and efficiency, we take a fixed λ during our policy learning.

We then use grid search to select hyperparameters based on the model s performance on the validation dataset: the learning rate was searched in {1e 5, 1e 4, 1e 3, 1e 2}; λ, γ, η1 were searched in {0.5, 0.1, 1, 2,5, 10, 15, 20, 25, 30, 40, 50}. And η2 was searched in {1, 10, 100, 1000}. For baseline algorithms, we also performed a similar grid search as mentioned above, and the search range follows the original papers.

Ablation study: hyper-parameter tuning. Although UIPS has four hyperparameters (λ, γ, η1, and η2), one only needs to carefully finetune two of them, i.e., γ and η2 1/λ, to obtain good performance of UIPS. This is because:

(a) Precision@5

(b) Recall@5

Figure 3: Effect of λ and γ on synthetic dataset with τ = 0.5

η2 acts as a capping threshold to ensure ϕ x,a 2η2 holds even when the corresponding propensity scores are very low. Hence, it should be set to a reasonably large value (e.g., 100). The key component (i.e., the first term) of ϕ x,a can be rewritten in the following way. While all (x, a) pairs will be multiplied by ϕ x,a, η1 in the numerator will not affect final performance too much, and the key is to find a good value of η2 1/λ to balance the two terms in the denominator:

exp ( γUx,a) + η2 1/λ πϑ(a|x)2

ˆβ(a|x)2 exp ( γUx,a)

Thus with η1 and η2 fixed, the effect of hyper-parameter γ and λ on precision, recall as well as NDCG can be found in Figure 3. We can observe that to make UIPS excel, Bx,a needs to be of high confidence, e.g., γ = 25 performed the best on the dataset with τ = 0.5. Moreover, the threshold

λ/η1 cannot be too small or too large.

7.4 Experiments on the doubly robust estimators.

The doubly robust (DR) estimator [15], which is a hybrid of direct method (DM) estimator and inverse propensity score (IPS) estimator, is also widely used for off-policy evaluation. More specifically, let ˆη : X A R be the imputation model in DM that estimates the reward of action a under context vector x, and ˆβ(a|x) be the estimated logging policy in the IPS estimator. The DR estimator evaluates policy π based on the logged dataset D := {(xn, an, rxn,an)|n [N]}, by:

ˆVDR(π) = ˆVDM(π) + 1

rxn,an ˆη(xn, an) (10)

where ˆVDM(π) is the DM estimator:

ˆVDM(π) = 1

a A π(a|xn)ˆη(xn, a). (11)

Again assume the policy π(a|x) is parameterized by ϑ, the REINFORCE gradient of ˆVDR(πϑ) with respect to ϑ can be readily derived as follows:

ϑ ˆVDR(πϑ) = 1

a A πϑ(a|xn)ˆη(xn, a) ϑ log(πϑ(a|xn))

rxn,an ˆη(xn, an) ϑ log(πϑ(an|xn)

The imputation model ˆη(x, a) is pre-trained following previous work [22] with the same neural network architecture as the logging policy model. Besides the standard DR estimator, we also adapt

Table 7: Experiment results on synthetic datasets. The best and second best results are highlighted with bold and underline respectively. Two p-values are calculated: (1) p-value (UIPSDR): The p-value under the t-test between UIPSDR and the best DR baseline on each dataset; (2) p-value (UIPS): The p-value under the t-test between UIPS and the best DR baseline on each dataset.

τ = 0.5 τ = 1 τ = 2 Algorithm P@5 R@5 NDCG@5 P@5 R@5 NDCG@5 P@5 R@5 NDCG@5 BIPS-Cap 0.5515 2e 3 0.1553 8e 4 0.6031 2e 3 0.5526 2e 3 0.1561 6e 4 0.6016 1e 3 0.5409 3e 3 0.1529 9e 4 0.5901 2e 3 UIPS 0.5589 3e 3 0.1583 9e 4 0.6095 3e 3 0.5572 2e 3 0.1571 8e 4 0.6074 2e 3 0.5432 3e 3 0.1534 8e 4 0.5946 2e 3 DR 0.3846 3e 2 0.1082 8e 3 0.4684 3e 2 0.3631 3e 2 0.1017 9e 3 0.4494 3e 2 0.3560 3e 2 0.0995 7e 3 0.4470 2e 3 Min Var DR 0.3212 3e 2 0.0908 8e 3 0.4062 3e 2 0.3240 5e 2 0.0903 1e 2 0.3905 5e 2 0.3234 5e 2 0.0910 1e 2 0.4059 4e 2 Shrinkage DR 0.4139 2e 2 0.1161 7e 3 0.4969 3e 2 0.3944 3e 2 0.1101 8e 3 0.4797 2e 2 0.4080 3e 2 0.1135 7e 3 0.4901 2e 2 UIPSDR 0.4278 2e 2 0.1200 6e 3 0.5069 2e 2 0.4008 2e 2 0.1126 7e 3 0.4847 2e 2 0.4144 2e 2 0.1162 8e 3 0.4972 2e 2 p-value (UIPSDR) 2e 1 2e 1 3e 1 6e 1 4e 1 6e 1 6e 1 4e 1 5e 1

p-value (UIPS) 6e 13 4e 13 4e 12 2e 12 1e 12 5e 12 8e 12 8e 12 2e 11

Table 8: Experiment results on real-world unbiased datasets. The best and second best results are highlighted with bold and underline respectively. Two p-values are calculated: (1) p-value (UIPSDR): The p-value under the t-test between UIPSDR and the best DR baseline on each dataset; (2) p-value (UIPS): The p-value under the t-test between UIPS and the best DR baseline on each dataset.

Yahoo Coat Kuai Rec Algorithm P@5 R@5 NDCG@5 P@5 R@5 NDCG@5 P@50 R@50 NDCG@50 BIPS-Cap 0.2808 2e 3 0.7576 5e 3 0.6099 8e 3 0.2758 6e 3 0.4582 7e 3 0.4399 9e 3 0.8750 3e 3 0.0238 7e 5 0.8788 5e 3 UIPS 0.2868 2e 3 0.7742 5e 3 0.6274 5e 3 0.2877 3e 3 0.4757 5e 3 0.4576 8e 3 0.9120 1e 3 0.0250 5e 5 0.9174 7e 4 DR 0.2670 2e 3 0.7174 6e 3 0.5636 6e 3 0.2884 3e 3 0.4760 5e 3 0.4541 5e 3 0.8794 1e 2 0.0240 5e 4 0.8824 2e 2 Min Var DR 0.2272 5e 3 0.5989 1e 2 0.4525 1e 2 0.2704 4e 3 0.4434 9e 3 0.4137 6e 3 0.8640 7e 3 0.0235 2e 4 0.8657 7e 3 Shrinkage DR 0.2697 2e 3 0.7226 6e 3 0.5713 5e 3 0.2895 4e 3 0.4749 6e 3 0.4526 6e 3 0.8778 2e 2 0.0239 5e 4 0.8800 2e 2 UIPSDR 0.2721 1e 3 0.7294 6e 3 0.5750 5e 3 0.2946 4e 3 0.4854 8e 3 0.4647 8e 3 0.8849 1e 2 0.0242 4e 4 0.8896 1e 2 p-value (UIPSDR) 1e 2 2e 2 1e 1 7e 3 5e 3 2e 3 4e 1 4e 1 3e 1

p-value (UIPS) 1e 12 6e 14 6e 15 3e 1 8e 1 1e 1 2e 6 2e 6 1e 3

UIPS and the best two baselines on off-policy evaluation estimator (i.e., Min Var and Shrinkage based on the results in Table 1 and Table 4) to doubly robust setting using the same imputation model.

Table 7 and Table 8 report the empirical performance of the learned policy on the synthetic datasets and three real-world datasets respectively. For ease of comparison, we also include the experiment results of BIPS-Cap and UIPS on each dataset in these two tables. Two p-values are also provided: (1) p-value (UIPSDR): The p-value under the t-test between UIPSDR and the best DR baseline on each dataset; (2) p-value (UIPS): The p-value under the t-test between UIPS and the best DR baseline on each dataset. From Table 7 and Table 8, we can first observe that DR cannot consistently outperform BIPS-Cap: It outperformed BIPS-Cap on the Coat and Kuai Rec dataset, while achieving much worse performance on the synthetic datasets and Yahoo dataset. This is because the imputation model also plays a very important role in gradient calculation as shown in Eq.(12). Its accuracy greatly affects policy learning. When the imputation model is sufficiently accurate, for example, on the Coat dataset with only 300 actions, incorporating the DM estimator not only led to better performance of DR over IPS, but also improved performance of UIPSDR over UIPS. And in particular, in this situation UIPSDR performed better than DR with the gain being statistically significant. When the imputation model is not accurate enough, for example, on the Kuai Rec dataset with a large action space but sparse reward feedback, DR is still worse than UIPS, and UIPSDR also performs worse than UIPS due to the distortion of the imputation model.

7.5 Difference against DICE-type algorithms.

The DICE line of work [26, 47, 9] is proposed for off-policy correction in the multi-step RL setting with the environment following the Markov Decision Process (MDP) assumption. Although the DICE line of work does not require the knowledge of the logging policy, it is fundamentally different from our work. Given a logged dataset D := {(st, at, rst,at, st+1)} collected from an unknown logging policy β (at|st), DICE-type algorithms propose to directly estimate the discounted stationary distribution correction wπ/D(st, at) = dπ(st,at)

d D(st,at) to replace the product-based off-policy correction

t π(at|st) β (at|st), which suffers high variance due to the series of products. While the estimation of wπ/D(st, at) is agnostic to the logging policy, it highly depends on two assumptions: (1) Environment follows an MDP, i.e., st+1 T(st, at) with T( , ) denoting the state transition function; (2) Each logged sample should be of a state-action-next-state tuple (st, at, rst,at, st+1) that contains state transition information.

Table 9: Empirical performance of DICE-S on synthetic datasets.

τ = 0.5 τ = 1 τ = 2 Algorithm P@5 R@5 NDCG@5 P@5 R@5 NDCG@5 P@5 R@5 NDCG@5 BIPS-Cap 0.5515 2e 3 0.1553 8e 4 0.6031 2e 3 0.5526 2e 3 0.1561 6e 4 0.6016 1e 3 0.5409 3e 3 0.1529 9e 4 0.5901 2e 3 DICE-S 0.5416 4e 3 0.1520 1e 3 0.5968 4e 3 0.5508 2e 3 0.1553 7e 4 0.6010 2e 3 0.5403 2e 3 0.1526 7e 4 0.5903 1e 3 UIPS 0.5589 3e 3 0.1583 9e 4 0.6095 3e 3 0.5572 2e 3 0.1571 8e 4 0.6074 2e 3 0.5432 3e 3 0.1534 8e 4 0.5946 2e 3

Table 10: Empirical performance of DICE-S on real-world datasets.

Yahoo Coat Kuai Rec Algorithm P@5 R@5 NDCG@5 P@5 R@5 NDCG@5 P@50 R@50 NDCG@50 BIPS-Cap 0.2808 2e 3 0.7576 5e 3 0.6099 8e 3 0.2758 6e 3 0.4582 7e 3 0.4399 9e 3 0.8750 3e 3 0.0238 7e 5 0.8788 5e 3 DICE-S 0.2618 4e 3 0.7010 1e 2 0.5627 9e 3 0.2686 5e 3 0.4422 6e 3 0.4265 5e 3 0.8842 2e 3 0.0241 6e 5 0.8908 2e 3 UIPS 0.2868 2e 3 0.7742 5e 3 0.6274 5e 3 0.2877 3e 3 0.4757 5e 3 0.4576 8e 3 0.9120 1e 3 0.0250 5e 5 0.9174 7e 4

We further inspected how DICE-type algorithms would work in the contextual bandit setting, which can be taken as a one-state MDP. We take Dual DICE [1] as an example for illustration, and similar conclusions can be drawn for other DICE-type algorithms. We can easily derive that in the contextual bandit setting, Dual DICE degenerates to the IPS estimator that approximates the unknown ground-truth logging policy β (a|s) with its empirical estimate from the given logged dataset D := {(si, ai, rsi,ai)|i [N]}, which inherits all limitations of IPS estimator, such as high variance on instances with low support in D.

More specifically, recall that Dual DICE estimates the discounted stationary distribution correction by optimizing the following objective function (Eq.(8) in [26]):

min x:S A C 1 2E(s,a) d D[x(s, a)2] E(s,a) dπ[x(s, a)],

with the optimizer x (s, a) = wπ/D(s, a). In a multi-step RL setting, one cannot directly calculate the second expectation as dπ(s, a) is inaccessible, thus Dual DICE takes the change-of-variable trick to address the above optimization problem. However, in the contextual bandit setting, let p(s) denote the state distribution, dπ(s, a) = p(s)π(a|s) and d D(s, a) = p(s)β (a|s). The optimization problem can be rewritten as :

min x:S A C Es p(s)

2Ea β (a|s)[x(s, a)2] Ea π(a|s)[x(s, a)] .

With the logged dataset D, the empirical estimate of above optimization problem is :

min x:S A C

2Ns x(s, a)2 X

a π(a|s)x(s, a)

yielding x (s, a) = π(a|s)Ns/Ns,a, where Ns and Ns,a denote the number of logged samples with si = s and (si = s, ai = a) respectively. This is actually equivalent to the IPS estimator with Ns,a/Ns as the estimate of β (a|s). We refer to the above estimator as DICE-S, and Table 9 and Table 10 show the empirical performance of the policy learned through DICE-S on the synthetic and real-world datasets respectively. Similar as BIPS-Cap described in Section 4, we also clip propensity scores to control variance. One can observe from Table 9 and Table 10 that DICE-S performs worse than BIPS-Cap in all datasets except the Kuai Rec dataset. Additionally, DICE-S underperforms compared to UIPS in all datasets. This is due to the fact that although DICE-S is unbiased, it only becomes accurate with numerous logged samples, and suffers high variance when logged samples are limited, resulting in its poor performance in our experiments.

7.6 Comparison against work on direct propensity estimation.

In addition to DICE, several other works [34, 21] propose methods to directly estimate the propensity ratio without requiring a behavior policy. These methods then use the propensity estimates to prioritize instances in the replay buffer for better TD learning. To compare its effectiveness, we also included a new baseline called IPS-LFIW, which implements the approach proposed in [1] to directly estimate the propensity ratio for off-policy learning.

The average performance and standard deviations of IPS-LFIW and UIPS on three synthetic datasets with different τ are reported in Table 11. Recall that smaller τ indicates a more skewed groundtruth logging policy. The p-value under the t-test between UIPS and IPS-LFIW is also provided to

Table 11: Empirical performance of IPS-LFIW and UIPS on synthetic datasets

τ = 0.5 τ = 1 τ = 2 Algorithm P@5 R@5 NDCG@5 P@5 R@5 NDCG@5 P@5 R@5 NDCG@5 IPS-LFIW 0.5542 1e 3 0.1568 4e 4 0.6033 1e 3 0.5472 1e 3 0.1549 5e 4 0.5975 1e 3 0.5255 6e 3 0.1485 1e 3 0.5769 5e 3 UIPS 0.5608 2e 3 0.1589 8e 4 0.6113 3e 3 0.5572 2e 3 0.1571 8e 4 0.6074 2e 3 0.5432 3e 3 0.1534 8e 4 0.5946 2e 3 p-value 2e 6 5e 6 6e 9 2e 8 2e 6 4e 10 4e 7 8e 7 1e 7

investigate the significance of improvements. Notably, UIPS consistently outperformed IPS-LFIW with statistically significant improvements. One major reason for the worse performance of IPS-LFIW is that it does not consider the accuracy of the estimated propensity ratio, in a direct analogy to failing to handle uncertainty in the estimated logging probabilities in existing IPS-type algorithms.

7.7 Comparison against distributionally robust off-policy evaluation and learning.

Our work is fundamentally different from the line of work on distributionally robust off-policy evaluation and learning [32, 17, 44]. This results in different objectives that guide the use of min-max optimization.

Specifically, the work in [32, 17, 44] assumes unknown changes exist between their training and deployment environments, such as user preference drift or unforeseen events during policy execution. Thus they choose to maximize the policy value (e.g., ˆVIPS(π)) in the worst environment within an uncertainty set around the training environment,

max π min U ˆVIPS(π).

As a result, their uncertainty set is created by introducing a small perturbation to the training environment. For example, the work in [32, 17] searches the worst environment P in the σ-close perturbed environments around the training environment P0 (Eq.(1) in [17]):

U(σ) = {P1 : P1 << P0 and DKL(P1||P0) σ}.

And the work in [44] adversarially perturbs the known ground-truth logging policy π0( | ) in searching for the worst case (Eq.(4) in [44]):

U(α) = {πu : max a,x max{πu(a|x)

π0(a|x) , π0(a|x)

πu(a|x)} eα}.

In contrast, UIPS assumes the training and deployment environments stay the same, but the groundtruth logging policy is unknown. To control the high bias and high variance caused by inaccurately and small estimated logging probabilities, UIPS explicitly models the uncertainty in the estimated logging policy by incorporating a per-sample weight ϕx,a as discussed in Eq.(4). In order to make ˆVUIPS(πϑ) as accurate as possible despite the unknown ground-truth logging policy β ( | ), UIPS solves a min-max optimization problem in Eq.(8). This optimization problem seeks to find the optimal ϕx,a that minimizes the upper bound of the mean squared error (MSE) of ˆVUIPS to its ground-truth value, within an uncertainty set of the unknown ground-truth logging policy β ( | ). The closed-form solution for the min-max optimization is also derived as in Theorem 3.2.

Furthermore, observing that work in [44] also performs optimization over an uncertainty set of the logging policy, we further adapted their method to handle the inaccuracy of the estimated logging policy, by taking π0( | ) as the estimated logging policy, i.e., π0( | ) = ˆβ( | ). We name the adapted methods as IPS-UN. Table 12 demonstrates the performance of the learned policy under IPS-UN on three synthetic datasets. The results suggest that directly applying IPS-UN to handle inaccurately estimated logging probabilities is not be a feasible solution to our problem. One important reason for the worse performance of IPS-UN is that it strives to optimize for the worst potential environment, which might not be the case in our experiment datatsets. On the other hand, UIPS assumes the training and deployment environments stay same and strives to identify the optimal policy with an unknown ground-truth logging policy.

7.8 Theoretical Proofs.

Proof of Proposition 2.1:

Table 12: Empirical performance of IPS-UN on synthetic datasets.

τ = 0.5 τ = 1 τ = 2 Algorithm P@5 R@5 NDCG@5 P@5 R@5 NDCG@5 P@5 R@5 NDCG@5 BIPS-Cap 0.5515 2e 3 0.1553 8e 4 0.6031 2e 3 0.5526 2e 3 0.1561 6e 4 0.6016 1e 3 0.5409 3e 3 0.1529 9e 4 0.5901 2e 3 IPS-UN 0.4089 3e 2 0.1152 8e 3 0.4916 2e 2 0.3911 3e 3 0.1100 9e 3 0.4754 3e 2 0.3599 4e 2 0.1009 1e 2 0.4353 4e 2 UIPS 0.5589 3e 3 0.1583 9e 4 0.6095 3e 3 0.5572 2e 3 0.1571 8e 4 0.6074 2e 3 0.5432 3e 3 0.1534 8e 4 0.5946 2e 3

Proof. Because of the linearity of expectation, we have ED h ˆVBIPS(πϑ) i = Eβ h πϑ(a|x)

ˆβ(a|x) rx,a i , and thus:

Bias ˆVBIPS(πϑ) = ED h ˆVBIPS(πϑ) V (πϑ) i

ˆβ(a|x) rx,a

β (a|x) rx,a

" πϑ(a|x) β (a|x) rx,a

Since samples are independently sampled from logging policy, the variance can be computed as:

Var D ˆVBIPS(πϑ) = 1

ˆβ(a|x) rx,a

By re-scaling, we get:

N Var D ˆVBIPS(πϑ) = Varβ

ˆβ(a|x) rx,a

ˆβ(a|x)2 r2 x,a

ˆβ(a|x) rx,a

" πϑ(a|x) β (a|x) β (a|x)2

ˆβ(a|x)2 r2 x,a

ˆβ(a|x) rx,a

This completes the proof.

Proof of Theorem 3.1:

Proof. We can get:

MSE ˆVUIPS(πϑ) = ED

ˆVUIPS(πϑ) V (πϑ) 2

= ED h ˆVUIPS(πϑ) V (πϑ) i 2 + Var D ˆVUIPS(πϑ) V (πϑ)

= ED h ˆVUIPS(πϑ) V (πϑ) i 2 + Var D ˆVUIPS(πϑ)

= Bias( ˆVUIPS(πϑ))2 + Var( ˆVUIPS(πϑ)).

We first bound the bias term:

Bias( ˆVUIPS(πϑ)) = ED h ˆVUIPS(πϑ) V (πϑ) i

ˆβ(a|x) ϕx,arx,a

ˆβ(a|x) ϕx,arx,a πϑ(a|x)

β (a|x) rx,a

rx,a πϑ(a|x) β (a|x)

ˆβ(a|x) ϕx,a 1

r2x,a πϑ(a|x) β (a|x)

v u u u t Eβ

ˆβ(a|x) ϕx,a 1

Step (1) follows the linearity of expectation and step (2) is due to the Cauchy-Schwarz inequality. We then bound the variance term:

Var( ˆVUIPS(πϑ)) = 1

ˆβ(a|x) ϕx,arx,a

ˆβ(a|x)2 ϕ2 x,ar2 x,a

ˆβ(a|x) ϕx,arx,a

ˆβ(a|x)2 ϕ2 x,ar2 x,a

ˆβ(a|x)2 ϕ2 x,a

Combining the bound of bias and variance completes the proof.

Proof of Theorem 3.2:

Proof. We first define several notations:

T(ϕx,a, βx,a) = λEβ βx,a ˆβ(a|x)ϕx,a 1 2 + Eβ h πϑ(a|x)2

ˆβ(a|x)2 ϕ2 x,a i .

T(ϕx,a) = maxβx,a Bx,a T(ϕx,a, βx,a) denotes the maximum value of inner optimization problem.

T = minϕx,a T(ϕx,a) = minϕx,a maxβx,a Bx,a T(ϕx,a, βx,a) denote the optimal minmax value. And ϕ x,a = arg minϕx,a T(ϕx,a) .

B x,a := ˆ Z exp( γUx,a)

Z ˆβ(a|x), and B+ x,a := ˆ Z exp(γUx,a)

We first find the maximum value of the inner optimization problem, i.e., T(ϕx,a) for any fixed ϕx,a. And there are three cases shown in Figure 4:

Case I: When ˆβ(a|x)

ϕx,a B+ x,a, , T(ϕx,a) achieves the maximum value at βx,a = B x,a. In other

words, T(ϕx,a) = T(ϕx,a, B x,a) when ϕx,a ˆβ(a|x)

Case II: When B x,a ˆβ(a|x)

ϕx,a B+ x,a, i.e., Z exp( γUx,a)

ˆ Z ϕx,a Z exp(γUx,a)

ˆ Z , then T(ϕx,a) will be the maximum between T(ϕx,a, B x,a) and T(ϕx,a, B+ x,a).

More specifically, when ˆβ(a|x)

ϕx,a B+ x,a+B x,a 2 , i.e., ϕx,a 2 ˆβ(a|x) B+ x,a+B x,a , T(ϕx,a) = T(ϕx,a, B+ x,a).

Otherwise when ϕx,a < 2 ˆβ(a|x) B+ x,a+B x,a , T(ϕx,a) = T(ϕx,a, B x,a).

Figure 4: Three cases for maximizing the inner optimization problem.

Case III: When ϕx,a ˆβ(a|x)

B x,a . implying ˆβ(a|x)

ϕx,a B x,a, T(ϕx,a) = T(ϕx,a, B+ x,a).

Overall, we get that:

T(ϕx,a, B x,a), ϕx,a ( , 2 ˆβ(a|x) B+ x,a+B x,a ]

T(ϕx,a, B+ x,a) ϕx,a [ 2 ˆβ(a|x) B+ x,a+B x,a , ) (15)

Next we try to find the minimum value of T(ϕx,a). We first observe that without considering constraint on ϕx,a, when

λ B+ x,a ˆβ(a|x) + πϑ(a|x)2 ˆβ(a|x)B+ x,a

T(ϕx,a, B+ x,a) achieves the global minimum value. However, ϕ+ x,a ˆβ(a|x)

B+ x,a 2 ˆβ(a|x) B+ x,a+B x,a ,

which implies when ϕx,a h 2 ˆβ(a|x) B+ x,a+B x,a , , the minimum value of T(ϕx,a, B+ x,a) is achieved at

2 ˆβ(a|x) B+ x,a+B x,a .

On the other hand, without considering any constraint on ϕx,a , the global minimum value of T(ϕx,a, B x,a) is achieved at:

λ B x,a ˆβ(a|x) + πϑ(a|x)2 ˆβ(a|x)B x,a

Thus if ϕ x,a 2 ˆβ(a|x) B+ x,a+B x,a , ϕ x,a = ϕ x,a, since T(ϕ x,a, B x,a) T( 2 ˆβ(a|x) B+ x,a+B x,a , B x,a) =

T( 2 ˆβ(a|x) B+ x,a+B x,a , B+ x,a). Otherwise, when ϕ x,a > 2 ˆβ(a|x) B+ x,a+B x,a , the minimum value of T(ϕx,a, B x,a) is

also achieved at 2 ˆβ(a|x) B+ x,a+B x,a , implying ϕ x,a = 2 ˆβ(a|x) B+ x,a+B x,a .

ϕ x,a = min

λ B x,a ˆβ(a|x) + πϑ(a|x)2 ˆβ(a|x)B x,a

, 2ˆβ(a|x) B+ x,a + B x,a

ˆ Z , we can get

ϕ x,a = min

λ η exp ( γUx,a) + ηπϑ(a|x)2 ˆβ(a|x)2 exp( γUx,a) , 2η exp (γUx,a) + exp ( γUx,a)

Note that η [exp( γU max s ), exp(γU max s )], since ˆZ exp( γU max s ) Z = P

a exp(fθ (a |x)) ˆZ exp(U max s ).

This completes the proof .

Proof of Corollary 3.3:

Proof. Following the similar procedure as in Theorem 3.1, we can derive the mean squared error (MSE) between ˆVBIPS(πϑ) and ground-truth estimator V (πϑ) is upper bounded as follows:

MSE ˆVBIPS(πϑ) = ED

ˆVBIPS(πϑ) V (πϑ) 2

r2 x,a πϑ(a|x) β (a|x)

When β (a|x) [B x,a, B+ x,a], with B x,a := ˆ Z exp( γUx,a)

Z ˆβ(a|x) and B+ x,a := ˆ Z exp(γUx,a)

Z ˆβ(a|x), MSE ˆVBIPS(πϑ) can be further upper bounded as follows.

For ease of illustration, we set λ = Eπϑ h r2 x,a πϑ(a|x) β (a|x) i and

T(ϕx,a, βx,a) = λEβ

βx,a ˆβ(a|x) ϕx,a 1

ˆβ(a|x)2 ϕ2 x,a

Let T BIPS denote the upper bound of MSE ˆVBIPS(πϑ) , then we can derive that

( T(1, B+ x,a), ˆβ(a|x) B+ x,a+B x,a 2 T(1, B x,a), ˆβ(a|x) > B+ x,a+B x,a 2 . (19)

Recall that in Theorem 3.2, we show that the upper bound of MSE ˆVUIPS(πϑ) is as follows:

T(ϕ x,a, B x,a), ϕ x,a < 2 ˆβ(a|x) B+ x,a+B x,a T( 2 ˆβ(a|x) B+ x,a+B x,a , B+ x,a), ϕ x,a 2 ˆβ(a|x) B+ x,a+B x,a . (20)

where ϕ x,a is defined in Eq.(16). Next we show that T UIPS T BIPS.

When ˆβ(a|x) B+ x,a+B x,a 2 , we can get T BIPS = T(1, B+ x,a) and 2 ˆβ(a|x) B+ x,a+B x,a 1.

If ϕ x,a < 2 ˆβ(a|x) B+ x,a+B x,a , then T UIPS = T(ϕ x,a, B x,a) < T( 2 ˆβ(a|x) B+ x,a+B x,a , B x,a) =

T( 2 ˆβ(a|x) B+ x,a+B x,a , B+ x,a) T(1, B+ x,a). And if ϕ x,a 2 ˆβ(a|x) B+ x,a+B x,a , T UIPS =

T( 2 ˆβ(a|x) B+ x,a+B x,a , B+ x,a) T(1, B+ x,a);

When ˆβ(a|x) > B+ x,a+B x,a 2 , we can get T BIPS = T(1, B x,a) and 2 ˆβ(a|x) B+ x,a+B x,a > 1. If

ϕ x,a < 2 ˆβ(a|x) B+ x,a+B x,a , then T UIPS = T(ϕ x,a, B x,a) T(1, B x,a), since ϕ x,a is a global

minimum. Otherwise when ϕ x,a 2 ˆβ(a|x) B+ x,a+B x,a > 1, T UIPS = T( 2 ˆβ(a|x) B+ x,a+B x,a , B+ x,a) =

T( 2 ˆβ(a|x) B+ x,a+B x,a , B x,a) < T(1, B x,a).

In both cases, we have T UIPS T BIPS, thus completing the proof.

Lemma 7.1. Under fixed πϑ(a|x) and ˆβ(a|x), and αx,a = q

λ 2η2 λ(1 η)

η2 exp( 2γUx,a), we have the following observations:

ˆβ(a|x) αx,a, ϕ x,a = 2η/ [exp(γUx,a) + exp( γUx,a)]. Otherwise ϕ x,a =

λ/ h λ η exp ( γUx,a) + ηπϑ(a|x)2

ˆβ2(a|x) exp( γUx,a)] i . In other words, ϕ x,a 2η always holds.

If αx,a πϑ(a|x)

ˆβ(a|x) and πϑ(a|x)

λ, larger Ux,a brings larger ϕ x,a. Otherwise ϕ x,a decreases as Ux,a increases.

Proof. The following inequality validates the first observation:

λ η exp ( γUx,a) + ηπϑ(a|x)2 ˆβ2(a|x) exp( γUx,a) 2η exp (γUx,a) + exp ( γUx,a)

For the second and third observations, αx,a

λ η . Let L(u) = λ η exp( γu) +

ˆβ(a|x)2 exp( γu), we can have:

u L(u) = γ λ

η exp( γu) + γ ηπϑ(a|x)2

ˆβ(a|x)2 exp(γu)

To make u L(u) 0, we need u 1 γ log

λ ˆβ(a|x) ηπϑ(a|x) . This implies when Ux,a

λ ˆβ(a|x) ηπϑ(a|x) , ϕ x,a will decrease as Ux,a increases; otherwise as Ux,a increases, ϕ x,a also increases.

More specifically, when αx,a πϑ(a|x)

ˆβ(a|x) , we have:

λˆβ(a|x) ηπϑ(a|x)

λ η exp( γUx,a).

In other words, for these samples, higher uncertainty implies smaller value of π/ˆβ, and UIPS tends to boost such safe sample with higher ϕ x,a.

In other cases, ϕ x,a decreases as Ux,a increases. This completes the proof.

7.9 Convergence Analysis

Next we provide the convergence analysis of policy improvement under UIPS. Definition 7.2. A function f : Rd R is L-smooth when f(x) f(y) 2 L x y 2, for all x, y Rd.

We first state and prove a general result, which serves as the basis to complete our convergence analysis of policy improvement under UIPS. The proof is a special case of convergence proof of stochastic gradient descent with biased gradients in [3]. Suppose we have a differentiable function f : Rd R, which is L-smooth, and attains a finite minimum value f := minx Rd f(x).

Suppose we cannot directly assess the gradient f(x). Instead we can only assess a noisy but unbiased gradient ζ(x) Rd at the given x of the function f(x).

Let b(x) = f(x) f(x) denote the difference between f(x) and f(x), and δ(x) = ζ(x) f(x) denote the noise in gradients. We assume that:

b(x) φ f(x) and E[δ(x)] = 0 and E[ δ(x) 2|x] ME[ f(x) 2] + σ2 (21)

where the constants φ and M satisfy 0 < φ < 1 and M 0 respectively. When running stochastic gradient descent algorithms, i.e. xk+1 = xk ηkζ(xk), we have the following guarantee on the convergence of xk to an approximate stationary point of f.

Theorem 7.3. Suppose f( ) is differentiable and L-smooth, and the assessed approximate gradient meets the conditions in Eq. (21) with parameters (σk, φk) at iteration k. Denote σmax = maxk σk and φmax = maxk φk. Set the stepsizes {ηk} to ηk = min{ 1 (M+1)L, 1/(σmax

K)}, after K iterations, the stochastic gradient descent satisfies :

k=1 E f(xk) 2 2L(f(x1) f )

K(1 φmax) + L + 2(f(x1) f )

Proof. With ηk 1 (M+1)L, we have:

f(xk+1) f(xk) + f(xk), xk+1 xk + L

2 xk+1 xk 2 (22)

= f(xk) ηk f(xk), ζ(xk) + Lη2 k 2 ζ(xk) 2

= f(xk) ηk f(xk), δ(xk) + b(xk) + f(xk) + Lη2 k 2 δ(xk) + b(xk) + f(xk) 2

By taking expectations on both side, we have:

E[f(xk+1)] E[f(xk)] ηk E[ f(xk), δ(xk) ] ηk E[ f(xk), b(x) + f(xk) ] + Lη2 k 2 E[ δ(xk) + b(xk) + f(xk) 2]

(1) E[f(xk)] ηk E[ f(xk), b(x) + f(xk) ] + Lη2 k 2 E[ δ(xk) 2] + E[ b(x) + f(xk) 2]

E[f(xk)] ηk E[ f(xk), b(x) + f(xk) ] + Lη2 k 2 (M + 1)E[ b(x) + f(xk) 2] + σ2 k

(2) E[f(xk)] + ηk

2 E 2 f(xk), b(x) + f(xk) + b(x) + f(xk) 2 + Lη2 k 2 σ2 k

E[f(xk)] + ηk

2 E[ f(xk) 2 + b(x) 2 ] + Lη2 k 2 σ2 k

(3) E[f(xk)] + ηk

2 (φk 1)E[ f(xk) 2] + Lη2 k 2 σ2 k

where the inequality labeled as (1) is due to E[δ(x)] = 0, inequality labeled as (2) is due to ηk 1 (M+1)L, and inequality labeled as (3) is due to b(xk) φk f(xk) .

By summing over iterations k = 1, 2, . . . , K and re-arranging the terms, we obtain:

k=1 (1 φk)ηk E[ f(xk) 2] f(x1) E[f(x K+1)] + L

k=1 η2 kσ2 k

f(x1) f + L

k=1 η2 kσ2 k

where the last inequality follows from f(x K+1) f . Since ηk = min{ 1 (M+1)L, 1 σmax

K }, k = 1, . . . , K, we can obtain:

k=1 E[ f(xk) 2] 2(f(x1) f ) + LKη2 1σ2 max

Dividing both sides of the above inequality by Kη1(1 φmax), we obtain the following,

k=1 E[ f(xk) 2] 2(f(x1) f ) + LKη2 1σ2 max Kη1(1 φmax)

2(f(x1) f )

K(1 φmax) max{L, σmax

K} + Lσ2 max 1 σmax

2L(f(x1) f )

K(1 φmax) + L + 2(f(x1) f )

Given this general result, we can now prove Theorem 3.4 by showing that the gradient in UIPS meets the requirements in Theorem 7.3.

Proof of Theorem 3.4:

Proof. UIPS aims to maximize the expected return V (πϑ). Therefore, we can utilize Theorem 7.3 by setting f = V (πϑ). Since f = Vmax and the expected reward is always non-negative, it follows that f(x1) f Vmax.

We first introduce some additional notations. Let ρ ϑ(x, a) = πϑ(a|x)

β (a|x) denote the propensity score

under ground-truth logging policy, and ˆρϑ(x, a) = πϑ(a|x)

ˆβ(a|x) ϕ x,a represet the propensity score of

UIPS. Recall that ϕ x,a is derived through solving the optimization problem in Eq (8). Let gϑ(x, a) = πϑ(a|x)

ϑ . The true off-policy policy gradient is computed as follows:

V (πϑ) = Eβ [ρ rgϑ],

For UIPS, the approximate policy gradient in each batch with batch size as B is:

ˆVUIPS(πϑ) = 1

i=1 ˆρirigi ϑ,

which is an unbiased estimate of :

VUIPS(πϑ) = Eβ [ˆρrgϑ].

To utilize Theorem 7.3, we set f = V (πϑ), f = VUIPS(πϑ), and ζ = ˆVUIPS(πϑ). We will now demonstrate that the assumptions in Eq. (21) can be satisfied.

Let φϑ = max n ˆρϑ(x,a)

ρ ϑ(x,a) 1 o , We first have:

b(ϑ) = VUIPS(πϑ) V (πϑ)

= Eβ [(ˆρ ρ )rgϑ] = Eβ [ρ ( ˆρ

φϑ Eβ [ρ rgϑ] (23)

And next we show that 0 < φϑ < 1.

φϑ = max ˆρϑ(x, a) ρ ϑ(x, a) 1

ˆβ(a|x) ϕ x,a 1

Recall from Eq.(17) in proof of Theorem 3.2, we have :

ϕ x,a = min

λ B x,a ˆβ(a|x) + πϑ(a|x)2 ˆβ(a|x)B x,a

, 2ˆβ(a|x) B+ x,a + B x,a

where B x,a := ˆ Z exp( γUx,a)

Z ˆβ(a|x), and B+ x,a := ˆ Z exp(γUx,a)

Z ˆβ(a|x). Thus we have:

ˆβ(a|x) ϕ x,a = min

λ B x,a β (a|x) + πϑ(a|x)2

β (a|x)B x,a

B+ x,a β (a|x) + B x,a β (a|x)

Since B+ x,a β (a|x), thus 0 β (a|x)

ˆβ(a|x) ϕ x,a 2, implying 0 < φϑ < 1.

Also, since ˆVUIPS(πϑ) is an unbiased estimate of VUIPS(πϑ), we have:

E[δ(ϑ)] = E[ ˆVUIPS(πϑ) VUIPS(πϑ)] = 0 (26)

Finally, we have the following,

E[ δ(ϑ) 2] = E h ˆVUIPS(πϑ) VUIPS(πϑ) 2i = E

ˆρirigi ϑ Eβ [ˆρrgϑ]

i=1 ˆρirigi ϑ Eβ [ˆρrgϑ]

i=1 ˆρirigi ϑ Eβ [ˆρrgϑ] 2 #

Eβ [ ˆρrgϑ Eβ [ˆρrgϑ] 2] (1) = Eβ [ ˆρrgϑ 2] Eβ [ˆρrgϑ] 2

Eβ [ ˆρrgϑ 2] G2 maxΦ (27)

where the equality labeled as (1) is due to E[ Y E[Y ] 2] = E[ Y 2] E[Y ] 2.

Hence, by applying Theorem 7.3, we have:

k=1 E[ V (πϑk) 2 2LVmax K(1 φmax) + L + 2Vmax (1 φmax)