# exponential_smoothing_for_offpolicy_learning__130a000d.pdf Exponential Smoothing for Off-Policy Learning Imad Aouali 1 2 Victor-Emmanuel Brunel 2 David Rohde 1 Anna Korba 2 Off-policy learning (OPL) aims at finding improved policies from logged bandit data, often by minimizing the inverse propensity scoring (IPS) estimator of the risk. In this work, we investigate a smooth regularization for IPS, for which we derive a two-sided PAC-Bayes generalization bound. The bound is tractable, scalable, interpretable and provides learning certificates. In particular, it is also valid for standard IPS without making the assumption that the importance weights are bounded. We demonstrate the relevance of our approach and its favorable performance through a set of learning tasks. Since our bound holds for standard IPS, we are able to provide insight into when regularizing IPS is useful. Namely, we identify cases where regularization might not be needed. This goes against the belief that, in practice, clipped IPS often enjoys favorable performance than standard IPS in OPL. 1. Introduction An off-policy contextual bandit (Dud ık et al., 2011) is a ubiquitous framework to optimize decision-making using offline data. In practice, logged data reflecting the preferences of the agent in an online setting is available (Bottou et al., 2013). In each round, the agent observes a context, takes an action, and receives a reward that depends on the observed context and the taken action. Off-policy evaluation (OPE) (Dud ık et al., 2011) aims at evaluating a policy offline by designing an estimator of its expected reward using logged data. The estimator is often based on the importance sampling trick and it is generally referred to as inverse propensity scoring (IPS) (Horvitz & Thompson, 1952). Offpolicy learning (OPL) leverages the latter estimator to learn an improved policy (Swaminathan & Joachims, 2015a). The literature on OPL has focused so far on using learn- 1CREST, ENSAE, IP Paris, France 2Criteo AI Lab, Paris, France. Correspondence to: Imad Aouali . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). ing principles derived from generalization bounds. First, Swaminathan & Joachims (2015a) used sample variance penalization (SVP) that favors policies with high estimated reward and low empirical variance. Recently, London & Sandler (2019) derived a novel scalable learning principle that favors policies with high estimated reward and whose parameter is not far from that of the logging policy in terms of L2 distance. While derived from generalization bounds, these learning principles do not give any guarantees on the expected performance of the learned policy. Also, they require additional care to tune their hyper-parameters. Thus, motivated by the results in Sakhi et al. (2022), we derive tractable generalization bounds that we optimize directly. The paper is organized as follows. In Section 2, we introduce the necessary background. In Section 3, we explain the shortcomings of the widely used hard clipping of IPS and present a smoother correction, called exponential smoothing. In Section 4, we focus on OPL and leverage PAC-Bayes theory to derive a two-sided generalization bound for our estimator. In contrast with prior works (Swaminathan & Joachims, 2015a; London & Sandler, 2019; Sakhi et al., 2022), our bound is also valid for standard IPS without clipping, and this is without assuming that the importance weights are bounded. We also discuss our results in detail in Section 5. In particular, we give insights into the sample complexity of our learning procedure, an important question not addressed in prior OPL works. Finally, we show in Section 6 that our approach enjoys favorable performance. A detailed comparative review of the literature is provided in Appendix A. The proofs are deferred to Appendices B and C. Refer to Appendix D to reproduce our experiments. 2. Background Consider an agent interacting with a contextual bandit environment over n rounds. In round t [n], the agent observes a context xt ν, where ν is a distribution whose support X is a compact subset of Rd. Then the agent takes an action at A = [K]. Finally, the agent receives a stochastic cost ct [ 1, 0] that depends on both xt and at. That is ct p( |xt, at) where p( |x, a) is the cost distribution of action a in context x. We let c(x, a) = Ec p( |x,a) [c] be the cost function that outputs the expected cost of action a in context x. Here we use a negative cost since it is seen as the Exponential Smoothing for Off-Policy Learning negative value of the reward, that is for any (x, a) X A , c(x, a) = r(x, a) where r : X A [0, 1] is the reward function that outputs the expected reward of a in context x. The agent is represented by a stochastic policy π. Given a context x X , π( |x) is a probability distribution over A. Our goal is to find a policy π Π among a set of policies Π that minimizes the risk defined as R(π) = E(x,a,c) µπ [c] = Ex ν,a π( |x) [c(x, a)] , (1) where µπ is the joint distribution of (x, a, c); µπ(x, a, c) = ν(x)π(a|x)p(c|x, a). We assume access to logged data Dn = (xi, ai, ci)i [n], where (xi, ai, ci) µπ0 are i.i.d. and π0 is a known logging policy. Given a policy π Π, OPE consists in building an estimator for its risk R(π) using Dn such as ˆRn(π) R(π). After that, OPL is used to find a policy ˆπn Π such that R(ˆπn) minπ Π R(π). In this work, we focus on inverse propensity scoring (IPS) (Horvitz & Thompson, 1952; Dud ık et al., 2012). Given a policy π Π, IPS estimates the risk R(π) by re-weighting the samples using the ratio between π and π0 such as ˆRIPS n (π) = 1 i=1 ciwπ(ai|xi) , (2) where for any (x, a) X A , wπ(a|x) = π(a|x)/π0(a|x) are the importance weights. The variance of ˆRIPS n (π) scales linearly with the importance weights (Swaminathan et al., 2017) which can be large. Thus other OPE methods that do not rely on the importance weights or partially use them were proposed and they can be categorized into two families, direct method (DM) (Jeunen & Goethals, 2021) and doubly robust (DR) (Dud ık et al., 2011). The reader may refer to Appendix A.1 for more details about these methods. Let ˆRn be an estimator of the risk R. For instance, ˆRn can be ˆRIPS n in (2). The goal in OPL is to minimize the risk R. But since we cannot access it, we only search for ˆπn = argminπ Π ˆRn(π) + pen(π) hoping that R(ˆπn) minπ Π R(π). Here pen( ) is a penalization term obtained using generalization bounds of the following form. Let δ (0, 1), then we have with probability at least 1 δ that R(π) ˆRn(π) + g(δ, Π, π, π0, n) , π Π , (3) for some function g. Improving upon π0, that is when R(π) R(π0) < 0, is guaranteed with high probability when ˆRn(π) + g(δ, Π, π, π0, n) R(π0) < 0. Thus we minimize ˆRn(π) + g(δ, Π, π, π0, n) R(π0) in the hope that the minimum is smaller than 0. Since R(π0) is fixed, the final objective reads ˆπn = argmin π Π ˆRn(π) + g(δ, Π, π, π0, n) . (4) This motivated the concept of counterfactual risk minimization (CRM) in Swaminathan & Joachims (2015a); London & Sandler (2019); Sakhi et al. (2022). However, all these works only derived one-sided inequalities similar to (3). In contrast, we derive two-sided inequalities of the form |R(π) ˆRn(π)| g(δ, Π, π, π0, n) , π Π . (5) This is because (5) can attest to the quality of the estimator ˆRn. A one-sided one fails at this. To see why, note that we have with probability 1 that R(π) ˆRPOOR n (π) with g(δ, Π, π, π0, n) = 0, considering a poor estimator of the risk, ˆRPOOR n (π) = 0 for any π Π. This holds since by definition R(π) [ 1, 0] while ˆRPOOR n (π) = 0 for any π Π. While this one-sided inequality holds for ˆRPOOR n , this estimator is not informative at all about R, so minimizing it is not relevant. This is why we need to control the quality of the upper bound on R, and this is achieved by two-sided inequalities similar to (5). Also, (5) leads to oracle inequalities of the form R(ˆπn) R(π ) + 2g(δ, Π, π , π0, n), where ˆπn is the learned policy in (4) and π = argminπ Π R(π) is the optimal policy. This allows us to quantify the number of samples n needed so that the risk of the learned policy R(ˆπn) is close to the optimal one R(π ). Moreover, in many prior works, the objective in (4) is not optimized directly. Instead, the function g is used to motivate a heuristic-based learning principle. Here we review these principles briefly. But the reader may refer to Appendix A.2 for more detail. First, Swaminathan & Joachims (2015a) minimized the estimated risk while penalizing its empirical variance. This was inspired by a function g that contains a variance term; discarding more complicated terms like the covering number of the space of policies Π. Similarly, London & Sandler (2019) parameterize policies by a mean parameter and propose to penalize the estimated risk by the L2 distance between the mean of the logging and the learning policies; discarding all the other terms from their bound. In contrast, we follow the theoretically grounded approach that consists in directly optimizing the objective in (4) as it is. It may also be relevant to note that some works (Metelli et al., 2021) derived evaluation bounds and used them in OPL. In evaluation, we fix a policy π Π, and show that P(|R(π) ˆRn(π)| f(δ, π, π0, n)) 1 δ , for some function f that does not necessarily depend on the space of policies Π. In contrast, the generalization bound in (5) holds simultaneously for any policy π Π, and it is the one that should be used in OPL. That said, in this work, we derive a two-sided generalization bound that holds simultaneously for any policy π Π as in (5). 3. Exponential Smoothing The estimator ˆRIPS n (π) in (2) is unbiased when π0(a|x) = 0 implies that π(a|x) = 0 for any (x, a) X A. But its variance can be large as it grows linearly with the importance Exponential Smoothing for Off-Policy Learning weights wπ(a|x). Thus they are often clipped (Swaminathan & Joachims, 2015a) such as on the following estimators IPS-min RM n(π) = 1 i=1 ci min wπ(ai|xi), M , IPS-max ˆRτ n(π) = 1 i=1 ci π(ai|xi) max(π0(ai|xi), τ) . (6) Here IPS-min clips the weights while IPS-max only clips π0 in the denominator since π is always smaller than 1. For instance, M R+ in RM n(π) trades the bias and variance of the estimator. When M is large, the bias of RM n(π) is small but its variance may be large. On the other hand, the variance goes to 0 when M 0 since in that case RM n(π) 0 for any π Π. Similarly, τ [0, 1] trades the bias and variance of ˆRτ n(π) and can be seen as τ 1 M . This hard clipping has some limitations. First, min( , M) leads to non-differentiable objectives that may require additional care in optimization (Papini et al., 2019). Also, min( , M) is constant on [M, ) leading to objectives with zero gradients for any policy π that satisfies wπ(ai|xi) > M for any i [n]. More importantly, hard clipping is sensitive to the choice of the clipping threshold M. In practice, tuning M is challenging and may cause the learned policy to match the logging policy, leading to minimal improvements. To see this, consider the following illustrative example. For simplicity, suppose that the problem is non-contextual, in which case the reward function r only depends on the actions a A. It follows that policies do not depend on x X; they are now probability distributions π( ) over A. Also, assume that A = [100] and that the reward received after taking action a [100] is binary. That is, r Bern(r(a)) where r(a) = 0.1 10 3(a 1) is the expected reward of action a, and for any p [0, 1], Bern(p) is the Bernoulli distribution with parameter p. This means that the best action is 1 and the worst is 100. Finally, the logging policy π0( ) is ϵ-greedy centered at action 50. That is π0(50) = 1 ϵ, and for any a = 50, π0(a) = ϵ 99, with ϵ = 0.05,. Now consider 100 deterministic policies πa( ) for a [100] such that πa( ) is the Dirac distribution centered at a. In Figure 1, we plot the estimated reward of the policies πa using either IPS in (2) or IPS-min in (6). We generate n = 50k samples and set M = 100 = O( n) as suggested by Ionides (2008). With this choice of M, IPS-min underestimates the reward of all policies πa for a = 100 since their weights πa/π0 are either 0 or 99/ϵ > M. The estimated reward of IPS-min is maximized in π50 π0 only. Thus, if we optimize RM n ( ) over Dirac policies, we will converge to the logging policy despite its bad performance. Although the other variant of hard clipping, IPS-max in (6), is differentiable, it is still sensitive to τ and may induce high bias similar to Figure 1. This is due to some loss of information related to the preferences of the logging policy. Indeed, for two actions a and a such that π0(a | xi) π0(a | xi) < τ for an observed context xi, the propensity scores π0(a | xi) and π0(a | xi) will be clipped to the same value τ. Thus the information that, for context xi, action a is preferred by the logging policy than action a will be lost. 0 20 40 60 80 100 0.00 true ips ips-min, M=100 Figure 1. Effect of hard clipping on the estimation quality. The x-axis corresponds to actions a [100]. The y-axis is the estimated reward of each of the 100 policies πa using either IPS or IPS-min. The cyan line is the true reward for each policy πa. To mitigate this, we propose the following exponential smoothing correction for IPS. Our estimators are defined as IPS-α : ˆRα n(π) = 1 i=1 ci ˆwα π(ai|xi) , α [0, 1] , IPS-β : Rβ n(π) = 1 i=1 ci wβ π(ai|xi) , β [0, 1] , (7) where ˆwα π(a|x) = π(a|x) π0(a|x)α and wβ π(a|x) = π(a|x)β π0(a|x)β . Here standard IPS is recovered for α = 1 and β = 1. These estimators are differentiable in π and do not suffer from stationary points in optimization as they are not constant in π when β = 0 and α = 0. Also, in contrast with IPS-max in (6), ˆRα n(π) preserves the preferences of the logging policy. Precisely, for two actions a and a such that π0(a | xi) < π0(a | xi) for an observed context xi, we still have π0(a | xi)α < π0(a | xi)α and the information that action a is preferred by the logging policy than action a is preserved. While a similar correction to IPS-β was proposed in Korba & Portier (2022), its use in off-policy contextual bandits is novel. Also, Su et al. (2020); Metelli et al. (2021) regularized the importance weights w as λ1w λ1+w2 , λ1 > 0 and w 1 λ2+λ2w , λ2 [0, 1], respectively. Thus, the expression of both corrections is very different from ours. More importantly, these corrections entail different properties than ours. Roughly speaking, our correction allows us to simultaneously (1) control a tuning parameter α [0, 1] that is in a bounded domain [0, 1], (2) without constraining the resulting importance weights to be bounded, (3) and to obtain PAC-Bayes generalization guarantees as the correction π πα 0 is linear in π; a technical requirement of our analysis. In contrast, Metelli et al. (2021); Su et al. (2020) do not provide generalization guarantees; they focus on OPE and only propose heuristics for OPL. Those heuristics are not Exponential Smoothing for Off-Policy Learning based on theory, in contrast with ours which is directly derived from our generalization bound. Also, our approach has favorable empirical performance (Appendix D.6). Although Korba & Portier (2022, Lemma 1) show that smoothing the importance weights similarly to IPS-β in (7) reduces the variance, it might still be unclear how α and β trade the bias and variance of our estimators in off-policy contextual bandits. To see this, let α [0, 1], then we have |B( ˆRα n(π))| Ex ν,a π( |x) 1 π0(a|x)1 α , (8) V h ˆRα n(π) i 1 n Ex ν,a π( |x) π(a|x) π0(a|x)2α 1 , with B( ˆRα n(π)) = E[ ˆRα n(π)] R(π) and V[ ˆRα n(π)] = E[( ˆRα n(π) E[ ˆRα n(π)])2] are respectively the bias and the variance of ˆRα n(π). The bound of the bias in (8) is minimized in α = 1 (standard IPS); in which case it is equal to 0 (standard IPS is unbiased). In contrast, the bound of the variance is minimized in α = 0. Thus if the variance is small or n is large enough such that E[π(a|x)/π0(a|x)2α 1]/n 0, then we set α 1. Otherwise, we set α 0. This shows that α trades the bias and variance of ˆRα n. More details and a similar discussion for Rβ n(π) are deferred to Appendix B. 4. PAC-Bayes Analysis for Off-Policy Learning We now derive generalization bounds for our estimator. We opt for the PAC-Bayes framework for the following reasons. First, it is known to provide some of the tightest generalization bounds in challenging scenarios (Farid & Majumdar, 2021), for aggregated and randomized predictors (Alquier, 2021). Second, the bounds have a Kullback Leibler (KL) divergence (Van Erven & Harremos, 2014) term DKL(Q P) that depends on a fixed prior P and a learning posterior Q (see Section 4.1 for a brief introduction). This quantity can be seen as a complexity measure, similarly to the covering number (Maurer & Pontil, 2009). The difference is that complexity measures are uniform on the space of policies while the KL term in PAC-Bayes depends on the prior P and the posterior Q. This allows getting sharper bounds when the former is well chosen. Third, the PAC-Bayes perspective fits very well with OPL. In fact, a policy π can be written as an aggregation of predictors under some distribution Q. Thus the prior P can be associated with the logging policy π0 that we want to improve upon while the posterior Q is related to the learning policy π. Fourth, London & Sandler (2019) showed that PAC-Bayes can lead to tractable and scalable objectives, an important consideration in practice. 4.1. Elements of PAC-Bayes Let Z = X Y be an instance space: e.g., X and Y are the input and output space in supervised learning. Let H = {h : X Y} denote a hypothesis space of mappings from X to Y (predictors). Also, let L : H Z R be a loss function and assume access to data Dn = (zi)i [n] drawn from an unknown distribution D. Let R(h) = Ez D [L(h, z)] be the risk of h H while ˆRn(h) = 1 n Pn i=1 L(h, zi) is its empirical counterpart. Then the main focus in PACBayes is to study the generalization capabilities of random hypothesis Q on H by controlling the gap between the expected risk under Q, Eh Q [R(h)] and the expected empirical risk under Q, Eh Q[ ˆRn(h)]. For example, assume that L(h, z) [0, 1] for any (h, z) H Z, let P be a fixed prior distribution on H and let δ (0, 1). Then with probability at least 1 δ over Dn Dn, the following inequality holds simultaneously for any posterior distribution Q on H Eh Q [R(h)] Eh Q[ ˆRn(h)] + DKL(Q P) + log 2 n This was originally proposed by Mc Allester (1998), and the reader may refer to Alquier (2021); Guedj (2019) for more elaborate introductions of PAC-Bayes theory. 4.2. PAC-Bayes for Off-Policy Contextual Bandits Let H = {h : X A} be a hypothesis space of mappings from X (contexts) to A (actions). Given a policy π and a context x X, the action distribution π( |x) is induced by a distribution Q over H (London & Sandler, 2019) such as π(a|x) = πQ(a|x) = Eh Q I{h(x)=a} . (9) This is not an assumption since any policy π has this form when H is rich enough (Sakhi et al., 2022, Theorem 2). From (9), we observe that policies can be seen as an aggregation Eh Q [ ] (under some distribution Q on the predefined hypothesis space H) of deterministic decision rules I{h(x)=a}. This allows formulating OPL as a PAC-Bayes problem. Before showing how this is achieved, we start by providing two practical policies of such form. Example 1 (softmax and mixed-logit policies): we define the space H = hθ,γ ; θ Rd K, γ RK of mappings hθ,γ(x) = argmaxa A ϕ(x) θa + γa. Here ϕ(x) outputs a d-dimensional representation of x, and γa is a standard Gumbel perturbation, γa G(0, 1) for any a A. Then πSOF θ (a|x) = exp(ϕ(x) θa) P a A exp(ϕ(x) θa ) , (i) = Eγ G(0,1)K I{hθ,γ(x)=a} , (10) where (i) follows from the Gumbel-Max trick (GMT) (Luce, 2012; Maddison et al., 2014). Thus a softmax policy πSOF θ can be written as in (9). Now we also consider random parameters θ N(µ, σ2Id K) with µ Rd K and σ > 0. Then, let Q = N(µ, σ2Id K) G(0, 1)K, it follows that Exponential Smoothing for Off-Policy Learning πQ = πMIXL µ,σ is a mixed-logit policy and it reads πMIXL µ,σ (a|x) = Eθ N(µ,σ2Id) exp(ϕ(x) θa) P a A exp(ϕ(x) θa ) = Eθ N(µ,σ2Id) ,γ G(0,1)K I{hθ,γ(x)=a} . (11) Example 2 (Gaussian policies): Sakhi et al. (2022) removed the Gumbel noise γ in (11) and consequently defined the hypothesis space as H = hθ ; θ Rd K of mappings hθ(x) = argmaxa A ϕ(x) θa for any x X. Then, let Q = N(µ, σ2Id K), it follows that πQ = πGAUS µ,σ reads πGAUS µ,σ (a|x) = Eθ N(µ,σ2Id) I{hθ(x)=a} . (12) To see why removing the Gumbel noise can be beneficial, the reader may refer to Appendix D.2. After motivating the definition of policies in (9), we are in a position to relate our estimators to the general PAC-Bayes framework in Section 4.1. One technical requirement of our proof is that the estimator should be linear in π. Thus we focus on ˆRα n( ) since Rβ n(π) is non-linear in π. Let h H, x X, a A and c [ 1, 0], we define the loss Lα as Lα(h, x, a, c) = I{h(x)=a} π0(a|x)α c . (13) Using the definition in (9) and the linearity of the expectation, we have that ˆRα n( ) in (7) can be written as ˆRα n(πQ) = Eh Q i=1 Lα(h, xi, ai, ci) Moreover, the expectation of ˆRn(πQ) reads Rα(πQ) = Eh QE(x,a,c) µπ0 [Lα(h, x, a, c)] . Finally, the main quantity of interest, the risk R(πQ) , can be expressed in terms of the loss with α = 1 , L1 , as R(πQ) = Eh QE(x,a,c) µπ0 [L1(h, x, a, c)] . Since ˆRα n(πQ) is an unbiased estimator of Rα(πQ), PACBayes can be used to bound Rα(πQ) ˆRα n(πQ). This will allow bounding our quantity of interest R(πQ) ˆRα n(πQ). 4.3. Main Result To ease the exposition, we assume that the costs are deterministic. Then, in logged data Dn, ci = c(xi, ai) for any i [n]. Note that the same result holds for stochastic costs. We discuss our result and sketch its proof in Section 5. The complete proof can be found in Appendix C.1. Theorem 4.1. Let λ > 0, n 1, δ (0, 1), α [0, 1], and let P be a fixed prior on H, then with probability at least 1 δ over draws Dn µn π0, the following holds simultaneously for any posterior Q on H |R(πQ) ˆRα n(πQ)| 2n + Bα n(πQ) + KL2(πQ) 2 V α n (πQ) . where KL1(πQ) = DKL(Q P) + ln 4 n KL2(πQ) = DKL(Q P) + ln 4 Bα n(πQ) = 1 1 i=1 Ea πQ( |xi) π1 α 0 (a|xi) , V α n (πQ) = 1 i=1 Ea π0( |xi) + πQ(ai|xi)c2 i π0(ai|xi)2α . We start by clarifying that the prior P can be any fixed distribution on H. If we have access to P0 on H such that π0 = πP0, then it is natural to set P = P0. But this is just a choice and one may use priors that do not depend on π0. Now we explain the main terms in our bound. First, the terms KL1(πQ) and KL2(πQ) contain the divergence DKL(Q P) which penalizes posteriors Q that differ a lot from the prior P. Moreover, Bα n(πQ) is the bias conditioned on the contexts (xi)i [n] ; Bα n(πQ) = 0 when α = 1 and Bα n(πQ) > 0 otherwise. Also, the first term in V α n (πQ) resembles the theoretical second moment of the regularized importance weights π πα 0 (without the cost) when they are seen as random variables. Similarly, the second term in V α n (πQ) resembles the empirical second moment of π πα 0 c (with the cost). Finally, if V α n (πQ) is bounded, then we can set λ = 1/ n, in which case our bound scales as O(1/ n + Bα n(πQ)). In practice, we set α 1 leading to Bα n(πQ)) 0 and the bound would scale as O(1/ n). This bound motivates the idea that we only need to control the second moments V α n (πQ) to get generalization guarantees for ˆRα n( ). In particular, one of the main strengths of our result is that it holds for standard IPS with α = 1 under the assumption that V 1 n (πQ) is bounded. This assumption is less restrictive than assuming that the importance weight as a random variable, πQ(a|x)/π0(a|x), is bounded, a required assumption for traditional concentration bounds. In contrast, V α n (πQ) only involves the expectations of the random variables πQ(a|xi)/π0(a|xi)2α, and ratios of π0 evaluated at observed contexts and actions and (xi, ai)i [n], that have non-zero probabilities under π0 by definition. Our result holds for fixed λ > 0 and α [0, 1]. In Appendix C.2, we extend this to any potentially data-dependent λ (0, 1) and α (0, 1]. The assumption that c [ 1, 0] can be relaxed to c [ B, 0] up to additional factors B2 and B in V α n (πQ) and KL1(πQ), respectively. Finally, our Exponential Smoothing for Off-Policy Learning bound is suitable for stochastic first-order optimization (Robbins & Monro, 1951) since data-dependent quantities are not inside a square root. This is important for scalability. 4.4. Adaptive and Data-Driven Tuning of α Theorem 4.1 assumes that α is fixed (although we extend it for data-dependent α in Appendix C.2). However, providing a procedure to tune α in an adaptive and data-dependent fashion is important in practice. Thus we propose to set α = argmin α [0,1] Bα n(πQ) + 2KL2(πQ) V α n (πQ) n , (14) where all the terms are defined in Theorem 4.1. Roughly speaking, α establishes a bias-variance trade-off; it minimizes the sum of the bias term Bα n(πQ) and the square root of the second moment term V α n (πQ), weighted by q n . Here (14) is obtained by minimizing the bound in Theorem 4.1 with respect to both α and λ as follows. First, we minimize the bound in Theorem 4.1 with respect to λ; the minimizer is λ = q 2KL2(πQ) n V α n (πQ). Then, the bound in Theorem 4.1 evaluated at λ = λ becomes r 2n + Bα n(πQ) + 2KL2(πQ) V α n (πQ) n . (15) Finally, α is defined as the minimizer of (15) with respect to α [0, 1], and q 2n does not appear in (14) as it does not depend on α. Note that α depends on both logged data Dn and the learning policy πQ. Thus it is adaptive; its value changes in each iteration during optimization. 5. Discussion We start by interpreting and comparing our results to related work. Then, we present the technical challenges in Section 5.2. After that, we sketch our proof in Section 5.3. 5.1. Interpretation and Comparison to Related Work Theorem 4.1 gives insight into the number of samples needed so that the performance of ˆπn is close to that of the optimal policy π . To simplify the problem, we consider the Gaussian policies in (12) and assume that there exists Q = N(µ , Id K) with µ Rd K such that the optimal policy is π = πQ . Also, we let the prior P = N(µ0, Id K) and assume that π0 is uniform. This is possible since as we said before, the prior P does not have to depend on the logging policy π0. Then we have that DKL(Q P) = µ µ0 2/2, Bα n(πQ ) = 1 1/K1 α and V α n (πQ ) 2K2α. The last inequality is not tight but it allows getting an easyto-interpret term that does not depend on n. Now let ϵ > 2(1 Kα 1) for α [1 log 2/ log K, 1]. This condition on α ensures that ϵ [0, 1] and it is mild as α is often close to 1. Then, it holds with high probability that n e> µ µ0 2 + K2α ϵ 2(1 Kα 1) 2 = R(ˆπn) R(πQ ) + ϵ , where we omit constant and logarithmic terms in e>. This gives an intuition on the sample complexity for our procedure. In particular, fewer samples are needed in four cases. The first is when ϵ is large, which means that we afford to learn a policy whose performance is far from the optimal one. The second is when the prior P is close to Q , that is when µ µ0 is small. This highlights that the choice of the prior P is important. The third is when the secondmoment term K2α is small. The fourth is when the bias Bα n(πQ ) is small. In particular, when α = 1, the bias is 0. In contrast, the second-moment term is minimized in α = 0. This is where the choice of α matters. The proofs of these claims and more detail can be found in Appendix C.4. Prior works (Swaminathan & Joachims, 2015a; London & Sandler, 2019; Sakhi et al., 2022) do not provide such insight for two reasons. They only derived one-sided inequalities and thus they cannot relate the risk of the learned policy with the optimal one as we discussed in the last three paragraphs of Section 2. Also, their bounds do not contain a bias term and as a result, they are minimized in τ = 1. In contrast, ours have a bias term and this allows seeing the effect of α. Our paper derives a tractable generalization bound for an estimator other than clipped IPS in (6), which also holds for the standard IPS in (2). The bounds in Swaminathan & Joachims (2015a); London & Sandler (2019); Sakhi et al. (2022) have a multiplicative dependency on the clipping threshold (M or 1/τ in (6)). Standard IPS is recovered when M (or τ = 0) in which case their bounds explode. We successfully avoid any similar dependency on α. Moreover, Swaminathan & Joachims (2015a); London & Sandler (2019) only used their generalization bounds to inspire learning principles. Although we directly optimize our theoretical bound (Theorem 4.1) in our experiments, our analysis also inspires a learning principle where we simultaneously penalize the L2 distance, the variance and the bias. That is, we find µ Rd K that minimizes ˆRα n(πµ) + λ1 µ µ0 2 + λ2 V α n (πµ) + λ3Bα n(πµ) . (16) Here λ1, λ2 and λ3 are tunable hyper-parameters, πµ can be the Gaussian policy in (12), πµ = πGAUS µ,1 , with a fixed σ = 1, and µ0 is the mean of the prior P = N(µ0, Id K). Existing works either penalize the L2 distance or the variance. For completeness, we also show that this learning principle should be preferred over existing ones in Appendix D.5. 5.2. Technical Challenges London & Sandler (2019); Sakhi et al. (2022) derived PAC- Exponential Smoothing for Off-Policy Learning Bayes generalization bounds for the estimator IPS-max in (6). Extending their analyses to our case is not straightforward. First, their estimator IPS-max is lower bounded by 1/τ, and thus they relied on traditional techniques for [0, 1]-losses (Alquier, 2021). In contrast, our loss in (13) is not lower bounded, and controlling it without assuming that the importance weights are bounded is challenging. Moreover, their bounds have a multiplicative dependency on 1/τ, hence they explode as τ 0. This makes them vacuous for small values of τ and inapplicable to the standard IPS estimator in (2) recovered for τ = 0. In contrast, our bound does not have a similar dependency on α and it is also valid for standard IPS recovered for α = 1. Moreover, we derive two-sided inequalities rather than one-sided ones for the important reasons that we priorly discussed. This requires carefully controlling in closed-form the absolute value of the bias. Prior works only used that the bias is negative which was enough to obtain one-sided inequalities. Explaining other challenges requires stating a result that inspired our analysis: Kuzborskij & Szepesv ari (2019) derived PAC-Bayes generalization bounds for unbounded losses by only controlling their second moments. Recently, Haddouche & Guedj (2022) proposed a similar result using Ville s inequality (Bercu & Touati, 2008). Adapting their theorem to our problem is given Proposition 5.1. We slightly adapt their proof to get a two-sided inequality for a negative loss. The proof is deferred to Appendix C.3. Proposition 5.1. Let λ > 0, n 1, δ (0, 1), α [0, 1] and let P be a fixed prior on H, then with probability at least 1 δ over draws Dn µn π0, the following holds simultaneously for all posteriors, Q, on H |Rα(πQ) ˆRα n(πQ)| DKL(Q P) + log 2 πQ(ai|xi) π2α 0 (ai|xi)c2 i + λ 2 E(x,a,c) µπ0 π2α 0 (a|x)c2 , There are two main issues with Proposition 5.1. First, the term E(x,a,c) µπ0 πQ(a|x) π2α 0 (a|x)c2 in (17) is intractable. One could bound c2 by 1, but the resulting term will still be intractable due to the expectation over the unknown distribution of contexts ν. Second, we need an upper bound of |R(πQ) ˆRα n(πQ)| while Proposition 5.1 only provides one for |Rα(πQ) ˆRα n(πQ)|. Thus it remains to quantify the approximation error |R(πQ) Rα(πQ)|. This will also require computing an expectation over x ν, which is intractable. 5.3. Sketch of Proof for Theorem 4.1 We conclude by showing how the technical challenges above were solved. First, We decompose R(πQ) ˆRα n(πQ) as R(πQ) ˆRα n(πQ) = I1 + I2 + I3 , where I1 = R(πQ) 1 i=1 R(πQ|xi) , i=1 R(πQ|xi) 1 i=1 Rα(πQ|xi) , i=1 Rα(πQ|xi) ˆRα n(πQ) , R(πQ|xi) = Ea πQ( |xi) [c(xi, a)] , Rα(πQ|xi) = Ea π0( |xi) h πQ(a|xi) π0(a|xi)α c(xi, a) i . I1 is the estimation error of the empirical mean of the risk using n i.i.d. contexts (xi)i [n]. This term is introduced to avoid the intractable expectation over x ν. Moreover, I2 is the bias term conditioned on the contexts (xi)i [n] and we bound it in closed-form. Finally, I3 is the estimation error of the risk conditioned on the contexts (xi)i [n]. Again, this conditioning allows us to avoid the intractable expectation over x ν and to consequently bound |I3| by tractable terms. First, Alquier (2021, Theorem 3.3) yields that with probability at least 1 δ 2, it holds for any Q on H that DKL(Q P) + log 4 n Also, |I2| is bounded similarly to (8) as i=1 Ea πQ( |xi) 1 π1 α 0 (a|xi) . Bounding |I3| is achieved by expressing it using martingale difference sequences (fi(ai, h))i [n] that we construct as follows. Let (Fi)i {0} [n] be a filtration adapted to (Si)i [n] where Si = (aℓ)ℓ [i] for any i [n], we define fi (ai, h) = Ea π0( |xi) I{h(xi)=a}c(xi, a) I{h(xi)=ai}ci π0(ai|xi)α . Then we show that for any h H, (fi(ai, h))i [n] is a martingale difference sequence. After that, we apply Haddouche & Guedj (2022, Theorem 5) and obtain that with probability at least 1 δ/2, it holds for any Q on H that |Eh Q [Mn(h)]| DKL(Q P) + log 4 2 Eh Q Vn(h) , where Mn(h) = Pn i=1 fi (ai, h) and Vn(h) = Pn i=1 fi (ai, h)2 + E h fi (ai, h)2 |Fi 1 i . Then notice that Eh Q [Mn(h)] can be expressed in terms of I3 as Eh Q [Mn(h)] = i=1 Rα(πQ|xi) n ˆRα n(πQ) = n I3 , Exponential Smoothing for Off-Policy Learning Moreover, Eh Q Vn(h) is bounded by i=1 Ea π0( |xi) + πQ(ai|xi) π0(ai|xi)2α c2 i . Thus with probability at least 1 δ 2, it holds for any Q that |I3| DKL(Q P) + log 4 πQ(ai|xi) π0(ai|xi)2α c2 i i=1 Ea π0( |xi) Our result is obtained by bounding |I1| + |I2| + |I3|. One shortcoming of our analysis is that V α n (πQ) is not exactly and only resembles the sum of the theoretical and empirical second moments of our estimator. Precisely, the terms πQ/π2α 0 should be π2 Q/π2α 0 . This problem arises due to our definition of the martingale difference sequences (fi(ai, h))i [n] in (13). Precisely, in our proof, we compute the square fi(ai, h)2. However, the square of an indicator function is the indicator function itself. Thus applying the expectation afterwards, Eh Q fi(ai, h)2 , leads to πQ appearing instead of π2 Q. This issue is inherent in the PACBayes formulation and seminal works (London & Sandler, 2019; Sakhi et al., 2022) would suffer the same issue. Solving this would be beneficial and we leave it to future work. 6. Experiments We briefly present our experiments. More details and discussions can be found in Appendix D. We consider the standard supervised-to-bandit conversion (Agarwal et al., 2014) where we transform a supervised training set S TR n to a logged bandit data Dn as described in Algorithm 1 in Appendix D.1. Here the action space A is the label set and the context space X is the input space. Then, Dn is used to train our policies. After that, we evaluate the reward of the learned policies on the supervised test set S TS n TS as described in Algorithm 2 in Appendix D.1. Roughly speaking, the resulting reward quantifies the ability of the learned policy to predict the true labels of the inputs in the test set. This is our performance metric; the higher the better. We use 4 image classification datasets MNIST (Le Cun et al., 1998), Fashion MNIST (Xiao et al., 2017), EMNIST (Cohen et al., 2017) and CIFAR100 (Krizhevsky et al., 2009). The logging policy is defined as π0 = πSOF η0 µ0 in (10), where µ0 = (µ0,a)a A Rd K and η0 [0, 1] is the inversetemperature parameter. The higher η0, the better the performance of π0. When η0 = 0, π0 is uniform. The parameters µ0 are learned using 5% of the training set S TR n . In our experiments, we consider both, Gaussian and mixed-logit policies, in (11) and (12), for which we set the prior as P = N(η0µ0, Id K) and P = N(η0µ0, Id K) G(0, 1)K, respectively. Given that µ0 are learnt on 5% of S TR n , we train our policies on the remaining 95% portion of S TR n to match our theory that requires the prior to not depend on training data. The policies are trained using Adam (Kingma & Ba, 2014) with a learning rate of 0.1 for 20 epochs. We compare our bound to those in London & Sandler (2019); Sakhi et al. (2022); discarding the intractable bound in Swaminathan & Joachims (2015a) as it requires computing a covering number. Here we do not include the learning principles in Swaminathan & Joachims (2015a); London & Sandler (2019) since we directly optimize our bounds. But we make such a comparison in Appendix D.5 for completeness, showing the favorable performance of our bound and the newly proposed learning principle in (16). Also, we do not compare to Su et al. (2020); Metelli et al. (2021) since they do not provide generalization guarantees; they focus on OPE and only propose a heuristic for OPL. However, we still show the favorable performance of our approach in OPL compared to Su et al. (2020); Metelli et al. (2021) in Appendix D.6 for completeness. Prior methods are not named. Thus we refer to them as (Author, Policy) where Author {Ours, London et al., Sakhi et al. 1, Sakhi et al. 2} and Policy {Gaussian, Mixed-Logit}. Here Ours, London et al., Sakhi et al. 1 and Sakhi et al. 2 correspond to Theorem 4.1, London & Sandler (2019, Theorem 1), Sakhi et al. (2022, Proposition 1), and Sakhi et al. (2022, Proposition 3), respectively. Since we have two classes of policies, each bound leads to two baselines. For example, London & Sandler (2019, Theorem 1) leads to (London et al., Gaussian) and (London et al., Mixed-Logit). More details are provided in Appendix D.3. In Figure 2, we report the reward of the learned policies. Here we fix τ = 1/ 4 n 0.06 and α = 1 1/ 4 n 0.94 so that when n is large enough, both ˆRτ n(π) and ˆRα n(π) approach ˆRIPS n (π) (Ionides, 2008). This is because standard IPS should be preferred when n . To have a fair comparison, we fixed α instead of tuning it in an adaptive fashion as described in Section 4.4. However, we also provide the results with an adaptive α in Figure 3. Let us start with interpreting Figure 2 (with fixed α and τ). Overall, our method outperforms all the baselines. We also observe that Gaussian policies behave better than mixed-logit policies. However, this is less significant for our method where the performances of both Gaussian and mixed-logit policies are comparable. Moreover, our method reaches the maximum reward even when the logging policy has an average performance. In contrast, the baselines only reach their best reward when the logging policy is well-performing (η0 1), in which case minor to no improvements are made. Finally, the baselines induce a better reward when the logging policy is uniform (η0 = 0). But our method has a better reward when η0 > 0, which is more common in practice. Exponential Smoothing for Off-Policy Learning 0.0 0.2 0.4 0.6 0.8 1.0 inverse-temperature parameter 0 reward of the learned policy MNIST, K=10, d=784 0.0 0.2 0.4 0.6 0.8 1.0 inverse-temperature parameter 0 Fashion MNIST, K=10, d=784 0.0 0.2 0.4 0.6 0.8 1.0 inverse-temperature parameter 0 EMNIST, K=47, d=784 0.0 0.2 0.4 0.6 0.8 1.0 inverse-temperature parameter 0 CIFAR, K=100, d=2048 Ours, Gaussian Ours, Mixed-Logit London et al., Gaussian London et al., Mixed-Logit Sakhi et al. 1, Gaussian Sakhi et al. 1, Mixed-Logit Sakhi et al. 2, Gaussian Sakhi et al. 2, Mixed-Logit Figure 2. The reward of the learned policy using one of the baselines with varying quality of the logging policy η0 [0, 1]. Our choice of τ and α does not affect the above conclusions. In Figure 3 (left-hand side), we compare our method with the best baseline, (Sakhi et al. 2) with Gaussian policies, for 20 evenly spaced values of τ (0, 1) and α (0, 1). We also include the results using the adaptive tuning procedure of α described in Section 4.4 (green curve). This procedure is reliable since the performance with an adaptive α (green curve) is comparable with the best possible choice of α. Also, our method consistently outperforms the best baseline (Sakhi et al. 2) with the best value of τ when the logging policy is not uniform (η0 > 0). Also, there is no very bad choice of α, in contrast with τ = 10 5 (dark blue plot) which led to minimal improvement upon all logging policies. This might be due to the 1/τ dependency in existing bounds. To see the effect of α, we consider the following experiment. We split the logging policies into two groups. The first is called modest logging which corresponds to logging policies π0 whose η0 is between 0 and 0.5. This group includes the uniform policy and other average-performing policies. The second is called good logging and it includes the logging policies whose η0 is between 0.5 and 1. Then, for each α, we compute the average reward of the learned policy, with that value of α, across these two groups. This leads to the two red and green curves in Figure 3 (right-hand side). Overall, we observe that α 0.7 leads to the best performance across the modest logging group. Thus when the performance of the logging policy is bad or average, which is common in practice, regularization can be critical. In contrast, when the performance of the logging policy is already good and n is large enough, regularization might not be needed and α 1 would also lead to good performance. This is one of the main strengths of our bound; it holds for the standard IPS recovered with α = 1. This result goes against the belief that clipped IPS should always be preferred to standard IPS. Here, our bound applied to standard IPS outperformed clipping by a large margin when the logging policy is relatively well-performing. Similar results for the other datasets are deferred to Appendix D.4. 0.0 0.2 0.4 0.6 0.8 1.0 inverse-temperature parameter 0 reward of the learned policy MNIST, K=10, d=784 Logging Ours Sakhi et al. 2 Ours, Adaptive 0.0 0.2 0.4 0.6 0.8 1.0 smoothing parameter reward of the learned policy MNIST, K=10, d=784 Modest Logging Good Logging Figure 3. On the left-hand side is the reward of the learned policy with varying τ (0, 1), α (0, 1) and η0 [0, 1], and for an adaptive α using the procedure in Section 4.4 (green curve). The blue-to-cyan and red-to-yellow colors correspond to varying values of τ and α, respectively. The lighter the color, the higher the value of τ or α. The green curve corresponds to the reward of the learned policy with an adaptive and data-dependent α (Section 4.4). On the right-hand side is the average reward of the learned policies using our method across the modest and good logging groups, η0 [0, 0.5] (red) and η0 [0.5, 1] (green), respectively. 7. Conclusion In this paper, we investigated a smooth regularization of IPS in the context of OPL. First, we highlighted the pitfalls of hard clipping and advocated for a soft regularization alternative, called exponential smoothing. Moreover, we addressed some fundamental theoretical limitations of existing OPL approaches. Those limitations include the use of one-sided inequalities instead of two-sided ones, the use of learning principles and the use of evaluation bounds in OPL. Building upon this, we successfully derived a tractable two-sided PAC-Bayes generalization bound for our estimator, which we directly optimize. We demonstrated, both theoretically through our bias-variance trade-off analysis in (8) and our bound in Theorem 4.1, and empirically, that this smooth regularization may be critical in some situations. In contrast with all prior works, our bound also applies to the standard IPS. This allowed us to also show that in some other cases, slight to no correction of IPS is needed in OPL. Exponential Smoothing for Off-Policy Learning Agarwal, A., Hsu, D., Kale, S., Langford, J., Li, L., and Schapire, R. Taming the monster: A fast and simple algorithm for contextual bandits. In International Conference on Machine Learning, pp. 1638 1646. PMLR, 2014. Alquier, P. User-friendly introduction to pac-bayes bounds. ar Xiv preprint ar Xiv:2110.11216, 2021. Aouali, I., Ivanov, S., Gartrell, M., Rohde, D., Vasile, F., Zaytsev, V., and Legrand, D. Combining reward and rank signals for slate recommendation. ar Xiv preprint ar Xiv:2107.12455, 2021. Aouali, I., Hammou, A. A. S., Ivanov, S., Sakhi, O., Rohde, D., and Vasile, F. Probabilistic rank and reward: A scalable model for slate recommendation, 2022. Aouali, I., Kveton, B., and Katariya, S. Mixed-effect thompson sampling. In International Conference on Artificial Intelligence and Statistics, pp. 2087 2115. PMLR, 2023. Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47:235 256, 2002. Bang, H. and Robins, J. M. Doubly robust estimation in missing data and causal inference models. Biometrics, 61 (4):962 973, 2005. Bercu, B. and Touati, A. Exponential inequalities for selfnormalized martingales with applications. 2008. Bottou, L., Peters, J., Qui nonero-Candela, J., Charles, D. X., Chickering, D. M., Portugaly, E., Ray, D., Simard, P., and Snelson, E. Counterfactual reasoning and learning systems: The example of computational advertising. Journal of Machine Learning Research, 14(11), 2013. Chu, W., Li, L., Reyzin, L., and Schapire, R. Contextual bandits with linear payoff functions. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, pp. 208 214, 2011. Cohen, G., Afshar, S., Tapson, J., and Van Schaik, A. Emnist: Extending mnist to handwritten letters. In 2017 international joint conference on neural networks (IJCNN), pp. 2921 2926. IEEE, 2017. Dud ık, M., Langford, J., and Li, L. Doubly robust policy evaluation and learning. In Proceedings of the 28th International Conference on International Conference on Machine Learning, ICML 11, pp. 1097 1104, 2011. Dud ık, M., Erhan, D., Langford, J., and Li, L. Sampleefficient nonstationary policy evaluation for contextual bandits. In Proceedings of the Twenty-Eighth Conference on Uncertainty in Artificial Intelligence, UAI 12, pp. 247 254, Arlington, Virginia, USA, 2012. AUAI Press. Dudik, M., Erhan, D., Langford, J., and Li, L. Doubly robust policy evaluation and optimization. Statistical Science, 29(4):485 511, 2014. Farajtabar, M., Chow, Y., and Ghavamzadeh, M. More robust doubly robust off-policy evaluation. In International Conference on Machine Learning, pp. 1447 1456. PMLR, 2018. Farid, A. and Majumdar, A. Generalization bounds for metalearning via pac-bayes and uniform stability. Advances in Neural Information Processing Systems, 34:2173 2186, 2021. Faury, L., Tanielian, U., Dohmatob, E., Smirnova, E., and Vasile, F. Distributionally robust counterfactual risk minimization. In Proceedings of the AAAI Conference on Artificial Intelligence, pp. 3850 3857, 2020. Gilotte, A., Calauz enes, C., Nedelec, T., Abraham, A., and Doll e, S. Offline a/b testing for recommender systems. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, pp. 198 206, 2018. Guedj, B. A primer on pac-bayesian learning. ar Xiv preprint ar Xiv:1901.05353, 2019. Haddouche, M. and Guedj, B. Pac-bayes with unbounded losses through supermartingales. ar Xiv preprint ar Xiv:2210.00928, 2022. Hong, J., Kveton, B., Katariya, S., Zaheer, M., and Ghavamzadeh, M. Deep hierarchy in bandits. ar Xiv preprint ar Xiv:2202.01454, 2022. Horvitz, D. G. and Thompson, D. J. A generalization of sampling without replacement from a finite universe. Journal of the American statistical Association, 47(260):663 685, 1952. Ionides, E. L. Truncated importance sampling. Journal of Computational and Graphical Statistics, 17(2):295 311, 2008. Jeunen, O. and Goethals, B. Pessimistic reward models for off-policy learning in recommendation. In Fifteenth ACM Conference on Recommender Systems, pp. 63 74, 2021. Kallus, N., Saito, Y., and Uehara, M. Optimal off-policy evaluation from multiple logging policies. In International Conference on Machine Learning, pp. 5247 5256. PMLR, 2021. Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. Exponential Smoothing for Off-Policy Learning Kingma, D. P., Salimans, T., and Welling, M. Variational dropout and the local reparameterization trick. Advances in neural information processing systems, 28, 2015. Korba, A. and Portier, F. Adaptive importance sampling meets mirror descent: a bias-variance tradeoff. In International Conference on Artificial Intelligence and Statistics, pp. 11503 11527. PMLR, 2022. Krizhevsky, A., Hinton, G., et al. Learning multiple layers of features from tiny images. 2009. Kuzborskij, I. and Szepesv ari, C. Efron-stein pac-bayesian inequalities. ar Xiv preprint ar Xiv:1909.01931, 2019. Kuzborskij, I., Vernade, C., Gyorgy, A., and Szepesv ari, C. Confident off-policy evaluation and selection through self-normalized importance weighting. In International Conference on Artificial Intelligence and Statistics, pp. 640 648. PMLR, 2021. Lattimore, T. and Szepesvari, C. Bandit Algorithms. Cambridge University Press, 2019. Le Cun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998. Li, L., Chu, W., Langford, J., and Schapire, R. A contextualbandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on World Wide Web, 2010. London, B. and Sandler, T. Bayesian counterfactual risk minimization. In International Conference on Machine Learning, pp. 4125 4133. PMLR, 2019. Luce, R. D. Individual choice behavior: A theoretical analysis. Courier Corporation, 2012. Maddison, C. J., Tarlow, D., and Minka, T. A* sampling. Advances in neural information processing systems, 27, 2014. Maurer, A. and Pontil, M. Empirical bernstein bounds and sample variance penalization. ar Xiv preprint ar Xiv:0907.3740, 2009. Mc Allester, D. A. Some pac-bayesian theorems. In Proceedings of the eleventh annual conference on Computational learning theory, pp. 230 234, 1998. Metelli, A. M., Russo, A., and Restelli, M. Subgaussian and differentiable importance sampling for off-policy evaluation and learning. Advances in Neural Information Processing Systems, 34:8119 8132, 2021. Papini, M., Metelli, A. M., Lupo, L., and Restelli, M. Optimistic policy optimization via multiple importance sampling. In International Conference on Machine Learning, pp. 4989 4999. PMLR, 2019. Robbins, H. and Monro, S. A stochastic approximation method. The annals of mathematical statistics, pp. 400 407, 1951. Robins, J. M. and Rotnitzky, A. Semiparametric efficiency in multivariate regression models with missing data. Journal of the American Statistical Association, 90(429):122 129, 1995. Russo, D., Van Roy, B., Kazerouni, A., Osband, I., and Wen, Z. A tutorial on Thompson sampling. Foundations and Trends in Machine Learning, 11(1):1 96, 2018. Sachdeva, N., Su, Y., and Joachims, T. Off-policy bandits with deficient support. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 965 975, 2020. Saito, Y. and Joachims, T. Off-policy evaluation for large action spaces via embeddings. ar Xiv preprint ar Xiv:2202.06317, 2022. Sakhi, O., Bonner, S., Rohde, D., and Vasile, F. Blob: A probabilistic model for recommendation that combines organic and bandit signals. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 783 793, 2020. Sakhi, O., Chopin, N., and Alquier, P. Pac-bayesian offline contextual bandits with guarantees. ar Xiv preprint ar Xiv:2210.13132, 2022. Su, Y., Wang, L., Santacatterina, M., and Joachims, T. Cab: Continuous adaptive blending for policy evaluation and learning. In International Conference on Machine Learning, pp. 6005 6014. PMLR, 2019. Su, Y., Dimakopoulou, M., Krishnamurthy, A., and Dud ık, M. Doubly robust off-policy evaluation with shrinkage. In International Conference on Machine Learning, pp. 9167 9176. PMLR, 2020. Swaminathan, A. and Joachims, T. Batch learning from logged bandit feedback through counterfactual risk minimization. The Journal of Machine Learning Research, 16(1):1731 1755, 2015a. Swaminathan, A. and Joachims, T. The self-normalized estimator for counterfactual learning. advances in neural information processing systems, 28, 2015b. Swaminathan, A., Krishnamurthy, A., Agarwal, A., Dudik, M., Langford, J., Jose, D., and Zitouni, I. Off-policy Exponential Smoothing for Off-Policy Learning evaluation for slate recommendation. Advances in Neural Information Processing Systems, 30, 2017. Thompson, W. R. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3-4):285 294, 1933. Van Erven, T. and Harremos, P. R enyi divergence and kullback-leibler divergence. IEEE Transactions on Information Theory, 60(7):3797 3820, 2014. Wang, Y.-X., Agarwal, A., and Dudık, M. Optimal and adaptive off-policy evaluation in contextual bandits. In International Conference on Machine Learning, pp. 3589 3597. PMLR, 2017. Xiao, H., Rasul, K., and Vollgraf, R. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. ar Xiv preprint ar Xiv:1708.07747, 2017. Zenati, H., Bietti, A., Martin, M., Diemert, E., and Mairal, J. Counterfactual learning of continuous stochastic policies. 2020. Zhu, Y., Foster, D. J., Langford, J., and Mineiro, P. Contextual bandits with large action spaces: Made practical. In International Conference on Machine Learning, pp. 27428 27453. PMLR, 2022. Zong, S., Ni, H., Sung, K., Ke, N. R., Wen, Z., and Kveton, B. Cascading bandits for large-scale recommendation problems. ar Xiv preprint ar Xiv:1603.05359, 2016. Exponential Smoothing for Off-Policy Learning Organization of the Supplementary Material The supplementary material is organized as follows. In Appendix A, we give a detailed comparative review of the literature of OPE and OPL. In Appendix B, we provide some results and proofs for the bias and variance trade-off for our estimators. In Appendix C, we prove Theorem 4.1. We also provide the proofs for all the claims made in Section 4. In Appendix D, we present in detail our experimental setup for reproducibility. This appendix also includes additional experiments. A. Related Work A contextual bandit (Lattimore & Szepesvari, 2019) is a popular and practical framework for online learning to act under uncertainty (Li et al., 2010; Chu et al., 2011). In practice, the action space is large and short-term gains are important. Thus the agent should be risk-averse which goes against the core principle of online algorithms that seek to explore the action space for the sake of long-term gains (Auer et al., 2002; Thompson, 1933; Russo et al., 2018). Although some practical algorithms have been proposed to efficiently explore the action space of a contextual bandit (Zong et al., 2016; Hong et al., 2022; Zhu et al., 2022; Aouali et al., 2023). A clear need remains for an offline procedure that allows optimizing decision-making using offline data. Fortunately, we have access to logged data about previous interactions. The agent can leverage such data to learn an improved policy offline (Swaminathan & Joachims, 2015a; London & Sandler, 2019; Sakhi et al., 2022) and consequently enhance the performance of the current system. In this work, we are concerned with this offline, or off-policy, formulation of contextual bandits (Dud ık et al., 2011; 2012; Dudik et al., 2014; Wang et al., 2017; Farajtabar et al., 2018). Before learning an improved policy, an important intermediary step is to estimate the performance of policies using logged data, as if they were evaluated online. This task is referred to as off-policy evaluation (OPE) (Dud ık et al., 2011). After that, the resulting estimator is optimized to approximate the optimal policy, and this is referred to as off-policy learning (OPL) (Swaminathan & Joachims, 2015a). Next, we review both OPE and OPL approaches. A.1. Off-Policy Evaluation Off-policy evaluation in contextual bandits has seen a lot of interest these recent years (Dud ık et al., 2011; 2012; Dudik et al., 2014; Wang et al., 2017; Farajtabar et al., 2018; Su et al., 2019; 2020; Kallus et al., 2021; Metelli et al., 2021; Kuzborskij et al., 2021; Saito & Joachims, 2022; Sakhi et al., 2020; Jeunen & Goethals, 2021). We can distinguish between three main families of approaches in the literature. First, direct method (DM) (Jeunen & Goethals, 2021) learns a model that approximates the expected cost and then uses it to estimate the performance of evaluated policies. Unfortunately, DM can suffer from modeling bias and misspecification. Thus DM is often designed for specific use cases, in particular large-scale recommender systems (Sakhi et al., 2020; Jeunen & Goethals, 2021; Aouali et al., 2021; 2022). Second, inverse propensity scoring (IPS) (Horvitz & Thompson, 1952; Dud ık et al., 2012) estimates the cost of the evaluated policies by removing the preference bias of the logging policy in logged data. Under the assumption that the evaluation policy is absolutely continuous with respect to the logging policy, IPS is unbiased, but it can suffer high variance. Note that it can also be highly biased when such an assumption is violated (Sachdeva et al., 2020). The variance issue is acknowledged and some fixes were proposed. For instance, clipping the importance weights (Ionides, 2008; Swaminathan & Joachims, 2015a), self normalizing them (Swaminathan & Joachims, 2015b), etc. (see Gilotte et al. (2018) for a survey). Third, doubly robust (DR) (Robins & Rotnitzky, 1995; Bang & Robins, 2005; Dud ık et al., 2011; Dudik et al., 2014; Farajtabar et al., 2018) is a combination of DM and IPS. Here a model of the expected cost is used as a control variate for IPS to reduce the variance. Finally, the accuracy of an estimator ˆRn(π) in OPE is assessed using the mean squared error (MSE) defined as MSE( ˆRn(π)) = E ˆRn(π) R(π) 2 = B( ˆRn(π))2 + V ˆRn(π) , where B( ˆRn(π)) = EDn[ ˆRn(π)] R(π) and V[ ˆRn(π)] = EDn[( ˆRn(π) EDn[ ˆRn(π)])2] are respectively the bias and the variance of the estimator. It may be relevant to note that Metelli et al. (2021) argued that high-probability concentration rates should be preferred over the MSE to evaluate OPE estimators as they provide non-asymptotic guarantees. In this work, we highlighted the effect of α and β in OPE following the common methodology of using the MSE as a performance metric. However, we also derived two-sided high-probability generalization bounds that attest to the quality of our estimator. Exponential Smoothing for Off-Policy Learning A.2. Off-Policy Learning Previous works focused on deriving learning principles inspired by generalization bounds. First, Swaminathan & Joachims (2015a) derived a generalization bound for the IPS-min estimator in (6) of the form R(π) RM n (π) + O ˆVn(π)Cn(Π, δ) n + M Cn(Π, δ) where Cn(Π, δ) is the complexity measure of the class of learning policies Π while ˆVn(π) is the empirical variance of the estimator on the logged data Dn. The term Cn(Π, δ) is not necessarily tractable. Thus the generalization bound above was only used to inspire the following learning principle min µ RM n (πµ) + λ where λ is a tunable hyper-parameter. This learning principle favors policies that simultaneously enjoy low estimated cost and empirical variance. Faury et al. (2020) generalized their work using distributional robustness optimization while Zenati et al. (2020) generalized it to continuous action spaces. The latter also proposed a soft clipping scheme but they derived a generalization bound similar to the one in Swaminathan & Joachims (2015a). Hence they also used the learning principle in (19). Our paper improves upon these work in different ways. First, (18) has a multiplicative dependency on M. Therefore, it is not applicable to standard IPS recovered for M . In contrast, our bound in Theorem 4.1 does not have a similar dependency on α and thus it also provides generalization guarantees for standard IPS without assuming that the importance weights are bounded. Second, the complexity measure Cn(Π, δ) is often hard to compute while our bound is tractable and the KL terms can be computed or bounded in closed-form for Gaussian and mixed-logit policies. Third, our bound is differentiable and scalable while the learning principle in (19) requires additional care in optimization (Swaminathan & Joachims, 2015a). Fourth, it is challenging to tune λ in (19) using a procedure that is aligned with online metrics. Finally, we follow the theoretically grounded approach where we optimize our bound directly instead of using a learning principle. This direct optimization of the bound does not require any additional hyper-parameters tuning. Recently, London & Sandler (2019) elegantly made the connection between PAC-Bayes theory and OPL. As a result, they derived a generalization bound for IPS-max in (6) which roughly has the following form R(πµ) ˆRτ n(πµ) + O ( ˆRτn(πµ) + 1 τn + µ µ0 2 Again, this bound was used to inspire a novel learning principle in the form min µ ˆRτ n(πµ) + λ µ µ0 2 , (21) where λ is a tunable hyper-parameter and µ0 Rd K is the parameter of the logging policy. This principle favors policies with low estimated cost and whose parameter is not far from that of the logging policy in terms of L2 distance. While the bound of London & Sandler (2019) is tractable, it still has a multiplicative dependency on 1/τ. This makes it inapplicable to standard IPS recovered for τ = 0. It is also not suitable for stochastic first-order optimization (Robbins & Monro, 1951) since the data-dependent quantity ˆRτ n(πµ) is inside a square root. Moreover, optimizing directly their bound leads to minimal improvements over the logging policy in practice. In their work, they used the learning principle in (21) instead which suffers the same issues that we discussed before for Swaminathan & Joachims (2015a), except that it is scalable. Recently, Sakhi et al. (2022) derived novel generalization bounds for a doubly robust version of the IPS-max estimator in (6). Sakhi et al. (2022) optimized the theoretical bound directly instead of using some form of learning principle and they showed favorable performance over existing methods. Unfortunately, their bounds have the same multiplicative dependency on 1/τ which makes them vacuous for small values of τ and inapplicable to standard IPS. Moreover, we derive two-sided generalization bounds while all these works only derived one-sided generalization bounds. Unfortunately, the latter does not provide any guarantees on the expected performance of the learned policy. Also, we propose a different estimator to the clipped IPS traditionally used for OPL and we demonstrate empirically that it has better performance. Exponential Smoothing for Off-Policy Learning B. Bias and Variance Trade-Off In this section, we provide additional results on how β and α control the bias and variance of Rβ n( ) and ˆRα n( ), respectively. Precisely, in Propositions B.1 and B.2 we upper bound the absolute bias and variance of ˆRα n( ) and Rβ n( ), respectively. B.1. Bias and variance of IPS-α The following proposition states the bias-variance trade-off for ˆRα n( ). Proposition B.1 (Bias and variance of IPS-α). Let α [0, 1], the following holds for any evaluation policy π Π that is absolutely continuous with respect to π0 |B( ˆRα n(π))| Ex ν,a π( |x) 1 π0(a|x)1 α , V h ˆRα n(π) i 1 n Ex ν,a π( |x) π(a|x) π0(a|x)2α 1 . Proof. We first bound the bias as B( ˆRα n(π)) = E h ˆRα n(π) i R(π) , i=1 Exi ν,ai π0( |xi),ci p( |xi,ai) ci π(ai|xi) π0(ai|xi)α (i) = E(x,a,c) µπ0 a A c(x, a) π(a|x) π0(a|x)α 1 a A c(x, a)π(a|x) a A c(x, a)π(a|x)(π0(a|x)1 α 1) = Ex ν,a π( |x) c(x, a)(π0(a|x)1 α 1) , where (i) follows from the i.i.d. assumption. Since π0(a|x)1 α 1 for any x X and a A, we have that |B( ˆRα n(π))| Ex ν,a π( |x) |c(x, a)||π0(a|x)1 α 1| , Ex ν,a π( |x) 1 π0(a|x)1 α . The variance is bounded as V h Rβ n(π) i = 1 i=1 Vxi ν,ai π0( |xi),ci p( |xi,ai) h ci π(ai|xi) π0(ai|xi)α n V(x,a,c) µπ0 n E(x,a,c) µπ0 h c2 π(a|x)2 n Ex ν,a π0( |x) h π(a|x)2 π0(a|x)2α 1 n Ex ν,a π( |x) h π(a|x) π0(a|x)2α 1 Exponential Smoothing for Off-Policy Learning B.2. Bias and variance of IPS-β The following proposition states the bias-variance trade-off for Rβ n( ). Proposition B.2 (Bias and variance of IPS-β). Let β [0, 1], the following holds for any evaluation policy π Π that is absolutely continuous with respect to π0 |B( Rβ n(π))| Ex ν,a π( |x) π(a|x) π0(a|x) β 1 1 , V h Rβ n(π) i 1 n Ex ν,a π( |x) π(a|x) π0(a|x) 2β 1 . Proof. We first bound the bias as B( Rβ n(π)) = E h Rβ n(π) i R(π) i=1 Exi ν,ai π0( |xi),ci p( |xi,ai) ci π(ai|xi) (i) = E(x,a,c) µπ0 = Ex ν,a π0( |x) c(x, a) π(a|x) β Ex ν,a π( |x) [c(x, a)] , a A c(x, a) π(a|x)β a A c(x, a)π(a|x) a A c(x, a)π(a|x) π0(a|x) where (i) follows from the i.i.d. assumption. It follows that |B( Rβ n(π))| Ex ν a A |c(x, a)|π(a|x) π0(a|x) a A π(a|x) π0(a|x) = Ex ν,a π( |x) The variance is bounded as V h Rβ n(π) i = 1 i=1 Vxi ν,ai π0( |xi),ci p( |xi,ai) h ci π(ai|xi) n V(x,a,c) µπ0 n E(x,a,c) µπ0 h c2 π(a|x) n Ex ν,a π0( |x) h π(a|x) π0(a|x)2β 1 n Ex ν,a π( |x) h π(a|x)2β 1 π0(a|x)2β 1 Exponential Smoothing for Off-Policy Learning B.3. Discussion Here we show using Propositions B.1 and B.2 how α and β trade the bias and variance of ˆRα n(π) and Rβ n(π), respectively. Let us start with ˆRα n(π), from Proposition B.1, the bound on the bias is minimized in α = 1; in which case it is equal to 0. In contrast, the bound on the variance is minimized in α = 0; in which case the variance is bounded by 1/n. Let α be the minimizer of the corresponding bound of the MSE α = argminα [0,1]Ex ν,a π( |x) 1 π0(a|x)1 α 2 + Ex ν,a π( |x) π(a|x) π0(a|x)2α 1 /n . We observe that when the variance is small or n is large enough such that Ex ν,a π( |x) π(a|x) π0(a|x)2α 1 /n 0, then we have that α 1. Thus it is better to use the standard IPS in this case. Otherwise, we have α 0 and this is when regularization helps; basically when we have few samples or when the evaluation policy induces high variance. This demonstrates that the choice of α matters as it trades the bias and variance of ˆRα n. Similarly, from Proposition B.2, we define β as β = argminβ [0,1]Ex ν,a π( |x) π(a|x) π0(a|x) β 1 1 + 1 n Ex ν,a π( |x) π(a|x) π0(a|x) 2β 1 . Again, we observe that if Ex ν,a π( |x) π(a|x) π0(a|x) 2β 1 /n 0, then β 1; in which case it is better to use standard IPS. Otherwise, we have β 0 to regularize the importance weights. C. Proofs for Off-Policy Learning In this section, we provide the complete proofs for our OPL results in Section 4. We start with proving Theorem 4.1 in Appendix C.1. We then state the extension of Theorem 4.1 along with its proof in Appendix C.2. After that, in Appendix C.3, we provide the proof for Proposition 5.1. Finally, in Appendix C.4, we discuss in detail and prove our claims regarding the number of samples needed so that the performance of the learned policy is close to that of the optimal policy. C.1. Proof of Theorem 4.1 In this section, we prove Theorem 4.1. Proof. First, we decompose the difference R(πQ) ˆRα n(πQ) as R(πQ) ˆRα n(πQ) = R(πQ) 1 i=1 R(πQ|xi) i=1 R(πQ|xi) 1 i=1 Rα(πQ|xi) i=1 Rα(πQ|xi) ˆRα n(πQ) R(πQ) = Ex ν ,a πQ( |x) [c(x, a)] , R(πQ|xi) = Ea πQ( |xi) [c(xi, a)] , Rα(πQ|xi) = Ea π0( |xi) π0(a|xi)α c(xi, a) , ˆRα n(π) = 1 π(ai|xi) π0(ai|xi)α ci . Our goal is to bound |R(πQ) ˆRα n(πQ)| and thus we need to bound |I1| + |I2| + |I3|. We start with |I1|, Alquier (2021, Theorem 3.3) yields that following inequality holds with probability at least 1 δ/2 for any distribution Q on H DKL(Q P) + log 4 n δ 2n . (22) Exponential Smoothing for Off-Policy Learning Moreover, |I2| can be bounded by decomposing it as i=1 Ea πQ( |xi) [c(xi, a)] 1 i=1 Ea π0( |xi) πα 0 (a|xi)c(xi, a) a A πQ(a|xi)c(xi, a) π0(a|xi)πQ(a|xi) πα 0 (a|xi)c(xi, a) πQ(a|xi) πQ(a|xi) πα 1 0 (a|xi) 1 π1 α 0 (a|xi) πQ(a|xi)c(xi, a) 1 π1 α 0 (a|xi) πQ(a|xi) |c(xi, a)| . But 1 π1 α 0 (a|x) 0 and |c(x, a)| 1 for any a A and x X. Thus i=1 Ea πQ( |xi) 1 π1 α 0 (a|xi) . (23) Finally, we need to bound the main term |I3|. To achieve this, we borrow the following technical lemma from Haddouche & Guedj (2022). It is slightly different from the one in Haddouche & Guedj (2022); their result holds for any n 1 while we state a simpler version where n is fixed in advance. Lemma C.1. Let Z be an instance space and let Sn = (zi)i [n] be an n-sized dataset for some n 1. Let (Fi)i {0} [n] be a filtration adapted to Sn. Also, let H be a hypothesis space and (fi (Si, h))i [n] be a martingale difference sequence for any h H, that is for any i [n], and h H , we have that E [fi (Si, h) |Fi 1] = 0. Moreover, for any h H, let Mn(h) = Pn i=1 fi (Si, h). Then for any fixed prior, P, on H, any λ > 0, the following holds with probability 1 δ over the sample Sn, simultaneously for any Q, on H |Eh Q [Mn(h)]| DKL(Q P) + log(2/δ) 2 (Eh Q [ M n(h) + [M]n(h)]) , where M n(h) = Pn i=1 E h fi (Si, h)2 |Fi 1 i and [M]n(h) = Pn i=1 fi (Si, h)2. To apply Lemma C.1, we need to construct an adequate martingale difference sequence (fi(Si, h))i [n] for h H that allows us to retrieve |I3|. To achieve this, we define Sn = (ai)i [n] as the set of n taken actions. Also, we let (Fi)i {0} [n] be a filtration adapted to Sn. For h H, we define fi (Si, h) as fi (Si, h) = fi (ai, h) = Ea π0( |xi) π0(a|xi)α c(xi, a) I{h(xi)=ai} π0(ai|xi)α c(xi, ai) . We stress that fi(Si, h) only depends on the last action in Si, ai, and the predictor h. For this reason, we denote it by fi(ai, h). The function fi is indexed by i since it depends on the fixed i-th context, xi. The context xi is fixed and thus randomness only comes from ai π0( |xi). It follows that the expectations are under ai π0( |xi). First, we have that E [fi (ai, h) |Fi 1] = 0 for any i [n] , h H. This follows from E [fi (ai, h) |Fi 1] = Eai π0( |xi) h fi (ai, h) a1, . . . , ai 1 i , = Eai π0( |xi) Ea π0( |xi) π0(a|xi)α c(xi, a) I{h(xi)=ai} π0(ai|xi)α c(xi, ai) a1, . . . , ai 1 (i) = Ea π0( |xi) π0(a|xi)α c(xi, a) Eai π0( |xi) I{h(xi)=ai} π0(ai|xi)α c(xi, ai) a1, . . . , ai 1 Exponential Smoothing for Off-Policy Learning In (i) we use the fact that given xi, Ea π0( |xi) h I{h(xi)=a} π0(a|xi)α c(xi, a) i is deterministic. Now ai does not depend on a1, . . . , ai 1 since logged data is i.d.d. Hence Eai π0( |xi) I{h(xi)=ai} π0(ai|xi)α c(xi, ai) a1, . . . , ai 1 = Eai π0( |xi) I{h(xi)=ai} π0(ai|xi)α c(xi, ai) , = Ea π0( |xi) π0(a|xi)α c(xi, a) . It follows that E [fi (ai, h) |Fi 1] = Ea π0( |xi) π0(a|xi)α c(xi, a) Eai π0( |xi) I{h(xi)=ai} π0(ai|xi)α c(xi, ai) a1, . . . , ai 1 = Ea π0( |xi) π0(a|xi)α c(xi, a) Ea π0( |xi) π0(a|xi)α c(xi, a) , Therefore, for any h H, (fi(ai, h))i [n] is a martingale difference sequence. Hence we apply Lemma C.1 and obtain that the following inequality holds with probability at least 1 δ/2 for any Q on H |Eh Q [Mn(h)]| DKL(Q P) + log(4/δ) 2 (Eh Q [ M n(h) + [M]n(h)]) , (24) i=1 fi (ai, h) , i=1 E h fi (ai, h)2 |Fi 1 i , i=1 fi (ai, h)2 Now these terms can be decomposed as Eh Q [Mn(h)] = i=1 Eh Q [fi (ai, h)] , Ea π0( |xi) π0(a|xi)α c(xi, a) I{h(xi)=ai} π0(ai|xi)α c(xi, ai) , Ea π0( |xi) π0(a|xi)α c(xi, a) Eh Q I{h(xi)=ai} π0(ai|xi)α c(xi, ai) , i=1 Ea π0( |xi) " Eh Q I{h(xi)=a} π0(a|xi)α c(xi, a) Eh Q I{h(xi)=ai} π0(ai|xi)α c(xi, ai) , i=1 Ea π0( |xi) π0(a|xi)α c(xi, a) πQ(ai|xi) π0(ai|xi)α c(xi, ai) , where we use the linearity of the expectation in both (i) and (ii). In (iii), we use our definition of policies in (9). Therefore, we have that Eh Q [Mn(h)] = i=1 Ea π0( |xi) π0(a|xi)α c(xi, a) πQ(ai|xi) π0(ai|xi)α c(xi, ai) , i=1 Rα(πQ|xi) n ˆRα n(πQ) , = n I3 , (25) Exponential Smoothing for Off-Policy Learning where we used the fact that ci = c(ai, xi) for any i [n] in (i). Now we focus on the terms M n(h) and [M]n(h). First, we have that fi (ai, h)2 = Ea π0( |xi) π0(a|xi)α c(xi, a) I{h(xi)=ai} π0(ai|xi)α c(xi, ai) 2 , (26) = Ea π0( |xi) π0(a|xi)α c(xi, a) 2 + I{h(xi)=ai} π0(ai|xi)α c(xi, ai) 2 2Ea π0( |xi) π0(a|xi)α c(xi, a) I{h(xi)=ai} π0(ai|xi)α c(xi, ai) , = Ea π0( |xi) π0(a|xi)α c(xi, a) 2 + I{h(xi)=ai} π0(ai|xi)2α c(xi, ai)2 2Ea π0( |xi) π0(a|xi)α c(xi, a) I{h(xi)=ai} π0(ai|xi)α c(xi, ai) . Moreover, fi (ai, h)2 does not depend on a1, . . . , ai 1. Thus E h fi (ai, h)2 |Fi 1 i = Eai π0( |xi) h fi (ai, h)2 |Fi 1 i = Eai π0( |xi) h fi (ai, h)2i = Ea π0( |xi) h fi (a, h)2i . Computing Ea π0( |xi) h fi (a, h)2i using the decomposition in (26) yields E h fi (ai, h)2 |Fi 1 i = Ea π0( |xi) h fi (a, h)2i , = Ea π0( |xi) π0(a|xi)α c(xi, a) 2 + Ea π0( |xi) π0(a|xi)2α c(xi, a)2 (27) Combining (26) and (27) leads to E h fi (ai, h)2 |Fi 1 i + fi (ai, h)2 = Ea π0( |xi) π0(a|xi)2α c(xi, a)2 + I{h(xi)=ai} π0(ai|xi)2α c(xi, ai)2 2Ea π0( |xi) π0(a|xi)α c(xi, a) I{h(xi)=ai} π0(ai|xi)α c(xi, ai) , (i) Ea π0( |xi) π0(a|xi)2α c(xi, a)2 + I{h(xi)=ai} π0(ai|xi)2α c(xi, ai)2 . (28) The inequality in (i) holds because 2Ea π0( |xi) h I{h(xi)=a} π0(a|xi)α c(xi, a) i I{h(xi)=ai} π0(ai|xi)α c(xi, ai) 0. Therefore, we have that M n(h) + [M]n(h) i=1 Ea π0( |xi) π0(a|xi)2α c(xi, a)2 + I{h(xi)=ai} π0(ai|xi)2α c(xi, ai)2 . Finally, by using the linearity of the expectation and the definition of policies in (9), we get that Eh Q [ M n(h) + [M]n(h)] i=1 Ea π0( |xi) " Eh Q I{h(xi)=a} π0(a|xi)2α c(xi, a)2 # + Eh Q I{h(xi)=ai} π0(ai|xi)2α c(xi, ai)2 , i=1 Ea π0( |xi) π0(a|xi)2α c(xi, a)2 + πQ(ai|xi) π0(ai|xi)2α c(xi, ai)2 . (29) Combining (24) and (29) yields i=1 Rα(πQ|xi) n ˆRα n(πQ)| DKL(Q P) + log(4/δ) i=1 Ea π0( |xi) π0(a|xi)2α c(xi, a)2 + πQ(ai|xi) π0(ai|xi)2α c(xi, ai)2 . (30) Exponential Smoothing for Off-Policy Learning This means that the following inequality holds with probability at least 1 δ/2 for any distribution Q on H |I3| DKL(Q P) + log(4/δ) i=1 Ea π0( |xi) π0(a|xi)2α c(xi, a)2 + λ πQ(ai|xi) π0(ai|xi)2α c(xi, ai)2 . (31) However we know that c(x, a)2 1 for any x X and a A and that c(xi, ai) = ci for any i [n]. Thus the following inequality holds with probability at least 1 δ/2 for any distribution Q on H |I3| DKL(Q P) + log(4/δ) i=1 Ea π0( |xi) πQ(ai|xi) π0(ai|xi)2α c2 i . (32) The union bound of (22) and (32) combined with the deterministic result in (23) yields that the following inequality holds with probability at least 1 δ for any distribution Q on H |R(πQ) ˆRα n(πQ)| DKL(Q P) + log 4 n i=1 Ea πQ( |xi) 1 π1 α 0 (a|xi) + DKL(Q P) + log(4/δ) i=1 Ea π0( |xi) πQ(ai|xi) π0(ai|xi)2α c2 i . (33) C.2. Extensions of Theorem 4.1 Proposition C.2 (Extension of Theorem 4.1 to hold simultaneously for any λ (0, 1)). Let n 1, δ [0, 1], α [0, 1], and let P be a fixed prior on H, then with probability at least 1 δ over draws Dn µn π0, the following holds simultaneously for any posterior Q on H, and for any λ (0, 1) that |R(πQ) ˆRα n(πQ)| KL 1(πQ, λ) 2n + Bα n(πQ) + KL 2(πQ, λ) 2 V α n (πQ) . KL 1(πQ, λ) = DKL(Q P) + log 8 n KL 2(πQ, λ) = 2 DKL(Q P) + log 8 Bα n(πQ) = 1 1 i=1 Ea πQ( |xi) π1 α 0 (a|xi) , V α n (πQ) = 1 i=1 Ea π0( |xi) + πQ(ai|xi) π0(ai|xi)2α c2 i . Proof. Let δ (0, 1). For any i 1, we define λi = 2 i and let δi = δλi. Then Theorem 4.1 yields that for any i 1, the following inequality holds with probability at least 1 δi for any Q on H |R(πQ) ˆRα n(πQ)| DKL(Q P) + log 4 n δi 2n + Bα n(πQ) + DKL(Q P) + log 4 δi nλi + λi 2 V α n (πQ) . Now notice that P i=1 λi = 1, and hence P i=1 δi = δ. Therefore, the union bound of the above inequalities over i 1 yields that with probability at least 1 δ, the following inequality holds with probability at least 1 δ for any Q on H and for any i 1 |R(πQ) ˆRα n(πQ)| DKL(Q P) + log 4 n δi 2n + Bα n(πQ) + DKL(Q P) + log 4 δi nλi + λi 2 V α n (πQ) . (34) Exponential Smoothing for Off-Policy Learning Let denote the ceiling function, then we have that for any λ (0, 1), there exists j = log λ log 2 1 such that λ/2 λj λ. Since (34) holds for any i 1, it holds in particular for j. In addition to this, we have that 1 λj 2 λ, that λj λ and that 1 δj = 1 λjδ 2 δλ. This yields that the following inequality holds with probability at least 1 δ for any Q on H and for any λ (0, 1) |R(πQ) ˆRα n(πQ)| DKL(Q P) + log 8 n δλ 2n + Bα n(πQ) + 2DKL(Q P) + log 8 2 V α n (πQ) . (35) The additional 2 in 2 DKL(Q P)+log 8 δλ nλ appears since we used that 1 λj 2 λ. Similarly, the additional 2 λ in the logarithmic terms is due to the fact that 1 δj 2 δλ. Finally, setting KL 1(πQ, λ) = DKL(Q P) + log 8 n KL 2(πQ, λ) = 2 DKL(Q P) + log 8 concludes the proof. Next, we provide a similar proof to extend Theorem 4.1 to any α (0, 1]. While we only provide a one-sided inequality, the same covering technique can be used to obtain the other side of the inequality. Proposition C.3 (One-sided extension of Theorem 4.1 to hold simultaneously for any α (0, 1) {1} ). Let n 1, δ [0, 1], λ > 0, and let P be a fixed prior on H, then with probability at least 1 δ over draws Dn µn π0, the following holds simultaneously for any posterior Q on H, and for any α (0, 1] that R(πQ) ˆRα n(πQ) + KL 1(πQ, α) 2n + Bα n(πQ) + KL 2(πQ, α) 2 V α n (πQ) . KL 1(πQ, α) = DKL(Q P) + log 8 n KL 2(πQ, α) = DKL(Q P) + log 8 Bα n(πQ) = 1 1 i=1 Ea πQ( |xi) π1 α 0 (a|xi) , V α n (πQ) = 1 i=1 Ea π0( |xi) + πQ(ai|xi) π0(ai|xi)2α c2 i . Proof. Let δ (0, 1). For any i 0, we define αi = 2 i and let δi = δαi/2. Then Theorem 4.1 yields that for any i 0, the following inequality holds with probability at least 1 δi for any Q on H |R(πQ) ˆRαi n (πQ)| DKL(Q P) + log 4 n δi 2n + Bαi n (πQ) + DKL(Q P) + log 4 2 V αi n (πQ) . Now notice that P i=0 αi = 2, and hence by definition of δi, we have P i=0 δi = δ. Therefore, the union bound of the above inequalities over i 0 yields that with probability at least 1 δ, the following inequality holds with probability at least 1 δ for any Q on H and for any i 0 |R(πQ) ˆRαi n (πQ)| DKL(Q P) + log 4 n δi 2n + Bαi n (πQ) + DKL(Q P) + log 4 2 V αi n (πQ) . (36) Let denote the floor function, then we have that for any α (0, 1], there exists j = log α log 2 0 such that α αj 2α. Since (34) holds for any i 0, it holds in particular for j. In addition, we have that Bα n(πQ) and ˆRα n(πQ) are Exponential Smoothing for Off-Policy Learning decreasing in α while V α n (πQ) is increasing in α. Therefore, we have that ˆRαj n (πQ) ˆRα n(πQ) , Bαj n (πQ) Bα n(πQ) , and V αj n (πQ) V 2α n (πQ). Moreover, we have that 1 δj 2 δα. This yields that the following inequality holds with probability at least 1 δ for any Q on H and for any α (0, 1] R(πQ) ˆRα n(πQ) + DKL(Q P) + log 8 n δα 2n + Bα n(πQ) + DKL(Q P) + log 8 2 V 2α n (πQ) . (37) Finally, setting KL 1(πQ, α) = DKL(Q P) + log 8 n KL 2(πQ, α) = DKL(Q P) + log 8 concludes the proof. C.3. Proof of Proposition 5.1 Haddouche & Guedj (2022, Theorem 7) provides an application of Lemma C.1 to the general PAC-Bayes learning problems in Section 4.1. We cannot apply their theorem directly to get Proposition 5.1 for two reasons. They assume that the loss function is non-negative and they derive a one-sided generalization bound. In our case, the loss function is negative and we want to derive a two-sided generalization bound. Fortunately, we show with a slight modification of their proof that the result can be extended to two-sided inequalities with negative losses. In fact, the only requirement is that the sign of loss is fixed. We show next how this is achieved. Proof. First, note that Lemma C.1 does not make any assumption on the sign of the martingale difference sequence (fi(Si, h))i [n] nor on the sign of the terms that decompose it. Now similarly to the proof in Appendix C.1, we define Sn = (xi, ai)i [n] as the set of n observed contexts and taken actions. Also, we let (Fi)i {0} [n] be a filtration adapted to Sn. For h H, we define fi (Si, h) as fi (Si, h) = fi (xi, ai, h) = f (xi, ai, h) = Ex ν,a π0( |x) π0(a|x)α c(x, a) I{h(xi)=ai} π0(ai|xi)α c(xi, ai) . Here fi(Si, h) only depends on the last samples xi, ai and the predictor h. For this reason, we denote it by fi (xi, ai, h). Also, the function fi does not depend on i and this is why we simplify the notation as fi (xi, ai, h) = f (xi, ai, h). Moreover, the randomness in f (xi, ai, h) is only due xi ν and ai π0( |xi); all other terms are deterministic. Thus the expectations are under xi ν, ai π0( |xi). Now similarly to the proof in Appendix C.1, we have that E [f (xi, ai, h) |Fi 1] = 0 for any i [n] , h H. Therefore, (f(xi, ai, h))i [n] is a martingale difference sequence for any h H. Thus we apply Lemma C.1 and get that that with probability at least 1 δ, the following holds simultaneously for any distribution Q on H |Eh Q [Mn(h)]| DKL(Q P) + log(2/δ) 2 (Eh Q [ M n(h) + [M]n(h)]) , (38) i=1 f (xi, ai, h) , i=1 E h f (xi, ai, h)2 |Fi 1 i , i=1 f (xi, ai, h)2 . Exponential Smoothing for Off-Policy Learning Now we compute Eh Q [Mn(h)] as Eh Q [Mn(h)] = i=1 Ex ν,a π0( |x) π0(a|x)α c(x, a) πQ(ai|xi) π0(ai|xi)α c(xi, ai) , = n Ex ν,a π0( |x) π0(a|x)α c(x, a) πQ(ai|xi) π0(ai|xi)α c(xi, ai) , (39) where we used the linearity of the expectation Eh Q [ ] and the definition of policies in (9). Moreover, similarly to the proof in Appendix C.1, we have that M n(h) + [M]n(h) = i=1 E h f (xi, ai, h)2 |Fi 1 i + f (xi, ai, h)2 i=1 Ex ν,a π0( |x) π0(a|x)2α c(x, a)2 + I{h(xi)=ai} π0(ai|xi)2α c(xi, ai)2 2Ex ν,a π0( |x) π0(a|x)α c(x, a) I{h(xi)=ai} π0(ai|xi)α c(xi, ai) , (i) n Ex ν,a π0( |x) π0(a|x)2α c(x, a)2 + I{h(xi)=ai} π0(ai|xi)2α c(xi, ai)2 , (40) where (i) holds since 2Ex ν,a π0( |x) h I{h(x)=a} π0(a|x)α c(x, a) i I{h(xi)=ai} π0(ai|xi)α c(xi, ai) 0 for any i [n]. This is where the non-negative loss assumption is not needed. Our loss Lα(h, x, a, c) = I{h(x)=a} π0(a|x)α c is negative since c [ 1, 0]. However, we only need the product between the loss and its expectation to be non-negative. This holds in particular when the loss has a fixed sign. In that case, the expectation of the loss and the loss itself will have the same sign and thus their product will be non-negative. In our case, the loss has a fixed negative sign and this is all we needed. Now notice that n Ex ν,a π0( |x) π0(a|x)α c(x, a) = n Rα(πQ) , πQ(ai|xi) π0(ai|xi)α c(xi, ai) = n ˆRα n(πQ) , where we used that c(xi, ai) = ci for any i [n] in the second equality. Using these two equalities and plugging (39) and (40) in (38) yields that with probability at least 1 δ, the following holds simultaneously for any distribution Q on H n Rα(πQ) ˆRα n(πQ) DKL(Q P) + log(2/δ) n Ex ν,a π0( |x) π0(a|x)2α c(x, a)2 πQ(ai|xi) π0(ai|xi)2α c(xi, ai)2 . (41) Again we used the linearity of the expectation Eh Q [ ] and the definition of policies in (9). Finally, we have that c(xi, ai) = ci for any i [n]. Thus with probability at least 1 δ the following inequality holds for any distribution Q on H Rα(πQ) ˆRα n(πQ) DKL(Q P) + log(2/δ) 2 Ex ν,a π0( |x) π0(a|x)2α c(x, a)2 πQ(ai|xi) π0(ai|xi)2α c2 i . (42) This concludes the proof. Exponential Smoothing for Off-Policy Learning C.4. Sample Complexity Proposition C.4. Let M1(H) be the set of probability distributions on the hypothesis space H, and let λ > 0, n 1, δ [0, 1], α [0, 1], and let P be a fixed prior on H, then with probability at least 1 δ over draws Dn µn π0, we have R(πˆQn) R(πQ ) + 2 2n + 2Bα n(πQ ) + 2 KL2(πQ ) nλ + λ V α n (πQ ) . where πˆQn is the learned policy with ˆQn = argmin Q M1(H) ˆRα n(πQ) + q 2n + Bα n(πQ) + KL2(πQ) 2 V α n (πQ) , Q = argmin Q M1(H) R(πQ), and KL1(πQ) = DKL(Q P) + log 4 n δ , KL2(πQ) = DKL(Q P) + log 4 Bα n(πQ) = 1 1 i=1 Ea πQ( |xi) π1 α 0 (a|xi) , V α n (πQ) = 1 i=1 Ea π0( |xi) + πQ(ai|xi)c2 i π0(ai|xi)2α . Proof. First, Theorem 4.1 holds for any potentially data dependent distribution Q on H. In particular, we have that with probability at least 1 δ the following inequalities hold simultaneously for ˆQn and Q |R(πˆQn) ˆRα n(πˆQn)| 2n + Bα n(πˆQn) + KL2(πˆQn) 2 V α n (πˆQn) , |R(πQ ) ˆRα n(πQ )| 2n + Bα n(πQ ) + KL2(πQ ) 2 V α n (πQ ) . Taking only one side of these inequalities yields that with probability at least 1 δ the following inequalities hold simultaneously for ˆQn and Q R(πˆQn) ˆRα n(πˆQn) + 2n + Bα n(πˆQn) + KL2(πˆQn) 2 V α n (πˆQn) | {z } (I) ˆRα n(πQ ) R(πQ ) + 2n + Bα n(πQ ) + KL2(πQ ) 2 V α n (πQ ) . Now using the definition of πˆQn, we know that I ˆRα n(πQ ) + 2n + Bα n(πQ ) + KL2(πQ ) 2 V α n (πQ ) . This yields that with probability at least 1 δ the following inequalities hold simultaneously for ˆQn and Q R(πˆQn) ˆRα n(πQ ) + 2n + Bα n(πQ ) + KL2(πQ ) 2 V α n (πQ ) , ˆRα n(πQ ) R(πQ ) + 2n + Bα n(πQ ) + KL2(πQ ) 2 V α n (πQ ) . Computing the sum of these two inequalities concludes the proof. Corollary C.5 (Special case of Proposition C.4). Let H = hθ ; θ Rd K of mappings hθ(x) = argmaxa A ϕ(x) θa for any x X. Let n 1, δ [0, 1], α [0, 1], and let P = N(µ0, Id K) be a fixed prior on H, then with probability at least 1 δ over draws Dn µn π0, we have that R(πˆQn) R(πQ ) + µ µ0 2 + 2 log 4 n δ n + 2(1 Kα 1) + µ µ0 2 + 2 log 4 δ n + K2α 1 + K2α where πˆQn is the learned policy with ˆQn = argmin Q=N(µ,Id K) ˆRα n(πQ) + q 2n + Bα n(πQ) + KL2(πQ) 2 V α n (πQ) , Q = argmin Q=N(µ,Id K) R(πQ). Exponential Smoothing for Off-Policy Learning Proof. This result follows from the general Proposition C.4 by simply setting P = N(µ0, Id K) and Q = N(µ , Id K). First, since the covariance matrices of both distributions are Id K, their KL divergence is DKL(Q P) = µ µ0 2/2. Moreover, since the logging policy is uniform then Bα n(πQ) = (1 Kα 1) and V α n (πQ) K2α 1 + K2α. Using these quantities, setting λ = 1/ n and applying Proposition C.4 yields that with probability at least 1 δ over draws Dn µn π0, we have that R(πˆQn) R(πQ ) + µ µ0 2 + 2 log 4 n δ n + 2(1 Kα 1) + µ µ0 2 + 2 log 4 δ n + K2α 1 + K2α This concludes the proof. The above corollary allows us to give insights into the sample complexity of our procedure. That is, the number of samples needed so that the performance of the learned policy πˆQn is close to that of the optimal one. Let ϵ > 2(1 Kα 1) for α [1 log 2/ log K, 1]. This condition on α ensures that ϵ [0, 1] and it is mild as α is often close to 1. Let δ, then the following implication holds µ µ0 2 + 2 log 4 n δ n + 2(1 Kα 1) + µ µ0 2 + 2 log 4 δ n + K2α 1 + K2α n = P(R(πˆQn) R(πQ ) + ϵ) 1 δ . (43) First, we use that q µ µ0 2 + 2 log 4 n δ . Moreover we bound K2α 1 + K2α 2K2α. Then the implication in (43) becomes n µ µ0 + µ µ0 2 + 2 log 4 ϵ 2(1 Kα 1) = P(R(πˆQn) R(πQ ) + ϵ) 1 δ . (44) We only provide intuition on the sample complexity and aim at having easy-to-interpret terms. Thus we omit the logarithmic terms in (44) and assume that µ µ0 2 µ µ0 . This leads to the claim made in Section 5.1. Of course, a more precise sample complexity analysis can be made by studying the function h(x) = x q δ /(ϵ 2(1 Kα 1)) and finding x such that f(x) µ µ0 + µ µ0 2+2 log 4 ϵ 2(1 Kα 1) . D. Experiments We consider the standard supervised-to-bandit conversion (Agarwal et al., 2014). Precisely, let S TR n and S TS n TS be the training and testing set of a classification dataset, respectively. First, we transform the training set S TR n to a logged bandit data Dn as described in Algorithm 1. The resulting logged data Dn is then used to train our policies. After that, the learned policies are tested on S TS n TS as described in Algorithm 2. We consider that the resulting reward in Algorithm 2 is a good proxy for the unknown true reward of the learned policies. This will be our performance metric, the higher the better. In our experiments, we use the following image classification datasets MNIST (Le Cun et al., 1998), Fashion MNIST (Xiao et al., 2017), EMNIST (Cohen et al., 2017) and CIFAR100 (Krizhevsky et al., 2009). We provide a summary of the statistics of these datasets in Table 1. Algorithm 1 takes as input a logging policy π0 which we define as π0(a|x) = exp(η0 ϕ(x) µ0,a) P a A exp(η0 ϕ(x) µ0,a ) , (x, a) X A . (45) Here ϕ(x) Rd is the feature transformation function that outputs a d-dimensional vector, µ0 = (µ0,a)a A Rd K are learnable parameters and η0 is an inverse-temperature parameter for the softmax in (45). We explain next how these quantities are derived in detail. The feature transformation function ϕ(x) Rd: for all the datasets, except CIFAR100, the feature transformation function ϕ( ) is defined as ϕ(x) = x x for any x X. That is, we simply normalize the features x X by their L2 Exponential Smoothing for Off-Policy Learning Table 1. Statistics of the datasets used in our experiments. DATA SET NBR. TRAIN SAMPLES n NBR. TEST SAMPLES n TS NBR. ACTIONS K DIMENSION d MNIST 60000 10000 10 784 FA S H I O NMNIST 60000 10000 10 784 EMNIST 112800 18800 47 784 CIFAR100 50000 10000 100 2048 norm x . In contrast, CIFAR100 is a more challenging problem. Thus we use transfer learning to extract features ϕ(x) expressive enough so that a linear softmax model would enjoy a reasonable performance. Precisely, we retrieve the last hidden layer of a Res Net-50 network, pre-trained on the Image Net dataset, to output 2048-dimensional features. Finally, the obtained features are normalized as x x and this whole process (Res Net-50 + normalization) corresponds to ϕ( ) for CIFAR100. The parameters µ0 = (µ0,a)a A Rd K: we learn the parameters µ0 using 5% of the training set S TR n . Precisely, we use the cross-entropy loss with an L2 regularization of 10 6 to prevent the logging policy π0 from being degenerate. This ensures that the learning policies are absolutely continuous with respect to the logging policy π0, a condition under which standard IPS is unbiased. In optimization, we use Adam (Kingma & Ba, 2014) with a learning rate of 0.1 for 10 epochs. In all the experiments, we set the prior P = N(η0µ0, Id K) for the Gaussian policies in (12) and we set it as P = N(η0µ0, Id K) G(0, 1)K for the mixed-logit policies in (11). Our theory requires that the prior does not depend on data. Given that µ0 is learned on the 5% portion of data, we only train our learning policies on the remaining 95% portion of the data to match our theoretical requirements. The inverse-temperature parameter η0 R: this controls the performance of the logging policy. A high positive value of η0 leads to a well-performing logging policy, while a negative one leads to a low-performing logging policy. When η0 = 0, π0 is identical to the uniform policy. In our experiments η0 varies between 0 and 1. Algorithm 1 Supervised-to-bandit: creating logged data Input: training classification set S TR n = {(xi, yi)}n i=1, logging policy π0. Output: logged bandit data Dn = (xi, ai, ci)i [n]. Initialize Dn = {} for i = 1, . . . , n do ai π0( |xi) ci = I{ai=yi} Dn Dn {(xi, ai, ci)} . Algorithm 2 Supervised-to-bandit: testing policies Input: image classification dataset S TS n TS = {(xi, yi)}n TS i=1, learned policy ˆπn. Output: reward r. for i = 1, . . . , n TS do ai ˆπn( |xi) ri = I{ai=yi} r = 1 n TS Pn TS i=1 ri . Now it remains to explain the learning policies πQ and the corresponding closed-form bounds using either our results or those in existing works (London & Sandler, 2019; Sakhi et al., 2022). D.2. Policies Here we present the two families of policies that we use in our experiments, Gaussian and mixed-logit policies. Exponential Smoothing for Off-Policy Learning D.2.1. MIXED-LOGIT Let H = hθ,γ ; θ Rd K, γ RK be a hypothesis space of mappings hθ,γ(x) = argmaxa A ϕ(x) θa + γa for any x X. Here ϕ(x) outputs a d-dimensional representation of context x X. Now assume that for any a A, γa is a standard Gumbel perturbation, γa G(0, 1), then we have that πSOF θ (a|x) = exp(ϕ(x) θa) P a A exp(ϕ(x) θa ) , = Eγ G(0,1)K I{hθ,γ(x)=a} . (46) In addition, we randomize θ such as θ N(µ, σ2Id K) where µ Rd K and σ > 0. It follows that the posterior Q is a multivariate Gaussian N(µ, σ2Id K) over the parameters θ with standard Gumbel perturbations γ G(0, 1)K. We denote such policies by πMIXL µ,σ and they are defined as πMIXL µ,σ (a|x) = Eθ N(µ,σ2Id K) exp(ϕ(x) θa) P a A exp(ϕ(x) θa ) = Eθ N(µ,σ2Id K) [πSOF θ (a|x)] , = Eθ N(µ,σ2Id K) ,γ G(0,1)K I{hθ,γ(x)=a} . (47) To sample from the mixed-logit policies πMIXL µ,σ , we first sample θ N(µ, σ2Id K) and γ G(0, 1)K and then set the sampled action as a hθ,γ(x). Now we also need to compute the gradient of the expectation in (47). This needs additional care since the distribution under which we take the expectation depends on the parameters µ, σ. Fortunately, the reparameterization trick can be used in this case. Roughly speaking, it allows us to express a gradient of the expectation in (47) as an expectation of a gradient. In our case, we use the local reparameterizaton trick (Kingma et al., 2015) which is known for reducing the variance of stochastic gradients. Precisely, we rewrite (47) as πMIXL µ,σ (a|x) = Eϵ N(0, ϕ(x) 2IK) exp(ϕ(x) µa + σϵa) P a A exp(ϕ(x) µa + σϵa ) = Eϵ N(0,IK) exp(ϕ(x) µa + σϵa) P a A exp(ϕ(x) µa + σϵa ) where we used that ϕ(x) 2 = 1 since features are normalized. It follows that gradients read µ,σπMIXL µ,σ (a|x) = Eϵ N(0,IK) µ,σ exp(ϕ(x) µa + σϵa) P a A exp(ϕ(x) µa + σϵa ) Moreover, the propensities are approximated as πMIXL µ,σ (a|x) 1 exp(ϕ(x) µa + σϵi,a) P a A exp(ϕ(x) µa + σϵi,a ) , ϵi N(0, IK) , i [S] . (48) In all our experiments, we set S = 32. D.2.2. GAUSSIAN We define the hypothesis space H = hθ ; θ Rd K of mappings hθ(x) = argmaxa A ϕ(x) θa for any x X. It follows that the learning policies πQ = πGAUS µ,σ read πGAUS µ,σ (a|x) = Eθ N(µ,σ2Id K) I{hθ(x)=a} . (49) To see why this can be beneficial (Sakhi et al., 2022), let π be the optimal policy. Given x X, π ( |x) should be deterministic; it chooses the best action for context x with probability 1. That is, there exists µ Rd K such that π = I{hµ (x)=a}. When µ µ and σ 0, the Gaussian policy in (49) approaches π . In contrast, the mixed-logit policy in (47) approaches πSOF µ . However, πSOF µ is not deterministic due to the additional randomness in γ and is equal to π only if Exponential Smoothing for Off-Policy Learning ϕ(x) µ ,a (x) . This explains the choice of removing the Gumbel noise. First, Sakhi et al. (2022) showed that (49) can be written as πGAUS µ,σ (a|x) = Eϵ N(0,1) h Y a =a Φ ϵ + ϕ(x) (µa µa ) where Φ is the cumulative distribution function of a standard normal variable. But ϕ(x) = 1 in all our experiments. Thus πGAUS µ,σ (a|x) = Eϵ N(0,1) h Y a =a Φ ϵ + ϕ(x) (µa µa ) Then similarly to mixed-logit policies, the gradient reads µ,σπGAUS µ,σ (a|x) = Eϵ N(0,1) h µ,σ Y a =a Φ ϵ + ϕ(x) (µa µa ) Moreover, the propensities are approximated as πGAUS µ,σ (a|x) 1 a =a Φ ϵi + ϕ(x) (µa µa ) σ , ϵi N(0, 1) , i [S] . (50) In all our experiments, we set S = 32. D.3. Baselines Here we present all the methods that we use in our experiments. For each method, we state the result that holds for any learning policy π. After that, we derive the corresponding closed-form bounds for Gaussian and mixed-logit policies that we presented previously. All the baselines require computing the KL divergence between the prior P and the posterior Q. Thus before presenting them, we state the following lemma that allows bounding the KL divergence between the prior P and the posterior Q in the cases of mixed-logit or Gaussian policies. Lemma D.1 (KL divergence for Gaussian distributions with Gumbel noise). For distributions P = N µ0, σ2 0Id K G(0, 1)K and Q = N µ, σ2Id K G(0, 1)K, with µ0, µ Rd K and 0 < σ2 σ2 0 < , DKL(Q P) µ µ0 2 2σ2 0 + d K 2 log σ2 0 σ2 . Moreover, this result holds when the Gumbel noise is removed. That is when P = N µ0, σ2 0Id K and Q = N µ, σ2Id K . We borrow this lemma from London & Sandler (2019). In particular, Lemma D.1 shows that the KL terms for both policies can be bounded by the same quantity. As a result, the corresponding bounds will be the same; the only difference is the space of learning policies on which we optimize. For completeness, however, we write these bounds for both types of policies although they are similar. Since existing approaches are not named, we name them as (Author, Policy) where Author {Ours, London et al., Sakhi et al. 1, Sakhi et al. 2} and Policy {Gaussian, Mixed-Logit} . Here Ours, London et al., Sakhi et al. 1 and Sakhi et al. 2 correspond to Theorem 4.1, London & Sandler (2019, Theorem 1), Sakhi et al. (2022, Proposition 1), Sakhi et al. (2022, Proposition 3), respectively. For example, London & Sandler (2019, Theorem 1) leads to two baselines (London et al., Gaussian) and (London et al., Mixed-Logit). In all our experiments, the learning policies are trained using Adam (Kingma & Ba, 2014) with a learning rate of 0.1 for 20 epochs. D.3.1. OURS, THEOREM 4.1 (Ours, Gaussian) Here we use the Gaussian policies in (49). Thus we only replace the term, DKL(Q P), with its closed-form bound in Lemma D.1. This leads to the following objective. min µ Rd K,σ>0 ˆRα n πGAUS µ,σ + 2 log σ2 + log 4 n δ 2n + Bα n(πGAUS µ,σ ) + 2 log σ2 + log 4 2 V α n (πGAUS µ,σ ) , Exponential Smoothing for Off-Policy Learning where we used that σ0 = 1 since our prior is P = N(η0µ0, Id K) for Gaussian policies. Moreover, we set 2 log σ2+log 4 δ n V α n (πGAUS µ,σ ) . (Ours, Mixed-Logit) Here we use the mixed-logit policies in (47). Thus we only replace the terms, DKL(Q P), with their closed-form bound in Lemma D.1. This leads to the following objective. min µ Rd K,σ>0 ˆRα n πMIXL µ,σ + 2 log σ2 + log 4 n δ 2n + Bα n(πMIXL µ,σ ) + 2 log σ2 + log 4 2 V α n (πMIXL µ,σ ) , where we used that σ0 = 1 since our prior is P = N(η0µ0, Id K) G(0, 1)K for mixed-logit policies. Moreover, we set 2 log σ2+log 4 δ n V α n (πMIXL µ,σ ) . D.3.2. LONDON & SANDLER (2019, THEOREM 1) Proposition D.2. Let τ (0, 1), n 1, δ (0, 1) and let P be a fixed prior on H, then with probability at least 1 δ over draws Dn µn π0, the following holds simultaneously for all posteriors, Q, on H that R (πQ) ˆRτ n (πQ) + v u u t2 ˆRτn (πQ) + 1 τ DKL(Q P) + log n τ(n 1) + 2 DKL(Q P) + log n τ(n 1) . (51) Baseline 1: (London et al., Gaussian) Here we use the Gaussian policies in (49). Thus we only replace the terms, DKL(Q P), with their closed-form bound in Lemma D.1. This leads to the following objective. min µ Rd K,σ>0 ˆRτ n πGAUS µ,σ + v u u t2 ˆRτn πGAUS µ,σ + 1 2 log σ2 + log n 2 log σ2 + log n where we used that σ0 = 1 since our prior is P = N(η0µ0, Id K) for Gaussian policies. Baseline 2: (London et al., Mixed-Logit) Here we consider the mixed-logit policies in (47). Since the additional Gumbel noise does not affect the KL divergence (Lemma D.1), we have the same objective as in the Gaussian case. That is min µ Rd K,σ>0 ˆRτ n πMIXL µ,σ + v u u t2 ˆRτn πMIXL µ,σ + 1 2 log σ2 + log n 2 log σ2 + log n where we used that σ0 = 1 since our prior is P = N(η0µ0, Id K) G(0, 1)K for mixed-logit policies. D.3.3. SAKHI ET AL. (2022, PROPOSITION 1) Proposition D.3. Let τ (0, 1), n 1, δ (0, 1) and let P be a fixed prior on H, then with probability at least 1 δ over draws Dn µn π0, the following holds simultaneously for all posteriors, Q, on H that R (πQ) min λ>0 1 τ (eλ 1) 1 e τλ ˆ Rτ n(πQ)+ DKL(Q P)+log 2 n Exponential Smoothing for Off-Policy Learning Baseline 3: (Sakhi et al. 1, Gaussian) Here we use the Gaussian policies in (49). min µ Rd K,σ>0,λ>0 1 e τλ ˆ Rτ n(πGAUS µ,σ )+ 2 log σ2+log 2 n where we used that σ0 = 1 since our prior is P = N(η0µ0, Id K) for Gaussian policies. Baseline 4: (Sakhi et al. 1, Mixed-Logit) Here we consider the mixed-logit policies in (47). min µ Rd K,σ>0,λ>0 1 e τλ ˆ Rτ n(πMIXL µ,σ )+ 2 log σ2+log 2 n where we used that σ0 = 1 since our prior is P = N(η0µ0, Id K) G(0, 1)K for mixed-logit policies. D.3.4. SAKHI ET AL. (2022, PROPOSITION 3) Proposition D.4. Let τ (0, 1), n 1, δ (0, 1), let P be a fixed prior on H, and let Λ = {λi}i [nλ] a set of nλ positive scalars. Then with probability at least 1 δ over draws Dn µn π0, the following holds simultaneously for all posteriors, Q, on H and any λi Λ, R (πQ) ˆRτ n (πQ) + DKL(Q P) + log 4 n δ 2n + DKL(Q P) + log 2nλ Vτ n (πQ) , (55) where g : u exp(u) 1 u u2 and Vτ n(πQ) = 1 n Pn i=1 Ea πQ( |xi) h π0(a|xi) max(τ,π0(a|xi))2 i , . Baseline 5: (Sakhi et al. 2, Gaussian) Here we consider the Gaussian policies in (49). min µ Rd K,σ>0,λ Λ ˆRτ n πGAUS µ,σ + 2 log σ2 + log 4 n 2 log σ2 + log 2nλ Vτ n πGAUS µ,σ , (56) where we used that σ0 = 1 since our prior is P = N(η0µ0, Id K) for Gaussian policies. Baseline 6: (Sakhi et al. 2, Mixed-Logit) Here we consider the mixed-logit policies in (47). min µ Rd K,σ>0,λ Λ ˆRτ n πMIXL µ,σ + 2 log σ2 + log 4 n 2 log σ2 + log 2nλ Vτ n πMIXL µ,σ , (57) where we used that σ0 = 1 since our prior is P = N(η0µ0, Id K) G(0, 1)K for mixed-logit policies. D.4. Additional Results and Discussion In Figure 4, we report the reward of the learned policy using one of the considered methods. We make the following observations: Choice of τ and α: in Figure 4, we set τ = 1/ 4 n 0.06 and α = 1 1/ 4 n 0.94 so that when n is large enough, both ˆRτ n(π) and ˆRα n(π) approach ˆRIPS n (π) (Ionides, 2008). This is because standard IPS should be preferred when n . For completeness, we also show in Figure 5 that the choice of α and τ does not affect the conclusions that we make here. We also include in Figure 5 the results with an adaptive and data-dependent α obtained using (14) in Section 4.4. The results in Figure 5 will be discussed in detail after we finish analyzing the results in Figure 4. Exponential Smoothing for Off-Policy Learning Overall performance: our method outperforms the baselines for any class of learning policies (Gaussian or mixed-logit) and any choice of logging policies. The only exception is when the logging policy is uniform. Effect of the class of learning policies: the class of policies, Gaussian or mixed-logit, affects the performance of all the baselines. In general, Gaussian policies behave better than mixed-logit policies. However, this is less significant for our method; the performance of both Gaussian and mixed-logit policies are comparable, and in both cases, our method outperforms the baselines with Gaussian policies. Therefore, in general, Gaussian policies should be preferred over mixed-logit policies. But in case engineering constraints impose the choice of mixed-logit or softmax policies, then the performance of our method is robust to this choice. Effect of the logging policy: our method reaches the maximum reward even when the logging policy is not performing well. In contrast, the baselines only reach their best reward when the logging policy is already well-performing (η0 1), in which case minor to no improvements are made. Note that the baselines have a better reward than ours when the logging policy is uniform. But our method has better reward when the logging policy is not uniform, that is when η0 > 0. This is more common in practice since the logging policy is deployed in production and thus it is expected to perform better than the uniform policy. In Figure 5, we compare our method to (Sakhi et al. 2) with Gaussian policies since this was the best-performing baseline in our experiments in Figure 4. Note that we did not include CIFAR100 in Figure 4 as it was computationally heavy to run these experiments with varying η0, α and τ for a very high-dimensional dataset such as CIFAR100. We consider 20 varying values of τ and α evenly spaced in (0, 1). We also include the results using the adaptive tuning procedure of α described in Section 4.4 (green curve). We make the following observations: Adaptive and data-dependent α: This procedure is reliable since the performance with an adaptive α (green curve) is comparable with the best possible choice of α. This is consistent for the three datasets. Effect of the choice α: as we observed before, the only case where the choice of α may lead to bad-performing policies is when the logging policy is uniform. When the logging policy is not uniform, our method outperforms the best baseline with the best τ for a wide range of values of α. Also, note that there is no very bad choice of α, in contrast with τ 0 that led to a very bad performing policy that slightly improved upon the logging policy. This attests to the robustness of our method to the choice of α. Moreover, our bound regularizes better α; it contains a bias-variance trade-off term for α. Also, the bound of (Sakhi et al. 2) has a 1/τ making it vacuous for small values of τ. Best choice of α: To see the effect of α for varying problems, we consider the following experiment. We split the logging policies into two groups. The first is modest logging which corresponds to logging policies whose η0 is between 0 and 0.5. This includes uniform logging policies and other average-performing logging policies. The second is good logging which corresponds to logging policies whose η0 is between 0.5 and 1. After that, for each α, we compute the average reward of the learned policy across either the group of modest or good logging policies. For each dataset, this leads to the two red and green curves in the second row of Figure 5. Overall, we observe that α 0.7 leads to the best performance for the modest logging group. Thus when the performance of the logging policy is average, regularizing the importance weights can be critical. In contrast, when the performance of the logging policy is already good, regularization is less needed and we can set α 1. Fortunately, one of the main strengths of this work is that our bound also holds for standard IPS recovered for α = 1. The bounds in all prior works cannot provide good performance for standard IPS due to their dependency on 1/τ. 0.0 0.2 0.4 0.6 0.8 1.0 inverse-temperature parameter 0 reward of the learned policy MNIST, K=10, d=784 0.0 0.2 0.4 0.6 0.8 1.0 inverse-temperature parameter 0 Fashion MNIST, K=10, d=784 0.0 0.2 0.4 0.6 0.8 1.0 inverse-temperature parameter 0 EMNIST, K=47, d=784 0.0 0.2 0.4 0.6 0.8 1.0 inverse-temperature parameter 0 CIFAR, K=100, d=2048 Ours, Gaussian Ours, Mixed-Logit London et al., Gaussian London et al., Mixed-Logit Sakhi et al. 1, Gaussian Sakhi et al. 1, Mixed-Logit Sakhi et al. 2, Gaussian Sakhi et al. 2, Mixed-Logit Figure 4. The reward of the learned policy for four datasets with varying quality of the logging policy η0 [0, 1]. Exponential Smoothing for Off-Policy Learning inverse-temperature parameter 0 0.1 reward of the learned policy MNIST, K=10, d=784 Logging Ours Sakhi et al. 2 Ours, Adaptive inverse-temperature parameter 0 0.1 Fashion MNIST, K=10, d=784 inverse-temperature parameter 0 0.0 EMNIST, K=47, d=784 0.0 0.2 0.4 0.6 0.8 1.0 smoothing parameter reward of the learned policy MNIST, K=10, d=784 Modest Logging Good Logging 0.0 0.2 0.4 0.6 0.8 1.0 smoothing parameter reward of the learned policy Fashion MNIST, K=10, d=784 Modest Logging Good Logging 0.0 0.2 0.4 0.6 0.8 1.0 smoothing parameter reward of the learned policy EMNIST, K=10, d=784 Modest Logging Good Logging Figure 5. In the first row, we report the reward of the learned policy with 20 evenly space values of τ (0, 1) and α (0, 1) and varying η0 [0, 1], and for an adaptive and data-dependent α obtained using (14) in Section 4.4. The blue-to-cyan colors correspond to different values of τ. The lighter the color, the higher the value of τ. For instance, the cyan lines correspond to high values of τ while the blue ones correspond to very small values of τ. Similarly, the red-to-yellow colors correspond to different values α. The lighter the color, the higher the value of α. For instance, the yellow lines correspond to high values of α while the red ones correspond to very small values of α. Finally, the green curve corresponds to the reward of the learned policy using an adaptive and data-dependent α described in (14) (Section 4.4). In the second row, we report the average reward of the learned policies using our method across the modest logging group (η0 [0, 0.5] in red) and the good logging group (η0 [0.5, 1] in green). D.5. Learning Principles Here we compare our bound in Theorem 4.1 and our learning principle in (16) to the one in London & Sandler (2019). We do not include the learning principle in Swaminathan & Joachims (2015a) since the one in London & Sandler (2019) enjoys similar performance and is far more scalable. The learning principle of London & Sandler (2019) is defined as min µ ˆRτ n(πµ) + λ µ µ0 2 . (58) where λ is a tunable hyper-parameters, πµ is the softmax policy defined in (46) and µ Rd K is its parameter vector. This learning principle is referred to as (London et al., LP). In contrast, our learning principle is defined as ˆRα n(πµ) + λ1 µ µ0 2 + λ2 V α n (πµ) + λ3Bα n(πµ) , (59) where λ1, λ2 and λ3 are tunable hyper-parameters and πµ is the Gaussian policy in (12) with a fixed σ = 1. Our learning principle is referred to as (Ours, LP). Finally, our bound in Theorem 4.1 with Gaussian policies is referred to as (Ours, Bound). Similarly to the previous experiments, we set τ = 1/ 4 n 0.06 and α = 1 1/ 4 n 0.94 so that when n is large enough, both ˆRτ n(π) and ˆRα n(π) approach ˆRIPS n (π) (Ionides, 2008). For the learning principles, we tried multiple values of hyper-parameters λ, λ1, λ2 and λ3, all between 10 5 and 10 1. For instance, we found that the best hyper-parameter for London & Sandler (2019) is λ = 10 5 which matches the value they found in their Fashion MNIST experiments. For our learning principle, the best hyper-parameters were λ1 = 10 5, λ2 = 10 5 and λ3 = 10 5. In contrast, our bound does not require hyper-parameter tuning. We report in Figure 6 the reward of the learned policy on the Fashion MNIST for all these methods with varying values of hyper-parameters. To reduce clutter, we only report the reward for good choices of hyper-parameters λ, λ1, λ2 and λ3. We observe that for a wide range of hyper-parameters, our learning principle outperforms the one in London & Sandler (2019). However, both learning principles are sensitive to the choice of hyper-parameters. In contrast, our bound does not require the tuning of any additional hyper-parameter and it achieves the best performance except for the uniform logging policy. In addition to being more theoretically grounded, this approach also enjoys favorable empirical performance without additional hyper-parameter tuning, an important practical consideration. Exponential Smoothing for Off-Policy Learning 0.0 0.2 0.4 0.6 0.8 1.0 inverse-temperature parameter 0 reward of the learned policy Fashion MNIST, K=10, d=784 Logging Ours, LP London et al., LP Ours, Bound Figure 6. The reward of the learned policy using either our bound in Theorem 4.1 (referred to as (Ours, Bound) in green), our learning principle in (16) (referred to as (Ours, LP) in red for multiple values of hyper-parameters) or the learning principle in London & Sandler (2019) (referred to as (London et al., LP) in blue) for multiple values of hyper-parameters). D.6. Other Importance Weight Corrections Su et al. (2020); Metelli et al. (2021) also proposed corrections that are different from hard clipping (a detailed comparison is given in Section 3). However, they were not included in our main experiments since they do not provide generalization guarantees; they focus on OPE and only propose a heuristic for OPL in their Appendix B.2 and Section 6.1.2, respectively. Those heuristics are not based on theory, in contrast with ours which is directly derived from our generalization bound. However, for completeness, we also compare our regularization of importance weights to theirs. To make such a comparison, we use the hyper-parameters and tuning procedures provided in Section 6 and Appendix B.2 for Metelli et al. (2021) and Sections 5 and 6.1.2 for Su et al. (2020). Overall, we observe in Figure 7 that our method outperforms these baselines in OPL and the gap is more significant when the logging policy is not performing well. 0.0 0.2 0.4 0.6 0.8 1.0 inverse-temperature parameter 0 reward of the learned policy MNIST, K=10, d=784 0.0 0.2 0.4 0.6 0.8 1.0 inverse-temperature parameter 0 Fashion MNIST, K=10, d=784 0.0 0.2 0.4 0.6 0.8 1.0 inverse-temperature parameter 0 EMNIST, K=47, d=784 The reward of the learned policy using one of the baselines with varying quality of the logging policy 0 2 [0; 1]. Ours Meteli et al.(2021) Su et al.(2020) Logging Figure 7. The reward of the learned policy with varying quality of the logging policy η0 [0, 1] using either our regularization (α-IPS) or the ones in Su et al. (2020); Metelli et al. (2021).