# distributionally_robust_counterfactual_risk_minimization__7e804ed4.pdf The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20) Distributionally Robust Counterfactual Risk Minimization Louis Faury,1,2, Ugo Tanielian,1,3, Elvis Dohmatob,1 Elena Smirnova,1 Flavian Vasile1 1Criteo AI Lab 2LTCI, Telecom-Paris Tech, Universit e Paris-Saclay 3LPSM, Universit e Paris 6 {l.faury, u.tanielian, e.dohmatob, e.smirnova, f.vasile}@criteo.com This manuscript introduces the idea of using Distributionally Robust Optimization (DRO) for the Counterfactual Risk Minimization (CRM) problem. Tapping into a rich existing literature, we show that DRO is a principled tool for counterfactual decision making. We also show that well-established solutions to the CRM problem like sample variance penalization schemes are special instances of a more general DRO problem. In this unifying framework, a variety of distributionally robust counterfactual risk estimators can be constructed using various probability distances and divergences as uncertainty measures. We propose the use of Kullback-Leibler divergence as an alternative way to model uncertainty in CRM and derive a new robust counterfactual objective. In our experiments, we show that this approach outperforms the state-of-the-art on four benchmark datasets, validating the relevance of using other uncertainty measures in practical applications. 1 Introduction Learning how to act from historical data is a largely studied field in machine learning (Strehl et al. 2010; Dud ık, Langford, and Li 2011; Li et al. 2011; 2015), spanning a wide range of applications where a system interacts with its environment (e.g search engines, ad-placement and recommender systems). Interactions are materialized by the actions taken by the system, themselves rewarded by a feedback measuring their relevance. Both quantities can be logged at little cost, and subsequently used to improve the performance of the learning system. The Batch Learning from Bandit Feedback (Swaminathan and Joachims 2015a; 2015b) (BLBF) framework describes such a situation, where a contextual decision making process must be improved based on the logged history of implicit feedback observed only on a subset of actions. Counterfactual estimators (Bottou et al. 2013) allow to forecast the performance of any system from the logs, as if it was taking the actions by itself. This enables the search for an optimal system, even with observations biased towards actions favored by the logger. A natural approach to carry out this search consists in favoring systems that select actions with high empirical counter- Equal contribution. Copyright c 2020, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. factual rewards. However, this initiative can be rather burdensome as it suffers a crucial caveat intimately linked with a phenomenon known as the optimizer s curse (Capen et al. 1971; Smith and Winkler 2006; Thaler 2012). It indicates that sorting actions by their empirical reward average can be suboptimal since the resulting expected post-decision surprise is non-zero. In real life applications where the space of possible actions is often extremely large and where decisions are taken based on very low sample sizes, the consequence of this phenomenon can be dire and motivates the design of principled robust solutions. As a solution for this Counterfactual Risk Minimization (CRM) problem, the authors of (Swaminathan and Joachims 2015a) proposed a modified action selection process, penalizing behaviors resulting in high-variance estimates. In this paper, we argue that another line of reasoning resides in using Distributional Robust Optimization (DRO) for the CRM. It has indeed proven to be a powerful tool both in decision theory (Duchi, Glynn, and Namkoong 2016; Esfahani and Kuhn 2018; Blanchet and Murthy 2019) and the training of robust classifiers (Madry et al. 2017; Xiao et al. 2018; Hu et al. 2018). Under the DRO formulation, one treats the empirical distribution with skepticism and hence seeks a solution that minimizes the worst-case expected cost over a family of distributions, described in terms of an uncertainty ball. Using distributionally robust optimization, one can therefore control the magnitude of the post-decision surprise, critical to the counterfactual analysis. We motivate the use of DRO for the CRM problem with asymptotic guarantees and bring to light a formal link between the variance penalization solution of (Swaminathan and Joachims 2015a) and a larger DRO problem for which the uncertainty set is defined with the chi-square divergence. Building from this, we propose the use of other uncertainty sets and introduce a KL-based formulation of the CRM problem. We develop a new algorithm for this objective and benchmark its performance on a variety of real-world datasets. We analyze its behavior and show that it outperforms existing state-of-the-art methods. The structure of the paper is the following: in Section 2 we formally introduce the BLBF framework and the CRM problem. In Section 3 we present the DRO framework, moti- vate it for CRM and re-derive the POEM (Swaminathan and Joachims 2015a) algorithm as one of its special cases and introduce a new CRM algorithm. In Section 4 we compare this new algorithm with state-of-the-art CRM algorithms on four public datasets and finally summarize our findings in Section 5. 2 Batch Learning from Logged Bandit Feedback 2.1 Notation and terminology We denote x X arbitrary contexts drawn from an unknown distribution ν and presented to the decision maker. Such a quantity can describe covariate information about a patient for a clinical test, or a potential targeted user in a recommender system. The variable y Y denotes the actions available to the decision maker - the potential medications to give to the patient, or possible advertisements targeting the user for instance. A policy is a mapping π : X P (Y) from the space of contexts to probabilities in the action space. For a given (context, action) pair (x, y), the quantity π(y|x) denotes the probability of the policy π to take the action y when presented the context x. When picking the action y for a given context x, the decision maker receives a reward δ(x, y), drawn from an unknown distribution. In our examples, this reward could indicate a patient s remission, or the fact that the targeted user clicked on the displayed ad. This reward δ(x, y) can also be assumed to be deterministic - as in (Swaminathan and Joachims 2015a; 2015b). We make this assumption in the rest of this manuscript. Finally, for a given context x X and an action y Y, we define the cost function c(x, y) δ(x, y). In this paper, we try to find policies producing low expected costs. To make this search tractable, it is usual to restrict the search to a family of parametric policies, henceforth tying policies πθ to a vector θ Θ. The risk: R(θ) Ex ν,y πθ( |x) [c(x, y)] (1) corresponds to the expected cost obtained by the policy πθ through the different actions y taken, a quantity the decision maker will try to minimize. 2.2 Counterfactual Risk Minimization In practical applications, it is common that one has only access to the interaction logs of a previous version of the decision making system, also called a logging policy (denoted π0). More formally, we are interested in the case where the only available data is a collection of quadruplets H (xi, yi, pi, ci)1 i n, where the costs ci c(xi, yi) were obtained after taking an action yi with probability pi π0(yi|xi) when presented with a context xi ν. In order to search for policies πθ with smaller risk than π0, one needs to build counterfactual estimators for R(θ) from the historic H. One way to do so is to use inverse propensity scores (Rosenblum and Rubin 1983): R(θ) = Ex ν,y πθ(x) [c(x, y)] = Ex ν,y π0(x) c(x, y)πθ(y|x) for any πθ absolutely continuous w.r.t π0. Henceforth, R(θ) can be approximated with samples (xi, yi, pi, ci) from the interaction logs H via the sample average approximation: i=1 ci πθ(yi|xi) Bluntly minimizing the objective provided by the counterfactual risk estimator (3) is known to be sub-optimal, as it can have unbounded variance (Ionides 2008). It is therefore a classical technique (Bottou et al. 2013; Cortes, Mansour, and Mohri 2010; Strehl et al. 2010; Swaminathan and Joachims 2015a) to clip the propensity weights . This leads to the Clipped Inverse Propensity Scores (CIPS) estimator: i=1 ci min M, πθ(yi|xi) The variable M is an hyper-parameter, balancing the variance reduction brought by weight clipping and the bias introduced in the empirical estimation of R(θ). The search for a minimal θ with respect to the estimator ˆRn(θ) is often referred to as Counterfactual Risk Minimization (CRM). Remark 1 Other techniques have been proposed in the literature to reduce the variance of the risk estimators. For instance, the doubly robust risk (Dud ık, Langford, and Li 2011) takes advantage of both counterfactual estimators and supervised learning methods, while the self-normalized risk (Swaminathan and Joachims 2015b) was designed to counter the effect of a phenomenon known as propensity overfitting. We do not explicitly cover them in our analysis, but the results we derive hereinafter also hold for such estimators. 2.3 Sample-Variance Penalization The main drawback of the CIPS estimator is that two different policies can have risk estimates of highly different variance - something the sample average approximation cannot capture. The authors of (Swaminathan and Joachims 2015a) developed a variance-sensitive action selection process where one penalizes policies with high-variance risk estimates. Their approach is based on a sample-variance penalized version of the CIPS estimator: ˆRλ n(θ) ˆRn(θ) + λ Vn(θ)/n, (5) where λ is an hyper-parameter set by the practitioner, and Vn(θ) denotes the empirical variance of the quantities ci min M, πθ(yi|xi) . The main motivation behind this approach is based on confidence bounds derived in (Maurer and Pontil 2009), upper-bounding with high-probability the true risk R(θ) by the empirical risk ˆRn(θ) augmented with an additive empirical variance term. In a few words, this allows to build and optimize a pessimistic envelope for the true risk and penalize policies with high variance risk estimates. The authors of (Swaminathan and Joachims 2015a) proposed the Policy Optimization for Exponential Models (POEM) algorithm and showed state-of-the-art results on a collection of counterfactual tasks when applying this method to exponentially parametrized policies: πθ(y|x) exp θT φ(x, y) , (6) with φ(x, y) being a d-dimensional joint feature map. 3 Distributionally Robust Counterfactual Risk Minimization 3.1 Motivating distributional robustness for CRM For more concise notations, let us introduce the variable ξ = (x, y), the distribution P = ν π and the loss ℓ(ξ, θ) c(x, y) min M, πθ(y|x) π0(y|x) . The minimization of the counter- factual risk (2), now writes θ argminθ Θ Eξ P [ℓ(ξ, θ)] and its empirical counterpart ˆθn argminθ Θ ˆRn(θ). The consistency of the estimator ˆθn holds under general conditions and the empirical risk converges to the optimal true risk (Vapnik 1992): Eξ P [ℓ(ξ; θ )] ˆRn(ˆθn) n 0 (7) However, one of the major drawbacks of this approach is that the empirical risk ˆRn(θ) cannot be used as a performance certificate for the true risk Eξ P [ℓ(ξ; θ)]. Indeed, one will fail at controlling the true risk of any parameter θ since by the Central Limit Theorem: lim n P Eξ P [ℓ(ξ; θ)] ˆRn(θ) = 1/2 (8) One way to circumvent this limitation is to treat the empirical distribution ˆPn with skepticism and to replace it with an uncertainty set Uϵ( ˆPn) of distributions around ˆPn with ϵ > 0 a parameter controlling the size of the uncertainty set Uϵ( ˆPn). This gives rise to the distributionally robust counterfactual risk: RU n(θ, ϵ) max Q Uϵ( ˆ Pn) Eξ Q[ℓ(ξ; θ)]. (9) Minimizing this quantity w.r.t to θ yields the general DRO program: θn argminθ Θ RU n(θ, ϵ) = argmin θ Θ max Q Uϵ( ˆ Pn) Eξ Q[ℓ(ξ; θ)]. (10) There is liberty on the way to construct the uncertainty set Uϵ( ˆPn) including parametric (Madry et al. 2017; Xiao et al. 2018) and non-parametric designs (Parys, Esfahani, and Kuhn 2017; Sinha, Namkoong, and Duchi 2017; Blanchet and Murthy 2019). Moreover, for well chosen uncertainty sets (Duchi, Glynn, and Namkoong 2016), one can prove performance guarantees asserting that asymptotically (in the limit n ), for all θ Θ: Eξ P [ℓ(ξ; θ)] RU n(θ, ϵn) w.h.p and Eξ P [ℓ(ξ; θ)] RU n(θ, ϵn) 0 The robust risk therefore acts as a consistent certificate on the true risk. We believe that these properties alone are enough to motivate the use of DRO for the CRM problem, as it provides an elegant way to design consistent asymptotic upper bounds for the true risk and ensure a small post-decision surprise. We detail such guarantees in the next subsection. Later, we draw links between DRO and the POEM, showing that a wide variety of DRO problems account for the empirical variance of the samples, therefore mitigating the limitations of empirical averages as discussed in Section 2.2. 3.2 Guarantees of robustified estimators with ϕ-divergences We are interested in DRO instances that are amenable to direct optimization. To this end, we focus here only on uncertainty sets Uϵ( ˆPn) based on information divergences (Csisz ar 1967), since they strike a nice compromise between ease of implementation and theoretical guarantees that will reveal useful for the CRM problem. The use of information divergences for DRO has already been largely studied in several works (for example (Duchi, Glynn, and Namkoong 2016; Gotoh, Kim, and Lim 2018)). For the sake of completeness, we now recall in Definition 1 the definition of information divergences. Definition 1 (ϕ-divergences). Let ϕ be a real-valued, convex function such that ϕ(1) = 0. For a reference distribution P, the divergence of another distribution Q with respect to P is defined by Dϕ(Q P) ϕ(d Q/d P)d P, if Q P, + , else. . (11) Subsequently, the definition of the uncertainty set Uϵ relies only on ϕ-divergences as follows: Uϵ( ˆPn) = Q | Dϕ(Q ˆPn) ϵ . (12) We need to ensure that the set of ϕ-divergences used to define the resulting robust risk R ϕ n(θ, ϵ) satisfies some basic coherence properties. We therefore make further assumptions about the measure of risk ϕ to narrow the space of information divergences we consider: Assumption 1 (Coherence). ϕ is a real-valued function satisfying: ϕ is convex and lower-semi-continuous ϕ(t) = for t < 0, and ϕ(t) ϕ(1) = 0, t R ϕ is twice continuously differentiable at t = 1 with ϕ (1) = 0 and ϕ (1) > 0 The axioms presented in Assumption 1 have been proposed and studied extensively in (Rockafellar 2018). Examples of coherent divergences include the Chi-Square, Kullback Leibler divergences and the squared Hellinger distance. Before stating the announced asymptotic guarantees, we make further assumptions on the structure of both the context and parameter spaces. Assumption 2 (Structure). . Θ is a compact subset of some Rd. X is a compact subset of some RD. We now state Lemma 1 which provides asymptotic certificate guarantees for the robust risk, asymptotically controlling the true counterfactual risk Eξ P [ℓ(ξ; θ)] with high probability. It is easy to show that under Assumptions 1 and 2, Lemma 1 can be obtained by a direct application of Proposition 1 of (Duchi, Glynn, and Namkoong 2016). Lemma 1 (Asymptotic guarantee - Proposition 1 of (Duchi, Glynn, and Namkoong 2016)). Under Assumptions 1 and 2, for a fixed level of confidence δ (0, 1], we have that θ Θ: lim n P Eξ P [ℓ(ξ; θ)] R ϕ n(θ, ϵn) 1 δ (13) where ϵn is defined by the (1 δ) Chi-Squared quantile, ϵn(δ) = χ2 1,1 δ/n. Proof. The proof mainly consists in showing that under Assumptions 1 and 2, the conditions for applying the Proposition 1 of (Duchi, Glynn, and Namkoong 2016) are fulfilled. For the CRM problem, this result is of upmost importance as it allows us to control the post-decision surprise suffered for a given policy πθ and allows for a pointwise control of the true risk. Remark 2 A stronger result than Lemma 1 would control with high probability the true risk of the robustified policy Eξ P [ℓ(ξ; θn)] with the optimal value RU n( θn, ϵn). It is easy to see that, if P Uϵ, then Eξ P [ℓ(ξ; θn)] RU n( θn, ϵn), hence P(Eξ P [ℓ(ξ; ˆθn)] RU n( θn, ϵn)) P(P Uϵ). By exhibiting strong rates of convergence of the empirical distribution ˆPn towards the true distribution P, such a result could be reached. Under mild assumptions on P, this guarantee has been proved in (Esfahani et al. 2017) for Wasserstein based uncertainty sets. In our current case where Uϵ is defined by information divergences this result holds solely under the assumption that P is finitely supported (Van Parys, Esfahani, and Kuhn 2017), a plausible situation when the logging policy is defined on a finite number of (context, action) pairs. 3.3 Equivalences between DRO and SVP In this subsection, we focus on stressing the link between DRO and sample variance penalization schemes as used in the POEM algorithm. In Lemma 2, we present an asymptotic equivalence between the robust risk (defined with coherent ϕ divergences) and the SVP regularization used in POEM. This Lemma is a specific case of existing results, already detailed in (Duchi, Glynn, and Namkoong 2016; Namkoong and Duchi 2017) and (Gotoh, Kim, and Lim 2017; 2018). Lemma 2. [Asymptotic equivalence - Theorem 2 of (Duchi, Glynn, and Namkoong 2016)] Under Assumptions 1 and 2, for any ϵ 0, integer n > 0 and θ Θ we have: R ϕ n(θ, ϵn) = ˆRn(θ) + ϵn Vn(θ) + αn(θ), (14) with supθ n|αn(θ)| P 0 and ϵn = ϵ/n. Proof. The proof is rather simple, as one only needs to show that the assumptions behind Theorem 2 of (Duchi, Glynn, and Namkoong 2016) are satisfied. This expansion gives intuition on the practical effect of the DRO approach: namely, it states that the minimization of the robust risk R ϕ n(θ) based on coherent information divergences is asymptotically equivalent to the POEM algorithm. This link between POEM and DRO goes further: the following Lemma states that sample-variance penalization is an exact instance of the DRO problem when the uncertainty set is based on the chi-square divergence. Lemma 3 (Non-asymptotic equivalence). Under Assumption 2, for χ2-based uncertainty sets, for any ϵ 0, integer n > 0 and θ Θ we have: n (θ, ϵ) = ˆRn(θ) + ϵVn(θ). (15) Proof. The line of proof is similar to the proof of Theorem 3.2 of (Gotoh, Kim, and Lim 2018). To ease notations, we denote Z ℓ(ξ, θ). Let s consider ϕ a coherent information divergence and ϕ (z) sup t>0 zt ϕ(t) its convex conjugate. By strong duality: sup Dϕ(Q ˆ Pn) ϵ EQ[Z] = inf γ 0 γϵ + sup Q (EQ[Z] γDϕ(Q ˆPn)) Theorem 4 of (Rockafellar 2018) states that: sup Q (EQ[Z] γDϕ(Q ˆPn)) = E ˆ Pn [Z] + inf c R(c + γE ˆ Pn[ϕ ((Z c)/γ)]) Hence we obtain that: sup Dϕ(Q ˆ Pn) ϵ EQ[Z] = inf γ 0 γϵ + E ˆ Pn[Z] + inf c R(c + γE ˆ Pn[ϕ ((Z c)/γ)]) In the specific case of the modified χ2 divergence, ϕ(z) = 1 2 2(z 1)2 and its convex conjugate is ϕ (z) = z + z2. Solving infc R(c + γE ˆ Pn[ϕ ((Z c)/γ)]) leads to: sup Dϕ(Q ˆ Pn) ϵ EQ[Z] = E ˆ Pn[Z] + inf γ 0 = E ˆ Pn[Z] + hence the announced result. In the case of χ2-based uncertainty sets, the expansion of Lemma 2 holds non-asymptotically, and the POEM algorithm can be interpreted as the minimization of the distributionally robust risk R χ2 n (θ, ϵ). To the best of our knowledge, it is the first time that a finite sample equivalence is established between counterfactual risk minimization with SVP (Swaminathan and Joachims 2015a) and DRO. 3.4 Kullback-Leibler based CRM Among information divergences, we are interested in the ones that allow tractable optimization. Going towards this direction, we investigate in this subsection the robust counterfactual risk generated by Kullback-Leibler (KL) uncertainty sets and stress its efficiency. The KL divergence is a coherent ϕ-divergence, with ϕ(z) z log(z) + z 1 for z > 0 and ϕ(z) = elsewhere. The robust risk R KL n (θ, ϵ) therefore benefits from the guarantees of Lemma 1. Furthermore, it enjoys a simple analytic formula stated in Lemma 4 that allows for direct optimization. Lemma 4 (Kullback-Leibler Robustified Counterfactual Risk). Under Assumption 2, the robust risk defined with a Kullback-Leibler uncertainty set can be rewritten as: R KL n (θ, ϵ) = inf γ>0 γϵ + γ log Eξ ˆ Pn[exp(ℓ(ξ; θ)/γ)] = Eξ ˆ P γ n (θ)[ℓ(ξ; θ)]. (17) where ˆP γ n denotes the Boltzmann distribution at temperature γ > 0, defined by ˆP γ n (ξi|θ) = exp(ℓ(ξi; θ)/γ) n j=1 exp(ℓ(ξj; θ)/γ). Proof. To ease notations, we denote Z ℓ(ξ, θ). As in the proof of Lemma 3, one can obtain that: max KL(Q|| ˆ Pn) ϵ EQ[ℓ(ξ; θ)] = inf γ 0 c + γE ˆ Pn[ϕ KL((Z c)/γ)] Since ϕKL(z) = z log z z + 1, it is a classical convex analysis exercise to show that its convex conjugate is ϕ KL(z) = ez 1. Therefore: max KL(Q|| ˆ Pn) ϵ EQ[ℓ(ξ; θ)] = inf γ 0 c + γE ˆ Pn[e(Z c)/γ 1] Solving inf c R c + γE ˆ Pn[e(Z c)/γ 1] is straightforward and leads to: max KL(Q|| ˆ Pn) ϵ EQ[ℓ(ξ; θ)] = inf γ 0 γϵ + γ log E ˆ Pn Differentiating the r.h.s and setting to 0 shows that the optimal γ is solution to the fixed-point equation: γ = E ˆ P γ n [Z] ϵ + log(E ˆ Pn[e Z/γ]) (19) where d ˆP γ n (z) := (1/E ˆ Pn[ez/γ])ez/γd ˆPn(z) is the density of the Gibbs distribution at temperature γ and state degeneracies ˆPn. Replacing this value for γ in the r.h.s of (18) yields the announced result. This formula has also been obtained in (Hu and Hong 2013), using another line of proof. A direct consequence of Lemma 4 is that the worst-case distribution in the uncertainty set (defined by the KL divergence) takes the form of a Boltzmann distribution. Henceforth, minimizing the associated robust risk is equivalent with the optimization of the following objective: R KL n (θ) = n i=1 ℓ(ξi; θ) exp(ℓ(ξi; θ)/γ ) n j=1 exp(ℓ(ξj; θ)/γ ) . (20) From an optimization standpoint this amounts to replacing the empirical distribution of the logged data with a Boltzmann adversary which re-weights samples in order to put more mass on hard examples (examples with high cost). In what follows, we call KL-CRM the algorithm minimizing the objective (20) while treating γ as a hyper-parameter that controls the hardness of the re-weighting. A small value for γ will lead to a conservative behavior that will put more weight on actions with high propensity cost. In the limit when γ 0, the robust risk only penalizes the action with highest propensity re-weighted cost. On the other end, a very large γ brings us back in the limit to the original CIPS estimator where the samples have equal weights. In a naive approach, this parameter can be determined through cross-validation and kept constant during the whole optimization procedure. Lemma 5 goes further into the treatment of the optimal temperature parameter γ and provides an adaptive rule for updating it during the robust risk minimization procedure. Lemma 5. The value of the optimal temperature parameter γ can be approximated as follows: Proof. To ease notations, we denote Z ℓ(ξ, θ). The log moment generating function : Φ : α log E ˆ Pn e Zα (22) is well defined as ˆPn has finite support and Z is bounded a.s. It checks the following equalities: Φ(0) = 0 Φ (0) = E ˆ Pn [Z] Φ (0) = Vn(Z) and a second-order Taylor expansion around 0 yields: Φ(α) = αE ˆ Pn [Z] + α2 2 Vn(Z) + o0(α2) With α = 1/γ and injecting this result in the r.h.s of Equation (18) yields: max KL(Q|| ˆ Pn) ϵ EQ[ℓM(ξ; θ)] = inf γ 0 γϵ + E ˆ Pn [Z] + 2γ + o (1/γ) Solving (approximately) the r.h.s of the above equation yields as announced: Algorithm 1: a KL-CRM inputs : H = {(x1, y1, p1, c1), . . . , (xn, yn, pn, cn)}, parametrized family of policies πθ hyper-parameters: clipping constant M, uncertainty set size ϵ 2 compute the counterfactual costs zi ci min(M, πθ(yi|xi)/pi) for i = 1, . . . , n 3 compute the optimal temperature γ n j=1 (zi z)2 /(2ϵ), where z = n j=1 zi/n 4 compute the normalized costs si ei/ n j=1 ej for i = 1, . . . , n, where ei = ezi/γ 5 compute the re-weighted loss L n i=1 zisi 6 update θ by applying an L-BFGS step to the loss L 7 until convergence; This results implies that γ should be updated concurrently to the parameter θ during the minimization of the robustified risk (20). This leads to an algorithm we call adaptive KLCRM, or a KL-CRM. Pseudo-code for this algorithm is provided in Algorithm 1. As for POEM and KL-CRM, its hyperparameter ϵ can be determined through cross-validation. 4 Experimental results It is well known that experiments in the field of counterfactual reasoning are highly sensitive to differences in datasets and implementations. Consequently, to evaluate and compare the two algorithms we previously introduced to existing solutions, we rigorously follow the experimental procedure introduced in (Swaminathan and Joachims 2015a) and used in several other works - such as (Swaminathan and Joachims 2015b) since then. It relies on a supervised to unsupervised dataset conversion (Agarwal et al. 2014) to build bandit feedback from multi-label classification datasets. As in (Swaminathan and Joachims 2015a), we train exponential models πθ(y|x) exp θT φ(x, y) for the CRM problem and use the same datasets taken from the Lib SVM repository. For reproducibility purposes, we used the code provided by its authors 1 for all our experiments. 4.1 Methodology For any multi-label classification tasks, let us note x the input features and y {0, 1}q the labels. The full supervised dataset is denoted D {(x1, y 1), . . . , (x N, y N)}, and is split into three parts: D train, D valid, D test. For every of the four dataset we consider (Scene, Yeast, RCV1-Topics and TMC2009), the split of the training dataset is done as follows: 75% goes to D train and 25% to D valid. The test dataset Dtest is provided by the original dataset. As in (Swaminathan and Joachims 2015a), we use joint features maps φ(x, y) = x y 1http://www.cs.cornell.edu/ adith/POEM/index.html Scene Yeast RCV1-Topics TMC2009 π0 1.529 5.542 1.462 3.435 CIPS 1.163 4.658 0.930 2.776 POEM 1.157 4.535 0.918 2.191 KL-CRM 1.146 4.604 0.922 2.136 a KL-CRM 1.128 4.553 0.783 2.126 CRF 0.646 2.817 0.341 1.187 Table 1: Expected Hamming loss on D test for the different algorithms, averaged over 20 independent runs. Bold font indicate that one or several algorithms are statistically better than the rest, according to a one-tailed paired difference t-test at significance level of 0.05. and train a Conditional Random Field (Lafferty, Mc Callum, and Pereira 2001) (CRF) on a fraction (5%, randomly constituted) of D train. This CRF has access to the full supervised feedback and plays the role of the logging policy π0. That is, for every xi D , a label prediction yi is sampled from the CRF with probability pi. The quality of this prediction is measured through the Hamming loss: ci = q l=1 |yl y l |. The logged bandit dataset is consequently generated by running this policy through D train for Δ = 4 times (Δ is the replay count). After training, the performances of the different policies π are reported as their expected Hamming loss on the held-out set D test. Every experiment is run 20 times with a different random seed (which controls the random training fraction for the logging policy and the creation of the bandit dataset). For each dataset we compare our algorithm with the naive CIPS estimator and the POEM. For all four algorithms (CIPS, POEM, KL-CRM , a KL-CRM ), the numerical optimization routine is deferred to the L-BFGS algorithm. As in (Swaminathan and Joachims 2015a), the clipping constant M is always set to the ratio of the 90%ile to the 10%ile of the propensity scores observed in logs H. Other hyper-parameters are selected by cross-validation on D valid with the unbiased counterfactual estimator (3). In the experimental results, we also report the performance of the logging policy π0 on the test set as an indicative baseline measure, and the performance of a skyline CRF trained on the whole supervised dataset, despite of its unfair advantage. 4.2 Results Table 1 reports the expected Hamming loss of the policies obtain with different algorithms on the Scene, Yeast, RCV1-Topics and TMC2007 dataset, averaged on 20 random seeds. The results reported for the baselines are coherent with (Swaminathan and Joachims 2015a). On each dataset, a KL-CRM comes out at one of the best algorithm (according to a one-tailed paired difference t-test at significance level 0.05) and outperforming the POEM baseline on three out of four datasets. The results for KL-CRM are more mitigated: it outperforms POEM on two datasets, but shows weaker performance on the two others clearly stressing the efficiency of an adaptive temperature parameter. As in (Swaminathan and Joachims 2015a), we can further evaluate the quality of the learned policies by evaluating Scene Yeast RCV1-Topics TMC2009 CIPS 1.163 4.369 0.929 2.774 POEM 1.157 4.261 0.918 2.190 KL-CRM 1.146 4.316 0.922 2.134 a KL-CRM 1.128 4.271 0.779 2.034 Table 2: Hamming loss on D test for the different greedy policies, averaged over 20 independent runs. Bold font indicates that one or several algorithms are statistically better than the rest, according to a one-tailed paired difference t-test at significance level of 0.05. the Hamming loss of their greedy version (selecting only the arm that is attributed the most probability mass by the policy). This comes at less expense that sampling from the policy as this does not require to compute the normalizing constant in (6). These results are reported in Table 2, and are consistent with the conclusions of Table 1. One can note that the improvement brought by a KL-CRM over POEM is even sharper under this evaluation. Another experiment carried in (Swaminathan and Joachims 2015a) focuses on the size of the bandit dataset H. This quantity can be easily modulated by varying the replay count Δ - the number of times we cycle through D train to create the logged feedback H. Figure 1 reports the expected Hamming loss of policies trained with the POEM, KL-CRM and a KL-CRM algorithms for different value of Δ, ranging from 1 to 256, and based on the Yeast and Scene datasets. Results are averaged over 10 independent runs. For large values of Δ (that is large bandit dataset) all algorithms seem to confound; this is to be expected as Lemma 2 states that for any coherent ϕ-divergences, all robust risks are asymptotically equivalent. It also stands out that for small replay counts (i.e the small data regime, a more realistic case), the KL-based algorithms outperform POEM. 5 Conclusion We presented in this work a unified framework for counterfactual risk minimization based on the distributionally robust optimization of policies and motivated it by asymptotic guarantees available when the uncertainty measure is based on ϕ-divergences. We showed that this new framework generalizes existing solutions, like sample-variance penalized counterfactual risk minimization algorithms (Swaminathan and Joachims 2015a). Our work therefore opens a new avenue for reasoning about counterfactual optimization with logged bandit feedback as we showed that a KL-divergence based formulation of the counterfactual DRO problem can lead to tractable and efficient algorithms for the CRM problem, outperforming state-of-the-art results on a collection of datasets. The authors of (Swaminathan and Joachims 2015a) also proposed a modification to the POEM algorithm that can be optimized at scale using stochastic gradient descent. Future work should therefore aim at developing stochastic optimization schemes for both KL-CRM and a KL-CRM so that they could handle large datasets. From the perspective of experimental evaluation, measuring the impact of the DRO formu- 20 21 22 23 24 25 26 27 28 # replay count expected Hamming loss (a) Yeast dataset 20 21 22 23 24 25 26 27 28 # replay count expected Hamming loss (b) Scene dataset Figure 1: Impact of the replay count Δ on the expected Hamming loss. Results are average over 10 independent runs, that is 10 independent train/test split and bandit dataset creation. KL-CRM and a KL-CRM outperform POEM in the small data regime. lation on the doubly robust (Dud ık, Langford, and Li 2011) and the self-normalized (Swaminathan and Joachims 2015b) estimators would further validate its relevance for real world problems. Finally, a more theoretical line of work could focus on proving finite-samples performance certificate guarantees for distributionally robust estimators based on coherent ϕdivergences, further motivating their use for counterfactual risk minimization. Agarwal, A.; Hsu, D.; Kale, S.; Langford, J.; Li, L.; and Schapire, R. 2014. Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits. In International Conference on Machine Learning, 1638 1646. Blanchet, J., and Murthy, K. 2019. Quantifying Distributional Model risk Via Optimal Transport. Mathematics of Operations Research. Bottou, L.; Peters, J.; Candela, J. Q.; Charles, D. X.; Chickering, M.; Portugaly, E.; Ray, D.; Simard, P. Y.; and Snelson, E. 2013. Counterfactual Reasoning and Learning Systems: the example of Computational Advertising. Journal of Machine Learning Research 14(1):3207 3260. Capen, E. C.; Clapp, R. V.; Campbell, W. M.; et al. 1971. Competitive Bidding in High-Risk Situations. Journal of Petroleum Technology 23(06):641 653. Cortes, C.; Mansour, Y.; and Mohri, M. 2010. Learning Bounds for Importance Weighting. In Advances in Neural Information Processing Systems, 442 450. Csisz ar, I. 1967. Information-type measures of difference of probability distributions and indirect observation. Studia Scientiarum Mathematicarum Hungarica 2:229 318. Duchi, J.; Glynn, P.; and Namkoong, H. 2016. Statistics of Robust Optimization: A Generalized Empirical Likelihood Approach. Dud ık, M.; Langford, J.; and Li, L. 2011. Doubly Robust Policy Evaluation and Learning. In Proceedings of the 28th International Conference on International Conference on Machine Learning, 1097 1104. Esfahani, P. M., and Kuhn, D. 2018. Data-driven Distributionally Robust Optimization using the Wasserstein metric: Performance Guarantees and Tractable Reformulations. Mathematical Programming 171(1-2):115 166. Esfahani et al., M. 2017. Data-driven distributionally robust optimization using the wasserstein metric: performance guarantees and tractable reformulations. Mathematical Programming. Gotoh, J.; Kim, M. J.; and Lim, A. E. B. 2017. Calibration of distributionally robust empirical optimization models. Co RR abs/1711.06565. Gotoh, J.; Kim, M. J.; and Lim, A. E. B. 2018. Robust empirical optimization is almost the same as mean-variance optimization. Oper. Res. Lett. 46(4):448 452. Hu, Z., and Hong, L. J. 2013. Kullback-leibler divergence constrained distributionally robust optimization. Available at Optimization Online. Hu, W.; Niu, G.; Sato, I.; and Sugiyama, M. 2018. Does Distributionally Robust Supervised Learning Give Robust Classifiers? In International Conference on Machine Learning, 2034 2042. Ionides, E. L. 2008. Truncated Importance Sampling. Journal of Computational and Graphical Statistics 17(2):295 311. Lafferty, J. D.; Mc Callum, A.; and Pereira, F. C. N. 2001. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of the Eighteenth International Conference on Machine Learning, ICML 01, 282 289. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc. Li, L.; Chu, W.; Langford, J.; and Wang, X. 2011. Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. In Proceedings of the fourth ACM international conference on Web search and data mining, 297 306. ACM. Li, L.; Chen, S.; Kleban, J.; and Gupta, A. 2015. Counterfactual estimation and optimization of click metrics in search engines: A case study. In Proceedings of the 24th International Conference on World Wide Web, 929 934. ACM. Madry, A.; Makelov, A.; Schmidt, L.; Tsipras, D.; and Vladu, A. 2017. Towards Deep Learning Models Resistant to Adversarial Attacks. ar Xiv preprint ar Xiv:1706.06083. Maurer, A., and Pontil, M. 2009. Empirical Bernstein Bounds and Sample-Variance Penalization. In COLT 2009 - The 22nd Conference on Learning Theory, Montreal, Quebec, Canada, June 18-21, 2009. Namkoong, H., and Duchi, J. C. 2017. Variance-based regularization with convex objectives. In Advances in Neural Information Processing Systems, 2971 2980. Parys, B. P. G. V.; Esfahani, P. M.; and Kuhn, D. 2017. From Data to Decisions: Distributionally Robust Optimization is Optimal. Co RR abs/1704.04118. Rockafellar, R. T. 2018. Risk and utility in the duality framework of convex analysis. Springer Proceedings in Mathematics and Statistics. Rosenblum, P. R., and Rubin, D. B. 1983. The Central Role of the Propensity Score in Observational Studies for Causal Effect. Biometrika 70(1):41 55. Sinha, A.; Namkoong, H.; and Duchi, J. C. 2017. Certifiable Distributional Robustness with Principled Adversarial Training. Co RR abs/1710.10571. Smith, J. E., and Winkler, R. L. 2006. The Optimizer s Curse: Skepticism and Postdecision Surprise in Decision Analysis. Management Science 52(3):311 322. Strehl, A.; Langford, J.; Li, L.; and Kakade, S. M. 2010. Learning from Logged Implicit Exploration Data. In Advances in Neural Information Processing Systems, 2217 2225. Swaminathan, A., and Joachims, T. 2015a. Batch Learning from Logged Bandit Feedback through Counterfactual Risk Minimization. Journal of Machine Learning Research 16(1):1731 1755. Swaminathan, A., and Joachims, T. 2015b. The Self Normalized Estimator for Counterfactual Learning. In Advances in Neural Information Processing Systems, 3231 3239. Thaler, R. 2012. The winner s curse: Paradoxes and anomalies of economic life. Simon and Schuster. Van Parys, B. P.; Esfahani, P. M.; and Kuhn, D. 2017. From Data to Decisions: Distributionally Robust Optimization is Optimal. ar Xiv preprint ar Xiv:1704.04118. Vapnik, V. 1992. Principles of Risk Minimization for Learning Theory. In Advances in Neural Information Processing Systems, 831 838. Xiao, C.; Li, B.; Zhu, J.-Y.; He, W.; Liu, M.; and Song, D. X. 2018. Generating Adversarial Examples with Adversarial Networks. In IJCAI.