# contextual_linear_optimization_with_bandit_feedback__e4e81e24.pdf

Contextual Linear Optimization with Bandit Feedback

Yichun Hu1 Nathan Kallus1 Xiaojie Mao2 Yanchen Wu2

1 Cornell University 2 Tsinghua University {yh767, kallus}@cornell.edu maoxj@sem.tsinghua.edu.cn wu-yc23@mails.tsinghua.edu.cn

Contextual linear optimization (CLO) uses predictive contextual features to reduce uncertainty in random cost coefficients and thereby improve average-cost performance. An example is the stochastic shortest path problem with random edge costs (e.g., traffic) and contextual features (e.g., lagged traffic, weather). Existing work on CLO assumes the data has fully observed cost coefficient vectors, but in many applications, we can only see the realized cost of a historical decision, that is, just one projection of the random cost coefficient vector, to which we refer as bandit feedback. We study a class of offline learning algorithms for CLO with bandit feedback, which we term induced empirical risk minimization (IERM), where we fit a predictive model to directly optimize the downstream performance of the policy it induces. We show a fast-rate regret bound for IERM that allows for misspecified model classes and flexible choices of the optimization estimate, and we develop computationally tractable surrogate losses. A byproduct of our theory of independent interest is fast-rate regret bound for IERM with full feedback and misspecified policy class. We compare the performance of different modeling choices numerically using a stochastic shortest path example and provide practical insights from the empirical results.

1 Introduction

Contextual linear optimization (CLO) models the use of predictive features (context variables) to improve decision making in linear optimization with random coefficients. In CLO, we consider a decision z Z that incurs an uncertain cost Y z determined by a random cost vector Y Rd that is not observed at decision time. We do, however, observe predictive features X Rp prior to decision, which help reduce uncertainty. CLO can be expressed either as a contextual stochastic optimization problem or a linear optimization problem where the cost vector is a conditional expectation:

v (x) = minz Z E Y z | X = x = minz Z f0(x) z, where f0(x) = E[Y | X = x]. (1)

We assume throughout that Z is a polytope (Z = Conv(Z ) for some vertex set Z < with supz Z z B) and Y is bounded (without loss of generality, Y Y = {y : y 1}).

CLO has been the focus of much recent work [e.g., 9, 11, 14, 27, 30], due to its relevance to datadriven operational decision making in many applications such as network optimization, portfolio optimization, product recommendation, etc. The key question in these is how to use data to learn a good policy, π : Rp Z, mapping feature observations to effective decisions. All of this literature studies an offline setting, where a batch of existing data (Xi, Yi) for i = 1, . . . , n is observed. These data are viewed as random draws from the joint distribution of (X, Y ), and the goal is to learn an effective decision policy from them. Crucially, this assumes that we fully observe the random cost

Authors are listed in alphabetical order. Correspondence to Xiaojie Mao: maoxj@sem.tsingua.edu.cn.

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

vector Yi s, meaning that we fully know the corresponding costs Y i z for any potential decision z. This may be unrealistic in many applications.

Consider for example the instantiation of Eq. (1) as a stochastic shortest path problem of transporting one unit from a start node to a desired end node, where Y contains the travel time on each of d edges, z the binary decision whether to traverse each edge, and Z constrains flow conservation. The data we would need in order to apply existing approaches here would consist of simultaneous observations of the travel times on every single edge, encapsulated in the elements of Yi s. However, such ideal observations may not be available. Indeed, modern travel time prediction models are often based on data from historical individual trips. For example, such is the case with Uber Movement data [26, 33], which is based on the total length of historical rides. Namely, instead of observing the entire Yi, we only observe the total travelling time Ci = Y i Zi for the path Zi in a historical trip i. We term this observation model bandit feedback as it corresponds to observing the cost only of a given decision and not the counterfactual cost of any alternative decision, as in bandit problems [25].

In this paper, we study the contextual linear optimization problem with bandit feedback and make several contributions. First, we adapt the end-to-end induced empirical risk minimization approach to the bandit feedback setting. This approach allows us to optimize decision policies by directly targeting the decision objective. We provide three different methods to identify the expected cost of any given policy and show how to estimate the expected decision cost from bandit-feedback data. Second, we derive upper bounds on the regret (i.e., decision sub-optimality) of the policy that minimizes the estimated objective. Our regret analysis accounts for the misspecification of the policy class and incorporates a margin condition that potentially enables a faster regret rate. This significantly extends existing theory for full-feedback CLO, and as a byproduct, we provide a novel fast rate bound for full-feedback CLO with misspecification. Finally, we demonstrate that existing surrogate losses for full-feedback CLO (such as the well-known SPO+ loss in [11]) can be adapted to our setting, enabling efficient policy optimization. We empirically test this in simulated shortest path problems and provide practical insights.

1.1 Background: The Full Feedback Setting

We first review two major approaches to the CLO problem with full feedback: an estimate-thenoptimize (ETO) approach and an end-to-end induced empirical risk minimization (IERM) approach. To this end, we need to define the generic plug-in policy πf for any given f : Rp Rd:

πf(x) arg minz Z f(x) z. (2)

Note that for any given covariate value x, this corresponds to a linear programming problem with coefficients given by f(x). An optimal policy in Eq. (1) corresponds to πf0. Without loss of generaility, we restrict the value of πf(x) to the set of vertices Z of the polytope Z with ties broken according to some fixed rule (e.g., a total ordering over Z such as lexicographic).

The ETO approach starts with estimating f0 in Eq. (1) by any supervised learning method for predicting Y given X. This can, e.g., be implemented by minimizing a prediction fitness criterion (e.g., sum of squared errors) over a hypothesis class of functions F [Rp Rd] (e.g., linear functions, decision trees, neural networks). After training an estimator ˆf of f0, the ETO approach then makes decisions according to the induced policy π ˆ f. This approach is a straightforward two-step

approach, but it ignores the downstream optimization task in choosing ˆf F.

In contrast, some recent literature propose to integrate the estimation and the optimization, directly searching for a decision policy to minimize the decision cost. Following [14], we consider an IERM formulation that minimizes the sample average cost over the class of plug-in policies induced by F:

ˆπF arg minπ ΠF 1 n Pn i=1 Y i π(Xi), where ΠF = {πf : f F}. (3)

This approach is end-to-end in that it directly targets the decision-making objective. Recent literature demonstrates that end-to-end approaches like IERM often outperform the ETO approach [12]. The benefits of end-to-end approaches are significant especially in the misspecified setting - that is, when the function class F fails to contain the true expected cost function f0, and the policy class ΠF does not include the globally optimal decision policy πf0 [e.g., 11, 14]. While Eq. (3) involves a challenging bi-level optimization due to the definition of ΠF, practical computational approximations have been proposed [11, 16].

Nonetheless, the IERM formulation in Eq. (3) requires observing the full feedback Yi s in the data. In this paper, we study how to extend this approach to the setting with bandit feedback.

1.2 Our Problem: CLO with Bandit Feedback

We now formally define the data generating process in the bandit feedback setting. Assume we have an offline dataset consisting of n data points D = {(X1, Z1, C1), . . . , (Xn, Zn, Cn)} that are independent draws from a distribution P on (X, Z, C) generated in the following way. We first draw (X, Z, Y ) where the (X, Y ) distribution is as in the full feedback setting and Z is generated according to some historical decision policies for data collection. Then we set C = Z Y and omit the full Y . Below we impose some basic assumptions on the generating process of the historical decision Z. Assumption 1. The data generating process satisfies the following two properties:

1. (Ignorability) E[C | Z, X] = Z f0(X) (which could follow from Z Y | X).

2. (Coverage) infz span(Z): z =1 E[(Z z)2 | X] > 0 almost surely.

The ignorability condition is a common assumption that plays a fundamental role in learning with partial feedback and causal inference (see the literature in Section 1.3). This condition requires that the assignments of the historical decisions Z do not depend on any unobserved factor potentially dependent on Y . The coverage condition requires that given the covariates X, historical decisions Z can explore all directions of the linear span of the constraint set Z; otherwise it may be impossible to evaluate certain policies from the observed costs of historical decisions. This is an analogue of the overlap assumption in learning with partial feedback and causal inference [17].

The learning task is to use these data D to come up with a data-driven policy ˆπ(x) that has low regret

Reg(π) = EX f0(X) π(X) minz Z f0(X) z = V (π) V (πf0), (4)

where V (π) = EX f0(X) π(X) .

1.3 Background: Off-Policy Evaluation and Optimization

An important special case is when Z = {(1, 0, . . . , 0), . . . , (0, . . . , 0, 1)} is the canonical basis and Z is the simplex. This corresponds to choosing one of d actions (or randomizing between them). The full-feedback problem in this case is cost-sensitive classification [10]. In the bandit feedback case, with the restriction Z Z , it is known as a logged or offline contextual bandit, and a long line of literature studies the problems of estimating and optimizing V (π) [1, 8, 18 20, 23, 24, 32, 34 36, 42 45]. It is closely related to causal inference, viewing each component of Y as the potential outcome for the corresponding action (or treatment), where we only see the realized outcome C corresponding to the factual action with index 1 in Z Z and see nothing about counterfactual actions Z \{Z}. Two key differentiating characteristics of our problem with a general polytope constraint Z are the opportunity to leverage the linear cost structure to extrapolate from the costs of historical decisions to costs of other decisions, and the presence of non-trivial constraints on the output of decision policies.

Going beyond just finite discrete arms, [21, 22, 29] consider the logged contextual bandit with a continuum of arms, which is not generally representable in terms of the problem in Section 1.2. These works make no restrictions on the relationship between the expected potential outcome and the corresponding action except generic smoothness, and leverage nonparametric approaches such as kernel weighting. Closest to our work is [6], which imposes a semiparametric assumption on this relationship. This includes our problem (Section 1.2) as a special case under the restriction that, conditional on X, expected outcomes are linear in actions (note that our E[C | Z, X] is linear in Z according to ignorability in Section 1.2). Relative to this work, our unique contributions are the use of induced policy classes to naturally deal with the constraints in Z, the adaptation of computational approximation such as SPO+, and obtaining the fast rates for the regret under misspecification.

2 Induced Empirical Risk Minimization with Bandit Feedback

Our IERM approach in the bandit setting follows the same idea of Eq. (3), directly minimizing an estimate of expected costs. However, since we do not observe Yi in our data, it is not as straightforward as a sample average, Pn i=1 Y i π(Xi)/n. Instead, we need alternative ways to identify V (π).

Define Σ0(X) = E ZZ | X and recall f0(X) = E[Y | X] , which we will use to characterize the policy value V (π). We refer to these functions as nuisance functions following the existing literature on off-policy evaluation and learning in contextual bandits or reinforcement learning [39], because they are not directly used for decision-making but serve merely as intermediaries for evaluating the policy value. Let θ(x, z, c; f, Σ) be a score function such that

EP θ(X, Z, C; f0, Σ0) π(X) = V (π), fixed policy π, (5)

where EP denotes taking expectation over P. Proposition 1 summarizes a few possible choices. Proposition 1. The following choices of θ(x, z, c; f, Σ) all satisfy Eq. (5):

1. (Direct Method) θDM(x, z, c; f, Σ) = f(x);

2. (Inverse Spectral Weighting) θISW(x, z, c; f, Σ) = Σ (x)zc;

3. (Doubly Robust) θDR(x, z, c; f, Σ) = f(x) + Σ (x)z(c z f(x)).

Here Σ denotes the Moore Penrose pseudo-inverse of matrix Σ.

The doubly robust identification in Proposition 1 is a generalization of the identification in [6] in that we allow for rank-deficient Σ0 by using a pseudo-inverse. Otherwise, our work significantly differs from [6] in terms of policy class specification, computation and theoretical analyses (see Section 1.3). Note that identification in Proposition 1 exploits the linearity of the decision cost, which significantly improve upon algorithms that ignore the linear structure and naively extend existing offline bandit learning using discrete actions Z ; see Section 5 and Appendix C.2 for numerical evidence.

It remains to estimate Eq. (5), since we know neither P nor the nuisance functions f0, Σ0. Following the approach of [1, 45] for logged contextual bandits, we adapt the cross-fitting procedure advocated in [5]. For simplicity, we focus on the twofold version and assume n is even. The extension to K-fold version is straightforward.

First, we split D into two equal-sized subsamples, D1 and D2, each with n/2 data points. We can use D1 as an auxiliary sample to estimate the nuisance functions f0 and Σ0, denoting the estimates as ˆf1 and ˆΣ1. We then use D2 as a main sample to obtain an estimate of V (π) using ˆf1 and ˆΣ1:

i D2 θ(Xi, Zi, Ci; ˆf1, ˆΣ1) π(Xi).

Of course, we can switch the roles of D1 and D2, i.e., we use D2 to get nuisance estimates ˆf2 and ˆΣ2, then use D1 to estimate V (π). Finally, given an induced policy class ΠF, the IERM policy ˆπ minimizes the average of the above two V (π) estimates over ΠF:

ˆπ arg minπ ΠF 1 n P2 j=1 P

i Dj θ(Xi, Zi, Ci; ˆf3 j, ˆΣ3 j) π(Xi). (6)

Remark 1 (Estimation of f0). The estimator ˆf(x) can be obtained by minimizing the least square loss over some appropriate function class FN:

ˆf arg minf FN P

i D Ci Z i f(Xi) 2. (7)

Note that two places in Eq. (6) require the specification of a function class for modeling f0: the F class used to induce the policy class ΠF, and the FN class used to construct nuisance estimators to evaluate the expected cost of induced policies. In practice, we do not need to use the same class for F and FN. In fact, it might be desirable to use a more flexible function class for FN to make sure it is well-specified for accurate policy evaluation, and a simpler class for F to make the policy optimization more tractable. We numerically test out different choices of F and FN in Section 5. Remark 2 (Estimation of Σ0). There are multiple ways to estimate Σ0. For example, [6] suggest estimating Σ0 by running a multi-task regression for all (j, k) entries to the matrix over some appropriate hypothesis spaces Sjk: ˆΣ = arg minΣ11 S11,...,Σdd Sdd P

i D Zi Z i Σ(Xi) 2 F ro. To ensure a positive semi-definite estimator, we may posit each hypothesis Σ to be the outer product of some matrix-valued hypothesis of appropriate dimension. Alternatively, we may only need to consider finitely many feasible decisions z1, . . . , zm, such as the feasible paths for stochastic shortest path problems (see experiments in Section 5) or more generally the vertices in Z . Then we can first estimate the propensity scores e0(z | X) for z = z1, . . . , zm using an suitable estimator ˆe(z | X) and then estimate Σ0(X) = Pm j=1 zjz j e0(zj | X) by ˆΣ(X) = Pm j=1 zjz j ˆe(zj | X).

3 Theoretical Analysis

In this section, we provide a theoretical regret analysis for the IERM approach with bandit feedback, allowing for model misspecification in the induced policy class ΠF. That is, we allow the globally optimal policy πf0 to be not included in the class ΠF. We derive an upper bound on the regret of the IERM approach, in terms of the complexity of the policy class ΠF, its misspecification bias, and the estimation errors of nuisance functions.

Before stating the main theorem, we define a few important notations. Let π be the best-in-class policy that minimizes expected regret over ΠF:

π arg minπ ΠF E f0(X) π(X) .

We note that π can be different from the global optimal policy πf0 if πf0 ΠF, and Reg( π ) characterizes the extent of misspecification in the induced policy class ΠF. For j = 1, 2, we define the function class

(x, z, c) θ(x, z, c; ˆfj, ˆΣj) (π(x) π (x))ρ

2BΘ : π ΠF, ρ [0, 1]

For any function class G : X Z C R, we define the local Rademacher complexity as

Rn(G, r) = E h supg G, g 2 r 1

n Pn i=1 ϵig(Xi, Zi, Ci) i ,

where ϵi are i.i.d. Rademacher variables, and g 2 = p

EP [g2(X, Z, C)].

Throughout Section 3, we impose two assumptions. Assumption 2 concerns the algorithms we use to estimate the nuisance functions f0, Σ0. It requires that the nuisance estimates are close enough to their true values, and θ(x, z, c; ˆf, ˆΣ) is bounded.

Assumption 2 (Nuisance Estimation). The nuisance estimators ˆf, ˆΣ trained on a sample of size n satisfy that θ(x, z, c; ˆf, ˆΣ) Θ for all x, z, c, and for any δ (0, 1) and π ΠF, with probability at least 1 δ,

θ(X, Z, C; f0, Σ0) θ(X, Z, C; ˆf, ˆΣ) (π(X) π (X)) Rate N(n, δ).

In Section 3.1, we will relate Rate N(n, δ) to the errors in estimating the individual nuisance functions f0 and Σ0 for different choices of score function θ.

Assumption 3, which we term the margin condition, controls the density of the sub-optimality gap near zero in the CLO problem instance and allows us to get faster regret rates. This type of condition was originally considered in the binary classification literature [2, 38]. It is recently extended to contextual linear optimization by [14], which we describe below. Assumption 3 (Margin Condition). Let Z (x) = arg minz Z f0(x) z, and (x) = infz Z \Z (x) f0(x) z infz Z f0(x) z if Z (x) = Z and (x) = 0 otherwise. Assume for some α, γ 0, PX(0 < (X) δ) (γδ/B)α δ > 0.

Lemmas 4 and 5 in [15] show that Assumption 3 generally holds with α = 1 for sufficiently wellbehaved f0 and continuous X and with α = for discrete X. Moreover, any CLO instance trivially satisfies α = 0. Overall, a larger α means that the sub-optimality gap between the best and second-best decisions tends to be large in more contexts, so it is easier to distinguish the optimal decisions from others. We will show that a larger α parameter could lead to faster regret rates.

3.1 Main Theorem

We now provide an upper bound on the regret of the IERM policy ˆπ in Eq. (6).

Theorem 1. Suppose Assumptions 2 and 3 hold, and Z (X) defined in Assumption 3 is a singleton almost surely. Suppose there exists a positive number r (that depends on n) that upper bounds the critical radii of the function classes G1, G2 almost surely (i.e., Rn/2(G1, r) r2 and Rn/2(G2, r)

r2) and satisfies the inequalities 3n r2/128 log log2(1/ r), and 2 exp 3n r2/128 δ/4. Then, there exists a positive constant C(α, γ) such that with probability at least 1 δ, we have

Reg(ˆπ) B 12Θ q

C(α, γ) r 2α+2

C(α, γ) Reg( π )

α 2(1+α) r + r2 !

+ 2Rate N(n/2, δ/4) + 2Reg( π ).

The upper bound in Theorem 1 involves several different types of error terms. The first type is the critical radii r that characterize the complexity of the function classes G1, G2. This is a common complexity measure for function classes in statistics and machine learning [41]. For example, as we will discuss below, the critical radii scale as O( p

η/n) if G1, G2 are VC subgraph classes of dimension η. The second type is the term Rate N(n/2, δ/4) resulted from the errors in estimating the nuisance functions f0, Σ0. Similar nuisance estimation errors also appear in the previous literature on offline contextual bandit learning [1, 5, 45] or more general learning problems that involve nuisance functions [13]. Below we will further discuss this term for different choices of score functions θ. The third type is the misspecification error term Reg( π ). It is natural to expect a bigger decision regret when the misspecification error Reg( π ) is higher. In particular, when the policy class ΠF is well-specified, i.e., πf0 ΠF, we would have Reg( π ) = 0, and the regret upper bound would scale with the function class complexity through a fast rate O( r 2α+2

α+2 ). For VC subgraph classes with r = O( p

η/n), the rate can range from O( p

η/n) for α = 0 to O(η/n) for α = . However, if the policy class is misspecified so that Reg( π ) > 0, then the dominating term related to r would be a slow rate O( r). This reveals an interesting phase transition between the correctly specified and misspecified settings, which is not discovered in the previous theory that considers only well-specification [14]. Remark 3. The constant coefficients in the regret upper bound can be improved when 2(1 + α)/α is an integer (which accommodates the case of α = 1 justified in [15]):

Reg(ˆπ) B 12Θ q

C(α, γ) r 2α+2

α+2 + 24(α + 1)

C(α, γ) Reg( π )

α 2(1+α) r + r2 !

α + 2 Rate N(n/2, δ/4) + Reg( π ).

Notably, the constant in front of the misspecification error Reg( π ) becomes 1 instead of 2, which we believe is tight. This upper bound follows from nearly the same proof as Theorem 1 except that it handles an inequality slightly differently. Specifically, the proof of Theorem 1 involves a transcendental inequality of the form Reg(ˆπ) c1Reg(ˆπ)

α 2(1+α) r + c2 for certain positive terms c1, c2. This inequality is difficult to solve exactly, so we can only get an upper bound on its solution. It turns out that we can get a better upper bound when 2(1 + α)/α is an integer.

The nuisance estimation rate. We now show that Rate N(n, δ) can be effectively controlled for the DM, ISW and DR score functions. In pariticular, Rate N(n, δ) is can be bounded by the estimation errors of the nuisance functions f0 and Σ0. Proposition 2. For any given δ (0, 1), let χn,δ be a positive sequence converging to 0 as n , such that the mean square errors of the nuisance estimates satisfy the following with probability at least 1 δ:

max n EX[ Projspan(Z)( ˆf(X) f0(X)) 2], EX[ ˆΣ (X) Σ 0(X) 2 Fro] o χ2 n,δ,

where Projspan(Z)( ˆf(X) f0(X)) is the projection of ˆf(X) f0(X) onto span(Z). Then,

1. If we take θ = θDM, we have Rate N DM(n, δ) = O(χn,δ);

2. If we take θ = θISW, we have Rate N ISW(n, δ) = O(χn,δ);

3. If we take θ = θDR, we have Rate N DR(n, δ) = O(χ2 n,δ).

Compared to the DM and ISW scores, the impact of the estimation errors of the nuisances in the DR score is of only second order, i.e., O(χ2 n,δ) instead of O(χn,δ). This echos the benefit of DR methods

in causal inference and offline contextual bandit learning [1, 5, 6]. Notably, here we only need to bound the projected error on the nuisance estimator ˆf, which handles the setting when span(Z) does not cover the whole Rd space, as is the case with the shortest path problem in Section 5.

Computing the critical radius. The critical radius, r characterizes the complexity of the function classes G1 and G2 defined in Eq. (8). The next proposition shows that r is of order O(1/ n) if the function classes have finite VC-subgraph dimensions. For simplicity, we focus on G1 only. Proposition 3. Suppose G1 has VC-subgraph dimension η almost surely. Then for any δ (0, 1), there exists a universal positive constant C such that

η log(n+1)+log(8/δ)

satisfies the inequalities Rn(G1, r) r2, 3n r2/64 log log2(1/ r), and 2 exp 3n r2/64 δ/4.

3.2 Byproduct: Fast Rates in the Full Feedback Setting with Misspecification

Although our main theorem is stated for the bandit feedback setting, our regret analysis techniques can be easily adapted to the full feedback setting. The following theorem states a similar regret upper bound. To our best knowledge, this is the first result for CLO that shows a margin-dependent fast rate with potential policy misspecification in the full feedback setting. Theorem 2. Suppose Z (X) defined in Assumption 3 is a singleton almost surely. Define

GF = n (x, y) y (π(x) π (x))ρ

2B : π ΠF, ρ [0, 1] o

and r F be any solution to the inequality Rn(GF, r) r2 satisfying 3n( r F)2/64 log log2(1/ r F) and 2 exp 3n( r F)2/64 δ. If Assumption 3 further holds, then, with probability at least 1 δ,

Reg(ˆπF) B 12 q

C(α, γ) r F 2α+2

α+2 + 24B q

C(α, γ) Reg( π )

B α 2(1+α) r F + ( r F)2 + 2Reg( π ).

The regret bound is similar to that in Theorem 1 for the bandit setting, except that it does not have the nuisance error term 2Rate N(n/2, δ/4). This is because, in the full feedback setting, we observe the entire Y vector, so we do not need to estimate any nuisance functions and can consider the nuisance estimation error term Rate N(n/2, δ/4) as zero. When ΠF is a well-specified VC-subgraph class with dimension η, we have Reg( π ) = 0, and r F = O( p

η/n), so the bound in Theorem 2 reduces to O((η/n)(α+1)/(α+2) + η/n). This bound interpolates between O(n 1/2) and O(n 1) according to the margin parameter α, recovering the fast rate in the full-feedback setting without misspecification as given in [14]. In contrast, our bound in Theorem 2 additionally quantifies the impact of policy misspecification, and its generalization Theorem 1 further incorporates the impact of nuisance estimation errors in the bandit-feedback setting.

4 Computationally Tractable Surrogate Loss

The IERM objective is generally nonconvex in f F, making it computationally intractable to optimize. In the full feedback setting, tractable surrogate losses have been proposed [11, 16]. In this section, we briefly explain the SPO+ loss in [11] and how it can be used in the bandit setting.

The full feedback IERM problem in Eq. (3) can be viewed as minimizing the following loss over F:

n Pn i=1 l IERM(f(Xi), Yi), where l IERM(f(x), y) = y πf(x) minz Z y z.

This IERM loss is equivalent to the smart predict-then-optimize (SPO) loss in Definition 1 of [11]. Letting z (y) arg minz Z y z with the same tie-breaking rule as in πf, [11] propose the SPO+ surrogate loss:

n Pn i=1 l SPO+(f(Xi), Yi), where l SPO+(f(x), y) = maxz Z (y 2f(x)) z (y 2f(x)) z (y).

The SPO+ loss has many desirable properties: given any fixed y, it is an upper bound for the IERM loss, it is convex in f(x), and its subgradient at f(x) has a closed form 2(z (y) z (2f(x) y)).

Figure 1: Stochastic Shortest path problem on a 5 5 grid with uncertain edge cost Yj and decision zj for j = 1, . . . , 40.

Methods Training Data n 400 1000 1600

ETO 3.34% 0.74% 0.35% SPO+ DM 2.30% 0.36% 0.16% SPO+ DR PI 2.47% 0.59% 0.32% SPO+ DR Lambda 2.23% 0.40% 0.18% SPO+ DR Clip 2.29% 0.44% 0.20%

Naive ETO 15.03% 12.12% 3.53% Naive SPO+ DM 15.05% 12.85% 5.08% Naive SPO+ DR 14.99% 13.00% 5.56%

Table 1: Average relative regret ratio of different methods over 50 replications when both the policy-inducing model and the nuisance model are correctly specified. The logging policy is a random policy.

In the bandit setting, although Yi is not observed, the score θ(Xi, Zi, Ci; ˆf, ˆΣ) plays the same role (see Eqs. (3) and (6)). So it is natural to adapt the SPO+ loss to the bandit setting by replacing the unobserved cost vector Yi by the corresponding score θ(Xi, Zi, Ci; ˆf, ˆΣ):

ˆf SPO+ arg minf F 1 n P

i Dj l SPO+ f(Xi), θ(Xi, Zi, Ci; ˆf3 j, ˆΣ3 j) .

Then we use the plug-in policy π ˆ f SPO+ as the decision policy. This is implemented in the experiments in Section 5. We can similarly adapt any surrogate loss for the full-feedback IERM problem to the bandit feedback setting, simply replacing the cost vector Yi s by the corresponding scores.

5 Numerical Experiments

We now test the performance of our proposed methods in a simulated stochastic shortest path problem following [11, 14]. Specifically, we aim to go from the start node s to the end node t on a 5 5 grid consisting of d = 40 edges, where the costs of traveling on the edges are given by the random vector Y R40 (see Fig. 1). We consider covariates X R3 and a function f0(x) = E[Y | X = x] whose components are cubic polynomials. The corresponding shortest path problem can be easily formulated into a CLO problem with the constraint set Z given by standard flow preservation constraints. The resulting optimal solution z belongs to {0, 1}40, indicating whether passing each edge or not. We note that there are m = 70 feasible paths z1, . . . , zm Z from the source node to the target node, and the feasible paths are linearly dependent with a rank of 18.

We consider a bandit feedback setting, observing only the total traveling costs C of the historical decisions Z generated by a certain logging policy but not the edge-wise costs Y . We consider different types of logging policies that generate the decisions in the observed data. In this section, we report results using a random logging policy that picks a path from all feasible ones at random regardless of the covariate value. In Appendix C, we further study the performance of two different covariatedependent logging policies: one that picks paths according to the sign of the first covariate X1, and one that depends on the signs of both X1 and X2. The empirical insights from the covariate-dependent logging policies are qualitatively the same as those for the random logging policy.

We numerically evaluate the performance of the ETO approach and the IERM approach2. For both approaches, we use the same class F to construct the decision policies. We consider three different classes for F: a correctly specified polynomial class, a misspecified class that omits two high-order terms (termed degree-2 misspecification), and a misspecified class that omits four high-order terms (termed degree-4 misspecification). For the IERM approach, we consider DM and DR here but defer the results of ISW to Appendix C.2 Table 3 for its significantly worse performance. In both DM and DR, the nuisance f0 is estimated by a bandit-feedback regression given in Eq. (7) with a ridge penalty, and we test out the three aforementioned function classes for the nuisance class FN as well. In DR, the nuisance Σ0(x) is estimated by the propensity score approach described in Remark 2,

2The code scripts for all experiments can be found at https://github.com/Causal ML/CLOBandit.

with the propensity scores estimated by either sample frequencies (for the random logging policy) or suitable decision tree models (for covariate-dependent logging policies). We further consider three variants when plugging ˆΣ into the doubly robust score: pseudo-inverse (DR PI) as in the original θDR definition; lambda regularization (DR Lambda), where we replace Σ with (Σ + λI) 1 for a positive constant λ; and clipping (DR Clip), where eigenvalues of Σ below a certain threshold are clipped to the threshold before taking the pseudo-inverse. For all IERM variants, we optimize the SPO+ losses as discussed in Section 4. Further details on the experimentation setup and implementation are summarized in Appendix C. Finally, we consider some naive extensions of the offline contextual bandit learning with finite discrete actions (Naive ETO and Naive SPO+), where we view the feasible paths as separate discrete actions, without considering the linear structure of the decision-making problem. See Appendix A for details and Section 1.3 for background on offline bandits.

We test the performance of different methods on an independent testing sample of size 2000, and evaluate the ratio of their regrets relative to the expected cost of the global optimal policy πf0. Table 1 shows the average relative regret ratios of different methods across 50 replications of the experiment for a random logging policy. Due to space limitations, we include results for the training data size n = 400, 1000, 1600, and defer results for other sizes to Appendix C.2. The relative regret of all methods properly decrease with the training data size n. In particular, the SPO+ approximation for the end-to-end IERM approach perform better than the ETO method. Among the SPO+ methods, the DM score achieves the best performance, while the DR score based pesudo-inverse performs the worst. Through a close inspection, we found that the bias adjustment term that involves the pseudo-inverse in the DR score causes a significant variance inflation. In fact, the ISW score also performs badly due to the high variance (see Appendix C.2). The Lambda regularization and Clip techniques can effectively reduce the variance and result in improved decision-making. Moreover, the naive benchmarks that ignore the linear structure of the decision costs have much worse performance. This shows the importance of leveraging the linear problem structure.

Methods Training Data n Training Data n 400 1000 1600 400 1000 1600

ETO F misspecified degree 2 F misspecified degree 4

11.04% 9.14% 8.34% 12.35% 11.42% 10.39%

Well-specified Nuisance Model FN

F misspecified degree 2 F misspecified degree 4

SPO+ DM 2.81% 0.80% 0.54% 4.06% 2.21% 2.06% SPO+ DR PI 3.27% 1.36% 1.05% 4.83% 2.95% 2.71% SPO+ DR Lambda 2.83% 0.97% 0.73% 4.33% 2.45% 2.25% SPO+ DR Clip 3.05% 1.09% 0.84% 4.59% 2.62% 2.38%

Well-specified Policy-inducing Model F

FN misspecified degree 2 FN misspecified degree 4

SPO+ DM 10.01% 8.37% 7.47% 12.51% 11.22% 9.68% SPO+ DR PI 9.11% 7.02% 6.44% 11.69% 10.19% 9.02% SPO+ DR Lambda 9.05% 7.52% 6.68% 12.31% 10.38% 8.96% SPO+ DR Clip 9.02% 7.28% 6.36% 11.87% 10.04% 8.70%

Misspecified

F, FN misspecified degree 2 F, FN misspecified degree 4

SPO+ DM 9.90% 8.34% 7.41% 12.45% 11.16% 9.69% SPO+ DR PI 9.15% 7.23% 6.52% 11.92% 10.46% 9.42% SPO+ DR Lambda 9.03% 7.46% 6.74% 12.01% 10.72% 9.25% SPO+ DR Clip 8.97% 7.22% 6.46% 11.75% 10.31% 8.95% Table 2: Mean relative regret ratio of different methods when the nuisance model FN and the policyinducing model F are misspecified to differrent degrees. The logging policy is a random policy.

In Table 2, we show the performance of different methods when either the policy-inducing model F or the nuisance model FN or both are misspecified. We observe that when the policy-inducing model is misspecified, the end-to-end SPO+ methods perform much better than the ETO method, provided that the nuisance model for SPO+ is correctly specified. This is consistent with the findings in the previous full-feedback CLO literature [e.g., 11, 12, 14], showing the benefit of integrating

estimation and optimization for misspecified policy models. However, the advantage of the endto-end approaches is weakened dramatically once the nuisance model is misspecified. In this case, the evaluation of decision policies is biased, so the end-to-end approaches also target a wrong objective that may not accurately capture the decision quality. Moreover, we observe that when the nuisance model is misspecified, the DR score can somewhat outperform the DM score, because it can debias the misspecified nuisance to some extent. These results demonstrate new challenges with the bandit-feedback CLO: the end-to-end approaches are sensitive to the misspecification of nuisance models, and the DM and DR scores face different bias-and-variance tradeoffs under nuisance model misspecification. Therefore, in practice we may prefer more flexible models for accurate nuisance modeling, while using simpler policy-inducing models for tractable end-to-end optimization.

6 Conclusions

This paper studies the bandit-feedback setting for contextual linear optimization for the first time. We adapt the induced empirical risk minimization approach to this setting, provide a novel theoretical analysis for the regret of the resulting policies, leverage surrogate losses for efficient optimization, and empirically demonstrate the performance of the proposed methods across different model specifications. Our paper has a few limitations that we aim to address in the future. First, we primarily consider parametric induced policy classes, and it would be interesting to accommodate more general nonparametric classes. Second, we focus mainly on the SPO+ surrogate loss, and investigating other surrogate losses in the bandit feedback setting would also be of great interest.

Acknowledgments and Disclosure of Funding

Nathan Kallus acknowledges that this material is based upon work supported by the National Science Foundation under Grant No. 1846210. Xiaojie Mao is supported in part by National Natural Science Foundation of China (grant numbers 72322001, 72201150, and 72293561) and National Key R&D Program of China (grant number 2022ZD0116700).

[1] Susan Athey and Stefan Wager. Policy learning with observational data. Econometrica, 89(1): 133 161, 2021.

[2] Jean-Yves Audibert and Alexandre B Tsybakov. Fast learning rates for plug-in classifiers. The Annals of statistics, 35(2):608 633, 2007.

[3] Stéphane Boucheron, Gábor Lugosi, Pascal Massart, et al. Concentration inequalities using the entropy method. The Annals of Probability, 31(3):1583 1614, 2003.

[4] Olivier Bousquet. Concentration inequalities for sub-additive functions using the entropy method. In Stochastic inequalities and applications, pages 213 247. Springer, 2003.

[5] Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins. Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1):C1 C68, 2018.

[6] Victor Chernozhukov, Mert Demirer, Greg Lewis, and Vasilis Syrgkanis. Semi-parametric efficient policy learning with continuous actions. Advances in Neural Information Processing Systems, 32, 2019.

[7] Miroslav Dudík, John Langford, and Lihong Li. Doubly robust policy evaluation and learning. ar Xiv preprint ar Xiv:1103.4601, 2011.

[8] Miroslav Dudík, Dumitru Erhan, John Langford, and Lihong Li. Doubly robust policy evaluation and optimization. 2014.

[9] Othman El Balghiti, Adam N Elmachtoub, Paul Grigas, and Ambuj Tewari. Generalization bounds in the predict-then-optimize framework. In Neur IPS, pages 14412 14421, 2019.

[10] Charles Elkan. The foundations of cost-sensitive learning. In International joint conference on artificial intelligence, volume 17, pages 973 978. Lawrence Erlbaum Associates Ltd, 2001.

[11] Adam N Elmachtoub and Paul Grigas. Smart predict, then optimize . Management Science, 68(1):9 26, 2022.

[12] Adam N Elmachtoub, Henry Lam, Haofeng Zhang, and Yunfan Zhao. Estimate-then-optimize versus integrated-estimation-optimization: A stochastic dominance perspective. ar Xiv preprint ar Xiv:2304.06833, 2023.

[13] Dylan J Foster and Vasilis Syrgkanis. Orthogonal statistical learning. The Annals of Statistics, 51(3):879 908, 2023.

[14] Yichun Hu, Nathan Kallus, and Xiaojie Mao. Fast rates for contextual linear optimization. Management Science, 68(6):4236 4245, 2022.

[15] Yichun Hu, Nathan Kallus, and Masatoshi Uehara. Fast rates for the regret of offline reinforcement learning. Mathematics of Operations Research, 2024.

[16] Michael Huang and Vishal Gupta. Learning best-in-class policies for the predict-then-optimize framework. ar Xiv preprint ar Xiv:2402.03256, 2024.

[17] Guido W Imbens and Donald B Rubin. Causal inference in statistics, social, and biomedical sciences. Cambridge university press, 2015.

[18] Nathan Kallus. Recursive partitioning for personalization using observational data. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1789 1798. JMLR. org, 2017.

[19] Nathan Kallus. Balanced policy evaluation and learning. In Advances in Neural Information Processing Systems, pages 8895 8906, 2018.

[20] Nathan Kallus. More efficient policy learning via optimal retargeting. Journal of the American Statistical Association, 116(534):646 658, 2021.

[21] Nathan Kallus and Masatoshi Uehara. Doubly robust off-policy value and gradient estimation for deterministic policies. Advances in Neural Information Processing Systems, 33:10420 10430, 2020.

[22] Nathan Kallus and Angela Zhou. Policy evaluation and optimization with continuous treatments. In International conference on artificial intelligence and statistics, pages 1243 1251. PMLR, 2018.

[23] Nathan Kallus, Xiaojie Mao, Kaiwen Wang, and Zhengyuan Zhou. Doubly robust distributionally robust off-policy evaluation and learning. In International Conference on Machine Learning, pages 10598 10632. PMLR, 2022.

[24] Toru Kitagawa and Aleksey Tetenov. Who should be treated? empirical welfare maximization methods for treatment choice. Econometrica, 86(2):591 616, 2018.

[25] Tor Lattimore and Csaba Szepesvári. Bandit algorithms. Cambridge University Press, 2020.

[26] Kelsey Maass, Arun V Sathanur, Arif Khan, and Robert Rallo. Street-level travel-time estimation via aggregated uber data. In 2020 Proceedings of the SIAM Workshop on Combinatorial Scientific Computing, pages 76 84. SIAM, 2020.

[27] Jayanta Mandi, James Kotary, Senne Berden, Maxime Mulamba, Victor Bucarey, Tias Guns, and Ferdinando Fioretto. Decision-focused learning: Foundations, state of the art, benchmark and future opportunities. ar Xiv preprint ar Xiv:2307.13565, 2023.

[28] David Pollard. Empirical processes: theory and applications. In NSF-CBMS regional conference series in probability and statistics, pages i 86. JSTOR, 1990.

[29] Noveen Sachdeva, Lequn Wang, Dawen Liang, Nathan Kallus, and Julian Mc Auley. Off-policy evaluation for large action spaces via policy convolution. In Proceedings of the ACM on Web Conference 2024, page 3576 3585, 2024.

[30] Utsav Sadana, Abhilash Chenreddy, Erick Delage, Alexandre Forel, Emma Frejinger, and Thibaut Vidal. A survey of contextual optimization methods for decision-making under uncertainty. European Journal of Operational Research, 2024.

[31] Bodhisattva Sen. A gentle introduction to empirical process theory and applications. Lecture Notes, Columbia University, 11:28 29, 2018.

[32] Nian Si, Fan Zhang, Zhengyuan Zhou, and Jose Blanchet. Distributionally robust batch contextual bandits. Management Science, 69(10):5772 5793, 2023.

[33] Yeran Sun, Yinming Ren, and Xuan Sun. Uber movement data: A proxy for average one-way commuting times by car. ISPRS International Journal of Geo-Information, 9(3):184, 2020.

[34] Adith Swaminathan and Thorsten Joachims. Batch learning from logged bandit feedback through counterfactual risk minimization. The Journal of Machine Learning Research, 16(1): 1731 1755, 2015.

[35] Adith Swaminathan and Thorsten Joachims. Counterfactual risk minimization: Learning from logged bandit feedback. In International Conference on Machine Learning, pages 814 823. PMLR, 2015.

[36] Adith Swaminathan and Thorsten Joachims. The self-normalized estimator for counterfactual learning. advances in neural information processing systems, 28, 2015.

[37] Michel Talagrand. New concentration inequalities in product spaces. Inventiones mathematicae, 126(3):505 563, 1996.

[38] Alexander B Tsybakov. Optimal aggregation of classifiers in statistical learning. The Annals of Statistics, 32(1):135 166, 2004.

[39] Masatoshi Uehara, Chengchun Shi, and Nathan Kallus. A review of off-policy evaluation in reinforcement learning. ar Xiv preprint ar Xiv:2212.06355, 2022.

[40] Aad W Van Der Vaart and Jon A Wellner. Weak convergence and empirical processes. Springer, 1996.

[41] Martin J Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge University Press, 2019.

[42] Baqun Zhang, Anastasios A Tsiatis, Marie Davidian, Min Zhang, and Eric Laber. Estimating optimal treatment regimes from a classification perspective. Stat, 1(1):103 114, 2012.

[43] Yingqi Zhao, Donglin Zeng, A John Rush, and Michael R Kosorok. Estimating individualized treatment rules using outcome weighted learning. Journal of the American Statistical Association, 107(499):1106 1118, 2012.

[44] Xin Zhou, Nicole Mayer-Hamblett, Umer Khan, and Michael R Kosorok. Residual weighted learning for estimating individualized treatment rules. Journal of the American Statistical Association, 112(517):169 187, 2017.

[45] Zhengyuan Zhou, Susan Athey, and Stefan Wager. Offline multi-action policy learning: Generalization and optimization. Operations Research, 71(1):148 183, 2023.

A Naive Extensions of Offline Contextual Bandit Learning

In this section, we show some alternative approaches to solve the stochastic shortest path problem in Section 5. These approaches can be viewed as naive extensions of offline contextual bandit learning with discrete actions.

Specifically, consider the feasible paths z1, . . . , zm for m = 70. We can view them as separate discrete actions, and a feasible decision policy π is a mapping from the covariates X to one of the m = 70 feasible paths. We now adopt one-hot-encoding for the decisions. Consider a simplex Zsimplex in Rm and zero-one vector z {0, 1}m that takes the value 1 on one and only one of its coordinate. Each feasible decision in z1, . . . , zm corresponds to one vector z, e.g., the decision zj corresponds to the z vector whose j-th entry is 1 and other entries are all 0.

Our decision-making problem restricted to the feasible decisions z1, . . . , zm can be written as follows:

min z Zsimplex

j=1 zj E[C | Z = zj, X = x] min z Zsimplex z f0(x),

where f0(x) = (E[C | Z = z1, X = x], . . . , E[C | Z = zm, X = x]) . For any given x, the resulting decision will be an one-hot vector that corresponds to an optimal decision at x (which can be equivalently given by πf0(x) for an plug-in policy in Eq. (2) at the true f0(x) = E[Y | X = x]). In this formulation, we view the decisions z1, . . . , zm as separate discrete actions, and we do not necessarily take into account the linear structure of the decision cost.

We can easily adapt the bandit-feedback ETO to this new formulation. Specifically, we can first construct an estimator ˆ f for the f0 function. For offline bandit learning with discrte actions, this would usually be implemented by regressing the observed total cost C with respect to the covariates

X, within each subsample for each of the feasible decisions respectively. Given the estimator ˆ f, we

can then solve the optimization problem min z Zsimplex z ˆ f0(x). We finally inspect which coordinate of the resulting solution is equal to 1 and choose it as the decision.

To adapt the IERM approach, we similarly define the plug-in policy for any given hypothesis f(x) : Rp Rm for the function f0(x) = (E[C | Z = z1, X = x], . . . , E[C | Z = zm, X = x]) :

π f(x) arg min z Zsimplex f(x) z,

where ties are again broken by some fixed rules. Given a function class F, we can then consider the induced policy class Π F = { π f : f F}. For any policy π Π F, its output is an one-hot vector, whose entry with value 1 corresponds to the chosen decision among z1, . . . , zm. For any observed decision Z {z1, . . . , zm}, we denote its one-hot transformation Z as the zero-one vector whose value-one entry corresponds to the value of Z. For a given observed total cost C, we denote C as the vector all of whose entries are equal to C. In the lemma below, we show that the value of each policy π can be also identified by some score funtions. Lemma 1. For any given policy π that maps any covariate value x to an m-dimensional one-hot vector, its policy value can be written as follows:

V ( π) = E h θ(X, Z, C; f0, e0) π(X) i ,

where f0(x) = (E[C | Z = z1, X = x], . . . , E[C | Z = zm, X = x])

e0(x) = (e0(z1 | x, . . . , e0(zm | x))) , e0(z | x) = P(Z = zj | X = x),

and the score function θ can take three different forms:

1. (Direct Method) θDM(x, z, c; f, e) = f(x);

2. (Inverse Propensity Weighting) θIPW(x, z, c; f, e) = z e(x) c;

3. (Doubly Robust) θDR(x, z, c; f, e) = f(x) + z e(x)( c f(x)).

In the three score functions above, all vector operations are entry-wise operations.

From Lemma 1, the new formulations above have very similar structure as our previous formulation in Section 2 Proposition 1. The major differences are that we restrict to the discrete actions z1, . . . , zm, redefine certain variables accordingly, and consider a simplex set as the constraint. We note that the identification in Lemma 1 mimics the DM, IPW and DR identification in offline contextual bandit learning [e.g., 1, 7, 43]. Since the identification formulae are analogous to those in Section 2, we can easily apply the same policy learning methods in Section 2 and the SPO+ relaxation in Section 4.

B Omitted Proofs

B.1 Supporting Lemmas

Lemma 2 (Talagrand s inequality, [4, 31, 37]). Let Ui, i = 1, . . . , n be independent U-valued random variables. Let H be a countable family of measurable real-valued functions on U such that h v and E[h(U1)] = = E[h(Un)] = 0, for all h H. Define

i=1 sup h H E h2(Ui) .

Then, for all t 0,

2vn + 2tv/3

Lemma 3. Fix functions f, Σ independent of {(Xi, Zi, Ci)}n i=1 such that θ(x, z, c; f, Σ) Θ for all x, z, c. Define the function class

G = (x, z, c) θ(x, z, c; f, Σ) (π(x) π (x))ρ

2BΘ : π ΠF, ρ [0, 1] .

Let r be any solution to the inequality Rn(G, r) r2 satisfying 3n r2/64 log log2(1/ r). Then we have

|(En EP )g|

g 2 + r 6 r 2 exp 3

i=1 g(Xi, Zi, Ci).

Proof of Lemma 3. When supg G|(En EP )g|/( g 2 + r) > 6 r, one of the following two events must hold true:

E1 = |(En EP )g| 6 r2 for some g G such that g 2 r ,

E2 = {|(En EP )g| 6 g 2 r for some g G such that g 2 r}.

Zn(r) = sup g G, g 2 r |(En EP )g|.

Note that g 2 r implies E[(g EP (g))2] r2, and we also have g EP (g) 2. By Talagrand s inequality (Lemma 2) over the function class {g EP (g) : g G},

P(Zn(r) E[Zn(r)] + t) exp nt2

8E[Zn(r)] + 2r2 + 4t/3

We now bound the expectation E[Zn(r)]. Since G is star-shaped3, by [41, Lemma 13.6], r Rn(G, r)/r is non-increasing. Thus, for any r r,

E[Zn(r)] 2Rn(G, r) 2r

r Rn(G, r) 2r r,

3A function class G is star-shaped if for any g G and ρ [0, 1], we have ρg G.

where the first inequality comes from a symmetrization argument, the second inequality uses the fact that r Rn(G, r)/r is non-increasing, and the third inequality is by definition of r.

The Talagrand s then implies

P(Zn(r) 2r r + t) exp nt2

16r r + 2r2 + 4t/3

We first bound P(E1). Taking r = r and t = 4 r2 in Eq. (10), we get

P(E1) P Zn( r) 6 r2 exp 24

We now bound P(E2). Note that

P(E2) P(Zn( g 2) 6 g 2 r for some g G, g 2 r).

Gm = g G : 2m 1 r g 2 2m r .

Since g 2 1, there exists M log2(1/ r) such that

G {g : g 2 r} M m=1Gm.

m=1 P(Zn( g 2) 6 g 2 r for some g Gm).

We now bound each term in the summation above. If there exists g Gm such that Zn( g 2) 6 g 2 r, then we have

Zn(2m r) Zn( g 2) 6 g 2 r 3 2m r2.

P(Zn( g 2) 6 g 2 r for some g Gm) P Zn(2m r) 3 2m r2 .

Now, taking r = 2m r and t = 2m r2 in Eq. (10), we get

P Zn(2m r) 3 2m r2 exp 3n r2

13 22 m + 6

Therefore, if 3 64n r2 log log2(1/ r), then

P(E2) M exp 3

32n r2 exp 3

Combining the bounds on P(E1) and P(E2) leads to the final conclusion.

Lemma 4. Suppose Assumption 3 holds and P(|Z (X)| > 1) = 0. Then there exists a constant C(α, γ) such that for any π ΠF,

P(π(X) = πf0(X)) C(α, γ) Reg(π)

Proof of Lemma 4. This follows directly from [14, Lemma 1].

Lemma 5. Let c1, c2, r be positive constants. For any α > 0, if a positive number x satisfies

α 2(1+α) r + c2,

x (c1r) 2α+2

Proof of Lemma 5. First, note that

α 2(1+α) r c2 = 1 c1rα 2(1 + α)y 2+α

The derivative is strictly increasing in y and is eventually positive. Note the function y c1yα/(2+2α)r c2 takes a negative value at y = 0. Then as y increases, the value of y c1yα/(2+2α)r c2 first decreases and then increases. Therefore, if y > 0 satisfies the inequality y c1yα/(2+2α)r c2 0, then such y also provides an upper bound on x. Hence it is sufficient to show that y = (c1r) 2α+2

α+2 + 2c2 satisfies the inequality, or equivalently,

α+2 + c2 c1 (c1r) 2α+2

α+2 + 2c2 α 2(1+α) r. (11)

Suppose α/(2 + 2α) is a rational number. In this case, we can write α/(2 + 2α) = m1/m2, where m1 and m2 are positive integers such that m2 2m1 + 1. Eq. (11) is then equivalent to (c1r)

m2 m2 m1 + c2 m2 cm2 1 (c1r)

m2 m2 m1 + 2c2 m1 rm2. (12)

Using the multinomial theorem, we have (c1r)

m2 m2 m1 + c2 m2 cm2 1 (c1r)

m2 m2 m1 + 2c2 m1 rm2

m2! (i + m2 m1)!(m1 i)! m1!2m1 i

im2 m2 m1 +m2cm1 i 2 .

Since m2 2m1 + 1, we have for any i = 0, . . . , m1,

m2! (i + m2 m1)!(m1 i)! m1!2m1 i

This can be proved by showing the ratio of LHS over RHS is larger than 1. Hence, Eq. (11) holds true when α/(2 + 2α) is a rational number.

Finally, note that

α+2 + c2 c1 (c1r) 2α+2

α+2 + 2c2 α 2(1+α) r

is continuous in α. Since any real number α can be viewed as the limit of a sequence of rational numbers and Eq. (11) holds for all rational numbers, it also holds true for all α > 0.

Lemma 6. Let c1, c2, r, y, z be positive constants, and let α be a positive constant such that 2(1 + α)/α is an integer. If a positive number x satisfies

α 2(1+α) r + c1y

α 2(1+α) r + c2r2 + z + y,

α+2 1 r 2α+2

α+2 + 2α + 2

α + 2 c1y α 2α+2 r + 2α + 2

α + 2 c2r2 + 2α + 2

α + 2 z + y.

Proof of Lemma 6. Let 2(α + 1)/α = m, where m is an integer by assumption. Using similar argu-

ments as in the proof of Lemma 5, it is sufficient to show that for w = c

α+2 1 r 2α+2

α+2 c1y α 2α+2 r+

α+2 c2r2 + 2α+2

α+2 z + y and c 2 = c1y

α 2(1+α) r + c2r2 + z + y, we have

α 2(1+α) r c 2 0.

This is equivalent to showing

α+2 1 r 2α+2

α+2 + 2α + 2

α + 2 c1y α 2α+2 r + 2α + 2

α + 2 c2r2 + 2α + 2

α + 2 z + y

α+2 1 r 2α+2

α+2 + 2α + 2

α + 2 c1y α 2α+2 r + 2α + 2

α + 2 c2r2 + 2α + 2

α + 2 z + y α 2(1+α) r + c1y

α 2(1+α) r + c2r2 + z + y.

Since α = 2/(m 2), the above inequality is equivalent to c

m m 1 1 r m m 1 + 1 m 1c1y 1 m r + 1 m 1c2r2 + 1 m 1z m

m m 1 1 r m m 1 + m m 1c1y 1 m r + m m 1c2r2 + m m 1z + y .

Using the multinomial theorem, it is easy to see that the expansion of LHS contains all terms on the RHS (plus additional positive terms). This finishes proving our conclusion.

B.2 Proof of Main Theorem

Proof of Theorem 1. To simplify notation, we define

Enj h θ(X, Z, C; ˆf3 j, ˆΣ3 j) (ˆπ(X) π (X)) i

i Dj θ(Xi, Zi, Ci; ˆf3 j, ˆΣ3 j) (ˆπ(Xi) π (Xi)).

We can decompose the regret as

Reg(ˆπ) =EP f0(X) (ˆπ(X) πf0(X))

=EP θ(X, Z, C; f0, Σ0) (ˆπ(X) π (X)) + E f0(X) ( π (X) πf0(X))

θ(X, Z, C; f0, Σ0) θ(X, Z, C; ˆfj, ˆΣj) (ˆπ(X) π (X))

j=1 (EP Enj) h θ(X, Z, C; ˆf3 j, ˆΣ3 j) (ˆπ(X) π (X)) i

+ Reg( π ),

where the inequality follows from the definition of ˆπ.

By Assumption 2, with probability at least 1 δ/2,

θ(X, Z, C; f0, Σ0) θ(X, Z, C; ˆfj, ˆΣj) (ˆπ(X) π (X)) Rate N(n/2, δ/4).

We now bound (EP En1) h θ(X, Z, C; ˆf2, ˆΣ2) (ˆπ(X) π (X)) i . By Lemma 3 and the assump-

tion that 2 exp 3n r2/128 δ/4, we have that with probability at least 1 δ/4,

|(En1 EP )g|

g 2 + r 6 r. (13)

Assuming Eq. (13) holds,

(EP En1) h θ(X, Z, C; ˆf2, ˆΣ2) (ˆπ(X) π (X)) i

θ(X, Z, C; ˆf2, ˆΣ2) (ˆπ(X) π (X))

θ(X, Z, C; ˆf2, ˆΣ2) (ˆπ(X) πf0(X))

θ(X, Z, C; ˆf2, ˆΣ2) (πf0(X) π (X))

For any π ΠF, θ(X, Z, C; ˆf2, ˆΣ2) (π(X) πf0(X))

θ(X, Z, C; ˆf2, ˆΣ2) (π(X) πf0(X))

I{π(X) = πf0(X)}

P(π(X) = πf0(X))

C(α, γ) Reg(π)

where the last inequality follows from Lemma 4.

Applying the inequality above for both ˆπ and πf0 and plug the bounds into Eq. (14), we get

(EP En1) h θ(X, Z, C; ˆf2, ˆΣ2) (ˆπ(X) π (X)) i

C(α, γ) Reg(ˆπ)

α 2(1+α) r + q

C(α, γ) Reg( π )

α 2(1+α) r + r2 !

We can similarly bound (EP En2) h θ(X, Z, C; ˆf1, ˆΣ1) (ˆπ(X) π (X)) i .

Combining all pieces together, we get that with probability at least 1 δ,

C(α, γ) Reg(ˆπ)

α 2(1+α) r + q

C(α, γ) Reg( π )

α 2(1+α) r + r2 !

+ Rate N(n/2, δ/4)

B + Reg( π )

Solving the above inequality with respect to Reg(ˆπ)/B using Lemma 5, we have

Reg(ˆπ) B 12Θ q

C(α, γ) r 2α+2

C(α, γ) Reg( π )

α 2(1+α) r + r2 !

+ 2Rate N(n/2, δ/4) + 2Reg( π ).

B.3 Proofs of Propositions

Proof of Proposition 1. For direct method, the conclusion is obvious.

For ISW, we have for any Σ,

E h Σ+(X)ZC π(X) i E f0(X) π(X) = E f0(X) (I Σ+(X)Σ0(X)) π(X) .

Let M(x) be a matrix whose columns include all basis vectors of the span of Z. Then π(X) Range(M(x)). According to the coverage assumption, the column space of Σ0(x) is identical to the column space of M(x). By the property of pseudo-inverse, the column space of (I Σ 0Σ0) is orthogonal to the column space of M. Therefore, (I Σ+(X)Σ0(X)) π(X) = 0 for any π Z.

For doubly robust score, we have that for any function f, Σ,

E π(X) f(X) + Σ(X)+Z(C Z f(X)) E[π(X) f0(X)]

=E π(X) (I Σ+(X)Σ0(X))(f(X) f0(X)) .

Taking either f = f0 or Σ = Σ0 gives 0.

Proof of Proposition 2. Because π(x) π span(Z), for the DM score we have

θDM(X, Z, C; f0, Σ0) θDM(X, Z, C; ˆf, ˆΣ) (π(X) π (X))

Projspan(Z) θDM(X, Z, C; f0, Σ0) θDM(X, Z, C; ˆf, ˆΣ) (π(X) π (X))

2B n EX[ Projspan(Z)( ˆf(X) f0(X)) 2] o1/2 = O(χn,δ).

For the ISW score, we have

θISW (X, Z, C; f0, Σ0) θISW (X, Z, C; ˆf, ˆΣ) (π(X) π (X))

2B n EX[ (ˆΣ (X) Σ 0(X))Σ0(X) 2 Fro] o1/2 = O(χn,δ).

For the doubly robust score, we can easily get

θDR(X, Z, C; f0, Σ0) θDR(X, Z, C; ˆf, ˆΣ) (π(X) π (X))

=EP h (π(X) π (X)) I ˆΣ+(X)Σ0(X) ( ˆf(X) f0(X)) i

=EP h (π(X) π (X)) Σ+ 0 (X) ˆΣ+(X) Σ0(X)( ˆf(X) f0(X)) i

=EP h (π(X) π (X)) Σ+ 0 (X) ˆΣ+(X) Σ0(X) Proj Span(Z)( ˆf(X) f0(X)) i

n EX[ (ˆΣ (X) Σ 0(X))Σ0(X) 2 Fro] o1/2n EX[ Projspan(Z)( ˆf(X) f0(X)) 2] o1/2 = O(χ2 n,δ).

Here the second equation holds because π(x) π (x) belongs to the linear span of Z, but I Σ+ 0 (x)Σ0 is orthogonal to the linear span of Z, as we already argued in the proof of Proposition 1. The third equation holds because the column space of Σ0(X) is identical to span(Z) according to the coverage assumption.

Proof of Proposition 3. Define

Note that whenever EΨ(|W|/w) 1 for some random variable W, we have by Markov s inequality that

P(|W| > t) 5 exp( t2/w2),

0 P(|W| > t)dt 5w. (15)

Throughout the proof, we condition on the event that G1 has VC-subgraph dimension η. We finish the proof in three steps.

Step I: Critical radius for empirical Rademacher complexity. Define the localized empirical Rademacher complexity

ˆRn(G1, r) = Eϵ

sup g G, g n r

i=1 ϵig(Xi, Zi, Ci)

where ϵ1, . . . , ϵn are i.i.d. Rademacher random variables, and g n = p Pn i=1 g2(Xi, Zi, Ci)/n. Let ˆr n be the smallest positive solution to ˆRn(G1, r) r2/32. In what follows, we show that there exists a universal constant C such that

η log(n + 1)

For any g G1, define set

G = {(g(X1, Z1, C1), . . . , g(Xn, Zn, Cn)) : g G1, g n r}.

Let D(t, G) be the t-packing number of G and N(t, G) be the t-covering number. Note that g nr for all g G. By [28, Theorem 3.5],

1 J sup g G1, g n r

i=1 ϵig(Xi, Zi, Ci)

log D(t, G)dt.

So by Eq. (15),

ˆRn(G1, r) 5

Consider the function class

G 1 = {g : g G1, g n r}.

Note that nr is the envelope of G 1 on (X1, Z1, C1), . . . , (Xn, Zn, Cn). Applying [40, Theorem 2.6.7] gives

D( nrt, G) N( nrt/2, G)

C(η + 1)(16e)η+1 4n

for a universal constant C. Thus,

J =9 nr Z 1

log D( nrt, G)dt

log C + log(η + 1) + (η + 1) log(16e) + η log n + η log 4 2η log tdt

2 log C + 15 3 log tdt p

η log(n + 1)nr,

where R 1 0 2 log C + 15 3 log tdt < . We then obtain that for a (different) universal constant C,

ˆRn(G1, r) C 32

η log(n + 1)

Therefore, for any samples (Xi, Zi, Ci)n i=1, any ˆrn C p

η log(n + 1)/n is a valid solution to ˆRn(G1, r) r2/32, which implies Eq. (16).

Step II: Critical radius for Rademacher complexity. Let r n be the smallest positive solution to the inequality Rn(G1, r) r2/32. We now bound r n.

For any t > 0, define the random variable

sup g G1, g 2 t

i=1 ϵig(Xi, Zi, Ci)

so that Rn(G1, r) = EP [Wn(r)] by construction. Define the events

E3(t) = |Wn(t) Rn(G1, t)| r nt 112

g 2 n g 2 2

g 2 2 + (r n)2 1

Following the proof of [14, Lemma EC.12],

P r n 5 ˆr n 3r n

P(2E3(r n) E3(7r n) E4).

[14, Lemma EC.10] implies that

P(Ec 4) 2e c1n(r n)2

for some universal constant c1 > 0. Moreover, for any ζ 1, we have Rn(G1, ζr n) Rn(r n) (r n)2/32. By [3, Theorem 16],

P(Ec 3(ζr n)) 2e c2n(r n)2

for some universal constant c2 > 0. Combining all pieces we have

P r n 5 ˆr n 3r n

1 6e ( c1 c2)n(r n)2. (17)

By step I in the proof, P ˆr n C0 p

η log(n + 1)/n = 1 for some constant C0. Let C > 5 C0

be a constant such that 2 C( c1 c2) < 1/6. If r n > C p

η log(n + 1)/n, by Eq. (17) we have

P ˆr n > C0 p

η log(n + 1)/n > 0, which leads to contradiction. Thus,

η log(n + 1)/n.

Finally, any r C p

η log(n + 1)/n solves the inequality Rn(G1, r) r2.

Step III: Checking other conditions. The other two inequalities, 3n r2/64 log log2(1/ r) and 2 exp 3n r2/64 δ/2, are easily satisfied as long as we take C big enough.

B.4 Proof for Full Feedback Setting

Proof of Theorem 1. To simplify notation, we write

En Y ˆπF(X) π (X) = 1

i=1 Y i ˆπF(Xi) π (Xi) .

We can decompose the regret as

Reg ˆπF =EP Y ˆπF(X) πf0(X)

=EP Y ˆπF(X) π (X) + E Y ( π (X) πf0(X))

(EP En) Y ˆπF(X) π (X) + Reg( π ),

where the inequality follows from the definition of ˆπF.

We now bound (EP En) Y ˆπF(X) π (X) . Following a similar proof as in the proof of Lemma 3, we have that with probability at least 1 δ,

|(En EP )g|

g 2 + r F 6 r F. (18)

Assuming Eq. (18) holds,

(EP En) Y ˆπF(X) π (X)

Y ˆπF(X) π (X)

2 r F + ( r F)2 !

Y ˆπF(X) πf0(X)

2 r F + Y (πf0(X) π (X))

2 r F + ( r F)2 !

For any π ΠF, Y (π(X) πf0(X))

" Y (π(X) πf0(X))

2 I{π(X) = πf0(X)}

P(π(X) = πf0(X))

C(α, γ) Reg(π)

where the last inequality follows from Lemma 4. Thus,

(EP En) Y ˆπF(X) π (X)

C(α, γ) Reg(ˆπF)

α 2(1+α) r F + q

C(α, γ) Reg( π )

α 2(1+α) r F + ( r F)2 !

Combining all pieces together, we get that with probability at least 1 δ,

C(α, γ) Reg(ˆπF)

α 2(1+α) r F + q

C(α, γ) Reg( π )

α 2(1+α) r F + ( r F)2 !

Solving the above inequality with respect to Reg(ˆπF)/B using Lemma 5, we have

Reg(ˆπF) B 12 q

C(α, γ) r F 2α+2

C(α, γ) Reg( π )

α 2(1+α) r F + ( r F)2 !

+ 2Reg( π ).

C Additional Experimental Details

In Section 5, we provide experimental results for different methods under various model specifications and different logging policies. In this section, we further explain the details of experiment setup, implementation, and provide additional experimental results. All experiments in the paper are implemented on a cloud computing platform with 128 CPUs of model Intel(R) Xeon(R) Platinum 8369B CPU @ 2.70GHz, 250GB RAM and 500GB storage. The experiment for each specification of F and FN and each logging policy takes around 2 days of running time based on parallel computing on 100 CPUs.

C.1 Experimental Setup and Implementation Details

Data generating process. We first generate i.i.d draws of the covariates X = (X1, X2, X3) R3 from independent standard normal distribution. Then we simulate the full feedback Y according to the equation Y = f0(X) + ϵ, where f0(X) = 3 + W 1 X1 + W 2 X2 + W 3 X3 + W 4 X1X2 + W 5 X2X3+W 6 X1X3+W 7 X1X2X3 for coefficient vectors W 1 , . . . , W 7 Rd and a random noise ϵ drawn from the Unif[ 0.5, 0.5]. To fix the coefficient vectors W 1 , . . . , W 7 , we draw their entries independently the from the Unif[0, 1] distribution once, and then use the resulting fixed coefficient vectors throughout the experiment. We sample observed decisions Z from the set of all feasible path decisions {z1, . . . , zm} for m = 70, according to different logging policies that will be described shortly. Then the total cost C = Y Z is recorded in the observed data.

We consider three different logging policies. One is a random logging policy that uniformly samples each decision from the feasible decisions, regardless of the covariate value. The other two are covariate-dependent logging policies. For these two policies, we first remove 20 feasible decisions that correspond to the optimal decisions for some covariate observations in the testing data, and then randomly sample the observed decisions from the rest of 50 feasible decisions according to different rules. This means that many promising decisions are not explored by the logging policies at all. We hope to use these logging policies to demonstrate the value of leveraging the linear structure of the

decision-making problem, since exploiting the linear structure allows to extrapolate the feedback from the logged decisions to decisions never explored by the policies. Specifically, we further divide the remaining 50 feasible decisions into two even groups. Then the two different logging policies sample decisions from the two groups according to different rules depending on the covariate value.

One covariate-dependent logging policy samples the decisions according to the sign of the first covariate X1. When X1 > 0, the logging policy chooses the first group with probability 2/3 and the second group with probability 1/3. When X1 0, the logging policy chooses the first group with probability 1/3 and the second group with probability 2/3. Once deciding the group, the policy then uniformly samples one decision from the chosen group. For concreteness, we will refer to this policy as the X1-policy.

The other covariate-dependent policy samples the decisions according to the signs of both X1 and X2. When X1 > 0 and X2 > 0, the logging policy chooses the two groups with probabilities 2/3 and 1/3 respectively. When X1 > 0 and X2 0, the logging policy chooses the two groups with probabilities 1/3 and 2/3 respectively. When X1 0 and X2 > 0, the policy chooses the two groups with probabilities 3/4 and 1/4 respectively. When X1 0 and X2 0, the policy chooses the two groups with probabilities 1/4 and 3/4 respectively. Once deciding the group, the policy again uniformly samples one decision from the chosen group. We will refer to this policy as the X1X2-policy.

Specification of the policy-inducing model and nuisance model. For the policy-inducing model and nuisance model, we consider three different classes. One is the correctly specified class {(x1, x3, x3) 7 W0 + W1x1 + W2x2 + W3x3 + W4x1x2 + W5x2x3 + W6x1x3 + W7x1x2x3 : W0, . . . , W7 R}. The second model class {(x1, x3, x3) 7 W0 + W1x1 + W2x2 + W3x3 + W4x1x2 + W5x2x3 : W0, . . . , W5 R} omits two interaction terms and is thus misspecified (which we refer to as degree-2 misspecification). The third model class {(x1, x3, x3) 7 W0 + W1x1 + W2x2 + W3x3 : W1, W2, W3 R} omits all four interaction terms (which we refer to as degree-4 misspecification).

Nuisance estimation for the SPO+ methods. The SPO+ methods involve two different nuisances. One is the function f0(x) = E[Y | X = x]. We estimate this by the least squares regression in Eq. (7), with the FN class being one of the three classes described above (i.e., correct specification, degree-2 misspecification, degree-4 misspecification). We incorporate an additional ridge penalty with a coefficient 1. We estimate the the nuisance Σ0(x) using the propensity score approach described in Remark 2, so we need to estimate the propensity scores P(Z = zj | X = x) for j = 1, . . . , m. For the random logging policy, we simply estimate P(Z = zj | X = x) for any x by the empirical frequency of the decision zj in the observed data. For the X1-policy and X1X2-policy, we estimate the propensity scores by classification decision trees of depth 3 trained to classify each instance to one of the observed classes among z1, . . . , zm. These nuisances are all estimated using the two-fold cross-fitting described in Section 1.2.

Since the Σ0 matrix is rank-deficient, we cannot directly invert it. Besides taking the pseudo-inverse, we also implement the Lambda regularization and clipping techniques described in Section 5. For Lambda regularization, we set the regularization parameter λ to 1. For the clipping technique, we consider the eigen-decomposition of Σ0, set all eigenvalues below 1 to 1, and then take a pseudoinverse of the transformed matrix. These two lead to the SPO+ DR Lambda and SPO+ DR Clip methods in Section 5 and this section.

Naive Benchmarks. We also implemented the benchmarks in Appendix A. For ETO, SPO+ DM, and SPO+ DR, we estimate f0(x) by regressing C with respect to X using data for each observed decision respectively. The regression function class uses similar correctly specified class, degree-2 misspecified class, degree-4 misspecified class mentioned above, with only slight difference in the dimension since the output of f0 is m-dimensional while the output of f0 is d-dimensional. For SPO+ DR and SPO+ IPW, we need to estimate the propensity scores. These are again estimated by either sample frequency or decision trees. Note that some feasible decisions are never observed in the training data. For these decisions, the corresponding component of f0 is heuristically imputed by a pooled regression of C against X using all observed data. For SPO+ IPW and SPO+ DR, although the propensity scores for the unseen decisions are zero, they do not impact the policy evaluation since they only need the propensity score for decisions observed in the data.

SPO+ Tuning and Optimization. We need to solve the SPO+ optimization problem in Section 4 for the DM and DR scores. Following [11], we incorporate an additional ridge penalty on the coefficients of the hypotheses. We select the penalty coefficient from a grid of 0, 0.001, 0.01 and 10 points distributed uniformly on the logarithmic scale over 0.1 to 100. This is done by minimizing the out-of-sample error on an independent validation dataset with size equal to the corresponding training data. The penalty coefficient is finally set as the half of the value chosen by this validation procedure.

To optimize the SPO+ loss, [11] recommend applying standard convex optimization solvers to a dual reformulation, or by running stochastic subgradient descents. We tried both approaches and found that the stochastic subgradient descent method with the default hyperparameters in [11, 14] (number of iterations, step size, batch size) runs much faster, while performing similarly to the dual reformulation method.

C.2 Additional Experimental Results

In this section, we provide some additional experimental results. Specifically, Tables 3 and 4 are extensions of Tables 1 and 2 for the random logging policy. In particular, Table 3 shows additional results on ISW and IPW methods. They are generally worse than other alternative methods. Table 4 additionally show results for the naive benchmarks under model misspecification, and further confirm their worse performance. Tables 5 to 8 provide results for two covariate-dependent logging policies, the X1-policy and X1X2-policy respectively. The performance of different methods tend to remain similar or slightly degrade under these more complicated logging policies. However, the comparisons of different methods remain qualitatively the same.

Methods Training Data n 400 600 800 1000 1200 1400 1600

ETO 3.34% 1.86% 1.04% 0.74% 0.50% 0.41% 0.35% SPO+ DM 2.30% 1.14% 0.64% 0.36% 0.25% 0.20% 0.16% SPO+ DR PI 2.47% 1.44% 0.86% 0.59% 0.45% 0.39% 0.32% SPO+ DR Lambda 2.23% 1.15% 0.64% 0.40% 0.27% 0.23% 0.18% SPO+ DR Clip 2.29% 1.22% 0.69% 0.44% 0.31% 0.25% 0.20% SPO+ ISW 15.00% 15.52% 15.11% 15.02% 14.98% 15.03% 15.11%

Naive ETO 15.03% 15.26% 14.74% 12.12% 8.45% 5.17% 3.53% Naive SPO+ DM 15.05% 15.46% 14.92% 12.85% 9.81% 6.63% 5.08% Naive SPO+ DR 14.99% 15.42% 14.97% 13.00% 10.10% 6.98% 5.56% Naive SPO+ IPW 15.56% 15.77% 16.06% 16.06% 16.06% 15.96% 15.96% Table 3: Mean relative regret ratio of different methods when the nuisance model FN and the policyinducing model F are correctly specified. The logging policy is a random policy.

Methods Training Data n Training Data n 400 1000 1600 400 1000 1600

ETO F misspecified degree 2 F misspecified degree 4

11.04% 9.14% 8.34% 12.35% 11.42% 10.39%

Well-specified Nuisance Model FN

F misspecified degree 2 F misspecified degree 4

SPO+ DM 2.81% 0.80% 0.54% 4.06% 2.21% 2.06% SPO+ DR PI 3.27% 1.36% 1.05% 4.83% 2.95% 2.71% SPO+ DR Lambda 2.83% 0.97% 0.73% 4.33% 2.45% 2.25% SPO+ DR Clip 3.05% 1.09% 0.84% 4.59% 2.62% 2.38% Naive SPO+ DM 14.97% 12.78% 5.68% 15.27% 13.20% 6.42% Naive SPO+ DR 15.00% 13.03% 6.31% 15.27% 13.48% 7.26%

Well-specified Policy-inducing Model F

FN misspecified degree 2 FN misspecified degree 4

SPO+ DM 10.01% 8.37% 7.47% 12.51% 11.22% 9.68% SPO+ DR PI 9.11% 7.02% 6.44% 11.69% 10.19% 9.02% SPO+ DR Lambda 9.05% 7.52% 6.68% 12.31% 10.38% 8.96% SPO+ DR Clip 9.02% 7.28% 6.36% 11.87% 10.04% 8.70% Naive SPO+ DM 15.56% 14.23% 12.96% 15.22% 14.51% 13.85% Naive SPO+ DR 15.64% 14.36% 13.31% 15.17% 14.67% 14.12%

Misspecified

F, FN misspecified degree 2 F, FN misspecified degree 4

SPO+ DM 9.90% 8.34% 7.41% 12.45% 11.16% 9.69% SPO+ DR PI 9.15% 7.23% 6.52% 11.92% 10.46% 9.42% SPO+ DR Lambda 9.03% 7.46% 6.74% 12.01% 10.72% 9.25% SPO+ DR Clip 8.97% 7.22% 6.46% 11.75% 10.31% 8.95% Naive SPO+ DM 15.65% 14.23% 13.02% 15.16% 14.53% 13.84% Naive SPO+ DR 15.63% 14.42% 13.33% 15.26% 14.74% 13.95% Table 4: Mean relative regret ratio of different methods when the nuisance model FN and the policyinducing model F are misspecified to differrent degrees. The logging policy is a random policy.

Methods Training Data n 400 600 800 1000 1200 1400 1600

ETO 3.20% 1.83% 1.05% 0.72% 0.56% 0.47% 0.38% SPO+ DM 2.16% 1.12% 0.60% 0.40% 0.30% 0.25% 0.20% SPO+ DR PI 2.51% 1.50% 0.97% 0.73% 0.59% 0.50% 0.44% SPO+ DR Lambda 2.14% 1.10% 0.62% 0.41% 0.34% 0.28% 0.23% SPO+ DR Clip 2.15% 1.18% 0.65% 0.44% 0.37% 0.31% 0.26% SPO+ ISW 15.49% 15.39% 15.46% 15.43% 15.63% 15.75% 15.75%

Naive ETO 15.44% 14.92% 10.77% 6.56% 4.59% 3.39% 2.87% Naive SPO+ DM 15.46% 15.21% 11.56% 7.49% 5.57% 4.32% 3.67% Naive SPO+ DR 15.47% 15.19% 11.91% 8.09% 6.04% 4.81% 4.18% Naive SPO+ IPW 15.57% 15.67% 15.71% 15.74% 15.66% 15.79% 15.80% Table 5: Mean relative regret ratio of different methods when the nuisance model FN and the policyinducing model F are correctly specified. The logging policy is a X1-policy.

Methods Training Data n Training Data n 400 1000 1600 400 1000 1600

ETO F misspecified degree 2 F misspecified degree 4

11.55% 9.56% 8.68% 11.41% 10.03% 9.60%

Well-specified Nuisance Model FN

F misspecified degree 2 F misspecified degree 4

SPO+ DM 2.86% 1.10% 0.86% 4.19% 2.53% 2.27% SPO+ DR PI 3.57% 1.74% 1.39% 5.11% 3.36% 3.10% SPO+ DR Lambda 2.93% 1.26% 1.03% 4.45% 2.79% 2.53% SPO+ DR Clip 3.05% 1.34% 1.11% 4.57% 2.98% 2.66% Naive SPO+ DM 15.59% 7.80% 4.37% 15.43% 8.08% 5.19% Naive SPO+ DR 15.60% 8.45% 5.00% 15.38% 8.83% 5.90%

Well-specified Policy-inducing Model F

FN misspecified degree 2 FN misspecified degree 4

SPO+ DM 10.47% 8.31% 7.68% 10.77% 9.37% 8.82% SPO+ DR PI 9.32% 7.48% 6.97% 10.69% 9.80% 9.43% SPO+ DR Lambda 9.15% 7.48% 6.98% 10.36% 9.10% 8.72% SPO+ DR Clip 9.23% 7.37% 6.78% 10.37% 8.94% 8.46% Naive SPO+ DM 15.63% 14.58% 13.78% 14.93% 13.08% 12.71% Naive SPO+ DR 15.56% 14.54% 13.95% 14.92% 13.57% 13.30%

Misspecified

F, FN misspecified degree 2 F, FN misspecified degree 4

SPO+ DM 10.36% 8.27% 7.58% 10.66% 9.35% 8.90% SPO+ DR PI 9.19% 7.21% 6.46% 10.35% 8.96% 8.55% SPO+ DR Lambda 9.24% 7.47% 6.98% 10.36% 9.18% 8.75% SPO+ DR Clip 9.38% 7.39% 6.76% 10.28% 8.94% 8.55% Naive SPO+ DM 15.61% 14.56% 13.88% 14.96% 13.25% 12.77% Naive SPO+ DR 15.49% 14.62% 14.12% 14.96% 13.48% 13.23% Table 6: Mean relative regret ratio of different methods when the nuisance model FN and the policyinducing model F are misspecified to differrent degrees. The logging policy is a X1-policy.

Methods Training Data n 400 600 800 1000 1200 1400 1600

ETO 3.09% 1.92% 1.28% 0.86% 0.58% 0.41% 0.34% SPO+ DM 2.19% 1.19% 0.66% 0.43% 0.32% 0.24% 0.19% SPO+ DR PI 2.49% 1.42% 1.00% 0.72% 0.60% 0.46% 0.45% SPO+ DR Lambda 2.08% 1.15% 0.67% 0.46% 0.35% 0.26% 0.21% SPO+ DR Clip 2.17% 1.21% 0.73% 0.51% 0.36% 0.30% 0.25% SPO+ ISW 14.57% 14.65% 14.49% 14.04% 14.36% 14.26% 14.30%

Naive ETO 14.30% 13.99% 11.01% 7.55% 5.21% 3.76% 3.11% Naive SPO+ DM 14.31% 14.21% 11.89% 8.44% 6.16% 4.69% 3.97% Naive SPO+ DR 14.29% 14.23% 12.12% 8.77% 6.52% 5.12% 4.45% Naive SPO+ IPW 15.52% 15.64% 15.64% 15.66% 15.76% 15.74% 15.70% Table 7: Mean relative regret ratio of different methods when the nuisance model FN and the policyinducing model F are correctly specified. The logging policy is a X1X2-policy.

Methods Training Data n Training Data n 400 1000 1600 400 1000 1600

ETO F misspecified degree 2 F misspecified degree 4

10.08% 8.86% 8.18% 12.03% 11.15% 11.03%

Well-specified Nuisance Model FN

F misspecified degree 2 F misspecified degree 4

SPO+ DM 2.91% 1.10% 0.83% 4.32% 2.56% 2.24% SPO+ DR PI 3.55% 1.73% 1.42% 5.27% 3.36% 3.04% SPO+ DR Lambda 2.91% 1.28% 1.01% 4.50% 2.78% 2.49% SPO+ DR Clip 3.07% 1.37% 1.12% 4.67% 2.94% 2.69% Naive SPO+ DM 14.32% 8.92% 4.69% 14.52% 9.32% 5.46% Naive SPO+ DR 14.34% 9.33% 5.37% 14.56% 9.78% 6.13%

Well-specified Policy-inducing Model F

FN misspecified degree 2 FN misspecified degree 4

SPO+ DM 8.78% 7.90% 7.11% 11.93% 11.08% 10.55% SPO+ DR PI 8.23% 6.97% 6.51% 11.91% 11.36% 11.07% SPO+ DR Lambda 8.28% 7.14% 6.60% 11.92% 11.01% 10.38% SPO+ DR Clip 8.19% 7.04% 6.37% 11.72% 10.95% 10.36% Naive SPO+ DM 14.86% 12.50% 12.07% 14.68% 13.84% 13.09% Naive SPO+ DR 14.77% 12.78% 12.26% 14.65% 14.19% 13.70%

Misspecified

F, FN misspecified degree 2 F, FN misspecified degree 4

SPO+ DM 8.75% 7.83% 7.09% 11.95% 11.07% 10.53% SPO+ DR PI 8.07% 6.66% 6.03% 11.79% 10.84% 10.40% SPO+ DR Lambda 8.23% 7.14% 6.58% 11.90% 10.87% 10.43% SPO+ DR Clip 8.09% 6.95% 6.41% 11.90% 10.74% 10.22% Naive SPO+ DM 14.85% 12.66% 12.14% 14.70% 14.02% 13.11% Naive SPO+ DR 14.75% 12.78% 12.26% 14.67% 14.17% 13.41% Table 8: Mean relative regret ratio of different methods when the nuisance model FN and the policyinducing model F are misspecified to differrent degrees. The logging policy is a X1X2-policy.

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: We provide the new methodology in Section 2, main theoretical results in Section 3, and numerical findings in Section 5. 2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: We discuss the limitations of our paper in Section 6. 3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes] Justification: Our main theoretical results are in Section 3, and we have a proposition in Section 2. All detailed proofs are provided in Appendix B. 4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: We provide open access to data and code at https://github.com/ Causal ML/CLOBandit. 5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We provide open access to data and code at https://github.com/ Causal ML/CLOBandit 6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: We provide detailed explanations for the experiment setup in Section 5 and Appendix C.2. 7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [No] Justification: We have statistical significance results in our experiments, and all our results are significant. We did not report them in the paper due to limited space, but can provide them upon request. 8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?

Answer: [Yes] Justification: Computer resources are reported in Appendix C. 9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: The authors have reviewed the Neur IPS Code of Ethics and the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics. 10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [NA] Justification: This paper is foundational research and not tied to particular applications. 11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: The paper poses no such risks. 12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [NA] Justification: The paper does not use existing assets 13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justification: The paper does not release new assets. 14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: The paper does not involve crowdsourcing nor research with human subjects. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: The paper does not involve crowdsourcing nor research with human subjects.