# offline_contextual_bandits_with_overparameterized_models__78bb8363.pdf

Ofﬂine Contextual Bandits with Overparameterized Models

David Brandfonbrener 1 William F. Whitney 1 Rajesh Ranganath 1 Joan Bruna 1

Recent results in supervised learning suggest that while overparameterized models have the capacity to overﬁt, they in fact generalize quite well. We ask whether the same phenomenon occurs for ofﬂine contextual bandits. Our results are mixed. Value-based algorithms beneﬁt from the same generalization behavior as overparameterized supervised learning, but policy-based algorithms do not. We show that this discrepancy is due to the action-stability of their objectives. An objective is action-stable if there exists a prediction (action-value vector or action distribution) which is optimal no matter which action is observed. While value-based objectives are action-stable, policy-based objectives are unstable. We formally prove upper bounds on the regret of overparameterized value-based learning and lower bounds on the regret for policy-based algorithms. In our experiments with large neural networks, this gap between action-stable value-based objectives and unstable policy-based objectives leads to signiﬁcant performance differences.

1. Introduction

The ofﬂine contextual bandit problem can be used to model decision making from logged data in domains as diverse as recommender systems (Li et al., 2010; Bottou et al., 2013), healthcare (Prasad et al., 2017; Raghu et al., 2017), and robotics (Pinto & Gupta, 2016). Prior work on the problem has primarily focused on underparameterized models with ﬁnite and small VC dimensions. This work has come from the bandit literature (Strehl et al., 2010; Swaminathan & Joachims, 2015a;b), the reinforcement learning literature (Munos & Szepesvári, 2008; Chen & Jiang, 2019), and the causal inference literature (Bottou et al., 2013; Athey & Wager, 2017; Kallus, 2018; Zhou et al., 2018).

1Courant Institute of Mathematical Sciences, New York University, New York, New York, USA. Correspondence to: David Brandfonbrener <david.brandfonbrener@nyu.edu>.

Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

In contrast, the best performance in modern supervised learning is often achieved by massively overparameterized models that are capable of ﬁtting random labels (Zhang et al., 2016). Use of such large models renders vacuous the bounds that require a small model class. But, the massive capacity of popular neural network models is now often viewed as a feature rather than a bug. Large models reduce approximation error and allow for easier optimization (Du et al., 2018) while still being able to generalize in regression and classiﬁcation problems (Belkin et al., 2018; 2019). In this paper, we investigate whether the strong performance of overparameterized models in supervised learning translates to the ofﬂine contextual bandit setting. The main prior work that considers this setup is (Joachims et al., 2018), which we discuss in detail in Section 7.

To formalize the differences between the supervised learning and contextual bandit settings, we introduce a novel regret decomposition. This decomposition shares the approximation and estimation terms from classic work in supervised learning (Vapnik, 1982; Bottou & Bousquet, 2008), but adds a term for bandit error which captures the excess risk due to only receiving partial feedback.

We use this framework to address the question: can we use overparameterized models for ofﬂine contextual bandits? Or is the bandit error a fundamental problem when we use large models? We ﬁnd mixed results. Value-based algorithms beneﬁt from the same generalization behavior as overparameterized supervised learning, but policy-based algorithms do not. We show that this difference is explained by a property of their objectives called action-stability. An objective is action-stable if there exists a single prediction which is simultaneously optimal for any observed action (where a prediction is a vector of state-action values for a value-based objective or an action distribution for a policybased objective). Action-stable objectives perform well when combined with overparameterized models since the random actions taken by the behavior policy do not change the optimal prediction. However, interpolating an unstable objective results in learning a different function for every sample of actions, even though the true optimal policy remains unchanged.

On the theory side, we prove that overparameterized valuebased algorithms are action stable and have small bandit

Bandit Overﬁtting in Ofﬂine Policy Learning

error via reduction to overparameterized regression. Meanwhile we prove that policy-based algorithms are not actionstable which allows us to prove lower bounds on the insample regret and lower bounds on the regret for simple nonparametric models.

Empirically, we demonstrate the gap in both action stability and bandit error between policy-based and value-based algorithms when using large neural network models on synthetic and image-based datasets.

In summary, our main contributions are:

We introduce the concept of bandit error, which separates contextual bandits from supervised learning.

We introduce action-stability and show that a lack of action-stability causes bandit error.

We show a gap between policy-based and value-based algorithms based on action-stability and bandit error both in theory and experiments.

2.1. Ofﬂine contextual bandit problem

First we will deﬁne the contextual bandit problem (Langford & Zhang, 2008). Let the context space X be inﬁnite and the action space A be ﬁnite with |A| = K < . At each round, a context x X and a full feedback reward vector r [rmin, rmax]K are drawn from a joint distribution D. Note that r can depend on x since they are jointly distributed. A policy π : X P(A) maps contexts to distributions over actions. An action a is sampled according to π(a|x) and the reward is r(a), the component of the vector r corresponding to a. We use bandit feedback to refer to only observing r(a). This contrasts with the full feedback problem where at each round the full vector of rewards r is revealed, independent of the action.

In the ofﬂine setting there is a ﬁnite dataset of N rounds with a ﬁxed behavior policy β. Then we denote the dataset as S = {xi, ri, ai, pi}N i=1 where pi is the observed propensity pi = β(ai|xi). The tuples in the datasets lie in X [rmin, rmax]K A [0, 1] and are drawn i.i.d from the joint distribution induced by D and β. From S we deﬁne the datasets SB for bandit feedback and SF for full feedback:

SB = {(xi, ri(ai), ai, pi)}N i=1, SF = {(xi, ri)}N i=1.

Note that we are assuming access to the behavior probabilities pi = β(ai|xi), so the issues that we raise do not have to do with estimating propensities. We will further make the following assumption about the behavior. Assumption 1 (Strict positivity). We have strict positivity of τ if β(a|x) τ > 0 for all a, x. Thus, in any dataset we will have pi = β(ai|xi) τ > 0.

There is important work that focuses learning without strict positivity by making algorithmic modiﬁcations like clipping (Bottou et al., 2013; Strehl et al., 2010; Swaminathan & Joachims, 2015a) and behavior constraints (Fujimoto et al., 2018; Laroche et al., 2019). However, these issues are orthogonal to the main contribution of our paper, so we focus on the setting with strict positivity.

The goal of an ofﬂine contextual bandit algorithm is to take in a dataset and produce a policy π so as to maximize the value V (π) deﬁned as

V (π) := Ex,r DEa π( |x)[r(a)].

We will use π to denote the deterministic policy that maximizes V . Finally, deﬁne the Q function at a particular context, action pair as

Q(x, a) := Er|x[r(a)].

2.2. Model classes

The novelty of our setting comes from the use of overparameterized model classes that are capable of interpolating the training objective. To deﬁne this more formally, all of the algorithms we consider take a model class of either policies Π or Q functions Q and optimize some objective over the data with respect to the model class. Following the empirical work of (Zhang et al., 2016) and theoretical work of (Belkin et al., 2018) we will call a model class overparameterized or interpolating if the model class contains a model that exactly optimizes the training objective. Formally, if we have data {xi}N i=1 and a pointwise loss function ℓ(x, y), then a model class Π can interpolate the data if

i=1 ℓ(xi, π(xi)) =

i=1 inf y ℓ(xi, y).

This contrasts with traditional statistical learning settings where we assume that the model class is ﬁnite or has low complexity as measured by something like VC dimension (Strehl et al., 2010; Swaminathan & Joachims, 2015a).

2.3. Algorithms

Now that we have deﬁned the problem setting, we can deﬁne the algorithms that we will analyze. This is not meant to be a comprehensive account of all algorithms, but a broad picture of the vanilla versions of the main families of algorithms. Since we are focusing on statistical issues we do not consider how the objectives are optimized.

Supervised learning with full feedback. In a full feedback problem, empirical value maximization (the analog to standard empirical risk minimization) is deﬁned by maxi-

Bandit Overﬁtting in Ofﬂine Policy Learning

mizing the empirical value ˆVF :

ˆVF (π; SF ) := 1

i=1 ri, π( |xi) (1)

πF := arg max π Π ˆVF (π; SF ). (2)

Policy-based learning. Importance weighted or inverse propensity weighted policy optimization directly optimizes the policy to maximize an estimate of its value. Since we only observe the rewards of the behavior policy, we use importance weighting to get an unbiased value estimate to maximize. Explicitly:

ˆVB(π; SB) := 1

i=1 ri(ai)π(ai|xi)

πB := arg max π Π ˆVB(π; SB). (4)

Note that this is the vanilla version of the policybased algorithm and modiﬁcations like regularizers, baselines/control variates, clipped importance weights, and selfnormalized importance weights have been proposed (Bottou et al., 2013; Joachims et al., 2018; Strehl et al., 2010; Swaminathan & Joachims, 2015a;b). For our purposes considering this vanilla version is sufﬁcient since as we show in Section 4, any objective that takes the form π(ai|xi)f(xi, ai, ri, pi) at each datapoint will have the same sort of problem with action-stability.

It is important to note that with underparameterized model classes, this algorithm is guaranteed to return nearly the best policy in the class. Explicilty, Strehl et al. (2010) prove that for a ﬁnite policy class Π, with high probability the regret

of the learned policy πB is bounded as O( 1

N ). This is elaborated in Appendix F. However, these guarantees no longer hold in our overparameterized setting.

Value-based learning. Another simple algorithm is to ﬁrst learn the Q function and then use a greedy policy with respect to this estimated Q function. Explicitly:

ˆQSB := arg min f Q

i=1 (f(xi, ai) ri(ai))2 (5)

π ˆ QSB (a|x) := 1 h a = arg max a ˆQSB(x, a ) i . (6)

This algorithm is sometimes called the direct method (Dudík et al., 2011). The RL literature also often deﬁnes a class of model-based algorithms, but in the contextual bandit problem there are no state transitions so model-based algorithms are equivalent to value-based algorithms.

This algorithm also has a guarantee with small model classes. Explicilty, Chen & Jiang (2019) prove that for a ﬁnite and

well speciﬁed model class Q, with high probability the regret

of the learned policy π ˆ QSB is bounded as O( 1 τ

N ). This is elaborated in Appendix F. Again, these guarantees no longer hold in our overparameterized setting.

Doubly robust policy optimization. The class of doubly robust algorithms (Dudík et al., 2011) does not fall cleanly into the value-based or policy-based bins since it requires ﬁrst learning a value function and using that to optimize a policy. However, with overparamterized models, doubly robust learning becomes exactly equivalent to our vanilla value-based algorithm unless we use crossﬁtting since the estimated Q values will coincide with the rewards. We prove this formally in Appendix D where we also show some issues that the doubly robust policy objective can have with overparameterized models and highly stochastic rewards. For our purposes, we will only consider the policy-based and value-based approaches since the doubly robust approach collapses to the value-based approach with overparameterized models.

3. Bandit error

In supervised learning, the standard decomposition of the excess risk separates the approximation and estimation error (Bottou & Bousquet, 2008). The approximation error is due to the limited function class and the estimation error is due to minimizing the empirical risk rather than the true risk. Since the full feedback policy learning problem is equivalent to supervised learning, the same decomposition applies. Formally, consider a full feedback algorithm AF which takes the dataset SF and produces a policy πF . Then

ES[V (π ) V (πF )] | {z } regret

= V (π ) sup π Π V (π) | {z } approximation error + ES[sup π Π V (π) V (πF )] | {z } estimation error

We can instead consider a bandit feedback algorithm AB which takes the dataset SB and produces a policy πB. To extend the above decomposition to the bandit problem we add a new term, the bandit error, that results from having access to SB rather than SF . Now we have:

ES[V (π ) V (πB)] | {z } regret

= V (π ) sup π Π V (π) | {z } approximation error + ES[sup π Π V (π) V (πF )] | {z } estimation error

+ ES[V (πF ) V (πB)] | {z } bandit error

Disentangling sources of error. The approximation error is the same quantity that we encounter in the supervised

Bandit Overﬁtting in Ofﬂine Policy Learning

learning problem, measuring how well our function class can do. The estimation error measures the error due to overﬁtting on ﬁnite contexts and noisy rewards. The bandit error accounts for the error due to only observing the actions chosen by the behavior policy. This is not quite analogous to overﬁtting to noise in the rewards since stochasticity in the actions is actually required to have the coverage of context-action pairs needed to learn a policy. While the standard approximation-estimation decomposition could be directly extended to the bandit problem, our approximationestimation-bandit decomposition is more conceptually useful since it disentangles these two types of error.

Can bandit error be negative? Usually, we think about an error decomposition as a sum of positive terms. This is not necessarily the case with our decomposition, but we view this as a feature rather than a bug. Intuitively, the bandit error term captures the contribution of the actions selected by the behavior policy. If the behavior policy is nearly optimal and the rewards are highly stochastic, there may be more signal in the actions selected by the behavior policy than the observed rewards. Thus overﬁtting the actions chosen by behavior policy can sometimes be beneﬁcial, causing the bandit error to be negative. The two terms disentangle the approximation error (due to reward noise) from bandit error (due to behavior actions).

4. Action-stable objective functions

Consider a simple thought experiment. We collect a contextual bandit dataset SB from a two-action environment using a uniformly random behavior policy. Then we construct a second dataset e SB by evaluating the outcome of taking the opposite action at each observed context. Since nothing about the environment has changed, we know that the optimal policy remains the same. Therefore we desire the following property from a bandit objective: there exists a single model which is optimal (with respect to that objective) on both SB and e SB. We say that such an objective is action-stable because it has an optimal policy which is stable to re-sampling of the actions in the dataset.

More formally, we deﬁne action stability pointwise at a datapoint z = (x, r, p) where r [rmin, rmax]K and p K is the behavior probability vector in the K-dimensional simplex (recall that K is the number of the actions). Let z(a) denote the datapoint when action a is sampled so that z(a) = (x, r(a), p(a), a). The objectives for both policy and value-based algorithms decompose into sums over the data of some loss ℓ(z(a), π(a|x)) or ℓ(z(a), Q(x, a)).

Note that the output space of a policy is the simplex so that π( |x) K, while the output of a Q function1 is

1When using neural networks Q is usually implemented as a

Q(x, ) RK. To allow for this difference in our deﬁnition, we will deﬁne a generic K-dimensional output space YK

and its corresponding restriction to one dimension as Y. So for a policy-based algorithm YK = K and Y = [0, 1], while for a value-based algorithm YK = RK and Y = R. Now we can deﬁne action-stability.

Deﬁnition 1 (Action-stable objective). An objective function ℓis action-stable at a datapoint z if there exists y YK such that for all a A:

ℓ(z(a), y (a)) = min y Y ℓ(z(a), y).

If an objective is not action-stable, a function which minimizes that objective exactly at every datapoint (x, r(a), p(a), a) does not minimize it for a different choice of a. As a direct consequence, interpolating an unstable objective results in learning a different function for every sample of actions, even though the true optimal policy remains unchanged.

We ﬁnd that policy-based objectives are not action-stable, while value-based objectives are. In the next section we will use the instability of policy-based objectives to show that policy-based algorithms exhibit large bandit error when used with overparameterized models. Our stability results are stated in the following two Lemmas, whose proofs can be found in Appendix A.

Lemma 1 (Value-based stability). Value-based objectives are action stable since we can let y = r and this minimizes the square loss at every action.

Lemma 2 (Policy-based instability). All policy-based objectives which take the form ℓ(z(a), π(a|x)) = f(z(a))π(a|x) are not action-stable at z unless f(z(a)) > 0 for exactly one action a.

These Lemmas tell us that the stochasticity of the behavior policy can cause instability for policy-based objectives. This is worrisome since one would hope that more stochastic behavior policies give us more information about all the actions and should thus yield better policies. Indeed, this is the case for value-based algorithms as we will see in the next section. But for policy-based algorithms, stochastic behavior can itself be a cause of overﬁtting due to the instability of the objective function.

Stabilizing policy-based algorithms. To avoid this problem in a policy-based algorithm, the sign of the function f(z(a)) must indicate the optimal action. This essentially requires having access to a baseline function b(s) that separates the optimal action from all the others so that r(a) > b(s) if and only if a is the optimal action. And then f(z(a)) = r(a) b(s)

β(a|s) yields an action-stable algorithm.

function of x with K outputs (Mnih et al., 2015)

Bandit Overﬁtting in Ofﬂine Policy Learning

This is in general as difﬁcult as learning the full value function Q. One notable special case is when the bandit problem is induced by an underlying classiﬁcation problem, so that only one action has reward 1 and all others have 0. In this case, any constant baseline between 0 and 1 will lead to action stability. This case has often been considered in the literature, e.g. by Joachims et al. (2018) as we discuss in Section 7.

Now that we have built up an understanding of the problem we can prove some formal results that show how valuebased algorithms more effectively leverage overparameterized models by being action-stable.

5. Regret bounds

Recall that as explained in Section 2, both policy-based and value-based algorithms have regret guarantees when we use small model classes (Strehl et al., 2010; Chen & Jiang, 2019). But, when we move to the overparameterized setting, this is no longer the case. In this section we prove regret upper bounds for value-based learning by using recent results in overparameterized regression. Then we prove lower bounds on the regret of policy-based algorithms due to their action-instability.

5.1. Value-based learning

In this section we show that value-based algorithms can provably compete with the optimal policy. The key insight is to reduce the problem to regression and then leverage the guarantees on overparameterized regression from the supervised learning literature. This is formalized by the following theorem.

Theorem 1 (Reduction to regression). By Assumption 1 we have β(a|x) τ for all x, a. Then with ˆQSB as deﬁned in (5) we have

V (π ) V (π ˆ QSB ) 2 τ

E x,a β[(Q(x, a) ˆQSB(x, a))2].

A proof can be found in Appendix B. Similar results are presented as intermediate results in Chen & Jiang (2019); Munos & Szepesvári (2008). The implication of this result that we want to emphasize is that any generalization guarantees for overparameterized regression immediately become guarantees for value-based learning in ofﬂine contextual bandits. Essentially, Theorem 1 gives us a regret bound in any problem where overparameterized regression works. The following results from the overparameterized regression literature demonstrate a few of these guarantees, which all require some sort of regularity assumption on the true Q function to bound the regression error:

The results of (Bartlett et al., 2020) give ﬁnite sam-

ple rates for overparameterized linear regression by the minimum norm interpolator depending on the covariance matrix of the data and assuming that the true function is realizable.

The results of (Belkin et al., 2019) imply that under smoothness assumptions on Q, a particular singular kernel will interpolate the data and have optimal nonparametric rates. After applying our reduction, the rates are no longer optimal for the policy learning problem due to the square root.

The results of (Bach, 2017) show how choosing the minimum norm inﬁnite width neural network in a particular function space can yield adaptive ﬁnite sample guarantees for many types of underlying structure in the Q function.

The results of (Cover, 1968) imply the consistency of a one nearest neighbor regressor when the rewards are noiseless and Q is piecewise continuous. This will contrast nicely with Theorem 3 below.

Each of these guarantees implies a corresponding corollary to Theorem 1 resulting in a regret bound for that particular combination of model and assumptions on Q.

5.2. Policy-based learning

Now we will show how the policy-based learning algorithms can provably fail because they lack action-stability. We will do this in a few ways. First, we will show that on the contexts in the dataset an action-unstable algorithm must suffer regret. This means that we cannot even learn the optimal policy at the contexts seen during training. Then to deal with generalization beyond the dataset we will prove a regret lower bound for a speciﬁc overparameterized model, namely one nearest neighbor. Finally, we discuss a conjecture that such lower bounds can be extended to richer model classes like neural networks.

Since we are proving lower bounds, making any more simplifying assumptions only makes the bound stronger. As such, all of our problem instances that create the lower bounds have only two actions (K = 2).

Regret on the observed contexts. Before considering how a policy generalizes off of the data, it is useful to consider what happens at the contexts in the dataset. This is especially true for overparameterized models which can exactly optimize the objective on the dataset. To do this, we will deﬁne the value of a policy π on the contexts in a dataset S (which we will call the in-sample value) by

V (π; S) := 1

i=1 Er|xi Ea π( |xi)[r(a)]. (7)

Bandit Overﬁtting in Ofﬂine Policy Learning

Then the following Theorem shows that the policy πB learned by the simple policy-based algorithm in Equation (4) must suffer substantial regret on S.

Theorem 2 (In-sample regret lower bound). Let K = 2 and the policy class be overparameterized. Deﬁne r(x) = Er|x[r(1) r(2)] as the absolute expected gap in rewards at x. Deﬁne pu(x) to be the probability that the policy-based objective is action-unstable at x. Recall that β(a|x) τ by Assumption 1. Then

ES[V (π ; S) V (πB; S)] τEx pu(x) r(x) .

The full proof is in Appendix C. This Theorem tells us that as often as the objective is action-unstable, we can suffer substantial regret even on the contexts in the dataset. We now offer some brief intuition of the proof. When we have two actions and an algorithm is not action-stable at x, then action chosen by the learned policy πB at xi is directly dependent on the observed action ai. Since the behavior β will choose each action with probability at least τ by Assumption 1, the learned policy πB must choose the suboptimal action with probability at least τ at xi. This causes unavoidable regret for unstable algorithms, as formally stated in Theorem 2.

Note that Theorem 2 essentially says that action-unstable algorithms convert noise in the behavior into regret. This is the essential problem with unstable algorithms. Rather than using stochasticity in the behavior policy to get estimates of counterfactual actions, an action-unstable algorithm is sensitive to this stochasticity like it is label noise in supervised learning.

In Appendix C.2 we present a result that makes this connection between behavior policy noise and action-instability more direct. Speciﬁcally we show a reduction that takes any classiﬁcation problem and turns it into a bandit problem such that optimizing the policy-based objective is equivalent to solving a noisy variant of the classiﬁcation problem. On the contrary, optimizing the full-feedback objective is equivalent to the noiseless classiﬁcation problem.

The limitation of this result is that it only applies in-sample and does not rule out that the model class could leverage its inductive bias to perform well away from the training data. Next we convert this in-sample regret lower bound into a standard regret lower bound for a particular simple interpolating model class, the nearest neighbor policy.

Regret with generalization: nearest neighbor models. The above result shows what happens at the contexts in the dataset S. It seems that only pathological combinations of model class and problem instance could perform poorly on S but recover strong performance off of the data. However, it cannot be ruled out a priori if the model class has

strong inductive biases to generalize well. In this section we will show that at least for a very simple overparameterized model class, the generalization of the model does not improve performance.

The following theorem shows that the simplest interpolating model class, a one nearest neighbor rule, fails to recover the optimal policy even in the limit of inﬁnite data.

Theorem 3 (Regret lower bound for one nearest neighbor). Let r = rmax rmin. Then there exist problem instances with noiseless rewards where

lim sup N ES[V (πF ) V (πB)] = r

lim sup N ES[V (π ) V (πF )] = 0.

The proof is in Appendix C. This result shows that using a nearest neighbor scheme to generalize based on the signal provided by the policy-based objective is not sufﬁcient to learn an optimal policy. Importantly, note that since the rewards are noiseless, a nearest neighbor policy does recover the optimal policy with full feedback and Theorem 1 shows that value-based algorithms also recover the optimal policy in this setting. So, the model class is capable of solving the problem, it is the action-unstable algorithm that is causing irreducible error.

More complicated model classes. The above result for nearest neighbor models is illustrative, but does not apply to richer model classes like neural networks. While we were not able to construct such lower bounds, we conjecture that they do exist and hope that future work can prove them. We have two reasons to believe that such lower bounds exist. First, Theorem 2 is agnostic to model class. For a policy to perform well despite poor performance in-sample would require strong inductive biases from the model class. Proving lower bounds requires ruling out such inductive biases as we have shown for nearest neighbor rules. Second, our empirical results presented in the next section show that policy-based algorithms have action-instability and high bandit error with neural networks. The inductive biases are not enough to overcome the poor in-sample performance.

6. Experiments

In this section we experimentally verify that the phenomena described by the theory above extend to practical settings that go slightly beyond the assumptions of the theory.2

Speciﬁcally we want to verify the following with overparameterized neural network models:

2Code can be found at https://github.com/ davidbrandfonbrener/deep-offline-bandits.

Bandit Overﬁtting in Ofﬂine Policy Learning

1. Policy-based algorithms are action-unstable while value-based algorithms are action-stable.

2. This causes high bandit error for policy-based algorithms, but not value-based algorithms.

We will break the section into two parts. First we consider a synthetic problem using simple feed-forward neural nets and then we show similar phenomena when the contexts are high-dimensional images and the models are Resnets (He et al., 2016).

6.1. Synthetic data

First, we will clearly demonstrate action-stability and bandit error in a synthetic problem with a linear reward function. Speciﬁcally, we sample some hidden reward matrix θ and then sample contexts and rewards from isotropic Gaussians:

θ U([0, 1]K d), x N(0, Id), r N(θx, ϵId).

Actions are sampled according to a uniform behavior:

a β( |x) = U({1, . . . , K}).

For these experiments we set K = 2, d = 10, ϵ = 0.1. We take N = 100 training points and sample an independent test set of 500 points. As our models we use MLPs with one hidden layer of width 512. In our experience, the ﬁndings are robust to all these hyperparameters of the problem so long as the model is overparameterized. Full details about the training procedure along with learning curves and further results are in Appendix E.

Action seed

Action seed

Policy-based

Action seed

Value-based

TV distance

Figure 1. We test action-stability by resampling the actions 20 times for a single dataset of contexts. Each pixel corresponds to the pair of action seeds i, j and the color shows the TV distance between πi( |x) and πj( |x) on a held-out test set sampled from the data generating distribution. The policy-based algorithms are highly sensitive to the randomly sampled actions.

To conﬁrm (1) and (2) listed above we conduct two experiments. First, to test the action-stability of an algorithm with a neural network model, we train 20 different policies on the same dataset of contexts and rewards, but with resampled actions. Formally, we sample {xi, ri}N i=1 from

Policy-based Value-based 0.00

Estimation error Bandit error

Figure 2. Estimated bandit error by averaging the values calculated on the held-out test sets for 50 independently sampled datasets. Error bars show one standard deviation. While policy-based learning has high bandit error, value-based learning has essentially zero bandit error.

the Gaussian distributions described above and then sample ai β( |xi) with 20 different seeds. We then train each algorithm to convergence and compare the resulting policies by total variation (TV) distance. Results are shown in Figure 1. We ﬁnd that our results from Section 4 are conﬁrmed: policy-based algorithms are unstable leading to high TV distance between policies trained on different seeds while value-based algorithms are stable.

Second, we estimate the bandit error of each algorithm. To do this we train policies to convergence for the policybased, value-based, and full-feedback objectives 50 independently sampled datasets (where now we also resample the contexts and rewards). For this estimate, we assume perfect optimization and no approximation error. Each estimate is calculated on a held out test set. Explicitly, let πj B, πj Q, πj F are the policy-based, value-based, and fullfeedback policies trained on dataset Sj with seed j and corresponding test set T j. Then we estimate bandit error as 1

J PJ j=1 V (πj F ; T j) V (πj B; T j). Similarly, since we know θ we can compute π and use this to estimate the estimation error. The results shown in Figure 2 demonstrate that the policy-based algorithm suffers from substantially more bandit error and thus more regret.

6.2. Classiﬁcation data

Most prior work on ofﬂine contextual bandits conducts experiments on classiﬁcation datasets that are transformed into bandit problems (Beygelzimer & Langford, 2009; Dudík et al., 2011; Swaminathan & Joachims, 2015a;b; Joachims et al., 2018; Chen et al., 2019). This methodology obscures issues of action-stability because the very particular reward function used (namely 1 for a correct label and 0 for incorrect) makes the policy-based objective action-stable. However, even minor changes to this reward function can dramatically change the performance of policy-based algorithms by rendering the objective action-unstable.

To make a clear comparison to prior work that uses deep

Bandit Overﬁtting in Ofﬂine Policy Learning

Value-based Unstable policy-based

Stable policy-based

Uniform behavior

Value-based Unstable policy-based

Stable policy-based

0.5 Hand-crafted behavior

Estimation error Bandit error

Figure 3. Estimated regret decomposition on CIFAR with uniform behavior (left) and the hand-crafted behavior of Joachims et al. (2018) (right). We see that the value-based learning has lowest bandit error and unstable policy-based learning the most. On the hand-crafted dataset the stable policy-based algorithm performs as well as value-based learning.

neural networks for ofﬂine contextual bandits like Joachims et al. (2018), we will consider the same image based bandit problem that they do in their work. Namely, we will use the a bandit version of CIFAR-10 (Krizhevsky, 2009).

To turn CIFAR into an ofﬂine bandit problem we view each possible label as an action and assign reward of 1 for a correct label/action and 0 for an incorrect label/action. We use two different behavior policies to generate training data: (1) a uniformly random behavior policy and (2) the handcrafted policy used in (Joachims et al., 2018). We train Resnet-18 (He et al., 2016) models using Pytorch (Paszke et al., 2019). Again full details about the training procedure are in Appendix E.

As explained in Section 4, the policy-based objective is stable if and only if the sign of the reward minus baseline is an indicator of the optimal action. To test this insight we consider two variants of policy-based learning: unstable where we use a baseline of -0.1 so that the effective rewards are 1.1 for a correct label and 0.1 for incorrect and stable where we use a baseline of 0.1 so that the effective rewards are of 0.9 and -0.1 to make the objective stable3. Note that this stable variant of the algorithm only exists because we are considering a classiﬁcation problem. In settings with more rich structure in the rewards, deﬁning such an algorithm is not possible and only versions of the unstable algorithm would exist.

We again estimate the regret decomposition as we did with the synthetic data. The difference is that this time we only use one seed since we only have one CIFAR-10 dataset. The results in Figure 3 conﬁrm the results from the synthetic data. The standard (unstable) policy-based algorithm suffers from large bandit error. The value-based algorithm has the best performance across both datasets although the stable

3This corresponds to the optimal value of λ in the experiments of Joachims et al. (2018). Our stable model slightly outperforms theirs, likely due to a slightly better implementation.

policy-based algorithm performs about as well for the handcrafted behavior policy.

7. Related work

Now that we have presented our results, we will contrast them with some related work to clarify our contribution.

7.1. Relation to propensity overﬁtting

Swaminathan & Joachims (2015b) introduce what they call propensity overﬁtting . By providing an example, they show that policy-based algorithms overﬁt to maximize the sum of propensities (P

pi ) rather than the value when the rewards are strictly positive. They provide a motivating example, but no formal deﬁnition of propensity overﬁtting and argue that it helps to describe the gap between supervised learning and bandit learning. In contrast, we introduce and formally deﬁne bandit error, which makes this gap between supervised learning and bandit learning precise and does not rely on the speciﬁc algorithm being used. Then we introduce and formally deﬁne action-instability, which explains an important cause of bandit error for policy-based algorithms when using large models. By mathematically formalizing these ideas we provide a more rigorous foundation for future work.

7.2. Relation to (Joachims et al., 2018)

The main related work that considers ofﬂine contextual bandits with large neural network models is Joachims et al. (2018). Speciﬁcally, that paper proposes a policy-based algorithm with an objective of the form: ri(ai) λ

β(ai|xi) π(ai|xi) for some constant baseline λ determined by a hyperparameter sweep, but motivated by a connection to self-normalized importance sampling.

Our work contrasts with this prior work in two key ways. First, we show that the algorithm proposed in Joachims et al. (2018) is action-unstable. Speciﬁcally, our Lemma 2 shows that any policy-based algorithm with an objective of the form P i f(zi(ai))π(ai|xi) cannot be action-stable unless the sign of f(z(a)) is the indicator of the optimal action. However, since that paper only tests the algorithm on classiﬁcation problems where the rewards are in {0, 1}, any setting of λ between 0 and 1 causes the sign of f to indicate the optimal action. The action-stability analysis shows how this algorithm will struggle beyond the classiﬁcation setting.

Second, we show that value-based methods provably and empirically work in the overparameterized setting. In contrast, Joachims et al. (2018) does not consider value-based methods. We show that value-based methods are not affected by action-stability issues (Lemma 1) and have vanishing bandit error (Theorem 1). We empirically test this con-

Bandit Overﬁtting in Ofﬂine Policy Learning

clusion on the same CIFAR-10 bandit problem as Joachims et al. (2018) and ﬁnd that a value-based approach outperforms the policy-based approach proposed in that paper (Figure 3).

7.3. Variance of importance weighting

The importance weighted value estimates used by policybased algorithms suffer from high variance due to low probability actions that have very large importance weights. Much prior work focuses on reducing this variance (Strehl et al., 2010; Bottou et al., 2013; Swaminathan & Joachims, 2015a). In contrast, the issue we consider, action-instability in the overparameterized setting, is distinct from this variance issue. When the policy class is ﬂexible enough to optimize the objective at each datapoint, the optimal predictor in that class does not depend on the importance weights. Meanwhile action-unstable objectives translate stochasticity in the behavior policy into noise in the objective, causing the overﬁtting issues that we see in policy-based algorithms. In fact, our Theorem 2 suggests that regret will be worse for more uniform behavior policies when using an actionunstable objective, even though these may be beneﬁcial in terms of variance. This is born out in our experiments where the behavior is usually uniform and known, which is a favorable setup in terms of the variance of the value estimates, but an unfavorable setup for action-unstable policy learning algorithms.

8. Discussion

We have examined the ofﬂine contextual bandit problem with overparameterized models. We introduced a new regret decomposition to separate the effects of estimation error and bandit error. We showed that policy-based algorithms are not action-stable and thus suffer from high bandit error with stochastic behavior policies. This is borne out both in the theory and experiments.

It is important to emphasize that our results may not apply beyond the setting we consider in this paper. When there is no strict positivity, there is unobserved confounding, there are continuous actions, or the model classes are small and misspeciﬁed then policy-based learning may have lower regret and lower bandit error than value-based learning.

In future work we hope to extend the ideas from the bandit setting to the full RL problem with longer horizon that requires temporal credit assignment. We predict that actionstability and bandit error remain signiﬁcant issues there. We also hope to investigate action-stable algorithms beyond the simple value-based algorithms we consider here.

Acknowledgements

We would like to thank Aahlad Puli for thoughtful conversations and Aaron Zweig, Min Jae Song, and Evgenii Nikishin for comments on earlier drafts.

This work is partially supported by the Alfred P. Sloan Foundation, NSF RI-1816753, NSF CAREER CIF 1845360, NSF CHS-1901091, Samsung Electronics, and the Institute for Advanced Study. DB is supported by the Department of Defense (Do D) through the National Defense Science & Engineering Graduate Fellowship (NDSEG) Program.

Athey, S. and Wager, S. Efﬁcient policy learning. ar Xiv preprint ar Xiv:1702.02896, 2017.

Bach, F. Breaking the curse of dimensionality with convex neural networks. The Journal of Machine Learning Research, 18(1):629 681, 2017.

Bartlett, P. L., Long, P. M., Lugosi, G., and Tsigler, A. Benign overﬁtting in linear regression. Proceedings of the National Academy of Sciences, 2020.

Belkin, M., Hsu, D. J., and Mitra, P. Overﬁtting or perfect ﬁtting? risk bounds for classiﬁcation and regression rules that interpolate. In Advances in neural information processing systems, pp. 2300 2311, 2018.

Belkin, M., Rakhlin, A., and Tsybakov, A. B. Does data interpolation contradict statistical optimality? In The 22nd International Conference on Artiﬁcial Intelligence and Statistics, pp. 1611 1619. PMLR, 2019.

Beygelzimer, A. and Langford, J. The offset tree for learning with partial labels. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 129 138, 2009.

Bottou, L. and Bousquet, O. The tradeoffs of large scale learning. In Advances in neural information processing systems, pp. 161 168, 2008.

Bottou, L., Peters, J., Quiñonero-Candela, J., Charles, D. X., Chickering, D. M., Portugaly, E., Ray, D., Simard, P., and Snelson, E. Counterfactual reasoning and learning systems: The example of computational advertising. The Journal of Machine Learning Research, 14(1):3207 3260, 2013.

Chen, J. and Jiang, N. Information-theoretic considerations in batch reinforcement learning. In Proceedings of the 36th International Conference on Machine Learning. PMLR, 2019.

Bandit Overﬁtting in Ofﬂine Policy Learning

Chen, M., Gummadi, R., Harris, C., and Schuurmans, D. Surrogate objectives for batch policy optimization in onestep decision making. In Advances in Neural Information Processing Systems, pp. 8825 8835, 2019.

Cover, T. Estimation by the nearest neighbor rule. IEEE Transactions on Information Theory, 14(1):50 55, 1968.

Cover, T. and Hart, P. Nearest neighbor pattern classiﬁcation. IEEE Transactions on Information Theory, 13(1):21 27, 1967.

Du, S. S., Zhai, X., Poczos, B., and Singh, A. Gradient descent provably optimizes over-parameterized neural networks. ar Xiv preprint ar Xiv:1810.02054, 2018.

Dudík, M., Langford, J., and Li, L. Doubly robust policy evaluation and learning. ar Xiv preprint ar Xiv:1103.4601, 2011.

Fujimoto, S., Meger, D., and Precup, D. Off-policy deep reinforcement learning without exploration. ar Xiv preprint ar Xiv:1812.02900, 2018.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016.

Joachims, T., Swaminathan, A., and de Rijke, M. Deep learning with logged bandit feedback. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum? id=SJa P_-x Ab.

Kallus, N. Balanced policy evaluation and learning. In Advances in Neural Information Processing Systems, pp. 8895 8906, 2018.

Krizhevsky, A. Learning multiple layers of features from tiny images. Technical report, 2009.

Langford, J. and Zhang, T. The epoch-greedy algorithm for multi-armed bandits with side information. In Advances in neural information processing systems, pp. 817 824, 2008.

Laroche, R., Trichelair, P., and Des Combes, R. T. Safe policy improvement with baseline bootstrapping. In International Conference on Machine Learning, pp. 3652 3661. PMLR, 2019.

Li, L., Chu, W., Langford, J., and Schapire, R. E. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on World Wide Web, pp. 661 670, 2010.

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. Human-level control through deep reinforcement learning. Nature, 518(7540): 529, 2015.

Munos, R. and Szepesvári, C. Finite-time bounds for ﬁtted value iteration. Journal of Machine Learning Research, 9 (May):815 857, 2008.

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. Pytorch: An imperative style, high-performance deep learning library. In Advances in neural information processing systems, pp. 8026 8037, 2019.

Pinto, L. and Gupta, A. Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours. In 2016 IEEE international conference on robotics and automation (ICRA), pp. 3406 3413. IEEE, 2016.

Prasad, N., Cheng, L.-F., Chivers, C., Draugelis, M., and Engelhardt, B. A reinforcement learning approach to weaning of mechanical ventilation in intensive care units. Ar Xiv, abs/1704.06300, 2017.

Raghu, A., Komorowski, M., Ahmed, I., Celi, L. A., Szolovits, P., and Ghassemi, M. Deep reinforcement learning for sepsis treatment. Ar Xiv, abs/1711.09602, 2017.

Strehl, A., Langford, J., Li, L., and Kakade, S. M. Learning from logged implicit exploration data. In Advances in Neural Information Processing Systems, pp. 2217 2225, 2010.

Swaminathan, A. and Joachims, T. Counterfactual risk minimization: Learning from logged bandit feedback. In International Conference on Machine Learning, pp. 814 823, 2015a.

Swaminathan, A. and Joachims, T. The self-normalized estimator for counterfactual learning. In advances in neural information processing systems, pp. 3231 3239, 2015b.

Vapnik, V. Estimation of Dependences Based on Empirical Data: Springer Series in Statistics (Springer Series in Statistics). Springer-Verlag, Berlin, Heidelberg, 1982. ISBN 0387907335.

Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding deep learning requires rethinking generalization. ar Xiv preprint ar Xiv:1611.03530, 2016.

Zhou, Z., Athey, S., and Wager, S. Ofﬂine multi-action policy learning: Generalization and optimization. ar Xiv preprint ar Xiv:1810.04778, 2018.