# robust_tests_in_online_decisionmaking__15f35646.pdf

Robust Tests in Online Decision-Making

Gi-Soo Kim1, Jane P. Kim2, Hyun-Joon Yang2

1Department of Industrial Engineering & Artiﬁcial Intelligence Graduate School, UNIST 2Department of Psychiatry and Behavioral Sciences, Stanford University School of Medicine gisookim@unist.ac.kr, janepkim@stanford.edu, yanghyun@stanford.edu

Bandit algorithms are widely used in sequential decision problems to maximize the cumulative reward. One potential application is mobile health, where the goal is to promote the user s health through personalized interventions based on user speciﬁc information acquired through wearable devices. Important considerations include the type of, and frequency with which data is collected (e.g. GPS, or continuous monitoring), as such factors can severely impact app performance and users adherence. In order to balance the need to collect data that is useful with the constraint of impacting app performance, one needs to be able to assess the usefulness of variables. Bandit feedback data are sequentially correlated, so traditional testing procedures developed for independent data cannot apply. Recently, a statistical testing procedure was developed for the actor-critic bandit algorithm. An actor-critic algorithm maintains two separate models, one for the actor, the action selection policy, and the other for the critic, the reward model. The performance of the algorithm as well as the validity of the test are guaranteed only when the critic model is correctly speciﬁed. However, misspeciﬁcation is frequent in practice due to incorrect functional form or missing covariates. In this work, we propose a modiﬁed actor-critic algorithm which is robust to critic misspeciﬁcation and derive a novel testing procedure for the actor parameters in this case.

Introduction Bandit algorithms apply to sequential decision problems. We assume a set of candidate actions, or arms, is revealed sequentially to a learning agent along with side information called contexts. The agent can pull one arm at a time and receives a corresponding reward. The expected value of the reward is an unknown function of the context information of the chosen action. The goal of the agent is to adaptively learn an action allocation policy so as to achieve high cumulative reward. The main challenge is the exploration-exploitation trade-off, which represents the dilemma between pulling arms that the agent is uncertain about for the sake of learning (exploration) and pulling the best arm based on current, limited knowledge (exploitation). Bandit algorithms can be particularly useful in the context of personalizing health interventions in Mobile Health

Copyright 2022, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

(Tewari and Murphy 2017). The goal of Mobile Health (m Health) apps is to promote the user s health through personalized interventions tailored to the user speciﬁc information acquired through devices such as phones or wearable devices. One important issue related to m Health apps is that the frequency of data queries (e.g. queries to the Health Kit API) impacts the app performance. As queries become increasingly frequent, the processing time slows down the app from the user perspective, which can result in low adherence to the app and hence low reward. Hence, the frequency of data queries and the reward feedback go hand in hand. Beyond performance related costs, there could also be costs of ethical valence (e.g. privacy) associated with querying data and thus it is important to collect only data that are correlated with the reward. Currently, there is little work on assessing the utility of variables collected by wearables. We consider testing the utility of context variables for an actor-critic bandit algorithm (Lei, Tewari, and Murphy 2017). An actor-critic bandit algorithm maintains two separate parameterized models, one for the actor, the action allocation policy, and the other for the critic, the reward model. The focus of this work is on testing the variables used in the actor model, which requires that asymptotic distributions of the actor parameter estimates are known. Lei, Tewari, and Murphy (2017) proved that when the reward model is linear and is correctly speciﬁed by the critic, then the actor parameter estimates converge in probability to the parameters of the optimal policy and asymptotically follow a normal distribution. Based on the asymptotic normality of the actor parameter estimates, we can apply a Z-test to assess the signiﬁcance of the actor parameters. The validity of model-based tests, such as those of Lei, Tewari, and Murphy (2017), relies on the assumption that the linear model is correctly speciﬁed; in other words, that the assumed statistical model represents the true reward function. Linear functions may, however, fail to accurately represent the true nature of the reward function, and often there is no a priori reason to hypothesize the reward should be of a certain functional form. When the parameterized critic model is not correctly speciﬁed (i.e. the true reward model is of a different form than the working model), asymptotic normality may not hold. In this paper, we propose a new actor-critic algorithm and a testing procedure that is robust to the misspeciﬁcation of the critic

The Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI-22)

model. The main contributions of our paper are as follows:

We propose a new actor-critic algorithm where the actor parameter estimates converge to the parameters of the optimal policy even when the reward model is misspeciﬁed by the critic. We show that in the new algorithm, critic and actor parameter estimates asymptotically follow a normal distribution. We conduct experiments on synthetic data and real data and show that our testing procedure appropriately assess the signiﬁcance of the parameters.

Related Works on Robust Bandits Our work is distinct from the limited literature on robust bandits. First, the focus of the existing works (Tang et al. 2021; Hao et al. 2019; Zhu et al. 2018) is on the robustness to the noise of the rewards. For example, Tang et al. (2021) and Hao et al. (2019) developed Upper-Conﬁdence Bound (UCB) algorithms without requiring to specify the tail property of the reward, and Zhu et al. (2018) developed an actor-critic (AC) algorithm that is robust to outliers. These methods however, are built upon the linearity of the reward model. Ghosh, Chowdhury, and Gopalan (2017) developed a new algorithm that maintains the sublinear regret in a model with large deviations from linearity, but is restricted to the case where the action feature set is ﬁxed over time. Hence even under large deviations, the problem can still be addressed by a multi-armed bandit algorithm with regret scaling with the number of arms instead of feature dimension. Second, we consider the AC algorithm, which, apart from ϵ-greedy, is unique in that asymptotic distributions of the parameter estimates are known; there are currently no established statistical testing procedures for the UCB or Thompson sampling algorithms. We consider the impact of misspeciﬁcation on the validity of inferential testing of the utility of contextual variables (i.e. signiﬁcance of actor parameters in the AC algorithm). To the best of our knowledge, no other work addresses the robustness of inferential testing in the context of the actor-critic algorithm. In a related work on the ϵ-greedy bandit, Chen, Lu, and Song (2020) derived the asymptotic properties of a weighted least squares estimator (LSE) for misspeciﬁed reward models. The authors demonstrated that the weighted LSE has asymptotic normal distribution with the mean being the least false linear parameter. While both this and our approach offer robust tests, ours offers a directed approach to exploration, which may be efﬁcient and desirable in the mobile health setting where the action space is large.

Preliminaries Problem Formulation We ﬁrst formulate the bandit problem. At each time t, the learning agent can pull one arm among N alternative arms. The i-th arm (i = 1, , N) returns a random reward rt,i with unknown mean when it is pulled. Prior to arm selection, a ﬁnite-dimensional context vector bt,i Rd

for each arm i is revealed to the agent. The agent tailors his(her) choice based on this contextual information. Let at denote the arm index pulled at time t. Then the goal of the agent is to maximize the cumulative sum of the rewards rt,at over a ﬁnite time horizon T. We assume that the full set of contexts bt = {bt,1, , bt,N} and the full set of rewards rt = {rt,1, , rt,N} are independently and identically (i.i.d.) distributed over time. Also, without loss of generality, we assume the L2-norm of bt,i is bounded by 1, i.e., ||bt,i||2 1.

Notations We denote the L2-norm of a vector x as ||x||2, the set of natural numbers from 1 to N as [N], the set of all natural numbers as N, the d-dimensional identity matrix as Id d, and the d-dimensional vector with all elements equal to 0 as 0d.

Actor-Critic Bandit Algorithm for Linear Rewards Under linear reward assumption E[rt,i|bt,i] = b T t,iµ for some µ Rd with ||µ ||2 1, Lei, Tewari, and Murphy (2017) proposed the actor-critic bandit algorithm (Algorithm 1) which learns two parametrized models, the critic and the actor. The critic for the i-th arm is a linear function of the i-th context variable with parameter µ Rd, b T t,iµ. The actor is the action allocation probability and is parametrized by a softmax function with parameter θ Rd, i.e., the probability of pulling the i-th arm at time t is πθ(bt, i) = exp(b T t,iθ)/{PN j=1 exp(b T t,jθ)}. Lei, Tewari, and Murphy (2017) deﬁne the optimal parameter θ as the value that maximizes the penalized expected reward,

θ = argmax θ E

i=1 b T t,iµ πθ(bt, i)

where λ > 0 and the expectation is taken over the distribution of bt. The penalty term λθT θ1 is introduced to constrain the norm of θ. Due to the penalty, there exists γ > 0 such that γ < πθ (bt, i) < 1 γ for every i with high probability. This guarantees treatment variety, which prevents habituation and increases user engagement in many applications including m Health. Also, when the expected rewards of the arms are the same so that the E h PN i=1 b T t,iµ πθ(bt, i) i term does not change according to the values of θ, θ is unique at θ = 0d. The estimator ˆµ of the critic parameter µ is the Ridge estimator using the context and reward pair of the chosen arms. The estimator ˆθ of the policy parameter θ is the maximizer of the estimate of the penalized expected reward, 1

t Pt τ=1 PN i=1 ˆrτ,iπθ(bτ, i) λθT θ, where ˆrτ,i is the truncated estimate of the reward deﬁned in Algorithm 1, line 5. When the true critic parameter µ has ||µ ||2 1 and ˆµt converges to µ , the truncated reward estimate ˆrt,i approaches the untruncated estimate, b T t,iˆµt.

1The presented penalty form is a special case of the penalty proposed in Lei, Tewari, and Murphy (2017).

Algorithm 1: Actor-Critic algorithm for linear reward [Lei et al.,2017]

1: Set B = ξId d, y = 0d, λ > 0, ξ > 0. 2: for t = 1, , T do 3: Pull arm at according to probability n πˆθt 1(bt, i) o N

i=1 and get reward rt,at.

4: Critic update:

B B+bt,atb T t,at, y y+bt,atrt,at, ˆµt B 1y.

5: ˆrτ,i max( 2, min(2, b T τ,iˆµt)) for i [N], τ [t]. 6: Actor update:

ˆθt argmax θ

i=1 ˆrτ,iπθ(bτ, i) λθT θ.

The boundedness of ˆrτ,i and the penalty term ensures that ˆθt is bounded. This guarantees that there exists γ > 0 such that γ < πˆθt(bt, i) < 1 γ for every i. This prevents πˆθt(bt, i) from concentrating on a single arm and induces a degree of exploration. Lei, Tewari, and Murphy (2017) showed that under some regular assumptions on the distribution of the contexts and rewards, ˆµt and ˆθt converge in probability to µ and θ respectively and are asymptotically normally distributed, hereby enabling a testing procedure.

Misspeciﬁcation of Models

The validity of model-based testing is predicated on correctly speciﬁed models. However, misspeciﬁcation is frequent in practice due to incorrect functional forms or missing covariates. In the statistics literature, robustness has been considered in the context of using models to test causal effects from data collected in experiments. Linear and GLM regression models (Rosenblum and Van Der Laan 2009) and proportional and multiplicative hazards models (Kim 2013) have been shown to be robust to misspeciﬁcation when considering the test of the coefﬁcient of the treatment assignment in the context of randomized trials. However in bandit settings, asymptotic normality is not guaranteed to hold when the working model is incorrect. In this paper, we consider the case where the critic is misspeciﬁed.

Inference from Bandit Feedback Data

Besides Lei, Tewari, and Murphy (2017), there is a recent growing body of literature on deriving the distribution of the parameter estimates from bandit feedback data. Bandit feedback data are not i.i.d. but are correlated due to adaptivity. This causes complexity in deriving the distribution of the estimates. Zhang, Janson, and Murphy (2021) recently showed the asymptotic distribution of M-estimators from bandit data. This work considered correctly speciﬁed reward models only. Chen, Lu, and Song (2020) derived the asymptotic normality of the

ordinary and weighted least-squares estimators when data are accumulated by a ε-greedy algorithm, in both cases where the reward model is linear or not linear. When the reward model is not linear, they proved that the estimator with inverse-probability weighting converges to the normal distribution with mean being the least false parameter in terms of the population distribution of the contexts. Since the action decision in ϵ-greedy algorithms is based on reward estimate values, a robust test on the utility of the variables could be conducted by testing the signiﬁcance of the least false parameters. However, as aforementioned earlier, we note that ε-greedy performs uniform exploration over context spaces which may be undesirable when the action space is large.

Compatibility Condition in Actor-Critic Algorithm

The algorithm of Lei, Tewari, and Murphy (2017) and the theoretical derivation therein exploit the fact that the true reward model is linear. The true nature of the reward can however be far from linear. Sutton et al. (2000) proved the following Lemma 1 which implies that if the reward model and policy model are both differentiable with respect to their parameters and satisfy the compatibility condition, the algorithm converges to the optimal policy πθ though the critic model may be misspeciﬁed. If we denote the critic model parameterized by µ as mµ( ), the compatibility condition states:

πθ(bt, i)/πθ(bt, i) = mµ(bt,i), (1)

where πθ(bt, i) = θπθ(bt, i) and mµ(bt,i) = µmµ(bt,i).

Lemma 1. (Theorem 2 of Sutton et al. (2000)) Let

J(θ) = Eb,r

i=1 rt,iπθ(bt, i)

where Eb,r( ) denotes the expectation over both the context and reward. Suppose the critic parameter µ minimizes

U(µ, θ) := Eb,r

i=1 {rt,i mµ(bt,i)}2 πθ(bt, i)

and the actor parameter θ maximizes

J(µ, θ) := Eb

i=1 mµ(bt,i) πθ(bt, i)

Then if πθ( ) and mµ( ) satisfy the compatibility condition (1), the actor parameter θ satisﬁes θJ(θ) = 0.

Sketch of Proof. The parameters µ and θ jointly solve Uµ(µ, θ) = 0 and Jθ(µ, θ) = 0, where Uµ(µ, θ) = 1

2 µU(µ, θ) and Jθ(µ, θ) = θJ(µ, θ). We have,

Uµ(µ, θ) = Eb,r

i=1 {rt,i mµ(bt,i)} mµ(bt,i)πθ(bt, i)

Jθ(µ, θ) = Eb

i=1 mµ(bt,i) πθ(bt, i)

Due to (1) and the facts that (3)= 0, and (4)= 0, the parameter θ satisﬁes

θJ(θ) = Eb,r

i=1 rt,i πθ(bt, i)

Note that in Lemma 1, J(θ) is deﬁned in terms of the true rewards rt,i s, while J(µ, θ) replaces them with mµ(bt, i) s. If the true reward model is linear, i.e., if E[rt,i|bt,i] = b T t,iµ , and if mµ(bt, i) = b T t,iµ, then we have J(θ) = J(µ , θ). However when the true model is not linear, J(θ) and J(µ, θ) are completely different functions. Sutton et al. (2000) show that (1) is satisﬁed when the actor model is a softmax function and the critic model is linear in the same context vectors as the policy, except they should be centered to have weighted mean 0:

critic : mµ,θ(bt,i) =

j=1 πθ(bt, j)bt,j

actor : πθ(bt, i) = exp b T t,iθ

PN j=1 exp b T t,jθ (6)

The model (5) can be viewed as the approximation of the advantage function (Baird III 1993). The advantage function enables to discard variables that do not vary by arm (e.g., age of the user). We would still need such variables if we model the reward instead of the advantage. From now on we denote the model (5) as mµ,θ( ) instead of mµ( ) to show its dependency on θ as well. Meanwhile, since PN i=1 πθ(bt, i) = 0, equation (4) is equivalent to

Jθ(µ, θ) = Eb

i=1 b T t,iµ πθ(bt, i)

So we redeﬁne

J(µ, θ) = Eb

i=1 b T t,iµ πθ(bt, i)

We can ﬁnd the value of θ satisfying (4) = 0 as the maximizer of the redeﬁned J(µ, θ).

Proposed Algorithm

We propose a new actor-critic algorithm which uses (5) and (6) to model the reward and action selection policy. We consider the case where the true functional form of the reward model may not be linear. In this case, we re-deﬁne the target parameters µ and θ as

θ = argmax θ J(θ), µ = argmin µ U(µ, θ ),

where J(θ) is deﬁned in (2) and

U(µ, θ) = Eb,r

i=1 {rt,i mµ,θ(bt,i)}2 πθ(bt, i)

Under (5) and (6) which satisfy the compatibility condition, θ = argmaxθJ(µ , θ), where J(µ, θ) is redeﬁned in (7). We assume that the arguments that achieve the maximum(argmax) and minimum(argmin) both exist in the parameter space that we consider. While the deﬁnition of θ is the same as the original deﬁnition, we notice that the deﬁnition of µ now depends on the value of θ .

Estimating Functions for µ and θ

The target parameters µ and θ are the values that jointly minimize (8) with respect to µ and maximize (7) with respect to θ. To consistently estimate the parameters, we use as estimating functions the empirical versions of (8) and (7) that converge in probability to (8) and (7) respectively. Suppose we use the residual mean square (RMS) 1 t Pt τ=1{rτ,aτ mµ,θ(bτ,aτ )}2 for (8), which is computed on the context and reward pair of the chosen arms. Let Ii(τ) = I(aτ = i) be the binary indicator taking value 1 if aτ = i and 0 otherwise. The expectation of the RMS is

E [RMS] = E

i=1 {rτ,i mµ,θ(bτ,i)}2Ii(τ)

i=1 {rτ,i mµ,θ(bτ,i)}2E[Ii(τ)|Fτ 1]

i=1 {rτ,i mµ,θ(bτ,i)}2πˆθτ 1(bτ, i)

where Ft 1 denotes a ﬁltration at time t, the union of the history Ht 1 of observations up to time t 1 and the context bt at time t, i.e., Ft 1 = Ht 1 {bt} where Ht 1 = St 1 τ=1{bτ, aτ, rτ,aτ }. Due to Azuma-Hoeffding s inequality, the RMS converges in probability to E[RMS] for any µ and θ. A gap with (8) is caused by the udpate of ˆθτ at each time point. To resolve this, we propose to minimize the following importance-weighted RMS instead,

ˆU t(µ, θ) = 1

τ=1 {rτ,aτ mµ,θ(bτ,aτ )}2 πθ(bτ, aτ) πˆθτ 1(bτ, aτ) (9)

i=1 {rτ,i mµ,θ(bτ,i)}2πθ(bτ, i) Ii(τ) πˆθτ 1(bτ, i) (10)

Since E[Ii(τ)|Fτ 1] = πˆθτ 1(bτ, i), the expectation

of ˆU t(µ, θ) is exactly (8), and ˆU t(µ, θ) converges in probability to (8) for any µ and θ. We note here that if we had the guarantee that ˆθt converges in probability to θ , then the RMS would converge to (8) as well. However, the convergence of ˆθt to θ

Algorithm 2: Actor-Improper Critic algorithm

1: Set λ > 0, C > 1, ˆθ0 = 0d. 2: for t = 1, , T do 3: Pull arm at according to probability n πˆθt 1(bt, i) o N

i=1 and get reward rt,at.

4: Critic update: ˆµt argmin µ:||µ||2 C ˆU t(µ, ˆθt 1) (see (10))

5: Actor update: ˆθt argmax θ ˆJt(ˆµt, θ) (see (11)).

is guaranteed only when the compatibility condition holds, which requires itself that the RMS converges to (8). The empirical version of (7) is

ˆJt(µ, θ) = 1

i=1 b T τ,iµ πθ(bτ, i) λθT θ, (11)

and its expectation is exactly (7). In the next section, we prove that the values of µ and θ that minimize ˆU t(µ, θ) with respect to µ and maximize ˆJt(µ, θ) with respect to θ converge in probability to µ and θ respectively.

Computation The proposed algorithm with the new estimating functions is presented in Algorithm 2. At each iteration of the algorithm, we ﬁnd the value ˆµt which minimizes ˆU t(µ, θ) with θ replaced with ˆθt 1 from the previous iteration. Then we ﬁnd the value ˆθt which maximizes ˆJt(µ, θ) with µ replaced with ˆµt. The inverse probability 1/πˆθτ 1(bτ, i) can have large value and cause instability of the estimate ˆµt. To mitigate such instability, we solve ˆµt = argmin µ:||µ||2 C ˆU t(µ, θ) for some positive constant C. We later

show that if C is set such that µ {µ : ||µ||2 C}, ˆµt and ˆθt converge in probability to µ and θ respectively. Without the constraint, ˆµt is just a weighted least-squares estimator with importance weights πˆθt(bτ, aτ)/πˆθτ 1(bτ, aτ) s. We later show that ˆµt with the constraint converges to the weighted least-squares estimator as time accumulates.

Regret Bound The proposed algorithm (Algorithm 2) is robust to the misspeciﬁcation of the critic model and converges to the optimal action selection policy. We deﬁne the regret with respect the optimal action selection policy as follows,

i=1 E[rt,i|bt,i] n πθ (bt, i) πˆθt 1(bt, i) o .

We can show that the proposed algorithm achieves a regret that is upper-bounded by O(

T) with high probability. This upper bound is of same order as the regret upper bound of Algorithm 1 which requires a restrictive assumption that the linear model correctly speciﬁes the reward model. We provide the proofs in the Supplementary Material.

Asymptotic Properties and Testing Procedure

Statistical tests on the signiﬁcance of the true parameter values (µ and θ ) can be conducted if the distribution of the estimates are known. In this section, we derive the asymptotic distribution of ˆµt and ˆθt. We ﬁrst state some necessary assumptions.

Assumption 1. The distribution of contexts variables is i.i.d. over time t, i.e.,

bt = {bt,1, , bt,N} i.i.d. Pb,

where Pb is some distribution over RN d. Also, the distribution of rewards rt = {rt,1, , rt,N} is i.i.d. over time t.

Assumption 2. Contexts and rewards are bounded. Without loss of generality, ||bt,i||2 1 and |rt,i| 1.

Assumption 3. The optimal policy θ is unique and µ is unique, and the joint equation n µU(µ, θ) o T ,

θJ(µ, θ) T = 0T 2d has unique

solution at [µT , θT ] = [µ T , θ T ]. Moreover, for a ﬁxed value of µ, J(µ, θ) has unique maximum at θ = θ µ. Also for a ﬁxed value of θ, U(µ, θ) has unique minimum at µ = µ θ.

Assumption 4. Let bθ(t) = PN i=1 πθ(bt, i)bt,i. The matrix Eθ[(bt,at bθ(t))(bt,at bθ(t))T ] =

E h PN i=1 πθ(bt, i)(bt,i bθ(t))(bt,i bθ(t))T i is positive deﬁnite for θ in a neighborhood of θ .

Assumption 1 is standard in literature (Langford and Zhang 2007; Goldenshluger and Zeevi 2013; Bastani and Bayati 2020) and is reasonable in many practical settings such as clinical trials where arms have a stationary distribution and do not depend on the past. The uniqueness of µ follows under mild conditions as it minimizes a convex function (8) and because we can discard all the contextual features which do not differ by arms. The uniqueness of θ is a reasonable assumption as well since the penalty λθT θ in (2) introduces a degree of convexity. Also due to this penalty ||θ ||2 is bounded, so the optimal policy itself is a policy that enforces a positive probability (γ) of uniform exploration. Therefore, we have Eθ [(bt,at bθ (t))(bt,at bθ (t))T ] γE h PN i=1(bt,i bθ (t))(bt,i bθ (t))T i . Positive-deﬁniteness of the right-hand side imposes variety in the arm features and is also a standard assumption in the literature. (See Goldenshluger and Zeevi (2013) and Bastani and Bayati (2020).) Hence, Assumption 4 is reasonable as well. We ﬁrst show the following lemma which is crucial in deriving the consistency of the estimates.

Lemma 2. Under Assumptions 2-4, the optimal policy parameter θ and µ lie in a compact set. Also, the estimated parameters ˆµt and ˆθt lie in a compact set for all t [T].

Sketch of Proof. We ﬁrst show the boundedness of µ .

Since µ is the minimizer of U(µ, θ ), we have

i=1 πθ (bt, i)(bt,i bθ (t))(bt,i bθ (t))T #) 1

i=1 πθ (bt, i)(bt,i bθ (t))rt,i

Due to Assumption 2, we have ||µ ||2 1/φ2 where

i=1 πθ (bt, i)(bt,i bθ (t))(bt,i bθ (t))T #!

and λ( ) denotes the minimum eigenvalue. Due to Assumption 4 φ2 > 0, so µ lies in a compact set. Now, since θ maximizes J(µ , θ), we have J(µ , 0d) J(µ , θ ). Due to ||µ ||2 1/φ2, Assumption 2, and Cauchy-Schwarz inequality, we have J(µ , θ ) 1 φ2 λθ T θ . Also, J(µ , 0d) 1

φ2 . Therefore, we have 1

λθ T θ which shows that ||θ ||2 q

2 λφ2 . Due to line 4

of Algorithm 2, ||ˆµt||2 C so ˆµt clearly lies in a compact

set, and analogously, we can show that ||ˆθt||2 q

2C λφ2 . We now prove in Theorem 1 the consistency of the estimates ˆµt and ˆθt. Theorem 1. Consistency Let C = 1/φ2. Under assumptions 1-4, if C C, (ˆµT t , ˆθT t ) converges to (µ T , θ T ) in probability. Sketch of Proof. We denote Ω= {(µT , θT ) : ||µ||2 C, ||θ||2 2 p

2C/(λφ2)}. Then Ωforms a compact set and includes (µ T , θ T ). Since PN i=1 b T τ,iµ πθ(bτ, i) is i.i.d. over time τ for ﬁxed µ and θ, we can apply Glivenko-Cantelli Theorem to ˆJt(µ, θ) and prove uniform convergence.

sup (µT ,θT ) Ω

ˆJt(µ, θ) J(µ, θ) P 0. (12)

Due to the term Ii(τ)/πˆθτ 1(bτ, i), ˆU t(µ, θ) is not the mean of i.i.d. variables and requires additional steps to prove the uniform convergence. Deﬁne

U t(µ, θ) = 1

i=1 {rτ,i mµ,θ(bτ,i)}2πθ(bτ, i).

Using martingale inequalities along with a covering argument on the space Ω, we ﬁrst show that | ˆU t(µ, θ) U t(µ, θ)| converges uniformly to 0 in probability. Then we apply Glivenko-Cantelli theorem to U t(µ, θ) to ﬁnally prove

sup (µT ,θT ) Ω

ˆU t(µ, θ) U(µ, θ) P 0, (13)

Since (ˆµT t , ˆθT t ) lies in Ω(Lemma 2), U(µ, θ)T and J(µ, θ)T are continuous on Ω, and (µ T , θ T ) is unique (Assumption 3), we can apply Theorem 9.4 in Keener (2010) to show (ˆµT t , ˆθT t ) P (µ T , θ T ). Detailed proofs are presented in the Supplementary Material.

Lemma 3. Suppose C C. As t , ˆµt converges in probability to the solution of ˆU t µ(µ, ˆθt 1) = 0, where ˆU t µ(µ, θ) = µ ˆU t(µ, θ) and ˆJt θ(µ, θ) = θ ˆJt(µ, θ). Sketch of Proof. We just need to show that the solution µt of ˆU t µ(µ, ˆθt 1) = 0 satisﬁes P( µt C) t 1, i.e.,

P(ˆµt = µt) 1. Note that the solution of ˆU t µ(µ, ˆθt 1) = 0 is a weighted least-squares estimator with weights wτ = πˆθt 1(bτ, aτ)/πˆθτ 1(bτ, aτ), covariates bτ,aτ bˆθt 1(τ), and outcomes rτ,aτ . The lemma holds due to Assumption 2, 4, and the consistency of ˆθt 1. Detailed proof can be found in the Supplementary Material.

Theorem 2. Asymptotic Normality Under assumptions

1-4, if C C,

converges in

distribution to a multivariate normal distribution with mean 0 and variance Ψ = Λ 1V Λ 1 where

Λ = U µµ U µθ J θµ J θθ

V = lim t 1

" uτ µ uτ T µ uτ µ jτ T θ jτ θ uτ T µ jτ θ jτ T θ

uτ µ(µ, θ) =

i=1 2{rτ,i mµ,θ(bτ,i)} mµ,θ(bτ,i)

Ii(τ) πˆθτ 1(bτ, i)πθ(bτ, i),

jτ θ (µ, θ) =

i=1 b T τ,iµ πθ(bτ, i) 2λθ,

Uµµ and Uµθ are second order partial derivatives of U with respect to µ twice and with respect to µ and θ respectively, the Jθµ and Jθθ are deﬁned analogously, and the U µµ, U µθ, J θµ, J θθ, uτ µ , jτ θ are the values of Uµµ, Uµθ, Jθµ, Jθθ, uτ µ, jτ θ evaluated at the true value (µ T , θ T ). The asymptotic variance Ψ can be estimated by replacing the expectation operation E( ) with the empirical mean and plugging-in the estimates (ˆµT t , ˆθT t ). Due to Assumption 1 and consistency of (ˆµT t , ˆθT t ), such plug-in type estimator is consistent for the asymptotic variance. Sketch of Proof. Due to Lemma 3 and linearization method, for sufﬁciently large t, 0d 0d

= ˆU t µ(ˆµt, ˆθt) ˆJt θ(ˆµt, ˆθt)

= ˆU t µ(µ , θ ) ˆJt θ(µ , θ )

" ˆU t µµ( µ, θ) ˆU t µθ( µ, θ) ˆJt θµ( µ, θ) ˆJt θθ( µ, θ)

# ˆµt µ ˆθt θ

where µ = αˆµt + (1 α)µ , θ = αˆθt + (1 α)θ , µ = βˆµt+(1 β)µ and θ = βˆθt+(1 β)θ for some 0 α 1 and 0 β 1. Due to the consistency of (ˆµT , ˆθT ) and the

Param. ε-greedy Param. AC Proposed i = 1 i = 2

µi (i 1)d+1 0.91 0.93 θ1 0.77 0.99 µi (i 1)d+2 0.93 0.94 θ2 0.64 0.99 µi (i 1)d+3 0.96 0.90 θ3 0.74 0.99 µi (i 1)d+4 0.58 0.59 θ4 0.07 0.09

Table 1: Rejection rates of H0 for each parameter (Param.) by ε-greedy (Chen, Lu, and Song 2020), Actor-Critic (Lei, Tewari, and Murphy 2017), and Proposed algorithm

Law of Large Numbers,

t ˆµt µ ˆθt θ

= U µµ U µθ J θµ J θθ

+ o P (1) 1

t ˆU t µ(µ , θ ) ˆJt θ(µ , θ )

Since the ˆJt θ(µ , θ ) is the empirical mean of i.i.d. variables with mean 0, we can apply the Central Limit Theorem (CLT) to derive the asymptotic distribution. On the other hand, ˆU t µ(µ , θ ) is the empirical mean of uτ µ(µ , θ ) which are not i.i.d. due to the term Ii(τ)/πˆθτ 1(bτ, i). Instead, the uτ µ(µ , θ ) s form a martingale difference sequence. Hence, we can apply martingale CLT to

t h ˆU t µ(µ , θ )T ˆJt θ(µ , θ )T i in whole and show that this converges to a normal distribution with mean 0 and variance V .

Based on Theorem 2, a Z-test can be conducted for a j-th variable (j = 1, , d) using the test statistic Z = ˆθt,j/ p

Ψd+j,d+j/t. We reject the null hypothesis H0 : θj = 0 with signiﬁcance level α when 2(1 Φ(|Z|)) < α, where Φ( ) is the cumulative distribution function of the standard normal distribution.

Experiments We conduct experiments to evaluate the performance of the Proposed algorithm under a misspeciﬁed reward model. We set N = 2 and d = 4. We generate the context vectors bt,i from a multivariate normal distribution N(0d, Id d) and truncate them to have L2-norm 1. We generate the reward from a model nonlinear in bt,i, rt,i = b T t,iµ max(b T t,1µ, b T t,2µ) + ηt,i where µ = ( 0.577, 0.577, 0.577, 0)T and ηt,i is generated from N(0, 0.012) independently over arms and time. To test the validity of the proposed testing procedure, we set µ4 to 0 so that the corresponding variable does not affect the reward and hence, will not be useful in the policy. We implement the Proposed algorithm (Actor-Improper Critic) along with the original Actor-Critic algorithm (Lei, Tewari, and Murphy 2017) and the ε-greedy algorithm using weighted LSE (Chen, Lu, and Song 2020). When implementing the Proposed algorithm, we drop the

0 10 20 30 40 50 Decision Point

Cumulative Reward

Actor Critic Proposed Method Epsilon greedy

Figure 1: Median (solid lines) and ﬁrst and third quartiles (dashed lines) of the cumulative rewards.

restriction on ||ˆµt||2 and compute the weighted least-squares estimator for simplicity. Since the objective function J(µ, θ) is not convex with respect to θ, we ﬁnd the maximizer through a grid search followed by pattern search based on the Nelder-Mead method (Nelder and Mead 1965) as suggested in Lei, Tewari, and Murphy (2017). When implementing the ε-greedy algorithm, we stack the context vectors into one vector b T t = [b T t,1, , b T t,N] and set the working model for the reward of the i-th arm as fi(bt) = b T t µi, where µi is a parameter of dimension Nd. We set the exploration parameter λ in the AC and Proposed algorithms to 0.001. In the ε-greedy algorithm, we use the value ε = 0.01 which corresponds to the exploration probability guaranteed by λ = 0.001 in the Proposed algorithm. We run the bandit algorithms until time horizon T = 50 with 100 repetitions. For each algorithm, we count the number of times the null hypotheses H0 : θj = 0 (for AC and Proposed algorithms) or H0 : µi (i 1)d+j = 0 (for ε-greedy) are rejected at time T according to a Z-test with signiﬁcance level α = 0.05. We report the rejection rates of H0 in Table 1. We observe that the AC algorithm (Lei, Tewari, and Murphy 2017) fails to reject the null hypotheses for the non-zero parameters θ1, θ2, and θ3 for at least 23% of the experiments. On the other hand, the Proposed algorithm rejects the null hypotheses 99 times out of 100 times. As for the fourth parameter which has true value 0, both algorithms reject the null hypothesis with small probability. We note that the signiﬁcance level α lies in the 95% conﬁdence interval [0.047, 0.133] computed from the formula h ˆp 1.96 p

α(1 α)/n, ˆp + 1.96 p

α(1 α)/n i , where ˆp = 0.09 is the rejection rate of H0 by the Proposed algorithm and n = 100 is the number of experiments. On the other hand, the ε-greedy algorithm rejects the null hypothesis for the fourth parameter with more than 50% frequency. We also note that the power of the tests for the ε-greedy algorithm with weighted LSE is lower than the Proposed algorithm. One possible reason for the low

AC Proposed ε-greedy

Mean 5296.9 5557.7 5577.3 St.d. 139.3 152.7 274.9

Table 2: Mean and standard deviations (St.d.) of cumulative rewards at T = 1000 by ε-greedy (Chen, Lu, and Song 2020), Actor-Critic (Lei, Tewari, and Murphy 2017), and Proposed algorithm for the Recovery Record Dataset

performance in testing is due to the variance induced by inverse probability weighting. The Proposed algorithm also involves computation of the inverse of probabilities, but it is used only in the ratio of probabilities πˆθt(bτ, i)/πˆθτ 1(bτ, i)

which converges to 1 as ˆθτ 1 converges. As for the cumulative rewards, we observe that the Proposed and the ε-greedy algorithm show comparable performance.

Data Application The Recovery Record Dataset contained patients adherence behaviors to their therapy for eating disorders (daily meal monitoring) and interactions with their linked clinicians on the app. Clinician communication is often viewed as a critical means to encourage adherence to monitoring, yet there is little guidance of when and how clinicians should communicate outside of ofﬁce visits, and thus is done on an ad-hoc and individual basis. The rewards (i.e. whether the patient adhered to daily monitoring) were observed for the actions chosen by the ad-hoc policies of clinicians (i.e. send a message or not). A contextual bandit algorithm can allow clinicians to tailor their communications (arms) with patients who have preferences (contexts) to maximize adherence to therapy (rewards). We applied the ofﬂine policy evaluation method of Li et al. (2011) to unbiasedly estimate the cumulative reward that we would obtain under the AC, Proposed, and ε-greedy algorithms. We repeated the evaluations on 30 bootstrap samples. Table 2 shows the mean and standard deviations of the cumulative rewards at time T = 1000. We remark that although the Proposed and ε-greedy algorithms have comparable cumulative rewards, ε-greedy suffers higher variance. More details on the implementation and testing results can be found in the Supplementary Material. We show in the Supplementary material the rejection rates of H0 for ε-greedy algorithms are closer to 0.5 as compared to the Proposed algorithm, which implies that ε-greedy may have lower power due to high variance.

Conclusion The problem of testing the utility of variables collected by wearables or sensors represents a practical need in m Health and is an inferential problem highlighted by Tewari and Murphy (2017). In this paper, we considered testing the utility of variables in the actor-critic bandit, but considered inference when the models used in the algorithms are not correctly speciﬁed. This work demonstrates that a robust test can be constructed for the actor parameter. Such work

also illustrates that inferential procedures associated with the actor-critic bandit inherit problems such as model misspeciﬁcation by virtue of the assumptions made in model-based testing; however, existing tools in the literature do not apply due to the unique structure of the objective function. This paper adds to the literature by developing a new statistical procedure to test the actor parameters in the actor-critic bandit even when the reward model is misspeciﬁed by the critic. Strengths of this study include its contribution to model-based estimation in the context of contextual bandit algorithms, and the key property that these results are robust even when the assumptions underlying the critic fail to be correct. The ability to test variables may offer guidance in understanding what types of data (e.g. location or sensitive information) are useful. Beyond the computational and performance related impact of this work, such knowledge could have societal impact in that it may discourage unnecessary data collection, thereby mitigating potential risks and threats to privacy. On the other hand, this method may also provide supporting evidence that certain types of data, perhaps either sensitive or costly variables, are in fact useful. In such cases, the ethical tension between acquiring sensitive information or choosing not to acquire it at the cost of sub-optimal decision-making deserves disclosure and careful discussion with stakeholders involved in and affected by the algorithm. This work provides a means to gauge the signiﬁcance and assumptions made about the utility of data that would otherwise go untested, as it commonly does at present. While the focus of this paper is on testing, future work that focuses on implementation aspects of this procedure would be beneﬁcial, such as addressing the question of when these tests should be performed in practice. The proposed method suggests that one feasible method to conduct testing after a large number of trials, but alternatives such as sequential or repeated hypothesis testing merit further attention.

Acknowledgments Gi-Soo Kim is supported by the Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No.2020-0-01336, Artiﬁcial Intelligence Graduate School Program(UNIST)) and the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT, No.2021R1G1A100980111). Gi-Soo Kim is also partly funded by the 0000 Project Fund (Project Number 1.210079.01) of UNIST, South Korea. Jane Paik Kim and Hyun-Joon Yang are supported by the National Institutes of Health Grant R01TR003505.

References Baird III, L. C. 1993. Advantage updating. Technical Report ADA280862, Defense Tech. Inform. Center. Bastani, H.; and Bayati, M. 2020. Online decision making with high-dimensional covariates. Operations Research, 68(1): 276 294.

Chen, H.; Lu, W.; and Song, R. 2020. Statistical Inference for Online Decision-Making: In a Contextual Bandit Setting. Journal of the American Statistical Association, 116(533): 240 255. Ghosh, A.; Chowdhury, S. R.; and Gopalan, A. 2017. Misspeciﬁed linear bandits. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 31. Goldenshluger, A.; and Zeevi, A. 2013. A linear response bandit problem. Stochastic Systems, 3(1): 230 261. Hao, B.; Abbasi Yadkori, Y.; Wen, Z.; and Cheng, G. 2019. Bootstrapping Upper Conﬁdence Bound. In Advances in Neural Information Processing Systems, volume 32, 12123 12133. Keener, R. 2010. Theoretical statistics: Topics for a core course. Springer Science & Business Media. Kim, J. P. 2013. A Note on Using Regression Models to Analyze Randomized Trials: Asymptotically Valid Hypothesis Tests Despite Incorrectly Speciﬁed Models. Biometrics, 69(1): 282 289. Langford, J.; and Zhang, T. 2007. Epoch-Greedy algorithm for multi-armed bandits with side information. In Advances in Neural Information Processing Systems, volume 20, 1 8. Lei, H.; Tewari, A.; and Murphy, S. A. 2017. An actor-critic contextual bandit algorithm for personalized mobile health interventions. ar Xiv:1706.09090. Li, L.; Chu, W.; Langford, J.; and Wang, X. 2011. Unbiased ofﬂine evaluation of contextual-bandit-based news article recommendation algorithms. In Proceedings of the fourth ACM international conference on Web search and data mining, 297 306. Nelder, J. A.; and Mead, R. 1965. A simplex method for function minimization. The computer journal, 7(4): 308 313. Rosenblum, M.; and Van Der Laan, M. J. 2009. Using regression models to analyze randomized trials: Asymptotically valid hypothesis tests despite incorrectly speciﬁed models. Biometrics, 65(3): 937 945. Sutton, R. S.; Mc Allester, D. A.; Singh, S. P.; and Mansour, Y. 2000. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems, 1057 1063. Tang, Q.; Xie, H.; Xia, Y.; Lee, J.; and Zhu, Q. 2021. Robust Contextual Bandits via Bootstrapping. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 35, 12182 12189. Tewari, A.; and Murphy, S. A. 2017. From ads to interventions: Contextual bandits in mobile health. In Mobile Health, 495 517. Springer, Cham. Zhang, K. W.; Janson, L.; and Murphy, S. A. 2021. Statistical Inference with M-Estimators on Bandit Data. ar Xiv:2104.14074. Zhu, F.; Guo, J.; Li, R.; and Huang, J. 2018. Robust actor-critic contextual bandit for mobile health (mhealth) interventions. In Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, 492 501.