# a_practical_semiparametric_contextual_bandit__c3d6acfc.pdf

A Practical Semi-Parametric Contextual Bandit

Yi Peng1 , Miao Xie1 , Jiahao Liu1 , Xuying Meng2 , Nan Li1 , Cheng Yang1 , Tao Yao1 and Rong Jin1

1Alibaba Group, Hang Zhou, China 2Institute of Computing Technology, Chinese Academy of Sciences {mulong.py, xiemiao.xm, glacier.ljh, nanli.ln, charis.yangc, tao.yao, jinrong.jr}@alibaba-inc.com

Classic multi-armed bandit algorithms are inefﬁcient for a large number of arms. On the other hand, contextual bandit algorithms are more efﬁcient, but they suffer from a large regret due to the bias of reward estimation with ﬁnite dimensional features. Although recent studies proposed semi-parametric bandits to overcome these defects, they assume arms features are constant over time. However, this assumption rarely holds in practice, since real-world problems often involve underlying processes that are dynamically evolving over time especially for the special promotions like Singles Day sales. In this paper, we formulate a novel Semi-Parametric Contextual Bandit Problem to relax this assumption. For this problem, a novel Two Steps Upper-Conﬁdence Bound framework, called Semi-Parametric UCB (SPUCB), is presented. It can be ﬂexibly applied to linear parametric function problem with a satisﬁed gap-free bound on the n-step regret. Moreover, to make our method more practical in online system, an optimization is proposed for dealing with high dimensional features of a linear function. Extensive experiments on synthetic data as well as a real dataset from one of the largest e-commercial platforms demonstrate the superior performance of our algorithm.

1 Introduction

The multi-armed bandit problem (MAB) models an agent that simultaneously attempts to acquire new knowledge (called exploration ) and optimizes their decisions based on existing knowledge (called exploitation ). The agent attempts to balance these competing tasks in order to maximize their total rewards over the period of time considered [Agrawal et al., 1988]. Such problem is ubiquitous in many practical applications, including online advertisement [Schwartz et al., 2017], personalized recommendation [Li et al., 2016], medical treatment [Press, 2009] and ﬁnancial portfolio design[Hoffman et al., 2011]. There are often many newly emerging items with

Contact Author

unknown rewards to be explored in these applications. Besides, MAB algorithms are very usable in cases where continuous and permanent optimization is needed for discovering the real interests of the customers.

MAB problem is ﬁrst introduced as the sequential design of experiments, which is now formulated as a system of several candidate arms whose rewards are independently sampled from some ﬁxed underlying distributions [Robbins, 1952]. Recently there are three typical groups of MABs: classic MABs, contextual bandits and semi-parametric bandits. Classic MAB algorithms (also called non-parametric methods), such as Upper Conﬁdence Bound (UCB) [Auer et al., 2002] and Thompson sampling [Thompson, 1933], estimate the expected reward of an arm by empirical mean of historical data. The reward estimation can be unbiased with sufﬁcient online data leading to an inefﬁcient convergence process for a large number of candidate arms. So it can be effective only where you can afford to wait for a long duration before knowing with certainty which decision is the best. As for contextual bandit algorithms (called parametric methods) [Chu et al., 2011; Agrawal and Goyal, 2013], the learner s aim is to collect enough information about how the contextual features and rewards relate to each other modelled by a parametric function. Although they are efﬁcient for a large arm set because the parameters can be shared among all independent candidate arms, it is generally difﬁcult to specify a correct model and to perfectly characterize the context with numeric features, leading to large regrets in many applications [Ghosh et al., 2017; Li et al., 2016]. To make up shortages of the above two kinds of MABs, semi-parametric bandits, which contain both parametric part and non-parametric part, have been proposed [Ou et al., 2019]. Speciﬁcally, the parametric part, which models expected reward as a parametric function of arm features, can efﬁciently eliminate poor arms from the candidate set whereas the non-parametric part, which adopts non-parametric model and revises the parametric estimation to avoid estimation bias. However, current semi-parametric bandits assume arms features are constant over time. In fact, the contextual features (especially for features of users behaviours) of candidate arms in real-world problems are usually evolved dramatically [Xie et al., 2015], especially in some special promotions like Alibaba s Singles Day sales. In these kinds of scenarios, it is required that MAB algorithms be able to adapt recommendation strategies according

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

to the evolution of arms contextual features. So dynamic features lead to a large regret and divergence of previous semiparametric bandits. In this paper, we will relax the assumption of constant features in semi-parametric bandits. It is the ﬁrst time to propose a novel Semi-Parametric Contextual Bandits Problem, as far as we known, for cases where contextual features are evolved dramatically. To solve it, a novel Two-Steps Upper Conﬁdence Bound framework, called Semi-Parametric UCB (SPUCB), is presented. It can be ﬂexibly applied to Top1 and Top-K selection problems. We give the implementation of utilizing a linear function as the parametric part and prove satisﬁed gap-free bounds on the n-step regret. Extensive experiments show that SPUCB leads to signiﬁcantly higher qualiﬁed arms because it can take dynamic contextual features in semi-parametric environment.

2 Related Work

Recently there are three typical groups of MABs: classic MABs [Auer et al., 2002; Gopalan et al., 2014; Thompson, 1933], contextual bandits [Agrawal and Goyal, 2013; Chu et al., 2011; Qin et al., 2014; Li et al., 2016] and semi-parametric bandits [Ou et al., 2019; Ghosh et al., 2017; Krishnamurthy et al., 2018; Greenewald et al., 2017]. Classic MAB algorithms can be called non-parametric methods and contextual bandits are usually seen as parametric methods. In addition, to solve top-k combinatorial optimization problem, combinatorial variants [Gai et al., 2012; Chen et al., 2013; Kveton et al., 2015a; Li et al., 2016] are proposed. Firstly, classic MAB algorithms like UCB and Thompson sampling are inefﬁcient when the arm set is large because they have to explore each arm independently. Secondly, to leverage contexts, Li et al. [2010] formulated personalized news recommendation as a contextual bandit problem. They proposed Lin UCB to maximize total clicks based on contextual features about the users and articles. Qin et al. [2014] proposed C2-UCB, a contextual combinatorial bandit algorithm, which considers semi-bandit feedback and non-linear reward. Li et al. [2016] combined contextual cascading bandit with position discounts. Their C3-UCB algorithm is bounded by O(d

KT log T). Although they are efﬁcient for a large arm set because parameters can be shared among all independent candidate arms, it is generally difﬁcult to specify a correct model and to perfectly characterize the context with numeric features. It makes the assumption impractical, leading to a large regret. Their methods divergent for Semi-Parametric Contextual Bandit Problem due to the latent arm bias. To make up the shortages of above two kinds of MABs, semi-parametric bandits, which contain both parametric part and non-parametric part, have been proposed [Ghosh et al., 2017; Ou et al., 2019; Krishnamurthy et al., 2018]. Krishnamurthy et al. [2018] assumed the reward is a summation of a linear function of context features and an arm-independent bias. Ghosh et al. [2017] assumed each arm has an arm-speciﬁc bias which is more ﬂexible in practice. Ou et al. [2019] proposed a novel framework, called Semi-Parametric Sampling, which can work in the semi-parametric environment. However,

these recent methods for semi-parametric bandit problems assume arms features are constant over the time, so they will suffer from a large regret in our deﬁned problem. Thirdly, combinatorial bandits are presented for top-k combinatorial optimization problem. Generic stochastic combinatorial bandits are pioneered by Gai et al. [2012], who proposed a UCB-like algorithm, named LLR. Kveton et al. [2015a] introduced the cascading model to combinatorial semi-bandits and achieved an upper bound of O( L min log T). Kveton et al. [2015b] proposed a computationally efﬁcient and sample efﬁcient framework, called COMBCASCADE, allowing each feasible action to be a subset of ground items under combinatorial constraints. Its regret is bounded by O( KL

min log T). Li et al. [2016] proposed contextual combinatorial cascading with position discounts. However, none of them can be used for our problem directly.

3 Semi-Parametric Contextual Bandit Suppose there is a tuple B = (E, K, S), where E = [L] is a ground set of L items (also referred to as candidate arms), integer K 1 and S is the set of all K-tuples of distinct items coming from E. The contextual features of candidate arms usually evolved dramatically which reﬂect the user s dynamic real interests, a novel Semi-Parametric Contextual Bandit Problem should be deﬁned as follows: Let T be the length of time horizon. At time t, the learning agent observes a feature vector xt,a Rd 1 with xt,a 2 1 for every candidate arm a E, and recommends to the user a list of K items At = (at 1, , at K) S. Users will give feedback (reward) to the recommended list. Let Ht denotes the history within T. It contains rewards and contextual information before the agent chooses an action at time t. We assume that each base arm a has an independent weight wt(a), representing the quality of a at time t. Given history Ht, expected reward can be formulated with a semi-parametric form:

w t,a = E[wt(a)|Ht] = g(θ , xt,a) + b a, (1)

where g(θ , xt,a) is a parametric reward function of arm features xt,a at time t, b a R is the bias of parametric function g and real expected reward w t,ain[0, 1], θ is the optimal function parameter deﬁned as

θ = argmin θ

a=1 |(wt(a) b a) g(θ, xt,a)|2. (2)

For expected reward w t,a, we call b a the non-parametric part and θ the parametric part, because b a are independent among arms with no parametric assumption. θ is an unknown d-dimensional vector with the assumption that θ 2 1 for all t, a. We assume that each weight wt(a) follows Rα-sub Gaussian tail ηt,a = wt(a) w t,a, which means

Rα > 0 µ R E[eµηt,a|Ht] e R2 αµ2/2.

The sub-Gaussian condition implies that ηt,a is a zero-mean Gaussian noise with variance at most R2 α lying in an interval of length at most 2Rα.

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

The objective of Semi-Parametric Contextual Bandit Problem is to ﬁnd a policy that can maximize the cumulative expected reward. Since direct analysis of cumulative reward is not tractable, we analyze the cumulative regret. In the case of K = 1, the user only chooses the best item, the cumulative regret is deﬁned as:

When K > 1, the problem will be a combinatorial optimization problem which can be reduced to a set cover problem in each time slot, thus it is a NP-hard problem. To propose an approximate solution, we use the cascading model to reduce it to a ranking problem with the following oracle. In the cascading model, the user examines items in a presented order and stops at the Ot-th item when he ﬁnds the ﬁrst attractive one. Then the agent will give a positive reward for Ot-th arm, while put a negative reward for the others in At. It is required that the decision strategy, denoted as f(A, w), for recommended arms A based on the given weights w satisfy monotonicity and B-Lipschitz continuity assumptions. We only assume that the agent has access to an α-approximation oracle OS(w) that outputs recommended action A, which means given input w, for some α 1, the oracle returns an action A = OS(w) S satisfying f(A, w) αf(A , w) where A = argmax A Sf(A, w). The α-regret of action A on time t is

Rα(t, A) = αf t f(A, wt),

where f t = f(A t , wt) and wt = (g(θ , xt,a) + b )a E. Thus the cumulative regret of an algorithm in the case of K > 1 is to minimize the α-regret

t=1 Rα(t, At)].

4 Two-Steps UCB Framework We introduce a novel Two-Steps Upper-Conﬁdence Bound framework, called Semi-Parametric UCB (SPUCB) to solve Semi-Parametric Contextual Bandit Problem, our method is unbiased and efﬁcient. From Eq.(1), θ and b a should be inferred according to the rewards that the learning agent can only observe within T steps. SPUCB framework owns the efﬁciency of parametric bandit and the unbiased property of non-parametric bandit. The parametric part helps to provide a prior of expected reward with the latest contextual information to rapidly reduce the candidate arm set. Thus the non-parametric part can work in a relatively small candidate set to efﬁciently estimate the expected reward and correctly discover the optimal arm. SPUCB is a UCB-like algorithm, which operates in two stages. Firstly, the learning agent computes the upper conﬁdence bounds (UCBs) Ut on the expected weights of all base arms in E, and uses them to select an action At = (at 1, , at K). Secondly, after the agent obtained the rewards wt, it use the training samples to get a new estimate ˆθt of θ

with a new conﬁdence radius, and estimate b a based on ˆθt.

For convenience, we deﬁne s(t, a) to be the number of times that candidate arm a is observed in t 1 steps.

Lemma 4.1. Let αt(a) = Rα p

2 log(δ/2)/s. With probability at least 1 δ, for each arm a at time t > 0, we have:

| ˆws(a) gs(θ , xa) b a| αt(a), (3)

where ˆws(a) is the average of s observed weights of candidate arm a, gs(θ, xa) = 1

s Ps τ=1 g(θ, xτ,a) is the average of g calculated from observed feature vectors of a and θ.

Proof. Let w s,a = 1 s Ps r=1 w r,a, we have b a = 1 s Ps r=1 w r,a g(θ , xr,a) = w s,a gs(θ , xa) by Eq.(1). Because η is a zero-mean Gaussian noise lying in [ Rα, Rα], by Hoeffding inequality, we have

P (| ˆws(a) gs(θ , xa) b a| αt 1(a))

1 2 exp 2s2α2 t 1(a) Ps r=1(R ( R))2

Lemma 4.2. Given g(ˆθt, xt,a) g(θ , xt,a) gs(ˆθ, xa) + gs(θ , xa) γt(a).

Then, with probability at least 1 δ, we have: ˆws(a) gs(ˆθ, xa) + g(ˆθt, xt,a) w t,a αt(a) + γt(a). (5)

Proof. Based on Lemma 4.1 and Eq.(1), we have ˆws(a) gs(ˆθ, xa) + g(ˆθt, xt,a) w t,a

= ˆws(a) gs(ˆθ, xa) + g(ˆθt, xt,a) g(θ , xt,a) b a

= ˆws(a) gs(θ , xa) b a + g(ˆθt, xt,a)

g(θ , xt,a) gs(ˆθ, xa) + gs(θ , xa) αt(a) + γt(a).

Based on Lemma 4.1 and Lemma 4.2, the upper conﬁdence bound of expected weight for each candidate arm a at time t can be deﬁned as

Ut(a) = ˆws(a) gs(ˆθ, xa) + g(ˆθt, xt,a) + αt(a) + γt(a). (7) Thus, according to Eq.(7), SPUCB can be presented in Algorithm 1. As soon as we give the speciﬁc approach for inferring and updating ˆθt and γt(a), SPUCB will be a feasible solution for Semi-Parametric Contextual Bandit problem. In Section 5, we will implement SPUCB with linear parametric function and give a satisﬁed gap-free bound on the T-step regret.

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

Algorithm 1 SPUCB

// Initialization Observe x0, w0 P s0 1, ˆθ0 0, α0 Rα p

2 log(δ/2), γ0 α0 ˆw1 w0, x1 x0 for all t = 1, , T do // Compute UCB for all a = 1, , L do Ut(a) ˆwst 1(a) gst 1(a)(ˆθ, xa)+g(ˆθt 1, xt,a)+ αt(a) + γt(a) // Recommend and observe At = (at 1, , at K) OS(Ut) Play At and observe Ot, wt(a) // Update statistics Compute ˆθt using x and w for all a = 1, , L do st(a) st 1(a) for all k = 1, , Ot do a at k st(a) st(a) + 1

ˆwst(a) st 1(a) ˆwst 1(a)+wst(a)(a)

st(a) Compute gst(a) using ˆθ and xa for all a = 1, , L do αt(a) Rα p

2 log(δ/2)/st(a) Compute γt(a)

5 Linear Parametric Implementation

We introduce SPUCB to solve Top-1 (K = 1) and Top-K (K 2) semi-parametric contextual bandit problems with linear parametric functions and achieved gap-free bounds given on the T-step regret. When the parametric part of Eq.(1) is a linear function, we have the following expected reward at time t,

wt,at k = g(θ , xt,a) + b = θT xt,at k + b .

We estimate θ using a ridge regression on samples xt and obtained rewards w, then we can get a l2-regularized least-squares estimate with a regularization parameter λ > 0:

ˆθt = ( XT t Xt + λI) XT t Yt,

where Xt R(Pt τ=1 Oτ ) d is the matrix whose rows are x T τ,a = x T τ,a 1 s(τ,a) Ps(τ,a) r=1 x T r,a and Yt is a column vector whose elements are wτ(a) = wτ(a) 1 s(τ,a) Ps(τ,a) r=1 wr(a), a = ar k, k [Oτ], τ [t]. Let

k=1 xτ,aτ k x T τ,aτ k,

which is a symmetric positive deﬁnite matrix. To analyze γt(a), we construct an index set Ψt {1, , t 1} so that for ﬁxed xτ,a with τ Ψt, wτ(a) are independent random variables [Auer, 2002]. Then we can have Lemma 5.1.

Lemma 5.1. With probability at least 1 δ/T, we have g(ˆθt, xt,a) g(θ , xt,a) gs(ˆθ, xa) + gs(θ , xa) γt(a), (8)

where γt(a) = (Rγ + 1) xt,a V 1 t 1 and x V =

(x T V x) 1 2 is the 2-norm of x based on V .

Proof. The proof is similar to the proof of [Lemma 1, Chu et al.].

Then we can deﬁne an upper conﬁdence bound of expected weight for each base arm a by

Ut(a) = min{ ˆws(a) + ˆθT t 1 xt,a + αt(a) + γt(a), 1}. (9)

5.1 Regret Analysis in Top-1 Scenario We follow the same method as [Auer et al., 2002] for choosing Ψt carefully to achieve the independence condition.

Lemma 5.2. Assuming |ΨT +1| 2, we have X

t ΨT +1 xt,a V 1 t 1 5 p

d |ΨT +1| ln |ΨT +1|. (10)

Proof. The proof is similar to the proof of [Lemma 3, Chu et al.].

Theorem 5.1. If SPUCB runs with Rγ = q

δ , then with probability at least 1 δ, the regret of the algorithm is

d T ln3 (LT ln(T)/δ) + p

LT ln (1/δ) . (11)

Proof. Because X

t ΨT +1 αt(a) O p

L |ΨT +1| ln(1/δ) ,

the proof is similar to the proof of [Theorem 1, Chu et al.].

5.2 Regret Analysis in Top-k Scenario We use a cascading model to reduce the problem to a ranking problem with an ofﬂine oracle [Li et al., 2016]. In the problem of cascading recommendation, when recommended with an ordered list of items At = (at 1, , at K), the user examines the items in that order. The examination process will stop if the user clicks one item or has examined all items without any click. The weight of each base arm a at time t, wt(a) {0, 1}, indicates whether the user has clicked an item or not. Then the random variable Ot satisﬁes

Ot = k, if wt(at j) = 0, j < k and wt(at k) = 1; K, if wt(at j) = 0, j K.

If the user clicks the k-th item, the learning agent will receive a reward 1; otherwise, there is no reward. At the end of step t, the learning agent observes wt(at k), k [Ot] and receives a reward rt = max1 k K wt(at k). Notice that the order of At affects both the feedback and the reward.

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

Let us deﬁne a function f : S [0, 1]|E| [0, 1] on A = (a1, , a K) S, w = (w(1), , w(L)) by

i=1 (1 w(ai))w(ak). (12)

It is easy to verify that let rt = f(A, w), we have E[rt] = f(A, w), where f is the expected reward function and also a function of expected weights. Moreover, f satisﬁes the properties of monotonicity and B-Lipschitz continuity by [Lemma B, Li et al.] with γk = 1, k [K]. Let A = argmax A Sf(A, w) be the optimal solution, then A = f(A , w) f(A, w) is a suboptimality gap of solution A, and the probability that all items in A are observed is pt,A = QK 1 k=1 (1 w(ak)). For regret analysis, we deﬁne shorthands p = min1 t T {min A S pt,A}. Lemma 5.3. For any time t and At = (at 1, , at k), if f satisﬁes the assumptions of monotonicity and B-Lipschitz continuity, we have

Rα(t, At) 2B

αt(at k) + γt(at k) .

Proof. Because f(At, Ut) αf(A , Ut) αf(A t , Ut) αf(A t , wt), we have Rα(t, At) = αf(A t , wt) f(At, wt) f(At, Ut) f(At, wt). By Lipschitz continuity of f and Lemma 4.2,

Rα(t, At) B

Ut(at k) wt,at k

αt(at k) + γt(at k) .

Notice that the upper bound of Rα(t, At) is in terms of all base arms of At. However, because V 1 t 1 only contains information of observed base arms, it is hard to estimate an upper bound for PT t=1 Rα(t, At). Lemma 5.4. Suppose Ineq.(3) holds for time 1 t T.

t=1 Rα(t, At)

αt(at k) + γt(at k) #

t=1 Rα(t, At)

t=1 Rα(t, At)E 1 pt,At 1{Ot = K} At

t=1 Rα(t, At)1{Ot = K}

αt(at k) + γt(at k) #

Lemma 5.5. Assuming |ΨT +1| 2, we have

k=1 αt(at k) O p

KL |ΨT +1| ln(1/δ) .

k=1 αt 1(at k) p

2 log(2/δ) X

2 log(2/δ) X

t ΨT +1 K 1 p

KL |ΨT +1| ln(1/δ) .

Theorem 5.2. If SPUCB runs with Rγ = q

δ , then with probability at least 1 δ, the regret of the algorithm is

d KT ln3 (LT ln(T)/δ) + p

KLT ln (1/δ) .

Proof. Based on Lemma 5.3 and Lemma 5.5, the proof is similar to the proof of [Theorem 1, Chu et al.].

Although the upper bound O(

T) of SPUCB for Top-K problem is the same as C3-UCB, our method can accurately ﬁnd the optimal item set and outperform C3-UCB because the non-parametric part of Semi-Parametric Contextual Bandit Problem will cause a large regret for C3-UCB, which is also shown in the experiments in Section 6.

5.3 Online Optimization To make our method more practical in real production system, an optimization technique on the linear parametric term is proposed to deal with high dimensional features. SPUCB needs to calculate xt,a V 1 t 1 in each step for

all arms leading to an O(d3) time complexity, where d is the dimension of feature vector. To overcome the bottleneck of matrix inversion, we use a novel iterative formula to reduce the complexity to O(d2) without any accuracy loss. Because of Vt+1 = Vt + xt,at k x T t,at k, thus

V 1 t+1 V 1 t V 1 t xt,at k x T t,at k V 1 t x T t,aτ k V 1 t xt,at k + 1 (14)

6 Experiments SPUCB s performance and efﬁciency will be evaluated by extensive experiments on both synthetic and real-world data sets, compared with the following state-of-art algorithms:

UCB [Auer et al., 2002]: a representative classic MAB, which can be seen as a pure non-parametric model. LINUCB [Chu et al., 2011]: a representative contextual bandit, which can be seen as a pure parametric model. C3-UCB [Li et al., 2016]: a representative contextual bandit algorithm with a cascading model, which can be seen as a pure parametric model used in Top-K problem.

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

0 2500 5000 7500 10000

UCB Lin UCB

(a) a = 0.6

0 2500 5000 7500 10000

UCB Lin UCB

(b) a = 0.4

Figure 1: Cumulative regret comparison on synthetic dataset for K = 1, L = 100, d = 10.

0 2500 5000 7500 10000

(a) Cumulative regret

10 40 80 160 1000 5000

Dimension of feature vector

(b) Efﬁciency improvement

Figure 2: Regret comparison for K = 5, a = 0.5 and efﬁciency improvement on synthetic dataset.

LSPS [Ou et al., 2019]: a representative semiparametric bandit algorithm using Thompson Sampling.

All super-parameters of the above algorithms are tuned by a cross-validation experiment with the best performance. (Speciﬁcally, UCB: c = 0.2, LINUCB: α = 3.5, LSPS: σ1 = 0.3, σ2 = 0.01, σ3 = 0.3, SPUCB: R = 0.2, δ = 0.9, λ = 1.0, Rr = 0.5)

6.1 Experiments on Synthetic Data The synthetic dataset is randomly generated following our assumptions. Firstly, we sample linear parameters from a standard Gaussian distribution and normalize them so that their norms are equal to a predeﬁned constant a [0, 1]. Secondly, the bias of each arm is uniformly sampled from [0, 1 a]. So a can be used to control the weight of non-parameter part. Thirdly, in each time step, contextual features of each arm are sampled from a standard Gaussian distribution and are normalized to [0, 1]. The stochastic reward of an arm is sampled from a Gaussian distribution whose expectation is the expected reward of the arm calculated by Eq. 1 in the situation of Top-1 (K = 1). In Top-K (K 2) scenario, we

0 12500 25000 37500 50000

UCB Lin UCB

Figure 3: Regret comparison on real-world dataset.

need to simulate the cascading model. The stochastic reward of an arm is sampled from a binomial distribution whose positive probability is its expected reward (by Eq. 1). Speciﬁcally, We run algorithms for 10, 000 time steps at most and set L = 100, d = 10, a = 0.4. All experiments are repeated for 50 times and reported average results. The performance comparison on average cumulative regrets is shown in Figure 1 when K = 1. Firstly, from the curve, we can ﬁnd that all algorithms except SPUCB failed to converge within 10, 000 time steps. The performance of SPUCB is signiﬁcantly superior to other algorithms. Secondly, Although UCB s coverage speed will be faster if we enlarge the weight of non-parametric part, the cumulative regrets are still worse than that of SPUCB, shown in Figure 1(b). On the contrary, LINUCB will be better if we set a small weight for non-parametric part, shown in Figure 1(a). Thirdly, even if LSPS is better than UCB and LINUCB, its assumption that arms features are constant over time leading to an increasing gap in cumulative regrets compared with SPUCB. Finally, SPUCB can also converge quickly and still be superior to the others in the case of K = 5, which is shown in Figure 2(a). Although the regret upper bound of SPUCB is the same with C3-UCB, SPUCB outperforms C3-UCB because of the non-parametric part of Semi-Parametric Contextual Bandit Problem. Moreover, the achieved speed-up of the proposed online optimization of Section 5.3 for SPUCB is given in Figure 2(b). The optimization will signiﬁcantly improve its efﬁciency for nearly 34 times when the dimension of feature vector, d, is 5000. The total running time for online updating parameters is less than 20ms with a single machine with 4-core CPUs. These results illustrate that the proposed optimization technique makes our algorithm more practical.

6.2 Experiments on Real-World Data The real-world dataset is collected from one of the largest ecommercial platform in China for the problem of products recommendation. It contains 100 items with 42-dimensional features from a special holiday. The features of arms contain several statistical features which evolve greatly over time. To make the experiments reproducible, we train a DNN model based on historical user behaviours on the items to simulate the reward function, and use it as the environment to evaluate online algorithms. In each time step, stochastic reward of an arm is sampled from a binomial distribution whose expectation is the expected reward of the arm. Figure 3 shows the comparison of cumulative regrets. As linear reward function is a mis-speciﬁed function, LINUCB suffers large regrets. Besides, LSPS is also worse than SPUCB because it cannot deal with evolving features properly.

7 Conclusions We study a novel Semi-Parametric Contextual Bandit Problem and propose a UCB-like framework SPUCB to deal with incomplete and evolving features. It can be ﬂexibly implemented to solve various problems with parametric functions, and has a gap-free bound. Moreover, an optimization technique is proposed to handle high dimensional features with a linear parametric function. Our method has also been deployed as a service to support online businesses in Alibaba.

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

References [Agrawal and Goyal, 2013] Shipra Agrawal and Navin Goyal. Thompson sampling for contextual bandits with linear payoffs. In International Conference on Machine Learning, pages 127 135, 2013. [Agrawal et al., 1988] R. Agrawal, M. V. Hedge, and D. Teneketzis. Asymptotically efﬁcient adaptive allocation rules for the multiarmed bandit problem with switching cost. IEEE Transactions on Automatic Control, 33(10):899 906, Oct 1988. [Auer et al., 2002] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-Time Analysis of the Multiarmed Bandit Problem. Machine Learning, 47(2-3):235 256, 2002. [Auer, 2002] Peter Auer. Using conﬁdence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397 422, 2002. [Chen et al., 2013] Wei Chen, Yajun Wang, and Yang Yuan. Combinatorial multi-armed bandit: General framework, results and applications. In Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28, ICML 13, pages I 151 I 159. JMLR.org, 2013. [Chu et al., 2011] Wei Chu, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artiﬁcial Intelligence and Statistics, pages 208 214, 2011. [Gai et al., 2012] Yi Gai, Bhaskar Krishnamachari, and Rahul Jain. Combinatorial network optimization with unknown variables: Multi-armed bandits with linear rewards and individual observations. IEEE/ACM Trans. Netw., 20(5):1466 1478, October 2012. [Ghosh et al., 2017] Avishek Ghosh, Sayak Ray Chowdhury, and Aditya Gopalan. Misspeciﬁed linear bandits. In AAAI, pages 3761 3767, 2017. [Gopalan et al., 2014] Aditya Gopalan, Shie Mannor, and Yishay Mansour. Thompson sampling for complex online problems. In International Conference on Machine Learning, pages 100 108, 2014. [Greenewald et al., 2017] Kristjan Greenewald, Ambuj Tewari, Susan Murphy, and Predag Klasnja. Action centered contextual bandits. In Advances in neural information processing systems, pages 5977 5985, 2017. [Hoffman et al., 2011] Matthew D Hoffman, Eric Brochu, and Nando de Freitas. Portfolio allocation for bayesian optimization. In UAI, pages 327 336. Citeseer, 2011. [Krishnamurthy et al., 2018] Akshay Krishnamurthy, Zhiwei Steven Wu, and Vasilis Syrgkanis. Semiparametric contextual bandits. ar Xiv preprint ar Xiv:1803.04204, 2018. [Kveton et al., 2015a] Branislav Kveton, Csaba Szepesv ari, Zheng Wen, and Azin Ashkan. Cascading bandits: Learning to rank in the cascade model. In Proceedings of the

32Nd International Conference on International Conference on Machine Learning - Volume 37, ICML 15, pages 767 776. JMLR.org, 2015. [Kveton et al., 2015b] Branislav Kveton, Zheng Wen, Azin Ashkan, and Csaba Szepesv ari. Combinatorial cascading bandits. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, NIPS 15, pages 1450 1458, Cambridge, MA, USA, 2015. MIT Press. [Li et al., 2010] Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on World Wide Web, WWW 10, pages 661 670, New York, NY, USA, 2010. ACM. [Li et al., 2016] Shuai Li, Baoxiang Wang, Shengyu Zhang, and Wei Chen. Contextual combinatorial cascading bandits. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML 16, pages 1245 1253. JMLR.org, 2016. [Ou et al., 2019] Mingdong Ou, Nan Li, Cheng Yang, Shenghuo Zhu, and Rong Jin. Semi-parametric sampling for stochastic bandits with many arms. In Proc. of AAAI Conference on Artiﬁcial Intelligence, Hawaii USA, Jan 2019. [Press, 2009] William H Press. Bandit solutions provide uniﬁed ethical models for randomized clinical trials and comparative effectiveness research. Proceedings of the National Academy of Sciences, pages pnas 0912378106, 2009. [Qin et al., 2014] Lijing Qin, Shouyuan Chen, and Xiaoyan Zhu. Contextual combinatorial bandit and its application on diversiﬁed online recommendation. In Mohammed Javeed Zaki, Zoran Obradovic, Pang-Ning Tan, Arindam Banerjee, Chandrika Kamath, and Srinivasan Parthasarathy, editors, Proceedings of the 2014 SIAM International Conference on Data Mining, pages 461 469. SIAM, 2014. [Robbins, 1952] Herbert Robbins. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 58(5):527 535, 1952. [Schwartz et al., 2017] Eric M Schwartz, Eric T Bradlow, and Peter S Fader. Customer acquisition via display advertising using multi-armed bandit experiments. Marketing Science, 36(4):500 522, 2017. [Thompson, 1933] William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285 294, 1933. [Xie et al., 2015] Miao Xie, Qiusong Yang, Qing Wang, Gao Cong, and Gerard De Melo. Dynadiffuse: A dynamic diffusion model for continuous time constrained inﬂuence maximization. In Twenty-Ninth AAAI Conference on Artiﬁcial Intelligence, 2015.

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)