# transferable_contextual_bandit_for_crossdomain_recommendation__f6660dce.pdf Transferable Contextual Bandit for Cross-Domain Recommendation Bo Liu, Ying Wei, Yu Zhang, Zhixian Yan, Qiang Yang The Hong Kong University of Science and Technology, Hong Kong Cheetah Mobile USA bliuab@cse.ust.hk, yweiad@cse.ust.hk, yuzhangcse@ust.hk zhixian.yan@cmcm.com, qyang@cse.ust.hk Traditional recommendation systems (Rec Sys) suffer from two problems: the exploitation-exploration dilemma and the cold-start problem. One solution to solving the exploitationexploration dilemma is the contextual bandit policy, which adaptively exploits and explores user interests. As a result, the contextual bandit policy achieves increased rewards in the long run. The contextual bandit policy, however, may cause the system to explore more than needed in the cold-start situations, which can lead to worse short-term rewards. Crossdomain Rec Sys methods adopt transfer learning to leverage prior knowledge in a source Rec Sys domain to jump start the cold-start target Rec Sys. To solve the two problems together, in this paper, we propose the first applicable transferable contextual bandit (TCB) policy for the cross-domain recommendation. TCB not only benefits the exploitation but also accelerates the exploration in the target Rec Sys. TCB s exploration, in turn, helps to learn how to transfer between different domains. TCB is a general algorithm for both homogeneous and heterogeneous domains. We perform both theoretical regret analysis and empirical experiments. The empirical results show that TCB outperforms the state-of-the-art algorithms over time. Introduction A personalized recommendation system (Rec Sys) is a vital component for online interactive services. According to user interests, an online service provider recommends items including movies, apps, articles, to achieve more user feedbacks including watches, installations, and clicks accordingly. Accurately inferring user interests and providing corresponding recommendations not only improve user satisfaction but also increase commercial revenue significantly. The exploitationexploration dilemma and the cold-start problem are two major obstacles to the successful deployment of a Rec Sys (Schein et al. 2002; Li et al. 2010). The primary objective of a Rec Sys is to maximize the cumulative amount of feedback which depends on user interests. The feedback, however, has high uncertainty and noise. A successful Rec Sys is supposed to balance between exploiting predicted user interests upon historical feedbacks and exploring uncertain user interests via recommending diverse items. Copyright 2018, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Compromising between the two conflicting objectives is well known as the exploitation-exploration dilemma in Rec Sys. Recently, the contextual bandit has been introduced to formulate the exploitation-exploration dilemma and achieved great success (Li et al. 2010). In a contextual bandit problem, the agent observes K arms a1, , a K (e.g., K candidate articles) and their context xa1, . . . , xa K (e.g., user profile, article content, etc.). In step τ, the agent decides to pull an arm aτ among all arms (e.g., recommending one article) and observes the corresponding reward raτ (e.g., click or not). The reward depends on the context but is noisy, i.e., raτ = f(xaτ ) + ϵ. According to N historical observations {(xaτ , raτ )}τ=1...N, a contextual bandit policy adaptatively decides which arm to pull in each step and learns the uncertain reward function. The process is illustrated in the left of Fig. 1. To maximize the cumulative reward, the agent should pull the arms with the larger predicted reward. Because only the stochastic reward of the pulled arms are observed, the agent is also expected to pull the arms which accelerate the exploration of the uncertain reward function (Zhou 2015). Therefore, the contextual bandit policy can well balance the exploitation and exploration trade-off, and maximize the cumulative reward in the long run (Chu et al. 2011). Target Domain Article Rec Sys Rec Sys Agent (TCB) Source Domain App Rec Sys translation click install uncertain and noisy observations uncertain parameters to explore Figure 1: The process of cross-domain Rec Sys using TCB. In terms of contextual bandit, TCB sequentially and adaptatively recommends articles based on noisy feedbacks. In terms of transfer learning, TCB adopts the translation to leverage both source and target observations. TCB suffers from noisy observations in blue and explores uncertain parameters in green accordingly. Unfortunately, the contextual bandit policies suffer from the cold-start problem which refers to that there exists few The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18) observations for new users, new items, or newly launched domains. In cold-start conditions, the agent tends to focus on exploration and sacrifice short-term reward for long-term reward. The ϵ-first policy (Tran-Thanh et al. 2010), for instance, purely explores in cold-start conditions via random recommendations, which goes overboard with the exploitation and exploration balancing and results in even worse performance. The cross-domain recommendation has been widely acknowledged as a killer method to solve the cold-start problem (Pan et al. 2010). A cross-domain Rec Sys leverages the prior knowledge learned from an auxiliary domain with abundant observations to improve the performance in the target domain of interest. The prior knowledge to be transferred could be either user interests or item properties. As illustrated in Fig. 1, a company is about to launch a new article Rec Sys, i.e., the target domain. Meantime, the company has accumulated sufficient application Rec Sys data. Assuming that a user s interests on applications also apply to articles, a crossdomain Rec Sys transfers user interests on applications as the source domain to provide better article recommendation. In this paper, we are motivated to address the exploitationexploration dilemma and the cold-start problem together via a new Transferable Contextual Bandit (TCB) algorithm. The TCB harnesses the collective and mutually reinforcing power of contextual bandit and transfer learning. First, transfer learning improves the exploitation of a contextual bandit policy and accelerates its exploration in a target domain. Assuming that the user interests in applications and articles are closely related, the TCB can infer user interests based on both the source and target observations. Thus, the TCB can estimate user interests for exploitation far better than single-domain algorithms, and significantly reduce the uncertainty of the estimated reward function for exploration. Second, the contextual bandit speeds up the knowledge transfer. As shown in Fig. 1, the TCB transfers knowledge via the translation U . The TCB explores not only the reward function but also how to transfer, i.e., the translation (Pan and Yang 2010). Thus the TCB can recommend those articles that help learn how to transfer the fastest. Provided that the TCB progressively and adaptively recommends an article in the target domain, it relies on the currently available source and target observations as well as the correspondence data between them. The correspondence data indicates the similarity between a source and a target observation. For example, the observations across domains produced by the same user enjoy a large similarity. Especially, the TCB is designed to handle knowledge transfer between both heterogeneous domains with the source and target contexts lying in different feature spaces and homogeneous domains in the same feature space. The primary contributions of this paper are threefold: 1) to the best of our knowledge, TCB is the first applicable transferable contextual bandit policy for cross-domain Rec Sys. TCB is empirically verified using the real-world Rec Sys data. 2) the proposed transfer bandit policy is general for both homogeneous and heterogeneous domains; 3) the theoretical regret analysis is also provided to guard the proposed algorithm. Related Work In this section, we discuss the existing contextual bandit approaches and transfer learning algorithms related to TCB. Lin UCB (Li et al. 2010) firstly formulates the Rec Sys as a contextual bandit problem. Lin UCB not only improves the reward (Li et al. 2011) in real-world online Rec Sys but also enjoys a sub-linear regret guarantee (Chu et al. 2011). Lin UCB assumes that the expected reward is linear with respect to context. Under the same assumption, Agrawal and Goyal proposed a Thompson Sampling style method (Agrawal and Goyal 2013). Both linear policies, however, are sensitive to the dimension and quality of context. When the context is high-dimensional, Yue, Hong, and Guestrin proposed Co Fine UCB to accelerate the exploration in Lin UCB by introducing two levels of feature spaces, i.e., the fine and coarse feature spaces (Yue, Hong, and Guestrin 2012). To tackle not fully observable context, h Lin UCB (Wang, Wu, and Wang 2016), with a significantly better regret guarantee, explicitly explores the hidden context and reward function simultaneously. Collaborative filtering, one of the most important Rec Sys techniques, is also combined with contextual bandit models. COFIBA (Li, Karatzoglou, and Gentile 2016), for example, explores a particular user s interests more efficiently by leveraging the observations from all users within the same cluster. Factor UCB (Cesa-Bianchi, Gentile, and Zappella 2013; Wang, Wu, and Wang 2017) assumes the reward function as the dot product between user latent factors and item latent factors. Upon the known user relationship, Factor UCB explores both latent factors in a collaborative filtering manner. All the aforementioned contextual bandit policies, unfortunately, focus on the Rec Sys within a single domain and thereby tend to fail in the domains with insufficient observations. Another line of related works is transfer learning (Pan and Yang 2010) which aims to improve the learning performance of a target domain by transferring knowledge from a source domain with sufficient observations. Heterogeneous transfer learning (HTL) advances by transferring between domains in different feature spaces. To bridge incommensurable domains, existing works fall into two categories: TLRisk (Dai et al. 2008) learns a translator from the co-occurrence data to translate features between domains; HTLIC (Zhu et al. 2011), He Map (Shi et al. 2013), and Co HTL (Wei et al. 2016) all learn a common latent space where the two domains are comparable. Additionally, cross-domain algorithms tailor the transfer learning for Rec Sys. CST (Pan et al. 2010) and TCF (Pan et al. 2011), for instance, learn user latent factors and item latent factors using matrix tri-factorization. CST and TCF transfer the knowledge via shared user interests, i.e., shared user latent factors. All the transfer learning algorithms mentioned above, however, work under the supervised learning setting. Without exploration, the HTL algorithms suffer from a linear regret when facing a contextual bandit problem. Though t UCB (Azar, Lazaric, and Brunskill 2013) and B-kl-UCB (Zhang and Bareinboim 2017) consider transfer learning for the bandit problem by transferring the estimated upper confidence bound, they differ from ours greatly. t UCB considers context-free multi-armed bandit problems which are inapplicable to real-world Rec Sys. Moreover, both meth- ods require the action sets to be exactly the same across domains. In comparison, TCB works even if two domains are heterogeneous, which is frequent for cross-domain Rec Sys. In this section, we first define the notations used throughout this paper. Then, in the view of transfer learning, we introduce how TCB exploits and explores, respectively. Finally, we theoretically analyze TCB s regret. In this paper, the bold uppercase symbol (A), bold lowercase symbol (a), and regular symbol (x), denote the matrix, column vector, and scalar, respectively. The uppercase calligraphic symbol stands for the set, e.g., A. We use Id to denote the identity matrix with d rows and columns. Similarly, 0d1,d2 denotes the zero matrix with d1 rows and d2 columns. ||A||F and ||a||2 represent the Frobenius norm for a matrix and l2 norm for a vector, respectively. For a positive definitive square matrix A, ||a||A represents the weighted l2 norm, i.e., ||a||A = a T Aa. Furthermore, denotes the Kronecker product, and vec( ) is the vectorization operator. Exploitation in TCB TCB assumes that N s source observations are available, i.e., Os N s = {Xs N s, rs N s} = {(xs aτ , rs aτ )}τ=1...N s where Xs N s RN s ds. The target observations until step N t are defined similarly as Ot N t = {Xt N t, rt N t} = {(xt aτ , rt aτ )}τ=1...N t where Xt N t RN t dt. The objective of TCB is to progressively and adaptatively select the actions {a1, . . . , a N t} in the target domain to maximize the cumulative reward, i.e., N t τ=1 rt aτ . TCB learns the selection strategy from both source observations Os N s and target observations Ot N t. Without loss of generality, TCB focuses on the knowledge transfer between heterogeneous domains, which signifies that a source and a target domain have different actions or their contexts are in different feature spaces. One of the essential steps to design a transfer learning algorithm is to decide how to transfer (Pan and Yang 2010). How to transfer for heterogeneous transfer learning mainly aligns incommensurable feature spaces so that the transferable knowledge can be discovered. TCB adopts a translation matrix to align different feature spaces. In comparison with transfer learning methods via a common space such as Co HTL, the translation based TCB suffers from fewer sources of noise and explores fewer uncertain parameters. The optimal translation matrix U Rdt ds is expected to perfectly translate a target context xt a into the source feature space, i.e. xs a = UT xt a. Under the assumption that the reward function is linear with regard to the context, the rewards of both domains are determined by, rt a = (xt a)T U θs + ϵτ, rs a = (xs a)T θs + ϵτ, (1) where ϵτ is 1/ 2-sub-Gaussian noise and θs is the reward parameter for the source domain. By supposing that we are given the optimal translation matrix U , we estimate the reward parameter θs from all source observations and the target observations in step τ by solving the following optimization problem similar to Lin UCB: ˆθs τ = arg min θs ||rs Ns Xs Nsθs||2 2 + ||rt τ Xt τ U θs||2 2 + ||θs||2 2. (2) When τ is extremely small, all the source observations serve as the warm start and definitively benefit the exploitation and exploration of Lin UCB. Unfortunately, the optimal translation matrix U is unknown in a real-world Rec Sys. U is also expected to learn from all the observations. Directly optimizing Eq. 2 w.r.t. U without any constraints, however, is equivalent to the ridge regression purely on the target observations, which yields a trivial translation matrix performing poorly on knowledge transfer. Inspired by (Dai et al. 2008; Wei et al. 2016), we learn a more effective U by also leveraging the correspondence data between source and target observations. We denote the correspondence data as S RNt Ns with Sij indicating the relatedness between the ith target observation and the jth source observation, e.g., whether they are produced by the same user. We would believe that such correspondence data are easily accessible nowadays. Si,j > 0 if the ith target observation and the jth source observation are related and Si,j = 0 otherwise. The optimal U is also expected to align each pair of observations across domains with Si,j > 0, xs j = UT xt i + η, if Si,j > 0, i, j, (3) where η denotes the translation noise, with each element as a cη/ 2-sub-Gaussian noise. Consequently, based on Eq. 1 and Eq. 3, the overall loss function to simultaneously estimate the reward parameter θs and the translation matrix U in each step τ is shown as below, ˆUτ, ˆθs τ = arg min U,θs β||rs Ns Xs Nsθs||2 2 + ||rt τ Xt τUθs||2 2 j=1 Si,j||UT xt i xs j||2 2 + ||θs||2 2 + ||U||2 F , (4) where β and γ control the importance of the source domain and the translation, respectively. The first term incorporates source observations to learn the reward parameter θs , and the third term learns the translation U by encouraging the correspondence between observations across domains to be preserved. The second term emphasizes the target observations and learns both parameters. The last two terms are regularizers to avoid overfitting. For simplicity, we also write the third term as, j=1 Si,j||UT xt i xs j||2 2 = Tr(UT (Xt τ)T S1Xt τU) + Tr((Xs Ns)T S2Xs Ns) 2Tr(UT (Xt τ)T SXs). where S1 and S2 are diagonal matrices with S1(i, i) = Ns j=1 Si,j and S2(j, j) = τ i=1 Si,j. Obviously, the optimization problem in Eq. 4 is not jointly convex w.r.t. U and θs. However, it is convex w.r.t. one parameter if the other is fixed. Thus, we optimize it in an alternating manner - iteratively fixing one parameter and solving the other with a closed-form solution. We give the closed-form solutions for both parameters in the following. ˆθs τ = C 1 τ dτ, Cτ = β(Xs)T Xs + ˆUT τ (Xt τ)T Xt τ ˆUτ + Ids, dτ = β(Xs)T rs + ˆUT τ (Xt τ)T rt. (5) vec( ˆUτ) = A 1 τ bτ, Aτ = ˆθs τ(ˆθs τ)T (Xt τ)T Xt τ + γIds (Xt τ)T S1Xt τ + Idtds, bτ = vec((Xt τ)T rt τ(ˆθs τ)T + γ(Xt τ)T SXs Ns). (6) For computational efficiency, we instead solve ˆθs τ and ˆUτ using conjugate gradient descent method in our experiments. In each step τ, a pure exploitation policy recommends the article with the largest predicted reward. Thus, the pure exploitation strategy is, aτ = arg max a At (xt a)T ˆUτ ˆθs τ. (7) Exploration in TCB In contextual bandit problems, only the rewards of pulled arms are observed and they are stochastic. Purely exploiting following the recommendation strategy in Eq. 7, therefore, easily get stuck in the suboptimal arms and suffers from the linear regret growth which is suboptimal. In this section, we introduce how TCB simultaneously exploits and explores. To explore, TCB is based on the Upper Confidence-Bound (UCB) (Auer 2002) principle. UCB style methods follow three steps: first, a high probability confidence set for each uncertain parameter is constructed; second, the UCB of the expected reward is calculated for each arm; third, the action with the largest UCB is pulled in each step. As shown in Fig. 1, there are three sources of uncertainty in our cross-domain Rec Sys including the correspondence data, the source rewards, and the target rewards. Accordingly, TCB explores the uncertainty of both the reward parameter ˆθs τ and the translation matrix ˆUτ. We firstly define the high probability confidence sets for both parameters in the following inequalities, ||ˆθτ θs ||Cτ αθ, (8) ||vec( ˆUτ) vec(U )||Aτ αU where αθ and αU denote the upper bounds of the confidence sets and are discussed in Lemma 1. Based on the mechanisms in Eq. 1 and Eq. 3, for θs and U within the confidence set, we then calculate the UCB of the expected reward for each target action with the context xt a: UCB(xt a) =(xt a)T ˆUτ ˆθs τ + αθ||(xt a)T ˆUτ||C 1 τ + αU||ˆθs τ xt a||A 1 τ . (9) Finally, TCB s recommendaion strategy is aτ = arg max a At UCB(xt a). (10) In Eq. 9, the first term of UCB calculates the expected reward which encourages more exploitation. The second term explores the reward parameter ˆθs τ. TCB, therefore, tends to select the actions that help learn the reward function quickly. More informative source observations incur shrinkage of the second term and thereby accelerate the exploration. The third term explores the translation matrix ˆUτ, i.e., how to transfer. Thus, TCB favors the actions that help learn the translation quickly. In short, UCB in Eq. 9 promotes transfer learning and the bandit policy to benefit each other. Additionally, the second and third terms of UCB are in shrinkage as TCB accumulates more target observations. As a result, TCB explores more in the beginning and exploits more in later stages. We present the details of the TCB policy in Algorithm 1. It is worth noting that TCB can also handle knowledge transfer between homogeneous domains. When ˆUτ is fixed to be Idt, TCB degenerates to Lin UCB with warm start. When two domains lie the same feature space but different distributions, TCB utilizes the translation to align distributions. Algorithm 1 Transferable Contextual Bandit 1: Input: Source Observations Os Ns, β, γ, αU and αθ. 2: Initialize: ˆU1 = 0dt,ds, ˆθs 1 = 0ds,1. 3: for τ = 1 . . . N t in the target domain do 4: Observe the action set At τ and context xt a Atτ ; 5: Calculate the UCB for a At τ according to Eq. 9; 6: Pull arm aτ = arg maxa Aτ UCB(xt a); 7: Observe reward rt aτ ; 8: while Until Convergence do 9: With new observation (xt aτ , rt aτ ); 10: Update ˆθs τ+1 and Cτ+1 according to Eq. 5; 11: Update ˆUτ+1 and Aτ+1 according to Eq. 6; 12: end while 13: end for Regret Analysis The overall loss function in Eq. 4 is not jointly convex if both parameters are optimized together. Since analyzing the convergence to the true parameters in such non-convex problems is beyond the focus of this paper, we conduct regret analysis on TCB by solving U and θs separately in Eq. 11. The true parameter of U is known when solving ˆθs τ, and vice versa. ˆθs τ = arg min θs β||rs Ns Xs Nsθs||2 2 + ||rt τ Xt τU θs||2 2 + ||θs||2 2, ˆUτ = arg min U ||rt τ Xt τUθs ||2 2 j=1 Si,j||UT xt i xs j||2 2 + ||U||2 F . (11) Lemma 1 illustrates that the confidence sets of ˆθs τ and ˆUτ grow sublinearly with respect to the number of steps Nt. Lemma 1 Suppose ||U ||F cu, ||θs ||2 cθ, ||xs||2 1 and ||xt||2 1. If each target observation has one corre- sponding source observation, then with probability 1 δ ||ˆθNt θs ||CNt log 3 det(CNt)1/2 det( ˆUT Nt(Xtτ)T Xtτ ˆUNt + Ids)1/2δ + (1 + cθcη) log 3 det(CNt)1/2 det(β(Xs)T Xs + Ids)1/2δ (12) cθ + (1 + cθcη + β) ds log 1 + Nt ||vec( ˆUτ) vec(U )||Aτ log 3 det(ANt)1/2 det(ˆθNt(ˆθNt)T (Xtτ)T Xtτ + Idtds)1/2δ + (1 + cηcθ) log 3 det(ANt)1/2 det(γIds ((Xtτ)T S1Xtτ) + Idtds)1/2δ c U + (1 + cηcθ + cηγ) dsdt log 1 + Nt With Lemma 1, we further present the the upper bound of TCB s cumulative regret in the next. As in Eq. 14, the regret quantifies a policy s effectiveness by the difference between the expected reward of the optimal arm a τ and that of the pulled arm aτ (Burtini, Loeppky, and Lawrence 2015). τ=1 (E(rt a τ ) E(rt aτ )) (14) Theorem 1 Under the same assumptions as Lemma 1, with probability 1 δ, TCB s cumulative regret satisfies Nt log det(ANt) det(γIds ((Xt Nt)T S1Xt Nt) + Idtds) Nt log det(CNt) det(β(Xs)T Xs + Ids) (15) Ntdt log (1 + Nt) + 2αθ Ntds log (1 + Nt) The detailed proofs are in the supplementary material1. According to Lemma 1 and Theorem 1, we conclude that TCB enjoys the same order of sublinear regret as the standard Lin UCB, i.e. O( Nt log(Nt)) (Abbasi-Yadkori, P al, and Szepesv ari 2011; Li et al. 2010). We further discuss the influences of knowledge transfer on TCB. According to Eq. 15, when the largest eigenvalue of (Xs)T Xs is nonzero, TCB s regret decreases by a constant. Alternatively speaking, transfer learning improves the contextual bandit policy if informative source data are provided. Similarly, in Eq. 12, the confidence set of ˆθs τ shrinks with the eigenvalues of the source observations. Thus, transfer learning accelerates the exploration of the reward function. Transfer learning always assumes that a source domain and a target domain are related but different. In our work, 1Supplementary material is available at http://www.cse.ust.hk/ bliuab we describe the difference between domains with a random vector η. As we mentioned before, each element of η is cη/ 2-sub-Gaussian. Intuitively, a random vector with a smaller cη is expected if two domains are near. A smaller cη will introduce faster shrinkage on the confidence sets of both θs and U according to Lemma 1, and achieve a better regret according to Theorem 1. In summary, the exploration of TCB becomes much more efficient when a pair of source and target domains are more related. Experiments In this section, we evaluate TCB s empirical performance using one synthetic and two real-world datasets. We compare TCB with four representative baselines, including HTL (Wei et al. 2016), Lin UCB (Li et al. 2010), h Lin UCB (Wang, Wu, and Wang 2016), and Factor UCB (Wang, Wu, and Wang 2017). HTL is the pure exploitation strategy with transfer learning as shown in Eq. 7. Lin UCB, h Lin UCB, and Factor UCB are the contextual bandit baselines without transfer. Lin UCB is the most widely used contextual bandit baseline with the assumption of a linear reward function. h Lin UCB2 explicitly models the hidden context and claims to be robust to the cold-start problem. Factor UCB learns the latent user factors and latent item factors as well as their UCB in a collaborative filtering manner. Factor UCB requires the knowledge of user clustering. Thus, we only compare with Factor UCB in the Delicious dataset. For all experiments, we fix TCB s hyperparameter as β = 0.1 and γ = 0.1 such that all terms in Eq. 4 are in the same order of magnitude. To be consistent with the canonical Lin UCB, we set the regularization hyperparameter of all methods to be one and exploration hyperparameter to be 0.2. We mainly investigate TCB s performance in the cold-start situation. Thus, we plot the cumulative reward within 1,000 steps in the synthetic experiments and 5,000 steps in the real-world experiments. All reported results are averaged by 10-times random experiments. Synthethic Experiment In this part, we build the synthetic environment to verify TCB s effectiveness. With the oracle knowledge of the reward function, we measure the performance with cumulative regret as shown in Eq 14. We firstly generate the source and target context by randomly drawing each dimension from a Gaussian distribution, i.e., N(0, σ) and σ U(0, 1). Similarly, the optimal translation matrix U and the reward parameter θs are randomly drawn from a uniform distribution U(0, 1). To be consistent with Lemma 1, we normalize the data such that the norm of each context, of the translation, and of the reward parameter are upper bounded by one. We further use the Gaussian similarity to measure the correspondence between jth source context and ith target context, i.e., Si,j = e ||UT xt i xs j||2 2. We keep the top 1% largest similarity and set the remainings to be zero. In each step of the simulation, all compared 2h Lin UCB and Factor UCB are both available at github.com/ huazhengwang/Bandit Lib Figure 2: Cumulative regret w.r.t. the steps in synthetic experiments policies face 5,000 actions and suffer from the same reward noise ϵ N(0, 1). Additionaly, we set the ds = 60, dt = 50, and Ns = 10, 000. On average, TCB costs 1.38 seconds to decide one action and update the UCB. In Fig. 2, we plot the cumulative regret w.r.t. the number of steps. Apparently, HTL, as a pure exploitation strategy, suffers from the linearly growing regret which is the worst. TCB and HTL both transfer the knowledge from the same source observations. In comparison, TCB explores the uncertainty of the translation and the reward function. On behalf of the exploration, TCB learns the translation and the reward function more effectively and efficiently, thereby achieving the improved transferring performance. In comparison with Lin UCB and h Lin UCB which do not transfer, we see that TCB s regret converges significantly faster. We, therefore, attribute TCB s superiority to the knowledge transfer. TCB leverages the knowledge from the source observations to obtain the more accurate parameters and tighter upper confidence bound. Additionally, across tentimes random experiments, TCB achieves more stable and robust regrets according to the smaller variance. In conclusion, Fig. 2 consolidates our claim that TCB s exploration and knowledge transfer mutually benefit each other. The hyperparameters αθ and αU balance the exploitation and exploration and directly decide the performance of TCB. In Fig. 3a, we investigate how αU = αθ = α influences the cumulative regret. According to Fig 3a, for all α > 0, TCB consistently outperforms Lin UCB and h Lin UCB, which further consolidates TCB s superiority. When α = 0, all bandit approaches degenerate to the pure exploitation and suffer from the inferior regret and larger variance. When α > 0.2, all methods explore more. More exploration, however, does not compensate the sacrificed short-term rewards, and achieves worse cumulative regret. Moreover, TCB(Theory) sets αU and αθ according to Lemma 1. Obviously, TCB with manually tuned α outperforms the TCB(Theory). As a result, α = 0.2 is a fair choice for all methods. The source observations and the correspondence data are the essential components of TCB. Theorem 1 requires that each target observation has one corresponding source observation, which is not easy to hold in a real-world Rec Sys. Thus, we investigate the effects of the source observations and the correspondence data on TCB s performance. In Fig. 3b, we examine how TCB s regret changes w.r.t. the varying number of source observations. Apparently, TCB s regret continues to deteriorate with the decreasing number of source observations. When only 1,000 source observations are available, TCB performs even worse than Lin UCB and h Lin UCB without transfer. The potential explanation is that TCB explores both the translation and the reward function the combination of which are more uncertain. The insufficient source observations, however, cannot accelerate the exploration of TCB enough, thereby leading to the worse performance. On the other hand, Fig. 3b signifies that TCB is capable of utilizing more and more source data. The correspondence data S is the guidance for exploring the translation matrix U . In Fig. 3c, we present TCB s cumulative regret when {0.1%, 0.5%, 1%, 5%, 10%} entries of S are non-zero. Given very sparse correspondence data, the exploration and learning of U turn slow due to the insufficient constraints. When TCB observes denser correspondence data, its regret firstly decreases significantly and then increases. For sparsity= 0.1, TCB projects each target context to be similar to 10% source observations, so that the target context becomes indistinguishable in the source feature space, thereby leading to worse regret. Finally, to further analyze TCB s performance, we investigate how TCB s regret changes with respect to the dimension of the context in Fig. 6 in the supplementary material. Real-world Experiments In this section, we empirically compare TCB with the stateof-the-art baselines using the Cheetah data and the public Delicious3 data. Without the knowledge of true reward function, we compare all methods using the cumulative reward. Due to the privacy issues, all reported cumulative rewards are normalized by a random policy. The Cheetah dataset is from the real-world Rec Sys supplied by Cheetah Mobile Inc. The app and the article Rec Sys are the source and the target domains, respectively. The user profiles and app attributes together serve as the context for app source Rec Sys. We combine the user profiles and article titles to be the target context. We further adopt principal components analysis (PCA) to reduce the source and the target context to 50 and 30 dimensions, respectively. Therefore, in Cheetah dataset, TCB faces not only different actions but also heterogeneous feature spaces of two domains. In each step τ of our experiment, we randomly select ten articles from all as the candidate set. We guarantee that the serving user reads exactly one candidate. All compared methods are required to choose one recommendation from the candidate set. If the serving user reads the recommended article, the agent receives the reward as one, and otherwise zero. We serve all users in chronological order. In the same way, we accumulate the source app observations using a random policy. Finally, we construct the correspondence data S for TCB and HTL. The ith target observation corresponds to the jth source observation if they satisfy the following 3Dataset is available at grouplens.org/datasets/hetrec-2011 (a) Regret of three policies with respect to the varying α (b) Regret of TCB with the different number of source observations (c) Regret of TCB using the similarity matrix of different sparsity two criteria: first, the ith target and jth source observations serve the same user; second, the user feedbacks are the same for the ith target and jth source observations. For example, one user installs the jth app and reads the ith article, then Sij = 1. We summarize the Cheetah data in Table 1. Table 1: Statistics of Cheetah Data # of unique users # of correspondence per user # of users with correspondence 4,000 1.077 1,395 # of unique articles # of source observations time span 3,706 11,770 21 Days Figure 4: Normalized Reward on Cheetah Data w.r.t. steps In Fig. 4, we empirically verify the necessity of knowledge transfer to improve the contextual bandit methods. When the target observations are fewer than 500, calculating the cumulative reward is high-variance and noisy. As a result, all methods are not distinguishable in this period. When the step is larger than 500, on behalf of transfer learning, TCB and HTL consistently outperform Lin UCB and h Lin UCB. At a later stage, all algorithms still suffer from the coldstart problem. TCB and HTL, however, learns the reasonable translation from the correspondence data. In the translated feature space, TCB and HTL estimate the reward function more accurately using both source and target observations. In comparison, Lin UCB and h Lin UCB only learn from the insufficient target observations and achieve inferior rewards. By comparing TCB with HTL in Fig. 4, we can empahsize the importance of exploration to transfer learning. TCB, unfortunately, is outperformed by HTL in the early period. The potential reason is that TCB sacrifices the short term reward to explore both the translation and the reward function. In the long run, due to the exploration, TCB estimates the reward function far more efficiently than HTL without exploration. As a result, when more observations are accumulated, TCB quickly bypasses HTL and shows consistent advantages. Finally, we verify that TCB can also be applied to the homogeneous problem using the public Delicious dataset. We perform the experiments in the same setting as (Cesa Bianchi, Gentile, and Zappella 2013). After preprocessing, we obtain 1,867 users, 57,784 URLs, and the corresponding 25-dimensional textual context. In each step τ, the agent recommends one URL among 25 candidates. The agent receives the reward as one if the user bookmarks the URL. For TCB, we design the source and the target Rec Sys to have mutual exclusive URLs. Specifically, the URL is the source if it owns a tag that occurs more than 80 times, otherwise target. Therefore, the source and target URLs are described in the same textual feature space but different distributions of text. The source observations and the correspondence are constructed in the same way as Cheetah data. h Lin UCB and Factor UCB consider the user clustering for collaborative filtering. For a fair comparison, we carry on the independent TCB, HTL, and Lin UCB on each user clustering. According to Fig. 5, TCB s two-way advantage is again proved in the homogeneous problem. For one thing, by comparing TCB with h Lin UCB and Factor UCB, we conclude that, for the cold-start problem, knowledge transfer from an auxiliary domain is more effective than collaborative filtering within a single domain. For another, TCB s knowledge transfer is more efficient than HTL. Conclusions In this paper, we propose a transferable contextual bandit policy named TCB for the cross-domain recommendation. TCB adopts transfer learning to optimize the cumulative reward in the target Rec Sys. On behalf of transfer learning, TCB benefits the exploitation and accelerates the exploration in the target Rec Sys. On behalf of the contextual bandit policy, TCB efficiently explores how to transfer and how to recommend. The theoretical regret analysis and empirical experiments verify TCB s superiority. In the future, we plan Figure 5: Normalized Reward on Delicious Data w.r.t. steps to speed up TCB in the high-dimensional context and deploy it in a real online recommendation system. Acknowledgements We are supported by National Grant Fundamental Research (973 Program) of China under Project 2014CB340304 and Hong Kong CERG projects 16211214, 16209715 and 16244616. Yu Zhang is supported by NSFC (61473087, 61673202) and the Natural Science Foundation of Jiangsu Province (BK20141340). Abbasi-Yadkori, Y.; P al, D.; and Szepesv ari, C. 2011. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, 2312 2320. Agrawal, S., and Goyal, N. 2013. Thompson sampling for contextual bandits with linear payoffs. In Proceedings of the 30th International Conference on Machine Learning, 127 135. Auer, P. 2002. Using confidence bounds for exploitationexploration trade-offs. Journal of Machine Learning Research 3:397 422. Azar, M. G.; Lazaric, A.; and Brunskill, E. 2013. Sequential transfer in multi-armed bandit with finite set of models. In Advances in Neural Information Processing Systems 26, 2220 2228. Burtini, G.; Loeppky, J.; and Lawrence, R. 2015. A survey of online experiment design with the stochastic multi-armed bandit. Co RR abs/1510.00757. Cesa-Bianchi, N.; Gentile, C.; and Zappella, G. 2013. A gang of bandits. In Advances in Neural Information Processing Systems 26, 737 745. Chu, W.; Li, L.; Reyzin, L.; and Schapire, R. E. 2011. Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, 208 214. Dai, W.; Chen, Y.; Xue, G.; Yang, Q.; and Yu, Y. 2008. Translated learning: Transfer learning across different feature spaces. In Advances in Neural Information Processing Systems 21, 353 360. Li, L.; Chu, W.; Langford, J.; and Schapire, R. E. 2010. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on World Wide Web, 661 670. ACM. Li, L.; Chu, W.; Langford, J.; and Wang, X. 2011. Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. In Proceedings of the Forth International Conference on Web Search and Web Data Mining, 297 306. Li, S.; Karatzoglou, A.; and Gentile, C. 2016. Collaborative filtering bandits. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, 539 548. Pan, S. J., and Yang, Q. 2010. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22(10):1345 1359. Pan, W.; Xiang, E. W.; Liu, N. N.; and Yang, Q. 2010. Transfer learning in collaborative filtering for sparsity reduction. In Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence,. Pan, W.; Liu, N. N.; Xiang, E. W.; and Yang, Q. 2011. Transfer learning to predict missing ratings via heterogeneous user feedbacks. In IJCAI 2011, Proceedings of the 22nd International Joint Conference on Artificial Intelligence, Barcelona, Catalonia, Spain, July 16-22, 2011, 2318 2323. Schein, A. I.; Popescul, A.; Ungar, L. H.; and Pennock, D. M. 2002. Methods and metrics for cold-start recommendations. In SIGIR 2002: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 253 260. Shi, X.; Liu, Q.; Fan, W.; and Yu, P. S. 2013. Transfer across completely different feature spaces via spectral embedding. IEEE Trans. Knowl. Data Eng. 25(4):906 918. Tran-Thanh, L.; Chapman, A. C.; de Cote, E. M.; Rogers, A.; and Jennings, N. R. 2010. Epsilon-first policies for budgetlimited multi-armed bandits. In Proceedings of the Twenty Fourth AAAI Conference on Artificial Intelligence,. Wang, H.; Wu, Q.; and Wang, H. 2016. Learning hidden features for contextual bandits. In Proceedings of the 25th ACM International Conference on Information and Knowledge Management, 1633 1642. Wang, H.; Wu, Q.; and Wang, H. 2017. Factorization bandits for interactive recommendation. In Proceedings of the Thirty First AAAI Conference on Artificial Intelligence, 2695 2702. Wei, Y.; Zhu, Y.; Leung, C. W.; Song, Y.; and Yang, Q. 2016. Instilling social to physical: Co-regularized heterogeneous transfer learning. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, 1338 1344. Yue, Y.; Hong, S. A.; and Guestrin, C. 2012. Hierarchical exploration for accelerating contextual bandits. In Proceedings of the 29th International Conference on Machine Learning. Zhang, J., and Bareinboim, E. 2017. Transfer learning in multi-armed bandits: A causal approach. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, 1340 1346. Zhou, L. 2015. A survey on contextual multi-armed bandits. Co RR abs/1508.03326. Zhu, Y.; Chen, Y.; Lu, Z.; Pan, S. J.; Xue, G.; Yu, Y.; and Yang, Q. 2011. Heterogeneous transfer learning for image classification. In Proceedings of the Twenty-Fifth AAAI Conference on Artificial Intelligence.