# profilebased_bandit_with_unknown_profiles__6206d402.pdf Journal of Machine Learning Research 19 (2018) 1-40 Submitted 11/17; Revised 6/18; Published 9/18 Profile-Based Bandit with Unknown Profiles Sylvain Lamprier sylvain.lamprier@lip6.fr Sorbonne Universit es, UPMC Paris 06, LIP6, CNRS UMR 7606 Thibault Gisselbrecht thibault.gisselbrecht@snips.ai SNIPS, 18 rue Saint Marc, 75002 Paris Patrick Gallinari patrick.gallinari@lip6.fr Sorbonne Universit es, UPMC Paris 06, LIP6, CNRS UMR 7606 Editor: Peter Auer Stochastic bandits have been widely studied since decades. A very large panel of settings have been introduced, some of them for the inclusion of some structure between actions. If actions are associated with feature vectors that underlie their usefulness, the discovery of a mapping parameter between such profiles and rewards can help the exploration process of the bandit strategies. This is the setting studied in this paper, but in our case the action profiles (constant feature vectors) are unknown beforehand. Instead, the agent is only given sample vectors, with mean centered on the true profiles, for a subset of actions at each step of the process. In this new bandit instance, policies have thus to deal with a doubled uncertainty, both on the profile estimators and the reward mapping parameters learned so far. We propose a new algorithm, called Samp Lin UCB, specifically designed for this case. Theoretical convergence guarantees are given for this strategy, according to various profile samples delivery scenarios. Finally, experiments are conducted on both artificial data and a task of focused data capture from online social networks. Obtained results demonstrate the relevance of the approach in various settings. Keywords: Stochastic Linear Bandits, Profile-based Exploration, Upper Confidence Bounds 1. Introduction Multi-armed bandits (MAB) correspond to online decision problems where, at each step of a sequential process, an agent has to choose an action - or arm - among a set of K actions, with the aim to maximize some cumulative reward function. In the so-called stochastic MAB setting, rewards collected for a given arm through time are assumed to be independently and identically distributed, following some hidden stationary distribution on every individual arm. The problem is therefore to deal with a tradeoffbetween exploitation - selecting actions according to some estimations about their usefulness - and exploration - selecting actions in order to increase the knowledge of their reward distribution. However, with classical stochastic bandit policies, the convergence towards the optimal arms can be slow when the number of actions becomes large. On another hand, contextual bandits correspond to MAB settings where some side information can be leveraged to improve estimations of reward distributions. In these settings, a decision context is observed before selecting actions. This context can either c 2018 Sylvain Lamprier, Thibault Gisselbrecht and Patrick Gallinari. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v19/17-693.html. Lamprier, Gisselbrecht and Gallinari correspond to a global decision feature vector or to specific feature vectors observed for each single action. Depending on the setting, these features can vary over time, which can help to predict reward fluctuations, or they can correspond to constant features on actions - that we call action profiles in the following - whose structure can be used to improve exploration policies over non-contextual approaches. In this paper we address the latter case, where we assume stationary distributions of rewards, but where distributions depend on constant profiles associated each arm. Reward distributions from the different arms are connected by a common unknown parameter to be learned (Filippi et al., 2010). However, we introduce a new scenario where, for various possible reasons (technical, political, etc...), profile vectors are not available a priori. Instead, the agent gets sample vectors, centered on the true profiles, for a subset of actions at each decision step. This can happen in various situations where some restrictions limit our knowledge of the whole decision environment. For example in a focused data capture or technology intelligence scenario on social media, where an agent is asked to collect relevant information w.r.t. to a given need. Because of the extremely large number of accounts on media such as Twitter, the agent needs to focus on a subset of relevant users to follow at each time step (Gisselbrecht et al., 2015). However, given the strict restrictions set by the media, no knowledge about users is available beforehand. Profiles have therefore to be constructed from users activities, which are only observed for a small fraction of users at each step. As we will see later, the process which delivers profile samples can either be independent - e.g., an external process delivers activity samples for randomly chosen users at each step in the case of data capture from Twitter - or be included in the decision process - e.g., activity samples are only collected for followed users at the current time step. To the best of our knowledge, this instance of contextual bandit has not been studied in the literature. Existing bandit approaches do not fit with this new setting. First, even if traditional algorithms such as UCB (Auer et al., 2002) could be applied, the information provided by the sample profile vectors would be entirely ignored. Our claim is that important benefits can arise from taking this available side-information into account. On the other hand, existing contextual bandit policies do not take into account uncertainty on context vectors, while we face here a bandit problem where uncertainty not only arises from regression parameters, as classically considered by contextual approaches, but also from the estimated profiles which serve as inputs for reward predictions at each step. The aim is to propose an approach able to leverage structural distributions of arms, based on noisy observations of their profiles, to improve over existing exploitation/exploration policies in bandit settings with partial side-information. The contribution of this paper is threefold: We propose a new instance of the contextual bandit problem, based on constant contexts for each action, where action profiles are not known beforehand, but built from samples obtained at each iteration (3 cases are investigated regarding the sampling process); We design the Samp Lin UCB algorithm to solve this problem, for which we demonstrate some theoretical convergence guarantees; We experiment our proposal for both an artificial setting and a real-world task of focused data capture from Twitter, to empirically demonstrate the benefits of such a Profile-Based Bandit with Unknown Profiles profile-based approach with ony partial knowledge for exploitation/exploration problems. The paper is organized as follows. In section 2, we present some background and related works. In section 3, we formalize the problem, propose our algorithm and derive the regret bound. Finally, section 4 reports our experiments. 2. Background: the Linear Stochastic Bandit The multi-armed bandit problem, originally introduced in (Lai and Robbins, 1985) in its stationary form, has been widely studied in the literature. This learning problem aims at tackling the trade offbetween exploration and exploitation in decision processes where, at each step, an agent must choose an action - or arm - among a finite set of size K. After each decision, it receives a reward which quantifies the quality of the chosen action. The aim for the agent is to maximize the cumulative reward trough time, or equivalently to minimize the cumulative regret RT at step T defined as: RT = max i {1,2,...,K} t=1 rit,t (1) where it stands for the action selected at step t and ri,t the reward obtained by playing the action i at step t. This represents the amount of rewards that has been lost by selecting it at each step t, compared to what could be obtained by playing the optimal arm from the beginning to the end of the process. Note that, at each step t, only the reward for the chosen action rit,t is observed in practice, other ones remain unknown. In the so-called stochastic case, one assume that rewards of an arm i are identically and independently sampled from a distribution with mean νi. Therefore, one usually rather consider the pseudo-regret of a policy, which introduces expectations of regret in the previous definition: t=1 νit (2) where i stands as the arm with the best reward expectation. One of the simplest and most straightforward algorithms to deal with the stochastic bandit problem is the well-known ϵ-greedy algorithm (Auer et al., 2002). This algorithm selects the arm with the best reward mean empirical estimation with probability 1 ϵ and uniformly selects an arm among the whole set regardless their current estimations with probability ϵ. This guarantees to regularly reconsider estimations of all arms and therefore prevents from getting stuck on sub-optimal arms. However, the reward loss resulting from these blind selections prevents from ensuring a sub-linear upper bound of the pseudo-regret, unless setting an appropriate decay on ϵ. But this requires to know a lower bound on the difference of reward expectations between the best and the second best action (Auer et al., 2002). Upper Confidence Bound algorithms (UCB) is another family of bandit approaches which define confident intervals for the reward expectations of each arm. Based on some concentration inequalities (Hoeffding, Bernstein, etc.), they propose optimistic policies which Lamprier, Gisselbrecht and Gallinari consider possible deviations of the estimated mean of each arm. By using upper bounds of confidence intervals as selection scores, they ensure a cleaver balance between exploitation and exploration. Many extensions of the famous UCB algorithm proposed in (Auer et al., 2002) are known to guarantee a sub-linear bound of the pseudo-regret (see UCBV in (Audibert et al., 2009), MOSS in (Audibert and Bubeck, 2009) or KL-UCB in (Garivier, 2011)). At last, Thompson sampling algorithms, originally proposed in (Thompson, 1933), develop a Bayesian approach to deal with uncertainty. By sampling from posterior distributions for the reward parameters, their exploration/exploitation mechanism is also proved to ensure a sub-linear regret (see (Kaufmann et al., 2012b) and (Agrawal and Goyal, 2012)). The contextual bandit setting is an instance of the bandit problem where context vectors are observed before each decision step. Typically, contextual bandits assume a linear relation between context features and reward expectations. Formally, if we observe a context vector xi,t Rd for each action i K at each time-step t, we consider the following assumption: β Rd such that ri,t = x i,tβ + ηi,t (3) where β is a mapping parameter between contexts and rewards, ηi,t is a zero-mean conditionally R sub-Gaussian random noise, with constant R > 0 i.e: λ R : E[eληi,t|Ht 1] eλ2R2/2, with Ht 1 = {(is, xis,s, ris,s)}s=1..t 1. In this context, given a set K of K actions, any contextual bandit algorithm proceeds at each step t {1, 2, 3, . . . , T} as follows: 1. Observation of the context vector xi,t Rd for each i {1, ..., K}; 2. According to the current estimate of β, selection of an action it and reception of the associated reward rit,t; 3. Improvement of the selection policy by considering the new input (it, xit,t, rit,t) for the estimation of β. Various contextual algorithms have been proposed in the literature. The first contextual bandit algorithm was introduced in (Auer, 2003). More recently the well-known Lin UCB algorithm has been proposed for a task of personalized recommendation in (Li et al., 2010) and analyzed in (Chu et al., 2011). Both of these algorithms are UCB-like policies, each of them selecting the action whose upper bound of its reward confidence bound is the highest. Many other UCB approaches have been developed since then. In particular, algorithms such as OFUL or Confidence Ball proposed in (Abbasi-Yadkori et al., 2011) and (Dani et al., 2008) have the advantage to enjoy a tighter regret upper bound (see also (Kaufmann et al., 2012a) and (Rusmevichientong and Tsitsiklis, 2010)). As in the stochastic bandit setting, Thompson sampling algorithms have also been designed for the contextual case, which also proved to be powerful, first empirically in (Chapelle and Li, 2011) and then theoretically in (Agrawal and Goyal, 2013) and (May et al., 2012). In this paper, we consider a variant of the contextual bandit problem where contexts of actions are constant, which we call action profiles in the following. Hence, in our setting we assume that each action i K is associated with a profile vector µi Rd. The linear assumption of equation 3 becomes: β Rd such that ri,t = µ i β + ηi,t (4) Profile-Based Bandit with Unknown Profiles Thus, in this setting contexts cannot be used to anticipate some variations in the rewards expectations as it is traditionally the case in the literature about contextual bandit, but they can be leveraged to improve the exploration process, the use of a shared mapping parameter β allowing one to define areas of interest in the representation space of the actions. To illustrate this, the figure 1 represents the selection scores of a contextual algorithm (such as OFUL that we rely on in the following) at a given step for a simple case where K = 4 and d = 2. In this figure, green areas correspond to high scores, whereas red ones correspond to low scores areas. Color variations render the latent structure inferred by the model. In this setting, the algorithm would select the action 1, since its profile is located in the most promising area of the space. On the other hand, the action 3 is located in an area that is greatly less promising. The fact of using a common mapping parameter β allows one to perform a mutual learning, where observations on some actions inform on the usefulness of similar ones. This allows one to improve the exploration process by focusing more quickly on the useful areas: Imagine that a great number of actions are located in the red area of the figure. In that case, a classical bandit algorithm such as UCB would need to consider each of these actions several times to reveal their low reward expectation. On the other hand, a contextual bandit algorithm such as OFUL is able to avoid these actions really more quickly because of the proximity with other bad actions. Figure 1: Illustration of OFUL scores for a profiles representation space. This structured bandit setting has already been investigated in (Filippi et al., 2010). This work showed that great improvements could indeed be obtained by exploiting the structure of actions, since observations on some actions inform on the usefulness of similar ones. This comes down to a classical stochastic bandit where the pseudo-regret can be defined as follows: t=1 µ i β µ itβ (5) Lamprier, Gisselbrecht and Gallinari where µi Rd stands for the profile of the action i K and µi = arg max µi,i=1..K µ i β corresponds to the profile of the optimal action i . This is the setting which is studied in this paper. However, in our case the profile vectors are unknown to the agent beforehand, they have to be discovered iteratively during the process. Our problem differs from existing instances by the following two main aspects: 1. Action profiles are not directly available, one only get samples centered on them during the process; 2. At each step, one only get samples for a subset of actions. In the following, we derive an UCB-based policy for this new setting. 3. Profile-Based Bandit with Unknown Profiles In this section, we explore our new setting in which the set of profiles {µ1, .., µK} is not directly observed. Instead, at each iteration t, the agent is given a subset of actions Ot such that for every i Ot, a sample xi,t of a random variable centered on µi is revealed. By assuming the same linear hypothesis as described in the previous section (formula 4), the relation of rewards with profiles can be re-written as follows for any time t, in order to introduce profile samples: s t : ri,s = µ i β + ηi,s = ˆx i,tβ + (µi ˆxi,t) β + ηi,s = ˆx i,tβ + ϵ i,tβ + ηi,s (6) where ϵi,t = µi ˆxi,t, ˆxi,t = 1 ni,t s T obs i,t xi,s, with T obs i,t = {s t, i Os} and ni,t = |T obs i,t |. In words, ni,t corresponds to the number of times a sample has been obtained for the action i until step t and ˆxi,t corresponds to the empirical mean of observed samples for i at time t. ϵi,t corresponds to the deviation of the estimator ˆxi,t from the true profile µi. Compared to traditional contextual bandits, the uncertainty is double : as classically it arises from the β parameter estimator, but also from the profile estimators, since the algorithm must both estimate β and the profile vectors {µ1, .., µK} from observations. Figure 21 illustrates this new setting. Contrary to figure 1 where profiles are known, here we only get confidence areas for them, represented by circles centered on their corresponding empirical mean (represented by a blue cross). From the law of large numbers, the more observations for a given action we get, the lower the deviation between its true profile and its empirical estimator is. Therefore, the more knowledge we get about a given action, the smaller its confidence circle is. The best action is still the action 1, whose true profile (represented by a black cross) is in the greenest area. However, this information is unknown from the agent. A naive solution would be to directly use the empirical mean for each action in order to determine the selection scores. From the figure, this would lead to select the sub-optimal 1. Note that this is only an illustration of the general principle, in practice the surface of selection scores should also differ from figure 1, since β is estimated from biased inputs. Profile-Based Bandit with Unknown Profiles action 2, whose empirical mean is located in a greener area than the one of other actions. We propose to include the additional uncertainty in the selection scores by using the best location inside the confidence ellipsoid it is possible to reach for each action. This allows one to define an optimistic policy which would select the optimal action 1 in the example figure, whose confidence area contains the most promising profiles (i.e., includes the most green locations in the figure). Figure 2: Illustration of the additional uncertainty arising from profile estimators. In the following we propose to define an algorithm fitted for this setting. After deriving the algorithm in a generic context of samples delivery, we consider three different cases: Case 1: Every action delivers a profile sample at each step t (i.e., t, Ot = {1, ..., K}); Case 2: Each action i owns a probability pi of sample delivery: at each step t, an action is included in Ot with probability pi; Case 3: At each step t, only the action selected at the previous step delivers a sample (i.e., t, Ot = it 1). The first case corresponds to the simplest case, where Ot is constant over time. The second case includes an additional difficulty since before any step, every action has not been observed the same number of times, which leads to different levels of uncertainty. The last case, probably the most interesting one for real-world applications, is the most difficult since decisions at each step not only affect knowledge about reward distributions but also profile estimations. For that case, it appears mandatory to take the uncertainty about profiles into account in the selection policy to guarantee the process to converge towards optimal actions. After deriving confidence intervals for this new bandit instance, this section describes the proposed algorithm and analyzes its regret for the three settings listed above. Then, we consider the case where multiple actions can be performed at each step. Lamprier, Gisselbrecht and Gallinari 3.1. Confidence Intervals In the following, we derive a series of propositions which will allow us to define a selection policy for our setting of profile-based bandit with unknown profiles. First, it is needed to define an estimator for the mapping parameter β, and the associated confidence ellipsoid. For that purpose, we rely on results from the theory of the self-normalized process (de la Pe na et al., 2009). The next step is to define a way to mix it with the uncertainty on profiles to define an optimistic policy. Proposition 1 Let us consider that for any action i, all profile samples xi,t Rd are iid from a distribution with mean µi Rd. Let us also assume that there exists a real number L > 0 such that ||xi,t|| L and a real number S > 0 such that ||β|| S. Then, for any i K and any step s t, the random variable ηi,t,s = ϵ i,tβ + ηi,s is conditionally sub-Gaussian with constant Ri,t = Proof Available in appendix A.1. In this proposition, R is the constant of the sub-Gaussian random noise of the rewards (ηs from equation 6) and the notation ||x|| stands as the norm of a vector x (i.e., ||x|| = x T x). Since the noise ηi,t,s is sub-Gaussian, it will be possible to apply the theory of self-normalized process for defining a confidence ellipsoid for β. At step t, we can use the following set of observations to find an estimator for β: {(ˆxis,t, ris,s)}s=1..t 1 (i.e., at any decision step t, the reward ris,s observed at any previous step s < t is associated with the profile of the selected action at step s, estimated knowing samples observed from step 1 to step t). The following notations are used in the remaining of the paper: η t 1 = (ηis,s + ϵ is,tβ) s=1..t 1 the vector of noises of size t 1. Xt 1 = (ˆx is,t)s=1..t 1 the (t 1) d matrix containing the empirical means of the selected actions, where the s-th row corresponds to the estimator at step t of the action selected at step s. Yt 1 = (ris,s) s=1..t 1 the rewards vector of size t 1. At 1 = diag(1/Ris,t)s=1..t 1 the diagonal (t 1) (t 1) matrix, where the s-th diagonal element equals 1/Ris,t. Note that, for a specific action, the value of its corresponding coefficient increases with the number of observed samples for this action. With these notations, the linear application from profiles to rewards can be written as: Yt 1 = Xt 1β + η t 1 (7) Proposition 2 We note ˆβt 1 the least square estimator of the parameter β at step t, according to the following l2-regularized regression problem, where each element is weighted Profile-Based Bandit with Unknown Profiles by the corresponding coefficient 1/Ris,t: ˆβt 1 = arg min β 1 Ris,t (β ˆxis,t ris,s)2 + λ||β||2 (8) where λ > 0 is the l2-regularization constant. ˆβt 1 = (X t 1At 1Xt 1 + λI) 1X t 1At 1Yt 1 (9) Proof Let us rewrite the minimization problem such as: ˆβt 1 = arg min β L with L = (Yt 1 Xt 1β) At 1(Yt 1 Xt 1β) + λβ β. The gradient is given by: βL = 2X t 1At 1(Yt 1 Xt 1β) + 2λβ = 2(X t 1At 1Xt 1 + λI)β 2X t 1At 1Yt 1 By canceling this gradient, we get the announced result. This estimator of β uses empirical means of observed samples as inputs. Weighting each element according to the corresponding value Ris,t allows one to consider the uncertainty associated with this approximation. It renders the confidence we have in the weighted input. Note that this coefficient tends towards a constant when the number of observed samples increases for the corresponding action. It allows one, according to the following proposition, to define a confidence ellipsoid for the estimator of β. Proposition 3 Let us define Vt 1 = λI + X t 1At 1Xt 1 = λI + t 1 P ˆxis,tˆx is,t Ris,t . With the same assumptions as in proposition 1, for any 0 < δ < 1, with a probability at least equal to 1 δ, the estimator ˆβt 1 verifies for all t 0: ||ˆβt 1 β||Vt 1 v u u t2 log det(Vt 1)1/2det(λI) 1/2 λS = αt 1 (10) where ||x||V = x V x is the V-norm of the vector x. Proof Available in appendix A.2. This bound is very similar to the one defined in the OFUL algorithm (Abbasi-Yadkori et al., 2011) to build its confidence ellipsoid. However, a notable difference lies in the definition of the matrix Vt, in which weights in At are applied to cope with confidence differences between profile estimators. Without this weighting, no confidence ellipsoid could be found for β since no common bound could be defined for the various noises η s (see the proof of proposition 3 in appendix). The following proposition can easily be deduced from the previous one to bound the expectation of reward with known profiles. Lamprier, Gisselbrecht and Gallinari Proposition 4 For every i K, with probability greater than 1 δ, we have for all t 0: β µi ˆβ t 1µi + αt 1||µi||V 1 t 1 (11) Proof Available in appendix A.3 This upper-bound for the expected reward contains two distinct terms: while the former corresponds to a classical exploitation term which estimates the expected reward with the current parameters, the latter corresponds to an exploration term since it takes into account the uncertainty on the reward parameter. If profiles were known, this could directly be used as a selection score for an UCB-like policy. However, in our setting, profiles are unknown. We have to consider confidence ellipsoids for the profiles of actions too. The following proposition defines confidence bounds for the profile estimators. Proposition 5 For every i K and any t > 0, with probability greater than 1 δ/t2, we have: ||ˆxi,t µi|| min(L 2d ni,t log 2dt2 , 2L) = ρi,t,δ (12) Proof This inequality comes from the application of the Hoeffding s inequality to each dimension separately. The min operator comes from the base hypothesis ||xi,t|| L, which can be more restrictive than the Hoeffding assumption. The proof is available in appendix A.4. Contrary to the bound of the deviation of the mapping parameter β which holds simultaneously for all steps of the process, the one for the profile estimators is only valid for each step separately. To obtain a bound holding for every step simultaneously, which is important for the regret analysis (see section 3.3), we use the uniform bound principle. For a given action i, we have: P( t, ||ˆxi,t µi|| ρi,t,δ) = 1 P( t, ||ˆxi,t µi|| ρi,t,δ) 1 P t P(||ˆxi,t µi|| ρi,t,δ) 1 P t δ/t2. This justifies the introduction of the t2 term in the bound, which allows one to define a uniform probability over all steps since we have thereby: P( t, ||ˆxi,t µi|| ρi,t,δ) 1 δ P t=2 δ/t2 = 1 δ δ(π2/6 1) 1 2δ. Now that we have defined probabilistic deviation bounds for the different estimators, we can use them conjointly to define the confidence interval of the reward expectation for the setting of unknown profiles, and thus to upper bound the expected reward for each action i. Proposition 6 For every i K and any t > 0, with probability greater than 1 δ/t2 δ, we have: β µi ˆβ t 1(ˆxi,t + ϵi,t) + αt 1||ˆxi,t + ϵi,t||V 1 t 1 (13) with: ϵi,t = ρi,t,δ ˆβt 1 ||ˆβt 1|| ϵi,t = ρi,t,δˆxi,t λ||ˆxi,t||V 1 t 1 Profile-Based Bandit with Unknown Profiles Proof The proof is available in appendix A.5. Compared to the bound given in proposition 4, we find the same two terms of exploitation and exploration. However in this case, profile vectors that are unknown are replaced by the estimator plus an additional term: ϵi,t in the former part and ϵi,t in the latter one. These terms aim at coping with profile uncertainty, considering confidence ellipsoids for these profiles as defined in proposition 5. ϵi,t is collinear with ˆβt 1. It is used to translate the estimator ˆxi,t so that β µi is upper-bounded. ϵi,t is collinear with ˆxi,t. It is used to translate the estimator ˆxi,t so that the V 1 t 1-norm ||µi||V 1 t 1 is upper-bounded. This bound enables us to derive an optimistic policy in the next section. 3.2. Samp Lin UCB In this section, we detail our policy for the setting of unknown profiles, called Samp Lin UCB, which is directly derived from the bound proposed in proposition 6. Its process is detailed in algorithm 1. In words, it proceeds as follows: 1. Initialization of the shared variables V and b used to estimate the mapping parameter β in lines 1 and 2. The d d matrix V is initialized with an identity matrix times the regularization parameter λ (the greater λ is, the more the parameter β will be constrained to have components close to zero). The vector b is initialized as a null vector of size d. 2. Initialization of the individual variables ni, ˆxi, Ri, Ni and Si for every action in K (lines 3 to 6). The two latter are additional scalar variables which enable efficient updates for the shared variables after context observations. Ni counts the number of times an action has been selected from the beginning, Si sums the rewards collected by selecting i from the beginning. 3. At each iteration t, for each action i Ot, observation of the sample xi,t (line 11) and update of individual variables ni, ˆxi and Ri for action i (line 12) and shared parameters V and b according to these new individual values for i (line 10 and 13). Since shared parameters are simple sums of elements, they can be simply updated by first removing old values (line 10) and then adding the new ones when updated (line 13). This is efficiently done without requiring an important memory load thanks to scalar variables Ni and Si. 4. Computation of the selection score si,t (line 21) for each action i according to equation 14 detailed below, and selection of the action associated with the highest selection score (line 23) (except in the early stages K where all actions are selected in turn to initialize their counts in line 16). 5. Collection of the associated reward (line 25) and update of variables Ni, Si, V and b according to this new outcome (lines 26 to 28). Lamprier, Gisselbrecht and Gallinari Algorithm 1: Samp Lin UCB V = λId d (Identity matrix of size d); 1 b = 0d (Null vector of size d); 2 for i K do 3 Ni = 0; Si = 0; 4 ni = 0; ˆxi = 0d; Ri = + ; 5 end 6 for t = 1..T do 7 Reception of Ot; 8 for i Ot do 9 V = V Ni ˆxiˆx i Ri ; b = b Si ˆxi Ri ; 10 Observation of xi,t; 11 ni = ni + 1; ˆxi = (ni 1)ˆxi + xi,t V = V + Ni ˆxiˆx i Ri ; b = b + Si ˆxi Ri ; 13 end 14 if t K then 15 Selection of it = t; 16 end 17 else 18 ˆβ = V 1b; 19 for i K do 20 Computation of si,t according to formula 14 ; 21 end 22 Selection of it = arg max i K si,t ; 23 end 24 Reception of rit,t; 25 Nit = Nit + 1; 26 Sit = Sit + rit,t; 27 V = V + ˆxit ˆx it Rit ; b = b + rit,t ˆxit Rit ; 28 end 29 The selection score si,t used in our policy for each action i at any step t is directly derived from proposition 6: si,t = (ˆxi,t + ϵi,t) ˆβt 1 + αt 1||ˆxi,t + ϵi,t||V 1 t 1 (14) For cases 2 and 3 of the profile delivery mechanism (see at the beginning of the section), there are actions i with ni,t = 0 in the early steps of the process. No sample has ever been observed for these actions, which is problematic for the computation of ρi,t,δ, and therefore Profile-Based Bandit with Unknown Profiles the computation of ϵi,t and ϵi,t. For the case 2, where we are not active on the process for observing contexts, this can be solved by simply ignoring actions until at least one sample of profile has been observed for them. For the case 3 however, samples are only obtained by selection. Thus, we need to force the observation of a sample for every action in the first steps. In that way, for actions with ni,t = 0 at any step t, we arbitrarily set si,t = + in order to make the policy favor actions without any knowledge to initialize the process. Thus, in that case, algorithm 1 selects the K actions in turn in the K first steps of the process. The selection score defined in formula 14 corresponds to the upper-bound of the expected reward for each action, as it is done in all UCB-based policies. Intuitively, it leads the algorithm to select actions whose profile estimator is either in an area with high potential, or is sufficiently uncertain to consider still likely that the action can be potentially useful. The goal is to quickly rule out bad actions, whose confidence ellipsoid does not include any potentially useful locations w.r.t. the current estimation of β. To better analyze the algorithm, we propose below a new formulation of the selection score. Proposition 7 The score si,t from equation 14 can be re-written in the following way: si,t = ˆx i,t ˆβt 1 + αt 1||ˆxi,t||V 1 t 1 + ρi,t,δ ||ˆβt 1|| + αt 1 si,t = (ˆxi,t + ϵi,t) ˆβt 1 + αt 1||ˆxi,t + ϵi,t||V 1 t 1 = ˆx i,t ˆβt 1 + ρi,t,δ ˆβ t 1 ||ˆβt 1|| ˆβt 1 + αt 1||ˆxi,t + ρi,t,δˆxi,t λ||ˆxi,t||V 1 t 1 ||V 1 t 1 = ˆx i,t ˆβt 1 + ρi,t,δ||ˆβt 1|| + αt 1 λ||ˆxi,t||V 1 t 1 ||ˆxi,t||V 1 t 1 = ˆx i,t ˆβt 1 + αt 1||ˆxi,t||V 1 t 1 + ρi,t,δ ||ˆβt 1|| + αt 1 This new formulation of the selection score allows one to take a different look at the algorithm behavior. The first part of the score ˆx i,t ˆβt 1 + αt 1||ˆxi,t||V 1 t 1 is similar to a score that would use the classical OFUL algorithm (although with a different construction of Vt), with an exploitation term and a classical exploration term considering the uncertainty on the estimator of β. But it exhibits an additional part ρi,t,δ ||ˆβt 1|| + αt 1 directly proportional to the coefficient ρi,t,δ and thus enables some exploration w.r.t. the uncertainty of the profile estimators. This highlights the premium granted to less observed actions. Note that for the case 1, this additional part is the same for every action. It therefore could be removed from the score since it does not permit to discriminate some action w.r.t any other one. However, this new exploration term is particularly useful for the Lamprier, Gisselbrecht and Gallinari case 3, where observations of samples are directly connected to the selection policy, since it prevents from moving aside some optimal actions that have unluckily provided only not promising samples in the early steps. To demonstrate that considering uncertainty on profiles is crucial in that case, let us consider a scenario where the optimal action i gets a null vector as the first profile sample. Then, in a setting where all profile samples are in [0, L]d and all rewards are in [0, + ], it suffices that a sub-optimal action i gets a non-null vector as the first profile sample and a positive value as the first reward to lead to a linear regret from a given step. Indeed, since we only get samples with all components greater or equal than 0, i will never get a null vector as a profile estimator. On the other hand, while i is not selected, its profile estimator cannot change from the null vector. Thus, with a naive algorithm that would not include translations w.r.t. uncertainty of profiles, we would have si ,t = ˆx i ,t ˆβt 1 +α||ˆxi ,t||V 1 t 1 = 0 for all t until it = i . Now, the least square estimator of β approximates observed reward values from estimated profiles. Since we have at least one non-null reward associated with a non-null profile estimator, β will always output a positive expected reward at least for one action. Thus, there is always an action i with si ,t > 0, which prevents from selecting the optimal action until the end of the process. This shows that a naive algorithm is not well fitted here, since it is likely to stay stuck on sub-optimal actions because of wrong knowledge about profiles. The point is now to show that the proposed additional term enables to solve this problem and ensures a sub-linear pseudo regret for our profile-based bandit algorithm with unknown profiles. 3.3. Regret The following proposition establishes an upper bound for the cumulative pseudo-regret of the Samplin UCB algorithm proposed above. This is a generic bound for which no assumption is done on the process generating Ot at each step t. Proposition 8 (Generic bound) By choosing λ max(1, L2/ R2), with a probability greater than 1 3δ, the cumulative pseudo-regret of the algorithm Samp Lin UCB is upperbounded by: d λ log 1 + TL2/λ 2d log 2d T 2 d log 1 + TL2/λ R2 + L2S2 log 1 + TL2 Proof Available in appendix A.6. Profile-Based Bandit with Unknown Profiles The study of dominant factors of the bound given above enables to obtain the following proposition for the three considered settings of the context delivery process, where we removed dependencies on L, λ, R and S to simplify the notations. Proposition 9 (Bounds for the three profile delivery settings) For each of the three considered settings for the profile delivery process, the upper bound of the cumulative pseudoregret is2: For the case 1, with a probability greater than 1 3δ: ˆRT = O d log T T log(T) (17) For the case 2, with a probability greater than (1 3δ)(1 δ), and for T 2 log(1/δ)/p2: where p is the probability of profile delivery for any action at each step. For the case 3, with a probability greater than 1 3δ: Proof The proofs for these three bounds are respectively given in appendix A.7.1, A.7.2 and A.7.3. Thus, in every setting our Samplin UCB algorithm ensures a sub-linear upper bound for its cumulative pseudo-regret. The bound given for case 2 owns an additional dependency in p, the probability of context delivery for each action at each step. Obviously, the higher this probability is, the faster the uncertainty about profiles decreases. Note that this bound for case 2 is only valid from a given number of iterations inversely proportional to p2, since it requires a minimal number of observations to hold. The bound for case 3 owns a dependency in the number of available actions K. This comes from the fact that only the selected action reveals its profile at each step, which re-introduces the need of considering each action a minimal number of times, as it is the case with traditional stationary approaches such as the classical UCB algorithm. However, as we show in our experiments below, the use of the structure of the actions, which enables some common learning of reward distributions, leads to greatly better results than existing stationary algorithms in various cases. 2. O renders the relation dominated by , which means that f = O(g) implies that there exists a strictly positive constant C such that asymptotically we have: |f| C|g|. Lamprier, Gisselbrecht and Gallinari 3.4. Extension to the multiple-plays setting This short section extends our algorithm for the multiple-plays setting, where multiple actions are chosen at each step. Rather than only selecting a single action it at any step t, the algorithm has now to select a set Kt K of k 1 actions for which it gets rewards. Algorithm 1 is therefore adapted to this setting, by simply selecting the k best actions at each step (those that get the best selection scores w.r.t. formula 14) rather than only the best one (line 23 of the algorithm). The aim is still to maximize the cumulative reward through time, where all rewards at any step t are simply summed to form the collected reward at step t (although other Lipschitz functions could have been considered for the collective reward construction from the k individual ones, such as proposed in Chen et al. (2013)). Definition 1 The cumulative pseudo-regret of our setting of bandit with multiple plays is defined as: i K µ i β P i Kt µ i β (20) with K the set of k optimal actions, i.e. the k actions with the highest values µ i β. Proposition 10 (Generic bound for the multiple-plays setting) By choosing λ max(1, L2/ R2), with a probability greater than 1 3δ, the cumulative pseudo-regret for our Samp Lin UCB algorithm with multiple selections is upper bounded by: d λ log 1 + Tk L2/λ 2d log 2d T 2 d log 1 + Tk L2/λ R2 + L2S2 log 1 + Tk L2 Proof The proof is available in appendix A.8. Equivalent bounds for the three cases of context delivery can be directly derived from this new generic bound by applying the same methods as in the previous section. This allows us to apply our algorithm for tasks where multiple actions can be triggered at each step, such as in the data capture task considered in our experiments in section 4.2. 4. Experiments This section is divided in two parts. First, we propose a series of experiments on artificial data in order to observe the behavior of our approach in well-controlled environments. Then, we give results obtained on real-world data, for a task of data capture from social media. Profile-Based Bandit with Unknown Profiles 4.1. Artifical Data 4.1.1. Protocol Data Generation: In order to assess the performances of the Samplin UCB algorithm, we propose to first experiment it in a context of simple selection (k = 1) on artificially generated data. For that purpose, we set the horizon T to 30000 iterations, the number of available actions K to 100 and the size of the profile space to d = 5 dimensions. Then, we sampled a mapping vector β randomly in h S/ d id , in order to fulfill the ||β|| S = 1 condition. For each arm i, we then sampled a random vector µi uniformly in h L/ d id with L = 1. Finally, for each iteration t {1, ..., T}, we proceeded as follows to generate simulated data: 1. For each action i {1, ..., K}, we sampled a vector xi,t from the multivariate Gaussian N(µi, σ2I). Note that, in order to assess the influence of profile samples variations on the performances of Samp Lin UCB, we tested different values for σ {0.5, 1.0, 2.0}. Moreover, in order to guarantee that ||xi,t|| L = 1, while still getting sampled centered on µi, the Gaussian is truncated symmetrically around µi. This is illustrated by figure 3 for d = 1, where hatched areas correspond to excluded values. On the left is given the case with µi > 0 and on the right the case with µi < 0; 2. For each action i {1, ..., K}, we sampled a reward ri,t from a Gaussian with mean µ i β and variance R2 = 1; 0 𝜇 𝐿 - 𝐿 2 𝜇 - 𝐿 (a) Case µ > 0 0 𝜇 𝐿 - 𝐿 2 𝜇+ 𝐿 (b) Case µ < 0 Figure 3: Profile Samples Generation Process: Truncated Gaussians To emhasize the need of exploration on profiles, note that we set to null vectors the profile samples of the 100 first steps of each dataset. 100 datasets have been generated in such a way. The results given below correspond to averages on these artificial datasets. Lamprier, Gisselbrecht and Gallinari Experimented Policies: We propose to compare Samp Lin UCB to the following bandit policies: UCB: the well-known UCB approach (Auer et al., 2002), which selects at each step the action with the best upper-confidence bound, estimated w.r.t. past rewards of actions, without any assumption about some latent structure of the actions; UCBV: the UCBV algorithm (Audibert et al., 2009) is an extension of UCB, where the variance of rewards for each action is included in the selection scores to lead the policy to ground their estimations in an higher number of observations for noisier actions. UCB δ: the UCB δ algorithm (Abbasi-Yadkori et al., 2011) is a variant of UCB where optimism is ensured via a concentration inequality based on auto-normalised process (de la Pe na et al., 2009), with a confidence level of 1 δ. In our experiments, we set δ = 0.05; Thompson: the Thompson Sampling algorithm (Thompson, 1933) introduces randomness in the exploration process by sampling reward expectations from their posterior at each time-step, following a Gaussian assumption of the rewards; MOSS: a variant of UCB, which usually obtains better results than the classical UCB but requires the knowledge about the horizon T (Audibert and Bubeck, 2009); None of these approaches use any side information. Therefore, the only noise they have to deal with comes from the variance R2 of the Gaussian distributions of rewards. For Samp Lin UCB an additional difficulty comes from the variations of the observed samples of profiles. The point is therefore to know whether these samples can be leveraged to exhibit some structure of the actions, that can benefit to stationary bandit tasks, despite such variations. Additionaly, the following two contextual baselines are considered in our experiments to analyze the performances of our approach: Lin UCB: the very famous contextual approach that assumes a linear mapping between observed contexts and rewards (Li et al., 2010). In our case, observed profile samples correspond to the contexts that Lin UCB takes into account in its selection policy. We consider this baseline in the interesting setting where contexts are only delivered for the selected arms (case 3 described above). In this setting, non selected arms deliver null context vectors for the next step; Mean Lin UCB: this baseline corresponds to our approach but without the exploration term w.r.t. the profiles. Empirical means are considered as true profiles at each step of the process (this comes down to set ρi,t,δ to 0 for every arm and every step). As discussed above (see the last paragraph of section 3.2), such a baseline cannot guarantee a sub-linear regret since it can infinitely stay stuck on sub-optimal arms, but an empirical evaluation of its performances is useful to understand the benefits of the proposed approach. To analyze the performances of Samp Lin UCB, we implement the three scenarios studied in previous sections. In the following, our approach is denoted Samp Lin UCB p=
, where p Profile-Based Bandit with Unknown Profiles corresponds to the probability for any action to get a sample of its profile at every iteration. Different values for p are considered: p {0, 0.005, 0.01, 1}. Note that p = 1 corresponds to the case 1, while p = 0 refers to the case 3. In this latter instance, as considered in the previous sections, the samples delivery process is replaced by the ability to observe samples for the selected actions at each iteration. Note also that, for clarity and analysis purposes, in instances with p > 0 we do not observe samples for the selected actions (which exactly follows the cases studied in the previous section)3. In every instance, we set δ = 0.05 for these experiments. Also, to avoid a too large exploration on profiles in the early steps, we multiplied each ρi,t,δ by a 0.01 coefficient, which still guarantees a sub-linear regret in the limit. 4.1.2. Results Figures 4(a), 4(b) and 4(c) report the evolution of the cumulative pseudo-regret through time for the tested policies, for σ values (variance of profile samples) respectively set to σ = 2.0, σ = 1.0 and σ = 0.5. Note that the curves of UCB, UCB-δ, UCBV, Thompson and MOSS are identical in every plot since their performances do not depend on the profile samples. We first notice from these plots that UCB-δ and UCBV do not provide good results on these data. It appears that these two policies over-explore during the whole bandit process. Thompson and MOSS obtain better results in average, but still far from the best contextual approach Samp Lin UCBp=1. This confirms that using profiles of arms can be greatly advantegeous when there exist a linear correlation between these profiles and the associated rewards. In this setting (which corresponds to the case 1 studied above), the profiles are discovered step by step, but since we get a sample for every arm at each iteration, the estimators quickly converge towards the true profiles. This explains the very good results for this easy setting, and why there is nearly no differences in the results of Samp Lin UCBp=1 for the three considered sample variances. Let us now focus on the results provided by our Samp Lin UCB algorithm when only a subset of arms gets profile samples at each step of the process. As expected, the more the algorithm observes samples, the better it performs. However, we remark that Samp Lin UCBp=0 obtains better results than Samp Lin UCBp=0.005 for σ = 2 and σ = 1, and even better than Samp Lin UCBp=0.01 when σ = 2 (while observing the same rate of samples as in this latter setting). This denotes a stronger robustness to the profile sample variance. By dynamically selecting the arms to observe, it is able to focus on the improvement of useful estimators rather than getting as many samples but for randomly selected arms (and potentially for arms that could be quickly discarded). In this interesting setting, Samp Lin UCB always outperforms non-contextual approaches for the studied sample variances, while we note a significant improvement of the results when the variance is low. At last, we can note the very weak - near random - results obtained by Lin UCB, which directly bases its strategy on the observed samples. More interesting are the weak results obtained by Mean Lin UCB, which exhibits a linear regret. This emphasizes the crucial role of the profile exploration term of Samp Lin UCB: While Samp Lin UCBp=0 is able to reconsider 3. Note that we could easily imagine tasks, which correspond to some mix of cases 2 and 3, where we both get samples from an external process and for the selected actions. For such cases, we can reasonably assume better results than those reported below for cases 2 and 3, since the process would benefit from both sample sources. Lamprier, Gisselbrecht and Gallinari bad profiles observed in the early steps of the process, Mean Lin UCB usually stays stuck on the first actions that provided a non-nul sample associated with a positive reward. If they are lucky, Lin UCB and Mean Lin UCB can exhibit good performances on some instances, but they are clearly not well fitted for the bandit setting considered in this paper. 0 5000 10000 15000 20000 25000 30000 Time step Cumulative Regret Random CUCB UCB δ CUCBV MOSS Thompson Lin UCB Mean Lin UCB Samp Lin UCBp=0 Samp Lin UCBp=0.005 Samp Lin UCBp=0.01 Samp Lin UCBp=1 0 5000 10000 15000 20000 25000 30000 Time step Cumulative Regret Random CUCB UCB δ CUCBV MOSS Thompson Lin UCB Mean Lin UCB Samp Lin UCBp=0 Samp Lin UCBp=0.005 Samp Lin UCBp=0.01 Samp Lin UCBp=1 0 5000 10000 15000 20000 25000 30000 Time step Cumulative Regret Random CUCB UCB δ CUCBV MOSS Thompson Lin UCB Mean Lin UCB Samp Lin UCBp=0 Samp Lin UCBp=0.005 Samp Lin UCBp=0.01 Samp Lin UCBp=1 (c) σ = 0.5 Figure 4: Cumulative pseudo-regret through time on artificial data with different settings for the profile samples delivery process (from very noisy samples on the left, to samples with low variance on the right). To conclude, it appears from these first experiments that our Samp Lin UCB algorithm, while dealing with a doubled uncertainty (both on the mapping parameters and the profile estimators), is able to leverage the latent structure that it discovers through time from noisy samples of the profiles. 4.2. Real World Experiments: Social Data Capture In this section we propose to apply our Samp Lin UCB algorithm to the task of dynamic data capture from Twitter introduced in (Gisselbrecht et al., 2015). According to a given information need, the aim is to collect relevant data from the streaming API proposed by Twitter. This API provides messages published by users on the social network in realtime. In this setting, each user account is associated with a stream of messages that can be monitored. However, for various reasons (notably w.r.t. constraints set by the social media), it is not possible to collect the whole activity of the social media. Only streams of messages published by a subset of k users can be monitored simultaneously (k << K). The aim is therefore to focus on users that are the most likely to publish messages that fit with the data need. The difficulty is that we do not know anything about the users beforehand, everything must be discovered during the capture process. We have thus to deal with an exploitation/exploration problem that suits well with the bandit setting studied in this paper. Given a time period divided in T steps, the agent has to select, at each iteration t {1, ..., T} of the process, a subset Kt of k user accounts to follow, among the whole set of possible users K (Kt K). Given a relevance score ri,t assigned to the content posted by user i Kt during iteration t of the process (the set of tweets he posted during iteration t), the aim is to select at each iteration t the set of user accounts that maximize the sum of Profile-Based Bandit with Unknown Profiles collected scores: max (Kt)t=1..T i Kt ri,t (22) 4.2.1. Rewards In our experiments, we attempt to focus on users that have a great impact on some specified thematic. The Follow Streaming API of Twitter provides in real-time not only tweets posted by the followed users, but also all the re-tweets and replies to these users other users post on the network. Our reward function takes all of these messages into account to provide a reward score ri,t for each user i Kt after each capture period t: ri,t = tanh ω Ωi,t gγ(ω) where Ωi,t contains the original messages from i, the re-tweets of messages from i and the replies to i during the period t, and gγ is a function returning 1 if the content of the message as argument is judged as belonging to the desired thematic γ, 0 otherwise. To build this function gγ, we trained a SVM topic classifier on the 20 Newsgroups dataset (with TF bag of words representations of the texts, after stemming via the Porter Stemmer algorithm). We finally focus on 4 different topics γ: Politics, Religion, Science and Sport. Four different reward functions are therefore considered in the following (one for each topic). 4.2.2. Profiles Following the setting of our profile based bandit, we assume that each user i of the social network is associated to an unknown vector µi corresponding to its profile. In these experiments, we assume that the profile of a user i corresponds to the mean of its content distribution: µi = lim T 1 T PT t=1 xi,t, where xi,t is a given representation of the content posted by i during step t 1: xi,t+1 = f( X ω Ψi,t ω) (24) where Ψi,t Ωi,t contains all messages posted by i during step t and ω Rm (with m the size of the vocabulary) is a TF bag of words representation of a message (after stemming via the Porter Stemmer algorithm). The function f aims at reducing the dimension of the representations, since the dimension d of the profile samples is the main factor of complexity in our algorithm, due to the required d d matrix inversions. In order to reduce the dimension of profiles, we used a Latent Dirichlet Allocation method specifically designed for short texts (Hong and Davison, 2010), which aims at modeling texts as a mixture of topics. We set the number of topics to d = 30 and learned the LDA model on a preliminary 3-days random capture from Twitter. Lamprier, Gisselbrecht and Gallinari 4.2.3. Datasets In order to be able to test different policies and simulate a real time decision process several times, we propose to conduct our experiments on offline datasets: USElections: dataset containing 3 587 961 messages produced by 5000 users during the ten days preceding the US presidential elections in 2012. The 5000 chosen accounts are the first ones who used either Obama , Romney or #USElections from a preliminary random capture on Twitter. Olympic Games: dataset containing 15 010 322 messages produced by 5000 users in August 2016 during a period of three weeks covering the Olympic Games of Rio. The 5000 chosen accounts are the ones that were observed to use the most many hashtags #Rio2016 , #Olympics , #Olympics2016 or #Olympicgames within a period of preliminary random capture of three days before the Olympic Games. Brexit: dataset containing 2 118 235 messages produced by 5000 users during the first week of October 2016. The 5000 chosen accounts are the first ones who used #Brexit from a preliminary random capture from Twitter. 4.2.4. Results As done in (Gisselbrecht et al., 2015), we set k, the number of listened users at each time step, to 100, and the size of an iteration to 100 seconds. In these experiments, we assume L = S = R = 1 and we set δ = 0.05 as done with artificial data. Figures 5, 6 and 7 give the evolution of the cumulative reward through time for the datasets USElections, Olympic Games and Brexit respectively. In every case, we consider the four reward functions corresponding to the four topics Politic, Religion, Science and Sport. In order to lighten the plots, we only give in these figures the results of Samplin UCB for p = 0 and p = 1. In every plot, our algorithm is compared to the same baselines as described in section 4.1, where the policies are extended for the multiple-plays setting (as done in (Gisselbrecht et al., 2015)). A first important observation from these plots is that in every setting, our algorithm Samp Lin UCB obtains better results than every other policy, even CUCBV, the extension of UCBV for the multiple-plays setting. Although CUCBV has demonstrated good performances for the task of social data capture (Gisselbrecht et al., 2015), where a high variance can be observed in the contents posted by users, the use of profiles associated to users of the networks enables an even more efficient exploration process. Globally, same manner as with artificial data, the performances of our approach increase with p, with a maximum reached when p = 1. Note however that the setting p = 0 (the case 3 studied above) is the most realistic one, since it does not use anything but the content collected by followed users at each step, which is the case in practice when collecting data from a social media such as Twitter. Interestingly, even for this setting the results obtained are always better than those of every compared approach. The improvement w.r.t. CUCBV is less significant for the Sport reward function for which greatly more rewards exist in the datasets (greatly more messages are categorized as sport), which allows non-contextual approaches to quickly Profile-Based Bandit with Unknown Profiles 0 1000 2000 3000 4000 5000 6000 7000 8000 Time step Cumulative Reward Random CUCB UCB δ CUCBV MOSS Thompson Lin UCB Mean Lin UCB Samp Lin UCBp=0 Samp Lin UCBp=1 (a) Politic 0 1000 2000 3000 4000 5000 6000 7000 8000 Time step Cumulative Reward Random CUCB UCB δ CUCBV MOSS Thompson Lin UCB Mean Lin UCB Samp Lin UCBp=0 Samp Lin UCBp=1 (b) Religion 0 1000 2000 3000 4000 5000 6000 7000 8000 Time step Cumulative Reward Random CUCB UCB δ CUCBV MOSS Thompson Lin UCB Mean Lin UCB Samp Lin UCBp=0 Samp Lin UCBp=1 (c) Science 0 1000 2000 3000 4000 5000 6000 7000 8000 Time step Cumulative Reward Random CUCB UCB δ CUCBV MOSS Thompson Lin UCB Mean Lin UCB Samp Lin UCBp=0 Samp Lin UCBp=1 Figure 5: Evolution of the cumulative reward through time on the USElections dataset, according to the three considered reward functions Politic, Religion, Science and Sport. collect knowledge about reward expectations of users. But for every other reward function Samp Lin UCBp=0 always obtained results comprised between 1.5 and 3 times the ones obtained with the best non-contextual approach. Note also the crucial role of the exploration term for profile discovery ρ, since Mean Lin UCB, which considers current empirical sample means as true profiles, always obtains greatly lower results than Samp Lin UCBp=0 (except for the Science reward function on the Olympic Games and the Brexit datasets, where it benefits from good rewards and profile samples observed for some useful users in the initialization steps of the process). At last, as expected, Lin UCB, which directly bases its selection policy on profile samples observed at the current step, obtains very bad results (near random). Since obtaining nul context vectors for every user not selected at the previous step, its selection mechanism very early focuses on a given set of users without ever reconsidering the others (except in the rare cases of context samples leading to negative reward expectations according to β). All these results highlight the interest of the proposed approach, based on confidence balls of the arm profiles, for tasks where contexts are only observed when the arms are selected . Lamprier, Gisselbrecht and Gallinari 0 2500 5000 7500 10000 12500 15000 17500 Time step Cumulative Reward Random CUCB UCB δ CUCBV MOSS Thompson Lin UCB Mean Lin UCB Samp Lin UCBp=0 Samp Lin UCBp=1 (a) Politic 0 2500 5000 7500 10000 12500 15000 17500 Time step Cumulative Reward Random CUCB UCB δ CUCBV MOSS Thompson Lin UCB Mean Lin UCB Samp Lin UCBp=0 Samp Lin UCBp=1 (b) Religion 0 2500 5000 7500 10000 12500 15000 17500 Time step Cumulative Reward Random CUCB UCB δ CUCBV MOSS Thompson Lin UCB Mean Lin UCB Samp Lin UCBp=0 Samp Lin UCBp=1 (c) Science 0 2500 5000 7500 10000 12500 15000 17500 Time step Cumulative Reward Random CUCB UCB δ CUCBV MOSS Thompson Lin UCB Mean Lin UCB Samp Lin UCBp=0 Samp Lin UCBp=1 Figure 6: Evolution of the cumulative reward through time on the Olympic Games dataset, according to the three considered reward functions Politic, Religion, Science and Sport. Figures 8, 9 and 10 give the relative final cumulative rewards for different settings of the sample delivery process on the datasets USElections, Olympic Games et Brexit respectively (each score is normalized according to the score obtained when p = 1). Here we still observe that performances tend to decrease with p, for settings where p > 0. However it must be noticed that the setting p = 0 obtains results very close to other settings: it always obtains at least 80% of the final cumulative reward obtained when every user delivers a sample at each step of the process (p = 1). Better, in many cases Samp Lin UCBp=0 succeeds in obtaining an higher final cumulative reward than p = 0.01 and p = 0.02. This is particularly true for the Brexit dataset where the dynamic selection of samples to be delivered appears very effective. On that dataset, Samp Lin UCBp=0 even usually reaches the performances of Samp Lin UCBp=0.05, while observing greatly less profile samples at each step (only 100 over 5000 at each iteration, which corresponds to the observation rate of the setting p = 0.02). While settings with p > 0 are greatly favored by the fact that they do not need to play an arm to get a sample of its profile, Samp Lin UCBp=0 is not only active for the discovery of the mapping parameters, but also for the estimation of profiles. Its knowledge about profiles Profile-Based Bandit with Unknown Profiles 0 1000 2000 3000 4000 Time step Cumulative Reward Random CUCB UCB δ CUCBV MOSS Thompson Lin UCB Mean Lin UCB Samp Lin UCBp=0 Samp Lin UCBp=1 (a) Politic 0 1000 2000 3000 4000 Time step Cumulative Reward Random CUCB UCB δ CUCBV MOSS Thompson Lin UCB Mean Lin UCB Samp Lin UCBp=0 Samp Lin UCBp=1 (b) Religion 0 1000 2000 3000 4000 Time step Cumulative Reward Random CUCB UCB δ CUCBV MOSS Thompson Lin UCB Mean Lin UCB Samp Lin UCBp=0 Samp Lin UCBp=1 (c) Science 0 1000 2000 3000 4000 Time step Cumulative Reward Random CUCB UCB δ CUCBV MOSS Thompson Lin UCB Mean Lin UCB Samp Lin UCBp=0 Samp Lin UCBp=1 Figure 7: Evolution of the cumulative reward through time on the Brexit dataset, according to the three considered reward functions Politic, Religion, Science and Sport. is directly connected to its selection strategy, with a selection score that favors promising actions with high uncertainty about their profile. This leads to an algorithm that efficiently deals with a trade-offbetween exploitation of good actions and exploration on both the mapping parameter and the profiles of actions. 5. Conclusion In this paper, we focused on structured stochastic bandits, where rewards depend on some constant profile associated with actions. More specifically, we introduced the case where the associated profiles are unknown beforehand, and must be discovered from samples delivered during the process. This setting implies a doubled uncertainty, both on profile estimators and on reward predictors, for which we designed a dedicated algorithm, named Samplin UCB, that seeks at leveraging the structure of the unknown profiles in its exploration process. Various settings for the profile samples delivery process have been considered, for which we gave theoretical convergence guarantees. Finally, experiments on both artificial data Lamprier, Gisselbrecht and Gallinari p=1 p=0.5 p=0.1 p=0.05 p=0.02 p=0.01 p=0 0.0 Final Cumulative Reward (a) Politic p=1 p=0.5 p=0.1 p=0.05 p=0.02 p=0.01 p=0 0.0 Final Cumulative Reward (b) Religion p=1 p=0.5 p=0.1 p=0.05 p=0.02 p=0.01 p=0 0.0 Final Cumulative Reward (c) Science p=1 p=0.5 p=0.1 p=0.05 p=0.02 p=0.01 p=0 0.0 Final Cumulative Reward Figure 8: Final normalized cumulative rewards for Samp Lin UCB on USElections with different p settings. p=1 p=0.5 p=0.1 p=0.05 p=0.02 p=0.01 p=0 0.0 Final Cumulative Reward (a) Politic p=1 p=0.5 p=0.1 p=0.05 p=0.02 p=0.01 p=0 0.0 Final Cumulative Reward (b) Religion p=1 p=0.5 p=0.1 p=0.05 p=0.02 p=0.01 p=0 0.0 Final Cumulative Reward (c) Science p=1 p=0.5 p=0.1 p=0.05 p=0.02 p=0.01 p=0 0.0 Final Cumulative Reward Figure 9: Final normalized cumulative rewards for Samp Lin UCB on Olympic Games with different p settings. and a task of data capture from social networks demonstrate the very good behavior of the proposed approach. Ongoing works concern the inclusion of a non-stationary part in the selection strategy, where profiles may vary over time according to some evolving latent state of the actions. Acknowledgments This research work has been carried out in the framework of the Technological Research Institute System X, and therefore granted with public funds within the scope of the French Program Investissements d Avenir . Appendix A. Appendix A.1. Proof of proposition 1 The two following lemmas directly come from the definition of the sub-gaussian variables. Lemma 1 Let X be a random variable centered on 0. Then, X is said sub-gaussian with constant R if one of the two equivalent following conditions holds: Laplace Condition: R > 0, λ R, E[eλX] e R2λ2/2 Profile-Based Bandit with Unknown Profiles p=1 p=0.5 p=0.1 p=0.05 p=0.02 p=0.01 p=0 0.0 Final Cumulative Reward (a) Politic p=1 p=0.5 p=0.1 p=0.05 p=0.02 p=0.01 p=0 0.0 Final Cumulative Reward (b) Religion p=1 p=0.5 p=0.1 p=0.05 p=0.02 p=0.01 p=0 0.0 Final Cumulative Reward (c) Science p=1 p=0.5 p=0.1 p=0.05 p=0.02 p=0.01 p=0 0.0 Final Cumulative Reward Figure 10: Final normalized cumulative rewards for Samp Lin UCB on Brexit with different p settings. Sub-Gaussian Tail: R > 0, γ > 0, P(|X| γ) 2e γ2/(2R2) Lemma 2 Let X1 and X2 be two sub-gaussian variables with respective constant R1 and R2. Let α1 and α2 be two real scalars. Then the variable α1X1 + α2X2 is sub-gaussian too, with constant p α2 1R2 1 + α2 2R2 2. The lemma 3 can be deduced from the application of the lemma 1 in the context of our specific problem of profile-bandit with unknown profiles, where the deviation of the profile estimators is a sub-gaussian variable. Lemma 3 Let us assume that for any i all samples xi,t Rd observed for i at every step t > 0 are iid with mean µi Rd. Let us also assume that ||xi,t|| L for every i and t, and that ||β|| S. Then, for every i, and at each step t, β ϵi,t is sub-gaussian with constant LS ni,t (with ϵi,t = µi ˆxi,t). Proof By using the Cauchy-Schwarz inequality, for all i and at each step t we have: |x i,tβ| ||β||||ˆxi,t|| LS. Then, given that for all i, all samples xi,t are iid and E[xi,t] = µi, we can apply the Hoeffding inequality to the random variable β ˆxi,t with mean µ i β: γ > 0, P |β ˆxi,t β µi| > γ = P |β ϵi,t| > γ 2e ni,tγ2 which allows us to say that ϵ i,tβ is sub-gaussian with constant LS ni,t . We finally use the lemma 2 with the sum of β ϵi,t and ηi,s to prove the proposition 1, which establishes the random variable β ϵi,t + ηi,s is conditionally sub-gaussian with constant Ri,t = Lamprier, Gisselbrecht and Gallinari A.2. Proof of the proposition 3 To lighten notations, we removed the dependence on t in A and X. We have: ˆβt 1 = arg min β 1 Ris,t (β ˆxis,t ris,s)2 + λ||β||2 = (X AX + λI) 1X AY = (X AX + λI) 1X A(Xβ + η ) = (X AX + λI) 1X Aη + (X AX + λI) 1(X AX + λI)β (X AX + λI) 1λIβ = (X AX + λI) 1X Aη + β λ(X AX + λI) 1β Then, the following main arguments of this proof come from the theory of autonormalized process (de la Pe na et al., 2009). By using a similar method to the one used in (Abbasi-Yadkori et al., 2011), we get: ||ˆβt 1 β||Vt 1 ||X Aη ||V 1 t 1 + λ||β||V 1 t 1 with Vt 1 = λI + X AX, which is semi-definite positive since λ > 0. Since ||β|| S and ||β||2 V 1 t 1 ||β||2/λmin(Vt 1) ||β||2/λ, we get: ||ˆβt 1 β||Vt 1 ||X Aη ||V 1 t 1 + By using the proposition 1 of (Abbasi-Yadkori et al., 2011), and since we know from proposition 1 that η s Ris,t is sub-gaussian with constant 1, for any δ > 0, with a probability of at least 1 δ, for every t 0 we have: ||X Aη ||V 1 t 1 = || η s Ris,t ˆxis,t||V 1 t 1 v u u t2 log det(Vt 1)1/2det(λI) 1/2 d log 1 + t L2/λ A.3. Proof of the proposition 4 Proof Let us assume that the inequality of the proposition 3 is valid. Therefore, we have for all t > 0 and every i K: Profile-Based Bandit with Unknown Profiles ˆβ t 1µi + αt 1||µi||V 1 t 1 β µi =(ˆβt 1 β) µi + αt 1||µi||V 1 t 1 ||ˆβt 1 β||Vt 1||µi||V 1 t 1 + αt 1||µi||V 1 t 1 αt 1||µi||V 1 t 1 + αt 1||µi||V 1 t 1 =0 A.4. Proof of the proposition 5 With ||xi,t|| L, we know that, for any j [1..d], |xj i,t| L. Thus, we can apply the Hoeffding inequality to each dimension of the profile estimators: γ > 0 : P |ˆxj i,t µj i| > γ/ d 2e ni,tγ2 Then, by using the fact that ||ˆxi,t µi|| 1 i=1 |ˆxi i,t µi i| and the uniform bound property, we get: P (||ˆxi,t µi|| γ) 1 2de ni,tγ2 Thus, for every i {1, . . . , K} and every step t > 0, with a probability of at least 1 δ/t2: ||ˆxi,t µi|| L 2d ni,t log 2dt2 This bound for the deviation of the profile estimator can be less restrictive than the base assumption which states that for any i K and t 0, ||xi,t|| L. From this assumption we indeed know that, ||ˆxi,t|| L, ||µi|| L and thus ||ˆxi,t µi|| 2L . We therefore consider the following bound that holds for any t 0 with a probability greater than 1 δ/t2: ||ˆxi,t µi|| min(L 2d ni,t log 2dt2 , 2L) = ρi,t,δ A.5. Proof of the proposition 6 Proof Let us assume that the inequality of the proposition 5 is valid. Therefore, we have: ||µi||V 1 t 1 ||ˆxi,t||V 1 t 1 ||µi ˆxi,t||V 1 t 1 ||µi ˆxi,t||/ λ. Thus: ||µi||V 1 t 1 ||ˆxi,t||V 1 t 1 + ρi,t,δ/ λ = ||ˆxi,t + ϵi,t||V 1 t 1, with ϵi,t = ρi,t,δˆxi,t/( λ||ˆxi,t||V 1 t 1). Lamprier, Gisselbrecht and Gallinari |ˆβ t 1(ˆxi,t µi)| ||ˆβt 1||||(ˆxi,t µi)|| ˆβ t 1ϵi,t, with ϵi,t = ρi,t,δ ˆβt 1/||ˆβt 1||. By using these two results and the uniform bound property, we can proof the proposition: ˆβ t 1(ˆxi,t + ϵi,t) + αt 1||ˆxi,t + ϵi,t||V 1 t 1 β µi =(ˆβt 1 β) µi + αt 1||ˆxi,t + ϵi,t||V 1 t 1 ˆβ t 1(µi ˆxi,t) + ˆβ t 1ϵi,t ||ˆβt 1 β||Vt 1||µi||V 1 t 1 + αt 1||ˆxi,t + ϵi,t||V 1 t 1 + ˆβ t 1(ˆxi,t µi + ϵi,t) αt 1||µi||V 1 t 1 + αt 1||ˆxi,t + ϵi,t||V 1 t 1 + ˆβ t 1(ˆxi,t µi + ϵi,t) αt 1||ˆxi,t + ϵi,t||V 1 t 1 + αt 1||ˆxi,t + ϵi,t||V 1 t 1 + ˆβ t 1(ˆxi,t µi + ϵi,t) A.6. Proof of the proposition 8 Lemma 4 For every i K and t > 0, with a probability of at least 1 δ/t2 δ, we have: ˆβ t 1(ˆxi,t + ϵi,t) + αt 1||ˆxi,t + ϵi,t||V 1 t 1 β µi 2αt||ˆxi,t||V 1 t 1 + 4 λ + S)ρi,t,δ Proof As for proposition 6, we assume that the inequality of proposition 5 holds. Then, noting that ||ϵi,t||V 1 t 1 ||ϵi,t||/ λ = ρi,t,δ/ λ and || ϵi,t||V 1 t 1 = ρi,t,δ/ λ, we have: ˆβ t 1(ˆxi,t + ϵi,t) + αt 1||ˆxi,t + ϵi,t||V 1 t 1 β µi =(ˆβt 1 β) µi + αt 1||ˆxi,t + ϵi,t||V 1 t 1 ˆβ t 1(µi ˆxi,t) + ˆβ t 1ϵi,t ||ˆβt 1 β||Vt 1||µi||V 1 t 1 + αt 1||ˆxi,t + ϵi,t||V 1 t 1 + ˆβ t 1(ˆxi,t µi + ϵi,t) 2αt 1||ˆxi,t + ϵi,t||V 1 t 1 + 2||ˆβt 1||Vt 1||ϵi,t||V 1 t 1 2αt 1||ˆxi,t||V 1 t 1 + 2αt 1|| ϵi,t||V 1 t 1 + 2(αt 1 + S λ)||ϵi,t||V 1 t 1 2αt 1||ˆxi,t||V 1 t 1 + 4(αt 1/ λ + S)ρi,t,δ Profile-Based Bandit with Unknown Profiles Lemma 5 For every t, with a probability of at least 1 δ/t2 δ, the instantaneous pseudoregret of the algorithm Samp Lin UCB, noted regt = β µi β µit, is upper-bounded as: regt 2αt 1||ˆxit,t||V 1 t 1 | {z } λ + S)ρit,t,δ | {z } Proof The previous lemma allows us to say that for all t: sit,t β µit + 2αt 1||ˆxi,t||V 1 t 1 + 4(αt 1/ λ + S)ρi,t,δ Now, given the selection policy of Samp Lin UCB and the proposition 6, we get for all t: sit,t si ,t β µi . Thus: regt sit,t β µit 2αt 1||ˆxi,t||V 1 t 1 + 4(αt 1/ λ + S)ρi,t,δ Next, we use the uniform bound property, the fact that P δ t2 = δ(π2/6 1) δ and the fact that in the proposition 3 the bound is uniform (i.e., it holds for all step t simultaneously) to say that, with a probability of at least 1 2δ: t=1 reg(2) t C + t=2 4(αt 1/ λ + S)ρit,t,δ t=2 4(αt 1/ 2d ni,t log 2dt2 C + 4L(αT / 2d log 2d T 2 On another hand, we have: t=1 reg(1) t t=1 2αt 1||ˆxit,t||2 V 1 t 1 t=1 4α2 t 1||ˆxit,t||2 V 1 t 1 t=1 ||ˆxit,t||2 V 1 t 1 Now, it remains to upper-bound the term TP t=1 ||ˆxit,t||2 V 1 t 1. For that purpose, we introduce the following notation: νi,t,δ = L p 2d/(ni,t) log (2d T/δ). By using again the Hoeffding inequality, with a probability of at least 1 δ/T, we get for all s t 1: Lamprier, Gisselbrecht and Gallinari ||ˆxis,t|| ||µis|| + νis,t,δ With ˇϵi,t = min(νi,t,δ, ||µi||)µi/||µi||, we get, for all s t 1: Ris,s||µis ˇϵis,s|| 1/ p Ris,t||ˆxis,t|| Then, we arrive to: Vt 1 = λI + 1 Ris,t ˆxis,tˆx is,t λI + 1 Ris,s (µis ˇϵis,s)(µis ˇϵis,s) = Wt 1 Which means that for every vector x: ||x||V 1 t 1 ||x||W 1 t 1. Let us now define ˆϵi,t = νi,t,δµi/( λ||µi||W 1 t 1), such that for all s t 1: ||ˆxis,t||W 1 t 1 ||µis + ˆϵis,s||W 1 t 1 ||ˆϵis,s||W 1 t 1 = νis,s,δ/ Finally, by using the uniform bound property and the fact that TP δ T = δ, with a probability of at least 1 δ: t=1 ||ˆxit,t||2 V 1 t 1 t=1 ||ˆxit,t||2 W 1 t 1 t=1 ||µit + ˆϵit,t||2 W 1 t 1 t=1 ||µit + ˆϵit,t ˇϵit,t + ˇϵit,t||2 W 1 t 1 t=1 ||µit ˇϵit,t||2 W 1 t 1 + t=1 ||ˆϵit,t||2 W 1 t 1 + t=1 ||ˇϵit,t||2 W 1 t 1 t=1 ||µit ˇϵit,t||2 W 1 t 1 + 2 t=1 ν2 it,t,δ t=1 ||µit ˇϵit,t||2 W 1 t 1 + 4L2d Profile-Based Bandit with Unknown Profiles On another hand, we have: det(WT ) = det(WT 1 + 1 Ri T ,T (µi T ˇϵi T ,T )(µa T ˇϵi T ,T ) ) = det(WT 1)det(I + 1 Ri T ,T W 1/2 T 1 (µi T ˇϵi T ,T )(W 1/2 T 1 (µi T ˇϵi T ,T )) ) = det(WT 1)(1 + 1 Ri T ,T ||µi T ˇϵi T ,T ||2 W 1 T 1) t=1 (1 + 1 Rit,t ||µit ˇϵit,t||2 W 1 t 1) Where we used the fact that all eigenvalues of I +xx equal 1 except one that is associated to the eigenvector x and thus equals 1 + ||x||2. Since by assumption λ > max(1, L2/ R2), we have: 1 Rit,t ||µit ˇϵit,t||2 W 1 t 1 ||ˆxit,t||2/ Thus, by using the fact that x 2 log(1 + x) when 0 x 1, we get: 2 log det(WT ) 1 Rit,t ||µit ˇϵit,t||2 W 1 t 1 t=1 ||µit ˇϵit,t||2 W 1 t 1 R2 + L2S2 T X t=1 ||µit ˇϵit,t||2 W 1 t 1 As in the lemma 11 of (Abbasi-Yadkori et al., 2011), we also have: log det(WT ) d log 1 + TL2 Which leads us to: t=1 ||µit ˇϵit,t||2 W 1 t 1 p R2 + L2S2d log 1 + TL2 Finally, same manner as in the lemma 10 of Abbasi-Yadkori et al. (2011), the tracedeterminant inequality gives: d log 1 + TL2/λ Gathering all these results together allows us to prove the announced result. Lamprier, Gisselbrecht and Gallinari A.7. Proof of the proposition 9 Lemma 6 When removing dependencies on L, λ, R and S, the bound from proposition 16 for the cumulative regret ˆRT can written as follows (when T > d): ˆRT C + C1d log T 1 nit,t + C2d log T with C,C1 and C2 three constants. Proof From proposition 16, we have: ˆRT C + ˆRT,1 + ˆRT,2 d λ log 1 + TL2/λ 2d log 2d T 2 Constant d log T 1 nit,t (when T > d > 0) d log 1 + TL2/λ R2 + L2S2 log 1 + TL2 log(T) + log d T v u u t Td log d T Constant d log T 1 nit,t (when T > d > 0) Thanks to this lemma, we are ready to derive specific bounds for the three considered profile delivery settings. Profile-Based Bandit with Unknown Profiles A.7.1. Case 1: On one hand, we have: t 1 + log(T) On the other hand, we have: Finally, we get from lemma 6 that, when T > d > 0: ˆRT,1 C + C1d log T T + C2d log T with C,C1 and C2 three constants. Since the last term is clearly the greatest, we get the announced result. A.7.2. Case 2: Lemma 7 i, t 2 log(1/δ)/p2 , with a probability of at least 1 δ: 2 Proof By the Hoeffding inequality, for all ϵ > 0: P(ni,t tp ϵ) 1 e 2ϵ2/t By taking ϵ = tp/2, we get: P(ni,t tp/2) 1 e tp2/2 If t 2 log(1/δ)/p2, then 1 e tp2/2 1 δ, which proves the lemma. Let us note u = ceil(2 log(1/δ)/p2). Thus, following lemma 7, with a probability of at least 1 δ, we have: u + 2 log(T) Lamprier, Gisselbrecht and Gallinari From another hand, still thanks to the lemma 7, with a probability of at least 1 δ: Finally, we get from lemma 6 that, when T > d > 0: ˆRT,1 C + C1d log T p + C2d log T with C,C1 and C2 three constants. Since the last term is clearly the greatest, we get the announced result. A.7.3. Case 3: First note that the sum TP 1 nit,t is maximized when each action has delivered exactly T/K samples in the T/K K first iterations (i.e., every action has been played as many times). Thus: K(1 + log( T/K )) With the same argument, we also get: Finally, by noting that K log( T/K ) K log(T/K), that K p KT and by using the generic bound from the proposition 9, we get the announced result. Finally, since Profile-Based Bandit with Unknown Profiles K log( T/K ) K log(T/K) and K p KT, we get from lemma 6 that, when T > d > 0: ˆRT,1 C + C1d log T KT + C2d log T TK log(T/K) with C,C1 and C2 three constants. Since the last term is clearly the greatest, we get the announced result. A.8. Proof of the proposition 10 We follow a similar method to the one presented in Qin et al. (2014) for the specific case of sums of individual rewards: Since we consider that the reward obtained by playing a set of actions at a given step t is the sum of rewards observed for every single played action at t, we have: i K µ i β X with K the set of k optimal actions, i.e., those with greatest expectations µ i β. Then, we use the fact that for every step t: X i Kt si,t X This leads to: i Kt 2αt 1||ˆxi,t||V 1 t 1 + 4 λ + S)ρi,t,δ Where the matrix Vt 1 is defined by considering the k played actions at each step: Vt 1 = λI + t 1 P 1 Ri,t ˆxi,tˆx i,t. The fact that we add k terms to the matrix V at each iteration implies two distinct things: First on the confidence ellipsoid of β. For k actions played at each step, we have: d log 1 + k TL2/λ On another hand on the upper-bounding of TP it Kt ||µit ˇϵit,t||2 W 1 t 1. We have: it Kt ||µit ˇϵit,t||2 W 1 t 1 R2 + L2S2d log 1 + Tk L2 Finally, we can use the same methods as in the previous proofs for deriving specific bounds from the generic one, where the selection of k actions at each step appears explicitly in the terms TP 1 ni,t and TP Lamprier, Gisselbrecht and Gallinari A.9. Table of the main notations T number of steps of the process K set of available actions K number of available arms k number of simultaneaous plays at each step d dimension of the arms profiles it arm selected at step t i arm with the highest final cumulative reward ri reward obtained by arm i at step t xi,t profile sample vector observed for arm i at step t ˆxi,t average of profile sample vectors observed for arm i until step t L upper-bound for the profiles norm Ot set of arms delivering a profile context at step t ni,t number of samples observed for arm i until step t µi profile vector of arm i β mapping parameter between profiles and rewards ˆβt estimator of β at step t S upper-bound for the β parameter norm λ l2-regularization constant of the β estimator ηi,t sub-gaussian noise of the reward of arm i at step t R sub-gaussian constant of the rewards distribution ϵi,t deviation between the true profile of arm i and its estimator at step t (ϵi,t = µi ˆxi,t) η t 1 vector of reward deviations of the first t 1 selected arms from their expectation at step t: η t 1 = (ηis,s + ϵ is,tβ) s=1..t 1 Ri,t sub-gaussian constant of the noise of the reward of i at step t w.r.t. ˆx i,tβ Xt 1 (t 1) d matrix containing the empirical means of the selected actions, where the s-th row corresponds to the estimator at step t of the action selected at step s: Xt 1 = (ˆx is,t)s=1..t 1 Yt 1 rewards vector of size t 1: Yt 1 = (ris,s) s=1..t 1 At 1 diagonal (t 1) (t 1) matrix, where the s-th diagonal element equals 1/Ris,t: At 1 = diag(1/Ris,t)s=1..t 1 δ parameter controling the confidence level of the regret bound V 1 t variance-covariance matrix of the posterior distribution of β at step t ρi,t,δ quantity used to bound the deviation of the estimators of profiles: ρi,t,δ = 2d ni,t log 2dt2 si,t selection score for arm i at step t αt exploration coefficient at step t w.r.t. the confidence of the beta estimator ϵt quantity used to bound µ i ˆβt ϵt quantity used to bound µ i (β ˆβt) Profile-Based Bandit with Unknown Profiles Yasin Abbasi-Yadkori, D avid P al, and Csaba Szepesv ari. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems 24: 25th Annual Conference on Neural Information Processing Systems 2011. Proceedings of a meeting held 12-14 December 2011, Granada, Spain., pages 2312 2320, 2011. Shipra Agrawal and Navin Goyal. Analysis of thompson sampling for the multi-armed bandit problem. In COLT 2012 - The 25th Annual Conference on Learning Theory, June 25-27, 2012, Edinburgh, Scotland, pages 39.1 39.26, 2012. Shipra Agrawal and Navin Goyal. Thompson sampling for contextual bandits with linear payoffs. In Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013, pages 127 135, 2013. Jean-Yves Audibert and S ebastien Bubeck. Minimax policies for adversarial and stochastic bandits. In COLT 2009 - The 22nd Conference on Learning Theory, Montreal, Quebec, Canada, June 18-21, 2009, 2009. Jean-Yves Audibert, R emi Munos, and Csaba Szepesv ari. Exploration-exploitation tradeoff using variance estimates in multi-armed bandits. Theor. Comput. Sci., 410(19):1876 1902, April 2009. ISSN 0304-3975. Peter Auer. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 2003. Peter Auer, Nicol o Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2-3):235 256, 2002. Olivier Chapelle and Lihong Li. An empirical evaluation of thompson sampling. In Proceedings of the 25th Conference on Neural Information Processing Systems 2011, Granada, Spain, pages 2249 2257, 2011. Wei Chen, Yajun Wang, and Yang Yuan. Combinatorial multi-armed bandit: General framework and applications. In Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013, pages 151 159, 2013. Wei Chu, Lihong Li, Lev Reyzin, and Robert E. Schapire. Contextual bandits with linear payofffunctions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2011, Fort Lauderdale, USA, April 11-13, 2011, pages 208 214, 2011. Varsha Dani, Thomas P. Hayes, and Sham M. Kakade. Stochastic linear optimization under bandit feedback. In 21st Annual Conference on Learning Theory - COLT 2008, Helsinki, Finland, July 9-12, 2008, pages 355 366, 2008. V. H. de la Pe na, T. L. Lai, and Q. M. Shao. Self-Normalized Processes: Limit Theory and Statistical Applications. Springer Series in Probability and its Applications. Springer, 2009. Lamprier, Gisselbrecht and Gallinari Sarah Filippi, Olivier Capp e, Aur elien Garivier, and Csaba Szepesv ari. Parametric bandits: The generalized linear case. In Advances in Neural Information Processing Systems 23: 24th Annual Conference on Neural Information Processing Systems 2010. Proceedings of a meeting held 6-9 December 2010, Vancouver, British Columbia, Canada., pages 586 594, 2010. Aurlien Garivier. The kl-ucb algorithm for bounded stochastic bandits and beyond. In COLT, 2011. Thibault Gisselbrecht, Ludovic Denoyer, Patrick Gallinari, and Sylvain Lamprier. Whichstreams: A dynamic approach for focused data capture from large social media. In Proceedings of the Ninth International Conference on Web and Social Media, ICWSM 2015, University of Oxford, Oxford, UK, May 26-29, 2015, pages 130 139, 2015. Liangjie Hong and Brian D. Davison. Empirical study of topic modeling in twitter. In ECIR, 2010. Emilie Kaufmann, Olivier Capp e, and Aur elien Garivier. On bayesian upper confidence bounds for bandit problems. In Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2012, La Palma, Canary Islands, April 21-23, 2012, pages 592 600, 2012a. Emilie Kaufmann, Nathaniel Korda, and R emi Munos. Thompson sampling: An asymptotically optimal finite-time analysis. In Algorithmic Learning Theory - 23rd International Conference, ALT 2012, Lyon, France, October 29-31, 2012. Proceedings, pages 199 213, 2012b. T.L Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1):4 22, 1985. ISSN 0196-8858. Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on World Wide Web, WWW 10, pages 661 670, New York, NY, USA, 2010. ACM. ISBN 978-1-60558-799-8. Benedict C. May, Nathan Korda, Anthony Lee, and David S. Leslie. Optimistic bayesian sampling in contextual-bandit problems. J. Mach. Learn. Res., 13:2069 2106, 2012. ISSN 1532-4435. Lijing Qin, Shouyuan Chen, and Xiaoyan Zhu. Contextual combinatorial bandit and its application on diversified online recommendation. In Proceedings of the 2014 SIAM International Conference on Data Mining, Philadelphia, Pennsylvania, USA, April 24-26, 2014, pages 461 469, 2014. Paat Rusmevichientong and John N. Tsitsiklis. Linearly parameterized bandits. Math. Oper. Res., 35(2):395 411, 2010. W.R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Bulletin of the American Mathematics Society, 25:285 294, 1933.