# profilebased_bandit_with_unknown_profiles__6206d402.pdf

Journal of Machine Learning Research 19 (2018) 1-40 Submitted 11/17; Revised 6/18; Published 9/18

Proﬁle-Based Bandit with Unknown Proﬁles

Sylvain Lamprier sylvain.lamprier@lip6.fr Sorbonne Universit es, UPMC Paris 06, LIP6, CNRS UMR 7606

Thibault Gisselbrecht thibault.gisselbrecht@snips.ai SNIPS, 18 rue Saint Marc, 75002 Paris

Patrick Gallinari patrick.gallinari@lip6.fr Sorbonne Universit es, UPMC Paris 06, LIP6, CNRS UMR 7606

Editor: Peter Auer

Stochastic bandits have been widely studied since decades. A very large panel of settings have been introduced, some of them for the inclusion of some structure between actions. If actions are associated with feature vectors that underlie their usefulness, the discovery of a mapping parameter between such proﬁles and rewards can help the exploration process of the bandit strategies. This is the setting studied in this paper, but in our case the action proﬁles (constant feature vectors) are unknown beforehand. Instead, the agent is only given sample vectors, with mean centered on the true proﬁles, for a subset of actions at each step of the process. In this new bandit instance, policies have thus to deal with a doubled uncertainty, both on the proﬁle estimators and the reward mapping parameters learned so far. We propose a new algorithm, called Samp Lin UCB, speciﬁcally designed for this case. Theoretical convergence guarantees are given for this strategy, according to various proﬁle samples delivery scenarios. Finally, experiments are conducted on both artiﬁcial data and a task of focused data capture from online social networks. Obtained results demonstrate the relevance of the approach in various settings. Keywords: Stochastic Linear Bandits, Proﬁle-based Exploration, Upper Conﬁdence Bounds

1. Introduction

Multi-armed bandits (MAB) correspond to online decision problems where, at each step of a sequential process, an agent has to choose an action - or arm - among a set of K actions, with the aim to maximize some cumulative reward function. In the so-called stochastic MAB setting, rewards collected for a given arm through time are assumed to be independently and identically distributed, following some hidden stationary distribution on every individual arm. The problem is therefore to deal with a tradeoﬀbetween exploitation - selecting actions according to some estimations about their usefulness - and exploration - selecting actions in order to increase the knowledge of their reward distribution. However, with classical stochastic bandit policies, the convergence towards the optimal arms can be slow when the number of actions becomes large. On another hand, contextual bandits correspond to MAB settings where some side information can be leveraged to improve estimations of reward distributions. In these settings, a decision context is observed before selecting actions. This context can either

c 2018 Sylvain Lamprier, Thibault Gisselbrecht and Patrick Gallinari.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v19/17-693.html.

Lamprier, Gisselbrecht and Gallinari

correspond to a global decision feature vector or to speciﬁc feature vectors observed for each single action. Depending on the setting, these features can vary over time, which can help to predict reward ﬂuctuations, or they can correspond to constant features on actions - that we call action proﬁles in the following - whose structure can be used to improve exploration policies over non-contextual approaches. In this paper we address the latter case, where we assume stationary distributions of rewards, but where distributions depend on constant proﬁles associated each arm. Reward distributions from the diﬀerent arms are connected by a common unknown parameter to be learned (Filippi et al., 2010). However, we introduce a new scenario where, for various possible reasons (technical, political, etc...), proﬁle vectors are not available a priori. Instead, the agent gets sample vectors, centered on the true proﬁles, for a subset of actions at each decision step. This can happen in various situations where some restrictions limit our knowledge of the whole decision environment. For example in a focused data capture or technology intelligence scenario on social media, where an agent is asked to collect relevant information w.r.t. to a given need. Because of the extremely large number of accounts on media such as Twitter, the agent needs to focus on a subset of relevant users to follow at each time step (Gisselbrecht et al., 2015). However, given the strict restrictions set by the media, no knowledge about users is available beforehand. Proﬁles have therefore to be constructed from users activities, which are only observed for a small fraction of users at each step. As we will see later, the process which delivers proﬁle samples can either be independent - e.g., an external process delivers activity samples for randomly chosen users at each step in the case of data capture from Twitter - or be included in the decision process - e.g., activity samples are only collected for followed users at the current time step. To the best of our knowledge, this instance of contextual bandit has not been studied in the literature. Existing bandit approaches do not ﬁt with this new setting. First, even if traditional algorithms such as UCB (Auer et al., 2002) could be applied, the information provided by the sample proﬁle vectors would be entirely ignored. Our claim is that important beneﬁts can arise from taking this available side-information into account. On the other hand, existing contextual bandit policies do not take into account uncertainty on context vectors, while we face here a bandit problem where uncertainty not only arises from regression parameters, as classically considered by contextual approaches, but also from the estimated proﬁles which serve as inputs for reward predictions at each step. The aim is to propose an approach able to leverage structural distributions of arms, based on noisy observations of their proﬁles, to improve over existing exploitation/exploration policies in bandit settings with partial side-information. The contribution of this paper is threefold:

We propose a new instance of the contextual bandit problem, based on constant contexts for each action, where action proﬁles are not known beforehand, but built from samples obtained at each iteration (3 cases are investigated regarding the sampling process);

We design the Samp Lin UCB algorithm to solve this problem, for which we demonstrate some theoretical convergence guarantees;

We experiment our proposal for both an artiﬁcial setting and a real-world task of focused data capture from Twitter, to empirically demonstrate the beneﬁts of such a

Profile-Based Bandit with Unknown Profiles

proﬁle-based approach with ony partial knowledge for exploitation/exploration problems.

The paper is organized as follows. In section 2, we present some background and related works. In section 3, we formalize the problem, propose our algorithm and derive the regret bound. Finally, section 4 reports our experiments.

2. Background: the Linear Stochastic Bandit

The multi-armed bandit problem, originally introduced in (Lai and Robbins, 1985) in its stationary form, has been widely studied in the literature. This learning problem aims at tackling the trade oﬀbetween exploration and exploitation in decision processes where, at each step, an agent must choose an action - or arm - among a ﬁnite set of size K. After each decision, it receives a reward which quantiﬁes the quality of the chosen action. The aim for the agent is to maximize the cumulative reward trough time, or equivalently to minimize the cumulative regret RT at step T deﬁned as:

RT = max i {1,2,...,K}

t=1 rit,t (1)

where it stands for the action selected at step t and ri,t the reward obtained by playing the action i at step t. This represents the amount of rewards that has been lost by selecting it at each step t, compared to what could be obtained by playing the optimal arm from the beginning to the end of the process. Note that, at each step t, only the reward for the chosen action rit,t is observed in practice, other ones remain unknown. In the so-called stochastic case, one assume that rewards of an arm i are identically and independently sampled from a distribution with mean νi. Therefore, one usually rather consider the pseudo-regret of a policy, which introduces expectations of regret in the previous deﬁnition:

t=1 νit (2)

where i stands as the arm with the best reward expectation. One of the simplest and most straightforward algorithms to deal with the stochastic bandit problem is the well-known ϵ-greedy algorithm (Auer et al., 2002). This algorithm selects the arm with the best reward mean empirical estimation with probability 1 ϵ and uniformly selects an arm among the whole set regardless their current estimations with probability ϵ. This guarantees to regularly reconsider estimations of all arms and therefore prevents from getting stuck on sub-optimal arms. However, the reward loss resulting from these blind selections prevents from ensuring a sub-linear upper bound of the pseudo-regret, unless setting an appropriate decay on ϵ. But this requires to know a lower bound on the diﬀerence of reward expectations between the best and the second best action (Auer et al., 2002). Upper Conﬁdence Bound algorithms (UCB) is another family of bandit approaches which deﬁne conﬁdent intervals for the reward expectations of each arm. Based on some concentration inequalities (Hoeﬀding, Bernstein, etc.), they propose optimistic policies which

Lamprier, Gisselbrecht and Gallinari

consider possible deviations of the estimated mean of each arm. By using upper bounds of conﬁdence intervals as selection scores, they ensure a cleaver balance between exploitation and exploration. Many extensions of the famous UCB algorithm proposed in (Auer et al., 2002) are known to guarantee a sub-linear bound of the pseudo-regret (see UCBV in (Audibert et al., 2009), MOSS in (Audibert and Bubeck, 2009) or KL-UCB in (Garivier, 2011)). At last, Thompson sampling algorithms, originally proposed in (Thompson, 1933), develop a Bayesian approach to deal with uncertainty. By sampling from posterior distributions for the reward parameters, their exploration/exploitation mechanism is also proved to ensure a sub-linear regret (see (Kaufmann et al., 2012b) and (Agrawal and Goyal, 2012)). The contextual bandit setting is an instance of the bandit problem where context vectors are observed before each decision step. Typically, contextual bandits assume a linear relation between context features and reward expectations. Formally, if we observe a context vector xi,t Rd for each action i K at each time-step t, we consider the following assumption:

β Rd such that ri,t = x i,tβ + ηi,t (3)

where β is a mapping parameter between contexts and rewards, ηi,t is a zero-mean conditionally R sub-Gaussian random noise, with constant R > 0 i.e: λ R : E[eληi,t|Ht 1] eλ2R2/2, with Ht 1 = {(is, xis,s, ris,s)}s=1..t 1. In this context, given a set K of K actions, any contextual bandit algorithm proceeds at each step t {1, 2, 3, . . . , T} as follows:

1. Observation of the context vector xi,t Rd for each i {1, ..., K};

2. According to the current estimate of β, selection of an action it and reception of the associated reward rit,t;

3. Improvement of the selection policy by considering the new input (it, xit,t, rit,t) for the estimation of β.

Various contextual algorithms have been proposed in the literature. The ﬁrst contextual bandit algorithm was introduced in (Auer, 2003). More recently the well-known Lin UCB algorithm has been proposed for a task of personalized recommendation in (Li et al., 2010) and analyzed in (Chu et al., 2011). Both of these algorithms are UCB-like policies, each of them selecting the action whose upper bound of its reward conﬁdence bound is the highest. Many other UCB approaches have been developed since then. In particular, algorithms such as OFUL or Confidence Ball proposed in (Abbasi-Yadkori et al., 2011) and (Dani et al., 2008) have the advantage to enjoy a tighter regret upper bound (see also (Kaufmann et al., 2012a) and (Rusmevichientong and Tsitsiklis, 2010)). As in the stochastic bandit setting, Thompson sampling algorithms have also been designed for the contextual case, which also proved to be powerful, ﬁrst empirically in (Chapelle and Li, 2011) and then theoretically in (Agrawal and Goyal, 2013) and (May et al., 2012). In this paper, we consider a variant of the contextual bandit problem where contexts of actions are constant, which we call action proﬁles in the following. Hence, in our setting we assume that each action i K is associated with a proﬁle vector µi Rd. The linear assumption of equation 3 becomes:

β Rd such that ri,t = µ i β + ηi,t (4)

Profile-Based Bandit with Unknown Profiles

Thus, in this setting contexts cannot be used to anticipate some variations in the rewards expectations as it is traditionally the case in the literature about contextual bandit, but they can be leveraged to improve the exploration process, the use of a shared mapping parameter β allowing one to deﬁne areas of interest in the representation space of the actions. To illustrate this, the ﬁgure 1 represents the selection scores of a contextual algorithm (such as OFUL that we rely on in the following) at a given step for a simple case where K = 4 and d = 2. In this ﬁgure, green areas correspond to high scores, whereas red ones correspond to low scores areas. Color variations render the latent structure inferred by the model. In this setting, the algorithm would select the action 1, since its proﬁle is located in the most promising area of the space. On the other hand, the action 3 is located in an area that is greatly less promising. The fact of using a common mapping parameter β allows one to perform a mutual learning, where observations on some actions inform on the usefulness of similar ones. This allows one to improve the exploration process by focusing more quickly on the useful areas: Imagine that a great number of actions are located in the red area of the ﬁgure. In that case, a classical bandit algorithm such as UCB would need to consider each of these actions several times to reveal their low reward expectation. On the other hand, a contextual bandit algorithm such as OFUL is able to avoid these actions really more quickly because of the proximity with other bad actions.

Figure 1: Illustration of OFUL scores for a proﬁles representation space.

This structured bandit setting has already been investigated in (Filippi et al., 2010). This work showed that great improvements could indeed be obtained by exploiting the structure of actions, since observations on some actions inform on the usefulness of similar ones. This comes down to a classical stochastic bandit where the pseudo-regret can be deﬁned as follows:

t=1 µ i β µ itβ (5)

Lamprier, Gisselbrecht and Gallinari

where µi Rd stands for the proﬁle of the action i K and µi = arg max µi,i=1..K µ i β corresponds

to the proﬁle of the optimal action i . This is the setting which is studied in this paper. However, in our case the proﬁle vectors are unknown to the agent beforehand, they have to be discovered iteratively during the process. Our problem diﬀers from existing instances by the following two main aspects:

1. Action proﬁles are not directly available, one only get samples centered on them during the process;

2. At each step, one only get samples for a subset of actions.

In the following, we derive an UCB-based policy for this new setting.

3. Proﬁle-Based Bandit with Unknown Proﬁles

In this section, we explore our new setting in which the set of proﬁles {µ1, .., µK} is not directly observed. Instead, at each iteration t, the agent is given a subset of actions Ot such that for every i Ot, a sample xi,t of a random variable centered on µi is revealed. By assuming the same linear hypothesis as described in the previous section (formula 4), the relation of rewards with proﬁles can be re-written as follows for any time t, in order to introduce proﬁle samples:

s t : ri,s = µ i β + ηi,s

= ˆx i,tβ + (µi ˆxi,t) β + ηi,s

= ˆx i,tβ + ϵ i,tβ + ηi,s (6)

where ϵi,t = µi ˆxi,t, ˆxi,t = 1 ni,t

s T obs i,t xi,s, with T obs i,t = {s t, i Os} and ni,t = |T obs i,t |. In

words, ni,t corresponds to the number of times a sample has been obtained for the action i until step t and ˆxi,t corresponds to the empirical mean of observed samples for i at time t. ϵi,t corresponds to the deviation of the estimator ˆxi,t from the true proﬁle µi. Compared to traditional contextual bandits, the uncertainty is double : as classically it arises from the β parameter estimator, but also from the proﬁle estimators, since the algorithm must both estimate β and the proﬁle vectors {µ1, .., µK} from observations. Figure 21

illustrates this new setting. Contrary to ﬁgure 1 where proﬁles are known, here we only get conﬁdence areas for them, represented by circles centered on their corresponding empirical mean (represented by a blue cross). From the law of large numbers, the more observations for a given action we get, the lower the deviation between its true proﬁle and its empirical estimator is. Therefore, the more knowledge we get about a given action, the smaller its conﬁdence circle is. The best action is still the action 1, whose true proﬁle (represented by a black cross) is in the greenest area. However, this information is unknown from the agent. A naive solution would be to directly use the empirical mean for each action in order to determine the selection scores. From the ﬁgure, this would lead to select the sub-optimal

1. Note that this is only an illustration of the general principle, in practice the surface of selection scores should also diﬀer from ﬁgure 1, since β is estimated from biased inputs.

Profile-Based Bandit with Unknown Profiles

action 2, whose empirical mean is located in a greener area than the one of other actions. We propose to include the additional uncertainty in the selection scores by using the best location inside the conﬁdence ellipsoid it is possible to reach for each action. This allows one to deﬁne an optimistic policy which would select the optimal action 1 in the example ﬁgure, whose conﬁdence area contains the most promising proﬁles (i.e., includes the most green locations in the ﬁgure).

Figure 2: Illustration of the additional uncertainty arising from proﬁle estimators.

In the following we propose to deﬁne an algorithm ﬁtted for this setting. After deriving the algorithm in a generic context of samples delivery, we consider three diﬀerent cases:

Case 1: Every action delivers a proﬁle sample at each step t (i.e., t, Ot = {1, ..., K});

Case 2: Each action i owns a probability pi of sample delivery: at each step t, an action is included in Ot with probability pi;

Case 3: At each step t, only the action selected at the previous step delivers a sample (i.e., t, Ot = it 1).

The ﬁrst case corresponds to the simplest case, where Ot is constant over time. The second case includes an additional diﬃculty since before any step, every action has not been observed the same number of times, which leads to diﬀerent levels of uncertainty. The last case, probably the most interesting one for real-world applications, is the most diﬃcult since decisions at each step not only aﬀect knowledge about reward distributions but also proﬁle estimations. For that case, it appears mandatory to take the uncertainty about proﬁles into account in the selection policy to guarantee the process to converge towards optimal actions. After deriving conﬁdence intervals for this new bandit instance, this section describes the proposed algorithm and analyzes its regret for the three settings listed above. Then, we consider the case where multiple actions can be performed at each step.

Lamprier, Gisselbrecht and Gallinari

3.1. Conﬁdence Intervals

In the following, we derive a series of propositions which will allow us to deﬁne a selection policy for our setting of proﬁle-based bandit with unknown proﬁles. First, it is needed to deﬁne an estimator for the mapping parameter β, and the associated conﬁdence ellipsoid. For that purpose, we rely on results from the theory of the self-normalized process (de la Pe na et al., 2009). The next step is to deﬁne a way to mix it with the uncertainty on proﬁles to deﬁne an optimistic policy.

Proposition 1 Let us consider that for any action i, all proﬁle samples xi,t Rd are iid from a distribution with mean µi Rd. Let us also assume that there exists a real number L > 0 such that ||xi,t|| L and a real number S > 0 such that ||β|| S. Then, for any i K and any step s t, the random variable ηi,t,s = ϵ i,tβ + ηi,s is conditionally sub-Gaussian

with constant Ri,t =

Proof Available in appendix A.1.

In this proposition, R is the constant of the sub-Gaussian random noise of the rewards (ηs from equation 6) and the notation ||x|| stands as the norm of a vector x (i.e., ||x|| =

x T x). Since the noise ηi,t,s is sub-Gaussian, it will be possible to apply the theory of self-normalized process for deﬁning a conﬁdence ellipsoid for β. At step t, we can use the following set of observations to ﬁnd an estimator for β: {(ˆxis,t, ris,s)}s=1..t 1 (i.e., at any decision step t, the reward ris,s observed at any previous step s < t is associated with the proﬁle of the selected action at step s, estimated knowing samples observed from step 1 to step t). The following notations are used in the remaining of the paper:

η t 1 = (ηis,s + ϵ is,tβ) s=1..t 1 the vector of noises of size t 1.

Xt 1 = (ˆx is,t)s=1..t 1 the (t 1) d matrix containing the empirical means of the selected actions, where the s-th row corresponds to the estimator at step t of the action selected at step s.

Yt 1 = (ris,s) s=1..t 1 the rewards vector of size t 1.

At 1 = diag(1/Ris,t)s=1..t 1 the diagonal (t 1) (t 1) matrix, where the s-th diagonal element equals 1/Ris,t. Note that, for a speciﬁc action, the value of its corresponding coeﬃcient increases with the number of observed samples for this action.

With these notations, the linear application from proﬁles to rewards can be written as:

Yt 1 = Xt 1β + η t 1 (7)

Proposition 2 We note ˆβt 1 the least square estimator of the parameter β at step t, according to the following l2-regularized regression problem, where each element is weighted

Profile-Based Bandit with Unknown Profiles

by the corresponding coeﬃcient 1/Ris,t:

ˆβt 1 = arg min β

1 Ris,t (β ˆxis,t ris,s)2 + λ||β||2 (8)

where λ > 0 is the l2-regularization constant.

ˆβt 1 = (X t 1At 1Xt 1 + λI) 1X t 1At 1Yt 1 (9)

Proof Let us rewrite the minimization problem such as: ˆβt 1 = arg min β L with L = (Yt 1 Xt 1β) At 1(Yt 1 Xt 1β) + λβ β.

The gradient is given by: βL = 2X t 1At 1(Yt 1 Xt 1β) + 2λβ = 2(X t 1At 1Xt 1 + λI)β 2X t 1At 1Yt 1 By canceling this gradient, we get the announced result.

This estimator of β uses empirical means of observed samples as inputs. Weighting each element according to the corresponding value Ris,t allows one to consider the uncertainty associated with this approximation. It renders the conﬁdence we have in the weighted input. Note that this coeﬃcient tends towards a constant when the number of observed samples increases for the corresponding action. It allows one, according to the following proposition, to deﬁne a conﬁdence ellipsoid for the estimator of β.

Proposition 3 Let us deﬁne Vt 1 = λI + X t 1At 1Xt 1 = λI + t 1 P

ˆxis,tˆx is,t Ris,t . With the

same assumptions as in proposition 1, for any 0 < δ < 1, with a probability at least equal to 1 δ, the estimator ˆβt 1 veriﬁes for all t 0:

||ˆβt 1 β||Vt 1

v u u t2 log

det(Vt 1)1/2det(λI) 1/2

λS = αt 1 (10)

where ||x||V =

x V x is the V-norm of the vector x.

Proof Available in appendix A.2.

This bound is very similar to the one deﬁned in the OFUL algorithm (Abbasi-Yadkori et al., 2011) to build its conﬁdence ellipsoid. However, a notable diﬀerence lies in the deﬁnition of the matrix Vt, in which weights in At are applied to cope with conﬁdence diﬀerences between proﬁle estimators. Without this weighting, no conﬁdence ellipsoid could be found for β since no common bound could be deﬁned for the various noises η s (see the proof of proposition 3 in appendix). The following proposition can easily be deduced from the previous one to bound the expectation of reward with known proﬁles.

Lamprier, Gisselbrecht and Gallinari

Proposition 4 For every i K, with probability greater than 1 δ, we have for all t 0:

β µi ˆβ t 1µi + αt 1||µi||V 1 t 1 (11)

Proof Available in appendix A.3

This upper-bound for the expected reward contains two distinct terms: while the former corresponds to a classical exploitation term which estimates the expected reward with the current parameters, the latter corresponds to an exploration term since it takes into account the uncertainty on the reward parameter. If proﬁles were known, this could directly be used as a selection score for an UCB-like policy. However, in our setting, proﬁles are unknown. We have to consider conﬁdence ellipsoids for the proﬁles of actions too. The following proposition deﬁnes conﬁdence bounds for the proﬁle estimators.

Proposition 5 For every i K and any t > 0, with probability greater than 1 δ/t2, we have:

||ˆxi,t µi|| min(L

2d ni,t log 2dt2

, 2L) = ρi,t,δ (12)

Proof This inequality comes from the application of the Hoeﬀding s inequality to each dimension separately. The min operator comes from the base hypothesis ||xi,t|| L, which can be more restrictive than the Hoeﬀding assumption. The proof is available in appendix A.4.

Contrary to the bound of the deviation of the mapping parameter β which holds simultaneously for all steps of the process, the one for the proﬁle estimators is only valid for each step separately. To obtain a bound holding for every step simultaneously, which is important for the regret analysis (see section 3.3), we use the uniform bound principle. For a given action i, we have: P( t, ||ˆxi,t µi|| ρi,t,δ) = 1 P( t, ||ˆxi,t µi|| ρi,t,δ) 1 P

t P(||ˆxi,t µi|| ρi,t,δ) 1 P

t δ/t2. This justiﬁes the introduction of the t2 term

in the bound, which allows one to deﬁne a uniform probability over all steps since we have

thereby: P( t, ||ˆxi,t µi|| ρi,t,δ) 1 δ P

t=2 δ/t2 = 1 δ δ(π2/6 1) 1 2δ.

Now that we have deﬁned probabilistic deviation bounds for the diﬀerent estimators, we can use them conjointly to deﬁne the conﬁdence interval of the reward expectation for the setting of unknown proﬁles, and thus to upper bound the expected reward for each action i.

Proposition 6 For every i K and any t > 0, with probability greater than 1 δ/t2 δ, we have:

β µi ˆβ t 1(ˆxi,t + ϵi,t) + αt 1||ˆxi,t + ϵi,t||V 1 t 1 (13)

with: ϵi,t = ρi,t,δ ˆβt 1

||ˆβt 1|| ϵi,t = ρi,t,δˆxi,t

λ||ˆxi,t||V 1 t 1

Profile-Based Bandit with Unknown Profiles

Proof The proof is available in appendix A.5.

Compared to the bound given in proposition 4, we ﬁnd the same two terms of exploitation and exploration. However in this case, proﬁle vectors that are unknown are replaced by the estimator plus an additional term: ϵi,t in the former part and ϵi,t in the latter one. These terms aim at coping with proﬁle uncertainty, considering conﬁdence ellipsoids for these proﬁles as deﬁned in proposition 5. ϵi,t is collinear with ˆβt 1. It is used to translate the estimator ˆxi,t so that β µi is upper-bounded. ϵi,t is collinear with ˆxi,t. It is used to translate the estimator ˆxi,t so that the V 1 t 1-norm ||µi||V 1 t 1 is upper-bounded. This bound enables us to derive an optimistic policy in the next section.

3.2. Samp Lin UCB

In this section, we detail our policy for the setting of unknown proﬁles, called Samp Lin UCB, which is directly derived from the bound proposed in proposition 6. Its process is detailed in algorithm 1. In words, it proceeds as follows:

1. Initialization of the shared variables V and b used to estimate the mapping parameter β in lines 1 and 2. The d d matrix V is initialized with an identity matrix times the regularization parameter λ (the greater λ is, the more the parameter β will be constrained to have components close to zero). The vector b is initialized as a null vector of size d.

2. Initialization of the individual variables ni, ˆxi, Ri, Ni and Si for every action in K (lines 3 to 6). The two latter are additional scalar variables which enable eﬃcient updates for the shared variables after context observations. Ni counts the number of times an action has been selected from the beginning, Si sums the rewards collected by selecting i from the beginning.

3. At each iteration t, for each action i Ot, observation of the sample xi,t (line 11) and update of individual variables ni, ˆxi and Ri for action i (line 12) and shared parameters V and b according to these new individual values for i (line 10 and 13). Since shared parameters are simple sums of elements, they can be simply updated by ﬁrst removing old values (line 10) and then adding the new ones when updated (line 13). This is eﬃciently done without requiring an important memory load thanks to scalar variables Ni and Si.

4. Computation of the selection score si,t (line 21) for each action i according to equation 14 detailed below, and selection of the action associated with the highest selection score (line 23) (except in the early stages K where all actions are selected in turn to initialize their counts in line 16).

5. Collection of the associated reward (line 25) and update of variables Ni, Si, V and b according to this new outcome (lines 26 to 28).

Lamprier, Gisselbrecht and Gallinari

Algorithm 1: Samp Lin UCB

V = λId d (Identity matrix of size d); 1 b = 0d (Null vector of size d); 2 for i K do 3 Ni = 0; Si = 0; 4 ni = 0; ˆxi = 0d; Ri = + ; 5 end 6 for t = 1..T do 7 Reception of Ot; 8 for i Ot do 9

V = V Ni ˆxiˆx i Ri ; b = b Si ˆxi Ri ; 10 Observation of xi,t; 11

ni = ni + 1; ˆxi = (ni 1)ˆxi + xi,t

V = V + Ni ˆxiˆx i Ri ; b = b + Si ˆxi Ri ; 13 end 14 if t K then 15 Selection of it = t; 16 end 17 else 18 ˆβ = V 1b; 19 for i K do 20 Computation of si,t according to formula 14 ; 21 end 22 Selection of it = arg max i K si,t ; 23 end 24 Reception of rit,t; 25 Nit = Nit + 1; 26 Sit = Sit + rit,t; 27

V = V + ˆxit ˆx it Rit ; b = b + rit,t ˆxit Rit ; 28 end 29

The selection score si,t used in our policy for each action i at any step t is directly derived from proposition 6:

si,t = (ˆxi,t + ϵi,t) ˆβt 1 + αt 1||ˆxi,t + ϵi,t||V 1 t 1 (14)

For cases 2 and 3 of the proﬁle delivery mechanism (see at the beginning of the section), there are actions i with ni,t = 0 in the early steps of the process. No sample has ever been observed for these actions, which is problematic for the computation of ρi,t,δ, and therefore

Profile-Based Bandit with Unknown Profiles

the computation of ϵi,t and ϵi,t. For the case 2, where we are not active on the process for observing contexts, this can be solved by simply ignoring actions until at least one sample of proﬁle has been observed for them. For the case 3 however, samples are only obtained by selection. Thus, we need to force the observation of a sample for every action in the ﬁrst steps. In that way, for actions with ni,t = 0 at any step t, we arbitrarily set si,t = + in order to make the policy favor actions without any knowledge to initialize the process. Thus, in that case, algorithm 1 selects the K actions in turn in the K ﬁrst steps of the process. The selection score deﬁned in formula 14 corresponds to the upper-bound of the expected reward for each action, as it is done in all UCB-based policies. Intuitively, it leads the algorithm to select actions whose proﬁle estimator is either in an area with high potential, or is suﬃciently uncertain to consider still likely that the action can be potentially useful. The goal is to quickly rule out bad actions, whose conﬁdence ellipsoid does not include any potentially useful locations w.r.t. the current estimation of β. To better analyze the algorithm, we propose below a new formulation of the selection score.

Proposition 7 The score si,t from equation 14 can be re-written in the following way:

si,t = ˆx i,t ˆβt 1 + αt 1||ˆxi,t||V 1 t 1 + ρi,t,δ

||ˆβt 1|| + αt 1

si,t = (ˆxi,t + ϵi,t) ˆβt 1 + αt 1||ˆxi,t + ϵi,t||V 1 t 1

= ˆx i,t ˆβt 1 + ρi,t,δ ˆβ t 1 ||ˆβt 1|| ˆβt 1 + αt 1||ˆxi,t + ρi,t,δˆxi,t

λ||ˆxi,t||V 1 t 1 ||V 1 t 1

= ˆx i,t ˆβt 1 + ρi,t,δ||ˆβt 1|| + αt 1

λ||ˆxi,t||V 1 t 1

||ˆxi,t||V 1 t 1

= ˆx i,t ˆβt 1 + αt 1||ˆxi,t||V 1 t 1 + ρi,t,δ

||ˆβt 1|| + αt 1

This new formulation of the selection score allows one to take a diﬀerent look at the algorithm behavior. The ﬁrst part of the score ˆx i,t ˆβt 1 + αt 1||ˆxi,t||V 1 t 1 is similar to a score that would use the classical OFUL algorithm (although with a diﬀerent construction of Vt), with an exploitation term and a classical exploration term considering the uncertainty

on the estimator of β. But it exhibits an additional part ρi,t,δ

||ˆβt 1|| + αt 1

directly proportional to the coeﬃcient ρi,t,δ and thus enables some exploration w.r.t. the uncertainty of the proﬁle estimators. This highlights the premium granted to less observed actions. Note that for the case 1, this additional part is the same for every action. It therefore could be removed from the score since it does not permit to discriminate some action w.r.t any other one. However, this new exploration term is particularly useful for the

Lamprier, Gisselbrecht and Gallinari

case 3, where observations of samples are directly connected to the selection policy, since it prevents from moving aside some optimal actions that have unluckily provided only not promising samples in the early steps. To demonstrate that considering uncertainty on proﬁles is crucial in that case, let us consider a scenario where the optimal action i gets a null vector as the ﬁrst proﬁle sample. Then, in a setting where all proﬁle samples are in [0, L]d and all rewards are in [0, + ], it suﬃces that a sub-optimal action i gets a non-null vector as the ﬁrst proﬁle sample and a positive value as the ﬁrst reward to lead to a linear regret from a given step. Indeed, since we only get samples with all components greater or equal than 0, i will never get a null vector as a proﬁle estimator. On the other hand, while i is not selected, its proﬁle estimator cannot change from the null vector. Thus, with a naive algorithm that would not include translations w.r.t. uncertainty of proﬁles, we would have si ,t = ˆx i ,t ˆβt 1 +α||ˆxi ,t||V 1 t 1 = 0 for all t until it = i . Now, the least square estimator of β approximates observed reward values from estimated proﬁles. Since we have at least one non-null reward associated with a non-null proﬁle estimator, β will always output a positive expected reward at least for one action. Thus, there is always an action i with si ,t > 0, which prevents from selecting the optimal action until the end of the process. This shows that a naive algorithm is not well ﬁtted here, since it is likely to stay stuck on sub-optimal actions because of wrong knowledge about proﬁles. The point is now to show that the proposed additional term enables to solve this problem and ensures a sub-linear pseudo regret for our proﬁle-based bandit algorithm with unknown proﬁles.

3.3. Regret

The following proposition establishes an upper bound for the cumulative pseudo-regret of the Samplin UCB algorithm proposed above. This is a generic bound for which no assumption is done on the process generating Ot at each step t.

Proposition 8 (Generic bound) By choosing λ max(1, L2/

R2), with a probability greater than 1 3δ, the cumulative pseudo-regret of the algorithm Samp Lin UCB is upperbounded by:

d λ log 1 + TL2/λ

2d log 2d T 2

d log 1 + TL2/λ

R2 + L2S2 log 1 + TL2

Proof Available in appendix A.6.

Profile-Based Bandit with Unknown Profiles

The study of dominant factors of the bound given above enables to obtain the following proposition for the three considered settings of the context delivery process, where we removed dependencies on L, λ, R and S to simplify the notations.

Proposition 9 (Bounds for the three proﬁle delivery settings) For each of the three considered settings for the proﬁle delivery process, the upper bound of the cumulative pseudoregret is2:

For the case 1, with a probability greater than 1 3δ:

ˆRT = O d log T

T log(T) (17)

For the case 2, with a probability greater than (1 3δ)(1 δ), and for T 2 log(1/δ)/p2:

where p is the probability of proﬁle delivery for any action at each step.

For the case 3, with a probability greater than 1 3δ:

Proof The proofs for these three bounds are respectively given in appendix A.7.1, A.7.2 and A.7.3.

Thus, in every setting our Samplin UCB algorithm ensures a sub-linear upper bound for its cumulative pseudo-regret. The bound given for case 2 owns an additional dependency in p, the probability of context delivery for each action at each step. Obviously, the higher this probability is, the faster the uncertainty about proﬁles decreases. Note that this bound for case 2 is only valid from a given number of iterations inversely proportional to p2, since it requires a minimal number of observations to hold. The bound for case 3 owns a dependency in the number of available actions K. This comes from the fact that only the selected action reveals its proﬁle at each step, which re-introduces the need of considering each action a minimal number of times, as it is the case with traditional stationary approaches such as the classical UCB algorithm. However, as we show in our experiments below, the use of the structure of the actions, which enables some common learning of reward distributions, leads to greatly better results than existing stationary algorithms in various cases.

2. O renders the relation dominated by , which means that f = O(g) implies that there exists a strictly positive constant C such that asymptotically we have: |f| C|g|.

Lamprier, Gisselbrecht and Gallinari

3.4. Extension to the multiple-plays setting

This short section extends our algorithm for the multiple-plays setting, where multiple actions are chosen at each step. Rather than only selecting a single action it at any step t, the algorithm has now to select a set Kt K of k 1 actions for which it gets rewards. Algorithm 1 is therefore adapted to this setting, by simply selecting the k best actions at each step (those that get the best selection scores w.r.t. formula 14) rather than only the best one (line 23 of the algorithm). The aim is still to maximize the cumulative reward through time, where all rewards at any step t are simply summed to form the collected reward at step t (although other Lipschitz functions could have been considered for the collective reward construction from the k individual ones, such as proposed in Chen et al. (2013)).

Deﬁnition 1 The cumulative pseudo-regret of our setting of bandit with multiple plays is deﬁned as:

i K µ i β P

i Kt µ i β (20)

with K the set of k optimal actions, i.e. the k actions with the highest values µ i β.

Proposition 10 (Generic bound for the multiple-plays setting) By choosing λ max(1, L2/

R2), with a probability greater than 1 3δ, the cumulative pseudo-regret for our Samp Lin UCB algorithm with multiple selections is upper bounded by:

d λ log 1 + Tk L2/λ

2d log 2d T 2

d log 1 + Tk L2/λ

R2 + L2S2 log 1 + Tk L2

Proof The proof is available in appendix A.8.

Equivalent bounds for the three cases of context delivery can be directly derived from this new generic bound by applying the same methods as in the previous section. This allows us to apply our algorithm for tasks where multiple actions can be triggered at each step, such as in the data capture task considered in our experiments in section 4.2.

4. Experiments

This section is divided in two parts. First, we propose a series of experiments on artiﬁcial data in order to observe the behavior of our approach in well-controlled environments. Then, we give results obtained on real-world data, for a task of data capture from social media.

Profile-Based Bandit with Unknown Profiles

4.1. Artiﬁcal Data

4.1.1. Protocol

Data Generation: In order to assess the performances of the Samplin UCB algorithm, we propose to ﬁrst experiment it in a context of simple selection (k = 1) on artiﬁcially generated data. For that purpose, we set the horizon T to 30000 iterations, the number of available actions K to 100 and the size of the proﬁle space to d = 5 dimensions. Then, we

sampled a mapping vector β randomly in h S/

d id , in order to fulﬁll the ||β|| S = 1 condition. For each arm i, we then sampled a random vector µi uniformly in h L/

d id with L = 1. Finally, for each iteration t {1, ..., T}, we proceeded as follows to generate simulated data:

1. For each action i {1, ..., K}, we sampled a vector xi,t from the multivariate Gaussian N(µi, σ2I). Note that, in order to assess the inﬂuence of proﬁle samples variations on the performances of Samp Lin UCB, we tested diﬀerent values for σ {0.5, 1.0, 2.0}. Moreover, in order to guarantee that ||xi,t|| L = 1, while still getting sampled centered on µi, the Gaussian is truncated symmetrically around µi. This is illustrated by ﬁgure 3 for d = 1, where hatched areas correspond to excluded values. On the left is given the case with µi > 0 and on the right the case with µi < 0;

2. For each action i {1, ..., K}, we sampled a reward ri,t from a Gaussian with mean µ i β and variance R2 = 1;

0 𝜇 𝐿 - 𝐿 2 𝜇 - 𝐿 (a) Case µ > 0

0 𝜇 𝐿 - 𝐿 2 𝜇+ 𝐿

(b) Case µ < 0

Figure 3: Proﬁle Samples Generation Process: Truncated Gaussians

To emhasize the need of exploration on proﬁles, note that we set to null vectors the proﬁle samples of the 100 ﬁrst steps of each dataset. 100 datasets have been generated in such a way. The results given below correspond to averages on these artiﬁcial datasets.

Lamprier, Gisselbrecht and Gallinari

Experimented Policies: We propose to compare Samp Lin UCB to the following bandit policies:

UCB: the well-known UCB approach (Auer et al., 2002), which selects at each step the action with the best upper-conﬁdence bound, estimated w.r.t. past rewards of actions, without any assumption about some latent structure of the actions;

UCBV: the UCBV algorithm (Audibert et al., 2009) is an extension of UCB, where the variance of rewards for each action is included in the selection scores to lead the policy to ground their estimations in an higher number of observations for noisier actions.

UCB δ: the UCB δ algorithm (Abbasi-Yadkori et al., 2011) is a variant of UCB where optimism is ensured via a concentration inequality based on auto-normalised process (de la Pe na et al., 2009), with a conﬁdence level of 1 δ. In our experiments, we set δ = 0.05;

Thompson: the Thompson Sampling algorithm (Thompson, 1933) introduces randomness in the exploration process by sampling reward expectations from their posterior at each time-step, following a Gaussian assumption of the rewards;

MOSS: a variant of UCB, which usually obtains better results than the classical UCB but requires the knowledge about the horizon T (Audibert and Bubeck, 2009);

None of these approaches use any side information. Therefore, the only noise they have to deal with comes from the variance R2 of the Gaussian distributions of rewards. For Samp Lin UCB an additional diﬃculty comes from the variations of the observed samples of proﬁles. The point is therefore to know whether these samples can be leveraged to exhibit some structure of the actions, that can beneﬁt to stationary bandit tasks, despite such variations. Additionaly, the following two contextual baselines are considered in our experiments to analyze the performances of our approach:

Lin UCB: the very famous contextual approach that assumes a linear mapping between observed contexts and rewards (Li et al., 2010). In our case, observed proﬁle samples correspond to the contexts that Lin UCB takes into account in its selection policy. We consider this baseline in the interesting setting where contexts are only delivered for the selected arms (case 3 described above). In this setting, non selected arms deliver null context vectors for the next step;

Mean Lin UCB: this baseline corresponds to our approach but without the exploration term w.r.t. the proﬁles. Empirical means are considered as true proﬁles at each step of the process (this comes down to set ρi,t,δ to 0 for every arm and every step). As discussed above (see the last paragraph of section 3.2), such a baseline cannot guarantee a sub-linear regret since it can inﬁnitely stay stuck on sub-optimal arms, but an empirical evaluation of its performances is useful to understand the beneﬁts of the proposed approach.

To analyze the performances of Samp Lin UCB, we implement the three scenarios studied in previous sections. In the following, our approach is denoted Samp Lin UCB p=<p>, where p

Profile-Based Bandit with Unknown Profiles

corresponds to the probability for any action to get a sample of its proﬁle at every iteration. Diﬀerent values for p are considered: p {0, 0.005, 0.01, 1}. Note that p = 1 corresponds to the case 1, while p = 0 refers to the case 3. In this latter instance, as considered in the previous sections, the samples delivery process is replaced by the ability to observe samples for the selected actions at each iteration. Note also that, for clarity and analysis purposes, in instances with p > 0 we do not observe samples for the selected actions (which exactly follows the cases studied in the previous section)3. In every instance, we set δ = 0.05 for these experiments. Also, to avoid a too large exploration on proﬁles in the early steps, we multiplied each ρi,t,δ by a 0.01 coeﬃcient, which still guarantees a sub-linear regret in the limit.

4.1.2. Results

Figures 4(a), 4(b) and 4(c) report the evolution of the cumulative pseudo-regret through time for the tested policies, for σ values (variance of proﬁle samples) respectively set to σ = 2.0, σ = 1.0 and σ = 0.5. Note that the curves of UCB, UCB-δ, UCBV, Thompson and MOSS are identical in every plot since their performances do not depend on the proﬁle samples. We ﬁrst notice from these plots that UCB-δ and UCBV do not provide good results on these data. It appears that these two policies over-explore during the whole bandit process. Thompson and MOSS obtain better results in average, but still far from the best contextual approach Samp Lin UCBp=1. This conﬁrms that using proﬁles of arms can be greatly advantegeous when there exist a linear correlation between these proﬁles and the associated rewards. In this setting (which corresponds to the case 1 studied above), the proﬁles are discovered step by step, but since we get a sample for every arm at each iteration, the estimators quickly converge towards the true proﬁles. This explains the very good results for this easy setting, and why there is nearly no diﬀerences in the results of Samp Lin UCBp=1 for the three considered sample variances. Let us now focus on the results provided by our Samp Lin UCB algorithm when only a subset of arms gets proﬁle samples at each step of the process. As expected, the more the algorithm observes samples, the better it performs. However, we remark that Samp Lin UCBp=0 obtains better results than Samp Lin UCBp=0.005 for σ = 2 and σ = 1, and even better than Samp Lin UCBp=0.01 when σ = 2 (while observing the same rate of samples as in this latter setting). This denotes a stronger robustness to the proﬁle sample variance. By dynamically selecting the arms to observe, it is able to focus on the improvement of useful estimators rather than getting as many samples but for randomly selected arms (and potentially for arms that could be quickly discarded). In this interesting setting, Samp Lin UCB always outperforms non-contextual approaches for the studied sample variances, while we note a signiﬁcant improvement of the results when the variance is low. At last, we can note the very weak - near random - results obtained by Lin UCB, which directly bases its strategy on the observed samples. More interesting are the weak results obtained by Mean Lin UCB, which exhibits a linear regret. This emphasizes the crucial role of the proﬁle exploration term of Samp Lin UCB: While Samp Lin UCBp=0 is able to reconsider

3. Note that we could easily imagine tasks, which correspond to some mix of cases 2 and 3, where we both get samples from an external process and for the selected actions. For such cases, we can reasonably assume better results than those reported below for cases 2 and 3, since the process would beneﬁt from both sample sources.

Lamprier, Gisselbrecht and Gallinari

bad proﬁles observed in the early steps of the process, Mean Lin UCB usually stays stuck on the ﬁrst actions that provided a non-nul sample associated with a positive reward. If they are lucky, Lin UCB and Mean Lin UCB can exhibit good performances on some instances, but they are clearly not well ﬁtted for the bandit setting considered in this paper.

0 5000 10000 15000 20000 25000 30000 Time step

Cumulative Regret

Random CUCB UCB δ CUCBV MOSS Thompson

Lin UCB Mean Lin UCB Samp Lin UCBp=0 Samp Lin UCBp=0.005 Samp Lin UCBp=0.01 Samp Lin UCBp=1

0 5000 10000 15000 20000 25000 30000 Time step

Cumulative Regret

Random CUCB UCB δ CUCBV MOSS Thompson

Lin UCB Mean Lin UCB Samp Lin UCBp=0 Samp Lin UCBp=0.005 Samp Lin UCBp=0.01 Samp Lin UCBp=1

0 5000 10000 15000 20000 25000 30000 Time step

Cumulative Regret

Random CUCB UCB δ CUCBV MOSS Thompson

Lin UCB Mean Lin UCB Samp Lin UCBp=0 Samp Lin UCBp=0.005 Samp Lin UCBp=0.01 Samp Lin UCBp=1

(c) σ = 0.5

Figure 4: Cumulative pseudo-regret through time on artiﬁcial data with diﬀerent settings for the proﬁle samples delivery process (from very noisy samples on the left, to samples with low variance on the right).

To conclude, it appears from these ﬁrst experiments that our Samp Lin UCB algorithm, while dealing with a doubled uncertainty (both on the mapping parameters and the proﬁle estimators), is able to leverage the latent structure that it discovers through time from noisy samples of the proﬁles.

4.2. Real World Experiments: Social Data Capture

In this section we propose to apply our Samp Lin UCB algorithm to the task of dynamic data capture from Twitter introduced in (Gisselbrecht et al., 2015). According to a given information need, the aim is to collect relevant data from the streaming API proposed by Twitter. This API provides messages published by users on the social network in realtime. In this setting, each user account is associated with a stream of messages that can be monitored. However, for various reasons (notably w.r.t. constraints set by the social media), it is not possible to collect the whole activity of the social media. Only streams of messages published by a subset of k users can be monitored simultaneously (k << K). The aim is therefore to focus on users that are the most likely to publish messages that ﬁt with the data need. The diﬃculty is that we do not know anything about the users beforehand, everything must be discovered during the capture process. We have thus to deal with an exploitation/exploration problem that suits well with the bandit setting studied in this paper.

Given a time period divided in T steps, the agent has to select, at each iteration t {1, ..., T} of the process, a subset Kt of k user accounts to follow, among the whole set of possible users K (Kt K). Given a relevance score ri,t assigned to the content posted by user i Kt during iteration t of the process (the set of tweets he posted during iteration t), the aim is to select at each iteration t the set of user accounts that maximize the sum of

Profile-Based Bandit with Unknown Profiles

collected scores:

max (Kt)t=1..T

i Kt ri,t (22)

4.2.1. Rewards

In our experiments, we attempt to focus on users that have a great impact on some speciﬁed thematic. The Follow Streaming API of Twitter provides in real-time not only tweets posted by the followed users, but also all the re-tweets and replies to these users other users post on the network. Our reward function takes all of these messages into account to provide a reward score ri,t for each user i Kt after each capture period t:

ri,t = tanh

ω Ωi,t gγ(ω)

where Ωi,t contains the original messages from i, the re-tweets of messages from i and the replies to i during the period t, and gγ is a function returning 1 if the content of the message as argument is judged as belonging to the desired thematic γ, 0 otherwise. To build this function gγ, we trained a SVM topic classiﬁer on the 20 Newsgroups dataset (with TF bag of words representations of the texts, after stemming via the Porter Stemmer algorithm). We ﬁnally focus on 4 diﬀerent topics γ: Politics, Religion, Science and Sport. Four diﬀerent reward functions are therefore considered in the following (one for each topic).

4.2.2. Profiles

Following the setting of our proﬁle based bandit, we assume that each user i of the social network is associated to an unknown vector µi corresponding to its proﬁle. In these experiments, we assume that the proﬁle of a user i corresponds to the mean of its content

distribution: µi = lim T 1 T PT t=1 xi,t, where xi,t is a given representation of the content

posted by i during step t 1:

xi,t+1 = f( X

ω Ψi,t ω) (24)

where Ψi,t Ωi,t contains all messages posted by i during step t and ω Rm (with m the size of the vocabulary) is a TF bag of words representation of a message (after stemming via the Porter Stemmer algorithm). The function f aims at reducing the dimension of the representations, since the dimension d of the proﬁle samples is the main factor of complexity in our algorithm, due to the required d d matrix inversions. In order to reduce the dimension of proﬁles, we used a Latent Dirichlet Allocation method speciﬁcally designed for short texts (Hong and Davison, 2010), which aims at modeling texts as a mixture of topics. We set the number of topics to d = 30 and learned the LDA model on a preliminary 3-days random capture from Twitter.

Lamprier, Gisselbrecht and Gallinari

4.2.3. Datasets

In order to be able to test diﬀerent policies and simulate a real time decision process several times, we propose to conduct our experiments on oﬄine datasets:

USElections: dataset containing 3 587 961 messages produced by 5000 users during the ten days preceding the US presidential elections in 2012. The 5000 chosen accounts are the ﬁrst ones who used either Obama , Romney or #USElections from a preliminary random capture on Twitter.

Olympic Games: dataset containing 15 010 322 messages produced by 5000 users in August 2016 during a period of three weeks covering the Olympic Games of Rio. The 5000 chosen accounts are the ones that were observed to use the most many hashtags #Rio2016 , #Olympics , #Olympics2016 or #Olympicgames within a period of preliminary random capture of three days before the Olympic Games.

Brexit: dataset containing 2 118 235 messages produced by 5000 users during the ﬁrst week of October 2016. The 5000 chosen accounts are the ﬁrst ones who used #Brexit from a preliminary random capture from Twitter.

4.2.4. Results

As done in (Gisselbrecht et al., 2015), we set k, the number of listened users at each time step, to 100, and the size of an iteration to 100 seconds. In these experiments, we assume L = S = R = 1 and we set δ = 0.05 as done with artiﬁcial data. Figures 5, 6 and 7 give the evolution of the cumulative reward through time for the datasets USElections, Olympic Games and Brexit respectively. In every case, we consider the four reward functions corresponding to the four topics Politic, Religion, Science and Sport. In order to lighten the plots, we only give in these ﬁgures the results of Samplin UCB for p = 0 and p = 1. In every plot, our algorithm is compared to the same baselines as described in section 4.1, where the policies are extended for the multiple-plays setting (as done in (Gisselbrecht et al., 2015)). A ﬁrst important observation from these plots is that in every setting, our algorithm Samp Lin UCB obtains better results than every other policy, even CUCBV, the extension of UCBV for the multiple-plays setting. Although CUCBV has demonstrated good performances for the task of social data capture (Gisselbrecht et al., 2015), where a high variance can be observed in the contents posted by users, the use of proﬁles associated to users of the networks enables an even more eﬃcient exploration process. Globally, same manner as with artiﬁcial data, the performances of our approach increase with p, with a maximum reached when p = 1. Note however that the setting p = 0 (the case 3 studied above) is the most realistic one, since it does not use anything but the content collected by followed users at each step, which is the case in practice when collecting data from a social media such as Twitter. Interestingly, even for this setting the results obtained are always better than those of every compared approach. The improvement w.r.t. CUCBV is less signiﬁcant for the Sport reward function for which greatly more rewards exist in the datasets (greatly more messages are categorized as sport), which allows non-contextual approaches to quickly

Profile-Based Bandit with Unknown Profiles

0 1000 2000 3000 4000 5000 6000 7000 8000 Time step

Cumulative Reward

Random CUCB UCB δ CUCBV MOSS

Thompson Lin UCB Mean Lin UCB Samp Lin UCBp=0 Samp Lin UCBp=1

(a) Politic

0 1000 2000 3000 4000 5000 6000 7000 8000 Time step

Cumulative Reward

Random CUCB UCB δ CUCBV MOSS

Thompson Lin UCB Mean Lin UCB Samp Lin UCBp=0 Samp Lin UCBp=1

(b) Religion

0 1000 2000 3000 4000 5000 6000 7000 8000 Time step

Cumulative Reward

Random CUCB UCB δ CUCBV MOSS

Thompson Lin UCB Mean Lin UCB Samp Lin UCBp=0 Samp Lin UCBp=1

(c) Science

0 1000 2000 3000 4000 5000 6000 7000 8000 Time step

Cumulative Reward

Random CUCB UCB δ CUCBV MOSS

Thompson Lin UCB Mean Lin UCB Samp Lin UCBp=0 Samp Lin UCBp=1

Figure 5: Evolution of the cumulative reward through time on the USElections dataset, according to the three considered reward functions Politic, Religion, Science and Sport.

collect knowledge about reward expectations of users. But for every other reward function Samp Lin UCBp=0 always obtained results comprised between 1.5 and 3 times the ones obtained with the best non-contextual approach. Note also the crucial role of the exploration term for proﬁle discovery ρ, since Mean Lin UCB, which considers current empirical sample means as true proﬁles, always obtains greatly lower results than Samp Lin UCBp=0 (except for the Science reward function on the Olympic Games and the Brexit datasets, where it beneﬁts from good rewards and proﬁle samples observed for some useful users in the initialization steps of the process). At last, as expected, Lin UCB, which directly bases its selection policy on proﬁle samples observed at the current step, obtains very bad results (near random). Since obtaining nul context vectors for every user not selected at the previous step, its selection mechanism very early focuses on a given set of users without ever reconsidering the others (except in the rare cases of context samples leading to negative reward expectations according to β). All these results highlight the interest of the proposed approach, based on conﬁdence balls of the arm proﬁles, for tasks where contexts are only observed when the arms are selected .

Lamprier, Gisselbrecht and Gallinari

0 2500 5000 7500 10000 12500 15000 17500 Time step

Cumulative Reward

Random CUCB UCB δ CUCBV MOSS

Thompson Lin UCB Mean Lin UCB Samp Lin UCBp=0 Samp Lin UCBp=1

(a) Politic

0 2500 5000 7500 10000 12500 15000 17500 Time step

Cumulative Reward

Random CUCB UCB δ CUCBV MOSS

Thompson Lin UCB Mean Lin UCB Samp Lin UCBp=0 Samp Lin UCBp=1

(b) Religion

0 2500 5000 7500 10000 12500 15000 17500 Time step

Cumulative Reward

Random CUCB UCB δ CUCBV MOSS

Thompson Lin UCB Mean Lin UCB Samp Lin UCBp=0 Samp Lin UCBp=1

(c) Science

0 2500 5000 7500 10000 12500 15000 17500 Time step

Cumulative Reward

Random CUCB UCB δ CUCBV MOSS

Thompson Lin UCB Mean Lin UCB Samp Lin UCBp=0 Samp Lin UCBp=1

Figure 6: Evolution of the cumulative reward through time on the Olympic Games dataset, according to the three considered reward functions Politic, Religion, Science and Sport.

Figures 8, 9 and 10 give the relative ﬁnal cumulative rewards for diﬀerent settings of the sample delivery process on the datasets USElections, Olympic Games et Brexit respectively (each score is normalized according to the score obtained when p = 1). Here we still observe that performances tend to decrease with p, for settings where p > 0. However it must be noticed that the setting p = 0 obtains results very close to other settings: it always obtains at least 80% of the ﬁnal cumulative reward obtained when every user delivers a sample at each step of the process (p = 1). Better, in many cases Samp Lin UCBp=0 succeeds in obtaining an higher ﬁnal cumulative reward than p = 0.01 and p = 0.02. This is particularly true for the Brexit dataset where the dynamic selection of samples to be delivered appears very eﬀective. On that dataset, Samp Lin UCBp=0 even usually reaches the performances of Samp Lin UCBp=0.05, while observing greatly less proﬁle samples at each step (only 100 over 5000 at each iteration, which corresponds to the observation rate of the setting p = 0.02). While settings with p > 0 are greatly favored by the fact that they do not need to play an arm to get a sample of its proﬁle, Samp Lin UCBp=0 is not only active for the discovery of the mapping parameters, but also for the estimation of proﬁles. Its knowledge about proﬁles

Profile-Based Bandit with Unknown Profiles

0 1000 2000 3000 4000 Time step

Cumulative Reward

Random CUCB UCB δ CUCBV MOSS

Thompson Lin UCB Mean Lin UCB Samp Lin UCBp=0 Samp Lin UCBp=1

(a) Politic

0 1000 2000 3000 4000 Time step

Cumulative Reward

Random CUCB UCB δ CUCBV MOSS

Thompson Lin UCB Mean Lin UCB Samp Lin UCBp=0 Samp Lin UCBp=1

(b) Religion

0 1000 2000 3000 4000 Time step

Cumulative Reward

Random CUCB UCB δ CUCBV MOSS

Thompson Lin UCB Mean Lin UCB Samp Lin UCBp=0 Samp Lin UCBp=1

(c) Science

0 1000 2000 3000 4000 Time step

Cumulative Reward

Random CUCB UCB δ CUCBV MOSS

Thompson Lin UCB Mean Lin UCB Samp Lin UCBp=0 Samp Lin UCBp=1

Figure 7: Evolution of the cumulative reward through time on the Brexit dataset, according to the three considered reward functions Politic, Religion, Science and Sport.

is directly connected to its selection strategy, with a selection score that favors promising actions with high uncertainty about their proﬁle. This leads to an algorithm that eﬃciently deals with a trade-oﬀbetween exploitation of good actions and exploration on both the mapping parameter and the proﬁles of actions.

5. Conclusion

In this paper, we focused on structured stochastic bandits, where rewards depend on some constant proﬁle associated with actions. More speciﬁcally, we introduced the case where the associated proﬁles are unknown beforehand, and must be discovered from samples delivered during the process. This setting implies a doubled uncertainty, both on proﬁle estimators and on reward predictors, for which we designed a dedicated algorithm, named Samplin UCB, that seeks at leveraging the structure of the unknown proﬁles in its exploration process. Various settings for the proﬁle samples delivery process have been considered, for which we gave theoretical convergence guarantees. Finally, experiments on both artiﬁcial data

Lamprier, Gisselbrecht and Gallinari

p=1 p=0.5 p=0.1 p=0.05 p=0.02 p=0.01 p=0 0.0

Final Cumulative Reward

(a) Politic

p=1 p=0.5 p=0.1 p=0.05 p=0.02 p=0.01 p=0 0.0

Final Cumulative Reward

(b) Religion

p=1 p=0.5 p=0.1 p=0.05 p=0.02 p=0.01 p=0 0.0

Final Cumulative Reward

(c) Science

p=1 p=0.5 p=0.1 p=0.05 p=0.02 p=0.01 p=0 0.0

Final Cumulative Reward

Figure 8: Final normalized cumulative rewards for Samp Lin UCB on USElections with diﬀerent p settings.

p=1 p=0.5 p=0.1 p=0.05 p=0.02 p=0.01 p=0 0.0

Final Cumulative Reward

(a) Politic

p=1 p=0.5 p=0.1 p=0.05 p=0.02 p=0.01 p=0 0.0

Final Cumulative Reward

(b) Religion

p=1 p=0.5 p=0.1 p=0.05 p=0.02 p=0.01 p=0 0.0

Final Cumulative Reward

(c) Science

p=1 p=0.5 p=0.1 p=0.05 p=0.02 p=0.01 p=0 0.0

Final Cumulative Reward

Figure 9: Final normalized cumulative rewards for Samp Lin UCB on Olympic Games with diﬀerent p settings.

and a task of data capture from social networks demonstrate the very good behavior of the proposed approach. Ongoing works concern the inclusion of a non-stationary part in the selection strategy, where proﬁles may vary over time according to some evolving latent state of the actions.

Acknowledgments

This research work has been carried out in the framework of the Technological Research Institute System X, and therefore granted with public funds within the scope of the French Program Investissements d Avenir .

Appendix A. Appendix

A.1. Proof of proposition 1

The two following lemmas directly come from the deﬁnition of the sub-gaussian variables.

Lemma 1 Let X be a random variable centered on 0. Then, X is said sub-gaussian with constant R if one of the two equivalent following conditions holds:

Laplace Condition: R > 0, λ R, E[eλX] e R2λ2/2

Profile-Based Bandit with Unknown Profiles

p=1 p=0.5 p=0.1 p=0.05 p=0.02 p=0.01 p=0 0.0

Final Cumulative Reward

(a) Politic

p=1 p=0.5 p=0.1 p=0.05 p=0.02 p=0.01 p=0 0.0

Final Cumulative Reward

(b) Religion

p=1 p=0.5 p=0.1 p=0.05 p=0.02 p=0.01 p=0 0.0

Final Cumulative Reward

(c) Science

p=1 p=0.5 p=0.1 p=0.05 p=0.02 p=0.01 p=0 0.0

Final Cumulative Reward

Figure 10: Final normalized cumulative rewards for Samp Lin UCB on Brexit with diﬀerent p settings.

Sub-Gaussian Tail: R > 0, γ > 0, P(|X| γ) 2e γ2/(2R2)

Lemma 2 Let X1 and X2 be two sub-gaussian variables with respective constant R1 and R2. Let α1 and α2 be two real scalars. Then the variable α1X1 + α2X2 is sub-gaussian too, with constant p

α2 1R2 1 + α2 2R2 2.

The lemma 3 can be deduced from the application of the lemma 1 in the context of our speciﬁc problem of proﬁle-bandit with unknown proﬁles, where the deviation of the proﬁle estimators is a sub-gaussian variable.

Lemma 3 Let us assume that for any i all samples xi,t Rd observed for i at every step t > 0 are iid with mean µi Rd. Let us also assume that ||xi,t|| L for every i and t, and that ||β|| S. Then, for every i, and at each step t, β ϵi,t is sub-gaussian with constant LS ni,t (with ϵi,t = µi ˆxi,t).

Proof By using the Cauchy-Schwarz inequality, for all i and at each step t we have: |x i,tβ| ||β||||ˆxi,t|| LS. Then, given that for all i, all samples xi,t are iid and E[xi,t] = µi, we can apply the Hoeﬀding inequality to the random variable β ˆxi,t with mean µ i β:

γ > 0, P |β ˆxi,t β µi| > γ = P |β ϵi,t| > γ 2e ni,tγ2

which allows us to say that ϵ i,tβ is sub-gaussian with constant LS ni,t .

We ﬁnally use the lemma 2 with the sum of β ϵi,t and ηi,s to prove the proposition 1, which establishes the random variable β ϵi,t + ηi,s is conditionally sub-gaussian with

constant Ri,t =

Lamprier, Gisselbrecht and Gallinari

A.2. Proof of the proposition 3

To lighten notations, we removed the dependence on t in A and X. We have:

ˆβt 1 = arg min β

1 Ris,t (β ˆxis,t ris,s)2 + λ||β||2

= (X AX + λI) 1X AY

= (X AX + λI) 1X A(Xβ + η )

= (X AX + λI) 1X Aη + (X AX + λI) 1(X AX + λI)β

(X AX + λI) 1λIβ

= (X AX + λI) 1X Aη + β λ(X AX + λI) 1β

Then, the following main arguments of this proof come from the theory of autonormalized process (de la Pe na et al., 2009). By using a similar method to the one used in (Abbasi-Yadkori et al., 2011), we get:

||ˆβt 1 β||Vt 1 ||X Aη ||V 1 t 1 + λ||β||V 1 t 1

with Vt 1 = λI + X AX, which is semi-deﬁnite positive since λ > 0. Since ||β|| S and ||β||2 V 1 t 1 ||β||2/λmin(Vt 1) ||β||2/λ, we get:

||ˆβt 1 β||Vt 1 ||X Aη ||V 1 t 1 +

By using the proposition 1 of (Abbasi-Yadkori et al., 2011), and since we know from

proposition 1 that η s Ris,t is sub-gaussian with constant 1, for any δ > 0, with a probability

of at least 1 δ, for every t 0 we have:

||X Aη ||V 1 t 1 = ||

η s Ris,t ˆxis,t||V 1 t 1

v u u t2 log

det(Vt 1)1/2det(λI) 1/2

d log 1 + t L2/λ

A.3. Proof of the proposition 4

Proof Let us assume that the inequality of the proposition 3 is valid. Therefore, we have for all t > 0 and every i K:

Profile-Based Bandit with Unknown Profiles

ˆβ t 1µi + αt 1||µi||V 1 t 1 β µi =(ˆβt 1 β) µi + αt 1||µi||V 1 t 1 ||ˆβt 1 β||Vt 1||µi||V 1 t 1 + αt 1||µi||V 1 t 1 αt 1||µi||V 1 t 1 + αt 1||µi||V 1 t 1 =0

A.4. Proof of the proposition 5

With ||xi,t|| L, we know that, for any j [1..d], |xj i,t| L. Thus, we can apply the Hoeﬀding inequality to each dimension of the proﬁle estimators:

γ > 0 : P |ˆxj i,t µj i| > γ/

d 2e ni,tγ2

Then, by using the fact that ||ˆxi,t µi|| 1

i=1 |ˆxi i,t µi i| and the uniform bound

property, we get:

P (||ˆxi,t µi|| γ) 1 2de ni,tγ2

Thus, for every i {1, . . . , K} and every step t > 0, with a probability of at least 1 δ/t2:

||ˆxi,t µi|| L

2d ni,t log 2dt2

This bound for the deviation of the proﬁle estimator can be less restrictive than the base assumption which states that for any i K and t 0, ||xi,t|| L. From this assumption we indeed know that, ||ˆxi,t|| L, ||µi|| L and thus ||ˆxi,t µi|| 2L . We therefore consider the following bound that holds for any t 0 with a probability greater than 1 δ/t2:

||ˆxi,t µi|| min(L

2d ni,t log 2dt2

, 2L) = ρi,t,δ

A.5. Proof of the proposition 6

Proof Let us assume that the inequality of the proposition 5 is valid. Therefore, we have:

||µi||V 1 t 1 ||ˆxi,t||V 1 t 1 ||µi ˆxi,t||V 1 t 1 ||µi ˆxi,t||/

λ. Thus: ||µi||V 1 t 1

||ˆxi,t||V 1 t 1 + ρi,t,δ/

λ = ||ˆxi,t + ϵi,t||V 1 t 1, with ϵi,t = ρi,t,δˆxi,t/(

λ||ˆxi,t||V 1 t 1).

Lamprier, Gisselbrecht and Gallinari

|ˆβ t 1(ˆxi,t µi)| ||ˆβt 1||||(ˆxi,t µi)|| ˆβ t 1ϵi,t, with ϵi,t = ρi,t,δ ˆβt 1/||ˆβt 1||.

By using these two results and the uniform bound property, we can proof the proposition:

ˆβ t 1(ˆxi,t + ϵi,t) + αt 1||ˆxi,t + ϵi,t||V 1 t 1 β µi

=(ˆβt 1 β) µi + αt 1||ˆxi,t + ϵi,t||V 1 t 1 ˆβ t 1(µi ˆxi,t) + ˆβ t 1ϵi,t

||ˆβt 1 β||Vt 1||µi||V 1 t 1 + αt 1||ˆxi,t + ϵi,t||V 1 t 1 + ˆβ t 1(ˆxi,t µi + ϵi,t)

αt 1||µi||V 1 t 1 + αt 1||ˆxi,t + ϵi,t||V 1 t 1 + ˆβ t 1(ˆxi,t µi + ϵi,t)

αt 1||ˆxi,t + ϵi,t||V 1 t 1 + αt 1||ˆxi,t + ϵi,t||V 1 t 1 + ˆβ t 1(ˆxi,t µi + ϵi,t)

A.6. Proof of the proposition 8

Lemma 4 For every i K and t > 0, with a probability of at least 1 δ/t2 δ, we have:

ˆβ t 1(ˆxi,t + ϵi,t) + αt 1||ˆxi,t + ϵi,t||V 1 t 1 β µi

2αt||ˆxi,t||V 1 t 1 + 4

λ + S)ρi,t,δ

Proof As for proposition 6, we assume that the inequality of proposition 5 holds. Then, noting that ||ϵi,t||V 1 t 1 ||ϵi,t||/

λ = ρi,t,δ/

λ and || ϵi,t||V 1 t 1 = ρi,t,δ/

λ, we have:

ˆβ t 1(ˆxi,t + ϵi,t) + αt 1||ˆxi,t + ϵi,t||V 1 t 1 β µi

=(ˆβt 1 β) µi + αt 1||ˆxi,t + ϵi,t||V 1 t 1 ˆβ t 1(µi ˆxi,t) + ˆβ t 1ϵi,t

||ˆβt 1 β||Vt 1||µi||V 1 t 1 + αt 1||ˆxi,t + ϵi,t||V 1 t 1 + ˆβ t 1(ˆxi,t µi + ϵi,t)

2αt 1||ˆxi,t + ϵi,t||V 1 t 1 + 2||ˆβt 1||Vt 1||ϵi,t||V 1 t 1 2αt 1||ˆxi,t||V 1 t 1 + 2αt 1|| ϵi,t||V 1 t 1 + 2(αt 1 + S

λ)||ϵi,t||V 1 t 1 2αt 1||ˆxi,t||V 1 t 1 + 4(αt 1/

λ + S)ρi,t,δ

Profile-Based Bandit with Unknown Profiles

Lemma 5 For every t, with a probability of at least 1 δ/t2 δ, the instantaneous pseudoregret of the algorithm Samp Lin UCB, noted regt = β µi β µit, is upper-bounded as:

regt 2αt 1||ˆxit,t||V 1 t 1 | {z }

λ + S)ρit,t,δ | {z }

Proof The previous lemma allows us to say that for all t: sit,t β µit + 2αt 1||ˆxi,t||V 1 t 1 + 4(αt 1/

λ + S)ρi,t,δ Now, given the selection policy of Samp Lin UCB and the proposition 6, we get for all t: sit,t si ,t β µi . Thus: regt sit,t β µit 2αt 1||ˆxi,t||V 1 t 1 + 4(αt 1/

λ + S)ρi,t,δ

Next, we use the uniform bound property, the fact that P

δ t2 = δ(π2/6 1) δ and the

fact that in the proposition 3 the bound is uniform (i.e., it holds for all step t simultaneously) to say that, with a probability of at least 1 2δ:

t=1 reg(2) t C +

t=2 4(αt 1/

λ + S)ρit,t,δ

t=2 4(αt 1/

2d ni,t log 2dt2

C + 4L(αT /

2d log 2d T 2

On another hand, we have:

t=1 reg(1) t

t=1 2αt 1||ˆxit,t||2 V 1 t 1

t=1 4α2 t 1||ˆxit,t||2 V 1 t 1

t=1 ||ˆxit,t||2 V 1 t 1

Now, it remains to upper-bound the term TP

t=1 ||ˆxit,t||2 V 1 t 1.

For that purpose, we introduce the following notation: νi,t,δ = L p

2d/(ni,t) log (2d T/δ). By using again the Hoeﬀding inequality, with a probability of at least 1 δ/T, we get for all s t 1:

Lamprier, Gisselbrecht and Gallinari

||ˆxis,t|| ||µis|| + νis,t,δ

With ˇϵi,t = min(νi,t,δ, ||µi||)µi/||µi||, we get, for all s t 1:

Ris,s||µis ˇϵis,s|| 1/ p

Ris,t||ˆxis,t||

Then, we arrive to:

Vt 1 = λI +

1 Ris,t ˆxis,tˆx is,t λI +

1 Ris,s (µis ˇϵis,s)(µis ˇϵis,s) = Wt 1

Which means that for every vector x: ||x||V 1 t 1 ||x||W 1 t 1.

Let us now deﬁne ˆϵi,t = νi,t,δµi/(

λ||µi||W 1 t 1), such that for all s t 1:

||ˆxis,t||W 1 t 1 ||µis + ˆϵis,s||W 1 t 1

||ˆϵis,s||W 1 t 1 = νis,s,δ/

Finally, by using the uniform bound property and the fact that TP

δ T = δ, with a

probability of at least 1 δ:

t=1 ||ˆxit,t||2 V 1 t 1

t=1 ||ˆxit,t||2 W 1 t 1

t=1 ||µit + ˆϵit,t||2 W 1 t 1

t=1 ||µit + ˆϵit,t ˇϵit,t + ˇϵit,t||2 W 1 t 1

t=1 ||µit ˇϵit,t||2 W 1 t 1 +

t=1 ||ˆϵit,t||2 W 1 t 1 +

t=1 ||ˇϵit,t||2 W 1 t 1

t=1 ||µit ˇϵit,t||2 W 1 t 1 + 2

t=1 ν2 it,t,δ

t=1 ||µit ˇϵit,t||2 W 1 t 1 + 4L2d

Profile-Based Bandit with Unknown Profiles

On another hand, we have:

det(WT ) = det(WT 1 + 1 Ri T ,T (µi T ˇϵi T ,T )(µa T ˇϵi T ,T ) )

= det(WT 1)det(I + 1 Ri T ,T W 1/2 T 1 (µi T ˇϵi T ,T )(W 1/2 T 1 (µi T ˇϵi T ,T )) )

= det(WT 1)(1 + 1 Ri T ,T ||µi T ˇϵi T ,T ||2 W 1 T 1)

t=1 (1 + 1 Rit,t ||µit ˇϵit,t||2 W 1 t 1)

Where we used the fact that all eigenvalues of I +xx equal 1 except one that is associated to the eigenvector x and thus equals 1 + ||x||2. Since by assumption λ > max(1, L2/

R2), we have:

1 Rit,t ||µit ˇϵit,t||2 W 1 t 1 ||ˆxit,t||2/

Thus, by using the fact that x 2 log(1 + x) when 0 x 1, we get:

2 log det(WT )

1 Rit,t ||µit ˇϵit,t||2 W 1 t 1

t=1 ||µit ˇϵit,t||2 W 1 t 1

R2 + L2S2 T X

t=1 ||µit ˇϵit,t||2 W 1 t 1

As in the lemma 11 of (Abbasi-Yadkori et al., 2011), we also have:

log det(WT )

d log 1 + TL2

Which leads us to:

t=1 ||µit ˇϵit,t||2 W 1 t 1 p

R2 + L2S2d log 1 + TL2

Finally, same manner as in the lemma 10 of Abbasi-Yadkori et al. (2011), the tracedeterminant inequality gives:

d log 1 + TL2/λ

Gathering all these results together allows us to prove the announced result.

Lamprier, Gisselbrecht and Gallinari

A.7. Proof of the proposition 9

Lemma 6 When removing dependencies on L, λ, R and S, the bound from proposition 16 for the cumulative regret ˆRT can written as follows (when T > d):

ˆRT C + C1d log T

1 nit,t + C2d log T

with C,C1 and C2 three constants.

Proof From proposition 16, we have:

ˆRT C + ˆRT,1 + ˆRT,2

d λ log 1 + TL2/λ

2d log 2d T 2

Constant d log T

1 nit,t (when T > d > 0)

d log 1 + TL2/λ

R2 + L2S2 log 1 + TL2

log(T) + log d T

v u u t Td log d T

Constant d log T

1 nit,t (when T > d > 0)

Thanks to this lemma, we are ready to derive speciﬁc bounds for the three considered proﬁle delivery settings.

Profile-Based Bandit with Unknown Profiles

A.7.1. Case 1:

On one hand, we have:

t 1 + log(T)

On the other hand, we have:

Finally, we get from lemma 6 that, when T > d > 0:

ˆRT,1 C + C1d log T

T + C2d log T

with C,C1 and C2 three constants. Since the last term is clearly the greatest, we get the announced result.

A.7.2. Case 2:

Lemma 7 i, t 2 log(1/δ)/p2 , with a probability of at least 1 δ:

2 Proof By the Hoeﬀding inequality, for all ϵ > 0:

P(ni,t tp ϵ) 1 e 2ϵ2/t

By taking ϵ = tp/2, we get:

P(ni,t tp/2) 1 e tp2/2

If t 2 log(1/δ)/p2, then 1 e tp2/2 1 δ, which proves the lemma.

Let us note u = ceil(2 log(1/δ)/p2). Thus, following lemma 7, with a probability of at least 1 δ, we have:

u + 2 log(T)

Lamprier, Gisselbrecht and Gallinari

From another hand, still thanks to the lemma 7, with a probability of at least 1 δ:

Finally, we get from lemma 6 that, when T > d > 0:

ˆRT,1 C + C1d log T

p + C2d log T

with C,C1 and C2 three constants. Since the last term is clearly the greatest, we get the announced result.

A.7.3. Case 3:

First note that the sum TP

1 nit,t is maximized when each action has delivered exactly T/K

samples in the T/K K ﬁrst iterations (i.e., every action has been played as many times). Thus:

K(1 + log( T/K ))

With the same argument, we also get:

Finally, by noting that K log( T/K ) K log(T/K), that K p

KT and by using the generic bound from the proposition 9, we get the announced result. Finally, since

Profile-Based Bandit with Unknown Profiles

K log( T/K ) K log(T/K) and K p

KT, we get from lemma 6 that, when T > d > 0:

ˆRT,1 C + C1d log T

KT + C2d log T

TK log(T/K)

with C,C1 and C2 three constants. Since the last term is clearly the greatest, we get the announced result.

A.8. Proof of the proposition 10

We follow a similar method to the one presented in Qin et al. (2014) for the speciﬁc case of sums of individual rewards: Since we consider that the reward obtained by playing a set of actions at a given step t is the sum of rewards observed for every single played action at t, we have:

i K µ i β X

with K the set of k optimal actions, i.e., those with greatest expectations µ i β. Then, we use the fact that for every step t: X

i Kt si,t X

This leads to:

i Kt 2αt 1||ˆxi,t||V 1 t 1 + 4

λ + S)ρi,t,δ

Where the matrix Vt 1 is deﬁned by considering the k played actions at each step:

Vt 1 = λI + t 1 P

1 Ri,t ˆxi,tˆx i,t. The fact that we add k terms to the matrix V at each

iteration implies two distinct things:

First on the conﬁdence ellipsoid of β. For k actions played at each step, we have:

d log 1 + k TL2/λ

On another hand on the upper-bounding of TP

it Kt ||µit ˇϵit,t||2 W 1 t 1. We have:

it Kt ||µit ˇϵit,t||2 W 1 t 1

R2 + L2S2d log 1 + Tk L2

Finally, we can use the same methods as in the previous proofs for deriving speciﬁc bounds from the generic one, where the selection of k actions at each step appears explicitly

in the terms TP

1 ni,t and TP

Lamprier, Gisselbrecht and Gallinari

A.9. Table of the main notations

T number of steps of the process K set of available actions K number of available arms k number of simultaneaous plays at each step d dimension of the arms proﬁles it arm selected at step t i arm with the highest ﬁnal cumulative reward ri reward obtained by arm i at step t xi,t proﬁle sample vector observed for arm i at step t ˆxi,t average of proﬁle sample vectors observed for arm i until step t L upper-bound for the proﬁles norm Ot set of arms delivering a proﬁle context at step t ni,t number of samples observed for arm i until step t µi proﬁle vector of arm i β mapping parameter between proﬁles and rewards ˆβt estimator of β at step t S upper-bound for the β parameter norm λ l2-regularization constant of the β estimator ηi,t sub-gaussian noise of the reward of arm i at step t R sub-gaussian constant of the rewards distribution ϵi,t deviation between the true proﬁle of arm i and its estimator at step t (ϵi,t = µi ˆxi,t) η t 1 vector of reward deviations of the ﬁrst t 1 selected arms from their expectation at step t: η t 1 = (ηis,s + ϵ is,tβ) s=1..t 1 Ri,t sub-gaussian constant of the noise of the reward of i at step t w.r.t. ˆx i,tβ Xt 1 (t 1) d matrix containing the empirical means of the selected actions, where the s-th row corresponds to the estimator at step t of the action selected at step s: Xt 1 = (ˆx is,t)s=1..t 1 Yt 1 rewards vector of size t 1: Yt 1 = (ris,s) s=1..t 1 At 1 diagonal (t 1) (t 1) matrix, where the s-th diagonal element equals 1/Ris,t: At 1 = diag(1/Ris,t)s=1..t 1 δ parameter controling the conﬁdence level of the regret bound V 1 t variance-covariance matrix of the posterior distribution of β at step t ρi,t,δ quantity used to bound the deviation of the estimators of proﬁles: ρi,t,δ =

2d ni,t log 2dt2

si,t selection score for arm i at step t αt exploration coeﬃcient at step t w.r.t. the conﬁdence of the beta estimator ϵt quantity used to bound µ i ˆβt ϵt quantity used to bound µ i (β ˆβt)

Profile-Based Bandit with Unknown Profiles

Yasin Abbasi-Yadkori, D avid P al, and Csaba Szepesv ari. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems 24: 25th Annual Conference on Neural Information Processing Systems 2011. Proceedings of a meeting held 12-14 December 2011, Granada, Spain., pages 2312 2320, 2011.

Shipra Agrawal and Navin Goyal. Analysis of thompson sampling for the multi-armed bandit problem. In COLT 2012 - The 25th Annual Conference on Learning Theory, June 25-27, 2012, Edinburgh, Scotland, pages 39.1 39.26, 2012.

Shipra Agrawal and Navin Goyal. Thompson sampling for contextual bandits with linear payoﬀs. In Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013, pages 127 135, 2013.

Jean-Yves Audibert and S ebastien Bubeck. Minimax policies for adversarial and stochastic bandits. In COLT 2009 - The 22nd Conference on Learning Theory, Montreal, Quebec, Canada, June 18-21, 2009, 2009.

Jean-Yves Audibert, R emi Munos, and Csaba Szepesv ari. Exploration-exploitation tradeoﬀ using variance estimates in multi-armed bandits. Theor. Comput. Sci., 410(19):1876 1902, April 2009. ISSN 0304-3975.

Peter Auer. Using conﬁdence bounds for exploitation-exploration trade-oﬀs. Journal of Machine Learning Research, 2003.

Peter Auer, Nicol o Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2-3):235 256, 2002.

Olivier Chapelle and Lihong Li. An empirical evaluation of thompson sampling. In Proceedings of the 25th Conference on Neural Information Processing Systems 2011, Granada, Spain, pages 2249 2257, 2011.

Wei Chen, Yajun Wang, and Yang Yuan. Combinatorial multi-armed bandit: General framework and applications. In Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013, pages 151 159, 2013.

Wei Chu, Lihong Li, Lev Reyzin, and Robert E. Schapire. Contextual bandits with linear payoﬀfunctions. In Proceedings of the Fourteenth International Conference on Artiﬁcial Intelligence and Statistics, AISTATS 2011, Fort Lauderdale, USA, April 11-13, 2011, pages 208 214, 2011.

Varsha Dani, Thomas P. Hayes, and Sham M. Kakade. Stochastic linear optimization under bandit feedback. In 21st Annual Conference on Learning Theory - COLT 2008, Helsinki, Finland, July 9-12, 2008, pages 355 366, 2008.

V. H. de la Pe na, T. L. Lai, and Q. M. Shao. Self-Normalized Processes: Limit Theory and Statistical Applications. Springer Series in Probability and its Applications. Springer, 2009.

Lamprier, Gisselbrecht and Gallinari

Sarah Filippi, Olivier Capp e, Aur elien Garivier, and Csaba Szepesv ari. Parametric bandits: The generalized linear case. In Advances in Neural Information Processing Systems 23: 24th Annual Conference on Neural Information Processing Systems 2010. Proceedings of a meeting held 6-9 December 2010, Vancouver, British Columbia, Canada., pages 586 594, 2010.

Aurlien Garivier. The kl-ucb algorithm for bounded stochastic bandits and beyond. In COLT, 2011.

Thibault Gisselbrecht, Ludovic Denoyer, Patrick Gallinari, and Sylvain Lamprier. Whichstreams: A dynamic approach for focused data capture from large social media. In Proceedings of the Ninth International Conference on Web and Social Media, ICWSM 2015, University of Oxford, Oxford, UK, May 26-29, 2015, pages 130 139, 2015.

Liangjie Hong and Brian D. Davison. Empirical study of topic modeling in twitter. In ECIR, 2010.

Emilie Kaufmann, Olivier Capp e, and Aur elien Garivier. On bayesian upper conﬁdence bounds for bandit problems. In Proceedings of the Fifteenth International Conference on Artiﬁcial Intelligence and Statistics, AISTATS 2012, La Palma, Canary Islands, April 21-23, 2012, pages 592 600, 2012a.

Emilie Kaufmann, Nathaniel Korda, and R emi Munos. Thompson sampling: An asymptotically optimal ﬁnite-time analysis. In Algorithmic Learning Theory - 23rd International Conference, ALT 2012, Lyon, France, October 29-31, 2012. Proceedings, pages 199 213, 2012b.

T.L Lai and Herbert Robbins. Asymptotically eﬃcient adaptive allocation rules. Advances in Applied Mathematics, 6(1):4 22, 1985. ISSN 0196-8858.

Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on World Wide Web, WWW 10, pages 661 670, New York, NY, USA, 2010. ACM. ISBN 978-1-60558-799-8.

Benedict C. May, Nathan Korda, Anthony Lee, and David S. Leslie. Optimistic bayesian sampling in contextual-bandit problems. J. Mach. Learn. Res., 13:2069 2106, 2012. ISSN 1532-4435.

Lijing Qin, Shouyuan Chen, and Xiaoyan Zhu. Contextual combinatorial bandit and its application on diversiﬁed online recommendation. In Proceedings of the 2014 SIAM International Conference on Data Mining, Philadelphia, Pennsylvania, USA, April 24-26, 2014, pages 461 469, 2014.

Paat Rusmevichientong and John N. Tsitsiklis. Linearly parameterized bandits. Math. Oper. Res., 35(2):395 411, 2010.

W.R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Bulletin of the American Mathematics Society, 25:285 294, 1933.