# discovering_learning_and_exploiting_relevance__65d87d10.pdf Discovering, Learning and Exploiting Relevance Cem Tekin Electrical Engineering Department University of California Los Angeles cmtkn@ucla.edu Mihaela van der Schaar Electrical Engineering Department University of California Los Angeles mihaela@ee.ucla.edu In this paper we consider the problem of learning online what is the information to consider when making sequential decisions. We formalize this as a contextual multi-armed bandit problem where a high dimensional (D-dimensional) context vector arrives to a learner which needs to select an action to maximize its expected reward at each time step. Each dimension of the context vector is called a type. We assume that there exists an unknown relation between actions and types, called the relevance relation, such that the reward of an action only depends on the contexts of the relevant types. When the relation is a function, i.e., the reward of an action only depends on the context of a single type, and the expected reward of an action is Lipschitz continuous in the context of its relevant type, we propose an algorithm that achieves O(T γ) regret with a high probability, where γ = 2/(1 + 2). Our algorithm achieves this by learning the unknown relevance relation, whereas prior contextual bandit algorithms that do not exploit the existence of a relevance relation will have O(T (D+1)/(D+2)) regret. Our algorithm alternates between exploring and exploiting, it does not require reward observations in exploitations, and it guarantees with a high probability that actions with suboptimality greater than ϵ are never selected in exploitations. Our proposed method can be applied to a variety of learning applications including medical diagnosis, recommender systems, popularity prediction from social networks, network security etc., where at each instance of time vast amounts of different types of information are available to the decision maker, but the effect of an action depends only on a single type. 1 Introduction In numerous learning problems the decision maker is provided with vast amounts of different types of information which it can utilize to learn how to select actions that lead to high rewards. The value of each type of information can be regarded as the context on which the learner acts, hence all the information can be encoded in a context vector. We focus on problems where this context vector is high dimensional but the reward of an action only depends on a small subset of types. This dependence is given in terms of a relation between actions and types, which is called the relevance relation. For an action set A and a type set D, the relevance relation is given by R = {R(a)}a A, where R(a) D. Expected reward of an action a only depends on the values of the relevant types of contexts. Hence, for a context vector x, action a s expected reward is equal to µ(a, x R(a)), where x R(a) is the context vector corresponding to the types in R(a). Several examples of relevance relations and their effect on expected action rewards are given in Fig. 1. The problem of finding the relevance relation is important especially when maxa A |R(a)| << |D|.1 In this paper we consider the case when the relevance relation is a function, i.e., |R(a)| = 1, for all a A, which is an important special case. We discuss the extension of our framework to the more general case in Section 3.3. 1For a set A, |A| denotes its cardinality. Figure 1: Examples of relevance relations: (i) general relevance relation, (ii) linear relevance relation, (iii) relevance function. In this paper we only consider (iii), while our methods can easily be generalized to (i) and (ii). Relevance relations exists naturally in many practical applications. For example, when sequentially treating patients with a particular disease, many types of information (contexts) are usually available - the patients age, weight, blood tests, scans, medical history etc. If a drug s effect on a patient is caused by only one of the types, then learning the relevant type for the drug will result in significantly faster learning for the effectiveness of the drug for the patients.2 Another example is recommender systems, where recommendations are made based on the high dimensional information obtained from the browsing and purchase histories of the users. A user s response to a product recommendation will depend on the user s gender, occupation, history of past purchases etc., while his/her response to other product recommendations may depend on completely different information about the user such as the age and home address. Traditional contextual bandit solutions disregard existence of such relations, hence have regret bounds that scale exponentially with the dimension of the context vector [1, 2]. In order to solve the curse of dimensionality problem, a new approach which learns the relevance relation in an online way is required. The algorithm we propose simultaneously learns the relevance relation (when it is a function) and the action rewards by comparing sample mean rewards of each action for context pairs of different types that are calculated based on the context and reward observations so far. The only assumption we make about actions and contexts is the Lipschitz continuity of expected reward of an action in the context of its relevant type. Our main contributions can be summarized as follows: We propose the Online Relevance Learning with Controlled Feedback (ORL-CF) algorithm that alternates between exploration and exploitation phases, which achieves a regret bound of O(T γ),3 with γ = 2/(1 + 2), when the relevance relation is a function. We derive separate bounds on the regret incurred in exploration and exploitation phases. ORL-CF only needs to observe the reward in exploration phases, hence the reward feedback is controlled. ORL-CF achieves the same time order of regret even when observing the reward has a non-zero cost. Given any δ > 0, which is an input to ORL-CF, suboptimal actions will never be selected in exploitation steps with probability at least 1 δ. This is very important, perhaps vital in numerous applications where the performance needs to be guaranteed, such as healthcare. Due to the limited space, numerical results on the performance of our proposed algorithm is included in the supplementary material. 2Even when there are multiple relevant types for each action, but there is one dominant type whose effect on the reward of the action is significantly larger than the effects of other types, assuming that the relevance relation is a function will be a good approximation. 3O( ) is the Big O notation, O( ) is the same as O( ) except it hides terms that have polylogarithmic growth. 2 Problem Formulation A is the set of actions, D is the dimension of the context vector, D := {1, 2, . . . , D} is the set of types, and R = {R(a)}a A : A D is the relevance function, which maps every a A to a unique d D. At each time step t = 1, 2, . . ., a context vector xt arrives to the learner. After observing xt the learner selects an action a A, which results in a random reward rt(a, xt). The learner may choose to observe this reward by paying cost c O 0. The goal of the learner is to maximize the sum of the generated rewards minus costs of observations for any time horizon T. Each xt consists of D types of contexts, and can be written as xt = (x1,t, x2,t, . . . , x D,t) where xi,t is called the type i context. Xi denotes the space of type i contexts and X := X1 X2 . . . XD denotes the space of context vectors. At any t, we have xi,t Xi for all i D. For the sake of notational simplicity we take Xi = [0, 1] for all i D, but all our results can be generalized to the case when Xi is a bounded subset of the real line. For x = (x1, x2, . . . , x D) X, rt(a, x) is generated according to an i.i.d. process with distribution F(a, x R(a)) with support in [0, 1] and expected value µ(a, x R(a)). The following assumption gives a similarity structure between the expected reward of an action and the contexts of the type that is relevant to that action. Assumption 1. For all a A, x, x X, we have |µ(a, x R(a)) µ(a, x R(a))| L|x R(a) x R(a)|, where L > 0 is the Lipschitz constant. We assume that the learner knows the L given in Assumption 1. This is a natural assumption in contextual bandit problems [1, 2]. Given a context vector x = (x1, x2, . . . , x D), the optimal action is a (x) := arg maxa A µ(a, x R(a)), but the learner does not know it since it does not know R, F(a, x R(a)) and µ(a, x R(a)) for a A, x X a priori. In order to assess the learner s loss due to unknowns, we compare its performance with the performance of an oracle benchmark which knows a (x) for all x X. Let µt(a) := µ(a, x R(a),t). The action chosen by the learner at time t is denoted by αt. The learner also decides whether to observe the reward or not, and this decision of the learner at time t is denoted by βt {0, 1}, where βt = 1 implies that the learner chooses to observe the reward and βt = 0 implies that the learner does not observe the reward. The learner s performance loss with respect to the oracle benchmark is defined as the regret, whose value at time T is given by t=1 µt(a (xt)) t=1 (µt(αt) c Oβt). (1) A regret that grows sublinearly in T, i.e., O(T γ), γ < 1, guarantees convergence in terms of the average reward, i.e., R(T)/T 0. We are interested in achieving sublinear growth with a rate independent of D. 3 Online Relevance Learning with Controlled Feedback 3.1 Description of the algorithm In this section we propose the algorithm Online Relevance Learning with Controlled Feedback (ORL-CF), which learns the best action for each context vector by simultaneously learning the relevance relation, and then estimating the expected reward of each action. The feedback, i.e., reward observations, is controlled based on the past context vector arrivals, in a way that reward observations are only made for actions for which the uncertainty in the reward estimates are high for the current context vector. The controlled feedback feature allows ORL-CF to operate as an active learning algorithm. Operation of ORL-CF can be summarized as follows: Adaptively discretize (partition) the context space of each type to learn action rewards of similar contexts together. For an action, form reward estimates for pairs of intervals corresponding to pairs of types. Based on the accuracy of these estimates, either choose to explore and observe the reward or choose to exploit the best estimated action for the current context vector. In order to choose the best action, compare the reward estimates for pairs of intervals for which one interval belongs to type i, for each type i and action a. Conclude that type i is relevant to a if the variation of the reward estimates does not greatly exceed the natural variation of the expected reward of action a over the interval of type i (calculated using Assumption 1). Online Relevance Learning with Controlled Feedback (ORL-CF): 1: Input: L, ρ, δ. 2: Initialization: Pi,1 = {[0, 1]}, i D. Run Initialize(i, Pi,1, 1), i D. 3: while t 1 do 4: Observe xt, find pt that xt belongs to. 5: Set Ut := S i D Ui,t, where Ui,t (given in (3)), is the set of under explored actions for type i. 6: if Ut = then 7: (Explore) βt = 1, select αt randomly from Ut, observe rt(αt, xt). 8: Update pairwise sample means: for all q Qt, given in (2). rind(q)(q, αt) = (Sind(q)(q, αt) rind(q)(q, αt) + rt(αt, xt))/(Sind(q)(q, αt) + 1). 9: Update counters: for all q Qt, Sind(q)(q, αt) + +. 10: else 11: (Exploit) βt = 0, for each a A calculate the set of candidate relevant contexts Relt(a) given in (4). 12: for a A do 13: if Relt(a) = then 14: Randomly select ˆct(a) from D. 15: else 16: For each i Relt(a), calculate Vart(i, a) given in (5). 17: Set ˆct(a) = arg mini Relt(a) Vart(i, a). 18: end if 19: Calculate rˆct(a) t (a) as given in (6). 20: end for 21: Select αt = arg maxa A rˆct(a) t (pˆct(a),t, a). 22: end if 23: for i D do 24: N i(pi,t) + +. 25: if N i(pi,t) 2ρl(pi,t) then 26: Create two new level l(pi,t) + 1 intervals p, p whose union gives pi,t. 27: Pi,t+1 = Pi,t {p, p } {pi,t}. 28: Run Initialize(i, {p, p }, t). 29: else 30: Pi,t+1 = Pi,t. 31: end if 32: end for 33: t = t + 1 34: end while Initialize(i, B, t): 1: for p B do 2: Set Ni(p) = 0, ri,j(p, pj, a) = rj,i(pj, p, a) = 0, Si,j(p, pj, a) = Sj,i(pj, p, a) = 0 for all a A, j D i and pj Pj,t. 3: end for Figure 2: Pseudocode for ORL-CF. Since the number of contexts is infinite, learning the reward of an action for each context is not feasible. In order to learn fast, ORL-CF exploits the similarities between the contexts of the relevant type given in Assumption 1 to estimate the rewards of the actions. The key to success of our algorithm is that this estimation is good enough. ORL-CF adaptively forms the partition of the space for each type in D, where the partition for the context space of type i at time t is denoted by Pi,t. All the elements of Pi,t are disjoint intervals of Xi = [0, 1] whose lengths are elements of the set {1, 2 1, 2 2, . . .}.4 An interval with length 2 l, l 0 is called a level l interval, and for an interval p, l(p) denotes its level, s(p) denotes its length. By convention, intervals are of the form (a, b], with the only exception being the interval containing 0, which is of the form [0, b].5 Let pi,t Pi,t be the interval that xi,t belongs to, pt := (p1,t, . . . , p D,t) and Pt := (P1,t, . . . , PD,t). 4Setting interval lengths to powers of 2 is for presentational simplicity. In general, interval lengths can be set to powers of any real number greater than 1. 5Endpoints of intervals will not matter in our analysis, so our results will hold even when the intervals have common endpoints. The pseudocode of ORL-CF is given in Fig. 2. ORL-CF starts with Pi,1 = {Xi} = {[0, 1]} for each i D. As time goes on and more contexts arrive for each type i, it divides Xi into smaller and smaller intervals. The idea is to combine the past observations made in an interval to form sample mean reward estimates for each interval, and use it to approximate the expected rewards of actions for contexts lying in these intervals. The intervals are created in a way to balance the variation of the sample mean rewards due to the number of past observations that are used to calculate them and the variation of the expected rewards in each interval. We also call Pi,t the set of active intervals for type i at time t. Since the partition of each type is adaptive, as time goes on, new intervals become active while old intervals are deactivated, based on how contexts arrive. For a type i interval p, let N i t(p) be the number of times xi,t p Pi,t for t t. The duration of time that an interval remains active, i.e., its lifetime, is determined by an input parameter ρ > 0, which is called the duration parameter. Whenever the number of arrivals to an interval p exceeds 2ρl(p), ORL-CF deactivates p and creates two level l(p)+1 intervals, whose union gives p. For example, when pi,t = (k2 l, (k + 1)2 l] for some 0 < k 2l 1 if N i t(pi,t) 2ρl, ORL-CF sets Pi,t+1 = Pi,t {(k2 l, (k + 1/2)2 l], ((k + 1/2)2 l, (k + 1)2 l]} {pi,t}. Otherwise Pi,t+1 remains the same as Pi,t. It is easy to see that the lifetime of an interval increases exponentially in its duration parameter. We next describe the counters, control numbers and sample mean rewards the learner keeps for each pair of intervals corresponding to a pair of types to determine whether to explore or exploit and how to exploit. Let D i := D {i}. For type i, let Qi,t := {(pi,t, pj,t) : j D i} be the pair of intervals that are related to type i at time t, and let i D Qi,t. (2) To denote an element of Qi,t or Qt we use index q. For any q Qt, the corresponding pair of types is denoted by ind(q). For example, ind((pi,t, pj,t)) = i, j. The decision to explore or exploit at time t is solely based on pt. For events A1, . . . , AK, let I(A1, . . . , Ak) denote the indicator function of event T k=1:K Ak. For p Pi,t, p Pj,t, let Si,j t (p, p , a) := t =1 I (αt = a, βt = 1, pi,t = p, pj,t = p ) , be the number of times a is selected and the reward is observed when the type i context is in p and type j context is in p , summed over times when both intervals are active. Also for the same p and p let ri,j t (p, p , a) := t =1 rt(a, xt)I (αt = a, βt = 1, pi,t = p, pj,t = p ) /(Si,j t (p, p , a)), be the pairwise sample mean reward of action a for pair of intervals (p, p ). At time t, ORL-CF assigns a control number to each i D denoted by Di,t := 2 log(t D|A|/δ) (Ls(pi,t))2 , which depends on the cardinality of A, the length of the active interval that type i context is in at time t and a confidence parameter δ > 0, which controls the accuracy of sample mean reward estimates. Then, it computes the set of under-explored actions for type i as Ui,t := {a A : Sind(q) t (q, a) < Di,t for some q Qi(t)}, (3) and then, the set of under-explored actions as Ut := S i D Ui,t. The decision to explore or exploit is based on whether or not Ut is empty. (i) If Ut = , ORL-CF randomly selects an action αt Ut to explore, and observes its reward rt(αt, xt). Then, it updates the pairwise sample mean rewards and pairwise counters for all q Qt, rind(q) t+1 (q, αt) = Sind(q) t (q,αt) rind(q) t+1 (q,αt)+rt(αt,xt) Sind(q) t (q,αt)+1 , Sind(q) t+1 (q, αt) = Sind(q) t (q, αt) + 1. (ii) If Ut = , ORL-CF exploits by estimating the relevant type ˆct(a) for each a A and forming sample mean reward estimates for action a based on ˆct(a). It first computes the set of candidate relevant types for each a A, Relt(a) := {i D : | ri,j t (pi,t, pj,t, a) ri,k t (pi,t, pk,t, a)| 3Ls(pi,t), j, k D i}. (4) The intuition is that if i is the type that is relevant to a, then independent of the values of the contexts of the other types, the variation of the pairwise sample mean reward of a over pi,t must be very close to the variation of the expected reward of a in that interval. If Relt(a) is empty, this implies that ORL-CF failed to identify the relevant type, hence ˆct(a) is randomly selected from D. If Relt(a) is nonempty, ORL-CF computes the maximum variation Vart(i, a) := max j,k D i | ri,j t (pi,t, pj,t, a) ri,k t (pi,t, pk,t, a)|, (5) for each i Relt(a). Then it sets ˆct(a) = mini Relt(a) Vart(i, a). This way, whenever the type relevant to action a is in Relt(a), even if it is not selected as the estimated relevant type, the sample mean reward of a calculated based on the estimated relevant type will be very close to the sample mean of its reward calculated according to the true relevant type. After finding the estimated relevant types, the sample mean reward of each action is computed based on its estimated relevant type as rˆct(a) t (a) := j D ˆct(a) rˆct(a),j t (pˆct(a),t, pj,t, a)Sˆct(a),j t (pˆct(a),t, pj,t, a) P j D ˆct(a) Sˆct(a),j t (pˆct(a),t, pj,t, a) . (6) Then, ORL-CF selects αt = arg maxa A rˆct(a) t (pˆct(a),t, a). Since the reward is not observed in exploitations, pairwise sample mean rewards and counters are not updated. 3.2 Regret analysis of ORL-CF Let τ(T) {1, 2, . . . , T} be the set of time steps in which ORL-CF exploits by time T. τ(T) is a random set which depends on context arrivals and the randomness of the action selection of ORLCF. The regret R(T) defined in (1) can be written as a sum of the regret incurred during explorations (denoted by RO(T)) and the regret incurred during exploitations (denoted by RI(T)). The following theorem gives a bound on the regret of ORL-CF in exploitation steps. Theorem 1. Let ORL-CF run with duration parameter ρ > 0, confidence parameter δ > 0 and control numbers Di,t := 2 log(t|A|D/δ) (Ls(pi,t))2 , for i D. Let Rinst(t) be the instantaneous regret at time t, which is the loss in expected reward at time t due to not selecting a (xt). Then, with probability at least 1 δ, we have Rinst(t) 8L(s(p R(αt),t) + s(p R(a (xt)),t)), for all t τ(T), and the total regret in exploitation steps is bounded above by t τ(T ) (s(p R(αt),t + s(p R(a (xt)),t)) 16L22ρT ρ/(1+ρ), for arbitrary context vectors x1, x2, . . . , x T . Theorem 1 provides both context arrival process dependent and worst case bounds on the exploitation regret of ORL-CF. By choosing ρ arbitrarily close to zero, RI(T) can be made O(T γ) for any γ > 0. While this is true, the reduction in regret for smaller ρ not only comes from increased accuracy, but it is also due to the reduction in the number of time steps in which ORL-CF exploits, i.e., |τ(T)|. By definition time t is an exploitation step if Si,j t (pi,t, pj,t, a) 2 log(t|A|D/δ) L2 min{s(pi,t)2, s(pj,t)2} = 22 max{l(pi,t),l(pj,t)}+1 log(t|A|D/δ) for all q = (pi,t, pj,t) Qt, i, j D. This implies that for any q Qi,t which has the interval with maximum level equal to l, O(22l) explorations are required before any exploitation can take place. Since the time a level l interval can stay active is 2ρl, it is required that ρ 2 so that τ(T) is nonempty. The next theorem gives a bound on the regret of ORL-CF in exploration steps. Theorem 2. Let ORL-CF run with ρ, δ and Di,t, i D values as stated in Theorem 1. Then, RO(T) 960D2(c O + 1) log(T|A|D/δ) 7L2 T 4/ρ + 64D2(c O + 1) with probability 1, for arbitrary context vectors x1, x2, . . . , x T . Based on the choice of the duration parameter ρ, which determines how long an interval will stay active, it is possible to get different regret bounds for explorations and exploitations. Any ρ > 4 will give a sublinear regret bound for both explorations and exploitations. The regret in exploitations increases in ρ while the regret in explorations decreases in ρ. Theorem 3. Let ORL-CF run with δ and Di,t, i D values as stated in Theorem 1 and ρ = 2+2 2. Then, the time order of exploration and exploitation regrets are balanced up to logaritmic orders. With probability at least 1 δ we have both RI(T) = O(T 2/(1+ 2)) and RO(T) = O(T 2/(1+ Remark 1. Prior work on contextual bandits focused on balancing the regret due to exploration and exploitation. For example in [1, 2], for a D-dimensional context vector algorithms are shown to achieve O(T (D+1)/(D+2)) regret.6 Also in [1] a O(T (D+1)/(D+2)) lower bound on the regret is proved. An interesting question is to find the tightest lower bound for contextual bandits with relevance function. One trivial lower bound is O(T 2/3), which corresponds to D = 1. However, since finding the action with the highest expected reward for a context vector requires comparisons of estimated rewards of actions with different relevant types, which requires accurate sample mean reward estimates for 2 dimensions of the context space corresponding to those types, we conjecture that a tighter lower bound is O(T 3/4). Proving this is left as future work. Another interesting case is when actions with suboptimality greater than ϵ > 0 must never be chosen in any exploitation step by time T. When such a condition is imposed, ORL-CF can start with partitions Pi,1 that have sets with high levels such that it explores more at the beginning to have more accurate reward estimates before any exploitation. The following theorem gives the regret bound of ORL-CF for this case. Theorem 4. Let ORL-CF run with duration parameter ρ > 0, confidence parameter δ > 0, control numbers Di,t := 2 log(t|A|D/δ) (Ls(pi,t))2 , and with initial partitions Pi,1, i D consisting of intervals of length lmin = log2(3L/(2ϵ)) . Then, with probability 1 δ, Rinst(t) ϵ for all t τ(T), RI(T) 16L22ρT ρ/(1+ρ) and 960D2(c O + 1) log(T|A|D/δ) 7L2 T 4/ρ + 64D2(c O + 1) for arbitrary context vectors x1, x2, . . . , x T . Bounds on RI(T) and RO(T) are balanced for ρ = 2 + 2 3.3 Future Work In this paper we only considered the relevance relations that are functions. Similar learning methods can be developed for more general relevance relations such as the ones given in Fig. 1 (i) and (ii). For example, for the general case in Fig. 1 (i), if |R(a)| Drel << D, for all a A, and Drel is known by the learner, the following variant of ORL-CF can be used to achieve regret whose time order depends only on Drel but not on D. Instead of keeping pairwise sample mean reward estimates, keep sample mean reward estimates of actions for Drel + 1 tuples of intervals of Drel + 1 types. For a Drel tuple of types i, let Qi,t be the Drel + 1 tuples of intervals that are related to i at time t, and Qt be the union of Qi,t over all Drel tuples of types. Similar to ORL-CF, compute the set of under-explored actions Ui,t, and the set of candidate relevant Drel tuples of types Relt(a), using the newly defined sample mean reward estimates. 6The results are shown in terms of the covering dimension which reduces to Euclidian dimension for our problem. In exploitation, set ˆct(a) to be the Drel tuple of types with the minimax variation, where the variation of action a for a tuple i is defined similar to (5), as the maximum of the distance between the sample mean rewards of action a for Drel+1 tuples that are in Qi,t. Another interesting case is when the relevance relation is linear as given in Fig. 1 (ii). For example, for action a if there is a type i that is much more relevant compared to other types j D i, i.e., wa,i >> wa,j, where the weights wa,i are given in Fig. 1, then ORL-CF is expected to have good performance (but not sublinear regret with respect to the benchmark that knows R). 4 Related Work Contextual bandit problems are studied by many others in the past [3, 4, 1, 2, 5, 6]. The problem we consider in this paper is a special case of the Lipschitz contextual bandit problem [1, 2], where the only assumption is the existence of a known similarity metric between the expected rewards of actions for different contexts. It is known that the lower bound on regret for this problem is O(T (D+1)/(D+2)) [1], and there exists algorithms that achieve O(T (D+1)/(D+2)) regret [1, 2]. Compared to the prior work above, ORL-CF only needs to observe rewards in explorations and has a regret whose time order is independent of D. Hence it can still learn the optimal actions fast enough in settings where observations are costly and context vector is high dimensional. Examples of related works that consider limited observations are KWIK learning [7, 8] and label efficient learning [9, 10, 11]. For example, [8] considers a bandit model where the reward function comes from a parameterized family of functions and gives bound on the average regret. An online prediction problem is considered in [9, 10, 11], where the predictor (action) lies in a class of linear predictors. The benchmark of the context is the best linear predictor. This restriction plays a crucial role in deriving regret bounds whose time order does not depend on D. Similar to these works, ORL-CF can guarantee with a high probability that actions with large suboptimalities will never be selected in exploitation steps. However, we do not have any assumptions on the form of the expected reward function other than the Lipschitz continuity and that it depends on a single type for each action. In [12] graphical bandits are proposed where the learner takes an action vector a which includes actions from several types that consitute a type set T . The expected reward of a for context vector x can be decomposed into sum of reward functions each of which only depends on a subset of D T . However, it is assumed that the form of decomposition is known but the functions are not known. Another work [13] proposes a fast learning algorithm for an i.i.d. contextual bandit problem in which the rewards for contexts and actions are sampled from a joint probability distribution. In this work the authors consider learning the best policy from a finite set of policies with oracle access, and prove a regret bound of O( T) which is also logarithmic in the size of the policy space. In contrast, in our problem (i) contexts arrive according to an arbitrary exogenous process, and the action rewards are sampled from an i.i.d. distribution given the context value, (ii) the set of policies that the learner can adopt is not restricted. Large dimensional action spaces, where the rewards depend on a subset of the types of actions are considered in [14] and [15]. [14] considers the problem when the reward is H older continuous in an unknown low-dimensional tuple of types, and uses a special discretization of the action space to achieve dimension independent bounds on the regret. This discretization can be effectively used since the learner can select the actions, as opposed to our case where the learner does not have any control over contexts. [15] considers the problem of optimizing high dimensional functions that have an unknown low dimensional structure from noisy observations. 5 Conclusion In this paper we formalized the problem of learning the best action through learning the relevance relation between types of contexts and actions. For the case when the relevance relation is a function, we proposed an algorithm that (i) has sublinear regret with time order independent of D, (ii) only requires reward observations in explorations, (iii) for any ϵ > 0, does not select any ϵ suboptimal actions in exploitations with a high probability. In the future we will extend our results to the linear and general relevance relations illustrated in Fig. 1. [1] T. Lu, D. P al, and M. P al, Contextual multi-armed bandits, in International Conference on Artificial Intelligence and Statistics (AISTATS), 2010, pp. 485 492. [2] A. Slivkins, Contextual bandits with similarity information, in Conference on Learning Theory (COLT), 2011. [3] E. Hazan and N. Megiddo, Online learning with prior knowledge, in Learning Theory. Springer, 2007, pp. 499 513. [4] J. Langford and T. Zhang, The epoch-greedy algorithm for contextual multi-armed bandits, Advances in Neural Information Processing Systems (NIPS), vol. 20, pp. 1096 1103, 2007. [5] W. Chu, L. Li, L. Reyzin, and R. E. Schapire, Contextual bandits with linear payoff functions, in International Conference on Artificial Intelligence and Statistics (AISTATS), 2011, pp. 208 214. [6] M. Dudik, D. Hsu, S. Kale, N. Karampatziakis, J. Langford, L. Reyzin, and T. Zhang, Efficient optimal learning for contextual bandits, ar Xiv preprint ar Xiv:1106.2369, 2011. [7] L. Li, M. L. Littman, T. J. Walsh, and A. L. Strehl, Knows what it knows: a framework for self-aware learning, Machine Learning, vol. 82, no. 3, pp. 399 443, 2011. [8] K. Amin, M. Kearns, M. Draief, and J. D. Abernethy, Large-scale bandit problems and KWIK learning, in International Conference on Machine Learning (ICML), 2013, pp. 588 596. [9] N. Cesa-Bianchi, C. Gentile, and F. Orabona, Robust bounds for classification via selective sampling, in International Conference on Machine Learning (ICML), 2009, pp. 121 128. [10] S. M. Kakade, S. Shalev-Shwartz, and A. Tewari, Efficient bandit algorithms for online multiclass prediction, in International Conference on Machine Learning (ICML), 2008, pp. 440 447. [11] E. Hazan and S. Kale, Newtron: an efficient bandit algorithm for online multiclass prediction. in Advances in Neural Information Processing Systems (NIPS), 2011, pp. 891 899. [12] K. Amin, M. Kearns, and U. Syed, Graphical models for bandit problems, in Conference on Uncertainty in Artificial Intelligence (UAI), 2011. [13] A. Agarwal, D. Hsu, S. Kale, J. Langford, L. Li, and R. E. Schapire, Taming the monster: A fast and simple algorithm for contextual bandits, ar Xiv preprint ar Xiv:1402.0555, 2014. [14] H. Tyagi and B. Gartner, Continuum armed bandit problem of few variables in high dimensions, in Workshop on Approximation and Online Algorithms (WAOA), 2014, pp. 108 119. [15] J. Djolonga, A. Krause, and V. Cevher, High-dimensional Gaussian process bandits, in Advances in Neural Information Processing Systems (NIPS), 2013, pp. 1025 1033.