# linear_multiresource_allocation_with_semibandit_feedback__447609bd.pdf Linear Multi-Resource Allocation with Semi-Bandit Feedback Tor Lattimore Department of Computing Science University of Alberta, Canada tor.lattimore@gmail.com Koby Crammer Department of Electrical Engineering The Technion, Israel koby@ee.technion.ac.il Csaba Szepesv ari Department of Computing Science University of Alberta, Canada szepesva@ualberta.ca We study an idealised sequential resource allocation problem. In each time step the learner chooses an allocation of several resource types between a number of tasks. Assigning more resources to a task increases the probability that it is completed. The problem is challenging because the alignment of the tasks to the resource types is unknown and the feedback is noisy. Our main contribution is the new setting and an algorithm with nearly-optimal regret analysis. Along the way we draw connections to the problem of minimising regret for stochastic linear bandits with heteroscedastic noise. We also present some new results for stochastic linear bandits on the hypercube that significantly improve on existing work, especially in the sparse case. 1 Introduction Economist Thomas Sowell remarked that The first lesson of economics is scarcity: There is never enough of anything to fully satisfy all those who want it. 1 The optimal allocation of resources is an enduring problem in economics, operations research and daily life. The problem is challenging not only because you are compelled to make difficult trade-offs, but also because the (expected) outcome of a particular allocation may be unknown and the feedback noisy. We focus on an idealised resource allocation problem where the economist plays a repeated resource allocation game with multiple resource types and multiple tasks to which these resources can be assigned. Specifically, we consider a (nearly) linear model with D resources and K tasks. In each time step t the economist chooses an allocation of resources Mt RD K where Mtk RD is the kth column and represents the amount of each resource type assigned to the kth task. We assume that the kth task is completed successfully with probability min {1, Mtk, νk } and νk RD is an unknown non-negative vector that determines how the success rate of a given task depends on the quantity and type of resources assigned to it. Naturally we will limit the availability of resources by demanding that Mt satisfies PK k=1 Mtdk 1 for all resource types d. At the end of each time step the economist observes which tasks were successful. The objective is to maximise the number of successful tasks up to some time horizon n that is known in advance. This model is a natural generalisation of the one used by Lattimore et al. [2014], where it was assumed that there was a single resource type only. 1He went on to add that The first lesson of politics is to disregard the first lesson of economics. Sowell [1993] An example application might be the problem of allocating computing resources on a server between a number of Virtual Private Servers (VPS). In each time step (some fixed interval) the controller chooses how much memory/cpu/bandwidth to allocate to each VPS. A VPS is said to fail in a given round if it fails to respond to requests in a timely fashion. The requirements of each VPS are unknown in advance, but do not change greatly with time. The controller should learn which VPS benefit the most from which resource types and allocate accordingly. The main contribution of this paper besides the new setting is an algorithm designed for this problem along with theoretical guarantees on its performance in terms of the regret. Along the way we present some additional results for the related problem of minimising regret for stochastic linear bandits on the hypercube. We also prove new concentration results for weighted least squares estimation, which may be independently interesting. The generalisation of the work of Lattimore et al. [2014] to multiple resources turns out to be fairly non-trivial. Those with knowledge of the theory of stochastic linear bandits will recognise some similarity. In particular, once the nonlinearity of the objective is removed, the problem is equivalent to playing K linear bandits in parallel, but where the limited resources constrain the actions of the learner and correspondingly the returns for each task. Stochastic linear bandits have recently been generating a significant body of research (e.g., Auer [2003], Dani et al. [2008], Rusmevichientong and Tsitsiklis [2010], Abbasi-Yadkori et al. [2011, 2012], Agrawal and Goyal [2012] and many others). A related problem is that of online combinatorial optimisation. This has an extensive literature, but most results are only applicable for discrete action sets, are in the adversarial setting, and cannot exploit the additional structure of our problem. Nevertheless, we refer the interested reader to (say) the recent work by Kveton et al. [2014] and references there-in. Also worth mentioning is that the resource allocation problem at hand is quite different to the linear semi-bandit proposed and analysed by Krishnamurthy et al. [2015] where the action set is also finite (the setting is different in many other ways besides). Given its similarity, it is tempting to apply the techniques of linear bandits to our problem. When doing so, two main difficulties arise. The first is that our payoffs are non-linear: the expected reward is a linear function only up to a point after which it is clipped. In the resource allocation problem this has a natural interpretation, which is that over-allocating resources beyond a certain point is fruitless. Fortunately, one can avoid this difficulty rather easily by ensuring that with high probability resources are never over-allocated. The second problem concerns achieving good regret regardless of the task specifics. In particular, when the number of tasks K is large and resources are at a premium the allocation problem behaves more like a K-armed bandit where the economist must choose the few tasks that can be completed successfully. For this kind of problem regret should scale in the worst case with K only [Auer et al., 2002, Bubeck and Cesa-Bianchi, 2012]. The standard linear bandits approach, on the other hand, would lead to a bound on the regret that depends linearly on K. To remedy this situation, we will exploit that if K is large and resources are scarce, then many tasks will necessarily be under-resourced and will fail with high probability. Since the noise model is Bernoulli, the variance of the noise for these tasks is extremely low. By using weighted least-squares estimators we are able to exploit this and thereby obtain an improved regret. An added benefit is that when resources are plentiful, then all tasks will succeed with high probability under the optimal allocation, and in this case the variance is also low. This leads to a poly-logarithmic regret for the resource-laden case where the optimal allocation fully allocates every task. 2 Preliminaries If F is some event, then F is its complement (i.e., it is the event that F does not occur). If A is positive definite and x is a vector, then x 2 A = x Ax stands for the weighted 2-norm. We write |x| to be the vector of element-wise absolute values of x. We let ν RD K be a matrix with columns ν1, . . . νK. All entries in ν are non-negative, but otherwise we make no global assumptions on ν. At each time step t the learner chooses an allocation matrix Mt M where M [0, 1]D K : k=1 Mdk 1 for all d The assumption that each resource type has a bound of 1 is non-restrictive, since the units of any resource can be changed to accommodate this assumption. We write Mtk [0, 1]D for the kth column of Mt. The reward at time step t is Yt 1 where Ytk {0, 1} is sampled from a Bernoulli distribution with parameter ψ( Mtk, νk ) = min {1, Mtk, νk }. The economist observes all Ytk, however, not just the sum. The optimal allocation is denoted by M and defined by M = arg max M M k=1 ψ( Mk, νk ) . We are primarily concerned with designing an allocation algorithm that minimises the expected (pseudo) regret of this problem, which is defined by k=1 ψ( M k, νk ) E k=1 ψ( Mtk, νk ) where the expectation is taken over both the actions of the algorithm and the observed reward. Optimal Allocations If ν is known, then the optimal allocation can be computed by constructing an appropriate linear program. Somewhat surprisingly it may also be computed exactly in O(K log K + D log D) time using Algorithm 1 below. The optimal allocation is not so straight-forward as, e.g., simply allocating resources to the incomplete task for which the corresponding ν is largest in some dimension. For example, for K = 2 tasks and d = 2 resource types: = 0 1/2 1/2 1 = M = M 1 M 2 = 0 1 1/2 1/2 Algorithm 1 Input: ν M = 0 RD K and B = 1 RD while k, d s.t Mk, νk < 1 and Bd > 0 do A = {k : Mk, νk < 1} and B = {d : Bd > 0} k, d = arg max (k,d) A B min i A\{k} Mdk = min Bd, 1 Mk, νk end while return M We see that even though ν22 is the largest parameter, the optimal allocation assigns only half of the second resource (d = 2) to this task. The right approach is to allocate resources to incomplete tasks using the ratios as prescribed by Algorithm 1. The intuition for allocating in this way is that resources should be allocated as efficiently as possible, and efficiency is determined by the ratio of the expected success due to the allocation of a resource and the amount of resources allocated. Theorem 1. Algorithm 1 returns M . The proof of Theorem 1 and an implementation of Algorithm 1 may be found in the supplementary material. We are interested primarily in the case when ν is unknown, so Algorithm 1 will not be directly applicable. Nevertheless, the algorithm is useful as a module in the implementation of a subsequent algorithm that estimates ν from data. 3 Optimistic Allocation Algorithm We follow the optimism in the face of uncertainty principle. In each time step t, the algorithm constructs an estimator ˆνkt for each νk and a corresponding confidence set Ctk for which νk Ctk holds with high probability. The algorithm then takes the optimistic action subject to the assumption that νk does indeed lie in Ctk for all k. The main difficulty is the construction of the confidence sets. Like other authors [Dani et al., 2008, Rusmevichientong and Tsitsiklis, 2010, Abbasi-Yadkori et al., 2011] we define our confidence sets to be ellipses, but the use of a weighted least-squares estimator means that our ellipses may be significantly smaller than the sets that would be available by using these previous works in a straightforward way. The algorithm accepts as input the number of tasks and resource types, the horizon and constants α > 0 and β where constant β is defined by δ = 1 n K , N = 4n4D2 D , B max k νk 2 2 , so that Note that B must be a known bound on maxk νk 2 2, which might seem like a serious restriction, until one realizes that it is easy to add an initialisation phase where estimates are quickly made while incurring minimal additional regret, as was also done by Lattimore et al. [2014]. The value of α determines the level of regularisation in the least squares estimation and will be tuned later to optimise the regret. Algorithm 2 Optimistic Allocation Algorithm 1: Input K, D, n, α, β 2: for t 1, . . . , n do 3: // Compute confidence sets for all tasks k: 4: Gtk = αI + P τ β for some t n and 1 k K. Lemma 4 (Abbasi-Yadkori et al. [2012]). Let x1, . . . , xn be an arbitrary sequence of vectors with xt 2 2 c and let Gt = I + Pt 1 s=1 xsx s . Then Pn t=1 min n 1, xt 2 G 1 t o 2D log 1 + c n Corollary 5. If F does not hold, then t=1 γtk min n 1, Mtk 2 G 1 tk o 8D log(1 + 4n2). The proof is omitted, but follows rather easily by showing that γtk can be moved inside the minimum at a price of increasing the loss at most by a factor of four, and then applying Lemma 4. See the supplementary material for the formal proof. Lemma 6. Suppose F does not hold, then k=1 γ 1 tk D max k νk + 4 p Proof. We exploit the fact that γ 1 tk is an estimate of the variance, which is small whenever Mtk 1 is small: γ 1 tk = arg max νk C tk Mtk, νk (1 Mtk, νk ) arg max νk C tk Mtk, νk = Mtk, ν + arg max νk Ctk Mtk, νk ν (a) Mtk 1 νk + 4 p β Mtk G 1 tk (b) Mtk 1 νk + 4 p (c) Mtk 1 νk + 4 p where (a) follows from Cauchy-Schwartz and the fact that νk C tk, (b) since G 1 tk I/α and basic linear algebra, (c) since Mtk I/α = p 1/α Mtk 2 p 1/α Mtk 1. The result is completed since the resource constraints implies that PK k=1 Mtk 1 D. Proof of Theorem 2. By Theorem 3 we have that F holds with probability at most δ = 1/(n K). If F does not hold, then by the definition of the confidence set we have νk Ctk for all t and k. Therefore k=1 ( M k, νk ψ( Mtk, νk )) 1 + E k=1 M k Mtk, νk Note that we were able to replace ψ( Mtk, νk ) = Mtk, νk , since if F does not hold, then Mtk will never be chosen in such a way that resources are over-allocated. We will now assume that F does not hold and bound the argument in the expectation. By the optimism principle we have: k=1 M k Mtk, νk (a) k=1 min {1, Mtk, νtk νk } k=1 min n 1, Mtk G 1 tk νtk νk Gtk k=1 min n 1, Mtk G 1 tk p k=1 min n 1, Mtk G 1 tk k=1 γtk min n 1, Mtk 2 G 1 tk max k νk + 4 k=1 γtk min n 1, Mtk 2 G 1 tk v u u t2βn K max k νk + 4 log(1 + 4n2) . where (a) follows from the assumption that νk Ctk for all t and k and since Mt is chosen optimistically, (b) by the Cauchy-Schwarz inequality, (c) by the definition of νkt, which lies inside Ctk, (d) by Jensen s inequality, (e) by Cauchy-Schwarz again, (f) follows from Lemma 6. Finally (g) follows from Corollary 5. 5 Regret in Resource-Laden Case We now show that if there are enough resources such that the optimal strategy can complete every task with certainty, then the regret of Algorithm 2 is poly-logarithmic (in contrast to O( n) otherwise). As before we exploit the low variance, but now the variance is small because Mtk, νk is close to 1, while in the previous section we argued that this could not happen too often (there is no contradiction as the quantity maxk νk appeared in the previous bound). Theorem 7. If PK k=1 M k, νk = K, then Rn 1 + 8βKD log(1 + 4n2). Proof. We start by showing that the weights are large: γ 1 tk = max ν C tk Mtk, ν (1 Mtk, ν ) max ν C tk (1 Mtk, ν ) max ν,ν C tk Mtk, ν ν Mtk G 1 tk max ν,ν C tk ν ν Gtk Mtk G 1 tk 4 p Applying the optimism principle and using the bound above combined with Corollary 5 gives the result: k=1 min {1, Mtk, νkt νk } k=1 min n 1, Mtk G 1 tk p k=1 min n 1, γ 1 tk γtk Mtk G 1 tk k=1 min n 1, γtk Mtk 2 G 1 tk 1 + 8βKD log(1 + 4n2) . 6 Experiments We present two experiments to demonstrate the behaviour of Algorithm 2. All code and data is available in the supplementary material. Error bars indicate 95% confidence intervals, but sometimes they are too small to see (the algorithm is quite conservative, so the variance is very low). We used B = 10 for all experiments. The first experiment demonstrates the improvements obtained by using a weighted estimator over an unweighted one, and also serves to give some idea of the rate of learning. For this experiment we used D = K = 2 and n = 106 and = 8/10 2/10 4/10 2 = M = 1 0 1/2 1/2 k=1 M k, νk = 2 , where the kth column is the parameter/allocation for the kth task. We ran two versions of the algorithm. The first, exactly as given in Algorithm 2 and the second identical except that the weights were fixed to γtk = 4 for all t and k (this value is chosen because it corresponds to the minimum inverse variance for a Bernoulli variable). The data was produced by taking the average regret over 8 runs. The results are given in Fig. 1. In Fig. 2 we plot γtk. The results show that γtk is increasing linearly with t. This is congruent with what we might expect because in this regime the estimation error should drop with O(1/t) and the estimated variance is proportional to the estimation error. Note that the estimation error for the algorithm with γtk = 4 will be O( p For the second experiment we show the algorithm adapting to the environment. We fix n = 5 105 and D = K = 2. For α (0, 1) we define να = 1/2 α/2 1/2 α/2 = M = 1 0 1 0 k=1 M k, νk = 1 . The unusual profile of the regret as α varies can be attributed to two factors. First, if α is small then the algorithm quickly identifies that resources should be allocated first to the first task. However, in the early stages of learning the algorithm is conservative in allocating to the first task to avoid overallocation. Since the remaining resources are given to the second task, the regret is larger for small α because the gain from allocating to the second task is small. On the other hand, if α is close to 1, then the algorithm suffers the opposite problem. Namely, it cannot identify which task the resources should be assigned to. Of course, if α = 1, then the algorithm must simply learn that all resources can be allocated safely and so the regret is smallest here. An important point is that the algorithm never allocates all its resources at the start of the process because this risks over-allocation, so even in easy problems the regret will not vanish. Figure 1: Weighted vs unweighted estimation 0 1,000,000 Weighted Estimator Unweighted Estimator Figure 2: Weights 0 1,000,000 Figure 3: Gap dependence 0.0 0.5 1.0 0 7 Conclusions and Summary We introduced the stochastic multi-resource allocation problem and developed a new algorithm that enjoys near-optimal worst-case regret. The main drawback of the new algorithm is that its computation time is exponential in the dimension parameters, which makes practical implementations challenging unless both K and D are relatively small. Despite this challenge we were able to implement that algorithm using a relatively brutish approach to solving the optimisation problem, and this was sufficient to present experimental results on synthetic data showing that the algorithm is behaving as the theory predicts, and that the use of the weighted least-squares estimation is leading to a real improvement. Despite the computational issues, we think this is a reasonable first step towards a more practical algorithm as well as a solid theoretical understanding of the structure of the problem. As a consolation (and on their own merits) we include some other results: An efficient (both in terms of regret and computation) algorithm for the case where overallocation is impossible. An algorithm for linear bandits on the hypercube that enjoys optimal regret bounds and adapts to sparsity. Theoretical analysis of weighted least-squares estimators, which may have other applications (e.g., linear bandits with heteroscedastic noise). There are many directions for future research. The most natural is to improve the practicality of the algorithm. We envisage such an algorithm might be obtained by following the program below: Generalise the Thompson sampling analysis for linear bandits by Agrawal and Goyal [2012]. This is a highly non-trivial step, since it is no longer straight-forward to show that such an algorithm is optimistic with high probability. Instead it will be necessary to make do with some kind of local optimism for each task. The method of estimation depends heavily on the algorithm over-allocating its resources only with extremely low probability, but this significantly slows learning in the initial phases when the confidence sets are large and the algorithm is acting conservatively. Ideally we would use a method of estimation that depended on the real structure of the problem, but existing techniques that might lead to theoretical guarantees (e.g., empirical process theory) do not seem promising if small constants are expected. It is not hard to think up extensions or modifications to the setting. For example, it would be interesting to look at an adversarial setting (even defining it is not so easy), or move towards a non-parametric model for the likelihood of success given an allocation. Yasin Abbasi-Yadkori, Csaba Szepesv ari, and David Tax. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312 2320, 2011. Yasin Abbasi-Yadkori, David Pal, and Csaba Szepesvari. Online-to-confidence-set conversions and application to sparse stochastic bandits. In AISTATS, volume 22, pages 1 9, 2012. Shipra Agrawal and Navin Goyal. Thompson sampling for contextual bandits with linear payoffs. ar Xiv preprint ar Xiv:1209.3352, 2012. Peter Auer. Using confidence bounds for exploitation-exploration trade-offs. The Journal of Machine Learning Research, 3:397 422, 2003. Peter Auer, Nicol o Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47:235 256, 2002. Kristin P Bennett and Olvi L Mangasarian. Bilinear separation of two sets inn-space. Computational Optimization and Applications, 2(3):207 227, 1993. S ebastien Bubeck and Nicol o Cesa-Bianchi. Regret Analysis of Stochastic and Nonstochastic Multiarmed Bandit Problems. Foundations and Trends in Machine Learning. Now Publishers Incorporated, 2012. ISBN 9781601986269. Varsha Dani, Thomas P Hayes, and Sham M Kakade. Stochastic linear optimization under bandit feedback. In COLT, pages 355 366, 2008. Akshay Krishnamurthy, Alekh Agarwal, and Miroslav Dudik. Efficient contextual semi-bandit learning. ar Xiv preprint ar Xiv:1502.05890, 2015. Branislav Kveton, Zheng Wen, Azin Ashkan, and Csaba Szepesvari. Tight regret bounds for stochastic combinatorial semi-bandits. ar Xiv preprint ar Xiv:1410.0949, 2014. Tor Lattimore, Koby Crammer, and Csaba Szepesv ari. Optimal resource allocation with semi-bandit feedback. In Proceedings of the 30th Conference on Uncertainty in Artificial Intelligence (UAI), 2014. Marek Petrik and Shlomo Zilberstein. Robust approximate bilinear programming for value function approximation. The Journal of Machine Learning Research, 12:3027 3063, 2011. Paat Rusmevichientong and John N Tsitsiklis. Linearly parameterized bandits. Mathematics of Operations Research, 35(2):395 411, 2010. Thomas Sowell. Is Reality Optional?: And Other Essays. Hoover Institution Press, 1993.