# online_decisionmaking_in_general_combinatorial_spaces__caa2f221.pdf Online Decision-Making in General Combinatorial Spaces Arun Rajkumar Shivani Agarwal Department of Computer Science and Automation Indian Institute of Science, Bangalore 560012, India {arun r,shivani}@csa.iisc.ernet.in We study online combinatorial decision problems, where one must make sequential decisions in some combinatorial space without knowing in advance the cost of decisions on each trial; the goal is to minimize the total regret over some sequence of trials relative to the best fixed decision in hindsight. Such problems have been studied mostly in settings where decisions are represented by Boolean vectors and costs are linear in this representation. Here we study a general setting where costs may be linear in any suitable low-dimensional vector representation of elements of the decision space. We give a general algorithm for such problems that we call low-dimensional online mirror descent (LDOMD); the algorithm generalizes both the Component Hedge algorithm of Koolen et al. (2010), and a recent algorithm of Suehiro et al. (2012). Our study offers a unification and generalization of previous work, and emphasizes the role of the convex polytope arising from the vector representation of the decision space; while Boolean representations lead to 0-1 polytopes, more general vector representations lead to more general polytopes. We study several examples of both types of polytopes. Finally, we demonstrate the benefit of having a general framework for such problems via an application to an online transportation problem; the associated transportation polytopes generalize the Birkhoff polytope of doubly stochastic matrices, and the resulting algorithm generalizes the Perm ELearn algorithm of Helmbold and Warmuth (2009). 1 Introduction In an online combinatorial decision problem, the decision space is a set of combinatorial structures, such as subsets, trees, paths, permutations, etc. On each trial, one selects a combinatorial structure from the decision space, and incurs a loss; the goal is to minimize the regret over some sequence of trials relative to the best fixed structure in hindsight. Such problems have been studied extensively in the last several years, primarily in the setting where the combinatorial structures are represented by Boolean vectors, and costs are linear in this representation; this includes online learning of paths, permutations, and various other specific combinatorial structures [16, 17, 12], as well as the Component Hedge algorithm of Koolen et al. [14] which generalizes many of these previous studies. More recently, Suehiro et al. [15] considered a setting where the combinatorial structures of interest are represented by the vertices of the base polytope of a submodular function, and costs are linear in this representation; this includes as special cases several of the Boolean examples considered earlier, as well as new settings such as learning permutations with certain position-based losses (see also [2]). In this work, we consider a general form of the online combinatorial decision problem, where costs can be linear in any suitable low-dimensional vector representation of the combinatorial structures of interest. This encompasses representations as Boolean vectors and vertices of submodular base polytopes as special cases, but also includes many other settings. We give a general algorithm for such problems that we call low-dimensional online mirror descent (LDOMD); the algorithm generalizes both the Component Hedge algorithm of Koolen et al. for Boolean representations [14], and the algorithm of Suehiro et al. for submodular polytope vertex representations [15].1 As we show, in many settings of interest, the regret bounds for LDOMD are better than what can be obtained with other algorithms for online decision problems, such as the Hedge algorithm of Freund and Schapire [10] and the Follow the Perturbed Leader algorithm of Kalai and Vempala [13]. We start with some preliminaries and background in Section 2, and describe the LDOMD algorithm and its analysis in Section 3. Our study emphasizes the role of the convex polytope arising from the vector representation of the decision space; we study several examples of such polytopes, including matroid polytopes, polytopes associated with submodular functions, and permutation polytopes in Sections 4 6, respectively. Section 7 applies our framework to an online transportation problem. 2 Preliminaries and Background Notation. For n Z+, we will denote [n] = {1, . . . , n}. For a vector z Rd, we will denote by z 1, z 2, and z the standard L1, L2, and L norms of z, respectively. For a set Z Rd, we will denote by conv(Z) the convex hull of Z, and by int(Z) the interior of Z. For a closed convex set K Rd and Legendre function F : K R,2 we will denote by BF : K int(K) R+ the Bregman divergence associated with F, defined as BF (x, x ) = F(x) F(x ) F(x ) (x x ), and by F : F(int(K)) R the Fenchel conjugate of F, defined as F (u) = supx K(x u F(x)). Online Combinatorial Decision-Making Inputs: Finite set of combinatorial structures C Mapping φ : C Rd For t = 1 . . . T: Predict ct C Receive loss vector ℓt [0, 1]d Incur loss φ(ct) ℓt Figure 1: Online decision-making in a general combinatorial space. Problem Setup. Let C be a (finite but large) set of combinatorial structures. Let φ : C Rd be some injective mapping that maps each c C to a unique vector φ(c) Rd (so that |φ(C)| = |C|). We will generally assume d |C| (e.g. d = poly log(|C|)). The online combinatorial decision-making problem we consider can be described as follows: On each trial t, one makes a decision in C by selecting a structure ct C, and receives a loss vector ℓt [0, 1]d; the loss incurred is given by φ(ct) ℓt (see Figure 1). The goal is to minimize the regret relative to the single best structure in C in hindsight; specifically, the regret of an algorithm A that selects ct C on trial t over T trials is defined as RT [A] = PT t=1 φ(ct) ℓt minc C PT t=1 φ(c) ℓt . In particular, we would like to design algorithms whose worst-case regret (over all possible loss sequences) is sublinear in T (and also has as good a dependence as possible on other relevant problem parameters). From standard results, it follows that for any deterministic algorithm, there is always a loss sequence that forces the regret to be linear in T; as is common in the online learning literature, we will therefore consider randomized algorithms that maintain a probability distribution pt over C from which ct is randomly drawn, and consider bounding the expected regret of such algorithms. Online Mirror Descent (OMD). Recall that online mirror descent (OMD) is a general algorithmic framework for online convex optimization problems, where on each trial t, one selects a point xt in some convex set Ω Rn, receives a convex cost function ft : Ω R, and incurs a loss ft(xt); the goal is to minimize the regret relative to the best single point in Ωin hindsight. The OMD algorithm makes use of a Legendre function F : K R defined on a closed convex set K Ω, and effectively performs a form of projected gradient descent in the dual space of int(K) under F, the projections being in terms of the Bregman divergence BF associated with F. See Appendix A.1 for an outline of OMD and its regret bound for the special case of online linear optimization, where costs ft are linear (so that ft(x) = ℓt x for some ℓt Rn), which will be relevant to our study. 1We note that the recent online stochastic mirror descent (OSMD) algorithm of Audibert et al. [3] also generalizes the Component Hedge algorithm, but in a different direction: OSMD (as described in [3]) applies to only Boolean representations, but allows also for partial information (bandit) settings; here we consider only full information settings, but allow for more general vector representations. 2Recall that for a closed convex set K Rd, a function F : K R is Legendre if it is strictly convex, differentiable on int(K), and (for any norm on Rd) F(xn) + whenever {xn} converges to a point in the boundary of K. Hedge/Na ıve OMD. The Hedge algorithm proposed by Freund and Schapire [10] is widely used for online decision problems in general. The algorithm maintains a probability distribution over the decision space, and can be viewed as an instantiation of the OMD framework, with Ω(and K) the probability simplex over the decision space, linear costs ft (since one works with expected losses), and F the negative entropy. When applied to online combinatorial decision problems in a na ıve manner, the Hedge algorithm requires maintaining a probability distribution over the combinatorial decision space C, which in many cases can be computationally prohibitive (see Appendix A.2 for an outline of the algorithm, which we also refer to as Na ıve OMD). The following bound on the expected regret of the Hedge/Na ıve OMD algorithm is well known: Theorem 1 (Regret bound for Hedge/Na ıve OMD). Let φ(c) ℓt [a, b] c C, t [T]. Then setting η = 2 (b a) E h RT Hedge(η ) i (b a) Follow the Perturbed Leader (FPL). Another widely used algorithm for online decision problems is the Follow the Perturbed Leader (FPL) algorithm proposed by Kalai and Vempala [13] (see Appendix A.3 for an outline of the algorithm). Note that in the combinatorial setting, FPL requires the solution to a combinatorial optimization problem on each trial, which may or may not be efficiently solvable depending on the form of the mapping φ. The following bound on the expected regret of the FPL algorithm is well known: Theorem 2 (Regret bound for FPL). Let φ(c) φ(c ) 1 D1, ℓt 1 G1, and |φ(c) ℓt| B c, c C, t [T]. Then setting η = q D1 BG1T gives E h RT FPL(η ) i 2 p Polytopes. Recall that a set S Rd is a polytope if there exist a finite number of points x1, . . . , xn Rd such that S = conv({x1, . . . , xn}). Any polytope S Rd has a unique minimal set of points x 1, . . . , x m Rd such that S = conv({x 1, . . . , x m}); these points are called the vertices of S. A polytope S Rd is said to be a 0-1 polytope if all its vertices lie in the Boolean hypercube {0, 1}d. As we shall see, in our study of online combinatorial decision problems as above, the polytope conv(φ(C)) Rd will play a central role. Clearly, if φ(C) {0, 1}d, then conv(φ(C)) is a 0-1 polytope; in general, however, conv(φ(C)) can be any polytope in Rd. 3 Low-Dimensional Online Mirror Descent (LDOMD) We describe the Low-Dimensional OMD (LDOMD) algorithm in Figure 2. The algorithm maintains a point xt in the polytope conv(φ(C)). It makes use of a Legendre function F : K R defined on a closed convex set K conv(φ(C)), and effectively performs OMD in a d-dimensional space rather than in a |C|-dimensional space as in the case of Hedge/Na ıve OMD. Note that an efficient implementation of LDOMD requires two operations to be performed efficiently: (a) given a point xt conv(φ(C)), one needs to be able to efficiently find a decomposition of xt into a convex combination of a small number of points in φ(C) (this yields a distribution pt C that satisfies Ec pt[φ(c)] = xt and also has small support, allowing efficient sampling); and (b) given a point ext+1 K, one needs to be able to efficiently find a projection of ext+1 onto conv(φ(C)) in terms of the Bregman divergence BF . The following regret bound for LDOMD follows directly from the standard OMD regret bound (see Theorem 4 in Appendix A.1): Theorem 3 (Regret bound for LDOMD). Let BF (φ(c), x1) D2 c C. Let be any norm in Rd such that ℓt G t [T], and such that the restriction of F to conv(φ(C)) is α-strongly convex w.r.t. , the dual norm of . Then setting η = D E h RT LDOMD(η ) i DG As we shall see below, the LDOMD algorithm generalizes both the Component Hedge algorithm of Koolen et al. [14], which applies to settings where φ(C) {0, 1}d (Section 3.1), and the recent algorithm of Suehiro et al. [15], which applies to settings where conv(φ(C)) is the base polytope associated with a submodular function (Section 5). Algorithm Low-Dimensional OMD (LDOMD) for Online Combinatorial Decision-Making Inputs: Finite set of combinatorial structures C Mapping φ : C Rd Parameters: η > 0 Closed convex set K conv(φ(C)), Legendre function F : K R Initialize: x1 = argminx conv(φ(C)) F(x) (or x1 = any other point in conv(φ(C))) For t = 1 . . . T: Let pt be any distribution over C such that Ec pt[φ(c)] = xt [Decomposition step] Randomly draw ct pt Receive loss vector ℓt [0, 1]d Incur loss φ(ct) ℓt Update: ext+1 F ( F(xt) ηℓt) xt+1 argminx conv(φ(C)) BF (x, ext+1) [Bregman projection step] Figure 2: The LDOMD algorithm. 3.1 LDOMD with 0-1 Polytopes Consider first a setting where each c C is represented as a Boolean vector, so that φ(C) {0, 1}d. In this case conv(φ(C)) is a 0-1 polytope. This is the setting commonly studied under the term online combinatorial learning [14, 8, 3]. In analyzing this setting, one generally introduces an additional problem parameter, namely an upper bound m on the size of each Boolean vector φ(c). Specifically, let us assume φ(c) 1 m c C for some m [d]. Under the above assumption, it is easy to verify that applying Theorems 1 and 2 gives E h RT Hedge(η ) i = O m q m) ; E h RT FPL(η ) i = O(m For the LDOMD algorithm, since conv(φ(C)) [0, 1]d Rd +, it is common to take K = Rd + and to let F : K R be the unnormalized negative entropy, defined as F(x) = Pd i=1 xi ln xi Pd i=1 xi, which leads to a multiplicative update algorithm; the resulting algorithm was termed Component Hedge in [14]. For the above choice of F, it is easy to see that BF (φ(c), x1) m ln( d m) c C; moreover, ℓt 1 t, and the restriction of F on conv(φ(C)) is ( 1 m)-strongly convex w.r.t. 1. Therefore, applying Theorem 3 with appropriate η , one gets E h RT LDOMD(η ) i = O m q Thus, when φ(C) {0, 1}d, the LDOMD algorithm with the above choice of F gives a better regret bound than both Hedge/Na ıve OMD and FPL; in fact the performance of LDOMD in this setting is essentially optimal, as one can easily show a matching lower bound [3]. Below we will see how several online combinatorial decision problems studied in the literature can be recovered under the above framework (e.g. see [16, 17, 12, 14, 8]); in many of these cases, both decomposition and unnormalized relative entropy projection steps in LDOMD can be performed efficiently (in poly(d) time) (e.g. see [14]). As a warm-up, consider the following simple example: Example 1 (m-sets with element-based losses). Here C contains all size-m subsets of a ground set of d elements: C = {S [d] | |S| = m}. On each trial t, one selects a subset St C and receives a loss vector ℓt [0, 1]d, with ℓt i specifying the loss for including element i [d]; the loss for the subset St is given by P i St ℓt i. Here it is natural to define a mapping φ : C {0, 1}d that maps each S C to its characteristic vector, defined as φi(S) = 1(i S) i [d]; the loss incurred on predicting St C is then simply φ(St) ℓt. Thus φ(C) = {x {0, 1}d | x 1 = m}, and conv(φ(C)) = {x [0, 1]d | x 1 = m}. LDOMD with unnormalized negative entropy as above has a regret bound of O m q m) . It can be shown that both decomposition and unnormalized relative entropy projection steps take O(d2) time [17, 14]. 3.2 LDOMD with General Polytopes Now consider a general setting where φ : C Rd, and conv(φ(C)) Rd is an arbitrary polytope. Let us assume again φ(c) 1 m c C for some m > 0. Again, it is easy to verify that applying Theorems 1 and 2 gives E h RT Hedge(η ) i = O(m p T ln |C|) ; E h RT FPL(η ) i = O(m For the LDOMD algorithm, we consider two cases: Case 1: φ(C) Rd +. Here one can again take K = Rd + and let F : K R be the unnormalized negative entropy. In this case, one gets BF (φ(c), x1) m ln(d) + m c C if m < d, and BF (φ(c), x1) m ln(m) + d c C if m d. As before, ℓt 1 t, and the restriction of F on conv(φ(C)) is ( 1 m)-strongly convex w.r.t. 1, so applying Theorem 3 for appropriate η gives E h RT LDOMD(η ) i = T ln(d) if m < d T ln(m) if m d. Thus, when φ(C) Rd +, if ln |C| = ω(max(ln(m), ln(d)))) and d = ω(ln(m)), then the LDOMD algorithm with unnormalized negative entropy again gives a better regret bound than both Hedge/Na ıve OMD and FPL. Case 2: φ(C) Rd +. Here one can no longer use the unnormalized negative entropy in LDOMD. One possibility is to take K = Rd and let F : K R be defined as F(x) = 1 2 x 2 2, which leads to an additive update algorithm. In this case, one gets BF (φ(c), x1) = 1 2 φ(c) x1 2 2 2m2 c C; moreover, ℓt 2 d t, and F is 1-strongly convex w.r.t. 2. Applying Theorem 3 for appropriate η then gives E h RT LDOMD(η ) i = O(m Thus in general, when φ(C) Rd +, LDOMD with squared L2-norm has a similar regret bound as that of Hedge/Na ıve OMD and FPL. Note however that in some cases, Hedge/Na ıve OMD and FPL may be infeasible to implement efficiently, while LDOMD with squared L2-norm may be efficiently implementable; moreover, in certain cases it may be possible to implement LDOMD with other choices of K and F that lead to better regret bounds. In the following sections we will consider several examples of applications of LDOMD to online combinatorial decision problems involving both 0-1 polytopes and general polytopes in Rd. 4 Matroid Polytopes Consider an online decision problem in which the decision space C contains (not necessarily all) independent sets in a matroid M = (E, I). Specifically, on each trial t, one selects an independent set It C, and receives a loss vector ℓt [0, 1]|E|, with ℓt e specifying the loss for including element e E; the loss for the independent set It is given by P e It ℓt e. Here it is natural to define a mapping φ : C {0, 1}|E| that maps each independent set I C to its characteristic vector, defined as φe(I) = 1(e I); the loss on selecting It C is then φ(It) ℓt. Thus here d = |E|, and φ(C) {0, 1}|E|. A particularly interesting case is obtained by taking C to contain all the maximal independent sets (bases) in I; in this case, the polytope conv(φ(C)) is known as the matroid base polytope of M. This polytope, often denoted as B(M), is also given by B(M) = n x R|E| P e S xe rank M(S) S E, and P e E xe = rank M(E) o , where rank M : 2E R is the matroid rank function of M defined as rank M(S) = max |I| | I I, I S S E . We will see below (Section 5) that both decomposition and unnormalized relative entropy projection steps in this case can be performed efficiently assuming an appropriate oracle. We note that Example 1 (m-subsets of a ground set of d elements) can be viewed as a special case of the above setting for the matroid Msub = (E, I) defined by E = [d] and I = {S E | |S| m}; the set C of m-subsets of [d] is then simply the set of bases in I, and conv(φ(C)) = B(Msub). The following is another well-studied example: Example 2 (Spanning trees with edge-based losses). Here one is given a connected, undirected graph G = ([n], E), and the decision space C is the set of all spanning trees in G. On each trial t, one selects a spanning tree T t C and receives a loss vector ℓt [0, 1]|E|, with ℓt e specifying the loss for using edge e; the loss for the tree T t is given by P e T t ℓt e. It is well known that the set of all spanning trees in G is the set of bases in the graphic matroid MG = (E, I), where I contains edge sets of all acyclic subgraphs of G. Therefore here d = |E|, φ(C) is the set of incidence vectors of all spanning trees in G, and conv(φ(C)) = B(MG), also known as the spanning tree polytope. Here LDOMD with unnormalized negative entropy has a regret bound of O n q 5 Polytopes Associated with Submodular Functions Next we consider settings where the decision space C is in one-to-one correspondence with the set of vertices of the base polytope associated with a submodular function, and losses are linear in the corresponding vertex representations of elements in C. This setting was considered recently in [15], and as we shall see, encompasses both of the examples we saw earlier, as well as many others. Let f : 2[n] R be a submodular function with f( ) = 0. The base polytope of f is defined as B(f) = n x Rn P i S xi f(S) S [n], and Pn i=1 xi = f([n]) o . Let φ : C Rn be a bijective mapping from C to the vertices of B(f); thus conv(φ(C)) = B(f). 5.1 Monotone Submodular Functions It is known that when f is a monotone submodular function (which means U V = f(U) f(V )), then B(f) Rn + [4]. Therefore in this case one can take K = Rn + and F : K R to be the unnormalized negative entropy. Both decomposition and unnormalized relative entropy projection steps can be performed in time O(n6 + n5Q), where Q is the time taken by an oracle that given S returns f(S); for cardinality-based submodular functions, for which f(S) = g(|S|) for some g : [n] R, these steps can be performed in just O(n2) time [15]. Remark on matroid base polytopes and spanning trees. We note that the matroid rank function of any matroid M is a monotone submodular function, and that the matroid base polytope B(M) is the same as B(rank M). Therefore Examples 1 and 2 can also be viewed as special cases of the above setting. For the spanning trees of Example 2, the decomposition step of [14] makes use of a linear programming formulation whose exact time complexity is unclear. Instead, one could use the decomposition step associated with the submodular function rank MG, which takes O(|E|6) time. Matroid polytopes are 0-1 polytopes; the example below illustrates a more general polytope: Example 3 (Permutations with a certain position-based loss). Let C = Sn, the set of all permutations of n objects: C = {σ : [n] [n] | σ is bijective}. On each trial t, one selects a permutation σt C and receives a loss vector ℓt [0, 1]n; the loss of the permutation is given by Pn i=1 ℓt i (n σt(i)+1). This type of loss arises in scheduling applications, where ℓt i denotes the time taken to complete the i-th job, and the loss of a job schedule (permutation of jobs) is the total waiting time of all jobs (the waiting time of a job is its own completion time plus the sum of completion times of all jobs scheduled before it) [15]. Here it is natural to define a mapping φ : C Rn + that maps σ C to φ(σ) = (n σ(1) + 1, . . . , n σ(n) + 1); the loss on selecting σt C is then φ(σt) ℓt. Thus here we have d = n, and φ(C) = {(σ(1), . . . , σ(n)) | σ Sn}. It is known that the n! vectors in φ(C) are exactly the vertices of the base polytope corresponding to the monotone (cardinality-based) submodular function fperm : 2[n] R defined as fperm(S) = P|S| i=1(n i + 1). Thus conv(φ(C)) = B(fperm); this is a well-known polytope called the permutahedron [21], and has recently been studied in the context of online learning applications in [18, 15, 1]. Here φ(σ) 1 = n(n+1) 2 σ C, and therefore LDOMD with unnormalized negative entropy has a regret bound of O n2p T ln(n) . As noted above, decomposition and unnormalized relative entropy projection steps take O(n2) time. 5.2 General Submodular Functions In general, when f is non-monotone, B(f) Rn can contain vectors with non-negative entries. Here one can use LOMD with the squared L2-norm. The Euclidean projection step can again be performed in time O(n6 + n5Q) in general, where Q is the time taken by an oracle that given S returns f(S), and in O(n2) time for cardinality-based submodular functions [15]. 6 Permutation Polytopes There has been increasing interest in recent years in online decision problems involving rankings or permutations, largely due to their role in applications such as information retrieval, recommender systems, rank aggregation, etc [12, 18, 19, 15, 1, 2]. Here the decision space is C = Sn, the set of all permutations of n objects: C = {σ : [n] [n] | σ is bijective} . On each trial t, one predicts a permutation σt C and receives some type of loss. We saw one special type of loss in Example 3; we now consider any loss that can be represented as a linear function of some vector representation of the permutations in C. Specifically, let d Z+, and let φ : C Rd be any injective mapping such that on predicting σt, one receives a loss vector ℓt [0, 1]d and incurs loss φ(σt) ℓt. For any such mapping φ, the polytope conv(φ(C)) is called a permutation polytope [5].3 The permutahedron we saw in Example 3 is one example of a permutation polytope; here we consider various other examples. For any such polytope, if one can perform the decomposition and suitable Bregman projection steps efficiently, then one can use the LDOMD algorithm to obtain good regret guarantees with respect to the associated loss. Example 4 (Permutations with assignment-based losses). Here on each trial t, one selects a permutation σt C and receives a loss matrix ℓt [0, 1]n n, with ℓt ij specifying the loss for assigning element i to position j; the loss for the permutation σt is given by Pn i=1 ℓt i,σt(i). Here it is natural to define a mapping φ : C {0, 1}n n that maps each σ C to its associated permutation matrix P σ {0, 1}n n, defined as P σ ij = 1(σ(i) = j) i, j [n]; the loss incurred on predicting σt C is then Pn i=1 Pn j=1 φij(σt)ℓt ij. Thus we have here that d = n2, φ(C) = {P σ {0, 1}n n | σ Sn}, and conv(φ(C)) is the well-known Birkhoff polytope containing all doubly stochastic matrices in [0, 1]n n (also known as the assignment polytope or the perfect matching polytope of the complete bipartite graph Kn,n). Here LDOMD with unnormalized negative entropy has a regret bound of O n p T ln(n) . This recovers exactly the Perm ELearn algorithm used in [12]; see [12] for efficient implementations of the decomposition and unnormalized relative entropy projection steps. Example 5 (Permutations with general position-based losses). Here on each trial t, one selects a permutation σt C and receives a loss vector ℓt [0, 1]n. There is a weight function γ : [n] R+ that weights the loss incurred at each position, such that the loss contributed by element i is ℓt i γ(σt(i)); the total loss of the permutation σt is given by Pn i=1 ℓt i γ(σt(i)). Note that the particular loss considered in Example 3 (and in [15]) is a special case of such a position-based loss, with weight function γ(i) = (n i+1). Several other position-dependent losses are used in practice; for example, the discounted cumulative gain (DCG) based loss, which is widely used in information retrieval applications, effectively uses γ(i) = 1 1 log2(i)+1 [9]. For a general position-based loss with weight function γ, one can define φ : C Rn + as φ(σ) = (γ(σ(1)), . . . , γ(σ(n))). This yields a permutation polytope conv(φ(C)) = conv (γ(σ(1)), . . . , γ(σ(n))) | σ Sn Rn +. Provided one can implement the decomposition and suitable Bregman projection steps efficiently, one can use the LDOMD algorithm to get a sublinear regret. 7 Application to an Online Transportation Problem Consider now the following transportation problem: there are m supply locations for a particular commodity and n demand locations, with a supply vector a Zm + and demand vector b Zn + specifying the (integer) quantities of the commodity supplied/demanded by the various locations. Assume Pm i=1 ai = Pn j=1 bj = q. In the offline setting, there is a cost matrix ℓ [0, 1]m n, with ℓij specifying the cost of transporting one unit of the commodity from supply location i to demand location j, and the goal is to decide on a transportation matrix Q Zm n + that specifies suitable (integer) quantities of the commodity to be transported between the various supply and demand locations so as to minimize the total transportation cost, Pm i=1 Pn j=1 Qijℓij. Here we consider an online variant of this problem where the supply vector a and demand vector b are viewed as remaining constant over some period of time, while the costs of transporting the com- 3The term permutation polytope is sometimes used to refer to various polytopes obtained through specific mappings φ : Sn Rd; here we use the term in a broad sense for any such polytope, following terminology of Bowman [5]. (Note that the description Bowman [5] gives of a particular 0-1 permutation polytope in Rn(n 1), known as the binary choice polytope or the linear ordering polytope [20], is actually incorrect; e.g. see [11].) Algorithm Decomposition Step for Transportation Polytopes Input: X T (a, b) (where a Zm +, b Zn +) Initialize: A1 X; k 0 Repeat: k k + 1 Find an extreme point Qk T (a, b) such that Ak ij = 0 = Qk ij = 0 (see Appendix B) αk min(i,j):Qk ij>0 Ak ij Qk ij Ak+1 Ak αk Qk Until all entries of Ak+1 are zero Ouput: Decomposition of X as convex combination of extreme points Q1, . . . , Qk: X = Pk r=1 αr Qr (it can be verified that αr (0, 1] r and Pk r=1 αr = 1) Figure 3: Decomposition step in applying LDOMD to transportation polytopes. modity between various supply and demand locations change over time. Specifically, the decision space here is the set of all valid (integer) transportation matrices satisfying constraints given by a, b: C = Q Zm n + | Pn j=1 Qij = ai i [m] , Pm i=1 Qij = bj j [n] . On each trial t, one selects a transportation matrix Qt C, and receives a cost matrix ℓt [0, 1]m n; the loss incurred is Pm i=1 Pn j=1 Qt ijℓt ij. A natural mapping here is simply the identity: φ : C Zm n + with φ(Q) = Q Q C. Thus we have here d = mn, φ(C) = C, and conv(φ(C)) is the well-known transportation polytope T (a, b) (e.g. see [6]): conv(φ(C)) = T (a, b) = X Rm n + | Pn j=1 Xij = ai i [m] , Pm i=1 Xij = bj j [n] . Transportation polytopes generalize the Birkhoff polytope of doubly stochastic matrices, which can be seen to arise as a special case when m = n and ai = bi = 1 i [n] (see Example 4). While the Birkhoff polytope is a 0-1 polytope, a general transportation polytope clearly includes non-Boolean vertices. Nevertheless, we do have T (a, b) Rm n + , which suggests we can use the LDOMD algorithm with unnormalized negative entropy. For the decomposition step in LDOMD, one can use an algorithm broadly similar to that used for the Birkhoff polytope in [12]. Specifically, given a matrix X conv(φ(C)) = T (a, b), one successively subtracts off multiples of extreme points Qk C from X until one is left with a zero matrix (see Figure 3). However, a key step of this algorithm is to find a suitable extreme point to subtract off on each iteration. In the case of the Birkhoff polytope, this involved finding a suitable permutation matrix, and was achieved by finding a perfect matching in a suitable bipartite graph. For general transportation polytopes, we make use of a characterization of extreme points in terms of spanning forests in a suitable bipartite graph (see Appendix B for details). The overall decomposition results in a convex combination of at most mn extreme points in C, and takes O(m3n3) time. The unnormalized relative entropy projection step can be performed efficiently by using a procedure similar to the Sinkhorn balancing used for the Birkhoff polytope in [12]. Specifically, given a nonnegative matrix e X Rm n + , one alternately scales the rows and columns to match the desired row and column sums until some convergence criterion is met. As with Sinkhorn balancing, this results in an approximate projection step, but does not hurt the overall regret analysis (other than a constant additive term), yielding a regret bound of O q p T ln(max(mn, q)) . 8 Conclusion We have considered a general form of online combinatorial decision problems, where costs can be linear in any suitable low-dimensional vector representation of elements of the decision space, and have given a general algorithm termed low-dimensional online mirror descent (LDOMD) for such problems. Our study emphasizes the role of the convex polytope arising from the vector representation of the decision space; this both yields a unification and generalization of previous algorithms, and gives a general framework that can be used to design new algorithms for specific applications. Acknowledgments. Thanks to the anonymous reviewers for helpful comments and Chandrashekar Lakshminarayanan for helpful discussions. AR is supported by a Microsoft Research India Ph D Fellowship. SA thanks DST and the Indo-US Science & Technology Forum for their support. [1] Nir Ailon. Bandit online optimization over the permutahedron. Co RR, abs/1312.1530, 2013. [2] Nir Ailon. Online ranking: Discrete choice, spearman correlation and other feedback. Co RR, abs/1308.6797, 2013. [3] Jean-Yves Audibert, S ebastien Bubeck, and G abor Lugosi. Regret in online combinatorial optimization. Mathematics of Operations Research, 39(1):31 45, 2014. [4] Francis Bach. Learning with submodular functions: A convex optimization perspective. Foundations and Trends in Machine Learning, 6(2-3):145 373, 2013. [5] V. J. Bowman. Permutation polyhedra. SIAM Journal on Applied Mathematics, 22(4):580 589, 1972. [6] Richard A Brualdi. Combinatorial Matrix Classes. Cambridge University Press, 2006. [7] S ebastion Bubeck. Introduction to online optimization. Lecture Notes, Princeton University, 2011. [8] Nicol o Cesa-Bianchi and G abor Lugosi. Combinatorial bandits. Journal of Computer and System Sciences, 78(5):1404 1422, 2012. [9] David Cossock and Tong Zhang. Statistical analysis of Bayes optimal subset ranking. IEEE Transactions on Information Theory, 54(11):5140 5154, 2008. [10] Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119 139, 1997. [11] M. Gr otschel, M. J unger, and G. Reinelt. Facets of the linear ordering polytope. Mathematical Programming, 33:43 60, 1985. [12] David P. Helmbold and Manfred K. Warmuth. Learning permutations with exponential weights. Journal of Machine Learning Research, 10:1705 1736, 2009. [13] Adam Tauman Kalai and Santosh Vempala. Efficient algorithms for online decision problems. Journal of Computer and System Sciences, 71(3):291 307, 2005. [14] Wouter M. Koolen, Manfred K. Warmuth, and Jyrki Kivinen. Hedging structured concepts. In COLT, 2010. [15] Daiki Suehiro, Kohei Hatano, Shuji Kijima, Eiji Takimoto, and Kiyohito Nagano. Online prediction under submodular constraints. In ALT, 2012. [16] Eiji Takimoto and Manfred K. Warmuth. Path kernels and multiplicative updates. Journal of Machine Learning Research, 4:773 818, 2003. [17] Manfred K. Warmuth and Dima Kuzmin. Randomized online PCA algorithms with regret bounds that are logarithmic in the dimension. Journal of Machine Learning Research, 9:2287 2320, 2008. [18] Shota Yasutake, Kohei Hatano, Shuji Kijima, Eiji Takimoto, and Masayuki Takeda. Online linear optimization over permutations. In ISAAC, pages 534 543, 2011. [19] Shota Yasutake, Kohei Hatano, Eiji Takimoto, and Masayuki Takeda. Online rank aggregation. In ACML, 2012. [20] Jun Zhang. Binary choice, subset choice, random utility, and ranking: A unified perspective using the permutahedron. Journal of Mathematical Psychology, 48:107 134, 2004. [21] G unter M. Ziegler. Lectures on Polytopes. Springer, 1995.