# learning_and_collusion_in_multiunit_auctions__391e4c03.pdf Learning and Collusion in Multi-unit Auctions Simina Brânzei Purdue University Mahsa Derakhshan Northeastern University Negin Golrezaei MIT Yanjun Han New York University In a carbon auction, licenses for CO2 emissions are allocated among multiple interested players. Inspired by this setting, we consider repeated multi-unit auctions with uniform pricing, which are widely used in practice. Our contribution is to analyze these auctions in both the offline and online settings, by designing efficient bidding algorithms with low regret and giving regret lower bounds. We also analyze the quality of the equilibria in two main variants of the auction, finding that one variant is susceptible to collusion among the bidders while the other is not. 1 Introduction In a multi-unit auction, a seller brings multiple identical units of a good to a group of players (buyers) interested in purchasing the items. Due to the complexity and depth of the model, the multi-unit auction has been explored in a significant volume of research [MRH89, EWK98, DN07, DLN12, FFLS12, ACP+14, DN15, BCI+05, BGN03, GML13, DHP17, BMT19] and recommended as a lens to the entire field of algorithmic mechanism design [Nis14]. The prior literature has often focused on understanding the equilibria of the auction under various solution concepts and designing revenue or welfare maximizing one-shot mechanisms. We consider the setting where each player i has a value vi,j for receiving a j-th unit in their bundle, which is reported to the auctioneer in the form of a bid bi,j. After receiving the bids, the auctioneer computes a price p per unit, then sorts the bids in descending order and allocates the j-th unit to the player that submitted the j-th highest bid, charging them p for the unit. The price p is set so that all units are sold and the resulting allocation has the property that each player gets their favorite bundle given their preferences. This allocation rule represents the well-known Walrasian mechanism in the multi-unit setting: compute the market (aka competitive or Walrasian) equilibrium with respect to the reported valuations. The template for the mechanism is: elicit the valuations from the buyers and then assign a set of prices to the goods such that when each buyer takes their favorite bundle of goods at the given prices, the market clears, i.e. all goods are sold and there is no excessive demand for any good (see, e.g., [BLNL14]). Due to the strong fairness and efficiency properties of the market equilibrium, the resulting allocation is envy-free with respect to the reported valuations. A strand of research concentrated on achieving fair pricing in auctions, which often involves setting uniform prices for identical units [GHK+05]. Repeated auctions take place in many real world settings, such as when allocating licenses for carbon (CO2 emissions) [CK02], ads on online platforms [BBW15, GJM21], U.S. Treasury notes to investors or trade exchanges over the internet [MT15]. Repeated mechanisms often give rise to complex patterns of behavior. For example, the bidders may use the past transaction prices as a way of discovering the valuations of the competing bidders, or engage in collusion using bid rotation Authors are listed in alphabetical order. This work was done in part while the authors were visiting the Simons Institute for the Theory of Computing. Simina was supported in part by NSF CAREER grant CCF2238372. 37th Conference on Neural Information Processing Systems (Neur IPS 2023). Negin was supported in part by the Young Investigator Program (YIP) Award from the Office of Naval Research (ONR) N00014-21-1-277. schemes since the history of bids can serve as a communication device [Aoy03]. In fact, [CCDP21] show that in markets such as Cournot oligopolies with stochastic demand, collusion is observed even when the bidders are not explicitly trying to conspire. Instead, collusion emerges naturally when the agents use AI algorithms based on Q-learning to update their strategies. Since repeated multi-unit auctions are used in high-stakes environments such as allocating licenses for carbon [GIL20], it is paramount to understand: (i) what features should the auction have in the first place?; and (ii) given an auction format, how should the players bid to maximize their utility? A good bidding strategy generally guarantees the player using it will not regret its strategy in hindsight, regardless of the strategies of others. There has been extensive work on understanding dynamics in auctions and games under various behavioral models of the players (see, e.g., [NCKP22, RST17, DS16, MT15, HKMN11, BR11, LB10, GJM21, GJLM21, GJL23, BF19]), but due to the complexity of understanding dynamical systems many questions remain open. In this paper, we design efficient learning algorithms that players can use for bidding, in both the offline and online settings. We also give regret lower bounds and then analyze the quality of the equilibria in two main variants of the auction, finding that one variant is susceptible to collusion among the bidders while the other is not. Consider a seller with K identical units of a good and a set of players [n] = {1, . . . , n} that have quasi-linear valuations with decreasing marginal returns. Formally, let vi = (vi,1, . . . , vi,K) be the valuation of player i, where vi,j 0 is the marginal value for getting a j-th unit in their bundle. The value of player i for getting ℓunits is Vi(ℓ) = Pℓ j=1 vi,j. Diminishing marginal returns require that vi,1 vi,2 . . . vi,K i [n]. A player i is said to be hungry if vi,j > 0 j [K]. Auction format: uniform pricing. The auctioneer announces K units of a good for sale. Each player i submits bids bi = (bi,1, . . . , bi,K), where bi,j is player i s bid for j-th unit. The auctioneer sorts the bids in decreasing order and computes a price p. Then for each j = 1, . . . , K, the auctioneer allocates the j-th unit to the player that submitted the j-th highest bid, charging them p for the unit. If multiple bids are equal to each other, ties are broken in lexicographic order of the player names. 2 We consider two variants of the auction: (i) the K-th price auction, where the price per unit is set to the K-th highest bid; (ii) the (K + 1)-st price auction, where the price is set to the (K + 1)-st highest bid. Both of these variants implement the market equilibrium with respect to the reported valuations of the players, which are represented by their bid vectors. The K-th price format selects the maximum possible market equilibrium price, namely the highest possible price per unit at which all the goods are sold and each player purchases their favorite bundle given their utility. The (K + 1)-st format selects the minimum possible market equilibrium price. Allocation, Price, Utility. An allocation z is an assignment of units to the players such that zi 0 is the number of units received by player i and Pn j=1 zj = K. Given bid profile b, let p(b) be the price per unit and x(b) = (x1(b), . . . , xn(b)) the allocation obtained when the auction is run with bids b, where xi(b) {0, . . . , K} is the number of units received by player i. The utility of player i at b is: ui(b) = Vi(xi(b)) p(b) xi(b) . Revenue and Welfare. The revenue obtained by the auctioneer at a bid profile b equals K p(b). The social welfare of the bidders at b is SW(b) = Pn i=1 Vi(xi(b)). Notation. W.l.o.g., we have bi,1 . . . bi,K for each bid profile b is such that and i [n]. Given a bid profile b = (b1, . . . , bn), let b i denote the bid profile of everyone except player i. The bid profile where player i uses a bid vector β and the other players bid b i is denoted (β, b i). An illustration of how the auction works is given in Example 1. 2In other words, after the players submit their bids, the auctioneer arranges these bids in descending order (c1, . . . , cj, . . . , cn K) and then allocates unit 1 to the bidder with bid c1, unit 2 to the bidder with bid c2, and so on. Example 1. Let K = 3 and n = 2. Suppose the valuations are v1 = (5, 2) and v2 = (4, 1). If the players submit bids b1 = (2, 1) and b2 = (3, 2), the bid are sorted in the order (b2,1, b1,1, b2,2, b1,2). Then the allocation is x1 = 1 and x2 = 2. Under the K-th price auction, the price is set to p = b2,2 = 2. The utilities of the players are u1(b) = V1(1) p = 5 2 = 3 and u2(b) = V2(2) 2 p = (4 + 1) 2 2 = 1. Under the (K + 1)-st price auction, the price is set to p = b1,2 = 1. The utilities of the players are u1(b) = V1(1) p = 5 1 = 4 and u2(b) = V2(2) 2 p = (4+1) 2 1 = 3. Repeated setting. We consider a repeated setting, where the auction is run multiple times among the same set of players, who can adjust their bids based on the outcomes from previous rounds. Formally, in each round t = 1, 2, . . . , T, the next steps take place: The auctioneer announces K units for sale. Each player i [n] submits bid vector bt i = (bt i,1, . . . , bt i,K) privately to the auctioneer, where bt i,j is player i s bid for a j-th unit at time t. Then the auctioneer runs the auction with bids bt = (bt 1, . . . , bt n) and reveals information (feedback) about the outcome to the players. We consider two feedback models for the information revealed at the end of round t: Full information: All the bids bt are public knowledge. Bandit feedback: The price p(bt) is the only public knowledge; additionally, each player i privately learns their own allocation xi(bt). Thus the allocation at time t solely depends on the bids submitted by players in round t, excluding any prior rounds. 3 1.2 Our Results The next sections summarize our results for both the K-th and (K + 1)-st price variants. 1.2.1 Offline Setting In the offline setting, we are given a player i with valuation vi and a history H i = (b1 i, . . . , b T i) containing the bids submitted by the other players in past auctions. Here we restrict the bidding space to a discrete domain D = {k ε | k N}, for some ε > 0. The goal is to find a fixed bid vector b i = (b i,1, . . . , b i,K) that maximizes player i s cumulative utility on the data set given by the history H i, that is, b i = arg maxβ DK PT t=1 ui(β, bt i) . The offline problem is challenging because the decision (bid) space is exponentially large and hence naively experimenting with every possible bid vector leads to impractical algorithms. We overcome this challenge by relying on the structural properties of the offline problem. At a high level, we carefully design a weighted directed acyclic graph (DAG) and identify a bijective map between paths from source to sink in the graph and bid profiles of player i. Since a maximum weight path in a DAG can be found in polynomial time, this yields a polynomial time algorithm for computing an optimum bid vector. Theorem 1 (informal). Computing an optimum bid vector for one player in the offline setting is equivalent to finding a maximum-weight path in a DAG and can be solved in polynomial time. Theorem 1 applies to both versions of the auction, with K-th and (K + 1)-st highest price. 1.2.2 Online Setting In the online setting, the players run learning algorithms to update their bids as they participate in the auction over time. The first scenario we consider is that of full information feedback, where at the end of each round t the auctioneer makes public all the bids bt submitted in that round. 3For example, consider K = 2 units and n = 2 players. Suppose that in round t = 5, player 1 submits the bid vector b5 1 = (4, 2) and player 2 submits b5 2 = (5, 3). The auctioneer sorts these bids in decreasing order, obtaining the vector (5, 4, 3, 2). Then the allocation in round 5 is to give the first unit to player 1 and the second unit to player 2; the allocation disregards any events from preceding rounds. Building on the offline analysis, we design an efficient online learning algorithm for bidding, which guarantees low regret for each player using it. The main idea is to run multiplicative weight updates, where the experts are paths in the DAG from the offline setting. Each such path corresponds to a bid profile for the learner. A challenge is that the number of paths in the graph (and so experts) is exponential. Nevertheless, we can achieve a polynomial runtime by representing the experts implicitly, using a type of weight pushing algorithm based on the method of path kernels [TW03]. Theorem 2 (Full information feedback, upper bound). For each player i and time horizon T, there is an algorithm for bidding in the repeated auction with full information feedback that runs in time O(T 2) and guarantees the player s regret is bounded by O(vi,1 p TK3 log T). To the best of our knowledge, our paper is the first to connect the path kernel setting with auctions and obtain efficient algorithms for learning in auctions via this connection. We also consider bandit feedback, which limits the amount of information the players learn about each other: at the end of each round the auctioneer announces publicly only the resulting price, and then privately tells each player their allocation. Bandit feedback could be relevant for reducing the amount of collusion among the players in repeated auction environments [HP89]. Theorem 3 (Bandit feedback, upper bound). For each player i and time horizon T, there is an algorithm for bidding in the repeated auction with bandit feedback that runs in O(KT +K 5/4T 7/4) and guarantees the player s regret is bounded by O(vi,1 min{ 4p T 3K7 log T, KT}). Although our bidding policy is similar to the EXP3-type algorithm in [GLLO07], and our computational efficiency relies on the path kernel methods in [TW03], a major difference is that our edge weights are heterogeneous, i.e. the weights of different edges could be of different scales. To circumvent this issue, we need to use a different estimator for the weight under bandit feedback which in turn also changes the technical analysis. We complement the upper bounds with the following regret lower bound, where the construction of the hard instance relies heavily on the specific structure of our auction format. Theorem 4 (Lower bound, full information and bandit feedback). Let K 2. Suppose the auction is run for T rounds. Then for each strategy of player i, under both full information and bandit feedback, there exists a bid sequence {bt i}T t=1 by the other players such that the expected regret of player i is at least c vi,1K T, where c > 0 is an absolute constant. Theorems 2, 3, and 4 apply to both variants of the auction (with K-th and (K + 1)-st highest price). 1.2.3 Equilibrium Analysis We also analyze the quality of the Nash equilibria reached in the worst case in the static version of the auction, as they naturally apply to the empirical distribution of joint actions when the players use sub-linear regret learning algorithms. The (K + 1)-st price auction has been observed to have low or even zero revenue equilibria [Wil79, KN04, ACP+14, BW20]. With n > K hungry players, the next phenomena occur: in the (K + 1)-st price auction: every allocation z that allocates all the units can be implemented in a pure Nash equilibrium b with price ε, for every small enough ε 0. in the K-th price auction: no pure Nash equilibrium with arbitrarily low price exists. Our contribution is to show that the zero-price equilibria of the (K + 1)-st highest price auction are very stable. We do this by considering the core solution concept, which models how groups of players can coordinate. In our setting, a strategy profile is core-stable if no group S of players can deviate by simultaneously changing their strategies such that each player in S weakly improves their utility and the improvement is strict for at least one player in S. The players outside S are assumed to have neutral reactions to the deviation, that is, they keep their strategy profiles unchanged. This is consistent with the Nash equilibrium solution concept, where only the deviating player changes their strategy. A body of literature studied the core in auctions when the auctioneer can collude together with the bidders (see, e.g., [DM08]). However, as is standard in mechanism design, it is also meaningful to consider the scenario where the auctioneer sets the auction format and then the bidders can strategize and potentially collude among themselves, without the auctioneer; this setting is our focus. First, we consider the core without transfers, where a strategy profile is simply a bid profile b. Theorem 5 (Core without transfers). Consider K units and n > K hungry players. The core without transfers of the (K + 1)-st auction can be characterized as follows: every bid profile b that is core-stable has price zero (i.e. p(b) = 0); for every allocation z, there is a core-stable bid profile b with price zero that implements z (i.e. x(b) = z and p(b) = 0). If a bid profile b is core-stable, then it is also a Nash equilibrium, for the simple reason that if no group of players can deviate simultaneously from b, then the group consisting of one player cannot find an improving deviation either. Thus Theorem 5 says that the equilibria with price zero are the only ones that remain stable when also allowing deviations by groups of players. Moreover, the theorem also says there are many inefficient core-stable equilibria, all with price zero. Second, we also consider the core with transfers, where the players can make monetary transfers to each other. A strategy profile in this setting is given by a pair (b, t), where b is a bid profile and t is a profile of monetary transfers, such that ti,j 0 is the monetary transfer (payment) of player i to player j (e.g., player i could pay player j so that j keeps its bids low). Theorem 6 (Core with transfers). Consider K units and n > K hungry players. The core with transfers of the (K + 1)-st auction can be characterized as follows. Let (b, t) be an arbitrary tuple of bids and transfers that is core stable. Then the allocation x(b) maximizes social welfare, the price is zero (i.e. p(b) = 0), and there are no transfers between the players (i.e. t = 0). To see why Theorem 6 holds, if the price is positive at some strategy profile, the players with highest values can pay the other players to lower their bids. When the price reaches zero, they can start withdrawing the payments. Thus the only strategy profiles in the core with transfers are those where the players with highest values win and pay zero. The difference from Theorem 5 is that transfers allow the players with highest values to force getting their desired items at price zero, which they cannot enforce without transfers. Nevertheless, even without transfers, the price remains zero. Remarks on collusion. Given that the zero price equilibria of the (K + 1)-st price auction are very stable and easy to find, one can expect they can be determined by the bidders, for example in settings with few bidders who know each other. These equilibria can be easily implemented in repeated settings once the bidders agree on their strategies. Strikingly, these zero price equilibria have in fact been observed empirically in repeated auctions for fishing quotas in the Faroe Islands, where even inexperienced bidders were able to collude and enforce them; see details in [LMT18]. In the carbon setting, the Northeast US auction is based on the (K + 1)-st price format [GIL20] and has very low prices as well. In contrast, the K-th price auction format does not have any Nash equilibria with price zero, let alone core-stable ones. Thus the K-th price auction format may be preferable to the (K + 1)-st price auction in settings with high stakes, such as selling rights for carbon emissions. Another remedy, also discussed in [LMT18], is to make the number of units a random variable. 2 Related Work Our work is related to the rich literature on combinatorial auctions (e.g., [DVV03, BN07, CSS07]) and more specifically auctions for identical goods, which are widely used in practice in various settings: Treasury Auctions [GI05, BS00], Procurement Auctions [AC05], Wholesale Electricity Markets [TSM08, FR03, Fvd FH06], Carbon Auctions [GS13, SS17, GIL20]. The two most prevalent multi-unit auctions are uniform price auctions and pay-as-bid auctions that are widely studied in the literature either empirically (e.g., [BZ93, MA98, BB01, HG13]) or theoretically under the classical Bayesian setting (e.g., [AHP13, GIL20, BILL23]). In particular, there have been several studies that aim to compare uniform price auctions versus pay-as-bid auctions (e.g., [KCPT01, FR03, SBLS04, LMMM11, ACP+14]) in terms of revenue and welfare under Nash equilibria, and the possibility of collusion. Our work contributes to this literature by studying uniform price auctions the angle of designing bidding algorithms for the players. Designing learning algorithms in auctions has attracted significant attention in recent years. Several papers study the problem of designing learning/bandit algorithms to optimize reserve prices in auctions over time (e.g., [KN14, MR16, MM16, CD17, DHL+17, GJL23, GJM21, GJLM21]). There are papers that have focused on designing bandit algorithms to optimize bidding strategies in singleunit second price and first price auctions (e.g., [WPR16, BMSW18, FPS18, HZF+20, HZW20, BGM+22]). Prior to our work, such problem was not studied for multi-unit uniform-price auctions. Leveraging structural information in designing bandit algorithms has been the topic of the rich literature, which includes linear reward structures [DHK08, RT10, MRT09, LS17], Lipschitz reward structures [MCP14, MLS18], convex reward distribution structure [VPG20], and combinatorial structures [AR08, UNK10, CBL12, ABL14, CWY13, CTPL15, ZLW19, KV05, NGW+21]. Our work is closest to the works that aim to exploit combinatorial structures. Within this literature, several papers leverage DP-based or MDP-based structures (e.g., [TW03, GLLO07, ZN13, RM21]). The multi-unit auction setting was studied in a large body of literature on auctions [DLN12, BCI+05, DN07, DL14, DN15], where the focus has been on designing truthful auctions with good approximations to some desired objective, such as the social welfare or the revenue. Uniform pricing represents a type of fair pricing, since when everyone gets charged the same price per unit, no bidder envies another bidder s allocation. Fairness is an important property that is at odds with maximizing revenue, since for the purpose of improving revenue the auctioneer is better off charging higher prices to the buyers that show more interest in the goods. However, there is evidence that buyers are unhappy with discriminatory prices (see, e.g., [AS08]), which led to a body of literature focused on designing fair auction mechanisms [GHK+05, FGL14, FFLS12, CAEFF16, Sek17, AVGN17, BFMZ17, FMTV20, DGJ+22]. Collusion in auctions was also studied both experimentally and theoretically in various models (e.g., [GN92, vd FH93, GNR15]). 3 Offline Setting In the offline setting, we are given a player i with valuation vi and a history H i = (b1 i, . . . , b T i) with the bids of other players in rounds 1 through T. We assume the bids are restricted to a discrete domain D = {k ε | k N}, for some ε > 0. The goal is to find an optimum fixed bid vector in hindsight: b i = arg maxβ DK TP t=1 ui(β, bt i), where we recall that ui(β, bt i) is the utility of player i at the bid profile (β, bt i). We will see the maximum in the definition of b i is well defined since there is an optimum bid vector β with entries at most ε above the maximum historical bid. For any ε > 0, the optimal solution on the discrete domain D also yields an (ε K)-optimal strategy on the continuous domain. The offline problem we study here has some resemblance to the problems studied in [RW16, DGPL19, DPS21]. There, the goal is to optimize reserve prices in VCG auctions while having access to a dataset of historical bids.4 To design a polynomial time algorithm for solving the offline problem, we define a set Si of candidate bids for player i, which has polynomial size in the history, player i s valuation, and 1/ε: Si = 0 n bt j,k | j [n] \ {i}, k [K] o n bt j,k + ε | j [n] \ {i}, k [K] o . (1) Observation 1. Player i has an optimum bid vector β = (β1, . . . , βK) DK with βj Si for all j [K], where Si is defined in equation (1). The proof of Observation 1 is simple and deferred to Appendix A. Next we construct a directed acyclic graph (DAG) Gi that will be used to compute an optimum bid vector for player i. Definition 1 (The graph Gi). Given valuation vi of a player i, history H i = (b1 i, . . . , b T i), ε > 0, and set Si from (1), define a graph Gi with the following features. 4[RW16] also presents an online version of their offline algorithms under full information feedback; see also [NGW+21] for an online version of [RW16] s algorithm for both full information and bandit settings using Blackwell approachability [Bla56]. Vertices. Create a vertex zs,j for each s Si and j [K]. We say vertex zs,j is in layer j. Add source z and sink z+. Edges. For each index j [K 1] and pair of bids r, s Si with r s, create a directed edge from vertex zr,j to vertex zs,j+1. Moreover, add edges from source z to each node in layer 1 and from each node in layer K to the sink z+. Edge weights. For each edge e = (zr,j, zs,j+1) or e = (zr,K, z+), let β = (β1, . . . , βK) SK i be a bid vector with βj = r and βj+1 = s (we define s = 0 if j = K). For each t [T], let ht = (β, bt i). Define the weight of edge e as t=1 1{xi(ht) j} (vi,j r) + j h 1{xi(ht)>j} (r s) + 1{xi(ht)=j} r p(ht) i , (2) recalling that p(ht) and x(ht) are the price and allocation at ht, respectively 5. The edges incoming from z have weight zero. A pictorial illustration of the graph Gi is displayed in Figure 1. Figure 1: Example of DAG as in Definition 1. Here, we have K = 4 units and the set of candidate bids for player i is Si = {0, 1, 2}. The graph has a source z , a sink z+, and nodes zr,j for all r Si and j [4]. Edges are of the form (zr,j, zs,j+1) for all j < 4 and r, s Si with r s. There are also edges from the source z to nodes zj,1 and from nodes zj,4 to the sink z+ j Si. At a first glance, it may seem as though the weight we of an edge e depends on the whole bid vector β since we use the value of xi(ht) in its definition. However, this is not the case. Meaning that, this formula gives us the same value for any choice of β that satisfies βj = r and βj+1 = s. This is due to the fact that the values of variables 1{xi(ht) }, 1{xi(ht)>j}, and 1{xi(ht)=j} can be determined only knowing the values of βj+1 and βj (without knowing the rest of the vector β). We will show that computing an optimum strategy for the player is equivalent to finding a maximumweight path in the DAG Gi from Definition 1. Theorem 1 (formal version). Suppose we are given a number n of players, number K of units, valuation vi of player i, discretization level ε > 0, and bid history H i = (b1 i, . . . , b T i) by players other than i. Then an optimum bid vector for player i can be computed in polynomial time in the input parameters. The proof of Theorem 1 is in Appendix A, together with the other omitted proofs of this section. At a high level, for any bid vector β, we first write the cumulative utility of player i when playing β while the others play according to bt i. Then observe the cummulative utility can be rewritten as a sum of K terms, so that each term only depends on j, βj, and βj+1. Create a layered DAG as in Definition 1, where the weight of edge (βj, j) to (βj+1, j + 1) is the j-th term of the sum above, with special handling for edges involving the source or the sink. The entire graph can be computed in polynomial time. Then the proof shows there is a bijection between the set of bid vectors of the player and the set of paths from the source to the sink in the graph. Thus the solution to the offline problem is obtained by computing a maximum weight path in the graph and returning the corresponding bid vector. The algorithm for the offline problem is summarized in the next figure. 5That is, xk(ht) is the number of units received by player k at allocation x(ht). Input: Player i with valuation vi = (vi,1, . . . , vi,K), a history H i = (b1 i, . . . , b T i) of bids submitted by players other than i, and a grid D = {k ε | k N} with allowed bid values. Output: b i = arg maxβ DK PT t=1 ui(β, bt i) . Compute set Si as defined in equation (1). By Observation 1, player i has an optimum bid vector b i with b i,j Si for each j [K]. Construct directed acyclic graph Gi as in Definition 1. By Theorem 1, there is a bijective map between paths from the source to the sink in Gi and bid vectors β Si of player i. Compute a maximum-weight path in graph Gi and output the corresponding bid vector. algorithm 1: Optimum Strategy for the Offline Problem. 4 Online Setting In the online setting, we consider the repeated auction where in each round t the auctioneer announces K units for sale. Then the players privately submit the bids bt and the auctioneer computes the outcome. At the end of round t, the auctioneer gives full information or bandit feedback. The information available to player i at the beginning of each round t is the history: Ht 1 i = (b1 i , , bt 1 i ) Ht 1 i , where 6 Ht 1 i = b1 i, , bt 1 i in the full information setting, p(b1), , p(bt 1) in the bandit setting. A pure strategy for player i at time t is a function πt i : Ht 1 i RK, where Ht 1 i is the history available to player i at the beginning of round t. Thus the pure strategy tells player i what bid vector to submit next given the information observed by the player about the previous rounds. A mixed strategy is a probability distribution over the set of pure strategies. Consequently, the bid vector bt i submitted by player i at time t is the realization of the mixed strategy πt i(Ht 1 i ). We denote by πi = (π1 i , , πT i ) the overall strategy of player i over the entire time horizon. Regret. From now on we fix an arbitrary player i. Given a bidding strategy πi of player i, the regret of the player is defined with respect to a history HT i of bids by other players: Regi(πi, HT i) = max β SK i vi,j p(β, bt i) 1{xi(β,bt i) j} t=1 Ebt i πt i(Ht 1 i ) vi,j p(bt) 1{xi(bt) j} For the purpose of equipping player i with a bidding algorithm, we will think of the other players as adversarial, and aim to achieve small regret regardless of HT i. A way to interpret the regret is that player i is competing against an oracle that (a) has the perfect knowledge of the history of bids, but (b) is restricted to submit a time-invariant bid β. The regret quantifies how much worse player i does compared to the oracle. Before studying the feedback models in greater detail, we replace the set Si of candidate bids for player i from equation (1) of the offline section by a coarser set Sε = {ε, 2ε, , vi,1/ε ε}. 4.1 Full information In the full information scenario, at the end of each round t the auctioneer announces the bids bt submitted by the players in round t. We define a graph Gt that forms the basis of the learning 6We omit the allocation information in the history, for it is not informative when other players are adversarial: from the price information, player i knows her own allocation; treating all other players as a giant adversarial player, that player gets the rest of the allocation. algorithm for the player. The graph Gt is the same as in the offline setting (Definition 1), except for the next differences: The vertex zs,j is indexed by s Sε, instead of Si; The edge weights are based only on the current round t, rather than the sum over all rounds in the offline setting. More specifically, for edge e = (zr,j, zs,j+1) or e = (zr,K, z+), set wt(e) = 1{xi(ht) j} (vi,j r) + j h 1{xi(ht)>j} (r s) + 1{xi(ht)=j} r p(ht) i . (4) The formal definition is deferred to Definition 2 in Appendix B. This DAG structure enables us to formulate the learning problem as an online maximum weight path problem, and therefore treat each path as an expert and apply online learning algorithms to compete with the best expert. Theorem 2 shows such an algorithm (Algorithm 2) works for both K-th and (K + 1)-st price auctions. Input: time horizon T, valuation vi = (vi,1, , vi,K) of the player i Output: bid vectors (b1 i , , b T i ). Set the resolution parameter ε = vi,1 p K/T, and learning rate η = log T/(vi,1 KT). For every t = 1, , T: Construct a graph Gt as in Definition 2, initially without weights on the edges. If t = 1: for each edge e, initialize an edge probability as 1/|Sε| if e = (z , zr,1); 1/|{r Sε | r r}| if e = (zr,j, zs,j+1), r s; 1 if e = (zr,K, z+). starting from Γt 1(z+) = 1, recursively compute in the bottom-up order Γt 1(u) = X v:e=(u,v) E ϕt 1(e) exp(η wt 1(e)) Γt 1(v), u V (Gt); for each edge e, update the edge probability ϕt(e) = ϕt 1(e) exp η wt 1(e) Γt 1(v) Γt 1(u), for all e = (u, v) E(Gt) . (5) For j = 1, 2, , K: sample bt i,j ϕt(zbt i,j 1,j 1, ), where zbt i,0,0 := z . The player observes bt i, and sets the edge weights wt(e) in Gt according to (4). algorithm 2: Selecting bids using the weight pushing algorithm in [TW03]. In Algorithm 2, inspired by the path kernel algorithm in [TW03], we define ϕt(u, ) as a probability distribution over the outneighbors of vertex u. To determine the bid vector in each round, we use a random walk starting from the source z , where the learner selects a random neighbor z1 of z using ϕt(z , ), then selects a random neighbor z2 of z1 using ϕt(z1, ), and so on, until reaching the sink z+. Remarkably, as we prove in Theorem 2, the distribution of the resulting path pt is the same as that in the Multiplicative Weight Update (Hedge) algorithm [LW94]. Thus while the number of paths in the DAG (corresponding to experts in the Hedge algorithm) is exponentially large, we can run the Hedge algorithm with polynomial space and time complexity by leveraging the DAG structure. We do not need to keep track of the weights of every path, but only need to store and update the weights of every edge in the DAG using the variables ϕt(e) and the recursive formula in equation (5). To accomplish this, we use the variables Γt(u), for every vertex u in the DAG, which can be viewed as the weights of paths starting at vertex u. We obtain the next theorem , the proof of which can be found in Appendix B. Theorem 2. For each player i and time horizon T, under full-information feedback, Algorithm 2 runs in time O(T 2) and guarantees the player s regret is at most O(vi,1 p TK3 log T). 4.2 Bandit feedback The next theorem presents an upper bound on the regret under the bandit feedback. Similar to the full-information case, the algorithm we design works for both the K-th and (K+1)-st price auctions. Theorem 3. For each player i and time horizon T, there is an algorithm for bidding in the repeated auction with bandit feedback that runs in O(KT + K 5/4T 7/4) and guarantees the player s regret is bounded by O(vi,1 min{ 4p T 3K7 log T, KT}). We describe the algorithm below, while the proof of the theorem can be found in Appendix B. The bidding strategy and its efficient implementation are the same as those in the proof of Theorem 2, with the only difference as follows. Under bandit feedback, we are not able to compute wt(e) for every edge e at the end of time t; instead, we are able to compute wt(e) for every edge on the chosen path pt at time t. This observation motivates us to use the next unbiased estimator bwt(e): bwt(e) = w(e) w(e) wt(e) pt(e) 1{e pt}, with pt(e) = X p:e p P t(p), (6) where P t is the player s action distribution over paths, and for each edge e: vi,1 r + j(r s) if e = (zr,j, zs,j+1); vi,1 r + Kr if e = (zr,K, z+); 0 if e = (z , zr,1). The resulting algorithm is exactly the same as Algorithm 2, with the only difference being that all wt are replaced by bwt, with an additional step (6) for the computation of bwt. The regret analysis will appear similar to that in [GLLO07], but a key difference is that we no longer have wt(e) vi,1 for every edge e in our setting. Instead, we apply an edge-dependent upper bound wt(e) w(e), and use the property that P e p w(e) Kvi,1 for every path p from source to sink. 4.3 Regret lower bounds The following theorem establishes an Ω(K T) lower bound on the minimax regret of a single bidder in the full information scenario. Clearly the same lower bound also holds for bandit feedback. Theorem 4 (restated, formal). Let K 2. For any policy πi used by player i, there exists a bid sequence {bt i}T t=1 for the other players such that the expected regret in (3) satisfies E[Regi(πi, {bt i}T t=1)] cvi,1K T, where c > 0 is an absolute constant. The proof of Theorem 4 is in Appendix B. The idea is to employ the celebrated Le Cam s two-point method, but the construction of the hard instance is rather delicate to reflect the regret dependence on K. We also remark that this regret lower bound still exhibits a gap compared with the current upper bound in Theorem 2, while it seems hard to extend the current two-point construction to Fano-type lower bound arguments. It is an outstanding question to close this gap. 5 Concluding Remarks Several open questions arise from this work. For example, is the regret Θ(K T) in both the full information and bandit setting? In the full information setting, the challenge in obtaining an algorithm with regret O(K T) is that the expert distribution is not a product distribution over the layers, but rather only close to a product. Furthermore, how does offline and online learning take place in repeated auctions when the units are not identical, the valuations do not necessarily exhibit diminishing returns, or the pricing rule is different (e.g. in the first price setting)? [ABL14] J-Y. Audibert, S. Bubeck, and G. Lugosi. Regret in online combinatorial optimization. Mathematics of Operations Research, 39(1):31 45, 2014. [AC05] L. Ausubel and P. Cramton. Dynamic auctions in procurement. Handbook of Procurement, 2005. [ACP+14] L. Ausubel, P. Cramton, M. Pycia, M. Rostek, and M. Weretka. Demand reduction and inefficiency in multi-unit auctions. The Review of Economic Studies, 81(4):1366 1400, 2014. [AHP13] E. Anderson, P. Holmberg, and A. Philpott. Mixed strategies in discriminatory divisible-good auctions. The RAND Journal of Economics, 44(1):1 32, 2013. [Aoy03] M. Aoyagi. Bid rotation and collusion in repeated auctions. Journal of Economic Theory, 112(1):79 105, 2003. [AR08] E. Abernethy, J.and Hazan and A. Rakhlin. Competing in the dark: An efficient algorithm for bandit linear optimization. In COLT, pages 263 273, 2008. [AS08] E. Anderson and D. Simester. Price stickiness and customer antagonism. Available at SSRN 1273647, 2008. [AVGN17] E. Areyan Viqueira, A. Greenwald, and V. Naroditskiy. On approximate welfareand revenue-maximizing equilibria for size-interchangeable bidders. In AAMAS, page 1466 1468, 2017. [BB01] J. Bower and D. Bunn. Experimental analysis of the efficiency of uniform-price versus discriminatory auctions in the england and wales electricity market. Journal of economic dynamics and control, 25(3-4):561 592, 2001. [BBW15] S. R. Balseiro, O. Besbes, and G. Y. Weintraub. Repeated auctions with budgets in ad exchanges: Approximations and design. Management Science, 61(4):864 884, 2015. [BCI+05] C. Borgs, J. Chayes, N. Immorlica, M. Mahdian, and A. Saberi. Multi-unit auctions with budget-constrained bidders. In ACM EC, pages 44 51, 2005. [BF19] S. Brânzei and A. Filos-Ratsikas. Walrasian dynamics in multi-unit markets. In AAAI, pages 1812 1819. AAAI Press, 2019. [BFMZ17] S. Brânzei, A. Filos-Ratsikas, P. B. Miltersen, and Y. Zeng. Walrasian pricing in multiunit auctions. In MFCS, volume 83 of LIPIcs, pages 80:1 80:14, 2017. [BGM+22] S. Balseiro, N. Golrezaei, M. Mahdian, V. Mirrokni, and J. Schneider. Contextual bandits with cross-learning. Mathematics of Operations Research, 2022. [BGN03] Y. Bartal, R. Gonen, and N. Nisan. Incentive compatible multi unit combinatorial auctions. In TARK, pages 72 87, 2003. [BILL23] M. Babaioff, N. Immorlica, Y. Li, and B. Lucier. Making auctions robust to aftermarkets. In ITCS, 2023. [Bla56] D. Blackwell. An analog of the minimax theorem for vector payoffs. Pacific Journal of Mathematics, 6(1):1 8, 1956. [BLNL14] M. Babaioff, B. Lucier, N. Nisan, and R. P. Leme. On the efficiency of the walrasian mechanism. In ACM EC, pages 783 800. ACM, 2014. [BMSW18] M. Braverman, J. Mao, J. Schneider, and M. Weinberg. Selling to a no-regret buyer. In ACM EC, page 523 538. Association for Computing Machinery, 2018. [BMT19] G. Birmpas, E. Markakis, and O Telelis. Tight welfare guarantees for pure nash equilibria of the uniform price auction. Theory Comput Syst, 63:1451 1469, 2019. [BN07] L. Blumrosen and N. Nisan. Combinatorial auctions. Algorithmic game theory, 267:300, 2007. [Bon63] O. N. Bondareva. Some applications of linear programming methods to the theory of cooperative games. Prob. Kibern., 10:119 133, 1963. Published in Russian. [BR11] K. Bhawalkar and T. Roughgarden. Welfare guarantees for combinatorial auctions with item bidding. In SODA, 2011. [BS00] K. Binmore and J. Swierzbinski. Treasury auctions: Uniform or discriminatory? Review of Economic Design, 5:387 410, 2000. [BW20] J. Burkett and K. Woodward. Uniform price auctions with a last accepted bid pricing rule. Journal of Economic Theory, 185(10):49 54, 2020. [BZ93] K. Back and J. Zender. Auctions of divisible goods: On the rationale for the treasury experiment. The Review of Financial Studies, 6(4):733 764, 1993. [CAEFF16] V. Cohen-Addad, A. Eden, M. Feldman, and A. Fiat. The invisible hand of dynamic market pricing. In ACM EC, pages 383 400, 2016. [CBL06] N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge university press, 2006. [CBL12] N. Cesa-Bianchi and G. Lugosi. Combinatorial bandits. Journal of Computer and System Sciences, 78(5):1404 1422, 2012. [CCDP21] E. Calvano, G. Calzolari, V. Denicoló, and S. Pastorello. Algorithmic collusion with imperfect monitoring. International Journal of Industrial Organization, 79:102712, 2021. [CD17] Y. Cai and C. Daskalakis. Learning multi-item auctions with (or without) samples. In FOCS, 2017. [CK02] P. Cramton and S. Kerr. Tradeable carbon permit auctions. Energy Policy, 30:333 345, 2002. [CSS07] Peter Cramton, Yoav Shoham, and Richard Steinberg. An overview of combinatorial auctions. ACM SIGecom Exchanges, 7(1):3 14, 2007. [CTPL15] R. Combes, M. Sadegh Talebi, A. Proutiere, and M. Lelarge. Combinatorial bandits revisited. In Neur IPS, pages 2116 2124, 2015. [CWY13] W. Chen, Y. Wang, and Y. Yuan. Combinatorial multi-armed bandit: General framework and applications. In ICML, pages 151 159, 2013. [DGJ+22] Y. Deng, N. Golrezaei, P. Jaillet, J. Liang, and V. Mirrokni. Fairness in the autobidding world with machine-learned advice. ar Xiv:2209.04748, 2022. [DGPL19] M. Derakhshan, N. Golrezaei, and R. Paes Leme. Lp-based approximation for personalized reserve prices. In ACM EC, pages 589 589, 2019. [DHK08] V. Dani, T.P. Hayes, and S.M. Kakade. Stochastic linear optimization under bandit feedback. In ALT, pages 355 366, 2008. [DHL+17] M. Dudík, N. Haghtalab, H. Luo, R. E. Schapire, V. Syrgkanis, and J. W. Vaughan. Oracle-efficient learning and auction design. In FOCS, 2017. [DHP17] N. R. Devanur, N. Haghpanah, and C.-A. Psomas. Optimal multi-unit mechanisms with private demands. In EC, pages 41 42, 2017. [DL14] S. Dobzinski and R. P. Leme. Efficiency guarantees in auctions with budgets. In ICALP, pages 392 404. Springer, 2014. [DLN12] S. Dobzinski, R. Lavi, and N. Nisan. Multi-unit auctions with budget limits. Games and Economic Behavior, 74(2):486 503, 2012. [DM08] R. Day and P. Milgrom. Core-selecting package auctions. Int. J. Game Theory, 36(3 4):393 407, mar 2008. [DN07] S. Dobzinski and N. Nisan. Mechanisms for multi-unit auctions. In ACM EC, pages 346 351, 2007. [DN15] S. Dobzinski and N. Nisan. Multi-unit auctions: beyond roberts. JET, 156:14 44, 2015. [DPS21] M. Derakhshan, D. Pennock, and A. Slivkins. Beating greedy for approximating reserve prices in multi-unit vcg auctions. In SODA, pages 1099 1118. SIAM, 2021. [DS16] C. Daskalakis and V. Syrgkanis. Learning in auctions: Regret is hard, envy is easy. In FOCS, pages 219 228, 2016. [DVV03] S. De Vries and R. Vohra. Combinatorial auctions: A survey. INFORMS Journal on computing, 15(3):284 309, 2003. [Edg81] F. Y. Edgeworth. Mathematical psychics: An essay on the mathematics to the moral sciences. London: Kegan Paul, 1881. [EWK98] R. Engelbrecht-Wiggans and C. M. Kahn. Multi-unit auctions with uniform prices. Economic Theory, 12:227 258, 1998. [FFLS12] M. Feldman, A. Fiat, S. Leonardi, and P. Sankowski. Revenue maximizing envy-free multi-unit auctions with budgets. In ACM EC, pages 532 549, 2012. [FGL14] M. Feldman, N. Gravin, and B. Lucier. Combinatorial auctions via posted prices. In SODA, pages 123 135. SIAM, 2014. [FMTV20] M. Flammini, M. Mauro, M. Tonelli, and C. Vinci. Inequity aversion pricing in multiunit markets. In ECAI, volume 325, pages 91 98. IOS Press, 2020. [FPS18] Z. Feng, C. Podimata, and V. Syrgkanis. Learning to bid without knowing your value. In ACM EC, pages 505 522. ACM, 2018. [FR03] G. Federico and D. Rahman. Bidding in an electricity pay-as-bid auction. Journal of Regulatory Economics, 24(2):175 211, 2003. [Fvd FH06] N. Fabra, N-H von der Fehr, and D. Harbord. Designing electricity auctions. The RAND Journal of Economics, 37(1):23 46, 2006. [GHK+05] V. Guruswami, J. Hartline, A. Karlin, D. Kempe, C. Kenyon, and F. Mc Sherry. On profit-maximizing envy-free pricing. In SODA, pages 1164 1173, 2005. [GI05] K. Garbade and J. Ingber. The treasury auction process: Objectives, structure, and recent adaptations. Structure, and Recent Adaptations, 2005. [Gil59] D. B. Gillies. Solutions to general non-zero-sum games. Contributions to the Theory of Games, 40, 1959. In Tucker, A. W.; Luce, R. D. (eds.). Contributions to the Theory of Games IV. Annals of Mathematics Studies. [GIL20] K. Goldner, N. Immorlica, and B. Lucier. Reducing inefficiency in carbon auctions with imperfect competition. In ITCS, 2020. [GJL23] N. Golrezaei, P. Jaillet, and J. Liang. Incentive-aware contextual pricing with nonparametric market noise. In AISTATS, pages 9331 9361. PMLR, 2023. [GJLM21] N. Golrezaei, P. Jaillet, J. Liang, and V. Mirrokni. Bidding and pricing in budget and roi constrained markets. ar Xiv:2107.07725, 2021. [GJM21] N. Golrezaei, A. Javanmard, and V. Mirrokni. Dynamic incentive-aware learning: Robust pricing in contextual auctions. Operations Research, 69(1):297 314, 2021. [GLLO07] András György, Tamás Linder, Gábor Lugosi, and György Ottucsák. The on-line shortest path problem under partial monitoring. Journal of Machine Learning Research, 8(10), 2007. [GML13] G. Goel, V. S. Mirrokni, and R. P. Leme. Clinching auctions with online supply. In SODA, pages 605 619, 2013. [GN92] R. J. Green and D. M. Newbery. Competition in the british electricity spot market. The Journal of political economy, 100(5):929 953, 1992. [GNR15] G. Goswami, T. H. Noe, and M. J. Rebello. Collusion in Uniform-Price Auctions: Experimental Evidence and Implications for Treasury Auctions. The Review of Financial Studies, 9(3):757 785, 06 2015. [GS13] L. H. Goulder and A. R. Schein. Carbon taxes versus cap and trade: A critical review. Climate Change Economics, 4, 2013. [HG13] S. Heim and G. Götz. Do pay-as-bid auctions favor collusion? evidence from germany s market for reserve power. Evidence from Germany s Market for Reserve Power, 2013. [HKMN11] A. Hassidim, H. Kaplan, Y. Mansour, and N. Nisan. Non-price equilibria in markets of discrete goods. In EC, pages 295 296, 2011. [HP89] K. Hendricks and R. Porter. Collusion in auctions. Annals of Economics and Statistics, Jul-Dec(15-16):217 230, 1989. [HZF+20] Y. Han, Z. Zhou, A. Flores, E. Ordentlich, and T. Weissman. Learning to bid optimally and efficiently in adversarial first-price auctions, 2020. [HZW20] Y. Han, Z. Zhou, and T. Weissman. Optimal no-regret learning in repeated first-price auctions. Co RR, abs/2003.09795, 2020. [KCPT01] A. Kahn, P. Cramton, R. Porter, and R. Tabors. Uniform pricing or pay-as-bid pricing: a dilemma for california and beyond. The Electricity Journal, 14(6):70 79, 2001. [KN04] I. Kremer and K. G. Nyborg. Underpricing and market power in uniform price auctions. The Review of Financial Studies, 17(3):849 877, 2004. [KN14] Y. Kanoria and H. Nazerzadeh. Dynamic reserve prices for repeated auctions: Learning from bids. In WINE, volume 8877, page 232. Springer, 2014. [Koc07] L. Koczy. A recursive core for partition function form games. Theor. Decis., 63:41 51, 2007. [Koc09] L. Koczy. Sequential coalition formation and the core in the presence of externalities. Game. Econ. Behav., 66(1):559 565, 2009. [KV05] A. Kalai and S. Vempala. Efficient algorithms for online decision problems. Journal of Computer and System Sciences, 71(3):291 307, 2005. [LB10] B. Lucier and A. Borodin. Price of anarchy for greedy auctions. In SODA, pages 537 553, 2010. [LMMM11] G. Lopomo, L. Marx, D. Mc Adams, and B. Murray. Carbon allowance auction design: an assessment of options for the united states. Review of Environmental Economics and Policy, 2011. [LMT18] Sanna Laksà , Daniel Marszalec, and Alexander Teytelboym. Epic fail: How belowbid pricing backfires in multiunit auctions. CIRJE F-Series CIRJE-F-1096, CIRJE, Faculty of Economics, University of Tokyo, 2018. [LS17] T. Lattimore and C. Szepesvari. The end of optimism? an asymptotic analysis of finite-armed linear bandits. In AISTATS, volume 54, pages 728 737. PMLR, 4 2017. [LS20] T. Lattimore and C. Szepesvári. Bandit algorithms. Cambridge University Press, 2020. [LW94] Nick Littlestone and Manfred K Warmuth. The weighted majority algorithm. Information and computation, 108(2):212 261, 1994. [MA98] Paul F Malvey and Christine M Archibald. Uniform-price auctions: Update of the treasury experience. US Treasury, 1998. [MCP14] S. Magureanu, R. Combes, and A. Proutiere. Lipschitz bandits: Regret lower bound and optimal algorithms. In COLT, volume 35, pages 975 999. PMLR, 2014. [MLS18] J. Mao, R. Leme, and J. Schneider. Contextual pricing for Lipschitz buyers. In Neur IPS, pages 5643 5651, 2018. [MM16] M. Mohri and A. M. Medina. Learning algorithms for second-price auctions with reserve. The Journal of Machine Learning Research, 17(1):2632 2656, 2016. [MR16] J. Morgenstern and T. Roughgarden. Learning simple auctions. In COLT, volume 49, pages 1298 1318. PMLR, 23 26 Jun 2016. [MRH89] E. Maskin, J. Riley, and F. Hahn. Optimal Multi-Unit Auctions, pages 312 335. Oxford University Press, 1989. Reprinted in P. Klemperer, The Economic Theory of Auctions, London: Edward Elgar, 2000. [MRT09] A.J. Mersereau, P. Rusmevichientong, and J.N. Tsitsiklis. A structured multiarmed bandit problem and the greedy policy. IEEE Transactions on Automatic Control, 54(12):2787 2802, 2009. [MT15] E. Markakis and O. Telelis. Uniform price auctions: Equilibria and efficiency. Theory Comput. Syst., 57(3):549 575, 2015. [NCKP22] T. Nedelec, C. Calauzenes, N. El Karoui, and V. Perchet. Learning in repeated auctions. Foundations and Trends in Machine Learning, 15(3):176 334, 2022. [NGW+21] R. Niazadeh, N. Golrezaei, J. Wang, F. Susan, and A. Badanidiyuru. Online learning via offline greedy algorithms: Applications in market design and optimization. In ACM EC, pages 737 738, 2021. [Nis14] N. Nisan. Algorithmic mechanism design through the lens of multi-unit auctions. Handbook of Game Theory 4, 2014. [Pes00] M. Pesendorfer. A study of collusion in first-price auctions. The Review of Economic Studies, 67(3):381 411, 2000. [RM21] Aviv Rosenberg and Yishay Mansour. Stochastic shortest path with adversarially changing costs. ar Xiv:2006.11561, 2021. [RST17] T. Roughgarden, V. Syrgkanis, and E. Tardos. The price of anarchy in auctions. J. Artif. Intell. Res., 59:59 101, 2017. [RT10] P. Rusmevichientong and J. N. Tsitsiklis. Linearly parameterized bandits. Mathematics of Operations Research, 35(2):395 411, 2010. [RW16] T. Roughgarden and J. Wang. Minimizing regret with multiple reserves. In ACM EC, pages 601 616, 2016. [SBLS04] Y. S. Son, R. Baldick, K-H. Lee, and S. Siddiqi. Short-term electricity market auction game analysis: uniform and pay-as-bid pricing. IEEE Transactions on Power Systems, 19(4):1990 1998, 2004. [Sek17] S. Sekar. Posted pricing sans discrimination. In IJCAI, pages 388 394, 2017. [Sha67] L. S. Shapley. On balanced sets and cores. Nav. Res. Logistics Quart., 14:453 460, 1967. [SMR+13] S.Brânzei, T. P. Michalak, T. Rahwan, K. Larson, and N. R. Jennings. Matchings with externalities and attitudes. In AAMAS, pages 295 302. IFAAMAS, 2013. [SS69] L. Shapley and M. Shubik. On market games. Journal of Economic Theory, 1(1):9 25, 1969. [SS17] R. Schmalensee and R. N. Stavins. Lessons learned from three decades of experience with cap and trade. Review of Environmental Economics and Policy, 11, 2017. [TSM08] S. Tierney, T. Schatzki, and R. Mukerji. Uniform-pricing versus pay-as-bid in wholesale electricity markets: Does it make a difference? New York ISO, 2008. [Tsy09] A. Tsybakov. Introduction to Nonparametric Estimation. Springer-Verlag, 2009. [TW03] E. Takimoto and M. Warmuth. Path kernels and multiplicative updates. The Journal of Machine Learning Research, 4:773 818, 2003. [UNK10] T. Uchiya, A. Nakamura, and M. Kudo. Algorithms for adversarial bandit problems with multiple plays. In International Conference on Algorithmic Learning Theory, pages 375 389. Springer, 2010. [vd FH93] N-H. von der Fehr and D. Harbord. Spot market competition in the uk electricity industry. The Economic journal (London), 103(418):531 546, 1993. [VPG20] B. Van Parys and N. Golrezaei. Optimal learning for structured bandits. ar Xiv:2007.07302, 2020. [Wil79] R. Wilson. Auctions of shares. The Quarterly Journal of Economics, 93(4):675 689, 1979. [WJ08] M. Wainwright and M. Jordan. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1(1 2):1 305, 2008. [WPR16] J. Weed, V. Perchet, and P. Rigollet. Online learning in repeated auctions. In Conference on Learning Theory, pages 1562 1583, 2016. [ZLW19] J. Zimmert, H. Luo, and C-Y. Wei. Beating stochastic and adversarial semi-bandits optimally and simultaneously. In ICML, pages 7683 7692. PMLR, 2019. [ZN13] A. Zimin and G. Neu. Online learning in episodic markovian decision processes by relative entropy policy search. Neur IPS, 26, 2013. Roadmap to the appendix Appendix A includes the proofs omitted from Section 3 (Offline Setting). Appendix B includes the proofs omitted from Section 4 (Online Setting). Appendix C includes the equilibrium analysis. A few results from prior work that we invoke are in Appendix D. A Appendix: Offline Setting In this section we include the proofs omitted from the main text for the offline setting. Observation 1 (restated). Player i has an optimum bid vector β = (β1, . . . , βK) DK with βj Si for all j [K]. Proof. Let c = (c1, . . . , c K) be an arbitrary optimum strategy for player i. Suppose c SK i . We use c to construct an optimum bid vector β SK i as follows. For each j [K]: If cj Si, then set βj = cj. Else, set βj = max y Si | y cj . Then in each round t, player i gets the same allocation when playing β as it does when playing c and the others play bt i since the ordering of the owners of the bids is the same under (c, bt i) as it is under (β, bt i); moreover, the price weakly decreases in each round t. Thus player i s utility weakly improves. Since c was an optimal strategy for player i, it follows that β is also an optimal strategy for player i and, moreover, β SK i as required. Next we show how to find an optimal bid vector in polynomial time. The proof uses several lemmas, which are proved after the theorem. Theorem 1 (restated, formal). Suppose we are given a number n of players, number K of units, valuation vi of player i, discretization level ε > 0, and bid history H i = (b1 i, . . . , b T i) by players other than i. Then an optimum bid vector for player i can be computed in polynomial time in the input parameters. Proof of Theorem 1. Compute the set Si given by equation (1) and the graph Gi from Definition 1. The proof has several steps as follows. Gi is a DAG. All the edges in Gi flow from the source z to the nodes from layer 1, then from the nodes in layer j to those in layer j +1 (i.e. of the form (zr,j, zs,j+1) for all j [K 1] and r, s Si with r s), and finally from all the nodes in layer K to the sink z+. A cycle would require at least one back edge, but such edges do not exist. Bijective map between bid vectors and paths from source to sink in Gi. To each bid vector β = (β1, . . . , βK) SK i , associate the following path in Gi: P(β) = z , zβ1,1, . . . , zβK,K, z+ . We show next the map P is a bijection from the set Si of candidate bid vectors for player i to the set of paths from source to sink in Gi. Consider arbitrary bid vector β = (β1, . . . , βK) SK i . By definition of a bid vector, we have β1 . . . βK . Then P(β) = (z , zβ1,1, . . . , zβK,K, z+) is a valid path in Gi. Conversely, since Gi has an edge (zr,j, zs,j+1) if and only if r s, each path from source to sink in Gi has the form Q = (z , zβ1,1, . . . , zβK,K, z+) for some numbers β1, . . . , βK Si, and so it can be mapped to bid vector β = (β1, . . . , βK). Thus Q = P(β), and so P is a bijective map. Utility of player i and weight of a path of length K in Gi. Let β = (β1, . . . , βK) SK i . Additionally, let βK+1 = 0. The total utility of player i when bidding β in each round t while the others bid bt i is Ui(β) = PT t=1 ui(ht), where ht = (β, bt i) t [T] . Let P(β) = (z , zβ1,1, . . . , zβK,K, z+) be the path in Gi corresponding to β. The edges of the path P(β) are (z , zβ1,1), (zβ1,1, zβ2,2), . . . , (zβK 1,K 1, zβK,K), (zβK,K, z+), while the weight of the path is equal to the sum of its edges. Summing equation (2), which gives the weight of an edge, across all edges of P(β) implies that the weight of path P(β) is equal to h 1{xi(ht) j}(vi,j βj) + j 1{xi(ht)>j}(βj βj+1) + 1{xi(ht)=j}(βj p(ht)) i . We claim that Ui(β) = w(P(β)). The high level idea is to rewrite the utility to spread it across the edges of the path corresponding to bid profile β. Towards this end, recall the utility of player i at strategy profile ht = (β, bt i) is j=1 1{xi(ht) j}(vi,j p(ht)) . (8) By Lemma 1, equation (8) is equivalent to h 1{xi(ht) j}(vi,j βj) + j 1{xi(ht)>j}(βj βj+1) + 1{xi(ht)=j}(βj p(ht)) i . Summing ui(ht) over all rounds t gives h 1{xi(ht) j}(vi,j βj) + j 1{xi(ht)>j}(βj βj+1) + 1{xi(ht)=j}(βj p(ht)) i (By equation (9)) h 1{xi(ht) j}(vi,j βj) + j 1{xi(ht)>j}(βj βj+1) + 1{xi(ht)=j}(βj p(ht)) i (Swapping the order of summation) = w(P(β)) . (By equation (7)) Thus the weight of the path P(β) is equal to the utility of player i from bidding β. The implication is that to find an optimum bid vector, it suffices to compute a maximum weight path in Gi. Computing a maximum weight path in Gi. The graph Gi is a DAG with a number of vertices of Gi that is polynomial in the input parameters. By Lemma 2, the edge weights of Gi can be computed in polynomial time. A maximum weight path in a DAG can be computed in polynomial time by changing every weight to its negation to obtain a graph G. Since G has no cycles, the graph G has no negative cycles. Thus running a shortest path algorithm on G will yield a longest (i.e. maximum weight) path on G in polynomial time, as required. The unique bid vector corresponding to the maximum weight path found can then be recovered in time O(K) and it represents an optimum bid vector for player i. Lemma 1. In the setting of Theorem 1, for each bid profile β SK i and round t [T], define ht = (β, bt i). Then we have h 1{xi(ht) j}(vi,j βj) + j 1{xi(ht)>j}(βj βj+1) + 1{xi(ht)=j}(βj p(ht)) i . Proof. Given bid profile β = (β1, . . . , βK), we also define βK+1 = 0. If xi(ht) = 0 then both sides of equation (10) are zero, so the statement holds. Thus it remains to prove equation (10) when xi(ht) > 0. Let ui,j(ht) = 1{xi(ht) j}(vi,j p(ht)) for each j [K]. The term ui,j(ht) represents the amount of utility obtained from the j-th unit acquired by player i at price p(ht), so ui(ht) = PK j=1 ui,j(ht). We first show that for each j [K]: ui,j(ht) = 1{xi(ht) j}(vi,j βj) + h 1{xi(ht)>k}(βk βk+1) + 1{xi(ht)=k}(βk p(ht)) i . To prove (11), we will rewrite ui,j(ht) by considering three cases and writing a unified expression for all of them. 1. xi(ht) = j. Then ui,j(ht) = vi,j p(ht) = (vi,j βj) + (βj p(ht)) = 1{xi(ht) j}(vi,j βj) + 1{xi(ht)=j}(βj p(ht) = 1{xi(ht) j}(vi,j βj) + h 1{xi(ht)>k}(βk βk+1) + 1{xi(ht)=k}(βk p(ht)) i . (Since 1{xi(ht)>k} = 0 k j and 1{xi(ht)=k} = 1 if and only if k = j.) 2. xi(ht) > j. Then xi(ht) = j + ℓ, for some ℓ {1, . . . , K j}. We have ui,j(ht) = vi,j p(ht) = (vi,j βj) + (βj βj+ℓ) + (βj+ℓ p(ht)) = (vi,j βj) + k=j βk βk+1 + (βj+ℓ p(ht)) . (12) We are in the case where 1{xi(ht) j} = 1, 1{xi(ht)>k} = 1 if and only if k {j, . . . , j + ℓ 1}, and 1{xi(ht)=k} = 1 if and only if k = j + ℓ. Adding indicators to the terms in (12) gives ui,j(ht) = 1{xi(ht) j}(vi,j βj) + h 1{xi(ht)>k}(βk βk+1) + 1{xi(ht)=k}(βk p(ht)) i . 3. xi(ht) < j. Then 1{xi(ht) j} = 0, 1{xi(ht)>k} = 0 for all k {j, . . . , K}, and 1{xi(ht)=k} = 0 for all k {j, . . . , K}. Thus we trivially have the identity required by the lemma statement since ui,j(ht) = 0 and the right hand side of (11) is zero as well. Thus equation (11) holds in all three cases as required. For each i [n] and j [K], define h 1{xi(ht)>k}(βk βk+1) + 1{xi(ht)=k}(βk p(ht)) i . (13) Then equation (11) is equivalent to ui,j(ht) = 1{xi(ht) j}(vi,j βj) + Ai,j, (14) Summing equation (13) over all j [K] gives h 1{xi(ht)>k}(βk βk+1) + 1{xi(ht)=k}(βk p(ht)) i (15) h 1{xi(ht)>k}(βk βk+1) + 1{xi(ht)=k}(βk p(ht)) i (16) k=1 (βk βk+1) j=1 1{xi(ht)>k} + k=1 (βk p(ht)) j=1 1{xi(ht)=k} (17) k=1 (βk βk+1) k 1{xi(ht)>k} + k=1 (βk p(ht)) k 1{xi(ht)=k} (18) where equation (a) holds because we change the order of the double summations and in the double summations, we consider any (integer) pair of (j, k) such that 1 k j K. Then we can rewrite the utility ui(ht) as j=1 ui,j(ht) 1{xi(ht) j}(vi,j βj) + Ai,j (By equation (11)) j=1 1{xi(ht) j}(vi,j βj) + j=1 (βj βj+1) j 1{xi(ht)>j} + k=1 (βj p(ht)) j 1{xi(ht)=j} (By equation (18)) h 1{xi(ht) j}(vi,j βj) + j 1{xi(ht)>j}(βj βj+1) + 1{xi(ht)=j}(βj p(ht)) i , as required by the lemma statement. Lemma 2. In the setting of Theorem 1, the edge weights of the graph Gi can be computed in polynomial time. Proof. Recall the graph Gi is constructed given as parameters the number n of players, the number of units K, a player i with valuation vi, discretization level ε > 0, and a bid history H i = (b1 i, . . . , b T i) by players other than i. Thus the goal is to show the edge weights of Gi can be computed in polynomial time in the bit length of these parameters. Towards this end, we will show there exist efficiently computable values It >j, It j, It j {0, 1} and qt Si such that the weight we of edge e = (zr,j, zs,j+1) is equal to t=1 It j (vi,j r) + j It >j (r s) + It j r qt) . (19) Roughly, qt will correspond to the price and It >j, It j, It j will tell whether player i gets more than j units, at least j units, or exactly j units, respectively, at some profile β = (β1, . . . , βK) Si with βj = r and βj+1 = s. The choice of β in will not matter, as long as βj = r and βj+1 = s. We prove equation (19) in several steps: Step (i). For each t [T], define Γt : R R, where Γt(x) = n (ℓ, j) | bt ℓ,j > x and ℓ [n] \ {i}, j [K] or bt ℓ,j = x and ℓ [n], ℓ< i, j [K] o . Thus Γt(x) counts the bids in profile bt i that would have priority to a bid of value x submitted by player i (i.e. it counts bids strictly higher than x and submitted by players other than i, as well as bids equal to x but submitted by players lexicographically before i). Step (ii). Recall the edge is denoted e = (zr,j, zs,j+1). Let β Si be an arbitrary bid profile with βj = r and βj = s. Also define βK+1 = 0. If Γt(s) < K j, then at ht = (β, bt i) player i receives one unit for each of the bids β1, . . . , βj+1, since there are at most K j 1 bids of other players that have higher priority than player i s highest j + 1 bids. Else, player i does not get more than j units. Thus It >j = 1{xi(ht)>j}. Step (iii). If Γt(r) K j, then player i receives at least j units at (β, bt i). The corresponding indicator is It j = 1{Γt(s) K j} = 1{xi(ht) j} . Step (iv). If Γt(r) K j and Γt(s) > K j, then player i receives exactly j units at (β, bt i). The indicator is It j = 1{Γt(r) K j} 1{Γt(s)>K j} = 1{xi(ht)=j} . Now suppose It j = 1. Then we show the price qt = p(ht) can be computed precisely without knowing the whole bid β. Consider the multiset of bids B = bt i {β1, . . . , βj+1}, recalling βK+1 was defined as zero. Sort B in descending order. We know the top j 1 bids of player i are winning, so the price is not determined by any of them. Thus the price is determined by βj = r, βj+1 = s, or by one of the bids in bt i. Remove elements β1, . . . , βj 1 from B and set the price as follows: Case (iv.a). For (K + 1)-st price auction: set qt to the (K + 2 j)th highest value in B. Case (iv.b). For K-th price auction: set qt to the (K + 1 j)th highest value in B. Step (v). Combining steps (i-iv), the weight of edge e = (zr,j, zs,j+1) can be rewritten as t=1 1{xi(ht) j} (vi,j r) + j h 1{xi(ht)>j} (r s) + 1{xi(ht)=j} r p(ht) i (By Definition 1) t=1 It j (vi,j r) + j h It >j (r s) + It j r qt) i . (By steps (i-iv)) Thus the weight of each edge can be computed in polynomial time as required. B Appendix: Online Setting In this section we include the material omitted from the main text for the online setting. Before studying the full information and bandit feedback models in greater detail, we replace the set Si of candidate bids for player i from equation (1) of the offline section by a coarser set Sε = {ε, 2ε, , vi,1/ε ε} . The regret, which was defined in equation (3), depends on whether the model is the K-th or (K +1)- st price auction; however the next lemma holds for both variants of the auction. Lemma 3. For all ε > 0, let Regi(πi, HT i, ε) be the counterpart of (3) where Si is replaced by the set Sε. Then Regi(πi, HT i) Regi(πi, HT i, ε) + TKε. Proof. First observe that it is not beneficial to bid above vi,1, so we can assume without loss of generality that the maximizer β in (3) satisfies βj vi,1 for all j [K]. Now we convert β into another bid vector βε SK ε as follows: for each j [K], let βε j = min{y Sε | y βj} . Clearly βj βε j βj + ε. Then we claim that the next inequalities hold: p(βε, bt i) p(β, bt i) + ε; (20) 1{xi(βε,bt i) j} 1{xi(β,bt i) j} . (21) Inequality (20) follows since p(b) is either the K-th or the (K + 1)-st largest element of b, and increasing each entry of b by at most ε can only increase p(b) by no more than ε. Inequality (21) is due to the fact that bidding a higher price can only help to win the unit. Consequently, we have vi,j p(β, bt i) 1{xi(β,bt i) j} vi,j p(β, bt i) 1{xi(βε,bt i) j} vi,j p(βε, bt i) ε 1{xi(βε,bt i) j} vi,j p(βε, bt i) 1{xi(βε,bt i) j} + TKε . This gives the desired statement of the lemma. B.1 Full Information Feedback In this section we include the omitted details for the full information setting. We begin with the formal definition of the graph Gt = (V, E, wt) used by the online learning algorithm with full information feedback. Definition 2 (The graph Gt). Given valuation vi of player i, bid profile bt i of the players at round t, and ε > 0, construct a graph Gt = (V, E, wt) as follows. Vertices. Create a vertex zs,j for each s Sε and index j [K]. We say vertex zs,j is in layer j. Add source z and sink z+. Edges. For each index j [K 1] and pair of bids r, s Sε with r s, create a directed edge from vertex zr,j to vertex zs,j+1. Moreover, add edges from source z to each node in layer 1 and from each node in layer K to the sink z+. Edge weights. For each edge e = (zr,j, zs,j+1) or e = (zr,K, z+), let β = (β1, . . . , βK) SK ε be a bid vector with βj = r and βj+1 = s (we define s = 0 if j = K). Define the weight of edge e as wt(e) = 1{xi(β,bt i) j} (vi,j r) + j h 1{xi(β,bt i)>j} (r s) + 1{xi(ht)=j} r p(β, bt i) i . The edges incoming from z have weight zero. The next observation follows immediately from the definition. Observation 2. The next properties hold: The weight wt(e) of each edge e of Gt can be computed efficiently (see Lemma 2). There is a bijective map between bid vectors β SK ε of player i and paths from the source to the sink of Gt (see proof of Theorem 1). Next we include the main theorem with its proof for the online learning algorithm under full information feedback. Theorem 2 (restated). For each player i and time horizon T, under full-information feedback, Algorithm 2 runs in time O(T 2) and guarantees the player s regret is at most O vi,1 p TK3 log T . Proof. The detailed algorithm description is presented as Algorithm 2. At a high level, the proof has three parts: (i) showing that Algorithm 2 is a correct implementation of the Hedge algorithm where each expert is a path from source to sink in the DAG Gt for time t, (ii) bounding its regret for an appropriate choice of the learning rate, and (iii) bounding the runtime. Step I: Algorithm 2 is a correct implementation of Hedge. We show that Algorithm 2 is the same as the Hedge algorithm in which every path in the DAG is equivalent to an expert in the Hedge algorithm. To do so, for each vertex u in the DAG, let ϕt(u, ) be a probability distribution over the outneighbors of u. Then, given the recursive sampling of the bids based on the probability distribution ϕt s, the probability that a path p is chosen in Algorithm 2 is equal to e p ϕt(e), (22) where ϕt s are updated in equation (5), which is restated below: ϕt(e) = ϕt 1(e) exp η wt 1(e) Γt 1(v) Γt 1(u), for all e = (u, v) E(Gt) . (5 revisited .) On the other hand, in the Hedge algorithm, the path probabilities are updated as follows: Given a learning rate η, define P 1 h(p) = Q e p ϕ1(e), and for t 2, P t h(p) = P t 1 h (p) exp(η P e p wt 1(e)) P q P t 1 h (q) exp(η P e q wt 1(e)) . (23) The subscript h in P t h(p) stands for Hedge. To show that the update rule in Algorithm 2 is equivalent to the one in Hedge, we first prove the following statement: ( ) For any vertex u in the graph, let P(u) be the set of all paths from u to z+. Then Γt 1(u) = X ϕt 1(e) exp(ηwt 1(e)) . (24) We prove equation (24) by induction on u, from the bottom layer to the top layer. If u = z+, by definition Γt 1(z+) = 1, and (24) holds. Now suppose that (24) holds for all u in the (k + 1)-st layer, for some 0 k K. Then if u is in the k-th layer, the recursion of Γt 1 gives that Γt 1(u) = X v:(u,v) E ϕt 1((u, v)) exp(ηwt 1((u, v)) Γt 1(v) v:(u,v) E ϕt 1((u, v)) exp(ηwt 1((u, v))) X ϕt 1(e) exp(ηwt 1(e)) ϕt 1(e) exp(ηwt 1(e)) , and therefore (24) holds. Having shown equation (24), we prove by induction on t our claim that Algorithm 2 is a correct implementation of Hedge, i.e. that P t(p) = P t h(p) for all paths p and rounds t. For t = 1, by definition of our initialization we have P 1 h(p) = P 1(p). Suppose that at time t 1, for every path p we have P t 1 h (p) = P t 1(p) = Q e p ϕt 1(e). Then at time t, it holds that e p ϕt(e) (a) = Y ϕt 1(e) exp η wt 1(e) Γt 1(v) (b) = P t 1 h (p) exp e p wt 1(e) where step (a) is due to equation (5), and step (b) follows using telescoping and the induction hypothesis P t 1 h (p) = Q e p ϕt 1(e). Given equation (23), by applying equation (24) to u {z , z+} and the induction hypothesis P t 1 h (p) = Q e p ϕt 1(e), the induction is complete. Thus Algorithm 2 is a correct implementation of Hedge. Step II: Regret upper bound. Let Applying Lemma 3 yields Regi(πi, HT i) Regi(πi, HT i, ε) + TKε = Regi(πi, HT i, ε) + vi,1 where Regi(πi, HT i, ε) is the counterpart of (3) where Si is replaced by the set Sε = {ε, 2ε, , vi,1/ε ε} . We also claim that P e p wt(e) Kvi,1 for each path p from source to sink in Gt. To see this, let β be the bid vector corresponding to path p. As shown in Lemma 1 and using the fact that edges outgoing from z have weight zero, we have ui(β, bt i) = P e p wt(e). Since player i s utility satisfies ui(β, bt i) Vi(xi(β, bt i)) Kvi,1, we obtain P e p wt(e) Kvi,1. By Step I, Algorithm 2 is a correct implementation of Hedge. To bound the regret of the algorithm, we will invoke Corollary 1 which is a slight variant of [CBL06, Theorem 2.2] with the following parameters: N experts, where each expert is a path from source to sink in Gt; learning rate η, time horizon T, and maximum reward L = Kvi,1; initial distribution σ on the experts, where σp = P 1(p) for each expert (path from source to sink) p. By Corollary 1, we obtain Regi(πi, HT i, ε) 1 η max p [N] log 1 Recall Algorithm 2 initially selects a path by starting at the source z and then performing an unbiased random walk in the layered DAG until reaching the sink z+. Since the number of vertices in each layer is vi,1/ε , the initial probability of selecting a particular expert (i.e. path from source to sink) p is σp = P 1(p) 1 vi,1/ε K , p [N] . (28) Substituting (28) in (27) yields Regi(πi, HT i, ε) 1 η max p [N] log 1 8 K log ( vi,1/ε ) η + T (Kvi,1)2 η 8 (Since ϵ = vi,1 q T and L = Kvi,1.) η + T (Kvi,1)2 η For η = log T/(vi,1 KT), inequality (29) gives Regi(πi, HT i, ε) K log T log T/(vi,1 KT) + T log T 8(vi,1 KT) (Kvi,1)2 TK3 log T . (30) Combining inequalities (26) and (30), we get that the regret of player i when running Algorithm 2 is Regi(πi, HT i) Regi(πi, HT i, ε) + vi,1 TK3 log T + vi,1 TK3 O vi,1 p TK3 log T . (31) This completes the proof of the regret upper bound. Polynomial time implementation. Finally we analyze the running time of the above algorithm. At every step, the computation of path kernels traverses all edges (note that each edge only appears in the sum once) and could be done in O(|E|) = O(Kv2 i,1/ε2) time. Similarly, the update of edge probabilities ϕt for all edges also takes O(|E|) = O(Kv2 i,1/ε2) time. Therefore, the overall computational complexity is O(TKv2 i,1/ε2) = O(T 2), where we used the choice of ε = vi,1 p K/T. This completes the proof of the theorem. B.2 Bandit Feedback In this section we include the main theorem and proof for the bandit setting. Recall that our algorithm for the bandit setting is the same as Algorithm 2, only with wt(e) replaced by bwt(e): bwt(e) = w(e) w(e) wt(e) pt(e) 1{e pt}, with pt(e) = X p:e p P t(p), with P t given in (22), and vi,1 r + j(r s) if e = (zr,j, zs,j+1), vi,1 r + Kr if e = (zr,K, z+), 0 if e = (z , zr,1). The resolution parameter and learning rate are chosen to be ε = vi,1 min{(K3 log T/T)1/4, 1}, η = min n ε q log(vi,1/ε)/(TK3v4 i,1), 1/(Kvi,1) o . (32) Theorem 3 (restated). For each player i and time horizon T, under the bandit feedback, there is an algorithm for bidding that runs in time O(TK + K 5/4T 7/4) and guarantees the player s regret is at most O(min{vi,1(T 3K7 log T)1/4, vi,1KT}). Proof. By Lemma 3 and the choice of ε in (32), it suffices to show that the above algorithm πi runs in time O(TKv3 i,1/ε3) and achieves Regi(πi, HT i, ε) = O v2 i,1 q TK5 log(vi,1/ε)/ε + vi,1K2 log(vi,1/ε) . The claimed result then follows from the fact that the regret is always upper bounded by O(vi,1KT). Before we proceed to the proof, we first comment on the choice of estimator bwt(e). First of all, this is an unbiased estimator of wt(e), i.e. Ept P t[ bwt(e)] = wt(e) for every edge e in Gt. Second, instead of using the natural importance-weighted estimator bwt(e) = wt(e)1{e pt}/pt(e), the current form in (6) is the loss-based importance-weighted estimator used for technical reasons, similar to [LS20, Eqn. (11.6)]. Third, by exploiting our DAG structure and the definition of wt(e) in (4), we construct an edge-specific quantity w(e) which always upper bounds wt(e). We now analyze the regret of the algorithm with bwt(e) given by (6). The standard EXP3 analysis (see, e.g. [LS20, Chapter 11]) gives that Regi(πi, HT i, ε) 1 η max p log 1 P 1(p) + p P t(p)eη b wt(p) ! p P t(p) bwt(p) (33) where bwt(p) = P e p bwt(e) is the estimated total weight of path p, and the expectation is with respect to the randomness in the estimator bwt(e). For every path p = (z , zr1,1, , zr K,K, z+) from the source to the sink, we have (by convention r K+1 = 0): e p wt(e) X j=1 (vi,1 (j 1)rj + jrj+1) = Kvi,1. Since ex 1 + x + x2 whenever x 1 and log(1 + y) y whenever y > 1, if η 1/(Kvi,1), inequality (33) gives Regi(πi, HT i, ε) 1 η max p log 1 P 1(p) + η p P t(p) E[ bwt(p)2] η log lvi,1 p P t(p) (K + 1) X e p E[ bwt(e)2], where the last equality uses (Pn i=1 xi)2 n Pn i=1 x2 i , and that each path p from the source to the sink has length K + 1. To proceed, note that E[ bwt(e)2] = w(e)2 (1 pt(e)) + w(e) w(e) wt(e) = wt(e)2 + (w(e) wt(e))2 1 pt(e) 1 , and therefore X e p E[ bwt(e)2] = X wt(e)2 + (w(e) wt(e))2 1 pt(e) 1 wt(e)2 + (w(e) wt(e))2 1 pt(e) 1 X p:e p P t(p) wt(e)2pt(e) + (w(e) wt(e))2(1 pt(e)) X where in the middle we have used the definition P p:e p P t(p) = pt(e) in (6). To further upper bound the above quantity, note that 1 s r vi,1/ε (vi,1 + (j 1)rε)2 r=1 (V + (j 1)rε)2 r=1 (v2 i,1 + (j 1)2r2ε2) = O K3v4 i,1 ε2 A combination of the above inequalities shows that as long as η 1/(Kvi,1), Regi(πi, HT i, ε) = O ε + ηTK4v4 i,1 ε2 Next, by the choice of η in (32), we have Regi(πi, HT i, ε) = O TK5 log(vi,1/ε) ε + K2vi,1 log vi,1 As for the computational complexity, the only difference from the Algorithm 2 is the additional computation of the estimated weights bwt(e) in (6), or equivalently, the marginal probability pt(e). As P t has a product structure in (22), the celebrated message passing algorithm in graphical models [WJ08, Section 2.5.1] takes O(|E|vi,1/ε) = O(Kv3 i,1/ε3) time to compute all edge marginals {pt(e)}e E. Therefore the overall computational complexity is O(TKv3 i,1/ε3). B.3 Regret Lower Bound In this section we include the theorem and proof for the regret lower bound. The construction uses the (K + 1)-st highest price, but is similar for the K-th highest price. Theorem 4 (restated, formal). Let K 2. For any policy πi used by player i, there exists a bid sequence {bt i}T t=1 for the other players such that the expected regret in (3) satisfies E[Regi(πi, {bt i}T t=1)] cvi,1K T, where c > 0 is an absolute constant. Proof. Without loss of generality assume that K = 2k is an even integer, and by scaling we may assume that vi,1 = 1. Consider the following two scenarios: the utility of the bidder i is vi,j 1, for all j [K]; at scenario 1, for every t [T], the other bidders bids are 3, 0, , 0 with probability 0.5 + δ, 2 3 with probability 0.5 δ, where the number of non-zero entries is k in the first line and 2k in the second line, and δ (0, 1/4) is a parameter to be determined later. The randomness used at different times is independent. at scenario 2, for every t [T], the other bidders bids are 3, 0, , 0 with probability 0.5 δ, 2 3 with probability 0.5 + δ, where the number of non-zero entries is k in the first line and 2k in the second line, and δ (0, 1/4) is a parameter to be determined later. The randomness used at different times is independent. We denote by P and Q the distributions of {bt i}T t=1 under scenarios 1 and 2, respectively, then DKL(P Q) = T DKL(Bern(0.5 + δ) Bern(0.5 δ)) T χ2(Bern(0.5 + δ) Bern(0.5 δ)) (0.5 + δ)(0.5 δ) 64 Consequently, by [Tsy09, Lemma 2.6], 1 TV(P, Q) 1 2 exp ( DKL(P Q)) 1 2 exp 64Tδ2 Next we investigate the separation between these two scenarios. It is clear that max bi EP ui(bi; bt i) EP ui((1, 1, , 1, 0, , 0); bt i) 2 + δ k + 1 max bi EQ ui(bi; bt i) EQ ui((1, 1, , 1, 1, , 1); bt i) Moreover, under (P + Q)/2 (i.e. bt i follows a Bern(1/2) distribution), suppose that the vector bi has k components smaller than 2/3. Distinguish into two scenarios: if k < k, then E(P +Q)/2 ui(bi; bt i) 1 if k k, then E(P +Q)/2 ui(bi; bt i) 1 2 (2k k ) + 1 Therefore, it always holds that max bi E(P +Q)/2 ui(bi; bt i) 2k Consequently, for each bi, max b i EP ui(b i ; bt i) ui(bi; bt i) + max b i EQ ui(b i ; bt i) ui(bi; bt i) max b i EP ui(b i ; bt i) + max b i EQ ui(b i ; bt i) 2 max bi E(P +Q)/2 ui(bi; bt i) In other words, any bid vector bi either incurs a total regret (δk T)/3 under P, or incurs a total regret (δk T)/3 under Q. Now the classical two-point method (see, e.g. [Tsy09, Theorem 2.2]) gives E(P +Q)/2[Regi(πi, {bt i}T t=1)] δk T 3 (1 TV(P, Q)) δk T 6 exp 64Tδ2 for every δ (0, 1/4). Choosing δ = 1/(8 T) gives that E(P +Q)/2[Regi(πi, {bt i}T t=1)] K i.e. the claimed lower bound holds with c = 1/(96e1/3). C Appendix: Equilibrium Analysis In this section we show that the pure Nash equilibria with price zero of the (K + 1)-st auction are the only ones that are robust to deviations by groups of players, captured through the notion of the core in the game among the bidders. The core of a game was formulated by Edgeworth [Edg81] and brought into game theory by Gillies [Gil59]. There is an extensive body of literature on the core of various games, including for auctions; see, e.g., analysis of collusion (cartel behavior) in first price auctions in [Pes00]. Our focus here is the core of the game among the bidders, where the auctioneer first sets the auction format and then the bidders can strategize and collude among themselves, without the auctioneer. Roughly speaking, a strategy profile is core-stable if no group C of players can deviate simultaneously (i.e. each player i C deviates to some alternative strategy profile), such that each player in C weakly improves and the improvement is strict for at least one player. For a deviation to take place, the players in C coordinate and switch their strategies simultaneously, while the players outside C keep their previous strategies. Every core-stable strategy profile is also a Nash equilibrium, since a strategy profile that is stable against deviations by groups of players is also stable against deviations by individuals. We consider two variants of the core, with and without monetary transfers [SS69, Bon63, Sha67]. In the case with transfers, the players can make monetary payments to each other, and so a strategy profile consists of a tuple of bids and transfers. In the case without transfers, the players cannot make such transfers and their strategy is the bid vector; thus the only agreement they can make in this case is to coordinate their bids. C.1 Core with transfers A strategy profile in this setting is described by a tuple (b, t), where b is a bid profile and t is a profile of payments (aka monetary transfers), such that ti,j 0 is the monetary payment of player i to player j. At this strategy profile, the auctioneer runs the auction with bids b and returns the outcome (price and allocation), while the players make the monetary transfers t to each other. For each profile of monetary transfers t, let mi(t) be the net amount of money that player i gets after all the transfers are made: The utility of player i at profile (b, t) is ui(b, t) = mi(t) + Deviations. Since in the case of auctions the actions (e.g. bids) of a group of players can affect the utility of the players outside of the group, it is necessary to model how the players outside S react to the deviation. Such reactions have been studied in the literature on the core with externalities (see, e.g., [Koc07, Koc09]). We consider neutral reactions, where non-deviators (i.e. players outside S) have a mild reaction to the deviation: they maintain the same bids as before the deviation and the monetary transfer of each player i [n]\S to each player j S is non-negative. The core where the deviators assume the nondeviators will have neutral reactions to the deviation is known as the neutral core [SMR+13]. Several other variants of the core exist, such as pessimistic core, where each deviator assumes that they will be punished in the worst possible way by the non-deviators. Such variants are also interesting to study, but next we focus on the basic case of neutral reactions. A group of players will alternatively be called a coalition. The set of all players, [n], will also be called sometimes the grand coalition. A group of players that agree on a deviation are known as a blocking coalition, formally defined next. Definition 3 (Blocking coalition, with transfers). Let (b, t) be a tuple of bids and monetary transfers. A group S [n] of players is a blocking coalition if there exists a profile (eb,et), at which each player i S weakly improves their utility, the improvement is strict for at least one player in S, and ebi,j = bi,j if i [n] \ S, j [K]. eti,j = 0 if i [n] \ S and j S. In other words, the blocking coalition S needs to agree on their bids and transfers to each other, such that they improve their utility when the players outside S maintain their existing bids but stop payments to players in S. In fact our characterization holds even if the players outside S make any non-negative transfers to the players in S; the case where the transfers are zero is the extreme case. If a coalition S deviates with zero transfers from players outside S, it also deviates for any transfers that are non-negative. Definition 4 (The core with transfers). The core with transfers consists of profiles (b, t) at which there are no blocking coalitions. Such profiles are core-stable. Theorem 6 (Core with transfers; restated). Consider K units and n > K hungry players. The core with transfers of the (K + 1)-st auction can be characterized as follows: Let (b, t) be an arbitrary tuple of bids and transfers that is core stable. Then the allocation x(b) maximizes social welfare, the price is zero (i.e. p(b) = 0), and there are no transfers between the players (i.e. t = 0). Proof. The proof has three steps as follows. Step I: core transfers are zero. Assume towards a contradiction that (b, t) is core stable and t = 0. Consider the directed weighted graph G = ([n], E, t), where E consists of all the directed edges (i, j) and the weight of each edge is ti,j. The net amount of money that each player i gets from transfers is mi(t) = Pn j=1 tj,i Pn j=1 ti,j. If there is a cycle C = (i1, . . . , ik) such that the payments along the cycle are strictly positive: ti1,i2 > 0, . . . , tik 1,ik > 0, and tik,i1 > 0, then some cancellations take place. That is, by subtracting min{ti1,i2, . . . , tik 1,ik, tik,i1} from the weight of each edge (i, j) C, we obtain a weighted directed graph without cycle C and where each player has the same net amount of money as in the original graph. Iterating the operation of removing cycles, we obtain a directed acyclic graph where the players have the same net amount of money as in the original graph. Thus we can in fact assume the transfers t are such that G is acyclic. Consider a topological ordering (i1, . . . , in) of the vertices of G. Let j be the minimum index for which vertex ij has no incoming edges and at least one outgoing edge. Then mij(t) < 0. The utility of player ij can be upper bounded as follows: uij(b, t) = mij(t) + p(b) xij(b) < p(b) xij(b) . We claim that player ij has an improving deviation by keeping its bid vector bij and stopping all payments to other players. Since the other players have neutral reactions to the deviations, we obtain an outcome (eb,et) such that eb = b, etij,k = 0 for all k [n], and etk,ij tk,ij for all k = ij. The gain of player ij from the deviation can be bounded by: uij(eb,et) uij(b, t) = uij(b,et) uij(b, t) (Since eb = b.) p(b) xij(b) p(b) xij(b) = mij(et) mij(t) = mij(t) (Since mij(et) = 0.) > 0 . (Since mij(t) < 0 by choice of ij.) Thus player ij has a strictly improving deviation, which contradicts the choice of (b, t) as corestable with t = 0. Thus the assumption must have been false. It follows that the only core stable profiles (if any) have t = 0. Step II: the social welfare is maximized. Next we show that if (b, t) is a core-stable outcome, then the allocation induced by b is welfare maximizing. Assume towards a contradiction this is not the case. By Step I, we have t = 0. Let w1 . . . wn K be the bids sorted in decreasing order (breaking ties lexicographically) at the truth-telling bid profile v. For each j [n K], let πj be the player that submitted bid wj in this ordering. Let ew1 . . . ewn K be the bids sorted in decreasing order (breaking ties lexicographically) at the bid profile b. Let eπj be the player that submitted bid ewj in this ordering. Consider an undirected bipartite graph G = (L, R, E), where L is the left part, R the right part, and E the set of edges. Define L = {(i, j) | i [n], j [K], and xi(v) j}. For example, if xi(v) = 2, then L has nodes (i, 1) and (i, 2). If on the other hand xi(v) = 0, then L has no nodes (i, j), for any j. R = {(i, j) | i [n], j [K], and xi(b) j}. E = (s1, s2), for all s1 L and s2 R. Since both allocations x(v) and x(b) allocate exactly K units, we have |L| = |R| = K. Consider now a graph G1 = (L1, R1, E1) obtained from G as follows. Set G = G1. Then for each node (i, j) L: if the node also appears in R, then delete both copies of the node, together with any edges containing them. Thus in G1, the left side L1 consists of nodes (i, j) with xi(v) j but xi(b) < j. For each such node (i, j), let k be the rank of the valuation vi,j in the ordering π. By definition of G1, we have that R1 does not have any node of the form (i, s) for any s. To see this, observe that nodes (i, s) with s < j were deleted when constructing G1 from G, and nodes (i, s) with s j do not exist even in G (if they did, then (i, j) would exist in both L and R and so would have been deleted when constructing G1). Let d(i, j) be the player that displaces player i s bid for the j-th unit at the bid profile b. Formally, d(i, j) is the owner of bid ewk when considering the bids b in descending order. Let ω(i, j) N be such that player d(i, j) obtains an ω(i, j)-th unit in their bundle at b. Since xi(b) < j, we have i = d(i, j). Since at the truth-telling profile player i gets a j-th unit but player d(i, j) does not get an ω(i, j)-th unit, we have vi,j vd(i,j),ω(i,j). Moreover, since the allocation x(v) maximizes welfare but x(b) does not, the inequality is strict for some (i, j) L1. We claim that [n] is a blocking coalition. To show this, we will argue there is a bid profile b and vector of transfers t such that at (b , t ) the utility of each player i [n] is weakly improved and the improvement is strict for at least one player. For each i [n], k [K], let b i,j = vi,j if xi(v) j 0 otherwise. Then x(b ) = x(v) and p(b ) = 0. Also define monetary transfers t as follows: Initialize t = 0. Let ε = min(i,j) L1 vi,j vd(i,j),ω(i,j) /2. Then vi,j vd(i,j),ω(i,j) + ε for each (i, j) L1. For each (i, j) L1, let t i,d(i,j) := t i,d(i,j) + vd(i,j),ω(i,j) + ε. Thus each player i that got a j-th unit in their bundle at the truth-telling profile did so because their bid for the j-th unit had rank k K. Since i does not get the j-th unit at bid profile b, there is a player d(i, j) whose bid for the j-th unit had rank k and who received this way a ω(i, j)-th unit in their bundle. We argue that all the players weakly improve their utility at (b , t ), and the improvement is strict for at least one of them. For each pair (i, j) L1, under the bid profile b player i receives the j-th unit at a cost of zero and transfers an amount of vd(i,j),ω(i,j) + ε to player d(i, j). The component of the utility that player i gets from unit j, counting the value, price, and transfer related to unit j, is vi,j 0 (vd(i,j),ω(i,j) + ε) 0, where the inequality holds by choice of ε. This is a weak improvement compared to the utility that player i gets from unit j at profile (b, 0), which is zero. Moreover, the improvement is strict for at least one pair (i, j) L1, since the bid profile b does not induce a welfare maximizing allocation. For each pair (i, j) L \ L1, at the profile (b, 0) player i gets utility vi,j p(b) from unit j, since it makes no transfers. At profile b , player i gets unit j at a price of zero and makes no transfers (towards other players) related to unit j. Thus the component of the utility related to unit j is vi,j 0 0 = vi,j. Since p(b) 0, we have vi,j p(b) vi,j, a weak improvement for player i with respect to unit j. For each pair (i, j) L: if (i, j) R, then player i does not get a j-th unit under either b or b , so its utility from unit j is zero at both profiles. If on the other hand (i, j) R, since (i, j) L, it must be that (i, j) R1. At profile (b, t) the utility of player i from unit j is vi,j p(b) since it gets the unit and receives no transfers. At profile (b , t ) player i does not receive the unit but receives a transfer of vi,j + ε from the player that gets the unit instead. This is again a weak improvement. Thus all players weakly improve their utility at (b , t ) and the improvement is strict for at least one player, so the profile (b, 0) is not stable. This is a contradiction, thus the assumption that x(b) is not welfare maximizing must have been false. Step III: the price is zero. Next we show that if profile (b, 0) is core-stable, then p(b) = 0. By Step II, the bid profile b induces a welfare maximizing allocation. Suppose towards a contradiction that p(b) > 0. Then we show there exists a blocking coalition. Let w1 . . . wn K be the bids sorted in decreasing order (breaking ties lexicographically) at bid profile b. For each i [n K], let πi be the owner of bid wi. We match the players as follows. Create a bipartite graph with left part L = (π1, . . . , πK) (allowing repetitions) and right part R = (πK+1, . . . , π2K) (allowing repetitions). For each i [K], create edge (πi, πK+i). Consider the profile (b , t ), where b i,j = vi,j if xi j 0 otherwise. Define the transfers as follows: Initialize t = 0. Set ε = p/(2K). For each i [K], let player πi pay an additional amount of ε to player πK+i: t πi,πK+i = t πi,πK+i + ε. Then each player i with xi(b) > 0 gets the same allocation at b as at b, but pays a price of zero for the units and makes a transfer of at most p/2 to other players, resulting in improved utility compared to the utility at profile (b, 0). On the other hand, each player i with xi(b) = 0 on the other hand gets the same allocation at b as at b (i.e. no units), but receives a non-negative amount of money from other players, and the net amount of money received is strictly positive for all the players πK+1, . . . , π2K. Thus there is an improving deviation, which contradicts the choice of (b, t) as core-stable. Thus the assumption must have been false, and p(b) = 0. C.2 Core without transfers A strategy profile in this setting is described by a bid profile b, where bi = (bi,1, . . . , bi,K) is the bid vector of player i. In the setting without transfers, given a profile b of bids, a blocking coalition S needs to agree on simultaneously changing their bids, such that when the players outside S still bid according to b, the players in S weakly improve their utility and the improvement is strict for at least one player in S. The core stable profiles are those that have no such blocking coalitions. Formally, we have: Definition 5 (Blocking coalition, without transfers). Let b be a bid profile. A group S [n] of players is a blocking coalition if there exists a bid profile eb, at which each player i S weakly improves their utility, the improvement is strict for at least one player in S, and ebi,j = bi,j for all i [n] \ S, j [K]. Definition 6 (The core with transfers). The core with transfers consists of bid profiles b at which there are no blocking coalitions. Such profiles are core-stable. The non-transferable utility core can be characterized as follows. Theorem 5 (Core without transfers; restated). Consider K units and n > K hungry players. The core without transfers of the (K + 1)-st auction can be characterized as follows: every bid profile b that is core stable has price zero (i.e. p(b) = 0); each allocation z where all the units are allocated can be supported in a core stable bid profile b with price zero (i.e. x(b) = z and p(b) = 0). Proof. The proof is in two parts. Part I: the price in the core is zero. Consider a bid profile b that is core-stable. Suppose towards a contradiction that p(b) > 0. Let M = n P ℓ2=1 vℓ1,ℓ2. Define a bid profile eb such that for all i [n], j [K]: ebi,j = M if j xi(b) ε otherwise. (34) At eb, the players only submit bids equal to M > 0 for the units they are supposed to get at allocation x(b), and moreover, there are exactly K strictly positive bids. Thus x(eb) = x(b). Moreover, p(b) = 0 since the (K + 1)-st highest bid is 0. Then the grand coalition C = [n] is blocking with the profile eb, in contradiction with b being core stable. Thus the assumption must have been false, so p(b) = 0. Part II: every allocation can be implemented at a core-stable bid profile. Let z be an arbitrary allocation at which all the units are allocated. Define b such that for all i [n], j [K], we have bi,j = M if j zi and bi,j = 0 otherwise. Then x(b) = z and p(b) = 0. Let W = {i [n] | zi > 0} be the set of winners at b. Assume towards a contradiction that b is not stable. Then there is a blocking coalition C = {i1, . . . , ik} [n] with alternative bid profile d = (di1, . . . , dik). Denote eb = (d, b C) the profile where each player i C bids ebi = di and each player i C bids ebi = bi. We must have ui(eb) ui(b) for all i C, with strict inequality for some player i C. If C W = , then the only way for at least one of the players in C to change their allocation is to make some of their bids at least H. But this increases the price from zero to at least H, which yields negative utility for everyone. Thus C W = . We consider two cases: 1. Case p(eb) > 0. Each player i C W requires strictly more units at eb than at b, to compensate for the higher price at eb. Thus xi(eb) > zi for all i C W ( ). If xi(ebi) < zi for some player i W \ C, there would have to exist at least K + 1 bids with value at least H, and so p(eb) H, which would give negative utility to all the players including the deviators. Thus xi(ebi) zi for all i W \ C ( ). Combining ( ) and ( ) gives a contradiction: X i C (W \C) xi(eb) > X i C (W \C) zi = K . Thus p(eb) > 0 cannot hold. 2. Case p(b) = 0. Then each player i C W requires ui(eb) zi. Since p(eb) = 0, the top K bids are strictly positive and the remaining bids are zero. Then each player i C W submits exactly zi strictly positive bids. It follows that xi(eb) = xi(b) and p(eb) = p(b) for each i [n], which means no player in C strictly improves. Thus C cannot be blocking. In both cases 1 and 2 we obtained a contradiction, so b is core-stable, p(b) = 0, and x(b) = z as required. D Theorems from prior work In this section we include the theorem from [CBL06] that we use. In the problem of prediction under expert advice, there are N experts in total, and at each round t [T]: learner chooses a probability distribution pt over [N]; nature reveals the losses {ℓt,i}i [N] of all experts at time t, where ℓt,i [0, L]. For a given sequence of probability distributions (p1, , p T ), the learner s regret is defined to be Reg(T, (p1, , p T )) = i=1 pt(i)ℓt,i min i [N] For this problem, the exponentially weighted average forecaster, also known as the Hedge algorithm, is defined as follows. The algorithm initializes p1 = σ, an arbitrary prior distribution over the experts [N] such that each expert i is selected with probability σi > 0. For t 2, the forecaster updates pt(i) = pt 1(i) exp( ηℓt 1,i) PN j=1 pt 1(j) exp( ηℓt 1,j) , i [N], where η > 0 is a learning rate. Next we include the statement of Theorem 2.2 of [CBL06] in our notation. Theorem 7 (Theorem 2.2 of [CBL06]). Consider the exponentially weighted average forecaster with N experts, learning rate η > 0, time horizon T, and rewards in [0, 1]. Suppose the initial distribution σ on the experts is uniform, that is, σ = (1/N, . . . , 1/N). The regret of the forecaster is Reg(T, (p1, , p T )) log N The following corollary is a well known variant of the above theorem, to allow rewards in an interval [0, L] and an arbitrary initial distribution σ over the experts. Corollary 1. Consider the exponentially weighted average forecaster with N experts, learning rate η > 0, time horizon T, and rewards in [0, L]. Suppose the initial distribution on the experts is σ. The regret of the forecaster is Reg(T, (p1, , p T )) 1 η max i [N] log 1 The proof of Corollary 1 is identical to that of Theorem 7, except for the following two differences: Instead of Wt = PN i=1 exp( η P s t ℓs,i), we define Wt = PN i=1 σi exp( η P s t ℓs,i). In this way, W0 = log WT η min i [N] t=1 ℓt,i max i [N] log 1 When applying [CBL06, Lemma 2.2], we use the interval for the rewards as [0, L] instead of [0, 1].