# machine_learning_in_an_auction_environment__82fb4c07.pdf Journal of Machine Learning Research 17 (2016) 1-37 Submitted 3/15; Revised 1/16; Published 11/16 Machine Learning in an Auction Environment Patrick Hummel phummel@google.com 1600 Amphitheatre Parkway Google Inc. Mountain View, CA 94043, USA R. Preston Mc Afee preston@mcafee.cc One Microsoft Way Microsoft Corp. Redmond, WA 98052, USA Editor: Shie Mannor We consider a model of repeated online auctions in which an ad with an uncertain clickthrough rate faces a random distribution of competing bids in each auction and there is discounting of payoffs. We formulate the optimal solution to this explore/exploit problem as a dynamic programming problem and show that efficiency is maximized by making a bid for each advertiser equal to the advertiser s expected value for the advertising opportunity plus a term proportional to the variance in this value divided by the number of impressions the advertiser has received thus far. We then use this result to illustrate that the value of incorporating active exploration in an auction environment is exceedingly small. Keywords: Auctions, Explore/exploit, Machine learning, Online advertising 1. Introduction In standard Internet auctions in which bidders bid by specifying how much they are willing to pay per click, it is standard to rank the advertisers by a product of their bid and their click-through rate, or their expected cost-per-1000-impressions (e CPM) bids. While this is a sensible way to determine the best ad to show for a particular query, it is potentially a suboptimal approach if one cares about showing the best possible ads in the long run. In online auctions, new ads are constantly entering the system, and for these ads one will typically have uncertainty in the true e CPM of the ad due to the fact that one will not know the click-through rate of a brand new ad with certainty. In this case, it can be desirable to show an ad where one has a high amount of uncertainty about the true e CPM of the ad so one can learn more about the ad s true e CPM by observing whether the ad received a click. Thus even if one believes that a high uncertainty ad is not the best ad for this particular query, it may be valuable to show this ad so one can learn more about the e CPM of the ad and make better decisions about whether to show this ad in the future. While there is an extensive literature that analyzes strategic experimentation in these types of multi-armed bandit problems, the online advertising setting differs substantially from these existing models. In online auctions there is a tremendous amount of random variation in the quality of competition that an ad with unknown e CPM faces in the auction c 2016 Patrick Hummel and R. Preston Mc Afee. Hummel and Mc Afee due to the fact that the ad is constantly competing in a wide variety of different auctions. In these settings, there will always be a certain amount of free exploration that takes place due to the fact that there will be some auctions in which there are no ads with e CPMs that are known to be high, and one can use these opportunities to explore ads with uncertain e CPMs. Almost all existing models of multi-armed bandits that can be applied to online auctions fail to take this possibility into account. This paper presents a model of repeated auctions in which an ad with an uncertain click-through rate faces a random distribution of competing bids in each auction and there is discounting of payoffs in the sense that an auctioneer values a dollar received in the distant future less highly than a dollar received today. We formulate this problem as a dynamic programming problem and show that the optimal solution to this problem takes a remarkably simple form. In each period, the auctioneer should rank the advertisers on the basis of the sum of an advertiser s expected e CPM plus a term that represents the value of learning about the e CPM of a particular ad. One then runs the auction by ranking the ads by these social values rather than their expected e CPMs. While there have been previous papers on multi-armed bandits that have proposed ranking arms by a term equal to the expected value of showing an ad plus an additional term representing the value of learning about the true value of that arm,1 the value of learning in the problem that we consider is dramatically different from the value of learning in standard multi-armed bandit problems. In standard multi-armed bandit problems (Auer et al., 2002) where there is no discounting of payoffs and no random variation in the competition that an arm faces, typical solutions involve ranking the ads according to a sum of the expected value of the arm plus a term proportional to the standard deviation in the arm s value. By contrast, we find that the value of learning in our setting is proportional to the variance in an ad s expected e CPM divided by the number of impressions that an ad has received. Thus the incremental increase in the probability that a particular ad is shown varies with 1 k2 , where k denotes the number of impressions this ad has received so far. This is an order of magnitude smaller than the corresponding incremental increase in standard machine learning algorithms. In fact, we show that if we attempted to rank the ads on the basis of the sum of an advertiser s expected e CPM plus a term equal to a constant times the standard deviation in the advertiser s e CPM, the optimal constant would be zero. Our baseline model considers a simple situation in which there is a single advertiser with unknown e CPM that competes in each period against an advertiser with known e CPM whose e CPM bid is a random draw from some distribution. But our conclusions about the value of learning are not restricted to this simple model. We show that our conclusions about the optimal bidding strategies extend to a variety of more complicated models including models in which there are multiple advertisers with unknown e CPMs as well as models in which there is correlation between the unknown e CPMs of multiple different advertisers and information from showing one advertiser can help one refine one s estimate of the e CPM for some other advertiser. We also illustrate an asymptotic equivalence between 1. In addition, Iyer et al. (2014) illustrate that bidders may have an incentive to make a bid equal to their expected value plus a term proportional to their value of learning about their value if bidders have uncertainty about their own value. This paper differs from ours in that it considers an environment in which bidders are attempting to learn their own values rather than an auctioneer attempting to learn the e CPMs. Machine Learning in an Auction Environment the theoretically optimal strategies and the strategies that would be selected by a simple one-step look ahead policy often referred to as knowledge gradients (Frazier et al., 2009; Ryzhov et al., 2010, 2012). A consequence of these small incremental changes in the probability that an ad is shown is that the total value from adding active exploration in the online auction setting is exceedingly small. Not only does the incremental increase in the probability that a particular ad is shown vary with 1 k2 , but on top of that, the expected payoffincrease that one obtains conditional on showing a different ad than would be shown without active learning also varies with 1 k2 . This implies that the total value of adding active exploration in the setting we consider will vary with 1 k4 for large numbers of impressions k, an exceedingly small amount. We further obtain finite sample results illustrating that for realistic amounts of uncertainty in the e CPMs of ads, the maximum total efficiency gain that could ever be achieved by adding active learning in this auction environment is exceedingly small, typically only a few hundredths of a percentage point. Finally, we empirically verify these findings through simulations and illustrate that adding active learning in the auction environment we consider only changes overall efficiency by a few hundredths of a percentage point. Perhaps the most closely related paper to our work is a paper by Li et al. (2010). This paper is the only other paper we are aware of that considers questions related to the value of learning about the e CPMs of ads with uncertain e CPMs in a setting where there is discounting in payoffs as well as random variation in the quality of the competition that an ad faces from competing ads in the auction. Li et al. (2010) demonstrate that the value of showing an ad with an uncertain e CPM will generally exceed the immediate value of showing that ad because one will learn information about the e CPM of the ad that will enable one to make better ranking decisions in the future. However, Li et al. (2010) do not attempt to characterize the optimal solution in this setting, as we do in the present paper. There is also an extensive literature in statistics and machine learning that addresses questions related to multi-armed bandits (Audibert and Bubeck, 2010; Auer et al., 2002, 2003; Gittins, 1979; Hazan and Kale, 2011; Lai and Robbins, 1985; Mannor and Tsitsiklis, 2004; May et al., 2012; Slivkins, 2014) as well as some papers that focus specifically on the auction context (Agarwal et al., 2009; Babaioffet al., 2009; Devanur and Kakade, 2009; Wortman et al., 2007). However, none of these papers considers appropriate methods for exploring ads in a context where there is random variation in the quality of the competition that an ad faces in an auction. The optimal methods for exploring ads in such a scenario turn out to be completely different from the methods considered in any of these previous papers, and as such, our work is completely different from existing machine learning literature. Finally, there is an extensive literature in economics related to questions on strategic experimentation. Within economics, this literature has considered a variety of questions including consumers trying to learn about the quality of various products (Bergemann and V alim aki, 1996, 1997, 2000), firms and sellers trying to learn about demand (Aghion et al., 1993; Fishman and Rob, 1998; Ghate, 2015; Keller and Rady, 1999; Mirman et al., 1993; Rusitchini and Wolinsky, 1995), learning to play repeated games (Anthonisen, 2002; Gale and Rosenthal, 1999), learning about untried policies in political economy (Callander, 2011; Callander and Hummel, 2014; Strulovici, 2010), learning from the actions of others (Banerjee and Fudenberg, 2004; Gale, 1996; Vives, 1997), as well as general results on experimentation Hummel and Mc Afee (Aghion et al., 1991; Banks and Sundaram, 1992; Bergemann and V alim aki, 2001; Bolton and Harris, 1999; Brezzi and Lai, 2002; Keller and Rady, 2010; Keller et al., 2005; Moscarini and Smith, 2001; Rothschild, 1974; Schlag, 1998; Weitzman, 1979). However, the economics literature has not considered strategic experimentation in auctions, as we do in the present paper. 2. The Model There is a new ad with an uncertain e CPM that will bid into a second-price auction for a single advertising opportunity with competing advertisers.2 Throughout we let x denote the actual unknown but fixed value (or e CPM) for showing the new ad, z denote the e CPM bid the auctioneer places on behalf of this advertiser,3 and let k denote the number of impressions the ad has received so far. We also suppose that the highest e CPM bid that this advertiser competes against may vary from auction to auction, and that in each auction, this highest competing e CPM bid is a random draw from some cumulative distribution function F( ) with corresponding continuous and twice differentiable density f( ). At any given point in time, the auctioneer does not necessarily know the exact value of x. Instead the auctioneer only knows that x is drawn from some distribution. We let x denote a generic distribution corresponding to the auctioneer s estimate of the distribution of possible values of x. This distribution will evolve over time as an ad has received more impressions and we have a better sense of the underlying e CPM of the ad. Throughout we also let x denote an unbiased estimate of the true value of x given the auctioneer s estimate of the distribution of possible values of x. We also let σ2 k denote the variance in our estimate of the e CPM for the new ad when the ad has been shown k times. In the limit when k is large, σ2 k will be well approximated by s2(x) k for some constant s2(x) that depends only on x, and we assume that σ2 k = s2(x) k2 ) for some continuously differentiable functions s2(x) and h(x).4 In addition, we let δ (0, 1) denote the per-period discount rate so the auctioneer only values advertising opportunities that take place at time T by a factor of δT as much as opportunities that take place at the present time period. Throughout we assume that the auctioneer wishes to maximize total efficiency; that is, if vt denotes the total value of the ad displayed in period t (the true e CPM of this ad), then the auctioneer s payoffis P t=0 δtvt. Since online ad auctions are typically designed to select the efficiency-maximizing allocation, this is a logical objective to optimize. 2. These second-price auctions for a single advertising slot are ubiquitous throughout the display advertising industry. In such auctions, advertisers have an incentive to make a bid equal to their true value for a click. 3. Typically advertisers bid by indicating how much they are willing to pay per click, and the auctioneer then uses this cost-per-click bid as well as an estimate of the probability the ad will be clicked to calculate an e CPM bid for the advertiser that the auctioneer then places on behalf of the advertiser in the auction. 4. This assumption will hold for most common priors about the distribution from which the uncertain e CPM of the ad is drawn, such as a beta prior. It is also worth noting that the weaker assumption that σ2 k = s2(x) k2 ) for some constant s2(x) is sufficient to prove our main result about the value of learning being O( 1 k2 ). The additional assumption that σ2 k = s2(x) k2 ) is only used to further prove that the value of learning is of the form v(x) k(k+1) + o( 1 k2 ) for some function v(x). Machine Learning in an Auction Environment 3. Preliminaries Before proceeding to analyze the precise model given above, we first address a closely related question about the extent to which a particular advertising opportunity increases total welfare if the e CPM of this advertising opportunity is known. In particular, we consider a concept that we refer to as the long-term value of a particular advertisement. The longterm value of a particular advertisement gives the total increase in the auctioneer s payoff that arises as a result of this ad being in the system from the various auctions that take place over time. Understanding the long-term value of a particular advertisement when the e CPM of that ad is known will serve as a useful benchmark for understanding how one should behave when there is uncertainty about the e CPM of the ad. Theorem 1 If the e CPM of an ad is known, then the total long-term value of this ad is a convex function of the e CPM of the ad and a strictly convex function for regions where the e CPM of the ad is within the support of the distribution of the highest competing e CPM. All proofs are in the appendix. The fact that the value for any particular advertisement is a convex function of the e CPM of the ad if the e CPM of the ad is known indicates that if there is uncertainty about the e CPM of the ad, then the expected long-term value of this ad will be greater than the long-term value of the expected e CPM of the ad. From this it follows that if there is uncertainty about the e CPM of the ad, then it will be optimal to behave as if this particular ad had a known e CPM that is greater than the expected e CPM of the ad. The precise additional amount that this advertiser s bid should be increased will be pinned down by the solution to the dynamic programming problem governed by the game described in the model. 4. Dynamic Programming Problem In this section, we formulate the value of a particular ad as a dynamic programming problem and use this formulation to derive the optimal bidding strategy. First we derive the auctioneer s payoffthat arises in a particular period when the auctioneer makes a particular bid on behalf of the advertiser with uncertain e CPM. Note that if the auctioneer places a bid of z on behalf of the advertiser with uncertain e CPM in the auction and the actual value of showing this particular ad is x, then the auctioneer s payofffrom running the auction once is u(z, x) = Z z yf(y) dy + Z z 0 xf(y) dy = y(1 F(y))| z + Z z (1 F(y)) dy + x F(z) = z(1 F(z)) + Z z (1 F(y)) dy + x F(z). In general placing a bid of z rather than x in a one-shot auction will result in some inefficiencies in the one-shot auction since it would be optimal for efficiency to place a bid exactly equal to x on behalf of this advertiser in a one-shot auction. The payoffloss that arises in a one-shot auction as a result of placing a bid of z instead of x is L = u(x, x) u(z, x) Hummel and Mc Afee = x(1 F(x)) + Z x (1 F(y)) dy + x F(x) z(1 F(z)) Z z (1 F(y)) dy x F(z) = (x z)(1 F(z)) + Z z x (1 F(y)) dy = Z z x F(z) F(y) dy. If we define the per-period reward to be the negative of this per-period loss, then the auctioneer seeks to maximize the discounted sum of these per period rewards. Let Vk(x) denote the value of this discounted sum when the auctioneer follows the optimal bidding strategy. Also note that, from the perspective of the auctioneer, x is a random variable that can be expressed as x = x + σkϵ, where σk denotes the standard deviation in our estimate of the ad with uncertain e CPM when the ad has been shown k times, and ϵ is a random variable with mean zero and variance one. We use this notation to prove the following: Lemma 2 Vk(x) can be expressed as the value of a dynamic programming problem by Vk(x) = 1 1 δ x+σkϵ F(z) F(y) dy + δF(z)(Ex [Vk+1(x )] Vk(x)) , where x denotes the uncertain realization of x after an ad receives an additional impression. By using the expression for the value of the dynamic programming problem in the previous lemma, we can derive the bid that the auctioneer should place on behalf of the advertiser to maximize the auctioneer s payoff. This is done in the theorem below: Theorem 3 The optimal bidding strategy in the dynamic programming problem when an ad has been shown k times entails setting z = x + δ(Ex [Vk+1(x )] Vk(x)). Thus the optimal bidding strategy in this dynamic programming problem can be written in a form where the bid the auctioneer makes on behalf of the bidder with uncertain e CPM is equal to the bidder s expected e CPM plus a term that represents the value of learning about the true e CPM of that bidder, δ(Ex [Vk+1(x )] Vk(x)). In order to calculate this value of learning, we need to get a sense of the size of the Vk(x) terms. 5. Value of Dynamic Program for Large Numbers of Impressions In the previous section, we have given exact expressions for the value of the dynamic program and the optimal bidding strategy that should be followed under this dynamic programming problem. In this section, we seek to derive accurate estimates of the value of this dynamic program in the limit when an ad has already been shown a large number of times. The main purpose of this section is to illustrate that the value of learning term given in the previous section will vary with 1 k2 for large k. We prove this by first showing that the expected efficiency loss arising due to the uncertainty in the e CPM of the ad varies with 1 k for large k, and then use this to show that the value of learning term varies with 1 k 1 k+1, which varies with 1 k2 for large k. When an ad has already been shown a large number of times, the value of σk that is estimated for the ad is likely to be very small. For small values of σk, we can use a Taylor expansion to approximate the value of the above dynamic programming problem. In particular, we obtain the following result: Machine Learning in an Auction Environment Lemma 4 Eϵ[ R z x+σkϵ F(z) F(y) dy] = R z x F(z) F(y) dy + 1 2σ2 kf(x) + a(x)σ4 k + o(σ4 k) for some constant a(x) for large k. Using the results from the previous lemma, one can immediately illustrate that Vk must be on the order of 1 k for large values of k. Theorem 5 Vk(x) = Θ( 1 k) for large k. To understand the intuition behind this result, note that the average error in the estimate of the e CPM of the ad is proportional to the standard error of this estimate, σk, which varies with 1 k, so the probability that the auctioneer will display the wrong ad as a result of misestimating the e CPM of the ad varies with 1 k. At the same time, conditional on displaying the wrong ad as a result of misestimating the e CPM of the ad, the average efficiency loss that one suffers varies with 1 k. Thus the expected efficiency loss that the auctioneer incurs varies with 1 k, which in turn implies the result in Theorem 5. Theorem 5 suggests that we may be able to write Vk(x) = v(x) k) for large k, where v is a function that depends only on x. To prove that Vk(x) can be expressed this way, it is necessary to show that k Vk(x) indeed converges to a function of x in the limit as k . This is done in the following theorem: Theorem 6 k Vk(x) converges to a function of x in the limit as k . Furthermore, it must be the case that k Vk(x) = 1 2(1 δ)s2(x)f(x) + O( 1 k) for large k. From Theorem 6, it follows that we can express Vk(x) by Vk(x) = v(x) k2 ) for large k, where v is a function that satisfies v(x) = 1 2(1 δ)s2(x)f(x). In order to complete our approximation of the solution the dynamic programming problem for large k, it is also necessary to bound the expression Ex [Vk+1(x )] Vk(x) that appears in the dynamic programming problem. This is done in the following theorem: Theorem 7 Ex [Vk+1(x )] Vk(x) = v(x) k(k+1) + o 1 k2 for large k. The intuition behind this result is that since the efficiency loss that the auctioneer incurs due to uncertainty in the e CPM of an ad varies with 1 k, the value of learning will be proportional to the reduction in the future efficiency loss that the auctioneer suffers as a result of learning more about the e CPM of the ad, meaning the value of learning will vary with 1 k 1 k+1, which varies with 1 k2 . The fact that Ex [Vk+1(x )] Vk(x) varies with 1 k2 indicates that the incremental increase in an advertiser s bid also varies with 1 k2 in the limit when k is large. This in turn implies that the incremental increase in an advertiser s probability of winning the auction will also vary with 1 k2 for large k. The result in Theorem 7 suggests that the optimal method for adding active exploration will only rarely have an effect on which ad wins the auction, as the probability that this active exploration changes which ad is shown varies with 1 k2 for large k. This result about the value of learning varying with 1 k2 for large k stands in marked contrast to algorithms that have been proposed for active exploration in standard multi-armed bandit problems with no discounting of payoffs and no random variation in the competition that an arm faces Hummel and Mc Afee in a given period (Auer et al., 2002). In these types of algorithms, the value of learning tends to vary with 1 k, which means the value of learning is an order of magnitude smaller in our setting than in standard multi-armed bandit problems. Ultimately we seek to use these insights to derive results about the change in payoffthat would result from incorporating active learning in this setting. Before doing this, we first illustrate how the conclusions of this section about the value of the dynamic programming problem and the optimal bidding strategy extend to a variety of more complicated scenarios including settings where there are multiple different ads with uncertain e CPMs whose true e CPMs may be correlated and we also illustrate a natural correspondence between the optimal solution to the full dynamic programming problem and a simple one-step lookahead strategy. First we tackle the problem of computing the value of the dynamic program when an ad with an uncertain e CPM has only received a small number of impressions. 6. Value of Dynamic Program for Small Numbers of Impressions To calculate the value of Vk(x) for small values of k, we apply backwards induction. At some large value of k, it will necessarily be the case that the incremental value of additional exploration is so small that the advertiser simply bids z = x because the smallest possible increment the advertiser would be allowed to adjust its bid exceeds the tiny incremental value of additional exploration. Thus if K denotes the earliest stage at which an advertiser always sets z = x, then for all k K, it is necessarily the case that the value of learning is zero, and Vk(x) = 1 1 δ Eϵ h R x x+σkϵ F(x) F(y) dy i 0. For values of k < K, we have (1 δ)Vk(x) = Eϵ x+σkϵ F(zk) F(y) dy + δF(zk)(Ex [Vk+1(x )] Vk(x)) Vk(x) = Eϵ h R zk x+σkϵ F(zk) F(y) dy i + δF(zk)Ex [Vk+1(x )] 1 δ + δF(zk) . Thus by empirically measuring the values of σk and F( ), we can apply backward induction to approximate Vk(x) for small values of k. We now address the question of what these values of Vk(x) will be approximately equal to for an important class of advertisers. Many ads that have only received a small number of impressions are ads that typically fail to win auctions because the machine learning system is pessimistic about the ad s true e CPM. The estimated e CPMs for these ads may be several orders of magnitude smaller than the typical e CPMs of the ads that have been shown many times. In these cases, even if the percentage uncertainty in the e CPMs of these ads is quite high, the absolute amount of uncertainty in the e CPMs of these ads will be small compared to the typical e CPMs of the ads that have been shown many times. Thus in these cases, x will be close to zero, and F(x) and σ2 k will be close to zero as well. Under these circumstances, we have the following result: Theorem 8 If x (and σ2 k) are close to zero for small values of k, then Vk(x) = 1 2(1 δ)f(x)σ2 k+ o(f(x)σ2 k) for small values of k. Machine Learning in an Auction Environment Theorem 8 indicates that even in small sample environments, it is still frequently reasonable to approximate Vk(x) by writing Vk(x) 1 2(1 δ)f(x)σ2 k, where σ2 k denotes the variance in our estimate of the ad s e CPM for a particular value of k. This theorem in turn implies that if a machine learning system is quite pessimistic about the true e CPM of a new ad, then there will be little value to actively exploring the ad because the value of learning term, Ex [Vk+1(x )] Vk(x), will be quite small. 7. Ads with Correlated Values So far we have restricted attention settings in which we only seek to learn the e CPM of one advertiser s ad. However, in many situations we may seek to learn the e CPMs of multiple advertisers ads and the e CPMs of the various advertisers may be correlated. In these situations, information about one ad s e CPM may help one learn about the e CPMs of other related advertisers. On top of this, even if there is only ad for which we are uncertain about the advertiser s e CPM, this ad may bid in several different contexts where the ad has substantially different e CPMs and the ad faces substantially different competing landscapes of bids.5 In these cases, information about an ad s e CPM in one context may also help one learn about the ad s e CPM in other contexts. To address how this affects the results, we extend the model to allow for the possibility that there are multiple different ads that bid in multiple different contexts where we seek to learn the e CPMs of the ads and these e CPMs may be correlated. In particular, we suppose that there are m different ad-context pairs where we seek to learn the e CPM of the ad in that particular context. For the ad-context pair a, we let xa denote the actual, unknown value of the e CPM of that ad in that context, and we let x = (x1, . . . , xm) denote the actual unknown e CPMs of the ads in all m contexts. We also let k denote the total number of impressions that these advertisers have received in the various contexts and let βa denote the fraction of these impressions that were received in context a. Thus we have Pm a=1 βa = 1. We again assume the auctioneer does not know the exact value of x, and instead the auctioneer only knows that x is drawn from some distribution. We again let x denote a generic distribution corresponding to the auctioneer s estimate of the distribution of possible values of x. This distribution allows for the possibility that the auctioneer may believe there is correlation in the unknown e CPMs of the advertisers in the different contexts, and the distribution will again evolve over time as an ad has received more impressions and we have a better sense of the underlying e CPM of the ad. Throughout we also let x denote an unbiased estimate of the true value of x given the auctioneer s estimate of the distribution of possible values of x and we let xa denote an unbiased estimate of the true value of xa given this distribution. We also let σ2 a,ka denote the variance in our estimate of the e CPM of the ad-context pair a when there have been a total of ka impressions in this ad-context pair. In the limit when ka is large, σ2 a,ka will be well approximated by s2 a(xa) ka = s2 a(xa) βak for some constant s2 a(xa) that depends only on xa, and 5. Contextual bandit problems in which an arm s payoffmay vary from context to context have appeared in the literature before in different settings. See, for example, work by May et al. (2012) and Slivkins (2014). Hummel and Mc Afee we again let σ2 a,ka = s2 a(xa) βak + ha(xa) β2ak2 + o( 1 k2 ) for some continuously differentiable functions s2 a(xa) and ha(xa). In addition, we let δ (0, 1) denote the per-period discount rate so that the mechanism designer only values advertising opportunities that take place at time T by a factor of δT as much as opportunities that take place at the present time period. In each period t, there is an auction for a single advertising opportunity. The auction can involve any one of the m possible ad-context pairs for which we do not know the e CPM of the ad in that context. We let πa denote the probability that there will be an auction involving ad-context pair a in any given time period. Thus we have Pm a=1 πa = 1. We further suppose that if there is an auction involving ad-context pair a, then the distribution of the values of the competing advertisers is such that the highest e CPM for a competing ad is a random draw from some cumulative distribution function Fa( ) with corresponding continuous and twice differentiable density fa( ). In this setting, the total long-term value of a particular ad-context pair is again a convex function of the e CPM of the ad for the same reasons as in Theorem 1 and we can again formulate this problem as a dynamic program. To do this, let k (k1, . . . , km) denote a vector that gives the number of impressions that have been received by the various adcontext pairs 1, . . . , m. Also let Va, k(x) denote the value of the dynamic program when the next auction involves the advertiser-context pair a, the e CPMs of the ads are x, and there have been k impressions in each of the various ad-context pairs, and let V k(x) Ea[Va, k(x)] denote the value of the same dynamic program unconditional on which ad-context pair is involved in the next auction. By using similar reasoning to that in Lemma 2, we know that V k(x) equals xa+σa,kaϵ Fa(za) Fa(y) dy + δFa(za)(Ex (a)[V k (a, k)(x (a))] V k(x)) where k (a, k) (k 1, . . . , k m) is a vector that satisfies k b = kb for all b = a and k a = ka + 1, and x (a) denotes the uncertain realization of x if the advertiser-context pair a receives an additional impression. Furthermore, the optimal bid za if there is an auction involving the advertiser-context pair a satisfies za = xa + δFa(za)(Ex (a)[V k (a, k)(x (a))] V k(x)) by similar logic to that given in Theorem 3, and the result in Lemma 4 is just a general mathematical result that holds regardless of the model we are considering. Thus natural analogs of Theorems 1 and 3 and Lemmas 2 and 4 continue to hold in this revised model. By using these insights, one can further show that V k(x) must be on the order of 1 k for large k. This is done in the following theorem: Theorem 9 When there are multiple ads with correlated values, V k(x) = Θ( 1 While this result indicates that V k(x) varies with 1 k for large values of k, this alone does not guarantee the convergence of this function for large values of k. We verify that this function does indeed converge for large values of k in the following theorem: Theorem 10 When there are multiple ads with correlated values, k V k(x) converges to a function of x in the limit as k . Furthermore, it must be the case that k V k(x) = 1 2(1 δ) Pm a=1 πa 1 βa s2 a(xa)fa(xa) + O( 1 k) for large k. Machine Learning in an Auction Environment Theorem 10 indicates that the results about the limiting value of k V k(x) derived in Theorem 6 extend naturally to the case where there are multiple ads with possibly correlated values. When there are multiple ad-context pairs that we must learn about, the value function corresponding to that in Theorem 6 differs only in that we take a weighted sum over the various possible advertiser-context pairs, where the weights are a function of the relative probabilities with which each advertiser-context pair arises. Thus there is a clear analog between the limiting properties of the value function when there are multiple advertisercontext pairs and the value function in the main model. By using the results in the previous theorem, one can further derive properties of the limiting value of Ex (a)[V k (a, k)(x (a))] V k(x) that is proportional to the additional amount that one should bid in the auction beyond the expected value that one has for the advertising opportunity. This is stated below in the following theorem: Theorem 11 When there are multiple ads with correlated values, Ex (a)[V k (a, k)(x (a))] V k(x) = πa 2(1 δ)β2ak2 s2 a(xa)fa(xa) + o( 1 k2 ) for large k. The proof of this result is substantively identical to the proof of Theorem 7 and is thus omitted. Theorem 11 illustrates that the substantive conclusions of Theorem 7 extend to this alternative environment in which there are multiple ads with possibly correlated values. When there are multiple ads, it remains optimal to increase one s bid by an amount proportional to the variances in our estimates of the e CPMs of the ads or 1 k2 for large k. 8. Knowledge Gradients Throughout the paper so far, we have considered a standard dynamic programming approach in which the optimal decision at any given point in time is affected in part by how this decision will affect future decisions when looking at the infinite horizon ahead. While this is a standard approach to take in these types of problems, recently there has been work considering an alternative approach often referred to as knowledge gradients in which the decision one takes in a given period is the decision that one would take if one faced an infinite-horizon game but this period was the last period in which the information one learned could be used to inform future actions. The main advantage of these knowledge gradients over the standard dynamic programming approach is that they have the virtue of being much easier to calculate than the optimal bidding strategy under the standard dynamic programming problem. This simplicity does potentially come at a performance cost. However, various papers have illustrated that using this simple one-step look-ahead approach can nonetheless achieve a performance that is competitive with that of other standard methods in contexts unrelated to advertising (Frazier et al., 2009; Ryzhov et al., 2010, 2012). In this section, we investigate whether this alternative knowledge gradient approach can indeed achieve a performance comparable to that of the theoretically optimal dynamic programming approach. To address this question, for simplicity we consider the baseline model in which there is one advertisement for which we are seeking to learn the e CPM of the ad, though similar results can easily be derived under the more general model we have considered with multiple ads and correlated values. Let Uk(x) denote the value that one would obtain for the rest of Hummel and Mc Afee the game when an ad has received k impressions so far, one s estimate of the e CPM of the ad is x, and one will not be able to use information that one learns in the future to inform future bidding decisions. Note that in this case, the optimal bidding strategy will be to submit a bid of z = x in every remaining period, and the auctioneer s expected per-period payoffwill be Eϵ h R z x+σkϵ F(z) F(y) dy i in every future period, where ϵ denotes some random variable with mean zero and variance one, and z x. The total value the auctioneer will obtain for the rest of the game is then Uk(x) = 1 1 δEϵ h R x x+σkϵ F(x) F(y) dy i . Also let Uk+1(x) denote the value that one would obtain for the rest of the game when an ad has received k + 1 impressions so far, one s estimate of the e CPM of the ad is x, and one will not be able to use information that one learns in the future to inform future bidding decisions. The total value the auctioneer will obtain for the rest of the game is then Uk+1(x) = 1 1 δEϵ h R x x+σk+1ϵ F(x) F(y) dy i . Now consider the bidding strategy that one would employ if one faced an infinite-horizon game but this period was the last period in which the information one learned could be used to inform future actions. The auctioneer s payofffrom bidding z that arises in the current period equals R z x+σkϵ F(z) F(y) dy. And the expected value that the auctioneer obtains from future periods by bidding z in the current period is F(z)Ex [Uk+1(x )]+(1 F(z))Uk(x) by the same reasoning used in the proof of Lemma 2. From this it follows that the expected payofffrom placing a bid of z in a given period is x+σkϵ F(z) F(y) dy + δ(F(z)Ex [Uk+1(x )] + (1 F(z))Uk(x)) . There is a clear similarity between this expression and the expression for the expected payofffrom placing a bid of z in the standard dynamic programming approach. The main difference is that the terms Uk+1(x ) and Uk(x) have replaced the terms Vk+1(x ) and Vk(x) in the standard dynamic programming approach. It is worth noting, however, that the payoffs that result from the one-step look ahead strategies in the model in this paper take a different form than those given in other knowledge gradient papers (Frazier et al., 2009; Ryzhov et al., 2010, 2012). The reason for this difference is that in the model in our paper, there is a competing ad whose e CPM is known in each period but is also a random draw from some distribution in each period. No such random changes in the values of the arms from period to period are present in existing knowledge gradient papers, so the payoffs and strategies in our paper are formulated differently than those given in existing knowledge gradient papers. From the equation we ve derived for the auctioneer s payofffrom bidding z, we can calculate the optimal bidding strategy under the knowledge gradient formulation. This bidding strategy is given in the following theorem: Theorem 12 The optimal bidding strategy in the knowledge gradient framework when an ad has been shown k times entails setting z = x + δ(Ex [Uk+1(x )] Uk(x)). The proof of this result is substantively identical to that in Theorem 3 and is thus omitted. This result indicates that in the knowledge gradient framework, the incremental amount that one increases one s bid beyond the immediate expected reward is again of the Machine Learning in an Auction Environment form δ(Ex [Uk+1(x )] Uk(x)), the only difference being that Uk(x) corresponds to the value of the dynamic program under the knowledge gradient framework. To better understand the incremental amount that one would increase one s bid, we present two results that illustrate how the incremental amount that one would increase one s bid under the knowledge gradient framework compares to the incremental amount that one would increase one s bid under the full dynamic programming problem. First we present a finite sample result about how these incremental bid increases compare in the two frameworks. In our first result, we consider what we refer to as the expected value of all future learning. To reflect the fact that Vk(x) gives the auctioneer s payofffrom the full dynamic programming problem when the auctioneer is able to make use of additional information in future periods, whereas Uk(x) gives the auctioneer s payofffrom the corresponding game in which the auctioneer is not able to make use of information that he learns, we define this expected value of all future learning term to be the difference between Vk(x) and Uk(x). With this definition in mind, we obtain the following result: Theorem 13 Suppose the expected value of all future learning is lower after the ad has been shown k + 1 times than it is after the ad has been shown k times. Then the incremental amount by which one would increase one s bid under the knowledge gradient framework is greater than it is under the full dynamic programming problem. Theorem 13 indicates that the solution to the one-step look ahead problem will generally involve increasing one s bid beyond the immediate expected value of the advertising opportunity by a greater amount than one would do so under the full dynamic programming problem. This makes sense intuitively. If the current period were the last period in which one could ever use information that one learns to inform future actions, then one would place quite a high premium on being able to learn this information while one still can. By contrast, in the full dynamic programming problem, there will always be plenty of opportunities to learn this information later, so there is relatively less incentive to substantially increase one s bid beyond the immediate expected reward. This explains the result in Theorem 13. Theorem 13 requires a technical condition that the expected value of all future learning is lower if an ad has been shown k +1 times than if the ad has been shown k times, but this is just a mild technical constraint that we would expect to hold in virtually any situation. When an ad has been shown k + 1 times, one has more precise information about the true e CPM of the ad than when the ad has only been shown k times, so there is less value to learning more about the true e CPM of the ad. While Theorem 13 suggests that one might increase one s bid by too much under the knowledge gradient framework compared to the strategy that one should follow under the full dynamic programming problem, these differences in bidding strategies turn out to be relatively small. We illustrate this by characterizing the value of Ex [Uk+1(x )] Uk(x): Theorem 14 In the knowledge gradient framework, Ex [Uk+1(x )] Uk(x) = v(x) k(k+1) +o 1 for large k, where v(x) 1 2(1 δ)s2(x)f(x). This result also immediately implies the following corollary: Hummel and Mc Afee Corollary 15 The ratio of the incremental amount by which one wants to increase one s bid in the knowledge gradient framework and the incremental amount by which one wants to increase one s bid in the standard dynamic programming approach becomes arbitrarily close to 1 in the limit as the amount of uncertainty in an ad s e CPM becomes arbitrarily small. Thus using the knowledge gradient to formulate one s bidding strategy will result in payoffs that are asymptotically equivalent to those that would result from using the theoretically optimal bidding strategy. These results suggest that the knowledge gradient framework is indeed an appealing framework for computing bidding strategies in an environment where one wishes to learn about the unknown e CPMs of advertisers, as using this approach will result in little loss from the theoretically optimal approach. 9. Learning About Multiple Advertisers in the Same Auction In the analysis so far, we have assumed that in any given auction, there is only one advertiser whose e CPM is unknown. But in many real-life auctions there may be multiple advertisers with unknown e CPMs. In these cases, an auctioneer must decide both which advertiser with unknown e CPM will have the highest bid as well as what bid to submit for this advertiser. In this setting, it is not clear whether the decision maker s optimal strategy can simply be represented by submitting a bid for each advertiser that is equal to the sum of the best estimate of the advertiser s e CPM as well as a value of learning term. It may be the case that the optimal bid for advertiser i if advertiser i submits the highest bid of the advertisers with unknown e CPMs is higher than the optimal bid for some other advertiser j if advertiser j submits the highest bid of the advertisers with unknown e CPMs, even though the decision maker would prefer to submit a higher bid for advertiser j than for advertiser i. We address whether this possibility can arise in this section. To address this question, suppose that in each auction, there are n ads with unknown e CPMs. The actual e CPMs of these ads are x1, . . . , xn, and we let z1, . . . , zn denote the bids placed by these advertisers in the auction. Also let i denote the advertiser who submits the highest e CPM bid amongst these n bidders and let j denote the advertiser who actually has the highest e CPM amongst these advertisers. In each auction, these advertisers with unknown e CPMs compete against other advertisers and the highest such competing e CPM is drawn from a cumulative distribution function F( ) with corresponding density f( ). Note that in this case, the utility that the decision maker obtains in a given period from having advertiser i submit a bid of zi that is the highest e CPM bid amongst these n bidders is u = zi(1 F(zi)) + R zi (1 F(y)) dy + xi F(zi). At the same time, this decision maker would obtain a utility of u = xj(1 F(xj)) + R xj (1 F(y)) dy + xj F(xj) in a given period from making the optimal decision in a given period. Thus the loss that this decision maker obtains in a given period as a result of having advertiser i submit a bid of zi that is the highest e CPM bid amongst these n bidders is the difference between these two utilities or L = xj zi + (zi xi)F(zi) + R zi xj (1 F(y)) dy = R zi xi F(zi) dy R zi xj F(y) dy. Now let ki denote the number of impressions that advertiser i has received so far, let k (k1, . . . , kn) denote a vector that gives the number of times each of these ads has been shown, let xi denote our best estimate of the expected value of the e CPM of advertiser i, Machine Learning in an Auction Environment and let x (x1, . . . , xn) denote a vector of these best estimates. Also let V k(x) denote the value of the dynamic program as a function of these quantities. By similar reasoning to that in the proof of Lemma 2, it follows that if i denotes the advertiser who submits the highest e CPM bid amongst the n bidders with unknown e CPMs, k (i) (k 1, . . . , k n) is the vector where k j = kj for all j = i and k i = ki+1, and x (i) denotes the uncertain realization of x if advertiser i receives an additional impression, then V k(x) equals max zi Exi,xj xj F(y) dy Z zi xi F(zi) dy + δF(zi)(Ex (i)[V k (i)(x (i))] V k(x)) where the difference in the values of the dynamic programs is due to the fact that the loss in a given period is now R zi xi F(zi) dy R zi xj F(y) dy. Similarly, the optimal bid for advertiser i if advertiser i submits the highest e CPM bid amongst the n bidders with unknown e CPMs still satisfies zi = xi + δ(Ex (i)[V k (i)(x (i))] V k(x)). We use these insights to prove the following: Theorem 16 Suppose the optimal bid for advertiser i if advertiser i submits the highest bid of the advertisers with unknown e CPMs is higher than the optimal bid for all other advertisers with unknown e CPMs if one of these other advertisers submits the highest bid of the advertisers with unknown e CPMs. Then it is also optimal for advertiser i to have the highest bid of all the advertisers with unknown e CPMs. Theorem 16 guarantees that if there are multiple ads with unknown e CPMs, then one can simply compute the optimal bids for each of these ads in the case where the ad in question was guaranteed to have a higher bid than the other ads with unknown e CPMs. The ad that has the highest such optimal bid will then be guaranteed to be the ad for which the mechanism designer would want to submit the highest such bid. Thus even when there are multiple ads in the same auction with unknown e CPMs, one can continue to make optimal decisions by computing bids for the advertisers equal to their estimated e CPMs plus a value of learning term for the ad and then rank the advertisers on this basis. 10. Performance Guarantees We now return to the baseline setting in Section 2. The results in the previous sections suggest a possible algorithm that will approximate the optimal bidding strategies for an auctioneer who seeks to maximize long-run efficiency. This algorithm would compute the expected e CPM for an advertiser with unknown e CPM, x, the density for the distribution of competing e CPM bids at this value of x, f(x), the variance s2(x) in the e CPM for an ad with estimated e CPM x that has only received one impression, and the number of impressions k that the ad has received. One then decides which ad to show by computing a score equal to x + δ 2(1 δ)k(k+1)s2(x)f(x) for each ad, where δ is the auctioneer s discount factor, and showing the ad with the highest such score. We refer to this strategy as the approximately optimal bidding strategy, and in this section we address questions related to the size of the performance guarantees that can be obtained by using this algorithm and related algorithms. Hummel and Mc Afee First we address questions related to how the algorithms we have considered in this paper will compare to other plausible algorithms in the machine learning literature. One other algorithm that is standard for multi-armed bandit problems involves ranking the arms by a term equal to the expected value of the arm plus a term proportional to the standard deviation in the arm (Auer et al., 2002). More generally, one can rank advertisers by a term equal to the e CPM of the advertiser plus a term proportional to 1 kα for any α 1 2, where k denotes the number of impressions that the ad has received so far. However, these algorithms are not well-suited towards the auction environment, as the following theorem illustrates: Theorem 17 Suppose the auctioneer uses a bid for the advertiser with unknown e CPM of the form z = x + c(x) kα , where α 1 2 and c(x) is a bounded non-negative constant that depends only on x and the distribution of competing bids. Then the optimal constant c(x) for any such algorithm is c(x) = 0 for sufficiently large k. This result immediately implies that standard existing algorithms for exploration which involve adding a term proportional to the standard deviation to the e CPM of the ad, such as the UCB algorithm, are actually dominated by the simple greedy approach of always making a bid equal to the e CPM of the ad. These existing algorithms do too much exploration, and as a result, lead to lower payoffs than not doing any active exploration at all.6 Next we turn to the question of what guarantees can be made about the size of the performance improvement that could be obtained by using the approximately optimal bidding strategy rather than the simple greedy algorithm. Our next result illustrates that one will indeed obtain a performance improvement by using the approximately optimal bidding strategy, but the size of the performance improvement is likely to be very small. Theorem 18 Suppose the auctioneer follows the approximately optimal bidding strategy. Then the expected payoffthat the auctioneer will obtain by using this algorithm will exceed the expected payoffthat the auctioneer would obtain by using the purely greedy approach by an amount δ2 8(1 δ)3k4 s4(x)f3(x) + o( 1 Theorem 18 indicates that the performance improvement that can be obtained as a result of using the approximately optimal bidding strategy is only on the order of 1 k4 , where k denotes the number of impressions that an ad has received. This follows from the fact that the incremental increase in the probability that a particular ad is shown varies with 1 k2 , and on top of that, the expected payoffincrease that one obtains conditional on showing a different ad than would be shown without active learning also varies with 1 k2 . Since this represents a fourth-order improvement in performance relative to the purely greedy approach, this result indicates that the performance improvement that can be obtained by following our algorithm rather than simply ranking the ads by their e CPMs is small. 6. Similarly, an algorithm such as epsilon-greedy, in which the ad with the highest e CPM is chosen with probability 1 ϵ, and an ad is chosen uniformly at random with probability ϵ, will also lead to lower payoffs than not doing any active exploration at all for large k. We prove this in Observation 23 in the appendix. 7. The expected payoffincrease that we refer to in this theorem is for the subgame beginning from the point when the ad with uncertain e CPM has already received k impressions. Machine Learning in an Auction Environment It is worth noting, however, that the result in Theorem 18 is not due to our algorithm being a suboptimal implementation of incorporating active exploration. Our next result illustrates that while the size of the performance improvement that can be obtained by using our algorithm is small, this algorithm will, in fact, obtain nearly the maximum possible performance improvement over the purely greedy approach of ranking ads by their e CPMs. Theorem 19 Suppose the auctioneer uses the approximately optimal bidding strategy. Then the difference between the auctioneer s payoffunder this strategy and the maximum possible payoffthe auctioneer could obtain under the theoretically optimal strategy becomes vanishingly small compared to the difference between the auctioneer s payoffunder this strategy and the auctioneer s payoffunder the greedy strategy for large k. The results in the previous theorems suggest that the maximum possible payoffincrease that can be achieved by incorporating active exploration is quite small for auctions involving ads that have already received a large number of impressions. However, in many auctions, there are frequently advertisers that have only received a small number of impressions, so it is desirable to know whether these conclusions for ads that have received large numbers of impressions will also hold for ads that have only received a small number of impressions. Under the mild technical condition discussed in Theorem 13, where the expected value of future learning is lower after the advertiser with unknown e CPM has been shown once rather than never having been shown at all, we obtain the following result: Theorem 20 Suppose the bidder with unknown e CPM has a cost-per-click bid of 1 and a click-through rate drawn from a beta distribution. Also suppose that this bidder s expected e CPM is ω and the standard deviation in this bidder s true e CPM is γω. Then the difference between the maximum possible payoffthe auctioneer could obtain under the theoretically optimal strategy and the auctioneer s payofffrom the greedy strategy is no greater than δ2γ8ω6f 3 8(1 δ)3(1 ω)2 , where f denotes the supremum of f( ). Theorem 20 presents bounds on the maximum performance improvement that can be achieved over the purely greedy strategy by using active learning, but it is not immediately clear from this result whether these bounds imply there are significant limitations on the performance improvement that can be achieved by using active learning. We thus seek to shed some light on this under empirically realistic values of the parameters. If the typical e CPM bids for the winning advertisers are roughly ξω, then the auctioneer s total payofffor the game will be roughly ξω 1 δ, and the result in Theorem 20 indicates that the maximum fractional increase in expected payoffthat one can achieve from using the theoretically optimal strategy rather than the greedy strategy is roughly δ2γ8ω5f 3 8ξ(1 δ)2(1 ω)2 . Furthermore, if the typical e CPM bids for the highest competing advertisers in an auction are roughly ξω, then f is likely to also be on the order of 1 ξω. This holds, for example, if the highest competing e CPM bids are drawn from a lognormal distribution, as the largest value of the density of a lognormal distribution with parameters µ and σ2 is equal ξω , where ξω is the expected value of the lognormal distribution and c(σ2) eσ2 2πσ2 is a constant that depends only on σ2. Furthermore c(σ2) is likely to be close to 1 for realistic values of σ2 since c(σ2) [0.93, 1.09] for values of σ2 [0.2, 1]. The lognormal distribution Hummel and Mc Afee is a realistic representation of the distribution of highest competing bids in online auctions since both Lahaie and Mc Afee (2011) and Ostrovsky and Schwarz (2009) have noted that the distribution of highest bids can be well-represented by a lognormal distribution using data from sponsored search auctions at Yahoo!. By using the facts that the value of f is likely to be on the order of 1 ξω, and the maximum fractional increase in expected payoffthat one can achieve from using the theoretically optimal strategy rather than the greedy strategy is roughly δ2γ8ω5f 3 8ξ(1 δ)2(1 ω)2 , it then follows that the maximum fractional increase in expected payoffthat one can achieve from using the theoretically optimal strategy rather than the greedy strategy is roughly δ2γ8ω2 8ξ4(1 δ)2(1 ω)2 . There is empirical evidence that indicates that the typical click-through rates for ads in online auctions tend to be on the order of 1 100 or 1 1000 for search ads and display ads respectively (Bax et al., 2011), so (1 ω)2 will be very close to 1 and ω2 is likely to be less than 10 4 (for search ads) or 10 6 (for display ads). Furthermore, even for a brand new ad, the typical errors in a machine learning system s predictions are unlikely to exceed 30% of the true click-through rate of the ad, so γ 0.3 is likely to hold in most practical applications. Finally, ξ is a measure of by how much the highest bid in an auction exceeds the typical e CPM bid of an average ad in the auction. Since there are normally hundreds of ads competing in online auctions, it seems that one can conservatively estimate that ξ 3 is likely to hold in most real-world online auctions. By combining the estimates in the previous paragraph, it follows that γ8ω2 8ξ4(1 ω)2 will almost certainly be less than 10 11 in search auctions and 10 13 in display auctions. Now if δ 0.9999, δ2 (1 δ)2 will be no greater than 108, and if δ 0.99999, δ2 (1 δ)2 will be no greater than 1010. Thus even for values of δ that are exceedingly close to 1 (δ = 0.9999 for search ads and δ = 0.99999 for display ads), γ8ω2 8ξ4(1 ω)2 δ2 (1 δ)2 will be no greater than 0.001. Thus as long as δ 0.9999 (or δ 0.99999 for display auctions), the bound given in Theorem 20 guarantees that under empirically realistic scenarios, the maximum possible performance improvement that can be achieved by incorporating active learning into a machine learning system is at most a few hundredths of a percentage point. This is a finite sample result that does not require a diverging number of impressions in order to hold. 11. Simulations The results of the previous section suggest that the overall benefit that can be obtained by incorporating active exploration in an auction environment is exceedingly small. We now seek to empirically verify that the benefit that can be obtained from active exploration is indeed quite small by conducting simulations under some empirically realistic scenarios. To do this, we consider a scenario in which there is a repeated auction in which a costper-click (CPC) bidder competes against CPM bidders in each auction. The CPC bidder has a CPC bid of 1 and a fixed unknown click-through rate. The CPM bidders CPM bids vary from period to period, and in each period, we assume that the highest CPM bid is a random draw from a distribution with probability density function f( ). Throughout we assume that payoffs are discounted at a rate of δ = 0.9995 and that there are T = 10000 time periods. Machine Learning in an Auction Environment While we are not aware of any empirical evidence regarding the form of the uncertainty of an advertiser s click-through rate, for simplicity we assume that the CPC bidder s clickthrough rate is initially drawn from a beta distribution with parameters α and β. The auctioneer may refine this estimate over time. In particular, just before the auction in period t, the auctioneer believes that the CPC bidder s true click-through rate is a random draw from the beta distribution with parameters αt and βt where αt is equal to α plus the number of clicks the CPC bidder has received so far and βt is equal to β plus the number of times the CPC bidder s ad was shown but did not receive a click. We compare total welfare under two possible scenarios. The first scenario we consider is a standard ranking algorithm in which the ads are ranked purely on the basis of their expected e CPM bids. The second scenario we consider is one in which the CPC bidder makes a bid of the form xt + δ(1 δT t) 2(1 δ) αtβt (αt+βt)2(αt+βt+1)2 f(xt) in each period t, where xt denotes the CPC bidder s expected click-through rate just before the auction in period t. This second scenario corresponds to adding a term equal to the value of learning to the CPC bidder s expected e CPM bid in the game with finite time horizons. Throughout we focus on scenarios that are motivated by empirical evidence on the likely expected click-through rates for ads in online auctions. In particular, since empirical evidence indicates that the typical click-through rates for ads in online auctions tend to be on the order of 1 100 or 1 1000 (Bax et al., 2011), we focus on situations in which the expected click-through rate of the CPC bidder is on the order of 1 100. Similarly, since it is unlikely that there will be substantial errors in the estimate of a new ad s predicted click-through rate, we focus on situations in which there is only moderate uncertainty in the click-through rate of a new ad. In particular, we consider distributions of the CPC bidder s bid such that the standard deviation in the advertiser s click-through rate is no greater than 20 or 30% of the expected value. We thus consider values of α and β satisfying (α, β) = (10, 1000) and (20, 2000) (for 30% and 20% standard errors respectively). Finally, since there is evidence that the distribution of highest bids is well modeled by a lognormal distribution (Lahaie and Mc Afee, 2011; Ostrovsky and Schwarz, 2009), we assume throughout that the CPM bidder s bid is drawn from a lognormal distribution with parameters µ and σ2. We use a value of σ2 = log(2) to match the variance in the lognormal distribution estimated by Ostrovsky and Schwarz (2009). And Varian (2009) has noted that the total value enjoyed by advertisers is typically about 2 2.3 times their total expenditure. If the auction consisted of only two advertisers, this would suggest that the appropriate value of µ would be such that the highest competing bidder had a CPM bid that is roughly double that of the CPC bidder in expectation. However, since there are more than two bidders in most real auctions, the appropriate value of µ will be larger than this. We thus consider a range of values of µ from 4.25 (for the case in which the highest competing CPM bid is roughly double that of the CPC bidder in expectation) to 3.5 (for the case in which the highest competing CPM bid is roughly four times that of the CPC bidder in expectation). Table 1 reports the results our simulations. The conclusions from these simulations are striking. While we have conducted enough simulations to estimate the efficiency gain that can be obtained from adding active exploration to within a few hundredths of a percentage point, none of the resulting estimated efficiency gains in Table 1 are statistically significant. Indeed one can conclude from these simulations that the maximum possible efficiency gain Hummel and Mc Afee Conditions Percentage increase in efficiency α = 10, β = 1000, µ = 4.25, σ2 = log(2) 0.021% (0.017%) α = 10, β = 1000, µ = 4, σ2 = log(2) -0.016% (0.011%) α = 10, β = 1000, µ = 3.75, σ2 = log(2) -0.008% (0.007%) α = 10, β = 1000, µ = 3.5, σ2 = log(2) 0.003% (0.004%) α = 20, β = 2000, µ = 4.25, σ2 = log(2) 0.003% (0.009%) α = 20, β = 2000, µ = 4, σ2 = log(2) 0.001% (0.006%) α = 20, β = 2000, µ = 3.75, σ2 = log(2) 0.001% (0.004%) α = 20, β = 2000, µ = 3.5, σ2 = log(2) -0.002% (0.002%) Table 1: Average percentage increase in efficiency from incorporating active learning (with standard errors in parentheses) after 2500 simulations. None of these results are statistically significant at the p < .05 level. that could be achieved in these settings is at most a few hundredths of a percentage point. These empirical results provide further support for our theoretical conclusions that the value of adding active exploration in an auction setting is exceedingly small. The reason for the results observed in Table 1 is that an optimal exploration algorithm will only do a tiny additional amount of exploration compared to the greedy strategy of always submitting a bid for the CPC bidder equal to the CPC bidder s estimated e CPM. For instance, for the first simulation considered in Table 1, the incremental increase in an advertiser s bid in the first period of the game as a result of active exploration is only 4.2%, implying only a 1.8% increase in the probability that the CPC bidder will be shown as well as only a 2.1% increase in expected payoffconditional on the auctioneer showing a different ad under active exploration than under the purely greedy strategy. Thus the incremental expected payoffincrease that can be achieved by incorporating active exploration in this auction setting is at most a few hundredths of a percentage point. The results in Table 1 make use of distributions that we regard as empirically realistic in the sense that there is a realistic amount of uncertainty in the click-through rate of the CPC bidder as well as a realistic amount of variation in the distribution of competing CPM bids. It is worth noting that if one relaxes the requirement that there be a realistic amount of uncertainty about these variances, then it is possible for the algorithm we have proposed to substantially outperform the purely greedy strategy of making a bid for the CPC bidder that always equals the CPC bidder s expected e CPM. In particular, if we instead assume that there is substantially more uncertainty about the CPC bidder s click-through rate than Machine Learning in an Auction Environment we have assumed in the simulations in Table 1 and we also assume that there is substantially less variance in the distribution of competing CPM bids than we have allowed for in Table 1, then there will be considerably greater benefits to adding active exploration because there is both more to learn about the CPC bidder s true e CPM bid as well as less exploration that will take place for free solely due to random variation in the competing bids. In this case, there may well be significant benefits to adding active exploration. Conditions Percentage increase in efficiency α = 2, β = 200, µ = 4, σ2 = log(2)/4 0.15% (0.05%) α = 2, β = 200, µ = 3.75, σ2 = log(2)/4 0.17% (0.05%) Table 2: Average percentage increase in efficiency from incorporating active learning (with standard errors in parentheses) after 10000 simulations. These results are both statistically significant at the p < .005 level. Table 2 reports the results of simulations that were conducted using distributions in which there is substantially more uncertainty about the CPC bidder s click-through rate and substantially less variance in the CPM bidder s competing CPM bid than in the distributions considered in Table 1. These simulations indeed reveal statistically significant efficiency gains as a result of active exploration. Nonetheless it is worth noting that the efficiency gains reported in Table 2 are still fairly small. Even when we make assumptions that bias the case in favor of active exploration being important, none of the efficiency gains reported in Table 2 are greater than a few tenths of a percentage point. Finally, while the gains achieved through active exploration in Table 2 are small, one would not achieve greater gains by using a standard algorithm such as UCB. To test this, we considered the same setting in the first row of this table, but instead of making a bid for the CPC bidder of the form xt+ δ(1 δT t) 2(1 δ) αtβt (αt+βt)2(αt+βt+1)2 f(xt) in each period t, we made a bid of the form xt +c(xt) 1 αt+βt , where the constant c(xt) was chosen so that this bid would equal xt + δ(1 δT t) 2(1 δ) αtβt (αt+βt)2(αt+βt+1)2 f(xt) in time period t = 1. Thus our implementation of UCB performed the same amount of exploration as the main algorithm we considered in the very first period of the game, while performing more exploration in later periods due to the fact that the rate of exploration declines with 1 (αt+βt)2 under our proposed algorithm, while only declining with 1 αt+βt under UCB. In this setting, we found that using the UCB algorithm rather than the purely greedy strategy resulted in an average efficiency loss of 1.04% (with a standard error of 0.07%). Thus while we were able to achieve an improvement by using the new algorithm we have proposed, using the UCB algorithm instead resulted in significant efficiency losses. The fact that UCB performed worse than the purely greedy strategy is not surprising since we know from Theorem 17 that UCB performs worse than the purely greedy strategy once an ad has received enough impressions. Hummel and Mc Afee 12. Conclusion In online auctions, there may be value to exploring ads with uncertain e CPMs to learn about the true e CPM of the ad and be able to make better ranking decisions in the future. But the online auction setting is very different from standard multi-armed bandit problems because there may be considerable variation in the quality of competition that an advertiser with unknown e CPM faces in an auction, and as a result there will typically be plenty of free opportunities to explore an ad with uncertain e CPM in auctions where there simply are no ads with e CPM bids that are known to be high. We have presented a model of the explore/exploit problem in online auctions that explicitly considers this random variation in competing bids that is present in real auctions. We find that the optimal solution for ranking the ads is dramatically different than the optimal solution in standard multi-armed bandit problems, and in particular, that the optimal amount of active exploration is considerably smaller than in standard multi-armed bandit problems. This in turn implies that the improvement in the auctioneer s payoffthat can be achieved by adding active learning in online auctions is also exceedingly small. Thus while it is theoretically possible to improve efficiency by incorporating active learning, in a practical exchange environment, a purely greedy strategy of simply ranking the ads by their expected e CPMs is likely to perform nearly as well as any other strategy. We conclude by discussing one other point. Throughout our analysis we have focused on the problem of an auctioneer who wants to maximize efficiency. Although this is a sensible objective, one might also envision scenarios in which the mechanism designer wishes to maximize a weighted average of efficiency and revenue. While incorporating active exploration in online auctions can only have a small effect on efficiency, this active exploration may significantly improve revenue. The reason for this is that if we rank the ads by the sum of their expected e CPMs and a value of learning term, the value of learning term may be larger for ads that typically lose the auctions, and incorporating this value of learning term may increase pricing pressure for the winning ads and thereby increase revenue.8 In fact, in several of the simulations considered in the previous section in which incorporating active exploration failed to show significant efficiency gains, the algorithm that we considered still showed significant revenue gains over the purely greedy strategy of ranking the ads by their expected e CPMs. But while it is still possible to achieve significant revenue gains by incorporating active exploration in the type of environment considered in this paper, the maximum possible efficiency gains are likely to be exceedingly small. Acknowledgments We especially thank Martin Zinkevich for numerous helpful discussions. We are also grateful to Joshua Dillon, Pierre Grinspan, Chris Harris, Tim Lipus, Mohammad Mahdian, Hal Varian, and the anonymous referees for helpful comments and discussions. This work is based on an earlier work: Machine Learning in an Auction Environment , in Proceedings of the 23rd International Conference on the World Wide Web (WWW) (2014) c ACM, 2014. http://doi.acm.org/10.1145/2566486.2567974. 8. A similar point has been previously noted by Li et al. (2010) and Mc Afee (2011). Machine Learning in an Auction Environment Appendix A. Proofs of Theorems Proof of Theorem 1: Suppose it is known that the e CPM of the ad is x. If the highest e CPM for a competing ad is p, then the presence of this ad with e CPM x increases total welfare by x p if x > p and 0 otherwise. Thus the expected increase in total welfare from this ad with e CPM of x competing in the auction is R x 0 (x p)f(p) dp. The total long-term value from having this advertisement is then the discounted sum of this expected total increase in welfare or 1 1 δ R x 0 (x p)f(p) dp. Now if V (x) 1 1 δ R x 0 (x p)f(p) dp, then V (x) = 1 1 δ R x 0 f(p) dp and V (x) = 1 1 δf(x). From this it follows that V (x) 0 for all x and V (x) > 0 if x is contained in the support of F. Thus the long-term value of the advertisement is a convex function of the e CPM of the ad and a strictly convex function if the e CPM of the ad is contained within the support of the distribution of the highest competing e CPM. Proof of Lemma 2: Suppose an ad has been shown k times. The value of the dynamic program that arises from placing the optimal bid z in the current period, Vk(x), equals the immediate reward from bidding z (or the negative of the loss function) in the current period plus δ times the expected value of the dynamic program that arises in the next period. Now if the new advertiser places a bid of z, then the probability the advertiser wins the auction is F(z), in which case the expected value of the dynamic program that arises next period is Ex [Vk+1(x )], where the expectation is taken over the randomness in the changes in the estimates of the e CPM of the ad x that arise as a result of showing this ad. The probability the advertiser does not win the auction is 1 F(z), in which case the value of the dynamic program remains at Vk(x). Thus the expected value of the dynamic program that arises in the next period is F(z)Ex [Vk+1(x )] + (1 F(z))Vk(x). At the same time, we have already seen that the reward from bidding z that arises in the current period equals R z x+σkϵ F(z) F(y) dy. By combining this with the insights in the previous paragraphs, it follows that Vk(x) = max z Eϵ x+σkϵ F(z) F(y) dy + δ(F(z)Ex [Vk+1(x )] + (1 F(z))Vk(x)) . By subtracting δVk(x) from both sides and dividing by 1 δ, it follows that Vk(x) = 1 1 δ x+σkϵ F(z) F(y) dy + δF(z)(Ex [Vk+1(x )] Vk(x)) . Proof of Theorem 3: By differentiating the expression in Lemma 2 with respect to z, we see that the first order condition for z to be an optimal bid is x+σkϵ f(z) dy + δf(z)(Ex [Vk+1(x )] Vk(x)) = Eϵ f(z)(z x σkϵ) + δf(z)(Ex [Vk+1(x )] Vk(x)) = f(z)(x z + δ(Ex [Vk+1(x )] Vk(x))) From this it follows that z = x + δ(Ex [Vk+1(x )] Vk(x)) satisfies the first order conditions. Moreover, at this value of z, the second order conditions are also satisfied. Thus optimal bidding entails setting z = x + δ(Ex [Vk+1(x )] Vk(x)). Hummel and Mc Afee Proof of Lemma 4: Let Φ( |σk) denote the distribution from which ϵ is drawn for any given value of σk. For any given σk, we know that Φ( |σk) has mean zero and variance one. We also know from the Bayesian central limit theorem that as σk 0 (and k ) that Φ( |σk) converges to the standard normal distribution. For any given σk, we can write Eϵ[ R z x+σkϵ F(z) F(y) dy] as J(σk) = Eϵ[ R z x+σkϵ F(z) F(y) dy|ϵ Φ( |σk)]. We seek to show that J(σk) is of the form given in the statement of the lemma. First note that J (σk) = Eϵ[ϵ(F(z) F(x+σkϵ))|ϵ Φ(σk)]+ d x+σkϵ F(z) F(y) dy|ϵ Φ( |σk) where d dΦEϵ[Z(ϵ, σk)|ϵ Φ( |σk)] denotes the derivative of the expectation of Z(ϵ, σk) arising through the changes in Φ( |σk) induced by changes in σk (that is, if φ(ϵ; σk) denotes the density corresponding to Φ( |σk), then d dΦEϵ[Z(ϵ, σk)|ϵ Φ( |σk)] R Z(ϵ, σk) φ σk (ϵ; σk)dϵ). Similarly, letting dm dΦm Eϵ[Z(ϵ, σk)|ϵ Φ( |σk)] R Z(ϵ, σk) mφ σm k (ϵ; σk)dϵ for all m, we have J (σk) = Eϵ[ϵ2f(x + σkϵ)|ϵ Φ( |σk)] 2 d dΦEϵ[ϵ(F(z) F(x + σkϵ))|ϵ Φ( |σk)] x+σkϵ F(z) F(y) dy|ϵ Φ( |σk) , J (σk) = Eϵ[ϵ3f (x + σkϵ)|ϵ Φ( |σk)] + 3 d dΦEϵ[ϵ2f(x + σkϵ)|ϵ Φ( |σk)] dΦ2 Eϵ[ϵ(F(z) F(x + σkϵ))|ϵ Φ( |σk)] x+σkϵ F(z) F(y) dy|ϵ Φ( |σk) , J (σk) = Eϵ[ϵ4f (x + σkϵ)|ϵ Φ( |σk)] + 4 d dΦEϵ[ϵ3f (x + σkϵ)|ϵ Φ( |σk)] dΦ2 Eϵ[ϵ2f(x + σkϵ)|ϵ Φ( |σk)] dΦ3 Eϵ[ϵ(F(z) F(x + σkϵ))|ϵ Φ( |σk)] x+σkϵ F(z) F(y) dy|ϵ Φ( |σk) , Note that when σk = 0, we have Eϵ[ R z x+σkϵ F(z) F(y) dy|ϵ Φ( |σk)] = R z x F(z) F(y) dy for any distribution Φ( |σk), Eϵ[ϵ(F(z) F(x + σkϵ))|ϵ Φ( |σk)] = Eϵ[ϵ(F(z) F(x))|ϵ Φ( |σk)] = 0 for any distribution Φ( |σk) with mean zero, and Eϵ[ϵ2f(x+σkϵ)|ϵ Φ( |σk)] = Eϵ[ϵ2f(x)|ϵ Φ( |σk)] = f(x) for any distribution Φ( |σk) with mean zero and variance one. Thus dm dΦm Eϵ[ R z x+σkϵ F(z) F(y) dy|ϵ Φ( |σk)] = 0, dm dΦm Eϵ[ϵ(F(z) F(x + σkϵ))|ϵ Φ( |σk)] = 0, and dm dΦm Eϵ[ϵ2f(x + σkϵ)|ϵ Φ( |σk)] = 0 for all m when evaluated at σk = 0. Machine Learning in an Auction Environment By using these facts, the fact that Φ( |0) is standard normal, and the above expressions for J(σk) and its derivatives, it follows that J(0) = R z x F(z) F(y) dy, J (0) = 0, J (0) = f(x), J (0) = 0, and J (0) = Eϵ[ϵ4f (x)|ϵ Φ( |0)] + 4 d dΦEϵ[ϵ3f (x)|ϵ Φ( |σk)]|σk=0. This in turn implies that the fourth-order Taylor approximation to Eϵ[ R z x+σkϵ F(z) F(y) dy] is x+σkϵ F(z) F(y) dy = Z z x F(z) F(y) dy + 1 2σ2 kf(x) + a(x)σ4 k + o(σ4 k), where a(x) 1 24[Eϵ[ϵ4f (x)|ϵ Φ( |0)] + 4 d dΦEϵ[ϵ3f (x)|ϵ Φ( |σk)]|σk=0]. Proof of Theorems 5 and 6: Since these results are special cases of Theorems 9 and 10 respectively, the proofs of these results are omitted. Before proving Theorem 7, we first introduce some notation for the finite-horizon version of this game. If the game has a finite time horizon and will last an additional T periods, we let Vk,T (x) denote the value of the dynamic program that arises when the auctioneer follows the optimal strategy. By analogy to Lemma 2, we know that Vk,T (x) = 1 1 δ x+σkϵ F(z) F(y) dy + δF(z)(Ex [Vk+1,T 1(x )] Vk,T 1(x)) when T > 0 and Vk,T (x) = Eϵ[ R x x+σkϵ F(x) F(y) dy] when T = 0. Also note that lim T Vk,T (x) = Vk(x), where Vk(x) is the value of the dynamic program for the original infinite-horizon game. Finally note that Lemma 21 Vk,T (x) is twice differentiable in x for all k and T. Furthermore, limk V k,T (x) = 0 and limk V k,T (x) = 0 for all T. Proof We prove this result by induction on T. The base case, T = 0, holds because the fact that f( ) is continuously differentiable implies F( ) is twice differentiable and Vk,T (x) = Eϵ[ R x x+σkϵ F(x) F(y) dy] is also twice differentiable in x. Furthermore, V k,T (x) = Eϵ[F(x) F(x + σkϵ) R x x+σkϵ f(x) dy] = Eϵ[F(x) F(x + σkϵ)] and V k,T (x) = Eϵ[f(x) f(x + σkϵ)], which both tend to zero as k . Thus the result holds for T = 0. Now suppose the result is true for T 1 and use this to prove the result must also hold for T. Since Vk,T (x) = 1 1 δ x+σkϵ F(z) F(y) dy + δF(z)(Ex [Vk+1,T 1(x ))] Vk,T 1(x)) , we know from analogy to Theorem 3 that the optimal bid z satisfies z = x+δ(Ex [Vk+1,T 1(x ))] Vk,T 1(x)). From the induction hypothesis, we thus know that the optimal bid zk(x) is twice differentiable in x and that limk z k(x) = 1 and limk z k(x) = 0. This in turn implies that Eϵ[ R zk(x) x+σkϵ F(zk(x)) F(y) dy] is twice differentiable in x and dx Eϵ[ R zk(x) x+σkϵ F(zk(x)) F(y) dy] = Eϵ[F(zk(x)) F(x+σkϵ) R zk(x) x+σkϵ z k(x)f(zk(x)) dy], which tends to zero as k since limk zk(x) = x. This also further implies that d2 d2x Eϵ[ R zk(x) x+σkϵ F(zk(x)) F(y) dy] = Eϵ[z k(x)f(zk(x)) f(x+σkϵ) (z k(x) 1)z k(x)f(zk(x)) Hummel and Mc Afee x+σkϵ z k(x)f(zk(x))+(z k(x))2f (zk(x)) dy], which tends to zero as k since limk zk(x) = x and limk z k(x) = 1. From the induction hypothesis, we also know that F(zk(x))(Ex [Vk+1,T 1(x )] Vk,T 1(x)) is twice differentiable in x and that the first and second derivatives of this expression with respect to x tend to zero as k . By combining this with the results in the previous two paragraphs, it follows that Vk,T (x) is twice differentiable in x and limk V k,T (x) = 0 and limk V k,T (x) = 0 for all T. The result follows by induction. We use these observations about the finite-horizon game to first prove that E[Vk+1(x ) Vk(x)] = O( 1 k2 ) in Lemma 22. Then we use this preliminary result to prove Theorem 7. Lemma 22 E[Vk+1(x ) Vk(x)] = O( 1 k2 ) for large k. Proof Note that if an ad is displayed, then one of two possible things will happen to the ad either the ad will receive a click or the ad will not receive a click. Let p denote the probability that the ad will receive a click, let xc denote the estimated e CPM of the ad if the ad receives a click, and let xn denote the estimated e CPM of the ad if the ad does not receive a click. Note that pxc + (1 p)xn = x. From Lemma 21 we know that Vk,T (x) is twice differentiable in x for all k and T. Thus the second-order Taylor approximations for Vk+1,T (xc) and Vk+1,T (xn) are Vk+1,T (xc) = Vk+1,T (x) + V k+1,T (x)(xc x) + 1 2V k+1,T (x)(xc x)2 + o(xc x)2 Vk+1,T (xn) = Vk+1,T (x) + V k+1,T (x)(xn x) + 1 2V k+1,T (x)(xn x)2 + o(xn x)2. Thus if x denotes the actual realization of the estimated e CPM after the ad has been shown k + 1 times (x will equal xc with probability p and xn with probability 1 p), then by using the fact that pxc + (1 p)xn = x and by taking a weighted average of the two previous equations, we find that E[Vk+1,T (x )] = p Vk+1,T (xc) + (1 p)Vk+1,T (xn) = Vk+1,T (x) + 1 2V k+1,T (x)E[(x x)2] + o(E[(x x)2]). From this it follows that E[Vk+1,T (x ) Vk,T (x)] = Vk+1,T (x) Vk,T (x)+ 1 2V k+1,T (x)E[(x x)2]+o(E[(x x)2]). (1) If c denotes the number of clicks that an ad has received so far, then the predicted click-through rate for an ad that has received a large number of impressions, k, will be approximately c k. Thus if b denotes the bid per click that the ad places, then the e CPM for an ad that has received c clicks and has been shown k times will be x bc k . From this it follows that xc b(c+1) k+1 , xn bc k+1, xc x b(k c) k(k+1), and xn x bc k(k+1). Thus Machine Learning in an Auction Environment k) for all possible realizations of x , and (x x)2 = O( 1 k2 ). Furthermore, from Lemma 21 we know that limk V k+1,T (x) = 0. Thus we can rewrite equation (1) as E[Vk+1,T (x ) Vk,T (x)] = Vk+1,T (x) Vk,T (x) + o 1 By using the fact that lim T Vk,T (x) = Vk(x), where Vk(x) denotes the value of the dynamic program in the original infinite horizon game, we then know that E[Vk+1(x ) Vk(x)] = Vk+1(x) Vk(x) + o 1 k + 1 + O 1 Proof of Theorem 7: We have seen in the proof of Lemma 22 that E[Vk+1(x ) Vk(x)] = Vk+1(x) Vk(x) + o 1 k2 . When combined with the fact that Vk(x) = v(x) k2 ), this immediately implied that E[Vk+1(x ) Vk(x)] = O 1 k2 . If we are able to further prove that we can write Vk(x) = v(x) k2 ) for some function w(x), it will then follow that Vk+1(x) Vk(x) = v(x) k(k+1) + o( 1 k2 ). Thus we first seek to show that we can write Vk(x) as Vk(x) = v(x) k2 ). Since E[Vk+1(x ) Vk(x)] = O( 1 k2 ) for large k and the optimal bidding strategy entails setting z = x + δ(E[Vk+1(x ) Vk(x)]), it must be the case that z = x + O( 1 k2 ) for large k. From this it follows that R z x F(z) F(y) dy = o( 1 k2 ) under the optimal bidding strategy z for large k. Now we have seen in Lemma 4 that Eϵ[ R z x+σkϵ F(z) F(y) dy] = R z x F(z) F(y) dy + 1 2σ2 kf(x)+a(x)σ4 k+o(σ4 k) for some constant a(x) for large k. Since R z x F(z) F(y) dy = o( 1 under the optimal bidding strategy z and σ2 k = s2(x) k2 ) for large k, it then follows that Eϵ[ R z x+σkϵ F(z) F(y) dy] = 1 2ks2(x)f(x) + 1 k2 [h(x)f(x) 2 + a(x)s4(x)] + o( 1 k2 ) for large k, which we can rewrite as Eϵ[ R z x+σkϵ F(z) F(y) dy] = 1 2ks2(x)f(x) + 1 k2 u(x) + o( 1 k2 ), where u(x) h(x)f(x) 2 + a(x)s4(x). But Eϵ[ R z x+σkϵ F(z) F(y) dy] represents the auctioneer s per-period payoffin the next auction. Thus if x denotes the estimated e CPM of the ad after an additional j periods have passed and k denotes the number of impressions the ad has received after an additional j periods have passed, then the auctioneer s per-period payoffin the period after an additional j periods have passed is 1 2k s2(x )f(x ) 1 2k 2 u(x ) + o( 1 k 2 ). The difference between this and 1 2ks2(x)f(x) is 2k s2(x )f(x ) 1 2k 2 u(x ) + 1 2ks2(x)f(x) + o 1 Hummel and Mc Afee = k s2(x)f(x) ks2(x )f(x ) 2kk 1 2k2 u(x) + 1 2k2 u(x) 1 2k 2 u(x ) + o 1 = k s2(x)f(x) ks2(x )f(x ) 2kk 1 2k2 u(x) + k 2u(x) k2u(x ) 2k2k 2 + o 1 = k s2(x)f(x) k[s2(x)f(x) + (x x)d(x) + o(x x)] 2kk 1 2k2 u(x) + k 2u(x) k2[u(x) + O(x x)] 2k2k 2 + o 1 where d(x) denotes the derivative of the function s2(x)f(x) with respect to x. By the same reasoning as in the proof of Lemma 22, we know that x x = O( 1 k) for all possible realizations of x . Thus we can rewrite the expression in equation (2) as (k k)s2(x)f(x) k(x x)d(x) 2kk 1 2k2 u(x) + (k 2 k2)u(x) 2k2k 2 + o 1 = (k k)s2(x)f(x) k(x x)d(x) 2kk 1 2k2 u(x) + o 1 Now note that E[x ] = x, where the expectation is taken over the uncertain realization of x in another j periods. Thus the expectation of the expression in equation (3) is E (k k)s2(x)f(x) 1 2k2 u(x) + o 1 where the expectation is taken over the uncertain realization of k . This expression can in turn be written as mj(x)s2(x)f(x) u(x) where mj(x) denotes the expected number of additional impressions that the ad with uncertain e CPM receives after an additional j periods have passed (which will equal 0 when j = 0 and vary approximately linearly with j for large k). The expression in equation (5) gives the difference between the auctioneer s actual expected payoffin the period after an additional j periods have passed and 1 2ks2(x)f(x). From this it follows that the difference between the auctioneer s actual payoffVk(x) and the payoffthe auctioneer would receive if the auctioneer obtained a payoffof 1 2ks2(x)f(x) in every future period is P j=0 δj mj(x)s2(x)f(x) u(x) k2 , which can be written as w(x) for some function w(x). Thus we can write Vk(x) as Vk(x) = v(x) k2 ) for some function w(x). But we have seen in the proof of Lemma 22 that E[Vk+1(x ) Vk(x)] = Vk+1(x) Vk(x) + o 1 k2 . Since Vk(x) = v(x) k2 ), it then follows that Vk+1(x) Vk(x) = v(x) k(k+1) + o( 1 Proof of Theorem 8: Recall that Vk(x) = 1 1 δ x+σkϵ F(z) F(y) dy + δF(z)(Ex [Vk+1(x )] Vk(x)) Machine Learning in an Auction Environment and a second-order Taylor approximation for Eϵ[ R z x+σkϵ F(z) F(y) dy] is x+σkϵ F(z) F(y) dy = Z z x F(z) F(y) dy + 1 2σ2 kf(x) + o(σ2 k) Now z = x + δ(Ex [Vk+1(x )] Vk(x)), so z x δVk(x), a term which is O(f(x)σ2 k). From this it follows that R z x F(z) F(y) dy = O(F(x)f(x)σ2 k) = o(f(x)σ2 k). And we also know that δF(z)(Ex [Vk+1(x )] Vk(x)) δF(z)Vk(x) = O(F(x)f(x)σ2 k) = o(f(x)σ2 k). Combining these results gives Eϵ[ R z x+σkϵ F(z) F(y) dy] = 1 2σ2 kf(x) + o(f(x)σ2 k) and F(z)(Ex [Vk+1(x )] Vk(x)) = o(f(x)σ2 k). Substituting this in to our expression for Vk(x) then gives Vk(x) = 1 2(1 δ)f(x)σ2 k + o(f(x)σ2 k). Proof of Theorem 9: First note that it must be the case that V k(x) = Ω( 1 k) for large k. We know that σ2 a,ka = Θ( 1 ka ) = Θ( 1 βak) = Θ( 1 k) for large k, so the immediate reward in any given period is at least on the same order as 1 k regardless of which ad-context pair a arises in the auction. Thus we know that V k(x) = Ω( 1 k) for large k. We also know that V k(x) = O( 1 k) for large k. To see this, note that the auctioneer can ensure that his loss in an auction involving the ad-context pair a in any given period is O( 1 ka ) = O( 1 k) by bidding za = xa, so the auctioneer can thus ensure that his expected loss in any given period is O( 1 k) unconditional on the precise ad-context pair that arises. And if the auctioneer s loss in any given period is O( 1 k), then the player s total loss from the game will also be no greater than O( 1 k) because the present value of the sum of losses that are Θ( 1 k), P j=k δj k 1 j , is also Θ( 1 k) since 1 < P j=k δj k k j < P j=k δj k = 1 1 δ implies 1 k < P j=k δj k 1 j < 1 (1 δ)k. Thus V k(x) = Θ( 1 k) for large k. Proof of Theorem 10: Since V k(x) = Θ( 1 k) for large k, we have Ex (a)[V k (a, k)(x (a))] V k(x) = O( 1 k) for large k. Thus since the optimal bidding strategy entails setting za = xa + δ(Ex (a)[V k (a, k)(x (a))] V k(x)), it must be the case that za = xa + O( 1 k) for large k. From this it follows that R za xa Fa(za) Fa(y) dy = O( 1 k2 ) under the optimal bidding strategy za for large k. Now we have seen in Lemma 4 that Eϵ[ R za xa+σa,kaϵ Fa(za) Fa(y) dy] = R za Fa(y) dy + 1 2σ2 a,kafa(xa) + O(σ4 a,ka) for large k. Since R za xa Fa(za) Fa(y) dy = O( 1 the optimal bidding strategy za and σ2 a,ka = s2 a(xa) k2 ) for large k, it then follows that Eϵ[ R za xa+σa,kaϵ Fa(za) Fa(y) dy] = 1 2ka s2 a(xa)fa(xa) + O( 1 k2 ) for large k. But Eϵ[ R za xa+σa,kaϵ Fa(za) Fa(y) dy] represents the auctioneer s per-period payoffif the next auction is an auction for the advertiser-context pair a. Thus the auctioneer s expected per-period payoffunconditional on what ad-context pair appears in the next auction is Pm a=1 πa 1 2ka s2 a(xa)fa(xa) + O( 1 k2 ) = Pm a=1 πa 1 2βaks2 a(xa)fa(xa) + O( 1 k2 ) for large k. From this it follows that if g(x) Pm a=1 πa 1 2βa s2 a(xa)fa(xa), then the expected per-period utility that one obtains at each point in the game unconditional on what ad-context pair appears in the next auction is 1 kg(x) + O( 1 k2 ). Since V k(x) can alternatively be expressed as the discounted sum of the per-period utility that one can obtain at each point in the game, it then follows that |k V k(x)| P j=k δj k[g(x)]+O( 1 k), meaning |k V k(x)| 1 1 δg(x)+O( 1 k) and |k V k(x)| P j=k δj k[k Hummel and Mc Afee k) = 1 1 δg(x) + O( 1 k) in the limit as k . From this it follows that |k V k(x)| = 1 1 δg(x) + O( 1 k) and k V k(x) = 1 1 δg(x) + O( 1 k) = 1 2(1 δ) Pm a=1 πa 1 βa s2 a(xa)fa(xa) + O( 1 Proof of Theorem 13: Under the knowledge gradient framework, the incremental amount that one increases one s bid by beyond the expected value of the advertising opportunity is δ(Ex [Uk+1(x )] Uk(x)). And under the full dynamic programming problem, the incremental amount that one increases one s bid is δ(Ex [Vk+1(x )] Vk(x)). Thus if Vk+1 denotes the difference between the values of Ex [Vk+1(x )] and Ex [Uk+1(x )] and Vk denotes the difference between the values of Vk(x) and Uk(x), then the difference between the incremental amount that one increases one s bid under the full dynamic programming problem and under the knowledge gradient framework is δ( Vk+1 Vk). But a condition of the theorem is that Vk+1 < Vk. Thus δ( Vk+1 Vk) < 0, and the difference between the incremental amount that one increases one s bid under the full dynamic programming problem and the incremental amount that one increases one s bid under the knowledge gradient framework is negative. From this it follows that the incremental amount by which one would increase one s bid under the knowledge gradient framework is indeed greater than it is under the full dynamic programming problem. Proof of Theorem 14: In the knowledge gradient framework, Uk(x) is just the discounted sum of the value of simply bidding z = x in each period when an ad has received k impressions so far and one s best estimate for the e CPM of the ad is x. Now we know from applying Lemma 4 to the special case in which z = x that the per-period payofffrom bidding z = x in each period when an ad has received k impressions so far and one s best estimate for the e CPM of the ad is x is 1 2σ2 kf(x) a(x)σ4 k + o(σ4 k) = 1 2ks2(x)f(x) + O( 1 k2 ). Thus Uk(x) = 1 2(1 δ)ks2(x)f(x) + O( 1 But we have seen in Theorem 7 that when Vk(x) = 1 2(1 δ)ks2(x)f(x) + O( 1 k2 ) and the per-period payofffrom making the optimal bid is 1 2σ2 kf(x) a(x)σ4 k + o(σ4 k), then it must be the case that Ex [Vk+1(x ))] Vk(x) = v(x) k(k+1) + o 1 k2 for large k, where v(x) 1 2(1 δ)s2(x)f(x). An identical argument illustrates that when Uk(x) = 1 2(1 δ)ks2(x)f(x) + O( 1 k2 ) and the per-period payofffrom making the optimal bid is 1 2σ2 kf(x) a(x)σ4 k +o(σ4 k), then it must be the case that Ex [Uk+1(x ))] Uk(x) = v(x) k(k+1) + o 1 k2 for large k, where v(x) 1 2(1 δ)s2(x)f(x). The result then follows. Proof of Theorem 16: Since the optimal bid for advertiser i if advertiser i submits the highest e CPM bid amongst the bidders with unknown e CPMs satisfies zi = xi + δ(Ex (i)[V k (i)(x (i))] V k(x)), it follows that Exi[ R zi xi F(zi) dy+δF(zi)(Ex (i)[V k (i)(x (i))] V k(x))] = 0 when advertiser i submits the optimal bid zi. By substituting this into the equation for the value of the dynamic programming problem V k(x), it follows that the value of this dynamic programming problem is always equal to 1 1 δ R zi xj F(y) dy, which is an increasing function of zi. From this it follows that if the optimal bid for advertiser i if advertiser i submits the highest bid of the advertisers with unknown e CPMs is higher than the optimal bid for all other advertisers with unknown e CPMs if one of these other advertisers submits the highest bid of the advertisers with unknown e CPMs, then the decision maker s Machine Learning in an Auction Environment payofffrom the game is maximized by having advertiser i submit the highest bid of all the advertisers with unknown e CPMs. Proof of Theorem 17: Recall from Lemma 4 that the auctioneer s per-period payoff if the auctioneer uses a bid for the advertiser with unknown e CPM that is equal to z is Eϵ[ R z x+σkϵ F(z) F(y) dy] = R z x F(z) F(y) dy 1 2σ2 kf(x) + o(σ2 k) for large k. Now if z = x + c(x) kα for some constant c(x) = 0, then R z x F(z) F(y) dy = R x+ c(x) kα x f(x)(x + c(x) kα y) dy + o( 1 k2α ) = f(x)c2(x) 2k2α + o( 1 k2α ). Thus the auctioneer s per-period payoffif the auctioneer uses a bid for the ad with unknown e CPM of the form z = x + c(x) 2k2α f(x) 1 2σ2 kf(x) + o( 1 k2α ) if c(x) = 0 and 1 2σ2 kf(x) + o(σ2 k) if c(x) = 0. Thus if c(x) = 0, the auctioneer s per-period payoffis 1 2ks2(x)f(x) + o( 1 k). We then know from similar reasoning to that in the proof of Theorem 10 that if this is the auctioneer s per-period payoff, then the auctioneer s total payofffrom the game is 1 2(1 δ)ks2(x)f(x) + o( 1 k) regardless of the learning rate. Similarly, if c(x) = 0 and α = 1 2, then the auctioneer s per-period payoffis 1 2kf(x)(s2(x) + c2(x)) + o( 1 k), and we know from identical reasoning that the auctioneer s total payoffis 1 2(1 δ)kf(x)(s2(x)+c2(x))+o( 1 k), which is strictly less than the auctioneer s total payofffrom the game when c(x) = 0 for sufficiently large k. Finally, if c(x) = 0 and α < 1 2, the auctioneer s per-period payoffis c2(x) 2k2α f(x)+o( 1 k2α ). Since the auctioneer s total payoffis the discounted sum of the auctioneer s per-period payoffs, it follows that if Vk(x) denotes the auctioneer s total payofffrom using this strategy, then k2αVk(x) P j=k δj k[ 1 j )2αc2(x)f(x)]+o(1) = 1 2(1 δ)c2(x)f(x)+o(1) in the limit as k . Thus if c(x) = 0 and α < 1 2, the auctioneer s total payoffis no greater than 1 2(1 δ)k2α c2(x)f(x)+o( 1 k2α ), which is less than 1 2ks2(x)f(x)+o( 1 k), the auctioneer s payoff from using the constant c(x) = 0 for sufficiently large k. From this and the result in the previous paragraph it follows that if the auctioneer uses the strategy in the statement of this theorem, the auctioneer s payoffwill be maximized when c(x) = 0 for sufficiently large k. Observation 23 Suppose the auctioneer displays the ad with the highest e CPM bid with probability 1 ϵ and displays an ad uniformly at random with probability ϵ > 0. Then the optimal constant ϵ for such an algorithm is ϵ = 0 for sufficiently large k. Proof Recall from Lemma 4 that the auctioneer s per-period payoffif the auctioneer uses a bid for the advertiser with unknown e CPM that is equal to z is Eϵ[ R z x+σkϵ F(z) F(y) dy] = R z x F(z) F(y) dy 1 2σ2 kf(x)+o(σ2 k) for large k. Note that displaying an ad uniformly at random is equivalent to making a bid of 0 for the ad with unknown e CPM with probability 1 2 and making a bid of for the ad with unknown e CPM with probability 1 2. Since R z x F(z) F(y) dy > 0 for either z = 0 or z = , it follows that the auctioneer s expected per-period payoffif the auctioneer follows the strategy in the statement of the observation is no greater than cϵ for some constant c > 0 for large k. However, if the auctioneer always uses a bid of z = x (as would be the case when ϵ = 0), then we know from the proof of Theorem 17 that the auctioneer s per-period payoff is 1 2ks2(x)f(x) + o( 1 k) for large k. Thus for sufficiently large k, the auctioneer always Hummel and Mc Afee achieves a larger per-period payoffby setting ϵ = 0 than by using any positive value of ϵ, so the optimal constant for this algorithm is ϵ = 0. Proof of Theorem 18: We know from Theorem 7 that Ex [Vk+1(x )] Vk(x) = v(x) k(k+1) + o 1 k2 for large k, where v(x) = 1 2(1 δ)s2(x)f(x), and we also know from the proof of Theorem 3 that the derivative of the seller s expected payofffrom making a bid of z with respect to z is f(z)(x z + δ(Ex [Vk+1(x )] Vk(x)). Thus if V Ex [Vk+1(x )] Vk(x), then the difference between the auctioneer s expected payofffrom making a bid of x and a bid of x + δ 2(1 δ)k(k+1)s2(x)f(x) is Z x+δ V +o( V ) x f(z)(x z + δ( V ) + o( V )) dz = f(x)δ2( V )2 2(1 δ) + o(( V )2). And since V = Ex [Vk+1(x )] Vk(x) = v(x) k(k+1) + o 1 k2 = s2(x)f(x) 2(1 δ)k(k+1) + o 1 k2 , it follows that the difference between the auctioneer s payofffrom making a bid of x and a bid of x + δ 2(1 δ)k(k+1)s2(x)f(x) is δ2 8(1 δ)3k4 s4(x)f3(x) + o( 1 Proof of Theorem 19: The theoretically optimal strategy for the auctioneer would entail submitting a bid of z = x + δ(Ex [Vk+1(x )] Vk(x)) in each time period. By the same reasoning as in the proof of Theorem 18, we know the difference between the auctioneer s expected payofffrom making a bid of x and making a bid of z = x + δ(E θk+1[Vk+1(x)] Vk(x)) is f(x)δ2( V )2 2(1 δ) +o(( V )2). Since the auctioneer s payofffrom using the approximately optimal bidding strategy is also f(x)δ2( V )2 2(1 δ) +o(( V )2), it follows that the difference between the auctioneer s payoffunder the approximately optimal bidding strategy and the maximum possible payoffthe auctioneer could obtain theoretically is o(( V )2) = o( 1 k4 ). But we know from Theorem 18 that the difference between the auctioneer s payoffunder the approximately optimal bidding strategy and the greedy strategy is δ2 8(1 δ)3k4 s4(x)f2(x)+ k4 ). Thus the difference between the auctioneer s payoffunder this strategy and the maximum possible payoffthe auctioneer could obtain under the theoretically optimal strategy becomes vanishingly small compared to the difference between the auctioneer s payoffunder this strategy and the auctioneer s payoffunder the greedy strategy for large k. Proof of Theorem 20: A consequence of Theorem 13 is that the difference between the auctioneer s payoffunder the theoretically optimal strategy and the auctioneer s payofffrom the greedy strategy is no greater than the difference between the auctioneer s payofffrom the theoretically optimal strategy and the auctioneer s payoffunder the greedy strategy when no learning is possible in future periods. Thus we seek to bound the difference between the auctioneer s payofffrom the theoretically optimal strategy and the auctioneer s payoff under the greedy strategy when no learning is possible in future periods. Let α and β denote the parameters of the beta distribution. If no learning ever took place and the auctioneer followed the greedy strategy, then in all periods the auctioneer would show the highest competing bidder if this bidder had an e CPM bid p satisfying p > α α+β and show the bidder with unknown e CPM otherwise. If the auctioneer showed the ad with unknown e CPM in the first period, this ad received a click, and no learning Machine Learning in an Auction Environment took place in future periods, then in future periods the auctioneer would show the highest competing bidder if and only if this bidder had an e CPM bid p satisfying p > α+1 α+β+1. And if the auctioneer showed the ad with unknown e CPM, this ad did not receive a click, and no learning took place in future periods, then in future periods the auctioneer would show the highest competing bidder if and only if this bidder had an e CPM bid p satisfying p > α α+β+1. From this it follows that if no learning takes place in future periods, then the auctioneer s payoffin any given period in the future is guaranteed to be the same regardless of whether the ad with unknown e CPM was shown in the first period if either p > α+1 α+β+1 or p < α α+β+1 in that particular period. The only circumstances under which the auctioneer s expected payoffin a future period t will differ as a result of showing the ad with unknown e CPM in the first period is if this ad receives a click in the first period and p ( α α+β, α+1 α+β+1) in period t or if the ad does not receive a click in the first period and p ( α α+β+1, α α+β) in period t. In the first case, the auctioneer s payoffin period t as a result of showing the ad with unknown e CPM in the first period exceeds the auctioneer s payoffunder normal circumstances by α+1 α+β+1 p, and in the second case the auctioneer s payoffin period t as a result of showing the ad with unknown e CPM in the first period exceeds the auctioneer s payoffunder normal circumstances by an amount p α α+β+1. Now the probability the ad with unknown e CPM receives a click in the first period if this ad is shown is α α+β and the probability this ad does not receive a click in the first period if this ad is shown is β α+β. By combining this with the result in the previous paragraph, it follows that the maximum possible expected payoffdifference that the auctioneer can obtain from future periods as a result of showing the ad with unknown e CPM in the first period is δ 1 δ[ α α+β R α+1 α+β+1 α α+β ( α+1 α+β+1 p)f dp + β α+β R α α+β α α+β+1 (p α α+β+1)f dp], where f supp f(p). This payoffdifference equals δf 2(1 δ)[ α α+β( α+1 α+β+1 α α+β)2 + β α+β( α α+β α α+β+1)2] = δf 2(1 δ)[ α α+β( β (α+β)(α+β+1))2 + β α+β( α (α+β)(α+β+1))2] = δf 2(1 δ) αβ(α+β) (α+β)3(α+β+1)2 = δf 2(1 δ) (α+β)2α2β2 αβ(α+β)4(α+β+1)2 . Now the expected value for a beta distribution is α α+β and the variance in a beta dis- tribution is αβ (α+β)2(α+β+1). Thus since the bidder s expected e CPM is ω and the standard deviation in the bidder s expected e CPM is γω, it follows that α+β β = 1 1 ω, and α2β2 (α+β)4(α+β+1)2 = γ4ω4. From this it follows that δf 2(1 δ) (α+β)2α2β2 αβ(α+β)4(α+β+1)2 = δf 2(1 δ) γ4ω3 1 ω . Thus the maximum additional payoffincrease that one can obtain from future periods as a result of showing the ad with uncertain e CPM in the first period is no greater than δf 2(1 δ) γ4ω3 Now if V denotes the change in payoffthat one obtains from future periods as a result of showing the ad with uncertain e CPM in the first period, then the value of showing the ad with uncertain e CPM in the first period is ω + V . Thus the theoretically optimal strategy will specify a bid of ω+ V for the bidder with uncertain e CPM, whereas the greedy strategy will specify a bid of ω, so the theoretically optimal strategy will only show a different ad when the highest competing e CPM bid, p, satisfies p [ω, ω + V ]. Furthermore, in the cases where the theoretically optimal strategy specifies a different bid, the theoretically optimal strategy achieves a payoffthat exceeds that of the greedy strategy by an amount ω+ V p, where p denotes the highest competing e CPM bid. From this it follows that the Hummel and Mc Afee difference in expected payoffthat one obtains as a result of using the theoretically optimal strategy rather than the greedy strategy is no greater than 1 1 δ R ω+ V ω (ω + V p)f(p) dp. Thus if f supp f(p), this payoffdifference is no greater than f 1 δ R ω+ V ω (ω + V p) dp = f 1 δ ( V )2 2 . Thus the difference between the auctioneer s payoffunder the theoretically optimal strategy and the auctioneer s payofffrom the greedy strategy is no greater than f 1 δ ( V )2 2 . But we have seen earlier that the maximum additional payoffincrease that one can obtain from future periods as a result of showing the ad with uncertain e CPM in the first period is no greater than δf 2(1 δ) γ4ω3 1 ω . Thus we know that V δf 2(1 δ) γ4ω3 1 ω . By combining this with the result in the previous paragraph, we see that the difference between the maximum possible payoffthe auctioneer could obtain under the theoretically optimal strategy and the auctioneer s payofffrom the greedy strategy is no greater than δ2γ8ω6f 3 8(1 δ)3(1 ω)2 . D. Agarwal, B.C. Chen, and P. Elango. Explore/exploit schemes for web content optimization. In Proceedings of the 9th Industrial Conference on Data Mining (ICDM), pages 1-10. IEEE, 2009. P. Aghion, P. Bolton, C. Harris, and B. Jullien. Optimal learning by experimentation. Review of Economic Studies, 58(4):621-654, 1991. P. Aghion, M.P. Espinosa, and B. Jullien. Dynamic duopoly with learning through market experimentation. Economic Theory, 3(3):517-539, 1993. N. Anthonisen. On learning to cooperate. Journal of Economic Theory, 107(2):253-287, 1993. N. Anthonisen. Regret bounds and minimax policies under partial monitoring. Journal of Machine Learning Research, 11:2785-2836, 2010. P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multi-armed bandit problem. Machine Learning, 47(2-3):235-256, 2002. P. Auer, N. Cesa-Bianchi, and P. Fischer. The nonstochastic multi-armed bandit problem. SIAM Journal on Computing, 32(1):48-77, 2003. M. Babaioff, Y. Sharma, and A. Slivkins. Characterizing truthful multi-armed bandit mechanisms. In Proceedings of the 10th ACM Conference on Electronic Commerce (EC), pages 79-88. ACM, 2009. A. Banerjee and D. Fudenberg. Word-of-mouth learning. Games and Economic Behavior, 46(1):1-22, 2004. J.S. Banks and R.K. Sundaram. Denumerable-armed bandits. Econometrica, 60(5):10711096, 2004. Machine Learning in an Auction Environment E. Bax, A. Kuratti, P. Mc Afee, and J. Romero. Comparing predicted prices in auctions for online advertising. International Journal of Industrial Organization, 30(1):80-88, 2011. D. Bergemann and J. V alim aki. Experimentation in markets. Review of Economic Studies 67(2):213-234, 2000. D. Bergemann and J. V alim aki. Learning and strategic pricing. Econometrica, 64(5):11251149, 1996. D. Bergemann and J. V alim aki. Market diffusion with two-sided learning. RAND Journal of Economics 28(4):773-795, 1997. D. Bergemann and J. V alim aki. Stationary multi-choice bandit problems. Journal of Economic Dynamics and Control 25(1):1585-1594, 2001. P. Bolton and C. Harris. Strategic experimentation. Econometrica 67(2):349-374, 1999. M. Brezzi and T.L. Lai. Optimal learning and experimentation in bandit problems. Journal of Economic Dynamics and Control 27(1):87-108, 2002. S. Callander. Searching for good policies. American Political Science Review 105(4):643-662, 2011. S. Callander and P. Hummel. Preemptive policy experimentation. Econometrica 82(4):15091528, 2014. S.E. Chick and N. Gans. Economic analysis of simulation selection problems. Management Science 55(3):421-437, 2009. N.R. Devanur and S.M. Kakade. The price of truthfulness for pay-per-click auctions. In Proceedings of the 10th ACM Conference on Electronic Commerce (EC), pages 99-106. ACM, 2009. A. Fishman and R. Rob. Experimentation and competition. Journal of Economic Theory 78(2):299-320, 1998. P. Frazier, W. Powell, and S. Dayanik. The knowledge-gradient policy for correlated normal beliefs. INFORMS Journal on Computing 21(4):599-613, 2009. D. Gale. What have we learned from social learning? European Economic Review 40(35):617-628, 2011. D. Gale and R.W. Rosenthal. Experimentation, imitation, and stochastic stability. Journal of Economic Theory 84(1):1-40, 2011. A. Ghate. Optimal minimum bids and inventory scrapping in sequential, single-unit, Vickrey auctions with demand learning. European Journal of Operations Research 245(2):555-570, 2015. J.C. Gittins. Bandit processes and dynamic allocation indices. Journal of the Royal Statistical Society, Series B 41(2):148-177, 1979. Hummel and Mc Afee E. Hazan and S. Kale. Better algorithms for benign bandits. Journal of Machine Learning Research 12:1287-1311, 2011. K. Iyer, R. Johari, and M. Sundararajan. Mean field equilibria of dynamic auctions with learning. Management Science 60(12):2949-2970, 2014. S.M. Kakade, I. Lobel, and H. Nazerzadeh. Optimal dynamic mechanism design and the virtual pivot mechanism. Operations Research 61(4):837-854, 2013. G. Keller and S. Rady. Optimal experimentation in a changing environment. Review of Economic Studies 66(3):475-503, 1999. G. Keller and S. Rady. Strategic experimentation with Poisson bandits. Theoretical Economics 5(2):275-311, 2010. G. Keller, S. Rady, and M. Cripps. Strategic experimentation with exponential bandits. Econometrica 73(1):39-68, 2010. S. Lahaie and R.P. Mc Afee. Efficient ranking in sponsored search. In Proceedings of the 7th International Workshop on Internet and Network Economics (WINE), pages 254-265. Springer, 2011. T.L. Lai and H. Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics 6:4-22, 1985. S.M. Li, M. Mahdian, and R.P. Mc Afee. Value of learning in sponsored search auctions. In Proceedings of the 6th International Workshop on Internet and Network Economics (WINE), pages 294-305. Springer, 2010. S. Mannor and J.N. Tsitsiklis. The sample complexity of exploration in the multi-armed bandit problem. Journal of Machine Learning Research 5:623-648, 2004. B.C. May, N. Korda, A. Lee, and D.S. Leslie. Optimistic Bayesian sampling in contextualbandit problems. Journal of Machine Learning Research 13(1): 2069-2106, 2012. R.P. Mc Afee. The design of advertising exchanges. Review of Industrial Organization 39(3):169-185, 2011. L.J. Mirman, L. Samuelson, and A. Urbano. Monopoly experimentation. International Economic Review 34(3):549-563, 1993. G. Moscarini and L. Smith. The optimal level of experimentation. Econometrica 69(6):16291644, 2001. M. Ostrovsky and M. Schwarz. Reserve prices in Internet advertising auctions: a field experiment. Stanford University Typescript, 2009. A. Pavan, I. Segal, and J. Toikka. Dynamic mechanism design: a Myersonian approach. Econometrica 82(2):601-653, 2014. Machine Learning in an Auction Environment M. Rothschild. A two-armed bandit theory of market pricing. Journal of Economic Theory 9(2):185-202, 1974. A. Rusitchini and A. Wolinsky. Learning about variable demand in the long run. Journal of Economic Dynamics and Control 19(5-7):1283-1292, 1995. I.O. Ryzhov, P.I. Frazier, and W.B. Powell. On the robustness of a one-period look-ahead policy in multi-armed bandit problems. Procedia Computer Science 1:1629-1639, 2010. I.O. Ryzhov, W.B. Powell, and P.I. Frazier. The knowledge gradient algorithm for a general class of online learning problems. Operations Research 60(1):180-195, 2012. K.H. Schlag. Why imitate, and if so, how? A boundedly rational approach to multi-armed bandits. Journal of Economic Theory 78(1):130-156, 1998. A. Slivkins. Contextual bandits with similarity information. Journal of Machine Learning Research 15(1):2533-2568, 2014. B. Strulovici. Learning while voting: determinant of collective experimentation. Econometrica 78(3):933-971, 2010. H.R. Varian. Online ad auctions. American Economic Review: Papers & Proceedings 99(2):430-434, 2009. X. Vives. Learning from others: a welfare analysis. Games and Economic Behavior 20(2):177-200, 1997. M.L. Weitzman. Optimal search for the best alternative. Econometrica 47(3):641-654, 1979. J. Wortman, Y. Vorobeychik, L. Li, and J. Langford. Maintaining equilibria during exploration in sponsored search auctions. In Proceedings of the 3rd International Workshop on Internet and Network Economics (WINE), pages 119-130. Springer, 2007.