# online_pricing_with_strategic_and_patient_buyers__1407ab42.pdf

Online Pricing with Strategic and Patient Buyers

Michal Feldman Tel-Aviv University and MSR Herzliya

michal.feldman@cs.tau.ac.il

Tomer Koren

Google Brain tkoren@google.com

Roi Livni Princeton University rlivni@cs.princeton.edu

Yishay Mansour Tel-Aviv University mansour@tau.ac.il

Aviv Zohar Hebrew University of Jerusalem

avivz@cs.huji.ac.il

We consider a seller with an unlimited supply of a single good, who is faced with a stream of T buyers. Each buyer has a window of time in which she would like to purchase, and would buy at the lowest price in that window, provided that this price is lower than her private value (and otherwise, would not buy at all). In this setting, we give an algorithm that attains O(T2/3) regret over any sequence of T buyers with respect to the best ﬁxed price in hindsight, and prove that no algorithm can perform better in the worst case.

1 Introduction

Perhaps the most common way to sell items is using a posted price mechanism in which the seller publishes the price of an item in advance, and buyers that wish to obtain the item decide whether to acquire it at the given price or to forgo the purchase. Such mechanisms are extremely appealing. The decision made by the buyer in a single-shot interaction is simple: if it values the item by more than the offering price, it should buy, and if its valuation is lower, it should decline. The seller on the other hand needs to determine the price at which she wishes to sell goods. In order to set prices, additive regret can be minimized using, for example, a multi-armed bandit (MAB) algorithm in which arms correspond to a different prices, and rewards correspond to the revenue obtained by the seller.

Things become much more complicated when the buyers who are facing the mechanism are patient and can choose to wait for the price to drop. The simplicity of posted price mechanisms is then tainted by strategic considerations, as buyers attempt to guess whether or not the seller will lower the price in the future. The direct application of MABs is no longer adequate, as prices set by such algorithms may ﬂuctuate at every time period. Strategic buyers can make use of this fact to gain the item at a lower price, which lowers the revenue of the seller and, more crucially, changes the seller s feedback for a given price. With patient buyers, the revenue from sales is no longer a result of the price at the current period alone, but rather the combined outcome of prices that were set in surrounding time periods, and of the expectation of buyers regarding future prices.

In this paper, we focus on strategic buyers that may delay their purchase in hopes of obtaining a better deal. We assume that each buyer has a valuation for the item, and a patience level which represents the length of the time-window during which it is willing to wait in order to purchase the item. Buyers wish to minimize the price during this period. Note that such buyers may interfere with naïve attempts to minimize regret, as consecutive days at which different prices are set are no longer independent.

Parts of this work were done while the author was at Microsoft Research, Herzliya.

30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.

To regain the simplicity of posted prices for the buyers, we consider a setting in which the seller commits to the price in subsequent time periods in advance, publishing prices for the entire window of the buyers. Strategic buyers that arrive at the market are then able to immediately choose the lowest price within their window. Thus, given the valuation and patience of the buyers (the number of days they are willing to wait) their actions are clearly determined: buy it at a day that is within the buyer s patience window and price is cheapest, provided that it is lower than the valuation.

An important aspect of our proposed model is to consider for each buyer a window of time (rather than, for example, discounting). For example, when considering discounting, the buyers, in order to best respond, would have argue how would other buyers would behave and how would the seller adjust the prices in response to them. By ﬁxing a window of time, and forcing the seller to publish prices for the entire window, the buyers become price takers and their behavior becomes tractable to analyze.

As in previous works, we focus on minimizing the additive regret of the seller, assuming that the appearance of buyers is adversarial; that is, we do not make any statistical assumptions on the buyers valuation and window size (except for a simple upper bound). Speciﬁcally we assume that the values are in the range [0, 1] and that the window size is in the range {1, . . ., ˆ + 1}. The regret is measured with respect to the best single price in hindsight. Note that the benchmark of a ﬁxed price p implies that any buyer with value above p buys and any buyer with value below p does not buy. The window size has no effect when we have a ﬁxed price. On the other hand, for the online algorithm, having to deal with various window sizes create a new challenge.

The special case of this model where ˆ = 0 (and hence all buyers have window of size exactly one) was previously studied by Kleinberg and Leighton [11], who discussed a few different models for the buyer valuations and derived tight regret bounds for them. When the set of feasible prices is of constant size their result implies a (

T) regret bound with respect to the best ﬁxed price, which is also proven to be the best possible in that case. In contrast, in the current paper we focus on the case ˆ 1, where the buyers window sizes may be larger than one, and exhibit the following contributions:

(i) We present an algorithm that achieves O( ˆ 1/3T2/3) additive regret in an adversarial setting,

compared to the best ﬁxed posted price in hindsight. The upper bound relies on creating epochs, when the price within each epoch is ﬁxed and the number of epochs limit the number of times the seller switches prices. The actual algorithm that is used to select prices within an epoch is EXP3 (or can be any other multi-arm bandit algorithm with similar performance).

(ii) We exhibit a matching lower bound of ( ˆ 1/3T2/3) regret. The proof of the lower bound reveals

that the difﬁculty in achieving lower regret stems from the lost revenue that the seller suffers every time she tries to lower costs. Buyers from preceding time slots wait and do not purchase the items at the higher prices that prevailed when they arrive. We are thus able to prove a lower bound by reducing to a multi-armed bandit problem with switching costs. Our lower bound uses only two prices.

In other words, we see that as soon as the buyers patience increases from zero to one, the optimal regret rate immediately jumps from (

T) to (T2/3).

The rest of the paper is organized as follows. In the remainder of this section we brieﬂy overview related work. We then proceed in Section 2 to provide a formal deﬁnition of the model and the statement of our main results. We continue in Section 3 with a presentation of our algorithm and its analysis, present our lower bound in Section 4, and conclude with a brief discussion.

1.1 Related work

As mentioned above, the work most closely related to ours is the paper of Kleinberg and Leighton [11] that studies the case ˆ = 0, i.e., in which the buyers windows are limited to be all of size one. For a ﬁxed set of feasible prices of constant size, their result implies a (

T) regret bound, whereas for a continuum of prices they achieve a (T2/3) regret bound. The (T2/3) lower bound found in [11] is similar to our own in asymptotic magnitude, but stems from the continuous nature of the prices. In our case the lower bound is achieved for buyers with only 2 prices, a case in which Kleinberg and Leighton [11] have a bound of (

T). Hence, we show that such a bound can occur due to the strategic nature of the interaction itself.

A line of work appearing in [1, 12, 13] considers a model of a single buyer and a single seller, where the buyer is strategic and has a constant discount factor. The main issue is that the buyer continuously interacts with the seller and thus has an incentive to lower future prices at the cost of current valuations. They deﬁne strategic regret and derive near optimal strategic regret bounds for various valuation models. We differ from this line of work in a few important ways. First, they consider other either ﬁxed unknown valuation or stochastic i.i.d. valuations, while we consider adversarial valuations. Second, they consider a single buyer while we consider a stream of buyers. More importantly, in our model the buyers do not inﬂuence the prices they are offered, so the strategic incentives are very different. Third, their model uses discounting to model the decay of buyer valuation over time, while we use a window of time.

There is a vast literature in Algorithmic Game Theory on revenue maximization with posted prices, in settings where agents valuations are drawn from unknown distributions. For the case of a single good of unlimited supply, the goal is to approximate the best price, as a function of the number of samples observed and with a multiplicative approximation ratio. The work of Balcan et al. [4] gives a generic reduction which can be used to show that one can achieve an -optimal pricing with a sample of size O((H/ 2) log(H/ )), where H is a bound on the maximum valuation. The works of Cole and Roughgarden [8] and Huang et al. [10] show that for regular and Monotone Hazard Rate distributions sample bounds of ( 3) and ( 3/2), respectively, guarantee a multiplicative approximation of 1 .

Finally, our setting is somewhat similar to a unit-demand auction in which agents desire a single item out of several offerings. In our case, we can consider items sold at different times as different items and agents desire a single one that is within their window. When agents have unit-demand preferences, posted-price mechanisms can extract a constant fraction of the optimal revenue [5, 6, 7]. Note that a constant ratio approximation algorithm implies a linear regret in our model. On the other hand, these works consider a more involved problem from a buyer s valuation perspective.

2 Setup and Main Results

We consider a setting with a single seller and a sequence of T buyers b1, . . ., b T. Every buyer bt is associated with value vt 2 [0, 1] and patience t. A buyer s patience indicates the time duration in which the buyer stays in the system and may purchase an item.

The seller posts prices in advance over some time window. Let ˆ be the maximum patience, and assume that t ˆ for every t. Let pt denote the price at time t, and assume that all prices are chosen from a discrete (and normalized) predeﬁned set of n prices P = {0, 1

n, . . . 1}. At time t = 1, the seller posts prices p1, . . ., pˆ +1, and learns the revenue obtained at time t = 1 (the revenue depends on the buyers behavior, which is explained below). Then, at each time step t, the seller publishes a new price pt+ˆ 2 P, and learns the revenue obtained at time t, which she can use to set the next prices. Note that at every time step, prices are known for the next ˆ time steps.

The revenue in every time step is determined by the strategic behavior of buyers, which is explained next. Every buyer bt observes prices pt, . . ., pt+ t , and purchases the item at the lowest price among these prices (breaking ties toward earlier times), if she does not exceed her value. The revenue obtained from buyer bt is given by:

β(pt, . . ., pt+ˆ ; bt) =

min{pt, . . ., pt+ t } if min{pt, . . ., pt+ t } vt, 0 otherwise.

As bt has patience t, we will sometime omit the irrelevant prices and write β(pt, . . ., pt+ t ; bt) =

β(pt, . . ., pt+ˆ ; bt).

As we described, a buyer need not buy the item on her day of appearance and may choose to wait. If the buyer chooses to wait, we will observe the feedback from her decision only on the day of purchase. We therefore need to distinguish between the revenue from buyer t and the revenue at time t. Given a sequence of prices p1, . . ., pt+ˆ and a sequence of buyers b1, . . ., bt we deﬁne the revenue at time t to be the sum of all revenues from buyers that preferred to buy at time t. Formally, let It denote the set of all buyers that buy at time t, i.e.,

It = {bi : t = arg min{i t i + i : pt = β(pi . . ., pi+ˆ ; bi)}}.

Then the revenue obtained at time t is given by:

Rt(pt ˆ , . . ., pt+ˆ ) = R(p1, . . ., pt+ˆ ; b1:t) :=

i2It β(pi, . . . pi+ˆ ; bi)),

where we use the notation b1:T as a shorthand for the sequence b1, . . ., b T. The regret of the (possibly randomized) seller A is the difference between the revenue obtained by the best ﬁxed price in hindsight and the expected revenue obtained by the seller A, given a sequence of buyers:

Regret T(A; b1:T) = max

R(p , . . ., p ; b1:t) E

R(p1, . . . pt+ˆ ; b1:t)

We further denote by Regret T(A) the expected regret a seller A incurs for the worst case sequence, i.e., Regret T(A) = maxb1:T Regret T(A; b1:T).

2.1 Main Results

Our main result are optimal regret rates in the strategic buyers setting.

Theorem 1. The T-round expected regret of Algorithm 1 for any sequence of buyers b1, . . ., b T with patience at most ˆ 1 is upper bounded as Regret T 10( ˆ n log n)1/3T2/3.

Theorem 2. For any ˆ 1, n 2 and for any pricing algorithm, there exists a sequence of buyers b1, . . ., b T with patience at most ˆ such that Regret T = ( ˆ 1/3T2/3).

3 Algorithm

In this section we describe and analyze our online pricing algorithm. It is worth to start by highlighting why simply running an off the shelf multi-arm bandit algorithm such as EXP3 would fail. Consider a ﬁxed distribution over the actions and assume the buyer has a window size of two. Unlike the standard multi-arm bandit, where we get the expected revenue from the price we select, now the buyer would select the lower of the two prices, which would clearly hurt our revenue (there is a slight gain, by the increased probability of sell, but it does sufﬁce to offset the loss). For this reason, the seller would intuitively like to minimize the number of time it changes prices (more precisely, lower the prices).

Our online pricing algorithm, which is given in Algorithm 1, is based on the EXP3 algorithm of Auer et al. [3] which we use as a black-box. The algorithm divides the time horizon to roughly T2/3 epochs, and within each epoch the seller repeatedly announces the same price, that was chosen by the EXP3 black-box in the beginning of the epoch. In the end of the epoch, EXP3 is updated with the overall average performance of the chosen price during the epoch (ignoring the time steps which might be inﬂuenced by different prices). Hence, our algorithm changes the posted price only O(T2/3) times, thereby keeping under control the costs associated with price ﬂuctuations due to the patience of the buyers.

Algorithm 1: Online posted pricing algorithm Parameters: horizon T, number of prices n, and maximal patience ˆ ; Let B = b ˆ 2/3(n log n) 1/3T1/3c and T 0 = b T/Bc; Initialize A EXP3(T 0, n); for j = 0, . . .,T 0 1 do

Sample i A and let p0

j = i/n; for t = Bj + 1, . . ., B(j + 1) do

Announce price pt+ˆ = p0

j; %On j = 0, t = 1 announce p1, . . . pt+ = p0

0.; Receive and observe total revenue Rt(pt ˆ , . . ., pt+ˆ );

Update A with feedback 1

t=Bj+2ˆ +1 Rt(pt ˆ , . . ., pt+ˆ );

for t = BT 0 + 1, . . .,T do

Announce price pt+ˆ = p0

We now analyze Algorithm 1 and prove Theorem 1. The proof follows standard arguments in adversarial online learning (e.g., Arora et al. [2]); we note, however, that for obtaining the optimal dependence on the maximal patience ˆ one cannot apply existing results directly and has to analyse the effect of accumulating revenues over epochs more carefully, as we do in the proof below. This is mainly because in our model the revenue at time t is not bounded by 1 but by , hence readily amenable results would add a factor to the regret.

Proof of Theorem 1. For all 0 j T 0 and for all prices p 2 P, deﬁne

Rt(p, . . ., p).

(Here, the argument p is repeated 2 ˆ + 1 times.) Observe that 0 R0

j(p) 1 for all j and p, as the maximal total revenue between rounds Bj + 2 ˆ + 1 and B(j + 1) is at most B; indeed, there are at most B buyers who might make a purchase during that time, and each purchase yields revenue of at most 1. By a similar reasoning, we also have

Rt(p, . . ., p) 4 ˆ (1)

for all j and p.

Now, notice that pt = p0

j for all Bj + ˆ + 1 t B(j + 1) + ˆ , hence the feedback fed back to A after epoch j is

Rt(pt ˆ , . . ., pt+ˆ ) = 1

j, . . ., p0

That is, Algorithm 1 is essentially running EXP3 on the reward functions R0

j. By the regret bound of EXP3, we know that

for any ﬁxed p 2 P, which implies

Rt(p , . . ., p ) E

Rt(pt ˆ , . . ., pt+ˆ )

BTn log n. (2)

In addition, due to Eq. (1) and the non-negativity of the revenues, we also have

Rt(p , . . ., p ) E

Rt(pt ˆ , . . ., pt+ˆ )

4 ˆ T 0 4 ˆ T

Summing Eqs. (2) and (3), and taking into account rounds BT 0 + 1, . . .,T during which the total revenue is at most B + 2 ˆ , we obtain the regret bound

Rt(p , . . ., p ) E

Rt(pt ˆ , . . ., pt+ˆ )

BTn log n + 4 ˆ T

B + B + 2 ˆ .

Finally, for B = b ˆ 2/3(n log n) 1/3T1/3c, the theorem follows (assuming that ˆ < T).

4 Lower Bound

We next brieﬂy overview the lower bound and the proof s main technique. A full proof is given in the supplementary material; for simplicity of exposition, here we assume ˆ = 1 and n = 2.

Our proof relies on two steps. The ﬁrst step is a reduction from pricing with patience ˆ = 0 but with switching cost. The second step is to lower bound the regret of pricing with switching cost. This we do again by reduction from the Multi Armed Bandit (MAB) problem with switching cost. We begin by brieﬂy over-viewing these terms and deﬁnitions.

We recall the standard setting of MAB with two actions and switching cost c. A sequence of losses is produced 1, . . ., T where each loss is deﬁned as a function t : {1, 2} ! {0, 1}. At each round a player chooses an action it 2 {1, 2} and receives as feedback t(it). The switching cost regret of player A is given by

Sc-Regret T(A; 1:T) = E

+ c E [|{it : it , it 1}|] .

We will deﬁne analogously the switching cost regret for non-strategic buyers. Namely, given a sequence of buyers b1, . . ., b T, all with patience ˆ = 0, the switching cost regret for a seller is given by:

Sc-Regret T(A; b1:T) = E

+ c E [|{pt : pt , pt 1}|] .

4.1 Reduction from Switching Cost Regret

As we stated above, our ﬁrst step is to show a reduction from switching cost regret for non-strategic buyers. This we do in Theorem 3: Theorem 3. For every (possibly randomized) seller A for strategic buyers with patience at most

ˆ = 1, there exists a randomized seller A0 for non-strategic buyers with patience ˆ = 0 such that:

12 -Regret T(A0) Regret T(A)

The proof idea is to construct from every sequence of non-strategic buyers b1, . . ., b T a sequence of strategic buyers b1, . . ., b T such that the regret incurred to A by b1:T is at least the switching cost regret incurred to A0 by b1:T. The idea behind the construction is as follows: At each iteration t we choose with probability half to present to the seller bt and with probability half we present to the seller a buyer zt that has the following statistics:

2, = 0) w.p. 1

2 (v = 1, = 1) w.p. 1

That is, zt is with probability 1

2 a buyer with value v = 1

2 and patience = 0, and with probability 1

2, zt is a buyer with value v = 1 and patience = 1.

Observe that if zt would always have patience = 0 (i.e., even if her value is v = 1), for any sequence of prices the expected rewards from the zt buyer is always half, independent of the prices. In other words, the sequence of noise does not change the performance of the sequence of prices and cannot be exploited to improve. On the other hand, note since the value 1 corresponds to patience 1, the seller might lose half whenever she reduces the price from 1 to 1

2. A crucial point is that the seller must post her price in advance, therefore she cannot in any way predict if the buyer is willing to wait or not and manipulate prices accordingly. A proof for the following Lemma is provided in the supplementary material. Lemma 4. Consider the pricing problem with ˆ = 1 and n = 2. Let b1, . . ., b T be a sequence of buyers with patience 0. Let z1, . . ., z T be a sequence of stochastic buyers as in Eq. (4). Deﬁne bt to be a stochastic buyer that is with probability half bt and with probability half zt. Then, for any seller

A, the expected regret A incurs from the sequence b1:T is at least

Regret T(A; b1:T)

β(p ; bt) β(pt; bt)

|{pt : pt > pt+1}|

where the expectations are taken with respect to the internal randomization of the seller A and the random bits used to generate the sequence b1:T.

4.1.1 Proof for Theorem 3

To construct algortihm A0 from A, we develop a meta algorithm A, depicted in Algorithm 2 that receives an algorithm, or seller, as input. A0 is then the seller obtained by ﬁxing A as the input for A. In our reduction we assume that at each iteration algorithm A can ask from A one posted price,pt, and in turn she can return a feedback rt to algorithm A, then a new iteration begins.

The idea of construction is as follows: As an initialization step Algorithm A0 produces a stochastic sequence of buyers of type z1, . . ., zt, the algorithm then chooses apriori if at step t a buyer bt is going to be the buyer bt that she observes or zt (with probability half each). The sequence bt is distributed as depicted in Lemma 4. Note that we do not assume that the learner knows the value of bt.

At each iteration t, algorithm A0 receives price pt from algorithm A and posts price pt. She then receives as feedback β(pt; bt): Given the revenues β(p1; b1), . . ., β(pt; bt) and her own internal random variables, the algorithm can calculate the revenue for algorithm A w.r.t to the sequence of buyers b1, . . ., bt, namely rt = R(pt 1, . . ., pt+1, b1:t).

In turn, at time t algorithm A0 returns to algorithm A her revenue, or feedback, w.r.t b1, . . ., b T at time t which is rt.

Since Algorithm A receives as feedback at time t R(pt 1, pt, pt+1; b1:t), we obtain that for the sequence of posted prices p1, . . ., p T:

Regret T(A; b1:T) =

β(p , p ; bt)

β(pt, pt+1; bt).

Taking expectation, using Lemma 4, and noting that the number of time pt+1 > pt is at least 1/3 of the times pt , pt+1 (since there are only 2 prices), we have that

12 -Regret T(A0; b1:T) E b1:T

Regret T(A; b1:T)

Regret T(A)

Since this is true for any sequence b1:T we obtain the desired result.

Algorithm 2: Reduction from from pricing with switching cost to strategic buyers Input:T, A % A is an algorithm with bounded regret for strategic buyers; Output:p1, . . ., p T; Set r1 = . . . = r T = 0; Draw IID z1, . . ., z T % see Eq. 4; Draw IID e1, . . ., e T 2 {0, 1} Distributed according to Bernoulli distribution; for t=1,...,T do

Receive from A a posted price pt+1; %At ﬁrst round receive two prices p1, p2.; post price pt and receive as feedback β(pt; bt); if et = 0 then

Set rt = rt + β(pt; bt); % bt = bt else

if (pt pt+1) OR (zt has patience 0) then

Set rt = rt + β(pt; zt) else

Set rt+1 = rt+1 + β(pt, pt+1; zt)

Return rt as feedback to A.

4.2 From MAB with switching cost to Pricing with switching cost

The above section concluded that switching cost for pricing may be reduced to pricing with strategic buyers. Therefore, our next step would be to show that we can produce a sequence of non-strategic buyers with high switching cost regret. Our proof relies on a further reduction for MAB with Switching cost. Theorem 5 (Dekel et al. [9]). Consider the MAB setting with 2 actions. For any randomized player, there exists a sequence of loss functions 1, . . ., T where t : {1, 2} ! {0, 1} such that Sc-Regret T(A; 1:T) 2 (T2/3), for every c > 0.

Here we prove an analogous statement for pricing setting:

Theorem 6. Consider the pricing problem for buyers with patience ˆ = 0 and n = 2. For any randomized seller, there exists a sequence of buyers b1, . . ., b T such that Sc-Regret T(A; b1:T) 2 (T2/3), for every c > 0.

The transition from MAB with switching cost to pricing with switching cost is a non-trivial task. To do so, we have to relate actions to prices and values to loss vectors in a manner that would relate the revenue regret to the loss regret. The main challenge, perhaps, is that the structure of the feedback is inherently different in the two problems. In two-armed bandit problems all loss conﬁguration are feasible. In contrast, in the pricing case certain feedbacks collapse to full information: for example, if we sell at price 1 we know the feedback from price 1

2, and if we fail to sell at price 1

2 we obtain full feedback for price 1.

Our reduction proceeds roughly along the following lines. We begin by constructing stochastic mappings that turn loss vectors into values t : {0, 1}2 ! {0, 1

2, 1}. This in turn deﬁnes a mapping from a sequences of losses t to stochastic sequences of buyers bt. In our reduction we assume we are given an algorithm A that solves the pricing problem; that is, at each iteration we may ask for a price and then in turn we return a feedback β(pt; bt). Note that we cannot assume that we have access or know bt that is deﬁned by t( t). The buyer bt depends on the full loss vector t: assuming that we can see the full t would not lead to a meaningful reduction for MAB.

However, our construction of t is such that each posted price is associated with a single action. This means that for each posted price there is a single action we need to observe in order to calculate the correct feedback or revenue. This also means that we switch actions only when algorithm A switches prices. Finally, our sequence of transformation has the following property: if i is the action needed in order to discover the revenue for price p, then E( t(i)) = 1

4E(β(p; bt)). Thus, the regret for our actions compares to the regret of the seller.

5 Discussion

In this work we introduced a new model of strategic buyers, where buyers have a window of time in which they would like to purchase the item. Our modeling circumvents complicated dynamics between the buyers, since it forces the seller to post prices for the entire window of time in advance.

We consider an adversarial setting, where both buyer valuation and window size are selected adversarially. We compare our online algorithm to a static ﬁxed price, which is by deﬁnition oblivious to the window sizes. We show that the regret is sub-linear, and more precisely (T2/3). The upper bound shows that in this model the average regret per buyer is still vanishing. The lower bound shows that having a window size greater than 1 impacts the regret bounds dramatically. Even for window sizes 1 or 2 and prices 1

2 or 1 we get a regret of (T2/3), compared to a regret of O(T1/2) when all the windows are of size 1.

Given the sharp (T2/3) bound, it might be worth revisiting our feedback model. Our model assumes that the feedback for the seller is the revenue obtained at the end of each day. It is worthwhile to consider stronger feedback models, where the seller can gain more information about the buyers. Namely, their day of arrival and their window size. In terms of the upper bound, our result applies to any feedback model that is stronger, i.e., as long as the seller gets to observe the revenue per day, the O(T2/3) bound holds. As far as the lower bound is concerned, one can observe that our proofs and construction are valid even for very strong feedback models. Namely, even if the seller gets as feedback the revenue from buyer t at time t (instead of the time of purchase), and in fact even if she gets to observe the patience of the buyers (i.e. full information w.r.t. patience), the (T2/3) bound holds, as long as the seller posts prices in advance.

We did not consider continuous pricing explicitly, but one can verify that applying our algorithm to a setting of continuous pricing gives a regret bound of O(T3/4), by discretizing the continuous prices to T1/4 prices. On the positive side, it shows that we still obtain a vanishing average regret in the continuous case. On the other hand, we were not able to improve our lower bound to match this upper bound. This gap is one of the interesting open problems in our work.

[1] K. Amin, A. Rostamizadeh, and U. Syed. Learning prices for repeated auctions with strategic

buyers. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 1169 1177. 2013.

[2] R. Arora, O. Dekel, and A. Tewari. Online bandit learning against an adaptive adversary: from

regret to policy regret. ar Xiv preprint ar Xiv:1206.6400, 2012.

[3] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The nonstochastic multiarmed bandit

problem. SIAM Journal on Computing, 32(1):48 77, 2002.

[4] M.-F. Balcan, A. Blum, J. D. Hartline, and Y. Mansour. Reducing mechanism design to

algorithm design via machine learning. J. Comput. Syst. Sci., 74(8):1245 1270, 2008.

[5] S. Chawla, J. D. Hartline, and R. D. Kleinberg. Algorithmic pricing via virtual valuations. In

ACM Conference on Electronic Commerce, pages 243 251, 2007.

[6] S. Chawla, J. D. Hartline, D. L. Malec, and B. Sivan. Multi-parameter mechanism design and

sequential posted pricing. In STOC, pages 311 320, 2010.

[7] S. Chawla, D. L. Malec, and B. Sivan. The Power of Randomness in Bayesian Optimal

Mechanism Design. In the 11th ACM Conference on Electronic Commerce (EC), 2010.

[8] R. Cole and T. Roughgarden. The sample complexity of revenue maximization. In Proceedings

of the 46th Annual ACM Symposium on Theory of Computing, pages 243 252. ACM, 2014.

[9] O. Dekel, J. Ding, T. Koren, and Y. Peres. Bandits with switching costs: T 2/3 regret. In

Proceedings of the 46th Annual ACM Symposium on Theory of Computing, pages 459 467. ACM, 2014.

[10] Z. Huang, Y. Mansour, and T. Roughgarden. Making the most of your samples. In Proceedings

of the Sixteenth ACM Conference on Economics and Computation, EC, pages 45 60, 2015.

[11] R. D. Kleinberg and F. T. Leighton. The value of knowing a demand curve: Bounds on regret for

online posted-price auctions. In 44th Symposium on Foundations of Computer Science FOCS, pages 594 605, 2003.

[12] M. Mohri and A. Munoz. Optimal regret minimization in posted-price auctions with strategic

buyers. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 1871 1879. 2014.

[13] M. Mohri and A. Munoz. Revenue optimization against strategic buyers. In C. Cortes, N. D.

Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 2530 2538. 2015.