# dynamic_pricing_in_highdimensions__461c8519.pdf

Journal of Machine Learning Research 20 (2019) 1-49 Submitted 6/17; Revised 8/18; Published 2/19

Dynamic Pricing in High-dimensions

Adel Javanmard ajavanma@usc.edu Department of Data Sciences and Operations Marshall School of Business University of Southern California Los Angeles, CA 90089 , USA

Hamid Nazerzadeh nazerzad@usc.edu Department of Data Sciences and Operations Marshall School of Business University of Southern California Los Angeles, CA 90089 , USA

Editor: Peter Auer

We study the pricing problem faced by a ﬁrm that sells a large number of products, described via a wide range of features, to customers that arrive over time. Customers independently make purchasing decisions according to a general choice model that includes products features and customers characteristics, encoded as d-dimensional numerical vectors, as well as the price oﬀered. The parameters of the choice model are a priori unknown to the ﬁrm, but can be learned as the (binary-valued) sales data accrues over time. The ﬁrm s objective is to maximize its revenue. We benchmark the performance using the classic regret minimization framework where the regret is deﬁned as the expected revenue loss against a clairvoyant policy that knows the parameters of the choice model in advance, and always oﬀers the revenue-maximizing price. This setting is motivated in part by the prevalence of online marketplaces that allow for real-time pricing. We assume a structured choice model, parameters of which depend on s0 out of the d product features. Assuming that the market noise distribution is known, we propose a dynamic policy, called Regularized Maximum Likelihood Pricing (RMLP) that leverages the (sparsity) structure of the high-dimensional model and obtains a logarithmic regret in T. More speciﬁcally, the regret of our algorithm is of O(s0 log d log T). Furthermore, we show that no policy can obtain regret better than O(s0(log d + log T)). In addition, we propose a generalization of our policy to a setting that the market noise distribution is unknown but belongs to a parametrized family of distributions. This policy obtains regret of O( p

(log d)T). We further show that no policy can obtain regret better than Ω(

T) in such environments. Keywords: Revenue Management, Dynamic Pricing, High-dimensional Regression, Maximum Likelihood, Regret, Sparsity

1. Introduction

A central challenge in revenue management is determining the optimal pricing policy when there is uncertainty about customers willingness to pay. Due to its importance, this problem has been studied extensively (Kleinberg and Leighton, 2003; Besbes and Zeevi, 2009;

c 2019 Adel Javanmard and Hamid Nazerzadeh.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v20/17-357.html.

Javanmard and Nazerzadeh

Badanidiyuru et al., 2013; Wang et al., 2014; Broder and Rusmevichientong, 2012; Keskin and Zeevi, 2014; den Boer and Zwart, 2014; Cohen et al., 2016). Most of these models are built around the following classic setting: customers arrive over time; the seller posts a price for each customer; if the customer s valuation is above the posted price, a sale occurs and the seller collects a revenue in the amount of the posted price; otherwise, no sale occurs and no revenue is generated. Based on this and the previous feedbacks, the seller updates the posted price. Therefore, the seller is involved in the realm of exploration-exploitation as he needs to choose between learning about the valuations and exploiting what has been learned so far to collect revenue. In this work, we consider a setting with a large number of products which are deﬁned via a wide range of features. The valuations are given by v(θ, x) with x being the (observable) feature vectors of products and θ0 representing the customer s characteristics true parameters of the choice model, which is initially unknown to the seller, cf. (Amin et al., 2014; Cohen et al., 2016). An important special case of this setting is the linear model in which

v(θ, x) = θ0 x + z , (1)

where z captures the idiosyncratic noise in valuations. Our setting is motivated in part by applications in online marketplaces. For instance, a company such as Airbnb recommends prices to hosts based on many features including the space (number of rooms, beds, bathrooms, etc.), amenities (AC, Wi Fi, washer, parking, etc.), the location (accessibility to public transportation, walk score of the neighborhood, etc.), house rules (pet-friendly, non-smoking, etc.), as well as the prediction of the demand which itself depends on many factors including the date, events in the area, availability and prices of near-by hotels, etc. (Airbnb Documentation, 2015). Therefore, the vector describing each property can have hundreds of features. Another important application comes from online advertising. Online publishers set the (reserve) price of ads based on many features including user s demographic, browsing history, the context of the webpage, the size and location of the ad on the page, etc. In this work, we propose Regularized Maximum Likelihood Pricing (RMLP) policy for dynamic pricing in high-dimensional environments. As suggested by its name, the policy uses maximum likelihood method to estimate the true parameters of the choice model. In addition, using an (ℓ1-norm) regularizer, our policy exploits the structure of the optimal solution; namely, the performance of the RMLP policy signiﬁcantly improves if the valuations are essentially determined by a small subset of features. More formally, the diﬀerence between the revenue obtained by our policy and the benchmark policy that knows in advance the true parameters of the choice model, θ0, is bounded by O s0 log d log T , where T, d, and s0 respectively denote the length of the horizon, number of the features, and sparsity (i.e., number of non-zero elements of θ0). We show that our results are nearly optimal and no policy can obtain regret better than O s0(log d + log T) . In the RMLP policy it is assumed that the noise distribution (distribution of term z in valuation model (1)) is known to the seller. We also consider the setting that the noise distribution is unknown but belongs to a parametrized family of distributions. We propose RMLP-2 policy for such setting and prove that it achieves a regret of O( p

(log d)T). We further how that no policy can get regret better than Ω(

T) in this setting.

Dynamic Pricing in High-dimensions

We point out that our results can be applied to applications where the features dimension is larger than the time horizon of interest. A powerful pricing policy for these applications should obtain regret that scales gracefully with the dimension. Note that in general, little can be learned about the model parameters θ0 if T < d, because the number of degrees of freedom d exceeds the number of observations T, and therefore, any estimator can be arbitrary erroneous. However, when there is prior knowledge about the structure of unknown parameter θ0, (e.g., sparsity), then accurate estimations are attainable even when T < d.

1.1. Organization and Contributions

In the remaining part of the introduction, we discuss how our work is positioned with respect to the literature on dynamic pricing and high-dimensional statistics. The rest of the paper presents the following contributions:

Section 2. We formally present our model and discuss the technical assumptions and the benchmark policy.

Section 3. We present the RMLP policy that uses a regularized maximum likelihood function to estimate the parameters of the choice model and subsequently sets the best prices based on the obtained estimates. The RMLP policy assumes that the market noise distribution is known and forms the log-likelihood function based on this knowledge.

Section 4. We analyze the regret of the RMLP policy with respect to a clairvoyant who knows the parameters of the choice model in advance. We show that it achieves a regret of O(s0 log d log T).

Section 5. We provide a lower bound on the regret of any policy that does not know the parameters of the choice model in advance. Namely, we show that no policy can obtain regret better than O(s0(log d + log T)).

Section 6. We generalize the RMLP policy to non-linear valuations functions and show that, similar to the case of linear valuations, it achieves a regret of O(s0 log d log T).

Section 7. We consider a setting where the exact underlying noise distribution is not known but it is assumed to belong to a known class of log-concave distributions. We propose the RMLP-2 policy, so-named, for this setting whose regret is O( p

(log d)T). We also argue that no policy can get better than O(

T) regret in this setting.

Section 8. We provide the proofs of our main theorems and results.

We conclude our work along with some future directions in Section 9. Proof of technical lemmas are deferred to the appendices.

1.2. Related Work

Our work contributes to the literature on dynamic pricing as well as high dimensional statistics. In the following, we brieﬂy overview the work closest to ours in these contexts.

Javanmard and Nazerzadeh

Dynamic pricing and learning. The literature on dynamic pricing and learning has been growing over the past few years, motivated in part by the advances in big data technology that allow ﬁrms to easily collect and utilize information. We brieﬂy discuss some of the recent lines of research in this literature. We refer to den Boer (2015) for an excellent survey on this topic.

Parametric Approach. A natural approach to capture uncertainty about the customers valuations is to model the uncertainty using a small number of parameters, and then estimate those parameters using classical statistical methods such as maximum likelihood (Broder and Rusmevichientong, 2012; den Boer and Zwart, 2013, 2014; Chen et al., 2015) or least square estimation (Keskin, 2014; Bastani and Bayati, 2016). Our work is similar to this line of work, in that we assume a parametric model for customer s valuations and apply the maximum likelihood method using the randomness of the idiosyncratic noise in valuations. However, the parameter vector θ is high-dimensional, whose dimension d (that can even exceed the time horizon of interest T). We use regularized maximum-likelihood in order to promote sparsity structure in the estimated parameter. Further, our pricing policy has an episodic theme which makes the posted prices pt in each episode independent of the idiosyncratic noise in valuations, zt, in that episode. This is in contrast to other policies based on maximum-likelihood, such as MLE-GREEDY (Broder and Rusmevichientong, 2012), or greedy iterative least square (GILS) (Keskin, 2014; den Boer and Zwart, 2014; Qiang and Bayati, 2016) that use the entire history of observations to update the estimate for the model parameters at each step.

Bayesian Approach. One of the earliest work on Bayesian parametric approach in this context is by (Rothschild, 1974) who consider a Bayesian framework where the ﬁrm can choose from two prices with unknown demand and show that (myopic) Bayesian policies may lead to incomplete learning. However, carefully designed variations of the myopic policies can (optimally) learn the optimal price (Harrison et al., 2012); see also Keller and Rady (1999); Araman and Caldentey (2009).

Non-Parametric models. An early work in non-parametric setting is by Kleinberg and Leighton (2003). They model the dynamic pricing problem as a multi-armed bandit (MAB) where each arm corresponds to a (discretized) posted price. They propose an O(

T)- algorithm where T is the length of the horizon. Similar results have been obtained in more general settings (Badanidiyuru et al., 2013; Agrawal and Devanur, 2014) including setting with inventory constraints (Besbes and Zeevi, 2009; Babaioﬀet al., 2012; Wang et al., 2014).

Feature-based Models. Recent papers on dynamic pricing consider models with features/covariates. Amin et al. (2014), in a model similar to ours, present an algorithm that obtains regret O(T 2/3); they also study dynamic incentive compatibility in repeated auctions. Another closely related work to ours is by Cohen et al. (2016). Their model diﬀers from ours in two main aspects: i) their model is deterministic (no idiosyncratic noise) ii) the arrivals (of features vectors) is modeled as adversarial. They propose a clever binary-search approach using the Ellipsoid method which obtains regret of O(d2 log(T/d)). Recently, Golrezaei et al. (2018) studied the problem of dynamic pricing when a group of strategic buyers competes with each other in repeated contextual auctions. To the extent of our knowledge, ours is the ﬁrst work that highlights the role of structure/sparsity in dynamic pricing.

Dynamic Pricing in High-dimensions

Qiang and Bayati (2016) study a model where the seller can observe the demand itself, not a binary signal as in our setting. They show that a myopic policy based on least-square estimations can obtain a logarithmic regret. Bastani and Bayati (2016) study a multi-armed bandit setting, with discrete arms, and high-dimensional covariates, generalizing results of Goldenshluger and Zeevi (2013). More precisely, the decision-maker has access to K arms (decisions), where each arm i has an unknown sparse parameter vector βi Rd. Upon pulling arm i at time t, the decision-maker gets reward xt βi + zt. The authors present an algorithm, using a LASSO estimator, that obtains regret O K(log T + log d)2 where K denotes the number of arms. By contrast, in our setting decisions (prices) can take continuous values and also the resulting rewards do not follow a similar relationship with the covariates and the decisions.

High-dimensional statistics. There has been a great deal of work on regularized estimator under the high-dimensional scaling; see e.g., Van de Geer (2008). Closer to the spirit of our work is the problem of 1-bit compressed sensing (Plan and Vershynin, 2013; Bhaskar and Javanmard, 2015). In this problem, linear measurements are observed for an unknown parameter of interest but only the sign of these measurements are observed. Note that in our problem, the seller is involved in both the learning task and also the policy design. Speciﬁcally, he should decide on the prices, which directly aﬀect collected revenue and also indirectly inﬂuence the diﬃculty of the learning task. The market values are then compared with the posted prices, in contrast to 1-bit compressed sensing where the measurements are compared with zero (sign information). In addition, the pricing problem has an online nature while the 1-bit compressed sensing is mostly studied for oﬄine setting. Finally, note that prices are set based on customer s purchase behavior, and hence introduce dependency among the collected information about the model parameters.

1.3. Notations

For a vector v, supp(v) represents the positions of nonzero entries of v. Further, for a vector v and a subset J, v J is the restriction of v to indices in J. We write v p for the standard ℓp norm of a vector v, i.e., v p = (P i |vi|p)1/p and v 0 for the umber of nonzero entries of v. If the subscript p is omitted, it should be deemed as ℓ2 norm. For two vectors a, b Rd, the notation a b = Pd i=1 aibi represents the standard inner product. For two functions f(n) and g(n), the notation f(n) = O(g(n)) means that f is bounded above by g asymptotically, namely, f(n) Cg(n) for some ﬁxed positive constant C > 0. Throughout, φ(x) = e x2/2/

2π is the Gaussian density and Φ(x) R x φ(u)du is the Gaussian distribution.

2. Choice Model

We consider a seller, who has a product for sale in each period t = 1, 2, , T, where T denotes the length of the horizon and may be unknown the to the seller. Each product is represented by an observable vector of features (covariates) xt X Rd. Products may vary across periods and we assume that feature vectors xt are sampled independently from a ﬁxed, but a priori unknown, distribution PX, supported on a bounded set X.

Javanmard and Nazerzadeh

The product at time t has a market value vt = v(xt), which is not observed by the seller and function v is (a priori) unknown. At each period t, the seller posts a price pt. If pt vt, a sale occurs, and the seller collects revenue pt. If the price is set higher than the market value, pt > vt, no sale occurs and no revenue is obtained. The goal of the seller is to design a pricing policy that maximizes the collected revenue. We ﬁrst assume that the market value of a product is a linear function of its covariates, namely

v(xt) = θ0 xt + zt , (2)

where a b denotes the inner product of vectors a and b. Here, {zt}t 1 are idiosyncratic shocks, referred to as noise, which are drawn independently and identically from a distribution with mean zero and cumulative function F, with density f(x) = F (x), cf. (Keskin and Zeevi, 2014). The noise can account for the features that are not measured. We generalize our model to non-linear valuation functions in Section 6. Parameter θ0 is a priori unknown to the seller. Therefore, the seller is involved in the realm of exploration-exploitation as he needs to choose between learning θ0 and exploiting what has been learned so far to collect revenue. Let yt be the response variable that indicates whether a sale has occurred at period t:

( +1 if vt pt , 1 if vt < pt . (3)

Note that the above model can be represented as the following probabilistic model:

( +1 with probability 1 F (pt θ0 xt) , 1 with probability F (pt θ0 xt) . (4)

Our proposed algorithm exploits the structure (sparsity) of the feature space to improve its performance. To this aim, let s0 denote the number of nonzero coordinates of θ0, i.e., s0 = θ0 0 = Pd j=1 I(θ0j = 0). We remark that s0 is a priori unknown to the seller.

2.1. Technical Assumptions

To simplify the presentation, we assume that xt 1, for all xt X, and θ0 1 W for a known constant W 1, where for a vector u = (u1, . . . , ud), u = maxi [d] |ui| denotes the maximum absolute value of its entries and u 1 = Pd i=1 |ui|. We denote by Ωthe set of feasible parameters, i.e.,

Ω= n θ Rd+1 : θ 0 s0 , θ 1 W o . (5)

We also make the following assumption on the distribution of noise F.

Assumption 1 The function F(v) is strictly increasing. Further, F(v) and 1 F(v) are log-concave in v.

Dynamic Pricing in High-dimensions

Log-concavity is a widely-used assumption in the economics literature (Bagnoli and Bergstrom, 2005). Note that if the density f is symmetric and the distribution F is logconcave, then 1 F is also log-concave. Assumption 1 is satisﬁed by several common probability distributions including normal, uniform, Laplace, exponential, and logistic. Note that the cumulative distribution function of all log-concave densities is also log-concave (Boyd and Vandenberghe, 2004). Our second assumption is on the product feature vectors.

Assumption 2 Product feature vectors are generated independently from a ﬁxed unknown distribution PX with a bounded support X Rd. We denote by Σ = E(xtx T t ) the second moment matrix of distribution PX, and assume that Σ is a positive deﬁnite matrix. Namely, all of its eigenvalues are bounded from below by a constant Cmin > 0. We also denote the maximum eigenvalue of Σ by Cmax.

The above assumption holds for many common probability distributions, such as uniform, truncated normal, and in general truncated version of many more distributions. Generally, if PX is bounded below from zero on an open set around the origin, then it has a positive deﬁnite second moment matrix. Let us stress that the seller knows neither the distribution PX, nor its second moment Σ.

2.2. Clairvoyant Policy and Performance Metric

We evaluate the performance of our algorithm using the common notion of regret: the expected revenue loss compared with the optimal pricing policy that knows θ0 in advance (but not the realizations of {zt}t 1). Let us ﬁrst characterize this benchmark policy. Using Eq. (2), the expected revenue from a posted price p is equal to

p P(vt p) = p(1 F(p θ0 xt)) .

Therefore, using ﬁrst order conditions, for the optimal posted price, denoted by p , we have

p (xt) = 1 F(p θ0 xt)

f(p θ0 xt) . (6)

To simplify the presentation, let p t = p (xt) denote the optimal price at time t. We now deﬁne ϕ(v) v (1 F(v))/f(v) corresponding to the virtual valuation function commonly used in mechanism design (Myerson, 1981). By Assumption 1, ϕ is injective and hence we can deﬁne function g as follows

g(v) v + ϕ 1( v) . (7)

It is easy to verify that g is non-negative. Note that by Eq. (6), for the optimal price we have θ0 xt + ϕ(p θ0 xt) = 0 .

Therefore, by rearranging the terms for the optimal price at time t we have

p t = g(θ0 xt) . (8)

Javanmard and Nazerzadeh

We can now formally deﬁne the regret of a policy. Let π be the seller s policy that sets price pt at period t, and pt can depend on the history of events up to time t. The worst-case regret is deﬁned as:

Regretπ(T) max θ0 Ω PX Q(X) E

p t I(vt p t ) pt I(vt pt) #

where the expectation is with respect to the distributions of idiosyncratic noise, zt, and PX, the distribution of feature vectors. Also, Q(X) represents the set of probability distributions supported on a bounded set X. Our algorithm uses the sparsity structure of θ0 and learns the model with order of magnitude less data compared to a structure-ignorant algorithm. In Section 4, we show that our pricing scheme achieves a regret bound of O s0 log d log T . Note that the setting with known market noise distribution is equivalent to positing a parametric choice model on the users behavior. Several parametric choice models have been widely used by practitioners and have been analyzed by researchers. Perhaps the most well-known example is the multinomial logit (MNL) model, which has quite a long history in economic and transportation research (Mc Fadden, 2001), as well as marketing (Hauser and Koppelman, 1979; Gensch and Recker, 1979; Louviere and Woodworth, 1983; Guadagni and Little, 2008). Note that the MNL model corresponds to the case where customers are oﬀered multiple products at each round and the market noise is coming from the Gumbel distribution (Talluri and Van Ryzin, 2006). In the case of single product oﬀered at each round (similar to our setting), this reduces to noise coming from the logistic distribution. Other prominent MNL models include Nested MNL, latent-class MNL, random-parameter mixed MNL, and hierarchical Bayesian MNL (Elshiewy et al., 2017). In Section 7 we relax the assumption of knowing the market noise distribution.

3. A Regularized Maximum Likelihood Pricing (RMLP) Policy

In this section, we present our dynamic pricing policy. Our policy runs in an episodic fashion. Episodes are indexed by k and time periods are indexed by t. The length of episode k is denoted by τk. Throughout episode k, we set the prices equal to pt = g(xt bθk) where bθk denotes the estimate of θ0 which is obtained from the observations {(xt, yt, pt)} in the previous episode. Note that by Eq. (7), pt is the optimal posted price if bθk was the true underlying parameter of the model. We estimate θ0 using a regularized maximum-likelihood estimator; see Eq. (10) where the (normalized) negative log-likelihood function for θ is given by Eq. (11). We note that as a consequence of the log concavity assumption on F and 1 F, the optimization problem (10) is a convex problem. There is a large toolkit of various optimization methods (e.g., alternating direction method of multipliers (ADMM), fast iterative shrinkage-thresholding algorithm (FISTA), accelerated projected gradient descent, among many others) that can be used to solve this optimization problem. There are also recent developments on distributed solvers for ℓ1 regularized cost function (Boyd et al., 2011). Observe that by design, prices posted in the k-th episode are independent from the market value noises in this period, i.e., {zt}τk+1 1 t=τk . This allows us to estimate θ0 for each

Dynamic Pricing in High-dimensions

Input: (at time 0) function g, regularizations λk, W (bound on θ0 1), Input: (arrives over time) covariate vectors {xt}t N Output: prices {pt}t N

1: τ1 1, p1 0, bθ1 0

2: for each episode k = 2, 3, . . . do

3: Set the length of k-th episode: τk 2k 1.

4: Update the model parameter estimate bθk using the regularized ML estimator obtained from observations in the previous episode:

bθk = arg min θ 1 W {L(θ) + λk θ 1} , (10)

L(θ) = 1 τk 1

I(yt = 1) log(1 F(pt θ xt)) + I(yt = 1) log(F(pt θ xt)) .

5: For each period t during the k-th episode, set

pt g(bθk xt) . (12)

Algorithm 1: RMLP policy for dynamic pricing

episode separately; see Proposition 10 in Section 8.1. Comparing to policies that use the entire data sale history in making decisions, some remarks are in order:

Perishability of data: In practical applications, the unknown demand parameters will change over time, raising the concern of perishability of data. Namely, collected data becomes obsolete after a while and cannot be relied on for estimating the model parameters (Keskin and Zeevi, 2016; Javanmard, 2017). Common policies to mitigate this problem (discussed in (Keskin and Zeevi, 2016)) include moving windows and decaying weights which use only recent data to learn the model parameters. By contrast, methods that use the entire historical data, with equal weight, suﬀers from this problem.

Simplicity and eﬃciency: In RMLP policy, estimates of the model parameters are updated only at the ﬁrst period of each episode (log T updates). Further, at each update, the policy uses only the historical data from the previous episode. These two ideas together, not only allow for a neat analysis of the statistical dependency among samples but also decrease the computational cost. Scalability of the pricing policy is indispensable in practical applications as the sales data is collected at an unprecedented rate.

Javanmard and Nazerzadeh

The impact on regret: By using half of the historical data at each update, our policy loses at most a factor 2 in the total regret. (This becomes clear shortly when we discuss the estimation error rate in terms of number of samples.)

The lengths of episodes in our algorithm increase geometrically (τk = 2k 1), allowing for more accurate estimate of θ0 as the episode index grows. The algorithm terminates at the end of the horizon (period T), but note that it does not need to know the length of the horizon in advance. The regularization parameter λk constrains the ℓ1 norm of the estimator bθk. Selecting the value of λk is of crucial importance as it aﬀects the estimator error. We set it as λk = O p

(log d)/τk 1 . More precisely, deﬁne

u F sup |x| 3W

n max n log F(x), log (1 F(x)) oo ,

where the derivatives are w.r.t. x. By the log-concavity property of F and 1 F, we have

u F = max n log F( 2W), log (1 F(2W)) o . (13)

Hence, u F captures the steepness of log F. In order to minimize the regret, we run the RMLP policy with

log d τk 1 . (14)

Note that exploration and exploitation tasks are mixed in our algorithm. In the beginning of each episode, we use what is learned from previous episode to improve the estimation of θ0 and then we exploit this estimate throughout the current episode to incur little regret. Meanwhile, the observations gathered in the current episode are used to update our estimate of θ0 for the next episode. The knowledge of the market noise distribution is used by the RMLP policy, in estimating the model parameters (constructing the log-likelihood function and choosing the regularization λk) as well as in setting the prices. (The pricing function g is deﬁned based on the noise distribution F.) In the next section we analyze the regret of the RMLP policy.

4. Regret Analysis

The following theorem bounds the regret of our dynamics pricing policy.

Theorem 3 (Regret Upper Bound) Suppose Assumptions 1 and 2 hold and the . Then, the regret of the RMLP policy is of O W C2 min s0 log d log T .

Let us stress again that the RMLP policy does not need to know the sparsity s0, although its regret scales linearly with s0. Further, the RMLP is assumed to know the market noise distribution F. 1 Below we provide an outline for the proof of Theorem 3 and defer its complete proof to Section 8.1.

1. It is useful to compare Theorem 3 with the result of Chen et al. (2015), [Theorem 5.4], which also uses a maximum likelihood estimator without exploiting sparsity structure and achieves regret O d Cmin p

Dynamic Pricing in High-dimensions

1. In RMLP, the updates in the model parameter estimation only occurs at the beginning of each episode, with using only the samples collected in the previous episode. Therefore, the prices posted in each episode are independent from the market value noises in that episode. This observation also veriﬁes that L(θ) given by (11), is indeed the negative log-likelihood of the samples collected in k-th episode. Note that this independence is not a mere serendipity, rather it holds because of the speciﬁc design of RMLP policy. Using this property, we use tools from high-dimensional statistics to bound the estimation error. To bound the error term θk θ0 2, we compare the function values L(θk) and L(θ0). The main challenge here is that L(θ) is not strictly convex in θ.2 Hence, there can be, in principle, parameter vectors θ1 and θ2 that are close to each other and nevertheless the values of function L at these points are far from each other.

To cope with this challenge, we show that a so-called restricted eigenvalue condition holds for the feature products. This notion implies that L(θ) is strictly convex on the set of sparse vectors.3 Using the restricted eigenvalue condition, we show the following ℓ2 error for the regularized log-likelihood estimate in the k-th episode, bθk, holds true

bθk θ0 2 = O( s0λk) = O

As expected, the estimate gets more accurate as the episode s length increases; see Section 8.1 for more details.

2. For any p 0, denote by revt(p) = p(1 F(p xt θ0)), the expected revenue under price p. We bound regt in terms of revt(p t ) revt(pt). Since p t arg max{revt(p)}, we have rev t(p t ) = 0, and by Taylor expansion of revt around p t , we obtain revt(p t ) revt(pt) = O((p t pt)2).

3. For t in the k-th episode, namely τk 1 t τk 1, we have

p t pt = g(θ0 xt) g(bθk xt) |(θ0 bθk) xt| ,

which follows by showing that g is 1-Lipschitz. Further, by Assumption 2 (without loss of generality assume Cmax > 1), we have

E[((θ0 bθk) xt)2] = E[(θ0 bθk)TΣ(θ0 bθk)] Cmax E[ bθk θ0 2 2] ,

where the equality holds because xt is independent of bθk, and the last step follows since the maximum eigenvalue of the second moment matrix Σ = E(xtx T t ) is at most Cmax > 0.

2. Note that 2 θL = ( 1/τk 1) Pτk 1 t=τk 1( 2/ 2 ut L)xtx T t , where ut = pt θ xt α0. Therefore, 2 θL is a d d matrix of rank at most τk τk 1. Hence, L(θ) is strictly convex in θ only if τk τk 1 d. However, since we are not updating our estimates in the middle of an episode, episodes of length d yield the regret to scale linearly in d, which is not desired. 3. It is strictly convex over the set of s0 sparse vectors in d-dimension if the number of samples is above cs0 log d for a suitable constant c > 0.

Javanmard and Nazerzadeh

Let regt be the regret occurred at step t. Combining the above bounds (step 2 and 3), we arrive at E[regt] = O(s0(log d)/τk 1). Therefore, the cumulative expected regret in episode k works out at O(s0 log d). Since the length of episodes increase geometrically, there are O(log T) episodes by time T. This implies that the total expected regret by time T is O(s0 log d log T).

4.1. Comparison with the Common Regret of Bound Ω(

There is an often-seen regret bound Ω(

T) in the literature of online decision making, which can be improved to a logarithmic regret bound if some type of separability assumption holds true (Dani et al., 2008; Abbasi-Yadkori et al., 2012). Separability assumption posits that there is a positive constant gap between the rewards of the best and the second best actions. In our framework, the parameter θ belongs to a continuous set in Rd+1 and therefore the separability assumption cannot be enforced because by choosing θ arbitrary close to θ0, one can obtain suboptimal (but arbitrary close to optimal) reward. However, the regret of our policy scales only logarithmically with T. Here, we contrast our logarithmic regret bound with the folklore lower bound Ω(

T), to build further insight on our results.

Uninformative prices and Ω(

T) lower-bound. We focus on (Broder and Rusmevichientong, 2012) which has a close framework to ours in that it considers a dynamic pricing policy from purchasing decisions and presents a pricing policy based on maximum likelihood estimation with regret O(

T). Adopting their notation, it is assumed that market values vt are independent and identically distributed random variables coming from a distribution function that belongs to some family parametrized by z. Denote by d(p; z) the demand curve. This curve determines the probability of a purchase at a given price, i.e., d(p; z) = Pz(vt p). (Broder and Rusmevichientong, 2012) show that the worst-case regret of any pricing policy must be at least Ω(

T) (see Theorem 3.1 therein). The bound is proved by considering a speciﬁc family of demand curves d(p; z), such that all demand curves in this family intersect at a common price. Further, the common price is the optimal price for a speciﬁc choice of parameter z0, i.e, p (z0).4 Therefore, the price p (z0) is uninformative since no policy can gain information about the demand parameter z, while pricing p (z0). The idea behind the derived lower bound for the worst-case regret is that for a policy to learn the underlying demand curve fast enough, it must necessarily choose prices that are away from (the uninformative) price p (z0) and this leads to a large regret when the true demand curve is indeed z0.

Intuition behind our results. In contrast to the previous case, for our framework there is no such uninformative price. First, note that the for a choice model with parameters θ0 = (θ0, α0), the demand curve at time t is given by

dt(p; θ0) = 1 F(p θ0 xt) ,

For n 1, we deﬁne the aggregate demand function up to time n as dn 1 = (d1, d2, . . . , dn). In the following, we argue that under our setting, there is no uninformative price. For any

4. Speciﬁcally, they consider d(p; z) = 0.5 + z zp. Hence d(1; z) = 1, for all z and it is shown that p (z0) = 1 for z0 = 0.5.

Dynamic Pricing in High-dimensions

price p and any θ1, θ2, we have

1 n dn 1(p, θ1) dn 1(p, θ2) 2 2 = c2

ℓ=1 ((θ1 θ2) xt)2

n X(θ1 θ2) 2 2 ,

where X is the matrix with rows xℓ, for 1 ℓ t. We also used the fact that f(z) c > 0 for some constant c because F is strictly increasing by Assumption 1. As we show in Appendix A, for n c0s0 log d (with c0 a proper constant), X satisfy a so-called restricted eigenvalue , by which we have

1 n X(θ1 θ2) 2 2 Cmin

2 θ1 θ2 2 2 . (15)

Therefore, for any ﬁxed price p, if we vary the demand parameters θ1 to some other value θ2, then the aggregate demand at price p also changes by an amount proportional to θ1 θ2 2. Hence, any price in this setting is informative about the model parameters. To build further insight, let us consider a more general choice model, where the utility of the customer from buying a product with feature vectors xt at price p is given by

u(xt) = θ0 xt β0p + zt , (16)

where θ0, β0 are unknown model parameters and zt is the noise term. The customer buys the product iﬀu(xt) 0. Note that the model we studied in this paper (see Equation (3)) is a special case when the price sensitivity β0 is known and hence can be normalized to 1. We next argue that in case of unknown β0, the uninformative prices do exist and hence the Ω(

T) is still in place. To see this, let θ0,i = 0 for 2 i d. Fix arbitrary a, and set β0 = g(a) a+θ0,1. Also, let xt,1 = 1, for all t. Then, the demand curves will be unaltered over time and are given by dt(p, θ) = 1 F(β0p θ0,1) = 1 F ((g(a) a + θ0,1)p θ0,1) .

We next verify that p = 1 is the optimal price for the speciﬁc choice of θ0,1 = a. In this case, the expected revenue form a posted price p is given by

p dt(p, θ) = p (1 F (g(a)p a)) . (17)

By deﬁnition of the pricing function g, given by (7), it is easy to see that p = 1 sets the gradient of the expected revenue to zero. Next note that all the demand curves, regardless of value θ0,1, intersect at p = 1 (they all have the value 1 F(g(a) a) at this price). Therefore, p is an uninformative price and no policy can gain information about θ0,1 by pricing at p . However, when θ0,1 = a, choosing prices that are away from this informative price leads to a large regret. Prices that are close to p does not have any information gain, and contrasting these two points, it can be shown that the worst case regret is of order Ω(

T). A formal proof follows the same lines of the proof of (Broder and Rusmevichientong, 2012, Theorem 3.1) and is omitted. Finally, it is worth noting that the rate of learning demand parameter θ0 is chieﬂy derived by three factors:

Javanmard and Nazerzadeh

Non-smoothness of distribution function F, as it controls the amount of information obtained about xt θ0 at each t.

The rate by which the feature vectors xt span the parameter space. This is controlled through the minimum eigenvalue of Σ, i.e., Cmin. If Cmin is small, the randomly generated features are relatively aligned and one requires larger sample size to estimate θ0 within speciﬁed accuracy.

Complexity of θ0. This is captured through the sparsity measure s0.

Later in Proposition 10, we see how these factors contribute to the learning rate of θ0.

4.2. Role of Cmin

In establishing our results, we relied on Assumption 2 which requires the population covariance of features to be positive deﬁnite. The lower bound on its eigenvalues, denoted by Cmin, appears in our regret bound as a factor 1/C2 min. As evident from the proof of Proposition 10, Assumption 1 can be replaced by the weaker restricted eigenvalue condition (B uhlmann and van de Geer, 2011; Candes and Tao, 2007), which is a common assumption in high-dimensional statistical learning. While assumption Cmin > 0 allows for a fast learning rate of model parameters and a regret bound O(log T), RMLP policy can still provably achieve regret O(

T), even when Cmin = 0.

Theorem 4 Suppose that product feature vectors are generated independently from a probability distribution PX with a bounded support X Rd. Also, assume that the choice model parameters θ0 Ω, where the set Ωis given by (5). Under Assumption 1, the regret of RMLP policy is of O(W 3p

Proof of Theorem 4 is given in Section 8.2. We remark that the regret bound in Theorem 4 is independent of the sparsity s0 and scales logarithmically with the feature dimension d. From the technical point of view, this is due to the fact that in our analysis instead of bounding the estimation errors θ0 bθk 2, we directly bound the prediction errors X(k 1)(θ0 bθk) 2, where X(k 1) is the matrix obtained by putting the feature vectors in episode (k 1) as its rows. We then connect the total regret in each episode to the corresponding prediction error. By this analysis, we are able to remove the requirement Cmin > 0 and obtain a regret that is independent of the sparsity s0, at the cost of getting O(

T) regret bound.

5. Lower Bound on Regret

As discussed in Section 2.2, if the true parameter θ0 is known, the optimal policy (in terms of expected revenue) is the one that sets prices as pt = g(xt θ0). Let Ht = {x1, x2, . . . , xt, z1, z2, . . . , zt} denote the history set up to time t, and recall that Ωdenotes the set of feasible parameters, i.e., Ω= {θ Rd : θ 0 s0 , θ 1 W}. We consider the following set of policies, Π:

Π = n π : π(pt) = g(xt θt), for some θt Ω, such that θt is Ht 1-measurable o . (18)

Dynamic Pricing in High-dimensions

Here π(pt) denotes the price posted by policy π at time t. We provide a lower bound on the achievable regret by any policy in set Π. Indeed this lower bound applies to an oracle who fully observes the market values after the price is either accepted or rejected. Compared to our setting, where the seller observes only the binary feedbacks (purchase/no purchase), this oracle appears exceedingly powerful at ﬁrst sight but surprisingly, the derived lower bound matches the regret of our dynamic policy, up to a logarithmic factor.

Theorem 5 Consider linear model (2) where the market values v(xt), 1 t T, are fully observed. We further assume that market value noises are generated as zt N(0, σ2). Let Π be the set of policies given by (18). Then, there exists constant C > 0 (depending on W and σ), such that the following holds true for all T N.

min π Π Regretπ(T) C s0 log T

s0 , s0 log d

In the following we give an outline for the proof of Theorem 5, summarizing its main steps and defer the complete proof to Section 8.3.

1. We derive a lower bound for regret in terms of the minimax estimation error. Specifically, for t N, let

regt p t I(vt p t ) pt I(vt pt) (20)

be the regret at period t. We show that

max θ0 ΩE(regt) c max θ0 ΩE{min( θt θ0 2 2, C)} , (21)

for some constants c, C > 0.

2. Let θT 1 = (θt)T t=1 and deﬁne d(θT 1 , θ) PT t=1 min( θt θ 2 2, C). We use a standard argument (Le Cam s method) that relates the minimax ℓ2 risk, minθT 1 maxθ0 ΩE[d(θT 1 , θ0)], in terms of the error in multi-way hypothesis problem (Tsybakov, 2008). We ﬁrst construct a maximal set of points in Ω, such that minimum pairwise distances among them is at least δ. (Such set is usually referred to as a δ-packing in the literature). Here δ is a free parameter to be determined in the proof. We then use a standard reduction to show that any estimator with small minimax risk should necessarily solve a hypothesis testing problem over the packing set, with small error probability. More speciﬁcally, suppose that nature chooses one point from the packing set uniformly at random and conditional on nature s choice of the parameter vector, say θ0, the market value are generated according to xt, θ0 + zt with zt N(0, σ2). The problem is reduced to lower bounding the error probability in distinguishing θ0 among the candidates in the packing set using the observed market values.

3. We apply Fano s inequality from information theory to lower bound the probability of error (Tsybakov, 2008). The Fano s bound involves the logarithm of the cardinality of the δ-packing set as well as the mutual information between the observations (market

Javanmard and Nazerzadeh

values) and the random parameter vector θ0 chosen uniformly at random from the packing set. Le Cam s method is used to derive minimal risk lower bound for an estimator bθ, while here we have a sequence of estimators and need to adjust the Le Cam s method to get the lower bound for d(θT 1 , θ0).

6. Nonlinear Valuation Function

In previous sections, we focused exclusively on linear valuation function given by Eq (2). Here, we extend our results and assume that the market valuations are modeled by a nonlinear function that depends on products features and an independent noise term. Speciﬁcally, the market value of a product with feature vector xt is given by

v(xt) = ψ(θ0 φ(xt) + zt) , (22)

where the original features xt are transformed by a feature mapping φ : Rd 7 Rd, and function ψ : R 7 R is a general function that is log-concave and strictly increasing. Important examples of this model include log-log model (ψ(x) = ex, φ(x) = ln(x)), semi-log model (ψ(x) = ex, φ(x) = x), and logistic model (ψ(x) = ex/(1 + ex), ψ(x) = x). Model (22) allows us to capture correlations and non-linear dependencies on the features. We next state our assumption on the feature mapping φ and then discuss our dynamic pricing policy and its regret bound for the general setting (22).

Assumption 6 Let p X be an (unknown) distribution from which the original features xt are sampled independently. Suppose that the feature mapping φ has continuous derivative and denote by Σφ E(φ(x) φ(x)T) the second moment matrix of φ(x) under PX. We assume that there exist constants Cmin and Cmax such that for every eigenvalue σ of Σφ, we have 0 < Cmin σ < Cmax < .

Invoking Assumption 1, PX has a bounded support X and since φ has continuous derivative, it is Lipschitz on X and hence the image of X under φ remains bounded. Therefore, the new features φ(xt) are also sampled independently from a bounded set. The condition on Σφ is analogous to that on Σ, as required by Assumption 2 for the linear setting. Based on feature mapping φ, validity of Assumption 6 may depend on all moments of distribution PX. We provide an alternative to this assumption, which only depends on feature mapping φ and the second moment of PX. In stating the assumption, we use the notation Dφ to denote the derivative matrix of a feature mapping φ. Precisely, for φ = (φ1, . . . , φd), with φi real-valued function deﬁned on Rd, we write Dφ = ( φi/ xj)1 i j d.

Assumption 7 (alternative to Assumption 6) Suppose that feature mapping φ has continuous derivative and its derivative Dφ(x) is full-rank for almost all x. In addition, there exist constants Cmin and Cmax such that for every eigenvalue σ of covariance Σ, we have 0 < Cmin σ < Cmax < .

Recall that the noise terms {zt}t 1 are drawn independently and identically from a distribution with cumulative function F and density f(x). Let λ(v) = f(v)/(1 F(v)) be the hazard rate function for distribution F. For a log-concave function ψ, we deﬁne

g 1 ψ (v) v λ 1 ψ (v)

Dynamic Pricing in High-dimensions

Input: (at time 0) function g, regularizations λk, W (bound on θ0 1) Input: (arrives over time) covariate vectors { xt = φ(xt)}t N Output: prices {pt}t N

1: τ1 1, p1 0, bθ1 0

2: for each episode k = 2, 3, . . . do

3: Set the length of k-th episode: τk 2k 1.

4: Update the model parameter estimate bθk using the regularized ML estimator obtained from observations in the previous episode:

bθk = arg min θ 1 W {L(θ) + λk θ 1} , (24)

where L(θ) is given by:

L(θ) = 1 τk 1

I(yt = 1) log(1 F(ψ 1(pt) θ xt))

+I(yt = 1) log(F(ψ 1(pt) θ xt)) . (25)

5: For each period t during the k-th episode, set

pt ψ(gψ(bθk xt)) . (26)

Algorithm 2: RMLP Policy for dynamic pricing under the nonlinear setting

Note that ψ (v)/ψ(v) = log ψ(v) and since ψ is log-concave, this term is decreasing. Further, since 1 F is log-concave then its hazard rate λ is increasing (See proof of Lemma C.1.) Combining these observations, we have that λ 1(ψ (v)/ψ(v)) is increasing. Consequently,

Right-hand side of (23) is strictly increasing and hence, g 1 ψ is well-deﬁned.

We have (g 1 ψ ) (v) 1, for all v. This implies that 0 < g ψ(v) 1, for all v.

It is worth noting that for ψ(v) = v (linear model), we have gψ = g, where g is deﬁned by (7). Our pricing policy for the nonlinear model is conceptually similar to the linear setting: The policy runs in an episodic manner. During episode k, the prices are set as pt = ψ(gψ(bθk xt)), where bθk denotes the estimate of the true parameter vector θ0, using a regularized maximum-likelihood estimator applied to observations in the previous episode, and xt = φ(xt). We describe our (modiﬁed) RMLP policy in Algorithm 2. There a few diﬀerences between Algorithm 2 and Algorithm 1: Firstly, the features xt are replaced by xt = φ(xt). Secondly, in the regularized estimator, prices pt are replaced by ψ 1(pt). Thirdly, in the last step of algorithm prices are set as ψ(gψ(bθk xt)), with gψ deﬁned by Equation (23). Our next theorem bounds the regret of our pricing policy (Algorithm 2).

Javanmard and Nazerzadeh

Theorem 8 Let ψ be log-concave and strictly increasing. Suppose that Assumptions 1 and 6 (or its alternative, Assumption 7) hold. Then, regret of the RMLP policy described as Algorithm 2 is of O(s0 log d log T).

Proof of Theorem 8 is given in Appendix 8.4. Here, we summarize its key ingredients.

1. By increasing property of ψ, a sale occurs at period t when zt ψ 1(pt) θ0 xt. Hence, the log-likelihood estimator for this setting reads as (25). By virtue of Assumption 6 (or its alternative, Assumption 7) we get a similar estimation error for the regularized estimator to the one in Proposition 10.

2. Similar to our derivation for linear setting, we show that the optimal pricing policy that knows θ0 in advance is given by p t = ψ(gψ(θ0 xt)), where gψ is deﬁned based on Equation (23).

3. The diﬀerence between the posted price and the optimal price can be bounded as pt p t = ψ(gψ(bθk xt)) ψ(gψ(θ0 xt)) L| xt (bθk θ0)|, for a constant L > 0. This bound is similar to the corresponding bound for the linear setting, and following the same lines of our regret analysis for that case, we get R(T) = O(s0 log d log T).

7. Knowledge of Market Noise Distribution

The proposed RMLP policy has assumed that the market noise distribution F is known to the seller. Knowledge of F has been used both in estimating the model parameters θ0 and in setting the prices pt. On the other hand, the benchmark policy is also assumed to have access to model parameters and the distribution F. Therefore, the regret bound established in Theorem 3 essentially measures how much the seller loses in revenue due to lack of knowledge of the underlying model parameters. In practice, however, the underlying distribution of valuations is not given and this raises the question of distribution-independent pricing policy. It is worth mentioning that in some applications, although the underlying distribution of valuations is unknown, it belongs to a known class of distributions. For example, lognormal distributions have proved to be a good ﬁt for the distribution of valuations of advertisers in online advertising markets (Edelman et al., 2007; Lahaie and Pennock, 2007; Xiao et al., 2009; Balseiro et al., 2014). In Section 7.1, we consider a setting where the underlying distribution belongs to a known class of log-concave distributions and propose a policy whose regret is O(

T). We also argue that no policy can get a better regret bound.

7.1. Unknown Distribution from a Known Class

Suppose that the market noises are generated from a log-concave distribution Fm,σ (e.g., Log-normal), with unknown mean m and unknown variance σ2. Without loss of generality, we can assume that m = 0; otherwise, in the valuation model (2), m can be absorbed in the features as the intercept term. We next explain how the RMLP policy can be adapted to this case.

Dynamic Pricing in High-dimensions

Input: Pricing function g (corresponding to F0,1), regularizations λk, W Input: (arrives over time) covariate vectors {xt}t N Output: prices {pt}t N

1: τ1 1, p1 0, bθ1 0

2: for each episode k = 2, 3, . . . do

3: Set the length of k-th episode: τk 2k 1.

4: Update the model parameter estimate bµk using the regularized ML estimator:

(bµk, bβk) = arg min (µ/β,β) 1 W {L(µ, β) + λk µ 1} , (28)

where L(µ, β) is given by:

L(µ, β) = 1 τk 1

I(yt = 1) log (1 F(βpt µ xt))

+ I(yt = 1) log (F(βpt µ xt)) . (29)

5: For each period t during the k-th episode, set

bβk g(bµk xt) . (30)

Algorithm 3: RMLP-2 policy for dynamic pricing

Deﬁne β0 = 1/σ and consider the transformation vt = β0vt, µ0 = β0θ0, zt = β0zt. Then, the valuation model (2) can be written as

vt = xt µ0 + zt , (27)

where zt are drawn from F0,1. To lighten the notation, we use the shorthand F F0,1. The response variables yt are then given by yt = I( vt β0pt). We propose a variant of RMLP policy, called RMLP-2 for this case. The RMLP2 policy is very similar to RMLP except that the maximum log-likelihood is formed to estimate the parameter vector (µ0, β0) jointly. Likewise, the regularizations are set as λk = 4u F p

(log d)/τk 1, where u F is deﬁned based on the distribution F F0,1 as in Equation (13). The policy then oﬀers the optimal prices based on the current parameter estimates at the time. Speciﬁcally, for episode k, we set pt = (1/bβk)g(bµk xt), where the pricing function g, given by (7), is deﬁned based on the distribution F F0,1. A formal description of RMLP-2 is given in Algorithm 3. Our next result bounds the regret of RMLP-2.

Theorem 9 Consider the valuation model (2), where noises zt are generated from a distribution Fm,σ, with unknown mean m and variance σ2. Assuming that distribution Fm,σ satisﬁes Assumption 1, the regret of RMLP-2 policy is of O( p

(log d)T). Further, regret of any pricing policy in this case is Ω(

Javanmard and Nazerzadeh

We refer to Section 8.5 for the proof of Theorem 9. As discussed in the proof, the lower bound Ω(

T) applies to this case due to the existence of non-informative prices; See also Section 4.1.

8. Proof of Theorems

8.1. Proof of Theorem 3

Following step 1 of the proof outline mentioned in Section 4, we consider the problem of estimating θ0 based on observations from previous episode. Before we proceed, let us emphasize once again that the way RMLP is designed, posted prices at each episode are statistically independent from the market noises in that episode. This can be easily observed because pt = g(xt bθk) for t belonging in the k-th episode, and bθk is estimated based on the samples in the (k 1)-th episode. We ﬁx k 1 and to lighten the notation, we use the indices 1, 2, . . . , n to correspond to periods in the k-the episode, i.e., t = τk, τk + 1, . . . , τk+1 1. Using probabilistic model (4), θ0 is estimated by solving a regularized maximum likelihood (ML) optimization problem. The (normalized) negative log-likelihood function for θ reads as

I(yt = 1) log(1 F(pt θ xt)) + I(yt = 1) log(F(pt θ xt)) . (31)

Parameter θ is estimated as the solution of the following program:

bθ = arg min θ 1 W L(θ) + λ θ 1 . (32)

Deﬁne ℓF as follows which corresponds to ﬂatness of function log F:

ℓF inf |x| 3W

n min n log F(x), log (1 F(x)) oo . (33)

By Assumption 1, the log-concavity property of F and 1 F, we have ℓF > 0. The next theorem upper bounds the estimation error of the proposed regularized estimator.

Proposition 10 (Estimation Error) Consider linear model (2) with θ0 Ω, under Assumptions 1 and 2. Let bθ be the solution of optimization problem (32) with λ 4u F p

(log d)/n. Then, there exist positive constants c0 and C such that, for n c0s0 log(d), the following inequality holds with probability at least 1 1/d 2e n/(c0s0):

bθ θ0 2 2 16s0λ2

ℓ2 F C2 min . (34)

We refer to Appendix A for the proof of Proposition 10. As we see the ℓ2 estimation error scales linearly with the sparsity level s0. As s0 increases, the number of parameters to be estimated becomes larger and this makes the estimation problem harder, leading to worse ℓ2 bound for a ﬁxed number of samples, n. Further,

Dynamic Pricing in High-dimensions

choosing λ p

(log d)/n (where indicates equality up to a constant factor), our ℓ2 bound scales logarithmically in the dimension of the demand space, d. This allows to deal with high-dimensional applications and obtain a regret that scales logarithmically in d. Further, the estimation error shrinks as 1/n; getting more samples with ﬁxed value of s0 and d leads to better estimation accuracy. Finally, note that for small values of ℓF , the log-likelihood function is very ﬂat and there can be, in principle, vectors θ of log-likelihood value very close to the optimum and nevertheless far from the optimum. In other words, estimation task becomes harder as ℓF gets smaller and this is clearly reﬂected in the derived estimation bound. We next use Proposition 10 to bound the expected estimation error.

Corollary 11 Under assumptions of Proposition 10, the following holds true:

E( bθ θ0 2 2) 16s0λ2

ℓ2 F C2 min + 4W 2 1

d + 2e n/(c0s0) . (35)

Proof of Corollary 11 is straightforward and is omitted. In the next proposition, we improve bound (35) for n c1d, for a constant c1 > 0. As we will see, the following result is useful to develop sharper upper bound for regret of RMLP policy.

Proposition 12 Under assumptions of Proposition 10, there exist constants c, c1 > 0, such that for n c1d, the following holds true:

E( bθ θ0 2 2) 16(s0 + 1)λ2

ℓ2 F C2 min + 4W 2e cn2 . (36)

Proposition 12 is proved in Appendix B. We next establish some useful properties of the virtual valuation function ϕ and the price function g.

Lemma 13 If 1 F is log-concave, then the virtual valuation function ϕ is strictly monotone increasing.

Lemma 14 If 1 F is log-concave, then the price function g satisﬁes 0 < g (v) < 1, for all values of v R.

Proofs of Lemma 13 and 14 are given in Appendix C.1 and C.2, respectively. Given that bθk 1 W and |xt bθk| W for all t, k,

pt = g(xt bθk) 2|xt bθk| 2W , (37)

where in the ﬁrst inequality we used the fact that ϕ(v) is increasing as per Lemma 13 and hence g(v) = v + ϕ 1( v) v + |v| 2|v|. Similarly, we have p t 2W for all t. We are now ready to bound the regret of our policy. For t 1, let

regt p t I(vt p t ) pt I(vt pt) (38)

Javanmard and Nazerzadeh

be the regret at period t. Further, let Ht = {x1, x2, . . . , xt, z1, z2, . . . , zt} be the history set, up to time t (more precisely, Ht is the ﬁltration generated by {x1, x2, . . . , xt, z1, z2, . . . , zt}). We also deﬁne Ht = Ht {xt+1} as the ﬁltration obtained after augmenting by the new feature xt+1. We write

E(regt| Ht 1) = E(p t I(vt p t )| Ht 1) E(pt I(vt pt)| Ht 1)

= p t (1 F(p t xt θ0)) pt(1 F(pt xt θ0)) . (39)

Deﬁne revt(p) p(1 F(p xt θ0)) as the expected revenue under price p. Note that p t arg max revt(p) and thus r t(p t ) = 0. By Taylor expansion,

revt(pt) = revt(p t ) + 1

2r t (p)(pt p t )2 , (40)

for some p between pt and p t . We next show that |r t (p)| 2(B + WB ), with B = maxv f(v), and B = maxv f (v). To see this, we write

|r t (p)| = |2f(p xt θ0) + pf (p xt θ0)| 2B + 2WB , (41)

where we use the fact that pt, p t 2W and consequently p 2W. We let C = 2 max(B, B ). Then, combining Equations (39), (40), (41), along with 1-Lipschitz property of g gives

E(regt| Ht 1) (B + WB )(p t pt)2

CW(p t pt)2

= CW(g(θ0 xt) g(bθk xt))2

CW |xt (θ0 bθk)|2 . (42)

Here, in the second inequality we used the fact that W 1. Given that xt is independent of Ht 1, we have

E(regt|Ht 1) CW bθk θ0, Σ(bθk θ0) , (43)

where Σ = E(xtx T t ). Since the maximum eigenvalue of Σ is bounded by Cmax, we obtain

E(regt) = E (E(regt|Ht 1)) CCmax WE( bθk θ0 2 2) . (44)

Now, since the length of episodes grows exponentially, the number of episodes by period T is logarithmic in T. Speciﬁcally, T belongs to episode K = log T + 1. Hence,

Regret(T) =

k=1 Regret(kth Episode) . (45)

We bound the total regret over each episode by considering three separate cases:

Dynamic Pricing in High-dimensions

2k 2 c0s0 log d: Here, c0 is the constant in the statement of Proposition 10. In this case, episodes are not large enough to estimate θ0 accurately enough, and thus we use a naive bound on regret. Clearly, by (37), we have E(regt) p t 2W. Since the length of k th episode is 2k 1 2c0s0 log d, the total regret incurred during episode k is at most 4c0Ws0 log d.

c0s0 log d 2k 2 c1d: Here, c1 is the constant in the statement of Proposition 12. Continuing from Equation (44) and applying Corollary 11 to episode k, we obtain

Regret(kth Episode) =

t=τk E(regt)

t=τk E( bθk θ0 2 2)

CCmax W 16s0λ2 k ℓ2 F C2 min τk + 4W 2 τk

d + 2τke τk 1/(c0s0)

2 2s0 log d + 8W 2 2c1 + 2τk 1e τk 1

where in the last step we used τk = 2τk 1 and τk = 2k 1 2c1d. Therefore, in this case

Regret(kth Episode) C W

C2 min s0 log d , (47)

where C hides various constants in the right-hand side of (46).

c1d < 2k 2: Continuing from Equation (44) and applying Proposition 12 to episode k, we obtain

Regret(kth Episode) =

t=τk E(regt)

t=τk E( bθk θ0 2 2)

CCmax W 16(s0 + 1)λ2 k ℓ2 F C2 min τk + 4τk W 2e cτ 2 k 1

2 2(s0 + 1) log d + 8W 2τk 1e cτ 2 k 1

Therefore, in this case

Regret(kth Episode) C W

C2 min s0 log d , (49)

Javanmard and Nazerzadeh

where C hides various constants in the right-hand side of (48).

Combining the above three cases into Equation (45), we get

Regret(T) K C W

C2 min s0 log d = O W

C2 min s0 log d log T , (50)

which concludes the proof.

8.2. Proof of Theorem 4

By using Equation (43), we have

2 E D bθk θ0, Σ(bθk θ0) E , (51)

with Σ = E(xtx T t ) and C = 2(B + WB ), where B = maxv f(v) and B = maxv f (v). Therefore, letting K = log T + 1, we get

Regret(T) =

k=1 Regret(kth Episode) C

k=1 E D bθk θ0, Σ(bθk θ0) E τk . (52)

We next bound the right-hand side of the above bound. Let X(k) Rτk d be the matrix obtained by stacking feature vectors in episode k as rows. By applying bound (91) to samples in episode (k 1), we get that with probability at least 1 1/d,

X(k 1)(θ0 bθk) 2 + 2λk bθk 1 λk bθk θ0 1 + 2λk θ0 1 . (53)

X(k 1)(θ0 bθk) 2 λk bθk θ0 1 + 2λk θ0 1 2λk bθk 1 3λk bθk θ0 1 . (54)

For k 1, let S(k) Rd d be the empirical second moment of X(k), and deﬁne E(k) = Σ S(k). Then, D bθk θ0, Σ(bθk θ0) E = D bθk θ0, S(k 1)(bθk θ0) E + D bθk θ0, E(k 1)(bθk θ0) E . (55)

The ﬁrst term is bounded using Equation (54) as follows:

D bθk θ0, S(k 1)(bθk θ0) E = 1 τk 1

X(k 1)(θ0 bθk) 2 3λk

2ℓF bθk θ0 1 3W

ℓF λk , (56)

with probability at least 1 1/d. The second term can be bounded by virtue of the following lemma, whose proof is deferred to Appendix C.8

Dynamic Pricing in High-dimensions

Lemma 15 For any k 1 and any vector v Rd, we have

with probability at least 1 8/d2.

By Lemma 15, we have

D bθk θ0, E(k 1)(bθk θ0) E 12

log d τk 1 W 2 , (57)

with probability at least 1 8/d2. Combining Equations (56) and (57), with probability at least 1 9/d we have

D bθk θ0, Σ(bθk θ0) E 3W

log d τk 1 W 2 = 12 Wu F

log d τk 1 . (58)

Following a similar argument as in Section 8.1 (see Equation (45) and onwards) we have that the following holds for a suitable constant C > 0:

Regret(T) CW 3 K X

log d τk 1 τk

(log d)τk 1 = O(W 3p

(log d)T) ,

which completes the proof.

8.3. Proof of Theorem 5

The regret benchmark (9) is deﬁned as the maximum gap between a policy and the oracle policy over diﬀerent θ0 Ωand p X Q(X). Without loss of generality, we assume X = [ 1, 1]d. In order to obtain a lower bound on the regret, it suﬃces to consider a speciﬁc distribution in Q(X). We consider a distribution p X that selects coordinates xi, 1 i d, uniformly at random from { 1, 1} and independent of each other. We further assume θ0 Ω, with Ωgiven by (5). Fix an arbitrary policy π in family Π. By deﬁnition of policy set Π, we have π(pt) = g(xt θt), for some θt Ω, which is Ht 1-measurable . Recalling our notation in the proof of Theorem 3, regt denotes the regret occurred at step t and by Equations (39), (40), we have

E(regt| Ht 1) = revt(p t ) revt(pt) = 1

2rev t (p)(pt p t )2 , (59)

for some p between pt and p t . Our ﬁrst lemma will be used in lower bounding E(regt| Ht 1).

Javanmard and Nazerzadeh

Lemma 16 There exists a constant c1 > 0 (depending on W and σ) such that, with probability one5, r t (p t ) c1, for all t 1. Further, there exists constant δ > 0 (depending on W and σ) such that r t (p) c1/4 for p [p t δ, p t + δ], with probability one.

Proof of Lemma 16 is given in Appendix C.3. Continuing from Equation (59), we consider two separate cases:

|pt p t | δ: We have p [p t δ, p t + δ] and therefore by applying Lemma 16 we obtain

E(regt| Ht 1) = revt(p t ) revt(pt) c

8(pt p t )2 . (60)

|pt p t | > δ: Since function revt has only one local maximum, namely p t , the function is increasing before p t and decreasing afterward. Therefore, if pt p t δ then

revt(pt) revt(p t δ) = revt(p t ) + 1

2rev t (p)δ2 revt(p t ) c1

8 δ2 , (61)

where p is some point in [p t δ, p t ] and we applied Lemma 16 in the last step.

Similarly, for pt p t + δ we obtain

revt(pt) revt(p t + δ) = revt(p t ) + 1

2rev t (p)δ2 revt(p t ) c1

8 δ2 , (62)

where p [p t δ, p t ] this time. Combining these two inequalities, we get that revt(p t ) revt(pt) c1δ2/8, if |p t pt| δ.

Writing the bounds in the two cases together, we get

E(regt| Ht 1) revt(p t ) revt(pt)

8 (pt p t )2 , if |pt p t | δ ,

8 δ2 , if |pt p t | > δ .

We proceed by relating the lower bound to the error in estimation θ0.

E(regt| Ht 1) c1

8 min (pt p t )2, δ2 = c1

8 min (g(xt θt) g(xt θ0))2, δ2

8 min c2 2 |xt (θt θ0)|2, δ2 , (64)

where we used the fact that by Lemma 14, g (v) > c2 over the bounded interval [ W, W], for some constant c2 > 0. We recall the deﬁnition of history set Ht Ht\{xt+1} = {x1, x2, . . . , xt, z1, z2, . . . , zt}. Since Ht Ht, by iterated law of expectation, we get

E(regt|Ht 1) = E(E(regt| Ht 1)|Ht 1) c1

8 E min c2 2|xt (θt θ0)|2 2, δ2 Ht 1 . (65)

Note that xt is independent of Ht 1 and θt θ0 is Ht 1-measurable. We use the following lemma to lower bound the right-hand side of (65).

5. The randomness comes from randomness in prices which in turn comes from randomness in features xt.

Dynamic Pricing in High-dimensions

Lemma 17 Let x Rd be a random vector such that its coordinates are chosen independently and uniformly at random from { 1, 1}. Further, suppose that v Rd and δ > 0 are deterministic. Then,

E min (x v)2, δ2 0.1 min( v 2 2, δ2) . (66)

Proof of Lemma 17 is given in Appendix C.4. Applying Lemma 17 to bound (65), we obtain

E(regt|Ht 1) c1c2 2 80 E min θt θ0 2 2, δ2/c2 2 Ht 1 . (67)

Now, taking expectation from both sides with respect to Ht 1, we arrive at

E(regt) c1c2 2 80 E min θt θ0 2 2, δ2/c2 2 . (68)

Equation (68) lower bounds the expected regret at each step to the ℓ2 estimation error. We continue by establishing a minimax lower bound on ℓ2-risk of estimation.

Lemma 18 Consider linear model (2), with α0 = 0, and assume that the market values v(xt), 1 t T, are fully observed and the feature vectors are generated according to p X, described above. We further assume that the noise in market value is generated as zt N(0, σ2). For a sequence of estimators θt, we let θt 1 = (θ1, θ2, . . . , θt). Then, for any ﬁxed value C > 0, there exists a nonnegative constant e C, depending on C, σ, W, such that

min θT 1 max θ0 Ω

t=1 E min θt θ0 2 2, C x1, . . . , x T e C s0 log T

s0 , s0 log d

Proof of Lemma 18 is given in Appendix C.5. We are now ready to lower bound the regret of any policy in Π.

Regret(T) max θ0 Ω

t=1 E(regt) c1c2 2 80

t=1 E min θt θ0 2 2, δ2/c2 2

e C c1c2 2 80

s0 , s0 log d

where the last step follows from Lemma 18.

8.4. Proof of Theorem 8

Let xt = φ(xt) denote the transformed features under the feature-map. Also, let pt = ψ 1(pt). We ﬁrst show that Assumption 7 implies Assumption 6, and therefore it suﬃces to prove the theorem under Assumption 6.

Lemma 19 Suppose that Assumption 1 hold true. Then, Assumption 7 implies Assumption 6.

Javanmard and Nazerzadeh

Proof of Lemma 19 is given in Appendix C.6. By Assumption 1, the support of PX is a bounded set X. Given that φ has a continuous derivative, it is Lipschitz on the bounded set X and ergo the image of X remains bounded under the feature-map φ. Putting diﬀerently, features xt are sampled from a bounded set in Rd. Without loss of generality, we assume xt 1. Further, as per Assumption 6, the covariance of the underlying distribution Σφ is positive deﬁnite with bounded eigenvalues. On a diﬀerent note, since ψ is strictly increasing, a sale occurs at period t when θ0 xt + zt ψ 1(pt) = pt. Therefore the (negative) log-likelihood function for θ reads as

L(θ) = 1 τk 1

I(yt = 1) log(1 F( pt θ xt)) + I(yt = 1) log(F( pt θ xt)) .

The estimation bound (34) also holds for this setting and the proof goes along the same lines of the proof of Proposition 10, with slight modiﬁcations: (i) the features xt and prices pt should be replaced by xt and pt. (ii) Quantity u F and ℓF in the statement of Proposition 10 should be set as M = (1/3)gψ(0) + (2/3)W. This follows from the bounds below

pt = gψ( xt bθk) gψ(0) + | xt bθk| gψ(0) + W , (71)

where we used the facts that gψ is 1-Lipschitz and increasing as explained below Equation (23). We next characterize the optimal policy when the true parameter θ0 is known. The expected revenue from a posted price p works out at p(1 F(ψ 1(p) θ0 xt)). Writing this in terms of p = ψ 1(p), the ﬁrst order condition for the optimal price reads as

λ( p θ0 xt) f( p θ0 xt) 1 F( p θ0 xt) = ψ ( p )

ψ( p ) , (72)

where λ denotes the hazard rate function. Equivalently

θ0 xt = p λ 1 ψ ( p )

By deﬁnition of function gψ as per Equation (23), we get p = gψ(θ0 xt) and thus p = ψ(gψ(θ0 xt)). We are now ready to bound the regret of the algorithm. Similar to Equation (42), we have

E(regt| Ht 1) C

2 (p t pt)2 = C

h ψ(gψ(θ0 xt)) ψ(gψ(bθk xt)) i2

2 L(gψ(θ0 xt)) gψ(bθk xt))2 LC

2 | xt (θ0 bθk)|2 , (74)

where L max|v| ψ(M) |ψ (v)| (since ψ is continuously diﬀerentiable, it attains a maximum over a bounded set.) In addition, we used the fact that g ψ(v) 1 as explained below Equation (23). The inequalities above then follow from the mean-value theorem. Given that xt is independent of Ht 1, we have

E(regt|Ht 1) LC

2 bθk θ0, Σφ(bθk θ0) , (75)

Dynamic Pricing in High-dimensions

where Σφ = E(xtx T t ). Using Assumption 6,

E(regt) = E (E(regt|Ht 1)) 1

2LCCmax E( bθk θ0 2 2) . (76)

Rest of the proof is similar to proof of Theorem 8.1 (see after Equation (45)).

8.5. Proof of Theorem 9

We consider representation (27) of the valuations. Letting xt = ( xt; pt) and η = (µ; β) vectors in R(d+1) 1, we can write the log-likelihood loss as:

L(η) = 1 τk 1

I(yt = 1) log (1 F ( xt η)) + I(yt = 1) log (F ( xt η)) . (77)

We let p(k) = (pτk 1, . . . , pτk 1)T be the vector of prices posted in episode k. We also construct feature matrix X(k) by putting features xt that arrive in episode k, as its rows. Matrix e X(k) is constructed likewise by stacking augmented features xt. Hence, e X(k) = [ X(k), p(k)]. The next lemma is a technical lemma that will be used later in bounding the regret. We refer the reader to Appendix C.9 for its proof.

Lemma 20 Assume that the product features xt Rd are generated independently from an unknown distribution with bounded support in Rd. Then, with the choice of regularization parameters λk = 4u F p

(log d)/τk 1 in the RMLP-2 policy, the following inequalities hold true:

1 τk 1 X(k 1)(µ0 bµk) 2 4W

ℓF λk , (78)

(β0 bβk)2 c W

ℓF λk , (79)

where c W > 0 is a constant that depends on W.

We next ﬁgure out the clairvoyant policy.

Lemma 21 Let g be the pricing function corresponding to distribution F = F0,1, given by g(v) = v + ϕ 1( v), where ϕ(v) = v (1 F(v))/f(v) is the virtual valuation function. Then, under model (27), the clairvoyant optimal prices are given by

β0 g(xt µ0) . (80)

Proof of Lemma 21 is given in Appendix C.7.

Javanmard and Nazerzadeh

We next bound the regret at other periods of the episode. Let Ht = { x1, . . . , xt, xt+1, z1, . . . , zt} be the history set up to time t. Similar to (42), we write

E(regt| Ht 1) CW(p t pt)2

β0 g(xt µ0) 1

bβk g(xt bµk) 2

g(xt µ0) g(xt bµk) 2 + 2CW 1

2 g(xt bµk)2

β2 0 |xt (µ0 bµk)|2 + 8CW 3

β2 0 (bβk β0)2 . (81)

In the last step, the ﬁrst term is bounded by using 1-Lipschitz property of g and the second term is bounded by using the following chain of inequalities:

1 bβk g(xt bµk) 2

bβk |xt bµk| 2

bβk xt bµk 1 2W .

Note that in the ﬁrst inequality we used the fact that ϕ(v) is increasing for log-concave distributions and hence g(v) = v + ϕ 1( v) v + |v| 2|v|. The last inequality holds because bµk/bβk W by the optimization constraint in (28). Recalling our notation Ht = Ht\{ xt+1} and applying the law of iterated expectations, we have

E(regt|Ht 1) 2CW

D bµk µ0, Σ(bµk µ0) E + 8CW 3

β2 0 (bβk β0)2

C1W 3 D bµk µ0, Σ(bµk µ0) E + (bβk β0)2 , (82)

with C1 = 8CW/β2 0. Following a similar derivation to (58) we get

D bµk µ0, Σ(bµk µ0) E 16 Wu F

log d τk 1 . (83)

Using Equations (83) and (79) in Equation (82), we get

E(regt) C1W 3 16 Wu F

ℓF + W 2 + 4u F c W

log d τk 1 . (84)

Following a similar argument as in Section 8.1 (see Equation (45) and onwards) we have that the following holds for a suitable constant C > 0:

Regret(T) C2

log d τk 1 τk

(log d)τk 1 = O( p

(log d)T) ,

Dynamic Pricing in High-dimensions

where we used τk = 2τk 1. For the lower bound Ω(

T), note that under model (27) we can deﬁne the (scaled) customer s utility as

u(xt) = µ0 xt β0p + zt . (85)

Then, a purchase occurs if u(xt) > 0. Following our discussion in Section 4.1 (see after Equation (16)), since β0 is unknown, the uninformative prices do exist and therefore Ω(

T) applies to this case.

9. Conclusion

In this work, we leverage tools from statistical learning to design a dynamic pricing policy for a setting wherein the products are described via high-dimensional features. Our policy is computationally eﬃcient and by exploiting the structure of demand parameters, it obtains a regret that scales gracefully with the features dimension and the time horizon. Namely, the regret of our algorithm scales linearly with the sparsity of the optimal solution and logarithmically with the dimension. We also show an O(log T) dependence of the regret on the length of the horizon. On the ﬂip side, we provide a lower-bound of O(log T) on the regret of any algorithm that does not know the true parameters of the model in advance. A natural next step is providing a tight bound on the regret, closing the gap between the derived upper and lower bounds. Another step would be assuming that θ is not exactly sparse, but it can be well approximated by a sparse vector, i.e, θ0 θs0 1 δ for some s0-sparse vector θs0. An interesting question is to ﬁgure out how the regret scales with δ. The choice model that proposed in this work assumes one product arrived at each period, and describes the customer s purchase behavior based on the product features and the posted price. A more general choice model would be the one that assumes multiple products at each period. More speciﬁcally, each customer has a consideration set which includes products left after the customer has narrowed down her choices based on her own personal screening criteria, and then chooses the product from this set which brings maximum utility. We model the no purchase option as an extra product. This generalization is the focus of a future work. We also believe the ideas and techniques developed in this work can be be applied to other settings such as personalized pricing where information about the buyers can be used for price diﬀerentiation or optimizing reserve prices in online ad auctions. Another application would be assortment optimization and learning consumer choice models both in terms of the role of the structure (Farias et al., 2013; Kallus and Udell, 2016) as well as inventory-constrained environments (Golrezaei et al., 2014; Ferreira et al., 2018; Cheung et al., 2018).

Acknowledgments

Authors are thankful to Arnoud den Boer and Paat Rusmevichientong for their suggestions that improved this work. A. J. would also like to acknowledge the ﬁnancial support of the Oﬃce of the Provost at the University of Southern California through the Zumberge Fund

Javanmard and Nazerzadeh

Individual Grant Program, and an Outlier Research in Business (i ORB) grant from the USC Marshall School of Business (2018). Authors are supported in part by a Google Faculty Research Award (2016).

Dynamic Pricing in High-dimensions

Appendix A. Proof of Proposition 10

We start by reviewing the notion of restricted eigenvalue (RE) which is commonplace in high-dimensional statistical estimation.

Deﬁnition 22 For a given matrix A Rd d and some integer s such that 1 s0 d and a positive number c, we say that Restricted Eigenvalue (RE) condition is met if

κ2(A, s0, c) min J [p] |J| s0

min v =0 v Jc 1 c v J 1

v TAv v J 2 2 > 0 .

It is shown in (B uhlmann and van de Geer, 2011) and (Rudelson and Zhou, 2013) that when two matrices A0, A1 are close to each other (in the maximum element-wise norm) compared to sparsity s0, the RE condition for A0 implies the RE condition for A1. This is particularly useful when A0 is a population second moment matrix and A1 is a corresponding empirical second moment matrix. To apply this result to our case, let X Rn d be the feature matrix with rows xt, corresponding to n products. Let Σ = E(xtx T t ) be the empirical second moment. By Assumption 2, we have Σ Cmin I and therefore, Σ satisﬁes RE condition with κ2(eΣ, s0, 3) Cmin. By using the following result, we conclude that bΣ = (XTX)/n also satisﬁes RE condition with κ2(bΣ, s0, 3) Cmin/2.

Proposition 23 Let bΣ = (XTX)/n and let S = supp(θ0) be the support of θ0. Under Assumption 2, bΣ satisﬁes the restricted eigenvalue condition with constant κ(bΣ, s0, 3) p

Cmin/2, with probability 1 e 2n/(c0s0) and c0 = 768/C2 min, provided that n c0s0 log d,

Proposition 23 follows from the results established in (B uhlmann and van de Geer, 2011) and (Rudelson and Zhou, 2013). We outline the main steps of its proof in Appendix A.1 for the reader s convenience. By the second-order Taylor s theorem, expanding around θ0 we have

L(θ0) L(bθ) = L(θ0), bθ θ0 1

2 bθ θ0, 2L( θ)(bθ θ0) , (86)

for some θ on the line segment between θ0 and bθ. Invoking (31), we have

t=1 ξt(θ)xt , 2L(θ) = 1

t=1 ηt(θ)xtx T t , (87)

where and 2 represents the gradient and the hessian w.r.t θ. Further,

ξt(θ) = f(ut(θ))

F(ut(θ))I(yt = 1) + f(ut(θ)) 1 F(ut(θ))I(yt = +1)

= log F(ut(θ))I(yt = 1) log (1 F(ut(θ)))I(yt = +1) .

ηt(θ) = f(ut(θ))2

F(ut(θ))2 f (ut(θ))

I(yt = 1) + f(ut(θ))2

(1 F(ut(θ)))2 + f (ut(θ)) 1 F(ut(θ))

= log F(ut(θ))I(yt = 1) log (1 F(ut(θ)))I(yt = +1) ,

Javanmard and Nazerzadeh

where ut(θ) = pt xt, θ , and log F(x) and log F(x) represent ﬁrst and second derivative w.r.t x, respectively. By Equation (37), we have

|ut(θ0)| |pt| + xt θ0 1 3W .

Further, recall that the sequences {pt}n t=1 and {xt}n t=1 are independent of {zt}n t=1. Therefore, {ut(θ0)}T t=1 and {zt(θ0)}T t=1 are independent and by (4), we have E[ξt(θ0)] = E[E[ξt(θ0)|ut(θ0)]] = 0. Further, by deﬁnition of u F , cf. Equation (13), we have |ξt(θ0)| u F . We next introduce the set

F L(θ0) 2u F

By applying Azuma-Hoeﬀding inequality followed by union bounding over d coordinates of feature vectors, we obtain P(F) 1 1/d. On the other note, θ0 1, bθ 1 W and hence θ 1 W. This implies that |ut( θ)| 3W. Therefore, by deﬁnition of ℓF , cf. Equation (33), we have ηt( θ) ℓF . Recalling Equation (87), we get 2L( θ) ℓF ( e XT e X/n). By optimality of bθ, we write

L(bθ) + λ bθ 1 L(θ0) + λ θ0 1 , (89)

and by rearranging the terms and using (86), we arrive at

n X(θ0 bθ) 2 + λ bθ 1 L(θ0) bθ θ0 1 + λ θ0 1 . (90)

Form now on, the analysis is exactly similar to the oracle inequality for Lasso estimator. We bring the analysis here for the reader s convenience. Choosing λ 4u F p

(log d)/n, we have on F

n X(θ0 bθ) 2 + 2λ bθ 1 λ bθ θ0 1 + 2λ θ0 1 . (91)

Let S = supp(θ0). On the left-hand side using triangle inequality, we have

bθ 1 = bθS 1 + bθSc 1 bθS 1 bθS θ0,S 1 + bθSc 1 .

On the right-hand side, we have

bθ θ0 1 = bθS θ0,S 1 + bθSc 1 .

Using these two inequalities in (91), we get

n X(θ0 bθ) 2 + λ bθSc 1 3λ bθS θ0,S 1 . (92)

Dynamic Pricing in High-dimensions

We next write

n X(θ0 bθ) 2 + λ bθ θ0 1 = 2ℓF

n X(θ0 bθ) 2 + λ bθS θ0,S 1 + λ bθSc 1

(a) 4λ bθS θ0,S 1 (b) 4λ s0 bθS θ0,S 2

(c) 4λ 2s0 n Cmin X(bθ θ0) 2

n X(bθ θ0) 2 2 + 8λ2s0

where (a) follows from Equation (92); (b) holds by Cauchy-Shwarz inequality; (c) follows form the RE condition, which holds for bΣ = (XTX)/n as stated by Proposition 23, with κ(bΣ, s0, 3) p

Cmin/2, and recalling the inequality bθSc θ0,Sc 1 = bθSc 1 3 bθS θ0,S as per Equation (92); Finally (d) follows from the inequality 2

ab a2 + b2. Rearranging the terms, we obtain

n X(θ0 bθ) 2 + λ bθ θ0 1 8λ2s0

ℓF Cmin . (93)

Applying the RE condition again to the l.h.s of (93), we get

2 θ0 bθ 2 2 ℓF

n X(θ0 bθ) 2 8λ2s0

ℓF Cmin , (94)

and therefore,

θ0 bθ 2 2 16s0λ2

ℓ2 F C2 min . (95)

The result follows.

A.1. Proof of Proposition 23

The proof follows by combining two lemmas from (B uhlmann and van de Geer, 2011). We show the desired result holds for a more general case, namely for X with subgaussian entries. Before stating the proof, we recall a few deﬁnitions and notations.

Deﬁnition 24 A random variable ν is subgaussian if there exist constants L, σ0 such that

L2 ) σ2 0 L2 + 1 .

Note that bounded random variables are subgaussian. Speciﬁcally, if |ν| νmax, then ν is subgaussian with L = νmax and σ0 = νmax e 1. For a matrix A, we let A denote its (element wise) maximum norm, i.e., A = maxi,j |Aij|. The next lemma shows that if two matrices are close enough in maximum norm and if the compatibility condition holds for one of them then it would also hold for the other one.

Javanmard and Nazerzadeh

Lemma 25 Suppose that the restricted eigenvalue (RE) condition holds for Σ0 with constant κ(Σ0, s0, 3) > 0. If

Σ0 Σ1 κ2(Σ0, s0, 3)

then the RE condition holds for Σ1 with constant κ(Σ1, s0, 3) κ(Σ0, s0, 3)/

Proof Proof of Lemma 25 We refer to Problem 6.10 of (B uhlmann and van de Geer, 2011).

Lemma 26 Consider X Rn p with i.i.d. rows generated from a distribution with covariance Σ Rd d. Let bΣ = XTX/n be the corresponding empirical second moment. Further, suppose that the entries of X are uniformly subgaussian with parameters L, σ0. If n c0Ls0 log d with c0 = 768L/κ2(Σ, s0, 3), then

P bΣ Σ κ2(Σ, s0, 3)

Proof Proof of Lemma 26 The result follows readily from Problem 14.3 on page 535 of (B uhlmann and van de Geer, 2011).

Next we note that Σ satisﬁes the restricted eigenvalue condition with constant κ2(Σ, s0, 3) Cmin because of Assumption 2. Further, since xt 1, we can apply the result of Lemma 26 with L = 1, σ0 = e 1. Proposition 23 then follows from Lemma 25.

Appendix B. Proof of Proposition 12

Deﬁne the event Bn as follows:

Bn n X Rn d : σmin(XTX/n) > Cmin/2 o . (97)

Using concentration bounds on the spectrum of random matrices with subgaussian rows (see (Vershynin, 2010, Equation (5.26))), there exist constants c, c1 > 0 such that for n > c1d, we have P(Bn) 1 e cn2. For γ > 0, we deﬁne the event Fγ = { L(θ0) γ}. Using characterization (87), and by applying Azuma-Hoeﬀding inequality (similar to our argument after Equation (88)), we obtain

P(Fγ) 1 d exp nγ2

We also let E1,n Bn Fλ/2, E2,n Bn Fc λ/2. To lighten the notation, we use the shorthand

D bθ θ0 2 2. We then have

E(D) = E(D I(Bc n)) + E(D I(E1,n)) + E(D I(E2,n)) . (99)

We treat each of the terms on the right-hand side separately.

Dynamic Pricing in High-dimensions

Term 1: We have

E(D I(Bc n)) 4W 2P(Bc n) 4W 2e cn2 . (100)

Term 2: Similar to proof of Proposition 10, on E1,n, we have D 16s0λ2/(ℓ2 F C2 min). Hence,

E(D I(E1,n)) 16s0λ2

ℓ2 F C2 min . (101)

Term 3: To bound term 3, we ﬁrst prove the following lemma.

Lemma 27 On event En(γ) Bn Fγ, with γ > λ/2, we have

D 36 C2 minℓ2 F

Lemma 27 is proved in Section B.1.

We next bound term 3 as follows. Let L = 9λ2d/(C2 minℓ2 F ).

E(D I(E2,n)) = Z

0 P D I(E2,n) > α dα

0 P D I(E2,n) > Lc dc

0 P D I(E2,n) > Lc dc + L Z

1 P D I(E2,n) > Lc dc . (102)

For the ﬁrst term on the right-hand side we write

0 P D I(E2,n) > Lc dc L Z 1

0 P(E2,n) LP(Fc λ/2) L

where the last step holds from Equation (98) with γ = λ/2. We next upper bound the second term. For arbitrary ﬁxed c > 1, let γ = p

Lc/(dκ), with κ = 36/(ℓ2 F C2 min). It is easy to verify that γ = cλ/2 > λ/2. Further, by virtue of Lemma 27, on E(γ) we have D κγ2d = Lc. Hence,

P D I(E2,n) > Lc P E(γ)c E2,n P(Fc γ Bn) P(Fc γ) . (104)

Further, applying Equation (98) and plugging for γ, we obtain

P(Fc γ) d exp n Lc 2u2 F κd

= d exp nλ2

d1 2c . (105)

Here, the second step follows from deﬁnition of L and the last step holds because λ 4u F p

Javanmard and Nazerzadeh

Combining Equations (104) and (105), we have

1 P D I(E2,n) > Lc dc L Z

1 d1 2c dc L 2d log d . (106)

Using bounds (103) and (106) in Equation (102), we obtain

E(D I(E2,n)) L

1 + 1 2 log d

ℓ2 F C2 min . (107)

The result follows by putting the upper bounds on the three terms together.

B.1. Proof of Lemma 27

We start by rewriting Equation (90), which follows from optimality of bθ and log-concave property of the loss function:

n X(θ0 bθ) 2 + λ bθ 1 L(θ0) bθ θ0 1 + λ θ0 1 . (108)

On event En(γ), Equation (108) implies that

1 2CminℓF θ0 bθ 2 2 + λ bθ 1 γ bθ θ0 1 + λ θ0 1 . (109)

Using the assumption γ > λ/2 and our shorthand D bθ θ0 2 2, we get

1 2CminℓF bθ θ0 2 2 γ bθ θ0 1 + λ θ0 1 λ bθ 1

(γ + λ) bθ θ0 1

3γ bθ θ0 1 3γ

d bθ θ0 2 .

Writing the above bound in terms of our shorthand D bθ θ0 2 2, we obtain the desired result.

Appendix C. Proof of Technical Lemmas

C.1. Proof of Lemma 13

We write the virtual valuation function as ϕ(v) = v 1/λ(v) where λ(v) = f(v) 1 F(v) = log (1 F(v)) is the hazard rate function. Since 1 F is log-concave, the hazard function λ(v) is increasing which implies that ϕ is strictly increasing. Indeed, by this argument ϕ (v) > 1.

C.2. Proof of Lemma 14

Recalling the deﬁnition g(v) = v + ϕ 1( v), we have g (v) = 1 1/ϕ (ϕ 1( v)). Since ϕ is strictly increasing by Lemma 13, we have g (v) < 1. The claim g (v) > 0 follows if we show ϕ (ϕ 1( v)) > 1. For this we refer to the proof of Lemma 13, where we showed that ϕ (v) > 1 for all v.

Dynamic Pricing in High-dimensions

C.3. Proof of Lemma 16

Let φ(v) and Φ(v) respectively denote the density and the distribution function of standard normal variable. Function ht and its derivatives read as

revt(p) = p 1 Φ p xt θ0

rev t(p) = 1 Φ p xt θ0

rev t (p) = 1

2 i φ p xt θ0

Deﬁne ξ p t xt θ0 = g(xt θ0) xt θ0. Writing rev t (p t ) in term of ξ, we obtain

r t (p t ) = 1

2 i φ(ξ/σ) . (113)

By tail bound inequality for Gaussian distribution 1 Φ(ξ/σ) (σ/ξ)φ(ξ/σ) for ξ 0. Therefore,

2 1 , (114)

and the same bound obviously holds for ξ < 0. By deﬁnition of function g, |ξ| 3W with ϕ being the virtual valuation fusion corresponding to the Gaussian distribution. Hence, φ(ξ/σ) φ(3W/σ). Putting this together with (114), we get rev t (p t ) c1 with c1 = (1/σ)φ(3W/σ).

For the second part of the Lemma statement, set δ = min n 3W, σ2/(18W), σ2φ(3W/σ) o . For p [p t δ, p t + δ], we have

σ2 |p p t | |p + p t xt θ0| 1

σ2 δ(5W + δ) 1

Using Equation (114) we get p(p xt θ0)/σ2 1/2. Further,

φ p t xt θ0

|p p t | 2σ2 δ 2σ2 1

2φ(3W/σ) . (116)

2φ(3W/σ) . (117)

Combining (115), (117) we obtain

rev t (p) = 1

2 i φ p xt θ0

4σφ(3W/σ) = c1

The result follows.

Javanmard and Nazerzadeh

C.4. Proof of Lemma 17

Let Z = x v and Z = Z/ v 2. Note that Var( Z) = 1. Write the expectation in terms of the tail probability

E(min(Z2, δ2)) = Z δ2

0 P(Z2 t)dt = Z δ2

We consider two cases:

δ v 2: The right-hand side in (119) can be lower bounded as

dt = 2δ2 Z 1

0 t P(| Z| t)dt . (120)

In the sequel, we provide two separate lower bounds for the right-hand side.

Let ξ P(| Z| 1). We have Z 1

0 t P | Z| t dt Z 1

We proceed to obtain another bound which utilizes the fact Var( Z) = 1. Z 1

0 t P(| Z| t)dt = Z

0 t P(| Z| t)dt Z

1 t P(| Z| t)dt

1 t P(| Z| t)dt . (122)

For t 1, we have P(| Z| t) ξ. Further, by applying Chernoﬀbound, we get

P(| Z| t) = 2P( Z t) = 2P(eλ Z eλt) e λt E(eλ Z)

= 2e λt d Y

eλ vi v 2 + e λ vi v 2 2

i=1 e λ2 v2 i 2 v 2 2 = 2e λ2

Setting λ = t leads to P(| Z| t) 2e t2

2 . Combining these bounds into (121), we obtain Z 1

0 t P(| Z| t)dt 1

1 t min(2e t2

2 , ξ)dt = 1 ξ

We summarize bounds (121) and (123) as in Z 1

0 t P | Z| t dt min ξ [0,1] max ξ, 1 ξ

> 0.05 . (124)

Turning back to Equation (120), in this case we have

dt 0.1δ2 . (125)

Dynamic Pricing in High-dimensions

δ v 2: Similar to the previous case, the right-hand side in (119) can be lower bounded as Z δ2

dt . 0.1 v 2 2 (126)

The above two cases can be summarized as E(min(Z2, δ2)) 0.1 min( v 2 2, δ2).

C.5. Proof of Lemma 18

We use a standard argument that relates minimax ℓ2-risk in terms of the error in multi-way hypothesis testing problem; See e.g. (Yang and Barron, 1999; Yu, 1997). Let { θ1, . . . , θm} be a δ-packing of set Ω, meaning that their pairwise distances are all at least δ. Parameter δ is free for now and its value will be determined later in the proof. We further let Pj denote the induced probability on market values (v(x1), . . . , v(x T )), conditional on (x1, . . . , x T ) and for θ0 = θj. In other words, in deﬁning distributions Pj we treat feature vectors ﬁxed. Let ν be random variable uniformly distributed on the hypothesis set {1, 2, . . . , m} which indicates the index of the true parameter, i.e, ν = j means θ0 = θj. Deﬁne d(θT 1 , θ) PT t=1 min( θt θ 2 2, C) and let µ be the value of j for which d(θT 1 , θj) is a minimum. Suppose that δ is chosen such that δ2 C. If d(θT 1 , θj) < δ2T/4 then µ = j, because assuming otherwise, we have µ = j = j, and by triangle inequality

min( θj θj 2 2, C) min(2 θt θj 2 2 + 2 θt θj 2 2, C)

min(2 θt θj 2 2, C) + min(2 θt θj 2 2, C) , (127)

for all t, where we used the inequality min(a + b, c) min(a, c) + min(b, c) for a, b, c 0. Summing over t = 1, 2, . . . , T, we get

T min( θj θj 2 2, C) 2d(θT 1 , θj) + 2d(θT 1 , θj ) 4d(θT 1 , θj) < δ2T ,

where we used the assumption µ = j . But this is a contradiction because θj θj 2 δ (they form a δ-packing of Ω) and δ2 C. Using Markov inequality, we can write

max j EPjd(θT 1 , θj) δ2T

4 max j P d(θT 1 , θj) δ2T

j=1 P(µ = j|ν = j) = δ2T

4 P(µ = ν) . (128)

We use Fano s inequality to lower bound the error probability on the right-hand side. We ﬁrst construct a δ-packing of Ωsimilar to the one proposed in (Raskutti et al., 2011, proof of Theorem 1). Let s = s0/2 d/2 and deﬁne

A = {q { 1, 0, 1}d : q 0 = s} .

As proved in (Raskutti et al., 2011, Lemma 5), there exists a subset A A of cardinality | A| exp( s

2 log d s/2

s ) such that the Hamming distance between any two elements in A

Javanmard and Nazerzadeh

is at least s/2. Next, consider the set q

2 sδ A for some δ W/

2s whose exact value to

be determined later. Then, for q in this set, q 1 =

2sδ W and hence q

Further, for q, q q

2 sδ A, we have the following bounds:

q q 2 2 δ2 , (129)

q q 2 2 8δ2 . (130)

By (129), the set q

2 sδ A forms a δ-packing for Ωwith size | A|. We now turn back to bound (128). Left-hand side can be lower bounded using Fano s inequality. We omit the details here as it is a standard argument and instead we refer to (Raskutti et al., 2011, proof of Theorem 1) for details. Using Fano s inequality and bound (130), we get

P(µ = ν) = 1

8T 2σ2 δ2 + log(2)

s 2 log(d s/2

s ) . (131)

Choosing δ2 δ2 1 σ2s

32T log(d s/2

s ), we obtain P(µ = ν) 1/4. Therefore, setting δ2 = min(W 2

2s , δ2 1, C) and combining with bound (128), we conclude that

min θT 1 max θ0 ΩE(d(θT 1 , θ0)) δ2T

16 min W 2T

32 log d s/2

, CT . (132)

Now since s = s0/2 d/2, we have log((d s/2)/s) c log(d/s) with some constant c > 0. Therefore, by using Equation (132) and substituting for s = s0/2, we obtain

min θT 1 max θ0 ΩE(d(θT 1 , θ0)) L1 , (133)

16 min W 2T

, CT . (134)

We next derive another separate lower bound for minimax risk, by assuming that an oracle gives us the true support of θ0. In this case, the least square estimator, applied to the observed features restricted to the true support S, achieves the optimal minimax ℓ2 rate. This implies that θt θ0 2 2 cσ2s0/t, for t s0 and a constant c > 0. Therefore,

min θT 1 max θ0 ΩE(d(θT 1 , θ0))

t=1 min cσ2 s0

t , C , (135)

from which we obtain

min θT 1 max θ0 ΩE(d(θT 1 , θ0)) L2 c s0 log(T/s0) , (136)

Dynamic Pricing in High-dimensions

for some constant c > 0, depending on σ and C. Combining bounds in (134) and (136), we have

min θT 1 max θ0 ΩE(d(θT 1 , θ0)) 1

e C s0 log T

s0 , s0 log d

for a constant e C that depends on C, σ, W. The proof is complete.

C.6. Proof of Lemma 19

We recall the notion of 0-property established by (Ponomarev, 1987).

Deﬁnition 28 A continuous function has the 0-property, if the pre-image of any set of probability zero is a set of probability zero.

As proved in (Ponomarev, 1987, Theorem 1), if a function φ : X Rd 7 Rd is continuously diﬀerentiable, then it satisﬁes 0-property if and only if its derivative Dφ is full rank for almost all x X. Therefore, we need to show that under Assumption 1, if φ has 0-property, then Assumption 6 holds true. Supposing otherwise, there exists a nonzero v Rd such that v TΣφv = 0. Therefore, E((z φ(x))2) = 0 which implies that z φ(x) = 0, almost surely. Deﬁne S {z Rd : z φ(x) = 0}. Space S is (d 1)-dimensional and all the points in φ(X) belong to S almost surely, i.e., P(φ(X) Sc) = 0. However, since Σ is positive deﬁnite (with all of its eigenvalues target than Cmin, by Assumption 1), PX(S) = 0. Combining these observations, P(φ(X)) P(S)+P(φ(X) Sc) = 0. Since φ has the 0-property, this implies that PX(X) = 0, which is a contradiction because X is the support of PX and thus PX(X) = 1. The result follows.

C.7. Proof of Lemma 21

Under model (27), a purchase occurs at time t with the posted price price p if vt β0p. This is equivalent to zt β0p xt µ0. Therefore, the expected revenue from a posted price p is given by

p P( vt β0p) = p(1 F(β0p µ0 xt)) . (138)

By setting the ﬁrst order conditions, the optimal price p t is given by the solution of the following equation:

β0p t = 1 F(β0p t µ0 xt) f(β0p t µ0 xt) . (139)

It is straightforward to verify that the solution p t of the above equation is given by p t = (1/β0)g(µ0 xt).

Javanmard and Nazerzadeh

C.8. Proof of Lemma 15

Recall the notation E(k) = maxi,j |E(k) ij |. Note that

i,j=1 |E(k) ij ||vi||vj| E(k)

i,j=1 |vi| |vj|

= E(k) v 2 1 . (140)

Therefore, we only need to bound E(k) . Fix 1 i, j d + 1. We then have

E(k) ij = Σ(k) ij S(k) ij = 1

n E(Xℓi Xℓj) Xℓi Xℓj o . (141)

Let u(ij) ℓ = Xℓi Xℓj. Then, |u(ij) ℓ | 1 because xℓ 1. By applying Hoeﬀding s inequality,

|E(k) ij | 3

Therefore, by union bonding over all indices 1 i, j d + 1, we obtain that E(k) 3 p

(log d)/τk, with probability at least 1 8/d2. The claim follows from this result along with (140).

C.9. Proof of Lemma 20

Deﬁne parameter vectors η0,0 = (µ0; β0), η0, = (µ0; bβk), η ,0 = (bµk; β0), and η , = (bµk; bβk) all in Rd+1. By optimality of η , , we have µL(bµk) + λkνk = 0 and βL(bβk) + λkρk = 0, where νk is a subgradient of ℓ1 norm at bµk and ρk is a subgradient of the absolute value function at bβk. Therefore,

L(η0, ) = ( µL(µ0); βL(bβk)) = ( µL(µ0); λkρk) ,

L(η ,0) = ( µL(bµk); βL(β0)) = ( λkνk; µL(β0)) .

By a similar argument following (88), we have L(η0,0) λk/2, with probability at least 1 1/d. Hence, invoking relations (143), we get

L(η0, ) λk , L(η ,0) λk . (144)

By optimality of η , , we have

L(η , ) + λk η , 1 L(η0, ) + λk η0, 1 .

Following a similar argument to (91), we have

2ℓF τk 1 e X(k 1)(η0, η , ) 2 + 2λk η , 1 2λk η , η0, 1 + 2λk η0, 1.

Dynamic Pricing in High-dimensions

Applying triangle inequality and rearranging the terms, this yields to the following inequality:

2ℓF τk 1 e X(k 1)(η0, η , ) 2 4λk η , η0, 1 8Wλk . (145)

By substituting for η0, and η , , we get Equation (79). Likewise, we can write (145) with η ,0 instead of η0, :

2ℓF τk 1 e X(k 1)(η ,0 η , ) 2 8Wλk . (146)

By substituting for η ,0 and η0,0 we obtain

2ℓF τk 1 (β0 bβk)2 p(k 1) 2 = 2ℓF

τk 1 e X(k 1)(η ,0 η , ) 2 8Wλk , (147)

where p(k) = (pτk 1, . . . , pτk 1)T is the vector of prices oﬀered in episode (k 1). Hence, in order to prove (79), it suﬃces to show that pt c W for all t. To see this, note that since g(v) > 0 it attains its minimum over any bounded interval. Therefore, there exists a constant c W > 0, such that g(v) > c W for v [ W, W]. Further, by the constraint in the optimization (28) we have |bβ(k 1)| W. Combining these two facts, we get

pt = 1 bβ(k 1) g(xt bµ(k 1)) c W W c W . (148)

The result follows from (147) and setting c W = 4W/ c2 W .

Yasin Abbasi-Yadkori, David Pal, and Csaba Szepesvari. Online-to-conﬁdence-set conversions and application to sparse stochastic bandits. In AISTATS, pages 1 9, 2012.

Shipra Agrawal and Nikhil R. Devanur. Bandits with concave rewards and convex knapsacks. In Proceedings of the Fifteenth ACM Conference on Economics and Computation, EC 14, pages 989 1006, 2014.

Airbnb Documentation. Smart pricing: Set prices based on demand. https://www.airbnb. com/help/article/1168/smart-pricing--set-prices-based-on-demand, 2015.

Kareem Amin, Afshin Rostamizadeh, and Umar Syed. Repeated contextual auctions with strategic buyers. In Advances in Neural Information Processing Systems, pages 622 630, 2014.

Victor F Araman and Ren e Caldentey. Dynamic pricing for nonperishable products with demand learning. Operations research, 57(5):1169 1188, 2009.

Moshe Babaioﬀ, Shaddin Dughmi, Robert Kleinberg, and Aleksandrs Slivkins. Dynamic pricing with limited supply. In Proceedings of the 13th ACM Conference on Electronic Commerce, EC 12, pages 74 91, 2012. ISBN 978-1-4503-1415-2.

Javanmard and Nazerzadeh

Ashwinkumar Badanidiyuru, Robert Kleinberg, and Aleksandrs Slivkins. Bandits with knapsacks. In Foundations of Computer Science (FOCS), 2013 IEEE 54th Annual Symposium on, pages 207 216. IEEE, 2013.

Mark Bagnoli and Ted Bergstrom. Log-concave probability and its applications. Economic theory, 26(2):445 469, 2005.

Santiago R Balseiro, Jon Feldman, Vahab Mirrokni, and S Muthukrishnan. Yield optimization of display advertising with ad exchange. Management Science, 60(12):2886 2907, 2014.

Hamsa Bastani and Mohsen Bayati. Online decision-making with high-dimensional covariates. Working Paper, 2016.

Omar Besbes and Assaf Zeevi. Dynamic pricing without knowing the demand function: risk bounds and near-optimal algorithms. Operations Research, 57:1407 1420, 2009.

Sonia A Bhaskar and Adel Javanmard. 1-bit matrix completion under exact low-rank constraint. In Information Sciences and Systems (CISS), 2015 49th Annual Conference on, pages 1 6. IEEE, 2015.

Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.

Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends R in Machine Learning, 3(1):1 122, 2011.

Josef Broder and Paat Rusmevichientong. Dynamic pricing under a general parametric choice model. Operations Research, 60(4):965 980, 2012.

Peter B uhlmann and Sara van de Geer. Statistics for high-dimensional data. Springer Verlag, 2011.

Emmanuel Candes and Terence Tao. The dantzig selector: Statistical estimation when p is much larger than n. The Annals of Statistics, pages 2313 2351, 2007.

Xi Chen, Zachary Owen, Clark Pixton, and David Simchi-Levi. A statistical learning approach to personalization in revenue management. Working Paper, 2015.

Wang Chi Cheung, Will Ma, David Simchi-Levi, and Xinshang Wang. Inventory balancing with online learning. Working Paper, 2018.

Maxime C Cohen, Ilan Lobel, and Renato Paes Leme. Feature-based dynamic pricing. ACM Conference on Economics and Computation, 2016.

Varsha Dani, Thomas P Hayes, and Sham M Kakade. Stochastic linear optimization under bandit feedback. In COLT, pages 355 366, 2008.

A. V. den Boer and A. P. Zwart. Mean square convergence rates for maximum(quasi) likelihood estimation. Stochastic systems, 4:1 29, 2014. ISSN 1946-5238.

Dynamic Pricing in High-dimensions

Arnoud V den Boer. Dynamic pricing and learning: historical origins, current research, and new directions. Surveys in operations research and management science, 20(1):1 18, 2015.

Arnoud V den Boer and Bert Zwart. Simultaneously learning and optimizing using controlled variance pricing. Management Science, 60(3):770 783, 2013.

Benjamin Edelman, Michael Ostrovsky, and Michael Schwarz. Internet advertising and the generalized second-price auction: Selling billions of dollars worth of keywords. The American economic review, 97(1):242 259, 2007.

Ossama Elshiewy, Daniel Guhl, and Yasemin Boztug. Multinomial logit models in marketing-from fundamentals to state-of-the-art. Marketing ZFP, 39(3):32 49, 2017.

Vivek F Farias, Srikanth Jagabathula, and Devavrat Shah. A nonparametric approach to modeling choice with limited data. Management Science, 59(2):305 322, 2013.

Kris Johnson Ferreira, David Simchi-Levi, and He Wang. Online network revenue management using thompson sampling. Operations research, 2018.

Dennis H Gensch and Wilfred W Recker. The multinomial, multiattribute logit choice model. Journal of Marketing Research, pages 124 132, 1979.

Alexander Goldenshluger and Assaf Zeevi. A linear response bandit problem. Stochastic Systems, 3(1):230 261, 2013.

Negin Golrezaei, Hamid Nazerzadeh, and Paat Rusmevichientong. Real-time optimization of personalized assortments. Management Science, 60(6):1532 1551, 2014.

Negin Golrezaei, Adel Javanmard, and Vahab Mirrokni. Dynamic incentive-aware learning: Robust pricing in contextual auctions. http://dx.doi.org/10.2139/ssrn.3144034, 2018.

Peter M Guadagni and John DC Little. A logit model of brand choice calibrated on scanner data. Marketing Science, 27(1):29 48, 2008.

J Michael Harrison, Bora Keskin, and Assaf Zeevi. Bayesian dynamic pricing policies: Learning and earning under a binary prior distribution. Management Science, 58(3): 570 586, 2012.

John R Hauser and Frank S Koppelman. Alternative perceptual mapping techniques: Relative accuracy and usefulness. Journal of marketing Research, pages 495 506, 1979.

Adel Javanmard. Perishability of data: Dynamic pricing under varying-coeﬃcient models. Journal of Machine Learning Research, 18(53):1 31, 2017. URL http://jmlr.org/ papers/v18/17-061.html.

Nathan Kallus and Madeleine Udell. Dynamic assortment personalization in high dimensions. Working Paper, 2016.

Godfrey Keller and Sven Rady. Optimal experimentation in a changing environment. The review of economic studies, 66(3):475 507, 1999.

Javanmard and Nazerzadeh

Bora Keskin. Optimal dynamic pricing with demand model uncertainty: A squaredcoeﬃcient-of-variation rule for learning and earning. Working Paper, 2014.

Bora Keskin and Assaf Zeevi. Dynamic pricing with an unknown demand model: Asymptotically optimal semi-myopic policies. Operations Research, 62(5):1142 1167, 2014.

N Bora Keskin and Assaf Zeevi. Chasing demand: Learning and earning in a changing environment. Mathematics of Operations Research, 2016.

Robert Kleinberg and Tom Leighton. The value of knowing a demand curve: Bounds on regret for online posted-price auctions. In Proceedings of 44th Annual IEEE Symposium on Foundations of Computer Science, pages 594 605. IEEE, 2003.

S ebastien Lahaie and David M Pennock. Revenue analysis of a family of ranking rules for keyword auctions. In Proceedings of the 8th ACM conference on Electronic commerce, pages 50 56. ACM, 2007.

Jordan J Louviere and George Woodworth. Design and analysis of simulated consumer choice or allocation experiments: an approach based on aggregate data. Journal of marketing research, pages 350 367, 1983.

Daniel Mc Fadden. Economic choices. American economic review, 91(3):351 378, 2001.

Roger B. Myerson. Optimal auction design. Mathematics of Operations Research, 6(1): 58 73, 1981.

Yaniv Plan and Roman Vershynin. One-bit compressed sensing by linear programming. Communications on Pure and Applied Mathematics, 66(8):1275 1297, 2013.

Stanislav P Ponomarev. Submersions and preimages of sets of measure zero. Siberian Mathematical Journal, 28(1):153 163, 1987.

Sheng Qiang and Mohsen Bayati. Dynamic pricing with demand covariates. Working Paper, 2016.

Garvesh Raskutti, Martin J Wainwright, and Bin Yu. Minimax rates of estimation for highdimensional linear regression over-balls. IEEE Transactions on Information Theory, 57 (10):6976 6994, 2011.

Michael Rothschild. A two-armed bandit theory of market pricing. Journal of Economic Theory, 9(2):185 202, 1974.

Mark Rudelson and Shuheng Zhou. Reconstruction from anisotropic random measurements. IEEE Trans. on Inform. Theory, 59(6):3434 3447, 2013.

Kalyan T Talluri and Garrett J Van Ryzin. The theory and practice of revenue management, volume 68. Springer Science & Business Media, 2006.

A.B. Tsybakov. Introduction to Nonparametric Estimation. Springer Series in Statistics. Springer New York, 2008. ISBN 9780387790527. URL https://books.google.com/ books?id=mw B8r UBsbqo C.

Dynamic Pricing in High-dimensions

Sara A Van de Geer. High-dimensional generalized linear models and the lasso. The Annals of Statistics, pages 614 645, 2008.

Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. ar Xiv preprint ar Xiv:1011.3027, 2010.

Zizhuo Wang, Shiming Deng, and Yinyu Ye. Close the gaps: A learning-while-doing algorithm for single-product revenue management problems. Operations Research, 62(2): 318 331, 2014.

Baichun Xiao, Wei Yang, and Jun Li. Optimal reserve price for the generalized second-price auction in sponsored search advertising. Journal of Electronic Commerce Research, 10 (3):114, 2009.

Yuhong Yang and Andrew Barron. Information-theoretic determination of minimax rates of convergence. Annals of Statistics, pages 1564 1599, 1999.

Bin Yu. Assouad, fano and le, cam. In Research Papers in Probability and Statistics: Festschrift in Honor of Lucien Le Cam, pages 423 435. Springer-Verlag, 1997.