# robust_quadratic_programming_for_price_optimization__d1a41f23.pdf

Robust Quadratic Programming for Price Optimization

Akihiro Yabe, Shinji Ito, Ryohei Fujimaki NEC Corporation a-yabe@cq.jp.nec.com, s-ito@me.jp.nec.com, rfujimaki@nec-labs.com

The goal of price optimization is to maximize total revenue by adjusting the prices of products, on the basis of predicted sales numbers that are functions of pricing strategies. Recent advances in demand modeling using machine learning raise a new challenge in price optimization, i.e., how to manage statistical errors in estimation. In this paper, we show that uncertainty in recently-proposed prescriptive price optimization frameworks can be represented by a matrix normal distribution. For this particular uncertainty, we propose novel robust quadratic programming algorithms for conservative lower-bound maximization. We offer an asymptotic probabilistic guarantee of conservativeness of our formulation. Our experiments on both artiﬁcial and actual price data show that our robust price optimization allows users to determine best risk-return trade-offs and to explore safe, proﬁtable price strategies.

1 Introduction

Product price is the most signiﬁcant factor in determining sales. Better pricing strategies (i.e., pricing for multiple products) lead to higher total proﬁts, and determination of the best price strategy is one of the most important strategic business decisions. Price optimization has been actively studied in marketing science and revenue management [Kunz and Crone, 2014] and has already achieved great success in such industries as online retail [Ferreira et al., 2015], fast fashion [Caro and Gallien, 2012], hotels [Koushik et al., 2012; Lee, 2011], and airlines [Côté et al., 2003]. Most price optimization frameworks consist of two stages. The ﬁrst is demand modeling, which reveals such complicated relationships among prices and sales quantities as price elasticity of demand [Marshall, 2009], cross price elasticity (a.k.a. cannibalization) [van Ryzin and Mahajan, 1999], and the law of diminishing marginal utility [Marshall, 2009]. The second derives the best price strategy by maximizing a utility function, typically total revenue or proﬁt. Most existing studies for multi-product price optimization employ mixed-integer programming [Caro and Gallien, 2012; Koushik et al., 2012; Lee, 2011] due to the discrete nature of individual prices.

Recently, [Ito and Fujimaki, 2017] have proposed prescriptive price optimization. They ﬁrst employ recent advanced regression techniques (e.g. non-linear, sparse) to ﬂexibly model complex relationships between demand and price with respect to multiple products, and they have shown that the proﬁt maximization problem can be naturally expressed by means of a binary quadratic programming (BQP) problem. [Ito and Fujimaki, 2017] has proposed a relaxation method using semi-deﬁnite programming (SDP). Although the SDP method is more efﬁcient than existing mixed-integer programming based methods, its scalability is still limited since it theoretically requires O(M 6) computational time, where M is the number of products. [Ito and Fujimaki, 2016] has utilized connection between submodularity and the substitutegoods property [Koushik et al., 2012] known in economics and has proposed an extremely fast proximal gradient algorithm with network-ﬂow optimization that, under certain conditions, guarantees the global optimality of the original BQP problem. A key drawback in previous works is its optimism with respect to its optimization. There exist two types of uncertainty in prescriptive price optimization, i.e., system noise and estimation error. System noise represents stochastic uncertainty in demand. Estimation error occurs due to the stochastic nature of machine learning, i.e., regression coefﬁcients vary due to stochastic changes in sales records. Although previous works have not taken into account these uncertainties, it is known in the area of stochastic optimization that such uncertainties might signiﬁcantly degrade the quality of "optimal" solutions due to optimistic bias in objective values [Delage and Ye, 2010; Duchi et al., 2016]. This paper proposes a novel robust quadratic method for price optimization. Our key contributions are mainly twofold. First, we prove that uncertainty in prescriptive price optimization can be represented by a matrix normal distribution when the least square estimation is employed. This provides a natural robust formulation of price optimization as conservative lower-bound maximization. Although it is common for robust optimization to take system noise into account [Ben Tal et al., 2009], our study is unique in the sense that we explicitly model the uncertainty that occurs in estimation made using machine learning (i.e., estimation error). Second, we propose algorithms for robust quadratic optimization consisting of sequential relaxation to a non-robust counterpart that

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

employs a non-robust algorithm (such as [Ito and Fujimaki, 2016; 2017]) as its sub-routine. The algorithm is guaranteed to obtain a local optimum by O(log(N/δ)) calls of the nonrobust optimization oracle, which is much fewer than that of existing algorithms [Tütüncü and Koenig, 2004]. We have experimentally evaluated our robust pricing strategy for several values of conservativeness parameters, using an artiﬁcial price simulator, and are able to show its efﬁciency in terms of computational time and actual revenue improvement.

2 Problem Settings 2.1 Quadratic Price Optimization Suppose we have M products and their prices and sales quantities are denoted by x = (x1, . . . , x M) X RM and y = (y1, . . . , y M) RM, respectively, where X is a closed bounded set. Further, let us assume that the vector y of the sales quantities follows a multi-dimensional regression with non-linear basis functions as follows: y = A v(x) + ε, (1)

where v : X RN is a N-dimensional non-linear transformation of x, A RM N is the true regression coefﬁcient matrix representing cross price elasticity [Marshall, 2009] (this paper assumes N M), and ε is system noise over RM that follows a multidimensional normal distribution N(0, Σ ) with zero mean and covariance matrix Σ RM M. Without loss of generality1, we assume that the ﬁrst M elements in v(x) are linear features, i.e., v(x) = (x , N M non-linear features) . The transformation v allows us to incorporate such non-linear effects as the law of diminishing marginal utility. We here omit non-price features but it is straightforward to incorporate such features as well, as pointed in [Ito and Fujimaki, 2016]. The goal of our price optimization is to maximize gross proﬁt as deﬁned by (x c) y, where c = (c1, . . . , c M) RM is a cost vector. Hereinafter, for notational simplicity, this paper assumes c = 0 and thus maximizes gross revenue x y. The problem can then be expressed by the following quadratic programming:

min x X v(x) Q v(x), where Q := A 0

2.2 Link to Prescriptive Price Optimization This subsection shows that the prescriptive optimization problem [Ito and Fujimaki, 2016; 2017] is reduced to (2), which considers the following general sparse additive models for demand modeling, yi = PM j=1 fi,j(xj) + bi, where fi,j : Pj R is any uni-variate feature transformation function and Pi := {pi,1, pi,2, . . . , pi,L} R is a set of price candidates. The price optimization problem can then be formulated as follows:

i=1 (xi ci)

j=1 fi,j(xj) + bi

1If we don t use linear feature x, we can simply ﬁx the ﬁrst M columns of A to be zero.

where X := {x | xi Pi, i = 1, . . . , M}. Again, for simplicity, we here assume c = 0, and this does not change the discussion that follows. Without loss of generality, we can assume that there exists a set of basis vk : R R and coefﬁcients a i,j,k for k = 1, 2, . . . , K such that

n=1 a i,j,kv k(xj) (4)

for all xj Pj. The most typical and simplest setting is K = 1 and v1(xj) = xj (linear basis), but in general we can represent (4) by having every fi,j as a basis for i, j = 1, . . . , M. Then we have y = Fv(x) + bi, where F RM KM is deﬁned by Fi,(k 1)M+j = a i,j,k. for k = 1, 2, . . . , K and n = 1, 2, . . . , N. This shows the equivalence of (3) and (2).

3 Robust Price Optimization 3.1 Robust QP under Matrix Normal Uncertainty Since the true model A cannot be obtained in practice, a natural way is to replace it by an estimator ˆA. Suppose we have a set of training samples denoted by {xd, yd}D d=1. We assume the least square estimator ˆA of A deﬁned as follows:

ˆA := arg min A

d=1 yd Av(xd) 2. (5)

For n 1, let In denote the identity matrix of size n. We deﬁne matrices V and W by

V := (v(x1), v(x2), . . . , v(x M)) RN D (6)

W := V V RN N. (7)

Then by the polar decomposition, there exists a matrix Γ RN D satisfying ΓΓ = IN and V = W 1/2Γ. Let Um,n denote the random matrix over Rm n, where each entry of Um,n is independently generated by N(0, 1) and where N represents a normal distribution. The next proposition provides the distribution of ˆA.

Proposition 1. Given {xm}M m=1, ˆA is generated by

ˆA = A + Σ 1/2UM,NW 1/2. (8)

Proposition 1 indicates that ˆA follows the matrix normal distribution [Gupta and Nagar, 1999], denoted by MN(A , Σ , W), as follows:

P( ˆA|A , Σ , W) (9)

2tr[W 1( ˆA A ) Σ 1( ˆA A )])

(2π)ND/2|Σ |N/2|W|D/2 .

In order to derive a formulation for robust price optimization, we ﬁrst consider the conﬁdence region of ˆA. Let us deﬁne the estimator ˆΣ of Σ and the conﬁdence region Cλ as follows:

D(Y ˆAV )(Y ˆAV ) (10)

Cλ : = {A | A = ˆA + ˆΣ1/2UW 1/2, U F λ}, (11)

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

where U RM N, and U F is the Frobenius norm of U. Then we derive the robust optimization problem with matrix normal uncertainty as

max x X min A Cλ x Av(x) min x X max Q C λ v(x) Qv(x), (12)

C λ := {Q = ˆQ + L 1 UL2, U F λ}, (13)

, L1 := ˆΣ 1 2 0 0 0

, L2 := W 1

Here λ 0 is a scalar parameter for controlling the conservativeness (a smaller λ corresponds to a more aggressive strategy).

3.2 Probabilistic Guarantees of Robustness This section presents an asymptotic probabilistic guarantee of our robust formulation, which can be regarded as a specialization of [Duchi et al., 2016, Theorem 6]. It ensures that we can prevent optimistic bias in objective values and shows that our robust optimum value does not asymptotically over-estimate the true optimum revenue in designed probability. Let χ2 k be a random variable generated by the chi-squared distribution with degree of freedom k. Proposition 2. Assume that the optimum solution of (2) is unique, and that the historical strategies xd are generated by some distribution. Then we have

lim D P max Q C λ v(ˆx) Qv(ˆx) v(ˆx) Q v(ˆx)

2P(χ2 1 λ2). (14)

where ˆx := arg minx X max Q C λ v(x) Qv(x).

4 Algorithms 4.1 Tight, Tractable Upper Bounds Let us deﬁne the function f : X R as follows:

f(x) = max U v(x) Qv(x) (15)

s.t. Q = ˆQ + L 1 UL2, U F λ. (16)

Our goal, then, is to obtain a strategy x X which minimizes f. We can remove the maximization in f and obtain another representation as follows: Lemma 3. It holds that

f(x) = v(x) ˆQv(x) + λ L1v(x) L2v(x) , (17)

where is the ℓ2 norm of .

Proof. We have

f(x) = v(x) ˆQv(x) + max U: U F λ v(x) L 1 UL2v(x)

= v(x) ˆQv(x) + λ L1v(x) L2v(x) . (18)

The last equality is attained when

U = λ(L1v(x)) (L2v(x))

(L1v(x)) (L2v(x)) F . (19)

The term L1v(x) L2v(x) is a product of the square roots of quartic functions w.r.t. v(x) and is not tractable for optimization. The next proposition provides a tractable upper bound of f. Proposition 4. For any x X and γ > 0, it holds that f(x) g(x, γ), (20) where

g(x, γ) := v(x) ˆQ + λγM1 + M2/γ

and M1 := L 1 L1, M2 := L 2 L2. (22) The equality is attained if and only if γ L1v(x) = L2v(x) .

Proof. For any v = v(x) RD, we have

L1v L2v = q

γ(v M1v))(v M2v)/γ (23)

γ(v M1v) + (v M2v)/γ /2 (24)

The equality condition is γ(v M1v) = (v M2v)/γ, which is equivalent to γ L1v(x) = L2v(x) .

The following corollary shows theoretical tightness of this upper bound, i.e., the optimal solution of the original problem is achievable with an appropriately chosen γ. Corollary 5. It holds that min x X f(x) = min x X inf γ (0, ) g(x, γ). (25)

On the basis of Proposition 4 and Corollary 5, the optimization problem can be re-written as follows: min x inf γ g(x, γ) s.t. x X, γ > 0. (26)

4.2 Upper Bound Minimization Algorithms Since robust optimization is essentially more difﬁcult than its non-robust counterpart, we make the following assumption to ensure that the non-robust counterpart is solvable.

Assumption 6. For any Q satisfying Q + Q ˆQ + ˆQ , we have a non-robust optimization oracle to solve minx X v(x) Q v(x). This assumption immediately holds for convex non-robust counterparts and also holds for certain types of non-convex ones, such as with branch-and-bound methods [Burer and Vandenbussche, 2008], minimum cut methods for unconstrained binary quadratic programming under submodularity conditions [Kolmogorov and Zabin, 2004], and prescriptive price optimization under substitute goods property conditions [Ito and Fujimaki, 2017; 2016]. Note that this assumption is required for our theoretical analysis, and we can employ an approximation algorithm in practice. For Eq. (21) and a given γ, Q γ + Q γ ˆQ + ˆQ holds where Q γ := ˆQ + λ( γM1 + M2/ γ)/2, and hence the oracle in Assumption 6 is applicable. For later convenience, we formally summarize this in the following proposition2.

2Even if such x is not unique, the following discussion remains valid for an arbitrary choice of such x, and therefore here we do not consider uniqueness.

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

Algorithm 1 Golden Section Search

Require: ˆQ, L1, L2, λ, α, β, δ Initialize a = α, b = β, r = (

5 1)/2 while |a b| δ do c b r (b a), d a + r (b a) b d if h(c) < h(d), and a c otherwise. end while Output x := arg minx X g(x, γ) where γ = (a + b)/2.

Algorithm 2 Coordinate Decent

Require: ˆQ, L1, L2, λ, γ0, δ Check γ0 > 0, and initialize r , γ γ0, x arg minx g(x, γ). while r g( x, γ) > δ and γ {0, } do r g( x, γ) x arg minx g(x, γ) γ arg minγ g( x, γ) by (27). end while Output x.

Proposition 7. Under Assumption 6, for a given γ (0, ), we can obtain x := arg minx X g(x, γ) by means of a single call of the non-robust optimization oracle. Further, on the basis of Proposition 4, for a given x, the inﬁmum of g( x, γ) with respect to γ is given by

γ = arg inf γ g( x, γ) = L2v( x) / L1v( x) . (27)

The above consideration indicates that the minimization of g w.r.t. one of x or γ is tractable when the other is ﬁxed. This paper proposes two simple algorithms to solve Eq. (26) Let us deﬁne a function h : (0, ) R by

h(γ) := min x X g(x, γ). (28)

For robust prescriptive price optimization, we can employ algorithms proposed in [Ito and Fujimaki, 2017; 2016] as h(γ). Algorithm 1 is a golden section search [Kiefer, 1953] method in which we calculate h(γ) by a single call of the oracle. An advantage of Algorithm 1 is a convergence guarantee for a ﬁxed number of calculation of h(γ) that provides theoretical validity to our upper bound minimization (see Section 4.3 for details). Algorithm 2 is a coordinate descent algorithm that alternates optimizations of g(x, γ) w.r.t. x and γ. Though this algorithm offers no guarantee on the number of iterations needed for convergence, our experiments in Section 6 show that Algorithm 2 empirically converges with only a few iterations and is, in practice, much faster than Algorithm 1. Let us emphasize that, though more advanced optimization techniques might be applicable, our empirical results in Section 6 show that these simple algorithms converge fast enough in practice.

4.3 Convergence analysis For a range 0 < α β, let h[α,β] : [α, β] R denote the function restricting the domain of h

onto [α, β]. Further, let us deﬁne an upper bound l by l := ( L1 F + L2 F ) maxx X v(x) maxx X max{ L1v(x) , L2v(x) }. The following theorem guarantees the convergence of Algorithm 1 in a ﬁxed number of iterations.

Theorem 8. Suppose that v is continuous and that there does not exist x X such that L1v(x) = L2v(x) = 0. Under Assumption 6, for any δ > 0, setting α = 1/β, β = 2λl2/δ and δ = δ /λβ2l2, Algorithm 1 outputs x with the following property by O(log(λl/δ )) calls of the non-robust optimization oracle: there exists a local optimum x X of f satisfying f(x) f(x ) + δ .

For comparison with an existing method, as is discussed in the next subsection, we prove in the following proposition that, under a fairly natural assumption, the function f becomes convex. In this case, f has a unique minimum, and our algorithm is able to obtain the global minimum. Here S+ is a set of positive semideﬁnite matrices, and Q is deﬁned by

Q := {(Q+Q )/2 | Q = ˆQ+L 1 UL2, U F λ}. (29)

Proposition 9. If X is a convex set, v is afﬁne, and Q S+, then f is a convex function.

5 Related Work

Studies of robust optimization with ellipsoidal uncertainty were initiated independently by [Ben-Tal and Nemirovski, 1998] and [Ghaoui and Lebret, 1997]. For robust quadratic programming minx X x Qx, [Ben-Tal et al., 2002] have considered the case in which A of Q = A A has ellipsoidal uncertainty. For the case in which Q has ellipsoidal uncertainty, which is also the case this paper considers, [Halldórsson and Tütüncü, 2003] have proposed an interior point method for a general self-concordant convex-concave function given by [Nemirovski, 1999] to this robust optimization. Further detailed information on robust optimization can be found in [Ben-Tal et al., 2009; Bertsimas et al., 2011]. In contrast to [Halldórsson and Tütüncü, 2003] which is applicable only when the convexity in Proposition 9 holds, our algorithm is applicable to non-convex robust quadratic programming under the existence of a (possibly approximate) algorithm for a non-robust counterpart. If Proposition 9 holds, both Algorithm 1 and [Halldórsson and Tütüncü, 2003] are applicable. In such a case, the non-robust counterpart is convex quadratic programming, which can be solved by

N log(1/δ) Newton steps, and thus Algorithm 1 requires O(

N log(1/δ) log(N/δ)) Newton steps if ||v(x)|| , ||L1|| and ||L1|| are bounded3. If N is large, this is better than the O(N log(1/δ)) Newton steps of the algorithm in [Halldórsson and Tütüncü, 2003].

6 Experiments

We employed the state-of-the-art prescriptive price optimization of [Ito and Fujimaki, 2016] to calculate h(γ) deﬁned by Eq. (28), and compared our robust algorithm and non-robust

3If ||v(x)|| , ||L1|| and ||L1|| are bounded, O(l) = O(log N) holds.

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

counterpart. Note that the results with λ = 0 are reduced to the results of the algorithm in [Ito and Fujimaki, 2016].

6.1 Evaluation on Artiﬁcial Price Simulator Simulation Data We conducted experiments with an artiﬁcial price simulator to support our robust formulation and theoretical analysis. We applied a similar simulation data generation process as that of the original paper [Ito and Fujimaki, 2016]. The true demand (regression) model followed (1), where v(x) = (x1, x2, . . . , x M, 1) were linear features with N = M + 1. Then the true coefﬁcient matrix A was generated by a i,i U([ 2M, M]), a i,j U([0, 2]) if i = j, and a i,N U([M/2, 3M/2]). We deﬁned the pricing strategies as X := {0.6, 0.7, 0.8, 0.9, 1.0}M where 1.0 was the list price and 0.9 is 10%-off. The distribution of the non-diagonal entries a i,j for j = 1, 2, . . . , M were chosen to balance the effect of ith price and the other prices on the sales quantity of ith product. The linear term was arranged to obtain a moderate optimum strategy of 0.8 when the nondiagonal entries were zero. The training data ({xd, yd})D d=1 were generated by xd,i = 1.0, 0.9, 0.8, 0.7, 0.6 with probability 0.5, 0.2, 0.1, 0.1, 0.1, respectively for all i = 1, 2, . . . , M, where the system noise in yd followed N(0, 25IM). Here xd,i has importance because it is natural to assume that the product is sold more at the list price than at discounted prices. In each experiment, we generated 10 true models randomly and 100 training datasets were generated for each dataset. The results are averages of all runs.

Effect of Robust Formulation Fig. 1 (left) demonstrates how robust solutions changed along with λ. We observed: The revenues were maximized with λ = 2, 3, 4 in all cases. By appropriately choosing the robustness parameter λ, we mitigate uncertainty in price optimization and achieve both better revenue and higher robustness. With the small training data (D = 5M), the non-robust solutions (λ = 0) result in much lower revenue than robust solutions. This is because the statistical uncertainty in estimation was large with small training data and therefore non-robust solutions easily became poor. As the size of training data increased, the difference became smaller. As λ increased, the standard deviation decreased (standard deviation is plotted only for D = 5M cases for visualization purposes). This indicates that the solutions became more conservative and thus more stable. Even with the best chosen λ value, the obtained revenue is 5-10% worse than the true maximal revenue that was obtained with the true demand model. This difference came from estimation error, which we cannot avoid regardless of the choice of optimization algorithms. This result indicates that the true optimal value is not achievable in expectation and we must always undergo a certain reduction in the expected revenue because of the estimation error.

Probabilistic Guarantee with Finite Samples We next demonstrate that the asymptotic probabilistic guarantee on the upper-bound of the objective value holds with a practical number of training samples and reasonably large λ, and show

that our robust optimization mitigates the over-estimation problem. Fig. 1 (middle) shows the result for D = 100 and M = 10, 30, 50 along with λ. The vertical axis is the probability of over-estimation P(max Q C λ v(xλ) Qv(xλ) v(xλ) Q v(xλ)), where xλ is the robust optimum strategy with λ. We observed: The non-robust optimization over-estimated revenue in 7080% of the cases, and the optimism of the non-robust optimization was empirically conﬁrmed. Although the empirical probabilities (blue, green, red) were larger than the theoretical asymptotic value (pale blue) in this setting (D = 100), the difference rapidly decreased with increasing λ. With a reasonably large λ value (e.g. λ 3), both the absolute value of empirical probabilities and the difference between empirical and theoretical probabilities became sufﬁciently small in practice.

Comparison of Algorithms We compared Algorithm 1 and Algorithm 2, with a sampling algorithm of min-max optimization that is a baseline for the robust optimization algorithm. The ith step of the sampling problem solves (12) with the replacement of U by ﬁnite samples {Uj}j=1,...,i 1, and we can then add a new Ui that is a maximizer for the current optimum solution. Observe that the sampling algorithm approximates the original robust optimization problem by adding a series of quadratic constraints, and thus in this framework the efﬁcient price optimization algorithm [Ito and Fujimaki, 2016] is not applicable. For solving this sample approximation, we used the MIQCP solver of GUROBI optimizer version 6.0, after convex approximation of ˆQ + Uj. Fig. 1 (right) shows convergence of the objective value over algorithm iterations. The majority of computational time was devoted to the evaluation of h(γ) using non-robust optimization oracle in our two algorithms, and to the application of the MIQCP solver in the sampling algorithm. Thus we summarized this experiment along with algorithm iteration. All algorithms achieved the same objective values at convergence. We observed that Algorithm 1 and the sampling algorithm needed almost the same number of iterations for convergence. However, while two applications of the non-robust optimization oracle in Algorithm 2 took 0.08 second per each iteration, the computational time needed for solving MIQCP grew as the number of quadratic constraints grew, and for λ = 20 it took 274 sec in the ﬁrst 10 iterations. We also observed that Algorithm 2 took only a few iterations for convergence in all cases, and conﬁrmed that Algorithm 2 was faster in practice and required only slightly more computational time than its non-robust counterpart.

6.2 Evaluation on Real Point-of-Sales Data

We applied the proposed method to real sales history of beers4 [Ito and Fujimaki, 2017; Wang et al., 2015]. The data consisted of prices and sales quantities on 50 products over 642 days. We used the price features for the prediction, and we obtained 50 regression models by means of the least

4The data has been provided by KSP-SP Co., LTD, http:// www.ksp-sp.com.

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

0 2 4 6 8 λ

actual revenue: D=50 actual revenue: D=100 actual revenue: D=200 true maximal revenue

0 1 2 3 4 5 λ

probability

M=10 M=30 M=50

1 2P(χ2 1 λ2)

5 10 15 20 number of iteration

objective function

Algo.1, λ = 1

Algo.2, λ = 1 MIQP, λ = 1 Algo.1, λ = 10 Algo.2, λ = 10 MIQP, λ = 10 Algo.1, λ = 20 Algo.2, λ = 20 MIQP, λ = 20

Figure 1: (left) Expected return and its deviation on robust strategies. The number of products is M = 10. The horizontal and vertical axes are, respectively, the robustness parameter λ and expected return. The blue, green, and red lines show the results for data size D = 5M, 10M, and 20M, respectively. The pale blue line shows the revenue with the optimum strategy when true model is known. (middle) The probability of an over-estimate of revenue for D = 300. The horizontal ax is the robustness parameter λ, and the vertical ax is the probability of over-estimation of revenue. The blue, green, and red line respectively show the result for the number of products M = 10, 30, 50. The pale blue line shows the theoretical guarantee in Proposition 2. (right) Convergence of Algorithm 1, 2, and the sampling algorithm for M = 20, D = 100. The horizontal and vertical axes show the number of iteration and the value of objective functions, respectively. The blue, green, red line show the values for λ = 1, 10, 20, respectively.

Table 1: Real data experiments with sales histories of beers. On the basis of the least square estimate, we calculated robust optimum strategies for several risk-return parameter λ values. Pi is the product ID sorted by the ratio of (optimum price at λ = 0)/(average). Red and blue products indicate, respectively, risky and stable pricing strategy in small λ values. Green products indicate large cross price elasticity.

ID λ=0 30 60 90 average P1 326 502 519 541 542 P2 159 218 243 263 264 P3 326 457 493 537 541 P4 584 880 926 967 969 P5 148 219 233 244 245 P6 903 1058 1278 1478 1491 P7 743 743 1005 1210 1226 P8 903 1006 1234 1473 1490 P9 153 200 235 251 252 P10 153 229 240 252 252 P11 600 735 881 980 987 P12 107 142 158 174 176 P13 104 140 155 169 171 P14 903 903 1156 1463 1483 P15 153 222 237 251 251 P16 114 184 179 186 187 P17 675 675 855 1091 1107

ID λ=0 30 60 90 average P18 1019 1019 1041 1626 1670 P19 55 77 84 90 90 P20 55 69 81 89 90 P21 675 675 675 1071 1099 P22 138 138 189 220 222 P23 138 138 194 219 220 P24 230 285 285 294 294 P25 167 178 181 187 187 P26 313 313 307 312 312 P27 267 254 256 265 266 P28 265 252 253 264 264 P29 256 235 243 255 255 P30 1600 1364 1467 1584 1592 P31 189 189 184 188 188 P32 188 188 182 187 187 P33 162 143 148 160 161 P34 162 151 152 161 161

ID λ=0 30 60 90 average P35 255 229 239 251 253 P36 218 218 209 215 216 P37 208 202 200 206 206 P38 285 283 271 281 282 P39 189 189 183 187 187 P40 189 189 182 186 187 P41 189 189 180 186 187 P42 189 189 189 188 187 P43 91 79 82 89 90 P44 180 171 168 177 178 P45 255 232 239 251 252 P46 1124 694 894 1096 1110 P47 275 245 253 269 271 P48 1124 697 839 1084 1101 P49 1505 920 985 1438 1470 P50 1288 1062 912 1232 1257

square estimator. For each product, we set lower and upper bounds on the product price as 60% and 100% of its historically maximum price. We then conducted robust optimization for λ = 0, 30, 60, 90. Table 1 shows the optimized robust pricing strategies for each λ. We observed: For many products (red), non-robust prices (λ = 0) were drastically discounted from the average prices, and the prices monotonically increased optimal prices to average prices over λ. With the most conservative setting λ = 90, the optimized prices were very close to average values. These products are considered to have large price elasticity of demand. This is natural because the average prices and largely-discounted prices can be seen as, respectively, the most "tested" strategies and the most risky. For many different products (green), prices changed little over λ. These products are considered to have small price elasticity of demand and the list prices (average prices) are the safest and the most proﬁtable. For some products (blue), non-robust prices were close to averaged prices. They once dropped to discounted prices and then returned to the averaged prices over λ. These products are mostly high-valued and therefore are considered to have large cross price elasticity. This complex

behavior of the optimized prices was caused by the price changes in the other products. Although we were unable to evaluate these strategies in real retail stores, we have learned here that our algorithm provides a way to simulate scenarios with different risk levels, and the users can explore for the best pricing strategies on the basis of their domain expertise.

This paper has proposed a novel robust quadratic optimization framework for prescriptive price optimization. Our statistical observations have revealed that the uncertainty occurring in estimation in machine learning follows a matrix normal distribution, which has lead us to formulate robust quadratic programming as a conservative upper-bound minimization. Our sequential algorithms for robust quadratic programming converge fast, both practically and theoretically, and can be implemented on the basis of existing non-robust price optimization algorithms. Experimental results on both artiﬁcial and actual price data showed that our method enables users to obtain both proﬁtable and safe price strategies in prescriptive price optimization.

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

References [Ben-Tal and Nemirovski, 1998] Aharon Ben-Tal and Arkadi Nemirovski. Robust convex optimization. Mathematics of Operations Research, 23(4):769 805, 1998. [Ben-Tal et al., 2002] Aharon Ben-Tal, Arkadi Nemirovski, and Cees Roos. Robust solutions of uncertain quadratic and conic-quadratic problems. SIAM Journal on Optimization, 13(2):535 560, 2002. [Ben-Tal et al., 2009] Aharon Ben-Tal, Laurent El Ghaoui, and Arkadi Nemirovski. Robust optimization. Princeton University Press, 2009. [Bertsimas et al., 2011] Dimitris Bertsimas, David B. Brown, and Constantine Caramanis. Theory and applications of robust optimization. SIAM review, 53(3):464 501, 2011. [Burer and Vandenbussche, 2008] Samuel Burer and Dieter Vandenbussche. A ﬁnite branch-and-bound algorithm for nonconvex quadratic programming via semideﬁnite relaxations. Mathematical Programming, 113(2):259 282, 2008. [Caro and Gallien, 2012] Felipe Caro and Jérémie Gallien. Clearance pricing optimization for a fast-fashion retailer. Operations Research, 60(6):1404 1422, 2012. [Côté et al., 2003] Jean-Philippe Côté, Patrice Marcotte, and Gilles Savard. A bilevel modelling approach to pricing and fare optimisation in the airline industry. Journal of Revenue and Pricing Management, 2(1):23 36, 2003. [Delage and Ye, 2010] Erick Delage and Yinyu Ye. Distributionally robust optimization under moment uncertainty with application to data-driven problems. Operations research, 58(3):595 612, 2010. [Duchi et al., 2016] John C. Duchi, Peter W. Glynn, and Hongseok Namkoong. Statistics of robust optimization: A generalized empirical likelihood approach. Stanford University, 2016. [Ferreira et al., 2015] Kris J. Ferreira, Bin H. A. Lee, and David Simchi-Levi. Analytics for an online retailer: Demand forecasting and price optimization. Manufacturing & Service Operations Management, 2015. [Ghaoui and Lebret, 1997] Laurent El Ghaoui and Hervé Lebret. Robust solutions to least-squares problems with uncertain data. SIAM Journal on Matrix Analysis and Applications, 18(4):1035 1064, 1997. [Gupta and Nagar, 1999] Arjun K. Gupta and Daya K. Nagar. Matrix variate distributions, volume 104. CRC Press, 1999. [Halldórsson and Tütüncü, 2003] Bjarni V. Halldórsson and Reha H. Tütüncü. An interior-point method for a class of saddle-point problems. Journal of Optimization Theory and Applications, 116(3):559 590, 2003. [Ito and Fujimaki, 2016] Shinji Ito and Ryohei Fujimaki. Large scale price optimization via network ﬂow. In Advances in Neural Information Processing Systems, 2016.

[Ito and Fujimaki, 2017] Shinji Ito and Ryohei Fujimaki. Optimization beyond prediction: Prescriptive price optimization. In Proceedings of the 23nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2017. [Kiefer, 1953] Jack Kiefer. Sequential minimax search for a maximum. Proceedings of the American Mathematical Society, 4(3):502 506, 1953. [Kolmogorov and Zabin, 2004] Vladimir Kolmogorov and Ramin Zabin. What energy functions can be minimized via graph cuts? Pattern Analysis and Machine Intelligence, IEEE Transactions on, 26(2):147 159, 2004. [Koushik et al., 2012] Dev Koushik, Jon A. Higbie, and Craig Eister. Retail price optimization at intercontinental hotels group. Interfaces, 42(1):45 57, 2012. [Kunz and Crone, 2014] Timo P. Kunz and Sven F. Crone. Demand models for the static retail price optimization problem-a revenue management perspective. In OASIcs Open Access Series in Informatics, volume 37. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2014. [Lee, 2011] Seonah Lee. Study of demand models and price optimization performance. Ph D thesis, Georgia Institute of Technology, 2011. [Marshall, 2009] Alfred Marshall. Principles of economics: unabridged eighth edition. Cosimo, Inc., 2009. [Nemirovski, 1999] Arkadi Nemirovski. On self-concordant convex concave functions. Optimization Methods and Software, 11(1-4):303 384, 1999. [Tütüncü and Koenig, 2004] R. H. Tütüncü and M. Koenig. Robust asset allocation. Annals of Operations Research, 132(1-4):157 187, 2004. [van Ryzin and Mahajan, 1999] Garrett van Ryzin and Siddharth Mahajan. On the relationship between inventory costs and variety beneﬁts in retail assortments. Management Science, 45(11):1496 1509, 1999. [Wang et al., 2015] Jialei Wang, Ryohei Fujimaki, and Yosuke Motohashi. Trading interpretability for accuracy: Oblique treed sparse additive models. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1245 1254. ACM, 2015.

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)