# unbiased_objective_estimation_in_predictive_optimization__2e95073b.pdf Unbiased Objective Estimation in Predictive Optimization Shinji Ito 1 Akihiro Yabe 1 Ryohei Fujimaki 1 For data-driven decision-making, one promising approach, called predictive optimization, is to solve maximization problems i n which the objective function to be maximized is estimated from data. Predictive optimization, however, suffers from the problem of a calculated optimal solution s being evaluated too optimistically, i.e., the value of the objective function is overestimated. This paper investigates such optimistic bias and presents two methods for correcting it. The first, which is analogous to cross-validation, successfully corrects the optimistic bias but results in underestimation of the true value. Our second method employs resampling techniques to avoid both overestimation and underestimation. We show that the second method, referred to as the parameter perturbation method, achieves asymptotically unbiased estimation. Empirical results for both artificial and real-world datasets demonstrate that our proposed approach successfully corrects the optimistic bias. 1. Introduction Data-driven decision-making has become the subject of increased interest and been used in a number of practical applications. One of the most promising approaches is mathematical programming based on predictive models generated by machine learning. Recent advances in machine learning have made it easier to create accurate predictive models, and resulting predictions have been used to build mathematical programming problems (we refer to such approaches as predictive optimization). Predictive optimization is employed in applications for which frequent trial-and-error process are not practical, such as water distribution optimization (Draper et al., 2003), energy generation planning (Baos et al., 2011), retail price optimization (Johnson 1NEC Corporation. Correspondence to: Shinji Ito . Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s). et al., 2016; Ito & Fujimaki, 2016), supply chain management (Thomas et al., 1996; Jung et al., 2004; Bertsimas & Thiele, 2004), and portfolio optimization (Markowitz, 1952; Chan et al., 1999; Konno & Yamazaki, 1991). Another important use for data-driven decision-making is in reinforcement learning (Kaelbling et al., 1996; Sutton & Barto, 2013). Here it is employed in situations mainly in which frequent trial-and-error operations are possible, except for batch reinforcement learning (Lange et al., 2012). The focus of this paper is on the first approach, i.e., predictive optimization. In many practical applications of predictive optimization, it is essential to estimate the quality of the computed strategy because executing a strategy is often costly and risky. For example, predictive price optimization has been used to estimate revenue functions through regressions of demand as functions of product prices, and then, to optimize pricing strategies by maximizing estimated revenue functions (Johnson et al., 2016; Ito & Fujimaki, 2016; 2017; Yabe et al., 2017). In practice, users need to assess the return for the computed optimal strategy before changing prices, in order to prevent unforeseen heavy losses. In a situation in which costs for trial-and-error processes are unrealistically high, a key challenge in predictive optimization is how to assess the quality (or expected return) of the optimal solution by means of an estimated objective function. Predictive optimization consists of two steps: estimation and optimization. In the estimation step, we construct an estimated objective function f(z, ˆθ) for the true objective function f(z, θ ), where θ is a parameter of f, and z is a decision variable corresponding to the strategy to be optimized. In the optimization step, we compute the estimated optimal strategy ˆz = arg maxz Z f(z, ˆθ), where Z is the domain of z. Because it would be expensive to observe f(ˆz, θ ) (i.e., to perform ˆz in a real environment), we usually estimate it by f(ˆz, ˆθ), which we call simple evaluation, in order to assess the quality of ˆz. It has been empirically seen, however, that this simple evaluation tends to be too optimistic. For example, in the contexts of algorithmic investment and portfolio optimization, it has been reported (Michaud, 1989; Chapados, 2011; Harvey & Liu, 2015) that f(ˆz, ˆθ) is much better than the acutual return. Michaud (Michaud, 1989) argued that this bias ap- Unbiased Objective Estimation in Predictive Optimization pears because the mean-variance optimizers act as error maximizers , i.e., optimizers tend to choose solutions containing large errors. According to (Harvey & Liu, 2015), a common practice in evaluating trading strategies is simple heuristics that discount the estimated objective to 50%, i.e., consider 0.5f(ˆz, ˆθ) to be an estimator of f(ˆz, θ ). Heuristics referred to as portfolio resampling techniques (Michaud, 1998; Scherer, 2002) have been studied for nearly 20 years but have not yet to be theoretically justified. A few recent studies (Bailey & Marcos, 2016; Bailey et al., 2014; Harvey & Liu, 2015) have statistically analyzed and proposed algorithms to mitigate the bias issue, but their algorithms are restricted to particular applications (e.g., algorithmic investment) and, as far as we know, there exists no principled algorithm for an unbiased estimator of f(z, θ ) in general predictive optimization problems. The goal of this study is to address this optimistic bias issue, and to propose methods for unbiased estimation of true objective values. Our key contributions are summarized as follows. First, we prove that the estimated optimal value f(ˆz, ˆθ) is biased even if the estimated objective function f(z, ˆθ) is an unbiased estimator of the true objective function f(z, θ ). Further, we correlate the bias issue to overfitting in machine learning, which yields a valuable insight into bias correction methods. Second, we propose two algorithms for estimating the value of true objective functions under mild assumptions. The first algorithm is based on a procedure similar to crossvalidation and has been inspired by the analogy between our problem and overfitting in supervised learning. This algorithm corrects the optimistic bias, but suffers from pessimistic bias, i.e., the estimated value is biased in a direction suggesting a poorer result, similar to that which occurs in cross-validation. The magnitude of this pessimistic bias tends to be larger than that of cross-validation, and hence, it is not negligible in many cases. To mitigate this issue, we propose another algorithm, which we refer to as a parameter perturbation method. This algorithm employs a resampling technique and is theoretically proven here to achieve asymptotically unbiased estimation. Our experimental results show that the proposed algorithms are able to estimate the value of a true objective function more accurately than a state-of-the-art hold-out validation technique commonly used in algorithmic investment (Bailey & Marcos, 2016; Bailey et al., 2014). In a simulation experiment with real-world retail datasets for price optimization, we have observed that our evaluation algorithms estimate a 17% increase in the gross profit, which seems to be more realistic and convincing than the value estimated without bias correction. The remainder of this paper is structured as follows. In Section 2, we introduce the framework of the combination of machine learning and mathematical optimization in examples of usage. We also show that such a framework suffers from bias w.r.t. optimal values. Section 4 gives solutions to this problem and theoretical guarantees for them. In Section 5, the empirical performance of our algorithms is demonstrated. 2. Predictive Optimization Suppose we have a sequence of training data x = (x1, . . . , x N) XN, where N is the number of data instances. Each xn is generated from a probabilistic model {p(x|θ) : θ Θ} parameterized by θ Θ. We further suppose having a set of objective functions {f(z, θ) : θ Θ} where z Z is a decision variable that corresponds to strategies to be optimized. The goal of predictive optimization is to find z arg maxz Z f(z, θ ), where θ is the true parameter. However, such a true parameter is unknown in practice, and therefore we estimate θ by ˆθ from x, and compute the estimated optimal solution ˆz arg maxz Z f(z, ˆθ) rather than z . This section discusses three examples of predictive optimization problems in order to provide a better picture of the process. Example 1 (Coin-Tossing). Suppose that we have a coin coming up heads with probability θ and tails with probability 1 θ , where θ Θ := [0, 1]. Consider predicting heads or tails for this coin. If we predict the subsequent face correctly, we win $1, and, otherwise, nothing. Predicting heads, then, will result in earning $1 with probability θ and $0 with probability 1 θ , and hence, the expectation value of the earnings for predicting heads is f( head , θ ) = 1 θ +0 (1 θ ) = θ . Similarly, the expected earnings for predicting tails is f( tail , θ ) = 1 θ . If we knew the true parameter θ , we could maximize the expected earnings by choosing z arg maxz Z f(z, θ ), where Z = { head , tail } stands for a set of feasible strategies. Since we do not know the true parameter θ , however, we use, rather, past data x XN := { head , tail }N of N tossings, for estimating θ . Table 1 illustrates how the optimistic bias occurs in predictive optimization. Suppose θ = 1/2 (a) and that there are four cases of the observed pattern for three tossings (b). The estimators of θ might then be obtained as (c), using maximum likelihood estimation. On the basis of ˆθ, the best strategies are estimated as (d), and the estimated and true optimal values are summarized in (e) and (f). It is worth noting that the expectation of (e) over four cases (bottom middle), which is 3/4, is larger than the true expectation (bottom right), which is 1/2 even if the ˆθ is unbiased, i.e., the expectation of ˆθ matches θ (bottom left). Example 2 (Portfolio optimization (Markowitz, 1952)). Unbiased Objective Estimation in Predictive Optimization Table 1. Example of optimistic bias in coin-tossing. Case 1 Case 2 Case 3 Case 4 (a) θ 1/2 1/2 1/2 1/2 (b) x {HHH} {HHT} {HTT} {TTT} (c) ˆθ 1 2/3 1/3 0 (d) ˆz H H T T (e) f(ˆz, ˆθ) 1 2/3 2/3 1 (f) f(ˆz, θ ) 1/2 1/2 1/2 1/2 E[ˆθ] E[f(ˆz, ˆθ)] E[f(ˆz, θ )] 1/2 = θ 3/4 1/2 Suppose that there are d assets, and let Rj stand for the return on each component asset for j {1, . . . , d}. Let µ = (µ 1, . . . , µ d) Rd be the expected return for each asset, i.e., µ j = E[Rj]. Then the portfolio expressed as Rz = Pd j=1 zj Rj, where zj 0 is the weighting of the j-th component asset and z = (z1, . . . , zd) Rd 0, has expected return E[Rz] = Pd j=1 zjµ j = µ z. Variance in the portfolio return can be expressed as var[Rz] = z Σ z, where Σ is the covariance matrix of (R1, . . . , Rd). Denote θ = (µ , Σ ). Then, with a given risk tolerance λ 0, the optimal portfolio is obtained as the solution of the following problem: Maximize f(z, θ ) := µ z λz Σ z, (1) j=1 zj = 1, zj 0 (j = 1, . . . , d). In practice, however, since θ is never available, we estimate it from historical data x = (x1, . . . , x N), where xn Rd is an observation of past returns for individual component assets (Qiu et al., 2015; Agarwal et al., 2006; Li & Hoi, 2012). Under the assumption that xn follow the same distribution,1 the estimators of µ and Σ are obtained by ˆµ = 1 N PN n=1 xn and ˆΣ = 1 N 1 PN n=1(xn ˆµ)(xn ˆµ) . We obtain the optimal solution by solving (1) with the replacement of µ and Σ by ˆµ and ˆΣ, respectively. Example 3 (Predictive price optimization(Ito & Fujimaki, 2017; 2016)). Suppose we have d products whose prices are denoted by z = (z1, . . . , zd). Let us denote their sales quantities by q (z) = (q j (z))d j=1 Rd, which are functions of the price z. The gross revenue function is then defined by f(z, θ ) = q (z) z, and the true optimal solution is obtained by solving the following problem: Maximize q (z) z subject to z Z, (2) where Z Rd is a pre-defined domain of prices (e.g., list price, 3%-off, 5%-off, and so on). However, we can never know the true demand-price relationship q (z), and 1This condition can easily be relaxed. the predictive price optimization approximates q (z) by the following regression functions: k=1 θkψk(z) + ϵ, ϵ N(0, Σ), (3) where {ψk : Rd Rd}K k=1 are fixed basis functions and {θk}K k=1 R are regression coefficients. We estimate θ = (θ1, . . . , θK) as a standard regression problem and then solve (2) after replacing q (z) by q(z, ˆθ), where ˆθ is the estimator of θ . 3. Optimistic Bias in the Optimal Value 3.1. Existence of Optimistic Bias This section formally proves the existence of optimistic bias in estimated optimal values. In the above examples, the objective functions f(z, θ) w.r.t. θ were affine functions and ˆθ were unbiased estimators of θ . Hence, the constructed objective function f(z, ˆθ) was an unbiased estimator of the true objective function f(z, θ ), i.e., it holds that Ex[f(z, ˆθ)] = Ex[f(z, θ )], z Z. (4) From this equation, one might expect that Ex[f(ˆz, ˆθ)] and f(ˆz, ˆθ) would be reasonable estimators of Ex[f(ˆz, θ )] and f(ˆz, θ ), respectively. However, the following proposition contradicts this intuition. Proposition 1 (Optimistic Bias). Suppose (4) holds. For ˆz arg maxz Z f(z, ˆθ) and z arg maxz Z f(z, θ ), it holds that Ex[f(ˆz, ˆθ)] f(z , θ ) Ex[f(ˆz, θ )]. (5) The right inequality is strict if ˆz is suboptimal w.r.t. the true objective function f(z, θ ) with non-zero probability. Proof. By taking the expectation of both sides of f(ˆz, ˆθ) f(z , ˆθ), we obtain the left inequality of (5) as follows: Ex[f(ˆz, ˆθ)] Ex[f(z , ˆθ)] = f(z , θ ), where the equality comes from (4). Similarly, the right inequality of (5) comes from f(z , θ ) f(ˆz, θ ). Further, if ˆz / arg maxz Z f(z, θ ) holds with non-zero probability, then f(z , θ ) > f(ˆz, θ ) holds with non-zero probability and f(z , θ ) f(ˆz, θ ) always holds, which implies f(z , θ ) > E[f(ˆz, θ )]. This proposition implies that the estimated optimal value f(ˆz, ˆθ) is not an unbiased estimator of f(ˆz, θ ) even if the estimated objective function f(z, ˆθ) is an unbiased estimator of the true objective function f(z, θ ). This optimistic bias Unbiased Objective Estimation in Predictive Optimization has been empirically learned in the context of portfolio optimization (Michaud, 1989). Recently, (Harvey & Liu, 2015; Harvey et al., 2016) have proposed bias correction methods based on statistical tests, though their methods are applicable only to cases in which the objective function is the Sharpe ratio. Other recent studies (Bailey & Marcos, 2016; Bailey et al., 2014) have also focused on the Sharpe ratio and proposed a hold-out validation method. Although their methods apply to general predictive optimization problems, they have not been proven to obtain unbiased estimators. Note that a similar inequality has been discovered in the context of stochastic programs,2 one that corresponds to the left inequality of (5). For the special case in which Z is a finite set, the same inequality as (5) has been shown in the context of decision analysis (Smith & Winkler, 2006). 3.2. Connection to Empirical Risk Minimization This subsection discusses the connection of the optimistic bias issue to overfitting in machine learning, which connection has led to the ideas underlying our proposed algorithms. In supervised machine learning, we choose the prediction rule ˆh from a hypothesis space H by minimizing the empirical error, i.e., we let ˆh arg minh H 1 n PN n=1 ℓ(h, xn), where xn is the observed data generated from a distribution D and ℓis a loss function. The empirical error 1 N PN n=1 ℓ(h, xn) is an unbiased estimator of the generalization error ℓD(h) := Ex D[ℓ(h, x)] for arbitrary fixed prediction rule h, i.e., it holds that Exn D[ 1 N PN n=1 ℓ(h, xn)] = ℓD(h) for any fixed h. Despite this equation, the empirical error 1 N PN n=1 ℓ(ˆh, xn) for the computed parameter ˆh is smaller than the generalization error ℓD(ˆh) in most cases, because ˆh overfits the observed samples, as is well known (Vapnik, 2013). The analogy between the optimistic bias in our setting and the overfitting issue in machine learning suggests the reuse of datasets for estimation of their objective functions and evaluation of objective values. A comparison between empirical risk minimization (ERM) and our prediction-based optimization is summarized in Table 2. As is shown in the Table, our problem concerning bias in predictive optimization has a structure similar to that of the problem of overfitting in empirical risk minimization. Typical methods for estimating generalization error in machine learning would be cross-validation and such asymptotic bias correction as AIC (Akaike, 1973). This paper follows the concept of cross-validation in the context of predictive optimization and, in the following section, proposes a more accurate algorithm. 2 In stochastic programs, the objective is a random function, and it has been shown in, e.g., (Mak et al., 1999), that the expectation of the minimum of the objective is a lower bound of the minimum of the expectation of the objective. Table 2. Correspondence of empirical risk minimization and predictive optimization ERM Optimization Decision variable Predictor h Strategy z True objective Ex D[ℓ(h, x)] f(z, θ ) Estimated objective 1 N PN n=1 ℓ(h, xn) f(z, ˆθ) 4. Bias Correction Algorithms Our goal is to construct unbiased estimators for the value f(ˆz, θ ) of the true objective function, i.e., to construct ρ : Xn R such that Ex[ρ(x)] = Ex[f(ˆz, θ )], where ˆz arg max z Z f(z, ˆθ) is the computed strategy. We assume the following conditions. Assumption 2. (i) f(z, θ) is affine in θ, i.e., a : Z R, b : Z R, f(z, θ) = θ a(z) + b(z). (ii) The optimal solution z(θ) arg maxz Z f(z, θ) is uniquely determined for almost all θ. (iii) One of the following holds: (iii.a) Z is a finite set, or (iii.b) Z is a compact subset of Rd, and z 7 (a(z), b(z)) is a continuous injective function. (iv) ˆθ is an unbiased estimator of θ , i.e., we have Ex[ˆθ] = θ . The assumptions (i)-(iii) are conditions on mathematical programming problems, and such typical ones as (mixedinteger) linear/quadratic/semidefinite programming problems satisfy these conditions. Assumption (iv) is a condition on the machine learning algorithm for estimating the objective function in the optimization problem, and we can employ any standard unbiased estimation algorithm. Note that the examples in Section 3 satisfy all these assumptions. We assume (i) and (iv) in Section 4.1, and assume (i)-(iv) in Section 4.2. 4.1. Cross-Validation Method As noted in Section 3.2, our problem is closely related to the problem of estimating generalization error. Inspired by the cross-validation method, one of the most popular methods for estimating generalization error in machine learning, we propose a cross-validation method for estimating the value of the true objective function in predictive optimization. In the context of algorithmic investment, a similar method, referred to as the hold-out method is mentioned in (Bailey et al., 2014). The method discussed below is essentially an extension of the hold-out method for general predictive optimization problems. One of the reasons that the value f(ˆz, ˆθ) contains biases is that ˆz and ˆθ are dependent random variables. Indeed, Unbiased Objective Estimation in Predictive Optimization Algorithm 1 k-fold cross-validation Input: data x XN, the number K 2 of partition Divide data x into K parts x1, . . . , x K. for k = 1 to K do Compute ˆθk, θk from xk, x k respectively, where we define x k to be all samples in x except for xk, and compute zk arg maxz Z f(z, θk). end for Output ρCV (x) := 1 K PK k=1 f( zk, ˆθk). if ˆz and ˆθ are independent, Ex[f(ˆz, ˆθ)] = Ex[f(ˆz, θ )] straightforwardly holds from assumptions (i) and (iv). The main idea of the cross-validation method (as with the standard cross-validation in machine learning) is to divide the data x XN into two parts x1 XN1, x2 XN2, where N1 + N2 = N. Note that each element in x1 and x2 follows p(x, θ ) independently, and, hence, x1 and x2 are independent random variables. Let us denote the estimators based on x1 and x2 by ˆθ1 and ˆθ2, respectively. Also, the optimal strategy on each estimator is denoted by ˆzi := arg maxz Z f(z, ˆθi) for i = 1, 2. Then ˆz1 and ˆθ2 are independent (the opposite also holds), and we have Ex[f(ˆz1, ˆθ2)] = Ex1[f(ˆz1, Ex2[ˆθ2])] = Ex1[f(ˆz1, θ )]. Further, if N1 is sufficiently close to N, Ex1[f( z1, θ )] is close to Ex[f(ˆz, θ )]. This idea can be extended to k-fold cross-validation, in which we divide data x RN into K parts x1, . . . , x K RN , where KN = N. We compute zk from {x1, . . . , x K} \ {xk}, and compute ˆθk from xk. Then the value ρCV (x) := 1 K PK k=1 f( zk, ˆθk) satisfies Ex[ρCV (x)] = Ex [f( z, θ )], (6) where z stands for the strategy computed from (K 1)N samples, under assumptions (i) and (iv). A major drawback to Algorithm 1 is that it can only estimate the objective value attained by N N samples, as is shown in (6), even though the value attained by all N samples is desired. In machine learning, to mitigate this gap, a leave-one-out method (i.e., setting N = 1) can be used. In predictive optimization, however, the number N of holdout samples needs to be large enough to compute another estimator, ˆθk, which limits the accuracy of the estimation of f(ˆz, θ ). The accuracy of Algorithm 1 is considered in Sec. 5 in an empirical evaluation. 4.2. Parameter perturbation method This subsection proposes another algorithm that addresses the drawbacks of Algorithm 1. Denote the error in the estimated parameter by δ := ˆθ θ . The error δ depends on the training data x and can be regarded as a random variable when x is considered to be a random variable. For γ 0, let us first define η(γ) as follows: η(γ) = Eδ[f(z(θ + γδ), θ )], where z(θ) := arg maxz Z f(z, θ). Since ˆz = z(ˆθ) = z(θ + δ), we have η(1) = E[f(ˆz, θ )]. Hence, our goal, unbiased estimation of f(ˆz, θ ), is equivalent to unbiased estimation of η(1). Let us next define φ(γ) as follows: φ(γ) = Eδ[f(z(θ + γδ), θ + γδ)]. (7) Note that we have φ(1) = E[f(ˆz, ˆθ)]. Further, φ(γ) and η(γ) satisfy φ(0) = η(0) = f(z , θ ) and φ(γ) f(z , θ ) η(γ) for all γ 0, which can be proved in a way similar to that of the proof of Proposition 1. The following proposition plays a key role in our second algorithm. Proposition 3. Suppose that assumptions (i)-(iv) hold. For all γ > 0, φ(γ) is differentiable, and its derivative φ (γ) satisfies η(γ) = φ(γ) γφ (γ). (8) The proof of this proposition is summarized in the supplementary material. Let us explain this proposition using Figure 1, which is based on the simulation experiment for portfolio optimization used in Section 5 and shows how the values of φ and η behave for some γ 0. The tangent to φ(γ) at γ = γ0 (the blue broken-line) has a y-intercept (the red broken-line) equal to the value of η(γ0), for all γ0 > 0. From this relationship, the derivative φ (1) of φ(γ) at γ = 1 satisfies φ (1) = φ(1) η(1) = E[f(ˆz, ˆθ)] E[f(ˆz, θ )], i.e., the value of φ (1) is equal to the value of the bias in our predictive optimization problem. Our problem is now to obtain an unbiased estimator ζ of φ (1) that will give us an unbiased estimator of f(ˆz, θ ), i.e. ρ = f(ˆz, ˆθ) ζ. From the definition of the derivative, the value of φ (1) can be approximated by (φ(1+h) φ(1))/h for small h. Further, from the definition of φ, the estimated optimal value f(ˆz, ˆθ) is an unbiased estimator of φ(1). Also, the value of φ(1 + h) = E[maxz Z f(z, θ + (1 + h)δ)] is the expectation of the optimal value for the objective function with a parameter having an enhanced error. If we get samples ˆθh following the distribution of θ + (1 + h)δ, we can develop an estimator of φ(1 + h), and accordingly, we can estimate η(1) = E[f(ˆz, θ )]. Suppose that ˆθ(1) h , . . . , ˆθ(s) h follows the distribution of θ + (1 + h)δ, and define ρh := 1 + h h max z Z f(z, ˆθ) 1 j=1 max z Z f(z, ˆθ(j) h ). (9) The value ρh, then, has the following property. Unbiased Objective Estimation in Predictive Optimization 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 γ 1.66 1.68 1.70 1.72 1.74 1.76 1.78 1.80 1.82 Figure 1. Comparison among φ(γ), η(γ) and f(z , θ ). The blue broken-line is the tangent to φ(γ) at γ = 1, and the red broken-line represents its y-intercept. Algorithm 2 Parameter perturbation method Input: data x Xn, parameters h > 0, s {1, 2, . . .} Compute ˆθ from x and set ˆv0 = maxz Z f(z, ˆθ). Generate {ˆθ(j) h }s j=1 by (i) for asymptotic normal estimators or (ii) for M-estimators. (i) Set ˆθ(j) h to be the estimator computed from N/(1 + h)2 samples randomly chosen from x without replacement. (ii) Generate ˆδj by (10), and set ˆθ(j) h = ˆθ + ˆδj. for j = 1 to s do Set ˆvj = maxz Z f(z, ˆθ(j) h ). end for Output ρh := 1+h h ˆv0 1 hs Ps j=1 ˆvj. Proposition 4. Under assumptions (i)-(iv), the value ρh defined by (9) is an asymptotically unbiased estimator of f(ˆz, θ ), i.e., it holds that limh 0 E [ρh] = E[f(ˆz, θ )]. Proof. From the definition of ρh and φ(γ), we have E[ρh] = ρ(1) φ(1+h) φ(1) h . Hence, we have limh 0 E [ρh] = φ(1) φ (1). From Proposition 3, this value is equal to η(1) = E[f(ˆz, θ )]. The remaining problem is how to obtain samples ˆθh, with enhanced errors, from the distribution of θ +(1+h)δ. If ˆθ is an asymptotically normal estimator of θ , its distribution can be approximated by the normal distribution N(θ , 1 N Σ ), where Σ is a constant matrix not dependent on N. Further, when we compute an estimator ˆθh from N/(1+h)2 data, the distribution of ˆθh can be approximated by N(θ , (1+h)2 N Σ ). This is an approximation of the distribution of θ + (1 + h)δ. This procedure for generating ˆθh is used in (i) of Algorithm 2. If ˆθ is an M-estimator, an asymptotically normal estimator commonly used in machine learning, we can eliminate repetitive computation in (i) of Algorithm 2. For M-estimators, ˆΣ is given in a closed form, as described in (van der Vaart, 1998), such that N(0, 1 N ˆΣ) approximates the error distribution of the estimator. Once we have computed ˆΣ, we generate samples from an approximated distribution of θ + (1 + h)δ, by adding ˆδ to ˆθ, which is obtained by ˆδ N(0, (1 + h)2 1 N ˆΣ). (10) We can, in fact, confirm that the distribution of ˆθ + ˆδ approximates that of θ + (1 + h)δ by applying the normal approximation to ˆθ θ = δ. From the normal approximation δ N(0, 1 N ˆΣ), we obtain θ +(1+h)δ N(θ , (1+h)2 N ˆΣ) and ˆθ + ˆδ N(θ +0, 1 N ˆΣ+ (1+h)2 1 N ˆΣ) = N(θ , (1+h)2 N ˆΣ). This procedure corresponds to (ii) in Algorithm 2. 5. Experiments We have compared our Algorithm 1 and Algorithm 2 with the hold-out method (Bailey & Marcos, 2016; Bailey et al., 2014) and the portfolio resampling method (Scherer, 2002) by means of the simulation models of the examples in Section 2. We used GUROBI Optimizer 6.0.43 for portfolio optimization, and the algorithm in (Ito & Fujimaki, 2016) for price optimization. 5.1. Predictive Portfolio Optimization The portfolio optimization problem described in Example 2 of Section 2 was constructed with θ = (µ , Σ ) defined by µ = 1 + ϵ and Σ = X X, where ϵ Rd were generated by N(0, I) and each entry of X RD D was drawn from N(0, D 1). We generated datasets {xn}N n=1 following N(µ , Σ ), from which we computed ˆθ, as in Example 2, and solved the optimization problem (1) with θ replaced by ˆθ, to obtain ˆz. We chose D = 50, N = 20, and λ = 1.0 for our simulation experiments. When using the portfolio resampling method, we computed z by means of 10 bootstrap resamplings and outputted f( z, ˆθ) f(ˆz, ˆθ). For details regarding portfolio resampling, see, e.g., (Scherer, 2002). For the hold-out validation, we first divided N data into N and N N , then computed ˆz1 from the former N data and estimated ˆθ2 from the letter N N data, and then calculated f(ˆz1, ˆθ2). Accuracy Comparisons Figure 2 shows the means and the standard deviations of computed values of f(z , θ ), f(ˆz, ˆθ) and f(ˆz, θ ) for 400 randomly-initialized datasets. We have observed that: f(ˆz, ˆθ) was much larger than f(ˆz, θ ), which is consistent with Proposition 1. The hold-out method performed much worse than our 3 http://www.gurobi.com/ Unbiased Objective Estimation in Predictive Optimization f(z , θ ) f(ˆz, ˆθ) f(ˆz, θ ) CV(K:5) CV(k:10) Alg2 PR HO(N':10)HO(N':18) 1.4 Figure 2. Values of the objective function and estimated values of f(ˆz, θ ) with Algorithms 1, 2, and the hold-out validation. CV, PR and HO stand for Algorithm 1, portfolio resampling, and Hold-out validation, respectively. Blue bars and error bars represent means and standard deviations, respectively. CV and perturbation methods, though its performance improved with an increasing N . Also, the variance in the proposed methods was much smaller. Note that we could not set N to be larger than N = 18 since the estimation of ˆθ1 and ˆθ2 would fail. The portfolio resampling method computed slightly less optimistic value than f(ˆz, ˆθ), but a large amount of optimistic bias remained. The perturbation method corrected bias better than the CV method w.r.t. both bias and variance. Indeed, it almost perfectly corrected the optimistic bias in expectation. Note that K = 10 was the largest possible value because at least two samples are necessary for estimating the covariance matrix. This means that the value of CV (K = 10) achieved the minimum bias for the CV method. The CV method and the hold-out method produced conservative estimates. The pessimistic bias in the CV method came from the difference between ˆz arg maxz Z f(z, ˆθ) and z in (6). Note that E[f(ˆz, θ )] was poorer than E[f(z , θ )], where the former was the best objective value achieved with the available finite training samples. This negative difference is unavoidable with our bias correction, which appears to raise an interesting open challenge w.r.t. the combination of our bias correction with robust optimization (Bertsimas et al., 2011), i.e., the former mitigates the optimistic bias, and the later mitigates uncertainty in objective functions. Sensitivity of the Perturbation Method We investigated the sensitivity of the perturbation method w.r.t. h > 0, which is the important trade-off parameter in bias and variance. We applied it to 100 different randomly-initialized datasets, for which we set h = 0.05, 0.10, . . . , 0.50. Because s is not sensitive, we fixed it to s = 10. Figure 3 demonstrates the changes in bias and variance (top figure) and RMSE against f(ˆz, θ ), over h. As the value 0.0 0.1 0.2 0.3 0.4 0.5 h Estimated value E[f(ˆz, θ )] Algorithm 2 0.0 0.1 0.2 0.3 0.4 0.5 h Root mean squared error Figure 3. Bias, variance (top), and RMSE (bottom) values over h obtained with the perturbation method. The error bars in the top figure represent the standard derivations. of h increased, the bias increased though the variance decreased (top figure), as was implied in Proposition 4, and this resulted in significantly larger RMSE values with smaller values of h. This observation indicates that an appropriate balance between bias and variance must be determined, and that a variance-sensitive measure such as RMSE can be used as a guide to determine the trade-off. 5.2. Predictive Price Optimization We applied our algorithms to the predictive price optimization discussed as Example 3 in Section 2. As reported in (Ito & Fujimaki, 2017), the optimal value in this problem contains optimistic bias, which is consistent with Proposition 1. Unlike in the portfolio optimization, the parameter ˆθ is estimated by regression techniques, and the set of feasible strategies Z is discrete. Simulation Experiment In this experiment, we investigated the effect of the optimistic bias and our bias correction over parameter dimensionality, i.e., the number of products d. We generated the same simulation data as in (Ito & Fujimaki, 2017). The sales quantity qi of the i-th product was generated from the regression model qi = αi + Pd j=1 βijpj, where αi and βij were generated by uniform distributions, where αi [d, 3d], βij [0, 2] for i = j, and βii [ 2d, d]. The feasible region Z was defined by Z = {0.6, 0.7, . . . , 1.0}d. We chose N = 500 for our experiments. Figure 4 shows the change in the objective values normalized by the ideal objective value f(z , θ ) over the number Unbiased Objective Estimation in Predictive Optimization 10 20 30 40 50 60 M:number of products objective value / f(z , θ ) Algo. 1 Algo. 2 Figure 4. Bias and variance over parameter dimensionality. The horizontal axis represents objective values normalized by the ideal objective value. of products d. For Algorithm 1 (CV method), we chose K = 2 so that the hold-out samples would be sufficient to estimate parameters {αi} and {βij}. We observed that: f(ˆz, θ ) degraded against f(z , θ ) with increasing d because the estimation error in machine learning increased. The optimistic bias, f(ˆz, ˆθ) f(ˆz, θ ), rapidly increased because f(ˆz, ˆθ) f(z , θ ) also increased in addition to the increase in f(z , θ ) f(ˆz, θ ). The CV method suffered from pessimistic bias, which increased as d increased. The perturbation method corrected the bias accurately even if the parameter dimensionality, i.e., d, increased. These results confirm the robustness of our proposed method over parameter dimensionality and also its general applicability to a wide range of problems (the portfolio optimization in Section 5.1 is continuous and convex while the price optimization in this section is discrete and non-convex). Real-World Retail Dataset The real-world retail dataset used in (Ito & Fujimaki, 2017; 2016) contains sales information for a middle-size supermarket located in Tokyo.4 Using this information, we selected 50 regularly-sold beer products. The data range was approximately the three years from 2012/01 to 2014/11. We used the first 35 months (1063 samples) for training regression models and simulated the best price strategy for the next day 2014/12/1. We estimated parameters in regression models, using the least squares method. The other settings were same as in (Ito & Fujimaki, 2016). The actual (non-optimized) gross profit in the past data was 106, 348 JPY, while the estimated optimal value f(ˆz, ˆθ) was 490, 502 JPY, which represents an approximately 361% increase in gross profit, but this value was obviously unreal- 4 The data were provided by KSP-SP Co., LTD, http:// www.ksp-sp.com. istically huge and unreliable (price changes alone could not increase a profit 4.6 by times!). The bias-corrected optimal gross profit with the perturbation method at h = 0.1 and s = 100 was 124, 477 JPY, which represents an approximately 17% increase in the gross profit. Although we were unable to confirm the validity of this value since this experiment was conducted on past historical data, intuitively speaking, a 17% increase in gross profit seems much more realistic than one of 361%, and considering the facts noted in the simulation studies, our result would surely seem more convincing to domain users. One of important remaining issues in real applications is the estimation of the confidence region. As noted above, we can never learn the value of f(ˆz, θ ) without performing ˆz, but the user has to make a decision as to whether to perform it or not without knowing the value. In such a case, it would be helpful to provide a confidence region w.r.t. the bias-corrected optimal value, which is available with neither the CV method nor the perturbation method. 6. Conclusion In this paper, we have focused on the framework of a combination of mathematical optimization and machine learning with which we solve an optimization problem whose objective is formulated with the aid of predictive models or estimators. We have demonstrated that such a framework suffers from a kind of bias w.r.t. optimal values because of overfitting of the solution to the constructed objective function. We have proposed a solution to this bias problem by means of developed methods that are guaranteed to compute an asymptotically unbiased estimator of the value of the true objective function. Empirical results have demonstrated that the proposed approach results in successful estimates of the value of the true objective function. A major open question remaining in this work is how to evaluate and reduce variance in the estimators of objective functions. The variance in estimators, i.e., uncertainty in estimation, is essential information for decision makers in many situations, and reducing variance in the estimator would help them make better decisions. Agarwal, A., Hazan, E., Kale, S., and Schapire, R. E. Algorithms for portfolio management based on the Newton method. Proceedings of the 23rd international conference on Machine learning - ICML 06, pp. 9 16, 2006. Akaike, H. Information theory and an extension of the maximum likelihood principle. In International Symposium on Information Theory, pp. 267 281, 1973. Bailey, D. H. and Marcos, L. Stock portfolio design and Unbiased Objective Estimation in Predictive Optimization backtest overfitting. SSRN Working Paper, pp. 1 14, 2016. Bailey, D. H., Borwein, J. M., L opez de Prado, M., and Zhu, Q. J. Pseudo-Mathematics and Financial Charlatanism: The Effects of Backtest Overfitting on Out-of-Sample Performance. Notices of the AMS, 61(5):458 471, 2014. Baos, R., Manzano-Agugliaro, F., Montoya, F., Gil, C., Alcayde, A., and Gmez, J. Optimization methods applied to renewable and sustainable energy: A review. Renewable and Sustainable Energy Reviews, 15(4):1753 1766, 2011. Bertsimas, D. and Thiele, A. A robust optimization approach to supply chain management. Integer programming and combinatorial optimization, 1(11):145 156, 2004. Bertsimas, D., Brown, D. B., and Caramanis, C. Theory and Applications of Robust Optimization. SIAM Review, 53 (3):464 501, 2011. Chan, L. K. C., Karceski, J., and Lakonishok, J. On portfolio optimization: Forecasting covariances and choosing the risk model. Review of Financial Studies, 12(5):937 974, 1999. Chapados, N. Portfolio choice problems: An introductory survey of single and multiperiod models. Springer Science & Business Media, 2011. Draper, A. J., Jenkins, M. W., Kirby, K. W., Lund, J. R., and Howitt, R. E. Economic-Engineering Optimization for California Water Management. Journal of water resources planning and management, 129(June):155 164, 2003. Harvey, C. R. and Liu, Y. Backtesting. The Journal of Portfolio Management, 42(1):13 28, 2015. Harvey, C. R., Liu, Y., and Zhu, H. ...and the Cross-Section of Expected Returns. Review of Financial Studies, 29(1): 5 68, 2016. Ito, S. and Fujimaki, R. Large-scale price optimization via network flow. In Advances in Neural Information Processing Systems, pp. 3855 3863, 2016. Ito, S. and Fujimaki, R. Optimization beyond prediction: Prescriptive price optimization. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1833 1841, 2017. Johnson, K., Hong, B., Lee, A., and Simchi-levi, D. Analytics for an Online Retailer : Demand Forecasting and Price Optimization. Manufacturing & Service Operations Management, 18(1):69 88, 2016. Jung, J. Y., Blau, G., Pekny, J. F., Reklaitis, G. V., and Eversdyk, D. A simulation based optimization approach to supply chain management under demand uncertainty. Computers and Chemical Engineering, 28(10):2087 2106, 2004. Kaelbling, L. P., Littman, M. L., and Moore, A. W. Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4:237 285, 1996. Konno, H. and Yamazaki, H. Mean-absolute deviation portfolio optimization model and its applications to tokyo stock market. Management science, 37(5):519 531, 1991. Lange, S., Gabel, T., and Riedmiller, M. Batch reinforcement learning. In Reinforcement learning, pp. 45 73. Springer, 2012. Li, B. and Hoi, S. C. H. On-Line Portfolio Selection with Moving Average Reversion. Proceedings of the 29th International Conference on Machine Learning (ICML12), pp. 273 280, 2012. Mak, W. K., Morton, D. P., and Wood, R. K. Monte Carlo bounding techniques for determining solution quality in stochastic programs. Operations Research Letters, 24(1): 47 56, 1999. Markowitz, H. Portfolio Selection. The Journal of Finance, 7(1):77 91, 1952. Michaud, R. Efficient asset management: a practical guide to stock portfolio management and asset allocation. Financial Management Association, Survey and Synthesis Series. HBS Press, Boston, MA, 1998. Michaud, R. O. The markowitz optimization enigma: Is optimized optimal? ICFA Continuing Education Series, 1989(4):43 54, 1989. Qiu, H., Han, F., Liu, H., and Caffo, B. Robust portfolio optimization. In Advances in Neural Information Processing Systems, pp. 46 54, 2015. Scherer, B. Portfolio resampling: Review and critique. Financial Analysts Journal, pp. 98 109, 2002. Smith, J. E. and Winkler, R. L. The optimizers curse: Skepticism and postdecision surprise in decision analysis. Management Science, 52(3):311 322, 2006. Sutton, R. S. and Barto, A. G. [Draft-2] Reinforcement learning : an introduction. Neural Networks IEEE Transactions on, 9(5):1054, 2013. Thomas, D. J., Thomas, D. J., Griffin, P. M., and Griffin, P. M. Coordinated supply chain management. European Journal of Operational Research, 94(1):1 15, 1996. Unbiased Objective Estimation in Predictive Optimization van der Vaart, A. W. Asymptotic Statistics. Asymptotic Statistics, 3:443, 1998. Vapnik, V. The nature of statistical learning theory. Springer science & business media, 2013. Yabe, A., Ito, S., and Fujimaki, R. Robust quadratic programming for price optimization. In IJCAI Proceedings International Joint Conference on Artificial Intelligence, 2017.